Thanks, the root cause is clear now.
It seems that your configuration (most probably the apache check) uses the pattern based process check and the system check is most probably defined behinf the apache in monitrc. When you do restart of such service, monit waits for it to start and refreshes the process list to see whether it started. The process list refresh also refreshes the cpu usage statistics - this happens every 5 milliseconds until the process starts or the action times out.
The CPU usage reported by monit (for example system 50%) is thus true but the value comes from very short timeframe (cpu usage from last 5 milliseconds) instead of full cycle (for example cpu usage from last 5 seconds) . If the system check will be defined first (in front of apache check), this won't happen, as it will take the initial values (from the cycle start) before the apache action occurred.
=> it is bug limited to specific configuration: 1.) "check process myproc matching …" is used 2.) "check system" is defined after the myproc check 3.) the myproc service is restarted
Workaround: move the "check system" ahead of other services in your monit configuration file
We'll fix the problem.
Thanks for help with the testing and data :)
Regards, Martin
On Dec 8, 2011, at 2:03 PM, Lawrence, Wayne wrote: Hi Martin,
did as you instructed here is the output.
From my untrained eye there is some serious miscalculation in the 4th CPUDEBUG statement not a clue how it arrives at that figure.
CPUDEBUG: used_system_memory_sysdep: time=1323349102: cpu_user=293199 (-1.00%), cpu_nice=547, cpu_syst=194433 (-1.00%), cpu_idle=58991209, cpu_wait=31605 (-1.00%), cpu_irq=153, cpu_softirq=1990, cpu_total=59512589 -- old_cpu_user=0, old_cpu_syst=0, old_cpu_wait=0, old_cpu_total=0
CPUDEBUG: check_system: time=1323349102: systeminfo.total_cpu_user_percent=-1.00%, systeminfo.total_cpu_syst_percent=-1.00%, systeminfo.total_cpu_wait_percent=-1.00% CPUDEBUG: used_system_memory_sysdep: time=1323349142: cpu_user=293227 (0.70%), cpu_nice=547, cpu_syst=194469 (0.90%), cpu_idle=58995131, cpu_wait=31606 (0.00%), cpu_irq=153, cpu_softirq=1990, cpu_total=59516576 -- old_cpu_user=293199, old_cpu_syst=194433, old_cpu_wait=31605, old_cpu_total=59512589
CPUDEBUG: used_system_memory_sysdep: time=1323349142: cpu_user=293227 (-214748364.80%), cpu_nice=547, cpu_syst=194469 (-214748364.80%), cpu_idle=58995131, cpu_wait=31606 (-214748364.80%), cpu_irq=153, cpu_softirq=1990, cpu_total=59516576 -- old_cpu_user=293227, old_cpu_syst=194469, old_cpu_wait=31606, old_cpu_total=59516576
CPUDEBUG: used_system_memory_sysdep: time=1323349142: cpu_user=293229 (0.00%), cpu_nice=547, cpu_syst=194473 (100.00%), cpu_idle=58995132, cpu_wait=31606 (0.00%), cpu_irq=153, cpu_softirq=1990, cpu_total=59516583 -- old_cpu_user=293229, old_cpu_syst=194472, old_cpu_wait=31606, old_cpu_total=59516582
CPUDEBUG: check_system: time=1323349142: systeminfo.total_cpu_user_percent=0.00%, systeminfo.total_cpu_syst_percent=100.00%, systeminfo.total_cpu_wait_percent=0.00% CPUDEBUG: used_system_memory_sysdep: time=1323349202: cpu_user=293307 (0.90%), cpu_nice=547, cpu_syst=194542 (0.90%), cpu_idle=59001021, cpu_wait=31610 (0.00%), cpu_irq=153, cpu_softirq=1990, cpu_total=59522623 -- old_cpu_user=293252, old_cpu_syst=194483, old_cpu_wait=31606, old_cpu_total=59516616
CPUDEBUG: check_system: time=1323349202: systeminfo.total_cpu_user_percent=0.90%, systeminfo.total_cpu_syst_percent=0.90%, systeminfo.total_cpu_wait_percent=0.00%
Regards
Wayne
On 8 December 2011 12:50, Martin Pala <address@hidden> wrote:
Hi,
thanks for update. I have prepared the debug version, which logs the values computed based on /proc/stat right when they are ready and once again before the values are checked, so we can see whether the values were read+computed correctly and whether no memory corruption occurred before they were compared by the validation engine => there are two "CPUDEBUG" log entries per cycle.
To compile:
tar -xzf monit-5.3.1p2.tar.gz
cd monit-5.3.1p2
./configure
make
Then stop existing monit instance and run new monit binary:
./monit -vI 2>&1 | grep CPUDEBUG
after you'll replicate the problem, terminate monit with ^C and send the whole CPUDEBUG output since monit start
Regards,,
Martin
On Dec 8, 2011, at 11:39 AM, Lawrence, Wayne wrote:
Hi Martin just as a side note here i disabled the cpu ssystem test and tried again and it seems that the issue is present with all the cpu monitoring/
I used the restarting of httpd as i knew it would trigger and alert and these were the results.
Date: Thu, 08 Dec 2011 10:27:59
Action: alert
Host: <hostname removed>
Description: cpu user usage of 100.0% matches resource limit [cpu user usage>70.0%]
I ran vmstat 1 10 at the same time as you can see its the 4th line.
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 0 0 0 739220 142536 973532 0 0 4 7 10 6 0 0 99 0 0
0 0 0 739088 142536 973532 0 0 0 0 114 160 0 1 99 0 0 3 0 0 739088 142536 973536 0 0 0 0 126 169 1 2 97 0 0 0 0 0 737336 142536 973544 0 0 0 168 721 796 35 14 50 1 0
1 0 0 736964 142536 973544 0 0 0 0 109 160 1 1 98 0 0
and just to make it a little simpler i ran sar 1 10 as well as it is more human readable.
10:27:55 CPU %user %nice %system %iowait %steal %idle 10:27:56 all 1.01 0.00 1.01 0.00 0.00 97.98 10:27:57 all 0.00 0.00 1.00 0.00 0.00 99.00
10:27:58 all 3.96 0.00 3.96 0.00 0.00 92.08 10:27:59 all 32.00 0.00 12.00 1.00 0.00 55.00
Something struck me as odd while testing this yesterdays results reporting 50% system usage from 15.84% actual means the reported usage is 3.2 times the actual. todays reported user usage of 100% is 3.2 times the actual 32%. so it seems just need to work out why it is multiplying the results.
Regards
Wayne
On 7 December 2011 11:43, Lawrence, Wayne <address@hidden> wrote:
Hi Martin,
I downloaded the source from the Monit website and compiled it on the server.
I have started monit in verbose mode and this is the relevant information it outputs when the event occurs.
cpu system usage of 50.0% matches resource limit [cpu system usage>30.0%]
------------------------------------------------------------------------------- ../tools/bin/monit() [0x41a533] ../tools/bin/monit(LogError+0x9f) [0x41ad2f]
../tools/bin/monit(Event_post+0x328) [0x417ba8] ..t/tools/bin/monit() [0x428071] ../tools/bin/monit(check_system+0x2b) [0x4285bb] ../tools/bin/monit(validate+0x226) [0x42ad16] ../tools/bin/monit() [0x41422d]
../tools/bin/monit(main+0x511) [0x4149e1] /lib64/libc.so.6(__libc_start_main+0xfd) [0x3592c1ecdd] ../tools/bin/monit() [0x40b179] -------------------------------------------------------------------------------
Unfortunately remote access is not an option but I will happily run a debug version to try and track down this problem as I really would like to use Monit for my current build.
Regards
Wayne
Thanks for data.
The /proc/stat format is this:
cpu <user> <nice> <system> <idle> <wait> <irq> <softirq>
The values count the cpu cycles, so if we subtract the corresponding values from your output, we get this:
user nice system idle wait irq softirq | total
09:57:35 1 0 1 99 0 0 0 | 101
09:57:36 1 0 0 98 0 0 0 | 99
09:57:37 25 0 16 59 1 0 0 | 101
09:57:38 1 0 2 98 0 0 0 | 101
=> at 09:57:37 the cpu usage was:
user = 24.75%
system = 15.84%
wait = 0.99%
This corresponds to the previous vmstat output. Monit counts the cpu usage the same way as above and doesn't modify these values => your monit really reports strange cpu usage (reported 50% vs. real ~ 16%).
What's the origin of your monit binary? Did you compile it from original source code or some 3rd party source code distibution? (such as RHEL or Fedora repository). Or do you use the pre-compiled binaries from www.mmonit.com? Or some 3rd party binary, patches or source code from other site?
Please can you try to run monit in verbose mode and provide full output?:
1.) stop monit
2.) run monit in foreground with verbose mode enabled:
./monit -vI
3.) after the problem happens, stop monit with "^C" and send output
I can also prepare debug version which will dump the cpu usage related informations or if you can provide remote access to the system, i can troubleshoot the problem remotely.
Regards,
Martin
On Dec 7, 2011, at 11:07 AM, Lawrence, Wayne wrote:
Hi Martin,
this is the output of the commands you requested.
1.) uname -m
x86_64
ELF 64-bit LSB executable, x86-64, version 1 (GNU/Linux), dynamically linked (uses shared libs), for GNU/Linux 2.6.18, not stripped
I ran the command you supplied to get the cup usage directly as well while restarting the httpd service as i know this will generate an alert.
Date: Wed, 07 Dec 2011 09:57:37
Action: exec
Host: <hostname removed>
Description: cpu system usage of 50.0% matches resource limit [cpu system usage>30.0%]
Wed Dec 7 09:57:34 GMT 2011 cpu 207060 501 103542 49452254 25303 83 1569 0 0 Wed Dec 7 09:57:35 GMT 2011 cpu 207061 501 103543 49452353 25303 83 1569 0 0 Wed Dec 7 09:57:36 GMT 2011 cpu 207062 501 103543 49452451 25303 83 1569 0 0
Wed Dec 7 09:57:37 GMT 2011 cpu 207087 501 103559 49452510 25304 83 1569 0 0 Wed Dec 7 09:57:38 GMT 2011 cpu 207088 501 103561 49452608 25304 83 1569 0 0 Wed Dec 7 09:57:40 GMT 2011
If my understanding of /proc/stat is coreect this still doesnt make any sense but i may be wrong.
Regards
Wayne
On 7 December 2011 09:37, Martin Pala <address@hidden> wrote:
Please can you check that your monit binary matches the system architecture? (i.e. for example 64-bit monit binary on 64-bit system - not 32-bit monit on 64-bit system)
To verify provide please the output of following commands: 1.) uname -m
2.) file `which monit`
Monit takes the statistics from the /proc/stat kernel interface. You can collect the statistics manually like this - for example to fetch the state in 1 second intervals (30 samples):
$ for ((i=0; i<30; i++)); do date; grep "cpu " /proc/stat; sleep 1; done
Note: monit takes the first /proc/stat line ("cpu") which contains the overall cpu usage in the system (summary of all cpus). The /proc/stat also contains per-cpu statistics if you want to collect all the statistics, replace the "grep 'cpu '" simply with "cat".
Regards,
Martin
On Dec 7, 2011, at 10:04 AM, Lawrence, Wayne wrote:
Hi Martin,
I have tried various methods to dientify the cause of this and took your advice and used vmstat. I simply restarted the httpd process from the monit web interface while the comand was running and got the following warning.
Description: cpu system usage of 50.0% matches resource limit [cpu system usage>30.0%]
But vmstat doesnt show that level of usage at the point of alert. As you can see there is some usage in the 3rd line of the output when i restarted the httpd service but it doesnt seem enough to trigger an alert.
vmstat 1 10 procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 0 0 0 859596 114684 856908 0 0 4 6 81 77 0 0 99 0 0
0 0 0 859448 114684 856916 0 0 0 0 100 94 1 0 99 0 0 0 0 0 898352 114692 815600 0 0 0 168 555 605 23 15 61 1 0
Not sure if there are any other tests i can run to narrow this down a bit further as it still isn't making sense.
Regards
Wayne
On 7 December 2011 08:27, Martin Pala <address@hidden> wrote:
Hi Lawrence,
the test which triggers the alert is "system" cpu => it's the time the system spend in kernel mode. The cpu usage could be triggered by some background kernel task, to verify the monit report matches the system cpu usage, you should use either "vmstat" or "top" instead of "ps".
Best regards,
Martin
On Dec 6, 2011, at 1:19 PM, Lawrence, Wayne wrote:
Hi Igor,
the operating system is RHEL6 and monit version is 5.3.1
this is what i have in my config
if cpu usage (user) > 70% then alert if cpu usage (system) > 30% then alert if cpu usage (wait) > 20% then alert
this is one of the errors
Description: cpu system usage of 50.0% matches resource limit [cpu system usage>30.0%]
this is what i get in /var/log/messages
Dec 6 12:01:29 <hostname-removed> monit[864]: <hostname-removed> cpu system usage of 50.0% matches resource limit [cpu system usage>30.0%]
Dec 6 12:02:29 <hostname-removed> monit[864]: <hostname-removed><hostname-removed>' cpu system usage check succeeded [current cpu system usage=0.9%]
this is the output of ps --no-headers -A -o "%cpu sz ucomm" | sort -k1nr | head -20
12:01:29 up 4 days, 20:24, 2 users, load average: 0.04, 0.01, 0.00 total used free shared buffers cached Mem: 2055108 1092176 962932 0 53156 811864
-/+ buffers/cache: 227156 1827952 Swap: 4128760 0 4128760 1.2 44308 perl 0.0 0 aio/0 0.0 0 async/mgr 0.0 0 ata/0 0.0 0 ata_aux 0.0 0 bdi-default 0.0 0 cpuset
0.0 0 crypto/0 0.0 0 events/0 0.0 0 ext4-dio-unwrit 0.0 0 flush-253:0 0.0 0 jbd2/dm-0-8 0.0 0 kacpi_hotplug 0.0 0 kacpi_notify 0.0 0 kacpid 0.0 0 kauditd
0.0 0 kblockd/0 0.0 0 kdmflush 0.0 0 khelper 0.0 0 khubd
Have to say i am at a total loss as there is no way the usage figures are accurate.
If there is any other info i can supply that will be useful please let me know.
Regards
Wayne
On 6 December 2011 12:03, Igor Homyakov <address@hidden> wrote:
Hi Lawrence,
Could you be a little bit more specific ? Please provide information about you operation system, monit version on which the problem
occurred and so on.
Regards Igor Homyakov
On Tue, Dec 6, 2011 at 15:35, Lawrence, Wayne < address@hidden> wrote: > Hi, > > I have a few CPU usage checks in my monitrc but it seems monit is
> misreporting the usage. > > I have run several tests and it seems that monit is multiplying the actual > usage by 10. > > I ran a process with top running in another shell and CPU usage for the user
> was never above 10% yet monit informed me that there was 100% cpu usage. > > I have tried various configurations including the one that came with the > default config for system cpu monitoring and all seem to demonstrate the
> same issue. > > Any advice welcomed on this > > Regards > > Wayne Lawrence
-- To unsubscribe: https://lists.nongnu.org/mailman/listinfo/monit-general
-- To unsubscribe: https://lists.nongnu.org/mailman/listinfo/monit-general
-- To unsubscribe: https://lists.nongnu.org/mailman/listinfo/monit-general
|