Thanks for the swift response, and that's great to hear. Based on
your experience, do you think this is likely to take weeks or months
before being available?|
On 22/06/2015 20:13, Martin Pala wrote:
the refactoring of the test scheduler mentioned in
the manual with fix for program execution already begun.
I'd like to query the rationale for a behaviour I've
experiencing in monit. I'm testing with the following
# Test config start
set daemon 10
check program MyProgram with path "/bin/dash -c 'echo OK!;
every "06 * * * *"
if status != 0 then alert
# Test config end
As expected, monit runs the dash test program at 6 minutes
past the hour. The dash script finishes immediately.
However, Monit doesn't pick up, report or alert on the
exit code in a timely manner. Until the next time Monit is
scheduled to run the test script, the dash script remains
as a zombie. But that is an hour later, which is a long
time to wait to be alerted to the script failing.
If the 'every' schedule was "06 0 * * *" then it would
seem one should expect to wait 24 hours before being
alerted to the script failing!
I realise the Monit manual explains:
"The asynchronous nature of the program check [...] comes
with a side-effect: when the program has finished
executing and is waiting for Monit to collect the result,
it becomes a so-called "zombie" process [...] the zombie
process is removed from the system as soon as Monit
collects the exit status. This means that every "check
program" will be associated with either a running process
or a temporary zombie. This unwanted zombie side-effect
will be removed in a later release of Monit."
That may be so, however why doesn't Monit reap the child
and collect the exit code at the *next poll cycle after
the child exits* (i.e. within 10 seconds of the test
script finishing given the 'set daemon 10' line in the
test config above) rather than when the program is next
scheduled to be run? Maybe I'm missing something, but the
current behaviour seems to undermine the entire purpose of
providing alerts on program failure (when used in
conjunction with cron-style scheduling). That is the
behaviour I'd like to query the rationale for.
Thanks in advance.