Re: Zombie processes and exit code retrieval

On 22 Jun 2015, at 22:42, Struan Bartlett <address@hidden> wrote:

Thanks for the swift response, and that's great to hear. Based on your experience, do you think this is likely to take weeks or months before being available?

On 22/06/2015 20:13, Martin Pala wrote:

Hi,

the refactoring of the test scheduler mentioned in the manual with fix for program execution already begun.

Regards,

Martin

On 22 Jun 2015, at 20:05, Struan Bartlett <address@hidden> wrote:

Hi

I'd like to query the rationale for a behaviour I've experiencing in monit. I'm testing with the following config:

# Test config start
set daemon 10

check program MyProgram with path "/bin/dash -c 'echo OK!; exit 1'"
every "06 * * * *"
if status != 0 then alert
# Test config end

As expected, monit runs the dash test program at 6 minutes past the hour. The dash script finishes immediately. However, Monit doesn't pick up, report or alert on the exit code in a timely manner. Until the next time Monit is scheduled to run the test script, the dash script remains as a zombie. But that is an hour later, which is a long time to wait to be alerted to the script failing.

If the 'every' schedule was "06 0 * * *" then it would seem one should expect to wait 24 hours before being alerted to the script failing!

I realise the Monit manual explains:

"The asynchronous nature of the program check [...] comes with a side-effect: when the program has finished executing and is waiting for Monit to collect the result, it becomes a so-called "zombie" process [...] the zombie process is removed from the system as soon as Monit collects the exit status. This means that every "check program" will be associated with either a running process or a temporary zombie. This unwanted zombie side-effect will be removed in a later release of Monit."

That may be so, however why doesn't Monit reap the child and collect the exit code at the *next poll cycle after the child exits* (i.e. within 10 seconds of the test script finishing given the 'set daemon 10' line in the test config above) rather than when the program is next scheduled to be run? Maybe I'm missing something, but the current behaviour seems to undermine the entire purpose of providing alerts on program failure (when used in conjunction with cron-style scheduling). That is the behaviour I'd like to query the rationale for.

Thanks in advance.

Kind regards

Struan

--
To unsubscribe:
https://lists.nongnu.org/mailman/listinfo/monit-general

From:	Martin Pala
Subject:	Re: Zombie processes and exit code retrieval
Date:	Tue, 23 Jun 2015 11:16:09 +0200