Re: Monit triggering restart storm

Hi,

if the start/stop methods are the same "/bin/systemctl [start|stop] myservice", then the solution should be the dependency of all 'check program' and 'check file' on the 'check process' parent.

If the dependant checks need to restart the parent process, they should do so via "... exec /usr/bin/monit restart myprocess" (unfortunately it's necessary to use the exec with monit CLI as there is currently no direct start/stop/restart action that would allow to pass the action to other check by name).

If the parent process will fail (for example the process is not running or port failed), the dependant checks will be aware about the parent restart and won't trigger another restart.

Example:

--8<--

check process myprocess matching "foobar"

start program = "/bin/systemctl start myservice"

stop program = "/bin/systemctl stop myservice"

if does not exist for 5 cycles then start

if failed port XXXX for 6 times within 8 cycles then restart

if failed port YYYY for 6 times within 8 cycles then restart

if failed port ZZZZ for 6 times within 8 cycles then restart

check program myprocess_collector with path "/usr/bin/collect_report_from_myprocess.sh"

if status != 0 for 5 times within 10 cycles then exec "/usr/bin/monit restart myprocess"

depends on myprocess

....

check program myprocess_log with path "/usr/bin/collect_report_from_myprocess.sh"

if content = "BIG ERROR" then exec "/usr/bin/monit restart myprocess"

depends on myprocess

--8<--

Best regards,

Martin

On 9 Nov 2017, at 12:07, Guillaume François <address@hidden> wrote:

Hi,

I have a bunch of Monit rules to perform check on a service
One check process rule (existence and port checks)
does not exist for 5 cycles then start
failed port XXXX for 6 times within 8 cycles then restart
failed port YYYY for 6 times within 8 cycles then restart
failed port ZZZZ for 6 times within 8 cycles then restart
Three check program rules with custom checks
if status != 0 for 5 times within 10 cycles then restart
if status != 0 for 5 times within 10 cycles then restart
if status != 0 for 5 times within 10 cycles then restart
One to check log content
check file + if content = "BIG ERROR" then restart
start/stop rules are

start program = "/bin/systemctl start myservice"
stop program = "/bin/systemctl stop myservice"

There are no dependency at Monit level but checks are part of the same bunch of groups.

Problem, is that due to multiple issues, I got a "restart" storm as
some port check failed -> restart issued
lead to error at custom script -> restart issued
content log reading has some lags -> restart issued
Myservice or system.d configuration/feature are not well designed so got "already bind exception" as system.d tried to start several instance at the same time🤔

So port check failed again, system.d killed the wrong one, MyService was blocked, restart again. etc.....

I had to shutdown Monit to prevent further action (I could have monit -g group unmonitor also), kill every instance of my service, start it correctly, then reactivate Monit

Question:
Is there a native way to prevent Monit to issue the same start/stop commands in a defined time-frame ?
Does Monit dependency feature between checks could help as I don't see how it could help ?
Any other hint/proposal (aside increasing the values of "for N times within T cycles" to delay the restart)
Remark: maybe exploring system.D features StartLimitIntervalSe & StartLimitBurst could help.

Best Regards.
--
To unsubscribe:
https://lists.nongnu.org/mailman/listinfo/monit-general

From:	address@hidden
Subject:	Re: Monit triggering restart storm
Date:	Thu, 9 Nov 2017 13:14:48 +0100