monit-general
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Monit triggering restart storm


From: address@hidden
Subject: Re: Monit triggering restart storm
Date: Thu, 9 Nov 2017 13:14:48 +0100

Hi,

if the start/stop methods are the same "/bin/systemctl [start|stop] myservice", then the solution should be the dependency of all 'check program' and 'check file' on the 'check process' parent.

If the dependant checks need to restart the parent process, they should do so via "... exec /usr/bin/monit restart myprocess" (unfortunately it's necessary to use the exec with monit CLI as there is currently no direct start/stop/restart action that would allow to pass the action to other check by name).

If the parent process will fail (for example the process is not running or port failed), the dependant checks will be aware about the parent restart and won't trigger another restart.


Example:

--8<--
check process myprocess matching "foobar"
    start program = "/bin/systemctl start myservice"
    stop program = "/bin/systemctl stop myservice"
    if does not exist for 5 cycles then start
    if failed port XXXX for 6 times within 8 cycles then restart
    if failed port YYYY for 6 times within 8 cycles then restart
    if failed port ZZZZ for 6 times within 8 cycles then restart

check program myprocess_collector with path "/usr/bin/collect_report_from_myprocess.sh"
    if status != 0 for 5 times within 10 cycles then exec "/usr/bin/monit restart myprocess"
    depends on myprocess

....

check program myprocess_log with path "/usr/bin/collect_report_from_myprocess.sh"
    if content = "BIG ERROR" then exec "/usr/bin/monit restart myprocess"
    depends on myprocess
--8<--


Best regards,
Martin



On 9 Nov 2017, at 12:07, Guillaume François <address@hidden> wrote:

Hi,

I have a bunch of Monit rules to perform check on a service
  1. One check process rule (existence and port checks)
    1. does not exist for 5 cycles then start 
    2.  failed port XXXX for 6 times within 8 cycles then restart
    3.  failed port YYYY for 6 times within 8 cycles then restart
    4.  failed port ZZZZ for 6 times within 8 cycles then restart
  2. Three check program rules with custom checks
    1. if status != 0 for 5 times within 10 cycles then restart
    2. if status != 0 for 5 times within 10 cycles then restart
    3. if status != 0 for 5 times within 10 cycles then restart
  3. One to check log content
    1. check file  + if content = "BIG ERROR" then restart
start/stop rules are 

start program = "/bin/systemctl start myservice"
stop program = "/bin/systemctl stop myservice"

There are no dependency at Monit level but checks are part of the same bunch of groups.

Problem, is that due to multiple issues, I got a "restart" storm as
  1. some  port check failed -> restart issued
  2. lead to error at custom script -> restart issued
  3. content log reading has some lags -> restart issued
Myservice or system.d configuration/feature are not well designed so got "already bind exception" as system.d tried to start several instance at the same time🤔 

So port check failed again, system.d killed the wrong one, MyService was blocked, restart again. etc.....

I had to shutdown Monit to prevent further action (I could have monit -g group unmonitor also), kill every instance of my service, start it correctly, then reactivate Monit


Question: 
  • Is there a native way to prevent Monit to issue the same start/stop commands in a defined time-frame ?
  • Does Monit dependency feature between checks could help as I don't see how it could help ?
  • Any other hint/proposal (aside increasing the values of "for N times within T cycles" to delay the restart)
Remark: maybe exploring system.D features StartLimitIntervalSe & StartLimitBurst could help.


Best Regards.
--
To unsubscribe:
https://lists.nongnu.org/mailman/listinfo/monit-general


reply via email to

[Prev in Thread] Current Thread [Next in Thread]