monit-general
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: monit doesn't run stop action


From: Marc Rossi
Subject: Re: monit doesn't run stop action
Date: Tue, 5 Mar 2019 15:11:42 -0600

Yeah was looking through the code and saw the call to check if process is running before issuing stop (ProcessTree_findProcess), so that was only thought I had as well.

check process foo matching /usr/local/bin/foo.py
      start program = "/bin/bash -l -c 'nohup /usr/local/bin/foo.py &'" as uid "nobody"
      stop program = "/usr/bin/pkill -u nobody -f /usr/local/bin/foo.py" as uid "nobody"
      if uptime > 11 hours then alert
      if uptime > 12 hours then exec "/usr/bin/pkill -u nobody -f -9 /usr/local/bin/foo.py" as uid "nobody"
      if 2 restarts within 3 cycles then timeout
      group apps
      depends foo.py

check process bar matching ^/usr/local/bin/bar
      start program = "/bin/bash -lc 'HOME=/home/someuser nohup /usr/local/bin/bar.sh > /tmp/bar-startup.out 2>&1 &'"
      stop program = "/bin/bash -c '/usr/bin/pkill -f ^/usr/local/bin/bar; sleep 1; /usr/bin/pkill -f ^/usr/local/bin/bar'"
      onreboot nostart
      if uptime > 12 hours then exec "/usr/bin/pkill -9 -f ^/usr/local/bin/bar"
      group apps
      mode passive

Here are logs from yesterday and today wrt to "bar"

[CST Mar  1 15:15:01] info     : 'bar' stop action done
[CST Mar  4 07:02:01] info     : 'bar' start on user request
[CST Mar  4 07:02:01] info     : 'bar' start action done
[CST Mar  4 07:02:01] error    : 'bar' uptime test failed for /usr/local/bin/bar-- current uptime is 259177 seconds
<we get above since it failed to shutdown on 3/1>
[CST Mar  4 07:02:01] info     : 'bar' exec: '/usr/bin/pkill -9 -f /usr/local/bin/bar'
[CST Mar  4 07:02:21] error    : 'bar' process is not running
<above line repeats every 20 seconds until we manually start it via monit>
[CST Mar  4 07:51:11] info     : 'bar' start: '/bin/bash -lc HOME=/home/someuser nohup /usr/local/bin/bar.sh > /tmp/bar-startup.out 2>&1 &'
[CST Mar  4 07:51:11] info     : 'bar' start action done
[CST Mar  4 07:51:11] info     : 'bar' process is running with pid 4897
[CST Mar  4 07:51:11] info     : 'bar' uptime test succeeded [current uptime = 1 seconds]
[CST Mar  4 15:15:01] info     : 'bar' stop on user request
[CST Mar  4 15:15:01] info     : 'bar' stop action done
<below same thing repeats itself the following morning>
[CST Mar  5 07:02:01] info     : 'bar' start on user request
[CST Mar  5 07:02:01] info     : 'bar' start action done
[CST Mar  5 07:02:01] error    : 'bar' uptime test failed for /usr/local/bin/bar-- current uptime is 83451 seconds
[CST Mar  5 07:02:01] info     : 'bar' exec: '/usr/bin/pkill -9 -f /usr/local/bin/bar'

Thanks again for looking. Worst case I'll just build a debug version of monit with some extra logging to see what is going on.



On Tue, Mar 5, 2019 at 2:40 PM address@hidden <address@hidden> wrote:
Hi,

please can you add the configuration of "foo" and "bar" services?

There are for example these possible reasons:

1.) the "bar" service is a process and monit detected that the process is not running - in this case it gets a fast path and stop is skipped (the process is not running)

2.) there was a problem if you used "check program" in combination with the "every" statement ... fixed in monit 5.25.3: https://bitbucket.org/tildeslash/monit/issues/759

Best regards,
Martin


On 5 Mar 2019, at 16:24, Marc Rossi <address@hidden> wrote:

Looking through source right now but figured I'd throw it out to list to see if this is something obvious I'm doing wrong.

Long time monit user but on a few of our apps we have recently been having problems with the shutdown action possibly not running.

For the app that DOES shut down properly logs show the following:

[CST Mar  4 17:00:02] info     : 'foo' stop on user request
[CST Mar  4 17:00:02] info     : Monit daemon with PID 17733 awakened
[CST Mar  4 17:00:02] info     : Awakened by User defined signal 1
[CST Mar  4 17:00:02] info     : 'foo' stop: '/usr/bin/pkill -u nobody -f /usr/local/bin/foo.py'
[CST Mar  4 17:00:02] info     : 'foo' stop action done

For the app that is not stopping properly logs show the following:

[CST Mar  4 15:15:01] info     : 'bar' stop on user request
[CST Mar  4 15:15:01] info     : Monit daemon with PID 17733 awakened
[CST Mar  4 15:15:01] info     : Awakened by User defined signal 1
[CST Mar  4 15:15:01] info     : 'bar' stop action done

Could be a red herring but where is the stop action line in the second log excerpt? Now the shutdown commands are indeed different between foo & bar but still would expect to see the stop action listed.

TIA
Marc

--
To unsubscribe:
https://lists.nongnu.org/mailman/listinfo/monit-general

--
To unsubscribe:
https://lists.nongnu.org/mailman/listinfo/monit-general

reply via email to

[Prev in Thread] Current Thread [Next in Thread]