monit-general
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: checking ntpd


From: Mike Schmidt
Subject: Re: checking ntpd
Date: Sun, 17 Apr 2011 11:03:09 -0400

Thank you. I wasn't aware of the ntp3 protocol type; I guess I failed to read the manual with enough attention. I will change my config files as you suggest.

In looking at one of the sites that has the ntp unmonitored, I noticed two things: [this is monit 5.2.4] (btw, monit should log its version in the log at startup)

1) there is no additional information in the error log more than just it failed to start

/var/log/messages.1:Apr 13 17:23:19 le-courrier-23 monit[2778]: Starting monit daemon with http interface at [le23.xxxx.com:2812]
/var/log/messages.1:Apr 13 17:23:19 le-courrier-23 monit[2778]: Monit start delay set -- pause for 60s
/var/log/messages.1:Apr 13 17:24:20 le-courrier-23 monit[2780]: Starting monit HTTP server at [le23.xxxx.com:2812]
/var/log/messages.1:Apr 13 17:24:20 le-courrier-23 monit[2780]: monit HTTP server started
/var/log/messages.1:Apr 13 17:24:20 le-courrier-23 monit[2780]: 'system' Monit started
/var/log/messages.1:Apr 13 17:24:57 le-courrier-23 monit[2780]: Cannot open a connection to the mailserver 'mailman.xxxx.com:25' -- Connection timed out
/var/log/messages.1:Apr 13 17:24:57 le-courrier-23 monit[2780]: No mail servers are available
/var/log/messages.1:Apr 13 17:24:57 le-courrier-23 monit[2780]: Aborting event
/var/log/messages.1:Apr 13 17:24:57 le-courrier-23 monit[2780]: M/Monit heartbeat started
/var/log/messages.1:Apr 13 17:24:57 le-courrier-23 monit[2780]: 'date-time' process is not running
/var/log/messages.1:Apr 13 17:25:41 le-courrier-23 monit[2780]: Cannot open a connection to the mailserver 'mailman.xxxx.com:25' -- Connection timed out
/var/log/messages.1:Apr 13 17:25:41 le-courrier-23 monit[2780]: No mail servers are available
/var/log/messages.1:Apr 13 17:25:41 le-courrier-23 monit[2780]: Aborting event
/var/log/messages.1:Apr 13 17:25:41 le-courrier-23 monit[2780]: 'date-time' trying to restart
/var/log/messages.1:Apr 13 17:25:41 le-courrier-23 monit[2780]: 'date-time' start: /sbin/service
/var/log/messages.1:Apr 13 17:30:43 le-courrier-23 monit[2780]: 'display0' process is not running
/var/log/messages.1:Apr 13 17:31:22 le-courrier-23 monit[2780]: Cannot open a connection to the mailserver 'mailman.xxxx.com:25' -- Connection timed out
/var/log/messages.1:Apr 13 17:31:22 le-courrier-23 monit[2780]: No mail servers are available
/var/log/messages.1:Apr 13 17:31:22 le-courrier-23 monit[2780]: Aborting event
/var/log/messages.1:Apr 13 17:31:22 le-courrier-23 monit[2780]: 'date-time' service restarted 2 times within 3 cycles(s) - unmonitor
/var/log/messages.1:Apr 13 17:31:58 le-courrier-23 monit[2780]: Cannot open a connection to the mailserver 'mailman.xxxx.com:25' -- Connection timed out
/var/log/messages.1:Apr 13 17:31:58 le-courrier-23 monit[2780]: No mail servers are available
/var/log/messages.1:Apr 13 17:31:58 le-courrier-23 monit[2780]: Aborting event

2) This happened on april 13, but the system reboots every night, and the service remains unmonitored. Does monit store information in a cache someplace so that unmonitoring can survive reboots? I would have thought that re-booting starts everything over again.

3) On redhat systems, not does not stop when it is off by > 1000ms; it just keeps running, but will take days to catch up if it's more than a second off.

4) On redhat systems, if /etc/ntp/step-tickers contains an ntp server, the service script automatically calls ntpdate before starting ntp, so on bootup we' re fine.

5) It looks like monit doesn not play well with re-boots. I send another messages to the user list for this.

However, in this particular case, I noticed that for once, ntp was not set to start automatically via chkconfig. But shouldn't monit have started it after the reboot?

On Sun, Apr 17, 2011 at 5:49 AM, Martin Pala <address@hidden> wrote:
Hi,

please can you send the monit log? It will show the reason why ntpd was restarted - whether the process died or the protocol test failed.

The reason for repeated restarts could be the ntpd behavior when the time difference is large (which may happen if the system was booted and time was not set) - if the ntpd is started and the time difference is bigger then 1000s, then ntpd usually exits - if monit is set to restart it, the ntpd will be started again, but will also exit again. In such case it is necessary to step the time for example using ntpdate.

It will be better to modify the configuration this way:

--8<--
check process date-time with pidfile /var/run/ntpd.pid
        start program =  "/bin/bash -c '/usr/sbin/ntpdate -s pool.ntp.org && /sbin/service ntpd start'"

        stop  program = "/sbin/service ntpd stop"
       if failed host 127.0.0.1 port 123 type udp protocol ntp3 for 2 times within 3 cycles then restart

        if 2 restarts within 3 cycles then timeout

check host ntp_peer with address pool.ntp.org
       if failed port 123 type udp protocol ntp3 for 2 times within 3 cycles then alert
--8<--

=> the start program is modified to set the time using ntpdate before ntpd is started.

The "protocol ntp3" is added - this is highly recommended especially for the UDP tests because of the connection-less nature of UDP. It allows to speedup the test because monit knows what the server should return - generic UDP test (without protocol specification) is tricky, as the only way to check that the packet arrived to the destination is, that no network error was indicated by ICMP.

Regards,
Martin


On Apr 16, 2011, at 8:21 PM, Mike Schmidt wrote:

Hi,

I have about 50 systems running monit to a m/monit server. The config files for all of them are the same, although the versions of linux are not necessarily so. I am seeing a number of inconsistencies in the different systems. Many of these have problems with ntpd:

check process date-time with pidfile /var/run/ntpd.pid
        start program = "/sbin/service ntpd start"
        stop  program = "/sbin/service ntpd stop"
#       if failed host pool.ntp.org port 123 type udp for 2 times within 3 cycles then alert
        if 2 restarts within 3 cycles then timeout

These systems are rebooted every night.

Most of the systems are ok. However, a number of them, across all versions of linux, keep thinking ntpd is not running, and restarting it, sometimes to the point of unmonitoring it (even though it's still running when I log on to the system in question to check). Looking at the events, I see that monit has restarted ntpd once in a while, like 3 or 4 times arbitrarily.  Before I installed monit, ntpd never stopped on its own to my knowledge. So monit is doing the stop/restart.

Any ideas on what can be causing this? Why would monit think its stopped when it's not? The pid file contains the correct pid,
--
Mike SCHMIDT
CTO 
Intello Technologies Inc.
address@hidden
Canada: 1-888-404-6261 x320
USA: 1-888-404-6268 x320
www.intello.com


--
To unsubscribe:
http://lists.nongnu.org/mailman/listinfo/monit-general


--
To unsubscribe:
http://lists.nongnu.org/mailman/listinfo/monit-general



--
Mike SCHMIDT
CTO 
Intello Technologies Inc.
address@hidden
Canada: 1-888-404-6261 x320
USA: 1-888-404-6268 x320
www.intello.com



reply via email to

[Prev in Thread] Current Thread [Next in Thread]