[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

failing more than once before alert

From: Ben Hartshorne
Subject: failing more than once before alert
Date: Thu, 28 Jul 2005 14:30:05 -0700
User-agent: Mutt/1.5.9i

Hi, all,

I have been getting an incredible number of false positive pages
recently.  I have to believe that it's something having to do with my
application, but most of the pages I get correct themselves one cycle
later.  I put in a test to hit google on port 80, and even that paged me
once in the middle of the night.

This pissed me off enough to do something about it.  Reading through the
list archives, I found this post:
It gave me a nice idea (and I followed his example) but I really didn't
like the fact that after a single failure, the service requires human
intervention to restart monitoring (since the timeout function disables
monitoring for that service).  

So I started making code changes.  Unfortunately, I didn't do it the
*right* way, because it's been way too long since I played with flex
etc.  Instead, I took advantage of the "if x restarts in y cycles then
timeout," but eviscerated the ACTION_TIMEOUT functionality.  It no
longer actually times out, it just alerts.  

What I really wanted was "if x restarts in y cycles then alert," but I
couldn't figure the right way to do it.

Since the timeout funcitonality was designed to start counting at the
first failure, and if a service actually times out, stop monitoring it,
the counter manipulation didn't work so well when timeouts could be
triggered and recovered often.  

The end result:  I have a rule like:
set alert address@hidden {timeout}
check host RadixTest with address
        start program = "/bin/true"
        stop program = "/bin/true"
        if 2 restarts within 3 cycles then timeout
                if failed url
                      and content == "default=linuxprep"
                      then restart

and I only get paged if it fails twice within three attempts.
fail pass fail == page
fail pass pass == no page
fail fail pass == page)

I also made it decrement the pass-counter slowly, so that 
fail pass fail pass fail == faiure-page, recovery-page, failure-page
i.e. if it's recently failed, be more paranoid.

One annoyance is that the check_timeout function comes before the
service test instead of afterwards, so I actually get paged at the
beginning of the cycle following the failure condition.  I'm cheking
every 60 seconds, so I can deal with that.  A correct solution woludn't
exibit this problem...  ;)

Another less-than-desireable trait - IMHO, the right way to do this
kind of thing is to use the leaky-bucket algorithm (as many network
protocols do) that says failures add up quickly but subside slowly.  You
would have to specify a rate at which the failure counter drops in
addition to the thresholds.
i.e. 5 failures within 10 attempts, decrease failure counter at a rate
of 1/5 successes.

This allows a certain amount of flakyness, but alerts you quickly on a
hard failure, and alerts you if it gets too flaky.

Anyway...   In case any of you are interested, I have attached a patch
of the modifications I made (to the head of the CVS tree)


p.s.  is this the right list? or should I have posted this to the
monit-general?  It seems much more high volume -- all I see go by here
are announcements of newly checked in files...

Ben Hartshorne
email: address@hidden

Attachment: monit.patch
Description: Text document

Attachment: signature.asc
Description: Digital signature

reply via email to

[Prev in Thread] Current Thread [Next in Thread]