monit-general
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

NFS is going down, et al [was: pidfiles aka. Re: [CVS] unix socket suppo


From: Christian Hopp
Subject: NFS is going down, et al [was: pidfiles aka. Re: [CVS] unix socket support added]
Date: Mon, 5 Aug 2002 23:19:49 +0200 (CEST)

On 5 Aug 2002, Jan-Henrik Haukeland wrote:

> I meant "subsidiary" in the sense that if the vote got through in
> spite of my -1 I would give Martins proposal my +1. I'm sorry if I
> wasn't clear on this.

Sorry, I accidently oversaw a -1, which came earlier.  In that case I
wouldn't have started the work. cp patch ~/experience

> > Okay, the impact of the patch to the code would be five lines to p.y
> > *
>
> I know, it's a small patch.

That was just to remove any "ohhh too much overhead" arguments. (-:

(...)

> But please, I'm not a total religious zealot,

Let take it as a start for alt.religion.monit!

> and if valid requests comes for this I'm absolutely willing to
> reconsider.

If neccessary I can recover it from ~/experience. (-:  I can definitely
live with it.  For me the topic is over... and for you?  Lets face
some more important stuff!

> > Before I spoke of the possible NFS problems that could come up, when
> > the connection breaks at the time monit accesses it.  You proposed
> > to use a "timeout" construction via alarm().  But that won't IHMO
> > work. Monit won't be able to evaluate any signal at that time.
>
> Are you sure about this? I cannot actually test it since I do not have
> access to a system using NFS. But theoretically I would think that
> this would just be some sort of a blocking affair and that alarm()
> would shake monit out of it. But of course I may be wrong.

Let me cite "man mount" on this:

(...)

Mount options for nfs

(...)

hard  The program accessing a file on a NFS mounted file system
       will hang when the  server crashes. The process cannot be
       interrupted or killed unless you also specify intr.  When the
       NFS server is back online the program will continue undisturbed
       from where it was. This is probably what you want.

We have unfortunately a very unreliable network right now.  And whenenver
it breaks down (at least twice a day) my _local_ xemacs hangs
uninterruptable for the time its gone.  Just because I have an "open file
handler" to my home dir.  That a good time to heat some water.  When it's
back life is just like nothing has happened.

> > The only possible way IHMO would be to fork away the actual checker
> > and evaluate its exit value. If you wait() longer than the timeout
> > value put aside this service until the a wait() is successfully
> > answered and warn the specific recipient.  Or do you think that this
> > might happen most unlikely???
>
> I don't know. But lets see, this is only relevant for a checksum check
> right?

Of course.

> Since we assume system pidfiles are on a local disk (I hope)
> and if a start/stop program fails because NFS died it will not be a
> problem since these child processes are autonomous anyway and not
> expected to report back anything.

What about all those hdless netbooters.  They even have a nfs swap.  You
can see really wild stuff sometimes.  But not in serious applications.

> Assuming this, monit will then only be suspended iff, your assumption
> is correct and monit has started reading a file for checksum testing
> while NFS died. (If monit was about to open a file for creating MD5
> and NFS was down the fopen call will just fail). It takes under a
> second to create a MD5 sum for a 2 Mb file. So you should be pretty
> unlucky for monit suspension to occur.

It takes 1 second... 5 minute cycle time... maybe 5 services all
binaries on nfs... 1/60 chance and bang.  I know that calculations are
a bit vague.  But... you never know.

> But of course "Things that will never happen, always happen".

There was a guy called Murphy and his law. (-:

> What do you think? Is this something we can live with (assuming alarm
> won't work). At least, maybe we should document that it's a really bad
> idea to save pidfiles in a NFS mounted directory.

Saving pidfiles on NFS is DANGEROUS.  Having servers on NFS is
definitely everything else but WISE.

> BTW, your suggestion sounds plausible and like the only possible
> workaround if alarm() do not work.

The thing is... monit has to run... even if services monit is checking
are running berserk.  I thinks that's what a monitor should do.

If we are in "what if" discussions here are some other things to think
about.

* Monit checks a server which defuncs aka. is a zombie.  Is it in
  "good health" or not?  Pidfile and Pid do match.  I don't know what
  its ports do (do they still connect or not?).

* A start/stop script returns with error, should monit still try to
  (re)start/stop the process?


Good night,

Christian

-- 
Christian Hopp                                email: address@hidden
Institut für Elektrische Informationstechnik             fon: +49-5323-72-2113
Technische Universität Clausthal                         fax: +49-5323-72-3197
  pgpkey: https://www.iei.tu-clausthal.de/pgp-keys/chopp.key.asc  (2001-11-22)




reply via email to

[Prev in Thread] Current Thread [Next in Thread]