monit-general
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: New user with several major monit problems


From: Martin Pala
Subject: Re: New user with several major monit problems
Date: Mon, 12 Sep 2005 18:13:02 +0200
User-agent: Mozilla Thunderbird 1.0.6 (Windows/20050716)

Following line in the /etc/ha.d/resource.d/Filesystem stop script can kill all processess which are accessing the mountpoint:

  $FUSER -mk $MOUNTPOINT

From your output it seems that your filesystem is opened by many processes - these all are candidates for fuser killer:

> inertia:~# /etc/ha.d/resource.d/Filesystem /dev/drbd0 /mnt/nfstest
> reiserfs stop
> /mnt/nfstest:            1rce     2rc     3rc     4rc    21rc    46rc
> 47rc    48rc    49rc   185rc   205rc   312rc  1867rc  7041rce  7045rce
> 7056rce  7071rce  7086rce  7102rce  7108rce  7118rce  7209rce  7216rce
> 7253rce  7256rce  7311rce  7314rce  7324rce  7327re 16681rc 16688rc


Martin


Jonathan Wheeler wrote:
Martin Pala wrote:


Jonathan Wheeler wrote:


Martin Pala wrote:



Jonathan Wheeler wrote:



Most annoyingly, for my cluster monit -g node1 stop all (as taken
directly from your documentation) kills the *entire* server (see
problem 1)



Yet one thing - the described node shutdown sounds me like some
watchdog driven shutdown - do you use heartbeat's watchdog capability
or some other external check which is able to panic the node?


No I don't, nothing fancy at all yet :)

Any thoughts on how I might troubleshoot this further? Syslog is killed
itself, so I don't have any information in the logs at all. Local
console is also booted out, so even sitting in front of the server
doesn't help.


I think it is either watchdog or some stonith method (power off/cycle
the machine). You can try for example 'lsof | grep watchdog' to see
whether the watchdog device is opened.

If you can supply your heartbeat, monit and scripts configuration as
described Hauk, then it will be much easier to find the problem.

Martin


Hi Team,

As promised I've done some more digging into this. After rebooting my
two test servers, I was unable to replicate the problem by simply
running monit -g node1 stop all. So I went back to my HA/monit
configuration again to see what would happen.

I was then in a position where monit -g node1 stop all, will kick me out
of my ssh sessions to the machine, and according to the monit http
interface it's restarting all services (one of which is sshd in should
be noted), regardless of group.

Then I realised, monit stop drbdfs, or monit stop heartbeat would kick
me out.

I commented out and stopped heartbeat at this point.

Syslogd was defined in my monitrc file, so I commented it out, reloaded
monit, and ran monit -g node1 stop all. I was booted out, and
reconnected to find this time syslog hadn't restarted itself. So, it
would appear that monit has been (re??)starting syslog (and sshd) for me
after all processes are killed. With syslog stopping it's very hard to
tell exactly what is happening, and of course it shouldn't be stopping
in the first place.
My assumption is that monit is only surviving as it is
running/respawning directly from init, and according to monit's uptime
number's it too is being restarted.

I then clicked that 'monit stop drbdfs' killing the system was probably
a very important clue and when running the arguments manually, I was
also kicked out!
YAY, so removing HA and monit I'm still able to replicate the problem:

inertia:~# /etc/ha.d/resource.d/Filesystem /dev/drbd0 /mnt/nfstest
reiserfs stop
/mnt/nfstest: 1rce 2rc 3rc 4rc 21rc 46rc 47rc 48rc 49rc 185rc 205rc 312rc 1867rc 7041rce 7045rce 7056rce 7071rce 7086rce 7102rce 7108rce 7118rce 7209rce 7216rce 7253rce 7256rce 7311rce 7314rce 7324rce 7327re 16681rc 16688rc
kill 7256: No such process
Connection to inertia closed by remote host.
Connection to inertia closed.

Now I realise that I've now more or less ruled out monit as the cause of
this, but I wonder if you'd be so kind as to cast your eyes over this
script and let me know if you see anything out of place, as I then
rebooted, ran the scripts manually, and WASN'T kicked out, as below.

These scripts, and indeed HA, was working for the most part before I
added monit to the equation, and the fact that this script worked this
reboot around is further murkyness. I do realise that it is still
perhaps a little hasty to therefore conclude that monit is at fault, but
any assistance you can provide would be greatly appreciated.

inertia:/mnt# /etc/ha.d/resource.d/Filesystem /dev/drbd0 /mnt/nfstest
reiserfs stop
inertia:/mnt# ps ax
  PID TTY      STAT   TIME COMMAND
1 ? S 0:00 init [2] 2 ? SN 0:00 [ksoftirqd/0]
    3 ?        S<     0:00 [events/0]
    4 ?        S<     0:00 [khelper]
   21 ?        S<     0:00 [kblockd/0]
   46 ?        S      0:00 [pdflush]
   47 ?        S      0:00 [pdflush]
   49 ?        S<     0:00 [aio/0]
   48 ?        S      0:00 [kswapd0]
  185 ?        S      0:00 [kseriod]
  205 ?        S<     0:00 [ata/0]
  312 ?        S<     0:00 [reiserfs/0]
 1894 ?        S      0:00 [drbd0_worker]
 1907 ?        S      0:00 [drbd0_receiver]
 1917 ?        S      0:00 [drbd0_asender]
 3555 tty1     Ss+    0:00 -bash
 3596 tty2     Ss+    0:00 -bash
 3608 tty3     Ss+    0:00 /sbin/getty 38400 tty3
 3629 tty4     Ss+    0:00 /sbin/getty 38400 tty4
 3649 tty5     Ss+    0:00 /sbin/getty 38400 tty5
 3661 tty6     Ss+    0:00 /sbin/getty 38400 tty6
 3772 ?        Ss     0:00 /usr/sbin/monit -Ic /etc/monit/monitrc
 3807 ?        Ss     0:00 /usr/sbin/sshd
 3814 ?        Ss     0:00 /usr/sbin/exim4 -bd -q30m
 3868 ?        Ss     0:00 sshd: address@hidden/1
 3871 pts/1    Ss+    0:00 -bash
 3956 ?        Ss     0:00 sshd: address@hidden/0
 3959 pts/0    Ss     0:00 -bash
 4033 pts/0    R+     0:00 ps ax



Filesystem script:
inertia:~# cat /etc/ha.d/resource.d/Filesystem | grep -v
\# unset LC_ALL; export LC_ALL
unset LANGUAGE; export LANGUAGE

prefix=/usr
exec_prefix=/usr
. /etc/ha.d/shellfuncs

MODPROBE=/sbin/modprobe
FSCK=/sbin/fsck
FUSER=/bin/fuser
MOUNT=/bin/mount
UMOUNT=/bin/umount
BLOCKDEV=/sbin/blockdev

check_util () {
    if [ ! -x "$1" ] ; then
        ha_log "ERROR: setup problem: Couldn't find utility $1"
        exit 1
    fi
}

usage() {

cat <<-EOT;
        usage: $0 <device> <directory> <fstype> [<options>]
{start|stop|status}

        <device>    : name of block device for the filesystem. e.g.
/dev/sda1, /dev/md0
                      OR -LFileSystemLabel OR -Uuuid or an NFS specification
        <directory> : the mount point for the filesystem
        <fstype>    : name of the filesystem type. e.g. ext2
        <options>   : options to be given as -o options to mount.

        $Id: Filesystem.in,v 1.10 2003/07/03 02:14:14 alan Exp $
        EOT
}

flushbufs() {
  if
    [ "$BLOCKDEV" != "" -a -x "$BLOCKDEV" ]
  then
    case $1 in
      -*|[^/]*:/*)      ;;
      *)                $BLOCKDEV --flushbufs $1;;
    esac
  fi
}
DEVICE=$1
MOUNTPOINT=$2
FSTYPE=$3

case $DEVICE in
        ;;
        ;;
  *)    if [ ! -b "$DEVICE" ] ; then
          ha_log "ERROR: Couldn't find device $DEVICE. Expected /dev/???
to exist"
          usage
          exit 1
        fi;;
esac

if [ ! -d "$MOUNTPOINT" ] ; then
        ha_log "ERROR: Couldn't find directory  $MOUNTPOINT to use as a
mount point"
        usage
exit 1 ficheck_util $MODPROBE
check_util $FSCK
check_util $FUSER
check_util $MOUNT
check_util $UMOUNT

  4)    operation=$4; options="";;
  5)    operation=$5; options="-o $4";;
  *)    usage; exit 1;;
esac
case "$operation" in

start)

        $MOUNT | cut -d' ' -f3 | grep -e "^$MOUNTPOINT$" >/dev/null
        if [ $? -ne 1 ] ; then
            ha_log "ERROR: Filesystem $MOUNTPOINT is already mounted!"
            exit 1;
        fi

        $MODPROBE scsi_hostadapter >/dev/null 2>&1

        $MODPROBE $FSTYPE >/dev/null 2>&1
        grep -e "$FSTYPE"'$' /proc/filesystems >/dev/null
        if [ $? != 0  ] ; then
                ha_log "ERROR: Couldn't find filesystem $FSTYPE in
/proc/filesystems"
                usage
                exit 1
        fi


        if
          case $FSTYPE in
            ext3|reiserfs|xfs|jfs|vfat|fat|nfs) false;;
            *)                          true;;
          esac
        then
          ha_log "info: Starting filesystem check on $DEVICE"
          $FSCK -t $FSTYPE -a $DEVICE
if
            [ $? -ge 4 ]
          then
            ha_log "ERROR: Couldn't sucessfully fsck filesystem for $DEVICE"
exit 1 fi fi

        flushbufs $DEVICE if
          $MOUNT -t $FSTYPE $options $DEVICE $MOUNTPOINT
        then
          : Mount worked!
        else
          ha_log "ERROR: Couldn't mount filesystem $DEVICE on $MOUNTPOINT"
          exit 1
        fi

;;

stop)

        if
          $MOUNT | grep -e " on $MOUNTPOINT " >/dev/null
        then
                $FUSER -mk $MOUNTPOINT


                DEV=`$MOUNT | grep "on $MOUNTPOINT " | cut -d' ' -f1`
                $UMOUNT $MOUNTPOINT
                if [ $? -ne 0 ] ; then
                        ha_log "ERROR: Couldn't unmount $MOUNTPOINT"
                        exit 1
                fi
                flushbufs $DEV
        else
                ha_log "WARNING: Filesystem $MOUNTPOINT not mounted?"
        fi

;;

status)

        $MOUNT | grep -e "on $MOUNTPOINT " >/dev/null
        if [ $? = 0 ] ; then
                echo "$MOUNTPOINT is mounted (running)"
        else
                echo "$MOUNTPOINT is unmounted (stopped)"
        fi
;;


*)
    echo "This script should be run with a fourth argument of 'start',
'stop', or 'status'"
    usage
    exit 1
;;

esac

exit 0;


My monitrc:

set daemon  60
set logfile syslog facility log_daemon
set mailserver localhost port 25, willow.griffous.net
set mail-format { from: address@hidden }
set alert address@hidden
set httpd port 2812 and
    allow 10.0.10.6
    allow 192.168.1.133
check process sshd with pidfile /var/run/sshd.pid
start program "/etc/init.d/ssh start"
stop program "/etc/init.d/ssh stop"
if failed port 22 protocol ssh then restart
   if 5 restarts within 5 cycles then timeout
group system


check process exim4 with pidfile /var/run/exim4/exim.pid
start program "/etc/init.d/exim4 start"
stop program "/etc/init.d/exim4 stop"
   if failed port 25 protocol smtp then restart
   if 5 restarts within 5 cycles then timeout
group system

check device drbd path /proc/drbd
start program = "/etc/ha.d/resource.d/drbddisk r0 start"
stop program = "/etc/ha.d/resource.d/drbddisk r0 stop"
mode manual
group node1

check directory drbdfs path /mnt/nfstest/nfs
start program = "/etc/ha.d/resource.d/Filesystem /dev/drbd0 /mnt/nfstest
reiserfs start"
stop program = "/etc/ha.d/resource.d/Filesystem /dev/drbd0 /mnt/nfstest
reiserfs stop"
mode manual
depends drbd
group node1


check process nfsd with pidfile /var/run/nfsd.pid
start program = "/etc/init.d/nfs-kernel-server start"
stop program = "/etc/init.d/nfs-kernel-server stop"
mode manual
depends on drbdfs
group node1

inertia:/mnt# monit -V
This is monit version 4.5
Copyright (C) 2000-2005 by the monit project group. All Rights Reserved.


Thanks,
Jonathan


--
To unsubscribe:
http://lists.nongnu.org/mailman/listinfo/monit-general





reply via email to

[Prev in Thread] Current Thread [Next in Thread]