gluster-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gluster-devel] Weird lock-ups


From: gordan
Subject: Re: [Gluster-devel] Weird lock-ups
Date: Mon, 27 Oct 2008 20:02:51 +0000 (GMT)
User-agent: Alpine 2.00 (LRH 1167 2008-08-23)



On Mon, 27 Oct 2008, address@hidden wrote:



On Mon, 27 Oct 2008, Gordan Bobic wrote:

 Krishna Srinivas wrote:
> On Tue, Oct 21, 2008 at 5:54 PM, Gordan Bobic <address@hidden> > wrote: > > I'm starting to see lock-ups when using a single-file client/server > > setup. > > > > machine1 (x86): =================================
> >   volume home2
> >          type protocol/client
> >          option transport-type tcp/client
> >          option remote-host 192.168.3.1
> >          option remote-subvolume home2
> >   end-volume
> > > > volume home-store
> >          type storage/posix
> >          option directory /gluster/home
> >   end-volume
> > > > volume home1
> >          type features/posix-locks
> >          subvolumes home-store
> >   end-volume
> > > > volume server
> >          type protocol/server
> >          option transport-type tcp/server
> >          subvolumes home1
> >          option auth.ip.home1.allow 127.0.0.1,192.168.*
> >   end-volume
> > > > volume home
> >          type cluster/afr
> >          subvolumes home1 home2
> >          option read-subvolume home1
> >   end-volume
> > > > machine2 (x86-64): =================================
> >   volume home1
> >          type protocol/client
> >          option transport-type tcp/client
> >          option remote-host 192.168.0.1
> >          option remote-subvolume home1
> >   end-volume
> > > > volume home-store
> >          type storage/posix
> >          option directory /gluster/home
> >   end-volume
> > > > volume home2
> >          type features/posix-locks
> >          subvolumes home-store
> >   end-volume
> > > > volume server
> >          type protocol/server
> >          option transport-type tcp/server
> >          subvolumes home2
> >          option auth.ip.home2.allow 127.0.0.1,192.168.*
> >   end-volume
> > > > volume home
> >          type cluster/afr
> >          subvolumes home1 home2
> >          option read-subvolume home2
> >   end-volume
> > > > ================== > > > > Do those configs look sane? > > > > When one machine is running on it's own, it's fine. Other client-only > > machines can connect to it without any problems. However, as soon as > > the
> >   second client/server comes up, typically the first ls access on the
> >   directory will lock the whole thing up solid.
> > > > Interestingly, on the x86 machine, the glusterfs process can always > > be > > killed. Not so on the x86-64 machine (the 2nd machine that comes up). > > kill
> >   -9 doesn't kill it. The only way to clear the lock-up is to reboot.
> > > > Using the 1.3.12 release compiled into an RPM on both machines > > (CentOS 5.2). > > > > One thing worthy of note is that machine2 is nfsrooted / network > > booted. It > > has local disks in it, and a local dmraid volume is mounted under > > /gluster
> >   on it (machine1 has a disk-backed root).
> > > > So, on machine1:
> >   / is local disk
> >   on machine2:
> >   / is NFS
> >   /gluster is local disk
> >   /gluster/home is exported in the volume spec for AFR.
> > > > If /gluster/home is newly created, it tends to get a little further, > > but > > still locks up pretty quickly. If I try to execute find /home once it > > is > > mounted, it will eventually hang, and the only thing of note I could > > see in
> >   the logs is that it said "active lock found" at the point where it
> > Do you see this error on server1 or server2? Any other clues in the > logs?

 Access to the FS locks up on both server1 and server2.

 I have split up the setup to separate cliend and server on server2
 (x86-64), and have tried to get it to sync up just the file placeholders
 (find . at the root of the glusterfs mounted tree), and this, too causes a
 lock-up. I have managed to kill the glusterfsd process, but only after
 killing the glusterfs process first.

 This ends up in the logs on server2, in the glusterfs (client) log:
 2008-10-27 18:44:31 C [client-protocol.c:212:call_bail] home2: bailing
 transport
 2008-10-27 18:44:31 E [client-protocol.c:4834:client_protocol_cleanup]
 home2: forced unwinding frame type(1) op(36) address@hidden
 2008-10-27 18:44:31 E [client-protocol.c:4215:client_setdents_cbk] home2:
 no proper reply from server, returning ENOTCONN
 2008-10-27 18:44:31 E [afr_self_heal.c:155:afr_lds_setdents_cbk] mirror:
 op_ret=-1 op_errno=107
 2008-10-27 18:44:31 E [client-protocol.c:4834:client_protocol_cleanup]
 home2: forced unwinding frame type(1) op(34) address@hidden
 2008-10-27 18:44:31 E [client-protocol.c:4430:client_lookup_cbk] home2: no
 proper reply from server, returning ENOTCONN
 2008-10-27 18:44:31 E [fuse-bridge.c:468:fuse_entry_cbk] glusterfs-fuse:
 19915: (34) /gordan/bin => -1 (5)
 2008-10-27 18:45:51 C [client-protocol.c:212:call_bail] home2: bailing
 transport
 2008-10-27 18:45:51 E [client-protocol.c:4834:client_protocol_cleanup]
 home2: forced unwinding frame type(1) op(0) address@hidden
 2008-10-27 18:45:51 E [client-protocol.c:2688:client_stat_cbk] home2: no
 proper reply from server, returning ENOTCONN
 2008-10-27 18:45:51 E [afr.c:3298:afr_stat_cbk] mirror: (child=home2)
 op_ret=-1 op_errno=107
 2008-10-27 18:45:51 E [client-protocol.c:4834:client_protocol_cleanup]
 home2: forced unwinding frame type(1) op(34) address@hidden
 2008-10-27 18:45:51 E [client-protocol.c:4430:client_lookup_cbk] home2: no
 proper reply from server, returning ENOTCONN
 2008-10-27 18:45:51 E [client-protocol.c:325:client_protocol_xfer] home2:
 transport_submit failed
 2008-10-27 18:45:51 E [client-protocol.c:325:client_protocol_xfer] home2:
 transport_submit failed
 2008-10-27 18:45:51 E [client-protocol.c:4834:client_protocol_cleanup]
 home2: forced unwinding frame type(1) op(34) address@hidden
 2008-10-27 18:45:51 E [client-protocol.c:4430:client_lookup_cbk] home2: no
 proper reply from server, returning ENOTCONN
 2008-10-27 18:46:23 E [protocol.c:271:gf_block_unserialize_transport]
 home1: EOF from peer (192.168.0.1:6996)
 2008-10-27 18:46:23 E [client-protocol.c:4834:client_protocol_cleanup]
 home1: forced unwinding frame type(2) op(5) address@hidden
 1230
 2008-10-27 18:46:23 E [client-protocol.c:4246:client_lock_cbk] home1: no
 proper reply from server, returning ENOTCONN

 I think this was generated in the logs only after the server2 client was
 forcefully killed, not when the lock-up occured, though.

 If I merge the client and server config into a single volume definition on
 server2, the lock-up happens as soon as the FS is mounted. If
 server2-server gets brought up first, the server1-combined, then
 server2-client, it seems to last a bit longer.

 I'm wondering now if it fails on a particular file/file type (e.g. a
 socket).

 But whatever is causing it, it is completely reproducible. I haven't been
 able to keep it running under these circumstances for long enough to
 finish loading X with the home directory mounted over glusterfs with both
 servers running.

Update - the problem seems to be somehow linked to running the client on server2. If I start up server2-server, and server1 client+server, I can execute a complete "find ." on the gluster mounted volume (from server1, obviously, server2 doesn't have a client running), and instigate a full resync by the usual "find /home -type f -exec head -c1 {} \; > /dev/null". This all works, and all files end up on server2.

But doing this with client up and running on server2 makes the whole process lock up. Sometimes I can only get the first "ls -la" on the base of the mounted tree before everything subsequent locks up and ends up waiting until I "killall glusterfs" on server2. At this point glusterfsd (server) on server2 is unkillable until glusterfs (client) is killed first.

I have just completed a full rescan of the underlying file system on server1 just in case that might have gone wrong, and it passed without any issues.

So, something in the server2 (x86-64) client part causes a lock-up somewhere in the process. :-(

I just noticed this sort of things ends up occassionally logged on server1, but not for all files:

2008-10-27 19:51:13 E [afr.c:1164:afr_selfheal_unlock_cbk] home: (path=/path/file child=home1) op_ret=-1 op_errno=2 2008-10-27 19:51:13 E [afr.c:2203:afr_open] home: self heal failed, returning EIO 2008-10-27 19:51:13 E [fuse-bridge.c:715:fuse_fd_cbk] glusterfs-fuse: 92274: (12) /path/file => -1 (5)

However, cat-ing the file (again, on server1), either through glusterfs or the underlying file system works fine and returns no error.

Gordan




reply via email to

[Prev in Thread] Current Thread [Next in Thread]