Re: [Gluster-devel] write-behind bug with ftruncate

On Sun, Jul 17, 2011 at 2:29 PM, Emmanuel Dreyfus <address@hidden> wrote:

However, the reordering does not occur in FUSE, and it seems i was wrong
about write-behind, and that removing it just made the bug disapear by
chance.

As I now understand, the problem is that fuse_setattr_cbk() will request a
ftruncate() after the SETATTR. Here is what I get in the logs:

Do you mean fuse_setattr_cbk is triggering an ftruncate() when it was not supposed to trigger it? I reviewed the code again just now but don't seem to find it doing such a faulty thing.

fuse_write() size = 4096, offset = 39981056
fuse_setattr() fsi->valid = 0x78 => truncate_needed, size = 39987632
fuse_write() size = 20480, offset = 39985152
(...)
client3_1_writev() size = 4096, offset = 39981056
fuse_setattr_cbk() call fuse_do_truncate, offset = 39987632
client3_1_writev() size = 2480, offset = 39985152
(...)
client3_1_ftruncate() offset = 39987632

The above sequence of events look proper, for the given fsi->valid (0x78). Please read below for explanation.

Why does it decides to set truncate_needed? fsi->valid = 0x78 means this is
set: | FATTR_FH | FATTR_SIZE

Exactly, FATTR_SIZE getting set means a truncate or ftruncate (depending on FATTR_FH being set) needs to be done.

Here is the offending code:

#define FATTR_MASK (FATTR_SIZE \
| FATTR_UID | FATTR_GID \
| FATTR_ATIME | FATTR_MTIME \
| FATTR_MODE)
(...)
if ((fsi->valid & (FATTR_MASK)) != FATTR_SIZE) {
if (fsi->valid & FATTR_SIZE) {
state->size = fsi->size;
state->truncate_needed = _gf_true;
}

The sin is therefore to set FATTR_ATIME | FATTR_MTIME, while glusterfs
assumes this is a ftruncate() calls because only FATTR_SIZE is set. Am I
correct?

This current behavior is the right behavior. FATTR_SIZE being set indicates a truncate is necessary. FATTR_ATIME|FATTR_MTIME being set indicates a utimes() is necessary and FATTR_UID|FATTR_GID being set indicates a chown/chmod is necessary, and FATTR_FH being set indicates an fXXXX() variant of the above calls. Multiple flags can be set at the same time - i.e, FATTR_ATIME|FATTR_MTIME|FATTR_SIZE can all be set in the same fuse_setattr() call and the filesystem (glusterfs) needs to perform all the required actions accordingly.

The problem I see here is, the write calls are arriving before setattr has completed (i.e, before send_fuse_obj() is called for the _entire_ setattr operation). This would naturally lead the writes and truncate to race within the filesystem as they are issued concurrently.

Filesystems only guarantee (if at all) completion of two syscall actions in a particular sequence only if the second syscall was issued after the return of the first syscall. In this situation, based on the sequence you paste above, the setattr and write seem to be issued concurrently. Because, till you see fuse_truncate_cbk in the logs, the setattr() processing is not complete. Any other write() in the meantime is subject to race and the filesystem need not guarantee any particular order or completion.

A possible cause of this problem could be that VOP_SETATTR in NetBSD is only 'setting' vnode attributes in memory, returning the syscall, and eventually results in fuse_setattr reach gluster _after_ the sys_ftruncate() syscall returns to the application. Is this a possibility? That can explain multiple write and setattr/truncate executing concurrently. (Or, the application is just poorly written without understanding the expectations of concurrency of syscalls and not seeing this behavior in on-disk filesystems as they don't have such upcall/scheduling issues)

Avati

From:	Anand Avati
Subject:	Re: [Gluster-devel] write-behind bug with ftruncate
Date:	Sun, 17 Jul 2011 16:58:29 +0530