qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [Qemu-block] [PATCH for-2.6 2/2] block/gluster: prevent


From: Raghavendra Talur
Subject: Re: [Qemu-devel] [Qemu-block] [PATCH for-2.6 2/2] block/gluster: prevent data loss after i/o error
Date: Mon, 9 May 2016 12:08:31 +0530

On Mon, Apr 11, 2016 at 9:56 AM, Raghavendra Gowdappa <address@hidden>
wrote:

>
> > +Raghavendra G who implemented this option in write-behind, to this
> > upstream patch review discussion
>
> Thanks Pranith. Kritika did inform us about the discussion. We were
> working on ways to find out solutions to the problems raised (and it was a
> long festive weekend in Bangalore).
>
> Sorry for top-posting. I am trying to address two issues raised in this
> mail thread:
> 1. No ways to identify whether setting of option succeeded in gfapi.
> 2. Why retaining cache after fsync failure is not default behavior?
>
> 1. No ways to identify whether setting of option succeeded in gfapi:
>
> I've added Poornima and Raghavendra Talur who work on gfapi to assist on
> this.
>

 There is currently no such feature in gfapi. We could think of two
possible solutions:

 a. Have the qemu-glusterfs driver require a version of glusterfs-api.rpm
which surely has this write behind option. In that case, if you use
glfs_set_xlator_option before glfs_init with right key and value, there is
no way the volume set gets rejected. Note that this is a installation time
dependency, we don't have libgfapi library versioning. This solution is
good enough, if the logic in qemu is

ret = glfs_set_xlator_option;
if (ret) {
    exit because we don't have required environment.
}
proceed with work;


b. Second solution is to implement a case in write_behind getxattr FOP to
handle such request and reply back with the current setting. Look at
dht_get_real_filename for similar feature. You will have to implement logic
similar to something like below

ret = glfs_getxattr ( fs, "/", glusterfs.write-behind-
resync-failed-syncs-after-fsync-status, &value, size);

if ( (ret = -1 && errno == ENODATA) || value == 0 ) {
       // write behind in the stack does not understand this option
       //  OR
      //  we have write-behind with required option set to off
    <work with the assumptions that there are not retries>
} else {
    // We do have the required option
}

One negative aspect of both the solutions above is the loosely coupled
nature of logic. If the behavior ever changes in lower layer(which is gfapi
or write-behind in this case) the upper layer(qemu) will have to
a. have a dependency of the sort which defines both the minimum version and
maximum version of package required
b. interpret the return values with more if-else conditions to maintain
backward compatibility.

 We are thinking of other ways too, but given above are current solutions
that come to mind.

Thanks,
Raghavendra Talur



> 2. Why retaining cache after fsync failure is not default behavior?
>
> It was mostly to not to break two synchronized applications, which rely on
> fsync failures to retry. Details of discussion can be found below. The
> other reason was we could not figure out what POSIX's take on the state of
> earlier write after fsync failure (Other filesystems xfs, ext4 had
> non-uniform behavior). The question more specifically was "is it correct
> for a write issued before a failed fsync to succeed on the backend storage
> (persisting happened after fsync failure)?". I've also added Vijay Bellur
> who was involved in the earlier discussion to cc list.
>
> Following is the discussion we had earlier with Kevin, Jeff Cody and
> others on the same topic. I am quoting it verbatim below.
>
> <quote>
>
> > > > > > I would actually argue that this setting would be right for
> everyone,
> > > > > > not just qemu. Can you think of a case where keeping the data
> cached
> > > > > > after a failed fsync would actively hurt any application? I can
> only
> > > > > > think of cases where data is unnecessarily lost if data is
> dropped.
> > > > > >
> > > > >
> > > > > I worry about use cases with concurrent writers to the same file
> and
> > > > > how different applications would handle fsync() failures with our
> new
> > > > > behavior.
> > > >
> > > > Any specific scenario you're worried about?
> > > >
> > > > > Keeping the known old behavior as the default will ensure that we
> do
> > > > > not break anything once this is out. qemu/virt users with gluster
> are
> > > > > accustomed to setting the virt group and hence no additional knobs
> > > > > would need to be altered by them.
> > > >
> > > > Not changing anything is a safe way to avoid regressions. But it's
> also
> > > > a safe way to leave bugs unfixed. I would say that the current
> behavĂ­our
> > > > is at least borderline buggy and very hard for applications to handle
> > > > correctly.
> > >
> > > I tend to agree with Kevin on this. If we've an error handling strategy
> > > that
> > > is posix-complaint, I don't think there is need to add one more option.
> > > Most
> > > of the times options tend to be used in default values, which is
> equivalent
> > > to not providing an option at all. However, before doing that its
> better we
> > > think through whether it can affect any existing deployments adversely
> > > (even
> > > when they are not posix-complaint).
> > >
> >
> > One pattern that I can think of -
> >
> > Applications that operate on the same file from different clients through
> > some locking or other co-ordination would have this pattern:
> >
> > lock (file), write (file), fsync (file), unlock (file);
> >
> > Now if the first fsync() fails, based on application logic the offset
> used
> > for the failed write + fsync could be re-utilized by a co-ordinating
> > application on another node to write out legitimate data. When control
> > returns back to the application that received a failure, the subsequent
> > write + fsync can cause data to be overwritten at the old offset along
> with
> > new data being written at the new offset.
>
> Yeah. I agree. Co-ordinated applications on different mounts will have
> issues, if they are working on the assumption that after fsync no older
> writes will hit the backend. Given that there seems to be a fair bit of
> confusion on status of retry of older writes after an fsync failure, we can
> expect some regressions. So, to summarize,
>
> 1. Keep the behaviour in patchset 11 of [1] as default. Both fsync and
> flush act as barriers and will make sure either
>    a. older writes are synced to backend
>    b. old writes are failed and never retried.
>
>    after a failed fsync/flush.
>
> 2. Modify the behaviour of patchset 11 of [1] to keep failed syncs after a
> failed fsync and retry them till a flush. After a flush, no retries of
> failed syncs will be attempted. This behaviour will be introduced as an
> option.
>
> 3. Transient and non-transient errors will be treated similarly and failed
> syncs will be retried alike.
>
> Does everyone agree on the above points? If yes, I'll modify [1]
> accordingly.
>
> [1] http://review.gluster.org/#/c/12594/
>
> </quote>
>
>


reply via email to

[Prev in Thread] Current Thread [Next in Thread]