qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH v2 1/1] quorum: Change vote rules for 64 bits ha


From: Alberto Garcia
Subject: Re: [Qemu-devel] [PATCH v2 1/1] quorum: Change vote rules for 64 bits hash
Date: Mon, 22 Feb 2016 14:31:30 +0100
User-agent: Notmuch/0.13.2 (http://notmuchmail.org) Emacs/23.2.1 (i486-pc-linux-gnu)

On Sat 20 Feb 2016 03:28:03 PM CET, Max Reitz <address@hidden> wrote:

>> That said, I'm not very convinced of the current logics of the Quorum
>> flush code either, so it's not even a problem with your patch... it
>> seems to me that the code should follow the same logics as in the
>> read/write case: if the number of correct flushes >= threshold then
>> return 0, else select the most common error code.
>
> I'm not convinced of the logic either, which is why I waited for you
> to respond to this patch. :-)
>
> Intuitively, I'd expect Quorum to return an error if flushing failed
> for any of the children, because, well, flushing failed. I somehow
> feel like flushing is different from a read or write operation and
> therefore ignoring the threshold would be fine here. However, maybe my
> intuition is just off.

The way I see it is that if we have, say, 5 drives with a threshold of 3
and flushing fails in one of them Quorum should report the error (with
QUORUM_REPORT_BAD probably, or maybe a new event) but succeed, because
we have at least 3 images that are (in principle) fine. I don't see why
the guest should see an error in that case.

This is what I found from the original discussion when the patch was
submitted:

https://lists.gnu.org/archive/html/qemu-devel/2014-02/msg00119.html
https://lists.gnu.org/archive/html/qemu-devel/2014-02/msg00377.html

It seems that there was no dicussion about why the threshold was not
used. The current code has the problem that I mentioned earlier: in the
example with 5 drives, if 2 succeed and 3 fail with different error
codes then Quorum will return 0, which feels wrong.

I think the correct solution would be something like the code in V10 but
counting the number of correct flushes and using them to decide whether
to report an error or not. Something like this:

    for (i = 0; i < s->total; i++) {
        result = bdrv_co_flush(s->bs[i]);
        if (result) {
            result_value.l = result;
            quorum_count_vote(&error_votes, &result_value, i);
        } else {
            correct++;
        }
    }

    if (correct < s->threshold) {
        winner = quorum_get_vote_winner(&error_votes);
        result = winner->value.l;
    }

> Anyway, regardless of that, if we do take the threshold into account,
> we should not use the exact error value for voting but just whether an
> error occurred or not. If all but one children fail to flush (all for
> different reasons), I find it totally wrong to return success. We
> should then just return -EIO or something.

Exactly. I think -EIO should be fine, I'm not sure why it was proposed
to vote the return code in this case?

It doesn't seem that there was any discussion other than this question
from Eric:

https://lists.gnu.org/archive/html/qemu-devel/2012-10/msg04034.html

Berto



reply via email to

[Prev in Thread] Current Thread [Next in Thread]