qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] Block job commands in QEMU 1.2 [v2, including support f


From: Paolo Bonzini
Subject: Re: [Qemu-devel] Block job commands in QEMU 1.2 [v2, including support for replication]
Date: Wed, 30 May 2012 17:06:25 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:12.0) Gecko/20120430 Thunderbird/12.0.1

Il 30/05/2012 14:34, Geert Jansen ha scritto:
> 
> On 05/29/2012 02:52 PM, Paolo Bonzini wrote:
> 
>>> Does the drive-mirror coroutine send the writes to the target in the
>>> same order as they are sent to the source? I assume so.
>>
>> No, it doesn't.  It's asynchronous; for continuous replication, the
>> target knows that it has a consistent view whenever it sees a flush on
>> the NBD stream.  Flushing the target is needed anyway before writing the
>> dirty bitmap, so the target might as well exploit them to get
>> information about the state of the source.
>>
>> The target _must_ flush to disk when it receives a flush commands, not
>> matter how close they are.  It _may_ choose to snapshot the disk, for
>> example establishing one new snapshot every 5 seconds.
> 
> Interesting. So it works quite differently than i had assumed. Some
> follow-up questions hope you don't mind...
> 
>  * I assume a flush roughly corresponds to an fsync() in the guest OS?

Yes (or a metadata flush from the guest OS filesystem, since our guest
models do not support attaching the FUA bit to single writes).

>  * Writes will not be re-ordered over a flush boundary, right?

More or less.  This for example is a valid ordering:

    write sector 0
                             write 0 returns
    flush
    write sector 1
                             write 1 returns
                             flush returns

However, writes that have already returned will not be re-ordered over a
flush boundary.

>> A synchronous implementation is not forbidden by the spec (by design),
>> but at the moment it's a bit more complex to implement because, as you
>> mention, it requires buffering the I/O data on the host.
> 
> So if i understand correctly, you'd only be keeping a list of
> (len, offset) tuples without any data, and drive-mirror then reads the
> data from the disk image? If that is the case how do you handle a flush?
> Does a flush need to wait for drive-mirror to drain the entire outgoing
> queue to the target before it can complete? If not how do prevent writes
> that happen after a flush from overwriting the data that will be sent to
> the target in case that hasn't reached the flush point yet.

The key is that:

1) you only flush the target when you have a consistent image of the
source on the destination, and the replication server only creates a
snapshot when it receives a flush.  Thus, the server does not create a
consistent snapshot unless the client was able to keep pace with the guest.

2) target flushes do not have to coincide with a source flush.  Writes
after the last source flush _can_ be inconsistent between the source and
the destination!  What matters is that all writes up to the last source
flush are consistent.

Say the guest starts with (4 characters = 1 sectors) "AAAA BBBB CCCC" on
disk

and then the following happens

    guest           disk           dirty count        mirroring
 -------------------------------------------------------------------
                                       0
    write 1 = XXXX                     1
    FLUSH
                    write 1 = XXXX
                    dirty bitmap: sector 1 dirty
    write 2 = YYYY                     2
                                       1           copy sector 1 = XXXX
                                       0           copy sector 2 = YYYY
                                                   FLUSH
                    dirty bitmap: all clean
    write 0 = ZZZZ
                    write 0 = ZZZZ

and then a power loss happens on the source.

The guest now has the dirty bitmap saying "all clean" even though the
source now is "ZZZZ XXXX CCCC" and the destination "AAAA XXXX YYYY".
However, this is not a problem because both are consistent with the last
flush.

I attach a Promela model of the algorithm I'm going to implement.  It's
not exactly the one I posted upthread; I successfully ran this one
through a model checker, so this one works. :)  (I tested 3 sectors / 1
write, i.e. the case above, and 2 sectors / 3 writes. It should be
enough given that it goes exhaustively through the entire state space).

I have another model with two concurrent writers, but it is quite messy
and I don't think it adds much.

It shouldn't be hard to follow, the only tricky thing is that multiple
branches in an "if" or "do" can be true, and if so all paths will be
explored by the model checker.  An "else" is only executed if all the
other paths are false.

>> Yes, this is already all correct.
> 
> OK, i think i was confused by your description of "drive-mirror" in the
> wiki. It says that starts mirroring, but what it also does is that it
> copies the source to the target before it does that. It is clear from
> the description of the "sync" option though.

Yes, the "sync" option simply fills in the dirty bitmap before starting
the actual loop.

Paolo

Attachment: flush.pm
Description: Text document


reply via email to

[Prev in Thread] Current Thread [Next in Thread]