Re: [Qemu-devel] [PATCH RESEND 0/2] PoC: Block replication for continuou

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH RESEND 0/2] PoC: Block replication for continuou

From:	Wen Congyang
Subject:	Re: [Qemu-devel] [PATCH RESEND 0/2] PoC: Block replication for continuous checkpointing
Date:	Tue, 6 Jan 2015 09:28:20 +0800
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.3.0

On 01/05/2015 06:44 PM, Dr. David Alan Gilbert wrote:
> * Paolo Bonzini (address@hidden) wrote:
>>
>>
>> On 26/12/2014 04:31, Yang Hongyang wrote:
>>> Please feel free to comment.
>>> We want comments/feedbacks as many as possiable please, thanks in advance.
>>
>> Hi Yang,
>>
>> I think it's possible to build COLO block replication from many basic
>> blocks that are already in QEMU.  The only new piece would be the disk
>> buffer on the secondary.
>>
>>          virtio-blk       ||
>>              ^            ||                            .----------
>>              |            ||                            | Secondary
>>         1 Quorum          ||                            '----------
>>          /      \         ||
>>         /        \        ||
>>    Primary      2 NBD  ------->  2 NBD
>>      disk       client    ||     server                  virtio-blk
>>                           ||        ^                         ^
>> --------.                 ||        |                         |
>> Primary |                 ||  Secondary disk <--------- COLO buffer 3
>> --------'                 ||                   backing
>>
> 
> I think the other thing about this structure is that it provides
> a way of doing an initial synchronisation of the secondary's disk at
> the start of COLO operation by using the NBD server (which I think is
> similar to the way the newer migration does it?)
> 
>> 1) The disk on the primary is represented by a block device with two
>> children, providing replication between a primary disk and the host that
>> runs the secondary VM.  The read pattern patches for quorum
>> (http://lists.gnu.org/archive/html/qemu-devel/2014-08/msg02381.html) can
>> be used/extended to make the primary always read from the local disk
>> instead of going through NBD.
>>
>> 2) The secondary disk receives writes from the primary VM through QEMU's
>> embedded NBD server (speculative write-through).
>>
>> 3) The disk on the secondary is represented by a custom block device
>> ("COLO buffer").  The disk buffer's backing image is the secondary disk,
>> and the disk buffer uses bdrv_add_before_write_notifier to implement
>> copy-on-write, similar to block/backup.c.
>>
>> 4) Checkpointing can use new bdrv_prepare_checkpoint and
>> bdrv_do_checkpoint members in BlockDriver to discard the COLO buffer,
>> similar to your patches (you did not explain why you do checkpointing in
>> two steps).  Failover instead is done with bdrv_commit or can even be
>> done without stopping the secondary (live commit, block/commit.c).
>>
>>
>> The missing parts are:
>>
>> 1) NBD server on the backing image of the COLO buffer.  This means the
>> backing image needs its own BlockBackend.  Apart for this, no new
>> infrastructure is needed to receive writes on the secondary.
>>
>> 2) Read pattern support for quorum need to be extended for the needs of
>> the COLO primary.  It may be simpler or faster to write a simple
>> "replication" driver that writes to N children but always reads from the
>> first.  But in any case initial tests can be done with the quorum
>> driver, even without read pattern support.  Again, all the network
>> infrastructure to replicate writes already exists in QEMU.
>>
>> 3) Of course the disk buffer itself.
> 
> I think there's also:
>   a) How does the secondary becomes a primary - e.g. after
>      the original primary dies and you need to bring it back into
>      resilience; the block structure has to morph into the primary
>      with the quorum etc

What about this:

         virtio-blk       ||                                       virtio-blk
             ^            ||                .----------                ^
             |            ||                | Secondary                |
        1 Quorum          ||                '----------            1 Quorum
         /      \         ||                                       /      \
        /        \        ||                                      /        \
   Primary      2 NBD  ------->  2 NBD                           /          \
     disk       client    ||     server                         /            \
                          ||        ^                          /              \
--------.                 ||        |                         /                \
Primary |                 ||  Secondary disk <--------- COLO buffer 3          
4 NBD
--------'                 ||                   backing                         
client

The NBD client in secondary will work when it becomes primary.

> 
>   b) There's some sequencing needed somewhere to ensure that at a
>     checkpoint boundary, the secondary restarts it's buffer at the
>     right point after all the writes from the previous checkpoint have
>     been received and before any writes coming from after the checkpoint.
>     Similarly at failover to make sure there aren't any left over blocks
>     still going through the nbd server.

NBD client sends a write request to NBD server, and NBD server will return an
ACK to NBD client. We will wait all ACKs when we stop the vm.

> 
>   c) Someone always has to have a valid disk after a power failure;
>     I guess the worst case is that the primary goes first, the secondary 
> starts
>     replaying it's buffer to disk but then dies part way through the replay.

COLO will use migration to do the first checkpoint, so we can use disk migration
to sync the disk first. And then start disk replication.

Thanks
Wen Congyang

> 
> Dave
>     
>> Paolo
>>
>>> Thanks,
>>> Yang.
>>>
>>> Wen Congyang (1):
>>>   PoC: Block replication for COLO
>>>
>>> Yang Hongyang (1):
>>>   Block: Block replication design for COLO
>>>
>>>  block.c                   |  48 +++++++
>>>  block/blkcolo.c           | 338 
>>> ++++++++++++++++++++++++++++++++++++++++++++++
>>>  docs/blkcolo.txt          |  85 ++++++++++++
>>>  include/block/block.h     |   6 +
>>>  include/block/block_int.h |  21 +++
>>>  5 files changed, 498 insertions(+)
>>>  create mode 100644 block/blkcolo.c
>>>  create mode 100644 docs/blkcolo.txt
>>>
> --
> Dr. David Alan Gilbert / address@hidden / Manchester, UK
> 
> .
>

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [Qemu-devel] [PATCH RESEND 0/2] PoC: Block replication for continuous checkpointing, Dr. David Alan Gilbert, 2015/01/05
- Re: [Qemu-devel] [PATCH RESEND 0/2] PoC: Block replication for continuous checkpointing, Wen Congyang <=
- Re: [Qemu-devel] [PATCH RESEND 0/2] PoC: Block replication for continuous checkpointing, Wen Congyang, 2015/01/28

Prev by Date: Re: [Qemu-devel] [RFC][PATCH] qemu_opt_get_bool_helper: back finding desc by name just if !opt->desc
Next by Date: [Qemu-devel] [Bug 1406706] Re: guest will be destroyed when create guest with parameter "-usbdevice tablet".
Previous by thread: Re: [Qemu-devel] [PATCH RESEND 0/2] PoC: Block replication for continuous checkpointing
Next by thread: Re: [Qemu-devel] [PATCH RESEND 0/2] PoC: Block replication for continuous checkpointing
Index(es):
- Date
- Thread