Re: [Qemu-devel] [RFC PATCH 01/14] docs: block replication's description

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC PATCH 01/14] docs: block replication's description

From:	Hongyang Yang
Subject:	Re: [Qemu-devel] [RFC PATCH 01/14] docs: block replication's description
Date:	Thu, 12 Feb 2015 17:36:07 +0800
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.4.0

Hi Fam,

在 02/12/2015 04:44 PM, Fam Zheng 写道:

On Thu, 02/12 15:40, Wen Congyang wrote:

On 02/12/2015 03:21 PM, Fam Zheng wrote:

Hi Congyang,

On Thu, 02/12 11:07, Wen Congyang wrote:

+== Workflow ==
+The following is the image of block replication workflow:
+
+        +----------------------+            +------------------------+
+        |Primary Write Requests|            |Secondary Write Requests|
+        +----------------------+            +------------------------+
+                  |                                       |
+                  |                                      (4)
+                  |                                       V
+                  |                              /-------------\
+                  |      Copy and Forward        |             |
+                  |---------(1)----------+       | Disk Buffer |
+                  |                      |       |             |
+                  |                     (3)      \-------------/
+                  |                 speculative      ^
+                  |                write through    (2)
+                  |                      |           |
+                  V                      V           |
+           +--------------+           +----------------+
+           | Primary Disk |           | Secondary Disk |
+           +--------------+           +----------------+
+
+    1) Primary write requests will be copied and forwarded to Secondary
+       QEMU.
+    2) Before Primary write requests are written to Secondary disk, the
+       original sector content will be read from Secondary disk and
+       buffered in the Disk buffer, but it will not overwrite the existing
+       sector content in the Disk buffer.


I'm a little confused by the tenses ("will be" versus "are") and terms. I am
reading them as "s/will be/are/g"

Why do you need this buffer?


We only sync the disk till next checkpoint. Before next checkpoint, secondary
vm write to the buffer.


If both primary and secondary write to the same sector, what is saved in the
buffer?


The primary content will be written to the secondary disk, and the secondary 
content
is saved in the buffer.


I wonder if alternatively this is possible with an imaginary "writable backing
image" feature, as described below.

When we have a normal backing chain,

                {virtio-blk dev 'foo'}
                          |
                          |
                          |
     [base] <- [mid] <- (foo)

Where [base] and [mid] are read only, (foo) is writable. When we add an overlay
to an existing image on top,

                {virtio-blk dev 'foo'}        {virtio-blk dev 'bar'}
                          |                              |
                          |                              |
                          |                              |
     [base] <- [mid] <- (foo)  <---------------------- (bar)

It's important to make sure that writes to 'foo' doesn't break data for 'bar'.
We can utilize an automatic hidden drive-backup target:

                {virtio-blk dev 'foo'}                                    
{virtio-blk dev 'bar'}
                          |                                                     
     |
                          |                                                     
     |
                          v                                                     
     v

     [base] <- [mid] <- (foo)  <----------------- (hidden target) 
<--------------- (bar)

                          v                              ^
                          v                              ^
                          v                              ^
                          v                              ^
                          >>>> drive-backup sync=none >>>>

So when guest writes to 'foo', the old data is moved to (hidden target), which
remains unchanged from (bar)'s PoV.

The drive in the middle is called hidden because QEMU creates it automatically,
the naming is arbitrary.

It is interesting because it is a more generalized case of image fleecing,
where the (hidden target) is exposed via NBD server for data scanning (read
only) purpose.

More interestingly, with above facility, it is also possible to create a guest
visible live snapshot (disk 'bar') of an existing device (disk 'foo') very
cheaply. Or call it shadow copy if you will.

Back to the COLO case, the configuration will be very similar:


                       {primary wr}                                             
   {secondary vm}
                             |                                                  
         |
                             |                                                  
         |
                             |                                                  
         |
                             v                                                  
         v

    [what] <- [ever] <- (nbd target) <------------ (hidden buf disk) 
<------------- (active disk)

                             v                              ^
                             v                              ^
                             v                              ^
                             v                              ^
                             >>>> drive-backup sync=none >>>>

The workflow analogue is:

+    1) Primary write requests will be copied and forwarded to Secondary
+       QEMU.


Primary write requests are forwarded to secondary QEMU as well.

+    2) Before Primary write requests are written to Secondary disk, the
+       original sector content will be read from Secondary disk and
+       buffered in the Disk buffer, but it will not overwrite the existing
+       sector content in the Disk buffer.


Before Primary write requests are written to (nbd target), aka the Secondary
disk, the orignal sector content is read from it and copied to (hidden buf
disk) by drive-backup. It obviously will not overwrite the data in (active
disk).

+    3) Primary write requests will be written to Secondary disk.


Primary write requests are written to (nbd target).

+    4) Secondary write requests will be buffered in the Disk buffer and it
+       will overwrite the existing sector content in the buffer.


Secondary write request will be written in (active disk) as usual.

Finally, when checkpoint arrives, if you want to sync with primary, just drop
data in (hidden buf disk) and (active disk); when failover happends, if you
want to promote secondary vm, you can commit (active disk) to (nbd target), and
drop data in (hidden buf disk).


If I understand correctly, you split the Disk Buffer into a hidden buf disk +
an active disk. What we need to do is only to implement a buf disk(will be
used as hidden buf disk and active disk as mentioned), apart from this, we can
use the existing mechinism like backing-file/drive-backup?


Fam
.


--
Thanks,
Yang.

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [Qemu-devel] [RFC PATCH 01/14] docs: block replication's description, (continued)

Prev by Date: [Qemu-devel] Google Summer of Code application submitted
Next by Date: Re: [Qemu-devel] [RFC PATCH 01/14] docs: block replication's description
Previous by thread: Re: [Qemu-devel] [RFC PATCH 01/14] docs: block replication's description
Next by thread: Re: [Qemu-devel] [RFC PATCH 01/14] docs: block replication's description
Index(es):
- Date
- Thread