[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC PATCH v2] Replication agent module

From: Ori Mamluk
Subject: Re: [Qemu-devel] [RFC PATCH v2] Replication agent module
Date: Wed, 04 Apr 2012 16:23:07 +0300
User-agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:11.0) Gecko/20120327 Thunderbird/11.0.1

Hi Stephan,
Thanks again for the thorough and detailed review.
Some answers embedded below, generally I took all the points you mentioned.
Main changes were:
- Use qemu-thread
* WRT to this - I didn't find any recv() function there. Am I missing anything? Should I call the native recv?
        Same for send, even though there's a send_all - should I use it?
- Use qemu_socket
- Use QTAILQ for volumes list

I'll shortly send patch V3 with the changes.

I also attached a PDF with a design sketch following your request to describe the main flows.

On 02/04/2012 14:22, Stefan Hajnoczi wrote:
Feedback on specific points below.  The main thing to think about is how
to integrate with QEMU's event loop.  You have used threads in places
but are also using qemu_set_fd_handler().  Most of QEMU's functions
(including the block layer) cannot be executed from arbitrary threads
because they are not thread-safe.  So it probably makes sense to run all
repagent network code in fd handler functions as part of the QEMU event
Right. Actually I moved exclusively to set_fd_handler. The thread remains only for re-opening disconnected sockets. The systems needs to recover from a rephub disconnection - i.e. a rephub which goes down and up again.
No Qemu function is called from that thread context.

By default the build and use of this module is disabled. To activate it
you need to use a flag in ./configure and a commandline switch to qemu.
If the code is portable and doesn't have specific external library
dependencies then it's good to enable it by default.  Distros and users
can decide to disable it but I think building everything by default
makes sense - it reduces bitrot because people will be building your
code on different host platforms all the time.
I see your point. I'll change it future.
Currently as long as the module is not mature I feel better to leave it under a build flag. I think I also have platform dependent sockets code that I still need to solve.
Missing features:
* Dirty bitmap at the protected side
        The protected volume side needs to persistently keep track of
'pending' IOs.
        I want to add a bitmap (can hook like another 'filter' with the
repagent), that will synchronously track IOs. And allow getting a list of
such IOs.
        Without such a bitmap a failure of any component (rephub or agent)
will require reading the whole protected volume.
To double-check that I understand the point of this:

If the rephub is unavailable for a period of time and then comes up
again the rephub will be able to request only the regions of the
replicated volume that have been modified.

If the host or QEMU crashes does the dirty bitmap come into play?
In general, yes - it depends on how persistent we'll implement the bitmap.
An acceptable bitmap for us can reside in host memory (i.e. not survive a host crash). It should definitely survive a Qemu crash, and preferably also a rephub crash. I prefer to leave the details to next stages, and just remain aware of this 'creature'.
BTW it also deals with network disconnections/outages, not just crashes.
The bitmap allows us to avoid reading the entire protected disk after each outage event.

* Capture IOs at the recovery side
        During fail-over or fail-over test, the repagent at the recovery
side needs to capture all IOs (reads and writes) done by the fail-over VM
        and answer them by passing them synchronously to the VRA.
Does fail-over mean that some management stack functionality will spawn
a new "recovery" VM if a VM disappears?  The recovery VM will have
access to the replicated volumes and will itself run repagent.

Can you explain this step-by-step?  What is a "VRA"?
VRA should be spelled Rephub. VRA is a VM that runs the Rephub in the Zerto current product.
Fail over indeed means what you wrote.
I attached a high level design document that explains these flows.
* I wrote a stand-alone Rephub for actively reading the protected volume
and copying it - for test purposes.
Cool, would be great to see a rephub implementation.
It's still very basic and only for testing. Once I establish the repagent design I'll add it.
Should it go into the same repository?
@@ -685,6 +693,12 @@ int bdrv_open(BlockDriverState *bs, const char
*filename, int flags,
      int ret;
      char tmp_filename[PATH_MAX];

+    if (use_repagent&&  bs->device_name[0] != '\0'&&  strcmp(
+            bs->device_name, "pflash0") != 0) {
+        BlockDriver *bdrv_repagent = bdrv_find_format("repagent");
+        drv = bdrv_repagent;
+    }
      if (flags&  BDRV_O_SNAPSHOT) {
          BlockDriverState *bs1;
          int64_t total_size;
You could add a -drive option.  See blockdev.c:drive_init().

   -drive ...,repagent=on|off  (default "off")
It's not really a per-volume option.
As a way to deliver the option to the block layer I can do the same as the snapshot - and add the option to all drives when the repagent global option is enabled ( if (snapshot) qemu_opts_foreach(qemu_find_opts("drive"), drive_enable_snapshot, NULL, 0); )

There still needs to be a way to configure the rephub network connection
details.  Is it ever useful to connect one volume to rephub A and
another volume to rephub B, then the connection details would have to be
I don't see it as a per-drive option. I think a single rephub is enough.

@@ -1842,6 +1866,18 @@ static int coroutine_fn
bdrv_co_do_writev(BlockDriverState *bs,
          ret = drv->bdrv_co_writev(bs, sector_num, nb_sectors, qiov);

+    if (bs->device_name[0] != '\0') {
+        /* We split the IO only at the highest stack driver layer.
+           Currently we know that by checking device_name - only
+           highest level (closest to the guest) has that name.
+           */
+/*           repagent_handle_protected_write(bs, sector_num,
+                nb_sectors, qiov, ret);
The top-level BlockDriverState requirement can be handled if you do
something like this in blockdev.c:drive_init():

   if (use_repagent) {

Only the top-level BlockDriverState (bs) would be tracked.
I see. Sounds good.
There's still the problem of additional drivers being added above - e.g. live snapshot, but as Kevin advised I'll worry about that later on.
+       the content of a protected volume.
+       Also used to read/write to a recovery volume - the replica of a
protected volume.
It sounds like the write operation will be used to populate a recovery
+        RepCmdDataStartProtect *pcmd_dat
+static void repagent_client_read(void *opaque)
+    printf("repagent_client_read\n");
+    int bytes_read =
Did I miss where the socket is set to nonblocking?  This function should
not block if used as an fd handler in QEMU's main loop.
You didn't miss. Good point. You're referencing the risk of the read from the socket getting blocked, right?
+    if (g_cl            printf("Connected (%d)...\n", c++);
+            usleep(1 * 1000 * 1000);
+        }
This thread isn't actually doing anything!  I think this function could
be integrated into the QEMU event loop instead of having a thread.
Right. The thread is just a heartbeat responsible for reconnecting the socket.
Any periodic event would suffice.
Can you give me a pointer in the code that registers a similar event?
uint32_tzeof(RepCmd); Hint that bytes

+    }
+    if (bytecount == 0) {
+        printf("Disconnected\n");
+        return 0;
What happens if we get disconnected, looks like this thread will
continue running with nothing to do?
The g_client_state.is_connected flag will reset and the thread will busy-try to connect again

+            assert(cmd_state->curCmd.magic2 == REPCMD_MAGIC2);
Is there a more robust error handling strategy than aborting the QEMU
I'll leave a todo there - during development it's probably a good way to handle it.
+            cmd_state->pReadBuf = (uint8_t *)&cmd_state->curCmd;
+            cmd_state->bytesGotten = 0;
+            cmd_state->bytesToGet = sizeof(RepCmd);
+            cmd_state->isGotHeader = 0;
+        }
+    }
Since we're running in a dedicated thread sequential code would be
simpler than this state-machine approach with isGotHeader and isGotData?
Yes. I started sequentially, but current code uses set_fd_handler

Attachment: qemu repagent design.pdf
Description: Adobe PDF document

reply via email to

[Prev in Thread] Current Thread [Next in Thread]