qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC PATCH] replication agent module


From: Ori Mamluk
Subject: Re: [Qemu-devel] [RFC PATCH] replication agent module
Date: Tue, 07 Feb 2012 16:18:06 +0200
User-agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:9.0) Gecko/20111222 Thunderbird/9.0.1

On 07/02/2012 15:50, Stefan Hajnoczi wrote:

First let me say that I'm not completely used to the inline replies - so I initially missed some of your mail before.
On Tue, Feb 7, 2012 at 1:34 PM, Kevin Wolf<address@hidden>  wrote:
Am 07.02.2012 11:29, schrieb Ori Mamluk:
Repagent is a new module that allows an external replication system to
replicate a volume of a Qemu VM.
I recently joked with Kevin that QEMU is on its way to reimplementing
the Linux block and device-mapper layers.  Now we have drbd, thanks!
:P

Except for image files, the way to do this on a Linux host would be
using drbd block devices.  We still haven't figured out a nice way to
make image files full-fledged Linux block devices, so we're
reimplementing all the block code in QEMU userspace.

This RFC patch adds the repagent client module to Qemu.



Documentation of the module role and API is in the patch at
replication/qemu-repagent.txt



The main motivation behind the module is to allow replication of VMs in
a virtualization environment like RhevM.

To achieve this we need basic replication support in Qemu.



This is the first submission of this module, which was written as a
Proof Of Concept, and used successfully for replicating and recovering a
Qemu VM.
I'll mostly ignore the code for now and just comment on the design.

One thing to consider for the next version of the RFC would be to split
this in a series smaller patches. This one has become quite large, which
makes it hard to review (and yes, please use git send-email).

Points and open issues:

*             The module interfaces the Qemu storage stack at block.c
generic layer. Is this the right place to intercept/inject IOs?
There are two ways to intercept I/O requests. The first one is what you
chose, just add some code to bdrv_co_do_writev, and I think it's
reasonable to do this.

The other one would be to add a special block driver for a replication:
protocol that writes to two different places (the real block driver for
the image, and the network connection). Generally this feels even a bit
more elegant, but it brings new problems with it: For example, when you
create an external snapshot, you need to pay attention not to lose the
replication because the protocol is somewhere in the middle of a backing
file chain.

*             The patch contains performing IO reads invoked by a new
thread (a TCP listener thread). See repaget_read_vol in repagent.c. It
is not protected by any lock – is this OK?
No, definitely not. Block layer code expects that it holds
qemu_global_mutex.

I'm not sure if a thread is the right solution. You should probably use
something that resembles other asynchronous code in qemu, i.e. either
callback or coroutine based.
There is a flow control problem here which is interesting.  If the
rephub is slower than the writer or unavailable, then eventually we
either need to stop replicating writes or we need to throttle the
guest writes.  I haven't read through the whole patch yet but the flow
control solution is very closely tied to how you use
threads/coroutines and how you use network sockets.
In general the replication is naturally less important than the main (production) volume.
This means that the solution aims to never throttle the guest writes.
In the current stage, both IOs will need to complete before reporting back to the guest, but the volume is a real write to storage while the Rephub may involve only copying to memory. Later on we can get rid of waiting to the replicated IO altogether by adding a bitmap - but this is only for a later stage.


+             * Read a protected volume - allows the Rephub to read a
protected volume, to enable the protected hub to syncronize the content
of a protected volume.
We were discussing using NBD as the protocol for any data that is
transferred from/to the replication hub, so that we can use the existing
NBD client and server code that qemu has. Seems you came to the
conclusion to use different protocol? What are the reasons?

The other message types could possibly be implemented as QMP commands. I
guess we might need to attach multiple QMP monitors for this to work
(one for libvirt, one for the rephub). I'm not sure if there is a
fundamental problem with this or if it just needs to be done.
Agreed.  You can already query block devices using QMP 'query-block'.
By adding in-process NBD server support you could then launch an NBD
server for each volume which you wish to replicate.  However, in this
case it sounds almost like you want the reverse - you could provide an
NBD server on the rephub and QEMU would mirror writes to it (the NBD
client code is already in QEMU).

There is also interest from other external software (like libvirt) to
be able to read volumes while the VM is running.

BTW, do you poll the volumes or how do you handle hotplug?  Does
anything special need to be done when a volume is unplugged?
We assume that we handle he hotplug top-down - via the management system, and not from the VM. In general, we don't protect 'all volumes' of a VM - the management system (either RhevM or Rephub - depending on the design) specifically instructs to start protecting a volume.
Stefan




reply via email to

[Prev in Thread] Current Thread [Next in Thread]