[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] live block copy/stream/snapshot discussion

From: Stefan Hajnoczi
Subject: Re: [Qemu-devel] live block copy/stream/snapshot discussion
Date: Mon, 11 Jul 2011 13:54:32 +0100
User-agent: Mutt/1.5.21 (2010-09-15)

On Tue, Jul 05, 2011 at 05:17:49PM +0300, Dor Laor wrote:
> Anthony advised to clone 
> http://wiki.qemu.org/index.php?title=Features/LiveBlockMigrationFuture
> to the list in order to encourage discussion, so here it is:
> ------------------------------------------------------------------------
>  qemu is expected to support these features (some already implemented):
> = Live features =
> == Live block copy ==
>    Ability to copy 1+ virtual disk from the source backing file/block
>    device to a new target that is accessible by the host. The copy
>    supposed to be executed while the VM runs in a transparent way.
> == Live snapshots and live snapshot merge ==
>    Live snapshot is already incorporated (by Jes) in qemu (still need
>    virt-agent work to freeze the guest FS).
>    Live snapshot merge is required in order of reducing the overhead
>    caused by the additional snapshots (sometimes over raw device).
>    We'll use live copy to do the live merge

This line seems outdated.  Kevin and Marcelo have suggested a separate
live commit operation that does not use the unified block copy/image
streaming mechanism.

> = Solutions =
> == Non shared storage ==
>    Either use iscsi (target and initiator) or NBD or proprietary qemu
>    solution. iScsi in theory is the best but there is a problem of
>    dealing with COW images - iScsi cannot report the COW level and
>    detect un-allocated blocks. This might force us to use
>    proprietary solution.
>    An interesting option (by Orit Wasserman) was to use iScsi for
>    exporting the images externally to qemu level and qemu will access
>    as if they were a local device. This can work well w/o almost any
>    effort. What do we do with chains of COW files? We create up to N
>    such iscsi connections for every COW file in the chain.

If there is a discovery mechanism to locate LUNs then it would be
possible to use this approach.

However, using iSCSI but placing all the copy-on-write intelligence into
the QEMU initiator is overkill since we need to support SAN/NAS
appliances that provide snapshots, copy-on-write, and thin provisioning
anyway.  If you look at what other hypervisors are doing, they are
trying to offload as much storage processing onto the appliance as

We probably want the appliance to do those operations for us, so
implementing them in the initiator for some cases is duplicating that
code and making the system more complex.

The real problem is that we're lacking a library interface to manage
volumes, including snapshots.  I don't think that QEMU needs to drive
this interface.  It should be libvirt (which deals with storage pools
and volumes today already).

Once we do have an interface defined, I think it makes less sense
implementing all of this in QEMU when this storage management
functionality really belongs in NAS/SAN appliances and software targets.

> == Live block migration ==
>    Use the streaming approach + regular live migration + iscsi:
>    Execute regular live migration and at the end of it, start streaming.
>    If there is no shared storage, use the external iscsi and behave as
>    if the image is local. At the end of the streaming operation there
>    will be a new local base image.
> == Block mirror layer ==
>    Was invented in order to duplicate write IOs for the source and
>    destination images. It prevents the potential race when both qemu
>    and the management crash at the end of the block copy stage and it
>    is unknown whether management should pick the source or the
>    destination
> == Streaming ==
>    No need for mirror since only the destination changes and is
>    writable.
> == Block copy background task ==
>    Can be shared between block copy and streaming
> == Live snapshot ==
>    It can be seen as a (local) stream that preserve the current COW
>    chain
> = Use cases =
>  1. Basic streaming, single base master image on source storage, need
>     to be instantiated on destination storage
>      The base image is a single level COW format (file or lvm).
>      The base is RO and only new destination is RW. base' is empty at
>      the beginning. The base image content is being copied in the
>      background to base'. At the end of the operation, base' is a
>      standalone image w/o depending on the base image.
>      a. Case of a shared storage streaming guest boot
>      Before:           src storage: base             dst storage: none
>      After             src storage: base             dst storage: base'
>      b. Case of no shared storage streaming guest boot
>         Every thing is the same, we use external iscsi target on the
>         src host and external iscsi initiator on the destination host.
>         Qemu boots from the destination by using the iscsi access. This
>         is transparent to qemu (expect cmd syntax change ). Once the
>         streaming is over, we can live drop the usage of iscsi and open
>         the image directly (some sort of null live copy)
>      c. Live block migration (using streaming) w/ shared storage.
>         Exactly like 1.a. First create the destination image, then we
>         run live migration there w/o data in the new image. Now we
>         stream like the boot scenario.
>      d. Live block migration (using streaming) w/o shared storage.
>         Like 1.b. + 1.c.
>      *** There is complexity to handle multiple block device belonging
>      to the same VM. Management will need to track each stream finish
>      event and manage failures accordingly.

This is tangental but recently I've been thinking about the two roles
that libvirt plays:
1. Hypervisor-neutral management API
2. KVM high-level functionality (image lifecycle, CPU affinity,
   networking configuration)

Libvirt is accumulating KVM-specific high-level functionality that
really should be in a KVM API or qemud.  Otherwise libvirt will become
lopsided with a significant part of the codebase doing KVM-specific
things simply because there was no other place to put this

Image lifecycle is one area where we could help.  qemu-img is good but
doesn't meet the needs of libvirt, which reimplements a bunch of image
management functionality.

When we talk about managing backing files or external dirty bitmaps I
fear this unbalance will get worse.  libvirt will be *doing* a lot of
the work instead of *delegating* what needs to be done to virtualization
software (like VMware APIs).

I feel we're missing something between qemu (scope: single guest
instance) and libvirt (scope: host-wide hypervisor-neutral management

>  2. Basic streaming of raw files/devices

s/Basic streaming of raw files/Basic streaming to raw files/

I think this makes it clearer that the issue is keeping track of
streamed blocks in the destination file.

>     Here we have an issue - what happens if there is a failure in the
>     middle? Regular COW can sustain a failure since the intermediate
>     base' contains information dirty bit block information. Such a
>     base' intermediate raw image will be broken. We cannot revert back
>     to the original base and start over because new writes were written
>     only to the base'.
>     Approaches:
>     a. Don't support that
>     b. Use intermediate COW image and then live copy it into raw (waste
>        time, IO, space). One can easily add new COW over the source and
>        continue from there.
>     c. Use external metadata of dirty-block-bitmap even for raw
>     Suggestion: at this stage, do either recommendation #a or #b

I think #a is fine as a starting point.  #c can be added as a feature
later and should not require major changes.

>  3. Basic live copy, single base master image on source storage, need
>     to be copied to the destination storage
>     The base image is a single level COW format or a raw file/device.
>     The base image content is being copied in the background to base'.
>     At the end of the operation, base' is a standalone image w/o
>     depending on the base image. In this case we only take into account
>     a running VM, no need to do that for boot stage.
>     So it is either VM running locally and about to change its storage
>     or a VM live migration. The plan is to use the mirror driver
>     approach. Both src/dst are writable.

I think this is outdated.  I believe Marcelo stated that the mirror
driver is not needed and that streaming can be used for live block
migration (pre-copy).  So there is no difference between this and basic

> == Exceptions ==
>  1. Hot unplug of the relevant disk
>     Prevent that. (or cancel the operation)
>  1. Live migration in the middle of non migration action from above
>     Shall we allow it? It can work but at the end of live migration we
>     need to reopen the images (NFS mainly), it might add un-needed
>     complexity.
>     We better prevent that.

I think state to lock devices is a good idea.  It helps prevent human
errors.  The only thing to watch out for is that the guest should not be
able to lock devices - otherwise the guest can prevent the
administrator's actions.  Never trust the guest :).


reply via email to

[Prev in Thread] Current Thread [Next in Thread]