qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC PATCH v2 01/12] mc: add documentation for micro-ch


From: Dr. David Alan Gilbert
Subject: Re: [Qemu-devel] [RFC PATCH v2 01/12] mc: add documentation for micro-checkpointing
Date: Tue, 18 Feb 2014 12:45:50 +0000
User-agent: Mutt/1.5.21 (2010-09-15)

* address@hidden (address@hidden) wrote:
> From: "Michael R. Hines" <address@hidden>
> 
> Wiki: http://wiki.qemu.org/Features/MicroCheckpointing
> Github: address@hidden:hinesmr/qemu.git, 'mc' branch
> 
> NOTE: This is a direct copy of the QEMU wiki page for the convenience
> of the review process. Since this series very much in flux, instead of
> maintaing two copies of documentation in two different formats, this
> documentation will be properly formatted in the future when the review
> process has completed.

It seems to be picking up some truncations as well.

> +The Micro-Checkpointing Process
> +Basic Algorithm
> +Micro-Checkpoints (MC) work against the existing live migration path in 
> QEMU, and can effectively be understood as a "live migration that never 
> ends". As such, iteration rounds happen at the granularity of 10s of 
> milliseconds and perform the following steps:
> +
> +1. After N milliseconds, stop the VM.
> +3. Generate a MC by invoking the live migration software path to identify 
> and copy dirty memory into a local staging area inside QEMU.
> +4. Resume the VM immediately so that it can make forward progress.
> +5. Transmit the checkpoint to the destination.
> +6. Repeat
> +Upon failure, load the contents of the last MC at the destination back into 
> memory and run the VM normally.

Later you talk about the memory allocation and how you grow the memory as needed
to fit the checkpoint, have you tried going the other way and triggering the
checkpoints sooner if they're taking too much memory?

> +1. MC over TCP/IP: Once the socket connection breaks, we assume
> failure. This happens very early in the loss of the latest MC not only
> because a very large amount of bytes is typically being sequenced in a
> TCP stream but perhaps also because of the timeout in acknowledgement
> of the receipt of a commit message by the destination.
> +
> +2. MC over RDMA: Since Infiniband does not provide any underlying
> timeout mechanisms, this implementation enhances QEMU's RDMA migration
> protocol to include a simple keep-alive. Upon the loss of multiple
> keep-alive messages, the sender is deemed to have failed.
> +
> +In both cases, either due to a failed TCP socket connection or lost RDMA 
> keep-alive group, both the sender or the receiver can be deemed to have 
> failed.
> +
> +If the sender is deemed to have failed, the destination takes over 
> immediately using the contents of the last checkpoint.
> +
> +If the destination is deemed to be lost, we perform the same action
> as a live migration: resume the sender normally and wait for management
> software to make a policy decision about whether or not to re-protect
> the VM, which may involve a third-party to identify a new destination
>host again to use as a backup for the VM.

In this world what is making the decision about whether the sender/destination
should win - how do you avoid a split brain situation where both
VMs are running but the only thing that failed is the comms between them?
Is there any guarantee that you'll have received knowledge of the comms
failure before you pull the plug out and enable the corked packets to be
sent on the sender side?

<snip>

> +RDMA is used for two different reasons:
> +
> +1. Checkpoint generation (RDMA-based memcpy):
> +2. Checkpoint transmission
> +Checkpoint generation must be done while the VM is paused. In the
> worst case, the size of the checkpoint can be equal in size to the amount
> of memory in total use by the VM. In order to resume VM execution as
> fast as possible, the checkpoint is copied consistently locally into
> a staging area before transmission. A standard memcpy() of potentially
> such a large amount of memory not only gets no use out of the CPU cache
> but also potentially clogs up the CPU pipeline which would otherwise
> be useful by other neighbor VMs on the same physical node that could be
> scheduled for execution. To minimize the effect on neighbor VMs, we use
> RDMA to perform a "local" memcpy(), bypassing the host processor. On
> more recent processors, a 'beefy' enough memory bus architecture can
> move memory just as fast (sometimes faster) as a pure-software CPU-only
> optimized memcpy() from libc. However, on older computers, this feature
> only gives you the benefit of lower CPU-utilization at the expense of

Isn't there a generic kernel DMA ABI for doing this type of thing (I
think there was at one point, people have suggested things like using
graphics cards to do it but I don't know if it ever happened).
The other question is, do you always need to copy - what about something
like COWing the pages?

Dave
--
Dr. David Alan Gilbert / address@hidden / Manchester, UK



reply via email to

[Prev in Thread] Current Thread [Next in Thread]