qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Qemu-devel] Re: Live migration protocol, device features, ABIs and ot


From: Anthony Liguori
Subject: [Qemu-devel] Re: Live migration protocol, device features, ABIs and other beasts
Date: Mon, 23 Nov 2009 09:32:48 -0600
User-agent: Thunderbird 2.0.0.23 (X11/20090825)

Gleb Natapov wrote:
On Mon, Nov 23, 2009 at 09:05:58AM -0600, Anthony Liguori wrote:
Gleb Natapov wrote:
Then I don't see why Juan claims what he claims.
Live migration is unidirectional.  As long as qemu can send out all
of the data without the stream closing, it will "succeed" on the
source.  While this may sound like a bug, it's an impossible problem
to solve as it's dealing with reliable communication between two
unreliable nodes (i.e. the two general's problem).  This is why the
source qemu does not exit after a successful live migration.  It
As far as I remember the two general's problem talks about unreliable
channel, not unreliable nodes.

That's just semantics. The problem is that one general does not know if the other general received the message. Even if there was a reliable channel between the two generals, if one of the generals can die with no indication, then you still have the same problem, i.e. the first general doesn't know for sure if the second general received the message.

 Why not having destination send ACK/NACK
to the source when it knows that migration succeeded/failed.

1) Source sends migration traffic
2) Destination receives it, sends Ack
3) Destination needs to wait to receive Ack from Source before starting guest to ensure that guest does not start twice
4) Source receives Ack from Destination, sends Ack
5) Source kills guest
6) Destination receives Ack from Source, starts guest

If Destination dies in between 5 and 6, the VM disappears.

 If source
gets NACK it continues, if it gets ACK it exits, otherwise it stays in
paused state. Yes, there are worst case scenarios where this will not work,
but it will not be worse then what we have now.

It introduces a round trip in a path that's extremely sensitive to latency. Waiting for those acks == guest down time. Since it doesn't make things fundamentally reliable, why bother?

A management tool doesn't exist in the down time path so it can look at both ends at its leisure to determine if something when wrong.

merely stays in the stopped state.  The idea is that a third party
management tool can be the "reliable third party" that can make the
final determination about whether the migration has succeeded and
take actions on the source and destination nodes appropriately.

In this precise case, if post_load() fails, it may or may not cause
the source to fail the migration depending on how large the TCP
window sizes are, how much data is in flight, and how much state is
left to process.

If post_load() fails it should inform management about failure and
management will restart the source. I this how it works now?

It informs management on the destination node and it can take appropriate action by sending cont to the source. This minimizes down time in the common case (successful migration).

Regards,

Anthony Liguori




reply via email to

[Prev in Thread] Current Thread [Next in Thread]