[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [PATCH v7 01/42] Start documenting how postcopy works.
From: |
Paolo Bonzini |
Subject: |
Re: [Qemu-devel] [PATCH v7 01/42] Start documenting how postcopy works. |
Date: |
Thu, 18 Jun 2015 10:28:35 +0200 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.6.0 |
On 18/06/2015 09:50, Li, Liang Z wrote:
> Do you have any idea or plan to deal with the failure happened during
> the postcopy phase?
>
> Lost the guest is too frightening for a cloud provider, we have a
> discussion with Alibaba, they said that they can't use the postcopy
> feature unless there is a mechanism to find the guest back.
There's no solution to this problem, except for rollback to a previous
snapshot.
To give an idea, an example of an intended usecase for postcopy is
datacenter evacuation in 30 minutes after a tsunami alert. That's not a
case where you care much about losing guests to network failures.
Why is there no solution? Let's look at one of the best surveys on
migration,
http://courses.cs.vt.edu/~cs5204/fall05-kafura/Papers/Migration/ProcessMigration.pdf
(warning, 59 pages!):
[3.2] If only part of the task state is transferred to another node,
the task can start executing sooner, and the initial migration costs
are lower.
[3.4] Fault resilience can be improved in several ways. The impact of
failures during migration can be reduced by maintaining process state
on both the source and destination sites until the destination site
instance is successfully promoted to a regular process and the source
node is informed about this.
[3.5] Migration algorithms should avoid linear dependencies on the
amount of state to be transferred. For example, the eager data
transfer strategy has costs proportional to the address space size
"Pre"copy means "start copying *before* promoting the destination to be
the primary host" and it has such a linear dependency on the amount of
state to be transferred. "Post"copy means "delay some copying to *after*
promoting the destination to be the primary host".
So we have:
Precopy Postcopy
3.2 Performance - (1) - (2)
3.4 Fault resilience + -
3.5 Scalability - +
(1) smaller impact, longer freeze time
(2) larger impact, extremely short freeze time
Postcopy can also limit the length of the non-resilient phase, by
starting with a precopy phase and only switching to postcopy after some
time. Then you have:
Precopy Hybrid Postcopy
3.2 Performance - (1) + (3) - (2)
3.4 Fault resilience + - --
3.5 Scalability - + +
(3) intermediate impact, extremely short freeze time
but there is still going to be a phase where migration is not resilient
to network faults.
Cloud operators can use a combination of precopy and postcopy. For
example, I would not use postcopy for mass migration when doing
host updates, but it can be used as a last resort before a scheduled
downtime.
For example, say you're doing a rolling update and you want it complete
by next Sunday. 90% of the guests are shut down by the customers or can
be migrated successfully with precopy. The others do not converge and
their SLA does not let you throttle them to complete precopy migration.
You then tell your customers that either they shutdown and restart their
instances before Saturday 8:00 PM, or they might be shut down forcibly.
Then for customers who haven't rebooted you can do
postcopy---you have alerted them that something might go wrong. So even
though postcopy would not be a first choice, it can still help cloud
operators.
Paolo
[Qemu-devel] [PATCH v7 05/42] Add qemu_get_buffer_less_copy to avoid copies some of the time, Dr. David Alan Gilbert (git), 2015/06/16
[Qemu-devel] [PATCH v7 04/42] qemu_ram_block_from_host, Dr. David Alan Gilbert (git), 2015/06/16