qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH v2 00/41] postcopy live migration


From: Dor Laor
Subject: Re: [Qemu-devel] [PATCH v2 00/41] postcopy live migration
Date: Tue, 05 Jun 2012 14:23:18 +0300
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:12.0) Gecko/20120430 Thunderbird/12.0.1

On 06/04/2012 04:38 PM, Isaku Yamahata wrote:
On Mon, Jun 04, 2012 at 08:37:04PM +0800, Anthony Liguori wrote:
On 06/04/2012 05:57 PM, Isaku Yamahata wrote:
After the long time, we have v2. This is qemu part.
The linux kernel part is sent separatedly.

Changes v1 ->   v2:
- split up patches for review
- buffered file refactored
- many bug fixes
    Espcially PV drivers can work with postcopy
- optimization/heuristic

Patches
1 - 30: refactoring exsiting code and preparation
31 - 37: implement postcopy itself (essential part)
38 - 41: some optimization/heuristic for postcopy

Intro
=====
This patch series implements postcopy live migration.[1]
As discussed at KVM forum 2011, dedicated character device is used for
distributed shared memory between migration source and destination.
Now we can discuss/benchmark/compare with precopy. I believe there are
much rooms for improvement.

[1] http://wiki.qemu.org/Features/PostCopyLiveMigration


Usage
=====
You need load umem character device on the host before starting migration.
Postcopy can be used for tcg and kvm accelarator. The implementation depend
on only linux umem character device. But the driver dependent code is split
into a file.
I tested only host page size == guest page size case, but the implementation
allows host page size != guest page size case.

The following options are added with this patch series.
- incoming part
    command line options
    -postcopy [-postcopy-flags<flags>]
    where flags is for changing behavior for benchmark/debugging
    Currently the following flags are available
    0: default
    1: enable touching page request

    example:
    qemu -postcopy -incoming tcp:0:4444 -monitor stdio -machine accel=kvm

- outging part
    options for migrate command
    migrate [-p [-n] [-m]] URI [<prefault forward>   [<prefault backword>]]
    -p: indicate postcopy migration
    -n: disable background transferring pages: This is for benchmark/debugging
    -m: move background transfer of postcopy mode
    <prefault forward>: The number of forward pages which is sent with on-demand
    <prefault backward>: The number of backward pages which is sent with
                         on-demand

    example:
    migrate -p -n tcp:<dest ip address>:4444
    migrate -p -n -m tcp:<dest ip address>:4444 32 0


TODO
====
- benchmark/evaluation. Especially how async page fault affects the result.

I don't mean to beat on a dead horse, but I really don't understand the
point of postcopy migration other than the fact that it's possible.  It's
a lot of code and a new ABI in an area where we already have too much
difficulty maintaining our ABI.

Without a compelling real world case with supporting benchmarks for why
we need postcopy and cannot improve precopy, I'm against merging this.

Some new results are available at
https://events.linuxfoundation.org/images/stories/pdf/lcjp2012_yamahata_postcopy.pdf


It does shows dramatic improvement over pre copy. As stated in the docs, async page faults may help lots of various loads and turn post copy into a viable solution over today's code.

In addition, the sort of 'demand pages' approach on the destination can help us for other usages - For example, we can use this implementation to live snapshot VMs w/ RAM (post live migration into a file that leave the source active) and live resume VMs from file w/o reading the entire RAM from disk.

I didn't go over the api for the live migration part but IIUC, the only change needed for the live migration 'protocol' is w.r.t guest pages and we need to do it regardless when we'll merge the page ordering optimization.

Cheers,
Dor

precopy assumes that the network bandwidth are wide enough and
the number of dirty pages converges. But it doesn't always hold true.

- planned migration
   predictability of total migration time is important

- dynamic consolidation
   In cloud use cases, the resources of physical machine are usually
   over committed.
   When physical machine becomes over loaded, some VMs are moved to another
   physical host to balance the load.
   precopy can't move VMs promptly. compression makes things worse.

- inter data center migration
   With L2 over L3 technology, it has becoming common to create a virtual
   data center which actually spans over multi physical data centers.
   It is useful to migrate VMs over physical data centers as disaster recovery.
   The network bandwidth between DCs is narrower than LAN case. So precopy
   assumption wouldn't hold.

- In case that network bandwidth might be limited by QoS,
   precopy assumption doesn't hold.


thanks,




reply via email to

[Prev in Thread] Current Thread [Next in Thread]