[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping fre

From: Wei Yang
Subject: Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages
Date: Wed, 23 Mar 2016 17:46:43 +0800
User-agent: Mutt/1.5.17 (2007-11-01)

On Wed, Mar 23, 2016 at 07:18:57AM +0000, Li, Liang Z wrote:
>> Hi, Liang
>> This is a very clear documentation of your work, I appreciated it a lot. 
>> Below
>> are some of my personal opinion and question.
>Thanks for your comments!
>> On Tue, Mar 22, 2016 at 03:43:49PM +0800, Liang Li wrote:
>> >I have sent the RFC version patch set for live migration optimization
>> >by skipping processing the free pages in the ram bulk stage and
>> >received a lot of comments. The related threads can be found at:
>> >
>> >https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00715.html
>> >https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00714.html
>> >https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00717.html
>> >https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00716.html
>> >https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00718.html
>> >
>> >https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00719.html
>> >https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00720.html
>> >https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00721.html
>> >
>> Actually there are two threads, Qemu thread and kernel thread. It would be
>> more clear for audience, if you just list two first mail for these two thread
>> respectively.
>Indeed,  my original version has this kind of information, but I removed it.
>> >To make things easier, I wrote this doc about the possible designs
>> >and my choices. Comments are welcome!
>> >
>> >Content
>> >=======
>> >1. Background
>> >2. Why not use virtio-balloon
>> >3. Virtio interface
>> >4. Constructing free page bitmap
>> >5. Tighten free page bitmap
>> >6. Handling page cache in the guest
>> >7. APIs for live migration
>> >8. Pseudo code
>> >
>> >Details
>> >=======
>> >1. Background
>> >As we know, in the ram bulk stage of live migration, current QEMU live
>> >migration implementation mark the all guest's RAM pages as dirtied in
>> >the ram bulk stage, all these pages will be checked for zero page
>> >first, and the page content will be sent to the destination depends on
>> >the checking result, that process consumes quite a lot of CPU cycles
>> >and network bandwidth.
>> >
>> >>From guest's point of view, there are some pages currently not used by
>> I see in your original RFC patch and your RFC doc, this line starts with a
>> character '>'. Not sure this one has a special purpose?
>No special purpose. Maybe it's caused by the email client. I didn't find the
>character in the original doc.


You could take a look at this link, there is a '>' before From.

>> >the guest, guest doesn't care about the content in these pages. Free
>> >pages are this kind of pages which are not used by guest. We can make
>> >use of this fact and skip processing the free pages in the ram bulk
>> >stage, it can save a lot CPU cycles and reduce the network traffic
>> >while speed up the live migration process obviously.
>> >
>> >Usually, only the guest has the information of free pages. But it’s
>> >possible to let the guest tell QEMU it’s free page information by some
>> >mechanism. E.g. Through the virtio interface. Once QEMU get the free
>> >page information, it can skip processing these free pages in the ram
>> >bulk stage by clearing the corresponding bit of the migration bitmap.
>> >
>> >2. Why not use virtio-balloon
>> >Actually, the virtio-balloon can do the similar thing by inflating the
>> >balloon before live migration, but its performance is no good, for an
>> >8GB idle guest just boots, it takes about 5.7 Sec to inflate the
>> >balloon to 7GB, but it only takes 25ms to get a valid free page bitmap
>> >from the guest.  There are some of reasons for the bad performance of
>> >vitio-balloon:
>> >a. allocating pages (5%, 304ms)
>> >b. sending PFNs to host (71%, 4194ms)
>> >c. address translation and madvise() operation (24%, 1423ms)
>> >Debugging shows the time spends on these operations are listed in the
>> >brackets above. By changing the VIRTIO_BALLOON_ARRAY_PFNS_MAX to a
>> >large value, such as 16384, the time spends on sending the PFNs can be
>> >reduced to about 400ms, but it’s still too long.
>> >
>> >Obviously, the virtio-balloon mechanism has a bigger performance
>> >impact to the guest than the way we are trying to implement.
>> >
>> >3. Virtio interface
>> >There are three different ways of using the virtio interface to
>> >send the free page information.
>> >a. Extend the current virtio device
>> >The virtio spec has already defined some virtio devices, and we can
>> >extend one of these devices so as to use it to transport the free page
>> >information. It requires modifying the virtio spec.
>> >
>> >b. Implement a new virtio device
>> >Implementing a brand new virtio device to exchange information
>> >between host and guest is another choice. It requires modifying the
>> >virtio spec too.
>> >
>> >c. Make use of virtio-serial (Amit’s suggestion, my choice)
>> >It’s possible to make use the virtio-serial for communication between
>> >host and guest, the benefit of this solution is no need to modify the
>> >virtio spec.
>> >
>> >4. Construct free page bitmap
>> >To minimize the space for saving free page information, it’s better to
>> >use a bitmap to describe the free pages. There are two ways to
>> >construct the free page bitmap.
>> >
>> >a. Construct free page bitmap when demand (My choice)
>> >Guest can allocate memory for the free page bitmap only when it
>> >receives the request from QEMU, and set the free page bitmap by
>> >traversing the free page list. The advantage of this way is that it’s
>> >quite simple and easy to implement. The disadvantage is that the
>> >traversing operation may consume quite a long time when there are a
>> >lot of free pages. (About 20ms for 7GB free pages)
>> >
>> >b. Update free page bitmap when allocating/freeing pages
>> >Another choice is to allocate the memory for the free page bitmap
>> >when guest boots, and then update the free page bitmap when
>> >allocating/freeing pages. It needs more modification to the code
>> >related to memory management in guest. The advantage of this way is
>> >that guest can response QEMU’s request for a free page bitmap very
>> >quickly, no matter how many free pages in the guest. Do the kernel guys
>> >like this?
>> >
>> >5. Tighten the free page bitmap
>> >At last, the free page bitmap should be operated with the
>> >ramlist.dirty_memory to filter out the free pages. We should make sure
>> In exec.c, the variable name is ram_list. If we use the same name in code and
>> doc, this may be more easy for audience to understand.
>Yes, thanks!
>> >the bit N in the free page bitmap and the bit N in the
>> >ramlist.dirty_memory are corresponding to the same guest’s page.
>> >Some arch, like X86, there are ‘holes’ in the memory’s physical
>> >address, which means there are no actual physical RAM pages
>> >corresponding to some PFNs. So, some arch specific information is
>> >needed to construct a proper free page bitmap.
>> >
>> >migration dirty page bitmap:
>> >    ---------------------
>> >    |a|b|c|d|e|f|g|h|i|j|
>> >    ---------------------
>> >loose free page bitmap:
>> >    -----------------------------
>> >    |a|b|c|d|e|f| | | | |g|h|i|j|
>> >    -----------------------------
>> >tight free page bitmap:
>> >    ---------------------
>> >    |a|b|c|d|e|f|g|h|i|j|
>> >    ---------------------
>> >
>> >There are two places for tightening the free page bitmap:
>> >a. In guest
>> >Constructing the free page bitmap in guest requires adding the arch
>> >related code in guest for building a tight bitmap. The advantage of
>> >this way is that less memory is needed to store the free page bitmap.
>> >b. In QEMU (My choice)
>> >Constructing the free page bitmap in QEMU is more flexible, we can get
>> >a loose free page bitmap which contains the holes, and then filter out
>> >the holes in QEMU, the advantage of this way is that we can keep the
>> >kernel code as simple as we can, the disadvantage is that more memory
>> >is needed to save the loose free page bitmap. Because this is a mainly
>> >QEMU feature, if possible, do all the related things in QEMU is
>> >better.
>> >
>> >6. Handling page cache in the guest
>> >The memory used for page cache in the guest will change depends on the
>> >workload, if guest run some block IO intensive work load, there will
>> Would this improvement benefit a lot when guest only has little free page?
>Yes, the improvement is very obvious.

Good to know this.

>> In your Performance data Case 2, I think it mimic this kind of case. While 
>> the
>> memory consuming task is stopped before migration. If it continues, would
>> we
>> still perform better than before?
>Actually, my RFC patch didn't consider the page cache, Roman raised this issue.
>so I add this part in this doc.
>Case 2 didn't mimic this kind of scenario, the work load is an memory consuming
>work load, not an block IO intensive work load, so there are not many page 
>in this case.
>If the work load in case 2 continues, as long as it not write all the memory 
>it allocates,
>we still can get benefits.

Sounds I have little knowledge on page cache, and its relationship between
free page and I/O intensive work.

Here is some personal understanding, I would appreciate if you could correct

      |Page     |Page     |Free Page|Page     |

Free Page is a page in the free_list, PageCache is some page cached in CPU's
cache line?

When memory consuming task runs, it leads to little Free Page in the whole
system. What's the consequence when I/O intensive work runs? I guess it still
leads to little Free Page. And will have some problem in sync on PageCache?

>> I am thinking is it possible to have a threshold or configurable threshold to
>> utilize free page bitmap optimization?
>Could you elaborate your idea? How does it work?

Let's back to Case 2. We run a memory consuming task which will leads to
little Free Page in the whole system. Which means from Qemu perspective,
little of the dirty_memory is filtered by Free Page list. My original question
is whether your solution benefits in this scenario. As you mentioned it works
fine. So maybe this threshold is not necessary.

My original idea is in Qemu we can calculate the percentage of the Free Page
in the whole system. If it finds there is only little percentage of Free Page,
then we don't need to bother to use this method.

Have a nice day~

>> --
>> Richard Yang\nHelp you, Help me

Richard Yang\nHelp you, Help me

reply via email to

[Prev in Thread] Current Thread [Next in Thread]