qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH 05/10] migration: Fix the migrate auto converge


From: Chegu Vinod
Subject: Re: [Qemu-devel] [PATCH 05/10] migration: Fix the migrate auto converge process
Date: Tue, 11 Mar 2014 15:56:12 -0700
User-agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20100101 Thunderbird/24.3.0

On 3/11/2014 1:48 PM, Juan Quintela wrote:
<address@hidden> wrote:
From: ChenLiang <address@hidden>

It is inaccuracy and complex that using the transfer speed of
migration thread to determine whether the convergence migration.
The dirty page may be compressed by XBZRLE or ZERO_PAGE.The counter
of updating dirty bitmap will be increasing continuously if the
migration can't convergence.
"It is inexact and complex to use the migration transfer speed to
dectermine weather the convergence of migration."

@@ -530,21 +523,11 @@ static void migration_bitmap_sync(void)
      /* more than 1 second = 1000 millisecons */
      if (end_time > start_time + 1000) {
          if (migrate_auto_converge()) {
-            /* The following detection logic can be refined later. For now:
-               Check to see if the dirtied bytes is 50% more than the approx.
-               amount of bytes that just got transferred since the last time we
-               were in this routine. If that happens >N times (for now N==4)
-               we turn on the throttle down logic */
-            bytes_xfer_now = ram_bytes_transferred();
-            if (s->dirty_pages_rate &&
-               (num_dirty_pages_period * TARGET_PAGE_SIZE >
-                   (bytes_xfer_now - bytes_xfer_prev)/2) &&
-               (dirty_rate_high_cnt++ > 4)) {
-                    trace_migration_throttle();
-                    mig_throttle_on = true;
-                    dirty_rate_high_cnt = 0;
-             }
-             bytes_xfer_prev = bytes_xfer_now;
+            if (get_bitmap_sync_cnt() > 15) {
+                /* It indicates that migration can't converge when the counter
+                is larger than fifteen. Enable the feature of auto
      converge */
Comment is not needed, it says excatly what the code does.

But why 15?  It is not that I think that the older code is better or
worse than yours.  Just that we move from one magic number to another
(that is even bigger).

Shouldn't it be easier jut just change mig_sleep_cpu()

to do something like:


static void mig_sleep_cpu(void *opq)
{
     qemu_mutex_unlock_iothread();
     g_usleep((2*get_bitmap_sync_cnt()*1000);
     qemu_mutex_lock_iothread();
}

This would get the 30ms on the 15th iteration.  I am open to change that
formula to anything different, but what I want is changing this to
something that makes the less convergence -> the more throotling.

< 'already got some feedback earlier on this and had this task in the list of things
    to work on... :)   >

Having the throttling start with some per-defined "degree" and then have that "degree" gradually increase ...either

a) automatically as shown in Juan's example above (or)

b) via some TBD user level interface...

...is one way to help with ensuring convergence for all cases.

The issue of continuing to increase this "degree" of throttling is an obvious area of concern for the workload ( that is still trying to run in the VM). Would it it better to force the live migration to switch from the iterative pre-copy phase to the "downtime" phase ...if it fails to converge even after throttling it for a couple of iterations ? Doing so could result in a longer actual downtime. Hope to try this and see... but if anyone has inputs(other than doing post-copy etc) pl. do share.



BTW, you are testing this with any workload to see that it improves?

Yes. Please do share some data.



+                mig_throttle_on = true;
+            }
Vinod, what do you think?
As is noted in the current code...the "logic" to detect the lack of convergence needs to be refined. If there is a better way to help detect same (and which covers these other cases like XBZRLE etc) then I am all for it. I do agree with Juan about the choice of magic numbers (i.e. one may not be better than the other).

BTW, on a related note...

I haven't used XBZRLE in the recent past (after having tried it in the early days). Does it now perform well with larger sized VMs running real world workloads ? Assume that is where you found that there was still need for forcing convergence ?

Pl. do consider sharing some results about the type of workload and also the size of the VMs etc that you have tried with XBZRLE.

Do you have a workload to test this?

Hmm... One can test this with memory intensive Java warehouse type of workloads (besides using synthetic workloads).

Vinod

Thanks, Juan.
.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]