[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] Live migration without bdrv_drain_all()
From: |
Felipe Franciosi |
Subject: |
Re: [Qemu-devel] Live migration without bdrv_drain_all() |
Date: |
Wed, 28 Sep 2016 10:00:21 +0000 |
> On 28 Sep 2016, at 10:03, Juan Quintela <address@hidden> wrote:
>
> "Dr. David Alan Gilbert" <address@hidden> wrote:
>> * Stefan Hajnoczi (address@hidden) wrote:
>>> On Mon, Aug 29, 2016 at 06:56:42PM +0000, Felipe Franciosi wrote:
>>>> Heya!
>>>>
>>>>> On 29 Aug 2016, at 08:06, Stefan Hajnoczi <address@hidden> wrote:
>>>>>
>>>>> At KVM Forum an interesting idea was proposed to avoid
>>>>> bdrv_drain_all() during live migration. Mike Cui and Felipe Franciosi
>>>>> mentioned running at queue depth 1. It needs more thought to make it
>>>>> workable but I want to capture it here for discussion and to archive
>>>>> it.
>>>>>
>>>>> bdrv_drain_all() is synchronous and can cause VM downtime if I/O
>>>>> requests hang. We should find a better way of quiescing I/O that is
>>>>> not synchronous. Up until now I thought we should simply add a
>>>>> timeout to bdrv_drain_all() so it can at least fail (and live
>>>>> migration would fail) if I/O is stuck instead of hanging the VM. But
>>>>> the following approach is also interesting...
>>>>>
>>>>> During the iteration phase of live migration we could limit the queue
>>>>> depth so points with no I/O requests in-flight are identified. At
>>>>> these points the migration algorithm has the opportunity to move to
>>>>> the next phase without requiring bdrv_drain_all() since no requests
>>>>> are pending.
>>>>
>>>> I actually think that this "io quiesced state" is highly unlikely
>>>> to _just_ happen on a busy guest. The main idea behind running at
>>>> QD1 is to naturally throttle the guest and make it easier to
>>>> "force quiesce" the VQs.
>>>>
>>>> In other words, if the guest is busy and we run at QD1, I would
>>>> expect the rings to be quite full of pending (ie. unprocessed)
>>>> requests. At the same time, I would expect that a call to
>>>> bdrv_drain_all() (as part of do_vm_stop()) should complete much
>>>> quicker.
>>>>
>>>> Nevertheless, you mentioned that this is still problematic as that
>>>> single outstanding IO could block, leaving the VM paused for
>>>> longer.
>>>>
>>>> My suggestion is therefore that we leave the vCPUs running, but
>>>> stop picking up requests from the VQs. Provided nothing blocks,
>>>> you should reach the "io quiesced state" fairly quickly. If you
>>>> don't, then the VM is at least still running (despite seeing no
>>>> progress on its VQs).
>>>>
>>>> Thoughts on that?
>>>
>>> If the guest experiences a hung disk it may enter error recovery. QEMU
>>> should avoid this so the guest doesn't remount file systems read-only.
>>>
>>> This can be solved by only quiescing the disk for, say, 30 seconds at a
>>> time. If we don't reach a point where live migration can proceed during
>>> those 30 seconds then the disk will service requests again temporarily
>>> to avoid upsetting the guest.
>>>
>>> I wonder if Juan or David have any thoughts from the live migration
>>> perspective?
>>
>> Throttling IO to reduce the time in the final drain makes sense
>> to me, however:
>> a) It doesn't solve the problem if the IO device dies at just the wrong
>> time,
>> so you can still get that hang in bdrv_drain_all
>>
>> b) Completely stopping guest IO sounds too drastic to me unless you can
>> time it to be just at the point before the end of migration; that feels
>> tricky to get right unless you can somehow tie it to an estimate of
>> remaining dirty RAM (that never works that well).
>>
>> c) Something like a 30 second pause still feels too long; if that was
>> a big hairy database workload it would effectively be 30 seconds
>> of downtime.
>>
>> Dave
>
> I think something like the proposed thing could work.
>
> We can put queue depth = 1 or somesuch when we know we are near
> completion for migration. What we need them is a way to call the
> equivalent of:
>
> bdrv_drain_all() to return EAGAIN or EBUSY if it is a bad moment. In
> that case, we just do another round over the whole memory, or retry in X
> seconds. Anything is good for us, we just need a way to ask for the
> operation but that it don't block.
>
> Notice that migration is the equivalent of:
>
> while (true) {
> write_some_dirty_pages();
> if (dirty_pages < threshold) {
> break;
> }
> }
> bdrv_drain_all();
> write_rest_of_dirty_pages();
>
> (Lots and lots of details ommited)
>
> What we really want is to issue the call of bdrv_drain_all() equivalent
> inside the while, so, if there is any problem, we just do another cycle,
> no problem.
>
> Later, Juan.
Hi,
Actually, the way I perceive the problem is that Qemu is doing a vm_stop()
*after* the "break;" in the pseudocode above (but *before* the drain). That
means the VM could be stopped for a long time while you're doing
bdrv_drain_all().
I don't see a magic solution for this. All we can do is try and find a way of
doing this that improves the VM experience during the migration.
It's easy to argue that it's better to see your storage performance go down for
a short period of time instead of seeing your CPUs not running for a long
period of time. After all, there's a reason for "cpu downtime" being an actual
hypervisor metric.
What I'd propose is a simple improvement like this:
while (true) {
write_some_dirty_pages();
if (dirty_pages < threshold_very_low) {
break;
} else if (dirty_pages < threshold_low) {
bdrv_stop_picking_new_reqs();
} else if (dirty_pages < threshold_med) {
bdrv_run_at_qd1();
}
}
vm_stop_force_state(RUN_STATE_FINISH_MIGRATE);
bdrv_drain_all();
write_rest_of_dirty_pages();
The idea is simple:
* When we're somewhere near, we pick only one request at a time.
* When we're really close, we stop picking up new requests. That still allows
the block drivers to complete whatever is outstanding.
* When we're really really close, we can break. At this point, we're very
likely drained already.
Knowing that most OSs use 30s by default as a "this request is not completing
anymore" kind of timeout, we can even improve the above to resume the block
drivers (or abort the migration) if the time between reaching "threshold_low"
and "threshold_very_low" exceeds, say, 15s. That can be combined with actually
waiting for everything to complete before stopping the CPUs. A more complete
version would look like this:
while (true) {
write_some_dirty_pages();
if (dirty_pages < threshold_very_low) {
if (bdrv_all_is_drained()) {
break;
} else if (bdrv_is_stopped() && (now() - ts_bdrv_stopped > 15s)) {
bdrv_run_at_qd1();
// or abort the migration and resume normally,
// perhaps after a few retries
}
}
if (dirty_pages < threshold_low) {
bdrv_stop_picking_new_reqs();
ts_bdrv_stopped = now();
} else if (dirty_pages < threshold_med) {
bdrv_run_at_qd1();
}
}
vm_stop_force_state(RUN_STATE_FINISH_MIGRATE);
bdrv_drain_all();
write_rest_of_dirty_pages();
Note that this version (somewhat) copes with (dirty_pages<threshold_very_low)
being reached before we actually observed a (dirty_pages<threshold_low).
There's still a race where requests are fired after bdrv_all_is_drained() and
before vm_stop_force_state(). But that can be easily addressed.
Thoughts?
Thanks,
Felipe