qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [RFC PATCH 0/3] block: Synchronous bdrv_*() from coroutine in differ


From: Vladimir Sementsov-Ogievskiy
Subject: Re: [RFC PATCH 0/3] block: Synchronous bdrv_*() from coroutine in different AioContext
Date: Tue, 19 May 2020 18:48:42 +0300
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.8.0

19.05.2020 18:29, Kevin Wolf wrote:
Am 19.05.2020 um 17:05 hat Denis Plotnikov geschrieben:
On 19.05.2020 17:18, Kevin Wolf wrote:
Am 19.05.2020 um 15:54 hat Denis Plotnikov geschrieben:

On 19.05.2020 15:32, Vladimir Sementsov-Ogievskiy wrote:
14.05.2020 17:26, Kevin Wolf wrote:
Am 14.05.2020 um 15:21 hat Thomas Lamprecht geschrieben:
On 5/12/20 4:43 PM, Kevin Wolf wrote:
Stefan (Reiter), after looking a bit closer at this, I think
there is no
bug in QEMU, but the bug is in your coroutine code that calls block
layer functions without moving into the right AioContext first. I've
written this series anyway as it potentially makes the life of callers
easier and would probably make your buggy code correct.
However, it doesn't feel right to commit something like
patch 2 without
having a user for it. Is there a reason why you can't upstream your
async snapshot code?
I mean I understand what you mean, but it would make the
interface IMO so
much easier to use, if one wants to explicit schedule it
beforehand they
can still do. But that would open the way for two styles doing
things, not
sure if this would seen as bad. The assert about from patch 3/3
would be
already really helping a lot, though.
I think patches 1 and 3 are good to be committed either way if people
think they are useful. They make sense without the async snapshot code.

My concern with the interface in patch 2 is both that it could give
people a false sense of security and that it would be tempting to write
inefficient code.

Usually, you won't have just a single call into the block layer for a
given block node, but you'll perform multiple operations. Switching to
the target context once rather than switching back and forth in every
operation is obviously more efficient.

But chances are that even if one of these function is bdrv_flush(),
which now works correctly from a different thread, you might need
another function that doesn't implement the same magic. So you always
need to be aware which functions support cross-context calls which
ones don't.

I feel we'd see a few bugs related to this.

Regarding upstreaming, there was some historical attempt to upstream it
from Dietmar, but in the time frame of ~ 8 to 10 years ago or so.
I'm not quite sure why it didn't went through then, I see if I can get
some time searching the mailing list archive.

We'd be naturally open and glad to upstream it, what it effectively
allow us to do is to not block the VM to much during snapshoting it
live.
Yes, there is no doubt that this is useful functionality. There has been
talk about this every now and then, but I don't think we ever got to a
point where it actually could be implemented.

Vladimir, I seem to remember you (or someone else from your team?) were
interested in async snapshots as well a while ago?
Den is working on this (add him to CC)
Yes, I was working on that.

What I've done can be found here:
https://github.com/denis-plotnikov/qemu/commits/bgs_uffd

The idea was to save a snapshot (state+ram) asynchronously to a separate
(raw) file using the existing infrastructure.
The goal of that was to reduce the VM downtime on snapshot.

We decided to postpone this work until "userfaultfd write protected mode"
wasn't in the linux mainstream.
Now, userfaultfd-wp is merged to linux. So we have plans to continue this
work.

According to the saving the "internal" snapshot to qcow2 I still have a
question. May be this is the right place and time to ask.

If I remember correctly, in qcow2 the snapshot is stored at the end of
the address space of the current block-placement-table.
Yes. Basically the way snapshots with VM state work is that you write
the VM state to some offset after the end of the virtual disk, when the
VM state is completely written you snapshot the current state (including
both content of the virtual disk and VM state) and finally discard the
VM state again in the active L1 table.

We switch to the new block-placement-table after the snapshot storing
is complete. In case of sync snapshot, we should switch the table
before the snapshot is written, another words, we should be able to
preallocate the the space for the snapshot and keep a link to the
space until snapshot writing is completed.
I don't see a fundamental difference between sync and async in this
respect. Why can't you write the VM state to the current L1 table first
like we usually do?

I'm not quite sure I understand the point.
Let's see all the picture of async snapshot: our goal is to minimize a VM
downtime during the snapshot.
When we do async snapshot we save vmstate except RAM when a VM is stopped
using the current L1 table (further initial L1 table). Then, we want the VM
start running
and write RAM content. At this time all RAM is write-protected.
We unprotect each RAM page once it has been written.

Oh, I see, you're basically doing something like postcopy migration. I
was assuming it was more like regular live migration, except that you
would overwrite updated RAM blocks instead of appending them.

I can see your requirement then.

All those RAM pages should go to the snapshot using the initial L1 table.
Since the VM is running, it may want to write new disk blocks,
so we need to use a NEW L1 table to provide this ability. (Am I correct so
far?)
Thus, if I understand correctly, we need to use two L1 tables: the initial
one to store RAM pages
to the vmstate and the new one to allow disk writings.

May be I can't see a better way to achieve that. Please, correct me if I'm
wrong.

I guess I could imagine a different, though probably not better way: We
could internally have a separate low-level operation that moves the VM
state from the active layer to an already existing disk snapshot. Then
you would snapshot the disk and start writing the VM to the active
layer, and when the VM state write has completed you move it into the
snapshot.

The other options is doing what you suggested. There is nothing in the
qcow2 on-disk format that would prevent this, but we would have to
extend the qcow2 driver to allow I/O to inactive L1 tables. This sounds
like a non-trivial amount of code changes, though it would potentially
enable more use cases we never implemented ((read-only) access to
internal snapshots as block nodes, so you could e.g. use block jobs to
export a snapshot).

Or export a snapshot through NBD.

Still, I have one more idea, probably we already discussed it?
Honestly, I don't like the fact that we store vmstate into guest-data space. 
After EOF, invisible, but still..

Maybe, it would be good to make a qcow2 extension for storing vmstate 
separately? So snapshot metadata will include two more fields: vmstate_offset 
and vmstate_length (hmm, actually we already have the second one), which will 
be allocated as normal qcow2 metadata? Or we can add one-two levels of layered 
allocation if needed, but keep it separate from L1/L2 tables for guest clusters.


--
Best regards,
Vladimir



reply via email to

[Prev in Thread] Current Thread [Next in Thread]