qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] KVM Forum block no[td]es


From: Denis V . Lunev
Subject: Re: [Qemu-devel] KVM Forum block no[td]es
Date: Fri, 16 Nov 2018 07:47:57 +0000

On 11/12/18 1:25 AM, Max Reitz wrote:
> This is what I’ve taken from two or three BoF-like get-togethers on
> blocky things.  Amendments are more than welcome, of course.
>
>
>
> Permission system
> =================
>
> GRAPH_MOD
> ---------
>
> We need some way for the commit job to prevent graph changes on its
> chain while it is running.  Our current blocker doesn’t do the job,
> however.  What to do?
>
> - We have no idea how to make a *permission* work.  Maybe the biggest
>   problem is that it just doesn’t work as a permission, because the
>   commit job doesn’t own the BdrvChildren that would need to be
>   blocked (namely the @backing BdrvChild).
>
> - A property of BdrvChild that can be set by a non-parent seems more
>   feasible, e.g. a counter where changing the child is possible only
>   if the counter is 0.  This also actually makes sense in what it
>   means.
>   (We never quite knew what “taking the GRAPH_PERMISSION” or
>   “unsharing the GRPAH_MOD permission” was supposed to mean.  Figuring
>   that out always took like half an our in any face-to-face meeting,
>   and then we decided it was pretty much useless for any case we had
>   at hand.)
>
>
> Reopen
> ------
>
> How should permissions be handled while the reopen is under way?
> Maybe we should take the union of @perm before and after, and the
> intersection of @shared before and after?
>
> - Taking permissions is a transaction that can fail.  Reopen, too, is
>   a transaction, and we want to go from the intermediate to the final
>   permissions in reopen’s commit part, so that transition is not
>   allowed to fail.
>   Since with the above model we would only relax things during that
>   transition (relinquishing bits from @perm and adding bits to
>   @shared), this transition should in theory be possible without any
>   failure.  However, in practice things are different, as permission
>   changes with file-posix nodes imply lock changes on the filesystem
>   -- which may always fail.  Arguably failures from changing the
>   file-posix locks can be ignored, because that just means that the
>   file claims more permissions to be taken and less to be shared than
>   is actually the case.  Which means you may not be able to open the
>   file in some other application, while you should be, but that’s the
>   benign kind of error.  You won’t be able to access data in a way
>   you shouldn’t be able to.
>   - Note that we have this issue already, so in general dropping
>     permissions sometimes aborts because code assumes that dropping
>     permissions is always safe and can never result in an error.  It
>     seems best to ignore such protocol layer errors in the generic
>     block layer rather than handling this in every protocol driver
>     itself.
>     (The block layer should discard errors from dropping permissions
>     on the protocol layer.)
>
> - Is it possible that changing an option may require taking an
>   intermediate permission that is required neither before nor after
>   the reopen process?
>   Changing a child link comes to mind (like changing a child from one
>   BDS to another, where the visible data changes, which would mean we
>   may want to e.g. unshare CONSISTENT_READ during the reopen).
>   However:
>   1. It is unfeasible to unshare that for all child changes.
>      Effectively everything requires CONSISTENT_READ, and for good
>      reason.
>   2. Why would a user even change a BDS to something of a different
>      content?
>   3. Anything that currently allows you to change a child node assumes
>      that the user always changes it to something of the same content
>      (some take extra care to verify this, like mirror, which makes
>      sure that @replaces and the target are connected, and there are
>      only filter nodes in between).
>   Always using the same enforcing model as mirror does (no. 3 above)
>   does not really work, though, because one use case is to copy a
>   backing file offline to some different storage and then replace the
>   files via QMP.  To qemu, both files are completely unrelated.
>
>
> Block jobs, including blockdev-copy
> ===================================
>
> Example for use of the fleecing filter:
> - The real target is on slow storage.  Put an overlay on fast storage
>   on top of it.  Then use that overlay as the target of the fleecing
>   filter (and commit the data later or on the side), so that the
>   backup job does not slow down the guest.
>
> For a unified copy job, having a backup/fleecing filter is not a
> problem on the way.  One thing we definitely have to and can do is to
> copy common functionality into a shared file so that the different
> jobs can at least share that.
>
> COR/Stream:
> - There should be a way to discard ranges that have been copied into
>   the overlay from the backing files to save space
> - Also, the COR filter should integrated with the stream job (at some
>   point, as always)
>
> Hole punching with active commit:
> - Putting data into the target and punching holes in the overlays to
>   make it visible on the active disk may be reasonable for some, but
>   not for others -- it should be an option.  You want this if saving
>   space is important, but you may not want this if speed is more
>   important (depends on your backing chain length and other factors
>   then, but that’s your choice).
>
> - Another thing: If we don’t need to punch any holes because the
>   intermediate layers aren’t allocated anyway, we don’t need to write
>   the data into the active disk either.  This can probably be done
>   indiscriminately, because the check for this does not concern the
>   protocol layer but only qemu-controlled metadata, so it should be
>   deterministically fast (want_zero=false).
>
>
> qcow2
> =====
>
> Recovering corrupt images:
> - Salvaging qemu-img convert would help (one that doesn’t abort
>   everything on encountering a single I/O error)
> - We may want to add an in-sync L1 table copy to recover from the
>   worst kinds of corruptions.  Checksumming would be a good idea
>   (then), too.
>   - Should we update the checksum every time?  If it’s just the sum of
>     all L1 entry values, why not, doing the update is trivial then and
>     does not involve looking at any but the entries modified.
>
> Online check:
> - This would need to be a block job
> - The check function would probably need to be a proper coroutine
>   (that does not just lock everything)
> - Would be very complicated if you wanted it to work on R/W images.
>   It’s probably the best to focus on making this work for read-only
>   images, because you can always just put a temporary snapshot over
>   the image for the time of the test and then commit it down after the
>   check is done.
>
>
> Bitmaps
> =======
>
> (Got this section from sneaking into a BoF I wasn’t invited to.  Oh
> well.  Won’t hurt to include them here.)
>
> Currently, when dirty bitmaps are loaded, all IN_USE bitmaps are just
> not loaded at all and completely ignored.  That isn’t correct, though,
> they should either still be loaded (and automatically treated and
> written back as fully dirty), or at least qemu-img check should
> “repair” them (i.e. fully dirtying them).
>
> Sometimes qemu (running in a mode as bare as possible) is better than
> using qemu-img convert, for instance.  It gives you more control
> (through QMP; you get e.g. better progress reporting), you get all of
> the mirror optimizations (we do have optimizations for convert, too,
> but whether it’s any good to write the same (or different?)
> optimizations twice is another question), and you get a common
> interface for everything (online and offline).
> Note that besides a bare qemu we’ve also always wanted to convert as
> many qemu-img operations into frontends for block jobs as possible.
> We have only done this for commit, however, even though convert looked
> like basically the ideal target.  It was just too hard with too little
> apparent gain, like always (and convert supports additional features
> like concatenation which we don’t have in the runtime block layer
> yet).
>
> Someone (not that someone™, but actually some specific someone) is
> about to make qemu-img info display the list of persistent bitmaps.
> Potential reviewers should be aware of the fact that this should be
> done bye adding that information to ImageInfoSpecificQCow2.
/me

Den

reply via email to

[Prev in Thread] Current Thread [Next in Thread]