[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Qemu-devel] KVM Forum block no[td]es
From: |
Max Reitz |
Subject: |
[Qemu-devel] KVM Forum block no[td]es |
Date: |
Sun, 11 Nov 2018 23:25:00 +0100 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.3.0 |
This is what I’ve taken from two or three BoF-like get-togethers on
blocky things. Amendments are more than welcome, of course.
Permission system
=================
GRAPH_MOD
---------
We need some way for the commit job to prevent graph changes on its
chain while it is running. Our current blocker doesn’t do the job,
however. What to do?
- We have no idea how to make a *permission* work. Maybe the biggest
problem is that it just doesn’t work as a permission, because the
commit job doesn’t own the BdrvChildren that would need to be
blocked (namely the @backing BdrvChild).
- A property of BdrvChild that can be set by a non-parent seems more
feasible, e.g. a counter where changing the child is possible only
if the counter is 0. This also actually makes sense in what it
means.
(We never quite knew what “taking the GRAPH_PERMISSION” or
“unsharing the GRPAH_MOD permission” was supposed to mean. Figuring
that out always took like half an our in any face-to-face meeting,
and then we decided it was pretty much useless for any case we had
at hand.)
Reopen
------
How should permissions be handled while the reopen is under way?
Maybe we should take the union of @perm before and after, and the
intersection of @shared before and after?
- Taking permissions is a transaction that can fail. Reopen, too, is
a transaction, and we want to go from the intermediate to the final
permissions in reopen’s commit part, so that transition is not
allowed to fail.
Since with the above model we would only relax things during that
transition (relinquishing bits from @perm and adding bits to
@shared), this transition should in theory be possible without any
failure. However, in practice things are different, as permission
changes with file-posix nodes imply lock changes on the filesystem
-- which may always fail. Arguably failures from changing the
file-posix locks can be ignored, because that just means that the
file claims more permissions to be taken and less to be shared than
is actually the case. Which means you may not be able to open the
file in some other application, while you should be, but that’s the
benign kind of error. You won’t be able to access data in a way
you shouldn’t be able to.
- Note that we have this issue already, so in general dropping
permissions sometimes aborts because code assumes that dropping
permissions is always safe and can never result in an error. It
seems best to ignore such protocol layer errors in the generic
block layer rather than handling this in every protocol driver
itself.
(The block layer should discard errors from dropping permissions
on the protocol layer.)
- Is it possible that changing an option may require taking an
intermediate permission that is required neither before nor after
the reopen process?
Changing a child link comes to mind (like changing a child from one
BDS to another, where the visible data changes, which would mean we
may want to e.g. unshare CONSISTENT_READ during the reopen).
However:
1. It is unfeasible to unshare that for all child changes.
Effectively everything requires CONSISTENT_READ, and for good
reason.
2. Why would a user even change a BDS to something of a different
content?
3. Anything that currently allows you to change a child node assumes
that the user always changes it to something of the same content
(some take extra care to verify this, like mirror, which makes
sure that @replaces and the target are connected, and there are
only filter nodes in between).
Always using the same enforcing model as mirror does (no. 3 above)
does not really work, though, because one use case is to copy a
backing file offline to some different storage and then replace the
files via QMP. To qemu, both files are completely unrelated.
Block jobs, including blockdev-copy
===================================
Example for use of the fleecing filter:
- The real target is on slow storage. Put an overlay on fast storage
on top of it. Then use that overlay as the target of the fleecing
filter (and commit the data later or on the side), so that the
backup job does not slow down the guest.
For a unified copy job, having a backup/fleecing filter is not a
problem on the way. One thing we definitely have to and can do is to
copy common functionality into a shared file so that the different
jobs can at least share that.
COR/Stream:
- There should be a way to discard ranges that have been copied into
the overlay from the backing files to save space
- Also, the COR filter should integrated with the stream job (at some
point, as always)
Hole punching with active commit:
- Putting data into the target and punching holes in the overlays to
make it visible on the active disk may be reasonable for some, but
not for others -- it should be an option. You want this if saving
space is important, but you may not want this if speed is more
important (depends on your backing chain length and other factors
then, but that’s your choice).
- Another thing: If we don’t need to punch any holes because the
intermediate layers aren’t allocated anyway, we don’t need to write
the data into the active disk either. This can probably be done
indiscriminately, because the check for this does not concern the
protocol layer but only qemu-controlled metadata, so it should be
deterministically fast (want_zero=false).
qcow2
=====
Recovering corrupt images:
- Salvaging qemu-img convert would help (one that doesn’t abort
everything on encountering a single I/O error)
- We may want to add an in-sync L1 table copy to recover from the
worst kinds of corruptions. Checksumming would be a good idea
(then), too.
- Should we update the checksum every time? If it’s just the sum of
all L1 entry values, why not, doing the update is trivial then and
does not involve looking at any but the entries modified.
Online check:
- This would need to be a block job
- The check function would probably need to be a proper coroutine
(that does not just lock everything)
- Would be very complicated if you wanted it to work on R/W images.
It’s probably the best to focus on making this work for read-only
images, because you can always just put a temporary snapshot over
the image for the time of the test and then commit it down after the
check is done.
Bitmaps
=======
(Got this section from sneaking into a BoF I wasn’t invited to. Oh
well. Won’t hurt to include them here.)
Currently, when dirty bitmaps are loaded, all IN_USE bitmaps are just
not loaded at all and completely ignored. That isn’t correct, though,
they should either still be loaded (and automatically treated and
written back as fully dirty), or at least qemu-img check should
“repair” them (i.e. fully dirtying them).
Sometimes qemu (running in a mode as bare as possible) is better than
using qemu-img convert, for instance. It gives you more control
(through QMP; you get e.g. better progress reporting), you get all of
the mirror optimizations (we do have optimizations for convert, too,
but whether it’s any good to write the same (or different?)
optimizations twice is another question), and you get a common
interface for everything (online and offline).
Note that besides a bare qemu we’ve also always wanted to convert as
many qemu-img operations into frontends for block jobs as possible.
We have only done this for commit, however, even though convert looked
like basically the ideal target. It was just too hard with too little
apparent gain, like always (and convert supports additional features
like concatenation which we don’t have in the runtime block layer
yet).
Someone (not that someone™, but actually some specific someone) is
about to make qemu-img info display the list of persistent bitmaps.
Potential reviewers should be aware of the fact that this should be
done bye adding that information to ImageInfoSpecificQCow2.
Transacitonable bitmap primitives (e.g. copying a bitmap) would be
nice so you can use them when creating a snapshot. Then it’d be up to
the management layer to make use of them:
- Do you want to continue using the very same bitmap? Copy it then
(or move it, depending on what exactly you want to do and what
primitives there are)
- Do you want to start with a new bitmap? Then just create a new one
along with the overlay.
Misc topics
===========
SEEK_HOLE/SEEK_DATA:
- According to Denis, the bugs left in SEEK_HOLE and fiemap are the
same now, but the former is slow when you seek over large ranges
(because we just want to know whether a certain portion is allocated
or not, but SEEK_HOLE/DATA actively seeks until the next hole/data
region and queries all metadata on that path, regardless whether we
even care anymore)
- Whether the bugs are the same depends on the version of Linux,
however, and there is no clear way to determine for qemu whether
fiemap is usable or not
- Making it a configure option would leave it to the user or
distribution, who should know for sure
Multiqueue with multiple iothreads:
- Kevin says Paolo says he’s working on it. But there are some
prerequisites left, the main one apparently being that there is one
aio_poll() left that polls from the wrong context. With that gone,
we can also probably drop AIO context altogether.
Some things we want from a cache block driver:
- An optional maximum resident memory size; in this case, the driver
needs to be backed by another node it uses for swapping
- Should support taking a bitmap from the cached node, from which it
would then preload all dirty clusters
signature.asc
Description: OpenPGP digital signature