qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 09/17] block: Refactor bdrv_has_zero_init{,_truncate}


From: Eric Blake
Subject: Re: [PATCH 09/17] block: Refactor bdrv_has_zero_init{,_truncate}
Date: Wed, 5 Feb 2020 12:39:48 -0600
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.4.1

On 2/5/20 11:22 AM, Max Reitz wrote:


And thus callers which just want the trivially obtainable
BDRV_ZERO_TRUNCATE info have to wait for the BDRV_ZERO_OPEN inquiry,
even though they don’t care about that flag.

True, but only to a minor extent; and the documentation mentions that
the BDRV_ZERO_OPEN calculation MUST NOT be as expensive as a blind
block_status loop.

So it must be less expensive than an arbitrarily complex loop.  I think
a single SEEK_DATA/HOLE call was something like O(n) on tmpfs?

If I recall, the tmpfs bug was that it was O(n) where n was based on the initial offset and the number of extents prior to that offset. The probe at offset 0 is O(1) (because there are no prior extents), whether it reaches the end of the file (the entire image is a hole) or hits data beforehand. It is only probes at later offsets where the speed penalty sets in, and where an O(n) loop over all extents turned into an O(n^2) traversal time due to the O(n) nature of each later lookup.


What I’m trying to say is that this is not a good limit and can mean
anything.

I do think this limit definition makes sense for callers that want to
know about ZERO_OPEN.  But I don’t know why we would have to let other
callers wait, too.

Keeping separate functions may still be the right approach for v2, although I'd still like to name things better ('has_zero_init' vs. 'has_zero_init_truncate' did not work well for me). And if I'm renaming things, then I'm touching just as much code whether I rename and keep separate functions or rename and consolidate into one.


Meanwhile, callers tend to only care about
bdrv_known_zeroes() right after opening an image or right before
resizing (not repeatedly during runtime);

Hm, yes.  I was thinking of parallels, but that only checks once in
parallels_open(), so it’s OK.

and you also argued elsewhere
in this thread that it may be worth having the block layer cache
BDRV_ZERO_OPEN and update the cache on any write,

I didn’t say the block layer, but it if makes sense.

at which point, the
expense in the driver callback really is a one-time call during
bdrv_co_open().

It definitely doesn’t make sense to me to do that call unconditionally
in bdrv_co_open().

Okay, you have a point there - while 'qemu-img convert' cares, not all clients of bdrv_co_open() are worried about whether the existing contents are zero; so unconditionally priming a cache during bdrv_co_open is not as wise as doing things when it will actually be useful information. On the other hand, if it is something that clients only use when first opening an image, caching data doesn't make much sense either.

So, we know that bdrv_has_zero_init() is only viable on a just-created image, bdrv_has_zero_init_truncate() is only viable if you are about to resize an image using bdrv_co_truncate(PREALLOC_MODE_OFF).

Hmm - thinking aloud: our ultimate goal is that we want to make it easier for algorithms that can be sped up IF the image is currently known to be all zero. Maybe what this means is that we really want to be tweaking bdrv_make_zeroes() to do all the work, something along the lines of: - if the image is known to already be all zeroes using an O(1) probe (this includes if the image was freshly created and creation sees all zeroes, or if a block_status at offset 0 shows a hole for the entire image, or if an NBD extension advertises all zero at connection time...), return success - if the image has a FAST truncate, and resizing reads zeroes, we can truncate to size 0 and back to the desired size, then return success; determining if truncate is fast should be similar to how BDRV_REQ_NO_FALLBACK determines whether write zeroes is fast - if the image supports BDRV_REQ_NO_FALLBACK with write zeroes, we can request a write zeroes over the whole image, which will either succeed (the image is now quickly zero) or fail (writing zeroes as we go is the best we can do) - if the image could report that it is all zeroes, but only at the cost of O(n) work such as a loop over block_status (or even O(n^2) with the tmpfs lseek bug), it's easier to report failure than to worry about making the image read all zeroes

qemu-img would then only ever need to consult --target-is-zero and bdrv_make_zero(), and not worry about any other function calls; while the block layer would take care of coordinating whatever other call sequences make the most sense in reporting success or failure in getting the image into an all-zero state if it was not already there.



And in that case, whether the one-time expense is done
via a single function call or via three driver callbacks, the amount of
work is the same; but the driver callback interface is easier if there
is only one callback (similar to how bdrv_unallocated_blocks_are_zero()
calls bdrv_get_info() only for bdi.unallocated_blocks_are_zero, even
though BlockDriverInfo tracks much more than that boolean).

In fact, it may be worth consolidating known zeroes support into
BlockDriverInfo.

I’m very skeptical of that.  BDI already has the problem that it doesn’t
know which of the information the caller actually wants and that it is
sometimes used in a quasi-hot path.

Maybe that means it is indeed time to incorporate it into BDI, but the
caller should have a way of specifying what parts of BDI it actually
needs and then drivers can skip anything that isn’t trivially obtainable
that the caller doesn’t need.

I'm reminded of the recent kernel addition of xstat(); the traditional stat/fstat interfaces really don't know which bits of information you care about, so you get everything, but with xstat(), you can request only what you plan to use, which may indeed result in speedups.


Those are still viable options, but before I repaint the bikeshed along
those lines, I'd at least like a review of whether the overall idea of
having a notion of 'reads-all-zeroes' is indeed useful enough,
regardless of how we implement it as one vs. three driver callbacks.

I’m as hesitant as ever to give a review that this notion is useful,
because I haven’t seen a practical example yet where the problem isn’t
the fact that NBD doesn’t have 64-bit write_zeroes support.

Even if the NBD protocol gains 64-bit write_zeroes, not all NBD servers will be compliant with that extension. This will speed up operations when talking to older servers which do not support 64-bit writes, even if newer qemu is never such a server.


So far, it looks to me like this notion is only really useful for cases
where we expect a management layer on top of qemu anyway.  And then I’m
not sure that this new feature works reliably enough for such a
management layer.

(I’m not saying it isn’t useful.  Again, intuitively it does seem
useful.  Intuition can be enough to merge a sufficiently simple series
that doesn’t increase code complexity too much.  But I’m still asking
for actual practical examples, because that would make a better
argument, of course.)

I'm hoping when I post my NBD patches that I can also provide some benchmark timing numbers to make the case a bit more concrete.

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org




reply via email to

[Prev in Thread] Current Thread [Next in Thread]