qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: backing chain & block status & filters


From: Vladimir Sementsov-Ogievskiy
Subject: Re: backing chain & block status & filters
Date: Thu, 7 May 2020 22:34:48 +0300
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.2.1

07.05.2020 15:58, Max Reitz wrote:
On 28.04.20 16:51, Vladimir Sementsov-Ogievskiy wrote:
28.04.2020 14:08, Max Reitz wrote:
On 28.04.20 10:55, Vladimir Sementsov-Ogievskiy wrote:
Hi!

I wanted to resend my "[PATCH 0/4] fix & merge block_status_above and
is_allocated_above", and returned to all the inconsistencies about
block-status. I keep in mind Max's series about child-access functions,
and Andrey's work about using COR filter in block-stream, which depends
on Max's series (because, without them COR fitler with file child breaks
backing chains).. And, it seems that it's better to discuss some
questions before resending.

First, problems about block-status:

1. We consider ALLOCATED = ZERO | DATA, and documented as follows:

     * BDRV_BLOCK_DATA: allocation for data at offset is tied to this
layer
     * BDRV_BLOCK_ZERO: offset reads as zero
     * BDRV_BLOCK_OFFSET_VALID: an associated offset exists for accessing
raw data
     * BDRV_BLOCK_ALLOCATED: the content of the block is determined by
this
     *                       layer rather than any backing, set by block
layer

This actually means, that we should always have BDRV_BLOCK_ALLOCATED for
formats which doesn't support backing. So, all such format drivers must
return ZERO or DATA (or both?), yes?. Seems file-posix does so, but, for
example, iscsi - doesn't.

Hm.  I could imagine that there are formats that have non-zero holes
(e.g. 0xff or just garbage).  It would be a bit wrong for them to return
ZERO or DATA then.

But OTOH we don’t care about such cases, do we?  We need to know whether
ranges are zero, data, or unallocated.  If they aren’t zero, we only
care about whether reading from it will return data from this layer or
not.

So I suppose that anything that doesn’t support backing files (or
filtered children) should always return ZERO and/or DATA.

2. ZERO. The meaning differs a bit for generic block_status and for
drivers.. I think, we at least should document it like this:

BDRV_BLOCK_DATA: allocation for data at offset is tied to this layer
BDRV_BLOCK_ZERO: if driver return ZERO, than the region is allocated at
this layer and read as ZERO. If generic block_status returns ZERO, it
only mean that it reads as zero, but the region may be allocated on
underlying level.

Hm.  What does that mean?

One of the problems is that “allocated” has two meanings:
(1) reading data returns data defined at this backing layer,
(2) actually allocated, i.e. takes up space on the file represented by
this BDS.

As far as I understand, we actually don’t care about (2) in the context
of block_status, but just about (1).

So if a layer returns ZERO, it is by definition (1)-allocated.  (It
isn’t necessarily (2)-allocated.)

3. bdi.unallocated_blocks_are_zero

I think it's very bad, that we have formats, that supports backing, but
doesn't report bdi.unallocated_blocks_are_zero as true. It means that
UNALLOCATED region reads as zero if we have short backing file, and not
as zero if we remove this short backing file.

What do you mean by “remove this short backing file”?  Because generally
one can’t just drop a backing file.

So maybe a case like block-stream?  Wouldn’t that be a bug in
block-stream them, i.e. shouldn’t it stream zeros after the end of the
backing file?

I can live with it but
this is weird logic. These bad drivers are qcow (not qcow2), parallels
and vmdk. I hope, they actually just forget to set
unallocated_blocks_are_zero to true.

qcow definitely sounds like it.

Next. But what about drivers which doesn't support backing? As we
considered above, they should always return ZERO or DATA, as everything
is allocated in this backing-chain level (last level, of course).. So
again unallocated_blocks_are_zero is meaningless. So, I think, that
driver which doesn't support backings, should be fixed to return always
ZERO or DATA, than we don't need this unallocated_blocks_are_zero at
all.

Agreed.

3.

The second 3.? :)

Short backing files in allocated_above: we must consider space after
EOF as ALLOCATED, if short backing file is inside requested
backing-chain part, as it produced exactly because of this short file
(and we never go to backing).

Sounds correct.

(current realization of allocated_above is
buggy, see my outdated series "[PATCH 0/4] fix & merge
block_status_above and is_allocated_above")

4. Long ago we've discussed problems about BDRV_BLOCK_RAW, when we have
a backing chain of non-backing child.. I just remember that we didn't
reach the consensus.

Possible? :)

5. Filters.. OK we have two functions for them:
bdrv_co_block_status_from_file and bdrv_co_block_status_from_backing. I
think both are wrong:

bdrv_co_block_status_from_file leads to problem [4], when we can report
UNALLOCATED, which refers not to the current backing chain, but to sub
backing chain of file child, which is inconsistent with
block_status_above and is_allocated_above iteration.

bdrv_co_block_status_from_backing is also is not consistent with
block_status_above iteration.. At least at leads to querying the same
node twice.

Well, yes.

So, about filters and backing chains. Keeping (OK, just, trying to keep)
all these things in mind, I think that it's better to keep backing
chains exactly *backing* chains, so that "backing" child is the only
"not own" child of the node. So, its close to current behavior and we
don't need child-access functions. Then how filters should work:

Filter with backing child, should always return UNALLOCATED (i.e. no
DATA, no ZERO), it is honest: everything is on the other level of
backing chain.

I disagree, because filters with or without backing children should work
exactly the same way and just not appear in the backing chain.

They work the same way in my variant too. The only difference is that if
you use file-child-based filters, you can't do stream/commit around
them.

Which is a bug that’s been know for a long time, and the primary cause
for the “Deal with filters” series.

I just think, that binding the concept to the "backing" link of
the node is simpler and more generic. In blockdev era, when all nodes
will be named and libvirt will take care of all nodes including filters,
we will not need any filter-skipping magic, libvirt will directly point
to exact nodes it means. And we can deprecate "file" children of
existing filters,

One thing to note is that all user-creatable filter drivers currently
use “file”, so this would change them all.

The idea of just using backing for filters just always seems to me like
trying to take the easy way out.  It seems to me like many things around
filters are broken because we weren’t careful enough when introducing
them (mostly a combination of “let’s see whether it just works” and “it
seems to mostly work”), and the “Deal with filters” series attempts to
rectify that.  As evidenced by the reviews that required even more
preliminary work (like the still on-list and under-discussion
BdrvChildRole series), there is really a ton of design that should be
improved.

That makes me very hesitant to just switch filters over to backing,
because I feel that might alleviate some of the most pressing symptoms,
while not addressing the underlying issues.  And without symptoms that
hurt, nobody might want to fix the issues.  Basically, it feels like
more of the same “Let’s try whether it mostly works” and “it seems to
mostly work”.

(Also, my naïve feeling is that if just treating backing and file the
same for filter drivers, the deal with filters series would be ten
patches long and v1 would have been uncontroversial.)


Hmm. OK, I don't know. Probably it's not a big difference: use CAF vs
move filters to backing child. And anyway we may move them later.
Probably it's simpler to fix all the issues, if move the filters to
backing child first, but we already have you series and rebasing it
on another concept is an extra work, not worth doing.

--
Best regards,
Vladimir



reply via email to

[Prev in Thread] Current Thread [Next in Thread]