qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-block] [Qemu-devel] [PATCH v4 02/11] block: Filtered children


From: Max Reitz
Subject: Re: [Qemu-block] [Qemu-devel] [PATCH v4 02/11] block: Filtered children access functions
Date: Thu, 23 May 2019 17:56:29 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.6.1

On 23.05.19 17:08, Vladimir Sementsov-Ogievskiy wrote:
> 23.05.2019 17:49, Max Reitz wrote:
>> On 17.05.19 13:50, Vladimir Sementsov-Ogievskiy wrote:
>>> 07.05.2019 18:13, Max Reitz wrote:
>>>> On 07.05.19 15:30, Vladimir Sementsov-Ogievskiy wrote:
>>>>> 10.04.2019 23:20, Max Reitz wrote:
>>>>>> What bs->file and bs->backing mean depends on the node.  For filter
>>>>>> nodes, both signify a node that will eventually receive all R/W
>>>>>> accesses.  For format nodes, bs->file contains metadata and data, and
>>>>>> bs->backing will not receive writes -- instead, writes are COWed to
>>>>>> bs->file.  Usually.
>>>>>>
>>>>>> In any case, it is not trivial to guess what a child means exactly with
>>>>>> our currently limited form of expression.  It is better to introduce
>>>>>> some functions that actually guarantee a meaning:
>>>>>>
>>>>>> - bdrv_filtered_cow_child() will return the child that receives requests
>>>>>>      filtered through COW.  That is, reads may or may not be forwarded
>>>>>>      (depending on the overlay's allocation status), but writes never go 
>>>>>> to
>>>>>>      this child.
>>>>>>
>>>>>> - bdrv_filtered_rw_child() will return the child that receives requests
>>>>>>      filtered through some very plain process.  Reads and writes issued 
>>>>>> to
>>>>>>      the parent will go to the child as well (although timing, etc. may 
>>>>>> be
>>>>>>      modified).
>>>>>>
>>>>>> - All drivers but quorum (but quorum is pretty opaque to the general
>>>>>>      block layer anyway) always only have one of these children: All read
>>>>>>      requests must be served from the filtered_rw_child (if it exists), 
>>>>>> so
>>>>>>      if there was a filtered_cow_child in addition, it would not receive
>>>>>>      any requests at all.
>>>>>>      (The closest here is mirror, where all requests are passed on to the
>>>>>>      source, but with write-blocking, write requests are "COWed" to the
>>>>>>      target.  But that just means that the target is a special child that
>>>>>>      cannot be introspected by the generic block layer functions, and 
>>>>>> that
>>>>>>      source is a filtered_rw_child.)
>>>>>>      Therefore, we can also add bdrv_filtered_child() which returns that
>>>>>>      one child (or NULL, if there is no filtered child).
>>>>>>
>>>>>> Also, many places in the current block layer should be skipping filters
>>>>>> (all filters or just the ones added implicitly, it depends) when going
>>>>>> through a block node chain.  They do not do that currently, but this
>>>>>> patch makes them.
>>>>>>
>>>>>> One example for this is qemu-img map, which should skip filters and only
>>>>>> look at the COW elements in the graph.  The change to iotest 204's
>>>>>> reference output shows how using blkdebug on top of a COW node used to
>>>>>> make qemu-img map disregard the rest of the backing chain, but with this
>>>>>> patch, the allocation in the base image is reported correctly.
>>>>>>
>>>>>> Furthermore, a note should be made that sometimes we do want to access
>>>>>> bs->backing directly.  This is whenever the operation in question is not
>>>>>> about accessing the COW child, but the "backing" child, be it COW or
>>>>>> not.  This is the case in functions such as bdrv_open_backing_file() or
>>>>>> whenever we have to deal with the special behavior of @backing as a
>>>>>> blockdev option, which is that it does not default to null like all
>>>>>> other child references do.
>>>>>>
>>>>>> Finally, the query functions (query-block and query-named-block-nodes)
>>>>>> are modified to return any filtered child under "backing", not just
>>>>>> bs->backing or COW children.  This is so that filters do not interrupt
>>>>>> the reported backing chain.  This changes the output of iotest 184, as
>>>>>> the throttled node now appears as a backing child.
>>>>>>
>>>>>> Signed-off-by: Max Reitz <address@hidden>
>>>>>> ---
>>>>>>     qapi/block-core.json           |   4 +
>>>>>>     include/block/block.h          |   1 +
>>>>>>     include/block/block_int.h      |  40 +++++--
>>>>>>     block.c                        | 210 
>>>>>> +++++++++++++++++++++++++++------
>>>>>>     block/backup.c                 |   8 +-
>>>>>>     block/block-backend.c          |  16 ++-
>>>>>>     block/commit.c                 |  33 +++---
>>>>>>     block/io.c                     |  45 ++++---
>>>>>>     block/mirror.c                 |  21 ++--
>>>>>>     block/qapi.c                   |  30 +++--
>>>>>>     block/stream.c                 |  13 +-
>>>>>>     blockdev.c                     |  88 +++++++++++---
>>>>>>     migration/block-dirty-bitmap.c |   4 +-
>>>>>>     nbd/server.c                   |   6 +-
>>>>>>     qemu-img.c                     |  29 ++---
>>>>>>     tests/qemu-iotests/184.out     |   7 +-
>>>>>>     tests/qemu-iotests/204.out     |   1 +
>>>>>>     17 files changed, 411 insertions(+), 145 deletions(-)
>>>>>>
>>>>>> diff --git a/qapi/block-core.json b/qapi/block-core.json
>>>>>> index 7ccbfff9d0..dbd9286e4a 100644
>>>>>> --- a/qapi/block-core.json
>>>>>> +++ b/qapi/block-core.json
>>>>>> @@ -2502,6 +2502,10 @@
>>>>>>     # On successful completion the image file is updated to drop the 
>>>>>> backing file
>>>>>>     # and the BLOCK_JOB_COMPLETED event is emitted.
>>>>>>     #
>>>>>> +# In case @device is a filter node, block-stream modifies the first 
>>>>>> non-filter
>>>>>> +# overlay node below it to point to base's backing node (or NULL if 
>>>>>> @base was
>>>>>> +# not specified) instead of modifying @device itself.
>>>>>> +#
>>>>>
>>>>> Is it necessary, why we can't keep it as is, modifying exactly device 
>>>>> node? May be,
>>>>> user wants to use filter in stream process, throttling for example.
>>>>
>>>> That wouldn't make any sense.  Say you have this configuration:
>>>>
>>>> throttle -> top -> base
>>>>
>>>> Now you stream from base to throttle.  The data goes from base through
>>>> throttle to top.  You propose to then make throttle point to base:
>>>>
>>>> throttle -> base
>>>>
>>>> This will discard all the data in top.
>>>>
>>>> Filters don’t store any data.  You need to keep the top data storing
>>>> image, i.e. the first non-filter overlay.
>>>
>>> Ah, yes, good reason.
>>>
>>>>
>>>>>>     # @job-id: identifier for the newly-created block job. If
>>>>>>     #          omitted, the device name will be used. (Since 2.7)
>>>>>>     #
>>>>
>>>> [...]
>>>>
>>>>>> @@ -2345,7 +2347,7 @@ void bdrv_set_backing_hd(BlockDriverState *bs, 
>>>>>> BlockDriverState *backing_hd,
>>>>>>         bool update_inherits_from = bdrv_chain_contains(bs, backing_hd) 
>>>>>> &&
>>>>>>             bdrv_inherits_from_recursive(backing_hd, bs);
>>>>>>     
>>>>>> -    if (bdrv_is_backing_chain_frozen(bs, backing_bs(bs), errp)) {
>>>>>> +    if (bdrv_is_backing_chain_frozen(bs, child_bs(bs->backing), errp)) {
>>>>>
>>>>> If we support file-filters for frozen backing chain, could it go through 
>>>>> file child here?
>>>>> Hmm, only in case when we are going to set backing hd for file-filter.. 
>>>>> Hmm, could filter have
>>>>> both file and backing children?
>>>>
>>>> No.  A filter passes through data from its children, so it can only have
>>>> a single child, or it is quorum.
>>>>
>>>> The file/backing combination is reserved for COW overlays.  file is
>>>> where the current layer’s data is, backing is the filtered child.
>>>
>>> My backup-top has two children - backing and target.. So, I think, we can 
>>> state that
>>> filter should not have both file and backing children, but may have any 
>>> other special
>>> children he wants, invisible for backing-child/file-child generic logic.
>>
>> Ah, yes, sorry, that’s what I meant.  A filter can have only a single
>> filtered child, but other than that, they’re free to have whatever.
>>
>> [...]
>>
>>>>> Here we don't want to check the chain, we exactly want to check backing 
>>>>> link, so it should be
>>>>> something like
>>>>>
>>>>> if (bs->backing && bs->backing->frozen) {
>>>>>       error_setg("backig exists and frozen!");
>>>>>       return;
>>>>> }
>>>>>
>>>>>
>>>>> Hmm, on the other hand, if we have frozen backing chain, going through 
>>>>> file child, we must not add
>>>>> backing child to the node with file child, as it will change backing 
>>>>> chain (which by default goes
>>>>> through backing)..
>>>>>
>>>>> Anyway, we don't need to check the whole backing chain, as we may find 
>>>>> other frozen backing subchain,
>>>>> far away of bs.. So, we possibly want to check
>>>>>
>>>>> if (bdrv_filtered_child(bs) && bdrv_filtered_child(bs)->frozed) {
>>>>>      ERROR
>>>>> }
>>>>>
>>>>>
>>>>> ....
>>>>>
>>>>> also, we'll need to check for frozen file child, when we want to replace 
>>>>> it.
>>>>
>>>> I don’t quite understand.  It sounds to me like you’re saying we don’t
>>>> need to check the whole chain here but just the immediate child.  But
>>>> isn’t that true regardless of this series?
>>>
>>> If we restrict adding backing child to filter with file child, all becomes 
>>> simpler and seems to be correct.
>>
>> OK. :-)
>>
>>> Should we add check for frozen file child to bdrv_replace_child() ?
>>
>> Argh.  You mean move it from bdrv_set_backing_hd()?  That actually makes
>> a lot of sense to me.  The problem is that bdrv_replace_child()
>> currently cannot return an error, which may be a problem for
>> bdrv_detach_child().  Hm.  But that’s effectively only called from
>> functions where the child is unref’d, and you have to know that your own
>> child is not frozen before you unref it.  So I guess we should be good
>> to pass an &error_abort there.
>>
>> [...]
>>
>>>>>> @@ -2208,7 +2218,7 @@ static int coroutine_fn 
>>>>>> bdrv_co_block_status_above(BlockDriverState *bs,
>>>>>>         bool first = true;
>>>>>>     
>>>>>>         assert(bs != base);
>>>>>> -    for (p = bs; p != base; p = backing_bs(p)) {
>>>>>> +    for (p = bs; p != base; p = bdrv_filtered_bs(p)) {
>>>>>>             ret = bdrv_co_block_status(p, want_zero, offset, bytes, 
>>>>>> pnum, map,
>>>>>>                                        file);
>>>>>
>>>>> Interesting that for filters who use bdrv_co_block_status_from_backing and
>>>>> bdrv_co_block_status_from_file we will finally call .bdrv_co_block_status 
>>>>> of
>>>>> underalying real node two or more times.. It's not wrong but obviously 
>>>>> not optimal.
>>>>
>>>> Hm.  If @p is a filter, we could skip straight to *file.  Would that work?
>>>
>>> No, as file may be not in backing chain:
>>>
>>> filter  [A]
>>>      |
>>>      v
>>> qcow2 -> file  [B]
>>>      |
>>>      v
>>> qcow2
>>>
>>> So, we shouldn't redirect the whole loop to file..
>>
>> But qcow2 is not a filter.  I meant skipping to *file only if the
>> current node is a filter.  And I don’t mean bs->file, I mean *file --
>> like, what bdrv_co_block_status() returns.
> 
> Me too. But as I understand, if we call bdrv_block_status on filter [A],
> resulting *file returned by bdrv_co_block_status() will point to file [B]
> due to recursion in bdrv_co_block_status.

Crap, you’re right.  Hm.  I guess I’ll just ignore this performance
problem for now, then.  The best thing I can think of would be to turn
@want_zero into a flag mask and then add a flag for “Please do not
recurse through filters, just give me BDRV_BLOCK_RAW”.

>> You say in your other mail that filters can have an own implementation
>> of .bdrv_co_block_status(), but I don’t think that makes sense,
>> actually.  They should always pass the status of their filtered child.
>>
>> blkdebug is the only filter I know that has an own implementation, and
>> the only thing besides passing the data through is add an alignment
>> assertion.  If it simplifies everything else, I’m very much willing to
>> break that.
> 
> Agree that assertion is a bad reason to not implement some clean generic
> logic.

OK.

Max

Attachment: signature.asc
Description: OpenPGP digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]