qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-block] [Qemu-devel] [PATCH v4 02/11] block: Filtered children


From: Max Reitz
Subject: Re: [Qemu-block] [Qemu-devel] [PATCH v4 02/11] block: Filtered children access functions
Date: Thu, 23 May 2019 19:27:58 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.6.1

On 17.05.19 16:50, Vladimir Sementsov-Ogievskiy wrote:
> 10.04.2019 23:20, Max Reitz wrote:
>> What bs->file and bs->backing mean depends on the node.  For filter
>> nodes, both signify a node that will eventually receive all R/W
>> accesses.  For format nodes, bs->file contains metadata and data, and
>> bs->backing will not receive writes -- instead, writes are COWed to
>> bs->file.  Usually.
>>
>> In any case, it is not trivial to guess what a child means exactly with
>> our currently limited form of expression.  It is better to introduce
>> some functions that actually guarantee a meaning:
>>
>> - bdrv_filtered_cow_child() will return the child that receives requests
>>    filtered through COW.  That is, reads may or may not be forwarded
>>    (depending on the overlay's allocation status), but writes never go to
>>    this child.
>>
>> - bdrv_filtered_rw_child() will return the child that receives requests
>>    filtered through some very plain process.  Reads and writes issued to
>>    the parent will go to the child as well (although timing, etc. may be
>>    modified).
>>
>> - All drivers but quorum (but quorum is pretty opaque to the general
>>    block layer anyway) always only have one of these children: All read
>>    requests must be served from the filtered_rw_child (if it exists), so
>>    if there was a filtered_cow_child in addition, it would not receive
>>    any requests at all.
>>    (The closest here is mirror, where all requests are passed on to the
>>    source, but with write-blocking, write requests are "COWed" to the
>>    target.  But that just means that the target is a special child that
>>    cannot be introspected by the generic block layer functions, and that
>>    source is a filtered_rw_child.)
>>    Therefore, we can also add bdrv_filtered_child() which returns that
>>    one child (or NULL, if there is no filtered child).
>>
>> Also, many places in the current block layer should be skipping filters
>> (all filters or just the ones added implicitly, it depends) when going
>> through a block node chain.  They do not do that currently, but this
>> patch makes them.
>>
>> One example for this is qemu-img map, which should skip filters and only
>> look at the COW elements in the graph.  The change to iotest 204's
>> reference output shows how using blkdebug on top of a COW node used to
>> make qemu-img map disregard the rest of the backing chain, but with this
>> patch, the allocation in the base image is reported correctly.
>>
>> Furthermore, a note should be made that sometimes we do want to access
>> bs->backing directly.  This is whenever the operation in question is not
>> about accessing the COW child, but the "backing" child, be it COW or
>> not.  This is the case in functions such as bdrv_open_backing_file() or
>> whenever we have to deal with the special behavior of @backing as a
>> blockdev option, which is that it does not default to null like all
>> other child references do.
>>
>> Finally, the query functions (query-block and query-named-block-nodes)
>> are modified to return any filtered child under "backing", not just
>> bs->backing or COW children.  This is so that filters do not interrupt
>> the reported backing chain.  This changes the output of iotest 184, as
>> the throttled node now appears as a backing child.
>>
>> Signed-off-by: Max Reitz <address@hidden>
>> ---
> 
> [..]
> 
>> --- a/block/mirror.c
>> +++ b/block/mirror.c

[...]

>> @@ -1650,7 +1651,9 @@ static void mirror_start_job(const char *job_id, 
>> BlockDriverState *bs,
>>        * any jobs in them must be blocked */
>>       if (target_is_backing) {
>>           BlockDriverState *iter;
>> -        for (iter = backing_bs(bs); iter != target; iter = 
>> backing_bs(iter)) {
>> +        for (iter = bdrv_filtered_bs(bs); iter != target;
> 
> should it be filtered_target too?

Hmm...  The comment says that all nodes that disappear must be blocked.
 I don’t even know by heart which nodes I let disappear. :-/

I suppose we should start at the first explicit node, filter or not...?

>> +             iter = bdrv_filtered_bs(iter))
>> +        {
>>               /* XXX BLK_PERM_WRITE needs to be allowed so we don't block
>>                * ourselves at s->base (if writes are blocked for a node, 
>> they are
>>                * also blocked for its backing file). The other options would 
>> be a

[...]

>> @@ -1707,14 +1710,14 @@ void mirror_start(const char *job_id, 
>> BlockDriverState *bs,
>>                     MirrorCopyMode copy_mode, Error **errp)
>>   {
>>       bool is_none_mode;
>> -    BlockDriverState *base;
>> +    BlockDriverState *base = NULL;
> 
> dead assignment

Now I wonder why I even have that.  Probably an artifact from some
intermediate point.

>>   
>>       if (mode == MIRROR_SYNC_MODE_INCREMENTAL) {
>>           error_setg(errp, "Sync mode 'incremental' not supported");
>>           return;
>>       }
>>       is_none_mode = mode == MIRROR_SYNC_MODE_NONE;
>> -    base = mode == MIRROR_SYNC_MODE_TOP ? backing_bs(bs) : NULL;
>> +    base = mode == MIRROR_SYNC_MODE_TOP ? bdrv_backing_chain_next(bs) : 
>> NULL;
>>       mirror_start_job(job_id, bs, creation_flags, target, replaces,
>>                        speed, granularity, buf_size, backing_mode,
>>                        on_source_error, on_target_error, unmap, NULL, NULL,
>> diff --git a/block/qapi.c b/block/qapi.c
>> index 110d05dc57..478c6f5e0d 100644
>> --- a/block/qapi.c
>> +++ b/block/qapi.c

[...]

>> @@ -535,9 +538,10 @@ static BlockStats 
>> *bdrv_query_bds_stats(BlockDriverState *bs,
>>           s->parent = bdrv_query_bds_stats(bs->file->bs, blk_level);
>>       }
>>   
>> -    if (blk_level && bs->backing) {
>> +    cow_bs = bdrv_filtered_cow_bs(bs);
> 
> So, if we at blk_level and top bs is explicit filter, you don't want to show 
> it's
> child?

I do.  It’s in s->parent.  I thought it makes sense to change the
existing bs->file vs. bs->backing to storage vs. COW.

> Hmm, at least, we can't show it if it is file-child, as qapi filed already 
> called
> backing. So, if we can't show for file-child-based filters, it may be better 
> to not
> show filter children here at all.
> 
>> +    if (blk_level && cow_bs) {
>>           s->has_backing = true;
>> -        s->backing = bdrv_query_bds_stats(bs->backing->bs, blk_level);
>> +        s->backing = bdrv_query_bds_stats(cow_bs, blk_level);
>>       }
>>   
>>       return s;
>> diff --git a/block/stream.c b/block/stream.c
>> index bfaebb861a..23d5c890e0 100644
>> --- a/block/stream.c
>> +++ b/block/stream.c
>> @@ -65,6 +65,7 @@ static int stream_prepare(Job *job)
>>       StreamBlockJob *s = container_of(job, StreamBlockJob, common.job);
>>       BlockJob *bjob = &s->common;
>>       BlockDriverState *bs = blk_bs(bjob->blk);
>> +    BlockDriverState *unfiltered = bdrv_skip_rw_filters(bs);
> 
> Aha, I'd call it filtered, but unfiltered is correct too, it's amazing

Haha :-)

I think it’s all rather insane than amazing, but, well, insanity never
ceases to amaze, does it.

>>       BlockDriverState *base = s->base;
>>       Error *local_err = NULL;
>>       int ret = 0;
>> @@ -72,7 +73,7 @@ static int stream_prepare(Job *job)

[...]

>> @@ -121,7 +122,7 @@ static int coroutine_fn stream_run(Job *job, Error 
>> **errp)
>>       int64_t n = 0; /* bytes */
>>       void *buf;
>>   
>> -    if (!bs->backing) {
>> +    if (!bdrv_filtered_child(bs)) {
>>           goto out;
>>       }
> 
> this condition checks that there is nothing to stream, so, I thing it's 
> better to check
> if (!bdrv_backing_chain_next(bs)) {
>    goto out;
> }

Ah, sure.

>> @@ -162,7 +163,7 @@ static int coroutine_fn stream_run(Job *job, Error 
>> **errp)
>>           } else if (ret >= 0) {
>>               /* Copy if allocated in the intermediate images.  Limit to the
>>                * known-unallocated area [offset, offset+n*BDRV_SECTOR_SIZE). 
>>  */
>> -            ret = bdrv_is_allocated_above(backing_bs(bs), base,
>> +            ret = bdrv_is_allocated_above(bdrv_filtered_bs(bs), base,
>>                                             offset, n, &n);
> 
> Hmm, if we trying to support bs to be filter, and actually operate on 
> first-non-filter,
> as you write in qapi spec, this is wrong. Again it should be
> bdrv_filtered_cow_bs(bdrv_skip_rw_filters(bs))..

Would bdrv_backing_chain_next() fulfill the same purpose?  It can’t be
allocated in a filter node, after all.

> Or, may be better, we at stream start should calculate reald top bs to 
> operate on, and
> forget about all filters above.. i.e., do bs = bdrv_skip_rw_filters(bs) at 
> the very
> beginning, when creating a job.

Sounds reasonable.  We can ignore all the filters on top of the
(un)filtered top anyway.

>>               /* Finish early if end of backing file has been reached */
>> @@ -268,7 +269,9 @@ void stream_start(const char *job_id, BlockDriverState 
>> *bs,
>>        * disappear from the chain after this operation. The streaming job 
>> reads
>>        * every block only once, assuming that it doesn't change, so block 
>> writes
>>        * and resizes. */
>> -    for (iter = backing_bs(bs); iter && iter != base; iter = 
>> backing_bs(iter)) {
>> +    for (iter = bdrv_filtered_bs(bs); iter && iter != base;
>> +         iter = bdrv_filtered_bs(iter))
>> +    {
>>           block_job_add_bdrv(&s->common, "intermediate node", iter, 0,
>>                              BLK_PERM_CONSISTENT_READ | 
>> BLK_PERM_WRITE_UNCHANGED,
>>                              &error_abort);
>> diff --git a/blockdev.c b/blockdev.c
>> index 4775a07d93..bb71b8368d 100644
>> --- a/blockdev.c
>> +++ b/blockdev.c
>> @@ -1094,7 +1094,7 @@ void hmp_commit(Monitor *mon, const QDict *qdict)
>>               return;
>>           }
>>   
>> -        bs = blk_bs(blk);
>> +        bs = bdrv_skip_implicit_filters(blk_bs(blk));
>>           aio_context = bdrv_get_aio_context(bs);
>>           aio_context_acquire(aio_context);
>>   
>> @@ -1663,7 +1663,7 @@ static void external_snapshot_prepare(BlkActionState 
>> *common,
>>           goto out;
>>       }
>>   
>> -    if (state->new_bs->backing != NULL) {
>> +    if (bdrv_filtered_cow_child(state->new_bs)) {
> 
> Do we allow to create filter snapshot? We should either restrict it 
> explicitly or
> check bdrv_filtered_child here.. And we can't allow file-based-filters 
> anyway..

Hm, yes, we should probably check both (separately to give better error
messages).

In theory it might be possible to allow filters on top, but there isn’t
really any point.  If someone wants to add filters on top of the
snapshot, they should use reopen.

> [skipped up to the end of blockdev.c, I'm tired o_O]

I can very much relate. :-)

Your review definitely is much appreciated.

>> diff --git a/migration/block-dirty-bitmap.c b/migration/block-dirty-bitmap.c
>> index d1bb863cb6..f99f753fba 100644
>> --- a/migration/block-dirty-bitmap.c
>> +++ b/migration/block-dirty-bitmap.c
>> @@ -285,9 +285,7 @@ static int init_dirty_bitmap_migration(void)
>>           const char *drive_name = bdrv_get_device_or_node_name(bs);
>>   
>>           /* skip automatically inserted nodes */
>> -        while (bs && bs->drv && bs->implicit) {
>> -            bs = backing_bs(bs);
>> -        }
>> +        bs = bdrv_skip_implicit_filters(bs);
> 
> this intersects with Jonh's patch
> [PATCH v2] migration/dirty-bitmaps: change bitmap enumeration method
> https://lists.gnu.org/archive/html/qemu-devel/2019-05/msg03340.html

Well.  I’m not really considerate of other patches with this series.
Rebasing is always such a pain that I just write it for the current
master.  I won’t incorporate unmerged series because doing so may cause
me to have to rebase more than once.

And I can’t get this series merged soon enough because it’s just wrong
that I (and you) have to the one(s) thinking about how to treat filters
everywhere.  It should be the people that introduce the code.

>>           for (bitmap = bdrv_dirty_bitmap_next(bs, NULL); bitmap;
>>                bitmap = bdrv_dirty_bitmap_next(bs, bitmap))
>> diff --git a/nbd/server.c b/nbd/server.c
>> index e21bd501dc..e41ae89dbe 100644
>> --- a/nbd/server.c
>> +++ b/nbd/server.c
>> @@ -1506,13 +1506,13 @@ NBDExport *nbd_export_new(BlockDriverState *bs, 
>> uint64_t dev_offset,
>>       if (bitmap) {
>>           BdrvDirtyBitmap *bm = NULL;
>>   
>> -        while (true) {
>> +        while (bs) {
>>               bm = bdrv_find_dirty_bitmap(bs, bitmap);
>> -            if (bm != NULL || bs->backing == NULL) {
>> +            if (bm != NULL) {
>>                   break;
>>               }
>>   
>> -            bs = bs->backing->bs;
>> +            bs = bdrv_filtered_bs(bs);
>>           }
> 
> Check in documentation: "@bitmap: Also export the dirty bitmap reachable from 
> @device".
> 
> "Reachable" is not bad, but we may want to clarify that extended backing 
> chain is meant

Hm...  Isn’t that just a problem with the current documentation?

I think this change in code better fits what I’d guess from “reachable”
than what it currently means.

>>           if (bm == NULL) {
>> diff --git a/qemu-img.c b/qemu-img.c
>> index aa6f81f1ea..bcfbb743fc 100644
>> --- a/qemu-img.c
>> +++ b/qemu-img.c

[...]

>> @@ -2434,7 +2433,8 @@ static int img_convert(int argc, char **argv)
>>            * s.target_backing_sectors has to be negative, which it will
>>            * be automatically).  The backing file length is used only
>>            * for optimizations, so such a case is not fatal. */
>> -        s.target_backing_sectors = bdrv_nb_sectors(out_bs->backing->bs);
>> +        s.target_backing_sectors =
>> +            bdrv_nb_sectors(bdrv_filtered_cow_bs(out_bs));
> 
> can't out_bs be filter itself?

why would you do that

More serious, well, perhaps, in theory.  In practice I really cannot
imagine why it would be.

> 
>>       } else {
>>           s.target_backing_sectors = -1;
>>       }
>> @@ -2797,6 +2797,7 @@ static int get_block_status(BlockDriverState *bs, 
>> int64_t offset,
>>   
>>       depth = 0;
>>       for (;;) {
>> +        bs = bdrv_skip_rw_filters(bs);
> 
> Why? Filters may have own implementation of block_status, why to skip it?
> 
> Or, thay cannot? Really, may be disallow filters have block_status, we may 
> solve
> inefficient block_status_above we talked about before.

As said in the other subthread, I think ignoring filters here is fine.

Max

>>           ret = bdrv_block_status(bs, offset, bytes, &bytes, &map, &file);
>>           if (ret < 0) {
>>               return ret;

Attachment: signature.asc
Description: OpenPGP digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]