qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-block] [Qemu-devel] [PATCH v4 02/11] block: Filtered children


From: Vladimir Sementsov-Ogievskiy
Subject: Re: [Qemu-block] [Qemu-devel] [PATCH v4 02/11] block: Filtered children access functions
Date: Fri, 17 May 2019 14:50:56 +0000

10.04.2019 23:20, Max Reitz wrote:
> What bs->file and bs->backing mean depends on the node.  For filter
> nodes, both signify a node that will eventually receive all R/W
> accesses.  For format nodes, bs->file contains metadata and data, and
> bs->backing will not receive writes -- instead, writes are COWed to
> bs->file.  Usually.
> 
> In any case, it is not trivial to guess what a child means exactly with
> our currently limited form of expression.  It is better to introduce
> some functions that actually guarantee a meaning:
> 
> - bdrv_filtered_cow_child() will return the child that receives requests
>    filtered through COW.  That is, reads may or may not be forwarded
>    (depending on the overlay's allocation status), but writes never go to
>    this child.
> 
> - bdrv_filtered_rw_child() will return the child that receives requests
>    filtered through some very plain process.  Reads and writes issued to
>    the parent will go to the child as well (although timing, etc. may be
>    modified).
> 
> - All drivers but quorum (but quorum is pretty opaque to the general
>    block layer anyway) always only have one of these children: All read
>    requests must be served from the filtered_rw_child (if it exists), so
>    if there was a filtered_cow_child in addition, it would not receive
>    any requests at all.
>    (The closest here is mirror, where all requests are passed on to the
>    source, but with write-blocking, write requests are "COWed" to the
>    target.  But that just means that the target is a special child that
>    cannot be introspected by the generic block layer functions, and that
>    source is a filtered_rw_child.)
>    Therefore, we can also add bdrv_filtered_child() which returns that
>    one child (or NULL, if there is no filtered child).
> 
> Also, many places in the current block layer should be skipping filters
> (all filters or just the ones added implicitly, it depends) when going
> through a block node chain.  They do not do that currently, but this
> patch makes them.
> 
> One example for this is qemu-img map, which should skip filters and only
> look at the COW elements in the graph.  The change to iotest 204's
> reference output shows how using blkdebug on top of a COW node used to
> make qemu-img map disregard the rest of the backing chain, but with this
> patch, the allocation in the base image is reported correctly.
> 
> Furthermore, a note should be made that sometimes we do want to access
> bs->backing directly.  This is whenever the operation in question is not
> about accessing the COW child, but the "backing" child, be it COW or
> not.  This is the case in functions such as bdrv_open_backing_file() or
> whenever we have to deal with the special behavior of @backing as a
> blockdev option, which is that it does not default to null like all
> other child references do.
> 
> Finally, the query functions (query-block and query-named-block-nodes)
> are modified to return any filtered child under "backing", not just
> bs->backing or COW children.  This is so that filters do not interrupt
> the reported backing chain.  This changes the output of iotest 184, as
> the throttled node now appears as a backing child.
> 
> Signed-off-by: Max Reitz <address@hidden>
> ---

[..]

> --- a/block/mirror.c
> +++ b/block/mirror.c
> @@ -660,8 +660,9 @@ static int mirror_exit_common(Job *job)
>                               &error_abort);
>       if (!abort && s->backing_mode == MIRROR_SOURCE_BACKING_CHAIN) {
>           BlockDriverState *backing = s->is_none_mode ? src : s->base;
> -        if (backing_bs(target_bs) != backing) {
> -            bdrv_set_backing_hd(target_bs, backing, &local_err);
> +        if (bdrv_backing_chain_next(target_bs) != backing) {
> +            bdrv_set_backing_hd(bdrv_skip_rw_filters(target_bs), backing,


here you support filters above target_bs ...

> +                                &local_err);
>               if (local_err) {
>                   error_report_err(local_err);
>                   ret = -EPERM;
> @@ -711,7 +712,7 @@ static int mirror_exit_common(Job *job)
>       block_job_remove_all_bdrv(bjob);
>       bdrv_child_try_set_perm(mirror_top_bs->backing, 0, BLK_PERM_ALL,
>                               &error_abort);
> -    bdrv_replace_node(mirror_top_bs, backing_bs(mirror_top_bs), 
> &error_abort);
> +    bdrv_replace_node(mirror_top_bs, mirror_top_bs->backing->bs, 
> &error_abort);
>   
>       /* We just changed the BDS the job BB refers to (with either or both of 
> the
>        * bdrv_replace_node() calls), so switch the BB back so the cleanup does
> @@ -903,7 +904,7 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
>       } else {
>           s->target_cluster_size = BDRV_SECTOR_SIZE;
>       }
> -    if (backing_filename[0] && !target_bs->backing &&
> +    if (backing_filename[0] && !bdrv_filtered_cow_child(target_bs) &&

And here we need it too

[continuing from here]

>           s->granularity < s->target_cluster_size) {
>           s->buf_size = MAX(s->buf_size, s->target_cluster_size);
>           s->cow_bitmap = bitmap_new(length);
> @@ -1083,7 +1084,7 @@ static void mirror_complete(Job *job, Error **errp)
>       if (s->backing_mode == MIRROR_OPEN_BACKING_CHAIN) {
>           int ret;
>   
> -        assert(!target->backing);
> +        assert(!bdrv_filtered_cow_child(target));

hmm and here...

Possibly, we should add a kind of s->filtered_target to use in all such cases.

>           ret = bdrv_open_backing_file(target, NULL, "backing", errp);
>           if (ret < 0) {
>               return;
> @@ -1650,7 +1651,9 @@ static void mirror_start_job(const char *job_id, 
> BlockDriverState *bs,
>        * any jobs in them must be blocked */
>       if (target_is_backing) {
>           BlockDriverState *iter;
> -        for (iter = backing_bs(bs); iter != target; iter = backing_bs(iter)) 
> {
> +        for (iter = bdrv_filtered_bs(bs); iter != target;

should it be filtered_target too?

> +             iter = bdrv_filtered_bs(iter))
> +        {
>               /* XXX BLK_PERM_WRITE needs to be allowed so we don't block
>                * ourselves at s->base (if writes are blocked for a node, they 
> are
>                * also blocked for its backing file). The other options would 
> be a
> @@ -1691,7 +1694,7 @@ fail:
>   
>       bdrv_child_try_set_perm(mirror_top_bs->backing, 0, BLK_PERM_ALL,
>                               &error_abort);
> -    bdrv_replace_node(mirror_top_bs, backing_bs(mirror_top_bs), 
> &error_abort);
> +    bdrv_replace_node(mirror_top_bs, mirror_top_bs->backing->bs, 
> &error_abort);
>   
>       bdrv_unref(mirror_top_bs);
>   }
> @@ -1707,14 +1710,14 @@ void mirror_start(const char *job_id, 
> BlockDriverState *bs,
>                     MirrorCopyMode copy_mode, Error **errp)
>   {
>       bool is_none_mode;
> -    BlockDriverState *base;
> +    BlockDriverState *base = NULL;

dead assignment

>   
>       if (mode == MIRROR_SYNC_MODE_INCREMENTAL) {
>           error_setg(errp, "Sync mode 'incremental' not supported");
>           return;
>       }
>       is_none_mode = mode == MIRROR_SYNC_MODE_NONE;
> -    base = mode == MIRROR_SYNC_MODE_TOP ? backing_bs(bs) : NULL;
> +    base = mode == MIRROR_SYNC_MODE_TOP ? bdrv_backing_chain_next(bs) : NULL;
>       mirror_start_job(job_id, bs, creation_flags, target, replaces,
>                        speed, granularity, buf_size, backing_mode,
>                        on_source_error, on_target_error, unmap, NULL, NULL,
> diff --git a/block/qapi.c b/block/qapi.c
> index 110d05dc57..478c6f5e0d 100644
> --- a/block/qapi.c
> +++ b/block/qapi.c
> @@ -149,9 +149,13 @@ BlockDeviceInfo *bdrv_block_device_info(BlockBackend 
> *blk,
>               return NULL;
>           }
>   
> -        if (bs0->drv && bs0->backing) {
> +        if (bs0->drv && bdrv_filtered_child(bs0)) {
> +            /*
> +             * Put any filtered child here (for backwards compatibility to 
> when
> +             * we put bs0->backing here, which might be any filtered child).
> +             */
>               info->backing_file_depth++;
> -            bs0 = bs0->backing->bs;
> +            bs0 = bdrv_filtered_bs(bs0);
>               (*p_image_info)->has_backing_image = true;
>               p_image_info = &((*p_image_info)->backing_image);
>           } else {
> @@ -160,9 +164,8 @@ BlockDeviceInfo *bdrv_block_device_info(BlockBackend *blk,
>   
>           /* Skip automatically inserted nodes that the user isn't aware of 
> for
>            * query-block (blk != NULL), but not for query-named-block-nodes */
> -        while (blk && bs0->drv && bs0->implicit) {
> -            bs0 = backing_bs(bs0);
> -            assert(bs0);
> +        if (blk) {
> +            bs0 = bdrv_skip_implicit_filters(bs0);
>           }
>       }
>   
> @@ -347,9 +350,9 @@ static void bdrv_query_info(BlockBackend *blk, BlockInfo 
> **p_info,
>       BlockDriverState *bs = blk_bs(blk);
>       char *qdev;
>   
> -    /* Skip automatically inserted nodes that the user isn't aware of */
> -    while (bs && bs->drv && bs->implicit) {
> -        bs = backing_bs(bs);
> +    if (bs) {
> +        /* Skip automatically inserted nodes that the user isn't aware of */
> +        bs = bdrv_skip_implicit_filters(bs);
>       }
>   
>       info->device = g_strdup(blk_name(blk));
> @@ -506,6 +509,7 @@ static void bdrv_query_blk_stats(BlockDeviceStats *ds, 
> BlockBackend *blk)
>   static BlockStats *bdrv_query_bds_stats(BlockDriverState *bs,
>                                           bool blk_level)
>   {
> +    BlockDriverState *cow_bs;
>       BlockStats *s = NULL;
>   
>       s = g_malloc0(sizeof(*s));
> @@ -518,9 +522,8 @@ static BlockStats *bdrv_query_bds_stats(BlockDriverState 
> *bs,
>       /* Skip automatically inserted nodes that the user isn't aware of in
>        * a BlockBackend-level command. Stay at the exact node for a node-level
>        * command. */
> -    while (blk_level && bs->drv && bs->implicit) {
> -        bs = backing_bs(bs);
> -        assert(bs);
> +    if (blk_level) {
> +        bs = bdrv_skip_implicit_filters(bs);
>       }
>   
>       if (bdrv_get_node_name(bs)[0]) {
> @@ -535,9 +538,10 @@ static BlockStats *bdrv_query_bds_stats(BlockDriverState 
> *bs,
>           s->parent = bdrv_query_bds_stats(bs->file->bs, blk_level);
>       }
>   
> -    if (blk_level && bs->backing) {
> +    cow_bs = bdrv_filtered_cow_bs(bs);

So, if we at blk_level and top bs is explicit filter, you don't want to show 
it's
child?

Hmm, at least, we can't show it if it is file-child, as qapi filed already 
called
backing. So, if we can't show for file-child-based filters, it may be better to 
not
show filter children here at all.

> +    if (blk_level && cow_bs) {
>           s->has_backing = true;
> -        s->backing = bdrv_query_bds_stats(bs->backing->bs, blk_level);
> +        s->backing = bdrv_query_bds_stats(cow_bs, blk_level);
>       }
>   
>       return s;
> diff --git a/block/stream.c b/block/stream.c
> index bfaebb861a..23d5c890e0 100644
> --- a/block/stream.c
> +++ b/block/stream.c
> @@ -65,6 +65,7 @@ static int stream_prepare(Job *job)
>       StreamBlockJob *s = container_of(job, StreamBlockJob, common.job);
>       BlockJob *bjob = &s->common;
>       BlockDriverState *bs = blk_bs(bjob->blk);
> +    BlockDriverState *unfiltered = bdrv_skip_rw_filters(bs);

Aha, I'd call it filtered, but unfiltered is correct too, it's amazing

>       BlockDriverState *base = s->base;
>       Error *local_err = NULL;
>       int ret = 0;
> @@ -72,7 +73,7 @@ static int stream_prepare(Job *job)
>       bdrv_unfreeze_backing_chain(bs, base);
>       s->chain_frozen = false;
>   
> -    if (bs->backing) {
> +    if (bdrv_filtered_cow_child(unfiltered)) {
>           const char *base_id = NULL, *base_fmt = NULL;
>           if (base) {
>               base_id = s->backing_file_str;
> @@ -80,7 +81,7 @@ static int stream_prepare(Job *job)
>                   base_fmt = base->drv->format_name;
>               }
>           }
> -        ret = bdrv_change_backing_file(bs, base_id, base_fmt);
> +        ret = bdrv_change_backing_file(unfiltered, base_id, base_fmt);
>           bdrv_set_backing_hd(bs, base, &local_err);
>           if (local_err) {
>               error_report_err(local_err);
> @@ -121,7 +122,7 @@ static int coroutine_fn stream_run(Job *job, Error **errp)
>       int64_t n = 0; /* bytes */
>       void *buf;
>   
> -    if (!bs->backing) {
> +    if (!bdrv_filtered_child(bs)) {
>           goto out;
>       }

this condition checks that there is nothing to stream, so, I thing it's better 
to check
if (!bdrv_backing_chain_next(bs)) {
   goto out;
}

>   
> @@ -162,7 +163,7 @@ static int coroutine_fn stream_run(Job *job, Error **errp)
>           } else if (ret >= 0) {
>               /* Copy if allocated in the intermediate images.  Limit to the
>                * known-unallocated area [offset, offset+n*BDRV_SECTOR_SIZE).  
> */
> -            ret = bdrv_is_allocated_above(backing_bs(bs), base,
> +            ret = bdrv_is_allocated_above(bdrv_filtered_bs(bs), base,
>                                             offset, n, &n);

Hmm, if we trying to support bs to be filter, and actually operate on 
first-non-filter,
as you write in qapi spec, this is wrong. Again it should be
bdrv_filtered_cow_bs(bdrv_skip_rw_filters(bs))..

Or, may be better, we at stream start should calculate reald top bs to operate 
on, and
forget about all filters above.. i.e., do bs = bdrv_skip_rw_filters(bs) at the 
very
beginning, when creating a job.

>   
>               /* Finish early if end of backing file has been reached */
> @@ -268,7 +269,9 @@ void stream_start(const char *job_id, BlockDriverState 
> *bs,
>        * disappear from the chain after this operation. The streaming job 
> reads
>        * every block only once, assuming that it doesn't change, so block 
> writes
>        * and resizes. */
> -    for (iter = backing_bs(bs); iter && iter != base; iter = 
> backing_bs(iter)) {
> +    for (iter = bdrv_filtered_bs(bs); iter && iter != base;
> +         iter = bdrv_filtered_bs(iter))
> +    {
>           block_job_add_bdrv(&s->common, "intermediate node", iter, 0,
>                              BLK_PERM_CONSISTENT_READ | 
> BLK_PERM_WRITE_UNCHANGED,
>                              &error_abort);
> diff --git a/blockdev.c b/blockdev.c
> index 4775a07d93..bb71b8368d 100644
> --- a/blockdev.c
> +++ b/blockdev.c
> @@ -1094,7 +1094,7 @@ void hmp_commit(Monitor *mon, const QDict *qdict)
>               return;
>           }
>   
> -        bs = blk_bs(blk);
> +        bs = bdrv_skip_implicit_filters(blk_bs(blk));
>           aio_context = bdrv_get_aio_context(bs);
>           aio_context_acquire(aio_context);
>   
> @@ -1663,7 +1663,7 @@ static void external_snapshot_prepare(BlkActionState 
> *common,
>           goto out;
>       }
>   
> -    if (state->new_bs->backing != NULL) {
> +    if (bdrv_filtered_cow_child(state->new_bs)) {

Do we allow to create filter snapshot? We should either restrict it explicitly 
or
check bdrv_filtered_child here.. And we can't allow file-based-filters anyway..

[skipped up to the end of blockdev.c, I'm tired o_O]

[..]

> diff --git a/migration/block-dirty-bitmap.c b/migration/block-dirty-bitmap.c
> index d1bb863cb6..f99f753fba 100644
> --- a/migration/block-dirty-bitmap.c
> +++ b/migration/block-dirty-bitmap.c
> @@ -285,9 +285,7 @@ static int init_dirty_bitmap_migration(void)
>           const char *drive_name = bdrv_get_device_or_node_name(bs);
>   
>           /* skip automatically inserted nodes */
> -        while (bs && bs->drv && bs->implicit) {
> -            bs = backing_bs(bs);
> -        }
> +        bs = bdrv_skip_implicit_filters(bs);

this intersects with Jonh's patch
[PATCH v2] migration/dirty-bitmaps: change bitmap enumeration method
https://lists.gnu.org/archive/html/qemu-devel/2019-05/msg03340.html

>   
>           for (bitmap = bdrv_dirty_bitmap_next(bs, NULL); bitmap;
>                bitmap = bdrv_dirty_bitmap_next(bs, bitmap))
> diff --git a/nbd/server.c b/nbd/server.c
> index e21bd501dc..e41ae89dbe 100644
> --- a/nbd/server.c
> +++ b/nbd/server.c
> @@ -1506,13 +1506,13 @@ NBDExport *nbd_export_new(BlockDriverState *bs, 
> uint64_t dev_offset,
>       if (bitmap) {
>           BdrvDirtyBitmap *bm = NULL;
>   
> -        while (true) {
> +        while (bs) {
>               bm = bdrv_find_dirty_bitmap(bs, bitmap);
> -            if (bm != NULL || bs->backing == NULL) {
> +            if (bm != NULL) {
>                   break;
>               }
>   
> -            bs = bs->backing->bs;
> +            bs = bdrv_filtered_bs(bs);
>           }

Check in documentation: "@bitmap: Also export the dirty bitmap reachable from 
@device".

"Reachable" is not bad, but we may want to clarify that extended backing chain 
is meant

>   
>           if (bm == NULL) {
> diff --git a/qemu-img.c b/qemu-img.c
> index aa6f81f1ea..bcfbb743fc 100644
> --- a/qemu-img.c
> +++ b/qemu-img.c
> @@ -982,7 +982,7 @@ static int img_commit(int argc, char **argv)
>       if (!blk) {
>           return 1;
>       }
> -    bs = blk_bs(blk);
> +    bs = bdrv_skip_implicit_filters(blk_bs(blk));

hope there should not be any, but for consistancy we may skip them

>   
>       qemu_progress_init(progress, 1.f);
>       qemu_progress_print(0.f, 100);
> @@ -999,7 +999,7 @@ static int img_commit(int argc, char **argv)
>           /* This is different from QMP, which by default uses the deepest 
> file in
>            * the backing chain (i.e., the very base); however, the traditional
>            * behavior of qemu-img commit is using the immediate backing file. 
> */
> -        base_bs = backing_bs(bs);
> +        base_bs = bdrv_filtered_cow_bs(bs);
>           if (!base_bs) {
>               error_setg(&local_err, "Image does not have a backing file");
>               goto done;
> @@ -1616,19 +1616,18 @@ static int convert_iteration_sectors(ImgConvertState 
> *s, int64_t sector_num)
>   
>       if (s->sector_next_status <= sector_num) {
>           int64_t count = n * BDRV_SECTOR_SIZE;
> +        BlockDriverState *src_bs = blk_bs(s->src[src_cur]);
> +        BlockDriverState *base;
>   
>           if (s->target_has_backing) {
> -
> -            ret = bdrv_block_status(blk_bs(s->src[src_cur]),
> -                                    (sector_num - src_cur_offset) *
> -                                    BDRV_SECTOR_SIZE,
> -                                    count, &count, NULL, NULL);
> +            base = bdrv_backing_chain_next(src_bs);
>           } else {
> -            ret = bdrv_block_status_above(blk_bs(s->src[src_cur]), NULL,
> -                                          (sector_num - src_cur_offset) *
> -                                          BDRV_SECTOR_SIZE,
> -                                          count, &count, NULL, NULL);
> +            base = NULL;
>           }
> +        ret = bdrv_block_status_above(src_bs, base,
> +                                      (sector_num - src_cur_offset) *
> +                                      BDRV_SECTOR_SIZE,
> +                                      count, &count, NULL, NULL);
>           if (ret < 0) {
>               error_report("error while reading block status of sector %" 
> PRId64
>                            ": %s", sector_num, strerror(-ret));
> @@ -2434,7 +2433,8 @@ static int img_convert(int argc, char **argv)
>            * s.target_backing_sectors has to be negative, which it will
>            * be automatically).  The backing file length is used only
>            * for optimizations, so such a case is not fatal. */
> -        s.target_backing_sectors = bdrv_nb_sectors(out_bs->backing->bs);
> +        s.target_backing_sectors =
> +            bdrv_nb_sectors(bdrv_filtered_cow_bs(out_bs));

can't out_bs be filter itself?

>       } else {
>           s.target_backing_sectors = -1;
>       }
> @@ -2797,6 +2797,7 @@ static int get_block_status(BlockDriverState *bs, 
> int64_t offset,
>   
>       depth = 0;
>       for (;;) {
> +        bs = bdrv_skip_rw_filters(bs);

Why? Filters may have own implementation of block_status, why to skip it?

Or, thay cannot? Really, may be disallow filters have block_status, we may solve
inefficient block_status_above we talked about before.

>           ret = bdrv_block_status(bs, offset, bytes, &bytes, &map, &file);
>           if (ret < 0) {
>               return ret;
> @@ -2805,7 +2806,7 @@ static int get_block_status(BlockDriverState *bs, 
> int64_t offset,
>           if (ret & (BDRV_BLOCK_ZERO|BDRV_BLOCK_DATA)) {
>               break;
>           }
> -        bs = backing_bs(bs);
> +        bs = bdrv_filtered_cow_bs(bs);
>           if (bs == NULL) {
>               ret = 0;
>               break;
> @@ -2944,7 +2945,7 @@ static int img_map(int argc, char **argv)
>       if (!blk) {
>           return 1;
>       }
> -    bs = blk_bs(blk);
> +    bs = bdrv_skip_implicit_filters(blk_bs(blk));
>   
>       if (output_format == OFORMAT_HUMAN) {
>           printf("%-16s%-16s%-16s%s\n", "Offset", "Length", "Mapped to", 
> "File");
> diff --git a/tests/qemu-iotests/184.out b/tests/qemu-iotests/184.out
> index 3deb3cfb94..1d61f7e224 100644
> --- a/tests/qemu-iotests/184.out
> +++ b/tests/qemu-iotests/184.out
> @@ -27,6 +27,11 @@ Testing:
>               "iops_rd": 0,
>               "detect_zeroes": "off",
>               "image": {
> +                "backing-image": {
> +                    "virtual-size": 1073741824,
> +                    "filename": "null-co://",
> +                    "format": "null-co"
> +                },
>                   "virtual-size": 1073741824,
>                   "filename": "json:{\"throttle-group\": \"group0\", 
> \"driver\": \"throttle\", \"file\": {\"driver\": \"null-co\"}}",
>                   "format": "throttle"
> @@ -34,7 +39,7 @@ Testing:
>               "iops_wr": 0,
>               "ro": false,
>               "node-name": "throttle0",
> -            "backing_file_depth": 0,
> +            "backing_file_depth": 1,
>               "drv": "throttle",
>               "iops": 0,
>               "bps_wr": 0,
> diff --git a/tests/qemu-iotests/204.out b/tests/qemu-iotests/204.out
> index f3a10fbe90..684774d763 100644
> --- a/tests/qemu-iotests/204.out
> +++ b/tests/qemu-iotests/204.out
> @@ -59,5 +59,6 @@ Offset          Length          File
>   0x900000        0x2400000       TEST_DIR/t.IMGFMT
>   0x3c00000       0x1100000       TEST_DIR/t.IMGFMT
>   0x6a00000       0x400000        TEST_DIR/t.IMGFMT
> +0x6e00000       0x1200000       TEST_DIR/t.IMGFMT.base
>   No errors were found on the image.
>   *** done
> 


-- 
Best regards,
Vladimir

reply via email to

[Prev in Thread] Current Thread [Next in Thread]