[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-block] [Qemu-devel] [PATCH v4 02/11] block: Filtered children
From: |
Vladimir Sementsov-Ogievskiy |
Subject: |
Re: [Qemu-block] [Qemu-devel] [PATCH v4 02/11] block: Filtered children access functions |
Date: |
Fri, 17 May 2019 14:50:56 +0000 |
10.04.2019 23:20, Max Reitz wrote:
> What bs->file and bs->backing mean depends on the node. For filter
> nodes, both signify a node that will eventually receive all R/W
> accesses. For format nodes, bs->file contains metadata and data, and
> bs->backing will not receive writes -- instead, writes are COWed to
> bs->file. Usually.
>
> In any case, it is not trivial to guess what a child means exactly with
> our currently limited form of expression. It is better to introduce
> some functions that actually guarantee a meaning:
>
> - bdrv_filtered_cow_child() will return the child that receives requests
> filtered through COW. That is, reads may or may not be forwarded
> (depending on the overlay's allocation status), but writes never go to
> this child.
>
> - bdrv_filtered_rw_child() will return the child that receives requests
> filtered through some very plain process. Reads and writes issued to
> the parent will go to the child as well (although timing, etc. may be
> modified).
>
> - All drivers but quorum (but quorum is pretty opaque to the general
> block layer anyway) always only have one of these children: All read
> requests must be served from the filtered_rw_child (if it exists), so
> if there was a filtered_cow_child in addition, it would not receive
> any requests at all.
> (The closest here is mirror, where all requests are passed on to the
> source, but with write-blocking, write requests are "COWed" to the
> target. But that just means that the target is a special child that
> cannot be introspected by the generic block layer functions, and that
> source is a filtered_rw_child.)
> Therefore, we can also add bdrv_filtered_child() which returns that
> one child (or NULL, if there is no filtered child).
>
> Also, many places in the current block layer should be skipping filters
> (all filters or just the ones added implicitly, it depends) when going
> through a block node chain. They do not do that currently, but this
> patch makes them.
>
> One example for this is qemu-img map, which should skip filters and only
> look at the COW elements in the graph. The change to iotest 204's
> reference output shows how using blkdebug on top of a COW node used to
> make qemu-img map disregard the rest of the backing chain, but with this
> patch, the allocation in the base image is reported correctly.
>
> Furthermore, a note should be made that sometimes we do want to access
> bs->backing directly. This is whenever the operation in question is not
> about accessing the COW child, but the "backing" child, be it COW or
> not. This is the case in functions such as bdrv_open_backing_file() or
> whenever we have to deal with the special behavior of @backing as a
> blockdev option, which is that it does not default to null like all
> other child references do.
>
> Finally, the query functions (query-block and query-named-block-nodes)
> are modified to return any filtered child under "backing", not just
> bs->backing or COW children. This is so that filters do not interrupt
> the reported backing chain. This changes the output of iotest 184, as
> the throttled node now appears as a backing child.
>
> Signed-off-by: Max Reitz <address@hidden>
> ---
[..]
> --- a/block/mirror.c
> +++ b/block/mirror.c
> @@ -660,8 +660,9 @@ static int mirror_exit_common(Job *job)
> &error_abort);
> if (!abort && s->backing_mode == MIRROR_SOURCE_BACKING_CHAIN) {
> BlockDriverState *backing = s->is_none_mode ? src : s->base;
> - if (backing_bs(target_bs) != backing) {
> - bdrv_set_backing_hd(target_bs, backing, &local_err);
> + if (bdrv_backing_chain_next(target_bs) != backing) {
> + bdrv_set_backing_hd(bdrv_skip_rw_filters(target_bs), backing,
here you support filters above target_bs ...
> + &local_err);
> if (local_err) {
> error_report_err(local_err);
> ret = -EPERM;
> @@ -711,7 +712,7 @@ static int mirror_exit_common(Job *job)
> block_job_remove_all_bdrv(bjob);
> bdrv_child_try_set_perm(mirror_top_bs->backing, 0, BLK_PERM_ALL,
> &error_abort);
> - bdrv_replace_node(mirror_top_bs, backing_bs(mirror_top_bs),
> &error_abort);
> + bdrv_replace_node(mirror_top_bs, mirror_top_bs->backing->bs,
> &error_abort);
>
> /* We just changed the BDS the job BB refers to (with either or both of
> the
> * bdrv_replace_node() calls), so switch the BB back so the cleanup does
> @@ -903,7 +904,7 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
> } else {
> s->target_cluster_size = BDRV_SECTOR_SIZE;
> }
> - if (backing_filename[0] && !target_bs->backing &&
> + if (backing_filename[0] && !bdrv_filtered_cow_child(target_bs) &&
And here we need it too
[continuing from here]
> s->granularity < s->target_cluster_size) {
> s->buf_size = MAX(s->buf_size, s->target_cluster_size);
> s->cow_bitmap = bitmap_new(length);
> @@ -1083,7 +1084,7 @@ static void mirror_complete(Job *job, Error **errp)
> if (s->backing_mode == MIRROR_OPEN_BACKING_CHAIN) {
> int ret;
>
> - assert(!target->backing);
> + assert(!bdrv_filtered_cow_child(target));
hmm and here...
Possibly, we should add a kind of s->filtered_target to use in all such cases.
> ret = bdrv_open_backing_file(target, NULL, "backing", errp);
> if (ret < 0) {
> return;
> @@ -1650,7 +1651,9 @@ static void mirror_start_job(const char *job_id,
> BlockDriverState *bs,
> * any jobs in them must be blocked */
> if (target_is_backing) {
> BlockDriverState *iter;
> - for (iter = backing_bs(bs); iter != target; iter = backing_bs(iter))
> {
> + for (iter = bdrv_filtered_bs(bs); iter != target;
should it be filtered_target too?
> + iter = bdrv_filtered_bs(iter))
> + {
> /* XXX BLK_PERM_WRITE needs to be allowed so we don't block
> * ourselves at s->base (if writes are blocked for a node, they
> are
> * also blocked for its backing file). The other options would
> be a
> @@ -1691,7 +1694,7 @@ fail:
>
> bdrv_child_try_set_perm(mirror_top_bs->backing, 0, BLK_PERM_ALL,
> &error_abort);
> - bdrv_replace_node(mirror_top_bs, backing_bs(mirror_top_bs),
> &error_abort);
> + bdrv_replace_node(mirror_top_bs, mirror_top_bs->backing->bs,
> &error_abort);
>
> bdrv_unref(mirror_top_bs);
> }
> @@ -1707,14 +1710,14 @@ void mirror_start(const char *job_id,
> BlockDriverState *bs,
> MirrorCopyMode copy_mode, Error **errp)
> {
> bool is_none_mode;
> - BlockDriverState *base;
> + BlockDriverState *base = NULL;
dead assignment
>
> if (mode == MIRROR_SYNC_MODE_INCREMENTAL) {
> error_setg(errp, "Sync mode 'incremental' not supported");
> return;
> }
> is_none_mode = mode == MIRROR_SYNC_MODE_NONE;
> - base = mode == MIRROR_SYNC_MODE_TOP ? backing_bs(bs) : NULL;
> + base = mode == MIRROR_SYNC_MODE_TOP ? bdrv_backing_chain_next(bs) : NULL;
> mirror_start_job(job_id, bs, creation_flags, target, replaces,
> speed, granularity, buf_size, backing_mode,
> on_source_error, on_target_error, unmap, NULL, NULL,
> diff --git a/block/qapi.c b/block/qapi.c
> index 110d05dc57..478c6f5e0d 100644
> --- a/block/qapi.c
> +++ b/block/qapi.c
> @@ -149,9 +149,13 @@ BlockDeviceInfo *bdrv_block_device_info(BlockBackend
> *blk,
> return NULL;
> }
>
> - if (bs0->drv && bs0->backing) {
> + if (bs0->drv && bdrv_filtered_child(bs0)) {
> + /*
> + * Put any filtered child here (for backwards compatibility to
> when
> + * we put bs0->backing here, which might be any filtered child).
> + */
> info->backing_file_depth++;
> - bs0 = bs0->backing->bs;
> + bs0 = bdrv_filtered_bs(bs0);
> (*p_image_info)->has_backing_image = true;
> p_image_info = &((*p_image_info)->backing_image);
> } else {
> @@ -160,9 +164,8 @@ BlockDeviceInfo *bdrv_block_device_info(BlockBackend *blk,
>
> /* Skip automatically inserted nodes that the user isn't aware of
> for
> * query-block (blk != NULL), but not for query-named-block-nodes */
> - while (blk && bs0->drv && bs0->implicit) {
> - bs0 = backing_bs(bs0);
> - assert(bs0);
> + if (blk) {
> + bs0 = bdrv_skip_implicit_filters(bs0);
> }
> }
>
> @@ -347,9 +350,9 @@ static void bdrv_query_info(BlockBackend *blk, BlockInfo
> **p_info,
> BlockDriverState *bs = blk_bs(blk);
> char *qdev;
>
> - /* Skip automatically inserted nodes that the user isn't aware of */
> - while (bs && bs->drv && bs->implicit) {
> - bs = backing_bs(bs);
> + if (bs) {
> + /* Skip automatically inserted nodes that the user isn't aware of */
> + bs = bdrv_skip_implicit_filters(bs);
> }
>
> info->device = g_strdup(blk_name(blk));
> @@ -506,6 +509,7 @@ static void bdrv_query_blk_stats(BlockDeviceStats *ds,
> BlockBackend *blk)
> static BlockStats *bdrv_query_bds_stats(BlockDriverState *bs,
> bool blk_level)
> {
> + BlockDriverState *cow_bs;
> BlockStats *s = NULL;
>
> s = g_malloc0(sizeof(*s));
> @@ -518,9 +522,8 @@ static BlockStats *bdrv_query_bds_stats(BlockDriverState
> *bs,
> /* Skip automatically inserted nodes that the user isn't aware of in
> * a BlockBackend-level command. Stay at the exact node for a node-level
> * command. */
> - while (blk_level && bs->drv && bs->implicit) {
> - bs = backing_bs(bs);
> - assert(bs);
> + if (blk_level) {
> + bs = bdrv_skip_implicit_filters(bs);
> }
>
> if (bdrv_get_node_name(bs)[0]) {
> @@ -535,9 +538,10 @@ static BlockStats *bdrv_query_bds_stats(BlockDriverState
> *bs,
> s->parent = bdrv_query_bds_stats(bs->file->bs, blk_level);
> }
>
> - if (blk_level && bs->backing) {
> + cow_bs = bdrv_filtered_cow_bs(bs);
So, if we at blk_level and top bs is explicit filter, you don't want to show
it's
child?
Hmm, at least, we can't show it if it is file-child, as qapi filed already
called
backing. So, if we can't show for file-child-based filters, it may be better to
not
show filter children here at all.
> + if (blk_level && cow_bs) {
> s->has_backing = true;
> - s->backing = bdrv_query_bds_stats(bs->backing->bs, blk_level);
> + s->backing = bdrv_query_bds_stats(cow_bs, blk_level);
> }
>
> return s;
> diff --git a/block/stream.c b/block/stream.c
> index bfaebb861a..23d5c890e0 100644
> --- a/block/stream.c
> +++ b/block/stream.c
> @@ -65,6 +65,7 @@ static int stream_prepare(Job *job)
> StreamBlockJob *s = container_of(job, StreamBlockJob, common.job);
> BlockJob *bjob = &s->common;
> BlockDriverState *bs = blk_bs(bjob->blk);
> + BlockDriverState *unfiltered = bdrv_skip_rw_filters(bs);
Aha, I'd call it filtered, but unfiltered is correct too, it's amazing
> BlockDriverState *base = s->base;
> Error *local_err = NULL;
> int ret = 0;
> @@ -72,7 +73,7 @@ static int stream_prepare(Job *job)
> bdrv_unfreeze_backing_chain(bs, base);
> s->chain_frozen = false;
>
> - if (bs->backing) {
> + if (bdrv_filtered_cow_child(unfiltered)) {
> const char *base_id = NULL, *base_fmt = NULL;
> if (base) {
> base_id = s->backing_file_str;
> @@ -80,7 +81,7 @@ static int stream_prepare(Job *job)
> base_fmt = base->drv->format_name;
> }
> }
> - ret = bdrv_change_backing_file(bs, base_id, base_fmt);
> + ret = bdrv_change_backing_file(unfiltered, base_id, base_fmt);
> bdrv_set_backing_hd(bs, base, &local_err);
> if (local_err) {
> error_report_err(local_err);
> @@ -121,7 +122,7 @@ static int coroutine_fn stream_run(Job *job, Error **errp)
> int64_t n = 0; /* bytes */
> void *buf;
>
> - if (!bs->backing) {
> + if (!bdrv_filtered_child(bs)) {
> goto out;
> }
this condition checks that there is nothing to stream, so, I thing it's better
to check
if (!bdrv_backing_chain_next(bs)) {
goto out;
}
>
> @@ -162,7 +163,7 @@ static int coroutine_fn stream_run(Job *job, Error **errp)
> } else if (ret >= 0) {
> /* Copy if allocated in the intermediate images. Limit to the
> * known-unallocated area [offset, offset+n*BDRV_SECTOR_SIZE).
> */
> - ret = bdrv_is_allocated_above(backing_bs(bs), base,
> + ret = bdrv_is_allocated_above(bdrv_filtered_bs(bs), base,
> offset, n, &n);
Hmm, if we trying to support bs to be filter, and actually operate on
first-non-filter,
as you write in qapi spec, this is wrong. Again it should be
bdrv_filtered_cow_bs(bdrv_skip_rw_filters(bs))..
Or, may be better, we at stream start should calculate reald top bs to operate
on, and
forget about all filters above.. i.e., do bs = bdrv_skip_rw_filters(bs) at the
very
beginning, when creating a job.
>
> /* Finish early if end of backing file has been reached */
> @@ -268,7 +269,9 @@ void stream_start(const char *job_id, BlockDriverState
> *bs,
> * disappear from the chain after this operation. The streaming job
> reads
> * every block only once, assuming that it doesn't change, so block
> writes
> * and resizes. */
> - for (iter = backing_bs(bs); iter && iter != base; iter =
> backing_bs(iter)) {
> + for (iter = bdrv_filtered_bs(bs); iter && iter != base;
> + iter = bdrv_filtered_bs(iter))
> + {
> block_job_add_bdrv(&s->common, "intermediate node", iter, 0,
> BLK_PERM_CONSISTENT_READ |
> BLK_PERM_WRITE_UNCHANGED,
> &error_abort);
> diff --git a/blockdev.c b/blockdev.c
> index 4775a07d93..bb71b8368d 100644
> --- a/blockdev.c
> +++ b/blockdev.c
> @@ -1094,7 +1094,7 @@ void hmp_commit(Monitor *mon, const QDict *qdict)
> return;
> }
>
> - bs = blk_bs(blk);
> + bs = bdrv_skip_implicit_filters(blk_bs(blk));
> aio_context = bdrv_get_aio_context(bs);
> aio_context_acquire(aio_context);
>
> @@ -1663,7 +1663,7 @@ static void external_snapshot_prepare(BlkActionState
> *common,
> goto out;
> }
>
> - if (state->new_bs->backing != NULL) {
> + if (bdrv_filtered_cow_child(state->new_bs)) {
Do we allow to create filter snapshot? We should either restrict it explicitly
or
check bdrv_filtered_child here.. And we can't allow file-based-filters anyway..
[skipped up to the end of blockdev.c, I'm tired o_O]
[..]
> diff --git a/migration/block-dirty-bitmap.c b/migration/block-dirty-bitmap.c
> index d1bb863cb6..f99f753fba 100644
> --- a/migration/block-dirty-bitmap.c
> +++ b/migration/block-dirty-bitmap.c
> @@ -285,9 +285,7 @@ static int init_dirty_bitmap_migration(void)
> const char *drive_name = bdrv_get_device_or_node_name(bs);
>
> /* skip automatically inserted nodes */
> - while (bs && bs->drv && bs->implicit) {
> - bs = backing_bs(bs);
> - }
> + bs = bdrv_skip_implicit_filters(bs);
this intersects with Jonh's patch
[PATCH v2] migration/dirty-bitmaps: change bitmap enumeration method
https://lists.gnu.org/archive/html/qemu-devel/2019-05/msg03340.html
>
> for (bitmap = bdrv_dirty_bitmap_next(bs, NULL); bitmap;
> bitmap = bdrv_dirty_bitmap_next(bs, bitmap))
> diff --git a/nbd/server.c b/nbd/server.c
> index e21bd501dc..e41ae89dbe 100644
> --- a/nbd/server.c
> +++ b/nbd/server.c
> @@ -1506,13 +1506,13 @@ NBDExport *nbd_export_new(BlockDriverState *bs,
> uint64_t dev_offset,
> if (bitmap) {
> BdrvDirtyBitmap *bm = NULL;
>
> - while (true) {
> + while (bs) {
> bm = bdrv_find_dirty_bitmap(bs, bitmap);
> - if (bm != NULL || bs->backing == NULL) {
> + if (bm != NULL) {
> break;
> }
>
> - bs = bs->backing->bs;
> + bs = bdrv_filtered_bs(bs);
> }
Check in documentation: "@bitmap: Also export the dirty bitmap reachable from
@device".
"Reachable" is not bad, but we may want to clarify that extended backing chain
is meant
>
> if (bm == NULL) {
> diff --git a/qemu-img.c b/qemu-img.c
> index aa6f81f1ea..bcfbb743fc 100644
> --- a/qemu-img.c
> +++ b/qemu-img.c
> @@ -982,7 +982,7 @@ static int img_commit(int argc, char **argv)
> if (!blk) {
> return 1;
> }
> - bs = blk_bs(blk);
> + bs = bdrv_skip_implicit_filters(blk_bs(blk));
hope there should not be any, but for consistancy we may skip them
>
> qemu_progress_init(progress, 1.f);
> qemu_progress_print(0.f, 100);
> @@ -999,7 +999,7 @@ static int img_commit(int argc, char **argv)
> /* This is different from QMP, which by default uses the deepest
> file in
> * the backing chain (i.e., the very base); however, the traditional
> * behavior of qemu-img commit is using the immediate backing file.
> */
> - base_bs = backing_bs(bs);
> + base_bs = bdrv_filtered_cow_bs(bs);
> if (!base_bs) {
> error_setg(&local_err, "Image does not have a backing file");
> goto done;
> @@ -1616,19 +1616,18 @@ static int convert_iteration_sectors(ImgConvertState
> *s, int64_t sector_num)
>
> if (s->sector_next_status <= sector_num) {
> int64_t count = n * BDRV_SECTOR_SIZE;
> + BlockDriverState *src_bs = blk_bs(s->src[src_cur]);
> + BlockDriverState *base;
>
> if (s->target_has_backing) {
> -
> - ret = bdrv_block_status(blk_bs(s->src[src_cur]),
> - (sector_num - src_cur_offset) *
> - BDRV_SECTOR_SIZE,
> - count, &count, NULL, NULL);
> + base = bdrv_backing_chain_next(src_bs);
> } else {
> - ret = bdrv_block_status_above(blk_bs(s->src[src_cur]), NULL,
> - (sector_num - src_cur_offset) *
> - BDRV_SECTOR_SIZE,
> - count, &count, NULL, NULL);
> + base = NULL;
> }
> + ret = bdrv_block_status_above(src_bs, base,
> + (sector_num - src_cur_offset) *
> + BDRV_SECTOR_SIZE,
> + count, &count, NULL, NULL);
> if (ret < 0) {
> error_report("error while reading block status of sector %"
> PRId64
> ": %s", sector_num, strerror(-ret));
> @@ -2434,7 +2433,8 @@ static int img_convert(int argc, char **argv)
> * s.target_backing_sectors has to be negative, which it will
> * be automatically). The backing file length is used only
> * for optimizations, so such a case is not fatal. */
> - s.target_backing_sectors = bdrv_nb_sectors(out_bs->backing->bs);
> + s.target_backing_sectors =
> + bdrv_nb_sectors(bdrv_filtered_cow_bs(out_bs));
can't out_bs be filter itself?
> } else {
> s.target_backing_sectors = -1;
> }
> @@ -2797,6 +2797,7 @@ static int get_block_status(BlockDriverState *bs,
> int64_t offset,
>
> depth = 0;
> for (;;) {
> + bs = bdrv_skip_rw_filters(bs);
Why? Filters may have own implementation of block_status, why to skip it?
Or, thay cannot? Really, may be disallow filters have block_status, we may solve
inefficient block_status_above we talked about before.
> ret = bdrv_block_status(bs, offset, bytes, &bytes, &map, &file);
> if (ret < 0) {
> return ret;
> @@ -2805,7 +2806,7 @@ static int get_block_status(BlockDriverState *bs,
> int64_t offset,
> if (ret & (BDRV_BLOCK_ZERO|BDRV_BLOCK_DATA)) {
> break;
> }
> - bs = backing_bs(bs);
> + bs = bdrv_filtered_cow_bs(bs);
> if (bs == NULL) {
> ret = 0;
> break;
> @@ -2944,7 +2945,7 @@ static int img_map(int argc, char **argv)
> if (!blk) {
> return 1;
> }
> - bs = blk_bs(blk);
> + bs = bdrv_skip_implicit_filters(blk_bs(blk));
>
> if (output_format == OFORMAT_HUMAN) {
> printf("%-16s%-16s%-16s%s\n", "Offset", "Length", "Mapped to",
> "File");
> diff --git a/tests/qemu-iotests/184.out b/tests/qemu-iotests/184.out
> index 3deb3cfb94..1d61f7e224 100644
> --- a/tests/qemu-iotests/184.out
> +++ b/tests/qemu-iotests/184.out
> @@ -27,6 +27,11 @@ Testing:
> "iops_rd": 0,
> "detect_zeroes": "off",
> "image": {
> + "backing-image": {
> + "virtual-size": 1073741824,
> + "filename": "null-co://",
> + "format": "null-co"
> + },
> "virtual-size": 1073741824,
> "filename": "json:{\"throttle-group\": \"group0\",
> \"driver\": \"throttle\", \"file\": {\"driver\": \"null-co\"}}",
> "format": "throttle"
> @@ -34,7 +39,7 @@ Testing:
> "iops_wr": 0,
> "ro": false,
> "node-name": "throttle0",
> - "backing_file_depth": 0,
> + "backing_file_depth": 1,
> "drv": "throttle",
> "iops": 0,
> "bps_wr": 0,
> diff --git a/tests/qemu-iotests/204.out b/tests/qemu-iotests/204.out
> index f3a10fbe90..684774d763 100644
> --- a/tests/qemu-iotests/204.out
> +++ b/tests/qemu-iotests/204.out
> @@ -59,5 +59,6 @@ Offset Length File
> 0x900000 0x2400000 TEST_DIR/t.IMGFMT
> 0x3c00000 0x1100000 TEST_DIR/t.IMGFMT
> 0x6a00000 0x400000 TEST_DIR/t.IMGFMT
> +0x6e00000 0x1200000 TEST_DIR/t.IMGFMT.base
> No errors were found on the image.
> *** done
>
--
Best regards,
Vladimir
Re: [Qemu-block] [Qemu-devel] [PATCH v4 02/11] block: Filtered children access functions,
Vladimir Sementsov-Ogievskiy <=
Re: [Qemu-block] [Qemu-devel] [PATCH v4 02/11] block: Filtered children access functions, Max Reitz, 2019/05/31