qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[PATCH 2/5] block: Fix BDRV_BLOCK_RAW status to honor alignment


From: Eric Blake
Subject: [PATCH 2/5] block: Fix BDRV_BLOCK_RAW status to honor alignment
Date: Thu, 18 Feb 2021 14:15:25 -0600

Previous patches mentioned how the blkdebug filter driver demonstrates
a bug present in our NBD server (for example, commit b0245d64); the
bug is also present with the raw format driver when probing
occurs. Basically, if we specify a particular alignment > 1, but defer
the actual block status to the underlying file, and the underlying
file has a smaller alignment, then the use of BDRV_BLOCK_RAW to defer
to the underlying file can end up with status split at an alignment
unacceptable to our layer.  Many callers don't care, but NBD does - it
is a violation of the NBD protocol to send status or read results
split on an unaligned boundary (in 737d3f5244, we taught our 4.0
client to be tolerant of such violations because the problem was even
more pronounced with qemu 3.1 as server; but we still need to fix our
server for the sake of other stricter clients).

This patch lays the groundwork - it adjusts bdrv_block_status to
ensure that any time one layer defers to another via BDRV_BLOCK_RAW,
the deferral is either truncated down to an aligned boundary, or
multiple sub-aligned requests are coalesced into a single
representative answer (using an implicit hole beyond EOF as
needed). Iotest 241 exposes the effect (when format probing occurred,
we don't want block status to subdivide the initial sector, and thus
not any other sector either). Similarly, iotest 221 is a good
candidate to amend to specifically track the effects; a change to a
hole at EOF is not visible if an upper layer does not want alignment
smaller than 512. However, this patch alone is not a complete fix - it
is still possible to have an active layer with large alignment
constraints backed by another layer with smaller constraints; so the
next patch will complete the task.

In particular, the next patch will introduce some mutual recursion
(bdrv_co_common_block_status_above will call this new function rather
than directly into bdrv_co_block_status), so some conditions added
here (such as a NULL pointer for map or handling a length-0 request)
are not reachable until that point.

There is one interesting corner case: prior to this patch, ALLOCATED
was never returned without either DATA or ZERO also set. But now, if
we have two subregions where the first reports status 0 (not
allocated), and the second reports ZERO|ALLOCATED but not DATA
(preallocated, read sees zero but underlying file has indeterminate
contents), then we can end up with a result of just
ALLOCATED. However, looking at callers of bdrv_is_allocated does not
find any problem with this new return value. What's more, this
situation doesn't really happen until the next patch adds support for
getting aligned status from backing files (as the use of aligned
status in this patch tends to be limited to just the protocol child of
a format driver, yet protocol drivers tend to report fully allocated,
and only format drivers have unallocated clusters that defer to a
backing file in the first place).

Signed-off-by: Eric Blake <eblake@redhat.com>
---
 block/io.c                 | 142 +++++++++++++++++++++++++++++++++++--
 tests/qemu-iotests/221     |  13 ++++
 tests/qemu-iotests/221.out |   9 +++
 tests/qemu-iotests/241.out |   3 +-
 4 files changed, 161 insertions(+), 6 deletions(-)

diff --git a/block/io.c b/block/io.c
index ca2dca30070e..4bca775c96b4 100644
--- a/block/io.c
+++ b/block/io.c
@@ -2325,6 +2325,132 @@ int bdrv_flush_all(void)
     return result;
 }

+static int coroutine_fn bdrv_co_block_status(BlockDriverState *bs,
+                                             bool want_zero,
+                                             int64_t offset, int64_t bytes,
+                                             int64_t *pnum, int64_t *map,
+                                             BlockDriverState **file);
+
+/*
+ * Returns an aligned allocation status of the specified disk region.
+ *
+ * Wrapper around bdrv_co_block_status() which requires the initial
+ * @offset and @count to be aligned to @align (must be power of 2),
+ * and guarantees the resulting @pnum will also be aligned; this may
+ * require piecing together multiple sub-aligned queries into an
+ * appropriate coalesced answer, as follows:
+ *
+ * - BDRV_BLOCK_DATA is set if the flag is set for at least one subregion
+ * - BDRV_BLOCK_ZERO is set only if the flag is set for all subregions
+ * - BDRV_BLOCK_OFFSET_VALID is set only if all subregions are contiguous
+ *   from the same file (@map and @file are then from the first subregion)
+ * - BDRV_BLOCK_ALLOCATED is set if the flag is set for at least one subregion
+ * - BDRV_BLOCK_EOF is set if the last subregion queried set it (any
+ *   remaining bytes to reach alignment are treated as an implicit hole)
+ * - BDRV_BLOCK_RAW is never set
+ */
+static int coroutine_fn bdrv_co_block_status_aligned(BlockDriverState *bs,
+                                                     uint32_t align,
+                                                     bool want_zero,
+                                                     int64_t offset,
+                                                     int64_t bytes,
+                                                     int64_t *pnum,
+                                                     int64_t *map,
+                                                     BlockDriverState **file)
+{
+    int ret;
+
+    assert(is_power_of_2(align) && QEMU_IS_ALIGNED(offset | bytes, align));
+    ret = bdrv_co_block_status(bs, want_zero, offset, bytes, pnum, map, file);
+    if (ret < 0) {
+        return ret;
+    }
+    /* 0-length return only possible for 0-length query or beyond EOF */
+    if (!*pnum) {
+        assert(!bytes || ret & BDRV_BLOCK_EOF);
+        return ret;
+    }
+    assert(!(ret & BDRV_BLOCK_RAW));
+
+    /*
+     * If initial query ended at EOF, round up to align: the post-EOF
+     * tail is an implicit hole, but our rules say we can treat that
+     * like the initial subregion.
+     */
+    if (ret & BDRV_BLOCK_EOF) {
+        *pnum = QEMU_ALIGN_UP(*pnum, align);
+        assert(*pnum <= bytes);
+        return ret;
+    }
+
+    /*
+     * If result is unaligned but not at EOF, it's easier to return
+     * the aligned subset and then compute the coalesced version over
+     * just align bytes.
+     */
+    if (*pnum >= align) {
+        *pnum = QEMU_ALIGN_DOWN(*pnum, align);
+        return ret;
+    }
+
+    /*
+     * If we got here, we have to merge status for multiple
+     * subregions. We can't detect if offsets are contiguous unless
+     * map and file are non-NULL.
+     */
+    if (!map || !file) {
+        ret &= ~BDRV_BLOCK_OFFSET_VALID;
+    }
+    while (*pnum < align) {
+        int ret2;
+        int64_t pnum2;
+        int64_t map2;
+        BlockDriverState *file2;
+
+        ret2 = bdrv_co_block_status(bs, want_zero, offset + *pnum,
+                                    align - *pnum, &pnum2, &map2, &file2);
+        if (ret2 < 0) {
+            return ret2;
+        }
+        assert(!(ret2 & BDRV_BLOCK_RAW));
+        /*
+         * A 0-length answer here is a bug - we should not be querying
+         * beyond EOF. Our rules allow any further bytes in implicit
+         * hole past EOF to have same treatment as the subregion just
+         * before EOF.
+         */
+        assert(pnum2 && pnum2 <= align - *pnum);
+        if (ret2 & BDRV_BLOCK_EOF) {
+            ret |= BDRV_BLOCK_EOF;
+            pnum2 = align - *pnum;
+        }
+
+        /* Now merge the status */
+        if (ret2 & BDRV_BLOCK_DATA) {
+            ret |= BDRV_BLOCK_DATA;
+        }
+        if (!(ret2 & BDRV_BLOCK_ZERO)) {
+            ret &= ~BDRV_BLOCK_ZERO;
+        }
+        if ((ret & BDRV_BLOCK_OFFSET_VALID) &&
+            (!(ret2 & BDRV_BLOCK_OFFSET_VALID) ||
+             *map + *pnum != map2 || *file != file2)) {
+            ret &= ~BDRV_BLOCK_OFFSET_VALID;
+            if (map) {
+                *map = 0;
+            }
+            if (file) {
+                *file = NULL;
+            }
+        }
+        if (ret2 & BDRV_BLOCK_ALLOCATED) {
+            ret |= BDRV_BLOCK_ALLOCATED;
+        }
+        *pnum += pnum2;
+    }
+    return ret;
+}
+
 /*
  * Returns the allocation status of the specified sectors.
  * Drivers not implementing the functionality are assumed to not support
@@ -2438,7 +2564,17 @@ static int coroutine_fn 
bdrv_co_block_status(BlockDriverState *bs,
      */
     assert(*pnum && QEMU_IS_ALIGNED(*pnum, align) &&
            align > offset - aligned_offset);
-    if (ret & BDRV_BLOCK_RECURSE) {
+    if (ret & BDRV_BLOCK_RAW) {
+        assert(local_file);
+        ret = bdrv_co_block_status_aligned(local_file, align, want_zero,
+                                           local_map, *pnum, pnum, &local_map,
+                                           &local_file);
+        if (ret < 0) {
+            goto out;
+        }
+        assert(!(ret & BDRV_BLOCK_RAW));
+        ret |= BDRV_BLOCK_RAW;
+    } else if (ret & BDRV_BLOCK_RECURSE) {
         assert(ret & BDRV_BLOCK_DATA);
         assert(ret & BDRV_BLOCK_OFFSET_VALID);
         assert(!(ret & BDRV_BLOCK_ZERO));
@@ -2453,9 +2589,7 @@ static int coroutine_fn 
bdrv_co_block_status(BlockDriverState *bs,
     }

     if (ret & BDRV_BLOCK_RAW) {
-        assert(ret & BDRV_BLOCK_OFFSET_VALID && local_file);
-        ret = bdrv_co_block_status(local_file, want_zero, local_map,
-                                   *pnum, pnum, &local_map, &local_file);
+        ret &= ~BDRV_BLOCK_RAW;
         goto out;
     }

diff --git a/tests/qemu-iotests/221 b/tests/qemu-iotests/221
index c463fd4b113e..6a15e0160b24 100755
--- a/tests/qemu-iotests/221
+++ b/tests/qemu-iotests/221
@@ -46,6 +46,12 @@ echo
 echo "=== Check mapping of unaligned raw image ==="
 echo

+# Note that when we enable format probing by omitting -f, the raw
+# layer forces 512-byte alignment and the bytes past EOF take on the
+# same status as the rest of the sector; otherwise, we can see the
+# implicit hole visible past EOF thanks to the block layer rounding
+# sizes up.
+
 _make_test_img 65537 # qemu-img create rounds size up

 # file-posix allocates the first block of any images when it is created;
@@ -55,15 +61,22 @@ _make_test_img 65537 # qemu-img create rounds size up
 $QEMU_IO -c 'discard 0 65537' "$TEST_IMG" | _filter_qemu_io

 $QEMU_IMG map --output=json "$TEST_IMG" | _filter_qemu_img_map
+$QEMU_IMG map -f $IMGFMT --output=json "$TEST_IMG" | _filter_qemu_img_map
+echo

 truncate --size=65537 "$TEST_IMG" # so we resize it and check again
 $QEMU_IMG map --output=json "$TEST_IMG" | _filter_qemu_img_map
+$QEMU_IMG map -f $IMGFMT --output=json "$TEST_IMG" | _filter_qemu_img_map
+echo

 $QEMU_IO -c 'w 65536 1' "$TEST_IMG" | _filter_qemu_io # writing also rounds up
 $QEMU_IMG map --output=json "$TEST_IMG" | _filter_qemu_img_map
+$QEMU_IMG map -f $IMGFMT --output=json "$TEST_IMG" | _filter_qemu_img_map
+echo

 truncate --size=65537 "$TEST_IMG" # so we resize it and check again
 $QEMU_IMG map --output=json "$TEST_IMG" | _filter_qemu_img_map
+$QEMU_IMG map -f $IMGFMT --output=json "$TEST_IMG" | _filter_qemu_img_map

 # success, all done
 echo '*** done'
diff --git a/tests/qemu-iotests/221.out b/tests/qemu-iotests/221.out
index 93846c7dabb6..d22b5e00d4f8 100644
--- a/tests/qemu-iotests/221.out
+++ b/tests/qemu-iotests/221.out
@@ -7,11 +7,20 @@ discard 65537/65537 bytes at offset 0
 64.001 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 [{ "start": 0, "length": 66048, "depth": 0, "zero": true, "data": false, 
"offset": OFFSET}]
 [{ "start": 0, "length": 66048, "depth": 0, "zero": true, "data": false, 
"offset": OFFSET}]
+
+[{ "start": 0, "length": 66048, "depth": 0, "zero": true, "data": false, 
"offset": OFFSET}]
+[{ "start": 0, "length": 66048, "depth": 0, "zero": true, "data": false, 
"offset": OFFSET}]
+
 wrote 1/1 bytes at offset 65536
 1 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 [{ "start": 0, "length": 65536, "depth": 0, "zero": true, "data": false, 
"offset": OFFSET},
+{ "start": 65536, "length": 512, "depth": 0, "zero": false, "data": true, 
"offset": OFFSET}]
+[{ "start": 0, "length": 65536, "depth": 0, "zero": true, "data": false, 
"offset": OFFSET},
 { "start": 65536, "length": 1, "depth": 0, "zero": false, "data": true, 
"offset": OFFSET},
 { "start": 65537, "length": 511, "depth": 0, "zero": true, "data": false, 
"offset": OFFSET}]
+
+[{ "start": 0, "length": 65536, "depth": 0, "zero": true, "data": false, 
"offset": OFFSET},
+{ "start": 65536, "length": 512, "depth": 0, "zero": false, "data": true, 
"offset": OFFSET}]
 [{ "start": 0, "length": 65536, "depth": 0, "zero": true, "data": false, 
"offset": OFFSET},
 { "start": 65536, "length": 1, "depth": 0, "zero": false, "data": true, 
"offset": OFFSET},
 { "start": 65537, "length": 511, "depth": 0, "zero": true, "data": false, 
"offset": OFFSET}]
diff --git a/tests/qemu-iotests/241.out b/tests/qemu-iotests/241.out
index 67aaeed34f50..56d3796cf3ac 100644
--- a/tests/qemu-iotests/241.out
+++ b/tests/qemu-iotests/241.out
@@ -22,8 +22,7 @@ WARNING: Image format was not specified for 'TEST_DIR/t.raw' 
and probing guessed

   size:  1024
   min block: 1
-[{ "start": 0, "length": 1000, "depth": 0, "zero": false, "data": true, 
"offset": OFFSET},
-{ "start": 1000, "length": 24, "depth": 0, "zero": true, "data": true, 
"offset": OFFSET}]
+[{ "start": 0, "length": 1024, "depth": 0, "zero": false, "data": true, 
"offset": OFFSET}]
 1 KiB (0x400) bytes     allocated at offset 0 bytes (0x0)

 === Encrypted qcow2 file backed by unaligned raw image ===
-- 
2.30.1




reply via email to

[Prev in Thread] Current Thread [Next in Thread]