[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Qemu-devel] [PULL 42/54] block: Perform copy-on-read in loop
From: |
Kevin Wolf |
Subject: |
[Qemu-devel] [PULL 42/54] block: Perform copy-on-read in loop |
Date: |
Fri, 6 Oct 2017 17:54:10 +0200 |
From: Eric Blake <address@hidden>
Improve our braindead copy-on-read implementation. Pre-patch,
we have multiple issues:
- we create a bounce buffer and perform a write for the entire
request, even if the active image already has 99% of the
clusters occupied, and really only needs to copy-on-read the
remaining 1% of the clusters
- our bounce buffer was as large as the read request, and can
needlessly exhaust our memory by using double the memory of
the request size (the original request plus our bounce buffer),
rather than a capped maximum overhead beyond the original
- if a driver has a max_transfer limit, we are bypassing the
normal code in bdrv_aligned_preadv() that fragments to that
limit, and instead attempt to read the entire buffer from the
driver in one go, which some drivers may assert on
- a client can request a large request of nearly 2G such that
rounding the request out to cluster boundaries results in a
byte count larger than 2G. While this cannot exceed 32 bits,
it DOES have some follow-on problems:
-- the call to bdrv_driver_pread() can assert for exceeding
BDRV_REQUEST_MAX_BYTES, if the driver is old and lacks
.bdrv_co_preadv
-- if the buffer is all zeroes, the subsequent call to
bdrv_co_do_pwrite_zeroes is a no-op due to a negative size,
which means we did not actually copy on read
Fix all of these issues by breaking up the action into a loop,
where each iteration is capped to sane limits. Also, querying
the allocation status allows us to optimize: when data is
already present in the active layer, we don't need to bounce.
Note that the code has a telling comment that copy-on-read
should probably be a filter driver rather than a bolt-on hack
in io.c; but that remains a task for another day.
CC: address@hidden
Signed-off-by: Eric Blake <address@hidden>
Reviewed-by: Kevin Wolf <address@hidden>
Reviewed-by: Stefan Hajnoczi <address@hidden>
Signed-off-by: Kevin Wolf <address@hidden>
---
block/io.c | 120 +++++++++++++++++++++++++++++++++++++++++--------------------
1 file changed, 82 insertions(+), 38 deletions(-)
diff --git a/block/io.c b/block/io.c
index a5598ed869..8e419070b5 100644
--- a/block/io.c
+++ b/block/io.c
@@ -34,6 +34,9 @@
#define NOT_DONE 0x7fffffff /* used while emulated sync operation in progress
*/
+/* Maximum bounce buffer for copy-on-read and write zeroes, in bytes */
+#define MAX_BOUNCE_BUFFER (32768 << BDRV_SECTOR_BITS)
+
static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
int64_t offset, int bytes, BdrvRequestFlags flags);
@@ -945,11 +948,14 @@ static int coroutine_fn
bdrv_co_do_copy_on_readv(BdrvChild *child,
BlockDriver *drv = bs->drv;
struct iovec iov;
- QEMUIOVector bounce_qiov;
+ QEMUIOVector local_qiov;
int64_t cluster_offset;
unsigned int cluster_bytes;
size_t skip_bytes;
int ret;
+ int max_transfer = MIN_NON_ZERO(bs->bl.max_transfer,
+ BDRV_REQUEST_MAX_BYTES);
+ unsigned int progress = 0;
/* FIXME We cannot require callers to have write permissions when all they
* are doing is a read request. If we did things right, write permissions
@@ -961,53 +967,95 @@ static int coroutine_fn
bdrv_co_do_copy_on_readv(BdrvChild *child,
// assert(child->perm & (BLK_PERM_WRITE_UNCHANGED | BLK_PERM_WRITE));
/* Cover entire cluster so no additional backing file I/O is required when
- * allocating cluster in the image file.
+ * allocating cluster in the image file. Note that this value may exceed
+ * BDRV_REQUEST_MAX_BYTES (even when the original read did not), which
+ * is one reason we loop rather than doing it all at once.
*/
bdrv_round_to_clusters(bs, offset, bytes, &cluster_offset, &cluster_bytes);
+ skip_bytes = offset - cluster_offset;
trace_bdrv_co_do_copy_on_readv(bs, offset, bytes,
cluster_offset, cluster_bytes);
- iov.iov_len = cluster_bytes;
- iov.iov_base = bounce_buffer = qemu_try_blockalign(bs, iov.iov_len);
+ bounce_buffer = qemu_try_blockalign(bs,
+ MIN(MIN(max_transfer, cluster_bytes),
+ MAX_BOUNCE_BUFFER));
if (bounce_buffer == NULL) {
ret = -ENOMEM;
goto err;
}
- qemu_iovec_init_external(&bounce_qiov, &iov, 1);
+ while (cluster_bytes) {
+ int64_t pnum;
- ret = bdrv_driver_preadv(bs, cluster_offset, cluster_bytes,
- &bounce_qiov, 0);
- if (ret < 0) {
- goto err;
- }
+ ret = bdrv_is_allocated(bs, cluster_offset,
+ MIN(cluster_bytes, max_transfer), &pnum);
+ if (ret < 0) {
+ /* Safe to treat errors in querying allocation as if
+ * unallocated; we'll probably fail again soon on the
+ * read, but at least that will set a decent errno.
+ */
+ pnum = MIN(cluster_bytes, max_transfer);
+ }
- bdrv_debug_event(bs, BLKDBG_COR_WRITE);
- if (drv->bdrv_co_pwrite_zeroes &&
- buffer_is_zero(bounce_buffer, iov.iov_len)) {
- /* FIXME: Should we (perhaps conditionally) be setting
- * BDRV_REQ_MAY_UNMAP, if it will allow for a sparser copy
- * that still correctly reads as zero? */
- ret = bdrv_co_do_pwrite_zeroes(bs, cluster_offset, cluster_bytes, 0);
- } else {
- /* This does not change the data on the disk, it is not necessary
- * to flush even in cache=writethrough mode.
- */
- ret = bdrv_driver_pwritev(bs, cluster_offset, cluster_bytes,
- &bounce_qiov, 0);
- }
+ assert(skip_bytes < pnum);
- if (ret < 0) {
- /* It might be okay to ignore write errors for guest requests. If this
- * is a deliberate copy-on-read then we don't want to ignore the error.
- * Simply report it in all cases.
- */
- goto err;
- }
+ if (ret <= 0) {
+ /* Must copy-on-read; use the bounce buffer */
+ iov.iov_base = bounce_buffer;
+ iov.iov_len = pnum = MIN(pnum, MAX_BOUNCE_BUFFER);
+ qemu_iovec_init_external(&local_qiov, &iov, 1);
- skip_bytes = offset - cluster_offset;
- qemu_iovec_from_buf(qiov, 0, bounce_buffer + skip_bytes, bytes);
+ ret = bdrv_driver_preadv(bs, cluster_offset, pnum,
+ &local_qiov, 0);
+ if (ret < 0) {
+ goto err;
+ }
+
+ bdrv_debug_event(bs, BLKDBG_COR_WRITE);
+ if (drv->bdrv_co_pwrite_zeroes &&
+ buffer_is_zero(bounce_buffer, pnum)) {
+ /* FIXME: Should we (perhaps conditionally) be setting
+ * BDRV_REQ_MAY_UNMAP, if it will allow for a sparser copy
+ * that still correctly reads as zero? */
+ ret = bdrv_co_do_pwrite_zeroes(bs, cluster_offset, pnum, 0);
+ } else {
+ /* This does not change the data on the disk, it is not
+ * necessary to flush even in cache=writethrough mode.
+ */
+ ret = bdrv_driver_pwritev(bs, cluster_offset, pnum,
+ &local_qiov, 0);
+ }
+
+ if (ret < 0) {
+ /* It might be okay to ignore write errors for guest
+ * requests. If this is a deliberate copy-on-read
+ * then we don't want to ignore the error. Simply
+ * report it in all cases.
+ */
+ goto err;
+ }
+
+ qemu_iovec_from_buf(qiov, progress, bounce_buffer + skip_bytes,
+ pnum - skip_bytes);
+ } else {
+ /* Read directly into the destination */
+ qemu_iovec_init(&local_qiov, qiov->niov);
+ qemu_iovec_concat(&local_qiov, qiov, progress, pnum - skip_bytes);
+ ret = bdrv_driver_preadv(bs, offset + progress, local_qiov.size,
+ &local_qiov, 0);
+ qemu_iovec_destroy(&local_qiov);
+ if (ret < 0) {
+ goto err;
+ }
+ }
+
+ cluster_offset += pnum;
+ cluster_bytes -= pnum;
+ progress += pnum - skip_bytes;
+ skip_bytes = 0;
+ }
+ ret = 0;
err:
qemu_vfree(bounce_buffer);
@@ -1213,9 +1261,6 @@ int coroutine_fn bdrv_co_readv(BdrvChild *child, int64_t
sector_num,
return bdrv_co_do_readv(child, sector_num, nb_sectors, qiov, 0);
}
-/* Maximum buffer for write zeroes fallback, in bytes */
-#define MAX_WRITE_ZEROES_BOUNCE_BUFFER (32768 << BDRV_SECTOR_BITS)
-
static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
int64_t offset, int bytes, BdrvRequestFlags flags)
{
@@ -1230,8 +1275,7 @@ static int coroutine_fn
bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
int max_write_zeroes = MIN_NON_ZERO(bs->bl.max_pwrite_zeroes, INT_MAX);
int alignment = MAX(bs->bl.pwrite_zeroes_alignment,
bs->bl.request_alignment);
- int max_transfer = MIN_NON_ZERO(bs->bl.max_transfer,
- MAX_WRITE_ZEROES_BOUNCE_BUFFER);
+ int max_transfer = MIN_NON_ZERO(bs->bl.max_transfer, MAX_BOUNCE_BUFFER);
assert(alignment % bs->bl.request_alignment == 0);
head = offset % alignment;
--
2.13.6
- [Qemu-devel] [PULL 39/54] block: Uniform handling of 0-length bdrv_get_block_status(), (continued)
- [Qemu-devel] [PULL 39/54] block: Uniform handling of 0-length bdrv_get_block_status(), Kevin Wolf, 2017/10/06
- [Qemu-devel] [PULL 35/54] qemu-iotests: Allow QMP pretty printing in common.qemu, Kevin Wolf, 2017/10/06
- [Qemu-devel] [PULL 34/54] commit: Support multiple roots above top node, Kevin Wolf, 2017/10/06
- [Qemu-devel] [PULL 32/54] qemu-iotests: merge "check" and "common", Kevin Wolf, 2017/10/06
- [Qemu-devel] [PULL 38/54] qemu-io: Add -C for opening with copy-on-read, Kevin Wolf, 2017/10/06
- [Qemu-devel] [PULL 37/54] commit: Remove overlay_bs, Kevin Wolf, 2017/10/06
- [Qemu-devel] [PULL 40/54] iotests: Restore stty settings on completion, Kevin Wolf, 2017/10/06
- [Qemu-devel] [PULL 46/54] block: fix data type casting for crypto payload offset, Kevin Wolf, 2017/10/06
- [Qemu-devel] [PULL 45/54] crypto: expose encryption sector size in APIs, Kevin Wolf, 2017/10/06
- [Qemu-devel] [PULL 41/54] block: Add blkdebug hook for copy-on-read, Kevin Wolf, 2017/10/06
- [Qemu-devel] [PULL 42/54] block: Perform copy-on-read in loop,
Kevin Wolf <=
- [Qemu-devel] [PULL 44/54] block: use 1 MB bounce buffers for crypto instead of 16KB, Kevin Wolf, 2017/10/06
- [Qemu-devel] [PULL 36/54] qemu-iotests: Test commit block job where top has two parents, Kevin Wolf, 2017/10/06
- [Qemu-devel] [PULL 43/54] iotests: Add test 197 for covering copy-on-read, Kevin Wolf, 2017/10/06
- [Qemu-devel] [PULL 47/54] block: convert crypto driver to bdrv_co_preadv|pwritev, Kevin Wolf, 2017/10/06
- [Qemu-devel] [PULL 50/54] block/mirror: check backing in bdrv_mirror_top_refresh_filename, Kevin Wolf, 2017/10/06
- [Qemu-devel] [PULL 48/54] block: convert qcrypto_block_encrypt|decrypt to take bytes offset, Kevin Wolf, 2017/10/06
- [Qemu-devel] [PULL 51/54] iotests: Fix 195 if IMGFMT is part of TEST_DIR, Kevin Wolf, 2017/10/06
- [Qemu-devel] [PULL 49/54] block: support passthrough of BDRV_REQ_FUA in crypto driver, Kevin Wolf, 2017/10/06
- [Qemu-devel] [PULL 53/54] qcow2: truncate the tail of the image file after shrinking the image, Kevin Wolf, 2017/10/06
- [Qemu-devel] [PULL 54/54] block/mirror: check backing in bdrv_mirror_top_flush, Kevin Wolf, 2017/10/06