[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [RFC 1/2] block: Split padded I/O vectors exceeding IOV_MAX

From: Hanna Czenczek
Subject: Re: [RFC 1/2] block: Split padded I/O vectors exceeding IOV_MAX
Date: Thu, 16 Mar 2023 11:10:02 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.7.1

On 15.03.23 19:48, Stefan Hajnoczi wrote:
On Wed, Mar 15, 2023 at 01:13:29PM +0100, Hanna Czenczek wrote:
When processing vectored guest requests that are not aligned to the
storage request alignment, we pad them by adding head and/or tail
buffers for a read-modify-write cycle.

The guest can submit I/O vectors up to IOV_MAX (1024) in length, but
with this padding, the vector can exceed that limit.  As of
4c002cef0e9abe7135d7916c51abce47f7fc1ee2 ("util/iov: make
qemu_iovec_init_extended() honest"), we refuse to pad vectors beyond the
limit, instead returning an error to the guest.

To the guest, this appears as a random I/O error.  We should not return
an I/O error to the guest when it issued a perfectly valid request.

Before 4c002cef0e9abe7135d7916c51abce47f7fc1ee2, we just made the vector
longer than IOV_MAX, which generally seems to work (because the guest
assumes a smaller alignment than we really have, file-posix's
raw_co_prw() will generally see bdrv_qiov_is_aligned() return false, and
so emulate the request, so that the IOV_MAX does not matter).  However,
that does not seem exactly great.

I see two ways to fix this problem:
1. We split such long requests into two requests.
2. We join some elements of the vector into new buffers to make it

I am wary of (1), because it seems like it may have unintended side

(2) on the other hand seems relatively simple to implement, with
hopefully few side effects, so this patch does that.
Looks like a reasonable solution. I think the code is correct and I
posted ideas for making it easier to understand.

Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=2141964
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
  block/io.c | 139 ++++++++++++++++++++++++++++++++++++++++++++++++++---
  util/iov.c |   4 --
  2 files changed, 133 insertions(+), 10 deletions(-)

diff --git a/block/io.c b/block/io.c
index 8974d46941..ee226d23d6 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1435,6 +1435,12 @@ out:
   * @merge_reads is true for small requests,
   * if @buf_len == @head + bytes + @tail. In this case it is possible that both
   * head and tail exist but @buf_len == align and @tail_buf == @buf.
+ *
+ * @write is true for write requests, false for read requests.
+ *
+ * If padding makes the vector too long (exceeding IOV_MAX), then we need to
+ * merge existing vector elements into a single one.  @collapse_buf acts as the
+ * bounce buffer in such cases.
  typedef struct BdrvRequestPadding {
      uint8_t *buf;
@@ -1443,11 +1449,17 @@ typedef struct BdrvRequestPadding {
      size_t head;
      size_t tail;
      bool merge_reads;
+    bool write;
      QEMUIOVector local_qiov;
+    uint8_t *collapse_buf;
+    size_t collapse_len;
+    QEMUIOVector collapsed_qiov;
  } BdrvRequestPadding;
static bool bdrv_init_padding(BlockDriverState *bs,
                                int64_t offset, int64_t bytes,
+                              bool write,
                                BdrvRequestPadding *pad)
      int64_t align = bs->bl.request_alignment;
@@ -1479,9 +1491,101 @@ static bool bdrv_init_padding(BlockDriverState *bs,
          pad->tail_buf = pad->buf + pad->buf_len - align;
+ pad->write = write;
      return true;
+ * If padding has made the IOV (`pad->local_qiov`) too long (more than IOV_MAX
+ * elements), collapse some elements into a single one so that it adheres to 
+ * IOV_MAX limit again.
+ *
+ * If collapsing, `pad->collapse_buf` will be used as a bounce buffer of length
+ * `pad->collapse_len`.  `pad->collapsed_qiov` will contain the previous 
+ * (before collapsing), so that bdrv_padding_destroy() can copy the bounce
+ * buffer content back for read requests.
The distinction between "collapse" and "collapsed" is subtle. I didn't
guess it right, I thought collapsed_qiov is a QEMUIOVector for

Please choose a name for collapsed_qiov that makes this clearer. Maybe
pre_collapse_qiov (i.e. the local_qiov iovecs that were replaced by

Yes, sounds good!

+ *
+ * Note that we will not touch the padding head or tail entries here.  We 
+ * move them to a bounce buffer, because for RMWs, both head and tail expect to
+ * be in an aligned buffer with scratch space after (head) or before (tail) to
+ * perform the read into (because the whole buffer must be aligned, but head's
+ * and tail's lengths naturally cannot be aligned, because they provide padding
+ * for unaligned requests).  A collapsed bounce buffer for multiple IOV 
+ * cannot provide such scratch space.
As someone who hasn't looked at this code for a while, I don't
understand this paragraph. Can you expand on why RMW is problematic
here? If not, don't worry, it's hard to explain iov juggling.

The read in the RMW cycle is done using bdrv_aligned_preadv(), so head and tail must be fully aligned; their buffers’ addresses and lengths both must be aligned.  However, head and tail themselves can’t have an aligned length, because they’re filling up something that’s unaligned to be aligned.

Therefore, there must be some scratch space in those buffers that overlaps with adjacent elements in the I/O vector.  The bdrv_aligned_preadv() calls directly read into the padding buffer, so they will not overwrite anything in the I/O vector.  Instead, in the I/O vector, the references to the padding buffer are shortened to match the pure lengths of head and tail, so that when we finally do the actual write, the scratch space is hidden.

So merging head or tail requires special consideration of this scratch space.  It probably isn’t impossible (I should fix the comment on that), but it’s just more complicated than to merge internal elements.  Also, if you merge head or tail, you need to adjust some existing fields in BdrvRequestPadding (free `buf`, adjust `buf_len`, adjust `tail_buf`).  And qemu_iovec_{to,from}_buf() would need to skip the head/tail.

I’ve begun by trying to merge into head/tail, but faced problem after problem and finally just decided to give up on that, finding that just merging internal buffers is simpler.

+ *
+ * Therefore, this function collapses the first IOV elements after the
+ * (potential) head element.
+ */
+static void bdrv_padding_collapse(BdrvRequestPadding *pad, BlockDriverState 
+    int surplus_count, collapse_count;
+    struct iovec *collapse_iovs;
+    QEMUIOVector collapse_qiov;
+    size_t move_count;
+    surplus_count = pad->local_qiov.niov - IOV_MAX;
+    /* Not exceeding the limit?  Nothing to collapse. */
+    if (surplus_count <= 0) {
+        return;
+    }
+    /*
+     * Only head and tail can have lead to the number of entries exceeding
+     * IOV_MAX, so we can exceed it by the head and tail at most
+     */
+    assert(surplus_count <= !!pad->head + !!pad->tail);
+    /*
+     * We merge (collapse) `surplus_count` entries into the first entry that is
+     * not padding, i.e. we merge `surplus_count + 1` entries into entry 0 if
+     * there is no head, or entry 1 if there is one.
+     */
+    collapse_count = surplus_count + 1;
+    collapse_iovs = &pad->local_qiov.iov[pad->head ? 1 : 0];
+    /* There must be no previously collapsed buffers in `pad` */
+    assert(pad->collapse_len == 0);
+    for (int i = 0; i < collapse_count; i++) {
+        pad->collapse_len += collapse_iovs[i].iov_len;
+    }
+    pad->collapse_buf = qemu_blockalign(bs, pad->collapse_len);
+    /* Save the to-be-collapsed IOV elements in collapsed_qiov */
+    qemu_iovec_init_external(&collapse_qiov, collapse_iovs, collapse_count);
+    qemu_iovec_init_slice(&pad->collapsed_qiov,
Having collapse_qiov and collapsed_qiov in the same function is
confusing. IIUC collapse_qiov and collapsed_qiov have the same buffers
the same, except that the last iovec in collapse_qiov has its original
length from local_qiov, whereas collapsed_qiov may shrink the last

Maybe just naming collapse_qiov "qiov" or "tmp_qiov" would be less
confusing because it avoids making collapse_buf/collapse_len vs
collapsed_qiov more confusing.


+                          &collapse_qiov, 0, pad->collapse_len);
+    if (pad->write) {
+        /* For writes: Copy all to-be-collapsed data into collapse_buf */
+        qemu_iovec_to_buf(&pad->collapsed_qiov, 0,
+                          pad->collapse_buf, pad->collapse_len);
+    }
+    /* Create the collapsed entry in pad->local_qiov */
+    collapse_iovs[0] = (struct iovec){
+        .iov_base = pad->collapse_buf,
+        .iov_len = pad->collapse_len,
+    };
+    /*
+     * To finalize collapsing, we must shift the rest of pad->local_qiov left 
+     * `surplus_count`, i.e. we must move all elements after `collapse_iovs` to
+     * immediately after the collapse target.
+     *
+     * Therefore, the memmove() target is `collapse_iovs[1]` and the source is
+     * `collapse_iovs[collapse_count]`.  The number of elements to move is the
+     * number of elements remaining in `pad->local_qiov` after and including
+     * `collapse_iovs[collapse_count]`.
+     */
+    move_count = &pad->local_qiov.iov[pad->local_qiov.niov] -
+        &collapse_iovs[collapse_count];
+    memmove(&collapse_iovs[1],
+            &collapse_iovs[collapse_count],
+            move_count * sizeof(pad->local_qiov.iov[0]));
+    pad->local_qiov.niov -= surplus_count;
  static int coroutine_fn GRAPH_RDLOCK
  bdrv_padding_rmw_read(BdrvChild *child, BdrvTrackedRequest *req,
                        BdrvRequestPadding *pad, bool zero_middle)
@@ -1549,6 +1653,18 @@ static void bdrv_padding_destroy(BdrvRequestPadding *pad)
+    if (pad->collapse_buf) {
+        if (!pad->write) {
+            /*
+             * If padding required elements in the vector to be collapsed into 
+             * bounce buffer, copy the bounce buffer content back
+             */
+            qemu_iovec_from_buf(&pad->collapsed_qiov, 0,
+                                pad->collapse_buf, pad->collapse_len);
+        }
+        qemu_vfree(pad->collapse_buf);
+        qemu_iovec_destroy(&pad->collapsed_qiov);
+    }
This is safe because qemu_iovec_init_slice() took copies of local_qiov
iovecs instead of references, but the code requires less thinking if
collapsed_qiov is destroyed before local_qiov. This change would be nice
if you respin.


      memset(pad, 0, sizeof(*pad));
@@ -1559,6 +1675,8 @@ static void bdrv_padding_destroy(BdrvRequestPadding *pad)
   * read of padding, bdrv_padding_rmw_read() should be called separately if
   * needed.
+ * @write is true for write requests, false for read requests.
+ *
   * Request parameters (@qiov, &qiov_offset, &offset, &bytes) are in-out:
   *  - on function start they represent original request
   *  - on failure or when padding is not needed they are unchanged
@@ -1567,6 +1685,7 @@ static void bdrv_padding_destroy(BdrvRequestPadding *pad)
  static int bdrv_pad_request(BlockDriverState *bs,
                              QEMUIOVector **qiov, size_t *qiov_offset,
                              int64_t *offset, int64_t *bytes,
+                            bool write,
                              BdrvRequestPadding *pad, bool *padded,
                              BdrvRequestFlags *flags)
@@ -1574,7 +1693,7 @@ static int bdrv_pad_request(BlockDriverState *bs,
bdrv_check_qiov_request(*offset, *bytes, *qiov, *qiov_offset, &error_abort); - if (!bdrv_init_padding(bs, *offset, *bytes, pad)) {
+    if (!bdrv_init_padding(bs, *offset, *bytes, write, pad)) {
          if (padded) {
              *padded = false;
@@ -1589,6 +1708,14 @@ static int bdrv_pad_request(BlockDriverState *bs,
          return ret;
+    /*
+     * If the IOV is too long after padding, merge (collapse) some entries to
+     * make it not exceed IOV_MAX
+     */
+    bdrv_padding_collapse(pad, bs);
+    assert(pad->local_qiov.niov <= IOV_MAX);
      *bytes += pad->head + pad->tail;
      *offset -= pad->head;
      *qiov = &pad->local_qiov;
@@ -1653,8 +1780,8 @@ int coroutine_fn bdrv_co_preadv_part(BdrvChild *child,
          flags |= BDRV_REQ_COPY_ON_READ;
- ret = bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, &bytes, &pad,
-                           NULL, &flags);
+    ret = bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, &bytes, false,
+                           &pad, NULL, &flags);
      if (ret < 0) {
          goto fail;
@@ -1996,7 +2123,7 @@ bdrv_co_do_zero_pwritev(BdrvChild *child, int64_t offset, 
int64_t bytes,
      /* This flag doesn't make sense for padding or zero writes */
      flags &= ~BDRV_REQ_REGISTERED_BUF;
- padding = bdrv_init_padding(bs, offset, bytes, &pad);
+    padding = bdrv_init_padding(bs, offset, bytes, true, &pad);
      if (padding) {
          assert(!(flags & BDRV_REQ_NO_WAIT));
          bdrv_make_request_serialising(req, align);
@@ -2112,8 +2239,8 @@ int coroutine_fn bdrv_co_pwritev_part(BdrvChild *child,
           * bdrv_co_do_zero_pwritev() does aligning by itself, so, we do
           * alignment only if there is no ZERO flag.
-        ret = bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, &bytes, &pad,
-                               &padded, &flags);
+        ret = bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, &bytes, true,
+                               &pad, &padded, &flags);
          if (ret < 0) {
              return ret;
diff --git a/util/iov.c b/util/iov.c
index b4be580022..4d0d381949 100644
--- a/util/iov.c
+++ b/util/iov.c
@@ -444,10 +444,6 @@ int qemu_iovec_init_extended(
total_niov = !!head_len + mid_niov + !!tail_len;
-    if (total_niov > IOV_MAX) {
-        return -EINVAL;
-    }
Perhaps an assertion is good here, just to make sure it's detected if a
new caller is added some day. qemu_iovec_init_extended() is a public
API, so something unrelated to block layer padding might use it.

The problem is that I’m not removing this because it’s become unnecessary, but because I need this function to happily create I/O vectors exceeding IOV_MAX.  After this patch, it will create I/O vectors with up to 1026 elements, which are only shrunk afterwards.

What I can do is add a comment that this function can create I/O vectors exceeding 1024 elements, and callers must ensure to somehow deal with this, either by knowing that it can’t happen in their case (and asserting that), or by shrinking/splitting the resulting vector somehow.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]