Re: [RFC 1/2] block: Split padded I/O vectors exceeding IOV

qemu-block

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [RFC 1/2] block: Split padded I/O vectors exceeding IOV_MAX

From:	Hanna Czenczek
Subject:	Re: [RFC 1/2] block: Split padded I/O vectors exceeding IOV_MAX
Date:	Fri, 17 Mar 2023 09:05:18 +0100
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.7.1

On 16.03.23 18:44, Vladimir Sementsov-Ogievskiy wrote:

On 15.03.23 15:13, Hanna Czenczek wrote:
When processing vectored guest requests that are not aligned to the
storage request alignment, we pad them by adding head and/or tail
buffers for a read-modify-write cycle.

The guest can submit I/O vectors up to IOV_MAX (1024) in length, but
with this padding, the vector can exceed that limit.  As of
4c002cef0e9abe7135d7916c51abce47f7fc1ee2 ("util/iov: make
qemu_iovec_init_extended() honest"), we refuse to pad vectors beyond the
limit, instead returning an error to the guest.

To the guest, this appears as a random I/O error.  We should not return
an I/O error to the guest when it issued a perfectly valid request.

Before 4c002cef0e9abe7135d7916c51abce47f7fc1ee2, we just made the vector
longer than IOV_MAX, which generally seems to work (because the guest
assumes a smaller alignment than we really have, file-posix's
raw_co_prw() will generally see bdrv_qiov_is_aligned() return false, and
so emulate the request, so that the IOV_MAX does not matter). However,
that does not seem exactly great.

I see two ways to fix this problem:
1. We split such long requests into two requests.
2. We join some elements of the vector into new buffers to make it
    shorter.

I am wary of (1), because it seems like it may have unintended side
effects.

(2) on the other hand seems relatively simple to implement, with
hopefully few side effects, so this patch does that.

Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=2141964
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
  block/io.c | 139 ++++++++++++++++++++++++++++++++++++++++++++++++++---
  util/iov.c |   4 --
  2 files changed, 133 insertions(+), 10 deletions(-)

diff --git a/block/io.c b/block/io.c
index 8974d46941..ee226d23d6 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1435,6 +1435,12 @@ out:
   * @merge_reads is true for small requests,
* if @buf_len == @head + bytes + @tail. In this case it ispossible that both
   * head and tail exist but @buf_len == align and @tail_buf == @buf.
+ *
+ * @write is true for write requests, false for read requests.
+ *
+ * If padding makes the vector too long (exceeding IOV_MAX), then weneed to+ * merge existing vector elements into a single one. @collapse_bufacts as the
+ * bounce buffer in such cases.
   */
  typedef struct BdrvRequestPadding {
      uint8_t *buf;
@@ -1443,11 +1449,17 @@ typedef struct BdrvRequestPadding {
      size_t head;
      size_t tail;
      bool merge_reads;
+    bool write;
      QEMUIOVector local_qiov;
+
+    uint8_t *collapse_buf;
+    size_t collapse_len;
+    QEMUIOVector collapsed_qiov;
  } BdrvRequestPadding;
    static bool bdrv_init_padding(BlockDriverState *bs,
                                int64_t offset, int64_t bytes,
+                              bool write,
                                BdrvRequestPadding *pad)
  {
      int64_t align = bs->bl.request_alignment;
@@ -1479,9 +1491,101 @@ static boolbdrv_init_padding(BlockDriverState *bs,
          pad->tail_buf = pad->buf + pad->buf_len - align;
      }
  +    pad->write = write;
+
      return true;
  }
  +/*
+ * If padding has made the IOV (`pad->local_qiov`) too long (morethan IOV_MAX+ * elements), collapse some elements into a single one so that itadheres to the
+ * IOV_MAX limit again.
+ *
+ * If collapsing, `pad->collapse_buf` will be used as a bouncebuffer of length+ * `pad->collapse_len`. `pad->collapsed_qiov` will contain theprevious entries+ * (before collapsing), so that bdrv_padding_destroy() can copy thebounce
+ * buffer content back for read requests.
+ *
+ * Note that we will not touch the padding head or tail entrieshere. We cannot+ * move them to a bounce buffer, because for RMWs, both head andtail expect to+ * be in an aligned buffer with scratch space after (head) or before(tail) to+ * perform the read into (because the whole buffer must be aligned,but head's+ * and tail's lengths naturally cannot be aligned, because theyprovide padding+ * for unaligned requests). A collapsed bounce buffer for multipleIOV elements
+ * cannot provide such scratch space.
+ *
+ * Therefore, this function collapses the first IOV elements after the
+ * (potential) head element.
+ */
+static void bdrv_padding_collapse(BdrvRequestPadding *pad,BlockDriverState *bs)
+{
+    int surplus_count, collapse_count;
+    struct iovec *collapse_iovs;
+    QEMUIOVector collapse_qiov;
+    size_t move_count;
+
+    surplus_count = pad->local_qiov.niov - IOV_MAX;
+    /* Not exceeding the limit?  Nothing to collapse. */
+    if (surplus_count <= 0) {
+        return;
+    }
+
+    /*
+ * Only head and tail can have lead to the number of entriesexceeding
+     * IOV_MAX, so we can exceed it by the head and tail at most
+     */
+    assert(surplus_count <= !!pad->head + !!pad->tail);
+
+    /*
+ * We merge (collapse) `surplus_count` entries into the firstentry that is+ * not padding, i.e. we merge `surplus_count + 1` entries intoentry 0 if
+     * there is no head, or entry 1 if there is one.
+     */
+    collapse_count = surplus_count + 1;
+    collapse_iovs = &pad->local_qiov.iov[pad->head ? 1 : 0];
+
+    /* There must be no previously collapsed buffers in `pad` */
+    assert(pad->collapse_len == 0);
+    for (int i = 0; i < collapse_count; i++) {
+        pad->collapse_len += collapse_iovs[i].iov_len;
+    }
+    pad->collapse_buf = qemu_blockalign(bs, pad->collapse_len);
+
+    /* Save the to-be-collapsed IOV elements in collapsed_qiov */
+ qemu_iovec_init_external(&collapse_qiov, collapse_iovs,collapse_count);
+    qemu_iovec_init_slice(&pad->collapsed_qiov,
+                          &collapse_qiov, 0, pad->collapse_len);
+    if (pad->write) {
+ /* For writes: Copy all to-be-collapsed data intocollapse_buf */
+        qemu_iovec_to_buf(&pad->collapsed_qiov, 0,
+                          pad->collapse_buf, pad->collapse_len);
+    }
+
+    /* Create the collapsed entry in pad->local_qiov */
+    collapse_iovs[0] = (struct iovec){
+        .iov_base = pad->collapse_buf,
+        .iov_len = pad->collapse_len,
+    };
+
+    /*
+ * To finalize collapsing, we must shift the rest ofpad->local_qiov left by+ * `surplus_count`, i.e. we must move all elements after`collapse_iovs` to
+     * immediately after the collapse target.
+     *
+ * Therefore, the memmove() target is `collapse_iovs[1]` and thesource is+ * `collapse_iovs[collapse_count]`. The number of elements tomove is the+ * number of elements remaining in `pad->local_qiov` after andincluding
+     * `collapse_iovs[collapse_count]`.
+     */
+    move_count = &pad->local_qiov.iov[pad->local_qiov.niov] -
+        &collapse_iovs[collapse_count];
+    memmove(&collapse_iovs[1],
+            &collapse_iovs[collapse_count],
+            move_count * sizeof(pad->local_qiov.iov[0]));
+
+    pad->local_qiov.niov -= surplus_count;
+}
What I don't like is that qemu_iovec_init_extended() is reallycomplex, and it is used only here [I mean bdrv_pad_request()](qemu_iovec_init_slice() uses only small subset ofqemu_iovec_init_extended() possibilities). And finally, we use thisqemu_iovec_init_extended() only to rewrite the resulting qiov by handusing direct access to iov[] array and memmove. I think, such directmanipulations better be done in util/iov.c.. And anyway, this all showthat qemu_iovec_init_extended() being complex doesn't meet our needs.
Hmm. *improving* qemu_iovec_init_external() by allowing it to allocateadditional bounce-buffer, and do collapsing doesn't look good.
Maybe instead, do the logic of qemu_iovec_init_extended() togetherwith bdrv_padding_collapse() in bdrv_pad_request() directly, usingsimpler qemu_iovec_* API?
Something like:

1. prepare bounce_buffer if want to collaps
2. allocate local_qiov of calculated size
3. compile the local_qiov:

  - if head: qemu_iovec_add(local_qiov, head)
  - if collpase_buf: qemu_iovec_add(local_qiov, collapse_buf)
  - qemu_iovec_concat(local_qiov, remaining part of qiov)
  - if tail: qemu_iovec_add(local_qiov, tail)


Sure, I’ll give it a try!

Hanna

[Prev in Thread]

Current Thread

[Next in Thread]

[RFC 0/2] Split padded I/O vectors exceeding IOV_MAX, Hanna Czenczek, 2023/03/15
- [RFC 1/2] block: Split padded I/O vectors exceeding IOV_MAX, Hanna Czenczek, 2023/03/15
  - Re: [RFC 1/2] block: Split padded I/O vectors exceeding IOV_MAX, Eric Blake, 2023/03/15
    - Re: [RFC 1/2] block: Split padded I/O vectors exceeding IOV_MAX, Hanna Czenczek, 2023/03/16
  - Re: [RFC 1/2] block: Split padded I/O vectors exceeding IOV_MAX, Stefan Hajnoczi, 2023/03/15
    - Re: [RFC 1/2] block: Split padded I/O vectors exceeding IOV_MAX, Hanna Czenczek, 2023/03/16
  - Re: [RFC 1/2] block: Split padded I/O vectors exceeding IOV_MAX, Vladimir Sementsov-Ogievskiy, 2023/03/16
    - Re: [RFC 1/2] block: Split padded I/O vectors exceeding IOV_MAX, Hanna Czenczek <=
- [RFC 2/2] iotests/iov-padding: New test, Hanna Czenczek, 2023/03/15
- Re: [RFC 0/2] Split padded I/O vectors exceeding IOV_MAX, Stefan Hajnoczi, 2023/03/15
  - Re: [RFC 0/2] Split padded I/O vectors exceeding IOV_MAX, Hanna Czenczek, 2023/03/15

Prev by Date: Re: [PATCH v2 31/32] contrib/gitdm: add more individual contributors
Next by Date: Re: test-blockjob: intermittent CI failures in msys2-64bit job
Previous by thread: Re: [RFC 1/2] block: Split padded I/O vectors exceeding IOV_MAX
Next by thread: [RFC 2/2] iotests/iov-padding: New test
Index(es):
- Date
- Thread