[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [RFC 1/2] block: Split padded I/O vectors exceeding IOV_MAX
From: |
Eric Blake |
Subject: |
Re: [RFC 1/2] block: Split padded I/O vectors exceeding IOV_MAX |
Date: |
Wed, 15 Mar 2023 13:25:01 -0500 |
User-agent: |
NeoMutt/20220429 |
On Wed, Mar 15, 2023 at 01:13:29PM +0100, Hanna Czenczek wrote:
> When processing vectored guest requests that are not aligned to the
> storage request alignment, we pad them by adding head and/or tail
> buffers for a read-modify-write cycle.
>
> The guest can submit I/O vectors up to IOV_MAX (1024) in length, but
> with this padding, the vector can exceed that limit. As of
> 4c002cef0e9abe7135d7916c51abce47f7fc1ee2 ("util/iov: make
> qemu_iovec_init_extended() honest"), we refuse to pad vectors beyond the
> limit, instead returning an error to the guest.
>
> To the guest, this appears as a random I/O error. We should not return
> an I/O error to the guest when it issued a perfectly valid request.
>
> Before 4c002cef0e9abe7135d7916c51abce47f7fc1ee2, we just made the vector
> longer than IOV_MAX, which generally seems to work (because the guest
> assumes a smaller alignment than we really have, file-posix's
> raw_co_prw() will generally see bdrv_qiov_is_aligned() return false, and
> so emulate the request, so that the IOV_MAX does not matter). However,
> that does not seem exactly great.
>
> I see two ways to fix this problem:
> 1. We split such long requests into two requests.
> 2. We join some elements of the vector into new buffers to make it
> shorter.
>
> I am wary of (1), because it seems like it may have unintended side
> effects.
>
> (2) on the other hand seems relatively simple to implement, with
> hopefully few side effects, so this patch does that.
Agreed that approach 2 is more conservative.
>
> Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=2141964
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> ---
> block/io.c | 139 ++++++++++++++++++++++++++++++++++++++++++++++++++---
> util/iov.c | 4 --
> 2 files changed, 133 insertions(+), 10 deletions(-)
>
> +/*
> + * If padding has made the IOV (`pad->local_qiov`) too long (more than
> IOV_MAX
> + * elements), collapse some elements into a single one so that it adheres to
> the
> + * IOV_MAX limit again.
> + *
> + * If collapsing, `pad->collapse_buf` will be used as a bounce buffer of
> length
> + * `pad->collapse_len`. `pad->collapsed_qiov` will contain the previous
> entries
> + * (before collapsing), so that bdrv_padding_destroy() can copy the bounce
> + * buffer content back for read requests.
> + *
> + * Note that we will not touch the padding head or tail entries here. We
> cannot
> + * move them to a bounce buffer, because for RMWs, both head and tail expect
> to
> + * be in an aligned buffer with scratch space after (head) or before (tail)
> to
> + * perform the read into (because the whole buffer must be aligned, but
> head's
> + * and tail's lengths naturally cannot be aligned, because they provide
> padding
> + * for unaligned requests). A collapsed bounce buffer for multiple IOV
> elements
> + * cannot provide such scratch space.
> + *
> + * Therefore, this function collapses the first IOV elements after the
> + * (potential) head element.
It looks like you blindly pick the first one or two non-padding iovs
at the front of the array. Would it be any wiser (in terms of less
memmove() action or even a smaller bounce buffer) to pick iovs at the
end of the array, and/or a sequential search for the smallest
neighboring iovs? Or is that a micro-optimization that costs more
than it saves?
Would it be any easier to swap the order of padding vs. collapsing?
That is, we already know the user is giving us a long list of iovs; if
it is 1024 elements long, and we can detect that padding will be
needed, should we collapse before padding instead of padding, finding
that we now have 1026, and memmove'ing back into 1024?
But logic-wise, your patch looks correct to me.
Reviewed-by: Eric Blake <eblake@redhat.com>
--
Eric Blake, Principal Software Engineer
Red Hat, Inc. +1-919-301-3266
Virtualization: qemu.org | libvirt.org