Re: [PATCH v2 1/4] softmmu: Support concurrent bounce buffers

qemu-devel
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH v2 1/4] softmmu: Support concurrent bounce buffers

From:	Mattias Nissler
Subject:	Re: [PATCH v2 1/4] softmmu: Support concurrent bounce buffers
Date:	Thu, 24 Aug 2023 08:58:30 +0200
On Wed, Aug 23, 2023 at 10:54 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Wed, Aug 23, 2023 at 10:08:08PM +0200, Mattias Nissler wrote:
> > Peter, thanks for taking a look and providing feedback!
> >
> > On Wed, Aug 23, 2023 at 7:35 PM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > On Wed, Aug 23, 2023 at 02:29:02AM -0700, Mattias Nissler wrote:
> > > > When DMA memory can't be directly accessed, as is the case when
> > > > running the device model in a separate process without shareable DMA
> > > > file descriptors, bounce buffering is used.
> > > >
> > > > It is not uncommon for device models to request mapping of several DMA
> > > > regions at the same time. Examples include:
> > > >  * net devices, e.g. when transmitting a packet that is split across
> > > >    several TX descriptors (observed with igb)
> > > >  * USB host controllers, when handling a packet with multiple data TRBs
> > > >    (observed with xhci)
> > > >
> > > > Previously, qemu only provided a single bounce buffer and would fail DMA
> > > > map requests while the buffer was already in use. In turn, this would
> > > > cause DMA failures that ultimately manifest as hardware errors from the
> > > > guest perspective.
> > > >
> > > > This change allocates DMA bounce buffers dynamically instead of
> > > > supporting only a single buffer. Thus, multiple DMA mappings work
> > > > correctly also when RAM can't be mmap()-ed.
> > > >
> > > > The total bounce buffer allocation size is limited by a new command line
> > > > parameter. The default is 4096 bytes to match the previous maximum
> > > > buffer size. It is expected that suitable limits will vary quite a bit
> > > > in practice depending on device models and workloads.
> > > >
> > > > Signed-off-by: Mattias Nissler <mnissler@rivosinc.com>
> > > > ---
> > > >  include/sysemu/sysemu.h |  2 +
> > > >  qemu-options.hx         | 27 +++++++++++++
> > > >  softmmu/globals.c       |  1 +
> > > >  softmmu/physmem.c       | 84 +++++++++++++++++++++++------------------
> > > >  softmmu/vl.c            |  6 +++
> > > >  5 files changed, 83 insertions(+), 37 deletions(-)
> > > >
> > > > diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
> > > > index 25be2a692e..c5dc93cb53 100644
> > > > --- a/include/sysemu/sysemu.h
> > > > +++ b/include/sysemu/sysemu.h
> > > > @@ -61,6 +61,8 @@ extern int nb_option_roms;
> > > >  extern const char *prom_envs[MAX_PROM_ENVS];
> > > >  extern unsigned int nb_prom_envs;
> > > >
> > > > +extern uint64_t max_bounce_buffer_size;
> > > > +
> > > >  /* serial ports */
> > > >
> > > >  /* Return the Chardev for serial port i, or NULL if none */
> > > > diff --git a/qemu-options.hx b/qemu-options.hx
> > > > index 29b98c3d4c..6071794237 100644
> > > > --- a/qemu-options.hx
> > > > +++ b/qemu-options.hx
> > > > @@ -4959,6 +4959,33 @@ SRST
> > > >  ERST
> > > >  #endif
> > > >
> > > > +DEF("max-bounce-buffer-size", HAS_ARG,
> > > > +    QEMU_OPTION_max_bounce_buffer_size,
> > > > +    "-max-bounce-buffer-size size\n"
> > > > +    "                DMA bounce buffer size limit in bytes 
> > > > (default=4096)\n",
> > > > +    QEMU_ARCH_ALL)
> > > > +SRST
> > > > +``-max-bounce-buffer-size size``
> > > > +    Set the limit in bytes for DMA bounce buffer allocations.
> > > > +
> > > > +    DMA bounce buffers are used when device models request 
> > > > memory-mapped access
> > > > +    to memory regions that can't be directly mapped by the qemu 
> > > > process, so the
> > > > +    memory must read or written to a temporary local buffer for the 
> > > > device
> > > > +    model to work with. This is the case e.g. for I/O memory regions, 
> > > > and when
> > > > +    running in multi-process mode without shared access to memory.
> > > > +
> > > > +    Whether bounce buffering is necessary depends heavily on the 
> > > > device model
> > > > +    implementation. Some devices use explicit DMA read and write 
> > > > operations
> > > > +    which do not require bounce buffers. Some devices, notably 
> > > > storage, will
> > > > +    retry a failed DMA map request after bounce buffer space becomes 
> > > > available
> > > > +    again. Most other devices will bail when encountering map request 
> > > > failures,
> > > > +    which will typically appear to the guest as a hardware error.
> > > > +
> > > > +    Suitable bounce buffer size values depend on the workload and guest
> > > > +    configuration. A few kilobytes up to a few megabytes are common 
> > > > sizes
> > > > +    encountered in practice.
> > >
> > > Does it mean that the default 4K size can still easily fail with some
> > > device setup?
> >
> > Yes. The thing is that the respective device setup is pretty exotic,
> > at least the only setup I'm aware of is multi-process with direct RAM
> > access via shared file descriptors from the device process disabled
> > (which hurts performance, so few people will run this setup). In
> > theory, DMA to an I/O region of some sort would also run into the
> > issue even in single process mode, but I'm not aware of such a
> > situation. Looking at it from a historic perspective, note that the
> > single-bounce-buffer restriction has been present since a decade, and
> > thus the issue has been present for years (since multi-process is a
> > thing), without it hurting anyone enough to get fixed. But don't get
> > me wrong - I don't want to downplay anything and very much would like
> > to see this fixed, but I want to be honest and put things into the
> > right perspective.
> >
> > >
> > > IIUC the whole point of limit here is to make sure the allocation is still
> > > bounded, while 4K itself is not a hard limit. Making it bigger would be,
> > > IMHO, nice if it should work with known configs which used to be broken.
> >
> > I'd be in favor of bumping the default. Networking should be fine with
> > 64KB, but I've observed a default Linux + xhci + usb_storage setup to
> > use up to 1MB of DMA buffers, we'd probably need to raise it
> > considerably. Would 4MB still be acceptable? That wouldn't allow a
> > single nefarious VM to stage a memory denial of service attack, but
> > what if you're running many VMs?
>
> Could wait and see whether there's any more comments from others.
> Personally 4MB looks fine, as that's not a constant consumption per-vm, but
> a worst case limit (probably only when there is an attacker).

OK, I'll take a note to change to 4MB for the next version of this
series then, barring any objections from others.

Note to self: Raising the limit will likely also accommodate the ide
test's DMA buffer consumption, which means we'd be losing test
coverage for the map client callback code path. I probably need to
adjust the test or make a new one to make up for that.

>
> Multiple VM does can indeed make it worse, but it means the attacker will
> need to attack all the VMs all success, and the sum up will be 4MB /
> mem_size_average_vm in percentage, irrelevant of numbers; for ~4GB average
> VM size it's 0.1% memory, and even less if VM is larger - maybe not
> something extremely scary even if happened.
>
> >
> > >
> > > > +ERST
> > > > +
> > > >  DEFHEADING()
> > > >
> > > >  DEFHEADING(Generic object creation:)
> > > > diff --git a/softmmu/globals.c b/softmmu/globals.c
> > > > index e83b5428d1..d3cc010717 100644
> > > > --- a/softmmu/globals.c
> > > > +++ b/softmmu/globals.c
> > > > @@ -54,6 +54,7 @@ const char *prom_envs[MAX_PROM_ENVS];
> > > >  uint8_t *boot_splash_filedata;
> > > >  int only_migratable; /* turn it off unless user states otherwise */
> > > >  int icount_align_option;
> > > > +uint64_t max_bounce_buffer_size = 4096;
> > > >
> > > >  /* The bytes in qemu_uuid are in the order specified by RFC4122, _not_ 
> > > > in the
> > > >   * little-endian "wire format" described in the SMBIOS 2.6 
> > > > specification.
> > > > diff --git a/softmmu/physmem.c b/softmmu/physmem.c
> > > > index 3df73542e1..9f0fec0c8e 100644
> > > > --- a/softmmu/physmem.c
> > > > +++ b/softmmu/physmem.c
> > > > @@ -50,6 +50,7 @@
> > > >  #include "sysemu/dma.h"
> > > >  #include "sysemu/hostmem.h"
> > > >  #include "sysemu/hw_accel.h"
> > > > +#include "sysemu/sysemu.h"
> > > >  #include "sysemu/xen-mapcache.h"
> > > >  #include "trace/trace-root.h"
> > > >
> > > > @@ -2904,13 +2905,12 @@ void cpu_flush_icache_range(hwaddr start, 
> > > > hwaddr len)
> > > >
> > > >  typedef struct {
> > > >      MemoryRegion *mr;
> > > > -    void *buffer;
> > > >      hwaddr addr;
> > > > -    hwaddr len;
> > > > -    bool in_use;
> > > > +    size_t len;
> > > > +    uint8_t buffer[];
> > > >  } BounceBuffer;
> > > >
> > > > -static BounceBuffer bounce;
> > > > +static size_t bounce_buffer_size;
> > > >
> > > >  typedef struct MapClient {
> > > >      QEMUBH *bh;
> > > > @@ -2945,9 +2945,9 @@ void cpu_register_map_client(QEMUBH *bh)
> > > >      qemu_mutex_lock(&map_client_list_lock);
> > > >      client->bh = bh;
> > > >      QLIST_INSERT_HEAD(&map_client_list, client, link);
> > > > -    /* Write map_client_list before reading in_use.  */
> > > > +    /* Write map_client_list before reading bounce_buffer_size.  */
> > > >      smp_mb();
> > > > -    if (!qatomic_read(&bounce.in_use)) {
> > > > +    if (qatomic_read(&bounce_buffer_size) < max_bounce_buffer_size) {
> > > >          cpu_notify_map_clients_locked();
> > > >      }
> > > >      qemu_mutex_unlock(&map_client_list_lock);
> > > > @@ -3076,31 +3076,35 @@ void *address_space_map(AddressSpace *as,
> > > >      RCU_READ_LOCK_GUARD();
> > > >      fv = address_space_to_flatview(as);
> > > >      mr = flatview_translate(fv, addr, &xlat, &l, is_write, attrs);
> > > > +    memory_region_ref(mr);
> > > >
> > > >      if (!memory_access_is_direct(mr, is_write)) {
> > > > -        if (qatomic_xchg(&bounce.in_use, true)) {
> > > > +        size_t size = qatomic_add_fetch(&bounce_buffer_size, l);
> > > > +        if (size > max_bounce_buffer_size) {
> > > > +            size_t excess = size - max_bounce_buffer_size;
> > > > +            l -= excess;
> > > > +            qatomic_sub(&bounce_buffer_size, excess);
> > > > +        }
> > > > +
> > > > +        if (l == 0) {
> > > >              *plen = 0;
> > > >              return NULL;
> > > >          }
> > > > -        /* Avoid unbounded allocations */
> > > > -        l = MIN(l, TARGET_PAGE_SIZE);
> > > > -        bounce.buffer = qemu_memalign(TARGET_PAGE_SIZE, l);
> > > > -        bounce.addr = addr;
> > > > -        bounce.len = l;
> > > >
> > > > -        memory_region_ref(mr);
> > > > -        bounce.mr = mr;
> > > > +        BounceBuffer *bounce = g_malloc(l + sizeof(BounceBuffer));
> > >
> > > Maybe g_malloc0() would be better?
> >
> > Good point, will change.
> >
> > >
> > > I just checked that we had target page aligned allocations since the 1st
> > > day (commit 6d16c2f88f2a).  I didn't find any clue showing why it was done
> > > like that, but I do have worry on whether any existing caller that may
> > > implicitly relying on an address that is target page aligned.  But maybe
> > > not a major issue; I didn't see anything rely on that yet.
> >
> > I did go through the same exercise when noticing the page alignment
> > and arrived at the same conclusion as you. That makes it two people
> > thinking it's OK, so I feel like we should take the risk here, in
> > particular given that we know this code path is already broken as is.
>
> It'll be more important to see if any one person thinks it's not okay in
> this case, though. :)
>
> If we decide to take the risk, we should merge a patch like this in as
> early stage as possible of the release.

Happy to do so.

However, let me know if you rather want to retain page alignment.
It'll complicate things though, as it'll likely lead to storing the
metadata separate from the buffer allocation, probably using a hash
table / tree / array (as you suggest below) to keep track of all
allocated buffers and locate their metadata by address.

>
> >
> > >
> > > > +        bounce->mr = mr;
> > > > +        bounce->addr = addr;
> > > > +        bounce->len = l;
> > > > +
> > > >          if (!is_write) {
> > > >              flatview_read(fv, addr, MEMTXATTRS_UNSPECIFIED,
> > > > -                               bounce.buffer, l);
> > > > +                          bounce->buffer, l);
> > > >          }
> > > >
> > > >          *plen = l;
> > > > -        return bounce.buffer;
> > > > +        return bounce->buffer;
> > > >      }
> > > >
> > > > -
> > > > -    memory_region_ref(mr);
> > > >      *plen = flatview_extend_translation(fv, addr, len, mr, xlat,
> > > >                                          l, is_write, attrs);
> > > >      fuzz_dma_read_cb(addr, *plen, mr);
> > > > @@ -3114,31 +3118,37 @@ void *address_space_map(AddressSpace *as,
> > > >  void address_space_unmap(AddressSpace *as, void *buffer, hwaddr len,
> > > >                           bool is_write, hwaddr access_len)
> > > >  {
> > > > -    if (buffer != bounce.buffer) {
> > > > -        MemoryRegion *mr;
> > > > -        ram_addr_t addr1;
> > > > +    MemoryRegion *mr;
> > > > +    ram_addr_t addr1;
> > > > +
> > > > +    mr = memory_region_from_host(buffer, &addr1);
> > > > +    if (mr == NULL) {
> > > > +        /*
> > > > +         * Must be a bounce buffer (unless the caller passed a pointer 
> > > > which
> > > > +         * wasn't returned by address_space_map, which is illegal).
> > >
> > > Is it possible to still have some kind of sanity check to make sure it's a
> > > bounce buffer passed in, just in case of a caller bug?  Or, the failure 
> > > can
> > > be weird..
> >
> > I was contemplating putting a magic number as the first struct member
> > as a best effort to detect invalid pointers and corruptions, but
> > wasn't sure it's warranted. Since you ask, I'll make that change.
>
> That'll be good, thanks.
>
> I was thinking maybe we can also maintain all the mapped buffers just like
> before, either in a tree, or a sorted array; the array can be even easier
> and static if the limit applied here will be "maximum number of bounce
> buffer mapped" rather than "maximum bytes of bounce buffer mapped", but
> this whole idea may already be over-complicated to worry on leaked buffers?
> The magic number sounds good enough.
>
> >
> > >
> > > > +         */
> > > > +        BounceBuffer *bounce = container_of(buffer, BounceBuffer, 
> > > > buffer);
> > > >
> > > > -        mr = memory_region_from_host(buffer, &addr1);
> > > > -        assert(mr != NULL);
> > > >          if (is_write) {
> > > > -            invalidate_and_set_dirty(mr, addr1, access_len);
> > > > -        }
> > > > -        if (xen_enabled()) {
> > > > -            xen_invalidate_map_cache_entry(buffer);
> > > > +            address_space_write(as, bounce->addr, 
> > > > MEMTXATTRS_UNSPECIFIED,
> > > > +                                bounce->buffer, access_len);
> > > >          }
> > > > -        memory_region_unref(mr);
> > > > +
> > > > +        memory_region_unref(bounce->mr);
> > > > +        qatomic_sub(&bounce_buffer_size, bounce->len);
> > > > +        /* Write bounce_buffer_size before reading map_client_list. */
> > > > +        smp_mb();
> > > > +        cpu_notify_map_clients();
> > > > +        g_free(bounce);
> > > >          return;
> > > >      }
> > > > +
> > > > +    if (xen_enabled()) {
> > > > +        xen_invalidate_map_cache_entry(buffer);
> > > > +    }
> > > >      if (is_write) {
> > > > -        address_space_write(as, bounce.addr, MEMTXATTRS_UNSPECIFIED,
> > > > -                            bounce.buffer, access_len);
> > > > -    }
> > > > -    qemu_vfree(bounce.buffer);
> > > > -    bounce.buffer = NULL;
> > > > -    memory_region_unref(bounce.mr);
> > > > -    /* Clear in_use before reading map_client_list.  */
> > > > -    qatomic_set_mb(&bounce.in_use, false);
> > > > -    cpu_notify_map_clients();
> > > > +        invalidate_and_set_dirty(mr, addr1, access_len);
> > > > +    }
> > > >  }
> > > >
> > > >  void *cpu_physical_memory_map(hwaddr addr,
> > > > diff --git a/softmmu/vl.c b/softmmu/vl.c
> > > > index b0b96f67fa..dbe52f5ea1 100644
> > > > --- a/softmmu/vl.c
> > > > +++ b/softmmu/vl.c
> > > > @@ -3469,6 +3469,12 @@ void qemu_init(int argc, char **argv)
> > > >                  exit(1);
> > > >  #endif
> > > >                  break;
> > > > +            case QEMU_OPTION_max_bounce_buffer_size:
> > > > +                if (qemu_strtosz(optarg, NULL, 
> > > > &max_bounce_buffer_size) < 0) {
> > > > +                    error_report("invalid -max-ounce-buffer-size 
> > > > value");
> > > > +                    exit(1);
> > > > +                }
> > > > +                break;
> > >
> > > PS: I had a vague memory that we do not recommend adding more qemu cmdline
> > > options, but I don't know enough on the plan to say anything real.
> >
> > I am aware of that, and I'm really not happy with the command line
> > option myself. Consider the command line flag a straw man I put in to
> > see whether any reviewers have better ideas :)
> >
> > More seriously, I actually did look around to see whether I can add
> > the parameter to one of the existing option groupings somewhere, but
> > neither do I have a suitable QOM object that I can attach the
> > parameter to, nor did I find any global option groups that fits: this
> > is not really memory configuration, and it's not really CPU
> > configuration, it's more related to shared device model
> > infrastructure... If you have a good idea for a home for this, I'm all
> > ears.
>
> No good & simple suggestion here, sorry.  We can keep the option there
> until someone jumps in, then the better alternative could also come along.
>
> After all I expect if we can choose a sensible enough default value, this
> new option shouldn't be used by anyone for real.
>
> Thanks,
>
> --
> Peter Xu
>
[Prev in Thread]
Current Thread
[Next in Thread]
[PATCH v2 0/4] Support message-based DMA in vfio-user server, Mattias Nissler, 2023/08/23
- [PATCH v2 1/4] softmmu: Support concurrent bounce buffers, Mattias Nissler, 2023/08/23
  - Re: [PATCH v2 1/4] softmmu: Support concurrent bounce buffers, Peter Xu, 2023/08/23
    - Re: [PATCH v2 1/4] softmmu: Support concurrent bounce buffers, Mattias Nissler, 2023/08/23
    - Re: [PATCH v2 1/4] softmmu: Support concurrent bounce buffers, Peter Xu, 2023/08/23
    - Re: [PATCH v2 1/4] softmmu: Support concurrent bounce buffers, Mattias Nissler <=
    - Re: [PATCH v2 1/4] softmmu: Support concurrent bounce buffers, Stefan Hajnoczi, 2023/08/24
- [PATCH v2 2/4] Update subprojects/libvfio-user, Mattias Nissler, 2023/08/23
- [PATCH v2 3/4] vfio-user: Message-based DMA support, Mattias Nissler, 2023/08/23
- [PATCH v2 4/4] vfio-user: Fix config space access byte order, Mattias Nissler, 2023/08/23
Prev by Date: Re: [PULL 00/12] First batch of s390x patches for QEMU 8.2
Next by Date: Re: [PATCH v2 43/58] i386/tdx: setup a timer for the qio channel
Previous by thread: Re: [PATCH v2 1/4] softmmu: Support concurrent bounce buffers
Next by thread: Re: [PATCH v2 1/4] softmmu: Support concurrent bounce buffers
Index(es):
- Date
- Thread