[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [PATCH 2/2] memory: Don't use memcpy for ram_device reg
From: |
Paolo Bonzini |
Subject: |
Re: [Qemu-devel] [PATCH 2/2] memory: Don't use memcpy for ram_device regions |
Date: |
Tue, 25 Oct 2016 13:32:43 +0200 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.4.0 |
On 25/10/2016 01:00, Alex Williamson wrote:
> With a vfio assigned device we lay down a base MemoryRegion registered
> as an IO region, giving us read & write accessors. If the region
> supports mmap, we lay down a higher priority sub-region MemoryRegion
> on top of the base layer initialized as a RAM device pointer to the
> mmap. Finally, if we have any quirks for the device (ie. address
> ranges that need additional virtualization support), we put another IO
> sub-region on top of the mmap MemoryRegion. When this is flattened,
> we now potentially have sub-page mmap MemoryRegions exposed which
> cannot be directly mapped through KVM.
>
> This is as expected, but a subtle detail of this is that we end up
> with two different access mechanisms through QEMU. If we disable the
> mmap MemoryRegion, we make use of the IO MemoryRegion and service
> accesses using pread and pwrite to the vfio device file descriptor.
> If the mmap MemoryRegion is enabled and results in one of these
> sub-page gaps, QEMU handles the access as RAM, using memcpy to the
> mmap. Using either pread/pwrite or the mmap directly should be
> correct, but using memcpy causes us problems. I expect that not only
> does memcpy not necessarily honor the original width and alignment in
> performing a copy, but it potentially also uses processor instructions
> not intended for MMIO spaces. It turns out that this has been a
> problem for Realtek NIC assignment, which has such a quirk that
> creates a sub-page mmap MemoryRegion access.
>
> To resolve this, we disable memory_access_is_direct() for ram_device
> regions since QEMU assumes that it can use memcpy for those regions.
> Instead we access through MemoryRegionOps, which replaces the memcpy
> with simple de-references of standard sizes to the host memory. This
> also allows us to eliminate the ram_device bool from the MemoryRegion
> structure since we can simply test the ops pointer.
>
> With this patch we attempt to provide unrestricted access to the RAM
> device, allowing byte through qword access as well as unaligned
> access. The assumption here is that accesses initiated by the VM are
> driven by a device specific driver, which knows the device
> capabilities. If unaligned accesses are not supported by the device,
> we don't want them to work in a VM by performing multiple aligned
> accesses to compose the unaligned access. A down-side of this
> philosophy is that the xp command from the monitor attempts to use
> the largest available access weidth, unaware of the underlying
> device. Using memcpy had this same restriction, but at least now an
> operator can dump individual registers, even if blocks of device
> memory may result in access widths beyond the capabilities of a
> given device (RTL NICs only support up to dword).
>
> Reported-by: Thorsten Kohfeldt <address@hidden>
> Signed-off-by: Alex Williamson <address@hidden>
> ---
> include/exec/memory.h | 7 +++--
> memory.c | 70
> ++++++++++++++++++++++++++++++++++++++++++++++++-
> trace-events | 2 +
> 3 files changed, 74 insertions(+), 5 deletions(-)
>
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index a75b8c3..2d4a287 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -209,7 +209,6 @@ struct MemoryRegion {
> void (*destructor)(MemoryRegion *mr);
> uint64_t align;
> bool terminates;
> - bool ram_device;
> bool enabled;
> bool warning_printed; /* For reservations */
> uint8_t vga_logging_count;
> @@ -1480,9 +1479,11 @@ void *qemu_map_ram_ptr(RAMBlock *ram_block, ram_addr_t
> addr);
> static inline bool memory_access_is_direct(MemoryRegion *mr, bool is_write)
> {
> if (is_write) {
> - return memory_region_is_ram(mr) && !mr->readonly;
> + return memory_region_is_ram(mr) &&
> + !mr->readonly && !memory_region_is_ram_device(mr);
> } else {
> - return memory_region_is_ram(mr) || memory_region_is_romd(mr);
> + return (memory_region_is_ram(mr) &&
> !memory_region_is_ram_device(mr)) ||
> + memory_region_is_romd(mr);
Oops, I hadn't thought of this. This is a bit slower, especially so if
you have to compare the ops pointer. It's probably best to keep the
mr->ram_device flag, so that later we can place all five relevant flags
(ram, readonly, ram_device, rom_device, romd_mode) into the same word
and play games with bits, like the following:
// write case
mr->flags == MR_RAM
// read cases
(mr->flags & ~MR_READONLY) == MR_RAM ||
(mr->flags & ~MR_READONLY) == (MR_ROM_DEVICE | MR_ROMD_MODE)
(the latter in turn can be optimized to a range check if the flags are
arranged cleverly).
The patch is okay with mr->ram_device left in place, feel free to merge
it through your own VFIO tree.
Paolo
> }
> }
>
> diff --git a/memory.c b/memory.c
> index 7ffcff1..d07f785 100644
> --- a/memory.c
> +++ b/memory.c
> @@ -1128,6 +1128,71 @@ const MemoryRegionOps unassigned_mem_ops = {
> .endianness = DEVICE_NATIVE_ENDIAN,
> };
>
> +static uint64_t memory_region_ram_device_read(void *opaque,
> + hwaddr addr, unsigned size)
> +{
> + MemoryRegion *mr = opaque;
> + uint64_t data = (uint64_t)~0;
> +
> + switch (size) {
> + case 1:
> + data = *(uint8_t *)(mr->ram_block->host + addr);
> + break;
> + case 2:
> + data = *(uint16_t *)(mr->ram_block->host + addr);
> + break;
> + case 4:
> + data = *(uint32_t *)(mr->ram_block->host + addr);
> + break;
> + case 8:
> + data = *(uint64_t *)(mr->ram_block->host + addr);
> + break;
> + }
> +
> + trace_memory_region_ram_device_read(get_cpu_index(), mr, addr, data,
> size);
> +
> + return data;
> +}
> +
> +static void memory_region_ram_device_write(void *opaque, hwaddr addr,
> + uint64_t data, unsigned size)
> +{
> + MemoryRegion *mr = opaque;
> +
> + trace_memory_region_ram_device_write(get_cpu_index(), mr, addr, data,
> size);
> +
> + switch (size) {
> + case 1:
> + *(uint8_t *)(mr->ram_block->host + addr) = (uint8_t)data;
> + break;
> + case 2:
> + *(uint16_t *)(mr->ram_block->host + addr) = (uint16_t)data;
> + break;
> + case 4:
> + *(uint32_t *)(mr->ram_block->host + addr) = (uint32_t)data;
> + break;
> + case 8:
> + *(uint64_t *)(mr->ram_block->host + addr) = data;
> + break;
> + }
> +}
> +
> +static const MemoryRegionOps ram_device_mem_ops = {
> + .read = memory_region_ram_device_read,
> + .write = memory_region_ram_device_write,
> + .endianness = DEVICE_NATIVE_ENDIAN,
> + .valid = {
> + .min_access_size = 1,
> + .max_access_size = 8,
> + .unaligned = true,
> + },
> + .impl = {
> + .min_access_size = 1,
> + .max_access_size = 8,
> + .unaligned = true,
> + },
> +};
> +
> bool memory_region_access_valid(MemoryRegion *mr,
> hwaddr addr,
> unsigned size,
> @@ -1362,7 +1427,8 @@ void memory_region_init_ram_device_ptr(MemoryRegion *mr,
> void *ptr)
> {
> memory_region_init_ram_ptr(mr, owner, name, size, ptr);
> - mr->ram_device = true;
> + mr->ops = &ram_device_mem_ops;
> + mr->opaque = mr;
> }
>
> void memory_region_init_alias(MemoryRegion *mr,
> @@ -1498,7 +1564,7 @@ const char *memory_region_name(const MemoryRegion *mr)
>
> bool memory_region_is_ram_device(MemoryRegion *mr)
> {
> - return mr->ram_device;
> + return mr->ops == &ram_device_mem_ops;
> }
>
> uint8_t memory_region_get_dirty_log_mask(MemoryRegion *mr)
> diff --git a/trace-events b/trace-events
> index 8ecded5..f74e1d3 100644
> --- a/trace-events
> +++ b/trace-events
> @@ -121,6 +121,8 @@ memory_region_subpage_read(int cpu_index, void *mr,
> uint64_t offset, uint64_t va
> memory_region_subpage_write(int cpu_index, void *mr, uint64_t offset,
> uint64_t value, unsigned size) "cpu %d mr %p offset %#"PRIx64" value
> %#"PRIx64" size %u"
> memory_region_tb_read(int cpu_index, uint64_t addr, uint64_t value, unsigned
> size) "cpu %d addr %#"PRIx64" value %#"PRIx64" size %u"
> memory_region_tb_write(int cpu_index, uint64_t addr, uint64_t value,
> unsigned size) "cpu %d addr %#"PRIx64" value %#"PRIx64" size %u"
> +memory_region_ram_device_read(int cpu_index, void *mr, uint64_t addr,
> uint64_t value, unsigned size) "cpu %d mr %p addr %#"PRIx64" value %#"PRIx64"
> size %u"
> +memory_region_ram_device_write(int cpu_index, void *mr, uint64_t addr,
> uint64_t value, unsigned size) "cpu %d mr %p addr %#"PRIx64" value %#"PRIx64"
> size %u"
>
> ### Guest events, keep at bottom
>
>