qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] util/mmap: optimize qemu_ram_mmap() alignment


From: David Hildenbrand
Subject: Re: [PATCH] util/mmap: optimize qemu_ram_mmap() alignment
Date: Tue, 11 Apr 2023 20:31:22 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.9.1

On 11.04.23 16:39, Steven Sistare wrote:
On 4/11/2023 3:57 AM, David Hildenbrand wrote:
On 10.04.23 17:46, Steve Sistare wrote:
Guest RAM created with memory-backend-memfd is aligned to a
QEMU_VMALLOC_ALIGN=2M boundary, and memory-backend-memfd does not support
the "align" parameter to change the default.  This is sub-optimal on
aarch64 kernels with a 64 KB page size and 512 MB huge page size, as the
range will not be backed by huge pages.  Moreover, any shared allocation
using qemu_ram_mmap() will be sub-optimal on such a system if the align
parameter is less than 512 MB.

The kernel is smart enough to return a hugely aligned pointer for MAP_SHARED
mappings when /sys/kernel/mm/transparent_hugepage/shmem_enabled allows it.
However, the qemu function qemu_ram_mmap() mmap's twice to perform its own
alignment:

      guardptr = mmap(0, total, PROT_NONE, flags, ...
      flags |= shared ? MAP_SHARED : MAP_PRIVATE;
      ptr = mmap(guardptr + offset, size, prot, flags | map_sync_flags, ...

On the first call, flags has MAP_PRIVATE, hence the kernel does not apply
its shared memory policy, and returns a non-huge-aligned guardptr.

To fix, for shared mappings, pass MAP_SHARED to both mmap calls.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Joao Martins <joao.m.martins@oracle.com>
---
   util/mmap-alloc.c | 5 +++--
   1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
index 5ed7d29..37a0d1e 100644
--- a/util/mmap-alloc.c
+++ b/util/mmap-alloc.c
@@ -121,7 +121,7 @@ static bool map_noreserve_effective(int fd, uint32_t 
qemu_map_flags)
    * Reserve a new memory region of the requested size to be used for mapping
    * from the given fd (if any).
    */
-static void *mmap_reserve(size_t size, int fd)
+static void *mmap_reserve(size_t size, int fd, int final_flags)
   {
       int flags = MAP_PRIVATE;
   @@ -144,6 +144,7 @@ static void *mmap_reserve(size_t size, int fd)
   #else
       fd = -1;
       flags |= MAP_ANONYMOUS;
+    flags |= final_flags & MAP_SHARED;

Setting both, MAP_PRIVATE and MAP_SHARED sure sounds very wrong.

Yes, thanks.  I introduced that mistake when I ported the fix from an earlier 
qemu that did not
have mmap_reserve.  Should be:

     fd = -1;
     flags = MAP_ANONYMOUS;
     flags |= final_flags & (MAP_SHARED | MAP_PRIVATE);


Better, but I still don't like bringing in some Linux kernel MAP_SHARED logic that apparently does better right now in some case right now ... because this is supposed to work on POSIX and ideally optimizes on various systems+configurations.


The traditional way to handle that is via QEMU_VMALLOC_ALIGN.

I think you'd actually want to turn QEMU_VMALLOC_ALIGN into a function that is able to 
special-case like hw/virtio/virtio-mem.c:virtio_mem_default_thp_size() does for 
"qemu_real_host_page_size() == 64 * KiB".

If you set QEMU_VMALLOC_ALIGN properly, you'll also have any DIMMs get that 
alignment in guest physical address space.

If we increase QEMU_VMALLOC_ALIGN, then all allocations will be 512MB aligned.  
If we make many small
allocations, we could conceivably run out of VA space.  Further, the processor 
may use low order

Running out of VA space? I'm not so sure. We are talking about the qemu_ram_mmap() function here ... (well, and file_ram_alloc() which only uses QEMU_VMALLOC_ALIGN on s390x).

This is all about guest RAM, although the name suggests otherwise.


address bits in internal hashes, and now offset X in every allocated range will 
have the same value for
the low 29 bits, possibly causing more collisions and reducing performance.  
Further, the huge alignment
is not even needed if huge pages for shmem have been disabled in sysconfig.

I'm not convinced this is worth optimizing, or even special-casing just for arm64 when we're only talking about mapping guest RAM.

Besides, what about ordinary anon THP? See below.


We could avoid that by adding logic to also consider allocation size when 
choosing alignment, and
checking sysconfig tunables.  Or, we can just let the kernel do so by telling 
it the truth about
memory flavor when calling mmap.

It's probably best to not trust the kernel to do the right alignment thing because we learned that it's not easy to get such optimizations into the kernel:

https://lkml.kernel.org/r/20220809142457.4751229f@imladris.surriel.com

got reverted again, which would have done the right thing for anon THP.


I'd suggest that we either hard-code it for arm64 as well, by adjusting QEMU_VMALLOC_ALIGN (if we don't care about giving it a better name and making it a function).

#elif defined(__linux__) && defined(__aarch64__)
# define QEMU_VMALLOC_ALIGN((qemu_real_host_page_size() == 64 * KiB) ? 512 MiB : 2 MiB)
#elif ...

Alternatively, we could rewrite into a proper function that caches the result and tries looking up the actual host THP size using "/sys/kernel/mm/transparent_hugepage/hpage_pmd_size" under Linux before falling back to the known defaults.

Sure, we could optimize in the caller (allocation size smaller than alignment?), but that requires good justification.

--
Thanks,

David / dhildenb




reply via email to

[Prev in Thread] Current Thread [Next in Thread]