qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v7 09/15] util/mmap-alloc: Support RAM_NORESERVE via MAP_NORE


From: David Hildenbrand
Subject: Re: [PATCH v7 09/15] util/mmap-alloc: Support RAM_NORESERVE via MAP_NORESERVE under Linux
Date: Tue, 4 May 2021 13:28:05 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.8.1

On 04.05.21 13:14, Daniel P. Berrangé wrote:
On Tue, May 04, 2021 at 01:04:17PM +0200, David Hildenbrand wrote:
On 04.05.21 12:32, Daniel P. Berrangé wrote:
On Tue, May 04, 2021 at 12:21:25PM +0200, David Hildenbrand wrote:
On 04.05.21 12:09, Daniel P. Berrangé wrote:
On Wed, Apr 28, 2021 at 03:37:48PM +0200, David Hildenbrand wrote:
Let's support RAM_NORESERVE via MAP_NORESERVE on Linux. The flag has no
effect on most shared mappings - except for hugetlbfs and anonymous memory.

Linux man page:
     "MAP_NORESERVE: Do not reserve swap space for this mapping. When swap
     space is reserved, one has the guarantee that it is possible to modify
     the mapping. When swap space is not reserved one might get SIGSEGV
     upon a write if no physical memory is available. See also the discussion
     of the file /proc/sys/vm/overcommit_memory in proc(5). In kernels before
     2.6, this flag had effect only for private writable mappings."

Note that the "guarantee" part is wrong with memory overcommit in Linux.

Also, in Linux hugetlbfs is treated differently - we configure reservation
of huge pages from the pool, not reservation of swap space (huge pages
cannot be swapped).

The rough behavior is [1]:
a) !Hugetlbfs:

     1) Without MAP_NORESERVE *or* with memory overcommit under Linux
        disabled ("/proc/sys/vm/overcommit_memory == 2"), the following
        accounting/reservation happens:
         For a file backed map
          SHARED or READ-only - 0 cost (the file is the map not swap)
          PRIVATE WRITABLE - size of mapping per instance

         For an anonymous or /dev/zero map
          SHARED   - size of mapping
          PRIVATE READ-only - 0 cost (but of little use)
          PRIVATE WRITABLE - size of mapping per instance

     2) With MAP_NORESERVE, no accounting/reservation happens.

b) Hugetlbfs:

     1) Without MAP_NORESERVE, huge pages are reserved.

     2) With MAP_NORESERVE, no huge pages are reserved.

Note: With "/proc/sys/vm/overcommit_memory == 0", we were already able
to configure it for !hugetlbfs globally; this toggle now allows
configuring it more fine-grained, not for the whole system.

The target use case is virtio-mem, which dynamically exposes memory
inside a large, sparse memory area to the VM.

Can you explain this use case in more real world terms, as I'm not
understanding what a mgmt app would actually do with this in
practice ?

Let's consider huge pages for simplicity. Assume you have 128 free huge
pages in your hypervisor that you want to dynamically assign to VMs.

Further assume you have two VMs running. A workflow could look like

1. Assign all huge pages to VM 0
2. Reassign 64 huge pages to VM 1
3. Reassign another 32 huge pages to VM 1
4. Reasssign 16 huge pages to VM 0
5. ...

Basically what we're used to doing with "ordinary" memory.

What does this look like in terms of the memory backend configuration
when you boot VM 0 and VM 1 ?

Are you saying that we boot both VMs with

     -object hostmem-memfd,size=128G,hugetlb=yes,hugetlbsize=1G,reserve=off

and then we have another property set on 'virtio-mem' to tell it
how much/little of that 128 G, to actually give to the guest ?
How do we change that at runtime ?

Roughly, yes. We only special-case memory backends managed by virtio-mem 
devices.

An advanced example for a single VM could look like this:

sudo build/qemu-system-x86_64 \
        ... \
        -m 4G,maxmem=64G \
        -smp sockets=2,cores=2 \
        -object hostmem-memfd,id=bmem0,size=2G,hugetlb=yes,hugetlbsize=2M \
        -numa node,nodeid=0,cpus=0-1,memdev=bmem0 \
        -object hostmem-memfd,id=bmem1,size=2G,hugetlb=yes,hugetlbsize=2M \
        -numa node,nodeid=1,cpus=2-3,memdev=bmem1 \
        ... \
        -object 
hostmem-memfd,id=mem0,size=30G,hugetlb=yes,hugetlbsize=2M,reserve=off \
        -device virtio-mem-pci,id=vmem0,memdev=mem0,node=0,requested-size=0G \
        -object 
hostmem-memfd,id=mem1,size=30G,hugetlb=yes,hugetlbsize=2M,reserve=off \
        -device virtio-mem-pci,id=vmem1,memdev=mem1,node=1,requested-size=0G \
        ... \

We can request a size change by adjusting the "requested-size" property (e.g., 
via qom-set)
and observe the current size by reading the "size" property (e.g., qom-get). 
Think of
it as an advanced device-local memory balloon mixed with the concept of a 
memory hotplug.

Ok, so in this example, the initial  GB of RAM has normal reserve=on
so if there's insufficient hugepages we'll see the startup failure IIUC.

Yes, except in some NUMA configurations, as huge page reservation isn't numa aware; even with reservation there are cases where we can run out of applicable free huge pages. Usually we end up preallocating all memory in the memory backend just so we're on the safe side.


What happens when we set qom-set requested-size=10GB at runtime, but there
are only 8 GB of hugepages left available ?

This is one of the user errors that will be tackled next by a dynamic preallocation (and/or reservation) inside virtio-mem.

Once the guest would actually touch >8 GiB, we run out of free huge pages and don't have huge page overcommit enabled (or huge page overcommit fails allocation which can happen easily), we'd essentially crash the VM.

Pretty much similar to messing up memory overcommit with "ordinary" memory and getting your VM killed by the OOM handler.

The solution is fairly easy: preallocate huge pages when resizing the virtio-mem device (making new huge pages available to the VM in this case).

In the simplest case this can be done using fallocate(). If you're interested about the dirty details where it's not that easy, take a look at my MADV_POPULATE_READ/MADV_POPULATE_WRITE kernel series [1]. Marek is working on handling virtio-mem device via an iothread, so we can do preallocation easily "concurrently" while the VM is running, avoiding holding the BQL for a long time.

[1] https://lkml.kernel.org/r/20210419135443.12822-1-david@redhat.com

--
Thanks,

David / dhildenb




reply via email to

[Prev in Thread] Current Thread [Next in Thread]