On 04.05.21 12:32, Daniel P. Berrangé wrote:
On Tue, May 04, 2021 at 12:21:25PM +0200, David Hildenbrand wrote:
On 04.05.21 12:09, Daniel P. Berrangé wrote:
On Wed, Apr 28, 2021 at 03:37:48PM +0200, David Hildenbrand wrote:
Let's support RAM_NORESERVE via MAP_NORESERVE on Linux. The flag has no
effect on most shared mappings - except for hugetlbfs and anonymous memory.
Linux man page:
"MAP_NORESERVE: Do not reserve swap space for this mapping. When swap
space is reserved, one has the guarantee that it is possible to modify
the mapping. When swap space is not reserved one might get SIGSEGV
upon a write if no physical memory is available. See also the discussion
of the file /proc/sys/vm/overcommit_memory in proc(5). In kernels before
2.6, this flag had effect only for private writable mappings."
Note that the "guarantee" part is wrong with memory overcommit in Linux.
Also, in Linux hugetlbfs is treated differently - we configure reservation
of huge pages from the pool, not reservation of swap space (huge pages
cannot be swapped).
The rough behavior is [1]:
a) !Hugetlbfs:
1) Without MAP_NORESERVE *or* with memory overcommit under Linux
disabled ("/proc/sys/vm/overcommit_memory == 2"), the following
accounting/reservation happens:
For a file backed map
SHARED or READ-only - 0 cost (the file is the map not swap)
PRIVATE WRITABLE - size of mapping per instance
For an anonymous or /dev/zero map
SHARED - size of mapping
PRIVATE READ-only - 0 cost (but of little use)
PRIVATE WRITABLE - size of mapping per instance
2) With MAP_NORESERVE, no accounting/reservation happens.
b) Hugetlbfs:
1) Without MAP_NORESERVE, huge pages are reserved.
2) With MAP_NORESERVE, no huge pages are reserved.
Note: With "/proc/sys/vm/overcommit_memory == 0", we were already able
to configure it for !hugetlbfs globally; this toggle now allows
configuring it more fine-grained, not for the whole system.
The target use case is virtio-mem, which dynamically exposes memory
inside a large, sparse memory area to the VM.
Can you explain this use case in more real world terms, as I'm not
understanding what a mgmt app would actually do with this in
practice ?
Let's consider huge pages for simplicity. Assume you have 128 free huge
pages in your hypervisor that you want to dynamically assign to VMs.
Further assume you have two VMs running. A workflow could look like
1. Assign all huge pages to VM 0
2. Reassign 64 huge pages to VM 1
3. Reassign another 32 huge pages to VM 1
4. Reasssign 16 huge pages to VM 0
5. ...
Basically what we're used to doing with "ordinary" memory.
What does this look like in terms of the memory backend configuration
when you boot VM 0 and VM 1 ?
Are you saying that we boot both VMs with
-object hostmem-memfd,size=128G,hugetlb=yes,hugetlbsize=1G,reserve=off
and then we have another property set on 'virtio-mem' to tell it
how much/little of that 128 G, to actually give to the guest ?
How do we change that at runtime ?
Roughly, yes. We only special-case memory backends managed by virtio-mem
devices.
An advanced example for a single VM could look like this:
sudo build/qemu-system-x86_64 \
... \
-m 4G,maxmem=64G \
-smp sockets=2,cores=2 \
-object hostmem-memfd,id=bmem0,size=2G,hugetlb=yes,hugetlbsize=2M \
-numa node,nodeid=0,cpus=0-1,memdev=bmem0 \
-object hostmem-memfd,id=bmem1,size=2G,hugetlb=yes,hugetlbsize=2M \
-numa node,nodeid=1,cpus=2-3,memdev=bmem1 \
... \
-object
hostmem-memfd,id=mem0,size=30G,hugetlb=yes,hugetlbsize=2M,reserve=off \
-device virtio-mem-pci,id=vmem0,memdev=mem0,node=0,requested-size=0G \
-object
hostmem-memfd,id=mem1,size=30G,hugetlb=yes,hugetlbsize=2M,reserve=off \
-device virtio-mem-pci,id=vmem1,memdev=mem1,node=1,requested-size=0G \
... \
We can request a size change by adjusting the "requested-size" property (e.g.,
via qom-set)
and observe the current size by reading the "size" property (e.g., qom-get).
Think of
it as an advanced device-local memory balloon mixed with the concept of a
memory hotplug.