[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [RFC PATCH qemu 0/4] memory: Reduce memory use
From: |
Alexey Kardashevskiy |
Subject: |
Re: [Qemu-devel] [RFC PATCH qemu 0/4] memory: Reduce memory use |
Date: |
Fri, 8 Sep 2017 14:04:15 +1000 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.2.1 |
On 08/09/17 12:08, Alexey Kardashevskiy wrote:
> On 08/09/17 00:54, Dr. David Alan Gilbert wrote:
>> * Alexey Kardashevskiy (address@hidden) wrote:
>>> On 07/09/17 19:51, Dr. David Alan Gilbert wrote:
>>>> * Alexey Kardashevskiy (address@hidden) wrote:
>>>>> This was inspired by https://bugzilla.redhat.com/show_bug.cgi?id=1481593
>>>>>
>>>>> What happens ithere is that every virtio block device creates 2 address
>>>>> spaces - for modern config space (called "virtio-pci-cfg-as") and
>>>>> for busmaster (common pci thing, called after the device name,
>>>>> in my case "virtio-blk-pci").
>>>>>
>>>>> Each address_space_init() updates topology for every address space.
>>>>> Every topology update (address_space_update_topology()) creates a new
>>>>> dispatch tree - AddressSpaceDispatch with nodes (1KB) and
>>>>> sections (48KB) and destroys the old one.
>>>>>
>>>>> However the dispatch destructor is postponed via RCU which does not
>>>>> get a chance to execute until the machine is initialized but before
>>>>> we get there, memory is not returned to the pool, and this is a lot
>>>>> of memory which grows n^2.
>>>>>
>>>>> These patches are trying to address the memory use and boot time
>>>>> issues but tbh only the first one provides visible outcome.
>>>>
>>>> Do you have a feel for how much memory is saved?
>>>
>>>
>>> The 1/4 saves ~33GB (~44GB -> 11GB) for a 2GB guest and 400 virtio-pci
>>> devices. These GB figures are the peak values (but it does not matter for
>>> OOM killer), memory gets released in one go when RCU kicks in, it just
>>> happens too late.
>>
>> Nice saving! Still, why is it using 11GB?
>
> Yet to be discovered :) Not clear at the moment.
>
>
>> What's it like for more sane configurations, say 2-3 virtio devices - is
>> there anything noticable or is it just the huge setups?
>>
>> Dave
>>
>>
>>> The 3/4 saves less, I'd say 50KB per VCPU (more if you count peaks but so
>>> much). Strangely, I do not see the difference in valgrind output when I run
>>> a guest with 1024 or just 8 CPUs, probably "massif" is not the right tool
>>> to catch this.
>
> I did some more tests.
>
> v2.10:
> 1024 CPUs, no virtio: 0:47 490.8MB 38/34
> 1 CPU, 500 virtio-block: 5:03 59.69GB 2354438/3
>
> 1/4 applied:
> 1024 CPUs, no virtio: 0:49 490.8MB 38/34
> 1 CPU, 500 virtio-block: 1:57 17.74GB 2186/3
>
> 3/4 applied:
> 1024 CPUs, no virtio: 0:53 491.1MB 20/17
> 1 CPU, 500 virtio-block: 2:01 17.7GB 2167/0
>
>
> Time is what it takes to start QEMU with -S and then Q-Ax.
>
> Memory amount is peak use from valgrind massif.
>
> Last 2 numbers - "38/34" for example - 38 is the number of g_new(FlatView,
> 1), 34 is the number of g_free(view); the numbers are printed at
> https://git.qemu.org/?p=qemu.git;a=blob;f=vl.c;h=8e247cc2a239ae8fb3d3cdf6d4ee78fd723d1053;hb=1ab5eb4efb91a3d4569b0df6e824cc08ab4bd8ec#l4666
> before RCU kicks in.
>
> 500 virtio-block + bridges use around 1100 address spaces.
Numbers are quite wrong for the [3/4] patch as
qemu_run_machine_init_done_notifiers() does the same dance all over again.
I can bring memory peak use down to 3.75GB (yay!) for "1 CPU, 500
virtio-block" with the below fix. It belongs to [1/4] and we might want
just extend that begin+commit to cover both devices creation and notifiers
initialization.
My guess that the general problem is that memory_region_transaction_begin()
and memory_region_transaction_commit() update all address spaces (or flat
views after [3/4] applied) while they do not really need to - they could
receive a hint (an AS? a flatview) and update only respective bits, I just
do not see an easy way to provide with a hint other than storing an AS
pointer in an MR.
In my test, I have 500 devices and 2 layers of PCI bridges. For every PCI
device probed by SLOF (our boot firmware), _all_ flatviews are rebuild 8
times (at least, sometime more, like 20, when a next bridge gets activated,
dunno):
- config space writes at 0x20 0x24 0x28 0x2c 0x1c - we end up in
pci_bridge_update_mappings();
- config space writes at 0x4 for a device and a bridge - pci_update_mappings();
- memory_region_set_enabled() on d->bus_master_enable_region
enable bridge.
Each of these operations re-renders all 1000-ish flatviews and dispatch
trees, making device probing really, really slow. But at least memory is
not sort of leaking or hold by RCU :) And I cannot wrap these config writes
into memory_region_transaction_begin/commit as every config space access
comes independently from SLOF.
diff --git a/vl.c b/vl.c
index 89fb58c1de..4317ef01b4 100644
--- a/vl.c
+++ b/vl.c
@@ -4754,8 +4754,13 @@ int main(int argc, char **argv, char **envp)
/* TODO: once all bus devices are qdevified, this should be done
* when bus is created by qdev.c */
qemu_register_reset(qbus_reset_all_fn, sysbus_get_default());
+
+ memory_region_transaction_begin();
+
qemu_run_machine_init_done_notifiers();
+ memory_region_transaction_commit();
+
if (rom_check_and_register_reset() != 0) {
error_report("rom check and register reset failed");
exit(1);
>
>
>
>
>
>>>
>>>>
>>>> Dave
>>>>
>>>>> There are still things to polish and double check the use of RCU,
>>>>> I'd like to get any feedback before proceeding - is this going
>>>>> the right way or way too ugly?
>>>>>
>>>>>
>>>>> This is based on sha1
>>>>> 1ab5eb4efb Peter Maydell "Update version for v2.10.0 release".
>>>>>
>>>>> Please comment. Thanks.
>>>>>
>>>>>
>>>>>
>>>>> Alexey Kardashevskiy (4):
>>>>> memory: Postpone flatview and dispatch tree building till all devices
>>>>> are added
>>>>> memory: Prepare for shared flat views
>>>>> memory: Share flat views and dispatch trees between address spaces
>>>>> memory: Add flat views to HMP "info mtree"
>>>>>
>>>>> include/exec/memory-internal.h | 6 +-
>>>>> include/exec/memory.h | 93 +++++++++----
>>>>> exec.c | 242 +++++++++++++++++++--------------
>>>>> hw/alpha/typhoon.c | 2 +-
>>>>> hw/dma/rc4030.c | 4 +-
>>>>> hw/i386/amd_iommu.c | 2 +-
>>>>> hw/i386/intel_iommu.c | 9 +-
>>>>> hw/intc/openpic_kvm.c | 2 +-
>>>>> hw/pci-host/apb.c | 2 +-
>>>>> hw/pci/pci.c | 3 +-
>>>>> hw/ppc/spapr_iommu.c | 4 +-
>>>>> hw/s390x/s390-pci-bus.c | 2 +-
>>>>> hw/vfio/common.c | 6 +-
>>>>> hw/virtio/vhost.c | 6 +-
>>>>> memory.c | 299
>>>>> +++++++++++++++++++++++++++--------------
>>>>> monitor.c | 3 +-
>>>>> vl.c | 4 +
>>>>> hmp-commands-info.hx | 7 +-
>>>>> 18 files changed, 448 insertions(+), 248 deletions(-)
>>>>>
>>>>> --
>>>>> 2.11.0
>>>>>
>>>>>
>>>> --
>>>> Dr. David Alan Gilbert / address@hidden / Manchester, UK
>>>>
>>>
>>>
>>> --
>>> Alexey
>> --
>> Dr. David Alan Gilbert / address@hidden / Manchester, UK
>>
>
>
--
Alexey
- Re: [Qemu-devel] [RFC PATCH qemu 1/4] memory: Postpone flatview and dispatch tree building till all devices are added, (continued)