[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [RFC v3 06/15] hw/arm/virt: Allocate device_memory
From: |
Auger Eric |
Subject: |
Re: [Qemu-devel] [RFC v3 06/15] hw/arm/virt: Allocate device_memory |
Date: |
Thu, 12 Jul 2018 16:53:01 +0200 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.4.0 |
Hi Drew,
On 07/12/2018 04:45 PM, Andrew Jones wrote:
> On Thu, Jul 12, 2018 at 04:22:05PM +0200, Auger Eric wrote:
>> Hi Igor,
>>
>> On 07/11/2018 03:17 PM, Igor Mammedov wrote:
>>> On Thu, 5 Jul 2018 16:27:05 +0200
>>> Auger Eric <address@hidden> wrote:
>>>
>>>> Hi Shameer,
>>>>
>>>> On 07/05/2018 03:19 PM, Shameerali Kolothum Thodi wrote:
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Auger Eric [mailto:address@hidden
>>>>>> Sent: 05 July 2018 13:18
>>>>>> To: David Hildenbrand <address@hidden>; address@hidden;
>>>>>> address@hidden; address@hidden; address@hidden;
>>>>>> Shameerali Kolothum Thodi <address@hidden>;
>>>>>> address@hidden
>>>>>> Cc: address@hidden; address@hidden; address@hidden;
>>>>>> address@hidden; address@hidden
>>>>>> Subject: Re: [Qemu-devel] [RFC v3 06/15] hw/arm/virt: Allocate
>>>>>> device_memory
>>>>>>
>>>>>> Hi David,
>>>>>>
>>>>>> On 07/05/2018 02:09 PM, David Hildenbrand wrote:
>>>>>>> On 05.07.2018 14:00, Auger Eric wrote:
>>>>>>>> Hi David,
>>>>>>>>
>>>>>>>> On 07/05/2018 01:54 PM, David Hildenbrand wrote:
>>>>>>>>> On 05.07.2018 13:42, Auger Eric wrote:
>>>>>>>>>> Hi David,
>>>>>>>>>>
>>>>>>>>>> On 07/04/2018 02:05 PM, David Hildenbrand wrote:
>>>>>>>>>>> On 03.07.2018 21:27, Auger Eric wrote:
>>>>>>>>>>>> Hi David,
>>>>>>>>>>>> On 07/03/2018 08:25 PM, David Hildenbrand wrote:
>>>>>>>>>>>>> On 03.07.2018 09:19, Eric Auger wrote:
>>>>>>>>>>>>>> We define a new hotpluggable RAM region (aka. device memory).
>>>>>>>>>>>>>> Its base is 2TB GPA. This obviously requires 42b IPA support
>>>>>>>>>>>>>> in KVM/ARM, FW and guest kernel. At the moment the device
>>>>>>>>>>>>>> memory region is max 2TB.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Maybe a stupid question, but why exactly does it have to start at
>>>>>>>>>>>>> 2TB
>>>>>>>>>>>>> (and not e.g. at 1TB)?
>>>>>>>>>>>> not a stupid question. See tentative answer below.
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This is largely inspired of device memory initialization in
>>>>>>>>>>>>>> pc machine code.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Signed-off-by: Eric Auger <address@hidden>
>>>>>>>>>>>>>> Signed-off-by: Kwangwoo Lee <address@hidden>
>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>> hw/arm/virt.c | 104
>>>>>> ++++++++++++++++++++++++++++++++++++--------------
>>>>>>>>>>>>>> include/hw/arm/arm.h | 2 +
>>>>>>>>>>>>>> include/hw/arm/virt.h | 1 +
>>>>>>>>>>>>>> 3 files changed, 79 insertions(+), 28 deletions(-)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> diff --git a/hw/arm/virt.c b/hw/arm/virt.c
>>>>>>>>>>>>>> index 5a4d0bf..6fefb78 100644
>>>>>>>>>>>>>> --- a/hw/arm/virt.c
>>>>>>>>>>>>>> +++ b/hw/arm/virt.c
>>>>>>>>>>>>>> @@ -59,6 +59,7 @@
>>>>>>>>>>>>>> #include "qapi/visitor.h"
>>>>>>>>>>>>>> #include "standard-headers/linux/input.h"
>>>>>>>>>>>>>> #include "hw/arm/smmuv3.h"
>>>>>>>>>>>>>> +#include "hw/acpi/acpi.h"
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> #define DEFINE_VIRT_MACHINE_LATEST(major, minor, latest) \
>>>>>>>>>>>>>> static void virt_##major##_##minor##_class_init(ObjectClass
>>>>>>>>>>>>>> *oc,
>>>>>> \
>>>>>>>>>>>>>> @@ -94,34 +95,25 @@
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> #define PLATFORM_BUS_NUM_IRQS 64
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -/* RAM limit in GB. Since VIRT_MEM starts at the 1GB mark, this
>>>>>>>>>>>>>>
>>>>>> means
>>>>>>>>>>>>>> - * RAM can go up to the 256GB mark, leaving 256GB of the
>>>>>>>>>>>>>> physical
>>>>>>>>>>>>>> - * address space unallocated and free for future use between
>>>>>>>>>>>>>> 256G
>>>>>> and 512G.
>>>>>>>>>>>>>> - * If we need to provide more RAM to VMs in the future then we
>>>>>> need to:
>>>>>>>>>>>>>> - * * allocate a second bank of RAM starting at 2TB and working
>>>>>>>>>>>>>> up
>>>>>>>>>>>> I acknowledge this comment was the main justification. Now if you
>>>>>>>>>>>> look
>>>>>> at
>>>>>>>>>>>>
>>>>>>>>>>>> Principles of ARM Memory Maps
>>>>>>>>>>>>
>>>>>> http://infocenter.arm.com/help/topic/com.arm.doc.den0001c/DEN0001C_princ
>>>>>> iples_of_arm_memory_maps.pdf
>>>>>>>>>>>> chapter 2.3 you will find that when adding PA bits, you always
>>>>>>>>>>>> leave
>>>>>>>>>>>> space for reserved space and mapped IO.
>>>>>>>>>>>
>>>>>>>>>>> Thanks for the pointer!
>>>>>>>>>>>
>>>>>>>>>>> So ... we can fit
>>>>>>>>>>>
>>>>>>>>>>> a) 2GB at 2GB
>>>>>>>>>>> b) 32GB at 32GB
>>>>>>>>>>> c) 512GB at 512GB
>>>>>>>>>>> d) 8TB at 8TB
>>>>>>>>>>> e) 128TB at 128TB
>>>>>>>>>>>
>>>>>>>>>>> (this is a nice rule of thumb if I understand it correctly :) )
>>>>>>>>>>>
>>>>>>>>>>> We should strive for device memory (maxram_size - ram_size) to fit
>>>>>>>>>>> exactly into one of these slots (otherwise things get nasty).
>>>>>>>>>>>
>>>>>>>>>>> Depending on the ram_size, we might have simpler setups and can
>>>>>> support
>>>>>>>>>>> more configurations, no?
>>>>>>>>>>>
>>>>>>>>>>> E.g. ram_size <= 34GB, device_memory <= 512GB
>>>>>>>>>>> -> move ram into a) and b)
>>>>>>>>>>> -> move device memory into c)
>>>>>>>>>>
>>>>>>>>>> The issue is machvirt doesn't comply with that document.
>>>>>>>>>> At the moment we have
>>>>>>>>>> 0 -> 1GB MMIO
>>>>>>>>>> 1GB -> 256GB RAM
>>>>>>>>>> 256GB -> 512GB is theoretically reserved for IO but most is free.
>>>>>>>>>> 512GB -> 1T is reserved for ECAM MMIO range. This is the top of our
>>>>>>>>>> existing 40b GPA space.
>>>>>>>>>>
>>>>>>>>>> We don't want to change this address map due to legacy reasons.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks, good to know!
>>>>>>>>>
>>>>>>>>>> Another question! do you know if it would be possible to have
>>>>>>>>>> device_memory region split into several discontinuous segments?
>>>>>>>>>
>>>>>>>>> It can be implemented for sure, but I would try to avoid that, as it
>>>>>>>>> makes certain configurations impossible (and very end user
>>>>>>>>> unfriendly).
>>>>>>>>>
>>>>>>>>> E.g. (numbers completely made up, but it should show what I mean)
>>>>>>>>>
>>>>>>>>> -m 20G,maxmem=120G:
>>>>>>>>> -> Try to add a DIMM with 100G -> error.
>>>>>>>>> -> But we can add e.g. two DIMMs with 40G and 60G.
>>>>>>>>>
>>>>>>>>> This exposes internal details to the end user. And the end user has no
>>>>>>>>> idea what is going on.
>>>>>>>>>
>>>>>>>>> So I think we should try our best to keep that area consecutive.
>>>>>>>>
>>>>>>>> Actually I didn't sufficiently detail the context. I would like
>>>>>>>> 1) 1 segment to be exposed to the end-user through slot|maxmem stuff
>>>>>>>> (what this series targets) and
>>>>>>>> 2) another segment used to instantiate PC-DIMM for internal needs as
>>>>>>>> replacement of part of the 1GB -> 256GB static RAM. This was the
>>>>>>>> purpose
>>>>>>>> of Shameer's original series
>>>>>>>
>>>>>>> I am not sure if PC-DIMMs are exactly what you want for internal
>>>>>>> purposes.
>>>>>>>
>>>>>>>>
>>>>>>>> [1] [RFC v2 0/6] hw/arm: Add support for non-contiguous iova regions
>>>>>>>> http://patchwork.ozlabs.org/cover/914694/
>>>>>>>> This approach is not yet validated though.
>>>>>>>>
>>>>>>>> The rationale is sometimes you must have "holes" in RAM as some GPAs
>>>>>>>> match reserved IOVAs for assigned devices.
>>>>>>>
>>>>>>> So if I understand it correctly, all you want is some memory region that
>>>>>>> a) contains only initially defined memory
>>>>>>> b) can have some holes in it
>>>>>>>
>>>>>>> This is exactly what x86 already does (pc_memory_init): Simply construct
>>>>>>> your own memory region leaving holes in it.
>>>>>>>
>>>>>>>
>>>>>>> memory_region_init_alias(ram_below_4g, NULL, "ram-below-4g", ram,
>>>>>>> 0, pcms->below_4g_mem_size);
>>>>>>> memory_region_add_subregion(system_memory, 0, ram_below_4g);
>>>>>>> ...
>>>>>>> if (pcms->above_4g_mem_size > 0)
>>>>>>> memory_region_init_alias(ram_above_4g, NULL, "ram-above-4g", ram,
>>>>>>> ...
>>>>>>> memory_region_add_subregion(system_memory, 0x100000000ULL,
>>>>>>> ...
>>>>>>>
>>>>>>> They "indicate" these different GPA areas using the e820 map to the
>>>>>>> guest.
>>>>>>>
>>>>>>> Would that also work for you?
>>>>>>
>>>>>> I would tentatively say yes. Effectively I am not sure that if we were
>>>>>> to actually put holes in the 1G-256GB RAM segment, PC-DIMM would be the
>>>>>> natural choice. Also the reserved IOVA issue impacts the device_memory
>>>>>> region area I think. I am skeptical about the fact we can put holes in
>>>>>> static RAM and device_memory regions like that.
>>> Could we just use a single device_memory region for both
>>> initial+hotpluggable
>>> RAM if we make base RAM address dynamic?
>> This assumes FW does support dynamic RAM base. If I understand correctly
>> this is not the case.
>
> It's not currently the case, but I've adding prototyping this near the top
> of my TODO. So stay tuned.
ok
>
>> Also there is the problematic of migration. How
>> would you migrate between guests whose RAM is not laid out at the same
>> place?
>
> I'm not sure what you mean here. Boot a guest with a new memory map,
> probably by explicitly asking for it with a new machine property,
> which means a new virt machine version. Then migrate at will to any
> host that supports that machine type.
My concern rather was about holes in the memory map matching reserved
regions.
>
>> I understood hotplug memory relied on a specific device_memory
>> region. So do you mean we would have 2 contiguous regions?
>
> I think Igor wants one contiguous region for RAM, where additional
> space can be reserved for hotplugging.
This is not compliant with 2012 ARM white paper, although I don't really
know if this document truly is a reference (did not get any reply).
Thanks
Eric
>
>>> In this case RAM could start wherever there is a free space for maxmem
>>> (if there is free space in lowmem, put device_memory there, otherwise put
>>> it somewhere in high mem) and we won't care if there is IOVA or not.
>>>
>>> *I don't have a clue about iommus, so here goes a stupid question*
>>> I agree with Peter that whole IOVA thing looks broken, when host
>>> layout dictates the guest's one.
>>> Shouldn't be there somewhere an iommu that would remap host map into
>>> guest specific one? (so guest would model board we need and be migrate-able
>>> instead of mimicking host hw)
>> The issue is related to IOVAs programmed by the guest in the host
>> assigned devices. DMA requests issued by the assigned devices using
>> those IOVAs are supposed to reach the guest RAM. But due to the host
>> topology they won't (host PCI host bridge windows or MSI reserved
>> regions). Adding a vIOMMU on guest side effectively allows to use IOVAs
>> != GPAs but guest is exposed to that change. Adding this extra
>> translation stage adds a huge performance penalty. And eventually the
>> IOVA allocator used in the guest should theoretically be aware of host
>> reserved regions as well.
>>
>>>
>>>
>>>>> The first approach[1] we had to address the holes in memory was using
>>>>> the memory alias way mentioned above. And based on Drew's review, the
>>>>> pc-dimm way of handling was introduced. I think the main argument was that
>>>>> it will be useful when we eventually support hotplug.
>>>>
>>>> That's my understanding too.
>>> not only hotplug,
>>>
>>> a RAM memory region that's split by aliases is difficult to handle
>>> as it creates nonlinear GPA<->HVA mapping instead of
>>> 1:1 mapping of pc-dimm,
>>> so if one needs to build HVA<->GPA map for a given MemoryRegion
>>> in case of aliases one would have to get list of MemorySections
>>> that belong to it and build map from that vs (addr + offset) in
>>> case of simple 1:1 mapping.
>>>
>>> complicated machine specific SRAT/e820 code due to holes
>>> /grep 'the memory map is a bit tricky'/
>>>
>>>> But since that is added
>>>>> anyway as part of this series, I am not sure we have any other benefit in
>>>>> modeling it as pc-dimm. May be I am missing something here.
>>>>
>>>> I tentatively agree with you. I was trying to understand if the
>>>> device_memory region was fitting the original needs too but I think
>>>> standard alias approach is more adapted to hole creation.
>>> Aliases are easy way to start with, but as compat knobs grow
>>> (based on PC experience, grep 'Calculate ram split')
>>> It's quite a pain to maintain manual implicit aliases layout
>>> without breaking it by accident.
>>> We probably won't be able to get rid of aliases on PC for legacy
>>> reasons but why introduce the same pain to virt board.
>>>
>>> Well, magical conversion from -m X to 2..y memory regions (aliases or not)
>>> aren't going to be easy in both cases, especially if one would take into
>>> account "-numa memdev|mem".
>>> I'd rather use a single pc-dimm approach for both /initial and hotpluggble
>>> RAM/
>>> and then use device_memory framework to enumerate RAM wherever needed
>>> (ACPI/DT)
>>> in inform way.
>> We have 2 problematics:
>> - support more RAM. This can be achieved by adding a single memory
>> region based on DIMMs
>> - manage IOVA reserved regions. I don't think we have a consensus on the
>> solution at the moment. What about migration between 2 guests having a
>> different memory topogy?
>
> With a dynamic RAM base (pretty easy to do in QEMU, but requires FW change
> - now on my TODO), then one only needs to pick a contiguous region within
> the guest physical address limits that has the requested size and does not
> overlap any host reserved regions (I think). I'm still not sure what the
> migration concern is.
>
> Thanks,
> drew
>
>>
>> Thanks
>>
>> Eric
>>>
>>>
>>>> Thanks
>>>>
>>>> Eric
>>>>>
>>>>> Thanks,
>>>>> Shameer
>>>>>
>>>>> [1]. https://lists.gnu.org/archive/html/qemu-arm/2018-04/msg00243.html
>>>>>
>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>> Eric
>>>>>>>
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>>
>>>>>>>> Eric
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>>
>>>>>>>>>> Eric
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>>
>>>
>>>
>>
- Re: [Qemu-devel] [RFC v3 06/15] hw/arm/virt: Allocate device_memory, (continued)
- Re: [Qemu-devel] [RFC v3 06/15] hw/arm/virt: Allocate device_memory, Auger Eric, 2018/07/05
- Re: [Qemu-devel] [RFC v3 06/15] hw/arm/virt: Allocate device_memory, David Hildenbrand, 2018/07/05
- Re: [Qemu-devel] [RFC v3 06/15] hw/arm/virt: Allocate device_memory, Auger Eric, 2018/07/05
- Re: [Qemu-devel] [RFC v3 06/15] hw/arm/virt: Allocate device_memory, David Hildenbrand, 2018/07/05
- Re: [Qemu-devel] [RFC v3 06/15] hw/arm/virt: Allocate device_memory, Auger Eric, 2018/07/05
- Re: [Qemu-devel] [RFC v3 06/15] hw/arm/virt: Allocate device_memory, Shameerali Kolothum Thodi, 2018/07/05
- Re: [Qemu-devel] [RFC v3 06/15] hw/arm/virt: Allocate device_memory, Auger Eric, 2018/07/05
- Re: [Qemu-devel] [RFC v3 06/15] hw/arm/virt: Allocate device_memory, Igor Mammedov, 2018/07/11
- Re: [Qemu-devel] [RFC v3 06/15] hw/arm/virt: Allocate device_memory, Auger Eric, 2018/07/12
- Re: [Qemu-devel] [RFC v3 06/15] hw/arm/virt: Allocate device_memory, Andrew Jones, 2018/07/12
- Re: [Qemu-devel] [RFC v3 06/15] hw/arm/virt: Allocate device_memory,
Auger Eric <=
- Re: [Qemu-devel] [RFC v3 06/15] hw/arm/virt: Allocate device_memory, Andrew Jones, 2018/07/12
- Re: [Qemu-devel] [RFC v3 06/15] hw/arm/virt: Allocate device_memory, Igor Mammedov, 2018/07/18
Re: [Qemu-devel] [RFC v3 06/15] hw/arm/virt: Allocate device_memory, Igor Mammedov, 2018/07/18
[Qemu-devel] [RFC v3 11/15] hw/arm/virt-acpi-build: Add PC-DIMM in SRAT, Eric Auger, 2018/07/03
[Qemu-devel] [RFC v3 02/15] hw/boards: Add a MachineState parameter to kvm_type callback, Eric Auger, 2018/07/03
[Qemu-devel] [RFC v3 04/15] hw/arm/virt: support kvm_type property, Eric Auger, 2018/07/03
[Qemu-devel] [RFC v3 09/15] hw/arm/boot: Expose the PC-DIMM nodes in the DT, Eric Auger, 2018/07/03
[Qemu-devel] [RFC v3 03/15] kvm: add kvm_arm_get_max_vm_phys_shift, Eric Auger, 2018/07/03
[Qemu-devel] [RFC v3 05/15] hw/arm/virt: handle max_vm_phys_shift conflicts on migration, Eric Auger, 2018/07/03