qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC v3 06/15] hw/arm/virt: Allocate device_memory


From: Auger Eric
Subject: Re: [Qemu-devel] [RFC v3 06/15] hw/arm/virt: Allocate device_memory
Date: Thu, 12 Jul 2018 16:53:01 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.4.0

Hi Drew,

On 07/12/2018 04:45 PM, Andrew Jones wrote:
> On Thu, Jul 12, 2018 at 04:22:05PM +0200, Auger Eric wrote:
>> Hi Igor,
>>
>> On 07/11/2018 03:17 PM, Igor Mammedov wrote:
>>> On Thu, 5 Jul 2018 16:27:05 +0200
>>> Auger Eric <address@hidden> wrote:
>>>
>>>> Hi Shameer,
>>>>
>>>> On 07/05/2018 03:19 PM, Shameerali Kolothum Thodi wrote:
>>>>>   
>>>>>> -----Original Message-----
>>>>>> From: Auger Eric [mailto:address@hidden
>>>>>> Sent: 05 July 2018 13:18
>>>>>> To: David Hildenbrand <address@hidden>; address@hidden;
>>>>>> address@hidden; address@hidden; address@hidden;
>>>>>> Shameerali Kolothum Thodi <address@hidden>;
>>>>>> address@hidden
>>>>>> Cc: address@hidden; address@hidden; address@hidden;
>>>>>> address@hidden; address@hidden
>>>>>> Subject: Re: [Qemu-devel] [RFC v3 06/15] hw/arm/virt: Allocate
>>>>>> device_memory
>>>>>>
>>>>>> Hi David,
>>>>>>
>>>>>> On 07/05/2018 02:09 PM, David Hildenbrand wrote:  
>>>>>>> On 05.07.2018 14:00, Auger Eric wrote:  
>>>>>>>> Hi David,
>>>>>>>>
>>>>>>>> On 07/05/2018 01:54 PM, David Hildenbrand wrote:  
>>>>>>>>> On 05.07.2018 13:42, Auger Eric wrote:  
>>>>>>>>>> Hi David,
>>>>>>>>>>
>>>>>>>>>> On 07/04/2018 02:05 PM, David Hildenbrand wrote:  
>>>>>>>>>>> On 03.07.2018 21:27, Auger Eric wrote:  
>>>>>>>>>>>> Hi David,
>>>>>>>>>>>> On 07/03/2018 08:25 PM, David Hildenbrand wrote:  
>>>>>>>>>>>>> On 03.07.2018 09:19, Eric Auger wrote:  
>>>>>>>>>>>>>> We define a new hotpluggable RAM region (aka. device memory).
>>>>>>>>>>>>>> Its base is 2TB GPA. This obviously requires 42b IPA support
>>>>>>>>>>>>>> in KVM/ARM, FW and guest kernel. At the moment the device
>>>>>>>>>>>>>> memory region is max 2TB.  
>>>>>>>>>>>>>
>>>>>>>>>>>>> Maybe a stupid question, but why exactly does it have to start at 
>>>>>>>>>>>>> 2TB
>>>>>>>>>>>>> (and not e.g. at 1TB)?  
>>>>>>>>>>>> not a stupid question. See tentative answer below.  
>>>>>>>>>>>>>  
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This is largely inspired of device memory initialization in
>>>>>>>>>>>>>> pc machine code.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Signed-off-by: Eric Auger <address@hidden>
>>>>>>>>>>>>>> Signed-off-by: Kwangwoo Lee <address@hidden>
>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>  hw/arm/virt.c         | 104  
>>>>>> ++++++++++++++++++++++++++++++++++++--------------  
>>>>>>>>>>>>>>  include/hw/arm/arm.h  |   2 +
>>>>>>>>>>>>>>  include/hw/arm/virt.h |   1 +
>>>>>>>>>>>>>>  3 files changed, 79 insertions(+), 28 deletions(-)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> diff --git a/hw/arm/virt.c b/hw/arm/virt.c
>>>>>>>>>>>>>> index 5a4d0bf..6fefb78 100644
>>>>>>>>>>>>>> --- a/hw/arm/virt.c
>>>>>>>>>>>>>> +++ b/hw/arm/virt.c
>>>>>>>>>>>>>> @@ -59,6 +59,7 @@
>>>>>>>>>>>>>>  #include "qapi/visitor.h"
>>>>>>>>>>>>>>  #include "standard-headers/linux/input.h"
>>>>>>>>>>>>>>  #include "hw/arm/smmuv3.h"
>>>>>>>>>>>>>> +#include "hw/acpi/acpi.h"
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  #define DEFINE_VIRT_MACHINE_LATEST(major, minor, latest) \
>>>>>>>>>>>>>>      static void virt_##major##_##minor##_class_init(ObjectClass 
>>>>>>>>>>>>>> *oc,  
>>>>>> \  
>>>>>>>>>>>>>> @@ -94,34 +95,25 @@
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  #define PLATFORM_BUS_NUM_IRQS 64
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -/* RAM limit in GB. Since VIRT_MEM starts at the 1GB mark, this 
>>>>>>>>>>>>>>  
>>>>>> means  
>>>>>>>>>>>>>> - * RAM can go up to the 256GB mark, leaving 256GB of the 
>>>>>>>>>>>>>> physical
>>>>>>>>>>>>>> - * address space unallocated and free for future use between 
>>>>>>>>>>>>>> 256G  
>>>>>> and 512G.  
>>>>>>>>>>>>>> - * If we need to provide more RAM to VMs in the future then we  
>>>>>> need to:  
>>>>>>>>>>>>>> - *  * allocate a second bank of RAM starting at 2TB and working 
>>>>>>>>>>>>>> up  
>>>>>>>>>>>> I acknowledge this comment was the main justification. Now if you 
>>>>>>>>>>>> look  
>>>>>> at  
>>>>>>>>>>>>
>>>>>>>>>>>> Principles of ARM Memory Maps
>>>>>>>>>>>>  
>>>>>> http://infocenter.arm.com/help/topic/com.arm.doc.den0001c/DEN0001C_princ
>>>>>> iples_of_arm_memory_maps.pdf  
>>>>>>>>>>>> chapter 2.3 you will find that when adding PA bits, you always 
>>>>>>>>>>>> leave
>>>>>>>>>>>> space for reserved space and mapped IO.  
>>>>>>>>>>>
>>>>>>>>>>> Thanks for the pointer!
>>>>>>>>>>>
>>>>>>>>>>> So ... we can fit
>>>>>>>>>>>
>>>>>>>>>>> a) 2GB at 2GB
>>>>>>>>>>> b) 32GB at 32GB
>>>>>>>>>>> c) 512GB at 512GB
>>>>>>>>>>> d) 8TB at 8TB
>>>>>>>>>>> e) 128TB at 128TB
>>>>>>>>>>>
>>>>>>>>>>> (this is a nice rule of thumb if I understand it correctly :) )
>>>>>>>>>>>
>>>>>>>>>>> We should strive for device memory (maxram_size - ram_size) to fit
>>>>>>>>>>> exactly into one of these slots (otherwise things get nasty).
>>>>>>>>>>>
>>>>>>>>>>> Depending on the ram_size, we might have simpler setups and can  
>>>>>> support  
>>>>>>>>>>> more configurations, no?
>>>>>>>>>>>
>>>>>>>>>>> E.g. ram_size <= 34GB, device_memory <= 512GB  
>>>>>>>>>>> -> move ram into a) and b)
>>>>>>>>>>> -> move device memory into c)  
>>>>>>>>>>
>>>>>>>>>> The issue is machvirt doesn't comply with that document.
>>>>>>>>>> At the moment we have
>>>>>>>>>> 0 -> 1GB MMIO
>>>>>>>>>> 1GB -> 256GB RAM
>>>>>>>>>> 256GB -> 512GB is theoretically reserved for IO but most is free.
>>>>>>>>>> 512GB -> 1T is reserved for ECAM MMIO range. This is the top of our
>>>>>>>>>> existing 40b GPA space.
>>>>>>>>>>
>>>>>>>>>> We don't want to change this address map due to legacy reasons.
>>>>>>>>>>  
>>>>>>>>>
>>>>>>>>> Thanks, good to know!
>>>>>>>>>  
>>>>>>>>>> Another question! do you know if it would be possible to have
>>>>>>>>>> device_memory region split into several discontinuous segments?  
>>>>>>>>>
>>>>>>>>> It can be implemented for sure, but I would try to avoid that, as it
>>>>>>>>> makes certain configurations impossible (and very end user 
>>>>>>>>> unfriendly).
>>>>>>>>>
>>>>>>>>> E.g. (numbers completely made up, but it should show what I mean)
>>>>>>>>>
>>>>>>>>> -m 20G,maxmem=120G:  
>>>>>>>>> -> Try to add a DIMM with 100G -> error.
>>>>>>>>> -> But we can add e.g. two DIMMs with 40G and 60G.  
>>>>>>>>>
>>>>>>>>> This exposes internal details to the end user. And the end user has no
>>>>>>>>> idea what is going on.
>>>>>>>>>
>>>>>>>>> So I think we should try our best to keep that area consecutive.  
>>>>>>>>
>>>>>>>> Actually I didn't sufficiently detail the context. I would like
>>>>>>>> 1) 1 segment to be exposed to the end-user through slot|maxmem stuff
>>>>>>>> (what this series targets) and
>>>>>>>> 2) another segment used to instantiate PC-DIMM for internal needs as
>>>>>>>> replacement of part of the 1GB -> 256GB static RAM. This was the 
>>>>>>>> purpose
>>>>>>>> of Shameer's original series  
>>>>>>>
>>>>>>> I am not sure if PC-DIMMs are exactly what you want for internal 
>>>>>>> purposes.
>>>>>>>  
>>>>>>>>
>>>>>>>> [1] [RFC v2 0/6] hw/arm: Add support for non-contiguous iova regions
>>>>>>>> http://patchwork.ozlabs.org/cover/914694/
>>>>>>>> This approach is not yet validated though.
>>>>>>>>
>>>>>>>> The rationale is sometimes you must have "holes" in RAM as some GPAs
>>>>>>>> match reserved IOVAs for assigned devices.  
>>>>>>>
>>>>>>> So if I understand it correctly, all you want is some memory region that
>>>>>>> a) contains only initially defined memory
>>>>>>> b) can have some holes in it
>>>>>>>
>>>>>>> This is exactly what x86 already does (pc_memory_init): Simply construct
>>>>>>> your own memory region leaving holes in it.
>>>>>>>
>>>>>>>
>>>>>>> memory_region_init_alias(ram_below_4g, NULL, "ram-below-4g", ram,
>>>>>>>                          0, pcms->below_4g_mem_size);
>>>>>>> memory_region_add_subregion(system_memory, 0, ram_below_4g);
>>>>>>> ...
>>>>>>> if (pcms->above_4g_mem_size > 0)
>>>>>>>     memory_region_init_alias(ram_above_4g, NULL, "ram-above-4g", ram,
>>>>>>>     ...
>>>>>>>     memory_region_add_subregion(system_memory, 0x100000000ULL,
>>>>>>>     ...
>>>>>>>
>>>>>>> They "indicate" these different GPA areas using the e820 map to the 
>>>>>>> guest.
>>>>>>>
>>>>>>> Would that also work for you?  
>>>>>>
>>>>>> I would tentatively say yes. Effectively I am not sure that if we were
>>>>>> to actually put holes in the 1G-256GB RAM segment, PC-DIMM would be the
>>>>>> natural choice. Also the reserved IOVA issue impacts the device_memory
>>>>>> region area I think. I am skeptical about the fact we can put holes in
>>>>>> static RAM and device_memory regions like that.
>>> Could we just use a single device_memory region for both 
>>> initial+hotpluggable
>>> RAM if we make base RAM address dynamic?
>> This assumes FW does support dynamic RAM base. If I understand correctly
>> this is not the case. 
> 
> It's not currently the case, but I've adding prototyping this near the top
> of my TODO. So stay tuned.
ok
> 
>> Also there is the problematic of migration. How
>> would you migrate between guests whose RAM is not laid out at the same
>> place?
> 
> I'm not sure what you mean here. Boot a guest with a new memory map,
> probably by explicitly asking for it with a new machine property,
> which means a new virt machine version. Then migrate at will to any
> host that supports that machine type.
My concern rather was about holes in the memory map matching reserved
regions.
> 
>> I understood hotplug memory relied on a specific device_memory
>> region. So do you mean we would have 2 contiguous regions?
> 
> I think Igor wants one contiguous region for RAM, where additional
> space can be reserved for hotplugging.
This is not compliant with 2012 ARM white paper, although I don't really
know if this document truly is a reference (did not get any reply).

Thanks

Eric
> 
>>> In this case RAM could start wherever there is a free space for maxmem
>>> (if there is free space in lowmem, put device_memory there, otherwise put
>>> it somewhere in high mem) and we won't care if there is IOVA or not.
>>>
>>> *I don't have a clue about iommus, so here goes a stupid question*
>>> I agree with Peter that whole IOVA thing looks broken, when host
>>> layout dictates the guest's one.
>>> Shouldn't be there somewhere an iommu that would remap host map into
>>> guest specific one? (so guest would model board we need and be migrate-able
>>> instead of mimicking host hw)
>> The issue is related to IOVAs programmed by the guest in the host
>> assigned devices. DMA requests issued by the assigned devices using
>> those IOVAs are supposed to reach the guest RAM. But due to the host
>> topology they won't (host PCI host bridge windows or MSI reserved
>> regions). Adding a vIOMMU on guest side effectively allows to use IOVAs
>> != GPAs but guest is exposed to that change. Adding this extra
>> translation stage adds a huge performance penalty. And eventually the
>> IOVA allocator used in the guest should theoretically be aware of host
>> reserved regions as well.
>>
>>>
>>>
>>>>> The first approach[1] we had to address the holes in memory was using
>>>>> the memory alias way mentioned above.  And based on Drew's review, the
>>>>> pc-dimm way of handling was introduced. I think the main argument was that
>>>>> it will be useful when we eventually support hotplug.  
>>>>
>>>> That's my understanding too.
>>> not only hotplug,
>>>
>>>   a RAM memory region that's split by aliases is difficult to handle
>>>   as it creates nonlinear GPA<->HVA mapping instead of
>>>   1:1 mapping of pc-dimm,
>>>   so if one needs to build HVA<->GPA map for a given MemoryRegion
>>>   in case of aliases one would have to get list of MemorySections
>>>   that belong to it and build map from that vs (addr + offset) in
>>>   case of simple 1:1 mapping.
>>>
>>>   complicated machine specific SRAT/e820 code due to holes
>>>   /grep 'the memory map is a bit tricky'/
>>>
>>>>  But since that is added
>>>>> anyway as part of this series, I am not sure we have any other benefit in
>>>>> modeling it as pc-dimm. May be I am missing something here.  
>>>>
>>>> I tentatively agree with you. I was trying to understand if the
>>>> device_memory region was fitting the original needs too but I think
>>>> standard alias approach is more adapted to hole creation.
>>> Aliases are easy way to start with, but as compat knobs grow
>>> (based on PC experience,  grep 'Calculate ram split')
>>> It's quite a pain to maintain manual implicit aliases layout
>>> without breaking it by accident.
>>> We probably won't be able to get rid of aliases on PC for legacy
>>> reasons but why introduce the same pain to virt board.
>>>
>>> Well, magical conversion from -m X to 2..y memory regions (aliases or not)
>>> aren't going to be easy in both cases, especially if one would take into
>>> account "-numa memdev|mem".
>>> I'd rather use a single pc-dimm approach for both /initial and hotpluggble 
>>> RAM/
>>> and then use device_memory framework to enumerate RAM wherever needed 
>>> (ACPI/DT)
>>> in inform way.
>> We have 2 problematics:
>> - support more RAM. This can be achieved by adding a single memory
>> region based on DIMMs
>> - manage IOVA reserved regions. I don't think we have a consensus on the
>> solution at the moment. What about migration between 2 guests having a
>> different memory topogy?
> 
> With a dynamic RAM base (pretty easy to do in QEMU, but requires FW change
> - now on my TODO), then one only needs to pick a contiguous region within
> the guest physical address limits that has the requested size and does not
> overlap any host reserved regions (I think). I'm still not sure what the
> migration concern is.
> 
> Thanks,
> drew
> 
>>
>> Thanks
>>
>> Eric
>>>
>>>
>>>> Thanks
>>>>
>>>> Eric
>>>>>
>>>>> Thanks,
>>>>> Shameer
>>>>>
>>>>> [1]. https://lists.gnu.org/archive/html/qemu-arm/2018-04/msg00243.html
>>>>>
>>>>>   
>>>>>> Thanks!
>>>>>>
>>>>>> Eric  
>>>>>>>  
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>>
>>>>>>>> Eric
>>>>>>>>  
>>>>>>>>>  
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>>
>>>>>>>>>> Eric  
>>>>>>>>>
>>>>>>>>>  
>>>>>>>
>>>>>>>  
>>>
>>>
>>



reply via email to

[Prev in Thread] Current Thread [Next in Thread]