qemu-arm
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-arm] [Qemu-devel] [RFC v3 06/15] hw/arm/virt: Allocate device_


From: Andrew Jones
Subject: Re: [Qemu-arm] [Qemu-devel] [RFC v3 06/15] hw/arm/virt: Allocate device_memory
Date: Thu, 12 Jul 2018 17:15:14 +0200
User-agent: NeoMutt/20180622

On Thu, Jul 12, 2018 at 04:53:01PM +0200, Auger Eric wrote:
> Hi Drew,
> 
> On 07/12/2018 04:45 PM, Andrew Jones wrote:
> > On Thu, Jul 12, 2018 at 04:22:05PM +0200, Auger Eric wrote:
> >> Hi Igor,
> >>
> >> On 07/11/2018 03:17 PM, Igor Mammedov wrote:
> >>> On Thu, 5 Jul 2018 16:27:05 +0200
> >>> Auger Eric <address@hidden> wrote:
> >>>
> >>>> Hi Shameer,
> >>>>
> >>>> On 07/05/2018 03:19 PM, Shameerali Kolothum Thodi wrote:
> >>>>>   
> >>>>>> -----Original Message-----
> >>>>>> From: Auger Eric [mailto:address@hidden
> >>>>>> Sent: 05 July 2018 13:18
> >>>>>> To: David Hildenbrand <address@hidden>; address@hidden;
> >>>>>> address@hidden; address@hidden; address@hidden;
> >>>>>> Shameerali Kolothum Thodi <address@hidden>;
> >>>>>> address@hidden
> >>>>>> Cc: address@hidden; address@hidden; address@hidden;
> >>>>>> address@hidden; address@hidden
> >>>>>> Subject: Re: [Qemu-devel] [RFC v3 06/15] hw/arm/virt: Allocate
> >>>>>> device_memory
> >>>>>>
> >>>>>> Hi David,
> >>>>>>
> >>>>>> On 07/05/2018 02:09 PM, David Hildenbrand wrote:  
> >>>>>>> On 05.07.2018 14:00, Auger Eric wrote:  
> >>>>>>>> Hi David,
> >>>>>>>>
> >>>>>>>> On 07/05/2018 01:54 PM, David Hildenbrand wrote:  
> >>>>>>>>> On 05.07.2018 13:42, Auger Eric wrote:  
> >>>>>>>>>> Hi David,
> >>>>>>>>>>
> >>>>>>>>>> On 07/04/2018 02:05 PM, David Hildenbrand wrote:  
> >>>>>>>>>>> On 03.07.2018 21:27, Auger Eric wrote:  
> >>>>>>>>>>>> Hi David,
> >>>>>>>>>>>> On 07/03/2018 08:25 PM, David Hildenbrand wrote:  
> >>>>>>>>>>>>> On 03.07.2018 09:19, Eric Auger wrote:  
> >>>>>>>>>>>>>> We define a new hotpluggable RAM region (aka. device memory).
> >>>>>>>>>>>>>> Its base is 2TB GPA. This obviously requires 42b IPA support
> >>>>>>>>>>>>>> in KVM/ARM, FW and guest kernel. At the moment the device
> >>>>>>>>>>>>>> memory region is max 2TB.  
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Maybe a stupid question, but why exactly does it have to start 
> >>>>>>>>>>>>> at 2TB
> >>>>>>>>>>>>> (and not e.g. at 1TB)?  
> >>>>>>>>>>>> not a stupid question. See tentative answer below.  
> >>>>>>>>>>>>>  
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> This is largely inspired of device memory initialization in
> >>>>>>>>>>>>>> pc machine code.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Signed-off-by: Eric Auger <address@hidden>
> >>>>>>>>>>>>>> Signed-off-by: Kwangwoo Lee <address@hidden>
> >>>>>>>>>>>>>> ---
> >>>>>>>>>>>>>>  hw/arm/virt.c         | 104  
> >>>>>> ++++++++++++++++++++++++++++++++++++--------------  
> >>>>>>>>>>>>>>  include/hw/arm/arm.h  |   2 +
> >>>>>>>>>>>>>>  include/hw/arm/virt.h |   1 +
> >>>>>>>>>>>>>>  3 files changed, 79 insertions(+), 28 deletions(-)
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> diff --git a/hw/arm/virt.c b/hw/arm/virt.c
> >>>>>>>>>>>>>> index 5a4d0bf..6fefb78 100644
> >>>>>>>>>>>>>> --- a/hw/arm/virt.c
> >>>>>>>>>>>>>> +++ b/hw/arm/virt.c
> >>>>>>>>>>>>>> @@ -59,6 +59,7 @@
> >>>>>>>>>>>>>>  #include "qapi/visitor.h"
> >>>>>>>>>>>>>>  #include "standard-headers/linux/input.h"
> >>>>>>>>>>>>>>  #include "hw/arm/smmuv3.h"
> >>>>>>>>>>>>>> +#include "hw/acpi/acpi.h"
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>  #define DEFINE_VIRT_MACHINE_LATEST(major, minor, latest) \
> >>>>>>>>>>>>>>      static void 
> >>>>>>>>>>>>>> virt_##major##_##minor##_class_init(ObjectClass *oc,  
> >>>>>> \  
> >>>>>>>>>>>>>> @@ -94,34 +95,25 @@
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>  #define PLATFORM_BUS_NUM_IRQS 64
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> -/* RAM limit in GB. Since VIRT_MEM starts at the 1GB mark, 
> >>>>>>>>>>>>>> this  
> >>>>>> means  
> >>>>>>>>>>>>>> - * RAM can go up to the 256GB mark, leaving 256GB of the 
> >>>>>>>>>>>>>> physical
> >>>>>>>>>>>>>> - * address space unallocated and free for future use between 
> >>>>>>>>>>>>>> 256G  
> >>>>>> and 512G.  
> >>>>>>>>>>>>>> - * If we need to provide more RAM to VMs in the future then 
> >>>>>>>>>>>>>> we  
> >>>>>> need to:  
> >>>>>>>>>>>>>> - *  * allocate a second bank of RAM starting at 2TB and 
> >>>>>>>>>>>>>> working up  
> >>>>>>>>>>>> I acknowledge this comment was the main justification. Now if 
> >>>>>>>>>>>> you look  
> >>>>>> at  
> >>>>>>>>>>>>
> >>>>>>>>>>>> Principles of ARM Memory Maps
> >>>>>>>>>>>>  
> >>>>>> http://infocenter.arm.com/help/topic/com.arm.doc.den0001c/DEN0001C_princ
> >>>>>> iples_of_arm_memory_maps.pdf  
> >>>>>>>>>>>> chapter 2.3 you will find that when adding PA bits, you always 
> >>>>>>>>>>>> leave
> >>>>>>>>>>>> space for reserved space and mapped IO.  
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks for the pointer!
> >>>>>>>>>>>
> >>>>>>>>>>> So ... we can fit
> >>>>>>>>>>>
> >>>>>>>>>>> a) 2GB at 2GB
> >>>>>>>>>>> b) 32GB at 32GB
> >>>>>>>>>>> c) 512GB at 512GB
> >>>>>>>>>>> d) 8TB at 8TB
> >>>>>>>>>>> e) 128TB at 128TB
> >>>>>>>>>>>
> >>>>>>>>>>> (this is a nice rule of thumb if I understand it correctly :) )
> >>>>>>>>>>>
> >>>>>>>>>>> We should strive for device memory (maxram_size - ram_size) to fit
> >>>>>>>>>>> exactly into one of these slots (otherwise things get nasty).
> >>>>>>>>>>>
> >>>>>>>>>>> Depending on the ram_size, we might have simpler setups and can  
> >>>>>> support  
> >>>>>>>>>>> more configurations, no?
> >>>>>>>>>>>
> >>>>>>>>>>> E.g. ram_size <= 34GB, device_memory <= 512GB  
> >>>>>>>>>>> -> move ram into a) and b)
> >>>>>>>>>>> -> move device memory into c)  
> >>>>>>>>>>
> >>>>>>>>>> The issue is machvirt doesn't comply with that document.
> >>>>>>>>>> At the moment we have
> >>>>>>>>>> 0 -> 1GB MMIO
> >>>>>>>>>> 1GB -> 256GB RAM
> >>>>>>>>>> 256GB -> 512GB is theoretically reserved for IO but most is free.
> >>>>>>>>>> 512GB -> 1T is reserved for ECAM MMIO range. This is the top of our
> >>>>>>>>>> existing 40b GPA space.
> >>>>>>>>>>
> >>>>>>>>>> We don't want to change this address map due to legacy reasons.
> >>>>>>>>>>  
> >>>>>>>>>
> >>>>>>>>> Thanks, good to know!
> >>>>>>>>>  
> >>>>>>>>>> Another question! do you know if it would be possible to have
> >>>>>>>>>> device_memory region split into several discontinuous segments?  
> >>>>>>>>>
> >>>>>>>>> It can be implemented for sure, but I would try to avoid that, as it
> >>>>>>>>> makes certain configurations impossible (and very end user 
> >>>>>>>>> unfriendly).
> >>>>>>>>>
> >>>>>>>>> E.g. (numbers completely made up, but it should show what I mean)
> >>>>>>>>>
> >>>>>>>>> -m 20G,maxmem=120G:  
> >>>>>>>>> -> Try to add a DIMM with 100G -> error.
> >>>>>>>>> -> But we can add e.g. two DIMMs with 40G and 60G.  
> >>>>>>>>>
> >>>>>>>>> This exposes internal details to the end user. And the end user has 
> >>>>>>>>> no
> >>>>>>>>> idea what is going on.
> >>>>>>>>>
> >>>>>>>>> So I think we should try our best to keep that area consecutive.  
> >>>>>>>>
> >>>>>>>> Actually I didn't sufficiently detail the context. I would like
> >>>>>>>> 1) 1 segment to be exposed to the end-user through slot|maxmem stuff
> >>>>>>>> (what this series targets) and
> >>>>>>>> 2) another segment used to instantiate PC-DIMM for internal needs as
> >>>>>>>> replacement of part of the 1GB -> 256GB static RAM. This was the 
> >>>>>>>> purpose
> >>>>>>>> of Shameer's original series  
> >>>>>>>
> >>>>>>> I am not sure if PC-DIMMs are exactly what you want for internal 
> >>>>>>> purposes.
> >>>>>>>  
> >>>>>>>>
> >>>>>>>> [1] [RFC v2 0/6] hw/arm: Add support for non-contiguous iova regions
> >>>>>>>> http://patchwork.ozlabs.org/cover/914694/
> >>>>>>>> This approach is not yet validated though.
> >>>>>>>>
> >>>>>>>> The rationale is sometimes you must have "holes" in RAM as some GPAs
> >>>>>>>> match reserved IOVAs for assigned devices.  
> >>>>>>>
> >>>>>>> So if I understand it correctly, all you want is some memory region 
> >>>>>>> that
> >>>>>>> a) contains only initially defined memory
> >>>>>>> b) can have some holes in it
> >>>>>>>
> >>>>>>> This is exactly what x86 already does (pc_memory_init): Simply 
> >>>>>>> construct
> >>>>>>> your own memory region leaving holes in it.
> >>>>>>>
> >>>>>>>
> >>>>>>> memory_region_init_alias(ram_below_4g, NULL, "ram-below-4g", ram,
> >>>>>>>                          0, pcms->below_4g_mem_size);
> >>>>>>> memory_region_add_subregion(system_memory, 0, ram_below_4g);
> >>>>>>> ...
> >>>>>>> if (pcms->above_4g_mem_size > 0)
> >>>>>>>     memory_region_init_alias(ram_above_4g, NULL, "ram-above-4g", ram,
> >>>>>>>     ...
> >>>>>>>     memory_region_add_subregion(system_memory, 0x100000000ULL,
> >>>>>>>     ...
> >>>>>>>
> >>>>>>> They "indicate" these different GPA areas using the e820 map to the 
> >>>>>>> guest.
> >>>>>>>
> >>>>>>> Would that also work for you?  
> >>>>>>
> >>>>>> I would tentatively say yes. Effectively I am not sure that if we were
> >>>>>> to actually put holes in the 1G-256GB RAM segment, PC-DIMM would be the
> >>>>>> natural choice. Also the reserved IOVA issue impacts the device_memory
> >>>>>> region area I think. I am skeptical about the fact we can put holes in
> >>>>>> static RAM and device_memory regions like that.
> >>> Could we just use a single device_memory region for both 
> >>> initial+hotpluggable
> >>> RAM if we make base RAM address dynamic?
> >> This assumes FW does support dynamic RAM base. If I understand correctly
> >> this is not the case. 
> > 
> > It's not currently the case, but I've adding prototyping this near the top
> > of my TODO. So stay tuned.
> ok
> > 
> >> Also there is the problematic of migration. How
> >> would you migrate between guests whose RAM is not laid out at the same
> >> place?
> > 
> > I'm not sure what you mean here. Boot a guest with a new memory map,
> > probably by explicitly asking for it with a new machine property,
> > which means a new virt machine version. Then migrate at will to any
> > host that supports that machine type.
> My concern rather was about holes in the memory map matching reserved
> regions.

Oh, I see. I don't think the reserved-host-memory-regions-messing-up-guest
problem can ever be solved for the migration case of the destination host
not having strictly an equal set or subset of the source host's reserved
regions. So there's nothing we can do but force upper management layers to
maintain migration candidate lists in these environments. A pre-check
could also be added, allowing migration to error-out early when an overlap
is detected.

> > 
> >> I understood hotplug memory relied on a specific device_memory
> >> region. So do you mean we would have 2 contiguous regions?
> > 
> > I think Igor wants one contiguous region for RAM, where additional
> > space can be reserved for hotplugging.
> This is not compliant with 2012 ARM white paper, although I don't really
> know if this document truly is a reference (did not get any reply).
> 
> Thanks
> 
> Eric
> > 
> >>> In this case RAM could start wherever there is a free space for maxmem
> >>> (if there is free space in lowmem, put device_memory there, otherwise put
> >>> it somewhere in high mem) and we won't care if there is IOVA or not.
> >>>
> >>> *I don't have a clue about iommus, so here goes a stupid question*
> >>> I agree with Peter that whole IOVA thing looks broken, when host
> >>> layout dictates the guest's one.
> >>> Shouldn't be there somewhere an iommu that would remap host map into
> >>> guest specific one? (so guest would model board we need and be 
> >>> migrate-able
> >>> instead of mimicking host hw)
> >> The issue is related to IOVAs programmed by the guest in the host
> >> assigned devices. DMA requests issued by the assigned devices using
> >> those IOVAs are supposed to reach the guest RAM. But due to the host
> >> topology they won't (host PCI host bridge windows or MSI reserved
> >> regions). Adding a vIOMMU on guest side effectively allows to use IOVAs
> >> != GPAs but guest is exposed to that change. Adding this extra
> >> translation stage adds a huge performance penalty. And eventually the
> >> IOVA allocator used in the guest should theoretically be aware of host
> >> reserved regions as well.
> >>
> >>>
> >>>
> >>>>> The first approach[1] we had to address the holes in memory was using
> >>>>> the memory alias way mentioned above.  And based on Drew's review, the
> >>>>> pc-dimm way of handling was introduced. I think the main argument was 
> >>>>> that
> >>>>> it will be useful when we eventually support hotplug.  
> >>>>
> >>>> That's my understanding too.
> >>> not only hotplug,
> >>>
> >>>   a RAM memory region that's split by aliases is difficult to handle
> >>>   as it creates nonlinear GPA<->HVA mapping instead of
> >>>   1:1 mapping of pc-dimm,
> >>>   so if one needs to build HVA<->GPA map for a given MemoryRegion
> >>>   in case of aliases one would have to get list of MemorySections
> >>>   that belong to it and build map from that vs (addr + offset) in
> >>>   case of simple 1:1 mapping.
> >>>
> >>>   complicated machine specific SRAT/e820 code due to holes
> >>>   /grep 'the memory map is a bit tricky'/
> >>>
> >>>>  But since that is added
> >>>>> anyway as part of this series, I am not sure we have any other benefit 
> >>>>> in
> >>>>> modeling it as pc-dimm. May be I am missing something here.  
> >>>>
> >>>> I tentatively agree with you. I was trying to understand if the
> >>>> device_memory region was fitting the original needs too but I think
> >>>> standard alias approach is more adapted to hole creation.
> >>> Aliases are easy way to start with, but as compat knobs grow
> >>> (based on PC experience,  grep 'Calculate ram split')
> >>> It's quite a pain to maintain manual implicit aliases layout
> >>> without breaking it by accident.
> >>> We probably won't be able to get rid of aliases on PC for legacy
> >>> reasons but why introduce the same pain to virt board.
> >>>
> >>> Well, magical conversion from -m X to 2..y memory regions (aliases or not)
> >>> aren't going to be easy in both cases, especially if one would take into
> >>> account "-numa memdev|mem".
> >>> I'd rather use a single pc-dimm approach for both /initial and 
> >>> hotpluggble RAM/
> >>> and then use device_memory framework to enumerate RAM wherever needed 
> >>> (ACPI/DT)
> >>> in inform way.
> >> We have 2 problematics:
> >> - support more RAM. This can be achieved by adding a single memory
> >> region based on DIMMs
> >> - manage IOVA reserved regions. I don't think we have a consensus on the
> >> solution at the moment. What about migration between 2 guests having a
> >> different memory topogy?
> > 
> > With a dynamic RAM base (pretty easy to do in QEMU, but requires FW change
> > - now on my TODO), then one only needs to pick a contiguous region within
> > the guest physical address limits that has the requested size and does not
> > overlap any host reserved regions (I think). I'm still not sure what the
> > migration concern is.
> > 
> > Thanks,
> > drew
> > 
> >>
> >> Thanks
> >>
> >> Eric
> >>>
> >>>
> >>>> Thanks
> >>>>
> >>>> Eric
> >>>>>
> >>>>> Thanks,
> >>>>> Shameer
> >>>>>
> >>>>> [1]. https://lists.gnu.org/archive/html/qemu-arm/2018-04/msg00243.html
> >>>>>
> >>>>>   
> >>>>>> Thanks!
> >>>>>>
> >>>>>> Eric  
> >>>>>>>  
> >>>>>>>>
> >>>>>>>> Thanks
> >>>>>>>>
> >>>>>>>> Eric
> >>>>>>>>  
> >>>>>>>>>  
> >>>>>>>>>>
> >>>>>>>>>> Thanks
> >>>>>>>>>>
> >>>>>>>>>> Eric  
> >>>>>>>>>
> >>>>>>>>>  
> >>>>>>>
> >>>>>>>  
> >>>
> >>>
> >>
> 



reply via email to

[Prev in Thread] Current Thread [Next in Thread]