qemu-ppc
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-ppc] [PATCH qemu 0/3] spapr_pci, vfio: NVIDIA V100 + P9 passth


From: Alexey Kardashevskiy
Subject: Re: [Qemu-ppc] [PATCH qemu 0/3] spapr_pci, vfio: NVIDIA V100 + P9 passthrough
Date: Mon, 11 Feb 2019 18:46:32 +1100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.5.0


On 11/02/2019 17:07, Alex Williamson wrote:
> On Mon, 11 Feb 2019 14:49:49 +1100
> Alexey Kardashevskiy <address@hidden> wrote:
> 
>> On 08/02/2019 16:28, David Gibson wrote:
>>> On Thu, Feb 07, 2019 at 08:26:20PM -0700, Alex Williamson wrote:  
>>>> On Fri, 8 Feb 2019 13:29:37 +1100
>>>> Alexey Kardashevskiy <address@hidden> wrote:
>>>>  
>>>>> On 08/02/2019 02:18, Alex Williamson wrote:  
>>>>>> On Thu, 7 Feb 2019 15:43:18 +1100
>>>>>> Alexey Kardashevskiy <address@hidden> wrote:
>>>>>>     
>>>>>>> On 07/02/2019 04:22, Daniel Henrique Barboza wrote:    
>>>>>>>> Based on this series, I've sent a Libvirt patch to allow a QEMU process
>>>>>>>> to inherit IPC_LOCK when using VFIO passthrough with the Tesla V100
>>>>>>>> GPU:
>>>>>>>>
>>>>>>>> https://www.redhat.com/archives/libvir-list/2019-February/msg00219.html
>>>>>>>>
>>>>>>>>
>>>>>>>> In that thread, Alex raised concerns about allowing QEMU to freely lock
>>>>>>>> all the memory it wants. Is this an issue to be considered in the 
>>>>>>>> review
>>>>>>>> of this series here?
>>>>>>>>
>>>>>>>> Reading the patches, specially patch 3/3, it seems to me that QEMU is
>>>>>>>> going to lock the KVM memory to populate the NUMA node with memory
>>>>>>>> of the GPU itself, so at first there is no risk of not taking over the
>>>>>>>> host RAM.
>>>>>>>> Am I missing something?      
>>>>>>>
>>>>>>>
>>>>>>> The GPU memory belongs to the device and not visible to the host as
>>>>>>> memory blocks and not covered by page structs, for the host it is more
>>>>>>> like MMIO which is passed through to the guest without that locked
>>>>>>> accounting, I'd expect libvirt to keep working as usual except that:
>>>>>>>
>>>>>>> when libvirt calculates the amount of memory needed for TCE tables
>>>>>>> (which is guestRAM/64k*8), now it needs to use the end of the last GPU
>>>>>>> RAM window as a guest RAM size. For example, in QEMU HMP "info mtree 
>>>>>>> -f":
>>>>>>>
>>>>>>> FlatView #2
>>>>>>>  AS "memory", root: system
>>>>>>>  AS "cpu-memory-0", root: system
>>>>>>>  Root memory region: system
>>>>>>>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
>>>>>>>   0000010000000000-0000011fffffffff (prio 0, ram): nvlink2-mr
>>>>>>>
>>>>>>> So previously the DMA window would cover 0x7fffffff+1, now it has to
>>>>>>> cover 0x11fffffffff+1.    
>>>>>>
>>>>>> This looks like a chicken and egg problem, you're saying libvirt needs
>>>>>> to query mtree to understand the extent of the GPU layout, but we need
>>>>>> to specify the locked memory limits in order for QEMU to start?  Is
>>>>>> libvirt supposed to start the VM with unlimited locked memory and fix
>>>>>> it at some indeterminate point in the future?  Run a dummy VM with
>>>>>> unlimited locked memory in order to determine the limits for the real
>>>>>> VM?  Neither of these sound practical.  Thanks,    
>>>>>
>>>>>
>>>>> QEMU maps GPU RAM at known locations (which only depends on the vPHB's
>>>>> index or can be set explicitely) and libvirt knows how many GPUs are
>>>>> passed so it is quite easy to calculate the required amount of memory.
>>>>>
>>>>> Here is the window start calculation:
>>>>> https://github.com/aik/qemu/commit/7073cad3ae7708d657e01672bcf53092808b54fb#diff-662409c2a5a150fe231d07ea8384b920R3812
>>>>>
>>>>> We do not exactly know the GPU RAM window size until QEMU reads it from
>>>>> VFIO/nvlink2 but we know that all existing hardware has a window of
>>>>> 128GB (the adapters I have access to only have 16/32GB on board).  
>>>>
>>>> So you're asking that libvirt add 128GB per GPU with magic nvlink
>>>> properties, which may be 8x what's actually necessary and libvirt
>>>> determines which GPUs to apply this to how?  Does libvirt need to sort
>>>> through device tree properties for this?  Thanks,  
>>>
>>> Hm.  If the GPU memory is really separate from main RAM, which it
>>> sounds like, I don't think it makes sense to account it against the
>>> same locked memory limit as regular RAM.  
>>
>>
>> This is accounting for TCE table to cover GPU RAM, not for GPU RAM itself.
>>
>> So I am asking libvirt to add 128GB/64k*8=16MB to the locked_vm. It
>> already does so for the guest RAM.
> 
> Why do host internal data structures count against the user's locked
> memory limit?  We don't include IOMMU page tables or type1 accounting
> structures on other archs.  Thanks,


Because pseries guests create DMA windows dynamically and the userspace
can pass multiple devices to a guest, placing each on its own vPHB each
of which most likely will create an additional 64bit DMA window which is
backed with an IOMMU table => the userspace triggers these allocations.
We account guest RAM once as it is shared among vPHBs but not the IOMMU
tables.


-- 
Alexey



reply via email to

[Prev in Thread] Current Thread [Next in Thread]