[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-ppc] [PATCH qemu v7 2/4] vfio/pci: Relax DMA map errors for MM
From: |
Auger Eric |
Subject: |
Re: [Qemu-ppc] [PATCH qemu v7 2/4] vfio/pci: Relax DMA map errors for MMIO regions |
Date: |
Tue, 13 Mar 2018 18:13:41 +0100 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.4.0 |
Hi Alex,
On 13/03/18 17:56, Alex Williamson wrote:
> [Cc +Eric]
>
> On Tue, 13 Mar 2018 15:53:19 +1100
> Alexey Kardashevskiy <address@hidden> wrote:
>
>> On 7/3/18 1:17 pm, Alexey Kardashevskiy wrote:
>>> On 26/02/18 19:36, Alexey Kardashevskiy wrote:
>>>> On 19/02/18 13:46, Alexey Kardashevskiy wrote:
>>>>> On 16/02/18 16:28, David Gibson wrote:
>>>>>> On Wed, Feb 14, 2018 at 08:55:41AM -0700, Alex Williamson wrote:
>>>>>>> On Wed, 14 Feb 2018 19:09:16 +1100
>>>>>>> Alexey Kardashevskiy <address@hidden> wrote:
>>>>>>>
>>>>>>>> On 14/02/18 12:33, David Gibson wrote:
>>>>>>>>> On Tue, Feb 13, 2018 at 07:20:56PM +1100, Alexey Kardashevskiy wrote:
>>>>>>>>>
>>>>>>>>>> On 13/02/18 16:41, David Gibson wrote:
>>>>>>>>>>> On Tue, Feb 13, 2018 at 04:36:30PM +1100, David Gibson wrote:
>>>>>>>>>>>> On Tue, Feb 13, 2018 at 12:15:52PM +1100, Alexey Kardashevskiy
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> On 13/02/18 03:06, Alex Williamson wrote:
>>>>>>>>>>>>>> On Mon, 12 Feb 2018 18:05:54 +1100
>>>>>>>>>>>>>> Alexey Kardashevskiy <address@hidden> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 12/02/18 16:19, David Gibson wrote:
>>>>>>>>>>>>>>>> On Fri, Feb 09, 2018 at 06:55:01PM +1100, Alexey Kardashevskiy
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> At the moment if vfio_memory_listener is registered in the
>>>>>>>>>>>>>>>>> system memory
>>>>>>>>>>>>>>>>> address space, it maps/unmaps every RAM memory region for DMA.
>>>>>>>>>>>>>>>>> It expects system page size aligned memory sections so
>>>>>>>>>>>>>>>>> vfio_dma_map
>>>>>>>>>>>>>>>>> would not fail and so far this has been the case. A mapping
>>>>>>>>>>>>>>>>> failure
>>>>>>>>>>>>>>>>> would be fatal. A side effect of such behavior is that some
>>>>>>>>>>>>>>>>> MMIO pages
>>>>>>>>>>>>>>>>> would not be mapped silently.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> However we are going to change MSIX BAR handling so we will
>>>>>>>>>>>>>>>>> end having
>>>>>>>>>>>>>>>>> non-aligned sections in vfio_memory_listener (more details is
>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>> the next patch) and vfio_dma_map will exit QEMU.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> In order to avoid fatal failures on what previously was not a
>>>>>>>>>>>>>>>>> failure and
>>>>>>>>>>>>>>>>> was just silently ignored, this checks the section alignment
>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> the smallest supported IOMMU page size and prints an error if
>>>>>>>>>>>>>>>>> not aligned;
>>>>>>>>>>>>>>>>> it also prints an error if vfio_dma_map failed despite the
>>>>>>>>>>>>>>>>> page size check.
>>>>>>>>>>>>>>>>> Both errors are not fatal; only MMIO RAM regions are checked
>>>>>>>>>>>>>>>>> (aka "RAM device" regions).
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> If the amount of errors printed is overwhelming, the MSIX
>>>>>>>>>>>>>>>>> relocation
>>>>>>>>>>>>>>>>> could be used to avoid excessive error output.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> This is unlikely to cause any behavioral change.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Signed-off-by: Alexey Kardashevskiy <address@hidden>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> There are some relatively superficial problems noted below.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> But more fundamentally, this feels like it's extending an
>>>>>>>>>>>>>>>> existing
>>>>>>>>>>>>>>>> hack past the point of usefulness.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The explicit check for is_ram_device() here has always
>>>>>>>>>>>>>>>> bothered me -
>>>>>>>>>>>>>>>> it's not like a real bus bridge magically knows whether a
>>>>>>>>>>>>>>>> target
>>>>>>>>>>>>>>>> address maps to RAM or not.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> What I think is really going on is that even for systems
>>>>>>>>>>>>>>>> without an
>>>>>>>>>>>>>>>> IOMMU, it's not really true to say that the PCI address space
>>>>>>>>>>>>>>>> maps
>>>>>>>>>>>>>>>> directly onto address_space_memory. Instead, there's a large,
>>>>>>>>>>>>>>>> but
>>>>>>>>>>>>>>>> much less than 2^64 sized, "upstream window" at address 0 on
>>>>>>>>>>>>>>>> the PCI
>>>>>>>>>>>>>>>> bus, which is identity mapped to the system bus. Details will
>>>>>>>>>>>>>>>> vary
>>>>>>>>>>>>>>>> with the system, but in practice we expect nothing but RAM to
>>>>>>>>>>>>>>>> be in
>>>>>>>>>>>>>>>> that window. Addresses not within that window won't be mapped
>>>>>>>>>>>>>>>> to the
>>>>>>>>>>>>>>>> system bus but will just be broadcast on the PCI bus and might
>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>> picked up as a p2p transaction.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Currently this p2p works only via the IOMMU, direct p2p is not
>>>>>>>>>>>>>>> possible as
>>>>>>>>>>>>>>> the guest needs to know physical MMIO addresses to make p2p
>>>>>>>>>>>>>>> work and it
>>>>>>>>>>>>>>> does not.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> /me points to the Direct Translated P2P section of the ACS spec,
>>>>>>>>>>>>>> though
>>>>>>>>>>>>>> it's as prone to spoofing by the device as ATS. In any case, p2p
>>>>>>>>>>>>>> reflected from the IOMMU is still p2p and offloads the CPU even
>>>>>>>>>>>>>> if
>>>>>>>>>>>>>> bandwidth suffers vs bare metal depending on if the data doubles
>>>>>>>>>>>>>> back
>>>>>>>>>>>>>> over any links. Thanks,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Sure, I was just saying that p2p via IOMMU won't be as simple as
>>>>>>>>>>>>> broadcast
>>>>>>>>>>>>> on the PCI bus, IOMMU needs to be programmed in advance to make
>>>>>>>>>>>>> this work,
>>>>>>>>>>>>> and current that broadcast won't work for the passed through
>>>>>>>>>>>>> devices.
>>>>>>>>>>>>
>>>>>>>>>>>> Well, sure, p2p in a guest with passthrough devices clearly needs
>>>>>>>>>>>> to
>>>>>>>>>>>> be translated through the IOMMU (and p2p from a passthrough to an
>>>>>>>>>>>> emulated device is essentially impossible).
>>>>>>>>>>>>
>>>>>>>>>>>> But.. what does that have to do with this code. This is the memory
>>>>>>>>>>>> area watcher, looking for memory regions being mapped directly into
>>>>>>>>>>>> the PCI space. NOT IOMMU regions, since those are handled
>>>>>>>>>>>> separately
>>>>>>>>>>>> by wiring up the IOMMU notifier. This will only trigger if
>>>>>>>>>>>> RAM-like,
>>>>>>>>>>>> non-RAM regions are put into PCI space *not* behind an IOMMMU.
>>>>>>>>>>>
>>>>>>>>>>> Duh, sorry, realised I was mixing up host and guest IOMMU. I guess
>>>>>>>>>>> the point here is that this will map RAM-like devices into the host
>>>>>>>>>>> IOMMU when there is no guest IOMMU, allowing p2p transactions
>>>>>>>>>>> between
>>>>>>>>>>> passthrough devices (though not from passthrough to emulated
>>>>>>>>>>> devices).
>>>>>>>>>>
>>>>>>>>>> Correct.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> The conditions still seem kind of awkward to me, but I guess it
>>>>>>>>>>> makes
>>>>>>>>>>> sense.
>>>>>>>>>>
>>>>>>>>>> Is it the time to split this listener to RAM-listener and PCI bus
>>>>>>>>>> listener?
>>>>>>>>>
>>>>>>>>> I'm not really sure what you mean by that.
>>>>>>>>>
>>>>>>>>>> On x86 it listens on the "memory" AS, on spapr - on the
>>>>>>>>>> "address@hidden" AS, this will just create more confusion over
>>>>>>>>>> time...
>>>>>>>>>
>>>>>>>>> Hm, it's still logically the same AS in each case - the PCI address
>>>>>>>>> space. It's just that in the x86 case that happens to be the same as
>>>>>>>>> the memory AS (at least when there isn't a guest side IOMMU).
>>>>>>>>
>>>>>>>> Hm. Ok.
>>>>>>>>
>>>>>>>>> I do wonder if that's really right even for x86 without IOMMU. The
>>>>>>>>> PCI address space is identity mapped to the "main" memory address
>>>>>>>>> space there, but I'm not sure it's mapped th *all* of the main memory
>>>>>>>>>
>>>>>>>>
>>>>>>>> What should have been in the place of "th" above? :)
>>>>>>>>
>>>>>>>>> address space, probably just certain ranges of it. That's what I was
>>>>>>>>> suggesting earlier in the thread.
>>>>>>>>
>>>>>>>> afaict vfio is trying to mmap every RAM MR == KVM memory slot, no
>>>>>>>> ranges or
>>>>>>>> anything like that. I am trying to figure out what is not correct with
>>>>>>>> the
>>>>>>>> patch. Thanks,
>>>>>>>
>>>>>>> It is possible for x86 systems to have translation applied to MMIO vs
>>>>>>> RAM such that the processor view and device view of MMIO are different,
>>>>>>> putting them into separate address spaces, but this is not typical and
>>>>>>> not available on the class of chipsets that QEMU emulates for PC.
>>>>>>> Therefore I think it's correct that MMIO and RAM all live in one big
>>>>>>> happy, flat address space as they do now (assuming the guest IOMMU is
>>>>>>> not present or disabled). Thanks,
>>>>>>
>>>>>> Ah.. I think the thing I was missing is that on PC (at least with
>>>>>> typical chipsets) *all* MMIO essentially comes from PCI space. Even
>>>>>> the legacy devices are essentially ISA which is treated as being on a
>>>>>> bridge under the PCI space. On non-x86 there are often at least a
>>>>>> handful of MMIO devices that aren't within the PCI space - say, the
>>>>>> PCI host bridge itself at least. x86 avoids that by using the
>>>>>> separate IO space, which is also a bit weird in that it's
>>>>>> simultaneously direct attached to the cpu (and so doesn't need bridge
>>>>>> configuration), but also identified with the legay IO space treated as
>>>>>> being under two bridges (host->PCI, PCI->ISA) for other purposes.
>>>>>>
>>>>>> Maybe it's just me, but I find it makes more sense to me if I think of
>>>>>> it as the two ASes being equal because on PC the system address map
>>>>>> (address_space_memory) is made equal to the PCI address map, rather
>>>>>> than the PCI address map being made equal to the system one.
>>>>>
>>>>> It makes more sense to me too, we just do not exploit/expose this on SPAPR
>>>>> - the PCI address space only has 2xIOMMU and 1xMSI windows and that's it,
>>>>> no MMIO which is mapped to the system AS only (adding those to the PCI AS
>>>>> is little tricky as mentioned elsewhere).
>>>>>
>>>>> Anyway, I still wonder - what needs to be addressed in the 2/4 patch in
>>>>> order to proceed?
>>>>
>>>> Ping?
>>>>
>>>
>>>
>>> Ping?
>>>
>>>
>>
>>
>> Could anyone please enlighten me what needs to be done with this patchset
>> to proceed, or confirm that it is ok but not going anywhere as everybody is
>> busy with other things? :) Thanks,
>
> Actually making sure it compiles would have been nice:
>
> qemu.git/hw/vfio/common.c: In function ‘vfio_listener_region_add’:
> qemu.git/hw/vfio/common.c:550:40: error: invalid operands to binary & (have
> ‘Int128 {aka struct Int128}’ and ‘hwaddr {aka long long unsigned int}’)
> if ((iova & pgmask) || (llsize & pgmask)) {
> ^
> qemu.git/hw/vfio/common.c: In function ‘vfio_listener_region_del’:
> qemu.git/hw/vfio/common.c:669:50: error: invalid operands to binary & (have
> ‘Int128 {aka struct Int128}’ and ‘hwaddr {aka long long unsigned int}’)
> try_unmap = !((iova & pgmask) || (llsize & pgmask));
> ^
> Clearly llsize needs to be wrapped in int128_get64() here. Should I
> presume that testing of this patch is equally lacking? Eric, have you
> done any testing of this? I think this was fixing a mapping issue you
> reported. Thanks,
not that version unfortunately. I don't have access to the HW on which I
had the issue anymore. Maybe I could by the beg of next week?
Thanks
Eric
>
> Alex
>