qemu-ppc
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-ppc] [PATCH v5 07/11] vfio: Add guest side IOMMU support


From: Alex Williamson
Subject: Re: [Qemu-ppc] [PATCH v5 07/11] vfio: Add guest side IOMMU support
Date: Mon, 31 Mar 2014 13:54:10 -0600

On Fri, 2014-03-28 at 15:49 +1100, Alexey Kardashevskiy wrote:
> On 03/22/2014 01:17 AM, Alex Williamson wrote:
> > On Fri, 2014-03-21 at 18:59 +1100, Alexey Kardashevskiy wrote:
> >> On 03/20/2014 06:57 AM, Alex Williamson wrote:
> >>> On Wed, 2014-03-12 at 16:52 +1100, Alexey Kardashevskiy wrote:
> >>>> From: David Gibson <address@hidden>
> >>>>
> >>>> This patch uses the new IOMMU notifiers to allow VFIO pass through 
> >>>> devices
> >>>> to work with guest side IOMMUs, as long as the host-side VFIO iommu has
> >>>> sufficient capability and granularity to match the guest side. This works
> >>>> by tracking all map and unmap operations on the guest IOMMU using the
> >>>> notifiers, and mirroring them into VFIO.
> >>>>
> >>>> There are a number of FIXMEs, and the scheme involves rather more 
> >>>> notifier
> >>>> structures than I'd like, but it should make for a reasonable proof of
> >>>> concept.
> >>>>
> >>>> Signed-off-by: David Gibson <address@hidden>
> >>>> Signed-off-by: Alexey Kardashevskiy <address@hidden>
> >>>>
> >>>> ---
> >>>> Changes:
> >>>> v4:
> >>>> * fixed list objects naming
> >>>> * vfio_listener_region_add() reworked to call memory_region_ref() from 
> >>>> one
> >>>> place only, it is also easier to review the changes
> >>>> * fixes boundary check not to fail on sections == 2^64 bytes,
> >>>> the "vfio: Fix debug output for int128 values" patch is required;
> >>>> this obsoletes the "[PATCH v3 0/3] vfio: fixes for better support
> >>>> for 128 bit memory section sizes" patch proposal
> >>>> ---
> >>>>  hw/misc/vfio.c | 126 
> >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++---
> >>>>  1 file changed, 120 insertions(+), 6 deletions(-)
> >>>>
> >>>> diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c
> >>>> index 038010b..4f6f5da 100644
> >>>> --- a/hw/misc/vfio.c
> >>>> +++ b/hw/misc/vfio.c
> >>>> @@ -159,10 +159,18 @@ typedef struct VFIOContainer {
> >>>>          };
> >>>>          void (*release)(struct VFIOContainer *);
> >>>>      } iommu_data;
> >>>> +    QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
> >>>>      QLIST_HEAD(, VFIOGroup) group_list;
> >>>>      QLIST_ENTRY(VFIOContainer) next;
> >>>>  } VFIOContainer;
> >>>>  
> >>>> +typedef struct VFIOGuestIOMMU {
> >>>> +    VFIOContainer *container;
> >>>> +    MemoryRegion *iommu;
> >>>> +    Notifier n;
> >>>> +    QLIST_ENTRY(VFIOGuestIOMMU) giommu_next;
> >>>> +} VFIOGuestIOMMU;
> >>>> +
> >>>>  /* Cache of MSI-X setup plus extra mmap and memory region for split BAR 
> >>>> map */
> >>>>  typedef struct VFIOMSIXInfo {
> >>>>      uint8_t table_bar;
> >>>> @@ -2241,8 +2249,9 @@ static int vfio_dma_map(VFIOContainer *container, 
> >>>> hwaddr iova,
> >>>>  
> >>>>  static bool vfio_listener_skipped_section(MemoryRegionSection *section)
> >>>>  {
> >>>> -    return !memory_region_is_ram(section->mr) ||
> >>>> -           /*
> >>>> +    return (!memory_region_is_ram(section->mr) &&
> >>>> +            !memory_region_is_iommu(section->mr)) ||
> >>>> +        /*
> >>>
> >>> White space damage
> >>>
> >>>>              * Sizing an enabled 64-bit BAR can cause spurious mappings 
> >>>> to
> >>>>              * addresses in the upper part of the 64-bit address space.  
> >>>> These
> >>>>              * are never accessed by the CPU and beyond the address 
> >>>> width of
> >>>> @@ -2251,6 +2260,61 @@ static bool 
> >>>> vfio_listener_skipped_section(MemoryRegionSection *section)
> >>>>             section->offset_within_address_space & (1ULL << 63);
> >>>>  }
> >>>>  
> >>>> +static void vfio_iommu_map_notify(Notifier *n, void *data)
> >>>> +{
> >>>> +    VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
> >>>> +    VFIOContainer *container = giommu->container;
> >>>> +    IOMMUTLBEntry *iotlb = data;
> >>>> +    MemoryRegion *mr;
> >>>> +    hwaddr xlat;
> >>>> +    hwaddr len = iotlb->addr_mask + 1;
> >>>> +    void *vaddr;
> >>>> +    int ret;
> >>>> +
> >>>> +    DPRINTF("iommu map @ %"HWADDR_PRIx" - %"HWADDR_PRIx"\n",
> >>>> +            iotlb->iova, iotlb->iova + iotlb->addr_mask);
> >>>> +
> >>>> +    /*
> >>>> +     * The IOMMU TLB entry we have just covers translation through
> >>>> +     * this IOMMU to its immediate target.  We need to translate
> >>>> +     * it the rest of the way through to memory.
> >>>> +     */
> >>>> +    mr = address_space_translate(&address_space_memory,
> >>>> +                                 iotlb->translated_addr,
> >>>> +                                 &xlat, &len, iotlb->perm & IOMMU_WO);
> >>>
> >>> Write-only?  Is this supposed to be read-write to mask just 2 bits?
> >>
> >>
> >> The last parameter of address_space_translate() bool is_write. So I do not
> >> really understand the problem here.
> > 
> > Oops, my bad, I didn't look at what address_space_translate() used that
> > for.  Ok.
> > 
> >>>> +    if (!memory_region_is_ram(mr)) {
> >>>> +        DPRINTF("iommu map to non memory area %"HWADDR_PRIx"\n",
> >>>> +                xlat);
> >>>> +        return;
> >>>> +    }
> >>>> +    if (len & iotlb->addr_mask) {
> >>>> +        DPRINTF("iommu has granularity incompatible with target AS\n");
> >>>
> >>> Is this possible?  Assuming len is initially a power-of-2, would the
> >>> translate function change it?  Maybe worth a comment to explain.
> >>
> >>
> >> Oh. address_space_translate() actually changes @len to min(len,
> >> TARGET_PAGE_SIZE) and TARGET_PAGE_SIZE is hardcoded to 4K. So far it was ok
> >> but lately I have been implementing a huge DMA window (plus one
> >> sPAPRTCETable and one VFIOGuestIOMMU objects) which currently operates with
> >> 16MB pages (can do 64K pages too) and now this "granularity incompatible"
> >> is happening.
> >>
> >> I disabled that check but I need to think of better fix...
> >>
> >> Adding Paolo to cc, may be he picks the context and gives good piece of
> >> advise :)
> >>
> >>
> >>
> >>>
> >>>> +        return;
> >>>> +    }
> >>>> +
> >>>> +    vaddr = memory_region_get_ram_ptr(mr) + xlat;
> >>>
> >>> This lookup isn't free and the unmap path doesn't need it, maybe move
> >>> the variable and lookup into the first branch below?
> >>>
> >>>> +
> >>>> +    if (iotlb->perm != IOMMU_NONE) {
> >>>> +        ret = vfio_dma_map(container, iotlb->iova,
> >>>> +                           iotlb->addr_mask + 1, vaddr,
> >>>> +                           !(iotlb->perm & IOMMU_WO) || mr->readonly);
> >>>> +        if (ret) {
> >>>> +            error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
> >>>> +                         "0x%"HWADDR_PRIx", %p) = %d (%m)",
> >>>> +                         container, iotlb->iova,
> >>>> +                         iotlb->addr_mask + 1, vaddr, ret);
> >>>> +        }
> >>>> +    } else {
> >>>> +        ret = vfio_dma_unmap(container, iotlb->iova, iotlb->addr_mask + 
> >>>> 1);
> >>>> +        if (ret) {
> >>>> +            error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
> >>>> +                         "0x%"HWADDR_PRIx") = %d (%m)",
> >>>> +                         container, iotlb->iova,
> >>>> +                         iotlb->addr_mask + 1, ret);
> >>>> +        }
> >>>> +    }
> >>>> +}
> >>>> +
> >>>>  static void vfio_listener_region_add(MemoryListener *listener,
> >>>>                                       MemoryRegionSection *section)
> >>>>  {
> >>>> @@ -2261,8 +2325,6 @@ static void 
> >>>> vfio_listener_region_add(MemoryListener *listener,
> >>>>      void *vaddr;
> >>>>      int ret;
> >>>>  
> >>>> -    assert(!memory_region_is_iommu(section->mr));
> >>>> -
> >>>>      if (vfio_listener_skipped_section(section)) {
> >>>>          DPRINTF("SKIPPING region_add %"HWADDR_PRIx" - %"PRIx64"\n",
> >>>>                  section->offset_within_address_space,
> >>>> @@ -2286,15 +2348,47 @@ static void 
> >>>> vfio_listener_region_add(MemoryListener *listener,
> >>>>          return;
> >>>>      }
> >>>>  
> >>>> +    memory_region_ref(section->mr);
> >>>> +
> >>>> +    if (memory_region_is_iommu(section->mr)) {
> >>>> +        VFIOGuestIOMMU *giommu;
> >>>> +
> >>>> +        DPRINTF("region_add [iommu] %"HWADDR_PRIx" - %"HWADDR_PRIx"\n",
> >>>> +                iova, int128_get64(int128_sub(llend, int128_one())));
> >>>> +        /*
> >>>> +         * FIXME: We should do some checking to see if the
> >>>> +         * capabilities of the host VFIO IOMMU are adequate to model
> >>>> +         * the guest IOMMU
> >>>> +         *
> >>>> +         * FIXME: This assumes that the guest IOMMU is empty of
> >>>> +         * mappings at this point - we should either enforce this, or
> >>>> +         * loop through existing mappings to map them into VFIO.
> >>>> +         *
> >>>> +         * FIXME: For VFIO iommu types which have KVM acceleration to
> >>>> +         * avoid bouncing all map/unmaps through qemu this way, this
> >>>> +         * would be the right place to wire that up (tell the KVM
> >>>> +         * device emulation the VFIO iommu handles to use).
> >>>> +         */
> >>>
> >>> That's a lot of FIXMEs...  The second one in particular looks like it
> >>> needs to expand a bit on why this is likely a valid assumption.  The
> >>> last one is more of a TODO than a FIXME.
> >>>
> >>>> +        giommu = g_malloc0(sizeof(*giommu));
> >>>> +        giommu->iommu = section->mr;
> >>>> +        giommu->container = container;
> >>>> +        giommu->n.notify = vfio_iommu_map_notify;
> >>>> +        QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);
> >>>> +        memory_region_register_iommu_notifier(giommu->iommu, 
> >>>> &giommu->n);
> >>>> +
> >>>> +        return;
> >>>> +    }
> >>>> +
> >>>> +    /* Here we assume that memory_region_is_ram(section->mr)==true */
> >>>> +
> >>>>      end = int128_get64(llend);
> >>>>      vaddr = memory_region_get_ram_ptr(section->mr) +
> >>>>              section->offset_within_region +
> >>>>              (iova - section->offset_within_address_space);
> >>>>  
> >>>> -    DPRINTF("region_add %"HWADDR_PRIx" - %"HWADDR_PRIx" [%p]\n",
> >>>> +    DPRINTF("region_add [ram] %"HWADDR_PRIx" - %"HWADDR_PRIx" [%p]\n",
> >>>>              iova, end - 1, vaddr);
> >>>>  
> >>>> -    memory_region_ref(section->mr);
> >>>>      ret = vfio_dma_map(container, iova, end - iova, vaddr, 
> >>>> section->readonly);
> >>>>      if (ret) {
> >>>>          error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
> >>>> @@ -2338,6 +2432,26 @@ static void 
> >>>> vfio_listener_region_del(MemoryListener *listener,
> >>>>          return;
> >>>>      }
> >>>>  
> >>>> +    if (memory_region_is_iommu(section->mr)) {
> >>>> +        VFIOGuestIOMMU *giommu;
> >>>> +
> >>>> +        QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
> >>>> +            if (giommu->iommu == section->mr) {
> >>>> +                memory_region_unregister_iommu_notifier(&giommu->n);
> >>>> +                QLIST_REMOVE(giommu, giommu_next);
> >>>> +                g_free(giommu);
> >>>> +                break;
> >>>> +            }
> >>>> +        }
> >>>> +
> >>>> +        /*
> >>>> +         * FIXME: We assume the one big unmap below is adequate to
> >>>> +         * remove any individual page mappings in the IOMMU which
> >>>> +         * might have been copied into VFIO.  That may not be true for
> >>>> +         * all IOMMU types
> >>>> +         */
> >>>
> >>> We assume this because the IOVA that gets unmapped is the same
> >>> regardless of whether a guest IOMMU is present?
> >>
> >>
> >> What exactly is meant by "guest IOMMU is present"? Doing the second DMA
> >> window, now I am really confused about terminology :(
> > 
> > The confusion for me is that add_region initializes the giommu and all
> > the DMA mapping through VFIO is done in the notifier for the giommu.
> > It's therefore asymmetric that add_region doesn't vfio_dma_map anything,
> > but region_del does vfio_dma_unmap, which is the basis of my question.
> > I thought this comment was trying to address why that is, but apparently
> > it's something else entirely, so it would be nice to understand why this
> > doesn't return() and decode a bit more clearly what the FIXME is trying
> > to say.  Thanks,
> 
> I do not mind extending that comment but need some assistance.
> 
> My understanding is that:
> ===
> region_del is called on memory region removal so this particular window is
> not going to be used anymore and this is the only place where such cleanup
> could be done.
> ===
> 
> Asymmetry is here but it does not look terrible.

The asymmetry may well be fine if properly documented and region_del is
a fine place to teardown and unmap a guest IOMMU that was instantiated
via region_add.  The problem is validating and clarifying the assumption
that the FIXME is making for future support.  The life cycle as I
understand it is:

region_add
 - create giommu and register notifier

notifier
 - adds and removes giommu mappings

region_del
 - unregister notifier, cleanup, and free giommu

As part of that cleanup we need to make sure there are no outstanding
mappings.  We do that by unmapping the entire MemoryRegionSection.  So
what exactly are the assumptions to be able to do this?  One seems to be
that the IOMMU will unmap any sub-mappings fully overlapped by a larger
range, which is what David identified.  Another seems to be that there's
no address translation of the IOVA and thus the range is the same
regardless of whether a guest IOMMU is present, which is what I tried to
identify but David nak'd.

Personally I think the FIXME here and the one where we create the giommu
and assume it has no mappings show some immaturity in the QEMU IOMMU
API.  I'd expect it to behave more like a MemoryListener, replaying all
mappings on register and removing all on unregister.  Thanks,

Alex




reply via email to

[Prev in Thread] Current Thread [Next in Thread]