qemu-ppc
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-ppc] [PATCH qemu v3 4/6] spapr_iommu: Do not replay mappings f


From: David Gibson
Subject: Re: [Qemu-ppc] [PATCH qemu v3 4/6] spapr_iommu: Do not replay mappings from just created DMA window
Date: Tue, 5 Mar 2019 14:28:46 +1100
User-agent: Mutt/1.11.3 (2019-02-01)

On Thu, Feb 28, 2019 at 04:37:25PM +1100, Alexey Kardashevskiy wrote:
> On 28/02/2019 14:49, David Gibson wrote:
> > On Thu, Feb 28, 2019 at 10:59:56AM +1100, Alexey Kardashevskiy wrote:
> >>
> >>
> >> On 28/02/2019 01:33, Greg Kurz wrote:
> >>> On Wed, 27 Feb 2019 19:51:47 +1100
> >>> Alexey Kardashevskiy <address@hidden> wrote:
> >>>
> >>>> On sPAPR vfio_listener_region_add() is called in 2 situations:
> >>>> 1. a new listener is registered from vfio_connect_container();
> >>>> 2. a new IOMMU Memory Region is added from 
> >>>> rtas_ibm_create_pe_dma_window().
> >>>>
> >>>> In both cases vfio_listener_region_add() calls
> >>>> memory_region_iommu_replay() to notify newly registered IOMMU notifiers
> >>>> about existing mappings which is totally desirable for case 1.
> >>>>
> >>>> However for case 2 it is nothing but noop as the window has just been
> >>>> created and has no valid mappings so replaying those does not do 
> >>>> anything.
> >>>> It is barely noticeable with usual guests but if the window happens to be
> >>>> really big, such no-op replay might take minutes and trigger RCU stall
> >>>> warnings in the guest.
> >>>>
> >>>> For example, a upcoming GPU RAM memory region mapped at 64TiB (right
> >>>> after SPAPR_PCI_LIMIT) causes a 64bit DMA window to be at least 128TiB
> >>>> which is (128<<40)/0x10000=2.147.483.648 TCEs to replay.
> >>>>
> >>>> This mitigates the problem by adding an "skipping_replay" flag to
> >>>> sPAPRTCETable and defining sPAPR own IOMMU MR replay() hook which does
> >>>> exactly the same thing as the generic one except it returns early if
> >>>> @skipping_replay==true.
> >>>>
> >>>> When "ibm,create-pe-dma-window" is complete, the guest will map only
> >>>> required regions of the huge DMA window.
> >>>>
> >>>> Signed-off-by: Alexey Kardashevskiy <address@hidden>
> >>>> ---
> >>>>  include/hw/ppc/spapr.h  |  1 +
> >>>>  hw/ppc/spapr_iommu.c    | 31 +++++++++++++++++++++++++++++++
> >>>>  hw/ppc/spapr_rtas_ddw.c |  7 +++++++
> >>>>  3 files changed, 39 insertions(+)
> >>>>
> >>>> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> >>>> index 86b0488..358bb38 100644
> >>>> --- a/include/hw/ppc/spapr.h
> >>>> +++ b/include/hw/ppc/spapr.h
> >>>> @@ -727,6 +727,7 @@ struct sPAPRTCETable {
> >>>>      uint64_t *mig_table;
> >>>>      bool bypass;
> >>>>      bool need_vfio;
> >>>> +    bool skipping_replay;
> >>>>      int fd;
> >>>>      MemoryRegion root;
> >>>>      IOMMUMemoryRegion iommu;
> >>>> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
> >>>> index 37e98f9..8f23179 100644
> >>>> --- a/hw/ppc/spapr_iommu.c
> >>>> +++ b/hw/ppc/spapr_iommu.c
> >>>> @@ -141,6 +141,36 @@ static IOMMUTLBEntry 
> >>>> spapr_tce_translate_iommu(IOMMUMemoryRegion *iommu,
> >>>>      return ret;
> >>>>  }
> >>>>  
> >>>> +static void spapr_tce_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier 
> >>>> *n)
> >>>> +{
> >>>> +    MemoryRegion *mr = MEMORY_REGION(iommu_mr);
> >>>> +    IOMMUMemoryRegionClass *imrc = 
> >>>> IOMMU_MEMORY_REGION_GET_CLASS(iommu_mr);
> >>>> +    hwaddr addr, granularity;
> >>>> +    IOMMUTLBEntry iotlb;
> >>>> +    sPAPRTCETable *tcet = container_of(iommu_mr, sPAPRTCETable, iommu);
> >>>> +
> >>>> +    if (tcet->skipping_replay) {
> >>>> +        return;
> >>>> +    }
> >>>> +
> >>>> +    granularity = memory_region_iommu_get_min_page_size(iommu_mr);
> >>>> +
> >>>> +    for (addr = 0; addr < memory_region_size(mr); addr += granularity) {
> >>>> +        iotlb = imrc->translate(iommu_mr, addr, IOMMU_NONE, 
> >>>> n->iommu_idx);
> >>>> +        if (iotlb.perm != IOMMU_NONE) {
> >>>> +            n->notify(n, &iotlb);
> >>>> +        }
> >>>> +
> >>>> +        /*
> >>>> +         * if (2^64 - MR size) < granularity, it's possible to get an
> >>>> +         * infinite loop here.  This should catch such a wraparound.
> >>>> +         */
> >>>> +        if ((addr + granularity) < addr) {
> >>>> +            break;
> >>>> +        }
> >>>> +    }
> >>>> +}
> >>>
> >>> It is a bit unfortunate to duplicate all that code. What about making
> >>> a memory_region_iommu_replay_generic() helper out of it and call it
> >>> from spapr_tce_replay() and memory_region_iommu_replay() ?
> >>
> >>
> >> I really do not want to mess with generic code to solve our local sPAPR
> >> problem, especially when there is a way not to do so.
> > 
> > Well, the thing is, I think we're actually the only user of the
> > current generic replay - everything else has more efficient structure
> > aware replay hooks AFAIK.  Which makes this hack even hackier.
> 
> If that so, then we are better off removing that loop from
> memory_region_iommu_replay() at all rather than keeping it generic.

Well.. maybe.  In theory this logic should work, albeit slowly, for
any IOMMU implementation, it's just that everyone else has more
efficient implementations at the moment.

> >> And as a next step, I was thinking of removing (i.e. making this replay
> >> a no-op) from QEMU later and do replay in KVM instead when an IOMMU
> >> group is attaching to KVM as this is the only case when we need replay
> >> and KVM has a lot better idea what TCEs are actually valid and can skip
> >> most of them. This is a bit bigger thing as it requires a KVM capability
> >> "KVM replays mappings" but when we get it, spapr_tce_replay() will
> >> become no-op.
> > 
> > That's a good idea longer term.
> > 
> >>> Apart from that, LGTM.
> >>
> >> Well. It is a hack, I just do not have taste to tell how nasty it is
> >> :)
> > 
> > As an interim step until the kernel change, I think we can do a bit
> > better than this.  First, as Greg suggests we should have the
> > "generic" replay be a helper and have the spapr one call that with a
> > little in the way of extra checking.
> > 
> > Second, rather than having an explicit "skip_replay" flag, what we
> > really want here is to have the replay be a fast no-op if there are no
> > existing mappings rather than a slow no-op.  So instead I think we
> > should have a flag which records if any mappings have been made in the
> > region yet, initialized to false.
> > The new replay would do nothing if
> > it's still false.
> 
> If QEMU controlled the mappings - sure. But it does not - KVM does it
> instead via that fast path. So QEMU does not know if there are mappings
> until it reads all TCEs from mmap'ed KVM TCE table which will fault in
> all these pages.

Oops, I forgot that; good point.

> We could implement some tricks such are allow reading (or ioctl) from
> that KVM TCE fd and it could tell what is mapped and what is not in a
> very condensed format (for example a bit per every 256MB of the guest
> address space)  ooooor  implement different behavior for mapping with RW
> or readonly - the latter would fail if there is no backing page
> allocated yet - and then QEMU could skip these regions when replaying.

Urgh, yeah, this is trickier than I initially thought.  About the
cleanest approach I can see is to delay allocation of the IOMMU table
until the first H_PUT_TCE, and only then actually bind the table to
KVM and allow for H_PUT_TCE acceleration.

Seems pretty awkward, though.  Ok, your hack seems the best way
forward in the short to medium term, just make sure there are some
comments there explaining why the hack is valuable.


-- 
David Gibson                    | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
                                | _way_ _around_!
http://www.ozlabs.org/~dgibson

Attachment: signature.asc
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]