[Qemu-devel] [PULL 33/62] spapr_iommu: Do not replay mappings from just

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Qemu-devel] [PULL 33/62] spapr_iommu: Do not replay mappings from just

From:	David Gibson
Subject:	[Qemu-devel] [PULL 33/62] spapr_iommu: Do not replay mappings from just created DMA window
Date:	Tue, 12 Mar 2019 19:54:33 +1100

From: Alexey Kardashevskiy <address@hidden>

On sPAPR vfio_listener_region_add() is called in 2 situations:
1. a new listener is registered from vfio_connect_container();
2. a new IOMMU Memory Region is added from rtas_ibm_create_pe_dma_window().

In both cases vfio_listener_region_add() calls
memory_region_iommu_replay() to notify newly registered IOMMU notifiers
about existing mappings which is totally desirable for case 1.

However for case 2 it is nothing but noop as the window has just been
created and has no valid mappings so replaying those does not do anything.
It is barely noticeable with usual guests but if the window happens to be
really big, such no-op replay might take minutes and trigger RCU stall
warnings in the guest.

For example, a upcoming GPU RAM memory region mapped at 64TiB (right
after SPAPR_PCI_LIMIT) causes a 64bit DMA window to be at least 128TiB
which is (128<<40)/0x10000=2.147.483.648 TCEs to replay.

This mitigates the problem by adding an "skipping_replay" flag to
sPAPRTCETable and defining sPAPR own IOMMU MR replay() hook which does
exactly the same thing as the generic one except it returns early if
@skipping_replay==true.

Another way of fixing this would be delaying replay till the very first
H_PUT_TCE but this does not work if in-kernel H_PUT_TCE handler is
enabled (a likely case).

When "ibm,create-pe-dma-window" is complete, the guest will map only
required regions of the huge DMA window.

Signed-off-by: Alexey Kardashevskiy <address@hidden>
Message-Id: <address@hidden>
Signed-off-by: David Gibson <address@hidden>
---
 hw/ppc/spapr_iommu.c    | 31 +++++++++++++++++++++++++++++++
 hw/ppc/spapr_rtas_ddw.c | 10 ++++++++++
 include/hw/ppc/spapr.h  |  1 +
 3 files changed, 42 insertions(+)

diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
index 37e98f9321..8f231799b2 100644
--- a/hw/ppc/spapr_iommu.c
+++ b/hw/ppc/spapr_iommu.c
@@ -141,6 +141,36 @@ static IOMMUTLBEntry 
spapr_tce_translate_iommu(IOMMUMemoryRegion *iommu,
     return ret;
 }
 
+static void spapr_tce_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
+{
+    MemoryRegion *mr = MEMORY_REGION(iommu_mr);
+    IOMMUMemoryRegionClass *imrc = IOMMU_MEMORY_REGION_GET_CLASS(iommu_mr);
+    hwaddr addr, granularity;
+    IOMMUTLBEntry iotlb;
+    sPAPRTCETable *tcet = container_of(iommu_mr, sPAPRTCETable, iommu);
+
+    if (tcet->skipping_replay) {
+        return;
+    }
+
+    granularity = memory_region_iommu_get_min_page_size(iommu_mr);
+
+    for (addr = 0; addr < memory_region_size(mr); addr += granularity) {
+        iotlb = imrc->translate(iommu_mr, addr, IOMMU_NONE, n->iommu_idx);
+        if (iotlb.perm != IOMMU_NONE) {
+            n->notify(n, &iotlb);
+        }
+
+        /*
+         * if (2^64 - MR size) < granularity, it's possible to get an
+         * infinite loop here.  This should catch such a wraparound.
+         */
+        if ((addr + granularity) < addr) {
+            break;
+        }
+    }
+}
+
 static int spapr_tce_table_pre_save(void *opaque)
 {
     sPAPRTCETable *tcet = SPAPR_TCE_TABLE(opaque);
@@ -659,6 +689,7 @@ static void 
spapr_iommu_memory_region_class_init(ObjectClass *klass, void *data)
     IOMMUMemoryRegionClass *imrc = IOMMU_MEMORY_REGION_CLASS(klass);
 
     imrc->translate = spapr_tce_translate_iommu;
+    imrc->replay = spapr_tce_replay;
     imrc->get_min_page_size = spapr_tce_get_min_page_size;
     imrc->notify_flag_changed = spapr_tce_notify_flag_changed;
     imrc->get_attr = spapr_tce_get_attr;
diff --git a/hw/ppc/spapr_rtas_ddw.c b/hw/ppc/spapr_rtas_ddw.c
index cb8a410359..cc9d1f5c1c 100644
--- a/hw/ppc/spapr_rtas_ddw.c
+++ b/hw/ppc/spapr_rtas_ddw.c
@@ -171,8 +171,18 @@ static void rtas_ibm_create_pe_dma_window(PowerPCCPU *cpu,
     }
 
     win_addr = (windows == 0) ? sphb->dma_win_addr : sphb->dma64_win_addr;
+    /*
+     * We have just created a window, we know for the fact that it is empty,
+     * use a hack to avoid iterating over the table as it is quite possible
+     * to have billions of TCEs, all empty.
+     * Note that we cannot delay this to the first H_PUT_TCE as this hcall is
+     * mostly likely to be handled in KVM so QEMU just does not know if it
+     * happened.
+     */
+    tcet->skipping_replay = true;
     spapr_tce_table_enable(tcet, page_shift, win_addr,
                            1ULL << (window_shift - page_shift));
+    tcet->skipping_replay = false;
     if (!tcet->nb_table) {
         goto hw_error_exit;
     }
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index 1311ebe28e..f117a7ce6e 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -723,6 +723,7 @@ struct sPAPRTCETable {
     uint64_t *mig_table;
     bool bypass;
     bool need_vfio;
+    bool skipping_replay;
     int fd;
     MemoryRegion root;
     IOMMUMemoryRegion iommu;
-- 
2.20.1

[Prev in Thread]

Current Thread

[Next in Thread]

[Qemu-devel] [PULL 35/62] target/ppc: introduce single vsrl_offset() function, (continued)
- [Qemu-devel] [PULL 35/62] target/ppc: introduce single vsrl_offset() function, David Gibson, 2019/03/12
- [Qemu-devel] [PULL 24/62] ppc/pnv: export the xive_router_notify() routine, David Gibson, 2019/03/12
- [Qemu-devel] [PULL 21/62] ppc/xive: hardwire the Physical CAM line of the thread context, David Gibson, 2019/03/12
- [Qemu-devel] [PULL 26/62] ppc/pnv: add a XIVE interrupt controller model for POWER9, David Gibson, 2019/03/12
- [Qemu-devel] [PULL 40/62] target/ppc: introduce vsr64_offset() to simplify get_cpu_vsr{l, h}() and set_cpu_vsr{l, h}(), David Gibson, 2019/03/12
- [Qemu-devel] [PULL 37/62] target/ppc: introduce avr_full_offset() function, David Gibson, 2019/03/12
- [Qemu-devel] [PULL 29/62] ppc/xive: activate HV support, David Gibson, 2019/03/12
- [Qemu-devel] [PULL 30/62] ppc/pnv: fix logging primitives using Ox, David Gibson, 2019/03/12
- [Qemu-devel] [PULL 34/62] target/ppc: introduce single fpr_offset() function, David Gibson, 2019/03/12
- [Qemu-devel] [PULL 44/62] ppc/pnv: add a PSI bridge model for POWER9, David Gibson, 2019/03/12
- [Qemu-devel] [PULL 33/62] spapr_iommu: Do not replay mappings from just created DMA window, David Gibson <=
- [Qemu-devel] [PULL 43/62] ppc/pnv: add a PSI bridge class model, David Gibson, 2019/03/12
- [Qemu-devel] [PULL 41/62] mac_oldworld: use node name instead of alias name for hd device in FWPathProvider, David Gibson, 2019/03/12
- [Qemu-devel] [PULL 36/62] target/ppc: move Vsr* macros from internal.h to cpu.h, David Gibson, 2019/03/12
- [Qemu-devel] [PULL 39/62] target/ppc: switch fpr/vsrl registers so all VSX registers are in host endian order, David Gibson, 2019/03/12
- [Qemu-devel] [PULL 42/62] mac_newworld: use node name instead of alias name for hd device in FWPathProvider, David Gibson, 2019/03/12
- [Qemu-devel] [PULL 47/62] ppc/pnv: add a 'dt_isa_nodename' to the chip, David Gibson, 2019/03/12
- [Qemu-devel] [PULL 49/62] ppc/pnv: add SerIRQ routing registers, David Gibson, 2019/03/12
- [Qemu-devel] [PULL 38/62] target/ppc: improve avr64_offset() and use it to simplify get_avr64()/set_avr64(), David Gibson, 2019/03/12
- [Qemu-devel] [PULL 46/62] ppc/pnv: add a LPC Controller class model, David Gibson, 2019/03/12
- [Qemu-devel] [PULL 51/62] ppc/pnv: add a OCC model for POWER9, David Gibson, 2019/03/12

Prev by Date: [Qemu-devel] [PULL 44/62] ppc/pnv: add a PSI bridge model for POWER9
Next by Date: Re: [Qemu-devel] [PATCH v6 04/11] hw/pvrdma: Collect debugging statistics
Previous by thread: [Qemu-devel] [PULL 44/62] ppc/pnv: add a PSI bridge model for POWER9
Next by thread: [Qemu-devel] [PULL 43/62] ppc/pnv: add a PSI bridge class model
Index(es):
- Date
- Thread