[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-ppc] [PATCH qemu v10 10/14] spapr_pci: Enable vfio-pci hotplug

From: Alexey Kardashevskiy
Subject: Re: [Qemu-ppc] [PATCH qemu v10 10/14] spapr_pci: Enable vfio-pci hotplug
Date: Mon, 13 Jul 2015 17:06:55 +1000
User-agent: Mozilla/5.0 (X11; Linux i686 on x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.0.1

On 07/13/2015 12:41 AM, Michael Roth wrote:
Quoting Alexey Kardashevskiy (2015-07-11 23:59:45)
On 07/11/2015 07:33 AM, Michael Roth wrote:
Quoting Alexey Kardashevskiy (2015-07-05 21:11:06)
sPAPR IOMMU is managing two copies of an TCE table:
1) a guest view of the table - this is what emulated devices use and
this is where H_GET_TCE reads from;
2) a hardware TCE table - only present if there is at least one vfio-pci
device on a PHB; it is updated via a memory listener on a PHB address
space which forwards map/unmap requests to vfio-pci IOMMU host driver.

At the moment presence of vfio-pci devices on a bus affect the way
the guest view table is allocated. If there is no vfio-pci on a PHB
and the host kernel supports KVM acceleration of H_PUT_TCE, a table
is allocated in KVM. However, if there is vfio-pci and we do yet not
support KVM acceleration for these, the table has to be allocated
by the userspace.

When vfio-pci device is hotplugged and there were no vfio-pci devices
already, the guest view table could have been allocated by KVM which
means that H_PUT_TCE is handled by the host kernel and since we
do not support vfio-pci in KVM, the hardware table will not be updated.

This reallocates the guest view table in QEMU if the first vfio-pci
device has just been plugged. spapr_tce_realloc_userspace() handles this.

This replays all the mappings to make sure that the tables are in sync.
This will not have a visible effect though as for a new device
the guest kernel will allocate-and-map new addresses and therefore
existing mappings from emulated devices will not be used by vfio-pci

This adds calls to spapr_phb_dma_capabilities_update() in PCI hotplug

Signed-off-by: Alexey Kardashevskiy <address@hidden>
* removed unnecessary  memory_region_del_subregion() and
memory_region_add_subregion() as
"vfio: Unregister IOMMU notifiers when container is destroyed" removes
notifiers in a more correct way

* spapr_phb_hotplug_dma_sync() enumerates TCE tables explicitely rather than
via object_child_foreach()
* spapr_phb_hotplug_dma_sync() does memory_region_del_subregion() +
memory_region_add_subregion() as otherwise vfio_listener_region_del() is not
called and we end up with vfio_iommu_map_notify registered twice (comments 
if we do hotplug+hotunplug+hotplug of the same device.
* moved spapr_phb_hotplug_dma_sync() on unplug event to rcu as before calling
spapr_phb_hotplug_dma_sync(), we need VFIO to release the container, otherwise
spapr_phb_dma_capabilities_update() will decide that the PHB still has VFIO 
Actual VFIO PCI device release happens from rcu and since we add ours later,
it gets executed later and we are good.
   hw/ppc/spapr_iommu.c        | 51 
   hw/ppc/spapr_pci.c          | 47 +++++++++++++++++++++++++++++++++++++++++
   include/hw/pci-host/spapr.h |  1 +
   include/hw/ppc/spapr.h      |  2 ++
   trace-events                |  2 ++
   5 files changed, 100 insertions(+), 3 deletions(-)

diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
index 45c00d8..2d99c3b 100644
--- a/hw/ppc/spapr_iommu.c
+++ b/hw/ppc/spapr_iommu.c
@@ -78,12 +78,13 @@ static uint64_t *spapr_tce_alloc_table(uint32_t liobn,
                                          uint32_t nb_table,
                                          uint32_t page_shift,
                                          int *fd,
-                                       bool vfio_accel)
+                                       bool vfio_accel,
+                                       bool force_userspace)
       uint64_t *table = NULL;
       uint64_t window_size = (uint64_t)nb_table << page_shift;

-    if (kvm_enabled() && !(window_size >> 32)) {
+    if (kvm_enabled() && !force_userspace && !(window_size >> 32)) {
           table = kvmppc_create_spapr_tce(liobn, window_size, fd, vfio_accel);

@@ -222,7 +223,8 @@ static void spapr_tce_table_do_enable(sPAPRTCETable *tcet, 
bool vfio_accel)
-                                        vfio_accel);
+                                        vfio_accel,
+                                        false);

                              (uint64_t)tcet->nb_table << tcet->page_shift);
@@ -495,6 +497,49 @@ int spapr_dma_dt(void *fdt, int node_off, const char 
       return 0;

+static int spapr_tce_do_replay(sPAPRTCETable *tcet, uint64_t *table)
+    target_ulong ioba = tcet->bus_offset, pgsz = (1ULL << tcet->page_shift);
+    long i, ret = 0;
+    for (i = 0; i < tcet->nb_table; ++i, ioba += pgsz) {
+        ret = put_tce_emu(tcet, ioba, table[i]);
+        if (ret) {
+            break;
+        }
+    }
+    return ret;
+int spapr_tce_replay(sPAPRTCETable *tcet)
+    return spapr_tce_do_replay(tcet, tcet->table);
+int spapr_tce_realloc_userspace(sPAPRTCETable *tcet, bool replay)
+    int ret = 0, oldfd;
+    uint64_t *oldtable;
+    oldtable = tcet->table;
+    oldfd = tcet->fd;
+    tcet->table = spapr_tce_alloc_table(tcet->liobn,
+                                        tcet->nb_table,
+                                        tcet->page_shift,
+                                        &tcet->fd,
+                                        false,
+                                        true); /* force_userspace */
+    if (replay) {
+        ret = spapr_tce_do_replay(tcet, oldtable);
+    }
+    spapr_tce_free_table(oldtable, oldfd, tcet->nb_table);
+    return ret;
   int spapr_tcet_dma_dt(void *fdt, int node_off, const char *propname,
                         sPAPRTCETable *tcet)
diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index 76c988f..d1fa157 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -827,6 +827,43 @@ int spapr_phb_dma_reset(sPAPRPHBState *sphb)
       return 0;

+static int spapr_phb_hotplug_dma_sync(sPAPRPHBState *sphb)
+    int ret = 0, i;
+    bool had_vfio = sphb->has_vfio;
+    sPAPRTCETable *tcet;
+    spapr_phb_dma_capabilities_update(sphb);

So, in the unplug case, we update caps, but has_vfio = false so we don't do
anything else below.


Does that mean our KVM-accelerated TCE table won't get restored until reboot?
Would it make sense to re-enable it here?

No, it shold be reenabled as DMA config is completely reset during the
machine reset by "[PATCH qemu v10 08/14] spapr_pci: Do complete reset of
DMA config when resetting PHB"

We don't get a PHB-level reset for PCI hotplug though, so it wouldn't
get re-enabled till guest system reset.

Yes, this is what I said :)

I'm not sure how big a deal that
is performance-wise, but it seems a little unexpected.


+    if (!had_vfio && sphb->has_vfio) {
+        for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) {
+            tcet = spapr_tce_find_by_liobn(SPAPR_PCI_LIOBN(sphb->index, i));
+            if (!tcet || !tcet->enabled) {
+                continue;
+            }
+            if (tcet->fd >= 0) {
+                /*
+                 * We got first vfio-pci device on accelerated table.
+                 * VFIO acceleration is not possible.
+                 * Reallocate table in userspace and replay mappings.
+                 */
+                ret = spapr_tce_realloc_userspace(tcet, true);
+                trace_spapr_pci_dma_realloc_update(tcet->liobn, ret);
+            } else {
+                /* There was no acceleration, so just replay mappings. */
+                ret = spapr_tce_replay(tcet);
+                trace_spapr_pci_dma_update(tcet->liobn, ret);
+            }
+            if (ret) {
+                break;
+            }
+        }
+        return ret;
+    }
+    return 0;
   /* Macros to operate with address in OF binding to PCI */
   #define b_x(x, p, l)    (((x) & ((1<<(l))-1)) << (p))
   #define b_n(x)          b_x((x), 31, 1) /* 0 if relocatable */
@@ -1106,6 +1143,7 @@ static void spapr_phb_add_pci_device(sPAPRDRConnector 
               error_setg(errp, "Failed to create pci child device tree node");
               goto out;
+        spapr_phb_hotplug_dma_sync(phb);

       drck->attach(drc, DEVICE(pdev),
@@ -1116,6 +1154,12 @@ out:

+static void spapr_phb_remove_sync_dma(struct rcu_head *head)
+    sPAPRPHBState *sphb = container_of(head, sPAPRPHBState, rcu);
+    spapr_phb_hotplug_dma_sync(sphb);
   static void spapr_phb_remove_pci_device_cb(DeviceState *dev, void *opaque)
       /* some version guests do not wait for completion of a device
@@ -1130,6 +1174,9 @@ static void spapr_phb_remove_pci_device_cb(DeviceState 
*dev, void *opaque)
+    /* Actual VFIO device release happens from RCU so postpone DMA update */
+    call_rcu1(&((sPAPRPHBState *)opaque)->rcu, spapr_phb_remove_sync_dma);

Hmm... can't think of any reason this wouldn't work, but would be nice
if there was something a bit more straightforward...

When the device is actually finalized, it does:

The problem is with "when". I looked at gdb, this vfio_instance_finalize()
is called from an RCU handler because the last reference is dropped because
of some memory region was removed and this was postponed to RCU.

If object_unparent(OBJECT(dev)) did call vfio_put_group() on the same
stack, I would not need this call_rcu1.

Right, object_unparent() has no guaruntee of immediately finalizing it,
but you *do* have the guaruntee that
vfio_instance_finalize()->vfio_put_group() will only be called once the
device is actually finalized, regardless of whether or not it's kicked
off by the RCU thread. So it seems more straightforward to hook into
that rather than needing to employ internal knowledge of

Right... I'll try adding a hook and see what review I'll receive.

    static void vfio_instance_finalize(Object *obj)
        PCIDevice *pci_dev = PCI_DEVICE(obj);
        VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pci_dev);
        VFIOGroup *group = vdev->vbasedev.group;



When all the groups are removed from a VFIO container, there's a
call to container->iommu_data.release(container). This is the
event we really care about, not so much the fact that a device
got released.

Right now all it does it remove the memory listener, but maybe it
makes sense to allow an additional callback/opaque to register for
the event. Not sure what the best way to do that is though...

In this context I rather care about container's fd being closed so
VFIO_IOMMU_SPAPR_TCE_GET_INFO would fail in my dma-sync and this way I know
that there is no more VFIO devices.

VFIO container getting closed also corresponds to the last group being
removed though. Even if it didn't, I think VFIO_IOMMU_SPAPR_TCE_GET_INFO
would fail unless at least one iommu group was attached to the
container? So knowing when the first/last group is removed seems to
be the real main event.

Right. This is still VFIO knowledge which PHB does not have access to. We will need these new hooks.

And, kind of a separate topic, but if we could do something
similar for the initial group attach,  we could drop *all* the
plug/unplug hooks, and the hooks themselves could drop all
the !had_vfio / has_vfio logic/probing, since that would then
be clear from the context.

Drop all hooks? HotplugHandlerClass hooks? Can you do that? :) Are not they
what HMP calls on "device_add"?

I mean all the places we call into code that ends up doing:
   /* do something special if has_vfio changed */

We currently have one in PCI plug, PCI unplug, and PHB reset. If we
hooked into vfio_put_group(), we could drop each of those hooks. PHB
reset would still have the special case of restoring default 32-bit
window config, but it wouldn't need to care about has_vfio status
anymore, all that code could be handled by
vfio_put_group/vfio_get_group callbacks.

I wouldn't hold up the series for it, but I think it would greatly
simplify tracking has_vfio changes.

It is now 2 series and I can try what you suggest.

But I do think spapr_phb_hotplug_dma_sync() should have some logic
squashed in for re-enabling TCE acceleration on
(had_vfio && !has_vfio).


reply via email to

[Prev in Thread] Current Thread [Next in Thread]