Re: [RFC v2 11/13] vdpa: add vdpa net migration state notifier

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [RFC v2 11/13] vdpa: add vdpa net migration state notifier

From:	Si-Wei Liu
Subject:	Re: [RFC v2 11/13] vdpa: add vdpa net migration state notifier
Date:	Mon, 13 Feb 2023 14:36:33 -0800
User-agent:	Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.6.1



On 2/13/2023 1:47 AM, Eugenio Perez Martin wrote:

On Sat, Feb 4, 2023 at 3:04 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:



On 2/2/2023 7:28 AM, Eugenio Perez Martin wrote:

On Thu, Feb 2, 2023 at 2:53 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:


On 1/12/2023 9:24 AM, Eugenio Pérez wrote:

This allows net to restart the device backend to configure SVQ on it.

Ideally, these changes should not be net specific. However, the vdpa net
backend is the one with enough knowledge to configure everything because
of some reasons:
* Queues might need to be shadowed or not depending on its kind (control
     vs data).
* Queues need to share the same map translations (iova tree).

Because of that it is cleaner to restart the whole net backend and
configure again as expected, similar to how vhost-kernel moves between
userspace and passthrough.

If more kinds of devices need dynamic switching to SVQ we can create a
callback struct like VhostOps and move most of the code there.
VhostOps cannot be reused since all vdpa backend share them, and to
personalize just for networking would be too heavy.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
    net/vhost-vdpa.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++++
    1 file changed, 84 insertions(+)

diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 5d7ad6e4d7..f38532b1df 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -26,6 +26,8 @@
    #include <err.h>
    #include "standard-headers/linux/virtio_net.h"
    #include "monitor/monitor.h"
+#include "migration/migration.h"
+#include "migration/misc.h"
    #include "migration/blocker.h"
    #include "hw/virtio/vhost.h"

@@ -33,6 +35,7 @@
    typedef struct VhostVDPAState {
        NetClientState nc;
        struct vhost_vdpa vhost_vdpa;
+    Notifier migration_state;
        Error *migration_blocker;
        VHostNetState *vhost_net;

@@ -243,10 +246,86 @@ static VhostVDPAState 
*vhost_vdpa_net_first_nc_vdpa(VhostVDPAState *s)
        return DO_UPCAST(VhostVDPAState, nc, nc0);
    }

+static void vhost_vdpa_net_log_global_enable(VhostVDPAState *s, bool enable)
+{
+    struct vhost_vdpa *v = &s->vhost_vdpa;
+    VirtIONet *n;
+    VirtIODevice *vdev;
+    int data_queue_pairs, cvq, r;
+    NetClientState *peer;
+
+    /* We are only called on the first data vqs and only if x-svq is not set */
+    if (s->vhost_vdpa.shadow_vqs_enabled == enable) {
+        return;
+    }
+
+    vdev = v->dev->vdev;
+    n = VIRTIO_NET(vdev);
+    if (!n->vhost_started) {
+        return;
+    }
+
+    if (enable) {
+        ioctl(v->device_fd, VHOST_VDPA_SUSPEND);
+    }
+    data_queue_pairs = n->multiqueue ? n->max_queue_pairs : 1;
+    cvq = virtio_vdev_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ) ?
+                                  n->max_ncs - n->max_queue_pairs : 0;
+    vhost_net_stop(vdev, n->nic->ncs, data_queue_pairs, cvq);
+
+    peer = s->nc.peer;
+    for (int i = 0; i < data_queue_pairs + cvq; i++) {
+        VhostVDPAState *vdpa_state;
+        NetClientState *nc;
+
+        if (i < data_queue_pairs) {
+            nc = qemu_get_peer(peer, i);
+        } else {
+            nc = qemu_get_peer(peer, n->max_queue_pairs);
+        }
+
+        vdpa_state = DO_UPCAST(VhostVDPAState, nc, nc);
+        vdpa_state->vhost_vdpa.shadow_data = enable;
+
+        if (i < data_queue_pairs) {
+            /* Do not override CVQ shadow_vqs_enabled */
+            vdpa_state->vhost_vdpa.shadow_vqs_enabled = enable;
+        }
+    }
+
+    r = vhost_net_start(vdev, n->nic->ncs, data_queue_pairs, cvq);

As the first revision, this method (vhost_net_stop followed by
vhost_net_start) should be fine for software vhost-vdpa backend for e.g.
vp_vdpa and vdpa_sim_net. However, I would like to get your attention
that this method implies substantial blackout time for mode switching on
real hardware - get a full cycle of device reset of getting memory
mappings torn down, unpin & repin same set of pages, and set up new
mapping would take very significant amount of time, especially for a
large VM. Maybe we can do:

Right, I think this is something that deserves optimization in the future.

Note that we must replace the mappings anyway, with all passthrough
queues stopped.

Yes, unmap and remap is needed indeed. I haven't checked, does shadow vq
keep mapping to the same GPA where passthrough data virtqueues were
associated with across switch (so that the mode switch is transparent to
the guest)?

I don't get this question, SVQ switching is already transparent to the guest.

Never mind, you seem to have answered the question in the reply here andbelow. I was thinking of possibility to do incremental in-place updatefor a given IOVA range with one single call (for the on-chip IOMMUcase), instead of separate unmap() and map() calls. Things like.set_map_replace(vdpa, asid, iova_start, size, iotlb_new_maps) as I evermentioned.

For platform IOMMU the mapping and remapping cost is
inevitable, though I wonder for the on-chip IOMMU case could it take
some fast path to just replace IOVA in place without destroying and
setting up all mapping entries, if the same GPA is going to be used for
the data rings (copy Eli for his input).

   This is because SVQ vrings are not in the guest space.
The pin can be skipped though, I think that's a low hand fruit here.

Yes, that's right. For a large VM pining overhead usually overweighs the
mapping cost. It would be a great amount of time saving if pin can be
skipped.

That is doable using dma_map/unmap apis instead of set_map (or
comparing in set_map) and allocation GPA translations in advance.

Is there a way for a driver to use both dma_map()/unmap() and set_map()APIs at the same time? Seems not possible for the moment. And batchingis currently unsupported on dma_map()/unmap().

Not sure how mapping could be decoupled from pinning as the current uAPI(VHOST_IOTLB_UPDATE and VHOST_IOTLB_INVALIDATE) have both, i.e. it's noteasy to tear them apart. If we agree pinning is not needed, perhaps wecould add a new uAPI to just remap the IOVA ranges for data VQaddresses, and get around any code path involving page pinning. Underthe hood at the driver API level, in case of general platform IOMMU,iommu_unmap() and iommu_map() can be used; in case of on-chip IOMMU,vdpa kernel would just call the new driver API .set_map_replace() toupdate the relevant IOVA mappings in place, without having to rebuildthe entire iova tree.

If any, we can track guest's IOVA and add SVQ vrings in a hole. If
guest's IOVA layout changes, we can translate it then to a new
location. That way we only need one map operation in the worst case.
I'm omitting the lookup time here, but I still should be worth it.

But as you mention I think it is not worth complicating this series
and we can think about it on top.

Yes, agreed. I'll just let you aware of the need of this optimization
for real hardware device.

   We can start building it on top of
your suggestions for sure.

1) replace reset with the RESUME feature that was just added to the
vhost-vdpa ioctls in kernel

We cannot change vring addresses just with a SUSPEND / RESUME.

I wonder if we can make SUSPEND (via some flag or new backend feature is
fine) accept updating internal state like the vring addresses, while
defer applying it to the device until RESUME? That way we don't lose a
lot of other states that otherwise would need to re-instantiate at large
with _F_RING_RESET or device reset.

If that helps, that can be done for sure.

As another idea, we could do the reverse and allow _F_RING_RESET to
not to forget the parameters unless they're explicitly overriden.

Hmmm, this might need spec extension as that's not the currentexpectation for _F_RING_RESET so far as I understand. Once ring isreset, all parameters associated with the ring are forgotten.

I think I prefer your idea in  SUSPEND / RESUME cycle, but just wanted
to put that possibility on the table if that makes more sense.

Yea may be via a new per-vq suspend feature: _F_RING_STOP.

We could do it with the VIRTIO_F_RING_RESET feature though. Would it
be advantageous to the device?

2) add new vdpa ioctls to allow iova range rebound to new virtual
address for QEMU's shadow vq or back to device's vq

Actually, if the device supports ASID we can allocate ASID 1 to that
purpose. At this moment only CVQ vrings and control buffers are there
when the device is passthrough.

Yep, we can get SVQ mapping pre-cooked in another ASID before dismantle
the mapping for the passthrough VQ. This will help the general IOMMU case.

But this doesn't solve the problem if we need to send all SVQ
translation to the device on-chip IOMMU, doesn't it? We must clear all
of it and send the new one to the device anyway.

3) use a light-weighted sequence of suspend+rebind+resume to switch mode
on the fly instead of getting through the whole reset+restart cycle

I think this is the same as 1, isn't it?

I mean do all three together: 1,2 in kernel and 3 in QEMU.

Ok I missed that in my first read, thanks!

But I feel 2 should be easier to do in qemu.

I don't really know how this helps in the general IOMMU case, I'm
assuming the IOMMU does not support PASID or similar tricks. Is that
because of the vhost_iotlb population or is there anything else I'm
missing?

A new uAPI (more precisely, iotlb message) is needed to get around ofpage pinning at least. Or if not specifically tied to onchip IOMMU, wecan make it two separate uAPIs for UNMAP and MAP, respectively.

I suspect the same idea could even be used to address high live
migration downtime seen on hardware vdpa device. What do you think?

I think this is a great start for sure! Some questions:
a) Is the time on reprogramming on-chip IOMMU comparable to program
regular IOMMU?

I would think this largely depends on the hardware implementation of
on-chip IOMMU, the performance characteristics of which is very device
specific. Some times driver software implementation and API for on-chip
MMU also matters. Which would require vendor specific work to optimize
based on the specific use case.

Got it.

   If it is the case it should be easier to find vdpa
devices with support for _F_RESET soon.
b) Not to merge on master, but it is possible to add an artificial
delay on vdpa_sim that simulates the properties of the delay of IOMMU?
In that line, have you observed if it is linear with the size of the
memory, with the number of maps, other factors..?

As I said this is very device specific and hard to quantify, but I agree
it's a good idea to simulate the delay and measure the effect. For the
on-chip MMU device I'm looking, large proportion of the time was spent
on software side in allocating a range of memory for hosting mapping
entries (don't know how to quantify this part but the allocation time is
not a constant nor linear to the size of memory), walking all iotlb
entries passed down from vdpa layer and building corresponding memory
key objects for a range of pages. For each iotlb entry the time to build
memory mapping looks grow linearly with the size of memory. Not sure if
there's room to improve, I'll let the owner to clarify.

So I think all of these are great ideas.

If we state the pin & unpin huts latency in the switching I think the
easiest way to start is:
* To start with qemu and send all the map / unmap in a batch

By map / unmap, you are referring to the uAPIs (VHOST_IOTLB_UPDATE andVHOST_IOTLB_INVALIDATE), not the driver level .dma_map/unmap() kernelAPIs, right? yes it's always good to commit all map / unmap transactionsat once in a batch.

* Avoid the pin / unpin in the kernel using a smarter algorithm for
that, not unpinning regions that it is going to pin again.

This seems to change the uAPI behavior underneath. Maybe cleaner to getit done with new uAPI.


Regards,
-Siwei


What do you think?

Thanks!

Thanks,
-Siwei

Thanks!

Thanks,
-Siwei

+    if (unlikely(r < 0)) {
+        error_report("unable to start vhost net: %s(%d)", g_strerror(-r), -r);
+    }
+}
+
+static void vdpa_net_migration_state_notifier(Notifier *notifier, void *data)
+{
+    MigrationState *migration = data;
+    VhostVDPAState *s = container_of(notifier, VhostVDPAState,
+                                     migration_state);
+
+    switch (migration->state) {
+    case MIGRATION_STATUS_SETUP:
+        vhost_vdpa_net_log_global_enable(s, true);
+        return;
+
+    case MIGRATION_STATUS_CANCELLING:
+    case MIGRATION_STATUS_CANCELLED:
+    case MIGRATION_STATUS_FAILED:
+        vhost_vdpa_net_log_global_enable(s, false);
+        return;
+    };
+}
+
    static void vhost_vdpa_net_data_start_first(VhostVDPAState *s)
    {
        struct vhost_vdpa *v = &s->vhost_vdpa;

+    if (v->feature_log) {
+        add_migration_state_change_notifier(&s->migration_state);
+    }
+
        if (v->shadow_vqs_enabled) {
            v->iova_tree = vhost_iova_tree_new(v->iova_range.first,
                                               v->iova_range.last);
@@ -280,6 +359,10 @@ static void vhost_vdpa_net_client_stop(NetClientState *nc)

        assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA);

+    if (s->vhost_vdpa.index == 0 && s->vhost_vdpa.feature_log) {
+        remove_migration_state_change_notifier(&s->migration_state);
+    }
+
        dev = s->vhost_vdpa.dev;
        if (dev->vq_index + dev->nvqs == dev->vq_index_end) {
            g_clear_pointer(&s->vhost_vdpa.iova_tree, vhost_iova_tree_delete);
@@ -767,6 +850,7 @@ static NetClientState *net_vhost_vdpa_init(NetClientState 
*peer,
        s->vhost_vdpa.device_fd = vdpa_device_fd;
        s->vhost_vdpa.index = queue_pair_index;
        s->always_svq = svq;
+    s->migration_state.notify = vdpa_net_migration_state_notifier;
        s->vhost_vdpa.shadow_vqs_enabled = svq;
        s->vhost_vdpa.iova_range = iova_range;
        s->vhost_vdpa.shadow_data = svq;

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [RFC v2 11/13] vdpa: add vdpa net migration state notifier, Si-Wei Liu, 2023/02/01
- Re: [RFC v2 11/13] vdpa: add vdpa net migration state notifier, Eugenio Perez Martin, 2023/02/02
  - Re: [RFC v2 11/13] vdpa: add vdpa net migration state notifier, Si-Wei Liu, 2023/02/03
    - Re: [RFC v2 11/13] vdpa: add vdpa net migration state notifier, Eugenio Perez Martin, 2023/02/13
    - Re: [RFC v2 11/13] vdpa: add vdpa net migration state notifier, Si-Wei Liu <=
    - Re: [RFC v2 11/13] vdpa: add vdpa net migration state notifier, Eugenio Perez Martin, 2023/02/14
- RE: [RFC v2 11/13] vdpa: add vdpa net migration state notifier, Eli Cohen, 2023/02/12

Prev by Date: Re: [PULL 00/10] xenpvh machine
Next by Date: Re: [PATCH 1/2] log: Add separate debug option for logging invalid memory accesses
Previous by thread: Re: [RFC v2 11/13] vdpa: add vdpa net migration state notifier
Next by thread: Re: [RFC v2 11/13] vdpa: add vdpa net migration state notifier
Index(es):
- Date
- Thread