[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [RFC PATCH 00/18] Map memory at destination .load_setup in vDPA-net
From: |
Eugenio Perez Martin |
Subject: |
Re: [RFC PATCH 00/18] Map memory at destination .load_setup in vDPA-net migration |
Date: |
Tue, 5 Dec 2023 15:23:13 +0100 |
On Fri, Nov 3, 2023 at 9:19 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>
>
>
> On 11/2/2023 5:37 AM, Eugenio Perez Martin wrote:
> > On Thu, Nov 2, 2023 at 11:13 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>
> >>
> >> On 10/19/2023 7:34 AM, Eugenio Pérez wrote:
> >>> Current memory operations like pinning may take a lot of time at the
> >>>
> >>> destination. Currently they are done after the source of the migration is
> >>>
> >>> stopped, and before the workload is resumed at the destination. This is a
> >>>
> >>> period where neigher traffic can flow, nor the VM workload can continue
> >>>
> >>> (downtime).
> >>>
> >>>
> >>>
> >>> We can do better as we know the memory layout of the guest RAM at the
> >>>
> >>> destination from the moment the migration starts. Moving that operation
> >>> allows
> >>>
> >>> QEMU to communicate the kernel the maps while the workload is still
> >>> running in
> >>>
> >>> the source, so Linux can start mapping them. Ideally, all IOMMU is
> >>> configured,
> >>>
> >>> but if the vDPA parent driver uses on-chip IOMMU and .set_map we're still
> >>>
> >>> saving all the pinning time.
> >> I get what you want to say, though not sure how pinning is relevant to
> >> on-chip IOMMU and .set_map here, essentially pinning is required for all
> >> parent vdpa drivers that perform DMA hence don't want VM pages to move
> >> around.
> > Basically highlighting that the work done under .set_map is not only
> > pinning, but it is a significant fraction on it. It can be reworded or
> > deleted for sure.
> >
> >>>
> >>>
> >>> Note that further devices setup at the end of the migration may alter the
> >>> guest
> >>>
> >>> memory layout. But same as the previous point, many operations are still
> >>> done
> >>>
> >>> incrementally, like memory pinning, so we're saving time anyway.
> >>>
> >>>
> >>>
> >>> The first bunch of patches just reorganizes the code, so memory related
> >>>
> >>> operation parameters are shared between all vhost_vdpa devices. This is
> >>>
> >>> because the destination does not know what vhost_vdpa struct will have the
> >>>
> >>> registered listener member, so it is easier to place them in a shared
> >>> struct
> >>>
> >>> rather to keep them in vhost_vdpa struct. Future version may squash or
> >>> omit
> >>>
> >>> these patches.
> >> It looks this VhostVDPAShared facility (patch 1-13) is also what I need
> >> in my SVQ descriptor group series [*], for which I've built similar
> >> construct there. If possible please try to merge this in ASAP. I'll
> >> rework my series on top of that.
> >>
> >> [*]
> >> https://github.com/siwliu-kernel/qemu/commit/813518354af5ee8a6e867b2bf7dff3d6004fbcd5
> >>
> > I can send it individually, for sure.
> >
> > MST, Jason, can this first part be merged? It doesn't add a lot by
> > itself but it helps pave the way for future changes.
> If it cannot, it doesn't matter. I can pick it from here and get my
> series posted with your patches 1-13 applied upfront. This should work,
> I think?
>
> >>>
> >>>
> >>> Only tested with vdpa_sim. I'm sending this before full benchmark, as
> >>> some work
> >>>
> >>> like [1] can be based on it, and Si-Wei agreed on benchmark this series
> >>> with
> >>>
> >>> his experience.
> >> Haven't done the full benchmark compared to pre-map at destination yet,
> >> though an observation is that the destination QEMU seems very easy to
> >> get stuck for very long time while in mid of pinning pages. During this
> >> period, any client doing read-only QMP query or executing HMP info
> >> command got frozen indefinitely (subject to how large size the memory is
> >> being pinned). Is it possible to unblock those QMP request or HMP
> >> command from being executed (at least the read-only ones) while in
> >> migration? Yield from the load_setup corourtine and spawn another thread?
> >>
> > Ok, I wasn't aware of that.
> >
> > I think we cannot yield in a coroutine and wait for an ioctl.
> I was wondering if we need a separate coroutine out of the general
> migration path to support this special code without overloading
> load_setup or its callers. For instance, unblock the source from sending
> guest rams while allow destination pin pages in parallel should be
> possible.
>
Hi Si-Wei,
I'm working on this, I think I'll be able to send a new version soon.
Just a question, when the mapping is done in vhost_vdpa_dev_start as
the current upstream master does, are you able to interact with QMP?
Thanks!
> Regardless, a separate thread is needed to carry out all the heavy
> lifting, i.e. ioctl(2) or write(2) syscalls to map&pin pages.
>
>
> > One
> > option that came to my mind is to effectively use another thread, and
> > use a POSIX barrier (or equivalent on glib / QEMU) before finishing
> > the migration.
> Yes, a separate thread is needed anyway.
>
> > I'm not sure if there are more points where we can
> > check the barrier and tell the migration to continue or stop though.
> I think there is, for e.g. what if the dma_map fails. There must be a
> check point for that.
>
> >
> > Another option is to effectively start doing these ioctls in an
> > asynchronous way, io_uring cmds like, but I'd like to achieve this
> > first.
> Yes, io_uring or any async API could be another option. Though this
> needs new uAPI through additional kernel facility to support. Anyway,
> it's up to you to decide. :)
>
> Regards,
> -Siwei
> >> Having said, not sure if .load_setup is a good fit for what we want to
> >> do. Searching all current users of .load_setup, either the job can be
> >> done instantly or the task is time bound without trapping into kernel
> >> for too long. Maybe pinning is too special use case here...
> >>
> >> -Siwei
> >>>
> >>>
> >>> Future directions on top of this series may include:
> >>>
> >>> * Iterative migration of virtio-net devices, as it may reduce downtime
> >>> per [1].
> >>>
> >>> vhost-vdpa net can apply the configuration through CVQ in the
> >>> destination
> >>>
> >>> while the source is still migrating.
> >>>
> >>> * Move more things ahead of migration time, like DRIVER_OK.
> >>>
> >>> * Check that the devices of the destination are valid, and cancel the
> >>> migration
> >>>
> >>> in case it is not.
> >>>
> >>>
> >>>
> >>> [1]
> >>> https://lore.kernel.org/qemu-devel/6c8ebb97-d546-3f1c-4cdd-54e23a566f61@nvidia.com/T/
> >>>
> >>>
> >>>
> >>> Eugenio Pérez (18):
> >>>
> >>> vdpa: add VhostVDPAShared
> >>>
> >>> vdpa: move iova tree to the shared struct
> >>>
> >>> vdpa: move iova_range to vhost_vdpa_shared
> >>>
> >>> vdpa: move shadow_data to vhost_vdpa_shared
> >>>
> >>> vdpa: use vdpa shared for tracing
> >>>
> >>> vdpa: move file descriptor to vhost_vdpa_shared
> >>>
> >>> vdpa: move iotlb_batch_begin_sent to vhost_vdpa_shared
> >>>
> >>> vdpa: move backend_cap to vhost_vdpa_shared
> >>>
> >>> vdpa: remove msg type of vhost_vdpa
> >>>
> >>> vdpa: move iommu_list to vhost_vdpa_shared
> >>>
> >>> vdpa: use VhostVDPAShared in vdpa_dma_map and unmap
> >>>
> >>> vdpa: use dev_shared in vdpa_iommu
> >>>
> >>> vdpa: move memory listener to vhost_vdpa_shared
> >>>
> >>> vdpa: do not set virtio status bits if unneeded
> >>>
> >>> vdpa: add vhost_vdpa_load_setup
> >>>
> >>> vdpa: add vhost_vdpa_net_load_setup NetClient callback
> >>>
> >>> vdpa: use shadow_data instead of first device v->shadow_vqs_enabled
> >>>
> >>> virtio_net: register incremental migration handlers
> >>>
> >>>
> >>>
> >>> include/hw/virtio/vhost-vdpa.h | 43 +++++---
> >>>
> >>> include/net/net.h | 4 +
> >>>
> >>> hw/net/virtio-net.c | 23 +++++
> >>>
> >>> hw/virtio/vdpa-dev.c | 7 +-
> >>>
> >>> hw/virtio/vhost-vdpa.c | 183 ++++++++++++++++++---------------
> >>>
> >>> net/vhost-vdpa.c | 127 ++++++++++++-----------
> >>>
> >>> hw/virtio/trace-events | 14 +--
> >>>
> >>> 7 files changed, 239 insertions(+), 162 deletions(-)
> >>>
> >>>
> >>>
>
- Re: [RFC PATCH 00/18] Map memory at destination .load_setup in vDPA-net migration,
Eugenio Perez Martin <=