qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [RFC PATCH 00/18] Map memory at destination .load_setup in vDPA-net


From: Eugenio Perez Martin
Subject: Re: [RFC PATCH 00/18] Map memory at destination .load_setup in vDPA-net migration
Date: Tue, 5 Dec 2023 15:23:13 +0100

On Fri, Nov 3, 2023 at 9:19 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>
>
>
> On 11/2/2023 5:37 AM, Eugenio Perez Martin wrote:
> > On Thu, Nov 2, 2023 at 11:13 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>
> >>
> >> On 10/19/2023 7:34 AM, Eugenio Pérez wrote:
> >>> Current memory operations like pinning may take a lot of time at the
> >>>
> >>> destination.  Currently they are done after the source of the migration is
> >>>
> >>> stopped, and before the workload is resumed at the destination.  This is a
> >>>
> >>> period where neigher traffic can flow, nor the VM workload can continue
> >>>
> >>> (downtime).
> >>>
> >>>
> >>>
> >>> We can do better as we know the memory layout of the guest RAM at the
> >>>
> >>> destination from the moment the migration starts.  Moving that operation 
> >>> allows
> >>>
> >>> QEMU to communicate the kernel the maps while the workload is still 
> >>> running in
> >>>
> >>> the source, so Linux can start mapping them.  Ideally, all IOMMU is 
> >>> configured,
> >>>
> >>> but if the vDPA parent driver uses on-chip IOMMU and .set_map we're still
> >>>
> >>> saving all the pinning time.
> >> I get what you want to say, though not sure how pinning is relevant to
> >> on-chip IOMMU and .set_map here, essentially pinning is required for all
> >> parent vdpa drivers that perform DMA hence don't want VM pages to move
> >> around.
> > Basically highlighting that the work done under .set_map is not only
> > pinning, but it is a significant fraction on it. It can be reworded or
> > deleted for sure.
> >
> >>>
> >>>
> >>> Note that further devices setup at the end of the migration may alter the 
> >>> guest
> >>>
> >>> memory layout. But same as the previous point, many operations are still 
> >>> done
> >>>
> >>> incrementally, like memory pinning, so we're saving time anyway.
> >>>
> >>>
> >>>
> >>> The first bunch of patches just reorganizes the code, so memory related
> >>>
> >>> operation parameters are shared between all vhost_vdpa devices.  This is
> >>>
> >>> because the destination does not know what vhost_vdpa struct will have the
> >>>
> >>> registered listener member, so it is easier to place them in a shared 
> >>> struct
> >>>
> >>> rather to keep them in vhost_vdpa struct.  Future version may squash or 
> >>> omit
> >>>
> >>> these patches.
> >> It looks this VhostVDPAShared facility (patch 1-13) is also what I need
> >> in my SVQ descriptor group series [*], for which I've built similar
> >> construct there. If possible please try to merge this in ASAP. I'll
> >> rework my series on top of that.
> >>
> >> [*]
> >> https://github.com/siwliu-kernel/qemu/commit/813518354af5ee8a6e867b2bf7dff3d6004fbcd5
> >>
> > I can send it individually, for sure.
> >
> > MST, Jason, can this first part be merged? It doesn't add a lot by
> > itself but it helps pave the way for future changes.
> If it cannot, it doesn't matter. I can pick it from here and get my
> series posted with your patches 1-13 applied upfront. This should work,
> I think?
>
> >>>
> >>>
> >>> Only tested with vdpa_sim. I'm sending this before full benchmark, as 
> >>> some work
> >>>
> >>> like [1] can be based on it, and Si-Wei agreed on benchmark this series 
> >>> with
> >>>
> >>> his experience.
> >> Haven't done the full benchmark compared to pre-map at destination yet,
> >> though an observation is that the destination QEMU seems very easy to
> >> get stuck for very long time while in mid of pinning pages. During this
> >> period, any client doing read-only QMP query or executing HMP info
> >> command got frozen indefinitely (subject to how large size the memory is
> >> being pinned). Is it possible to unblock those QMP request or HMP
> >> command from being executed (at least the read-only ones) while in
> >> migration? Yield from the load_setup corourtine and spawn another thread?
> >>
> > Ok, I wasn't aware of that.
> >
> > I think we cannot yield in a coroutine and wait for an ioctl.
> I was wondering if we need a separate coroutine out of the general
> migration path to support this special code without overloading
> load_setup or its callers. For instance, unblock the source from sending
> guest rams while allow destination pin pages in parallel should be
> possible.
>

Hi Si-Wei,

I'm working on this, I think I'll be able to send a new version soon.
Just a question, when the mapping is done in vhost_vdpa_dev_start as
the current upstream master does, are you able to interact with QMP?

Thanks!

> Regardless, a separate thread is needed to carry out all the heavy
> lifting, i.e. ioctl(2) or write(2) syscalls to map&pin pages.
>
>
> > One
> > option that came to my mind is to effectively use another thread, and
> > use a POSIX barrier (or equivalent on glib / QEMU) before finishing
> > the migration.
> Yes, a separate thread is needed anyway.
>
> >   I'm not sure if there are more points where we can
> > check the barrier and tell the migration to continue or stop though.
> I think there is, for e.g. what if the dma_map fails. There must be a
> check point for that.
>
> >
> > Another option is to effectively start doing these ioctls in an
> > asynchronous way, io_uring cmds like, but I'd like to achieve this
> > first.
> Yes, io_uring or any async API could be another option. Though this
> needs new uAPI through additional kernel facility to support. Anyway,
> it's up to you to decide. :)
>
> Regards,
> -Siwei
> >> Having said, not sure if .load_setup is a good fit for what we want to
> >> do. Searching all current users of .load_setup, either the job can be
> >> done instantly or the task is time bound without trapping into kernel
> >> for too long. Maybe pinning is too special use case here...
> >>
> >> -Siwei
> >>>
> >>>
> >>> Future directions on top of this series may include:
> >>>
> >>> * Iterative migration of virtio-net devices, as it may reduce downtime 
> >>> per [1].
> >>>
> >>>     vhost-vdpa net can apply the configuration through CVQ in the 
> >>> destination
> >>>
> >>>     while the source is still migrating.
> >>>
> >>> * Move more things ahead of migration time, like DRIVER_OK.
> >>>
> >>> * Check that the devices of the destination are valid, and cancel the 
> >>> migration
> >>>
> >>>     in case it is not.
> >>>
> >>>
> >>>
> >>> [1] 
> >>> https://lore.kernel.org/qemu-devel/6c8ebb97-d546-3f1c-4cdd-54e23a566f61@nvidia.com/T/
> >>>
> >>>
> >>>
> >>> Eugenio Pérez (18):
> >>>
> >>>     vdpa: add VhostVDPAShared
> >>>
> >>>     vdpa: move iova tree to the shared struct
> >>>
> >>>     vdpa: move iova_range to vhost_vdpa_shared
> >>>
> >>>     vdpa: move shadow_data to vhost_vdpa_shared
> >>>
> >>>     vdpa: use vdpa shared for tracing
> >>>
> >>>     vdpa: move file descriptor to vhost_vdpa_shared
> >>>
> >>>     vdpa: move iotlb_batch_begin_sent to vhost_vdpa_shared
> >>>
> >>>     vdpa: move backend_cap to vhost_vdpa_shared
> >>>
> >>>     vdpa: remove msg type of vhost_vdpa
> >>>
> >>>     vdpa: move iommu_list to vhost_vdpa_shared
> >>>
> >>>     vdpa: use VhostVDPAShared in vdpa_dma_map and unmap
> >>>
> >>>     vdpa: use dev_shared in vdpa_iommu
> >>>
> >>>     vdpa: move memory listener to vhost_vdpa_shared
> >>>
> >>>     vdpa: do not set virtio status bits if unneeded
> >>>
> >>>     vdpa: add vhost_vdpa_load_setup
> >>>
> >>>     vdpa: add vhost_vdpa_net_load_setup NetClient callback
> >>>
> >>>     vdpa: use shadow_data instead of first device v->shadow_vqs_enabled
> >>>
> >>>     virtio_net: register incremental migration handlers
> >>>
> >>>
> >>>
> >>>    include/hw/virtio/vhost-vdpa.h |  43 +++++---
> >>>
> >>>    include/net/net.h              |   4 +
> >>>
> >>>    hw/net/virtio-net.c            |  23 +++++
> >>>
> >>>    hw/virtio/vdpa-dev.c           |   7 +-
> >>>
> >>>    hw/virtio/vhost-vdpa.c         | 183 ++++++++++++++++++---------------
> >>>
> >>>    net/vhost-vdpa.c               | 127 ++++++++++++-----------
> >>>
> >>>    hw/virtio/trace-events         |  14 +--
> >>>
> >>>    7 files changed, 239 insertions(+), 162 deletions(-)
> >>>
> >>>
> >>>
>




reply via email to

[Prev in Thread] Current Thread [Next in Thread]