[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] host side todo list for virtio rdma
From: |
Dr. David Alan Gilbert |
Subject: |
Re: [Qemu-devel] host side todo list for virtio rdma |
Date: |
Wed, 19 Jul 2017 11:55:50 +0100 |
User-agent: |
Mutt/1.8.3 (2017-05-23) |
* Michael S. Tsirkin (address@hidden) wrote:
> Here are some thoughts on bits that are still missing to get a working
> virtio-rdma, with some suggestions. These are very preliminary but I
> feel I kept these in my head (and discussed offline) for too long. All
> of the below is just my personal humble opinion.
>
> Feature Requirements:
>
> The basic requirement is to be able to do RDMA to/from
> VM memory, with support for VM migration and/or memory
> overcommit and/or autonuma and/or THP.
> Why are migration/overcommit/autonuma required?
> Without these, you can do RDMA with device passthrough,
> with likely better performance.
Is this solution usable on a system without host-RDMA hardware?
i.e. just to run RDMA between two VMs on the same host
without using something like SoftROCE on the host?
> Feature Non-requirements:
>
> It's not a requirement to support RDMA without VM exits,
> e.g. like with device passthrough. While avoiding exits improves
> performance, it would be handy to more than RDMA,
> so there seems no reason to require it from RDMA when we
> do not have it for e.g. network.
>
> Assumptions:
>
> It's OK to assume specific hardware capabilities at least initially.
>
> High level architecture:
>
> Follows the same lines as most other virtio devices:
>
> +-----------------------------------
> +
> + guest kernel
> + ^
> +-------------|----------------------
> + v
> + host kernel (kvm, vhost)
> +
> + ^
> +-------------|----------------------
> + v
> +
> + host userspace (QEMU, vhost-user)
> +
> +-----------------------------------
>
> Each request is forwarded by host kernel to QEMU,
> that executes it using the ibverbs library.
Should that be 'forwarded by guest kernel' ?
Is there a guest-userspace here as well - most of the
RDMA NICs seem to have a userspace component.
> Most of this should be implementable host-side using existing
> software. However, several issues remain and would need
> infrastructure changes, as outlined below.
>
> Host-side todo list for virtio-rdma support:
>
> - Memory registration for guest userspace.
>
> Register memory region verb accepts a single virtual address,
> which supplies both the on-wire key for access and the
> range of memory to access. Guest kernel turns this into a
> list of pages (e.g. by get_user_pages); when forwarded to host this
> turns into a s/g list of virtual addresses in QEMU address space.
>
> Suggestion: add a new verb, along the lines of ibv_register_physical,
> which splits these two parameters, accepting the on-wire VA key
> and separately a list of userspace virtual address/size pairs.
>
> - Memory registration for guest kernels.
>
> Another ability used by some in-kernel users is registering all of memory.
> Ranges not actually present are never accessed - this is OK as
> kernel users are trusted. Memory hotplug changes which ranges
> are present.
>
> Suggestion: add some throw-away memory and map all
> non-present ranges there. Add ibv_reregister_physical_mr or similar
> API to update mappings on guest memory hotplug/unplug.
>
> - Memory overcommit/autonuma/THP.
>
> This includes techniques such as swap,KSM,COW, page migration.
> All these rely on ability to move pages around without
> breaking hardware access.
>
> Suggestion: for hardware that supports it,
> enabling on-demand paging for all registered memory seems
> to address the issue more or less transparently to guests.
> This isn't supported by all hardware but might be
> at least a reasonable first step.
>
> - Migration: memory tracking.
>
> Migration requires detecting hardware access to pages
> either on write (pre-copy) or any access (post-copy).
> Post copy just requires ODP support to work with
> userfaultfd properly.
Can you explain what ODP support is?
> Pre-copy would require a write-tracking API along
> the lines of one exposed by KVM or vhost.
> Each tracked page would be write-protected (causing faults on
> hardware access) on hardware write fault is generated
> and recorded, page is made writeable.
Can you write-protect like that from the RDMA hardware?
I'd be surprised if the hardware was happy with that.
> - Migration: moving QP numbers.
>
> QP numbers are exposed on the wire and so must move together
> with the VM.
>
> Suggestion: allow specifying QP number when creating a QP.
> To avoid conflicts between multiple users, initial version can limit
> library to a single user per device. Multiple VMs can simply
> attach to distinct VFs.
>
> - Migration: moving QP state.
>
> When migrating the VM, a QP has to be torn down
> on source and created on destination.
> We have to migrate e.g. the current PSN - but what
> should happen when a new packet arrives on source
> after QP has been torn down?
>
> Suggestion 1: move QP to a special state "suspended" and ignore
> packets, or cause source to retransmit with e.g. an out of
> resources error. Retransmit counter might need to be
> adjusted compared to what guest requested to account
> for the extra need to retransmit.
> Is there a good existing QP state that does this?
>
> Suggestion 2: forward packets to destination somehow.
> Might overload the fabric as we are crossing e.g.
> pci bus multiple times.
>
> - Migration: network update
>
> ROCE v1 and infiniband seem to tie connections to
> hardware specific GIDs which can not be moved by software.
>
> Suggestion: limit migration to RoCE v2 initially.
>
> - Migration: packet loss recovery.
>
> As a RoCE address moves across the network, network has
> to be updated which takes time, meanwhile packet loss seems
> to be hard to avoid.
>
> Suggestion: limit initial support to hardware that is
> able to recover from occasional packet drops, with
> some slowdown.
>
> - Migration: suspend/resume API?
> It might be easier to pack up state of all resources
> such as all QP numbers and state of all QPs etc
> in a single memory buffer, migrate then unpack on destination.
>
> Removes need for 2 separate APIs for suspended state and
> for specifying QPN on creation.
>
> This creates a format for serialization that will have to
> be maintained in a compatible way - it is not clear that
> the maintainance overhead is worth the potential
> simplification, if any.
>
>
> That's it - I hope this helps, feel free to discuss, preferably copying
> virtio-dev (subscription required for now, people are looking into
> fixing this, sorry about that).
Dave
> Thanks!
>
> --
> MST
>
--
Dr. David Alan Gilbert / address@hidden / Manchester, UK