[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Qemu-devel] [PATCH V3 3/5] docs: add pvrdma device documentation.

From: Marcel Apfelbaum
Subject: [Qemu-devel] [PATCH V3 3/5] docs: add pvrdma device documentation.
Date: Wed, 3 Jan 2018 12:29:09 +0200

Signed-off-by: Marcel Apfelbaum <address@hidden>
Signed-off-by: Yuval Shaia <address@hidden>
Reviewed-by: Shamir Rabinovitch <address@hidden>
 docs/pvrdma.txt | 145 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 145 insertions(+)
 create mode 100644 docs/pvrdma.txt

diff --git a/docs/pvrdma.txt b/docs/pvrdma.txt
new file mode 100644
index 0000000000..74c5cf2495
--- /dev/null
+++ b/docs/pvrdma.txt
@@ -0,0 +1,145 @@
+Paravirtualized RDMA Device (PVRDMA)
+1. Description
+PVRDMA is the QEMU implementation of VMware's paravirtualized RDMA device.
+It works with its Linux Kernel driver AS IS, no need for any special guest
+While it complies with the VMware device, it can also communicate with bare
+metal RDMA-enabled machines and does not require an RDMA HCA in the host, it
+can work with Soft-RoCE (rxe).
+It does not require the whole guest RAM to be pinned allowing memory
+over-commit and, even if not implemented yet, migration support will be
+possible with some HW assistance.
+A project presentation accompany this document:
+2. Setup
+2.1 Guest setup
+Fedora 27+ kernels work out of the box, older distributions
+require updating the kernel to 4.14 to include the pvrdma driver.
+However the libpvrdma library needed by User Level Software is still
+not available as part of the distributions, so the rdma-core library
+needs to be compiled and optionally installed.
+Please follow the instructions at:
+  https://github.com/linux-rdma/rdma-core.git
+2.2 Host Setup
+The pvrdma backend is an ibdevice interface that can be exposed
+either by a Soft-RoCE(rxe) device on machines with no RDMA device,
+or an HCA SRIOV function(VF/PF).
+Note that ibdevice interfaces can't be shared between pvrdma devices,
+each one requiring a separate instance (rxe or SRIOV VF).
+2.2.1 Soft-RoCE backend(rxe)
+A stable version of rxe is required, Fedora 27+ or a Linux
+Kernel 4.14+ is preferred.
+The rdma_rxe module is part of the Linux Kernel but not loaded by default.
+Install the User Level library (librxe) following the instructions from:
+Associate an ETH interface with rxe by running:
+   rxe_cfg add eth0
+An rxe0 ibdevice interface will be created and can be used as pvrdma backend.
+2.2.2 RDMA device Virtual Function backend
+Nothing special is required, the pvrdma device can work not only with
+Ethernet Links, but also Infinibands Links.
+All is needed is an ibdevice with an active port, for Mellanox cards
+will be something like mlx5_6 which can be the backend.
+2.2.3 QEMU setup
+Configure QEMU with --enable-rdma flag, installing
+the required RDMA libraries.
+3. Usage
+Currently the device is working only with memory backed RAM
+and it must be mark as "shared":
+   -m 1G \
+   -object memory-backend-ram,id=mb1,size=1G,share \
+   -numa node,memdev=mb1 \
+The pvrdma device is composed of two functions:
+ - Function 0 is a vmxnet Ethernet Device which is redundant in Guest
+   but is required to pass the ibdevice GID using its MAC.
+   Examples:
+     For an rxe backend using eth0 interface it will use its mac:
+       -device vmxnet3,addr=<slot>.0,multifunction=on,mac=<eth0 MAC>
+     For an SRIOV VF, we take the Ethernet Interface exposed by it:
+       -device vmxnet3,multifunction=on,mac=<RoCE eth MAC>
+ - Function 1 is the actual device:
+       -device 
+   where the ibdevice can be rxe or RDMA VF (e.g. mlx5_4)
+ Note: Pay special attention that the GID at backend-gid-idx matches vmxnet's 
+ The rules of conversion are part of the RoCE spec, but since manual conversion
+ is not required, spotting problems is not hard:
+    Example: GID: fe80:0000:0000:0000:7efe:90ff:fecb:743a
+             MAC: 7c:fe:90:cb:74:3a
+    Note the difference between the first byte of the MAC and the GID.
+4. Implementation details
+The device acts like a proxy between the Guest Driver and the host
+ibdevice interface.
+On configuration path:
+ - For every hardware resource request (PD/QP/CQ/...) the pvrdma will request
+   a resource from the backend interface, maintaining a 1-1 mapping
+   between the guest and host.
+On data path:
+ - Every post_send/receive received from the guest will be converted into
+   a post_send/receive for the backend. The buffers data will not be touched
+   or copied resulting in near bare-metal performance for large enough buffers.
+ - Completions from the backend interface will result in completions for
+   the pvrdma device.
+5. Limitations
+- The device obviously is limited by the Guest Linux Driver features 
+  of the VMware device API.
+- Memory registration mechanism requires mremap for every page in the buffer 
in order
+  to map it to a contiguous virtual address range. Since this is not the data 
+  it should not matter much.
+- QEMU cannot map guest RAM from a file descriptor if a pvrdma device is 
+  so it can't work with huge pages. The limitation will be addressed in the 
+  however QEMU allocates Gust RAM with MADV_HUGEPAGE so if there are enough 
+  pages available, QEMU will use them.
+- As previously stated, migration is not supported yet, however with some 
+  support can be done.
+6. Performance
+By design the pvrdma device exits on each post-send/receive, so for small 
+the performance is affected; however for medium buffers it will became close to
+bare metal and from 1MB buffers and  up it reaches bare metal performance.
+(tested with 2 VMs, the pvrdma devices connected to 2 VFs of the same device)
+All the above assumes no memory registration is done on data path.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]