qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [DRAFT RFC] ioeventfd/ioregionfd support in vfio-user


From: Stefan Hajnoczi
Subject: Re: [DRAFT RFC] ioeventfd/ioregionfd support in vfio-user
Date: Tue, 26 Jan 2021 11:01:04 +0000

On Thu, Jan 21, 2021 at 04:09:09PM +0000, John Levon wrote:
> 
> Hi, please take a look. For context, this addition is against the current
> https://github.com/nutanix/libvfio-user/blob/master/docs/vfio-user.rst
> specification.
> 
> kvm@ readers: Stefan suggested this may be of interest from a VFIO point of
> view, in case there is any potential cross-over in defining how to wire up 
> these
> fds.

After talking to Alex Williamson about including these APIs in the
kernel VFIO ioctls it became clear to me that there is an architectural
difference between classic vfio-pci devices and vfio-user devices:

The current model with kernel VFIO is that the VMM sets up ioeventfds
using ioctl(VFIO_DEVICE_IOEVENTFD). The VMM has device-specific
knowledge that allows it to add ioeventfds for specific hardware
registers.

In vfio-user the VMM has no knowledge of the specific device. Instead
the device tells the VMM how to configure ioeventfds and ioregionfds.
There are a few reasons for this:

1. It is not practical to add knowledge of every vfio-user device
   implementation into the VMM. A cleaner solution is for the device to
   advertise its desired configuration to the VMM instead.

2. The device implementation may prefer a specific ioeventfd or
   ioregionfd configuration to fit its multi-threading architecture. A
   multi-threaded (multi-queue) device implementation may want
   ioeventfds and ioregionfds to be configured so that the right thread
   receives certain MMIO/PIO accesses.

3. The device implementation may want to specify ioregionfd's user_data
   value. This is the application-specific value that is passed along
   with an MMIO/PIO access.

I think mdev fits in somewhere between these two models. I can imagine
complex mdev devices wanting to control the ioeventfd and ioregionfd
configuration just like vfio-user. But in simple cases it doesn't
matter.

I suggest including John's ioeventfd/ioregionfd work into the kernel
VFIO interface so mdev devices can take advantage of it too.

> Note that this is a new message instead of adding a new region capability to
> VFIO_USER_DEVICE_GET_REGION_INFO: with a new capability, there's no way for 
> the
> server to know if ioeventfd/ioregionfd is actually used/requested by the 
> client
> (the client would just have to close those fds if it didn't want to use FDs). 
> An
> explicit new call is much clearer for this.
> 
> The ioregionfd feature itself is at proposal stage, so there's a good chance 
> of
> further changes depending on that.
> 
> I also have these pending issues so far:
> 
> 1) I'm not familiar with CCW bus, so don't know if
> KVM_IOEVENTFD_FLAG_VIRTIO_CCW_NOTIFY flag makes sense or is supportable in
> vfio-user context
> 
> 2) if ioregionfd subsumes all ioeventfd use cases, we can remove the 
> distinction
> from this API, and the client caller can translate to ioregionfd or ioeventfd 
> as
> needed
> 
> 3) is it neccessary for the client to indicate support (e.g. lacking 
> ioregionfd
> support) ?
> 
> 4) (ioregionfd issue) region_id is 4 bytes, which seems a little awkward from
> the server side: this has to encode both the region ID as well as the offset 
> of
> the sub-region within that region. Can this be 64 bits wide?

The user_data field in Elena's ioregionfd patches is 64 bits. Does this
solve the problem?

> 
> thanks
> john
> 
> (NB: I edited the diff so the new sections are more readable.)
> 
> diff --git a/docs/vfio-user.rst b/docs/vfio-user.rst
> index e3adc7a..e7274a2 100644
> --- a/docs/vfio-user.rst
> +++ b/docs/vfio-user.rst
> @@ -161,6 +161,17 @@ in the region info reply of a device-specific region. 
> Such regions are reflected
>  in ``struct vfio_device_info.num_regions``. Thus, for PCI devices this value 
> can
>  be equal to, or higher than, VFIO_PCI_NUM_REGIONS.
>  
> +Region I/O via file descriptors
> +-------------------------------
> +
> +For unmapped regions, region I/O from the client is done via
> +VFIO_USER_REGION_READ/WRITE.  As an optimization, ioeventfds or ioregionfds 
> may
> +be configured for sub-regions of some regions. A client may request 
> information
> +on these sub-regions via VFIO_USER_DEVICE_GET_REGION_IO_FDS; by configuring 
> the
> +returned file descriptors as ioeventfds or ioregionfds, the server can be
> +directly notified of I/O (for example, by KVM) without taking a trip through 
> the
> +client.
> +
>  Interrupts
>  ^^^^^^^^^^
>  The client uses VFIO_USER_DEVICE_GET_IRQ_INFO messages to query the server 
> for
> @@ -293,37 +304,39 @@ Commands
>  The following table lists the VFIO message command IDs, and whether the
>  message command is sent from the client or the server.
>  
> -+----------------------------------+---------+-------------------+
> -| Name                             | Command | Request Direction |
> -+==================================+=========+===================+
> -| VFIO_USER_VERSION                | 1       | client -> server  |
> -+----------------------------------+---------+-------------------+
> -| VFIO_USER_DMA_MAP                | 2       | client -> server  |
> -+----------------------------------+---------+-------------------+
> -| VFIO_USER_DMA_UNMAP              | 3       | client -> server  |
> -+----------------------------------+---------+-------------------+
> -| VFIO_USER_DEVICE_GET_INFO        | 4       | client -> server  |
> -+----------------------------------+---------+-------------------+
> -| VFIO_USER_DEVICE_GET_REGION_INFO | 5       | client -> server  |
> -+----------------------------------+---------+-------------------+
> -| VFIO_USER_DEVICE_GET_IRQ_INFO    | 6       | client -> server  |
> -+----------------------------------+---------+-------------------+
> -| VFIO_USER_DEVICE_SET_IRQS        | 7       | client -> server  |
> -+----------------------------------+---------+-------------------+
> -| VFIO_USER_REGION_READ            | 8       | client -> server  |
> -+----------------------------------+---------+-------------------+
> -| VFIO_USER_REGION_WRITE           | 9       | client -> server  |
> -+----------------------------------+---------+-------------------+
> -| VFIO_USER_DMA_READ               | 10      | server -> client  |
> -+----------------------------------+---------+-------------------+
> -| VFIO_USER_DMA_WRITE              | 11      | server -> client  |
> -+----------------------------------+---------+-------------------+
> -| VFIO_USER_VM_INTERRUPT           | 12      | server -> client  |
> -+----------------------------------+---------+-------------------+
> -| VFIO_USER_DEVICE_RESET           | 13      | client -> server  |
> -+----------------------------------+---------+-------------------+
> -| VFIO_USER_DIRTY_PAGES            | 14      | client -> server  |
> -+----------------------------------+---------+-------------------+
> ++------------------------------------+---------+-------------------+
> +| Name                               | Command | Request Direction |
> ++====================================+=========+===================+
> +| VFIO_USER_VERSION                  | 1       | client -> server  |
> ++------------------------------------+---------+-------------------+
> +| VFIO_USER_DMA_MAP                  | 2       | client -> server  |
> ++------------------------------------+---------+-------------------+
> +| VFIO_USER_DMA_UNMAP                | 3       | client -> server  |
> ++------------------------------------+---------+-------------------+
> +| VFIO_USER_DEVICE_GET_INFO          | 4       | client -> server  |
> ++------------------------------------+---------+-------------------+
> +| VFIO_USER_DEVICE_GET_REGION_INFO   | 5       | client -> server  |
> ++------------------------------------+---------+-------------------+
> +| VFIO_USER_DEVICE_GET_REGION_IO_FDS | 6       | client -> server  |
> ++------------------------------------+---------+-------------------+
> +| VFIO_USER_DEVICE_GET_IRQ_INFO      | 7       | client -> server  |
> ++------------------------------------+---------+-------------------+
> +| VFIO_USER_DEVICE_SET_IRQS          | 8       | client -> server  |
> ++------------------------------------+---------+-------------------+
> +| VFIO_USER_REGION_READ              | 9       | client -> server  |
> ++------------------------------------+---------+-------------------+
> +| VFIO_USER_REGION_WRITE             | 10      | client -> server  |
> ++------------------------------------+---------+-------------------+
> +| VFIO_USER_DMA_READ                 | 11      | server -> client  |
> ++------------------------------------+---------+-------------------+
> +| VFIO_USER_DMA_WRITE                | 12      | server -> client  |
> ++------------------------------------+---------+-------------------+
> +| VFIO_USER_VM_INTERRUPT             | 13      | server -> client  |
> ++------------------------------------+---------+-------------------+
> +| VFIO_USER_DEVICE_RESET             | 14      | client -> server  |
> ++------------------------------------+---------+-------------------+
> +| VFIO_USER_DIRTY_PAGES              | 15      | client -> server  |
> ++------------------------------------+---------+-------------------+
>  
>  
>  .. Note:: Some VFIO defines cannot be reused since their values are
> @@ -1130,6 +1143,161 @@ client must write data on the same order and 
> transction size as read.
>  If an error occurs then the server must fail the read or write operation. It 
> is
>  an implementation detail of the client how to handle errors.
>  
> VFIO_USER_DEVICE_GET_REGION_IO_FDS
> ----------------------------------
> 
> Message format
> ^^^^^^^^^^^^^^
> 
> +--------------+------------------------+
> | Name         | Value                  |
> +==============+========================+
> | Message ID   | <ID>                   |
> +--------------+------------------------+
> | Command      | 6                      |
> +--------------+------------------------+
> | Message size | 32 + subregion info    |
> +--------------+------------------------+
> | Flags        | Reply bit set in reply |
> +--------------+------------------------+
> | Error        | 0/errno                |
> +--------------+------------------------+
> | Region info  | Region IO FD info      |
> +--------------+------------------------+
> 
> Clients can access regions via VFIO_USER_REGION_READ/WRITE or, if provided, by
> mmap()ing a file descriptor provided by the server.
> 
> VFIO_USER_DEVICE_GET_REGION_IO_FDS provides an alternative access mechanism 
> via
> file descriptors. This is an optional feature intended for performance
> improvements where an underlying sub-system (such as KVM) supports 
> communication
> across such file descriptors to the vfio-user server, without needing to
> round-trip through the client.
> 
> The server returns an array describing sub-regions of the given region along
> with the server specifies a set of sub-regions and the requested file 
> descriptor
> notification mechanism to use for that sub-region.  Each sub-region in the
> response message may choose to use a different method, as defined below.  The
> two mechanisms supported in this specification are ioeventfds and ioregionfds.
> 
> A client should then hook up the returned file descriptors with the 
> notification
> method requested.
> 
> Region IO FD info format
> ^^^^^^^^^^^^^^^^^^^^^^^^
> 
> +------------+--------+------+
> | Name       | Offset | Size |
> +============+========+======+
> | argsz      | 16     | 4    |
> +------------+--------+------+
> | flags      | 20     | 4    |
> +------------+--------+------+
> | index      | 24     | 4    |
> +------------+--------+------+
> | count      | 28     | 4    |
> +------------+--------+------+
> | subregions | 32     | ...  |
> +------------+--------+------+
> 
> * *argsz* is the size of the region IO FD info structure plus the
>   total size of the subregion array. Thus, each array entry "i" is at offset
>     i * ((argsz - 32) / count)
> * *flags* must be zero
> * *index* is the index of memory region being queried
> * *count* is the number of sub-regions in the array
> * *subregions* is the array of Sub-Region IO FD info structures
> 
> The client must set ``flags`` to zero and specify the region being queried in
> the ``index``.
> 
> The client sets the ``argsz`` field to indicate the maximum size of the 
> response
> that the server can send, which must be at least the size of the response 
> header
> plus space for the subregion array. If the full response size exceeds 
> ``argsz``,
> then the server must respond only with the response header and the Region IO 
> FD
> info structure, setting in ``argsz`` the buffer size required to store the 
> full
> response. In this case, no file descriptors are passed back.  The client then
> retries the operation with a larger receive buffer.
> 
> The reply message will additionally include at least one file descriptor in 
> the
> ancillary data. Note that more than one subregion may share the same file
> descriptor.
> 
> Each sub-region given in the response has one of two possible structures,
> depending whether *type* is `VFIO_USER_IO_FD_TYPE_IOEVENTFD` or
> `VFIO_USER_IO_FD_TYPE_IOREGIONFD`:
> 
> Sub-Region IO FD info format (ioeventfd)
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 
> +-----------+--------+------+
> | Name      | Offset | Size |
> +===========+========+======+
> | offset    | 0      | 8    |
> +-----------+--------+------+
> | size      | 8      | 8    |
> +-----------+--------+------+
> | fd_index  | 16     | 4    |
> +-----------+--------+------+
> | type      | 20     | 4    |
> +-----------+--------+------+
> | flags     | 24     | 4    |
> +-----------+--------+------+
> | padding   | 28     | 4    |
> +-----------+--------+------+
> | datamatch | 32     | 8    |
> +-----------+--------+------+
> 
> * *offset* is the offset of the start of the sub-region within the region
>   requested ("physical address offset" for the region)
> * *size* is the length of the sub-region. This may be zero if the access size 
> is
>   not relevant, which may allow for optimizations
> * *fd_index* is the index in the ancillary data of the FD to use for ioeventfd
>   notification; it may be shared.
> * *type* is `VFIO_USER_IO_FD_TYPE_IOEVENTFD`
> * *flags* is any of:
>   * `KVM_IOEVENTFD_FLAG_DATAMATCH`
>   * `KVM_IOEVENTFD_FLAG_PIO`
>   * `KVM_IOEVENTFD_FLAG_VIRTIO_CCW_NOTIFY` (FIXME: makes sense?)
> * *datamatch* is the datamatch value if needed
> 
> See https://www.kernel.org/doc/Documentation/virtual/kvm/api.txt 4.59
> KVM_IOEVENTFD for further context on the ioeventfd-specific fields.
> 
> Sub-Region IO FD info format (ioregionfd)
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 
> +-----------+--------+------+
> | Name      | Offset | Size |
> +===========+========+======+
> | offset    | 0      | 8    |
> +-----------+--------+------+
> | size      | 8      | 8    |
> +-----------+--------+------+
> | fd_index  | 16     | 4    |
> +-----------+--------+------+
> | type      | 20     | 4    |
> +-----------+--------+------+
> | flags     | 24     | 4    |
> +-----------+--------+------+
> | region_id | 28     | 4    |
> +-----------+--------+------+
> 
> * *offset* is the offset of the start of the sub-region within the region
>   requested ("physical address offset" for the region)
> * *size* is the length of the sub-region. FIXME: may allow zero?
> * *fd_index* is the index in the ancillary data of the FD to use for 
> ioregionfd
>   messages; it may be shared
> * *type* is `VFIO_USER_IO_FD_TYPE_IOREGIONFD`
> * *flags* is any of:
>   * `KVM_IOREGIONFD_FLAG_PIO`
>   * `KVM_IOREGIONFD_FLAG_POSTED_WRITES`
> * *region_id* is an opaque value passed back to the server via a message on 
> the
>   file descriptor
> 
> See https://www.spinics.net/lists/kvm/msg208139.html (FIXME) for further 
> context
> on the ioregionfd-specific fields.
> 
>  VFIO_USER_DEVICE_GET_IRQ_INFO
>  -----------------------------
>  
> @@ -1141,7 +1309,7 @@ Message format
>  +==============+========================+
>  | Message ID   | <ID>                   |
>  +--------------+------------------------+
> -| Command      | 6                      |
> +| Command      | 7                      |
>  +--------------+------------------------+
>  | Message size | 32                     |
>  +--------------+------------------------+
> @@ -1212,7 +1380,7 @@ Message format
>  +==============+========================+
>  | Message ID   | <ID>                   |
>  +--------------+------------------------+
> -| Command      | 7                      |
> +| Command      | 8                      |
>  +--------------+------------------------+
>  | Message size | 36 + any data          |
>  +--------------+------------------------+
> @@ -1370,7 +1538,7 @@ Message format
>  +==============+========================+
>  | Message ID   | <ID>                   |
>  +--------------+------------------------+
> -| Command      | 8                      |
> +| Command      | 9                      |
>  +--------------+------------------------+
>  | Message size | 32 + data size         |
>  +--------------+------------------------+
> @@ -1397,7 +1565,7 @@ Message format
>  +==============+========================+
>  | Message ID   | <ID>                   |
>  +--------------+------------------------+
> -| Command      | 9                      |
> +| Command      | 10                     |
>  +--------------+------------------------+
>  | Message size | 32 + data size         |
>  +--------------+------------------------+
> @@ -1424,7 +1592,7 @@ Message format
>  +==============+========================+
>  | Message ID   | <ID>                   |
>  +--------------+------------------------+
> -| Command      | 10                     |
> +| Command      | 11                     |
>  +--------------+------------------------+
>  | Message size | 28 + data size         |
>  +--------------+------------------------+
> @@ -1451,7 +1619,7 @@ Message format
>  +==============+========================+
>  | Message ID   | <ID>                   |
>  +--------------+------------------------+
> -| Command      | 11                     |
> +| Command      | 12                     |
>  +--------------+------------------------+
>  | Message size | 28 + data size         |
>  +--------------+------------------------+
> @@ -1478,7 +1646,7 @@ Message format
>  +================+========================+
>  | Message ID     | <ID>                   |
>  +----------------+------------------------+
> -| Command        | 12                     |
> +| Command        | 13                     |
>  +----------------+------------------------+
>  | Message size   | 20                     |
>  +----------------+------------------------+
> @@ -1515,7 +1683,7 @@ Message format
>  +==============+========================+
>  | Message ID   | <ID>                   |
>  +--------------+------------------------+
> -| Command      | 13                     |
> +| Command      | 14                     |
>  +--------------+------------------------+
>  | Message size | 16                     |
>  +--------------+------------------------+
> @@ -1537,7 +1705,7 @@ Message format
>  +====================+========================+
>  | Message ID         | <ID>                   |
>  +--------------------+------------------------+
> -| Command            | 14                     |
> +| Command            | 15                     |
>  +--------------------+------------------------+
>  | Message size       | 16                     |
>  +--------------------+------------------------+
> 

Attachment: signature.asc
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]