qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [PATCH] introduce VFIO-over-socket protocol specificaion


From: Thanos Makatos
Subject: RE: [PATCH] introduce VFIO-over-socket protocol specificaion
Date: Wed, 22 Jul 2020 11:42:26 +0000

> -----Original Message-----
> From: Stefan Hajnoczi <stefanha@redhat.com>
> Sent: 17 July 2020 13:18
> To: Thanos Makatos <thanos.makatos@nutanix.com>
> Cc: qemu-devel@nongnu.org; alex.williamson@redhat.com;
> benjamin.walker@intel.com; elena.ufimtseva@oracle.com;
> jag.raman@oracle.com; Swapnil Ingle <swapnil.ingle@nutanix.com>;
> james.r.harris@intel.com; konrad.wilk@oracle.com; Raphael Norwitz
> <raphael.norwitz@nutanix.com>; marcandre.lureau@redhat.com;
> Kanth.Ghatraju@oracle.com; Felipe Franciosi <felipe@nutanix.com>;
> tina.zhang@intel.com; changpeng.liu@intel.com; dgilbert@redhat.com;
> tomassetti.andrea@gmail.com; yuvalkashtan@gmail.com;
> ismael@linux.com; John G Johnson <john.g.johnson@oracle.com>
> Subject: Re: [PATCH] introduce VFIO-over-socket protocol specificaion
> 
> On Thu, Jul 16, 2020 at 08:31:43AM -0700, Thanos Makatos wrote:
> > This patch introduces the VFIO-over-socket protocol specification, which
> > is designed to allow devices to be emulated outside QEMU, in a separate
> > process. VFIO-over-socket reuses the existing VFIO defines, structs and
> > concepts.
> >
> > It has been earlier discussed as an RFC in:
> > "RFC: use VFIO over a UNIX domain socket to implement device offloading"
> >
> > Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
> > Signed-off-by: Thanos Makatos <thanos.makatos@nutanix.com>
> > ---
> >  docs/devel/vfio-over-socket.rst | 1135
> +++++++++++++++++++++++++++++++++++++++
> >  1 files changed, 1135 insertions(+), 0 deletions(-)
> >  create mode 100644 docs/devel/vfio-over-socket.rst
> 
> This is exciting! The spec is clear enough that I feel I could start
> writing a client/server. There is enough functionality here to implement
> real-world devices. Can you share links to client/server
> implementations?

Strictly speaking there's no client/server implementation yet as we're waiting
until the protocol is finalized.

The closest server implementation we have is an NVMe controller implemented in
SPDK with MUSER:
(https://github.com/tmakatos/spdk/tree/rfc-vfio-over-socket and
https://github.com/tmakatos/muser/tree/vfio-over-socket).

The closest client implementation we have is libvfio in the VFIO-over-socket
PoC.

Neither of these implementation use the new protocol, but they're similar in
spirit.

John is working on the VFIO changes, the only thing left to do is the DMA/IOMMU
changes we made in the last review round, they'll be on their GitHub site soon.


> 
> launching, and controlling device emulation processes. That doesn't need
> It would be useful to introduce a standard way of enumerating,
> to be part of this specification document though. In vhost-user there
> are conventions for command-line parameters, process lifecycle, etc that
> make it easier for management tools to run device processes (the
> "Backend program conventions" section in vhost-user.rst).

Sure, we'll come up with something similar based on those.

> 
> > diff --git a/docs/devel/vfio-over-socket.rst b/docs/devel/vfio-over-
> socket.rst
> > new file mode 100644
> > index 0000000..723b944
> > --- /dev/null
> > +++ b/docs/devel/vfio-over-socket.rst
> > @@ -0,0 +1,1135 @@
> > +***************************************
> > +VFIO-over-socket Protocol Specification
> > +***************************************
> > +
> > +Version 0.1
> 
> Please include a reference to the section below explaining how
> versioning works.

I'm not sure I understand, do you mean we should add something like the
following (right below "Version 0.1"):

"Refer to section 1.2.3 on how versioning works."

?

> 
> Also, are there rules about noting versions when updating the spec? For
> example:
> 
>   When making a change to this specification, the protocol version
>   number must be included:
> 
>     The `foo` field contains ... Added in version 1.3.

OK, we'll add the rule as per your recommendation.

> 
> > +
> > +Introduction
> > +============
> > +VFIO-over-socket, also known as vfio-user, is a protocol that allows a
> device
> 
> vfio-user is shorted. Now is the best time to start consistently using
> "vfio-user" as the name for this protocol. Want to drop the name
> VFIO-over-socket?

"vfio-user" it is.

> 
> > +to be virtualized in a separate process outside of QEMU. VFIO-over-
> socket
> 
> Is there anything QEMU-specific about this protocol?
> 
> I think the scope of this protocol is more general and it could be
> described as:
> 
>   allows device emulation in a separate process outside of a Virtual
>   Machine Monitor (VMM).
> 
> (Or "emulator" instead of VMM, if you prefer.)
> 
> > +devices consist of a generic VFIO device type, living inside QEMU, which
> we
> 
> s/QEMU/the VMM/

Correct, QEMU is simply our main use case but the protocol is agnostic to that.
We'll drop QEMU and rephrase based on your suggestion.

> 
> > +call the client, and the core device implementation, living outside QEMU,
> which
> > +we call the server. VFIO-over-socket can be the main transport
> mechanism for
> > +multi-process QEMU, however it can be used by other applications
> offering
> > +device virtualization. Explaining the advantages of a
> > +disaggregated/multi-process QEMU, and device virtualization outside
> QEMU in
> > +general, is beyond the scope of this document.
> > +
> > +This document focuses on specifying the VFIO-over-socket protocol. VFIO
> has
> > +been chosen for the following reasons:
> 
> It's a little subtle here that "VFIO" means the Linux VFIO ioctl
> interface. Expanding it a bit makes the rest of the text clearer:
> 
>   The `Linux VFIO ioctl
>   interface
>   <https://www.kernel.org/doc/html/latest/driver-api/vfio.html>`_ has
>   been chosen as the base for this protocol for the following reasons:

OK.

> 
> > +
> > +1) It is a mature and stable API, backed by an extensively used
> framework.
> > +2) The existing VFIO client implementation (qemu/hw/vfio/) can be
> largely
> > +   reused.
> > +
> > +In a proof of concept implementation it has been demonstrated that
> using VFIO
> > +over a UNIX domain socket is a viable option. VFIO-over-socket is
> designed with
> > +QEMU in mind, however it could be used by other client applications. The
> > +VFIO-over-socket protocol does not require that QEMU's VFIO client
> > +implementation is used in QEMU. None of the VFIO kernel modules are
> required
> > +for supporting the protocol, neither in the client nor the server, only the
> > +source header files are used.
> 
> In the long run only the last sentence in this paragraph needs to be in
> the specification.

OK.

> 
> The proof of concept and QEMU-specific info is good to know for the
> discussion right now, but I don't think this needs to be in the
> specification.

OK.

> 
> > +
> > +The main idea is to allow a virtual device to function in a separate 
> > process
> in
> > +the same host over a UNIX domain socket. A UNIX domain socket
> (AF_UNIX) is
> > +chosen because we can trivially send file descriptors over it, which in 
> > turn
> > +allows:
> > +
> > +* Sharing of guest memory for DMA with the virtual device process.
> 
> Should the spec start consistently using "server" instead of various
> terms like "virtual device process"?

Indeed, that would be better.

> 
> > +* Sharing of virtual device memory with the guest for fast MMIO.
> 
> The VIRTIO spec restricts itself to the terms "device" and "driver" so
> that the scope isn't too specific to virtualization of software devices.
> I think the same might be good here because who knows how
> VFIO-over-socket will be used in the future. Maybe it will be used
> outside the context of traditional guests. For example, it's a great
> interface for writing device emulation tests and fuzzers!
> 
> I suggest consistently using "client" (for the guest) and "server" (for
> the device emulation process). That way the spec isn't focussed on a
> specific use case.

OK.

> 
> > +* Efficient sharing of eventfd's for triggering interrupts.
> > +
> > +However, other socket types could be used which allows the virtual
> device
> > +process to run in a separate guest in the same host (AF_VSOCK) or
> remotely
> > +(AF_INET). Theoretically the underlying transport doesn't necessarily have
> to
> > +be a socket, however we don't examine such alternatives. In this
> document we
> > +focus on using a UNIX domain socket and introduce basic support for the
> other
> > +two types of sockets without considering performance implications.
> 
> This is a good place to be explicit about the protocol requirements:
> 
> Is file-descriptor passing necessary?
> 
> Is shared-memory necessary?
> 
> Is there always an in-band message-passing fallback for any operation
> that can be optimized via fd passing?
> 
> By stating something about this in the spec it's easier to ensure that
> the protocol continues to follow these design parameters in the future
> when other people make modifications.

Passing of file descriptors isn't necessary. We explain that later in
"Guest Memory Configuration" and we'll also state it here for clarity.

> 
> > +This document does not yet describe any internal details of the server-
> side
> > +implementation, however QEMU's VFIO client implementation will have
> to be
> > +adapted according to this protocol in order to support VFIO-over-socket
> virtual
> > +devices.
> 
> This paragraph about the QEMU implementation status can be dropped from
> the specification.

OK.

> 
> > +
> > +VFIO
> > +====
> > +VFIO is a framework that allows a physical device to be securely passed
> through
> > +to a user space process; the kernel does not drive the device at all.
> > +Typically, the user space process is a VM and the device is passed through
> to
> > +it in order to achieve high performance. VFIO provides an API and the
> required
> > +functionality in the kernel. QEMU has adopted VFIO to allow a guest
> virtual
> > +machine to directly access physical devices, instead of emulating them in
> > +software
> 
> s/software/software./

OK.

> 
> > +
> > +VFIO-over-socket reuses the core VFIO concepts defined in its API, but
> > +implements them as messages to be sent over a UNIX-domain socket. It
> does not
> 
> Can "UNIX-domain" be dropped here? If the protocol design is intended to
> support other transports then it helps to avoid mentioning a specific
> transport even if only AF_UNIX will be implemented in the beginning.

OK.

> 
> > +change the kernel-based VFIO in any way, in fact none of the VFIO kernel
> > +modules need to be loaded to use VFIO-over-socket. It is also possible for
> QEMU
> > +to concurrently use the current kernel-based VFIO for one guest device,
> and use
> > +VFIO-over-socket for another device in the same guest.
> > +
> > +VFIO Device Model
> > +-----------------
> > +A device under VFIO presents a standard VFIO model to the user process.
> Many
> 
> I'm not sure what the meaning of the first sentence is.
> 
> s/standard VFIO model/standard interface/ ?

OK.

> 
> > +of the VFIO operations in the existing kernel model use the ioctl() system
> > +call, and references to the existing model are called the ioctl()
> > +implementation in this document.
> > +
> > +The following sections describe the set of messages that implement the
> VFIO
> > +device model over a UNIX domain socket. In many cases, the messages
> are direct
> > +translations of data structures used in the ioctl() implementation.
> Messages
> > +derived from ioctl()s will have a name derived from the ioctl() command
> name.
> > +E.g., the VFIO_GET_INFO ioctl() command becomes a
> VFIO_USER_GET_INFO message.
> > +The purpose for this reuse is to share as much code as feasible with the
> > +ioctl() implementation.
> > +
> > +Client and Server
> > +^^^^^^^^^^^^^^^^^
> > +The socket connects two processes together: a client process and a server
> > +process. In the context of this document, the client process is the process
> > +emulating a guest virtual machine, such as QEMU. The server process is a
> > +process that provides device emulation.
> > +
> > +Connection Initiation
> > +^^^^^^^^^^^^^^^^^^^^^
> > +After the client connects to the server, the initial server message is
> > +VFIO_USER_VERSION to propose a protocol version and set of capabilities
> to
> > +apply to the session. The client replies with a compatible version and set
> of
> > +capabilities it will support, or closes the connection if it cannot 
> > support the
> > +advertised version.
> > +
> > +Guest Memory Configuration
> > +^^^^^^^^^^^^^^^^^^^^^^^^^^
> > +The client uses VFIO_USER_DMA_MAP and VFIO_USER_DMA_UNMAP
> messages to inform
> > +the server of the valid guest DMA ranges that the server can access on
> behalf
> > +of a device. Guest memory may be accessed by the server via
> VFIO_USER_DMA_READ
> > +and VFIO_USER_DMA_WRITE messages over the socket.
> 
> "Guest" is a virtualization term. "DMA Memory Configuration" and "DMA
> memory may be accessed ..." would be more general.

OK.

> 
> > +
> > +An optimization for server access to guest memory is for the client to
> provide
> > +file descriptors the server can mmap() to directly access guest memory.
> Note
> > +that mmap() privileges cannot be revoked by the client, therefore file
> > +descriptors should only be exported in environments where the client
> trusts the
> > +server not to corrupt guest memory.
> > +
> > +Device Information
> > +^^^^^^^^^^^^^^^^^^
> > +The client uses a VFIO_USER_DEVICE_GET_INFO message to query the
> server for
> > +information about the device. This information includes:
> > +
> > +* The device type and capabilities,
> > +* the number of memory regions, and
> 
> VFIO calls them "device regions", which is clearer:
> 1. It cannot be confused with "DMA Memory Regions"
> 2. It allows for I/O space in additional to Memory space
> 
> I suggest using "device regions" consistently.

OK.

> 
> > +* the device presents to the guest the number of interrupt types the
> device
> > +  supports.
> > +
> > +Region Information
> > +^^^^^^^^^^^^^^^^^^
> > +The client uses VFIO_USER_DEVICE_GET_REGION_INFO messages to
> query the server
> > +for information about the device's memory regions. This information
> describes:
> > +
> > +* Read and write permissions, whether it can be memory mapped, and
> whether it
> > +  supports additional capabilities.
> > +* Region index, size, and offset.
> > +
> > +When a region can be mapped by the client, the server provides a file
> > +descriptor which the client can mmap(). The server is responsible for
> polling
> > +for client updates to memory mapped regions.
> > +
> > +Region Capabilities
> > +"""""""""""""""""""
> > +Some regions have additional capabilities that cannot be described
> adequately
> > +by the region info data structure. These capabilities are returned in the
> > +region info reply in a list similar to PCI capabilities in a PCI device's
> > +configuration space.
> > +
> > +Sparse Regions
> > +""""""""""""""
> > +A region can be memory-mappable in whole or in part. When only a
> subset of a
> > +region can be mapped by the client, a
> VFIO_REGION_INFO_CAP_SPARSE_MMAP
> > +capability is included in the region info reply. This capability describes
> > +which portions can be mapped by the client.
> > +
> > +For example, in a virtual NVMe controller, sparse regions can be used so
> that
> > +accesses to the NVMe registers (found in the beginning of BAR0) are
> trapped (an
> > +infrequent an event), while allowing direct access to the doorbells (an
> > +extremely frequent event as every I/O submission requires a write to
> BAR0),
> > +found right after the NVMe registers in BAR0.
> > +
> > +Interrupts
> > +^^^^^^^^^^
> > +The client uses VFIO_USER_DEVICE_GET_IRQ_INFO messages to query
> the server for
> > +the device's interrupt types. The interrupt types are specific to the bus
> the
> > +device is attached to, and the client is expected to know the capabilities 
> > of
> > +each interrupt type. The server can signal an interrupt either with
> > +VFIO_USER_VM_INTERRUPT messages over the socket, or can directly
> inject
> > +interrupts into the guest via an event file descriptor. The client 
> > configures
> > +how the server signals an interrupt with VFIO_USER_SET_IRQS messages.
> > +
> > +Device Read and Write
> > +^^^^^^^^^^^^^^^^^^^^^
> > +When the guest executes load or store operations to device memory, the
> client
> > +forwards these operations to the server with VFIO_USER_REGION_READ
> or
> > +VFIO_USER_REGION_WRITE messages. The server will reply with data
> from the
> > +device on read operations or an acknowledgement on write operations.
> > +
> > +DMA
> > +^^^
> > +When a device performs DMA accesses to guest memory, the server will
> forward
> > +them to the client with VFIO_USER_DMA_READ and
> VFIO_USER_DMA_WRITE messages.
> > +These messages can only be used to access guest memory the client has
> > +configured into the server.
> > +
> > +Protocol Specification
> > +======================
> > +To distinguish from the base VFIO symbols, all VFIO-over-socket symbols
> are
> > +prefixed with vfio_user or VFIO_USER. In revision 0.1, all data is in the
> > +little-endian format, although this may be relaxed in future revision in
> cases
> > +where the client and server are both big-endian. The messages are
> formatted
> > +for seamless reuse of the native VFIO structs. A server can serve:
> > +
> > +1) multiple clients, and/or
> > +2) multiple virtual devices, belonging to one or more clients.
> > +
> > +Therefore each message requires a header that uniquely identifies the
> virtual
> > +device. It is a server-side implementation detail whether a single server
> > +handles multiple virtual devices from the same or multiple guests.
> > +
> > +Socket
> > +------
> > +A single UNIX domain socket is assumed to be used for each device. The
> location
> > +of the socket is implementation-specific. Multiplexing clients, devices, 
> > and
> > +servers over the same socket is not supported in this version of the
> protocol,
> > +but a device ID field exists in the message header so that a future support
> can
> > +be added without a major version change.
> 
> There is a version negotiation mechanism for making protocol changes.
> I'm not sure there is any advantage to including the multiplexing header
> now? I suggest keeping it simple and dropping the unused header.

OK.

> 
> > +
> > +Authentication
> > +--------------
> > +For AF_UNIX, we rely on OS mandatory access controls on the socket
> files,
> > +therefore it is up to the management layer to set up the socket as
> required.
> > +Socket types than span guests or hosts will require a proper
> authentication
> > +mechanism. Defining that mechanism is deferred to a future version of
> the
> > +protocol.
> > +
> > +Request Concurrency
> > +-------------------
> > +There can be multiple outstanding requests per virtual device, e.g. a
> > +frame buffer where the guest does multiple stores to the virtual device.
> The
> > +server can execute and reorder non-conflicting requests in parallel,
> depending
> > +on the device semantics.
> 
> The word "request" has not been used up to this point in the spec. What
> does "request" mean here?
> 
> I'm not sure if this section is talking about vfio-user protocol message
> exchanges (e.g. a client->server message followed by a server->client
> message) or something else?

Indeed, this is confusing, we use the term "command" earlier in the
"VFIO Device Model" section, we'll replace the term "request"
with "command" throughout the specification.

> 
> What about the ordering semantics at the vfio-user protocol level? For
> example, if a client sends multiple VFIO_USER_DMA_MAP/UNMAP
> messages
> then the order matters. What are the rules? I wonder if maybe
> concurrency is a special case and really only applies to a subset of
> protocol commands?

All commands are executed in the order they were sent, regardless of whether a
reply is needed.

> 
> I'm not sure how a client would exploit parallelism in this protocol.
> Can you give an example of a case where there would be multiple commands
> pending on a single device?

For instance, a client can issue the following operations back to back without
waiting for the first two to complete:
1. map a DMA region 
2. trigger some device-specific operation that results in data being read into
   that DMA region, and
3. unmap the DMA region

> 
> > +
> > +Socket Disconnection Behavior
> > +-----------------------------
> > +The server and the client can disconnect from each other, either
> intentionally
> > +or unexpectedly. Both the client and the server need to know how to
> handle such
> > +events.
> > +
> > +Server Disconnection
> > +^^^^^^^^^^^^^^^^^^^^
> > +A server disconnecting from the client may indicate that:
> > +
> > +1) A virtual device has been restarted, either intentionally (e.g. because 
> > of
> a
> > +device update) or unintentionally (e.g. because of a crash). In any case,
> the
> > +virtual device will come back so the client should not do anything (e.g.
> simply
> > +reconnect and retry failed operations).
> > +
> > +2) A virtual device has been shut down with no intention to be restarted.
> > +
> > +It is impossible for the client to know whether or not a failure is
> > +intermittent or innocuous and should be retried, therefore the client
> should
> > +attempt to reconnect to the socket. Since an intentional server restart
> (e.g.
> > +due to an upgrade) might take some time, a reasonable timeout should
> be used.
> > +In cases where the disconnection is expected (e.g. the guest shutting
> down), no
> > +new requests will be sent anyway so this situation doesn't pose a
> problem. The
> > +control stack will clean up accordingly.
> > +
> > +Parametrizing this behaviour by having the virtual device advertise a
> 
> s/Parametrizing/Parameterizing/

OK.

> 
> > +reasonable reconnect is deferred to a future version of the protocol.
> 
> No mention is made of recovering state or how disconnect maps to VFIO
> device types (PCI, etc.). Does a disconnect mean that the device has
> been reset?

Regarding recovering state, I believe that since all the building blocks are
there and the client is pretty much the master in the vfio-user model, it's up
to the client how to deal with exceptional situations. For recovery to work, the
client will have to reconfigure IRQs, DMAs, etc., and the server will have to
persistently store device state.

Regarding how disconnect maps to VFIO device types, it depends on whether or not
the client/server can recover from it and continue operating. If this doesn't
happen (e.g. the server hasn't restarted, the client doesn't support
recovering), then the device cannot continue being operational, which I suppose
is an implementation detail of the client.

Do we need something more specific at this stage?

> 
> > +
> > +Client Disconnection
> > +^^^^^^^^^^^^^^^^^^^^
> > +The client disconnecting from the server primarily means that the QEMU
> process
> > +has exited. Currently this means that the guest is shut down so the device
> is
> > +no longer needed therefore the server can automatically exit. However,
> there
> > +can be cases where a client disconnect should not result in a server exit:
> > +
> > +1) A single server serving multiple clients.
> > +2) A multi-process QEMU upgrading itself step by step, which isn't yet
> > +   implemented.
> > +
> > +Therefore in order for the protocol to be forward compatible the server
> should
> > +take no action when the client disconnects. If anything happens to the
> client
> > +process the control stack will know about it and can clean up resources
> > +accordingly.
> > +
> > +Request Retry and Response Timeout
> > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > +QEMU's VFIO retries certain operations if they fail. While this makes sense
> for
> > +real HW, we don't know for sure whether it makes sense for virtual
> devices. A
> > +failed request is a request that has been successfully sent and has been
> > +responded to with an error code. Failure to send the request in the first
> place
> > +(e.g. because the socket is disconnected) is a different type of error
> examined
> > +earlier in the disconnect section.
> > +
> > +Defining a retry and timeout scheme if deferred to a future version of the
> > +protocol.
> > +
> > +Commands
> > +--------
> > +The following table lists the VFIO message command IDs, and whether
> the
> > +message request is sent from the client or the server.
> > +
> > ++----------------------------------+---------+-------------------+
> > +| Name                             | Command | Request Direction |
> >
> ++==================================+=========+===========
> ========+
> > +| VFIO_USER_VERSION                | 1       | server ???????? client   |
> 
> Unicode issues? I see "????????" instead of the characters that were
> supposed to be here.

My bad, I selected UTF-8 when I ran git send-email, it's supposed to be "->".

> 
> > ++----------------------------------+---------+-------------------+
> > +| VFIO_USER_DMA_MAP                | 2       | client ???????? server   |
> > ++----------------------------------+---------+-------------------+
> > +| VFIO_USER_DMA_UNMAP              | 3       | client ???????? server   |
> > ++----------------------------------+---------+-------------------+
> > +| VFIO_USER_DEVICE_GET_INFO        | 4       | client ???????? server   |
> > ++----------------------------------+---------+-------------------+
> > +| VFIO_USER_DEVICE_GET_REGION_INFO | 5       | client ???????? server
> |
> > ++----------------------------------+---------+-------------------+
> > +| VFIO_USER_DEVICE_GET_IRQ_INFO    | 6       | client ???????? server   |
> > ++----------------------------------+---------+-------------------+
> > +| VFIO_USER_DEVICE_SET_IRQS        | 7       | client ???????? server   |
> > ++----------------------------------+---------+-------------------+
> > +| VFIO_USER_REGION_READ            | 8       | client ???????? server   |
> > ++----------------------------------+---------+-------------------+
> > +| VFIO_USER_REGION_WRITE           | 9       | client ???????? server   |
> > ++----------------------------------+---------+-------------------+
> > +| VFIO_USER_DMA_READ               | 10      | server ???????? client   |
> > ++----------------------------------+---------+-------------------+
> > +| VFIO_USER_DMA_READ               | 11      | server ???????? client   |
> > ++----------------------------------+---------+-------------------+
> > +| VFIO_USER_VM_INTERRUPT           | 12      | server ???????? client   |
> > ++----------------------------------+---------+-------------------+
> > +| VFIO_DEVICE_RESET                | 13      | client ???????? server   |
> > ++----------------------------------+---------+-------------------+
> 
> Why is this command named VFIO_DEVICE_RESET and not
> VFIO_USER_DEVICE_RESET?

That's a mistake, it should be VFIO_USER_DEVICE_RESET.

> 
> > +
> > +Header
> > +------
> > +All messages are preceded by a 16 byte header that contains basic
> information
> 
> I suggest dropping "16 byte" here because the table already shows this
> and it is easy to forget to update the text when changing the table.

OK.

> 
> > +about the message. The header is followed by message-specific data
> described
> > +in the sections below.
> > +
> > ++----------------+--------+-------------+
> > +| Name           | Offset | Size        |
> > ++================+========+=============+
> > +| Device ID      | 0      | 2           |
> > ++----------------+--------+-------------+
> > +| Message ID     | 2      | 2           |
> > ++----------------+--------+-------------+
> > +| Command        | 4      | 4           |
> > ++----------------+--------+-------------+
> > +| Message size   | 8      | 4           |
> > ++----------------+--------+-------------+
> > +| Flags          | 12     | 4           |
> > ++----------------+--------+-------------+
> > +|                | +-----+------------+ |
> > +|                | | Bit | Definition | |
> > +|                | +=====+============+ |
> > +|                | | 0   | Reply      | |
> > +|                | +-----+------------+ |
> > +|                | | 1   | No_reply   | |
> > +|                | +-----+------------+ |
> > ++----------------+--------+-------------+
> > +| <message data> | 16     | variable    |
> > ++----------------+--------+-------------+
> > +
> > +* Device ID identifies the destination device of the message. This field is
> > +  reserved when the server only supports one device per socket.
> 
> Clearer:
> s/is reserved/must be 0/

OK.

> 
> > +* Message ID identifies the message, and is used in the message
> acknowledgement.
> 
> Are Message IDs global or per-device?

We're dropping the device ID field in the header therefore it's global.

> 
> > +* Command specifies the command to be executed, listed in the
> Command Table.
> > +* Message size contains the size of the entire message, including the
> header.
> > +* Flags contains attributes of the message:
> > +
> > +  * The reply bit differentiates request messages from reply messages. A
> reply
> > +    message acknowledges a previous request with the same message ID.
> > +  * No_reply indicates that no reply is needed for this request. This is
> > +    commonly used when multiple requests are sent, and only the last
> needs
> > +    acknowledgement.
> 
> Is_Reply and Dont_Reply are more self-explanatory to me, but this is
> probably subjective.
> 
> > +
> > +VFIO_USER_VERSION
> > +-----------------
> > +
> > +Message format
> > +^^^^^^^^^^^^^^
> > +
> > ++--------------+------------------------+
> > +| Name         | Value                  |
> > ++==============+========================+
> > +| Device ID    | 0                      |
> > ++--------------+------------------------+
> 
> The Device ID is always 0 and this message is sent just once regardless
> of the number of devices? Please describe this explicitly so it's clear
> that this message is per-session and not per-device.

We're dropping the device ID from the header.

> 
> There could also be a special NO_DEVICE (0xff) value but I don't think
> it matters.
> 
> > +| Message ID   | <ID>                   |
> > ++--------------+------------------------+
> > +| Command      | 1                      |
> > ++--------------+------------------------+
> > +| Message size | 16 + version length    |
> > ++--------------+------------------------+
> > +| Flags        | Reply bit set in reply |
> > ++--------------+------------------------+
> > +| Version      | JSON byte array        |
> > ++--------------+------------------------+
> > +
> > +This is the initial message sent by the server after the socket connection 
> > is
> > +established. The version is in JSON format, and the following objects must
> be
> > +included:
> > +
> > ++--------------+--------+---------------------------------------------------+
> > +| Name         | Type   | Description                                      
> >  |
> >
> ++==============+========+================================
> ===================+
> > +| version      | object | {???????major????????: <number>,
> ???????minor????????: <number>}            |
> 
> More Unicode issues.
> 
> > +|              |        | Version supported by the sender, e.g.
> ???????0.1????????.      |
> > ++--------------+--------+---------------------------------------------------+
> > +| type         | string | Fixed to ???????vfio-user????????.               
> >               |
> > ++--------------+--------+---------------------------------------------------+
> > +| capabilities | array  | Reserved. Can be omitted for v0.1, otherwise must
> |
> > +|              |        | be empty.                                        
> >  |
> > ++--------------+--------+---------------------------------------------------+
> 
> s/array/array of strings/ ?

No, it's array were each element can be any object in JSON.

> 
> > +
> > +Versioning and Feature Support
> > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > +Upon accepting a connection, the server must send a
> VFIO_USER_VERSION message
> > +proposing a protocol version and a set of capabilities. The client compares
> > +these with the versions and capabilities it supports and sends a
> > +VFIO_USER_VERSION reply according to the following rules.
> > +
> > +* The major version in the reply must be the same as proposed. If the
> client
> > +  does not support the proposed major, it closes the connection.
> > +* The minor version in the reply must be equal to or less than the minor
> > +  version proposed.
> > +* The capability list must be a subset of those proposed. If the client
> > +  requires a capability the server did not include, it closes the 
> > connection.
> > +* If type is not ???????vfio-user????????, the client closes the
> connection.
> 
> Is there a rationale for this field? In order to get to the point where
> this JSON is parsed we must already implement the vfio-user protocol
> (e.g. parse the header).

It was suggested to include it as a sanity check, we'll drop it.

> 
> > +
> > +The protocol major version will only change when incompatible protocol
> changes
> > +are made, such as changing the message format. The minor version may
> change
> > +when compatible changes are made, such as adding new messages or
> capabilities,
> > +Both the client and server must support all minor versions less than the
> > +maximum minor version it supports. E.g., an implementation that
> supports
> > +version 1.3 must also support 1.0 through 1.2.
> > +
> > +VFIO_USER_DMA_MAP
> > +-----------------
> > +
> > +VFIO_USER_DMA_UNMAP
> > +-------------------
> > +
> > +Message Format
> > +^^^^^^^^^^^^^^
> > +
> > ++--------------+------------------------+
> > +| Name         | Value                  |
> > ++==============+========================+
> > +| Device ID    | 0                      |
> > ++--------------+------------------------+
> 
> Is this message per-session instead of per-device?
> 
> This has implications on IOMMU semantics. Two different devices sharing
> a connection have access to the same DMA memory?

Good point, it is per-session since we're dropping the device ID from the
header. I suppose this means that if in a future version we want to support
devices sharing the same connection the header will have to be extended to
include the device ID.

> 
> > +| Message ID   | <ID>                   |
> > ++--------------+------------------------+
> > +| Command      | MAP=2, UNMAP=3         |
> > ++--------------+------------------------+
> > +| Message size | 16 + table size        |
> > ++--------------+------------------------+
> > +| Flags        | Reply bit set in reply |
> > ++--------------+------------------------+
> > +| Table        | array of table entries |
> > ++--------------+------------------------+
> > +
> > +This message is sent by the client to the server to inform it of the guest
> > +memory regions the device can access. It must be sent before the device
> can
> > +perform any DMA to the guest. It is normally sent directly after the
> version
> > +handshake is completed, but may also occur when memory is added or
> subtracted
> > +in the guest.
> > +
> > +The table is an array of the following structure. This structure is 32 
> > bytes
> > +in size, so the message size will be 16 + (# of table entries * 32). If a
> > +region being added can be directly mapped by the server, an array of file
> > +descriptors will be sent as part of the message meta-data. Each region
> entry
> > +will have a corresponding file descriptor. On AF_UNIX sockets, the file
> > +descriptors will be passed as SCM_RIGHTS type ancillary data.
> > +
> > +Table entry format
> > +^^^^^^^^^^^^^^^^^^
> > +
> > ++-------------+--------+-------------+
> > +| Name        | Offset | Size        |
> > ++=============+========+=============+
> > +| Address     | 0      | 8           |
> > ++-------------+--------+-------------+
> > +| Size        | 8      | 8           |
> > ++-------------+--------+-------------+
> > +| Offset      | 16     | 8           |
> > ++-------------+--------+-------------+
> > +| Protections | 24     | 4           |
> > ++-------------+--------+-------------+
> > +| Flags       | 28     | 4           |
> > ++-------------+--------+-------------+
> > +|             | +-----+------------+ |
> > +|             | | Bit | Definition | |
> > +|             | +=====+============+ |
> > +|             | | 0   | Mappable   | |
> > +|             | +-----+------------+ |
> > ++-------------+--------+-------------+
> > +
> > +* Address is the base DMA address of the region.
> > +* Size is the size of the region.
> > +* Offset is the file offset of the region with respect to the associated 
> > file
> > +  descriptor.
> > +* Protections are the region's protection attributes as encoded in
> > +  ``<sys/mman.h>``.
> > +* Flags contain the following region attributes:
> > +
> > +  * Mappable indicate the region can be mapped via the mmap() system
> call using
> > +    the file descriptor provided in the message meta-data.
> 
> What are the semantics of map/unmap with respect to mapping over
> existing regions, unmapping a subset of an existing region, etc?

The VFIO kernel driver returns EEXIST if mapping over an existing area
and allows unmapping a subset. I'd think we should follow their example.

> 
> What are the concerns for MAP/UNMAP while the device is operational and
> may be performing DMA? Should this command be combined into a single
> VFIO_USER_SET_DMA_REGIONS command with Flags Bit 1 indicating
> Add/Delete
> so that a single message can atomically add and delete DMA regions?

Regarding receiving a DMA unmap request while there is an ongoing DMA
transaction, once the server ACK's the DMA unmap then it must not touch that DMA
region. It is an implementation detail whether the server waits for the DMA to
finish or kills the DMA operation, which might not be possible at all (e.g part
of that region has been submitted for I/O to a physical device).

> 
> What is the format of the reply to this message?

It's just the header explained in the "Header" section, no additional data are
sent by the server, it simply ACK's the command.

> 
> > +
> > +VFIO_USER_DEVICE_GET_INFO
> > +-------------------------
> > +
> > +Message format
> > +^^^^^^^^^^^^^^
> > +
> > ++--------------+----------------------------+
> > +| Name         | Value                      |
> > ++==============+============================+
> > +| Device ID    | <ID>                       |
> > ++--------------+----------------------------+
> > +| Message ID   | <ID>                       |
> > ++--------------+----------------------------+
> > +| Command      | 4                          |
> > ++--------------+----------------------------+
> > +| Message size | 16 in request, 32 in reply |
> > ++--------------+----------------------------+
> > +| Flags        | Reply bit set in reply     |
> > ++--------------+----------------------------+
> > +| Device info  | VFIO device info           |
> > ++--------------+----------------------------+
> > +
> > +This message is sent by the client to the server to query for basic
> information
> > +about the device. Only the message header is needed in the request
> message.
> > +The VFIO device info structure is defined in ``<sys/vfio.h>`` (``struct
> > +vfio_device_info``).
> > +
> > +VFIO device info format
> > +^^^^^^^^^^^^^^^^^^^^^^^
> > +
> > ++-------------+--------+--------------------------+
> > +| Name        | Offset | Size                     |
> > ++=============+========+==========================+
> > +| argsz       | 16     | 4                        |
> > ++-------------+--------+--------------------------+
> > +| flags       | 20     | 4                        |
> > ++-------------+--------+--------------------------+
> > +|             | +-----+-------------------------+ |
> > +|             | | Bit | Definition              | |
> > +|             | +=====+=========================+ |
> > +|             | | 0   | VFIO_DEVICE_FLAGS_RESET | |
> > +|             | +-----+-------------------------+ |
> > +|             | | 1   | VFIO_DEVICE_FLAGS_PCI   | |
> > +|             | +-----+-------------------------+ |
> > ++-------------+--------+--------------------------+
> > +| num_regions | 24     | 4                        |
> > ++-------------+--------+--------------------------+
> > +| num_irqs    | 28     | 4                        |
> > ++-------------+--------+--------------------------+
> > +
> > +* argz is reserved in vfio-user, it is only used in the ioctl() VFIO
> > +  implementation.
> > +* flags contains the following device attributes.
> > +
> > +  * VFIO_DEVICE_FLAGS_RESET indicates the device supports the
> > +    VFIO_USER_DEVICE_RESET message.
> 
> Why is VFIO_USER_DEVICE_RESET support optional?

Because it is optional in VFIO, too.

> 
> > +  * VFIO_DEVICE_FLAGS_PCI indicates the device is a PCI device.
> > +
> > +* num_regions is the number of memory regions the device exposes.
> > +* num_irqs is the number of distinct interrupt types the device supports.
> > +
> > +This version of the protocol only supports PCI devices. Additional devices
> may
> > +be supported in future versions.
> > +
> > +VFIO_USER_DEVICE_GET_REGION_INFO
> > +--------------------------------
> > +
> > +Message format
> > +^^^^^^^^^^^^^^
> > +
> > ++--------------+------------------+
> > +| Name         | Value            |
> > ++==============+==================+
> > +| Device ID    | <ID>             |
> > ++--------------+------------------+
> > +| Message ID   | <ID>             |
> > ++--------------+------------------+
> > +| Command      | 5                |
> > ++--------------+------------------+
> > +| Message size | 48 + any caps    |
> > ++--------------+------------------+
> 
> The client does not know how much space capabilities need when sending
> the request. Should the client send a 48-byte request and expect the
> server to reply with a 48+ byte response?

Correct.

> 
> > +DMA Read/Write Data
> > +^^^^^^^^^^^^^^^^^^^
> > +
> > ++---------+--------+----------+
> > +| Name    | Offset | Size     |
> > ++=========+========+==========+
> > +| Address | 16     | 8        |
> > ++---------+--------+----------+
> > +| Count   | 24     | 4        |
> > ++---------+--------+----------+
> > +| Data    | 28     | variable |
> > ++---------+--------+----------+
> > +
> > +* Address is the area of guest memory being accessed. This address must
> have
> > +  been exported to the server with a VFIO_USER_DMA_MAP message.
> > +* Count is the size of the data to be transferred.
> > +* Data is the data to be read or written.
> > +
> > +Address and count can also be accessed as ``struct iovec`` from
> ``<sys/uio.h>``.
> 
> This seems to be incorrect since the count field is 4 bytes but struct
> iovec::iov_len is size_t (64-bit on 64-bit hosts).

Indeed, I forgot about padding. We can remove the reference to struct iovec or
make count 8 bytes?

> 
> > +
> > +VFIO_USER_REGION_READ
> > +---------------------
> > +
> > +Message format
> > +^^^^^^^^^^^^^^
> > +
> > ++--------------+------------------------+
> > +| Name         | Value                  |
> > ++==============+========================+
> > +| Device ID    | <ID>                   |
> > ++--------------+------------------------+
> > +| Message ID   | <ID>                   |
> > ++--------------+------------------------+
> > +| Command      | 8                      |
> > ++--------------+------------------------+
> > +| Message size | 32 + data size         |
> > ++--------------+------------------------+
> > +| Flags Reply  | bit set in reply       |
> > ++--------------+------------------------+
> > +| Read info    | REGION read/write data |
> > ++--------------+------------------------+
> > +
> > +This request is sent from the client to the server to read from device
> memory.
> > +In the request messages, there will be no data, and the count field will be
> the
> > +amount of data to be read. The reply will include the data read, and its
> count
> > +field will be the amount of data read.
> > +
> > +VFIO_USER_REGION_WRITE
> > +----------------------
> > +
> > +Message format
> > +^^^^^^^^^^^^^^
> > +
> > ++--------------+------------------------+
> > +| Name         | Value                  |
> > ++==============+========================+
> > +| Device ID    | <ID>                   |
> > ++--------------+------------------------+
> > +| Message ID   | <ID>                   |
> > ++--------------+------------------------+
> > +| Command      | 9                      |
> > ++--------------+------------------------+
> > +| Message size | 32 + data size         |
> > ++--------------+------------------------+
> > +| Flags        | Reply bit set in reply |
> > ++--------------+------------------------+
> > +| Write info   | REGION read write data |
> > ++--------------+------------------------+
> > +
> > +This request is sent from the client to the server to write to device
> memory.
> > +The request message will contain the data to be written, and its count
> field
> > +will contain the amount of write data. The count field in the reply will be
> > +zero.
> > +
> > +VFIO_USER_DMA_READ
> > +------------------
> > +
> > +Message format
> > +^^^^^^^^^^^^^^
> > +
> > ++--------------+---------------------+
> > +| Name         | Value               |
> > ++==============+=====================+
> > +| Device ID    | <ID>                |
> > ++--------------+---------------------+
> > +| Message ID   | <ID>                |
> > ++--------------+---------------------+
> > +| Command      | 10                  |
> > ++--------------+---------------------+
> > +| Message size | 28 + data size      |
> > ++--------------+---------------------+
> > +| Flags Reply  | bit set in reply    |
> > ++--------------+---------------------+
> > +| DMA info     | DMA read/write data |
> > ++--------------+---------------------+
> > +
> > +This request is sent from the server to the client to read from guest
> memory.
> > +In the request messages, there will be no data, and the count field will be
> the
> > +amount of data to be read. The reply will include the data read, and its
> count
> > +field will be the amount of data read.
> > +
> > +VFIO_USER_DMA_WRITE
> > +-------------------
> > +
> > +Message format
> > +^^^^^^^^^^^^^^
> > +
> > ++--------------+------------------------+
> > +| Name         | Value                  |
> > ++==============+========================+
> > +| Device ID    | <ID>                   |
> > ++--------------+------------------------+
> > +| Message ID   | <ID>                   |
> > ++--------------+------------------------+
> > +| Command      | 11                     |
> > ++--------------+------------------------+
> > +| Message size | 28 + data size         |
> > ++--------------+------------------------+
> > +| Flags        | Reply bit set in reply |
> > ++--------------+------------------------+
> > +| DMA info     | DMA read/write data    |
> > ++--------------+------------------------+
> > +
> > +This request is sent from the server to the client to write to guest
> memory.
> > +The request message will contain the data to be written, and its count
> field
> > +will contain the amount of write data. The count field in the reply will be
> > +zero.
> > +
> > +VFIO_USER_VM_INTERRUPT
> > +----------------------
> > +
> > +Message format
> > +^^^^^^^^^^^^^^
> > +
> > ++----------------+------------------------+
> > +| Name           | Value                  |
> > ++================+========================+
> > +| Device ID      | <ID>                   |
> > ++----------------+------------------------+
> > +| Message ID     | <ID>                   |
> > ++----------------+------------------------+
> > +| Command        | 12                     |
> > ++----------------+------------------------+
> > +| Message size   | 24                     |
> > ++----------------+------------------------+
> > +| Flags          | Reply bit set in reply |
> > ++----------------+------------------------+
> > +| Interrupt info | <interrupt>            |
> > ++----------------+------------------------+
> > +
> > +This request is sent from the server to the client to signal the device has
> > +raised an interrupt.
> > +
> > +Interrupt info format
> > +^^^^^^^^^^^^^^^^^^^^^
> > +
> > ++----------+--------+------+
> > +| Name     | Offset | Size |
> > ++==========+========+======+
> > +| Index    | 16     | 4    |
> > ++----------+--------+------+
> > +| Subindex | 20     | 4    |
> > ++----------+--------+------+
> > +
> > +* Index is the interrupt index; it is the same value used in
> VFIO_USER_SET_IRQS.
> > +* Subindex is relative to the index, e.g., the vector number used in PCI
> MSI/X
> > +  type interrupts.
> > +
> > +VFIO_USER_DEVICE_RESET
> > +----------------------
> > +
> > +Message format
> > +^^^^^^^^^^^^^^
> > +
> > ++--------------+------------------------+
> > +| Name         | Value                  |
> > ++==============+========================+
> > +| Device ID    | <ID>                   |
> > ++--------------+------------------------+
> > +| Message ID   | <ID>                   |
> > ++--------------+------------------------+
> > +| Command      | 13                     |
> > ++--------------+------------------------+
> > +| Message size | 16                     |
> > ++--------------+------------------------+
> > +| Flags        | Reply bit set in reply |
> > ++--------------+------------------------+
> > +
> > +This request is sent from the client to the server to reset the device.
> 
> How are errors treated in any of these commands? For example, if memory
> addresses are out-of-bounds?

We'll add an error flag in the header and an error field to store a UNIX errno.
It will be unused in the command.


John and Thanos



reply via email to

[Prev in Thread] Current Thread [Next in Thread]