qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] introduce VFIO-over-socket protocol specificaion


From: Stefan Hajnoczi
Subject: Re: [PATCH] introduce VFIO-over-socket protocol specificaion
Date: Wed, 29 Jul 2020 13:48:03 +0100

On Wed, Jul 22, 2020 at 11:42:26AM +0000, Thanos Makatos wrote:
> > > diff --git a/docs/devel/vfio-over-socket.rst b/docs/devel/vfio-over-
> > socket.rst
> > > new file mode 100644
> > > index 0000000..723b944
> > > --- /dev/null
> > > +++ b/docs/devel/vfio-over-socket.rst
> > > @@ -0,0 +1,1135 @@
> > > +***************************************
> > > +VFIO-over-socket Protocol Specification
> > > +***************************************
> > > +
> > > +Version 0.1
> > 
> > Please include a reference to the section below explaining how
> > versioning works.
> 
> I'm not sure I understand, do you mean we should add something like the
> following (right below "Version 0.1"):
> 
> "Refer to section 1.2.3 on how versioning works."
> 
> ?

Yes, coming across this version number the reader has no idea about its
meaning and how the protocol is versioned. The spec then moves on to
other things. It would be helpful to reference the section that explains
how versioning works so that the reader knows where to go to understand
the meaning of the number.

> > What about the ordering semantics at the vfio-user protocol level? For
> > example, if a client sends multiple VFIO_USER_DMA_MAP/UNMAP
> > messages
> > then the order matters. What are the rules? I wonder if maybe
> > concurrency is a special case and really only applies to a subset of
> > protocol commands?
> 
> All commands are executed in the order they were sent, regardless of whether a
> reply is needed.
> 
> > 
> > I'm not sure how a client would exploit parallelism in this protocol.
> > Can you give an example of a case where there would be multiple commands
> > pending on a single device?
> 
> For instance, a client can issue the following operations back to back without
> waiting for the first two to complete:
> 1. map a DMA region 
> 2. trigger some device-specific operation that results in data being read into
>    that DMA region, and
> 3. unmap the DMA region

That is pipelining, but I don't see the ability to "reorder
non-conflicting requests in parallel". That example has no parallelism.

It's unclear to me what the spec means by concurrency and parallelism.

If the intention is just to allow pipelining then request identifiers
aren't really necessary. The client can keep track of which request will
complete next. So I'm wondering if there is some parallelism somewhere
that I'm missing...


> > 
> > > +
> > > +Socket Disconnection Behavior
> > > +-----------------------------
> > > +The server and the client can disconnect from each other, either
> > intentionally
> > > +or unexpectedly. Both the client and the server need to know how to
> > handle such
> > > +events.
> > > +
> > > +Server Disconnection
> > > +^^^^^^^^^^^^^^^^^^^^
> > > +A server disconnecting from the client may indicate that:
> > > +
> > > +1) A virtual device has been restarted, either intentionally (e.g. 
> > > because of
> > a
> > > +device update) or unintentionally (e.g. because of a crash). In any case,
> > the
> > > +virtual device will come back so the client should not do anything (e.g.
> > simply
> > > +reconnect and retry failed operations).
> > > +
> > > +2) A virtual device has been shut down with no intention to be restarted.
> > > +
> > > +It is impossible for the client to know whether or not a failure is
> > > +intermittent or innocuous and should be retried, therefore the client
> > should
> > > +attempt to reconnect to the socket. Since an intentional server restart
> > (e.g.
> > > +due to an upgrade) might take some time, a reasonable timeout should
> > be used.
> > > +In cases where the disconnection is expected (e.g. the guest shutting
> > down), no
> > > +new requests will be sent anyway so this situation doesn't pose a
> > problem. The
> > > +control stack will clean up accordingly.
> > > +
> > > +Parametrizing this behaviour by having the virtual device advertise a
> > 
> > s/Parametrizing/Parameterizing/
> 
> OK.
> 
> > 
> > > +reasonable reconnect is deferred to a future version of the protocol.
> > 
> > No mention is made of recovering state or how disconnect maps to VFIO
> > device types (PCI, etc.). Does a disconnect mean that the device has
> > been reset?
> 
> Regarding recovering state, I believe that since all the building blocks are
> there and the client is pretty much the master in the vfio-user model, it's up
> to the client how to deal with exceptional situations. For recovery to work, 
> the
> client will have to reconfigure IRQs, DMAs, etc., and the server will have to
> persistently store device state.
> 
> Regarding how disconnect maps to VFIO device types, it depends on whether or 
> not
> the client/server can recover from it and continue operating. If this doesn't
> happen (e.g. the server hasn't restarted, the client doesn't support
> recovering), then the device cannot continue being operational, which I 
> suppose
> is an implementation detail of the client.
> 
> Do we need something more specific at this stage?

This sounds like implementation-defined behavior. If the spec does not
cover this area then it might be hard to standardize it later without
breaking existing implementations.

Do you know how kernel VFIO behaves when the application is terminated
unexpectedly?

My intuition is that the device should be reset or at least fenced to
discard all DMA immediately. With that as the default behavior the
protocol could let implementations use protocol extensions for other
recovery behavior.

> > > +|              |        | Version supported by the sender, e.g.
> > ???????0.1????????.      |
> > > ++--------------+--------+---------------------------------------------------+
> > > +| type         | string | Fixed to ???????vfio-user????????.             
> > >                 |
> > > ++--------------+--------+---------------------------------------------------+
> > > +| capabilities | array  | Reserved. Can be omitted for v0.1, otherwise 
> > > must
> > |
> > > +|              |        | be empty.                                      
> > >    |
> > > ++--------------+--------+---------------------------------------------------+
> > 
> > s/array/array of strings/ ?
> 
> No, it's array were each element can be any object in JSON.

Okay. Maybe there is a way to be explicit by saying "untyped array" or
adding "The type of the elements is not defined in this version of the
specification" to the description.

> 
> > 
> > > +
> > > +Versioning and Feature Support
> > > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > +Upon accepting a connection, the server must send a
> > VFIO_USER_VERSION message
> > > +proposing a protocol version and a set of capabilities. The client 
> > > compares
> > > +these with the versions and capabilities it supports and sends a
> > > +VFIO_USER_VERSION reply according to the following rules.
> > > +
> > > +* The major version in the reply must be the same as proposed. If the
> > client
> > > +  does not support the proposed major, it closes the connection.
> > > +* The minor version in the reply must be equal to or less than the minor
> > > +  version proposed.
> > > +* The capability list must be a subset of those proposed. If the client
> > > +  requires a capability the server did not include, it closes the 
> > > connection.
> > > +* If type is not ???????vfio-user????????, the client closes the
> > connection.
> > 
> > Is there a rationale for this field? In order to get to the point where
> > this JSON is parsed we must already implement the vfio-user protocol
> > (e.g. parse the header).
> 
> It was suggested to include it as a sanity check, we'll drop it.

It's fine to keep it. Many protocols have a banner or identifier early
on that requires some parsing. In this case it could be: this field
makes it easy for humans decoding a protocol conversation to identify
the protocol as vfio-user and also serves as a sanity check.

> > What are the semantics of map/unmap with respect to mapping over
> > existing regions, unmapping a subset of an existing region, etc?
> 
> The VFIO kernel driver returns EEXIST if mapping over an existing area
> and allows unmapping a subset. I'd think we should follow their example.

Sounds good.

> > 
> > What are the concerns for MAP/UNMAP while the device is operational and
> > may be performing DMA? Should this command be combined into a single
> > VFIO_USER_SET_DMA_REGIONS command with Flags Bit 1 indicating
> > Add/Delete
> > so that a single message can atomically add and delete DMA regions?
> 
> Regarding receiving a DMA unmap request while there is an ongoing DMA
> transaction, once the server ACK's the DMA unmap then it must not touch that 
> DMA
> region. It is an implementation detail whether the server waits for the DMA to
> finish or kills the DMA operation, which might not be possible at all (e.g 
> part
> of that region has been submitted for I/O to a physical device).
> 
> > 
> > What is the format of the reply to this message?
> 
> It's just the header explained in the "Header" section, no additional data are
> sent by the server, it simply ACK's the command.

Okay, this was a little unclear to me when reading the spec. Maybe it
would help to have a general explanation at the beginning stating that
the message format is used by both the request and the reply unless
stated otherwise.

> 
> > 
> > > +
> > > +VFIO_USER_DEVICE_GET_INFO
> > > +-------------------------
> > > +
> > > +Message format
> > > +^^^^^^^^^^^^^^
> > > +
> > > ++--------------+----------------------------+
> > > +| Name         | Value                      |
> > > ++==============+============================+
> > > +| Device ID    | <ID>                       |
> > > ++--------------+----------------------------+
> > > +| Message ID   | <ID>                       |
> > > ++--------------+----------------------------+
> > > +| Command      | 4                          |
> > > ++--------------+----------------------------+
> > > +| Message size | 16 in request, 32 in reply |
> > > ++--------------+----------------------------+
> > > +| Flags        | Reply bit set in reply     |
> > > ++--------------+----------------------------+
> > > +| Device info  | VFIO device info           |
> > > ++--------------+----------------------------+
> > > +
> > > +This message is sent by the client to the server to query for basic
> > information
> > > +about the device. Only the message header is needed in the request
> > message.
> > > +The VFIO device info structure is defined in ``<sys/vfio.h>`` (``struct
> > > +vfio_device_info``).
> > > +
> > > +VFIO device info format
> > > +^^^^^^^^^^^^^^^^^^^^^^^
> > > +
> > > ++-------------+--------+--------------------------+
> > > +| Name        | Offset | Size                     |
> > > ++=============+========+==========================+
> > > +| argsz       | 16     | 4                        |
> > > ++-------------+--------+--------------------------+
> > > +| flags       | 20     | 4                        |
> > > ++-------------+--------+--------------------------+
> > > +|             | +-----+-------------------------+ |
> > > +|             | | Bit | Definition              | |
> > > +|             | +=====+=========================+ |
> > > +|             | | 0   | VFIO_DEVICE_FLAGS_RESET | |
> > > +|             | +-----+-------------------------+ |
> > > +|             | | 1   | VFIO_DEVICE_FLAGS_PCI   | |
> > > +|             | +-----+-------------------------+ |
> > > ++-------------+--------+--------------------------+
> > > +| num_regions | 24     | 4                        |
> > > ++-------------+--------+--------------------------+
> > > +| num_irqs    | 28     | 4                        |
> > > ++-------------+--------+--------------------------+
> > > +
> > > +* argz is reserved in vfio-user, it is only used in the ioctl() VFIO
> > > +  implementation.
> > > +* flags contains the following device attributes.
> > > +
> > > +  * VFIO_DEVICE_FLAGS_RESET indicates the device supports the
> > > +    VFIO_USER_DEVICE_RESET message.
> > 
> > Why is VFIO_USER_DEVICE_RESET support optional?
> 
> Because it is optional in VFIO, too.

I see :D

> > > +DMA Read/Write Data
> > > +^^^^^^^^^^^^^^^^^^^
> > > +
> > > ++---------+--------+----------+
> > > +| Name    | Offset | Size     |
> > > ++=========+========+==========+
> > > +| Address | 16     | 8        |
> > > ++---------+--------+----------+
> > > +| Count   | 24     | 4        |
> > > ++---------+--------+----------+
> > > +| Data    | 28     | variable |
> > > ++---------+--------+----------+
> > > +
> > > +* Address is the area of guest memory being accessed. This address must
> > have
> > > +  been exported to the server with a VFIO_USER_DMA_MAP message.
> > > +* Count is the size of the data to be transferred.
> > > +* Data is the data to be read or written.
> > > +
> > > +Address and count can also be accessed as ``struct iovec`` from
> > ``<sys/uio.h>``.
> > 
> > This seems to be incorrect since the count field is 4 bytes but struct
> > iovec::iov_len is size_t (64-bit on 64-bit hosts).
> 
> Indeed, I forgot about padding. We can remove the reference to struct iovec or
> make count 8 bytes?

I suggest removing the reference to struct iovec.

Attachment: signature.asc
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]