qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] introduce VFIO-over-socket protocol specificaion


From: John G Johnson
Subject: Re: [PATCH] introduce VFIO-over-socket protocol specificaion
Date: Wed, 29 Jul 2020 21:48:54 -0700

        Thanos is on vacation.  My comments embedded.

                                        JJ



> On Jul 29, 2020, at 5:48 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> 
> On Wed, Jul 22, 2020 at 11:42:26AM +0000, Thanos Makatos wrote:
>>>> diff --git a/docs/devel/vfio-over-socket.rst b/docs/devel/vfio-over-
>>> socket.rst
>>>> new file mode 100644
>>>> index 0000000..723b944
>>>> --- /dev/null
>>>> +++ b/docs/devel/vfio-over-socket.rst
>>>> @@ -0,0 +1,1135 @@
>>>> +***************************************
>>>> +VFIO-over-socket Protocol Specification
>>>> +***************************************
>>>> +
>>>> +Version 0.1
>>> 
>>> Please include a reference to the section below explaining how
>>> versioning works.
>> 
>> I'm not sure I understand, do you mean we should add something like the
>> following (right below "Version 0.1"):
>> 
>> "Refer to section 1.2.3 on how versioning works."
>> 
>> ?
> 
> Yes, coming across this version number the reader has no idea about its
> meaning and how the protocol is versioned. The spec then moves on to
> other things. It would be helpful to reference the section that explains
> how versioning works so that the reader knows where to go to understand
> the meaning of the number.
> 

        OK, we’ll add a forward reference


>>> What about the ordering semantics at the vfio-user protocol level? For
>>> example, if a client sends multiple VFIO_USER_DMA_MAP/UNMAP
>>> messages
>>> then the order matters. What are the rules? I wonder if maybe
>>> concurrency is a special case and really only applies to a subset of
>>> protocol commands?
>> 
>> All commands are executed in the order they were sent, regardless of whether 
>> a
>> reply is needed.
>> 
>>> 
>>> I'm not sure how a client would exploit parallelism in this protocol.
>>> Can you give an example of a case where there would be multiple commands
>>> pending on a single device?
>> 
>> For instance, a client can issue the following operations back to back 
>> without
>> waiting for the first two to complete:
>> 1. map a DMA region 
>> 2. trigger some device-specific operation that results in data being read 
>> into
>>   that DMA region, and
>> 3. unmap the DMA region
> 
> That is pipelining, but I don't see the ability to "reorder
> non-conflicting requests in parallel". That example has no parallelism.
> 
> It's unclear to me what the spec means by concurrency and parallelism.
> 
> If the intention is just to allow pipelining then request identifiers
> aren't really necessary. The client can keep track of which request will
> complete next. So I'm wondering if there is some parallelism somewhere
> that I'm missing…
> 

        The reason we have message IDs is so the client knows which request
is being acknowledged if it has more than one non-ack'd request outstanding.
Requests are executed in-order; the only time parallelism can happen is if
multiple client threads send requests in parallel.  A single thread can
pipeline requests, but those requests are not executed out of order by the
server.

        I’ll try to re-word it to be clearer.


> 
>>> 
>>>> +
>>>> +Socket Disconnection Behavior
>>>> +-----------------------------
>>>> +The server and the client can disconnect from each other, either
>>> intentionally
>>>> +or unexpectedly. Both the client and the server need to know how to
>>> handle such
>>>> +events.
>>>> +
>>>> +Server Disconnection
>>>> +^^^^^^^^^^^^^^^^^^^^
>>>> +A server disconnecting from the client may indicate that:
>>>> +
>>>> +1) A virtual device has been restarted, either intentionally (e.g. 
>>>> because of
>>> a
>>>> +device update) or unintentionally (e.g. because of a crash). In any case,
>>> the
>>>> +virtual device will come back so the client should not do anything (e.g.
>>> simply
>>>> +reconnect and retry failed operations).
>>>> +
>>>> +2) A virtual device has been shut down with no intention to be restarted.
>>>> +
>>>> +It is impossible for the client to know whether or not a failure is
>>>> +intermittent or innocuous and should be retried, therefore the client
>>> should
>>>> +attempt to reconnect to the socket. Since an intentional server restart
>>> (e.g.
>>>> +due to an upgrade) might take some time, a reasonable timeout should
>>> be used.
>>>> +In cases where the disconnection is expected (e.g. the guest shutting
>>> down), no
>>>> +new requests will be sent anyway so this situation doesn't pose a
>>> problem. The
>>>> +control stack will clean up accordingly.
>>>> +
>>>> +Parametrizing this behaviour by having the virtual device advertise a
>>> 
>>> s/Parametrizing/Parameterizing/
>> 
>> OK.
>> 
>>> 
>>>> +reasonable reconnect is deferred to a future version of the protocol.
>>> 
>>> No mention is made of recovering state or how disconnect maps to VFIO
>>> device types (PCI, etc.). Does a disconnect mean that the device has
>>> been reset?
>> 
>> Regarding recovering state, I believe that since all the building blocks are
>> there and the client is pretty much the master in the vfio-user model, it's 
>> up
>> to the client how to deal with exceptional situations. For recovery to work, 
>> the
>> client will have to reconfigure IRQs, DMAs, etc., and the server will have to
>> persistently store device state.
>> 
>> Regarding how disconnect maps to VFIO device types, it depends on whether or 
>> not
>> the client/server can recover from it and continue operating. If this doesn't
>> happen (e.g. the server hasn't restarted, the client doesn't support
>> recovering), then the device cannot continue being operational, which I 
>> suppose
>> is an implementation detail of the client.
>> 
>> Do we need something more specific at this stage?
> 
> This sounds like implementation-defined behavior. If the spec does not
> cover this area then it might be hard to standardize it later without
> breaking existing implementations.
> 
> Do you know how kernel VFIO behaves when the application is terminated
> unexpectedly?
> 
> My intuition is that the device should be reset or at least fenced to
> discard all DMA immediately. With that as the default behavior the
> protocol could let implementations use protocol extensions for other
> recovery behavior.
> 

        I’m not sure we’ll be able to improve on device reset, since
the client does not know what operations were completed before the
server died, and some region reads and writes may not be idempotent.  

        I’d like to give Thanos a chance to reply before we commit to
a change.


>>>> +|              |        | Version supported by the sender, e.g.
>>> ???????0.1????????.      |
>>>> ++--------------+--------+---------------------------------------------------+
>>>> +| type         | string | Fixed to ???????vfio-user????????.              
>>>>                |
>>>> ++--------------+--------+---------------------------------------------------+
>>>> +| capabilities | array  | Reserved. Can be omitted for v0.1, otherwise 
>>>> must
>>> |
>>>> +|              |        | be empty.                                       
>>>>   |
>>>> ++--------------+--------+---------------------------------------------------+
>>> 
>>> s/array/array of strings/ ?
>> 
>> No, it's array were each element can be any object in JSON.
> 
> Okay. Maybe there is a way to be explicit by saying "untyped array" or
> adding "The type of the elements is not defined in this version of the
> specification" to the description.
> 
>> 
>>> 
>>>> +
>>>> +Versioning and Feature Support
>>>> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>>> +Upon accepting a connection, the server must send a
>>> VFIO_USER_VERSION message
>>>> +proposing a protocol version and a set of capabilities. The client 
>>>> compares
>>>> +these with the versions and capabilities it supports and sends a
>>>> +VFIO_USER_VERSION reply according to the following rules.
>>>> +
>>>> +* The major version in the reply must be the same as proposed. If the
>>> client
>>>> +  does not support the proposed major, it closes the connection.
>>>> +* The minor version in the reply must be equal to or less than the minor
>>>> +  version proposed.
>>>> +* The capability list must be a subset of those proposed. If the client
>>>> +  requires a capability the server did not include, it closes the 
>>>> connection.
>>>> +* If type is not ???????vfio-user????????, the client closes the
>>> connection.
>>> 
>>> Is there a rationale for this field? In order to get to the point where
>>> this JSON is parsed we must already implement the vfio-user protocol
>>> (e.g. parse the header).
>> 
>> It was suggested to include it as a sanity check, we'll drop it.
> 
> It's fine to keep it. Many protocols have a banner or identifier early
> on that requires some parsing. In this case it could be: this field
> makes it easy for humans decoding a protocol conversation to identify
> the protocol as vfio-user and also serves as a sanity check.
> 

        I’d like to re-think the versioning a bit to avoid some of
the indeterminism.  I think it might be better to have the fields that
must exist to bootstrap the protocol (major & minor) be enumerated in
the version packet body.  Then, for each major/minor we can specify the
objects and their types that may exist in the JSON array.

        This list would initially only have the “type” string.



>>> What are the semantics of map/unmap with respect to mapping over
>>> existing regions, unmapping a subset of an existing region, etc?
>> 
>> The VFIO kernel driver returns EEXIST if mapping over an existing area
>> and allows unmapping a subset. I'd think we should follow their example.
> 
> Sounds good.
> 
>>> 
>>> What are the concerns for MAP/UNMAP while the device is operational and
>>> may be performing DMA? Should this command be combined into a single
>>> VFIO_USER_SET_DMA_REGIONS command with Flags Bit 1 indicating
>>> Add/Delete
>>> so that a single message can atomically add and delete DMA regions?
>> 
>> Regarding receiving a DMA unmap request while there is an ongoing DMA
>> transaction, once the server ACK's the DMA unmap then it must not touch that 
>> DMA
>> region. It is an implementation detail whether the server waits for the DMA 
>> to
>> finish or kills the DMA operation, which might not be possible at all (e.g 
>> part
>> of that region has been submitted for I/O to a physical device).
>> 
>>> 
>>> What is the format of the reply to this message?
>> 
>> It's just the header explained in the "Header" section, no additional data 
>> are
>> sent by the server, it simply ACK's the command.
> 
> Okay, this was a little unclear to me when reading the spec. Maybe it
> would help to have a general explanation at the beginning stating that
> the message format is used by both the request and the reply unless
> stated otherwise.
> 

        OK.

>> 
>>> 
>>>> +
>>>> +VFIO_USER_DEVICE_GET_INFO
>>>> +-------------------------
>>>> +
>>>> +Message format
>>>> +^^^^^^^^^^^^^^
>>>> +
>>>> ++--------------+----------------------------+
>>>> +| Name         | Value                      |
>>>> ++==============+============================+
>>>> +| Device ID    | <ID>                       |
>>>> ++--------------+----------------------------+
>>>> +| Message ID   | <ID>                       |
>>>> ++--------------+----------------------------+
>>>> +| Command      | 4                          |
>>>> ++--------------+----------------------------+
>>>> +| Message size | 16 in request, 32 in reply |
>>>> ++--------------+----------------------------+
>>>> +| Flags        | Reply bit set in reply     |
>>>> ++--------------+----------------------------+
>>>> +| Device info  | VFIO device info           |
>>>> ++--------------+----------------------------+
>>>> +
>>>> +This message is sent by the client to the server to query for basic
>>> information
>>>> +about the device. Only the message header is needed in the request
>>> message.
>>>> +The VFIO device info structure is defined in ``<sys/vfio.h>`` (``struct
>>>> +vfio_device_info``).
>>>> +
>>>> +VFIO device info format
>>>> +^^^^^^^^^^^^^^^^^^^^^^^
>>>> +
>>>> ++-------------+--------+--------------------------+
>>>> +| Name        | Offset | Size                     |
>>>> ++=============+========+==========================+
>>>> +| argsz       | 16     | 4                        |
>>>> ++-------------+--------+--------------------------+
>>>> +| flags       | 20     | 4                        |
>>>> ++-------------+--------+--------------------------+
>>>> +|             | +-----+-------------------------+ |
>>>> +|             | | Bit | Definition              | |
>>>> +|             | +=====+=========================+ |
>>>> +|             | | 0   | VFIO_DEVICE_FLAGS_RESET | |
>>>> +|             | +-----+-------------------------+ |
>>>> +|             | | 1   | VFIO_DEVICE_FLAGS_PCI   | |
>>>> +|             | +-----+-------------------------+ |
>>>> ++-------------+--------+--------------------------+
>>>> +| num_regions | 24     | 4                        |
>>>> ++-------------+--------+--------------------------+
>>>> +| num_irqs    | 28     | 4                        |
>>>> ++-------------+--------+--------------------------+
>>>> +
>>>> +* argz is reserved in vfio-user, it is only used in the ioctl() VFIO
>>>> +  implementation.
>>>> +* flags contains the following device attributes.
>>>> +
>>>> +  * VFIO_DEVICE_FLAGS_RESET indicates the device supports the
>>>> +    VFIO_USER_DEVICE_RESET message.
>>> 
>>> Why is VFIO_USER_DEVICE_RESET support optional?
>> 
>> Because it is optional in VFIO, too.
> 
> I see :D
> 
>>>> +DMA Read/Write Data
>>>> +^^^^^^^^^^^^^^^^^^^
>>>> +
>>>> ++---------+--------+----------+
>>>> +| Name    | Offset | Size     |
>>>> ++=========+========+==========+
>>>> +| Address | 16     | 8        |
>>>> ++---------+--------+----------+
>>>> +| Count   | 24     | 4        |
>>>> ++---------+--------+----------+
>>>> +| Data    | 28     | variable |
>>>> ++---------+--------+----------+
>>>> +
>>>> +* Address is the area of guest memory being accessed. This address must
>>> have
>>>> +  been exported to the server with a VFIO_USER_DMA_MAP message.
>>>> +* Count is the size of the data to be transferred.
>>>> +* Data is the data to be read or written.
>>>> +
>>>> +Address and count can also be accessed as ``struct iovec`` from
>>> ``<sys/uio.h>``.
>>> 
>>> This seems to be incorrect since the count field is 4 bytes but struct
>>> iovec::iov_len is size_t (64-bit on 64-bit hosts).
>> 
>> Indeed, I forgot about padding. We can remove the reference to struct iovec 
>> or
>> make count 8 bytes?
> 
> I suggest removing the reference to struct iovec.

        OK






reply via email to

[Prev in Thread] Current Thread [Next in Thread]