qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH v6 0/8] Vhost and vhost-net support for userspac


From: Michael S. Tsirkin
Subject: Re: [Qemu-devel] [PATCH v6 0/8] Vhost and vhost-net support for userspace based backends
Date: Mon, 27 Jan 2014 18:49:08 +0200

On Mon, Jan 27, 2014 at 05:37:02PM +0100, Antonios Motakis wrote:
> Hello again,
> 
> 
> On Wed, Jan 15, 2014 at 3:49 PM, Michael S. Tsirkin <address@hidden> wrote:
> >
> > On Wed, Jan 15, 2014 at 01:50:47PM +0100, Antonios Motakis wrote:
> > >
> > >
> > >
> > > On Wed, Jan 15, 2014 at 10:07 AM, Michael S. Tsirkin <address@hidden> 
> > > wrote:
> > >
> > >     On Tue, Jan 14, 2014 at 07:13:43PM +0100, Antonios Motakis wrote:
> > >     >
> > >     >
> > >     >
> > >     > On Tue, Jan 14, 2014 at 12:33 PM, Michael S. Tsirkin 
> > > <address@hidden>
> > >     wrote:
> > >     >
> > >     >     On Mon, Jan 13, 2014 at 03:25:11PM +0100, Antonios Motakis 
> > > wrote:
> > >     >     > In this patch series we would like to introduce our approach 
> > > for
> > >     putting
> > >     >     a
> > >     >     > virtio-net backend in an external userspace process. Our 
> > > eventual
> > >     target
> > >     >     is to
> > >     >     > run the network backend in the Snabbswitch ethernet switch, 
> > > while
> > >     >     receiving
> > >     >     > traffic from a guest inside QEMU/KVM which runs an unmodified
> > >     virtio-net
> > >     >     > implementation.
> > >     >     >
> > >     >     > For this, we are working into extending vhost to allow 
> > > equivalent
> > >     >     functionality
> > >     >     > for userspace. Vhost already passes control of the data plane 
> > > of
> > >     >     virtio-net to
> > >     >     > the host kernel; we want to realize a similar model, but for
> > >     userspace.
> > >     >     >
> > >     >     > In this patch series the concept of a vhost-backend is 
> > > introduced.
> > >     >     >
> > >     >     > We define two vhost backend types - vhost-kernel and 
> > > vhost-user.
> > >     The
> > >     >     former is
> > >     >     > the interface to the current kernel module implementation. Its
> > >     control
> > >     >     plane is
> > >     >     > ioctl based. The data plane is the kernel directly accessing 
> > > the
> > >     QEMU
> > >     >     allocated,
> > >     >     > guest memory.
> > >     >     >
> > >     >     > In the new vhost-user backend, the control plane is based on
> > >     >     communication
> > >     >     > between QEMU and another userspace process using a unix domain
> > >     socket.
> > >     >     This
> > >     >     > allows to implement a virtio backend for a guest running in 
> > > QEMU,
> > >     inside
> > >     >     the
> > >     >     > other userspace process.
> > >     >     >
> > >     >     > We change -mem-path to QemuOpts and add prealloc, share and 
> > > unlink
> > >     as
> > >     >     properties
> > >     >     > to it. HugeTLBFS requirements of -mem-path are relaxed, so any
> > >     valid path
> > >     >     can
> > >     >     > be used now. The new properties allow more fine grained 
> > > control
> > >     over the
> > >     >     guest
> > >     >     > RAM backing store.
> > >     >     >
> > >     >     > The data path is realized by directly accessing the vrings 
> > > and the
> > >     buffer
> > >     >     data
> > >     >     > off the guest's memory.
> > >     >     >
> > >     >     > The current user of vhost-user is only vhost-net. We add new 
> > > netdev
> > >     >     backend
> > >     >     > that is intended to initialize vhost-net with vhost-user 
> > > backend.
> > >     >
> > >     >     Some meta comments.
> > >     >
> > >     >     Something that makes this patch harder to review is how it's
> > >     >     split up. Generally IMHO it's not a good idea to repeatedly
> > >     >     edit same part of file adding stuff in patch after patch,
> > >     >     it's only making things harder to read if you add stubs, then 
> > > fill
> > >     them up.
> > >     >     (we do this sometimes when we are changing existing code, but
> > >     >     it is generally not needed when adding new code)
> > >     >
> > >     >     Instead, split it like this:
> > >     >
> > >     >     1. general refactoring, split out linux specific and generic 
> > > parts
> > >     >        and add the ops indirection
> > >     >     2. add new files for vhost-user with complete implementation.
> > >     >        without command line to support it, there will be no way to 
> > > use
> > >     it,
> > >     >        but should build fine.
> > >     >     3. tie it all up with option parsing
> > >     >
> > >     >
> > >     >     Generic vhost and vhost net files should be kept separate.
> > >     >     Don't let vhost net stuff seep back into generic files,
> > >     >     we have vhost-scsi too.
> > >     >     I would also prefer that userspace vhost has its own files.
> > >     >
> > >     >
> > >     > Ok, we'll keep this into account.
> > >     >
> > >     >
> > >     >
> > >     >     We need a small test server qemu can talk to, to verify things
> > >     >     actually work.
> > >     >
> > >     >
> > >     > We have implemented such a test app: https://github.com/
> > >     virtualopensystems/vapp
> > >     >
> > >     > We use it for testing, and also as a reference implementation. A 
> > > client
> > >     is also
> > >     > included.
> > >     >
> > >
> > >     Sounds good. Can we include this in qemu and tie
> > >     it into the qtest framework?
> > >     >From a brief look, it merely needs to be tweaked for portability,
> > >     unless
> > >
> > >     >
> > >     >     Already commented on: reuse the chardev syntax and preferably 
> > > code.
> > >     >     We already support a bunch of options there for
> > >     >     domain sockets that will be useful here, they should
> > >     >     work here as well.
> > >     >
> > >     >
> > >     > We adapted the syntax for this to be consistent with chardev. What 
> > > we
> > >     didn't
> > >     > use, it is not obvious at all to us on how they should be used; a 
> > > lot of
> > >     the
> > >     > chardev options just don't apply to us.
> > >     >
> > >
> > >     Well server option should work at least.
> > >     nowait can work too?
> > >
> > >     Also, if reconnect is useful it should be for chardevs too, so if we 
> > > don't
> > >     share code, need to code it in two places to stay consistent.
> > >
> > >     Overall sharing some code might be better ...
> > >
> > >
> > >
> > > What you have in mind is to use the functions chardev uses from 
> > > qemu-sockets.c
> > > right? Chardev itself doesn't look to have anything else that can be 
> > > shared.
> >
> > Yes.
> >
> > > The problem with reconnect is that it is implemented at the protocol 
> > > level; we
> > > are not just transparently reconnecting the socket. So the same approach 
> > > would
> > > most likely not apply for chardev.
> >
> > Chardev mostly just could use transparent reconnect.
> > vhost-user could use that and get a callback to reconfigure
> > everything after reconnect.
> >
> > Once you write up the protocol in some text file we can
> > discuss this in more detail.
> > For example I wonder how would feature negotiation work
> > with reconnect: new connection could be from another
> > application that does not support same features, but
> > virtio assumes that device features never change.
> >
> 
> I attach the text document that we will include in the next version of
> the series, which describes the vhost-user protocol.
> 
> The protocol is based on and very close to the vhost kernel protocol.
> Of note is the VHOST_USER_ECHO message, which is the only one that
> doesn't have an equivalent ioctl in the kernel version of vhost; this
> is the message that is being used to detect that the remote party is
> not on the socket anymore. At that point QEMU will close the session
> and try to initiate a new one on the same socket.

What if e.g. features change in between?
Everything just goes south, doesn't it?

Is this detection and reconnect a must for your project?

I think it would be simpler to
        - generalize char unix socket handling code and reuse for vhost-user
        - as a separate step, add live detection and reconnect abilities
          to the generic code

> > >
> > >
> > >
> > >     >     In particular you shouldn't require filesystem access by qemu,
> > >     >     passing fd for domain socket should work.
> > >     >
> > >     >
> > >     > We can add an option to pass an fd for the domain socket if needed.
> > >     However as
> > >     > far as we understand, chardev doesn't do that either (at least form
> > >     looking at
> > >     > the man page). Maybe we misunderstand what you mean.
> > >
> > >     Sorry. I got confused with e.g. tap which has this. This might be
> > >     useful but does not have to block this patch.
> > >
> > >     >
> > >     >
> > >     >     > Example usage:
> > >     >     >
> > >     >     > qemu -m 1024 -mem-path /hugetlbfs,prealloc=on,share=on \
> > >     >     >      -netdev 
> > > type=vhost-user,id=net0,path=/path/to/sock,poll_time=
> > >     2500 \
> > >     >     >      -device virtio-net-pci,netdev=net0
> > >     >
> > >     >     It's not clear which parts of -mem-path are required for 
> > > vhost-user.
> > >     >     It should be documented somewhere, made clear in -help
> > >     >     and should fail gracefully if misconfigured.
> > >     >
> > >     >
> > >     >
> > >     > Ok.
> > >     >
> > >     >
> > >     >
> > >     >     >
> > >     >     > Changes from v5:
> > >     >     >  - Split -mem-path unlink option to a separate patch
> > >     >     >  - Fds are passed only in the ancillary data
> > >     >     >  - Stricter message size checks on receive/send
> > >     >     >  - Netdev vhost-user now includes path and poll_time options
> > >     >     >  - The connection probing interval is configurable
> > >     >     >
> > >     >     > Changes from v4:
> > >     >     >  - Use error_report for errors
> > >     >     >  - VhostUserMsg has new field `size` indicating the following
> > >     payload
> > >     >     length.
> > >     >     >    Field `flags` now has version and reply bits. The 
> > > structure is
> > >     packed.
> > >     >     >  - Send data is of variable length (`size` field in message)
> > >     >     >  - Receive in 2 steps, header and payload
> > >     >     >  - Add new message type VHOST_USER_ECHO, to check connection 
> > > status
> > >     >     >
> > >     >     > Changes from v3:
> > >     >     >  - Convert -mem-path to QemuOpts with prealloc, share and 
> > > unlink
> > >     >     properties
> > >     >     >  - Set 1 sec timeout when read/write to the unix domain socket
> > >     >     >  - Fix file descriptor leak
> > >     >     >
> > >     >     > Changes from v2:
> > >     >     >  - Reconnect when the backend disappears
> > >     >     >
> > >     >     > Changes from v1:
> > >     >     >  - Implementation of vhost-user netdev backend
> > >     >     >  - Code improvements
> > >     >     >
> > >     >     > Antonios Motakis (8):
> > >     >     >   Convert -mem-path to QemuOpts and add prealloc and share
> > >     properties
> > >     >     >   New -mem-path option - unlink.
> > >     >     >   Decouple vhost from kernel interface
> > >     >     >   Add vhost-user skeleton
> > >     >     >   Add domain socket communication for vhost-user backend
> > >     >     >   Add vhost-user calls implementation
> > >     >     >   Add new vhost-user netdev backend
> > >     >     >   Add vhost-user reconnection
> > >     >     >
> > >     >     >  exec.c                            |  57 +++-
> > >     >     >  hmp-commands.hx                   |   4 +-
> > >     >     >  hw/net/vhost_net.c                | 144 +++++++---
> > >     >     >  hw/net/virtio-net.c               |  42 ++-
> > >     >     >  hw/scsi/vhost-scsi.c              |  13 +-
> > >     >     >  hw/virtio/Makefile.objs           |   2 +-
> > >     >     >  hw/virtio/vhost-backend.c         | 556
> > >     >     ++++++++++++++++++++++++++++++++++++++
> > >     >     >  hw/virtio/vhost.c                 |  46 ++--
> > >     >     >  include/exec/cpu-all.h            |   3 -
> > >     >     >  include/hw/virtio/vhost-backend.h |  40 +++
> > >     >     >  include/hw/virtio/vhost.h         |   4 +-
> > >     >     >  include/net/vhost-user.h          |  17 ++
> > >     >     >  include/net/vhost_net.h           |  15 +-
> > >     >     >  net/Makefile.objs                 |   2 +-
> > >     >     >  net/clients.h                     |   3 +
> > >     >     >  net/hub.c                         |   1 +
> > >     >     >  net/net.c                         |   2 +
> > >     >     >  net/tap.c                         |  16 +-
> > >     >     >  net/vhost-user.c                  | 177 ++++++++++++
> > >     >     >  qapi-schema.json                  |  21 +-
> > >     >     >  qemu-options.hx                   |  24 +-
> > >     >     >  vl.c                              |  41 ++-
> > >     >     >  22 files changed, 1106 insertions(+), 124 deletions(-)
> > >     >     >  create mode 100644 hw/virtio/vhost-backend.c
> > >     >     >  create mode 100644 include/hw/virtio/vhost-backend.h
> > >     >     >  create mode 100644 include/net/vhost-user.h
> > >     >     >  create mode 100644 net/vhost-user.c
> > >     >     >
> > >     >     > --
> > >     >     > 1.8.3.2
> > >     >     >
> > >     >
> > >     >
> > >
> > >

> Vhost-user Protocol
> ===================
> 
> This protocol is aiming to complement the ioctl interface used to control the
> vhost implementation in the Linux kernel. It implements the control plane 
> needed
> to establish virtqueue sharing with a user space process on the same host. It
> uses communication over a Unix domain socket to share file descriptors in the
> ancillary data of the message.
> 
> The protocol defines 2 sides of the communication, master and slave. Master is
> the application that shares it's virtqueues, in our case QEMU. Slave is the
> consumer of the virtqueues.
> 
> In the current implementation QEMU is the Master, and the Slave is intended to
> be a software ethernet switch running in user space, such as Snabbswitch.
> 
> Master and slave can be either a client (i.e. connecting) or server 
> (listening)
> in the socket communication.
> 
> Message Specification
> ---------------------
> 
> Note that all numbers are in the machine native byte order. A vhost-user 
> message
> consists of 3 header fields and a payload:
> 
> ------------------------------------
> | request | flags | size | payload |
> ------------------------------------
> 
>  * Request: 32-bit type of the request
>  * Flags: 32-bit bit field:
>    - Lower 2 bits are the version (currently 0x01)
>    - Bit 2 is the reply flag - needs to be sent on each reply from the slave
>  * Size - 32-bit size of the payload
> 
> 
> Depending on the request type, payload can be:
> 
>  * A single 64-bit integer
>    -------
>    | u64 |
>    -------
> 
>    u64: a 64-bit unsigned integer
> 
>  * A vring state description
>    ---------------
>   | index | num |
>   ---------------
> 
>    Index: a 32-bit index
>    Num: a 32-bit number
> 
>  * A vring address description
>    --------------------------------------------------------------
>    | index | flags | size | descriptor | used | available | log |
>    --------------------------------------------------------------
> 
>    Index: a 32-bit vring index
>    Flags: a 32-bit vring flags
>    Descriptor: a 64-bit user address of the vring descriptor table
>    Used: a 64-bit user address of the vring used ring
>    Available: a 64-bit user address of the vring available ring
>    Log: a 64-bit guest address for logging
> 
>  * Memory regions description
>    ---------------------------------------------------
>    | num regions | padding | region0 | ... | region7 |
>    ---------------------------------------------------
> 
>    Num regions: a 32-bit number of regions
>    Padding: 32-bit
> 
>    A region is:
>    ---------------------------------------
>    | guest address | size | user address |
>    ---------------------------------------
>  
>    Guest address: a 64-bit guest address of the region
>    Size: a 64-bit size
>    User address: a 64-bit user address
> 
> 
> In QEMU the vhost-user message is implemented with the following struct:
> 
> typedef struct VhostUserMsg {
>     VhostUserRequest request;
>     uint32_t flags;
>     uint32_t size;
>     union {
>         uint64_t u64;
>         struct vhost_vring_state state;
>         struct vhost_vring_addr addr;
>         VhostUserMemory memory;
>     };
> } QEMU_PACKED VhostUserMsg;
> 
> Communication
> -------------
> 
> The protocol for vhost-user is based on the existing implementation of vhost
> for the Linux Kernel. Most messages that can be send via the Unix domain 
> socket
> implementing vhost-user have an equivalent ioctl to the kernel implementation.
> 
> The communication consists of master sending message requests and slave 
> sending
> message replies. Most of the requests don't require replies. Here is a list of
> the ones that do:
> 
>  * VHOST_USER_ECHO
>  * VHOST_GET_FEATURES
>  * VHOST_GET_VRING_BASE
> 
> There are several messages that the master sends with file descriptors passed
> in the ancillary data:
> 
>  * VHOST_SET_MEM_TABLE
>  * VHOST_SET_LOG_FD
>  * VHOST_SET_VRING_KICK
>  * VHOST_SET_VRING_CALL
>  * VHOST_SET_VRING_ERR
> 
> If Master is unable to send the full message or receives a wrong reply it will
> close the connection. An optional reconnection mechanism can be implemented.
> 
> Message types
> -------------
> 
>  * VHOST_USER_ECHO
> 
>       Id: 1
>       Equivalent ioctl: N/A
>       Master payload: N/A
> 
>       ECHO request that is used to periodically probe the connection. When
>       received by the slave, it is expected that he will send back an ECHO
>       packet with the REPLY flag set.
> 
>  * VHOST_USER_GET_FEATURES
> 
>       Id: 2
>       Equivalent ioctl: VHOST_GET_FEATURES
>       Master payload: N/A
>       Slave payload: u64
> 
>       Get from the underlying vhost implementation the features bitmask.
> 
>  * VHOST_USER_SET_FEATURES
> 
>       Id: 3
>       Ioctl: VHOST_SET_FEATURES
>       Master payload: u64
> 
>       Enable features in the underlying vhost implementation using a bitmask.
> 
>  * VHOST_USER_SET_OWNER
> 
>       Id: 4
>       Equivalent ioctl: VHOST_SET_OWNER
>       Master payload: N/A
> 
>       Issued when a new connection is established. It sets the current Master
>       as an owner of the session. This can be used on the Slave as a 
>       "session start" flag.
> 
>  * VHOST_USER_RESET_OWNER
> 
>       Id: 5
>       Equivalent ioctl: VHOST_RESET_OWNER
>       Master payload: N/A
> 
>       Issued when a new connection is about to be closed. The Master will no 
>       longer own this connection (and will usually close it).
> 
>  * VHOST_USER_SET_MEM_TABLE
> 
>       Id: 6
>       Equivalent ioctl: VHOST_SET_MEM_TABLE
>       Master payload: memory regions description
> 
>       Sets the memory map regions on the slave so it can translate the vring
>       addresses. In the ancillary data there is an array of file descriptors
>       for each memory mapped region. The size and ordering of the fds matches
>       the number and ordering of memory regions.
> 
>  * VHOST_USER_SET_LOG_BASE
> 
>       Id: 7
>       Equivalent ioctl: VHOST_SET_LOG_BASE
>       Master payload: u64
> 
>       Sets the logging base address.
> 
>  * VHOST_USER_SET_LOG_FD
> 
>       Id: 8
>       Equivalent ioctl: VHOST_SET_LOG_FD
>       Master payload: N/A
> 
>       Sets the logging file descriptor, which is passed as ancillary data.
> 
>  * VHOST_USER_SET_VRING_NUM
> 
>       Id: 9
>       Equivalent ioctl: VHOST_SET_VRING_NUM
>       Master payload: vring state description
> 
>       Sets the number of vrings for this owner.
> 
>  * VHOST_USER_SET_VRING_ADDR
> 
>       Id: 10
>       Equivalent ioctl: VHOST_SET_VRING_ADDR
>       Master payload: vring address description
>       Slave payload: N/A
> 
>       Sets the addresses of the different aspects of the vring.
> 
>  * VHOST_USER_SET_VRING_BASE
> 
>       Id: 11
>       Equivalent ioctl: VHOST_SET_VRING_BASE
>       Master payload: vring state description
> 
>       Sets the base address where the available descriptors are.
> 
>  * VHOST_USER_GET_VRING_BASE
> 
>       Id: 12
>       Equivalent ioctl: VHOST_USER_GET_VRING_BASE
>       Master payload: vring state description
>       Slave payload: vring state description
> 
>       Get the vring base address.
> 
>  * VHOST_USER_SET_VRING_KICK
> 
>       Id: 13
>       Equivalent ioctl: VHOST_SET_VRING_KICK
>       Master payload: N/A
> 
>       Set the event file descriptor for adding buffers to the vring. It
>       is passed in the ancillary data.   
> 
>  * VHOST_USER_SET_VRING_CALL
> 
>       Id: 14
>       Equivalent ioctl: VHOST_SET_VRING_CALL
>       Master payload: N/A
> 
>       Set the event file descriptor to signal when buffers are used. It
>       is passed in the ancillary data.
> 
>  * VHOST_USER_SET_VRING_ERR
> 
>       Id: 15
>       Equivalent ioctl: VHOST_SET_VRING_ERR
>       Master payload: N/A
> 
>       Set the event file descriptor to signal when error occurs. It
>       is passed in the ancillary data.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]