Re: [RFC v4 PATCH 00/49] Initial support of multi-process qemu

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [RFC v4 PATCH 00/49] Initial support of multi-process qemu - status

From:	Jag Raman
Subject:	Re: [RFC v4 PATCH 00/49] Initial support of multi-process qemu - status update
Date:	Thu, 19 Dec 2019 11:40:05 -0500
User-agent:	Mozilla/5.0 (Windows NT 10.0; WOW64; rv:60.0) Gecko/20100101 Thunderbird/60.7.1



On 12/19/2019 7:33 AM, Felipe Franciosi wrote:

Hello,

(I've added Jim and Ben from the SPDK team to the thread.)

On Dec 19, 2019, at 11:55 AM, Stefan Hajnoczi <address@hidden> wrote:

On Tue, Dec 17, 2019 at 10:57:17PM +0000, Felipe Franciosi wrote:

On Dec 17, 2019, at 5:33 PM, Stefan Hajnoczi <address@hidden> wrote:
On Mon, Dec 16, 2019 at 07:57:32PM +0000, Felipe Franciosi wrote:

On 16 Dec 2019, at 20:47, Elena Ufimtseva <address@hidden> wrote:
On Fri, Dec 13, 2019 at 10:41:16AM +0000, Stefan Hajnoczi wrote:

Questions I've seen when discussing muser with people have been:

1. Can unprivileged containers create muser devices?  If not, this is a
  blocker for use cases that want to avoid root privileges entirely.


Yes you can. Muser device creation follows the same process as general
mdev device creation (ie. you write to a sysfs path). That creates an
entry in /dev/vfio and the control plane can further drop privileges
there (set selinux contexts, &c.)


In this case there is still a privileged step during setup.  What about
completely unprivileged scenarios like a regular user without root or a
rootless container?


Oh, I see what you are saying. I suppose we need to investigate
adjusting the privileges of the sysfs path correctly beforehand to
allow devices to be created by non-root users. The credentials used on
creation should be reflected on the vfio endpoint (ie. /dev/fio/<group>).

I need to look into that and get back to you.


As a prerequisite to using the "vfio-pci" device in QEMU, the user
assigns the PCI device on the host bus to the VFIO kernel driver by
writing to "/sys/bus/pci/drivers/vfio-pci/new_id" and
"/sys/bus/pci/drivers/vfio-pci/bind"

I believe a privileged control plane is required to perform these
prerequisite steps. Therefore, I wonder how rootless containers or
unprivileged users currently go about using a VFIO device with QEMU/KVM.

Thanks!
--
Jag

2. Does muser need to be in the kernel (e.g. slower to develop/ship,
  security reasons)?  A similar library could be implemented in
  userspace along the lines of the vhost-user protocol.  Although VMMs
  would then need to use a new libmuser-client library instead of
  reusing their VFIO code to access the device.


Doing it in userspace was the flow we proposed back in last year's KVM
Forum (Edinburgh), but it got turned down. That's why we procured the
kernel approach, which turned out to have some advantages:
- No changes needed to Qemu
- No Qemu needed at all for userspace drivers
- Device emulation process restart is trivial
  (it therefore makes device code upgrades much easier)

Having said that, nothing stops us from enhancing libmuser to talk
directly to Qemu (for the Qemu case). I envision at least two ways of
doing that:
- Hooking up libmuser with Qemu directly (eg. over a unix socket)
- Hooking Qemu with CUSE and implementing the muser.ko interface

For the latter, libmuser would talk to a character device just like it
talks to the vfio character device. We "just" need to implement that
backend in Qemu. :)


What about:
* libmuser's API stays mostly unchanged but the library speaks a
   VFIO-over-UNIX domain sockets protocol instead of talking to
   mdev/vfio in the host kernel.


As I said above, there are advantages to the kernel model. The key one
is transparent device emulation restarts. Today, muser.ko keeps the
"device memory" internally in a prefix tree. Upon restart, a new
device emulator can recover state (eg. from a state file in /dev/shm
or similar) and remap the same memory that is already configured to
the guest via Qemu. We have a pending work item for muser.ko to also
keep the eventfds so we can recover those, too. Another advantage is
working with any userspace driver and not requiring a VMM at all.

If done entirely in userspace, the device emulator needs to allocate
the device memory somewhere that remains accessible (eg. tmpfs), with
the difference that now we may be talking about non-trivial amounts of
memory. Also, that may not be the kind of content you want lingering
around the filesystem (for the same reasons Qemu unlinks memory files
from /dev/hugepages after mmap'ing it).

That's why I'd prefer to rephrase what you said to "in addition"
instead of "instead".

* VMMs can implement this protocol directly for POSIX-portable and
   unprivileged operation.
* A CUSE VFIO adapter simulates /dev/vfio so that VFIO-only VMMs can
   still take advantage of libmuser devices.


I'm happy with that.
We need to think the credential aspect throughout to ensure nodes can
be created in the right places with the right privileges.


Assuming this is feasible, would you lose any important
features/advantages of the muser.ko approach?  I don't know enough about
VFIO to identify any blocker or obvious performance problems.


That's what I elaborated above. The fact that muser.ko can keep
various metadata (and other resources) about the device in the kernel
and grant it back to userspace as needed. There are ways around it,
but it requires some orchestration with tmpfs and the VMM (only so
much can be kept in tmpfs; the eventfds need to be retransmitted from
the machine emulator on request).

Restarting is a critical aspect of this. One key use case for the
project is to be able to emulate various devices from one process (for
polling). That must be able to restart for upgrades or recovery.


Regarding recovery, it seems straightforward to keep state in a tmpfs
file that can be reopened when the device is restarted.  I don't think
kernel code is necessary?


It adds a dependency, but isn't a show stopper. If we can work through
permission issues, making sure the VMM can reconnect and retransmit
eventfds and other state, then it should be ok.

To be clear: I'm very happy to have a userspace-only option for this,
I just don't want to ditch the kernel module (yet, anyway). :)

F.


Stefan

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [RFC v4 PATCH 00/49] Initial support of multi-process qemu - status update, (continued)

Prev by Date: [PATCH] target/ppc: fix memory dump endianness in QEMU monitor
Next by Date: Re: [RFC v4 PATCH 00/49] Initial support of multi-process qemu - status update
Previous by thread: Re: [RFC v4 PATCH 00/49] Initial support of multi-process qemu - status update
Next by thread: Re: [RFC v4 PATCH 00/49] Initial support of multi-process qemu - status update
Index(es):
- Date
- Thread