qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: VFIO Migration


From: Stefan Hajnoczi
Subject: Re: VFIO Migration
Date: Wed, 4 Nov 2020 07:16:13 +0000

On Wed, Nov 04, 2020 at 11:32:34AM +0800, Jason Wang wrote:
> 
> On 2020/11/3 下午8:15, Stefan Hajnoczi wrote:
> > On Tue, Nov 03, 2020 at 04:46:53PM +0800, Jason Wang wrote:
> > > On 2020/11/2 下午7:11, Stefan Hajnoczi wrote:
> > > > There is discussion about VFIO migration in the "Re: Out-of-Process
> > > > Device Emulation session at KVM Forum 2020" thread. The current status
> > > > is that Kirti proposed a VFIO device region type for saving and loading
> > > > device state. There is currently no guidance on migrating between
> > > > different device versions or device implementations from different
> > > > vendors. This is known to be non-trivial and raised discussion about
> > > > whether it should really be handled by VFIO or centralized in QEMU.
> > > > 
> > > > Below is a document that describes how to ensure migration compatibility
> > > > in VFIO. It does not require changes to the VFIO migration interface. It
> > > > can be used for both VFIO/mdev kernel devices and vfio-user devices.
> > > > 
> > > > The idea is that the device state blob is opaque to the VMM but the same
> > > > level of migration compatibility that exists today is still available.
> > > 
> > > So if we can't mandate this or there's no way to validate this. Vendor is
> > > still free to implement their own protocol which could lead a lot of
> > > maintaining burden.
> > Yes, the device state representation is their responsibility. We can't
> > do that for them since they define the hardware interface and internal
> > state.
> > 
> > As Michael and Paolo have mentioned in the other thread, we can provide
> > guidelines and standardize common aspects.
> > 
> > > > Migration can fail if loading the device state is not possible. It 
> > > > should fail
> > > > early with a clear error message. It must not appear to complete but 
> > > > leave the
> > > > device inoperable due to a migration problem.
> > > 
> > > For VFIO-user, how management know that a VM can be migrated from src to
> > > dst? For kernel, we have sysfs.
> > vfio-user devices will normally be instantiated in one of two ways:
> > 
> > 1. Launching a device backend and passing command-line parameters:
> > 
> >       $ my-nic --socket-path /tmp/my-nic-vfio-user.sock \
> >                --model https://vendor-a.com/my-nic \
> >           --rss on
> > 
> >     Here "model" is the device model URL. The program could support
> >     multiple device models.
> > 
> >     The "rss" device configuration parameter enables Receive Side Scaling
> >     (RSS) as an example of a configuration parameter.
> > 
> > 2. Creating a device using an RPC interface:
> > 
> >       (qemu) device-add my-nic,rss=on
> > 
> > If the device instantiation succeeds then it is safe to live migrate.
> > The device is exposing the desired hardware interface and expecting the
> > right device state representation.
> 
> 
> Does this mean there will still be a "my-nic" stub in qemu? (I thought it
> should be a generic one like device-add "vfio-user-pci")

No, sorry for the confusing example. I was thinking of
qemu-storage-daemon or multi-process QEMU where devices could be
configured over a QMP/HMP monitor. The device happens to be implemented
in the QEMU codebase but the VMM doesn't need a stub device.

A D-Bus or gRPC example would have been clearer because it's not
associated with a VMM.

> > 
> > > > The rest of this document describes how these requirements can be met.
> > > > 
> > > > Device Models
> > > > -------------
> > > > Devices have a *hardware interface* consisting of hardware registers,
> > > > interrupts, and so on.
> > > > 
> > > > The hardware interface together with the device state representation is 
> > > > called
> > > > a *device model*. Device models can be assigned URIs such as
> > > > https://qemu.org/devices/e1000e to uniquely identify them.
> > > 
> > > It looks worse than "pci://vendor_id.device_id.subvendor_id.subdevice_id".
> > > "e1000e" means a lot of different 8275X implementations that have subtle 
> > > but
> > > easy to be ignored differences.
> > If you wish to reflect those differences in the device model URI then
> > you can use:
> > 
> >    
> > https://qemu.org/devices/pci/<vendor-id>/<device-id>/<subvendor-id>/<subdevice-id>
> > 
> > Another option is to use device configuration parameters to express
> > differences.
> > 
> > The important thing is that this device model URI has one owner. No one
> > else will use qemu.org. There can be many different e1000e device model
> > URIs, if necessary (with slightly different hardware interfaces and/or
> > device state representations). This avoids collisions.
> > 
> > > And is it possible to have a list of URIs here?
> > A device implementation (mdev driver, vfio-user device backend, etc) may
> > support multiple device model URIs.
> > 
> > A device instance has an immutable device model URI and list of
> > configuration parameters. In other words, once the device is created its
> > ABI is fixed for the lifetime of the device. A new device instance can
> > be configured by powering off the machine, hotplug, etc.
> > 
> > > > Multiple implementations of a device model may exist. They are they are
> > > > interchangeable if they follow the same hardware interface and device
> > > > state representation.
> > > > 
> > > > Multiple implementations of the same hardware interface may exist with
> > > > different device state representations, in which case the device models 
> > > > are not
> > > > interchangeable and must be assigned different URIs.
> > > > 
> > > > Migration is only possible when the same device model is supported by 
> > > > the
> > > > *source* and the *destination* devices.
> > > > 
> > > > Device Configuration
> > > > --------------------
> > > > Device models may have parameters that affect the hardware interface or 
> > > > device
> > > > state representation. For example, a network card may have a 
> > > > configurable
> > > > address filtering table size parameter called ``rx-filter-size``. A
> > > > device state saved with ``rx-filter-size=32`` cannot be safely loaded
> > > > into a device with ``rx-filter-size=0``, because changing the size from
> > > > 32 to 0 may disrupt device operation.
> > > 
> > > Do we allow the migration from "rx-filter-size=16" to "rx-filter-size=32" 
> > > (I
> > > guess not?) And should we extend the concept to "device capability" 
> > > instead
> > > of just state representation.  E.g src has CAP_X=on,CAP_Y=off but dst has
> > > CAP_X=on,CAP_Y=on, so we disallow the migration from src to dst.
> > A device instance's configuration parameters are immutable.
> > rx-filter-size=16 cannot be migrated to rx-filter-size=32.
> 
> 
> But then it looks to me we can't migrate back, or do you mean it is required
> to have the ability to change the max rx-filter-size.

We can migrate a device with rx-filter-size=16 from old -> new if the
new device implementation supports rx-filter-size=16. We can migrate
back to the old device implementation because it supports
rx-filter-size=16.

If you want to change the configuration parameters then new device must
be instantiated during poweroff or hotplug. This is how
rx-filter-size=16 can be changed to rx-filter-size=32, but it must be
done explicitly (configuration parameters don't change across
migration).

> > Yes, configuration parameters can describe capabilities. I think of
> > capabilities as something that affects the guest-visible hardware
> > interface (e.g. the RSS feature bit is enabled) that is mentioned in the
> > text, but it would be clearer to mention them explicitly.
> > 
> > > > A list of configuration parameters is called the *device configuration*.
> > > > Migration is expected to succeed when the same device model and 
> > > > configuration
> > > > that was used for saving the device state is used again to load it.
> > > > 
> > > > Note that not all parameters used to instantiate a device need to be 
> > > > part of
> > > > the device configuration. For example, assigning a network card to a 
> > > > specific
> > > > physical port is not part of the device configuration since it is not 
> > > > part of
> > > > the device's hardware interface or the device state representation.
> > > 
> > > Yes, but the task needs to be done by management somehow. So do you 
> > > expect a
> > > vendor specific provisioning API here?
> > There seems to be no consensus on this yet. It's the question of how to
> > manage the lifecycle of VFIO, mdev, vhost-user, and vfio-user devices.
> > There are attempts to standardize in some of these areas.
> > 
> > For mdev drivers we can standardize the sysfs interface so management
> > tools can query source devices and instantiate destination devices
> > without device-specific code.
> 
> 
> Even for mdev, it should be have some class defined for sysfs which could be
> a standard way to configure NVME or virtio device.

Discussion on the mdev sysfs interface has started in the sub-thread
with Alex Williamson.

> > The problem with subsection semantics is that they break rollback. Once
> > the old device state has been loaded by the new device implementation,
> > saving the device state produces the new device state representation.
> > The old device implementation can no longer load it :(.
> 
> 
> Only when subsection is needed.

Good point. Most rollback migrations still work, only the ones that
introduce new subsections fail.

> >    Manual
> > intervention is necessary to tell the new device implementation to save
> > in the old representation.
> 
> 
> If we don't support subsection, could we end up with a deadlock like we do
> migration since want upgrade the kernel, but if we don't upgrade the kernel,
> we can't do live migration.

Can you explain in more detail?

I think the approach described in this document works, except it
requires manual intervention to change device configuration parameters
whereas subsections are automatically applied by the new QEMU.

Attachment: signature.asc
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]