qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: VFIO Migration


From: Stefan Hajnoczi
Subject: Re: VFIO Migration
Date: Tue, 3 Nov 2020 18:09:38 +0000

On Tue, Nov 03, 2020 at 10:13:05AM -0700, Alex Williamson wrote:
> On Tue, 3 Nov 2020 11:03:24 +0000
> Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> > On Mon, Nov 02, 2020 at 12:38:23PM -0700, Alex Williamson wrote:
> > > 
> > > Cc+ Intel folks as this really bumps into the migration compatibility
> > > discussion[1][2][3]
> > > 
> > > On Mon, 2 Nov 2020 11:11:53 +0000
> > > Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> > > > VFIO Implementation
> > > > -------------------
> > > > The following applies both to kernel VFIO/mdev drivers and vfio-user 
> > > > device
> > > > backends.
> > > > 
> > > > Devices are instantiated based on a version and/or configuration 
> > > > parameters:
> > > > * ``version=1`` - use the device configuration aliased by version 1
> > > > * ``version=2,rx-filter-size=64`` - use version 1 and override 
> > > > ``rx-filter-size``
> > > > * ``rx-filter-size=0`` - directly set configuration parameters without 
> > > > using a version
> > > > 
> > > > Device creation fails if the version and/or configuration parameters 
> > > > are not
> > > > supported.
> > > > 
> > > > There must be a mechanism to query the "latest" configuration for a 
> > > > device
> > > > model. It may simply report the ``version=5`` where 5 is the latest 
> > > > version but
> > > > it could also report all configuration parameters instead of using a 
> > > > version
> > > > alias.  
> > > 
> > > When we talk about "instantiating" a device here, are we referring to
> > > managing the device on the host or within QEMU via something like
> > > vfio_realize()?  We create an instance of an mdev on the host via an
> > > mdev type using operations on the host sysfs.  That mdev type doesn't
> > > really seem to map to your idea if a device model represented by a URI.
> > > How are supported URIs exposed and specified when the device is
> > > instantiated?
> > > 
> > > Same for device configuration, we might have per device attributes in
> > > host sysfs defining the configuration of a given mdev device, are these
> > > the device configuration values?  It seems like you're referring to
> > > something much more QEMU centric, but vfio-pci in QEMU handles all
> > > devices the same, aside from quirks.
> > > 
> > > Likewise, I don't know where versions would be exposed in the current
> > > vfio interface.  
> > 
> > "Instantiating" means writing to the mdev "create" sysfs attr. I am not
> > very familiar with mdev so this could be totally wrong, but I'll try to
> > define a mapping:
> > 
> > 1. The mdev driver sets up struct
> >    mdev_parent_opts->supported_type_groups as follows:
> > 
> >   /* Device model URI */
> >   static ssize_t model_show(struct kobject *kobj,
> >                             struct device *dev,
> >                             char *buf)
> >   {
> >       return sprintf(buf, "https://vendor-a.com/my-nic\n";);
> >   }
> >   static MDEV_TYPE_ATTR_RO(model);
> > 
> >   /* Receive Side Scaling (RSS) */
> >   static ssize_t rss_show(struct kobject *kobj,
> >                           struct dev *dev,
> >                       char *buf)
> >   {
> >       return sprintf(buf, "%d\n", ...->rss);
> >   }
> >   static ssize_t rss_store(struct kobject *kobj,
> >                            struct attribute *attr,
> >                        const char *page,
> >                        size_t count)
> >   {
> >       char *p = (char *) page;
> >       unsigned long val = simple_strtoul(p, &p, 10);
> > 
> >       ...->rss = !!val;
> >       return count;
> >   }
> >   static MDEV_TYPE_ATTR_RW(rss);
> > 
> >   /* Device version */
> >   static ssize_t version_show(struct kobject *kobj,
> >                               struct dev *dev,
> >                           char *buf)
> >   {
> >       return sprintf(buf, "%u\n", ...->version);
> >   }
> >   static ssize_t version_store(struct kobject *kobj,
> >                                struct attribute *attr,
> >                            const char *page,
> >                            size_t count)
> >   {
> >       char *p = (char *) page;
> >       unsigned long val = simple_strtoul(p, &p, 10);
> > 
> >       /* Set device configuration parameters to their defaults */
> >       switch (version) {
> >       case 1:
> >           ...->rss = false;
> >       ...->version = 1;
> >       break;
> > 
> >       case 2:
> >           ...->rss = true;
> >       ...->version = 2;
> >       break;
> > 
> >       default:
> >           return -ENOTSUPP;
> >       }
> > 
> >       return count;
> >   }
> >   static MDEV_TYPE_ATTR_RW(rss);
> > 
> >   static struct attribute *mdev_type_my_nic_attrs[] = {
> >       &mdev_type_attr_model.attr,
> >       &mdev_type_attr_rss.attr,
> >       &mdev_type_attr_version.attr,
> >       NULL,
> >   };
> > 
> >   static struct attribute_group mdev_type_group_my_nic = {
> >       .name  = "my-nic", /* shorthand name */
> >       .attrs = mdev_type_my_nic_attrs,
> >   };
> > 
> >   struct attribute_group *supported_type_groups[] = {
> >       &mdev_type_group_my_nic,
> >       NULL,
> >   };
> > 
> > 2. The userspace tooling enumerates supported device models by reading
> >    the "model" sysfs attr from each supported type attr group.
> 
> 
> So a given mdev type can only support a single model, model just gives
> us some independence from the vendor driver association of the mdev
> type?  I wonder how "model" is really different from the "name"
> attribute on an mdev type other than being more formalized.

Two reasons, neither of them critical:
1. Short names are more human-friendly than full device model URIs.
2. I was concerned about escaping characters if the type name is used as
   a sysfs directory name. "https://qemu.org/devices/e1000"; cannot be a
   path component, it needs to be escaped somehow to avoid the
   backslashes.

> > 3. Userspace picks the device model it wishes to instantiate and sets
> >    the "version" sysfs attr and other device configuration parameters as
> >    desired.
> > 
> > 4. Userspace instantiates the device by writing to the mdev "create" sysfs
> >    attr. If instantiation succeeds then migrating a device state saved
> >    by the same device model with the same configuration parameters is
> >    possible.
> 
> 
> These are not feasible semantics, multiple tasks may be instantiating
> devices simultaneously, we can't lock a sub-hierarchy of sysfs while
> one process configures features to their liking.  It seems more like
> these attributes would be read-only to advertise support, but be
> applied as part of the write to create, ie. we'd append device specific
> attributes after the uuid string, similar to how we've previously
> discussed supporting device aggregation.

Thanks for explaining and suggesting a solution.

> > 
> > Maybe a cleaner way to structure this is to include the version as part
> > of the supported type group. So "my-nic" becomes "my-nic-1", "my-nic-2",
> > etc. There would still be a "version" sysfs attr but it would be
> > read-only. Device configuration parameters would only be present if they
> > were actually available in that version. For example, "my-nic-1" would
> > not expose an "rss" sysfs attr because it was introduced in "my-nic-2".
> > I see pros and cons to both the approach I outlined above and this
> > alternative, maybe someone more familiar with mdev has a preference?
> 
> 
> How exactly is this different from an mdev type?  An mdev type is
> supposed to define a software compatible device configuration.  If
> version 2 is not compatible with version 1, we'd expect a vendor to
> define a new type.  In practice vendors are often defining new types to
> indicate the scale of a device, for example with vGPUs types may
> different in the amount of graphics memory per instance.  The topic of
> "aggregation" came about as a generic way to describe this within a
> single type, when for example we might have an untenable number of
> scaling increments to describe each as a separate type.  Unfortunately
> "aggregation" is also too generic, "aggregation of what" needs to be
> more clearly defined.

Yes, this is a 1:1 mapping of device versions to mdev types. Based on
QEMU's device models I would estimate that devices may have over 100
versions but less than 1000. Does that sound reasonable?

If having so many mdev types is problematic we can use the multiplexing
approach that we discussed above with a "version" sysfs attr.

> > > There's also a desire to support the vfio migration interface on
> > > non-mdev vfio devices.  We don't know yet if those will be separate,
> > > device specific vfio bus drivers or be integrated into existing
> > > vfio-pci, but the host device is likely instantiated by binding to a
> > > driver, so again I don't really understand where you're proposing this
> > > negotiation occurs.  Will management tools be required to create a
> > > device on-demand to fulfill a migration request or can we manipulate an
> > > existing device into that desired.  Some management layers embrace the
> > > idea of device pools rather than dynamic creation.  Thanks,  
> > 
> > The concept of device instantiation is natural for mdev and vfio-user,
> > but not essential.
> > 
> > When dealing with physical devices (even PCI SR-IOV), we don't need to
> > instantiate them explicitly. Device instances can already exist. As long
> > as we know their device model URI and configuration parameters we can
> > ensure migration compatibility.
> > 
> > For example, imagine a physical PCI NIC accompanied by a non-mdev VFIO
> > migration driver. The device model URI and configuration parameter
> > information can be distributed alongside the VFIO migration driver. It
> > could be available via modinfo(8), as a separate metadata file, via a
> > vendor-specific tool, etc.
> 
> 
> I think we want instances of objects to expose their device and
> configuration through sysfs, we don't want to require userspace to use
> different methods for different flavors of devices, nor should it be
> required for someone to remember how a device was instantiated.

Sounds good. The sysfs layout can be the same as for mdev types.

> > Management tools need to match the device model/configuration from the
> > source device against the destination device. If the destination is
> > capable of supporting the source's device model/configuration then
> > migration can proceed safely.
> > 
> > Let's look at the case where we are migration from an older version of a
> > device to a newer version. On the source we have:
> > 
> >   model = https://vendor-a.com/my-nic
> > 
> > On the destination we have:
> > 
> >   model = https://vendor-a.com/my-nic
> >   rss = on
> > 
> > The two devices are incompatible because the destination exposes the RSS
> > feature that is not present on the source. The RSS feature involves
> > guest-visible hardware interface changes and a change to the device
> > state representation. It is not safe to migrate!
> > 
> > In this case an extra configuration step is necessary so that the
> > destination device can accept the device state from the source. The
> > management tool invokes a vendor-specific tool to put the device into
> > the right configuration:
> > 
> >   # vendor-tool set-migration-config --device 0000:00:04.0 \
> >                                      --model https://vendor-a.com/my-nic
> > 
> > (This tool only succeeds when the device is bound to VFIO but not yet
> > opened.)
> > 
> > The tool invokes ioctls on the vendor-specific VFIO driver that does two
> > things:
> > 1. Tells the device to present the old hardware interface without RSS
> > 2. Uses the old device state representation without RSS support
> > 
> > Does this approach fit?
> 
> 
> Should we not require that any sort of configuration like this occurs
> through sysfs?  We must be able to create an instance with a specific
> configuration without using vendor specific tools, therefore in the
> worse case we should be able to remove and recreate an instance as we
> desire without invoking vendor specific tools.

Yes, sysfs sounds good.

One thing that is a little ugly is that the version=1 device above does
not have the "rss" configuration parameter and we have no value to set
on the destination device, which is newer and supports the "rss"
configuration parameter. It may be necessary to implement a special case
for disabling configuration parameters like writing an empty string or
magic string to the "rss" sysfs attr.

Or maybe there would be a standard sysfs attr called "config_params"
where userspace writes a space-separated list of configuration parameter
names that are active. Then userspace could omit "rss" from the
"config_params" sysfs attr to disable it.

Stefan

Attachment: signature.asc
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]