Re: VFIO Migration

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: VFIO Migration

From:	Alex Williamson
Subject:	Re: VFIO Migration
Date:	Tue, 3 Nov 2020 10:13:05 -0700

On Tue, 3 Nov 2020 11:03:24 +0000
Stefan Hajnoczi <stefanha@redhat.com> wrote:

> On Mon, Nov 02, 2020 at 12:38:23PM -0700, Alex Williamson wrote:
> > 
> > Cc+ Intel folks as this really bumps into the migration compatibility
> > discussion[1][2][3]
> > 
> > On Mon, 2 Nov 2020 11:11:53 +0000
> > Stefan Hajnoczi <stefanha@redhat.com> wrote:

> > > VFIO Implementation
> > > -------------------
> > > The following applies both to kernel VFIO/mdev drivers and vfio-user 
> > > device
> > > backends.
> > > 
> > > Devices are instantiated based on a version and/or configuration 
> > > parameters:
> > > * ``version=1`` - use the device configuration aliased by version 1
> > > * ``version=2,rx-filter-size=64`` - use version 1 and override 
> > > ``rx-filter-size``
> > > * ``rx-filter-size=0`` - directly set configuration parameters without 
> > > using a version
> > > 
> > > Device creation fails if the version and/or configuration parameters are 
> > > not
> > > supported.
> > > 
> > > There must be a mechanism to query the "latest" configuration for a device
> > > model. It may simply report the ``version=5`` where 5 is the latest 
> > > version but
> > > it could also report all configuration parameters instead of using a 
> > > version
> > > alias.  
> > 
> > When we talk about "instantiating" a device here, are we referring to
> > managing the device on the host or within QEMU via something like
> > vfio_realize()?  We create an instance of an mdev on the host via an
> > mdev type using operations on the host sysfs.  That mdev type doesn't
> > really seem to map to your idea if a device model represented by a URI.
> > How are supported URIs exposed and specified when the device is
> > instantiated?
> > 
> > Same for device configuration, we might have per device attributes in
> > host sysfs defining the configuration of a given mdev device, are these
> > the device configuration values?  It seems like you're referring to
> > something much more QEMU centric, but vfio-pci in QEMU handles all
> > devices the same, aside from quirks.
> > 
> > Likewise, I don't know where versions would be exposed in the current
> > vfio interface.  
> 
> "Instantiating" means writing to the mdev "create" sysfs attr. I am not
> very familiar with mdev so this could be totally wrong, but I'll try to
> define a mapping:
> 
> 1. The mdev driver sets up struct
>    mdev_parent_opts->supported_type_groups as follows:
> 
>   /* Device model URI */
>   static ssize_t model_show(struct kobject *kobj,
>                             struct device *dev,
>                             char *buf)
>   {
>       return sprintf(buf, "https://vendor-a.com/my-nic\n";);
>   }
>   static MDEV_TYPE_ATTR_RO(model);
> 
>   /* Receive Side Scaling (RSS) */
>   static ssize_t rss_show(struct kobject *kobj,
>                           struct dev *dev,
>                         char *buf)
>   {
>       return sprintf(buf, "%d\n", ...->rss);
>   }
>   static ssize_t rss_store(struct kobject *kobj,
>                            struct attribute *attr,
>                          const char *page,
>                          size_t count)
>   {
>       char *p = (char *) page;
>       unsigned long val = simple_strtoul(p, &p, 10);
> 
>       ...->rss = !!val;
>       return count;
>   }
>   static MDEV_TYPE_ATTR_RW(rss);
> 
>   /* Device version */
>   static ssize_t version_show(struct kobject *kobj,
>                               struct dev *dev,
>                             char *buf)
>   {
>       return sprintf(buf, "%u\n", ...->version);
>   }
>   static ssize_t version_store(struct kobject *kobj,
>                                struct attribute *attr,
>                              const char *page,
>                              size_t count)
>   {
>       char *p = (char *) page;
>       unsigned long val = simple_strtoul(p, &p, 10);
> 
>       /* Set device configuration parameters to their defaults */
>       switch (version) {
>       case 1:
>           ...->rss = false;
>         ...->version = 1;
>         break;
> 
>       case 2:
>           ...->rss = true;
>         ...->version = 2;
>         break;
> 
>       default:
>           return -ENOTSUPP;
>       }
> 
>       return count;
>   }
>   static MDEV_TYPE_ATTR_RW(rss);
> 
>   static struct attribute *mdev_type_my_nic_attrs[] = {
>       &mdev_type_attr_model.attr,
>       &mdev_type_attr_rss.attr,
>       &mdev_type_attr_version.attr,
>       NULL,
>   };
> 
>   static struct attribute_group mdev_type_group_my_nic = {
>       .name  = "my-nic", /* shorthand name */
>       .attrs = mdev_type_my_nic_attrs,
>   };
> 
>   struct attribute_group *supported_type_groups[] = {
>       &mdev_type_group_my_nic,
>       NULL,
>   };
> 
> 2. The userspace tooling enumerates supported device models by reading
>    the "model" sysfs attr from each supported type attr group.


So a given mdev type can only support a single model, model just gives
us some independence from the vendor driver association of the mdev
type?  I wonder how "model" is really different from the "name"
attribute on an mdev type other than being more formalized.

 
> 3. Userspace picks the device model it wishes to instantiate and sets
>    the "version" sysfs attr and other device configuration parameters as
>    desired.
> 
> 4. Userspace instantiates the device by writing to the mdev "create" sysfs
>    attr. If instantiation succeeds then migrating a device state saved
>    by the same device model with the same configuration parameters is
>    possible.


These are not feasible semantics, multiple tasks may be instantiating
devices simultaneously, we can't lock a sub-hierarchy of sysfs while
one process configures features to their liking.  It seems more like
these attributes would be read-only to advertise support, but be
applied as part of the write to create, ie. we'd append device specific
attributes after the uuid string, similar to how we've previously
discussed supporting device aggregation.

> 
> Maybe a cleaner way to structure this is to include the version as part
> of the supported type group. So "my-nic" becomes "my-nic-1", "my-nic-2",
> etc. There would still be a "version" sysfs attr but it would be
> read-only. Device configuration parameters would only be present if they
> were actually available in that version. For example, "my-nic-1" would
> not expose an "rss" sysfs attr because it was introduced in "my-nic-2".
> I see pros and cons to both the approach I outlined above and this
> alternative, maybe someone more familiar with mdev has a preference?


How exactly is this different from an mdev type?  An mdev type is
supposed to define a software compatible device configuration.  If
version 2 is not compatible with version 1, we'd expect a vendor to
define a new type.  In practice vendors are often defining new types to
indicate the scale of a device, for example with vGPUs types may
different in the amount of graphics memory per instance.  The topic of
"aggregation" came about as a generic way to describe this within a
single type, when for example we might have an untenable number of
scaling increments to describe each as a separate type.  Unfortunately
"aggregation" is also too generic, "aggregation of what" needs to be
more clearly defined.

 
> > There's also a desire to support the vfio migration interface on
> > non-mdev vfio devices.  We don't know yet if those will be separate,
> > device specific vfio bus drivers or be integrated into existing
> > vfio-pci, but the host device is likely instantiated by binding to a
> > driver, so again I don't really understand where you're proposing this
> > negotiation occurs.  Will management tools be required to create a
> > device on-demand to fulfill a migration request or can we manipulate an
> > existing device into that desired.  Some management layers embrace the
> > idea of device pools rather than dynamic creation.  Thanks,  
> 
> The concept of device instantiation is natural for mdev and vfio-user,
> but not essential.
> 
> When dealing with physical devices (even PCI SR-IOV), we don't need to
> instantiate them explicitly. Device instances can already exist. As long
> as we know their device model URI and configuration parameters we can
> ensure migration compatibility.
> 
> For example, imagine a physical PCI NIC accompanied by a non-mdev VFIO
> migration driver. The device model URI and configuration parameter
> information can be distributed alongside the VFIO migration driver. It
> could be available via modinfo(8), as a separate metadata file, via a
> vendor-specific tool, etc.


I think we want instances of objects to expose their device and
configuration through sysfs, we don't want to require userspace to use
different methods for different flavors of devices, nor should it be
required for someone to remember how a device was instantiated.

 
> Management tools need to match the device model/configuration from the
> source device against the destination device. If the destination is
> capable of supporting the source's device model/configuration then
> migration can proceed safely.
> 
> Let's look at the case where we are migration from an older version of a
> device to a newer version. On the source we have:
> 
>   model = https://vendor-a.com/my-nic
> 
> On the destination we have:
> 
>   model = https://vendor-a.com/my-nic
>   rss = on
> 
> The two devices are incompatible because the destination exposes the RSS
> feature that is not present on the source. The RSS feature involves
> guest-visible hardware interface changes and a change to the device
> state representation. It is not safe to migrate!
> 
> In this case an extra configuration step is necessary so that the
> destination device can accept the device state from the source. The
> management tool invokes a vendor-specific tool to put the device into
> the right configuration:
> 
>   # vendor-tool set-migration-config --device 0000:00:04.0 \
>                                      --model https://vendor-a.com/my-nic
> 
> (This tool only succeeds when the device is bound to VFIO but not yet
> opened.)
> 
> The tool invokes ioctls on the vendor-specific VFIO driver that does two
> things:
> 1. Tells the device to present the old hardware interface without RSS
> 2. Uses the old device state representation without RSS support
> 
> Does this approach fit?


Should we not require that any sort of configuration like this occurs
through sysfs?  We must be able to create an instance with a specific
configuration without using vendor specific tools, therefore in the
worse case we should be able to remove and recreate an instance as we
desire without invoking vendor specific tools.  Thanks,

Alex

[Prev in Thread]

Current Thread

[Next in Thread]

VFIO Migration, Stefan Hajnoczi, 2020/11/02
- Re: VFIO Migration, Cornelia Huck, 2020/11/02
  - Re: VFIO Migration, Stefan Hajnoczi, 2020/11/02
    - Re: VFIO Migration, Gerd Hoffmann, 2020/11/04
    - Re: VFIO Migration, Stefan Hajnoczi, 2020/11/04
    - Re: VFIO Migration, Gerd Hoffmann, 2020/11/05
    - Re: VFIO Migration, Stefan Hajnoczi, 2020/11/05
- Re: VFIO Migration, Alex Williamson, 2020/11/02
  - Re: VFIO Migration, Stefan Hajnoczi, 2020/11/03
    - Re: VFIO Migration, Alex Williamson <=
    - Re: VFIO Migration, Stefan Hajnoczi, 2020/11/03
    - Re: VFIO Migration, Yan Zhao, 2020/11/05
- Re: VFIO Migration, Jason Wang, 2020/11/03
  - Re: VFIO Migration, Stefan Hajnoczi, 2020/11/03
    - Re: VFIO Migration, Jason Wang, 2020/11/03
    - Re: VFIO Migration, Stefan Hajnoczi, 2020/11/04
- Re: VFIO Migration, Daniel P . Berrangé, 2020/11/03
  - Re: VFIO Migration, Stefan Hajnoczi, 2020/11/03
    - Re: VFIO Migration, Daniel P . Berrangé, 2020/11/03
    - Re: VFIO Migration, Stefan Hajnoczi, 2020/11/03

Prev by Date: Re: [PATCH 07/14] target/arm: return the stage 2 index for stage 1
Next by Date: Re: [PATCH-for-5.2 v2 00/25] block/nvme: Fix Aarch64 or big-endian hosts
Previous by thread: Re: VFIO Migration
Next by thread: Re: VFIO Migration
Index(es):
- Date
- Thread