qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration


From: Zhao Yan
Subject: Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration
Date: Fri, 29 Mar 2019 19:10:50 -0400
User-agent: Mutt/1.9.4 (2018-02-28)

On Fri, Mar 29, 2019 at 10:26:39PM +0800, Alex Williamson wrote:
> On Thu, 28 Mar 2019 22:47:04 -0400
> Zhao Yan <address@hidden> wrote:
> 
> > On Fri, Mar 29, 2019 at 12:04:31AM +0800, Alex Williamson wrote:
> > > On Thu, 28 Mar 2019 10:21:38 +0100
> > > Erik Skultety <address@hidden> wrote:
> > >   
> > > > On Thu, Mar 28, 2019 at 04:36:03AM -0400, Zhao Yan wrote:  
> > > > > hi Alex and Dave,
> > > > > Thanks for your replies.
> > > > > Please see my comments inline.
> > > > >
> > > > > On Thu, Mar 28, 2019 at 06:10:20AM +0800, Alex Williamson wrote:    
> > > > > > On Wed, 27 Mar 2019 20:18:54 +0000
> > > > > > "Dr. David Alan Gilbert" <address@hidden> wrote:
> > > > > >    
> > > > > > > * Zhao Yan (address@hidden) wrote:    
> > > > > > > > On Wed, Feb 20, 2019 at 07:42:42PM +0800, Cornelia Huck wrote:  
> > > > > > > >   
> > > > > > > > > > > > >   b) How do we detect if we're migrating from/to the 
> > > > > > > > > > > > > wrong device or
> > > > > > > > > > > > > version of device?  Or say to a device with older 
> > > > > > > > > > > > > firmware or perhaps
> > > > > > > > > > > > > a device that has less device memory ?    
> > > > > > > > > > > > Actually it's still an open for VFIO migration. Need to 
> > > > > > > > > > > > think about
> > > > > > > > > > > > whether it's better to check that in libvirt or qemu 
> > > > > > > > > > > > (like a device magic
> > > > > > > > > > > > along with verion ?).    
> > > > > > > > > >
> > > > > > > > > > We must keep the hardware generation is the same with one 
> > > > > > > > > > POD of public cloud
> > > > > > > > > > providers. But we still think about the live migration 
> > > > > > > > > > between from the the lower
> > > > > > > > > > generation of hardware migrated to the higher generation.   
> > > > > > > > > >  
> > > > > > > > >
> > > > > > > > > Agreed, lower->higher is the one direction that might make 
> > > > > > > > > sense to
> > > > > > > > > support.
> > > > > > > > >
> > > > > > > > > But regardless of that, I think we need to make sure that 
> > > > > > > > > incompatible
> > > > > > > > > devices/versions fail directly instead of failing in a 
> > > > > > > > > subtle, hard to
> > > > > > > > > debug way. Might be useful to do some initial sanity checks 
> > > > > > > > > in libvirt
> > > > > > > > > as well.
> > > > > > > > >
> > > > > > > > > How easy is it to obtain that information in a form that can 
> > > > > > > > > be
> > > > > > > > > consumed by higher layers? Can we find out the device type at 
> > > > > > > > > least?
> > > > > > > > > What about some kind of revision?    
> > > > > > > > hi Alex and Cornelia
> > > > > > > > for device compatibility, do you think it's a good idea to use 
> > > > > > > > "version"
> > > > > > > > and "device version" fields?
> > > > > > > >
> > > > > > > > version field: identify live migration interface's version. it 
> > > > > > > > can have a
> > > > > > > > sort of backward compatibility, like target machine's version 
> > > > > > > > >= source
> > > > > > > > machine's version. something like that.    
> > > > > >
> > > > > > Don't we essentially already have this via the device specific 
> > > > > > region?
> > > > > > The struct vfio_info_cap_header includes id and version fields, so 
> > > > > > we
> > > > > > can declare a migration id and increment the version for any
> > > > > > incompatible changes to the protocol.    
> > > > > yes, good idea!
> > > > > so, what about declaring below new cap?
> > > > >     #define VFIO_REGION_INFO_CAP_MIGRATION 4
> > > > >     struct vfio_region_info_cap_migration {
> > > > >         struct vfio_info_cap_header header;
> > > > >         __u32 device_version_len;
> > > > >         __u8  device_version[];
> > > > >     };  
> > > 
> > > I'm not sure why we need a new region for everything, it seems this
> > > could fit within the protocol of a single region.  This could simply be
> > > a new action to retrieve the version where the protocol would report
> > > the number of bytes available, just like the migration stream itself.  
> > so, to get version of VFIO live migration device state interface (simply
> > call it migration interface?),
> > a new cap looks like this:
> > #define VFIO_REGION_INFO_CAP_MIGRATION 4
> > it contains struct vfio_info_cap_header only.
> > when get region info of the migration region, we query this cap and get
> > migration interface's version. right?
> > 
> > or just directly use VFIO_REGION_INFO_CAP_TYPE is fine?
> 
> Again, why a new region.  I'm imagining we have one region and this is
> just asking for a slightly different thing from it.  But TBH, I'm not
> sure we need it at all vs the sysfs interface.
> 
> > > > > > > > device_version field consists two parts:
> > > > > > > > 1. vendor id : it takes 32 bits. e.g. 0x8086.    
> > > > > >
> > > > > > Who allocates IDs?  If we're going to use PCI vendor IDs, then I'd
> > > > > > suggest we use a bit to flag it as such so we can reserve that 
> > > > > > portion
> > > > > > of the 32bit address space.  See for example:
> > > > > >
> > > > > > #define VFIO_REGION_TYPE_PCI_VENDOR_TYPE        (1 << 31)
> > > > > > #define VFIO_REGION_TYPE_PCI_VENDOR_MASK        (0xffff)
> > > > > >
> > > > > > For vendor specific regions.    
> > > > > Yes, use PCI vendor ID.
> > > > > you are right, we need to use highest bit 
> > > > > (VFIO_REGION_TYPE_PCI_VENDOR_TYPE)
> > > > > to identify it's a PCI ID.
> > > > > Thanks for pointing it out.
> > > > > But, I have a question. what is VFIO_REGION_TYPE_PCI_VENDOR_MASK used 
> > > > > for?
> > > > > why it's 0xffff? I searched QEMU and kernel code and did not find 
> > > > > anywhere
> > > > > uses it.  
> > > 
> > > PCI vendor IDs are 16bits, it's just indicating that when the
> > > PCI_VENDOR_TYPE bit is set the valid data is the lower 16bits.  
> > 
> > thanks:)
> > 
> > > > > > > > 2. vendor proprietary string: it can be any string that a 
> > > > > > > > vendor driver
> > > > > > > > thinks can identify a source device. e.g. pciid + mdev type.
> > > > > > > > "vendor id" is to avoid overlap of "vendor proprietary string".
> > > > > > > >
> > > > > > > >
> > > > > > > > struct vfio_device_state_ctl {
> > > > > > > >      __u32 version;            /* ro */
> > > > > > > >      __u8 device_version[MAX_DEVICE_VERSION_LEN];            /* 
> > > > > > > > ro */
> > > > > > > >      struct {
> > > > > > > >         __u32 action; /* GET_BUFFER, SET_BUFFER, IS_COMPATIBLE*/
> > > > > > > >         ...
> > > > > > > >      }data;
> > > > > > > >      ...
> > > > > > > >  };    
> > > > > >
> > > > > > We have a buffer area where we can read and write data from the 
> > > > > > vendor
> > > > > > driver, why would we have this IS_COMPATIBLE protocol to write the
> > > > > > device version string but use a static fixed length version string 
> > > > > > in
> > > > > > the control header to read it?  IOW, let's use GET_VERSION,
> > > > > > CHECK_VERSION actions that make use of the buffer area and allow 
> > > > > > vendor
> > > > > > specific version information length.    
> > > > > you are right, such static fixed length version string is bad :)
> > > > > To get device version, do you think which approach below is better?
> > > > > 1. use GET_VERSION action, and read from region buffer
> > > > > 2. get it when querying cap VFIO_REGION_INFO_CAP_MIGRATION
> > > > >
> > > > > seems approach 1 is better, and cap VFIO_REGION_INFO_CAP_MIGRATION is 
> > > > > only
> > > > > for checking migration interface's version?  
> > > 
> > > I think 1 provides the most flexibility to the vendor driver.  
> > 
> > Got it.
> > For VFIO live migration, compared to reuse device state region (which takes
> > GET_BUFFER/SET_BUFFER actions),
> > could we create a new region for GET_VERSION & CHECK_VERSION ?
> 
> Why?
> 
> > > > > > > > Then, an action IS_COMPATIBLE is added to check device 
> > > > > > > > compatibility.
> > > > > > > >
> > > > > > > > The flow to figure out whether a source device is migratable to 
> > > > > > > > target device
> > > > > > > > is like that:
> > > > > > > > 1. in source side's .save_setup, save source device's 
> > > > > > > > device_version string
> > > > > > > > 2. in target side's .load_state, load source device's device 
> > > > > > > > version string
> > > > > > > > and write it to data region, and call IS_COMPATIBLE action to 
> > > > > > > > ask vendor driver
> > > > > > > > to check whether the source device is compatible to it.
> > > > > > > >
> > > > > > > > The advantage of adding an IS_COMPATIBLE action is that, vendor 
> > > > > > > > driver can
> > > > > > > > maintain a compatibility table and decide whether source device 
> > > > > > > > is compatible
> > > > > > > > to target device according to its proprietary table.
> > > > > > > > In device_version string, vendor driver only has to describe 
> > > > > > > > the source
> > > > > > > > device as elaborately as possible and resorts to vendor driver 
> > > > > > > > in target side
> > > > > > > > to figure out whether they are compatible.    
> > > > > >
> > > > > > I agree, it's too complicated and restrictive to try to create an
> > > > > > interface for the user to determine compatibility, let the driver
> > > > > > declare it compatible or not.    
> > > > > :)
> > > > >    
> > > > > > > It would also be good if the 'IS_COMPATIBLE' was somehow callable
> > > > > > > externally - so we could be able to answer a question like 'can we
> > > > > > > migrate this VM to this host' - from the management layer before 
> > > > > > > it
> > > > > > > actually starts the migration.    
> > > > >
> > > > > so qemu needs to expose two qmp/sysfs interfaces: GET_VERSION and 
> > > > > CHECK_VERSION.
> > > > > GET_VERSION returns a vm's device's version string.
> > > > > CHECK_VERSION's input is device version string and return
> > > > > compatible/non-compatible.
> > > > > Do you think it's good?  
> > > 
> > > That's the idea, but note that QEMU can only provide the QMP interface,
> > > the sysfs interface would of course be provided as more of a direct
> > > path from the vendor driver or mdev kernel layer.
> > >   
> > > > > > I think we'd need to mirror this capability in sysfs to support 
> > > > > > that,
> > > > > > or create a qmp interface through QEMU that the device owner could 
> > > > > > make
> > > > > > the request on behalf of the management layer.  Getting access to 
> > > > > > the
> > > > > > vfio device requires an iommu context that's already in use by the
> > > > > > device owner, we have no intention of supporting a model that allows
> > > > > > independent tasks access to a device.  Thanks,
> > > > > > Alex
> > > > > >    
> > > > > do you think two sysfs nodes under a device node is ok?
> > > > > e.g.
> > > > > /sys/devices/pci0000\:00/0000\:00\:02.0/882cc4da-dede-11e7-9180-078a62063ab1/get_version
> > > > > /sys/devices/pci0000\:00/0000\:00\:02.0/882cc4da-dede-11e7-9180-078a62063ab1/check_version
> > > > >     
> > > 
> > > I'd think it might live more in the mdev_support_types area, wouldn't
> > > we ideally like to know if a device is compatible even before it's
> > > created?  For example maybe:
> > > 
> > > /sys/class/mdev_bus/0000:00:02.0/mdev_supported_types/i915-GVTg_V5_4/version
> > > 
> > > Where reading the sysfs attribute returns the version string and
> > > writing a string into the attribute return an errno for incompatibility.  
> > yes, knowing if a device is compatible before it's created is good.
> > but do you think check whether a device is compatible after it's created is
> > also required? For live migration, user usually only queries this 
> > information
> > when it's really required, i.e. when a device has been created.
> > maybe we can add this version get/check at both places?
> 
> Why does an instantiated device suddenly not follow the version and
> compatibility rules of an uninstantiated device?  IOW, if the version
> and compatibility check are on the mdev type, why can't we trace back
> from the device to the mdev type and make use of that same interface?
> Seems the only question is whether we require an interface through the
> vfio API directly or if sysfs is sufficient.
ok. got it.

> > > > Why do you need both sysfs and QMP at the same time? I can see it gives 
> > > > us some
> > > > flexibility, but is there something more to that?
> > > >
> > > > Normally, I'd prefer a QMP interface from libvirt's perspective (with an
> > > > appropriate capability that libvirt can check for QEMU support) because 
> > > > I imagine large nodes having a
> > > > bunch of GPUs with different revisions which might not be backwards 
> > > > compatible.
> > > > Libvirt might query the version string on source and check it on dest 
> > > > via the
> > > > QMP in a way that QEMU, talking to the driver, would return either a 
> > > > list or a
> > > > single physical device to which we can migrate, because neither QEMU nor
> > > > libvirt know that, only the driver does, so that's an important 
> > > > information
> > > > rather than looping through all the devices and trying to find one that 
> > > > is
> > > > compatible. However, you might have a hard time making all the necessary
> > > > changes in QMP introspectable, a new command would be fine, but if you 
> > > > also
> > > > wanted to extend say vfio-pci options, IIRC that would not appear in 
> > > > the QAPI
> > > > schema and libvirt would not be able to detect support for it.
> > > > 
> > > > On the other hand, the presence of a QMP interface IMO doesn't help 
> > > > mgmt apps
> > > > much, as it still carries the burden of being able to check this only 
> > > > at the
> > > > time of migration, which e.g. OpenStack would like to know long before 
> > > > that.
> > > > 
> > > > So, having sysfs attributes would work for both libvirt (even though 
> > > > libvirt
> > > > would benefit from a QMP much more) and OpenStack. OpenStack would IMO 
> > > > then
> > > > have to figure out how to create the mappings between compatible 
> > > > devices across
> > > > several nodes which are non-uniform.  
> > > 
> > > Yep, vfio encompasses more than just QEMU, so a sysfs interface has more
> > > utility than a QMP interface.  For instance we couldn't predetermine if
> > > an mdev type on a host is compatible if we need to first create the
> > > device and launch a QEMU instance on it to get access to QMP.  So maybe
> > > the question is whether we should bother with any sort of VFIO API to
> > > do this comparison, perhaps only a sysfs interface is sufficient for a
> > > complete solution.  The downside of not having a version API in the
> > > user interface might be that QEMU on its own can only try a migration
> > > and see if it fails, it wouldn't have the ability to test expected
> > > compatibility without access to sysfs.  And maybe that's fine.  Thanks,
> > >   
> > So QEMU vfio uses sysfs to check device compatiblity in migration's 
> > save_setup
> > phase?
> 
> The migration stream between source and target device are the ultimate
> test of compatibility, the vendor driver should never rely on userspace
> validating compatibility of the migration.  At the point it could do so, the
> migration has already begun, so we're only testing how quickly we can
> fail the migration.  The management layer setting up the migration can
> test via sysfs for compatibility and the migration stream itself needs
> to be self validating, so what value is added for QEMU to perform a
> version compatibility test?  Thanks,
oh, do you mean vendor driver should embed source device's version in migration
stream, which is opaque to qemu?
otherwise, I can't think of a quick way for vendor driver to determine whether
source device is an incompatible device.  



Thanks
Yan



reply via email to

[Prev in Thread] Current Thread [Next in Thread]