[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [PATCH v3 1/5] VFIO KABI for migration interface
From: |
Alex Williamson |
Subject: |
Re: [Qemu-devel] [PATCH v3 1/5] VFIO KABI for migration interface |
Date: |
Thu, 7 Mar 2019 12:45:18 -0700 |
On Wed, 27 Feb 2019 01:35:01 +0530
Kirti Wankhede <address@hidden> wrote:
> Alex,
>
> On 2/22/2019 3:53 AM, Alex Williamson wrote:
> > On Wed, 20 Feb 2019 02:53:16 +0530
> > Kirti Wankhede <address@hidden> wrote:
> >
> >> - Defined MIGRATION region type and sub-type.
> >> - Used 2 bits to define VFIO device states.
> >> Bit 0 => 0/1 => _STOPPED/_RUNNING
> >> Bit 1 => 0/1 => _RESUMING/_SAVING
> >> Combination of these bits defines VFIO device's state during migration
> >> _RUNNING => Normal VFIO device running state.
> >> _STOPPED => VFIO device stopped.
> >> _SAVING | _RUNNING => vCPUs are running, VFIO device is running but
> >> start
> >> saving state of device i.e. pre-copy state
> >> _SAVING | _STOPPED => vCPUs are stoppped, VFIO device should be
> >> stopped, and
> >> save device state,i.e. stop-n-copy state
> >> _RESUMING => VFIO device resuming state.
> >
> > Shouldn't we have a non-_RESUMING/_SAVING run state? If these are
> > indicating directly flow, maybe we need two bits:
> >
> > 00b - None, normal runtime
> > 01b - Saving
> > 10b - Resuming
> > 11b - Invalid/reserved (maybe a Failed state indicator)
> >
>
> There has to be 2 more states:
> _SAVING | _RUNNING => SAVING while device is RUNNING - pre-copy phase
> _SAVING | _STOPPED => SAVING while device is STOPPED - stop-and-copy phase
Sorry, this was unclear, a separate third bit would need to be the stop
bit, the above was only focusing on the data direction, because...
> So the 2 bits used in this patch are:
> 00b - _RESUMING
This is actually:
_RESUMING | _STOPPED
> 01b - _RUNNING - Normal Running
And this is really:
_RESUMING | _RUNNING
We're just selectively ignoring _RESUMING when we choose to. So my
question is really more like:
|0|00|
^ ^^
| ||
| |+-- Saving
| +--- Resuming
+----- Stopped
Where Saving|Resuming both set is either invalid or a failure indicator.
> 10b - _SAVING | _STOPPED - stop-and-copy phase
> 11b - _SAVING | _RUNNING - pre-copy phase
>
>
> >> - Defined vfio_device_migration_info structure which will be placed at 0th
> >> offset of migration region to get/set VFIO device related information.
> >> Defined members of structure and usage on read/write access:
> >> * device_state: (write only)
> >> To convey VFIO device state to be transitioned to.
> >
> > Seems trivial and potentially useful to support read here, we have 30
> > (or maybe 29) bits yet to define.
> >
>
> Ok, those bits can be used later.
>
> >> * pending bytes: (read only)
> >> To get pending bytes yet to be migrated for VFIO device
> >> * data_offset: (read/write)
> >> To get or set data offset in migration from where data exist
> >> during _SAVING and _RESUMING state
> >
> > What's the use case for writing this?
> >
>
> During resuming, user space application (QEMU) writes data to migration
> region. At that time QEMU writes data_offset from where data is written,
> so that vendor driver can read data from that offset. If data section is
> mmapped, data_offset is start of mmapped region where as if data section
> is trapped, data_offset is sizeof(struct vfio_device_migration_info) +
> 1, just after vfio_device_migration_info structure.
It doesn't make sense to me that this proposal allows the user to write
wherever they want, the vendor driver defines whether they support mmap
for a given operation and the user can choose to make use of mmap or
not. Therefore I'd suggest that the vendor driver expose a read-only
data_offset that matches a sparse mmap capability entry should the
driver support mmap. The use should always read or write data from the
vendor defined data_offset. Otherwise we have scenarios like the other
patch I commented on where the user implicitly decides that the first
mmap region large enough is the one to be used for a given operation,
removing the vendor driver's ability to decide whether it wants to
support mmap for that operation.
> >> * data_size: (write only)
> >> To convey size of data copied in migration region during _RESUMING
> >> state
> >
> > How to know how much is available for read?
>
> pending_bytes tells how much is still to be read.
But pending_bytes can be bigger than the region, right? Does the data
area necessarily extend to the end of the region?
> >> * start_pfn, page_size, total_pfns: (write only)
> >> To get bitmap of dirty pages from vendor driver from given
> >> start address for total_pfns.
> >
> > What would happen if a user wrote in 1MB for page size? Is the vendor
> > driver expected to support arbitrary page sizes? Are we only trying to
> > convey the page size and would that page size ever be other than
> > getpagesize()?
> >
> >> * copied_pfns: (read only)
> >> To get number of pfns bitmap copied in migration region.
> >> Vendor driver should copy the bitmap with bits set only for
> >> pages to be marked dirty in migration region. Vendor driver
> >> should return 0 if there are 0 pages dirty in requested
> >> range.
> >
> > This is useful, but I wonder if it's really a required feature for the
> > vendor driver. For instance, with mdev IOMMU support we could wrap an
> > arbitrary PCI device as mdev, but we don't necessarily have dirty page
> > tracking. Would a device need to report -1 here if it wanted to
> > indicate any page could be dirty if we only know how to collect the
> > state of the device itself for migration (ie. force the device to be
> > stopped first).
> >
>
> Does that mean if returned -1 here, mark all pages in the section as dirty?
Yes
> >> Migration region looks like:
> >> ------------------------------------------------------------------
> >> |vfio_device_migration_info| data section |
> >> | | /////////////////////////////// |
> >> ------------------------------------------------------------------
> > ^ what's this?
> >
> >> ^ ^ ^
> >> offset 0-trapped part data.offset data.size
> >
> > Isn't data.size above really (data.offset + data.size)? '.' vs '_'
> > inconsistency vs above.
> >
>
> Yes, it should be '_'. I'll correct that.
>
>
> >> Data section is always followed by vfio_device_migration_info
> >> structure in the region, so data.offset will always be none-0.
> >
> > This seems exactly backwards from the diagram, data section follows
> > vfio_device_migration_info. Also, "non-zero".
> >
> >> Offset from where data is copied is decided by kernel driver, data
> >
> > But data_offset is listed as read-write.
> >
>
> data_offset is read during _SAVING state so QEMU knows from where to
> read data
> data_offset is written during _RESUMING state so that QEMU can convey
> offset in migration region to vendor driver from where data should be
> considered as input data.
Please let's drop the idea that the user can write data to an arbitrary
offset within the region. Does it serve any value?
> >> section can be trapped or mapped depending on how kernel driver
> >> defines data section. If mmapped, then data.offset should be page
> >> aligned, where as initial section which contain
> >> vfio_device_migration_info structure might not end at offset which
> >> is page aligned.
> >>
> >> Signed-off-by: Kirti Wankhede <address@hidden>
> >> Reviewed-by: Neo Jia <address@hidden>
> >> ---
> >> linux-headers/linux/vfio.h | 65
> >> ++++++++++++++++++++++++++++++++++++++++++++++
> >> 1 file changed, 65 insertions(+)
> >>
> >> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> >> index 12a7b1dc53c8..1b12a9b95e00 100644
> >> --- a/linux-headers/linux/vfio.h
> >> +++ b/linux-headers/linux/vfio.h
> >> @@ -368,6 +368,71 @@ struct vfio_region_gfx_edid {
> >> */
> >> #define VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD (1)
> >>
> >> +/* Migration region type and sub-type */
> >> +#define VFIO_REGION_TYPE_MIGRATION (2)
> >> +#define VFIO_REGION_SUBTYPE_MIGRATION (1)
> >> +
> >> +/**
> >> + * Structure vfio_device_migration_info is placed at 0th offset of
> >> + * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device related
> >> migration
> >> + * information. Field accesses from this structure are only supported at
> >> their
> >> + * native width and alignment, otherwise should return error.
> >> + *
> >> + * device_state: (write only)
> >> + * To indicate vendor driver the state VFIO device should be
> >> transitioned
> >> + * to. If device state transition fails, write to this field return
> >> error.
> >> + * It consists of 2 bits.
> >> + * - If bit 0 set, indicates _RUNNING state. When its reset, that
> >> indicates
> >> + * _STOPPED state. When device is changed to _STOPPED, driver
> >> should stop
> >> + * device before write returns.
> >> + * - If bit 1 set, indicates _SAVING state. When its reset, that
> >> indicates
> >> + * _RESUMING state.
> >> + *
> >> + * pending bytes: (read only)
> >> + * Read pending bytes yet to be migrated from vendor driver
> >
> > Is this essentially a volatile value, changing based on data previously
> > copied and device activity?
>
> Yes.
>
> >
> >> + *
> >> + * data_offset: (read/write)
> >> + * User application should read data_offset in migration region from
> >> where
> >> + * user application should read data during _SAVING state.
> >> + * User application would write data_offset in migration region from
> >> where
> >> + * user application is had written data during _RESUMING state.
> >
> > Why wouldn't data_offset simply be fixed? Do we really want to support
> > the user writing a arbitrary offsets in the migration region? Each
> > chunk can simply start at the kernel provided data_offset.
> >
>
> data_offset differs based on region is trapped on mapped. In case when
> region is mmaped, QEMU writes data to mapped region and then write
> data_offset field, indicating data is present in mmaped buffer. Read
> below for more detailed steps.
>
> >> + *
> >> + * data_size: (write only)
> >> + * User application should write size of data copied in migration
> >> region
> >> + * during _RESUMING state.
> >
> > How much does the user read on _SAVING then?
> >
>
> pending_bytes tells bytes that should be read on _SAVING.
So pending_bytes is always less than or equal to the size of the
region, thus pending_bytes cannot be used to gauge the _total_ pending
device state?
> >> + *
> >> + * start_pfn: (write only)
> >> + * Start address pfn to get bitmap of dirty pages from vendor driver
> >> duing
> >> + * _SAVING state.
> >> + *
> >> + * page_size: (write only)
> >> + * User application should write the page_size of pfn.
> >> + *
> >> + * total_pfns: (write only)
> >> + * Total pfn count from start_pfn for which dirty bitmap is
> >> requested.
> >
> > So effectively the area that begins at data_offset is dual purpose
> > (triple purpose vs Yan's, config, memory, and dirty bitmap) but the
> > protocol isn't entirely evident to me. I think we need to write it out
> > as Yan provided. If I'm in the _SAVING state, do I simply read from
> > data_offset to min(data_size, region.size - data_offset)? If that area
> > is mmap'd, how does the user indicate to the kernel to prepare the data
> > or that X bytes were acquired? In the _RESUMING state, do I write from
> > data_offset to min(data_size, region.size - data_offset) and indicate
> > the write using data_size?
> >
>
> Let me list down the steps for each state, hope that helps to answer all
> above questions.
>
> In _SAVING|_RUNNING device state:
> - read pending_bytes
> - read data_offset - indicates kernel driver to write data to staging
> buffer which is mmapped.
> - if data section is trapped, pread() from data_offset to
> min(pending_bytes, remaining_region)
> - if data section is mmaped, read mmaped buffer of size
> min(pending_bytes, remaining_region)
Ok, pending_bytes can describe more than the region currently holds.
It still seems worthwhile to have a data_size so that we can extend
this interface. For instance what if a driver wanted to trap data
accesses but mmap a dirty bitmap. This interface doesn't allow for
that since they're both necessarily covered by the same sparse mmap
offset.
> - Write data packet to file stream as below:
> {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data,
> VFIO_MIG_FLAG_END_OF_STATE }
>
>
> In _SAVING|_STOPPED device state:
> a. read config space of device and save to migration file stream. This
> doesn't need to be from vendor driver. Any other special config state
> from driver can be saved as data in following iteration.
> b. read pending_bytes - indicates kernel driver to write data to staging
> buffer which is mmapped.
> c. if data section is trapped, pread() from data_offset to
> min(pending_bytes, remaining_region)
> d. if data section is mmaped, read mmaped buffer of size
> min(pending_bytes, remaining_region)
> e. Write data packet as below:
> {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data}
> f. iterate through steps b to e until (pending_bytes > 0)
What indicates to the vendor driver to deduct from pending_bytes and
advance the data? It seems like it's assumed that a read of
pending_bytes indicates this, but I don't like that implicit
interaction, a user could be interrupted and read pending_bytes again
to see if there's still data and introduce data loss. IMO, there
should be an explicit "Ok, I read # bytes, advance the data stream"
type operation.
> g. Write {VFIO_MIG_FLAG_END_OF_STATE}
>
>
> Dirty page tracking (.log_sync) is part of RAM copying state, where
> vendor driver provides the bitmap of pages which are dirtied by vendor
> driver through migration region and as part of RAM copy, those pages
> gets copied to file stream.
>
>
> In _RESUMING device state:
> - load config state.
> - For data packet, till VFIO_MIG_FLAG_END_OF_STATE is not reached
> - read data_size from packet, read buffer of data_size
> if region is mmaped, write data of data_size to mmaped region.
> - write data_offset and data_size.
> In case of mmapped region, write to data_size indicates kernel
> driver that data is written in staging buffer.
> - if region is trapped, pwrite() data of data_size from data_offset.
I still think data_offset should be read_only and this should use the
same operation above to indicate how many bytes were written rather
than read. Thanks,
Alex
> >> + *
> >> + * copied_pfns: (read only)
> >> + * pfn count for which dirty bitmap is copied to migration region.
> >> + * Vendor driver should copy the bitmap with bits set only for pages
> >> to be
> >> + * marked dirty in migration region.
> >> + * Vendor driver should return 0 if there are 0 pages dirty in
> >> requested
> >> + * range.
> >> + */
> >> +
> >> +struct vfio_device_migration_info {
> >> + __u32 device_state; /* VFIO device state */
> >> +#define VFIO_DEVICE_STATE_RUNNING (1 << 0)
> >> +#define VFIO_DEVICE_STATE_SAVING (1 << 1)
> >> + __u32 reserved;
> >> + __u64 pending_bytes;
> >> + __u64 data_offset;
> >> + __u64 data_size;
> >> + __u64 start_pfn;
> >> + __u64 page_size;
> >> + __u64 total_pfns;
> >> + __u64 copied_pfns;
> >> +} __attribute__((packed));
> >> +
> >
> > As you mentioned, and with Yan's version, we still need to figure out
> > something with compatibility and versioning. Thanks,
> >
>
> Yes, we still need to figure out compatibility and versioning.
>
> Thanks,
> Kirti
>
> > Alex
> >
> >> /*
> >> * The MSIX mappable capability informs that MSIX data of a BAR can be
> >> mmapped
> >> * which allows direct access to non-MSIX registers which happened to be
> >> within
> >