[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [PATCH 1/5] VFIO KABI for migration interface
From: |
Tian, Kevin |
Subject: |
Re: [Qemu-devel] [PATCH 1/5] VFIO KABI for migration interface |
Date: |
Mon, 26 Nov 2018 07:14:17 +0000 |
> From: Kirti Wankhede [mailto:address@hidden
> Sent: Friday, November 23, 2018 4:02 AM
>
[...]
> >
> > I looked at the explanations in this patch, but still didn't get the
> > intention,
> e.g.:
> >
> > + * - VFIO_DEVICE_STATE_MIGRATION_SETUP:
> > + * Transition VFIO device in migration setup state. This is used to
> prepare
> > + * VFIO device for migration while application or VM and vCPUs are still
> in
> > + * running state.
> >
> > what preparation is actually required? any example?
>
> Each vendor driver can have different requirements as to how to prepare
> for migration. For example, this phase can be used to allocate buffer
> which can be mapped to MIGRATION region's data part, and allocating
> staging buffer. Driver might need to spawn thread which would start
> collecting data that need to be send during pre-copy phase.
>
> >
> > + * - VFIO_DEVICE_STATE_MIGRATION_PRECOPY:
> > + * When VFIO user space application or VM is active and vCPUs are
> running,
> > + * transition VFIO device in pre-copy state.
> >
> > why does device driver need know this stage? in precopy phase, the VM
> > is still running. Just dirty page tracking is in progress. the dirty bitmap
> could
> > be retrieved through its own action interface.
> >
>
> All mdev devices are not similar. Pre-copy phase is not just about dirty
> page tracking. For devices which have memory on device could transfer
> data from that memory during pre-copy phase. For example, NVIDIA GPU
> has
> its own FB, so need to start sending FB data during pre-copy phase and
> then during stop and copy phase send data from FB which is marked dirty
> after that was copied in pre-copy phase. That helps to reduce total down
> time.
yes it makes sense, otherwise copying whole big FB at stop time is time
consuming. Curious, does Qemu already support pre-copy of device state
today, or is this series the 1st example to do that?
>
> > you have code to demonstrate how those states are transitioned in Qemu,
> > but you didn't show evidence why those states are necessary in device
> side,
> > which leads to the puzzle whether the definition is over-killed and
> > limiting.
> >
>
> I'm trying to keep these interfaces generic for VFIO and mdev devices.
> Its difficult to define what vendor driver should do for each state,
> each vendor driver have their own requirements. Vendor drivers should
> decide whether to take any action on state transition or not.
>
> > the flow in my mind is like below:
> >
> > 1. an interface to turn on/off dirty page tracking on VFIO device:
> > * vendor driver can do whatever required to enable device specific
> > dirty page tracking mechanism here
> > * device state is not changed here. still in running state
> >
> > 2. an interface to get dirty page bitmap
> >
>
> I don't think there should be on/off interface for dirty page tracking.
> If there is a write access on dirty_pfns.start_addr and dirty_pfns.total
> and device_state >=VFIO_DEVICE_STATE_MIGRATION_SETUP &&
> device_state <=
> VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY then dirty page tracking has
> started, so return dirty page bitmap in data part of migration region.
dirty page tracking might be useful for other purposes, e.g. if people want
to just draw memory access pattern of a given VM. binding dirty tracking
to migration flow is limiting...
>
>
> > 3. an interface to start/stop device activity
> > * the effect of stop is to stop and drain in-the-fly device activities
> and
> > make device state ready for dump-out. vendor driver can do specific
> preparation
> > here
>
> VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY is to stop the device, but as
> I
> mentioned above some vendor driver might have to do preparation before
> pre-copy phase starts.
>
> > * the effect of start is to check validity of device state and then
> resume
> > device activities. again, vendor driver can do specific cleanup/preparation
> here
> >
>
> That is VFIO_DEVICE_STATE_MIGRATION_RESUME.
>
> Defined VFIO_DEVICE_STATE_MIGRATION_SAVE_COMPLETED and
> VFIO_DEVICE_STATE_MIGRATION_RESUME_COMPLETED states to cleanup
> all that
> which was allocated/mmapped/started thread during setup phase. This
> can
> be moved to transition to _RUNNING state. So if all agrees these states
> can be removed.
>
>
> > 4. an interface to save/restore device state
> > * should happen when device is stopped
> > * of course there is still an open how to check state compatibility as
> > Alex pointed earlier
> >
>
> I hope above explains why other states are required.
>
yes, above makes the whole picture much clearer. Thanks a lot!
Accordingly I'm thinking about whether below state definition could be
more general and extensible:
_STATE_NONE, indicates initial state
_STATE_RUNNING, indicates normal state
_STATE_STOPPED, indicates that device activities are fully stopped
_STATE_IN_TRACKING, indicates that device state can be r/w by user space.
this state can be ORed to RUNNING or STOPPED.
live migration could be implemented in below flow:
(at src side)
1. RUNNING -> {RUNNING | IN_TRACKING}
* this switch does vendor specific preparation to make device
state accessible to user space (as covered by MIGRATION_SETUP)
* vendor driver may let iterative read get incremental changes
since last read (as covered by MIGRATION_PRECOPY). *open*, do we
need an explicit flag to indicate such capability?
* dirty page bitmap is also made available upon this change
2. (RUNNING | IN_TRACKING) -> (STOPPED | IN_TRACKING)
* device is stopped thus device state is finalized
* user space can read full device state, as defined for
MIGRATION_STOPNCOPY
3. (STOPPED | IN_TRACKING) -> (STOPPED)
* device state tracking and dirty page tracking are cancelled.
cleanup is done for resources setup in step 1. similar to MIGRATION_
SAVE_COMPLETED
4. STOPPED -> NONE, when device is reset later
(at dest side)
1. NONE -> (STOPPED | IN_TRACKING)
* prepare device state region so user space can write
* map to MIGRATION_RESUME
* open: do we need both NONE and STOPPED, or just STOPPED?
2. (STOPPED | IN_TRACKING) -> STOPPED
* clean up resources allocated in step 1
* map to MIGRATION_RESUME_COMPLETED
3. STOPPED -> RUNNING
* resume the device activities
compare to original definition, I think all important steps are covered:
+enum {
+ VFIO_DEVICE_STATE_NONE,
+ VFIO_DEVICE_STATE_RUNNING,
+ VFIO_DEVICE_STATE_MIGRATION_SETUP,
+ VFIO_DEVICE_STATE_MIGRATION_PRECOPY,
+ VFIO_DEVICE_STATE_MIGRATION_STOPNCOPY,
+ VFIO_DEVICE_STATE_MIGRATION_SAVE_COMPLETED,
+ VFIO_DEVICE_STATE_MIGRATION_RESUME,
+ VFIO_DEVICE_STATE_MIGRATION_RESUME_COMPLETED,
+ VFIO_DEVICE_STATE_MIGRATION_FAILED,
+ VFIO_DEVICE_STATE_MIGRATION_CANCELLED,
+};
FAILED is not a device state. It should be indicated in return value of set
state action.
CANCELLED can be achieved any time by clearing IN_TRACKING state.
with this new definition, above states can be also selectively used for
other purposes, e.g.:
1. user space can do RUNNING->STOPPED->RUNNING for any control reason,
w/o touching device state at all.
2. if someone wants to draw memory access pattern of a VM, it could
be done by RUNNING->(RUNNING | IN_TRACKING)->RUNNING, by reading
dirty bitmap when IN_TRACKING is active. Device state is ready but not
accessed here, hope it is not a big burden.
Thoughts?
Thanks
Kevin
Re: [Qemu-devel] [PATCH 1/5] VFIO KABI for migration interface, Pierre Morel, 2018/11/21
Re: [Qemu-devel] [PATCH 1/5] VFIO KABI for migration interface, Dr. David Alan Gilbert, 2018/11/22
Re: [Qemu-devel] [PATCH 1/5] VFIO KABI for migration interface, Zhao Yan, 2018/11/23
Re: [Qemu-devel] [PATCH 1/5] VFIO KABI for migration interface, Alex Williamson, 2018/11/27
[Qemu-devel] [PATCH 2/5] Add save and load functions for VFIO PCI devices, Kirti Wankhede, 2018/11/20
[Qemu-devel] [PATCH 3/5] Add migration functions for VFIO devices, Kirti Wankhede, 2018/11/20