[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH v2 03/20] vfio/migration: Add VFIO migration pre-copy support
From: |
Alex Williamson |
Subject: |
Re: [PATCH v2 03/20] vfio/migration: Add VFIO migration pre-copy support |
Date: |
Wed, 22 Feb 2023 13:58:11 -0700 |
On Wed, 22 Feb 2023 19:48:58 +0200
Avihai Horon <avihaih@nvidia.com> wrote:
> Pre-copy support allows the VFIO device data to be transferred while the
> VM is running. This helps to accommodate VFIO devices that have a large
> amount of data that needs to be transferred, and it can reduce migration
> downtime.
>
> Pre-copy support is optional in VFIO migration protocol v2.
> Implement pre-copy of VFIO migration protocol v2 and use it for devices
> that support it. Full description of it can be found here [1].
>
> [1]
> https://lore.kernel.org/kvm/20221206083438.37807-3-yishaih@nvidia.com/
>
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> ---
> docs/devel/vfio-migration.rst | 35 +++++--
> include/hw/vfio/vfio-common.h | 3 +
> hw/vfio/common.c | 6 +-
> hw/vfio/migration.c | 175 ++++++++++++++++++++++++++++++++--
> hw/vfio/trace-events | 4 +-
> 5 files changed, 201 insertions(+), 22 deletions(-)
>
> diff --git a/docs/devel/vfio-migration.rst b/docs/devel/vfio-migration.rst
> index c214c73e28..ba80b9150d 100644
> --- a/docs/devel/vfio-migration.rst
> +++ b/docs/devel/vfio-migration.rst
> @@ -7,12 +7,14 @@ the guest is running on source host and restoring this
> saved state on the
> destination host. This document details how saving and restoring of VFIO
> devices is done in QEMU.
>
> -Migration of VFIO devices currently consists of a single stop-and-copy phase.
> -During the stop-and-copy phase the guest is stopped and the entire VFIO
> device
> -data is transferred to the destination.
> -
> -The pre-copy phase of migration is currently not supported for VFIO devices.
> -Support for VFIO pre-copy will be added later on.
> +Migration of VFIO devices consists of two phases: the optional pre-copy
> phase,
> +and the stop-and-copy phase. The pre-copy phase is iterative and allows to
> +accommodate VFIO devices that have a large amount of data that needs to be
> +transferred. The iterative pre-copy phase of migration allows for the guest
> to
> +continue whilst the VFIO device state is transferred to the destination, this
> +helps to reduce the total downtime of the VM. VFIO devices can choose to skip
> +the pre-copy phase of migration by not reporting the VFIO_MIGRATION_PRE_COPY
> +flag in VFIO_DEVICE_FEATURE_MIGRATION ioctl.
Or alternatively for the last sentence,
VFIO devices opt-in to pre-copy support by reporting the
VFIO_MIGRATION_PRE_COPY flag in the VFIO_DEVICE_FEATURE_MIGRATION
ioctl.
> Note that currently VFIO migration is supported only for a single device.
> This
> is due to VFIO migration's lack of P2P support. However, P2P support is
> planned
> @@ -29,10 +31,20 @@ VFIO implements the device hooks for the iterative
> approach as follows:
> * A ``load_setup`` function that sets the VFIO device on the destination in
> _RESUMING state.
>
> +* A ``state_pending_estimate`` function that reports an estimate of the
> + remaining pre-copy data that the vendor driver has yet to save for the VFIO
> + device.
> +
> * A ``state_pending_exact`` function that reads pending_bytes from the vendor
> driver, which indicates the amount of data that the vendor driver has yet
> to
> save for the VFIO device.
>
> +* An ``is_active_iterate`` function that indicates ``save_live_iterate`` is
> + active only when the VFIO device is in pre-copy states.
> +
> +* A ``save_live_iterate`` function that reads the VFIO device's data from the
> + vendor driver during iterative pre-copy phase.
> +
> * A ``save_state`` function to save the device config space if it is present.
>
> * A ``save_live_complete_precopy`` function that sets the VFIO device in
> @@ -95,8 +107,10 @@ Flow of state changes during Live migration
> ===========================================
>
> Below is the flow of state change during live migration.
> -The values in the brackets represent the VM state, the migration state, and
> +The values in the parentheses represent the VM state, the migration state,
> and
> the VFIO device state, respectively.
> +The text in the square brackets represents the flow if the VFIO device
> supports
> +pre-copy.
>
> Live migration save path
> ------------------------
> @@ -108,11 +122,12 @@ Live migration save path
> |
> migrate_init spawns migration_thread
> Migration thread then calls each device's .save_setup()
> - (RUNNING, _SETUP, _RUNNING)
> + (RUNNING, _SETUP, _RUNNING [_PRE_COPY])
> |
> - (RUNNING, _ACTIVE, _RUNNING)
> - If device is active, get pending_bytes by .state_pending_exact()
> + (RUNNING, _ACTIVE, _RUNNING [_PRE_COPY])
> + If device is active, get pending_bytes by
> .state_pending_{estimate,exact}()
> If total pending_bytes >= threshold_size, call .save_live_iterate()
> + [Data of VFIO device for pre-copy phase is copied]
> Iterate till total pending bytes converge and are less than threshold
> |
> On migration completion, vCPU stops and calls .save_live_complete_precopy
> for
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 87524c64a4..ee55d442b4 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -66,6 +66,9 @@ typedef struct VFIOMigration {
> int data_fd;
> void *data_buffer;
> size_t data_buffer_size;
> + uint64_t precopy_init_size;
> + uint64_t precopy_dirty_size;
size_t?
> + uint64_t mig_flags;
> } VFIOMigration;
>
> typedef struct VFIOAddressSpace {
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index bab83c0e55..6f5afe9f5a 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -409,7 +409,8 @@ static bool vfio_devices_all_dirty_tracking(VFIOContainer
> *container)
> }
>
> if (vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF &&
> - migration->device_state == VFIO_DEVICE_STATE_RUNNING) {
> + (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
> + migration->device_state == VFIO_DEVICE_STATE_PRE_COPY)) {
> return false;
> }
> }
> @@ -438,7 +439,8 @@ static bool
> vfio_devices_all_running_and_mig_active(VFIOContainer *container)
> return false;
> }
>
> - if (migration->device_state == VFIO_DEVICE_STATE_RUNNING) {
> + if (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
> + migration->device_state == VFIO_DEVICE_STATE_PRE_COPY) {
> continue;
> } else {
> return false;
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 94a4df73d0..307983d57d 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -67,6 +67,8 @@ static const char *mig_state_to_str(enum
> vfio_device_mig_state state)
> return "STOP_COPY";
> case VFIO_DEVICE_STATE_RESUMING:
> return "RESUMING";
> + case VFIO_DEVICE_STATE_PRE_COPY:
> + return "PRE_COPY";
> default:
> return "UNKNOWN STATE";
> }
> @@ -240,6 +242,23 @@ static int vfio_query_stop_copy_size(VFIODevice
> *vbasedev,
> return 0;
> }
>
> +static int vfio_query_precopy_size(VFIOMigration *migration,
> + uint64_t *init_size, uint64_t *dirty_size)
size_t? Seems like a concern throughout.
> +{
> + struct vfio_precopy_info precopy = {
> + .argsz = sizeof(precopy),
> + };
> +
> + if (ioctl(migration->data_fd, VFIO_MIG_GET_PRECOPY_INFO, &precopy)) {
> + return -errno;
> + }
> +
> + *init_size = precopy.initial_bytes;
> + *dirty_size = precopy.dirty_bytes;
> +
> + return 0;
> +}
> +
> /* Returns the size of saved data on success and -errno on error */
> static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration *migration)
> {
> @@ -248,6 +267,11 @@ static ssize_t vfio_save_block(QEMUFile *f,
> VFIOMigration *migration)
> data_size = read(migration->data_fd, migration->data_buffer,
> migration->data_buffer_size);
> if (data_size < 0) {
> + /* Pre-copy emptied all the device state for now */
> + if (errno == ENOMSG) {
> + return 0;
> + }
> +
> return -errno;
> }
> if (data_size == 0) {
> @@ -264,6 +288,31 @@ static ssize_t vfio_save_block(QEMUFile *f,
> VFIOMigration *migration)
> return qemu_file_get_error(f) ?: data_size;
> }
>
> +static void vfio_update_estimated_pending_data(VFIOMigration *migration,
> + uint64_t data_size)
> +{
> + if (!data_size) {
> + /*
> + * Pre-copy emptied all the device state for now, update estimated
> sizes
> + * accordingly.
> + */
> + migration->precopy_init_size = 0;
> + migration->precopy_dirty_size = 0;
> +
> + return;
> + }
> +
> + if (migration->precopy_init_size) {
> + uint64_t init_size = MIN(migration->precopy_init_size, data_size);
> +
> + migration->precopy_init_size -= init_size;
> + data_size -= init_size;
> + }
> +
> + migration->precopy_dirty_size -= MIN(migration->precopy_dirty_size,
> + data_size);
> +}
> +
> /* ---------------------------------------------------------------------- */
>
> static int vfio_save_setup(QEMUFile *f, void *opaque)
> @@ -284,6 +333,35 @@ static int vfio_save_setup(QEMUFile *f, void *opaque)
> return -ENOMEM;
> }
>
> + if (migration->mig_flags & VFIO_MIGRATION_PRE_COPY) {
> + uint64_t init_size = 0, dirty_size = 0;
> + int ret;
> +
> + switch (migration->device_state) {
> + case VFIO_DEVICE_STATE_RUNNING:
> + ret = vfio_migration_set_state(vbasedev,
> VFIO_DEVICE_STATE_PRE_COPY,
> + VFIO_DEVICE_STATE_RUNNING);
> + if (ret) {
> + return ret;
> + }
> +
> + vfio_query_precopy_size(migration, &init_size, &dirty_size);
> + migration->precopy_init_size = init_size;
> + migration->precopy_dirty_size = dirty_size;
Seems like we could do away with {init,dirty}_size, initialize
migration->precopy_{init,dirty}_size before the switch, pass them
directly to vfio_query_precopy_size() and remove all but the break from
the case below. But then that also suggests we could redefine
vfio_query_precopy_size() to
static int vfio_update_precopy_info(VFIOMigration *migration)
which sets the fields directly since this is the only way it's used.
> +
> + break;
> + case VFIO_DEVICE_STATE_STOP:
> + /* vfio_save_complete_precopy() will go to STOP_COPY */
> +
> + migration->precopy_init_size = 0;
> + migration->precopy_dirty_size = 0;
> +
> + break;
> + default:
> + return -EINVAL;
> + }
> + }
> +
> trace_vfio_save_setup(vbasedev->name, migration->data_buffer_size);
>
> qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> @@ -302,23 +380,44 @@ static void vfio_save_cleanup(void *opaque)
> trace_vfio_save_cleanup(vbasedev->name);
> }
>
> +static void vfio_state_pending_estimate(void *opaque, uint64_t
> threshold_size,
> + uint64_t *must_precopy,
> + uint64_t *can_postcopy)
> +{
> + VFIODevice *vbasedev = opaque;
> + VFIOMigration *migration = vbasedev->migration;
> +
> + if (migration->device_state != VFIO_DEVICE_STATE_PRE_COPY) {
> + return;
> + }
> +
> + /*
> + * Initial size should be transferred during pre-copy phase so stop-copy
> + * phase will not be slowed down. Report threshold_size to force another
> + * pre-copy iteration.
> + */
> + *must_precopy += migration->precopy_init_size ?
> + threshold_size :
> + migration->precopy_dirty_size;
This sure feels like we're feeding false data back to the iterator to
spoof it to run another iteration, when the vfio migration protocol
only recommends that initial_bytes reaches zero before proceeding to
stop-copy, it's not a requirement. What benefit is actually observed
from this? Why is this required for initial pre-copy support? It
seems devious.
> +
> + trace_vfio_state_pending_estimate(vbasedev->name, *must_precopy,
> + *can_postcopy,
> + migration->precopy_init_size,
> + migration->precopy_dirty_size);
> +}
> +
> /*
> * Migration size of VFIO devices can be as little as a few KBs or as big as
> * many GBs. This value should be big enough to cover the worst case.
> */
> #define VFIO_MIG_STOP_COPY_SIZE (100 * GiB)
>
> -/*
> - * Only exact function is implemented and not estimate function. The reason
> is
> - * that during pre-copy phase of migration the estimate function is called
> - * repeatedly while pending RAM size is over the threshold, thus migration
> - * can't converge and querying the VFIO device pending data size is useless.
> - */
> static void vfio_state_pending_exact(void *opaque, uint64_t threshold_size,
> uint64_t *must_precopy,
> uint64_t *can_postcopy)
> {
> VFIODevice *vbasedev = opaque;
> + VFIOMigration *migration = vbasedev->migration;
> uint64_t stop_copy_size = VFIO_MIG_STOP_COPY_SIZE;
>
> /*
> @@ -328,8 +427,57 @@ static void vfio_state_pending_exact(void *opaque,
> uint64_t threshold_size,
> vfio_query_stop_copy_size(vbasedev, &stop_copy_size);
> *must_precopy += stop_copy_size;
>
> + if (migration->device_state == VFIO_DEVICE_STATE_PRE_COPY) {
> + uint64_t init_size = 0, dirty_size = 0;
> +
> + vfio_query_precopy_size(migration, &init_size, &dirty_size);
> + migration->precopy_init_size = init_size;
> + migration->precopy_dirty_size = dirty_size;
This is the only other caller of vfio_query_precopy_size(), following
the same pattern that could be simplified if the function filled the
migration fields itself.
> +
> + /*
> + * Initial size should be transferred during pre-copy phase so
> + * stop-copy phase will not be slowed down. Report threshold_size
> + * to force another pre-copy iteration.
> + */
> + *must_precopy += migration->precopy_init_size ?
> + threshold_size :
> + migration->precopy_dirty_size;
> + }
Just as sketchy as above. Thanks,
Alex
> +
> trace_vfio_state_pending_exact(vbasedev->name, *must_precopy,
> *can_postcopy,
> - stop_copy_size);
> + stop_copy_size,
> migration->precopy_init_size,
> + migration->precopy_dirty_size);
> +}
> +
> +static bool vfio_is_active_iterate(void *opaque)
> +{
> + VFIODevice *vbasedev = opaque;
> + VFIOMigration *migration = vbasedev->migration;
> +
> + return migration->device_state == VFIO_DEVICE_STATE_PRE_COPY;
> +}
> +
> +static int vfio_save_iterate(QEMUFile *f, void *opaque)
> +{
> + VFIODevice *vbasedev = opaque;
> + VFIOMigration *migration = vbasedev->migration;
> + ssize_t data_size;
> +
> + data_size = vfio_save_block(f, migration);
> + if (data_size < 0) {
> + return data_size;
> + }
> + qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> +
> + vfio_update_estimated_pending_data(migration, data_size);
> +
> + trace_vfio_save_iterate(vbasedev->name);
> +
> + /*
> + * A VFIO device's pre-copy dirty_bytes is not guaranteed to reach zero.
> + * Return 1 so following handlers will not be potentially blocked.
> + */
> + return 1;
> }
>
> static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
> @@ -338,7 +486,7 @@ static int vfio_save_complete_precopy(QEMUFile *f, void
> *opaque)
> ssize_t data_size;
> int ret;
>
> - /* We reach here with device state STOP only */
> + /* We reach here with device state STOP or STOP_COPY only */
> ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP_COPY,
> VFIO_DEVICE_STATE_STOP);
> if (ret) {
> @@ -457,7 +605,10 @@ static int vfio_load_state(QEMUFile *f, void *opaque,
> int version_id)
> static const SaveVMHandlers savevm_vfio_handlers = {
> .save_setup = vfio_save_setup,
> .save_cleanup = vfio_save_cleanup,
> + .state_pending_estimate = vfio_state_pending_estimate,
> .state_pending_exact = vfio_state_pending_exact,
> + .is_active_iterate = vfio_is_active_iterate,
> + .save_live_iterate = vfio_save_iterate,
> .save_live_complete_precopy = vfio_save_complete_precopy,
> .save_state = vfio_save_state,
> .load_setup = vfio_load_setup,
> @@ -470,13 +621,18 @@ static const SaveVMHandlers savevm_vfio_handlers = {
> static void vfio_vmstate_change(void *opaque, bool running, RunState state)
> {
> VFIODevice *vbasedev = opaque;
> + VFIOMigration *migration = vbasedev->migration;
> enum vfio_device_mig_state new_state;
> int ret;
>
> if (running) {
> new_state = VFIO_DEVICE_STATE_RUNNING;
> } else {
> - new_state = VFIO_DEVICE_STATE_STOP;
> + new_state =
> + (migration->device_state == VFIO_DEVICE_STATE_PRE_COPY &&
> + (state == RUN_STATE_FINISH_MIGRATE || state ==
> RUN_STATE_PAUSED)) ?
> + VFIO_DEVICE_STATE_STOP_COPY :
> + VFIO_DEVICE_STATE_STOP;
> }
>
> /*
> @@ -590,6 +746,7 @@ static int vfio_migration_init(VFIODevice *vbasedev)
> migration->vbasedev = vbasedev;
> migration->device_state = VFIO_DEVICE_STATE_RUNNING;
> migration->data_fd = -1;
> + migration->mig_flags = mig_flags;
>
> oid = vmstate_if_get_id(VMSTATE_IF(DEVICE(obj)));
> if (oid) {
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index 669d9fe07c..51613e02e6 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -161,6 +161,8 @@ vfio_save_block(const char *name, int data_size) " (%s)
> data_size %d"
> vfio_save_cleanup(const char *name) " (%s)"
> vfio_save_complete_precopy(const char *name, int ret) " (%s) ret %d"
> vfio_save_device_config_state(const char *name) " (%s)"
> +vfio_save_iterate(const char *name) " (%s)"
> vfio_save_setup(const char *name, uint64_t data_buffer_size) " (%s) data
> buffer size 0x%"PRIx64
> -vfio_state_pending_exact(const char *name, uint64_t precopy, uint64_t
> postcopy, uint64_t stopcopy_size) " (%s) precopy 0x%"PRIx64" postcopy
> 0x%"PRIx64" stopcopy size 0x%"PRIx64
> +vfio_state_pending_estimate(const char *name, uint64_t precopy, uint64_t
> postcopy, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s)
> precopy 0x%"PRIx64" postcopy 0x%"PRIx64" precopy initial size 0x%"PRIx64"
> precopy dirty size 0x%"PRIx64
> +vfio_state_pending_exact(const char *name, uint64_t precopy, uint64_t
> postcopy, uint64_t stopcopy_size, uint64_t precopy_init_size, uint64_t
> precopy_dirty_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" stopcopy
> size 0x%"PRIx64" precopy initial size 0x%"PRIx64" precopy dirty size
> 0x%"PRIx64
> vfio_vmstate_change(const char *name, int running, const char *reason, const
> char *dev_state) " (%s) running %d reason %s device state %s"
- Re: [PATCH v2 02/20] vfio/migration: Refactor vfio_save_block() to return saved data size, (continued)
- [PATCH v2 04/20] vfio/common: Fix error reporting in vfio_get_dirty_bitmap(), Avihai Horon, 2023/02/22
- [PATCH v2 07/20] vfio/common: Add VFIOBitmap and (de)alloc functions, Avihai Horon, 2023/02/22
- [PATCH v2 09/20] util: Extend iova_tree_foreach() to take data argument, Avihai Horon, 2023/02/22
- [PATCH v2 05/20] vfio/common: Fix wrong %m usages, Avihai Horon, 2023/02/22
- [PATCH v2 06/20] vfio/common: Abort migration if dirty log start/stop/sync fails, Avihai Horon, 2023/02/22
- [PATCH v2 03/20] vfio/migration: Add VFIO migration pre-copy support, Avihai Horon, 2023/02/22
- Re: [PATCH v2 03/20] vfio/migration: Add VFIO migration pre-copy support,
Alex Williamson <=
- Re: [PATCH v2 03/20] vfio/migration: Add VFIO migration pre-copy support, Avihai Horon, 2023/02/23
- Re: [PATCH v2 03/20] vfio/migration: Add VFIO migration pre-copy support, Alex Williamson, 2023/02/23
- Re: [PATCH v2 03/20] vfio/migration: Add VFIO migration pre-copy support, Avihai Horon, 2023/02/26
- Re: [PATCH v2 03/20] vfio/migration: Add VFIO migration pre-copy support, Alex Williamson, 2023/02/27
- Re: [PATCH v2 03/20] vfio/migration: Add VFIO migration pre-copy support, Jason Gunthorpe, 2023/02/27
- Re: [PATCH v2 03/20] vfio/migration: Add VFIO migration pre-copy support, Alex Williamson, 2023/02/27
[PATCH v2 08/20] util: Add iova_tree_nnodes(), Avihai Horon, 2023/02/22
[PATCH v2 11/20] vfio/common: Add device dirty page tracking start/stop, Avihai Horon, 2023/02/22