On 07/08/2023 18:53, Cédric Le Goater wrote:
External email: Use caution opening links or attachments
[ Adding Juan and Peter for their awareness ]
On 8/2/23 10:14, Avihai Horon wrote:
Changing the device state from STOP_COPY to STOP can take time as the
device may need to free resources and do other operations as part of the
transition. Currently, this is done in vfio_save_complete_precopy() and
therefore it is counted in the migration downtime.
To avoid this, change the device state from STOP_COPY to STOP in
vfio_save_cleanup(), which is called after migration has completed and
thus is not part of migration downtime.
What bothers me is that this looks like a device specific optimization
True, currently it helps mlx5, but this change is based on the assumption that,
in general, VFIO devices are likely to free resources when transitioning from
STOP_COPY to STOP.
So I think this is a good change to have in any case.
and we are loosing the error part.
I don't think we lose the error part.
AFAIU, the crucial part is transitioning to STOP_COPY and sending the final
data.
If that's done successfully, then migration is successful.
The STOP_COPY->STOP transition is done as part of the cleanup flow, after the
migration is completed -- i.e., failure in it does not affect the success of
migration.
Further more, if there is an error in the STOP_COPY->STOP transition, then it's
reported by vfio_migration_set_state().