[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: an issue for device hot-unplug
From: |
Igor Mammedov |
Subject: |
Re: an issue for device hot-unplug |
Date: |
Tue, 4 Apr 2023 14:17:19 +0200 |
On Mon, 3 Apr 2023 15:24:43 +0200
Yu Zhang <yu.zhang@ionos.com> wrote:
> Dear Laurent,
>
> recently we run into an issue with the following error:
>
> command '{ "execute": "device_del", "arguments": { "id": "virtio-diskX" }
> }' for VM "id" failed ({ "return": {"class": "GenericError", "desc":
> "Device virtio-diskX is already in the process of unplug"} }).
>
> The issue is reproducible. With a few seconds delay before hot-unplug,
> hot-unplug just works fine.
>
> After a few digging, we found that the commit 9323f892b39 may incur the
> issue.
> ------------------
> failover: fix unplug pending detection
>
> Failover needs to detect the end of the PCI unplug to start migration
> after the VFIO card has been unplugged.
>
> To do that, a flag is set in pcie_cap_slot_unplug_request_cb() and
> reset in
> pcie_unplug_device().
>
> But since
> 17858a169508 ("hw/acpi/ich9: Set ACPI PCI hot-plug as default on
> Q35")
> we have switched to ACPI unplug and these functions are not called
> anymore
> and the flag not set. So failover migration is not able to detect if
> card
> is really unplugged and acts as it's done as soon as it's started. So it
> doesn't wait the end of the unplug to start the migration. We don't see
> any
> problem when we test that because ACPI unplug is faster than PCIe native
> hotplug and when the migration really starts the unplug operation is
> already done.
>
> See c000a9bd06ea ("pci: mark device having guest unplug request
> pending")
> a99c4da9fc2a ("pci: mark devices partially unplugged")
>
> Signed-off-by: Laurent Vivier <lvivier@redhat.com>
> Reviewed-by: Ani Sinha <ani@anisinha.ca>
> Message-Id: <20211118133225.324937-4-lvivier@redhat.com>
> Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> ------------------
> The purpose is for detecting the end of the PCI device hot-unplug. However,
unplug is async process and issuing multiple unplug requests waiting for
'not found' error as a means to detect that device has been unplugged
hardly a sane way to do that.
Instead of swamping guest with unplug requests (which lead to hw interrupts)
you should wait for DEVICE_DELETED QMP event.
> we feel the error confusing. How is it possible that a disk "is already in
> the process of unplug" during the first hot-unplug attempt? So far as I
> know, the issue was also encountered by libvirt, but they simply ignored it:
>
> https://bugzilla.redhat.com/show_bug.cgi?id=1878659
>
> Hence, a question is: should we have the line below in
> acpi_pcihp_device_unplug_request_cb()?
>
> pdev->qdev.pending_deleted_event = true;
comment 15 in above BZ describes how we could get rid of this line
but also see comment 17
(in nutshell you get error because device hasn't been removed yet)
>
> It would be great if you as the author could give us a few hints.
>
> Thank you very much for your reply!
>
> Sincerely,
>
> Yu Zhang @ Compute Platform IONOS
> 03.04.2013
Re: an issue for device hot-unplug,
Igor Mammedov <=