[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [patch v5 11/12] vfio: device may stuck in D3 when doin

From: Chen Fan
Subject: Re: [Qemu-devel] [patch v5 11/12] vfio: device may stuck in D3 when doing aer recovery
Date: Thu, 31 Mar 2016 14:55:07 +0800
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.3.0

On 03/25/2016 10:22 AM, Alex Williamson wrote:
On Fri, 25 Mar 2016 09:38:09 +0800
Chen Fan <address@hidden> wrote:

On 03/25/2016 06:54 AM, Alex Williamson wrote:
On Wed, 23 Mar 2016 18:12:06 +0800
Cao jin <address@hidden> wrote:
From: Chen Fan <address@hidden>

when a physical device aer occurred, the device state probably
is not in D0 in a short time, if we recover the device quickly.
we may stuck in D3 state when force to change device state to D0.
we may need to wait for a short time to inject the error to guest.

Signed-off-by: Chen Fan <address@hidden>
   hw/vfio/pci.c | 3 +++
   1 file changed, 3 insertions(+)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 25fc095..5216e7f 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2658,6 +2658,9 @@ static void vfio_err_notifier_handler(void *opaque)
           msg.severity = isfatal ? PCI_ERR_ROOT_CMD_FATAL_EN :
+ /* wait a bit to ensure aer device is ready */
+        usleep(2 * 1000);
Where does this number come from?  Why would the device be in D3?  I
don't understand this at all.
Hi Alex,

      when I tested the code in my environment, I found that when I used
the aer-inject module to inject a fake aer error to device on host, the qemu
would throw out the message "vfio: Unable to power on device, stuck in D3"
on and off. if I use "gdb" to debug the vfio_pci_pre_reset, the phenomenon
would not appearance, I just thought it should be some timing race issue,
so I use a sleep() to wait 2ms (double the reset time of 1ms) to ensure the
device state is ready. maybe the root reason still need to be
investigated deeply.
Yes, it sounds like you need to investigate this further, the delay is
arbitrary and perhaps suggests a race that needs to be fixed
correctly.  Thanks,
Hi Alex,

after done some investigation of the problem, I found that only when the injected error is fatal, the problem will appear. because in aer do_recovery, host will call reset_link on the root port, which would invoke pci_reset_bridge_secondary_bus in aer_root_reset, that would reset the bridge and all the device under that. so when qemu receive the aer notification, then propagate the error to guest, guest does the same way to perform the recovery, if the guest `reset_link` that will call the vfio_pre_reset done at the stage of host
bridge reset, the device status would probable stick in D3.

so I think after qemu receive the aer notification, we should wait for enough time to ensure the bridge has been reset completely. I just use sleep <=10ms to test the code, seems still appear the message "vfio: Unable to power on device, stuck in D3". so I think we should sleep 100ms to ensure the delay sufficient. I have tested that code 100+ times
by inject aer error. the issue no longer appears.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]