[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Qemu-devel] [PATCH 0/3] vfio-pci: support recovery of AER non fatal err
From: |
Cao jin |
Subject: |
[Qemu-devel] [PATCH 0/3] vfio-pci: support recovery of AER non fatal error |
Date: |
Mon, 27 Feb 2017 15:30:24 +0800 |
This is nearly new design of the feature, so re-number the verion from 0.
About The test:
Hardware problem(unsteady) still occurs like before. The test server is in
another country spot A, and my contact of the country located spot B, so
it is not quite convenient to find help(plug cable, or check the hardware).
So, my NIC(has 2 functions) still just has func1 connected to gateway.
If there is other people who has the hardware could test the patches, that
would be great help.
Basically, there are two phenomenon of unsteady hardware:
1. Start vm, the hardware emit fatal error itself before I did anything,
cause vm stop.
2. Start vm, assign IP to func1, then ping the gateway, it will show
"Destination Host Unreachable" after dozens of or hundreds of successful
ping, and guest dmesg shows nothing abnormal. I think this phenomenon is
the *strong evidence* of saying unsteady hardware, I speculate that
the cable has problem.
on the opposite, I also saw perfect result 2 times in my numerous tests,
which just assign func1 while func0 has no user. It can ping several housrs(
more than 15000 times ping) withtout any problem, during the period, inject
non fatal error to func0 & func1, error recovery is very good.
So, most of time, I must do the test quickly before the hardware goes crazy,
until get what I expected.
Test:
scenario 1: assign func1 to vm while func0 has no user.
scenario 2: assign both functions to 1 vm, with the same topology as host.
scenario 3: assign both functions to 1 vm, under different bus.
scenario 4: assign each function to a separate vm.
the steps is: assign IP to func1, ping the gateway, inject non fatal error to
both functions, see if func1 still can ping after recovery.
Although we don't have cable for func0, but in the test like scenario 4,
inject to func0, it doesn't affect func1's recovery, so I think it can prove
that one function's recovery doesn't affect another.
Extra info FYI:
1. During the test, some debug lines are added in vfio_err_notifier_handler,
read the uncor status register in this function when fatal error occured,
it shows all F's every time.
2. Based on the v10 patch & the corresponding kernel part, modified as
comments: revert the eventfd handling(don't signal uncor status), and
guest link reset will induce the host link reset. The test result shows:
non fatal error recovery is good; fatal error recovery has same result
with what Alex find before(guest kernel crash), because guest device
driver's error_detected() access the MMIO registers, get all F's.
Cao jin (3):
pcie aer: verify if AER functionality is available
vfio pci: new function to init AER capability
vfio-pci: process non fatal error of AER
hw/pci/pcie_aer.c | 28 +++++++
hw/vfio/pci.c | 180 +++++++++++++++++++++++++++++++++++++++++++--
hw/vfio/pci.h | 3 +
linux-headers/linux/vfio.h | 1 +
4 files changed, 207 insertions(+), 5 deletions(-)
--
1.8.3.1
- [Qemu-devel] [PATCH 0/3] vfio-pci: support recovery of AER non fatal error,
Cao jin <=