qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Qemu-devel] [PATCH v10 00/10] vfio-pci: pass the aer error to guest


From: Cao jin
Subject: [Qemu-devel] [PATCH v10 00/10] vfio-pci: pass the aer error to guest
Date: Sun, 27 Nov 2016 19:32:23 +0800

After such a long time silence, finally we have a really workable version.
It has quite a few difference from previous, so does kernel driver parts.
I will try to describe the big changes(and reason) as following. Also still
have some unsure points need comments, and left some debug lines in the patch,
so it is also a RFC version.

FYI:
1. v9 triggered a bug of igb driver which causes guest kernel oops, actually
   it doesn't have any relationship with aer patchset, we just trigger it.
   Now oops issue has been fixed: http://patchwork.ozlabs.org/patch/692171/

v10 changelog:
1. Many comments & commit message rephrase, and minor changes,
   not list them here.

2. Rebase on Eric Auger's "Convert VFIO-PCI to realize" patchset.

3. Introduce a new patch 3 to support configurable AER cap version.

4. Drop vfio_pci_host_match_slot() in patch 6, because host_match
   implies slot match

5. Change what eventfd signals(check vfio driver patch). Before, vfio-pci
   driver add 1 to the counter, which doesn't have meaning, just for
   notification. Now, vfio-pci's error detected handler read the uncorrectable
   error status and signal it to qemu.
   Why? When testing previous version(host aer driver still reset link), found
   that there is quite possibility that register reading returns the invalid
   value 0xFFFFFFFF, which results in all(2 in my environment) qemu's vfio-pci
   function send AER to root port while I only inject error into one function.
   This is unreasonable, and I think it is the result of reading during device
   reset. Previous patch does considered to find the real function who has
   error happened, but won't work under the situation I found. So move the
   register reading as early as possible would be the nature thoughts, and it
   is moved into vfio-pci driver. Although now reset_link is disabled in aer
   driver, get the uncor error status as early as possible still make sense,
   for example: if user space driver does device reset once receiving the
   error notification, and then read register.

6. Drop the ugly patch which does "wait 3s to do guest reset". There are
   several consideration in it.

   1st, previous patch use a flag in function 0 to indicate that the functions
   under the virtual bus got error notification, then do bus reset rashly, as
   the response to guest reset_link. This is unreasonable. In physical world,
   set bridge's secondary bus reset would send hot-reset TLP to all functions
   below, trigger every device's reset separately. Emulated device should
   behave the same, means using DeviceClass->reset without any logical change.

   2nd, in vfio-pci driver, we use pci_cfg_access_trylock() to prevent config
   access. When signal aer to guest, guest's aer driver will do config space
   access[*], which results in qemu process being scheduled out. Unlock the
   config space in vfio-pci resume handler, then there is no need to wait
   anything.
   [*] find_source_device->find_device_iter->is_error_source

7. For description integrality, I put the changelog of kernel part here.
   The *key issue* of whole functionality is on the reset_link in host
   error recovery.
   On one hand, device-specific driver in guest often will do register
   read/write during its error recovery, but reset_link in host will make the
   result of read/write unpredictable, the guest kernel oops issue is one of
   the results.
   On the other hand, link reset is one big step in whole error recovery. error
   recovery is very device specific thing, so, vfio-pci as the meta driver,
   won't know how to recover naturally. So it is user space driver, or device
   specific driver in guest, who should take the whole responsibility to do the
   error recovery, include link reset.

Unsure points on vfio-pci driver:
1. Previous patch disabled irq notification in its error detected handler, but
   without enabling it again in resume handler, which is wrong. I am not sure
   whether we need to disable it, so I just removed the code.

2. Two new fields: aer_recovering & aer_error_completion of struct
   vfio_pci_device, and the related operation on them, I am not sure we still
   need them, for now, I think maybe we can drop them.

Test:
I test it with intel 82576 NIC, which has 2 functions, function 1 has cable
connected to gateway, function 0 has no link.

1. Linux: guest works well with this patchset. Robustness is good.
   Steps:
   * start the VM with 2 assigned functions
   * assign ip to function 1, add route info, and ping the gateway
   * on host, injecting error to the assigned devices in any combination:
     inject fatal/nonfatal error to func 0/func 1.  Or, repeat injecting
     certain error to certain function quickly. The result meets the expectation

2. Windows: only test on Win7, but recovery don't work, when injecting error,
   Win7 will show the blue screen, and then reboot. I think it is the bug of
   Windows driver, and not this patch's scope to fix it.
   On previous version, Win7 shows blue screen, then reboot, then blue again,
   reboot again, repeatedly, won't stop.


Chen Fan (9):
  vfio: extract vfio_get_hot_reset_info as a single function
  vfio: squeeze out vfio_pci_do_hot_reset to support bus reset
  vfio: new function to init aer cap for vfio device
  vfio: refine function vfio_pci_host_match
  vfio: add check host bus reset is support or not
  pci: introduce function validation check during hotplug
  vfio: check aer functionality for hotplugged device
  vfio-pci: pass the aer error to guest
  vfio: add 'aer' property to expose aercap

Dou Liyang (1):
  pcie_aer: support configurable AER capa version

 hw/net/e1000e.c                    |   3 +-
 hw/pci-bridge/ioh3420.c            |   2 +-
 hw/pci-bridge/xio3130_downstream.c |   2 +-
 hw/pci-bridge/xio3130_upstream.c   |   2 +-
 hw/pci/pci.c                       |  29 ++
 hw/pci/pcie_aer.c                  |   5 +-
 hw/vfio/pci.c                      | 543 ++++++++++++++++++++++++++++++++-----
 hw/vfio/pci.h                      |   4 +
 include/hw/pci/pci.h               |   1 +
 include/hw/pci/pcie_aer.h          |   3 +-
 10 files changed, 516 insertions(+), 78 deletions(-)

-- 
1.8.3.1






reply via email to

[Prev in Thread] Current Thread [Next in Thread]