qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] vfio-pci issues with multiple devices on the same root


From: Peter Lieven
Subject: Re: [Qemu-devel] vfio-pci issues with multiple devices on the same root port
Date: Mon, 15 Dec 2014 16:22:44 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.3.0

On 15.12.2014 16:11, Alex Williamson wrote:
On Sat, 2014-12-13 at 21:43 +0100, Peter Lieven wrote:
Am 12.12.2014 um 23:21 schrieb Alex Williamson:
On Fri, 2014-12-12 at 22:38 +0100, Peter Lieven wrote:
Hi,

we have a Cisco UCS infrastructure where we have fnic Fibre-Channel Adapters 
that we expose to guests. The UCS
infrastruture allows to create virtual HBAs that can be exposed to a host so 
its possible to have quite a lot of them.

We ran into a strange issue when we started having more than one vServer with a 
FibreChannel Adapter passed
thru with vfio-pci.

When a hypervisor shuts down it the kernel sees the following error:

  pcieport 0000:00:07.0: AER: Uncorrected (Non-Fatal) error received: id=0038
  pcieport 0000:00:07.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), 
type=Transaction Layer, id=0038(Receiver ID)
  pcieport 0000:00:07.0:   device [8086:340e] error 
status/mask=00200000/00100000
  pcieport 0000:00:07.0:    [21] Unknown Error Bit (First)
  pcieport 0000:00:07.0: broadcast error_detected message
  pcieport 0000:00:07.0: AER: Device recovery failed

Bit 21 seems to be ACS Violation. And 0000:00:07.0 is the PCIE Root Port on 
that System.

This wouldn't be a big problem, altough I would like to find out what the ACS 
Violation causes.

The real problem is that all other vfio-pci cards on that root port get 
notified of this error and the connected vServers are suspended
with RUN_STATE_INTERNAL_ERROR.

Any ideas to work around this other than hacking qemu to not register an error 
handler or modifying vfio_err_notifier_handler
to not suspend the vServer?
You could set bit 21 in the AER uncorrected error mask register to avoid
the root port signaling the error.  Is bit 21 already clear in the
severity register to make this non-fatal?

Is it correct that all children of a root port are notified? Should qemu 
distinguish between fatal and non-fatal errors when
suspending a vServer?
Yes, each child is notified.  QEMU only gets an eventfd signal, which is
supposed to occur only for fatal errors.  I don't quite understand why
this apparently non-fatal error is getting through.  The kernel-side
VFIO code is where filtering of fatal vs non-fatal should occur.
Had a look at vfio-pci.c from master. I can't see where there is a filtering of 
fatal vs. non-fatal
I'm under the impression that fatal vs non-fatal would be determined
somewhere in the PCI layers and the driver would only be notified for
uncorrected/fatal.  Are we missing that filtering?  Thanks,

As far as I am understand vfio_pci_aer_err_detected in 
drivers/vfio/pci/vfio_pci.c
is called to recover potential recoverable errors and the driver decides if the
error was recoverable by the return code.

Peter




reply via email to

[Prev in Thread] Current Thread [Next in Thread]