[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] vfio-pci issues with multiple devices on the same root
From: |
Peter Lieven |
Subject: |
Re: [Qemu-devel] vfio-pci issues with multiple devices on the same root port |
Date: |
Sat, 13 Dec 2014 21:36:45 +0100 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.2.0 |
Am 12.12.2014 um 23:21 schrieb Alex Williamson:
> On Fri, 2014-12-12 at 22:38 +0100, Peter Lieven wrote:
>> Hi,
>>
>> we have a Cisco UCS infrastructure where we have fnic Fibre-Channel Adapters
>> that we expose to guests. The UCS
>> infrastruture allows to create virtual HBAs that can be exposed to a host so
>> its possible to have quite a lot of them.
>>
>> We ran into a strange issue when we started having more than one vServer
>> with a FibreChannel Adapter passed
>> thru with vfio-pci.
>>
>> When a hypervisor shuts down it the kernel sees the following error:
>>
>> pcieport 0000:00:07.0: AER: Uncorrected (Non-Fatal) error received: id=0038
>> pcieport 0000:00:07.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal),
>> type=Transaction Layer, id=0038(Receiver ID)
>> pcieport 0000:00:07.0: device [8086:340e] error
>> status/mask=00200000/00100000
>> pcieport 0000:00:07.0: [21] Unknown Error Bit (First)
>> pcieport 0000:00:07.0: broadcast error_detected message
>> pcieport 0000:00:07.0: AER: Device recovery failed
>>
>> Bit 21 seems to be ACS Violation. And 0000:00:07.0 is the PCIE Root Port on
>> that System.
>>
>> This wouldn't be a big problem, altough I would like to find out what the
>> ACS Violation causes.
>>
>> The real problem is that all other vfio-pci cards on that root port get
>> notified of this error and the connected vServers are suspended
>> with RUN_STATE_INTERNAL_ERROR.
>>
>> Any ideas to work around this other than hacking qemu to not register an
>> error handler or modifying vfio_err_notifier_handler
>> to not suspend the vServer?
> You could set bit 21 in the AER uncorrected error mask register to avoid
> the root port signaling the error. Is bit 21 already clear in the
> severity register to make this non-fatal?
Can you give me a hint where I find those registers and how I mangle them?
At least the syslog output states non-fatal.
I am not that familiar with PCI internals, I am more the Block Layer guy.
>
>> Is it correct that all children of a root port are notified? Should qemu
>> distinguish between fatal and non-fatal errors when
>> suspending a vServer?
> Yes, each child is notified. QEMU only gets an eventfd signal, which is
> supposed to occur only for fatal errors. I don't quite understand why
> this apparently non-fatal error is getting through. The kernel-side
> VFIO code is where filtering of fatal vs non-fatal should occur.
I will look at that. Has there been any bugfix since 3.13 ?
Peter