[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] iommu emulation
From: |
Peter Xu |
Subject: |
Re: [Qemu-devel] iommu emulation |
Date: |
Wed, 15 Feb 2017 11:34:52 +0800 |
User-agent: |
Mutt/1.5.24 (2015-08-30) |
On Wed, Feb 15, 2017 at 10:52:43AM +0800, Peter Xu wrote:
[...]
> > >
> > > Then, I *think* above assertion you encountered would fail only if
> > > prev == 0 here, but I still don't quite sure why was that happening.
> > > Btw, could you paste me your "lspci -vvv -s 00:03.0" result in your L1
> > > guest?
> > >
> >
> > Sure. This is from my L1 guest.
>
> Hmm... I think I found the problem...
>
> >
> > address@hidden:~# lspci -vvv -s 00:03.0
> > 00:03.0 Network controller: Mellanox Technologies MT27500 Family
> > [ConnectX-3]
> > Subsystem: Mellanox Technologies Device 0050
> > Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
> > Stepping- SERR+ FastB2B- DisINTx+
> > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort-
> > <MAbort- >SERR- <PERR- INTx-
> > Latency: 0, Cache Line Size: 64 bytes
> > Interrupt: pin A routed to IRQ 23
> > Region 0: Memory at fe900000 (64-bit, non-prefetchable) [size=1M]
> > Region 2: Memory at fe000000 (64-bit, prefetchable) [size=8M]
> > Expansion ROM at fea00000 [disabled] [size=1M]
> > Capabilities: [40] Power Management version 3
> > Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
> > Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
> > Capabilities: [48] Vital Product Data
> > Product Name: CX354A - ConnectX-3 QSFP
> > Read-only fields:
> > [PN] Part number: MCX354A-FCBT
> > [EC] Engineering changes: A4
> > [SN] Serial number: MT1346X00791
> > [V0] Vendor specific: PCIe Gen3 x8
> > [RV] Reserved: checksum good, 0 byte(s) reserved
> > Read/write fields:
> > [V1] Vendor specific: N/A
> > [YA] Asset tag: N/A
> > [RW] Read-write area: 105 byte(s) free
> > [RW] Read-write area: 253 byte(s) free
> > [RW] Read-write area: 253 byte(s) free
> > [RW] Read-write area: 253 byte(s) free
> > [RW] Read-write area: 253 byte(s) free
> > [RW] Read-write area: 253 byte(s) free
> > [RW] Read-write area: 253 byte(s) free
> > [RW] Read-write area: 253 byte(s) free
> > [RW] Read-write area: 253 byte(s) free
> > [RW] Read-write area: 253 byte(s) free
> > [RW] Read-write area: 253 byte(s) free
> > [RW] Read-write area: 253 byte(s) free
> > [RW] Read-write area: 253 byte(s) free
> > [RW] Read-write area: 253 byte(s) free
> > [RW] Read-write area: 253 byte(s) free
> > [RW] Read-write area: 252 byte(s) free
> > End
> > Capabilities: [9c] MSI-X: Enable+ Count=128 Masked-
> > Vector table: BAR=0 offset=0007c000
> > PBA: BAR=0 offset=0007d000
> > Capabilities: [60] Express (v2) Root Complex Integrated Endpoint, MSI 00
> > DevCap: MaxPayload 256 bytes, PhantFunc 0
> > ExtTag- RBE+
> > DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported+
> > RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
> > MaxPayload 256 bytes, MaxReadReq 4096 bytes
> > DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
> > DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not
> > Supported
> > DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis-, LTR-, OBFF Disabled
> > Capabilities: [100 v0] #00
>
> Here we have the head of ecap capability as cap_id==0, then when we
> boot the l2 guest with the same device, we'll first copy this
> cap_id==0 cap, then when adding the 2nd ecap, we'll probably encounter
> problem since pcie_find_capability_list() will thought there is no cap
> at all (cap_id==0 is skipped).
>
> Do you want to try this "hacky patch" to see whether it works for you?
>
> ------8<-------
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 332f41d..bacd302 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -1925,11 +1925,6 @@ static void vfio_add_ext_cap(VFIOPCIDevice *vdev)
>
> }
>
> - /* Cleanup chain head ID if necessary */
> - if (pci_get_word(pdev->config + PCI_CONFIG_SPACE_SIZE) == 0xFFFF) {
> - pci_set_word(pdev->config + PCI_CONFIG_SPACE_SIZE, 0);
> - }
> -
> g_free(config);
> return;
> }
> ------>8-------
>
> I don't think it's a good solution (it just used 0xffff instead of 0x0
> for the masked cap_id, then l2 guest would like to co-op with it), but
> it should workaround this temporarily. I'll try to think of a better
> one later and post when proper.
>
> (Alex, please leave comment if you have any better suggestion before
> mine :)
Alex, do you like something like below to fix above issue that Jintack
has encountered?
(note: this code is not for compile, only trying show what I mean...)
------8<-------
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 332f41d..4dca631 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -1877,25 +1877,6 @@ static void vfio_add_ext_cap(VFIOPCIDevice *vdev)
*/
config = g_memdup(pdev->config, vdev->config_size);
- /*
- * Extended capabilities are chained with each pointing to the next, so we
- * can drop anything other than the head of the chain simply by modifying
- * the previous next pointer. For the head of the chain, we can modify the
- * capability ID to something that cannot match a valid capability. ID
- * 0 is reserved for this since absence of capabilities is indicated by
- * 0 for the ID, version, AND next pointer. However, pcie_add_capability()
- * uses ID 0 as reserved for list management and will incorrectly match and
- * assert if we attempt to pre-load the head of the chain with this ID.
- * Use ID 0xFFFF temporarily since it is also seems to be reserved in
- * part for identifying absence of capabilities in a root complex register
- * block. If the ID still exists after adding capabilities, switch back to
- * zero. We'll mark this entire first dword as emulated for this purpose.
- */
- pci_set_long(pdev->config + PCI_CONFIG_SPACE_SIZE,
- PCI_EXT_CAP(0xFFFF, 0, 0));
- pci_set_long(pdev->wmask + PCI_CONFIG_SPACE_SIZE, 0);
- pci_set_long(vdev->emulated_config_bits + PCI_CONFIG_SPACE_SIZE, ~0);
-
for (next = PCI_CONFIG_SPACE_SIZE; next;
next = PCI_EXT_CAP_NEXT(pci_get_long(config + next))) {
header = pci_get_long(config + next);
@@ -1917,6 +1898,8 @@ static void vfio_add_ext_cap(VFIOPCIDevice *vdev)
switch (cap_id) {
case PCI_EXT_CAP_ID_SRIOV: /* Read-only VF BARs confuse OVMF */
case PCI_EXT_CAP_ID_ARI: /* XXX Needs next function virtualization */
+ /* keep this ecap header (4 bytes), but mask cap_id to 0xffff */
+ ...
trace_vfio_add_ext_cap_dropped(vdev->vbasedev.name, cap_id, next);
break;
default:
@@ -1925,11 +1908,6 @@ static void vfio_add_ext_cap(VFIOPCIDevice *vdev)
}
- /* Cleanup chain head ID if necessary */
- if (pci_get_word(pdev->config + PCI_CONFIG_SPACE_SIZE) == 0xFFFF) {
- pci_set_word(pdev->config + PCI_CONFIG_SPACE_SIZE, 0);
- }
-
g_free(config);
return;
}
----->8-----
Since after all we need the assumption that 0xffff is reserved for
cap_id. Then, we can just remove the "first 0xffff then 0x0" hack,
which is imho error-prone and hacky.
Thanks,
-- peterx
- Re: [Qemu-devel] iommu emulation, Peter Xu, 2017/02/08
- Re: [Qemu-devel] iommu emulation, Jintack Lim, 2017/02/09
- Re: [Qemu-devel] iommu emulation, Peter Xu, 2017/02/14
- Re: [Qemu-devel] iommu emulation, Jintack Lim, 2017/02/14
- Re: [Qemu-devel] iommu emulation, Peter Xu, 2017/02/14
- Re: [Qemu-devel] iommu emulation,
Peter Xu <=
- Re: [Qemu-devel] iommu emulation, Alex Williamson, 2017/02/15
- Re: [Qemu-devel] iommu emulation, Peter Xu, 2017/02/15
- Re: [Qemu-devel] iommu emulation, Alex Williamson, 2017/02/15
- Re: [Qemu-devel] iommu emulation, Jintack Lim, 2017/02/21
- Re: [Qemu-devel] iommu emulation, Jintack Lim, 2017/02/23
- Re: [Qemu-devel] iommu emulation, Jintack Lim, 2017/02/15
- Re: [Qemu-devel] iommu emulation, Alex Williamson, 2017/02/15
- Re: [Qemu-devel] iommu emulation, Jintack Lim, 2017/02/15
- Re: [Qemu-devel] iommu emulation, Alex Williamson, 2017/02/15