qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining devic


From: Alex Williamson
Subject: Re: [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space
Date: Wed, 27 Mar 2019 11:01:15 -0600

On Wed, 27 Mar 2019 11:32:55 -0400
"Michael S. Tsirkin" <address@hidden> wrote:

> On Tue, Mar 26, 2019 at 04:55:19PM -0600, Alex Williamson wrote:
> > Conventional PCI buses pre-date requester IDs.  An IOMMU cannot
> > distinguish by devfn & bus between devices in a conventional PCI
> > topology and therefore we cannot assign them separate AddressSpaces.
> > By taking this requester ID aliasing into account, QEMU better matches
> > the bare metal behavior and restrictions, and enables shared
> > AddressSpace configurations that are otherwise not possible with
> > guest IOMMU support.
> > 
> > For the latter case, given any example where an IOMMU group on the
> > host includes multiple devices:
> > 
> >   $ ls  /sys/kernel/iommu_groups/1/devices/
> >   0000:00:01.0  0000:01:00.0  0000:01:00.1
> > 
> > If we incorporate a vIOMMU into the VM configuration, we're restricted
> > that we can only assign one of the endpoints to the guest because a
> > second endpoint will attempt to use a different AddressSpace.  VFIO
> > only supports IOMMU group level granularity at the container level,
> > preventing this second endpoint from being assigned:
> > 
> > qemu-system-x86_64 -machine q35... \
> >   -device intel-iommu,intremap=on \
> >   -device pcie-root-port,addr=1e.0,id=pcie.1 \
> >   -device vfio-pci,host=1:00.0,bus=pcie.1,addr=0.0,multifunction=on \
> >   -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1
> > 
> > qemu-system-x86_64: -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1: vfio \
> > 0000:01:00.1: group 1 used in multiple address spaces
> > 
> > However, when QEMU incorporates proper aliasing, we can make use of a
> > PCIe-to-PCI bridge to mask the requester ID, resulting in a hack that
> > provides the downstream devices with the same AddressSpace, ex:
> > 
> > qemu-system-x86_64 -machine q35... \
> >   -device intel-iommu,intremap=on \
> >   -device pcie-pci-bridge,addr=1e.0,id=pci.1 \
> >   -device vfio-pci,host=1:00.0,bus=pci.1,addr=1.0,multifunction=on \
> >   -device vfio-pci,host=1:00.1,bus=pci.1,addr=1.1  
> 
> 
> This will break extended address space access though, won't it?
> things like attempts to take PCI Express badnwidth into
> consideration by drivers for E2E credit math will also
> not work ...

Correct.  As I explain in my reply to Peter, we're forcing the IOMMU to
use a common address space for these devices, and the only hack we have
to do that is to introduce a conventional PCI bus.  PCIe-to-PCI bridges
are required to respond with an unsupported request to extended config
space on the downstream devices.  This is part of the loss of utility I
mention below.  Perhaps we can add a couple more layers of hacks if
those sorts of things are important.  PCI-X supports extended config
space, so (without digging through the spec) I assume a PCIe-to-PCIX
bridge could pass extended config space.  Additionally, if we had the
same bridge in reverse mode, for something like this:

[RC]-[PCIe-to-PCIX]-[PCIX-to-PCIe]-[PCIe endpoint]

Then we would have access to link and extended config space while
maintaining the single address space.  Hack^3

> > While the utility of this hack may be limited, this AddressSpace
> > aliasing is the correct behavior for QEMU to emulate bare metal.  
> 
> 
> This one's a valid point.
> 
> > Signed-off-by: Alex Williamson <address@hidden>
> > ---
> >  hw/pci/pci.c |   33 +++++++++++++++++++++++++++++++--
> >  1 file changed, 31 insertions(+), 2 deletions(-)
> > 
> > diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> > index 35451c1e9987..38467e676f1f 100644
> > --- a/hw/pci/pci.c
> > +++ b/hw/pci/pci.c
> > @@ -2594,12 +2594,41 @@ AddressSpace 
> > *pci_device_iommu_address_space(PCIDevice *dev)
> >  {
> >      PCIBus *bus = pci_get_bus(dev);
> >      PCIBus *iommu_bus = bus;
> > +    uint8_t devfn = dev->devfn;
> >  
> >      while(iommu_bus && !iommu_bus->iommu_fn && iommu_bus->parent_dev) {
> > -        iommu_bus = pci_get_bus(iommu_bus->parent_dev);
> > +        PCIBus *parent_bus = pci_get_bus(iommu_bus->parent_dev);
> > +
> > +        /*
> > +         * Determine which requester ID alias should be used for the device
> > +         * based on the PCI topology.  There are no requester IDs on 
> > convetional
> > +         * PCI buses, therefore we push the alias up to the parent on each 
> > non-
> > +         * express bus.  Which alias we use depends on whether this is a 
> > legacy
> > +         * PCI bridge or PCIe-to-PCI/X bridge as in chapter 2.3 of the 
> > PCIe-to-
> > +         * PCI bridge spec.  Note that we cannot use pci_requester_id() 
> > here
> > +         * because the resulting BDF depends on the secondary bridge 
> > register
> > +         * programming.  
> 
> Could you clarify this part a bit pls?

We use the same algorithm here as we do to come up with
requester_id_cache on the PCIDevice (modulo stopping at the iommu_fn
bus), but pci_req_id_cache_extract() is only useful once the bridge
secondary bus number register is programmed.  Until then, the bus part
of the BDF is always zero, which means we can't even lookup the PCIBus
from the bus number.  Since we're setting up the IOMMU AddressSpace
during device realize, we cannot presume any bus number programming.
NB. None of the IOMMUs actually seem to care about the PCIBus at this
point, the pointer is simply used as an opaque token for a hash for the
bus, but that also means there's really no good substitute for it
either.

> >  We also cannot lookup the PCIBus from the bus number
> > +         * at this point for the iommu_fn.  Also, requester_id_cache is the
> > +         * alias to the root bus, which is usually, but not necessarily 
> > always
> > +         * where we'll find our iommu_fn.
> > +         */
> > +        if (!pci_bus_is_express(iommu_bus)) {
> > +            PCIDevice *parent = iommu_bus->parent_dev;
> > +
> > +            if (pci_is_express(parent) &&
> > +                pcie_cap_get_type(parent) == PCI_EXP_TYPE_PCI_BRIDGE) {
> > +                devfn = PCI_DEVFN(0, 0);  
> 
> Why 0,0?

See also pci_req_id_cache_get(), this is defined in the PCIe-to-PCI
bridge spec.  Bridges following this spec use a fixed requester ID for
all downstream devices in order to differentiate requests from the
bridge itself vs downstream devices.  pci_req_id_cache_extract() is
where this gets applied for the PCIReqIDCache version.  Thanks,

Alex

> > +                bus = iommu_bus;
> > +            } else {
> > +                devfn = parent->devfn;
> > +                bus = parent_bus;
> > +            }
> > +        }
> > +
> > +        iommu_bus = parent_bus;
> >      }
> >      if (iommu_bus && iommu_bus->iommu_fn) {
> > -        return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, 
> > dev->devfn);
> > +        return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, devfn);
> >      }
> >      return &address_space_memory;
> >  }  




reply via email to

[Prev in Thread] Current Thread [Next in Thread]