qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [RFC] hw/arm/virt: Provide DT binding generation for PCI eXpander Br


From: Jonathan Cameron
Subject: Re: [RFC] hw/arm/virt: Provide DT binding generation for PCI eXpander Bridges
Date: Mon, 24 Apr 2023 22:56:26 +0100

On Mon, 24 Apr 2023 16:45:48 +0100
Peter Maydell <peter.maydell@linaro.org> wrote:

> On Mon, 24 Apr 2023 at 16:41, Jonathan Cameron
> <Jonathan.Cameron@huawei.com> wrote:
> >
> > On Mon, 24 Apr 2023 10:28:30 +0100
> > Peter Maydell <peter.maydell@linaro.org> wrote:  
> > > So, not knowing anything about CXL, my naive question is:
> > > this is PCI, why can't we handle it the way we handle everything
> > > else PCI, i.e. present the PCI controller in the DTB and it's
> > > the guest kernel's job to probe, enumerate, etc whatever it wants ?  
> >
> > Absolutely the kernel will still do the enumeration.  But there's a problem
> > - it won't always work, unless there is 'enough room'.
> >
> > If the aim is to do it in a similar fashion to how PCI Expander Bridges 
> > (PXB)
> > are handled today with ACPI (we could look at doing something different of 
> > course)
> >
> > We have one set of memory windows for assigning PCI bars into etc. In the 
> > model
> > used for PXB, that set of resources is shared between different host bridges
> > including the main one (each one has separate DT description).
> >
> > So for virt, VIRT_PCIE_MMIO, VIRT_PCIE_IO, VIRT_PCIE_ECAM, 
> > VIRT_HIGH_PCIE_ECAM
> > and VIRT_HIGH_PCIE_MMIO are split up between the host bridges.
> > Each host bridge has it's own DT description.
> >
> > For ACPI, how to split up available memory windows up depends on the 
> > requirements
> > of the PCIe devices below the host bridges.  For ACPI we 'know' the answer
> > as to what is required at the point of creating the description because EDK2
> > did the work for us.  For DT we currently don't.  What to do about that is 
> > the
> > question this RFC tries to address.
> >
> > In theory we could change the kernel to support enumerating PXB instances, 
> > but
> > that's a very different model from today where they are just 'standard'
> > host bridges, using the generic bindings for ACPI (and after this patch for 
> > DT).  
> 
> On the other hand, having QEMU enumerate PCI devices is *also* a
> very different model from today, where we assume that the guest
> code is the one that is going to deal with enumerating PCI devices.
> To my mind one of the major advantages of PCI is exactly that it
> is entirely probeable and discoverable, so that there is no need
> for the dtb to include a lot of information that the kernel can
> find out for itself by directly asking the hardware...

I absolutely agree that QEMU enumerating PCI seem silly level of complexity
to introduce. So easy route is to just use the bus numbers to partition
resources. We have those available without any complexity. It's not the
same as how it's done with ACPI, but then the alternatives are either
(though maybe they are closer).  Note current proposed algorithm may be
too simplistic (perhaps some alignment forcing should adjust the division
of the resources to start on round number addresses)

More complex route will be to push the whole problem to the kernel.
I doubt we'll get any agreement to add this to the generic host
bridge DT bindings given it's just to support QEMU specific host bridges
which the kernel has previously had no knowledge of beyond as generic
PCIe Host bridges. So we'd need QEMU to detect the presence of PXB
instances then use a new DT compatible for the main host bridge
and kick off a new type of host bridge discovery.
Probably also need to modify EDK2 to recognize this, which will give
us compatibility problems with existing ACPI usage (EDK2 is reading
the DT bindings so kick off it's PCIe enumeration so this gets fiddly)
May also have problems with the kernel having to do late discovery of
host bridges (not tried it but sounds 'interesting'.)

Anyhow, I promised some illustrative examples.

For reference, here's one of my test configurations
(slightly pathological as it exercises some of the heuristics in enumeration)

-+-[0000:00]-+-00.0  Red Hat, Inc. QEMU PCIe Host bridge
 |           +-01.0-[01]--
 |           +-02.0  Red Hat, Inc. Virtio block device
 |           +-03.0  Red Hat, Inc. Virtio network device
 |           +-04.0  Red Hat, Inc. QEMU PCIe Expander bridge
 |           \-05.0  Red Hat, Inc. QEMU PCIe Expander bridge
 +-[0000:0c]---00.0-[0d-0f]--+-00.0-[0e-0f]----00.0-[0f]----00.0  Red Hat, Inc. 
Virtio RNG
 |                           \-00.1  Red Hat, Inc. Virtio RNG
 \-[0000:30]---00.0-[31]----00.0  Red Hat, Inc. Virtio RNG

It's somewhat easier to see the current ACPI results vs what this
patch proposed for DT by looking at /proc/iomem than looking at the
firmware descriptions themselves. /proc/iomem includes
the firmware described host bridge windows and the PCI buses and devices
enumerated below them (EDK2 enumerated for ACPI, Linux kernel for DT)
There are some other subtle differences of how the memory is mapped, but
hopefully you can see the relationship between what happens with ACPI
(requiring enumeration) and the proposal in this RFC for DT which doesn't.

Relevant chunks of /proc/iomem for ACPI for 32 bit region (64 bit one similar.)
//FW described main HB
10000000-103fffff : PCI Bus 0000:00
  //Everything indented discovered by enumeration of PCI bus.
  10000000-101fffff : PCI Bus 0000:01
  10200000-1023ffff : 0000:00:03.0
  10240000-10240fff : 0000:00:01.0
  10241000-10241fff : 0000:00:02.0
  10242000-10242fff : 0000:00:03.0
//FW described. 1st PXB on bus 0xc
//Key is start is defined as previous enumerated device end + some heuristic 
driven padding.
10400000-10700fff : PCI Bus 0000:0c
  10400000-105fffff : PCI Bus 0000:0d
    10400000-10400fff : PCI Bus 0000:0e
      10400000-10400fff : PCI Bus 0000:0f
        10400000-10400fff : 0000:0f:00.0
    10401000-10401fff : 0000:0d:00.1
  10600000-10600fff : 0000:0c:00.0
//FW described spare bit assigned to main HB
10701000-107fffff : PCI Bus 0000:00
//FW described. 2nd PXB on bus 0x30
//Key is start is defined as previous enumerated device end + some heuristic 
driven padding.
10800000-10a00fff : PCI Bus 0000:30
  10800000-108fffff : PCI Bus 0000:31
    10800000-10800fff : 0000:31:00.0
  10900000-10900fff : 0000:30:00.0
10a01000-3efeffff : PCI Bus 0000:00

DT version of equivalent from this patch.  Mem split up based only bus numbers.
//FW described.
10000000-1233f3ff : pcie@10000000
  //Everything indented is enumerated by the linux kernel
  10000000-101fffff : PCI Bus 0000:01
  10200000-1023ffff : 0000:00:03.0
  10240000-10240fff : 0000:00:01.0
  10241000-10241fff : 0000:00:02.0
  10242000-10242fff : 0000:00:03.0
//lots of space here

//FW described 1st PXB - note way above end of previous enumerated region.
1233f400-18cfcfff : pcie@1233f400
  12340000-12340fff : 0001:0c:00.0
  12400000-126fffff : PCI Bus 0001:0d
    12400000-125fffff : PCI Bus 0001:0e
      12400000-125fffff : PCI Bus 0001:0f
        12400000-12400fff : 0001:0f:00.0
  12600000-12600fff : 0001:0d:00.1
//lots of space here..

//FW described 2nd PXB - note way above en of previous enumerated region.
18cfd000-3efeffff : pcie@18cfd000
  18cfd000-18cfdfff : 0002:30:00.0
  18d00000-18efffff : PCI Bus 0002:31
    18d00000-18d00fff : 0002:31:00.0
//lots of space here.


Note the main bridge CRS and hence /proc/iomem entries have entries after
the PXBs because QEMU provides the left over bits
of the window, even though there is no easy way to use them. For the DT
scheme they end up in the space for which ever PXB has the highest bus
number.

In this case we have masses of room in the DT scheme as all the devices
have small bars. I have test cases that fail to have enough room
but those are even worse to read and require hacks to the PCI test device
driver.

I'm not great at reading DT binary from sysfs so some dt version of this
ripped from /proc/iomem above rather than directly

//Only first entry of each _CRS type is actually useful.
//In DT node: pcie@10000000
Device (PCI0)
{
    Name (_HID, "PNP0A08" /* PCI Express Bus */)  // _HID: Hardware ID
    Name (_CID, "PNP0A03" /* PCI Bus */)  // _CID: Compatible ID
    Name (_SEG, Zero)  // _SEG: PCI Segment
    Name (_BBN, Zero)  // _BBN: BIOS Bus Number
...
    Name (_CRS, ResourceTemplate ()  // _CRS: Current Resource Settings
    {
//Bus numbers 0 to B :
// DT bus-range = [0, b]
        WordBusNumber (ResourceProducer, MinFixed, MaxFixed, PosDecode,
            0x0000,             // Granularity
            0x0000,             // Range Minimum
            0x000B,             // Range Maximum
            0x0000,             // Translation Offset
            0x000C,             // Length
            ,, )
//First part to cover devices enumeratd + space from heuristics (bit of padding)
// DT ranges first entry 10000000-1233f3ff : pcie@10000000
        DWordMemory (ResourceProducer, PosDecode, MinFixed, MaxFixed, 
NonCacheable, ReadWrite,
            0x00000000,         // Granularity
            0x10000000,         // Range Minimum
            0x103FFFFF,         // Range Maximum
            0x00000000,         // Translation Offset
            0x00400000,         // Length
            ,, , AddressRangeMemory, TypeStatic)
//A hole from enumeration of the PXBs.  Doesn't exist in DT form...
        DWordMemory (ResourceProducer, PosDecode, MinFixed, MaxFixed, 
NonCacheable, ReadWrite,
            0x00000000,         // Granularity
            0x10701000,         // Range Minimum
            0x107FFFFF,         // Range Maximum
            0x00000000,         // Translation Offset
            0x000FF000,         // Length
            ,, , AddressRangeMemory, TypeStatic)
//Left over above PXBs. Doesn't exist in DT form where we assign 'the rest' to
//the PXB with the highest bus number.
        DWordMemory (ResourceProducer, PosDecode, MinFixed, MaxFixed, 
NonCacheable, ReadWrite,
            0x00000000,         // Granularity
            0x10A01000,         // Range Minimum
            0x3EFEFFFF,         // Range Maximum
            0x00000000,         // Translation Offset
            0x2E5EF000,         // Length
            ,, , AddressRangeMemory, TypeStatic)
//First part to cover devices enumerated (+ space from heuristics)
//I left whole range for IO on DT version.  Don't really care about this being 
available
//for modern PCIe devices below PXBs anyway.
        DWordIO (ResourceProducer, MinFixed, MaxFixed, PosDecode, EntireRange,
            0x00000000,         // Granularity
            0x00000000,         // Range Minimum
            0x00000FFF,         // Range Maximum
            0x3EFF0000,         // Translation Offset
            0x00001000,         // Length
            ,, , TypeStatic, DenseTranslation)
//Left over of original window - no used by anything
        DWordIO (ResourceProducer, MinFixed, MaxFixed, PosDecode, EntireRange,
            0x00000000,         // Granularity
            0x00003000,         // Range Minimum
            0x0000FFFF,         // Range Maximum
            0x3EFF0000,         // Translation Offset
            0x0000D000,         // Length
            ,, , TypeStatic, DenseTranslation)
//First part to cover the devices enumerated.
// DT proportional version: 8000000000-85ffffffff : pcie@10000000
// Note that in this example all the devices have small BARs so with enumeration
// they don't need as much space as I gave with DT version.
        QWordMemory (ResourceProducer, PosDecode, MinFixed, MaxFixed, 
NonCacheable, ReadWrite,
            0x0000000000000000, // Granularity
            0x0000008000000000, // Range Minimum
            0x00000080000FFFFF, // Range Maximum
            0x0000000000000000, // Translation Offset
            0x0000000000100000, // Length
            ,, , AddressRangeMemory, TypeStatic)
//Left over of original window - not used by anything.
        QWordMemory (ResourceProducer, PosDecode, MinFixed, MaxFixed, 
NonCacheable, ReadWrite,
            0x0000000000000000, // Granularity
            0x0000008000400000, // Range Minimum
            0x000000FFFFFFFFFF, // Range Maximum
            0x0000000000000000, // Translation Offset
            0x0000007FFFC00000, // Length
            ,, , AddressRangeMemory, TypeStatic)
    })
...

Device (PC0C)
{
    Name (_HID, "PNP0A08" /* PCI Express Bus */)  // _HID: Hardware ID
    Name (_CID, "PNP0A03" /* PCI Bus */)  // _CID: Compatible ID
    Name (_BBN, 0x0C)  // _BBN: BIOS Bus Number
    Name (_UID, 0x0C)  // _UID: Unique ID
... 
    Name (_CRS, ResourceTemplate ()  // _CRS: Current Resource Settings
    {
// IO space big enough for enumerated devices on first PXB
// For DT I gave no IO space to PXBs for now.
        DWordIO (ResourceProducer, MinFixed, MaxFixed, PosDecode, EntireRange,
            0x00000000,         // Granularity
            0x00001000,         // Range Minimum
            0x00001FFF,         // Range Maximum
            0x3EFF0000,         // Translation Offset
            0x00001000,         // Length
            ,, , TypeStatic, DenseTranslation)
//Lower mem region for enumerated devices on first PXB
//DT : 1233f400-18cfcfff : pcie@1233f400
        DWordMemory (ResourceProducer, PosDecode, MinFixed, MaxFixed, 
NonCacheable, ReadWrite,
            0x00000000,         // Granularity
            0x10400000,         // Range Minimum
            0x10700FFF,         // Range Maximum
            0x00000000,         // Translation Offset
            0x00301000,         // Length
            ,, , AddressRangeMemory, TypeStatic)
//High mem region for enumerated devices on first PXB
//DT: 8600000000-97ffffffff : pcie@1233f400
        QWordMemory (ResourceProducer, PosDecode, MinFixed, MaxFixed, 
NonCacheable, ReadWrite,
            0x0000000000000000, // Granularity
            0x0000008000100000, // Range Minimum
            0x00000080002FFFFF, // Range Maximum
            0x0000000000000000, // Translation Offset
            0x0000000000200000, // Length
            ,, , AddressRangeMemory, TypeStatic)
//PXB covers bus numbers 0xc to 0xf
        WordBusNumber (ResourceProducer, MinFixed, MaxFixed, PosDecode,
            0x0000,             // Granularity
            0x000C,             // Range Minimum
            0x000F,             // Range Maximum
            0x0000,             // Translation Offset
            0x0004,             // Length
            ,, )
    })
Device (PC30)
{
    Name (_HID, "PNP0A08" /* PCI Express Bus */)  // _HID: Hardware ID
    Name (_CID, "PNP0A03" /* PCI Bus */)  // _CID: Compatible ID
    Name (_BBN, 0x30)  // _BBN: BIOS Bus Number
    Name (_UID, 0x30)  // _UID: Unique ID
    Name (_STR, Unicode ("pxb Device"))  // _STR: Description String
    Name (_CCA, One)  // _CCA: Cache Coherency Attribute
    Name (_PRT, Package (0x80)  // _PRT: PCI Routing Table
    Name (_CRS, ResourceTemplate ()  // _CRS: Current Resource Settings
    {
//IO space for 2nd PXB - punched out of the main bus space.
//DT: Again no IO space.
        DWordIO (ResourceProducer, MinFixed, MaxFixed, PosDecode, EntireRange,
            0x00000000,         // Granularity
            0x00002000,         // Range Minimum
            0x00002FFF,         // Range Maximum
            0x3EFF0000,         // Translation Offset
            0x00001000,         // Length
            ,, , TypeStatic, DenseTranslation)
//low mem space for 2nd PXB punched out of main host bridge space.
// DT: 18cfd000-3efeffff : pcie@18cfd000
        DWordMemory (ResourceProducer, PosDecode, MinFixed, MaxFixed, 
NonCacheable, ReadWrite,
            0x00000000,         // Granularity
            0x10800000,         // Range Minimum
            0x10A00FFF,         // Range Maximum
            0x00000000,         // Translation Offset
            0x00201000,         // Length
            ,, , AddressRangeMemory, TypeStatic)#
//high mem space for 2nd PXB punched out main host bridge space.
// DT: 9800000000-ffffffffff : pcie@18cfd000
        QWordMemory (ResourceProducer, PosDecode, MinFixed, MaxFixed, 
NonCacheable, ReadWrite,
            0x0000000000000000, // Granularity
            0x0000008000300000, // Range Minimum
            0x00000080003FFFFF, // Range Maximum
            0x0000000000000000, // Translation Offset
            0x0000000000100000, // Length
            ,, , AddressRangeMemory, TypeStatic)
// Just 2 bus numbers (directly connected device)
        WordBusNumber (ResourceProducer, MinFixed, MaxFixed, PosDecode,
            0x0000,             // Granularity
            0x0030,             // Range Minimum
            0x0031,             // Range Maximum
            0x0000,             // Translation Offset
            0x0002,             // Length
            ,, )
        })

Maybe this mess will help illustrate the issues and approaches.

I personally think the rough approach of this patch is most sensible.
Don't enumerate the bus at all for DT case, just do partitioning based
on bus numbers.  In corner cases that will mean the kernel enumeration
fails when things would have worked with current ACPI / EDK2 dance
to allow enumeration before the ACPI tables the kernel uses are written.

Jonathan

> 
> thanks
> -- PMM




reply via email to

[Prev in Thread] Current Thread [Next in Thread]