qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [for-4.2 PATCH 2/2] hw/i386: AMD-Vi IVRS DMA alias supp


From: Alex Williamson
Subject: Re: [Qemu-devel] [for-4.2 PATCH 2/2] hw/i386: AMD-Vi IVRS DMA alias support
Date: Mon, 29 Jul 2019 13:04:41 -0600

On Mon, 29 Jul 2019 16:26:46 +0800
Peter Xu <address@hidden> wrote:

> On Fri, Jul 26, 2019 at 06:55:53PM -0600, Alex Williamson wrote:
> > When we account for DMA aliases in the PCI address space, we can no
> > longer use a single IVHD entry in the IVRS covering all devices.  We
> > instead need to walk the PCI bus and create alias ranges when we find
> > a conventional bus.  These alias ranges cannot overlap with a "Select
> > All" range (as currently implemented), so we also need to enumerate
> > each device with IVHD entries.
> > 
> > Importantly, the IVHD entries used here include a Device ID, which is
> > simply the PCI BDF (Bus/Device/Function).  The guest firmware is
> > responsible for programming bus numbers, so the final revision of this
> > table depends on the update mechanism (acpi_build_update) to be called
> > after guest PCI enumeration.  
> 
> Ouch... so the ACPI build procedure is after those guest PCI code!
> Could I ask how do you find this? :) It seems much easier for sure
> this way...

I believe this is what MST was referring to with the MCFG update,
acpi_build() is called from both acpi_setup() and acpi_build_update(),
the latter is setup in numerous places to be called via a mechanism
that re-writes the table on certain guest actions.  For instance
acpi_add_rom_blob() passes this function as a callback which turns into
a select_cb in fw_cfg, such that (aiui) the tables are updated before
the guest reads them.  I added some fprintfs in the PCI config write
path to confirm that the update callback occurs after PCI enumeration
for both SeaBIOS and OVMF.  Since we seem to have other dependencies on
this ordering elsewhere, I don't think that the IVRS update is unique
in this regard.
 
> This looks very nice to me already, though I still have got a few
> questions, please see below.
> 
> [...]
> 
> > +    if (object_dynamic_cast(OBJECT(dev), TYPE_PCI_BRIDGE)) {
> > +        PCIBus *sec_bus = pci_bridge_get_sec_bus(PCI_BRIDGE(dev));
> > +        uint8_t sec = pci_bus_num(sec_bus);
> > +        uint8_t sub = dev->config[PCI_SUBORDINATE_BUS];
> > +
> > +        if (pci_bus_is_express(sec_bus)) {
> > +            /*
> > +             * Walk the bus if there are subordinates, otherwise use a 
> > range
> > +             * to cover an entire leaf bus.  We could potentially also use 
> > a
> > +             * range for traversed buses, but we'd need to take care not to
> > +             * create both Select and Range entries covering the same 
> > device.
> > +             * This is easier and potentially more compact.
> > +             *
> > +             * An example bare metal system seems to use Select entries for
> > +             * root ports without a slot (ie. built-ins) and Range entries
> > +             * when there is a slot.  The same system also only hard-codes
> > +             * the alias range for an onboard PCIe-to-PCI bridge, 
> > apparently
> > +             * making no effort to support nested bridges.  We attempt to
> > +             * be more thorough here.
> > +             */
> > +            if (sec == sub) { /* leaf bus */
> > +                /* "Start of Range" IVHD entry, type 0x3 */
> > +                entry = PCI_BUILD_BDF(sec, PCI_DEVFN(0, 0)) << 8 | 0x3;
> > +                build_append_int_noprefix(table_data, entry, 4);
> > +                /* "End of Range" IVHD entry, type 0x4 */
> > +                entry = PCI_BUILD_BDF(sub, PCI_DEVFN(31, 7)) << 8 | 0x4;
> > +                build_append_int_noprefix(table_data, entry, 4);
> > +            } else {
> > +                pci_for_each_device(sec_bus, sec, insert_ivhd, table_data);
> > +            }
> > +        } else {
> > +            /*
> > +             * If the secondary bus is conventional, then we need to 
> > create an
> > +             * Alias range for everything downstream.  The range covers the
> > +             * first devfn on the secondary bus to the last devfn on the
> > +             * subordinate bus.  The alias target depends on legacy versus
> > +             * express bridges, just as in 
> > pci_device_iommu_address_space().
> > +             * DeviceIDa vs DeviceIDb as per the AMD IOMMU spec.
> > +             */
> > +            uint16_t dev_id_a, dev_id_b;
> > +
> > +            dev_id_a = PCI_BUILD_BDF(sec, PCI_DEVFN(0, 0));
> > +
> > +            if (pci_is_express(dev) &&
> > +                pcie_cap_get_type(dev) == PCI_EXP_TYPE_PCI_BRIDGE) {
> > +                dev_id_b = dev_id_a;
> > +            } else {
> > +                dev_id_b = PCI_BUILD_BDF(pci_bus_num(bus), dev->devfn);
> > +            }
> > +
> > +            /* "Alias Start of Range" IVHD entry, type 0x43, 8 bytes */
> > +            build_append_int_noprefix(table_data, dev_id_a << 8 | 0x43, 4);
> > +            build_append_int_noprefix(table_data, dev_id_b << 8 | 0x0, 4);
> > +
> > +            /* "End of Range" IVHD entry, type 0x4 */
> > +            entry = PCI_BUILD_BDF(sub, PCI_DEVFN(31, 7)) << 8 | 0x4;
> > +            build_append_int_noprefix(table_data, entry, 4);
> > +        }  
> 
> We've implmented the similar logic for multiple times:
> 
>   - When we want to do DMA (pci_requester_id)
>   - When we want to fetch the DMA address space (the previous patch)
>   - When we fill in the AMD ACPI table (this patch)
> 
> Do you think we can generalize them somehow?  I'm thinking how about
> we directly fetch RID in the 2nd/3rd use case using pci_requester_id()
> (which existed already) and simply use it?

For this patch, I think we could use pci_requester_id() for dev_id_b
above, but we still need to walk the buses and devices, so it really
only simplifies the 'if (pci_is_express...' code block above, and
barely at that since we're at the point in the topology where such a
decision is made.  For the previous patch, pci_requester_id() returns a
BDF and that code is executed before bus numbers are programmed.  We
might still make use of requester_id_cache, but a different interface
would be necessary.  Also note how pci_req_id_cache_get() assumes we're
looking for the requester ID as seen from the root bus while
pci_device_iommu_address_space() is from the bus hosting iommu_fn.
While these are generally the same in practice, it's not required.  I'd
therefore tend to leave that to some future consolidation.  I can
update the comment to include that justification in the previous patch.

> > +    /*
> > +     * A PCI bus walk, for each PCI host bridge, is necessary to create a
> > +     * complete set of IVHD entries.  Do this into a separate blob so that 
> > we
> > +     * can calculate the total IVRS table length here and then append the 
> > new
> > +     * blob further below.  Fall back to an entry covering all devices, 
> > which
> > +     * is sufficient when no aliases are present.
> > +     */
> > +    object_child_foreach_recursive(object_get_root(),
> > +                                   ivrs_host_bridges, ivhd_blob);
> > +
> > +    if (!ivhd_blob->len) {
> > +        /*
> > +         *   Type 1 device entry reporting all devices
> > +         *   These are 4-byte device entries currently reporting the range 
> > of
> > +         *   Refer to Spec - Table 95:IVHD Device Entry Type Codes(4-byte)
> > +         */
> > +        build_append_int_noprefix(ivhd_blob, 0x0000001, 4);
> > +    }  
> 
> Is there a real use case for ivhd_blob->len==0?

It was mostly paranoia, but I believe it's really only an Intel
convention that the PCI host bridge appears as a device on the bus.  It
seems possible that we could have a host bridge with no devices, in
which case we'd insert this select-all entry to make the IVRS valid.
Perhaps in combination with AMD exposing their IOMMU as a device on the
PCI bus this is not really an issue, but it's a trivial safety net.
Thanks,

Alex



reply via email to

[Prev in Thread] Current Thread [Next in Thread]