[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-ppc] [PATCH qemu] spapr_pci: Add numa node id
From: |
Bharata B Rao |
Subject: |
Re: [Qemu-ppc] [PATCH qemu] spapr_pci: Add numa node id |
Date: |
Mon, 19 Sep 2016 09:09:05 +0530 |
User-agent: |
Mutt/1.6.1 (2016-04-27) |
On Wed, Sep 14, 2016 at 07:39:10PM +1000, Alexey Kardashevskiy wrote:
> On 14/09/16 09:29, Michael Roth wrote:
> > Quoting Alexey Kardashevskiy (2016-07-27 03:03:38)
> >> This adds a numa id property to a PHB to allow linking passed PCI device
> >> to CPU/memory. It is up to the management stack to do CPU/memory pinning
> >> to the node with the actual PCI device.
> >
> > It looks like x86 relies on PCIBus->numa_node() method (via
> > pci_bus_numa_node()) to encode similar PCIBus affinities
> > into ACPI tables, and currently exposes it via
> > -device pci-[-express]-expander-bus,numa_node=X.
>
>
>
> Well, until we allow DMA windows per PCI bus (not per PHB as it is now),
> this won't make much sense for us (unless I am missing something here).
>
>
> > Maybe we should implement it using this existing
> > PCIBus->numa_node() interface?
> >
> > We'd still have to expose numa_node as a spapr-pci-host-bridge
> > device option though. Not sure if there's a more common way
> > to expose it that might be easier for libvirt to discover. As it
> > stands we'd need to add spapr-pci-host-bridge to a libvirt
> > whitelist that currently only covers pci-expander-bus.
> >
> > Cc'ing Shiva who was looking into the libvirt side.
> >
> > One comment below:
> >
> >>
> >> Signed-off-by: Alexey Kardashevskiy <address@hidden>
> >> ---
> >> hw/ppc/spapr_pci.c | 13 +++++++++++++
> >> include/hw/pci-host/spapr.h | 2 ++
> >> 2 files changed, 15 insertions(+)
> >>
> >> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> >> index 949c44f..af5394a 100644
> >> --- a/hw/ppc/spapr_pci.c
> >> +++ b/hw/ppc/spapr_pci.c
> >> @@ -47,6 +47,7 @@
> >> #include "sysemu/device_tree.h"
> >> #include "sysemu/kvm.h"
> >> #include "sysemu/hostmem.h"
> >> +#include "sysemu/numa.h"
> >>
> >> #include "hw/vfio/vfio.h"
> >>
> >> @@ -1544,6 +1545,7 @@ static Property spapr_phb_properties[] = {
> >> DEFINE_PROP_BOOL("ddw", sPAPRPHBState, ddw_enabled, true),
> >> DEFINE_PROP_UINT64("pgsz", sPAPRPHBState, page_size_mask,
> >> (1ULL << 12) | (1ULL << 16)),
> >> + DEFINE_PROP_UINT32("node", sPAPRPHBState, numa_node, -1),
> >> DEFINE_PROP_END_OF_LIST(),
> >> };
> >>
> >> @@ -1805,6 +1807,11 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
> >> cpu_to_be32(1),
> >> cpu_to_be32(RTAS_IBM_RESET_PE_DMA_WINDOW)
> >> };
> >> + uint32_t associativity[] = {cpu_to_be32(0x4),
> >> + cpu_to_be32(0x0),
> >> + cpu_to_be32(0x0),
> >> + cpu_to_be32(0x0),
> >> + cpu_to_be32(phb->numa_node)};
> >> sPAPRTCETable *tcet;
> >> PCIBus *bus = PCI_HOST_BRIDGE(phb)->bus;
> >> sPAPRFDT s_fdt;
> >> @@ -1837,6 +1844,12 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
> >> &ddw_extensions, sizeof(ddw_extensions)));
> >> }
> >>
> >> + /* Advertise NUMA via ibm,associativity */
> >> + if (nb_numa_nodes > 1) {
> >> + _FDT(fdt_setprop(fdt, bus_off, "ibm,associativity", associativity,
> >> + sizeof(associativity)));
> >> + }
> >
> > LoPAPR 15.3 seems to suggest that ibm,associativity-reference-points is
> > required alongside ibm,associativity for each DT node it appears in,
> > and since we hardcode "Form 1" affinity it should be done similarly as
> > the entry we place in the top-level DT node.
>
>
> Hm, okay, I'll add it. There is a question to Bharata - why do we have 4s
> in spapr_create_fdt_skel()'s refpoints? Just a random pick?
I remember basing it on what I saw in an LPAR>
>
> vec5[5]==0x80 means we are doing "Form1" and these 4s are far/near distances?
This comment from
https://github.com/open-power/skiboot/blob/master/core/affinity.c should
explain things:
/*
*
* We currently construct our associativity properties as such:
*
* - For "chip" devices (bridges, memory, ...), 4 entries:
*
* - CCM node ID
* - HW card ID
* - HW module ID
* - Chip ID
*
* The information is constructed based on the chip ID which (unlike
* pHyp) is our HW chip ID (aka "XSCOM" chip ID). We use it to retrieve
* the other properties from the corresponding chip/xscom node in the
* device-tree. If those properties are absent, 0 is used.
*
* - For "core" devices, we add a 5th entry:
*
* - Core ID
*
* Here too, we do not use the "cooked" HW processor ID from HDAT but
* instead use the real HW core ID which is basically the interrupt
* server number of thread 0 on that core.
*
*
* The ibm,associativity-reference-points property is currently set to
* 4,4 indicating that the chip ID is our only reference point. This
* should be extended to encompass the node IDs eventually.
*/
Regards,
Bharata.