[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH 1/2] spapr: number of SMP sockets must be equal to NUMA nodes
From: |
David Gibson |
Subject: |
Re: [PATCH 1/2] spapr: number of SMP sockets must be equal to NUMA nodes |
Date: |
Thu, 1 Apr 2021 13:59:16 +1100 |
On Wed, Mar 31, 2021 at 05:18:45PM +0200, Cédric Le Goater wrote:
> On 3/31/21 2:57 AM, David Gibson wrote:
> > On Mon, Mar 29, 2021 at 03:32:37PM -0300, Daniel Henrique Barboza wrote:
> >>
> >>
> >> On 3/29/21 12:32 PM, Cédric Le Goater wrote:
> >>> On 3/29/21 6:20 AM, David Gibson wrote:
> >>>> On Thu, Mar 25, 2021 at 09:56:04AM +0100, Cédric Le Goater wrote:
> >>>>> On 3/25/21 3:10 AM, David Gibson wrote:
> >>>>>> On Tue, Mar 23, 2021 at 02:21:33PM -0300, Daniel Henrique Barboza
> >>>>>> wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>> On 3/22/21 10:03 PM, David Gibson wrote:
> >>>>>>>> On Fri, Mar 19, 2021 at 03:34:52PM -0300, Daniel Henrique Barboza
> >>>>>>>> wrote:
> >>>>>>>>> Kernel commit 4bce545903fa ("powerpc/topology: Update
> >>>>>>>>> topology_core_cpumask") cause a regression in the pseries machine
> >>>>>>>>> when
> >>>>>>>>> defining certain SMP topologies [1]. The reasoning behind the
> >>>>>>>>> change is
> >>>>>>>>> explained in kernel commit 4ca234a9cbd7 ("powerpc/smp: Stop updating
> >>>>>>>>> cpu_core_mask"). In short, cpu_core_mask logic was causing troubles
> >>>>>>>>> with
> >>>>>>>>> large VMs with lots of CPUs and was changed by cpu_cpu_mask
> >>>>>>>>> because, as
> >>>>>>>>> far as the kernel understanding of SMP topologies goes, both masks
> >>>>>>>>> are
> >>>>>>>>> equivalent.
> >>>>>>>>>
> >>>>>>>>> Further discussions in the kernel mailing list [2] shown that the
> >>>>>>>>> powerpc kernel always considered that the number of sockets were
> >>>>>>>>> equal
> >>>>>>>>> to the number of NUMA nodes. The claim is that it doesn't make
> >>>>>>>>> sense,
> >>>>>>>>> for Power hardware at least, 2+ sockets being in the same NUMA
> >>>>>>>>> node. The
> >>>>>>>>> immediate conclusion is that all SMP topologies the pseries machine
> >>>>>>>>> were
> >>>>>>>>> supplying to the kernel, with more than one socket in the same NUMA
> >>>>>>>>> node
> >>>>>>>>> as in [1], happened to be correctly represented in the kernel by
> >>>>>>>>> accident during all these years.
> >>>>>>>>>
> >>>>>>>>> There's a case to be made for virtual topologies being detached from
> >>>>>>>>> hardware constraints, allowing maximum flexibility to users. At the
> >>>>>>>>> same
> >>>>>>>>> time, this freedom can't result in unrealistic hardware
> >>>>>>>>> representations
> >>>>>>>>> being emulated. If the real hardware and the pseries kernel don't
> >>>>>>>>> support multiple chips/sockets in the same NUMA node, neither
> >>>>>>>>> should we.
> >>>>>>>>>
> >>>>>>>>> Starting in 6.0.0, all sockets must match an unique NUMA node in the
> >>>>>>>>> pseries machine. qtest changes were made to adapt to this new
> >>>>>>>>> condition.
> >>>>>>>>
> >>>>>>>> Oof. I really don't like this idea. It means a bunch of fiddly work
> >>>>>>>> for users to match these up, for no real gain. I'm also concerned
> >>>>>>>> that this will require follow on changes in libvirt to not make this
> >>>>>>>> a
> >>>>>>>> really cryptic and irritating point of failure.
> >>>>>>>
> >>>>>>> Haven't though about required Libvirt changes, although I can say
> >>>>>>> that there
> >>>>>>> will be some amount to be mande and it will probably annoy existing
> >>>>>>> users
> >>>>>>> (everyone that has a multiple socket per NUMA node topology).
> >>>>>>>
> >>>>>>> There is not much we can do from the QEMU layer aside from what I've
> >>>>>>> proposed
> >>>>>>> here. The other alternative is to keep interacting with the kernel
> >>>>>>> folks to
> >>>>>>> see if there is a way to keep our use case untouched.
> >>>>>>
> >>>>>> Right. Well.. not necessarily untouched, but I'm hoping for more
> >>>>>> replies from Cédric to my objections and mpe's. Even with sockets
> >>>>>> being a kinda meaningless concept in PAPR, I don't think tying it to
> >>>>>> NUMA nodes makes sense.
> >>>>>
> >>>>> I did a couple of replies in different email threads but maybe not
> >>>>> to all. I felt it was going nowhere :/ Couple of thoughts,
> >>>>
> >>>> I think I saw some of those, but maybe not all.
> >>>>
> >>>>> Shouldn't we get rid of the socket concept, die also, under pseries
> >>>>> since they don't exist under PAPR ? We only have numa nodes, cores,
> >>>>> threads AFAICT.
> >>>>
> >>>> Theoretically, yes. I'm not sure it's really practical, though, since
> >>>> AFAICT, both qemu and the kernel have the notion of sockets (though
> >>>> not dies) built into generic code.
> >>>
> >>> Yes. But, AFAICT, these topology notions have not reached "arch/powerpc"
> >>> and PPC Linux only has a NUMA node id, on pseries and powernv.
> >>>
> >>>> It does mean that one possible approach here - maybe the best one - is
> >>>> to simply declare that sockets are meaningless under, so we simply
> >>>> don't expect what the guest kernel reports to match what's given to
> >>>> qemu.
> >>>>
> >>>> It'd be nice to avoid that if we can: in a sense it's just cosmetic,
> >>>> but it is likely to surprise and confuse people.
> >>>>
> >>>>> Should we diverged from PAPR and add extra DT properties "qemu,..." ?
> >>>>> There are a couple of places where Linux checks for the underlying
> >>>>> hypervisor already.
> >>>>>
> >>>>>>> This also means that
> >>>>>>> 'ibm,chip-id' will probably remain in use since it's the only place
> >>>>>>> where
> >>>>>>> we inform cores per socket information to the kernel.
> >>>>>>
> >>>>>> Well.. unless we can find some other sensible way to convey that
> >>>>>> information. I haven't given up hope for that yet.
> >>>>>
> >>>>> Well, we could start by fixing the value in QEMU. It is broken
> >>>>> today.
> >>>>
> >>>> Fixing what value, exactly?
> >>>
> >>> The value of the "ibm,chip-id" since we are keeping the property under
> >>> QEMU.
> >>
> >> David, I believe this has to do with the discussing we had last Friday.
> >>
> >> I mentioned that the ibm,chip-id property is being calculated in a way that
> >> promotes the same ibm,chip-id in CPUs that belongs to different NUMA nodes,
> >> e.g.:
> >>
> >> -smp 4,cores=4,maxcpus=8,threads=1 \
> >> -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 \
> >> -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1
> >>
> >>
> >> $ dtc -I dtb -O dts fdt.dtb | grep -B2 ibm,chip-id
> >> ibm,associativity = <0x05 0x00 0x00 0x00 0x00 0x00>;
> >> ibm,pft-size = <0x00 0x19>;
> >> ibm,chip-id = <0x00>;
> >
> >> We assign ibm,chip-id=0x0 to CPUs 0-3, but CPUs 2-3 are located in a
> >> different NUMA node than 0-1. This would mean that the same socket
> >> would belong to different NUMA nodes at the same time.
> >
> > Right... and I'm still not seeing why that's a problem. AFAICT that's
> > a possible, if unexpected, situation under real hardware - though
> > maybe not for POWER9 specifically.
> The ibm,chip-id property does not exist under PAPR. PAPR only has
> NUMA nodes, no sockets nor chips.
>
> And the property value is simply broken under QEMU. Try this :
>
> -smp 4,cores=1,maxcpus=8 -object memory-backend-ram,id=ram-node0,size=2G
> -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object
> memory-backend-ram,id=ram-node1,size=2G -numa
> node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1
>
> # dmesg | grep numa
> [ 0.013106] numa: Node 0 CPUs: 0-1
> [ 0.013136] numa: Node 1 CPUs: 2-3
>
> # dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id
> ibm,chip-id = <0x01>;
> ibm,chip-id = <0x02>;
> ibm,chip-id = <0x00>;
> ibm,chip-id = <0x03>;
Not seeing the problem here. You've specified 8 (possible) cpus, 1
core-per-socket, therefore there are 4 sockets, hence 4 chip-ids.
Again, I don't see where this assumption that the chip-id is somehow
related to the NUMA topology is coming from.
> >> I believe this is what Cedric wants to be addressed. Given that the
> >> property is called after the OPAL property ibm,chip-id, the kernel
> >> expects that the property will have the same semantics as in OPAL.>
> > Even on powernv, I'm not clear why chip-id is tied into the NUMA
> > configuration, rather than getting all the NUMA info from
> > associativity properties.
>
> It is the case.
What exactly is the case?
> The associativity properties are built from chip-id in OPAL though.
Ok, so? Why do we care how OPAL builds the associativity properties
once we *have* the associativity properties its built?
> The chip-id property is only used in low level PowerNV drivers, VAS,
> XSCOM, LPC, etc.
Which AFAIK, really *do* need to know which chip they're on, not the
NUMA toplogy.
> It's also badly used in the common part of the XIVE driver, what I am
> trying to fix to introduce an IPI per node on all platforms.
See comments on other thread.
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
signature.asc
Description: PGP signature
- Re: [PATCH 1/2] spapr: number of SMP sockets must be equal to NUMA nodes, (continued)
- Re: [PATCH 1/2] spapr: number of SMP sockets must be equal to NUMA nodes, Cédric Le Goater, 2021/03/29
- Re: [PATCH 1/2] spapr: number of SMP sockets must be equal to NUMA nodes, Daniel Henrique Barboza, 2021/03/29
- Re: [PATCH 1/2] spapr: number of SMP sockets must be equal to NUMA nodes, Igor Mammedov, 2021/03/29
- Re: [PATCH 1/2] spapr: number of SMP sockets must be equal to NUMA nodes, David Gibson, 2021/03/30
- Re: [PATCH 1/2] spapr: number of SMP sockets must be equal to NUMA nodes, Michael Ellerman, 2021/03/31
- Re: [PATCH 1/2] spapr: number of SMP sockets must be equal to NUMA nodes, Cédric Le Goater, 2021/03/31
- Re: [PATCH 1/2] spapr: number of SMP sockets must be equal to NUMA nodes, David Gibson, 2021/03/31
- Re: [PATCH 1/2] spapr: number of SMP sockets must be equal to NUMA nodes, Cédric Le Goater, 2021/03/31
- Re: [PATCH 1/2] spapr: number of SMP sockets must be equal to NUMA nodes, Daniel Henrique Barboza, 2021/03/31
- Re: [PATCH 1/2] spapr: number of SMP sockets must be equal to NUMA nodes, Cédric Le Goater, 2021/03/31
- Re: [PATCH 1/2] spapr: number of SMP sockets must be equal to NUMA nodes,
David Gibson <=
- Re: [PATCH 1/2] spapr: number of SMP sockets must be equal to NUMA nodes, Igor Mammedov, 2021/03/29
- Re: [PATCH 1/2] spapr: number of SMP sockets must be equal to NUMA nodes, Daniel Henrique Barboza, 2021/03/30
[PATCH 2/2] spapr.c: remove 'ibm,chip-id' from DT, Daniel Henrique Barboza, 2021/03/19