[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH 1/2] spapr: number of SMP sockets must be equal to NUMA nodes
From: |
Igor Mammedov |
Subject: |
Re: [PATCH 1/2] spapr: number of SMP sockets must be equal to NUMA nodes |
Date: |
Tue, 30 Mar 2021 01:55:40 +0200 |
On Mon, 29 Mar 2021 15:32:37 -0300
Daniel Henrique Barboza <danielhb413@gmail.com> wrote:
> On 3/29/21 12:32 PM, Cédric Le Goater wrote:
> > On 3/29/21 6:20 AM, David Gibson wrote:
> >> On Thu, Mar 25, 2021 at 09:56:04AM +0100, Cédric Le Goater wrote:
> >>> On 3/25/21 3:10 AM, David Gibson wrote:
> >>>> On Tue, Mar 23, 2021 at 02:21:33PM -0300, Daniel Henrique Barboza wrote:
> >>>>
> >>>>>
> >>>>>
> >>>>> On 3/22/21 10:03 PM, David Gibson wrote:
> >>>>>> On Fri, Mar 19, 2021 at 03:34:52PM -0300, Daniel Henrique Barboza
> >>>>>> wrote:
> >>>>>>> Kernel commit 4bce545903fa ("powerpc/topology: Update
> >>>>>>> topology_core_cpumask") cause a regression in the pseries machine when
> >>>>>>> defining certain SMP topologies [1]. The reasoning behind the change
> >>>>>>> is
> >>>>>>> explained in kernel commit 4ca234a9cbd7 ("powerpc/smp: Stop updating
> >>>>>>> cpu_core_mask"). In short, cpu_core_mask logic was causing troubles
> >>>>>>> with
> >>>>>>> large VMs with lots of CPUs and was changed by cpu_cpu_mask because,
> >>>>>>> as
> >>>>>>> far as the kernel understanding of SMP topologies goes, both masks are
> >>>>>>> equivalent.
> >>>>>>>
> >>>>>>> Further discussions in the kernel mailing list [2] shown that the
> >>>>>>> powerpc kernel always considered that the number of sockets were equal
> >>>>>>> to the number of NUMA nodes. The claim is that it doesn't make sense,
> >>>>>>> for Power hardware at least, 2+ sockets being in the same NUMA node.
> >>>>>>> The
> >>>>>>> immediate conclusion is that all SMP topologies the pseries machine
> >>>>>>> were
> >>>>>>> supplying to the kernel, with more than one socket in the same NUMA
> >>>>>>> node
> >>>>>>> as in [1], happened to be correctly represented in the kernel by
> >>>>>>> accident during all these years.
> >>>>>>>
> >>>>>>> There's a case to be made for virtual topologies being detached from
> >>>>>>> hardware constraints, allowing maximum flexibility to users. At the
> >>>>>>> same
> >>>>>>> time, this freedom can't result in unrealistic hardware
> >>>>>>> representations
> >>>>>>> being emulated. If the real hardware and the pseries kernel don't
> >>>>>>> support multiple chips/sockets in the same NUMA node, neither should
> >>>>>>> we.
> >>>>>>>
> >>>>>>> Starting in 6.0.0, all sockets must match an unique NUMA node in the
> >>>>>>> pseries machine. qtest changes were made to adapt to this new
> >>>>>>> condition.
> >>>>>>
> >>>>>> Oof. I really don't like this idea. It means a bunch of fiddly work
> >>>>>> for users to match these up, for no real gain. I'm also concerned
> >>>>>> that this will require follow on changes in libvirt to not make this a
> >>>>>> really cryptic and irritating point of failure.
> >>>>>
> >>>>> Haven't though about required Libvirt changes, although I can say that
> >>>>> there
> >>>>> will be some amount to be mande and it will probably annoy existing
> >>>>> users
> >>>>> (everyone that has a multiple socket per NUMA node topology).
> >>>>>
> >>>>> There is not much we can do from the QEMU layer aside from what I've
> >>>>> proposed
> >>>>> here. The other alternative is to keep interacting with the kernel
> >>>>> folks to
> >>>>> see if there is a way to keep our use case untouched.
> >>>>
> >>>> Right. Well.. not necessarily untouched, but I'm hoping for more
> >>>> replies from Cédric to my objections and mpe's. Even with sockets
> >>>> being a kinda meaningless concept in PAPR, I don't think tying it to
> >>>> NUMA nodes makes sense.
> >>>
> >>> I did a couple of replies in different email threads but maybe not
> >>> to all. I felt it was going nowhere :/ Couple of thoughts,
> >>
> >> I think I saw some of those, but maybe not all.
> >>
> >>> Shouldn't we get rid of the socket concept, die also, under pseries
> >>> since they don't exist under PAPR ? We only have numa nodes, cores,
> >>> threads AFAICT.
> >>
> >> Theoretically, yes. I'm not sure it's really practical, though, since
> >> AFAICT, both qemu and the kernel have the notion of sockets (though
> >> not dies) built into generic code.
> >
> > Yes. But, AFAICT, these topology notions have not reached "arch/powerpc"
> > and PPC Linux only has a NUMA node id, on pseries and powernv.
> >
> >> It does mean that one possible approach here - maybe the best one - is
> >> to simply declare that sockets are meaningless under, so we simply
> >> don't expect what the guest kernel reports to match what's given to
> >> qemu.
> >>
> >> It'd be nice to avoid that if we can: in a sense it's just cosmetic,
> >> but it is likely to surprise and confuse people.
> >>
> >>> Should we diverged from PAPR and add extra DT properties "qemu,..." ?
> >>> There are a couple of places where Linux checks for the underlying
> >>> hypervisor already.
> >>>
> >>>>> This also means that
> >>>>> 'ibm,chip-id' will probably remain in use since it's the only place
> >>>>> where
> >>>>> we inform cores per socket information to the kernel.
> >>>>
> >>>> Well.. unless we can find some other sensible way to convey that
> >>>> information. I haven't given up hope for that yet.
> >>>
> >>> Well, we could start by fixing the value in QEMU. It is broken
> >>> today.
> >>
> >> Fixing what value, exactly?
> >
> > The value of the "ibm,chip-id" since we are keeping the property under
> > QEMU.
>
> David, I believe this has to do with the discussing we had last Friday.
>
> I mentioned that the ibm,chip-id property is being calculated in a way that
> promotes the same ibm,chip-id in CPUs that belongs to different NUMA nodes,
> e.g.:
>
> -smp 4,cores=4,maxcpus=8,threads=1 \
> -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 \
> -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1
just heads up, 'cpus=' is going away (deprecation patch got lost during 6.0
but I plan on pushing it during 6.1 devel window),
you can use '-numa cpus,node-id=a,core-id=b' instead
>
>
> $ dtc -I dtb -O dts fdt.dtb | grep -B2 ibm,chip-id
> ibm,associativity = <0x05 0x00 0x00 0x00 0x00 0x00>;
> ibm,pft-size = <0x00 0x19>;
> ibm,chip-id = <0x00>;
> --
> ibm,associativity = <0x05 0x00 0x00 0x00 0x00 0x01>;
> ibm,pft-size = <0x00 0x19>;
> ibm,chip-id = <0x00>;
> --
> ibm,associativity = <0x05 0x01 0x01 0x01 0x01 0x02>;
> ibm,pft-size = <0x00 0x19>;
> ibm,chip-id = <0x00>;
> --
> ibm,associativity = <0x05 0x01 0x01 0x01 0x01 0x03>;
> ibm,pft-size = <0x00 0x19>;
> ibm,chip-id = <0x00>;
>
> We assign ibm,chip-id=0x0 to CPUs 0-3, but CPUs 2-3 are located in a different
> NUMA node than 0-1. This would mean that the same socket would belong to
> different NUMA nodes at the same time.
>
> I believe this is what Cedric wants to be addressed. Given that the property
> is
> called after the OPAL property ibm,chip-id, the kernel expects that the
> property
> will have the same semantics as in OPAL.
>
>
>
> Thanks,
>
>
> DHB
>
>
>
>
> >
> >>> This is all coming from some work we did last year to evaluate our HW
> >>> (mostly for XIVE) on 2s, 4s, 16s systems on baremetal, KVM and PowerVM.
> >>> We saw some real problems because Linux did not have a clear view of the
> >>> topology. See the figures here :
> >>>
> >>> http://patchwork.ozlabs.org/project/linuxppc-dev/patch/20210303174857.1760393-9-clg@kaod.org/
> >>>
> >>> The node id is a key parameter for system resource management, memory
> >>> allocation, interrupt affinity, etc. Linux scales much better if used
> >>> correctly.
> >>
> >> Well, sure. And we have all the ibm,associativity stuff to convey the
> >> node ids to the guest (which has its own problems, but not that are
> >> relevant here). What's throwing me is why getting node IDs correct
> >> has anything to do with socket numbers.
> >
> >
>
- [PATCH 0/2] pseries: SMP sockets must match NUMA nodes, Daniel Henrique Barboza, 2021/03/19
- [PATCH 1/2] spapr: number of SMP sockets must be equal to NUMA nodes, Daniel Henrique Barboza, 2021/03/19
- Re: [PATCH 1/2] spapr: number of SMP sockets must be equal to NUMA nodes, David Gibson, 2021/03/22
- Re: [PATCH 1/2] spapr: number of SMP sockets must be equal to NUMA nodes, Daniel Henrique Barboza, 2021/03/23
- Re: [PATCH 1/2] spapr: number of SMP sockets must be equal to NUMA nodes, David Gibson, 2021/03/24
- Re: [PATCH 1/2] spapr: number of SMP sockets must be equal to NUMA nodes, Cédric Le Goater, 2021/03/25
- Re: [PATCH 1/2] spapr: number of SMP sockets must be equal to NUMA nodes, Daniel Henrique Barboza, 2021/03/25
- Re: [PATCH 1/2] spapr: number of SMP sockets must be equal to NUMA nodes, David Gibson, 2021/03/29
- Re: [PATCH 1/2] spapr: number of SMP sockets must be equal to NUMA nodes, Cédric Le Goater, 2021/03/29
- Re: [PATCH 1/2] spapr: number of SMP sockets must be equal to NUMA nodes, Daniel Henrique Barboza, 2021/03/29
- Re: [PATCH 1/2] spapr: number of SMP sockets must be equal to NUMA nodes,
Igor Mammedov <=
- Re: [PATCH 1/2] spapr: number of SMP sockets must be equal to NUMA nodes, David Gibson, 2021/03/30
- Re: [PATCH 1/2] spapr: number of SMP sockets must be equal to NUMA nodes, Michael Ellerman, 2021/03/31
- Re: [PATCH 1/2] spapr: number of SMP sockets must be equal to NUMA nodes, Cédric Le Goater, 2021/03/31
- Re: [PATCH 1/2] spapr: number of SMP sockets must be equal to NUMA nodes, David Gibson, 2021/03/31
- Re: [PATCH 1/2] spapr: number of SMP sockets must be equal to NUMA nodes, Cédric Le Goater, 2021/03/31
- Re: [PATCH 1/2] spapr: number of SMP sockets must be equal to NUMA nodes, Daniel Henrique Barboza, 2021/03/31
- Re: [PATCH 1/2] spapr: number of SMP sockets must be equal to NUMA nodes, Cédric Le Goater, 2021/03/31
- Re: [PATCH 1/2] spapr: number of SMP sockets must be equal to NUMA nodes, David Gibson, 2021/03/31
- Re: [PATCH 1/2] spapr: number of SMP sockets must be equal to NUMA nodes, Igor Mammedov, 2021/03/29
- Re: [PATCH 1/2] spapr: number of SMP sockets must be equal to NUMA nodes, Daniel Henrique Barboza, 2021/03/30