qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 1/2] spapr: number of SMP sockets must be equal to NUMA nodes


From: Daniel Henrique Barboza
Subject: Re: [PATCH 1/2] spapr: number of SMP sockets must be equal to NUMA nodes
Date: Wed, 31 Mar 2021 14:29:12 -0300
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.8.1


On 3/31/21 12:18 PM, Cédric Le Goater wrote:
On 3/31/21 2:57 AM, David Gibson wrote:
On Mon, Mar 29, 2021 at 03:32:37PM -0300, Daniel Henrique Barboza wrote:


On 3/29/21 12:32 PM, Cédric Le Goater wrote:
On 3/29/21 6:20 AM, David Gibson wrote:
On Thu, Mar 25, 2021 at 09:56:04AM +0100, Cédric Le Goater wrote:
On 3/25/21 3:10 AM, David Gibson wrote:
On Tue, Mar 23, 2021 at 02:21:33PM -0300, Daniel Henrique Barboza wrote:


On 3/22/21 10:03 PM, David Gibson wrote:
On Fri, Mar 19, 2021 at 03:34:52PM -0300, Daniel Henrique Barboza wrote:
Kernel commit 4bce545903fa ("powerpc/topology: Update
topology_core_cpumask") cause a regression in the pseries machine when
defining certain SMP topologies [1]. The reasoning behind the change is
explained in kernel commit 4ca234a9cbd7 ("powerpc/smp: Stop updating
cpu_core_mask"). In short, cpu_core_mask logic was causing troubles with
large VMs with lots of CPUs and was changed by cpu_cpu_mask because, as
far as the kernel understanding of SMP topologies goes, both masks are
equivalent.

Further discussions in the kernel mailing list [2] shown that the
powerpc kernel always considered that the number of sockets were equal
to the number of NUMA nodes. The claim is that it doesn't make sense,
for Power hardware at least, 2+ sockets being in the same NUMA node. The
immediate conclusion is that all SMP topologies the pseries machine were
supplying to the kernel, with more than one socket in the same NUMA node
as in [1], happened to be correctly represented in the kernel by
accident during all these years.

There's a case to be made for virtual topologies being detached from
hardware constraints, allowing maximum flexibility to users. At the same
time, this freedom can't result in unrealistic hardware representations
being emulated. If the real hardware and the pseries kernel don't
support multiple chips/sockets in the same NUMA node, neither should we.

Starting in 6.0.0, all sockets must match an unique NUMA node in the
pseries machine. qtest changes were made to adapt to this new
condition.

Oof.  I really don't like this idea.  It means a bunch of fiddly work
for users to match these up, for no real gain.  I'm also concerned
that this will require follow on changes in libvirt to not make this a
really cryptic and irritating point of failure.

Haven't though about required Libvirt changes, although I can say that there
will be some amount to be mande and it will probably annoy existing users
(everyone that has a multiple socket per NUMA node topology).

There is not much we can do from the QEMU layer aside from what I've proposed
here. The other alternative is to keep interacting with the kernel folks to
see if there is a way to keep our use case untouched.

Right.  Well.. not necessarily untouched, but I'm hoping for more
replies from Cédric to my objections and mpe's.  Even with sockets
being a kinda meaningless concept in PAPR, I don't think tying it to
NUMA nodes makes sense.

I did a couple of replies in different email threads but maybe not
to all. I felt it was going nowhere :/ Couple of thoughts,

I think I saw some of those, but maybe not all.

Shouldn't we get rid of the socket concept, die also, under pseries
since they don't exist under PAPR ? We only have numa nodes, cores,
threads AFAICT.

Theoretically, yes.  I'm not sure it's really practical, though, since
AFAICT, both qemu and the kernel have the notion of sockets (though
not dies) built into generic code.

Yes. But, AFAICT, these topology notions have not reached "arch/powerpc"
and PPC Linux only has a NUMA node id, on pseries and powernv.

It does mean that one possible approach here - maybe the best one - is
to simply declare that sockets are meaningless under, so we simply
don't expect what the guest kernel reports to match what's given to
qemu.

It'd be nice to avoid that if we can: in a sense it's just cosmetic,
but it is likely to surprise and confuse people.

Should we diverged from PAPR and add extra DT properties "qemu,..." ?
There are a couple of places where Linux checks for the underlying
hypervisor already.

This also means that
'ibm,chip-id' will probably remain in use since it's the only place where
we inform cores per socket information to the kernel.

Well.. unless we can find some other sensible way to convey that
information.  I haven't given up hope for that yet.

Well, we could start by fixing the value in QEMU. It is broken
today.

Fixing what value, exactly?

The value of the "ibm,chip-id" since we are keeping the property under
QEMU.

David, I believe this has to do with the discussing we had last Friday.

I mentioned that the ibm,chip-id property is being calculated in a way that
promotes the same ibm,chip-id in CPUs that belongs to different NUMA nodes,
e.g.:

-smp 4,cores=4,maxcpus=8,threads=1 \
-numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 \
-numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1


$ dtc -I dtb -O dts fdt.dtb | grep -B2 ibm,chip-id
                        ibm,associativity = <0x05 0x00 0x00 0x00 0x00 0x00>;
                        ibm,pft-size = <0x00 0x19>;
                        ibm,chip-id = <0x00>;
--
                       ibm,associativity = <0x05 0x00 0x00 0x00 0x00 0x01>;
                       ibm,pft-size = <0x00 0x19>;
                       ibm,chip-id = <0x00>;
--
                       ibm,associativity = <0x05 0x01 0x01 0x01 0x01 0x02>;
                       ibm,pft-size = <0x00 0x19>;
                       ibm,chip-id = <0x00>;
--
                       ibm,associativity = <0x05 0x01 0x01 0x01 0x01 0x03>;
                       ibm,pft-size = <0x00 0x19>;
                       ibm,chip-id = <0x00>;

We assign ibm,chip-id=0x0 to CPUs 0-3, but CPUs 2-3 are located in a
different NUMA node than 0-1. This would mean that the same socket
would belong to different NUMA nodes at the same time.

Right... and I'm still not seeing why that's a problem.  AFAICT that's
a possible, if unexpected, situation under real hardware - though
maybe not for POWER9 specifically.
The ibm,chip-id property does not exist under PAPR. PAPR only has
NUMA nodes, no sockets nor chips.

And the property value is simply broken under QEMU. Try this  :

    -smp 4,cores=1,maxcpus=8 -object memory-backend-ram,id=ram-node0,size=2G 
-numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object 
memory-backend-ram,id=ram-node1,size=2G -numa 
node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1

# dmesg | grep numa
[    0.013106] numa: Node 0 CPUs: 0-1
[    0.013136] numa: Node 1 CPUs: 2-3

# dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id
                ibm,chip-id = <0x01>;
                ibm,chip-id = <0x02>;
                ibm,chip-id = <0x00>;
                ibm,chip-id = <0x03>;

These values are not wrong. When you do:

-smp 4,cores=1,maxcpus=8 (....)

You didn't fill threads and sockets. QEMU default is to prioritize sockets
to fill the missing information, up to the maxcpus value. This means that what
you did is equivalent to:

-smp 4,threads=1,cores=1,sockets=8,maxcpus=8 (....)


It's a 1 thread/core, 1 core/socket with 8 sockets config. Each possible CPU
will sit in its own core, having its own ibm,chip-id. So:

"-numa node,nodeid=0,cpus=0-1"

is in fact allocating sockets 0 and 1 to NUMA node 0.


Thanks,



DHB


I believe this is what Cedric wants to be addressed. Given that the
property is called after the OPAL property ibm,chip-id, the kernel
expects that the property will have the same semantics as in OPAL.>
Even on powernv, I'm not clear why chip-id is tied into the NUMA
configuration, rather than getting all the NUMA info from
associativity properties.

It is the case.

The associativity properties are built from chip-id in OPAL though.

The chip-id property is only used in low level PowerNV drivers, VAS,
XSCOM, LPC, etc.

It's also badly used in the common part of the XIVE driver, what I am
trying to fix to introduce an IPI per node on all platforms.

C.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]