qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 1/2] spapr: number of SMP sockets must be equal to NUMA nodes


From: Cédric Le Goater
Subject: Re: [PATCH 1/2] spapr: number of SMP sockets must be equal to NUMA nodes
Date: Wed, 31 Mar 2021 19:40:40 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.8.0

On 3/31/21 7:29 PM, Daniel Henrique Barboza wrote:
> 
> On 3/31/21 12:18 PM, Cédric Le Goater wrote:
>> On 3/31/21 2:57 AM, David Gibson wrote:
>>> On Mon, Mar 29, 2021 at 03:32:37PM -0300, Daniel Henrique Barboza wrote:
>>>>
>>>>
>>>> On 3/29/21 12:32 PM, Cédric Le Goater wrote:
>>>>> On 3/29/21 6:20 AM, David Gibson wrote:
>>>>>> On Thu, Mar 25, 2021 at 09:56:04AM +0100, Cédric Le Goater wrote:
>>>>>>> On 3/25/21 3:10 AM, David Gibson wrote:
>>>>>>>> On Tue, Mar 23, 2021 at 02:21:33PM -0300, Daniel Henrique Barboza 
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 3/22/21 10:03 PM, David Gibson wrote:
>>>>>>>>>> On Fri, Mar 19, 2021 at 03:34:52PM -0300, Daniel Henrique Barboza 
>>>>>>>>>> wrote:
>>>>>>>>>>> Kernel commit 4bce545903fa ("powerpc/topology: Update
>>>>>>>>>>> topology_core_cpumask") cause a regression in the pseries machine 
>>>>>>>>>>> when
>>>>>>>>>>> defining certain SMP topologies [1]. The reasoning behind the 
>>>>>>>>>>> change is
>>>>>>>>>>> explained in kernel commit 4ca234a9cbd7 ("powerpc/smp: Stop updating
>>>>>>>>>>> cpu_core_mask"). In short, cpu_core_mask logic was causing troubles 
>>>>>>>>>>> with
>>>>>>>>>>> large VMs with lots of CPUs and was changed by cpu_cpu_mask 
>>>>>>>>>>> because, as
>>>>>>>>>>> far as the kernel understanding of SMP topologies goes, both masks 
>>>>>>>>>>> are
>>>>>>>>>>> equivalent.
>>>>>>>>>>>
>>>>>>>>>>> Further discussions in the kernel mailing list [2] shown that the
>>>>>>>>>>> powerpc kernel always considered that the number of sockets were 
>>>>>>>>>>> equal
>>>>>>>>>>> to the number of NUMA nodes. The claim is that it doesn't make 
>>>>>>>>>>> sense,
>>>>>>>>>>> for Power hardware at least, 2+ sockets being in the same NUMA 
>>>>>>>>>>> node. The
>>>>>>>>>>> immediate conclusion is that all SMP topologies the pseries machine 
>>>>>>>>>>> were
>>>>>>>>>>> supplying to the kernel, with more than one socket in the same NUMA 
>>>>>>>>>>> node
>>>>>>>>>>> as in [1], happened to be correctly represented in the kernel by
>>>>>>>>>>> accident during all these years.
>>>>>>>>>>>
>>>>>>>>>>> There's a case to be made for virtual topologies being detached from
>>>>>>>>>>> hardware constraints, allowing maximum flexibility to users. At the 
>>>>>>>>>>> same
>>>>>>>>>>> time, this freedom can't result in unrealistic hardware 
>>>>>>>>>>> representations
>>>>>>>>>>> being emulated. If the real hardware and the pseries kernel don't
>>>>>>>>>>> support multiple chips/sockets in the same NUMA node, neither 
>>>>>>>>>>> should we.
>>>>>>>>>>>
>>>>>>>>>>> Starting in 6.0.0, all sockets must match an unique NUMA node in the
>>>>>>>>>>> pseries machine. qtest changes were made to adapt to this new
>>>>>>>>>>> condition.
>>>>>>>>>>
>>>>>>>>>> Oof.  I really don't like this idea.  It means a bunch of fiddly work
>>>>>>>>>> for users to match these up, for no real gain.  I'm also concerned
>>>>>>>>>> that this will require follow on changes in libvirt to not make this 
>>>>>>>>>> a
>>>>>>>>>> really cryptic and irritating point of failure.
>>>>>>>>>
>>>>>>>>> Haven't though about required Libvirt changes, although I can say 
>>>>>>>>> that there
>>>>>>>>> will be some amount to be mande and it will probably annoy existing 
>>>>>>>>> users
>>>>>>>>> (everyone that has a multiple socket per NUMA node topology).
>>>>>>>>>
>>>>>>>>> There is not much we can do from the QEMU layer aside from what I've 
>>>>>>>>> proposed
>>>>>>>>> here. The other alternative is to keep interacting with the kernel 
>>>>>>>>> folks to
>>>>>>>>> see if there is a way to keep our use case untouched.
>>>>>>>>
>>>>>>>> Right.  Well.. not necessarily untouched, but I'm hoping for more
>>>>>>>> replies from Cédric to my objections and mpe's.  Even with sockets
>>>>>>>> being a kinda meaningless concept in PAPR, I don't think tying it to
>>>>>>>> NUMA nodes makes sense.
>>>>>>>
>>>>>>> I did a couple of replies in different email threads but maybe not
>>>>>>> to all. I felt it was going nowhere :/ Couple of thoughts,
>>>>>>
>>>>>> I think I saw some of those, but maybe not all.
>>>>>>
>>>>>>> Shouldn't we get rid of the socket concept, die also, under pseries
>>>>>>> since they don't exist under PAPR ? We only have numa nodes, cores,
>>>>>>> threads AFAICT.
>>>>>>
>>>>>> Theoretically, yes.  I'm not sure it's really practical, though, since
>>>>>> AFAICT, both qemu and the kernel have the notion of sockets (though
>>>>>> not dies) built into generic code.
>>>>>
>>>>> Yes. But, AFAICT, these topology notions have not reached "arch/powerpc"
>>>>> and PPC Linux only has a NUMA node id, on pseries and powernv.
>>>>>
>>>>>> It does mean that one possible approach here - maybe the best one - is
>>>>>> to simply declare that sockets are meaningless under, so we simply
>>>>>> don't expect what the guest kernel reports to match what's given to
>>>>>> qemu.
>>>>>>
>>>>>> It'd be nice to avoid that if we can: in a sense it's just cosmetic,
>>>>>> but it is likely to surprise and confuse people.
>>>>>>
>>>>>>> Should we diverged from PAPR and add extra DT properties "qemu,..." ?
>>>>>>> There are a couple of places where Linux checks for the underlying
>>>>>>> hypervisor already.
>>>>>>>
>>>>>>>>> This also means that
>>>>>>>>> 'ibm,chip-id' will probably remain in use since it's the only place 
>>>>>>>>> where
>>>>>>>>> we inform cores per socket information to the kernel.
>>>>>>>>
>>>>>>>> Well.. unless we can find some other sensible way to convey that
>>>>>>>> information.  I haven't given up hope for that yet.
>>>>>>>
>>>>>>> Well, we could start by fixing the value in QEMU. It is broken
>>>>>>> today.
>>>>>>
>>>>>> Fixing what value, exactly?
>>>>>
>>>>> The value of the "ibm,chip-id" since we are keeping the property under
>>>>> QEMU.
>>>>
>>>> David, I believe this has to do with the discussing we had last Friday.
>>>>
>>>> I mentioned that the ibm,chip-id property is being calculated in a way that
>>>> promotes the same ibm,chip-id in CPUs that belongs to different NUMA nodes,
>>>> e.g.:
>>>>
>>>> -smp 4,cores=4,maxcpus=8,threads=1 \
>>>> -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 \
>>>> -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1
>>>>
>>>>
>>>> $ dtc -I dtb -O dts fdt.dtb | grep -B2 ibm,chip-id
>>>>             ibm,associativity = <0x05 0x00 0x00 0x00 0x00 0x00>;
>>>>             ibm,pft-size = <0x00 0x19>;
>>>>             ibm,chip-id = <0x00>;
>>>> -- 
>>>>                        ibm,associativity = <0x05 0x00 0x00 0x00 0x00 0x01>;
>>>>                        ibm,pft-size = <0x00 0x19>;
>>>>                        ibm,chip-id = <0x00>;
>>>> -- 
>>>>                        ibm,associativity = <0x05 0x01 0x01 0x01 0x01 0x02>;
>>>>                        ibm,pft-size = <0x00 0x19>;
>>>>                        ibm,chip-id = <0x00>;
>>>> -- 
>>>>                        ibm,associativity = <0x05 0x01 0x01 0x01 0x01 0x03>;
>>>>                        ibm,pft-size = <0x00 0x19>;
>>>>                        ibm,chip-id = <0x00>;
>>>
>>>> We assign ibm,chip-id=0x0 to CPUs 0-3, but CPUs 2-3 are located in a
>>>> different NUMA node than 0-1. This would mean that the same socket
>>>> would belong to different NUMA nodes at the same time.
>>>
>>> Right... and I'm still not seeing why that's a problem.  AFAICT that's
>>> a possible, if unexpected, situation under real hardware - though
>>> maybe not for POWER9 specifically.
>> The ibm,chip-id property does not exist under PAPR. PAPR only has
>> NUMA nodes, no sockets nor chips.
>>
>> And the property value is simply broken under QEMU. Try this  :
>>
>>     -smp 4,cores=1,maxcpus=8 -object memory-backend-ram,id=ram-node0,size=2G 
>> -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object 
>> memory-backend-ram,id=ram-node1,size=2G -numa 
>> node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1
>>
>> # dmesg | grep numa
>> [    0.013106] numa: Node 0 CPUs: 0-1
>> [    0.013136] numa: Node 1 CPUs: 2-3
>>
>> # dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id
>>         ibm,chip-id = <0x01>;
>>         ibm,chip-id = <0x02>;
>>         ibm,chip-id = <0x00>;
>>         ibm,chip-id = <0x03>;
> 
> These values are not wrong. When you do:
> 
> -smp 4,cores=1,maxcpus=8 (....)
> 
> You didn't fill threads and sockets. QEMU default is to prioritize sockets
> to fill the missing information, up to the maxcpus value. This means that what
> you did is equivalent to:
> 
> -smp 4,threads=1,cores=1,sockets=8,maxcpus=8 (....)
> 
> 
> It's a 1 thread/core, 1 core/socket with 8 sockets config. Each possible CPU
> will sit in its own core, having its own ibm,chip-id. So:
> 
> "-numa node,nodeid=0,cpus=0-1"
> 
> is in fact allocating sockets 0 and 1 to NUMA node 0.

ok. 
Thanks,

C.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]