qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [PATCH v2 1/3] hw/i386: Initialize topo_ids from CpuInstanceProperti


From: Babu Moger
Subject: RE: [PATCH v2 1/3] hw/i386: Initialize topo_ids from CpuInstanceProperties
Date: Wed, 29 Jul 2020 16:22:32 -0500
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.10.0

Igor,
Sorry. Few more questions.

> -----Original Message-----
> From: Igor Mammedov <imammedo@redhat.com>
> Sent: Wednesday, July 29, 2020 9:12 AM
> To: Moger, Babu <Babu.Moger@amd.com>
> Cc: pbonzini@redhat.com; rth@twiddle.net; qemu-devel@nongnu.org;
> ehabkost@redhat.com
> Subject: Re: [PATCH v2 1/3] hw/i386: Initialize topo_ids from
> CpuInstanceProperties
> 
> On Mon, 27 Jul 2020 18:59:42 -0500
> Babu Moger <babu.moger@amd.com> wrote:
> 
> > > -----Original Message-----
> > > From: Igor Mammedov <imammedo@redhat.com>
> > > Sent: Monday, July 27, 2020 12:14 PM
> > > To: Moger, Babu <Babu.Moger@amd.com>
> > > Cc: qemu-devel@nongnu.org; pbonzini@redhat.com;
> ehabkost@redhat.com;
> > > rth@twiddle.net
> > > Subject: Re: [PATCH v2 1/3] hw/i386: Initialize topo_ids from
> > > CpuInstanceProperties
> > >
> > > On Mon, 27 Jul 2020 10:49:08 -0500
> > > Babu Moger <babu.moger@amd.com> wrote:
> > >
> > > > > -----Original Message-----
> > > > > From: Igor Mammedov <imammedo@redhat.com>
> > > > > Sent: Friday, July 24, 2020 12:05 PM
> > > > > To: Moger, Babu <Babu.Moger@amd.com>
> > > > > Cc: qemu-devel@nongnu.org; pbonzini@redhat.com;
> > > ehabkost@redhat.com;
> > > > > rth@twiddle.net
> > > > > Subject: Re: [PATCH v2 1/3] hw/i386: Initialize topo_ids from
> > > > > CpuInstanceProperties
> > > > >
> > > > > On Mon, 13 Jul 2020 14:30:29 -0500 Babu Moger
> > > > > <babu.moger@amd.com> wrote:
> > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Igor Mammedov <imammedo@redhat.com>
> > > > > > > Sent: Monday, July 13, 2020 12:32 PM
> > > > > > > To: Moger, Babu <Babu.Moger@amd.com>
> > > > > > > Cc: pbonzini@redhat.com; rth@twiddle.net;
> > > > > > > ehabkost@redhat.com;
> > > > > > > qemu- devel@nongnu.org
> > > > > > > Subject: Re: [PATCH v2 1/3] hw/i386: Initialize topo_ids
> > > > > > > from CpuInstanceProperties
> > > > > > >
> > > > > > > On Mon, 13 Jul 2020 11:43:33 -0500 Babu Moger
> > > > > > > <babu.moger@amd.com> wrote:
> > > > > > >
> > > > > > > > On 7/13/20 11:17 AM, Igor Mammedov wrote:
> > > > > > > > > On Mon, 13 Jul 2020 10:02:22 -0500 Babu Moger
> > > > > > > > > <babu.moger@amd.com> wrote:
> > > > > > > > >
> > > > > > > > >>> -----Original Message-----
> > > > > > > > >>> From: Igor Mammedov <imammedo@redhat.com>
> > > > > > > > >>> Sent: Monday, July 13, 2020 4:08 AM
> > > > > > > > >>> To: Moger, Babu <Babu.Moger@amd.com>
> > > > > > > > >>> Cc: pbonzini@redhat.com; rth@twiddle.net;
> > > > > > > > >>> ehabkost@redhat.com;
> > > > > > > > >>> qemu- devel@nongnu.org
> > > > > > > > >>> Subject: Re: [PATCH v2 1/3] hw/i386: Initialize
> > > > > > > > >>> topo_ids from CpuInstanceProperties
> > > > > > > > > [...]
> > > > > > > > >>>> +
> > > > > > > > >>>> +/*
> > > > > > > > >>>> + * Initialize topo_ids from CpuInstanceProperties
> > > > > > > > >>>> + * node_id in CpuInstanceProperties(or in CPU
> > > > > > > > >>>> +device) is a sequential
> > > > > > > > >>>> + * number, but while building the topology
> > > > > > > > >>>
> > > > > > > > >>>> we need to separate it for
> > > > > > > > >>>> + * each socket(mod nodes_per_pkg).
> > > > > > > > >>> could you clarify a bit more on why this is necessary?
> > > > > > > > >>
> > > > > > > > >> If you have two sockets and 4 numa nodes, node_id in
> > > > > > > > >> CpuInstanceProperties will be number sequentially as 0, 1, 
> > > > > > > > >> 2, 3.
> > > > > > > > >> But in EPYC topology, it will be  0, 1, 0, 1( Basically
> > > > > > > > >> mod % number of nodes
> > > > > > > per socket).
> > > > > > > > >
> > > > > > > > > I'm confused, let's suppose we have 2 EPYC sockets with
> > > > > > > > > 2 nodes per socket so APIC id woulbe be composed like:
> > > > > > > > >
> > > > > > > > >  1st socket
> > > > > > > > >    pkg_id(0) | node_id(0)
> > > > > > > > >    pkg_id(0) | node_id(1)
> > > > > > > > >
> > > > > > > > >  2nd socket
> > > > > > > > >    pkg_id(1) | node_id(0)
> > > > > > > > >    pkg_id(1) | node_id(1)
> > > > > > > > >
> > > > > > > > > if that's the case, then EPYC's node_id here doesn't
> > > > > > > > > look like a NUMA node in the sense it's usually used
> > > > > > > > > (above config would have 4 different memory controllers
> > > > > > > > > => 4 conventional
> > > NUMA nodes).
> > > > > > > >
> > > > > > > > EPIC model uses combination of socket id and node id to
> > > > > > > > identify the numa nodes. So, it internally uses all the 
> > > > > > > > information.
> > > > > > >
> > > > > > > well with above values, EPYC's node_id doesn't look like
> > > > > > > it's specifying a machine numa node, but rather a node index
> > > > > > > within single socket. In which case, it doesn't make much
> > > > > > > sense calling it NUMA node_id, it's rather some index within
> > > > > > > a socket. (it starts looking like terminology is all mixed
> > > > > > > up)
> > > > > > >
> > > > > > > If you have access to a milti-socket EPYC machine, can you
> > > > > > > dump and post here its apic ids, pls?
> > > > > >
> > > > > > Here is the output from my EPYC machine with 2 sockets and
> > > > > > totally
> > > > > > 8 nodes(SMT disabled). The cpus 0-31 are in socket 0 and  cpus
> > > > > > 32-63 in socket 1.
> > > > > >
> > > > > > # lscpu
> > > > > > Architecture:        x86_64
> > > > > > CPU op-mode(s):      32-bit, 64-bit
> > > > > > Byte Order:          Little Endian
> > > > > > CPU(s):              64
> > > > > > On-line CPU(s) list: 0-63
> > > > > > Thread(s) per core:  1
> > > > > > Core(s) per socket:  32
> > > > > > Socket(s):           2
> > > > > > NUMA node(s):        8
> > > > > > Vendor ID:           AuthenticAMD
> > > > > > CPU family:          23
> > > > > > Model:               1
> > > > > > Model name:          AMD Eng Sample: 1S1901A4VIHF5_30/19_N
> > > > > > Stepping:            2
> > > > > > CPU MHz:             2379.233
> > > > > > CPU max MHz:         1900.0000
> > > > > > CPU min MHz:         1200.0000
> > > > > > BogoMIPS:            3792.81
> > > > > > Virtualization:      AMD-V
> > > > > > L1d cache:           32K
> > > > > > L1i cache:           64K
> > > > > > L2 cache:            512K
> > > > > > L3 cache:            8192K
> > > > > > NUMA node0 CPU(s):   0-7
> > > > > > NUMA node1 CPU(s):   8-15
> > > > > > NUMA node2 CPU(s):   16-23
> > > > > > NUMA node3 CPU(s):   24-31
> > > > > > NUMA node4 CPU(s):   32-39
> > > > > > NUMA node5 CPU(s):   40-47
> > > > > > NUMA node6 CPU(s):   48-55
> > > > > > NUMA node7 CPU(s):   56-63
> > > > > >
> > > > > > Here is the output of #cpuid  -l 0x8000001e  -r
> > > > >
> > > > >
> > > > > (1)
> > > > > > You may want to refer
> > > > > >
> > >
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.
> > > > > >
> > > > >
> > >
> amd.com%2Fsystem%2Ffiles%2FTechDocs%2F54945_3.03_ppr_ZP_B2_pub.zip&
> > > > > amp
> > > > > >
> > > > >
> > >
> ;data=02%7C01%7Cbabu.moger%40amd.com%7Ceacf7e8facbc4ae2eee808d82
> > > > > ff3ca9
> > > > > >
> > > > >
> > >
> 0%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C6373120714103223
> > > > > 35&amp;
> > > > > >
> > > > >
> > >
> sdata=%2Fdr93YVlwSq82%2FwRh2NU21Zkw4HJ%2B%2FVVYxAkhCCKJ4w%3D&a
> > > > > mp;reser
> > > > > > ved=0 (section 2.1.12.2.1.3 ApicId Enumeration Requirements).
> > > > > > Note that this is a general guideline. We tried to generalize
> > > > > > in qemu as much as possible. It is bit complex.
> > > > >
> > > > >
> > > > >
> > > > > > CPU 0:
> > > > > >    0x8000001e 0x00: eax=0x00000000 ebx=0x00000100
> > > > > > ecx=0x00000300
> > > > > > edx=0x00000000
> > > > > [...]
> > > > > > CPU 63:
> > > > > >    0x8000001e 0x00: eax=0x0000007e ebx=0x0000011f
> > > > > > ecx=0x00000307
> > > > > > edx=0x00000000
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > I wonder if linux guest actually uses node_id encoded in
> > > > > > > > > apic id for configuring/checking numa structures, or it
> > > > > > > > > just uses whatever ACPI SRAT table provided.
> > > > > > > > >
> > > > > > > > >>>> + */
> > > > > > > > >>>> +static inline void x86_init_topo_ids(X86CPUTopoInfo
> > > *topo_info,
> > > > > > > > >>>> +                                     
> > > > > > > > >>>> CpuInstanceProperties props,
> > > > > > > > >>>> +                                     X86CPUTopoIDs 
> > > > > > > > >>>> *topo_ids) {
> > > > > > > > >>>> +    topo_ids->smt_id = props.has_thread_id ? 
> > > > > > > > >>>> props.thread_id
> : 0;
> > > > > > > > >>>> +    topo_ids->core_id = props.has_core_id ? props.core_id 
> > > > > > > > >>>> : 0;
> > > > > > > > >>>> +    topo_ids->die_id = props.has_die_id ? props.die_id : 
> > > > > > > > >>>> 0;
> > > > > > > > >>>> +    topo_ids->node_id = props.has_node_id ?
> > > > > > > > >>>> +                        props.node_id %
> > > > > > > > >>>> +MAX(topo_info->nodes_per_pkg, 1) : 0;
> > > > >
> > > > > It looks like I was wrong pushing system wide NUMA node-id into
> > > > > APIC ID (choosen naming is confusing a bit), per [1] mentioned
> > > > > above, EPYC's
> > > node-id is:
> > > > >
> > > > > • ApicId[6] = Socket ID.
> > > > > * ApicId[5:4]= Node ID.
> > > > > • ApicId[3] = Logical CCX L3 complex ID • ApicId[2:0]= (SMT) ?
> > > > > {LogicalCoreID[1:0],ThreadId} : {1'b0,LogicalCoreID[1:0]}
> > > > >
> > > > > which is can hold only 0-3 values, and defined as:
> > > > >
> > > > > "A node, is an integrated circuit device that includes one to 8
> > > > > cores (one or two Core Complexes)."
> > > > >
> > > > > spec also mentions it indirectly as die-id if one looks at
> > > > > CPUID_Fn8000001E_EBX [Core Identifiers] (Core::X86::Cpuid::CoreId) ...
> > > > >   CoreId = ({2'b0, DieId[1:0], LogicalComplexId[0],
> > > > > LogicalThreadId[2:0]} >> SMT
> > > > >
> > > > > and in
> > > > > (2)
> > > > > CPUID_Fn8000001E_ECX [Node Identifiers] (Core::X86::Cpuid::NodeId) ...
> > > > >   {5'b00000,1'b[SOCKET_ID],2'b[DIE_ID]}
> > > > >
> > > > > Question is why we did not reuse topo_ids->die_id instead of
> > > > > adding confusing topo_ids->node_id in the first place?
> > > >
> > > > Initially, I thought about it. But Intel uses die_id differently than 
> > > > AMD.
> > > > So, did not want complicate things.
> > > > If we take that route then we need to re-arrange the numa code as
> > > > we need to numa information to calculate the die id. So, did not
> > > > want to mix up things.
> > > >
> > > > >
> > > > > Also looking APIC ID and SRAT table provided here,
> > > > > CPUID_Fn8000001E_ECX corresponds to NUMA node id (i.e. what
> > > > > -numa in QEMU used for) and Node ID embeded into ApicId[5:4] is
> basically die-id.
> > > > >
> > > > > Difference between die-id implemented in QEMU and EPYC's die id
> > > > > (topo_ids-
> > > > > >node_id) is that the former doesn't require numa config (maybe
> > > > > >it should, but
> > > > > ship'salready sailed) and gets number of dies from '-smp dies=X'
> > > > > CLI option, while for EPYC we calculate topo_ids->node_id
> > > > > implicitly from number of numa nodes and sockets, which
> > > > > implicitly requires that machine 'must' be configured with -numa 
> > > > > options.
> > > > >
> > > > > Maybe we should drop this implicit calculation along with
> > > > > topo_ids->node_id and reuse '-smp dies=X' plus extra checks for
> > > > > EPYC to ask for -numa if there is more than 1 die and if we need
> > > > > to be really strict, add checks for limit of dies/cores within
> > > > > socket/die according to spec[2] so encoded APIC ID and
> > > > > CPUID_8000001E match the
> > > spec.
> > > >
> > > > There will be complications when user configures with both die_id
> > > > and numa_id. It will complicate things further. I will have to
> > > > look closely at the code if it is feasible.
> > >
> > > it's worth a try.
> > > conseptionally die_id in intel/amd is the same. Most likely intel
> > > has a dedicated memory controller on each die so it still should
> > > form a NUMA node. But that aspect probably was ignored while
> > > implementing it in QEMU so ping of configuring QEMU right is on
> > > user's shoulders (there is no checks whatsoever if cpu belonging to 
> > > specific
> die is on right NUMA node).
> >
> > So you are suggesting to use die_id to build the topology for EPYC.
> > Also initialize die_id based on the numa information. Re-arrange the
> > numa code to make sure we have all the information before we build the
> > topology. And then remove the node_id inside X86CPUTopoIDs. Is that the
> plan?
> reusing die_id might simplify logic and at very least we won't have 2 very 
> similar
> fields to deal with. With die_id it should be conditional on EPYC.

Not convinced if the using the die_id will solve the problem here. But
going to investigate this little bit.
> > Regardless of using die_id, we should
> 
> (1) error out if tolopolgy will require more than 1 numa node and no numa
> config was provided.

We already have node_id check in numa_cpu_pre_plug. Do you want me to
bring this check in pc_cpu_pre_plug?

> (2) for 1 NUMA node use autonuma to create 1 node implicitly, that requres
> converting static MachineClass::auto_enable_numa into an instance specific
> value, i.e. moving it into MachineState, so that we can change it at runtime
> depending on CPU type.

Isn't it already taken care in numa_complete_configuration when num_nodes
= 0? Where does this change go if at all required?

> (3) use NUMA id from CPU::node-id for CPUID_8000001E and have a checks that
> will ensure
>     that used value is possible to fit in CPUID leaf.

Node_id is already part of apic_id. We can easily extract it from apic_id.
I have already sent the patch to simplify CPUID_8000001E. I will make it
part of this series.
https://lore.kernel.org/qemu-devel/159164753686.20543.4158548114186964547.stgit@naples-babu.amd.com/

> 
> 
> 
> 
> > > What AMD has implemented on top of that in CPU hw, is to expose NUMA
> > > node id in CPUID_8000001E. I don't know why it was done as usually
> > > it's ACPI tables that describe relations between nodes so for OS
> > > this info almost useless (I'd guess it's faster to use CPUID instead
> > > of fetching pre-cpu variable but that's pretty much it from OS point
> > > of view)
> > >
> > > >
> > > > >
> > > > >
> > > > >
> > > > > > > > >>>> +    topo_ids->pkg_id = props.has_socket_id ? 
> > > > > > > > >>>> props.socket_id
> :
> > > > > > > > >>>> +0; }
> > > > > > > > >>>>  /*
> > > > > > > > >>>>   * Make APIC ID for the CPU 'cpu_index'
> > > > > > > > >>>>   *
> > > > > > > > >>>>
> > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > > > >
> > > >
> >
> >




reply via email to

[Prev in Thread] Current Thread [Next in Thread]