qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 1/2] hw/386: Fix uninitialized memory with -device and CPU ho


From: Babu Moger
Subject: Re: [PATCH 1/2] hw/386: Fix uninitialized memory with -device and CPU hotplug
Date: Thu, 25 Jun 2020 17:55:20 -0500
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.8.0


On 6/25/20 1:32 PM, Igor Mammedov wrote:
> On Thu, 25 Jun 2020 11:41:25 -0500
> Babu Moger <babu.moger@amd.com> wrote:
> 
>> Igor,
>>
>>> -----Original Message-----
>>> From: Igor Mammedov <imammedo@redhat.com>
>>> Sent: Thursday, June 25, 2020 10:19 AM
>>> To: Moger, Babu <Babu.Moger@amd.com>
>>> Cc: ehabkost@redhat.com; mst@redhat.com; qemu-devel@nongnu.org;
>>> pbonzini@redhat.com; rth@twiddle.net
>>> Subject: Re: [PATCH 1/2] hw/386: Fix uninitialized memory with -device and 
>>> CPU
>>> hotplug
>>>
>>> On Wed, 24 Jun 2020 12:35:59 -0500
>>> Babu Moger <babu.moger@amd.com> wrote:
>>>   
>>>>> -----Original Message-----
>>>>> From: Igor Mammedov <imammedo@redhat.com>
>>>>> Sent: Wednesday, June 24, 2020 8:48 AM
>>>>> To: Moger, Babu <Babu.Moger@amd.com>
>>>>> Cc: ehabkost@redhat.com; mst@redhat.com; qemu-devel@nongnu.org;
>>>>> pbonzini@redhat.com; rth@twiddle.net
>>>>> Subject: Re: [PATCH 1/2] hw/386: Fix uninitialized memory with -device 
>>>>> and  
>>> CPU  
>>>>> hotplug
>>>>>
>>>>> On Tue, 16 Jun 2020 12:18:56 -0500
>>>>> Babu Moger <babu.moger@amd.com> wrote:
>>>>>  
>>>>>>> -----Original Message-----
>>>>>>> From: Igor Mammedov <imammedo@redhat.com>
>>>>>>> Sent: Tuesday, June 16, 2020 5:59 AM
>>>>>>> To: Moger, Babu <Babu.Moger@amd.com>
>>>>>>> Cc: pbonzini@redhat.com; rth@twiddle.net; ehabkost@redhat.com;
>>>>>>> mst@redhat.com; marcel.apfelbaum@gmail.com; qemu-  
>>> devel@nongnu.org  
>>>>>>> Subject: Re: [PATCH 1/2] hw/386: Fix uninitialized memory with -device  
>>> and  
>>>>> CPU  
>>>>>>> hotplug
>>>>>>>
>>>>>>> On Mon, 08 Jun 2020 15:18:50 -0500
>>>>>>> Babu Moger <babu.moger@amd.com> wrote:
>>>>>>>  
>>>>>>>> Noticed the following command failure while testing CPU hotplug.
>>>>>>>>
>>>>>>>> $ qemu-system-x86_64 -machine q35,accel=kvm -smp 1,maxcpus=2,
>>>>>>>>   cores=1, threads=1,sockets=2 -cpu EPYC -device EPYC-x86_64-
>>>>>>>>   cpu,core-id=0,socket-id=1,thread-id=0
>>>>>>>>
>>>>>>>>   qemu-system-x86_64: -device EPYC-x86_64-cpu,core-id=0,socket-  
>>> id=1,  
>>>>>>>>   thread-id=0: Invalid CPU [socket: 21855, die: 0, core: 0, thread: 0]
>>>>>>>>   with APIC ID 21855, valid index range 0:1
>>>>>>>>
>>>>>>>> This happens because APIC ID is calculated using uninitialized memory.
>>>>>>>> This is happening after the addition of new field node_id in  
>>>>> X86CPUTopoIDs  
>>>>>>>> structure. The node_id field is uninitialized while calling
>>>>>>>> apicid_from_topo_ids. The problem is discussed in the thread below.
>>>>>>>>  
>>>>>>>  
>>>>>  
>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.ker  
>>>>>>> nel.org%2Fqemu-
>>>>>>>  
>>>>>  
>>> devel%2F20200602171838.GG577771%40habkost.net%2F&amp;data=02%7C01  
>>>>>>>  
>>>>>  
>>> %7Cbabu.moger%40amd.com%7C02200d75fd8b48d1955608d811e44f5b%7C3d  
>>>>>>>  
>>>>>  
>>> d8961fe4884e608e11a82d994e183d%7C0%7C0%7C637279019564311233&amp  
>>>>>>>  
>>>>>  
>>> ;sdata=ry3QO0Z5dxLPoRxkYVkOsVm3nl%2BxfCGv8be%2BMHdoUPY%3D&amp;r  
>>>>>>> eserved=0  
>>>>>>>>
>>>>>>>> Fix the problem by initializing the node_id properly.
>>>>>>>>
>>>>>>>> Signed-off-by: Babu Moger <babu.moger@amd.com>
>>>>>>>> ---
>>>>>>>>  hw/i386/pc.c               |    2 ++
>>>>>>>>  include/hw/i386/topology.h |   11 +++++++++++
>>>>>>>>  2 files changed, 13 insertions(+)
>>>>>>>>
>>>>>>>> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
>>>>>>>> index 2128f3d6fe..974cc30891 100644
>>>>>>>> --- a/hw/i386/pc.c
>>>>>>>> +++ b/hw/i386/pc.c
>>>>>>>> @@ -1585,6 +1585,8 @@ static void pc_cpu_pre_plug(HotplugHandler  
>>>>>>> *hotplug_dev,  
>>>>>>>>          topo_ids.die_id = cpu->die_id;
>>>>>>>>          topo_ids.core_id = cpu->core_id;
>>>>>>>>          topo_ids.smt_id = cpu->thread_id;
>>>>>>>> +        topo_ids.node_id = cpu_x86_use_epyc_apic_id_encoding(ms-  
>>>>>> cpu_type)  
>>>>>>> ?  
>>>>>>>> +                           x86_node_id_for_epyc(&topo_info, 
>>>>>>>> &topo_ids) : 0;  
>>>>>>>
>>>>>>> I'd rather not calculate some default value here,
>>>>>>> this is the branch where we check user provided topology info and error 
>>>>>>>  
>>> out  
>>>>>>> asking
>>>>>>> to provide missing bits.  
>>>>>> Noticed that cpu->node_id is initialized to  
>>> 0xFF(NUMA_NODE_UNASSIGNED).  
>>>>>> We can initialize cpu->node_id to default node like how we do it in
>>>>>> x86_get_default_cpu_node_id.  We can use it to initialize  
>>> topo_ids.node_id.  
>>>>>> This is consistent with other fields core_id, die_id etc..
>>>>>>  
>>>>>>>
>>>>>>> I also wonder if we should force user to specify numa nodes on CLI if  
>>> EPYC  
>>>>> cpu is  
>>>>>>> used.
>>>>>>> (i.e. I'm assuming that EPYC always requires numa)  
>>>>>>
>>>>>> That is not true. Without numa all the cpus will be configured under one
>>>>>> default numa node 0. Like we do it using x86_get_default_cpu_node_id.  
>>>>>
>>>>> get_default_cpu_node_id() which is making things up, is going to be  
>>> removed  
>>>>> eventually in favor of asking user to provide numa mapping explicitly on 
>>>>> CLI.  
>>>>
>>>> That will be good going forward.
>>>>  
>>>>>
>>>>> now if it's always only node 0, why do we need to calculate it then,
>>>>> why not just assing 0 directly?
>>>>>
>>>>> what if we have several sockets, would all vCPUs still have node-id = 0?  
>>>>
>>>> If there are several nodes then socket id becomes node id.  
>>> I wonder if node id == socket id then why bother with node_id at all,
>>> probably node id is there to allow for design where several sockets are on
>>> the same node
>>>
>>>   
>>>>> Another question is if we have CPUs with only 1 numa node set on all of  
>>> them,  
>>>>> does it require all other machinery like ACPI SRAT table defined or it's 
>>>>> just
>>>>> internal CPU impl. detail?  
>>>>
>>>> I am not very sure about it. To me it looks like it is just internal cpu
>>>> implementation detail.  
>>> I'd think it might confuse guest OS, when it decodes more than 1 numa node
>>> for APIC ID/CPUID but then there are no such nodes described in ACPI.
>>> While it might work for caches, it would miss any relation of memory mapping
>>> to nodes or get it wrong if one doesn't match another.
>>>
>>>   
>>>> I think we have two options here.
>>>>
>>>> 1. Update the patch to initialize the node_id the way it is done
>>>> get_default_cpu_node_id.  
>>>
>>> if it were only one node for every CPU (incl. multisocket), I'd go with 
>>> enabling
>>> autonuma assigning all CPUs to default 0 node-id, since there is no 
>>> ambiguity
>>> where
>>> CPUs and RAM are mapped to.
>>> Is it possible to use node-id=0 for all EPYC CPUs even in multisocket 
>>> config?
>>> (it seems spec allows only one node per socket, but doesn't say that node 
>>> ids
>>> must
>>> be different.)
>>> If not, then making up node-id is not an option.
>>>   
>>>> 2. Ask the user to pass the node_id while hot plugging the device. This is
>>>> a clean solution. Will require some data structure changes.  
>>>
>>> Here is my brain dump of current very non obvious flow:
>>>
>>>   1. x86_possible_cpu_arch_ids()
>>>          ms->possible_cpus->cpus[i].props.* = x86_topo_ids_from_idx()
>>>
>>>   2. possible numa_cpu_set()
>>>          ms->possible_cpus->cpus[i].props.node_id = user input|0 -in 
>>> autonuma
>>> case
>>>
>>>   3. x86_cpus_init()
>>>          // generate apic_id AND makeup node_id embedded into it
>>>          ms->possible_cpus->cpus[i].arch_id = 
>>> x86_cpu_apic_id_from_index(x86ms,
>>> i);  
>>>                         -> x86_apicid_from_cpu_idx_epyc() ->  
>>> x86_topo_ids_from_idx_epyc()
>>>                                                                  same as 
>>> x86_topo_ids_from_idx() + node_id
>>>                      or  
>>>                         -> x86_apicid_from_cpu_idx() -> 
>>> x86_topo_ids_from_idx()  
>>>
>>>   4. pc_cpu_pre_plug()
>>>          // basically topo ids module node-id is not set or user provided
>>>          cpu_slot = pc_find_cpu_slot(MACHINE(pcms), cpu->apic_id, &idx);
>>>
>>>   5.
>>>          // do it again with diff that in EPYC case it my have different 
>>> node-id than
>>> cpu_slot
>>>          x86ms->topo_ids_from_apicid(cpu->apic_id, &topo_info, &topo_ids);
>>>
>>>          //i.e. user input of node-id is ignored
>>>          set socket-id/core-id/... (but not node-id) from topo_info
>>>
>>>          numa_cpu_pre_plug(cpu_slot)
>>>                            ^^^^^^
>>>               if (node_id == CPU_UNSET_NUMA_NODE_ID) {
>>>                    if (slot->props.has_node_id)
>>>                        object_property_set_int(... slot->props.node_id, 
>>> "node-id",...);
>>>               // this applies to hotplugged without node-id and to initial 
>>> CPUs (-smp X)
>>>               // so we may end up with "node-id" being set to user defined 
>>> value
>>>               // or left unset (no numa enabled)
>>>               // while APIC ID will have some node-id encoded in it.
>>>
>>>
>>> that's quite a mess, maybe we should unify both
>>> amd make x86_apicid_from_cpu_idx_epyc()/x86_apicid_from_cpu_idx() use
>>> ms->possible_cpus->cpus[i].props instead of x86_topo_ids_from_idx()
>>> i.e
>>>
>>>        x86_apicid_from_cpu_idx_epyc() {
>>>            topoids = x86_apicid_from_cpu_idx() {
>>>                           return ms->possible_cpus->cpus[i].props
>>>                       }
>>>            if (ms->possible_cpus->cpus[i].props.has_node_id)
>>>                topoids.node_id = ms->possible_cpus->cpus[i].props.node_id
>>>            else
>>>                error_fatal("EPYC requires use of -numa to define topology 
>>> if using
>>> more than 1 socket")
>>>        }
>>>
>>> that way QEMU makes up only node[0] by enabling autonuma or whatever
>>> user privided explicitly is encoded into APIC ID and it will be always 
>>> consistent
>>> with cpu
>>> *-id properties in possible_cpus and SRAT table QEMU generates.
>>>
>>> as cleanup we can get rid of back and forth conversion [5] and use cpu_slot 
>>> to
>>> set
>>> the same ids.
>>>
>>> Also maybe we should have a check that node-id is the same within socket in
>>> case of EPYC
>>> if it's guarantied that EPYC won't support multiple nodes per socket.
>>>
>>> hope it makes at least some sense.  
>>
>> To make things clear, in case of autonuma we don't have to worry about
>> node_id. We just have to set it topo_ids.node_id to 0 in pc_cpu_pre_plug,
>> Everything will work as expected. This will solve our current problem of
>> uninitialized variable.
> 
> I'm proposing to enable autonuma, which in its turn will assign all CPUs to
> node-id=0 in possible_cpus. And once this information is in possible_cpus,
> numa_cpu_pre_plug() should take care of setting correct node-id on CPU for
> the case of initial CPUs (node_id == CPU_UNSET_NUMA_NODE_ID), and in case
> of hotplug numa_cpu_pre_plug() will complaing if user suppled nonsense on
> with device_add.
> 
>> Problem here is, when user has configured the numa, then setting the
>> topo_ids.node_id to 0 might not work because it might create duplicate
>> apicids and device_add will be rejected.  As per the comments in
> I don't see how such duplicate could be made, even if all CPUs have 0
> node-id, there are pkg_id/core_id/thread_id wich are encoded in APIC ID,
> where pkg_id is unique across machine (at least in QEMU), so I don't see
> how duplicate is possible.
> 
>> numa_cpu_pre_plug, this is already broken. Look at the comments below.
>> Looks like node_id cannot be passed down.
>> ============================================
>> if (node_id == CPU_UNSET_NUMA_NODE_ID) {
>>         /* due to bug in libvirt, it doesn't pass node-id from props on
>>          * device_add as expected, so we have to fix it up here */
>>         if (slot->props.has_node_id) {
>>             object_property_set_int(OBJECT(dev), slot->props.node_id,
>>                                     "node-id", errp);
>>         }
>           else if (epyc)
>              error("incomplete EPYC topology use -numa 
> cpu,node-id=some-id,socket-id=%d  to configure numa node for socket",
>                     cpu_socket_id)
> 
>>     } else if (node_id != slot->props.node_id) {
>> ============================================
>>
>> I was trying to solve this problem setting the node_id correctly for EPYC
>> at least.  If you think, this is not important we can ignore it (by
>> setting topo_ids.node_id to 0) and move forward.  I don't see the need for
>> changing other topology specific code as we have already made very generic.
> 
> node-id - can be passed down (problem was that libvird didn't do it back then
> for -device/device_add, hence above hack).
> 
> But that's not a problem, the problem is that x86_apicid_from_cpu_idx_epyc() 
> makes
> up node-id on its own, which is not big deal in case of autonuma since they 
> happen
> to match and there aren't any ambiguity, but with more numa nodes, numa 
> config should
> be defined by user explicitly and current code may end up with incoherent 
> config,
> where some parts of QEMU think CPU has one node-id while APIC ID is encoded 
> with another. 
> 
> So after some pondering on a subject, to make sure it will look correct from 
> all angles,
> we need to:
>  
>  1: use single source for topo ids, i.e. pull user provided node-id from 
> possible_cpus
>     for both cpu.node-id property and for APIC ID. Hence my suggestion to 
> change
>     x86_apicid_from_cpu_idx_epyc() as described above.
> 
>  2: verify that user provided id's make sense in EPYC case. (pre_plug)

Basically, you are saying to setup the props.has_node_id and props.node_id
while buidling the topology. Make sure to use the numa information if the
user has provided else build the topology as per EPYC topology model.
Right? I will have to think thru this little bit. Will try to send the
draft patch tomarrow.


> 
>>>> Let me know if you see any other option.
>>>>  
>>>>>
>>>>>  
>>>>>>>  
>>>>>>>>          cpu->apic_id = x86ms->apicid_from_topo_ids(&topo_info,  
>>>>> &topo_ids);  
>>>>>>>>      }
>>>>>>>>
>>>>>>>> diff --git a/include/hw/i386/topology.h b/include/hw/i386/topology.h
>>>>>>>> index 07239f95f4..ee4deb84c4 100644
>>>>>>>> --- a/include/hw/i386/topology.h
>>>>>>>> +++ b/include/hw/i386/topology.h
>>>>>>>> @@ -140,6 +140,17 @@ static inline unsigned  
>>>>>>> apicid_pkg_offset_epyc(X86CPUTopoInfo *topo_info)  
>>>>>>>>             apicid_node_width_epyc(topo_info);
>>>>>>>>  }
>>>>>>>>
>>>>>>>> +static inline unsigned x86_node_id_for_epyc(X86CPUTopoInfo  
>>>>> *topo_info,  
>>>>>>>> +                                            const X86CPUTopoIDs 
>>>>>>>> *topo_ids)
>>>>>>>> +{
>>>>>>>> +    unsigned nr_nodes = MAX(topo_info->nodes_per_pkg, 1);
>>>>>>>> +    unsigned cores_per_node = DIV_ROUND_UP((topo_info-  
>>>> dies_per_pkg  
>>>>> *  
>>>>>>>> +                                            topo_info->cores_per_die *
>>>>>>>> +                                            
>>>>>>>> topo_info->threads_per_core),
>>>>>>>> +                                            nr_nodes);
>>>>>>>> +
>>>>>>>> +    return (topo_ids->core_id / cores_per_node) % nr_nodes;  
>>>>>>> what if nr_nodes == 0?
>>>>>>>  
>>>>>>>> +}
>>>>>>>>  /*
>>>>>>>>   * Make APIC ID for the CPU based on Pkg_ID, Core_ID, SMT_ID
>>>>>>>>   *
>>>>>>>>
>>>>>>>>  
>>>>>>
>>>>>>  
>>>>  
>>
> 



reply via email to

[Prev in Thread] Current Thread [Next in Thread]