qemu-stable
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-stable] [Qemu-devel] [PATCH] pc: acpi: revert back to 1 SRAT e


From: Yu Zhang
Subject: Re: [Qemu-stable] [Qemu-devel] [PATCH] pc: acpi: revert back to 1 SRAT entry for hotpluggable area
Date: Fri, 24 Aug 2018 15:54:58 +0800
User-agent: NeoMutt/20180622-66-b94505

On Thu, Aug 23, 2018 at 02:34:31PM +0200, Igor Mammedov wrote:
> On Thu, 23 Aug 2018 17:01:33 +0800
> Yu Zhang <address@hidden> wrote:
> 
> > On 8/23/2018 2:01 AM, Eduardo Habkost wrote:
> > > On Wed, Aug 22, 2018 at 03:05:36PM +0200, Igor Mammedov wrote:  
> > >> On Wed, 22 Aug 2018 12:06:26 +0200
> > >> Laszlo Ersek <address@hidden> wrote:
> > >>  
> > >>> On 08/22/18 11:46, Igor Mammedov wrote:  
> > >>>> Commit
> > >>>>    10efd7e108 "pc: acpi: fix memory hotplug regression by reducing 
> > >>>> stub SRAT entry size"
> > >>>> attemped to fix hotplug regression introduced by
> > >>>>    848a1cc1e "hw/acpi-build: build SRAT memory affinity structures for 
> > >>>> DIMM devices"
> > >>>>
> > >>>> fixed issue for Windows/3.0+ linux kernels, however it regressed 2.6 
> > >>>> based
> > >>>> kernels (RHEL6) to the point where guest might crash at boot.
> > >>>> Reason is that 2.6 kernel discards SRAT table due too small last entry
> > >>>> which down the road leads to crashes. Hack I've tried in 10efd7e108 is 
> > >>>> also
> > >>>> not ACPI spec compliant according to which whole possible RAM should be
> > >>>> described in SRAT. Revert 10efd7e108 to fix regression for 2.6 based 
> > >>>> kernels.
> > >>>>
> > >>>> With 10efd7e108 reverted, I've also tried splitting SRAT table 
> > >>>> statically
> > >>>> in different ways %/node and %/slot but Windows still fails to online
> > >>>> 2nd pc-dimm hot-plugged into node 0 (as described in 10efd7e108) and
> > >>>> sometimes even coldplugged pc-dimms where affected with static SRAT
> > >>>> partitioning.
> > >>>> The only known so far way where Windows stays happy is when we have 1
> > >>>> SRAT entry in the last node covering all hotplug area.
> > >>>>
> > >>>> Revert 848a1cc1e until we come up with a way to avoid regression
> > >>>> on Windows with hotplug area split in several entries.
> > >>>> Tested this with 2.6/3.0 based kernels (RHEL6/7) and 
> > >>>> WS20[08/12/12R2/16]).
> > >>>>
> > >>>> Signed-off-by: Igor Mammedov <address@hidden>
> > >>>> ---
> > >>>> CC: address@hidden
> > >>>> CC: address@hidden
> > >>>> CC: address@hidden
> > >>>> CC: address@hidden
> > >>>> CC: address@hidden
> > >>>> ---
> > >>>>   hw/i386/acpi-build.c | 73 
> > >>>> +++++++++-------------------------------------------
> > >>>>   1 file changed, 12 insertions(+), 61 deletions(-)
> > >>>>
> > >>>> diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
> > >>>> index e1ee8ae..1599caa 100644
> > >>>> --- a/hw/i386/acpi-build.c
> > >>>> +++ b/hw/i386/acpi-build.c
> > >>>> @@ -2251,64 +2251,6 @@ build_tpm2(GArray *table_data, BIOSLinker 
> > >>>> *linker, GArray *tcpalog)
> > >>>>   #define HOLE_640K_START  (640 * KiB)
> > >>>>   #define HOLE_640K_END   (1 * MiB)
> > >>>>   
> > >>>> -static void build_srat_hotpluggable_memory(GArray *table_data, 
> > >>>> uint64_t base,
> > >>>> -                                           uint64_t len, int 
> > >>>> default_node)
> > >>>> -{
> > >>>> -    MemoryDeviceInfoList *info_list = qmp_memory_device_list();
> > >>>> -    MemoryDeviceInfoList *info;
> > >>>> -    MemoryDeviceInfo *mi;
> > >>>> -    PCDIMMDeviceInfo *di;
> > >>>> -    uint64_t end = base + len, cur, size;
> > >>>> -    bool is_nvdimm;
> > >>>> -    AcpiSratMemoryAffinity *numamem;
> > >>>> -    MemoryAffinityFlags flags;
> > >>>> -
> > >>>> -    for (cur = base, info = info_list;
> > >>>> -         cur < end;
> > >>>> -         cur += size, info = info->next) {
> > >>>> -        numamem = acpi_data_push(table_data, sizeof *numamem);
> > >>>> -
> > >>>> -        if (!info) {
> > >>>> -            /*
> > >>>> -             * Entry is required for Windows to enable memory hotplug 
> > >>>> in OS
> > >>>> -             * and for Linux to enable SWIOTLB when booted with less 
> > >>>> than
> > >>>> -             * 4G of RAM. Windows works better if the entry sets 
> > >>>> proximity
> > >>>> -             * to the highest NUMA node in the machine at the end of 
> > >>>> the
> > >>>> -             * reserved space.
> > >>>> -             * Memory devices may override proximity set by this 
> > >>>> entry,
> > >>>> -             * providing _PXM method if necessary.
> > >>>> -             */
> > >>>> -            build_srat_memory(numamem, end - 1, 1, default_node,
> > >>>> -                              MEM_AFFINITY_HOTPLUGGABLE | 
> > >>>> MEM_AFFINITY_ENABLED);
> > >>>> -            break;
> > >>>> -        }
> > >>>> -
> > >>>> -        mi = info->value;
> > >>>> -        is_nvdimm = (mi->type == MEMORY_DEVICE_INFO_KIND_NVDIMM);
> > >>>> -        di = !is_nvdimm ? mi->u.dimm.data : mi->u.nvdimm.data;
> > >>>> -
> > >>>> -        if (cur < di->addr) {
> > >>>> -            build_srat_memory(numamem, cur, di->addr - cur, 
> > >>>> default_node,
> > >>>> -                              MEM_AFFINITY_HOTPLUGGABLE | 
> > >>>> MEM_AFFINITY_ENABLED);
> > >>>> -            numamem = acpi_data_push(table_data, sizeof *numamem);
> > >>>> -        }
> > >>>> -
> > >>>> -        size = di->size;
> > >>>> -
> > >>>> -        flags = MEM_AFFINITY_ENABLED;
> > >>>> -        if (di->hotpluggable) {
> > >>>> -            flags |= MEM_AFFINITY_HOTPLUGGABLE;
> > >>>> -        }
> > >>>> -        if (is_nvdimm) {
> > >>>> -            flags |= MEM_AFFINITY_NON_VOLATILE;
> > >>>> -        }
> > >>>> -
> > >>>> -        build_srat_memory(numamem, di->addr, size, di->node, flags);
> > >>>> -    }
> > >>>> -
> > >>>> -    qapi_free_MemoryDeviceInfoList(info_list);
> > >>>> -}
> > >>>> -
> > >>>>   static void
> > >>>>   build_srat(GArray *table_data, BIOSLinker *linker, MachineState 
> > >>>> *machine)
> > >>>>   {
> > >>>> @@ -2414,10 +2356,19 @@ build_srat(GArray *table_data, BIOSLinker 
> > >>>> *linker, MachineState *machine)
> > >>>>           build_srat_memory(numamem, 0, 0, 0, MEM_AFFINITY_NOFLAGS);
> > >>>>       }
> > >>>>   
> > >>>> +    /*
> > >>>> +     * Entry is required for Windows to enable memory hotplug in OS
> > >>>> +     * and for Linux to enable SWIOTLB when booted with less than
> > >>>> +     * 4G of RAM. Windows works better if the entry sets proximity
> > >>>> +     * to the highest NUMA node in the machine.
> > >>>> +     * Memory devices may override proximity set by this entry,
> > >>>> +     * providing _PXM method if necessary.
> > >>>> +     */
> > >>>>       if (hotplugabble_address_space_size) {
> > >>>> -        build_srat_hotpluggable_memory(table_data, 
> > >>>> machine->device_memory->base,
> > >>>> -                                       
> > >>>> hotplugabble_address_space_size,
> > >>>> -                                       pcms->numa_nodes - 1);
> > >>>> +        numamem = acpi_data_push(table_data, sizeof *numamem);
> > >>>> +        build_srat_memory(numamem, machine->device_memory->base,
> > >>>> +                          hotplugabble_address_space_size, 
> > >>>> pcms->numa_nodes - 1,
> > >>>> +                          MEM_AFFINITY_HOTPLUGGABLE | 
> > >>>> MEM_AFFINITY_ENABLED);
> > >>>>       }
> > >>>>   
> > >>>>       build_header(linker, table_data,
> > >>>>      
> > >>> So this reverts both 10efd7e108 (which only moved the regression around)
> > >>> and the earlier 848a1cc1e (which had introduced the regression in the
> > >>> first place).
> > >>>
> > >>> If I understand correctly, the issue that 848a1cc1e was meant to solve
> > >>> was that "-device nvdimm,node=..." could be passed by the user such that
> > >>> it lead to "proximity domain mismatch". (What was the higher-level goal
> > >>> with that BTW?)
> > >>>
> > >>> However, that mismatch could as well be avoided by simply not passing
> > >>> such "node=..." properties. Is that right?
> > >>>
> > >>> If so, should we perhaps add another patch (temporarily), beyond this
> > >>> revert, that identifies the misconfig in question, and prints a warning
> > >>> about it?  
> > >> not exactly, before the patch node=... influenced only _PXM which is 
> > >> supposed
> > >> to be evaluated after SRAT table and override whatever was in static 
> > >> table.
> > >>
> > >> The patch, as I understood it, was about ACPI spec compliance where 
> > >> nvdimm SPA
> > >> in NFIT table should have a corresponding entry in SRAT table with
> > >> MEM_AFFINITY_NON_VOLATILE flag set.
> > >> Whether it solves any real issues aren't known to me and typically for
> > >> Intel contributed patches, author's email doesn't exists anymore so ...
> > >> And even if it fixes some new nvdimm issue, I'd rather have it broken
> > >> than regress memory hotplug that worked fine so far and wait for another
> > >> nvdimm patch that won't break existing features.
> > >>  
> > >>> Anyway the revert seems justified to me.  
> > >> I've killed quite enough time trying to find out a way to keep
> > >> everyone happy, but alas it's time to give up and let interested
> > >> party to deal with regressions while introducing new stuff for
> > >> nvdimm, hence this revert.  
> > > If the original author isn't available, I agree the best option
> > > is to revert it to avoid the regression by now.  
> > 
> > Thanks Edurado, for informing us, and sorry for the delayed reply.
> > I'm not the author of the original patch(848a1cc1e), and is just trying 
> > to understand
> > this part now...
> > 
> > And my understanding is that the patch was cooked to make sure data in SRAT
> > align with the ones in NFIT.
> > 
> > > Reviewed-by: Eduardo Habkost <address@hidden>
> > >
> > > However, have you considered keeping adding separate entries for
> > > NVDIMM devices only (so we follow the spec), but add a single
> > > (numa_nodes-1, MEM_AFFINITY_HOTPLUGGABLE|MEM_AFFINITY_ENABLED)
> > > entry to the rest?  
> > 
> > And I think that is what Igor did in patch 10efd7e108.
> > 
> > But I feel puzzled why windows has such requirement. And I am also 
> > curious, is
> > this a windows specific issue?
> It's is a Windows issue, but QEMU doesn't differentiate between guests
> OSes so in general it should work for all of them or at least do not break
> existing features.
> 

Yes, agreed.

> 
> > Reverting 848a1cc1e will lead the VM not able to figure out the correct 
> > NUMA info,
> > e.g. when we are trying to assign different regions from NVDIMMs 
> > residing on different
> > NUMA node.
> that's probably non issue as QEMU's impl. supports only 1:1 mapping,
> i.e. SPA for whole nvdimm goes to 1 node and SPA record already
> contains node ID it belongs to.
> 
> The reason why I've acked 848a1cc1e is that it's correct per ACPI spec,
> but unfortunately QE found a test case where it regresses memory hotplug,
> hence a revert. We should find out a way to be spec compliant and
> not to break existing features.
> 

I agree it's not easy to match all the requirements. So let's revert this
first. :)

B.R.
Yu

> Perhaps you'd be able to figure out how to accomplish both goals,
> I'd be happy to help you out if you'd need some assistance.
> 
> 
> > 
> > B.R.
> > Yu
> > 
> > > This way both use cases should still work as expected.  If
> > > Windows break if using NVDIMM + Memory Hotplug at the same time,
> > > maybe that's just a Windows bug we can't work around.
> > >  
> 
> 



reply via email to

[Prev in Thread] Current Thread [Next in Thread]