qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC PATCH v3] numa: enable sparse node numbering


From: Eduardo Habkost
Subject: Re: [Qemu-devel] [RFC PATCH v3] numa: enable sparse node numbering
Date: Thu, 26 Jun 2014 15:39:32 -0300
User-agent: Mutt/1.5.23 (2014-03-12)

On Thu, Jun 26, 2014 at 10:58:07AM -0700, Nishanth Aravamudan wrote:
> On 26.06.2014 [17:09:25 +0800], Hu Tao wrote:
> > On Wed, Jun 25, 2014 at 09:23:17PM +0300, Michael S. Tsirkin wrote:
> > > On Wed, Jun 25, 2014 at 01:52:56PM -0300, Eduardo Habkost wrote:
> > > > On Wed, Jun 25, 2014 at 09:13:59AM -0700, Nishanth Aravamudan wrote:
> > > > > On 25.06.2014 [13:21:34 +0200], Igor Mammedov wrote:
> > > > > > On Tue, 24 Jun 2014 10:40:38 -0700
> > > > > > Nishanth Aravamudan <address@hidden> wrote:
> > > > > <snip>
> > > > > > > diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
> > > > > > > index 277230d..b90bf66 100644
> > > > > > > --- a/include/sysemu/sysemu.h
> > > > > > > +++ b/include/sysemu/sysemu.h
> > > > > > > @@ -145,11 +145,13 @@ extern int mem_prealloc;
> > > > > > >   */
> > > > > > >  #define MAX_CPUMASK_BITS 255
> > > > > > >  
> > > > > > > -extern int nb_numa_nodes;
> > > > > > > +extern int nb_numa_nodes; /* Number of NUMA nodes */
> > > > > > > +extern int max_numa_node; /* Highest specified NUMA node ID */
> > > > > > >  typedef struct node_info {
> > > > > > >      uint64_t node_mem;
> > > > > > >      DECLARE_BITMAP(node_cpu, MAX_CPUMASK_BITS);
> > > > > > >      struct HostMemoryBackend *node_memdev;
> > > > > > > +    bool present;
> > > > > > How about dropping 'present' and replacing array with a list
> > > > > > of only present nodes?
> > > > > 
> > > > > If that would be preferred, I can move to that. I assume a simple
> > > > > linked-list is fine. Does qemu provide any infrastructure for defining
> > > > > lists? I'll look through the source but any pointers would be helpful.
> > > > > 
> > > > > Generally speaking, sparse NUMA nodes aren't that common and when they
> > > > > exist, the gaps aren't large. But it does seem to make sense if we 
> > > > > have
> > > > > sparse IDs at all, we might as well move to a list.
> > > > > 
> > > > > In any case, moving to the list means we'd have a nodeid as part of 
> > > > > the
> > > > > structure instead.
> > > > > 
> > > > > > That way it will be one more step closer to converting numa
> > > > > > infrastructure to a set of QOM objects.
> > > > > 
> > > > > Sounds like a good idea to me. I'll respin the patch soon.
> > > > 
> > > > Having a list makes sense, the only difference is that keeping a sparse
> > > > array sorted is much easier than making a sorted list (because the ACPI
> > > > tables are nodeid-ordered). That's why I suggested keeping the array
> > > > initially.
> > > > 
> > > > Adding a "present" field to the array is a trivial and easy-to-review
> > > > change. Changing NodeInfo to use linked lists is a more complex change
> > > > that I wouldn't want to include after soft freeze.
> > > > 
> > > > In other words:
> > > >  * Having a list is better than a sparse array; but:
> > > >  * Having a small sparse array with the "present" field is better
> > > >    than broken sparse nodeid support (IMO).
> > > 
> > > I agree here. This patchset is still RFC but if that's
> > > the only issue I might apply it.
> > 
> > I don't think it is ready for 2.1 since it lacks arch-specific changes.
> > The patch itself only makes qemu be able to bring up guest with sparse
> > numa nodes, without arch-specific changes, guest sees only one node(at
> > least on x86).
> 
> Well, it does no harm or foul to have it, on it's own. In that, nothing
> is worse or better than it was. Although I would argue the qemu core is
> better than it was :) I have sent the powerpc enabling patch, but even
> with that, we need Alexey's series of 6 patches (which do fix a
> regression) to fully support the sparse numbering. Maybe that means I've
> convinced myself it's best to wait until after 2.1.
> 
> I honestly am not sure how to "fix" the x86 code to support the sparse
> numbering -- the APIC code seems to rely on it being continuous. From
> some googling, it seems like the most common reason for non-continuous
> numbering on x86 is faulty hardware. I will keep fiddling if no one else
> steps up, but I'm not an x86 NUMA expert :)

There are two ACPI table building methods: the fw_cfg-based (where
SeaBIOS builds the SRAT table), and the acpi-build.c code.

The fw_cfg-based method assumes contigous nodeids: it builds a table
with nb_numa_nodes entries, carrying only the memory size of each node.

The acpi-build.c code could easily use the "present" flag, and doesn't
seem to require contiguous nodeids. The ACPI specs don't seem to have
any requirement about contiguous proximity domain IDs, but I am not sure
about how guest OSes would react.

acpi-build.c has some confusing code to make sure the table has at least
nb_numa_nodes+2 entries (adding disabled entries at the end of the table
if necessary), but I don't understand why. Maybe it is just for
compatibility with SeaBIOS versions that calculated the SRAT table size
based on nb_numa_nodes beforehand.

But we need to this carefully, as changing the behavior again later
would require compatibility code, and would make things worse.  I don't
even know if guest OSes would properly handle non-contiguous node IDs.

In either case, QEMU is currently broken when nodeids are not contiguous
or nodeids are repeated in the command-line. So I suggest we do this on
QEMU 2.1:

 * Add the "present" field (required to allow us to detect
   command-lines with missing nodeids);
 * Reject the command-line if a nodeid is specified twice;
 * Make the PC initialization code reject the configuration if some
   nodeid is not present. This is a bug fix, not a regression, as
   non-contiguous nodeids were never supported.

This way, we are not introducing new behavior to keep compatibility in
the future, but we are fixing the bug where QEMU doesn't complain about
non-contiguous nodeids. And after 2.1, we can make PC support
non-contiguous nodeids if we agree it is useful.

-- 
Eduardo



reply via email to

[Prev in Thread] Current Thread [Next in Thread]