qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC PATCH] Exporting Guest RAM information for NUMA bi


From: Alexander Graf
Subject: Re: [Qemu-devel] [RFC PATCH] Exporting Guest RAM information for NUMA binding
Date: Wed, 23 Nov 2011 19:34:37 +0100
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.14) Gecko/20110221 SUSE/3.1.8 Thunderbird/3.1.8

On 11/23/2011 04:03 PM, Andrea Arcangeli wrote:
Hi!

On Mon, Nov 21, 2011 at 07:51:21PM -0600, Anthony Liguori wrote:
Fundamentally, the entity that should be deciding what memory should be present
and where it should located is the kernel.  I'm fundamentally opposed to trying
to make QEMU override the scheduler/mm by using cpu or memory pinning in QEMU.

  From what I can tell about ms_mbind(), it just uses process knowledge to bind
specific areas of memory to a memsched group and let's the kernel decide what to
do with that knowledge.  This is exactly the type of interface that QEMU should
be using.

QEMU should tell the kernel enough information such that the kernel can make
good decisions.  QEMU should not be the one making the decisions.
True, QEMU won't have to decide where the memory and vcpus should be
located (but hey it wouldn't need to decide that even if you use
cpusets, you can use relative mbind with cpusets, the admin or a
cpuset job scheduler could decide) but it's still QEMU making the
decision of what memory and which vcpus threads to
ms_mbind/ms_tbind. Think how you're going to create the input of those
syscalls...

If it wasn't qemu to decide that, qemu wouldn't be required to scan
the whole host physical numa (cpu/memory) topology in order to create
the "input" arguments of "ms_mbind/ms_tbind". And when you migrate the
VM to another host, the whole vtopology may be counter-productive
because the kernel isn't automatically detecting the numa affinity
between threads and the guest vtopology will stick to whatever numa
_physical_ topology that was seen on the first node where the VM was
created.

I doubt that the assumption that all cloud nodes will have the same
physical numa topology is reasonable.

Furthermore to get the same benefits that qemu gets on host by using
ms_mbind/ms_tbind, every single guest application should be modified
to scan the guest vtopology and call ms_mbind/ms_tbind too (or use the
hard bindings which is what we try to avoid).

I think it's unreasonable to expect all applications to use
ms_mbind/ms_tbind in the guest, at best guest apps will use cpusets or
wrappers, few apps will be modified for sys_ms_tbind/mbind.

You can always have the supercomputer case with just one app that is
optimized and a single VM spanning over the whole host, but in that
scenarios hard bindings would work perfectly too.

In my view the trouble of the numa hard bindings is not the fact
they're hard and qemu has to also decide the location (in fact it
doesn't need to decide the location if you use cpusets and relative
mbinds). The bigger problem is the fact either the admin or the app
developer has to explicitly scan the numa physical topology (both cpus
and memory) and tell the kernel how much memory to bind to each
thread. ms_mbind/ms_tbind only partially solve that problem. They're
similar to the mbind MPOL_F_RELATIVE_NODES with cpusets, except you
don't need an admin or a cpuset-job-scheduler (or a perl script) to
redistribute the hardware resources.

Well yeah, of course the guest needs to see some topology. I don't see why we'd have to actually scan the host for this though. All we need to tell the kernel is "this memory region is close to that thread".

So if you define "-numa node,mem=1G,cpus=0" then QEMU should be able to tell the kernel that this GB of RAM actually is close to that vCPU thread.

Of course the admin still needs to decide how to split up memory. That's the deal with emulating real hardware. You get the interfaces hardware gets :). However, if you follow a reasonable default strategy such as numa splitting your RAM into equal chunks between guest vCPUs you're probably close enough to optimal usage models. Or at least you could have a close enough approximation of how this mapping could work for the _guest_ regardless of the host and when you migrate it somewhere else it should also work reasonably well.


Now dealing with bindings isn't big deal for qemu, in fact this API is
pretty much ideal for qemu, but it won't make life substantially
easier than if compared to hard bindings. Simply the management code
that is now done with a perl script will have to be moved in the
kernel. It looks an incremental improvement compared to the relative
mbind+cpuset, but I'm unsure if it's the best we could aim for and
what we really need in virt considering we deal with VM migration too.

The real long term design to me is not to add more syscalls, and
initially handling the case of a process/VM spanning not more than one
node in thread number and amount of memory. That's not too hard an in
fact I've benchmarks for the scheduler already showing it to work
pretty well (it's creating a too strict affinity but it can be relaxed
to be more useful). Then later add some mechanism (simplest is the
page fault at low frequency) to create a
guest_vcpu_thread<->host_memory affinity and have a parvirtualized
interface that tells the guest scheduler to group CPUs.

If the guest scheduler runs free and is allowed to move threads
randomly without any paravirtualized interface that controls the CPU
thread migration in the guest scheduler, the thread<->memory affinity
on host will be hopeless. But with a parvirtualized interface to make
a guest thread stick to vcpu0/1/2/3 and not going into vcpu4/5/6/7,
will allow to create a more meaningful guest_thread<->physical_ram
affinity on host through KVM page faults. And then this will work also
with VM migration and without having to create a vtopology in guest.

So you want to basically dynamically create NUMA topologies from the runtime behavior of the guest? What if it changes over time?

And for apps running in guest no paravirt will be needed of course.

The reason paravirt would be needed for qemu-kvm with a full automatic
thread<->memory affinity is that the vcpu threads are magic. What runs
in the vcpu thread are guest threads. And those can move through the
guest CPU scheduler from vcpu0 to vcpu7. If that happens and we've 4
physical cpu for each physical node, any affinity we measure in the
host will be meaningless. Normal threads using NPTL won't behave like
that. Maybe some other thread library could have a "scheduler" inside
that would make it behave like a vcpu thread (it's one thread really
with several threads inside) but those existed mostly to simulate
multiple threads in a single thread so they don't matter. And in this
respect sys_tbind also requires the tid to have meaningful memory
affinity. sys_tbind/mbind gets away with it by creating a vtopology in
the guest, so the guest scheduler would then follow the vtopology (but
vtopology breaks across VM migration and to really be followed well
with sys_mbind/tbind it'd require all apps to be modified).

grouping guest threads to stick into some vcpu sounds immensely
simpler than changing the whole guest vtopology at runtime that would
involve changing memory layout too.

NOTE: the paravirt cpu grouping interface would also handle the case
of 3 guests of 2.5G on a 8G guest (4G per node). One of the three
guests will have memory spanning over two nodes, and the guest
vtopology created by sys_mbind/tbind can't handle it. While paravirt
cpu grouping and automatic thread<->memory affinity on host will
handle it, like it will handle VM migration across nodes with
different physical topology. The problem is to create a
thread<->memory affinity we'll have to issue some page fault in KVM in
the background. How harmful that is I don't know at this point. So the
full automatic thread<->memory affinity is a bit of a vapourware
concept at this point (process<->memory affinity seems to work already
though).

But Peter's migration code was driven by page faults already (not
included in the patch he posted) and the other patch that exists
called migrate-on-fault also depended on page faults. So I am
optimistic we could have a thread<->memory affinity working too in the
longer term. The plan would be to run them at low frequency and only
if we can't fit a process into one node (in terms of both number of
threads and memory). If the process fits in one node, we wouldn't even
need any page fault and the information in the pagetables will be
enough to do a best decision. The downside is it significantly more
difficult to implement the thread<->memory affinity. And that's why
I'm focusing initially on the simpler case of considering only the
process<->memory affinity. That's fairly easy.

So for the time being this incremental improvement may be justified,
it moves the logic from a perl script to the kernel but I'm just
skeptical it provides a big advantage compared to the numa bindings we
already have in the kernel, especially if in the long term we can get
rid of a vtopology completely.

I actually like the idea of just telling the kernel how close memory will be to a thread. Sure, you can handle this basically by shoving your scheduler into user space, but isn't managing processes what a kernel is supposed to do in the first place?

You can always argue for a microkernel, but having a scheduler in user space (perl script) and another one in the kernel doesn't sound very appealing to me. If you want to go full-on user space, sure, I can see why :).

Either way, your approach sounds to be very much in the concept phase, while this is more something that can actually be tested and benchmarked against today. So yes, I want the interim solution - just in case your plan doesn't work out :). Oh, and then there's the non-PV guests too...


Alex




reply via email to

[Prev in Thread] Current Thread [Next in Thread]