[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [RFC PATCH] Exporting Guest RAM information for NUMA bi
Re: [Qemu-devel] [RFC PATCH] Exporting Guest RAM information for NUMA binding
Thu, 01 Dec 2011 18:40:31 +0100
On Wed, 2011-11-23 at 16:03 +0100, Andrea Arcangeli wrote:
> On Mon, Nov 21, 2011 at 07:51:21PM -0600, Anthony Liguori wrote:
> > Fundamentally, the entity that should be deciding what memory should be
> > present
> > and where it should located is the kernel. I'm fundamentally opposed to
> > trying
> > to make QEMU override the scheduler/mm by using cpu or memory pinning in
> > QEMU.
> > From what I can tell about ms_mbind(), it just uses process knowledge to
> > bind
> > specific areas of memory to a memsched group and let's the kernel decide
> > what to
> > do with that knowledge. This is exactly the type of interface that QEMU
> > should
> > be using.
> > QEMU should tell the kernel enough information such that the kernel can
> > make
> > good decisions. QEMU should not be the one making the decisions.
> True, QEMU won't have to decide where the memory and vcpus should be
> located (but hey it wouldn't need to decide that even if you use
> cpusets, you can use relative mbind with cpusets, the admin or a
> cpuset job scheduler could decide) but it's still QEMU making the
> decision of what memory and which vcpus threads to
> ms_mbind/ms_tbind. Think how you're going to create the input of those
> If it wasn't qemu to decide that, qemu wouldn't be required to scan
> the whole host physical numa (cpu/memory) topology in order to create
> the "input" arguments of "ms_mbind/ms_tbind".
That's a plain falsehood, you don't need to scan host physcal topology
in order to create useful ms_[mt]bind arguments. You can use physical
topology to optimize for particular hardware, but its not a strict
> And when you migrate the
> VM to another host, the whole vtopology may be counter-productive
> because the kernel isn't automatically detecting the numa affinity
> between threads and the guest vtopology will stick to whatever numa
> _physical_ topology that was seen on the first node where the VM was
This doesn't make any sense at all.
> I doubt that the assumption that all cloud nodes will have the same
> physical numa topology is reasonable.
So what? If you want to be very careful you can make sure you vnodes are
small enough they fit any any physical node in your cloud (god I f*king
hate that word).
If you're slightly less careful, things will still work, you might get
less max parallelism, but typically (from what I understood) these VM
hosting thingies are overloaded so you never get your max cpu anyway, so
Things is, whatever you set-up it will always work, it might not be
optimal, but the one guarantee: [threads,vrange] will stay on the same
node will be kept true, no matter where you run it.
Also, migration between non-identical hosts is always 'tricky'. You're
always stuck with some minimally supported subset or average case thing.
Really, why do you think NUMA would be any different.
> Furthermore to get the same benefits that qemu gets on host by using
> ms_mbind/ms_tbind, every single guest application should be modified
> to scan the guest vtopology and call ms_mbind/ms_tbind too (or use the
> hard bindings which is what we try to avoid).
No! ms_[tm]bind() is just part of the solution, the other part is what
to do for simple programs, and like I wrote in my email earlier, and
what we talked about in Prague, is that for normal simple proglets we
simply pick a numa node and stick to it. Much like:
Except we could actually migrate the whole thing if needed. Basically
you give each task its own 1 vnode and assign all threads to it.
Only big programs that need to span multiple nodes need to be modified
to get best advantage of numa. But that has always been true.
> In my view the trouble of the numa hard bindings is not the fact
> they're hard and qemu has to also decide the location (in fact it
> doesn't need to decide the location if you use cpusets and relative
> mbinds). The bigger problem is the fact either the admin or the app
> developer has to explicitly scan the numa physical topology (both cpus
> and memory) and tell the kernel how much memory to bind to each
> thread. ms_mbind/ms_tbind only partially solve that problem. They're
> similar to the mbind MPOL_F_RELATIVE_NODES with cpusets, except you
> don't need an admin or a cpuset-job-scheduler (or a perl script) to
> redistribute the hardware resources.
You're full of crap Andrea.
Yes you need some clue as to your actual topology, but that's life, you
can't get SMP for free either, you need to have some clue.
Just like with regular SMP where you need to be aware of data sharing,
NUMA just makes it worse. If your app decomposes well enough to create a
vnode per thread, that's excellent, if you want to scale your app to fit
your machine that's fine too, heck, every multi-threaded app out there
worth using already queries machine topology one way or another, its not
a big deal.
But cpusets and relative_nodes doesn't work, you still get your memory
splattered all over whatever nodes you allow and the scheduler will
still move your task around based purely on cpu-load. 0-win.
Not needing a (userspace) job-scheduler is a win, because that avoids
having everybody talk to this job-scheduler, and there's multiple
job-schedulers out there, two can't properly co-exist, etc. Also, the
kernel is the right place to do this.
[ this btw is true for all muddle-ware solutions, try and fit two
applications together that are written against different but similar
purpose muddle-wares and shit will come apart quickly ]
> Now dealing with bindings isn't big deal for qemu, in fact this API is
> pretty much ideal for qemu, but it won't make life substantially
> easier than if compared to hard bindings. Simply the management code
> that is now done with a perl script will have to be moved in the
> kernel. It looks an incremental improvement compared to the relative
> mbind+cpuset, but I'm unsure if it's the best we could aim for and
> what we really need in virt considering we deal with VM migration too.
No virt is crap, it needs to die, its horrid, and any solution aimed
squarely at virt only is shit and not worth considering, that simple.
If you want to help solve the NUMA issue, forget about virt and solve it
for the non-virt case.
> The real long term design to me is not to add more syscalls, and
> initially handling the case of a process/VM spanning not more than one
> node in thread number and amount of memory. That's not too hard an in
> fact I've benchmarks for the scheduler already showing it to work
> pretty well (it's creating a too strict affinity but it can be relaxed
> to be more useful). Then later add some mechanism (simplest is the
> page fault at low frequency) to create a
> guest_vcpu_thread<->host_memory affinity and have a parvirtualized
> interface that tells the guest scheduler to group CPUs.
I bet you're believe a compiler can solve all
parallelization/concurrency problems for you as well. Happy pipe
dreaming for you. While you're at it, I've heard this
transactional-memory crap will solve all our locking problems.
Concurrency is hard, applications needs to know wtf they're doing if
they want to gain any efficiency by it.
> If the guest scheduler runs free and is allowed to move threads
> randomly without any paravirtualized interface that controls the CPU
> thread migration in the guest scheduler, the thread<->memory affinity
> on host will be hopeless. But with a parvirtualized interface to make
> a guest thread stick to vcpu0/1/2/3 and not going into vcpu4/5/6/7,
> will allow to create a more meaningful guest_thread<->physical_ram
> affinity on host through KVM page faults. And then this will work also
> with VM migration and without having to create a vtopology in guest.
As a maintainer of the scheduler I can, with a fair degree of certainty,
say you'll never get such paravirt scheduler hooks.
Also, as much as I dislike the whole virt stuff, the whole premise of
virt is to 'emulate' real hardware. Real hardware does NUMA, therefore
its not weird to also do vNUMA. And yes NUMA sucks eggs, and in fact not
all hardware platforms expose it, have a look at s390 for example. They
stuck in huge caches and pretend it doesn't exist. But for those that
do, there's a performance gain to play by its rules.
Furthermore, I've been told there is a great interest in running !
paravirt kernels, so much so in fact that hardware emulation seems more
important than paravirt solutions.
Also, I really don't see how trying to establish thread:page relations
is in any way virt related, why couldn't you do this in a host kernel?
From what I gather what you propose is to periodically unmap all user
memory (or map it !r !w !x, which is effectively the same) and take the
fault. This fault will establish a thread:page relation. One can just
use that or involve some history as well. Once you have this thread:page
relation set you want to group them on the same node.
There's various problems with that, firstly of course the overhead,
storing this thread:page relation set requires quite a lot of memory.
Secondly I'm not quite sure I see how that works for threads that share
a working set. Suppose you have 4 threads and 2 working sets, how do you
make sure to keep the 2 groups together. I don't think that's evident
from the simple thread:page relation data [*]. Thirdly I immensely
dislike all these background scanner things, they make it very hard to
account time to those who actually use it.
[ * I can only make that work if you're willing to store something like
O(nr_pages * nr_threads) amount of data to correlate stuff, and that's
not even the time needed to process it and make something useful out of
> sys_tbind/mbind gets away with it by creating a vtopology in
> the guest, so the guest scheduler would then follow the vtopology (but
> vtopology breaks across VM migration and to really be followed well
> with sys_mbind/tbind it'd require all apps to be modified).
Again, vtopology doesn't break with VM migration. Its perfectly possible
to create a vnode with 8 threads on hardware with only 2 cpus per node.
Your threads all get to share those 2 cpus, so its not ideal, but it
> grouping guest threads to stick into some vcpu sounds immensely
> simpler than changing the whole guest vtopology at runtime that would
> involve changing memory layout too.
vtopology is stable for the entire duration of the guest. Some weird
people at IBM think its doable to change the topology at runtime, but
I'd argue they're wrong.
> NOTE: the paravirt cpu grouping interface would also handle the case
> of 3 guests of 2.5G on a 8G guest (4G per node). One of the three
> guests will have memory spanning over two nodes, and the guest
> vtopology created by sys_mbind/tbind can't handle it.
You could of course have created 3 guests with 2 nodes of 1.25G each.
You can always do stupid things, sys_[mt]bind doesn't pretend brains
<snip more stuff>