qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH 0/2][RFC] postcopy migration: Linux char device


From: Andrea Arcangeli
Subject: Re: [Qemu-devel] [PATCH 0/2][RFC] postcopy migration: Linux char device for postcopy
Date: Tue, 3 Jan 2012 15:25:41 +0100

On Mon, Jan 02, 2012 at 06:55:18PM +0100, Paolo Bonzini wrote:
> On 01/02/2012 06:05 PM, Andrea Arcangeli wrote:
> > On Thu, Dec 29, 2011 at 06:01:45PM +0200, Avi Kivity wrote:
> >> On 12/29/2011 06:00 PM, Avi Kivity wrote:
> >>> The NFS client has exactly the same issue, if you mount it with the intr
> >>> option.  In fact you could use the NFS client as a trivial umem/cuse
> >>> prototype.
> >>
> >> Actually, NFS can return SIGBUS, it doesn't care about restarting daemons.
> >
> > During KVMForum I suggested to a few people that it could be done
> > entirely in userland with PROT_NONE.
> 
> Or MAP_NORESERVE.

MAP_NORESERVE has no effect with the default
/proc/sys/vm/overcommit_memory == 0, and in general has no effect until you
run out of memory. It's an accounting on/off switch only, mostly a noop.

> Anything you do that is CUSE-based should be doable in a separate QEMU 
> thread (rather than a different process that talks to CUSE).  If a 
> userspace CUSE-based solution could be done with acceptable performance, 
> the same thing would have the same or better performance if done 
> entirely within QEMU.

It should be somehow doable within qemu and the source node could
handle one connection per vcpu thread for the async network pageins.
 
> > So the problem is if we do it in
> > userland with the current functionality you'll run out of VMAs and
> > slowdown performance too much.
> >
> > But all you need is the ability to map single pages in the address
> > space.
> 
> Would this also let you set different pgprots for different pages in the 
> same VMA?  It would be useful for write barriers in garbage collectors 
> (such as boehm-gc).  These do not have _that_ many VMAs, because every 
> GC cycles could merge all of them back to a single VMA with PROT_READ 
> permissions; however, they still put some strain on the VM subsystem.

Changing permission sounds more tricky as more code may make
assumptions on the vma before checking the pte.

Adding a magic unmapped pte entry sounds fairly safe because there's
the migration pte already used by migrate which halts page faults and
wait, that creates a precedent. So I guess we could reuse the same
code that already exists for the migration entry and we'd need to fire
a signal and returns to userland instead of waiting. The signal should
be invoked before the page fault will trigger again. Of course if the
signal returns and does nothing it'll loop at 100% cpu load but that's
ok. Maybe it's possible to tweak the permissions but it will need a
lot more thoughts. Specifically for anon pages marking them readonly
sounds possible if they are supposed to behave like regular COWs (not
segfaulting or anything), as you already can have a mixture of
readonly and read-write ptes (not to tell readonly KSM pages), but for
any other case it's non trivial. Last but not the least the API here
would be like a vma-less-mremap, moving a page from one address to
another without modifying the vmas, the permission tweak sounds more
like an mprotect, so I'm unsure if it could do both or if it should be
an optimization to consider independently.

In theory I suspect we could also teach mremap to do a
not-vma-mangling mremap if we move pages that aren't shared and so we
can adjust the page->index of the pages, instead of creating new vmas
at the dst address with an adjusted vma->vm_pgoff, but I suspect a
syscall that only works on top of fault-unmapped areas is simpler and
safer. mremap semantics requires nuking the dst region before the move
starts. If we would teach mremap how to handle the fault-unmapped
areas we could just add one syscall prepare_fault_area (or whatever
name you choose).

The locking of doing a vma-less-mremap still sounds tricky but I doubt
you can avoid that locking complexity by using the chardevice as long
as the chardevice backed-memory still allows THP, migration and swap,
if you want to do it atomic-zerocopy and I think zerocopy would be
better especially if the network card is fast and all vcpus are
faulting into unmapped pages simultaneously so triggering heavy amount
of copying from all physical cpus.

I don't mean the current device driver doing a copy_user won't work or
is bad idea, it's more self contained and maybe easier to merge
upstream. I'm just presenting another option more VM integrated
zerocopy with just 2 syscalls.

vmas must not be involved in the mremap for reliability, or too much
memory could get pinned in vmas even if we temporary lift the
/proc/sys/vm/max_map_count for the process. Plus sending another
signal (not sigsegv or sigbus) should be more reliable in case the
migration crashes for real.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]