[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [bug-libsigsegv] [Qemu-devel] [PATCH 00/21] RFC: userfaultfd v3
From: |
Eric Blake |
Subject: |
Re: [bug-libsigsegv] [Qemu-devel] [PATCH 00/21] RFC: userfaultfd v3 |
Date: |
Fri, 06 Mar 2015 08:29:35 -0700 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.5.0 |
[adding libsigsegv project]
On 03/05/2015 10:17 AM, Andrea Arcangeli wrote:
> Hello everyone,
>
> This is a RFC for the userfaultfd syscall API v3 that addresses the
> feedback received for the previous v2 submit.
>
> The main change from the v2 is that MADV_USERFAULT/NOUSERFAULT
> disappeared (they're replaced by the UFFDIO_REGISTER/UNREGISTER
> ioctls). In short userfaults are now only possible through the
> userfaultfd. The remap_anon_pages syscall also disappeared replaced by
> the UFFDIO_REMAP ioctl which is in turn mostly obsoleted by the newer
> UFFDIO_COPY and UFFDIO_ZEROPAGE ioctls that are indeed more efficient
> by never having to flush the TLB. The suggestion to copy the data
> instead of moving it, in order to resolve the userfault, was
> immediately agreed.
>
> The latest code can also be cloned here:
>
> git clone --reference linux -b userfault
> git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git
>
>
> Userfaults allow to implement on demand paging from userland and more
> generally they allow userland to more efficiently take control on
> various types of page faults.
>
> For example userfaults allows a proper and more optimal implementation
> of the PROT_NONE+SIGSEGV trick.
Which is what GNU libsigsegv currently uses. Anyone interested in
adding code to libsigsegv to take advantage of this proposed new kernel
interface?
>
> There has been interest from multiple users for different use cases:
>
> 1) KVM postcopy live migration (one form of cloud memory
> externalization). KVM postcopy live migration is the primary driver
> of this work:
>
> http://blog.zhaw.ch/icclab/setting-up-post-copy-live-migration-in-openstack/
> http://lists.gnu.org/archive/html/qemu-devel/2015-02/msg04873.html
> )
>
> 2) KVM postcopy live snapshotting (allowing to limit/throttle the
> memory usage, unlike fork would, plus the avoidance of fork
> overhead in the first place).
>
> The syscall API is already contemplating the wrprotect fault
> tracking and it's generic enough to allow its later implementation
> in a backwards compatible fashion.
>
> 3) KVM userfaults on shared memory. The UFFDIO_COPY lowlevel method
> should be extended to work also on tmpfs and then the
> uffdio_register.ioctls will notify userland that UFFDIO_COPY is
> available even when the registered virtual memory range is tmpfs
> backed.
>
> 4) alternate mechanism to notify web browsers or apps on embedded
> devices that volatile pages have been reclaimed. This basically
> avoids the need to run a syscall before the app can access with the
> CPU the virtual regions marked volatile. This also requires point 3)
> to be fulfilled, as volatile pages happily apply to tmpfs.
>
> 5) postcopy live migration of binaries inside linux containers.
>
> Even though there wasn't a real use case requesting it yet, the new
> API also allows to implement distributed shared memory in a way that
> readonly shared mappings can exist simultaneously in different hosts
> and they can be become exclusive at the first wrprotect fault.
>
> The UFFDIO_REMAP method is still present in the patchset but it's
> provided primarily to remove (add not) memory from the userfault
> range. The addition of the UFFDIO_REMAP method is intentionally kept
> at the end of the patchset. The postcopy live migration qemu code will
> only use UFFDIO_COPY and UFFDIO_ZEROPAGE. UFFDIO_REMAP isn't intended
> to be merged upstream in the short term, and it can be dropped later
> if there's an agreement it's a bad idea to keep it around in the
> patchset.
>
> David run some KVM postcopy live migration benchmarks on a 8-way CPU
> system and he measured that using UFFDIO_COPY instead of UFFDIO_REMAP
> resulted in a roughly a -20% reduction in latency which is good. The
> standard deviation error on the latency measurement decreased
> significantly as well (because the number of CPUs that required IPI
> delivery was variable, while the copy always takes roughly the same
> time). A bigger improvement is expectable if measured on a larger host
> with more CPUs.
>
> All UFFDIO_COPY/ZEROPAGE/REMAP methods already support CRIU postcopy
> live migration and the UFFD can be passed to a manager process through
> unix domain sockets to satisfy point 5).
>
> I look forward to discuss this further next week at the LSF/MM
> summit, if you're attending the summit see you soon!
>
> Comments welcome, thanks,
> Andrea
>
> Credits: partially funded by the Orbit EU project.
>
> PS. There is one TODO detail worth mentioning for completeness that
> affects usage 2) and UFFDIO_REMAP if used to remove memory from the
> userfault range: handle_userfault() is only effective if
> FAULT_FLAG_ALLOW_RETRY is set... but that is only set at the first
> attempted page fault. If by accident some thread was already faulting
> in the range and the first page fault attempt returned VM_FAULT_RETRY
> and UFFDIO_REMAP or UFFDIO_WP jumps in to arm the userfault just
> before the second attempt starts, a SIGBUS would be raised by the page
> fault. Stopping all thread access to the userfault ranges during
> UFFDIO_REMAP/WP while possible, isn't optimal. Currently (excluding
> real filebacked mappings and handle_userfault() itself which is
> clearly no problem) only tmpfs or a swapin can return
> VM_FAULT_RETRY. To close this SIGBUS window for all usages, the
> simplest solution would be that if FAULT_FLAG_TRIED is set
> VM_FAULT_RETRY can still be returned (but only by handle_userfault
> that has a legitimate reason for insisting a second time in a row with
> VM_FAULT_RETRY). That would require some change to the FAULT_FLAG
> semantics. Again userland could cope with this detail but it'd be
> inefficient to solve it in userland. This would be a fully backwards
> compatible change and it's only strictly required by the wrprotect
> tracking mode, so it's no problem to solve this later. Because of its
> inherent racy nature, nobody could possibly depend on a racy SIGBUS
> being raised now, when it won't be raised anymore later.
>
> Andrea Arcangeli (21):
> userfaultfd: waitqueue: add nr wake parameter to __wake_up_locked_key
> userfaultfd: linux/Documentation/vm/userfaultfd.txt
> userfaultfd: uAPI
> userfaultfd: linux/userfaultfd_k.h
> userfaultfd: add vm_userfaultfd_ctx to the vm_area_struct
> userfaultfd: add VM_UFFD_MISSING and VM_UFFD_WP
> userfaultfd: call handle_userfault() for userfaultfd_missing() faults
> userfaultfd: teach vma_merge to merge across vma->vm_userfaultfd_ctx
> userfaultfd: prevent khugepaged to merge if userfaultfd is armed
> userfaultfd: add new syscall to provide memory externalization
> userfaultfd: buildsystem activation
> userfaultfd: activate syscall
> userfaultfd: UFFDIO_COPY|UFFDIO_ZEROPAGE uAPI
> userfaultfd: mcopy_atomic|mfill_zeropage: UFFDIO_COPY|UFFDIO_ZEROPAGE
> preparation
> userfaultfd: UFFDIO_COPY and UFFDIO_ZEROPAGE
> userfaultfd: remap_pages: rmap preparation
> userfaultfd: remap_pages: swp_entry_swapcount() preparation
> userfaultfd: UFFDIO_REMAP uABI
> userfaultfd: remap_pages: UFFDIO_REMAP preparation
> userfaultfd: UFFDIO_REMAP
> userfaultfd: add userfaultfd_wp mm helpers
>
> Documentation/ioctl/ioctl-number.txt | 1 +
> Documentation/vm/userfaultfd.txt | 97 +++
> arch/powerpc/include/asm/systbl.h | 1 +
> arch/powerpc/include/asm/unistd.h | 2 +-
> arch/powerpc/include/uapi/asm/unistd.h | 1 +
> arch/x86/syscalls/syscall_32.tbl | 1 +
> arch/x86/syscalls/syscall_64.tbl | 1 +
> fs/Makefile | 1 +
> fs/userfaultfd.c | 1128
> ++++++++++++++++++++++++++++++++
> include/linux/mm.h | 4 +-
> include/linux/mm_types.h | 11 +
> include/linux/swap.h | 6 +
> include/linux/syscalls.h | 1 +
> include/linux/userfaultfd_k.h | 112 ++++
> include/linux/wait.h | 5 +-
> include/uapi/linux/userfaultfd.h | 150 +++++
> init/Kconfig | 11 +
> kernel/fork.c | 3 +-
> kernel/sched/wait.c | 7 +-
> kernel/sys_ni.c | 1 +
> mm/Makefile | 1 +
> mm/huge_memory.c | 217 +++++-
> mm/madvise.c | 3 +-
> mm/memory.c | 16 +
> mm/mempolicy.c | 4 +-
> mm/mlock.c | 3 +-
> mm/mmap.c | 39 +-
> mm/mprotect.c | 3 +-
> mm/rmap.c | 9 +
> mm/swapfile.c | 13 +
> mm/userfaultfd.c | 793 ++++++++++++++++++++++
> net/sunrpc/sched.c | 2 +-
> 32 files changed, 2593 insertions(+), 54 deletions(-)
> create mode 100644 Documentation/vm/userfaultfd.txt
> create mode 100644 fs/userfaultfd.c
> create mode 100644 include/linux/userfaultfd_k.h
> create mode 100644 include/uapi/linux/userfaultfd.h
> create mode 100644 mm/userfaultfd.c
>
>
>
>
--
Eric Blake eblake redhat com +1-919-301-3266
Libvirt virtualization library http://libvirt.org
signature.asc
Description: OpenPGP digital signature
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- Re: [bug-libsigsegv] [Qemu-devel] [PATCH 00/21] RFC: userfaultfd v3,
Eric Blake <=