[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH 4/5] linux-user: Support CLONE_VM and extended clone options
From: |
Alex Bennée |
Subject: |
Re: [PATCH 4/5] linux-user: Support CLONE_VM and extended clone options |
Date: |
Tue, 16 Jun 2020 17:08:16 +0100 |
User-agent: |
mu4e 1.5.3; emacs 28.0.50 |
Josh Kunz <jkz@google.com> writes:
> The `clone` system call can be used to create new processes that share
> attributes with their parents, such as virtual memory, file
> system location, file descriptor tables, etc. These can be useful to a
> variety of guest programs.
>
> Before this patch, QEMU had support for a limited set of these attributes.
> Basically the ones needed for threads, and the options used by fork.
> This change adds support for all flag combinations involving CLONE_VM.
> In theory, almost all clone options could be supported, but invocations
> not using CLONE_VM are likely to run afoul of linux-user's inherently
> multi-threaded design.
>
> To add this support, this patch updates the `qemu_clone` helper. An
> overview of the mechanism used to support general `clone` options with
> CLONE_VM is described below.
>
> This patch also enables by-default the `clone` unit-tests in
> tests/tcg/multiarch/linux-test.c, and adds an additional test for duplicate
> exit signals, based on a bug found during development.
Which by the way fail on some targets:
TEST linux-test on alpha
/home/alex/lsrc/qemu.git/tests/tcg/multiarch/linux-test.c:709: child did not
receive PDEATHSIG on parent death
make[2]: *** [../Makefile.target:153: run-linux-test] Error 1
make[1]: *** [/home/alex/lsrc/qemu.git/tests/tcg/Makefile.qemu:76:
run-guest-tests] Error 2
make: *** [/home/alex/lsrc/qemu.git/tests/Makefile.include:851:
run-tcg-tests-alpha-linux-user] Error 2
Have you managed a clean check-tcg with docker enabled so all the guest
architectures get tested?
>
> !! Overview
>
> Adding support for CLONE_VM is tricky. The parent and guest process will
> share an address space (similar to threads), so the emulator must
> coordinate between the parent and the child. Currently, QEMU relies
> heavily on Thread Local Storage (TLS) as part of this coordination
> strategy. For threads, this works fine, because libc manages the
> thread-local data region used for TLS, when we create new threads using
> `pthread_create`. Ideally we could use the same mechanism for
> "process-local storage" needed to allow the parent/child processes to
> emulate in tandem. Unfortunately TLS is tightly integrated into libc.
> The only way to create TLS data regions is via the `pthread_create` API
> which also spawns a new thread (rather than a new processes, which is
> what we want). Worse still, TLS itself is a complicated arch-specific
> feature that is tightly integrated into the rest of libc and the dynamic
> linker. Re-implementing TLS support for QEMU would likely require a
> special dynamic linker / libc. Alternatively, the popular libcs could be
> extended, to allow for users to create TLS regions without creating
> threads. Even if major libcs decide to add this support, QEMU will still
> need a temporary work around until those libcs are widely deployed. It's
> also unclear if libcs will be interested in supporting this case, since
> TLS image creation is generally deeply integrated with thread setup.
>
> In this patch, I've employed an alternative approach: spawning a thread
> an "stealing" its TLS image for use in the child process. This approach
> leaves a dangling thread while the TLS image is in use, but by design
> that thread will not become schedulable until after the TLS data is no
> longer in-use by the child (as described in a moment). Therefore, it
> should cause relatively minimal overhead. When considered in the larger
> context, this seems like a reasonable tradeoff.
*sharp intake of breath*
OK so the solution to the complexity of handling threads is to add more
threads? cool cool cool....
>
> A major complication of this approach knowing when it is safe to clean up
> the stack, and TLS image, used by a child process. When a child is
> created with `CLONE_VM` its stack, and TLS data, need to remain valid
> until that child has either exited, or successfully called `execve` (on
> `execve` the child is given a new VMM by the kernel). One approach would
> be to use `waitid(WNOWAIT)` (the `WNOWAIT` allows the guest to reap the
> child). The problem is that the `wait` family of calls only waits for
> termination. The pattern of `clone() ... execve()` for long running
> child processes is pretty common. If we waited for child processes to
> exit, it's likely we would end up using substantially more memory, and
> keep the suspended TLS thread around much longer than necessary.
> Instead, in this patch, I've used an "trampoline" process. The real
> parent first clones a trampoline, the trampoline then clones the
> ultimate child using the `CLONE_VFORK` option. `CLONE_VFORK` suspends
> the trampoline process until the child has exited, or called `execve`.
> Once the trampoline is re-scheduled, we know it is safe to clean up
> after the child. This creates one more suspended process, but typically,
> the trampoline only exists for a short period of time.
>
> !! CLONE_VM setup, step by step
>
> 1. First, the suspended thread whose TLS we will use is created using
> `pthread_create`. The thread fetches and returns it's "TLS pointer"
> (an arch-specific value given to the kernel) to the parent. It then
> blocks on a lock to prevent its TLS data from being cleaned up.
> Ultimately the lock will be unlocked by the trampoline once the child
> exits.
> 2. Once the TLS thread has fetched the TLS pointer, it notifies the real
> parent thread, which calls `clone()` to create the trampoline
> process. For ease of implementation, the TLS image is set for the
> trampoline process during this step. This allows the trampoline to
> use functions that require TLS if needed (e.g., printf). TLS location
> is inherited when a new child is spawned, so this TLS data will
> automatically be inherited by the child.
> 3. Once the trampoline has been spawned, it registers itself as a
> "hidden" process with the signal subsystem. This prevents the exit
> signal from the trampoline from ever being forwarded to the guest.
> This is needed due to the way that Linux sets the exit signal for the
> ultimate child when `CLONE_PARENT` is set. See the source for
> details.
> 4. Once setup is complete, the trampoline spawns the final child with
> the original clone flags, plus `CLONE_PARENT`, so the child is
> correctly parented to the kernel task on which the guest invoked
> `clone`. Without this, kernel features like PDEATHSIG, and
> subreapers, would not work properly. As previously discussed, the
> trampoline also supplies `CLONE_VFORK` so that it is suspended until
> the child can be cleaned up.
> 5. Once the child is spawned, it signals the original parent thread that
> it is running. At this point, the trampoline process is suspended
> (due to CLONE_VFORK).
> 6. Finally, the call to `qemu_clone` in the parent is finished, the
> child begins executing the given callback function in the new child
> process.
>
> !! Cleaning up
>
> Clean up itself is a multi-step process. Once the child exits, or is
> killed by a signal (cleanup is the same in both cases), the trampoline
> process becomes schedulable. When the trampoline is scheduled, it frees
> the child stack, and unblocks the suspended TLS thread. This cleans up
> the child resources, but not the stack used by the trampoline itself. It
> is possible for a process to clean up its own stack, but it is tricky,
> and architecture-specific. Instead we leverage the TLS manager thread to
> clean up the trampoline stack. When the trampoline is cloned (in step 2
> above), we additionally set the `CHILD_SETTID` and `CHILD_CLEARTID`
> flags. The target location for the SET/CLEAR TID is set to a special field
> known by the TLS manager. Then, when the TLS manager thread is unsuspended,
> it performs an additional `FUTEX_WAIT` on this location. That blocks the
> TLS manager thread until the trampoline has fully exited, then the TLS
> manager thread frees the trampoline process's stack, before exiting
> itself.
>
> !! Shortcomings of this patch
>
> * It's complicated.
> * It doesn't support any clone options when CLONE_VM is omitted.
> * It doesn't properly clean up the CPU queue when the child process
> terminates, or calls execve().
> * RCU unregistration is done in the trampoline process (in clone.c), but
> registration happens in syscall.c This should be made more explicit.
> * The TLS image, and trampoline stack are not cleaned up if the parent
> calls `execve` or `exit_group` before the child does. This is because
> those cleanup tasks are handled by the TLS manager thread. The TLS
> manager thread is in the same thread group as the parent, so it will
> be terminated if the parent exits or calls `execve`.
>
> !! Alternatives considered
>
> * Non-standard libc extension to allow creating TLS images independent
> of threads. This would allow us to just `clone` the child directly
> instead of this complicated maneuver. Though we probably would still
> need the cleanup logic. For libcs, TLS image allocation is tightly
> connected to thread stack allocation, which is also arch-specific. I
> do not have enough experience with libc development to know if
> maintainers of any popular libcs would be open to supporting such an
> API. Additionally, since it will probably take years before a libc
> fix would be widely deployed, we need an interim solution anyways.
We could consider a custom lib stub that intercepts calls to the guests
original libc and replaces it with a QEMU aware one?
> * Non-standard, Linux-only, libc extension to allow us to specify the
> CLONE_* flags used by `pthread_create`. The processes we are creating
> are basically threads in a different thread group. If we could alter
> the flags used, this whole processes could become a `pthread_create.`
> The problem with this approach is that I don't know what requirements
> pthreads has on threads to ensure they function properly. I suspect
> that pthreads relies on CHILD_CLEARTID+FUTEX_WAKE to cleanup detached
> thread state. Since we don't control the child exit reason (Linux only
> handles CHILD_CLEARTID on normal, non-signal process termination), we
> probably can't use this same tracking mechanism.
> * Other mechanisms for detecting child exit so cleanup can happen
> besides CLONE_VFORK:
> * waitid(WNOWAIT): This can only detect exit, not execve.
> * file descriptors with close on exec set: This cannot detect children
> cloned with CLONE_FILES.
> * System V semaphore adjustments: Cannot detect children cloned with
> CLONE_SYSVSEM.
> * CLONE_CHILD_CLEARTID + FUTEX_WAIT: Cannot detect abnormally
> terminated children.
> * Doing the child clone directly in the TLS manager thread: This saves the
> need for the trampoline process, but it causes the child process to be
> parented to the wrong kernel task (the TLS thread instead of the Main
> thread) breaking things like PDEATHSIG.
Have you considered a daemon which could co-ordinate between the
multiple processes that are sharing some state?
> Signed-off-by: Josh Kunz <jkz@google.com>
> ---
> linux-user/clone.c | 415 ++++++++++++++++++++++++++++++-
> linux-user/qemu.h | 17 ++
> linux-user/signal.c | 49 ++++
> linux-user/syscall.c | 69 +++--
> tests/tcg/multiarch/linux-test.c | 67 ++++-
> 5 files changed, 592 insertions(+), 25 deletions(-)
>
> diff --git a/linux-user/clone.c b/linux-user/clone.c
> index f02ae8c464..3f7344cf9e 100644
> --- a/linux-user/clone.c
> +++ b/linux-user/clone.c
> @@ -12,6 +12,12 @@
> #include <stdbool.h>
> #include <assert.h>
>
> +/* arch-specifc includes needed to fetch the TLS base offset. */
> +#if defined(__x86_64__)
> +#include <asm/prctl.h>
> +#include <sys/prctl.h>
> +#endif
> +
> static const unsigned long NEW_STACK_SIZE = 0x40000UL;
>
> /*
> @@ -62,6 +68,397 @@ static void completion_finish(struct completion *c)
> pthread_mutex_unlock(&c->mu);
> }
>
> +struct tls_manager {
> + void *tls_ptr;
> + /* fetched is completed once tls_ptr has been set by the thread. */
> + struct completion fetched;
> + /*
> + * spawned is completed by the user once the managed_tid
> + * has been spawned.
> + */
> + struct completion spawned;
> + /*
> + * TID of the child whose memory is cleaned up upon death. This memory
> + * location is used as part of a futex op, and is cleared by the kernel
> + * since we specify CHILD_CLEARTID.
> + */
> + int managed_tid;
> + /*
> + * The value to be `free`'d up once the janitor is ready to clean up the
> + * TLS section, and the managed tid has exited.
> + */
> + void *cleanup;
> +};
> +
> +/*
> + * tls_ptr fetches the TLS "pointer" for the current thread. This pointer
> + * should be whatever platform-specific address is used to represent the TLS
> + * base address.
> + */
> +static void *tls_ptr()
This and a number of other prototypes need void args to stop the
compiler complaining about missing prototypes.
> +{
> + void *ptr;
> +#if defined(__x86_64__)
> + /*
> + * On x86_64, the TLS base is stored in the `fs` segment register, we can
> + * fetch it with `ARCH_GET_FS`:
> + */
> + (void)syscall(SYS_arch_prctl, ARCH_GET_FS, (unsigned long) &ptr);
> +#else
> + ptr = NULL;
> +#endif
> + return ptr;
> +}
> +
> +/*
> + * clone_vm_supported returns true if clone_vm() is supported on this
> + * platform.
> + */
> +static bool clone_vm_supported()
> +{
> +#if defined(__x86_64__)
> + return true;
> +#else
> + return false;
> +#endif
> +}
<snip>
--
Alex Bennée
- [PATCH 0/5] linux-user: Support extended clone(CLONE_VM), Josh Kunz, 2020/06/11
- [PATCH 5/5] linux-user: Add PDEATHSIG test for clone process hierarchy., Josh Kunz, 2020/06/11
- Re: [PATCH 0/5] linux-user: Support extended clone(CLONE_VM), no-reply, 2020/06/11
- Re: [PATCH 0/5] linux-user: Support extended clone(CLONE_VM), Alex Bennée, 2020/06/13
- Re: [PATCH 0/5] linux-user: Support extended clone(CLONE_VM), Alex Bennée, 2020/06/16