[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: question on vhost, limiting kernel threads and NPROC
From: |
Christian Brauner |
Subject: |
Re: question on vhost, limiting kernel threads and NPROC |
Date: |
Tue, 14 Sep 2021 18:04:07 +0200 |
On Mon, Sep 13, 2021 at 05:32:32PM -0400, Michael S. Tsirkin wrote:
> On Mon, Sep 13, 2021 at 12:04:04PM -0500, Mike Christie wrote:
> > I just realized I forgot to cc the virt list so adding now.
> >
> > Christian see the very bottom for a different fork patch.
> >
> > On 7/12/21 7:05 AM, Stefan Hajnoczi wrote:
> > > On Fri, Jul 09, 2021 at 11:25:37AM -0500, Mike Christie wrote:
> > >> Hi,
> > >>
> > >> The goal of this email is to try and figure how we want to track/limit
> > >> the
> > >> number of kernel threads created by vhost devices.
> > >>
> > >> Background:
> > >> -----------
> > >> For vhost-scsi, we've hit a issue where the single vhost worker thread
> > >> can't
> > >> handle all IO the being sent from multiple queues. IOPs is stuck at
> > >> around
> > >> 500K. To fix this, we did this patchset:
> > >>
> > >> https://lore.kernel.org/linux-scsi/20210525180600.6349-1-michael.christie@oracle.com/
> > >>
> > >> which allows userspace to create N threads and map them to a dev's
> > >> virtqueues.
> > >> With this we can get around 1.4M IOPs.
> > >>
> > >> Problem:
> > >> --------
> > >> While those patches were being reviewed, a concern about tracking all
> > >> these
> > >> new possible threads was raised here:
> > >>
> > >> https://lore.kernel.org/linux-scsi/YL45CfpHyzSEcAJv@stefanha-x1.localdomain/
> > >>
> > >> To save you some time, the question is what does other kernel code using
> > >> the
> > >> kthread API do to track the number of kernel threads created on behalf of
> > >> a userspace thread. The answer is they don't do anything so we will have
> > >> to
> > >> add that code.
> > >>
> > >> I started to do that here:
> > >>
> > >> https://lkml.org/lkml/2021/6/23/1233
> > >>
> > >> where those patches would charge/check the vhost device owner's
> > >> RLIMIT_NPROC
> > >> value. But, the question of if we really want to do this has come up
> > >> which is
> > >> why I'm bugging lists like libvirt now.
> > >>
> > >> Question/Solution:
> > >> ------------------
> > >> I'm bugging everyone so we can figure out:
> > >>
> > >> If we need to specifically track the number of kernel threads being made
> > >> for the vhost kernel use case by the RLIMIT_NPROC limit?
> > >>
> > >> Or, is it ok to limit the number of devices with the RLIMIT_NOFILE limit.
> > >> Then each device has a limit on the number of threads it can create.
> > >
> > > Do we want to add an interface where an unprivileged userspace process
> > > can create large numbers of kthreads? The number is indirectly bounded
> > > by RLIMIT_NOFILE * num_virtqueues, but there is no practical way to
> > > use that rlimit since num_virtqueues various across vhost devices and
> > > RLIMIT_NOFILE might need to have a specific value to control file
> > > descriptors.
> > >
> > > io_uring worker threads are limited by RLIMIT_NPROC. I think it makes
> > > sense in vhost too where the device instance is owned by a specific
> > > userspace process and can be accounted against that process' rlimit.
> > >
> > > I don't have a specific use case other than that I think vhost should be
> > > safe and well-behaved.
> > >
> >
> > Sorry for the late reply. I finally got to go on PTO and used like 2
> > years worth in one super long vacation :)
> >
> > I still don't have a RLIMIT_NPROC use case and it wasn't not clear to
> > me if that has to be handled before merging. However, I might have got
> > lucky and found a bug where the fix will handle your request too.
> >
> > It looks like cgroup v2 is supposed to work, but for vhost threads
> > it doesn't because the kernel functions we use just support v1. If
> > we change the vhost layer to create threads like how io_uring does
> > then we get the RLIMIT_NPROC checks and also cgroup v2 support.
> >
> > Christian, If you didn't like this patch
> >
> > https://lkml.org/lkml/2021/6/23/1233
> >
> > then I'm not sure how much you will like what is needed to support the
> > above. Here is a patch which includes what we would need from the fork
> > related code. On one hand, it's nicer because it fits into the PF FLAG
> > code like you requested. But, I have to add a no_files arg. See below:
> >
> >
> > ----------------------------------------------
> >
> >
> > >From 351d476e8db0a78b9bdf22d77dd1abe66c0eac40 Mon Sep 17 00:00:00 2001
> > From: Mike Christie <michael.christie@oracle.com>
> > Date: Mon, 13 Sep 2021 11:20:20 -0500
> > Subject: [PATCH] fork: allow cloning of userspace procs from kernel
> >
> > Userspace apps/processes like Qemu call into the vhost layer to create
> > worker threads which execute IO on behalf of VMs. If users set RIMIT
> > or cgroup limits or setup v2 cgroups or namespaces, the worker thread
> > is not accounted for or even setup correctly. The reason is that vhost
> > uses the kthread api which inherits those attributes/values from the
> > kthreadd thread. This patch allows kernel modules to work like the
> > io_uring code which can call kernel_clone from the userspace thread's
> > context and directly inherit its attributes like cgroups from and will
> > check limits like RLIMIT_NPROC against that userspace thread.
> >
> > Note: this patch combines 2 changes that should be separate patches. I'm
> > including both in one patch to just make it easier to get an idea of what
> > needs to be done. If we are ok with this then I'll break it up into a
> > proper patchset.
> >
> > This patch does the following:
> >
> > 1. Separates the PF_IO_WORKER flag behavior that controls signals and exit
> > cleanup into a new flag PF_USER_WORKER, so the vhost layer can use it
> > without the PF_IO_WORKER scheduling/IO behavior.
> >
> > 2. It adds a new no_files kernel_clone_args field. This is needed by vhost
> > because tools like qemu/libvirt do not always do a close() on the vhost
> > device. For some devices they just rely on the process exit reaping/cleanup
> > code to do a close() on all open FDs. However, if the vhost worker threads
> > have the device open (CLONE_FILES not set) or have a refcount on the
> > files_struct (CLONE_FILES set) then we can leak or possibly crash.
> >
> > leak - qemu just exits and expects the put done by the process exit
> > code will be the last put on the fd. But becuase the worker thread has a
> > ref to the fd or to the process's files_struct then it will never get the
> > last put and so the vhost device's release function will never be called.
> >
> > crash - if we add signal handling to the worker threads then it can
> > happen where the worker thread might get the signal and exit before
> > qemu has called the vhost cleanup releated ioctls and we can end up
> > crashing referencing what should be a valid device still.
> > ---
> > arch/x86/kernel/process.c | 4 ++--
> > include/linux/sched.h | 1 +
> > include/linux/sched/task.h | 5 ++++-
> > init/main.c | 4 ++--
> > kernel/fork.c | 24 +++++++++++++++++++-----
> > kernel/kthread.c | 3 ++-
> > kernel/signal.c | 4 ++--
> > kernel/umh.c | 5 +++--
> > 8 files changed, 35 insertions(+), 15 deletions(-)
> >
> > diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> > index 1d9463e3096b..1c5d516fb508 100644
> > --- a/arch/x86/kernel/process.c
> > +++ b/arch/x86/kernel/process.c
> > @@ -178,9 +178,9 @@ int copy_thread(unsigned long clone_flags, unsigned
> > long sp, unsigned long arg,
> > task_user_gs(p) = get_user_gs(current_pt_regs());
> > #endif
> >
> > - if (unlikely(p->flags & PF_IO_WORKER)) {
> > + if (unlikely(p->flags & PF_USER_WORKER)) {
> > /*
> > - * An IO thread is a user space thread, but it doesn't
> > + * A user worker thread is a user space thread, but it doesn't
> > * return to ret_after_fork().
> > *
> > * In order to indicate that to tools like gdb,
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index ec8d07d88641..0c9b3f62d85f 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1577,6 +1577,7 @@ extern struct pid *cad_pid;
> > #define PF_VCPU 0x00000001 /* I'm a virtual CPU */
> > #define PF_IDLE 0x00000002 /* I am an IDLE thread
> > */
> > #define PF_EXITING 0x00000004 /* Getting shut down */
> > +#define PF_USER_WORKER 0x00000008 /* Userspace kernel
> > thread */
> > #define PF_IO_WORKER 0x00000010 /* Task is an IO worker
> > */
> > #define PF_WQ_WORKER 0x00000020 /* I'm a workqueue
> > worker */
> > #define PF_FORKNOEXEC 0x00000040 /* Forked but didn't
> > exec */
> > diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
> > index ef02be869cf2..2a8f9b8c3868 100644
> > --- a/include/linux/sched/task.h
> > +++ b/include/linux/sched/task.h
> > @@ -32,6 +32,8 @@ struct kernel_clone_args {
> > size_t set_tid_size;
> > int cgroup;
> > int io_thread;
> > + int no_files;
> > + int user_worker;
> > struct cgroup *cgrp;
> > struct css_set *cset;
> > };
> > @@ -86,7 +88,8 @@ extern pid_t kernel_clone(struct kernel_clone_args
> > *kargs);
> > struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int
> > node);
> > struct task_struct *fork_idle(int);
> > struct mm_struct *copy_init_mm(void);
> > -extern pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long
> > flags);
> > +extern pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long
> > flags,
> > + int no_files, int user_worker);
> > extern long kernel_wait4(pid_t, int __user *, int, struct rusage *);
> > int kernel_wait(pid_t pid, int *stat);
> >
> > diff --git a/init/main.c b/init/main.c
> > index f5b8246e8aa1..18f3b126df93 100644
> > --- a/init/main.c
> > +++ b/init/main.c
> > @@ -676,7 +676,7 @@ noinline void __ref rest_init(void)
> > * the init task will end up wanting to create kthreads, which, if
> > * we schedule it before we create kthreadd, will OOPS.
> > */
> > - pid = kernel_thread(kernel_init, NULL, CLONE_FS);
> > + pid = kernel_thread(kernel_init, NULL, CLONE_FS, 0, 0);
> > /*
> > * Pin init on the boot CPU. Task migration is not properly working
> > * until sched_init_smp() has been run. It will set the allowed
> > @@ -689,7 +689,7 @@ noinline void __ref rest_init(void)
> > rcu_read_unlock();
> >
> > numa_default_policy();
> > - pid = kernel_thread(kthreadd, NULL, CLONE_FS | CLONE_FILES);
> > + pid = kernel_thread(kthreadd, NULL, CLONE_FS | CLONE_FILES, 0, 0);
> > rcu_read_lock();
> > kthreadd_task = find_task_by_pid_ns(pid, &init_pid_ns);
> > rcu_read_unlock();
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index bc94b2cc5995..9528940d83d7 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -1458,7 +1458,8 @@ static int copy_fs(unsigned long clone_flags, struct
> > task_struct *tsk)
> > return 0;
> > }
> >
> > -static int copy_files(unsigned long clone_flags, struct task_struct *tsk)
> > +static int copy_files(unsigned long clone_flags, struct task_struct *tsk,
> > + int no_files)
> > {
> > struct files_struct *oldf, *newf;
> > int error = 0;
> > @@ -1470,6 +1471,11 @@ static int copy_files(unsigned long clone_flags,
> > struct task_struct *tsk)
> > if (!oldf)
> > goto out;
> >
> > + if (no_files) {
> > + tsk->files = NULL;
> > + goto out;
> > + }
> > +
> > if (clone_flags & CLONE_FILES) {
> > atomic_inc(&oldf->count);
> > goto out;
> > @@ -1954,11 +1960,14 @@ static __latent_entropy struct task_struct
> > *copy_process(
> > goto fork_out;
> > if (args->io_thread) {
> > /*
> > - * Mark us an IO worker, and block any signal that isn't
> > - * fatal or STOP
> > + * Mark us an IO worker.
> > */
> > p->flags |= PF_IO_WORKER;
> > + }
> > +
> > + if (args->user_worker) {
> > siginitsetinv(&p->blocked, sigmask(SIGKILL)|sigmask(SIGSTOP));
> > + p->flags |= PF_USER_WORKER;
> > }
> >
> > /*
> > @@ -2104,7 +2113,7 @@ static __latent_entropy struct task_struct
> > *copy_process(
> > retval = copy_semundo(clone_flags, p);
> > if (retval)
> > goto bad_fork_cleanup_security;
> > - retval = copy_files(clone_flags, p);
> > + retval = copy_files(clone_flags, p, args->no_files);
> > if (retval)
> > goto bad_fork_cleanup_semundo;
> > retval = copy_fs(clone_flags, p);
> > @@ -2452,6 +2461,7 @@ struct task_struct *create_io_thread(int (*fn)(void
> > *), void *arg, int node)
> > .stack = (unsigned long)fn,
> > .stack_size = (unsigned long)arg,
> > .io_thread = 1,
> > + .user_worker = 1,
> > };
> >
> > return copy_process(NULL, 0, node, &args);
> > @@ -2548,7 +2558,8 @@ pid_t kernel_clone(struct kernel_clone_args *args)
> > /*
> > * Create a kernel thread.
> > */
> > -pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags)
> > +pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags,
> > + int no_files, int user_worker)
> > {
> > struct kernel_clone_args args = {
> > .flags = ((lower_32_bits(flags) | CLONE_VM |
> > @@ -2556,10 +2567,13 @@ pid_t kernel_thread(int (*fn)(void *), void *arg,
> > unsigned long flags)
> > .exit_signal = (lower_32_bits(flags) & CSIGNAL),
> > .stack = (unsigned long)fn,
> > .stack_size = (unsigned long)arg,
> > + .no_files = no_files,
> > + .user_worker = user_worker,
> > };
> >
> > return kernel_clone(&args);
> > }
> > +EXPORT_SYMBOL_GPL(kernel_thread);
> >
> > #ifdef __ARCH_WANT_SYS_FORK
> > SYSCALL_DEFINE0(fork)
> > diff --git a/kernel/kthread.c b/kernel/kthread.c
> > index 5b37a8567168..724c7ec63307 100644
> > --- a/kernel/kthread.c
> > +++ b/kernel/kthread.c
> > @@ -339,7 +339,8 @@ static void create_kthread(struct kthread_create_info
> > *create)
> > current->pref_node_fork = create->node;
> > #endif
> > /* We want our own signal handler (we take no signals by default). */
> > - pid = kernel_thread(kthread, create, CLONE_FS | CLONE_FILES | SIGCHLD);
> > + pid = kernel_thread(kthread, create, CLONE_FS | CLONE_FILES | SIGCHLD,
> > + 0, 0);
> > if (pid < 0) {
> > /* If user was SIGKILLed, I release the structure. */
> > struct completion *done = xchg(&create->done, NULL);
> > diff --git a/kernel/signal.c b/kernel/signal.c
> > index a3229add4455..3f901067b266 100644
> > --- a/kernel/signal.c
> > +++ b/kernel/signal.c
> > @@ -2795,11 +2795,11 @@ bool get_signal(struct ksignal *ksig)
> > }
> >
> > /*
> > - * PF_IO_WORKER threads will catch and exit on fatal signals
> > + * PF_USER_WORKER threads will catch and exit on fatal signals
> > * themselves. They have cleanup that must be performed, so
> > * we cannot call do_exit() on their behalf.
> > */
> > - if (current->flags & PF_IO_WORKER)
> > + if (current->flags & PF_USER_WORKER)
> > goto out;
> >
> > /*
> > diff --git a/kernel/umh.c b/kernel/umh.c
> > index 36c123360ab8..a6b7b733bd99 100644
> > --- a/kernel/umh.c
> > +++ b/kernel/umh.c
> > @@ -132,7 +132,8 @@ static void call_usermodehelper_exec_sync(struct
> > subprocess_info *sub_info)
> >
> > /* If SIGCLD is ignored do_wait won't populate the status. */
> > kernel_sigaction(SIGCHLD, SIG_DFL);
> > - pid = kernel_thread(call_usermodehelper_exec_async, sub_info, SIGCHLD);
> > + pid = kernel_thread(call_usermodehelper_exec_async, sub_info, SIGCHLD,
> > + 0, 0);
> > if (pid < 0)
> > sub_info->retval = pid;
> > else
> > @@ -172,7 +173,7 @@ static void call_usermodehelper_exec_work(struct
> > work_struct *work)
> > * that always ignores SIGCHLD to ensure auto-reaping.
> > */
> > pid = kernel_thread(call_usermodehelper_exec_async, sub_info,
> > - CLONE_PARENT | SIGCHLD);
> > + CLONE_PARENT | SIGCHLD, 0, 0);
> > if (pid < 0) {
> > sub_info->retval = pid;
> > umh_complete(sub_info);
>
>
> Looks quite reasonable to me. You do of course want to post
> it and CC the proper people so it gets review. And add the
> vhost changes of course.
Looks reasonable to me. Cc Jens Axboe too on this please.
Christian