If kthread_is_per_cpu runs concurrently with free_kthread_struct the kthread_struct that was just freed may be read from.
This bug was introduced by commit 40966e316f86 ("kthread: Ensure struct kthread is present for all kthreads"). When kthread_struct started to be allocated for all tasks that have PF_KTHREAD set. This in turn required the kthread_struct to be freed in kernel_execve and violated the assumption that kthread_struct will have the same lifetime as the task.
Looking a bit deeper this only applies to callers of kernel_execve which is just the init process and the user mode helper processes. These processes really don't want to be kernel threads but are for historical reasons. Mostly that copy_thread does not know how to take a kernel mode function to the process with for processes without PF_KTHREAD or PF_IO_WORKER set.
Solve this by not allocating kthread_struct for the init process and the user mode helper processes.
This is done by adding a kthread member to struct kernel_clone_args. Setting kthread in fork_idle and kernel_thread. Adding user_mode_thread that works like kernel_thread except it does not set kthread. In fork only allocating the kthread_struct if .kthread is set.
I have looked at kernel/kthread.c and since commit 40966e316f86 ("kthread: Ensure struct kthread is present for all kthreads") there have been no assumptions added that to_kthread or __to_kthread will not return NULL.
There are a few callers of to_kthread or __to_kthread that assume a non-NULL struct kthread pointer will be returned. These functions are kthread_data(), kthread_parmme(), kthread_exit(), kthread(), kthread_park(), kthread_unpark(), kthread_stop(). All of those functions can reasonably expected to be called when it is know that a task is a kthread so that assumption seems reasonable.
Cc: stable@vger.kernel.org Fixes: 40966e316f86 ("kthread: Ensure struct kthread is present for all kthreads") Reported-by: Максим Кутявин maximkabox13@gmail.com Signed-off-by: "Eric W. Biederman" ebiederm@xmission.com --- fs/exec.c | 6 ++++-- include/linux/sched/task.h | 2 ++ init/main.c | 2 +- kernel/fork.c | 22 ++++++++++++++++++++-- kernel/umh.c | 6 +++--- 5 files changed, 30 insertions(+), 8 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c index e3e55d5e0be1..75eb6e0ee7b2 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1308,8 +1308,6 @@ int begin_new_exec(struct linux_binprm * bprm) if (retval) goto out_unlock;
- if (me->flags & PF_KTHREAD) - free_kthread_struct(me); me->flags &= ~(PF_RANDOMIZE | PF_FORKNOEXEC | PF_KTHREAD | PF_NOFREEZE | PF_NO_SETAFFINITY); flush_thread(); @@ -1955,6 +1953,10 @@ int kernel_execve(const char *kernel_filename, int fd = AT_FDCWD; int retval;
+ if (WARN_ON_ONCE((current->flags & PF_KTHREAD) && + (current->worker_private))) + return -EINVAL; + filename = getname_kernel(kernel_filename); if (IS_ERR(filename)) return PTR_ERR(filename); diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h index 719c9a6cac8d..4492266935dd 100644 --- a/include/linux/sched/task.h +++ b/include/linux/sched/task.h @@ -32,6 +32,7 @@ struct kernel_clone_args { size_t set_tid_size; int cgroup; int io_thread; + int kthread; struct cgroup *cgrp; struct css_set *cset; }; @@ -89,6 +90,7 @@ struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node); struct task_struct *fork_idle(int); struct mm_struct *copy_init_mm(void); extern pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags); +extern pid_t user_mode_thread(int (*fn)(void *), void *arg, unsigned long flags); extern long kernel_wait4(pid_t, int __user *, int, struct rusage *); int kernel_wait(pid_t pid, int *stat);
diff --git a/init/main.c b/init/main.c index 98182c3c2c4b..39baac0211c6 100644 --- a/init/main.c +++ b/init/main.c @@ -688,7 +688,7 @@ noinline void __ref rest_init(void) * the init task will end up wanting to create kthreads, which, if * we schedule it before we create kthreadd, will OOPS. */ - pid = kernel_thread(kernel_init, NULL, CLONE_FS); + pid = user_mode_thread(kernel_init, NULL, CLONE_FS); /* * Pin init on the boot CPU. Task migration is not properly working * until sched_init_smp() has been run. It will set the allowed diff --git a/kernel/fork.c b/kernel/fork.c index 9796897560ab..27c5203750b4 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2157,7 +2157,7 @@ static __latent_entropy struct task_struct *copy_process( p->io_context = NULL; audit_set_context(p, NULL); cgroup_fork(p); - if (p->flags & PF_KTHREAD) { + if (args->kthread) { if (!set_kthread_struct(p)) goto bad_fork_cleanup_delayacct; } @@ -2548,7 +2548,8 @@ struct task_struct * __init fork_idle(int cpu) { struct task_struct *task; struct kernel_clone_args args = { - .flags = CLONE_VM, + .flags = CLONE_VM, + .kthread = 1, };
task = copy_process(&init_struct_pid, 0, cpu_to_node(cpu), &args); @@ -2679,6 +2680,23 @@ pid_t kernel_clone(struct kernel_clone_args *args) * Create a kernel thread. */ pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags) +{ + struct kernel_clone_args args = { + .flags = ((lower_32_bits(flags) | CLONE_VM | + CLONE_UNTRACED) & ~CSIGNAL), + .exit_signal = (lower_32_bits(flags) & CSIGNAL), + .stack = (unsigned long)fn, + .stack_size = (unsigned long)arg, + .kthread = 1, + }; + + return kernel_clone(&args); +} + +/* + * Create a user mode thread. + */ +pid_t user_mode_thread(int (*fn)(void *), void *arg, unsigned long flags) { struct kernel_clone_args args = { .flags = ((lower_32_bits(flags) | CLONE_VM | diff --git a/kernel/umh.c b/kernel/umh.c index 36c123360ab8..b989736e8707 100644 --- a/kernel/umh.c +++ b/kernel/umh.c @@ -132,7 +132,7 @@ static void call_usermodehelper_exec_sync(struct subprocess_info *sub_info)
/* If SIGCLD is ignored do_wait won't populate the status. */ kernel_sigaction(SIGCHLD, SIG_DFL); - pid = kernel_thread(call_usermodehelper_exec_async, sub_info, SIGCHLD); + pid = user_mode_thread(call_usermodehelper_exec_async, sub_info, SIGCHLD); if (pid < 0) sub_info->retval = pid; else @@ -171,8 +171,8 @@ static void call_usermodehelper_exec_work(struct work_struct *work) * want to pollute current->children, and we need a parent * that always ignores SIGCHLD to ensure auto-reaping. */ - pid = kernel_thread(call_usermodehelper_exec_async, sub_info, - CLONE_PARENT | SIGCHLD); + pid = user_mode_thread(call_usermodehelper_exec_async, sub_info, + CLONE_PARENT | SIGCHLD); if (pid < 0) { sub_info->retval = pid; umh_complete(sub_info);
On Fri, May 06 2022 at 09:15, Eric W. Biederman wrote:
* the init task will end up wanting to create kthreads, which, if * we schedule it before we create kthreadd, will OOPS. */
- pid = kernel_thread(kernel_init, NULL, CLONE_FS);
- pid = user_mode_thread(kernel_init, NULL, CLONE_FS);
So init does not have PF_KTHREAD set anymore, which causes this to go sideways with a NULL pointer dereference in get_mm_counter() on next:
get_mm_counter include/linux/mm.h:1996 [inline] get_mm_rss include/linux/mm.h:2049 [inline] task_nr_scan_windows.isra.0+0x23/0x120 kernel/sched/fair.c:1123 task_scan_min kernel/sched/fair.c:1144 [inline] task_scan_start+0x6c/0x400 kernel/sched/fair.c:1150 task_tick_numa kernel/sched/fair.c:2944 [inline] task_tick_fair+0xaeb/0xef0 kernel/sched/fair.c:11186 scheduler_tick+0x20a/0x5e0 kernel/sched/core.c:5380
https://lore.kernel.org/lkml/0000000000008a9fbb05dea76400@google.com
because the fence in task_tick_numa():
if ((curr->flags & (PF_EXITING | PF_KTHREAD)) || work->next != work) return;
is not longer sufficient. It needs also to bail if !curr->mm.
I'm worried that there are more of these issues lurking. Haven't looked yet.
Thanks,
tglx
Thomas Gleixner tglx@linutronix.de writes:
On Fri, May 06 2022 at 09:15, Eric W. Biederman wrote:
* the init task will end up wanting to create kthreads, which, if * we schedule it before we create kthreadd, will OOPS. */
- pid = kernel_thread(kernel_init, NULL, CLONE_FS);
- pid = user_mode_thread(kernel_init, NULL, CLONE_FS);
So init does not have PF_KTHREAD set anymore, which causes this to go sideways with a NULL pointer dereference in get_mm_counter() on next:
Well not after the change above, but in a later patch yes.
Patch 1/7 really gets us back to the previous status quo, where I introduced the breakage.
get_mm_counter include/linux/mm.h:1996 [inline] get_mm_rss include/linux/mm.h:2049 [inline] task_nr_scan_windows.isra.0+0x23/0x120 kernel/sched/fair.c:1123 task_scan_min kernel/sched/fair.c:1144 [inline] task_scan_start+0x6c/0x400 kernel/sched/fair.c:1150 task_tick_numa kernel/sched/fair.c:2944 [inline] task_tick_fair+0xaeb/0xef0 kernel/sched/fair.c:11186 scheduler_tick+0x20a/0x5e0 kernel/sched/core.c:5380
https://lore.kernel.org/lkml/0000000000008a9fbb05dea76400@google.com
because the fence in task_tick_numa():
if ((curr->flags & (PF_EXITING | PF_KTHREAD)) || work->next != work) return;
is not longer sufficient. It needs also to bail if !curr->mm.
Agreed. I proposed a patch to do just that a little while ago.
I'm worried that there are more of these issues lurking. Haven't looked yet.
I looked earlier and I missed this one. I am going to look again today, along with applying the obvious fix to task_tick_numa.
I don't think there are many but when the code has evolved into a shape that is not easy to understand things occasionally slip through when the abstractions are made clear to understand. The reason to rework the code and make it clear is that once the code has evolved to a point of many subtle issues making any change is brittle.
Eric
"Eric W. Biederman" ebiederm@xmission.com writes:
Thomas Gleixner tglx@linutronix.de writes:
I'm worried that there are more of these issues lurking. Haven't looked yet.
I looked earlier and I missed this one. I am going to look again today, along with applying the obvious fix to task_tick_numa.
So I have just looked again and I don't see anything. There are a couple of subtle issues on x86. Especially with saving floating point but as I read the code copy_thread initializes things properly so that code that doesn't touch floating point registers doesn't need to do anything.
The important thing for lurking issues is that even if I missed something practically all of the uses of PF_KTHREAD are on x86 or generic code so they should be flushed out quickly.
Eric
linux-stable-mirror@lists.linaro.org