This fixes a deadlock in the tracer when tracing a multi-threaded application that calls execve while more than one thread are running.
I observed that when running strace on the gcc test suite, it always blocks after a while, when expect calls execve, because other threads have to be terminated. They send ptrace events, but the strace is no longer able to respond, since it is blocked in vm_access.
The deadlock is always happening when strace needs to access the tracees process mmap, while another thread in the tracee starts to execve a child process, but that cannot continue until the PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:
strace D 0 30614 30584 0x00000000 Call Trace: __schedule+0x3ce/0x6e0 schedule+0x5c/0xd0 schedule_preempt_disabled+0x15/0x20 __mutex_lock.isra.13+0x1ec/0x520 __mutex_lock_killable_slowpath+0x13/0x20 mutex_lock_killable+0x28/0x30 mm_access+0x27/0xa0 process_vm_rw_core.isra.3+0xff/0x550 process_vm_rw+0xdd/0xf0 __x64_sys_process_vm_readv+0x31/0x40 do_syscall_64+0x64/0x220 entry_SYSCALL_64_after_hwframe+0x44/0xa9
expect D 0 31933 30876 0x80004003 Call Trace: __schedule+0x3ce/0x6e0 schedule+0x5c/0xd0 flush_old_exec+0xc4/0x770 load_elf_binary+0x35a/0x16c0 search_binary_handler+0x97/0x1d0 __do_execve_file.isra.40+0x5d4/0x8a0 __x64_sys_execve+0x49/0x60 do_syscall_64+0x64/0x220 entry_SYSCALL_64_after_hwframe+0x44/0xa9
The proposed solution is to have a second mutex that is used in mm_access, so it is allowed to continue while the dying threads are not yet terminated.
I also took the opportunity to improve the documentation of prepare_creds, which is obviously out of sync.
Signed-off-by: Bernd Edlinger bernd.edlinger@hotmail.de --- Documentation/security/credentials.rst | 18 ++++++------ fs/exec.c | 9 ++++++ include/linux/binfmts.h | 6 +++- include/linux/sched/signal.h | 1 + init/init_task.c | 1 + kernel/cred.c | 2 +- kernel/fork.c | 5 ++-- mm/process_vm_access.c | 2 +- tools/testing/selftests/ptrace/Makefile | 4 +-- tools/testing/selftests/ptrace/vmaccess.c | 46 +++++++++++++++++++++++++++++++ 10 files changed, 79 insertions(+), 15 deletions(-) create mode 100644 tools/testing/selftests/ptrace/vmaccess.c
v2: adds a test case which passes when this patch is applied.
diff --git a/Documentation/security/credentials.rst b/Documentation/security/credentials.rst index 282e79f..c98e0a8 100644 --- a/Documentation/security/credentials.rst +++ b/Documentation/security/credentials.rst @@ -437,9 +437,13 @@ new set of credentials by calling::
struct cred *prepare_creds(void);
-this locks current->cred_replace_mutex and then allocates and constructs a -duplicate of the current process's credentials, returning with the mutex still -held if successful. It returns NULL if not successful (out of memory). +this allocates and constructs a duplicate of the current process's credentials. +It returns NULL if not successful (out of memory). + +If called from __do_execve_file, the mutex current->signal->cred_guard_mutex +is acquired before this function gets called, and the mutex +current->signal->cred_change_mutex is acquired later, while the credentials +and the process mmap are actually changed.
The mutex prevents ``ptrace()`` from altering the ptrace state of a process while security checks on credentials construction and changing is taking place @@ -466,9 +470,8 @@ by calling::
This will alter various aspects of the credentials and the process, giving the LSM a chance to do likewise, then it will use ``rcu_assign_pointer()`` to -actually commit the new credentials to ``current->cred``, it will release -``current->cred_replace_mutex`` to allow ``ptrace()`` to take place, and it -will notify the scheduler and others of the changes. +actually commit the new credentials to ``current->cred``, and it will notify +the scheduler and others of the changes.
This function is guaranteed to return 0, so that it can be tail-called at the end of such functions as ``sys_setresuid()``. @@ -486,8 +489,7 @@ invoked::
void abort_creds(struct cred *new);
-This releases the lock on ``current->cred_replace_mutex`` that -``prepare_creds()`` got and then releases the new credentials. +This releases the new credentials.
A typical credentials alteration function would look something like this:: diff --git a/fs/exec.c b/fs/exec.c index 74d88da..a6884e4 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1266,6 +1266,12 @@ int flush_old_exec(struct linux_binprm * bprm) if (retval) goto out;
+ retval = mutex_lock_killable(¤t->signal->cred_change_mutex); + if (retval) + goto out; + + bprm->called_flush_old_exec = 1; + /* * Must be called _before_ exec_mmap() as bprm->mm is * not visibile until then. This also enables the update @@ -1420,6 +1426,8 @@ static void free_bprm(struct linux_binprm *bprm) { free_arg_pages(bprm); if (bprm->cred) { + if (bprm->called_flush_old_exec) + mutex_unlock(¤t->signal->cred_change_mutex); mutex_unlock(¤t->signal->cred_guard_mutex); abort_creds(bprm->cred); } @@ -1469,6 +1477,7 @@ void install_exec_creds(struct linux_binprm *bprm) * credentials; any time after this it may be unlocked. */ security_bprm_committed_creds(bprm); + mutex_unlock(¤t->signal->cred_change_mutex); mutex_unlock(¤t->signal->cred_guard_mutex); } EXPORT_SYMBOL(install_exec_creds); diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h index b40fc63..2e1318b 100644 --- a/include/linux/binfmts.h +++ b/include/linux/binfmts.h @@ -44,7 +44,11 @@ struct linux_binprm { * exec has happened. Used to sanitize execution environment * and to set AT_SECURE auxv for glibc. */ - secureexec:1; + secureexec:1, + /* + * Set by flush_old_exec, when the cred_change_mutex is taken. + */ + called_flush_old_exec:1; #ifdef __alpha__ unsigned int taso:1; #endif diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h index 8805025..37eeabe 100644 --- a/include/linux/sched/signal.h +++ b/include/linux/sched/signal.h @@ -225,6 +225,7 @@ struct signal_struct { struct mutex cred_guard_mutex; /* guard against foreign influences on * credential calculations * (notably. ptrace) */ + struct mutex cred_change_mutex; /* guard against credentials change */ } __randomize_layout;
/* diff --git a/init/init_task.c b/init/init_task.c index 9e5cbe5..6cd9a0f 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -26,6 +26,7 @@ .multiprocess = HLIST_HEAD_INIT, .rlim = INIT_RLIMITS, .cred_guard_mutex = __MUTEX_INITIALIZER(init_signals.cred_guard_mutex), + .cred_change_mutex = __MUTEX_INITIALIZER(init_signals.cred_change_mutex), #ifdef CONFIG_POSIX_TIMERS .posix_timers = LIST_HEAD_INIT(init_signals.posix_timers), .cputimer = { diff --git a/kernel/cred.c b/kernel/cred.c index 809a985..e4c78de 100644 --- a/kernel/cred.c +++ b/kernel/cred.c @@ -676,7 +676,7 @@ void __init cred_init(void) * * Returns the new credentials or NULL if out of memory. * - * Does not take, and does not return holding current->cred_replace_mutex. + * Does not take, and does not return holding ->cred_guard_mutex. */ struct cred *prepare_kernel_cred(struct task_struct *daemon) { diff --git a/kernel/fork.c b/kernel/fork.c index 0808095..0395154 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1224,7 +1224,7 @@ struct mm_struct *mm_access(struct task_struct *task, unsigned int mode) struct mm_struct *mm; int err;
- err = mutex_lock_killable(&task->signal->cred_guard_mutex); + err = mutex_lock_killable(&task->signal->cred_change_mutex); if (err) return ERR_PTR(err);
@@ -1234,7 +1234,7 @@ struct mm_struct *mm_access(struct task_struct *task, unsigned int mode) mmput(mm); mm = ERR_PTR(-EACCES); } - mutex_unlock(&task->signal->cred_guard_mutex); + mutex_unlock(&task->signal->cred_change_mutex);
return mm; } @@ -1594,6 +1594,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk) sig->oom_score_adj_min = current->signal->oom_score_adj_min;
mutex_init(&sig->cred_guard_mutex); + mutex_init(&sig->cred_change_mutex);
return 0; } diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c index 357aa7b..b3e6eb5 100644 --- a/mm/process_vm_access.c +++ b/mm/process_vm_access.c @@ -204,7 +204,7 @@ static ssize_t process_vm_rw_core(pid_t pid, struct iov_iter *iter, if (!mm || IS_ERR(mm)) { rc = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH; /* - * Explicitly map EACCES to EPERM as EPERM is a more a + * Explicitly map EACCES to EPERM as EPERM is a more * appropriate error code for process_vw_readv/writev */ if (rc == -EACCES) diff --git a/tools/testing/selftests/ptrace/Makefile b/tools/testing/selftests/ptrace/Makefile index c0b7f89..2f1f532 100644 --- a/tools/testing/selftests/ptrace/Makefile +++ b/tools/testing/selftests/ptrace/Makefile @@ -1,6 +1,6 @@ # SPDX-License-Identifier: GPL-2.0-only -CFLAGS += -iquote../../../../include/uapi -Wall +CFLAGS += -std=c99 -pthread -iquote../../../../include/uapi -Wall
-TEST_GEN_PROGS := get_syscall_info peeksiginfo +TEST_GEN_PROGS := get_syscall_info peeksiginfo vmaccess
include ../lib.mk diff --git a/tools/testing/selftests/ptrace/vmaccess.c b/tools/testing/selftests/ptrace/vmaccess.c new file mode 100644 index 0000000..ef08c9f --- /dev/null +++ b/tools/testing/selftests/ptrace/vmaccess.c @@ -0,0 +1,46 @@ +// SPDX-License-Identifier: GPL-2.0+ +/* + * Copyright (c) 2020 Bernd Edlinger bernd.edlinger@hotmail.de + * All rights reserved. + * + * Check whether /proc/$pid/mem can be accessed without causing deadlocks + * when de_thread is blocked with ->cred_guard_mutex held. + */ + +#include "../kselftest_harness.h" +#include <stdio.h> +#include <fcntl.h> +#include <pthread.h> +#include <signal.h> +#include <unistd.h> +#include <sys/ptrace.h> + +static void *thread(void *arg) +{ + ptrace(PTRACE_TRACEME, 0, 0, 0); + return NULL; +} + +TEST(vmaccess) +{ + int f, pid = fork(); + char mm[64]; + + if (!pid) { + pthread_t pt; + pthread_create(&pt, NULL, thread, NULL); + pthread_join(pt, NULL); + execlp("true", "true", NULL); + } + + sleep(1); + sprintf(mm, "/proc/%d/mem", pid); + f = open(mm, O_RDONLY); + ASSERT_LE(0, f) + close(f); + /* this is not fixed! ptrace(PTRACE_ATTACH, pid, 0,0); */ + f = kill(pid, SIGCONT); + ASSERT_EQ(0, f); +} + +TEST_HARNESS_MAIN
Bernd Edlinger bernd.edlinger@hotmail.de writes:
This fixes a deadlock in the tracer when tracing a multi-threaded application that calls execve while more than one thread are running.
I observed that when running strace on the gcc test suite, it always blocks after a while, when expect calls execve, because other threads have to be terminated. They send ptrace events, but the strace is no longer able to respond, since it is blocked in vm_access.
The deadlock is always happening when strace needs to access the tracees process mmap, while another thread in the tracee starts to execve a child process, but that cannot continue until the PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:
I think your patch works, but I don't think to solve your case another mutex is necessary. Possibly it is justified, but I hesitate to introduce yet another concept in the code.
Having read elsewhere in the thread that this does not solve the problem Oleg has mentioned I am really hesitant to add more complexity to the situation.
For your case there is a straight forward and local workaround.
When the current task is ptracing the target task don't bother with cred_gaurd_mutex and ptrace_may_access in access_mm as those tests have already passed. Instead just confirm the ptrace status. AKA the permission check in ptraces_access_vm.
I think something like this is all we need.
diff --git a/kernel/fork.c b/kernel/fork.c index cee89229606a..b0ab98c84589 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1224,6 +1224,16 @@ struct mm_struct *mm_access(struct task_struct *task, unsigned int mode) struct mm_struct *mm; int err;
+ if (task->ptrace && (current == task->parent)) { + mm = get_task_mm(task); + if ((get_dumpable(mm) != SUID_DUMP_USER) && + !ptracer_capable(task, mm->user_ns)) { + mmput(mm); + mm = ERR_PTR(-EACCESS); + } + return mm; + } + err = mutex_lock_killable(&task->signal->cred_guard_mutex); if (err) return ERR_PTR(err);
Does this solve your test case?
The patch above is short the approriate locking for the ptrace attached check. (tasklist_lock I think). But is enough to illustrate the idea, and it is probably a check we want in any event so that if the tracer starts dropping privileges process_vm_readv and process_vm_writev will still be usable by the tracer.
Eric
strace D 0 30614 30584 0x00000000 Call Trace: __schedule+0x3ce/0x6e0 schedule+0x5c/0xd0 schedule_preempt_disabled+0x15/0x20 __mutex_lock.isra.13+0x1ec/0x520 __mutex_lock_killable_slowpath+0x13/0x20 mutex_lock_killable+0x28/0x30 mm_access+0x27/0xa0 process_vm_rw_core.isra.3+0xff/0x550 process_vm_rw+0xdd/0xf0 __x64_sys_process_vm_readv+0x31/0x40 do_syscall_64+0x64/0x220 entry_SYSCALL_64_after_hwframe+0x44/0xa9
expect D 0 31933 30876 0x80004003 Call Trace: __schedule+0x3ce/0x6e0 schedule+0x5c/0xd0 flush_old_exec+0xc4/0x770 load_elf_binary+0x35a/0x16c0 search_binary_handler+0x97/0x1d0 __do_execve_file.isra.40+0x5d4/0x8a0 __x64_sys_execve+0x49/0x60 do_syscall_64+0x64/0x220 entry_SYSCALL_64_after_hwframe+0x44/0xa9
The proposed solution is to have a second mutex that is used in mm_access, so it is allowed to continue while the dying threads are not yet terminated.
I also took the opportunity to improve the documentation of prepare_creds, which is obviously out of sync.
Signed-off-by: Bernd Edlinger bernd.edlinger@hotmail.de
Documentation/security/credentials.rst | 18 ++++++------ fs/exec.c | 9 ++++++ include/linux/binfmts.h | 6 +++- include/linux/sched/signal.h | 1 + init/init_task.c | 1 + kernel/cred.c | 2 +- kernel/fork.c | 5 ++-- mm/process_vm_access.c | 2 +- tools/testing/selftests/ptrace/Makefile | 4 +-- tools/testing/selftests/ptrace/vmaccess.c | 46 +++++++++++++++++++++++++++++++ 10 files changed, 79 insertions(+), 15 deletions(-) create mode 100644 tools/testing/selftests/ptrace/vmaccess.c
v2: adds a test case which passes when this patch is applied.
diff --git a/Documentation/security/credentials.rst b/Documentation/security/credentials.rst index 282e79f..c98e0a8 100644 --- a/Documentation/security/credentials.rst +++ b/Documentation/security/credentials.rst @@ -437,9 +437,13 @@ new set of credentials by calling:: struct cred *prepare_creds(void); -this locks current->cred_replace_mutex and then allocates and constructs a -duplicate of the current process's credentials, returning with the mutex still -held if successful. It returns NULL if not successful (out of memory). +this allocates and constructs a duplicate of the current process's credentials. +It returns NULL if not successful (out of memory).
+If called from __do_execve_file, the mutex current->signal->cred_guard_mutex +is acquired before this function gets called, and the mutex +current->signal->cred_change_mutex is acquired later, while the credentials +and the process mmap are actually changed. The mutex prevents ``ptrace()`` from altering the ptrace state of a process while security checks on credentials construction and changing is taking place @@ -466,9 +470,8 @@ by calling:: This will alter various aspects of the credentials and the process, giving the LSM a chance to do likewise, then it will use ``rcu_assign_pointer()`` to -actually commit the new credentials to ``current->cred``, it will release -``current->cred_replace_mutex`` to allow ``ptrace()`` to take place, and it -will notify the scheduler and others of the changes. +actually commit the new credentials to ``current->cred``, and it will notify +the scheduler and others of the changes. This function is guaranteed to return 0, so that it can be tail-called at the end of such functions as ``sys_setresuid()``. @@ -486,8 +489,7 @@ invoked:: void abort_creds(struct cred *new); -This releases the lock on ``current->cred_replace_mutex`` that -``prepare_creds()`` got and then releases the new credentials. +This releases the new credentials. A typical credentials alteration function would look something like this:: diff --git a/fs/exec.c b/fs/exec.c index 74d88da..a6884e4 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1266,6 +1266,12 @@ int flush_old_exec(struct linux_binprm * bprm) if (retval) goto out;
- retval = mutex_lock_killable(¤t->signal->cred_change_mutex);
- if (retval)
goto out;
- bprm->called_flush_old_exec = 1;
- /*
- Must be called _before_ exec_mmap() as bprm->mm is
- not visibile until then. This also enables the update
@@ -1420,6 +1426,8 @@ static void free_bprm(struct linux_binprm *bprm) { free_arg_pages(bprm); if (bprm->cred) {
if (bprm->called_flush_old_exec)
mutex_unlock(¤t->signal->cred_guard_mutex); abort_creds(bprm->cred); }mutex_unlock(¤t->signal->cred_change_mutex);
@@ -1469,6 +1477,7 @@ void install_exec_creds(struct linux_binprm *bprm) * credentials; any time after this it may be unlocked. */ security_bprm_committed_creds(bprm);
- mutex_unlock(¤t->signal->cred_change_mutex); mutex_unlock(¤t->signal->cred_guard_mutex);
} EXPORT_SYMBOL(install_exec_creds); diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h index b40fc63..2e1318b 100644 --- a/include/linux/binfmts.h +++ b/include/linux/binfmts.h @@ -44,7 +44,11 @@ struct linux_binprm { * exec has happened. Used to sanitize execution environment * and to set AT_SECURE auxv for glibc. */
secureexec:1;
secureexec:1,
/*
* Set by flush_old_exec, when the cred_change_mutex is taken.
*/
called_flush_old_exec:1;
#ifdef __alpha__ unsigned int taso:1; #endif diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h index 8805025..37eeabe 100644 --- a/include/linux/sched/signal.h +++ b/include/linux/sched/signal.h @@ -225,6 +225,7 @@ struct signal_struct { struct mutex cred_guard_mutex; /* guard against foreign influences on * credential calculations * (notably. ptrace) */
- struct mutex cred_change_mutex; /* guard against credentials change */
} __randomize_layout; /* diff --git a/init/init_task.c b/init/init_task.c index 9e5cbe5..6cd9a0f 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -26,6 +26,7 @@ .multiprocess = HLIST_HEAD_INIT, .rlim = INIT_RLIMITS, .cred_guard_mutex = __MUTEX_INITIALIZER(init_signals.cred_guard_mutex),
- .cred_change_mutex = __MUTEX_INITIALIZER(init_signals.cred_change_mutex),
#ifdef CONFIG_POSIX_TIMERS .posix_timers = LIST_HEAD_INIT(init_signals.posix_timers), .cputimer = { diff --git a/kernel/cred.c b/kernel/cred.c index 809a985..e4c78de 100644 --- a/kernel/cred.c +++ b/kernel/cred.c @@ -676,7 +676,7 @@ void __init cred_init(void)
- Returns the new credentials or NULL if out of memory.
- Does not take, and does not return holding current->cred_replace_mutex.
*/
- Does not take, and does not return holding ->cred_guard_mutex.
struct cred *prepare_kernel_cred(struct task_struct *daemon) { diff --git a/kernel/fork.c b/kernel/fork.c index 0808095..0395154 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1224,7 +1224,7 @@ struct mm_struct *mm_access(struct task_struct *task, unsigned int mode) struct mm_struct *mm; int err;
- err = mutex_lock_killable(&task->signal->cred_guard_mutex);
- err = mutex_lock_killable(&task->signal->cred_change_mutex); if (err) return ERR_PTR(err);
@@ -1234,7 +1234,7 @@ struct mm_struct *mm_access(struct task_struct *task, unsigned int mode) mmput(mm); mm = ERR_PTR(-EACCES); }
- mutex_unlock(&task->signal->cred_guard_mutex);
- mutex_unlock(&task->signal->cred_change_mutex);
return mm; } @@ -1594,6 +1594,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk) sig->oom_score_adj_min = current->signal->oom_score_adj_min; mutex_init(&sig->cred_guard_mutex);
- mutex_init(&sig->cred_change_mutex);
return 0; } diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c index 357aa7b..b3e6eb5 100644 --- a/mm/process_vm_access.c +++ b/mm/process_vm_access.c @@ -204,7 +204,7 @@ static ssize_t process_vm_rw_core(pid_t pid, struct iov_iter *iter, if (!mm || IS_ERR(mm)) { rc = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH; /*
* Explicitly map EACCES to EPERM as EPERM is a more a
* Explicitly map EACCES to EPERM as EPERM is a more
*/ if (rc == -EACCES)
- appropriate error code for process_vw_readv/writev
diff --git a/tools/testing/selftests/ptrace/Makefile b/tools/testing/selftests/ptrace/Makefile index c0b7f89..2f1f532 100644 --- a/tools/testing/selftests/ptrace/Makefile +++ b/tools/testing/selftests/ptrace/Makefile @@ -1,6 +1,6 @@ # SPDX-License-Identifier: GPL-2.0-only -CFLAGS += -iquote../../../../include/uapi -Wall +CFLAGS += -std=c99 -pthread -iquote../../../../include/uapi -Wall -TEST_GEN_PROGS := get_syscall_info peeksiginfo +TEST_GEN_PROGS := get_syscall_info peeksiginfo vmaccess include ../lib.mk diff --git a/tools/testing/selftests/ptrace/vmaccess.c b/tools/testing/selftests/ptrace/vmaccess.c new file mode 100644 index 0000000..ef08c9f --- /dev/null +++ b/tools/testing/selftests/ptrace/vmaccess.c @@ -0,0 +1,46 @@ +// SPDX-License-Identifier: GPL-2.0+ +/*
- Copyright (c) 2020 Bernd Edlinger bernd.edlinger@hotmail.de
- All rights reserved.
- Check whether /proc/$pid/mem can be accessed without causing deadlocks
- when de_thread is blocked with ->cred_guard_mutex held.
- */
+#include "../kselftest_harness.h" +#include <stdio.h> +#include <fcntl.h> +#include <pthread.h> +#include <signal.h> +#include <unistd.h> +#include <sys/ptrace.h>
+static void *thread(void *arg) +{
- ptrace(PTRACE_TRACEME, 0, 0, 0);
- return NULL;
+}
+TEST(vmaccess) +{
- int f, pid = fork();
- char mm[64];
- if (!pid) {
pthread_t pt;
pthread_create(&pt, NULL, thread, NULL);
pthread_join(pt, NULL);
execlp("true", "true", NULL);
- }
- sleep(1);
- sprintf(mm, "/proc/%d/mem", pid);
- f = open(mm, O_RDONLY);
- ASSERT_LE(0, f)
close(f);
- /* this is not fixed! ptrace(PTRACE_ATTACH, pid, 0,0); */
- f = kill(pid, SIGCONT);
- ASSERT_EQ(0, f);
+}
+TEST_HARNESS_MAIN
On 3/2/20 7:38 AM, Eric W. Biederman wrote:
Bernd Edlinger bernd.edlinger@hotmail.de writes:
This fixes a deadlock in the tracer when tracing a multi-threaded application that calls execve while more than one thread are running.
I observed that when running strace on the gcc test suite, it always blocks after a while, when expect calls execve, because other threads have to be terminated. They send ptrace events, but the strace is no longer able to respond, since it is blocked in vm_access.
The deadlock is always happening when strace needs to access the tracees process mmap, while another thread in the tracee starts to execve a child process, but that cannot continue until the PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:
I think your patch works, but I don't think to solve your case another mutex is necessary. Possibly it is justified, but I hesitate to introduce yet another concept in the code.
Having read elsewhere in the thread that this does not solve the problem Oleg has mentioned I am really hesitant to add more complexity to the situation.
For your case there is a straight forward and local workaround.
When the current task is ptracing the target task don't bother with cred_gaurd_mutex and ptrace_may_access in access_mm as those tests have already passed. Instead just confirm the ptrace status. AKA the permission check in ptraces_access_vm.
I think something like this is all we need.
diff --git a/kernel/fork.c b/kernel/fork.c index cee89229606a..b0ab98c84589 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1224,6 +1224,16 @@ struct mm_struct *mm_access(struct task_struct *task, unsigned int mode) struct mm_struct *mm; int err;
- if (task->ptrace && (current == task->parent)) {
mm = get_task_mm(task);
if ((get_dumpable(mm) != SUID_DUMP_USER) &&
!ptracer_capable(task, mm->user_ns)) {
mmput(mm);
mm = ERR_PTR(-EACCESS);
}
return mm;
- }
- err = mutex_lock_killable(&task->signal->cred_guard_mutex); if (err) return ERR_PTR(err);
Does this solve your test case?
I tried this with s/EACCESS/EACCES/.
The test case in this patch is not fixed, but strace does not freeze, at least with my setup where it did freeze repeatable. That is obviously because it bypasses the cred_guard_mutex. But all other process that access this file still freeze, and cannot be interrupted except with kill -9.
However that smells like a denial of service, that this simple test case which can be executed by guest, creates a /proc/$pid/mem that freezes any process, even root, when it looks at it. I mean: "ln -s README /proc/$pid/mem" would be a nice bomb.
Bernd.
Bernd Edlinger bernd.edlinger@hotmail.de writes:
I tried this with s/EACCESS/EACCES/.
The test case in this patch is not fixed, but strace does not freeze, at least with my setup where it did freeze repeatable.
Thanks, That is what I was aiming at.
So we have one method we can pursue to fix this in practice.
That is obviously because it bypasses the cred_guard_mutex. But all other process that access this file still freeze, and cannot be interrupted except with kill -9.
However that smells like a denial of service, that this simple test case which can be executed by guest, creates a /proc/$pid/mem that freezes any process, even root, when it looks at it. I mean: "ln -s README /proc/$pid/mem" would be a nice bomb.
Yes. Your the test case in your patch a variant of the original problem.
I have been staring at this trying to understand the fundamentals of the original deeper problem.
The current scope of cred_guard_mutex in exec is because being ptraced causes suid exec to act differently. So we need to know early if we are ptraced.
If that case did not exist we could reduce the scope of the cred_guard_mutex in exec to where your patch puts the cred_change_mutex.
I am starting to think reworking how we deal with ptrace and exec is the way to solve this problem.
Eric
On 3/2/20 4:57 PM, Eric W. Biederman wrote:
Bernd Edlinger bernd.edlinger@hotmail.de writes:
I tried this with s/EACCESS/EACCES/.
The test case in this patch is not fixed, but strace does not freeze, at least with my setup where it did freeze repeatable.
Thanks, That is what I was aiming at.
So we have one method we can pursue to fix this in practice.
That is obviously because it bypasses the cred_guard_mutex. But all other process that access this file still freeze, and cannot be interrupted except with kill -9.
However that smells like a denial of service, that this simple test case which can be executed by guest, creates a /proc/$pid/mem that freezes any process, even root, when it looks at it. I mean: "ln -s README /proc/$pid/mem" would be a nice bomb.
Yes. Your the test case in your patch a variant of the original problem.
I have been staring at this trying to understand the fundamentals of the original deeper problem.
The current scope of cred_guard_mutex in exec is because being ptraced causes suid exec to act differently. So we need to know early if we are ptraced.
It has a second use, that it prevents two threads entering execve, which would probably result in disaster.
If that case did not exist we could reduce the scope of the cred_guard_mutex in exec to where your patch puts the cred_change_mutex.
I am starting to think reworking how we deal with ptrace and exec is the way to solve this problem.
Eric
Bernd Edlinger bernd.edlinger@hotmail.de writes:
On 3/2/20 4:57 PM, Eric W. Biederman wrote:
Bernd Edlinger bernd.edlinger@hotmail.de writes:
I tried this with s/EACCESS/EACCES/.
The test case in this patch is not fixed, but strace does not freeze, at least with my setup where it did freeze repeatable.
Thanks, That is what I was aiming at.
So we have one method we can pursue to fix this in practice.
That is obviously because it bypasses the cred_guard_mutex. But all other process that access this file still freeze, and cannot be interrupted except with kill -9.
However that smells like a denial of service, that this simple test case which can be executed by guest, creates a /proc/$pid/mem that freezes any process, even root, when it looks at it. I mean: "ln -s README /proc/$pid/mem" would be a nice bomb.
Yes. Your the test case in your patch a variant of the original problem.
I have been staring at this trying to understand the fundamentals of the original deeper problem.
The current scope of cred_guard_mutex in exec is because being ptraced causes suid exec to act differently. So we need to know early if we are ptraced.
It has a second use, that it prevents two threads entering execve, which would probably result in disaster.
Exec can fail with an error code up until de_thread. de_thread causes exec to fail with the error code -EAGAIN for the second thread to get into de_thread.
So no. The cred_guard_mutex is not needed for that case at all.
If that case did not exist we could reduce the scope of the cred_guard_mutex in exec to where your patch puts the cred_change_mutex.
I am starting to think reworking how we deal with ptrace and exec is the way to solve this problem.
I am 99% convinced that the fix is to move cred_guard_mutex down.
Then right after we take cred_guard_mutex do: if (ptraced) { use_original_creds(); }
And call it a day.
The details suck but I am 99% certain that would solve everyones problems, and not be too bad to audit either.
Eric
On Mon, Mar 2, 2020 at 5:19 PM Eric W. Biederman ebiederm@xmission.com wrote:
Bernd Edlinger bernd.edlinger@hotmail.de writes:
On 3/2/20 4:57 PM, Eric W. Biederman wrote:
Bernd Edlinger bernd.edlinger@hotmail.de writes:
I tried this with s/EACCESS/EACCES/.
The test case in this patch is not fixed, but strace does not freeze, at least with my setup where it did freeze repeatable.
Thanks, That is what I was aiming at.
So we have one method we can pursue to fix this in practice.
That is obviously because it bypasses the cred_guard_mutex. But all other process that access this file still freeze, and cannot be interrupted except with kill -9.
However that smells like a denial of service, that this simple test case which can be executed by guest, creates a /proc/$pid/mem that freezes any process, even root, when it looks at it. I mean: "ln -s README /proc/$pid/mem" would be a nice bomb.
Yes. Your the test case in your patch a variant of the original problem.
I have been staring at this trying to understand the fundamentals of the original deeper problem.
The current scope of cred_guard_mutex in exec is because being ptraced causes suid exec to act differently. So we need to know early if we are ptraced.
It has a second use, that it prevents two threads entering execve, which would probably result in disaster.
Exec can fail with an error code up until de_thread. de_thread causes exec to fail with the error code -EAGAIN for the second thread to get into de_thread.
So no. The cred_guard_mutex is not needed for that case at all.
If that case did not exist we could reduce the scope of the cred_guard_mutex in exec to where your patch puts the cred_change_mutex.
I am starting to think reworking how we deal with ptrace and exec is the way to solve this problem.
I am 99% convinced that the fix is to move cred_guard_mutex down.
"move cred_guard_mutex down" as in "take it once we've already set up the new process, past the point of no return"?
Then right after we take cred_guard_mutex do: if (ptraced) { use_original_creds(); }
And call it a day.
The details suck but I am 99% certain that would solve everyones problems, and not be too bad to audit either.
Ah, hmm, that sounds like it'll work fine at least when no LSMs are involved.
SELinux normally doesn't do the execution-degrading thing, it just blocks the execution completely - see their selinux_bprm_set_creds() hook. So I think they'd still need to set some state on the task that says "we're currently in the middle of an execution where the target task will run in context X", and then check against that in the ptrace_may_access hook. Or I suppose they could just kill the task near the end of execve, although that'd be kinda ugly.
On 3/2/20 5:43 PM, Jann Horn wrote:
On Mon, Mar 2, 2020 at 5:19 PM Eric W. Biederman ebiederm@xmission.com wrote:
Bernd Edlinger bernd.edlinger@hotmail.de writes:
On 3/2/20 4:57 PM, Eric W. Biederman wrote:
Bernd Edlinger bernd.edlinger@hotmail.de writes:
I tried this with s/EACCESS/EACCES/.
The test case in this patch is not fixed, but strace does not freeze, at least with my setup where it did freeze repeatable.
Thanks, That is what I was aiming at.
So we have one method we can pursue to fix this in practice.
That is obviously because it bypasses the cred_guard_mutex. But all other process that access this file still freeze, and cannot be interrupted except with kill -9.
However that smells like a denial of service, that this simple test case which can be executed by guest, creates a /proc/$pid/mem that freezes any process, even root, when it looks at it. I mean: "ln -s README /proc/$pid/mem" would be a nice bomb.
Yes. Your the test case in your patch a variant of the original problem.
I have been staring at this trying to understand the fundamentals of the original deeper problem.
The current scope of cred_guard_mutex in exec is because being ptraced causes suid exec to act differently. So we need to know early if we are ptraced.
It has a second use, that it prevents two threads entering execve, which would probably result in disaster.
Exec can fail with an error code up until de_thread. de_thread causes exec to fail with the error code -EAGAIN for the second thread to get into de_thread.
So no. The cred_guard_mutex is not needed for that case at all.
If that case did not exist we could reduce the scope of the cred_guard_mutex in exec to where your patch puts the cred_change_mutex.
I am starting to think reworking how we deal with ptrace and exec is the way to solve this problem.
I am 99% convinced that the fix is to move cred_guard_mutex down.
"move cred_guard_mutex down" as in "take it once we've already set up the new process, past the point of no return"?
Then right after we take cred_guard_mutex do: if (ptraced) { use_original_creds(); }
And call it a day.
The details suck but I am 99% certain that would solve everyones problems, and not be too bad to audit either.
Ah, hmm, that sounds like it'll work fine at least when no LSMs are involved.
SELinux normally doesn't do the execution-degrading thing, it just blocks the execution completely - see their selinux_bprm_set_creds() hook. So I think they'd still need to set some state on the task that says "we're currently in the middle of an execution where the target task will run in context X", and then check against that in the ptrace_may_access hook. Or I suppose they could just kill the task near the end of execve, although that'd be kinda ugly.
We have current->in_execve for that, right? I think when the cred_guard_mutex is taken only in the critical section, then PTRACE_ATTACH could take the guard_mutex, and look at current->in_execve, and just return -EAGAIN in that case, right, everybody happy :)
Bernd.
On Mon, Mar 2, 2020 at 6:01 PM Bernd Edlinger bernd.edlinger@hotmail.de wrote:
On 3/2/20 5:43 PM, Jann Horn wrote:
On Mon, Mar 2, 2020 at 5:19 PM Eric W. Biederman ebiederm@xmission.com wrote:
Bernd Edlinger bernd.edlinger@hotmail.de writes:
On 3/2/20 4:57 PM, Eric W. Biederman wrote:
Bernd Edlinger bernd.edlinger@hotmail.de writes:
I tried this with s/EACCESS/EACCES/.
The test case in this patch is not fixed, but strace does not freeze, at least with my setup where it did freeze repeatable.
Thanks, That is what I was aiming at.
So we have one method we can pursue to fix this in practice.
That is obviously because it bypasses the cred_guard_mutex. But all other process that access this file still freeze, and cannot be interrupted except with kill -9.
However that smells like a denial of service, that this simple test case which can be executed by guest, creates a /proc/$pid/mem that freezes any process, even root, when it looks at it. I mean: "ln -s README /proc/$pid/mem" would be a nice bomb.
Yes. Your the test case in your patch a variant of the original problem.
I have been staring at this trying to understand the fundamentals of the original deeper problem.
The current scope of cred_guard_mutex in exec is because being ptraced causes suid exec to act differently. So we need to know early if we are ptraced.
It has a second use, that it prevents two threads entering execve, which would probably result in disaster.
Exec can fail with an error code up until de_thread. de_thread causes exec to fail with the error code -EAGAIN for the second thread to get into de_thread.
So no. The cred_guard_mutex is not needed for that case at all.
If that case did not exist we could reduce the scope of the cred_guard_mutex in exec to where your patch puts the cred_change_mutex.
I am starting to think reworking how we deal with ptrace and exec is the way to solve this problem.
I am 99% convinced that the fix is to move cred_guard_mutex down.
"move cred_guard_mutex down" as in "take it once we've already set up the new process, past the point of no return"?
Then right after we take cred_guard_mutex do: if (ptraced) { use_original_creds(); }
And call it a day.
The details suck but I am 99% certain that would solve everyones problems, and not be too bad to audit either.
Ah, hmm, that sounds like it'll work fine at least when no LSMs are involved.
SELinux normally doesn't do the execution-degrading thing, it just blocks the execution completely - see their selinux_bprm_set_creds() hook. So I think they'd still need to set some state on the task that says "we're currently in the middle of an execution where the target task will run in context X", and then check against that in the ptrace_may_access hook. Or I suppose they could just kill the task near the end of execve, although that'd be kinda ugly.
We have current->in_execve for that, right? I think when the cred_guard_mutex is taken only in the critical section, then PTRACE_ATTACH could take the guard_mutex, and look at current->in_execve, and just return -EAGAIN in that case, right, everybody happy :)
It's probably going to mean that things like strace will just randomly fail to attach to processes if they happen to be in the middle of execve... but I guess that works?
On 3/2/20 5:17 PM, Eric W. Biederman wrote:
Bernd Edlinger bernd.edlinger@hotmail.de writes:
On 3/2/20 4:57 PM, Eric W. Biederman wrote:
Bernd Edlinger bernd.edlinger@hotmail.de writes:
I tried this with s/EACCESS/EACCES/.
The test case in this patch is not fixed, but strace does not freeze, at least with my setup where it did freeze repeatable.
Thanks, That is what I was aiming at.
So we have one method we can pursue to fix this in practice.
That is obviously because it bypasses the cred_guard_mutex. But all other process that access this file still freeze, and cannot be interrupted except with kill -9.
However that smells like a denial of service, that this simple test case which can be executed by guest, creates a /proc/$pid/mem that freezes any process, even root, when it looks at it. I mean: "ln -s README /proc/$pid/mem" would be a nice bomb.
Yes. Your the test case in your patch a variant of the original problem.
I have been staring at this trying to understand the fundamentals of the original deeper problem.
The current scope of cred_guard_mutex in exec is because being ptraced causes suid exec to act differently. So we need to know early if we are ptraced.
It has a second use, that it prevents two threads entering execve, which would probably result in disaster.
Exec can fail with an error code up until de_thread. de_thread causes exec to fail with the error code -EAGAIN for the second thread to get into de_thread.
So no. The cred_guard_mutex is not needed for that case at all.
Okay, but that will reset current->in_execve, right?
If that case did not exist we could reduce the scope of the cred_guard_mutex in exec to where your patch puts the cred_change_mutex.
I am starting to think reworking how we deal with ptrace and exec is the way to solve this problem.
I am 99% convinced that the fix is to move cred_guard_mutex down.
Then right after we take cred_guard_mutex do: if (ptraced) { use_original_creds(); }
And call it a day.
The details suck but I am 99% certain that would solve everyones problems, and not be too bad to audit either.
Eric
Bernd Edlinger bernd.edlinger@hotmail.de writes:
On 3/2/20 5:17 PM, Eric W. Biederman wrote:
Bernd Edlinger bernd.edlinger@hotmail.de writes:
On 3/2/20 4:57 PM, Eric W. Biederman wrote:
Bernd Edlinger bernd.edlinger@hotmail.de writes:
I tried this with s/EACCESS/EACCES/.
The test case in this patch is not fixed, but strace does not freeze, at least with my setup where it did freeze repeatable.
Thanks, That is what I was aiming at.
So we have one method we can pursue to fix this in practice.
That is obviously because it bypasses the cred_guard_mutex. But all other process that access this file still freeze, and cannot be interrupted except with kill -9.
However that smells like a denial of service, that this simple test case which can be executed by guest, creates a /proc/$pid/mem that freezes any process, even root, when it looks at it. I mean: "ln -s README /proc/$pid/mem" would be a nice bomb.
Yes. Your the test case in your patch a variant of the original problem.
I have been staring at this trying to understand the fundamentals of the original deeper problem.
The current scope of cred_guard_mutex in exec is because being ptraced causes suid exec to act differently. So we need to know early if we are ptraced.
It has a second use, that it prevents two threads entering execve, which would probably result in disaster.
Exec can fail with an error code up until de_thread. de_thread causes exec to fail with the error code -EAGAIN for the second thread to get into de_thread.
So no. The cred_guard_mutex is not needed for that case at all.
Okay, but that will reset current->in_execve, right?
Absolutely.
The error handling kicks in and exec_binprm fails with a negative return code. Then __do_excve_file cleans up and clears current->in_execve.
Eric
On 3/2/20 10:49 PM, Eric W. Biederman wrote:
Bernd Edlinger bernd.edlinger@hotmail.de writes:
On 3/2/20 5:17 PM, Eric W. Biederman wrote:
Bernd Edlinger bernd.edlinger@hotmail.de writes:
On 3/2/20 4:57 PM, Eric W. Biederman wrote:
Bernd Edlinger bernd.edlinger@hotmail.de writes:
I tried this with s/EACCESS/EACCES/.
The test case in this patch is not fixed, but strace does not freeze, at least with my setup where it did freeze repeatable.
Thanks, That is what I was aiming at.
So we have one method we can pursue to fix this in practice.
That is obviously because it bypasses the cred_guard_mutex. But all other process that access this file still freeze, and cannot be interrupted except with kill -9.
However that smells like a denial of service, that this simple test case which can be executed by guest, creates a /proc/$pid/mem that freezes any process, even root, when it looks at it. I mean: "ln -s README /proc/$pid/mem" would be a nice bomb.
Yes. Your the test case in your patch a variant of the original problem.
I have been staring at this trying to understand the fundamentals of the original deeper problem.
The current scope of cred_guard_mutex in exec is because being ptraced causes suid exec to act differently. So we need to know early if we are ptraced.
It has a second use, that it prevents two threads entering execve, which would probably result in disaster.
Exec can fail with an error code up until de_thread. de_thread causes exec to fail with the error code -EAGAIN for the second thread to get into de_thread.
So no. The cred_guard_mutex is not needed for that case at all.
Okay, but that will reset current->in_execve, right?
Absolutely.
The error handling kicks in and exec_binprm fails with a negative return code. Then __do_excve_file cleans up and clears current->in_execve.
Yes of course. I was under the wrong impression that that value is a kind of global, but it is a thread local.
So I think I need a new boolean see v3 of the patch, and soon v4 (with just one comment fixed).
I'm currently executing the strace v5.5 testsuite, and every test is passed so far. I'll also look at gdb testsuite, before I send the next version.
Thanks Bernd.
This fixes a deadlock in the tracer when tracing a multi-threaded application that calls execve while more than one thread are running.
I observed that when running strace on the gcc test suite, it always blocks after a while, when expect calls execve, because other threads have to be terminated. They send ptrace events, but the strace is no longer able to respond, since it is blocked in vm_access.
The deadlock is always happening when strace needs to access the tracees process mmap, while another thread in the tracee starts to execve a child process, but that cannot continue until the PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:
strace D 0 30614 30584 0x00000000 Call Trace: __schedule+0x3ce/0x6e0 schedule+0x5c/0xd0 schedule_preempt_disabled+0x15/0x20 __mutex_lock.isra.13+0x1ec/0x520 __mutex_lock_killable_slowpath+0x13/0x20 mutex_lock_killable+0x28/0x30 mm_access+0x27/0xa0 process_vm_rw_core.isra.3+0xff/0x550 process_vm_rw+0xdd/0xf0 __x64_sys_process_vm_readv+0x31/0x40 do_syscall_64+0x64/0x220 entry_SYSCALL_64_after_hwframe+0x44/0xa9
expect D 0 31933 30876 0x80004003 Call Trace: __schedule+0x3ce/0x6e0 schedule+0x5c/0xd0 flush_old_exec+0xc4/0x770 load_elf_binary+0x35a/0x16c0 search_binary_handler+0x97/0x1d0 __do_execve_file.isra.40+0x5d4/0x8a0 __x64_sys_execve+0x49/0x60 do_syscall_64+0x64/0x220 entry_SYSCALL_64_after_hwframe+0x44/0xa9
The proposed solution is to take the cred_guard_mutex only in a critical section at the beginning, and at the end of the execve function, and let PTRACE_ATTACH fail with EAGAIN while execve is not complete, but other functions like vm_access are allowed to complete normally.
I also took the opportunity to improve the documentation of prepare_creds, which is obviously out of sync.
Signed-off-by: Bernd Edlinger bernd.edlinger@hotmail.de --- Documentation/security/credentials.rst | 19 +++++---- fs/exec.c | 28 +++++++++++-- include/linux/binfmts.h | 6 ++- include/linux/sched/signal.h | 1 + init/init_task.c | 1 + kernel/cred.c | 2 +- kernel/fork.c | 1 + kernel/ptrace.c | 4 ++ mm/process_vm_access.c | 2 +- tools/testing/selftests/ptrace/Makefile | 4 +- tools/testing/selftests/ptrace/vmaccess.c | 66 +++++++++++++++++++++++++++++++ 11 files changed, 117 insertions(+), 17 deletions(-) create mode 100644 tools/testing/selftests/ptrace/vmaccess.c
v2: adds a test case which passes when this patch is applied. v3: fixes the issue without introducing a new mutex. v4: fixes one comment and a formatting issue found by checkpatch.pl in the test case.
diff --git a/Documentation/security/credentials.rst b/Documentation/security/credentials.rst index 282e79f..61d6704 100644 --- a/Documentation/security/credentials.rst +++ b/Documentation/security/credentials.rst @@ -437,9 +437,14 @@ new set of credentials by calling::
struct cred *prepare_creds(void);
-this locks current->cred_replace_mutex and then allocates and constructs a -duplicate of the current process's credentials, returning with the mutex still -held if successful. It returns NULL if not successful (out of memory). +this allocates and constructs a duplicate of the current process's credentials. +It returns NULL if not successful (out of memory). + +If called from __do_execve_file, the mutex current->signal->cred_guard_mutex +is acquired before this function gets called, and released after setting +current->signal->cred_locked_for_ptrace. The same mutex is acquired later, +while the credentials and the process mmap are actually changed, and +current->signal->cred_locked_for_ptrace is reset again.
The mutex prevents ``ptrace()`` from altering the ptrace state of a process while security checks on credentials construction and changing is taking place @@ -466,9 +471,8 @@ by calling::
This will alter various aspects of the credentials and the process, giving the LSM a chance to do likewise, then it will use ``rcu_assign_pointer()`` to -actually commit the new credentials to ``current->cred``, it will release -``current->cred_replace_mutex`` to allow ``ptrace()`` to take place, and it -will notify the scheduler and others of the changes. +actually commit the new credentials to ``current->cred``, and it will notify +the scheduler and others of the changes.
This function is guaranteed to return 0, so that it can be tail-called at the end of such functions as ``sys_setresuid()``. @@ -486,8 +490,7 @@ invoked::
void abort_creds(struct cred *new);
-This releases the lock on ``current->cred_replace_mutex`` that -``prepare_creds()`` got and then releases the new credentials. +This releases the new credentials.
A typical credentials alteration function would look something like this:: diff --git a/fs/exec.c b/fs/exec.c index 74d88da..e466301 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1266,6 +1266,12 @@ int flush_old_exec(struct linux_binprm * bprm) if (retval) goto out;
+ retval = mutex_lock_killable(¤t->signal->cred_guard_mutex); + if (retval) + goto out; + + bprm->called_flush_old_exec = 1; + /* * Must be called _before_ exec_mmap() as bprm->mm is * not visibile until then. This also enables the update @@ -1398,28 +1404,41 @@ void finalize_exec(struct linux_binprm *bprm) EXPORT_SYMBOL(finalize_exec);
/* - * Prepare credentials and lock ->cred_guard_mutex. + * Prepare credentials and set ->cred_locked_for_ptrace. * install_exec_creds() commits the new creds and drops the lock. * Or, if exec fails before, free_bprm() should release ->cred and * and unlock. */ static int prepare_bprm_creds(struct linux_binprm *bprm) { + int ret; + if (mutex_lock_interruptible(¤t->signal->cred_guard_mutex)) return -ERESTARTNOINTR;
+ ret = -EAGAIN; + if (unlikely(current->signal->cred_locked_for_ptrace)) + goto out; + + ret = -ENOMEM; bprm->cred = prepare_exec_creds(); - if (likely(bprm->cred)) - return 0; + if (likely(bprm->cred)) { + current->signal->cred_locked_for_ptrace = true; + ret = 0; + }
+out: mutex_unlock(¤t->signal->cred_guard_mutex); - return -ENOMEM; + return ret; }
static void free_bprm(struct linux_binprm *bprm) { free_arg_pages(bprm); if (bprm->cred) { + if (!bprm->called_flush_old_exec) + mutex_lock(¤t->signal->cred_guard_mutex); + current->signal->cred_locked_for_ptrace = false; mutex_unlock(¤t->signal->cred_guard_mutex); abort_creds(bprm->cred); } @@ -1469,6 +1488,7 @@ void install_exec_creds(struct linux_binprm *bprm) * credentials; any time after this it may be unlocked. */ security_bprm_committed_creds(bprm); + current->signal->cred_locked_for_ptrace = false; mutex_unlock(¤t->signal->cred_guard_mutex); } EXPORT_SYMBOL(install_exec_creds); diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h index b40fc63..2930253 100644 --- a/include/linux/binfmts.h +++ b/include/linux/binfmts.h @@ -44,7 +44,11 @@ struct linux_binprm { * exec has happened. Used to sanitize execution environment * and to set AT_SECURE auxv for glibc. */ - secureexec:1; + secureexec:1, + /* + * Set by flush_old_exec, when the cred_guard_mutex is taken. + */ + called_flush_old_exec:1; #ifdef __alpha__ unsigned int taso:1; #endif diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h index 8805025..073a2b7 100644 --- a/include/linux/sched/signal.h +++ b/include/linux/sched/signal.h @@ -225,6 +225,7 @@ struct signal_struct { struct mutex cred_guard_mutex; /* guard against foreign influences on * credential calculations * (notably. ptrace) */ + bool cred_locked_for_ptrace; /* set while in execve */ } __randomize_layout;
/* diff --git a/init/init_task.c b/init/init_task.c index 9e5cbe5..ecefff28 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -26,6 +26,7 @@ .multiprocess = HLIST_HEAD_INIT, .rlim = INIT_RLIMITS, .cred_guard_mutex = __MUTEX_INITIALIZER(init_signals.cred_guard_mutex), + .cred_locked_for_ptrace = false, #ifdef CONFIG_POSIX_TIMERS .posix_timers = LIST_HEAD_INIT(init_signals.posix_timers), .cputimer = { diff --git a/kernel/cred.c b/kernel/cred.c index 809a985..e4c78de 100644 --- a/kernel/cred.c +++ b/kernel/cred.c @@ -676,7 +676,7 @@ void __init cred_init(void) * * Returns the new credentials or NULL if out of memory. * - * Does not take, and does not return holding current->cred_replace_mutex. + * Does not take, and does not return holding ->cred_guard_mutex. */ struct cred *prepare_kernel_cred(struct task_struct *daemon) { diff --git a/kernel/fork.c b/kernel/fork.c index 0808095..a2b2ec8 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1594,6 +1594,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk) sig->oom_score_adj_min = current->signal->oom_score_adj_min;
mutex_init(&sig->cred_guard_mutex); + sig->cred_locked_for_ptrace = false;
return 0; } diff --git a/kernel/ptrace.c b/kernel/ptrace.c index 43d6179..abf09ba 100644 --- a/kernel/ptrace.c +++ b/kernel/ptrace.c @@ -395,6 +395,10 @@ static int ptrace_attach(struct task_struct *task, long request, if (mutex_lock_interruptible(&task->signal->cred_guard_mutex)) goto out;
+ retval = -EAGAIN; + if (task->signal->cred_locked_for_ptrace) + goto unlock_creds; + task_lock(task); retval = __ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS); task_unlock(task); diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c index 357aa7b..b3e6eb5 100644 --- a/mm/process_vm_access.c +++ b/mm/process_vm_access.c @@ -204,7 +204,7 @@ static ssize_t process_vm_rw_core(pid_t pid, struct iov_iter *iter, if (!mm || IS_ERR(mm)) { rc = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH; /* - * Explicitly map EACCES to EPERM as EPERM is a more a + * Explicitly map EACCES to EPERM as EPERM is a more * appropriate error code for process_vw_readv/writev */ if (rc == -EACCES) diff --git a/tools/testing/selftests/ptrace/Makefile b/tools/testing/selftests/ptrace/Makefile index c0b7f89..2f1f532 100644 --- a/tools/testing/selftests/ptrace/Makefile +++ b/tools/testing/selftests/ptrace/Makefile @@ -1,6 +1,6 @@ # SPDX-License-Identifier: GPL-2.0-only -CFLAGS += -iquote../../../../include/uapi -Wall +CFLAGS += -std=c99 -pthread -iquote../../../../include/uapi -Wall
-TEST_GEN_PROGS := get_syscall_info peeksiginfo +TEST_GEN_PROGS := get_syscall_info peeksiginfo vmaccess
include ../lib.mk diff --git a/tools/testing/selftests/ptrace/vmaccess.c b/tools/testing/selftests/ptrace/vmaccess.c new file mode 100644 index 0000000..6d8a048 --- /dev/null +++ b/tools/testing/selftests/ptrace/vmaccess.c @@ -0,0 +1,66 @@ +// SPDX-License-Identifier: GPL-2.0+ +/* + * Copyright (c) 2020 Bernd Edlinger bernd.edlinger@hotmail.de + * All rights reserved. + * + * Check whether /proc/$pid/mem can be accessed without causing deadlocks + * when de_thread is blocked with ->cred_guard_mutex held. + */ + +#include "../kselftest_harness.h" +#include <stdio.h> +#include <fcntl.h> +#include <pthread.h> +#include <signal.h> +#include <unistd.h> +#include <sys/ptrace.h> + +static void *thread(void *arg) +{ + ptrace(PTRACE_TRACEME, 0, 0L, 0L); + return NULL; +} + +TEST(vmaccess) +{ + int f, pid = fork(); + char mm[64]; + + if (!pid) { + pthread_t pt; + + pthread_create(&pt, NULL, thread, NULL); + pthread_join(pt, NULL); + execlp("true", "true", NULL); + } + + sleep(1); + sprintf(mm, "/proc/%d/mem", pid); + f = open(mm, O_RDONLY); + ASSERT_LE(0, f); + close(f); + f = kill(pid, SIGCONT); + ASSERT_EQ(0, f); +} + +TEST(attach) +{ + int f, pid = fork(); + + if (!pid) { + pthread_t pt; + + pthread_create(&pt, NULL, thread, NULL); + pthread_join(pt, NULL); + execlp("true", "true", NULL); + } + + sleep(1); + f = ptrace(PTRACE_ATTACH, pid, 0L, 0L); + ASSERT_EQ(EAGAIN, errno); + ASSERT_EQ(f, -1); + f = kill(pid, SIGCONT); + ASSERT_EQ(0, f); +} + +TEST_HARNESS_MAIN
On Mon, Mar 02, 2020 at 10:18:07PM +0000, Bernd Edlinger wrote:
This fixes a deadlock in the tracer when tracing a multi-threaded application that calls execve while more than one thread are running.
I observed that when running strace on the gcc test suite, it always blocks after a while, when expect calls execve, because other threads have to be terminated. They send ptrace events, but the strace is no longer able to respond, since it is blocked in vm_access.
The deadlock is always happening when strace needs to access the tracees process mmap, while another thread in the tracee starts to execve a child process, but that cannot continue until the PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:
strace D 0 30614 30584 0x00000000 Call Trace: __schedule+0x3ce/0x6e0 schedule+0x5c/0xd0 schedule_preempt_disabled+0x15/0x20 __mutex_lock.isra.13+0x1ec/0x520 __mutex_lock_killable_slowpath+0x13/0x20 mutex_lock_killable+0x28/0x30 mm_access+0x27/0xa0 process_vm_rw_core.isra.3+0xff/0x550 process_vm_rw+0xdd/0xf0 __x64_sys_process_vm_readv+0x31/0x40 do_syscall_64+0x64/0x220 entry_SYSCALL_64_after_hwframe+0x44/0xa9
expect D 0 31933 30876 0x80004003 Call Trace: __schedule+0x3ce/0x6e0 schedule+0x5c/0xd0 flush_old_exec+0xc4/0x770 load_elf_binary+0x35a/0x16c0 search_binary_handler+0x97/0x1d0 __do_execve_file.isra.40+0x5d4/0x8a0 __x64_sys_execve+0x49/0x60 do_syscall_64+0x64/0x220 entry_SYSCALL_64_after_hwframe+0x44/0xa9
The proposed solution is to take the cred_guard_mutex only in a critical section at the beginning, and at the end of the execve function, and let PTRACE_ATTACH fail with EAGAIN while execve is not complete, but other functions like vm_access are allowed to complete normally.
Sorry to be bummer, but I don't think this will work. A few more things during the exec process depend on cred_guard_mutex being held.
If I'm reading this patch correctly, this changes the lifetime of the cred_guard_mutex lock to be: - during prepare_bprm_creds() - from flush_old_exec() through install_exec_creds() Before, cred_guard_mutex was held from prepare_bprm_creds() through install_exec_creds().
That means, for example, that check_unsafe_exec()'s documented invariant is violated: /* * determine how safe it is to execute the proposed program * - the caller must hold ->cred_guard_mutex to protect against * PTRACE_ATTACH or seccomp thread-sync */ static void check_unsafe_exec(struct linux_binprm *bprm) ... which is looking at no_new_privs as well as other details, and making decisions about the bprm state from the current state.
I think it also means that the potentially multiple invocations of bprm_fill_uid() (via prepare_binprm() via binfmt_script.c and binfmt_misc.c) would be changing bprm->cred details (uid, gid) without a lock (another place where current's no_new_privs is evaluated).
Related, it also means that cred_guard_mutex is unheld for every invocation of search_binary_handler() (which can loop via the previously mentioned binfmt_script.c and binfmt_misc.c), if any of them have hidden dependencies on cred_guard_mutex. (Thought I only see bprm_fill_uid() currently.)
For seccomp, the expectations about existing thread states risks races too. There are two locks held for TSYNC: - current->sighand->siglock is held to keep new threads from appearing/disappearing, which would destroy filter refcounting and lead to memory corruption. - cred_guard_mutex is held to keep no_new_privs in sync with filters to avoid no_new_privs and filter confusion during exec, which could lead to exploitable setuid conditions (see below).
Just racing a malicious thread during TSYNC is not a very strong example (a malicious thread could do lots of fun things to "current" before it ever got near calling TSYNC), but I think there is the risk of mismatched/confused states that we don't want to allow. One is a particularly bad state that could lead to privilege escalations (in the form of the old "sendmail doesn't check setuid" flaw; if a setuid process has a filter attached that silently fails a priv-dropping setuid call and continues execution with elevated privs, it can be tricked into doing bad things on behalf of the unprivileged parent, which was the primary goal of the original use of cred_guard_mutex with TSYNC[1]):
thread A clones thread B thread B starts setuid exec thread A sets no_new_privs thread A calls seccomp with TSYNC thread A in seccomp_sync_threads() sets seccomp filter on self and thread B thread B passes check_unsafe_exec() with no_new_privs unset thread B reaches bprm_fill_uid() with no_new_privs unset and gains privs thread A still in seccomp_sync_threads() sets no_new_privs on thread B thread B finishes exec, now running with elevated privs, a filter chosen by thread A, _and_ nnp set (which doesn't matter)
With the original locking, thread B will fail check_unsafe_exec() because filter and nnp state are changed together, with "atomicity" protected by the cred_guard_mutex.
And this is just the bad state I _can_ see. I'm worried there are more...
All this said, I do see a small similarity here to the work I did to stabilize stack rlimits (there was an ongoing problem with making multiple decisions for the bprm based on current's state -- but current's state was mutable during exec). For this, I saved rlim_stack to bprm and ignored current's copy until exec ended and then stored bprm's copy into current. If the only problem anyone can see here is the handling of no_new_privs, we might be able to solve that similarly, at least disentangling tsync/nnp from cred_guard_mutex.
-Kees
[1] https://lore.kernel.org/lkml/20140625142121.GD7892@redhat.com/
On 3/3/20 3:26 AM, Kees Cook wrote:
On Mon, Mar 02, 2020 at 10:18:07PM +0000, Bernd Edlinger wrote:
This fixes a deadlock in the tracer when tracing a multi-threaded application that calls execve while more than one thread are running.
I observed that when running strace on the gcc test suite, it always blocks after a while, when expect calls execve, because other threads have to be terminated. They send ptrace events, but the strace is no longer able to respond, since it is blocked in vm_access.
The deadlock is always happening when strace needs to access the tracees process mmap, while another thread in the tracee starts to execve a child process, but that cannot continue until the PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:
strace D 0 30614 30584 0x00000000 Call Trace: __schedule+0x3ce/0x6e0 schedule+0x5c/0xd0 schedule_preempt_disabled+0x15/0x20 __mutex_lock.isra.13+0x1ec/0x520 __mutex_lock_killable_slowpath+0x13/0x20 mutex_lock_killable+0x28/0x30 mm_access+0x27/0xa0 process_vm_rw_core.isra.3+0xff/0x550 process_vm_rw+0xdd/0xf0 __x64_sys_process_vm_readv+0x31/0x40 do_syscall_64+0x64/0x220 entry_SYSCALL_64_after_hwframe+0x44/0xa9
expect D 0 31933 30876 0x80004003 Call Trace: __schedule+0x3ce/0x6e0 schedule+0x5c/0xd0 flush_old_exec+0xc4/0x770 load_elf_binary+0x35a/0x16c0 search_binary_handler+0x97/0x1d0 __do_execve_file.isra.40+0x5d4/0x8a0 __x64_sys_execve+0x49/0x60 do_syscall_64+0x64/0x220 entry_SYSCALL_64_after_hwframe+0x44/0xa9
The proposed solution is to take the cred_guard_mutex only in a critical section at the beginning, and at the end of the execve function, and let PTRACE_ATTACH fail with EAGAIN while execve is not complete, but other functions like vm_access are allowed to complete normally.
Sorry to be bummer, but I don't think this will work. A few more things during the exec process depend on cred_guard_mutex being held.
If I'm reading this patch correctly, this changes the lifetime of the cred_guard_mutex lock to be:
- during prepare_bprm_creds()
- from flush_old_exec() through install_exec_creds()
Before, cred_guard_mutex was held from prepare_bprm_creds() through install_exec_creds().
That means, for example, that check_unsafe_exec()'s documented invariant is violated: /* * determine how safe it is to execute the proposed program * - the caller must hold ->cred_guard_mutex to protect against * PTRACE_ATTACH or seccomp thread-sync */
Oh, right, I haven't understood that hint...
static void check_unsafe_exec(struct linux_binprm *bprm) ...
which is looking at no_new_privs as well as other details, and making decisions about the bprm state from the current state.
I think it also means that the potentially multiple invocations of bprm_fill_uid() (via prepare_binprm() via binfmt_script.c and binfmt_misc.c) would be changing bprm->cred details (uid, gid) without a lock (another place where current's no_new_privs is evaluated).
So no_new_privs can change from 0->1, but should not when execve is running.
As long as the calling thread is in execve it won't do this, and the only other place, where it may set for other threads is in seccomp_sync_threads, but that can easily be avoided see below.
Related, it also means that cred_guard_mutex is unheld for every invocation of search_binary_handler() (which can loop via the previously mentioned binfmt_script.c and binfmt_misc.c), if any of them have hidden dependencies on cred_guard_mutex. (Thought I only see bprm_fill_uid() currently.)
For seccomp, the expectations about existing thread states risks races too. There are two locks held for TSYNC:
- current->sighand->siglock is held to keep new threads from appearing/disappearing, which would destroy filter refcounting and lead to memory corruption.
I don't understand what you mean here. How can this lead to memory corruption?
- cred_guard_mutex is held to keep no_new_privs in sync with filters to avoid no_new_privs and filter confusion during exec, which could lead to exploitable setuid conditions (see below).
Just racing a malicious thread during TSYNC is not a very strong example (a malicious thread could do lots of fun things to "current" before it ever got near calling TSYNC), but I think there is the risk of mismatched/confused states that we don't want to allow. One is a particularly bad state that could lead to privilege escalations (in the form of the old "sendmail doesn't check setuid" flaw; if a setuid process has a filter attached that silently fails a priv-dropping setuid call and continues execution with elevated privs, it can be tricked into doing bad things on behalf of the unprivileged parent, which was the primary goal of the original use of cred_guard_mutex with TSYNC[1]):
thread A clones thread B thread B starts setuid exec thread A sets no_new_privs thread A calls seccomp with TSYNC thread A in seccomp_sync_threads() sets seccomp filter on self and thread B thread B passes check_unsafe_exec() with no_new_privs unset thread B reaches bprm_fill_uid() with no_new_privs unset and gains privs thread A still in seccomp_sync_threads() sets no_new_privs on thread B thread B finishes exec, now running with elevated privs, a filter chosen by thread A, _and_ nnp set (which doesn't matter)
With the original locking, thread B will fail check_unsafe_exec() because filter and nnp state are changed together, with "atomicity" protected by the cred_guard_mutex.
Ah, good point, thanks!
This can be fixed by checking current->signal->cred_locked_for_ptrace while the cred_guard_mutex is locked, like this for instance:
diff --git a/kernel/seccomp.c b/kernel/seccomp.c index b6ea3dc..377abf0 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -342,6 +342,9 @@ static inline pid_t seccomp_can_sync_threads(void) BUG_ON(!mutex_is_locked(¤t->signal->cred_guard_mutex)); assert_spin_locked(¤t->sighand->siglock);
+ if (current->signal->cred_locked_for_ptrace) + return -EAGAIN; + /* Validate all threads being eligible for synchronization. */ caller = current; for_each_thread(caller, thread) {
And this is just the bad state I _can_ see. I'm worried there are more...
All this said, I do see a small similarity here to the work I did to stabilize stack rlimits (there was an ongoing problem with making multiple decisions for the bprm based on current's state -- but current's state was mutable during exec). For this, I saved rlim_stack to bprm and ignored current's copy until exec ended and then stored bprm's copy into current. If the only problem anyone can see here is the handling of no_new_privs, we might be able to solve that similarly, at least disentangling tsync/nnp from cred_guard_mutex.
I still think that is solvable with using cred_locked_for_ptrace and simply make the tsync fail if it would otherwise be blocked.
Thanks Bernd.
On Tue, Mar 03, 2020 at 04:54:34AM +0000, Bernd Edlinger wrote:
On 3/3/20 3:26 AM, Kees Cook wrote:
On Mon, Mar 02, 2020 at 10:18:07PM +0000, Bernd Edlinger wrote:
[...]
If I'm reading this patch correctly, this changes the lifetime of the cred_guard_mutex lock to be:
- during prepare_bprm_creds()
- from flush_old_exec() through install_exec_creds()
Before, cred_guard_mutex was held from prepare_bprm_creds() through install_exec_creds().
BTW, I think the effect of this change (i.e. my paragraph above) should be distinctly called out in the commit log if this solution moves forward.
That means, for example, that check_unsafe_exec()'s documented invariant is violated: /* * determine how safe it is to execute the proposed program * - the caller must hold ->cred_guard_mutex to protect against * PTRACE_ATTACH or seccomp thread-sync */
Oh, right, I haven't understood that hint...
I know no_new_privs is checked there, but I haven't studied the PTRACE_ATTACH part of that comment. If that is handled with the new check, this comment should be updated.
I think it also means that the potentially multiple invocations of bprm_fill_uid() (via prepare_binprm() via binfmt_script.c and binfmt_misc.c) would be changing bprm->cred details (uid, gid) without a lock (another place where current's no_new_privs is evaluated).
So no_new_privs can change from 0->1, but should not when execve is running.
As long as the calling thread is in execve it won't do this, and the only other place, where it may set for other threads is in seccomp_sync_threads, but that can easily be avoided see below.
Yeah, everything was fine until I had to go complicate things with TSYNC. ;) The real goal is making sure an exec cannot gain privs while later gaining a seccomp filter from an unpriv process. The no_new_privs flag was used to control this, but it required that the filter not get applied during exec.
Related, it also means that cred_guard_mutex is unheld for every invocation of search_binary_handler() (which can loop via the previously mentioned binfmt_script.c and binfmt_misc.c), if any of them have hidden dependencies on cred_guard_mutex. (Thought I only see bprm_fill_uid() currently.)
For seccomp, the expectations about existing thread states risks races too. There are two locks held for TSYNC:
- current->sighand->siglock is held to keep new threads from appearing/disappearing, which would destroy filter refcounting and lead to memory corruption.
I don't understand what you mean here. How can this lead to memory corruption?
Mainly this is a matter of how seccomp manages its filter hierarchy (since the filters are shared through process ancestry), so if a thread appears in the middle of TSYNC it may be racing another TSYNC and break ancestry, leading to bad reference counting on process death, etc. (Though, yes, with refcount_t now, things should never corrupt, just waste memory.)
- cred_guard_mutex is held to keep no_new_privs in sync with filters to avoid no_new_privs and filter confusion during exec, which could lead to exploitable setuid conditions (see below).
Just racing a malicious thread during TSYNC is not a very strong example (a malicious thread could do lots of fun things to "current" before it ever got near calling TSYNC), but I think there is the risk of mismatched/confused states that we don't want to allow. One is a particularly bad state that could lead to privilege escalations (in the form of the old "sendmail doesn't check setuid" flaw; if a setuid process has a filter attached that silently fails a priv-dropping setuid call and continues execution with elevated privs, it can be tricked into doing bad things on behalf of the unprivileged parent, which was the primary goal of the original use of cred_guard_mutex with TSYNC[1]):
thread A clones thread B thread B starts setuid exec thread A sets no_new_privs thread A calls seccomp with TSYNC thread A in seccomp_sync_threads() sets seccomp filter on self and thread B thread B passes check_unsafe_exec() with no_new_privs unset thread B reaches bprm_fill_uid() with no_new_privs unset and gains privs thread A still in seccomp_sync_threads() sets no_new_privs on thread B thread B finishes exec, now running with elevated privs, a filter chosen by thread A, _and_ nnp set (which doesn't matter)
With the original locking, thread B will fail check_unsafe_exec() because filter and nnp state are changed together, with "atomicity" protected by the cred_guard_mutex.
Ah, good point, thanks!
This can be fixed by checking current->signal->cred_locked_for_ptrace while the cred_guard_mutex is locked, like this for instance:
diff --git a/kernel/seccomp.c b/kernel/seccomp.c index b6ea3dc..377abf0 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -342,6 +342,9 @@ static inline pid_t seccomp_can_sync_threads(void) BUG_ON(!mutex_is_locked(¤t->signal->cred_guard_mutex)); assert_spin_locked(¤t->sighand->siglock);
if (current->signal->cred_locked_for_ptrace)
return -EAGAIN;
Hmm. I guess something like that could work. TSYNC expects to be able to report _which_ thread wrecked the call, though... I wonder if in_execve could be used to figure out the offending thread. Hm, nope, that would be outside of lock too (and all users are "current" right now, so the lock wasn't needed before).
/* Validate all threads being eligible for synchronization. */ caller = current; for_each_thread(caller, thread) {
And this is just the bad state I _can_ see. I'm worried there are more...
All this said, I do see a small similarity here to the work I did to stabilize stack rlimits (there was an ongoing problem with making multiple decisions for the bprm based on current's state -- but current's state was mutable during exec). For this, I saved rlim_stack to bprm and ignored current's copy until exec ended and then stored bprm's copy into current. If the only problem anyone can see here is the handling of no_new_privs, we might be able to solve that similarly, at least disentangling tsync/nnp from cred_guard_mutex.
I still think that is solvable with using cred_locked_for_ptrace and simply make the tsync fail if it would otherwise be blocked.
I wonder if we can find a better name than "cred_locked_for_ptrace"? Maybe "cred_unfinished" or "cred_locked_in_exec" or something?
And the comment on bool cred_locked_for_ptrace should mention that access is only allowed under cred_guard_mutex lock.
- sig->cred_locked_for_ptrace = false;
This is redundant to the zalloc -- I think you can drop it (unless someone wants to keep it for clarify?)
Also, I think cred_locked_for_ptrace needs checking deeper, in __ptrace_may_access(), not in ptrace_attach(), since LOTS of things make calls to ptrace_may_access() holding cred_guard_mutex, expecting that to be sufficient to see a stable version of the thread...
(I remain very nervous about weakening cred_guard_mutex without addressing the many many users...)
On 3/3/20 6:29 AM, Kees Cook wrote:
On Tue, Mar 03, 2020 at 04:54:34AM +0000, Bernd Edlinger wrote:
On 3/3/20 3:26 AM, Kees Cook wrote:
On Mon, Mar 02, 2020 at 10:18:07PM +0000, Bernd Edlinger wrote:
[...]
If I'm reading this patch correctly, this changes the lifetime of the cred_guard_mutex lock to be:
- during prepare_bprm_creds()
- from flush_old_exec() through install_exec_creds()
Before, cred_guard_mutex was held from prepare_bprm_creds() through install_exec_creds().
BTW, I think the effect of this change (i.e. my paragraph above) should be distinctly called out in the commit log if this solution moves forward.
Okay, will do.
That means, for example, that check_unsafe_exec()'s documented invariant is violated: /* * determine how safe it is to execute the proposed program * - the caller must hold ->cred_guard_mutex to protect against * PTRACE_ATTACH or seccomp thread-sync */
Oh, right, I haven't understood that hint...
I know no_new_privs is checked there, but I haven't studied the PTRACE_ATTACH part of that comment. If that is handled with the new check, this comment should be updated.
Okay, I change that comment to:
/* * determine how safe it is to execute the proposed program * - the caller must have set ->cred_locked_in_execve to protect against * PTRACE_ATTACH or seccomp thread-sync */
I think it also means that the potentially multiple invocations of bprm_fill_uid() (via prepare_binprm() via binfmt_script.c and binfmt_misc.c) would be changing bprm->cred details (uid, gid) without a lock (another place where current's no_new_privs is evaluated).
So no_new_privs can change from 0->1, but should not when execve is running.
As long as the calling thread is in execve it won't do this, and the only other place, where it may set for other threads is in seccomp_sync_threads, but that can easily be avoided see below.
Yeah, everything was fine until I had to go complicate things with TSYNC. ;) The real goal is making sure an exec cannot gain privs while later gaining a seccomp filter from an unpriv process. The no_new_privs flag was used to control this, but it required that the filter not get applied during exec.
Related, it also means that cred_guard_mutex is unheld for every invocation of search_binary_handler() (which can loop via the previously mentioned binfmt_script.c and binfmt_misc.c), if any of them have hidden dependencies on cred_guard_mutex. (Thought I only see bprm_fill_uid() currently.)
For seccomp, the expectations about existing thread states risks races too. There are two locks held for TSYNC:
- current->sighand->siglock is held to keep new threads from appearing/disappearing, which would destroy filter refcounting and lead to memory corruption.
I don't understand what you mean here. How can this lead to memory corruption?
Mainly this is a matter of how seccomp manages its filter hierarchy (since the filters are shared through process ancestry), so if a thread appears in the middle of TSYNC it may be racing another TSYNC and break ancestry, leading to bad reference counting on process death, etc. (Though, yes, with refcount_t now, things should never corrupt, just waste memory.)
I assume for now, that the current->sighand->siglock held while iterating all threads is sufficient here.
- cred_guard_mutex is held to keep no_new_privs in sync with filters to avoid no_new_privs and filter confusion during exec, which could lead to exploitable setuid conditions (see below).
Just racing a malicious thread during TSYNC is not a very strong example (a malicious thread could do lots of fun things to "current" before it ever got near calling TSYNC), but I think there is the risk of mismatched/confused states that we don't want to allow. One is a particularly bad state that could lead to privilege escalations (in the form of the old "sendmail doesn't check setuid" flaw; if a setuid process has a filter attached that silently fails a priv-dropping setuid call and continues execution with elevated privs, it can be tricked into doing bad things on behalf of the unprivileged parent, which was the primary goal of the original use of cred_guard_mutex with TSYNC[1]):
thread A clones thread B thread B starts setuid exec thread A sets no_new_privs thread A calls seccomp with TSYNC thread A in seccomp_sync_threads() sets seccomp filter on self and thread B thread B passes check_unsafe_exec() with no_new_privs unset thread B reaches bprm_fill_uid() with no_new_privs unset and gains privs thread A still in seccomp_sync_threads() sets no_new_privs on thread B thread B finishes exec, now running with elevated privs, a filter chosen by thread A, _and_ nnp set (which doesn't matter)
With the original locking, thread B will fail check_unsafe_exec() because filter and nnp state are changed together, with "atomicity" protected by the cred_guard_mutex.
Ah, good point, thanks!
This can be fixed by checking current->signal->cred_locked_for_ptrace while the cred_guard_mutex is locked, like this for instance:
diff --git a/kernel/seccomp.c b/kernel/seccomp.c index b6ea3dc..377abf0 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -342,6 +342,9 @@ static inline pid_t seccomp_can_sync_threads(void) BUG_ON(!mutex_is_locked(¤t->signal->cred_guard_mutex)); assert_spin_locked(¤t->sighand->siglock);
if (current->signal->cred_locked_for_ptrace)
return -EAGAIN;
Hmm. I guess something like that could work. TSYNC expects to be able to report _which_ thread wrecked the call, though... I wonder if in_execve could be used to figure out the offending thread. Hm, nope, that would be outside of lock too (and all users are "current" right now, so the lock wasn't needed before).
I could move that in_execve = 1 to prepare_bprm_creds, if it really matters, but the caller will die quickly and cannot do anything with that information when another thread executes execve, right?
/* Validate all threads being eligible for synchronization. */ caller = current; for_each_thread(caller, thread) {
And this is just the bad state I _can_ see. I'm worried there are more...
All this said, I do see a small similarity here to the work I did to stabilize stack rlimits (there was an ongoing problem with making multiple decisions for the bprm based on current's state -- but current's state was mutable during exec). For this, I saved rlim_stack to bprm and ignored current's copy until exec ended and then stored bprm's copy into current. If the only problem anyone can see here is the handling of no_new_privs, we might be able to solve that similarly, at least disentangling tsync/nnp from cred_guard_mutex.
I still think that is solvable with using cred_locked_for_ptrace and simply make the tsync fail if it would otherwise be blocked.
I wonder if we can find a better name than "cred_locked_for_ptrace"? Maybe "cred_unfinished" or "cred_locked_in_exec" or something?
Yeah, I'd go with "cred_locked_in_execve".
And the comment on bool cred_locked_for_ptrace should mention that access is only allowed under cred_guard_mutex lock.
okay.
- sig->cred_locked_for_ptrace = false;
This is redundant to the zalloc -- I think you can drop it (unless someone wants to keep it for clarify?)
I'll remove that here and in init/init_task.c
Also, I think cred_locked_for_ptrace needs checking deeper, in __ptrace_may_access(), not in ptrace_attach(), since LOTS of things make calls to ptrace_may_access() holding cred_guard_mutex, expecting that to be sufficient to see a stable version of the thread...
No, these need to be addressed individually, but most users just want to know if the current credentials are sufficient at this moment, but will not change the credentials, as ptrace and TSYNC do.
BTW: Not all users have cred_guard_mutex, see mm/migrate.c, mm/mempolicy.c, kernel/futex.c, fs/proc/namespaces.c etc. So adding an access to cred_locked_for_execve in ptrace_may_access is probably not an option.
However, one nice added value by this change is this:
void *thread(void *arg) { ptrace(PTRACE_TRACEME, 0,0,0); return NULL; }
int main(void) { int pid = fork();
if (!pid) { pthread_t pt; pthread_create(&pt, NULL, thread, NULL); pthread_join(pt, NULL); execlp("echo", "echo", "passed", NULL); }
sleep(1000); ptrace(PTRACE_ATTACH, pid, 0,0); kill(pid, SIGCONT); return 0; }
cat /proc/3812/stack [<0>] flush_old_exec+0xbf/0x760 [<0>] load_elf_binary+0x35a/0x16c0 [<0>] search_binary_handler+0x97/0x1d0 [<0>] __do_execve_file.isra.40+0x624/0x920 [<0>] __x64_sys_execve+0x49/0x60 [<0>] do_syscall_64+0x64/0x220 [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
(I remain very nervous about weakening cred_guard_mutex without addressing the many many users...)
They need to be looked at closely, that's pretty clear. Most fall in the class, that just the current credentials need to stay stable for a certain time.
Bernd.
On Tue, Mar 03, 2020 at 08:08:26AM +0000, Bernd Edlinger wrote:
On 3/3/20 6:29 AM, Kees Cook wrote:
On Tue, Mar 03, 2020 at 04:54:34AM +0000, Bernd Edlinger wrote:
On 3/3/20 3:26 AM, Kees Cook wrote:
On Mon, Mar 02, 2020 at 10:18:07PM +0000, Bernd Edlinger wrote:
[...]
If I'm reading this patch correctly, this changes the lifetime of the cred_guard_mutex lock to be:
- during prepare_bprm_creds()
- from flush_old_exec() through install_exec_creds()
Before, cred_guard_mutex was held from prepare_bprm_creds() through install_exec_creds().
BTW, I think the effect of this change (i.e. my paragraph above) should be distinctly called out in the commit log if this solution moves forward.
Okay, will do.
That means, for example, that check_unsafe_exec()'s documented invariant is violated: /* * determine how safe it is to execute the proposed program * - the caller must hold ->cred_guard_mutex to protect against * PTRACE_ATTACH or seccomp thread-sync */
Oh, right, I haven't understood that hint...
I know no_new_privs is checked there, but I haven't studied the PTRACE_ATTACH part of that comment. If that is handled with the new check, this comment should be updated.
Okay, I change that comment to:
/*
- determine how safe it is to execute the proposed program
- the caller must have set ->cred_locked_in_execve to protect against
- PTRACE_ATTACH or seccomp thread-sync
*/
I think it also means that the potentially multiple invocations of bprm_fill_uid() (via prepare_binprm() via binfmt_script.c and binfmt_misc.c) would be changing bprm->cred details (uid, gid) without a lock (another place where current's no_new_privs is evaluated).
So no_new_privs can change from 0->1, but should not when execve is running.
As long as the calling thread is in execve it won't do this, and the only other place, where it may set for other threads is in seccomp_sync_threads, but that can easily be avoided see below.
Yeah, everything was fine until I had to go complicate things with TSYNC. ;) The real goal is making sure an exec cannot gain privs while later gaining a seccomp filter from an unpriv process. The no_new_privs flag was used to control this, but it required that the filter not get applied during exec.
Related, it also means that cred_guard_mutex is unheld for every invocation of search_binary_handler() (which can loop via the previously mentioned binfmt_script.c and binfmt_misc.c), if any of them have hidden dependencies on cred_guard_mutex. (Thought I only see bprm_fill_uid() currently.)
For seccomp, the expectations about existing thread states risks races too. There are two locks held for TSYNC:
- current->sighand->siglock is held to keep new threads from appearing/disappearing, which would destroy filter refcounting and lead to memory corruption.
I don't understand what you mean here. How can this lead to memory corruption?
Mainly this is a matter of how seccomp manages its filter hierarchy (since the filters are shared through process ancestry), so if a thread appears in the middle of TSYNC it may be racing another TSYNC and break ancestry, leading to bad reference counting on process death, etc. (Though, yes, with refcount_t now, things should never corrupt, just waste memory.)
I assume for now, that the current->sighand->siglock held while iterating all threads is sufficient here.
- cred_guard_mutex is held to keep no_new_privs in sync with filters to avoid no_new_privs and filter confusion during exec, which could lead to exploitable setuid conditions (see below).
Just racing a malicious thread during TSYNC is not a very strong example (a malicious thread could do lots of fun things to "current" before it ever got near calling TSYNC), but I think there is the risk of mismatched/confused states that we don't want to allow. One is a particularly bad state that could lead to privilege escalations (in the form of the old "sendmail doesn't check setuid" flaw; if a setuid process has a filter attached that silently fails a priv-dropping setuid call and continues execution with elevated privs, it can be tricked into doing bad things on behalf of the unprivileged parent, which was the primary goal of the original use of cred_guard_mutex with TSYNC[1]):
thread A clones thread B thread B starts setuid exec thread A sets no_new_privs thread A calls seccomp with TSYNC thread A in seccomp_sync_threads() sets seccomp filter on self and thread B thread B passes check_unsafe_exec() with no_new_privs unset thread B reaches bprm_fill_uid() with no_new_privs unset and gains privs thread A still in seccomp_sync_threads() sets no_new_privs on thread B thread B finishes exec, now running with elevated privs, a filter chosen by thread A, _and_ nnp set (which doesn't matter)
With the original locking, thread B will fail check_unsafe_exec() because filter and nnp state are changed together, with "atomicity" protected by the cred_guard_mutex.
Ah, good point, thanks!
This can be fixed by checking current->signal->cred_locked_for_ptrace while the cred_guard_mutex is locked, like this for instance:
diff --git a/kernel/seccomp.c b/kernel/seccomp.c index b6ea3dc..377abf0 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -342,6 +342,9 @@ static inline pid_t seccomp_can_sync_threads(void) BUG_ON(!mutex_is_locked(¤t->signal->cred_guard_mutex)); assert_spin_locked(¤t->sighand->siglock);
if (current->signal->cred_locked_for_ptrace)
return -EAGAIN;
Hmm. I guess something like that could work. TSYNC expects to be able to report _which_ thread wrecked the call, though... I wonder if in_execve could be used to figure out the offending thread. Hm, nope, that would be outside of lock too (and all users are "current" right now, so the lock wasn't needed before).
I could move that in_execve = 1 to prepare_bprm_creds, if it really matters, but the caller will die quickly and cannot do anything with that information when another thread executes execve, right?
/* Validate all threads being eligible for synchronization. */ caller = current; for_each_thread(caller, thread) {
And this is just the bad state I _can_ see. I'm worried there are more...
All this said, I do see a small similarity here to the work I did to stabilize stack rlimits (there was an ongoing problem with making multiple decisions for the bprm based on current's state -- but current's state was mutable during exec). For this, I saved rlim_stack to bprm and ignored current's copy until exec ended and then stored bprm's copy into current. If the only problem anyone can see here is the handling of no_new_privs, we might be able to solve that similarly, at least disentangling tsync/nnp from cred_guard_mutex.
I still think that is solvable with using cred_locked_for_ptrace and simply make the tsync fail if it would otherwise be blocked.
I wonder if we can find a better name than "cred_locked_for_ptrace"? Maybe "cred_unfinished" or "cred_locked_in_exec" or something?
Yeah, I'd go with "cred_locked_in_execve".
And the comment on bool cred_locked_for_ptrace should mention that access is only allowed under cred_guard_mutex lock.
okay.
- sig->cred_locked_for_ptrace = false;
This is redundant to the zalloc -- I think you can drop it (unless someone wants to keep it for clarify?)
I'll remove that here and in init/init_task.c
Also, I think cred_locked_for_ptrace needs checking deeper, in __ptrace_may_access(), not in ptrace_attach(), since LOTS of things make calls to ptrace_may_access() holding cred_guard_mutex, expecting that to be sufficient to see a stable version of the thread...
No, these need to be addressed individually, but most users just want to know if the current credentials are sufficient at this moment, but will not change the credentials, as ptrace and TSYNC do.
BTW: Not all users have cred_guard_mutex, see mm/migrate.c, mm/mempolicy.c, kernel/futex.c, fs/proc/namespaces.c etc. So adding an access to cred_locked_for_execve in ptrace_may_access is probably not an option.
However, one nice added value by this change is this:
void *thread(void *arg) { ptrace(PTRACE_TRACEME, 0,0,0); return NULL; }
int main(void) { int pid = fork();
if (!pid) { pthread_t pt; pthread_create(&pt, NULL, thread, NULL); pthread_join(pt, NULL); execlp("echo", "echo", "passed", NULL); }
sleep(1000); ptrace(PTRACE_ATTACH, pid, 0,0); kill(pid, SIGCONT); return 0; }
cat /proc/3812/stack [<0>] flush_old_exec+0xbf/0x760 [<0>] load_elf_binary+0x35a/0x16c0 [<0>] search_binary_handler+0x97/0x1d0 [<0>] __do_execve_file.isra.40+0x624/0x920 [<0>] __x64_sys_execve+0x49/0x60 [<0>] do_syscall_64+0x64/0x220 [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
(I remain very nervous about weakening cred_guard_mutex without addressing the many many users...)
They need to be looked at closely, that's pretty clear. Most fall in the class, that just the current credentials need to stay stable for a certain time.
I remain rather set on wanting some very basic tests with this change. Imho, looking through tools/testing/selftests again we don't have nearly enough for these codepaths; not to say none. Basically, if someone wants to make a change affecting the current problem we should really have at least a single simple test/reproducer that can be run without digging through lore. And hopefully over time we'll have more tests.
Christian
On Tue, Mar 03, 2020 at 09:34:26AM +0100, Christian Brauner wrote:
On Tue, Mar 03, 2020 at 08:08:26AM +0000, Bernd Edlinger wrote:
On 3/3/20 6:29 AM, Kees Cook wrote:
On Tue, Mar 03, 2020 at 04:54:34AM +0000, Bernd Edlinger wrote:
On 3/3/20 3:26 AM, Kees Cook wrote:
On Mon, Mar 02, 2020 at 10:18:07PM +0000, Bernd Edlinger wrote:
[...]
If I'm reading this patch correctly, this changes the lifetime of the cred_guard_mutex lock to be:
- during prepare_bprm_creds()
- from flush_old_exec() through install_exec_creds()
Before, cred_guard_mutex was held from prepare_bprm_creds() through install_exec_creds().
BTW, I think the effect of this change (i.e. my paragraph above) should be distinctly called out in the commit log if this solution moves forward.
Okay, will do.
That means, for example, that check_unsafe_exec()'s documented invariant is violated: /* * determine how safe it is to execute the proposed program * - the caller must hold ->cred_guard_mutex to protect against * PTRACE_ATTACH or seccomp thread-sync */
Oh, right, I haven't understood that hint...
I know no_new_privs is checked there, but I haven't studied the PTRACE_ATTACH part of that comment. If that is handled with the new check, this comment should be updated.
Okay, I change that comment to:
/*
- determine how safe it is to execute the proposed program
- the caller must have set ->cred_locked_in_execve to protect against
- PTRACE_ATTACH or seccomp thread-sync
*/
I think it also means that the potentially multiple invocations of bprm_fill_uid() (via prepare_binprm() via binfmt_script.c and binfmt_misc.c) would be changing bprm->cred details (uid, gid) without a lock (another place where current's no_new_privs is evaluated).
So no_new_privs can change from 0->1, but should not when execve is running.
As long as the calling thread is in execve it won't do this, and the only other place, where it may set for other threads is in seccomp_sync_threads, but that can easily be avoided see below.
Yeah, everything was fine until I had to go complicate things with TSYNC. ;) The real goal is making sure an exec cannot gain privs while later gaining a seccomp filter from an unpriv process. The no_new_privs flag was used to control this, but it required that the filter not get applied during exec.
Related, it also means that cred_guard_mutex is unheld for every invocation of search_binary_handler() (which can loop via the previously mentioned binfmt_script.c and binfmt_misc.c), if any of them have hidden dependencies on cred_guard_mutex. (Thought I only see bprm_fill_uid() currently.)
For seccomp, the expectations about existing thread states risks races too. There are two locks held for TSYNC:
- current->sighand->siglock is held to keep new threads from appearing/disappearing, which would destroy filter refcounting and lead to memory corruption.
I don't understand what you mean here. How can this lead to memory corruption?
Mainly this is a matter of how seccomp manages its filter hierarchy (since the filters are shared through process ancestry), so if a thread appears in the middle of TSYNC it may be racing another TSYNC and break ancestry, leading to bad reference counting on process death, etc. (Though, yes, with refcount_t now, things should never corrupt, just waste memory.)
I assume for now, that the current->sighand->siglock held while iterating all threads is sufficient here.
- cred_guard_mutex is held to keep no_new_privs in sync with filters to avoid no_new_privs and filter confusion during exec, which could lead to exploitable setuid conditions (see below).
Just racing a malicious thread during TSYNC is not a very strong example (a malicious thread could do lots of fun things to "current" before it ever got near calling TSYNC), but I think there is the risk of mismatched/confused states that we don't want to allow. One is a particularly bad state that could lead to privilege escalations (in the form of the old "sendmail doesn't check setuid" flaw; if a setuid process has a filter attached that silently fails a priv-dropping setuid call and continues execution with elevated privs, it can be tricked into doing bad things on behalf of the unprivileged parent, which was the primary goal of the original use of cred_guard_mutex with TSYNC[1]):
thread A clones thread B thread B starts setuid exec thread A sets no_new_privs thread A calls seccomp with TSYNC thread A in seccomp_sync_threads() sets seccomp filter on self and thread B thread B passes check_unsafe_exec() with no_new_privs unset thread B reaches bprm_fill_uid() with no_new_privs unset and gains privs thread A still in seccomp_sync_threads() sets no_new_privs on thread B thread B finishes exec, now running with elevated privs, a filter chosen by thread A, _and_ nnp set (which doesn't matter)
With the original locking, thread B will fail check_unsafe_exec() because filter and nnp state are changed together, with "atomicity" protected by the cred_guard_mutex.
Ah, good point, thanks!
This can be fixed by checking current->signal->cred_locked_for_ptrace while the cred_guard_mutex is locked, like this for instance:
diff --git a/kernel/seccomp.c b/kernel/seccomp.c index b6ea3dc..377abf0 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -342,6 +342,9 @@ static inline pid_t seccomp_can_sync_threads(void) BUG_ON(!mutex_is_locked(¤t->signal->cred_guard_mutex)); assert_spin_locked(¤t->sighand->siglock);
if (current->signal->cred_locked_for_ptrace)
return -EAGAIN;
Hmm. I guess something like that could work. TSYNC expects to be able to report _which_ thread wrecked the call, though... I wonder if in_execve could be used to figure out the offending thread. Hm, nope, that would be outside of lock too (and all users are "current" right now, so the lock wasn't needed before).
I could move that in_execve = 1 to prepare_bprm_creds, if it really matters, but the caller will die quickly and cannot do anything with that information when another thread executes execve, right?
/* Validate all threads being eligible for synchronization. */ caller = current; for_each_thread(caller, thread) {
And this is just the bad state I _can_ see. I'm worried there are more...
All this said, I do see a small similarity here to the work I did to stabilize stack rlimits (there was an ongoing problem with making multiple decisions for the bprm based on current's state -- but current's state was mutable during exec). For this, I saved rlim_stack to bprm and ignored current's copy until exec ended and then stored bprm's copy into current. If the only problem anyone can see here is the handling of no_new_privs, we might be able to solve that similarly, at least disentangling tsync/nnp from cred_guard_mutex.
I still think that is solvable with using cred_locked_for_ptrace and simply make the tsync fail if it would otherwise be blocked.
I wonder if we can find a better name than "cred_locked_for_ptrace"? Maybe "cred_unfinished" or "cred_locked_in_exec" or something?
Yeah, I'd go with "cred_locked_in_execve".
And the comment on bool cred_locked_for_ptrace should mention that access is only allowed under cred_guard_mutex lock.
okay.
- sig->cred_locked_for_ptrace = false;
This is redundant to the zalloc -- I think you can drop it (unless someone wants to keep it for clarify?)
I'll remove that here and in init/init_task.c
Also, I think cred_locked_for_ptrace needs checking deeper, in __ptrace_may_access(), not in ptrace_attach(), since LOTS of things make calls to ptrace_may_access() holding cred_guard_mutex, expecting that to be sufficient to see a stable version of the thread...
No, these need to be addressed individually, but most users just want to know if the current credentials are sufficient at this moment, but will not change the credentials, as ptrace and TSYNC do.
BTW: Not all users have cred_guard_mutex, see mm/migrate.c, mm/mempolicy.c, kernel/futex.c, fs/proc/namespaces.c etc. So adding an access to cred_locked_for_execve in ptrace_may_access is probably not an option.
However, one nice added value by this change is this:
void *thread(void *arg) { ptrace(PTRACE_TRACEME, 0,0,0); return NULL; }
int main(void) { int pid = fork();
if (!pid) { pthread_t pt; pthread_create(&pt, NULL, thread, NULL); pthread_join(pt, NULL); execlp("echo", "echo", "passed", NULL); }
sleep(1000); ptrace(PTRACE_ATTACH, pid, 0,0); kill(pid, SIGCONT); return 0; }
cat /proc/3812/stack [<0>] flush_old_exec+0xbf/0x760 [<0>] load_elf_binary+0x35a/0x16c0 [<0>] search_binary_handler+0x97/0x1d0 [<0>] __do_execve_file.isra.40+0x624/0x920 [<0>] __x64_sys_execve+0x49/0x60 [<0>] do_syscall_64+0x64/0x220 [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
(I remain very nervous about weakening cred_guard_mutex without addressing the many many users...)
They need to be looked at closely, that's pretty clear. Most fall in the class, that just the current credentials need to stay stable for a certain time.
I remain rather set on wanting some very basic tests with this change. Imho, looking through tools/testing/selftests again we don't have nearly enough for these codepaths; not to say none. Basically, if someone wants to make a change affecting the current problem we should really have at least a single simple test/reproducer that can be run without digging through lore. And hopefully over time we'll have more tests.
Which you added in v4. Which is great! (I should've mentioned this in my first mail.) Christian
On Tue, Mar 03, 2020 at 08:08:26AM +0000, Bernd Edlinger wrote:
On 3/3/20 6:29 AM, Kees Cook wrote:
On Tue, Mar 03, 2020 at 04:54:34AM +0000, Bernd Edlinger wrote:
On 3/3/20 3:26 AM, Kees Cook wrote:
On Mon, Mar 02, 2020 at 10:18:07PM +0000, Bernd Edlinger wrote:
[...]
If I'm reading this patch correctly, this changes the lifetime of the cred_guard_mutex lock to be:
- during prepare_bprm_creds()
- from flush_old_exec() through install_exec_creds()
Before, cred_guard_mutex was held from prepare_bprm_creds() through install_exec_creds().
BTW, I think the effect of this change (i.e. my paragraph above) should be distinctly called out in the commit log if this solution moves forward.
Okay, will do.
That means, for example, that check_unsafe_exec()'s documented invariant is violated: /* * determine how safe it is to execute the proposed program * - the caller must hold ->cred_guard_mutex to protect against * PTRACE_ATTACH or seccomp thread-sync */
Oh, right, I haven't understood that hint...
I know no_new_privs is checked there, but I haven't studied the PTRACE_ATTACH part of that comment. If that is handled with the new check, this comment should be updated.
Okay, I change that comment to:
/*
- determine how safe it is to execute the proposed program
- the caller must have set ->cred_locked_in_execve to protect against
- PTRACE_ATTACH or seccomp thread-sync
*/
I think it also means that the potentially multiple invocations of bprm_fill_uid() (via prepare_binprm() via binfmt_script.c and binfmt_misc.c) would be changing bprm->cred details (uid, gid) without a lock (another place where current's no_new_privs is evaluated).
So no_new_privs can change from 0->1, but should not when execve is running.
As long as the calling thread is in execve it won't do this, and the only other place, where it may set for other threads is in seccomp_sync_threads, but that can easily be avoided see below.
Yeah, everything was fine until I had to go complicate things with TSYNC. ;) The real goal is making sure an exec cannot gain privs while later gaining a seccomp filter from an unpriv process. The no_new_privs flag was used to control this, but it required that the filter not get applied during exec.
Related, it also means that cred_guard_mutex is unheld for every invocation of search_binary_handler() (which can loop via the previously mentioned binfmt_script.c and binfmt_misc.c), if any of them have hidden dependencies on cred_guard_mutex. (Thought I only see bprm_fill_uid() currently.)
For seccomp, the expectations about existing thread states risks races too. There are two locks held for TSYNC:
- current->sighand->siglock is held to keep new threads from appearing/disappearing, which would destroy filter refcounting and lead to memory corruption.
I don't understand what you mean here. How can this lead to memory corruption?
Mainly this is a matter of how seccomp manages its filter hierarchy (since the filters are shared through process ancestry), so if a thread appears in the middle of TSYNC it may be racing another TSYNC and break ancestry, leading to bad reference counting on process death, etc. (Though, yes, with refcount_t now, things should never corrupt, just waste memory.)
I assume for now, that the current->sighand->siglock held while iterating all threads is sufficient here.
- cred_guard_mutex is held to keep no_new_privs in sync with filters to avoid no_new_privs and filter confusion during exec, which could lead to exploitable setuid conditions (see below).
Just racing a malicious thread during TSYNC is not a very strong example (a malicious thread could do lots of fun things to "current" before it ever got near calling TSYNC), but I think there is the risk of mismatched/confused states that we don't want to allow. One is a particularly bad state that could lead to privilege escalations (in the form of the old "sendmail doesn't check setuid" flaw; if a setuid process has a filter attached that silently fails a priv-dropping setuid call and continues execution with elevated privs, it can be tricked into doing bad things on behalf of the unprivileged parent, which was the primary goal of the original use of cred_guard_mutex with TSYNC[1]):
thread A clones thread B thread B starts setuid exec thread A sets no_new_privs thread A calls seccomp with TSYNC thread A in seccomp_sync_threads() sets seccomp filter on self and thread B thread B passes check_unsafe_exec() with no_new_privs unset thread B reaches bprm_fill_uid() with no_new_privs unset and gains privs thread A still in seccomp_sync_threads() sets no_new_privs on thread B thread B finishes exec, now running with elevated privs, a filter chosen by thread A, _and_ nnp set (which doesn't matter)
With the original locking, thread B will fail check_unsafe_exec() because filter and nnp state are changed together, with "atomicity" protected by the cred_guard_mutex.
Ah, good point, thanks!
This can be fixed by checking current->signal->cred_locked_for_ptrace while the cred_guard_mutex is locked, like this for instance:
diff --git a/kernel/seccomp.c b/kernel/seccomp.c index b6ea3dc..377abf0 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -342,6 +342,9 @@ static inline pid_t seccomp_can_sync_threads(void) BUG_ON(!mutex_is_locked(¤t->signal->cred_guard_mutex)); assert_spin_locked(¤t->sighand->siglock);
if (current->signal->cred_locked_for_ptrace)
return -EAGAIN;
Hmm. I guess something like that could work. TSYNC expects to be able to report _which_ thread wrecked the call, though... I wonder if in_execve could be used to figure out the offending thread. Hm, nope, that would be outside of lock too (and all users are "current" right now, so the lock wasn't needed before).
I could move that in_execve = 1 to prepare_bprm_creds, if it really matters, but the caller will die quickly and cannot do anything with that information when another thread executes execve, right?
/* Validate all threads being eligible for synchronization. */ caller = current; for_each_thread(caller, thread) {
And this is just the bad state I _can_ see. I'm worried there are more...
All this said, I do see a small similarity here to the work I did to stabilize stack rlimits (there was an ongoing problem with making multiple decisions for the bprm based on current's state -- but current's state was mutable during exec). For this, I saved rlim_stack to bprm and ignored current's copy until exec ended and then stored bprm's copy into current. If the only problem anyone can see here is the handling of no_new_privs, we might be able to solve that similarly, at least disentangling tsync/nnp from cred_guard_mutex.
I still think that is solvable with using cred_locked_for_ptrace and simply make the tsync fail if it would otherwise be blocked.
I wonder if we can find a better name than "cred_locked_for_ptrace"? Maybe "cred_unfinished" or "cred_locked_in_exec" or something?
Yeah, I'd go with "cred_locked_in_execve".
And the comment on bool cred_locked_for_ptrace should mention that access is only allowed under cred_guard_mutex lock.
okay.
- sig->cred_locked_for_ptrace = false;
This is redundant to the zalloc -- I think you can drop it (unless someone wants to keep it for clarify?)
I'll remove that here and in init/init_task.c
Also, I think cred_locked_for_ptrace needs checking deeper, in __ptrace_may_access(), not in ptrace_attach(), since LOTS of things make calls to ptrace_may_access() holding cred_guard_mutex, expecting that to be sufficient to see a stable version of the thread...
No, these need to be addressed individually, but most users just want to know if the current credentials are sufficient at this moment, but will not change the credentials, as ptrace and TSYNC do.
BTW: Not all users have cred_guard_mutex, see mm/migrate.c, mm/mempolicy.c, kernel/futex.c, fs/proc/namespaces.c etc. So adding an access to cred_locked_for_execve in ptrace_may_access is probably not an option.
That could be solved by e.g. adding ptrace_may_access_{no}exec() taking cred_guard_mutex.
On Mon, Mar 02, 2020 at 06:26:47PM -0800, Kees Cook wrote:
On Mon, Mar 02, 2020 at 10:18:07PM +0000, Bernd Edlinger wrote:
This fixes a deadlock in the tracer when tracing a multi-threaded application that calls execve while more than one thread are running.
I observed that when running strace on the gcc test suite, it always blocks after a while, when expect calls execve, because other threads have to be terminated. They send ptrace events, but the strace is no longer able to respond, since it is blocked in vm_access.
The deadlock is always happening when strace needs to access the tracees process mmap, while another thread in the tracee starts to execve a child process, but that cannot continue until the PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:
strace D 0 30614 30584 0x00000000 Call Trace: __schedule+0x3ce/0x6e0 schedule+0x5c/0xd0 schedule_preempt_disabled+0x15/0x20 __mutex_lock.isra.13+0x1ec/0x520 __mutex_lock_killable_slowpath+0x13/0x20 mutex_lock_killable+0x28/0x30 mm_access+0x27/0xa0 process_vm_rw_core.isra.3+0xff/0x550 process_vm_rw+0xdd/0xf0 __x64_sys_process_vm_readv+0x31/0x40 do_syscall_64+0x64/0x220 entry_SYSCALL_64_after_hwframe+0x44/0xa9
expect D 0 31933 30876 0x80004003 Call Trace: __schedule+0x3ce/0x6e0 schedule+0x5c/0xd0 flush_old_exec+0xc4/0x770 load_elf_binary+0x35a/0x16c0 search_binary_handler+0x97/0x1d0 __do_execve_file.isra.40+0x5d4/0x8a0 __x64_sys_execve+0x49/0x60 do_syscall_64+0x64/0x220 entry_SYSCALL_64_after_hwframe+0x44/0xa9
The proposed solution is to take the cred_guard_mutex only in a critical section at the beginning, and at the end of the execve function, and let PTRACE_ATTACH fail with EAGAIN while execve is not complete, but other functions like vm_access are allowed to complete normally.
Sorry to be bummer, but I don't think this will work. A few more things during the exec process depend on cred_guard_mutex being held.
If I'm reading this patch correctly, this changes the lifetime of the cred_guard_mutex lock to be:
- during prepare_bprm_creds()
- from flush_old_exec() through install_exec_creds()
Before, cred_guard_mutex was held from prepare_bprm_creds() through install_exec_creds().
That means, for example, that check_unsafe_exec()'s documented invariant is violated: /* * determine how safe it is to execute the proposed program * - the caller must hold ->cred_guard_mutex to protect against * PTRACE_ATTACH or seccomp thread-sync */ static void check_unsafe_exec(struct linux_binprm *bprm) ... which is looking at no_new_privs as well as other details, and making decisions about the bprm state from the current state.
I think it also means that the potentially multiple invocations of bprm_fill_uid() (via prepare_binprm() via binfmt_script.c and binfmt_misc.c) would be changing bprm->cred details (uid, gid) without a lock (another place where current's no_new_privs is evaluated).
Related, it also means that cred_guard_mutex is unheld for every invocation of search_binary_handler() (which can loop via the previously mentioned binfmt_script.c and binfmt_misc.c), if any of them have hidden dependencies on cred_guard_mutex. (Thought I only see bprm_fill_uid() currently.)
So one issue I see with having to reacquire the cred_guard_mutex might be that this would allow tasks holding the cred_guard_mutex to block a killed exec'ing task from exiting, right?
On 3/3/20 9:58 AM, Christian Brauner wrote:
On Mon, Mar 02, 2020 at 06:26:47PM -0800, Kees Cook wrote:
On Mon, Mar 02, 2020 at 10:18:07PM +0000, Bernd Edlinger wrote:
This fixes a deadlock in the tracer when tracing a multi-threaded application that calls execve while more than one thread are running.
I observed that when running strace on the gcc test suite, it always blocks after a while, when expect calls execve, because other threads have to be terminated. They send ptrace events, but the strace is no longer able to respond, since it is blocked in vm_access.
The deadlock is always happening when strace needs to access the tracees process mmap, while another thread in the tracee starts to execve a child process, but that cannot continue until the PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:
strace D 0 30614 30584 0x00000000 Call Trace: __schedule+0x3ce/0x6e0 schedule+0x5c/0xd0 schedule_preempt_disabled+0x15/0x20 __mutex_lock.isra.13+0x1ec/0x520 __mutex_lock_killable_slowpath+0x13/0x20 mutex_lock_killable+0x28/0x30 mm_access+0x27/0xa0 process_vm_rw_core.isra.3+0xff/0x550 process_vm_rw+0xdd/0xf0 __x64_sys_process_vm_readv+0x31/0x40 do_syscall_64+0x64/0x220 entry_SYSCALL_64_after_hwframe+0x44/0xa9
expect D 0 31933 30876 0x80004003 Call Trace: __schedule+0x3ce/0x6e0 schedule+0x5c/0xd0 flush_old_exec+0xc4/0x770 load_elf_binary+0x35a/0x16c0 search_binary_handler+0x97/0x1d0 __do_execve_file.isra.40+0x5d4/0x8a0 __x64_sys_execve+0x49/0x60 do_syscall_64+0x64/0x220 entry_SYSCALL_64_after_hwframe+0x44/0xa9
The proposed solution is to take the cred_guard_mutex only in a critical section at the beginning, and at the end of the execve function, and let PTRACE_ATTACH fail with EAGAIN while execve is not complete, but other functions like vm_access are allowed to complete normally.
Sorry to be bummer, but I don't think this will work. A few more things during the exec process depend on cred_guard_mutex being held.
If I'm reading this patch correctly, this changes the lifetime of the cred_guard_mutex lock to be:
- during prepare_bprm_creds()
- from flush_old_exec() through install_exec_creds()
Before, cred_guard_mutex was held from prepare_bprm_creds() through install_exec_creds().
That means, for example, that check_unsafe_exec()'s documented invariant is violated: /* * determine how safe it is to execute the proposed program * - the caller must hold ->cred_guard_mutex to protect against * PTRACE_ATTACH or seccomp thread-sync */ static void check_unsafe_exec(struct linux_binprm *bprm) ... which is looking at no_new_privs as well as other details, and making decisions about the bprm state from the current state.
I think it also means that the potentially multiple invocations of bprm_fill_uid() (via prepare_binprm() via binfmt_script.c and binfmt_misc.c) would be changing bprm->cred details (uid, gid) without a lock (another place where current's no_new_privs is evaluated).
Related, it also means that cred_guard_mutex is unheld for every invocation of search_binary_handler() (which can loop via the previously mentioned binfmt_script.c and binfmt_misc.c), if any of them have hidden dependencies on cred_guard_mutex. (Thought I only see bprm_fill_uid() currently.)
So one issue I see with having to reacquire the cred_guard_mutex might be that this would allow tasks holding the cred_guard_mutex to block a killed exec'ing task from exiting, right?
Yes maybe, but I think it will not be worse than it is now. Since the second time the mutex is acquired it is done with mutex_lock_killable, so at least kill -9 should get it terminated.
Bernd.
On 3/3/20 11:34 AM, Bernd Edlinger wrote:
On 3/3/20 9:58 AM, Christian Brauner wrote:
So one issue I see with having to reacquire the cred_guard_mutex might be that this would allow tasks holding the cred_guard_mutex to block a killed exec'ing task from exiting, right?
Yes maybe, but I think it will not be worse than it is now. Since the second time the mutex is acquired it is done with mutex_lock_killable, so at least kill -9 should get it terminated.
static void free_bprm(struct linux_binprm *bprm) { free_arg_pages(bprm); if (bprm->cred) {
if (!bprm->called_flush_old_exec)
mutex_lock(¤t->signal->cred_guard_mutex);
mutex_unlock(¤t->signal->cred_guard_mutex);current->signal->cred_locked_for_ptrace = false;
Hmm, cough... actually when the mutex_lock_killable fails, due to kill -9, in flush_old_exec free_bprm locks the same mutex, this time unkillable, but I should better do mutex_lock_killable here, and if that fails, I can leave cred_locked_for_ptrace, it shouldn't matter, since this is a fatal signal anyway, right?
Bernd.
On Tue, Mar 03, 2020 at 11:23:31AM +0000, Bernd Edlinger wrote:
On 3/3/20 11:34 AM, Bernd Edlinger wrote:
On 3/3/20 9:58 AM, Christian Brauner wrote:
So one issue I see with having to reacquire the cred_guard_mutex might be that this would allow tasks holding the cred_guard_mutex to block a killed exec'ing task from exiting, right?
Yes maybe, but I think it will not be worse than it is now. Since the second time the mutex is acquired it is done with mutex_lock_killable, so at least kill -9 should get it terminated.
static void free_bprm(struct linux_binprm *bprm) { free_arg_pages(bprm); if (bprm->cred) {
if (!bprm->called_flush_old_exec)
mutex_lock(¤t->signal->cred_guard_mutex);
mutex_unlock(¤t->signal->cred_guard_mutex);current->signal->cred_locked_for_ptrace = false;
Hmm, cough... actually when the mutex_lock_killable fails, due to kill -9, in flush_old_exec free_bprm locks the same mutex, this time unkillable, but I should better do mutex_lock_killable here, and if that fails, I can leave cred_locked_for_ptrace, it shouldn't matter, since this is a fatal signal anyway, right?
I think so, yes.
This fixes a deadlock in the tracer when tracing a multi-threaded application that calls execve while more than one thread are running.
I observed that when running strace on the gcc test suite, it always blocks after a while, when expect calls execve, because other threads have to be terminated. They send ptrace events, but the strace is no longer able to respond, since it is blocked in vm_access.
The deadlock is always happening when strace needs to access the tracees process mmap, while another thread in the tracee starts to execve a child process, but that cannot continue until the PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:
strace D 0 30614 30584 0x00000000 Call Trace: __schedule+0x3ce/0x6e0 schedule+0x5c/0xd0 schedule_preempt_disabled+0x15/0x20 __mutex_lock.isra.13+0x1ec/0x520 __mutex_lock_killable_slowpath+0x13/0x20 mutex_lock_killable+0x28/0x30 mm_access+0x27/0xa0 process_vm_rw_core.isra.3+0xff/0x550 process_vm_rw+0xdd/0xf0 __x64_sys_process_vm_readv+0x31/0x40 do_syscall_64+0x64/0x220 entry_SYSCALL_64_after_hwframe+0x44/0xa9
expect D 0 31933 30876 0x80004003 Call Trace: __schedule+0x3ce/0x6e0 schedule+0x5c/0xd0 flush_old_exec+0xc4/0x770 load_elf_binary+0x35a/0x16c0 search_binary_handler+0x97/0x1d0 __do_execve_file.isra.40+0x5d4/0x8a0 __x64_sys_execve+0x49/0x60 do_syscall_64+0x64/0x220 entry_SYSCALL_64_after_hwframe+0x44/0xa9
The proposed solution is to take the cred_guard_mutex only in a critical section at the beginning, and at the end of the execve function, and let PTRACE_ATTACH fail with EAGAIN while execve is not complete, but other functions like vm_access are allowed to complete normally.
This changes the lifetime of the cred_guard_mutex lock to be: - during prepare_bprm_creds() - from flush_old_exec() through install_exec_creds() Before, cred_guard_mutex was held from prepare_bprm_creds() through install_exec_creds().
I also took the opportunity to improve the documentation of prepare_creds, which is obviously out of sync.
Signed-off-by: Bernd Edlinger bernd.edlinger@hotmail.de --- Documentation/security/credentials.rst | 19 +++++---- fs/exec.c | 41 ++++++++++++++++--- include/linux/binfmts.h | 6 ++- include/linux/sched/signal.h | 2 + kernel/cred.c | 2 +- kernel/ptrace.c | 4 ++ kernel/seccomp.c | 3 ++ mm/process_vm_access.c | 2 +- tools/testing/selftests/ptrace/Makefile | 4 +- tools/testing/selftests/ptrace/vmaccess.c | 66 +++++++++++++++++++++++++++++++ 10 files changed, 130 insertions(+), 19 deletions(-) create mode 100644 tools/testing/selftests/ptrace/vmaccess.c
v2: adds a test case which passes when this patch is applied. v3: fixes the issue without introducing a new mutex. v4: fixes one comment and a formatting issue found by checkpatch.pl in the test case. v5: addresses review comments.
diff --git a/Documentation/security/credentials.rst b/Documentation/security/credentials.rst index 282e79f..0988798 100644 --- a/Documentation/security/credentials.rst +++ b/Documentation/security/credentials.rst @@ -437,9 +437,14 @@ new set of credentials by calling::
struct cred *prepare_creds(void);
-this locks current->cred_replace_mutex and then allocates and constructs a -duplicate of the current process's credentials, returning with the mutex still -held if successful. It returns NULL if not successful (out of memory). +this allocates and constructs a duplicate of the current process's credentials. +It returns NULL if not successful (out of memory). + +If called from __do_execve_file, the mutex current->signal->cred_guard_mutex +is acquired before this function gets called, and released after setting +current->signal->cred_locked_in_execve. The same mutex is acquired later, +while the credentials and the process mmap are actually changed, and +current->signal->cred_locked_in_execve is reset again.
The mutex prevents ``ptrace()`` from altering the ptrace state of a process while security checks on credentials construction and changing is taking place @@ -466,9 +471,8 @@ by calling::
This will alter various aspects of the credentials and the process, giving the LSM a chance to do likewise, then it will use ``rcu_assign_pointer()`` to -actually commit the new credentials to ``current->cred``, it will release -``current->cred_replace_mutex`` to allow ``ptrace()`` to take place, and it -will notify the scheduler and others of the changes. +actually commit the new credentials to ``current->cred``, and it will notify +the scheduler and others of the changes.
This function is guaranteed to return 0, so that it can be tail-called at the end of such functions as ``sys_setresuid()``. @@ -486,8 +490,7 @@ invoked::
void abort_creds(struct cred *new);
-This releases the lock on ``current->cred_replace_mutex`` that -``prepare_creds()`` got and then releases the new credentials. +This releases the new credentials.
A typical credentials alteration function would look something like this:: diff --git a/fs/exec.c b/fs/exec.c index 74d88da..5fc744e 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1266,6 +1266,12 @@ int flush_old_exec(struct linux_binprm * bprm) if (retval) goto out;
+ retval = mutex_lock_killable(¤t->signal->cred_guard_mutex); + if (retval) + goto out; + + bprm->called_flush_old_exec = 1; + /* * Must be called _before_ exec_mmap() as bprm->mm is * not visibile until then. This also enables the update @@ -1398,29 +1404,51 @@ void finalize_exec(struct linux_binprm *bprm) EXPORT_SYMBOL(finalize_exec);
/* - * Prepare credentials and lock ->cred_guard_mutex. + * Prepare credentials and set ->cred_locked_in_execve. * install_exec_creds() commits the new creds and drops the lock. * Or, if exec fails before, free_bprm() should release ->cred and * and unlock. */ static int prepare_bprm_creds(struct linux_binprm *bprm) { + int ret; + if (mutex_lock_interruptible(¤t->signal->cred_guard_mutex)) return -ERESTARTNOINTR;
+ ret = -EAGAIN; + if (unlikely(current->signal->cred_locked_in_execve)) + goto out; + + ret = -ENOMEM; bprm->cred = prepare_exec_creds(); - if (likely(bprm->cred)) - return 0; + if (likely(bprm->cred)) { + current->signal->cred_locked_in_execve = true; + ret = 0; + }
+out: mutex_unlock(¤t->signal->cred_guard_mutex); - return -ENOMEM; + return ret; }
static void free_bprm(struct linux_binprm *bprm) { free_arg_pages(bprm); if (bprm->cred) { - mutex_unlock(¤t->signal->cred_guard_mutex); + /* + * If flush_old_exec did not acquire the cred_guard_mutex, + * try again here, but if that fails, just leave + * cred_locked_in_execve alone, since this means there + * must be a fatal signal pending. + * We don't want to prevent this task to be killed, just + * because it is stuck in the middle of execve. + */ + if (bprm->called_flush_old_exec || + !mutex_lock_killable(¤t->signal->cred_guard_mutex)) { + current->signal->cred_locked_in_execve = false; + mutex_unlock(¤t->signal->cred_guard_mutex); + } abort_creds(bprm->cred); } if (bprm->file) { @@ -1469,13 +1497,14 @@ void install_exec_creds(struct linux_binprm *bprm) * credentials; any time after this it may be unlocked. */ security_bprm_committed_creds(bprm); + current->signal->cred_locked_in_execve = false; mutex_unlock(¤t->signal->cred_guard_mutex); } EXPORT_SYMBOL(install_exec_creds);
/* * determine how safe it is to execute the proposed program - * - the caller must hold ->cred_guard_mutex to protect against + * - the caller must have set ->cred_locked_in_execve to protect against * PTRACE_ATTACH or seccomp thread-sync */ static void check_unsafe_exec(struct linux_binprm *bprm) diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h index b40fc63..2930253 100644 --- a/include/linux/binfmts.h +++ b/include/linux/binfmts.h @@ -44,7 +44,11 @@ struct linux_binprm { * exec has happened. Used to sanitize execution environment * and to set AT_SECURE auxv for glibc. */ - secureexec:1; + secureexec:1, + /* + * Set by flush_old_exec, when the cred_guard_mutex is taken. + */ + called_flush_old_exec:1; #ifdef __alpha__ unsigned int taso:1; #endif diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h index 8805025..8f8e358 100644 --- a/include/linux/sched/signal.h +++ b/include/linux/sched/signal.h @@ -225,6 +225,8 @@ struct signal_struct { struct mutex cred_guard_mutex; /* guard against foreign influences on * credential calculations * (notably. ptrace) */ + bool cred_locked_in_execve; /* set while in execve, only valid when + * cred_guard_mutex is held */ } __randomize_layout;
/* diff --git a/kernel/cred.c b/kernel/cred.c index 809a985..e4c78de 100644 --- a/kernel/cred.c +++ b/kernel/cred.c @@ -676,7 +676,7 @@ void __init cred_init(void) * * Returns the new credentials or NULL if out of memory. * - * Does not take, and does not return holding current->cred_replace_mutex. + * Does not take, and does not return holding ->cred_guard_mutex. */ struct cred *prepare_kernel_cred(struct task_struct *daemon) { diff --git a/kernel/ptrace.c b/kernel/ptrace.c index 43d6179..0f82bab 100644 --- a/kernel/ptrace.c +++ b/kernel/ptrace.c @@ -395,6 +395,10 @@ static int ptrace_attach(struct task_struct *task, long request, if (mutex_lock_interruptible(&task->signal->cred_guard_mutex)) goto out;
+ retval = -EAGAIN; + if (task->signal->cred_locked_in_execve) + goto unlock_creds; + task_lock(task); retval = __ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS); task_unlock(task); diff --git a/kernel/seccomp.c b/kernel/seccomp.c index b6ea3dc..3efa3e5 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -342,6 +342,9 @@ static inline pid_t seccomp_can_sync_threads(void) BUG_ON(!mutex_is_locked(¤t->signal->cred_guard_mutex)); assert_spin_locked(¤t->sighand->siglock);
+ if (current->signal->cred_locked_in_execve) + return -EAGAIN; + /* Validate all threads being eligible for synchronization. */ caller = current; for_each_thread(caller, thread) { diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c index 357aa7b..b3e6eb5 100644 --- a/mm/process_vm_access.c +++ b/mm/process_vm_access.c @@ -204,7 +204,7 @@ static ssize_t process_vm_rw_core(pid_t pid, struct iov_iter *iter, if (!mm || IS_ERR(mm)) { rc = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH; /* - * Explicitly map EACCES to EPERM as EPERM is a more a + * Explicitly map EACCES to EPERM as EPERM is a more * appropriate error code for process_vw_readv/writev */ if (rc == -EACCES) diff --git a/tools/testing/selftests/ptrace/Makefile b/tools/testing/selftests/ptrace/Makefile index c0b7f89..2f1f532 100644 --- a/tools/testing/selftests/ptrace/Makefile +++ b/tools/testing/selftests/ptrace/Makefile @@ -1,6 +1,6 @@ # SPDX-License-Identifier: GPL-2.0-only -CFLAGS += -iquote../../../../include/uapi -Wall +CFLAGS += -std=c99 -pthread -iquote../../../../include/uapi -Wall
-TEST_GEN_PROGS := get_syscall_info peeksiginfo +TEST_GEN_PROGS := get_syscall_info peeksiginfo vmaccess
include ../lib.mk diff --git a/tools/testing/selftests/ptrace/vmaccess.c b/tools/testing/selftests/ptrace/vmaccess.c new file mode 100644 index 0000000..6d8a048 --- /dev/null +++ b/tools/testing/selftests/ptrace/vmaccess.c @@ -0,0 +1,66 @@ +// SPDX-License-Identifier: GPL-2.0+ +/* + * Copyright (c) 2020 Bernd Edlinger bernd.edlinger@hotmail.de + * All rights reserved. + * + * Check whether /proc/$pid/mem can be accessed without causing deadlocks + * when de_thread is blocked with ->cred_guard_mutex held. + */ + +#include "../kselftest_harness.h" +#include <stdio.h> +#include <fcntl.h> +#include <pthread.h> +#include <signal.h> +#include <unistd.h> +#include <sys/ptrace.h> + +static void *thread(void *arg) +{ + ptrace(PTRACE_TRACEME, 0, 0L, 0L); + return NULL; +} + +TEST(vmaccess) +{ + int f, pid = fork(); + char mm[64]; + + if (!pid) { + pthread_t pt; + + pthread_create(&pt, NULL, thread, NULL); + pthread_join(pt, NULL); + execlp("true", "true", NULL); + } + + sleep(1); + sprintf(mm, "/proc/%d/mem", pid); + f = open(mm, O_RDONLY); + ASSERT_LE(0, f); + close(f); + f = kill(pid, SIGCONT); + ASSERT_EQ(0, f); +} + +TEST(attach) +{ + int f, pid = fork(); + + if (!pid) { + pthread_t pt; + + pthread_create(&pt, NULL, thread, NULL); + pthread_join(pt, NULL); + execlp("true", "true", NULL); + } + + sleep(1); + f = ptrace(PTRACE_ATTACH, pid, 0L, 0L); + ASSERT_EQ(EAGAIN, errno); + ASSERT_EQ(f, -1); + f = kill(pid, SIGCONT); + ASSERT_EQ(0, f); +} + +TEST_HARNESS_MAIN
Bernd Edlinger bernd.edlinger@hotmail.de writes:
This fixes a deadlock in the tracer when tracing a multi-threaded application that calls execve while more than one thread are running.
I observed that when running strace on the gcc test suite, it always blocks after a while, when expect calls execve, because other threads have to be terminated. They send ptrace events, but the strace is no longer able to respond, since it is blocked in vm_access.
The deadlock is always happening when strace needs to access the tracees process mmap, while another thread in the tracee starts to execve a child process, but that cannot continue until the PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:
A couple of things.
Why do we think it is safe to change the behavior exposed to userspace? Not the deadlock but all of the times the current code would not deadlock?
Especially given that this is a small window it might be hard for people to track down and report so we need a strong argument that this won't break existing userspace before we just change things.
Usually surveying all of the users of a system call that we can find and checking to see if they might be affected by the change in behavior is difficult enough that we usually opt for not being lazy and preserving the behavior.
This patch is up to two changes in behavior now, that could potentially affect a whole array of programs. Adding linux-api so that this change in behavior can be documented if/when this change goes through.
If you can split the documentation and test fixes out into separate patches that would help reviewing this code, or please make it explicit that the your are changing documentation about behavior that is changing with this patch.
Eric
diff --git a/tools/testing/selftests/ptrace/vmaccess.c b/tools/testing/selftests/ptrace/vmaccess.c new file mode 100644 index 0000000..6d8a048 --- /dev/null +++ b/tools/testing/selftests/ptrace/vmaccess.c @@ -0,0 +1,66 @@ +// SPDX-License-Identifier: GPL-2.0+ +/*
- Copyright (c) 2020 Bernd Edlinger bernd.edlinger@hotmail.de
- All rights reserved.
- Check whether /proc/$pid/mem can be accessed without causing deadlocks
- when de_thread is blocked with ->cred_guard_mutex held.
- */
+#include "../kselftest_harness.h" +#include <stdio.h> +#include <fcntl.h> +#include <pthread.h> +#include <signal.h> +#include <unistd.h> +#include <sys/ptrace.h>
+static void *thread(void *arg) +{
- ptrace(PTRACE_TRACEME, 0, 0L, 0L);
- return NULL;
+}
+TEST(vmaccess) +{
- int f, pid = fork();
- char mm[64];
- if (!pid) {
pthread_t pt;
pthread_create(&pt, NULL, thread, NULL);
pthread_join(pt, NULL);
execlp("true", "true", NULL);
- }
- sleep(1);
- sprintf(mm, "/proc/%d/mem", pid);
- f = open(mm, O_RDONLY);
- ASSERT_LE(0, f);
- close(f);
- f = kill(pid, SIGCONT);
- ASSERT_EQ(0, f);
+}
+TEST(attach) +{
- int f, pid = fork();
- if (!pid) {
pthread_t pt;
pthread_create(&pt, NULL, thread, NULL);
pthread_join(pt, NULL);
execlp("true", "true", NULL);
- }
- sleep(1);
- f = ptrace(PTRACE_ATTACH, pid, 0L, 0L);
To be meaningful this code needs to learn to loop when ptrace returns -EAGAIN.
Because that is pretty much what any self respecting user space process will do.
At which point I am not certain we can say that the behavior has sufficiently improved not to be a deadlock.
- ASSERT_EQ(EAGAIN, errno);
- ASSERT_EQ(f, -1);
- f = kill(pid, SIGCONT);
- ASSERT_EQ(0, f);
+}
+TEST_HARNESS_MAIN
Eric
On 3/3/20 4:18 PM, Eric W. Biederman wrote:
Bernd Edlinger bernd.edlinger@hotmail.de writes:
This fixes a deadlock in the tracer when tracing a multi-threaded application that calls execve while more than one thread are running.
I observed that when running strace on the gcc test suite, it always blocks after a while, when expect calls execve, because other threads have to be terminated. They send ptrace events, but the strace is no longer able to respond, since it is blocked in vm_access.
The deadlock is always happening when strace needs to access the tracees process mmap, while another thread in the tracee starts to execve a child process, but that cannot continue until the PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:
A couple of things.
Why do we think it is safe to change the behavior exposed to userspace? Not the deadlock but all of the times the current code would not deadlock?
Especially given that this is a small window it might be hard for people to track down and report so we need a strong argument that this won't break existing userspace before we just change things.
Hmm, I tend to agree.
Usually surveying all of the users of a system call that we can find and checking to see if they might be affected by the change in behavior is difficult enough that we usually opt for not being lazy and preserving the behavior.
This patch is up to two changes in behavior now, that could potentially affect a whole array of programs. Adding linux-api so that this change in behavior can be documented if/when this change goes through.
One is PTRACE_ACCESS possibly returning EAGAIN, yes.
We could try to restrict that behavior change to when any thread is ptraced when execve starts, can't be too complicated.
But the other is only SYS_seccomp returning EAGAIN, when a different thread of the current process is calling execve at the same time.
I would consider it completely impossible to have any user-visual effect, since de_thread is just terminating all threads, including the thread where the -EAGAIN was returned, so we will never know what happened.
If you can split the documentation and test fixes out into separate patches that would help reviewing this code, or please make it explicit that the your are changing documentation about behavior that is changing with this patch.
I am not sure if I have touched the right user documentation.
I only saw a document referring to a non-existent "current->cred_replace_mutex" I haven't digged the git history, but that must be pre-historic IMHO. It appears to me that is some developer documentation, but it's nevertheless worth to keep up to date when the code changes.
So where would I add the possibility for PTRACE_ATTACH to return -EAGAIN ?
Bernd.
Eric
diff --git a/tools/testing/selftests/ptrace/vmaccess.c b/tools/testing/selftests/ptrace/vmaccess.c new file mode 100644 index 0000000..6d8a048 --- /dev/null +++ b/tools/testing/selftests/ptrace/vmaccess.c @@ -0,0 +1,66 @@ +// SPDX-License-Identifier: GPL-2.0+ +/*
- Copyright (c) 2020 Bernd Edlinger bernd.edlinger@hotmail.de
- All rights reserved.
- Check whether /proc/$pid/mem can be accessed without causing deadlocks
- when de_thread is blocked with ->cred_guard_mutex held.
- */
+#include "../kselftest_harness.h" +#include <stdio.h> +#include <fcntl.h> +#include <pthread.h> +#include <signal.h> +#include <unistd.h> +#include <sys/ptrace.h>
+static void *thread(void *arg) +{
- ptrace(PTRACE_TRACEME, 0, 0L, 0L);
- return NULL;
+}
+TEST(vmaccess) +{
- int f, pid = fork();
- char mm[64];
- if (!pid) {
pthread_t pt;
pthread_create(&pt, NULL, thread, NULL);
pthread_join(pt, NULL);
execlp("true", "true", NULL);
- }
- sleep(1);
- sprintf(mm, "/proc/%d/mem", pid);
- f = open(mm, O_RDONLY);
- ASSERT_LE(0, f);
- close(f);
- f = kill(pid, SIGCONT);
- ASSERT_EQ(0, f);
+}
+TEST(attach) +{
- int f, pid = fork();
- if (!pid) {
pthread_t pt;
pthread_create(&pt, NULL, thread, NULL);
pthread_join(pt, NULL);
execlp("true", "true", NULL);
- }
- sleep(1);
- f = ptrace(PTRACE_ATTACH, pid, 0L, 0L);
To be meaningful this code needs to learn to loop when ptrace returns -EAGAIN.
Because that is pretty much what any self respecting user space process will do.
At which point I am not certain we can say that the behavior has sufficiently improved not to be a deadlock.
In this special dead-duck test it won't work, but it would still be lots more transparent what is going on, since previously you had two zombie process, and no way to even output debug messages, which also all self respecting user space processes should do.
So yes, I can at least give a good example and re-try it several times together with wait4 which a tracer is expected to do.
Bernd.
- ASSERT_EQ(EAGAIN, errno);
- ASSERT_EQ(f, -1);
- f = kill(pid, SIGCONT);
- ASSERT_EQ(0, f);
+}
+TEST_HARNESS_MAIN
Eric
On Tue, Mar 03, 2020 at 04:48:01PM +0000, Bernd Edlinger wrote:
On 3/3/20 4:18 PM, Eric W. Biederman wrote:
Bernd Edlinger bernd.edlinger@hotmail.de writes:
This fixes a deadlock in the tracer when tracing a multi-threaded application that calls execve while more than one thread are running.
I observed that when running strace on the gcc test suite, it always blocks after a while, when expect calls execve, because other threads have to be terminated. They send ptrace events, but the strace is no longer able to respond, since it is blocked in vm_access.
The deadlock is always happening when strace needs to access the tracees process mmap, while another thread in the tracee starts to execve a child process, but that cannot continue until the PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:
A couple of things.
Why do we think it is safe to change the behavior exposed to userspace? Not the deadlock but all of the times the current code would not deadlock?
Especially given that this is a small window it might be hard for people to track down and report so we need a strong argument that this won't break existing userspace before we just change things.
Hmm, I tend to agree.
Usually surveying all of the users of a system call that we can find and checking to see if they might be affected by the change in behavior is difficult enough that we usually opt for not being lazy and preserving the behavior.
This patch is up to two changes in behavior now, that could potentially affect a whole array of programs. Adding linux-api so that this change in behavior can be documented if/when this change goes through.
One is PTRACE_ACCESS possibly returning EAGAIN, yes.
We could try to restrict that behavior change to when any thread is ptraced when execve starts, can't be too complicated.
But the other is only SYS_seccomp returning EAGAIN, when a different thread of the current process is calling execve at the same time.
I would consider it completely impossible to have any user-visual effect, since de_thread is just terminating all threads, including the thread where the -EAGAIN was returned, so we will never know what happened.
I think if we risk a user-space facing change we should try the simple thing first before making the fix more convoluted? But it's a tough call...
If you can split the documentation and test fixes out into separate patches that would help reviewing this code, or please make it explicit that the your are changing documentation about behavior that is changing with this patch.
I am not sure if I have touched the right user documentation.
I only saw a document referring to a non-existent "current->cred_replace_mutex" I haven't digged the git history, but that must be pre-historic IMHO. It appears to me that is some developer documentation, but it's nevertheless worth to keep up to date when the code changes.
So where would I add the possibility for PTRACE_ATTACH to return -EAGAIN ?
Since that would be a potentially user-visible change it would make the most sense to add it to man ptrace(2) if/when we land this change.
For developers, placing a comment in kernel/ptrace.c:ptrace_attach() would make the most sense? We already have something about exec protection in there.
Christian
On Tue, Mar 03, 2020 at 06:01:11PM +0100, Christian Brauner wrote:
On Tue, Mar 03, 2020 at 04:48:01PM +0000, Bernd Edlinger wrote:
On 3/3/20 4:18 PM, Eric W. Biederman wrote:
Bernd Edlinger bernd.edlinger@hotmail.de writes:
This fixes a deadlock in the tracer when tracing a multi-threaded application that calls execve while more than one thread are running.
I observed that when running strace on the gcc test suite, it always blocks after a while, when expect calls execve, because other threads have to be terminated. They send ptrace events, but the strace is no longer able to respond, since it is blocked in vm_access.
The deadlock is always happening when strace needs to access the tracees process mmap, while another thread in the tracee starts to execve a child process, but that cannot continue until the PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:
A couple of things.
Why do we think it is safe to change the behavior exposed to userspace? Not the deadlock but all of the times the current code would not deadlock?
Especially given that this is a small window it might be hard for people to track down and report so we need a strong argument that this won't break existing userspace before we just change things.
Hmm, I tend to agree.
Usually surveying all of the users of a system call that we can find and checking to see if they might be affected by the change in behavior is difficult enough that we usually opt for not being lazy and preserving the behavior.
This patch is up to two changes in behavior now, that could potentially affect a whole array of programs. Adding linux-api so that this change in behavior can be documented if/when this change goes through.
One is PTRACE_ACCESS possibly returning EAGAIN, yes.
We could try to restrict that behavior change to when any thread is ptraced when execve starts, can't be too complicated.
But the other is only SYS_seccomp returning EAGAIN, when a different thread of the current process is calling execve at the same time.
I would consider it completely impossible to have any user-visual effect, since de_thread is just terminating all threads, including the thread where the -EAGAIN was returned, so we will never know what happened.
I think if we risk a user-space facing change we should try the simple thing first before making the fix more convoluted? But it's a tough call...
Actually, to get a _rough_ estimate of the possible impact I would recommend you run the criu test suite (and possible the strace test-suite) on a kernel with and without your fix. That's what I tend to do when I touch code I fear will have impact on APIs that very deeply touch core kernel. Criu's test-suite makes heavy use of ptrace and usually runs into a bunch of interesting (exec) races too, and does have tests for handling zombies processes etc. pp.
Should be relatively simple: create a vm and then criu build-dependencies, git clone criu; cd criu; make; cd test; ./zdtm.py run -a --keep-going If your system doesn't support Selinux properly, you need to disable it when running the tests and you also need to make sure that you're using python3 or change the shebang in zdtm.py to python3.
Just a recommendation.
Christian
Bernd Edlinger bernd.edlinger@hotmail.de writes:
On 3/3/20 4:18 PM, Eric W. Biederman wrote:
Bernd Edlinger bernd.edlinger@hotmail.de writes:
diff --git a/tools/testing/selftests/ptrace/vmaccess.c b/tools/testing/selftests/ptrace/vmaccess.c new file mode 100644 index 0000000..6d8a048 --- /dev/null +++ b/tools/testing/selftests/ptrace/vmaccess.c @@ -0,0 +1,66 @@ +// SPDX-License-Identifier: GPL-2.0+ +/*
- Copyright (c) 2020 Bernd Edlinger bernd.edlinger@hotmail.de
- All rights reserved.
- Check whether /proc/$pid/mem can be accessed without causing deadlocks
- when de_thread is blocked with ->cred_guard_mutex held.
- */
+#include "../kselftest_harness.h" +#include <stdio.h> +#include <fcntl.h> +#include <pthread.h> +#include <signal.h> +#include <unistd.h> +#include <sys/ptrace.h>
+static void *thread(void *arg) +{
- ptrace(PTRACE_TRACEME, 0, 0L, 0L);
- return NULL;
+}
+TEST(vmaccess) +{
- int f, pid = fork();
- char mm[64];
- if (!pid) {
pthread_t pt;
pthread_create(&pt, NULL, thread, NULL);
pthread_join(pt, NULL);
execlp("true", "true", NULL);
- }
- sleep(1);
- sprintf(mm, "/proc/%d/mem", pid);
- f = open(mm, O_RDONLY);
- ASSERT_LE(0, f);
- close(f);
- f = kill(pid, SIGCONT);
- ASSERT_EQ(0, f);
+}
+TEST(attach) +{
- int f, pid = fork();
- if (!pid) {
pthread_t pt;
pthread_create(&pt, NULL, thread, NULL);
pthread_join(pt, NULL);
execlp("true", "true", NULL);
- }
- sleep(1);
- f = ptrace(PTRACE_ATTACH, pid, 0L, 0L);
To be meaningful this code needs to learn to loop when ptrace returns -EAGAIN.
Because that is pretty much what any self respecting user space process will do.
At which point I am not certain we can say that the behavior has sufficiently improved not to be a deadlock.
In this special dead-duck test it won't work, but it would still be lots more transparent what is going on, since previously you had two zombie process, and no way to even output debug messages, which also all self respecting user space processes should do.
Agreed it is more transparent. So if you are going to deadlock it is better.
My previous proposal (which I admit is more work to implement) would actually allow succeeding in this case and so it would not be subject to a dead lock (even via -EGAIN) at this point.
So yes, I can at least give a good example and re-try it several times together with wait4 which a tracer is expected to do.
Thank you,
Eric
On 3/3/20 9:08 PM, Eric W. Biederman wrote:
Bernd Edlinger bernd.edlinger@hotmail.de writes:
On 3/3/20 4:18 PM, Eric W. Biederman wrote:
Bernd Edlinger bernd.edlinger@hotmail.de writes:
diff --git a/tools/testing/selftests/ptrace/vmaccess.c b/tools/testing/selftests/ptrace/vmaccess.c new file mode 100644 index 0000000..6d8a048 --- /dev/null +++ b/tools/testing/selftests/ptrace/vmaccess.c @@ -0,0 +1,66 @@ +// SPDX-License-Identifier: GPL-2.0+ +/*
- Copyright (c) 2020 Bernd Edlinger bernd.edlinger@hotmail.de
- All rights reserved.
- Check whether /proc/$pid/mem can be accessed without causing deadlocks
- when de_thread is blocked with ->cred_guard_mutex held.
- */
+#include "../kselftest_harness.h" +#include <stdio.h> +#include <fcntl.h> +#include <pthread.h> +#include <signal.h> +#include <unistd.h> +#include <sys/ptrace.h>
+static void *thread(void *arg) +{
- ptrace(PTRACE_TRACEME, 0, 0L, 0L);
- return NULL;
+}
+TEST(vmaccess) +{
- int f, pid = fork();
- char mm[64];
- if (!pid) {
pthread_t pt;
pthread_create(&pt, NULL, thread, NULL);
pthread_join(pt, NULL);
execlp("true", "true", NULL);
- }
- sleep(1);
- sprintf(mm, "/proc/%d/mem", pid);
- f = open(mm, O_RDONLY);
- ASSERT_LE(0, f);
- close(f);
- f = kill(pid, SIGCONT);
- ASSERT_EQ(0, f);
+}
+TEST(attach) +{
- int f, pid = fork();
- if (!pid) {
pthread_t pt;
pthread_create(&pt, NULL, thread, NULL);
pthread_join(pt, NULL);
execlp("true", "true", NULL);
- }
- sleep(1);
- f = ptrace(PTRACE_ATTACH, pid, 0L, 0L);
To be meaningful this code needs to learn to loop when ptrace returns -EAGAIN.
Because that is pretty much what any self respecting user space process will do.
At which point I am not certain we can say that the behavior has sufficiently improved not to be a deadlock.
In this special dead-duck test it won't work, but it would still be lots more transparent what is going on, since previously you had two zombie process, and no way to even output debug messages, which also all self respecting user space processes should do.
Agreed it is more transparent. So if you are going to deadlock it is better.
My previous proposal (which I admit is more work to implement) would actually allow succeeding in this case and so it would not be subject to a dead lock (even via -EGAIN) at this point.
So yes, I can at least give a good example and re-try it several times together with wait4 which a tracer is expected to do.
Thank you,
Eric
Okay, I think it can be done with minimal API changes, but it needs two mutexes, one that guards the execve, and one that guards only the credentials.
If no traced sibling thread exists, the mutexes are used this way: lock(exec_guard_mutex) cred_locked_in_execve = true; de_thread() lock(cred_guard_mutex) unlock(cred_guard_mutex) cred_locked_in_execve = false; unlock(exec_guard_mutex)
so effectively no API change at all.
If a traced sibling thread exists, the mutexes are used differently: lock(exec_guard_mutex) cred_locked_in_execve = true; unlock(exec_guard_mutex) de_thread() lock(cred_guard_mutex) unlock(cred_guard_mutex) lock(exec_guard_mutex) cred_locked_in_execve = false; unlock(exec_guard_mutex)
Only the case changes that would deadlock anyway.
Bernd.
Bernd Edlinger bernd.edlinger@hotmail.de writes:
On 3/3/20 9:08 PM, Eric W. Biederman wrote:
Bernd Edlinger bernd.edlinger@hotmail.de writes:
On 3/3/20 4:18 PM, Eric W. Biederman wrote:
Bernd Edlinger bernd.edlinger@hotmail.de writes:
diff --git a/tools/testing/selftests/ptrace/vmaccess.c b/tools/testing/selftests/ptrace/vmaccess.c new file mode 100644 index 0000000..6d8a048 --- /dev/null +++ b/tools/testing/selftests/ptrace/vmaccess.c @@ -0,0 +1,66 @@ +// SPDX-License-Identifier: GPL-2.0+ +/*
- Copyright (c) 2020 Bernd Edlinger bernd.edlinger@hotmail.de
- All rights reserved.
- Check whether /proc/$pid/mem can be accessed without causing deadlocks
- when de_thread is blocked with ->cred_guard_mutex held.
- */
+#include "../kselftest_harness.h" +#include <stdio.h> +#include <fcntl.h> +#include <pthread.h> +#include <signal.h> +#include <unistd.h> +#include <sys/ptrace.h>
+static void *thread(void *arg) +{
- ptrace(PTRACE_TRACEME, 0, 0L, 0L);
- return NULL;
+}
+TEST(vmaccess) +{
- int f, pid = fork();
- char mm[64];
- if (!pid) {
pthread_t pt;
pthread_create(&pt, NULL, thread, NULL);
pthread_join(pt, NULL);
execlp("true", "true", NULL);
- }
- sleep(1);
- sprintf(mm, "/proc/%d/mem", pid);
- f = open(mm, O_RDONLY);
- ASSERT_LE(0, f);
- close(f);
- f = kill(pid, SIGCONT);
- ASSERT_EQ(0, f);
+}
+TEST(attach) +{
- int f, pid = fork();
- if (!pid) {
pthread_t pt;
pthread_create(&pt, NULL, thread, NULL);
pthread_join(pt, NULL);
execlp("true", "true", NULL);
- }
- sleep(1);
- f = ptrace(PTRACE_ATTACH, pid, 0L, 0L);
To be meaningful this code needs to learn to loop when ptrace returns -EAGAIN.
Because that is pretty much what any self respecting user space process will do.
At which point I am not certain we can say that the behavior has sufficiently improved not to be a deadlock.
In this special dead-duck test it won't work, but it would still be lots more transparent what is going on, since previously you had two zombie process, and no way to even output debug messages, which also all self respecting user space processes should do.
Agreed it is more transparent. So if you are going to deadlock it is better.
My previous proposal (which I admit is more work to implement) would actually allow succeeding in this case and so it would not be subject to a dead lock (even via -EGAIN) at this point.
So yes, I can at least give a good example and re-try it several times together with wait4 which a tracer is expected to do.
Thank you,
Eric
Okay, I think it can be done with minimal API changes, but it needs two mutexes, one that guards the execve, and one that guards only the credentials.
If no traced sibling thread exists, the mutexes are used this way: lock(exec_guard_mutex) cred_locked_in_execve = true; de_thread() lock(cred_guard_mutex) unlock(cred_guard_mutex) cred_locked_in_execve = false; unlock(exec_guard_mutex)
so effectively no API change at all.
If a traced sibling thread exists, the mutexes are used differently: lock(exec_guard_mutex) cred_locked_in_execve = true; unlock(exec_guard_mutex) de_thread() lock(cred_guard_mutex) unlock(cred_guard_mutex) lock(exec_guard_mutex) cred_locked_in_execve = false; unlock(exec_guard_mutex)
Only the case changes that would deadlock anyway.
Let me propose a slight alternative that I think sets us up for long term success.
Leave cred_guard_mutex as is, but declare it undesirable. The cred_guard_mutex as designed really is something we should get rid of. As it it can sleep over several different userspace accesses. The copying of the exec arguments is technically as prone to deadlock as the ptrace case.
Add a new mutex with a better name perhaps "exec_change_mutex" that is used to guard the changes that exec makes to a process.
Then we gradually shift all the cred_guard_mutex users over to the new mutex. AKA one patch per user of cred_guard_mutex. At each patch that shifts things over we will have the opportunity to review the code to see that there no funny dependencies that were missed.
I will sign up for working on the no_new_privs and ptrace_attach cases as I think I can make those happen. Especially no_new_privs.
Getting the easier cases will resolve your issues and put things on a better footing.
Eric
On 3/4/20 5:33 PM, Eric W. Biederman wrote:
Bernd Edlinger bernd.edlinger@hotmail.de writes:
On 3/3/20 9:08 PM, Eric W. Biederman wrote:
Bernd Edlinger bernd.edlinger@hotmail.de writes:
On 3/3/20 4:18 PM, Eric W. Biederman wrote:
Bernd Edlinger bernd.edlinger@hotmail.de writes:
diff --git a/tools/testing/selftests/ptrace/vmaccess.c b/tools/testing/selftests/ptrace/vmaccess.c new file mode 100644 index 0000000..6d8a048 --- /dev/null +++ b/tools/testing/selftests/ptrace/vmaccess.c @@ -0,0 +1,66 @@ +// SPDX-License-Identifier: GPL-2.0+ +/*
- Copyright (c) 2020 Bernd Edlinger bernd.edlinger@hotmail.de
- All rights reserved.
- Check whether /proc/$pid/mem can be accessed without causing deadlocks
- when de_thread is blocked with ->cred_guard_mutex held.
- */
+#include "../kselftest_harness.h" +#include <stdio.h> +#include <fcntl.h> +#include <pthread.h> +#include <signal.h> +#include <unistd.h> +#include <sys/ptrace.h>
+static void *thread(void *arg) +{
- ptrace(PTRACE_TRACEME, 0, 0L, 0L);
- return NULL;
+}
+TEST(vmaccess) +{
- int f, pid = fork();
- char mm[64];
- if (!pid) {
pthread_t pt;
pthread_create(&pt, NULL, thread, NULL);
pthread_join(pt, NULL);
execlp("true", "true", NULL);
- }
- sleep(1);
- sprintf(mm, "/proc/%d/mem", pid);
- f = open(mm, O_RDONLY);
- ASSERT_LE(0, f);
- close(f);
- f = kill(pid, SIGCONT);
- ASSERT_EQ(0, f);
+}
+TEST(attach) +{
- int f, pid = fork();
- if (!pid) {
pthread_t pt;
pthread_create(&pt, NULL, thread, NULL);
pthread_join(pt, NULL);
execlp("true", "true", NULL);
- }
- sleep(1);
- f = ptrace(PTRACE_ATTACH, pid, 0L, 0L);
To be meaningful this code needs to learn to loop when ptrace returns -EAGAIN.
Because that is pretty much what any self respecting user space process will do.
At which point I am not certain we can say that the behavior has sufficiently improved not to be a deadlock.
In this special dead-duck test it won't work, but it would still be lots more transparent what is going on, since previously you had two zombie process, and no way to even output debug messages, which also all self respecting user space processes should do.
Agreed it is more transparent. So if you are going to deadlock it is better.
My previous proposal (which I admit is more work to implement) would actually allow succeeding in this case and so it would not be subject to a dead lock (even via -EGAIN) at this point.
So yes, I can at least give a good example and re-try it several times together with wait4 which a tracer is expected to do.
Thank you,
Eric
Okay, I think it can be done with minimal API changes, but it needs two mutexes, one that guards the execve, and one that guards only the credentials.
If no traced sibling thread exists, the mutexes are used this way: lock(exec_guard_mutex) cred_locked_in_execve = true; de_thread() lock(cred_guard_mutex) unlock(cred_guard_mutex) cred_locked_in_execve = false; unlock(exec_guard_mutex)
so effectively no API change at all.
If a traced sibling thread exists, the mutexes are used differently: lock(exec_guard_mutex) cred_locked_in_execve = true; unlock(exec_guard_mutex) de_thread() lock(cred_guard_mutex) unlock(cred_guard_mutex) lock(exec_guard_mutex) cred_locked_in_execve = false; unlock(exec_guard_mutex)
Only the case changes that would deadlock anyway.
Let me propose a slight alternative that I think sets us up for long term success.
Leave cred_guard_mutex as is, but declare it undesirable. The cred_guard_mutex as designed really is something we should get rid of. As it it can sleep over several different userspace accesses. The copying of the exec arguments is technically as prone to deadlock as the ptrace case.
Add a new mutex with a better name perhaps "exec_change_mutex" that is used to guard the changes that exec makes to a process.
Then we gradually shift all the cred_guard_mutex users over to the new mutex. AKA one patch per user of cred_guard_mutex. At each patch that shifts things over we will have the opportunity to review the code to see that there no funny dependencies that were missed.
I will sign up for working on the no_new_privs and ptrace_attach cases as I think I can make those happen. Especially no_new_privs.
Getting the easier cases will resolve your issues and put things on a better footing.
Eric
Okay, however I think we will need two mutexes in the long term.
So currently I have reduced the cred_guard_mutex to protect just the loading of the executable code in the process vm, since that is what works for vm_access, (one of the test cases). And another mutex that protects the whole execve function, that is need for ptrace, (and seccomp). But I have only a test case for ptrace.
If I understand that right, I should not recycle cred_guard_mutex but leave it as is, and create two additional mutexes which will take over step by step.
Sounds reasonable, indeed.
I will send an update (v6) what I have right now, but just for information, so you can see how my minimal API-Change approach works.
Bernd.
This fixes a deadlock in the tracer when tracing a multi-threaded application that calls execve while more than one thread are running.
I observed that when running strace on the gcc test suite, it always blocks after a while, when expect calls execve, because other threads have to be terminated. They send ptrace events, but the strace is no longer able to respond, since it is blocked in vm_access.
The deadlock is always happening when strace needs to access the tracees process mmap, while another thread in the tracee starts to execve a child process, but that cannot continue until the PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:
strace D 0 30614 30584 0x00000000 Call Trace: __schedule+0x3ce/0x6e0 schedule+0x5c/0xd0 schedule_preempt_disabled+0x15/0x20 __mutex_lock.isra.13+0x1ec/0x520 __mutex_lock_killable_slowpath+0x13/0x20 mutex_lock_killable+0x28/0x30 mm_access+0x27/0xa0 process_vm_rw_core.isra.3+0xff/0x550 process_vm_rw+0xdd/0xf0 __x64_sys_process_vm_readv+0x31/0x40 do_syscall_64+0x64/0x220 entry_SYSCALL_64_after_hwframe+0x44/0xa9
expect D 0 31933 30876 0x80004003 Call Trace: __schedule+0x3ce/0x6e0 schedule+0x5c/0xd0 flush_old_exec+0xc4/0x770 load_elf_binary+0x35a/0x16c0 search_binary_handler+0x97/0x1d0 __do_execve_file.isra.40+0x5d4/0x8a0 __x64_sys_execve+0x49/0x60 do_syscall_64+0x64/0x220 entry_SYSCALL_64_after_hwframe+0x44/0xa9
The proposed solution is to detect if a sibling thread exists that is traced and in this case to make PTRACE_ACCESS fail with -EAGAIN instead of dead-lock. But other functions like vm_access are allowed to complete normally.
This changes the lifetime of the cred_guard_mutex lock to be from flush_old_exec() through install_exec_creds(). Before, cred_guard_mutex was held from prepare_bprm_creds() through install_exec_creds().
Additionally a new mutex exec_guard_mutex is introduced that is used for PTRACE_ACCESS and SECCOMP_FILTER_FLAG_TSYNC.
Signed-off-by: Bernd Edlinger bernd.edlinger@hotmail.de --- Documentation/security/credentials.rst | 29 ++++++++--- fs/exec.c | 58 ++++++++++++++++++--- include/linux/binfmts.h | 15 +++++- include/linux/sched/signal.h | 10 ++-- init/init_task.c | 1 + kernel/cred.c | 4 +- kernel/fork.c | 1 + kernel/ptrace.c | 20 ++++++-- kernel/seccomp.c | 15 +++--- mm/process_vm_access.c | 2 +- tools/testing/selftests/ptrace/Makefile | 4 +- tools/testing/selftests/ptrace/vmaccess.c | 85 +++++++++++++++++++++++++++++++ 12 files changed, 210 insertions(+), 34 deletions(-) create mode 100644 tools/testing/selftests/ptrace/vmaccess.c
v2: adds a test case which passes when this patch is applied. v3: fixes the issue without introducing a new mutex. v4: fixes one comment and a formatting issue found by checkpatch.pl in the test case. v5: addresses review comments. v6: minimal API changes, using a second mutex, improved test case.
diff --git a/Documentation/security/credentials.rst b/Documentation/security/credentials.rst index 282e79f..b08899f 100644 --- a/Documentation/security/credentials.rst +++ b/Documentation/security/credentials.rst @@ -437,15 +437,30 @@ new set of credentials by calling::
struct cred *prepare_creds(void);
-this locks current->cred_replace_mutex and then allocates and constructs a -duplicate of the current process's credentials, returning with the mutex still -held if successful. It returns NULL if not successful (out of memory). +this allocates and constructs a duplicate of the current process's credentials. +It returns NULL if not successful (out of memory). + +If called from __do_execve_file, the mutex current->signal->exec_guard_mutex +is acquired before this function gets called, and usually released after +the new process mmap and credentials are installed. However if one of the +sibling threads are being traced when the execve is invoked, there is no +guarantee how long it takes to terminate all sibling threads, and therefore +the variable current->signal->cred_locked_in_execve is set, and the +exec_guard_mutex is released immediately. Functions that may have effect +on the credentials of a different thread need to lock the exec_guard_mutex +and additionally check the cred_locked_in_execve status, and fail with +-EAGAIN if that variable is set.
The mutex prevents ``ptrace()`` from altering the ptrace state of a process while security checks on credentials construction and changing is taking place as the ptrace state may alter the outcome, particularly in the case of ``execve()``.
+The mutex current->signal->cred_guard_mutex is acquired when only a single thread +is remaining, and the credentials and the process mmap are actually changed. +Functions that only need to access to a consistent state of the credentials +and the process mmap do only need to aquire this mutex. + The new credentials set should be altered appropriately, and any security checks and hooks done. Both the current and the proposed sets of credentials are available for this purpose as current_cred() will return the current set @@ -466,9 +481,8 @@ by calling::
This will alter various aspects of the credentials and the process, giving the LSM a chance to do likewise, then it will use ``rcu_assign_pointer()`` to -actually commit the new credentials to ``current->cred``, it will release -``current->cred_replace_mutex`` to allow ``ptrace()`` to take place, and it -will notify the scheduler and others of the changes. +actually commit the new credentials to ``current->cred``, and it will notify +the scheduler and others of the changes.
This function is guaranteed to return 0, so that it can be tail-called at the end of such functions as ``sys_setresuid()``. @@ -486,8 +500,7 @@ invoked::
void abort_creds(struct cred *new);
-This releases the lock on ``current->cred_replace_mutex`` that -``prepare_creds()`` got and then releases the new credentials. +This releases the new credentials.
A typical credentials alteration function would look something like this:: diff --git a/fs/exec.c b/fs/exec.c index 74d88da..8a23804 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1258,6 +1258,11 @@ int flush_old_exec(struct linux_binprm * bprm) { int retval;
+ if (bprm->detected_unsafe_exec) { + mutex_unlock(¤t->signal->exec_guard_mutex); + bprm->holding_exec_guard_mutex = 0; + } + /* * Make sure we have a private signal table and that * we are unassociated from the previous thread group. @@ -1266,6 +1271,12 @@ int flush_old_exec(struct linux_binprm * bprm) if (retval) goto out;
+ retval = mutex_lock_killable(¤t->signal->cred_guard_mutex); + if (retval) + goto out; + + bprm->holding_cred_guard_mutex = 1; + /* * Must be called _before_ exec_mmap() as bprm->mm is * not visibile until then. This also enables the update @@ -1398,29 +1409,56 @@ void finalize_exec(struct linux_binprm *bprm) EXPORT_SYMBOL(finalize_exec);
/* - * Prepare credentials and lock ->cred_guard_mutex. + * Prepare credentials and set ->cred_locked_in_execve. * install_exec_creds() commits the new creds and drops the lock. * Or, if exec fails before, free_bprm() should release ->cred and * and unlock. */ static int prepare_bprm_creds(struct linux_binprm *bprm) { - if (mutex_lock_interruptible(¤t->signal->cred_guard_mutex)) + int ret; + struct task_struct *t; + + if (mutex_lock_interruptible(¤t->signal->exec_guard_mutex)) return -ERESTARTNOINTR;
+ bprm->holding_exec_guard_mutex = 1; + + ret = -EAGAIN; + if (unlikely(current->signal->cred_locked_in_execve)) + goto out; + bprm->cred = prepare_exec_creds(); - if (likely(bprm->cred)) - return 0; + ret = -ENOMEM; + if (unlikely(bprm->cred == NULL)) + goto out;
- mutex_unlock(¤t->signal->cred_guard_mutex); - return -ENOMEM; + current->signal->cred_locked_in_execve = true; + + spin_lock_irq(¤t->sighand->siglock); + t = current; + while_each_thread(current, t) { + if (t->ptrace) + bprm->detected_unsafe_exec = 1; + } + spin_unlock_irq(¤t->sighand->siglock); + return 0; + +out: + mutex_unlock(¤t->signal->exec_guard_mutex); + return ret; }
static void free_bprm(struct linux_binprm *bprm) { free_arg_pages(bprm); if (bprm->cred) { - mutex_unlock(¤t->signal->cred_guard_mutex); + if (bprm->holding_cred_guard_mutex) + mutex_unlock(¤t->signal->cred_guard_mutex); + if (!bprm->holding_exec_guard_mutex) + mutex_lock(¤t->signal->exec_guard_mutex); + current->signal->cred_locked_in_execve = false; + mutex_unlock(¤t->signal->exec_guard_mutex); abort_creds(bprm->cred); } if (bprm->file) { @@ -1470,12 +1508,16 @@ void install_exec_creds(struct linux_binprm *bprm) */ security_bprm_committed_creds(bprm); mutex_unlock(¤t->signal->cred_guard_mutex); + if (bprm->detected_unsafe_exec) + mutex_lock(¤t->signal->exec_guard_mutex); + current->signal->cred_locked_in_execve = false; + mutex_unlock(¤t->signal->exec_guard_mutex); } EXPORT_SYMBOL(install_exec_creds);
/* * determine how safe it is to execute the proposed program - * - the caller must hold ->cred_guard_mutex to protect against + * - the caller must have set ->cred_locked_in_execve to protect against * PTRACE_ATTACH or seccomp thread-sync */ static void check_unsafe_exec(struct linux_binprm *bprm) diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h index b40fc63..238e280 100644 --- a/include/linux/binfmts.h +++ b/include/linux/binfmts.h @@ -44,7 +44,20 @@ struct linux_binprm { * exec has happened. Used to sanitize execution environment * and to set AT_SECURE auxv for glibc. */ - secureexec:1; + secureexec:1, + /* + * Set by prepare_bprm_creds, if a sibling thread is being + * traced and the exec_guard_mutex is therefore not taken. + */ + detected_unsafe_exec:1, + /* + * Set when the cred_guard_mutex is taken. + */ + holding_cred_guard_mutex:1, + /* + * Set when the exec_guard_mutex is taken. + */ + holding_exec_guard_mutex:1; #ifdef __alpha__ unsigned int taso:1; #endif diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h index 8805025..4484aa3 100644 --- a/include/linux/sched/signal.h +++ b/include/linux/sched/signal.h @@ -222,9 +222,13 @@ struct signal_struct { struct mm_struct *oom_mm; /* recorded mm when the thread group got * killed by the oom killer */
- struct mutex cred_guard_mutex; /* guard against foreign influences on - * credential calculations - * (notably. ptrace) */ + struct mutex cred_guard_mutex; /* guard against changing credentials */ + struct mutex exec_guard_mutex; /* guard against foreign influences on + * execve (notably. ptrace) + */ + bool cred_locked_in_execve; /* set while in execve, only valid when + * exec_guard_mutex is held + */ } __randomize_layout;
/* diff --git a/init/init_task.c b/init/init_task.c index 9e5cbe5..6cf602a 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -26,6 +26,7 @@ .multiprocess = HLIST_HEAD_INIT, .rlim = INIT_RLIMITS, .cred_guard_mutex = __MUTEX_INITIALIZER(init_signals.cred_guard_mutex), + .exec_guard_mutex = __MUTEX_INITIALIZER(init_signals.exec_guard_mutex), #ifdef CONFIG_POSIX_TIMERS .posix_timers = LIST_HEAD_INIT(init_signals.posix_timers), .cputimer = { diff --git a/kernel/cred.c b/kernel/cred.c index 809a985..620cd50 100644 --- a/kernel/cred.c +++ b/kernel/cred.c @@ -295,7 +295,7 @@ struct cred *prepare_creds(void)
/* * Prepare credentials for current to perform an execve() - * - The caller must hold ->cred_guard_mutex + * - The caller must hold ->exec_guard_mutex */ struct cred *prepare_exec_creds(void) { @@ -676,7 +676,7 @@ void __init cred_init(void) * * Returns the new credentials or NULL if out of memory. * - * Does not take, and does not return holding current->cred_replace_mutex. + * Does not take, and does not return holding ->cred_guard_mutex. */ struct cred *prepare_kernel_cred(struct task_struct *daemon) { diff --git a/kernel/fork.c b/kernel/fork.c index 0808095..0c21baa 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1594,6 +1594,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk) sig->oom_score_adj_min = current->signal->oom_score_adj_min;
mutex_init(&sig->cred_guard_mutex); + mutex_init(&sig->exec_guard_mutex);
return 0; } diff --git a/kernel/ptrace.c b/kernel/ptrace.c index 43d6179..1af8ff4 100644 --- a/kernel/ptrace.c +++ b/kernel/ptrace.c @@ -392,9 +392,13 @@ static int ptrace_attach(struct task_struct *task, long request, * under ptrace. */ retval = -ERESTARTNOINTR; - if (mutex_lock_interruptible(&task->signal->cred_guard_mutex)) + if (mutex_lock_interruptible(&task->signal->exec_guard_mutex)) goto out;
+ retval = -EAGAIN; + if (task->signal->cred_locked_in_execve) + goto unlock_creds; + task_lock(task); retval = __ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS); task_unlock(task); @@ -447,7 +451,7 @@ static int ptrace_attach(struct task_struct *task, long request, unlock_tasklist: write_unlock_irq(&tasklist_lock); unlock_creds: - mutex_unlock(&task->signal->cred_guard_mutex); + mutex_unlock(&task->signal->exec_guard_mutex); out: if (!retval) { /* @@ -472,10 +476,18 @@ static int ptrace_attach(struct task_struct *task, long request, */ static int ptrace_traceme(void) { - int ret = -EPERM; + int ret; + + if (mutex_lock_interruptible(¤t->signal->exec_guard_mutex)) + return -ERESTARTNOINTR; + + ret = -EAGAIN; + if (current->signal->cred_locked_in_execve) + goto unlock_creds;
write_lock_irq(&tasklist_lock); /* Are we already being traced? */ + ret = -EPERM; if (!current->ptrace) { ret = security_ptrace_traceme(current->parent); /* @@ -490,6 +502,8 @@ static int ptrace_traceme(void) } write_unlock_irq(&tasklist_lock);
+unlock_creds: + mutex_unlock(¤t->signal->exec_guard_mutex); return ret; }
diff --git a/kernel/seccomp.c b/kernel/seccomp.c index b6ea3dc..7ec66b1 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -329,7 +329,7 @@ static int is_ancestor(struct seccomp_filter *parent, /** * seccomp_can_sync_threads: checks if all threads can be synchronized * - * Expects sighand and cred_guard_mutex locks to be held. + * Expects sighand and exec_guard_mutex locks to be held. * * Returns 0 on success, -ve on error, or the pid of a thread which was * either not in the correct seccomp mode or did not have an ancestral @@ -339,9 +339,12 @@ static inline pid_t seccomp_can_sync_threads(void) { struct task_struct *thread, *caller;
- BUG_ON(!mutex_is_locked(¤t->signal->cred_guard_mutex)); + BUG_ON(!mutex_is_locked(¤t->signal->exec_guard_mutex)); assert_spin_locked(¤t->sighand->siglock);
+ if (current->signal->cred_locked_in_execve) + return -EAGAIN; + /* Validate all threads being eligible for synchronization. */ caller = current; for_each_thread(caller, thread) { @@ -371,7 +374,7 @@ static inline pid_t seccomp_can_sync_threads(void) /** * seccomp_sync_threads: sets all threads to use current's filter * - * Expects sighand and cred_guard_mutex locks to be held, and for + * Expects sighand and exec_guard_mutex locks to be held, and for * seccomp_can_sync_threads() to have returned success already * without dropping the locks. * @@ -380,7 +383,7 @@ static inline void seccomp_sync_threads(unsigned long flags) { struct task_struct *thread, *caller;
- BUG_ON(!mutex_is_locked(¤t->signal->cred_guard_mutex)); + BUG_ON(!mutex_is_locked(¤t->signal->exec_guard_mutex)); assert_spin_locked(¤t->sighand->siglock);
/* Synchronize all threads. */ @@ -1319,7 +1322,7 @@ static long seccomp_set_mode_filter(unsigned int flags, * while another thread is in the middle of calling exec. */ if (flags & SECCOMP_FILTER_FLAG_TSYNC && - mutex_lock_killable(¤t->signal->cred_guard_mutex)) + mutex_lock_killable(¤t->signal->exec_guard_mutex)) goto out_put_fd;
spin_lock_irq(¤t->sighand->siglock); @@ -1337,7 +1340,7 @@ static long seccomp_set_mode_filter(unsigned int flags, out: spin_unlock_irq(¤t->sighand->siglock); if (flags & SECCOMP_FILTER_FLAG_TSYNC) - mutex_unlock(¤t->signal->cred_guard_mutex); + mutex_unlock(¤t->signal->exec_guard_mutex); out_put_fd: if (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) { if (ret) { diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c index 357aa7b..b3e6eb5 100644 --- a/mm/process_vm_access.c +++ b/mm/process_vm_access.c @@ -204,7 +204,7 @@ static ssize_t process_vm_rw_core(pid_t pid, struct iov_iter *iter, if (!mm || IS_ERR(mm)) { rc = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH; /* - * Explicitly map EACCES to EPERM as EPERM is a more a + * Explicitly map EACCES to EPERM as EPERM is a more * appropriate error code for process_vw_readv/writev */ if (rc == -EACCES) diff --git a/tools/testing/selftests/ptrace/Makefile b/tools/testing/selftests/ptrace/Makefile index c0b7f89..2f1f532 100644 --- a/tools/testing/selftests/ptrace/Makefile +++ b/tools/testing/selftests/ptrace/Makefile @@ -1,6 +1,6 @@ # SPDX-License-Identifier: GPL-2.0-only -CFLAGS += -iquote../../../../include/uapi -Wall +CFLAGS += -std=c99 -pthread -iquote../../../../include/uapi -Wall
-TEST_GEN_PROGS := get_syscall_info peeksiginfo +TEST_GEN_PROGS := get_syscall_info peeksiginfo vmaccess
include ../lib.mk diff --git a/tools/testing/selftests/ptrace/vmaccess.c b/tools/testing/selftests/ptrace/vmaccess.c new file mode 100644 index 0000000..fdca30b --- /dev/null +++ b/tools/testing/selftests/ptrace/vmaccess.c @@ -0,0 +1,85 @@ +// SPDX-License-Identifier: GPL-2.0+ +/* + * Copyright (c) 2020 Bernd Edlinger bernd.edlinger@hotmail.de + * All rights reserved. + * + * Check whether /proc/$pid/mem can be accessed without causing deadlocks + * when de_thread is blocked with ->cred_guard_mutex held. + */ + +#include "../kselftest_harness.h" +#include <stdio.h> +#include <fcntl.h> +#include <pthread.h> +#include <signal.h> +#include <unistd.h> +#include <sys/ptrace.h> + +static void *thread(void *arg) +{ + ptrace(PTRACE_TRACEME, 0, 0L, 0L); + return NULL; +} + +TEST(vmaccess) +{ + int f, pid = fork(); + char mm[64]; + + if (!pid) { + pthread_t pt; + + pthread_create(&pt, NULL, thread, NULL); + pthread_join(pt, NULL); + execlp("true", "true", NULL); + } + + sleep(1); + sprintf(mm, "/proc/%d/mem", pid); + f = open(mm, O_RDONLY); + ASSERT_GE(f, 0); + close(f); + f = kill(pid, SIGCONT); + ASSERT_EQ(f, 0); +} + +TEST(attach) +{ + int s, k, pid = fork(); + + if (!pid) { + pthread_t pt; + + pthread_create(&pt, NULL, thread, NULL); + pthread_join(pt, NULL); + execlp("sleep", "sleep", "2", NULL); + } + + sleep(1); + k = ptrace(PTRACE_ATTACH, pid, 0L, 0L); + ASSERT_EQ(errno, EAGAIN); + ASSERT_EQ(k, -1); + k = waitpid(-1, &s, WNOHANG); + ASSERT_NE(k, 0); + ASSERT_NE(k, pid); + ASSERT_EQ(WIFEXITED(s), 1); + ASSERT_EQ(WEXITSTATUS(s), 0); + sleep(1); + k = ptrace(PTRACE_ATTACH, pid, 0L, 0L); + ASSERT_EQ(k, 0); + k = waitpid(-1, &s, 0); + ASSERT_EQ(k, pid); + ASSERT_EQ(WIFSTOPPED(s), 1); + ASSERT_EQ(WSTOPSIG(s), SIGSTOP); + k = ptrace(PTRACE_DETACH, pid, 0L, 0L); + ASSERT_EQ(k, 0); + k = waitpid(-1, &s, 0); + ASSERT_EQ(k, pid); + ASSERT_EQ(WIFEXITED(s), 1); + ASSERT_EQ(WEXITSTATUS(s), 0); + k = waitpid(-1, NULL, 0); + ASSERT_EQ(k, -1); + ASSERT_EQ(errno, ECHILD); +} + +TEST_HARNESS_MAIN
On 3/4/20 10:56 PM, Bernd Edlinger wrote:
This fixes a deadlock in the tracer when tracing a multi-threaded application that calls execve while more than one thread are running.
I observed that when running strace on the gcc test suite, it always blocks after a while, when expect calls execve, because other threads have to be terminated. They send ptrace events, but the strace is no longer able to respond, since it is blocked in vm_access.
The deadlock is always happening when strace needs to access the tracees process mmap, while another thread in the tracee starts to execve a child process, but that cannot continue until the PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:
strace D 0 30614 30584 0x00000000 Call Trace: __schedule+0x3ce/0x6e0 schedule+0x5c/0xd0 schedule_preempt_disabled+0x15/0x20 __mutex_lock.isra.13+0x1ec/0x520 __mutex_lock_killable_slowpath+0x13/0x20 mutex_lock_killable+0x28/0x30 mm_access+0x27/0xa0 process_vm_rw_core.isra.3+0xff/0x550 process_vm_rw+0xdd/0xf0 __x64_sys_process_vm_readv+0x31/0x40 do_syscall_64+0x64/0x220 entry_SYSCALL_64_after_hwframe+0x44/0xa9
expect D 0 31933 30876 0x80004003 Call Trace: __schedule+0x3ce/0x6e0 schedule+0x5c/0xd0 flush_old_exec+0xc4/0x770 load_elf_binary+0x35a/0x16c0 search_binary_handler+0x97/0x1d0 __do_execve_file.isra.40+0x5d4/0x8a0 __x64_sys_execve+0x49/0x60 do_syscall_64+0x64/0x220 entry_SYSCALL_64_after_hwframe+0x44/0xa9
The proposed solution is to detect if a sibling thread exists that is traced and in this case to make PTRACE_ACCESS fail with -EAGAIN instead of dead-lock. But other functions like vm_access are allowed to complete normally.
This changes the lifetime of the cred_guard_mutex lock to be from flush_old_exec() through install_exec_creds(). Before, cred_guard_mutex was held from prepare_bprm_creds() through install_exec_creds().
Additionally a new mutex exec_guard_mutex is introduced that is used for PTRACE_ACCESS and SECCOMP_FILTER_FLAG_TSYNC.
Signed-off-by: Bernd Edlinger bernd.edlinger@hotmail.de
Documentation/security/credentials.rst | 29 ++++++++--- fs/exec.c | 58 ++++++++++++++++++--- include/linux/binfmts.h | 15 +++++- include/linux/sched/signal.h | 10 ++-- init/init_task.c | 1 + kernel/cred.c | 4 +- kernel/fork.c | 1 + kernel/ptrace.c | 20 ++++++-- kernel/seccomp.c | 15 +++--- mm/process_vm_access.c | 2 +- tools/testing/selftests/ptrace/Makefile | 4 +- tools/testing/selftests/ptrace/vmaccess.c | 85 +++++++++++++++++++++++++++++++ 12 files changed, 210 insertions(+), 34 deletions(-) create mode 100644 tools/testing/selftests/ptrace/vmaccess.c
Okay, I think there is consensus about the next steps to be as follows:
- post the Documentation/security/credentials.rst changes as an independent patch. - post a infrastructure patch which only introduces two new mutexes, one exec_guard_mutex, and one the "cred_change_mutex" (I am unhappy with that name, because credentials can change without the cred_guard_mutex, this appears more to guarantee that the credentials of the process and the process memory map are consistent, so I think I need to think of a better name first...) This keeps cred_guard_mutex as is, just deprecates it, and adds a note that it will go away. - post one patch that fixes the mm_access code path - post one patch that fixes the PTRACE_ATTACH code path - post one patch that introduces the new test cases
Thanks Bernd.
Bernd, everyone
This is how I think the infrastructure change should look that makes way for fixing this issue.
- Correct the point of no return. - Add a new mutex to replace cred_guard_mutex
Then I think it is just going through the existing users of cred_guard_mutex and fixing them to use the new one.
There really aren't that many users of cred_guard_mutex so we should be able to get through the easy ones fairly quickly. And anything that isn't easy we can wait until we have a good fix.
The users of cred_guard_mutex that I saw were: fs/proc/base.c: proc_pid_attr_write do_io_accounting proc_pid_stack proc_pid_syscall proc_pid_personality
perf_event_open mm_access kcmp pidfd_fget seccomp_set_mode_filter
Bernd does this make sense to you?
I think we can fix the seccomp/no_new_privs issue with some careful refactoring. We can probably do the same for ptrace but that appears to need a little lsm bug fixing.
My goal here is to allow us to fix the uncontroversial easy bits. While still allowing the difficult tricky bits to be fixed.
Eric W. Biederman (2): exec: Properly mark the point of no return exec: Add a exec_update_mutex to replace cred_guard_mutex
fs/exec.c | 11 ++++++++--- include/linux/binfmts.h | 7 ++++++- include/linux/sched/signal.h | 9 ++++++++- kernel/fork.c | 1 + 4 files changed, 23 insertions(+), 5 deletions(-)
Eric
Add a flag binfmt->unrecoverable to mark when execution has gotten to the point where it is impossible to return to userspace with the calling process unchanged.
While techinically this state starts as soon as de_thread starts killing threads, the only return path at that point is if there is a fatal signal pending. I have choosen instead to set unrecoverable when the killing stops, and there are possibilities of failures other than fatal signals. In particular it is possible for the allocation of a new sighand structure to fail.
Setting unrecoverable at this point has the benefit that other actions can be taken after the other threads are all dead, and the unrecoverable flag can double as a flag that those actions have been taken.
Signed-off-by: "Eric W. Biederman" ebiederm@xmission.com --- fs/exec.c | 7 ++++--- include/linux/binfmts.h | 7 ++++++- 2 files changed, 10 insertions(+), 4 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c index db17be51b112..c243f9660d46 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1061,7 +1061,7 @@ static int exec_mmap(struct mm_struct *mm) * disturbing other processes. (Other processes might share the signal * table via the CLONE_SIGHAND option to clone().) */ -static int de_thread(struct task_struct *tsk) +static int de_thread(struct linux_binprm *bprm, struct task_struct *tsk) { struct signal_struct *sig = tsk->signal; struct sighand_struct *oldsighand = tsk->sighand; @@ -1182,6 +1182,7 @@ static int de_thread(struct task_struct *tsk) release_task(leader); }
+ bprm->unrecoverable = true; sig->group_exit_task = NULL; sig->notify_count = 0;
@@ -1266,7 +1267,7 @@ int flush_old_exec(struct linux_binprm * bprm) * Make sure we have a private signal table and that * we are unassociated from the previous thread group. */ - retval = de_thread(current); + retval = de_thread(bprm, current); if (retval) goto out;
@@ -1664,7 +1665,7 @@ int search_binary_handler(struct linux_binprm *bprm)
read_lock(&binfmt_lock); put_binfmt(fmt); - if (retval < 0 && !bprm->mm) { + if (retval < 0 && bprm->unrecoverable) { /* we got to flush_old_exec() and failed after it */ read_unlock(&binfmt_lock); force_sigsegv(SIGSEGV); diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h index b40fc633f3be..12263115ce7a 100644 --- a/include/linux/binfmts.h +++ b/include/linux/binfmts.h @@ -44,7 +44,12 @@ struct linux_binprm { * exec has happened. Used to sanitize execution environment * and to set AT_SECURE auxv for glibc. */ - secureexec:1; + secureexec:1, + /* + * Set when changes have been made that prevent returning + * to userspace. + */ + unrecoverable:1; #ifdef __alpha__ unsigned int taso:1; #endif
On 3/5/20 10:15 PM, Eric W. Biederman wrote:
Add a flag binfmt->unrecoverable to mark when execution has gotten to the point where it is impossible to return to userspace with the calling process unchanged.
While techinically this state starts as soon as de_thread starts killing threads, the only return path at that point is if there is a fatal signal pending. I have choosen instead to set unrecoverable when the killing stops, and there are possibilities of failures other than fatal signals. In particular it is possible for the allocation of a new sighand structure to fail.
Setting unrecoverable at this point has the benefit that other actions can be taken after the other threads are all dead, and the unrecoverable flag can double as a flag that those actions have been taken.
Signed-off-by: "Eric W. Biederman" ebiederm@xmission.com
fs/exec.c | 7 ++++--- include/linux/binfmts.h | 7 ++++++- 2 files changed, 10 insertions(+), 4 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c index db17be51b112..c243f9660d46 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1061,7 +1061,7 @@ static int exec_mmap(struct mm_struct *mm)
- disturbing other processes. (Other processes might share the signal
- table via the CLONE_SIGHAND option to clone().)
*/ -static int de_thread(struct task_struct *tsk) +static int de_thread(struct linux_binprm *bprm, struct task_struct *tsk) { struct signal_struct *sig = tsk->signal; struct sighand_struct *oldsighand = tsk->sighand; @@ -1182,6 +1182,7 @@ static int de_thread(struct task_struct *tsk) release_task(leader); }
- bprm->unrecoverable = true; sig->group_exit_task = NULL; sig->notify_count = 0;
@@ -1266,7 +1267,7 @@ int flush_old_exec(struct linux_binprm * bprm) * Make sure we have a private signal table and that * we are unassociated from the previous thread group. */
- retval = de_thread(current);
- retval = de_thread(bprm, current);
can we get rid of passing current as parameter here?
Thanks Bernd.
if (retval) goto out; @@ -1664,7 +1665,7 @@ int search_binary_handler(struct linux_binprm *bprm) read_lock(&binfmt_lock); put_binfmt(fmt);
if (retval < 0 && !bprm->mm) {
if (retval < 0 && bprm->unrecoverable) { /* we got to flush_old_exec() and failed after it */ read_unlock(&binfmt_lock); force_sigsegv(SIGSEGV);
diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h index b40fc633f3be..12263115ce7a 100644 --- a/include/linux/binfmts.h +++ b/include/linux/binfmts.h @@ -44,7 +44,12 @@ struct linux_binprm { * exec has happened. Used to sanitize execution environment * and to set AT_SECURE auxv for glibc. */
secureexec:1;
secureexec:1,
/*
* Set when changes have been made that prevent returning
* to userspace.
*/
unrecoverable:1;
#ifdef __alpha__ unsigned int taso:1; #endif
Bernd Edlinger bernd.edlinger@hotmail.de writes:
On 3/5/20 10:15 PM, Eric W. Biederman wrote:
@@ -1266,7 +1267,7 @@ int flush_old_exec(struct linux_binprm * bprm) * Make sure we have a private signal table and that * we are unassociated from the previous thread group. */
- retval = de_thread(current);
- retval = de_thread(bprm, current);
can we get rid of passing current as parameter here?
With a separate patch. It makes the patch less clear if I make that change in this one.
Eric
On 3/5/20 10:15 PM, Eric W. Biederman wrote:
Add a flag binfmt->unrecoverable to mark when execution has gotten to the point where it is impossible to return to userspace with the calling process unchanged.
While techinically this state starts as soon as de_thread starts killing threads, the only return path at that point is if there is a fatal signal pending. I have choosen instead to set unrecoverable when the killing stops, and there are possibilities of failures other than fatal signals. In particular it is possible for the allocation of a new sighand structure to fail.
Setting unrecoverable at this point has the benefit that other actions can be taken after the other threads are all dead, and the unrecoverable flag can double as a flag that those actions have been taken.
Signed-off-by: "Eric W. Biederman" ebiederm@xmission.com
fs/exec.c | 7 ++++--- include/linux/binfmts.h | 7 ++++++- 2 files changed, 10 insertions(+), 4 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c index db17be51b112..c243f9660d46 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1061,7 +1061,7 @@ static int exec_mmap(struct mm_struct *mm)
- disturbing other processes. (Other processes might share the signal
- table via the CLONE_SIGHAND option to clone().)
*/ -static int de_thread(struct task_struct *tsk) +static int de_thread(struct linux_binprm *bprm, struct task_struct *tsk) { struct signal_struct *sig = tsk->signal; struct sighand_struct *oldsighand = tsk->sighand; @@ -1182,6 +1182,7 @@ static int de_thread(struct task_struct *tsk) release_task(leader); }
- bprm->unrecoverable = true; sig->group_exit_task = NULL; sig->notify_count = 0;
ah, sorry, if (thread_group_empty(tsk)) goto no_thread_group; will skip this:
sig->group_exit_task = NULL; sig->notify_count = 0;
no_thread_group: /* we have changed execution domain */ tsk->exit_signal = SIGCHLD;
so I think the bprm->unrecoverable = true; should be here?
Bernd.
@@ -1266,7 +1267,7 @@ int flush_old_exec(struct linux_binprm * bprm) * Make sure we have a private signal table and that * we are unassociated from the previous thread group. */
- retval = de_thread(current);
- retval = de_thread(bprm, current); if (retval) goto out;
@@ -1664,7 +1665,7 @@ int search_binary_handler(struct linux_binprm *bprm) read_lock(&binfmt_lock); put_binfmt(fmt);
if (retval < 0 && !bprm->mm) {
if (retval < 0 && bprm->unrecoverable) { /* we got to flush_old_exec() and failed after it */ read_unlock(&binfmt_lock); force_sigsegv(SIGSEGV);
diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h index b40fc633f3be..12263115ce7a 100644 --- a/include/linux/binfmts.h +++ b/include/linux/binfmts.h @@ -44,7 +44,12 @@ struct linux_binprm { * exec has happened. Used to sanitize execution environment * and to set AT_SECURE auxv for glibc. */
secureexec:1;
secureexec:1,
/*
* Set when changes have been made that prevent returning
* to userspace.
*/
unrecoverable:1;
#ifdef __alpha__ unsigned int taso:1; #endif
Bernd Edlinger bernd.edlinger@hotmail.de writes:
On 3/5/20 10:15 PM, Eric W. Biederman wrote:
Add a flag binfmt->unrecoverable to mark when execution has gotten to the point where it is impossible to return to userspace with the calling process unchanged.
While techinically this state starts as soon as de_thread starts killing threads, the only return path at that point is if there is a fatal signal pending. I have choosen instead to set unrecoverable when the killing stops, and there are possibilities of failures other than fatal signals. In particular it is possible for the allocation of a new sighand structure to fail.
Setting unrecoverable at this point has the benefit that other actions can be taken after the other threads are all dead, and the unrecoverable flag can double as a flag that those actions have been taken.
Signed-off-by: "Eric W. Biederman" ebiederm@xmission.com
fs/exec.c | 7 ++++--- include/linux/binfmts.h | 7 ++++++- 2 files changed, 10 insertions(+), 4 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c index db17be51b112..c243f9660d46 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1061,7 +1061,7 @@ static int exec_mmap(struct mm_struct *mm)
- disturbing other processes. (Other processes might share the signal
- table via the CLONE_SIGHAND option to clone().)
*/ -static int de_thread(struct task_struct *tsk) +static int de_thread(struct linux_binprm *bprm, struct task_struct *tsk) { struct signal_struct *sig = tsk->signal; struct sighand_struct *oldsighand = tsk->sighand; @@ -1182,6 +1182,7 @@ static int de_thread(struct task_struct *tsk) release_task(leader); }
- bprm->unrecoverable = true; sig->group_exit_task = NULL; sig->notify_count = 0;
ah, sorry, if (thread_group_empty(tsk)) goto no_thread_group; will skip this:
sig->group_exit_task = NULL; sig->notify_count = 0;
no_thread_group: /* we have changed execution domain */ tsk->exit_signal = SIGCHLD;
so I think the bprm->unrecoverable = true; should be here?
Absolutely. Thank you very much.
This is why I try and keep things to one clear simple thing per patch so silly thinkos like that can be caught.
Eric
On 3/6/20 6:09 AM, Eric W. Biederman wrote:
Bernd Edlinger bernd.edlinger@hotmail.de writes:
On 3/5/20 10:15 PM, Eric W. Biederman wrote:
Add a flag binfmt->unrecoverable to mark when execution has gotten to the point where it is impossible to return to userspace with the calling process unchanged.
While techinically this state starts as soon as de_thread starts
typo: s/techinically/technically/
killing threads, the only return path at that point is if there is a fatal signal pending. I have choosen instead to set unrecoverable
I'm not good at english, is this chosen ?
Bernd.
Bernd Edlinger bernd.edlinger@hotmail.de writes:
On 3/6/20 6:09 AM, Eric W. Biederman wrote:
Bernd Edlinger bernd.edlinger@hotmail.de writes:
On 3/5/20 10:15 PM, Eric W. Biederman wrote:
Add a flag binfmt->unrecoverable to mark when execution has gotten to the point where it is impossible to return to userspace with the calling process unchanged.
While techinically this state starts as soon as de_thread starts
typo: s/techinically/technically/
killing threads, the only return path at that point is if there is a fatal signal pending. I have choosen instead to set unrecoverable
I'm not good at english, is this chosen ?
Yes. Defintley worth fixing.
Eric
The cred_guard_mutex is problematic. The cred_guard_mutex is held over the userspace accesses as the arguments from userspace are read. The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other threads are killed. The cred_guard_mutex is held over "put_user(0, tsk->clear_child_tid)" in exit_mm().
Any of those can result in deadlock, as the cred_guard_mutex is held over a possible indefinite userspace waits for userspace.
Add exec_update_mutex that is only held over exec updating process with the new contents of exec, so that code that needs not to be confused by exec changing the mm and the cred in ways that can not happen during ordinary execution of a process can take.
The plan is to switch the users of cred_guard_mutex to exed_udpate_mutex one by one. This lets us move forward while still being careful and not introducing any regressions.
Link: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/ Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03M... Link: https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/ Link: https://lore.kernel.org/lkml/20160923095031.GA14923@redhat.com/ Link: https://lore.kernel.org/lkml/20170213141452.GA30203@redhat.com/ Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.") Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2") Signed-off-by: "Eric W. Biederman" ebiederm@xmission.com --- fs/exec.c | 4 ++++ include/linux/sched/signal.h | 9 ++++++++- kernel/fork.c | 1 + 3 files changed, 13 insertions(+), 1 deletion(-)
diff --git a/fs/exec.c b/fs/exec.c index c243f9660d46..ad7b518f906d 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1182,6 +1182,7 @@ static int de_thread(struct linux_binprm *bprm, struct task_struct *tsk) release_task(leader); }
+ mutex_lock(¤t->signal->exec_update_mutex); bprm->unrecoverable = true; sig->group_exit_task = NULL; sig->notify_count = 0; @@ -1425,6 +1426,8 @@ static void free_bprm(struct linux_binprm *bprm) { free_arg_pages(bprm); if (bprm->cred) { + if (bprm->unrecoverable) + mutex_unlock(¤t->signal->exec_update_mutex); mutex_unlock(¤t->signal->cred_guard_mutex); abort_creds(bprm->cred); } @@ -1474,6 +1477,7 @@ void install_exec_creds(struct linux_binprm *bprm) * credentials; any time after this it may be unlocked. */ security_bprm_committed_creds(bprm); + mutex_unlock(¤t->signal->exec_update_mutex); mutex_unlock(¤t->signal->cred_guard_mutex); } EXPORT_SYMBOL(install_exec_creds); diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h index 88050259c466..a29df79540ce 100644 --- a/include/linux/sched/signal.h +++ b/include/linux/sched/signal.h @@ -224,7 +224,14 @@ struct signal_struct {
struct mutex cred_guard_mutex; /* guard against foreign influences on * credential calculations - * (notably. ptrace) */ + * (notably. ptrace) + * Deprecated do not use in new code. + * Use exec_update_mutex instead. + */ + struct mutex exec_update_mutex; /* Held while task_struct is being + * updated during exec, and may have + * inconsistent permissions. + */ } __randomize_layout;
/* diff --git a/kernel/fork.c b/kernel/fork.c index 60a1295f4384..12896a6ecee6 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1594,6 +1594,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk) sig->oom_score_adj_min = current->signal->oom_score_adj_min;
mutex_init(&sig->cred_guard_mutex); + mutex_init(&sig->exec_update_mutex);
return 0; }
On 3/5/20 10:16 PM, Eric W. Biederman wrote:
The cred_guard_mutex is problematic. The cred_guard_mutex is held over the userspace accesses as the arguments from userspace are read. The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other threads are killed. The cred_guard_mutex is held over "put_user(0, tsk->clear_child_tid)" in exit_mm().
Any of those can result in deadlock, as the cred_guard_mutex is held over a possible indefinite userspace waits for userspace.
Add exec_update_mutex that is only held over exec updating process with the new contents of exec, so that code that needs not to be confused by exec changing the mm and the cred in ways that can not happen during ordinary execution of a process can take.
The plan is to switch the users of cred_guard_mutex to exed_udpate_mutex one by one. This lets us move forward while still being careful and not introducing any regressions.
Link: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/ Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03M... Link: https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/ Link: https://lore.kernel.org/lkml/20160923095031.GA14923@redhat.com/ Link: https://lore.kernel.org/lkml/20170213141452.GA30203@redhat.com/ Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.") Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2") Signed-off-by: "Eric W. Biederman" ebiederm@xmission.com
fs/exec.c | 4 ++++ include/linux/sched/signal.h | 9 ++++++++- kernel/fork.c | 1 + 3 files changed, 13 insertions(+), 1 deletion(-)
diff --git a/fs/exec.c b/fs/exec.c index c243f9660d46..ad7b518f906d 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1182,6 +1182,7 @@ static int de_thread(struct linux_binprm *bprm, struct task_struct *tsk) release_task(leader); }
- mutex_lock(¤t->signal->exec_update_mutex); bprm->unrecoverable = true; sig->group_exit_task = NULL; sig->notify_count = 0;
@@ -1425,6 +1426,8 @@ static void free_bprm(struct linux_binprm *bprm) { free_arg_pages(bprm); if (bprm->cred) {
if (bprm->unrecoverable)
mutex_unlock(¤t->signal->cred_guard_mutex); abort_creds(bprm->cred); }mutex_unlock(¤t->signal->exec_update_mutex);
@@ -1474,6 +1477,7 @@ void install_exec_creds(struct linux_binprm *bprm) * credentials; any time after this it may be unlocked. */ security_bprm_committed_creds(bprm);
- mutex_unlock(¤t->signal->exec_update_mutex); mutex_unlock(¤t->signal->cred_guard_mutex);
} EXPORT_SYMBOL(install_exec_creds); diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h index 88050259c466..a29df79540ce 100644 --- a/include/linux/sched/signal.h +++ b/include/linux/sched/signal.h @@ -224,7 +224,14 @@ struct signal_struct { struct mutex cred_guard_mutex; /* guard against foreign influences on * credential calculations
* (notably. ptrace) */
* (notably. ptrace)
* Deprecated do not use in new code.
* Use exec_update_mutex instead.
*/
- struct mutex exec_update_mutex; /* Held while task_struct is being
* updated during exec, and may have
* inconsistent permissions.
*/
} __randomize_layout; /* diff --git a/kernel/fork.c b/kernel/fork.c index 60a1295f4384..12896a6ecee6 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1594,6 +1594,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk) sig->oom_score_adj_min = current->signal->oom_score_adj_min; mutex_init(&sig->cred_guard_mutex);
- mutex_init(&sig->exec_update_mutex);
return 0; }
Don't you need to add something like this to init/init_task.c ?
.exec_update_mutex = __MUTEX_INITIALIZER(init_signals.exec_update_mutex),
Bernd.
Bernd Edlinger bernd.edlinger@hotmail.de writes:
On 3/5/20 10:16 PM, Eric W. Biederman wrote:
The cred_guard_mutex is problematic. The cred_guard_mutex is held over the userspace accesses as the arguments from userspace are read. The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other threads are killed. The cred_guard_mutex is held over "put_user(0, tsk->clear_child_tid)" in exit_mm().
Any of those can result in deadlock, as the cred_guard_mutex is held over a possible indefinite userspace waits for userspace.
Add exec_update_mutex that is only held over exec updating process with the new contents of exec, so that code that needs not to be confused by exec changing the mm and the cred in ways that can not happen during ordinary execution of a process can take.
The plan is to switch the users of cred_guard_mutex to exed_udpate_mutex one by one. This lets us move forward while still being careful and not introducing any regressions.
Link: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/ Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03M... Link: https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/ Link: https://lore.kernel.org/lkml/20160923095031.GA14923@redhat.com/ Link: https://lore.kernel.org/lkml/20170213141452.GA30203@redhat.com/ Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.") Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2") Signed-off-by: "Eric W. Biederman" ebiederm@xmission.com
fs/exec.c | 4 ++++ include/linux/sched/signal.h | 9 ++++++++- kernel/fork.c | 1 + 3 files changed, 13 insertions(+), 1 deletion(-)
diff --git a/fs/exec.c b/fs/exec.c index c243f9660d46..ad7b518f906d 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1182,6 +1182,7 @@ static int de_thread(struct linux_binprm *bprm, struct task_struct *tsk) release_task(leader); }
- mutex_lock(¤t->signal->exec_update_mutex); bprm->unrecoverable = true; sig->group_exit_task = NULL; sig->notify_count = 0;
@@ -1425,6 +1426,8 @@ static void free_bprm(struct linux_binprm *bprm) { free_arg_pages(bprm); if (bprm->cred) {
if (bprm->unrecoverable)
mutex_unlock(¤t->signal->cred_guard_mutex); abort_creds(bprm->cred); }mutex_unlock(¤t->signal->exec_update_mutex);
@@ -1474,6 +1477,7 @@ void install_exec_creds(struct linux_binprm *bprm) * credentials; any time after this it may be unlocked. */ security_bprm_committed_creds(bprm);
- mutex_unlock(¤t->signal->exec_update_mutex); mutex_unlock(¤t->signal->cred_guard_mutex);
} EXPORT_SYMBOL(install_exec_creds); diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h index 88050259c466..a29df79540ce 100644 --- a/include/linux/sched/signal.h +++ b/include/linux/sched/signal.h @@ -224,7 +224,14 @@ struct signal_struct { struct mutex cred_guard_mutex; /* guard against foreign influences on * credential calculations
* (notably. ptrace) */
* (notably. ptrace)
* Deprecated do not use in new code.
* Use exec_update_mutex instead.
*/
- struct mutex exec_update_mutex; /* Held while task_struct is being
* updated during exec, and may have
* inconsistent permissions.
*/
} __randomize_layout; /* diff --git a/kernel/fork.c b/kernel/fork.c index 60a1295f4384..12896a6ecee6 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1594,6 +1594,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk) sig->oom_score_adj_min = current->signal->oom_score_adj_min; mutex_init(&sig->cred_guard_mutex);
- mutex_init(&sig->exec_update_mutex);
return 0; }
Don't you need to add something like this to init/init_task.c ?
.exec_update_mutex = __MUTEX_INITIALIZER(init_signals.exec_update_mutex),
Yes. I overlooked that. Thank you.
Eric
Am 06.03.20 um 06:17 schrieb Eric W. Biederman:
Bernd Edlinger bernd.edlinger@hotmail.de writes:
On 3/5/20 10:16 PM, Eric W. Biederman wrote:
The cred_guard_mutex is problematic. The cred_guard_mutex is held over the userspace accesses as the arguments from userspace are read. The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other threads are killed. The cred_guard_mutex is held over "put_user(0, tsk->clear_child_tid)" in exit_mm().
Any of those can result in deadlock, as the cred_guard_mutex is held over a possible indefinite userspace waits for userspace.
Add exec_update_mutex that is only held over exec updating process with the new contents of exec, so that code that needs not to be confused by exec changing the mm and the cred in ways that can not happen during ordinary execution of a process can take.
The plan is to switch the users of cred_guard_mutex to exed_udpate_mutex one by one. This lets us move forward while still being careful and not introducing any regressions.
Link: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/ Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03M... Link: https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/ Link: https://lore.kernel.org/lkml/20160923095031.GA14923@redhat.com/ Link: https://lore.kernel.org/lkml/20170213141452.GA30203@redhat.com/ Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.") Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2") Signed-off-by: "Eric W. Biederman" ebiederm@xmission.com
fs/exec.c | 4 ++++ include/linux/sched/signal.h | 9 ++++++++- kernel/fork.c | 1 + 3 files changed, 13 insertions(+), 1 deletion(-)
diff --git a/fs/exec.c b/fs/exec.c index c243f9660d46..ad7b518f906d 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1182,6 +1182,7 @@ static int de_thread(struct linux_binprm *bprm, struct task_struct *tsk) release_task(leader); }
- mutex_lock(¤t->signal->exec_update_mutex);
And by the way, could you make this mutex_lock_killable?
Bernd.
Bernd Edlinger bernd.edlinger@hotmail.de writes:
Am 06.03.20 um 06:17 schrieb Eric W. Biederman:
Bernd Edlinger bernd.edlinger@hotmail.de writes:
On 3/5/20 10:16 PM, Eric W. Biederman wrote:
The cred_guard_mutex is problematic. The cred_guard_mutex is held over the userspace accesses as the arguments from userspace are read. The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other threads are killed. The cred_guard_mutex is held over "put_user(0, tsk->clear_child_tid)" in exit_mm().
Any of those can result in deadlock, as the cred_guard_mutex is held over a possible indefinite userspace waits for userspace.
Add exec_update_mutex that is only held over exec updating process with the new contents of exec, so that code that needs not to be confused by exec changing the mm and the cred in ways that can not happen during ordinary execution of a process can take.
The plan is to switch the users of cred_guard_mutex to exed_udpate_mutex one by one. This lets us move forward while still being careful and not introducing any regressions.
Link: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/ Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03M... Link: https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/ Link: https://lore.kernel.org/lkml/20160923095031.GA14923@redhat.com/ Link: https://lore.kernel.org/lkml/20170213141452.GA30203@redhat.com/ Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.") Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2") Signed-off-by: "Eric W. Biederman" ebiederm@xmission.com
fs/exec.c | 4 ++++ include/linux/sched/signal.h | 9 ++++++++- kernel/fork.c | 1 + 3 files changed, 13 insertions(+), 1 deletion(-)
diff --git a/fs/exec.c b/fs/exec.c index c243f9660d46..ad7b518f906d 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1182,6 +1182,7 @@ static int de_thread(struct linux_binprm *bprm, struct task_struct *tsk) release_task(leader); }
- mutex_lock(¤t->signal->exec_update_mutex);
And by the way, could you make this mutex_lock_killable?
For some reason when I first read this suggestion I thought making this mutex_lock_killable would cause me to rework the logic of when I set unrecoverable and when I unlock the mutex. I blame a tired brain. If a process has received a fatal signal none of that matters.
So yes I will do that just to make things robust in case we miss something that would still make it possible to deadlock in with the new mutex.
I am a little worried that the new mutex might still cover a little too much. But past a certain point I we are not being able to make this code perfect in the first change. The best we can do is to be careful and avoid regressions. Whatever slips through we can fix when we spot the problem.
Eric
On 3/6/20 6:17 AM, Eric W. Biederman wrote:
Bernd Edlinger bernd.edlinger@hotmail.de writes:
On 3/5/20 10:16 PM, Eric W. Biederman wrote:
The cred_guard_mutex is problematic. The cred_guard_mutex is held over the userspace accesses as the arguments from userspace are read. The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other threads are killed. The cred_guard_mutex is held over "put_user(0, tsk->clear_child_tid)" in exit_mm().
I am all for this patch, and the direction it is heading, Eric.
I just wanted to add a note that I think it is possible that exec_mm_release can also invoke put_user(0, tsk->clear_child_tid), under the new exec_update_mutex, since vm_access increments the mm->mm_users, under the cred_update_mutex, but releases the mutex, and the caller can hold the reference for a while and then exec_mmap is not releasing the last reference.
Bernd.
Bernd Edlinger bernd.edlinger@hotmail.de writes:
On 3/6/20 6:17 AM, Eric W. Biederman wrote:
Bernd Edlinger bernd.edlinger@hotmail.de writes:
On 3/5/20 10:16 PM, Eric W. Biederman wrote:
The cred_guard_mutex is problematic. The cred_guard_mutex is held over the userspace accesses as the arguments from userspace are read. The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other threads are killed. The cred_guard_mutex is held over "put_user(0, tsk->clear_child_tid)" in exit_mm().
I am all for this patch, and the direction it is heading, Eric.
I just wanted to add a note that I think it is possible that exec_mm_release can also invoke put_user(0, tsk->clear_child_tid), under the new exec_update_mutex, since vm_access increments the mm->mm_users, under the cred_update_mutex, but releases the mutex, and the caller can hold the reference for a while and then exec_mmap is not releasing the last reference.
Good catch. I really appreciate your close look at the details.
I am wondering if process_vm_readv and process_vm_writev could be safely changed to use mmgrab and mmdrop, instead of mmget and mmput.
That would resolve the potential issue you have pointed out. I just haven't figured out if it is safe. The mm code has been seriously refactored since I knew how it all worked.
Eric
ebiederm@xmission.com (Eric W. Biederman) writes:
Bernd Edlinger bernd.edlinger@hotmail.de writes:
On 3/6/20 6:17 AM, Eric W. Biederman wrote:
Bernd Edlinger bernd.edlinger@hotmail.de writes:
On 3/5/20 10:16 PM, Eric W. Biederman wrote:
The cred_guard_mutex is problematic. The cred_guard_mutex is held over the userspace accesses as the arguments from userspace are read. The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other threads are killed. The cred_guard_mutex is held over "put_user(0, tsk->clear_child_tid)" in exit_mm().
I am all for this patch, and the direction it is heading, Eric.
I just wanted to add a note that I think it is possible that exec_mm_release can also invoke put_user(0, tsk->clear_child_tid), under the new exec_update_mutex, since vm_access increments the mm->mm_users, under the cred_update_mutex, but releases the mutex, and the caller can hold the reference for a while and then exec_mmap is not releasing the last reference.
Good catch. I really appreciate your close look at the details.
I am wondering if process_vm_readv and process_vm_writev could be safely changed to use mmgrab and mmdrop, instead of mmget and mmput.
That would resolve the potential issue you have pointed out. I just haven't figured out if it is safe. The mm code has been seriously refactored since I knew how it all worked.
Nope, mmget can not be replaced by mmgrab.
It might be possible to do something creative like store a cred in place of the userns on the mm and use that for mm_access permission checks. Still we are talking a pretty narrow window, and a case that no one has figured out how to trigger yet. So I will leave that corner case as something for future improvements.
Eric
ebiederm@xmission.com (Eric W. Biederman) writes:
ebiederm@xmission.com (Eric W. Biederman) writes:
Bernd Edlinger bernd.edlinger@hotmail.de writes:
On 3/6/20 6:17 AM, Eric W. Biederman wrote:
Bernd Edlinger bernd.edlinger@hotmail.de writes:
On 3/5/20 10:16 PM, Eric W. Biederman wrote:
The cred_guard_mutex is problematic. The cred_guard_mutex is held over the userspace accesses as the arguments from userspace are read. The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other threads are killed. The cred_guard_mutex is held over "put_user(0, tsk->clear_child_tid)" in exit_mm().
I am all for this patch, and the direction it is heading, Eric.
I just wanted to add a note that I think it is possible that exec_mm_release can also invoke put_user(0, tsk->clear_child_tid), under the new exec_update_mutex, since vm_access increments the mm->mm_users, under the cred_update_mutex, but releases the mutex, and the caller can hold the reference for a while and then exec_mmap is not releasing the last reference.
Good catch. I really appreciate your close look at the details.
I am wondering if process_vm_readv and process_vm_writev could be safely changed to use mmgrab and mmdrop, instead of mmget and mmput.
That would resolve the potential issue you have pointed out. I just haven't figured out if it is safe. The mm code has been seriously refactored since I knew how it all worked.
Nope, mmget can not be replaced by mmgrab.
It might be possible to do something creative like store a cred in place of the userns on the mm and use that for mm_access permission checks. Still we are talking a pretty narrow window, and a case that no one has figured out how to trigger yet. So I will leave that corner case as something for future improvements.
My brain is restless and keep looking at it.
The worst case is processes created with CLONE_VM|CLONE_CHILD_CLEARTID but not CLONE_THREAD. For those that put_user will occur ever time in exec_mmap.
The only solution that I can see is to move taking the new mutex after exec_mm_release. Which may be feasible given how close exec_mmap follows de_thread.
I am going to sleep on that and perhaps I will be able to see how to move taking the mutex lower.
It would be very nice not to have a known issue going into this set of changes.
Eric
It was pointed out that de_thread may return -ENOMEM when it already terminated threads, and returning an error from execve, except when a fatal signal is being delivered is not an option any more.
Allocate the memory for the signal table earlier, and make sure that -ENOMEM is returned before the unrecoverable actions are started.
Signed-off-by: Bernd Edlinger bernd.edlinger@hotmail.de --- Eric, what do you think, might this be helpful to move the "point of no return" lower, and simplify your patch?
fs/exec.c | 31 +++++++++++++++++++++++-------- 1 file changed, 23 insertions(+), 8 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c index 74d88da..a0328dc 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1057,16 +1057,26 @@ static int exec_mmap(struct mm_struct *mm) * disturbing other processes. (Other processes might share the signal * table via the CLONE_SIGHAND option to clone().) */ -static int de_thread(struct task_struct *tsk) +static int de_thread(void) { + struct task_struct *tsk = current; struct signal_struct *sig = tsk->signal; struct sighand_struct *oldsighand = tsk->sighand; spinlock_t *lock = &oldsighand->siglock; + struct sighand_struct *newsighand = NULL;
if (thread_group_empty(tsk)) goto no_thread_group;
/* + * This is the last time for an out of memory error. + * After this point only fatal signals are are okay. + */ + newsighand = kmem_cache_alloc(sighand_cachep, GFP_KERNEL); + if (!newsighand) + return -ENOMEM; + + /* * Kill all other threads in the thread group. */ spin_lock_irq(lock); @@ -1076,7 +1086,7 @@ static int de_thread(struct task_struct *tsk) * return so that the signal is processed. */ spin_unlock_irq(lock); - return -EAGAIN; + goto err_free; }
sig->group_exit_task = tsk; @@ -1191,14 +1201,16 @@ static int de_thread(struct task_struct *tsk) #endif
if (refcount_read(&oldsighand->count) != 1) { - struct sighand_struct *newsighand; /* * This ->sighand is shared with the CLONE_SIGHAND * but not CLONE_THREAD task, switch to the new one. */ - newsighand = kmem_cache_alloc(sighand_cachep, GFP_KERNEL); - if (!newsighand) - return -ENOMEM; + if (!newsighand) { + newsighand = kmem_cache_alloc(sighand_cachep, + GFP_KERNEL); + if (!newsighand) + return -ENOMEM; + }
refcount_set(&newsighand->count, 1); memcpy(newsighand->action, oldsighand->action, @@ -1211,7 +1223,8 @@ static int de_thread(struct task_struct *tsk) write_unlock_irq(&tasklist_lock);
__cleanup_sighand(oldsighand); - } + } else if (newsighand) + kmem_cache_free(sighand_cachep, newsighand);
BUG_ON(!thread_group_leader(tsk)); return 0; @@ -1222,6 +1235,8 @@ static int de_thread(struct task_struct *tsk) sig->group_exit_task = NULL; sig->notify_count = 0; read_unlock(&tasklist_lock); +err_free: + kmem_cache_free(sighand_cachep, newsighand); return -EAGAIN; }
@@ -1262,7 +1277,7 @@ int flush_old_exec(struct linux_binprm * bprm) * Make sure we have a private signal table and that * we are unassociated from the previous thread group. */ - retval = de_thread(current); + retval = de_thread(); if (retval) goto out;
Bernd Edlinger bernd.edlinger@hotmail.de writes:
It was pointed out that de_thread may return -ENOMEM when it already terminated threads, and returning an error from execve, except when a fatal signal is being delivered is not an option any more.
Allocate the memory for the signal table earlier, and make sure that -ENOMEM is returned before the unrecoverable actions are started.
Signed-off-by: Bernd Edlinger bernd.edlinger@hotmail.de
Eric, what do you think, might this be helpful to move the "point of no return" lower, and simplify your patch?
Good thinking but no. In this case it is possible to move the entire allocation lower. As well as the posix timer cleanup. That code is actually much clearer outside of de_thread. I will post a patch in that direction in a moment.
It is something of a bad idea to move the new sighand allocation sooner because in practice it does not happen. It only exists to support the CLONE_VM | CLONE_SIGHAND without CLONE_SIGNAL case which is not used by the modern posix thread libraries.
There are just enough old executables floating out there that I don't think we can remove the CLONE_SIGHAND case in general but I keep dreaming about it. We get a lot of complexity in the code to support something that no one really does anymore.
Eric
fs/exec.c | 31 +++++++++++++++++++++++-------- 1 file changed, 23 insertions(+), 8 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c index 74d88da..a0328dc 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1057,16 +1057,26 @@ static int exec_mmap(struct mm_struct *mm)
- disturbing other processes. (Other processes might share the signal
- table via the CLONE_SIGHAND option to clone().)
*/ -static int de_thread(struct task_struct *tsk) +static int de_thread(void) {
- struct task_struct *tsk = current; struct signal_struct *sig = tsk->signal; struct sighand_struct *oldsighand = tsk->sighand; spinlock_t *lock = &oldsighand->siglock;
- struct sighand_struct *newsighand = NULL;
if (thread_group_empty(tsk)) goto no_thread_group; /*
* This is the last time for an out of memory error.
* After this point only fatal signals are are okay.
*/
- newsighand = kmem_cache_alloc(sighand_cachep, GFP_KERNEL);
- if (!newsighand)
return -ENOMEM;
- /*
*/ spin_lock_irq(lock);
- Kill all other threads in the thread group.
@@ -1076,7 +1086,7 @@ static int de_thread(struct task_struct *tsk) * return so that the signal is processed. */ spin_unlock_irq(lock);
return -EAGAIN;
}goto err_free;
sig->group_exit_task = tsk; @@ -1191,14 +1201,16 @@ static int de_thread(struct task_struct *tsk) #endif if (refcount_read(&oldsighand->count) != 1) {
/*struct sighand_struct *newsighand;
*/
- This ->sighand is shared with the CLONE_SIGHAND
- but not CLONE_THREAD task, switch to the new one.
newsighand = kmem_cache_alloc(sighand_cachep, GFP_KERNEL);
if (!newsighand)
return -ENOMEM;
if (!newsighand) {
newsighand = kmem_cache_alloc(sighand_cachep,
GFP_KERNEL);
if (!newsighand)
return -ENOMEM;
}
refcount_set(&newsighand->count, 1); memcpy(newsighand->action, oldsighand->action, @@ -1211,7 +1223,8 @@ static int de_thread(struct task_struct *tsk) write_unlock_irq(&tasklist_lock); __cleanup_sighand(oldsighand);
- }
- } else if (newsighand)
kmem_cache_free(sighand_cachep, newsighand);
BUG_ON(!thread_group_leader(tsk)); return 0; @@ -1222,6 +1235,8 @@ static int de_thread(struct task_struct *tsk) sig->group_exit_task = NULL; sig->notify_count = 0; read_unlock(&tasklist_lock); +err_free:
- kmem_cache_free(sighand_cachep, newsighand); return -EAGAIN;
} @@ -1262,7 +1277,7 @@ int flush_old_exec(struct linux_binprm * bprm) * Make sure we have a private signal table and that * we are unassociated from the previous thread group. */
- retval = de_thread(current);
- retval = de_thread(); if (retval) goto out;
On 3/5/20 10:14 PM, Eric W. Biederman wrote:
Bernd, everyone
This is how I think the infrastructure change should look that makes way for fixing this issue.
- Correct the point of no return.
- Add a new mutex to replace cred_guard_mutex
Then I think it is just going through the existing users of cred_guard_mutex and fixing them to use the new one.
There really aren't that many users of cred_guard_mutex so we should be able to get through the easy ones fairly quickly. And anything that isn't easy we can wait until we have a good fix.
The users of cred_guard_mutex that I saw were: fs/proc/base.c: proc_pid_attr_write do_io_accounting proc_pid_stack proc_pid_syscall proc_pid_personality perf_event_open mm_access kcmp pidfd_fget seccomp_set_mode_filter
Bernd does this make sense to you?
I think we can fix the seccomp/no_new_privs issue with some careful refactoring. We can probably do the same for ptrace but that appears to need a little lsm bug fixing.
Yes, for most functions the proposed "exec_update_mutex" is fine, but we will need a longer-time block for ptrace_attach, seccomp_set_mode_filter and proc_pid_attr_write need to be blocked for the whole exec duration so they need a second "mutex", with deadlock-detection as in my previous patch, if I see that right.
Unfortunately only one of the two test cases can be fixed without the second mutex, of course the mm_access is what cause the practical problem.
Currently for the unlimited user space delay, I have only the case of a ptraced sibling thread on my radar, de_thread waits for the parent to call wait in this case, that can literally take forever. But I know that also PTRACE_CONT may be needed after a PTRACE_EVENT_EXIT.
Can you explain what else in the user space can go wrong to make an unlimited delay in the execve?
Bernd.
Bernd Edlinger bernd.edlinger@hotmail.de writes:
On 3/5/20 10:14 PM, Eric W. Biederman wrote:
Bernd, everyone
This is how I think the infrastructure change should look that makes way for fixing this issue.
- Correct the point of no return.
- Add a new mutex to replace cred_guard_mutex
Then I think it is just going through the existing users of cred_guard_mutex and fixing them to use the new one.
There really aren't that many users of cred_guard_mutex so we should be able to get through the easy ones fairly quickly. And anything that isn't easy we can wait until we have a good fix.
The users of cred_guard_mutex that I saw were: fs/proc/base.c: proc_pid_attr_write do_io_accounting proc_pid_stack proc_pid_syscall proc_pid_personality perf_event_open mm_access kcmp pidfd_fget seccomp_set_mode_filter
Bernd does this make sense to you?
I think we can fix the seccomp/no_new_privs issue with some careful refactoring. We can probably do the same for ptrace but that appears to need a little lsm bug fixing.
Yes, for most functions the proposed "exec_update_mutex" is fine, but we will need a longer-time block for ptrace_attach, seccomp_set_mode_filter and proc_pid_attr_write need to be blocked for the whole exec duration so they need a second "mutex", with deadlock-detection as in my previous patch, if I see that right.
So far I am leaving "cred_guard_mutex" as that second "mutex". My sense is that when all we have left are the hard cases we can take those cases out in detail, examine them and see what really can be done.
Unfortunately only one of the two test cases can be fixed without the second mutex, of course the mm_access is what cause the practical problem.
Fixing the practical problems are foremost on my agenda. That and clearing away enough of the noise that we can really focus on the hard problems when we begin to address them.
That way I am hoping we can really solve some of these issues and make them go away.
Currently for the unlimited user space delay, I have only the case of a ptraced sibling thread on my radar, de_thread waits for the parent to call wait in this case, that can literally take forever. But I know that also PTRACE_CONT may be needed after a PTRACE_EVENT_EXIT.
Can you explain what else in the user space can go wrong to make an unlimited delay in the execve?
Triggering a page fault. Depending on the backing store or possibly with the use of userfaultfd that page fault can be delayed indefinitely and pretty much be as bad as the ptrace case.
Eric
Bernd, everyone
This is how I think the infrastructure change should look that makes way for fixing this issue.
- Cleanup and reorder the code so code that can potentially wait indefinitely for userspace comes at the beginning for flush_old_exec. - Add a new mutex and take it after we have passed any potential indefinite waits for userspace.
Then I think it is just going through the existing users of cred_guard_mutex and fixing them to use the new one.
There really aren't that many users of cred_guard_mutex so we should be able to get through the easy ones fairly quickly. And anything that isn't easy we can wait until we have a good fix.
The users of cred_guard_mutex that I saw were: fs/proc/base.c: proc_pid_attr_write do_io_accounting proc_pid_stack proc_pid_syscall proc_pid_personality
perf_event_open mm_access kcmp pidfd_fget seccomp_set_mode_filter
Bernd I think I have addressed the issues you pointed out in v1. Please let me know if you see anything else.
Eric W. Biederman (5): exec: Only compute current once in flush_old_exec exec: Factor unshare_sighand out of de_thread and call it separately exec: Move cleanup of posix timers on exec out of de_thread exec: Move exec_mmap right after de_thread in flush_old_exec exec: Add a exec_update_mutex to replace cred_guard_mutex
fs/exec.c | 65 ++++++++++++++++++++++++++++++-------------- include/linux/sched/signal.h | 9 +++++- init/init_task.c | 1 + kernel/fork.c | 1 + 4 files changed, 54 insertions(+), 22 deletions(-)
Make it clear that current only needs to be computed once in flush_old_exec. This may have some efficiency improvements and it makes the code easier to change.
Signed-off-by: "Eric W. Biederman" ebiederm@xmission.com --- fs/exec.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c index db17be51b112..c3f34791f2f0 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1260,13 +1260,14 @@ void __set_task_comm(struct task_struct *tsk, const char *buf, bool exec) */ int flush_old_exec(struct linux_binprm * bprm) { + struct task_struct *me = current; int retval;
/* * Make sure we have a private signal table and that * we are unassociated from the previous thread group. */ - retval = de_thread(current); + retval = de_thread(me); if (retval) goto out;
@@ -1294,10 +1295,10 @@ int flush_old_exec(struct linux_binprm * bprm) bprm->mm = NULL;
set_fs(USER_DS); - current->flags &= ~(PF_RANDOMIZE | PF_FORKNOEXEC | PF_KTHREAD | + me->flags &= ~(PF_RANDOMIZE | PF_FORKNOEXEC | PF_KTHREAD | PF_NOFREEZE | PF_NO_SETAFFINITY); flush_thread(); - current->personality &= ~bprm->per_clear; + me->personality &= ~bprm->per_clear;
/* * We have to apply CLOEXEC before we change whether the process is @@ -1305,7 +1306,7 @@ int flush_old_exec(struct linux_binprm * bprm) * trying to access the should-be-closed file descriptors of a process * undergoing exec(2). */ - do_close_on_exec(current->files); + do_close_on_exec(me->files); return 0;
out:
On 3/8/20 10:35 PM, Eric W. Biederman wrote:
Make it clear that current only needs to be computed once in flush_old_exec. This may have some efficiency improvements and it makes the code easier to change.
Signed-off-by: "Eric W. Biederman" ebiederm@xmission.com
fs/exec.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c index db17be51b112..c3f34791f2f0 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1260,13 +1260,14 @@ void __set_task_comm(struct task_struct *tsk, const char *buf, bool exec) */ int flush_old_exec(struct linux_binprm * bprm) {
- struct task_struct *me = current; int retval;
/* * Make sure we have a private signal table and that * we are unassociated from the previous thread group. */
- retval = de_thread(current);
- retval = de_thread(me); if (retval) goto out;
@@ -1294,10 +1295,10 @@ int flush_old_exec(struct linux_binprm * bprm) bprm->mm = NULL; set_fs(USER_DS);
- current->flags &= ~(PF_RANDOMIZE | PF_FORKNOEXEC | PF_KTHREAD |
- me->flags &= ~(PF_RANDOMIZE | PF_FORKNOEXEC | PF_KTHREAD | PF_NOFREEZE | PF_NO_SETAFFINITY);
I wonder if this line should be aligned with the previous?
Bernd.
Bernd Edlinger bernd.edlinger@hotmail.de writes:
On 3/8/20 10:35 PM, Eric W. Biederman wrote:
Make it clear that current only needs to be computed once in flush_old_exec. This may have some efficiency improvements and it makes the code easier to change.
Signed-off-by: "Eric W. Biederman" ebiederm@xmission.com
fs/exec.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c index db17be51b112..c3f34791f2f0 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1260,13 +1260,14 @@ void __set_task_comm(struct task_struct *tsk, const char *buf, bool exec) */ int flush_old_exec(struct linux_binprm * bprm) {
- struct task_struct *me = current; int retval;
/* * Make sure we have a private signal table and that * we are unassociated from the previous thread group. */
- retval = de_thread(current);
- retval = de_thread(me); if (retval) goto out;
@@ -1294,10 +1295,10 @@ int flush_old_exec(struct linux_binprm * bprm) bprm->mm = NULL; set_fs(USER_DS);
- current->flags &= ~(PF_RANDOMIZE | PF_FORKNOEXEC | PF_KTHREAD |
- me->flags &= ~(PF_RANDOMIZE | PF_FORKNOEXEC | PF_KTHREAD | PF_NOFREEZE | PF_NO_SETAFFINITY);
I wonder if this line should be aligned with the previous?
In this case I don't think so. The style used for second line is indent with tabs as much as possible to the right. I haven't changed that.
Further mixing a change in indentation style with just a variable rename will make the patch confusing to read because two things have to be verified at the same time.
So while I see why you ask I think this bit needs to stay as is.
Eric
On 3/9/20 6:34 PM, Eric W. Biederman wrote:
Bernd Edlinger bernd.edlinger@hotmail.de writes:
On 3/8/20 10:35 PM, Eric W. Biederman wrote:
Make it clear that current only needs to be computed once in flush_old_exec. This may have some efficiency improvements and it makes the code easier to change.
Signed-off-by: "Eric W. Biederman" ebiederm@xmission.com
fs/exec.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c index db17be51b112..c3f34791f2f0 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1260,13 +1260,14 @@ void __set_task_comm(struct task_struct *tsk, const char *buf, bool exec) */ int flush_old_exec(struct linux_binprm * bprm) {
- struct task_struct *me = current; int retval;
/* * Make sure we have a private signal table and that * we are unassociated from the previous thread group. */
- retval = de_thread(current);
- retval = de_thread(me); if (retval) goto out;
@@ -1294,10 +1295,10 @@ int flush_old_exec(struct linux_binprm * bprm) bprm->mm = NULL; set_fs(USER_DS);
- current->flags &= ~(PF_RANDOMIZE | PF_FORKNOEXEC | PF_KTHREAD |
- me->flags &= ~(PF_RANDOMIZE | PF_FORKNOEXEC | PF_KTHREAD | PF_NOFREEZE | PF_NO_SETAFFINITY);
I wonder if this line should be aligned with the previous?
In this case I don't think so. The style used for second line is indent with tabs as much as possible to the right. I haven't changed that.
Further mixing a change in indentation style with just a variable rename will make the patch confusing to read because two things have to be verified at the same time.
So while I see why you ask I think this bit needs to stay as is.
Ah, okay, I see. Thanks for explaining this rule, I was not aware of it, but I am still new here :)
Thanks Bernd.
On 3/9/20 6:56 PM, Bernd Edlinger wrote:
On 3/9/20 6:34 PM, Eric W. Biederman wrote:
Bernd Edlinger bernd.edlinger@hotmail.de writes:
On 3/8/20 10:35 PM, Eric W. Biederman wrote:
Make it clear that current only needs to be computed once in flush_old_exec. This may have some efficiency improvements and it makes the code easier to change.
Signed-off-by: "Eric W. Biederman" ebiederm@xmission.com
fs/exec.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c index db17be51b112..c3f34791f2f0 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1260,13 +1260,14 @@ void __set_task_comm(struct task_struct *tsk, const char *buf, bool exec) */ int flush_old_exec(struct linux_binprm * bprm) {
- struct task_struct *me = current; int retval;
/* * Make sure we have a private signal table and that * we are unassociated from the previous thread group. */
- retval = de_thread(current);
- retval = de_thread(me); if (retval) goto out;
@@ -1294,10 +1295,10 @@ int flush_old_exec(struct linux_binprm * bprm) bprm->mm = NULL; set_fs(USER_DS);
- current->flags &= ~(PF_RANDOMIZE | PF_FORKNOEXEC | PF_KTHREAD |
- me->flags &= ~(PF_RANDOMIZE | PF_FORKNOEXEC | PF_KTHREAD | PF_NOFREEZE | PF_NO_SETAFFINITY);
I wonder if this line should be aligned with the previous?
In this case I don't think so. The style used for second line is indent with tabs as much as possible to the right. I haven't changed that.
Further mixing a change in indentation style with just a variable rename will make the patch confusing to read because two things have to be verified at the same time.
So while I see why you ask I think this bit needs to stay as is.
Ah, okay, I see. Thanks for explaining this rule, I was not aware of it, but I am still new here :)
Reviewed-by: Bernd Edlinger bernd.edlinger@hotmail.de
Bernd.
On Sun, Mar 08, 2020 at 04:35:26PM -0500, Eric W. Biederman wrote:
Make it clear that current only needs to be computed once in flush_old_exec. This may have some efficiency improvements and it makes the code easier to change.
Signed-off-by: "Eric W. Biederman" ebiederm@xmission.com
modulo my suggestion of adding more comments (it could even be kerndoc!) that explicitly states that "me" should always be "current", yup, looks good:
Reviewed-by: Kees Cook keescook@chromium.org
-Kees
fs/exec.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c index db17be51b112..c3f34791f2f0 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1260,13 +1260,14 @@ void __set_task_comm(struct task_struct *tsk, const char *buf, bool exec) */ int flush_old_exec(struct linux_binprm * bprm) {
- struct task_struct *me = current; int retval;
/* * Make sure we have a private signal table and that * we are unassociated from the previous thread group. */
- retval = de_thread(current);
- retval = de_thread(me); if (retval) goto out;
@@ -1294,10 +1295,10 @@ int flush_old_exec(struct linux_binprm * bprm) bprm->mm = NULL; set_fs(USER_DS);
- current->flags &= ~(PF_RANDOMIZE | PF_FORKNOEXEC | PF_KTHREAD |
- me->flags &= ~(PF_RANDOMIZE | PF_FORKNOEXEC | PF_KTHREAD | PF_NOFREEZE | PF_NO_SETAFFINITY); flush_thread();
- current->personality &= ~bprm->per_clear;
- me->personality &= ~bprm->per_clear;
/* * We have to apply CLOEXEC before we change whether the process is @@ -1305,7 +1306,7 @@ int flush_old_exec(struct linux_binprm * bprm) * trying to access the should-be-closed file descriptors of a process * undergoing exec(2). */
- do_close_on_exec(current->files);
- do_close_on_exec(me->files); return 0;
out: -- 2.25.0
On Sun, Mar 08, 2020 at 04:35:26PM -0500, Eric W. Biederman wrote:
Make it clear that current only needs to be computed once in flush_old_exec. This may have some efficiency improvements and it makes the code easier to change.
Signed-off-by: "Eric W. Biederman" ebiederm@xmission.com
Acked-by: Christian Brauner christian.brauner@ubuntu.com
This makes the code clearer and makes it easier to implement a mutex that is not taken over any locations that may block indefinitely waiting for userspace.
Signed-off-by: "Eric W. Biederman" ebiederm@xmission.com --- fs/exec.c | 39 ++++++++++++++++++++++++++------------- 1 file changed, 26 insertions(+), 13 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c index c3f34791f2f0..ff74b9a74d34 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1194,6 +1194,23 @@ static int de_thread(struct task_struct *tsk) flush_itimer_signals(); #endif
+ BUG_ON(!thread_group_leader(tsk)); + return 0; + +killed: + /* protects against exit_notify() and __exit_signal() */ + read_lock(&tasklist_lock); + sig->group_exit_task = NULL; + sig->notify_count = 0; + read_unlock(&tasklist_lock); + return -EAGAIN; +} + + +static int unshare_sighand(struct task_struct *me) +{ + struct sighand_struct *oldsighand = me->sighand; + if (refcount_read(&oldsighand->count) != 1) { struct sighand_struct *newsighand; /* @@ -1210,23 +1227,13 @@ static int de_thread(struct task_struct *tsk)
write_lock_irq(&tasklist_lock); spin_lock(&oldsighand->siglock); - rcu_assign_pointer(tsk->sighand, newsighand); + rcu_assign_pointer(me->sighand, newsighand); spin_unlock(&oldsighand->siglock); write_unlock_irq(&tasklist_lock);
__cleanup_sighand(oldsighand); } - - BUG_ON(!thread_group_leader(tsk)); return 0; - -killed: - /* protects against exit_notify() and __exit_signal() */ - read_lock(&tasklist_lock); - sig->group_exit_task = NULL; - sig->notify_count = 0; - read_unlock(&tasklist_lock); - return -EAGAIN; }
char *__get_task_comm(char *buf, size_t buf_size, struct task_struct *tsk) @@ -1264,13 +1271,19 @@ int flush_old_exec(struct linux_binprm * bprm) int retval;
/* - * Make sure we have a private signal table and that - * we are unassociated from the previous thread group. + * Make this the only thread in the thread group. */ retval = de_thread(me); if (retval) goto out;
+ /* + * Make the signal table private. + */ + retval = unshare_sighand(me); + if (retval) + goto out; + /* * Must be called _before_ exec_mmap() as bprm->mm is * not visibile until then. This also enables the update
On 3/8/20 10:36 PM, Eric W. Biederman wrote:
This makes the code clearer and makes it easier to implement a mutex that is not taken over any locations that may block indefinitely waiting for userspace.
Signed-off-by: "Eric W. Biederman" ebiederm@xmission.com
Reviewed-by: Bernd Edlinger bernd.edlinger@hotmail.de
Bernd.
fs/exec.c | 39 ++++++++++++++++++++++++++------------- 1 file changed, 26 insertions(+), 13 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c index c3f34791f2f0..ff74b9a74d34 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1194,6 +1194,23 @@ static int de_thread(struct task_struct *tsk) flush_itimer_signals(); #endif
- BUG_ON(!thread_group_leader(tsk));
- return 0;
+killed:
- /* protects against exit_notify() and __exit_signal() */
- read_lock(&tasklist_lock);
- sig->group_exit_task = NULL;
- sig->notify_count = 0;
- read_unlock(&tasklist_lock);
- return -EAGAIN;
+}
+static int unshare_sighand(struct task_struct *me) +{
- struct sighand_struct *oldsighand = me->sighand;
- if (refcount_read(&oldsighand->count) != 1) { struct sighand_struct *newsighand; /*
@@ -1210,23 +1227,13 @@ static int de_thread(struct task_struct *tsk) write_lock_irq(&tasklist_lock); spin_lock(&oldsighand->siglock);
rcu_assign_pointer(tsk->sighand, newsighand);
spin_unlock(&oldsighand->siglock); write_unlock_irq(&tasklist_lock);rcu_assign_pointer(me->sighand, newsighand);
__cleanup_sighand(oldsighand); }
- BUG_ON(!thread_group_leader(tsk)); return 0;
-killed:
- /* protects against exit_notify() and __exit_signal() */
- read_lock(&tasklist_lock);
- sig->group_exit_task = NULL;
- sig->notify_count = 0;
- read_unlock(&tasklist_lock);
- return -EAGAIN;
} char *__get_task_comm(char *buf, size_t buf_size, struct task_struct *tsk) @@ -1264,13 +1271,19 @@ int flush_old_exec(struct linux_binprm * bprm) int retval; /*
* Make sure we have a private signal table and that
* we are unassociated from the previous thread group.
*/ retval = de_thread(me); if (retval) goto out;* Make this the only thread in the thread group.
- /*
* Make the signal table private.
*/
- retval = unshare_sighand(me);
- if (retval)
goto out;
- /*
- Must be called _before_ exec_mmap() as bprm->mm is
- not visibile until then. This also enables the update
On Sun, Mar 08, 2020 at 04:36:17PM -0500, Eric W. Biederman wrote:
This makes the code clearer and makes it easier to implement a mutex that is not taken over any locations that may block indefinitely waiting for userspace.
Signed-off-by: "Eric W. Biederman" ebiederm@xmission.com
fs/exec.c | 39 ++++++++++++++++++++++++++------------- 1 file changed, 26 insertions(+), 13 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c index c3f34791f2f0..ff74b9a74d34 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1194,6 +1194,23 @@ static int de_thread(struct task_struct *tsk) flush_itimer_signals(); #endif
Semi-related (existing behavior): in de_thread(), what keeps the thread group from changing? i.e.:
if (thread_group_empty(tsk)) goto no_thread_group;
/* * Kill all other threads in the thread group. */ spin_lock_irq(lock); ... kill other threads under lock ...
Why is the thread_group_emtpy() test not under lock?
- BUG_ON(!thread_group_leader(tsk));
- return 0;
+killed:
- /* protects against exit_notify() and __exit_signal() */
I wonder if include/linux/sched/task.h's definition of tasklist_lock should explicitly gain note about group_exit_task and notify_count, or, alternatively, signal.h's section on these fields should gain a comment? tasklist_lock is unmentioned in signal.h... :(
- read_lock(&tasklist_lock);
- sig->group_exit_task = NULL;
- sig->notify_count = 0;
- read_unlock(&tasklist_lock);
- return -EAGAIN;
+}
+static int unshare_sighand(struct task_struct *me) +{
- struct sighand_struct *oldsighand = me->sighand;
- if (refcount_read(&oldsighand->count) != 1) { struct sighand_struct *newsighand; /*
@@ -1210,23 +1227,13 @@ static int de_thread(struct task_struct *tsk) write_lock_irq(&tasklist_lock); spin_lock(&oldsighand->siglock);
rcu_assign_pointer(tsk->sighand, newsighand);
spin_unlock(&oldsighand->siglock); write_unlock_irq(&tasklist_lock);rcu_assign_pointer(me->sighand, newsighand);
__cleanup_sighand(oldsighand); }
- BUG_ON(!thread_group_leader(tsk)); return 0;
-killed:
- /* protects against exit_notify() and __exit_signal() */
- read_lock(&tasklist_lock);
- sig->group_exit_task = NULL;
- sig->notify_count = 0;
- read_unlock(&tasklist_lock);
- return -EAGAIN;
} char *__get_task_comm(char *buf, size_t buf_size, struct task_struct *tsk) @@ -1264,13 +1271,19 @@ int flush_old_exec(struct linux_binprm * bprm) int retval; /*
* Make sure we have a private signal table and that
* we are unassociated from the previous thread group.
*/ retval = de_thread(me); if (retval) goto out;* Make this the only thread in the thread group.
- /*
* Make the signal table private.
*/
- retval = unshare_sighand(me);
- if (retval)
goto out;
- /*
- Must be called _before_ exec_mmap() as bprm->mm is
- not visibile until then. This also enables the update
-- 2.25.0
Otherwise, yes, sensible separation.
Reviewed-by: Kees Cook keescook@chromium.org
On 3/10/20 9:29 PM, Kees Cook wrote:
On Sun, Mar 08, 2020 at 04:36:17PM -0500, Eric W. Biederman wrote:
This makes the code clearer and makes it easier to implement a mutex that is not taken over any locations that may block indefinitely waiting for userspace.
Signed-off-by: "Eric W. Biederman" ebiederm@xmission.com
fs/exec.c | 39 ++++++++++++++++++++++++++------------- 1 file changed, 26 insertions(+), 13 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c index c3f34791f2f0..ff74b9a74d34 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1194,6 +1194,23 @@ static int de_thread(struct task_struct *tsk) flush_itimer_signals(); #endif
Semi-related (existing behavior): in de_thread(), what keeps the thread group from changing? i.e.:
if (thread_group_empty(tsk)) goto no_thread_group; /* * Kill all other threads in the thread group. */ spin_lock_irq(lock);
... kill other threads under lock ...
Why is the thread_group_emtpy() test not under lock?
A new thread cannot created when only one thread is executing, right?
- BUG_ON(!thread_group_leader(tsk));
- return 0;
+killed:
- /* protects against exit_notify() and __exit_signal() */
I wonder if include/linux/sched/task.h's definition of tasklist_lock should explicitly gain note about group_exit_task and notify_count, or, alternatively, signal.h's section on these fields should gain a comment? tasklist_lock is unmentioned in signal.h... :(
- read_lock(&tasklist_lock);
- sig->group_exit_task = NULL;
- sig->notify_count = 0;
- read_unlock(&tasklist_lock);
- return -EAGAIN;
+}
+static int unshare_sighand(struct task_struct *me) +{
- struct sighand_struct *oldsighand = me->sighand;
- if (refcount_read(&oldsighand->count) != 1) { struct sighand_struct *newsighand; /*
@@ -1210,23 +1227,13 @@ static int de_thread(struct task_struct *tsk) write_lock_irq(&tasklist_lock); spin_lock(&oldsighand->siglock);
rcu_assign_pointer(tsk->sighand, newsighand);
spin_unlock(&oldsighand->siglock); write_unlock_irq(&tasklist_lock);rcu_assign_pointer(me->sighand, newsighand);
__cleanup_sighand(oldsighand); }
- BUG_ON(!thread_group_leader(tsk)); return 0;
-killed:
- /* protects against exit_notify() and __exit_signal() */
- read_lock(&tasklist_lock);
- sig->group_exit_task = NULL;
- sig->notify_count = 0;
- read_unlock(&tasklist_lock);
- return -EAGAIN;
} char *__get_task_comm(char *buf, size_t buf_size, struct task_struct *tsk) @@ -1264,13 +1271,19 @@ int flush_old_exec(struct linux_binprm * bprm) int retval; /*
* Make sure we have a private signal table and that
* we are unassociated from the previous thread group.
*/ retval = de_thread(me); if (retval) goto out;* Make this the only thread in the thread group.
- /*
* Make the signal table private.
*/
- retval = unshare_sighand(me);
- if (retval)
goto out;
- /*
- Must be called _before_ exec_mmap() as bprm->mm is
- not visibile until then. This also enables the update
-- 2.25.0
Otherwise, yes, sensible separation.
Reviewed-by: Kees Cook keescook@chromium.org
On Tue, Mar 10, 2020 at 09:34:03PM +0100, Bernd Edlinger wrote:
On 3/10/20 9:29 PM, Kees Cook wrote:
On Sun, Mar 08, 2020 at 04:36:17PM -0500, Eric W. Biederman wrote:
This makes the code clearer and makes it easier to implement a mutex that is not taken over any locations that may block indefinitely waiting for userspace.
Signed-off-by: "Eric W. Biederman" ebiederm@xmission.com
fs/exec.c | 39 ++++++++++++++++++++++++++------------- 1 file changed, 26 insertions(+), 13 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c index c3f34791f2f0..ff74b9a74d34 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1194,6 +1194,23 @@ static int de_thread(struct task_struct *tsk) flush_itimer_signals(); #endif
Semi-related (existing behavior): in de_thread(), what keeps the thread group from changing? i.e.:
if (thread_group_empty(tsk)) goto no_thread_group; /* * Kill all other threads in the thread group. */ spin_lock_irq(lock);
... kill other threads under lock ...
Why is the thread_group_emtpy() test not under lock?
A new thread cannot created when only one thread is executing, right?
*face palm* Yes, of course. :) I'm thinking too hard.
On Sun, Mar 08, 2020 at 04:36:17PM -0500, Eric W. Biederman wrote:
This makes the code clearer and makes it easier to implement a mutex that is not taken over any locations that may block indefinitely waiting for userspace.
Signed-off-by: "Eric W. Biederman" ebiederm@xmission.com
fs/exec.c | 39 ++++++++++++++++++++++++++------------- 1 file changed, 26 insertions(+), 13 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c index c3f34791f2f0..ff74b9a74d34 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1194,6 +1194,23 @@ static int de_thread(struct task_struct *tsk) flush_itimer_signals(); #endif
- BUG_ON(!thread_group_leader(tsk));
- return 0;
+killed:
- /* protects against exit_notify() and __exit_signal() */
- read_lock(&tasklist_lock);
- sig->group_exit_task = NULL;
- sig->notify_count = 0;
- read_unlock(&tasklist_lock);
- return -EAGAIN;
+}
+static int unshare_sighand(struct task_struct *me) +{
- struct sighand_struct *oldsighand = me->sighand;
- if (refcount_read(&oldsighand->count) != 1) { struct sighand_struct *newsighand; /*
@@ -1210,23 +1227,13 @@ static int de_thread(struct task_struct *tsk) write_lock_irq(&tasklist_lock); spin_lock(&oldsighand->siglock);
rcu_assign_pointer(tsk->sighand, newsighand);
spin_unlock(&oldsighand->siglock); write_unlock_irq(&tasklist_lock);rcu_assign_pointer(me->sighand, newsighand);
__cleanup_sighand(oldsighand); }
This is fine for now but we share an aweful lot of code with copy_sighand(). We should earmark this to look into consolidating the core operations into a common helper called from both copy_sighand() and unshare_sighand() maybe even dumbing it down to one helper. But not needed for now.
Otherwise: Acked-by: Christian Brauner christian.brauner@ubuntu.com
These functions have very little to do with de_thread move them out of de_thread an into flush_old_exec proper so it can be more clearly seen what flush_old_exec is doing.
Signed-off-by: "Eric W. Biederman" ebiederm@xmission.com --- fs/exec.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c index ff74b9a74d34..215d86f77b63 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1189,11 +1189,6 @@ static int de_thread(struct task_struct *tsk) /* we have changed execution domain */ tsk->exit_signal = SIGCHLD;
-#ifdef CONFIG_POSIX_TIMERS - exit_itimers(sig); - flush_itimer_signals(); -#endif - BUG_ON(!thread_group_leader(tsk)); return 0;
@@ -1277,6 +1272,11 @@ int flush_old_exec(struct linux_binprm * bprm) if (retval) goto out;
+#ifdef CONFIG_POSIX_TIMERS + exit_itimers(me->signal); + flush_itimer_signals(); +#endif + /* * Make the signal table private. */
On 3/8/20 10:36 PM, Eric W. Biederman wrote:
These functions have very little to do with de_thread move them out of de_thread an into flush_old_exec proper so it can be more clearly seen what flush_old_exec is doing.
Signed-off-by: "Eric W. Biederman" ebiederm@xmission.com
Reviewed-by: Bernd Edlinger bernd.edlinger@hotmail.de
Bernd.
fs/exec.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c index ff74b9a74d34..215d86f77b63 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1189,11 +1189,6 @@ static int de_thread(struct task_struct *tsk) /* we have changed execution domain */ tsk->exit_signal = SIGCHLD; -#ifdef CONFIG_POSIX_TIMERS
- exit_itimers(sig);
- flush_itimer_signals();
-#endif
- BUG_ON(!thread_group_leader(tsk)); return 0;
@@ -1277,6 +1272,11 @@ int flush_old_exec(struct linux_binprm * bprm) if (retval) goto out; +#ifdef CONFIG_POSIX_TIMERS
- exit_itimers(me->signal);
- flush_itimer_signals();
+#endif
- /*
*/
- Make the signal table private.
On Sun, Mar 08, 2020 at 04:36:55PM -0500, Eric W. Biederman wrote:
These functions have very little to do with de_thread move them out of de_thread an into flush_old_exec proper so it can be more clearly seen what flush_old_exec is doing.
Signed-off-by: "Eric W. Biederman" ebiederm@xmission.com
fs/exec.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c index ff74b9a74d34..215d86f77b63 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1189,11 +1189,6 @@ static int de_thread(struct task_struct *tsk)
While you're cleaning up de_thread() wouldn't it be good to also take the opportunity and remove the task argument from de_thread(). It's only ever used with current. Could be done in one of your patches or as a separate patch.
diff --git a/fs/exec.c b/fs/exec.c index db17be51b112..ee108707e4b0 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1061,8 +1061,9 @@ static int exec_mmap(struct mm_struct *mm) * disturbing other processes. (Other processes might share the signal * table via the CLONE_SIGHAND option to clone().) */ -static int de_thread(struct task_struct *tsk) +static int de_thread(void) { + struct task_struct *tsk = current; struct signal_struct *sig = tsk->signal; struct sighand_struct *oldsighand = tsk->sighand; spinlock_t *lock = &oldsighand->siglock; @@ -1266,7 +1267,7 @@ int flush_old_exec(struct linux_binprm * bprm) * Make sure we have a private signal table and that * we are unassociated from the previous thread group. */ - retval = de_thread(current); + retval = de_thread(); if (retval) goto out;
Christian Brauner christian.brauner@ubuntu.com writes:
On Sun, Mar 08, 2020 at 04:36:55PM -0500, Eric W. Biederman wrote:
These functions have very little to do with de_thread move them out of de_thread an into flush_old_exec proper so it can be more clearly seen what flush_old_exec is doing.
Signed-off-by: "Eric W. Biederman" ebiederm@xmission.com
fs/exec.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c index ff74b9a74d34..215d86f77b63 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1189,11 +1189,6 @@ static int de_thread(struct task_struct *tsk)
While you're cleaning up de_thread() wouldn't it be good to also take the opportunity and remove the task argument from de_thread(). It's only ever used with current. Could be done in one of your patches or as a separate patch.
How does that affect the code generation?
My sense is that computing current once in flush_old_exec is much better than computing it in each function flush_old_exec calls. I remember that computing current used to be not expensive but noticable.
For clarity I can see renaming tsk to me. So that it is clear we are talking about the current process, and not some arbitrary process.
And for clarity my goal here is not to clean up de_thread. Though I don't mind that result. My goal is to get the extra work out of de_thread so we can do process tear down cleanups that are safe according to the ordinary process rules, before taking a mutex that protects exec mucking with all of the state in exec.
Eric
diff --git a/fs/exec.c b/fs/exec.c index db17be51b112..ee108707e4b0 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1061,8 +1061,9 @@ static int exec_mmap(struct mm_struct *mm)
- disturbing other processes. (Other processes might share the signal
- table via the CLONE_SIGHAND option to clone().)
*/ -static int de_thread(struct task_struct *tsk) +static int de_thread(void) {
struct task_struct *tsk = current; struct signal_struct *sig = tsk->signal; struct sighand_struct *oldsighand = tsk->sighand; spinlock_t *lock = &oldsighand->siglock;
@@ -1266,7 +1267,7 @@ int flush_old_exec(struct linux_binprm * bprm) * Make sure we have a private signal table and that * we are unassociated from the previous thread group. */
retval = de_thread(current);
retval = de_thread(); if (retval) goto out;
On Mon, Mar 09, 2020 at 03:06:46PM -0500, Eric W. Biederman wrote:
Christian Brauner christian.brauner@ubuntu.com writes:
On Sun, Mar 08, 2020 at 04:36:55PM -0500, Eric W. Biederman wrote:
These functions have very little to do with de_thread move them out of de_thread an into flush_old_exec proper so it can be more clearly seen what flush_old_exec is doing.
Signed-off-by: "Eric W. Biederman" ebiederm@xmission.com
fs/exec.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c index ff74b9a74d34..215d86f77b63 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1189,11 +1189,6 @@ static int de_thread(struct task_struct *tsk)
While you're cleaning up de_thread() wouldn't it be good to also take the opportunity and remove the task argument from de_thread(). It's only ever used with current. Could be done in one of your patches or as a separate patch.
How does that affect the code generation?
The same way renaming "tsk" to "me" does.
My sense is that computing current once in flush_old_exec is much better than computing it in each function flush_old_exec calls. I remember that computing current used to be not expensive but noticable.
For clarity I can see renaming tsk to me. So that it is clear we are talking about the current process, and not some arbitrary process.
For clarity since de_thread() uses "tsk" giving the impression that any task can be dethreaded while it's only ever used with current. It's just a suggestion since you're doing the rename tsk->me anyway it would fit with the series. You do whatever you want though. (I just remember that the same request was made once to changes I did: Don't pass current as arg when it's the only task passed.)
Christian Brauner christian.brauner@ubuntu.com writes:
On Mon, Mar 09, 2020 at 03:06:46PM -0500, Eric W. Biederman wrote:
Christian Brauner christian.brauner@ubuntu.com writes:
On Sun, Mar 08, 2020 at 04:36:55PM -0500, Eric W. Biederman wrote:
These functions have very little to do with de_thread move them out of de_thread an into flush_old_exec proper so it can be more clearly seen what flush_old_exec is doing.
Signed-off-by: "Eric W. Biederman" ebiederm@xmission.com
fs/exec.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c index ff74b9a74d34..215d86f77b63 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1189,11 +1189,6 @@ static int de_thread(struct task_struct *tsk)
While you're cleaning up de_thread() wouldn't it be good to also take the opportunity and remove the task argument from de_thread(). It's only ever used with current. Could be done in one of your patches or as a separate patch.
How does that affect the code generation?
The same way renaming "tsk" to "me" does.
My sense is that computing current once in flush_old_exec is much better than computing it in each function flush_old_exec calls. I remember that computing current used to be not expensive but noticable.
For clarity I can see renaming tsk to me. So that it is clear we are talking about the current process, and not some arbitrary process.
For clarity since de_thread() uses "tsk" giving the impression that any task can be dethreaded while it's only ever used with current. It's just a suggestion since you're doing the rename tsk->me anyway it would fit with the series. You do whatever you want though. (I just remember that the same request was made once to changes I did: Don't pass current as arg when it's the only task passed.)
That's fair.
And I completely agree that we should at least rename tsk to me. Just for clarity.
My apologies if I am a little short. My little son has been an extra handful lately.
Eric
On Mon, Mar 09, 2020 at 03:48:55PM -0500, Eric W. Biederman wrote:
Christian Brauner christian.brauner@ubuntu.com writes:
On Mon, Mar 09, 2020 at 03:06:46PM -0500, Eric W. Biederman wrote:
Christian Brauner christian.brauner@ubuntu.com writes:
On Sun, Mar 08, 2020 at 04:36:55PM -0500, Eric W. Biederman wrote:
These functions have very little to do with de_thread move them out of de_thread an into flush_old_exec proper so it can be more clearly seen what flush_old_exec is doing.
Signed-off-by: "Eric W. Biederman" ebiederm@xmission.com
fs/exec.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c index ff74b9a74d34..215d86f77b63 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1189,11 +1189,6 @@ static int de_thread(struct task_struct *tsk)
While you're cleaning up de_thread() wouldn't it be good to also take the opportunity and remove the task argument from de_thread(). It's only ever used with current. Could be done in one of your patches or as a separate patch.
How does that affect the code generation?
The same way renaming "tsk" to "me" does.
My sense is that computing current once in flush_old_exec is much better than computing it in each function flush_old_exec calls. I remember that computing current used to be not expensive but noticable.
For clarity I can see renaming tsk to me. So that it is clear we are talking about the current process, and not some arbitrary process.
For clarity since de_thread() uses "tsk" giving the impression that any task can be dethreaded while it's only ever used with current. It's just a suggestion since you're doing the rename tsk->me anyway it would fit with the series. You do whatever you want though. (I just remember that the same request was made once to changes I did: Don't pass current as arg when it's the only task passed.)
That's fair.
And I completely agree that we should at least rename tsk to me. Just for clarity.
My apologies if I am a little short. My little son has been an extra handful lately.
No worries, stress is a thing most of us know too well.
Christian
During exec some file descriptors are closed and the files struct is unshared. But all of that can happen at other times and it has the same protections during exec as at ordinary times. So stop taking the cred_guard_mutex as it is useless.
Furthermore he cred_guard_mutex is a bad idea because it is deadlock prone, as it is held in serveral while waiting possibly indefinitely for userspace to do something.
Cc: Sargun Dhillon sargun@sargun.me Cc: Christian Brauner christian.brauner@ubuntu.com Cc: Arnd Bergmann arnd@arndb.de Fixes: 8649c322f75c ("pid: Implement pidfd_getfd syscall") Signed-off-by: "Eric W. Biederman" ebiederm@xmission.com --- kernel/pid.c | 6 ------ 1 file changed, 6 deletions(-)
Christian if you don't have any objections I will take this one through my tree.
I tried to figure out why this code path takes the cred_guard_mutex and the archive on lore.kernel.org was not helpful in finding that part of the conversation.
diff --git a/kernel/pid.c b/kernel/pid.c index 60820e72634c..53646d5616d2 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -577,17 +577,11 @@ static struct file *__pidfd_fget(struct task_struct *task, int fd) struct file *file; int ret;
- ret = mutex_lock_killable(&task->signal->cred_guard_mutex); - if (ret) - return ERR_PTR(ret); - if (ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS)) file = fget_task(task, fd); else file = ERR_PTR(-EPERM);
- mutex_unlock(&task->signal->cred_guard_mutex); - return file ?: ERR_PTR(-EBADF); }
On Tue, Mar 10, 2020 at 01:52:05PM -0500, Eric W. Biederman wrote:
During exec some file descriptors are closed and the files struct is unshared. But all of that can happen at other times and it has the same protections during exec as at ordinary times. So stop taking the cred_guard_mutex as it is useless.
Furthermore he cred_guard_mutex is a bad idea because it is deadlock prone, as it is held in serveral while waiting possibly indefinitely for userspace to do something.
Cc: Sargun Dhillon sargun@sargun.me Cc: Christian Brauner christian.brauner@ubuntu.com Cc: Arnd Bergmann arnd@arndb.de Fixes: 8649c322f75c ("pid: Implement pidfd_getfd syscall") Signed-off-by: "Eric W. Biederman" ebiederm@xmission.com
kernel/pid.c | 6 ------ 1 file changed, 6 deletions(-)
Christian if you don't have any objections I will take this one through my tree.
Sure. Acked-by: Christian Brauner christian.brauner@ubuntu.com
I tried to figure out why this code path takes the cred_guard_mutex and the archive on lore.kernel.org was not helpful in finding that part of the conversation.
Let me think a little harder and hopefully get back to you with a sensible explanation.
On Tue, Mar 10, 2020 at 7:54 PM Eric W. Biederman ebiederm@xmission.com wrote:
During exec some file descriptors are closed and the files struct is unshared. But all of that can happen at other times and it has the same protections during exec as at ordinary times. So stop taking the cred_guard_mutex as it is useless.
Furthermore he cred_guard_mutex is a bad idea because it is deadlock prone, as it is held in serveral while waiting possibly indefinitely for userspace to do something.
Please don't. Just use the new exec_update_mutex like everywhere else.
Cc: Sargun Dhillon sargun@sargun.me Cc: Christian Brauner christian.brauner@ubuntu.com Cc: Arnd Bergmann arnd@arndb.de Fixes: 8649c322f75c ("pid: Implement pidfd_getfd syscall") Signed-off-by: "Eric W. Biederman" ebiederm@xmission.com
kernel/pid.c | 6 ------ 1 file changed, 6 deletions(-)
Christian if you don't have any objections I will take this one through my tree.
I tried to figure out why this code path takes the cred_guard_mutex and the archive on lore.kernel.org was not helpful in finding that part of the conversation.
That was my suggestion.
diff --git a/kernel/pid.c b/kernel/pid.c index 60820e72634c..53646d5616d2 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -577,17 +577,11 @@ static struct file *__pidfd_fget(struct task_struct *task, int fd) struct file *file; int ret;
ret = mutex_lock_killable(&task->signal->cred_guard_mutex);
if (ret)
return ERR_PTR(ret);
if (ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS)) file = fget_task(task, fd); else file = ERR_PTR(-EPERM);
mutex_unlock(&task->signal->cred_guard_mutex);
return file ?: ERR_PTR(-EBADF);
}
If you make this change, then if this races with execution of a setuid program that afterwards e.g. opens a unix domain socket, an attacker will be able to steal that socket and inject messages into communication with things like DBus. procfs currently has the same race, and that still needs to be fixed, but at least procfs doesn't let you open things like sockets because they don't have a working ->open handler, and it enforces the normal permission check for opening files.
Jann Horn jannh@google.com writes:
On Tue, Mar 10, 2020 at 7:54 PM Eric W. Biederman ebiederm@xmission.com wrote:
During exec some file descriptors are closed and the files struct is unshared. But all of that can happen at other times and it has the same protections during exec as at ordinary times. So stop taking the cred_guard_mutex as it is useless.
Furthermore he cred_guard_mutex is a bad idea because it is deadlock prone, as it is held in serveral while waiting possibly indefinitely for userspace to do something.
Please don't. Just use the new exec_update_mutex like everywhere else.
Cc: Sargun Dhillon sargun@sargun.me Cc: Christian Brauner christian.brauner@ubuntu.com Cc: Arnd Bergmann arnd@arndb.de Fixes: 8649c322f75c ("pid: Implement pidfd_getfd syscall") Signed-off-by: "Eric W. Biederman" ebiederm@xmission.com
kernel/pid.c | 6 ------ 1 file changed, 6 deletions(-)
Christian if you don't have any objections I will take this one through my tree.
I tried to figure out why this code path takes the cred_guard_mutex and the archive on lore.kernel.org was not helpful in finding that part of the conversation.
That was my suggestion.
diff --git a/kernel/pid.c b/kernel/pid.c index 60820e72634c..53646d5616d2 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -577,17 +577,11 @@ static struct file *__pidfd_fget(struct task_struct *task, int fd) struct file *file; int ret;
ret = mutex_lock_killable(&task->signal->cred_guard_mutex);
if (ret)
return ERR_PTR(ret);
if (ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS)) file = fget_task(task, fd); else file = ERR_PTR(-EPERM);
mutex_unlock(&task->signal->cred_guard_mutex);
return file ?: ERR_PTR(-EBADF);
}
If you make this change, then if this races with execution of a setuid program that afterwards e.g. opens a unix domain socket, an attacker will be able to steal that socket and inject messages into communication with things like DBus. procfs currently has the same race, and that still needs to be fixed, but at least procfs doesn't let you open things like sockets because they don't have a working ->open handler, and it enforces the normal permission check for opening files.
It isn't only exec that can change credentials. Do we need a lock for changing credentials?
Wouldn't it be sufficient to simply test ptrace_may_access after we get a copy of the file?
If we need a lock around credential change let's design and build that. Having a mismatch between what a lock is designed to do, and what people use it for can only result in other bugs as people get confused.
Eric
On Tue, Mar 10, 2020 at 8:29 PM Eric W. Biederman ebiederm@xmission.com wrote:
Jann Horn jannh@google.com writes:
On Tue, Mar 10, 2020 at 7:54 PM Eric W. Biederman ebiederm@xmission.com wrote:
During exec some file descriptors are closed and the files struct is unshared. But all of that can happen at other times and it has the same protections during exec as at ordinary times. So stop taking the cred_guard_mutex as it is useless.
Furthermore he cred_guard_mutex is a bad idea because it is deadlock prone, as it is held in serveral while waiting possibly indefinitely for userspace to do something.
Please don't. Just use the new exec_update_mutex like everywhere else.
Cc: Sargun Dhillon sargun@sargun.me Cc: Christian Brauner christian.brauner@ubuntu.com Cc: Arnd Bergmann arnd@arndb.de Fixes: 8649c322f75c ("pid: Implement pidfd_getfd syscall") Signed-off-by: "Eric W. Biederman" ebiederm@xmission.com
kernel/pid.c | 6 ------ 1 file changed, 6 deletions(-)
Christian if you don't have any objections I will take this one through my tree.
I tried to figure out why this code path takes the cred_guard_mutex and the archive on lore.kernel.org was not helpful in finding that part of the conversation.
That was my suggestion.
diff --git a/kernel/pid.c b/kernel/pid.c index 60820e72634c..53646d5616d2 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -577,17 +577,11 @@ static struct file *__pidfd_fget(struct task_struct *task, int fd) struct file *file; int ret;
ret = mutex_lock_killable(&task->signal->cred_guard_mutex);
if (ret)
return ERR_PTR(ret);
if (ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS)) file = fget_task(task, fd); else file = ERR_PTR(-EPERM);
mutex_unlock(&task->signal->cred_guard_mutex);
return file ?: ERR_PTR(-EBADF);
}
If you make this change, then if this races with execution of a setuid program that afterwards e.g. opens a unix domain socket, an attacker will be able to steal that socket and inject messages into communication with things like DBus. procfs currently has the same race, and that still needs to be fixed, but at least procfs doesn't let you open things like sockets because they don't have a working ->open handler, and it enforces the normal permission check for opening files.
It isn't only exec that can change credentials. Do we need a lock for changing credentials?
Hmm, I guess so? Normally, a task that's changing credentials becomes nondumpable at the same time (and there are explicit memory barriers in commit_creds() and __ptrace_may_access() to enforce the ordering for this); so you normally don't see tasks becoming ptrace-accessible via anything other than execve(). But I guess if someone opens a root-only file, closes it, drops privileges, and then explicitly does prctl(PR_SET_DUMPABLE, 1), we should probably protect that, too.
Wouldn't it be sufficient to simply test ptrace_may_access after we get a copy of the file?
There are also setuid helpers that can, after having done privileged stuff, drop privileges and call execve(); after that, ptrace_may_access() succeeds again. In particular, polkit has a helper that does this.
If we need a lock around credential change let's design and build that. Having a mismatch between what a lock is designed to do, and what people use it for can only result in other bugs as people get confused.
Hmm... what benefits do we get from making it a separate lock? I guess it would allow us to make it a per-task lock instead of a signal_struct-wide one? That might be helpful...
On Tue, Mar 10, 2020 at 9:00 PM Jann Horn jannh@google.com wrote:
On Tue, Mar 10, 2020 at 8:29 PM Eric W. Biederman ebiederm@xmission.com wrote:
Jann Horn jannh@google.com writes:
On Tue, Mar 10, 2020 at 7:54 PM Eric W. Biederman ebiederm@xmission.com wrote:
During exec some file descriptors are closed and the files struct is unshared. But all of that can happen at other times and it has the same protections during exec as at ordinary times. So stop taking the cred_guard_mutex as it is useless.
Furthermore he cred_guard_mutex is a bad idea because it is deadlock prone, as it is held in serveral while waiting possibly indefinitely for userspace to do something.
[...]
If you make this change, then if this races with execution of a setuid program that afterwards e.g. opens a unix domain socket, an attacker will be able to steal that socket and inject messages into communication with things like DBus. procfs currently has the same race, and that still needs to be fixed, but at least procfs doesn't let you open things like sockets because they don't have a working ->open handler, and it enforces the normal permission check for opening files.
It isn't only exec that can change credentials. Do we need a lock for changing credentials?
[...]
If we need a lock around credential change let's design and build that. Having a mismatch between what a lock is designed to do, and what people use it for can only result in other bugs as people get confused.
Hmm... what benefits do we get from making it a separate lock? I guess it would allow us to make it a per-task lock instead of a signal_struct-wide one? That might be helpful...
But actually, isn't the core purpose of the cred_guard_mutex to guard against concurrent credential changes anyway? That's what almost everyone uses it for, and it's in the name...
On 3/10/20 9:10 PM, Jann Horn wrote:
On Tue, Mar 10, 2020 at 9:00 PM Jann Horn jannh@google.com wrote:
On Tue, Mar 10, 2020 at 8:29 PM Eric W. Biederman ebiederm@xmission.com wrote:
Jann Horn jannh@google.com writes:
On Tue, Mar 10, 2020 at 7:54 PM Eric W. Biederman ebiederm@xmission.com wrote:
During exec some file descriptors are closed and the files struct is unshared. But all of that can happen at other times and it has the same protections during exec as at ordinary times. So stop taking the cred_guard_mutex as it is useless.
Furthermore he cred_guard_mutex is a bad idea because it is deadlock prone, as it is held in serveral while waiting possibly indefinitely for userspace to do something.
[...]
If you make this change, then if this races with execution of a setuid program that afterwards e.g. opens a unix domain socket, an attacker will be able to steal that socket and inject messages into communication with things like DBus. procfs currently has the same race, and that still needs to be fixed, but at least procfs doesn't let you open things like sockets because they don't have a working ->open handler, and it enforces the normal permission check for opening files.
It isn't only exec that can change credentials. Do we need a lock for changing credentials?
[...]
If we need a lock around credential change let's design and build that. Having a mismatch between what a lock is designed to do, and what people use it for can only result in other bugs as people get confused.
Hmm... what benefits do we get from making it a separate lock? I guess it would allow us to make it a per-task lock instead of a signal_struct-wide one? That might be helpful...
But actually, isn't the core purpose of the cred_guard_mutex to guard against concurrent credential changes anyway? That's what almost everyone uses it for, and it's in the name...
The main reason d'etre of exec_update_mutex is to get a consitent view of task->mm and task credentials.
The reason why you want the cred_guard_mutex, is that some action is changing the resulting credentials that the execve is about to install, and that is the data flow in the opposite direction.
Bernd.
On 3/10/20 9:22 PM, Bernd Edlinger wrote:
On 3/10/20 9:10 PM, Jann Horn wrote:
On Tue, Mar 10, 2020 at 9:00 PM Jann Horn jannh@google.com wrote:
On Tue, Mar 10, 2020 at 8:29 PM Eric W. Biederman ebiederm@xmission.com wrote:
Jann Horn jannh@google.com writes:
On Tue, Mar 10, 2020 at 7:54 PM Eric W. Biederman ebiederm@xmission.com wrote:
During exec some file descriptors are closed and the files struct is unshared. But all of that can happen at other times and it has the same protections during exec as at ordinary times. So stop taking the cred_guard_mutex as it is useless.
Furthermore he cred_guard_mutex is a bad idea because it is deadlock prone, as it is held in serveral while waiting possibly indefinitely for userspace to do something.
[...]
If you make this change, then if this races with execution of a setuid program that afterwards e.g. opens a unix domain socket, an attacker will be able to steal that socket and inject messages into communication with things like DBus. procfs currently has the same race, and that still needs to be fixed, but at least procfs doesn't let you open things like sockets because they don't have a working ->open handler, and it enforces the normal permission check for opening files.
It isn't only exec that can change credentials. Do we need a lock for changing credentials?
[...]
If we need a lock around credential change let's design and build that. Having a mismatch between what a lock is designed to do, and what people use it for can only result in other bugs as people get confused.
Hmm... what benefits do we get from making it a separate lock? I guess it would allow us to make it a per-task lock instead of a signal_struct-wide one? That might be helpful...
But actually, isn't the core purpose of the cred_guard_mutex to guard against concurrent credential changes anyway? That's what almost everyone uses it for, and it's in the name...
The main reason d'etre of exec_update_mutex is to get a consitent view of task->mm and task credentials.
The reason why you want the cred_guard_mutex, is that some action
is changing the resulting credentials that the execve is about to install, and that is the data flow in the opposite direction.
So in other words, you need the exec_update_mutex when you access another thread's credentials and possibly the mmap at the same time.
You need the cred_guard_mutex when you *change* the credentials of another thread. (Where you cannot be sure that the other thread just started to execve something)
You need no mutex at all when you are just accessing or even changing the credentials of the current thread. (If another thread is doing execve, your task will be killed, and wether or not the credentials were changed does not matter any more)
Bernd.
On Wed, Mar 11, 2020 at 7:12 AM Bernd Edlinger bernd.edlinger@hotmail.de wrote:
On 3/10/20 9:22 PM, Bernd Edlinger wrote:
On 3/10/20 9:10 PM, Jann Horn wrote:
On Tue, Mar 10, 2020 at 9:00 PM Jann Horn jannh@google.com wrote:
On Tue, Mar 10, 2020 at 8:29 PM Eric W. Biederman ebiederm@xmission.com wrote:
Jann Horn jannh@google.com writes:
On Tue, Mar 10, 2020 at 7:54 PM Eric W. Biederman ebiederm@xmission.com wrote: > During exec some file descriptors are closed and the files struct is > unshared. But all of that can happen at other times and it has the > same protections during exec as at ordinary times. So stop taking the > cred_guard_mutex as it is useless. > > Furthermore he cred_guard_mutex is a bad idea because it is deadlock > prone, as it is held in serveral while waiting possibly indefinitely > for userspace to do something.
[...]
If you make this change, then if this races with execution of a setuid program that afterwards e.g. opens a unix domain socket, an attacker will be able to steal that socket and inject messages into communication with things like DBus. procfs currently has the same race, and that still needs to be fixed, but at least procfs doesn't let you open things like sockets because they don't have a working ->open handler, and it enforces the normal permission check for opening files.
It isn't only exec that can change credentials. Do we need a lock for changing credentials?
[...]
If we need a lock around credential change let's design and build that. Having a mismatch between what a lock is designed to do, and what people use it for can only result in other bugs as people get confused.
Hmm... what benefits do we get from making it a separate lock? I guess it would allow us to make it a per-task lock instead of a signal_struct-wide one? That might be helpful...
But actually, isn't the core purpose of the cred_guard_mutex to guard against concurrent credential changes anyway? That's what almost everyone uses it for, and it's in the name...
The main reason d'etre of exec_update_mutex is to get a consitent view of task->mm and task credentials.
The reason why you want the cred_guard_mutex, is that some action
is changing the resulting credentials that the execve is about to install, and that is the data flow in the opposite direction.
So in other words, you need the exec_update_mutex when you access another thread's credentials and possibly the mmap at the same time.
Or the file descriptor table, or register state, ...
You need no mutex at all when you are just accessing or even changing the credentials of the current thread. (If another thread is doing execve, your task will be killed, and wether or not the credentials were changed does not matter any more)
Only if the only access checks you care about are those related to mm access.
Jann Horn jannh@google.com writes:
On Tue, Mar 10, 2020 at 9:00 PM Jann Horn jannh@google.com wrote:
On Tue, Mar 10, 2020 at 8:29 PM Eric W. Biederman ebiederm@xmission.com wrote:
Jann Horn jannh@google.com writes:
On Tue, Mar 10, 2020 at 7:54 PM Eric W. Biederman ebiederm@xmission.com wrote:
During exec some file descriptors are closed and the files struct is unshared. But all of that can happen at other times and it has the same protections during exec as at ordinary times. So stop taking the cred_guard_mutex as it is useless.
Furthermore he cred_guard_mutex is a bad idea because it is deadlock prone, as it is held in serveral while waiting possibly indefinitely for userspace to do something.
[...]
If you make this change, then if this races with execution of a setuid program that afterwards e.g. opens a unix domain socket, an attacker will be able to steal that socket and inject messages into communication with things like DBus. procfs currently has the same race, and that still needs to be fixed, but at least procfs doesn't let you open things like sockets because they don't have a working ->open handler, and it enforces the normal permission check for opening files.
It isn't only exec that can change credentials. Do we need a lock for changing credentials?
[...]
If we need a lock around credential change let's design and build that. Having a mismatch between what a lock is designed to do, and what people use it for can only result in other bugs as people get confused.
Hmm... what benefits do we get from making it a separate lock? I guess it would allow us to make it a per-task lock instead of a signal_struct-wide one? That might be helpful...
But actually, isn't the core purpose of the cred_guard_mutex to guard against concurrent credential changes anyway? That's what almost everyone uses it for, and it's in the name...
Having been through all of the users nope.
Maybe someone tried to repurpose for that. I haven't traced through when it went the it was renamed from cred_exec_mutex to cred_guard_mutex.
The original purpose was to make make exec and ptrace deadlock. But it was seen as being there to allow safely calculating the new credentials before the point of now return. Because if a process is ptraced or not affects the new credential calculations. Unfortunately offering that guarantee fundamentally leads to deadlock.
So ptrace_attach and seccomp use the cred_guard_mutex to guarantee a deadlock.
The common use is to take cred_guard_mutex to guard the window when credentials and process details are out of sync in exec. But there is at least do_io_accounting that seems to have the same justification for holding __pidfd_fget.
With effort I suspect we can replace exec_change_mutex with task_lock. When we are guaranteed to be single threaded placing exec_change_mutex in signal_struct doesn't really help us (except maybe in some races?).
The deep problem is no one really understands cred_guard_mutex so it is a mess. Code with poorly defined semantics is always wrong somewhere for someone. Which is part of why I am attacking this and having the conversations to make certain I understand what is going on.
I see your point about commit_creds making a process undumpable. So in practice it really is only exec that changes creds in a way that ptrace_may_access will allow the process to be inspected.
So I guess for now the practical non-regressing course is to change everything to my exec_change_mutex, removing the deadlock. Then we figure out how to cleanly deal with the races inspecting a process with changing credentials has.
Eric
On Tue, Mar 10, 2020 at 03:57:35PM -0500, Eric W. Biederman wrote:
Jann Horn jannh@google.com writes:
On Tue, Mar 10, 2020 at 9:00 PM Jann Horn jannh@google.com wrote:
On Tue, Mar 10, 2020 at 8:29 PM Eric W. Biederman ebiederm@xmission.com wrote:
Jann Horn jannh@google.com writes:
On Tue, Mar 10, 2020 at 7:54 PM Eric W. Biederman ebiederm@xmission.com wrote:
During exec some file descriptors are closed and the files struct is unshared. But all of that can happen at other times and it has the same protections during exec as at ordinary times. So stop taking the cred_guard_mutex as it is useless.
Furthermore he cred_guard_mutex is a bad idea because it is deadlock prone, as it is held in serveral while waiting possibly indefinitely for userspace to do something.
[...]
If you make this change, then if this races with execution of a setuid program that afterwards e.g. opens a unix domain socket, an attacker will be able to steal that socket and inject messages into communication with things like DBus. procfs currently has the same race, and that still needs to be fixed, but at least procfs doesn't let you open things like sockets because they don't have a working ->open handler, and it enforces the normal permission check for opening files.
It isn't only exec that can change credentials. Do we need a lock for changing credentials?
[...]
If we need a lock around credential change let's design and build that. Having a mismatch between what a lock is designed to do, and what people use it for can only result in other bugs as people get confused.
Hmm... what benefits do we get from making it a separate lock? I guess it would allow us to make it a per-task lock instead of a signal_struct-wide one? That might be helpful...
But actually, isn't the core purpose of the cred_guard_mutex to guard against concurrent credential changes anyway? That's what almost everyone uses it for, and it's in the name...
Having been through all of the users nope.
Maybe someone tried to repurpose for that. I haven't traced through when it went the it was renamed from cred_exec_mutex to cred_guard_mutex.
The original purpose was to make make exec and ptrace deadlock. But it was seen as being there to allow safely calculating the new credentials before the point of now return. Because if a process is ptraced or not affects the new credential calculations. Unfortunately offering that guarantee fundamentally leads to deadlock.
So ptrace_attach and seccomp use the cred_guard_mutex to guarantee a deadlock.
The common use is to take cred_guard_mutex to guard the window when credentials and process details are out of sync in exec. But there is at least do_io_accounting that seems to have the same justification for holding __pidfd_fget.
With effort I suspect we can replace exec_change_mutex with task_lock. When we are guaranteed to be single threaded placing exec_change_mutex in signal_struct doesn't really help us (except maybe in some races?).
The deep problem is no one really understands cred_guard_mutex so it is a mess. Code with poorly defined semantics is always wrong somewhere
This is a good point. When discussing patches sensitive to credential changes cred_guard_mutex was always introduced as having the purpose to guard against concurrent credential changes. And I'm pretty sure that that's how most people have been using it for quite a long time. I mean, it's at least the case for seccomp and proc and probably quite a few more. So the problem seems to me that it has clear _intended_ semantics that runs into issues in all sorts of cases. So if cred_guard_mutex is not that then we seem to need to provide something that serves it's intended purpose.
On Tue, Mar 10, 2020 at 03:57:35PM -0500, Eric W. Biederman wrote:
So ptrace_attach and seccomp use the cred_guard_mutex to guarantee a deadlock.
Well, that's the result, but seccomp uses it because it wants to be certain that credentials and no_new_privs are changed together "atomically".
This changes __pidfd_fget to use the new exec_update_mutex instead of cred_guard_mutex.
This should be safe, as the credentials do not change before exec_update_mutex is locked. Therefore whatever file access is possible with holding the cred_guard_mutex here is also possbile with the exec_update_mutex.
Signed-off-by: Bernd Edlinger bernd.edlinger@hotmail.de --- kernel/pid.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
This replaces Eric's "[PATCH] pidfd: Stop taking cred_guard_mutex"
diff --git a/kernel/pid.c b/kernel/pid.c index 0f4ecb5..04821f4 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -584,7 +584,7 @@ static struct file *__pidfd_fget(struct task_struct *task, int fd) struct file *file; int ret;
- ret = mutex_lock_killable(&task->signal->cred_guard_mutex); + ret = mutex_lock_killable(&task->signal->exec_update_mutex); if (ret) return ERR_PTR(ret);
@@ -593,7 +593,7 @@ static struct file *__pidfd_fget(struct task_struct *task, int fd) else file = ERR_PTR(-EPERM);
- mutex_unlock(&task->signal->cred_guard_mutex); + mutex_unlock(&task->signal->exec_update_mutex);
return file ?: ERR_PTR(-EBADF); }
On Mon, Mar 09, 2020 at 03:48:55PM -0500, Eric W. Biederman wrote:
And I completely agree that we should at least rename tsk to me. Just for clarity.
I think it wouldn't hurt to add comments to spell it out explicitly in each of the tsk->me functions, something like:
/* * The "me" task_struct argument here must only ever refer to "current", * but it gets passed in to avoid re-calculating "current" in each helper. */
I've found that the exec code in its entirety would be better off with more comments. :) Usually that's the bulk of what I find myself adding when I make changes in this area. ;)
-Kees
On Sun, Mar 08, 2020 at 04:36:55PM -0500, Eric W. Biederman wrote:
These functions have very little to do with de_thread move them out of de_thread an into flush_old_exec proper so it can be more clearly seen what flush_old_exec is doing.
Signed-off-by: "Eric W. Biederman" ebiederm@xmission.com
fs/exec.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c index ff74b9a74d34..215d86f77b63 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1189,11 +1189,6 @@ static int de_thread(struct task_struct *tsk) /* we have changed execution domain */ tsk->exit_signal = SIGCHLD; -#ifdef CONFIG_POSIX_TIMERS
- exit_itimers(sig);
- flush_itimer_signals();
-#endif
- BUG_ON(!thread_group_leader(tsk)); return 0;
@@ -1277,6 +1272,11 @@ int flush_old_exec(struct linux_binprm * bprm) if (retval) goto out; +#ifdef CONFIG_POSIX_TIMERS
- exit_itimers(me->signal);
- flush_itimer_signals();
+#endif
I twitch at seeing #ifdefs in .c instead of hidden in the .h declarations of these two functions, but as this is a copy/paste, I'll live. ;)
Reviewed-by: Kees Cook keescook@chromium.org
-Kees
/* * Make the signal table private. */ -- 2.25.0
On Sun, Mar 8, 2020 at 10:39 PM Eric W. Biederman ebiederm@xmission.com wrote:
These functions have very little to do with de_thread move them out of de_thread an into flush_old_exec proper so it can be more clearly seen what flush_old_exec is doing.
Signed-off-by: "Eric W. Biederman" ebiederm@xmission.com
fs/exec.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c index ff74b9a74d34..215d86f77b63 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1189,11 +1189,6 @@ static int de_thread(struct task_struct *tsk) /* we have changed execution domain */ tsk->exit_signal = SIGCHLD;
-#ifdef CONFIG_POSIX_TIMERS
exit_itimers(sig);
flush_itimer_signals();
-#endif
BUG_ON(!thread_group_leader(tsk)); return 0;
@@ -1277,6 +1272,11 @@ int flush_old_exec(struct linux_binprm * bprm) if (retval) goto out;
+#ifdef CONFIG_POSIX_TIMERS
exit_itimers(me->signal);
flush_itimer_signals();
+#endif
nit: exit_itimers() has a comment referring to de_thread, that should probably be updated
Jann Horn jannh@google.com writes:
On Sun, Mar 8, 2020 at 10:39 PM Eric W. Biederman ebiederm@xmission.com wrote:
These functions have very little to do with de_thread move them out of de_thread an into flush_old_exec proper so it can be more clearly seen what flush_old_exec is doing.
Signed-off-by: "Eric W. Biederman" ebiederm@xmission.com
fs/exec.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c index ff74b9a74d34..215d86f77b63 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1189,11 +1189,6 @@ static int de_thread(struct task_struct *tsk) /* we have changed execution domain */ tsk->exit_signal = SIGCHLD;
-#ifdef CONFIG_POSIX_TIMERS
exit_itimers(sig);
flush_itimer_signals();
-#endif
BUG_ON(!thread_group_leader(tsk)); return 0;
@@ -1277,6 +1272,11 @@ int flush_old_exec(struct linux_binprm * bprm) if (retval) goto out;
+#ifdef CONFIG_POSIX_TIMERS
exit_itimers(me->signal);
flush_itimer_signals();
+#endif
nit: exit_itimers() has a comment referring to de_thread, that should probably be updated
Good point.
Eric
On Sun, Mar 08, 2020 at 04:36:55PM -0500, Eric W. Biederman wrote:
These functions have very little to do with de_thread move them out of de_thread an into flush_old_exec proper so it can be more clearly seen what flush_old_exec is doing.
Signed-off-by: "Eric W. Biederman" ebiederm@xmission.com
Acked-by: Christian Brauner christian.brauner@ubuntu.com
I have read through the code in exec_mmap and I do not see anything that depends on sighand or the sighand lock, or on signals in anyway so this should be safe.
This rearrangement of code has two siginficant benefits. It makes the determination of passing the point of no return by testing bprm->mm accurate. All failures prior to that point in flush_old_exec are either truly recoverable or they are fatal.
Futher this consolidates all of the possible indefinite waits for userspace together at the top of flush_old_exec. The possible wait for a ptracer on PTRACE_EVENT_EXIT, the possible wait for a page fault to be resolved in clear_child_tid, and the possible wait for a page fault in exit_robust_list.
This consolidation allows the creation of a mutex to replace cred_guard_mutex that is not held of possible indefinite userspace waits. Which will allow removing deadlock scenarios from the kernel.
Signed-off-by: "Eric W. Biederman" ebiederm@xmission.com --- fs/exec.c | 24 ++++++++++++------------ 1 file changed, 12 insertions(+), 12 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c index 215d86f77b63..d820a7272a76 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1272,18 +1272,6 @@ int flush_old_exec(struct linux_binprm * bprm) if (retval) goto out;
-#ifdef CONFIG_POSIX_TIMERS - exit_itimers(me->signal); - flush_itimer_signals(); -#endif - - /* - * Make the signal table private. - */ - retval = unshare_sighand(me); - if (retval) - goto out; - /* * Must be called _before_ exec_mmap() as bprm->mm is * not visibile until then. This also enables the update @@ -1307,6 +1295,18 @@ int flush_old_exec(struct linux_binprm * bprm) */ bprm->mm = NULL;
+#ifdef CONFIG_POSIX_TIMERS + exit_itimers(me->signal); + flush_itimer_signals(); +#endif + + /* + * Make the signal table private. + */ + retval = unshare_sighand(me); + if (retval) + goto out; + set_fs(USER_DS); me->flags &= ~(PF_RANDOMIZE | PF_FORKNOEXEC | PF_KTHREAD | PF_NOFREEZE | PF_NO_SETAFFINITY);
On 3/8/20 10:38 PM, Eric W. Biederman wrote:
I have read through the code in exec_mmap and I do not see anything that depends on sighand or the sighand lock, or on signals in anyway so this should be safe.
This rearrangement of code has two siginficant benefits. It makes
^ typo: significant
the determination of passing the point of no return by testing bprm->mm accurate. All failures prior to that point in flush_old_exec are either truly recoverable or they are fatal.
Futher this consolidates all of the possible indefinite waits for ^ typo: Further
userspace together at the top of flush_old_exec. The possible wait for a ptracer on PTRACE_EVENT_EXIT, the possible wait for a page fault to be resolved in clear_child_tid, and the possible wait for a page fault in exit_robust_list.
This consolidation allows the creation of a mutex to replace cred_guard_mutex that is not held of possible indefinite userspace
can you also reword this "held of" thing here as well?
Thanks Bernd.
Bernd Edlinger bernd.edlinger@hotmail.de writes:
On 3/8/20 10:38 PM, Eric W. Biederman wrote:
This consolidation allows the creation of a mutex to replace cred_guard_mutex that is not held of possible indefinite userspace
can you also reword this "held of" thing here as well?
Done:
exec: Move exec_mmap right after de_thread in flush_old_exec
I have read through the code in exec_mmap and I do not see anything that depends on sighand or the sighand lock, or on signals in anyway so this should be safe.
This rearrangement of code has two siginficant benefits. It makes the determination of passing the point of no return by testing bprm->mm accurate. All failures prior to that point in flush_old_exec are either truly recoverable or they are fatal.
Futher this consolidates all of the possible indefinite waits for userspace together at the top of flush_old_exec. The possible wait for a ptracer on PTRACE_EVENT_EXIT, the possible wait for a page fault to be resolved in clear_child_tid, and the possible wait for a page fault in exit_robust_list.
This consolidation allows the creation of a mutex to replace cred_guard_mutex that is not held over possible indefinite userspace waits. Which will allow removing deadlock scenarios from the kernel.
Reviewed-by: Bernd Edlinger bernd.edlinger@hotmail.de Signed-off-by: "Eric W. Biederman" ebiederm@xmission.com
Eric
On 3/9/20 8:45 PM, Eric W. Biederman wrote:
Bernd Edlinger bernd.edlinger@hotmail.de writes:
On 3/8/20 10:38 PM, Eric W. Biederman wrote:
This consolidation allows the creation of a mutex to replace cred_guard_mutex that is not held of possible indefinite userspace
can you also reword this "held of" thing here as well?
Done:
exec: Move exec_mmap right after de_thread in flush_old_exec
I have read through the code in exec_mmap and I do not see anything that depends on sighand or the sighand lock, or on signals in anyway so this should be safe. This rearrangement of code has two siginficant benefits. It makes
watch out: sig_i_nificant
the determination of passing the point of no return by testing bprm->mm accurate. All failures prior to that point in flush_old_exec are either truly recoverable or they are fatal.
Futher this consolidates all of the possible indefinite waits for
Add some r to "Futher", please?
userspace together at the top of flush_old_exec. The possible wait for a ptracer on PTRACE_EVENT_EXIT, the possible wait for a page fault to be resolved in clear_child_tid, and the possible wait for a page fault in exit_robust_list.
This consolidation allows the creation of a mutex to replace cred_guard_mutex that is not held over possible indefinite userspace waits. Which will allow removing deadlock scenarios from the kernel. Reviewed-by: Bernd Edlinger bernd.edlinger@hotmail.de Signed-off-by: "Eric W. Biederman" ebiederm@xmission.com
Eric
Ok. I think this has it sorted:
exec: Move exec_mmap right after de_thread in flush_old_exec
I have read through the code in exec_mmap and I do not see anything that depends on sighand or the sighand lock, or on signals in anyway so this should be safe.
This rearrangement of code has two significant benefits. It makes the determination of passing the point of no return by testing bprm->mm accurate. All failures prior to that point in flush_old_exec are either truly recoverable or they are fatal.
Further this consolidates all of the possible indefinite waits for userspace together at the top of flush_old_exec. The possible wait for a ptracer on PTRACE_EVENT_EXIT, the possible wait for a page fault to be resolved in clear_child_tid, and the possible wait for a page fault in exit_robust_list.
This consolidation allows the creation of a mutex to replace cred_guard_mutex that is not held over possible indefinite userspace waits. Which will allow removing deadlock scenarios from the kernel.
Reviewed-by: Bernd Edlinger bernd.edlinger@hotmail.de Signed-off-by: "Eric W. Biederman" ebiederm@xmission.com
I don't think I usually have this many typos. Sigh.
Eric
On 3/9/20 8:58 PM, Eric W. Biederman wrote:
Ok. I think this has it sorted:
exec: Move exec_mmap right after de_thread in flush_old_exec
I have read through the code in exec_mmap and I do not see anything that depends on sighand or the sighand lock, or on signals in anyway so this should be safe. This rearrangement of code has two significant benefits. It makes the determination of passing the point of no return by testing bprm->mm accurate. All failures prior to that point in flush_old_exec are either truly recoverable or they are fatal. Further this consolidates all of the possible indefinite waits for userspace together at the top of flush_old_exec. The possible wait for a ptracer on PTRACE_EVENT_EXIT, the possible wait for a page fault to be resolved in clear_child_tid, and the possible wait for a page fault in exit_robust_list. This consolidation allows the creation of a mutex to replace cred_guard_mutex that is not held over possible indefinite userspace waits. Which will allow removing deadlock scenarios from the kernel. Reviewed-by: Bernd Edlinger bernd.edlinger@hotmail.de Signed-off-by: "Eric W. Biederman" ebiederm@xmission.com
I don't think I usually have this many typos. Sigh.
OK.
never mind, Bernd.
Bernd Edlinger bernd.edlinger@hotmail.de writes:
On 3/9/20 8:58 PM, Eric W. Biederman wrote:
Ok. I think this has it sorted:
exec: Move exec_mmap right after de_thread in flush_old_exec
I have read through the code in exec_mmap and I do not see anything that depends on sighand or the sighand lock, or on signals in anyway so this should be safe. This rearrangement of code has two significant benefits. It makes the determination of passing the point of no return by testing bprm->mm accurate. All failures prior to that point in flush_old_exec are either truly recoverable or they are fatal. Further this consolidates all of the possible indefinite waits for userspace together at the top of flush_old_exec. The possible wait for a ptracer on PTRACE_EVENT_EXIT, the possible wait for a page fault to be resolved in clear_child_tid, and the possible wait for a page fault in exit_robust_list. This consolidation allows the creation of a mutex to replace cred_guard_mutex that is not held over possible indefinite userspace waits. Which will allow removing deadlock scenarios from the kernel. Reviewed-by: Bernd Edlinger bernd.edlinger@hotmail.de Signed-off-by: "Eric W. Biederman" ebiederm@xmission.com
I don't think I usually have this many typos. Sigh.
OK.
never mind,
No no. I really appreciate all of the scrutiny. Frequently the issues that will produce typos or poor patch descriptions are also the issues that will produce sloppy patches as well. I was just frustrated with myself.
Eric
On Sun, Mar 08, 2020 at 04:38:00PM -0500, Eric W. Biederman wrote:
I have read through the code in exec_mmap and I do not see anything that depends on sighand or the sighand lock, or on signals in anyway so this should be safe.
This rearrangement of code has two siginficant benefits. It makes the determination of passing the point of no return by testing bprm->mm accurate. All failures prior to that point in flush_old_exec are either truly recoverable or they are fatal.
Agreed. Though I see a use of "current", which maybe you want to parameterize to a "me" argument in acct_arg_size(). (Though looking at the callers, perhaps there is no benefit?)
Futher this consolidates all of the possible indefinite waits for userspace together at the top of flush_old_exec. The possible wait for a ptracer on PTRACE_EVENT_EXIT, the possible wait for a page fault to be resolved in clear_child_tid, and the possible wait for a page fault in exit_robust_list.
This consolidation allows the creation of a mutex to replace cred_guard_mutex that is not held of possible indefinite userspace waits. Which will allow removing deadlock scenarios from the kernel.
Signed-off-by: "Eric W. Biederman" ebiederm@xmission.com
fs/exec.c | 24 ++++++++++++------------ 1 file changed, 12 insertions(+), 12 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c index 215d86f77b63..d820a7272a76 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1272,18 +1272,6 @@ int flush_old_exec(struct linux_binprm * bprm) if (retval) goto out; -#ifdef CONFIG_POSIX_TIMERS
- exit_itimers(me->signal);
- flush_itimer_signals();
-#endif
I think this comment:
/* * This is called by do_exit or de_thread, only when there are no more * references to the shared signal_struct. */ void exit_itimers(struct signal_struct *sig)
Refers to there being other threads, yes? Not that the signal table is private yet?
- /*
* Make the signal table private.
*/
- retval = unshare_sighand(me);
- if (retval)
goto out;
- /*
- Must be called _before_ exec_mmap() as bprm->mm is
- not visibile until then. This also enables the update
@@ -1307,6 +1295,18 @@ int flush_old_exec(struct linux_binprm * bprm) */ bprm->mm = NULL; +#ifdef CONFIG_POSIX_TIMERS
- exit_itimers(me->signal);
- flush_itimer_signals();
+#endif
I've mostly convinced myself that there are no "side-effects" from having these timers expire as the mm is going away. I think some kind of comment of that intent should be explicitly stated here above the timer work.
Beyond that:
Reviewed-by: Kees Cook keescook@chromium.org
-Kees
- /*
* Make the signal table private.
*/
- retval = unshare_sighand(me);
- if (retval)
goto out;
- set_fs(USER_DS); me->flags &= ~(PF_RANDOMIZE | PF_FORKNOEXEC | PF_KTHREAD | PF_NOFREEZE | PF_NO_SETAFFINITY);
-- 2.25.0
Kees Cook keescook@chromium.org writes:
On Sun, Mar 08, 2020 at 04:38:00PM -0500, Eric W. Biederman wrote:
I have read through the code in exec_mmap and I do not see anything that depends on sighand or the sighand lock, or on signals in anyway so this should be safe.
This rearrangement of code has two siginficant benefits. It makes the determination of passing the point of no return by testing bprm->mm accurate. All failures prior to that point in flush_old_exec are either truly recoverable or they are fatal.
Agreed. Though I see a use of "current", which maybe you want to parameterize to a "me" argument in acct_arg_size(). (Though looking at the callers, perhaps there is no benefit?)
My testing suggests there is a small benefit on x86.
The code is just "#define current get_current()" and get_current() revoles into a read of "%gs:current_task".
But looking at the code I find gcc can sometimes when the reads are close in the source code can optimize the read away. But gcc does not manage to optimize the extra read of "%gs:current_task" away.
So I think things are much much better than they used to be, code generation wise. But it still helps to cache current in a local variable.
Futher this consolidates all of the possible indefinite waits for userspace together at the top of flush_old_exec. The possible wait for a ptracer on PTRACE_EVENT_EXIT, the possible wait for a page fault to be resolved in clear_child_tid, and the possible wait for a page fault in exit_robust_list.
This consolidation allows the creation of a mutex to replace cred_guard_mutex that is not held of possible indefinite userspace waits. Which will allow removing deadlock scenarios from the kernel.
Signed-off-by: "Eric W. Biederman" ebiederm@xmission.com
fs/exec.c | 24 ++++++++++++------------ 1 file changed, 12 insertions(+), 12 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c index 215d86f77b63..d820a7272a76 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1272,18 +1272,6 @@ int flush_old_exec(struct linux_binprm * bprm) if (retval) goto out; -#ifdef CONFIG_POSIX_TIMERS
- exit_itimers(me->signal);
- flush_itimer_signals();
-#endif
I think this comment:
/*
- This is called by do_exit or de_thread, only when there are no more
- references to the shared signal_struct.
*/ void exit_itimers(struct signal_struct *sig)
Refers to there being other threads, yes? Not that the signal table is private yet?
The signal table is in sighand_struct.
So yes that refers to the other threads being gone.
- /*
* Make the signal table private.
*/
- retval = unshare_sighand(me);
- if (retval)
goto out;
- /*
- Must be called _before_ exec_mmap() as bprm->mm is
- not visibile until then. This also enables the update
@@ -1307,6 +1295,18 @@ int flush_old_exec(struct linux_binprm * bprm) */ bprm->mm = NULL; +#ifdef CONFIG_POSIX_TIMERS
- exit_itimers(me->signal);
- flush_itimer_signals();
+#endif
I've mostly convinced myself that there are no "side-effects" from having these timers expire as the mm is going away. I think some kind of comment of that intent should be explicitly stated here above the timer work.
The timers can at most generate signals. And we are not handling signals in the middle of exec.
So the only possible interaction would be to set a timeout and then try exec, and have the timer kill the caller.
Maybe we get a killable signal from a scenario like that and maybe this changes the time before the timer expires into the dangerous zone. But that is all I can think of.
We have to return to the edge of userspace before any signals are delivered.
Beyond that:
Reviewed-by: Kees Cook keescook@chromium.org
-Kees
- /*
* Make the signal table private.
*/
- retval = unshare_sighand(me);
- if (retval)
goto out;
- set_fs(USER_DS); me->flags &= ~(PF_RANDOMIZE | PF_FORKNOEXEC | PF_KTHREAD | PF_NOFREEZE | PF_NO_SETAFFINITY);
-- 2.25.0
Eric
On Sun, Mar 08, 2020 at 04:38:00PM -0500, Eric W. Biederman wrote:
Futher this consolidates all of the possible indefinite waits for userspace together at the top of flush_old_exec. The possible wait for a ptracer on PTRACE_EVENT_EXIT, the possible wait for a page fault to be resolved in clear_child_tid, and the possible wait for a page fault in exit_robust_list.
I forgot to mention, just as a point of clarity, there are lots of other page faults possible, but they're _before_ flush_old_exec() (i.e. all the copy_strings() calls). Is it worth clarifying this to "before or at the top of flush_old_exec()" or do you mean something else? (And as always: perhaps expand flush_old_exec()'s comment to describe the newly intended state.)
Kees Cook keescook@chromium.org writes:
On Sun, Mar 08, 2020 at 04:38:00PM -0500, Eric W. Biederman wrote:
Futher this consolidates all of the possible indefinite waits for userspace together at the top of flush_old_exec. The possible wait for a ptracer on PTRACE_EVENT_EXIT, the possible wait for a page fault to be resolved in clear_child_tid, and the possible wait for a page fault in exit_robust_list.
I forgot to mention, just as a point of clarity, there are lots of other page faults possible, but they're _before_ flush_old_exec() (i.e. all the copy_strings() calls). Is it worth clarifying this to "before or at the top of flush_old_exec()" or do you mean something else? (And as always: perhaps expand flush_old_exec()'s comment to describe the newly intended state.)
Yes. Before or at the start of flush_old_exec where the mutex is taken. That is the point. I will see if I can come up with and appropriate comment.
Eric
The cred_guard_mutex is problematic. The cred_guard_mutex is held over the userspace accesses as the arguments from userspace are read. The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other threads are killed. The cred_guard_mutex is held over "put_user(0, tsk->clear_child_tid)" in exit_mm().
Any of those can result in deadlock, as the cred_guard_mutex is held over a possible indefinite userspace waits for userspace.
Add exec_update_mutex that is only held over exec updating process with the new contents of exec, so that code that needs not to be confused by exec changing the mm and the cred in ways that can not happen during ordinary execution of a process.
The plan is to switch the users of cred_guard_mutex to exec_udpate_mutex one by one. This lets us move forward while still being careful and not introducing any regressions.
Link: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/ Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03M... Link: https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/ Link: https://lore.kernel.org/lkml/20160923095031.GA14923@redhat.com/ Link: https://lore.kernel.org/lkml/20170213141452.GA30203@redhat.com/ Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.") Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2") Signed-off-by: "Eric W. Biederman" ebiederm@xmission.com --- fs/exec.c | 9 +++++++++ include/linux/sched/signal.h | 9 ++++++++- init/init_task.c | 1 + kernel/fork.c | 1 + 4 files changed, 19 insertions(+), 1 deletion(-)
diff --git a/fs/exec.c b/fs/exec.c index d820a7272a76..ffeebb1f167b 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1014,6 +1014,7 @@ static int exec_mmap(struct mm_struct *mm) { struct task_struct *tsk; struct mm_struct *old_mm, *active_mm; + int ret;
/* Notify parent that we're no longer interested in the old VM */ tsk = current; @@ -1034,6 +1035,11 @@ static int exec_mmap(struct mm_struct *mm) return -EINTR; } } + + ret = mutex_lock_killable(&tsk->signal->exec_update_mutex); + if (ret) + return ret; + task_lock(tsk); active_mm = tsk->active_mm; membarrier_exec_mmap(mm); @@ -1438,6 +1444,8 @@ static void free_bprm(struct linux_binprm *bprm) { free_arg_pages(bprm); if (bprm->cred) { + if (!bprm->mm) + mutex_unlock(¤t->signal->exec_update_mutex); mutex_unlock(¤t->signal->cred_guard_mutex); abort_creds(bprm->cred); } @@ -1487,6 +1495,7 @@ void install_exec_creds(struct linux_binprm *bprm) * credentials; any time after this it may be unlocked. */ security_bprm_committed_creds(bprm); + mutex_unlock(¤t->signal->exec_update_mutex); mutex_unlock(¤t->signal->cred_guard_mutex); } EXPORT_SYMBOL(install_exec_creds); diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h index 88050259c466..a29df79540ce 100644 --- a/include/linux/sched/signal.h +++ b/include/linux/sched/signal.h @@ -224,7 +224,14 @@ struct signal_struct {
struct mutex cred_guard_mutex; /* guard against foreign influences on * credential calculations - * (notably. ptrace) */ + * (notably. ptrace) + * Deprecated do not use in new code. + * Use exec_update_mutex instead. + */ + struct mutex exec_update_mutex; /* Held while task_struct is being + * updated during exec, and may have + * inconsistent permissions. + */ } __randomize_layout;
/* diff --git a/init/init_task.c b/init/init_task.c index 9e5cbe5eab7b..bd403ed3e418 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -26,6 +26,7 @@ static struct signal_struct init_signals = { .multiprocess = HLIST_HEAD_INIT, .rlim = INIT_RLIMITS, .cred_guard_mutex = __MUTEX_INITIALIZER(init_signals.cred_guard_mutex), + .exec_update_mutex = __MUTEX_INITIALIZER(init_signals.exec_update_mutex), #ifdef CONFIG_POSIX_TIMERS .posix_timers = LIST_HEAD_INIT(init_signals.posix_timers), .cputimer = { diff --git a/kernel/fork.c b/kernel/fork.c index 60a1295f4384..12896a6ecee6 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1594,6 +1594,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk) sig->oom_score_adj_min = current->signal->oom_score_adj_min;
mutex_init(&sig->cred_guard_mutex); + mutex_init(&sig->exec_update_mutex);
return 0; }
On 3/8/20 10:38 PM, Eric W. Biederman wrote:
The cred_guard_mutex is problematic. The cred_guard_mutex is held over the userspace accesses as the arguments from userspace are read. The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other
... is held while waiting for the trace parent to handle PTRACE_EVENT_EXIT or something?
I wonder if we also should mention that it is held while waiting for the trace parent to receive the exit code with "wait"?
threads are killed. The cred_guard_mutex is held over "put_user(0, tsk->clear_child_tid)" in exit_mm().
Any of those can result in deadlock, as the cred_guard_mutex is held over a possible indefinite userspace waits for userspace.
Add exec_update_mutex that is only held over exec updating process
Add ?
with the new contents of exec, so that code that needs not to be confused by exec changing the mm and the cred in ways that can not happen during ordinary execution of a process.
The plan is to switch the users of cred_guard_mutex to exec_udpate_mutex one by one. This lets us move forward while still
s/udpate/update/
Bernd.
Bernd Edlinger bernd.edlinger@hotmail.de writes:
On 3/8/20 10:38 PM, Eric W. Biederman wrote:
The cred_guard_mutex is problematic. The cred_guard_mutex is held over the userspace accesses as the arguments from userspace are read. The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other
^ over
... is held while waiting for the trace parent to handle PTRACE_EVENT_EXIT or something?
Yes. Let me see if I can phrase that better.
I wonder if we also should mention that it is held while waiting for the trace parent to receive the exit code with "wait"?
I don't think we have to spell out the details of how it all works, unless that makes things clearer. Kernel developers can be expected to figure out how the kernel works. The critical thing is that it is an indefinite wait for userspace to take action.
But I will look.
threads are killed. The cred_guard_mutex is held over "put_user(0, tsk->clear_child_tid)" in exit_mm().
Any of those can result in deadlock, as the cred_guard_mutex is held over a possible indefinite userspace waits for userspace.
Add exec_update_mutex that is only held over exec updating process
Add ?
Yes. That is what the change does: add exec_update_mutex.
with the new contents of exec, so that code that needs not to be confused by exec changing the mm and the cred in ways that can not happen during ordinary execution of a process.
The plan is to switch the users of cred_guard_mutex to exec_udpate_mutex one by one. This lets us move forward while still
s/udpate/update/
Yes. Very much so.
Eric
On 3/9/20 6:40 PM, Eric W. Biederman wrote:
Bernd Edlinger bernd.edlinger@hotmail.de writes:
On 3/8/20 10:38 PM, Eric W. Biederman wrote:
The cred_guard_mutex is problematic. The cred_guard_mutex is held over the userspace accesses as the arguments from userspace are read. The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other
^ over
... is held while waiting for the trace parent to handle PTRACE_EVENT_EXIT or something?
Yes. Let me see if I can phrase that better.
I wonder if we also should mention that it is held while waiting for the trace parent to receive the exit code with "wait"?
I don't think we have to spell out the details of how it all works, unless that makes things clearer. Kernel developers can be expected to figure out how the kernel works. The critical thing is that it is an indefinite wait for userspace to take action.
But I will look.
threads are killed. The cred_guard_mutex is held over "put_user(0, tsk->clear_child_tid)" in exit_mm().
Any of those can result in deadlock, as the cred_guard_mutex is held over a possible indefinite userspace waits for userspace.
Add exec_update_mutex that is only held over exec updating process
Add ?
Yes. That is what the change does: add exec_update_mutex.
I just kind of missed the "subject" in this sentence, like "This patch adds an exec_update_mutex that is ..." but english is a foreign language for me, so may be okay as is.
Bernd.
with the new contents of exec, so that code that needs not to be confused by exec changing the mm and the cred in ways that can not happen during ordinary execution of a process.
The plan is to switch the users of cred_guard_mutex to exec_udpate_mutex one by one. This lets us move forward while still
s/udpate/update/
Yes. Very much so.
Eric
Bernd Edlinger bernd.edlinger@hotmail.de writes:
On 3/9/20 6:40 PM, Eric W. Biederman wrote:
Bernd Edlinger bernd.edlinger@hotmail.de writes:
On 3/8/20 10:38 PM, Eric W. Biederman wrote:
The cred_guard_mutex is problematic. The cred_guard_mutex is held over the userspace accesses as the arguments from userspace are read. The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other
^ over
... is held while waiting for the trace parent to handle PTRACE_EVENT_EXIT or something?
Yes. Let me see if I can phrase that better.
I wonder if we also should mention that it is held while waiting for the trace parent to receive the exit code with "wait"?
I don't think we have to spell out the details of how it all works, unless that makes things clearer. Kernel developers can be expected to figure out how the kernel works. The critical thing is that it is an indefinite wait for userspace to take action.
But I will look.
threads are killed. The cred_guard_mutex is held over "put_user(0, tsk->clear_child_tid)" in exit_mm().
Any of those can result in deadlock, as the cred_guard_mutex is held over a possible indefinite userspace waits for userspace.
Add exec_update_mutex that is only held over exec updating process
Add ?
Yes. That is what the change does: add exec_update_mutex.
I just kind of missed the "subject" in this sentence, like "This patch adds an exec_update_mutex that is ..." but english is a foreign language for me, so may be okay as is.
English has a lot of options. I think this is a stylistic difference.
Instead of being an observer and describing what the change does: "This patch adds exec_update_mutex ..."
I was being there in the moment and saying/commading what is happening: "Add exec_update_mutex ..."
Using the more immdediate form ends up with more concise and clearer sentences.
Every one of my writing teachers in school emphasized that point and I see the who it works when I write things. But writing is hard and I still tend toward long rambling sentences with many qualifiers that confuse and detract from the point rather than make it clear what is happening.
Eric
ebiederm@xmission.com (Eric W. Biederman) writes:
Bernd Edlinger bernd.edlinger@hotmail.de writes:
On 3/9/20 6:40 PM, Eric W. Biederman wrote:
Bernd Edlinger bernd.edlinger@hotmail.de writes:
On 3/8/20 10:38 PM, Eric W. Biederman wrote:
The cred_guard_mutex is problematic. The cred_guard_mutex is held over the userspace accesses as the arguments from userspace are read. The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other
^ over
... is held while waiting for the trace parent to handle PTRACE_EVENT_EXIT or something?
Yes. Let me see if I can phrase that better.
I wonder if we also should mention that it is held while waiting for the trace parent to receive the exit code with "wait"?
I don't think we have to spell out the details of how it all works, unless that makes things clearer. Kernel developers can be expected to figure out how the kernel works. The critical thing is that it is an indefinite wait for userspace to take action.
But I will look.
threads are killed. The cred_guard_mutex is held over "put_user(0, tsk->clear_child_tid)" in exit_mm().
Any of those can result in deadlock, as the cred_guard_mutex is held over a possible indefinite userspace waits for userspace.
Add exec_update_mutex that is only held over exec updating process
Add ?
Yes. That is what the change does: add exec_update_mutex.
I just kind of missed the "subject" in this sentence, like "This patch adds an exec_update_mutex that is ..." but english is a foreign language for me, so may be okay as is.
English has a lot of options. I think this is a stylistic difference.
Instead of being an observer and describing what the change does: "This patch adds exec_update_mutex ..."
I was being there in the moment and saying/commading what is happening: "Add exec_update_mutex ..."
Using the more immdediate form ends up with more concise and clearer sentences.
Every one of my writing teachers in school emphasized that point and I see the who it works when I write things. But writing is hard and I still tend toward long rambling sentences with many qualifiers that confuse and detract from the point rather than make it clear what is happening.
And reading through it all now I can see your confusion. That description of my changes was not well done. Reworking it now.
Eric
My rewritten change description reads as follows:
exec: Add a exec_update_mutex to replace cred_guard_mutex
The cred_guard_mutex is problematic as it is held over possibly indefinite waits for userspace. The possilbe indefinite waits for userspace that I have identified are: The cred_guard_mutex is held in PTRACE_EVENT_EXIT waiting for the tracer. The cred_guard_mutex is held over "put_user(0, tsk->clear_child_tid)" in exit_mm(). The cred_guard_mutex is held over "get_user(futex_offset, ...") in exit_robust_list. The cred_guard_mutex held over copy_strings.
The functions get_user and put_user can trigger a page fault which can potentially wait indefinitely in the case of userfaultfd or if userspace implements part of the page fault path.
In any of those cases the userspace process that the kernel is waiting for might userspace might make a different system call that winds up taking the cred_guard_mutex and result in deadlock.
Holding a mutex over any of those possibly indefinite waits for userspace does not appear necessary. Add exec_update_mutex that will just cover updating the process during exec where the permissions and the objects pointed to by the task struct may be out of sync.
The plan is to switch the users of cred_guard_mutex to exec_udpate_mutex one by one. This lets us move forward while still being careful and not introducing any regressions.
Link: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/ Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03M... Link: https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/ Link: https://lore.kernel.org/lkml/20160923095031.GA14923@redhat.com/ Link: https://lore.kernel.org/lkml/20170213141452.GA30203@redhat.com/ Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.") Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2") Signed-off-by: "Eric W. Biederman" ebiederm@xmission.com
Does that sound better?
Eric
On 3/9/20 7:36 PM, Eric W. Biederman wrote:
My rewritten change description reads as follows:
exec: Add a exec_update_mutex to replace cred_guard_mutex
is this "an" exec_update_mutex?
The cred_guard_mutex is problematic as it is held over possibly indefinite waits for userspace. The possilbe indefinite waits for userspace that I have identified are: The cred_guard_mutex is held in PTRACE_EVENT_EXIT waiting for the tracer. The cred_guard_mutex is held over "put_user(0, tsk->clear_child_tid)" in exit_mm(). The cred_guard_mutex is held over "get_user(futex_offset, ...") in exit_robust_list. The cred_guard_mutex held over copy_strings. The functions get_user and put_user can trigger a page fault which can potentially wait indefinitely in the case of userfaultfd or if userspace implements part of the page fault path. In any of those cases the userspace process that the kernel is waiting for might userspace might make a different system call that winds up
^-------------^ ^- remove this
taking the cred_guard_mutex and result in deadlock.
Holding a mutex over any of those possibly indefinite waits for userspace does not appear necessary. Add exec_update_mutex that will just cover updating the process during exec where the permissions and the objects pointed to by the task struct may be out of sync. The plan is to switch the users of cred_guard_mutex to exec_udpate_mutex one by one. This lets us move forward while still
^-- typo: update
being careful and not introducing any regressions.
Link: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/ Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03M... Link: https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/ Link: https://lore.kernel.org/lkml/20160923095031.GA14923@redhat.com/ Link: https://lore.kernel.org/lkml/20170213141452.GA30203@redhat.com/ Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.") Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2") Signed-off-by: "Eric W. Biederman" ebiederm@xmission.com
Does that sound better?
almost done.
Eric
Bernd Edlinger bernd.edlinger@hotmail.de writes:
On 3/9/20 7:36 PM, Eric W. Biederman wrote:
Does that sound better?
almost done.
I think this text is finally clean.
exec: Add exec_update_mutex to replace cred_guard_mutex
The cred_guard_mutex is problematic as it is held over possibly indefinite waits for userspace. The possilbe indefinite waits for userspace that I have identified are: The cred_guard_mutex is held in PTRACE_EVENT_EXIT waiting for the tracer. The cred_guard_mutex is held over "put_user(0, tsk->clear_child_tid)" in exit_mm(). The cred_guard_mutex is held over "get_user(futex_offset, ...") in exit_robust_list. The cred_guard_mutex held over copy_strings.
The functions get_user and put_user can trigger a page fault which can potentially wait indefinitely in the case of userfaultfd or if userspace implements part of the page fault path.
In any of those cases the userspace process that the kernel is waiting for might make a different system call that winds up taking the cred_guard_mutex and result in deadlock.
Holding a mutex over any of those possibly indefinite waits for userspace does not appear necessary. Add exec_update_mutex that will just cover updating the process during exec where the permissions and the objects pointed to by the task struct may be out of sync.
The plan is to switch the users of cred_guard_mutex to exec_update_mutex one by one. This lets us move forward while still being careful and not introducing any regressions.
Link: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/ Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03M... Link: https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/ Link: https://lore.kernel.org/lkml/20160923095031.GA14923@redhat.com/ Link: https://lore.kernel.org/lkml/20170213141452.GA30203@redhat.com/ Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.") Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2") Signed-off-by: "Eric W. Biederman" ebiederm@xmission.com
Bernd do you want to give me your Reviewed-by for this part of the series?
After that do you think you can write the obvious patch for mm_access?
I will apply these changes to my tree and push them into linux-next.
Eric
On 3/9/20 8:02 PM, Eric W. Biederman wrote:
Bernd Edlinger bernd.edlinger@hotmail.de writes:
On 3/9/20 7:36 PM, Eric W. Biederman wrote:
Does that sound better?
almost done.
I think this text is finally clean.
exec: Add exec_update_mutex to replace cred_guard_mutex
The cred_guard_mutex is problematic as it is held over possibly indefinite waits for userspace. The possilbe indefinite waits for userspace that I have identified are: The cred_guard_mutex is held in PTRACE_EVENT_EXIT waiting for the tracer. The cred_guard_mutex is held over "put_user(0, tsk->clear_child_tid)" in exit_mm(). The cred_guard_mutex is held over "get_user(futex_offset, ...") in exit_robust_list. The cred_guard_mutex held over copy_strings. The functions get_user and put_user can trigger a page fault which can potentially wait indefinitely in the case of userfaultfd or if userspace implements part of the page fault path. In any of those cases the userspace process that the kernel is waiting for might make a different system call that winds up taking the cred_guard_mutex and result in deadlock. Holding a mutex over any of those possibly indefinite waits for userspace does not appear necessary. Add exec_update_mutex that will just cover updating the process during exec where the permissions and the objects pointed to by the task struct may be out of sync. The plan is to switch the users of cred_guard_mutex to exec_update_mutex one by one. This lets us move forward while still being careful and not introducing any regressions. Link: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/ Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03M... Link: https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/ Link: https://lore.kernel.org/lkml/20160923095031.GA14923@redhat.com/ Link: https://lore.kernel.org/lkml/20170213141452.GA30203@redhat.com/ Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.") Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2")
I checked the urls they all work. Just one last question, are these git references? I can't find them in my linux git tree (cloned from linus' git)?
Sorry for being pedantically.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Bernd do you want to give me your Reviewed-by for this part of the series?
Sure also the other parts of course.
Reviewed-by: Bernd Edlinger bernd.edlinger@hotmail.de
After that do you think you can write the obvious patch for mm_access?
Yes, I can do that. I also have some typos in comments, will make them extra patches as well.
I wonder if the test case is okay to include the ptrace_attach altough that is not yet passing?
Thanks Bernd.
Bernd Edlinger bernd.edlinger@hotmail.de writes:
On 3/9/20 8:02 PM, Eric W. Biederman wrote:
Bernd Edlinger bernd.edlinger@hotmail.de writes:
On 3/9/20 7:36 PM, Eric W. Biederman wrote:
Does that sound better?
almost done.
I think this text is finally clean.
exec: Add exec_update_mutex to replace cred_guard_mutex
The cred_guard_mutex is problematic as it is held over possibly indefinite waits for userspace. The possilbe indefinite waits for userspace that I have identified are: The cred_guard_mutex is held in PTRACE_EVENT_EXIT waiting for the tracer. The cred_guard_mutex is held over "put_user(0, tsk->clear_child_tid)" in exit_mm(). The cred_guard_mutex is held over "get_user(futex_offset, ...") in exit_robust_list. The cred_guard_mutex held over copy_strings. The functions get_user and put_user can trigger a page fault which can potentially wait indefinitely in the case of userfaultfd or if userspace implements part of the page fault path. In any of those cases the userspace process that the kernel is waiting for might make a different system call that winds up taking the cred_guard_mutex and result in deadlock. Holding a mutex over any of those possibly indefinite waits for userspace does not appear necessary. Add exec_update_mutex that will just cover updating the process during exec where the permissions and the objects pointed to by the task struct may be out of sync. The plan is to switch the users of cred_guard_mutex to exec_update_mutex one by one. This lets us move forward while still being careful and not introducing any regressions. Link: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/ Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03M... Link: https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/ Link: https://lore.kernel.org/lkml/20160923095031.GA14923@redhat.com/ Link: https://lore.kernel.org/lkml/20170213141452.GA30203@redhat.com/ Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.") Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2")
I checked the urls they all work. Just one last question, are these git references? I can't find them in my linux git tree (cloned from linus' git)?
Sorry for being pedantically.
You have to track down tglx's historicaly git tree from when everything was in bitkeeper.
But yes they are git references and yes they work. Just that part of the history is not in linux.git.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Bernd do you want to give me your Reviewed-by for this part of the series?
Sure also the other parts of course.
Reviewed-by: Bernd Edlinger bernd.edlinger@hotmail.de
After that do you think you can write the obvious patch for mm_access?
Yes, I can do that. I also have some typos in comments, will make them extra patches as well.
I wonder if the test case is okay to include the ptrace_attach altough that is not yet passing?
It is an existing kernel but that it doesn't pass.
My sense is that if you include it as a separate patch if it is a problem for someone we can identify it easily via bisect and we do whatever is appropriate.
Eric
Bernd Edlinger bernd.edlinger@hotmail.de writes:
On 3/9/20 8:02 PM, Eric W. Biederman wrote:
Link: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/ Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03M... Link: https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/ Link: https://lore.kernel.org/lkml/20160923095031.GA14923@redhat.com/ Link: https://lore.kernel.org/lkml/20170213141452.GA30203@redhat.com/ Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.") Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2")
I checked the urls they all work. Just one last question, are these git references? I can't find them in my linux git tree (cloned from linus' git)?
I will add this tag to help people figure out what is going on.
History Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Eric
This is a follow up on Eric's patch series to fix the deadlocks observed with ptracing when execve in multi-threaded applications.
This fixes the simple and most important case where the cred_guard_mutex causes strace to deadlock.
This also adds a test case (which is only partially fixed so far, the rest of the fixes will follow soon).
Two trivial comment fixes are also included.
Bernd Edlinger (4): exec: Fix a deadlock in ptrace selftests/ptrace: add test cases for dead-locks mm: docs: Fix a comment in process_vm_rw_core kernel: doc: remove outdated comment in prepare_kernel_cred
kernel/cred.c | 2 - kernel/fork.c | 4 +- mm/process_vm_access.c | 2 +- tools/testing/selftests/ptrace/Makefile | 4 +- tools/testing/selftests/ptrace/vmaccess.c | 86 +++++++++++++++++++++++++++++++ 5 files changed, 91 insertions(+), 7 deletions(-) create mode 100644 tools/testing/selftests/ptrace/vmaccess.c
Bernd Edlinger bernd.edlinger@hotmail.de writes:
This is a follow up on Eric's patch series to fix the deadlocks observed with ptracing when execve in multi-threaded applications.
This fixes the simple and most important case where the cred_guard_mutex causes strace to deadlock.
This also adds a test case (which is only partially fixed so far, the rest of the fixes will follow soon).
Two trivial comment fixes are also included.
Bernd Edlinger (4): exec: Fix a deadlock in ptrace selftests/ptrace: add test cases for dead-locks mm: docs: Fix a comment in process_vm_rw_core kernel: doc: remove outdated comment in prepare_kernel_cred
kernel/cred.c | 2 - kernel/fork.c | 4 +- mm/process_vm_access.c | 2 +- tools/testing/selftests/ptrace/Makefile | 4 +- tools/testing/selftests/ptrace/vmaccess.c | 86 +++++++++++++++++++++++++++++++ 5 files changed, 91 insertions(+), 7 deletions(-) create mode 100644 tools/testing/selftests/ptrace/vmaccess.c
Applied.
Thank you, Eric
This continues the execve anti-deadlock patch and addresses all of the (mostly) simple cases, there the new exec_update_mutex can be used instead of the cred_guard_mutex.
Note: each of these patches is independent of each other, so in case one of them turns out to be controversial, that does not affect the others.
Bernd Edlinger (4): kernel/kcmp.c: Use new infrastructure to fix deadlocks in execve proc: Use new infrastructure to fix deadlocks in execve proc: io_accounting: Use new infrastructure to fix deadlocks in execve perf: Use new infrastructure to fix deadlocks in execve
fs/proc/base.c | 10 +++++----- kernel/events/core.c | 12 ++++++------ kernel/kcmp.c | 8 ++++---- 3 files changed, 15 insertions(+), 15 deletions(-)
This changes kcmp_epoll_target to use the new exec_update_mutex instead of cred_guard_mutex.
This should be safe, as the credentials are only used for reading, and furthermore ->mm and ->sighand are updated on execve, but only under the new exec_update_mutex.
Signed-off-by: Bernd Edlinger bernd.edlinger@hotmail.de --- kernel/kcmp.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/kernel/kcmp.c b/kernel/kcmp.c index a0e3d7a..b3ff928 100644 --- a/kernel/kcmp.c +++ b/kernel/kcmp.c @@ -173,8 +173,8 @@ static int kcmp_epoll_target(struct task_struct *task1, /* * One should have enough rights to inspect task details. */ - ret = kcmp_lock(&task1->signal->cred_guard_mutex, - &task2->signal->cred_guard_mutex); + ret = kcmp_lock(&task1->signal->exec_update_mutex, + &task2->signal->exec_update_mutex); if (ret) goto err; if (!ptrace_may_access(task1, PTRACE_MODE_READ_REALCREDS) || @@ -229,8 +229,8 @@ static int kcmp_epoll_target(struct task_struct *task1, }
err_unlock: - kcmp_unlock(&task1->signal->cred_guard_mutex, - &task2->signal->cred_guard_mutex); + kcmp_unlock(&task1->signal->exec_update_mutex, + &task2->signal->exec_update_mutex); err: put_task_struct(task1); put_task_struct(task2);
Bernd Edlinger bernd.edlinger@hotmail.de writes:
This changes kcmp_epoll_target to use the new exec_update_mutex instead of cred_guard_mutex.
This should be safe, as the credentials are only used for reading, and furthermore ->mm and ->sighand are updated on execve, but only under the new exec_update_mutex.
Can you add a comment that the exec_update_mutex is not needed for KCMP_FILE? As both sets of credentials during exec are valid for accessing the files so exec_update_mutex does not matter.
I don't think exec_update_mutex is needed for KCMP_SYSVSEM or KCMP_EPOLL_TFD either. As I don't think exec changes either one of those.
Eric
Signed-off-by: Bernd Edlinger bernd.edlinger@hotmail.de
kernel/kcmp.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/kernel/kcmp.c b/kernel/kcmp.c index a0e3d7a..b3ff928 100644 --- a/kernel/kcmp.c +++ b/kernel/kcmp.c @@ -173,8 +173,8 @@ static int kcmp_epoll_target(struct task_struct *task1, /* * One should have enough rights to inspect task details. */
- ret = kcmp_lock(&task1->signal->cred_guard_mutex,
&task2->signal->cred_guard_mutex);
- ret = kcmp_lock(&task1->signal->exec_update_mutex,
if (ret) goto err; if (!ptrace_may_access(task1, PTRACE_MODE_READ_REALCREDS) ||&task2->signal->exec_update_mutex);
@@ -229,8 +229,8 @@ static int kcmp_epoll_target(struct task_struct *task1, } err_unlock:
- kcmp_unlock(&task1->signal->cred_guard_mutex,
&task2->signal->cred_guard_mutex);
- kcmp_unlock(&task1->signal->exec_update_mutex,
&task2->signal->exec_update_mutex);
err: put_task_struct(task1); put_task_struct(task2);
On 3/10/20 8:01 PM, Eric W. Biederman wrote:
Bernd Edlinger bernd.edlinger@hotmail.de writes:
This changes kcmp_epoll_target to use the new exec_update_mutex instead of cred_guard_mutex.
This should be safe, as the credentials are only used for reading, and furthermore ->mm and ->sighand are updated on execve, but only under the new exec_update_mutex.
Can you add a comment that the exec_update_mutex is not needed for KCMP_FILE? As both sets of credentials during exec are valid for accessing the files so exec_update_mutex does not matter.
some files are closed by do_close_on_exec, so in theory this allows you to examine files that were open in the old user but closed for the new user with either credential.
It is not a race condition, but it may be a security concern.
I don't think exec_update_mutex is needed for KCMP_SYSVSEM or KCMP_EPOLL_TFD either. As I don't think exec changes either one of those.
KCMP_EPOLL_TFD is also accessing file pointers, that is possible.
It might be that KCMP_SYSVSEM is a missed optimization, but I may have overlooked something. I'd rather err on the safe side.
Eric
Signed-off-by: Bernd Edlinger bernd.edlinger@hotmail.de
kernel/kcmp.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/kernel/kcmp.c b/kernel/kcmp.c index a0e3d7a..b3ff928 100644 --- a/kernel/kcmp.c +++ b/kernel/kcmp.c @@ -173,8 +173,8 @@ static int kcmp_epoll_target(struct task_struct *task1, /* * One should have enough rights to inspect task details. */
- ret = kcmp_lock(&task1->signal->cred_guard_mutex,
&task2->signal->cred_guard_mutex);
- ret = kcmp_lock(&task1->signal->exec_update_mutex,
if (ret) goto err; if (!ptrace_may_access(task1, PTRACE_MODE_READ_REALCREDS) ||&task2->signal->exec_update_mutex);
@@ -229,8 +229,8 @@ static int kcmp_epoll_target(struct task_struct *task1, } err_unlock:
- kcmp_unlock(&task1->signal->cred_guard_mutex,
&task2->signal->cred_guard_mutex);
- kcmp_unlock(&task1->signal->exec_update_mutex,
&task2->signal->exec_update_mutex);
err: put_task_struct(task1); put_task_struct(task2);
This changes lock_trace to use the new exec_update_mutex instead of cred_guard_mutex.
This fixes possible deadlocks when the trace is accessing /proc/$pid/stack for instance.
This should be safe, as the credentials are only used for reading, and task->mm is updated on execve under the new exec_update_mutex.
Signed-off-by: Bernd Edlinger bernd.edlinger@hotmail.de --- fs/proc/base.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/fs/proc/base.c b/fs/proc/base.c index ebea950..4fdfe4f 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -403,11 +403,11 @@ static int proc_pid_wchan(struct seq_file *m, struct pid_namespace *ns,
static int lock_trace(struct task_struct *task) { - int err = mutex_lock_killable(&task->signal->cred_guard_mutex); + int err = mutex_lock_killable(&task->signal->exec_update_mutex); if (err) return err; if (!ptrace_may_access(task, PTRACE_MODE_ATTACH_FSCREDS)) { - mutex_unlock(&task->signal->cred_guard_mutex); + mutex_unlock(&task->signal->exec_update_mutex); return -EPERM; } return 0; @@ -415,7 +415,7 @@ static int lock_trace(struct task_struct *task)
static void unlock_trace(struct task_struct *task) { - mutex_unlock(&task->signal->cred_guard_mutex); + mutex_unlock(&task->signal->exec_update_mutex); }
#ifdef CONFIG_STACKTRACE
On Tue, Mar 10, 2020 at 06:45:32PM +0100, Bernd Edlinger wrote:
This changes lock_trace to use the new exec_update_mutex instead of cred_guard_mutex.
This fixes possible deadlocks when the trace is accessing /proc/$pid/stack for instance.
This should be safe, as the credentials are only used for reading, and task->mm is updated on execve under the new exec_update_mutex.
Signed-off-by: Bernd Edlinger bernd.edlinger@hotmail.de
Reviewed-by: Kees Cook keescook@chromium.org
-Kees
fs/proc/base.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/fs/proc/base.c b/fs/proc/base.c index ebea950..4fdfe4f 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -403,11 +403,11 @@ static int proc_pid_wchan(struct seq_file *m, struct pid_namespace *ns, static int lock_trace(struct task_struct *task) {
- int err = mutex_lock_killable(&task->signal->cred_guard_mutex);
- int err = mutex_lock_killable(&task->signal->exec_update_mutex); if (err) return err; if (!ptrace_may_access(task, PTRACE_MODE_ATTACH_FSCREDS)) {
mutex_unlock(&task->signal->cred_guard_mutex);
return -EPERM; } return 0;mutex_unlock(&task->signal->exec_update_mutex);
@@ -415,7 +415,7 @@ static int lock_trace(struct task_struct *task) static void unlock_trace(struct task_struct *task) {
- mutex_unlock(&task->signal->cred_guard_mutex);
- mutex_unlock(&task->signal->exec_update_mutex);
}
#ifdef CONFIG_STACKTRACE
1.9.1
On Tue, Mar 10, 2020 at 06:45:32PM +0100, Bernd Edlinger wrote:
This changes lock_trace to use the new exec_update_mutex instead of cred_guard_mutex.
This fixes possible deadlocks when the trace is accessing /proc/$pid/stack for instance.
This should be safe, as the credentials are only used for reading, and task->mm is updated on execve under the new exec_update_mutex.
Signed-off-by: Bernd Edlinger bernd.edlinger@hotmail.de
I have the same question here as in 3/4. I should probably rescind my Reviewed-by until I'm convinced about the security-safety of this -- why is this not a race against cred changes?
-Kees
fs/proc/base.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/fs/proc/base.c b/fs/proc/base.c index ebea950..4fdfe4f 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -403,11 +403,11 @@ static int proc_pid_wchan(struct seq_file *m, struct pid_namespace *ns, static int lock_trace(struct task_struct *task) {
- int err = mutex_lock_killable(&task->signal->cred_guard_mutex);
- int err = mutex_lock_killable(&task->signal->exec_update_mutex); if (err) return err; if (!ptrace_may_access(task, PTRACE_MODE_ATTACH_FSCREDS)) {
mutex_unlock(&task->signal->cred_guard_mutex);
return -EPERM; } return 0;mutex_unlock(&task->signal->exec_update_mutex);
@@ -415,7 +415,7 @@ static int lock_trace(struct task_struct *task) static void unlock_trace(struct task_struct *task) {
- mutex_unlock(&task->signal->cred_guard_mutex);
- mutex_unlock(&task->signal->exec_update_mutex);
}
#ifdef CONFIG_STACKTRACE
1.9.1
On 3/11/20 8:10 PM, Kees Cook wrote:
On Tue, Mar 10, 2020 at 06:45:32PM +0100, Bernd Edlinger wrote:
This changes lock_trace to use the new exec_update_mutex instead of cred_guard_mutex.
This fixes possible deadlocks when the trace is accessing /proc/$pid/stack for instance.
This should be safe, as the credentials are only used for reading, and task->mm is updated on execve under the new exec_update_mutex.
Signed-off-by: Bernd Edlinger bernd.edlinger@hotmail.de
I have the same question here as in 3/4. I should probably rescind my Reviewed-by until I'm convinced about the security-safety of this -- why is this not a race against cred changes?
The credentials of a thread that is currently executing execve is already set in the bprm structure, however the credential in the task structure is not yet changed, as well as the process memory map keeps stable until the exec_update_mutex is acquired.
What is done with this functions is access the call stack of the process before the new executable is actually started.
There would immediately be a severe security problem if we did not use any mutex as the check would be then with the old credential, but the stack trace would potentially reveal secret function calls that are done by a setuid program when it starts up.
Bernd.
-Kees
fs/proc/base.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/fs/proc/base.c b/fs/proc/base.c index ebea950..4fdfe4f 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -403,11 +403,11 @@ static int proc_pid_wchan(struct seq_file *m, struct pid_namespace *ns, static int lock_trace(struct task_struct *task) {
- int err = mutex_lock_killable(&task->signal->cred_guard_mutex);
- int err = mutex_lock_killable(&task->signal->exec_update_mutex); if (err) return err; if (!ptrace_may_access(task, PTRACE_MODE_ATTACH_FSCREDS)) {
mutex_unlock(&task->signal->cred_guard_mutex);
return -EPERM; } return 0;mutex_unlock(&task->signal->exec_update_mutex);
@@ -415,7 +415,7 @@ static int lock_trace(struct task_struct *task) static void unlock_trace(struct task_struct *task) {
- mutex_unlock(&task->signal->cred_guard_mutex);
- mutex_unlock(&task->signal->exec_update_mutex);
}
#ifdef CONFIG_STACKTRACE
1.9.1
This changes do_io_accounting to use the new exec_update_mutex instead of cred_guard_mutex.
This fixes possible deadlocks when the trace is accessing /proc/$pid/io for instance.
This should be safe, as the credentials are only used for reading.
Signed-off-by: Bernd Edlinger bernd.edlinger@hotmail.de --- fs/proc/base.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/fs/proc/base.c b/fs/proc/base.c index 4fdfe4f..529d0c6 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -2770,7 +2770,7 @@ static int do_io_accounting(struct task_struct *task, struct seq_file *m, int wh unsigned long flags; int result;
- result = mutex_lock_killable(&task->signal->cred_guard_mutex); + result = mutex_lock_killable(&task->signal->exec_update_mutex); if (result) return result;
@@ -2806,7 +2806,7 @@ static int do_io_accounting(struct task_struct *task, struct seq_file *m, int wh result = 0;
out_unlock: - mutex_unlock(&task->signal->cred_guard_mutex); + mutex_unlock(&task->signal->exec_update_mutex); return result; }
Bernd Edlinger bernd.edlinger@hotmail.de writes:
This changes do_io_accounting to use the new exec_update_mutex instead of cred_guard_mutex.
This fixes possible deadlocks when the trace is accessing /proc/$pid/io for instance.
This should be safe, as the credentials are only used for reading.
This is an improvement.
We probably want to do this just as an incremental step in making things better but perhaps I am blind but I am not finding the reason for guarding this with the cred_guard_mutex to be at all persuasive.
I think moving the ptrace_may_access check down to after the unlock_task_sighand would be just as effective at addressing the concerns raised in the original commit. I think the task_lock provides all of the barrier we need to make it safe to move the ptrace_may_access checks safe.
The reason I say this is I don't see exec changing ->ioac. Just performing some I/O which would update the io accounting statistics.
Can anyone see if I am wrong?
Eric
commit 293eb1e7772b25a93647c798c7b89bf26c2da2e0 Author: Vasiliy Kulikov segoon@openwall.com Date: Tue Jul 26 16:08:38 2011 -0700
proc: fix a race in do_io_accounting()
If an inode's mode permits opening /proc/PID/io and the resulting file descriptor is kept across execve() of a setuid or similar binary, the ptrace_may_access() check tries to prevent using this fd against the task with escalated privileges.
Unfortunately, there is a race in the check against execve(). If execve() is processed after the ptrace check, but before the actual io information gathering, io statistics will be gathered from the privileged process. At least in theory this might lead to gathering sensible information (like ssh/ftp password length) that wouldn't be available otherwise.
Holding task->signal->cred_guard_mutex while gathering the io information should protect against the race.
The order of locking is similar to the one inside of ptrace_attach(): first goes cred_guard_mutex, then lock_task_sighand().
Signed-off-by: Vasiliy Kulikov segoon@openwall.com Cc: Al Viro viro@zeniv.linux.org.uk Cc: stable@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org
Signed-off-by: Bernd Edlinger bernd.edlinger@hotmail.de
fs/proc/base.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/fs/proc/base.c b/fs/proc/base.c index 4fdfe4f..529d0c6 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -2770,7 +2770,7 @@ static int do_io_accounting(struct task_struct *task, struct seq_file *m, int wh unsigned long flags; int result;
- result = mutex_lock_killable(&task->signal->cred_guard_mutex);
- result = mutex_lock_killable(&task->signal->exec_update_mutex); if (result) return result;
@@ -2806,7 +2806,7 @@ static int do_io_accounting(struct task_struct *task, struct seq_file *m, int wh result = 0; out_unlock:
- mutex_unlock(&task->signal->cred_guard_mutex);
- mutex_unlock(&task->signal->exec_update_mutex); return result;
}
On 3/10/20 8:06 PM, Eric W. Biederman wrote:
Bernd Edlinger bernd.edlinger@hotmail.de writes:
This changes do_io_accounting to use the new exec_update_mutex instead of cred_guard_mutex.
This fixes possible deadlocks when the trace is accessing /proc/$pid/io for instance.
This should be safe, as the credentials are only used for reading.
This is an improvement.
We probably want to do this just as an incremental step in making things better but perhaps I am blind but I am not finding the reason for guarding this with the cred_guard_mutex to be at all persuasive.
I think moving the ptrace_may_access check down to after the unlock_task_sighand would be just as effective at addressing the concerns raised in the original commit. I think the task_lock provides all of the barrier we need to make it safe to move the ptrace_may_access checks safe.
The reason I say this is I don't see exec changing ->ioac. Just performing some I/O which would update the io accounting statistics.
Maybe the suid executable is starting up and doing io or not, and what the program does immediately at startup is a secret, that we want to keep secret but evil eve want to find out. eve is using /proc/alice/io to do that.
It is a bit constructed, but seems like a security concern. when we keep the exec_update_mutex while collecting the data, we cannot see any io of the new process when the new credentials don't allow that.
Bernd.
Can anyone see if I am wrong?
Eric
commit 293eb1e7772b25a93647c798c7b89bf26c2da2e0 Author: Vasiliy Kulikov segoon@openwall.com Date: Tue Jul 26 16:08:38 2011 -0700
proc: fix a race in do_io_accounting()
If an inode's mode permits opening /proc/PID/io and the resulting file descriptor is kept across execve() of a setuid or similar binary, the ptrace_may_access() check tries to prevent using this fd against the task with escalated privileges. Unfortunately, there is a race in the check against execve(). If execve() is processed after the ptrace check, but before the actual io information gathering, io statistics will be gathered from the privileged process. At least in theory this might lead to gathering sensible information (like ssh/ftp password length) that wouldn't be available otherwise. Holding task->signal->cred_guard_mutex while gathering the io information should protect against the race. The order of locking is similar to the one inside of ptrace_attach(): first goes cred_guard_mutex, then lock_task_sighand(). Signed-off-by: Vasiliy Kulikov segoon@openwall.com Cc: Al Viro viro@zeniv.linux.org.uk Cc: stable@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org
Signed-off-by: Bernd Edlinger bernd.edlinger@hotmail.de
fs/proc/base.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/fs/proc/base.c b/fs/proc/base.c index 4fdfe4f..529d0c6 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -2770,7 +2770,7 @@ static int do_io_accounting(struct task_struct *task, struct seq_file *m, int wh unsigned long flags; int result;
- result = mutex_lock_killable(&task->signal->cred_guard_mutex);
- result = mutex_lock_killable(&task->signal->exec_update_mutex); if (result) return result;
@@ -2806,7 +2806,7 @@ static int do_io_accounting(struct task_struct *task, struct seq_file *m, int wh result = 0; out_unlock:
- mutex_unlock(&task->signal->cred_guard_mutex);
- mutex_unlock(&task->signal->exec_update_mutex); return result;
}
Bernd Edlinger bernd.edlinger@hotmail.de writes:
On 3/10/20 8:06 PM, Eric W. Biederman wrote:
Bernd Edlinger bernd.edlinger@hotmail.de writes:
This changes do_io_accounting to use the new exec_update_mutex instead of cred_guard_mutex.
This fixes possible deadlocks when the trace is accessing /proc/$pid/io for instance.
This should be safe, as the credentials are only used for reading.
This is an improvement.
We probably want to do this just as an incremental step in making things better but perhaps I am blind but I am not finding the reason for guarding this with the cred_guard_mutex to be at all persuasive.
I think moving the ptrace_may_access check down to after the unlock_task_sighand would be just as effective at addressing the concerns raised in the original commit. I think the task_lock provides all of the barrier we need to make it safe to move the ptrace_may_access checks safe.
The reason I say this is I don't see exec changing ->ioac. Just performing some I/O which would update the io accounting statistics.
Maybe the suid executable is starting up and doing io or not, and what the program does immediately at startup is a secret, that we want to keep secret but evil eve want to find out. eve is using /proc/alice/io to do that.
It is a bit constructed, but seems like a security concern. when we keep the exec_update_mutex while collecting the data, we cannot see any io of the new process when the new credentials don't allow that.
Jann Horn has convinced me we should just convert these to the exec_change_mutex today. Because while not 100% correct in theory, the only really interesting case is exec. So the code does something interesting and worth while, and mostly correct. The last thing I want to do is to cause an unnecessary regression.
Eric
On Tue, Mar 10, 2020 at 06:45:47PM +0100, Bernd Edlinger wrote:
This changes do_io_accounting to use the new exec_update_mutex instead of cred_guard_mutex.
This fixes possible deadlocks when the trace is accessing /proc/$pid/io for instance.
This should be safe, as the credentials are only used for reading.
I'd like to see the rationale described better here for why it should be safe. I'm still not seeing why this is safe here, as we might check ptrace_may_access() with one cred and then iterate io accounting with a different credential...
What am I missing?
-Kees
Signed-off-by: Bernd Edlinger bernd.edlinger@hotmail.de
fs/proc/base.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/fs/proc/base.c b/fs/proc/base.c index 4fdfe4f..529d0c6 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -2770,7 +2770,7 @@ static int do_io_accounting(struct task_struct *task, struct seq_file *m, int wh unsigned long flags; int result;
- result = mutex_lock_killable(&task->signal->cred_guard_mutex);
- result = mutex_lock_killable(&task->signal->exec_update_mutex); if (result) return result;
@@ -2806,7 +2806,7 @@ static int do_io_accounting(struct task_struct *task, struct seq_file *m, int wh result = 0; out_unlock:
- mutex_unlock(&task->signal->cred_guard_mutex);
- mutex_unlock(&task->signal->exec_update_mutex); return result;
} -- 1.9.1
On 3/11/20 8:08 PM, Kees Cook wrote:
On Tue, Mar 10, 2020 at 06:45:47PM +0100, Bernd Edlinger wrote:
This changes do_io_accounting to use the new exec_update_mutex instead of cred_guard_mutex.
This fixes possible deadlocks when the trace is accessing /proc/$pid/io for instance.
This should be safe, as the credentials are only used for reading.
I'd like to see the rationale described better here for why it should be safe. I'm still not seeing why this is safe here, as we might check ptrace_may_access() with one cred and then iterate io accounting with a different credential...
What am I missing?
The same here, even if execve is already started, the credentials are not actually changed until the execve acquired the exec_update_mutex.
The data flow is from the task->cred => do_io_accounting, if the data flow would be from do_io_accounting => task's no new privs you would see an entirely different patch.
I am open for suggestions how to improve the description, or even add a comment from time to time :)
Thanks Bernd.
-Kees
Signed-off-by: Bernd Edlinger bernd.edlinger@hotmail.de
fs/proc/base.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/fs/proc/base.c b/fs/proc/base.c index 4fdfe4f..529d0c6 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -2770,7 +2770,7 @@ static int do_io_accounting(struct task_struct *task, struct seq_file *m, int wh unsigned long flags; int result;
- result = mutex_lock_killable(&task->signal->cred_guard_mutex);
- result = mutex_lock_killable(&task->signal->exec_update_mutex); if (result) return result;
@@ -2806,7 +2806,7 @@ static int do_io_accounting(struct task_struct *task, struct seq_file *m, int wh result = 0; out_unlock:
- mutex_unlock(&task->signal->cred_guard_mutex);
- mutex_unlock(&task->signal->exec_update_mutex); return result;
} -- 1.9.1
Kees Cook keescook@chromium.org writes:
On Tue, Mar 10, 2020 at 06:45:47PM +0100, Bernd Edlinger wrote:
This changes do_io_accounting to use the new exec_update_mutex instead of cred_guard_mutex.
This fixes possible deadlocks when the trace is accessing /proc/$pid/io for instance.
This should be safe, as the credentials are only used for reading.
I'd like to see the rationale described better here for why it should be safe. I'm still not seeing why this is safe here, as we might check ptrace_may_access() with one cred and then iterate io accounting with a different credential...
What am I missing?
The rational for non-regression is that exec_update_mutex covers all of the same tsk->cred changes as cred_guard_mutex. Therefore we are not any worse off, and we avoid the deadlock.
As for safety. Jann's argument that the only interesting credential change is in exec applies. All other credential changes that have any effect on permission checks make the new cred non-dumpable (excepions apply see the code).
So I think this is a non-regressing change. A safe change.
I don't think either version of this code is fully correct.
Eric
Signed-off-by: Bernd Edlinger bernd.edlinger@hotmail.de
fs/proc/base.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/fs/proc/base.c b/fs/proc/base.c index 4fdfe4f..529d0c6 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -2770,7 +2770,7 @@ static int do_io_accounting(struct task_struct *task, struct seq_file *m, int wh unsigned long flags; int result;
- result = mutex_lock_killable(&task->signal->cred_guard_mutex);
- result = mutex_lock_killable(&task->signal->exec_update_mutex); if (result) return result;
@@ -2806,7 +2806,7 @@ static int do_io_accounting(struct task_struct *task, struct seq_file *m, int wh result = 0; out_unlock:
- mutex_unlock(&task->signal->cred_guard_mutex);
- mutex_unlock(&task->signal->exec_update_mutex); return result;
} -- 1.9.1
This changes perf_event_set_clock to use the new exec_update_mutex instead of cred_guard_mutex.
This should be safe, as the credentials are only used for reading.
Signed-off-by: Bernd Edlinger bernd.edlinger@hotmail.de --- kernel/events/core.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/kernel/events/core.c b/kernel/events/core.c index 2173c23..c37f6eb 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -1248,7 +1248,7 @@ static void put_ctx(struct perf_event_context *ctx) * function. * * Lock order: - * cred_guard_mutex + * exec_update_mutex * task_struct::perf_event_mutex * perf_event_context::mutex * perf_event::child_mutex; @@ -11254,14 +11254,14 @@ static int perf_event_set_clock(struct perf_event *event, clockid_t clk_id) }
if (task) { - err = mutex_lock_interruptible(&task->signal->cred_guard_mutex); + err = mutex_lock_interruptible(&task->signal->exec_update_mutex); if (err) goto err_task;
/* * Reuse ptrace permission checks for now. * - * We must hold cred_guard_mutex across this and any potential + * We must hold exec_update_mutex across this and any potential * perf_install_in_context() call for this new event to * serialize against exec() altering our credentials (and the * perf_event_exit_task() that could imply). @@ -11550,7 +11550,7 @@ static int perf_event_set_clock(struct perf_event *event, clockid_t clk_id) mutex_unlock(&ctx->mutex);
if (task) { - mutex_unlock(&task->signal->cred_guard_mutex); + mutex_unlock(&task->signal->exec_update_mutex); put_task_struct(task); }
@@ -11586,7 +11586,7 @@ static int perf_event_set_clock(struct perf_event *event, clockid_t clk_id) free_event(event); err_cred: if (task) - mutex_unlock(&task->signal->cred_guard_mutex); + mutex_unlock(&task->signal->exec_update_mutex); err_task: if (task) put_task_struct(task); @@ -11891,7 +11891,7 @@ static void perf_event_exit_task_context(struct task_struct *child, int ctxn) /* * When a child task exits, feed back event values to parent events. * - * Can be called with cred_guard_mutex held when called from + * Can be called with exec_update_mutex held when called from * install_exec_creds(). */ void perf_event_exit_task(struct task_struct *child)
This fixes a deadlock in the tracer when tracing a multi-threaded application that calls execve while more than one thread are running.
I observed that when running strace on the gcc test suite, it always blocks after a while, when expect calls execve, because other threads have to be terminated. They send ptrace events, but the strace is no longer able to respond, since it is blocked in vm_access.
The deadlock is always happening when strace needs to access the tracees process mmap, while another thread in the tracee starts to execve a child process, but that cannot continue until the PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:
strace D 0 30614 30584 0x00000000 Call Trace: __schedule+0x3ce/0x6e0 schedule+0x5c/0xd0 schedule_preempt_disabled+0x15/0x20 __mutex_lock.isra.13+0x1ec/0x520 __mutex_lock_killable_slowpath+0x13/0x20 mutex_lock_killable+0x28/0x30 mm_access+0x27/0xa0 process_vm_rw_core.isra.3+0xff/0x550 process_vm_rw+0xdd/0xf0 __x64_sys_process_vm_readv+0x31/0x40 do_syscall_64+0x64/0x220 entry_SYSCALL_64_after_hwframe+0x44/0xa9
expect D 0 31933 30876 0x80004003 Call Trace: __schedule+0x3ce/0x6e0 schedule+0x5c/0xd0 flush_old_exec+0xc4/0x770 load_elf_binary+0x35a/0x16c0 search_binary_handler+0x97/0x1d0 __do_execve_file.isra.40+0x5d4/0x8a0 __x64_sys_execve+0x49/0x60 do_syscall_64+0x64/0x220 entry_SYSCALL_64_after_hwframe+0x44/0xa9
This changes mm_access to use the new exec_update_mutex instead of cred_guard_mutex.
This patch is based on the following patch by Eric W. Biederman: "[PATCH 0/5] Infrastructure to allow fixing exec deadlocks" Link: https://lore.kernel.org/lkml/87v9ne5y4y.fsf_-_@x220.int.ebiederm.org/
Signed-off-by: Bernd Edlinger bernd.edlinger@hotmail.de --- kernel/fork.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/fork.c b/kernel/fork.c index c12595a..5720ff3 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1224,7 +1224,7 @@ struct mm_struct *mm_access(struct task_struct *task, unsigned int mode) struct mm_struct *mm; int err;
- err = mutex_lock_killable(&task->signal->cred_guard_mutex); + err = mutex_lock_killable(&task->signal->exec_update_mutex); if (err) return ERR_PTR(err);
@@ -1234,7 +1234,7 @@ struct mm_struct *mm_access(struct task_struct *task, unsigned int mode) mmput(mm); mm = ERR_PTR(-EACCES); } - mutex_unlock(&task->signal->cred_guard_mutex); + mutex_unlock(&task->signal->exec_update_mutex);
return mm; }
Bernd Edlinger bernd.edlinger@hotmail.de writes:
This fixes a deadlock in the tracer when tracing a multi-threaded application that calls execve while more than one thread are running.
I observed that when running strace on the gcc test suite, it always blocks after a while, when expect calls execve, because other threads have to be terminated. They send ptrace events, but the strace is no longer able to respond, since it is blocked in vm_access.
The deadlock is always happening when strace needs to access the tracees process mmap, while another thread in the tracee starts to execve a child process, but that cannot continue until the PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:
Overall this looks good. Mind if I change the subject to: "exec: Fix a deadlock in strace" ?
Eric
strace D 0 30614 30584 0x00000000 Call Trace: __schedule+0x3ce/0x6e0 schedule+0x5c/0xd0 schedule_preempt_disabled+0x15/0x20 __mutex_lock.isra.13+0x1ec/0x520 __mutex_lock_killable_slowpath+0x13/0x20 mutex_lock_killable+0x28/0x30 mm_access+0x27/0xa0 process_vm_rw_core.isra.3+0xff/0x550 process_vm_rw+0xdd/0xf0 __x64_sys_process_vm_readv+0x31/0x40 do_syscall_64+0x64/0x220 entry_SYSCALL_64_after_hwframe+0x44/0xa9
expect D 0 31933 30876 0x80004003 Call Trace: __schedule+0x3ce/0x6e0 schedule+0x5c/0xd0 flush_old_exec+0xc4/0x770 load_elf_binary+0x35a/0x16c0 search_binary_handler+0x97/0x1d0 __do_execve_file.isra.40+0x5d4/0x8a0 __x64_sys_execve+0x49/0x60 do_syscall_64+0x64/0x220 entry_SYSCALL_64_after_hwframe+0x44/0xa9
This changes mm_access to use the new exec_update_mutex instead of cred_guard_mutex.
This patch is based on the following patch by Eric W. Biederman: "[PATCH 0/5] Infrastructure to allow fixing exec deadlocks" Link: https://lore.kernel.org/lkml/87v9ne5y4y.fsf_-_@x220.int.ebiederm.org/
Signed-off-by: Bernd Edlinger bernd.edlinger@hotmail.de
kernel/fork.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/fork.c b/kernel/fork.c index c12595a..5720ff3 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1224,7 +1224,7 @@ struct mm_struct *mm_access(struct task_struct *task, unsigned int mode) struct mm_struct *mm; int err;
- err = mutex_lock_killable(&task->signal->cred_guard_mutex);
- err = mutex_lock_killable(&task->signal->exec_update_mutex); if (err) return ERR_PTR(err);
@@ -1234,7 +1234,7 @@ struct mm_struct *mm_access(struct task_struct *task, unsigned int mode) mmput(mm); mm = ERR_PTR(-EACCES); }
- mutex_unlock(&task->signal->cred_guard_mutex);
- mutex_unlock(&task->signal->exec_update_mutex);
return mm; }
On 3/10/20 4:13 PM, Eric W. Biederman wrote:
Bernd Edlinger bernd.edlinger@hotmail.de writes:
This fixes a deadlock in the tracer when tracing a multi-threaded application that calls execve while more than one thread are running.
I observed that when running strace on the gcc test suite, it always blocks after a while, when expect calls execve, because other threads have to be terminated. They send ptrace events, but the strace is no longer able to respond, since it is blocked in vm_access.
The deadlock is always happening when strace needs to access the tracees process mmap, while another thread in the tracee starts to execve a child process, but that cannot continue until the PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:
Overall this looks good. Mind if I change the subject to: "exec: Fix a deadlock in strace" ?
Sure, go ahead.
Thanks Bernd.
Eric
strace D 0 30614 30584 0x00000000 Call Trace: __schedule+0x3ce/0x6e0 schedule+0x5c/0xd0 schedule_preempt_disabled+0x15/0x20 __mutex_lock.isra.13+0x1ec/0x520 __mutex_lock_killable_slowpath+0x13/0x20 mutex_lock_killable+0x28/0x30 mm_access+0x27/0xa0 process_vm_rw_core.isra.3+0xff/0x550 process_vm_rw+0xdd/0xf0 __x64_sys_process_vm_readv+0x31/0x40 do_syscall_64+0x64/0x220 entry_SYSCALL_64_after_hwframe+0x44/0xa9
expect D 0 31933 30876 0x80004003 Call Trace: __schedule+0x3ce/0x6e0 schedule+0x5c/0xd0 flush_old_exec+0xc4/0x770 load_elf_binary+0x35a/0x16c0 search_binary_handler+0x97/0x1d0 __do_execve_file.isra.40+0x5d4/0x8a0 __x64_sys_execve+0x49/0x60 do_syscall_64+0x64/0x220 entry_SYSCALL_64_after_hwframe+0x44/0xa9
This changes mm_access to use the new exec_update_mutex instead of cred_guard_mutex.
This patch is based on the following patch by Eric W. Biederman: "[PATCH 0/5] Infrastructure to allow fixing exec deadlocks" Link: https://lore.kernel.org/lkml/87v9ne5y4y.fsf_-_@x220.int.ebiederm.org/
Signed-off-by: Bernd Edlinger bernd.edlinger@hotmail.de
kernel/fork.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/fork.c b/kernel/fork.c index c12595a..5720ff3 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1224,7 +1224,7 @@ struct mm_struct *mm_access(struct task_struct *task, unsigned int mode) struct mm_struct *mm; int err;
- err = mutex_lock_killable(&task->signal->cred_guard_mutex);
- err = mutex_lock_killable(&task->signal->exec_update_mutex); if (err) return ERR_PTR(err);
@@ -1234,7 +1234,7 @@ struct mm_struct *mm_access(struct task_struct *task, unsigned int mode) mmput(mm); mm = ERR_PTR(-EACCES); }
- mutex_unlock(&task->signal->cred_guard_mutex);
- mutex_unlock(&task->signal->exec_update_mutex);
return mm; }
On Tue, Mar 10, 2020 at 02:43:41PM +0100, Bernd Edlinger wrote:
This fixes a deadlock in the tracer when tracing a multi-threaded application that calls execve while more than one thread are running.
I observed that when running strace on the gcc test suite, it always blocks after a while, when expect calls execve, because other threads have to be terminated. They send ptrace events, but the strace is no longer able to respond, since it is blocked in vm_access.
The deadlock is always happening when strace needs to access the tracees process mmap, while another thread in the tracee starts to execve a child process, but that cannot continue until the PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:
strace D 0 30614 30584 0x00000000 Call Trace: __schedule+0x3ce/0x6e0 schedule+0x5c/0xd0 schedule_preempt_disabled+0x15/0x20 __mutex_lock.isra.13+0x1ec/0x520 __mutex_lock_killable_slowpath+0x13/0x20 mutex_lock_killable+0x28/0x30 mm_access+0x27/0xa0 process_vm_rw_core.isra.3+0xff/0x550 process_vm_rw+0xdd/0xf0 __x64_sys_process_vm_readv+0x31/0x40 do_syscall_64+0x64/0x220 entry_SYSCALL_64_after_hwframe+0x44/0xa9
expect D 0 31933 30876 0x80004003 Call Trace: __schedule+0x3ce/0x6e0 schedule+0x5c/0xd0 flush_old_exec+0xc4/0x770 load_elf_binary+0x35a/0x16c0 search_binary_handler+0x97/0x1d0 __do_execve_file.isra.40+0x5d4/0x8a0 __x64_sys_execve+0x49/0x60 do_syscall_64+0x64/0x220 entry_SYSCALL_64_after_hwframe+0x44/0xa9
This changes mm_access to use the new exec_update_mutex instead of cred_guard_mutex.
This patch is based on the following patch by Eric W. Biederman: "[PATCH 0/5] Infrastructure to allow fixing exec deadlocks" Link: https://lore.kernel.org/lkml/87v9ne5y4y.fsf_-_@x220.int.ebiederm.org/
Signed-off-by: Bernd Edlinger bernd.edlinger@hotmail.de
Cool, yes, on top of the new infrastructure this looks correct to me -- the new mutex wraps mm changes and mm_access() is looking at *drum roll* the mm! :)
Reviewed-by: Kees Cook keescook@chromium.org
-Kees
kernel/fork.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/fork.c b/kernel/fork.c index c12595a..5720ff3 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1224,7 +1224,7 @@ struct mm_struct *mm_access(struct task_struct *task, unsigned int mode) struct mm_struct *mm; int err;
- err = mutex_lock_killable(&task->signal->cred_guard_mutex);
- err = mutex_lock_killable(&task->signal->exec_update_mutex); if (err) return ERR_PTR(err);
@@ -1234,7 +1234,7 @@ struct mm_struct *mm_access(struct task_struct *task, unsigned int mode) mmput(mm); mm = ERR_PTR(-EACCES); }
- mutex_unlock(&task->signal->cred_guard_mutex);
- mutex_unlock(&task->signal->exec_update_mutex);
return mm; } -- 1.9.1
This adds test cases for ptrace deadlocks.
Additionally fixes a compile problem in get_syscall_info.c, observed with gcc-4.8.4:
get_syscall_info.c: In function 'get_syscall_info': get_syscall_info.c:93:3: error: 'for' loop initial declarations are only allowed in C99 mode for (unsigned int i = 0; i < ARRAY_SIZE(args); ++i) { ^ get_syscall_info.c:93:3: note: use option -std=c99 or -std=gnu99 to compile your code
Signed-off-by: Bernd Edlinger bernd.edlinger@hotmail.de --- tools/testing/selftests/ptrace/Makefile | 4 +- tools/testing/selftests/ptrace/vmaccess.c | 86 +++++++++++++++++++++++++++++++ 2 files changed, 88 insertions(+), 2 deletions(-) create mode 100644 tools/testing/selftests/ptrace/vmaccess.c
diff --git a/tools/testing/selftests/ptrace/Makefile b/tools/testing/selftests/ptrace/Makefile index c0b7f89..2f1f532 100644 --- a/tools/testing/selftests/ptrace/Makefile +++ b/tools/testing/selftests/ptrace/Makefile @@ -1,6 +1,6 @@ # SPDX-License-Identifier: GPL-2.0-only -CFLAGS += -iquote../../../../include/uapi -Wall +CFLAGS += -std=c99 -pthread -iquote../../../../include/uapi -Wall
-TEST_GEN_PROGS := get_syscall_info peeksiginfo +TEST_GEN_PROGS := get_syscall_info peeksiginfo vmaccess
include ../lib.mk diff --git a/tools/testing/selftests/ptrace/vmaccess.c b/tools/testing/selftests/ptrace/vmaccess.c new file mode 100644 index 0000000..4db327b --- /dev/null +++ b/tools/testing/selftests/ptrace/vmaccess.c @@ -0,0 +1,86 @@ +// SPDX-License-Identifier: GPL-2.0+ +/* + * Copyright (c) 2020 Bernd Edlinger bernd.edlinger@hotmail.de + * All rights reserved. + * + * Check whether /proc/$pid/mem can be accessed without causing deadlocks + * when de_thread is blocked with ->cred_guard_mutex held. + */ + +#include "../kselftest_harness.h" +#include <stdio.h> +#include <fcntl.h> +#include <pthread.h> +#include <signal.h> +#include <unistd.h> +#include <sys/ptrace.h> + +static void *thread(void *arg) +{ + ptrace(PTRACE_TRACEME, 0, 0L, 0L); + return NULL; +} + +TEST(vmaccess) +{ + int f, pid = fork(); + char mm[64]; + + if (!pid) { + pthread_t pt; + + pthread_create(&pt, NULL, thread, NULL); + pthread_join(pt, NULL); + execlp("true", "true", NULL); + } + + sleep(1); + sprintf(mm, "/proc/%d/mem", pid); + f = open(mm, O_RDONLY); + ASSERT_GE(f, 0); + close(f); + f = kill(pid, SIGCONT); + ASSERT_EQ(f, 0); +} + +TEST(attach) +{ + int s, k, pid = fork(); + + if (!pid) { + pthread_t pt; + + pthread_create(&pt, NULL, thread, NULL); + pthread_join(pt, NULL); + execlp("sleep", "sleep", "2", NULL); + } + + sleep(1); + k = ptrace(PTRACE_ATTACH, pid, 0L, 0L); + ASSERT_EQ(errno, EAGAIN); + ASSERT_EQ(k, -1); + k = waitpid(-1, &s, WNOHANG); + ASSERT_NE(k, -1); + ASSERT_NE(k, 0); + ASSERT_NE(k, pid); + ASSERT_EQ(WIFEXITED(s), 1); + ASSERT_EQ(WEXITSTATUS(s), 0); + sleep(1); + k = ptrace(PTRACE_ATTACH, pid, 0L, 0L); + ASSERT_EQ(k, 0); + k = waitpid(-1, &s, 0); + ASSERT_EQ(k, pid); + ASSERT_EQ(WIFSTOPPED(s), 1); + ASSERT_EQ(WSTOPSIG(s), SIGSTOP); + k = ptrace(PTRACE_DETACH, pid, 0L, 0L); + ASSERT_EQ(k, 0); + k = waitpid(-1, &s, 0); + ASSERT_EQ(k, pid); + ASSERT_EQ(WIFEXITED(s), 1); + ASSERT_EQ(WEXITSTATUS(s), 0); + k = waitpid(-1, NULL, 0); + ASSERT_EQ(k, -1); + ASSERT_EQ(errno, ECHILD); +} + +TEST_HARNESS_MAIN
On Tue, Mar 10, 2020 at 02:44:01PM +0100, Bernd Edlinger wrote:
This adds test cases for ptrace deadlocks.
Additionally fixes a compile problem in get_syscall_info.c, observed with gcc-4.8.4:
get_syscall_info.c: In function 'get_syscall_info': get_syscall_info.c:93:3: error: 'for' loop initial declarations are only allowed in C99 mode for (unsigned int i = 0; i < ARRAY_SIZE(args); ++i) { ^ get_syscall_info.c:93:3: note: use option -std=c99 or -std=gnu99 to compile your code
*discomfort noises* (see below)
Signed-off-by: Bernd Edlinger bernd.edlinger@hotmail.de
tools/testing/selftests/ptrace/Makefile | 4 +- tools/testing/selftests/ptrace/vmaccess.c | 86 +++++++++++++++++++++++++++++++ 2 files changed, 88 insertions(+), 2 deletions(-) create mode 100644 tools/testing/selftests/ptrace/vmaccess.c
diff --git a/tools/testing/selftests/ptrace/Makefile b/tools/testing/selftests/ptrace/Makefile index c0b7f89..2f1f532 100644 --- a/tools/testing/selftests/ptrace/Makefile +++ b/tools/testing/selftests/ptrace/Makefile @@ -1,6 +1,6 @@ # SPDX-License-Identifier: GPL-2.0-only -CFLAGS += -iquote../../../../include/uapi -Wall +CFLAGS += -std=c99 -pthread -iquote../../../../include/uapi -Wall
This isn't the common solution in the kernel (the variable declaration would just be lifted out of the loop), but as it's selftest code, which does lots of special things ... I *guess* this is okay.
-TEST_GEN_PROGS := get_syscall_info peeksiginfo +TEST_GEN_PROGS := get_syscall_info peeksiginfo vmaccess
I love having this deadlock test added to the selftests.
I think I need to make an improvement to the test harness, though, as the failure mode right now just blows up after the 30 second timeout and leaves this deadlocked:
$ ./vmaccess [==========] Running 2 tests from 1 test cases. [ RUN ] global.vmaccess Alarm clock $ ps PID TTY TIME CMD 2605 pts/0 00:00:00 bash 23360 pts/0 00:00:00 vmaccess 23361 pts/0 00:00:00 vmaccess 23363 pts/0 00:00:00 ps
But that's mostly unrelated to this code.
Reviewed-by: Kees Cook keescook@chromium.org
-Kees
include ../lib.mk diff --git a/tools/testing/selftests/ptrace/vmaccess.c b/tools/testing/selftests/ptrace/vmaccess.c new file mode 100644 index 0000000..4db327b --- /dev/null +++ b/tools/testing/selftests/ptrace/vmaccess.c @@ -0,0 +1,86 @@ +// SPDX-License-Identifier: GPL-2.0+ +/*
- Copyright (c) 2020 Bernd Edlinger bernd.edlinger@hotmail.de
- All rights reserved.
- Check whether /proc/$pid/mem can be accessed without causing deadlocks
- when de_thread is blocked with ->cred_guard_mutex held.
- */
+#include "../kselftest_harness.h" +#include <stdio.h> +#include <fcntl.h> +#include <pthread.h> +#include <signal.h> +#include <unistd.h> +#include <sys/ptrace.h>
+static void *thread(void *arg) +{
- ptrace(PTRACE_TRACEME, 0, 0L, 0L);
- return NULL;
+}
+TEST(vmaccess) +{
- int f, pid = fork();
- char mm[64];
- if (!pid) {
pthread_t pt;
pthread_create(&pt, NULL, thread, NULL);
pthread_join(pt, NULL);
execlp("true", "true", NULL);
- }
- sleep(1);
- sprintf(mm, "/proc/%d/mem", pid);
- f = open(mm, O_RDONLY);
- ASSERT_GE(f, 0);
- close(f);
- f = kill(pid, SIGCONT);
- ASSERT_EQ(f, 0);
+}
+TEST(attach) +{
- int s, k, pid = fork();
- if (!pid) {
pthread_t pt;
pthread_create(&pt, NULL, thread, NULL);
pthread_join(pt, NULL);
execlp("sleep", "sleep", "2", NULL);
- }
- sleep(1);
- k = ptrace(PTRACE_ATTACH, pid, 0L, 0L);
- ASSERT_EQ(errno, EAGAIN);
- ASSERT_EQ(k, -1);
- k = waitpid(-1, &s, WNOHANG);
- ASSERT_NE(k, -1);
- ASSERT_NE(k, 0);
- ASSERT_NE(k, pid);
- ASSERT_EQ(WIFEXITED(s), 1);
- ASSERT_EQ(WEXITSTATUS(s), 0);
- sleep(1);
- k = ptrace(PTRACE_ATTACH, pid, 0L, 0L);
- ASSERT_EQ(k, 0);
- k = waitpid(-1, &s, 0);
- ASSERT_EQ(k, pid);
- ASSERT_EQ(WIFSTOPPED(s), 1);
- ASSERT_EQ(WSTOPSIG(s), SIGSTOP);
- k = ptrace(PTRACE_DETACH, pid, 0L, 0L);
- ASSERT_EQ(k, 0);
- k = waitpid(-1, &s, 0);
- ASSERT_EQ(k, pid);
- ASSERT_EQ(WIFEXITED(s), 1);
- ASSERT_EQ(WEXITSTATUS(s), 0);
- k = waitpid(-1, NULL, 0);
- ASSERT_EQ(k, -1);
- ASSERT_EQ(errno, ECHILD);
+}
+TEST_HARNESS_MAIN
1.9.1
On Tue, Mar 10, 2020 at 02:44:01PM +0100, Bernd Edlinger wrote:
This adds test cases for ptrace deadlocks.
Additionally fixes a compile problem in get_syscall_info.c, observed with gcc-4.8.4:
get_syscall_info.c: In function 'get_syscall_info': get_syscall_info.c:93:3: error: 'for' loop initial declarations are only allowed in C99 mode for (unsigned int i = 0; i < ARRAY_SIZE(args); ++i) { ^ get_syscall_info.c:93:3: note: use option -std=c99 or -std=gnu99 to compile your code
[...]
@@ -1,6 +1,6 @@ # SPDX-License-Identifier: GPL-2.0-only -CFLAGS += -iquote../../../../include/uapi -Wall +CFLAGS += -std=c99 -pthread -iquote../../../../include/uapi -Wall
Wouldn't it be better to choose -std=gnu99 over -std=c99?
This removes a duplicate "a" in the comment in process_vm_rw_core.
Signed-off-by: Bernd Edlinger bernd.edlinger@hotmail.de --- mm/process_vm_access.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c index 357aa7b..b3e6eb5 100644 --- a/mm/process_vm_access.c +++ b/mm/process_vm_access.c @@ -204,7 +204,7 @@ static ssize_t process_vm_rw_core(pid_t pid, struct iov_iter *iter, if (!mm || IS_ERR(mm)) { rc = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH; /* - * Explicitly map EACCES to EPERM as EPERM is a more a + * Explicitly map EACCES to EPERM as EPERM is a more * appropriate error code for process_vw_readv/writev */ if (rc == -EACCES)
On Tue, Mar 10, 2020 at 02:44:10PM +0100, Bernd Edlinger wrote:
This removes a duplicate "a" in the comment in process_vm_rw_core.
Signed-off-by: Bernd Edlinger bernd.edlinger@hotmail.de
Reviewed-by: Kees Cook keescook@chromium.org
-Kees
mm/process_vm_access.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c index 357aa7b..b3e6eb5 100644 --- a/mm/process_vm_access.c +++ b/mm/process_vm_access.c @@ -204,7 +204,7 @@ static ssize_t process_vm_rw_core(pid_t pid, struct iov_iter *iter, if (!mm || IS_ERR(mm)) { rc = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH; /*
* Explicitly map EACCES to EPERM as EPERM is a more a
* Explicitly map EACCES to EPERM as EPERM is a more
*/ if (rc == -EACCES)
- appropriate error code for process_vw_readv/writev
-- 1.9.1
This removes an outdated comment in prepare_kernel_cred.
There is no "cred_replace_mutex" any more, so the comment must go away.
Signed-off-by: Bernd Edlinger bernd.edlinger@hotmail.de --- kernel/cred.c | 2 -- 1 file changed, 2 deletions(-)
diff --git a/kernel/cred.c b/kernel/cred.c index 809a985..71a7926 100644 --- a/kernel/cred.c +++ b/kernel/cred.c @@ -675,8 +675,6 @@ void __init cred_init(void) * The caller may change these controls afterwards if desired. * * Returns the new credentials or NULL if out of memory. - * - * Does not take, and does not return holding current->cred_replace_mutex. */ struct cred *prepare_kernel_cred(struct task_struct *daemon) {
On Tue, Mar 10, 2020 at 02:44:18PM +0100, Bernd Edlinger wrote:
This removes an outdated comment in prepare_kernel_cred.
There is no "cred_replace_mutex" any more, so the comment must go away.
Signed-off-by: Bernd Edlinger bernd.edlinger@hotmail.de
Reviewed-by: Kees Cook keescook@chromium.org
-Kees
kernel/cred.c | 2 -- 1 file changed, 2 deletions(-)
diff --git a/kernel/cred.c b/kernel/cred.c index 809a985..71a7926 100644 --- a/kernel/cred.c +++ b/kernel/cred.c @@ -675,8 +675,6 @@ void __init cred_init(void)
- The caller may change these controls afterwards if desired.
- Returns the new credentials or NULL if out of memory.
*/
- Does not take, and does not return holding current->cred_replace_mutex.
struct cred *prepare_kernel_cred(struct task_struct *daemon) { -- 1.9.1
On Mon, Mar 09, 2020 at 02:02:37PM -0500, Eric W. Biederman wrote:
Bernd Edlinger bernd.edlinger@hotmail.de writes:
On 3/9/20 7:36 PM, Eric W. Biederman wrote:
Does that sound better?
almost done.
I think this text is finally clean.
exec: Add exec_update_mutex to replace cred_guard_mutex
The cred_guard_mutex is problematic as it is held over possibly indefinite waits for userspace. The possilbe indefinite waits for
-------------------------------------------^^^^^^^^ possible?
"Dmitry V. Levin" ldv@altlinux.org writes:
On Mon, Mar 09, 2020 at 02:02:37PM -0500, Eric W. Biederman wrote:
Bernd Edlinger bernd.edlinger@hotmail.de writes:
On 3/9/20 7:36 PM, Eric W. Biederman wrote:
Does that sound better?
almost done.
I think this text is finally clean.
exec: Add exec_update_mutex to replace cred_guard_mutex
The cred_guard_mutex is problematic as it is held over possibly indefinite waits for userspace. The possilbe indefinite waits for
-------------------------------------------^^^^^^^^ possible?
Yes. Thank you. Fixed.
Eric
On Mon, Mar 09, 2020 at 02:02:37PM -0500, Eric W. Biederman wrote:
exec: Add exec_update_mutex to replace cred_guard_mutex
The cred_guard_mutex is problematic as it is held over possibly indefinite waits for userspace. The possilbe indefinite waits for userspace that I have identified are: The cred_guard_mutex is held in PTRACE_EVENT_EXIT waiting for the tracer. The cred_guard_mutex is held over "put_user(0, tsk->clear_child_tid)" in exit_mm(). The cred_guard_mutex is held over "get_user(futex_offset, ...") in exit_robust_list. The cred_guard_mutex held over copy_strings.
I suspect you're not trying to make a comprehensive list here, but do you want to mention seccomp too (since it's yet another weird case).
[...] Holding a mutex over any of those possibly indefinite waits for userspace does not appear necessary. Add exec_update_mutex that will just cover updating the process during exec where the permissions and the objects pointed to by the task struct may be out of sync.
Should the specific resources be pointed out here? creds, mm, ... ?
But otherwise, yup, looks sane:
Reviewed-by: Kees Cook keescook@chromium.org
Kees Cook keescook@chromium.org writes:
On Mon, Mar 09, 2020 at 02:02:37PM -0500, Eric W. Biederman wrote:
exec: Add exec_update_mutex to replace cred_guard_mutex
The cred_guard_mutex is problematic as it is held over possibly indefinite waits for userspace. The possilbe indefinite waits for userspace that I have identified are: The cred_guard_mutex is held in PTRACE_EVENT_EXIT waiting for the tracer. The cred_guard_mutex is held over "put_user(0, tsk->clear_child_tid)" in exit_mm(). The cred_guard_mutex is held over "get_user(futex_offset, ...") in exit_robust_list. The cred_guard_mutex held over copy_strings.
I suspect you're not trying to make a comprehensive list here, but do you want to mention seccomp too (since it's yet another weird case).
I was calling out all of the places I have found so far where cred_guard_mutex is held over waiting for userspace to maybe do something. Those places are what cause our deadlocks.
[...] Holding a mutex over any of those possibly indefinite waits for userspace does not appear necessary. Add exec_update_mutex that will just cover updating the process during exec where the permissions and the objects pointed to by the task struct may be out of sync.
Should the specific resources be pointed out here? creds, mm, ... ?
But otherwise, yup, looks sane:
Probably not. The design is if exec changes it we will hold the cred_guard_mutex over it, so things are semi-atomic.
Reviewed-by: Kees Cook keescook@chromium.org
Eric
On Sun, Mar 8, 2020 at 10:41 PM Eric W. Biederman ebiederm@xmission.com wrote:
The cred_guard_mutex is problematic. The cred_guard_mutex is held over the userspace accesses as the arguments from userspace are read. The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other threads are killed. The cred_guard_mutex is held over "put_user(0, tsk->clear_child_tid)" in exit_mm().
Any of those can result in deadlock, as the cred_guard_mutex is held over a possible indefinite userspace waits for userspace.
Add exec_update_mutex that is only held over exec updating process with the new contents of exec, so that code that needs not to be confused by exec changing the mm and the cred in ways that can not happen during ordinary execution of a process.
The plan is to switch the users of cred_guard_mutex to exec_udpate_mutex one by one. This lets us move forward while still being careful and not introducing any regressions.
[...]
@@ -1034,6 +1035,11 @@ static int exec_mmap(struct mm_struct *mm) return -EINTR; } }
ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
if (ret)
return ret;
We're already holding the old mmap_sem, and now nest the exec_update_mutex inside it; but then while still holding the exec_update_mutex, we do mmput(), which can e.g. end up in ksm_exit(), which can do down_write(&mm->mmap_sem) from __ksm_exit(). So I think at least lockdep will be unhappy, and I'm not sure whether it's an actual problem or not.
Jann Horn jannh@google.com writes:
On Sun, Mar 8, 2020 at 10:41 PM Eric W. Biederman ebiederm@xmission.com wrote:
The cred_guard_mutex is problematic. The cred_guard_mutex is held over the userspace accesses as the arguments from userspace are read. The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other threads are killed. The cred_guard_mutex is held over "put_user(0, tsk->clear_child_tid)" in exit_mm().
Any of those can result in deadlock, as the cred_guard_mutex is held over a possible indefinite userspace waits for userspace.
Add exec_update_mutex that is only held over exec updating process with the new contents of exec, so that code that needs not to be confused by exec changing the mm and the cred in ways that can not happen during ordinary execution of a process.
The plan is to switch the users of cred_guard_mutex to exec_udpate_mutex one by one. This lets us move forward while still being careful and not introducing any regressions.
[...]
@@ -1034,6 +1035,11 @@ static int exec_mmap(struct mm_struct *mm) return -EINTR; } }
ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
if (ret)
return ret;
We're already holding the old mmap_sem, and now nest the exec_update_mutex inside it; but then while still holding the exec_update_mutex, we do mmput(), which can e.g. end up in ksm_exit(), which can do down_write(&mm->mmap_sem) from __ksm_exit(). So I think at least lockdep will be unhappy, and I'm not sure whether it's an actual problem or not.
Good point. I should double check the lock ordering here with mmap_sem. It doesn't look like mmput takes mmap_sem, but still there might be a lock inversion of some kind here. At least as far as lockdep is concerned and we don't want anything like that.
Eric
On Tue, Mar 10, 2020 at 10:33 PM Eric W. Biederman ebiederm@xmission.com wrote:
Jann Horn jannh@google.com writes:
On Sun, Mar 8, 2020 at 10:41 PM Eric W. Biederman ebiederm@xmission.com wrote:
The cred_guard_mutex is problematic. The cred_guard_mutex is held over the userspace accesses as the arguments from userspace are read. The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other threads are killed. The cred_guard_mutex is held over "put_user(0, tsk->clear_child_tid)" in exit_mm().
Any of those can result in deadlock, as the cred_guard_mutex is held over a possible indefinite userspace waits for userspace.
Add exec_update_mutex that is only held over exec updating process with the new contents of exec, so that code that needs not to be confused by exec changing the mm and the cred in ways that can not happen during ordinary execution of a process.
The plan is to switch the users of cred_guard_mutex to exec_udpate_mutex one by one. This lets us move forward while still being careful and not introducing any regressions.
[...]
@@ -1034,6 +1035,11 @@ static int exec_mmap(struct mm_struct *mm) return -EINTR; } }
ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
if (ret)
return ret;
We're already holding the old mmap_sem, and now nest the exec_update_mutex inside it; but then while still holding the exec_update_mutex, we do mmput(), which can e.g. end up in ksm_exit(), which can do down_write(&mm->mmap_sem) from __ksm_exit(). So I think at least lockdep will be unhappy, and I'm not sure whether it's an actual problem or not.
Good point. I should double check the lock ordering here with mmap_sem. It doesn't look like mmput takes mmap_sem
You sure about that? mmput() -> __mmput() -> ksm_exit() -> __ksm_exit() -> down_write(&mm->mmap_sem)
Or also: mmput() -> __mmput() -> khugepaged_exit() -> __khugepaged_exit() -> down_write(&mm->mmap_sem)
Or is there a reason why those paths can't happen?
Jann Horn jannh@google.com writes:
On Tue, Mar 10, 2020 at 10:33 PM Eric W. Biederman ebiederm@xmission.com wrote:
Jann Horn jannh@google.com writes:
On Sun, Mar 8, 2020 at 10:41 PM Eric W. Biederman ebiederm@xmission.com wrote:
The cred_guard_mutex is problematic. The cred_guard_mutex is held over the userspace accesses as the arguments from userspace are read. The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other threads are killed. The cred_guard_mutex is held over "put_user(0, tsk->clear_child_tid)" in exit_mm().
Any of those can result in deadlock, as the cred_guard_mutex is held over a possible indefinite userspace waits for userspace.
Add exec_update_mutex that is only held over exec updating process with the new contents of exec, so that code that needs not to be confused by exec changing the mm and the cred in ways that can not happen during ordinary execution of a process.
The plan is to switch the users of cred_guard_mutex to exec_udpate_mutex one by one. This lets us move forward while still being careful and not introducing any regressions.
[...]
@@ -1034,6 +1035,11 @@ static int exec_mmap(struct mm_struct *mm) return -EINTR; } }
ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
if (ret)
return ret;
We're already holding the old mmap_sem, and now nest the exec_update_mutex inside it; but then while still holding the exec_update_mutex, we do mmput(), which can e.g. end up in ksm_exit(), which can do down_write(&mm->mmap_sem) from __ksm_exit(). So I think at least lockdep will be unhappy, and I'm not sure whether it's an actual problem or not.
Good point. I should double check the lock ordering here with mmap_sem. It doesn't look like mmput takes mmap_sem
You sure about that? mmput() -> __mmput() -> ksm_exit() -> __ksm_exit() -> down_write(&mm->mmap_sem)
Or also: mmput() -> __mmput() -> khugepaged_exit() -> __khugepaged_exit() -> down_write(&mm->mmap_sem)
Or is there a reason why those paths can't happen?
Clearly I didn't look far enough.
I will adjust this so that exec_update_mutex is taken before mmap_sem. Anything else is just asking for trouble.
Eric
On 3/11/20 1:15 AM, Eric W. Biederman wrote:
Jann Horn jannh@google.com writes:
On Tue, Mar 10, 2020 at 10:33 PM Eric W. Biederman ebiederm@xmission.com wrote:
Jann Horn jannh@google.com writes:
On Sun, Mar 8, 2020 at 10:41 PM Eric W. Biederman ebiederm@xmission.com wrote:
The cred_guard_mutex is problematic. The cred_guard_mutex is held over the userspace accesses as the arguments from userspace are read. The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other threads are killed. The cred_guard_mutex is held over "put_user(0, tsk->clear_child_tid)" in exit_mm().
Any of those can result in deadlock, as the cred_guard_mutex is held over a possible indefinite userspace waits for userspace.
Add exec_update_mutex that is only held over exec updating process with the new contents of exec, so that code that needs not to be confused by exec changing the mm and the cred in ways that can not happen during ordinary execution of a process.
The plan is to switch the users of cred_guard_mutex to exec_udpate_mutex one by one. This lets us move forward while still being careful and not introducing any regressions.
[...]
@@ -1034,6 +1035,11 @@ static int exec_mmap(struct mm_struct *mm) return -EINTR; } }
ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
if (ret)
return ret;
We're already holding the old mmap_sem, and now nest the exec_update_mutex inside it; but then while still holding the exec_update_mutex, we do mmput(), which can e.g. end up in ksm_exit(), which can do down_write(&mm->mmap_sem) from __ksm_exit(). So I think at least lockdep will be unhappy, and I'm not sure whether it's an actual problem or not.
Good point. I should double check the lock ordering here with mmap_sem. It doesn't look like mmput takes mmap_sem
You sure about that? mmput() -> __mmput() -> ksm_exit() -> __ksm_exit() -> down_write(&mm->mmap_sem)
Or also: mmput() -> __mmput() -> khugepaged_exit() -> __khugepaged_exit() -> down_write(&mm->mmap_sem)
Or is there a reason why those paths can't happen?
Clearly I didn't look far enough.
I will adjust this so that exec_update_mutex is taken before mmap_sem. Anything else is just asking for trouble.
Note that vm_access does also mmput under the exec_update_mutex. So I don't see a huge problem here. But maybe I missed something.
Bernd.
Bernd Edlinger bernd.edlinger@hotmail.de writes:
On 3/11/20 1:15 AM, Eric W. Biederman wrote:
Jann Horn jannh@google.com writes:
On Tue, Mar 10, 2020 at 10:33 PM Eric W. Biederman ebiederm@xmission.com wrote:
Jann Horn jannh@google.com writes:
On Sun, Mar 8, 2020 at 10:41 PM Eric W. Biederman ebiederm@xmission.com wrote:
The cred_guard_mutex is problematic. The cred_guard_mutex is held over the userspace accesses as the arguments from userspace are read. The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other threads are killed. The cred_guard_mutex is held over "put_user(0, tsk->clear_child_tid)" in exit_mm().
Any of those can result in deadlock, as the cred_guard_mutex is held over a possible indefinite userspace waits for userspace.
Add exec_update_mutex that is only held over exec updating process with the new contents of exec, so that code that needs not to be confused by exec changing the mm and the cred in ways that can not happen during ordinary execution of a process.
The plan is to switch the users of cred_guard_mutex to exec_udpate_mutex one by one. This lets us move forward while still being careful and not introducing any regressions.
[...]
@@ -1034,6 +1035,11 @@ static int exec_mmap(struct mm_struct *mm) return -EINTR; } }
ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
if (ret)
return ret;
We're already holding the old mmap_sem, and now nest the exec_update_mutex inside it; but then while still holding the exec_update_mutex, we do mmput(), which can e.g. end up in ksm_exit(), which can do down_write(&mm->mmap_sem) from __ksm_exit(). So I think at least lockdep will be unhappy, and I'm not sure whether it's an actual problem or not.
Good point. I should double check the lock ordering here with mmap_sem. It doesn't look like mmput takes mmap_sem
You sure about that? mmput() -> __mmput() -> ksm_exit() -> __ksm_exit() -> down_write(&mm->mmap_sem)
Or also: mmput() -> __mmput() -> khugepaged_exit() -> __khugepaged_exit() -> down_write(&mm->mmap_sem)
Or is there a reason why those paths can't happen?
Clearly I didn't look far enough.
I will adjust this so that exec_update_mutex is taken before mmap_sem. Anything else is just asking for trouble.
Note that vm_access does also mmput under the exec_update_mutex. So I don't see a huge problem here. But maybe I missed something.
The issue is that to prevent deadlock locks must always be taken in the same order.
Taking mmap_sem then exec_update_mutex at the start of the function, then taking exec_update_mutex then mmap_sem in mmput, takes the two locks in two different orders. Which means that in the right set or circumstances:
thread1: thread2: obtain mmap_sem optain exec_update_mutex wait for exec_update_mutex wait for mmap_sem
Which guarantees that neither thread will make progress.
The fix is easy I just need to take exec_update_mutex a few lines earlier.
Eric
On Sun, 2020-03-08 at 16:38 -0500, Eric W. Biederman wrote:
The cred_guard_mutex is problematic. The cred_guard_mutex is held over the userspace accesses as the arguments from userspace are read. The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other threads are killed. The cred_guard_mutex is held over "put_user(0, tsk->clear_child_tid)" in exit_mm().
Any of those can result in deadlock, as the cred_guard_mutex is held over a possible indefinite userspace waits for userspace.
Add exec_update_mutex that is only held over exec updating process with the new contents of exec, so that code that needs not to be confused by exec changing the mm and the cred in ways that can not happen during ordinary execution of a process.
The plan is to switch the users of cred_guard_mutex to exec_udpate_mutex one by one. This lets us move forward while still being careful and not introducing any regressions.
Link: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/ Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03M... Link: https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/ Link: https://lore.kernel.org/lkml/20160923095031.GA14923@redhat.com/ Link: https://lore.kernel.org/lkml/20170213141452.GA30203@redhat.com/ Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.") Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2") Signed-off-by: "Eric W. Biederman" ebiederm@xmission.com
This patch will trigger a warning during boot,
[ 19.707214][ T1] pci 0035:01:00.0: enabling device (0545 -> 0547) [ 19.707287][ T1] EEH: Capable adapter found: recovery enabled. [ 19.732541][ T1] cpuidle-powernv: Default stop: psscr = 0x0000000000000330,mask=0x00000000003003ff [ 19.732567][ T1] cpuidle-powernv: Deepest stop: psscr = 0x0000000000300375,mask=0x00000000003003ff [ 19.732598][ T1] cpuidle-powernv: First stop level that may lose SPRs = 0x4 [ 19.732617][ T1] cpuidle-powernv: First stop level that may lose timebase = 0x10 [ 19.769784][ T1] HugeTLB registered 2.00 MiB page size, pre-allocated 0 pages [ 19.769810][ T1] HugeTLB registered 1.00 GiB page size, pre-allocated 0 pages [ 19.789344][ T718] [ 19.789367][ T718] ===================================== [ 19.789379][ T718] WARNING: bad unlock balance detected! [ 19.789393][ T718] 5.6.0-rc5-next-20200311+ #4 Not tainted [ 19.789414][ T718] ------------------------------------- [ 19.789426][ T718] kworker/u257:0/718 is trying to release lock (&sig-
exec_update_mutex) at:
[ 19.789459][ T718] [<c0000000004c6770>] free_bprm+0xe0/0xf0 [ 19.789481][ T718] but there are no more locks to release! [ 19.789502][ T718] [ 19.789502][ T718] other info that might help us debug this: [ 19.789537][ T718] 1 lock held by kworker/u257:0/718: [ 19.789558][ T718] #0: c000001fa8842808 (&sig->cred_guard_mutex){+.+.}, at: __do_execve_file.isra.33+0x1b0/0xda0 [ 19.789611][ T718] [ 19.789611][ T718] stack backtrace: [ 19.789645][ T718] CPU: 8 PID: 718 Comm: kworker/u257:0 Not tainted 5.6.0- rc5-next-20200311+ #4 [ 19.789681][ T718] Call Trace: [ 19.789703][ T718] [c000000dad8cfa70] [c000000000979b40] dump_stack+0xf4/0x164 (unreliable) [ 19.789742][ T718] [c000000dad8cfac0] [c0000000001c1d78] print_unlock_imbalance_bug+0x118/0x140 [ 19.789780][ T718] [c000000dad8cfb40] [c0000000001ceaa0] lock_release+0x270/0x520 [ 19.789817][ T718] [c000000dad8cfbf0] [c0000000009a2898] __mutex_unlock_slowpath+0x68/0x400 [ 19.789854][ T718] [c000000dad8cfcc0] [c0000000004c6770] free_bprm+0xe0/0xf0 [ 19.789900][ T718] [c000000dad8cfcf0] [c0000000004c845c] __do_execve_file.isra.33+0x44c/0xda0 __do_execve_file at fs/exec.c:1904 [ 19.789938][ T718] [c000000dad8cfde0] [c0000000001391d8] call_usermodehelper_exec_async+0x218/0x250 [ 19.789977][ T718] [c000000dad8cfe20] [c00000000000b748] ret_from_kernel_thread+0x5c/0x74
fs/exec.c | 9 +++++++++ include/linux/sched/signal.h | 9 ++++++++- init/init_task.c | 1 + kernel/fork.c | 1 + 4 files changed, 19 insertions(+), 1 deletion(-)
diff --git a/fs/exec.c b/fs/exec.c index d820a7272a76..ffeebb1f167b 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1014,6 +1014,7 @@ static int exec_mmap(struct mm_struct *mm) { struct task_struct *tsk; struct mm_struct *old_mm, *active_mm;
- int ret;
/* Notify parent that we're no longer interested in the old VM */ tsk = current; @@ -1034,6 +1035,11 @@ static int exec_mmap(struct mm_struct *mm) return -EINTR; } }
- ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
- if (ret)
return ret;
- task_lock(tsk); active_mm = tsk->active_mm; membarrier_exec_mmap(mm);
@@ -1438,6 +1444,8 @@ static void free_bprm(struct linux_binprm *bprm) { free_arg_pages(bprm); if (bprm->cred) {
if (!bprm->mm)
mutex_unlock(¤t->signal->cred_guard_mutex); abort_creds(bprm->cred); }mutex_unlock(¤t->signal->exec_update_mutex);
@@ -1487,6 +1495,7 @@ void install_exec_creds(struct linux_binprm *bprm) * credentials; any time after this it may be unlocked. */ security_bprm_committed_creds(bprm);
- mutex_unlock(¤t->signal->exec_update_mutex); mutex_unlock(¤t->signal->cred_guard_mutex);
} EXPORT_SYMBOL(install_exec_creds); diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h index 88050259c466..a29df79540ce 100644 --- a/include/linux/sched/signal.h +++ b/include/linux/sched/signal.h @@ -224,7 +224,14 @@ struct signal_struct { struct mutex cred_guard_mutex; /* guard against foreign influences on * credential calculations
* (notably. ptrace) */
* (notably. ptrace)
* Deprecated do not use in new code.
* Use exec_update_mutex instead.
*/
- struct mutex exec_update_mutex; /* Held while task_struct is being
* updated during exec, and may have
* inconsistent permissions.
*/
} __randomize_layout; /* diff --git a/init/init_task.c b/init/init_task.c index 9e5cbe5eab7b..bd403ed3e418 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -26,6 +26,7 @@ static struct signal_struct init_signals = { .multiprocess = HLIST_HEAD_INIT, .rlim = INIT_RLIMITS, .cred_guard_mutex = __MUTEX_INITIALIZER(init_signals.cred_guard_mutex),
- .exec_update_mutex = __MUTEX_INITIALIZER(init_signals.exec_update_mutex),
#ifdef CONFIG_POSIX_TIMERS .posix_timers = LIST_HEAD_INIT(init_signals.posix_timers), .cputimer = { diff --git a/kernel/fork.c b/kernel/fork.c index 60a1295f4384..12896a6ecee6 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1594,6 +1594,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk) sig->oom_score_adj_min = current->signal->oom_score_adj_min; mutex_init(&sig->cred_guard_mutex);
- mutex_init(&sig->exec_update_mutex);
return 0; }
On 09.03.2020 00:38, Eric W. Biederman wrote:
The cred_guard_mutex is problematic. The cred_guard_mutex is held over the userspace accesses as the arguments from userspace are read. The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other threads are killed. The cred_guard_mutex is held over "put_user(0, tsk->clear_child_tid)" in exit_mm().
Any of those can result in deadlock, as the cred_guard_mutex is held over a possible indefinite userspace waits for userspace.
Add exec_update_mutex that is only held over exec updating process with the new contents of exec, so that code that needs not to be confused by exec changing the mm and the cred in ways that can not happen during ordinary execution of a process.
The plan is to switch the users of cred_guard_mutex to exec_udpate_mutex one by one. This lets us move forward while still being careful and not introducing any regressions.
Link: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/ Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03M... Link: https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/ Link: https://lore.kernel.org/lkml/20160923095031.GA14923@redhat.com/ Link: https://lore.kernel.org/lkml/20170213141452.GA30203@redhat.com/ Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.") Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2") Signed-off-by: "Eric W. Biederman" ebiederm@xmission.com
fs/exec.c | 9 +++++++++ include/linux/sched/signal.h | 9 ++++++++- init/init_task.c | 1 + kernel/fork.c | 1 + 4 files changed, 19 insertions(+), 1 deletion(-)
diff --git a/fs/exec.c b/fs/exec.c index d820a7272a76..ffeebb1f167b 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1014,6 +1014,7 @@ static int exec_mmap(struct mm_struct *mm) { struct task_struct *tsk; struct mm_struct *old_mm, *active_mm;
- int ret;
/* Notify parent that we're no longer interested in the old VM */ tsk = current; @@ -1034,6 +1035,11 @@ static int exec_mmap(struct mm_struct *mm) return -EINTR; } }
- ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
- if (ret)
return ret;
You missed old_mm->mmap_sem unlock. See here:
diff --git a/fs/exec.c b/fs/exec.c index 47582cd97f86..d557bac3e862 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1063,8 +1063,11 @@ static int exec_mmap(struct mm_struct *mm) }
ret = mutex_lock_killable(&tsk->signal->exec_update_mutex); - if (ret) + if (ret) { + if (old_mm) + up_read(&old_mm->mmap_sem); return ret; + }
task_lock(tsk); active_mm = tsk->active_mm;
Kirill Tkhai ktkhai@virtuozzo.com writes:
On 09.03.2020 00:38, Eric W. Biederman wrote:
The cred_guard_mutex is problematic. The cred_guard_mutex is held over the userspace accesses as the arguments from userspace are read. The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other threads are killed. The cred_guard_mutex is held over "put_user(0, tsk->clear_child_tid)" in exit_mm().
Any of those can result in deadlock, as the cred_guard_mutex is held over a possible indefinite userspace waits for userspace.
Add exec_update_mutex that is only held over exec updating process with the new contents of exec, so that code that needs not to be confused by exec changing the mm and the cred in ways that can not happen during ordinary execution of a process.
The plan is to switch the users of cred_guard_mutex to exec_udpate_mutex one by one. This lets us move forward while still being careful and not introducing any regressions.
Link: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/ Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03M... Link: https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/ Link: https://lore.kernel.org/lkml/20160923095031.GA14923@redhat.com/ Link: https://lore.kernel.org/lkml/20170213141452.GA30203@redhat.com/ Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.") Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2") Signed-off-by: "Eric W. Biederman" ebiederm@xmission.com
fs/exec.c | 9 +++++++++ include/linux/sched/signal.h | 9 ++++++++- init/init_task.c | 1 + kernel/fork.c | 1 + 4 files changed, 19 insertions(+), 1 deletion(-)
diff --git a/fs/exec.c b/fs/exec.c index d820a7272a76..ffeebb1f167b 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1014,6 +1014,7 @@ static int exec_mmap(struct mm_struct *mm) { struct task_struct *tsk; struct mm_struct *old_mm, *active_mm;
- int ret;
/* Notify parent that we're no longer interested in the old VM */ tsk = current; @@ -1034,6 +1035,11 @@ static int exec_mmap(struct mm_struct *mm) return -EINTR; } }
- ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
- if (ret)
return ret;
You missed old_mm->mmap_sem unlock. See here:
Duh. Thank you.
I actually need to switch the lock ordering here, and I haven't yet because my son was sick yesterday.
Something like this.
diff --git a/fs/exec.c b/fs/exec.c index 96f89401b4d1..03d50c27ec01 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1020,9 +1020,14 @@ static int exec_mmap(struct mm_struct *mm) tsk = current; old_mm = current->mm; exec_mm_release(tsk, old_mm); + if (old_mm) + sync_mm_rss(old_mm); + + ret = mutex_lock_killable(&tsk->signal->exec_update_mutex); + if (ret) + return ret;
if (old_mm) { - sync_mm_rss(old_mm); /* * Make sure that if there is a core dump in progress * for the old mm, we get out and die instead of going @@ -1032,14 +1037,11 @@ static int exec_mmap(struct mm_struct *mm) down_read(&old_mm->mmap_sem); if (unlikely(old_mm->core_state)) { up_read(&old_mm->mmap_sem); + mutex_unlock(&tsk->signal->exec_update_mutex); return -EINTR; } }
- ret = mutex_lock_killable(&tsk->signal->exec_update_mutex); - if (ret) - return ret; - task_lock(tsk); active_mm = tsk->active_mm; membarrier_exec_mmap(mm);
diff --git a/fs/exec.c b/fs/exec.c index 47582cd97f86..d557bac3e862 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1063,8 +1063,11 @@ static int exec_mmap(struct mm_struct *mm) } ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
- if (ret)
- if (ret) {
if (old_mm)
return ret;up_read(&old_mm->mmap_sem);
- }
task_lock(tsk); active_mm = tsk->active_mm;
Eric
On 12.03.2020 15:24, Eric W. Biederman wrote:
Kirill Tkhai ktkhai@virtuozzo.com writes:
On 09.03.2020 00:38, Eric W. Biederman wrote:
The cred_guard_mutex is problematic. The cred_guard_mutex is held over the userspace accesses as the arguments from userspace are read. The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other threads are killed. The cred_guard_mutex is held over "put_user(0, tsk->clear_child_tid)" in exit_mm().
Any of those can result in deadlock, as the cred_guard_mutex is held over a possible indefinite userspace waits for userspace.
Add exec_update_mutex that is only held over exec updating process with the new contents of exec, so that code that needs not to be confused by exec changing the mm and the cred in ways that can not happen during ordinary execution of a process.
The plan is to switch the users of cred_guard_mutex to exec_udpate_mutex one by one. This lets us move forward while still being careful and not introducing any regressions.
Link: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/ Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03M... Link: https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/ Link: https://lore.kernel.org/lkml/20160923095031.GA14923@redhat.com/ Link: https://lore.kernel.org/lkml/20170213141452.GA30203@redhat.com/ Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.") Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2") Signed-off-by: "Eric W. Biederman" ebiederm@xmission.com
fs/exec.c | 9 +++++++++ include/linux/sched/signal.h | 9 ++++++++- init/init_task.c | 1 + kernel/fork.c | 1 + 4 files changed, 19 insertions(+), 1 deletion(-)
diff --git a/fs/exec.c b/fs/exec.c index d820a7272a76..ffeebb1f167b 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1014,6 +1014,7 @@ static int exec_mmap(struct mm_struct *mm) { struct task_struct *tsk; struct mm_struct *old_mm, *active_mm;
- int ret;
/* Notify parent that we're no longer interested in the old VM */ tsk = current; @@ -1034,6 +1035,11 @@ static int exec_mmap(struct mm_struct *mm) return -EINTR; } }
- ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
- if (ret)
return ret;
You missed old_mm->mmap_sem unlock. See here:
Duh. Thank you.
I actually need to switch the lock ordering here, and I haven't yet because my son was sick yesterday.
There is some fundamental problem with your patch, since the below fires in 100% cases on current linux-next:
[ 22.838717] kernel BUG at fs/exec.c:1474!
diff --git a/fs/exec.c b/fs/exec.c index 47582cd97f86..0f77f8c94905 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1470,8 +1470,10 @@ static void free_bprm(struct linux_binprm *bprm) { free_arg_pages(bprm); if (bprm->cred) { - if (!bprm->mm) + if (!bprm->mm) { + BUG_ON(!mutex_is_locked(¤t->signal->exec_update_mutex)); mutex_unlock(¤t->signal->exec_update_mutex); + } mutex_unlock(¤t->signal->cred_guard_mutex); abort_creds(bprm->cred); } @@ -1521,6 +1523,7 @@ void install_exec_creds(struct linux_binprm *bprm) * credentials; any time after this it may be unlocked. */ security_bprm_committed_creds(bprm); + BUG_ON(!mutex_is_locked(¤t->signal->exec_update_mutex)); mutex_unlock(¤t->signal->exec_update_mutex); mutex_unlock(¤t->signal->cred_guard_mutex); }
---------------------------------------------------------------------------------------------
First time the mutex is unlocked in:
exec_binprm()->search_binary_handler()->.load_binary->install_exec_creds()
Then exec_binprm()->search_binary_handler()->.load_binary->flush_old_exec() clears mm:
bprm->mm = NULL;
Second time the mutex is unlocked in free_bprm():
if (bprm->cred) { if (!bprm->mm) mutex_unlock(¤t->signal->exec_update_mutex);
My opinion is we should not relay on side indicators like bprm->mm. Better you may introduce struct linux_binprm::exec_update_mutex_is_locked. So the next person dealing with this after you won't waste much time on diving into this. Also, if someone decides to change the place, where bprm->mm is set into NULL, this person will bump into hell of dependences between unrelated components like your newly introduced mutex.
So, I'm strongly for *struct linux_binprm::exec_update_mutex_is_locked*, since this improves modularity.
Kirill Tkhai ktkhai@virtuozzo.com writes:
On 12.03.2020 15:24, Eric W. Biederman wrote:
Kirill Tkhai ktkhai@virtuozzo.com writes:
On 09.03.2020 00:38, Eric W. Biederman wrote:
The cred_guard_mutex is problematic. The cred_guard_mutex is held over the userspace accesses as the arguments from userspace are read. The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other threads are killed. The cred_guard_mutex is held over "put_user(0, tsk->clear_child_tid)" in exit_mm().
Any of those can result in deadlock, as the cred_guard_mutex is held over a possible indefinite userspace waits for userspace.
Add exec_update_mutex that is only held over exec updating process with the new contents of exec, so that code that needs not to be confused by exec changing the mm and the cred in ways that can not happen during ordinary execution of a process.
The plan is to switch the users of cred_guard_mutex to exec_udpate_mutex one by one. This lets us move forward while still being careful and not introducing any regressions.
Link: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/ Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03M... Link: https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/ Link: https://lore.kernel.org/lkml/20160923095031.GA14923@redhat.com/ Link: https://lore.kernel.org/lkml/20170213141452.GA30203@redhat.com/ Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.") Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2") Signed-off-by: "Eric W. Biederman" ebiederm@xmission.com
fs/exec.c | 9 +++++++++ include/linux/sched/signal.h | 9 ++++++++- init/init_task.c | 1 + kernel/fork.c | 1 + 4 files changed, 19 insertions(+), 1 deletion(-)
diff --git a/fs/exec.c b/fs/exec.c index d820a7272a76..ffeebb1f167b 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1014,6 +1014,7 @@ static int exec_mmap(struct mm_struct *mm) { struct task_struct *tsk; struct mm_struct *old_mm, *active_mm;
- int ret;
/* Notify parent that we're no longer interested in the old VM */ tsk = current; @@ -1034,6 +1035,11 @@ static int exec_mmap(struct mm_struct *mm) return -EINTR; } }
- ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
- if (ret)
return ret;
You missed old_mm->mmap_sem unlock. See here:
Duh. Thank you.
I actually need to switch the lock ordering here, and I haven't yet because my son was sick yesterday.
There is some fundamental problem with your patch, since the below fires in 100% cases on current linux-next:
Thank you.
I have just backed this out of linux-next for now because it is clearly flawed.
You make some good points about the recursion. I will go back to the drawing board and see what I can work out.
[ 22.838717] kernel BUG at fs/exec.c:1474!
diff --git a/fs/exec.c b/fs/exec.c index 47582cd97f86..0f77f8c94905 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1470,8 +1470,10 @@ static void free_bprm(struct linux_binprm *bprm) { free_arg_pages(bprm); if (bprm->cred) {
if (!bprm->mm)
if (!bprm->mm) {
BUG_ON(!mutex_is_locked(¤t->signal->exec_update_mutex)); mutex_unlock(¤t->signal->exec_update_mutex);
mutex_unlock(¤t->signal->cred_guard_mutex); abort_creds(bprm->cred); }}
@@ -1521,6 +1523,7 @@ void install_exec_creds(struct linux_binprm *bprm) * credentials; any time after this it may be unlocked. */ security_bprm_committed_creds(bprm);
- BUG_ON(!mutex_is_locked(¤t->signal->exec_update_mutex)); mutex_unlock(¤t->signal->exec_update_mutex); mutex_unlock(¤t->signal->cred_guard_mutex);
}
First time the mutex is unlocked in:
exec_binprm()->search_binary_handler()->.load_binary->install_exec_creds()
Then exec_binprm()->search_binary_handler()->.load_binary->flush_old_exec() clears mm:
bprm->mm = NULL;
Second time the mutex is unlocked in free_bprm():
if (bprm->cred) { if (!bprm->mm) mutex_unlock(¤t->signal->exec_update_mutex);
My opinion is we should not relay on side indicators like bprm->mm. Better you may introduce struct linux_binprm::exec_update_mutex_is_locked. So the next person dealing with this after you won't waste much time on diving into this. Also, if someone decides to change the place, where bprm->mm is set into NULL, this person will bump into hell of dependences between unrelated components like your newly introduced mutex.
So, I'm strongly for *struct linux_binprm::exec_update_mutex_is_locked*, since this improves modularity.
Am I wrong or is that also a problem with cred_guard_mutex?
Eric
On 12.03.2020 17:38, Eric W. Biederman wrote:
Kirill Tkhai ktkhai@virtuozzo.com writes:
On 12.03.2020 15:24, Eric W. Biederman wrote:
Kirill Tkhai ktkhai@virtuozzo.com writes:
On 09.03.2020 00:38, Eric W. Biederman wrote:
The cred_guard_mutex is problematic. The cred_guard_mutex is held over the userspace accesses as the arguments from userspace are read. The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other threads are killed. The cred_guard_mutex is held over "put_user(0, tsk->clear_child_tid)" in exit_mm().
Any of those can result in deadlock, as the cred_guard_mutex is held over a possible indefinite userspace waits for userspace.
Add exec_update_mutex that is only held over exec updating process with the new contents of exec, so that code that needs not to be confused by exec changing the mm and the cred in ways that can not happen during ordinary execution of a process.
The plan is to switch the users of cred_guard_mutex to exec_udpate_mutex one by one. This lets us move forward while still being careful and not introducing any regressions.
Link: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/ Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03M... Link: https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/ Link: https://lore.kernel.org/lkml/20160923095031.GA14923@redhat.com/ Link: https://lore.kernel.org/lkml/20170213141452.GA30203@redhat.com/ Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.") Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2") Signed-off-by: "Eric W. Biederman" ebiederm@xmission.com
fs/exec.c | 9 +++++++++ include/linux/sched/signal.h | 9 ++++++++- init/init_task.c | 1 + kernel/fork.c | 1 + 4 files changed, 19 insertions(+), 1 deletion(-)
diff --git a/fs/exec.c b/fs/exec.c index d820a7272a76..ffeebb1f167b 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1014,6 +1014,7 @@ static int exec_mmap(struct mm_struct *mm) { struct task_struct *tsk; struct mm_struct *old_mm, *active_mm;
- int ret;
/* Notify parent that we're no longer interested in the old VM */ tsk = current; @@ -1034,6 +1035,11 @@ static int exec_mmap(struct mm_struct *mm) return -EINTR; } }
- ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
- if (ret)
return ret;
You missed old_mm->mmap_sem unlock. See here:
Duh. Thank you.
I actually need to switch the lock ordering here, and I haven't yet because my son was sick yesterday.
There is some fundamental problem with your patch, since the below fires in 100% cases on current linux-next:
Thank you.
I have just backed this out of linux-next for now because it is clearly flawed.
You make some good points about the recursion. I will go back to the drawing board and see what I can work out.
[ 22.838717] kernel BUG at fs/exec.c:1474!
diff --git a/fs/exec.c b/fs/exec.c index 47582cd97f86..0f77f8c94905 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1470,8 +1470,10 @@ static void free_bprm(struct linux_binprm *bprm) { free_arg_pages(bprm); if (bprm->cred) {
if (!bprm->mm)
if (!bprm->mm) {
BUG_ON(!mutex_is_locked(¤t->signal->exec_update_mutex)); mutex_unlock(¤t->signal->exec_update_mutex);
mutex_unlock(¤t->signal->cred_guard_mutex); abort_creds(bprm->cred); }}
@@ -1521,6 +1523,7 @@ void install_exec_creds(struct linux_binprm *bprm) * credentials; any time after this it may be unlocked. */ security_bprm_committed_creds(bprm);
- BUG_ON(!mutex_is_locked(¤t->signal->exec_update_mutex)); mutex_unlock(¤t->signal->exec_update_mutex); mutex_unlock(¤t->signal->cred_guard_mutex);
}
First time the mutex is unlocked in:
exec_binprm()->search_binary_handler()->.load_binary->install_exec_creds()
Then exec_binprm()->search_binary_handler()->.load_binary->flush_old_exec() clears mm:
bprm->mm = NULL;
Second time the mutex is unlocked in free_bprm():
if (bprm->cred) { if (!bprm->mm) mutex_unlock(¤t->signal->exec_update_mutex);
My opinion is we should not relay on side indicators like bprm->mm. Better you may introduce struct linux_binprm::exec_update_mutex_is_locked. So the next person dealing with this after you won't waste much time on diving into this. Also, if someone decides to change the place, where bprm->mm is set into NULL, this person will bump into hell of dependences between unrelated components like your newly introduced mutex.
So, I'm strongly for *struct linux_binprm::exec_update_mutex_is_locked*, since this improves modularity.
Am I wrong or is that also a problem with cred_guard_mutex?
No, there is no a problem.
cred_guard_mutex is locked in a pair with bprm->cred = prepare_exec_creds() assignment.
cred_guard_mutex is unlocked in a pair with bprm->cred = NULL clearing (see install_exec_creds()). Further free_bprm() skip unlock in case of bprm->cred is NULL.
On 3/12/20 3:38 PM, Eric W. Biederman wrote:
Kirill Tkhai ktkhai@virtuozzo.com writes:
On 12.03.2020 15:24, Eric W. Biederman wrote:
I actually need to switch the lock ordering here, and I haven't yet because my son was sick yesterday.
All the best wishes to you and your son. I hope he will get well soon.
And sorry for not missing the issue in the review. The reason turns out that bprm_mm_init is called after prepare_bprm_creds, but there are error pathes between those where free_bprm is called up with cred != NULL and mm == NULL, but the mutex not locked.
I figured out a possible fix for the problem that was pointed out:
From ceb6f65b52b3a7f0280f4f20509a1564a439edf6 Mon Sep 17 00:00:00 2001
From: Bernd Edlinger bernd.edlinger@hotmail.de Date: Wed, 11 Mar 2020 15:31:07 +0100 Subject: [PATCH] Fix issues with exec_update_mutex
Signed-off-by: Bernd Edlinger bernd.edlinger@hotmail.de --- fs/exec.c | 17 ++++++++++------- 1 file changed, 10 insertions(+), 7 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c index ffeebb1..cde4937 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1021,8 +1021,14 @@ static int exec_mmap(struct mm_struct *mm) old_mm = current->mm; exec_mm_release(tsk, old_mm);
- if (old_mm) { + if (old_mm) sync_mm_rss(old_mm); + + ret = mutex_lock_killable(&tsk->signal->exec_update_mutex); + if (ret) + return ret; + + if (old_mm) { /* * Make sure that if there is a core dump in progress * for the old mm, we get out and die instead of going @@ -1032,14 +1038,11 @@ static int exec_mmap(struct mm_struct *mm) down_read(&old_mm->mmap_sem); if (unlikely(old_mm->core_state)) { up_read(&old_mm->mmap_sem); + mutex_unlock(&tsk->signal->exec_update_mutex); return -EINTR; } }
- ret = mutex_lock_killable(&tsk->signal->exec_update_mutex); - if (ret) - return ret; - task_lock(tsk); active_mm = tsk->active_mm; membarrier_exec_mmap(mm); @@ -1444,8 +1447,6 @@ static void free_bprm(struct linux_binprm *bprm) { free_arg_pages(bprm); if (bprm->cred) { - if (!bprm->mm) - mutex_unlock(¤t->signal->exec_update_mutex); mutex_unlock(¤t->signal->cred_guard_mutex); abort_creds(bprm->cred); } @@ -1846,6 +1847,8 @@ static int __do_execve_file(int fd, struct filename *filename, would_dump(bprm, bprm->file);
retval = exec_binprm(bprm); + if (bprm->cred && !bprm->mm) + mutex_unlock(¤t->signal->exec_update_mutex); if (retval < 0) goto out;
On 13.03.2020 04:05, Bernd Edlinger wrote:
On 3/12/20 3:38 PM, Eric W. Biederman wrote:
Kirill Tkhai ktkhai@virtuozzo.com writes:
On 12.03.2020 15:24, Eric W. Biederman wrote:
I actually need to switch the lock ordering here, and I haven't yet because my son was sick yesterday.
All the best wishes to you and your son. I hope he will get well soon.
And sorry for not missing the issue in the review. The reason turns out that bprm_mm_init is called after prepare_bprm_creds, but there are error pathes between those where free_bprm is called up with cred != NULL and mm == NULL, but the mutex not locked.
I figured out a possible fix for the problem that was pointed out:
From ceb6f65b52b3a7f0280f4f20509a1564a439edf6 Mon Sep 17 00:00:00 2001 From: Bernd Edlinger bernd.edlinger@hotmail.de Date: Wed, 11 Mar 2020 15:31:07 +0100 Subject: [PATCH] Fix issues with exec_update_mutex
Signed-off-by: Bernd Edlinger bernd.edlinger@hotmail.de
fs/exec.c | 17 ++++++++++------- 1 file changed, 10 insertions(+), 7 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c index ffeebb1..cde4937 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1021,8 +1021,14 @@ static int exec_mmap(struct mm_struct *mm) old_mm = current->mm; exec_mm_release(tsk, old_mm);
- if (old_mm) {
- if (old_mm) sync_mm_rss(old_mm);
- ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
- if (ret)
return ret;
- if (old_mm) { /*
- Make sure that if there is a core dump in progress
- for the old mm, we get out and die instead of going
@@ -1032,14 +1038,11 @@ static int exec_mmap(struct mm_struct *mm) down_read(&old_mm->mmap_sem); if (unlikely(old_mm->core_state)) { up_read(&old_mm->mmap_sem);
} }mutex_unlock(&tsk->signal->exec_update_mutex); return -EINTR;
- ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
- if (ret)
return ret;
- task_lock(tsk); active_mm = tsk->active_mm; membarrier_exec_mmap(mm);
@@ -1444,8 +1447,6 @@ static void free_bprm(struct linux_binprm *bprm) { free_arg_pages(bprm); if (bprm->cred) {
if (!bprm->mm)
mutex_unlock(¤t->signal->cred_guard_mutex); abort_creds(bprm->cred); }mutex_unlock(¤t->signal->exec_update_mutex);
@@ -1846,6 +1847,8 @@ static int __do_execve_file(int fd, struct filename *filename, would_dump(bprm, bprm->file); retval = exec_binprm(bprm);
- if (bprm->cred && !bprm->mm)
mutex_unlock(¤t->signal->exec_update_mutex);
Despite this should fix the problem, this looks like a broken puzzle.
We can't use bprm->cred as an identifier whether the mutex was locked or not. We can check for bprm->cred in regard to cred_guard_mutex, because of there is strong rule: "cred_guard_mutex is becomes locked together with bprm->cred assignment (see prepare_bprm_creds()), and it becomes unlocked together with bprm->cred zeroing". Take attention on modularity of all this: there is no dependencies between anything else.
In regard to newly introduced exec_update_mutex, your fix and source patch way look like an obfuscation. The mutex becomes deadly glued to unrelated bprm->cred and bprm->mm, and this introduces the problems in the future modifications and support of all involved entities. If someone wants to move some functions in relation to each other, there will be a pain, and this person will have to go again the same dependencies and bug way, Eric stepped on in the original patch.
The cred_guard_mutex is problematic. The cred_guard_mutex is held over the userspace accesses as the arguments from userspace are read. The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other threads are killed. The cred_guard_mutex is held over "put_user(0, tsk->clear_child_tid)" in exit_mm().
Any of those can result in deadlock, as the cred_guard_mutex is held over a possible indefinite userspace waits for userspace.
Add exec_update_mutex that is only held over exec updating process with the new contents of exec, so that code that needs not to be confused by exec changing the mm and the cred in ways that can not happen during ordinary execution of a process.
The plan is to switch the users of cred_guard_mutex to exec_udpate_mutex one by one. This lets us move forward while still being careful and not introducing any regressions.
Link: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/ Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03M... Link: https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/ Link: https://lore.kernel.org/lkml/20160923095031.GA14923@redhat.com/ Link: https://lore.kernel.org/lkml/20170213141452.GA30203@redhat.com/ Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.") Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2") Signed-off-by: "Eric W. Biederman" ebiederm@xmission.com Signed-off-by: Bernd Edlinger bernd.edlinger@hotmail.de --- fs/exec.c | 17 ++++++++++++++--- include/linux/binfmts.h | 8 +++++++- include/linux/sched/signal.h | 9 ++++++++- init/init_task.c | 1 + kernel/fork.c | 1 + 5 files changed, 31 insertions(+), 5 deletions(-)
v3: this update fixes lock-order and adds an explicit data member in linux_binprm
diff --git a/fs/exec.c b/fs/exec.c index d820a72..11974a1 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1014,12 +1014,17 @@ static int exec_mmap(struct mm_struct *mm) { struct task_struct *tsk; struct mm_struct *old_mm, *active_mm; + int ret;
/* Notify parent that we're no longer interested in the old VM */ tsk = current; old_mm = current->mm; exec_mm_release(tsk, old_mm);
+ ret = mutex_lock_killable(&tsk->signal->exec_update_mutex); + if (ret) + return ret; + if (old_mm) { sync_mm_rss(old_mm); /* @@ -1031,9 +1036,11 @@ static int exec_mmap(struct mm_struct *mm) down_read(&old_mm->mmap_sem); if (unlikely(old_mm->core_state)) { up_read(&old_mm->mmap_sem); + mutex_unlock(&tsk->signal->exec_update_mutex); return -EINTR; } } + task_lock(tsk); active_mm = tsk->active_mm; membarrier_exec_mmap(mm); @@ -1288,11 +1295,12 @@ int flush_old_exec(struct linux_binprm * bprm) goto out;
/* - * After clearing bprm->mm (to mark that current is using the - * prepared mm now), we have nothing left of the original + * After setting bprm->called_exec_mmap (to mark that current is + * using the prepared mm now), we have nothing left of the original * process. If anything from here on returns an error, the check * in search_binary_handler() will SEGV current. */ + bprm->called_exec_mmap = 1; bprm->mm = NULL;
#ifdef CONFIG_POSIX_TIMERS @@ -1438,6 +1446,8 @@ static void free_bprm(struct linux_binprm *bprm) { free_arg_pages(bprm); if (bprm->cred) { + if (bprm->called_exec_mmap) + mutex_unlock(¤t->signal->exec_update_mutex); mutex_unlock(¤t->signal->cred_guard_mutex); abort_creds(bprm->cred); } @@ -1487,6 +1497,7 @@ void install_exec_creds(struct linux_binprm *bprm) * credentials; any time after this it may be unlocked. */ security_bprm_committed_creds(bprm); + mutex_unlock(¤t->signal->exec_update_mutex); mutex_unlock(¤t->signal->cred_guard_mutex); } EXPORT_SYMBOL(install_exec_creds); @@ -1678,7 +1689,7 @@ int search_binary_handler(struct linux_binprm *bprm)
read_lock(&binfmt_lock); put_binfmt(fmt); - if (retval < 0 && !bprm->mm) { + if (retval < 0 && bprm->called_exec_mmap) { /* we got to flush_old_exec() and failed after it */ read_unlock(&binfmt_lock); force_sigsegv(SIGSEGV); diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h index b40fc63..a345d9f 100644 --- a/include/linux/binfmts.h +++ b/include/linux/binfmts.h @@ -44,7 +44,13 @@ struct linux_binprm { * exec has happened. Used to sanitize execution environment * and to set AT_SECURE auxv for glibc. */ - secureexec:1; + secureexec:1, + /* + * Set by flush_old_exec, when exec_mmap has been called. + * This is past the point of no return, when the + * exec_update_mutex has been taken. + */ + called_exec_mmap:1; #ifdef __alpha__ unsigned int taso:1; #endif diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h index 8805025..a29df79 100644 --- a/include/linux/sched/signal.h +++ b/include/linux/sched/signal.h @@ -224,7 +224,14 @@ struct signal_struct {
struct mutex cred_guard_mutex; /* guard against foreign influences on * credential calculations - * (notably. ptrace) */ + * (notably. ptrace) + * Deprecated do not use in new code. + * Use exec_update_mutex instead. + */ + struct mutex exec_update_mutex; /* Held while task_struct is being + * updated during exec, and may have + * inconsistent permissions. + */ } __randomize_layout;
/* diff --git a/init/init_task.c b/init/init_task.c index 9e5cbe5..bd403ed 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -26,6 +26,7 @@ .multiprocess = HLIST_HEAD_INIT, .rlim = INIT_RLIMITS, .cred_guard_mutex = __MUTEX_INITIALIZER(init_signals.cred_guard_mutex), + .exec_update_mutex = __MUTEX_INITIALIZER(init_signals.exec_update_mutex), #ifdef CONFIG_POSIX_TIMERS .posix_timers = LIST_HEAD_INIT(init_signals.posix_timers), .cputimer = { diff --git a/kernel/fork.c b/kernel/fork.c index 8642530..036b692 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1594,6 +1594,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk) sig->oom_score_adj_min = current->signal->oom_score_adj_min;
mutex_init(&sig->cred_guard_mutex); + mutex_init(&sig->exec_update_mutex);
return 0; }
On 14.03.2020 12:11, Bernd Edlinger wrote:
The cred_guard_mutex is problematic. The cred_guard_mutex is held over the userspace accesses as the arguments from userspace are read. The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other threads are killed. The cred_guard_mutex is held over "put_user(0, tsk->clear_child_tid)" in exit_mm().
Any of those can result in deadlock, as the cred_guard_mutex is held over a possible indefinite userspace waits for userspace.
Add exec_update_mutex that is only held over exec updating process with the new contents of exec, so that code that needs not to be confused by exec changing the mm and the cred in ways that can not happen during ordinary execution of a process.
The plan is to switch the users of cred_guard_mutex to exec_udpate_mutex one by one. This lets us move forward while still being careful and not introducing any regressions.
Link: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/ Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03M... Link: https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/ Link: https://lore.kernel.org/lkml/20160923095031.GA14923@redhat.com/ Link: https://lore.kernel.org/lkml/20170213141452.GA30203@redhat.com/ Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.") Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2") Signed-off-by: "Eric W. Biederman" ebiederm@xmission.com Signed-off-by: Bernd Edlinger bernd.edlinger@hotmail.de
fs/exec.c | 17 ++++++++++++++--- include/linux/binfmts.h | 8 +++++++- include/linux/sched/signal.h | 9 ++++++++- init/init_task.c | 1 + kernel/fork.c | 1 + 5 files changed, 31 insertions(+), 5 deletions(-)
v3: this update fixes lock-order and adds an explicit data member in linux_binprm
diff --git a/fs/exec.c b/fs/exec.c index d820a72..11974a1 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1014,12 +1014,17 @@ static int exec_mmap(struct mm_struct *mm) { struct task_struct *tsk; struct mm_struct *old_mm, *active_mm;
- int ret;
/* Notify parent that we're no longer interested in the old VM */ tsk = current; old_mm = current->mm; exec_mm_release(tsk, old_mm);
- ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
- if (ret)
return ret;
- if (old_mm) { sync_mm_rss(old_mm); /*
@@ -1031,9 +1036,11 @@ static int exec_mmap(struct mm_struct *mm) down_read(&old_mm->mmap_sem); if (unlikely(old_mm->core_state)) { up_read(&old_mm->mmap_sem);
} }mutex_unlock(&tsk->signal->exec_update_mutex); return -EINTR;
- task_lock(tsk); active_mm = tsk->active_mm; membarrier_exec_mmap(mm);
@@ -1288,11 +1295,12 @@ int flush_old_exec(struct linux_binprm * bprm) goto out; /*
* After clearing bprm->mm (to mark that current is using the
* prepared mm now), we have nothing left of the original
* After setting bprm->called_exec_mmap (to mark that current is
* using the prepared mm now), we have nothing left of the original
*/
- process. If anything from here on returns an error, the check
- in search_binary_handler() will SEGV current.
- bprm->called_exec_mmap = 1;
The two below is non-breaking pair:
exec_mmap(bprm->mm); bprm->called_exec_mmap = 1;
Why not move this into exec_mmap(), so nobody definitely inserts something between them?
bprm->mm = NULL; #ifdef CONFIG_POSIX_TIMERS @@ -1438,6 +1446,8 @@ static void free_bprm(struct linux_binprm *bprm) { free_arg_pages(bprm); if (bprm->cred) {
if (bprm->called_exec_mmap)
mutex_unlock(¤t->signal->cred_guard_mutex); abort_creds(bprm->cred); }mutex_unlock(¤t->signal->exec_update_mutex);
@@ -1487,6 +1497,7 @@ void install_exec_creds(struct linux_binprm *bprm) * credentials; any time after this it may be unlocked. */ security_bprm_committed_creds(bprm);
- mutex_unlock(¤t->signal->exec_update_mutex); mutex_unlock(¤t->signal->cred_guard_mutex);
} EXPORT_SYMBOL(install_exec_creds); @@ -1678,7 +1689,7 @@ int search_binary_handler(struct linux_binprm *bprm) read_lock(&binfmt_lock); put_binfmt(fmt);
if (retval < 0 && !bprm->mm) {
if (retval < 0 && bprm->called_exec_mmap) { /* we got to flush_old_exec() and failed after it */ read_unlock(&binfmt_lock); force_sigsegv(SIGSEGV);
diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h index b40fc63..a345d9f 100644 --- a/include/linux/binfmts.h +++ b/include/linux/binfmts.h @@ -44,7 +44,13 @@ struct linux_binprm { * exec has happened. Used to sanitize execution environment * and to set AT_SECURE auxv for glibc. */
secureexec:1;
secureexec:1,
/*
* Set by flush_old_exec, when exec_mmap has been called.
* This is past the point of no return, when the
* exec_update_mutex has been taken.
*/
called_exec_mmap:1;
#ifdef __alpha__ unsigned int taso:1; #endif diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h index 8805025..a29df79 100644 --- a/include/linux/sched/signal.h +++ b/include/linux/sched/signal.h @@ -224,7 +224,14 @@ struct signal_struct { struct mutex cred_guard_mutex; /* guard against foreign influences on * credential calculations
* (notably. ptrace) */
* (notably. ptrace)
* Deprecated do not use in new code.
* Use exec_update_mutex instead.
*/
- struct mutex exec_update_mutex; /* Held while task_struct is being
* updated during exec, and may have
* inconsistent permissions.
*/
} __randomize_layout; /* diff --git a/init/init_task.c b/init/init_task.c index 9e5cbe5..bd403ed 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -26,6 +26,7 @@ .multiprocess = HLIST_HEAD_INIT, .rlim = INIT_RLIMITS, .cred_guard_mutex = __MUTEX_INITIALIZER(init_signals.cred_guard_mutex),
- .exec_update_mutex = __MUTEX_INITIALIZER(init_signals.exec_update_mutex),
#ifdef CONFIG_POSIX_TIMERS .posix_timers = LIST_HEAD_INIT(init_signals.posix_timers), .cputimer = { diff --git a/kernel/fork.c b/kernel/fork.c index 8642530..036b692 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1594,6 +1594,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk) sig->oom_score_adj_min = current->signal->oom_score_adj_min; mutex_init(&sig->cred_guard_mutex);
- mutex_init(&sig->exec_update_mutex);
return 0; }
On 3/17/20 9:56 AM, Kirill Tkhai wrote:
On 14.03.2020 12:11, Bernd Edlinger wrote:
The cred_guard_mutex is problematic. The cred_guard_mutex is held over the userspace accesses as the arguments from userspace are read. The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other threads are killed. The cred_guard_mutex is held over "put_user(0, tsk->clear_child_tid)" in exit_mm().
Any of those can result in deadlock, as the cred_guard_mutex is held over a possible indefinite userspace waits for userspace.
Add exec_update_mutex that is only held over exec updating process with the new contents of exec, so that code that needs not to be confused by exec changing the mm and the cred in ways that can not happen during ordinary execution of a process.
The plan is to switch the users of cred_guard_mutex to exec_udpate_mutex one by one. This lets us move forward while still being careful and not introducing any regressions.
Link: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/ Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03M... Link: https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/ Link: https://lore.kernel.org/lkml/20160923095031.GA14923@redhat.com/ Link: https://lore.kernel.org/lkml/20170213141452.GA30203@redhat.com/ Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.") Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2") Signed-off-by: "Eric W. Biederman" ebiederm@xmission.com Signed-off-by: Bernd Edlinger bernd.edlinger@hotmail.de
fs/exec.c | 17 ++++++++++++++--- include/linux/binfmts.h | 8 +++++++- include/linux/sched/signal.h | 9 ++++++++- init/init_task.c | 1 + kernel/fork.c | 1 + 5 files changed, 31 insertions(+), 5 deletions(-)
v3: this update fixes lock-order and adds an explicit data member in linux_binprm
diff --git a/fs/exec.c b/fs/exec.c index d820a72..11974a1 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1014,12 +1014,17 @@ static int exec_mmap(struct mm_struct *mm) { struct task_struct *tsk; struct mm_struct *old_mm, *active_mm;
- int ret;
/* Notify parent that we're no longer interested in the old VM */ tsk = current; old_mm = current->mm; exec_mm_release(tsk, old_mm);
- ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
- if (ret)
return ret;
- if (old_mm) { sync_mm_rss(old_mm); /*
@@ -1031,9 +1036,11 @@ static int exec_mmap(struct mm_struct *mm) down_read(&old_mm->mmap_sem); if (unlikely(old_mm->core_state)) { up_read(&old_mm->mmap_sem);
} }mutex_unlock(&tsk->signal->exec_update_mutex); return -EINTR;
- task_lock(tsk); active_mm = tsk->active_mm; membarrier_exec_mmap(mm);
@@ -1288,11 +1295,12 @@ int flush_old_exec(struct linux_binprm * bprm) goto out; /*
* After clearing bprm->mm (to mark that current is using the
* prepared mm now), we have nothing left of the original
* After setting bprm->called_exec_mmap (to mark that current is
* using the prepared mm now), we have nothing left of the original
*/
- process. If anything from here on returns an error, the check
- in search_binary_handler() will SEGV current.
- bprm->called_exec_mmap = 1;
The two below is non-breaking pair:
exec_mmap(bprm->mm); bprm->called_exec_mmap = 1;
Why not move this into exec_mmap(), so nobody definitely inserts something between them?
Hmm, could be done, but then I would probably need a different name than "called_exec_mmap".
How about adding a nice function comment to exec_mmap that calls out the changed behaviour that the exec_update_mutex is taken unless the function fails?
Bernd.
bprm->mm = NULL; #ifdef CONFIG_POSIX_TIMERS @@ -1438,6 +1446,8 @@ static void free_bprm(struct linux_binprm *bprm) { free_arg_pages(bprm); if (bprm->cred) {
if (bprm->called_exec_mmap)
mutex_unlock(¤t->signal->cred_guard_mutex); abort_creds(bprm->cred); }mutex_unlock(¤t->signal->exec_update_mutex);
@@ -1487,6 +1497,7 @@ void install_exec_creds(struct linux_binprm *bprm) * credentials; any time after this it may be unlocked. */ security_bprm_committed_creds(bprm);
- mutex_unlock(¤t->signal->exec_update_mutex); mutex_unlock(¤t->signal->cred_guard_mutex);
} EXPORT_SYMBOL(install_exec_creds); @@ -1678,7 +1689,7 @@ int search_binary_handler(struct linux_binprm *bprm) read_lock(&binfmt_lock); put_binfmt(fmt);
if (retval < 0 && !bprm->mm) {
if (retval < 0 && bprm->called_exec_mmap) { /* we got to flush_old_exec() and failed after it */ read_unlock(&binfmt_lock); force_sigsegv(SIGSEGV);
diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h index b40fc63..a345d9f 100644 --- a/include/linux/binfmts.h +++ b/include/linux/binfmts.h @@ -44,7 +44,13 @@ struct linux_binprm { * exec has happened. Used to sanitize execution environment * and to set AT_SECURE auxv for glibc. */
secureexec:1;
secureexec:1,
/*
* Set by flush_old_exec, when exec_mmap has been called.
* This is past the point of no return, when the
* exec_update_mutex has been taken.
*/
called_exec_mmap:1;
#ifdef __alpha__ unsigned int taso:1; #endif diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h index 8805025..a29df79 100644 --- a/include/linux/sched/signal.h +++ b/include/linux/sched/signal.h @@ -224,7 +224,14 @@ struct signal_struct { struct mutex cred_guard_mutex; /* guard against foreign influences on * credential calculations
* (notably. ptrace) */
* (notably. ptrace)
* Deprecated do not use in new code.
* Use exec_update_mutex instead.
*/
- struct mutex exec_update_mutex; /* Held while task_struct is being
* updated during exec, and may have
* inconsistent permissions.
*/
} __randomize_layout; /* diff --git a/init/init_task.c b/init/init_task.c index 9e5cbe5..bd403ed 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -26,6 +26,7 @@ .multiprocess = HLIST_HEAD_INIT, .rlim = INIT_RLIMITS, .cred_guard_mutex = __MUTEX_INITIALIZER(init_signals.cred_guard_mutex),
- .exec_update_mutex = __MUTEX_INITIALIZER(init_signals.exec_update_mutex),
#ifdef CONFIG_POSIX_TIMERS .posix_timers = LIST_HEAD_INIT(init_signals.posix_timers), .cputimer = { diff --git a/kernel/fork.c b/kernel/fork.c index 8642530..036b692 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1594,6 +1594,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk) sig->oom_score_adj_min = current->signal->oom_score_adj_min; mutex_init(&sig->cred_guard_mutex);
- mutex_init(&sig->exec_update_mutex);
return 0; }
On 18.03.2020 00:53, Bernd Edlinger wrote:
On 3/17/20 9:56 AM, Kirill Tkhai wrote:
On 14.03.2020 12:11, Bernd Edlinger wrote:
The cred_guard_mutex is problematic. The cred_guard_mutex is held over the userspace accesses as the arguments from userspace are read. The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other threads are killed. The cred_guard_mutex is held over "put_user(0, tsk->clear_child_tid)" in exit_mm().
Any of those can result in deadlock, as the cred_guard_mutex is held over a possible indefinite userspace waits for userspace.
Add exec_update_mutex that is only held over exec updating process with the new contents of exec, so that code that needs not to be confused by exec changing the mm and the cred in ways that can not happen during ordinary execution of a process.
The plan is to switch the users of cred_guard_mutex to exec_udpate_mutex one by one. This lets us move forward while still being careful and not introducing any regressions.
Link: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/ Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03M... Link: https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/ Link: https://lore.kernel.org/lkml/20160923095031.GA14923@redhat.com/ Link: https://lore.kernel.org/lkml/20170213141452.GA30203@redhat.com/ Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.") Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2") Signed-off-by: "Eric W. Biederman" ebiederm@xmission.com Signed-off-by: Bernd Edlinger bernd.edlinger@hotmail.de
fs/exec.c | 17 ++++++++++++++--- include/linux/binfmts.h | 8 +++++++- include/linux/sched/signal.h | 9 ++++++++- init/init_task.c | 1 + kernel/fork.c | 1 + 5 files changed, 31 insertions(+), 5 deletions(-)
v3: this update fixes lock-order and adds an explicit data member in linux_binprm
diff --git a/fs/exec.c b/fs/exec.c index d820a72..11974a1 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1014,12 +1014,17 @@ static int exec_mmap(struct mm_struct *mm) { struct task_struct *tsk; struct mm_struct *old_mm, *active_mm;
- int ret;
/* Notify parent that we're no longer interested in the old VM */ tsk = current; old_mm = current->mm; exec_mm_release(tsk, old_mm);
- ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
- if (ret)
return ret;
- if (old_mm) { sync_mm_rss(old_mm); /*
@@ -1031,9 +1036,11 @@ static int exec_mmap(struct mm_struct *mm) down_read(&old_mm->mmap_sem); if (unlikely(old_mm->core_state)) { up_read(&old_mm->mmap_sem);
} }mutex_unlock(&tsk->signal->exec_update_mutex); return -EINTR;
- task_lock(tsk); active_mm = tsk->active_mm; membarrier_exec_mmap(mm);
@@ -1288,11 +1295,12 @@ int flush_old_exec(struct linux_binprm * bprm) goto out; /*
* After clearing bprm->mm (to mark that current is using the
* prepared mm now), we have nothing left of the original
* After setting bprm->called_exec_mmap (to mark that current is
* using the prepared mm now), we have nothing left of the original
*/
- process. If anything from here on returns an error, the check
- in search_binary_handler() will SEGV current.
- bprm->called_exec_mmap = 1;
The two below is non-breaking pair:
exec_mmap(bprm->mm); bprm->called_exec_mmap = 1;
Why not move this into exec_mmap(), so nobody definitely inserts something between them?
Hmm, could be done, but then I would probably need a different name than "called_exec_mmap".
How about adding a nice function comment to exec_mmap that calls out the changed behaviour that the exec_update_mutex is taken unless the function fails?
Not sure, I understand correct.
Could you post this like a small patch hunk (on top of anything you want)?
Bernd.
bprm->mm = NULL; #ifdef CONFIG_POSIX_TIMERS @@ -1438,6 +1446,8 @@ static void free_bprm(struct linux_binprm *bprm) { free_arg_pages(bprm); if (bprm->cred) {
if (bprm->called_exec_mmap)
mutex_unlock(¤t->signal->cred_guard_mutex); abort_creds(bprm->cred); }mutex_unlock(¤t->signal->exec_update_mutex);
@@ -1487,6 +1497,7 @@ void install_exec_creds(struct linux_binprm *bprm) * credentials; any time after this it may be unlocked. */ security_bprm_committed_creds(bprm);
- mutex_unlock(¤t->signal->exec_update_mutex); mutex_unlock(¤t->signal->cred_guard_mutex);
} EXPORT_SYMBOL(install_exec_creds); @@ -1678,7 +1689,7 @@ int search_binary_handler(struct linux_binprm *bprm) read_lock(&binfmt_lock); put_binfmt(fmt);
if (retval < 0 && !bprm->mm) {
if (retval < 0 && bprm->called_exec_mmap) { /* we got to flush_old_exec() and failed after it */ read_unlock(&binfmt_lock); force_sigsegv(SIGSEGV);
diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h index b40fc63..a345d9f 100644 --- a/include/linux/binfmts.h +++ b/include/linux/binfmts.h @@ -44,7 +44,13 @@ struct linux_binprm { * exec has happened. Used to sanitize execution environment * and to set AT_SECURE auxv for glibc. */
secureexec:1;
secureexec:1,
/*
* Set by flush_old_exec, when exec_mmap has been called.
* This is past the point of no return, when the
* exec_update_mutex has been taken.
*/
called_exec_mmap:1;
#ifdef __alpha__ unsigned int taso:1; #endif diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h index 8805025..a29df79 100644 --- a/include/linux/sched/signal.h +++ b/include/linux/sched/signal.h @@ -224,7 +224,14 @@ struct signal_struct { struct mutex cred_guard_mutex; /* guard against foreign influences on * credential calculations
* (notably. ptrace) */
* (notably. ptrace)
* Deprecated do not use in new code.
* Use exec_update_mutex instead.
*/
- struct mutex exec_update_mutex; /* Held while task_struct is being
* updated during exec, and may have
* inconsistent permissions.
*/
} __randomize_layout; /* diff --git a/init/init_task.c b/init/init_task.c index 9e5cbe5..bd403ed 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -26,6 +26,7 @@ .multiprocess = HLIST_HEAD_INIT, .rlim = INIT_RLIMITS, .cred_guard_mutex = __MUTEX_INITIALIZER(init_signals.cred_guard_mutex),
- .exec_update_mutex = __MUTEX_INITIALIZER(init_signals.exec_update_mutex),
#ifdef CONFIG_POSIX_TIMERS .posix_timers = LIST_HEAD_INIT(init_signals.posix_timers), .cputimer = { diff --git a/kernel/fork.c b/kernel/fork.c index 8642530..036b692 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1594,6 +1594,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk) sig->oom_score_adj_min = current->signal->oom_score_adj_min; mutex_init(&sig->cred_guard_mutex);
- mutex_init(&sig->exec_update_mutex);
return 0; }
On 3/18/20 1:22 PM, Kirill Tkhai wrote:
On 18.03.2020 00:53, Bernd Edlinger wrote:
On 3/17/20 9:56 AM, Kirill Tkhai wrote:
On 14.03.2020 12:11, Bernd Edlinger wrote:
The cred_guard_mutex is problematic. The cred_guard_mutex is held over the userspace accesses as the arguments from userspace are read. The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other threads are killed. The cred_guard_mutex is held over "put_user(0, tsk->clear_child_tid)" in exit_mm().
Any of those can result in deadlock, as the cred_guard_mutex is held over a possible indefinite userspace waits for userspace.
Add exec_update_mutex that is only held over exec updating process with the new contents of exec, so that code that needs not to be confused by exec changing the mm and the cred in ways that can not happen during ordinary execution of a process.
The plan is to switch the users of cred_guard_mutex to exec_udpate_mutex one by one. This lets us move forward while still being careful and not introducing any regressions.
Link: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/ Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03M... Link: https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/ Link: https://lore.kernel.org/lkml/20160923095031.GA14923@redhat.com/ Link: https://lore.kernel.org/lkml/20170213141452.GA30203@redhat.com/ Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.") Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2") Signed-off-by: "Eric W. Biederman" ebiederm@xmission.com Signed-off-by: Bernd Edlinger bernd.edlinger@hotmail.de
fs/exec.c | 17 ++++++++++++++--- include/linux/binfmts.h | 8 +++++++- include/linux/sched/signal.h | 9 ++++++++- init/init_task.c | 1 + kernel/fork.c | 1 + 5 files changed, 31 insertions(+), 5 deletions(-)
v3: this update fixes lock-order and adds an explicit data member in linux_binprm
diff --git a/fs/exec.c b/fs/exec.c index d820a72..11974a1 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1014,12 +1014,17 @@ static int exec_mmap(struct mm_struct *mm) { struct task_struct *tsk; struct mm_struct *old_mm, *active_mm;
- int ret;
/* Notify parent that we're no longer interested in the old VM */ tsk = current; old_mm = current->mm; exec_mm_release(tsk, old_mm);
- ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
- if (ret)
return ret;
- if (old_mm) { sync_mm_rss(old_mm); /*
@@ -1031,9 +1036,11 @@ static int exec_mmap(struct mm_struct *mm) down_read(&old_mm->mmap_sem); if (unlikely(old_mm->core_state)) { up_read(&old_mm->mmap_sem);
} }mutex_unlock(&tsk->signal->exec_update_mutex); return -EINTR;
- task_lock(tsk); active_mm = tsk->active_mm; membarrier_exec_mmap(mm);
@@ -1288,11 +1295,12 @@ int flush_old_exec(struct linux_binprm * bprm) goto out; /*
* After clearing bprm->mm (to mark that current is using the
* prepared mm now), we have nothing left of the original
* After setting bprm->called_exec_mmap (to mark that current is
* using the prepared mm now), we have nothing left of the original
*/
- process. If anything from here on returns an error, the check
- in search_binary_handler() will SEGV current.
- bprm->called_exec_mmap = 1;
The two below is non-breaking pair:
exec_mmap(bprm->mm); bprm->called_exec_mmap = 1;
Why not move this into exec_mmap(), so nobody definitely inserts something between them?
Hmm, could be done, but then I would probably need a different name than "called_exec_mmap".
How about adding a nice function comment to exec_mmap that calls out the changed behaviour that the exec_update_mutex is taken unless the function fails?
Not sure, I understand correct.
Could you post this like a small patch hunk (on top of anything you want)?
I was thinking of something like that:
--- a/fs/exec.c +++ b/fs/exec.c @@ -1010,6 +1010,11 @@ ssize_t read_code(struct file *file, unsigned long addr, } EXPORT_SYMBOL(read_code);
+/* + * Maps the mm_struct mm into the current task struct. + * On success, this function returns with the mutex + * exec_update_mutex locked. + */ static int exec_mmap(struct mm_struct *mm) { struct task_struct *tsk;
Bernd.
bprm->mm = NULL; #ifdef CONFIG_POSIX_TIMERS @@ -1438,6 +1446,8 @@ static void free_bprm(struct linux_binprm *bprm) { free_arg_pages(bprm); if (bprm->cred) {
if (bprm->called_exec_mmap)
mutex_unlock(¤t->signal->cred_guard_mutex); abort_creds(bprm->cred); }mutex_unlock(¤t->signal->exec_update_mutex);
@@ -1487,6 +1497,7 @@ void install_exec_creds(struct linux_binprm *bprm) * credentials; any time after this it may be unlocked. */ security_bprm_committed_creds(bprm);
- mutex_unlock(¤t->signal->exec_update_mutex); mutex_unlock(¤t->signal->cred_guard_mutex);
} EXPORT_SYMBOL(install_exec_creds); @@ -1678,7 +1689,7 @@ int search_binary_handler(struct linux_binprm *bprm) read_lock(&binfmt_lock); put_binfmt(fmt);
if (retval < 0 && !bprm->mm) {
if (retval < 0 && bprm->called_exec_mmap) { /* we got to flush_old_exec() and failed after it */ read_unlock(&binfmt_lock); force_sigsegv(SIGSEGV);
diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h index b40fc63..a345d9f 100644 --- a/include/linux/binfmts.h +++ b/include/linux/binfmts.h @@ -44,7 +44,13 @@ struct linux_binprm { * exec has happened. Used to sanitize execution environment * and to set AT_SECURE auxv for glibc. */
secureexec:1;
secureexec:1,
/*
* Set by flush_old_exec, when exec_mmap has been called.
* This is past the point of no return, when the
* exec_update_mutex has been taken.
*/
called_exec_mmap:1;
#ifdef __alpha__ unsigned int taso:1; #endif diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h index 8805025..a29df79 100644 --- a/include/linux/sched/signal.h +++ b/include/linux/sched/signal.h @@ -224,7 +224,14 @@ struct signal_struct { struct mutex cred_guard_mutex; /* guard against foreign influences on * credential calculations
* (notably. ptrace) */
* (notably. ptrace)
* Deprecated do not use in new code.
* Use exec_update_mutex instead.
*/
- struct mutex exec_update_mutex; /* Held while task_struct is being
* updated during exec, and may have
* inconsistent permissions.
*/
} __randomize_layout; /* diff --git a/init/init_task.c b/init/init_task.c index 9e5cbe5..bd403ed 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -26,6 +26,7 @@ .multiprocess = HLIST_HEAD_INIT, .rlim = INIT_RLIMITS, .cred_guard_mutex = __MUTEX_INITIALIZER(init_signals.cred_guard_mutex),
- .exec_update_mutex = __MUTEX_INITIALIZER(init_signals.exec_update_mutex),
#ifdef CONFIG_POSIX_TIMERS .posix_timers = LIST_HEAD_INIT(init_signals.posix_timers), .cputimer = { diff --git a/kernel/fork.c b/kernel/fork.c index 8642530..036b692 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1594,6 +1594,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk) sig->oom_score_adj_min = current->signal->oom_score_adj_min; mutex_init(&sig->cred_guard_mutex);
- mutex_init(&sig->exec_update_mutex);
return 0; }
On 18.03.2020 23:06, Bernd Edlinger wrote:
On 3/18/20 1:22 PM, Kirill Tkhai wrote:
On 18.03.2020 00:53, Bernd Edlinger wrote:
On 3/17/20 9:56 AM, Kirill Tkhai wrote:
On 14.03.2020 12:11, Bernd Edlinger wrote:
The cred_guard_mutex is problematic. The cred_guard_mutex is held over the userspace accesses as the arguments from userspace are read. The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other threads are killed. The cred_guard_mutex is held over "put_user(0, tsk->clear_child_tid)" in exit_mm().
Any of those can result in deadlock, as the cred_guard_mutex is held over a possible indefinite userspace waits for userspace.
Add exec_update_mutex that is only held over exec updating process with the new contents of exec, so that code that needs not to be confused by exec changing the mm and the cred in ways that can not happen during ordinary execution of a process.
The plan is to switch the users of cred_guard_mutex to exec_udpate_mutex one by one. This lets us move forward while still being careful and not introducing any regressions.
Link: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/ Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03M... Link: https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/ Link: https://lore.kernel.org/lkml/20160923095031.GA14923@redhat.com/ Link: https://lore.kernel.org/lkml/20170213141452.GA30203@redhat.com/ Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.") Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2") Signed-off-by: "Eric W. Biederman" ebiederm@xmission.com Signed-off-by: Bernd Edlinger bernd.edlinger@hotmail.de
fs/exec.c | 17 ++++++++++++++--- include/linux/binfmts.h | 8 +++++++- include/linux/sched/signal.h | 9 ++++++++- init/init_task.c | 1 + kernel/fork.c | 1 + 5 files changed, 31 insertions(+), 5 deletions(-)
v3: this update fixes lock-order and adds an explicit data member in linux_binprm
diff --git a/fs/exec.c b/fs/exec.c index d820a72..11974a1 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1014,12 +1014,17 @@ static int exec_mmap(struct mm_struct *mm) { struct task_struct *tsk; struct mm_struct *old_mm, *active_mm;
- int ret;
/* Notify parent that we're no longer interested in the old VM */ tsk = current; old_mm = current->mm; exec_mm_release(tsk, old_mm);
- ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
- if (ret)
return ret;
- if (old_mm) { sync_mm_rss(old_mm); /*
@@ -1031,9 +1036,11 @@ static int exec_mmap(struct mm_struct *mm) down_read(&old_mm->mmap_sem); if (unlikely(old_mm->core_state)) { up_read(&old_mm->mmap_sem);
} }mutex_unlock(&tsk->signal->exec_update_mutex); return -EINTR;
- task_lock(tsk); active_mm = tsk->active_mm; membarrier_exec_mmap(mm);
@@ -1288,11 +1295,12 @@ int flush_old_exec(struct linux_binprm * bprm) goto out; /*
* After clearing bprm->mm (to mark that current is using the
* prepared mm now), we have nothing left of the original
* After setting bprm->called_exec_mmap (to mark that current is
* using the prepared mm now), we have nothing left of the original
*/
- process. If anything from here on returns an error, the check
- in search_binary_handler() will SEGV current.
- bprm->called_exec_mmap = 1;
The two below is non-breaking pair:
exec_mmap(bprm->mm); bprm->called_exec_mmap = 1;
Why not move this into exec_mmap(), so nobody definitely inserts something between them?
Hmm, could be done, but then I would probably need a different name than "called_exec_mmap".
How about adding a nice function comment to exec_mmap that calls out the changed behaviour that the exec_update_mutex is taken unless the function fails?
Not sure, I understand correct.
Could you post this like a small patch hunk (on top of anything you want)?
I was thinking of something like that:
--- a/fs/exec.c +++ b/fs/exec.c @@ -1010,6 +1010,11 @@ ssize_t read_code(struct file *file, unsigned long addr, } EXPORT_SYMBOL(read_code); +/*
- Maps the mm_struct mm into the current task struct.
- On success, this function returns with the mutex
- exec_update_mutex locked.
- */
Looks OK for me.
static int exec_mmap(struct mm_struct *mm) { struct task_struct *tsk;
Bernd.
bprm->mm = NULL; #ifdef CONFIG_POSIX_TIMERS @@ -1438,6 +1446,8 @@ static void free_bprm(struct linux_binprm *bprm) { free_arg_pages(bprm); if (bprm->cred) {
if (bprm->called_exec_mmap)
mutex_unlock(¤t->signal->cred_guard_mutex); abort_creds(bprm->cred); }mutex_unlock(¤t->signal->exec_update_mutex);
@@ -1487,6 +1497,7 @@ void install_exec_creds(struct linux_binprm *bprm) * credentials; any time after this it may be unlocked. */ security_bprm_committed_creds(bprm);
- mutex_unlock(¤t->signal->exec_update_mutex); mutex_unlock(¤t->signal->cred_guard_mutex);
} EXPORT_SYMBOL(install_exec_creds); @@ -1678,7 +1689,7 @@ int search_binary_handler(struct linux_binprm *bprm) read_lock(&binfmt_lock); put_binfmt(fmt);
if (retval < 0 && !bprm->mm) {
if (retval < 0 && bprm->called_exec_mmap) { /* we got to flush_old_exec() and failed after it */ read_unlock(&binfmt_lock); force_sigsegv(SIGSEGV);
diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h index b40fc63..a345d9f 100644 --- a/include/linux/binfmts.h +++ b/include/linux/binfmts.h @@ -44,7 +44,13 @@ struct linux_binprm { * exec has happened. Used to sanitize execution environment * and to set AT_SECURE auxv for glibc. */
secureexec:1;
secureexec:1,
/*
* Set by flush_old_exec, when exec_mmap has been called.
* This is past the point of no return, when the
* exec_update_mutex has been taken.
*/
called_exec_mmap:1;
#ifdef __alpha__ unsigned int taso:1; #endif diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h index 8805025..a29df79 100644 --- a/include/linux/sched/signal.h +++ b/include/linux/sched/signal.h @@ -224,7 +224,14 @@ struct signal_struct { struct mutex cred_guard_mutex; /* guard against foreign influences on * credential calculations
* (notably. ptrace) */
* (notably. ptrace)
* Deprecated do not use in new code.
* Use exec_update_mutex instead.
*/
- struct mutex exec_update_mutex; /* Held while task_struct is being
* updated during exec, and may have
* inconsistent permissions.
*/
} __randomize_layout; /* diff --git a/init/init_task.c b/init/init_task.c index 9e5cbe5..bd403ed 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -26,6 +26,7 @@ .multiprocess = HLIST_HEAD_INIT, .rlim = INIT_RLIMITS, .cred_guard_mutex = __MUTEX_INITIALIZER(init_signals.cred_guard_mutex),
- .exec_update_mutex = __MUTEX_INITIALIZER(init_signals.exec_update_mutex),
#ifdef CONFIG_POSIX_TIMERS .posix_timers = LIST_HEAD_INIT(init_signals.posix_timers), .cputimer = { diff --git a/kernel/fork.c b/kernel/fork.c index 8642530..036b692 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1594,6 +1594,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk) sig->oom_score_adj_min = current->signal->oom_score_adj_min; mutex_init(&sig->cred_guard_mutex);
- mutex_init(&sig->exec_update_mutex);
return 0; }
On 3/19/20 8:13 AM, Kirill Tkhai wrote:
On 18.03.2020 23:06, Bernd Edlinger wrote:
I was thinking of something like that:
--- a/fs/exec.c +++ b/fs/exec.c @@ -1010,6 +1010,11 @@ ssize_t read_code(struct file *file, unsigned long addr, } EXPORT_SYMBOL(read_code); +/*
- Maps the mm_struct mm into the current task struct.
- On success, this function returns with the mutex
- exec_update_mutex locked.
- */
Looks OK for me.
Cool, yeah, then I will post an updated patch in a moment.
Thanks Bernd.
The cred_guard_mutex is problematic. The cred_guard_mutex is held over the userspace accesses as the arguments from userspace are read. The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other threads are killed. The cred_guard_mutex is held over "put_user(0, tsk->clear_child_tid)" in exit_mm().
Any of those can result in deadlock, as the cred_guard_mutex is held over a possible indefinite userspace waits for userspace.
Add exec_update_mutex that is only held over exec updating process with the new contents of exec, so that code that needs not to be confused by exec changing the mm and the cred in ways that can not happen during ordinary execution of a process.
The plan is to switch the users of cred_guard_mutex to exec_udpate_mutex one by one. This lets us move forward while still being careful and not introducing any regressions.
Link: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/ Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03M... Link: https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/ Link: https://lore.kernel.org/lkml/20160923095031.GA14923@redhat.com/ Link: https://lore.kernel.org/lkml/20170213141452.GA30203@redhat.com/ Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.") Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2") Signed-off-by: "Eric W. Biederman" ebiederm@xmission.com Signed-off-by: Bernd Edlinger bernd.edlinger@hotmail.de --- fs/exec.c | 22 +++++++++++++++++++--- include/linux/binfmts.h | 8 +++++++- include/linux/sched/signal.h | 9 ++++++++- init/init_task.c | 1 + kernel/fork.c | 1 + 5 files changed, 36 insertions(+), 5 deletions(-)
v3: this update fixes lock-order and adds an explicit data member in linux_binprm v4: add a function comment to exec_mmap
diff --git a/fs/exec.c b/fs/exec.c index d820a72..0e46ec5 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1010,16 +1010,26 @@ ssize_t read_code(struct file *file, unsigned long addr, loff_t pos, size_t len) } EXPORT_SYMBOL(read_code);
+/* + * Maps the mm_struct mm into the current task struct. + * On success, this function returns with the mutex + * exec_update_mutex locked. + */ static int exec_mmap(struct mm_struct *mm) { struct task_struct *tsk; struct mm_struct *old_mm, *active_mm; + int ret;
/* Notify parent that we're no longer interested in the old VM */ tsk = current; old_mm = current->mm; exec_mm_release(tsk, old_mm);
+ ret = mutex_lock_killable(&tsk->signal->exec_update_mutex); + if (ret) + return ret; + if (old_mm) { sync_mm_rss(old_mm); /* @@ -1031,9 +1041,11 @@ static int exec_mmap(struct mm_struct *mm) down_read(&old_mm->mmap_sem); if (unlikely(old_mm->core_state)) { up_read(&old_mm->mmap_sem); + mutex_unlock(&tsk->signal->exec_update_mutex); return -EINTR; } } + task_lock(tsk); active_mm = tsk->active_mm; membarrier_exec_mmap(mm); @@ -1288,11 +1300,12 @@ int flush_old_exec(struct linux_binprm * bprm) goto out;
/* - * After clearing bprm->mm (to mark that current is using the - * prepared mm now), we have nothing left of the original + * After setting bprm->called_exec_mmap (to mark that current is + * using the prepared mm now), we have nothing left of the original * process. If anything from here on returns an error, the check * in search_binary_handler() will SEGV current. */ + bprm->called_exec_mmap = 1; bprm->mm = NULL;
#ifdef CONFIG_POSIX_TIMERS @@ -1438,6 +1451,8 @@ static void free_bprm(struct linux_binprm *bprm) { free_arg_pages(bprm); if (bprm->cred) { + if (bprm->called_exec_mmap) + mutex_unlock(¤t->signal->exec_update_mutex); mutex_unlock(¤t->signal->cred_guard_mutex); abort_creds(bprm->cred); } @@ -1487,6 +1502,7 @@ void install_exec_creds(struct linux_binprm *bprm) * credentials; any time after this it may be unlocked. */ security_bprm_committed_creds(bprm); + mutex_unlock(¤t->signal->exec_update_mutex); mutex_unlock(¤t->signal->cred_guard_mutex); } EXPORT_SYMBOL(install_exec_creds); @@ -1678,7 +1694,7 @@ int search_binary_handler(struct linux_binprm *bprm)
read_lock(&binfmt_lock); put_binfmt(fmt); - if (retval < 0 && !bprm->mm) { + if (retval < 0 && bprm->called_exec_mmap) { /* we got to flush_old_exec() and failed after it */ read_unlock(&binfmt_lock); force_sigsegv(SIGSEGV); diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h index b40fc63..a345d9f 100644 --- a/include/linux/binfmts.h +++ b/include/linux/binfmts.h @@ -44,7 +44,13 @@ struct linux_binprm { * exec has happened. Used to sanitize execution environment * and to set AT_SECURE auxv for glibc. */ - secureexec:1; + secureexec:1, + /* + * Set by flush_old_exec, when exec_mmap has been called. + * This is past the point of no return, when the + * exec_update_mutex has been taken. + */ + called_exec_mmap:1; #ifdef __alpha__ unsigned int taso:1; #endif diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h index 8805025..a29df79 100644 --- a/include/linux/sched/signal.h +++ b/include/linux/sched/signal.h @@ -224,7 +224,14 @@ struct signal_struct {
struct mutex cred_guard_mutex; /* guard against foreign influences on * credential calculations - * (notably. ptrace) */ + * (notably. ptrace) + * Deprecated do not use in new code. + * Use exec_update_mutex instead. + */ + struct mutex exec_update_mutex; /* Held while task_struct is being + * updated during exec, and may have + * inconsistent permissions. + */ } __randomize_layout;
/* diff --git a/init/init_task.c b/init/init_task.c index 9e5cbe5..bd403ed 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -26,6 +26,7 @@ .multiprocess = HLIST_HEAD_INIT, .rlim = INIT_RLIMITS, .cred_guard_mutex = __MUTEX_INITIALIZER(init_signals.cred_guard_mutex), + .exec_update_mutex = __MUTEX_INITIALIZER(init_signals.exec_update_mutex), #ifdef CONFIG_POSIX_TIMERS .posix_timers = LIST_HEAD_INIT(init_signals.posix_timers), .cputimer = { diff --git a/kernel/fork.c b/kernel/fork.c index 8642530..036b692 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1594,6 +1594,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk) sig->oom_score_adj_min = current->signal->oom_score_adj_min;
mutex_init(&sig->cred_guard_mutex); + mutex_init(&sig->exec_update_mutex);
return 0; }
Ah, sorry this is actuall v4 5/5. Should I send a new version or can you handle it?
On 3/19/20 10:11 AM, Bernd Edlinger wrote:
The cred_guard_mutex is problematic. The cred_guard_mutex is held over the userspace accesses as the arguments from userspace are read. The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other threads are killed. The cred_guard_mutex is held over "put_user(0, tsk->clear_child_tid)" in exit_mm().
Any of those can result in deadlock, as the cred_guard_mutex is held over a possible indefinite userspace waits for userspace.
Add exec_update_mutex that is only held over exec updating process with the new contents of exec, so that code that needs not to be confused by exec changing the mm and the cred in ways that can not happen during ordinary execution of a process.
The plan is to switch the users of cred_guard_mutex to exec_udpate_mutex one by one. This lets us move forward while still being careful and not introducing any regressions.
Link: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/ Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03M... Link: https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/ Link: https://lore.kernel.org/lkml/20160923095031.GA14923@redhat.com/ Link: https://lore.kernel.org/lkml/20170213141452.GA30203@redhat.com/ Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.") Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2") Signed-off-by: "Eric W. Biederman" ebiederm@xmission.com Signed-off-by: Bernd Edlinger bernd.edlinger@hotmail.de
fs/exec.c | 22 +++++++++++++++++++--- include/linux/binfmts.h | 8 +++++++- include/linux/sched/signal.h | 9 ++++++++- init/init_task.c | 1 + kernel/fork.c | 1 + 5 files changed, 36 insertions(+), 5 deletions(-)
v3: this update fixes lock-order and adds an explicit data member in linux_binprm v4: add a function comment to exec_mmap
diff --git a/fs/exec.c b/fs/exec.c index d820a72..0e46ec5 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1010,16 +1010,26 @@ ssize_t read_code(struct file *file, unsigned long addr, loff_t pos, size_t len) } EXPORT_SYMBOL(read_code); +/*
- Maps the mm_struct mm into the current task struct.
- On success, this function returns with the mutex
- exec_update_mutex locked.
- */
static int exec_mmap(struct mm_struct *mm) { struct task_struct *tsk; struct mm_struct *old_mm, *active_mm;
- int ret;
/* Notify parent that we're no longer interested in the old VM */ tsk = current; old_mm = current->mm; exec_mm_release(tsk, old_mm);
- ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
- if (ret)
return ret;
- if (old_mm) { sync_mm_rss(old_mm); /*
@@ -1031,9 +1041,11 @@ static int exec_mmap(struct mm_struct *mm) down_read(&old_mm->mmap_sem); if (unlikely(old_mm->core_state)) { up_read(&old_mm->mmap_sem);
} }mutex_unlock(&tsk->signal->exec_update_mutex); return -EINTR;
- task_lock(tsk); active_mm = tsk->active_mm; membarrier_exec_mmap(mm);
@@ -1288,11 +1300,12 @@ int flush_old_exec(struct linux_binprm * bprm) goto out; /*
* After clearing bprm->mm (to mark that current is using the
* prepared mm now), we have nothing left of the original
* After setting bprm->called_exec_mmap (to mark that current is
* using the prepared mm now), we have nothing left of the original
*/
- process. If anything from here on returns an error, the check
- in search_binary_handler() will SEGV current.
- bprm->called_exec_mmap = 1; bprm->mm = NULL;
#ifdef CONFIG_POSIX_TIMERS @@ -1438,6 +1451,8 @@ static void free_bprm(struct linux_binprm *bprm) { free_arg_pages(bprm); if (bprm->cred) {
if (bprm->called_exec_mmap)
mutex_unlock(¤t->signal->cred_guard_mutex); abort_creds(bprm->cred); }mutex_unlock(¤t->signal->exec_update_mutex);
@@ -1487,6 +1502,7 @@ void install_exec_creds(struct linux_binprm *bprm) * credentials; any time after this it may be unlocked. */ security_bprm_committed_creds(bprm);
- mutex_unlock(¤t->signal->exec_update_mutex); mutex_unlock(¤t->signal->cred_guard_mutex);
} EXPORT_SYMBOL(install_exec_creds); @@ -1678,7 +1694,7 @@ int search_binary_handler(struct linux_binprm *bprm) read_lock(&binfmt_lock); put_binfmt(fmt);
if (retval < 0 && !bprm->mm) {
if (retval < 0 && bprm->called_exec_mmap) { /* we got to flush_old_exec() and failed after it */ read_unlock(&binfmt_lock); force_sigsegv(SIGSEGV);
diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h index b40fc63..a345d9f 100644 --- a/include/linux/binfmts.h +++ b/include/linux/binfmts.h @@ -44,7 +44,13 @@ struct linux_binprm { * exec has happened. Used to sanitize execution environment * and to set AT_SECURE auxv for glibc. */
secureexec:1;
secureexec:1,
/*
* Set by flush_old_exec, when exec_mmap has been called.
* This is past the point of no return, when the
* exec_update_mutex has been taken.
*/
called_exec_mmap:1;
#ifdef __alpha__ unsigned int taso:1; #endif diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h index 8805025..a29df79 100644 --- a/include/linux/sched/signal.h +++ b/include/linux/sched/signal.h @@ -224,7 +224,14 @@ struct signal_struct { struct mutex cred_guard_mutex; /* guard against foreign influences on * credential calculations
* (notably. ptrace) */
* (notably. ptrace)
* Deprecated do not use in new code.
* Use exec_update_mutex instead.
*/
- struct mutex exec_update_mutex; /* Held while task_struct is being
* updated during exec, and may have
* inconsistent permissions.
*/
} __randomize_layout; /* diff --git a/init/init_task.c b/init/init_task.c index 9e5cbe5..bd403ed 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -26,6 +26,7 @@ .multiprocess = HLIST_HEAD_INIT, .rlim = INIT_RLIMITS, .cred_guard_mutex = __MUTEX_INITIALIZER(init_signals.cred_guard_mutex),
- .exec_update_mutex = __MUTEX_INITIALIZER(init_signals.exec_update_mutex),
#ifdef CONFIG_POSIX_TIMERS .posix_timers = LIST_HEAD_INIT(init_signals.posix_timers), .cputimer = { diff --git a/kernel/fork.c b/kernel/fork.c index 8642530..036b692 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1594,6 +1594,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk) sig->oom_score_adj_min = current->signal->oom_score_adj_min; mutex_init(&sig->cred_guard_mutex);
- mutex_init(&sig->exec_update_mutex);
return 0; }
On Thu, Mar 19, 2020 at 10:13:20AM +0100, Bernd Edlinger wrote:
Ah, sorry this is actuall v4 5/5. Should I send a new version or can you handle it?
This thread is a total crazy mess of different versions.
I know I can't unwind any of this, so I _STRONGLY_ suggest resending the whole series, properly versioned, as a new thread.
Would you want to try to pick out the proper patches from this pile?
thanks,
greg k-h
On 3/19/20 10:19 AM, Greg Kroah-Hartman wrote:
On Thu, Mar 19, 2020 at 10:13:20AM +0100, Bernd Edlinger wrote:
Ah, sorry this is actuall v4 5/5. Should I send a new version or can you handle it?
This thread is a total crazy mess of different versions.
I know I can't unwind any of this, so I _STRONGLY_ suggest resending the whole series, properly versioned, as a new thread.
Would you want to try to pick out the proper patches from this pile?
thanks,
greg k-h
Yes, thanks, good suggestion.
I will do that in the evening.
On 3/19/20 10:19 AM, Greg Kroah-Hartman wrote:
On Thu, Mar 19, 2020 at 10:13:20AM +0100, Bernd Edlinger wrote:
Ah, sorry this is actuall v4 5/5. Should I send a new version or can you handle it?
This thread is a total crazy mess of different versions.
I know I can't unwind any of this, so I _STRONGLY_ suggest resending the whole series, properly versioned, as a new thread.
Would you want to try to pick out the proper patches from this pile?
thanks,
greg k-h
Okay, meanwhile I collected everything I could find from this thread and sent it again:
[PATCH v6 00/16] Infrastructure to allow fixing exec deadlocks https://lore.kernel.org/lkml/AM6PR03MB5170B2F5BE24A28980D05780E4F50@AM6PR03M...
[PATCH v6 01/16] exec: Only compute current once in flush_old_exec https://lore.kernel.org/lkml/AM6PR03MB5170FC93B158EB8179F91D6AE4F50@AM6PR03M...
[PATCH v6 02/16] exec: Factor unshare_sighand out of de_thread and call it separately https://lore.kernel.org/lkml/AM6PR03MB51708AECEA6E05CAE2FDC166E4F50@AM6PR03M...
[PATCH v6 03/16] exec: Move cleanup of posix timers on exec out of de_thread https://lore.kernel.org/lkml/AM6PR03MB5170CCB8D8B36F6002446FBDE4F50@AM6PR03M...
[PATCH v6 04/16] exec: Move exec_mmap right after de_thread in flush_old_exec https://lore.kernel.org/lkml/AM6PR03MB5170FDB2C9B5225224B76398E4F50@AM6PR03M...
[PATCH v6 05/16] exec: Add exec_update_mutex to replace cred_guard_mutex https://lore.kernel.org/lkml/AM6PR03MB5170739C1B582B37E637279EE4F50@AM6PR03M...
[PATCH v6 06/16] exec: Fix a deadlock in strace https://lore.kernel.org/lkml/AM6PR03MB51709A321EBA829CC36EE1F8E4F50@AM6PR03M...
[PATCH v6 07/16] selftests/ptrace: add test cases for dead-locks https://lore.kernel.org/lkml/AM6PR03MB517022530A9BECDBCAADC8D2E4F50@AM6PR03M...
[PATCH v6 08/16] mm: docs: Fix a comment in process_vm_rw_core https://lore.kernel.org/lkml/AM6PR03MB517027F6ACBB4CF2D9BF014CE4F50@AM6PR03M...
[PATCH v6 09/16] kernel: doc: remove outdated comment cred.c https://lore.kernel.org/lkml/AM6PR03MB51705CEFAB7D02E6EA6CEBA6E4F50@AM6PR03M...
[PATCH v6 10/16] kernel/kcmp.c: Use new infrastructure to fix deadlocks in execve https://lore.kernel.org/lkml/AM6PR03MB5170FFDE1D7BF09DD2663EDEE4F50@AM6PR03M...
[PATCH v6 11/16] proc: Use new infrastructure to fix deadlocks in execve https://lore.kernel.org/lkml/AM6PR03MB5170C4D177DD76E3C65E8033E4F50@AM6PR03M...
[PATCH v6 12/16] proc: io_accounting: Use new infrastructure to fix deadlocks in execve https://lore.kernel.org/lkml/AM6PR03MB51701CB541B08F21D56DCAC9E4F50@AM6PR03M...
[PATCH v6 13/16] perf: Use new infrastructure to fix deadlocks in execve https://lore.kernel.org/lkml/AM6PR03MB51704A188C3A1FA02B76B9EFE4F50@AM6PR03M...
[PATCH v6 14/16] pidfd: Use new infrastructure to fix deadlocks in execve https://lore.kernel.org/lkml/e2ae1c06-b205-a053-d36c-045be27b3138@hotmail.de...
[PATCH v6 15/16] exec: Fix dead-lock in de_thread with ptrace_attach https://lore.kernel.org/lkml/b6537ae6-31b1-5c50-f32b-8b8332ace882@hotmail.de...
[PATCH v6 16/16] doc: Update documentation of ->exec_*_mutex https://lore.kernel.org/lkml/3ce46b88-7ed3-2f21-c0ed-8f6055d38ebb@hotmail.de...
Each of the patches in this series build on the previous one and are independent from the following patches. So if one or more of these turn out to be controversial, the previous patches are still an improvement, especially [PATCH v6 06/16] which fixes the deadlock in strace, this one fixes the most important tracing deadlocks.
Thanks Bernd.
This completes the new infrastructure patch, and replaces the cred_guard_mutex with an exec_guard_mutex, and a boolean, that is set, when a dead-lock situation is detected.
I also change ptrace_traceme to use the new mutex, but I consider it a bug, that it didn't take any mutex previously since it calls security_ptrace_traceme, and all the security modules operate under the assumption that execve is not operating in parallel.
This patch fixes the test case tools/testing/selftests/ptrace/vmaccess:
[==========] Running 2 tests from 1 test cases. [ RUN ] global.vmaccess [ OK ] global.vmaccess [ RUN ] global.attach [ OK ] global.attach <= this was still failing [==========] 2 / 2 tests passed. [ PASSED ]
Yes, it is an API change, but only in some very special case, so I would exepect this to be un-noticeable to user space applications.
Bernd Edlinger (2): exec: Fix dead-lock in de_thread with ptrace_attach doc: Update documentation of ->exec_*_mutex
Documentation/security/credentials.rst | 29 +++++++++++++++------- fs/exec.c | 44 +++++++++++++++++++++++++++------- fs/proc/base.c | 13 ++++++---- include/linux/sched/signal.h | 14 +++++++---- init/init_task.c | 2 +- kernel/cred.c | 2 +- kernel/fork.c | 2 +- kernel/ptrace.c | 20 +++++++++++++--- kernel/seccomp.c | 15 +++++++----- 9 files changed, 102 insertions(+), 39 deletions(-)
This removes the last users of cred_guard_mutex and replaces it with a new mutex exec_guard_mutex, and a boolean unsafe_execve_in_progress.
This addresses the case when at least one of the sibling threads is traced, and therefore the trace process may dead-lock in ptrace_attach, but de_thread will need to wait for the tracer to continue execution.
The solution is to detect this situation and make ptrace_attach and similar functions return -EAGAIN, but only in a situation where a dead-lock is imminent.
This means this is an API change, but only when the process is traced while execve happens in a multi-threaded application.
See tools/testing/selftests/ptrace/vmaccess.c for a test case that gets fixed by this change.
Signed-off-by: Bernd Edlinger bernd.edlinger@hotmail.de --- fs/exec.c | 44 +++++++++++++++++++++++++++++++++++--------- fs/proc/base.c | 13 ++++++++----- include/linux/sched/signal.h | 14 +++++++++----- init/init_task.c | 2 +- kernel/cred.c | 2 +- kernel/fork.c | 2 +- kernel/ptrace.c | 20 +++++++++++++++++--- kernel/seccomp.c | 15 +++++++++------ 8 files changed, 81 insertions(+), 31 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c index 11974a1..6b78518 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1073,14 +1073,26 @@ static int de_thread(struct task_struct *tsk) struct signal_struct *sig = tsk->signal; struct sighand_struct *oldsighand = tsk->sighand; spinlock_t *lock = &oldsighand->siglock; + struct task_struct *t = tsk;
if (thread_group_empty(tsk)) goto no_thread_group;
+ spin_lock_irq(lock); + while_each_thread(tsk, t) { + if (unlikely(t->ptrace)) + sig->unsafe_execve_in_progress = true; + } + + if (unlikely(sig->unsafe_execve_in_progress)) { + spin_unlock_irq(lock); + mutex_unlock(&sig->exec_guard_mutex); + spin_lock_irq(lock); + } + /* * Kill all other threads in the thread group. */ - spin_lock_irq(lock); if (signal_group_exit(sig)) { /* * Another group action in progress, just @@ -1424,22 +1436,30 @@ void finalize_exec(struct linux_binprm *bprm) EXPORT_SYMBOL(finalize_exec);
/* - * Prepare credentials and lock ->cred_guard_mutex. + * Prepare credentials and lock ->exec_guard_mutex. * install_exec_creds() commits the new creds and drops the lock. * Or, if exec fails before, free_bprm() should release ->cred and * and unlock. */ static int prepare_bprm_creds(struct linux_binprm *bprm) { - if (mutex_lock_interruptible(¤t->signal->cred_guard_mutex)) + int ret; + + if (mutex_lock_interruptible(¤t->signal->exec_guard_mutex)) return -ERESTARTNOINTR;
+ ret = -EAGAIN; + if (unlikely(current->signal->unsafe_execve_in_progress)) + goto out; + bprm->cred = prepare_exec_creds(); if (likely(bprm->cred)) return 0;
- mutex_unlock(¤t->signal->cred_guard_mutex); - return -ENOMEM; + ret = -ENOMEM; +out: + mutex_unlock(¤t->signal->exec_guard_mutex); + return ret; }
static void free_bprm(struct linux_binprm *bprm) @@ -1448,7 +1468,10 @@ static void free_bprm(struct linux_binprm *bprm) if (bprm->cred) { if (bprm->called_exec_mmap) mutex_unlock(¤t->signal->exec_update_mutex); - mutex_unlock(¤t->signal->cred_guard_mutex); + if (unlikely(current->signal->unsafe_execve_in_progress)) + mutex_lock(¤t->signal->exec_guard_mutex); + current->signal->unsafe_execve_in_progress = false; + mutex_unlock(¤t->signal->exec_guard_mutex); abort_creds(bprm->cred); } if (bprm->file) { @@ -1492,19 +1515,22 @@ void install_exec_creds(struct linux_binprm *bprm) if (get_dumpable(current->mm) != SUID_DUMP_USER) perf_event_exit_task(current); /* - * cred_guard_mutex must be held at least to this point to prevent + * exec_guard_mutex must be held at least to this point to prevent * ptrace_attach() from altering our determination of the task's * credentials; any time after this it may be unlocked. */ security_bprm_committed_creds(bprm); mutex_unlock(¤t->signal->exec_update_mutex); - mutex_unlock(¤t->signal->cred_guard_mutex); + if (unlikely(current->signal->unsafe_execve_in_progress)) + mutex_lock(¤t->signal->exec_guard_mutex); + current->signal->unsafe_execve_in_progress = false; + mutex_unlock(¤t->signal->exec_guard_mutex); } EXPORT_SYMBOL(install_exec_creds);
/* * determine how safe it is to execute the proposed program - * - the caller must hold ->cred_guard_mutex to protect against + * - the caller must hold ->exec_guard_mutex to protect against * PTRACE_ATTACH or seccomp thread-sync */ static void check_unsafe_exec(struct linux_binprm *bprm) diff --git a/fs/proc/base.c b/fs/proc/base.c index 6b13fc4..a428536 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -2680,14 +2680,17 @@ static ssize_t proc_pid_attr_write(struct file * file, const char __user * buf, }
/* Guard against adverse ptrace interaction */ - rv = mutex_lock_interruptible(¤t->signal->cred_guard_mutex); + rv = mutex_lock_interruptible(¤t->signal->exec_guard_mutex); if (rv < 0) goto out_free;
- rv = security_setprocattr(PROC_I(inode)->op.lsm, - file->f_path.dentry->d_name.name, page, - count); - mutex_unlock(¤t->signal->cred_guard_mutex); + if (unlikely(current->signal->unsafe_execve_in_progress)) + rv = -EAGAIN; + else + rv = security_setprocattr(PROC_I(inode)->op.lsm, + file->f_path.dentry->d_name.name, + page, count); + mutex_unlock(¤t->signal->exec_guard_mutex); out_free: kfree(page); out: diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h index a29df79..e83cef2 100644 --- a/include/linux/sched/signal.h +++ b/include/linux/sched/signal.h @@ -212,6 +212,13 @@ struct signal_struct { #endif
/* + * Set while execve is executing but is *not* holding + * exec_guard_mutex to avoid possible dead-locks. + * Only valid when exec_guard_mutex is held. + */ + bool unsafe_execve_in_progress; + + /* * Thread is the potential origin of an oom condition; kill first on * oom */ @@ -222,11 +229,8 @@ struct signal_struct { struct mm_struct *oom_mm; /* recorded mm when the thread group got * killed by the oom killer */
- struct mutex cred_guard_mutex; /* guard against foreign influences on - * credential calculations - * (notably. ptrace) - * Deprecated do not use in new code. - * Use exec_update_mutex instead. + struct mutex exec_guard_mutex; /* Held while execve runs, except when + * a sibling thread is being traced. */ struct mutex exec_update_mutex; /* Held while task_struct is being * updated during exec, and may have diff --git a/init/init_task.c b/init/init_task.c index bd403ed..6f96327 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -25,7 +25,7 @@ }, .multiprocess = HLIST_HEAD_INIT, .rlim = INIT_RLIMITS, - .cred_guard_mutex = __MUTEX_INITIALIZER(init_signals.cred_guard_mutex), + .exec_guard_mutex = __MUTEX_INITIALIZER(init_signals.exec_guard_mutex), .exec_update_mutex = __MUTEX_INITIALIZER(init_signals.exec_update_mutex), #ifdef CONFIG_POSIX_TIMERS .posix_timers = LIST_HEAD_INIT(init_signals.posix_timers), diff --git a/kernel/cred.c b/kernel/cred.c index 71a7926..341ca59 100644 --- a/kernel/cred.c +++ b/kernel/cred.c @@ -295,7 +295,7 @@ struct cred *prepare_creds(void)
/* * Prepare credentials for current to perform an execve() - * - The caller must hold ->cred_guard_mutex + * - The caller must hold ->exec_guard_mutex */ struct cred *prepare_exec_creds(void) { diff --git a/kernel/fork.c b/kernel/fork.c index e23ccac..98012f7 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1593,7 +1593,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk) sig->oom_score_adj = current->signal->oom_score_adj; sig->oom_score_adj_min = current->signal->oom_score_adj_min;
- mutex_init(&sig->cred_guard_mutex); + mutex_init(&sig->exec_guard_mutex); mutex_init(&sig->exec_update_mutex);
return 0; diff --git a/kernel/ptrace.c b/kernel/ptrace.c index 43d6179..221759e 100644 --- a/kernel/ptrace.c +++ b/kernel/ptrace.c @@ -392,9 +392,13 @@ static int ptrace_attach(struct task_struct *task, long request, * under ptrace. */ retval = -ERESTARTNOINTR; - if (mutex_lock_interruptible(&task->signal->cred_guard_mutex)) + if (mutex_lock_interruptible(&task->signal->exec_guard_mutex)) goto out;
+ retval = -EAGAIN; + if (unlikely(task->signal->unsafe_execve_in_progress)) + goto unlock_creds; + task_lock(task); retval = __ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS); task_unlock(task); @@ -447,7 +451,7 @@ static int ptrace_attach(struct task_struct *task, long request, unlock_tasklist: write_unlock_irq(&tasklist_lock); unlock_creds: - mutex_unlock(&task->signal->cred_guard_mutex); + mutex_unlock(&task->signal->exec_guard_mutex); out: if (!retval) { /* @@ -472,10 +476,18 @@ static int ptrace_attach(struct task_struct *task, long request, */ static int ptrace_traceme(void) { - int ret = -EPERM; + int ret; + + if (mutex_lock_interruptible(¤t->signal->exec_guard_mutex)) + return -ERESTARTNOINTR; + + ret = -EAGAIN; + if (unlikely(current->signal->unsafe_execve_in_progress)) + goto unlock_creds;
write_lock_irq(&tasklist_lock); /* Are we already being traced? */ + ret = -EPERM; if (!current->ptrace) { ret = security_ptrace_traceme(current->parent); /* @@ -490,6 +502,8 @@ static int ptrace_traceme(void) } write_unlock_irq(&tasklist_lock);
+unlock_creds: + mutex_unlock(¤t->signal->exec_guard_mutex); return ret; }
diff --git a/kernel/seccomp.c b/kernel/seccomp.c index b6ea3dc..acd6960 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -329,7 +329,7 @@ static int is_ancestor(struct seccomp_filter *parent, /** * seccomp_can_sync_threads: checks if all threads can be synchronized * - * Expects sighand and cred_guard_mutex locks to be held. + * Expects sighand and exec_guard_mutex locks to be held. * * Returns 0 on success, -ve on error, or the pid of a thread which was * either not in the correct seccomp mode or did not have an ancestral @@ -339,9 +339,12 @@ static inline pid_t seccomp_can_sync_threads(void) { struct task_struct *thread, *caller;
- BUG_ON(!mutex_is_locked(¤t->signal->cred_guard_mutex)); + BUG_ON(!mutex_is_locked(¤t->signal->exec_guard_mutex)); assert_spin_locked(¤t->sighand->siglock);
+ if (unlikely(current->signal->unsafe_execve_in_progress)) + return -EAGAIN; + /* Validate all threads being eligible for synchronization. */ caller = current; for_each_thread(caller, thread) { @@ -371,7 +374,7 @@ static inline pid_t seccomp_can_sync_threads(void) /** * seccomp_sync_threads: sets all threads to use current's filter * - * Expects sighand and cred_guard_mutex locks to be held, and for + * Expects sighand and exec_guard_mutex locks to be held, and for * seccomp_can_sync_threads() to have returned success already * without dropping the locks. * @@ -380,7 +383,7 @@ static inline void seccomp_sync_threads(unsigned long flags) { struct task_struct *thread, *caller;
- BUG_ON(!mutex_is_locked(¤t->signal->cred_guard_mutex)); + BUG_ON(!mutex_is_locked(¤t->signal->exec_guard_mutex)); assert_spin_locked(¤t->sighand->siglock);
/* Synchronize all threads. */ @@ -1319,7 +1322,7 @@ static long seccomp_set_mode_filter(unsigned int flags, * while another thread is in the middle of calling exec. */ if (flags & SECCOMP_FILTER_FLAG_TSYNC && - mutex_lock_killable(¤t->signal->cred_guard_mutex)) + mutex_lock_killable(¤t->signal->exec_guard_mutex)) goto out_put_fd;
spin_lock_irq(¤t->sighand->siglock); @@ -1337,7 +1340,7 @@ static long seccomp_set_mode_filter(unsigned int flags, out: spin_unlock_irq(¤t->sighand->siglock); if (flags & SECCOMP_FILTER_FLAG_TSYNC) - mutex_unlock(¤t->signal->cred_guard_mutex); + mutex_unlock(¤t->signal->exec_guard_mutex); out_put_fd: if (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) { if (ret) {
This brings the outdated Documentation/security/credentials.rst back in line with the current implementation, and describes the purpose of current->signal->exec_update_mutex, current->signal->exec_guard_mutex and current->signal->unsafe_execve_in_progress.
Signed-off-by: Bernd Edlinger bernd.edlinger@hotmail.de --- Documentation/security/credentials.rst | 29 +++++++++++++++++++++-------- 1 file changed, 21 insertions(+), 8 deletions(-)
diff --git a/Documentation/security/credentials.rst b/Documentation/security/credentials.rst index 282e79f..fe4cd76 100644 --- a/Documentation/security/credentials.rst +++ b/Documentation/security/credentials.rst @@ -437,15 +437,30 @@ new set of credentials by calling::
struct cred *prepare_creds(void);
-this locks current->cred_replace_mutex and then allocates and constructs a -duplicate of the current process's credentials, returning with the mutex still -held if successful. It returns NULL if not successful (out of memory). +this allocates and constructs a duplicate of the current process's credentials. +It returns NULL if not successful (out of memory). + +If called from __do_execve_file, the mutex current->signal->exec_guard_mutex +is acquired before this function gets called, and usually released after +the new process mmap and credentials are installed. However if one of the +sibling threads are being traced when the execve is invoked, there is no +guarantee how long it takes to terminate all sibling threads, and therefore +the variable current->signal->unsafe_execve_in_progress is set, and the +exec_guard_mutex is released immediately. Functions that may have effect +on the credentials of a different thread need to lock the exec_guard_mutex +and additionally check the unsafe_execve_in_progress status, and fail with +-EAGAIN if that variable is set.
The mutex prevents ``ptrace()`` from altering the ptrace state of a process while security checks on credentials construction and changing is taking place as the ptrace state may alter the outcome, particularly in the case of ``execve()``.
+The mutex current->signal->exec_update_mutex is acquired when only a single +thread is remaining, and the credentials and the process mmap are actually +changed. Functions that only need to access to a consistent state of the +credentials and the process mmap do only need to aquire this mutex. + The new credentials set should be altered appropriately, and any security checks and hooks done. Both the current and the proposed sets of credentials are available for this purpose as current_cred() will return the current set @@ -466,9 +481,8 @@ by calling::
This will alter various aspects of the credentials and the process, giving the LSM a chance to do likewise, then it will use ``rcu_assign_pointer()`` to -actually commit the new credentials to ``current->cred``, it will release -``current->cred_replace_mutex`` to allow ``ptrace()`` to take place, and it -will notify the scheduler and others of the changes. +actually commit the new credentials to ``current->cred``, and it will notify +the scheduler and others of the changes.
This function is guaranteed to return 0, so that it can be tail-called at the end of such functions as ``sys_setresuid()``. @@ -486,8 +500,7 @@ invoked::
void abort_creds(struct cred *new);
-This releases the lock on ``current->cred_replace_mutex`` that -``prepare_creds()`` got and then releases the new credentials. +This releases the new credentials.
A typical credentials alteration function would look something like this::
On 3/13/20 10:13 AM, Kirill Tkhai wrote:
Despite this should fix the problem, this looks like a broken puzzle.
We can't use bprm->cred as an identifier whether the mutex was locked or not. We can check for bprm->cred in regard to cred_guard_mutex, because of there is strong rule: "cred_guard_mutex is becomes locked together with bprm->cred assignment (see prepare_bprm_creds()), and it becomes unlocked together with bprm->cred zeroing". Take attention on modularity of all this: there is no dependencies between anything else.
In regard to newly introduced exec_update_mutex, your fix and source patch way look like an obfuscation. The mutex becomes deadly glued to unrelated bprm->cred and bprm->mm, and this introduces the problems in the future modifications and support of all involved entities. If someone wants to move some functions in relation to each other, there will be a pain, and this person will have to go again the same dependencies and bug way, Eric stepped on in the original patch.
Okay, yes, valid points you make, thanks. I just wanted to understand what was exactly wrong with this patch, since the failure mode looked a lot like it was failing because of something clobbering the data unexpectedly.
So I have posted a few updated patch for the failed one here:
[PATCH v3 5/5] exec: Add a exec_update_mutex to replace cred_guard_mutex [PATCH] pidfd: Use new infrastructure to fix deadlocks in execve
which replaces these: [PATCH v2 5/5] exec: Add a exec_update_mutex to replace cred_guard_mutex https://lore.kernel.org/lkml/87zhcq4jdj.fsf_-_@x220.int.ebiederm.org/
[PATCH] pidfd: Stop taking cred_guard_mutex https://lore.kernel.org/lkml/87wo7svy96.fsf_-_@x220.int.ebiederm.org/
and a new patch series to fix deadlock in ptrace_attach and update doc: [PATCH 0/2] exec: Fix dead-lock in de_thread with ptrace_attach [PATCH 1/2] exec: Fix dead-lock in de_thread with ptrace_attach [PATCH 2/2] doc: Update documentation of ->exec_*_mutex
Other patches needed, still valid:
[PATCH v2 1/5] exec: Only compute current once in flush_old_exec https://lore.kernel.org/lkml/87pndm5y3l.fsf_-_@x220.int.ebiederm.org/
[PATCH v2 2/5] exec: Factor unshare_sighand out of de_thread and call it separately https://lore.kernel.org/lkml/87k13u5y26.fsf_-_@x220.int.ebiederm.org/
[PATCH v2 4/5] exec: Move exec_mmap right after de_thread in flush_old_exec https://lore.kernel.org/lkml/875zfe5xzb.fsf_-_@x220.int.ebiederm.org/
[PATCH 1/4] exec: Fix a deadlock in ptrace https://lore.kernel.org/lkml/AM6PR03MB517033EAD25BED15CC84E17DE4FF0@AM6PR03M...
[PATCH 2/4] selftests/ptrace: add test cases for dead-locks https://lore.kernel.org/lkml/AM6PR03MB51703199741A2C27A78980FFE4FF0@AM6PR03M...
[PATCH 3/4] mm: docs: Fix a comment in process_vm_rw_core https://lore.kernel.org/lkml/AM6PR03MB5170ED6D4D216EEEEF400136E4FF0@AM6PR03M...
[PATCH 4/4] kernel: doc: remove outdated comment cred.c https://lore.kernel.org/lkml/AM6PR03MB517039DB07AB641C194FEA57E4FF0@AM6PR03M...
[PATCH 1/4] kernel/kcmp.c: Use new infrastructure to fix deadlocks in execve https://lore.kernel.org/lkml/AM6PR03MB517057A2269C3A4FB287B76EE4FF0@AM6PR03M...
[PATCH 2/4] proc: Use new infrastructure to fix deadlocks in execve https://lore.kernel.org/lkml/AM6PR03MB51705D211EC8E7EA270627B1E4FF0@AM6PR03M...
[PATCH 3/4] proc: io_accounting: Use new infrastructure to fix deadlocks in execve https://lore.kernel.org/lkml/AM6PR03MB5170BD2476E35068E182EFA4E4FF0@AM6PR03M...
[PATCH 4/4] perf: Use new infrastructure to fix deadlocks in execve https://lore.kernel.org/lkml/AM6PR03MB517035DEEDB9C8699CB6B34EE4FF0@AM6PR03M...
I think most of the existing patches are already approved, but if there are still change requests, please let me know.
Thanks Bernd.
On 3/14/20 10:57 AM, Bernd Edlinger wrote:
On 3/13/20 10:13 AM, Kirill Tkhai wrote:
Despite this should fix the problem, this looks like a broken puzzle.
We can't use bprm->cred as an identifier whether the mutex was locked or not. We can check for bprm->cred in regard to cred_guard_mutex, because of there is strong rule: "cred_guard_mutex is becomes locked together with bprm->cred assignment (see prepare_bprm_creds()), and it becomes unlocked together with bprm->cred zeroing". Take attention on modularity of all this: there is no dependencies between anything else.
In regard to newly introduced exec_update_mutex, your fix and source patch way look like an obfuscation. The mutex becomes deadly glued to unrelated bprm->cred and bprm->mm, and this introduces the problems in the future modifications and support of all involved entities. If someone wants to move some functions in relation to each other, there will be a pain, and this person will have to go again the same dependencies and bug way, Eric stepped on in the original patch.
Okay, yes, valid points you make, thanks. I just wanted to understand what was exactly wrong with this patch, since the failure mode looked a lot like it was failing because of something clobbering the data unexpectedly.
So I have posted a few updated patch for the failed one here:
[PATCH v3 5/5] exec: Add a exec_update_mutex to replace cred_guard_mutex [PATCH] pidfd: Use new infrastructure to fix deadlocks in execve
which replaces these: [PATCH v2 5/5] exec: Add a exec_update_mutex to replace cred_guard_mutex https://lore.kernel.org/lkml/87zhcq4jdj.fsf_-_@x220.int.ebiederm.org/
[PATCH] pidfd: Stop taking cred_guard_mutex https://lore.kernel.org/lkml/87wo7svy96.fsf_-_@x220.int.ebiederm.org/
and a new patch series to fix deadlock in ptrace_attach and update doc: [PATCH 0/2] exec: Fix dead-lock in de_thread with ptrace_attach [PATCH 1/2] exec: Fix dead-lock in de_thread with ptrace_attach [PATCH 2/2] doc: Update documentation of ->exec_*_mutex
Other patches needed, still valid:
[PATCH v2 1/5] exec: Only compute current once in flush_old_exec https://lore.kernel.org/lkml/87pndm5y3l.fsf_-_@x220.int.ebiederm.org/
[PATCH v2 2/5] exec: Factor unshare_sighand out of de_thread and call it separately https://lore.kernel.org/lkml/87k13u5y26.fsf_-_@x220.int.ebiederm.org/
Ah, sorry, forgot this one: [PATCH v2 3/5] exec: Move cleanup of posix timers on exec out of de_thread https://lore.kernel.org/lkml/87eeu25y14.fsf_-_@x220.int.ebiederm.org/
[PATCH v2 4/5] exec: Move exec_mmap right after de_thread in flush_old_exec https://lore.kernel.org/lkml/875zfe5xzb.fsf_-_@x220.int.ebiederm.org/
[PATCH 1/4] exec: Fix a deadlock in ptrace https://lore.kernel.org/lkml/AM6PR03MB517033EAD25BED15CC84E17DE4FF0@AM6PR03M...
[PATCH 2/4] selftests/ptrace: add test cases for dead-locks https://lore.kernel.org/lkml/AM6PR03MB51703199741A2C27A78980FFE4FF0@AM6PR03M...
[PATCH 3/4] mm: docs: Fix a comment in process_vm_rw_core https://lore.kernel.org/lkml/AM6PR03MB5170ED6D4D216EEEEF400136E4FF0@AM6PR03M...
[PATCH 4/4] kernel: doc: remove outdated comment cred.c https://lore.kernel.org/lkml/AM6PR03MB517039DB07AB641C194FEA57E4FF0@AM6PR03M...
[PATCH 1/4] kernel/kcmp.c: Use new infrastructure to fix deadlocks in execve https://lore.kernel.org/lkml/AM6PR03MB517057A2269C3A4FB287B76EE4FF0@AM6PR03M...
[PATCH 2/4] proc: Use new infrastructure to fix deadlocks in execve https://lore.kernel.org/lkml/AM6PR03MB51705D211EC8E7EA270627B1E4FF0@AM6PR03M...
[PATCH 3/4] proc: io_accounting: Use new infrastructure to fix deadlocks in execve https://lore.kernel.org/lkml/AM6PR03MB5170BD2476E35068E182EFA4E4FF0@AM6PR03M...
[PATCH 4/4] perf: Use new infrastructure to fix deadlocks in execve https://lore.kernel.org/lkml/AM6PR03MB517035DEEDB9C8699CB6B34EE4FF0@AM6PR03M...
I think most of the existing patches are already approved, but if there are still change requests, please let me know.
Thanks Bernd.
Hope it is correct now. I haven't seen the new patches on the kernel archives yet, so I cannot add URLs for them.
Bernd.
On 14.03.2020 13:02, Bernd Edlinger wrote:
On 3/14/20 10:57 AM, Bernd Edlinger wrote:
On 3/13/20 10:13 AM, Kirill Tkhai wrote:
Despite this should fix the problem, this looks like a broken puzzle.
We can't use bprm->cred as an identifier whether the mutex was locked or not. We can check for bprm->cred in regard to cred_guard_mutex, because of there is strong rule: "cred_guard_mutex is becomes locked together with bprm->cred assignment (see prepare_bprm_creds()), and it becomes unlocked together with bprm->cred zeroing". Take attention on modularity of all this: there is no dependencies between anything else.
In regard to newly introduced exec_update_mutex, your fix and source patch way look like an obfuscation. The mutex becomes deadly glued to unrelated bprm->cred and bprm->mm, and this introduces the problems in the future modifications and support of all involved entities. If someone wants to move some functions in relation to each other, there will be a pain, and this person will have to go again the same dependencies and bug way, Eric stepped on in the original patch.
Okay, yes, valid points you make, thanks. I just wanted to understand what was exactly wrong with this patch, since the failure mode looked a lot like it was failing because of something clobbering the data unexpectedly.
So I have posted a few updated patch for the failed one here:
[PATCH v3 5/5] exec: Add a exec_update_mutex to replace cred_guard_mutex [PATCH] pidfd: Use new infrastructure to fix deadlocks in execve
which replaces these: [PATCH v2 5/5] exec: Add a exec_update_mutex to replace cred_guard_mutex https://lore.kernel.org/lkml/87zhcq4jdj.fsf_-_@x220.int.ebiederm.org/
[PATCH] pidfd: Stop taking cred_guard_mutex https://lore.kernel.org/lkml/87wo7svy96.fsf_-_@x220.int.ebiederm.org/
and a new patch series to fix deadlock in ptrace_attach and update doc: [PATCH 0/2] exec: Fix dead-lock in de_thread with ptrace_attach [PATCH 1/2] exec: Fix dead-lock in de_thread with ptrace_attach [PATCH 2/2] doc: Update documentation of ->exec_*_mutex
Other patches needed, still valid:
[PATCH v2 1/5] exec: Only compute current once in flush_old_exec https://lore.kernel.org/lkml/87pndm5y3l.fsf_-_@x220.int.ebiederm.org/
[PATCH v2 2/5] exec: Factor unshare_sighand out of de_thread and call it separately https://lore.kernel.org/lkml/87k13u5y26.fsf_-_@x220.int.ebiederm.org/
Ah, sorry, forgot this one: [PATCH v2 3/5] exec: Move cleanup of posix timers on exec out of de_thread https://lore.kernel.org/lkml/87eeu25y14.fsf_-_@x220.int.ebiederm.org/
[PATCH v2 4/5] exec: Move exec_mmap right after de_thread in flush_old_exec https://lore.kernel.org/lkml/875zfe5xzb.fsf_-_@x220.int.ebiederm.org/
1-4/5 look OK for me. You may add my
Reviewed-by: Kirill Tkhai ktkhai@virtuozzo.com
[PATCH 1/4] exec: Fix a deadlock in ptrace https://lore.kernel.org/lkml/AM6PR03MB517033EAD25BED15CC84E17DE4FF0@AM6PR03M...
[PATCH 2/4] selftests/ptrace: add test cases for dead-locks https://lore.kernel.org/lkml/AM6PR03MB51703199741A2C27A78980FFE4FF0@AM6PR03M...
[PATCH 3/4] mm: docs: Fix a comment in process_vm_rw_core https://lore.kernel.org/lkml/AM6PR03MB5170ED6D4D216EEEEF400136E4FF0@AM6PR03M...
[PATCH 4/4] kernel: doc: remove outdated comment cred.c https://lore.kernel.org/lkml/AM6PR03MB517039DB07AB641C194FEA57E4FF0@AM6PR03M...
[PATCH 1/4] kernel/kcmp.c: Use new infrastructure to fix deadlocks in execve https://lore.kernel.org/lkml/AM6PR03MB517057A2269C3A4FB287B76EE4FF0@AM6PR03M...
[PATCH 2/4] proc: Use new infrastructure to fix deadlocks in execve https://lore.kernel.org/lkml/AM6PR03MB51705D211EC8E7EA270627B1E4FF0@AM6PR03M...
[PATCH 3/4] proc: io_accounting: Use new infrastructure to fix deadlocks in execve https://lore.kernel.org/lkml/AM6PR03MB5170BD2476E35068E182EFA4E4FF0@AM6PR03M...
[PATCH 4/4] perf: Use new infrastructure to fix deadlocks in execve https://lore.kernel.org/lkml/AM6PR03MB517035DEEDB9C8699CB6B34EE4FF0@AM6PR03M...
I think most of the existing patches are already approved, but if there are still change requests, please let me know.
Thanks Bernd.
Hope it is correct now. I haven't seen the new patches on the kernel archives yet, so I cannot add URLs for them.
Bernd.
On 3/8/20 10:34 PM, Eric W. Biederman wrote:
Bernd, everyone
This is how I think the infrastructure change should look that makes way for fixing this issue.
- Cleanup and reorder the code so code that can potentially wait indefinitely for userspace comes at the beginning for flush_old_exec.
- Add a new mutex and take it after we have passed any potential indefinite waits for userspace.
Then I think it is just going through the existing users of cred_guard_mutex and fixing them to use the new one.
There really aren't that many users of cred_guard_mutex so we should be able to get through the easy ones fairly quickly. And anything that isn't easy we can wait until we have a good fix.
The users of cred_guard_mutex that I saw were: fs/proc/base.c: proc_pid_attr_write do_io_accounting proc_pid_stack proc_pid_syscall proc_pid_personality perf_event_open mm_access kcmp pidfd_fget seccomp_set_mode_filter
Bernd I think I have addressed the issues you pointed out in v1. Please let me know if you see anything else.
Yes, looks good, except some nits.
Thanks Bernd.
On Tue, Mar 03, 2020 at 09:18:44AM -0600, Eric W. Biederman wrote:
Bernd Edlinger bernd.edlinger@hotmail.de writes:
This fixes a deadlock in the tracer when tracing a multi-threaded application that calls execve while more than one thread are running.
I observed that when running strace on the gcc test suite, it always blocks after a while, when expect calls execve, because other threads have to be terminated. They send ptrace events, but the strace is no longer able to respond, since it is blocked in vm_access.
The deadlock is always happening when strace needs to access the tracees process mmap, while another thread in the tracee starts to execve a child process, but that cannot continue until the PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:
A couple of things.
Why do we think it is safe to change the behavior exposed to userspace? Not the deadlock but all of the times the current code would not deadlock?
Especially given that this is a small window it might be hard for people to track down and report so we need a strong argument that this won't break existing userspace before we just change things.
Usually surveying all of the users of a system call that we can find and checking to see if they might be affected by the change in behavior is difficult enough that we usually opt for not being lazy and preserving the behavior.
This patch is up to two changes in behavior now, that could potentially affect a whole array of programs. Adding linux-api so that this change in behavior can be documented if/when this change goes through.
If you can split the documentation and test fixes out into separate patches that would help reviewing this code, or please make it explicit that the your are changing documentation about behavior that is changing with this patch.
Agreed. I think it'd be good to do it in three patches: 1. unrelated documentation update 2. fix + documentation changes specific to the fix 3. test(s)
Christian
On 03/01, Bernd Edlinger wrote:
This fixes a deadlock in the tracer when tracing a multi-threaded application that calls execve while more than one thread are running.
Heh. Yes, known problem. See my attempt to fix it: https://lore.kernel.org/lkml/20170213141452.GA30203@redhat.com/
@@ -1224,7 +1224,7 @@ struct mm_struct *mm_access(struct task_struct *task, unsigned int mode) struct mm_struct *mm; int err;
- err = mutex_lock_killable(&task->signal->cred_guard_mutex);
- err = mutex_lock_killable(&task->signal->cred_change_mutex);
So if I understand correctly your patch doesn't fix other problems with debugger waiting for cred_guard_mutex.
I too do not think this can justify the new mutex in signal_struct...
Oleg.
On 3/2/20 1:28 PM, Oleg Nesterov wrote:
On 03/01, Bernd Edlinger wrote:
This fixes a deadlock in the tracer when tracing a multi-threaded application that calls execve while more than one thread are running.
Heh. Yes, known problem. See my attempt to fix it: https://lore.kernel.org/lkml/20170213141452.GA30203@redhat.com/
@@ -1224,7 +1224,7 @@ struct mm_struct *mm_access(struct task_struct *task, unsigned int mode) struct mm_struct *mm; int err;
- err = mutex_lock_killable(&task->signal->cred_guard_mutex);
- err = mutex_lock_killable(&task->signal->cred_change_mutex);
So if I understand correctly your patch doesn't fix other problems with debugger waiting for cred_guard_mutex.
No, but I see this just as a first step.
I too do not think this can justify the new mutex in signal_struct...
I think for the vm_access the semantic of this mutex is clear, that it prevents the credentials to change while it is held by vm_access, and probably other places can take advantage of this mutex as well.
While on the other hand, the cred_guard_mutex is needed to avoid two threads calling execve at the same time. So that is needed as well.
What remains is probably making PTHREAD_ATTACH detect that the process is currently in execve, and make that call fail in that situation. I have not thought in depth about that problem, but it will probably just need the right mutex to access current->in_execve.
That's at least how I see it.
Thanks Bernd.
linux-stable-mirror@lists.linaro.org