On 3/3/20 3:26 AM, Kees Cook wrote:
On Mon, Mar 02, 2020 at 10:18:07PM +0000, Bernd Edlinger wrote:
This fixes a deadlock in the tracer when tracing a multi-threaded application that calls execve while more than one thread are running.
I observed that when running strace on the gcc test suite, it always blocks after a while, when expect calls execve, because other threads have to be terminated. They send ptrace events, but the strace is no longer able to respond, since it is blocked in vm_access.
The deadlock is always happening when strace needs to access the tracees process mmap, while another thread in the tracee starts to execve a child process, but that cannot continue until the PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:
strace D 0 30614 30584 0x00000000 Call Trace: __schedule+0x3ce/0x6e0 schedule+0x5c/0xd0 schedule_preempt_disabled+0x15/0x20 __mutex_lock.isra.13+0x1ec/0x520 __mutex_lock_killable_slowpath+0x13/0x20 mutex_lock_killable+0x28/0x30 mm_access+0x27/0xa0 process_vm_rw_core.isra.3+0xff/0x550 process_vm_rw+0xdd/0xf0 __x64_sys_process_vm_readv+0x31/0x40 do_syscall_64+0x64/0x220 entry_SYSCALL_64_after_hwframe+0x44/0xa9
expect D 0 31933 30876 0x80004003 Call Trace: __schedule+0x3ce/0x6e0 schedule+0x5c/0xd0 flush_old_exec+0xc4/0x770 load_elf_binary+0x35a/0x16c0 search_binary_handler+0x97/0x1d0 __do_execve_file.isra.40+0x5d4/0x8a0 __x64_sys_execve+0x49/0x60 do_syscall_64+0x64/0x220 entry_SYSCALL_64_after_hwframe+0x44/0xa9
The proposed solution is to take the cred_guard_mutex only in a critical section at the beginning, and at the end of the execve function, and let PTRACE_ATTACH fail with EAGAIN while execve is not complete, but other functions like vm_access are allowed to complete normally.
Sorry to be bummer, but I don't think this will work. A few more things during the exec process depend on cred_guard_mutex being held.
If I'm reading this patch correctly, this changes the lifetime of the cred_guard_mutex lock to be:
- during prepare_bprm_creds()
- from flush_old_exec() through install_exec_creds()
Before, cred_guard_mutex was held from prepare_bprm_creds() through install_exec_creds().
That means, for example, that check_unsafe_exec()'s documented invariant is violated: /* * determine how safe it is to execute the proposed program * - the caller must hold ->cred_guard_mutex to protect against * PTRACE_ATTACH or seccomp thread-sync */
Oh, right, I haven't understood that hint...
static void check_unsafe_exec(struct linux_binprm *bprm) ...
which is looking at no_new_privs as well as other details, and making decisions about the bprm state from the current state.
I think it also means that the potentially multiple invocations of bprm_fill_uid() (via prepare_binprm() via binfmt_script.c and binfmt_misc.c) would be changing bprm->cred details (uid, gid) without a lock (another place where current's no_new_privs is evaluated).
So no_new_privs can change from 0->1, but should not when execve is running.
As long as the calling thread is in execve it won't do this, and the only other place, where it may set for other threads is in seccomp_sync_threads, but that can easily be avoided see below.
Related, it also means that cred_guard_mutex is unheld for every invocation of search_binary_handler() (which can loop via the previously mentioned binfmt_script.c and binfmt_misc.c), if any of them have hidden dependencies on cred_guard_mutex. (Thought I only see bprm_fill_uid() currently.)
For seccomp, the expectations about existing thread states risks races too. There are two locks held for TSYNC:
- current->sighand->siglock is held to keep new threads from appearing/disappearing, which would destroy filter refcounting and lead to memory corruption.
I don't understand what you mean here. How can this lead to memory corruption?
- cred_guard_mutex is held to keep no_new_privs in sync with filters to avoid no_new_privs and filter confusion during exec, which could lead to exploitable setuid conditions (see below).
Just racing a malicious thread during TSYNC is not a very strong example (a malicious thread could do lots of fun things to "current" before it ever got near calling TSYNC), but I think there is the risk of mismatched/confused states that we don't want to allow. One is a particularly bad state that could lead to privilege escalations (in the form of the old "sendmail doesn't check setuid" flaw; if a setuid process has a filter attached that silently fails a priv-dropping setuid call and continues execution with elevated privs, it can be tricked into doing bad things on behalf of the unprivileged parent, which was the primary goal of the original use of cred_guard_mutex with TSYNC[1]):
thread A clones thread B thread B starts setuid exec thread A sets no_new_privs thread A calls seccomp with TSYNC thread A in seccomp_sync_threads() sets seccomp filter on self and thread B thread B passes check_unsafe_exec() with no_new_privs unset thread B reaches bprm_fill_uid() with no_new_privs unset and gains privs thread A still in seccomp_sync_threads() sets no_new_privs on thread B thread B finishes exec, now running with elevated privs, a filter chosen by thread A, _and_ nnp set (which doesn't matter)
With the original locking, thread B will fail check_unsafe_exec() because filter and nnp state are changed together, with "atomicity" protected by the cred_guard_mutex.
Ah, good point, thanks!
This can be fixed by checking current->signal->cred_locked_for_ptrace while the cred_guard_mutex is locked, like this for instance:
diff --git a/kernel/seccomp.c b/kernel/seccomp.c index b6ea3dc..377abf0 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -342,6 +342,9 @@ static inline pid_t seccomp_can_sync_threads(void) BUG_ON(!mutex_is_locked(¤t->signal->cred_guard_mutex)); assert_spin_locked(¤t->sighand->siglock);
+ if (current->signal->cred_locked_for_ptrace) + return -EAGAIN; + /* Validate all threads being eligible for synchronization. */ caller = current; for_each_thread(caller, thread) {
And this is just the bad state I _can_ see. I'm worried there are more...
All this said, I do see a small similarity here to the work I did to stabilize stack rlimits (there was an ongoing problem with making multiple decisions for the bprm based on current's state -- but current's state was mutable during exec). For this, I saved rlim_stack to bprm and ignored current's copy until exec ended and then stored bprm's copy into current. If the only problem anyone can see here is the handling of no_new_privs, we might be able to solve that similarly, at least disentangling tsync/nnp from cred_guard_mutex.
I still think that is solvable with using cred_locked_for_ptrace and simply make the tsync fail if it would otherwise be blocked.
Thanks Bernd.