Celeste!
I'll pick up your thread [1] here, since there's code here! :-) Let's try to clear up/define the possible flow.
The common/generic entry syscall_enter_from_user_mode{,_work}() says that a return value of -1 means that the syscall should be skipped.
The RISC-V calling convention uses the same register for arg0/retval (a0). So, when a ptracer (PTRACE_SYSCALL+PTRACE_GETREGS). That means that the kernel cannot call into the tracer, *after* changing a0.
The interesting flow for a syscall is roughly: 1. The exception/trap function 2. syscall_enter_from_user_mode() which might return -1, meaning that the syscall should be skipped. A tracer might have altered the regs. More importantly, if it's -1 the kernel cannot change the return value, because seccomp filtering might already done that. 3. If it's a valid syscall, perform the syscall.
Now some scenarios: * A user does a valid syscall. * A user does an invalid syscall(-1) * A user does an invalid syscall(!= -1)
Then there're the tracing variants of those, and that's where we go get trouble.
Now the bugs we've seen in RISC-V:
1. strace "tracing": Requires that regs->a0 is not tampered with prior ptrace notification
E.g.: | # ./strace / | execve("/", ["/"], 0x7ffffaac3890 /* 21 vars */) = -1 EACCES (Permission denied) | ./strace: exec: Permission denied | +++ exited with 1 +++ | # ./disable_ptrace_get_syscall_info ./strace / | execve(0xffffffffffffffda, ["/"], 0x7fffd893ce10 /* 21 vars */) = -1 EACCES (Permission denied) | ./strace: exec: Permission denied | +++ exited with 1 +++
In the second case, arg0 is prematurely set to -ENOSYS (0xffffffffffffffda).
2. strace "syscall tampering": Requires that ENOSYS is returned for syscall(-1), and not skipped w/o a proper return value.
E.g.: | ./strace -a0 -ewrite -einject=write:error=enospc echo helloject=write:error=enospc echo hello
Here, strace expects that injecting -1, would result in a ENOSYS.
3. seccomp filtering: Requires that the a0 is not tampered to
First 3 was broken (tampering with a0 after seccomp), then 2 was broken (not setting ENOSYS for -1), and finally 1 was broken (and still is!).
Now for your patch:
Celeste Liu via B4 Relay devnull+CoelacanthusHex.gmail.com@kernel.org writes:
From: Celeste Liu CoelacanthusHex@gmail.com
The return value of syscall_enter_from_user_mode() is always -1 when the syscall was filtered. We can't know whether syscall_nr is -1 when we get -1 from syscall_enter_from_user_mode(). And the old syscall variable is unusable because syscall_enter_from_user_mode() may change a7 register. So get correct syscall number from syscall_get_nr().
So syscall number part of return value of syscall_enter_from_user_mode() is completely useless. We can remove it from API and require caller to get syscall number from syscall_get_nr(). But this change affect more architectures and will block more time. So we split it into another patchset to avoid block this fix. (Other architectures can works without this change but riscv need it, see Link: tag below)
Fixes: 61119394631f ("riscv: entry: always initialize regs->a0 to -ENOSYS") Reported-by: Andrea Bolognani abologna@redhat.com Closes: https://github.com/strace/strace/issues/315 Link: https://lore.kernel.org/all/59505464-c84a-403d-972f-d4b2055eeaac@gmail.com/ Signed-off-by: Celeste Liu CoelacanthusHex@gmail.com
arch/riscv/kernel/traps.c | 13 ++++++++++--- 1 file changed, 10 insertions(+), 3 deletions(-)
diff --git a/arch/riscv/kernel/traps.c b/arch/riscv/kernel/traps.c index 51ebfd23e0076447518081d137102a9a11ff2e45..3125fab8ee4af468ace9f692dd34e1797555cce3 100644 --- a/arch/riscv/kernel/traps.c +++ b/arch/riscv/kernel/traps.c @@ -316,18 +316,25 @@ void do_trap_ecall_u(struct pt_regs *regs) { if (user_mode(regs)) { long syscall = regs->a7;
long res;
regs->epc += 4; regs->orig_a0 = regs->a0;
regs->a0 = -ENOSYS;
riscv_v_vstate_discard(regs);
syscall = syscall_enter_from_user_mode(regs, syscall);
res = syscall_enter_from_user_mode(regs, syscall);
/*
* Call syscall_get_nr() again because syscall_enter_from_user_mode()
* may change a7 register.
*/
syscall = syscall_get_nr(current, regs);
add_random_kstack_offset();
if (syscall >= 0 && syscall < NR_syscalls)
if (syscall < 0 || syscall >= NR_syscalls)
regs->a0 = -ENOSYS;
else if (res != -1) syscall_handler(regs, syscall);
Here we can perform the syscall, even if res is -1. E.g., if this path [2] is taken, we might have a valid syscall number in a7, but the syscall should not be performed.
Also, one reason for the generic entry is so that it should be less work. Here, you pull (IMO) details that belong to the common entry implementation/API up the common entry user. Wdyt about pushing it down to common entry? Something like:
--8<-- diff --git a/arch/riscv/kernel/traps.c b/arch/riscv/kernel/traps.c index 51ebfd23e007..66fded8e4b60 100644 --- a/arch/riscv/kernel/traps.c +++ b/arch/riscv/kernel/traps.c @@ -319,7 +319,6 @@ void do_trap_ecall_u(struct pt_regs *regs)
regs->epc += 4; regs->orig_a0 = regs->a0; - regs->a0 = -ENOSYS;
riscv_v_vstate_discard(regs);
@@ -329,6 +328,8 @@ void do_trap_ecall_u(struct pt_regs *regs)
if (syscall >= 0 && syscall < NR_syscalls) syscall_handler(regs, syscall); + else if (syscall != -1) + regs->a0 = -ENOSYS;
/* * Ultimately, this value will get limited by KSTACK_OFFSET_MAX(), diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h index 1e50cdb83ae5..9b69c2ad4f12 100644 --- a/include/linux/entry-common.h +++ b/include/linux/entry-common.h @@ -14,6 +14,7 @@ #include <linux/kmsan.h>
#include <asm/entry-common.h> +#include <asm/syscall.h>
/* * Define dummy _TIF work flags if not defined by the architecture or for @@ -166,6 +167,8 @@ static __always_inline long syscall_enter_from_user_mode_work(struct pt_regs *re
if (work & SYSCALL_WORK_ENTER) syscall = syscall_trace_enter(regs, syscall, work); + else if (syscall == -1L) + syscall_set_return_value(current, regs, -ENOSYS, 0);
return syscall; } diff --git a/kernel/entry/common.c b/kernel/entry/common.c index 5b6934e23c21..99742aee5002 100644 --- a/kernel/entry/common.c +++ b/kernel/entry/common.c @@ -43,8 +43,10 @@ long syscall_trace_enter(struct pt_regs *regs, long syscall, /* Handle ptrace */ if (work & (SYSCALL_WORK_SYSCALL_TRACE | SYSCALL_WORK_SYSCALL_EMU)) { ret = ptrace_report_syscall_entry(regs); - if (ret || (work & SYSCALL_WORK_SYSCALL_EMU)) + if (ret || (work & SYSCALL_WORK_SYSCALL_EMU)) { + syscall_set_return_value(current, regs, -ENOSYS, 0); return -1L; + } }
/* Do seccomp after ptrace, to catch any tracer changes. */ @@ -66,6 +68,14 @@ long syscall_trace_enter(struct pt_regs *regs, long syscall, syscall = syscall_get_nr(current, regs); }
+ /* + * If we're not setting the return value here, the context is + * gone; For higher callers, -1 means that the syscall should + * be skipped. + */ + if (syscall == -1L) + syscall_set_return_value(current, regs, -ENOSYS, 0); + syscall_enter_audit(regs, syscall);
return ret ? : syscall; --8<--
I did a quick test of the above, and it seems to cover all the previous bugs -- but who knows! ;-)
Björn
[1] https://lore.kernel.org/linux-riscv/59505464-c84a-403d-972f-d4b2055eeaac@gma... [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kern...