[PATCH] seccomp: passthrough uretprobe systemcall without filtering

List overview All Threads
Download

newer

older

[PATCH] Revert "mmc: sdhci_am654:...

[PATCH v2 0/2] Allow non-coherent...

Eyal Birger

17 Jan 2025 17 Jan '25

12:55 a.m.

When attaching uretprobes to processes running inside docker, the attached process is segfaulted when encountering the retprobe.

The reason is that now that uretprobe is a system call the default seccomp filters in docker block it as they only allow a specific set of known syscalls. This is true for other userspace applications which use seccomp to control their syscall surface.

Since uretprobe is a "kernel implementation detail" system call which is not used by userspace application code directly, it is impractical and there's very little point in forcing all userspace applications to explicitly allow it in order to avoid crashing tracked processes.

Pass this systemcall through seccomp without depending on configuration.

Fixes: ff474a78cef5 ("uprobe: Add uretprobe syscall to speed up return probe") Reported-by: Rafael Buchbinder rafi@rbk.io Link: https://lore.kernel.org/lkml/CAHsH6Gs3Eh8DFU0wq58c_LF8A4_+o6z456J7BidmcVY2Aq... Cc: stable@vger.kernel.org Signed-off-by: Eyal Birger eyal.birger@gmail.com ---

The following reproduction script synthetically demonstrates the problem:

cat > /tmp/x.c << EOF

char *syscalls[] = { "write", "exit_group", "fstat", };

__attribute__((noinline)) int probed(void) { printf("Probed\n"); return 1; }

void apply_seccomp_filter(char **syscalls, int num_syscalls) { scmp_filter_ctx ctx;

ctx = seccomp_init(SCMP_ACT_KILL); for (int i = 0; i < num_syscalls; i++) { seccomp_rule_add(ctx, SCMP_ACT_ALLOW, seccomp_syscall_resolve_name(syscalls[i]), 0); } seccomp_load(ctx); seccomp_release(ctx); }

int main(int argc, char *argv[]) { int num_syscalls = sizeof(syscalls) / sizeof(syscalls[0]);

apply_seccomp_filter(syscalls, num_syscalls);

probed();

return 0; } EOF

cat > /tmp/trace.bt << EOF uretprobe:/tmp/x:probed { printf("ret=%d\n", retval); } EOF

gcc -o /tmp/x /tmp/x.c -lseccomp

/usr/bin/bpftrace /tmp/trace.bt &

sleep 5 # wait for uretprobe attach /tmp/x

pkill bpftrace

rm /tmp/x /tmp/x.c /tmp/trace.bt --- kernel/seccomp.c | 5 +++++ 1 file changed, 5 insertions(+)

diff --git a/kernel/seccomp.c b/kernel/seccomp.c index 385d48293a5f..10a55c9b5c18 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -1359,6 +1359,11 @@ int __secure_computing(const struct seccomp_data *sd) this_syscall = sd ? sd->nr : syscall_get_nr(current, current_pt_regs());

+#ifdef CONFIG_X86_64 + if (unlikely(this_syscall == __NR_uretprobe) && !in_ia32_syscall()) + return 0; +#endif + switch (mode) { case SECCOMP_MODE_STRICT: __secure_computing_strict(this_syscall); /* may call do_exit */

-- 2.43.0

Show replies by date

Oleg Nesterov

17 Jan 17 Jan

1:39 a.m.

On 01/16, Eyal Birger wrote:

...

Fixes: ff474a78cef5 ("uprobe: Add uretprobe syscall to speed up return probe") Reported-by: Rafael Buchbinder rafi@rbk.io Link: https://lore.kernel.org/lkml/CAHsH6Gs3Eh8DFU0wq58c_LF8A4_+o6z456J7BidmcVY2Aq... Cc: stable@vger.kernel.org

...

@@ -1359,6 +1359,11 @@ int __secure_computing(const struct seccomp_data *sd) this_syscall = sd ? sd->nr : syscall_get_nr(current, current_pt_regs());

+#ifdef CONFIG_X86_64
if (unlikely(this_syscall == __NR_uretprobe) && !in_ia32_syscall())
return 0;
+#endif

Acked-by: Oleg Nesterov oleg@redhat.com

A note for the seccomp maintainers...

I don't know what do you think, but I agree in advance that the very fact this patch adds "#ifdef CONFIG_X86_64" into __secure_computing() doesn't look nice.

The problem is that we need a simple patch for -stable which fixes the real problem. We can cleanup this logic later, I think.

Oleg.

Masami Hiramatsu

8:02 a.m.

On Fri, 17 Jan 2025 02:39:28 +0100 Oleg Nesterov oleg@redhat.com wrote:

...

On 01/16, Eyal Birger wrote:

...
Fixes: ff474a78cef5 ("uprobe: Add uretprobe syscall to speed up return probe") Reported-by: Rafael Buchbinder rafi@rbk.io Link: https://lore.kernel.org/lkml/CAHsH6Gs3Eh8DFU0wq58c_LF8A4_+o6z456J7BidmcVY2Aq... Cc: stable@vger.kernel.org

...

...
@@ -1359,6 +1359,11 @@ int __secure_computing(const struct seccomp_data *sd) this_syscall = sd ? sd->nr : syscall_get_nr(current, current_pt_regs());

+#ifdef CONFIG_X86_64
if (unlikely(this_syscall == __NR_uretprobe) && !in_ia32_syscall())
return 0;
+#endif
Acked-by: Oleg Nesterov oleg@redhat.com

A note for the seccomp maintainers...

I don't know what do you think, but I agree in advance that the very fact this patch adds "#ifdef CONFIG_X86_64" into __secure_computing() doesn't look nice.

Indeed. in_ia32_syscall() depends arch/x86 too. We can add an inline function like;

``` uprobes.h static inline bool is_uprobe_syscall(int syscall) { // arch_is_uprobe_syscall check can be replaced by Kconfig, // something like CONFIG_ARCH_URETPROBE_SYSCALL. #ifdef arch_is_uprobe_syscall return arch_is_uprobe_syscall(syscall) #else return false; #endif } ``` and ``` arch/x86/include/asm/uprobes.h #define arch_is_uprobe_syscall(syscall) \ (IS_ENABLED(CONFIG_X86_64) && syscall == __NR_uretprobe && !in_ia32_syscall()) ```

...

The problem is that we need a simple patch for -stable which fixes the real problem. We can cleanup this logic later, I think.

Hmm, at least we should make it is_uprobe_syscall() in uprobes.h so that do not pollute the seccomp subsystem with #ifdef.

Thank you,

...

Oleg.

-- Masami Hiramatsu (Google) mhiramat@kernel.org

Eyal Birger

1:36 p.m.

On Fri, Jan 17, 2025 at 12:02 AM Masami Hiramatsu mhiramat@kernel.org wrote:

...

On Fri, 17 Jan 2025 02:39:28 +0100 Oleg Nesterov oleg@redhat.com wrote:

...
On 01/16, Eyal Birger wrote:

...
Fixes: ff474a78cef5 ("uprobe: Add uretprobe syscall to speed up return probe") Reported-by: Rafael Buchbinder rafi@rbk.io Link: https://lore.kernel.org/lkml/CAHsH6Gs3Eh8DFU0wq58c_LF8A4_+o6z456J7BidmcVY2Aq... Cc: stable@vger.kernel.org

...

...
@@ -1359,6 +1359,11 @@ int __secure_computing(const struct seccomp_data *sd) this_syscall = sd ? sd->nr : syscall_get_nr(current, current_pt_regs());

+#ifdef CONFIG_X86_64
if (unlikely(this_syscall == __NR_uretprobe) && !in_ia32_syscall())
      return 0;
+#endif
Acked-by: Oleg Nesterov oleg@redhat.com

A note for the seccomp maintainers...

I don't know what do you think, but I agree in advance that the very fact this patch adds "#ifdef CONFIG_X86_64" into __secure_computing() doesn't look nice.
Indeed. in_ia32_syscall() depends arch/x86 too. We can add an inline function like;
static inline bool is_uprobe_syscall(int syscall)
{
        // arch_is_uprobe_syscall check can be replaced by Kconfig,
        // something like CONFIG_ARCH_URETPROBE_SYSCALL.
#ifdef arch_is_uprobe_syscall
        return arch_is_uprobe_syscall(syscall)
#else
        return false;
#endif
}
and
#define arch_is_uprobe_syscall(syscall) \
        (IS_ENABLED(CONFIG_X86_64) && syscall == __NR_uretprobe && !in_ia32_syscall())

Notice it'll need to be enclosed in ifdef CONFIG_X86_64 as __NR_uretprobe isn't defined otherwise so the IS_ENABLED isn't needed.

...

...
The problem is that we need a simple patch for -stable which fixes the real problem. We can cleanup this logic later, I think.

Hmm, at least we should make it is_uprobe_syscall() in uprobes.h so that do not pollute the seccomp subsystem with #ifdef.

I like this approach.

Notice it does add a couple of includes that weren't there before: - arch/x86/include/asm/uprobes.h would include asm/unistd.h - seccomp.c would include linux/uprobes.h

So it's a less "trivial" patch... If that's ok I can repost with these changes.

Eyal.

Oleg Nesterov

2:09 p.m.

On 01/17, Masami Hiramatsu wrote:

...

On Fri, 17 Jan 2025 02:39:28 +0100 Oleg Nesterov oleg@redhat.com wrote:

...
A note for the seccomp maintainers...

I don't know what do you think, but I agree in advance that the very fact this patch adds "#ifdef CONFIG_X86_64" into __secure_computing() doesn't look nice.

Indeed. in_ia32_syscall() depends arch/x86 too. We can add an inline function like;
static inline bool is_uprobe_syscall(int syscall)
{

We can, and this is what I tried to suggest from the very beginning. But I agree with Eyal who decided to send the most trivial fix for -stable, we can add the helper later.

I don't think it should live in uprobes.h and I'd prefer something like arch_seccomp_ignored(int) but I won't insist.

...

// arch_is_uprobe_syscall check can be replaced by Kconfig, // something like CONFIG_ARCH_URETPROBE_SYSCALL.

Or sysctl or both. This is another issue.

...

#define arch_is_uprobe_syscall(syscall) \
   (IS_ENABLED(CONFIG_X86_64) && syscall == __NR_uretprobe && !in_ia32_syscall())

This won't compile if IS_ENABLED(CONFIG_X86_64) == false, __NR_uretprobe will be undefined.

...

...
The problem is that we need a simple patch for -stable which fixes the real problem. We can cleanup this logic later, I think.

Hmm, at least we should make it is_uprobe_syscall() in uprobes.h so that do not pollute the seccomp subsystem with #ifdef.

See above. But I won't insist.

Oleg.

Andrii Nakryiko

5:51 p.m.

On Fri, Jan 17, 2025 at 6:10 AM Oleg Nesterov oleg@redhat.com wrote:

...

On 01/17, Masami Hiramatsu wrote:

...
On Fri, 17 Jan 2025 02:39:28 +0100 Oleg Nesterov oleg@redhat.com wrote:

...
A note for the seccomp maintainers...

I don't know what do you think, but I agree in advance that the very fact this patch adds "#ifdef CONFIG_X86_64" into __secure_computing() doesn't look nice.

Indeed. in_ia32_syscall() depends arch/x86 too. We can add an inline function like;
static inline bool is_uprobe_syscall(int syscall)
{
We can, and this is what I tried to suggest from the very beginning. But I agree with Eyal who decided to send the most trivial fix for -stable, we can add the helper later.

I don't think it should live in uprobes.h and I'd prefer something like arch_seccomp_ignored(int) but I won't insist.

yep, I think this is the way, keeping it as a general category. Should we also put rt_sigreturn there explicitly as well? Also, wouldn't it be better to have it as a non-arch-specific function for something like rt_sigreturn where defining it per each arch is cumbersome, and have the default implementation also call into an arch-specific function?

...

...
  // arch_is_uprobe_syscall check can be replaced by Kconfig,
  // something like CONFIG_ARCH_URETPROBE_SYSCALL.
Or sysctl or both. This is another issue.

...
#define arch_is_uprobe_syscall(syscall) \
      (IS_ENABLED(CONFIG_X86_64) && syscall == __NR_uretprobe && !in_ia32_syscall())
This won't compile if IS_ENABLED(CONFIG_X86_64) == false, __NR_uretprobe will be undefined.

...
...
The problem is that we need a simple patch for -stable which fixes the real problem. We can cleanup this logic later, I think.

Hmm, at least we should make it is_uprobe_syscall() in uprobes.h so that do not pollute the seccomp subsystem with #ifdef.

See above. But I won't insist.

Oleg.

Eyal Birger

7:24 p.m.

On Fri, Jan 17, 2025 at 9:51 AM Andrii Nakryiko andrii.nakryiko@gmail.com wrote:

...

On Fri, Jan 17, 2025 at 6:10 AM Oleg Nesterov oleg@redhat.com wrote:

...
On 01/17, Masami Hiramatsu wrote:

...
On Fri, 17 Jan 2025 02:39:28 +0100 Oleg Nesterov oleg@redhat.com wrote:

...
A note for the seccomp maintainers...

I don't know what do you think, but I agree in advance that the very fact this patch adds "#ifdef CONFIG_X86_64" into __secure_computing() doesn't look nice.

Indeed. in_ia32_syscall() depends arch/x86 too. We can add an inline function like;
static inline bool is_uprobe_syscall(int syscall)
{
We can, and this is what I tried to suggest from the very beginning. But I agree with Eyal who decided to send the most trivial fix for -stable, we can add the helper later.

I don't think it should live in uprobes.h and I'd prefer something like arch_seccomp_ignored(int) but I won't insist.
yep, I think this is the way, keeping it as a general category. Should we also put rt_sigreturn there explicitly as well? Also, wouldn't it be better to have it as a non-arch-specific function for something like rt_sigreturn where defining it per each arch is cumbersome, and have the default implementation also call into an arch-specific function?

I like the more generic approach and keeping CONFIG_X86 out of seccomp, and more generic than uprobes, however, I'm not sure where a common part to place it which includes arch/x86/include/asm/syscall.h would be. And as mentioned before, this would make this bugfix more complex to backport.

For that reason I wouldn't refactor handling rt_sigreturn as part of this fix.

Thanks! Eyal.

Andrii Nakryiko

7:34 p.m.

On Fri, Jan 17, 2025 at 11:24 AM Eyal Birger eyal.birger@gmail.com wrote:

...

On Fri, Jan 17, 2025 at 9:51 AM Andrii Nakryiko andrii.nakryiko@gmail.com wrote:

...
On Fri, Jan 17, 2025 at 6:10 AM Oleg Nesterov oleg@redhat.com wrote:

...
On 01/17, Masami Hiramatsu wrote:

...
On Fri, 17 Jan 2025 02:39:28 +0100 Oleg Nesterov oleg@redhat.com wrote:

...
A note for the seccomp maintainers...

I don't know what do you think, but I agree in advance that the very fact this patch adds "#ifdef CONFIG_X86_64" into __secure_computing() doesn't look nice.

Indeed. in_ia32_syscall() depends arch/x86 too. We can add an inline function like;
static inline bool is_uprobe_syscall(int syscall)
{
We can, and this is what I tried to suggest from the very beginning. But I agree with Eyal who decided to send the most trivial fix for -stable, we can add the helper later.

I don't think it should live in uprobes.h and I'd prefer something like arch_seccomp_ignored(int) but I won't insist.
yep, I think this is the way, keeping it as a general category. Should we also put rt_sigreturn there explicitly as well? Also, wouldn't it be better to have it as a non-arch-specific function for something like rt_sigreturn where defining it per each arch is cumbersome, and have the default implementation also call into an arch-specific function?
I like the more generic approach and keeping CONFIG_X86 out of seccomp, and more generic than uprobes, however, I'm not sure where a common part to place it which includes arch/x86/include/asm/syscall.h would be. And as mentioned before, this would make this bugfix more complex to backport.

For that reason I wouldn't refactor handling rt_sigreturn as part of this fix.

SGTM, it can always be improved later, if necessary

...

Thanks! Eyal.

Oleg Nesterov

18 Jan 18 Jan

3:05 p.m.

On 01/17, Andrii Nakryiko wrote:

...

On Fri, Jan 17, 2025 at 6:10 AM Oleg Nesterov oleg@redhat.com wrote:

...
We can, and this is what I tried to suggest from the very beginning. But I agree with Eyal who decided to send the most trivial fix for -stable, we can add the helper later.

I don't think it should live in uprobes.h and I'd prefer something like arch_seccomp_ignored(int) but I won't insist.

yep, I think this is the way, keeping it as a general category. Should we also put rt_sigreturn there explicitly as well? Also, wouldn't it be better to have it as a non-arch-specific function for something like rt_sigreturn where defining it per each arch is cumbersome, and have the default implementation also call into an arch-specific function?

I personally don't think we should exclude rt_sigreturn. and I guess we can't do it in a arch-agnostic way, think of __NR_ia32_sigreturn.

However. These are all good questions that need a separate discussion. Plus the SECCOMP_RET_TRACE/strace issue raised by Dmitry. And probably even more.

But IMO it would be better to push the trivial (and urgent) fix to -stable first, then discuss the possible cleanups/improvements.

What do you think?

Oleg.

Dmitry V. Levin

17 Jan 17 Jan

6:34 p.m.

On Thu, Jan 16, 2025 at 04:55:39PM -0800, Eyal Birger wrote:

...

When attaching uretprobes to processes running inside docker, the attached process is segfaulted when encountering the retprobe.

The reason is that now that uretprobe is a system call the default seccomp filters in docker block it as they only allow a specific set of known syscalls. This is true for other userspace applications which use seccomp to control their syscall surface.

Since uretprobe is a "kernel implementation detail" system call which is not used by userspace application code directly, it is impractical and there's very little point in forcing all userspace applications to explicitly allow it in order to avoid crashing tracked processes.

Pass this systemcall through seccomp without depending on configuration.

Fixes: ff474a78cef5 ("uprobe: Add uretprobe syscall to speed up return probe") Reported-by: Rafael Buchbinder rafi@rbk.io Link: https://lore.kernel.org/lkml/CAHsH6Gs3Eh8DFU0wq58c_LF8A4_+o6z456J7BidmcVY2Aq... Cc: stable@vger.kernel.org Signed-off-by: Eyal Birger eyal.birger@gmail.com

The following reproduction script synthetically demonstrates the problem:

cat > /tmp/x.c << EOF

char *syscalls[] = { "write", "exit_group", "fstat", };

__attribute__((noinline)) int probed(void) { printf("Probed\n"); return 1; }

void apply_seccomp_filter(char **syscalls, int num_syscalls) { scmp_filter_ctx ctx;

ctx = seccomp_init(SCMP_ACT_KILL); for (int i = 0; i < num_syscalls; i++) { seccomp_rule_add(ctx, SCMP_ACT_ALLOW, seccomp_syscall_resolve_name(syscalls[i]), 0); } seccomp_load(ctx); seccomp_release(ctx); }

int main(int argc, char *argv[]) { int num_syscalls = sizeof(syscalls) / sizeof(syscalls[0]);

apply_seccomp_filter(syscalls, num_syscalls);

probed();

return 0; } EOF

cat > /tmp/trace.bt << EOF uretprobe:/tmp/x:probed { printf("ret=%d\n", retval); } EOF

gcc -o /tmp/x /tmp/x.c -lseccomp

/usr/bin/bpftrace /tmp/trace.bt &

sleep 5 # wait for uretprobe attach /tmp/x

pkill bpftrace

rm /tmp/x /tmp/x.c /tmp/trace.bt

kernel/seccomp.c | 5 +++++ 1 file changed, 5 insertions(+)

diff --git a/kernel/seccomp.c b/kernel/seccomp.c index 385d48293a5f..10a55c9b5c18 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -1359,6 +1359,11 @@ int __secure_computing(const struct seccomp_data *sd) this_syscall = sd ? sd->nr : syscall_get_nr(current, current_pt_regs()); +#ifdef CONFIG_X86_64
if (unlikely(this_syscall == __NR_uretprobe) && !in_ia32_syscall())
return 0;
+#endif

switch (mode) { case SECCOMP_MODE_STRICT: __secure_computing_strict(this_syscall); /* may call do_exit */

This seems to be a hot fix to bypass some SECCOMP_RET_ERRNO filters. However, this way it bypasses seccomp completely, including SECCOMP_RET_TRACE, making it invisible to strace --seccomp, and I wonder why do you want that.

-- ldv

Eyal Birger

6:52 p.m.

On Fri, Jan 17, 2025 at 10:34 AM Dmitry V. Levin ldv@strace.io wrote:

...

On Thu, Jan 16, 2025 at 04:55:39PM -0800, Eyal Birger wrote:

...
When attaching uretprobes to processes running inside docker, the attached process is segfaulted when encountering the retprobe.

The reason is that now that uretprobe is a system call the default seccomp filters in docker block it as they only allow a specific set of known syscalls. This is true for other userspace applications which use seccomp to control their syscall surface.

Since uretprobe is a "kernel implementation detail" system call which is not used by userspace application code directly, it is impractical and there's very little point in forcing all userspace applications to explicitly allow it in order to avoid crashing tracked processes.

Pass this systemcall through seccomp without depending on configuration.

Fixes: ff474a78cef5 ("uprobe: Add uretprobe syscall to speed up return probe") Reported-by: Rafael Buchbinder rafi@rbk.io Link: https://lore.kernel.org/lkml/CAHsH6Gs3Eh8DFU0wq58c_LF8A4_+o6z456J7BidmcVY2Aq... Cc: stable@vger.kernel.org Signed-off-by: Eyal Birger eyal.birger@gmail.com

The following reproduction script synthetically demonstrates the problem:

cat > /tmp/x.c << EOF

char *syscalls[] = { "write", "exit_group", "fstat", };

__attribute__((noinline)) int probed(void) { printf("Probed\n"); return 1; }

void apply_seccomp_filter(char **syscalls, int num_syscalls) { scmp_filter_ctx ctx;
  ctx = seccomp_init(SCMP_ACT_KILL);
  for (int i = 0; i < num_syscalls; i++) {
          seccomp_rule_add(ctx, SCMP_ACT_ALLOW,
                           seccomp_syscall_resolve_name(syscalls[i]), 0);
  }
  seccomp_load(ctx);
  seccomp_release(ctx);
}

int main(int argc, char *argv[]) { int num_syscalls = sizeof(syscalls) / sizeof(syscalls[0]);
  apply_seccomp_filter(syscalls, num_syscalls);

  probed();

  return 0;
} EOF

cat > /tmp/trace.bt << EOF uretprobe:/tmp/x:probed { printf("ret=%d\n", retval); } EOF

gcc -o /tmp/x /tmp/x.c -lseccomp

/usr/bin/bpftrace /tmp/trace.bt &

sleep 5 # wait for uretprobe attach /tmp/x

pkill bpftrace

rm /tmp/x /tmp/x.c /tmp/trace.bt

kernel/seccomp.c | 5 +++++ 1 file changed, 5 insertions(+)

diff --git a/kernel/seccomp.c b/kernel/seccomp.c index 385d48293a5f..10a55c9b5c18 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -1359,6 +1359,11 @@ int __secure_computing(const struct seccomp_data *sd) this_syscall = sd ? sd->nr : syscall_get_nr(current, current_pt_regs());

+#ifdef CONFIG_X86_64
if (unlikely(this_syscall == __NR_uretprobe) && !in_ia32_syscall())
        return 0;
+#endif
switch (mode) {
case SECCOMP_MODE_STRICT:
        __secure_computing_strict(this_syscall);  /* may call do_exit */
This seems to be a hot fix to bypass some SECCOMP_RET_ERRNO filters.

It's a little broader than just SECCOMP_RET_ERRNO, but yes, this is a hotfix to avoid filtering this system call in seccomp.

The rationale is that this is not a userspace created system call - the kernel uses it to instrument the function - and the fact that it's a system call is just an implementation detail. Ideally, userspace wouldn't need to know or care about it.

...

However, this way it bypasses seccomp completely, including SECCOMP_RET_TRACE, making it invisible to strace --seccomp, and I wonder why do you want that.

It's a good question. I could move this check to both "strict" seccomp and after the BPF verdict is received, but before it's applied, but I fear this would make the fix more error prone, and way harder to backmerge. So I'm wondering whether supporting strace --seccomp-bpf for this particular syscall is a priority.

Eyal.

Kees Cook

18 Jan 18 Jan

8:21 p.m.

On Thu, Jan 16, 2025 at 04:55:39PM -0800, Eyal Birger wrote:

...

Since uretprobe is a "kernel implementation detail" system call which is not used by userspace application code directly, it is impractical and there's very little point in forcing all userspace applications to explicitly allow it in order to avoid crashing tracked processes.

How is this any different from sigreturn, rt_sigreturn, or restart_syscall? These are all handled explicitly by userspace filters already, and I don't see why uretprobe should be any different. Docker has had plenty of experience with fixing their seccomp filters for new syscalls. For example, many times already a given libc will suddenly start using a new syscall when it sees its available, etc.

Basically, this is a Docker issue, not a kernel issue. Seccomp is behaving correctly. I don't want to start making syscalls invisible without an extremely good reason. If _anything_ should be invisible, it is restart_syscall (which actually IS invisible under certain architectures).

-Kees

-- Kees Cook

Darrick J. Wong

8:31 p.m.

On Sat, Jan 18, 2025 at 12:21:51PM -0800, Kees Cook wrote:

...

On Thu, Jan 16, 2025 at 04:55:39PM -0800, Eyal Birger wrote:

...
Since uretprobe is a "kernel implementation detail" system call which is not used by userspace application code directly, it is impractical and there's very little point in forcing all userspace applications to explicitly allow it in order to avoid crashing tracked processes.

How is this any different from sigreturn, rt_sigreturn, or restart_syscall? These are all handled explicitly by userspace filters already, and I don't see why uretprobe should be any different. Docker has had plenty of experience with fixing their seccomp filters for new syscalls. For example, many times already a given libc will suddenly start using a new syscall when it sees its available, etc.

Basically, this is a Docker issue, not a kernel issue. Seccomp is behaving correctly. I don't want to start making syscalls invisible without an extremely good reason. If _anything_ should be invisible, it is restart_syscall (which actually IS invisible under certain architectures).

I was wondering that too -- if ______'s security policy is to disallow by default, then fix the security policy. Don't blow a hole in seccomp for all users. Maybe someone *wants* to block uretprobe. Maybe doing so will be needed some day as a crude mitigation for a zeroday.

--D

...

-Kees

-- Kees Cook

Eyal Birger

8:45 p.m.

Hןת

On Sat, Jan 18, 2025 at 12:21 PM Kees Cook kees@kernel.org wrote:

...

On Thu, Jan 16, 2025 at 04:55:39PM -0800, Eyal Birger wrote:

...
Since uretprobe is a "kernel implementation detail" system call which is not used by userspace application code directly, it is impractical and there's very little point in forcing all userspace applications to explicitly allow it in order to avoid crashing tracked processes.

How is this any different from sigreturn, rt_sigreturn, or restart_syscall? These are all handled explicitly by userspace filters already, and I don't see why uretprobe should be any different. Docker has had plenty of experience with fixing their seccomp filters for new syscalls. For example, many times already a given libc will suddenly start using a new syscall when it sees its available, etc.

I think the difference is that this syscall is not part of the process's code - it is inserted there by another process tracing it. So this is different than desiring to deploy a new version of a binary that uses a new libc or a new syscall. Here the case is that there are three players - the tracer running out of docker, the tracee running in docker, and docker itself. All three were running fine in a specific kernel version, but upgrading the kernel now crashes the traced process.

...

Basically, this is a Docker issue, not a kernel issue.

As mentione above, for all three given binaries, nothing changed - only the kernel version.

...

Seccomp is behaving correctly. I don't want to start making syscalls invisible without an extremely good reason. If _anything_ should be invisible, it is restart_syscall (which actually IS invisible under certain architectures).

I think this syscall is different in that respect for the reasons described. I don't know if seccomp is behaving correctly when it blocks a kernel implementation detail that isn't user created. IMHO the fact that this implementation detail is implemented as a syscall is unfortunate, and I'm trying to mitigate the result.

Eyal.

...

-Kees

-- Kees Cook

Kees Cook

19 Jan 19 Jan

2:24 a.m.

On January 18, 2025 12:45:47 PM PST, Eyal Birger eyal.birger@gmail.com wrote:

...

I think the difference is that this syscall is not part of the process's code - it is inserted there by another process tracing it.

Well that's nothing like syscall_restart, and now I'm convinced seccomp must never ignore uretprobe -- a process might want to block uretprobe!

So, no, sorry, this needs to be handled by the seccomp policy that is applied to the process.

...

So this is different than desiring to deploy a new version of a binary that uses a new libc or a new syscall.

Uh, no, the case I used as an example was no changes to anything except the kernel. Libc noticed the available syscall, uses it, and is instantly killed by the Docker seccomp policy which didn't know about that syscall.

...

Here the case is that there are three players - the tracer running out of docker, the tracee running in docker, and docker itself. All three were running fine in a specific kernel version, but upgrading the kernel now crashes the traced process.

If uretprobe used to work without a syscall, then that seems to be the problem. But I think easiest is just fixing the Docker policy. (Which is a text file configuration change; no new binaries, no rebuilds!).

...

I think this syscall is different in that respect for the reasons described.

I don't agree, sorry. Seccomp has a really singular and specific purpose, which is explicitly *externalizing* policy. I do not want to have policy within seccomp itself.

...

I don't know if seccomp is behaving correctly when it blocks a kernel implementation detail that isn't user created.

But it is user created? Something added a uretprobe to a process who's seccomp policy is not expecting it. This seems precisely by design.

-Kees

-- Kees Cook

Eyal Birger

3:39 a.m.

Hi,

Thank you for the detailed response.

On Sat, Jan 18, 2025 at 6:25 PM Kees Cook kees@kernel.org wrote:

...

On January 18, 2025 12:45:47 PM PST, Eyal Birger eyal.birger@gmail.com wrote:

...
I think the difference is that this syscall is not part of the process's code - it is inserted there by another process tracing it.

Well that's nothing like syscall_restart, and now I'm convinced seccomp must never ignore uretprobe -- a process might want to block uretprobe!

I think I understand your point. But do you think this is intentional? i.e. seccomp couldn't have been used to block uretprobes before this syscall implementation afaict.

...

So, no, sorry, this needs to be handled by the seccomp policy that is applied to the process.

The problem we're facing is that existing workloads are breaking, and as mentioned I'm not sure how practical it is to demand replacing a working docker environment because of a new syscall that was added for performance reasons.

...

...
So this is different than desiring to deploy a new version of a binary that uses a new libc or a new syscall.

Uh, no, the case I used as an example was no changes to anything except the kernel. Libc noticed the available syscall, uses it, and is instantly killed by the Docker seccomp policy which didn't know about that syscall.

That's an interesting situation and quite unexpected :) I'm glad I didn't have to face that one in production.

...

...
Here the case is that there are three players - the tracer running out of docker, the tracee running in docker, and docker itself. All three were running fine in a specific kernel version, but upgrading the kernel now crashes the traced process.

If uretprobe used to work without a syscall, then that seems to be the problem.

I agree.

...

But I think easiest is just fixing the Docker policy. (Which is a text file configuration change; no new binaries, no rebuilds!).

As far as I can tell libseccomp needs to provide support for this new syscall and a new docker version would need to be deployed, so It's not just a configuration change. Also the default policy which comes packed in docker would probably need to be changed to avoid having to explicitly provide a seccomp configuration for each deployment.

...

...
I think this syscall is different in that respect for the reasons described.

I don't agree, sorry. Seccomp has a really singular and specific purpose, which is explicitly *externalizing* policy. I do not want to have policy within seccomp itself.

Understood.

...

...
I don't know if seccomp is behaving correctly when it blocks a kernel implementation detail that isn't user created.

But it is user created? Something added a uretprobe to a process who's seccomp policy is not expecting it. This seems precisely by design.

I think I wasn't accurate in my wording. The uretprobe syscall is added to the tracee by the kernel. The tracer itself is merely requesting to attach a uretprobe bpf function. In previous versions, this was implemented by the kernel installing an int3 instruction, and in the new implementation the kernel is installing a uretprobe syscall. The "user" in this case - the tracer program - didn't deliberately install the syscall, but anyway this is semantics.

I think I understand your point that it is regarded as "policy", only that it creates a problem in actual deployments, where in order to be able to run the tracer software which has been working on newer kernels a new docker has to be deployed.

I'm trying to find a pragmatic solution to this problem, and I understand the motivation to avoid policy in seccomp.

Alternatively, maybe this syscall implementation should be reverted?

Thanks again, Eyal.

Jiri Olsa

10:44 a.m.

On Sat, Jan 18, 2025 at 07:39:25PM -0800, Eyal Birger wrote:

...

Hi,

Thank you for the detailed response.

On Sat, Jan 18, 2025 at 6:25 PM Kees Cook kees@kernel.org wrote:

...
On January 18, 2025 12:45:47 PM PST, Eyal Birger eyal.birger@gmail.com wrote:

...
I think the difference is that this syscall is not part of the process's code - it is inserted there by another process tracing it.

Well that's nothing like syscall_restart, and now I'm convinced seccomp must never ignore uretprobe -- a process might want to block uretprobe!

I think I understand your point. But do you think this is intentional? i.e. seccomp couldn't have been used to block uretprobes before this syscall implementation afaict.

...
So, no, sorry, this needs to be handled by the seccomp policy that is applied to the process.

The problem we're facing is that existing workloads are breaking, and as mentioned I'm not sure how practical it is to demand replacing a working docker environment because of a new syscall that was added for performance reasons.

...
...
So this is different than desiring to deploy a new version of a binary that uses a new libc or a new syscall.

Uh, no, the case I used as an example was no changes to anything except the kernel. Libc noticed the available syscall, uses it, and is instantly killed by the Docker seccomp policy which didn't know about that syscall.

That's an interesting situation and quite unexpected :) I'm glad I didn't have to face that one in production.

...
...
Here the case is that there are three players - the tracer running out of docker, the tracee running in docker, and docker itself. All three were running fine in a specific kernel version, but upgrading the kernel now crashes the traced process.

If uretprobe used to work without a syscall, then that seems to be the problem.

I agree.

...
But I think easiest is just fixing the Docker policy. (Which is a text file configuration change; no new binaries, no rebuilds!).

As far as I can tell libseccomp needs to provide support for this new syscall and a new docker version would need to be deployed, so It's not just a configuration change. Also the default policy which comes packed in docker would probably need to be changed to avoid having to explicitly provide a seccomp configuration for each deployment.

...
...
I think this syscall is different in that respect for the reasons described.

I don't agree, sorry. Seccomp has a really singular and specific purpose, which is explicitly *externalizing* policy. I do not want to have policy within seccomp itself.

Understood.

...
...
I don't know if seccomp is behaving correctly when it blocks a kernel implementation detail that isn't user created.

But it is user created? Something added a uretprobe to a process who's seccomp policy is not expecting it. This seems precisely by design.

I think I wasn't accurate in my wording. The uretprobe syscall is added to the tracee by the kernel. The tracer itself is merely requesting to attach a uretprobe bpf function. In previous versions, this was implemented by the kernel installing an int3 instruction, and in the new implementation the kernel is installing a uretprobe syscall. The "user" in this case - the tracer program - didn't deliberately install the syscall, but anyway this is semantics.

that's correct, uretprobe syscall is installed by kernel to special user memory map and it can be executed only from there and if process calls it from another place it receives sigill

so at the end the process executes the uretprobe syscall, but it's up to kernel to decide that and set it up.. but I don't know if that's strong enough reason for seccomp to ignore the syscall

...

I think I understand your point that it is regarded as "policy", only that it creates a problem in actual deployments, where in order to be able to run the tracer software which has been working on newer kernels a new docker has to be deployed.

I'm trying to find a pragmatic solution to this problem, and I understand the motivation to avoid policy in seccomp.

I could think of sysctl for that.. you complained earlier about weird semantics for that [1], but I think it's better than to remove it

jirka

...

Alternatively, maybe this syscall implementation should be reverted?

Thanks again, Eyal.

[1] https://lore.kernel.org/bpf/CAHsH6Gs03iJt-ziWt5Bye_DuqCbk3TpMmgPbkYh64XBvpGa...

Kees Cook

20 Jan 20 Jan

9:34 p.m.

On Sat, Jan 18, 2025 at 07:39:25PM -0800, Eyal Birger wrote:

...

Alternatively, maybe this syscall implementation should be reverted?

Honestly, that seems the best choice. I don't think any thought was given to how it would interact with syscall interposers (including ptrace, strict mode seccomp, etc).

-- Kees Cook

Eyal Birger

27 Jan 27 Jan

7:24 p.m.

Hi Kees,

On Mon, Jan 20, 2025 at 1:34 PM Kees Cook kees@kernel.org wrote:

...

On Sat, Jan 18, 2025 at 07:39:25PM -0800, Eyal Birger wrote:

...
Alternatively, maybe this syscall implementation should be reverted?

Honestly, that seems the best choice. I don't think any thought was given to how it would interact with syscall interposers (including ptrace, strict mode seccomp, etc).

I don't know if you noticed Andrii's and others' comments on this [1].

Given that: - this issue requires immediate remediation - there seems to be pushback for reverting the syscall implementation - filtering uretprobe is not within the capabilities of seccomp without this syscall (so reverting the syscall is equivalent to just passing it through seccomp)

is it possible to consider applying this current fix, with the possibility of extending seccomp in the future to support filtering uretprobe if deemed necessary (for example by allowing userspace to define a stricter policy)?

Thanks, Eyal.

[1] https://lore.kernel.org/lkml/20250121182939.33d05470@gandalf.local.home/T/#m...

Kees Cook

7:33 p.m.

On Mon, Jan 27, 2025 at 11:24:02AM -0800, Eyal Birger wrote:

...

Hi Kees,

On Mon, Jan 20, 2025 at 1:34 PM Kees Cook kees@kernel.org wrote:

...
On Sat, Jan 18, 2025 at 07:39:25PM -0800, Eyal Birger wrote:

...
Alternatively, maybe this syscall implementation should be reverted?

Honestly, that seems the best choice. I don't think any thought was given to how it would interact with syscall interposers (including ptrace, strict mode seccomp, etc).

I don't know if you noticed Andrii's and others' comments on this [1].

Given that:

this issue requires immediate remediation

there seems to be pushback for reverting the syscall implementation

filtering uretprobe is not within the capabilities of seccomp without this syscall (so reverting the syscall is equivalent to just passing it through seccomp)

is it possible to consider applying this current fix, with the possibility of extending seccomp in the future to support filtering uretprobe if deemed necessary (for example by allowing userspace to define a stricter policy)?

I still think this is a Docker problem, but I agree that uretprobe without syscall is just as unfilterable as seccomp ignoring the syscall.

Can you please update the patch to use the existing action_cache bitmaps instead of adding an open-coded check? We can consider adding syscall_restart to this as well in the future...

-- Kees Cook

Eyal Birger

7:39 p.m.

On Mon, Jan 27, 2025 at 11:33 AM Kees Cook kees@kernel.org wrote:

...

On Mon, Jan 27, 2025 at 11:24:02AM -0800, Eyal Birger wrote:

...
Hi Kees,

On Mon, Jan 20, 2025 at 1:34 PM Kees Cook kees@kernel.org wrote:

...
On Sat, Jan 18, 2025 at 07:39:25PM -0800, Eyal Birger wrote:

...
Alternatively, maybe this syscall implementation should be reverted?

Honestly, that seems the best choice. I don't think any thought was given to how it would interact with syscall interposers (including ptrace, strict mode seccomp, etc).

I don't know if you noticed Andrii's and others' comments on this [1].

Given that:

this issue requires immediate remediation

there seems to be pushback for reverting the syscall implementation

filtering uretprobe is not within the capabilities of seccomp without this syscall (so reverting the syscall is equivalent to just passing it through seccomp)

is it possible to consider applying this current fix, with the possibility of extending seccomp in the future to support filtering uretprobe if deemed necessary (for example by allowing userspace to define a stricter policy)?

I still think this is a Docker problem, but I agree that uretprobe without syscall is just as unfilterable as seccomp ignoring the syscall.

Can you please update the patch to use the existing action_cache bitmaps instead of adding an open-coded check? We can consider adding syscall_restart to this as well in the future...

I can. The main difference as far as I can tell is that it would not apply to strict mode. Is that OK? it means that existing binaries using strict mode would still crash if uretprobe is attached to them.

Eyal.

Kees Cook

7:43 p.m.

On Mon, Jan 27, 2025 at 11:39:44AM -0800, Eyal Birger wrote:

...

On Mon, Jan 27, 2025 at 11:33 AM Kees Cook kees@kernel.org wrote:

...
On Mon, Jan 27, 2025 at 11:24:02AM -0800, Eyal Birger wrote:

...
Hi Kees,

On Mon, Jan 20, 2025 at 1:34 PM Kees Cook kees@kernel.org wrote:

...
On Sat, Jan 18, 2025 at 07:39:25PM -0800, Eyal Birger wrote:

...
Alternatively, maybe this syscall implementation should be reverted?

Honestly, that seems the best choice. I don't think any thought was given to how it would interact with syscall interposers (including ptrace, strict mode seccomp, etc).

I don't know if you noticed Andrii's and others' comments on this [1].

Given that:

this issue requires immediate remediation

there seems to be pushback for reverting the syscall implementation

filtering uretprobe is not within the capabilities of seccomp without this syscall (so reverting the syscall is equivalent to just passing it through seccomp)

is it possible to consider applying this current fix, with the possibility of extending seccomp in the future to support filtering uretprobe if deemed necessary (for example by allowing userspace to define a stricter policy)?

I still think this is a Docker problem, but I agree that uretprobe without syscall is just as unfilterable as seccomp ignoring the syscall.

Can you please update the patch to use the existing action_cache bitmaps instead of adding an open-coded check? We can consider adding syscall_restart to this as well in the future...

I can. The main difference as far as I can tell is that it would not apply to strict mode. Is that OK? it means that existing binaries using strict mode would still crash if uretprobe is attached to them.

Ah, good point. Please also add it to mode1_syscalls for strict. :)

-- Kees Cook

Jiri Olsa

21 Jan 21 Jan

2:38 p.m.

On Sat, Jan 18, 2025 at 07:39:25PM -0800, Eyal Birger wrote:

SNIP

...

I think I wasn't accurate in my wording. The uretprobe syscall is added to the tracee by the kernel. The tracer itself is merely requesting to attach a uretprobe bpf function. In previous versions, this was implemented by the kernel installing an int3 instruction, and in the new implementation the kernel is installing a uretprobe syscall. The "user" in this case - the tracer program - didn't deliberately install the syscall, but anyway this is semantics.

I think I understand your point that it is regarded as "policy", only that it creates a problem in actual deployments, where in order to be able to run the tracer software which has been working on newer kernels a new docker has to be deployed.

I'm trying to find a pragmatic solution to this problem, and I understand the motivation to avoid policy in seccomp.

Alternatively, maybe this syscall implementation should be reverted?

you mentioned in the previous reply:

...

...
As far as I can tell libseccomp needs to provide support for this new syscall and a new docker version would need to be deployed, so It's not just a configuration change. Also the default policy which comes packed in docker would probably need to be changed to avoid having to explicitly provide a seccomp configuration for each deployment.

please disregard if this is too stupid.. but could another way out be just to disable it (easy to do) and meanwhile teach libseccomp to allow uretprobe (or whatever mechanism needs to be added to libseccomp) plus the needed docker change ... to minimize the impact ?

or there's just too many other seccomp user space libraries

I'm still trying to come up with some other solution but wanted to exhaust all the options I could think of

thanks, jirka

Eyal Birger

2:47 p.m.

Hi,

On Tue, Jan 21, 2025 at 6:38 AM Jiri Olsa olsajiri@gmail.com wrote:

...

On Sat, Jan 18, 2025 at 07:39:25PM -0800, Eyal Birger wrote:

SNIP

...
I think I wasn't accurate in my wording. The uretprobe syscall is added to the tracee by the kernel. The tracer itself is merely requesting to attach a uretprobe bpf function. In previous versions, this was implemented by the kernel installing an int3 instruction, and in the new implementation the kernel is installing a uretprobe syscall. The "user" in this case - the tracer program - didn't deliberately install the syscall, but anyway this is semantics.

I think I understand your point that it is regarded as "policy", only that it creates a problem in actual deployments, where in order to be able to run the tracer software which has been working on newer kernels a new docker has to be deployed.

I'm trying to find a pragmatic solution to this problem, and I understand the motivation to avoid policy in seccomp.

Alternatively, maybe this syscall implementation should be reverted?

you mentioned in the previous reply:

...
...
As far as I can tell libseccomp needs to provide support for this new syscall and a new docker version would need to be deployed, so It's not just a configuration change. Also the default policy which comes packed in docker would probably need to be changed to avoid having to explicitly provide a seccomp configuration for each deployment.

please disregard if this is too stupid.. but could another way out be just to disable it (easy to do) and meanwhile teach libseccomp to allow uretprobe (or whatever mechanism needs to be added to libseccomp) plus the needed docker change ... to minimize the impact ?

Right. the patch I was thinking to suggest wouldn't revert the entire thing, but instead disable its use for now and allow a careful reconsideration of the available options.

If that makes sense, I'll post it.

...

or there's just too many other seccomp user space libraries

I think in theory, the example of a simple binary using "restrict" mode makes it problematic to assume that this can be fixed solely from userspace i.e. for such binary, uretprobes would still work in one kernel version and break on another. It's hard to tell how common this is.

...

I'm still trying to come up with some other solution but wanted to exhaust all the options I could think of

thanks, jirka

Steven Rostedt

4:16 p.m.

[ Watching this with popcorn from the sidelines, but I'll chime in anyway ]

On Tue, 21 Jan 2025 15:38:48 +0100 Jiri Olsa olsajiri@gmail.com wrote:

...

I'm still trying to come up with some other solution but wanted to exhaust all the options I could think of

I think this may have been mentioned, but is there a way that the kernel could know that this system call is being monitored by seccomp, and if so, just stick with the interrupt version? If not, enable the system call?

-- Steve

Oleg Nesterov

4:44 p.m.

On 01/21, Steven Rostedt wrote:

...

I think this may have been mentioned, but is there a way that the kernel could know that this system call is being monitored by seccomp, and if so, just stick with the interrupt version? If not, enable the system call?

Consider

int func_to_uretprobe() { seccomp(SECCOMP_SET_MODE_STRICT/whatever); return 123; }

by the time it is called, the kernel can't know that this function will call seccomp/install-the-filters/etc, so prepare_uretprobe() can't know if it is safe to use uretprobe or not.

Oleg.

Jiri Olsa

4:55 p.m.

On Tue, Jan 21, 2025 at 11:16:31AM -0500, Steven Rostedt wrote:

...

[ Watching this with popcorn from the sidelines, but I'll chime in anyway ]

On Tue, 21 Jan 2025 15:38:48 +0100 Jiri Olsa olsajiri@gmail.com wrote:

...
I'm still trying to come up with some other solution but wanted to exhaust all the options I could think of

I think this may have been mentioned, but is there a way that the kernel could know that this system call is being monitored by seccomp, and if so, just stick with the interrupt version? If not, enable the system call?

yes [1], the problem with that solution is that we install uretprobe trampoline at function's uprobe entry probe, so we won't catch case where seccomp is enabled in this probed function, like:

foo uprobe -> install uretprobe trampoline ... seccomp(SECCOMP_MODE_STRICT.. ... ret -> execute uretprobe trampoline with sys_uretprobe

I thought we could perhaps switch existing uretprobe trampoline to int3 when we are in sys_seccomp, but another user thread might be already executing the existing uretprobe trampoline, so I don't think we can do that

jirka

[1] https://lore.kernel.org/bpf/20250114123257.GD19816@redhat.com/

Andrii Nakryiko

10:38 p.m.

On Tue, Jan 21, 2025 at 8:55 AM Jiri Olsa olsajiri@gmail.com wrote:

...

On Tue, Jan 21, 2025 at 11:16:31AM -0500, Steven Rostedt wrote:

...
[ Watching this with popcorn from the sidelines, but I'll chime in anyway ]

On Tue, 21 Jan 2025 15:38:48 +0100 Jiri Olsa olsajiri@gmail.com wrote:

...
I'm still trying to come up with some other solution but wanted to exhaust all the options I could think of

I think this may have been mentioned, but is there a way that the kernel could know that this system call is being monitored by seccomp, and if so, just stick with the interrupt version? If not, enable the system call?

yes [1], the problem with that solution is that we install uretprobe trampoline at function's uprobe entry probe, so we won't catch case where seccomp is enabled in this probed function, like:

foo uprobe -> install uretprobe trampoline ... seccomp(SECCOMP_MODE_STRICT.. ... ret -> execute uretprobe trampoline with sys_uretprobe

I thought we could perhaps switch existing uretprobe trampoline to int3 when we are in sys_seccomp, but another user thread might be already executing the existing uretprobe trampoline, so I don't think we can do that

Jiri,

We should abandon the vector of "let's try to detect whether someone is blocking sys_uretprobe" as a solution, I don't believe it's possible. Blocking sys_uretprobe is too dynamic of a thing. There is an arbitrary periods of time between adding uretprobe trampoline (i.e., sys_uretprobe) and actually disabling sys_uretprobe through seccomp (or even BPF: LSM or even kprobes can do that, why not?), and userspace can flip this decision many times over.

And as Oleg said, sysctl "please-make-my-uretprobe-2x-faster-assuming-i-know-about-this-option" makes no sense either, this will basically almost never get enabled.

Kees,

You said yourself that sys_uretprobe is no different from rt_sigreturn and restart_syscall, so why would we rollback sys_uretprobe if we wouldn't rollback rt_sigreturn/restart_syscall? Given it's impossible, generally speaking, to know if userspace is blocking the syscall (and that can change dynamically and very frequently), any improvement or optimization that kernel would do with the help of special syscall is now prohibited, effectively. That doesn't seem wise to restrict the kernel development so much just because libseccomp blocks any unknown syscall by default.

I'm OK either asking libseccomp to learn about sys_uretprobe and not block it (like systemd is doing), or if we want to bend over backwards, prevent user policy from filtering theses special syscalls which are meant to be used by kernel only. We can't single out sys_uretprobe just because it's the newest of this special cohort.

You also asked "what if userspace wants to block uprobes"? If that's really the goal, that would be done at uprobe attachment time, not when uprobe is (conceptually) attached, new process is forked, and kernel installs uretprobe trampoline with uretprobe syscall. Or just control that through (lack of) capabilities. Using seccomp to block *second part of uretprobe handling* doesn't make much sense. It's just the wrong place for that.

P.S. Also using FRED as an excuse for not doing sys_uretprobe is manipulative. When we get FRED-enabled CPUs widely available and deployed *and* all (or at least majority of) the currently used CPUs are decommissioned, only then we can realistically talk about sys_uretprobe being unnecessary. That's years and years. sys_uretprobe is necessary and important *right now* and will be for the foreseeable future.

...

jirka

[1] https://lore.kernel.org/bpf/20250114123257.GD19816@redhat.com/

Steven Rostedt

10:46 p.m.

On Tue, 21 Jan 2025 14:38:09 -0800 Andrii Nakryiko andrii.nakryiko@gmail.com wrote:

...

You said yourself that sys_uretprobe is no different from rt_sigreturn and restart_syscall, so why would we rollback sys_uretprobe if we wouldn't rollback rt_sigreturn/restart_syscall? Given it's impossible, generally speaking, to know if userspace is blocking the syscall (and that can change dynamically and very frequently), any improvement or optimization that kernel would do with the help of special syscall is now prohibited, effectively. That doesn't seem wise to restrict the kernel development so much just because libseccomp blocks any unknown syscall by default.

What happens if the system call is hit when there was no uprobe attached to it? Perhaps it should segfault? That is, this system call is only used when the kernel attaches it, if the kernel did not attach it, perhaps there's some malicious code that is trying to use it for some CVE corner case. But if it always crashes when added, the only thing the malicious code can do by adding this system call is to crash the application. That shouldn't be a problem, as if malicious code can add a system call, it can also change the code to crash as well.

Perhaps the security folks would feel better if there were other protections against this system call when not used as expected?

-- Steve

Eyal Birger

11:13 p.m.

On Tue, Jan 21, 2025 at 2:46 PM Steven Rostedt rostedt@goodmis.org wrote:

...

On Tue, 21 Jan 2025 14:38:09 -0800 Andrii Nakryiko andrii.nakryiko@gmail.com wrote:

...
You said yourself that sys_uretprobe is no different from rt_sigreturn and restart_syscall, so why would we rollback sys_uretprobe if we wouldn't rollback rt_sigreturn/restart_syscall? Given it's impossible, generally speaking, to know if userspace is blocking the syscall (and that can change dynamically and very frequently), any improvement or optimization that kernel would do with the help of special syscall is now prohibited, effectively. That doesn't seem wise to restrict the kernel development so much just because libseccomp blocks any unknown syscall by default.

What happens if the system call is hit when there was no uprobe attached to it? Perhaps it should segfault? That is, this system call is only used when the kernel attaches it, if the kernel did not attach it, perhaps there's some malicious code that is trying to use it for some CVE corner case. But if it always crashes when added, the only thing the malicious code can do by adding this system call is to crash the application. That shouldn't be a problem, as if malicious code can add a system call, it can also change the code to crash as well.

Perhaps the security folks would feel better if there were other protections against this system call when not used as expected?

Isn't that the case already, or maybe I misunderstood what Jiri wrote [1]:

...

On Sun, Jan 19, 2025 at 2:44 AM Jiri Olsa olsajiri@gmail.com wrote: that's correct, uretprobe syscall is installed by kernel to special user memory map and it can be executed only from there and if process calls it from another place it receives sigill

Eyal.

[1] https://lore.kernel.org/lkml/Z4zXlaEMPbiYYlQ8@krava/

Steven Rostedt

11:29 p.m.

On Tue, 21 Jan 2025 15:13:52 -0800 Eyal Birger eyal.birger@gmail.com wrote:

...

Isn't that the case already, or maybe I misunderstood what Jiri wrote [1]:

...
On Sun, Jan 19, 2025 at 2:44 AM Jiri Olsa olsajiri@gmail.com wrote: that's correct, uretprobe syscall is installed by kernel to special user memory map and it can be executed only from there and if process calls it from another place it receives sigill

Eyal.

[1] https://lore.kernel.org/lkml/Z4zXlaEMPbiYYlQ8@krava/

Ah, he did. Thanks I missed that:

...

that's correct, uretprobe syscall is installed by kernel to special user memory map and it can be executed only from there and if process calls it from another place it receives sigill

-- Steve

Andy Lutomirski

19 Jan 19 Jan

6:36 p.m.

On Sat, Jan 18, 2025 at 6:25 PM Kees Cook kees@kernel.org wrote:

...

On January 18, 2025 12:45:47 PM PST, Eyal Birger eyal.birger@gmail.com wrote:

...
I think the difference is that this syscall is not part of the process's code - it is inserted there by another process tracing it.

Well that's nothing like syscall_restart, and now I'm convinced seccomp must never ignore uretprobe -- a process might want to block uretprobe!

I've been contemplating this. uretprobe is a very odd syscall: it's a syscall that emulates a breakpoint. So, before uretprobe-the-syscall was added, the process would breakpoint via a non-syscall vector, and the tracing code would do its thing, and seccomp would be none the wiser.

There's a distinction between different types of operations that seccomp is entirely unaware of right now: is the task trying to:

a) do something *to itself*

b) do something that doesn't have meaningful side effects on the rest of the world, at least in a non-buggy kernel, but where the process is actually trying to restrict its own actions

c) interacting with something outside the process, that has privilege over the process, where the interaction in question should absolutely be subject to security policy, but that security policy really ought to apply to the outside process.

uretprobe is very much in category c, and it's kind of unique in this sense *as a syscall*. But there are plenty of other examples that just happen to not be syscalls. For example, ptrace breakpoints use the #DB vector, which isn't a syscall.

Here are few factors that may be vaguely relevant:

- uretprobe is conceptually a bit like sigreturn in the sense that both of them are having the kernel help with something that process can kind-of-sort-of do all by itself.

- BUT: sigreturn is not supposed to have side effects reaching outside the calling task. uretprobe does, and that's the whole point.

- uretprobe-the-syscall is, in a rather optimistic sense, obsolete. Once FRED becomes common (if ever...), it won't really serve much purpose any more. FRED, for those not watching, at least in theory, makes "event delivery" and "return" fast, for all (hah!) events. Including breakpoints. And returns to usermode where rcx != rip, etc.

So I don't know what the right answer is. There's a real argument to be made that seccomp ought to decide that its policy permits whomever installed the uretprobe to do so, and that this decision means that the uretprobe machinery is therefore permissible.

Oleg Nesterov

12:40 p.m.

On 01/18, Kees Cook wrote:

...

On Thu, Jan 16, 2025 at 04:55:39PM -0800, Eyal Birger wrote:

...
Since uretprobe is a "kernel implementation detail" system call which is not used by userspace application code directly, it is impractical and there's very little point in forcing all userspace applications to explicitly allow it in order to avoid crashing tracked processes.

How is this any different from sigreturn, rt_sigreturn, or restart_syscall? These are all handled explicitly by userspace filters already, and I don't see why uretprobe should be any different.

The only difference is that sys_uretprobe() is new and existing setups doesn't know about it. Suppose you have

int func(void) { return 123; }

int main(void) { seccomp(SECCOMP_SET_MODE_STRICT, 0,0); for (;;) func(); }

and it runs with func() uretprobed.

If you install the new kernel, this application will crash immediately.

I understand your objections, but what do you think we can do instead? I don't think a new "try_to_speedup_uretprobes_at_your_own_risk" sysctl makes sense, it will be almost never enabled...

Oleg.

Kees Cook

20 Jan 20 Jan

9:32 p.m.

On Sun, Jan 19, 2025 at 01:40:22PM +0100, Oleg Nesterov wrote:

...

On 01/18, Kees Cook wrote:

...
On Thu, Jan 16, 2025 at 04:55:39PM -0800, Eyal Birger wrote:

...
Since uretprobe is a "kernel implementation detail" system call which is not used by userspace application code directly, it is impractical and there's very little point in forcing all userspace applications to explicitly allow it in order to avoid crashing tracked processes.

How is this any different from sigreturn, rt_sigreturn, or restart_syscall? These are all handled explicitly by userspace filters already, and I don't see why uretprobe should be any different.

The only difference is that sys_uretprobe() is new and existing setups doesn't know about it. Suppose you have

int func(void) { return 123; }

int main(void) { seccomp(SECCOMP_SET_MODE_STRICT, 0,0); for (;;) func(); }

and it runs with func() uretprobed.

If you install the new kernel, this application will crash immediately.

I understand your objections, but what do you think we can do instead? I don't think a new "try_to_speedup_uretprobes_at_your_own_risk" sysctl makes sense, it will be almost never enabled...

This seems like a uretprobes design problem. If it's going to use syscalls, it must take things like seccomp into account. SECCOMP_SET_MODE_STRICT will also crash in the face of syscall_restart...

-- Kees Cook

Oleg Nesterov

21 Jan 21 Jan

3:28 p.m.

On 01/20, Kees Cook wrote:

...

...
The only difference is that sys_uretprobe() is new and existing setups doesn't know about it. Suppose you have

int func(void) { return 123; }

int main(void) { seccomp(SECCOMP_SET_MODE_STRICT, 0,0); for (;;) func(); }

and it runs with func() uretprobed.

If you install the new kernel, this application will crash immediately.

I understand your objections, but what do you think we can do instead? I don't think a new "try_to_speedup_uretprobes_at_your_own_risk" sysctl makes sense, it will be almost never enabled...

This seems like a uretprobes design problem. If it's going to use syscalls, it must take things like seccomp into account.

True. I reviewed that patch, and I forgot about seccomp too.

...

SECCOMP_SET_MODE_STRICT will also crash in the face of syscall_restart...

Yes, I guess SECCOMP_SET_MODE_STRICT assumes that read/write can't return ERESTART_RESTARTBLOCK.

But again, what can we do right now?

I do not like the idea to revert the patch which adds sys_uretprobe(). Don't get me wrong, I do not use uprobes, so personally I don't really care about the performance improvements it adds. Not to mention FRED, although I have no idea when it will be available.

Lets forget about sys_uretprobe(). Lets suppose the kernel doesn't have ERESTART_RESTARTBLOCK/sys_restart_syscall and we want to add this feature today.

How do you think we can do this without breaking the existing setups which use seccomp ?

Oleg.

189

days inactive

199

days old

linux-stable-mirror@lists.linaro.org

34 comments

participants

tags (0)

participants (10)

Andrii Nakryiko
Andy Lutomirski
Darrick J. Wong
Dmitry V. Levin
Eyal Birger
Jiri Olsa
Kees Cook
Masami Hiramatsu
Oleg Nesterov
Steven Rostedt