On Sat, Jan 18, 2025 at 07:39:25PM -0800, Eyal Birger wrote:
Hi,
Thank you for the detailed response.
On Sat, Jan 18, 2025 at 6:25 PM Kees Cook kees@kernel.org wrote:
On January 18, 2025 12:45:47 PM PST, Eyal Birger eyal.birger@gmail.com wrote:
I think the difference is that this syscall is not part of the process's code - it is inserted there by another process tracing it.
Well that's nothing like syscall_restart, and now I'm convinced seccomp must never ignore uretprobe -- a process might want to block uretprobe!
I think I understand your point. But do you think this is intentional? i.e. seccomp couldn't have been used to block uretprobes before this syscall implementation afaict.
So, no, sorry, this needs to be handled by the seccomp policy that is applied to the process.
The problem we're facing is that existing workloads are breaking, and as mentioned I'm not sure how practical it is to demand replacing a working docker environment because of a new syscall that was added for performance reasons.
So this is different than desiring to deploy a new version of a binary that uses a new libc or a new syscall.
Uh, no, the case I used as an example was no changes to anything except the kernel. Libc noticed the available syscall, uses it, and is instantly killed by the Docker seccomp policy which didn't know about that syscall.
That's an interesting situation and quite unexpected :) I'm glad I didn't have to face that one in production.
Here the case is that there are three players - the tracer running out of docker, the tracee running in docker, and docker itself. All three were running fine in a specific kernel version, but upgrading the kernel now crashes the traced process.
If uretprobe used to work without a syscall, then that seems to be the problem.
I agree.
But I think easiest is just fixing the Docker policy. (Which is a text file configuration change; no new binaries, no rebuilds!).
As far as I can tell libseccomp needs to provide support for this new syscall and a new docker version would need to be deployed, so It's not just a configuration change. Also the default policy which comes packed in docker would probably need to be changed to avoid having to explicitly provide a seccomp configuration for each deployment.
I think this syscall is different in that respect for the reasons described.
I don't agree, sorry. Seccomp has a really singular and specific purpose, which is explicitly *externalizing* policy. I do not want to have policy within seccomp itself.
Understood.
I don't know if seccomp is behaving correctly when it blocks a kernel implementation detail that isn't user created.
But it is user created? Something added a uretprobe to a process who's seccomp policy is not expecting it. This seems precisely by design.
I think I wasn't accurate in my wording. The uretprobe syscall is added to the tracee by the kernel. The tracer itself is merely requesting to attach a uretprobe bpf function. In previous versions, this was implemented by the kernel installing an int3 instruction, and in the new implementation the kernel is installing a uretprobe syscall. The "user" in this case - the tracer program - didn't deliberately install the syscall, but anyway this is semantics.
that's correct, uretprobe syscall is installed by kernel to special user memory map and it can be executed only from there and if process calls it from another place it receives sigill
so at the end the process executes the uretprobe syscall, but it's up to kernel to decide that and set it up.. but I don't know if that's strong enough reason for seccomp to ignore the syscall
I think I understand your point that it is regarded as "policy", only that it creates a problem in actual deployments, where in order to be able to run the tracer software which has been working on newer kernels a new docker has to be deployed.
I'm trying to find a pragmatic solution to this problem, and I understand the motivation to avoid policy in seccomp.
I could think of sysctl for that.. you complained earlier about weird semantics for that [1], but I think it's better than to remove it
jirka
Alternatively, maybe this syscall implementation should be reverted?
Thanks again, Eyal.
[1] https://lore.kernel.org/bpf/CAHsH6Gs03iJt-ziWt5Bye_DuqCbk3TpMmgPbkYh64XBvpGa...