On Thu, Aug 8, 2024 at 10:16 AM Mark Brown broonie@kernel.org wrote:
Since clone3() is readily extensible let's add support for specifying a shadow stack when creating a new thread or process in a similar manner to how the normal stack is specified, keeping the current implicit allocation behaviour if one is not specified either with clone3() or through the use of clone(). The user must provide a shadow stack address and size, this must point to memory mapped for use as a shadow stackby map_shadow_stack() with a shadow stack token at the top of the stack.
As a heads-up so you don't get surprised by this in the future:
Because clone3() does not pass the flags in a register like clone() does, it is not available in places like docker containers that use the default Docker seccomp policy (https://github.com/moby/moby/blob/master/profiles/seccomp/default.json). Docker uses seccomp to filter clone() arguments (to prevent stuff like namespace creation), and that's not possible with clone3(), so clone3() is blocked.
The same thing applies to things like sandboxed renderer processes of web browsers - they want to block anything other than creating normal threads, so they use seccomp to block stuff like namespace creation and creating new processes.
I briefly mentioned this here during clone3 development, though I probably should have been more explicit about how it would be beneficial for clone3 to pass flags in a register: https://lore.kernel.org/all/CAG48ez3q=BeNcuVTKBN79kJui4vC6nw0Bfq6xc-i0neheT17TA@mail.gmail.com/
So if you want your feature to be available in such contexts, you'll probably have to either add a new syscall clone4() that passes the flags in a register; or do the plumbing work required to make it possible to seccomp-filter things other than register contexts (by invoking seccomp again from the clone3 handler with some kinda pseudo-syscall?); or change the signature of the existing syscall (but that would require something like using the high bit of the size to signal that there's a flags argument in another register, which is probably more ugly than just adding a new syscall).