On Fri, Aug 09, 2024 at 07:19:26PM +0100, Catalin Marinas wrote:
On Thu, Aug 08, 2024 at 09:15:25AM +0100, Mark Brown wrote:
- /* This should really be an atomic cmpxchg. It is not. */
- if (access_remote_vm(mm, addr, &val, sizeof(val),
FOLL_FORCE) != sizeof(val))
goto out;
If we restrict the shadow stack creation only to the CLONE_VM case, we'd not need the remote vm access, it's in the current mm context already. More on this below.
The discussion in previous iterations was that it seemed better to allow even surprising use cases since it simplifies the analysis of what we have covered. If the user has specified a shadow stack we just do what they asked for and let them worry about if it's useful.
- if (val != expected)
goto out;
I'm confused that we need to consume the token here. I could not find the default shadow stack allocation doing this, only setting it via create_rstor_token() (or I did not search enough). In the default case, is the user consuming it? To me the only difference should been the default allocation vs the one passed by the user via clone3(), with the latter maybe requiring the user to set the token initially.
As discussed for a couple of previous versions if we don't have the token and userspace can specify any old shadow stack page as the shadow stack this allows clone3() to be used to overwrite the shadow stack of another thread, you can point to a shadow stack page which is currently in use and then run some code that causes shadow stack writes. This could potentially then in turn be used as part of a bigger exploit chain, probably it's hard to get anything beyond just causing the other thread to fault but won't be impossible.
With a kernel allocated shadow stack this is not an issue since we are placing the shadow stack in new memory, userspace can't control where we place it so it can't overwrite an existing shadow stack.
/*
* For CLONE_VFORK the child will share the parents
* shadow stack. Make sure to clear the internal
* tracking of the thread shadow stack so the freeing
* logic run for child knows to leave it alone.
*/
if (clone_flags & CLONE_VFORK) {
shstk->base = 0;
shstk->size = 0;
return 0;
}
I think we should leave the CLONE_VFORK check on its own independent of the clone3() arguments. If one passes both CLONE_VFORK and specific shadow stack address/size, they should be ignored (or maybe return an error if you want to make it stricter).
This is existing logic from the current x86 code that's been reindented due to the addition of explicitly specified shadow stacks, it's not new behaviour. It is needed to stop the child thinking it has the parent's shadow stack in the CLONE_VFORK case.
- /*
* For !CLONE_VM the child will use a copy of the parents shadow
* stack.
*/
- if (!(clone_flags & CLONE_VM))
return 0;
/*
* For !CLONE_VM the child will use a copy of the
* parents shadow stack.
*/
if (!(clone_flags & CLONE_VM))
return 0;
Is the !CLONE_VM case specific only to the default shadow stack allocation? Sorry if this has been discussed already (or I completely forgot) but I thought we'd only implement this for the thread creation case. The typical fork() for a new process should inherit the parent's layout, so applicable to the clone3() with the shadow stack arguments as well (which should be ignored or maybe return an error with !CLONE_VM).
This is again all existing behaviour for the case where the user has not specified a shadow stack reindented, as mentioned above if the user has specified one explicitly then we just do what we were asked. The existing behaviour is to only create a new shadow stack for the child in the CLONE_VM case and leave the child using the same shadow stack as the parent in the copied mm for !CLONE_VM.
@@ -2790,6 +2808,8 @@ pid_t kernel_clone(struct kernel_clone_args *args) */ trace_sched_process_fork(current, p);
- shstk_post_fork(p, args);
Do we need this post fork call? Can we not handle the setup via the copy_thread() path in shstk_alloc_thread_stack()?
It looks like we do actually have the new mm in the process before we call copy_thread() so we could move things into there though we'd loose a small bit of factoring out of the error handling (at one point I had more code factored out but right now it's quite small, looking again we could also factor out the get_task_mm()/mput()). ISTR having the new process' mm was the biggest reason for this initially but looking again I'm not sure why that was. It does still feel like even the small amount that's factored out currently is useful though, a bit less duplication in the architecture code which feels welcome here.