This series is a follow-up to Joey's Permission Overlay Extension (POE) series [1] that recently landed on mainline. The goal is to improve the way we handle the register that governs which pkeys/POIndex are accessible (POR_EL0) during signal delivery. As things stand, we may unexpectedly fail to write the signal frame on the stack because POR_EL0 is not reset before the uaccess operations. See patch 3 for more details and the main changes this series brings.
A similar series landed recently for x86/MPK [2]; the present series aims at aligning arm64 with x86. Worth noting: once the signal frame is written, POR_EL0 is still set to POR_EL0_INIT, granting access to pkey 0 only. This means that a program that sets up an alternate signal stack with a non-zero pkey will need some assembly trampoline to set POR_EL0 before invoking the real signal handler, as discussed here [3].
The x86 series also added kselftests to ensure that no spurious SIGSEGV occurs during signal delivery regardless of which pkey is accessible at the point where the signal is delivered. This series adapts those kselftests to allow running them on arm64 (patch 4-5).
Finally patch 2 is a clean-up following feedback on Joey's series [4].
I have tested this series on arm64 and x86_64 (booting and running the protection_keys and pkey_sighandler_tests mm kselftests).
- Kevin
[1] https://lore.kernel.org/linux-arm-kernel/20240822151113.1479789-1-joey.gouly... [2] https://lore.kernel.org/lkml/20240802061318.2140081-1-aruna.ramakrishna@orac... [3] https://lore.kernel.org/lkml/CABi2SkWxNkP2O7ipkP67WKz0-LV33e5brReevTTtba6oKU... [4] https://lore.kernel.org/linux-arm-kernel/20241015114116.GA19334@willie-the-t...
Cc: akpm@linux-foundation.org Cc: anshuman.khandual@arm.com Cc: aruna.ramakrishna@oracle.com Cc: broonie@kernel.org Cc: catalin.marinas@arm.com Cc: dave.hansen@linux.intel.com Cc: dave.martin@arm.com Cc: jeffxu@chromium.org Cc: joey.gouly@arm.com Cc: shuah@kernel.org Cc: will@kernel.org Cc: linux-kselftest@vger.kernel.org Cc: x86@kernel.org
Kevin Brodsky (5): arm64: signal: Remove unused macro arm64: signal: Remove unnecessary check when saving POE state arm64: signal: Improve POR_EL0 handling to avoid uaccess failures selftests/mm: Use generic pkey register manipulation selftests/mm: Enable pkey_sighandler_tests on arm64
arch/arm64/kernel/signal.c | 92 +++++++++++++--- tools/testing/selftests/mm/Makefile | 8 +- tools/testing/selftests/mm/pkey-arm64.h | 1 + tools/testing/selftests/mm/pkey-x86.h | 2 + .../selftests/mm/pkey_sighandler_tests.c | 101 +++++++++++++----- 5 files changed, 159 insertions(+), 45 deletions(-)
Commit 33f082614c34 ("arm64: signal: Allow expansion of the signal frame") introduced the BASE_SIGFRAME_SIZE macro but it has apparently never been used.
Signed-off-by: Kevin Brodsky kevin.brodsky@arm.com --- arch/arm64/kernel/signal.c | 1 - 1 file changed, 1 deletion(-)
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c index 561986947530..dc998326e24d 100644 --- a/arch/arm64/kernel/signal.c +++ b/arch/arm64/kernel/signal.c @@ -66,7 +66,6 @@ struct rt_sigframe_user_layout { unsigned long end_offset; };
-#define BASE_SIGFRAME_SIZE round_up(sizeof(struct rt_sigframe), 16) #define TERMINATOR_SIZE round_up(sizeof(struct _aarch64_ctx), 16) #define EXTRA_CONTEXT_SIZE round_up(sizeof(struct extra_context), 16)
On Thu, Oct 17, 2024 at 02:39:05PM +0100, Kevin Brodsky wrote:
Commit 33f082614c34 ("arm64: signal: Allow expansion of the signal frame") introduced the BASE_SIGFRAME_SIZE macro but it has apparently never been used.
Nit: Should there be a statement of what the patch does?
Same throughout the series.
(Yes, I know it's in the subject line, but Mutt doesn't think that's part of the message body, so I can't see it now that I'm replying... and submitting-patches.rst and e.g., maintainer-tip.rst seem to take the same policy, albeit without quite stating it explicitly.)
Signed-off-by: Kevin Brodsky kevin.brodsky@arm.com
Weird. Maybe there are places where this could have been used, but I guess we have managed fine without it.
Or possibly some unmerged version of the SVE patches used this but it disappeared in refactoring.
Either way:
Reviewed-by: Dave Martin Dave.Martin@arm.com
arch/arm64/kernel/signal.c | 1 - 1 file changed, 1 deletion(-)
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c index 561986947530..dc998326e24d 100644 --- a/arch/arm64/kernel/signal.c +++ b/arch/arm64/kernel/signal.c @@ -66,7 +66,6 @@ struct rt_sigframe_user_layout { unsigned long end_offset; }; -#define BASE_SIGFRAME_SIZE round_up(sizeof(struct rt_sigframe), 16) #define TERMINATOR_SIZE round_up(sizeof(struct _aarch64_ctx), 16) #define EXTRA_CONTEXT_SIZE round_up(sizeof(struct extra_context), 16) -- 2.43.0
On 17/10/2024 17:49, Dave Martin wrote:
On Thu, Oct 17, 2024 at 02:39:05PM +0100, Kevin Brodsky wrote:
Commit 33f082614c34 ("arm64: signal: Allow expansion of the signal frame") introduced the BASE_SIGFRAME_SIZE macro but it has apparently never been used.
Nit: Should there be a statement of what the patch does?
Same throughout the series.
(Yes, I know it's in the subject line, but Mutt doesn't think that's part of the message body, so I can't see it now that I'm replying... and submitting-patches.rst and e.g., maintainer-tip.rst seem to take the same policy, albeit without quite stating it explicitly.)
Ah good point, I didn't consider that. Will make it explicit in patch 1 and 2.
Kevin
On Mon, Oct 21, 2024 at 12:05:30PM +0200, Kevin Brodsky wrote:
On 17/10/2024 17:49, Dave Martin wrote:
On Thu, Oct 17, 2024 at 02:39:05PM +0100, Kevin Brodsky wrote:
Commit 33f082614c34 ("arm64: signal: Allow expansion of the signal frame") introduced the BASE_SIGFRAME_SIZE macro but it has apparently never been used.
Nit: Should there be a statement of what the patch does?
Same throughout the series.
(Yes, I know it's in the subject line, but Mutt doesn't think that's part of the message body, so I can't see it now that I'm replying... and submitting-patches.rst and e.g., maintainer-tip.rst seem to take the same policy, albeit without quite stating it explicitly.)
Ah good point, I didn't consider that. Will make it explicit in patch 1 and 2.
Thanks.
(I have a patch for submitting-patches.rst knocking about to propose making this more explicit, but I didn't dare to post it so far...)
Cheers ---Dave
On Thu, Oct 17, 2024 at 02:39:05PM +0100, Kevin Brodsky wrote:
Commit 33f082614c34 ("arm64: signal: Allow expansion of the signal frame") introduced the BASE_SIGFRAME_SIZE macro but it has apparently never been used.
Signed-off-by: Kevin Brodsky kevin.brodsky@arm.com
Acked-by: Catalin Marinas catalin.marinas@arm.com
The POE frame record is allocated unconditionally if POE is supported. If the allocation fails, a SIGSEGV is delivered before setup_sigframe() can be reached. As a result there is no need to check that poe_offset has been checked before saving POR_EL0; this is in line with other frame records (FPMR, TPIDR2).
Signed-off-by: Kevin Brodsky kevin.brodsky@arm.com --- arch/arm64/kernel/signal.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c index dc998326e24d..f5fb48dabebe 100644 --- a/arch/arm64/kernel/signal.c +++ b/arch/arm64/kernel/signal.c @@ -1092,7 +1092,7 @@ static int setup_sigframe(struct rt_sigframe_user_layout *user, err |= preserve_fpmr_context(fpmr_ctx); }
- if (system_supports_poe() && err == 0 && user->poe_offset) { + if (system_supports_poe() && err == 0) { struct poe_context __user *poe_ctx = apply_user_offset(user, user->poe_offset);
On Thu, Oct 17, 2024 at 02:39:06PM +0100, Kevin Brodsky wrote:
The POE frame record is allocated unconditionally if POE is supported. If the allocation fails, a SIGSEGV is delivered before setup_sigframe() can be reached. As a result there is no need to check that poe_offset has been checked before saving POR_EL0; this is in line with other frame records (FPMR, TPIDR2).
Reviewed-by: Mark Brown broonie@kernel.org
On Thu, Oct 17, 2024 at 02:39:06PM +0100, Kevin Brodsky wrote:
The POE frame record is allocated unconditionally if POE is supported. If the allocation fails, a SIGSEGV is delivered before setup_sigframe() can be reached. As a result there is no need to check that poe_offset has been checked before saving POR_EL0; this is in line with other frame records (FPMR, TPIDR2).
Signed-off-by: Kevin Brodsky kevin.brodsky@arm.com
Reviewed-by: Dave Martin Dave.Martin@arm.com
arch/arm64/kernel/signal.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c index dc998326e24d..f5fb48dabebe 100644 --- a/arch/arm64/kernel/signal.c +++ b/arch/arm64/kernel/signal.c @@ -1092,7 +1092,7 @@ static int setup_sigframe(struct rt_sigframe_user_layout *user, err |= preserve_fpmr_context(fpmr_ctx); }
- if (system_supports_poe() && err == 0 && user->poe_offset) {
- if (system_supports_poe() && err == 0) { struct poe_context __user *poe_ctx = apply_user_offset(user, user->poe_offset);
2.43.0
On Thu, Oct 17, 2024 at 02:39:06PM +0100, Kevin Brodsky wrote:
The POE frame record is allocated unconditionally if POE is supported. If the allocation fails, a SIGSEGV is delivered before setup_sigframe() can be reached. As a result there is no need to check that poe_offset has been checked before saving POR_EL0; this is in line with other frame records (FPMR, TPIDR2).
Signed-off-by: Kevin Brodsky kevin.brodsky@arm.com
Acked-by: Catalin Marinas catalin.marinas@arm.com
TL;DR: reset POR_EL0 to "allow all" before writing the signal frame, preventing spurious uaccess failures.
When POE is supported, the POR_EL0 register constrains memory accesses based on the target page's POIndex (pkey). This raises the question: what constraints should apply to a signal handler? The current answer is that POR_EL0 is reset to POR_EL0_INIT when invoking the handler, giving it full access to POIndex 0. This is in line with x86's MPK support and remains unchanged.
This is only part of the story, though. POR_EL0 constrains all unprivileged memory accesses, meaning that uaccess routines such as put_user() are also impacted. As a result POR_EL0 may prevent the signal frame from being written to the signal stack (ultimately causing a SIGSEGV). This is especially concerning when an alternate signal stack is used, because userspace may want to prevent access to it outside of signal handlers. There is currently no provision for that: POR_EL0 is reset after writing to the stack, and POR_EL0_INIT only enables access to POIndex 0.
This patch ensures that POR_EL0 is reset to its most permissive state before the signal stack is accessed. Once the signal frame has been fully written, POR_EL0 is still set to POR_EL0_INIT - it is up to the signal handler to enable access to additional pkeys if needed. As to sigreturn(), it expects having access to the stack like any other syscall; we only need to ensure that POR_EL0 is restored from the signal frame after all uaccess calls. This approach is in line with the recent x86/pkeys series [1].
Resetting POR_EL0 early introduces some complications, in that we can no longer read the register directly in preserve_poe_context(). This is addressed by introducing a struct (unpriv_access_state) and helpers to manage any such register impacting uaccess. Things look like this on signal delivery: 1. Save original POR_EL0 into struct [save_reset_unpriv_access_state()] 2. Set POR_EL0 to "allow all" [save_reset_unpriv_access_state()] 3. Create signal frame 4. Write saved POR_EL0 value to the signal frame [preserve_poe_context()] 5. Finalise signal frame 6. Set POR_EL0 to POR_EL0_INIT [set_handler_unpriv_access_state()]
The return path (sys_rt_sigreturn) doesn't strictly require any change since restore_poe_context() is already called last. However, to avoid uaccess calls being accidentally added after that point, we use the same approach as in the delivery path, i.e. separating uaccess from writing to the register: 1. Read saved POR_EL0 value from the signal frame [restore_poe_context()] 2. Set POR_EL0 to the saved value [restore_unpriv_access_state()]
[1] https://lore.kernel.org/lkml/20240802061318.2140081-1-aruna.ramakrishna@orac...
Signed-off-by: Kevin Brodsky kevin.brodsky@arm.com --- arch/arm64/kernel/signal.c | 89 ++++++++++++++++++++++++++++++++------ 1 file changed, 75 insertions(+), 14 deletions(-)
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c index f5fb48dabebe..3548146084b3 100644 --- a/arch/arm64/kernel/signal.c +++ b/arch/arm64/kernel/signal.c @@ -66,9 +66,64 @@ struct rt_sigframe_user_layout { unsigned long end_offset; };
+/* + * Holds any EL0-controlled state that influences unprivileged memory accesses. + * This includes both accesses done in userspace and uaccess done in the kernel. + * + * This state needs to be carefully managed to ensure that it doesn't cause + * uaccess to fail when setting up the signal frame, and the signal handler + * itself also expects a well-defined state when entered. + */ +struct unpriv_access_state { + u64 por_el0; +}; + #define TERMINATOR_SIZE round_up(sizeof(struct _aarch64_ctx), 16) #define EXTRA_CONTEXT_SIZE round_up(sizeof(struct extra_context), 16)
+/* + * Save the unpriv access state into ua_state and reset it to disable any + * restrictions. + */ +static void save_reset_unpriv_access_state(struct unpriv_access_state *ua_state) +{ + if (system_supports_poe()) { + /* + * Enable all permissions in all 8 keys + * (inspired by REPEAT_BYTE()) + */ + u64 por_enable_all = (~0u / POE_MASK) * POE_RXW; + + ua_state->por_el0 = read_sysreg_s(SYS_POR_EL0); + write_sysreg_s(por_enable_all, SYS_POR_EL0); + /* Ensure that any subsequent uaccess observes the updated value */ + isb(); + } +} + +/* + * Set the unpriv access state for invoking the signal handler. + * + * No uaccess should be done after that function is called. + */ +static void set_handler_unpriv_access_state(void) +{ + if (system_supports_poe()) + write_sysreg_s(POR_EL0_INIT, SYS_POR_EL0); + +} + +/* + * Restore the unpriv access state to the values saved in ua_state. + * + * No uaccess should be done after that function is called. + */ +static void restore_unpriv_access_state(const struct unpriv_access_state *ua_state) +{ + if (system_supports_poe()) + write_sysreg_s(ua_state->por_el0, SYS_POR_EL0); +} + static void init_user_layout(struct rt_sigframe_user_layout *user) { const size_t reserved_size = @@ -260,18 +315,20 @@ static int restore_fpmr_context(struct user_ctxs *user) return err; }
-static int preserve_poe_context(struct poe_context __user *ctx) +static int preserve_poe_context(struct poe_context __user *ctx, + const struct unpriv_access_state *ua_state) { int err = 0;
__put_user_error(POE_MAGIC, &ctx->head.magic, err); __put_user_error(sizeof(*ctx), &ctx->head.size, err); - __put_user_error(read_sysreg_s(SYS_POR_EL0), &ctx->por_el0, err); + __put_user_error(ua_state->por_el0, &ctx->por_el0, err);
return err; }
-static int restore_poe_context(struct user_ctxs *user) +static int restore_poe_context(struct user_ctxs *user, + struct unpriv_access_state *ua_state) { u64 por_el0; int err = 0; @@ -281,7 +338,7 @@ static int restore_poe_context(struct user_ctxs *user)
__get_user_error(por_el0, &(user->poe->por_el0), err); if (!err) - write_sysreg_s(por_el0, SYS_POR_EL0); + ua_state->por_el0 = por_el0;
return err; } @@ -849,7 +906,8 @@ static int parse_user_sigframe(struct user_ctxs *user, }
static int restore_sigframe(struct pt_regs *regs, - struct rt_sigframe __user *sf) + struct rt_sigframe __user *sf, + struct unpriv_access_state *ua_state) { sigset_t set; int i, err; @@ -898,7 +956,7 @@ static int restore_sigframe(struct pt_regs *regs, err = restore_zt_context(&user);
if (err == 0 && system_supports_poe() && user.poe) - err = restore_poe_context(&user); + err = restore_poe_context(&user, ua_state);
return err; } @@ -907,6 +965,7 @@ SYSCALL_DEFINE0(rt_sigreturn) { struct pt_regs *regs = current_pt_regs(); struct rt_sigframe __user *frame; + struct unpriv_access_state ua_state;
/* Always make any pending restarted system calls return -EINTR */ current->restart_block.fn = do_no_restart_syscall; @@ -923,12 +982,14 @@ SYSCALL_DEFINE0(rt_sigreturn) if (!access_ok(frame, sizeof (*frame))) goto badframe;
- if (restore_sigframe(regs, frame)) + if (restore_sigframe(regs, frame, &ua_state)) goto badframe;
if (restore_altstack(&frame->uc.uc_stack)) goto badframe;
+ restore_unpriv_access_state(&ua_state); + return regs->regs[0];
badframe: @@ -1034,7 +1095,8 @@ static int setup_sigframe_layout(struct rt_sigframe_user_layout *user, }
static int setup_sigframe(struct rt_sigframe_user_layout *user, - struct pt_regs *regs, sigset_t *set) + struct pt_regs *regs, sigset_t *set, + const struct unpriv_access_state *ua_state) { int i, err = 0; struct rt_sigframe __user *sf = user->sigframe; @@ -1096,10 +1158,9 @@ static int setup_sigframe(struct rt_sigframe_user_layout *user, struct poe_context __user *poe_ctx = apply_user_offset(user, user->poe_offset);
- err |= preserve_poe_context(poe_ctx); + err |= preserve_poe_context(poe_ctx, ua_state); }
- /* ZA state if present */ if (system_supports_sme() && err == 0 && user->za_offset) { struct za_context __user *za_ctx = @@ -1236,9 +1297,6 @@ static void setup_return(struct pt_regs *regs, struct k_sigaction *ka, sme_smstop(); }
- if (system_supports_poe()) - write_sysreg_s(POR_EL0_INIT, SYS_POR_EL0); - if (ka->sa.sa_flags & SA_RESTORER) sigtramp = ka->sa.sa_restorer; else @@ -1252,9 +1310,11 @@ static int setup_rt_frame(int usig, struct ksignal *ksig, sigset_t *set, { struct rt_sigframe_user_layout user; struct rt_sigframe __user *frame; + struct unpriv_access_state ua_state; int err = 0;
fpsimd_signal_preserve_current_state(); + save_reset_unpriv_access_state(&ua_state);
if (get_sigframe(&user, ksig, regs)) return 1; @@ -1265,7 +1325,7 @@ static int setup_rt_frame(int usig, struct ksignal *ksig, sigset_t *set, __put_user_error(NULL, &frame->uc.uc_link, err);
err |= __save_altstack(&frame->uc.uc_stack, regs->sp); - err |= setup_sigframe(&user, regs, set); + err |= setup_sigframe(&user, regs, set, &ua_state); if (err == 0) { setup_return(regs, &ksig->ka, &user, usig); if (ksig->ka.sa.sa_flags & SA_SIGINFO) { @@ -1273,6 +1333,7 @@ static int setup_rt_frame(int usig, struct ksignal *ksig, sigset_t *set, regs->regs[1] = (unsigned long)&frame->info; regs->regs[2] = (unsigned long)&frame->uc; } + set_handler_unpriv_access_state(); }
return err;
On Thu, Oct 17, 2024 at 02:39:07PM +0100, Kevin Brodsky wrote:
TL;DR: reset POR_EL0 to "allow all" before writing the signal frame, preventing spurious uaccess failures.
When POE is supported, the POR_EL0 register constrains memory accesses based on the target page's POIndex (pkey). This raises the question: what constraints should apply to a signal handler? The current answer is that POR_EL0 is reset to POR_EL0_INIT when invoking the handler, giving it full access to POIndex 0. This is in line with x86's MPK support and remains unchanged.
This is only part of the story, though. POR_EL0 constrains all unprivileged memory accesses, meaning that uaccess routines such as put_user() are also impacted. As a result POR_EL0 may prevent the signal frame from being written to the signal stack (ultimately causing a SIGSEGV). This is especially concerning when an alternate signal stack is used, because userspace may want to prevent access to it outside of signal handlers. There is currently no provision for that: POR_EL0 is reset after writing to the stack, and POR_EL0_INIT only enables access to POIndex 0.
This all seems a bit convoluted.
The issues seem to boil down to: the signal frame is read and written on behalf of the signal handler, and so should be done with consistent permissions to those the signal handler gets (instead of whatever random prevailing permissions were specified in POR_EL0 before signal delivery... and were possibly responsible for triggering the signal in the first place.)
The need to save/restore POR_EL0 via staging storage seems to follow on naturally from that, since if POR_EL0 was independent of uaccess then this patch would be redundant...
This patch ensures that POR_EL0 is reset to its most permissive
Is this right? See my comment on save_reset_unpriv_access_state() below.
state before the signal stack is accessed. Once the signal frame has been fully written, POR_EL0 is still set to POR_EL0_INIT - it is up to the signal handler to enable access to additional pkeys if needed. As to sigreturn(), it expects having access to the stack like any other syscall; we only need to ensure that POR_EL0 is restored from the signal frame after all uaccess calls. This approach is in line with the recent x86/pkeys series [1].
Resetting POR_EL0 early introduces some complications, in that we can no longer read the register directly in preserve_poe_context(). This is addressed by introducing a struct (unpriv_access_state) and helpers to manage any such register impacting uaccess. Things look like this on signal delivery:
- Save original POR_EL0 into struct [save_reset_unpriv_access_state()]
- Set POR_EL0 to "allow all" [save_reset_unpriv_access_state()]
- Create signal frame
- Write saved POR_EL0 value to the signal frame [preserve_poe_context()]
- Finalise signal frame
- Set POR_EL0 to POR_EL0_INIT [set_handler_unpriv_access_state()]
The return path (sys_rt_sigreturn) doesn't strictly require any change since restore_poe_context() is already called last. However, to avoid uaccess calls being accidentally added after that point, we use the same approach as in the delivery path, i.e. separating uaccess from writing to the register:
- Read saved POR_EL0 value from the signal frame [restore_poe_context()]
- Set POR_EL0 to the saved value [restore_unpriv_access_state()]
[1] https://lore.kernel.org/lkml/20240802061318.2140081-1-aruna.ramakrishna@orac...
Signed-off-by: Kevin Brodsky kevin.brodsky@arm.com
arch/arm64/kernel/signal.c | 89 ++++++++++++++++++++++++++++++++------ 1 file changed, 75 insertions(+), 14 deletions(-)
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c index f5fb48dabebe..3548146084b3 100644 --- a/arch/arm64/kernel/signal.c +++ b/arch/arm64/kernel/signal.c @@ -66,9 +66,64 @@ struct rt_sigframe_user_layout { unsigned long end_offset; }; +/*
- Holds any EL0-controlled state that influences unprivileged memory accesses.
- This includes both accesses done in userspace and uaccess done in the kernel.
- This state needs to be carefully managed to ensure that it doesn't cause
- uaccess to fail when setting up the signal frame, and the signal handler
- itself also expects a well-defined state when entered.
- */
+struct unpriv_access_state {
- u64 por_el0;
+};
#define TERMINATOR_SIZE round_up(sizeof(struct _aarch64_ctx), 16) #define EXTRA_CONTEXT_SIZE round_up(sizeof(struct extra_context), 16) +/*
- Save the unpriv access state into ua_state and reset it to disable any
- restrictions.
- */
+static void save_reset_unpriv_access_state(struct unpriv_access_state *ua_state)
Would _user_ be more consistent naming than _unpriv_ ?
Same elsewhere.
+{
- if (system_supports_poe()) {
/*
* Enable all permissions in all 8 keys
* (inspired by REPEAT_BYTE())
*/
u64 por_enable_all = (~0u / POE_MASK) * POE_RXW;
Yikes!
Seriously though, why are we granting permissions that the signal handler isn't itself going to have over its own stack?
I think the logical thing to do is to think of the write/read of the signal frame as being done on behalf of the signal handler, so the permissions should be those we're going to give the signal handler: not less, and (so far as we can approximate) not more.
ua_state->por_el0 = read_sysreg_s(SYS_POR_EL0);
write_sysreg_s(por_enable_all, SYS_POR_EL0);
/* Ensure that any subsequent uaccess observes the updated value */
isb();
- }
+}
+/*
- Set the unpriv access state for invoking the signal handler.
- No uaccess should be done after that function is called.
- */
+static void set_handler_unpriv_access_state(void) +{
- if (system_supports_poe())
write_sysreg_s(POR_EL0_INIT, SYS_POR_EL0);
Spurious blank line?
+}
[...]
@@ -1252,9 +1310,11 @@ static int setup_rt_frame(int usig, struct ksignal *ksig, sigset_t *set, { struct rt_sigframe_user_layout user; struct rt_sigframe __user *frame;
- struct unpriv_access_state ua_state; int err = 0;
fpsimd_signal_preserve_current_state();
- save_reset_unpriv_access_state(&ua_state);
(Trivial nit: maybe put the blank line before this rather than after? This has nothing to do with "settling" the kernel's internal context switch state, and a lot to do with generaing the signal frame...)
if (get_sigframe(&user, ksig, regs)) return 1;
[...]
@@ -1273,6 +1333,7 @@ static int setup_rt_frame(int usig, struct ksignal *ksig, sigset_t *set, regs->regs[1] = (unsigned long)&frame->info; regs->regs[2] = (unsigned long)&frame->uc; }
set_handler_unpriv_access_state();
This bit feels prematurely factored? We don't have separate functions for the other low-level preparation done here...
It works either way though, and I don't have a strong view.
Overall, this all looks reasonable.
Cheers ---Dave
On 17/10/2024 17:53, Dave Martin wrote:
[...]
+/*
- Save the unpriv access state into ua_state and reset it to disable any
- restrictions.
- */
+static void save_reset_unpriv_access_state(struct unpriv_access_state *ua_state)
Would _user_ be more consistent naming than _unpriv_ ?
I did ponder on the naming. I considered user_access/uaccess instead of unpriv_access, but my concern is that it might imply that only uaccess is concerned, while in reality loads/stores that userspace itself executes are impacted too. I thought using the "unpriv" terminology from the Arm ARM (used for stage 1 permissions) might avoid such misunderstanding. I'm interested to hear opinions on this, maybe accuracy sacrifices readability.
Same elsewhere.
+{
- if (system_supports_poe()) {
/*
* Enable all permissions in all 8 keys
* (inspired by REPEAT_BYTE())
*/
u64 por_enable_all = (~0u / POE_MASK) * POE_RXW;
Yikes!
Seriously though, why are we granting permissions that the signal handler isn't itself going to have over its own stack?
I think the logical thing to do is to think of the write/read of the signal frame as being done on behalf of the signal handler, so the permissions should be those we're going to give the signal handler: not less, and (so far as we can approximate) not more.
Will continue that discussion on the cover letter.
ua_state->por_el0 = read_sysreg_s(SYS_POR_EL0);
write_sysreg_s(por_enable_all, SYS_POR_EL0);
/* Ensure that any subsequent uaccess observes the updated value */
isb();
- }
+}
+/*
- Set the unpriv access state for invoking the signal handler.
- No uaccess should be done after that function is called.
- */
+static void set_handler_unpriv_access_state(void) +{
- if (system_supports_poe())
write_sysreg_s(POR_EL0_INIT, SYS_POR_EL0);
Spurious blank line?
Thanks!
+}
[...]
@@ -1252,9 +1310,11 @@ static int setup_rt_frame(int usig, struct ksignal *ksig, sigset_t *set, { struct rt_sigframe_user_layout user; struct rt_sigframe __user *frame;
- struct unpriv_access_state ua_state; int err = 0;
fpsimd_signal_preserve_current_state();
- save_reset_unpriv_access_state(&ua_state);
(Trivial nit: maybe put the blank line before this rather than after? This has nothing to do with "settling" the kernel's internal context switch state, and a lot to do with generaing the signal frame...)
In fact considering the concern Catalin brought up with POR_EL0 being reset even when we fail to deliver the signal [1], I'm realising this call should be moved after get_sigframe(), since the latter doesn't use uaccess and can fail.
[1] https://lore.kernel.org/linux-arm-kernel/Zw6D2waVyIwYE7wd@arm.com/
if (get_sigframe(&user, ksig, regs)) return 1;
[...]
@@ -1273,6 +1333,7 @@ static int setup_rt_frame(int usig, struct ksignal *ksig, sigset_t *set, regs->regs[1] = (unsigned long)&frame->info; regs->regs[2] = (unsigned long)&frame->uc; }
set_handler_unpriv_access_state();
This bit feels prematurely factored? We don't have separate functions for the other low-level preparation done here...
I preferred to have a consistent API for all manipulations of POR_EL0, the idea being that if more registers are added to struct unpriv_access_state, only the *unpriv_access* helpers need to be amended.
It works either way though, and I don't have a strong view.
Overall, this all looks reasonable.
Thanks for the review!
Kevin
On Mon, Oct 21, 2024 at 12:06:07PM +0200, Kevin Brodsky wrote:
On 17/10/2024 17:53, Dave Martin wrote:
[...]
+/*
- Save the unpriv access state into ua_state and reset it to disable any
- restrictions.
- */
+static void save_reset_unpriv_access_state(struct unpriv_access_state *ua_state)
Would _user_ be more consistent naming than _unpriv_ ?
I did ponder on the naming. I considered user_access/uaccess instead of unpriv_access, but my concern is that it might imply that only uaccess is concerned, while in reality loads/stores that userspace itself executes are impacted too. I thought using the "unpriv" terminology from the Arm ARM (used for stage 1 permissions) might avoid such misunderstanding. I'm interested to hear opinions on this, maybe accuracy sacrifices readability.
"user_access" seemed natural to me: it parses equally as "[user access]" (i.e., uaccess) and "[user] access" (i.e., access by, to, or on behalf of user(space)).
Introducing an architectural term when there is already a generic OS and Linux kernel term that means the right thing seemed not to improve readability, but I guess it's a matter of opinion.
Anyway, it doesn't really matter.
Same elsewhere.
+{
- if (system_supports_poe()) {
/*
* Enable all permissions in all 8 keys
* (inspired by REPEAT_BYTE())
*/
u64 por_enable_all = (~0u / POE_MASK) * POE_RXW;
Yikes!
Seriously though, why are we granting permissions that the signal handler isn't itself going to have over its own stack?
I think the logical thing to do is to think of the write/read of the signal frame as being done on behalf of the signal handler, so the permissions should be those we're going to give the signal handler: not less, and (so far as we can approximate) not more.
Will continue that discussion on the cover letter.
ua_state->por_el0 = read_sysreg_s(SYS_POR_EL0);
write_sysreg_s(por_enable_all, SYS_POR_EL0);
/* Ensure that any subsequent uaccess observes the updated value */
isb();
- }
+}
+/*
- Set the unpriv access state for invoking the signal handler.
- No uaccess should be done after that function is called.
- */
+static void set_handler_unpriv_access_state(void) +{
- if (system_supports_poe())
write_sysreg_s(POR_EL0_INIT, SYS_POR_EL0);
Spurious blank line?
Thanks!
+}
[...]
@@ -1252,9 +1310,11 @@ static int setup_rt_frame(int usig, struct ksignal *ksig, sigset_t *set, { struct rt_sigframe_user_layout user; struct rt_sigframe __user *frame;
- struct unpriv_access_state ua_state; int err = 0;
fpsimd_signal_preserve_current_state();
- save_reset_unpriv_access_state(&ua_state);
(Trivial nit: maybe put the blank line before this rather than after? This has nothing to do with "settling" the kernel's internal context switch state, and a lot to do with generaing the signal frame...)
In fact considering the concern Catalin brought up with POR_EL0 being reset even when we fail to deliver the signal [1], I'm realising this call should be moved after get_sigframe(), since the latter doesn't use uaccess and can fail.
[1] https://lore.kernel.org/linux-arm-kernel/Zw6D2waVyIwYE7wd@arm.com/
if (get_sigframe(&user, ksig, regs)) return 1;
[...]
^
Ah, good point. The save_reset_unpriv_access_state(&ua_state) call probably belong just before the first __put_user() then.
@@ -1273,6 +1333,7 @@ static int setup_rt_frame(int usig, struct ksignal *ksig, sigset_t *set, regs->regs[1] = (unsigned long)&frame->info; regs->regs[2] = (unsigned long)&frame->uc; }
set_handler_unpriv_access_state();
This bit feels prematurely factored? We don't have separate functions for the other low-level preparation done here...
I preferred to have a consistent API for all manipulations of POR_EL0, the idea being that if more registers are added to struct unpriv_access_state, only the *unpriv_access* helpers need to be amended.
Certainly if that struct grows more state, then the factoring will help in future. I wasn't clear on how we expect this all to evolve.
Either way, this is basically a non-issue, and keeping the symmetry is probably a good idea.
Cheers ---Dave
On 21/10/2024 15:43, Dave Martin wrote:
On Mon, Oct 21, 2024 at 12:06:07PM +0200, Kevin Brodsky wrote:
On 17/10/2024 17:53, Dave Martin wrote:
[...]
+/*
- Save the unpriv access state into ua_state and reset it to disable any
- restrictions.
- */
+static void save_reset_unpriv_access_state(struct unpriv_access_state *ua_state)
Would _user_ be more consistent naming than _unpriv_ ?
I did ponder on the naming. I considered user_access/uaccess instead of unpriv_access, but my concern is that it might imply that only uaccess is concerned, while in reality loads/stores that userspace itself executes are impacted too. I thought using the "unpriv" terminology from the Arm ARM (used for stage 1 permissions) might avoid such misunderstanding. I'm interested to hear opinions on this, maybe accuracy sacrifices readability.
"user_access" seemed natural to me: it parses equally as "[user access]" (i.e., uaccess) and "[user] access" (i.e., access by, to, or on behalf of user(space)).
Introducing an architectural term when there is already a generic OS and Linux kernel term that means the right thing seemed not to improve readability, but I guess it's a matter of opinion.
Both good points. "user_access" seems to strike the right balance, plus it's slightly shorter. Will switch to that naming in v2.
Kevin
Hi,
On Tue, Oct 22, 2024 at 02:34:09PM +0200, Kevin Brodsky wrote:
On 21/10/2024 15:43, Dave Martin wrote:
On Mon, Oct 21, 2024 at 12:06:07PM +0200, Kevin Brodsky wrote:
On 17/10/2024 17:53, Dave Martin wrote:
[...]
+/*
- Save the unpriv access state into ua_state and reset it to disable any
- restrictions.
- */
+static void save_reset_unpriv_access_state(struct unpriv_access_state *ua_state)
Would _user_ be more consistent naming than _unpriv_ ?
I did ponder on the naming. I considered user_access/uaccess instead of unpriv_access, but my concern is that it might imply that only uaccess is concerned, while in reality loads/stores that userspace itself executes are impacted too. I thought using the "unpriv" terminology from the Arm ARM (used for stage 1 permissions) might avoid such misunderstanding. I'm interested to hear opinions on this, maybe accuracy sacrifices readability.
"user_access" seemed natural to me: it parses equally as "[user access]" (i.e., uaccess) and "[user] access" (i.e., access by, to, or on behalf of user(space)).
Introducing an architectural term when there is already a generic OS and Linux kernel term that means the right thing seemed not to improve readability, but I guess it's a matter of opinion.
Both good points. "user_access" seems to strike the right balance, plus it's slightly shorter. Will switch to that naming in v2.
Suits me (wasn't sure I was going to win that one actually!)
Cheers ---Dave
Hi,
Just in case the reply I thought I'd sent to this evaporated (or I imagined it):
On Mon, Oct 21, 2024 at 12:06:07PM +0200, Kevin Brodsky wrote:
On 17/10/2024 17:53, Dave Martin wrote:
[...]
+/*
- Save the unpriv access state into ua_state and reset it to disable any
- restrictions.
- */
+static void save_reset_unpriv_access_state(struct unpriv_access_state *ua_state)
Would _user_ be more consistent naming than _unpriv_ ?
I did ponder on the naming. I considered user_access/uaccess instead of unpriv_access, but my concern is that it might imply that only uaccess is concerned, while in reality loads/stores that userspace itself executes are impacted too. I thought using the "unpriv" terminology from the Arm ARM (used for stage 1 permissions) might avoid such misunderstanding. I'm interested to hear opinions on this, maybe accuracy sacrifices readability.
Same elsewhere.
I think "user" covers these meanings, though including the word "access" makes it sound like this is specific to uaccess.
Maybe something like:
save_reset_user_permissions() restore_user_permissions()
would make sense? (But again, it's not a big deal.)
+{
- if (system_supports_poe()) {
/*
* Enable all permissions in all 8 keys
* (inspired by REPEAT_BYTE())
*/
u64 por_enable_all = (~0u / POE_MASK) * POE_RXW;
Yikes!
Seriously though, why are we granting permissions that the signal handler isn't itself going to have over its own stack?
I think the logical thing to do is to think of the write/read of the signal frame as being done on behalf of the signal handler, so the permissions should be those we're going to give the signal handler: not less, and (so far as we can approximate) not more.
Will continue that discussion on the cover letter.
ua_state->por_el0 = read_sysreg_s(SYS_POR_EL0);
write_sysreg_s(por_enable_all, SYS_POR_EL0);
/* Ensure that any subsequent uaccess observes the updated value */
isb();
- }
+}
+/*
- Set the unpriv access state for invoking the signal handler.
- No uaccess should be done after that function is called.
- */
+static void set_handler_unpriv_access_state(void) +{
- if (system_supports_poe())
write_sysreg_s(POR_EL0_INIT, SYS_POR_EL0);
Spurious blank line?
Thanks!
+}
[...]
@@ -1252,9 +1310,11 @@ static int setup_rt_frame(int usig, struct ksignal *ksig, sigset_t *set, { struct rt_sigframe_user_layout user; struct rt_sigframe __user *frame;
- struct unpriv_access_state ua_state; int err = 0;
fpsimd_signal_preserve_current_state();
- save_reset_unpriv_access_state(&ua_state);
(Trivial nit: maybe put the blank line before this rather than after? This has nothing to do with "settling" the kernel's internal context switch state, and a lot to do with generaing the signal frame...)
In fact considering the concern Catalin brought up with POR_EL0 being reset even when we fail to deliver the signal [1], I'm realising this call should be moved after get_sigframe(), since the latter doesn't use uaccess and can fail.
Good point...
[1] https://lore.kernel.org/linux-arm-kernel/Zw6D2waVyIwYE7wd@arm.com/
if (get_sigframe(&user, ksig, regs)) return 1;
[...]
I guess the call can be pushed to just before the first __put_user(), after here?
@@ -1273,6 +1333,7 @@ static int setup_rt_frame(int usig, struct ksignal *ksig, sigset_t *set, regs->regs[1] = (unsigned long)&frame->info; regs->regs[2] = (unsigned long)&frame->uc; }
set_handler_unpriv_access_state();
This bit feels prematurely factored? We don't have separate functions for the other low-level preparation done here...
I preferred to have a consistent API for all manipulations of POR_EL0, the idea being that if more registers are added to struct unpriv_access_state, only the *unpriv_access* helpers need to be amended.
It works either way though, and I don't have a strong view.
Overall, this all looks reasonable.
Keeping the symmetry seems generally a good idea, especially if we expect that struct to grow more state over time. I wasn't sure how we anticipiate this evolving.
[...]
Cheers ---Dave
pkey_sighandler_tests.c currently hardcodes x86 PKRU encodings. The first step towards running those tests on arm64 is to abstract away the pkey register values.
Since those tests want to deny access to all keys except a few, we have each arch define PKEY_ALLOW_NONE, the pkey register value denying access to all keys. We then use the existing set_pkey_bits() helper to grant access to specific keys.
Because pkeys may also remove the execute permission on arm64, we need to be a little careful: all code is mapped with pkey 0, and we need it to remain executable. pkey_reg_no_access is introduced for that purpose: this value prevents RW access to all pkeys, but retains X permission for pkey 0.
test_pkru_preserved_after_sigusr1() only checks that the pkey register value remains unchanged after a signal is delivered, so the particular value is irrelevant. We enable pkey 0 and a few more arbitrary keys in the smallest range available on all architectures (8 keys on arm64).
Signed-off-by: Kevin Brodsky kevin.brodsky@arm.com --- tools/testing/selftests/mm/pkey-arm64.h | 1 + tools/testing/selftests/mm/pkey-x86.h | 2 + .../selftests/mm/pkey_sighandler_tests.c | 39 ++++++++++++++----- 3 files changed, 33 insertions(+), 9 deletions(-)
diff --git a/tools/testing/selftests/mm/pkey-arm64.h b/tools/testing/selftests/mm/pkey-arm64.h index 580e1b0bb38e..5ec53d67dfc7 100644 --- a/tools/testing/selftests/mm/pkey-arm64.h +++ b/tools/testing/selftests/mm/pkey-arm64.h @@ -31,6 +31,7 @@ #define NR_RESERVED_PKEYS 1 /* pkey-0 */
#define PKEY_ALLOW_ALL 0x77777777 +#define PKEY_ALLOW_NONE 0
#define PKEY_BITS_PER_PKEY 4 #define PAGE_SIZE sysconf(_SC_PAGESIZE) diff --git a/tools/testing/selftests/mm/pkey-x86.h b/tools/testing/selftests/mm/pkey-x86.h index 5f28e26a2511..53ed9a336ffe 100644 --- a/tools/testing/selftests/mm/pkey-x86.h +++ b/tools/testing/selftests/mm/pkey-x86.h @@ -34,6 +34,8 @@ #define PAGE_SIZE 4096 #define MB (1<<20)
+#define PKEY_ALLOW_NONE 0x55555555 + static inline void __page_o_noops(void) { /* 8-bytes of instruction * 512 bytes = 1 page */ diff --git a/tools/testing/selftests/mm/pkey_sighandler_tests.c b/tools/testing/selftests/mm/pkey_sighandler_tests.c index a8088b645ad6..b5e1767ee5d9 100644 --- a/tools/testing/selftests/mm/pkey_sighandler_tests.c +++ b/tools/testing/selftests/mm/pkey_sighandler_tests.c @@ -37,6 +37,8 @@ pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER; pthread_cond_t cond = PTHREAD_COND_INITIALIZER; siginfo_t siginfo = {0};
+static u64 pkey_reg_no_access; + /* * We need to use inline assembly instead of glibc's syscall because glibc's * syscall will attempt to access the PLT in order to call a library function @@ -113,7 +115,7 @@ static void raise_sigusr2(void) static void *thread_segv_with_pkey0_disabled(void *ptr) { /* Disable MPK 0 (and all others too) */ - __write_pkey_reg(0x55555555); + __write_pkey_reg(pkey_reg_no_access);
/* Segfault (with SEGV_MAPERR) */ *(int *) (0x1) = 1; @@ -123,7 +125,7 @@ static void *thread_segv_with_pkey0_disabled(void *ptr) static void *thread_segv_pkuerr_stack(void *ptr) { /* Disable MPK 0 (and all others too) */ - __write_pkey_reg(0x55555555); + __write_pkey_reg(pkey_reg_no_access);
/* After we disable MPK 0, we can't access the stack to return */ return NULL; @@ -133,6 +135,7 @@ static void *thread_segv_maperr_ptr(void *ptr) { stack_t *stack = ptr; int *bad = (int *)1; + u64 pkey_reg;
/* * Setup alternate signal stack, which should be pkey_mprotect()ed by @@ -142,7 +145,8 @@ static void *thread_segv_maperr_ptr(void *ptr) syscall_raw(SYS_sigaltstack, (long)stack, 0, 0, 0, 0, 0);
/* Disable MPK 0. Only MPK 1 is enabled. */ - __write_pkey_reg(0x55555551); + pkey_reg = set_pkey_bits(pkey_reg_no_access, 1, 0); + __write_pkey_reg(pkey_reg);
/* Segfault */ *bad = 1; @@ -240,6 +244,7 @@ static void test_sigsegv_handler_with_different_pkey_for_stack(void) int pkey; int parent_pid = 0; int child_pid = 0; + u64 pkey_reg;
sa.sa_flags = SA_SIGINFO | SA_ONSTACK;
@@ -257,7 +262,9 @@ static void test_sigsegv_handler_with_different_pkey_for_stack(void) assert(stack != MAP_FAILED);
/* Allow access to MPK 0 and MPK 1 */ - __write_pkey_reg(0x55555550); + pkey_reg = set_pkey_bits(pkey_reg_no_access, 0, 0); + pkey_reg = set_pkey_bits(pkey_reg, 1, 0); + __write_pkey_reg(pkey_reg);
/* Protect the new stack with MPK 1 */ pkey = pkey_alloc(0, 0); @@ -307,7 +314,12 @@ static void test_sigsegv_handler_with_different_pkey_for_stack(void) static void test_pkru_preserved_after_sigusr1(void) { struct sigaction sa; - unsigned long pkru = 0x45454544; + u64 pkey_reg; + + /* Allow access to MPK 0 and an arbitrary set of keys */ + pkey_reg = set_pkey_bits(pkey_reg_no_access, 0, 0); + pkey_reg = set_pkey_bits(pkey_reg, 3, 0); + pkey_reg = set_pkey_bits(pkey_reg, 7, 0);
sa.sa_flags = SA_SIGINFO;
@@ -320,7 +332,7 @@ static void test_pkru_preserved_after_sigusr1(void)
memset(&siginfo, 0, sizeof(siginfo));
- __write_pkey_reg(pkru); + __write_pkey_reg(pkey_reg);
raise(SIGUSR1);
@@ -330,7 +342,7 @@ static void test_pkru_preserved_after_sigusr1(void) pthread_mutex_unlock(&mutex);
/* Ensure the pkru value is the same after returning from signal. */ - ksft_test_result(pkru == __read_pkey_reg() && + ksft_test_result(pkey_reg == __read_pkey_reg() && siginfo.si_signo == SIGUSR1, "%s\n", __func__); } @@ -347,6 +359,7 @@ static noinline void *thread_sigusr2_self(void *ptr) 'S', 'I', 'G', 'U', 'S', 'R', '2', '.', '.', '.', '\n', '\0'}; stack_t *stack = ptr; + u64 pkey_reg;
/* * Setup alternate signal stack, which should be pkey_mprotect()ed by @@ -356,7 +369,8 @@ static noinline void *thread_sigusr2_self(void *ptr) syscall(SYS_sigaltstack, (long)stack, 0, 0, 0, 0, 0);
/* Disable MPK 0. Only MPK 2 is enabled. */ - __write_pkey_reg(0x55555545); + pkey_reg = set_pkey_bits(pkey_reg_no_access, 2, 0); + __write_pkey_reg(pkey_reg);
raise_sigusr2();
@@ -384,6 +398,7 @@ static void test_pkru_sigreturn(void) int pkey; int parent_pid = 0; int child_pid = 0; + u64 pkey_reg;
sa.sa_handler = SIG_DFL; sa.sa_flags = 0; @@ -418,7 +433,9 @@ static void test_pkru_sigreturn(void) * the current thread's stack is protected by the default MPK 0. Hence * both need to be enabled. */ - __write_pkey_reg(0x55555544); + pkey_reg = set_pkey_bits(pkey_reg_no_access, 0, 0); + pkey_reg = set_pkey_bits(pkey_reg, 2, 0); + __write_pkey_reg(pkey_reg);
/* Protect the stack with MPK 2 */ pkey = pkey_alloc(0, 0); @@ -473,6 +490,10 @@ int main(int argc, char *argv[]) ksft_print_header(); ksft_set_plan(ARRAY_SIZE(pkey_tests));
+ /* Only allow X for MPK 0 and nothing for other keys */ + pkey_reg_no_access = set_pkey_bits(PKEY_ALLOW_NONE, 0, + PKEY_DISABLE_ACCESS); + for (i = 0; i < ARRAY_SIZE(pkey_tests); i++) (*pkey_tests[i])();
pkey_sighandler_tests.c makes raw syscalls using its own helper, syscall_raw(). One of those syscalls is clone, which is problematic as every architecture has a different opinion on the order of its arguments.
To complete arm64 support, we therefore add an appropriate implementation in syscall_raw(), and introduce a clone_raw() helper that shuffles arguments as needed for each arch.
Having done this, we enable building pkey_sighandler_tests for arm64 in the Makefile.
Signed-off-by: Kevin Brodsky kevin.brodsky@arm.com --- tools/testing/selftests/mm/Makefile | 8 +-- .../selftests/mm/pkey_sighandler_tests.c | 62 ++++++++++++++----- 2 files changed, 50 insertions(+), 20 deletions(-)
diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile index 02e1204971b0..0f8c110e0805 100644 --- a/tools/testing/selftests/mm/Makefile +++ b/tools/testing/selftests/mm/Makefile @@ -105,12 +105,12 @@ endif ifeq ($(CAN_BUILD_X86_64),1) TEST_GEN_FILES += $(BINARIES_64) endif -else
-ifneq (,$(filter $(ARCH),arm64 powerpc)) +else ifeq ($(ARCH),arm64) +TEST_GEN_FILES += protection_keys +TEST_GEN_FILES += pkey_sighandler_tests +else ifeq ($(ARCH),powerpc) TEST_GEN_FILES += protection_keys -endif - endif
ifneq (,$(filter $(ARCH),arm64 mips64 parisc64 powerpc riscv64 s390x sparc64 x86_64 s390)) diff --git a/tools/testing/selftests/mm/pkey_sighandler_tests.c b/tools/testing/selftests/mm/pkey_sighandler_tests.c index b5e1767ee5d9..97460980811c 100644 --- a/tools/testing/selftests/mm/pkey_sighandler_tests.c +++ b/tools/testing/selftests/mm/pkey_sighandler_tests.c @@ -61,12 +61,44 @@ long syscall_raw(long n, long a1, long a2, long a3, long a4, long a5, long a6) : "=a"(ret) : "a"(n), "b"(a1), "c"(a2), "d"(a3), "S"(a4), "D"(a5) : "memory"); +#elif defined __aarch64__ + register long x0 asm("x0") = a1; + register long x1 asm("x1") = a2; + register long x2 asm("x2") = a3; + register long x3 asm("x3") = a4; + register long x4 asm("x4") = a5; + register long x5 asm("x5") = a6; + register long x8 asm("x8") = n; + asm volatile ("svc #0" + : "=r"(x0) + : "r"(x0), "r"(x1), "r"(x2), "r"(x3), "r"(x4), "r"(x5), "r"(x8) + : "memory"); + ret = x0; #else # error syscall_raw() not implemented #endif return ret; }
+static inline long clone_raw(unsigned long flags, void *stack, + int *parent_tid, int *child_tid) +{ + long a1 = flags; + long a2 = (long)stack; + long a3 = (long)parent_tid; +#if defined(__x86_64__) || defined(__i386) + long a4 = (long)child_tid; + long a5 = 0; +#elif defined(__aarch64__) + long a4 = 0; + long a5 = (long)child_tid; +#else +# error clone_raw() not implemented +#endif + + return syscall_raw(SYS_clone, a1, a2, a3, a4, a5, 0); +} + static void sigsegv_handler(int signo, siginfo_t *info, void *ucontext) { pthread_mutex_lock(&mutex); @@ -279,14 +311,13 @@ static void test_sigsegv_handler_with_different_pkey_for_stack(void) memset(&siginfo, 0, sizeof(siginfo));
/* Use clone to avoid newer glibcs using rseq on new threads */ - long ret = syscall_raw(SYS_clone, - CLONE_VM | CLONE_FS | CLONE_FILES | - CLONE_SIGHAND | CLONE_THREAD | CLONE_SYSVSEM | - CLONE_PARENT_SETTID | CLONE_CHILD_CLEARTID | - CLONE_DETACHED, - (long) ((char *)(stack) + STACK_SIZE), - (long) &parent_pid, - (long) &child_pid, 0, 0); + long ret = clone_raw(CLONE_VM | CLONE_FS | CLONE_FILES | + CLONE_SIGHAND | CLONE_THREAD | CLONE_SYSVSEM | + CLONE_PARENT_SETTID | CLONE_CHILD_CLEARTID | + CLONE_DETACHED, + stack + STACK_SIZE, + &parent_pid, + &child_pid);
if (ret < 0) { errno = -ret; @@ -448,14 +479,13 @@ static void test_pkru_sigreturn(void) sigstack.ss_size = STACK_SIZE;
/* Use clone to avoid newer glibcs using rseq on new threads */ - long ret = syscall_raw(SYS_clone, - CLONE_VM | CLONE_FS | CLONE_FILES | - CLONE_SIGHAND | CLONE_THREAD | CLONE_SYSVSEM | - CLONE_PARENT_SETTID | CLONE_CHILD_CLEARTID | - CLONE_DETACHED, - (long) ((char *)(stack) + STACK_SIZE), - (long) &parent_pid, - (long) &child_pid, 0, 0); + long ret = clone_raw(CLONE_VM | CLONE_FS | CLONE_FILES | + CLONE_SIGHAND | CLONE_THREAD | CLONE_SYSVSEM | + CLONE_PARENT_SETTID | CLONE_CHILD_CLEARTID | + CLONE_DETACHED, + stack + STACK_SIZE, + &parent_pid, + &child_pid);
if (ret < 0) { errno = -ret;
On Thu, Oct 17, 2024 at 02:39:04PM +0100, Kevin Brodsky wrote:
This series is a follow-up to Joey's Permission Overlay Extension (POE) series [1] that recently landed on mainline. The goal is to improve the way we handle the register that governs which pkeys/POIndex are accessible (POR_EL0) during signal delivery. As things stand, we may unexpectedly fail to write the signal frame on the stack because POR_EL0 is not reset before the uaccess operations. See patch 3 for more details and the main changes this series brings.
A similar series landed recently for x86/MPK [2]; the present series aims at aligning arm64 with x86. Worth noting: once the signal frame is written, POR_EL0 is still set to POR_EL0_INIT, granting access to pkey 0 only. This means that a program that sets up an alternate signal stack with a non-zero pkey will need some assembly trampoline to set POR_EL0 before invoking the real signal handler, as discussed here [3].
This feels a bit bogus (though it's anyway orthogonal to this series).
Really, we want some way for userspace to tell the kernel what permissions to use for the alternate signal stack and signal handlers using it, and then honour that request consistently (just as we try to do for the main stack today).
ss_flags is mostly unused... I wonder whether we could add something in there? Or add a sigaltstack2()?
[...]
Cheers ---Dave
On 17/10/2024 17:48, Dave Martin wrote:
On Thu, Oct 17, 2024 at 02:39:04PM +0100, Kevin Brodsky wrote:
This series is a follow-up to Joey's Permission Overlay Extension (POE) series [1] that recently landed on mainline. The goal is to improve the way we handle the register that governs which pkeys/POIndex are accessible (POR_EL0) during signal delivery. As things stand, we may unexpectedly fail to write the signal frame on the stack because POR_EL0 is not reset before the uaccess operations. See patch 3 for more details and the main changes this series brings.
A similar series landed recently for x86/MPK [2]; the present series aims at aligning arm64 with x86. Worth noting: once the signal frame is written, POR_EL0 is still set to POR_EL0_INIT, granting access to pkey 0 only. This means that a program that sets up an alternate signal stack with a non-zero pkey will need some assembly trampoline to set POR_EL0 before invoking the real signal handler, as discussed here [3].
This feels a bit bogus (though it's anyway orthogonal to this series).
I'm not very fond of this either. However I believe this is the correct first step: bring arm64 in line with x86. Removing all restrictions before uaccess and then setting POR_EL0 to POR_EL0_INIT enables userspace to use any pkey for the alternate signal stack without an ABI change, albeit not in a very comfortable way (if the pkey is not 0).
Really, we want some way for userspace to tell the kernel what permissions to use for the alternate signal stack and signal handlers using it, and then honour that request consistently (just as we try to do for the main stack today).
ss_flags is mostly unused... I wonder whether we could add something in there? Or add a sigaltstack2()?
Yes, this would be sensible as a second step (backwards-compatible extension). Exactly how that API would look like is not trivial though: is the pkey implicitly derived from the pointer provided to sigaltstack()? Is there a need to specify another pkey for code, or do we just assume that the signal handler is only using code with pkey 0? (Not a concern on x86 as MPK doesn't restrict execution.) Would be very interested to hear opinions on this.
Kevin
On Mon, Oct 21, 2024 at 12:06:25PM +0200, Kevin Brodsky wrote:
On 17/10/2024 17:48, Dave Martin wrote:
On Thu, Oct 17, 2024 at 02:39:04PM +0100, Kevin Brodsky wrote:
This series is a follow-up to Joey's Permission Overlay Extension (POE) series [1] that recently landed on mainline. The goal is to improve the way we handle the register that governs which pkeys/POIndex are accessible (POR_EL0) during signal delivery. As things stand, we may unexpectedly fail to write the signal frame on the stack because POR_EL0 is not reset before the uaccess operations. See patch 3 for more details and the main changes this series brings.
A similar series landed recently for x86/MPK [2]; the present series aims at aligning arm64 with x86. Worth noting: once the signal frame is written, POR_EL0 is still set to POR_EL0_INIT, granting access to pkey 0 only. This means that a program that sets up an alternate signal stack with a non-zero pkey will need some assembly trampoline to set POR_EL0 before invoking the real signal handler, as discussed here [3].
This feels a bit bogus (though it's anyway orthogonal to this series).
I'm not very fond of this either. However I believe this is the correct first step: bring arm64 in line with x86. Removing all restrictions before uaccess and then setting POR_EL0 to POR_EL0_INIT enables userspace to use any pkey for the alternate signal stack without an ABI change, albeit not in a very comfortable way (if the pkey is not 0).
I see: we try not to prevent userspace from using whatever pkey it likes for the alternate signal stack, but we are only permissive for the bare minimum operations that userspace can't possibly control for itself (i.e., writing the signal frame).
This whole thing feels a bit of a botch, though.
Do we know of anyone actually using a sigaltstack with a pkey other than 0? Why the urgency? Code relying on an asm shim on x86 is already nonportable, unless I've misunderstood something, so simply turning on arm64 pkeys support in the kernel and libc shouldn't break anything today? (At least, nothing that wasn't asking to be broken.)
Really, we want some way for userspace to tell the kernel what permissions to use for the alternate signal stack and signal handler using it, and then honour that request consistently (just as we try to do for the main stack today).
ss_flags is mostly unused... I wonder whether we could add something in there? Or add a sigaltstack2()?
Yes, this would be sensible as a second step (backwards-compatible extension). Exactly how that API would look like is not trivial though: is the pkey implicitly derived from the pointer provided to sigaltstack()? Is there a need to specify another pkey for code, or do we just assume that the signal handler is only using code with pkey 0? (Not a concern on x86 as MPK doesn't restrict execution.) Would be very interested to hear opinions on this.
Kevin
I would vote for specifying the pkey (or, if feasible, PKRU or modifications to it) in some bits of ss_flags, or in an additional flags argument to sigaltstack2().
Memory with a non-zero pkey cannot be used 100% portably, period, and having non-RW(X) permissions on pkey 0 at any time is also not portable, period. So I'm not sure that having libc magically guess what userspace's pkeys policy is supposed to be based on racily digging metadata out of /proc/self/maps or a cache of it etc. would be such a good idea.
There are other ways to approach (or not approach) this though -- I would be interested to hear what other people think too...
Cheers ---Dave
On Mon, Oct 21, 2024 at 02:31:08PM +0100, Dave P Martin wrote:
On Thu, Oct 17, 2024 at 02:39:04PM +0100, Kevin Brodsky wrote:
This series is a follow-up to Joey's Permission Overlay Extension (POE) series [1] that recently landed on mainline. The goal is to improve the way we handle the register that governs which pkeys/POIndex are accessible (POR_EL0) during signal delivery. As things stand, we may unexpectedly fail to write the signal frame on the stack because POR_EL0 is not reset before the uaccess operations. See patch 3 for more details and the main changes this series brings.
A similar series landed recently for x86/MPK [2]; the present series aims at aligning arm64 with x86. Worth noting: once the signal frame is written, POR_EL0 is still set to POR_EL0_INIT, granting access to pkey 0 only. This means that a program that sets up an alternate signal stack with a non-zero pkey will need some assembly trampoline to set POR_EL0 before invoking the real signal handler, as discussed here [3].
[...]
Memory with a non-zero pkey cannot be used 100% portably, period, and having non-RW(X) permissions on pkey 0 at any time is also not portable, period. So I'm not sure that having libc magically guess what userspace's pkeys policy is supposed to be based on racily digging metadata out of /proc/self/maps or a cache of it etc. would be such a good idea.
I agree that changing RWX overlay permission for pkey 0 to anything else is a really bad idea. We can't prevent it but we shouldn't actively try to work around it in the kernel either. With the current signal ABI, I don't think we should support anything other than pkey 0 for the stack. Since the user shouldn't change the pkey 0 RWX overlay permission anyway, I don't think we should reset POR_EL0 _prior_ to writing the signal frame. The best we can do is document it somewhere.
So on patch 3 I'd only ensure that we have POR_EL0_INIT when invoking the signal handler and not when performing the uaccess. If the uaccess fails, we'd get a fatal SIGSEGV. The user may have got it already if it made the stack read-only.
Currently the primary use of pkeys is for W^X and signal stacks shouldn't fall into this category. If we ever have a strong case for non-zero pkeys on the signal stack, we'll need to look into some new ABI. I'm not sure about SS_* flags though, I think the signal POR_EL0 should be associated with the sigaction rather than the stack (the latter would just be mapped by the user with the right pkey, the kernel doesn't need to know which, only what POR_EL0 is needed by the handler).
Until such case turns up, I'd not put any effort into ABI improvements. I can think of some light compartmentalisation where we have a pkey that's "privileged" and all threads have a POR_EL0 that prevents access to that pkey. The signal handler would have more permissive rights to that privileged pkey. I'd not proactively add support for this though.
On Mon, Oct 21, 2024 at 04:30:04PM +0100, Catalin Marinas wrote:
On Mon, Oct 21, 2024 at 02:31:08PM +0100, Dave P Martin wrote:
On Thu, Oct 17, 2024 at 02:39:04PM +0100, Kevin Brodsky wrote:
This series is a follow-up to Joey's Permission Overlay Extension (POE) series [1] that recently landed on mainline. The goal is to improve the way we handle the register that governs which pkeys/POIndex are accessible (POR_EL0) during signal delivery. As things stand, we may unexpectedly fail to write the signal frame on the stack because POR_EL0 is not reset before the uaccess operations. See patch 3 for more details and the main changes this series brings.
A similar series landed recently for x86/MPK [2]; the present series aims at aligning arm64 with x86. Worth noting: once the signal frame is written, POR_EL0 is still set to POR_EL0_INIT, granting access to pkey 0 only. This means that a program that sets up an alternate signal stack with a non-zero pkey will need some assembly trampoline to set POR_EL0 before invoking the real signal handler, as discussed here [3].
[...]
Memory with a non-zero pkey cannot be used 100% portably, period, and having non-RW(X) permissions on pkey 0 at any time is also not portable, period. So I'm not sure that having libc magically guess what userspace's pkeys policy is supposed to be based on racily digging metadata out of /proc/self/maps or a cache of it etc. would be such a good idea.
I agree that changing RWX overlay permission for pkey 0 to anything else is a really bad idea. We can't prevent it but we shouldn't actively try to work around it in the kernel either. With the current signal ABI, I don't think we should support anything other than pkey 0 for the stack. Since the user shouldn't change the pkey 0 RWX overlay permission anyway, I don't think we should reset POR_EL0 _prior_ to writing the signal frame. The best we can do is document it somewhere.
So on patch 3 I'd only ensure that we have POR_EL0_INIT when invoking the signal handler and not when performing the uaccess. If the uaccess fails, we'd get a fatal SIGSEGV. The user may have got it already if it made the stack read-only.
Hmm, but based on what Kevin's saying, this would mean actively choosing a different ABI than what x86 is trying to get to.
Currently the primary use of pkeys is for W^X and signal stacks shouldn't fall into this category. If we ever have a strong case for non-zero pkeys on the signal stack, we'll need to look into some new ABI. I'm not sure about SS_* flags though, I think the signal POR_EL0 should be associated with the sigaction rather than the stack (the latter would just be mapped by the user with the right pkey, the kernel doesn't need to know which, only what POR_EL0 is needed by the handler).
Until such case turns up, I'd not put any effort into ABI improvements.
Kevin -- do you know what prompted x86 to want the pkey to be reset early in signal delivery? Perhaps such a use-case already exists.
I can think of some light compartmentalisation where we have a pkey that's "privileged" and all threads have a POR_EL0 that prevents access to that pkey. The signal handler would have more permissive rights to that privileged pkey. I'd not proactively add support for this though.
I'd not proactively diverge from other architectures, either :p
Will
On Mon, Oct 21, 2024 at 06:19:38PM +0100, Will Deacon wrote:
On Mon, Oct 21, 2024 at 04:30:04PM +0100, Catalin Marinas wrote:
On Mon, Oct 21, 2024 at 02:31:08PM +0100, Dave P Martin wrote:
On Thu, Oct 17, 2024 at 02:39:04PM +0100, Kevin Brodsky wrote:
This series is a follow-up to Joey's Permission Overlay Extension (POE) series [1] that recently landed on mainline. The goal is to improve the way we handle the register that governs which pkeys/POIndex are accessible (POR_EL0) during signal delivery. As things stand, we may unexpectedly fail to write the signal frame on the stack because POR_EL0 is not reset before the uaccess operations. See patch 3 for more details and the main changes this series brings.
A similar series landed recently for x86/MPK [2]; the present series aims at aligning arm64 with x86. Worth noting: once the signal frame is written, POR_EL0 is still set to POR_EL0_INIT, granting access to pkey 0 only. This means that a program that sets up an alternate signal stack with a non-zero pkey will need some assembly trampoline to set POR_EL0 before invoking the real signal handler, as discussed here [3].
[...]
Memory with a non-zero pkey cannot be used 100% portably, period, and having non-RW(X) permissions on pkey 0 at any time is also not portable, period. So I'm not sure that having libc magically guess what userspace's pkeys policy is supposed to be based on racily digging metadata out of /proc/self/maps or a cache of it etc. would be such a good idea.
I agree that changing RWX overlay permission for pkey 0 to anything else is a really bad idea. We can't prevent it but we shouldn't actively try to work around it in the kernel either. With the current signal ABI, I don't think we should support anything other than pkey 0 for the stack. Since the user shouldn't change the pkey 0 RWX overlay permission anyway, I don't think we should reset POR_EL0 _prior_ to writing the signal frame. The best we can do is document it somewhere.
So on patch 3 I'd only ensure that we have POR_EL0_INIT when invoking the signal handler and not when performing the uaccess. If the uaccess fails, we'd get a fatal SIGSEGV. The user may have got it already if it made the stack read-only.
Hmm, but based on what Kevin's saying, this would mean actively choosing a different ABI than what x86 is trying to get to.
I was more thinking of not relaxing the ABI further at this point in the rc cycle rather than completely diverging (x86 did relax the ABI subsequently to handle non-zero pkey sigaltstack).
Currently the primary use of pkeys is for W^X and signal stacks shouldn't fall into this category. If we ever have a strong case for non-zero pkeys on the signal stack, we'll need to look into some new ABI. I'm not sure about SS_* flags though, I think the signal POR_EL0 should be associated with the sigaction rather than the stack (the latter would just be mapped by the user with the right pkey, the kernel doesn't need to know which, only what POR_EL0 is needed by the handler).
Until such case turns up, I'd not put any effort into ABI improvements.
Kevin -- do you know what prompted x86 to want the pkey to be reset early in signal delivery? Perhaps such a use-case already exists.
Given the email from Pierre with Chrome potentially using a sigaltstack with a non-zero pkey, Kevin's patches (and the x86 changes) make more sense. The question is whether we do this as a fix now or we leave the relaxation for a subsequent kernel release. I guess we could squeeze it now if we agree on the implementation.
Dave Martin Dave.Martin@arm.com writes:
On Mon, Oct 21, 2024 at 12:06:25PM +0200, Kevin Brodsky wrote:
On 17/10/2024 17:48, Dave Martin wrote:
On Thu, Oct 17, 2024 at 02:39:04PM +0100, Kevin Brodsky wrote:
This series is a follow-up to Joey's Permission Overlay Extension (POE) series [1] that recently landed on mainline. The goal is to improve the way we handle the register that governs which pkeys/POIndex are accessible (POR_EL0) during signal delivery. As things stand, we may unexpectedly fail to write the signal frame on the stack because POR_EL0 is not reset before the uaccess operations. See patch 3 for more details and the main changes this series brings.
A similar series landed recently for x86/MPK [2]; the present series aims at aligning arm64 with x86. Worth noting: once the signal frame is written, POR_EL0 is still set to POR_EL0_INIT, granting access to pkey 0 only. This means that a program that sets up an alternate signal stack with a non-zero pkey will need some assembly trampoline to set POR_EL0 before invoking the real signal handler, as discussed here [3].
This feels a bit bogus (though it's anyway orthogonal to this series).
I'm not very fond of this either. However I believe this is the correct first step: bring arm64 in line with x86. Removing all restrictions before uaccess and then setting POR_EL0 to POR_EL0_INIT enables userspace to use any pkey for the alternate signal stack without an ABI change, albeit not in a very comfortable way (if the pkey is not 0).
I see: we try not to prevent userspace from using whatever pkey it likes for the alternate signal stack, but we are only permissive for the bare minimum operations that userspace can't possibly control for itself (i.e., writing the signal frame).
This whole thing feels a bit of a botch, though.
Do we know of anyone actually using a sigaltstack with a pkey other than 0? Why the urgency? Code relying on an asm shim on x86 is already nonportable, unless I've misunderstood something, so simply turning on arm64 pkeys support in the kernel and libc shouldn't break anything today? (At least, nothing that wasn't asking to be broken.)
As far as I know, Chrome plans on using a sigaltstack with a non-zero pkey as part of the V8 CFI and W^X work [0][1][2]. IIUC that was is part of the motivation for the x86 change. I don't know if it's urgent though.
I added Stephen on CC who might be able to comment on the current state of things in Chrome. I don't think the code that uses a pkey on a sigaltstack is upstream yet.
[0]: https://v8.dev/blog/control-flow-integrity#signal-frame-corruption [1]: https://lore.kernel.org/lkml/CAEAAPHa3g0QwU=DZ2zVCqTCSh-+n2TtVKrQ07LvpwDjQ-F... [2]: https://docs.google.com/document/d/1O2jwK4dxI3nRcOJuPYkonhTkNQfbmwdvxQMyXgea...
Really, we want some way for userspace to tell the kernel what permissions to use for the alternate signal stack and signal handler using it, and then honour that request consistently (just as we try to do for the main stack today).
ss_flags is mostly unused... I wonder whether we could add something in there? Or add a sigaltstack2()?
Yes, this would be sensible as a second step (backwards-compatible extension). Exactly how that API would look like is not trivial though: is the pkey implicitly derived from the pointer provided to sigaltstack()? Is there a need to specify another pkey for code, or do we just assume that the signal handler is only using code with pkey 0? (Not a concern on x86 as MPK doesn't restrict execution.) Would be very interested to hear opinions on this.
I hadn't considered setting a non-zero pkey for code, but it sounds appealing.
The general goal, IIUC, is for signal handlers to run in an isolated "context" using pkeys, in order to mitigate against an attacker trying to corrupt the CPU state on the stack from another thread. Then use this as a way to bypass any CFI mitigation, by setting an arbitrary PC and registers.
So sigaltstack+pkey helps with isolating the stack. Then it's up to the programmer to carefully write the signal handler code so it only uses pkey-tagged data that other threads cannot corrupt in order to trick the signal handler into writing to its own stack.
In this context, using a non-default pkey for code might be useful, in order to differentiate between "valid" signal handlers and other functions. It could help fend against an attacker being able to use sigaction as a whole-function gadget to install any function as a signal hander. Basically mitigating going from a limited CFI bypass to an arbitrary CFI bypass.
That being said, regarding the kernel API, it might be possible to do the above with this patch. We'd be using the proposed "assembly prologues" that sets POR_EL0 as the first thing then continues to the real signal handler. But if we can avoid those and directly ask the kernel what POR_EL0 should be set to, it'd be simpler (and maybe safer).
Kevin
I would vote for specifying the pkey (or, if feasible, PKRU or modifications to it) in some bits of ss_flags, or in an additional flags argument to sigaltstack2().
Memory with a non-zero pkey cannot be used 100% portably, period, and having non-RW(X) permissions on pkey 0 at any time is also not portable, period. So I'm not sure that having libc magically guess what userspace's pkeys policy is supposed to be based on racily digging metadata out of /proc/self/maps or a cache of it etc. would be such a good idea.
There are other ways to approach (or not approach) this though -- I would be interested to hear what other people think too...
Thinking about this, I'm not sure about tying this API to sigaltstack, as this is about configuring the POR_EL0 register, which may control more than the stack.
I actually have a concrete example of this in V8. There's a SetDefaultPermissionsForSignalHandler [3] function that needs to be called first thing on signal handlers to configure access to an allocated non-zero key. This is independent from having a pkey-tagged sigaltstack or not, but I suppose later on it will need to be replaced with assembly when the stack is no-longer accessible.
[3]: https://source.chromium.org/chromium/chromium/src/+/main:v8/include/v8-platf...
Doing this via sigaction as Catalin suggested makes sense to me, but I'm unsure how we express how POR_EL0 needs to be set solely using SA_* flags. Are we able to add a new architecture-specific payload to sigaction, or would that resort in a new syscall like sigaction2?
As an alternative, I was wondering if this would warrant a new syscall like sigaltstack, but for CPU state.
Thanks, Pierre
On Tue, Oct 22, 2024 at 11:31 AM Pierre Langlois pierre.langlois@arm.com wrote:
Dave Martin Dave.Martin@arm.com writes:
On Mon, Oct 21, 2024 at 12:06:25PM +0200, Kevin Brodsky wrote:
On 17/10/2024 17:48, Dave Martin wrote:
On Thu, Oct 17, 2024 at 02:39:04PM +0100, Kevin Brodsky wrote:
This series is a follow-up to Joey's Permission Overlay Extension (POE) series [1] that recently landed on mainline. The goal is to improve the way we handle the register that governs which pkeys/POIndex are accessible (POR_EL0) during signal delivery. As things stand, we may unexpectedly fail to write the signal frame on the stack because POR_EL0 is not reset before the uaccess operations. See patch 3 for more details and the main changes this series brings.
A similar series landed recently for x86/MPK [2]; the present series aims at aligning arm64 with x86. Worth noting: once the signal frame is written, POR_EL0 is still set to POR_EL0_INIT, granting access to pkey 0 only. This means that a program that sets up an alternate signal stack with a non-zero pkey will need some assembly trampoline to set POR_EL0 before invoking the real signal handler, as discussed here [3].
This feels a bit bogus (though it's anyway orthogonal to this series).
I'm not very fond of this either. However I believe this is the correct first step: bring arm64 in line with x86. Removing all restrictions before uaccess and then setting POR_EL0 to POR_EL0_INIT enables userspace to use any pkey for the alternate signal stack without an ABI change, albeit not in a very comfortable way (if the pkey is not 0).
I see: we try not to prevent userspace from using whatever pkey it likes for the alternate signal stack, but we are only permissive for the bare minimum operations that userspace can't possibly control for itself (i.e., writing the signal frame).
This whole thing feels a bit of a botch, though.
Do we know of anyone actually using a sigaltstack with a pkey other than 0? Why the urgency? Code relying on an asm shim on x86 is already nonportable, unless I've misunderstood something, so simply turning on arm64 pkeys support in the kernel and libc shouldn't break anything today? (At least, nothing that wasn't asking to be broken.)
As far as I know, Chrome plans on using a sigaltstack with a non-zero pkey as part of the V8 CFI and W^X work [0][1][2]. IIUC that was is part of the motivation for the x86 change. I don't know if it's urgent though.
I added Stephen on CC who might be able to comment on the current state of things in Chrome. I don't think the code that uses a pkey on a sigaltstack is upstream yet.
We don't have any code for this in Chrome, since I believe it's not supported by the kernel yet.
Really, we want some way for userspace to tell the kernel what permissions to use for the alternate signal stack and signal handler using it, and then honour that request consistently (just as we try to do for the main stack today).
ss_flags is mostly unused... I wonder whether we could add something in there? Or add a sigaltstack2()?
Yes, this would be sensible as a second step (backwards-compatible extension). Exactly how that API would look like is not trivial though: is the pkey implicitly derived from the pointer provided to sigaltstack()? Is there a need to specify another pkey for code, or do we just assume that the signal handler is only using code with pkey 0? (Not a concern on x86 as MPK doesn't restrict execution.) Would be very interested to hear opinions on this.
I hadn't considered setting a non-zero pkey for code, but it sounds appealing.
The general goal, IIUC, is for signal handlers to run in an isolated "context" using pkeys, in order to mitigate against an attacker trying to corrupt the CPU state on the stack from another thread. Then use this as a way to bypass any CFI mitigation, by setting an arbitrary PC and registers.
Right. We're mainly looking for a solution to protect the signal frame against memory corruption. I'm aware of two proposals on how to achieve this: 1) is using a pkey-protected sigaltstack, which requires a patchset like [0] to allow xsave to write to the stack 2) is to store part of the sigframe on the shadow stack as Rick Edgecombe proposed in [1]
[0] https://lore.kernel.org/lkml/20240802061318.2140081-1-aruna.ramakrishna@orac... [1] https://lore.kernel.org/lkml/2fb80876e286b4db8f9ef36bcce04bbf02af0de2.camel@...
So sigaltstack+pkey helps with isolating the stack. Then it's up to the programmer to carefully write the signal handler code so it only uses pkey-tagged data that other threads cannot corrupt in order to trick the signal handler into writing to its own stack.
In this context, using a non-default pkey for code might be useful, in order to differentiate between "valid" signal handlers and other functions. It could help fend against an attacker being able to use sigaction as a whole-function gadget to install any function as a signal hander. Basically mitigating going from a limited CFI bypass to an arbitrary CFI bypass.
That being said, regarding the kernel API, it might be possible to do the above with this patch. We'd be using the proposed "assembly prologues" that sets POR_EL0 as the first thing then continues to the real signal handler. But if we can avoid those and directly ask the kernel what POR_EL0 should be set to, it'd be simpler (and maybe safer).
Kevin
I would vote for specifying the pkey (or, if feasible, PKRU or modifications to it) in some bits of ss_flags, or in an additional flags argument to sigaltstack2().
Memory with a non-zero pkey cannot be used 100% portably, period, and having non-RW(X) permissions on pkey 0 at any time is also not portable, period. So I'm not sure that having libc magically guess what userspace's pkeys policy is supposed to be based on racily digging metadata out of /proc/self/maps or a cache of it etc. would be such a good idea.
There are other ways to approach (or not approach) this though -- I would be interested to hear what other people think too...
Thinking about this, I'm not sure about tying this API to sigaltstack, as this is about configuring the POR_EL0 register, which may control more than the stack.
I actually have a concrete example of this in V8. There's a SetDefaultPermissionsForSignalHandler [3] function that needs to be called first thing on signal handlers to configure access to an allocated non-zero key. This is independent from having a pkey-tagged sigaltstack or not, but I suppose later on it will need to be replaced with assembly when the stack is no-longer accessible.
Doing this via sigaction as Catalin suggested makes sense to me, but I'm unsure how we express how POR_EL0 needs to be set solely using SA_* flags. Are we able to add a new architecture-specific payload to sigaction, or would that resort in a new syscall like sigaction2?
As an alternative, I was wondering if this would warrant a new syscall like sigaltstack, but for CPU state.
Thanks, Pierre
linux-kselftest-mirror@lists.linaro.org