When running with return thunks enabled under 32-bit EFI, the system crashes with:
[ 0.137688] kernel tried to execute NX-protected page - exploit attempt? (uid: 0) [ 0.138136] BUG: unable to handle page fault for address: 000000005bc02900 [ 0.138136] #PF: supervisor instruction fetch in kernel mode [ 0.138136] #PF: error_code(0x0011) - permissions violation [ 0.138136] PGD 18f7063 P4D 18f7063 PUD 18ff063 PMD 190e063 PTE 800000005bc02063 [ 0.138136] Oops: 0011 [#1] PREEMPT SMP PTI [ 0.138136] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.19.0-rc6+ #166 [ 0.138136] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 [ 0.138136] RIP: 0010:0x5bc02900 [ 0.138136] Code: Unable to access opcode bytes at RIP 0x5bc028d6. [ 0.138136] RSP: 0018:ffffffffb3203e10 EFLAGS: 00010046 [ 0.138136] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000048 [ 0.138136] RDX: 000000000190dfac RSI: 0000000000001710 RDI: 000000007eae823b [ 0.138136] RBP: ffffffffb3203e70 R08: 0000000001970000 R09: ffffffffb3203e28 [ 0.138136] R10: 747563657865206c R11: 6c6977203a696665 R12: 0000000000001710 [ 0.138136] R13: 0000000000000030 R14: 0000000001970000 R15: 0000000000000001 [ 0.138136] FS: 0000000000000000(0000) GS:ffff8e013ca00000(0000) knlGS:0000000000000000 [ 0.138136] CS: 0010 DS: 0018 ES: 0018 CR0: 0000000080050033 [ 0.138136] CR2: 000000005bc02900 CR3: 0000000001930000 CR4: 00000000000006f0 [ 0.138136] Call Trace: [ 0.138136] <TASK> [ 0.138136] ? efi_set_virtual_address_map+0x9c/0x175 [ 0.138136] efi_enter_virtual_mode+0x4a6/0x53e [ 0.138136] start_kernel+0x67c/0x71e [ 0.138136] x86_64_start_reservations+0x24/0x2a [ 0.138136] x86_64_start_kernel+0xe9/0xf4 [ 0.138136] secondary_startup_64_no_verify+0xe5/0xeb [ 0.138136] </TASK>
That's because it cannot jump to the return thunk from the 32-bit code. Using a naked RET and marking it as safe allows the system to proceed booting.
Fixes: aa3d480315ba ("x86: Use return-thunk in asm code") Reported-by: Guenter Roeck linux@roeck-us.net Signed-off-by: Thadeu Lima de Souza Cascardo cascardo@canonical.com Cc: Peter Zijlstra (Intel) peterz@infradead.org Cc: Borislav Petkov bp@suse.de Cc: Josh Poimboeuf jpoimboe@kernel.org Cc: stable@vger.kernel.org ---
Does this leave one potential attack vector open? Perhaps, since this is running under a different mapping (AFAIU), the risk is reduced? Or rather, the attacker could attack using the firmware RETs anyway?
Alternatively, we could use IBPB when available when using the wrapper.
Thoughts?
--- arch/x86/platform/efi/efi_thunk_64.S | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/arch/x86/platform/efi/efi_thunk_64.S b/arch/x86/platform/efi/efi_thunk_64.S index 9ffe2bad27d5..4e5257a4811b 100644 --- a/arch/x86/platform/efi/efi_thunk_64.S +++ b/arch/x86/platform/efi/efi_thunk_64.S @@ -23,6 +23,7 @@ #include <linux/objtool.h> #include <asm/page_types.h> #include <asm/segment.h> +#include <asm/nospec-branch.h>
.text .code64 @@ -75,7 +76,9 @@ STACK_FRAME_NON_STANDARD __efi64_thunk 1: movq 0x20(%rsp), %rsp pop %rbx pop %rbp - RET + ANNOTATE_UNRET_SAFE + ret + int3
.code32 2: pushl $__KERNEL_CS
On Fri, Jul 15, 2022 at 04:45:50PM -0300, Thadeu Lima de Souza Cascardo wrote:
When running with return thunks enabled under 32-bit EFI, the system crashes with:
[ 0.137688] kernel tried to execute NX-protected page - exploit attempt? (uid: 0) [ 0.138136] BUG: unable to handle page fault for address: 000000005bc02900 [ 0.138136] #PF: supervisor instruction fetch in kernel mode [ 0.138136] #PF: error_code(0x0011) - permissions violation [ 0.138136] PGD 18f7063 P4D 18f7063 PUD 18ff063 PMD 190e063 PTE 800000005bc02063 [ 0.138136] Oops: 0011 [#1] PREEMPT SMP PTI [ 0.138136] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.19.0-rc6+ #166 [ 0.138136] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 [ 0.138136] RIP: 0010:0x5bc02900 [ 0.138136] Code: Unable to access opcode bytes at RIP 0x5bc028d6. [ 0.138136] RSP: 0018:ffffffffb3203e10 EFLAGS: 00010046 [ 0.138136] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000048 [ 0.138136] RDX: 000000000190dfac RSI: 0000000000001710 RDI: 000000007eae823b [ 0.138136] RBP: ffffffffb3203e70 R08: 0000000001970000 R09: ffffffffb3203e28 [ 0.138136] R10: 747563657865206c R11: 6c6977203a696665 R12: 0000000000001710 [ 0.138136] R13: 0000000000000030 R14: 0000000001970000 R15: 0000000000000001 [ 0.138136] FS: 0000000000000000(0000) GS:ffff8e013ca00000(0000) knlGS:0000000000000000 [ 0.138136] CS: 0010 DS: 0018 ES: 0018 CR0: 0000000080050033 [ 0.138136] CR2: 000000005bc02900 CR3: 0000000001930000 CR4: 00000000000006f0 [ 0.138136] Call Trace: [ 0.138136] <TASK> [ 0.138136] ? efi_set_virtual_address_map+0x9c/0x175 [ 0.138136] efi_enter_virtual_mode+0x4a6/0x53e [ 0.138136] start_kernel+0x67c/0x71e [ 0.138136] x86_64_start_reservations+0x24/0x2a [ 0.138136] x86_64_start_kernel+0xe9/0xf4 [ 0.138136] secondary_startup_64_no_verify+0xe5/0xeb [ 0.138136] </TASK>
That's because it cannot jump to the return thunk from the 32-bit code. Using a naked RET and marking it as safe allows the system to proceed booting.
Fixes: aa3d480315ba ("x86: Use return-thunk in asm code") Reported-by: Guenter Roeck linux@roeck-us.net Signed-off-by: Thadeu Lima de Souza Cascardo cascardo@canonical.com Cc: Peter Zijlstra (Intel) peterz@infradead.org Cc: Borislav Petkov bp@suse.de Cc: Josh Poimboeuf jpoimboe@kernel.org Cc: stable@vger.kernel.org
No more crashes with this patch applies on top of the mainline kernel (sha e5d523f1ae8f).
Tested-by: Guenter Roeck linux@roeck-us.net
Guenter
On Fri, Jul 15, 2022 at 04:45:50PM -0300, Thadeu Lima de Souza Cascardo wrote:
When running with return thunks enabled under 32-bit EFI, the system crashes with:
[ 0.137688] kernel tried to execute NX-protected page - exploit attempt? (uid: 0) [ 0.138136] BUG: unable to handle page fault for address: 000000005bc02900 [ 0.138136] #PF: supervisor instruction fetch in kernel mode [ 0.138136] #PF: error_code(0x0011) - permissions violation [ 0.138136] PGD 18f7063 P4D 18f7063 PUD 18ff063 PMD 190e063 PTE 800000005bc02063 [ 0.138136] Oops: 0011 [#1] PREEMPT SMP PTI [ 0.138136] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.19.0-rc6+ #166 [ 0.138136] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 [ 0.138136] RIP: 0010:0x5bc02900 [ 0.138136] Code: Unable to access opcode bytes at RIP 0x5bc028d6. [ 0.138136] RSP: 0018:ffffffffb3203e10 EFLAGS: 00010046 [ 0.138136] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000048 [ 0.138136] RDX: 000000000190dfac RSI: 0000000000001710 RDI: 000000007eae823b [ 0.138136] RBP: ffffffffb3203e70 R08: 0000000001970000 R09: ffffffffb3203e28 [ 0.138136] R10: 747563657865206c R11: 6c6977203a696665 R12: 0000000000001710 [ 0.138136] R13: 0000000000000030 R14: 0000000001970000 R15: 0000000000000001 [ 0.138136] FS: 0000000000000000(0000) GS:ffff8e013ca00000(0000) knlGS:0000000000000000 [ 0.138136] CS: 0010 DS: 0018 ES: 0018 CR0: 0000000080050033 [ 0.138136] CR2: 000000005bc02900 CR3: 0000000001930000 CR4: 00000000000006f0 [ 0.138136] Call Trace: [ 0.138136] <TASK> [ 0.138136] ? efi_set_virtual_address_map+0x9c/0x175 [ 0.138136] efi_enter_virtual_mode+0x4a6/0x53e [ 0.138136] start_kernel+0x67c/0x71e [ 0.138136] x86_64_start_reservations+0x24/0x2a [ 0.138136] x86_64_start_kernel+0xe9/0xf4 [ 0.138136] secondary_startup_64_no_verify+0xe5/0xeb [ 0.138136] </TASK>
That's because it cannot jump to the return thunk from the 32-bit code. Using a naked RET and marking it as safe allows the system to proceed booting.
Fixes: aa3d480315ba ("x86: Use return-thunk in asm code") Reported-by: Guenter Roeck linux@roeck-us.net Signed-off-by: Thadeu Lima de Souza Cascardo cascardo@canonical.com Cc: Peter Zijlstra (Intel) peterz@infradead.org Cc: Borislav Petkov bp@suse.de Cc: Josh Poimboeuf jpoimboe@kernel.org Cc: stable@vger.kernel.org
Does this leave one potential attack vector open? Perhaps, since this is running under a different mapping (AFAIU), the risk is reduced? Or rather, the attacker could attack using the firmware RETs anyway?
Alternatively, we could use IBPB when available when using the wrapper.
Thoughts?
What actual uarch are you running this on? Is this AMD hardware?
For Intel we'll enable IBRS for firmware if it is not otherwise enabled (upstream will always enable IBRS for the SKL family chips, but Thomas just posted the retbleed=stuff approach yesterday that will not)
On AMD I think you're stuck with IBPB, but that has to be issued *before* calling the firmware muck.
In either case, I think the patch as proposed is fine. But perhaps we want something like the below on top.
--- Subject: x86/amd: Use IBPB for firmware calls
On AMD IBRS does not prevent Retbleed; as such use IBPB before a firmware call to flush the branch history state.
Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org --- arch/x86/include/asm/cpufeatures.h | 1 + arch/x86/include/asm/nospec-branch.h | 2 ++ arch/x86/kernel/cpu/bugs.c | 11 ++++++++++- 3 files changed, 13 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h index 00f5227c8459..a77b915d36a8 100644 --- a/arch/x86/include/asm/cpufeatures.h +++ b/arch/x86/include/asm/cpufeatures.h @@ -302,6 +302,7 @@ #define X86_FEATURE_RETPOLINE_LFENCE (11*32+13) /* "" Use LFENCE for Spectre variant 2 */ #define X86_FEATURE_RETHUNK (11*32+14) /* "" Use REturn THUNK */ #define X86_FEATURE_UNRET (11*32+15) /* "" AMD BTB untrain return */ +#define X86_FEATURE_USE_IBPB_FW (11*32+16) /* "" Use IBPB during runtime firmware calls */
/* Intel-defined CPU features, CPUID level 0x00000007:1 (EAX), word 12 */ #define X86_FEATURE_AVX_VNNI (12*32+ 4) /* AVX VNNI instructions */ diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h index 10a3bfc1eb23..f934dcdb7c0d 100644 --- a/arch/x86/include/asm/nospec-branch.h +++ b/arch/x86/include/asm/nospec-branch.h @@ -297,6 +297,8 @@ do { \ alternative_msr_write(MSR_IA32_SPEC_CTRL, \ spec_ctrl_current() | SPEC_CTRL_IBRS, \ X86_FEATURE_USE_IBRS_FW); \ + altnerative_msr_write(MSR_IA32_PRED_CMD, PRED_CMD_IBPB, \ + X86_FEATURE_USE_IBPB_FW); \ } while (0)
#define firmware_restrict_branch_speculation_end() \ diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c index aa34f908c39f..78c9082242a9 100644 --- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -1516,7 +1516,16 @@ static void __init spectre_v2_select_mitigation(void) * the CPU supports Enhanced IBRS, kernel might un-intentionally not * enable IBRS around firmware calls. */ - if (boot_cpu_has(X86_FEATURE_IBRS) && !spectre_v2_in_ibrs_mode(mode)) { + if (boot_cpu_has_bug(X86_BUG_RETBLEED) && + (boot_cpu_data.x86_vendor == X86_VENDOR_AMD || + boot_cpu_data.x86_vendor == X86_VENDOR_HYGON)) { + + if (retbleed_cmd != RETBLEED_CMD_IBPB) { + setup_force_cpu_cap(X86_FEATURE_USE_IBPB_FW); + pr_info("Enabling Speculation Barrier for firmware calls\n"); + } + + } else if (boot_cpu_has(X86_FEATURE_IBRS) && !spectre_v2_in_ibrs_mode(mode)) { setup_force_cpu_cap(X86_FEATURE_USE_IBRS_FW); pr_info("Enabling Restricted Speculation for firmware calls\n"); }
On Mon, Jul 18, 2022 at 01:41:37PM +0200, Peter Zijlstra wrote:
On Fri, Jul 15, 2022 at 04:45:50PM -0300, Thadeu Lima de Souza Cascardo wrote:
When running with return thunks enabled under 32-bit EFI, the system crashes with:
[ 0.137688] kernel tried to execute NX-protected page - exploit attempt? (uid: 0) [ 0.138136] BUG: unable to handle page fault for address: 000000005bc02900 [ 0.138136] #PF: supervisor instruction fetch in kernel mode [ 0.138136] #PF: error_code(0x0011) - permissions violation [ 0.138136] PGD 18f7063 P4D 18f7063 PUD 18ff063 PMD 190e063 PTE 800000005bc02063 [ 0.138136] Oops: 0011 [#1] PREEMPT SMP PTI [ 0.138136] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.19.0-rc6+ #166 [ 0.138136] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 [ 0.138136] RIP: 0010:0x5bc02900 [ 0.138136] Code: Unable to access opcode bytes at RIP 0x5bc028d6. [ 0.138136] RSP: 0018:ffffffffb3203e10 EFLAGS: 00010046 [ 0.138136] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000048 [ 0.138136] RDX: 000000000190dfac RSI: 0000000000001710 RDI: 000000007eae823b [ 0.138136] RBP: ffffffffb3203e70 R08: 0000000001970000 R09: ffffffffb3203e28 [ 0.138136] R10: 747563657865206c R11: 6c6977203a696665 R12: 0000000000001710 [ 0.138136] R13: 0000000000000030 R14: 0000000001970000 R15: 0000000000000001 [ 0.138136] FS: 0000000000000000(0000) GS:ffff8e013ca00000(0000) knlGS:0000000000000000 [ 0.138136] CS: 0010 DS: 0018 ES: 0018 CR0: 0000000080050033 [ 0.138136] CR2: 000000005bc02900 CR3: 0000000001930000 CR4: 00000000000006f0 [ 0.138136] Call Trace: [ 0.138136] <TASK> [ 0.138136] ? efi_set_virtual_address_map+0x9c/0x175 [ 0.138136] efi_enter_virtual_mode+0x4a6/0x53e [ 0.138136] start_kernel+0x67c/0x71e [ 0.138136] x86_64_start_reservations+0x24/0x2a [ 0.138136] x86_64_start_kernel+0xe9/0xf4 [ 0.138136] secondary_startup_64_no_verify+0xe5/0xeb [ 0.138136] </TASK>
That's because it cannot jump to the return thunk from the 32-bit code. Using a naked RET and marking it as safe allows the system to proceed booting.
Fixes: aa3d480315ba ("x86: Use return-thunk in asm code") Reported-by: Guenter Roeck linux@roeck-us.net Signed-off-by: Thadeu Lima de Souza Cascardo cascardo@canonical.com Cc: Peter Zijlstra (Intel) peterz@infradead.org Cc: Borislav Petkov bp@suse.de Cc: Josh Poimboeuf jpoimboe@kernel.org Cc: stable@vger.kernel.org
Does this leave one potential attack vector open? Perhaps, since this is running under a different mapping (AFAIU), the risk is reduced? Or rather, the attacker could attack using the firmware RETs anyway?
Alternatively, we could use IBPB when available when using the wrapper.
Thoughts?
What actual uarch are you running this on? Is this AMD hardware?
For Intel we'll enable IBRS for firmware if it is not otherwise enabled (upstream will always enable IBRS for the SKL family chips, but Thomas just posted the retbleed=stuff approach yesterday that will not)
On AMD I think you're stuck with IBPB, but that has to be issued *before* calling the firmware muck.
In either case, I think the patch as proposed is fine. But perhaps we want something like the below on top.
I was testing this on Intel, because Guenter Roeck had reported such failures when booting with 32-bit EFI. My patch fixes the boot problem, but then I asked whether it would leave us vulnerable and, then, I was thinking about AMD mostly, as you pointed out.
And I think you nailed what I had in mind for using IBPB when doing firmware calls, and perhaps this is wanted even when we ignore this naked RET here.
There is a typo on your patch below, but I will give it a try and see if it doesn't blow up on AMD systems without IBPB (by way of emulation).
Thanks. Cascardo.
Subject: x86/amd: Use IBPB for firmware calls
On AMD IBRS does not prevent Retbleed; as such use IBPB before a firmware call to flush the branch history state.
Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org
arch/x86/include/asm/cpufeatures.h | 1 + arch/x86/include/asm/nospec-branch.h | 2 ++ arch/x86/kernel/cpu/bugs.c | 11 ++++++++++- 3 files changed, 13 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h index 00f5227c8459..a77b915d36a8 100644 --- a/arch/x86/include/asm/cpufeatures.h +++ b/arch/x86/include/asm/cpufeatures.h @@ -302,6 +302,7 @@ #define X86_FEATURE_RETPOLINE_LFENCE (11*32+13) /* "" Use LFENCE for Spectre variant 2 */ #define X86_FEATURE_RETHUNK (11*32+14) /* "" Use REturn THUNK */ #define X86_FEATURE_UNRET (11*32+15) /* "" AMD BTB untrain return */ +#define X86_FEATURE_USE_IBPB_FW (11*32+16) /* "" Use IBPB during runtime firmware calls */ /* Intel-defined CPU features, CPUID level 0x00000007:1 (EAX), word 12 */ #define X86_FEATURE_AVX_VNNI (12*32+ 4) /* AVX VNNI instructions */ diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h index 10a3bfc1eb23..f934dcdb7c0d 100644 --- a/arch/x86/include/asm/nospec-branch.h +++ b/arch/x86/include/asm/nospec-branch.h @@ -297,6 +297,8 @@ do { \ alternative_msr_write(MSR_IA32_SPEC_CTRL, \ spec_ctrl_current() | SPEC_CTRL_IBRS, \ X86_FEATURE_USE_IBRS_FW); \
- altnerative_msr_write(MSR_IA32_PRED_CMD, PRED_CMD_IBPB, \
X86_FEATURE_USE_IBPB_FW); \
} while (0) #define firmware_restrict_branch_speculation_end() \ diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c index aa34f908c39f..78c9082242a9 100644 --- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -1516,7 +1516,16 @@ static void __init spectre_v2_select_mitigation(void) * the CPU supports Enhanced IBRS, kernel might un-intentionally not * enable IBRS around firmware calls. */
- if (boot_cpu_has(X86_FEATURE_IBRS) && !spectre_v2_in_ibrs_mode(mode)) {
- if (boot_cpu_has_bug(X86_BUG_RETBLEED) &&
(boot_cpu_data.x86_vendor == X86_VENDOR_AMD ||
boot_cpu_data.x86_vendor == X86_VENDOR_HYGON)) {
if (retbleed_cmd != RETBLEED_CMD_IBPB) {
setup_force_cpu_cap(X86_FEATURE_USE_IBPB_FW);
pr_info("Enabling Speculation Barrier for firmware calls\n");
}
- } else if (boot_cpu_has(X86_FEATURE_IBRS) && !spectre_v2_in_ibrs_mode(mode)) { setup_force_cpu_cap(X86_FEATURE_USE_IBRS_FW); pr_info("Enabling Restricted Speculation for firmware calls\n"); }
On Mon, Jul 18, 2022 at 10:59:18AM -0300, Thadeu Lima de Souza Cascardo wrote:
And I think you nailed what I had in mind for using IBPB when doing firmware calls, and perhaps this is wanted even when we ignore this naked RET here.
There is a typo on your patch below, but I will give it a try and see if it doesn't blow up on AMD systems without IBPB (by way of emulation).
Yeah, Boris pointed out the same. Typing is hard ;-)
On Mon, Jul 18, 2022 at 01:41:37PM +0200, Peter Zijlstra wrote:
diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h index 10a3bfc1eb23..f934dcdb7c0d 100644 --- a/arch/x86/include/asm/nospec-branch.h +++ b/arch/x86/include/asm/nospec-branch.h @@ -297,6 +297,8 @@ do { \ alternative_msr_write(MSR_IA32_SPEC_CTRL, \ spec_ctrl_current() | SPEC_CTRL_IBRS, \ X86_FEATURE_USE_IBRS_FW); \
- altnerative_msr_write(MSR_IA32_PRED_CMD, PRED_CMD_IBPB, \
X86_FEATURE_USE_IBPB_FW); \
} while (0)
So I'm being told we need to untrain on return from EFI to protect the kernel from it. Ontop of yours.
Asm looks correct but lemme run it through the test builds first...
--- diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h index 38a3e86e665e..e58f18555022 100644 --- a/arch/x86/include/asm/nospec-branch.h +++ b/arch/x86/include/asm/nospec-branch.h @@ -62,6 +62,12 @@ dec reg; \ jnz 771b;
+#ifdef CONFIG_CPU_UNRET_ENTRY +#define CALL_ZEN_UNTRAIN_RET "call zen_untrain_ret" +#else +#define CALL_ZEN_UNTRAIN_RET "" +#endif + #ifdef __ASSEMBLY__
/* @@ -128,12 +134,6 @@ .Lskip_rsb_@: .endm
-#ifdef CONFIG_CPU_UNRET_ENTRY -#define CALL_ZEN_UNTRAIN_RET "call zen_untrain_ret" -#else -#define CALL_ZEN_UNTRAIN_RET "" -#endif - /* * Mitigate RETBleed for AMD/Hygon Zen uarch. Requires KERNEL CR3 because the * return thunk isn't mapped into the userspace tables (then again, AMD @@ -169,6 +169,16 @@ extern void __x86_return_thunk(void); extern void zen_untrain_ret(void); extern void entry_ibpb(void);
+#if defined(CONFIG_CPU_UNRET_ENTRY) || defined(CONFIG_CPU_IBPB_ENTRY) +# define UNTRAIN_RET \ + asm volatile(ALTERNATIVE_2(ANNOTATE_RETPOLINE_SAFE "nop", \ + CALL_ZEN_UNTRAIN_RET, X86_FEATURE_UNRET, \ + "call entry_ibpb", X86_FEATURE_ENTRY_IBPB) \ + ::) +#else +# define UNTRAIN_RET +#endif + #ifdef CONFIG_RETPOLINE
#define GEN(reg) \ @@ -306,6 +316,7 @@ do { \ alternative_msr_write(MSR_IA32_SPEC_CTRL, \ spec_ctrl_current(), \ X86_FEATURE_USE_IBRS_FW); \ + UNTRAIN_RET; \ preempt_enable(); \ } while (0)
On Mon, Jul 18, 2022 at 06:28:27PM +0200, Borislav Petkov wrote:
On Mon, Jul 18, 2022 at 01:41:37PM +0200, Peter Zijlstra wrote:
diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h index 10a3bfc1eb23..f934dcdb7c0d 100644 --- a/arch/x86/include/asm/nospec-branch.h +++ b/arch/x86/include/asm/nospec-branch.h @@ -297,6 +297,8 @@ do { \ alternative_msr_write(MSR_IA32_SPEC_CTRL, \ spec_ctrl_current() | SPEC_CTRL_IBRS, \ X86_FEATURE_USE_IBRS_FW); \
- altnerative_msr_write(MSR_IA32_PRED_CMD, PRED_CMD_IBPB, \
X86_FEATURE_USE_IBPB_FW); \
} while (0)
So I'm being told we need to untrain on return from EFI to protect the kernel from it. Ontop of yours.
I don't think there's any credible way we can protect against EFI taking over the system if it wants to. It runs at CPL0 and has access to the direct map. If EFI wants it can take over the system without trying.
On 18/07/2022 17:38, Peter Zijlstra wrote:
On Mon, Jul 18, 2022 at 06:28:27PM +0200, Borislav Petkov wrote:
On Mon, Jul 18, 2022 at 01:41:37PM +0200, Peter Zijlstra wrote:
diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h index 10a3bfc1eb23..f934dcdb7c0d 100644 --- a/arch/x86/include/asm/nospec-branch.h +++ b/arch/x86/include/asm/nospec-branch.h @@ -297,6 +297,8 @@ do { \ alternative_msr_write(MSR_IA32_SPEC_CTRL, \ spec_ctrl_current() | SPEC_CTRL_IBRS, \ X86_FEATURE_USE_IBRS_FW); \
- altnerative_msr_write(MSR_IA32_PRED_CMD, PRED_CMD_IBPB, \
X86_FEATURE_USE_IBPB_FW); \
} while (0)
So I'm being told we need to untrain on return from EFI to protect the kernel from it. Ontop of yours.
I don't think there's any credible way we can protect against EFI taking over the system if it wants to. It runs at CPL0 and has access to the direct map. If EFI wants it can take over the system without trying.
I don't think an untrain is needed either. EFI RS can do anything it wants, architecturally speaking, so the only threat is it acting as a confused deputy.
The IBPB on the way in mitigates any BTC attacks against EFI-RS.
The "safe" BTB entry can be evicted due to competition or an alias, both in kernel code or EFI code, but neither of these contexts will be deliberately creating a malicious entry.
~Andrew
On Mon, Jul 18, 2022 at 9:28 AM Borislav Petkov bp@suse.de wrote:
So I'm being told we need to untrain on return from EFI to protect the kernel from it.
Why would we have to protect the kernel from EFI?
If we can't trust EFI, then the machine is already compromised. We just *called* an EFI routine, if EFI is untrusted, it did something random.
I mean, it could have already done something bad at boot time when it loaded the kernel.
So no, let's not "protect ourselves from EFI".
Linus
On Mon, Jul 18, 2022 at 11:34:02AM -0700, Linus Torvalds wrote:
Why would we have to protect the kernel from EFI?
Yes, we cleared this up on IRC in the meantime.
This was raised as a concern in case we don't trust EFI. But we cannot not (double negation on purpose) trust EFI because it can do whatever it likes anyway, "underneath" the OS.
I'm keeping the UNTRAIN_RET-in-C diff in my patches/ folder, though - I get the feeling we might need it soon for something else.
:-)
On Mon, 18 Jul 2022 at 20:46, Borislav Petkov bp@suse.de wrote:
On Mon, Jul 18, 2022 at 11:34:02AM -0700, Linus Torvalds wrote:
Why would we have to protect the kernel from EFI?
Yes, we cleared this up on IRC in the meantime.
This was raised as a concern in case we don't trust EFI. But we cannot not (double negation on purpose) trust EFI because it can do whatever it likes anyway, "underneath" the OS.
I'm keeping the UNTRAIN_RET-in-C diff in my patches/ folder, though - I get the feeling we might need it soon for something else.
:-)
The code in question is a little trampoline that executes from the EFI mixed mode 1:1 mapping of the kernel text, and never via the kernel mapping, so we should just move it into .rodata instead (and fix up the mixed mode virtual address map logic accordingly). I don't think mapping the kernel text and rodata into the 1;1 EFI map is needed at all, tbh, and the only thing we ever access via that mapping is that little trampoline.
Something like
--- a/arch/x86/platform/efi/efi_64.c +++ b/arch/x86/platform/efi/efi_64.c @@ -176,7 +176,7 @@ virt_to_phys_or_null_size(void *va, unsigned long size)
int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages) { - unsigned long pfn, text, pf, rodata; + unsigned long pfn, pf, rodata; struct page *page; unsigned npages; pgd_t *pgd = efi_mm.pgd; @@ -222,7 +222,7 @@ int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages) /* * When making calls to the firmware everything needs to be 1:1 * mapped and addressable with 32-bit pointers. Map the kernel - * text and allocate a new stack because we can't rely on the + * rodata and allocate a new stack because we can't rely on the * stack pointer being < 4GB. */ if (!efi_is_mixed()) @@ -236,21 +236,11 @@ int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages)
efi_mixed_mode_stack_pa = page_to_phys(page + 1); /* stack grows down */
- npages = (_etext - _text) >> PAGE_SHIFT; - text = __pa(_text); - pfn = text >> PAGE_SHIFT; - - pf = _PAGE_ENC; - if (kernel_map_pages_in_pgd(pgd, pfn, text, npages, pf)) { - pr_err("Failed to map kernel text 1:1\n"); - return 1; - } - npages = (__end_rodata - __start_rodata) >> PAGE_SHIFT; rodata = __pa(__start_rodata); pfn = rodata >> PAGE_SHIFT;
- pf = _PAGE_NX | _PAGE_ENC; + pf = _PAGE_ENC; if (kernel_map_pages_in_pgd(pgd, pfn, rodata, npages, pf)) { pr_err("Failed to map kernel rodata 1:1\n"); return 1; diff --git a/arch/x86/platform/efi/efi_thunk_64.S b/arch/x86/platform/efi/efi_thunk_64.S index 854dd81804b7..d4ee75ebfad6 100644 --- a/arch/x86/platform/efi/efi_thunk_64.S +++ b/arch/x86/platform/efi/efi_thunk_64.S @@ -71,7 +71,9 @@ STACK_FRAME_NON_STANDARD __efi64_thunk pushq $__KERNEL32_CS pushq %rdi /* EFI runtime service address */ lretq +SYM_FUNC_END(__efi64_thunk)
+ .section ".rodata", "a" 1: movq 0x20(%rsp), %rsp pop %rbx pop %rbp @@ -81,7 +83,6 @@ STACK_FRAME_NON_STANDARD __efi64_thunk 2: pushl $__KERNEL_CS pushl %ebp lret -SYM_FUNC_END(__efi64_thunk)
.bss .balign 8
On Tue, Jul 19, 2022 at 05:22:28PM +0200, Ard Biesheuvel wrote:
The code in question is a little trampoline that executes from the EFI mixed mode 1:1 mapping of the kernel text, and never via the kernel mapping, so we should just move it into .rodata instead (and fix up the mixed mode virtual address map logic accordingly). I don't think mapping the kernel text and rodata into the 1;1 EFI map is needed at all, tbh, and the only thing we ever access via that mapping is that little trampoline.
Something like
I'm obviously always for simplifications like that. I'm guessing this should be tested for a full next release before it goes to Linus?
Thx.
linux-stable-mirror@lists.linaro.org