The git of this backport can be found here: git clone --branch pti_v4.1.49 https://github.com/soleen/linux
The patches were backported from stable 4.4 to Oracle UEK4.1, and from UEK4.1 to Stable 4.1
Aaron Lu (1): x86/irq: Do not substract irq_tlb_count from irq_call_count
Andy Lutomirski (16): x86/mm: Add INVPCID helpers x86/mm: Add a 'noinvpcid' boot option to turn off INVPCID x86/mm: If INVPCID is available, use it to flush global mappings sched/core: Add switch_mm_irqs_off() and use it in the scheduler x86/mm: Build arch/x86/mm/tlb.c even on !SMP x86/mm, sched/core: Turn off IRQs in switch_mm() sched/core: Idle_task_exit() shouldn't use switch_mm_irqs_off() x86/vm86/32: Switch to flush_tlb_mm_range() in mark_screen_rdonly() x86/mm: Remove flush_tlb() and flush_tlb_current_task() x86/mm: Make flush_tlb_mm_range() more predictable x86/mm: Reimplement flush_tlb_page() using flush_tlb_mm_range() x86/mm: Remove the UP asm/tlbflush.h code, always use the (formerly) SMP code x86/mm: Disable PCID on 32-bit kernels x86/mm: Add the 'nopcid' boot option to turn off PCID x86/mm: Enable CR4.PCIDE on supported systems x86/mm/64: Fix reboot interaction with CR4.PCIDE
Borislav Petkov (5): x86/mm: Fix INVPCID asm constraint x86/kaiser: Rename and simplify X86_FEATURE_KAISER handling x86/kaiser: Check boottime cmdline params x86/kaiser: Reenable PARAVIRT x86/kaiser: Move feature detection up
Dave Hansen (2): kaiser: merged update kaiser: enhanced by kernel and user PCIDs
Denys Vlasenko (3): x86/entry: Stop using PER_CPU_VAR(kernel_stack) x86/entry: Remove unused 'kernel_stack' per-cpu variable x86/entry: Define 'cpu_current_top_of_stack' for 64-bit code
Hugh Dickins (26): kaiser: do not set _PAGE_NX on pgd_none kaiser: stack map PAGE_SIZE at THREAD_SIZE-PAGE_SIZE kaiser: fix build and FIXME in alloc_ldt_struct() kaiser: KAISER depends on SMP kaiser: fix regs to do_nmi() ifndef CONFIG_KAISER kaiser: fix perf crashes kaiser: ENOMEM if kaiser_pagetable_walk() NULL kaiser: tidied up asm/kaiser.h somewhat kaiser: tidied up kaiser_add/remove_mapping slightly kaiser: kaiser_remove_mapping() move along the pgd kaiser: cleanups while trying for gold link kaiser: name that 0x1000 KAISER_SHADOW_PGD_OFFSET kaiser: delete KAISER_REAL_SWITCH option kaiser: vmstat show NR_KAISERTABLE as nr_overhead kaiser: load_new_mm_cr3() let SWITCH_USER_CR3 flush user kaiser: PCID 0 for kernel and 128 for user kaiser: x86_cr3_pcid_noflush and x86_cr3_pcid_user kaiser: paranoid_entry pass cr3 need to paranoid_exit kaiser: _pgd_alloc() without __GFP_REPEAT to avoid stalls kaiser: fix unlikely error in alloc_ldt_struct() kaiser: add "nokaiser" boot option, using ALTERNATIVE kaiser: add "nokaiser" boot option, using ALTERNATIVE kaiser: use ALTERNATIVE instead of x86_cr3_pcid_noflush kaiser: drop is_atomic arg to kaiser_pagetable_walk() kaiser: asm/tlbflush.h handle noPGE at lower level kaiser: kaiser_flush_tlb_on_return_to_user() check PCID
Ingo Molnar (1): mm/mmu_context, sched/core: Fix mmu_context.h assumption
Jamie Iles (1): x86/ldt: fix crash in ldt freeing.
Jiri Kosina (1): PTI: unbreak EFI old_memmap
Kees Cook (1): KPTI: Rename to PAGE_TABLE_ISOLATION
Konrad Rzeszutek Wilk (1): kpti: Disable when running under Xen PV
Pavel Tatashin (3): x86/mm, sched/core: Uninline switch_mm() pti: Rename X86_FEATURE_KAISER to X86_FEATURE_PTI x86/pti/efi: broken conversion from efi to kernel page table
Richard Fellner (1): KAISER: Kernel Address Isolation
Steven Rostedt (1): ARM: Hide finish_arch_post_lock_switch() from modules
Thomas Gleixner (1): x86/paravirt: Dont patch flush_tlb_single
Tom Lendacky (1): x86/boot: Add early cmdline parsing for options with arguments
Documentation/kernel-parameters.txt | 12 + arch/arm/include/asm/mmu_context.h | 2 + arch/x86/boot/compressed/misc.h | 1 + arch/x86/ia32/ia32entry.S | 11 +- arch/x86/include/asm/cmdline.h | 2 + arch/x86/include/asm/cpufeature.h | 7 + arch/x86/include/asm/desc.h | 2 +- arch/x86/include/asm/disabled-features.h | 4 +- arch/x86/include/asm/hardirq.h | 6 +- arch/x86/include/asm/hw_irq.h | 2 +- arch/x86/include/asm/kaiser.h | 151 ++++++++++ arch/x86/include/asm/mmu.h | 6 - arch/x86/include/asm/mmu_context.h | 101 +------ arch/x86/include/asm/pgtable.h | 28 +- arch/x86/include/asm/pgtable_64.h | 25 +- arch/x86/include/asm/pgtable_types.h | 29 +- arch/x86/include/asm/processor.h | 2 +- arch/x86/include/asm/thread_info.h | 8 +- arch/x86/include/asm/tlbflush.h | 233 +++++++++------ arch/x86/include/uapi/asm/processor-flags.h | 3 +- arch/x86/kernel/cpu/bugs.c | 8 + arch/x86/kernel/cpu/common.c | 86 +++++- arch/x86/kernel/cpu/perf_event_intel_ds.c | 57 +++- arch/x86/kernel/entry_64.S | 170 +++++++++-- arch/x86/kernel/espfix_64.c | 10 + arch/x86/kernel/head_64.S | 35 ++- arch/x86/kernel/irq.c | 3 +- arch/x86/kernel/irqinit.c | 2 +- arch/x86/kernel/ldt.c | 25 +- arch/x86/kernel/paravirt_patch_64.c | 2 - arch/x86/kernel/process.c | 2 +- arch/x86/kernel/process_32.c | 5 +- arch/x86/kernel/process_64.c | 3 - arch/x86/kernel/reboot.c | 4 + arch/x86/kernel/setup.c | 7 + arch/x86/kernel/smpboot.c | 2 - arch/x86/kernel/tracepoint.c | 2 + arch/x86/kernel/vm86_32.c | 2 +- arch/x86/kvm/x86.c | 3 +- arch/x86/lib/cmdline.c | 105 +++++++ arch/x86/mm/Makefile | 4 +- arch/x86/mm/init.c | 4 +- arch/x86/mm/init_64.c | 10 + arch/x86/mm/kaiser.c | 449 ++++++++++++++++++++++++++++ arch/x86/mm/pageattr.c | 63 +++- arch/x86/mm/pgtable.c | 16 +- arch/x86/mm/tlb.c | 194 ++++++++---- arch/x86/platform/efi/efi_64.c | 6 + arch/x86/realmode/init.c | 4 +- arch/x86/realmode/rm/trampoline_64.S | 3 +- arch/x86/xen/enlighten.c | 6 + arch/x86/xen/xen-asm_64.S | 6 +- include/asm-generic/vmlinux.lds.h | 7 + include/linux/kaiser.h | 52 ++++ include/linux/mmu_context.h | 7 + include/linux/mmzone.h | 3 +- include/linux/percpu-defs.h | 32 +- init/main.c | 2 + kernel/fork.c | 6 + kernel/sched/core.c | 4 +- mm/mmu_context.c | 2 +- mm/vmstat.c | 1 + security/Kconfig | 10 + 63 files changed, 1687 insertions(+), 372 deletions(-) create mode 100644 arch/x86/include/asm/kaiser.h create mode 100644 arch/x86/mm/kaiser.c create mode 100644 include/linux/kaiser.h
From: Andy Lutomirski luto@kernel.org
commit 060a402a1ddb551455ee410de2eadd3349f2801b upstream.
This adds helpers for each of the four currently-specified INVPCID modes.
Signed-off-by: Andy Lutomirski luto@kernel.org Reviewed-by: Borislav Petkov bp@suse.de Cc: Andrew Morton akpm@linux-foundation.org Cc: Andrey Ryabinin aryabinin@virtuozzo.com Cc: Andy Lutomirski luto@amacapital.net Cc: Borislav Petkov bp@alien8.de Cc: Brian Gerst brgerst@gmail.com Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Denys Vlasenko dvlasenk@redhat.com Cc: H. Peter Anvin hpa@zytor.com Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Luis R. Rodriguez mcgrof@suse.com Cc: Oleg Nesterov oleg@redhat.com Cc: Peter Zijlstra peterz@infradead.org Cc: Thomas Gleixner tglx@linutronix.de Cc: Toshi Kani toshi.kani@hp.com Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/8a62b23ad686888cee01da134c91409e22064db9.1454096309... Signed-off-by: Ingo Molnar mingo@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org (cherry picked from commit becf292446e9f2dc8842c448836bbe8005e24db0) Signed-off-by: Pavel Tatashin pasha.tatashin@oracle.com --- arch/x86/include/asm/tlbflush.h | 48 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 48 insertions(+)
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index 7e459b7ee708..995937999e1f 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -7,6 +7,54 @@ #include <asm/processor.h> #include <asm/special_insns.h>
+static inline void __invpcid(unsigned long pcid, unsigned long addr, + unsigned long type) +{ + u64 desc[2] = { pcid, addr }; + + /* + * The memory clobber is because the whole point is to invalidate + * stale TLB entries and, especially if we're flushing global + * mappings, we don't want the compiler to reorder any subsequent + * memory accesses before the TLB flush. + * + * The hex opcode is invpcid (%ecx), %eax in 32-bit mode and + * invpcid (%rcx), %rax in long mode. + */ + asm volatile (".byte 0x66, 0x0f, 0x38, 0x82, 0x01" + : : "m" (desc), "a" (type), "c" (desc) : "memory"); +} + +#define INVPCID_TYPE_INDIV_ADDR 0 +#define INVPCID_TYPE_SINGLE_CTXT 1 +#define INVPCID_TYPE_ALL_INCL_GLOBAL 2 +#define INVPCID_TYPE_ALL_NON_GLOBAL 3 + +/* Flush all mappings for a given pcid and addr, not including globals. */ +static inline void invpcid_flush_one(unsigned long pcid, + unsigned long addr) +{ + __invpcid(pcid, addr, INVPCID_TYPE_INDIV_ADDR); +} + +/* Flush all mappings for a given PCID, not including globals. */ +static inline void invpcid_flush_single_context(unsigned long pcid) +{ + __invpcid(pcid, 0, INVPCID_TYPE_SINGLE_CTXT); +} + +/* Flush all mappings, including globals, for all PCIDs. */ +static inline void invpcid_flush_all(void) +{ + __invpcid(0, 0, INVPCID_TYPE_ALL_INCL_GLOBAL); +} + +/* Flush all mappings for all PCIDs except globals. */ +static inline void invpcid_flush_all_nonglobals(void) +{ + __invpcid(0, 0, INVPCID_TYPE_ALL_NON_GLOBAL); +} + #ifdef CONFIG_PARAVIRT #include <asm/paravirt.h> #else
From: Borislav Petkov bp@suse.de
commit e2c7698cd61f11d4077fdb28148b2d31b82ac848 upstream.
So we want to specify the dependency on both @pcid and @addr so that the compiler doesn't reorder accesses to them *before* the TLB flush. But for that to work, we need to express this properly in the inline asm and deref the whole desc array, not the pointer to it. See clwb() for an example.
This fixes the build error on 32-bit:
arch/x86/include/asm/tlbflush.h: In function __invpcid arch/x86/include/asm/tlbflush.h:26:18: error: memory input 0 is not directly addressable
which gcc4.7 caught but 5.x didn't. Which is strange. :-\
Signed-off-by: Borislav Petkov bp@suse.de Cc: Andrew Morton akpm@linux-foundation.org Cc: Andrey Ryabinin aryabinin@virtuozzo.com Cc: Andy Lutomirski luto@amacapital.net Cc: Borislav Petkov bp@alien8.de Cc: Brian Gerst brgerst@gmail.com Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Denys Vlasenko dvlasenk@redhat.com Cc: H. Peter Anvin hpa@zytor.com Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Luis R. Rodriguez mcgrof@suse.com Cc: Michael Matz matz@suse.de Cc: Oleg Nesterov oleg@redhat.com Cc: Peter Zijlstra peterz@infradead.org Cc: Thomas Gleixner tglx@linutronix.de Cc: Toshi Kani toshi.kani@hp.com Cc: linux-mm@kvack.org Signed-off-by: Ingo Molnar mingo@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org (cherry picked from commit 04ec428b15f161ce8449756fb64b6f380c8d95fd) Signed-off-by: Pavel Tatashin pasha.tatashin@oracle.com --- arch/x86/include/asm/tlbflush.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index 995937999e1f..ed2317f19ec7 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -10,7 +10,7 @@ static inline void __invpcid(unsigned long pcid, unsigned long addr, unsigned long type) { - u64 desc[2] = { pcid, addr }; + struct { u64 d[2]; } desc = { { pcid, addr } };
/* * The memory clobber is because the whole point is to invalidate @@ -22,7 +22,7 @@ static inline void __invpcid(unsigned long pcid, unsigned long addr, * invpcid (%rcx), %rax in long mode. */ asm volatile (".byte 0x66, 0x0f, 0x38, 0x82, 0x01" - : : "m" (desc), "a" (type), "c" (desc) : "memory"); + : : "m" (desc), "a" (type), "c" (&desc) : "memory"); }
#define INVPCID_TYPE_INDIV_ADDR 0
From: Andy Lutomirski luto@kernel.org
commit d12a72b844a49d4162f24cefdab30bed3f86730e upstream.
This adds a chicken bit to turn off INVPCID in case something goes wrong. It's an early_param() because we do TLB flushes before we parse __setup() parameters.
Signed-off-by: Andy Lutomirski luto@kernel.org Reviewed-by: Borislav Petkov bp@suse.de Cc: Andrew Morton akpm@linux-foundation.org Cc: Andrey Ryabinin aryabinin@virtuozzo.com Cc: Andy Lutomirski luto@amacapital.net Cc: Borislav Petkov bp@alien8.de Cc: Brian Gerst brgerst@gmail.com Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Denys Vlasenko dvlasenk@redhat.com Cc: H. Peter Anvin hpa@zytor.com Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Luis R. Rodriguez mcgrof@suse.com Cc: Oleg Nesterov oleg@redhat.com Cc: Peter Zijlstra peterz@infradead.org Cc: Thomas Gleixner tglx@linutronix.de Cc: Toshi Kani toshi.kani@hp.com Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/f586317ed1bc2b87aee652267e515b90051af385.1454096309... Signed-off-by: Ingo Molnar mingo@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org (cherry picked from commit 791a0f3fecdabe18cc291e5f9b7ebbdc81895975) Signed-off-by: Pavel Tatashin pasha.tatashin@oracle.com --- Documentation/kernel-parameters.txt | 2 ++ arch/x86/kernel/cpu/common.c | 16 ++++++++++++++++ 2 files changed, 18 insertions(+)
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 08dc303d0d47..ceaab09a279e 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -2435,6 +2435,8 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
nointroute [IA-64]
+ noinvpcid [X86] Disable the INVPCID cpu feature. + nojitter [IA-64] Disables jitter checking for ITC timers.
no-kvmclock [X86,KVM] Disable paravirtualized KVM clock driver diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index 5732326ec126..90ef802d9d90 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -172,6 +172,22 @@ static int __init x86_xsaves_setup(char *s) } __setup("noxsaves", x86_xsaves_setup);
+static int __init x86_noinvpcid_setup(char *s) +{ + /* noinvpcid doesn't accept parameters */ + if (s) + return -EINVAL; + + /* do not emit a message if the feature is not present */ + if (!boot_cpu_has(X86_FEATURE_INVPCID)) + return 0; + + setup_clear_cpu_cap(X86_FEATURE_INVPCID); + pr_info("noinvpcid: INVPCID feature disabled\n"); + return 0; +} +early_param("noinvpcid", x86_noinvpcid_setup); + #ifdef CONFIG_X86_32 static int cachesize_override = -1; static int disable_x86_serial_nr = 1;
From: Andy Lutomirski luto@kernel.org
commit d8bced79af1db6734f66b42064cc773cada2ce99 upstream.
On my Skylake laptop, INVPCID function 2 (flush absolutely everything) takes about 376ns, whereas saving flags, twiddling CR4.PGE to flush global mappings, and restoring flags takes about 539ns.
Signed-off-by: Andy Lutomirski luto@kernel.org Reviewed-by: Borislav Petkov bp@suse.de Cc: Andrew Morton akpm@linux-foundation.org Cc: Andrey Ryabinin aryabinin@virtuozzo.com Cc: Andy Lutomirski luto@amacapital.net Cc: Borislav Petkov bp@alien8.de Cc: Brian Gerst brgerst@gmail.com Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Denys Vlasenko dvlasenk@redhat.com Cc: H. Peter Anvin hpa@zytor.com Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Luis R. Rodriguez mcgrof@suse.com Cc: Oleg Nesterov oleg@redhat.com Cc: Peter Zijlstra peterz@infradead.org Cc: Thomas Gleixner tglx@linutronix.de Cc: Toshi Kani toshi.kani@hp.com Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/ed0ef62581c0ea9c99b9bf6df726015e96d44743.1454096309... Signed-off-by: Ingo Molnar mingo@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org (cherry picked from commit 85d3700c744a11ee2989252acf50ccbbd814167a) Signed-off-by: Pavel Tatashin pasha.tatashin@oracle.com --- arch/x86/include/asm/tlbflush.h | 9 +++++++++ 1 file changed, 9 insertions(+)
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index ed2317f19ec7..433eeaafe498 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -152,6 +152,15 @@ static inline void __native_flush_tlb_global(void) { unsigned long flags;
+ if (static_cpu_has(X86_FEATURE_INVPCID)) { + /* + * Using INVPCID is considerably faster than a pair of writes + * to CR4 sandwiched inside an IRQ flag save/restore. + */ + invpcid_flush_all(); + return; + } + /* * Read-modify-write to CR4 - protect it from preemption and * from interrupts. (Use the raw variant because this code can
From: Ingo Molnar mingo@kernel.org
commit 8efd755ac2fe262d4c8d5c9bbe054bb67dae93da upstream.
Some architectures (such as Alpha) rely on include/linux/sched.h definitions in their mmu_context.h files.
So include sched.h before mmu_context.h.
Cc: Andy Lutomirski luto@kernel.org Cc: Borislav Petkov bp@alien8.de Cc: Linus Torvalds torvalds@linux-foundation.org Cc: linux-kernel@vger.kernel.org Cc: Peter Zijlstra peterz@infradead.org Cc: Thomas Gleixner tglx@linutronix.de Signed-off-by: Ingo Molnar mingo@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org (cherry picked from commit dfe513a4e8ddde75ffc6abd3f139c5d65bf925d7) Signed-off-by: Pavel Tatashin pasha.tatashin@oracle.com --- mm/mmu_context.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/mmu_context.c b/mm/mmu_context.c index f802c2d216a7..6f4d27c5bb32 100644 --- a/mm/mmu_context.c +++ b/mm/mmu_context.c @@ -4,9 +4,9 @@ */
#include <linux/mm.h> +#include <linux/sched.h> #include <linux/mmu_context.h> #include <linux/export.h> -#include <linux/sched.h>
#include <asm/mmu_context.h>
From: Andy Lutomirski luto@kernel.org
commit f98db6013c557c216da5038d9c52045be55cd039 upstream.
By default, this is the same thing as switch_mm().
x86 will override it as an optimization.
Signed-off-by: Andy Lutomirski luto@kernel.org Reviewed-by: Borislav Petkov bp@suse.de Cc: Borislav Petkov bp@alien8.de Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Peter Zijlstra peterz@infradead.org Cc: Thomas Gleixner tglx@linutronix.de Link: http://lkml.kernel.org/r/df401df47bdd6be3e389c6f1e3f5310d70e81b2c.1461688545... Signed-off-by: Ingo Molnar mingo@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org (cherry picked from commit 425f13a36652523d604fd96413d6c438d415dd70) Signed-off-by: Pavel Tatashin pasha.tatashin@oracle.com --- include/linux/mmu_context.h | 7 +++++++ kernel/sched/core.c | 6 +++--- 2 files changed, 10 insertions(+), 3 deletions(-)
diff --git a/include/linux/mmu_context.h b/include/linux/mmu_context.h index 70fffeba7495..a4441784503b 100644 --- a/include/linux/mmu_context.h +++ b/include/linux/mmu_context.h @@ -1,9 +1,16 @@ #ifndef _LINUX_MMU_CONTEXT_H #define _LINUX_MMU_CONTEXT_H
+#include <asm/mmu_context.h> + struct mm_struct;
void use_mm(struct mm_struct *mm); void unuse_mm(struct mm_struct *mm);
+/* Architectures that care about IRQ state in switch_mm can override this. */ +#ifndef switch_mm_irqs_off +# define switch_mm_irqs_off switch_mm +#endif + #endif diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 8fbedeb5553f..d253618d09c6 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -32,7 +32,7 @@ #include <linux/init.h> #include <linux/uaccess.h> #include <linux/highmem.h> -#include <asm/mmu_context.h> +#include <linux/mmu_context.h> #include <linux/interrupt.h> #include <linux/capability.h> #include <linux/completion.h> @@ -2339,7 +2339,7 @@ context_switch(struct rq *rq, struct task_struct *prev, atomic_inc(&oldmm->mm_count); enter_lazy_tlb(oldmm, next); } else - switch_mm(oldmm, mm, next); + switch_mm_irqs_off(oldmm, mm, next);
if (!prev->mm) { prev->active_mm = NULL; @@ -4979,7 +4979,7 @@ void idle_task_exit(void) BUG_ON(cpu_online(smp_processor_id()));
if (mm != &init_mm) { - switch_mm(mm, &init_mm, current); + switch_mm_irqs_off(mm, &init_mm, current); finish_arch_post_lock_switch(); } mmdrop(mm);
From: Andy Lutomirski luto@kernel.org
commit e1074888c326038340a1ada9129d679e661f2ea6 upstream.
Currently all of the functions that live in tlb.c are inlined on !SMP builds. One can debate whether this is a good idea (in many respects the code in tlb.c is better than the inlined UP code).
Regardless, I want to add code that needs to be built on UP and SMP kernels and relates to tlb flushing, so arrange for tlb.c to be compiled unconditionally.
Signed-off-by: Andy Lutomirski luto@kernel.org Reviewed-by: Borislav Petkov bp@suse.de Cc: Borislav Petkov bp@alien8.de Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Peter Zijlstra peterz@infradead.org Cc: Thomas Gleixner tglx@linutronix.de Link: http://lkml.kernel.org/r/f0d778f0d828fc46e5d1946bca80f0aaf9abf032.1461688545... Signed-off-by: Ingo Molnar mingo@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org (cherry picked from commit 83cc4b50e3a977915666ade0b951ba446e7181bd) Signed-off-by: Pavel Tatashin pasha.tatashin@oracle.com --- arch/x86/mm/Makefile | 3 +-- arch/x86/mm/tlb.c | 4 ++++ 2 files changed, 5 insertions(+), 2 deletions(-)
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile index a482d105172b..d893640d5c68 100644 --- a/arch/x86/mm/Makefile +++ b/arch/x86/mm/Makefile @@ -1,5 +1,5 @@ obj-y := init.o init_$(BITS).o fault.o ioremap.o extable.o pageattr.o mmap.o \ - pat.o pgtable.o physaddr.o gup.o setup_nx.o + pat.o pgtable.o physaddr.o gup.o setup_nx.o tlb.o
# Make sure __phys_addr has no stackprotector nostackp := $(call cc-option, -fno-stack-protector) @@ -9,7 +9,6 @@ CFLAGS_setup_nx.o := $(nostackp) CFLAGS_fault.o := -I$(src)/../include/asm/trace
obj-$(CONFIG_X86_PAT) += pat_rbtree.o -obj-$(CONFIG_SMP) += tlb.o
obj-$(CONFIG_X86_32) += pgtable_32.o iomap_32.o
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 061e0114005e..a1aa5f59e3ad 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -28,6 +28,8 @@ * Implement flush IPI by CALL_FUNCTION_VECTOR, Alex Shi */
+#ifdef CONFIG_SMP + struct flush_tlb_info { struct mm_struct *flush_mm; unsigned long flush_start; @@ -346,3 +348,5 @@ static int __init create_tlb_single_page_flush_ceiling(void) return 0; } late_initcall(create_tlb_single_page_flush_ceiling); + +#endif /* CONFIG_SMP */
commit 69c0319aabba45bcf33178916a2f06967b4adede upstream.
It's fairly large and it has quite a few callers. This may also help untangle some headers down the road.
Signed-off-by: Andy Lutomirski luto@kernel.org Reviewed-by: Borislav Petkov bp@suse.de Cc: Borislav Petkov bp@alien8.de Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Peter Zijlstra peterz@infradead.org Cc: Thomas Gleixner tglx@linutronix.de Link: http://lkml.kernel.org/r/54f3367803e7f80b2be62c8a21879aa74b1a5f57.1461688545... Signed-off-by: Ingo Molnar mingo@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org (cherry picked from commit 70a39c7fd167399fde76aeac314dce026a255b49) Signed-off-by: Pavel Tatashin pasha.tatashin@oracle.com
Conflicts: arch/x86/include/asm/mmu_context.h --- arch/x86/include/asm/mmu_context.h | 97 +---------------------------------- arch/x86/mm/tlb.c | 100 +++++++++++++++++++++++++++++++++++++ 2 files changed, 102 insertions(+), 95 deletions(-)
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h index 73e38f14ddeb..375fafccb32c 100644 --- a/arch/x86/include/asm/mmu_context.h +++ b/arch/x86/include/asm/mmu_context.h @@ -92,101 +92,8 @@ static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk) #endif }
-static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next, - struct task_struct *tsk) -{ - unsigned cpu = smp_processor_id(); - - if (likely(prev != next)) { -#ifdef CONFIG_SMP - this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK); - this_cpu_write(cpu_tlbstate.active_mm, next); -#endif - cpumask_set_cpu(cpu, mm_cpumask(next)); - - /* - * Re-load page tables. - * - * This logic has an ordering constraint: - * - * CPU 0: Write to a PTE for 'next' - * CPU 0: load bit 1 in mm_cpumask. if nonzero, send IPI. - * CPU 1: set bit 1 in next's mm_cpumask - * CPU 1: load from the PTE that CPU 0 writes (implicit) - * - * We need to prevent an outcome in which CPU 1 observes - * the new PTE value and CPU 0 observes bit 1 clear in - * mm_cpumask. (If that occurs, then the IPI will never - * be sent, and CPU 0's TLB will contain a stale entry.) - * - * The bad outcome can occur if either CPU's load is - * reordered before that CPU's store, so both CPUs must - * execute full barriers to prevent this from happening. - * - * Thus, switch_mm needs a full barrier between the - * store to mm_cpumask and any operation that could load - * from next->pgd. TLB fills are special and can happen - * due to instruction fetches or for no reason at all, - * and neither LOCK nor MFENCE orders them. - * Fortunately, load_cr3() is serializing and gives the - * ordering guarantee we need. - * - */ - load_cr3(next->pgd); - - trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL); - - /* Stop flush ipis for the previous mm */ - cpumask_clear_cpu(cpu, mm_cpumask(prev)); - - /* Load per-mm CR4 state */ - load_mm_cr4(next); - - /* - * Load the LDT, if the LDT is different. - * - * It's possible that prev->context.ldt doesn't match - * the LDT register. This can happen if leave_mm(prev) - * was called and then modify_ldt changed - * prev->context.ldt but suppressed an IPI to this CPU. - * In this case, prev->context.ldt != NULL, because we - * never set context.ldt to NULL while the mm still - * exists. That means that next->context.ldt != - * prev->context.ldt, because mms never share an LDT. - */ - if (unlikely(prev->context.ldt != next->context.ldt)) - load_mm_ldt(next); - } -#ifdef CONFIG_SMP - else { - this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK); - BUG_ON(this_cpu_read(cpu_tlbstate.active_mm) != next); - - if (!cpumask_test_cpu(cpu, mm_cpumask(next))) { - /* - * On established mms, the mm_cpumask is only changed - * from irq context, from ptep_clear_flush() while in - * lazy tlb mode, and here. Irqs are blocked during - * schedule, protecting us from simultaneous changes. - */ - cpumask_set_cpu(cpu, mm_cpumask(next)); - - /* - * We were in lazy tlb mode and leave_mm disabled - * tlb flush IPI delivery. We must reload CR3 - * to make sure to use no freed page tables. - * - * As above, load_cr3() is serializing and orders TLB - * fills with respect to the mm_cpumask write. - */ - load_cr3(next->pgd); - trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL); - load_mm_cr4(next); - load_mm_ldt(next); - } - } -#endif -} +extern void switch_mm(struct mm_struct *prev, struct mm_struct *next, + struct task_struct *tsk);
#define activate_mm(prev, next) \ do { \ diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index a1aa5f59e3ad..40c640980720 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -59,6 +59,106 @@ void leave_mm(int cpu) } EXPORT_SYMBOL_GPL(leave_mm);
+#endif /* CONFIG_SMP */ + +void switch_mm(struct mm_struct *prev, struct mm_struct *next, + struct task_struct *tsk) +{ + unsigned cpu = smp_processor_id(); + + if (likely(prev != next)) { +#ifdef CONFIG_SMP + this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK); + this_cpu_write(cpu_tlbstate.active_mm, next); +#endif + cpumask_set_cpu(cpu, mm_cpumask(next)); + + /* + * Re-load page tables. + * + * This logic has an ordering constraint: + * + * CPU 0: Write to a PTE for 'next' + * CPU 0: load bit 1 in mm_cpumask. if nonzero, send IPI. + * CPU 1: set bit 1 in next's mm_cpumask + * CPU 1: load from the PTE that CPU 0 writes (implicit) + * + * We need to prevent an outcome in which CPU 1 observes + * the new PTE value and CPU 0 observes bit 1 clear in + * mm_cpumask. (If that occurs, then the IPI will never + * be sent, and CPU 0's TLB will contain a stale entry.) + * + * The bad outcome can occur if either CPU's load is + * reordered before that CPU's store, so both CPUs must + * execute full barriers to prevent this from happening. + * + * Thus, switch_mm needs a full barrier between the + * store to mm_cpumask and any operation that could load + * from next->pgd. TLB fills are special and can happen + * due to instruction fetches or for no reason at all, + * and neither LOCK nor MFENCE orders them. + * Fortunately, load_cr3() is serializing and gives the + * ordering guarantee we need. + * + */ + load_cr3(next->pgd); + + trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL); + + /* Stop flush ipis for the previous mm */ + cpumask_clear_cpu(cpu, mm_cpumask(prev)); + + /* Load per-mm CR4 state */ + load_mm_cr4(next); + + /* + * Load the LDT, if the LDT is different. + * + * It's possible that prev->context.ldt doesn't match + * the LDT register. This can happen if leave_mm(prev) + * was called and then modify_ldt changed + * prev->context.ldt but suppressed an IPI to this CPU. + * In this case, prev->context.ldt != NULL, because we + * never set context.ldt to NULL while the mm still + * exists. That means that next->context.ldt != + * prev->context.ldt, because mms never share an LDT. + */ + if (unlikely(prev->context.ldt != next->context.ldt)) + load_mm_ldt(next); + } +#ifdef CONFIG_SMP + else { + this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK); + BUG_ON(this_cpu_read(cpu_tlbstate.active_mm) != next); + + if (!cpumask_test_cpu(cpu, mm_cpumask(next))) { + /* + * On established mms, the mm_cpumask is only changed + * from irq context, from ptep_clear_flush() while in + * lazy tlb mode, and here. Irqs are blocked during + * schedule, protecting us from simultaneous changes. + */ + cpumask_set_cpu(cpu, mm_cpumask(next)); + + /* + * We were in lazy tlb mode and leave_mm disabled + * tlb flush IPI delivery. We must reload CR3 + * to make sure to use no freed page tables. + * + * As above, load_cr3() is serializing and orders TLB + * fills with respect to the mm_cpumask write. + */ + load_cr3(next->pgd); + trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL); + load_mm_cr4(next); + load_mm_ldt(next); + } + } +#endif +} + +#ifdef CONFIG_SMP + /* * The flush IPI assumes that a thread switch happens in this order: * [cpu0: the cpu that switches]
From: Andy Lutomirski luto@kernel.org
commit 078194f8e9fe3cf54c8fd8bded48a1db5bd8eb8a upstream.
Potential races between switch_mm() and TLB-flush or LDT-flush IPIs could be very messy. AFAICT the code is currently okay, whether by accident or by careful design, but enabling PCID will make it considerably more complicated and will no longer be obviously safe.
Fix it with a big hammer: run switch_mm() with IRQs off.
To avoid a performance hit in the scheduler, we take advantage of our knowledge that the scheduler already has IRQs disabled when it calls switch_mm().
Signed-off-by: Andy Lutomirski luto@kernel.org Reviewed-by: Borislav Petkov bp@suse.de Cc: Borislav Petkov bp@alien8.de Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Peter Zijlstra peterz@infradead.org Cc: Thomas Gleixner tglx@linutronix.de Link: http://lkml.kernel.org/r/f19baf759693c9dcae64bbff76189db77cb13398.1461688545... Signed-off-by: Ingo Molnar mingo@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org (cherry picked from commit 4ead44fd2525ed97e5362a806d312a0e3b0ea445) Signed-off-by: Pavel Tatashin pasha.tatashin@oracle.com
Conflicts: arch/x86/include/asm/mmu_context.h --- arch/x86/include/asm/mmu_context.h | 4 ++++ arch/x86/mm/tlb.c | 10 ++++++++++ 2 files changed, 14 insertions(+)
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h index 375fafccb32c..2bf4bae1c65e 100644 --- a/arch/x86/include/asm/mmu_context.h +++ b/arch/x86/include/asm/mmu_context.h @@ -95,6 +95,10 @@ static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk) extern void switch_mm(struct mm_struct *prev, struct mm_struct *next, struct task_struct *tsk);
+extern void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next, + struct task_struct *tsk); +#define switch_mm_irqs_off switch_mm_irqs_off + #define activate_mm(prev, next) \ do { \ paravirt_activate_mm((prev), (next)); \ diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 40c640980720..4ce6569ad963 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -63,6 +63,16 @@ EXPORT_SYMBOL_GPL(leave_mm);
void switch_mm(struct mm_struct *prev, struct mm_struct *next, struct task_struct *tsk) +{ + unsigned long flags; + + local_irq_save(flags); + switch_mm_irqs_off(prev, next, tsk); + local_irq_restore(flags); +} + +void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next, + struct task_struct *tsk) { unsigned cpu = smp_processor_id();
From: Steven Rostedt rostedt@goodmis.org
commit ef0491ea17f8019821c7e9c8e801184ecf17f85a upstream.
The introduction of switch_mm_irqs_off() brought back an old bug regarding the use of preempt_enable_no_resched:
As part of:
62b94a08da1b ("sched/preempt: Take away preempt_enable_no_resched() from modules")
the definition of preempt_enable_no_resched() is only available in built-in code, not in loadable modules, so we can't generally use it from header files.
However, the ARM version of finish_arch_post_lock_switch() calls preempt_enable_no_resched() and is defined as a static inline function in asm/mmu_context.h. This in turn means we cannot include asm/mmu_context.h from modules.
With today's tip tree, asm/mmu_context.h gets included from linux/mmu_context.h, which is normally the exact pattern one would expect, but unfortunately, linux/mmu_context.h can be included from the vhost driver that is a loadable module, now causing this compile time error with modular configs:
In file included from ../include/linux/mmu_context.h:4:0, from ../drivers/vhost/vhost.c:18: ../arch/arm/include/asm/mmu_context.h: In function 'finish_arch_post_lock_switch': ../arch/arm/include/asm/mmu_context.h:88:3: error: implicit declaration of function 'preempt_enable_no_resched' [-Werror=implicit-function-declaration] preempt_enable_no_resched();
Andy already tried to fix the bug by including linux/preempt.h from asm/mmu_context.h, but that didn't help. Arnd suggested reordering the header files, which wasn't popular, so let's use this workaround instead:
The finish_arch_post_lock_switch() definition is now also hidden inside of #ifdef MODULE, so we don't see anything referencing preempt_enable_no_resched() from a header file. I've built a few hundred randconfig kernels with this, and did not see any new problems.
Tested-by: Guenter Roeck linux@roeck-us.net Signed-off-by: Steven Rostedt rostedt@goodmis.org Signed-off-by: Arnd Bergmann arnd@arndb.de Acked-by: Russell King rmk+kernel@arm.linux.org.uk Cc: Alexander Shishkin alexander.shishkin@linux.intel.com Cc: Andy Lutomirski luto@amacapital.net Cc: Andy Lutomirski luto@kernel.org Cc: Ard Biesheuvel ard.biesheuvel@linaro.org Cc: Arnaldo Carvalho de Melo acme@redhat.com Cc: Borislav Petkov bp@suse.de Cc: Frederic Weisbecker fweisbec@gmail.com Cc: Jiri Olsa jolsa@redhat.com Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Mel Gorman mgorman@techsingularity.net Cc: Peter Zijlstra peterz@infradead.org Cc: Russell King - ARM Linux linux@armlinux.org.uk Cc: Stephane Eranian eranian@google.com Cc: Thomas Gleixner tglx@linutronix.de Cc: Vince Weaver vincent.weaver@maine.edu Cc: linux-arm-kernel@lists.infradead.org Fixes: f98db6013c55 ("sched/core: Add switch_mm_irqs_off() and use it in the scheduler") Link: http://lkml.kernel.org/r/1463146234-161304-1-git-send-email-arnd@arndb.de Signed-off-by: Ingo Molnar mingo@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org (cherry picked from commit c22d4b4d1c7fcc0d9eb4d8618d86c554c48ed9c0) Signed-off-by: Pavel Tatashin pasha.tatashin@oracle.com --- arch/arm/include/asm/mmu_context.h | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/arch/arm/include/asm/mmu_context.h b/arch/arm/include/asm/mmu_context.h index 9b32f76bb0dd..10f662498eb7 100644 --- a/arch/arm/include/asm/mmu_context.h +++ b/arch/arm/include/asm/mmu_context.h @@ -61,6 +61,7 @@ static inline void check_and_switch_context(struct mm_struct *mm, cpu_switch_mm(mm->pgd, mm); }
+#ifndef MODULE #define finish_arch_post_lock_switch \ finish_arch_post_lock_switch static inline void finish_arch_post_lock_switch(void) @@ -82,6 +83,7 @@ static inline void finish_arch_post_lock_switch(void) preempt_enable_no_resched(); } } +#endif /* !MODULE */
#endif /* CONFIG_MMU */
From: Andy Lutomirski luto@kernel.org
commit 252d2a4117bc181b287eeddf848863788da733ae upstream.
idle_task_exit() can be called with IRQs on x86 on and therefore should use switch_mm(), not switch_mm_irqs_off().
This doesn't seem to cause any problems right now, but it will confuse my upcoming TLB flush changes. Nonetheless, I think it should be backported because it's trivial. There won't be any meaningful performance impact because idle_task_exit() is only used when offlining a CPU.
Signed-off-by: Andy Lutomirski luto@kernel.org Cc: Borislav Petkov bp@suse.de Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Peter Zijlstra peterz@infradead.org Cc: Thomas Gleixner tglx@linutronix.de Cc: stable@vger.kernel.org Fixes: f98db6013c55 ("sched/core: Add switch_mm_irqs_off() and use it in the scheduler") Link: http://lkml.kernel.org/r/ca3d1a9fa93a0b49f5a8ff729eda3640fb6abdf9.1497034141... Signed-off-by: Ingo Molnar mingo@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org (cherry picked from commit 18a5348d49afcfc2b95da939143c9420edd78b9e) Signed-off-by: Pavel Tatashin pasha.tatashin@oracle.com --- kernel/sched/core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index d253618d09c6..9c905bd94ff0 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4979,7 +4979,7 @@ void idle_task_exit(void) BUG_ON(cpu_online(smp_processor_id()));
if (mm != &init_mm) { - switch_mm_irqs_off(mm, &init_mm, current); + switch_mm(mm, &init_mm, current); finish_arch_post_lock_switch(); } mmdrop(mm);
From: Aaron Lu aaron.lu@intel.com
commit 82ba4faca1bffad429f15c90c980ffd010366c25 upstream.
Since commit:
52aec3308db8 ("x86/tlb: replace INVALIDATE_TLB_VECTOR by CALL_FUNCTION_VECTOR")
the TLB remote shootdown is done through call function vector. That commit didn't take care of irq_tlb_count, which a later commit:
fd0f5869724f ("x86: Distinguish TLB shootdown interrupts from other functions call interrupts")
... tried to fix.
The fix assumes every increase of irq_tlb_count has a corresponding increase of irq_call_count. So the irq_call_count is always bigger than irq_tlb_count and we could substract irq_tlb_count from irq_call_count.
Unfortunately this is not true for the smp_call_function_single() case. The IPI is only sent if the target CPU's call_single_queue is empty when adding a csd into it in generic_exec_single. That means if two threads are both adding flush tlb csds to the same CPU's call_single_queue, only one IPI is sent. In other words, the irq_call_count is incremented by 1 but irq_tlb_count is incremented by 2. Over time, irq_tlb_count will be bigger than irq_call_count and the substract will produce a very large irq_call_count value due to overflow.
Considering that:
1) it's not worth to send more IPIs for the sake of accurate counting of irq_call_count in generic_exec_single();
2) it's not easy to tell if the call function interrupt is for TLB shootdown in __smp_call_function_single_interrupt().
Not to exclude TLB shootdown from call function count seems to be the simplest fix and this patch just does that.
This bug was found by LKP's cyclic performance regression tracking recently with the vm-scalability test suite. I have bisected to commit:
3dec0ba0be6a ("mm/rmap: share the i_mmap_rwsem")
This commit didn't do anything wrong but revealed the irq_call_count problem. IIUC, the commit makes rwc->remap_one in rmap_walk_file concurrent with multiple threads. When remap_one is try_to_unmap_one(), then multiple threads could queue flush TLB to the same CPU but only one IPI will be sent.
Since the commit was added in Linux v3.19, the counting problem only shows up from v3.19 onwards.
Signed-off-by: Aaron Lu aaron.lu@intel.com Cc: Alex Shi alex.shi@linaro.org Cc: Andy Lutomirski luto@kernel.org Cc: Borislav Petkov bp@alien8.de Cc: Brian Gerst brgerst@gmail.com Cc: Davidlohr Bueso dave@stgolabs.net Cc: Denys Vlasenko dvlasenk@redhat.com Cc: H. Peter Anvin hpa@zytor.com Cc: Huang Ying ying.huang@intel.com Cc: Josh Poimboeuf jpoimboe@redhat.com Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Peter Zijlstra peterz@infradead.org Cc: Thomas Gleixner tglx@linutronix.de Cc: Tomoki Sekiyama tomoki.sekiyama.qu@hitachi.com Link: http://lkml.kernel.org/r/20160811074430.GA18163@aaronlu.sh.intel.com Signed-off-by: Ingo Molnar mingo@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org (cherry picked from commit 3b9d9ec0d8261bb9b12f858e66f0c84cd2a6a3bb) Signed-off-by: Pavel Tatashin pasha.tatashin@oracle.com --- arch/x86/include/asm/hardirq.h | 4 ---- arch/x86/kernel/irq.c | 3 +-- 2 files changed, 1 insertion(+), 6 deletions(-)
diff --git a/arch/x86/include/asm/hardirq.h b/arch/x86/include/asm/hardirq.h index 0f5fb6b6567e..ebaf64d0a785 100644 --- a/arch/x86/include/asm/hardirq.h +++ b/arch/x86/include/asm/hardirq.h @@ -21,10 +21,6 @@ typedef struct { #ifdef CONFIG_SMP unsigned int irq_resched_count; unsigned int irq_call_count; - /* - * irq_tlb_count is double-counted in irq_call_count, so it must be - * subtracted from irq_call_count when displaying irq_call_count - */ unsigned int irq_tlb_count; #endif #ifdef CONFIG_X86_THERMAL_VECTOR diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c index e5952c225532..b6460c5a9cab 100644 --- a/arch/x86/kernel/irq.c +++ b/arch/x86/kernel/irq.c @@ -96,8 +96,7 @@ int arch_show_interrupts(struct seq_file *p, int prec) seq_puts(p, " Rescheduling interrupts\n"); seq_printf(p, "%*s: ", prec, "CAL"); for_each_online_cpu(j) - seq_printf(p, "%10u ", irq_stats(j)->irq_call_count - - irq_stats(j)->irq_tlb_count); + seq_printf(p, "%10u ", irq_stats(j)->irq_call_count); seq_puts(p, " Function call interrupts\n"); seq_printf(p, "%*s: ", prec, "TLB"); for_each_online_cpu(j)
From: Andy Lutomirski luto@kernel.org
commit 9ccee2373f0658f234727700e619df097ba57023 upstream.
mark_screen_rdonly() is the last remaining caller of flush_tlb(). flush_tlb_mm_range() is potentially faster and isn't obsolete.
Compile-tested only because I don't know whether software that uses this mechanism even exists.
Signed-off-by: Andy Lutomirski luto@kernel.org Cc: Andrew Morton akpm@linux-foundation.org Cc: Borislav Petkov bp@alien8.de Cc: Brian Gerst brgerst@gmail.com Cc: Dave Hansen dave.hansen@intel.com Cc: Denys Vlasenko dvlasenk@redhat.com Cc: H. Peter Anvin hpa@zytor.com Cc: Josh Poimboeuf jpoimboe@redhat.com Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Michal Hocko mhocko@suse.com Cc: Nadav Amit namit@vmware.com Cc: Peter Zijlstra peterz@infradead.org Cc: Rik van Riel riel@redhat.com Cc: Sasha Levin sasha.levin@oracle.com Cc: Thomas Gleixner tglx@linutronix.de Link: http://lkml.kernel.org/r/791a644076fc3577ba7f7b7cafd643cc089baa7d.1492844372... Signed-off-by: Ingo Molnar mingo@kernel.org Cc: Hugh Dickins hughd@google.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org (cherry picked from commit 6ce9d1e6819e53c4de0bf980555c4e07bbedb4ce) Signed-off-by: Pavel Tatashin pasha.tatashin@oracle.com --- arch/x86/kernel/vm86_32.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/kernel/vm86_32.c b/arch/x86/kernel/vm86_32.c index fc9db6ef2a95..e0ae0a8ad5bd 100644 --- a/arch/x86/kernel/vm86_32.c +++ b/arch/x86/kernel/vm86_32.c @@ -194,7 +194,7 @@ static void mark_screen_rdonly(struct mm_struct *mm) pte_unmap_unlock(pte, ptl); out: up_write(&mm->mmap_sem); - flush_tlb(); + flush_tlb_mm_range(mm, 0xA0000, 0xA0000 + 32*PAGE_SIZE, 0UL); }
From: Andy Lutomirski luto@kernel.org
commit 29961b59a51f8c6838a26a45e871a7ed6771809b upstream.
I was trying to figure out what how flush_tlb_current_task() would possibly work correctly if current->mm != current->active_mm, but I realized I could spare myself the effort: it has no callers except the unused flush_tlb() macro.
Signed-off-by: Andy Lutomirski luto@kernel.org Cc: Andrew Morton akpm@linux-foundation.org Cc: Borislav Petkov bp@alien8.de Cc: Brian Gerst brgerst@gmail.com Cc: Dave Hansen dave.hansen@intel.com Cc: Denys Vlasenko dvlasenk@redhat.com Cc: H. Peter Anvin hpa@zytor.com Cc: Josh Poimboeuf jpoimboe@redhat.com Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Michal Hocko mhocko@suse.com Cc: Nadav Amit namit@vmware.com Cc: Peter Zijlstra peterz@infradead.org Cc: Rik van Riel riel@redhat.com Cc: Thomas Gleixner tglx@linutronix.de Link: http://lkml.kernel.org/r/e52d64c11690f85e9f1d69d7b48cc2269cd2e94b.1492844372... Signed-off-by: Ingo Molnar mingo@kernel.org Cc: Hugh Dickins hughd@google.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org (cherry picked from commit 227d6f0e79f809e448d3157fbfd00eb54dcbb54e) Signed-off-by: Pavel Tatashin pasha.tatashin@oracle.com --- arch/x86/include/asm/tlbflush.h | 9 --------- arch/x86/mm/tlb.c | 17 ----------------- 2 files changed, 26 deletions(-)
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index 433eeaafe498..d9ee4674c235 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -197,7 +197,6 @@ static inline void __flush_tlb_one(unsigned long addr) /* * TLB flushing: * - * - flush_tlb() flushes the current mm struct TLBs * - flush_tlb_all() flushes all processes TLBs * - flush_tlb_mm(mm) flushes the specified mm context TLB's * - flush_tlb_page(vma, vmaddr) flushes one page @@ -229,11 +228,6 @@ static inline void flush_tlb_all(void) __flush_tlb_all(); }
-static inline void flush_tlb(void) -{ - __flush_tlb_up(); -} - static inline void local_flush_tlb(void) { __flush_tlb_up(); @@ -295,14 +289,11 @@ static inline void flush_tlb_kernel_range(unsigned long start, flush_tlb_mm_range(vma->vm_mm, start, end, vma->vm_flags)
extern void flush_tlb_all(void); -extern void flush_tlb_current_task(void); extern void flush_tlb_page(struct vm_area_struct *, unsigned long); extern void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start, unsigned long end, unsigned long vmflag); extern void flush_tlb_kernel_range(unsigned long start, unsigned long end);
-#define flush_tlb() flush_tlb_current_task() - void native_flush_tlb_others(const struct cpumask *cpumask, struct mm_struct *mm, unsigned long start, unsigned long end); diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 4ce6569ad963..262abde154eb 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -265,23 +265,6 @@ void native_flush_tlb_others(const struct cpumask *cpumask, smp_call_function_many(cpumask, flush_tlb_func, &info, 1); }
-void flush_tlb_current_task(void) -{ - struct mm_struct *mm = current->mm; - - preempt_disable(); - - count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL); - - /* This is an implicit full barrier that synchronizes with switch_mm. */ - local_flush_tlb(); - - trace_tlb_flush(TLB_LOCAL_SHOOTDOWN, TLB_FLUSH_ALL); - if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids) - flush_tlb_others(mm_cpumask(mm), mm, 0UL, TLB_FLUSH_ALL); - preempt_enable(); -} - /* * See Documentation/x86/tlb.txt for details. We choose 33 * because it is large enough to cover the vast majority (at
From: Andy Lutomirski luto@kernel.org
commit ce27374fabf553153c3f53efcaa9bfab9216bd8c upstream.
I'm about to rewrite the function almost completely, but first I want to get a functional change out of the way. Currently, if flush_tlb_mm_range() does not flush the local TLB at all, it will never do individual page flushes on remote CPUs. This seems to be an accident, and preserving it will be awkward. Let's change it first so that any regressions in the rewrite will be easier to bisect and so that the rewrite can attempt to change no visible behavior at all.
The fix is simple: we can simply avoid short-circuiting the calculation of base_pages_to_flush.
As a side effect, this also eliminates a potential corner case: if tlb_single_page_flush_ceiling == TLB_FLUSH_ALL, flush_tlb_mm_range() could have ended up flushing the entire address space one page at a time.
Signed-off-by: Andy Lutomirski luto@kernel.org Acked-by: Dave Hansen dave.hansen@intel.com Cc: Andrew Morton akpm@linux-foundation.org Cc: Borislav Petkov bp@alien8.de Cc: Brian Gerst brgerst@gmail.com Cc: Denys Vlasenko dvlasenk@redhat.com Cc: H. Peter Anvin hpa@zytor.com Cc: Josh Poimboeuf jpoimboe@redhat.com Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Michal Hocko mhocko@suse.com Cc: Nadav Amit namit@vmware.com Cc: Peter Zijlstra peterz@infradead.org Cc: Rik van Riel riel@redhat.com Cc: Thomas Gleixner tglx@linutronix.de Link: http://lkml.kernel.org/r/4b29b771d9975aad7154c314534fec235618175a.1492844372... Signed-off-by: Ingo Molnar mingo@kernel.org Cc: Hugh Dickins hughd@google.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org (cherry picked from commit 9f4d1ba1d407e56dac833aa0b11c60f952939e1c) Signed-off-by: Pavel Tatashin pasha.tatashin@oracle.com --- arch/x86/mm/tlb.c | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-)
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 262abde154eb..95d5b4fff799 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -285,6 +285,12 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start, unsigned long base_pages_to_flush = TLB_FLUSH_ALL;
preempt_disable(); + + if ((end != TLB_FLUSH_ALL) && !(vmflag & VM_HUGETLB)) + base_pages_to_flush = (end - start) >> PAGE_SHIFT; + if (base_pages_to_flush > tlb_single_page_flush_ceiling) + base_pages_to_flush = TLB_FLUSH_ALL; + if (current->active_mm != mm) { /* Synchronize with switch_mm. */ smp_mb(); @@ -301,15 +307,11 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start, goto out; }
- if ((end != TLB_FLUSH_ALL) && !(vmflag & VM_HUGETLB)) - base_pages_to_flush = (end - start) >> PAGE_SHIFT; - /* * Both branches below are implicit full barriers (MOV to CR or * INVLPG) that synchronize with switch_mm. */ - if (base_pages_to_flush > tlb_single_page_flush_ceiling) { - base_pages_to_flush = TLB_FLUSH_ALL; + if (base_pages_to_flush == TLB_FLUSH_ALL) { count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL); local_flush_tlb(); } else {
From: Andy Lutomirski luto@kernel.org
commit ca6c99c0794875c6d1db6e22f246699691ab7e6b upstream.
flush_tlb_page() was very similar to flush_tlb_mm_range() except that it had a couple of issues:
- It was missing an smp_mb() in the case where current->active_mm != mm. (This is a longstanding bug reported by Nadav Amit)
- It was missing tracepoints and vm counter updates.
The only reason that I can see for keeping it at as a separate function is that it could avoid a few branches that flush_tlb_mm_range() needs to decide to flush just one page. This hardly seems worthwhile. If we decide we want to get rid of those branches again, a better way would be to introduce an __flush_tlb_mm_range() helper and make both flush_tlb_page() and flush_tlb_mm_range() use it.
Signed-off-by: Andy Lutomirski luto@kernel.org Acked-by: Kees Cook keescook@chromium.org Cc: Andrew Morton akpm@linux-foundation.org Cc: Borislav Petkov bpetkov@suse.de Cc: Dave Hansen dave.hansen@intel.com Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Mel Gorman mgorman@suse.de Cc: Michal Hocko mhocko@suse.com Cc: Nadav Amit nadav.amit@gmail.com Cc: Nadav Amit namit@vmware.com Cc: Peter Zijlstra peterz@infradead.org Cc: Rik van Riel riel@redhat.com Cc: Thomas Gleixner tglx@linutronix.de Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/3cc3847cf888d8907577569b8bac3f01992ef8f9.1495492063... Signed-off-by: Ingo Molnar mingo@kernel.org Cc: Hugh Dickins hughd@google.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org (cherry picked from commit 3efba6062a410a2a65fc9d6f53dca63db2602e65) Signed-off-by: Pavel Tatashin pasha.tatashin@oracle.com --- arch/x86/include/asm/tlbflush.h | 6 +++++- arch/x86/mm/tlb.c | 27 --------------------------- 2 files changed, 5 insertions(+), 28 deletions(-)
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index d9ee4674c235..f23ee750bb68 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -289,11 +289,15 @@ static inline void flush_tlb_kernel_range(unsigned long start, flush_tlb_mm_range(vma->vm_mm, start, end, vma->vm_flags)
extern void flush_tlb_all(void); -extern void flush_tlb_page(struct vm_area_struct *, unsigned long); extern void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start, unsigned long end, unsigned long vmflag); extern void flush_tlb_kernel_range(unsigned long start, unsigned long end);
+static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a) +{ + flush_tlb_mm_range(vma->vm_mm, a, a + PAGE_SIZE, VM_NONE); +} + void native_flush_tlb_others(const struct cpumask *cpumask, struct mm_struct *mm, unsigned long start, unsigned long end); diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 95d5b4fff799..77fdf801efde 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -332,33 +332,6 @@ out: preempt_enable(); }
-void flush_tlb_page(struct vm_area_struct *vma, unsigned long start) -{ - struct mm_struct *mm = vma->vm_mm; - - preempt_disable(); - - if (current->active_mm == mm) { - if (current->mm) { - /* - * Implicit full barrier (INVLPG) that synchronizes - * with switch_mm. - */ - __flush_tlb_one(start); - } else { - leave_mm(smp_processor_id()); - - /* Synchronize with switch_mm. */ - smp_mb(); - } - } - - if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids) - flush_tlb_others(mm_cpumask(mm), mm, start, 0UL); - - preempt_enable(); -} - static void do_flush_tlb_all(void *info) { count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
From: Andy Lutomirski luto@kernel.org
commit ce4a4e565f5264909a18c733b864c3f74467f69e upstream.
The UP asm/tlbflush.h generates somewhat nicer code than the SMP version. Aside from that, it's fallen quite a bit behind the SMP code:
- flush_tlb_mm_range() didn't flush individual pages if the range was small.
- The lazy TLB code was much weaker. This usually wouldn't matter, but, if a kernel thread flushed its lazy "active_mm" more than once (due to reclaim or similar), it wouldn't be unlazied and would instead pointlessly flush repeatedly.
- Tracepoints were missing.
Aside from that, simply having the UP code around was a maintanence burden, since it means that any change to the TLB flush code had to make sure not to break it.
Simplify everything by deleting the UP code.
Signed-off-by: Andy Lutomirski luto@kernel.org Cc: Andrew Morton akpm@linux-foundation.org Cc: Arjan van de Ven arjan@linux.intel.com Cc: Borislav Petkov bpetkov@suse.de Cc: Dave Hansen dave.hansen@intel.com Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Mel Gorman mgorman@suse.de Cc: Michal Hocko mhocko@suse.com Cc: Nadav Amit nadav.amit@gmail.com Cc: Nadav Amit namit@vmware.com Cc: Peter Zijlstra peterz@infradead.org Cc: Rik van Riel riel@redhat.com Cc: Thomas Gleixner tglx@linutronix.de Cc: linux-mm@kvack.org Signed-off-by: Ingo Molnar mingo@kernel.org Cc: Hugh Dickins hughd@google.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
(cherry picked from commit b2e24274d50e0ecdf560ebe06dbed0cc648ad3f9) Signed-off-by: Pavel Tatashin pasha.tatashin@oracle.com
Conflicts: arch/x86/Kconfig --- arch/x86/include/asm/hardirq.h | 2 +- arch/x86/include/asm/mmu.h | 6 --- arch/x86/include/asm/mmu_context.h | 2 - arch/x86/include/asm/tlbflush.h | 78 +------------------------------------- arch/x86/mm/init.c | 2 - arch/x86/mm/tlb.c | 17 +-------- 6 files changed, 4 insertions(+), 103 deletions(-)
diff --git a/arch/x86/include/asm/hardirq.h b/arch/x86/include/asm/hardirq.h index ebaf64d0a785..edd280d39a31 100644 --- a/arch/x86/include/asm/hardirq.h +++ b/arch/x86/include/asm/hardirq.h @@ -21,8 +21,8 @@ typedef struct { #ifdef CONFIG_SMP unsigned int irq_resched_count; unsigned int irq_call_count; - unsigned int irq_tlb_count; #endif + unsigned int irq_tlb_count; #ifdef CONFIG_X86_THERMAL_VECTOR unsigned int irq_thermal_count; #endif diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h index 364d27481a52..652517e86e01 100644 --- a/arch/x86/include/asm/mmu.h +++ b/arch/x86/include/asm/mmu.h @@ -22,12 +22,6 @@ typedef struct { atomic_t perf_rdpmc_allowed; /* nonzero if rdpmc is allowed */ } mm_context_t;
-#ifdef CONFIG_SMP void leave_mm(int cpu); -#else -static inline void leave_mm(int cpu) -{ -} -#endif
#endif /* _ASM_X86_MMU_H */ diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h index 2bf4bae1c65e..29292b7aa370 100644 --- a/arch/x86/include/asm/mmu_context.h +++ b/arch/x86/include/asm/mmu_context.h @@ -86,10 +86,8 @@ void destroy_context(struct mm_struct *mm);
static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk) { -#ifdef CONFIG_SMP if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK) this_cpu_write(cpu_tlbstate.state, TLBSTATE_LAZY); -#endif }
extern void switch_mm(struct mm_struct *prev, struct mm_struct *next, diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index f23ee750bb68..4f7017c9cae8 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -6,6 +6,7 @@
#include <asm/processor.h> #include <asm/special_insns.h> +#include <asm/smp.h>
static inline void __invpcid(unsigned long pcid, unsigned long addr, unsigned long type) @@ -64,10 +65,8 @@ static inline void invpcid_flush_all_nonglobals(void) #endif
struct tlb_state { -#ifdef CONFIG_SMP struct mm_struct *active_mm; int state; -#endif
/* * Access to this CR4 shadow and to H/W CR4 is protected by @@ -208,79 +207,6 @@ static inline void __flush_tlb_one(unsigned long addr) * and page-granular flushes are available only on i486 and up. */
-#ifndef CONFIG_SMP - -/* "_up" is for UniProcessor. - * - * This is a helper for other header functions. *Not* intended to be called - * directly. All global TLB flushes need to either call this, or to bump the - * vm statistics themselves. - */ -static inline void __flush_tlb_up(void) -{ - count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL); - __flush_tlb(); -} - -static inline void flush_tlb_all(void) -{ - count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL); - __flush_tlb_all(); -} - -static inline void local_flush_tlb(void) -{ - __flush_tlb_up(); -} - -static inline void flush_tlb_mm(struct mm_struct *mm) -{ - if (mm == current->active_mm) - __flush_tlb_up(); -} - -static inline void flush_tlb_page(struct vm_area_struct *vma, - unsigned long addr) -{ - if (vma->vm_mm == current->active_mm) - __flush_tlb_one(addr); -} - -static inline void flush_tlb_range(struct vm_area_struct *vma, - unsigned long start, unsigned long end) -{ - if (vma->vm_mm == current->active_mm) - __flush_tlb_up(); -} - -static inline void flush_tlb_mm_range(struct mm_struct *mm, - unsigned long start, unsigned long end, unsigned long vmflag) -{ - if (mm == current->active_mm) - __flush_tlb_up(); -} - -static inline void native_flush_tlb_others(const struct cpumask *cpumask, - struct mm_struct *mm, - unsigned long start, - unsigned long end) -{ -} - -static inline void reset_lazy_tlbstate(void) -{ -} - -static inline void flush_tlb_kernel_range(unsigned long start, - unsigned long end) -{ - flush_tlb_all(); -} - -#else /* SMP */ - -#include <asm/smp.h> - #define local_flush_tlb() __flush_tlb()
#define flush_tlb_mm(mm) flush_tlb_mm_range(mm, 0UL, TLB_FLUSH_ALL, 0UL) @@ -311,8 +237,6 @@ static inline void reset_lazy_tlbstate(void) this_cpu_write(cpu_tlbstate.active_mm, &init_mm); }
-#endif /* SMP */ - #ifndef CONFIG_PARAVIRT #define flush_tlb_others(mask, mm, start, end) \ native_flush_tlb_others(mask, mm, start, end) diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c index 3e1bb1c8daea..0a06e5a14278 100644 --- a/arch/x86/mm/init.c +++ b/arch/x86/mm/init.c @@ -752,10 +752,8 @@ void __init zone_sizes_init(void) }
DEFINE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate) = { -#ifdef CONFIG_SMP .active_mm = &init_mm, .state = 0, -#endif .cr4 = ~0UL, /* fail hard if we screw up cr4 shadow initialization */ }; EXPORT_SYMBOL_GPL(cpu_tlbstate); diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 77fdf801efde..95624d903c0e 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -15,7 +15,7 @@ #include <linux/debugfs.h>
/* - * Smarter SMP flushing macros. + * TLB flushing, formerly SMP-only * c/o Linus Torvalds. * * These mean you can really definitely utterly forget about @@ -28,8 +28,6 @@ * Implement flush IPI by CALL_FUNCTION_VECTOR, Alex Shi */
-#ifdef CONFIG_SMP - struct flush_tlb_info { struct mm_struct *flush_mm; unsigned long flush_start; @@ -59,8 +57,6 @@ void leave_mm(int cpu) } EXPORT_SYMBOL_GPL(leave_mm);
-#endif /* CONFIG_SMP */ - void switch_mm(struct mm_struct *prev, struct mm_struct *next, struct task_struct *tsk) { @@ -77,10 +73,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next, unsigned cpu = smp_processor_id();
if (likely(prev != next)) { -#ifdef CONFIG_SMP this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK); this_cpu_write(cpu_tlbstate.active_mm, next); -#endif cpumask_set_cpu(cpu, mm_cpumask(next));
/* @@ -135,9 +129,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next, */ if (unlikely(prev->context.ldt != next->context.ldt)) load_mm_ldt(next); - } -#ifdef CONFIG_SMP - else { + } else { this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK); BUG_ON(this_cpu_read(cpu_tlbstate.active_mm) != next);
@@ -164,11 +156,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next, load_mm_ldt(next); } } -#endif }
-#ifdef CONFIG_SMP - /* * The flush IPI assumes that a thread switch happens in this order: * [cpu0: the cpu that switches] @@ -416,5 +405,3 @@ static int __init create_tlb_single_page_flush_ceiling(void) return 0; } late_initcall(create_tlb_single_page_flush_ceiling); - -#endif /* CONFIG_SMP */
From: Andy Lutomirski luto@kernel.org
commit cba4671af7550e008f7a7835f06df0763825bf3e upstream.
32-bit kernels on new hardware will see PCID in CPUID, but PCID can only be used in 64-bit mode. Rather than making all PCID code conditional, just disable the feature on 32-bit builds.
Signed-off-by: Andy Lutomirski luto@kernel.org Reviewed-by: Nadav Amit nadav.amit@gmail.com Reviewed-by: Borislav Petkov bp@suse.de Reviewed-by: Thomas Gleixner tglx@linutronix.de Cc: Andrew Morton akpm@linux-foundation.org Cc: Arjan van de Ven arjan@linux.intel.com Cc: Borislav Petkov bp@alien8.de Cc: Dave Hansen dave.hansen@intel.com Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Mel Gorman mgorman@suse.de Cc: Peter Zijlstra peterz@infradead.org Cc: Rik van Riel riel@redhat.com Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/2e391769192a4d31b808410c383c6bf0734bc6ea.1498751203... Signed-off-by: Ingo Molnar mingo@kernel.org Cc: Hugh Dickins hughd@google.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
(cherry picked from commit 78043e5b6fb2921d836b31f23e89e52925191153) Signed-off-by: Pavel Tatashin pasha.tatashin@oracle.com --- arch/x86/include/asm/disabled-features.h | 4 +++- arch/x86/kernel/cpu/bugs.c | 8 ++++++++ 2 files changed, 11 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h index f226df064660..8b17c2ad1048 100644 --- a/arch/x86/include/asm/disabled-features.h +++ b/arch/x86/include/asm/disabled-features.h @@ -21,11 +21,13 @@ # define DISABLE_K6_MTRR (1<<(X86_FEATURE_K6_MTRR & 31)) # define DISABLE_CYRIX_ARR (1<<(X86_FEATURE_CYRIX_ARR & 31)) # define DISABLE_CENTAUR_MCR (1<<(X86_FEATURE_CENTAUR_MCR & 31)) +# define DISABLE_PCID 0 #else # define DISABLE_VME 0 # define DISABLE_K6_MTRR 0 # define DISABLE_CYRIX_ARR 0 # define DISABLE_CENTAUR_MCR 0 +# define DISABLE_PCID (1<<(X86_FEATURE_PCID & 31)) #endif /* CONFIG_X86_64 */
/* @@ -35,7 +37,7 @@ #define DISABLED_MASK1 0 #define DISABLED_MASK2 0 #define DISABLED_MASK3 (DISABLE_CYRIX_ARR|DISABLE_CENTAUR_MCR|DISABLE_K6_MTRR) -#define DISABLED_MASK4 0 +#define DISABLED_MASK4 (DISABLE_PCID) #define DISABLED_MASK5 0 #define DISABLED_MASK6 0 #define DISABLED_MASK7 0 diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c index 03445346ee0a..4c7dd836304a 100644 --- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -65,6 +65,14 @@ static void __init check_fpu(void)
void __init check_bugs(void) { +#ifdef CONFIG_X86_32 + /* + * Regardless of whether PCID is enumerated, the SDM says + * that it can't be enabled in 32-bit mode. + */ + setup_clear_cpu_cap(X86_FEATURE_PCID); +#endif + identify_boot_cpu(); #ifndef CONFIG_SMP pr_info("CPU: ");
From: Andy Lutomirski luto@kernel.org
commit 0790c9aad84901ca1bdc14746175549c8b5da215 upstream.
The parameter is only present on x86_64 systems to save a few bytes, as PCID is always disabled on x86_32.
Signed-off-by: Andy Lutomirski luto@kernel.org Reviewed-by: Nadav Amit nadav.amit@gmail.com Reviewed-by: Borislav Petkov bp@suse.de Reviewed-by: Thomas Gleixner tglx@linutronix.de Cc: Andrew Morton akpm@linux-foundation.org Cc: Arjan van de Ven arjan@linux.intel.com Cc: Borislav Petkov bp@alien8.de Cc: Dave Hansen dave.hansen@intel.com Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Mel Gorman mgorman@suse.de Cc: Peter Zijlstra peterz@infradead.org Cc: Rik van Riel riel@redhat.com Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/8bbb2e65bcd249a5f18bfb8128b4689f08ac2b60.1498751203... Signed-off-by: Ingo Molnar mingo@kernel.org Cc: Hugh Dickins hughd@google.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
(cherry picked from commit dcccd3c266e24ce80ecf592765315c54c222ac33) Signed-off-by: Pavel Tatashin pasha.tatashin@oracle.com --- Documentation/kernel-parameters.txt | 2 ++ arch/x86/kernel/cpu/common.c | 18 ++++++++++++++++++ 2 files changed, 20 insertions(+)
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index ceaab09a279e..97bc24101896 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -2471,6 +2471,8 @@ bytes respectively. Such letter suffixes can also be entirely omitted. nopat [X86] Disable PAT (page attribute table extension of pagetables) support.
+ nopcid [X86-64] Disable the PCID cpu feature. + norandmaps Don't use address space randomization. Equivalent to echo 0 > /proc/sys/kernel/randomize_va_space
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index 90ef802d9d90..843be1de5ddb 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -172,6 +172,24 @@ static int __init x86_xsaves_setup(char *s) } __setup("noxsaves", x86_xsaves_setup);
+#ifdef CONFIG_X86_64 +static int __init x86_pcid_setup(char *s) +{ + /* require an exact match without trailing characters */ + if (strlen(s)) + return 0; + + /* do not emit a message if the feature is not present */ + if (!boot_cpu_has(X86_FEATURE_PCID)) + return 1; + + setup_clear_cpu_cap(X86_FEATURE_PCID); + pr_info("nopcid: PCID feature disabled\n"); + return 1; +} +__setup("nopcid", x86_pcid_setup); +#endif + static int __init x86_noinvpcid_setup(char *s) { /* noinvpcid doesn't accept parameters */
From: Andy Lutomirski luto@kernel.org
commit 660da7c9228f685b2ebe664f9fd69aaddcc420b5 upstream.
We can use PCID if the CPU has PCID and PGE and we're not on Xen.
By itself, this has no effect. A followup patch will start using PCID.
Signed-off-by: Andy Lutomirski luto@kernel.org Reviewed-by: Nadav Amit nadav.amit@gmail.com Reviewed-by: Boris Ostrovsky boris.ostrovsky@oracle.com Reviewed-by: Thomas Gleixner tglx@linutronix.de Cc: Andrew Morton akpm@linux-foundation.org Cc: Arjan van de Ven arjan@linux.intel.com Cc: Borislav Petkov bp@alien8.de Cc: Dave Hansen dave.hansen@intel.com Cc: Juergen Gross jgross@suse.com Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Mel Gorman mgorman@suse.de Cc: Peter Zijlstra peterz@infradead.org Cc: Rik van Riel riel@redhat.com Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/6327ecd907b32f79d5aa0d466f04503bbec5df88.1498751203... Signed-off-by: Ingo Molnar mingo@kernel.org Cc: Hugh Dickins hughd@google.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
(cherry picked from commit fd0504525efd2ce2063cd4229baabd3e3a56ecbc) Signed-off-by: Pavel Tatashin pasha.tatashin@oracle.com --- arch/x86/include/asm/tlbflush.h | 8 ++++++++ arch/x86/kernel/cpu/common.c | 22 ++++++++++++++++++++++ arch/x86/xen/enlighten.c | 6 ++++++ 3 files changed, 36 insertions(+)
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index 4f7017c9cae8..09a70d8d293e 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -183,6 +183,14 @@ static inline void __flush_tlb_all(void) __flush_tlb_global(); else __flush_tlb(); + + /* + * Note: if we somehow had PCID but not PGE, then this wouldn't work -- + * we'd end up flushing kernel translations for the current ASID but + * we might fail to flush kernel translations for other cached ASIDs. + * + * To avoid this issue, we force PCID off if PGE is off. + */ }
static inline void __flush_tlb_one(unsigned long addr) diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index 843be1de5ddb..f5647d238337 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -339,6 +339,25 @@ static __always_inline void setup_smap(struct cpuinfo_x86 *c) } }
+static void setup_pcid(struct cpuinfo_x86 *c) +{ + if (cpu_has(c, X86_FEATURE_PCID)) { + if (cpu_has(c, X86_FEATURE_PGE)) { + cr4_set_bits(X86_CR4_PCIDE); + } else { + /* + * flush_tlb_all(), as currently implemented, won't + * work if PCID is on but PGE is not. Since that + * combination doesn't exist on real hardware, there's + * no reason to try to fully support it, but it's + * polite to avoid corrupting data if we're on + * an improperly configured VM. + */ + clear_cpu_cap(c, X86_FEATURE_PCID); + } + } +} + /* * Some CPU features depend on higher CPUID levels, which may not always * be available due to CPUID level capping or broken virtualization @@ -968,6 +987,9 @@ static void identify_cpu(struct cpuinfo_x86 *c) setup_smep(c); setup_smap(c);
+ /* Set up PCID */ + setup_pcid(c); + /* * The vendor-specific functions might have changed features. * Now we do "generic changes." diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c index 1ecae556d4ed..809730c09e2b 100644 --- a/arch/x86/xen/enlighten.c +++ b/arch/x86/xen/enlighten.c @@ -432,6 +432,12 @@ static void __init xen_init_cpuid_mask(void) ~((1 << X86_FEATURE_MTRR) | /* disable MTRR */ (1 << X86_FEATURE_ACC)); /* thermal monitoring */
+ /* + * Xen PV would need some work to support PCID: CR3 handling as well + * as xen_flush_tlb_others() would need updating. + */ + cpuid_leaf1_ecx_mask &= ~(1 << (X86_FEATURE_PCID % 32)); /* disable PCID */ + if (!xen_initial_domain()) cpuid_leaf1_edx_mask &= ~((1 << X86_FEATURE_ACPI)); /* disable ACPI */
From: Andy Lutomirski luto@kernel.org
commit 924c6b900cfdf376b07bccfd80e62b21914f8a5a upstream.
Trying to reboot via real mode fails with PCID on: long mode cannot be exited while CR4.PCIDE is set. (No, I have no idea why, but the SDM and actual CPUs are in agreement here.) The result is a GPF and a hang instead of a reboot.
I didn't catch this in testing because neither my computer nor my VM reboots this way. I can trigger it with reboot=bios, though.
Fixes: 660da7c9228f ("x86/mm: Enable CR4.PCIDE on supported systems") Reported-and-tested-by: Steven Rostedt (VMware) rostedt@goodmis.org Signed-off-by: Andy Lutomirski luto@kernel.org Signed-off-by: Thomas Gleixner tglx@linutronix.de Cc: Borislav Petkov bp@alien8.de Link: https://lkml.kernel.org/r/f1e7d965998018450a7a70c2823873686a8b21c0.150752474... Cc: Hugh Dickins hughd@google.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
(cherry picked from commit 6c4db09c291a19da66512f99c4bcb378a862f9e6) Signed-off-by: Pavel Tatashin pasha.tatashin@oracle.com --- arch/x86/kernel/reboot.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c index 0549ae3cb332..d9ea27ec9dbd 100644 --- a/arch/x86/kernel/reboot.c +++ b/arch/x86/kernel/reboot.c @@ -93,6 +93,10 @@ void __noreturn machine_real_restart(unsigned int type) load_cr3(initial_page_table); #else write_cr3(real_mode_header->trampoline_pgd); + + /* Exiting long mode will fail if CR4.PCIDE is set. */ + if (static_cpu_has(X86_FEATURE_PCID)) + cr4_clear_bits(X86_CR4_PCIDE); #endif
/* Jump to the identity-mapped low memory code */
From: Tom Lendacky thomas.lendacky@amd.com
commit e505371dd83963caae1a37ead9524e8d997341be upstream.
Add a cmdline_find_option() function to look for cmdline options that take arguments. The argument is returned in a supplied buffer and the argument length (regardless of whether it fits in the supplied buffer) is returned, with -1 indicating not found.
Signed-off-by: Tom Lendacky thomas.lendacky@amd.com Reviewed-by: Thomas Gleixner tglx@linutronix.de Cc: Alexander Potapenko glider@google.com Cc: Andrey Ryabinin aryabinin@virtuozzo.com Cc: Andy Lutomirski luto@kernel.org Cc: Arnd Bergmann arnd@arndb.de Cc: Borislav Petkov bp@alien8.de Cc: Brijesh Singh brijesh.singh@amd.com Cc: Dave Young dyoung@redhat.com Cc: Dmitry Vyukov dvyukov@google.com Cc: Jonathan Corbet corbet@lwn.net Cc: Konrad Rzeszutek Wilk konrad.wilk@oracle.com Cc: Larry Woodman lwoodman@redhat.com Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Matt Fleming matt@codeblueprint.co.uk Cc: Michael S. Tsirkin mst@redhat.com Cc: Paolo Bonzini pbonzini@redhat.com Cc: Peter Zijlstra peterz@infradead.org Cc: Radim Krčmář rkrcmar@redhat.com Cc: Rik van Riel riel@redhat.com Cc: Toshimitsu Kani toshi.kani@hpe.com Cc: kasan-dev@googlegroups.com Cc: kvm@vger.kernel.org Cc: linux-arch@vger.kernel.org Cc: linux-doc@vger.kernel.org Cc: linux-efi@vger.kernel.org Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/36b5f97492a9745dce27682305f990fc20e5cf8a.1500319216... Signed-off-by: Ingo Molnar mingo@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
(cherry picked from commit 0fa147b407478e73fe7a478677ff2b12bb824014) Signed-off-by: Pavel Tatashin pasha.tatashin@oracle.com --- arch/x86/include/asm/cmdline.h | 2 + arch/x86/lib/cmdline.c | 105 +++++++++++++++++++++++++++++++++++++++++ 2 files changed, 107 insertions(+)
diff --git a/arch/x86/include/asm/cmdline.h b/arch/x86/include/asm/cmdline.h index e01f7f7ccb0c..84ae170bc3d0 100644 --- a/arch/x86/include/asm/cmdline.h +++ b/arch/x86/include/asm/cmdline.h @@ -2,5 +2,7 @@ #define _ASM_X86_CMDLINE_H
int cmdline_find_option_bool(const char *cmdline_ptr, const char *option); +int cmdline_find_option(const char *cmdline_ptr, const char *option, + char *buffer, int bufsize);
#endif /* _ASM_X86_CMDLINE_H */ diff --git a/arch/x86/lib/cmdline.c b/arch/x86/lib/cmdline.c index 422db000d727..a744506856b1 100644 --- a/arch/x86/lib/cmdline.c +++ b/arch/x86/lib/cmdline.c @@ -82,3 +82,108 @@ int cmdline_find_option_bool(const char *cmdline, const char *option)
return 0; /* Buffer overrun */ } + +/* + * Find a non-boolean option (i.e. option=argument). In accordance with + * standard Linux practice, if this option is repeated, this returns the + * last instance on the command line. + * + * @cmdline: the cmdline string + * @max_cmdline_size: the maximum size of cmdline + * @option: option string to look for + * @buffer: memory buffer to return the option argument + * @bufsize: size of the supplied memory buffer + * + * Returns the length of the argument (regardless of if it was + * truncated to fit in the buffer), or -1 on not found. + */ +static int +__cmdline_find_option(const char *cmdline, int max_cmdline_size, + const char *option, char *buffer, int bufsize) +{ + char c; + int pos = 0, len = -1; + const char *opptr = NULL; + char *bufptr = buffer; + enum { + st_wordstart = 0, /* Start of word/after whitespace */ + st_wordcmp, /* Comparing this word */ + st_wordskip, /* Miscompare, skip */ + st_bufcpy, /* Copying this to buffer */ + } state = st_wordstart; + + if (!cmdline) + return -1; /* No command line */ + + /* + * This 'pos' check ensures we do not overrun + * a non-NULL-terminated 'cmdline' + */ + while (pos++ < max_cmdline_size) { + c = *(char *)cmdline++; + if (!c) + break; + + switch (state) { + case st_wordstart: + if (myisspace(c)) + break; + + state = st_wordcmp; + opptr = option; + /* fall through */ + + case st_wordcmp: + if ((c == '=') && !*opptr) { + /* + * We matched all the way to the end of the + * option we were looking for, prepare to + * copy the argument. + */ + len = 0; + bufptr = buffer; + state = st_bufcpy; + break; + } else if (c == *opptr++) { + /* + * We are currently matching, so continue + * to the next character on the cmdline. + */ + break; + } + state = st_wordskip; + /* fall through */ + + case st_wordskip: + if (myisspace(c)) + state = st_wordstart; + break; + + case st_bufcpy: + if (myisspace(c)) { + state = st_wordstart; + } else { + /* + * Increment len, but don't overrun the + * supplied buffer and leave room for the + * NULL terminator. + */ + if (++len < bufsize) + *bufptr++ = c; + } + break; + } + } + + if (bufsize) + *bufptr = '\0'; + + return len; +} + +int cmdline_find_option(const char *cmdline, const char *option, char *buffer, + int bufsize) +{ + return __cmdline_find_option(cmdline, COMMAND_LINE_SIZE, option, + buffer, bufsize); +}
From: Denys Vlasenko dvlasenk@redhat.com
PER_CPU_VAR(kernel_stack) is redundant:
- On the 64-bit build, we can use PER_CPU_VAR(cpu_tss + TSS_sp0). - On the 32-bit build, we can use PER_CPU_VAR(cpu_current_top_of_stack).
PER_CPU_VAR(kernel_stack) will be deleted by a separate change.
Signed-off-by: Denys Vlasenko dvlasenk@redhat.com Cc: Alexei Starovoitov ast@plumgrid.com Cc: Andrew Morton akpm@linux-foundation.org Cc: Andy Lutomirski luto@amacapital.net Cc: Borislav Petkov bp@alien8.de Cc: Frederic Weisbecker fweisbec@gmail.com Cc: H. Peter Anvin hpa@zytor.com Cc: Kees Cook keescook@chromium.org Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Oleg Nesterov oleg@redhat.com Cc: Peter Zijlstra peterz@infradead.org Cc: Steven Rostedt rostedt@goodmis.org Cc: Thomas Gleixner tglx@linutronix.de Cc: Will Drewry wad@chromium.org Link: http://lkml.kernel.org/r/1429889495-27850-1-git-send-email-dvlasenk@redhat.c... Signed-off-by: Ingo Molnar mingo@kernel.org (cherry picked from commit 63332a8455d8310b77d38779c6c21a660a8d9feb) Signed-off-by: Pavel Tatashin pasha.tatashin@oracle.com --- arch/x86/ia32/ia32entry.S | 2 +- arch/x86/include/asm/thread_info.h | 8 +++++++- arch/x86/kernel/entry_64.S | 4 ++-- arch/x86/xen/xen-asm_64.S | 5 +++-- 4 files changed, 13 insertions(+), 6 deletions(-)
diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S index 27e54946ef35..2ae22c951fa0 100644 --- a/arch/x86/ia32/ia32entry.S +++ b/arch/x86/ia32/ia32entry.S @@ -356,7 +356,7 @@ ENTRY(ia32_cstar_target) SWAPGS_UNSAFE_STACK movl %esp,%r8d CFI_REGISTER rsp,r8 - movq PER_CPU_VAR(kernel_stack),%rsp + movq PER_CPU_VAR(cpu_tss + TSS_sp0),%rsp ENABLE_INTERRUPTS(CLBR_NONE)
/* Zero-extending 32-bit regs, do not remove */ diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h index b4bdec3e9523..d656a363e1eb 100644 --- a/arch/x86/include/asm/thread_info.h +++ b/arch/x86/include/asm/thread_info.h @@ -198,9 +198,15 @@ static inline unsigned long current_stack_pointer(void) #else /* !__ASSEMBLY__ */
/* Load thread_info address into "reg" */ +#ifdef CONFIG_X86_32 #define GET_THREAD_INFO(reg) \ - _ASM_MOV PER_CPU_VAR(kernel_stack),reg ; \ + _ASM_MOV PER_CPU_VAR(cpu_current_top_of_stack),reg ; \ _ASM_SUB $(THREAD_SIZE),reg ; +#else +#define GET_THREAD_INFO(reg) \ + _ASM_MOV PER_CPU_VAR(cpu_tss + TSS_sp0),reg ; \ + _ASM_SUB $(THREAD_SIZE),reg ; +#endif
/* * ASM operand which evaluates to a 'thread_info' address of diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S index 6c9cb6073832..99f67a38f7f0 100644 --- a/arch/x86/kernel/entry_64.S +++ b/arch/x86/kernel/entry_64.S @@ -216,7 +216,7 @@ ENTRY(system_call) GLOBAL(system_call_after_swapgs)
movq %rsp,PER_CPU_VAR(rsp_scratch) - movq PER_CPU_VAR(kernel_stack),%rsp + movq PER_CPU_VAR(cpu_tss + TSS_sp0),%rsp
/* Construct struct pt_regs on stack */ pushq_cfi $__USER_DS /* pt_regs->ss */ @@ -1464,7 +1464,7 @@ ENTRY(nmi) SWAPGS_UNSAFE_STACK cld movq %rsp, %rdx - movq PER_CPU_VAR(kernel_stack), %rsp + movq PER_CPU_VAR(cpu_tss + TSS_sp0), %rsp pushq 5*8(%rdx) /* pt_regs->ss */ pushq 4*8(%rdx) /* pt_regs->rsp */ pushq 3*8(%rdx) /* pt_regs->flags */ diff --git a/arch/x86/xen/xen-asm_64.S b/arch/x86/xen/xen-asm_64.S index 985fc3ee0973..acc49e088ec5 100644 --- a/arch/x86/xen/xen-asm_64.S +++ b/arch/x86/xen/xen-asm_64.S @@ -15,6 +15,7 @@ #include <asm/percpu.h> #include <asm/processor-flags.h> #include <asm/segment.h> +#include <asm/asm-offsets.h>
#include <xen/interface/xen.h>
@@ -69,7 +70,7 @@ ENTRY(xen_sysret64) * still with the kernel gs, so we can easily switch back */ movq %rsp, PER_CPU_VAR(rsp_scratch) - movq PER_CPU_VAR(kernel_stack), %rsp + movq PER_CPU_VAR(cpu_tss + TSS_sp0), %rsp
pushq $__USER_DS pushq PER_CPU_VAR(rsp_scratch) @@ -88,7 +89,7 @@ ENTRY(xen_sysret32) * still with the kernel gs, so we can easily switch back */ movq %rsp, PER_CPU_VAR(rsp_scratch) - movq PER_CPU_VAR(kernel_stack), %rsp + movq PER_CPU_VAR(cpu_tss + TSS_sp0), %rsp
pushq $__USER32_DS pushq PER_CPU_VAR(rsp_scratch)
From: Denys Vlasenko dvlasenk@redhat.com
Signed-off-by: Denys Vlasenko dvlasenk@redhat.com Acked-by: Andy Lutomirski luto@kernel.org Cc: Alexei Starovoitov ast@plumgrid.com Cc: Andrew Morton akpm@linux-foundation.org Cc: Borislav Petkov bp@alien8.de Cc: Frederic Weisbecker fweisbec@gmail.com Cc: H. Peter Anvin hpa@zytor.com Cc: Kees Cook keescook@chromium.org Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Oleg Nesterov oleg@redhat.com Cc: Peter Zijlstra peterz@infradead.org Cc: Steven Rostedt rostedt@goodmis.org Cc: Thomas Gleixner tglx@linutronix.de Cc: Will Drewry wad@chromium.org Link: http://lkml.kernel.org/r/1429889495-27850-2-git-send-email-dvlasenk@redhat.c... Signed-off-by: Ingo Molnar mingo@kernel.org (cherry picked from commit fed7c3f0f750f225317828d691e9eb76eec887b3) Signed-off-by: Pavel Tatashin pasha.tatashin@oracle.com --- arch/x86/include/asm/thread_info.h | 2 -- arch/x86/kernel/cpu/common.c | 4 ---- arch/x86/kernel/process_32.c | 5 +---- arch/x86/kernel/process_64.c | 3 --- arch/x86/kernel/smpboot.c | 2 -- 5 files changed, 1 insertion(+), 15 deletions(-)
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h index d656a363e1eb..472288962c99 100644 --- a/arch/x86/include/asm/thread_info.h +++ b/arch/x86/include/asm/thread_info.h @@ -177,8 +177,6 @@ struct thread_info { */ #ifndef __ASSEMBLY__
-DECLARE_PER_CPU(unsigned long, kernel_stack); - static inline struct thread_info *current_thread_info(void) { return (struct thread_info *)(current_top_of_stack() - THREAD_SIZE); diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index f5647d238337..4f1db34113e2 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -1210,10 +1210,6 @@ static __init int setup_disablecpuid(char *arg) } __setup("clearcpuid=", setup_disablecpuid);
-DEFINE_PER_CPU(unsigned long, kernel_stack) = - (unsigned long)&init_thread_union + THREAD_SIZE; -EXPORT_PER_CPU_SYMBOL(kernel_stack); - #ifdef CONFIG_X86_64 struct desc_ptr idt_descr = { NR_VECTORS * 16 - 1, (unsigned long) idt_table }; struct desc_ptr debug_idt_descr = { NR_VECTORS * 16 - 1, diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c index 8ed2106b06da..a99900cedc22 100644 --- a/arch/x86/kernel/process_32.c +++ b/arch/x86/kernel/process_32.c @@ -302,13 +302,10 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p) arch_end_context_switch(next_p);
/* - * Reload esp0, kernel_stack, and current_top_of_stack. This changes + * Reload esp0 and cpu_current_top_of_stack. This changes * current_thread_info(). */ load_sp0(tss, next); - this_cpu_write(kernel_stack, - (unsigned long)task_stack_page(next_p) + - THREAD_SIZE); this_cpu_write(cpu_current_top_of_stack, (unsigned long)task_stack_page(next_p) + THREAD_SIZE); diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c index f7724a1d7de1..b6533ef508c9 100644 --- a/arch/x86/kernel/process_64.c +++ b/arch/x86/kernel/process_64.c @@ -410,9 +410,6 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p) /* Reload esp0 and ss1. This changes current_thread_info(). */ load_sp0(tss, next);
- this_cpu_write(kernel_stack, - (unsigned long)task_stack_page(next_p) + THREAD_SIZE); - /* * Now maybe reload the debug registers and handle I/O bitmaps */ diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index 50e547eac8cd..023cccf5a4ae 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -792,8 +792,6 @@ void common_cpu_up(unsigned int cpu, struct task_struct *idle) clear_tsk_thread_flag(idle, TIF_FORK); initial_gs = per_cpu_offset(cpu); #endif - per_cpu(kernel_stack, cpu) = - (unsigned long)task_stack_page(idle) + THREAD_SIZE; }
/*
From: Denys Vlasenko dvlasenk@redhat.com
32-bit code has PER_CPU_VAR(cpu_current_top_of_stack). 64-bit code uses somewhat more obscure: PER_CPU_VAR(cpu_tss + TSS_sp0).
Define the 'cpu_current_top_of_stack' macro on CONFIG_X86_64 as well so that the PER_CPU_VAR(cpu_current_top_of_stack) expression can be used in both 32-bit and 64-bit code.
Signed-off-by: Denys Vlasenko dvlasenk@redhat.com Cc: Alexei Starovoitov ast@plumgrid.com Cc: Andrew Morton akpm@linux-foundation.org Cc: Andy Lutomirski luto@amacapital.net Cc: Borislav Petkov bp@alien8.de Cc: Frederic Weisbecker fweisbec@gmail.com Cc: H. Peter Anvin hpa@zytor.com Cc: Kees Cook keescook@chromium.org Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Oleg Nesterov oleg@redhat.com Cc: Peter Zijlstra peterz@infradead.org Cc: Steven Rostedt rostedt@goodmis.org Cc: Thomas Gleixner tglx@linutronix.de Cc: Will Drewry wad@chromium.org Link: http://lkml.kernel.org/r/1429889495-27850-3-git-send-email-dvlasenk@redhat.c... Signed-off-by: Ingo Molnar mingo@kernel.org (cherry picked from commit 3a23208e69679597e767cf3547b1a30dd845d9b5) Signed-off-by: Pavel Tatashin pasha.tatashin@oracle.com --- arch/x86/ia32/ia32entry.S | 4 ++-- arch/x86/include/asm/thread_info.h | 10 ++++------ arch/x86/kernel/entry_64.S | 4 ++-- arch/x86/xen/xen-asm_64.S | 5 +++-- 4 files changed, 11 insertions(+), 12 deletions(-)
diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S index 2ae22c951fa0..b2fafcb37d4e 100644 --- a/arch/x86/ia32/ia32entry.S +++ b/arch/x86/ia32/ia32entry.S @@ -119,7 +119,7 @@ ENTRY(ia32_sysenter_target) * it is too small to ever cause noticeable irq latency. */ SWAPGS_UNSAFE_STACK - movq PER_CPU_VAR(cpu_tss + TSS_sp0), %rsp + movq PER_CPU_VAR(cpu_current_top_of_stack), %rsp ENABLE_INTERRUPTS(CLBR_NONE)
/* Zero-extending 32-bit regs, do not remove */ @@ -356,7 +356,7 @@ ENTRY(ia32_cstar_target) SWAPGS_UNSAFE_STACK movl %esp,%r8d CFI_REGISTER rsp,r8 - movq PER_CPU_VAR(cpu_tss + TSS_sp0),%rsp + movq PER_CPU_VAR(cpu_current_top_of_stack),%rsp ENABLE_INTERRUPTS(CLBR_NONE)
/* Zero-extending 32-bit regs, do not remove */ diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h index 472288962c99..225ee545e1a0 100644 --- a/arch/x86/include/asm/thread_info.h +++ b/arch/x86/include/asm/thread_info.h @@ -195,16 +195,14 @@ static inline unsigned long current_stack_pointer(void)
#else /* !__ASSEMBLY__ */
+#ifdef CONFIG_X86_64 +# define cpu_current_top_of_stack (cpu_tss + TSS_sp0) +#endif + /* Load thread_info address into "reg" */ -#ifdef CONFIG_X86_32 #define GET_THREAD_INFO(reg) \ _ASM_MOV PER_CPU_VAR(cpu_current_top_of_stack),reg ; \ _ASM_SUB $(THREAD_SIZE),reg ; -#else -#define GET_THREAD_INFO(reg) \ - _ASM_MOV PER_CPU_VAR(cpu_tss + TSS_sp0),reg ; \ - _ASM_SUB $(THREAD_SIZE),reg ; -#endif
/* * ASM operand which evaluates to a 'thread_info' address of diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S index 99f67a38f7f0..eaf3d4df76d5 100644 --- a/arch/x86/kernel/entry_64.S +++ b/arch/x86/kernel/entry_64.S @@ -216,7 +216,7 @@ ENTRY(system_call) GLOBAL(system_call_after_swapgs)
movq %rsp,PER_CPU_VAR(rsp_scratch) - movq PER_CPU_VAR(cpu_tss + TSS_sp0),%rsp + movq PER_CPU_VAR(cpu_current_top_of_stack),%rsp
/* Construct struct pt_regs on stack */ pushq_cfi $__USER_DS /* pt_regs->ss */ @@ -1464,7 +1464,7 @@ ENTRY(nmi) SWAPGS_UNSAFE_STACK cld movq %rsp, %rdx - movq PER_CPU_VAR(cpu_tss + TSS_sp0), %rsp + movq PER_CPU_VAR(cpu_current_top_of_stack), %rsp pushq 5*8(%rdx) /* pt_regs->ss */ pushq 4*8(%rdx) /* pt_regs->rsp */ pushq 3*8(%rdx) /* pt_regs->flags */ diff --git a/arch/x86/xen/xen-asm_64.S b/arch/x86/xen/xen-asm_64.S index acc49e088ec5..5e15e92099de 100644 --- a/arch/x86/xen/xen-asm_64.S +++ b/arch/x86/xen/xen-asm_64.S @@ -16,6 +16,7 @@ #include <asm/processor-flags.h> #include <asm/segment.h> #include <asm/asm-offsets.h> +#include <asm/thread_info.h>
#include <xen/interface/xen.h>
@@ -70,7 +71,7 @@ ENTRY(xen_sysret64) * still with the kernel gs, so we can easily switch back */ movq %rsp, PER_CPU_VAR(rsp_scratch) - movq PER_CPU_VAR(cpu_tss + TSS_sp0), %rsp + movq PER_CPU_VAR(cpu_current_top_of_stack), %rsp
pushq $__USER_DS pushq PER_CPU_VAR(rsp_scratch) @@ -89,7 +90,7 @@ ENTRY(xen_sysret32) * still with the kernel gs, so we can easily switch back */ movq %rsp, PER_CPU_VAR(rsp_scratch) - movq PER_CPU_VAR(cpu_tss + TSS_sp0), %rsp + movq PER_CPU_VAR(cpu_current_top_of_stack), %rsp
pushq $__USER32_DS pushq PER_CPU_VAR(rsp_scratch)
From: Richard Fellner richard.fellner@student.tugraz.at
This patch introduces our implementation of KAISER (Kernel Address Isolation to have Side-channels Efficiently Removed), a kernel isolation technique to close hardware side channels on kernel address information.
More information about the patch can be found on:
https://github.com/IAIK/KAISER
From: Richard Fellner richard.fellner@student.tugraz.at From: Daniel Gruss daniel.gruss@iaik.tugraz.at X-Subject: [RFC, PATCH] x86_64: KAISER - do not map kernel in user mode Date: Thu, 4 May 2017 14:26:50 +0200 Link: http://marc.info/?l=linux-kernel&m=149390087310405&w=2 Kaiser-4.10-SHA1: c4b1831d44c6144d3762ccc72f0c4e71a0c713e5
To: linux-kernel@vger.kernel.org To: kernel-hardening@lists.openwall.com Cc: clementine.maurice@iaik.tugraz.at Cc: moritz.lipp@iaik.tugraz.at Cc: Michael Schwarz michael.schwarz@iaik.tugraz.at Cc: Richard Fellner richard.fellner@student.tugraz.at Cc: Ingo Molnar mingo@kernel.org Cc: kirill.shutemov@linux.intel.com Cc: anders.fogh@gdata-adan.de
After several recent works [1,2,3] KASLR on x86_64 was basically considered dead by many researchers. We have been working on an efficient but effective fix for this problem and found that not mapping the kernel space when running in user mode is the solution to this problem [4] (the corresponding paper [5] will be presented at ESSoS17).
With this RFC patch we allow anybody to configure their kernel with the flag CONFIG_KAISER to add our defense mechanism.
If there are any questions we would love to answer them. We also appreciate any comments!
Cheers, Daniel (+ the KAISER team from Graz University of Technology)
[1] http://www.ieee-security.org/TC/SP2013/papers/4977a191.pdf [2] https://www.blackhat.com/docs/us-16/materials/us-16-Fogh-Using-Undocumented-... [3] https://www.blackhat.com/docs/us-16/materials/us-16-Jang-Breaking-Kernel-Add... [4] https://github.com/IAIK/KAISER [5] https://gruss.cc/files/kaiser.pdf
[patch based also on https://raw.githubusercontent.com/IAIK/KAISER/master/KAISER/0001-KAISER-Kern...]
Signed-off-by: Richard Fellner richard.fellner@student.tugraz.at Signed-off-by: Moritz Lipp moritz.lipp@iaik.tugraz.at Signed-off-by: Daniel Gruss daniel.gruss@iaik.tugraz.at Signed-off-by: Michael Schwarz michael.schwarz@iaik.tugraz.at Acked-by: Jiri Kosina jkosina@suse.cz Signed-off-by: Hugh Dickins hughd@google.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org (cherry picked from commit 8a43ddfb93a0c6ae1a6e1f5c25705ec5d1843c40) Signed-off-by: Pavel Tatashin pasha.tatashin@oracle.com
Conflicts: arch/x86/entry/entry_64.S (not in this tree) arch/x86/kernel/entry_64.S (patched instead of that) arch/x86/entry/entry_64_compat.S (not in this tree) arch/x86/ia32/ia32entry.S (patched instead of that) arch/x86/include/asm/hw_irq.h arch/x86/kernel/irqinit.c kernel/fork.c --- arch/x86/ia32/ia32entry.S | 6 ++ arch/x86/include/asm/hw_irq.h | 2 +- arch/x86/include/asm/kaiser.h | 113 +++++++++++++++++++++++++ arch/x86/include/asm/pgtable.h | 4 + arch/x86/include/asm/pgtable_64.h | 21 +++++ arch/x86/include/asm/pgtable_types.h | 12 ++- arch/x86/include/asm/processor.h | 7 +- arch/x86/kernel/cpu/common.c | 4 +- arch/x86/kernel/entry_64.S | 18 +++- arch/x86/kernel/espfix_64.c | 6 ++ arch/x86/kernel/head_64.S | 16 +++- arch/x86/kernel/irqinit.c | 2 +- arch/x86/kernel/process.c | 2 +- arch/x86/mm/Makefile | 1 + arch/x86/mm/kaiser.c | 160 +++++++++++++++++++++++++++++++++++ arch/x86/mm/pageattr.c | 2 +- arch/x86/mm/pgtable.c | 26 ++++++ include/asm-generic/vmlinux.lds.h | 11 ++- include/linux/percpu-defs.h | 30 +++++++ init/main.c | 6 ++ kernel/fork.c | 8 ++ security/Kconfig | 7 ++ 22 files changed, 448 insertions(+), 16 deletions(-) create mode 100644 arch/x86/include/asm/kaiser.h create mode 100644 arch/x86/mm/kaiser.c
diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S index b2fafcb37d4e..665e2c7887fe 100644 --- a/arch/x86/ia32/ia32entry.S +++ b/arch/x86/ia32/ia32entry.S @@ -15,6 +15,7 @@ #include <asm/irqflags.h> #include <asm/asm.h> #include <asm/smap.h> +#include <asm/kaiser.h> #include <linux/linkage.h> #include <linux/err.h>
@@ -119,6 +120,7 @@ ENTRY(ia32_sysenter_target) * it is too small to ever cause noticeable irq latency. */ SWAPGS_UNSAFE_STACK + SWITCH_KERNEL_CR3_NO_STACK movq PER_CPU_VAR(cpu_current_top_of_stack), %rsp ENABLE_INTERRUPTS(CLBR_NONE)
@@ -207,6 +209,7 @@ sysexit_from_sys_call: movl EFLAGS(%rsp),%r11d /* User eflags */ /*CFI_RESTORE rflags*/ TRACE_IRQS_ON + SWITCH_USER_CR3
/* * SYSRETL works even on Intel CPUs. Use it in preference to SYSEXIT, @@ -354,6 +357,7 @@ ENTRY(ia32_cstar_target) * it is too small to ever cause noticeable irq latency. */ SWAPGS_UNSAFE_STACK + SWITCH_KERNEL_CR3_NO_STACK movl %esp,%r8d CFI_REGISTER rsp,r8 movq PER_CPU_VAR(cpu_current_top_of_stack),%rsp @@ -419,6 +423,7 @@ sysretl_from_sys_call: xorq %r9,%r9 xorq %r8,%r8 TRACE_IRQS_ON + SWITCH_USER_CR3 movl RSP(%rsp),%esp CFI_RESTORE rsp /* @@ -513,6 +518,7 @@ ENTRY(ia32_syscall) PARAVIRT_ADJUST_EXCEPTION_FRAME ASM_CLAC /* Do this early to minimize exposure */ SWAPGS + SWITCH_KERNEL_CR3_NO_STACK ENABLE_INTERRUPTS(CLBR_NONE)
/* Zero-extending 32-bit regs, do not remove */ diff --git a/arch/x86/include/asm/hw_irq.h b/arch/x86/include/asm/hw_irq.h index e9571ddabc4f..08017d37db0a 100644 --- a/arch/x86/include/asm/hw_irq.h +++ b/arch/x86/include/asm/hw_irq.h @@ -190,7 +190,7 @@ extern char irq_entries_start[]; #define VECTOR_RETRIGGERED (-2)
typedef int vector_irq_t[NR_VECTORS]; -DECLARE_PER_CPU(vector_irq_t, vector_irq); +DECLARE_PER_CPU_USER_MAPPED(vector_irq_t, vector_irq);
#endif /* !ASSEMBLY_ */
diff --git a/arch/x86/include/asm/kaiser.h b/arch/x86/include/asm/kaiser.h new file mode 100644 index 000000000000..63ee8309b35b --- /dev/null +++ b/arch/x86/include/asm/kaiser.h @@ -0,0 +1,113 @@ +#ifndef _ASM_X86_KAISER_H +#define _ASM_X86_KAISER_H + +/* This file includes the definitions for the KAISER feature. + * KAISER is a counter measure against x86_64 side channel attacks on the kernel virtual memory. + * It has a shodow-pgd for every process. the shadow-pgd has a minimalistic kernel-set mapped, + * but includes the whole user memory. Within a kernel context switch, or when an interrupt is handled, + * the pgd is switched to the normal one. When the system switches to user mode, the shadow pgd is enabled. + * By this, the virtual memory chaches are freed, and the user may not attack the whole kernel memory. + * + * A minimalistic kernel mapping holds the parts needed to be mapped in user mode, as the entry/exit functions + * of the user space, or the stacks. + */ +#ifdef __ASSEMBLY__ +#ifdef CONFIG_KAISER + +.macro _SWITCH_TO_KERNEL_CR3 reg +movq %cr3, \reg +andq $(~0x1000), \reg +movq \reg, %cr3 +.endm + +.macro _SWITCH_TO_USER_CR3 reg +movq %cr3, \reg +orq $(0x1000), \reg +movq \reg, %cr3 +.endm + +.macro SWITCH_KERNEL_CR3 +pushq %rax +_SWITCH_TO_KERNEL_CR3 %rax +popq %rax +.endm + +.macro SWITCH_USER_CR3 +pushq %rax +_SWITCH_TO_USER_CR3 %rax +popq %rax +.endm + +.macro SWITCH_KERNEL_CR3_NO_STACK +movq %rax, PER_CPU_VAR(unsafe_stack_register_backup) +_SWITCH_TO_KERNEL_CR3 %rax +movq PER_CPU_VAR(unsafe_stack_register_backup), %rax +.endm + + +.macro SWITCH_USER_CR3_NO_STACK + +movq %rax, PER_CPU_VAR(unsafe_stack_register_backup) +_SWITCH_TO_USER_CR3 %rax +movq PER_CPU_VAR(unsafe_stack_register_backup), %rax + +.endm + +#else /* CONFIG_KAISER */ + +.macro SWITCH_KERNEL_CR3 reg +.endm +.macro SWITCH_USER_CR3 reg +.endm +.macro SWITCH_USER_CR3_NO_STACK +.endm +.macro SWITCH_KERNEL_CR3_NO_STACK +.endm + +#endif /* CONFIG_KAISER */ +#else /* __ASSEMBLY__ */ + + +#ifdef CONFIG_KAISER +// Upon kernel/user mode switch, it may happen that +// the address space has to be switched before the registers have been stored. +// To change the address space, another register is needed. +// A register therefore has to be stored/restored. +// +DECLARE_PER_CPU_USER_MAPPED(unsigned long, unsafe_stack_register_backup); + +#endif /* CONFIG_KAISER */ + +/** + * shadowmem_add_mapping - map a virtual memory part to the shadow mapping + * @addr: the start address of the range + * @size: the size of the range + * @flags: The mapping flags of the pages + * + * the mapping is done on a global scope, so no bigger synchronization has to be done. + * the pages have to be manually unmapped again when they are not needed any longer. + */ +extern void kaiser_add_mapping(unsigned long addr, unsigned long size, unsigned long flags); + + +/** + * shadowmem_remove_mapping - unmap a virtual memory part of the shadow mapping + * @addr: the start address of the range + * @size: the size of the range + */ +extern void kaiser_remove_mapping(unsigned long start, unsigned long size); + +/** + * shadowmem_initialize_mapping - Initalize the shadow mapping + * + * most parts of the shadow mapping can be mapped upon boot time. + * only the thread stacks have to be mapped on runtime. + * the mapped regions are not unmapped at all. + */ +extern void kaiser_init(void); + +#endif + + + +#endif /* _ASM_X86_KAISER_H */ diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index fe57e7a98839..53d44406af45 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -829,6 +829,10 @@ static inline void pmdp_set_wrprotect(struct mm_struct *mm, static inline void clone_pgd_range(pgd_t *dst, pgd_t *src, int count) { memcpy(dst, src, count * sizeof(pgd_t)); +#ifdef CONFIG_KAISER + // clone the shadow pgd part as well + memcpy(native_get_shadow_pgd(dst), native_get_shadow_pgd(src), count * sizeof(pgd_t)); +#endif }
#define PTE_SHIFT ilog2(PTRS_PER_PTE) diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h index 2ee781114d34..2131edda6620 100644 --- a/arch/x86/include/asm/pgtable_64.h +++ b/arch/x86/include/asm/pgtable_64.h @@ -106,9 +106,30 @@ static inline void native_pud_clear(pud_t *pud) native_set_pud(pud, native_make_pud(0)); }
+#ifdef CONFIG_KAISER +static inline pgd_t * native_get_shadow_pgd(pgd_t *pgdp) { + return (pgd_t *)(void*)((unsigned long)(void*)pgdp | (unsigned long)PAGE_SIZE); +} + +static inline pgd_t * native_get_normal_pgd(pgd_t *pgdp) { + return (pgd_t *)(void*)((unsigned long)(void*)pgdp & ~(unsigned long)PAGE_SIZE); +} +#endif /* CONFIG_KAISER */ + static inline void native_set_pgd(pgd_t *pgdp, pgd_t pgd) { +#ifdef CONFIG_KAISER + // We know that a pgd is page aligned. + // Therefore the lower indices have to be mapped to user space. + // These pages are mapped to the shadow mapping. + if ((((unsigned long)pgdp) % PAGE_SIZE) < (PAGE_SIZE / 2)) { + native_get_shadow_pgd(pgdp)->pgd = pgd.pgd; + } + + pgdp->pgd = pgd.pgd & ~_PAGE_USER; +#else /* CONFIG_KAISER */ *pgdp = pgd; +#endif }
static inline void native_pgd_clear(pgd_t *pgd) diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h index 74fcdf3f1534..c0a0560b19af 100644 --- a/arch/x86/include/asm/pgtable_types.h +++ b/arch/x86/include/asm/pgtable_types.h @@ -39,7 +39,11 @@ #define _PAGE_ACCESSED (_AT(pteval_t, 1) << _PAGE_BIT_ACCESSED) #define _PAGE_DIRTY (_AT(pteval_t, 1) << _PAGE_BIT_DIRTY) #define _PAGE_PSE (_AT(pteval_t, 1) << _PAGE_BIT_PSE) -#define _PAGE_GLOBAL (_AT(pteval_t, 1) << _PAGE_BIT_GLOBAL) +#ifdef CONFIG_KAISER +#define _PAGE_GLOBAL (_AT(pteval_t, 0)) +#else +#define _PAGE_GLOBAL (_AT(pteval_t, 1) << _PAGE_BIT_GLOBAL) +#endif #define _PAGE_SOFTW1 (_AT(pteval_t, 1) << _PAGE_BIT_SOFTW1) #define _PAGE_SOFTW2 (_AT(pteval_t, 1) << _PAGE_BIT_SOFTW2) #define _PAGE_PAT (_AT(pteval_t, 1) << _PAGE_BIT_PAT) @@ -89,7 +93,11 @@ #define _PAGE_NX (_AT(pteval_t, 0)) #endif
-#define _PAGE_PROTNONE (_AT(pteval_t, 1) << _PAGE_BIT_PROTNONE) +#ifdef CONFIG_KAISER +#define _PAGE_PROTNONE (_AT(pteval_t, 0)) +#else +#define _PAGE_PROTNONE (_AT(pteval_t, 1) << _PAGE_BIT_PROTNONE) +#endif
#define _PAGE_TABLE (_PAGE_PRESENT | _PAGE_RW | _PAGE_USER | \ _PAGE_ACCESSED | _PAGE_DIRTY) diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h index 23ba6765b718..4abbd7d7cfb0 100644 --- a/arch/x86/include/asm/processor.h +++ b/arch/x86/include/asm/processor.h @@ -300,7 +300,7 @@ struct tss_struct {
} ____cacheline_aligned;
-DECLARE_PER_CPU_SHARED_ALIGNED(struct tss_struct, cpu_tss); +DECLARE_PER_CPU_SHARED_ALIGNED_USER_MAPPED(struct tss_struct, cpu_tss);
#ifdef CONFIG_X86_32 DECLARE_PER_CPU(unsigned long, cpu_current_top_of_stack); @@ -449,6 +449,11 @@ union irq_stack_union { char gs_base[40]; unsigned long stack_canary; }; + + struct { + char irq_stack_pointer[64]; + char unused[IRQ_STACK_SIZE - 64]; + }; };
DECLARE_PER_CPU_FIRST(union irq_stack_union, irq_stack_union) __visible; diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index 4f1db34113e2..47194fac7eb7 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -91,7 +91,7 @@ static const struct cpu_dev default_cpu = {
static const struct cpu_dev *this_cpu = &default_cpu;
-DEFINE_PER_CPU_PAGE_ALIGNED(struct gdt_page, gdt_page) = { .gdt = { +DEFINE_PER_CPU_PAGE_ALIGNED_USER_MAPPED(struct gdt_page, gdt_page) = { .gdt = { #ifdef CONFIG_X86_64 /* * We need valid kernel segments for data and code in long mode too @@ -1247,7 +1247,7 @@ static const unsigned int exception_stack_sizes[N_EXCEPTION_STACKS] = { [DEBUG_STACK - 1] = DEBUG_STKSZ };
-static DEFINE_PER_CPU_PAGE_ALIGNED(char, exception_stacks +DEFINE_PER_CPU_PAGE_ALIGNED_USER_MAPPED(char, exception_stacks [(N_EXCEPTION_STACKS - 1) * EXCEPTION_STKSZ + DEBUG_STKSZ]);
/* May not be marked __init: used by software suspend */ diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S index eaf3d4df76d5..8435b806d6a5 100644 --- a/arch/x86/kernel/entry_64.S +++ b/arch/x86/kernel/entry_64.S @@ -45,6 +45,7 @@ #include <asm/context_tracking.h> #include <asm/smap.h> #include <asm/pgtable_types.h> +#include <asm/kaiser.h> #include <linux/err.h>
/* Avoid __ASSEMBLER__'ifying <linux/audit.h> just for this. */ @@ -208,6 +209,7 @@ ENTRY(system_call) * it is too small to ever cause noticeable irq latency. */ SWAPGS_UNSAFE_STACK + SWITCH_KERNEL_CR3_NO_STACK /* * A hypervisor implementation might want to use a label * after the swapgs, so that it can do the swapgs @@ -284,11 +286,12 @@ system_call_fastpath:
CFI_REMEMBER_STATE
- RESTORE_C_REGS_EXCEPT_RCX_R11 movq RIP(%rsp),%rcx CFI_REGISTER rip,rcx movq EFLAGS(%rsp),%r11 /*CFI_REGISTER rflags,r11*/ + RESTORE_C_REGS_EXCEPT_RCX_R11 + SWITCH_USER_CR3 movq RSP(%rsp),%rsp /* * 64bit SYSRET restores rip from rcx, @@ -477,11 +480,13 @@ syscall_return_via_sysret: CFI_REMEMBER_STATE /* r11 is already restored (see code above) */ RESTORE_C_REGS_EXCEPT_R11 + SWITCH_USER_CR3 movq RSP(%rsp),%rsp USERGS_SYSRET64 CFI_RESTORE_STATE
opportunistic_sysret_failed: + SWITCH_USER_CR3 SWAPGS jmp restore_c_regs_and_iret CFI_ENDPROC @@ -689,6 +694,7 @@ END(irq_entries_start) testl $3, CS-RBP(%rsp) je 1f SWAPGS + SWITCH_KERNEL_CR3 1: /* * Save previous stack pointer, optionally switch to interrupt stack. @@ -765,6 +771,7 @@ retint_swapgs: /* return to user-space */ DISABLE_INTERRUPTS(CLBR_ANY) TRACE_IRQS_IRETQ
+ SWITCH_USER_CR3 SWAPGS jmp restore_c_regs_and_iret
@@ -820,6 +827,7 @@ native_irq_return_ldt: pushq_cfi %rax pushq_cfi %rdi SWAPGS + SWITCH_KERNEL_CR3 movq PER_CPU_VAR(espfix_waddr),%rdi movq %rax,(0*8)(%rdi) /* RAX */ movq (2*8)(%rsp),%rax /* RIP */ @@ -835,6 +843,7 @@ native_irq_return_ldt: andl $0xffff0000,%eax popq_cfi %rdi orq PER_CPU_VAR(espfix_stack),%rax + SWITCH_USER_CR3 SWAPGS movq %rax,%rsp popq_cfi %rax @@ -1283,6 +1292,7 @@ ENTRY(paranoid_entry) testl %edx,%edx js 1f /* negative -> in kernel */ SWAPGS + SWITCH_KERNEL_CR3 xorl %ebx,%ebx 1: ret CFI_ENDPROC @@ -1306,6 +1316,7 @@ ENTRY(paranoid_exit) testl %ebx,%ebx /* swapgs needed? */ jnz paranoid_exit_no_swapgs TRACE_IRQS_IRETQ + SWITCH_USER_CR3_NO_STACK SWAPGS_UNSAFE_STACK jmp paranoid_exit_restore paranoid_exit_no_swapgs: @@ -1332,6 +1343,7 @@ ENTRY(error_entry) je error_kernelspace error_swapgs: SWAPGS + SWITCH_KERNEL_CR3 error_sti: TRACE_IRQS_OFF ret @@ -1460,8 +1472,8 @@ ENTRY(nmi) * We also must not push anything to the stack before switching * stacks lest we corrupt the "NMI executing" variable. */ - SWAPGS_UNSAFE_STACK + SWITCH_KERNEL_CR3_NO_STACK cld movq %rsp, %rdx movq PER_CPU_VAR(cpu_current_top_of_stack), %rsp @@ -1501,6 +1513,7 @@ ENTRY(nmi) * work, because we don't want to enable interrupts. Fortunately, * do_nmi doesn't modify pt_regs. */ + SWITCH_USER_CR3 SWAPGS jmp restore_c_regs_and_iret
@@ -1710,6 +1723,7 @@ end_repeat_nmi: testl %ebx,%ebx /* swapgs needed? */ jnz nmi_restore nmi_swapgs: + SWITCH_USER_CR3_NO_STACK SWAPGS_UNSAFE_STACK nmi_restore: RESTORE_EXTRA_REGS diff --git a/arch/x86/kernel/espfix_64.c b/arch/x86/kernel/espfix_64.c index f5d0730e7b08..9b0aef0dce94 100644 --- a/arch/x86/kernel/espfix_64.c +++ b/arch/x86/kernel/espfix_64.c @@ -41,6 +41,7 @@ #include <asm/pgalloc.h> #include <asm/setup.h> #include <asm/espfix.h> +#include <asm/kaiser.h>
/* * Note: we only need 6*8 = 48 bytes for the espfix stack, but round @@ -126,6 +127,11 @@ void __init init_espfix_bsp(void) /* Install the espfix pud into the kernel page directory */ pgd_p = &init_level4_pgt[pgd_index(ESPFIX_BASE_ADDR)]; pgd_populate(&init_mm, pgd_p, (pud_t *)espfix_pud_page); +#ifdef CONFIG_KAISER + // add the esp stack pud to the shadow mapping here. + // This can be done directly, because the fixup stack has its own pud + set_pgd(native_get_shadow_pgd(pgd_p), __pgd(_PAGE_TABLE | __pa((pud_t *)espfix_pud_page))); +#endif
/* Randomize the locations */ init_espfix_random(); diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S index 174fa035a09a..0ee14f5bf59c 100644 --- a/arch/x86/kernel/head_64.S +++ b/arch/x86/kernel/head_64.S @@ -441,6 +441,14 @@ early_idt_ripmsg: .balign PAGE_SIZE; \ GLOBAL(name)
+#ifdef CONFIG_KAISER +#define NEXT_PGD_PAGE(name) \ + .balign 2 * PAGE_SIZE; \ +GLOBAL(name) +#else +#define NEXT_PGD_PAGE(name) NEXT_PAGE(name) +#endif + /* Automate the creation of 1 to 1 mapping pmd entries */ #define PMDS(START, PERM, COUNT) \ i = 0 ; \ @@ -450,7 +458,7 @@ GLOBAL(name) .endr
__INITDATA -NEXT_PAGE(early_level4_pgt) +NEXT_PGD_PAGE(early_level4_pgt) .fill 511,8,0 .quad level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE
@@ -460,10 +468,10 @@ NEXT_PAGE(early_dynamic_pgts) .data
#ifndef CONFIG_XEN -NEXT_PAGE(init_level4_pgt) - .fill 512,8,0 +NEXT_PGD_PAGE(init_level4_pgt) + .fill 2*512,8,0 #else -NEXT_PAGE(init_level4_pgt) +NEXT_PGD_PAGE(init_level4_pgt) .quad level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE .org init_level4_pgt + L4_PAGE_OFFSET*8, 0 .quad level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE diff --git a/arch/x86/kernel/irqinit.c b/arch/x86/kernel/irqinit.c index cd10a6437264..77b07c72c3e3 100644 --- a/arch/x86/kernel/irqinit.c +++ b/arch/x86/kernel/irqinit.c @@ -51,7 +51,7 @@ static struct irqaction irq2 = { .flags = IRQF_NO_THREAD, };
-DEFINE_PER_CPU(vector_irq_t, vector_irq) = { +DEFINE_PER_CPU_USER_MAPPED(vector_irq_t, vector_irq) = { [0 ... NR_VECTORS - 1] = VECTOR_UNDEFINED, };
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c index 971743774248..21e00403877a 100644 --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -38,7 +38,7 @@ * section. Since TSS's are completely CPU-local, we want them * on exact cacheline boundaries, to eliminate cacheline ping-pong. */ -__visible DEFINE_PER_CPU_SHARED_ALIGNED(struct tss_struct, cpu_tss) = { +__visible DEFINE_PER_CPU_SHARED_ALIGNED_USER_MAPPED(struct tss_struct, cpu_tss) = { .x86_tss = { .sp0 = TOP_OF_INIT_STACK, #ifdef CONFIG_X86_32 diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile index d893640d5c68..4c15b80bf7df 100644 --- a/arch/x86/mm/Makefile +++ b/arch/x86/mm/Makefile @@ -32,3 +32,4 @@ obj-$(CONFIG_ACPI_NUMA) += srat.o obj-$(CONFIG_NUMA_EMU) += numa_emulation.o
obj-$(CONFIG_X86_INTEL_MPX) += mpx.o +obj-$(CONFIG_KAISER) += kaiser.o diff --git a/arch/x86/mm/kaiser.c b/arch/x86/mm/kaiser.c new file mode 100644 index 000000000000..cf1bb922d467 --- /dev/null +++ b/arch/x86/mm/kaiser.c @@ -0,0 +1,160 @@ + + +#include <linux/kernel.h> +#include <linux/errno.h> +#include <linux/string.h> +#include <linux/types.h> +#include <linux/bug.h> +#include <linux/init.h> +#include <linux/spinlock.h> +#include <linux/mm.h> + +#include <linux/uaccess.h> +#include <asm/pgtable.h> +#include <asm/pgalloc.h> +#include <asm/desc.h> +#ifdef CONFIG_KAISER + +__visible DEFINE_PER_CPU_USER_MAPPED(unsigned long, unsafe_stack_register_backup); + +/** + * Get the real ppn from a address in kernel mapping. + * @param address The virtual adrress + * @return the physical address + */ +static inline unsigned long get_pa_from_mapping (unsigned long address) +{ + pgd_t *pgd; + pud_t *pud; + pmd_t *pmd; + pte_t *pte; + + pgd = pgd_offset_k(address); + BUG_ON(pgd_none(*pgd) || pgd_large(*pgd)); + + pud = pud_offset(pgd, address); + BUG_ON(pud_none(*pud)); + + if (pud_large(*pud)) { + return (pud_pfn(*pud) << PAGE_SHIFT) | (address & ~PUD_PAGE_MASK); + } + + pmd = pmd_offset(pud, address); + BUG_ON(pmd_none(*pmd)); + + if (pmd_large(*pmd)) { + return (pmd_pfn(*pmd) << PAGE_SHIFT) | (address & ~PMD_PAGE_MASK); + } + + pte = pte_offset_kernel(pmd, address); + BUG_ON(pte_none(*pte)); + + return (pte_pfn(*pte) << PAGE_SHIFT) | (address & ~PAGE_MASK); +} + +void _kaiser_copy (unsigned long start_addr, unsigned long size, + unsigned long flags) +{ + pgd_t *pgd; + pud_t *pud; + pmd_t *pmd; + pte_t *pte; + unsigned long address; + unsigned long end_addr = start_addr + size; + unsigned long target_address; + + for (address = PAGE_ALIGN(start_addr - (PAGE_SIZE - 1)); + address < PAGE_ALIGN(end_addr); address += PAGE_SIZE) { + target_address = get_pa_from_mapping(address); + + pgd = native_get_shadow_pgd(pgd_offset_k(address)); + + BUG_ON(pgd_none(*pgd) && "All shadow pgds should be mapped at this time\n"); + BUG_ON(pgd_large(*pgd)); + + pud = pud_offset(pgd, address); + if (pud_none(*pud)) { + set_pud(pud, __pud(_PAGE_TABLE | __pa(pmd_alloc_one(0, address)))); + } + BUG_ON(pud_large(*pud)); + + pmd = pmd_offset(pud, address); + if (pmd_none(*pmd)) { + set_pmd(pmd, __pmd(_PAGE_TABLE | __pa(pte_alloc_one_kernel(0, address)))); + } + BUG_ON(pmd_large(*pmd)); + + pte = pte_offset_kernel(pmd, address); + if (pte_none(*pte)) { + set_pte(pte, __pte(flags | target_address)); + } else { + BUG_ON(__pa(pte_page(*pte)) != target_address); + } + } +} + +// at first, add a pmd for every pgd entry in the shadowmem-kernel-part of the kernel mapping +static inline void __init _kaiser_init(void) +{ + pgd_t *pgd; + int i = 0; + + pgd = native_get_shadow_pgd(pgd_offset_k((unsigned long )0)); + for (i = PTRS_PER_PGD / 2; i < PTRS_PER_PGD; i++) { + set_pgd(pgd + i, __pgd(_PAGE_TABLE |__pa(pud_alloc_one(0, 0)))); + } +} + +extern char __per_cpu_user_mapped_start[], __per_cpu_user_mapped_end[]; +spinlock_t shadow_table_lock; +void __init kaiser_init(void) +{ + int cpu; + spin_lock_init(&shadow_table_lock); + + spin_lock(&shadow_table_lock); + + _kaiser_init(); + + for_each_possible_cpu(cpu) { + // map the per cpu user variables + _kaiser_copy( + (unsigned long) (__per_cpu_user_mapped_start + per_cpu_offset(cpu)), + (unsigned long) __per_cpu_user_mapped_end - (unsigned long) __per_cpu_user_mapped_start, + __PAGE_KERNEL); + } + + // map the entry/exit text section, which is responsible to switch between user- and kernel mode + _kaiser_copy( + (unsigned long) __entry_text_start, + (unsigned long) __entry_text_end - (unsigned long) __entry_text_start, + __PAGE_KERNEL_RX); + + // the fixed map address of the idt_table + _kaiser_copy( + (unsigned long) idt_descr.address, + sizeof(gate_desc) * NR_VECTORS, + __PAGE_KERNEL_RO); + + spin_unlock(&shadow_table_lock); +} + +// add a mapping to the shadow-mapping, and synchronize the mappings +void kaiser_add_mapping(unsigned long addr, unsigned long size, unsigned long flags) +{ + spin_lock(&shadow_table_lock); + _kaiser_copy(addr, size, flags); + spin_unlock(&shadow_table_lock); +} + +extern void unmap_pud_range(pgd_t *pgd, unsigned long start, unsigned long end); +void kaiser_remove_mapping(unsigned long start, unsigned long size) +{ + pgd_t *pgd = native_get_shadow_pgd(pgd_offset_k(start)); + spin_lock(&shadow_table_lock); + do { + unmap_pud_range(pgd, start, start + size); + } while (pgd++ != native_get_shadow_pgd(pgd_offset_k(start + size))); + spin_unlock(&shadow_table_lock); +} +#endif /* CONFIG_KAISER */ diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c index 1e5a786e75ce..5553b7fbc6cb 100644 --- a/arch/x86/mm/pageattr.c +++ b/arch/x86/mm/pageattr.c @@ -802,7 +802,7 @@ static void unmap_pmd_range(pud_t *pud, unsigned long start, unsigned long end) pud_clear(pud); }
-static void unmap_pud_range(pgd_t *pgd, unsigned long start, unsigned long end) +void unmap_pud_range(pgd_t *pgd, unsigned long start, unsigned long end) { pud_t *pud = pud_offset(pgd, start);
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c index 0b97d2c75df3..c847acf67933 100644 --- a/arch/x86/mm/pgtable.c +++ b/arch/x86/mm/pgtable.c @@ -342,12 +342,38 @@ static inline void _pgd_free(pgd_t *pgd) #else static inline pgd_t *_pgd_alloc(void) { +#ifdef CONFIG_KAISER + // Instead of one PML4, we aquire two PML4s and, thus, an 8kb-aligned memory + // block. Therefore, we have to allocate at least 3 pages. However, the + // __get_free_pages returns us 4 pages. Hence, we store the base pointer at + // the beginning of the page of our 8kb-aligned memory block in order to + // correctly free it afterwars. + + unsigned long pages = __get_free_pages(PGALLOC_GFP, get_order(4*PAGE_SIZE)); + + if(native_get_normal_pgd((pgd_t*) pages) == (pgd_t*) pages) + { + *((unsigned long*)(pages + 2 * PAGE_SIZE)) = pages; + return (pgd_t *) pages; + } + else + { + *((unsigned long*)(pages + 3 * PAGE_SIZE)) = pages; + return (pgd_t *) (pages + PAGE_SIZE); + } +#else return (pgd_t *)__get_free_page(PGALLOC_GFP); +#endif }
static inline void _pgd_free(pgd_t *pgd) { +#ifdef CONFIG_KAISER + unsigned long pages = *((unsigned long*) ((char*) pgd + 2 * PAGE_SIZE)); + free_pages(pages, get_order(4*PAGE_SIZE)); +#else free_page((unsigned long)pgd); +#endif } #endif /* CONFIG_X86_PAE */
diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h index 13671bc9a288..88c02f965f2e 100644 --- a/include/asm-generic/vmlinux.lds.h +++ b/include/asm-generic/vmlinux.lds.h @@ -715,7 +715,16 @@ */ #define PERCPU_INPUT(cacheline) \ VMLINUX_SYMBOL(__per_cpu_start) = .; \ - *(.data..percpu..first) \ + \ + VMLINUX_SYMBOL(__per_cpu_user_mapped_start) = .; \ + *(.data..percpu..first) \ + . = ALIGN(cacheline); \ + *(.data..percpu..user_mapped) \ + *(.data..percpu..user_mapped..shared_aligned) \ + . = ALIGN(PAGE_SIZE); \ + *(.data..percpu..user_mapped..page_aligned) \ + VMLINUX_SYMBOL(__per_cpu_user_mapped_end) = .; \ + \ . = ALIGN(PAGE_SIZE); \ *(.data..percpu..page_aligned) \ . = ALIGN(cacheline); \ diff --git a/include/linux/percpu-defs.h b/include/linux/percpu-defs.h index 57f3a1c550dc..141d0de913a9 100644 --- a/include/linux/percpu-defs.h +++ b/include/linux/percpu-defs.h @@ -35,6 +35,12 @@
#endif
+#ifdef CONFIG_KAISER +#define USER_MAPPED_SECTION "..user_mapped" +#else +#define USER_MAPPED_SECTION "" +#endif + /* * Base implementations of per-CPU variable declarations and definitions, where * the section in which the variable is to be placed is provided by the @@ -115,6 +121,12 @@ #define DEFINE_PER_CPU(type, name) \ DEFINE_PER_CPU_SECTION(type, name, "")
+#define DECLARE_PER_CPU_USER_MAPPED(type, name) \ + DECLARE_PER_CPU_SECTION(type, name, USER_MAPPED_SECTION) + +#define DEFINE_PER_CPU_USER_MAPPED(type, name) \ + DEFINE_PER_CPU_SECTION(type, name, USER_MAPPED_SECTION) + /* * Declaration/definition used for per-CPU variables that must come first in * the set of variables. @@ -144,6 +156,14 @@ DEFINE_PER_CPU_SECTION(type, name, PER_CPU_SHARED_ALIGNED_SECTION) \ ____cacheline_aligned_in_smp
+#define DECLARE_PER_CPU_SHARED_ALIGNED_USER_MAPPED(type, name) \ + DECLARE_PER_CPU_SECTION(type, name, USER_MAPPED_SECTION PER_CPU_SHARED_ALIGNED_SECTION) \ + ____cacheline_aligned_in_smp + +#define DEFINE_PER_CPU_SHARED_ALIGNED_USER_MAPPED(type, name) \ + DEFINE_PER_CPU_SECTION(type, name, USER_MAPPED_SECTION PER_CPU_SHARED_ALIGNED_SECTION) \ + ____cacheline_aligned_in_smp + #define DECLARE_PER_CPU_ALIGNED(type, name) \ DECLARE_PER_CPU_SECTION(type, name, PER_CPU_ALIGNED_SECTION) \ ____cacheline_aligned @@ -162,6 +182,16 @@ #define DEFINE_PER_CPU_PAGE_ALIGNED(type, name) \ DEFINE_PER_CPU_SECTION(type, name, "..page_aligned") \ __aligned(PAGE_SIZE) +/* + * Declaration/definition used for per-CPU variables that must be page aligned and need to be mapped in user mode. + */ +#define DECLARE_PER_CPU_PAGE_ALIGNED_USER_MAPPED(type, name) \ + DECLARE_PER_CPU_SECTION(type, name, USER_MAPPED_SECTION"..page_aligned") \ + __aligned(PAGE_SIZE) + +#define DEFINE_PER_CPU_PAGE_ALIGNED_USER_MAPPED(type, name) \ + DEFINE_PER_CPU_SECTION(type, name, USER_MAPPED_SECTION"..page_aligned") \ + __aligned(PAGE_SIZE)
/* * Declaration/definition used for per-CPU variables that must be read mostly. diff --git a/init/main.c b/init/main.c index 2a89545e0a5d..de1951eaf1cc 100644 --- a/init/main.c +++ b/init/main.c @@ -87,6 +87,9 @@ #include <asm/setup.h> #include <asm/sections.h> #include <asm/cacheflush.h> +#ifdef CONFIG_KAISER +#include <asm/kaiser.h> +#endif
static int kernel_init(void *);
@@ -487,6 +490,9 @@ static void __init mm_init(void) pgtable_init(); vmalloc_init(); ioremap_huge_init(); +#ifdef CONFIG_KAISER + kaiser_init(); +#endif }
asmlinkage __visible void __init start_kernel(void) diff --git a/kernel/fork.c b/kernel/fork.c index edc1916e89ee..efe6f94d8dc3 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -167,8 +167,12 @@ static struct thread_info *alloc_thread_info_node(struct task_struct *tsk, return page ? page_address(page) : NULL; }
+extern void kaiser_remove_mapping(unsigned long start_addr, unsigned long size); static inline void free_thread_info(struct thread_info *ti) { +#ifdef CONFIG_KAISER + kaiser_remove_mapping((unsigned long)ti, THREAD_SIZE); +#endif free_kmem_pages((unsigned long)ti, THREAD_SIZE_ORDER); } # else @@ -325,6 +329,7 @@ void set_task_stack_end_magic(struct task_struct *tsk) *stackend = STACK_END_MAGIC; /* for overflow detection */ }
+extern void kaiser_add_mapping(unsigned long addr, unsigned long size, unsigned long flags); static struct task_struct *dup_task_struct(struct task_struct *orig) { struct task_struct *tsk; @@ -345,6 +350,9 @@ static struct task_struct *dup_task_struct(struct task_struct *orig) goto free_ti;
tsk->stack = ti; +#ifdef CONFIG_KAISER + kaiser_add_mapping((unsigned long)tsk->stack, THREAD_SIZE, __PAGE_KERNEL); +#endif #ifdef CONFIG_SECCOMP /* * We must handle setting up seccomp filters once we're under diff --git a/security/Kconfig b/security/Kconfig index bf4ec46474b6..9da2a6029703 100644 --- a/security/Kconfig +++ b/security/Kconfig @@ -30,6 +30,13 @@ config SECURITY model will be used.
If you are unsure how to answer this question, answer N. +config KAISER + bool "Remove the kernel mapping in user mode" + depends on X86_64 + depends on !PARAVIRT + help + This enforces a strict kernel and user space isolation in order to close + hardware side channels on kernel address information.
config SECURITYFS bool "Enable the securityfs filesystem"
From: Dave Hansen dave.hansen@linux.intel.com
Merged fixes and cleanups, rebased to 4.4.89 tree (no 5-level paging).
Signed-off-by: Hugh Dickins hughd@google.com Acked-by: Jiri Kosina jkosina@suse.cz Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org (cherry picked from commit bed9bb7f3e6d4045013d2bb9e4004896de57f02b) Signed-off-by: Pavel Tatashin pasha.tatashin@oracle.com
Conflicts: arch/x86/entry/entry_64.S (not in this tree) arch/x86/kernel/entry_64.S (patched instead of that) arch/x86/kernel/ldt.c kernel/fork.c --- arch/x86/include/asm/kaiser.h | 43 +++-- arch/x86/include/asm/pgtable.h | 18 +- arch/x86/include/asm/pgtable_64.h | 48 +++++- arch/x86/include/asm/pgtable_types.h | 6 +- arch/x86/kernel/entry_64.S | 107 ++++++++++-- arch/x86/kernel/espfix_64.c | 13 +- arch/x86/kernel/head_64.S | 19 ++- arch/x86/kernel/ldt.c | 27 ++- arch/x86/kernel/tracepoint.c | 2 + arch/x86/mm/kaiser.c | 314 +++++++++++++++++++++++++---------- arch/x86/mm/pageattr.c | 63 +++++-- arch/x86/mm/pgtable.c | 40 ++--- include/linux/kaiser.h | 26 +++ kernel/fork.c | 9 +- security/Kconfig | 5 + 15 files changed, 551 insertions(+), 189 deletions(-) create mode 100644 include/linux/kaiser.h
diff --git a/arch/x86/include/asm/kaiser.h b/arch/x86/include/asm/kaiser.h index 63ee8309b35b..0703f48777f3 100644 --- a/arch/x86/include/asm/kaiser.h +++ b/arch/x86/include/asm/kaiser.h @@ -16,13 +16,17 @@
.macro _SWITCH_TO_KERNEL_CR3 reg movq %cr3, \reg +#ifdef CONFIG_KAISER_REAL_SWITCH andq $(~0x1000), \reg +#endif movq \reg, %cr3 .endm
.macro _SWITCH_TO_USER_CR3 reg movq %cr3, \reg +#ifdef CONFIG_KAISER_REAL_SWITCH orq $(0x1000), \reg +#endif movq \reg, %cr3 .endm
@@ -65,48 +69,53 @@ movq PER_CPU_VAR(unsafe_stack_register_backup), %rax .endm
#endif /* CONFIG_KAISER */ + #else /* __ASSEMBLY__ */
#ifdef CONFIG_KAISER -// Upon kernel/user mode switch, it may happen that -// the address space has to be switched before the registers have been stored. -// To change the address space, another register is needed. -// A register therefore has to be stored/restored. -// -DECLARE_PER_CPU_USER_MAPPED(unsigned long, unsafe_stack_register_backup); +/* + * Upon kernel/user mode switch, it may happen that the address + * space has to be switched before the registers have been + * stored. To change the address space, another register is + * needed. A register therefore has to be stored/restored. +*/
-#endif /* CONFIG_KAISER */ +DECLARE_PER_CPU_USER_MAPPED(unsigned long, unsafe_stack_register_backup);
/** - * shadowmem_add_mapping - map a virtual memory part to the shadow mapping + * kaiser_add_mapping - map a virtual memory part to the shadow (user) mapping * @addr: the start address of the range * @size: the size of the range * @flags: The mapping flags of the pages * - * the mapping is done on a global scope, so no bigger synchronization has to be done. - * the pages have to be manually unmapped again when they are not needed any longer. + * The mapping is done on a global scope, so no bigger + * synchronization has to be done. the pages have to be + * manually unmapped again when they are not needed any longer. */ -extern void kaiser_add_mapping(unsigned long addr, unsigned long size, unsigned long flags); +extern int kaiser_add_mapping(unsigned long addr, unsigned long size, unsigned long flags);
/** - * shadowmem_remove_mapping - unmap a virtual memory part of the shadow mapping + * kaiser_remove_mapping - unmap a virtual memory part of the shadow mapping * @addr: the start address of the range * @size: the size of the range */ extern void kaiser_remove_mapping(unsigned long start, unsigned long size);
/** - * shadowmem_initialize_mapping - Initalize the shadow mapping + * kaiser_initialize_mapping - Initalize the shadow mapping * - * most parts of the shadow mapping can be mapped upon boot time. - * only the thread stacks have to be mapped on runtime. - * the mapped regions are not unmapped at all. + * Most parts of the shadow mapping can be mapped upon boot + * time. Only per-process things like the thread stacks + * or a new LDT have to be mapped at runtime. These boot- + * time mappings are permanent and nevertunmapped. */ extern void kaiser_init(void);
-#endif +#endif /* CONFIG_KAISER */ + +#endif /* __ASSEMBLY */
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 53d44406af45..34e5177bd7b3 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -626,7 +626,17 @@ static inline pud_t *pud_offset(pgd_t *pgd, unsigned long address)
static inline int pgd_bad(pgd_t pgd) { - return (pgd_flags(pgd) & ~_PAGE_USER) != _KERNPG_TABLE; + pgdval_t ignore_flags = _PAGE_USER; + /* + * We set NX on KAISER pgds that map userspace memory so + * that userspace can not meaningfully use the kernel + * page table by accident; it will fault on the first + * instruction it tries to run. See native_set_pgd(). + */ + if (IS_ENABLED(CONFIG_KAISER)) + ignore_flags |= _PAGE_NX; + + return (pgd_flags(pgd) & ~ignore_flags) != _KERNPG_TABLE; }
static inline int pgd_none(pgd_t pgd) @@ -830,8 +840,10 @@ static inline void clone_pgd_range(pgd_t *dst, pgd_t *src, int count) { memcpy(dst, src, count * sizeof(pgd_t)); #ifdef CONFIG_KAISER - // clone the shadow pgd part as well - memcpy(native_get_shadow_pgd(dst), native_get_shadow_pgd(src), count * sizeof(pgd_t)); + /* Clone the shadow pgd part as well */ + memcpy(native_get_shadow_pgd(dst), + native_get_shadow_pgd(src), + count * sizeof(pgd_t)); #endif }
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h index 2131edda6620..b0adf0d11c9a 100644 --- a/arch/x86/include/asm/pgtable_64.h +++ b/arch/x86/include/asm/pgtable_64.h @@ -107,26 +107,58 @@ static inline void native_pud_clear(pud_t *pud) }
#ifdef CONFIG_KAISER -static inline pgd_t * native_get_shadow_pgd(pgd_t *pgdp) { +static inline pgd_t * native_get_shadow_pgd(pgd_t *pgdp) +{ return (pgd_t *)(void*)((unsigned long)(void*)pgdp | (unsigned long)PAGE_SIZE); }
-static inline pgd_t * native_get_normal_pgd(pgd_t *pgdp) { +static inline pgd_t * native_get_normal_pgd(pgd_t *pgdp) +{ return (pgd_t *)(void*)((unsigned long)(void*)pgdp & ~(unsigned long)PAGE_SIZE); } +#else +static inline pgd_t * native_get_shadow_pgd(pgd_t *pgdp) +{ + BUILD_BUG_ON(1); + return NULL; +} +static inline pgd_t * native_get_normal_pgd(pgd_t *pgdp) +{ + return pgdp; +} #endif /* CONFIG_KAISER */
+/* + * Page table pages are page-aligned. The lower half of the top + * level is used for userspace and the top half for the kernel. + * This returns true for user pages that need to get copied into + * both the user and kernel copies of the page tables, and false + * for kernel pages that should only be in the kernel copy. + */ +static inline bool is_userspace_pgd(void *__ptr) +{ + unsigned long ptr = (unsigned long)__ptr; + + return ((ptr % PAGE_SIZE) < (PAGE_SIZE / 2)); +} + static inline void native_set_pgd(pgd_t *pgdp, pgd_t pgd) { #ifdef CONFIG_KAISER - // We know that a pgd is page aligned. - // Therefore the lower indices have to be mapped to user space. - // These pages are mapped to the shadow mapping. - if ((((unsigned long)pgdp) % PAGE_SIZE) < (PAGE_SIZE / 2)) { + pteval_t extra_kern_pgd_flags = 0; + /* Do we need to also populate the shadow pgd? */ + if (is_userspace_pgd(pgdp)) { native_get_shadow_pgd(pgdp)->pgd = pgd.pgd; + /* + * Even if the entry is *mapping* userspace, ensure + * that userspace can not use it. This way, if we + * get out to userspace running on the kernel CR3, + * userspace will crash instead of running. + */ + extra_kern_pgd_flags = _PAGE_NX; } - - pgdp->pgd = pgd.pgd & ~_PAGE_USER; + pgdp->pgd = pgd.pgd; + pgdp->pgd |= extra_kern_pgd_flags; #else /* CONFIG_KAISER */ *pgdp = pgd; #endif diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h index c0a0560b19af..7a5fc4410c00 100644 --- a/arch/x86/include/asm/pgtable_types.h +++ b/arch/x86/include/asm/pgtable_types.h @@ -42,7 +42,7 @@ #ifdef CONFIG_KAISER #define _PAGE_GLOBAL (_AT(pteval_t, 0)) #else -#define _PAGE_GLOBAL (_AT(pteval_t, 1) << _PAGE_BIT_GLOBAL) +#define _PAGE_GLOBAL (_AT(pteval_t, 1) << _PAGE_BIT_GLOBAL) #endif #define _PAGE_SOFTW1 (_AT(pteval_t, 1) << _PAGE_BIT_SOFTW1) #define _PAGE_SOFTW2 (_AT(pteval_t, 1) << _PAGE_BIT_SOFTW2) @@ -93,11 +93,7 @@ #define _PAGE_NX (_AT(pteval_t, 0)) #endif
-#ifdef CONFIG_KAISER -#define _PAGE_PROTNONE (_AT(pteval_t, 0)) -#else #define _PAGE_PROTNONE (_AT(pteval_t, 1) << _PAGE_BIT_PROTNONE) -#endif
#define _PAGE_TABLE (_PAGE_PRESENT | _PAGE_RW | _PAGE_USER | \ _PAGE_ACCESSED | _PAGE_DIRTY) diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S index 8435b806d6a5..1bda5ebd1013 100644 --- a/arch/x86/kernel/entry_64.S +++ b/arch/x86/kernel/entry_64.S @@ -291,6 +291,13 @@ system_call_fastpath: movq EFLAGS(%rsp),%r11 /*CFI_REGISTER rflags,r11*/ RESTORE_C_REGS_EXCEPT_RCX_R11 + /* + * This opens a window where we have a user CR3, but are + * running in the kernel. This makes using the CS + * register useless for telling whether or not we need to + * switch CR3 in NMIs. Normal interrupts are OK because + * they are off here. + */ SWITCH_USER_CR3 movq RSP(%rsp),%rsp /* @@ -480,12 +487,26 @@ syscall_return_via_sysret: CFI_REMEMBER_STATE /* r11 is already restored (see code above) */ RESTORE_C_REGS_EXCEPT_R11 + /* + * This opens a window where we have a user CR3, but are + * running in the kernel. This makes using the CS + * register useless for telling whether or not we need to + * switch CR3 in NMIs. Normal interrupts are OK because + * they are off here. + */ SWITCH_USER_CR3 movq RSP(%rsp),%rsp USERGS_SYSRET64 CFI_RESTORE_STATE
opportunistic_sysret_failed: + /* + * This opens a window where we have a user CR3, but are + * running in the kernel. This makes using the CS + * register useless for telling whether or not we need to + * switch CR3 in NMIs. Normal interrupts are OK because + * they are off here. + */ SWITCH_USER_CR3 SWAPGS jmp restore_c_regs_and_iret @@ -1338,12 +1359,18 @@ ENTRY(error_entry) cld SAVE_C_REGS 8 SAVE_EXTRA_REGS 8 + /* + * error_entry() always returns with a kernel gsbase and + * CR3. We must also have a kernel CR3/gsbase before + * calling TRACE_IRQS_*. Just unconditionally switch to + * the kernel CR3 here. + */ + SWITCH_KERNEL_CR3 xorl %ebx,%ebx testl $3,CS+8(%rsp) je error_kernelspace error_swapgs: SWAPGS - SWITCH_KERNEL_CR3 error_sti: TRACE_IRQS_OFF ret @@ -1473,7 +1500,10 @@ ENTRY(nmi) * stacks lest we corrupt the "NMI executing" variable. */ SWAPGS_UNSAFE_STACK - SWITCH_KERNEL_CR3_NO_STACK + /* + * percpu variables are mapped with user CR3, so no need + * to switch CR3 here. + */ cld movq %rsp, %rdx movq PER_CPU_VAR(cpu_current_top_of_stack), %rsp @@ -1506,14 +1536,32 @@ ENTRY(nmi) */ movq %rsp, %rdi movq $-1, %rsi +#ifdef CONFIG_KAISER + /* Unconditionally use kernel CR3 for do_nmi() */ + /* %rax is saved above, so OK to clobber here */ + movq %cr3, %rax + pushq %rax +#ifdef CONFIG_KAISER_REAL_SWITCH + andq $(~0x1000), %rax +#endif + movq %rax, %cr3 +#endif call do_nmi - + /* + * Unconditionally restore CR3. I know we return to + * kernel code that needs user CR3, but do we ever return + * to "user mode" where we need the kernel CR3? + */ +#ifdef CONFIG_KAISER + popq %rax + mov %rax, %cr3 +#endif /* * Return back to user mode. We must *not* do the normal exit - * work, because we don't want to enable interrupts. Fortunately, - * do_nmi doesn't modify pt_regs. + * work, because we don't want to enable interrupts. Do not + * switch to user CR3: we might be going back to kernel code + * that had a user CR3 set. */ - SWITCH_USER_CR3 SWAPGS jmp restore_c_regs_and_iret
@@ -1706,24 +1754,55 @@ end_repeat_nmi: ALLOC_PT_GPREGS_ON_STACK
/* - * Use paranoid_entry to handle SWAPGS, but no need to use paranoid_exit - * as we should not be calling schedule in NMI context. - * Even with normal interrupts enabled. An NMI should not be - * setting NEED_RESCHED or anything that normal interrupts and - * exceptions might do. + * Use the same approach as paranoid_entry to handle SWAPGS, but + * without CR3 handling since we do that differently in NMIs. No + * need to use paranoid_exit as we should not be calling schedule + * in NMI context. Even with normal interrupts enabled. An NMI + * should not be setting NEED_RESCHED or anything that normal + * interrupts and exceptions might do. */ - call paranoid_entry - DEFAULT_FRAME 0 + cld + SAVE_C_REGS + SAVE_EXTRA_REGS + movl $1, %ebx + movl $MSR_GS_BASE, %ecx + rdmsr + testl %edx, %edx + js 1f /* negative -> in kernel */ + SWAPGS + xorl %ebx, %ebx +1: +#ifdef CONFIG_KAISER + /* Unconditionally use kernel CR3 for do_nmi() */ + /* %rax is saved above, so OK to clobber here */ + movq %cr3, %rax + pushq %rax +#ifdef CONFIG_KAISER_REAL_SWITCH + andq $(~0x1000), %rax +#endif + movq %rax, %cr3 +#endif + DEFAULT_FRAME 0 /* XXX: Do we need this? */
/* paranoidentry do_nmi, 0; without TRACE_IRQS_OFF */ movq %rsp,%rdi + addq $8, %rdi /* point %rdi at ptregs, fixed up for CR3 */ movq $-1,%rsi call do_nmi + /* + * Unconditionally restore CR3. We might be returning to + * kernel code that needs user CR3, like just just before + * a sysret. + */ +#ifdef CONFIG_KAISER + popq %rax + mov %rax, %cr3 +#endif
testl %ebx,%ebx /* swapgs needed? */ jnz nmi_restore nmi_swapgs: - SWITCH_USER_CR3_NO_STACK + /* We fixed up CR3 above, so no need to switch it here */ SWAPGS_UNSAFE_STACK nmi_restore: RESTORE_EXTRA_REGS diff --git a/arch/x86/kernel/espfix_64.c b/arch/x86/kernel/espfix_64.c index 9b0aef0dce94..a48fad060fe9 100644 --- a/arch/x86/kernel/espfix_64.c +++ b/arch/x86/kernel/espfix_64.c @@ -127,11 +127,14 @@ void __init init_espfix_bsp(void) /* Install the espfix pud into the kernel page directory */ pgd_p = &init_level4_pgt[pgd_index(ESPFIX_BASE_ADDR)]; pgd_populate(&init_mm, pgd_p, (pud_t *)espfix_pud_page); -#ifdef CONFIG_KAISER - // add the esp stack pud to the shadow mapping here. - // This can be done directly, because the fixup stack has its own pud - set_pgd(native_get_shadow_pgd(pgd_p), __pgd(_PAGE_TABLE | __pa((pud_t *)espfix_pud_page))); -#endif + /* + * Just copy the top-level PGD that is mapping the espfix + * area to ensure it is mapped into the shadow user page + * tables. + */ + if (IS_ENABLED(CONFIG_KAISER)) + set_pgd(native_get_shadow_pgd(pgd_p), + __pgd(_KERNPG_TABLE | __pa((pud_t *)espfix_pud_page)));
/* Randomize the locations */ init_espfix_random(); diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S index 0ee14f5bf59c..73e7f8e90fac 100644 --- a/arch/x86/kernel/head_64.S +++ b/arch/x86/kernel/head_64.S @@ -442,11 +442,24 @@ early_idt_ripmsg: GLOBAL(name)
#ifdef CONFIG_KAISER +/* + * Each PGD needs to be 8k long and 8k aligned. We do not + * ever go out to userspace with these, so we do not + * strictly *need* the second page, but this allows us to + * have a single set_pgd() implementation that does not + * need to worry about whether it has 4k or 8k to work + * with. + * + * This ensures PGDs are 8k long: + */ +#define KAISER_USER_PGD_FILL 512 +/* This ensures they are 8k-aligned: */ #define NEXT_PGD_PAGE(name) \ .balign 2 * PAGE_SIZE; \ GLOBAL(name) #else #define NEXT_PGD_PAGE(name) NEXT_PAGE(name) +#define KAISER_USER_PGD_FILL 0 #endif
/* Automate the creation of 1 to 1 mapping pmd entries */ @@ -461,6 +474,7 @@ GLOBAL(name) NEXT_PGD_PAGE(early_level4_pgt) .fill 511,8,0 .quad level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE + .fill KAISER_USER_PGD_FILL,8,0
NEXT_PAGE(early_dynamic_pgts) .fill 512*EARLY_DYNAMIC_PAGE_TABLES,8,0 @@ -469,7 +483,8 @@ NEXT_PAGE(early_dynamic_pgts)
#ifndef CONFIG_XEN NEXT_PGD_PAGE(init_level4_pgt) - .fill 2*512,8,0 + .fill 512,8,0 + .fill KAISER_USER_PGD_FILL,8,0 #else NEXT_PGD_PAGE(init_level4_pgt) .quad level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE @@ -478,6 +493,7 @@ NEXT_PGD_PAGE(init_level4_pgt) .org init_level4_pgt + L4_START_KERNEL*8, 0 /* (2^48-(2*1024*1024*1024))/(2^39) = 511 */ .quad level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE + .fill KAISER_USER_PGD_FILL,8,0
NEXT_PAGE(level3_ident_pgt) .quad level2_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE @@ -488,6 +504,7 @@ NEXT_PAGE(level2_ident_pgt) */ PMDS(0, __PAGE_KERNEL_IDENT_LARGE_EXEC, PTRS_PER_PMD) #endif + .fill KAISER_USER_PGD_FILL,8,0
NEXT_PAGE(level3_kernel_pgt) .fill L3_START_KERNEL,8,0 diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c index 2bcc0525f1c1..9c6ebaf0a83b 100644 --- a/arch/x86/kernel/ldt.c +++ b/arch/x86/kernel/ldt.c @@ -15,6 +15,7 @@ #include <linux/slab.h> #include <linux/vmalloc.h> #include <linux/uaccess.h> +#include <linux/kaiser.h>
#include <asm/ldt.h> #include <asm/desc.h> @@ -33,11 +34,21 @@ static void flush_ldt(void *current_mm) set_ldt(pc->ldt->entries, pc->ldt->size); }
+static void __free_ldt_struct(struct ldt_struct *ldt) +{ + if (ldt->size * LDT_ENTRY_SIZE > PAGE_SIZE) + vfree(ldt->entries); + else + free_page((unsigned long)ldt->entries); + kfree(ldt); +} + /* The caller must call finalize_ldt_struct on the result. LDT starts zeroed. */ static struct ldt_struct *alloc_ldt_struct(int size) { struct ldt_struct *new_ldt; int alloc_size; + int ret = 0;
if (size > LDT_ENTRIES) return NULL; @@ -65,6 +76,14 @@ static struct ldt_struct *alloc_ldt_struct(int size) return NULL; }
+ // FIXME: make kaiser_add_mapping() return an error code + // when it fails + kaiser_add_mapping((unsigned long)new_ldt->entries, alloc_size, + __PAGE_KERNEL); + if (ret) { + __free_ldt_struct(new_ldt); + return NULL; + } new_ldt->size = size; return new_ldt; } @@ -91,12 +110,10 @@ static void free_ldt_struct(struct ldt_struct *ldt) if (likely(!ldt)) return;
+ kaiser_remove_mapping((unsigned long)ldt->entries, + ldt->size * LDT_ENTRY_SIZE); paravirt_free_ldt(ldt->entries, ldt->size); - if (ldt->size * LDT_ENTRY_SIZE > PAGE_SIZE) - vfree(ldt->entries); - else - kfree(ldt->entries); - kfree(ldt); + __free_ldt_struct(ldt); }
/* diff --git a/arch/x86/kernel/tracepoint.c b/arch/x86/kernel/tracepoint.c index 1c113db9ed57..2bb5ee464df3 100644 --- a/arch/x86/kernel/tracepoint.c +++ b/arch/x86/kernel/tracepoint.c @@ -9,10 +9,12 @@ #include <linux/atomic.h>
atomic_t trace_idt_ctr = ATOMIC_INIT(0); +__aligned(PAGE_SIZE) struct desc_ptr trace_idt_descr = { NR_VECTORS * 16 - 1, (unsigned long) trace_idt_table };
/* No need to be aligned, but done to keep all IDTs defined the same way. */ +__aligned(PAGE_SIZE) gate_desc trace_idt_table[NR_VECTORS] __page_aligned_bss;
static int trace_irq_vector_refcount; diff --git a/arch/x86/mm/kaiser.c b/arch/x86/mm/kaiser.c index cf1bb922d467..f29c0db1c9b4 100644 --- a/arch/x86/mm/kaiser.c +++ b/arch/x86/mm/kaiser.c @@ -1,160 +1,306 @@ - - +#include <linux/bug.h> #include <linux/kernel.h> #include <linux/errno.h> #include <linux/string.h> #include <linux/types.h> #include <linux/bug.h> #include <linux/init.h> +#include <linux/interrupt.h> #include <linux/spinlock.h> #include <linux/mm.h> - #include <linux/uaccess.h> +#include <linux/ftrace.h> + +#include <asm/kaiser.h> #include <asm/pgtable.h> #include <asm/pgalloc.h> #include <asm/desc.h> #ifdef CONFIG_KAISER
__visible DEFINE_PER_CPU_USER_MAPPED(unsigned long, unsafe_stack_register_backup); +/* + * At runtime, the only things we map are some things for CPU + * hotplug, and stacks for new processes. No two CPUs will ever + * be populating the same addresses, so we only need to ensure + * that we protect between two CPUs trying to allocate and + * populate the same page table page. + * + * Only take this lock when doing a set_p[4um]d(), but it is not + * needed for doing a set_pte(). We assume that only the *owner* + * of a given allocation will be doing this for _their_ + * allocation. + * + * This ensures that once a system has been running for a while + * and there have been stacks all over and these page tables + * are fully populated, there will be no further acquisitions of + * this lock. + */ +static DEFINE_SPINLOCK(shadow_table_allocation_lock);
-/** - * Get the real ppn from a address in kernel mapping. - * @param address The virtual adrress - * @return the physical address +/* + * Returns -1 on error. */ -static inline unsigned long get_pa_from_mapping (unsigned long address) +static inline unsigned long get_pa_from_mapping(unsigned long vaddr) { pgd_t *pgd; pud_t *pud; pmd_t *pmd; pte_t *pte;
- pgd = pgd_offset_k(address); - BUG_ON(pgd_none(*pgd) || pgd_large(*pgd)); - - pud = pud_offset(pgd, address); - BUG_ON(pud_none(*pud)); + pgd = pgd_offset_k(vaddr); + /* + * We made all the kernel PGDs present in kaiser_init(). + * We expect them to stay that way. + */ + BUG_ON(pgd_none(*pgd)); + /* + * PGDs are either 512GB or 128TB on all x86_64 + * configurations. We don't handle these. + */ + BUG_ON(pgd_large(*pgd));
- if (pud_large(*pud)) { - return (pud_pfn(*pud) << PAGE_SHIFT) | (address & ~PUD_PAGE_MASK); + pud = pud_offset(pgd, vaddr); + if (pud_none(*pud)) { + WARN_ON_ONCE(1); + return -1; }
- pmd = pmd_offset(pud, address); - BUG_ON(pmd_none(*pmd)); + if (pud_large(*pud)) + return (pud_pfn(*pud) << PAGE_SHIFT) | (vaddr & ~PUD_PAGE_MASK);
- if (pmd_large(*pmd)) { - return (pmd_pfn(*pmd) << PAGE_SHIFT) | (address & ~PMD_PAGE_MASK); + pmd = pmd_offset(pud, vaddr); + if (pmd_none(*pmd)) { + WARN_ON_ONCE(1); + return -1; }
- pte = pte_offset_kernel(pmd, address); - BUG_ON(pte_none(*pte)); + if (pmd_large(*pmd)) + return (pmd_pfn(*pmd) << PAGE_SHIFT) | (vaddr & ~PMD_PAGE_MASK);
- return (pte_pfn(*pte) << PAGE_SHIFT) | (address & ~PAGE_MASK); + pte = pte_offset_kernel(pmd, vaddr); + if (pte_none(*pte)) { + WARN_ON_ONCE(1); + return -1; + } + + return (pte_pfn(*pte) << PAGE_SHIFT) | (vaddr & ~PAGE_MASK); }
-void _kaiser_copy (unsigned long start_addr, unsigned long size, - unsigned long flags) +/* + * This is a relatively normal page table walk, except that it + * also tries to allocate page tables pages along the way. + * + * Returns a pointer to a PTE on success, or NULL on failure. + */ +static pte_t *kaiser_pagetable_walk(unsigned long address, bool is_atomic) { - pgd_t *pgd; - pud_t *pud; pmd_t *pmd; - pte_t *pte; - unsigned long address; - unsigned long end_addr = start_addr + size; - unsigned long target_address; + pud_t *pud; + pgd_t *pgd = native_get_shadow_pgd(pgd_offset_k(address)); + gfp_t gfp = (GFP_KERNEL | __GFP_NOTRACK | __GFP_ZERO);
- for (address = PAGE_ALIGN(start_addr - (PAGE_SIZE - 1)); - address < PAGE_ALIGN(end_addr); address += PAGE_SIZE) { - target_address = get_pa_from_mapping(address); + might_sleep(); + if (is_atomic) { + gfp &= ~GFP_KERNEL; + gfp |= __GFP_HIGH; + }
- pgd = native_get_shadow_pgd(pgd_offset_k(address)); + if (pgd_none(*pgd)) { + WARN_ONCE(1, "All shadow pgds should have been populated"); + return NULL; + } + BUILD_BUG_ON(pgd_large(*pgd) != 0);
- BUG_ON(pgd_none(*pgd) && "All shadow pgds should be mapped at this time\n"); - BUG_ON(pgd_large(*pgd)); + pud = pud_offset(pgd, address); + /* The shadow page tables do not use large mappings: */ + if (pud_large(*pud)) { + WARN_ON(1); + return NULL; + } + if (pud_none(*pud)) { + unsigned long new_pmd_page = __get_free_page(gfp); + if (!new_pmd_page) + return NULL; + spin_lock(&shadow_table_allocation_lock); + if (pud_none(*pud)) + set_pud(pud, __pud(_KERNPG_TABLE | __pa(new_pmd_page))); + else + free_page(new_pmd_page); + spin_unlock(&shadow_table_allocation_lock); + }
- pud = pud_offset(pgd, address); - if (pud_none(*pud)) { - set_pud(pud, __pud(_PAGE_TABLE | __pa(pmd_alloc_one(0, address)))); - } - BUG_ON(pud_large(*pud)); + pmd = pmd_offset(pud, address); + /* The shadow page tables do not use large mappings: */ + if (pmd_large(*pmd)) { + WARN_ON(1); + return NULL; + } + if (pmd_none(*pmd)) { + unsigned long new_pte_page = __get_free_page(gfp); + if (!new_pte_page) + return NULL; + spin_lock(&shadow_table_allocation_lock); + if (pmd_none(*pmd)) + set_pmd(pmd, __pmd(_KERNPG_TABLE | __pa(new_pte_page))); + else + free_page(new_pte_page); + spin_unlock(&shadow_table_allocation_lock); + }
- pmd = pmd_offset(pud, address); - if (pmd_none(*pmd)) { - set_pmd(pmd, __pmd(_PAGE_TABLE | __pa(pte_alloc_one_kernel(0, address)))); - } - BUG_ON(pmd_large(*pmd)); + return pte_offset_kernel(pmd, address); +}
- pte = pte_offset_kernel(pmd, address); +int kaiser_add_user_map(const void *__start_addr, unsigned long size, + unsigned long flags) +{ + int ret = 0; + pte_t *pte; + unsigned long start_addr = (unsigned long )__start_addr; + unsigned long address = start_addr & PAGE_MASK; + unsigned long end_addr = PAGE_ALIGN(start_addr + size); + unsigned long target_address; + + for (;address < end_addr; address += PAGE_SIZE) { + target_address = get_pa_from_mapping(address); + if (target_address == -1) { + ret = -EIO; + break; + } + pte = kaiser_pagetable_walk(address, false); if (pte_none(*pte)) { set_pte(pte, __pte(flags | target_address)); } else { - BUG_ON(__pa(pte_page(*pte)) != target_address); + pte_t tmp; + set_pte(&tmp, __pte(flags | target_address)); + WARN_ON_ONCE(!pte_same(*pte, tmp)); } } + return ret; +} + +static int kaiser_add_user_map_ptrs(const void *start, const void *end, unsigned long flags) +{ + unsigned long size = end - start; + + return kaiser_add_user_map(start, size, flags); }
-// at first, add a pmd for every pgd entry in the shadowmem-kernel-part of the kernel mapping -static inline void __init _kaiser_init(void) +/* + * Ensure that the top level of the (shadow) page tables are + * entirely populated. This ensures that all processes that get + * forked have the same entries. This way, we do not have to + * ever go set up new entries in older processes. + * + * Note: we never free these, so there are no updates to them + * after this. + */ +static void __init kaiser_init_all_pgds(void) { pgd_t *pgd; int i = 0;
pgd = native_get_shadow_pgd(pgd_offset_k((unsigned long )0)); for (i = PTRS_PER_PGD / 2; i < PTRS_PER_PGD; i++) { - set_pgd(pgd + i, __pgd(_PAGE_TABLE |__pa(pud_alloc_one(0, 0)))); + pgd_t new_pgd; + pud_t *pud = pud_alloc_one(&init_mm, PAGE_OFFSET + i * PGDIR_SIZE); + if (!pud) { + WARN_ON(1); + break; + } + new_pgd = __pgd(_KERNPG_TABLE |__pa(pud)); + /* + * Make sure not to stomp on some other pgd entry. + */ + if (!pgd_none(pgd[i])) { + WARN_ON(1); + continue; + } + set_pgd(pgd + i, new_pgd); } }
+#define kaiser_add_user_map_early(start, size, flags) do { \ + int __ret = kaiser_add_user_map(start, size, flags); \ + WARN_ON(__ret); \ +} while (0) + +#define kaiser_add_user_map_ptrs_early(start, end, flags) do { \ + int __ret = kaiser_add_user_map_ptrs(start, end, flags); \ + WARN_ON(__ret); \ +} while (0) + extern char __per_cpu_user_mapped_start[], __per_cpu_user_mapped_end[]; -spinlock_t shadow_table_lock; +/* + * If anything in here fails, we will likely die on one of the + * first kernel->user transitions and init will die. But, we + * will have most of the kernel up by then and should be able to + * get a clean warning out of it. If we BUG_ON() here, we run + * the risk of being before we have good console output. + */ void __init kaiser_init(void) { int cpu; - spin_lock_init(&shadow_table_lock); - - spin_lock(&shadow_table_lock);
- _kaiser_init(); + kaiser_init_all_pgds();
for_each_possible_cpu(cpu) { - // map the per cpu user variables - _kaiser_copy( - (unsigned long) (__per_cpu_user_mapped_start + per_cpu_offset(cpu)), - (unsigned long) __per_cpu_user_mapped_end - (unsigned long) __per_cpu_user_mapped_start, - __PAGE_KERNEL); + void *percpu_vaddr = __per_cpu_user_mapped_start + + per_cpu_offset(cpu); + unsigned long percpu_sz = __per_cpu_user_mapped_end - + __per_cpu_user_mapped_start; + kaiser_add_user_map_early(percpu_vaddr, percpu_sz, + __PAGE_KERNEL); }
- // map the entry/exit text section, which is responsible to switch between user- and kernel mode - _kaiser_copy( - (unsigned long) __entry_text_start, - (unsigned long) __entry_text_end - (unsigned long) __entry_text_start, - __PAGE_KERNEL_RX); + /* + * Map the entry/exit text section, which is needed at + * switches from user to and from kernel. + */ + kaiser_add_user_map_ptrs_early(__entry_text_start, __entry_text_end, + __PAGE_KERNEL_RX);
- // the fixed map address of the idt_table - _kaiser_copy( - (unsigned long) idt_descr.address, - sizeof(gate_desc) * NR_VECTORS, - __PAGE_KERNEL_RO); - - spin_unlock(&shadow_table_lock); +#if defined(CONFIG_FUNCTION_GRAPH_TRACER) || defined(CONFIG_KASAN) + kaiser_add_user_map_ptrs_early(__irqentry_text_start, + __irqentry_text_end, + __PAGE_KERNEL_RX); +#endif + kaiser_add_user_map_early((void *)idt_descr.address, + sizeof(gate_desc) * NR_VECTORS, + __PAGE_KERNEL_RO); +#ifdef CONFIG_TRACING + kaiser_add_user_map_early(&trace_idt_descr, + sizeof(trace_idt_descr), + __PAGE_KERNEL); + kaiser_add_user_map_early(&trace_idt_table, + sizeof(gate_desc) * NR_VECTORS, + __PAGE_KERNEL); +#endif + kaiser_add_user_map_early(&debug_idt_descr, sizeof(debug_idt_descr), + __PAGE_KERNEL); + kaiser_add_user_map_early(&debug_idt_table, + sizeof(gate_desc) * NR_VECTORS, + __PAGE_KERNEL); }
+extern void unmap_pud_range_nofree(pgd_t *pgd, unsigned long start, unsigned long end); // add a mapping to the shadow-mapping, and synchronize the mappings -void kaiser_add_mapping(unsigned long addr, unsigned long size, unsigned long flags) +int kaiser_add_mapping(unsigned long addr, unsigned long size, unsigned long flags) { - spin_lock(&shadow_table_lock); - _kaiser_copy(addr, size, flags); - spin_unlock(&shadow_table_lock); + return kaiser_add_user_map((const void *)addr, size, flags); }
-extern void unmap_pud_range(pgd_t *pgd, unsigned long start, unsigned long end); void kaiser_remove_mapping(unsigned long start, unsigned long size) { - pgd_t *pgd = native_get_shadow_pgd(pgd_offset_k(start)); - spin_lock(&shadow_table_lock); - do { - unmap_pud_range(pgd, start, start + size); - } while (pgd++ != native_get_shadow_pgd(pgd_offset_k(start + size))); - spin_unlock(&shadow_table_lock); + unsigned long end = start + size; + unsigned long addr; + + for (addr = start; addr < end; addr += PGDIR_SIZE) { + pgd_t *pgd = native_get_shadow_pgd(pgd_offset_k(addr)); + /* + * unmap_p4d_range() handles > P4D_SIZE unmaps, + * so no need to trim 'end'. + */ + unmap_pud_range_nofree(pgd, addr, end); + } } #endif /* CONFIG_KAISER */ diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c index 5553b7fbc6cb..c2ab36151d44 100644 --- a/arch/x86/mm/pageattr.c +++ b/arch/x86/mm/pageattr.c @@ -52,6 +52,7 @@ static DEFINE_SPINLOCK(cpa_lock); #define CPA_FLUSHTLB 1 #define CPA_ARRAY 2 #define CPA_PAGES_ARRAY 4 +#define CPA_FREE_PAGETABLES 8
#ifdef CONFIG_PROC_FS static unsigned long direct_pages_count[PG_LEVEL_NUM]; @@ -696,10 +697,13 @@ static int split_large_page(struct cpa_data *cpa, pte_t *kpte, return 0; }
-static bool try_to_free_pte_page(pte_t *pte) +static bool try_to_free_pte_page(struct cpa_data *cpa, pte_t *pte) { int i;
+ if (!(cpa->flags & CPA_FREE_PAGETABLES)) + return false; + for (i = 0; i < PTRS_PER_PTE; i++) if (!pte_none(pte[i])) return false; @@ -708,10 +712,13 @@ static bool try_to_free_pte_page(pte_t *pte) return true; }
-static bool try_to_free_pmd_page(pmd_t *pmd) +static bool try_to_free_pmd_page(struct cpa_data *cpa, pmd_t *pmd) { int i;
+ if (!(cpa->flags & CPA_FREE_PAGETABLES)) + return false; + for (i = 0; i < PTRS_PER_PMD; i++) if (!pmd_none(pmd[i])) return false; @@ -732,7 +739,9 @@ static bool try_to_free_pud_page(pud_t *pud) return true; }
-static bool unmap_pte_range(pmd_t *pmd, unsigned long start, unsigned long end) +static bool unmap_pte_range(struct cpa_data *cpa, pmd_t *pmd, + unsigned long start, + unsigned long end) { pte_t *pte = pte_offset_kernel(pmd, start);
@@ -743,22 +752,23 @@ static bool unmap_pte_range(pmd_t *pmd, unsigned long start, unsigned long end) pte++; }
- if (try_to_free_pte_page((pte_t *)pmd_page_vaddr(*pmd))) { + if (try_to_free_pte_page(cpa, (pte_t *)pmd_page_vaddr(*pmd))) { pmd_clear(pmd); return true; } return false; }
-static void __unmap_pmd_range(pud_t *pud, pmd_t *pmd, +static void __unmap_pmd_range(struct cpa_data *cpa, pud_t *pud, pmd_t *pmd, unsigned long start, unsigned long end) { - if (unmap_pte_range(pmd, start, end)) - if (try_to_free_pmd_page((pmd_t *)pud_page_vaddr(*pud))) + if (unmap_pte_range(cpa, pmd, start, end)) + if (try_to_free_pmd_page(cpa, (pmd_t *)pud_page_vaddr(*pud))) pud_clear(pud); }
-static void unmap_pmd_range(pud_t *pud, unsigned long start, unsigned long end) +static void unmap_pmd_range(struct cpa_data *cpa, pud_t *pud, + unsigned long start, unsigned long end) { pmd_t *pmd = pmd_offset(pud, start);
@@ -769,7 +779,7 @@ static void unmap_pmd_range(pud_t *pud, unsigned long start, unsigned long end) unsigned long next_page = (start + PMD_SIZE) & PMD_MASK; unsigned long pre_end = min_t(unsigned long, end, next_page);
- __unmap_pmd_range(pud, pmd, start, pre_end); + __unmap_pmd_range(cpa, pud, pmd, start, pre_end);
start = pre_end; pmd++; @@ -782,7 +792,8 @@ static void unmap_pmd_range(pud_t *pud, unsigned long start, unsigned long end) if (pmd_large(*pmd)) pmd_clear(pmd); else - __unmap_pmd_range(pud, pmd, start, start + PMD_SIZE); + __unmap_pmd_range(cpa, pud, pmd, + start, start + PMD_SIZE);
start += PMD_SIZE; pmd++; @@ -792,17 +803,19 @@ static void unmap_pmd_range(pud_t *pud, unsigned long start, unsigned long end) * 4K leftovers? */ if (start < end) - return __unmap_pmd_range(pud, pmd, start, end); + return __unmap_pmd_range(cpa, pud, pmd, start, end);
/* * Try again to free the PMD page if haven't succeeded above. */ if (!pud_none(*pud)) - if (try_to_free_pmd_page((pmd_t *)pud_page_vaddr(*pud))) + if (try_to_free_pmd_page(cpa, (pmd_t *)pud_page_vaddr(*pud))) pud_clear(pud); }
-void unmap_pud_range(pgd_t *pgd, unsigned long start, unsigned long end) +static void __unmap_pud_range(struct cpa_data *cpa, pgd_t *pgd, + unsigned long start, + unsigned long end) { pud_t *pud = pud_offset(pgd, start);
@@ -813,7 +826,7 @@ void unmap_pud_range(pgd_t *pgd, unsigned long start, unsigned long end) unsigned long next_page = (start + PUD_SIZE) & PUD_MASK; unsigned long pre_end = min_t(unsigned long, end, next_page);
- unmap_pmd_range(pud, start, pre_end); + unmap_pmd_range(cpa, pud, start, pre_end);
start = pre_end; pud++; @@ -827,7 +840,7 @@ void unmap_pud_range(pgd_t *pgd, unsigned long start, unsigned long end) if (pud_large(*pud)) pud_clear(pud); else - unmap_pmd_range(pud, start, start + PUD_SIZE); + unmap_pmd_range(cpa, pud, start, start + PUD_SIZE);
start += PUD_SIZE; pud++; @@ -837,7 +850,7 @@ void unmap_pud_range(pgd_t *pgd, unsigned long start, unsigned long end) * 2M leftovers? */ if (start < end) - unmap_pmd_range(pud, start, end); + unmap_pmd_range(cpa, pud, start, end);
/* * No need to try to free the PUD page because we'll free it in @@ -845,6 +858,24 @@ void unmap_pud_range(pgd_t *pgd, unsigned long start, unsigned long end) */ }
+static void unmap_pud_range(pgd_t *pgd, unsigned long start, unsigned long end) +{ + struct cpa_data cpa = { + .flags = CPA_FREE_PAGETABLES, + }; + + __unmap_pud_range(&cpa, pgd, start, end); +} + +void unmap_pud_range_nofree(pgd_t *pgd, unsigned long start, unsigned long end) +{ + struct cpa_data cpa = { + .flags = 0, + }; + + __unmap_pud_range(&cpa, pgd, start, end); +} + static void unmap_pgd_range(pgd_t *root, unsigned long addr, unsigned long end) { pgd_t *pgd_entry = root + pgd_index(addr); diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c index c847acf67933..b0848a7e9c68 100644 --- a/arch/x86/mm/pgtable.c +++ b/arch/x86/mm/pgtable.c @@ -340,40 +340,26 @@ static inline void _pgd_free(pgd_t *pgd) kmem_cache_free(pgd_cache, pgd); } #else -static inline pgd_t *_pgd_alloc(void) -{ + #ifdef CONFIG_KAISER - // Instead of one PML4, we aquire two PML4s and, thus, an 8kb-aligned memory - // block. Therefore, we have to allocate at least 3 pages. However, the - // __get_free_pages returns us 4 pages. Hence, we store the base pointer at - // the beginning of the page of our 8kb-aligned memory block in order to - // correctly free it afterwars. - - unsigned long pages = __get_free_pages(PGALLOC_GFP, get_order(4*PAGE_SIZE)); - - if(native_get_normal_pgd((pgd_t*) pages) == (pgd_t*) pages) - { - *((unsigned long*)(pages + 2 * PAGE_SIZE)) = pages; - return (pgd_t *) pages; - } - else - { - *((unsigned long*)(pages + 3 * PAGE_SIZE)) = pages; - return (pgd_t *) (pages + PAGE_SIZE); - } +/* + * Instead of one pmd, we aquire two pmds. Being order-1, it is + * both 8k in size and 8k-aligned. That lets us just flip bit 12 + * in a pointer to swap between the two 4k halves. + */ +#define PGD_ALLOCATION_ORDER 1 #else - return (pgd_t *)__get_free_page(PGALLOC_GFP); +#define PGD_ALLOCATION_ORDER 0 #endif + +static inline pgd_t *_pgd_alloc(void) +{ + return (pgd_t *)__get_free_pages(PGALLOC_GFP, PGD_ALLOCATION_ORDER); }
static inline void _pgd_free(pgd_t *pgd) { -#ifdef CONFIG_KAISER - unsigned long pages = *((unsigned long*) ((char*) pgd + 2 * PAGE_SIZE)); - free_pages(pages, get_order(4*PAGE_SIZE)); -#else - free_page((unsigned long)pgd); -#endif + free_pages((unsigned long)pgd, PGD_ALLOCATION_ORDER); } #endif /* CONFIG_X86_PAE */
diff --git a/include/linux/kaiser.h b/include/linux/kaiser.h new file mode 100644 index 000000000000..9db5433c2284 --- /dev/null +++ b/include/linux/kaiser.h @@ -0,0 +1,26 @@ +#ifndef _INCLUDE_KAISER_H +#define _INCLUDE_KAISER_H + +#ifdef CONFIG_KAISER +#include <asm/kaiser.h> +#else + +/* + * These stubs are used whenever CONFIG_KAISER is off, which + * includes architectures that support KAISER, but have it + * disabled. + */ + +static inline void kaiser_init(void) +{ +} +static inline void kaiser_remove_mapping(unsigned long start, unsigned long size) +{ +} +static inline int kaiser_add_mapping(unsigned long addr, unsigned long size, unsigned long flags) +{ + return 0; +} + +#endif /* !CONFIG_KAISER */ +#endif /* _INCLUDE_KAISER_H */ diff --git a/kernel/fork.c b/kernel/fork.c index efe6f94d8dc3..eb1ca74c48ce 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -58,6 +58,7 @@ #include <linux/tsacct_kern.h> #include <linux/cn_proc.h> #include <linux/freezer.h> +#include <linux/kaiser.h> #include <linux/delayacct.h> #include <linux/taskstats_kern.h> #include <linux/random.h> @@ -329,7 +330,6 @@ void set_task_stack_end_magic(struct task_struct *tsk) *stackend = STACK_END_MAGIC; /* for overflow detection */ }
-extern void kaiser_add_mapping(unsigned long addr, unsigned long size, unsigned long flags); static struct task_struct *dup_task_struct(struct task_struct *orig) { struct task_struct *tsk; @@ -350,9 +350,10 @@ static struct task_struct *dup_task_struct(struct task_struct *orig) goto free_ti;
tsk->stack = ti; -#ifdef CONFIG_KAISER - kaiser_add_mapping((unsigned long)tsk->stack, THREAD_SIZE, __PAGE_KERNEL); -#endif + + err= kaiser_add_mapping((unsigned long)tsk->stack, THREAD_SIZE, __PAGE_KERNEL); + if (err) + goto free_ti; #ifdef CONFIG_SECCOMP /* * We must handle setting up seccomp filters once we're under diff --git a/security/Kconfig b/security/Kconfig index 9da2a6029703..cdad6dd3693a 100644 --- a/security/Kconfig +++ b/security/Kconfig @@ -32,12 +32,17 @@ config SECURITY If you are unsure how to answer this question, answer N. config KAISER bool "Remove the kernel mapping in user mode" + default y depends on X86_64 depends on !PARAVIRT help This enforces a strict kernel and user space isolation in order to close hardware side channels on kernel address information.
+config KAISER_REAL_SWITCH + bool "KAISER: actually switch page tables" + default y + config SECURITYFS bool "Enable the securityfs filesystem" help
From: Hugh Dickins hughd@google.com
native_pgd_clear() uses native_set_pgd(), so native_set_pgd() must avoid setting the _PAGE_NX bit on an otherwise pgd_none() entry: usually that just generated a warning on exit, but sometimes more mysterious and damaging failures (our production machines could not complete booting).
The original fix to this just avoided adding _PAGE_NX to an empty entry; but eventually more problems surfaced with kexec, and EFI mapping expected to be a problem too. So now instead change native_set_pgd() to update shadow only if _PAGE_USER:
A few places (kernel/machine_kexec_64.c, platform/efi/efi_64.c for sure) use set_pgd() to set up a temporary internal virtual address space, with physical pages remapped at what Kaiser regards as userspace addresses: Kaiser then assumes a shadow pgd follows, which it will try to corrupt.
This appears to be responsible for the recent kexec and kdump failures; though it's unclear how those did not manifest as a problem before. Ah, the shadow pgd will only be assumed to "follow" if the requested pgd is on an even-numbered page: so I suppose it was going wrong 50% of the time all along.
What we need is a flag to set_pgd(), to tell it we're dealing with userspace. Er, isn't that what the pgd's _PAGE_USER bit is saying? Add a test for that. But we cannot do the same for pgd_clear() (which may be called to clear corrupted entries - set aside the question of "corrupt in which pgd?" until later), so there just rely on pgd_clear() not being called in the problematic cases - with a WARN_ON_ONCE() which should fire half the time if it is.
But this is getting too big for an inline function: move it into arch/x86/mm/kaiser.c (which then demands a boot/compressed mod); and de-void and de-space native_get_shadow/normal_pgd() while here.
Signed-off-by: Hugh Dickins hughd@google.com Acked-by: Jiri Kosina jkosina@suse.cz Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org (cherry picked from commit edde73205b3fdde8c8a3adfce78cc6d0de72386b) Signed-off-by: Pavel Tatashin pasha.tatashin@oracle.com --- arch/x86/boot/compressed/misc.h | 1 + arch/x86/include/asm/pgtable_64.h | 51 ++++++++++----------------------------- arch/x86/mm/kaiser.c | 42 ++++++++++++++++++++++++++++++++ 3 files changed, 56 insertions(+), 38 deletions(-)
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h index 805d25ca5f1d..cbaa2ebfa88d 100644 --- a/arch/x86/boot/compressed/misc.h +++ b/arch/x86/boot/compressed/misc.h @@ -9,6 +9,7 @@ */ #undef CONFIG_PARAVIRT #undef CONFIG_PARAVIRT_SPINLOCKS +#undef CONFIG_KAISER #undef CONFIG_KASAN
#include <linux/linkage.h> diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h index b0adf0d11c9a..8629be9e6649 100644 --- a/arch/x86/include/asm/pgtable_64.h +++ b/arch/x86/include/asm/pgtable_64.h @@ -107,61 +107,36 @@ static inline void native_pud_clear(pud_t *pud) }
#ifdef CONFIG_KAISER -static inline pgd_t * native_get_shadow_pgd(pgd_t *pgdp) +extern pgd_t kaiser_set_shadow_pgd(pgd_t *pgdp, pgd_t pgd); + +static inline pgd_t *native_get_shadow_pgd(pgd_t *pgdp) { - return (pgd_t *)(void*)((unsigned long)(void*)pgdp | (unsigned long)PAGE_SIZE); + return (pgd_t *)((unsigned long)pgdp | (unsigned long)PAGE_SIZE); }
-static inline pgd_t * native_get_normal_pgd(pgd_t *pgdp) +static inline pgd_t *native_get_normal_pgd(pgd_t *pgdp) { - return (pgd_t *)(void*)((unsigned long)(void*)pgdp & ~(unsigned long)PAGE_SIZE); + return (pgd_t *)((unsigned long)pgdp & ~(unsigned long)PAGE_SIZE); } #else -static inline pgd_t * native_get_shadow_pgd(pgd_t *pgdp) +static inline pgd_t kaiser_set_shadow_pgd(pgd_t *pgdp, pgd_t pgd) +{ + return pgd; +} +static inline pgd_t *native_get_shadow_pgd(pgd_t *pgdp) { BUILD_BUG_ON(1); return NULL; } -static inline pgd_t * native_get_normal_pgd(pgd_t *pgdp) +static inline pgd_t *native_get_normal_pgd(pgd_t *pgdp) { return pgdp; } #endif /* CONFIG_KAISER */
-/* - * Page table pages are page-aligned. The lower half of the top - * level is used for userspace and the top half for the kernel. - * This returns true for user pages that need to get copied into - * both the user and kernel copies of the page tables, and false - * for kernel pages that should only be in the kernel copy. - */ -static inline bool is_userspace_pgd(void *__ptr) -{ - unsigned long ptr = (unsigned long)__ptr; - - return ((ptr % PAGE_SIZE) < (PAGE_SIZE / 2)); -} - static inline void native_set_pgd(pgd_t *pgdp, pgd_t pgd) { -#ifdef CONFIG_KAISER - pteval_t extra_kern_pgd_flags = 0; - /* Do we need to also populate the shadow pgd? */ - if (is_userspace_pgd(pgdp)) { - native_get_shadow_pgd(pgdp)->pgd = pgd.pgd; - /* - * Even if the entry is *mapping* userspace, ensure - * that userspace can not use it. This way, if we - * get out to userspace running on the kernel CR3, - * userspace will crash instead of running. - */ - extra_kern_pgd_flags = _PAGE_NX; - } - pgdp->pgd = pgd.pgd; - pgdp->pgd |= extra_kern_pgd_flags; -#else /* CONFIG_KAISER */ - *pgdp = pgd; -#endif + *pgdp = kaiser_set_shadow_pgd(pgdp, pgd); }
static inline void native_pgd_clear(pgd_t *pgd) diff --git a/arch/x86/mm/kaiser.c b/arch/x86/mm/kaiser.c index f29c0db1c9b4..df7f6591d5aa 100644 --- a/arch/x86/mm/kaiser.c +++ b/arch/x86/mm/kaiser.c @@ -303,4 +303,46 @@ void kaiser_remove_mapping(unsigned long start, unsigned long size) unmap_pud_range_nofree(pgd, addr, end); } } + +/* + * Page table pages are page-aligned. The lower half of the top + * level is used for userspace and the top half for the kernel. + * This returns true for user pages that need to get copied into + * both the user and kernel copies of the page tables, and false + * for kernel pages that should only be in the kernel copy. + */ +static inline bool is_userspace_pgd(pgd_t *pgdp) +{ + return ((unsigned long)pgdp % PAGE_SIZE) < (PAGE_SIZE / 2); +} + +pgd_t kaiser_set_shadow_pgd(pgd_t *pgdp, pgd_t pgd) +{ + /* + * Do we need to also populate the shadow pgd? Check _PAGE_USER to + * skip cases like kexec and EFI which make temporary low mappings. + */ + if (pgd.pgd & _PAGE_USER) { + if (is_userspace_pgd(pgdp)) { + native_get_shadow_pgd(pgdp)->pgd = pgd.pgd; + /* + * Even if the entry is *mapping* userspace, ensure + * that userspace can not use it. This way, if we + * get out to userspace running on the kernel CR3, + * userspace will crash instead of running. + */ + pgd.pgd |= _PAGE_NX; + } + } else if (!pgd.pgd) { + /* + * pgd_clear() cannot check _PAGE_USER, and is even used to + * clear corrupted pgd entries: so just rely on cases like + * kexec and EFI never to be using pgd_clear(). + */ + if (!WARN_ON_ONCE((unsigned long)pgdp & PAGE_SIZE) && + is_userspace_pgd(pgdp)) + native_get_shadow_pgd(pgdp)->pgd = pgd.pgd; + } + return pgd; +} #endif /* CONFIG_KAISER */
From: Hugh Dickins hughd@google.com
Kaiser only needs to map one page of the stack; and kernel/fork.c did not build on powerpc (no __PAGE_KERNEL). It's all cleaner if linux/kaiser.h provides kaiser_map_thread_stack() and kaiser_unmap_thread_stack() wrappers around asm/kaiser.h's kaiser_add_mapping() and kaiser_remove_mapping(). And use linux/kaiser.h in init/main.c to avoid the #ifdefs there.
Signed-off-by: Hugh Dickins hughd@google.com Acked-by: Jiri Kosina jkosina@suse.cz Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org (cherry picked from commit 003e476716906afa135faf605ae0a5c3598c0293) Signed-off-by: Pavel Tatashin pasha.tatashin@oracle.com
Conflicts: init/main.c --- include/linux/kaiser.h | 40 +++++++++++++++++++++++++++++++++------- init/main.c | 6 +----- kernel/fork.c | 7 ++----- 3 files changed, 36 insertions(+), 17 deletions(-)
diff --git a/include/linux/kaiser.h b/include/linux/kaiser.h index 9db5433c2284..4a4d6d911a14 100644 --- a/include/linux/kaiser.h +++ b/include/linux/kaiser.h @@ -1,26 +1,52 @@ -#ifndef _INCLUDE_KAISER_H -#define _INCLUDE_KAISER_H +#ifndef _LINUX_KAISER_H +#define _LINUX_KAISER_H
#ifdef CONFIG_KAISER #include <asm/kaiser.h> + +static inline int kaiser_map_thread_stack(void *stack) +{ + /* + * Map that page of kernel stack on which we enter from user context. + */ + return kaiser_add_mapping((unsigned long)stack + + THREAD_SIZE - PAGE_SIZE, PAGE_SIZE, __PAGE_KERNEL); +} + +static inline void kaiser_unmap_thread_stack(void *stack) +{ + /* + * Note: may be called even when kaiser_map_thread_stack() failed. + */ + kaiser_remove_mapping((unsigned long)stack + + THREAD_SIZE - PAGE_SIZE, PAGE_SIZE); +} #else
/* * These stubs are used whenever CONFIG_KAISER is off, which - * includes architectures that support KAISER, but have it - * disabled. + * includes architectures that support KAISER, but have it disabled. */
static inline void kaiser_init(void) { } -static inline void kaiser_remove_mapping(unsigned long start, unsigned long size) +static inline int kaiser_add_mapping(unsigned long addr, + unsigned long size, unsigned long flags) +{ + return 0; +} +static inline void kaiser_remove_mapping(unsigned long start, + unsigned long size) { } -static inline int kaiser_add_mapping(unsigned long addr, unsigned long size, unsigned long flags) +static inline int kaiser_map_thread_stack(void *stack) { return 0; } +static inline void kaiser_unmap_thread_stack(void *stack) +{ +}
#endif /* !CONFIG_KAISER */ -#endif /* _INCLUDE_KAISER_H */ +#endif /* _LINUX_KAISER_H */ diff --git a/init/main.c b/init/main.c index de1951eaf1cc..c8d3596a721f 100644 --- a/init/main.c +++ b/init/main.c @@ -81,15 +81,13 @@ #include <linux/integrity.h> #include <linux/proc_ns.h> #include <linux/io.h> +#include <linux/kaiser.h>
#include <asm/io.h> #include <asm/bugs.h> #include <asm/setup.h> #include <asm/sections.h> #include <asm/cacheflush.h> -#ifdef CONFIG_KAISER -#include <asm/kaiser.h> -#endif
static int kernel_init(void *);
@@ -490,9 +488,7 @@ static void __init mm_init(void) pgtable_init(); vmalloc_init(); ioremap_huge_init(); -#ifdef CONFIG_KAISER kaiser_init(); -#endif }
asmlinkage __visible void __init start_kernel(void) diff --git a/kernel/fork.c b/kernel/fork.c index eb1ca74c48ce..823bcff9b047 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -168,12 +168,9 @@ static struct thread_info *alloc_thread_info_node(struct task_struct *tsk, return page ? page_address(page) : NULL; }
-extern void kaiser_remove_mapping(unsigned long start_addr, unsigned long size); static inline void free_thread_info(struct thread_info *ti) { -#ifdef CONFIG_KAISER - kaiser_remove_mapping((unsigned long)ti, THREAD_SIZE); -#endif + kaiser_unmap_thread_stack(ti); free_kmem_pages((unsigned long)ti, THREAD_SIZE_ORDER); } # else @@ -351,7 +348,7 @@ static struct task_struct *dup_task_struct(struct task_struct *orig)
tsk->stack = ti;
- err= kaiser_add_mapping((unsigned long)tsk->stack, THREAD_SIZE, __PAGE_KERNEL); + err = kaiser_map_thread_stack(tsk->stack); if (err) goto free_ti; #ifdef CONFIG_SECCOMP
From: Hugh Dickins hughd@google.com
Include linux/kaiser.h instead of asm/kaiser.h to build ldt.c without CONFIG_KAISER. kaiser_add_mapping() does already return an error code, so fix the FIXME.
Signed-off-by: Hugh Dickins hughd@google.com Acked-by: Jiri Kosina jkosina@suse.cz Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org (cherry picked from commit 9b94cf97f42ca30fe9b5010900fa6e1d6855a9f6) Signed-off-by: Pavel Tatashin pasha.tatashin@oracle.com --- arch/x86/kernel/ldt.c | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-)
diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c index 9c6ebaf0a83b..c388247e0353 100644 --- a/arch/x86/kernel/ldt.c +++ b/arch/x86/kernel/ldt.c @@ -48,7 +48,7 @@ static struct ldt_struct *alloc_ldt_struct(int size) { struct ldt_struct *new_ldt; int alloc_size; - int ret = 0; + int ret;
if (size > LDT_ENTRIES) return NULL; @@ -76,10 +76,8 @@ static struct ldt_struct *alloc_ldt_struct(int size) return NULL; }
- // FIXME: make kaiser_add_mapping() return an error code - // when it fails - kaiser_add_mapping((unsigned long)new_ldt->entries, alloc_size, - __PAGE_KERNEL); + ret = kaiser_add_mapping((unsigned long)new_ldt->entries, alloc_size, + __PAGE_KERNEL); if (ret) { __free_ldt_struct(new_ldt); return NULL;
From: Hugh Dickins hughd@google.com
It is absurd that KAISER should depend on SMP, but apparently nobody has tried a UP build before: which breaks on implicit declaration of function 'per_cpu_offset' in arch/x86/mm/kaiser.c.
Now, you would expect that to be trivially fixed up; but looking at the System.map when that block is #ifdef'ed out of kaiser_init(), I see that in a UP build __per_cpu_user_mapped_end is precisely at __per_cpu_user_mapped_start, and the items carefully gathered into that section for user-mapping on SMP, dispersed elsewhere on UP.
So, some other kind of section assignment will be needed on UP, but implementing that is not a priority: just make KAISER depend on SMP for now.
Also inserted a blank line before the option, tidied up the brief Kconfig help message, and added an "If unsure, Y".
Signed-off-by: Hugh Dickins hughd@google.com Acked-by: Jiri Kosina jkosina@suse.cz Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org (cherry picked from commit d94df20135ccfdfb77b1479c501564e9b4ab5bc9) Signed-off-by: Pavel Tatashin pasha.tatashin@oracle.com --- security/Kconfig | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-)
diff --git a/security/Kconfig b/security/Kconfig index cdad6dd3693a..5f4b656bfa4c 100644 --- a/security/Kconfig +++ b/security/Kconfig @@ -30,14 +30,16 @@ config SECURITY model will be used.
If you are unsure how to answer this question, answer N. + config KAISER bool "Remove the kernel mapping in user mode" default y - depends on X86_64 - depends on !PARAVIRT + depends on X86_64 && SMP && !PARAVIRT help - This enforces a strict kernel and user space isolation in order to close - hardware side channels on kernel address information. + This enforces a strict kernel and user space isolation, in order + to close hardware side channels on kernel address information. + + If you are unsure how to answer this question, answer Y.
config KAISER_REAL_SWITCH bool "KAISER: actually switch page tables"
From: Hugh Dickins hughd@google.com
pjt has observed that nmi's second (nmi_from_kernel) call to do_nmi() adjusted the %rdi regs arg, rightly when CONFIG_KAISER, but wrongly when not CONFIG_KAISER.
Although the minimal change is to add an #ifdef CONFIG_KAISER around the addq line, that looks cluttered, and I prefer how the first call to do_nmi() handled it: prepare args in %rdi and %rsi before getting into the CONFIG_KAISER block, since it does not touch them at all.
And while we're here, place the "#ifdef CONFIG_KAISER" that follows each, to enclose the "Unconditionally restore CR3" comment: matching how the "Unconditionally use kernel CR3" comment above is enclosed.
Signed-off-by: Hugh Dickins hughd@google.com Acked-by: Jiri Kosina jkosina@suse.cz Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org (cherry picked from commit 487f0b73d82611a2dc48d7d78409e2e9d994006a) Signed-off-by: Pavel Tatashin pasha.tatashin@oracle.com
Conflicts: arch/x86/entry/entry_64.S (not in this tree) arch/x86/kernel/entry_64.S (patched instead of that) --- arch/x86/kernel/entry_64.S | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-)
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S index 1bda5ebd1013..8e4056d4b1d6 100644 --- a/arch/x86/kernel/entry_64.S +++ b/arch/x86/kernel/entry_64.S @@ -1547,12 +1547,12 @@ ENTRY(nmi) movq %rax, %cr3 #endif call do_nmi +#ifdef CONFIG_KAISER /* * Unconditionally restore CR3. I know we return to * kernel code that needs user CR3, but do we ever return * to "user mode" where we need the kernel CR3? */ -#ifdef CONFIG_KAISER popq %rax mov %rax, %cr3 #endif @@ -1772,6 +1772,8 @@ end_repeat_nmi: SWAPGS xorl %ebx, %ebx 1: + movq %rsp, %rdi + movq $-1, %rsi #ifdef CONFIG_KAISER /* Unconditionally use kernel CR3 for do_nmi() */ /* %rax is saved above, so OK to clobber here */ @@ -1785,16 +1787,13 @@ end_repeat_nmi: DEFAULT_FRAME 0 /* XXX: Do we need this? */
/* paranoidentry do_nmi, 0; without TRACE_IRQS_OFF */ - movq %rsp,%rdi - addq $8, %rdi /* point %rdi at ptregs, fixed up for CR3 */ - movq $-1,%rsi call do_nmi +#ifdef CONFIG_KAISER /* * Unconditionally restore CR3. We might be returning to * kernel code that needs user CR3, like just just before * a sysret. */ -#ifdef CONFIG_KAISER popq %rax mov %rax, %cr3 #endif
From: Hugh Dickins hughd@google.com
Avoid perf crashes: place debug_store in the user-mapped per-cpu area instead of allocating, and use page allocator plus kaiser_add_mapping() to keep the BTS and PEBS buffers user-mapped (that is, present in the user mapping, though visible only to kernel and hardware). The PEBS fixup buffer does not need this treatment.
The need for a user-mapped struct debug_store showed up before doing any conscious perf testing: in a couple of kernel paging oopses on Westmere, implicating the debug_store offset of the per-cpu area.
Signed-off-by: Hugh Dickins hughd@google.com Acked-by: Jiri Kosina jkosina@suse.cz Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org (cherry picked from commit 20cbe9a3aa2e341824da57ce0ac6d52cbffaa570) Signed-off-by: Pavel Tatashin pasha.tatashin@oracle.com
Conflicts: arch/x86/kernel/cpu/perf_event_intel_ds.c --- arch/x86/kernel/cpu/perf_event_intel_ds.c | 57 ++++++++++++++++++++++++------- 1 file changed, 45 insertions(+), 12 deletions(-)
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c index 7f73b3553e2e..e4f109a7ed9a 100644 --- a/arch/x86/kernel/cpu/perf_event_intel_ds.c +++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c @@ -2,11 +2,15 @@ #include <linux/types.h> #include <linux/slab.h>
+#include <asm/kaiser.h> #include <asm/perf_event.h> #include <asm/insn.h>
#include "perf_event.h"
+static +DEFINE_PER_CPU_SHARED_ALIGNED_USER_MAPPED(struct debug_store, cpu_debug_store); + /* The size of a BTS record in bytes: */ #define BTS_RECORD_SIZE 24
@@ -246,6 +250,39 @@ void fini_debug_store_on_cpu(int cpu)
static DEFINE_PER_CPU(void *, insn_buffer);
+static void *dsalloc(size_t size, gfp_t flags, int node) +{ +#ifdef CONFIG_KAISER + unsigned int order = get_order(size); + struct page *page; + unsigned long addr; + + page = alloc_pages_node(node, flags | __GFP_ZERO, order); + if (!page) + return NULL; + addr = (unsigned long)page_address(page); + if (kaiser_add_mapping(addr, size, __PAGE_KERNEL) < 0) { + __free_pages(page, order); + addr = 0; + } + return (void *)addr; +#else + return kmalloc_node(size, flags | __GFP_ZERO, node); +#endif +} + +static void dsfree(const void *buffer, size_t size) +{ +#ifdef CONFIG_KAISER + if (!buffer) + return; + kaiser_remove_mapping((unsigned long)buffer, size); + free_pages((unsigned long)buffer, get_order(size)); +#else + kfree(buffer); +#endif +} + static int alloc_pebs_buffer(int cpu) { struct debug_store *ds = per_cpu(cpu_hw_events, cpu).ds; @@ -256,7 +293,7 @@ static int alloc_pebs_buffer(int cpu) if (!x86_pmu.pebs) return 0;
- buffer = kzalloc_node(PEBS_BUFFER_SIZE, GFP_KERNEL, node); + buffer = dsalloc(PEBS_BUFFER_SIZE, GFP_KERNEL, node); if (unlikely(!buffer)) return -ENOMEM;
@@ -267,7 +304,7 @@ static int alloc_pebs_buffer(int cpu) if (x86_pmu.intel_cap.pebs_format < 2) { ibuffer = kzalloc_node(PEBS_FIXUP_SIZE, GFP_KERNEL, node); if (!ibuffer) { - kfree(buffer); + dsfree(buffer, PEBS_BUFFER_SIZE); return -ENOMEM; } per_cpu(insn_buffer, cpu) = ibuffer; @@ -296,7 +333,8 @@ static void release_pebs_buffer(int cpu) kfree(per_cpu(insn_buffer, cpu)); per_cpu(insn_buffer, cpu) = NULL;
- kfree((void *)(unsigned long)ds->pebs_buffer_base); + dsfree((void *)(unsigned long)ds->pebs_buffer_base, + PEBS_BUFFER_SIZE); ds->pebs_buffer_base = 0; }
@@ -310,7 +348,7 @@ static int alloc_bts_buffer(int cpu) if (!x86_pmu.bts) return 0;
- buffer = kzalloc_node(BTS_BUFFER_SIZE, GFP_KERNEL | __GFP_NOWARN, node); + buffer = dsalloc(BTS_BUFFER_SIZE, GFP_KERNEL | __GFP_NOWARN, node); if (unlikely(!buffer)) { WARN_ONCE(1, "%s: BTS buffer allocation failure\n", __func__); return -ENOMEM; @@ -336,19 +374,15 @@ static void release_bts_buffer(int cpu) if (!ds || !x86_pmu.bts) return;
- kfree((void *)(unsigned long)ds->bts_buffer_base); + dsfree((void *)(unsigned long)ds->bts_buffer_base, BTS_BUFFER_SIZE); ds->bts_buffer_base = 0; }
static int alloc_ds_buffer(int cpu) { - int node = cpu_to_node(cpu); - struct debug_store *ds; - - ds = kzalloc_node(sizeof(*ds), GFP_KERNEL, node); - if (unlikely(!ds)) - return -ENOMEM; + struct debug_store *ds = per_cpu_ptr(&cpu_debug_store, cpu);
+ memset(ds, 0, sizeof(*ds)); per_cpu(cpu_hw_events, cpu).ds = ds;
return 0; @@ -362,7 +396,6 @@ static void release_ds_buffer(int cpu) return;
per_cpu(cpu_hw_events, cpu).ds = NULL; - kfree(ds); }
void release_ds_buffers(void)
From: Hugh Dickins hughd@google.com
kaiser_add_user_map() took no notice when kaiser_pagetable_walk() failed. And avoid its might_sleep() when atomic (though atomic at present unused).
Signed-off-by: Hugh Dickins hughd@google.com Acked-by: Jiri Kosina jkosina@suse.cz Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org (cherry picked from commit 407c3ff6a24c7cb418b77a124d17e282f9622037) Signed-off-by: Pavel Tatashin pasha.tatashin@oracle.com
Conflicts: arch/x86/mm/kaiser.c --- arch/x86/mm/kaiser.c | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-)
diff --git a/arch/x86/mm/kaiser.c b/arch/x86/mm/kaiser.c index df7f6591d5aa..058d0886086b 100644 --- a/arch/x86/mm/kaiser.c +++ b/arch/x86/mm/kaiser.c @@ -99,11 +99,11 @@ static pte_t *kaiser_pagetable_walk(unsigned long address, bool is_atomic) pgd_t *pgd = native_get_shadow_pgd(pgd_offset_k(address)); gfp_t gfp = (GFP_KERNEL | __GFP_NOTRACK | __GFP_ZERO);
- might_sleep(); if (is_atomic) { gfp &= ~GFP_KERNEL; gfp |= __GFP_HIGH; - } + } else + might_sleep();
if (pgd_none(*pgd)) { WARN_ONCE(1, "All shadow pgds should have been populated"); @@ -160,13 +160,17 @@ int kaiser_add_user_map(const void *__start_addr, unsigned long size, unsigned long end_addr = PAGE_ALIGN(start_addr + size); unsigned long target_address;
- for (;address < end_addr; address += PAGE_SIZE) { + for (; address < end_addr; address += PAGE_SIZE) { target_address = get_pa_from_mapping(address); if (target_address == -1) { ret = -EIO; break; } pte = kaiser_pagetable_walk(address, false); + if (!pte) { + ret = -ENOMEM; + break; + } if (pte_none(*pte)) { set_pte(pte, __pte(flags | target_address)); } else {
From: Hugh Dickins hughd@google.com
Mainly deleting a surfeit of blank lines, and reflowing header comment.
Signed-off-by: Hugh Dickins hughd@google.com Acked-by: Jiri Kosina jkosina@suse.cz Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org (cherry picked from commit 5fbd46c4be78174656b52e1b04d3057a5dd7af66) Signed-off-by: Pavel Tatashin pasha.tatashin@oracle.com --- arch/x86/include/asm/kaiser.h | 32 +++++++++++++------------------- 1 file changed, 13 insertions(+), 19 deletions(-)
diff --git a/arch/x86/include/asm/kaiser.h b/arch/x86/include/asm/kaiser.h index 0703f48777f3..7394ba9f9951 100644 --- a/arch/x86/include/asm/kaiser.h +++ b/arch/x86/include/asm/kaiser.h @@ -1,15 +1,17 @@ #ifndef _ASM_X86_KAISER_H #define _ASM_X86_KAISER_H - -/* This file includes the definitions for the KAISER feature. - * KAISER is a counter measure against x86_64 side channel attacks on the kernel virtual memory. - * It has a shodow-pgd for every process. the shadow-pgd has a minimalistic kernel-set mapped, - * but includes the whole user memory. Within a kernel context switch, or when an interrupt is handled, - * the pgd is switched to the normal one. When the system switches to user mode, the shadow pgd is enabled. - * By this, the virtual memory chaches are freed, and the user may not attack the whole kernel memory. +/* + * This file includes the definitions for the KAISER feature. + * KAISER is a counter measure against x86_64 side channel attacks on + * the kernel virtual memory. It has a shadow pgd for every process: the + * shadow pgd has a minimalistic kernel-set mapped, but includes the whole + * user memory. Within a kernel context switch, or when an interrupt is handled, + * the pgd is switched to the normal one. When the system switches to user mode, + * the shadow pgd is enabled. By this, the virtual memory caches are freed, + * and the user may not attack the whole kernel memory. * - * A minimalistic kernel mapping holds the parts needed to be mapped in user mode, as the entry/exit functions - * of the user space, or the stacks. + * A minimalistic kernel mapping holds the parts needed to be mapped in user + * mode, such as the entry/exit functions of the user space, or the stacks. */ #ifdef __ASSEMBLY__ #ifdef CONFIG_KAISER @@ -48,13 +50,10 @@ _SWITCH_TO_KERNEL_CR3 %rax movq PER_CPU_VAR(unsafe_stack_register_backup), %rax .endm
- .macro SWITCH_USER_CR3_NO_STACK - movq %rax, PER_CPU_VAR(unsafe_stack_register_backup) _SWITCH_TO_USER_CR3 %rax movq PER_CPU_VAR(unsafe_stack_register_backup), %rax - .endm
#else /* CONFIG_KAISER */ @@ -72,7 +71,6 @@ movq PER_CPU_VAR(unsafe_stack_register_backup), %rax
#else /* __ASSEMBLY__ */
- #ifdef CONFIG_KAISER /* * Upon kernel/user mode switch, it may happen that the address @@ -80,7 +78,6 @@ movq PER_CPU_VAR(unsafe_stack_register_backup), %rax * stored. To change the address space, another register is * needed. A register therefore has to be stored/restored. */ - DECLARE_PER_CPU_USER_MAPPED(unsigned long, unsafe_stack_register_backup);
/** @@ -95,7 +92,6 @@ DECLARE_PER_CPU_USER_MAPPED(unsigned long, unsafe_stack_register_backup); */ extern int kaiser_add_mapping(unsigned long addr, unsigned long size, unsigned long flags);
- /** * kaiser_remove_mapping - unmap a virtual memory part of the shadow mapping * @addr: the start address of the range @@ -104,12 +100,12 @@ extern int kaiser_add_mapping(unsigned long addr, unsigned long size, unsigned l extern void kaiser_remove_mapping(unsigned long start, unsigned long size);
/** - * kaiser_initialize_mapping - Initalize the shadow mapping + * kaiser_init - Initialize the shadow mapping * * Most parts of the shadow mapping can be mapped upon boot * time. Only per-process things like the thread stacks * or a new LDT have to be mapped at runtime. These boot- - * time mappings are permanent and nevertunmapped. + * time mappings are permanent and never unmapped. */ extern void kaiser_init(void);
@@ -117,6 +113,4 @@ extern void kaiser_init(void);
#endif /* __ASSEMBLY */
- - #endif /* _ASM_X86_KAISER_H */
From: Hugh Dickins hughd@google.com
Yes, unmap_pud_range_nofree()'s declaration ought to be in a header file really, but I'm not sure we want to use it anyway: so for now just declare it inside kaiser_remove_mapping(). And there doesn't seem to be such a thing as unmap_p4d_range(), even in a 5-level paging tree.
Signed-off-by: Hugh Dickins hughd@google.com Acked-by: Jiri Kosina jkosina@suse.cz Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org (cherry picked from commit 0c68228f7b39c96cabd89bee3e1d6bd55926df80) Signed-off-by: Pavel Tatashin pasha.tatashin@oracle.com --- arch/x86/mm/kaiser.c | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-)
diff --git a/arch/x86/mm/kaiser.c b/arch/x86/mm/kaiser.c index 058d0886086b..190371071a02 100644 --- a/arch/x86/mm/kaiser.c +++ b/arch/x86/mm/kaiser.c @@ -286,8 +286,7 @@ void __init kaiser_init(void) __PAGE_KERNEL); }
-extern void unmap_pud_range_nofree(pgd_t *pgd, unsigned long start, unsigned long end); -// add a mapping to the shadow-mapping, and synchronize the mappings +/* Add a mapping to the shadow mapping, and synchronize the mappings */ int kaiser_add_mapping(unsigned long addr, unsigned long size, unsigned long flags) { return kaiser_add_user_map((const void *)addr, size, flags); @@ -295,15 +294,13 @@ int kaiser_add_mapping(unsigned long addr, unsigned long size, unsigned long fla
void kaiser_remove_mapping(unsigned long start, unsigned long size) { + extern void unmap_pud_range_nofree(pgd_t *pgd, + unsigned long start, unsigned long end); unsigned long end = start + size; unsigned long addr;
for (addr = start; addr < end; addr += PGDIR_SIZE) { pgd_t *pgd = native_get_shadow_pgd(pgd_offset_k(addr)); - /* - * unmap_p4d_range() handles > P4D_SIZE unmaps, - * so no need to trim 'end'. - */ unmap_pud_range_nofree(pgd, addr, end); } }
From: Hugh Dickins hughd@google.com
When removing the bogus comment from kaiser_remove_mapping(), I really ought to have checked the extent of its bogosity: as Neel points out, there is nothing to stop unmap_pud_range_nofree() from continuing beyond the end of a pud (and starting in the wrong position on the next).
Fix kaiser_remove_mapping() to constrain the extent and advance pgd pointer correctly: use pgd_addr_end() macro as used throughout base mm (but don't assume page-rounded start and size in this case).
But this bug was very unlikely to trigger in this backport: since any buddy allocation is contained within a single pud extent, and we are not using vmapped stacks (and are only mapping one page of stack anyway): the only way to hit this bug here would be when freeing a large modified ldt.
Signed-off-by: Hugh Dickins hughd@google.com Acked-by: Jiri Kosina jkosina@suse.cz Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org (cherry picked from commit f127705d26b34c053e59b47aef84b3ea564dd743) Signed-off-by: Pavel Tatashin pasha.tatashin@oracle.com --- arch/x86/mm/kaiser.c | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-)
diff --git a/arch/x86/mm/kaiser.c b/arch/x86/mm/kaiser.c index 190371071a02..8996f3292596 100644 --- a/arch/x86/mm/kaiser.c +++ b/arch/x86/mm/kaiser.c @@ -297,11 +297,13 @@ void kaiser_remove_mapping(unsigned long start, unsigned long size) extern void unmap_pud_range_nofree(pgd_t *pgd, unsigned long start, unsigned long end); unsigned long end = start + size; - unsigned long addr; + unsigned long addr, next; + pgd_t *pgd;
- for (addr = start; addr < end; addr += PGDIR_SIZE) { - pgd_t *pgd = native_get_shadow_pgd(pgd_offset_k(addr)); - unmap_pud_range_nofree(pgd, addr, end); + pgd = native_get_shadow_pgd(pgd_offset_k(start)); + for (addr = start; addr < end; pgd++, addr = next) { + next = pgd_addr_end(addr, end); + unmap_pud_range_nofree(pgd, addr, next); } }
From: Hugh Dickins hughd@google.com
While trying to get our gold link to work, four cleanups: matched the gdt_page declaration to its definition; in fiddling unsuccessfully with PERCPU_INPUT(), lined up backslashes; lined up the backslashes according to convention in percpu-defs.h; deleted the unused irq_stack_pointer addition to irq_stack_union.
Sad to report that aligning backslashes does not appear to help gold align to 8192: but while these did not help, they are worth keeping.
Signed-off-by: Hugh Dickins hughd@google.com Acked-by: Jiri Kosina jkosina@suse.cz Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org (cherry picked from commit c52e55a2a82d3a44189810d35717d81cb4cf61d4) Signed-off-by: Pavel Tatashin pasha.tatashin@oracle.com --- arch/x86/include/asm/desc.h | 2 +- arch/x86/include/asm/processor.h | 5 ----- include/asm-generic/vmlinux.lds.h | 18 ++++++++---------- include/linux/percpu-defs.h | 22 +++++++++++----------- 4 files changed, 20 insertions(+), 27 deletions(-)
diff --git a/arch/x86/include/asm/desc.h b/arch/x86/include/asm/desc.h index 4e10d73cf018..880db91d9457 100644 --- a/arch/x86/include/asm/desc.h +++ b/arch/x86/include/asm/desc.h @@ -43,7 +43,7 @@ struct gdt_page { struct desc_struct gdt[GDT_ENTRIES]; } __attribute__((aligned(PAGE_SIZE)));
-DECLARE_PER_CPU_PAGE_ALIGNED(struct gdt_page, gdt_page); +DECLARE_PER_CPU_PAGE_ALIGNED_USER_MAPPED(struct gdt_page, gdt_page);
static inline struct desc_struct *get_cpu_gdt_table(unsigned int cpu) { diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h index 4abbd7d7cfb0..21c7924595e4 100644 --- a/arch/x86/include/asm/processor.h +++ b/arch/x86/include/asm/processor.h @@ -449,11 +449,6 @@ union irq_stack_union { char gs_base[40]; unsigned long stack_canary; }; - - struct { - char irq_stack_pointer[64]; - char unused[IRQ_STACK_SIZE - 64]; - }; };
DECLARE_PER_CPU_FIRST(union irq_stack_union, irq_stack_union) __visible; diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h index 88c02f965f2e..8e265917b0a9 100644 --- a/include/asm-generic/vmlinux.lds.h +++ b/include/asm-generic/vmlinux.lds.h @@ -715,16 +715,14 @@ */ #define PERCPU_INPUT(cacheline) \ VMLINUX_SYMBOL(__per_cpu_start) = .; \ - \ - VMLINUX_SYMBOL(__per_cpu_user_mapped_start) = .; \ - *(.data..percpu..first) \ - . = ALIGN(cacheline); \ - *(.data..percpu..user_mapped) \ - *(.data..percpu..user_mapped..shared_aligned) \ - . = ALIGN(PAGE_SIZE); \ - *(.data..percpu..user_mapped..page_aligned) \ - VMLINUX_SYMBOL(__per_cpu_user_mapped_end) = .; \ - \ + VMLINUX_SYMBOL(__per_cpu_user_mapped_start) = .; \ + *(.data..percpu..first) \ + . = ALIGN(cacheline); \ + *(.data..percpu..user_mapped) \ + *(.data..percpu..user_mapped..shared_aligned) \ + . = ALIGN(PAGE_SIZE); \ + *(.data..percpu..user_mapped..page_aligned) \ + VMLINUX_SYMBOL(__per_cpu_user_mapped_end) = .; \ . = ALIGN(PAGE_SIZE); \ *(.data..percpu..page_aligned) \ . = ALIGN(cacheline); \ diff --git a/include/linux/percpu-defs.h b/include/linux/percpu-defs.h index 141d0de913a9..7cecbdbf5c25 100644 --- a/include/linux/percpu-defs.h +++ b/include/linux/percpu-defs.h @@ -121,10 +121,10 @@ #define DEFINE_PER_CPU(type, name) \ DEFINE_PER_CPU_SECTION(type, name, "")
-#define DECLARE_PER_CPU_USER_MAPPED(type, name) \ +#define DECLARE_PER_CPU_USER_MAPPED(type, name) \ DECLARE_PER_CPU_SECTION(type, name, USER_MAPPED_SECTION)
-#define DEFINE_PER_CPU_USER_MAPPED(type, name) \ +#define DEFINE_PER_CPU_USER_MAPPED(type, name) \ DEFINE_PER_CPU_SECTION(type, name, USER_MAPPED_SECTION)
/* @@ -156,11 +156,11 @@ DEFINE_PER_CPU_SECTION(type, name, PER_CPU_SHARED_ALIGNED_SECTION) \ ____cacheline_aligned_in_smp
-#define DECLARE_PER_CPU_SHARED_ALIGNED_USER_MAPPED(type, name) \ +#define DECLARE_PER_CPU_SHARED_ALIGNED_USER_MAPPED(type, name) \ DECLARE_PER_CPU_SECTION(type, name, USER_MAPPED_SECTION PER_CPU_SHARED_ALIGNED_SECTION) \ ____cacheline_aligned_in_smp
-#define DEFINE_PER_CPU_SHARED_ALIGNED_USER_MAPPED(type, name) \ +#define DEFINE_PER_CPU_SHARED_ALIGNED_USER_MAPPED(type, name) \ DEFINE_PER_CPU_SECTION(type, name, USER_MAPPED_SECTION PER_CPU_SHARED_ALIGNED_SECTION) \ ____cacheline_aligned_in_smp
@@ -185,18 +185,18 @@ /* * Declaration/definition used for per-CPU variables that must be page aligned and need to be mapped in user mode. */ -#define DECLARE_PER_CPU_PAGE_ALIGNED_USER_MAPPED(type, name) \ - DECLARE_PER_CPU_SECTION(type, name, USER_MAPPED_SECTION"..page_aligned") \ - __aligned(PAGE_SIZE) +#define DECLARE_PER_CPU_PAGE_ALIGNED_USER_MAPPED(type, name) \ + DECLARE_PER_CPU_SECTION(type, name, USER_MAPPED_SECTION"..page_aligned") \ + __aligned(PAGE_SIZE)
-#define DEFINE_PER_CPU_PAGE_ALIGNED_USER_MAPPED(type, name) \ - DEFINE_PER_CPU_SECTION(type, name, USER_MAPPED_SECTION"..page_aligned") \ - __aligned(PAGE_SIZE) +#define DEFINE_PER_CPU_PAGE_ALIGNED_USER_MAPPED(type, name) \ + DEFINE_PER_CPU_SECTION(type, name, USER_MAPPED_SECTION"..page_aligned") \ + __aligned(PAGE_SIZE)
/* * Declaration/definition used for per-CPU variables that must be read mostly. */ -#define DECLARE_PER_CPU_READ_MOSTLY(type, name) \ +#define DECLARE_PER_CPU_READ_MOSTLY(type, name) \ DECLARE_PER_CPU_SECTION(type, name, "..read_mostly")
#define DEFINE_PER_CPU_READ_MOSTLY(type, name) \
From: Hugh Dickins hughd@google.com
There's a 0x1000 in various places, which looks better with a name.
Signed-off-by: Hugh Dickins hughd@google.com Acked-by: Jiri Kosina jkosina@suse.cz Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org (cherry picked from commit aeda21d77e22fb382c51fd3f6bbb18df69bc032f) Signed-off-by: Pavel Tatashin pasha.tatashin@oracle.com
Conflicts: arch/x86/entry/entry_64.S (not in this tree) arch/x86/kernel/entry_64.S (patched instead of that) --- arch/x86/include/asm/kaiser.h | 7 +++++-- arch/x86/kernel/entry_64.S | 4 ++-- 2 files changed, 7 insertions(+), 4 deletions(-)
diff --git a/arch/x86/include/asm/kaiser.h b/arch/x86/include/asm/kaiser.h index 7394ba9f9951..051acf678cda 100644 --- a/arch/x86/include/asm/kaiser.h +++ b/arch/x86/include/asm/kaiser.h @@ -13,13 +13,16 @@ * A minimalistic kernel mapping holds the parts needed to be mapped in user * mode, such as the entry/exit functions of the user space, or the stacks. */ + +#define KAISER_SHADOW_PGD_OFFSET 0x1000 + #ifdef __ASSEMBLY__ #ifdef CONFIG_KAISER
.macro _SWITCH_TO_KERNEL_CR3 reg movq %cr3, \reg #ifdef CONFIG_KAISER_REAL_SWITCH -andq $(~0x1000), \reg +andq $(~KAISER_SHADOW_PGD_OFFSET), \reg #endif movq \reg, %cr3 .endm @@ -27,7 +30,7 @@ movq \reg, %cr3 .macro _SWITCH_TO_USER_CR3 reg movq %cr3, \reg #ifdef CONFIG_KAISER_REAL_SWITCH -orq $(0x1000), \reg +orq $(KAISER_SHADOW_PGD_OFFSET), \reg #endif movq \reg, %cr3 .endm diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S index 8e4056d4b1d6..5c6c11d9d779 100644 --- a/arch/x86/kernel/entry_64.S +++ b/arch/x86/kernel/entry_64.S @@ -1542,7 +1542,7 @@ ENTRY(nmi) movq %cr3, %rax pushq %rax #ifdef CONFIG_KAISER_REAL_SWITCH - andq $(~0x1000), %rax + andq $(~KAISER_SHADOW_PGD_OFFSET), %rax #endif movq %rax, %cr3 #endif @@ -1780,7 +1780,7 @@ end_repeat_nmi: movq %cr3, %rax pushq %rax #ifdef CONFIG_KAISER_REAL_SWITCH - andq $(~0x1000), %rax + andq $(~KAISER_SHADOW_PGD_OFFSET), %rax #endif movq %rax, %cr3 #endif
From: Hugh Dickins hughd@google.com
We fail to see what CONFIG_KAISER_REAL_SWITCH is for: it seems to be left over from early development, and now just obscures tricky parts of the code. Delete it before adding PCIDs, or nokaiser boot option.
(Or if there is some good reason to keep the option, then it needs a help text - and a "depends on KAISER", so that all those without KAISER are not asked the question.)
Signed-off-by: Hugh Dickins hughd@google.com Acked-by: Jiri Kosina jkosina@suse.cz Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org (cherry picked from commit b9d2ccc54e17b5aa50dd0c036d3f4fb4e5248d54) Signed-off-by: Pavel Tatashin pasha.tatashin@oracle.com
Conflicts: arch/x86/entry/entry_64.S (not in this tree) arch/x86/kernel/entry_64.S (patched instead of that) --- arch/x86/include/asm/kaiser.h | 4 ---- arch/x86/kernel/entry_64.S | 4 ---- security/Kconfig | 4 ---- 3 files changed, 12 deletions(-)
diff --git a/arch/x86/include/asm/kaiser.h b/arch/x86/include/asm/kaiser.h index 051acf678cda..e0fc45e77aee 100644 --- a/arch/x86/include/asm/kaiser.h +++ b/arch/x86/include/asm/kaiser.h @@ -21,17 +21,13 @@
.macro _SWITCH_TO_KERNEL_CR3 reg movq %cr3, \reg -#ifdef CONFIG_KAISER_REAL_SWITCH andq $(~KAISER_SHADOW_PGD_OFFSET), \reg -#endif movq \reg, %cr3 .endm
.macro _SWITCH_TO_USER_CR3 reg movq %cr3, \reg -#ifdef CONFIG_KAISER_REAL_SWITCH orq $(KAISER_SHADOW_PGD_OFFSET), \reg -#endif movq \reg, %cr3 .endm
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S index 5c6c11d9d779..d2b2372e01c6 100644 --- a/arch/x86/kernel/entry_64.S +++ b/arch/x86/kernel/entry_64.S @@ -1541,9 +1541,7 @@ ENTRY(nmi) /* %rax is saved above, so OK to clobber here */ movq %cr3, %rax pushq %rax -#ifdef CONFIG_KAISER_REAL_SWITCH andq $(~KAISER_SHADOW_PGD_OFFSET), %rax -#endif movq %rax, %cr3 #endif call do_nmi @@ -1779,9 +1777,7 @@ end_repeat_nmi: /* %rax is saved above, so OK to clobber here */ movq %cr3, %rax pushq %rax -#ifdef CONFIG_KAISER_REAL_SWITCH andq $(~KAISER_SHADOW_PGD_OFFSET), %rax -#endif movq %rax, %cr3 #endif DEFAULT_FRAME 0 /* XXX: Do we need this? */ diff --git a/security/Kconfig b/security/Kconfig index 5f4b656bfa4c..9d63046e4794 100644 --- a/security/Kconfig +++ b/security/Kconfig @@ -41,10 +41,6 @@ config KAISER
If you are unsure how to answer this question, answer Y.
-config KAISER_REAL_SWITCH - bool "KAISER: actually switch page tables" - default y - config SECURITYFS bool "Enable the securityfs filesystem" help
From: Hugh Dickins hughd@google.com
The kaiser update made an interesting choice, never to free any shadow page tables. Contention on global spinlock was worrying, particularly with it held across page table scans when freeing. Something had to be done: I was going to add refcounting; but simply never to free them is an appealing choice, minimizing contention without complicating the code (the more a page table is found already, the less the spinlock is used).
But leaking pages in this way is also a worry: can we get away with it? At the very least, we need a count to show how bad it actually gets: in principle, one might end up wasting about 1/256 of memory that way (1/512 for when direct-mapped pages have to be user-mapped, plus 1/512 for when they are user-mapped from the vmalloc area on another occasion (but we don't have vmalloc'ed stacks, so only large ldts are vmalloc'ed).
Add per-cpu stat NR_KAISERTABLE: including 256 at startup for the shared pgd entries, and 1 for each intermediate page table added thereafter for user-mapping - but leave out the 1 per mm, for its shadow pgd, because that distracts from the monotonic increase. Shown in /proc/vmstat as nr_overhead (0 if kaiser not enabled).
In practice, it doesn't look so bad so far: more like 1/12000 after nine hours of gtests below; and movable pageblock segregation should tend to cluster the kaiser tables into a subset of the address space (if not, they will be bad for compaction too). But production may tell a different story: keep an eye on this number, and bring back lighter freeing if it gets out of control (maybe a shrinker).
Signed-off-by: Hugh Dickins hughd@google.com Acked-by: Jiri Kosina jkosina@suse.cz Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org (cherry picked from commit 3e3d38fd9832e82a8cb1a5b1154acfa43ac08d15) Signed-off-by: Pavel Tatashin pasha.tatashin@oracle.com --- arch/x86/mm/kaiser.c | 16 +++++++++++----- include/linux/mmzone.h | 3 ++- mm/vmstat.c | 1 + 3 files changed, 14 insertions(+), 6 deletions(-)
diff --git a/arch/x86/mm/kaiser.c b/arch/x86/mm/kaiser.c index 8996f3292596..50d650799f39 100644 --- a/arch/x86/mm/kaiser.c +++ b/arch/x86/mm/kaiser.c @@ -122,9 +122,11 @@ static pte_t *kaiser_pagetable_walk(unsigned long address, bool is_atomic) if (!new_pmd_page) return NULL; spin_lock(&shadow_table_allocation_lock); - if (pud_none(*pud)) + if (pud_none(*pud)) { set_pud(pud, __pud(_KERNPG_TABLE | __pa(new_pmd_page))); - else + __inc_zone_page_state(virt_to_page((void *) + new_pmd_page), NR_KAISERTABLE); + } else free_page(new_pmd_page); spin_unlock(&shadow_table_allocation_lock); } @@ -140,9 +142,11 @@ static pte_t *kaiser_pagetable_walk(unsigned long address, bool is_atomic) if (!new_pte_page) return NULL; spin_lock(&shadow_table_allocation_lock); - if (pmd_none(*pmd)) + if (pmd_none(*pmd)) { set_pmd(pmd, __pmd(_KERNPG_TABLE | __pa(new_pte_page))); - else + __inc_zone_page_state(virt_to_page((void *) + new_pte_page), NR_KAISERTABLE); + } else free_page(new_pte_page); spin_unlock(&shadow_table_allocation_lock); } @@ -206,11 +210,13 @@ static void __init kaiser_init_all_pgds(void) pgd = native_get_shadow_pgd(pgd_offset_k((unsigned long )0)); for (i = PTRS_PER_PGD / 2; i < PTRS_PER_PGD; i++) { pgd_t new_pgd; - pud_t *pud = pud_alloc_one(&init_mm, PAGE_OFFSET + i * PGDIR_SIZE); + pud_t *pud = pud_alloc_one(&init_mm, + PAGE_OFFSET + i * PGDIR_SIZE); if (!pud) { WARN_ON(1); break; } + inc_zone_page_state(virt_to_page(pud), NR_KAISERTABLE); new_pgd = __pgd(_KERNPG_TABLE |__pa(pud)); /* * Make sure not to stomp on some other pgd entry. diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 54d74f6eb233..42c56e0c947f 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -131,8 +131,9 @@ enum zone_stat_item { NR_SLAB_RECLAIMABLE, NR_SLAB_UNRECLAIMABLE, NR_PAGETABLE, /* used for pagetables */ - NR_KERNEL_STACK, /* Second 128 byte cacheline */ + NR_KERNEL_STACK, + NR_KAISERTABLE, NR_UNSTABLE_NFS, /* NFS unstable pages */ NR_BOUNCE, NR_VMSCAN_WRITE, diff --git a/mm/vmstat.c b/mm/vmstat.c index 4f5cd974e11a..8e0cbcd0fccc 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -714,6 +714,7 @@ const char * const vmstat_text[] = { "nr_slab_unreclaimable", "nr_page_table_pages", "nr_kernel_stack", + "nr_overhead", "nr_unstable", "nr_bounce", "nr_vmscan_write",
From: Dave Hansen dave.hansen@linux.intel.com
Merged performance improvements to Kaiser, using distinct kernel and user Process Context Identifiers to minimize the TLB flushing.
Signed-off-by: Hugh Dickins hughd@google.com Acked-by: Jiri Kosina jkosina@suse.cz Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org (cherry picked from commit eb82151d0b1df53d1ad8d060ecd554ca12eb552a) Signed-off-by: Pavel Tatashin pasha.tatashin@oracle.com
Conflicts: arch/x86/entry/entry_64.S (not in this tree) arch/x86/kernel/entry_64.S (patched instead of that) arch/x86/entry/entry_64_compat.S (not in this tree) arch/x86/ia32/ia32entry.S (patched instead of that) arch/x86/include/asm/tlbflush.h --- arch/x86/ia32/ia32entry.S | 1 + arch/x86/include/asm/cpufeature.h | 1 + arch/x86/include/asm/kaiser.h | 15 ++++++++-- arch/x86/include/asm/pgtable_types.h | 26 ++++++++++++++++ arch/x86/include/asm/tlbflush.h | 37 +++++++++++++++++++++-- arch/x86/include/uapi/asm/processor-flags.h | 3 +- arch/x86/kernel/cpu/common.c | 34 +++++++++++++++++++++ arch/x86/kernel/entry_64.S | 10 +++++-- arch/x86/kvm/x86.c | 3 +- arch/x86/mm/kaiser.c | 7 +++++ arch/x86/mm/tlb.c | 46 +++++++++++++++++++++++++++-- 11 files changed, 172 insertions(+), 11 deletions(-)
diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S index 665e2c7887fe..070470d161f6 100644 --- a/arch/x86/ia32/ia32entry.S +++ b/arch/x86/ia32/ia32entry.S @@ -15,6 +15,7 @@ #include <asm/irqflags.h> #include <asm/asm.h> #include <asm/smap.h> +#include <asm/pgtable_types.h> #include <asm/kaiser.h> #include <linux/linkage.h> #include <linux/err.h> diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h index 3d6606fb97d0..80a2b9487b29 100644 --- a/arch/x86/include/asm/cpufeature.h +++ b/arch/x86/include/asm/cpufeature.h @@ -185,6 +185,7 @@ #define X86_FEATURE_ARAT ( 7*32+ 1) /* Always Running APIC Timer */ #define X86_FEATURE_CPB ( 7*32+ 2) /* AMD Core Performance Boost */ #define X86_FEATURE_EPB ( 7*32+ 3) /* IA32_ENERGY_PERF_BIAS support */ +#define X86_FEATURE_INVPCID_SINGLE ( 7*32+ 4) /* Effectively INVPCID && CR4.PCIDE=1 */ #define X86_FEATURE_PLN ( 7*32+ 5) /* Intel Power Limit Notification */ #define X86_FEATURE_PTS ( 7*32+ 6) /* Intel Package Thermal Status */ #define X86_FEATURE_DTHERM ( 7*32+ 7) /* Digital Thermal Sensor */ diff --git a/arch/x86/include/asm/kaiser.h b/arch/x86/include/asm/kaiser.h index e0fc45e77aee..360ff3bc44a9 100644 --- a/arch/x86/include/asm/kaiser.h +++ b/arch/x86/include/asm/kaiser.h @@ -1,5 +1,8 @@ #ifndef _ASM_X86_KAISER_H #define _ASM_X86_KAISER_H + +#include <uapi/asm/processor-flags.h> /* For PCID constants */ + /* * This file includes the definitions for the KAISER feature. * KAISER is a counter measure against x86_64 side channel attacks on @@ -21,13 +24,21 @@
.macro _SWITCH_TO_KERNEL_CR3 reg movq %cr3, \reg -andq $(~KAISER_SHADOW_PGD_OFFSET), \reg +andq $(~(X86_CR3_PCID_ASID_MASK | KAISER_SHADOW_PGD_OFFSET)), \reg +orq X86_CR3_PCID_KERN_VAR, \reg movq \reg, %cr3 .endm
.macro _SWITCH_TO_USER_CR3 reg movq %cr3, \reg -orq $(KAISER_SHADOW_PGD_OFFSET), \reg +andq $(~(X86_CR3_PCID_ASID_MASK | KAISER_SHADOW_PGD_OFFSET)), \reg +/* + * This can obviously be one instruction by putting the + * KAISER_SHADOW_PGD_OFFSET bit in the X86_CR3_PCID_USER_VAR. + * But, just leave it now for simplicity. + */ +orq X86_CR3_PCID_USER_VAR, \reg +orq $(KAISER_SHADOW_PGD_OFFSET), \reg movq \reg, %cr3 .endm
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h index 7a5fc4410c00..b3c77e6529c2 100644 --- a/arch/x86/include/asm/pgtable_types.h +++ b/arch/x86/include/asm/pgtable_types.h @@ -106,6 +106,32 @@ _PAGE_SOFT_DIRTY) #define _HPAGE_CHG_MASK (_PAGE_CHG_MASK | _PAGE_PSE)
+/* The ASID is the lower 12 bits of CR3 */ +#define X86_CR3_PCID_ASID_MASK (_AC((1<<12)-1,UL)) + +/* Mask for all the PCID-related bits in CR3: */ +#define X86_CR3_PCID_MASK (X86_CR3_PCID_NOFLUSH | X86_CR3_PCID_ASID_MASK) +#if defined(CONFIG_KAISER) && defined(CONFIG_X86_64) +#define X86_CR3_PCID_ASID_KERN (_AC(0x4,UL)) +#define X86_CR3_PCID_ASID_USER (_AC(0x6,UL)) + +#define X86_CR3_PCID_KERN_FLUSH (X86_CR3_PCID_ASID_KERN) +#define X86_CR3_PCID_USER_FLUSH (X86_CR3_PCID_ASID_USER) +#define X86_CR3_PCID_KERN_NOFLUSH (X86_CR3_PCID_NOFLUSH | X86_CR3_PCID_ASID_KERN) +#define X86_CR3_PCID_USER_NOFLUSH (X86_CR3_PCID_NOFLUSH | X86_CR3_PCID_ASID_USER) +#else +#define X86_CR3_PCID_ASID_KERN (_AC(0x0,UL)) +#define X86_CR3_PCID_ASID_USER (_AC(0x0,UL)) +/* + * PCIDs are unsupported on 32-bit and none of these bits can be + * set in CR3: + */ +#define X86_CR3_PCID_KERN_FLUSH (0) +#define X86_CR3_PCID_USER_FLUSH (0) +#define X86_CR3_PCID_KERN_NOFLUSH (0) +#define X86_CR3_PCID_USER_NOFLUSH (0) +#endif + /* * The cache modes defined here are used to translate between pure SW usage * and the HW defined cache mode bits and/or PAT entries. diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index 09a70d8d293e..b558d1996621 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -12,7 +12,6 @@ static inline void __invpcid(unsigned long pcid, unsigned long addr, unsigned long type) { struct { u64 d[2]; } desc = { { pcid, addr } }; - /* * The memory clobber is because the whole point is to invalidate * stale TLB entries and, especially if we're flushing global @@ -134,6 +133,14 @@ static inline void cr4_set_bits_and_update_boot(unsigned long mask) static inline void __native_flush_tlb(void) { native_write_cr3(native_read_cr3()); + /* + * We are no longer using globals with KAISER, so a + * "nonglobals" flush would work too. But, this is more + * conservative. + * + * Note, this works with CR4.PCIDE=0 or 1. + */ + invpcid_flush_all(); }
static inline void __native_flush_tlb_global_irq_disabled(void) @@ -155,6 +162,8 @@ static inline void __native_flush_tlb_global(void) /* * Using INVPCID is considerably faster than a pair of writes * to CR4 sandwiched inside an IRQ flag save/restore. + * + * Note, this works with CR4.PCIDE=0 or 1. */ invpcid_flush_all(); return; @@ -174,7 +183,31 @@ static inline void __native_flush_tlb_global(void)
static inline void __native_flush_tlb_single(unsigned long addr) { - asm volatile("invlpg (%0)" ::"r" (addr) : "memory"); + /* + * SIMICS #GP's if you run INVPCID with type 2/3 + * and X86_CR4_PCIDE clear. Shame! + * + * The ASIDs used below are hard-coded. But, we must not + * call invpcid(type=1/2) before CR4.PCIDE=1. Just call + * invpcid in the case we are called early. + */ + if (!this_cpu_has(X86_FEATURE_INVPCID_SINGLE)) { + asm volatile("invlpg (%0)" ::"r" (addr) : "memory"); + return; + } + /* Flush the address out of both PCIDs. */ + /* + * An optimization here might be to determine addresses + * that are only kernel-mapped and only flush the kernel + * ASID. But, userspace flushes are probably much more + * important performance-wise. + * + * Make sure to do only a single invpcid when KAISER is + * disabled and we have only a single ASID. + */ + if (X86_CR3_PCID_ASID_KERN != X86_CR3_PCID_ASID_USER) + invpcid_flush_one(X86_CR3_PCID_ASID_KERN, addr); + invpcid_flush_one(X86_CR3_PCID_ASID_USER, addr); }
static inline void __flush_tlb_all(void) diff --git a/arch/x86/include/uapi/asm/processor-flags.h b/arch/x86/include/uapi/asm/processor-flags.h index 180a0c3c224d..bd4513b7b877 100644 --- a/arch/x86/include/uapi/asm/processor-flags.h +++ b/arch/x86/include/uapi/asm/processor-flags.h @@ -79,7 +79,8 @@ #define X86_CR3_PWT _BITUL(X86_CR3_PWT_BIT) #define X86_CR3_PCD_BIT 4 /* Page Cache Disable */ #define X86_CR3_PCD _BITUL(X86_CR3_PCD_BIT) -#define X86_CR3_PCID_MASK _AC(0x00000fff,UL) /* PCID Mask */ +#define X86_CR3_PCID_NOFLUSH_BIT 63 /* Preserve old PCID */ +#define X86_CR3_PCID_NOFLUSH _BITULL(X86_CR3_PCID_NOFLUSH_BIT)
/* * Intel CPU features in CR4 diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index 47194fac7eb7..6fb0d332f440 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -339,11 +339,45 @@ static __always_inline void setup_smap(struct cpuinfo_x86 *c) } }
+/* + * These can have bit 63 set, so we can not just use a plain "or" + * instruction to get their value or'd into CR3. It would take + * another register. So, we use a memory reference to these + * instead. + * + * This is also handy because systems that do not support + * PCIDs just end up or'ing a 0 into their CR3, which does + * no harm. + */ +__aligned(PAGE_SIZE) unsigned long X86_CR3_PCID_KERN_VAR = 0; +__aligned(PAGE_SIZE) unsigned long X86_CR3_PCID_USER_VAR = 0; + static void setup_pcid(struct cpuinfo_x86 *c) { if (cpu_has(c, X86_FEATURE_PCID)) { if (cpu_has(c, X86_FEATURE_PGE)) { cr4_set_bits(X86_CR4_PCIDE); + /* + * These variables are used by the entry/exit + * code to change PCIDs. + */ +#ifdef CONFIG_KAISER + X86_CR3_PCID_KERN_VAR = X86_CR3_PCID_KERN_NOFLUSH; + X86_CR3_PCID_USER_VAR = X86_CR3_PCID_USER_NOFLUSH; +#endif + /* + * INVPCID has two "groups" of types: + * 1/2: Invalidate an individual address + * 3/4: Invalidate all contexts + * + * 1/2 take a PCID, but 3/4 do not. So, 3/4 + * ignore the PCID argument in the descriptor. + * But, we have to be careful not to call 1/2 + * with an actual non-zero PCID in them before + * we do the above cr4_set_bits(). + */ + if (cpu_has(c, X86_FEATURE_INVPCID)) + set_cpu_cap(c, X86_FEATURE_INVPCID_SINGLE); } else { /* * flush_tlb_all(), as currently implemented, won't diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S index d2b2372e01c6..0f7bba928264 100644 --- a/arch/x86/kernel/entry_64.S +++ b/arch/x86/kernel/entry_64.S @@ -1541,7 +1541,10 @@ ENTRY(nmi) /* %rax is saved above, so OK to clobber here */ movq %cr3, %rax pushq %rax - andq $(~KAISER_SHADOW_PGD_OFFSET), %rax + /* mask off "user" bit of pgd address and 12 PCID bits: */ + andq $(~(X86_CR3_PCID_ASID_MASK | KAISER_SHADOW_PGD_OFFSET)), %rax + /* Add back kernel PCID and "no flush" bit */ + orq X86_CR3_PCID_KERN_VAR, %rax movq %rax, %cr3 #endif call do_nmi @@ -1777,7 +1780,10 @@ end_repeat_nmi: /* %rax is saved above, so OK to clobber here */ movq %cr3, %rax pushq %rax - andq $(~KAISER_SHADOW_PGD_OFFSET), %rax + /* mask off "user" bit of pgd address and 12 PCID bits: */ + andq $(~(X86_CR3_PCID_ASID_MASK | KAISER_SHADOW_PGD_OFFSET)), %rax + /* Add back kernel PCID and "no flush" bit */ + orq X86_CR3_PCID_KERN_VAR, %rax movq %rax, %cr3 #endif DEFAULT_FRAME 0 /* XXX: Do we need this? */ diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index e4e7d45fd551..153afbb6c080 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -733,7 +733,8 @@ int kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4) return 1;
/* PCID can not be enabled when cr3[11:0]!=000H or EFER.LMA=0 */ - if ((kvm_read_cr3(vcpu) & X86_CR3_PCID_MASK) || !is_long_mode(vcpu)) + if ((kvm_read_cr3(vcpu) & X86_CR3_PCID_ASID_MASK) || + !is_long_mode(vcpu)) return 1; }
diff --git a/arch/x86/mm/kaiser.c b/arch/x86/mm/kaiser.c index 50d650799f39..91968328ccdf 100644 --- a/arch/x86/mm/kaiser.c +++ b/arch/x86/mm/kaiser.c @@ -240,6 +240,8 @@ static void __init kaiser_init_all_pgds(void) } while (0)
extern char __per_cpu_user_mapped_start[], __per_cpu_user_mapped_end[]; +extern unsigned long X86_CR3_PCID_KERN_VAR; +extern unsigned long X86_CR3_PCID_USER_VAR; /* * If anything in here fails, we will likely die on one of the * first kernel->user transitions and init will die. But, we @@ -290,6 +292,11 @@ void __init kaiser_init(void) kaiser_add_user_map_early(&debug_idt_table, sizeof(gate_desc) * NR_VECTORS, __PAGE_KERNEL); + + kaiser_add_user_map_early(&X86_CR3_PCID_KERN_VAR, PAGE_SIZE, + __PAGE_KERNEL); + kaiser_add_user_map_early(&X86_CR3_PCID_USER_VAR, PAGE_SIZE, + __PAGE_KERNEL); }
/* Add a mapping to the shadow mapping, and synchronize the mappings */ diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 95624d903c0e..7176496112ff 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -34,6 +34,46 @@ struct flush_tlb_info { unsigned long flush_end; };
+static void load_new_mm_cr3(pgd_t *pgdir) +{ + unsigned long new_mm_cr3 = __pa(pgdir); + + /* + * KAISER, plus PCIDs needs some extra work here. But, + * if either of features is not present, we need no + * PCIDs here and just do a normal, full TLB flush with + * the write_cr3() + */ + if (!IS_ENABLED(CONFIG_KAISER) || + !cpu_feature_enabled(X86_FEATURE_PCID)) + goto out_set_cr3; + /* + * We reuse the same PCID for different tasks, so we must + * flush all the entires for the PCID out when we change + * tasks. + */ + new_mm_cr3 = X86_CR3_PCID_KERN_FLUSH | __pa(pgdir); + + /* + * The flush from load_cr3() may leave old TLB entries + * for userspace in place. We must flush that context + * separately. We can theoretically delay doing this + * until we actually load up the userspace CR3, but + * that's a bit tricky. We have to have the "need to + * flush userspace PCID" bit per-cpu and check it in the + * exit-to-userspace paths. + */ + invpcid_flush_single_context(X86_CR3_PCID_ASID_USER); + +out_set_cr3: + /* + * Caution: many callers of this function expect + * that load_cr3() is serializing and orders TLB + * fills with respect to the mm_cpumask writes. + */ + write_cr3(new_mm_cr3); +} + /* * We cannot call mmdrop() because we are in interrupt context, * instead update mm->cpu_vm_mask. @@ -45,7 +85,7 @@ void leave_mm(int cpu) BUG(); if (cpumask_test_cpu(cpu, mm_cpumask(active_mm))) { cpumask_clear_cpu(cpu, mm_cpumask(active_mm)); - load_cr3(swapper_pg_dir); + load_new_mm_cr3(swapper_pg_dir); /* * This gets called in the idle path where RCU * functions differently. Tracing normally @@ -105,7 +145,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next, * ordering guarantee we need. * */ - load_cr3(next->pgd); + load_new_mm_cr3(next->pgd);
trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
@@ -150,7 +190,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next, * As above, load_cr3() is serializing and orders TLB * fills with respect to the mm_cpumask write. */ - load_cr3(next->pgd); + load_new_mm_cr3(next->pgd); trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL); load_mm_cr4(next); load_mm_ldt(next);
From: Hugh Dickins hughd@google.com
We have many machines (Westmere, Sandybridge, Ivybridge) supporting PCID but not INVPCID: on these load_new_mm_cr3() simply crashed.
Flushing user context inside load_new_mm_cr3() without the use of invpcid is difficult: momentarily switch from kernel to user context and back to do so? I'm not sure whether that can be safely done at all, and would risk polluting user context with kernel internals, and kernel context with stale user externals.
Instead, follow the hint in the comment that was there: change X86_CR3_PCID_USER_VAR to be a per-cpu variable, then load_new_mm_cr3() can leave a note in it, for SWITCH_USER_CR3 on return to userspace to flush user context TLB, instead of default X86_CR3_PCID_USER_NOFLUSH.
Which works well enough that there's no need to do it this way only when invpcid is unsupported: it's a good alternative to invpcid here. But there's a couple of inlines in asm/tlbflush.h that need to do the same trick, so it's best to localize all this per-cpu business in mm/kaiser.c: moving that part of the initialization from setup_pcid() to kaiser_setup_pcid(); with kaiser_flush_tlb_on_return_to_user() the function for noting an X86_CR3_PCID_USER_FLUSH. And let's keep a KAISER_SHADOW_PGD_OFFSET in there, to avoid the extra OR on exit.
I did try to make the feature tests in asm/tlbflush.h more consistent with each other: there seem to be far too many ways of performing such tests, and I don't have a good grasp of their differences. At first I converted them all to be static_cpu_has(): but that proved to be a mistake, as the comment in __native_flush_tlb_single() hints; so then I reversed and made them all this_cpu_has(). Probably all gratuitous change, but that's the way it's working at present.
I am slightly bothered by the way non-per-cpu X86_CR3_PCID_KERN_VAR gets re-initialized by each cpu (before and after these changes): no problem when (as usual) all cpus on a machine have the same features, but in principle incorrect. However, my experiment to per-cpu-ify that one did not end well...
Signed-off-by: Hugh Dickins hughd@google.com Acked-by: Jiri Kosina jkosina@suse.cz Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org (cherry picked from commit 0731188fc74cc2237975a2b5bedd36e2463ef10b) Signed-off-by: Pavel Tatashin pasha.tatashin@oracle.com
Conflicts: arch/x86/include/asm/tlbflush.h --- arch/x86/include/asm/kaiser.h | 18 ++++++++------ arch/x86/include/asm/tlbflush.h | 54 ++++++++++++++++++++++++++++++++--------- arch/x86/kernel/cpu/common.c | 22 +---------------- arch/x86/mm/kaiser.c | 50 +++++++++++++++++++++++++++++++++----- arch/x86/mm/tlb.c | 46 ++++++++++++++--------------------- 5 files changed, 117 insertions(+), 73 deletions(-)
diff --git a/arch/x86/include/asm/kaiser.h b/arch/x86/include/asm/kaiser.h index 360ff3bc44a9..009bca514c20 100644 --- a/arch/x86/include/asm/kaiser.h +++ b/arch/x86/include/asm/kaiser.h @@ -32,13 +32,12 @@ movq \reg, %cr3 .macro _SWITCH_TO_USER_CR3 reg movq %cr3, \reg andq $(~(X86_CR3_PCID_ASID_MASK | KAISER_SHADOW_PGD_OFFSET)), \reg -/* - * This can obviously be one instruction by putting the - * KAISER_SHADOW_PGD_OFFSET bit in the X86_CR3_PCID_USER_VAR. - * But, just leave it now for simplicity. - */ -orq X86_CR3_PCID_USER_VAR, \reg -orq $(KAISER_SHADOW_PGD_OFFSET), \reg +orq PER_CPU_VAR(X86_CR3_PCID_USER_VAR), \reg +js 9f +// FLUSH this time, reset to NOFLUSH for next time +// But if nopcid? Consider using 0x80 for user pcid? +movb $(0x80), PER_CPU_VAR(X86_CR3_PCID_USER_VAR+7) +9: movq \reg, %cr3 .endm
@@ -90,6 +89,11 @@ movq PER_CPU_VAR(unsafe_stack_register_backup), %rax */ DECLARE_PER_CPU_USER_MAPPED(unsigned long, unsafe_stack_register_backup);
+extern unsigned long X86_CR3_PCID_KERN_VAR; +DECLARE_PER_CPU(unsigned long, X86_CR3_PCID_USER_VAR); + +extern char __per_cpu_user_mapped_start[], __per_cpu_user_mapped_end[]; + /** * kaiser_add_mapping - map a virtual memory part to the shadow (user) mapping * @addr: the start address of the range diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index b558d1996621..ff8c5eb95caf 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -12,6 +12,7 @@ static inline void __invpcid(unsigned long pcid, unsigned long addr, unsigned long type) { struct { u64 d[2]; } desc = { { pcid, addr } }; + /* * The memory clobber is because the whole point is to invalidate * stale TLB entries and, especially if we're flushing global @@ -130,17 +131,42 @@ static inline void cr4_set_bits_and_update_boot(unsigned long mask) cr4_set_bits(mask); }
+/* + * Declare a couple of kaiser interfaces here for convenience, + * to avoid the need for asm/kaiser.h in unexpected places. + */ +#ifdef CONFIG_KAISER +extern void kaiser_setup_pcid(void); +extern void kaiser_flush_tlb_on_return_to_user(void); +#else +static inline void kaiser_setup_pcid(void) +{ +} +static inline void kaiser_flush_tlb_on_return_to_user(void) +{ +} +#endif + static inline void __native_flush_tlb(void) { - native_write_cr3(native_read_cr3()); + if (this_cpu_has(X86_FEATURE_INVPCID)) { + /* + * Note, this works with CR4.PCIDE=0 or 1. + */ + invpcid_flush_all_nonglobals(); + return; + } + /* - * We are no longer using globals with KAISER, so a - * "nonglobals" flush would work too. But, this is more - * conservative. - * - * Note, this works with CR4.PCIDE=0 or 1. + * If current->mm == NULL then we borrow a mm which may change during a + * task switch and therefore we must not be preempted while we write CR3 + * back: */ - invpcid_flush_all(); + preempt_disable(); + if (this_cpu_has(X86_FEATURE_PCID)) + kaiser_flush_tlb_on_return_to_user(); + native_write_cr3(native_read_cr3()); + preempt_enable(); }
static inline void __native_flush_tlb_global_irq_disabled(void) @@ -156,9 +182,13 @@ static inline void __native_flush_tlb_global_irq_disabled(void)
static inline void __native_flush_tlb_global(void) { +#ifdef CONFIG_KAISER + /* Globals are not used at all */ + __native_flush_tlb(); +#else unsigned long flags;
- if (static_cpu_has(X86_FEATURE_INVPCID)) { + if (this_cpu_has(X86_FEATURE_INVPCID)) { /* * Using INVPCID is considerably faster than a pair of writes * to CR4 sandwiched inside an IRQ flag save/restore. @@ -175,10 +205,9 @@ static inline void __native_flush_tlb_global(void) * be called from deep inside debugging code.) */ raw_local_irq_save(flags); - __native_flush_tlb_global_irq_disabled(); - raw_local_irq_restore(flags); +#endif }
static inline void __native_flush_tlb_single(unsigned long addr) @@ -189,9 +218,12 @@ static inline void __native_flush_tlb_single(unsigned long addr) * * The ASIDs used below are hard-coded. But, we must not * call invpcid(type=1/2) before CR4.PCIDE=1. Just call - * invpcid in the case we are called early. + * invlpg in the case we are called early. */ + if (!this_cpu_has(X86_FEATURE_INVPCID_SINGLE)) { + if (this_cpu_has(X86_FEATURE_PCID)) + kaiser_flush_tlb_on_return_to_user(); asm volatile("invlpg (%0)" ::"r" (addr) : "memory"); return; } diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index 6fb0d332f440..9e09e190ba92 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -339,32 +339,11 @@ static __always_inline void setup_smap(struct cpuinfo_x86 *c) } }
-/* - * These can have bit 63 set, so we can not just use a plain "or" - * instruction to get their value or'd into CR3. It would take - * another register. So, we use a memory reference to these - * instead. - * - * This is also handy because systems that do not support - * PCIDs just end up or'ing a 0 into their CR3, which does - * no harm. - */ -__aligned(PAGE_SIZE) unsigned long X86_CR3_PCID_KERN_VAR = 0; -__aligned(PAGE_SIZE) unsigned long X86_CR3_PCID_USER_VAR = 0; - static void setup_pcid(struct cpuinfo_x86 *c) { if (cpu_has(c, X86_FEATURE_PCID)) { if (cpu_has(c, X86_FEATURE_PGE)) { cr4_set_bits(X86_CR4_PCIDE); - /* - * These variables are used by the entry/exit - * code to change PCIDs. - */ -#ifdef CONFIG_KAISER - X86_CR3_PCID_KERN_VAR = X86_CR3_PCID_KERN_NOFLUSH; - X86_CR3_PCID_USER_VAR = X86_CR3_PCID_USER_NOFLUSH; -#endif /* * INVPCID has two "groups" of types: * 1/2: Invalidate an individual address @@ -390,6 +369,7 @@ static void setup_pcid(struct cpuinfo_x86 *c) clear_cpu_cap(c, X86_FEATURE_PCID); } } + kaiser_setup_pcid(); }
/* diff --git a/arch/x86/mm/kaiser.c b/arch/x86/mm/kaiser.c index 91968328ccdf..7923190cd123 100644 --- a/arch/x86/mm/kaiser.c +++ b/arch/x86/mm/kaiser.c @@ -12,12 +12,26 @@ #include <linux/ftrace.h>
#include <asm/kaiser.h> +#include <asm/tlbflush.h> /* to verify its kaiser declarations */ #include <asm/pgtable.h> #include <asm/pgalloc.h> #include <asm/desc.h> + #ifdef CONFIG_KAISER +__visible +DEFINE_PER_CPU_USER_MAPPED(unsigned long, unsafe_stack_register_backup); + +/* + * These can have bit 63 set, so we can not just use a plain "or" + * instruction to get their value or'd into CR3. It would take + * another register. So, we use a memory reference to these instead. + * + * This is also handy because systems that do not support PCIDs + * just end up or'ing a 0 into their CR3, which does no harm. + */ +__aligned(PAGE_SIZE) unsigned long X86_CR3_PCID_KERN_VAR; +DEFINE_PER_CPU(unsigned long, X86_CR3_PCID_USER_VAR);
-__visible DEFINE_PER_CPU_USER_MAPPED(unsigned long, unsafe_stack_register_backup); /* * At runtime, the only things we map are some things for CPU * hotplug, and stacks for new processes. No two CPUs will ever @@ -239,9 +253,6 @@ static void __init kaiser_init_all_pgds(void) WARN_ON(__ret); \ } while (0)
-extern char __per_cpu_user_mapped_start[], __per_cpu_user_mapped_end[]; -extern unsigned long X86_CR3_PCID_KERN_VAR; -extern unsigned long X86_CR3_PCID_USER_VAR; /* * If anything in here fails, we will likely die on one of the * first kernel->user transitions and init will die. But, we @@ -295,8 +306,6 @@ void __init kaiser_init(void)
kaiser_add_user_map_early(&X86_CR3_PCID_KERN_VAR, PAGE_SIZE, __PAGE_KERNEL); - kaiser_add_user_map_early(&X86_CR3_PCID_USER_VAR, PAGE_SIZE, - __PAGE_KERNEL); }
/* Add a mapping to the shadow mapping, and synchronize the mappings */ @@ -361,4 +370,33 @@ pgd_t kaiser_set_shadow_pgd(pgd_t *pgdp, pgd_t pgd) } return pgd; } + +void kaiser_setup_pcid(void) +{ + unsigned long kern_cr3 = 0; + unsigned long user_cr3 = KAISER_SHADOW_PGD_OFFSET; + + if (this_cpu_has(X86_FEATURE_PCID)) { + kern_cr3 |= X86_CR3_PCID_KERN_NOFLUSH; + user_cr3 |= X86_CR3_PCID_USER_NOFLUSH; + } + /* + * These variables are used by the entry/exit + * code to change PCID and pgd and TLB flushing. + */ + X86_CR3_PCID_KERN_VAR = kern_cr3; + this_cpu_write(X86_CR3_PCID_USER_VAR, user_cr3); +} + +/* + * Make a note that this cpu will need to flush USER tlb on return to user. + * Caller checks whether this_cpu_has(X86_FEATURE_PCID) before calling: + * if cpu does not, then the NOFLUSH bit will never have been set. + */ +void kaiser_flush_tlb_on_return_to_user(void) +{ + this_cpu_write(X86_CR3_PCID_USER_VAR, + X86_CR3_PCID_USER_FLUSH | KAISER_SHADOW_PGD_OFFSET); +} +EXPORT_SYMBOL(kaiser_flush_tlb_on_return_to_user); #endif /* CONFIG_KAISER */ diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 7176496112ff..9c9657f139cd 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -6,13 +6,14 @@ #include <linux/interrupt.h> #include <linux/module.h> #include <linux/cpu.h> +#include <linux/debugfs.h>
#include <asm/tlbflush.h> #include <asm/mmu_context.h> #include <asm/cache.h> #include <asm/apic.h> #include <asm/uv/uv.h> -#include <linux/debugfs.h> +#include <asm/kaiser.h>
/* * TLB flushing, formerly SMP-only @@ -38,34 +39,23 @@ static void load_new_mm_cr3(pgd_t *pgdir) { unsigned long new_mm_cr3 = __pa(pgdir);
- /* - * KAISER, plus PCIDs needs some extra work here. But, - * if either of features is not present, we need no - * PCIDs here and just do a normal, full TLB flush with - * the write_cr3() - */ - if (!IS_ENABLED(CONFIG_KAISER) || - !cpu_feature_enabled(X86_FEATURE_PCID)) - goto out_set_cr3; - /* - * We reuse the same PCID for different tasks, so we must - * flush all the entires for the PCID out when we change - * tasks. - */ - new_mm_cr3 = X86_CR3_PCID_KERN_FLUSH | __pa(pgdir); - - /* - * The flush from load_cr3() may leave old TLB entries - * for userspace in place. We must flush that context - * separately. We can theoretically delay doing this - * until we actually load up the userspace CR3, but - * that's a bit tricky. We have to have the "need to - * flush userspace PCID" bit per-cpu and check it in the - * exit-to-userspace paths. - */ - invpcid_flush_single_context(X86_CR3_PCID_ASID_USER); +#ifdef CONFIG_KAISER + if (this_cpu_has(X86_FEATURE_PCID)) { + /* + * We reuse the same PCID for different tasks, so we must + * flush all the entries for the PCID out when we change tasks. + * Flush KERN below, flush USER when returning to userspace in + * kaiser's SWITCH_USER_CR3 (_SWITCH_TO_USER_CR3) macro. + * + * invpcid_flush_single_context(X86_CR3_PCID_ASID_USER) could + * do it here, but can only be used if X86_FEATURE_INVPCID is + * available - and many machines support pcid without invpcid. + */ + new_mm_cr3 |= X86_CR3_PCID_KERN_FLUSH; + kaiser_flush_tlb_on_return_to_user(); + } +#endif /* CONFIG_KAISER */
-out_set_cr3: /* * Caution: many callers of this function expect * that load_cr3() is serializing and orders TLB
From: Hugh Dickins hughd@google.com
Why was 4 chosen for kernel PCID and 6 for user PCID? No good reason in a backport where PCIDs are only used for Kaiser.
If we continue with those, then we shall need to add Andy Lutomirski's 4.13 commit 6c690ee1039b ("x86/mm: Split read_cr3() into read_cr3_pa() and __read_cr3()"), which deals with the problem of read_cr3() callers finding stray bits in the cr3 that they expected to be page-aligned; and for hibernation, his 4.14 commit f34902c5c6c0 ("x86/hibernate/64: Mask off CR3's PCID bits in the saved CR3").
But if 0 is used for kernel PCID, then there's no need to add in those commits - whenever the kernel looks, it sees 0 in the lower bits; and 0 for kernel seems an obvious choice.
And I naughtily propose 128 for user PCID. Because there's a place in _SWITCH_TO_USER_CR3 where it takes note of the need for TLB FLUSH, but needs to reset that to NOFLUSH for the next occasion. Currently it does so with a "movb $(0x80)" into the high byte of the per-cpu quadword, but that will cause a machine without PCID support to crash. Now, if %al just happened to have 0x80 in it at that point, on a machine with PCID support, but 0 on a machine without PCID support...
(That will go badly wrong once the pgd can be at a physical address above 2^56, but even with 5-level paging, physical goes up to 2^52.)
Signed-off-by: Hugh Dickins hughd@google.com Acked-by: Jiri Kosina jkosina@suse.cz Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org (cherry picked from commit 3b4ce0e1a17228eec71815d7997e49e403ebf2a7) Signed-off-by: Pavel Tatashin pasha.tatashin@oracle.com --- arch/x86/include/asm/kaiser.h | 19 ++++++++++++------- arch/x86/include/asm/pgtable_types.h | 7 ++++--- arch/x86/mm/tlb.c | 3 +++ 3 files changed, 19 insertions(+), 10 deletions(-)
diff --git a/arch/x86/include/asm/kaiser.h b/arch/x86/include/asm/kaiser.h index 009bca514c20..110a73e0572d 100644 --- a/arch/x86/include/asm/kaiser.h +++ b/arch/x86/include/asm/kaiser.h @@ -29,14 +29,19 @@ orq X86_CR3_PCID_KERN_VAR, \reg movq \reg, %cr3 .endm
-.macro _SWITCH_TO_USER_CR3 reg +.macro _SWITCH_TO_USER_CR3 reg regb +/* + * regb must be the low byte portion of reg: because we have arranged + * for the low byte of the user PCID to serve as the high byte of NOFLUSH + * (0x80 for each when PCID is enabled, or 0x00 when PCID and NOFLUSH are + * not enabled): so that the one register can update both memory and cr3. + */ movq %cr3, \reg andq $(~(X86_CR3_PCID_ASID_MASK | KAISER_SHADOW_PGD_OFFSET)), \reg orq PER_CPU_VAR(X86_CR3_PCID_USER_VAR), \reg js 9f -// FLUSH this time, reset to NOFLUSH for next time -// But if nopcid? Consider using 0x80 for user pcid? -movb $(0x80), PER_CPU_VAR(X86_CR3_PCID_USER_VAR+7) +/* FLUSH this time, reset to NOFLUSH for next time (if PCID enabled) */ +movb \regb, PER_CPU_VAR(X86_CR3_PCID_USER_VAR+7) 9: movq \reg, %cr3 .endm @@ -49,7 +54,7 @@ popq %rax
.macro SWITCH_USER_CR3 pushq %rax -_SWITCH_TO_USER_CR3 %rax +_SWITCH_TO_USER_CR3 %rax %al popq %rax .endm
@@ -61,7 +66,7 @@ movq PER_CPU_VAR(unsafe_stack_register_backup), %rax
.macro SWITCH_USER_CR3_NO_STACK movq %rax, PER_CPU_VAR(unsafe_stack_register_backup) -_SWITCH_TO_USER_CR3 %rax +_SWITCH_TO_USER_CR3 %rax %al movq PER_CPU_VAR(unsafe_stack_register_backup), %rax .endm
@@ -69,7 +74,7 @@ movq PER_CPU_VAR(unsafe_stack_register_backup), %rax
.macro SWITCH_KERNEL_CR3 reg .endm -.macro SWITCH_USER_CR3 reg +.macro SWITCH_USER_CR3 reg regb .endm .macro SWITCH_USER_CR3_NO_STACK .endm diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h index b3c77e6529c2..93554f37df99 100644 --- a/arch/x86/include/asm/pgtable_types.h +++ b/arch/x86/include/asm/pgtable_types.h @@ -111,16 +111,17 @@
/* Mask for all the PCID-related bits in CR3: */ #define X86_CR3_PCID_MASK (X86_CR3_PCID_NOFLUSH | X86_CR3_PCID_ASID_MASK) +#define X86_CR3_PCID_ASID_KERN (_AC(0x0,UL)) + #if defined(CONFIG_KAISER) && defined(CONFIG_X86_64) -#define X86_CR3_PCID_ASID_KERN (_AC(0x4,UL)) -#define X86_CR3_PCID_ASID_USER (_AC(0x6,UL)) +/* Let X86_CR3_PCID_ASID_USER be usable for the X86_CR3_PCID_NOFLUSH bit */ +#define X86_CR3_PCID_ASID_USER (_AC(0x80,UL))
#define X86_CR3_PCID_KERN_FLUSH (X86_CR3_PCID_ASID_KERN) #define X86_CR3_PCID_USER_FLUSH (X86_CR3_PCID_ASID_USER) #define X86_CR3_PCID_KERN_NOFLUSH (X86_CR3_PCID_NOFLUSH | X86_CR3_PCID_ASID_KERN) #define X86_CR3_PCID_USER_NOFLUSH (X86_CR3_PCID_NOFLUSH | X86_CR3_PCID_ASID_USER) #else -#define X86_CR3_PCID_ASID_KERN (_AC(0x0,UL)) #define X86_CR3_PCID_ASID_USER (_AC(0x0,UL)) /* * PCIDs are unsupported on 32-bit and none of these bits can be diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 9c9657f139cd..749e2134bd7a 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -50,6 +50,9 @@ static void load_new_mm_cr3(pgd_t *pgdir) * invpcid_flush_single_context(X86_CR3_PCID_ASID_USER) could * do it here, but can only be used if X86_FEATURE_INVPCID is * available - and many machines support pcid without invpcid. + * + * The line below is a no-op: X86_CR3_PCID_KERN_FLUSH is now 0; + * but keep that line in there in case something changes. */ new_mm_cr3 |= X86_CR3_PCID_KERN_FLUSH; kaiser_flush_tlb_on_return_to_user();
From: Hugh Dickins hughd@google.com
Mostly this commit is just unshouting X86_CR3_PCID_KERN_VAR and X86_CR3_PCID_USER_VAR: we usually name variables in lower-case.
But why does x86_cr3_pcid_noflush need to be __aligned(PAGE_SIZE)? Ah, it's a leftover from when kaiser_add_user_map() once complained about mapping the same page twice. Make it __read_mostly instead. (I'm a little uneasy about all the unrelated data which shares its page getting user-mapped too, but that was so before, and not a big deal: though we call it user-mapped, it's not mapped with _PAGE_USER.)
And there is a little change around the two calls to do_nmi(). Previously they set the NOFLUSH bit (if PCID supported) when forcing to kernel context before do_nmi(); now they also have the NOFLUSH bit set (if PCID supported) when restoring context after: nothing done in do_nmi() should require a TLB to be flushed here.
Signed-off-by: Hugh Dickins hughd@google.com Acked-by: Jiri Kosina jkosina@suse.cz Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org (cherry picked from commit 20268a10ffecd9fcc04880b21fc99a9192394599) Signed-off-by: Pavel Tatashin pasha.tatashin@oracle.com
Conflicts: arch/x86/entry/entry_64.S (not in this tree) arch/x86/kernel/entry_64.S (patched instead of that) --- arch/x86/include/asm/kaiser.h | 11 +++++------ arch/x86/kernel/entry_64.S | 8 ++++---- arch/x86/mm/kaiser.c | 13 +++++++------ 3 files changed, 16 insertions(+), 16 deletions(-)
diff --git a/arch/x86/include/asm/kaiser.h b/arch/x86/include/asm/kaiser.h index 110a73e0572d..48d8d70dd8c7 100644 --- a/arch/x86/include/asm/kaiser.h +++ b/arch/x86/include/asm/kaiser.h @@ -25,7 +25,7 @@ .macro _SWITCH_TO_KERNEL_CR3 reg movq %cr3, \reg andq $(~(X86_CR3_PCID_ASID_MASK | KAISER_SHADOW_PGD_OFFSET)), \reg -orq X86_CR3_PCID_KERN_VAR, \reg +orq x86_cr3_pcid_noflush, \reg movq \reg, %cr3 .endm
@@ -37,11 +37,10 @@ movq \reg, %cr3 * not enabled): so that the one register can update both memory and cr3. */ movq %cr3, \reg -andq $(~(X86_CR3_PCID_ASID_MASK | KAISER_SHADOW_PGD_OFFSET)), \reg -orq PER_CPU_VAR(X86_CR3_PCID_USER_VAR), \reg +orq PER_CPU_VAR(x86_cr3_pcid_user), \reg js 9f /* FLUSH this time, reset to NOFLUSH for next time (if PCID enabled) */ -movb \regb, PER_CPU_VAR(X86_CR3_PCID_USER_VAR+7) +movb \regb, PER_CPU_VAR(x86_cr3_pcid_user+7) 9: movq \reg, %cr3 .endm @@ -94,8 +93,8 @@ movq PER_CPU_VAR(unsafe_stack_register_backup), %rax */ DECLARE_PER_CPU_USER_MAPPED(unsigned long, unsafe_stack_register_backup);
-extern unsigned long X86_CR3_PCID_KERN_VAR; -DECLARE_PER_CPU(unsigned long, X86_CR3_PCID_USER_VAR); +extern unsigned long x86_cr3_pcid_noflush; +DECLARE_PER_CPU(unsigned long, x86_cr3_pcid_user);
extern char __per_cpu_user_mapped_start[], __per_cpu_user_mapped_end[];
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S index 0f7bba928264..0508420c6cde 100644 --- a/arch/x86/kernel/entry_64.S +++ b/arch/x86/kernel/entry_64.S @@ -1540,11 +1540,11 @@ ENTRY(nmi) /* Unconditionally use kernel CR3 for do_nmi() */ /* %rax is saved above, so OK to clobber here */ movq %cr3, %rax + /* If PCID enabled, NOFLUSH now and NOFLUSH on return */ + orq x86_cr3_pcid_noflush, %rax pushq %rax /* mask off "user" bit of pgd address and 12 PCID bits: */ andq $(~(X86_CR3_PCID_ASID_MASK | KAISER_SHADOW_PGD_OFFSET)), %rax - /* Add back kernel PCID and "no flush" bit */ - orq X86_CR3_PCID_KERN_VAR, %rax movq %rax, %cr3 #endif call do_nmi @@ -1779,11 +1779,11 @@ end_repeat_nmi: /* Unconditionally use kernel CR3 for do_nmi() */ /* %rax is saved above, so OK to clobber here */ movq %cr3, %rax + /* If PCID enabled, NOFLUSH now and NOFLUSH on return */ + orq x86_cr3_pcid_noflush, %rax pushq %rax /* mask off "user" bit of pgd address and 12 PCID bits: */ andq $(~(X86_CR3_PCID_ASID_MASK | KAISER_SHADOW_PGD_OFFSET)), %rax - /* Add back kernel PCID and "no flush" bit */ - orq X86_CR3_PCID_KERN_VAR, %rax movq %rax, %cr3 #endif DEFAULT_FRAME 0 /* XXX: Do we need this? */ diff --git a/arch/x86/mm/kaiser.c b/arch/x86/mm/kaiser.c index 7923190cd123..f6e85067d58e 100644 --- a/arch/x86/mm/kaiser.c +++ b/arch/x86/mm/kaiser.c @@ -29,8 +29,8 @@ DEFINE_PER_CPU_USER_MAPPED(unsigned long, unsafe_stack_register_backup); * This is also handy because systems that do not support PCIDs * just end up or'ing a 0 into their CR3, which does no harm. */ -__aligned(PAGE_SIZE) unsigned long X86_CR3_PCID_KERN_VAR; -DEFINE_PER_CPU(unsigned long, X86_CR3_PCID_USER_VAR); +unsigned long x86_cr3_pcid_noflush __read_mostly; +DEFINE_PER_CPU(unsigned long, x86_cr3_pcid_user);
/* * At runtime, the only things we map are some things for CPU @@ -304,7 +304,8 @@ void __init kaiser_init(void) sizeof(gate_desc) * NR_VECTORS, __PAGE_KERNEL);
- kaiser_add_user_map_early(&X86_CR3_PCID_KERN_VAR, PAGE_SIZE, + kaiser_add_user_map_early(&x86_cr3_pcid_noflush, + sizeof(x86_cr3_pcid_noflush), __PAGE_KERNEL); }
@@ -384,8 +385,8 @@ void kaiser_setup_pcid(void) * These variables are used by the entry/exit * code to change PCID and pgd and TLB flushing. */ - X86_CR3_PCID_KERN_VAR = kern_cr3; - this_cpu_write(X86_CR3_PCID_USER_VAR, user_cr3); + x86_cr3_pcid_noflush = kern_cr3; + this_cpu_write(x86_cr3_pcid_user, user_cr3); }
/* @@ -395,7 +396,7 @@ void kaiser_setup_pcid(void) */ void kaiser_flush_tlb_on_return_to_user(void) { - this_cpu_write(X86_CR3_PCID_USER_VAR, + this_cpu_write(x86_cr3_pcid_user, X86_CR3_PCID_USER_FLUSH | KAISER_SHADOW_PGD_OFFSET); } EXPORT_SYMBOL(kaiser_flush_tlb_on_return_to_user);
From: Hugh Dickins hughd@google.com
Neel Natu points out that paranoid_entry() was wrong to assume that an entry that did not need swapgs would not need SWITCH_KERNEL_CR3: paranoid_entry (used for debug breakpoint, int3, double fault or MCE; though I think it's only the MCE case that is cause for concern here) can break in at an awkward time, between cr3 switch and swapgs, but its handling always needs kernel gs and kernel cr3.
Easy to fix in itself, but paranoid_entry() also needs to convey to paranoid_exit() (and my reading of macro idtentry says paranoid_entry and paranoid_exit are always paired) how to restore the prior state. The swapgs state is already conveyed by %ebx (0 or 1), so extend that also to convey when SWITCH_USER_CR3 will be needed (2 or 3).
(Yes, I'd much prefer that 0 meant no swapgs, whereas it's the other way round: and a convention shared with error_entry() and error_exit(), which I don't want to touch. Perhaps I should have inverted the bit for switch cr3 too, but did not.)
paranoid_exit() would be straightforward, except for TRACE_IRQS: it did TRACE_IRQS_IRETQ when doing swapgs, but TRACE_IRQS_IRETQ_DEBUG when not: which is it supposed to use when SWITCH_USER_CR3 is split apart from that? As best as I can determine, commit 5963e317b1e9 ("ftrace/x86: Do not change stacks in DEBUG when calling lockdep") missed the swapgs case, and should have used TRACE_IRQS_IRETQ_DEBUG there too (the discrepancy has nothing to do with the liberal use of _NO_STACK and _UNSAFE_STACK hereabouts: TRACE_IRQS_OFF_DEBUG has just been used in all cases); discrepancy lovingly preserved across several paranoid_exit() cleanups, but I'm now removing it.
Neel further indicates that to use SWITCH_USER_CR3_NO_STACK there in paranoid_exit() is now not only unnecessary but unsafe: might corrupt syscall entry's unsafe_stack_register_backup of %rax. Just use SWITCH_USER_CR3: and delete SWITCH_USER_CR3_NO_STACK altogether, before we make the mistake of using it again.
hughd adds: this commit fixes an issue in the Kaiser-without-PCIDs part of the series, and ought to be moved earlier, if you decided to make a release of Kaiser-without-PCIDs.
Signed-off-by: Hugh Dickins hughd@google.com Acked-by: Jiri Kosina jkosina@suse.cz Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org (cherry picked from commit fc8334e6b3e5d28afd4eec8a74493933f73b2784) Signed-off-by: Pavel Tatashin pasha.tatashin@oracle.com
Conflicts: arch/x86/entry/entry_64.S (not in this tree) arch/x86/kernel/entry_64.S (patched instead of that) --- arch/x86/include/asm/kaiser.h | 8 -------- arch/x86/kernel/entry_64.S | 46 +++++++++++++++++++++++++++++++++---------- 2 files changed, 36 insertions(+), 18 deletions(-)
diff --git a/arch/x86/include/asm/kaiser.h b/arch/x86/include/asm/kaiser.h index 48d8d70dd8c7..3dc5f4c39b3e 100644 --- a/arch/x86/include/asm/kaiser.h +++ b/arch/x86/include/asm/kaiser.h @@ -63,20 +63,12 @@ _SWITCH_TO_KERNEL_CR3 %rax movq PER_CPU_VAR(unsafe_stack_register_backup), %rax .endm
-.macro SWITCH_USER_CR3_NO_STACK -movq %rax, PER_CPU_VAR(unsafe_stack_register_backup) -_SWITCH_TO_USER_CR3 %rax %al -movq PER_CPU_VAR(unsafe_stack_register_backup), %rax -.endm - #else /* CONFIG_KAISER */
.macro SWITCH_KERNEL_CR3 reg .endm .macro SWITCH_USER_CR3 reg regb .endm -.macro SWITCH_USER_CR3_NO_STACK -.endm .macro SWITCH_KERNEL_CR3_NO_STACK .endm
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S index 0508420c6cde..3973c43903f5 100644 --- a/arch/x86/kernel/entry_64.S +++ b/arch/x86/kernel/entry_64.S @@ -1300,7 +1300,11 @@ idtentry machine_check has_error_code=0 paranoid=1 do_sym=*machine_check_vector( /* * Save all registers in pt_regs, and switch gs if needed. * Use slow, but surefire "are we in kernel?" check. - * Return: ebx=0: need swapgs on exit, ebx=1: otherwise + * + * Return: ebx=0: needs swapgs but not SWITCH_USER_CR3 in paranoid_exit + * ebx=1: needs neither swapgs nor SWITCH_USER_CR3 in paranoid_exit + * ebx=2: needs both swapgs and SWITCH_USER_CR3 in paranoid_exit + * ebx=3: needs SWITCH_USER_CR3 but not swapgs in paranoid_exit */ ENTRY(paranoid_entry) XCPT_FRAME 1 15*8 @@ -1313,9 +1317,26 @@ ENTRY(paranoid_entry) testl %edx,%edx js 1f /* negative -> in kernel */ SWAPGS - SWITCH_KERNEL_CR3 xorl %ebx,%ebx -1: ret +1: +#ifdef CONFIG_KAISER + /* + * We might have come in between a swapgs and a SWITCH_KERNEL_CR3 + * on entry, or between a SWITCH_USER_CR3 and a swapgs on exit. + * Do a conditional SWITCH_KERNEL_CR3: this could safely be done + * unconditionally, but we need to find out whether the reverse + * should be done on return (conveyed to paranoid_exit in %ebx). + */ + movq %cr3, %rax + testl $KAISER_SHADOW_PGD_OFFSET, %eax + jz 2f + orl $2, %ebx + andq $(~(X86_CR3_PCID_ASID_MASK | KAISER_SHADOW_PGD_OFFSET)), %rax + orq x86_cr3_pcid_noflush, %rax + movq %rax, %cr3 +2: +#endif + ret CFI_ENDPROC END(paranoid_entry)
@@ -1328,21 +1349,26 @@ END(paranoid_entry) * in syscall entry), so checking for preemption here would * be complicated. Fortunately, we there's no good reason * to try to handle preemption here. + * On entry: ebx=0: needs swapgs but not SWITCH_USER_CR3 + * ebx=1: needs neither swapgs nor SWITCH_USER_CR3 + * ebx=2: needs both swapgs and SWITCH_USER_CR3 + * ebx=3: needs SWITCH_USER_CR3 but not swapgs */ -/* On entry, ebx is "no swapgs" flag (1: don't need swapgs, 0: need it) */ ENTRY(paranoid_exit) DEFAULT_FRAME DISABLE_INTERRUPTS(CLBR_NONE) TRACE_IRQS_OFF_DEBUG - testl %ebx,%ebx /* swapgs needed? */ + TRACE_IRQS_IRETQ_DEBUG +#ifdef CONFIG_KAISER + testl $2, %ebx /* SWITCH_USER_CR3 needed? */ + jz paranoid_exit_no_switch + SWITCH_USER_CR3 +paranoid_exit_no_switch: +#endif + testl $1, %ebx /* swapgs needed? */ jnz paranoid_exit_no_swapgs - TRACE_IRQS_IRETQ - SWITCH_USER_CR3_NO_STACK SWAPGS_UNSAFE_STACK - jmp paranoid_exit_restore paranoid_exit_no_swapgs: - TRACE_IRQS_IRETQ_DEBUG -paranoid_exit_restore: RESTORE_EXTRA_REGS RESTORE_C_REGS REMOVE_PT_GPREGS_FROM_STACK 8
From: Hugh Dickins hughd@google.com
Synthetic filesystem mempressure testing has shown softlockups, with hour-long page allocation stalls, and pgd_alloc() trying for order:1 with __GFP_REPEAT in one of the backtraces each time.
That's _pgd_alloc() going for a Kaiser double-pgd, using the __GFP_REPEAT common to all page table allocations, but actually having no effect on order:0 (see should_alloc_oom() and should_continue_reclaim() in this tree, but beware that ports to another tree might behave differently).
Order:1 stack allocation has been working satisfactorily without __GFP_REPEAT forever, and page table allocation only asks __GFP_REPEAT for awkward occasions in a long-running process: it's not appropriate at fork or exec time, and seems to be doing much more harm than good: getting those contiguous pages under very heavy mempressure can be hard (though even without it, Kaiser does generate more mempressure).
Mask out that __GFP_REPEAT inside _pgd_alloc(). Why not take it out of the PGALLOG_GFP altogether, as v4.7 commit a3a9a59d2067 ("x86: get rid of superfluous __GFP_REPEAT") did? Because I think that might make a difference to our page table memcg charging, which I'd prefer not to interfere with at this time.
hughd adds: __alloc_pages_slowpath() in the 4.4.89-stable tree handles __GFP_REPEAT a little differently than in prod kernel or 3.18.72-stable, so it may not always be exactly a no-op on order:0 pages, as said above; but I think still appropriate to omit it from Kaiser or non-Kaiser pgd.
Signed-off-by: Hugh Dickins hughd@google.com Acked-by: Jiri Kosina jkosina@suse.cz Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org (cherry picked from commit d41f46f778951b0ea851ca52b88b2549c6336b47) Signed-off-by: Pavel Tatashin pasha.tatashin@oracle.com --- arch/x86/mm/pgtable.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c index b0848a7e9c68..20b8438f35c0 100644 --- a/arch/x86/mm/pgtable.c +++ b/arch/x86/mm/pgtable.c @@ -6,7 +6,7 @@ #include <asm/fixmap.h> #include <asm/mtrr.h>
-#define PGALLOC_GFP GFP_KERNEL | __GFP_NOTRACK | __GFP_REPEAT | __GFP_ZERO +#define PGALLOC_GFP (GFP_KERNEL | __GFP_NOTRACK | __GFP_REPEAT | __GFP_ZERO)
#ifdef CONFIG_HIGHPTE #define PGALLOC_USER_GFP __GFP_HIGHMEM @@ -354,7 +354,9 @@ static inline void _pgd_free(pgd_t *pgd)
static inline pgd_t *_pgd_alloc(void) { - return (pgd_t *)__get_free_pages(PGALLOC_GFP, PGD_ALLOCATION_ORDER); + /* No __GFP_REPEAT: to avoid page allocation stalls in order-1 case */ + return (pgd_t *)__get_free_pages(PGALLOC_GFP & ~__GFP_REPEAT, + PGD_ALLOCATION_ORDER); }
static inline void _pgd_free(pgd_t *pgd)
From: Hugh Dickins hughd@google.com
An error from kaiser_add_mapping() here is not at all likely, but Eric Biggers rightly points out that __free_ldt_struct() relies on new_ldt->size being initialized: move that up.
Signed-off-by: Hugh Dickins hughd@google.com Acked-by: Jiri Kosina jkosina@suse.cz Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org (cherry picked from commit 500943e57db8d3e298e98f595f835c5b613e843b) Signed-off-by: Pavel Tatashin pasha.tatashin@oracle.com --- arch/x86/kernel/ldt.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c index c388247e0353..5797d437710d 100644 --- a/arch/x86/kernel/ldt.c +++ b/arch/x86/kernel/ldt.c @@ -78,11 +78,11 @@ static struct ldt_struct *alloc_ldt_struct(int size)
ret = kaiser_add_mapping((unsigned long)new_ldt->entries, alloc_size, __PAGE_KERNEL); + new_ldt->size = size; if (ret) { __free_ldt_struct(new_ldt); return NULL; } - new_ldt->size = size; return new_ldt; }
From: Hugh Dickins hughd@google.com
Added "nokaiser" boot option: an early param like "noinvpcid". Most places now check int kaiser_enabled (#defined 0 when not CONFIG_KAISER) instead of #ifdef CONFIG_KAISER; but entry_64.S and entry_64_compat.S are using the ALTERNATIVE technique, which patches in the preferred instructions at runtime. That technique is tied to x86 cpu features, so X86_FEATURE_KAISER is fabricated.
Prior to "nokaiser", Kaiser #defined _PAGE_GLOBAL 0: revert that, but be careful with both _PAGE_GLOBAL and CR4.PGE: setting them when nokaiser like when !CONFIG_KAISER, but not setting either when kaiser - neither matters on its own, but it's hard to be sure that _PAGE_GLOBAL won't get set in some obscure corner, or something add PGE into CR4. By omitting _PAGE_GLOBAL from __supported_pte_mask when kaiser_enabled, all page table setup which uses pte_pfn() masks it out of the ptes.
It's slightly shameful that the same declaration versus definition of kaiser_enabled appears in not one, not two, but in three header files (asm/kaiser.h, asm/pgtable.h, asm/tlbflush.h). I felt safer that way, than with #including any of those in any of the others; and did not feel it worth an asm/kaiser_enabled.h - kernel/cpu/common.c includes them all, so we shall hear about it if they get out of synch.
Cleanups while in the area: removed the silly #ifdef CONFIG_KAISER from kaiser.c; removed the unused native_get_normal_pgd(); removed the spurious reg clutter from SWITCH_*_CR3 macro stubs; corrected some comments. But more interestingly, set CR4.PSE in secondary_startup_64: the manual is clear that it does not matter whether it's 0 or 1 when 4-level-pts are enabled, but I was distracted to find cr4 different on BSP and auxiliaries - BSP alone was adding PSE, in probe_page_size_mask().
Signed-off-by: Hugh Dickins hughd@google.com Acked-by: Jiri Kosina jkosina@suse.cz Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org (cherry picked from commit e345dcc9481543edf4a0a5df4c4c2f9597b0a997) Signed-off-by: Pavel Tatashin pasha.tatashin@oracle.com
Conflicts: arch/x86/entry/entry_64.S (not in this tree) arch/x86/kernel/entry_64.S (patched instead of that) --- Documentation/kernel-parameters.txt | 2 ++ arch/x86/include/asm/cpufeature.h | 3 +++ arch/x86/include/asm/kaiser.h | 27 ++++++++++++++++++------- arch/x86/include/asm/pgtable.h | 20 ++++++++++++------ arch/x86/include/asm/pgtable_64.h | 13 ++++-------- arch/x86/include/asm/pgtable_types.h | 4 ---- arch/x86/include/asm/tlbflush.h | 39 +++++++++++++++++++++++------------- arch/x86/kernel/cpu/common.c | 28 +++++++++++++++++++++++++- arch/x86/kernel/entry_64.S | 15 +++++++------- arch/x86/kernel/espfix_64.c | 3 ++- arch/x86/kernel/head_64.S | 4 ++-- arch/x86/mm/init.c | 2 +- arch/x86/mm/init_64.c | 10 +++++++++ arch/x86/mm/kaiser.c | 26 ++++++++++++++++++++---- arch/x86/mm/pgtable.c | 8 ++------ arch/x86/mm/tlb.c | 4 +--- 16 files changed, 143 insertions(+), 65 deletions(-)
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 97bc24101896..2d75f55419e0 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -2439,6 +2439,8 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
nojitter [IA-64] Disables jitter checking for ITC timers.
+ nokaiser [X86-64] Disable KAISER isolation of kernel from user. + no-kvmclock [X86,KVM] Disable paravirtualized KVM clock driver
no-kvmapf [X86,KVM] Disable paravirtualized asynchronous page diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h index 80a2b9487b29..b62f5b1a4361 100644 --- a/arch/x86/include/asm/cpufeature.h +++ b/arch/x86/include/asm/cpufeature.h @@ -198,6 +198,9 @@ #define X86_FEATURE_HWP_PKG_REQ ( 7*32+14) /* Intel HWP_PKG_REQ */ #define X86_FEATURE_INTEL_PT ( 7*32+15) /* Intel Processor Trace */
+/* Because the ALTERNATIVE scheme is for members of the X86_FEATURE club... */ +#define X86_FEATURE_KAISER ( 7*32+31) /* CONFIG_KAISER w/o nokaiser */ + /* Virtualization flags: Linux defined, word 8 */ #define X86_FEATURE_TPR_SHADOW ( 8*32+ 0) /* Intel TPR Shadow */ #define X86_FEATURE_VNMI ( 8*32+ 1) /* Intel Virtual NMI */ diff --git a/arch/x86/include/asm/kaiser.h b/arch/x86/include/asm/kaiser.h index 3dc5f4c39b3e..96643a9c194c 100644 --- a/arch/x86/include/asm/kaiser.h +++ b/arch/x86/include/asm/kaiser.h @@ -46,28 +46,33 @@ movq \reg, %cr3 .endm
.macro SWITCH_KERNEL_CR3 -pushq %rax +ALTERNATIVE "jmp 8f", "pushq %rax", X86_FEATURE_KAISER _SWITCH_TO_KERNEL_CR3 %rax popq %rax +8: .endm
.macro SWITCH_USER_CR3 -pushq %rax +ALTERNATIVE "jmp 8f", "pushq %rax", X86_FEATURE_KAISER _SWITCH_TO_USER_CR3 %rax %al popq %rax +8: .endm
.macro SWITCH_KERNEL_CR3_NO_STACK -movq %rax, PER_CPU_VAR(unsafe_stack_register_backup) +ALTERNATIVE "jmp 8f", \ + __stringify(movq %rax, PER_CPU_VAR(unsafe_stack_register_backup)), \ + X86_FEATURE_KAISER _SWITCH_TO_KERNEL_CR3 %rax movq PER_CPU_VAR(unsafe_stack_register_backup), %rax +8: .endm
#else /* CONFIG_KAISER */
-.macro SWITCH_KERNEL_CR3 reg +.macro SWITCH_KERNEL_CR3 .endm -.macro SWITCH_USER_CR3 reg regb +.macro SWITCH_USER_CR3 .endm .macro SWITCH_KERNEL_CR3_NO_STACK .endm @@ -90,6 +95,16 @@ DECLARE_PER_CPU(unsigned long, x86_cr3_pcid_user);
extern char __per_cpu_user_mapped_start[], __per_cpu_user_mapped_end[];
+extern int kaiser_enabled; +#else +#define kaiser_enabled 0 +#endif /* CONFIG_KAISER */ + +/* + * Kaiser function prototypes are needed even when CONFIG_KAISER is not set, + * so as to build with tests on kaiser_enabled instead of #ifdefs. + */ + /** * kaiser_add_mapping - map a virtual memory part to the shadow (user) mapping * @addr: the start address of the range @@ -119,8 +134,6 @@ extern void kaiser_remove_mapping(unsigned long start, unsigned long size); */ extern void kaiser_init(void);
-#endif /* CONFIG_KAISER */ - #endif /* __ASSEMBLY */
#endif /* _ASM_X86_KAISER_H */ diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 34e5177bd7b3..af9713487b36 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -18,6 +18,12 @@ #ifndef __ASSEMBLY__ #include <asm/x86_init.h>
+#ifdef CONFIG_KAISER +extern int kaiser_enabled; +#else +#define kaiser_enabled 0 +#endif + void ptdump_walk_pgd_level(struct seq_file *m, pgd_t *pgd);
/* @@ -633,7 +639,7 @@ static inline int pgd_bad(pgd_t pgd) * page table by accident; it will fault on the first * instruction it tries to run. See native_set_pgd(). */ - if (IS_ENABLED(CONFIG_KAISER)) + if (kaiser_enabled) ignore_flags |= _PAGE_NX;
return (pgd_flags(pgd) & ~ignore_flags) != _KERNPG_TABLE; @@ -838,12 +844,14 @@ static inline void pmdp_set_wrprotect(struct mm_struct *mm, */ static inline void clone_pgd_range(pgd_t *dst, pgd_t *src, int count) { - memcpy(dst, src, count * sizeof(pgd_t)); + memcpy(dst, src, count * sizeof(pgd_t)); #ifdef CONFIG_KAISER - /* Clone the shadow pgd part as well */ - memcpy(native_get_shadow_pgd(dst), - native_get_shadow_pgd(src), - count * sizeof(pgd_t)); + if (kaiser_enabled) { + /* Clone the shadow pgd part as well */ + memcpy(native_get_shadow_pgd(dst), + native_get_shadow_pgd(src), + count * sizeof(pgd_t)); + } #endif }
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h index 8629be9e6649..233d19c9f22e 100644 --- a/arch/x86/include/asm/pgtable_64.h +++ b/arch/x86/include/asm/pgtable_64.h @@ -111,13 +111,12 @@ extern pgd_t kaiser_set_shadow_pgd(pgd_t *pgdp, pgd_t pgd);
static inline pgd_t *native_get_shadow_pgd(pgd_t *pgdp) { +#ifdef CONFIG_DEBUG_VM + /* linux/mmdebug.h may not have been included at this point */ + BUG_ON(!kaiser_enabled); +#endif return (pgd_t *)((unsigned long)pgdp | (unsigned long)PAGE_SIZE); } - -static inline pgd_t *native_get_normal_pgd(pgd_t *pgdp) -{ - return (pgd_t *)((unsigned long)pgdp & ~(unsigned long)PAGE_SIZE); -} #else static inline pgd_t kaiser_set_shadow_pgd(pgd_t *pgdp, pgd_t pgd) { @@ -128,10 +127,6 @@ static inline pgd_t *native_get_shadow_pgd(pgd_t *pgdp) BUILD_BUG_ON(1); return NULL; } -static inline pgd_t *native_get_normal_pgd(pgd_t *pgdp) -{ - return pgdp; -} #endif /* CONFIG_KAISER */
static inline void native_set_pgd(pgd_t *pgdp, pgd_t pgd) diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h index 93554f37df99..d5b89550fb53 100644 --- a/arch/x86/include/asm/pgtable_types.h +++ b/arch/x86/include/asm/pgtable_types.h @@ -39,11 +39,7 @@ #define _PAGE_ACCESSED (_AT(pteval_t, 1) << _PAGE_BIT_ACCESSED) #define _PAGE_DIRTY (_AT(pteval_t, 1) << _PAGE_BIT_DIRTY) #define _PAGE_PSE (_AT(pteval_t, 1) << _PAGE_BIT_PSE) -#ifdef CONFIG_KAISER -#define _PAGE_GLOBAL (_AT(pteval_t, 0)) -#else #define _PAGE_GLOBAL (_AT(pteval_t, 1) << _PAGE_BIT_GLOBAL) -#endif #define _PAGE_SOFTW1 (_AT(pteval_t, 1) << _PAGE_BIT_SOFTW1) #define _PAGE_SOFTW2 (_AT(pteval_t, 1) << _PAGE_BIT_SOFTW2) #define _PAGE_PAT (_AT(pteval_t, 1) << _PAGE_BIT_PAT) diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index ff8c5eb95caf..b376095a1fd9 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -136,9 +136,11 @@ static inline void cr4_set_bits_and_update_boot(unsigned long mask) * to avoid the need for asm/kaiser.h in unexpected places. */ #ifdef CONFIG_KAISER +extern int kaiser_enabled; extern void kaiser_setup_pcid(void); extern void kaiser_flush_tlb_on_return_to_user(void); #else +#define kaiser_enabled 0 static inline void kaiser_setup_pcid(void) { } @@ -163,7 +165,7 @@ static inline void __native_flush_tlb(void) * back: */ preempt_disable(); - if (this_cpu_has(X86_FEATURE_PCID)) + if (kaiser_enabled && this_cpu_has(X86_FEATURE_PCID)) kaiser_flush_tlb_on_return_to_user(); native_write_cr3(native_read_cr3()); preempt_enable(); @@ -174,20 +176,30 @@ static inline void __native_flush_tlb_global_irq_disabled(void) unsigned long cr4;
cr4 = this_cpu_read(cpu_tlbstate.cr4); - /* clear PGE */ - native_write_cr4(cr4 & ~X86_CR4_PGE); - /* write old PGE again and flush TLBs */ - native_write_cr4(cr4); + if (cr4 & X86_CR4_PGE) { + /* clear PGE and flush TLB of all entries */ + native_write_cr4(cr4 & ~X86_CR4_PGE); + /* restore PGE as it was before */ + native_write_cr4(cr4); + } else { + /* + * x86_64 microcode update comes this way when CR4.PGE is not + * enabled, and it's safer for all callers to allow this case. + */ + native_write_cr3(native_read_cr3()); + } }
static inline void __native_flush_tlb_global(void) { -#ifdef CONFIG_KAISER - /* Globals are not used at all */ - __native_flush_tlb(); -#else unsigned long flags;
+ if (kaiser_enabled) { + /* Globals are not used at all */ + __native_flush_tlb(); + return; + } + if (this_cpu_has(X86_FEATURE_INVPCID)) { /* * Using INVPCID is considerably faster than a pair of writes @@ -207,7 +219,6 @@ static inline void __native_flush_tlb_global(void) raw_local_irq_save(flags); __native_flush_tlb_global_irq_disabled(); raw_local_irq_restore(flags); -#endif }
static inline void __native_flush_tlb_single(unsigned long addr) @@ -222,7 +233,7 @@ static inline void __native_flush_tlb_single(unsigned long addr) */
if (!this_cpu_has(X86_FEATURE_INVPCID_SINGLE)) { - if (this_cpu_has(X86_FEATURE_PCID)) + if (kaiser_enabled && this_cpu_has(X86_FEATURE_PCID)) kaiser_flush_tlb_on_return_to_user(); asm volatile("invlpg (%0)" ::"r" (addr) : "memory"); return; @@ -237,9 +248,9 @@ static inline void __native_flush_tlb_single(unsigned long addr) * Make sure to do only a single invpcid when KAISER is * disabled and we have only a single ASID. */ - if (X86_CR3_PCID_ASID_KERN != X86_CR3_PCID_ASID_USER) - invpcid_flush_one(X86_CR3_PCID_ASID_KERN, addr); - invpcid_flush_one(X86_CR3_PCID_ASID_USER, addr); + if (kaiser_enabled) + invpcid_flush_one(X86_CR3_PCID_ASID_USER, addr); + invpcid_flush_one(X86_CR3_PCID_ASID_KERN, addr); }
static inline void __flush_tlb_all(void) diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index 9e09e190ba92..4404a78bdc0a 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -188,6 +188,20 @@ static int __init x86_pcid_setup(char *s) return 1; } __setup("nopcid", x86_pcid_setup); + +static int __init x86_nokaiser_setup(char *s) +{ + /* nokaiser doesn't accept parameters */ + if (s) + return -EINVAL; +#ifdef CONFIG_KAISER + kaiser_enabled = 0; + setup_clear_cpu_cap(X86_FEATURE_KAISER); + pr_info("nokaiser: KAISER feature disabled\n"); +#endif + return 0; +} +early_param("nokaiser", x86_nokaiser_setup); #endif
static int __init x86_noinvpcid_setup(char *s) @@ -342,7 +356,7 @@ static __always_inline void setup_smap(struct cpuinfo_x86 *c) static void setup_pcid(struct cpuinfo_x86 *c) { if (cpu_has(c, X86_FEATURE_PCID)) { - if (cpu_has(c, X86_FEATURE_PGE)) { + if (cpu_has(c, X86_FEATURE_PGE) || kaiser_enabled) { cr4_set_bits(X86_CR4_PCIDE); /* * INVPCID has two "groups" of types: @@ -762,6 +776,10 @@ void get_cpu_cap(struct cpuinfo_x86 *c) c->x86_power = cpuid_edx(0x80000007);
init_scattered_cpuid_features(c); +#ifdef CONFIG_KAISER + if (kaiser_enabled) + set_cpu_cap(c, X86_FEATURE_KAISER); +#endif }
static void identify_cpu_without_cpuid(struct cpuinfo_x86 *c) @@ -1425,6 +1443,14 @@ void cpu_init(void) * try to read it. */ cr4_init_shadow(); + if (!kaiser_enabled) { + /* + * secondary_startup_64() deferred setting PGE in cr4: + * probe_page_size_mask() sets it on the boot cpu, + * but it needs to be set on each secondary cpu. + */ + cr4_set_bits(X86_CR4_PGE); + }
/* * Load microcode on this cpu if a valid microcode is available. diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S index 3973c43903f5..71189e3d1073 100644 --- a/arch/x86/kernel/entry_64.S +++ b/arch/x86/kernel/entry_64.S @@ -1327,7 +1327,7 @@ ENTRY(paranoid_entry) * unconditionally, but we need to find out whether the reverse * should be done on return (conveyed to paranoid_exit in %ebx). */ - movq %cr3, %rax + ALTERNATIVE "jmp 2f", "movq %cr3, %rax", X86_FEATURE_KAISER testl $KAISER_SHADOW_PGD_OFFSET, %eax jz 2f orl $2, %ebx @@ -1360,6 +1360,7 @@ ENTRY(paranoid_exit) TRACE_IRQS_OFF_DEBUG TRACE_IRQS_IRETQ_DEBUG #ifdef CONFIG_KAISER + /* No ALTERNATIVE for X86_FEATURE_KAISER: paranoid_entry sets %ebx */ testl $2, %ebx /* SWITCH_USER_CR3 needed? */ jz paranoid_exit_no_switch SWITCH_USER_CR3 @@ -1565,13 +1566,14 @@ ENTRY(nmi) #ifdef CONFIG_KAISER /* Unconditionally use kernel CR3 for do_nmi() */ /* %rax is saved above, so OK to clobber here */ - movq %cr3, %rax + ALTERNATIVE "jmp 2f", "movq %cr3, %rax", X86_FEATURE_KAISER /* If PCID enabled, NOFLUSH now and NOFLUSH on return */ orq x86_cr3_pcid_noflush, %rax pushq %rax /* mask off "user" bit of pgd address and 12 PCID bits: */ andq $(~(X86_CR3_PCID_ASID_MASK | KAISER_SHADOW_PGD_OFFSET)), %rax movq %rax, %cr3 +2: #endif call do_nmi #ifdef CONFIG_KAISER @@ -1580,8 +1582,7 @@ ENTRY(nmi) * kernel code that needs user CR3, but do we ever return * to "user mode" where we need the kernel CR3? */ - popq %rax - mov %rax, %cr3 + ALTERNATIVE "", "popq %rax; movq %rax, %cr3", X86_FEATURE_KAISER #endif /* * Return back to user mode. We must *not* do the normal exit @@ -1804,13 +1805,14 @@ end_repeat_nmi: #ifdef CONFIG_KAISER /* Unconditionally use kernel CR3 for do_nmi() */ /* %rax is saved above, so OK to clobber here */ - movq %cr3, %rax + ALTERNATIVE "jmp 2f", "movq %cr3, %rax", X86_FEATURE_KAISER /* If PCID enabled, NOFLUSH now and NOFLUSH on return */ orq x86_cr3_pcid_noflush, %rax pushq %rax /* mask off "user" bit of pgd address and 12 PCID bits: */ andq $(~(X86_CR3_PCID_ASID_MASK | KAISER_SHADOW_PGD_OFFSET)), %rax movq %rax, %cr3 +2: #endif DEFAULT_FRAME 0 /* XXX: Do we need this? */
@@ -1822,8 +1824,7 @@ end_repeat_nmi: * kernel code that needs user CR3, like just just before * a sysret. */ - popq %rax - mov %rax, %cr3 + ALTERNATIVE "", "popq %rax; movq %rax, %cr3", X86_FEATURE_KAISER #endif
testl %ebx,%ebx /* swapgs needed? */ diff --git a/arch/x86/kernel/espfix_64.c b/arch/x86/kernel/espfix_64.c index a48fad060fe9..aef263b0577d 100644 --- a/arch/x86/kernel/espfix_64.c +++ b/arch/x86/kernel/espfix_64.c @@ -132,9 +132,10 @@ void __init init_espfix_bsp(void) * area to ensure it is mapped into the shadow user page * tables. */ - if (IS_ENABLED(CONFIG_KAISER)) + if (kaiser_enabled) { set_pgd(native_get_shadow_pgd(pgd_p), __pgd(_KERNPG_TABLE | __pa((pud_t *)espfix_pud_page))); + }
/* Randomize the locations */ init_espfix_random(); diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S index 73e7f8e90fac..0d89f538efb6 100644 --- a/arch/x86/kernel/head_64.S +++ b/arch/x86/kernel/head_64.S @@ -183,8 +183,8 @@ ENTRY(secondary_startup_64) movq $(init_level4_pgt - __START_KERNEL_map), %rax 1:
- /* Enable PAE mode and PGE */ - movl $(X86_CR4_PAE | X86_CR4_PGE), %ecx + /* Enable PAE and PSE, but defer PGE until kaiser_enabled is decided */ + movl $(X86_CR4_PAE | X86_CR4_PSE), %ecx movq %rcx, %cr4
/* Setup early boot stage 4 level pagetables. */ diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c index 0a06e5a14278..785bd0347d13 100644 --- a/arch/x86/mm/init.c +++ b/arch/x86/mm/init.c @@ -162,7 +162,7 @@ static void __init probe_page_size_mask(void) cr4_set_bits_and_update_boot(X86_CR4_PSE);
/* Enable PGE if available */ - if (cpu_has_pge) { + if (cpu_has_pge && !kaiser_enabled) { cr4_set_bits_and_update_boot(X86_CR4_PGE); __supported_pte_mask |= _PAGE_GLOBAL; } else diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c index f9977a7a9444..a97d3cd16354 100644 --- a/arch/x86/mm/init_64.c +++ b/arch/x86/mm/init_64.c @@ -395,6 +395,16 @@ void __init cleanup_highmap(void) continue; if (vaddr < (unsigned long) _text || vaddr > end) set_pmd(pmd, __pmd(0)); + else if (kaiser_enabled) { + /* + * level2_kernel_pgt is initialized with _PAGE_GLOBAL: + * clear that now. This is not important, so long as + * CR4.PGE remains clear, but it removes an anomaly. + * Physical mapping setup below avoids _PAGE_GLOBAL + * by use of massage_pgprot() inside pfn_pte() etc. + */ + set_pmd(pmd, pmd_clear_flags(*pmd, _PAGE_GLOBAL)); + } } }
diff --git a/arch/x86/mm/kaiser.c b/arch/x86/mm/kaiser.c index f6e85067d58e..bfd48347eaac 100644 --- a/arch/x86/mm/kaiser.c +++ b/arch/x86/mm/kaiser.c @@ -17,7 +17,9 @@ #include <asm/pgalloc.h> #include <asm/desc.h>
-#ifdef CONFIG_KAISER +int kaiser_enabled __read_mostly = 1; +EXPORT_SYMBOL(kaiser_enabled); /* for inlined TLB flush functions */ + __visible DEFINE_PER_CPU_USER_MAPPED(unsigned long, unsafe_stack_register_backup);
@@ -168,8 +170,8 @@ static pte_t *kaiser_pagetable_walk(unsigned long address, bool is_atomic) return pte_offset_kernel(pmd, address); }
-int kaiser_add_user_map(const void *__start_addr, unsigned long size, - unsigned long flags) +static int kaiser_add_user_map(const void *__start_addr, unsigned long size, + unsigned long flags) { int ret = 0; pte_t *pte; @@ -178,6 +180,15 @@ int kaiser_add_user_map(const void *__start_addr, unsigned long size, unsigned long end_addr = PAGE_ALIGN(start_addr + size); unsigned long target_address;
+ /* + * It is convenient for callers to pass in __PAGE_KERNEL etc, + * and there is no actual harm from setting _PAGE_GLOBAL, so + * long as CR4.PGE is not set. But it is nonetheless troubling + * to see Kaiser itself setting _PAGE_GLOBAL (now that "nokaiser" + * requires that not to be #defined to 0): so mask it off here. + */ + flags &= ~_PAGE_GLOBAL; + for (; address < end_addr; address += PAGE_SIZE) { target_address = get_pa_from_mapping(address); if (target_address == -1) { @@ -264,6 +275,8 @@ void __init kaiser_init(void) { int cpu;
+ if (!kaiser_enabled) + return; kaiser_init_all_pgds();
for_each_possible_cpu(cpu) { @@ -312,6 +325,8 @@ void __init kaiser_init(void) /* Add a mapping to the shadow mapping, and synchronize the mappings */ int kaiser_add_mapping(unsigned long addr, unsigned long size, unsigned long flags) { + if (!kaiser_enabled) + return 0; return kaiser_add_user_map((const void *)addr, size, flags); }
@@ -323,6 +338,8 @@ void kaiser_remove_mapping(unsigned long start, unsigned long size) unsigned long addr, next; pgd_t *pgd;
+ if (!kaiser_enabled) + return; pgd = native_get_shadow_pgd(pgd_offset_k(start)); for (addr = start; addr < end; pgd++, addr = next) { next = pgd_addr_end(addr, end); @@ -344,6 +361,8 @@ static inline bool is_userspace_pgd(pgd_t *pgdp)
pgd_t kaiser_set_shadow_pgd(pgd_t *pgdp, pgd_t pgd) { + if (!kaiser_enabled) + return pgd; /* * Do we need to also populate the shadow pgd? Check _PAGE_USER to * skip cases like kexec and EFI which make temporary low mappings. @@ -400,4 +419,3 @@ void kaiser_flush_tlb_on_return_to_user(void) X86_CR3_PCID_USER_FLUSH | KAISER_SHADOW_PGD_OFFSET); } EXPORT_SYMBOL(kaiser_flush_tlb_on_return_to_user); -#endif /* CONFIG_KAISER */ diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c index 20b8438f35c0..1cec796635d7 100644 --- a/arch/x86/mm/pgtable.c +++ b/arch/x86/mm/pgtable.c @@ -341,16 +341,12 @@ static inline void _pgd_free(pgd_t *pgd) } #else
-#ifdef CONFIG_KAISER /* - * Instead of one pmd, we aquire two pmds. Being order-1, it is + * Instead of one pgd, Kaiser acquires two pgds. Being order-1, it is * both 8k in size and 8k-aligned. That lets us just flip bit 12 * in a pointer to swap between the two 4k halves. */ -#define PGD_ALLOCATION_ORDER 1 -#else -#define PGD_ALLOCATION_ORDER 0 -#endif +#define PGD_ALLOCATION_ORDER kaiser_enabled
static inline pgd_t *_pgd_alloc(void) { diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 749e2134bd7a..5539d613c263 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -39,8 +39,7 @@ static void load_new_mm_cr3(pgd_t *pgdir) { unsigned long new_mm_cr3 = __pa(pgdir);
-#ifdef CONFIG_KAISER - if (this_cpu_has(X86_FEATURE_PCID)) { + if (kaiser_enabled && this_cpu_has(X86_FEATURE_PCID)) { /* * We reuse the same PCID for different tasks, so we must * flush all the entries for the PCID out when we change tasks. @@ -57,7 +56,6 @@ static void load_new_mm_cr3(pgd_t *pgdir) new_mm_cr3 |= X86_CR3_PCID_KERN_FLUSH; kaiser_flush_tlb_on_return_to_user(); } -#endif /* CONFIG_KAISER */
/* * Caution: many callers of this function expect
From: Hugh Dickins hughd@google.com
Added "nokaiser" boot option: an early param like "noinvpcid". Most places now check int kaiser_enabled (#defined 0 when not CONFIG_KAISER) instead of #ifdef CONFIG_KAISER; but entry_64.S and entry_64_compat.S are using the ALTERNATIVE technique, which patches in the preferred instructions at runtime. That technique is tied to x86 cpu features, so X86_FEATURE_KAISER is fabricated.
Prior to "nokaiser", Kaiser #defined _PAGE_GLOBAL 0: revert that, but be careful with both _PAGE_GLOBAL and CR4.PGE: setting them when nokaiser like when !CONFIG_KAISER, but not setting either when kaiser - neither matters on its own, but it's hard to be sure that _PAGE_GLOBAL won't get set in some obscure corner, or something add PGE into CR4. By omitting _PAGE_GLOBAL from __supported_pte_mask when kaiser_enabled, all page table setup which uses pte_pfn() masks it out of the ptes.
It's slightly shameful that the same declaration versus definition of kaiser_enabled appears in not one, not two, but in three header files (asm/kaiser.h, asm/pgtable.h, asm/tlbflush.h). I felt safer that way, than with #including any of those in any of the others; and did not feel it worth an asm/kaiser_enabled.h - kernel/cpu/common.c includes them all, so we shall hear about it if they get out of synch.
Cleanups while in the area: removed the silly #ifdef CONFIG_KAISER from kaiser.c; removed the unused native_get_normal_pgd(); removed the spurious reg clutter from SWITCH_*_CR3 macro stubs; corrected some comments. But more interestingly, set CR4.PSE in secondary_startup_64: the manual is clear that it does not matter whether it's 0 or 1 when 4-level-pts are enabled, but I was distracted to find cr4 different on BSP and auxiliaries - BSP alone was adding PSE, in probe_page_size_mask().
Signed-off-by: Hugh Dickins hughd@google.com Acked-by: Jiri Kosina jkosina@suse.cz Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org (cherry picked from commit e345dcc9481543edf4a0a5df4c4c2f9597b0a997) Signed-off-by: Pavel Tatashin pasha.tatashin@oracle.com
Conflicts: arch/x86/entry/entry_64.S (not in this tree) arch/x86/kernel/entry_64.S (patched instead of that) --- arch/x86/include/asm/cpufeature.h | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h index b62f5b1a4361..8effe086de27 100644 --- a/arch/x86/include/asm/cpufeature.h +++ b/arch/x86/include/asm/cpufeature.h @@ -201,6 +201,9 @@ /* Because the ALTERNATIVE scheme is for members of the X86_FEATURE club... */ #define X86_FEATURE_KAISER ( 7*32+31) /* CONFIG_KAISER w/o nokaiser */
+/* Because the ALTERNATIVE scheme is for members of the X86_FEATURE club... */ +#define X86_FEATURE_KAISER ( 7*32+31) /* CONFIG_KAISER w/o nokaiser */ + /* Virtualization flags: Linux defined, word 8 */ #define X86_FEATURE_TPR_SHADOW ( 8*32+ 0) /* Intel TPR Shadow */ #define X86_FEATURE_VNMI ( 8*32+ 1) /* Intel Virtual NMI */
From: Borislav Petkov bp@suse.de
Concentrate it in arch/x86/mm/kaiser.c and use the upstream string "nopti".
Signed-off-by: Borislav Petkov bp@suse.de Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org (cherry picked from commit dea9aa9ffae11c91285335cc3215b4f0e48e8139) Signed-off-by: Pavel Tatashin pasha.tatashin@oracle.com --- Documentation/kernel-parameters.txt | 2 +- arch/x86/kernel/cpu/common.c | 18 ------------------ arch/x86/mm/kaiser.c | 20 +++++++++++++++++++- 3 files changed, 20 insertions(+), 20 deletions(-)
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 2d75f55419e0..c1f3dbed0021 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -2439,7 +2439,7 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
nojitter [IA-64] Disables jitter checking for ITC timers.
- nokaiser [X86-64] Disable KAISER isolation of kernel from user. + nopti [X86-64] Disable KAISER isolation of kernel from user.
no-kvmclock [X86,KVM] Disable paravirtualized KVM clock driver
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index 4404a78bdc0a..e80e6e5ede48 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -188,20 +188,6 @@ static int __init x86_pcid_setup(char *s) return 1; } __setup("nopcid", x86_pcid_setup); - -static int __init x86_nokaiser_setup(char *s) -{ - /* nokaiser doesn't accept parameters */ - if (s) - return -EINVAL; -#ifdef CONFIG_KAISER - kaiser_enabled = 0; - setup_clear_cpu_cap(X86_FEATURE_KAISER); - pr_info("nokaiser: KAISER feature disabled\n"); -#endif - return 0; -} -early_param("nokaiser", x86_nokaiser_setup); #endif
static int __init x86_noinvpcid_setup(char *s) @@ -776,10 +762,6 @@ void get_cpu_cap(struct cpuinfo_x86 *c) c->x86_power = cpuid_edx(0x80000007);
init_scattered_cpuid_features(c); -#ifdef CONFIG_KAISER - if (kaiser_enabled) - set_cpu_cap(c, X86_FEATURE_KAISER); -#endif }
static void identify_cpu_without_cpuid(struct cpuinfo_x86 *c) diff --git a/arch/x86/mm/kaiser.c b/arch/x86/mm/kaiser.c index bfd48347eaac..a724496a5852 100644 --- a/arch/x86/mm/kaiser.c +++ b/arch/x86/mm/kaiser.c @@ -275,8 +275,13 @@ void __init kaiser_init(void) { int cpu;
- if (!kaiser_enabled) + if (!kaiser_enabled) { + setup_clear_cpu_cap(X86_FEATURE_KAISER); return; + } + + setup_force_cpu_cap(X86_FEATURE_KAISER); + kaiser_init_all_pgds();
for_each_possible_cpu(cpu) { @@ -419,3 +424,16 @@ void kaiser_flush_tlb_on_return_to_user(void) X86_CR3_PCID_USER_FLUSH | KAISER_SHADOW_PGD_OFFSET); } EXPORT_SYMBOL(kaiser_flush_tlb_on_return_to_user); + +static int __init x86_nokaiser_setup(char *s) +{ + /* nopti doesn't accept parameters */ + if (s) + return -EINVAL; + + kaiser_enabled = 0; + pr_info("Kernel/User page tables isolation: disabled\n"); + + return 0; +} +early_param("nopti", x86_nokaiser_setup);
From: Borislav Petkov bp@suse.de
AMD (and possibly other vendors) are not affected by the leak KAISER is protecting against.
Keep the "nopti" for traditional reasons and add pti=<on|off|auto> like upstream.
Signed-off-by: Borislav Petkov bp@suse.de Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org (cherry picked from commit e405a064bd7d6eca88935342ddb71057a9d6ceab) Signed-off-by: Pavel Tatashin pasha.tatashin@oracle.com --- Documentation/kernel-parameters.txt | 6 ++++ arch/x86/mm/kaiser.c | 59 ++++++++++++++++++++++++++----------- 2 files changed, 47 insertions(+), 18 deletions(-)
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index c1f3dbed0021..f6c046f03905 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -2972,6 +2972,12 @@ bytes respectively. Such letter suffixes can also be entirely omitted. pt. [PARIDE] See Documentation/blockdev/paride.txt.
+ pti= [X86_64] + Control KAISER user/kernel address space isolation: + on - enable + off - disable + auto - default setting + pty.legacy_count= [KNL] Number of legacy pty's. Overwrites compiled-in default number. diff --git a/arch/x86/mm/kaiser.c b/arch/x86/mm/kaiser.c index a724496a5852..88b4526d57a5 100644 --- a/arch/x86/mm/kaiser.c +++ b/arch/x86/mm/kaiser.c @@ -16,6 +16,7 @@ #include <asm/pgtable.h> #include <asm/pgalloc.h> #include <asm/desc.h> +#include <asm/cmdline.h>
int kaiser_enabled __read_mostly = 1; EXPORT_SYMBOL(kaiser_enabled); /* for inlined TLB flush functions */ @@ -264,6 +265,43 @@ static void __init kaiser_init_all_pgds(void) WARN_ON(__ret); \ } while (0)
+void __init kaiser_check_boottime_disable(void) +{ + bool enable = true; + char arg[5]; + int ret; + + ret = cmdline_find_option(boot_command_line, "pti", arg, sizeof(arg)); + if (ret > 0) { + if (!strncmp(arg, "on", 2)) + goto enable; + + if (!strncmp(arg, "off", 3)) + goto disable; + + if (!strncmp(arg, "auto", 4)) + goto skip; + } + + if (cmdline_find_option_bool(boot_command_line, "nopti")) + goto disable; + +skip: + if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD) + goto disable; + +enable: + if (enable) + setup_force_cpu_cap(X86_FEATURE_KAISER); + + return; + +disable: + pr_info("Kernel/User page tables isolation: disabled\n"); + kaiser_enabled = 0; + setup_clear_cpu_cap(X86_FEATURE_KAISER); +} + /* * If anything in here fails, we will likely die on one of the * first kernel->user transitions and init will die. But, we @@ -275,12 +313,10 @@ void __init kaiser_init(void) { int cpu;
- if (!kaiser_enabled) { - setup_clear_cpu_cap(X86_FEATURE_KAISER); - return; - } + kaiser_check_boottime_disable();
- setup_force_cpu_cap(X86_FEATURE_KAISER); + if (!kaiser_enabled) + return;
kaiser_init_all_pgds();
@@ -424,16 +460,3 @@ void kaiser_flush_tlb_on_return_to_user(void) X86_CR3_PCID_USER_FLUSH | KAISER_SHADOW_PGD_OFFSET); } EXPORT_SYMBOL(kaiser_flush_tlb_on_return_to_user); - -static int __init x86_nokaiser_setup(char *s) -{ - /* nopti doesn't accept parameters */ - if (s) - return -EINVAL; - - kaiser_enabled = 0; - pr_info("Kernel/User page tables isolation: disabled\n"); - - return 0; -} -early_param("nopti", x86_nokaiser_setup);
From: Hugh Dickins hughd@google.com
Now that we're playing the ALTERNATIVE game, use that more efficient method: instead of user-mapping an extra page, and reading an extra cacheline each time for x86_cr3_pcid_noflush.
Neel has found that __stringify(bts $X86_CR3_PCID_NOFLUSH_BIT, %rax) is a working substitute for the "bts $63, %rax" in these ALTERNATIVEs; but the one line with $63 in looks clearer, so let's stick with that.
Worried about what happens with an ALTERNATIVE between the jump and jump label in another ALTERNATIVE? I was, but have checked the combinations in SWITCH_KERNEL_CR3_NO_STACK at entry_SYSCALL_64, and it does a good job.
Signed-off-by: Hugh Dickins hughd@google.com Acked-by: Jiri Kosina jkosina@suse.cz Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org (cherry picked from commit 2dff99eb0335f9e0817410696a180dba25ca7371) Signed-off-by: Pavel Tatashin pasha.tatashin@oracle.com
Conflicts: arch/x86/entry/entry_64.S (not in this tree) arch/x86/kernel/entry_64.S (patched instead of that) --- arch/x86/include/asm/kaiser.h | 6 +++--- arch/x86/kernel/entry_64.S | 7 ++++--- arch/x86/mm/kaiser.c | 11 +---------- 3 files changed, 8 insertions(+), 16 deletions(-)
diff --git a/arch/x86/include/asm/kaiser.h b/arch/x86/include/asm/kaiser.h index 96643a9c194c..906150d6094e 100644 --- a/arch/x86/include/asm/kaiser.h +++ b/arch/x86/include/asm/kaiser.h @@ -25,7 +25,8 @@ .macro _SWITCH_TO_KERNEL_CR3 reg movq %cr3, \reg andq $(~(X86_CR3_PCID_ASID_MASK | KAISER_SHADOW_PGD_OFFSET)), \reg -orq x86_cr3_pcid_noflush, \reg +/* If PCID enabled, set X86_CR3_PCID_NOFLUSH_BIT */ +ALTERNATIVE "", "bts $63, \reg", X86_FEATURE_PCID movq \reg, %cr3 .endm
@@ -39,7 +40,7 @@ movq \reg, %cr3 movq %cr3, \reg orq PER_CPU_VAR(x86_cr3_pcid_user), \reg js 9f -/* FLUSH this time, reset to NOFLUSH for next time (if PCID enabled) */ +/* If PCID enabled, FLUSH this time, reset to NOFLUSH for next time */ movb \regb, PER_CPU_VAR(x86_cr3_pcid_user+7) 9: movq \reg, %cr3 @@ -90,7 +91,6 @@ movq PER_CPU_VAR(unsafe_stack_register_backup), %rax */ DECLARE_PER_CPU_USER_MAPPED(unsigned long, unsafe_stack_register_backup);
-extern unsigned long x86_cr3_pcid_noflush; DECLARE_PER_CPU(unsigned long, x86_cr3_pcid_user);
extern char __per_cpu_user_mapped_start[], __per_cpu_user_mapped_end[]; diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S index 71189e3d1073..f681e3bc6c22 100644 --- a/arch/x86/kernel/entry_64.S +++ b/arch/x86/kernel/entry_64.S @@ -1332,7 +1332,8 @@ ENTRY(paranoid_entry) jz 2f orl $2, %ebx andq $(~(X86_CR3_PCID_ASID_MASK | KAISER_SHADOW_PGD_OFFSET)), %rax - orq x86_cr3_pcid_noflush, %rax + /* If PCID enabled, set X86_CR3_PCID_NOFLUSH_BIT */ + ALTERNATIVE "", "bts $63, %rax", X86_FEATURE_PCID movq %rax, %cr3 2: #endif @@ -1568,7 +1569,7 @@ ENTRY(nmi) /* %rax is saved above, so OK to clobber here */ ALTERNATIVE "jmp 2f", "movq %cr3, %rax", X86_FEATURE_KAISER /* If PCID enabled, NOFLUSH now and NOFLUSH on return */ - orq x86_cr3_pcid_noflush, %rax + ALTERNATIVE "", "bts $63, %rax", X86_FEATURE_PCID pushq %rax /* mask off "user" bit of pgd address and 12 PCID bits: */ andq $(~(X86_CR3_PCID_ASID_MASK | KAISER_SHADOW_PGD_OFFSET)), %rax @@ -1807,7 +1808,7 @@ end_repeat_nmi: /* %rax is saved above, so OK to clobber here */ ALTERNATIVE "jmp 2f", "movq %cr3, %rax", X86_FEATURE_KAISER /* If PCID enabled, NOFLUSH now and NOFLUSH on return */ - orq x86_cr3_pcid_noflush, %rax + ALTERNATIVE "", "bts $63, %rax", X86_FEATURE_PCID pushq %rax /* mask off "user" bit of pgd address and 12 PCID bits: */ andq $(~(X86_CR3_PCID_ASID_MASK | KAISER_SHADOW_PGD_OFFSET)), %rax diff --git a/arch/x86/mm/kaiser.c b/arch/x86/mm/kaiser.c index 88b4526d57a5..a8ce2e4737e0 100644 --- a/arch/x86/mm/kaiser.c +++ b/arch/x86/mm/kaiser.c @@ -32,7 +32,6 @@ DEFINE_PER_CPU_USER_MAPPED(unsigned long, unsafe_stack_register_backup); * This is also handy because systems that do not support PCIDs * just end up or'ing a 0 into their CR3, which does no harm. */ -unsigned long x86_cr3_pcid_noflush __read_mostly; DEFINE_PER_CPU(unsigned long, x86_cr3_pcid_user);
/* @@ -357,10 +356,6 @@ void __init kaiser_init(void) kaiser_add_user_map_early(&debug_idt_table, sizeof(gate_desc) * NR_VECTORS, __PAGE_KERNEL); - - kaiser_add_user_map_early(&x86_cr3_pcid_noflush, - sizeof(x86_cr3_pcid_noflush), - __PAGE_KERNEL); }
/* Add a mapping to the shadow mapping, and synchronize the mappings */ @@ -434,18 +429,14 @@ pgd_t kaiser_set_shadow_pgd(pgd_t *pgdp, pgd_t pgd)
void kaiser_setup_pcid(void) { - unsigned long kern_cr3 = 0; unsigned long user_cr3 = KAISER_SHADOW_PGD_OFFSET;
- if (this_cpu_has(X86_FEATURE_PCID)) { - kern_cr3 |= X86_CR3_PCID_KERN_NOFLUSH; + if (this_cpu_has(X86_FEATURE_PCID)) user_cr3 |= X86_CR3_PCID_USER_NOFLUSH; - } /* * These variables are used by the entry/exit * code to change PCID and pgd and TLB flushing. */ - x86_cr3_pcid_noflush = kern_cr3; this_cpu_write(x86_cr3_pcid_user, user_cr3); }
From: Hugh Dickins hughd@google.com
I have not observed a might_sleep() warning from setup_fixmap_gdt()'s use of kaiser_add_mapping() in our tree (why not?), but like upstream we have not provided a way for that to pass is_atomic true down to kaiser_pagetable_walk(), and at startup it's far from a likely source of trouble: so just delete the walk's is_atomic arg and might_sleep().
Signed-off-by: Hugh Dickins hughd@google.com Acked-by: Jiri Kosina jkosina@suse.cz Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org (cherry picked from commit 28c6de5441740f868a5b371804a0e8dde03757fb) Signed-off-by: Pavel Tatashin pasha.tatashin@oracle.com
Conflicts: arch/x86/mm/kaiser.c --- arch/x86/mm/kaiser.c | 10 ++-------- 1 file changed, 2 insertions(+), 8 deletions(-)
diff --git a/arch/x86/mm/kaiser.c b/arch/x86/mm/kaiser.c index a8ce2e4737e0..c64bfef99ee8 100644 --- a/arch/x86/mm/kaiser.c +++ b/arch/x86/mm/kaiser.c @@ -108,19 +108,13 @@ static inline unsigned long get_pa_from_mapping(unsigned long vaddr) * * Returns a pointer to a PTE on success, or NULL on failure. */ -static pte_t *kaiser_pagetable_walk(unsigned long address, bool is_atomic) +static pte_t *kaiser_pagetable_walk(unsigned long address) { pmd_t *pmd; pud_t *pud; pgd_t *pgd = native_get_shadow_pgd(pgd_offset_k(address)); gfp_t gfp = (GFP_KERNEL | __GFP_NOTRACK | __GFP_ZERO);
- if (is_atomic) { - gfp &= ~GFP_KERNEL; - gfp |= __GFP_HIGH; - } else - might_sleep(); - if (pgd_none(*pgd)) { WARN_ONCE(1, "All shadow pgds should have been populated"); return NULL; @@ -195,7 +189,7 @@ static int kaiser_add_user_map(const void *__start_addr, unsigned long size, ret = -EIO; break; } - pte = kaiser_pagetable_walk(address, false); + pte = kaiser_pagetable_walk(address); if (!pte) { ret = -ENOMEM; break;
From: Hugh Dickins hughd@google.com
I found asm/tlbflush.h too twisty, and think it safer not to avoid __native_flush_tlb_global_irq_disabled() in the kaiser_enabled case, but instead let it handle kaiser_enabled along with cr3: it can just use __native_flush_tlb() for that, no harm in re-disabling preemption.
(This is not the same change as Kirill and Dave have suggested for upstream, flipping PGE in cr4: that's neat, but needs a cpu_has_pge check; cr3 is enough for kaiser, and thought to be cheaper than cr4.)
Also delete the X86_FEATURE_INVPCID invpcid_flush_all_nonglobals() preference from __native_flush_tlb(): unlike the invpcid_flush_all() preference in __native_flush_tlb_global(), it's not seen in upstream 4.14, and was recently reported to be surprisingly slow.
Signed-off-by: Hugh Dickins hughd@google.com Acked-by: Jiri Kosina jkosina@suse.cz Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org (cherry picked from commit 0651b3ad99dd59269e2ec883338ab8fba617e203) Signed-off-by: Pavel Tatashin pasha.tatashin@oracle.com --- arch/x86/include/asm/tlbflush.h | 27 +++------------------------ 1 file changed, 3 insertions(+), 24 deletions(-)
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index b376095a1fd9..6fdc8c399601 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -151,14 +151,6 @@ static inline void kaiser_flush_tlb_on_return_to_user(void)
static inline void __native_flush_tlb(void) { - if (this_cpu_has(X86_FEATURE_INVPCID)) { - /* - * Note, this works with CR4.PCIDE=0 or 1. - */ - invpcid_flush_all_nonglobals(); - return; - } - /* * If current->mm == NULL then we borrow a mm which may change during a * task switch and therefore we must not be preempted while we write CR3 @@ -182,11 +174,8 @@ static inline void __native_flush_tlb_global_irq_disabled(void) /* restore PGE as it was before */ native_write_cr4(cr4); } else { - /* - * x86_64 microcode update comes this way when CR4.PGE is not - * enabled, and it's safer for all callers to allow this case. - */ - native_write_cr3(native_read_cr3()); + /* do it with cr3, letting kaiser flush user PCID */ + __native_flush_tlb(); } }
@@ -194,12 +183,6 @@ static inline void __native_flush_tlb_global(void) { unsigned long flags;
- if (kaiser_enabled) { - /* Globals are not used at all */ - __native_flush_tlb(); - return; - } - if (this_cpu_has(X86_FEATURE_INVPCID)) { /* * Using INVPCID is considerably faster than a pair of writes @@ -255,11 +238,7 @@ static inline void __native_flush_tlb_single(unsigned long addr)
static inline void __flush_tlb_all(void) { - if (cpu_has_pge) - __flush_tlb_global(); - else - __flush_tlb(); - + __flush_tlb_global(); /* * Note: if we somehow had PCID but not PGE, then this wouldn't work -- * we'd end up flushing kernel translations for the current ASID but
From: Hugh Dickins hughd@google.com
Let kaiser_flush_tlb_on_return_to_user() do the X86_FEATURE_PCID check, instead of each caller doing it inline first: nobody needs to optimize for the noPCID case, it's clearer this way, and better suits later changes. Replace those no-op X86_CR3_PCID_KERN_FLUSH lines by a BUILD_BUG_ON() in load_new_mm_cr3(), in case something changes.
Signed-off-by: Hugh Dickins hughd@google.com Acked-by: Jiri Kosina jkosina@suse.cz Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org (cherry picked from commit 8eaca4c7d9f167209a9cc568ff028c0a3b0deb2d) Signed-off-by: Pavel Tatashin pasha.tatashin@oracle.com --- arch/x86/include/asm/tlbflush.h | 4 ++-- arch/x86/mm/kaiser.c | 6 +++--- arch/x86/mm/tlb.c | 8 ++++---- 3 files changed, 9 insertions(+), 9 deletions(-)
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index 6fdc8c399601..73865b174090 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -157,7 +157,7 @@ static inline void __native_flush_tlb(void) * back: */ preempt_disable(); - if (kaiser_enabled && this_cpu_has(X86_FEATURE_PCID)) + if (kaiser_enabled) kaiser_flush_tlb_on_return_to_user(); native_write_cr3(native_read_cr3()); preempt_enable(); @@ -216,7 +216,7 @@ static inline void __native_flush_tlb_single(unsigned long addr) */
if (!this_cpu_has(X86_FEATURE_INVPCID_SINGLE)) { - if (kaiser_enabled && this_cpu_has(X86_FEATURE_PCID)) + if (kaiser_enabled) kaiser_flush_tlb_on_return_to_user(); asm volatile("invlpg (%0)" ::"r" (addr) : "memory"); return; diff --git a/arch/x86/mm/kaiser.c b/arch/x86/mm/kaiser.c index c64bfef99ee8..9d6b7517fca5 100644 --- a/arch/x86/mm/kaiser.c +++ b/arch/x86/mm/kaiser.c @@ -436,12 +436,12 @@ void kaiser_setup_pcid(void)
/* * Make a note that this cpu will need to flush USER tlb on return to user. - * Caller checks whether this_cpu_has(X86_FEATURE_PCID) before calling: - * if cpu does not, then the NOFLUSH bit will never have been set. + * If cpu does not have PCID, then the NOFLUSH bit will never have been set. */ void kaiser_flush_tlb_on_return_to_user(void) { - this_cpu_write(x86_cr3_pcid_user, + if (this_cpu_has(X86_FEATURE_PCID)) + this_cpu_write(x86_cr3_pcid_user, X86_CR3_PCID_USER_FLUSH | KAISER_SHADOW_PGD_OFFSET); } EXPORT_SYMBOL(kaiser_flush_tlb_on_return_to_user); diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 5539d613c263..5939c549905f 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -39,7 +39,7 @@ static void load_new_mm_cr3(pgd_t *pgdir) { unsigned long new_mm_cr3 = __pa(pgdir);
- if (kaiser_enabled && this_cpu_has(X86_FEATURE_PCID)) { + if (kaiser_enabled) { /* * We reuse the same PCID for different tasks, so we must * flush all the entries for the PCID out when we change tasks. @@ -50,10 +50,10 @@ static void load_new_mm_cr3(pgd_t *pgdir) * do it here, but can only be used if X86_FEATURE_INVPCID is * available - and many machines support pcid without invpcid. * - * The line below is a no-op: X86_CR3_PCID_KERN_FLUSH is now 0; - * but keep that line in there in case something changes. + * If X86_CR3_PCID_KERN_FLUSH actually added something, then it + * would be needed in the write_cr3() below - if PCIDs enabled. */ - new_mm_cr3 |= X86_CR3_PCID_KERN_FLUSH; + BUILD_BUG_ON(X86_CR3_PCID_KERN_FLUSH); kaiser_flush_tlb_on_return_to_user(); }
From: Thomas Gleixner tglx@linutronix.de
commit a035795499ca1c2bd1928808d1a156eda1420383 upstream
native_flush_tlb_single() will be changed with the upcoming PAGE_TABLE_ISOLATION feature. This requires to have more code in there than INVLPG.
Remove the paravirt patching for it.
Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Josh Poimboeuf jpoimboe@redhat.com Reviewed-by: Juergen Gross jgross@suse.com Acked-by: Peter Zijlstra peterz@infradead.org Cc: Andy Lutomirski luto@kernel.org Cc: Boris Ostrovsky boris.ostrovsky@oracle.com Cc: Borislav Petkov bp@alien8.de Cc: Borislav Petkov bpetkov@suse.de Cc: Brian Gerst brgerst@gmail.com Cc: Dave Hansen dave.hansen@intel.com Cc: Dave Hansen dave.hansen@linux.intel.com Cc: David Laight David.Laight@aculab.com Cc: Denys Vlasenko dvlasenk@redhat.com Cc: Eduardo Valentin eduval@amazon.com Cc: Greg KH gregkh@linuxfoundation.org Cc: H. Peter Anvin hpa@zytor.com Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Rik van Riel riel@redhat.com Cc: Will Deacon will.deacon@arm.com Cc: aliguori@amazon.com Cc: daniel.gruss@iaik.tugraz.at Cc: hughd@google.com Cc: keescook@google.com Cc: linux-mm@kvack.org Cc: michael.schwarz@iaik.tugraz.at Cc: moritz.lipp@iaik.tugraz.at Cc: richard.fellner@student.tugraz.at Link: https://lkml.kernel.org/r/20171204150606.828111617@linutronix.de Signed-off-by: Ingo Molnar mingo@kernel.org Acked-by: Borislav Petkov bp@suse.de Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org (cherry picked from commit 3e809caffdd7beeac731feb16788873c3bdb811e) Signed-off-by: Pavel Tatashin pasha.tatashin@oracle.com --- arch/x86/kernel/paravirt_patch_64.c | 2 -- 1 file changed, 2 deletions(-)
diff --git a/arch/x86/kernel/paravirt_patch_64.c b/arch/x86/kernel/paravirt_patch_64.c index a1da6737ba5b..a91d9b9b4bde 100644 --- a/arch/x86/kernel/paravirt_patch_64.c +++ b/arch/x86/kernel/paravirt_patch_64.c @@ -9,7 +9,6 @@ DEF_NATIVE(pv_irq_ops, save_fl, "pushfq; popq %rax"); DEF_NATIVE(pv_mmu_ops, read_cr2, "movq %cr2, %rax"); DEF_NATIVE(pv_mmu_ops, read_cr3, "movq %cr3, %rax"); DEF_NATIVE(pv_mmu_ops, write_cr3, "movq %rdi, %cr3"); -DEF_NATIVE(pv_mmu_ops, flush_tlb_single, "invlpg (%rdi)"); DEF_NATIVE(pv_cpu_ops, clts, "clts"); DEF_NATIVE(pv_cpu_ops, wbinvd, "wbinvd");
@@ -57,7 +56,6 @@ unsigned native_patch(u8 type, u16 clobbers, void *ibuf, PATCH_SITE(pv_mmu_ops, read_cr3); PATCH_SITE(pv_mmu_ops, write_cr3); PATCH_SITE(pv_cpu_ops, clts); - PATCH_SITE(pv_mmu_ops, flush_tlb_single); PATCH_SITE(pv_cpu_ops, wbinvd);
patch_site:
From: Borislav Petkov bp@suse.de
Now that the required bits have been addressed, reenable PARAVIRT.
Signed-off-by: Borislav Petkov bp@suse.de Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org (cherry picked from commit 750fb627d764eb66430c36961b94ab0002694c02) Signed-off-by: Pavel Tatashin pasha.tatashin@oracle.com --- security/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/security/Kconfig b/security/Kconfig index 9d63046e4794..ae44b94d63b2 100644 --- a/security/Kconfig +++ b/security/Kconfig @@ -34,7 +34,7 @@ config SECURITY config KAISER bool "Remove the kernel mapping in user mode" default y - depends on X86_64 && SMP && !PARAVIRT + depends on X86_64 && SMP help This enforces a strict kernel and user space isolation, in order to close hardware side channels on kernel address information.
From: Borislav Petkov bp@suse.de
... before the first use of kaiser_enabled as otherwise funky things happen:
about to get started... (XEN) d0v0 Unhandled page fault fault/trap [#14, ec=0000] (XEN) Pagetable walk from ffff88022a449090: (XEN) L4[0x110] = 0000000229e0e067 0000000000001e0e (XEN) L3[0x008] = 0000000000000000 ffffffffffffffff (XEN) domain_crash_sync called from entry.S: fault at ffff82d08033fd08 entry.o#create_bounce_frame+0x135/0x14d (XEN) Domain 0 (vcpu#0) crashed on cpu#0: (XEN) ----[ Xen-4.9.1_02-3.21 x86_64 debug=n Not tainted ]---- (XEN) CPU: 0 (XEN) RIP: e033:[<ffffffff81007460>] (XEN) RFLAGS: 0000000000000286 EM: 1 CONTEXT: pv guest (d0v0)
Signed-off-by: Borislav Petkov bp@suse.de Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org (cherry picked from commit 7f79599df9c4a36130f7a4f6778b334a97632477) Signed-off-by: Pavel Tatashin pasha.tatashin@oracle.com
Conflicts: arch/x86/kernel/setup.c --- arch/x86/include/asm/kaiser.h | 2 ++ arch/x86/kernel/setup.c | 7 +++++++ arch/x86/mm/kaiser.c | 2 -- 3 files changed, 9 insertions(+), 2 deletions(-)
diff --git a/arch/x86/include/asm/kaiser.h b/arch/x86/include/asm/kaiser.h index 906150d6094e..b5e46aa683f4 100644 --- a/arch/x86/include/asm/kaiser.h +++ b/arch/x86/include/asm/kaiser.h @@ -96,8 +96,10 @@ DECLARE_PER_CPU(unsigned long, x86_cr3_pcid_user); extern char __per_cpu_user_mapped_start[], __per_cpu_user_mapped_end[];
extern int kaiser_enabled; +extern void __init kaiser_check_boottime_disable(void); #else #define kaiser_enabled 0 +static inline void __init kaiser_check_boottime_disable(void) {} #endif /* CONFIG_KAISER */
/* diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c index 1473a02e6ccb..1adaca0050b0 100644 --- a/arch/x86/kernel/setup.c +++ b/arch/x86/kernel/setup.c @@ -111,6 +111,7 @@ #include <asm/mce.h> #include <asm/alternative.h> #include <asm/prom.h> +#include <asm/kaiser.h>
/* * max_low_pfn_mapped: highest direct mapped pfn under 4GB @@ -1034,6 +1035,12 @@ void __init setup_arch(char **cmdline_p) */ init_hypervisor_platform();
+ /* + * This needs to happen right after XENPV is set on xen and + * kaiser_enabled is checked below in cleanup_highmap(). + */ + kaiser_check_boottime_disable(); + x86_init.resources.probe_roms();
/* after parse_early_param, so could debug it */ diff --git a/arch/x86/mm/kaiser.c b/arch/x86/mm/kaiser.c index 9d6b7517fca5..50783ad4861d 100644 --- a/arch/x86/mm/kaiser.c +++ b/arch/x86/mm/kaiser.c @@ -306,8 +306,6 @@ void __init kaiser_init(void) { int cpu;
- kaiser_check_boottime_disable(); - if (!kaiser_enabled) return;
From: Kees Cook keescook@chromium.org
This renames CONFIG_KAISER to CONFIG_PAGE_TABLE_ISOLATION.
Signed-off-by: Kees Cook keescook@chromium.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org (cherry picked from commit 3e1457d6bf26d9ec300781f84cd0057e44deb45d) Signed-off-by: Pavel Tatashin pasha.tatashin@oracle.com
Conflicts: arch/x86/entry/entry_64.S (not in this tree) arch/x86/kernel/entry_64.S (patched instead of that) --- arch/x86/boot/compressed/misc.h | 2 +- arch/x86/include/asm/cpufeature.h | 2 +- arch/x86/include/asm/kaiser.h | 12 ++++++------ arch/x86/include/asm/pgtable.h | 4 ++-- arch/x86/include/asm/pgtable_64.h | 4 ++-- arch/x86/include/asm/pgtable_types.h | 2 +- arch/x86/include/asm/tlbflush.h | 2 +- arch/x86/kernel/cpu/perf_event_intel_ds.c | 4 ++-- arch/x86/kernel/entry_64.S | 12 ++++++------ arch/x86/kernel/head_64.S | 2 +- arch/x86/mm/Makefile | 2 +- include/linux/kaiser.h | 6 +++--- include/linux/percpu-defs.h | 2 +- security/Kconfig | 2 +- 14 files changed, 29 insertions(+), 29 deletions(-)
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h index cbaa2ebfa88d..6d92220983e6 100644 --- a/arch/x86/boot/compressed/misc.h +++ b/arch/x86/boot/compressed/misc.h @@ -9,7 +9,7 @@ */ #undef CONFIG_PARAVIRT #undef CONFIG_PARAVIRT_SPINLOCKS -#undef CONFIG_KAISER +#undef CONFIG_PAGE_TABLE_ISOLATION #undef CONFIG_KASAN
#include <linux/linkage.h> diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h index 8effe086de27..a22aa8c5903c 100644 --- a/arch/x86/include/asm/cpufeature.h +++ b/arch/x86/include/asm/cpufeature.h @@ -202,7 +202,7 @@ #define X86_FEATURE_KAISER ( 7*32+31) /* CONFIG_KAISER w/o nokaiser */
/* Because the ALTERNATIVE scheme is for members of the X86_FEATURE club... */ -#define X86_FEATURE_KAISER ( 7*32+31) /* CONFIG_KAISER w/o nokaiser */ +#define X86_FEATURE_KAISER ( 7*32+31) /* CONFIG_PAGE_TABLE_ISOLATION w/o nokaiser */
/* Virtualization flags: Linux defined, word 8 */ #define X86_FEATURE_TPR_SHADOW ( 8*32+ 0) /* Intel TPR Shadow */ diff --git a/arch/x86/include/asm/kaiser.h b/arch/x86/include/asm/kaiser.h index b5e46aa683f4..802bbbdfe143 100644 --- a/arch/x86/include/asm/kaiser.h +++ b/arch/x86/include/asm/kaiser.h @@ -20,7 +20,7 @@ #define KAISER_SHADOW_PGD_OFFSET 0x1000
#ifdef __ASSEMBLY__ -#ifdef CONFIG_KAISER +#ifdef CONFIG_PAGE_TABLE_ISOLATION
.macro _SWITCH_TO_KERNEL_CR3 reg movq %cr3, \reg @@ -69,7 +69,7 @@ movq PER_CPU_VAR(unsafe_stack_register_backup), %rax 8: .endm
-#else /* CONFIG_KAISER */ +#else /* CONFIG_PAGE_TABLE_ISOLATION */
.macro SWITCH_KERNEL_CR3 .endm @@ -78,11 +78,11 @@ movq PER_CPU_VAR(unsafe_stack_register_backup), %rax .macro SWITCH_KERNEL_CR3_NO_STACK .endm
-#endif /* CONFIG_KAISER */ +#endif /* CONFIG_PAGE_TABLE_ISOLATION */
#else /* __ASSEMBLY__ */
-#ifdef CONFIG_KAISER +#ifdef CONFIG_PAGE_TABLE_ISOLATION /* * Upon kernel/user mode switch, it may happen that the address * space has to be switched before the registers have been @@ -100,10 +100,10 @@ extern void __init kaiser_check_boottime_disable(void); #else #define kaiser_enabled 0 static inline void __init kaiser_check_boottime_disable(void) {} -#endif /* CONFIG_KAISER */ +#endif /* CONFIG_PAGE_TABLE_ISOLATION */
/* - * Kaiser function prototypes are needed even when CONFIG_KAISER is not set, + * Kaiser function prototypes are needed even when CONFIG_PAGE_TABLE_ISOLATION is not set, * so as to build with tests on kaiser_enabled instead of #ifdefs. */
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index af9713487b36..4f971f28a7ec 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -18,7 +18,7 @@ #ifndef __ASSEMBLY__ #include <asm/x86_init.h>
-#ifdef CONFIG_KAISER +#ifdef CONFIG_PAGE_TABLE_ISOLATION extern int kaiser_enabled; #else #define kaiser_enabled 0 @@ -845,7 +845,7 @@ static inline void pmdp_set_wrprotect(struct mm_struct *mm, static inline void clone_pgd_range(pgd_t *dst, pgd_t *src, int count) { memcpy(dst, src, count * sizeof(pgd_t)); -#ifdef CONFIG_KAISER +#ifdef CONFIG_PAGE_TABLE_ISOLATION if (kaiser_enabled) { /* Clone the shadow pgd part as well */ memcpy(native_get_shadow_pgd(dst), diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h index 233d19c9f22e..c810226e741a 100644 --- a/arch/x86/include/asm/pgtable_64.h +++ b/arch/x86/include/asm/pgtable_64.h @@ -106,7 +106,7 @@ static inline void native_pud_clear(pud_t *pud) native_set_pud(pud, native_make_pud(0)); }
-#ifdef CONFIG_KAISER +#ifdef CONFIG_PAGE_TABLE_ISOLATION extern pgd_t kaiser_set_shadow_pgd(pgd_t *pgdp, pgd_t pgd);
static inline pgd_t *native_get_shadow_pgd(pgd_t *pgdp) @@ -127,7 +127,7 @@ static inline pgd_t *native_get_shadow_pgd(pgd_t *pgdp) BUILD_BUG_ON(1); return NULL; } -#endif /* CONFIG_KAISER */ +#endif /* CONFIG_PAGE_TABLE_ISOLATION */
static inline void native_set_pgd(pgd_t *pgdp, pgd_t pgd) { diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h index d5b89550fb53..ea03e53e3c07 100644 --- a/arch/x86/include/asm/pgtable_types.h +++ b/arch/x86/include/asm/pgtable_types.h @@ -109,7 +109,7 @@ #define X86_CR3_PCID_MASK (X86_CR3_PCID_NOFLUSH | X86_CR3_PCID_ASID_MASK) #define X86_CR3_PCID_ASID_KERN (_AC(0x0,UL))
-#if defined(CONFIG_KAISER) && defined(CONFIG_X86_64) +#if defined(CONFIG_PAGE_TABLE_ISOLATION) && defined(CONFIG_X86_64) /* Let X86_CR3_PCID_ASID_USER be usable for the X86_CR3_PCID_NOFLUSH bit */ #define X86_CR3_PCID_ASID_USER (_AC(0x80,UL))
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index 73865b174090..a691b66cc40a 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -135,7 +135,7 @@ static inline void cr4_set_bits_and_update_boot(unsigned long mask) * Declare a couple of kaiser interfaces here for convenience, * to avoid the need for asm/kaiser.h in unexpected places. */ -#ifdef CONFIG_KAISER +#ifdef CONFIG_PAGE_TABLE_ISOLATION extern int kaiser_enabled; extern void kaiser_setup_pcid(void); extern void kaiser_flush_tlb_on_return_to_user(void); diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c index e4f109a7ed9a..c0fbb9393628 100644 --- a/arch/x86/kernel/cpu/perf_event_intel_ds.c +++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c @@ -252,7 +252,7 @@ static DEFINE_PER_CPU(void *, insn_buffer);
static void *dsalloc(size_t size, gfp_t flags, int node) { -#ifdef CONFIG_KAISER +#ifdef CONFIG_PAGE_TABLE_ISOLATION unsigned int order = get_order(size); struct page *page; unsigned long addr; @@ -273,7 +273,7 @@ static void *dsalloc(size_t size, gfp_t flags, int node)
static void dsfree(const void *buffer, size_t size) { -#ifdef CONFIG_KAISER +#ifdef CONFIG_PAGE_TABLE_ISOLATION if (!buffer) return; kaiser_remove_mapping((unsigned long)buffer, size); diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S index f681e3bc6c22..95b57f4393b7 100644 --- a/arch/x86/kernel/entry_64.S +++ b/arch/x86/kernel/entry_64.S @@ -1319,7 +1319,7 @@ ENTRY(paranoid_entry) SWAPGS xorl %ebx,%ebx 1: -#ifdef CONFIG_KAISER +#ifdef CONFIG_PAGE_TABLE_ISOLATION /* * We might have come in between a swapgs and a SWITCH_KERNEL_CR3 * on entry, or between a SWITCH_USER_CR3 and a swapgs on exit. @@ -1360,7 +1360,7 @@ ENTRY(paranoid_exit) DISABLE_INTERRUPTS(CLBR_NONE) TRACE_IRQS_OFF_DEBUG TRACE_IRQS_IRETQ_DEBUG -#ifdef CONFIG_KAISER +#ifdef CONFIG_PAGE_TABLE_ISOLATION /* No ALTERNATIVE for X86_FEATURE_KAISER: paranoid_entry sets %ebx */ testl $2, %ebx /* SWITCH_USER_CR3 needed? */ jz paranoid_exit_no_switch @@ -1564,7 +1564,7 @@ ENTRY(nmi) */ movq %rsp, %rdi movq $-1, %rsi -#ifdef CONFIG_KAISER +#ifdef CONFIG_PAGE_TABLE_ISOLATION /* Unconditionally use kernel CR3 for do_nmi() */ /* %rax is saved above, so OK to clobber here */ ALTERNATIVE "jmp 2f", "movq %cr3, %rax", X86_FEATURE_KAISER @@ -1577,7 +1577,7 @@ ENTRY(nmi) 2: #endif call do_nmi -#ifdef CONFIG_KAISER +#ifdef CONFIG_PAGE_TABLE_ISOLATION /* * Unconditionally restore CR3. I know we return to * kernel code that needs user CR3, but do we ever return @@ -1803,7 +1803,7 @@ end_repeat_nmi: 1: movq %rsp, %rdi movq $-1, %rsi -#ifdef CONFIG_KAISER +#ifdef CONFIG_PAGE_TABLE_ISOLATION /* Unconditionally use kernel CR3 for do_nmi() */ /* %rax is saved above, so OK to clobber here */ ALTERNATIVE "jmp 2f", "movq %cr3, %rax", X86_FEATURE_KAISER @@ -1819,7 +1819,7 @@ end_repeat_nmi:
/* paranoidentry do_nmi, 0; without TRACE_IRQS_OFF */ call do_nmi -#ifdef CONFIG_KAISER +#ifdef CONFIG_PAGE_TABLE_ISOLATION /* * Unconditionally restore CR3. We might be returning to * kernel code that needs user CR3, like just just before diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S index 0d89f538efb6..53ef274d2a89 100644 --- a/arch/x86/kernel/head_64.S +++ b/arch/x86/kernel/head_64.S @@ -441,7 +441,7 @@ early_idt_ripmsg: .balign PAGE_SIZE; \ GLOBAL(name)
-#ifdef CONFIG_KAISER +#ifdef CONFIG_PAGE_TABLE_ISOLATION /* * Each PGD needs to be 8k long and 8k aligned. We do not * ever go out to userspace with these, so we do not diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile index 4c15b80bf7df..3fbf7de9e9bd 100644 --- a/arch/x86/mm/Makefile +++ b/arch/x86/mm/Makefile @@ -32,4 +32,4 @@ obj-$(CONFIG_ACPI_NUMA) += srat.o obj-$(CONFIG_NUMA_EMU) += numa_emulation.o
obj-$(CONFIG_X86_INTEL_MPX) += mpx.o -obj-$(CONFIG_KAISER) += kaiser.o +obj-$(CONFIG_PAGE_TABLE_ISOLATION) += kaiser.o diff --git a/include/linux/kaiser.h b/include/linux/kaiser.h index 4a4d6d911a14..58c55b1589d0 100644 --- a/include/linux/kaiser.h +++ b/include/linux/kaiser.h @@ -1,7 +1,7 @@ #ifndef _LINUX_KAISER_H #define _LINUX_KAISER_H
-#ifdef CONFIG_KAISER +#ifdef CONFIG_PAGE_TABLE_ISOLATION #include <asm/kaiser.h>
static inline int kaiser_map_thread_stack(void *stack) @@ -24,7 +24,7 @@ static inline void kaiser_unmap_thread_stack(void *stack) #else
/* - * These stubs are used whenever CONFIG_KAISER is off, which + * These stubs are used whenever CONFIG_PAGE_TABLE_ISOLATION is off, which * includes architectures that support KAISER, but have it disabled. */
@@ -48,5 +48,5 @@ static inline void kaiser_unmap_thread_stack(void *stack) { }
-#endif /* !CONFIG_KAISER */ +#endif /* !CONFIG_PAGE_TABLE_ISOLATION */ #endif /* _LINUX_KAISER_H */ diff --git a/include/linux/percpu-defs.h b/include/linux/percpu-defs.h index 7cecbdbf5c25..7c2f5bb9e0e9 100644 --- a/include/linux/percpu-defs.h +++ b/include/linux/percpu-defs.h @@ -35,7 +35,7 @@
#endif
-#ifdef CONFIG_KAISER +#ifdef CONFIG_PAGE_TABLE_ISOLATION #define USER_MAPPED_SECTION "..user_mapped" #else #define USER_MAPPED_SECTION "" diff --git a/security/Kconfig b/security/Kconfig index ae44b94d63b2..b6e47c9d1a7e 100644 --- a/security/Kconfig +++ b/security/Kconfig @@ -31,7 +31,7 @@ config SECURITY
If you are unsure how to answer this question, answer N.
-config KAISER +config PAGE_TABLE_ISOLATION bool "Remove the kernel mapping in user mode" default y depends on X86_64 && SMP
From: Jamie Iles jamie.iles@oracle.com
94b1f3e2c4b7 (kaiser: merged update) factored out __free_ldt_struct() to use vfree/free_page, but in the page allocation case it is actually allocated with kmalloc so needs to be freed with kfree and not free_page().
Reported-by: Vegard Nossum vegard.nossum@oracle.com Signed-off-by: Jamie Iles jamie.iles@oracle.com Signed-off-by: Pavel Tatashin pasha.tatashin@oracle.com --- arch/x86/kernel/ldt.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c index 5797d437710d..5de9fbc4ab50 100644 --- a/arch/x86/kernel/ldt.c +++ b/arch/x86/kernel/ldt.c @@ -39,7 +39,7 @@ static void __free_ldt_struct(struct ldt_struct *ldt) if (ldt->size * LDT_ENTRY_SIZE > PAGE_SIZE) vfree(ldt->entries); else - free_page((unsigned long)ldt->entries); + kfree(ldt->entries); kfree(ldt); }
From: Jiri Kosina jkosina@suse.cz
old_memmap's efi_call_phys_prolog() calls set_pgd() with swapper PGD that has PAGE_USER set, which makes PTI set NX on it, and therefore EFI can't execute it's code.
Fix that by forcefully clearing _PAGE_NX from the PGD (this can't be done by the pgprot API).
_PAGE_NX will be automatically reintroduced in efi_call_phys_epilog(), as _set_pgd() will again notice that this is _PAGE_USER, and set _PAGE_NX on it.
Signed-off-by: Jiri Kosina jkosina@suse.cz Signed-off-by: Pavel Tatashin pasha.tatashin@oracle.com --- arch/x86/platform/efi/efi_64.c | 6 ++++++ 1 file changed, 6 insertions(+)
diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c index 18dfaad71c99..12118bae3caf 100644 --- a/arch/x86/platform/efi/efi_64.c +++ b/arch/x86/platform/efi/efi_64.c @@ -90,6 +90,12 @@ pgd_t * __init efi_call_phys_prolog(void) save_pgd[pgd] = *pgd_offset_k(pgd * PGDIR_SIZE); vaddress = (unsigned long)__va(pgd * PGDIR_SIZE); set_pgd(pgd_offset_k(pgd * PGDIR_SIZE), *pgd_offset_k(vaddress)); + /* + * pgprot API doesn't clear it for PGD + * + * Will be brought back automatically in _epilog() + */ + pgd_offset_k(pgd * PGDIR_SIZE)->pgd &= ~_PAGE_NX; } out: __flush_tlb_all();
From: Konrad Rzeszutek Wilk konrad.wilk@oracle.com
Very very partial backport from aa8c6248f8c75 where there is a check to see if this is an Xen PV guest - and if so disable it.
The reason is that the PV ABI would require a major overhaul to be Meltdown resistent.
Instead there are mitigations (PV in HVM) which are far more suitable.
Signed-off-by: Konrad Rzeszutek Wilk konrad.wilk@oracle.com --- arch/x86/mm/kaiser.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/arch/x86/mm/kaiser.c b/arch/x86/mm/kaiser.c index 50783ad4861d..05fb616c9e2a 100644 --- a/arch/x86/mm/kaiser.c +++ b/arch/x86/mm/kaiser.c @@ -10,6 +10,7 @@ #include <linux/mm.h> #include <linux/uaccess.h> #include <linux/ftrace.h> +#include <xen/xen.h>
#include <asm/kaiser.h> #include <asm/tlbflush.h> /* to verify its kaiser declarations */ @@ -264,6 +265,9 @@ void __init kaiser_check_boottime_disable(void) char arg[5]; int ret;
+ if (xen_pv_domain()) + goto disable; + ret = cmdline_find_option(boot_command_line, "pti", arg, sizeof(arg)); if (ret > 0) { if (!strncmp(arg, "on", 2))
cat /proc/cpuinfo still shows kaiser feature, and want only pti to be visible to users. Therefore, rename this macro to get correct user visible output.
Signed-off-by: Pavel Tatashin pasha.tatashin@oracle.com Signed-off-by: Brian Maly brian.maly@oracle.com --- arch/x86/include/asm/cpufeature.h | 2 +- arch/x86/include/asm/kaiser.h | 6 +++--- arch/x86/kernel/entry_64.S | 12 ++++++------ arch/x86/mm/kaiser.c | 4 ++-- 4 files changed, 12 insertions(+), 12 deletions(-)
diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h index a22aa8c5903c..fc790e2f0d4e 100644 --- a/arch/x86/include/asm/cpufeature.h +++ b/arch/x86/include/asm/cpufeature.h @@ -202,7 +202,7 @@ #define X86_FEATURE_KAISER ( 7*32+31) /* CONFIG_KAISER w/o nokaiser */
/* Because the ALTERNATIVE scheme is for members of the X86_FEATURE club... */ -#define X86_FEATURE_KAISER ( 7*32+31) /* CONFIG_PAGE_TABLE_ISOLATION w/o nokaiser */ +#define X86_FEATURE_PTI ( 7*32+31) /* CONFIG_PAGE_TABLE_ISOLATION w/o nopti */
/* Virtualization flags: Linux defined, word 8 */ #define X86_FEATURE_TPR_SHADOW ( 8*32+ 0) /* Intel TPR Shadow */ diff --git a/arch/x86/include/asm/kaiser.h b/arch/x86/include/asm/kaiser.h index 802bbbdfe143..5d85ddf26166 100644 --- a/arch/x86/include/asm/kaiser.h +++ b/arch/x86/include/asm/kaiser.h @@ -47,14 +47,14 @@ movq \reg, %cr3 .endm
.macro SWITCH_KERNEL_CR3 -ALTERNATIVE "jmp 8f", "pushq %rax", X86_FEATURE_KAISER +ALTERNATIVE "jmp 8f", "pushq %rax", X86_FEATURE_PTI _SWITCH_TO_KERNEL_CR3 %rax popq %rax 8: .endm
.macro SWITCH_USER_CR3 -ALTERNATIVE "jmp 8f", "pushq %rax", X86_FEATURE_KAISER +ALTERNATIVE "jmp 8f", "pushq %rax", X86_FEATURE_PTI _SWITCH_TO_USER_CR3 %rax %al popq %rax 8: @@ -63,7 +63,7 @@ popq %rax .macro SWITCH_KERNEL_CR3_NO_STACK ALTERNATIVE "jmp 8f", \ __stringify(movq %rax, PER_CPU_VAR(unsafe_stack_register_backup)), \ - X86_FEATURE_KAISER + X86_FEATURE_PTI _SWITCH_TO_KERNEL_CR3 %rax movq PER_CPU_VAR(unsafe_stack_register_backup), %rax 8: diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S index 95b57f4393b7..204dfb58c9f0 100644 --- a/arch/x86/kernel/entry_64.S +++ b/arch/x86/kernel/entry_64.S @@ -1327,7 +1327,7 @@ ENTRY(paranoid_entry) * unconditionally, but we need to find out whether the reverse * should be done on return (conveyed to paranoid_exit in %ebx). */ - ALTERNATIVE "jmp 2f", "movq %cr3, %rax", X86_FEATURE_KAISER + ALTERNATIVE "jmp 2f", "movq %cr3, %rax", X86_FEATURE_PTI testl $KAISER_SHADOW_PGD_OFFSET, %eax jz 2f orl $2, %ebx @@ -1361,7 +1361,7 @@ ENTRY(paranoid_exit) TRACE_IRQS_OFF_DEBUG TRACE_IRQS_IRETQ_DEBUG #ifdef CONFIG_PAGE_TABLE_ISOLATION - /* No ALTERNATIVE for X86_FEATURE_KAISER: paranoid_entry sets %ebx */ + /* No ALTERNATIVE for X86_FEATURE_PTI: paranoid_entry sets %ebx */ testl $2, %ebx /* SWITCH_USER_CR3 needed? */ jz paranoid_exit_no_switch SWITCH_USER_CR3 @@ -1567,7 +1567,7 @@ ENTRY(nmi) #ifdef CONFIG_PAGE_TABLE_ISOLATION /* Unconditionally use kernel CR3 for do_nmi() */ /* %rax is saved above, so OK to clobber here */ - ALTERNATIVE "jmp 2f", "movq %cr3, %rax", X86_FEATURE_KAISER + ALTERNATIVE "jmp 2f", "movq %cr3, %rax", X86_FEATURE_PTI /* If PCID enabled, NOFLUSH now and NOFLUSH on return */ ALTERNATIVE "", "bts $63, %rax", X86_FEATURE_PCID pushq %rax @@ -1583,7 +1583,7 @@ ENTRY(nmi) * kernel code that needs user CR3, but do we ever return * to "user mode" where we need the kernel CR3? */ - ALTERNATIVE "", "popq %rax; movq %rax, %cr3", X86_FEATURE_KAISER + ALTERNATIVE "", "popq %rax; movq %rax, %cr3", X86_FEATURE_PTI #endif /* * Return back to user mode. We must *not* do the normal exit @@ -1806,7 +1806,7 @@ end_repeat_nmi: #ifdef CONFIG_PAGE_TABLE_ISOLATION /* Unconditionally use kernel CR3 for do_nmi() */ /* %rax is saved above, so OK to clobber here */ - ALTERNATIVE "jmp 2f", "movq %cr3, %rax", X86_FEATURE_KAISER + ALTERNATIVE "jmp 2f", "movq %cr3, %rax", X86_FEATURE_PTI /* If PCID enabled, NOFLUSH now and NOFLUSH on return */ ALTERNATIVE "", "bts $63, %rax", X86_FEATURE_PCID pushq %rax @@ -1825,7 +1825,7 @@ end_repeat_nmi: * kernel code that needs user CR3, like just just before * a sysret. */ - ALTERNATIVE "", "popq %rax; movq %rax, %cr3", X86_FEATURE_KAISER + ALTERNATIVE "", "popq %rax; movq %rax, %cr3", X86_FEATURE_PTI #endif
testl %ebx,%ebx /* swapgs needed? */ diff --git a/arch/x86/mm/kaiser.c b/arch/x86/mm/kaiser.c index 05fb616c9e2a..e37ac77e6cb6 100644 --- a/arch/x86/mm/kaiser.c +++ b/arch/x86/mm/kaiser.c @@ -289,14 +289,14 @@ skip:
enable: if (enable) - setup_force_cpu_cap(X86_FEATURE_KAISER); + setup_force_cpu_cap(X86_FEATURE_PTI);
return;
disable: pr_info("Kernel/User page tables isolation: disabled\n"); kaiser_enabled = 0; - setup_clear_cpu_cap(X86_FEATURE_KAISER); + setup_clear_cpu_cap(X86_FEATURE_PTI); }
/*
In entry_64.S we have code like this:
/* Unconditionally use kernel CR3 for do_nmi() */ /* %rax is saved above, so OK to clobber here */ ALTERNATIVE "jmp 2f", "movq %cr3, %rax", X86_FEATURE_KAISER /* If PCID enabled, NOFLUSH now and NOFLUSH on return */ ALTERNATIVE "", "bts $63, %rax", X86_FEATURE_PCID pushq %rax /* mask off "user" bit of pgd address and 12 PCID bits: */ andq $(~(X86_CR3_PCID_ASID_MASK | KAISER_SHADOW_PGD_OFFSET)), %rax movq %rax, %cr3 2:
/* paranoidentry do_nmi, 0; without TRACE_IRQS_OFF */ call do_nmi
With this instruction: andq $(~(X86_CR3_PCID_ASID_MASK | KAISER_SHADOW_PGD_OFFSET)), %rax
We unconditionally switch from whatever our CR3 was to kernel page table. But, in arch/x86/platform/efi/efi_64.c We temporarily set a different page table, that does not have the kernel page table with 0x1000 offset from it.
Look in efi_thunk() and efi_thunk_set_virtual_address_map().
So, while CR3 points to the other page table, we get an NMI interrupt, and clear 0x1000 from CR3, resulting in a bogus CR3 if the 0x1000 bit was set.
The efi page table comes from realmode/rm/trampoline_64.S:
arch/x86/realmode/rm/trampoline_64.S
141 .bss 142 .balign PAGE_SIZE 143 GLOBAL(trampoline_pgd) .space PAGE_SIZE
Notice: alignment is PAGE_SIZE, so after applying KAISER_SHADOW_PGD_OFFSET which equal to PAGE_SIZE, we can get a different page table.
But, even if we fix alignment, here the trampoline binary is later copied into dynamically allocated memory in reserve_real_mode(), so we need to fix that place as well.
Fixes: 8a43ddfb93a0 ("KAISER: Kernel Address Isolation")
Signed-off-by: Pavel Tatashin pasha.tatashin@oracle.com Reviewed-by: Steven Sistare steven.sistare@oracle.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org (cherry picked from commit 7ec5d87df34a90758cf2aaf6824bb748454a8f35) Signed-off-by: Pavel Tatashin pasha.tatashin@oracle.com --- arch/x86/include/asm/kaiser.h | 10 ++++++++++ arch/x86/realmode/init.c | 4 +++- arch/x86/realmode/rm/trampoline_64.S | 3 ++- 3 files changed, 15 insertions(+), 2 deletions(-)
diff --git a/arch/x86/include/asm/kaiser.h b/arch/x86/include/asm/kaiser.h index 5d85ddf26166..620619d8de57 100644 --- a/arch/x86/include/asm/kaiser.h +++ b/arch/x86/include/asm/kaiser.h @@ -19,6 +19,16 @@
#define KAISER_SHADOW_PGD_OFFSET 0x1000
+#ifdef CONFIG_PAGE_TABLE_ISOLATION +/* + * A page table address must have this alignment to stay the same when + * KAISER_SHADOW_PGD_OFFSET mask is applied + */ +#define KAISER_KERNEL_PGD_ALIGNMENT (KAISER_SHADOW_PGD_OFFSET << 1) +#else +#define KAISER_KERNEL_PGD_ALIGNMENT PAGE_SIZE +#endif + #ifdef __ASSEMBLY__ #ifdef CONFIG_PAGE_TABLE_ISOLATION
diff --git a/arch/x86/realmode/init.c b/arch/x86/realmode/init.c index 0b7a63d98440..805a3271a137 100644 --- a/arch/x86/realmode/init.c +++ b/arch/x86/realmode/init.c @@ -4,6 +4,7 @@ #include <asm/cacheflush.h> #include <asm/pgtable.h> #include <asm/realmode.h> +#include <asm/kaiser.h>
struct real_mode_header *real_mode_header; u32 *trampoline_cr4_features; @@ -15,7 +16,8 @@ void __init reserve_real_mode(void) size_t size = PAGE_ALIGN(real_mode_blob_end - real_mode_blob);
/* Has to be under 1M so we can execute real-mode AP code. */ - mem = memblock_find_in_range(0, 1<<20, size, PAGE_SIZE); + mem = memblock_find_in_range(0, 1 << 20, size, + KAISER_KERNEL_PGD_ALIGNMENT); if (!mem) panic("Cannot allocate trampoline\n");
diff --git a/arch/x86/realmode/rm/trampoline_64.S b/arch/x86/realmode/rm/trampoline_64.S index dac7b20d2f9d..781cca63f795 100644 --- a/arch/x86/realmode/rm/trampoline_64.S +++ b/arch/x86/realmode/rm/trampoline_64.S @@ -30,6 +30,7 @@ #include <asm/msr.h> #include <asm/segment.h> #include <asm/processor-flags.h> +#include <asm/kaiser.h> #include "realmode.h"
.text @@ -139,7 +140,7 @@ tr_gdt: tr_gdt_end:
.bss - .balign PAGE_SIZE + .balign KAISER_KERNEL_PGD_ALIGNMENT GLOBAL(trampoline_pgd) .space PAGE_SIZE
.balign 8
[ adding Hugh and Miroslav to CC ]
On Mon, 5 Mar 2018, Pavel Tatashin wrote:
The git of this backport can be found here: git clone --branch pti_v4.1.49 https://github.com/soleen/linux
The patches were backported from stable 4.4 to Oracle UEK4.1, and from UEK4.1 to Stable 4.1
You'd want to consider also taking trampoline stack entry patches we currently are putting into SLE12-SP[23] kernel tree [1], which also bases on the original Hugh's 4.4 port.
I've notified Greg already that trampoline stack entry should be put into 4.4-stable proper as well, but that hasn't happened yet.
[1] you can find those on https://kernel.suse.com/; the patches that are currently in (commits 6d3ee9dc14bafff and f7de61af517916 in kernel-source.git there) do pass the usual battery of PTI testing. There will be some changes made to it to avoid external module ABI breakage, but those will not be picked up by stable I guess, and you probably don't care either.
linux-stable-mirror@lists.linaro.org