Context =======
We've observed within Red Hat that isolated, NOHZ_FULL CPUs running a pure-userspace application get regularly interrupted by IPIs sent from housekeeping CPUs. Those IPIs are caused by activity on the housekeeping CPUs leading to various on_each_cpu() calls, e.g.:
64359.052209596 NetworkManager 0 1405 smp_call_function_many_cond (cpu=0, func=do_kernel_range_flush) smp_call_function_many_cond+0x1 smp_call_function+0x39 on_each_cpu+0x2a flush_tlb_kernel_range+0x7b __purge_vmap_area_lazy+0x70 _vm_unmap_aliases.part.42+0xdf change_page_attr_set_clr+0x16a set_memory_ro+0x26 bpf_int_jit_compile+0x2f9 bpf_prog_select_runtime+0xc6 bpf_prepare_filter+0x523 sk_attach_filter+0x13 sock_setsockopt+0x92c __sys_setsockopt+0x16a __x64_sys_setsockopt+0x20 do_syscall_64+0x87 entry_SYSCALL_64_after_hwframe+0x65
The heart of this series is the thought that while we cannot remove NOHZ_FULL CPUs from the list of CPUs targeted by these IPIs, they may not have to execute the callbacks immediately. Anything that only affects kernelspace can wait until the next user->kernel transition, providing it can be executed "early enough" in the entry code.
The original implementation is from Peter [1]. Nicolas then added kernel TLB invalidation deferral to that [2], and I picked it up from there.
Deferral approach =================
Storing each and every callback, like a secondary call_single_queue turned out to be a no-go: the whole point of deferral is to keep NOHZ_FULL CPUs in userspace for as long as possible - no signal of any form would be sent when deferring an IPI. This means that any form of queuing for deferred callbacks would end up as a convoluted memory leak.
Deferred IPIs must thus be coalesced, which this series achieves by assigning IPIs a "type" and having a mapping of IPI type to callback, leveraged upon kernel entry.
What about IPIs whose callback take a parameter, you may ask?
Peter suggested during OSPM23 [3] that since on_each_cpu() targets housekeeping CPUs *and* isolated CPUs, isolated CPUs can access either global or housekeeping-CPU-local state to "reconstruct" the data that would have been sent via the IPI.
This series does not affect any IPI callback that requires an argument, but the approach would remain the same (one coalescable callback executed on kernel entry).
Kernel entry vs execution of the deferred operation ===================================================
This is what I've referred to as the "Danger Zone" during my LPC24 talk [4].
There is a non-zero length of code that is executed upon kernel entry before the deferred operation can be itself executed (i.e. before we start getting into context_tracking.c proper), i.e.:
idtentry_func_foo() <--- we're in the kernel irqentry_enter() enter_from_user_mode() __ct_user_exit() ct_kernel_enter_state() ct_work_flush() <--- deferred operation is executed here
This means one must take extra care to what can happen in the early entry code, and that <bad things> cannot happen. For instance, we really don't want to hit instructions that have been modified by a remote text_poke() while we're on our way to execute a deferred sync_core(). Patches doing the actual deferral have more detail on this.
Patches =======
o Patches 1-2 are standalone objtool cleanups.
o Patches 3-4 add an RCU testing feature.
o Patches 5-6 add infrastructure for annotating static keys and static calls that may be used in noinstr code (courtesy of Josh). o Patches 7-19 use said annotations on relevant keys / calls. o Patch 20 enforces proper usage of said annotations (courtesy of Josh).
o Patches 21-23 fiddle with CT_STATE* within context tracking
o Patches 24-29 add the actual IPI deferral faff
o Patch 30 adds a freebie: deferring IPIs for NOHZ_IDLE. Not tested that much! if you care about battery-powered devices and energy consumption, go give it a try!
Patches are also available at: https://gitlab.com/vschneid/linux.git -b redhat/isolirq/defer/v4
Stuff I'd like eyes and neurons on ==================================
Context-tracking vs idle. Patch 22 "improves" the situation by adding an IDLE->KERNEL transition when getting an IRQ while idle, but it leaves the following window:
~> IRQ ct_nmi_enter() state = state + CT_STATE_KERNEL - CT_STATE_IDLE
[...]
ct_nmi_exit() state = state - CT_STATE_KERNEL + CT_STATE_IDLE
[...] /!\ CT_STATE_IDLE here while we're really in kernelspace! /!\
ct_cpuidle_exit() state = state + CT_STATE_KERNEL - CT_STATE_IDLE
Said window is contained within cpu_idle_poll() and the cpuidle call within cpuidle_enter_state(), both being noinstr (the former is __cpuidle which is noinstr itself). Thus objtool will consider it as early entry and will warn accordingly of any static key / call misuse, so the damage is somewhat contained, but it's not ideal.
I tried fiddling with this but idle polling likes being annoying, as it is shaped like so:
ct_cpuidle_enter();
raw_local_irq_enable(); while (!tif_need_resched() && (cpu_idle_force_poll || tick_check_broadcast_expired())) cpu_relax(); raw_local_irq_disable();
ct_cpuidle_exit();
IOW, getting an IRQ that doesn't end up setting NEED_RESCHED while idle-polling doesn't come near ct_cpuidle_exit(), which prevents me from having the outermost ct_nmi_exit() leave the state as CT_STATE_KERNEL (rather than CT_STATE_IDLE).
Testing =======
Xeon E5-2699 system with SMToff, NOHZ_FULL, isolated CPUs. RHEL9 userspace.
Workload is using rteval (kernel compilation + hackbench) on housekeeping CPUs and a dummy stay-in-userspace loop on the isolated CPUs. The main invocation is:
$ trace-cmd record -e "csd_queue_cpu" -f "cpu & CPUS{$ISOL_CPUS}" \ -e "ipi_send_cpumask" -f "cpumask & CPUS{$ISOL_CPUS}" \ -e "ipi_send_cpu" -f "cpu & CPUS{$ISOL_CPUS}" \ rteval --onlyload --loads-cpulist=$HK_CPUS \ --hackbench-runlowmem=True --duration=$DURATION
This only records IPIs sent to isolated CPUs, so any event there is interference (with a bit of fuzz at the start/end of the workload when spawning the processes). All tests were done with a duration of 6 hours.
v6.13-rc6 # This is the actual IPI count $ trace-cmd report | grep callback | awk '{ print $(NF) }' | sort | uniq -c | sort -nr 531 callback=generic_smp_call_function_single_interrupt+0x0
# These are the different CSD's that caused IPIs $ trace-cmd report | grep csd_queue | awk '{ print $(NF-1) }' | sort | uniq -c | sort -nr 12818 func=do_flush_tlb_all 910 func=do_kernel_range_flush 78 func=do_sync_core
v6.13-rc6 + patches: # This is the actual IPI count $ trace-cmd report | grep callback | awk '{ print $(NF) }' | sort | uniq -c | sort -nr # Zilch!
# These are the different CSD's that caused IPIs $ trace-cmd report | grep csd_queue | awk '{ print $(NF-1) }' | sort | uniq -c | sort -nr # Nada!
Note that tlb_remove_table_smp_sync() showed up during testing of v3, and has gone as mysteriously as it showed up. Yair had a series adressing this [5] which would be worth revisiting.
Acknowledgements ================
Special thanks to: o Clark Williams for listening to my ramblings about this and throwing ideas my way o Josh Poimboeuf for all his help with everything objtool-related o All of the folks who attended various (too many?) talks about this and provided precious feedback.
Links =====
[1]: https://lore.kernel.org/all/20210929151723.162004989@infradead.org/ [2]: https://github.com/vianpl/linux.git -b ct-work-defer-wip [3]: https://youtu.be/0vjE6fjoVVE [4]: https://lpc.events/event/18/contributions/1889/ [5]: https://lore.kernel.org/lkml/20230620144618.125703-1-ypodemsk@redhat.com/
Revisions =========
RFCv3 -> v4 ++++++++++++++
o Rebased onto v6.13-rc6
o New objtool patches from Josh o More .noinstr static key/call patches o Static calls now handled as well (again thanks to Josh)
o Fixed clearing the work bits on kernel exit o Messed with IRQ hitting an idle CPU vs context tracking o Various comment and naming cleanups
o Made RCU_DYNTICKS_TORTURE depend on !COMPILE_TEST (PeterZ) o Fixed the CT_STATE_KERNEL check when setting a deferred work (Frederic) o Cleaned up the __flush_tlb_all() mess thanks to PeterZ
RFCv2 -> RFCv3 ++++++++++++++
o Rebased onto v6.12-rc6
o Added objtool documentation for the new warning (Josh) o Added low-size RCU watching counter to TREE04 torture scenario (Paul) o Added FORCEFUL jump label and static key types o Added noinstr-compliant helpers for tlb flush deferral
RFCv1 -> RFCv2 ++++++++++++++
o Rebased onto v6.5-rc1
o Updated the trace filter patches (Steven)
o Fixed __ro_after_init keys used in modules (Peter) o Dropped the extra context_tracking atomic, squashed the new bits in the existing .state field (Peter, Frederic)
o Added an RCU_EXPERT config for the RCU dynticks counter size, and added an rcutorture case for a low-size counter (Paul)
o Fixed flush_tlb_kernel_range_deferrable() definition
Josh Poimboeuf (3): jump_label: Add annotations for validating noinstr usage static_call: Add read-only-after-init static calls objtool: Add noinstr validation for static branches/calls
Peter Zijlstra (1): x86,tlb: Make __flush_tlb_global() noinstr-compliant
Valentin Schneider (26): objtool: Make validate_call() recognize indirect calls to pv_ops[] objtool: Flesh out warning related to pv_ops[] calls rcu: Add a small-width RCU watching counter debug option rcutorture: Make TREE04 use CONFIG_RCU_DYNTICKS_TORTURE x86/paravirt: Mark pv_sched_clock static call as __ro_after_init x86/idle: Mark x86_idle static call as __ro_after_init x86/paravirt: Mark pv_steal_clock static call as __ro_after_init riscv/paravirt: Mark pv_steal_clock static call as __ro_after_init loongarch/paravirt: Mark pv_steal_clock static call as __ro_after_init arm64/paravirt: Mark pv_steal_clock static call as __ro_after_init arm/paravirt: Mark pv_steal_clock static call as __ro_after_init perf/x86/amd: Mark perf_lopwr_cb static call as __ro_after_init sched/clock: Mark sched_clock_running key as __ro_after_init x86/speculation/mds: Mark mds_idle_clear key as allowed in .noinstr sched/clock, x86: Mark __sched_clock_stable key as allowed in .noinstr x86/kvm/vmx: Mark vmx_l1d_should flush and vmx_l1d_flush_cond keys as allowed in .noinstr stackleack: Mark stack_erasing_bypass key as allowed in .noinstr context_tracking: Explicitely use CT_STATE_KERNEL where it is missing context_tracking: Exit CT_STATE_IDLE upon irq/nmi entry context_tracking: Turn CT_STATE_* into bits context-tracking: Introduce work deferral infrastructure context_tracking,x86: Defer kernel text patching IPIs x86/tlb: Make __flush_tlb_local() noinstr-compliant x86/tlb: Make __flush_tlb_all() noinstr x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs context-tracking: Add a Kconfig to enable IPI deferral for NO_HZ_IDLE
arch/Kconfig | 9 ++ arch/arm/kernel/paravirt.c | 2 +- arch/arm64/kernel/paravirt.c | 2 +- arch/loongarch/kernel/paravirt.c | 2 +- arch/riscv/kernel/paravirt.c | 2 +- arch/x86/Kconfig | 1 + arch/x86/events/amd/brs.c | 2 +- arch/x86/include/asm/context_tracking_work.h | 22 ++++ arch/x86/include/asm/invpcid.h | 13 +-- arch/x86/include/asm/paravirt.h | 4 +- arch/x86/include/asm/text-patching.h | 1 + arch/x86/include/asm/tlbflush.h | 3 +- arch/x86/include/asm/xen/hypercall.h | 11 +- arch/x86/kernel/alternative.c | 38 ++++++- arch/x86/kernel/cpu/bugs.c | 9 +- arch/x86/kernel/kprobes/core.c | 4 +- arch/x86/kernel/kprobes/opt.c | 4 +- arch/x86/kernel/module.c | 2 +- arch/x86/kernel/paravirt.c | 4 +- arch/x86/kernel/process.c | 2 +- arch/x86/kvm/vmx/vmx.c | 11 +- arch/x86/mm/tlb.c | 46 ++++++-- arch/x86/xen/mmu_pv.c | 10 +- arch/x86/xen/xen-ops.h | 12 +- include/asm-generic/sections.h | 15 +++ include/linux/context_tracking.h | 21 ++++ include/linux/context_tracking_state.h | 64 +++++++++-- include/linux/context_tracking_work.h | 28 +++++ include/linux/jump_label.h | 30 ++++- include/linux/objtool.h | 7 ++ include/linux/static_call.h | 19 ++++ kernel/context_tracking.c | 98 ++++++++++++++-- kernel/rcu/Kconfig.debug | 15 +++ kernel/sched/clock.c | 7 +- kernel/stackleak.c | 6 +- kernel/time/Kconfig | 19 ++++ mm/vmalloc.c | 35 +++++- tools/objtool/Documentation/objtool.txt | 34 ++++++ tools/objtool/check.c | 106 +++++++++++++++--- tools/objtool/include/objtool/check.h | 1 + tools/objtool/include/objtool/elf.h | 1 + tools/objtool/include/objtool/special.h | 1 + tools/objtool/special.c | 18 ++- .../selftests/rcutorture/configs/rcu/TREE04 | 1 + 44 files changed, 635 insertions(+), 107 deletions(-) create mode 100644 arch/x86/include/asm/context_tracking_work.h create mode 100644 include/linux/context_tracking_work.h
-- 2.43.0
call_dest_name() does not get passed the file pointer of validate_call(), which means its invocation of insn_reloc() will always return NULL. Make it take a file pointer.
While at it, make sure call_dest_name() uses arch_dest_reloc_offset(), otherwise it gets the pv_ops[] offset wrong.
Fabricating an intentional warning shows the change; previously:
vmlinux.o: warning: objtool: __flush_tlb_all_noinstr+0x4: call to {dynamic}() leaves .noinstr.text section
now:
vmlinux.o: warning: objtool: __flush_tlb_all_noinstr+0x4: call to pv_ops[1]() leaves .noinstr.text section
Signed-off-by: Valentin Schneider vschneid@redhat.com Acked-by: Josh Poimboeuf jpoimboe@kernel.org --- tools/objtool/check.c | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-)
diff --git a/tools/objtool/check.c b/tools/objtool/check.c index 76060da755b5c..b35763f05a548 100644 --- a/tools/objtool/check.c +++ b/tools/objtool/check.c @@ -3448,7 +3448,7 @@ static inline bool func_uaccess_safe(struct symbol *func) return false; }
-static inline const char *call_dest_name(struct instruction *insn) +static inline const char *call_dest_name(struct objtool_file *file, struct instruction *insn) { static char pvname[19]; struct reloc *reloc; @@ -3457,9 +3457,9 @@ static inline const char *call_dest_name(struct instruction *insn) if (insn_call_dest(insn)) return insn_call_dest(insn)->name;
- reloc = insn_reloc(NULL, insn); + reloc = insn_reloc(file, insn); if (reloc && !strcmp(reloc->sym->name, "pv_ops")) { - idx = (reloc_addend(reloc) / sizeof(void *)); + idx = (arch_dest_reloc_offset(reloc_addend(reloc)) / sizeof(void *)); snprintf(pvname, sizeof(pvname), "pv_ops[%d]", idx); return pvname; } @@ -3538,17 +3538,19 @@ static int validate_call(struct objtool_file *file, { if (state->noinstr && state->instr <= 0 && !noinstr_call_dest(file, insn, insn_call_dest(insn))) { - WARN_INSN(insn, "call to %s() leaves .noinstr.text section", call_dest_name(insn)); + WARN_INSN(insn, "call to %s() leaves .noinstr.text section", call_dest_name(file, insn)); return 1; }
if (state->uaccess && !func_uaccess_safe(insn_call_dest(insn))) { - WARN_INSN(insn, "call to %s() with UACCESS enabled", call_dest_name(insn)); + WARN_INSN(insn, "call to %s() with UACCESS enabled", + call_dest_name(file, insn)); return 1; }
if (state->df) { - WARN_INSN(insn, "call to %s() with DF set", call_dest_name(insn)); + WARN_INSN(insn, "call to %s() with DF set", + call_dest_name(file, insn)); return 1; }
I had to look into objtool itself to understand what this warning was about; make it more explicit.
Signed-off-by: Valentin Schneider vschneid@redhat.com Acked-by: Josh Poimboeuf jpoimboe@kernel.org --- tools/objtool/check.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/objtool/check.c b/tools/objtool/check.c index b35763f05a548..6aa9259fc9940 100644 --- a/tools/objtool/check.c +++ b/tools/objtool/check.c @@ -3486,7 +3486,7 @@ static bool pv_call_dest(struct objtool_file *file, struct instruction *insn)
list_for_each_entry(target, &file->pv_ops[idx].targets, pv_target) { if (!target->sec->noinstr) { - WARN("pv_ops[%d]: %s", idx, target->name); + WARN("pv_ops[%d]: indirect call to %s() leaves .noinstr.text section", idx, target->name); file->pv_ops[idx].clean = false; } }
A later commit will reduce the size of the RCU watching counter to free up some bits for another purpose. Paul suggested adding a config option to test the extreme case where the counter is reduced to its minimum usable width for rcutorture to poke at, so do that.
Make it only configurable under RCU_EXPERT. While at it, add a comment to explain the layout of context_tracking->state.
Link: http://lore.kernel.org/r/4c2cb573-168f-4806-b1d9-164e8276e66a@paulmck-laptop Suggested-by: Paul E. McKenney paulmck@kernel.org Signed-off-by: Valentin Schneider vschneid@redhat.com Reviewed-by: Paul E. McKenney paulmck@kernel.org --- include/linux/context_tracking_state.h | 44 ++++++++++++++++++++++---- kernel/rcu/Kconfig.debug | 15 +++++++++ 2 files changed, 52 insertions(+), 7 deletions(-)
diff --git a/include/linux/context_tracking_state.h b/include/linux/context_tracking_state.h index 7b8433d5a8efe..0b81248aa03e2 100644 --- a/include/linux/context_tracking_state.h +++ b/include/linux/context_tracking_state.h @@ -18,12 +18,6 @@ enum ctx_state { CT_STATE_MAX = 4, };
-/* Odd value for watching, else even. */ -#define CT_RCU_WATCHING CT_STATE_MAX - -#define CT_STATE_MASK (CT_STATE_MAX - 1) -#define CT_RCU_WATCHING_MASK (~CT_STATE_MASK) - struct context_tracking { #ifdef CONFIG_CONTEXT_TRACKING_USER /* @@ -44,9 +38,45 @@ struct context_tracking { #endif };
+/* + * We cram two different things within the same atomic variable: + * + * CT_RCU_WATCHING_START CT_STATE_START + * | | + * v v + * MSB [ RCU watching counter ][ context_state ] LSB + * ^ ^ + * | | + * CT_RCU_WATCHING_END CT_STATE_END + * + * Bits are used from the LSB upwards, so unused bits (if any) will always be in + * upper bits of the variable. + */ #ifdef CONFIG_CONTEXT_TRACKING +#define CT_SIZE (sizeof(((struct context_tracking *)0)->state) * BITS_PER_BYTE) + +#define CT_STATE_WIDTH bits_per(CT_STATE_MAX - 1) +#define CT_STATE_START 0 +#define CT_STATE_END (CT_STATE_START + CT_STATE_WIDTH - 1) + +#define CT_RCU_WATCHING_MAX_WIDTH (CT_SIZE - CT_STATE_WIDTH) +#define CT_RCU_WATCHING_WIDTH (IS_ENABLED(CONFIG_RCU_DYNTICKS_TORTURE) ? 2 : CT_RCU_WATCHING_MAX_WIDTH) +#define CT_RCU_WATCHING_START (CT_STATE_END + 1) +#define CT_RCU_WATCHING_END (CT_RCU_WATCHING_START + CT_RCU_WATCHING_WIDTH - 1) +#define CT_RCU_WATCHING BIT(CT_RCU_WATCHING_START) + +#define CT_STATE_MASK GENMASK(CT_STATE_END, CT_STATE_START) +#define CT_RCU_WATCHING_MASK GENMASK(CT_RCU_WATCHING_END, CT_RCU_WATCHING_START) + +#define CT_UNUSED_WIDTH (CT_RCU_WATCHING_MAX_WIDTH - CT_RCU_WATCHING_WIDTH) + +static_assert(CT_STATE_WIDTH + + CT_RCU_WATCHING_WIDTH + + CT_UNUSED_WIDTH == + CT_SIZE); + DECLARE_PER_CPU(struct context_tracking, context_tracking); -#endif +#endif /* CONFIG_CONTEXT_TRACKING */
#ifdef CONFIG_CONTEXT_TRACKING_USER static __always_inline int __ct_state(void) diff --git a/kernel/rcu/Kconfig.debug b/kernel/rcu/Kconfig.debug index 9b0b52e1836fa..ea36953803a1e 100644 --- a/kernel/rcu/Kconfig.debug +++ b/kernel/rcu/Kconfig.debug @@ -168,4 +168,19 @@ config RCU_STRICT_GRACE_PERIOD when looking for certain types of RCU usage bugs, for example, too-short RCU read-side critical sections.
+ +config RCU_DYNTICKS_TORTURE + bool "Minimize RCU dynticks counter size" + depends on RCU_EXPERT && !COMPILE_TEST + default n + help + This option sets the width of the dynticks counter to its + minimum usable value. This minimum width greatly increases + the probability of flushing out bugs involving counter wrap, + but it also increases the probability of extending grace period + durations. This Kconfig option should therefore be avoided in + production due to the consequent increased probability of OOMs. + + This has no value for production and is only for testing. + endmenu # "RCU Debugging"
Le Tue, Jan 14, 2025 at 06:51:16PM +0100, Valentin Schneider a écrit :
A later commit will reduce the size of the RCU watching counter to free up some bits for another purpose. Paul suggested adding a config option to test the extreme case where the counter is reduced to its minimum usable width for rcutorture to poke at, so do that.
Make it only configurable under RCU_EXPERT. While at it, add a comment to explain the layout of context_tracking->state.
Link: http://lore.kernel.org/r/4c2cb573-168f-4806-b1d9-164e8276e66a@paulmck-laptop Suggested-by: Paul E. McKenney paulmck@kernel.org Signed-off-by: Valentin Schneider vschneid@redhat.com Reviewed-by: Paul E. McKenney paulmck@kernel.org
Reviewed-by: Frederic Weisbecker frederic@kernel.org
We now have an RCU_EXPERT config for testing small-sized RCU dynticks counter: CONFIG_RCU_DYNTICKS_TORTURE.
Modify scenario TREE04 to exercise to use this config in order to test a ridiculously small counter (2 bits).
Link: http://lore.kernel.org/r/4c2cb573-168f-4806-b1d9-164e8276e66a@paulmck-laptop Suggested-by: Paul E. McKenney paulmck@kernel.org Signed-off-by: Valentin Schneider vschneid@redhat.com Reviewed-by: Paul E. McKenney paulmck@kernel.org --- tools/testing/selftests/rcutorture/configs/rcu/TREE04 | 1 + 1 file changed, 1 insertion(+)
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE04 b/tools/testing/selftests/rcutorture/configs/rcu/TREE04 index dc4985064b3ad..67caf4276bb01 100644 --- a/tools/testing/selftests/rcutorture/configs/rcu/TREE04 +++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE04 @@ -16,3 +16,4 @@ CONFIG_DEBUG_OBJECTS_RCU_HEAD=n CONFIG_RCU_EXPERT=y CONFIG_RCU_EQS_DEBUG=y CONFIG_RCU_LAZY=y +CONFIG_RCU_DYNTICKS_TORTURE=y
Le Tue, Jan 14, 2025 at 06:51:17PM +0100, Valentin Schneider a écrit :
We now have an RCU_EXPERT config for testing small-sized RCU dynticks counter: CONFIG_RCU_DYNTICKS_TORTURE.
Modify scenario TREE04 to exercise to use this config in order to test a ridiculously small counter (2 bits).
Link: http://lore.kernel.org/r/4c2cb573-168f-4806-b1d9-164e8276e66a@paulmck-laptop Suggested-by: Paul E. McKenney paulmck@kernel.org Signed-off-by: Valentin Schneider vschneid@redhat.com Reviewed-by: Paul E. McKenney paulmck@kernel.org
tools/testing/selftests/rcutorture/configs/rcu/TREE04 | 1 + 1 file changed, 1 insertion(+)
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE04 b/tools/testing/selftests/rcutorture/configs/rcu/TREE04 index dc4985064b3ad..67caf4276bb01 100644 --- a/tools/testing/selftests/rcutorture/configs/rcu/TREE04 +++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE04 @@ -16,3 +16,4 @@ CONFIG_DEBUG_OBJECTS_RCU_HEAD=n CONFIG_RCU_EXPERT=y CONFIG_RCU_EQS_DEBUG=y CONFIG_RCU_LAZY=y +CONFIG_RCU_DYNTICKS_TORTURE=y
Reviewed-by: Frederic Weisbecker frederic@kernel.org
-- 2.43.0
From: Josh Poimboeuf jpoimboe@kernel.org
Deferring a code patching IPI is unsafe if the patched code is in a noinstr region. In that case the text poke code must trigger an immediate IPI to all CPUs, which can rudely interrupt an isolated NO_HZ CPU running in userspace.
Some noinstr static branches may really need to be patched at runtime, despite the resulting disruption. Add DEFINE_STATIC_KEY_*_NOINSTR() variants for those. They don't do anything special yet; that will come later.
Signed-off-by: Josh Poimboeuf jpoimboe@kernel.org --- include/linux/jump_label.h | 17 +++++++++++++++++ 1 file changed, 17 insertions(+)
diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h index f5a2727ca4a9a..88bb6e32fdcbc 100644 --- a/include/linux/jump_label.h +++ b/include/linux/jump_label.h @@ -385,6 +385,23 @@ struct static_key_false { #define DEFINE_STATIC_KEY_FALSE_RO(name) \ struct static_key_false name __ro_after_init = STATIC_KEY_FALSE_INIT
+/* + * The _NOINSTR variants are used to tell objtool the static key is allowed to + * be used in noinstr code. + * + * They should almost never be used, as they prevent code patching IPIs from + * being deferred, which can be problematic for isolated NOHZ_FULL CPUs running + * in pure userspace. + * + * If using one of these _NOINSTR variants, please add a comment above the + * definition with the rationale. + */ +#define DEFINE_STATIC_KEY_TRUE_NOINSTR(name) \ + DEFINE_STATIC_KEY_TRUE(name) + +#define DEFINE_STATIC_KEY_FALSE_NOINSTR(name) \ + DEFINE_STATIC_KEY_FALSE(name) + #define DECLARE_STATIC_KEY_FALSE(name) \ extern struct static_key_false name
From: Josh Poimboeuf jpoimboe@kernel.org
Deferring a code patching IPI is unsafe if the patched code is in a noinstr region. In that case the text poke code must trigger an immediate IPI to all CPUs, which can rudely interrupt an isolated NO_HZ CPU running in userspace.
If a noinstr static call only needs to be patched during boot, its key can be made ro-after-init to ensure it will never be patched at runtime.
Signed-off-by: Josh Poimboeuf jpoimboe@kernel.org --- include/linux/static_call.h | 16 ++++++++++++++++ 1 file changed, 16 insertions(+)
diff --git a/include/linux/static_call.h b/include/linux/static_call.h index 78a77a4ae0ea8..ea6ca57e2a829 100644 --- a/include/linux/static_call.h +++ b/include/linux/static_call.h @@ -192,6 +192,14 @@ extern long __static_call_return0(void); }; \ ARCH_DEFINE_STATIC_CALL_TRAMP(name, _func)
+#define DEFINE_STATIC_CALL_RO(name, _func) \ + DECLARE_STATIC_CALL(name, _func); \ + struct static_call_key __ro_after_init STATIC_CALL_KEY(name) = {\ + .func = _func, \ + .type = 1, \ + }; \ + ARCH_DEFINE_STATIC_CALL_TRAMP(name, _func) + #define DEFINE_STATIC_CALL_NULL(name, _func) \ DECLARE_STATIC_CALL(name, _func); \ struct static_call_key STATIC_CALL_KEY(name) = { \ @@ -200,6 +208,14 @@ extern long __static_call_return0(void); }; \ ARCH_DEFINE_STATIC_CALL_NULL_TRAMP(name)
+#define DEFINE_STATIC_CALL_NULL_RO(name, _func) \ + DECLARE_STATIC_CALL(name, _func); \ + struct static_call_key __ro_after_init STATIC_CALL_KEY(name) = {\ + .func = NULL, \ + .type = 1, \ + }; \ + ARCH_DEFINE_STATIC_CALL_NULL_TRAMP(name) + #define DEFINE_STATIC_CALL_RET0(name, _func) \ DECLARE_STATIC_CALL(name, _func); \ struct static_call_key STATIC_CALL_KEY(name) = { \
Later commits will cause objtool to warn about static calls being used in .noinstr sections in order to safely defer instruction patching IPIs targeted at NOHZ_FULL CPUs.
pv_sched_clock is updated in: o __init vmware_paravirt_ops_setup() o __init xen_init_time_common() o kvm_sched_clock_init() <- __init kvmclock_init() o hv_setup_sched_clock() <- __init hv_init_tsc_clocksource()
IOW purely init context, and can thus be marked as __ro_after_init.
Reported-by: Josh Poimboeuf jpoimboe@kernel.org Signed-off-by: Valentin Schneider vschneid@redhat.com --- arch/x86/kernel/paravirt.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c index fec3815335558..ae6675167877b 100644 --- a/arch/x86/kernel/paravirt.c +++ b/arch/x86/kernel/paravirt.c @@ -73,7 +73,7 @@ static u64 native_steal_clock(int cpu) }
DEFINE_STATIC_CALL(pv_steal_clock, native_steal_clock); -DEFINE_STATIC_CALL(pv_sched_clock, native_sched_clock); +DEFINE_STATIC_CALL_RO(pv_sched_clock, native_sched_clock);
void paravirt_set_sched_clock(u64 (*func)(void)) {
Later commits will cause objtool to warn about static calls being used in .noinstr sections in order to safely defer instruction patching IPIs targeted at NOHZ_FULL CPUs.
x86_idle is updated in: o xen_set_default_idle() <- __init xen_arch_setup() o __init select_idle_routine()
IOW purely init context, and can thus be marked as __ro_after_init.
Reported-by: Josh Poimboeuf jpoimboe@kernel.org Signed-off-by: Valentin Schneider vschneid@redhat.com --- arch/x86/kernel/process.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c index f63f8fd00a91f..85cd62f61d633 100644 --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -746,7 +746,7 @@ void __cpuidle default_idle(void) EXPORT_SYMBOL(default_idle); #endif
-DEFINE_STATIC_CALL_NULL(x86_idle, default_idle); +DEFINE_STATIC_CALL_NULL_RO(x86_idle, default_idle);
static bool x86_idle_set(void) {
The static call is only ever updated in
__init pv_time_init() __init xen_init_time_common() __init vmware_paravirt_ops_setup() __init xen_time_setup_guest(
so mark it appropriately as __ro_after_init.
Signed-off-by: Valentin Schneider vschneid@redhat.com --- arch/x86/kernel/paravirt.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c index ae6675167877b..a19b0002b18d8 100644 --- a/arch/x86/kernel/paravirt.c +++ b/arch/x86/kernel/paravirt.c @@ -72,7 +72,7 @@ static u64 native_steal_clock(int cpu) return 0; }
-DEFINE_STATIC_CALL(pv_steal_clock, native_steal_clock); +DEFINE_STATIC_CALL_RO(pv_steal_clock, native_steal_clock); DEFINE_STATIC_CALL_RO(pv_sched_clock, native_sched_clock);
void paravirt_set_sched_clock(u64 (*func)(void))
The static call is only ever updated in:
__init pv_time_init() __init xen_time_setup_guest()
so mark it appropriately as __ro_after_init.
Signed-off-by: Valentin Schneider vschneid@redhat.com --- arch/riscv/kernel/paravirt.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/riscv/kernel/paravirt.c b/arch/riscv/kernel/paravirt.c index fa6b0339a65de..dfe8808016fd8 100644 --- a/arch/riscv/kernel/paravirt.c +++ b/arch/riscv/kernel/paravirt.c @@ -30,7 +30,7 @@ static u64 native_steal_clock(int cpu) return 0; }
-DEFINE_STATIC_CALL(pv_steal_clock, native_steal_clock); +DEFINE_STATIC_CALL_RO(pv_steal_clock, native_steal_clock);
static bool steal_acc = true; static int __init parse_no_stealacc(char *arg)
On Tue, Jan 14, 2025 at 06:51:23PM +0100, Valentin Schneider wrote:
The static call is only ever updated in:
__init pv_time_init() __init xen_time_setup_guest()
so mark it appropriately as __ro_after_init.
Signed-off-by: Valentin Schneider vschneid@redhat.com
arch/riscv/kernel/paravirt.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/riscv/kernel/paravirt.c b/arch/riscv/kernel/paravirt.c index fa6b0339a65de..dfe8808016fd8 100644 --- a/arch/riscv/kernel/paravirt.c +++ b/arch/riscv/kernel/paravirt.c @@ -30,7 +30,7 @@ static u64 native_steal_clock(int cpu) return 0; } -DEFINE_STATIC_CALL(pv_steal_clock, native_steal_clock); +DEFINE_STATIC_CALL_RO(pv_steal_clock, native_steal_clock); static bool steal_acc = true; static int __init parse_no_stealacc(char *arg) -- 2.43.0
Reviewed-by: Andrew Jones ajones@ventanamicro.com
The static call is only ever updated in
__init pv_time_init() __init xen_time_setup_guest()
so mark it appropriately as __ro_after_init.
Signed-off-by: Valentin Schneider vschneid@redhat.com --- arch/loongarch/kernel/paravirt.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/loongarch/kernel/paravirt.c b/arch/loongarch/kernel/paravirt.c index e5a39bbad0780..b011578d3e931 100644 --- a/arch/loongarch/kernel/paravirt.c +++ b/arch/loongarch/kernel/paravirt.c @@ -20,7 +20,7 @@ static u64 native_steal_clock(int cpu) return 0; }
-DEFINE_STATIC_CALL(pv_steal_clock, native_steal_clock); +DEFINE_STATIC_CALL_RO(pv_steal_clock, native_steal_clock);
static bool steal_acc = true;
The static call is only ever updated in
__init pv_time_init() __init xen_time_setup_guest()
so mark it appropriately as __ro_after_init.
Signed-off-by: Valentin Schneider vschneid@redhat.com --- arch/arm64/kernel/paravirt.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/arm64/kernel/paravirt.c b/arch/arm64/kernel/paravirt.c index aa718d6a9274a..ad28fa23c9228 100644 --- a/arch/arm64/kernel/paravirt.c +++ b/arch/arm64/kernel/paravirt.c @@ -32,7 +32,7 @@ static u64 native_steal_clock(int cpu) return 0; }
-DEFINE_STATIC_CALL(pv_steal_clock, native_steal_clock); +DEFINE_STATIC_CALL_RO(pv_steal_clock, native_steal_clock);
struct pv_time_stolen_time_region { struct pvclock_vcpu_stolen_time __rcu *kaddr;
The static call is only ever updated in
__init xen_time_setup_guest()
so mark it appropriately as __ro_after_init.
Signed-off-by: Valentin Schneider vschneid@redhat.com --- arch/arm/kernel/paravirt.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/arm/kernel/paravirt.c b/arch/arm/kernel/paravirt.c index 7dd9806369fb0..632d8d5e06db3 100644 --- a/arch/arm/kernel/paravirt.c +++ b/arch/arm/kernel/paravirt.c @@ -20,4 +20,4 @@ static u64 native_steal_clock(int cpu) return 0; }
-DEFINE_STATIC_CALL(pv_steal_clock, native_steal_clock); +DEFINE_STATIC_CALL_RO(pv_steal_clock, native_steal_clock);
Later commits will cause objtool to warn about static calls being used in .noinstr sections in order to safely defer instruction patching IPIs targeted at NOHZ_FULL CPUs.
perf_lopwr_cb is used in .noinstr code, but is only ever updated in __init amd_brs_lopwr_init(), and can thus be marked as __ro_after_init.
Reported-by: Josh Poimboeuf jpoimboe@kernel.org Signed-off-by: Valentin Schneider vschneid@redhat.com --- arch/x86/events/amd/brs.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/events/amd/brs.c b/arch/x86/events/amd/brs.c index 780acd3dff22a..e2ff03af15d82 100644 --- a/arch/x86/events/amd/brs.c +++ b/arch/x86/events/amd/brs.c @@ -422,7 +422,7 @@ void noinstr perf_amd_brs_lopwr_cb(bool lopwr_in) } }
-DEFINE_STATIC_CALL_NULL(perf_lopwr_cb, perf_amd_brs_lopwr_cb); +DEFINE_STATIC_CALL_NULL_RO(perf_lopwr_cb, perf_amd_brs_lopwr_cb); EXPORT_STATIC_CALL_TRAMP_GPL(perf_lopwr_cb);
void __init amd_brs_lopwr_init(void)
sched_clock_running is only ever enabled in the __init functions sched_clock_init() and sched_clock_init_late(), and is never disabled. Mark it __ro_after_init.
Signed-off-by: Valentin Schneider vschneid@redhat.com --- kernel/sched/clock.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c index a09655b481402..200e5568b9894 100644 --- a/kernel/sched/clock.c +++ b/kernel/sched/clock.c @@ -66,7 +66,7 @@ notrace unsigned long long __weak sched_clock(void) } EXPORT_SYMBOL_GPL(sched_clock);
-static DEFINE_STATIC_KEY_FALSE(sched_clock_running); +static DEFINE_STATIC_KEY_FALSE_RO(sched_clock_running);
#ifdef CONFIG_HAVE_UNSTABLE_SCHED_CLOCK /*
Later commits will cause objtool to warn about static keys being used in .noinstr sections in order to safely defer instruction patching IPIs targeted at NOHZ_FULL CPUs.
mds_idle_clear is used in .noinstr code, and can be modified at runtime (SMT hotplug). Suppressing the text_poke_sync() IPI has little benefits for this key, as hotplug implies eventually going through takedown_cpu() -> stop_machine_cpuslocked() which is going to cause interference on all online CPUs anyway.
Mark it to let objtool know not to warn about it.
Signed-off-by: Valentin Schneider vschneid@redhat.com --- arch/x86/kernel/cpu/bugs.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c index 47a01d4028f60..acad84dcfc3cd 100644 --- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -113,8 +113,13 @@ DEFINE_STATIC_KEY_FALSE(switch_mm_cond_ibpb); /* Control unconditional IBPB in switch_mm() */ DEFINE_STATIC_KEY_FALSE(switch_mm_always_ibpb);
-/* Control MDS CPU buffer clear before idling (halt, mwait) */ -DEFINE_STATIC_KEY_FALSE(mds_idle_clear); +/* + * Control MDS CPU buffer clear before idling (halt, mwait) + * + * Allowed in .noinstr as this key is updated during hotplug which comes with + * more interference than just the text patching IPI. + */ +DEFINE_STATIC_KEY_FALSE_NOINSTR(mds_idle_clear); EXPORT_SYMBOL_GPL(mds_idle_clear);
/*
Later commits will cause objtool to warn about static keys being used in .noinstr sections in order to safely defer instruction patching IPIs targeted at NOHZ_FULL CPUs.
__sched_clock_stable is used in .noinstr code, and can be modified at runtime (e.g. time_cpufreq_notifier()). Suppressing the text_poke_sync() IPI has little benefits for this key, as NOHZ_FULL is incompatible with an unstable TSC anyway.
Mark it to let objtool know not to warn about it.
Signed-off-by: Valentin Schneider vschneid@redhat.com --- kernel/sched/clock.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c index 200e5568b9894..e59986bc14a43 100644 --- a/kernel/sched/clock.c +++ b/kernel/sched/clock.c @@ -75,8 +75,11 @@ static DEFINE_STATIC_KEY_FALSE_RO(sched_clock_running); * * Similarly we start with __sched_clock_stable_early, thereby assuming we * will become stable, such that there's only a single 1 -> 0 transition. + * + * Allowed in .noinstr as an unstable TLC is incompatible with NOHZ_FULL, + * thus the text patching IPI would be the least of our concerns. */ -static DEFINE_STATIC_KEY_FALSE(__sched_clock_stable); +static DEFINE_STATIC_KEY_FALSE_NOINSTR(__sched_clock_stable); static int __sched_clock_stable_early = 1;
/*
Later commits will cause objtool to warn about static keys being used in .noinstr sections in order to safely defer instruction patching IPIs targeted at NOHZ_FULL CPUs.
These keys are used in .noinstr code, and can be modified at runtime (/proc/kernel/vmx* write). However it is not expected that they will be flipped during latency-sensitive operations, and thus shouldn't be a source of interference wrt the text patching IPI.
Mark it to let objtool know not to warn about it.
Reported-by: Josh Poimboeuf jpoimboe@kernel.org Signed-off-by: Valentin Schneider vschneid@redhat.com --- arch/x86/kvm/vmx/vmx.c | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 893366e537322..a028c38f44e02 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -225,8 +225,15 @@ module_param(pt_mode, int, S_IRUGO);
struct x86_pmu_lbr __ro_after_init vmx_lbr_caps;
-static DEFINE_STATIC_KEY_FALSE(vmx_l1d_should_flush); -static DEFINE_STATIC_KEY_FALSE(vmx_l1d_flush_cond); +/* + * Both of these static keys end up being used in .noinstr sections, however + * they are only modified: + * - at init + * - from a /proc/kernel/vmx* write + * thus during latency-sensitive operations they should remain stable. + */ +static DEFINE_STATIC_KEY_FALSE_NOINSTR(vmx_l1d_should_flush); +static DEFINE_STATIC_KEY_FALSE_NOINSTR(vmx_l1d_flush_cond); static DEFINE_MUTEX(vmx_l1d_flush_mutex);
/* Storage for pre module init parameter parsing */
Please use "KVM: VMX:" for the scope.
On Tue, Jan 14, 2025, Valentin Schneider wrote:
Later commits will cause objtool to warn about static keys being used in .noinstr sections in order to safely defer instruction patching IPIs targeted at NOHZ_FULL CPUs.
These keys are used in .noinstr code, and can be modified at runtime (/proc/kernel/vmx* write). However it is not expected that they will be flipped during latency-sensitive operations, and thus shouldn't be a source of interference wrt the text patching IPI.
This misses KVM's static key that's buried behind CONFIG_HYPERV=m|y.
vmlinux.o: warning: objtool: vmx_vcpu_enter_exit+0x241: __kvm_is_using_evmcs: non-RO static key usage in noinstr vmlinux.o: warning: objtool: vmx_update_host_rsp+0x13: __kvm_is_using_evmcs: non-RO static key usage in noinstr
Side topic, it's super annoying that "objtool --noinstr" only runs on vmlinux.o. I realize objtool doesn't have the visilibity to validate cross-object calls, but couldn't objtool validates calls and static key/branch usage so long as the target or key/branch is defined in the same object?
On 14/01/25 13:19, Sean Christopherson wrote:
Please use "KVM: VMX:" for the scope.
On Tue, Jan 14, 2025, Valentin Schneider wrote:
Later commits will cause objtool to warn about static keys being used in .noinstr sections in order to safely defer instruction patching IPIs targeted at NOHZ_FULL CPUs.
These keys are used in .noinstr code, and can be modified at runtime (/proc/kernel/vmx* write). However it is not expected that they will be flipped during latency-sensitive operations, and thus shouldn't be a source of interference wrt the text patching IPI.
This misses KVM's static key that's buried behind CONFIG_HYPERV=m|y.
vmlinux.o: warning: objtool: vmx_vcpu_enter_exit+0x241: __kvm_is_using_evmcs: non-RO static key usage in noinstr vmlinux.o: warning: objtool: vmx_update_host_rsp+0x13: __kvm_is_using_evmcs: non-RO static key usage in noinstr
Thanks, I'll add these to v5.
Side topic, it's super annoying that "objtool --noinstr" only runs on vmlinux.o. I realize objtool doesn't have the visilibity to validate cross-object calls, but couldn't objtool validates calls and static key/branch usage so long as the target or key/branch is defined in the same object?
Per my testing you can manually run it on individual objects, but it can and will easily get hung up on the first noinstr violation it finds and not search further within one given function.
Later commits will cause objtool to warn about static keys being used in .noinstr sections in order to safely defer instruction patching IPIs targeted at NOHZ_FULL CPUs.
stack_erasing_bypass is used in .noinstr code, and can be modified at runtime (proc/sys/kernel/stack_erasing write). However it is not expected that it will be flipped during latency-sensitive operations, and thus shouldn't be a source of interference wrt the text patching IPI.
Mark it to let objtool know not to warn about it.
Reported-by: Josh Poimboeuf jpoimboe@kernel.org Signed-off-by: Valentin Schneider vschneid@redhat.com --- kernel/stackleak.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/kernel/stackleak.c b/kernel/stackleak.c index 39fd620a7db6f..a4f07fbc13f61 100644 --- a/kernel/stackleak.c +++ b/kernel/stackleak.c @@ -18,7 +18,11 @@ #include <linux/sysctl.h> #include <linux/init.h>
-static DEFINE_STATIC_KEY_FALSE(stack_erasing_bypass); +/* + * This static key can only be modified via its sysctl interface. It is + * expected it will remain stable during latency-senstive operations. + */ +static DEFINE_STATIC_KEY_FALSE_NOINSTR(stack_erasing_bypass);
#ifdef CONFIG_SYSCTL static int stack_erasing_sysctl(const struct ctl_table *table, int write,
From: Josh Poimboeuf jpoimboe@kernel.org
Warn about static branches/calls in noinstr regions, unless the corresponding key is RO-after-init or has been manually whitelisted with DEFINE_STATIC_KEY_*_NOINSTR(().
Signed-off-by: Josh Poimboeuf jpoimboe@kernel.org [Added NULL check for insn_call_dest() return value] Signed-off-by: Valentin Schneider vschneid@redhat.com --- include/linux/jump_label.h | 17 +++-- include/linux/objtool.h | 7 ++ include/linux/static_call.h | 3 + tools/objtool/Documentation/objtool.txt | 34 +++++++++ tools/objtool/check.c | 92 ++++++++++++++++++++++--- tools/objtool/include/objtool/check.h | 1 + tools/objtool/include/objtool/elf.h | 1 + tools/objtool/include/objtool/special.h | 1 + tools/objtool/special.c | 18 ++++- 9 files changed, 156 insertions(+), 18 deletions(-)
diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h index 88bb6e32fdcbc..dc8a82a62c109 100644 --- a/include/linux/jump_label.h +++ b/include/linux/jump_label.h @@ -75,6 +75,7 @@
#include <linux/types.h> #include <linux/compiler.h> +#include <linux/objtool.h>
extern bool static_key_initialized;
@@ -373,8 +374,9 @@ struct static_key_false { #define DEFINE_STATIC_KEY_TRUE(name) \ struct static_key_true name = STATIC_KEY_TRUE_INIT
-#define DEFINE_STATIC_KEY_TRUE_RO(name) \ - struct static_key_true name __ro_after_init = STATIC_KEY_TRUE_INIT +#define DEFINE_STATIC_KEY_TRUE_RO(name) \ + struct static_key_true name __ro_after_init = STATIC_KEY_TRUE_INIT; \ + ANNOTATE_NOINSTR_ALLOWED(name)
#define DECLARE_STATIC_KEY_TRUE(name) \ extern struct static_key_true name @@ -382,8 +384,9 @@ struct static_key_false { #define DEFINE_STATIC_KEY_FALSE(name) \ struct static_key_false name = STATIC_KEY_FALSE_INIT
-#define DEFINE_STATIC_KEY_FALSE_RO(name) \ - struct static_key_false name __ro_after_init = STATIC_KEY_FALSE_INIT +#define DEFINE_STATIC_KEY_FALSE_RO(name) \ + struct static_key_false name __ro_after_init = STATIC_KEY_FALSE_INIT; \ + ANNOTATE_NOINSTR_ALLOWED(name)
/* * The _NOINSTR variants are used to tell objtool the static key is allowed to @@ -397,10 +400,12 @@ struct static_key_false { * definition with the rationale. */ #define DEFINE_STATIC_KEY_TRUE_NOINSTR(name) \ - DEFINE_STATIC_KEY_TRUE(name) + DEFINE_STATIC_KEY_TRUE(name); \ + ANNOTATE_NOINSTR_ALLOWED(name)
#define DEFINE_STATIC_KEY_FALSE_NOINSTR(name) \ - DEFINE_STATIC_KEY_FALSE(name) + DEFINE_STATIC_KEY_FALSE(name); \ + ANNOTATE_NOINSTR_ALLOWED(name)
#define DECLARE_STATIC_KEY_FALSE(name) \ extern struct static_key_false name diff --git a/include/linux/objtool.h b/include/linux/objtool.h index b3b8d3dab52d5..1a7389f273063 100644 --- a/include/linux/objtool.h +++ b/include/linux/objtool.h @@ -34,6 +34,12 @@ static void __used __section(".discard.func_stack_frame_non_standard") \ *__func_stack_frame_non_standard_##func = func
+#define __ANNOTATE_NOINSTR_ALLOWED(key) \ + static void __used __section(".discard.noinstr_allowed") \ + *__annotate_noinstr_allowed_##key = &key + +#define ANNOTATE_NOINSTR_ALLOWED(key) __ANNOTATE_NOINSTR_ALLOWED(key) + /* * STACK_FRAME_NON_STANDARD_FP() is a frame-pointer-specific function ignore * for the case where a function is intentionally missing frame pointer setup, @@ -157,6 +163,7 @@ #define STACK_FRAME_NON_STANDARD_FP(func) #define ANNOTATE_NOENDBR #define ASM_REACHABLE +#define ANNOTATE_NOINSTR_ALLOWED(key) #else #define ANNOTATE_INTRA_FUNCTION_CALL .macro UNWIND_HINT type:req sp_reg=0 sp_offset=0 signal=0 diff --git a/include/linux/static_call.h b/include/linux/static_call.h index ea6ca57e2a829..0d4b16d348501 100644 --- a/include/linux/static_call.h +++ b/include/linux/static_call.h @@ -133,6 +133,7 @@
#include <linux/types.h> #include <linux/cpu.h> +#include <linux/objtool.h> #include <linux/static_call_types.h>
#ifdef CONFIG_HAVE_STATIC_CALL @@ -198,6 +199,7 @@ extern long __static_call_return0(void); .func = _func, \ .type = 1, \ }; \ + ANNOTATE_NOINSTR_ALLOWED(STATIC_CALL_TRAMP(name)); \ ARCH_DEFINE_STATIC_CALL_TRAMP(name, _func)
#define DEFINE_STATIC_CALL_NULL(name, _func) \ @@ -214,6 +216,7 @@ extern long __static_call_return0(void); .func = NULL, \ .type = 1, \ }; \ + ANNOTATE_NOINSTR_ALLOWED(STATIC_CALL_TRAMP(name)); \ ARCH_DEFINE_STATIC_CALL_NULL_TRAMP(name)
#define DEFINE_STATIC_CALL_RET0(name, _func) \ diff --git a/tools/objtool/Documentation/objtool.txt b/tools/objtool/Documentation/objtool.txt index 7c3ee959b63c7..922d3b41541d0 100644 --- a/tools/objtool/Documentation/objtool.txt +++ b/tools/objtool/Documentation/objtool.txt @@ -447,6 +447,40 @@ the objtool maintainers. names and does not use module_init() / module_exit() macros to create them.
+13. file.o: warning: func()+0x2a: key: non-RO static key usage in noinstr code + file.o: warning: func()+0x2a: key: non-RO static call usage in noinstr code + + This means that noinstr function func() uses a static key or + static call named 'key' which can be modified at runtime. This is + discouraged because it prevents code patching IPIs from being + deferred. + + You have the following options: + + 1) Check whether the static key/call in question is only modified + during init. If so, define it as read-only-after-init with + DEFINE_STATIC_KEY_*_RO() or DEFINE_STATIC_CALL_RO(). + + 2) Avoid the runtime patching. For static keys this can be done by + using static_key_enabled() or by getting rid of the static key + altogether if performance is not a concern. + + For static calls, something like the following could be done: + + target = static_call_query(foo); + if (target == func1) + func1(); + else if (target == func2) + func2(); + ... + + 3) Silence the warning by defining the static key/call with + DEFINE_STATIC_*_NOINSTR(). This decision should not + be taken lightly as it may result in code patching IPIs getting + sent to isolated NOHZ_FULL CPUs running in pure userspace. A + comment should be added above the definition explaining the + rationale for the decision. +
If the error doesn't seem to make sense, it could be a bug in objtool. Feel free to ask the objtool maintainer for help. diff --git a/tools/objtool/check.c b/tools/objtool/check.c index 6aa9259fc9940..24219538c1587 100644 --- a/tools/objtool/check.c +++ b/tools/objtool/check.c @@ -1068,6 +1068,45 @@ static int create_direct_call_sections(struct objtool_file *file) return 0; }
+static int read_noinstr_allowed(struct objtool_file *file) +{ + struct section *rsec; + struct symbol *sym; + struct reloc *reloc; + + rsec = find_section_by_name(file->elf, ".rela.discard.noinstr_allowed"); + if (!rsec) + return 0; + + for_each_reloc(rsec, reloc) { + switch (reloc->sym->type) { + case STT_OBJECT: + case STT_FUNC: + sym = reloc->sym; + break; + + case STT_SECTION: + sym = find_symbol_by_offset(reloc->sym->sec, + reloc_addend(reloc)); + if (!sym) { + WARN_FUNC("can't find static key/call symbol", + reloc->sym->sec, reloc_addend(reloc)); + return -1; + } + break; + + default: + WARN("unexpected relocation symbol type in %s: %d", + rsec->name, reloc->sym->type); + return -1; + } + + sym->noinstr_allowed = 1; + } + + return 0; +} + /* * Warnings shouldn't be reported for ignored functions. */ @@ -1955,6 +1994,8 @@ static int handle_jump_alt(struct objtool_file *file, return -1; }
+ orig_insn->key = special_alt->key; + if (opts.hack_jump_label && special_alt->key_addend & 2) { struct reloc *reloc = insn_reloc(file, orig_insn);
@@ -2731,6 +2772,10 @@ static int decode_sections(struct objtool_file *file) if (ret) return ret;
+ ret = read_noinstr_allowed(file); + if (ret) + return ret; + return 0; }
@@ -3494,9 +3539,9 @@ static bool pv_call_dest(struct objtool_file *file, struct instruction *insn) return file->pv_ops[idx].clean; }
-static inline bool noinstr_call_dest(struct objtool_file *file, - struct instruction *insn, - struct symbol *func) +static inline bool noinstr_call_allowed(struct objtool_file *file, + struct instruction *insn, + struct symbol *func) { /* * We can't deal with indirect function calls at present; @@ -3516,10 +3561,10 @@ static inline bool noinstr_call_dest(struct objtool_file *file, return true;
/* - * If the symbol is a static_call trampoline, we can't tell. + * Only DEFINE_STATIC_CALL_*_RO allowed. */ if (func->static_call_tramp) - return true; + return func->noinstr_allowed;
/* * The __ubsan_handle_*() calls are like WARN(), they only happen when @@ -3532,14 +3577,29 @@ static inline bool noinstr_call_dest(struct objtool_file *file, return false; }
+static char *static_call_name(struct symbol *func) +{ + return func->name + strlen("__SCT__"); +} + static int validate_call(struct objtool_file *file, struct instruction *insn, struct insn_state *state) { - if (state->noinstr && state->instr <= 0 && - !noinstr_call_dest(file, insn, insn_call_dest(insn))) { - WARN_INSN(insn, "call to %s() leaves .noinstr.text section", call_dest_name(file, insn)); - return 1; + if (state->noinstr && state->instr <= 0) { + struct symbol *dest = insn_call_dest(insn); + + if (dest && dest->static_call_tramp) { + if (!dest->noinstr_allowed) { + WARN_INSN(insn, "%s: non-RO static call usage in noinstr", + static_call_name(dest)); + } + + } else if (dest && !noinstr_call_allowed(file, insn, dest)) { + WARN_INSN(insn, "call to %s() leaves .noinstr.text section", + call_dest_name(file, insn)); + return 1; + } }
if (state->uaccess && !func_uaccess_safe(insn_call_dest(insn))) { @@ -3604,6 +3664,17 @@ static int validate_return(struct symbol *func, struct instruction *insn, struct return 0; }
+static int validate_static_key(struct instruction *insn, struct insn_state *state) +{ + if (state->noinstr && state->instr <= 0 && !insn->key->noinstr_allowed) { + WARN_INSN(insn, "%s: non-RO static key usage in noinstr", + insn->key->name); + return 1; + } + + return 0; +} + static struct instruction *next_insn_to_validate(struct objtool_file *file, struct instruction *insn) { @@ -3765,6 +3836,9 @@ static int validate_branch(struct objtool_file *file, struct symbol *func, if (handle_insn_ops(insn, next_insn, &state)) return 1;
+ if (insn->key) + validate_static_key(insn, &state); + switch (insn->type) {
case INSN_RETURN: diff --git a/tools/objtool/include/objtool/check.h b/tools/objtool/include/objtool/check.h index daa46f1f0965a..c0da7246eac7b 100644 --- a/tools/objtool/include/objtool/check.h +++ b/tools/objtool/include/objtool/check.h @@ -77,6 +77,7 @@ struct instruction { struct symbol *sym; struct stack_op *stack_ops; struct cfi_state *cfi; + struct symbol *key; };
static inline struct symbol *insn_func(struct instruction *insn) diff --git a/tools/objtool/include/objtool/elf.h b/tools/objtool/include/objtool/elf.h index d7e815c2fd156..0cb79931262bb 100644 --- a/tools/objtool/include/objtool/elf.h +++ b/tools/objtool/include/objtool/elf.h @@ -69,6 +69,7 @@ struct symbol { u8 embedded_insn : 1; u8 local_label : 1; u8 frame_pointer : 1; + u8 noinstr_allowed : 1; struct list_head pv_target; struct reloc *relocs; }; diff --git a/tools/objtool/include/objtool/special.h b/tools/objtool/include/objtool/special.h index 86d4af9c5aa9d..ce4759358ec48 100644 --- a/tools/objtool/include/objtool/special.h +++ b/tools/objtool/include/objtool/special.h @@ -20,6 +20,7 @@ struct special_alt { bool skip_alt; bool jump_or_nop; u8 key_addend; + struct symbol *key;
struct section *orig_sec; unsigned long orig_off; diff --git a/tools/objtool/special.c b/tools/objtool/special.c index 097a69db82a0e..982d5cb55e1bb 100644 --- a/tools/objtool/special.c +++ b/tools/objtool/special.c @@ -119,14 +119,26 @@ static int get_alt_entry(struct elf *elf, const struct special_entry *entry,
if (entry->key) { struct reloc *key_reloc; + struct symbol *key; + s64 key_addend;
key_reloc = find_reloc_by_dest(elf, sec, offset + entry->key); if (!key_reloc) { - WARN_FUNC("can't find key reloc", - sec, offset + entry->key); + WARN_FUNC("can't find key reloc", sec, offset + entry->key); return -1; } - alt->key_addend = reloc_addend(key_reloc); + + key = key_reloc->sym; + key_addend = reloc_addend(key_reloc); + + if (key->type == STT_SECTION) + key = find_symbol_by_offset(key->sec, key_addend & ~3); + + /* embedded keys not supported */ + if (key) { + alt->key = key; + alt->key_addend = key_addend; + } }
return 0;
CT_STATE_KERNEL being zero means it can be (and is) omitted in a handful of places. A later patch will change CT_STATE_KERNEL into a non-zero value, prepare that by using it where it should be:
o In the initial CT state o At kernel entry / exit
No change in functionality intended.
Signed-off-by: Valentin Schneider vschneid@redhat.com --- kernel/context_tracking.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c index 938c48952d265..a61498a8425e2 100644 --- a/kernel/context_tracking.c +++ b/kernel/context_tracking.c @@ -31,7 +31,7 @@ DEFINE_PER_CPU(struct context_tracking, context_tracking) = { .nesting = 1, .nmi_nesting = CT_NESTING_IRQ_NONIDLE, #endif - .state = ATOMIC_INIT(CT_RCU_WATCHING), + .state = ATOMIC_INIT(CT_RCU_WATCHING | CT_STATE_KERNEL), }; EXPORT_SYMBOL_GPL(context_tracking);
@@ -147,7 +147,7 @@ static void noinstr ct_kernel_exit(bool user, int offset) instrumentation_end(); WRITE_ONCE(ct->nesting, 0); /* Avoid irq-access tearing. */ // RCU is watching here ... - ct_kernel_exit_state(offset); + ct_kernel_exit_state(offset - CT_STATE_KERNEL); // ... but is no longer watching here. rcu_task_exit(); } @@ -175,7 +175,7 @@ static void noinstr ct_kernel_enter(bool user, int offset) } rcu_task_enter(); // RCU is not watching here ... - ct_kernel_enter_state(offset); + ct_kernel_enter_state(offset + CT_STATE_KERNEL); // ... but is watching here. instrumentation_begin();
@@ -537,7 +537,7 @@ void noinstr __ct_user_enter(enum ctx_state state) * RCU only requires CT_RCU_WATCHING increments to be fully * ordered. */ - raw_atomic_add(state, &ct->state); + raw_atomic_add(state - CT_STATE_KERNEL, &ct->state); } } } @@ -647,7 +647,7 @@ void noinstr __ct_user_exit(enum ctx_state state) * RCU only requires CT_RCU_WATCHING increments to be fully * ordered. */ - raw_atomic_sub(state, &ct->state); + raw_atomic_sub(state - CT_STATE_KERNEL, &ct->state); } } }
ct_nmi_{enter, exit}() only touches the RCU watching counter and doesn't modify the actual CT state part context_tracking.state. This means that upon receiving an IRQ when idle, the CT_STATE_IDLE->CT_STATE_KERNEL transition only happens in ct_idle_exit().
One can note that ct_nmi_enter() can only ever be entered with the CT state as either CT_STATE_KERNEL or CT_STATE_IDLE, as an IRQ/NMI happenning in the CT_STATE_USER or CT_STATE_GUEST states will be routed down to ct_user_exit().
Add/remove CT_STATE_IDLE from the context tracking state as needed in ct_nmi_{enter, exit}().
Note that this leaves the following window where the CPU is executing code in kernelspace, but the context tracking state is CT_STATE_IDLE:
~> IRQ ct_nmi_enter() state = state + CT_STATE_KERNEL - CT_STATE_IDLE
[...]
ct_nmi_exit() state = state - CT_STATE_KERNEL + CT_STATE_IDLE
[...] /!\ CT_STATE_IDLE here while we're really in kernelspace! /!\
ct_cpuidle_exit() state = state + CT_STATE_KERNEL - CT_STATE_IDLE
Signed-off-by: Valentin Schneider vschneid@redhat.com --- kernel/context_tracking.c | 22 +++++++++++++++++++--- 1 file changed, 19 insertions(+), 3 deletions(-)
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c index a61498a8425e2..15f10ddec8cbe 100644 --- a/kernel/context_tracking.c +++ b/kernel/context_tracking.c @@ -236,7 +236,9 @@ void noinstr ct_nmi_exit(void) instrumentation_end();
// RCU is watching here ... - ct_kernel_exit_state(CT_RCU_WATCHING); + ct_kernel_exit_state(CT_RCU_WATCHING - + CT_STATE_KERNEL + + CT_STATE_IDLE); // ... but is no longer watching here.
if (!in_nmi()) @@ -259,6 +261,7 @@ void noinstr ct_nmi_enter(void) { long incby = 2; struct context_tracking *ct = this_cpu_ptr(&context_tracking); + int curr_state;
/* Complain about underflow. */ WARN_ON_ONCE(ct_nmi_nesting() < 0); @@ -271,13 +274,26 @@ void noinstr ct_nmi_enter(void) * to be in the outermost NMI handler that interrupted an RCU-idle * period (observation due to Andy Lutomirski). */ - if (!rcu_is_watching_curr_cpu()) { + curr_state = raw_atomic_read(this_cpu_ptr(&context_tracking.state)); + if (!(curr_state & CT_RCU_WATCHING)) {
if (!in_nmi()) rcu_task_enter();
+ /* + * RCU isn't watching, so we're one of + * CT_STATE_IDLE + * CT_STATE_USER + * CT_STATE_GUEST + * guest/user entry is handled by ct_user_enter(), so this has + * to be idle entry. + */ + WARN_ON_ONCE((curr_state & CT_STATE_MASK) != CT_STATE_IDLE); + // RCU is not watching here ... - ct_kernel_enter_state(CT_RCU_WATCHING); + ct_kernel_enter_state(CT_RCU_WATCHING + + CT_STATE_KERNEL - + CT_STATE_IDLE); // ... but is watching here.
instrumentation_begin();
Le Tue, Jan 14, 2025 at 06:51:35PM +0100, Valentin Schneider a écrit :
ct_nmi_{enter, exit}() only touches the RCU watching counter and doesn't modify the actual CT state part context_tracking.state. This means that upon receiving an IRQ when idle, the CT_STATE_IDLE->CT_STATE_KERNEL transition only happens in ct_idle_exit().
One can note that ct_nmi_enter() can only ever be entered with the CT state as either CT_STATE_KERNEL or CT_STATE_IDLE, as an IRQ/NMI happenning in the CT_STATE_USER or CT_STATE_GUEST states will be routed down to ct_user_exit().
Are you sure? An NMI can fire between guest_state_enter_irqoff() and __svm_vcpu_run(). And NMIs interrupting userspace don't call enter_from_user_mode(). In fact they don't call irqentry_enter_from_user_mode() like regular IRQs but irqentry_nmi_enter() instead. Well that's for archs implementing common entry code, I can't speak for the others.
Unifying the behaviour between user and idle such that the IRQs/NMIs exit the CT_STATE can be interesting but I fear this may not come for free. You would need to save the old state on IRQ/NMI entry and restore it on exit.
Do we really need it?
Thanks.
On Wed, Jan 22, 2025, Frederic Weisbecker wrote:
Le Tue, Jan 14, 2025 at 06:51:35PM +0100, Valentin Schneider a écrit :
ct_nmi_{enter, exit}() only touches the RCU watching counter and doesn't modify the actual CT state part context_tracking.state. This means that upon receiving an IRQ when idle, the CT_STATE_IDLE->CT_STATE_KERNEL transition only happens in ct_idle_exit().
One can note that ct_nmi_enter() can only ever be entered with the CT state as either CT_STATE_KERNEL or CT_STATE_IDLE, as an IRQ/NMI happenning in the CT_STATE_USER or CT_STATE_GUEST states will be routed down to ct_user_exit().
Are you sure? An NMI can fire between guest_state_enter_irqoff() and __svm_vcpu_run().
Heh, technically, they can't. On SVM, KVM clears GIF prior to svm_vcpu_enter_exit(), and restores GIF=1 only after it returns. I.e. NMIs are fully blocked _on SVM_.
VMX unfortunately doesn't provide GIF, and so NMIs can arrive at any time. It's infeasible for software to prevent them, so we're stuck with that. [In theory, KVM could deliberately generate an NMI and not do IRET so that NMIs are blocked, but that would be beyond crazy].
And NMIs interrupting userspace don't call enter_from_user_mode(). In fact they don't call irqentry_enter_from_user_mode() like regular IRQs but irqentry_nmi_enter() instead. Well that's for archs implementing common entry code, I can't speak for the others.
Unifying the behaviour between user and idle such that the IRQs/NMIs exit the CT_STATE can be interesting but I fear this may not come for free. You would need to save the old state on IRQ/NMI entry and restore it on exit.
Do we really need it?
Thanks.
On 22/01/25 01:22, Frederic Weisbecker wrote:
Le Tue, Jan 14, 2025 at 06:51:35PM +0100, Valentin Schneider a écrit :
ct_nmi_{enter, exit}() only touches the RCU watching counter and doesn't modify the actual CT state part context_tracking.state. This means that upon receiving an IRQ when idle, the CT_STATE_IDLE->CT_STATE_KERNEL transition only happens in ct_idle_exit().
One can note that ct_nmi_enter() can only ever be entered with the CT state as either CT_STATE_KERNEL or CT_STATE_IDLE, as an IRQ/NMI happenning in the CT_STATE_USER or CT_STATE_GUEST states will be routed down to ct_user_exit().
Are you sure? An NMI can fire between guest_state_enter_irqoff() and __svm_vcpu_run().
Urgh, you're quite right.
And NMIs interrupting userspace don't call enter_from_user_mode(). In fact they don't call irqentry_enter_from_user_mode() like regular IRQs but irqentry_nmi_enter() instead. Well that's for archs implementing common entry code, I can't speak for the others.
That I didn't realize, so thank you for pointing it out. Having another look now, I mistook DEFINE_IDTENTRY_RAW(exc_int3) for the general case when it really isn't :(
Unifying the behaviour between user and idle such that the IRQs/NMIs exit the CT_STATE can be interesting but I fear this may not come for free. You would need to save the old state on IRQ/NMI entry and restore it on exit.
That's what I tried to avoid, but it sounds like there's no nice way around it.
Do we really need it?
Well, my problem with not doing IDLE->KERNEL transitions on IRQ/NMI is that this leads the IPI deferral logic to observe a technically-out-of-sync sate for remote CPUs. Consider:
CPUx CPUy state := CT_STATE_IDLE ... ~>IRQ ... ct_nmi_enter() [in the kernel proper by now]
text_poke_bp_batch() ct_set_cpu_work(CPUy, CT_WORK_SYNC) READ CPUy ct->state `-> CT_IDLE_STATE `-> defer IPI
I thought this meant I would need to throw out the "defer IPIs if CPU is idle" part, but AIUI this also affects CT_STATE_USER and CT_STATE_GUEST, which is a bummer :(
On 27/01/25 12:17, Valentin Schneider wrote:
On 22/01/25 01:22, Frederic Weisbecker wrote:
And NMIs interrupting userspace don't call enter_from_user_mode(). In fact they don't call irqentry_enter_from_user_mode() like regular IRQs but irqentry_nmi_enter() instead. Well that's for archs implementing common entry code, I can't speak for the others.
That I didn't realize, so thank you for pointing it out. Having another look now, I mistook DEFINE_IDTENTRY_RAW(exc_int3) for the general case when it really isn't :(
Unifying the behaviour between user and idle such that the IRQs/NMIs exit the CT_STATE can be interesting but I fear this may not come for free. You would need to save the old state on IRQ/NMI entry and restore it on exit.
That's what I tried to avoid, but it sounds like there's no nice way around it.
Do we really need it?
Well, my problem with not doing IDLE->KERNEL transitions on IRQ/NMI is that this leads the IPI deferral logic to observe a technically-out-of-sync sate for remote CPUs. Consider:
CPUx CPUy state := CT_STATE_IDLE ... ~>IRQ ... ct_nmi_enter() [in the kernel proper by now]
text_poke_bp_batch() ct_set_cpu_work(CPUy, CT_WORK_SYNC) READ CPUy ct->state `-> CT_IDLE_STATE `-> defer IPI
I thought this meant I would need to throw out the "defer IPIs if CPU is idle" part, but AIUI this also affects CT_STATE_USER and CT_STATE_GUEST, which is a bummer :(
Soooo I've been thinking...
Isn't
(context_tracking.state & CT_RCU_WATCHING)
pretty much a proxy for knowing whether a CPU is executing in kernelspace, including NMIs?
NMI interrupts userspace/VM/idle -> ct_nmi_enter() -> it becomes true IRQ interrupts idle -> ct_irq_enter() -> it becomes true IRQ interrupts userspace -> __ct_user_exit() -> it becomes true IRQ interrupts VM -> __ct_user_exit() -> it becomes true
IOW, if I gate setting deferred work by checking for this instead of explicitely CT_STATE_KERNEL, "it should work" and prevent the aforementioned issue? Or should I be out drinking instead? :-)
Le Fri, Feb 07, 2025 at 06:06:45PM +0100, Valentin Schneider a écrit :
On 27/01/25 12:17, Valentin Schneider wrote:
On 22/01/25 01:22, Frederic Weisbecker wrote:
And NMIs interrupting userspace don't call enter_from_user_mode(). In fact they don't call irqentry_enter_from_user_mode() like regular IRQs but irqentry_nmi_enter() instead. Well that's for archs implementing common entry code, I can't speak for the others.
That I didn't realize, so thank you for pointing it out. Having another look now, I mistook DEFINE_IDTENTRY_RAW(exc_int3) for the general case when it really isn't :(
Unifying the behaviour between user and idle such that the IRQs/NMIs exit the CT_STATE can be interesting but I fear this may not come for free. You would need to save the old state on IRQ/NMI entry and restore it on exit.
That's what I tried to avoid, but it sounds like there's no nice way around it.
Do we really need it?
Well, my problem with not doing IDLE->KERNEL transitions on IRQ/NMI is that this leads the IPI deferral logic to observe a technically-out-of-sync sate for remote CPUs. Consider:
CPUx CPUy state := CT_STATE_IDLE ... ~>IRQ ... ct_nmi_enter() [in the kernel proper by now]
text_poke_bp_batch() ct_set_cpu_work(CPUy, CT_WORK_SYNC) READ CPUy ct->state `-> CT_IDLE_STATE `-> defer IPI
I thought this meant I would need to throw out the "defer IPIs if CPU is idle" part, but AIUI this also affects CT_STATE_USER and CT_STATE_GUEST, which is a bummer :(
Soooo I've been thinking...
Isn't
(context_tracking.state & CT_RCU_WATCHING)
pretty much a proxy for knowing whether a CPU is executing in kernelspace, including NMIs?
You got it!
NMI interrupts userspace/VM/idle -> ct_nmi_enter() -> it becomes true IRQ interrupts idle -> ct_irq_enter() -> it becomes true IRQ interrupts userspace -> __ct_user_exit() -> it becomes true IRQ interrupts VM -> __ct_user_exit() -> it becomes true
IOW, if I gate setting deferred work by checking for this instead of explicitely CT_STATE_KERNEL, "it should work" and prevent the aforementioned issue? Or should I be out drinking instead? :-)
Exactly it should work! Now that doesn't mean you can't go out for a drink :-)
Thanks.
On 07/02/25 19:37, Frederic Weisbecker wrote:
Le Fri, Feb 07, 2025 at 06:06:45PM +0100, Valentin Schneider a écrit :
Soooo I've been thinking...
Isn't
(context_tracking.state & CT_RCU_WATCHING)
pretty much a proxy for knowing whether a CPU is executing in kernelspace, including NMIs?
You got it!
Yay!
NMI interrupts userspace/VM/idle -> ct_nmi_enter() -> it becomes true IRQ interrupts idle -> ct_irq_enter() -> it becomes true IRQ interrupts userspace -> __ct_user_exit() -> it becomes true IRQ interrupts VM -> __ct_user_exit() -> it becomes true
IOW, if I gate setting deferred work by checking for this instead of explicitely CT_STATE_KERNEL, "it should work" and prevent the aforementioned issue? Or should I be out drinking instead? :-)
Exactly it should work! Now that doesn't mean you can't go out for a drink :-)
Well, drinks were had very shortly after sending this email :D
Thanks.
A later patch will require to easily exclude CT_STATE_KERNEL from a genuine a ct->state read CT_STATE_KERNEL, which requires that value being non-zero and exclusive with the other CT_STATE_* values.
This increases the size of the CT_STATE region of the ct->state variable by two bits.
Signed-off-by: Valentin Schneider vschneid@redhat.com --- include/linux/context_tracking_state.h | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/include/linux/context_tracking_state.h b/include/linux/context_tracking_state.h index 0b81248aa03e2..eb2149b20baef 100644 --- a/include/linux/context_tracking_state.h +++ b/include/linux/context_tracking_state.h @@ -11,11 +11,11 @@
enum ctx_state { CT_STATE_DISABLED = -1, /* returned by ct_state() if unknown */ - CT_STATE_KERNEL = 0, - CT_STATE_IDLE = 1, - CT_STATE_USER = 2, - CT_STATE_GUEST = 3, - CT_STATE_MAX = 4, + CT_STATE_KERNEL = 1, + CT_STATE_IDLE = 2, + CT_STATE_USER = 4, + CT_STATE_GUEST = 8, + CT_STATE_MAX = 9, };
struct context_tracking {
smp_call_function() & friends have the unfortunate habit of sending IPIs to isolated, NOHZ_FULL, in-userspace CPUs, as they blindly target all online CPUs.
Some callsites can be bent into doing the right, such as done by commit:
cc9e303c91f5 ("x86/cpu: Disable frequency requests via aperfmperf IPI for nohz_full CPUs")
Unfortunately, not all SMP callbacks can be omitted in this fashion. However, some of them only affect execution in kernelspace, which means they don't have to be executed *immediately* if the target CPU is in userspace: stashing the callback and executing it upon the next kernel entry would suffice. x86 kernel instruction patching or kernel TLB invalidation are prime examples of it.
Reduce the RCU dynticks counter width to free up some bits to be used as a deferred callback bitmask. Add some build-time checks to validate that setup.
Presence of CT_STATE_KERNEL in the ct_state prevents queuing deferred work.
Later commits introduce the bit:callback mappings.
Link: https://lore.kernel.org/all/20210929151723.162004989@infradead.org/ Signed-off-by: Nicolas Saenz Julienne nsaenzju@redhat.com Signed-off-by: Valentin Schneider vschneid@redhat.com --- arch/Kconfig | 9 +++ arch/x86/Kconfig | 1 + arch/x86/include/asm/context_tracking_work.h | 16 +++++ include/linux/context_tracking.h | 21 +++++++ include/linux/context_tracking_state.h | 30 ++++++--- include/linux/context_tracking_work.h | 26 ++++++++ kernel/context_tracking.c | 66 +++++++++++++++++++- kernel/time/Kconfig | 5 ++ 8 files changed, 162 insertions(+), 12 deletions(-) create mode 100644 arch/x86/include/asm/context_tracking_work.h create mode 100644 include/linux/context_tracking_work.h
diff --git a/arch/Kconfig b/arch/Kconfig index 6682b2a53e342..b637f20f0fc68 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -952,6 +952,15 @@ config HAVE_CONTEXT_TRACKING_USER_OFFSTACK - No use of instrumentation, unless instrumentation_begin() got called.
+config HAVE_CONTEXT_TRACKING_WORK + bool + help + Architecture supports deferring work while not in kernel context. + This is especially useful on setups with isolated CPUs that might + want to avoid being interrupted to perform housekeeping tasks (for + ex. TLB invalidation or icache invalidation). The housekeeping + operations are performed upon re-entering the kernel. + config HAVE_TIF_NOHZ bool help diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 9d7bd0ae48c42..ca5dd4cfc354b 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -216,6 +216,7 @@ config X86 select HAVE_CMPXCHG_LOCAL select HAVE_CONTEXT_TRACKING_USER if X86_64 select HAVE_CONTEXT_TRACKING_USER_OFFSTACK if HAVE_CONTEXT_TRACKING_USER + select HAVE_CONTEXT_TRACKING_WORK if X86_64 select HAVE_C_RECORDMCOUNT select HAVE_OBJTOOL_MCOUNT if HAVE_OBJTOOL select HAVE_OBJTOOL_NOP_MCOUNT if HAVE_OBJTOOL_MCOUNT diff --git a/arch/x86/include/asm/context_tracking_work.h b/arch/x86/include/asm/context_tracking_work.h new file mode 100644 index 0000000000000..5f3b2d0977235 --- /dev/null +++ b/arch/x86/include/asm/context_tracking_work.h @@ -0,0 +1,16 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _ASM_X86_CONTEXT_TRACKING_WORK_H +#define _ASM_X86_CONTEXT_TRACKING_WORK_H + +static __always_inline void arch_context_tracking_work(enum ct_work work) +{ + switch (work) { + case CT_WORK_n: + // Do work... + break; + case CT_WORK_MAX: + WARN_ON_ONCE(true); + } +} + +#endif diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h index af9fe87a09225..0b0faa040e9b5 100644 --- a/include/linux/context_tracking.h +++ b/include/linux/context_tracking.h @@ -5,6 +5,7 @@ #include <linux/sched.h> #include <linux/vtime.h> #include <linux/context_tracking_state.h> +#include <linux/context_tracking_work.h> #include <linux/instrumentation.h>
#include <asm/ptrace.h> @@ -137,6 +138,26 @@ static __always_inline unsigned long ct_state_inc(int incby) return raw_atomic_add_return(incby, this_cpu_ptr(&context_tracking.state)); }
+#ifdef CONFIG_CONTEXT_TRACKING_WORK +static __always_inline unsigned long ct_state_inc_clear_work(int incby) +{ + struct context_tracking *ct = this_cpu_ptr(&context_tracking); + unsigned long new, old, state; + + state = arch_atomic_read(&ct->state); + do { + old = state; + new = old & ~CT_WORK_MASK; + new += incby; + state = arch_atomic_cmpxchg(&ct->state, old, new); + } while (old != state); + + return new; +} +#else +#define ct_state_inc_clear_work(x) ct_state_inc(x) +#endif + static __always_inline bool warn_rcu_enter(void) { bool ret = false; diff --git a/include/linux/context_tracking_state.h b/include/linux/context_tracking_state.h index eb2149b20baef..b6ddfccba9714 100644 --- a/include/linux/context_tracking_state.h +++ b/include/linux/context_tracking_state.h @@ -5,6 +5,7 @@ #include <linux/percpu.h> #include <linux/static_key.h> #include <linux/context_tracking_irq.h> +#include <linux/context_tracking_work.h>
/* Offset to allow distinguishing irq vs. task-based idle entry/exit. */ #define CT_NESTING_IRQ_NONIDLE ((LONG_MAX / 2) + 1) @@ -39,16 +40,19 @@ struct context_tracking { };
/* - * We cram two different things within the same atomic variable: + * We cram up to three different things within the same atomic variable: * - * CT_RCU_WATCHING_START CT_STATE_START - * | | - * v v - * MSB [ RCU watching counter ][ context_state ] LSB - * ^ ^ - * | | - * CT_RCU_WATCHING_END CT_STATE_END + * CT_RCU_WATCHING_START CT_STATE_START + * | CT_WORK_START | + * | | | + * v v v + * MSB [ RCU watching counter ][ context work ][ context_state ] LSB + * ^ ^ ^ + * | | | + * | CT_WORK_END | + * CT_RCU_WATCHING_END CT_STATE_END * + * The [ context work ] region spans 0 bits if CONFIG_CONTEXT_WORK=n * Bits are used from the LSB upwards, so unused bits (if any) will always be in * upper bits of the variable. */ @@ -59,18 +63,24 @@ struct context_tracking { #define CT_STATE_START 0 #define CT_STATE_END (CT_STATE_START + CT_STATE_WIDTH - 1)
-#define CT_RCU_WATCHING_MAX_WIDTH (CT_SIZE - CT_STATE_WIDTH) +#define CT_WORK_WIDTH (IS_ENABLED(CONFIG_CONTEXT_TRACKING_WORK) ? CT_WORK_MAX_OFFSET : 0) +#define CT_WORK_START (CT_STATE_END + 1) +#define CT_WORK_END (CT_WORK_START + CT_WORK_WIDTH - 1) + +#define CT_RCU_WATCHING_MAX_WIDTH (CT_SIZE - CT_WORK_WIDTH - CT_STATE_WIDTH) #define CT_RCU_WATCHING_WIDTH (IS_ENABLED(CONFIG_RCU_DYNTICKS_TORTURE) ? 2 : CT_RCU_WATCHING_MAX_WIDTH) -#define CT_RCU_WATCHING_START (CT_STATE_END + 1) +#define CT_RCU_WATCHING_START (CT_WORK_END + 1) #define CT_RCU_WATCHING_END (CT_RCU_WATCHING_START + CT_RCU_WATCHING_WIDTH - 1) #define CT_RCU_WATCHING BIT(CT_RCU_WATCHING_START)
#define CT_STATE_MASK GENMASK(CT_STATE_END, CT_STATE_START) +#define CT_WORK_MASK GENMASK(CT_WORK_END, CT_WORK_START) #define CT_RCU_WATCHING_MASK GENMASK(CT_RCU_WATCHING_END, CT_RCU_WATCHING_START)
#define CT_UNUSED_WIDTH (CT_RCU_WATCHING_MAX_WIDTH - CT_RCU_WATCHING_WIDTH)
static_assert(CT_STATE_WIDTH + + CT_WORK_WIDTH + CT_RCU_WATCHING_WIDTH + CT_UNUSED_WIDTH == CT_SIZE); diff --git a/include/linux/context_tracking_work.h b/include/linux/context_tracking_work.h new file mode 100644 index 0000000000000..c68245f8d77c5 --- /dev/null +++ b/include/linux/context_tracking_work.h @@ -0,0 +1,26 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_CONTEXT_TRACKING_WORK_H +#define _LINUX_CONTEXT_TRACKING_WORK_H + +#include <linux/bitops.h> + +enum { + CT_WORK_n_OFFSET, + CT_WORK_MAX_OFFSET +}; + +enum ct_work { + CT_WORK_n = BIT(CT_WORK_n_OFFSET), + CT_WORK_MAX = BIT(CT_WORK_MAX_OFFSET) +}; + +#include <asm/context_tracking_work.h> + +#ifdef CONFIG_CONTEXT_TRACKING_WORK +extern bool ct_set_cpu_work(unsigned int cpu, enum ct_work work); +#else +static inline bool +ct_set_cpu_work(unsigned int cpu, unsigned int work) { return false; } +#endif + +#endif diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c index 15f10ddec8cbe..f7019c12269f9 100644 --- a/kernel/context_tracking.c +++ b/kernel/context_tracking.c @@ -72,6 +72,67 @@ static __always_inline void rcu_task_trace_heavyweight_exit(void) #endif /* #ifdef CONFIG_TASKS_TRACE_RCU */ }
+#ifdef CONFIG_CONTEXT_TRACKING_WORK +static noinstr void ct_work_flush(unsigned long seq) +{ + int bit; + + seq = (seq & CT_WORK_MASK) >> CT_WORK_START; + + /* + * arch_context_tracking_work() must be noinstr, non-blocking, + * and NMI safe. + */ + for_each_set_bit(bit, &seq, CT_WORK_MAX) + arch_context_tracking_work(BIT(bit)); +} + +/** + * ct_set_cpu_work - set work to be run at next kernel context entry + * + * If @cpu is not currently executing in kernelspace, it will execute the + * callback mapped to @work (see arch_context_tracking_work()) at its next + * transition to CT_KERNEL_STATE. + * + * If it is already in CT_KERNEL_STATE, this will be a no-op. + */ +bool ct_set_cpu_work(unsigned int cpu, enum ct_work work) +{ + struct context_tracking *ct = per_cpu_ptr(&context_tracking, cpu); + unsigned int old; + bool ret = false; + + if (!ct->active) + return false; + + preempt_disable(); + + old = atomic_read(&ct->state); + + /* + * We only want to set the work bit if the target CPU is not in + * kernelspace, so we clear the KERNEL bit here and let the cmpxchg do + * the check for us - the state could change between the atomic_read() and + * the cmpxchg(). + */ + old &= ~CT_STATE_KERNEL; + /* + * Try setting the work until either + * - the target CPU has entered kernelspace + * - the work has been set + */ + do { + ret = atomic_try_cmpxchg(&ct->state, &old, old | (work << CT_WORK_START)); + } while (!ret && ((old & CT_STATE_MASK) != CT_STATE_KERNEL)); + + preempt_enable(); + return ret; +} +#else +static __always_inline void ct_work_flush(unsigned long work) { } +static __always_inline void ct_work_clear(struct context_tracking *ct) { } +#endif + /* * Record entry into an extended quiescent state. This is only to be * called when not already in an extended quiescent state, that is, @@ -88,7 +149,7 @@ static noinstr void ct_kernel_exit_state(int offset) * next idle sojourn. */ rcu_task_trace_heavyweight_enter(); // Before CT state update! - seq = ct_state_inc(offset); + seq = ct_state_inc_clear_work(offset); // RCU is no longer watching. Better be in extended quiescent state! WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && (seq & CT_RCU_WATCHING)); } @@ -100,7 +161,7 @@ static noinstr void ct_kernel_exit_state(int offset) */ static noinstr void ct_kernel_enter_state(int offset) { - int seq; + unsigned long seq;
/* * CPUs seeing atomic_add_return() must see prior idle sojourns, @@ -108,6 +169,7 @@ static noinstr void ct_kernel_enter_state(int offset) * critical section. */ seq = ct_state_inc(offset); + ct_work_flush(seq); // RCU is now watching. Better not be in an extended quiescent state! rcu_task_trace_heavyweight_exit(); // After CT state update! WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && !(seq & CT_RCU_WATCHING)); diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig index b0b97a60aaa6f..7e8106a0d981f 100644 --- a/kernel/time/Kconfig +++ b/kernel/time/Kconfig @@ -181,6 +181,11 @@ config CONTEXT_TRACKING_USER_FORCE Say N otherwise, this option brings an overhead that you don't want in production.
+config CONTEXT_TRACKING_WORK + bool + depends on HAVE_CONTEXT_TRACKING_WORK && CONTEXT_TRACKING_USER + default y + config NO_HZ bool "Old Idle dynticks config" help
text_poke_bp_batch() sends IPIs to all online CPUs to synchronize them vs the newly patched instruction. CPUs that are executing in userspace do not need this synchronization to happen immediately, and this is actually harmful interference for NOHZ_FULL CPUs.
As the synchronization IPIs are sent using a blocking call, returning from text_poke_bp_batch() implies all CPUs will observe the patched instruction(s), and this should be preserved even if the IPI is deferred. In other words, to safely defer this synchronization, any kernel instruction leading to the execution of the deferred instruction sync (ct_work_flush()) must *not* be mutable (patchable) at runtime.
This means we must pay attention to mutable instructions in the early entry code: - alternatives - static keys - static calls - all sorts of probes (kprobes/ftrace/bpf/???)
The early entry code leading to ct_work_flush() is noinstr, which gets rid of the probes.
Alternatives are safe, because it's boot-time patching (before SMP is even brought up) which is before any IPI deferral can happen.
This leaves us with static keys and static calls.
Any static key used in early entry code should be only forever-enabled at boot time, IOW __ro_after_init (pretty much like alternatives). Exceptions are explicitly marked as allowed in .noinstr and will always generate an IPI when flipped.
The same applies to static calls - they should be only updated at boot time, or manually marked as an exception.
Objtool is now able to point at static keys/calls that don't respect this, and all static keys/calls used in early entry code have now been verified as behaving appropriately.
Leverage the new context_tracking infrastructure to defer sync_core() IPIs to a target CPU's next kernel entry.
Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Signed-off-by: Nicolas Saenz Julienne nsaenzju@redhat.com Signed-off-by: Valentin Schneider vschneid@redhat.com --- arch/x86/include/asm/context_tracking_work.h | 6 ++-- arch/x86/include/asm/text-patching.h | 1 + arch/x86/kernel/alternative.c | 38 ++++++++++++++++---- arch/x86/kernel/kprobes/core.c | 4 +-- arch/x86/kernel/kprobes/opt.c | 4 +-- arch/x86/kernel/module.c | 2 +- include/asm-generic/sections.h | 15 ++++++++ include/linux/context_tracking_work.h | 4 +-- 8 files changed, 59 insertions(+), 15 deletions(-)
diff --git a/arch/x86/include/asm/context_tracking_work.h b/arch/x86/include/asm/context_tracking_work.h index 5f3b2d0977235..485b32881fde5 100644 --- a/arch/x86/include/asm/context_tracking_work.h +++ b/arch/x86/include/asm/context_tracking_work.h @@ -2,11 +2,13 @@ #ifndef _ASM_X86_CONTEXT_TRACKING_WORK_H #define _ASM_X86_CONTEXT_TRACKING_WORK_H
+#include <asm/sync_core.h> + static __always_inline void arch_context_tracking_work(enum ct_work work) { switch (work) { - case CT_WORK_n: - // Do work... + case CT_WORK_SYNC: + sync_core(); break; case CT_WORK_MAX: WARN_ON_ONCE(true); diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h index ab9e143ec9fea..9dfa46f721c1d 100644 --- a/arch/x86/include/asm/text-patching.h +++ b/arch/x86/include/asm/text-patching.h @@ -33,6 +33,7 @@ extern void apply_relocation(u8 *buf, const u8 * const instr, size_t instrlen, u */ extern void *text_poke(void *addr, const void *opcode, size_t len); extern void text_poke_sync(void); +extern void text_poke_sync_deferrable(void); extern void *text_poke_kgdb(void *addr, const void *opcode, size_t len); extern void *text_poke_copy(void *addr, const void *opcode, size_t len); #define text_poke_copy text_poke_copy diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c index 243843e44e89d..633deea8a89cb 100644 --- a/arch/x86/kernel/alternative.c +++ b/arch/x86/kernel/alternative.c @@ -18,6 +18,7 @@ #include <linux/mmu_context.h> #include <linux/bsearch.h> #include <linux/sync_core.h> +#include <linux/context_tracking.h> #include <asm/text-patching.h> #include <asm/alternative.h> #include <asm/sections.h> @@ -2109,9 +2110,24 @@ static void do_sync_core(void *info) sync_core(); }
+static bool do_sync_core_defer_cond(int cpu, void *info) +{ + return !ct_set_cpu_work(cpu, CT_WORK_SYNC); +} + +static void __text_poke_sync(smp_cond_func_t cond_func) +{ + on_each_cpu_cond(cond_func, do_sync_core, NULL, 1); +} + void text_poke_sync(void) { - on_each_cpu(do_sync_core, NULL, 1); + __text_poke_sync(NULL); +} + +void text_poke_sync_deferrable(void) +{ + __text_poke_sync(do_sync_core_defer_cond); }
/* @@ -2282,6 +2298,7 @@ static int tp_vec_nr; */ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries) { + smp_cond_func_t cond = do_sync_core_defer_cond; unsigned char int3 = INT3_INSN_OPCODE; unsigned int i; int do_sync; @@ -2317,11 +2334,20 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries * First step: add a int3 trap to the address that will be patched. */ for (i = 0; i < nr_entries; i++) { - tp[i].old = *(u8 *)text_poke_addr(&tp[i]); - text_poke(text_poke_addr(&tp[i]), &int3, INT3_INSN_SIZE); + void *addr = text_poke_addr(&tp[i]); + + /* + * There's no safe way to defer IPIs for patching text in + * .noinstr, record whether there is at least one such poke. + */ + if (is_kernel_noinstr_text((unsigned long)addr)) + cond = NULL; + + tp[i].old = *((u8 *)addr); + text_poke(addr, &int3, INT3_INSN_SIZE); }
- text_poke_sync(); + __text_poke_sync(cond);
/* * Second step: update all but the first byte of the patched range. @@ -2383,7 +2409,7 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries * not necessary and we'd be safe even without it. But * better safe than sorry (plus there's not only Intel). */ - text_poke_sync(); + __text_poke_sync(cond); }
/* @@ -2404,7 +2430,7 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries }
if (do_sync) - text_poke_sync(); + __text_poke_sync(cond);
/* * Remove and wait for refs to be zero. diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c index 72e6a45e7ec24..c2fd2578fd5fc 100644 --- a/arch/x86/kernel/kprobes/core.c +++ b/arch/x86/kernel/kprobes/core.c @@ -817,7 +817,7 @@ void arch_arm_kprobe(struct kprobe *p) u8 int3 = INT3_INSN_OPCODE;
text_poke(p->addr, &int3, 1); - text_poke_sync(); + text_poke_sync_deferrable(); perf_event_text_poke(p->addr, &p->opcode, 1, &int3, 1); }
@@ -827,7 +827,7 @@ void arch_disarm_kprobe(struct kprobe *p)
perf_event_text_poke(p->addr, &int3, 1, &p->opcode, 1); text_poke(p->addr, &p->opcode, 1); - text_poke_sync(); + text_poke_sync_deferrable(); }
void arch_remove_kprobe(struct kprobe *p) diff --git a/arch/x86/kernel/kprobes/opt.c b/arch/x86/kernel/kprobes/opt.c index 36d6809c6c9e1..b2ce4d9c3ba56 100644 --- a/arch/x86/kernel/kprobes/opt.c +++ b/arch/x86/kernel/kprobes/opt.c @@ -513,11 +513,11 @@ void arch_unoptimize_kprobe(struct optimized_kprobe *op) JMP32_INSN_SIZE - INT3_INSN_SIZE);
text_poke(addr, new, INT3_INSN_SIZE); - text_poke_sync(); + text_poke_sync_deferrable(); text_poke(addr + INT3_INSN_SIZE, new + INT3_INSN_SIZE, JMP32_INSN_SIZE - INT3_INSN_SIZE); - text_poke_sync(); + text_poke_sync_deferrable();
perf_event_text_poke(op->kp.addr, old, JMP32_INSN_SIZE, new, JMP32_INSN_SIZE); } diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c index 8984abd91c001..acc810150b76c 100644 --- a/arch/x86/kernel/module.c +++ b/arch/x86/kernel/module.c @@ -194,7 +194,7 @@ static int write_relocate_add(Elf64_Shdr *sechdrs, write, apply);
if (!early) { - text_poke_sync(); + text_poke_sync_deferrable(); mutex_unlock(&text_mutex); }
diff --git a/include/asm-generic/sections.h b/include/asm-generic/sections.h index c768de6f19a9a..1b3ba3084fd2f 100644 --- a/include/asm-generic/sections.h +++ b/include/asm-generic/sections.h @@ -199,6 +199,21 @@ static inline bool is_kernel_inittext(unsigned long addr) addr < (unsigned long)_einittext; }
+ +/** + * is_kernel_noinstr_text - checks if the pointer address is located in the + * .noinstr section + * + * @addr: address to check + * + * Returns: true if the address is located in .noinstr, false otherwise. + */ +static inline bool is_kernel_noinstr_text(unsigned long addr) +{ + return addr >= (unsigned long)__noinstr_text_start && + addr < (unsigned long)__noinstr_text_end; +} + /** * __is_kernel_text - checks if the pointer address is located in the * .text section diff --git a/include/linux/context_tracking_work.h b/include/linux/context_tracking_work.h index c68245f8d77c5..931ade1dcbcc2 100644 --- a/include/linux/context_tracking_work.h +++ b/include/linux/context_tracking_work.h @@ -5,12 +5,12 @@ #include <linux/bitops.h>
enum { - CT_WORK_n_OFFSET, + CT_WORK_SYNC, CT_WORK_MAX_OFFSET };
enum ct_work { - CT_WORK_n = BIT(CT_WORK_n_OFFSET), + CT_WORK_SYNC = BIT(CT_WORK_SYNC_OFFSET), CT_WORK_MAX = BIT(CT_WORK_MAX_OFFSET) };
On Tue, Jan 14, 2025, Valentin Schneider wrote:
text_poke_bp_batch() sends IPIs to all online CPUs to synchronize them vs the newly patched instruction. CPUs that are executing in userspace do not need this synchronization to happen immediately, and this is actually harmful interference for NOHZ_FULL CPUs.
...
This leaves us with static keys and static calls.
...
@@ -2317,11 +2334,20 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries * First step: add a int3 trap to the address that will be patched. */ for (i = 0; i < nr_entries; i++) {
tp[i].old = *(u8 *)text_poke_addr(&tp[i]);
text_poke(text_poke_addr(&tp[i]), &int3, INT3_INSN_SIZE);
void *addr = text_poke_addr(&tp[i]);
/*
* There's no safe way to defer IPIs for patching text in
* .noinstr, record whether there is at least one such poke.
*/
if (is_kernel_noinstr_text((unsigned long)addr))
cond = NULL;
Maybe pre-check "cond", especially if multiple ranges need to be checked? I.e.
if (cond && is_kernel_noinstr_text(...))
tp[i].old = *((u8 *)addr);
}text_poke(addr, &int3, INT3_INSN_SIZE);
- text_poke_sync();
- __text_poke_sync(cond);
/* * Second step: update all but the first byte of the patched range.
...
+/**
- is_kernel_noinstr_text - checks if the pointer address is located in the
.noinstr section
- @addr: address to check
- Returns: true if the address is located in .noinstr, false otherwise.
- */
+static inline bool is_kernel_noinstr_text(unsigned long addr) +{
- return addr >= (unsigned long)__noinstr_text_start &&
addr < (unsigned long)__noinstr_text_end;
+}
This doesn't do the right thing for modules, which matters because KVM can be built as a module on x86, and because context tracking understands transitions to GUEST mode, i.e. CPUs that are running in a KVM guest will be treated as not being in the kernel, and thus will have IPIs deferred. If KVM uses a static key or branch between guest_state_enter_irqoff() and guest_state_exit_irqoff(), the patching code won't wait for CPUs to exit guest mode, i.e. KVM could theoretically use the wrong static path.
I don't expect this to ever cause problems in practice, because patching code in KVM's VM-Enter/VM-Exit path that has *functional* implications, while CPUs are actively running guest code, would be all kinds of crazy. But I do think we should plug the hole.
If this issue is unique to KVM, i.e. is not a generic problem for all modules (I assume module code generally isn't allowed in the entry path, even via NMI?), one idea would be to let KVM register its noinstr section for text poking.
On Tue, Jan 14, 2025, Sean Christopherson wrote:
On Tue, Jan 14, 2025, Valentin Schneider wrote:
+/**
- is_kernel_noinstr_text - checks if the pointer address is located in the
.noinstr section
- @addr: address to check
- Returns: true if the address is located in .noinstr, false otherwise.
- */
+static inline bool is_kernel_noinstr_text(unsigned long addr) +{
- return addr >= (unsigned long)__noinstr_text_start &&
addr < (unsigned long)__noinstr_text_end;
+}
This doesn't do the right thing for modules, which matters because KVM can be built as a module on x86, and because context tracking understands transitions to GUEST mode, i.e. CPUs that are running in a KVM guest will be treated as not being in the kernel, and thus will have IPIs deferred. If KVM uses a static key or branch between guest_state_enter_irqoff() and guest_state_exit_irqoff(), the patching code won't wait for CPUs to exit guest mode, i.e. KVM could theoretically use the wrong static path.
I don't expect this to ever cause problems in practice, because patching code in KVM's VM-Enter/VM-Exit path that has *functional* implications, while CPUs are actively running guest code, would be all kinds of crazy. But I do think we should plug the hole.
If this issue is unique to KVM, i.e. is not a generic problem for all modules (I assume module code generally isn't allowed in the entry path, even via NMI?), one idea would be to let KVM register its noinstr section for text poking.
Another idea would be to track which keys/branches are tagged noinstr, i.e. generate the information at compile time instead of doing lookups at runtime. The biggest downside I can think of is that it would require plumbing in the information to text_poke_bp_batch().
On 14/01/25 13:48, Sean Christopherson wrote:
On Tue, Jan 14, 2025, Sean Christopherson wrote:
On Tue, Jan 14, 2025, Valentin Schneider wrote:
+/**
- is_kernel_noinstr_text - checks if the pointer address is located in the
.noinstr section
- @addr: address to check
- Returns: true if the address is located in .noinstr, false otherwise.
- */
+static inline bool is_kernel_noinstr_text(unsigned long addr) +{
- return addr >= (unsigned long)__noinstr_text_start &&
addr < (unsigned long)__noinstr_text_end;
+}
This doesn't do the right thing for modules, which matters because KVM can be built as a module on x86, and because context tracking understands transitions to GUEST mode, i.e. CPUs that are running in a KVM guest will be treated as not being in the kernel, and thus will have IPIs deferred. If KVM uses a static key or branch between guest_state_enter_irqoff() and guest_state_exit_irqoff(), the patching code won't wait for CPUs to exit guest mode, i.e. KVM could theoretically use the wrong static path.
I don't expect this to ever cause problems in practice, because patching code in KVM's VM-Enter/VM-Exit path that has *functional* implications, while CPUs are actively running guest code, would be all kinds of crazy. But I do think we should plug the hole.
If this issue is unique to KVM, i.e. is not a generic problem for all modules (I assume module code generally isn't allowed in the entry path, even via NMI?), one idea would be to let KVM register its noinstr section for text poking.
Another idea would be to track which keys/branches are tagged noinstr, i.e. generate the information at compile time instead of doing lookups at runtime. The biggest downside I can think of is that it would require plumbing in the information to text_poke_bp_batch().
IIUC that's what I went for in v3:
https://lore.kernel.org/lkml/20241119153502.41361-11-vschneid@redhat.com
but, modules notwithstanding, simply checking if the patched instruction is in .noinstr was a lot neater.
On 14/01/25 13:13, Sean Christopherson wrote:
On Tue, Jan 14, 2025, Valentin Schneider wrote:
text_poke_bp_batch() sends IPIs to all online CPUs to synchronize them vs the newly patched instruction. CPUs that are executing in userspace do not need this synchronization to happen immediately, and this is actually harmful interference for NOHZ_FULL CPUs.
...
This leaves us with static keys and static calls.
...
@@ -2317,11 +2334,20 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries * First step: add a int3 trap to the address that will be patched. */ for (i = 0; i < nr_entries; i++) {
tp[i].old = *(u8 *)text_poke_addr(&tp[i]);
text_poke(text_poke_addr(&tp[i]), &int3, INT3_INSN_SIZE);
void *addr = text_poke_addr(&tp[i]);
/*
* There's no safe way to defer IPIs for patching text in
* .noinstr, record whether there is at least one such poke.
*/
if (is_kernel_noinstr_text((unsigned long)addr))
cond = NULL;
Maybe pre-check "cond", especially if multiple ranges need to be checked? I.e.
if (cond && is_kernel_noinstr_text(...))
tp[i].old = *((u8 *)addr);
}text_poke(addr, &int3, INT3_INSN_SIZE);
- text_poke_sync();
__text_poke_sync(cond);
/*
- Second step: update all but the first byte of the patched range.
...
+/**
- is_kernel_noinstr_text - checks if the pointer address is located in the
.noinstr section
- @addr: address to check
- Returns: true if the address is located in .noinstr, false otherwise.
- */
+static inline bool is_kernel_noinstr_text(unsigned long addr) +{
- return addr >= (unsigned long)__noinstr_text_start &&
addr < (unsigned long)__noinstr_text_end;
+}
This doesn't do the right thing for modules, which matters because KVM can be built as a module on x86, and because context tracking understands transitions to GUEST mode, i.e. CPUs that are running in a KVM guest will be treated as not being in the kernel, and thus will have IPIs deferred. If KVM uses a static key or branch between guest_state_enter_irqoff() and guest_state_exit_irqoff(), the patching code won't wait for CPUs to exit guest mode, i.e. KVM could theoretically use the wrong static path.
AFAICT guest_state_{enter,exit}_irqoff() are only used in noinstr functions and thus such a static key usage should at the very least be caught and warned about by objtool - when this isn't built as a module.
I never really thought about noinstr sections for modules; I can get objtool to warn about a non-noinstr allowed key being used in e.g. vmx_vcpu_enter_exit() just by feeding it the vmx.o:
arch/x86/kvm/vmx/vmx.o: warning: objtool: vmx_vcpu_enter_exit.isra.0+0x0: dummykey: non-RO static key usage in noinstr
...but that requires removing a lot of code first because objtool stops earlier in its noinstr checks as it hits functions it doesn't have full information on, e.g.
arch/x86/kvm/vmx/vmx.o: warning: objtool: vmx_vcpu_enter_exit+0x21c: call to __ct_user_enter() leaves .noinstr.text section
__ct_user_enter() *is* noinstr, but you don't get that from just the header prototype.
I don't expect this to ever cause problems in practice, because patching code in KVM's VM-Enter/VM-Exit path that has *functional* implications, while CPUs are actively running guest code, would be all kinds of crazy. But I do think we should plug the hole.
If this issue is unique to KVM, i.e. is not a generic problem for all modules (I assume module code generally isn't allowed in the entry path, even via NMI?), one idea would be to let KVM register its noinstr section for text poking.
On Fri, Jan 17, 2025, Valentin Schneider wrote:
On 14/01/25 13:13, Sean Christopherson wrote:
On Tue, Jan 14, 2025, Valentin Schneider wrote:
+/**
- is_kernel_noinstr_text - checks if the pointer address is located in the
.noinstr section
- @addr: address to check
- Returns: true if the address is located in .noinstr, false otherwise.
- */
+static inline bool is_kernel_noinstr_text(unsigned long addr) +{
- return addr >= (unsigned long)__noinstr_text_start &&
addr < (unsigned long)__noinstr_text_end;
+}
This doesn't do the right thing for modules, which matters because KVM can be built as a module on x86, and because context tracking understands transitions to GUEST mode, i.e. CPUs that are running in a KVM guest will be treated as not being in the kernel, and thus will have IPIs deferred. If KVM uses a static key or branch between guest_state_enter_irqoff() and guest_state_exit_irqoff(), the patching code won't wait for CPUs to exit guest mode, i.e. KVM could theoretically use the wrong static path.
AFAICT guest_state_{enter,exit}_irqoff() are only used in noinstr functions and thus such a static key usage should at the very least be caught and warned about by objtool - when this isn't built as a module.
That doesn't magically do the right thing though. If KVM is built as a module, is_kernel_noinstr_text() will get false negatives even for static keys/branches that are annotaed as NOINSTR.
On 17/01/25 09:15, Sean Christopherson wrote:
On Fri, Jan 17, 2025, Valentin Schneider wrote:
On 14/01/25 13:13, Sean Christopherson wrote:
On Tue, Jan 14, 2025, Valentin Schneider wrote:
+/**
- is_kernel_noinstr_text - checks if the pointer address is located in the
.noinstr section
- @addr: address to check
- Returns: true if the address is located in .noinstr, false otherwise.
- */
+static inline bool is_kernel_noinstr_text(unsigned long addr) +{
- return addr >= (unsigned long)__noinstr_text_start &&
addr < (unsigned long)__noinstr_text_end;
+}
This doesn't do the right thing for modules, which matters because KVM can be built as a module on x86, and because context tracking understands transitions to GUEST mode, i.e. CPUs that are running in a KVM guest will be treated as not being in the kernel, and thus will have IPIs deferred. If KVM uses a static key or branch between guest_state_enter_irqoff() and guest_state_exit_irqoff(), the patching code won't wait for CPUs to exit guest mode, i.e. KVM could theoretically use the wrong static path.
AFAICT guest_state_{enter,exit}_irqoff() are only used in noinstr functions and thus such a static key usage should at the very least be caught and warned about by objtool - when this isn't built as a module.
That doesn't magically do the right thing though. If KVM is built as a module, is_kernel_noinstr_text() will get false negatives even for static keys/branches that are annotaed as NOINSTR.
Quite so.
I've been looking at mod_mem_type & friends, I'm thinking adding a MOD_NOINSTR_TEXT type might be overkill considering modules really shouldn't be involved with early entry, KVM being the one exception.
Your suggestion to have a KVM-module-specific noinstr section sounds good to me, I'll have a look at that.
On Tue, Jan 14, 2025, Valentin Schneider wrote:
static __always_inline void arch_context_tracking_work(enum ct_work work) { switch (work) {
- case CT_WORK_n:
// Do work...
- case CT_WORK_SYNC:
sync_core();
Not your bug, but serialize() needs to be __always_inline. Not sure what exactly caused it to be uninlined, but this is the obvious new usage.
vmlinux.o: warning: objtool: __static_call_update_early+0x4e: call to serialize() leaves .noinstr.text section vmlinux.o: warning: objtool: ct_work_flush+0x69: call to serialize() leaves .noinstr.text section
Hello Valentin,
On 1/14/2025 11:21 PM, Valentin Schneider wrote:
[..snip..] diff --git a/include/linux/context_tracking_work.h b/include/linux/context_tracking_work.h index c68245f8d77c5..931ade1dcbcc2 100644 --- a/include/linux/context_tracking_work.h +++ b/include/linux/context_tracking_work.h @@ -5,12 +5,12 @@ #include <linux/bitops.h> enum {
- CT_WORK_n_OFFSET,
- CT_WORK_SYNC,
nit.
Shouldn't this be "CT_WORK_SYNC_OFFSET"?
I believe it gets fixed in Patch 29/30 when "CT_WORK_TLBI_OFFSET" is added but perhaps that can be moved here to preserve bisection.
CT_WORK_MAX_OFFSET }; enum ct_work {
- CT_WORK_n = BIT(CT_WORK_n_OFFSET),
- CT_WORK_SYNC = BIT(CT_WORK_SYNC_OFFSET), CT_WORK_MAX = BIT(CT_WORK_MAX_OFFSET) };
From: Peter Zijlstra peterz@infradead.org
Later patches will require issuing a __flush_tlb_all() from noinstr code. This requires making both __flush_tlb_local() and __flush_tlb_global() noinstr-compliant.
For __flush_tlb_global(), both native_flush_tlb_global() and xen_flush_tlb() need to be made noinstr.
Forgo using __native_flush_tlb_global() / native_write_cr4() and have the ASM directly inlined in the native function. For the Xen stuff, __always_inline a handful of helpers.
Not-signed-off-by: Peter Zijlstra peterz@infradead.org [Changelog faff] Signed-off-by: Valentin Schneider vschneid@redhat.com --- arch/x86/include/asm/invpcid.h | 13 ++++++------- arch/x86/include/asm/paravirt.h | 2 +- arch/x86/include/asm/xen/hypercall.h | 11 +++++++++-- arch/x86/mm/tlb.c | 15 +++++++++++---- arch/x86/xen/mmu_pv.c | 10 +++++----- arch/x86/xen/xen-ops.h | 12 ++++++++---- 6 files changed, 40 insertions(+), 23 deletions(-)
diff --git a/arch/x86/include/asm/invpcid.h b/arch/x86/include/asm/invpcid.h index 734482afbf81d..ff26136fcd9c6 100644 --- a/arch/x86/include/asm/invpcid.h +++ b/arch/x86/include/asm/invpcid.h @@ -2,7 +2,7 @@ #ifndef _ASM_X86_INVPCID #define _ASM_X86_INVPCID
-static inline void __invpcid(unsigned long pcid, unsigned long addr, +static __always_inline void __invpcid(unsigned long pcid, unsigned long addr, unsigned long type) { struct { u64 d[2]; } desc = { { pcid, addr } }; @@ -13,7 +13,7 @@ static inline void __invpcid(unsigned long pcid, unsigned long addr, * mappings, we don't want the compiler to reorder any subsequent * memory accesses before the TLB flush. */ - asm volatile("invpcid %[desc], %[type]" + asm_inline volatile("invpcid %[desc], %[type]" :: [desc] "m" (desc), [type] "r" (type) : "memory"); }
@@ -23,26 +23,25 @@ static inline void __invpcid(unsigned long pcid, unsigned long addr, #define INVPCID_TYPE_ALL_NON_GLOBAL 3
/* Flush all mappings for a given pcid and addr, not including globals. */ -static inline void invpcid_flush_one(unsigned long pcid, - unsigned long addr) +static __always_inline void invpcid_flush_one(unsigned long pcid, unsigned long addr) { __invpcid(pcid, addr, INVPCID_TYPE_INDIV_ADDR); }
/* Flush all mappings for a given PCID, not including globals. */ -static inline void invpcid_flush_single_context(unsigned long pcid) +static __always_inline void invpcid_flush_single_context(unsigned long pcid) { __invpcid(pcid, 0, INVPCID_TYPE_SINGLE_CTXT); }
/* Flush all mappings, including globals, for all PCIDs. */ -static inline void invpcid_flush_all(void) +static __always_inline void invpcid_flush_all(void) { __invpcid(0, 0, INVPCID_TYPE_ALL_INCL_GLOBAL); }
/* Flush all mappings for all PCIDs except globals. */ -static inline void invpcid_flush_all_nonglobals(void) +static __always_inline void invpcid_flush_all_nonglobals(void) { __invpcid(0, 0, INVPCID_TYPE_ALL_NON_GLOBAL); } diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index d4eb9e1d61b8e..b3daee3d46677 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -75,7 +75,7 @@ static inline void __flush_tlb_local(void) PVOP_VCALL0(mmu.flush_tlb_user); }
-static inline void __flush_tlb_global(void) +static __always_inline void __flush_tlb_global(void) { PVOP_VCALL0(mmu.flush_tlb_kernel); } diff --git a/arch/x86/include/asm/xen/hypercall.h b/arch/x86/include/asm/xen/hypercall.h index 97771b9d33af3..291e9f8006f62 100644 --- a/arch/x86/include/asm/xen/hypercall.h +++ b/arch/x86/include/asm/xen/hypercall.h @@ -365,8 +365,8 @@ MULTI_mmu_update(struct multicall_entry *mcl, struct mmu_update *req, trace_xen_mc_entry(mcl, 4); }
-static inline void -MULTI_mmuext_op(struct multicall_entry *mcl, struct mmuext_op *op, int count, +static __always_inline void +__MULTI_mmuext_op(struct multicall_entry *mcl, struct mmuext_op *op, int count, int *success_count, domid_t domid) { mcl->op = __HYPERVISOR_mmuext_op; @@ -374,6 +374,13 @@ MULTI_mmuext_op(struct multicall_entry *mcl, struct mmuext_op *op, int count, mcl->args[1] = count; mcl->args[2] = (unsigned long)success_count; mcl->args[3] = domid; +} + +static inline void +MULTI_mmuext_op(struct multicall_entry *mcl, struct mmuext_op *op, int count, + int *success_count, domid_t domid) +{ + __MULTI_mmuext_op(mcl, op, count, success_count, domid);
trace_xen_mc_entry(mcl, 4); } diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index a2becb85bea79..2d2ab3e221f0c 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -1169,9 +1169,10 @@ void flush_tlb_one_user(unsigned long addr) /* * Flush everything */ -STATIC_NOPV void native_flush_tlb_global(void) +STATIC_NOPV noinstr void native_flush_tlb_global(void) { unsigned long flags; + unsigned long cr4;
if (static_cpu_has(X86_FEATURE_INVPCID)) { /* @@ -1190,9 +1191,15 @@ STATIC_NOPV void native_flush_tlb_global(void) * be called from deep inside debugging code.) */ raw_local_irq_save(flags); - - __native_tlb_flush_global(this_cpu_read(cpu_tlbstate.cr4)); - + cr4 = this_cpu_read(cpu_tlbstate.cr4); + asm volatile("mov %0,%%cr4": : "r" (cr4 ^ X86_CR4_PGE) : "memory"); + asm volatile("mov %0,%%cr4": : "r" (cr4) : "memory"); + /* + * In lieu of not having the pinning crap, hard fail if CR4 doesn't + * match the expected value. This ensures that anybody doing dodgy gets + * the fallthrough check. + */ + BUG_ON(cr4 != this_cpu_read(cpu_tlbstate.cr4)); raw_local_irq_restore(flags); }
diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c index 55a4996d0c04f..4eb265eb867af 100644 --- a/arch/x86/xen/mmu_pv.c +++ b/arch/x86/xen/mmu_pv.c @@ -1231,22 +1231,22 @@ static noinstr void xen_write_cr2(unsigned long cr2) this_cpu_read(xen_vcpu)->arch.cr2 = cr2; }
-static noinline void xen_flush_tlb(void) +static noinline noinstr void xen_flush_tlb(void) { struct mmuext_op *op; struct multicall_space mcs;
- preempt_disable(); + preempt_disable_notrace();
mcs = xen_mc_entry(sizeof(*op));
op = mcs.args; op->cmd = MMUEXT_TLB_FLUSH_LOCAL; - MULTI_mmuext_op(mcs.mc, op, 1, NULL, DOMID_SELF); + __MULTI_mmuext_op(mcs.mc, op, 1, NULL, DOMID_SELF);
- xen_mc_issue(XEN_LAZY_MMU); + __xen_mc_issue(XEN_LAZY_MMU);
- preempt_enable(); + preempt_enable_notrace(); }
static void xen_flush_tlb_one_user(unsigned long addr) diff --git a/arch/x86/xen/xen-ops.h b/arch/x86/xen/xen-ops.h index 63c13a2ccf556..effb1a54afbd1 100644 --- a/arch/x86/xen/xen-ops.h +++ b/arch/x86/xen/xen-ops.h @@ -235,15 +235,19 @@ static inline struct multicall_space xen_mc_entry(size_t args) void xen_mc_flush(void);
/* Issue a multicall if we're not in a lazy mode */ -static inline void xen_mc_issue(unsigned mode) +static __always_inline void __xen_mc_issue(unsigned mode) { - trace_xen_mc_issue(mode); - if ((xen_get_lazy_mode() & mode) == 0) xen_mc_flush();
/* restore flags saved in xen_mc_batch */ - local_irq_restore(this_cpu_read(xen_mc_irq_flags)); + raw_local_irq_restore(this_cpu_read(xen_mc_irq_flags)); +} + +static inline void xen_mc_issue(unsigned mode) +{ + trace_xen_mc_issue(mode); + __xen_mc_issue(mode); }
/* Set up a callback to be called when the current batch is flushed */
On 1/14/25 09:51, Valentin Schneider wrote:
- cr4 = this_cpu_read(cpu_tlbstate.cr4);
- asm volatile("mov %0,%%cr4": : "r" (cr4 ^ X86_CR4_PGE) : "memory");
- asm volatile("mov %0,%%cr4": : "r" (cr4) : "memory");
- /*
* In lieu of not having the pinning crap, hard fail if CR4 doesn't
* match the expected value. This ensures that anybody doing dodgy gets
* the fallthrough check.
*/
- BUG_ON(cr4 != this_cpu_read(cpu_tlbstate.cr4));
Let's say someone managed to write to cpu_tlbstate.cr4 where they cleared one of the pinned bits.
Before this patch, CR4 pinning would WARN_ONCE() about it pretty quickly and also reset the cleared bits.
After this patch, the first native_flush_tlb_global() can clear pinned bits, at least until native_write_cr4() gets called the next time. That seems like it'll undermine CR4 pinning at least somewhat.
What keeps native_write_cr4() from being noinstr-compliant now? Is it just the WARN_ONCE()?
If so, I'd kinda rather have a native_write_cr4_nowarn() that's noinstr-compliant but retains all the other CR4 pinning behavior. Would something like the attached patch be _worse_?
On 14/01/25 13:45, Dave Hansen wrote:
On 1/14/25 09:51, Valentin Schneider wrote:
- cr4 = this_cpu_read(cpu_tlbstate.cr4);
- asm volatile("mov %0,%%cr4": : "r" (cr4 ^ X86_CR4_PGE) : "memory");
- asm volatile("mov %0,%%cr4": : "r" (cr4) : "memory");
- /*
* In lieu of not having the pinning crap, hard fail if CR4 doesn't
* match the expected value. This ensures that anybody doing dodgy gets
* the fallthrough check.
*/
- BUG_ON(cr4 != this_cpu_read(cpu_tlbstate.cr4));
Let's say someone managed to write to cpu_tlbstate.cr4 where they cleared one of the pinned bits.
Before this patch, CR4 pinning would WARN_ONCE() about it pretty quickly and also reset the cleared bits.
After this patch, the first native_flush_tlb_global() can clear pinned bits, at least until native_write_cr4() gets called the next time. That seems like it'll undermine CR4 pinning at least somewhat.
The BUG_ON() should still catch any pinned bit mishandling, however...
What keeps native_write_cr4() from being noinstr-compliant now? Is it just the WARN_ONCE()?
I don't think that's even an issue since __WARN_printf() wraps the print in instrumentation_{begin,end}(). In v3 I made native_write_cr4() noinstr and added a non-noinstr wrapper to be used in existing callsites.
AFAICT if acceptable we could make the whole thing noinstr and stick with that; Peter, is there something I missed that made you write the handmade noinstr CR4 RMW?
Later patches will require issuing a __flush_tlb_all() from noinstr code. This requires making both __flush_tlb_local() and __flush_tlb_global() noinstr-compliant.
For __flush_tlb_local(), xen_flush_tlb() has already been made noinstr, so it's just native_flush_tlb_global(), and simply __always_inline'ing invalidate_user_asid() gets us there
Signed-off-by: Valentin Schneider vschneid@redhat.com --- arch/x86/include/asm/paravirt.h | 2 +- arch/x86/mm/tlb.c | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index b3daee3d46677..0c0dd186c03e6 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -70,7 +70,7 @@ void native_flush_tlb_one_user(unsigned long addr); void native_flush_tlb_multi(const struct cpumask *cpumask, const struct flush_tlb_info *info);
-static inline void __flush_tlb_local(void) +static __always_inline void __flush_tlb_local(void) { PVOP_VCALL0(mmu.flush_tlb_user); } diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 2d2ab3e221f0c..18b40bbc2fa15 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -257,7 +257,7 @@ static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen, * * See SWITCH_TO_USER_CR3. */ -static inline void invalidate_user_asid(u16 asid) +static __always_inline void invalidate_user_asid(u16 asid) { /* There is no user ASID if address space separation is off */ if (!IS_ENABLED(CONFIG_MITIGATION_PAGE_TABLE_ISOLATION)) @@ -1206,7 +1206,7 @@ STATIC_NOPV noinstr void native_flush_tlb_global(void) /* * Flush the entire current user mapping */ -STATIC_NOPV void native_flush_tlb_local(void) +STATIC_NOPV noinstr void native_flush_tlb_local(void) { /* * Preemption or interrupts must be disabled to protect the access
On Tue, Jan 14, 2025, Valentin Schneider wrote:
Later patches will require issuing a __flush_tlb_all() from noinstr code. This requires making both __flush_tlb_local() and __flush_tlb_global() noinstr-compliant.
For __flush_tlb_local(), xen_flush_tlb() has already been made noinstr, so it's just native_flush_tlb_global(), and simply __always_inline'ing invalidate_user_asid() gets us there
Signed-off-by: Valentin Schneider vschneid@redhat.com
...
@@ -1206,7 +1206,7 @@ STATIC_NOPV noinstr void native_flush_tlb_global(void) /*
- Flush the entire current user mapping
*/ -STATIC_NOPV void native_flush_tlb_local(void) +STATIC_NOPV noinstr void native_flush_tlb_local(void)
native_write_cr3() and __native_read_cr3() need to be __always_inline.
vmlinux.o: warning: objtool: native_flush_tlb_local+0x8: call to __native_read_cr3() leaves .noinstr.text section
Later patches will require issuing a __flush_tlb_all() from noinstr code. Both __flush_tlb_local() and __flush_tlb_global() are now noinstr-compliant, so __flush_tlb_all() can be made noinstr itself.
Signed-off-by: Valentin Schneider vschneid@redhat.com --- arch/x86/include/asm/tlbflush.h | 2 +- arch/x86/mm/tlb.c | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index 69e79fff41b80..4d11396250999 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -17,7 +17,7 @@
DECLARE_PER_CPU(u64, tlbstate_untag_mask);
-void __flush_tlb_all(void); +noinstr void __flush_tlb_all(void);
#define TLB_FLUSH_ALL -1UL #define TLB_GENERATION_INVALID 0 diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 18b40bbc2fa15..119765772ab11 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -1229,7 +1229,7 @@ void flush_tlb_local(void) /* * Flush everything */ -void __flush_tlb_all(void) +noinstr void __flush_tlb_all(void) { /* * This is to catch users with enabled preemption and the PGE feature @@ -1243,7 +1243,7 @@ void __flush_tlb_all(void) /* * !PGE -> !PCID (setup_pcid()), thus every flush is total. */ - flush_tlb_local(); + __flush_tlb_local(); } } EXPORT_SYMBOL_GPL(__flush_tlb_all);
vunmap()'s issued from housekeeping CPUs are a relatively common source of interference for isolated NOHZ_FULL CPUs, as they are hit by the flush_tlb_kernel_range() IPIs.
Given that CPUs executing in userspace do not access data in the vmalloc range, these IPIs could be deferred until their next kernel entry.
Deferral vs early entry danger zone ===================================
This requires a guarantee that nothing in the vmalloc range can be vunmap'd and then accessed in early entry code.
Vmalloc uses are, as reported by vmallocinfo:
$ cat /proc/vmallocinfo | awk '{ print $3 }' | sort | uniq __pci_enable_msix_range+0x32b/0x560 acpi_os_map_iomem+0x22d/0x250 bpf_prog_alloc_no_stats+0x34/0x140 fork_idle+0x79/0x120 gen_pool_add_owner+0x3e/0xb0 ? hpet_enable+0xbf/0x470 irq_init_percpu_irqstack+0x129/0x160 kernel_clone+0xab/0x3b0 memremap+0x164/0x330 n_tty_open+0x18/0xa0 pcpu_create_chunk+0x4e/0x1b0 pcpu_create_chunk+0x75/0x1b0 pcpu_get_vm_areas+0x0/0x1100 unpurged vp_modern_map_capability+0x140/0x270 zisofs_init+0x16/0x30
I've categorized these as:
a) Device or percpu mappings
For these to be unmapped, the device (or CPU) has to be removed and an eventual IRQ freed. Even if the IRQ isn't freed, context tracking entry happens before handling the IRQ itself, per irqentry_enter() usage.
__pci_enable_msix_range() acpi_os_map_iomem() irq_init_percpu_irqstack() (not even unmapped when CPU is hot-unplugged!) memremap() n_tty_open() pcpu_create_chunk() pcpu_get_vm_areas() vp_modern_map_capability()
b) CONFIG_VMAP_STACK
fork_idle() & kernel_clone()
vmalloc'd kernel stacks are AFAICT a safe example, as a task running in userspace needs to enter kernelspace to execute do_exit() before its stack can be vfree'd.
c) Non-issues
bpf_prog_alloc_no_stats() - early entry is noinstr, no BPF! hpet_enable() - hpet_clear_mapping() is only called if __init function fails, no runtime worries zisofs_init () - used for zisofs block decompression, that's way past context tracking entry
d) I'm not sure, have a look?
gen_pool_add_owner() - AIUI this is mainly for PCI / DMA stuff, which again I wouldn't expect to be accessed before context tracking entry.
Changes ======
Blindly deferring any and all flush of the kernel mappings is a risky move, so introduce a variant of flush_tlb_kernel_range() that explicitly allows deferral. Use it for vunmap flushes.
Note that while flush_tlb_kernel_range() may end up issuing a full flush (including user mappings), this only happens when reaching a invalidation range threshold where it is cheaper to do a full flush than to individually invalidate each page in the range via INVLPG. IOW, it doesn't *require* invalidating user mappings, and thus remains safe to defer until a later kernel entry.
Signed-off-by: Valentin Schneider vschneid@redhat.com --- arch/x86/include/asm/context_tracking_work.h | 4 +++ arch/x86/include/asm/tlbflush.h | 1 + arch/x86/mm/tlb.c | 23 +++++++++++-- include/linux/context_tracking_work.h | 4 ++- mm/vmalloc.c | 35 +++++++++++++++++--- 5 files changed, 58 insertions(+), 9 deletions(-)
diff --git a/arch/x86/include/asm/context_tracking_work.h b/arch/x86/include/asm/context_tracking_work.h index 485b32881fde5..da3d270289836 100644 --- a/arch/x86/include/asm/context_tracking_work.h +++ b/arch/x86/include/asm/context_tracking_work.h @@ -3,6 +3,7 @@ #define _ASM_X86_CONTEXT_TRACKING_WORK_H
#include <asm/sync_core.h> +#include <asm/tlbflush.h>
static __always_inline void arch_context_tracking_work(enum ct_work work) { @@ -10,6 +11,9 @@ static __always_inline void arch_context_tracking_work(enum ct_work work) case CT_WORK_SYNC: sync_core(); break; + case CT_WORK_TLBI: + __flush_tlb_all(); + break; case CT_WORK_MAX: WARN_ON_ONCE(true); } diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index 4d11396250999..6e690acc35e53 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -248,6 +248,7 @@ extern void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start, unsigned long end, unsigned int stride_shift, bool freed_tables); extern void flush_tlb_kernel_range(unsigned long start, unsigned long end); +extern void flush_tlb_kernel_range_deferrable(unsigned long start, unsigned long end);
static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a) { diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 119765772ab11..47fb437acf52a 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -12,6 +12,7 @@ #include <linux/task_work.h> #include <linux/mmu_notifier.h> #include <linux/mmu_context.h> +#include <linux/context_tracking.h>
#include <asm/tlbflush.h> #include <asm/mmu_context.h> @@ -1042,6 +1043,11 @@ static void do_flush_tlb_all(void *info) __flush_tlb_all(); }
+static bool do_kernel_flush_defer_cond(int cpu, void *info) +{ + return !ct_set_cpu_work(cpu, CT_WORK_TLBI); +} + void flush_tlb_all(void) { count_vm_tlb_event(NR_TLB_REMOTE_FLUSH); @@ -1058,12 +1064,13 @@ static void do_kernel_range_flush(void *info) flush_tlb_one_kernel(addr); }
-void flush_tlb_kernel_range(unsigned long start, unsigned long end) +static inline void +__flush_tlb_kernel_range(smp_cond_func_t cond_func, unsigned long start, unsigned long end) { /* Balance as user space task's flush, a bit conservative */ if (end == TLB_FLUSH_ALL || (end - start) > tlb_single_page_flush_ceiling << PAGE_SHIFT) { - on_each_cpu(do_flush_tlb_all, NULL, 1); + on_each_cpu_cond(cond_func, do_flush_tlb_all, NULL, 1); } else { struct flush_tlb_info *info;
@@ -1071,13 +1078,23 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end) info = get_flush_tlb_info(NULL, start, end, 0, false, TLB_GENERATION_INVALID);
- on_each_cpu(do_kernel_range_flush, info, 1); + on_each_cpu_cond(cond_func, do_kernel_range_flush, info, 1);
put_flush_tlb_info(); preempt_enable(); } }
+void flush_tlb_kernel_range(unsigned long start, unsigned long end) +{ + __flush_tlb_kernel_range(NULL, start, end); +} + +void flush_tlb_kernel_range_deferrable(unsigned long start, unsigned long end) +{ + __flush_tlb_kernel_range(do_kernel_flush_defer_cond, start, end); +} + /* * This can be used from process context to figure out what the value of * CR3 is without needing to do a (slow) __read_cr3(). diff --git a/include/linux/context_tracking_work.h b/include/linux/context_tracking_work.h index 931ade1dcbcc2..1be60c64cdeca 100644 --- a/include/linux/context_tracking_work.h +++ b/include/linux/context_tracking_work.h @@ -5,12 +5,14 @@ #include <linux/bitops.h>
enum { - CT_WORK_SYNC, + CT_WORK_SYNC_OFFSET, + CT_WORK_TLBI_OFFSET, CT_WORK_MAX_OFFSET };
enum ct_work { CT_WORK_SYNC = BIT(CT_WORK_SYNC_OFFSET), + CT_WORK_TLBI = BIT(CT_WORK_TLBI_OFFSET), CT_WORK_MAX = BIT(CT_WORK_MAX_OFFSET) };
diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 5c88d0e90c209..e8aad4d55e955 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -467,6 +467,31 @@ void vunmap_range_noflush(unsigned long start, unsigned long end) __vunmap_range_noflush(start, end); }
+#ifdef CONFIG_CONTEXT_TRACKING_WORK +/* + * !!! BIG FAT WARNING !!! + * + * The CPU is free to cache any part of the paging hierarchy it wants at any + * time. It's also free to set accessed and dirty bits at any time, even for + * instructions that may never execute architecturally. + * + * This means that deferring a TLB flush affecting freed page-table-pages (IOW, + * keeping them in a CPU's paging hierarchy cache) is akin to dancing in a + * minefield. + * + * This isn't a problem for deferral of TLB flushes in vmalloc, because + * page-table-pages used for vmap() mappings are never freed - see how + * __vunmap_range_noflush() walks the whole mapping but only clears the leaf PTEs. + * If this ever changes, TLB flush deferral will cause misery. + */ +void __weak flush_tlb_kernel_range_deferrable(unsigned long start, unsigned long end) +{ + flush_tlb_kernel_range(start, end); +} +#else +#define flush_tlb_kernel_range_deferrable(start, end) flush_tlb_kernel_range(start, end) +#endif + /** * vunmap_range - unmap kernel virtual addresses * @addr: start of the VM area to unmap @@ -480,7 +505,7 @@ void vunmap_range(unsigned long addr, unsigned long end) { flush_cache_vunmap(addr, end); vunmap_range_noflush(addr, end); - flush_tlb_kernel_range(addr, end); + flush_tlb_kernel_range_deferrable(addr, end); }
static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr, @@ -2281,7 +2306,7 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end,
nr_purge_nodes = cpumask_weight(&purge_nodes); if (nr_purge_nodes > 0) { - flush_tlb_kernel_range(start, end); + flush_tlb_kernel_range_deferrable(start, end);
/* One extra worker is per a lazy_max_pages() full set minus one. */ nr_purge_helpers = atomic_long_read(&vmap_lazy_nr) / lazy_max_pages(); @@ -2384,7 +2409,7 @@ static void free_unmap_vmap_area(struct vmap_area *va) flush_cache_vunmap(va->va_start, va->va_end); vunmap_range_noflush(va->va_start, va->va_end); if (debug_pagealloc_enabled_static()) - flush_tlb_kernel_range(va->va_start, va->va_end); + flush_tlb_kernel_range_deferrable(va->va_start, va->va_end);
free_vmap_area_noflush(va); } @@ -2832,7 +2857,7 @@ static void vb_free(unsigned long addr, unsigned long size) vunmap_range_noflush(addr, addr + size);
if (debug_pagealloc_enabled_static()) - flush_tlb_kernel_range(addr, addr + size); + flush_tlb_kernel_range_deferrable(addr, addr + size);
spin_lock(&vb->lock);
@@ -2897,7 +2922,7 @@ static void _vm_unmap_aliases(unsigned long start, unsigned long end, int flush) free_purged_blocks(&purge_list);
if (!__purge_vmap_area_lazy(start, end, false) && flush) - flush_tlb_kernel_range(start, end); + flush_tlb_kernel_range_deferrable(start, end); mutex_unlock(&vmap_purge_lock); }
On Tue, Jan 14, 2025 at 6:51 PM Valentin Schneider vschneid@redhat.com wrote:
vunmap()'s issued from housekeeping CPUs are a relatively common source of interference for isolated NOHZ_FULL CPUs, as they are hit by the flush_tlb_kernel_range() IPIs.
Given that CPUs executing in userspace do not access data in the vmalloc range, these IPIs could be deferred until their next kernel entry.
Deferral vs early entry danger zone
This requires a guarantee that nothing in the vmalloc range can be vunmap'd and then accessed in early entry code.
In other words, it needs a guarantee that no vmalloc allocations that have been created in the vmalloc region while the CPU was idle can then be accessed during early entry, right?
On 14/01/25 19:16, Jann Horn wrote:
On Tue, Jan 14, 2025 at 6:51 PM Valentin Schneider vschneid@redhat.com wrote:
vunmap()'s issued from housekeeping CPUs are a relatively common source of interference for isolated NOHZ_FULL CPUs, as they are hit by the flush_tlb_kernel_range() IPIs.
Given that CPUs executing in userspace do not access data in the vmalloc range, these IPIs could be deferred until their next kernel entry.
Deferral vs early entry danger zone
This requires a guarantee that nothing in the vmalloc range can be vunmap'd and then accessed in early entry code.
In other words, it needs a guarantee that no vmalloc allocations that have been created in the vmalloc region while the CPU was idle can then be accessed during early entry, right?
I'm not sure if that would be a problem (not an mm expert, please do correct me) - looking at vmap_pages_range(), flush_cache_vmap() isn't deferred anyway.
So after vmapping something, I wouldn't expect isolated CPUs to have invalid TLB entries for the newly vmapped page.
However, upon vunmap'ing something, the TLB flush is deferred, and thus stale TLB entries can and will remain on isolated CPUs, up until they execute the deferred flush themselves (IOW for the entire duration of the "danger zone").
Does that make sense?
On Fri, Jan 17, 2025 at 4:25 PM Valentin Schneider vschneid@redhat.com wrote:
On 14/01/25 19:16, Jann Horn wrote:
On Tue, Jan 14, 2025 at 6:51 PM Valentin Schneider vschneid@redhat.com wrote:
vunmap()'s issued from housekeeping CPUs are a relatively common source of interference for isolated NOHZ_FULL CPUs, as they are hit by the flush_tlb_kernel_range() IPIs.
Given that CPUs executing in userspace do not access data in the vmalloc range, these IPIs could be deferred until their next kernel entry.
Deferral vs early entry danger zone
This requires a guarantee that nothing in the vmalloc range can be vunmap'd and then accessed in early entry code.
In other words, it needs a guarantee that no vmalloc allocations that have been created in the vmalloc region while the CPU was idle can then be accessed during early entry, right?
I'm not sure if that would be a problem (not an mm expert, please do correct me) - looking at vmap_pages_range(), flush_cache_vmap() isn't deferred anyway.
flush_cache_vmap() is about stuff like flushing data caches on architectures with virtually indexed caches; that doesn't do TLB maintenance. When you look for its definition on x86 or arm64, you'll see that they use the generic implementation which is simply an empty inline function.
So after vmapping something, I wouldn't expect isolated CPUs to have invalid TLB entries for the newly vmapped page.
However, upon vunmap'ing something, the TLB flush is deferred, and thus stale TLB entries can and will remain on isolated CPUs, up until they execute the deferred flush themselves (IOW for the entire duration of the "danger zone").
Does that make sense?
The design idea wrt TLB flushes in the vmap code is that you don't do TLB flushes when you unmap stuff or when you map stuff, because doing TLB flushes across the entire system on every vmap/vunmap would be a bit costly; instead you just do batched TLB flushes in between, in __purge_vmap_area_lazy().
In other words, the basic idea is that you can keep calling vmap() and vunmap() a bunch of times without ever doing TLB flushes until you run out of virtual memory in the vmap region; then you do one big TLB flush, and afterwards you can reuse the free virtual address space for new allocations again.
So if you "defer" that batched TLB flush for CPUs that are not currently running in the kernel, I think the consequence is that those CPUs may end up with incoherent TLB state after a reallocation of the virtual address space.
Actually, I think this would mean that your optimization is disallowed at least on arm64 - I'm not sure about the exact wording, but arm64 has a "break before make" rule that forbids conflicting writable address translations or something like that.
(I said "until you run out of virtual memory in the vmap region", but that's not actually true - see the comment above lazy_max_pages() for an explanation of the actual heuristic. You might be able to tune that a bit if you'd be significantly happier with less frequent interruptions, or something along those lines.)
On 17/01/25 16:52, Jann Horn wrote:
On Fri, Jan 17, 2025 at 4:25 PM Valentin Schneider vschneid@redhat.com wrote:
On 14/01/25 19:16, Jann Horn wrote:
On Tue, Jan 14, 2025 at 6:51 PM Valentin Schneider vschneid@redhat.com wrote:
vunmap()'s issued from housekeeping CPUs are a relatively common source of interference for isolated NOHZ_FULL CPUs, as they are hit by the flush_tlb_kernel_range() IPIs.
Given that CPUs executing in userspace do not access data in the vmalloc range, these IPIs could be deferred until their next kernel entry.
Deferral vs early entry danger zone
This requires a guarantee that nothing in the vmalloc range can be vunmap'd and then accessed in early entry code.
In other words, it needs a guarantee that no vmalloc allocations that have been created in the vmalloc region while the CPU was idle can then be accessed during early entry, right?
I'm not sure if that would be a problem (not an mm expert, please do correct me) - looking at vmap_pages_range(), flush_cache_vmap() isn't deferred anyway.
flush_cache_vmap() is about stuff like flushing data caches on architectures with virtually indexed caches; that doesn't do TLB maintenance. When you look for its definition on x86 or arm64, you'll see that they use the generic implementation which is simply an empty inline function.
So after vmapping something, I wouldn't expect isolated CPUs to have invalid TLB entries for the newly vmapped page.
However, upon vunmap'ing something, the TLB flush is deferred, and thus stale TLB entries can and will remain on isolated CPUs, up until they execute the deferred flush themselves (IOW for the entire duration of the "danger zone").
Does that make sense?
The design idea wrt TLB flushes in the vmap code is that you don't do TLB flushes when you unmap stuff or when you map stuff, because doing TLB flushes across the entire system on every vmap/vunmap would be a bit costly; instead you just do batched TLB flushes in between, in __purge_vmap_area_lazy().
In other words, the basic idea is that you can keep calling vmap() and vunmap() a bunch of times without ever doing TLB flushes until you run out of virtual memory in the vmap region; then you do one big TLB flush, and afterwards you can reuse the free virtual address space for new allocations again.
So if you "defer" that batched TLB flush for CPUs that are not currently running in the kernel, I think the consequence is that those CPUs may end up with incoherent TLB state after a reallocation of the virtual address space.
Ah, gotcha, thank you for laying this out! In which case yes, any vmalloc that occurred while an isolated CPU was NOHZ-FULL can be an issue if said CPU accesses it during early entry;
Actually, I think this would mean that your optimization is disallowed at least on arm64 - I'm not sure about the exact wording, but arm64 has a "break before make" rule that forbids conflicting writable address translations or something like that.
On the bright side of things, arm64 is not as bad as x86 when it comes to IPI'ing isolated CPUs :-) I'll add that to my notes, thanks!
(I said "until you run out of virtual memory in the vmap region", but that's not actually true - see the comment above lazy_max_pages() for an explanation of the actual heuristic. You might be able to tune that a bit if you'd be significantly happier with less frequent interruptions, or something along those lines.)
On Fri, Jan 17, 2025 at 05:53:33PM +0100, Valentin Schneider wrote:
On 17/01/25 16:52, Jann Horn wrote:
On Fri, Jan 17, 2025 at 4:25 PM Valentin Schneider vschneid@redhat.com wrote:
On 14/01/25 19:16, Jann Horn wrote:
On Tue, Jan 14, 2025 at 6:51 PM Valentin Schneider vschneid@redhat.com wrote:
vunmap()'s issued from housekeeping CPUs are a relatively common source of interference for isolated NOHZ_FULL CPUs, as they are hit by the flush_tlb_kernel_range() IPIs.
Given that CPUs executing in userspace do not access data in the vmalloc range, these IPIs could be deferred until their next kernel entry.
Deferral vs early entry danger zone
This requires a guarantee that nothing in the vmalloc range can be vunmap'd and then accessed in early entry code.
In other words, it needs a guarantee that no vmalloc allocations that have been created in the vmalloc region while the CPU was idle can then be accessed during early entry, right?
I'm not sure if that would be a problem (not an mm expert, please do correct me) - looking at vmap_pages_range(), flush_cache_vmap() isn't deferred anyway.
flush_cache_vmap() is about stuff like flushing data caches on architectures with virtually indexed caches; that doesn't do TLB maintenance. When you look for its definition on x86 or arm64, you'll see that they use the generic implementation which is simply an empty inline function.
So after vmapping something, I wouldn't expect isolated CPUs to have invalid TLB entries for the newly vmapped page.
However, upon vunmap'ing something, the TLB flush is deferred, and thus stale TLB entries can and will remain on isolated CPUs, up until they execute the deferred flush themselves (IOW for the entire duration of the "danger zone").
Does that make sense?
The design idea wrt TLB flushes in the vmap code is that you don't do TLB flushes when you unmap stuff or when you map stuff, because doing TLB flushes across the entire system on every vmap/vunmap would be a bit costly; instead you just do batched TLB flushes in between, in __purge_vmap_area_lazy().
In other words, the basic idea is that you can keep calling vmap() and vunmap() a bunch of times without ever doing TLB flushes until you run out of virtual memory in the vmap region; then you do one big TLB flush, and afterwards you can reuse the free virtual address space for new allocations again.
So if you "defer" that batched TLB flush for CPUs that are not currently running in the kernel, I think the consequence is that those CPUs may end up with incoherent TLB state after a reallocation of the virtual address space.
Ah, gotcha, thank you for laying this out! In which case yes, any vmalloc that occurred while an isolated CPU was NOHZ-FULL can be an issue if said CPU accesses it during early entry;
So the issue is:
CPU1: unmappes vmalloc page X which was previously mapped to physical page P1.
CPU2: does a whole bunch of vmalloc and vfree eventually crossing some lazy threshold and sending out IPIs. It then goes ahead and does an allocation that maps the same virtual page X to physical page P2.
CPU3 is isolated and executes some early entry code before receving said IPIs which are supposedly deferred by Valentin's patches.
It does not receive the IPI becuase it is deferred, thus access by early entry code to page X on this CPU results in a UAF access to P1.
Is that the issue?
thanks,
- Joel
On 19/02/25 10:05, Joel Fernandes wrote:
On Fri, Jan 17, 2025 at 05:53:33PM +0100, Valentin Schneider wrote:
On 17/01/25 16:52, Jann Horn wrote:
On Fri, Jan 17, 2025 at 4:25 PM Valentin Schneider vschneid@redhat.com wrote:
On 14/01/25 19:16, Jann Horn wrote:
On Tue, Jan 14, 2025 at 6:51 PM Valentin Schneider vschneid@redhat.com wrote:
vunmap()'s issued from housekeeping CPUs are a relatively common source of interference for isolated NOHZ_FULL CPUs, as they are hit by the flush_tlb_kernel_range() IPIs.
Given that CPUs executing in userspace do not access data in the vmalloc range, these IPIs could be deferred until their next kernel entry.
Deferral vs early entry danger zone
This requires a guarantee that nothing in the vmalloc range can be vunmap'd and then accessed in early entry code.
In other words, it needs a guarantee that no vmalloc allocations that have been created in the vmalloc region while the CPU was idle can then be accessed during early entry, right?
I'm not sure if that would be a problem (not an mm expert, please do correct me) - looking at vmap_pages_range(), flush_cache_vmap() isn't deferred anyway.
flush_cache_vmap() is about stuff like flushing data caches on architectures with virtually indexed caches; that doesn't do TLB maintenance. When you look for its definition on x86 or arm64, you'll see that they use the generic implementation which is simply an empty inline function.
So after vmapping something, I wouldn't expect isolated CPUs to have invalid TLB entries for the newly vmapped page.
However, upon vunmap'ing something, the TLB flush is deferred, and thus stale TLB entries can and will remain on isolated CPUs, up until they execute the deferred flush themselves (IOW for the entire duration of the "danger zone").
Does that make sense?
The design idea wrt TLB flushes in the vmap code is that you don't do TLB flushes when you unmap stuff or when you map stuff, because doing TLB flushes across the entire system on every vmap/vunmap would be a bit costly; instead you just do batched TLB flushes in between, in __purge_vmap_area_lazy().
In other words, the basic idea is that you can keep calling vmap() and vunmap() a bunch of times without ever doing TLB flushes until you run out of virtual memory in the vmap region; then you do one big TLB flush, and afterwards you can reuse the free virtual address space for new allocations again.
So if you "defer" that batched TLB flush for CPUs that are not currently running in the kernel, I think the consequence is that those CPUs may end up with incoherent TLB state after a reallocation of the virtual address space.
Ah, gotcha, thank you for laying this out! In which case yes, any vmalloc that occurred while an isolated CPU was NOHZ-FULL can be an issue if said CPU accesses it during early entry;
So the issue is:
CPU1: unmappes vmalloc page X which was previously mapped to physical page P1.
CPU2: does a whole bunch of vmalloc and vfree eventually crossing some lazy threshold and sending out IPIs. It then goes ahead and does an allocation that maps the same virtual page X to physical page P2.
CPU3 is isolated and executes some early entry code before receving said IPIs which are supposedly deferred by Valentin's patches.
It does not receive the IPI becuase it is deferred, thus access by early entry code to page X on this CPU results in a UAF access to P1.
Is that the issue?
Pretty much so yeah. That is, *if* there such a vmalloc'd address access in early entry code - testing says it's not the case, but I haven't found a way to instrumentally verify this.
On 2/19/2025 11:18 AM, Valentin Schneider wrote:
On 19/02/25 10:05, Joel Fernandes wrote:
On Fri, Jan 17, 2025 at 05:53:33PM +0100, Valentin Schneider wrote:
On 17/01/25 16:52, Jann Horn wrote:
On Fri, Jan 17, 2025 at 4:25 PM Valentin Schneider vschneid@redhat.com wrote:
On 14/01/25 19:16, Jann Horn wrote:
On Tue, Jan 14, 2025 at 6:51 PM Valentin Schneider vschneid@redhat.com wrote: > vunmap()'s issued from housekeeping CPUs are a relatively common source of > interference for isolated NOHZ_FULL CPUs, as they are hit by the > flush_tlb_kernel_range() IPIs. > > Given that CPUs executing in userspace do not access data in the vmalloc > range, these IPIs could be deferred until their next kernel entry. > > Deferral vs early entry danger zone > =================================== > > This requires a guarantee that nothing in the vmalloc range can be vunmap'd > and then accessed in early entry code.
In other words, it needs a guarantee that no vmalloc allocations that have been created in the vmalloc region while the CPU was idle can then be accessed during early entry, right?
I'm not sure if that would be a problem (not an mm expert, please do correct me) - looking at vmap_pages_range(), flush_cache_vmap() isn't deferred anyway.
flush_cache_vmap() is about stuff like flushing data caches on architectures with virtually indexed caches; that doesn't do TLB maintenance. When you look for its definition on x86 or arm64, you'll see that they use the generic implementation which is simply an empty inline function.
So after vmapping something, I wouldn't expect isolated CPUs to have invalid TLB entries for the newly vmapped page.
However, upon vunmap'ing something, the TLB flush is deferred, and thus stale TLB entries can and will remain on isolated CPUs, up until they execute the deferred flush themselves (IOW for the entire duration of the "danger zone").
Does that make sense?
The design idea wrt TLB flushes in the vmap code is that you don't do TLB flushes when you unmap stuff or when you map stuff, because doing TLB flushes across the entire system on every vmap/vunmap would be a bit costly; instead you just do batched TLB flushes in between, in __purge_vmap_area_lazy().
In other words, the basic idea is that you can keep calling vmap() and vunmap() a bunch of times without ever doing TLB flushes until you run out of virtual memory in the vmap region; then you do one big TLB flush, and afterwards you can reuse the free virtual address space for new allocations again.
So if you "defer" that batched TLB flush for CPUs that are not currently running in the kernel, I think the consequence is that those CPUs may end up with incoherent TLB state after a reallocation of the virtual address space.
Ah, gotcha, thank you for laying this out! In which case yes, any vmalloc that occurred while an isolated CPU was NOHZ-FULL can be an issue if said CPU accesses it during early entry;
So the issue is:
CPU1: unmappes vmalloc page X which was previously mapped to physical page P1.
CPU2: does a whole bunch of vmalloc and vfree eventually crossing some lazy threshold and sending out IPIs. It then goes ahead and does an allocation that maps the same virtual page X to physical page P2.
CPU3 is isolated and executes some early entry code before receving said IPIs which are supposedly deferred by Valentin's patches.
It does not receive the IPI becuase it is deferred, thus access by early entry code to page X on this CPU results in a UAF access to P1.
Is that the issue?
Pretty much so yeah. That is, *if* there such a vmalloc'd address access in early entry code - testing says it's not the case, but I haven't found a way to instrumentally verify this.
Ok, thanks for confirming. Maybe there is an address sanitizer way of verifying, but yeah it is subtle and there could be more than one way of solving it. Too much 'fun' ;)
- Joel
On 2/19/25 09:08, Joel Fernandes wrote:
Pretty much so yeah. That is, *if* there such a vmalloc'd address access in early entry code - testing says it's not the case, but I haven't found a way to instrumentally verify this.
Ok, thanks for confirming. Maybe there is an address sanitizer way of verifying, but yeah it is subtle and there could be more than one way of solving it. Too much 'fun' 😉
For debugging, you could just make a copy of part or all of the page tables and run the NOHZ_FULL tasks from those while they're in userspace. Then, instead of flushing the TLB in the deferred work, you switch over to the "real" page tables.
That would _behave_ like a CPU with a big TLB and really old, crusty TLB entries from the last time the kernel ran.
BTW, the other option for all of this is just to say that if you want IPI-free TLB flushing that you need to go buy some hardware with it as opposed to all of this complexity.
On Fri, Jan 17, 2025 at 04:52:19PM +0100, Jann Horn wrote:
On Fri, Jan 17, 2025 at 4:25 PM Valentin Schneider vschneid@redhat.com wrote:
On 14/01/25 19:16, Jann Horn wrote:
On Tue, Jan 14, 2025 at 6:51 PM Valentin Schneider vschneid@redhat.com wrote:
vunmap()'s issued from housekeeping CPUs are a relatively common source of interference for isolated NOHZ_FULL CPUs, as they are hit by the flush_tlb_kernel_range() IPIs.
Given that CPUs executing in userspace do not access data in the vmalloc range, these IPIs could be deferred until their next kernel entry.
Deferral vs early entry danger zone
This requires a guarantee that nothing in the vmalloc range can be vunmap'd and then accessed in early entry code.
In other words, it needs a guarantee that no vmalloc allocations that have been created in the vmalloc region while the CPU was idle can then be accessed during early entry, right?
I'm not sure if that would be a problem (not an mm expert, please do correct me) - looking at vmap_pages_range(), flush_cache_vmap() isn't deferred anyway.
flush_cache_vmap() is about stuff like flushing data caches on architectures with virtually indexed caches; that doesn't do TLB maintenance. When you look for its definition on x86 or arm64, you'll see that they use the generic implementation which is simply an empty inline function.
So after vmapping something, I wouldn't expect isolated CPUs to have invalid TLB entries for the newly vmapped page.
However, upon vunmap'ing something, the TLB flush is deferred, and thus stale TLB entries can and will remain on isolated CPUs, up until they execute the deferred flush themselves (IOW for the entire duration of the "danger zone").
Does that make sense?
The design idea wrt TLB flushes in the vmap code is that you don't do TLB flushes when you unmap stuff or when you map stuff, because doing TLB flushes across the entire system on every vmap/vunmap would be a bit costly; instead you just do batched TLB flushes in between, in __purge_vmap_area_lazy().
In other words, the basic idea is that you can keep calling vmap() and vunmap() a bunch of times without ever doing TLB flushes until you run out of virtual memory in the vmap region; then you do one big TLB flush, and afterwards you can reuse the free virtual address space for new allocations again.
So if you "defer" that batched TLB flush for CPUs that are not currently running in the kernel, I think the consequence is that those CPUs may end up with incoherent TLB state after a reallocation of the virtual address space.
Actually, I think this would mean that your optimization is disallowed at least on arm64 - I'm not sure about the exact wording, but arm64 has a "break before make" rule that forbids conflicting writable address translations or something like that.
Yes, that would definitely be a problem. There's also the more obvious issue that the CnP ("Common not Private") feature of some Arm CPUs means that TLB entries can be shared between cores, so the whole idea of using a CPU's exception level to predicate invalidation is flawed on such a system.
Will
On 17/01/25 16:52, Jann Horn wrote:
On Fri, Jan 17, 2025 at 4:25 PM Valentin Schneider vschneid@redhat.com wrote:
On 14/01/25 19:16, Jann Horn wrote:
On Tue, Jan 14, 2025 at 6:51 PM Valentin Schneider vschneid@redhat.com wrote:
vunmap()'s issued from housekeeping CPUs are a relatively common source of interference for isolated NOHZ_FULL CPUs, as they are hit by the flush_tlb_kernel_range() IPIs.
Given that CPUs executing in userspace do not access data in the vmalloc range, these IPIs could be deferred until their next kernel entry.
Deferral vs early entry danger zone
This requires a guarantee that nothing in the vmalloc range can be vunmap'd and then accessed in early entry code.
In other words, it needs a guarantee that no vmalloc allocations that have been created in the vmalloc region while the CPU was idle can then be accessed during early entry, right?
I'm not sure if that would be a problem (not an mm expert, please do correct me) - looking at vmap_pages_range(), flush_cache_vmap() isn't deferred anyway.
flush_cache_vmap() is about stuff like flushing data caches on architectures with virtually indexed caches; that doesn't do TLB maintenance. When you look for its definition on x86 or arm64, you'll see that they use the generic implementation which is simply an empty inline function.
So after vmapping something, I wouldn't expect isolated CPUs to have invalid TLB entries for the newly vmapped page.
However, upon vunmap'ing something, the TLB flush is deferred, and thus stale TLB entries can and will remain on isolated CPUs, up until they execute the deferred flush themselves (IOW for the entire duration of the "danger zone").
Does that make sense?
The design idea wrt TLB flushes in the vmap code is that you don't do TLB flushes when you unmap stuff or when you map stuff, because doing TLB flushes across the entire system on every vmap/vunmap would be a bit costly; instead you just do batched TLB flushes in between, in __purge_vmap_area_lazy().
In other words, the basic idea is that you can keep calling vmap() and vunmap() a bunch of times without ever doing TLB flushes until you run out of virtual memory in the vmap region; then you do one big TLB flush, and afterwards you can reuse the free virtual address space for new allocations again.
So if you "defer" that batched TLB flush for CPUs that are not currently running in the kernel, I think the consequence is that those CPUs may end up with incoherent TLB state after a reallocation of the virtual address space.
Actually, I think this would mean that your optimization is disallowed at least on arm64 - I'm not sure about the exact wording, but arm64 has a "break before make" rule that forbids conflicting writable address translations or something like that.
(I said "until you run out of virtual memory in the vmap region", but that's not actually true - see the comment above lazy_max_pages() for an explanation of the actual heuristic. You might be able to tune that a bit if you'd be significantly happier with less frequent interruptions, or something along those lines.)
I've been thinking some more (this is your cue to grab a brown paper bag)...
Experimentation (unmap the whole VMALLOC range upon return to userspace, see what explodes upon entry into the kernel) suggests that the early entry "danger zone" should only access the vmaped stack, which itself isn't an issue.
That is obviously just a test on one system configuration, and the problem I'm facing is trying put in place /some/ form of instrumentation that would at the very least cause a warning for any future patch that would introduce a vmap'd access in early entry code. That, or a complete mitigation that prevents those accesses altogether.
What if isolated CPUs unconditionally did a TLBi as late as possible in the stack right before returning to userspace? This would mean that upon re-entering the kernel, an isolated CPU's TLB wouldn't contain any kernel range translation - with the exception of whatever lies between the last-minute flush and the actual userspace entry, which should be feasible to vet? Then AFAICT there wouldn't be any work/flush to defer, the IPI could be entirely silenced if it targets an isolated CPU.
On Mon, Feb 10, 2025 at 7:36 PM Valentin Schneider vschneid@redhat.com wrote:
What if isolated CPUs unconditionally did a TLBi as late as possible in the stack right before returning to userspace? This would mean that upon re-entering the kernel, an isolated CPU's TLB wouldn't contain any kernel range translation - with the exception of whatever lies between the last-minute flush and the actual userspace entry, which should be feasible to vet? Then AFAICT there wouldn't be any work/flush to defer, the IPI could be entirely silenced if it targets an isolated CPU.
Two issues with that:
1. I think the "Common not Private" feature Will Deacon referred to is incompatible with this idea: https://developer.arm.com/documentation/101811/0104/Address-spaces/Common-not-Private says "When the CnP bit is set, the software promises to use the ASIDs and VMIDs in the same way on all processors, which allows the TLB entries that are created by one processor to be used by another"
2. It's wrong to assume that TLB entries are only populated for addresses you access - thanks to speculative execution, you have to assume that the CPU might be populating random TLB entries all over the place.
On 10/02/25 23:08, Jann Horn wrote:
On Mon, Feb 10, 2025 at 7:36 PM Valentin Schneider vschneid@redhat.com wrote:
What if isolated CPUs unconditionally did a TLBi as late as possible in the stack right before returning to userspace? This would mean that upon re-entering the kernel, an isolated CPU's TLB wouldn't contain any kernel range translation - with the exception of whatever lies between the last-minute flush and the actual userspace entry, which should be feasible to vet? Then AFAICT there wouldn't be any work/flush to defer, the IPI could be entirely silenced if it targets an isolated CPU.
Two issues with that:
Firstly, thank you for entertaining the idea :-)
- I think the "Common not Private" feature Will Deacon referred to is
incompatible with this idea: https://developer.arm.com/documentation/101811/0104/Address-spaces/Common-not-Private says "When the CnP bit is set, the software promises to use the ASIDs and VMIDs in the same way on all processors, which allows the TLB entries that are created by one processor to be used by another"
Sorry for being obtuse - I can understand inconsistent TLB states (old vs new translations being present in separate TLBs) due to not sending the flush IPI causing an issue with that, but not "flushing early". Even if TLB entries can be shared/accessed between CPUs, a CPU should be allowed not to have a shared entry in its TLB - what am I missing?
- It's wrong to assume that TLB entries are only populated for
addresses you access - thanks to speculative execution, you have to assume that the CPU might be populating random TLB entries all over the place.
Gotta love speculation. Now it is supposed to be limited to genuinely accessible data & code, right? Say theoretically we have a full TLBi as literally the last thing before doing the return-to-userspace, speculation should be limited to executing maybe bits of the return-from-userspace code?
Furthermore, I would hope that once a CPU is executing in userspace, it's not going to populate the TLB with kernel address translations - AIUI the whole vulnerability mitigation debacle was about preventing this sort of thing.
On Tue, Feb 11, 2025 at 02:33:51PM +0100, Valentin Schneider wrote:
On 10/02/25 23:08, Jann Horn wrote:
On Mon, Feb 10, 2025 at 7:36 PM Valentin Schneider vschneid@redhat.com wrote:
What if isolated CPUs unconditionally did a TLBi as late as possible in the stack right before returning to userspace? This would mean that upon re-entering the kernel, an isolated CPU's TLB wouldn't contain any kernel range translation - with the exception of whatever lies between the last-minute flush and the actual userspace entry, which should be feasible to vet? Then AFAICT there wouldn't be any work/flush to defer, the IPI could be entirely silenced if it targets an isolated CPU.
Two issues with that:
Firstly, thank you for entertaining the idea :-)
- I think the "Common not Private" feature Will Deacon referred to is
incompatible with this idea: https://developer.arm.com/documentation/101811/0104/Address-spaces/Common-not-Private says "When the CnP bit is set, the software promises to use the ASIDs and VMIDs in the same way on all processors, which allows the TLB entries that are created by one processor to be used by another"
Sorry for being obtuse - I can understand inconsistent TLB states (old vs new translations being present in separate TLBs) due to not sending the flush IPI causing an issue with that, but not "flushing early". Even if TLB entries can be shared/accessed between CPUs, a CPU should be allowed not to have a shared entry in its TLB - what am I missing?
- It's wrong to assume that TLB entries are only populated for
addresses you access - thanks to speculative execution, you have to assume that the CPU might be populating random TLB entries all over the place.
Gotta love speculation. Now it is supposed to be limited to genuinely accessible data & code, right? Say theoretically we have a full TLBi as literally the last thing before doing the return-to-userspace, speculation should be limited to executing maybe bits of the return-from-userspace code?
I think it's easier to ignore speculation entirely, and just assume that the MMU can arbitrarily fill TLB entries from any page table entries which are valid/accessible in the active page tables. Hardware prefetchers can do that regardless of the specific path of speculative execution.
Thus TLB fills are not limited to VAs which would be used on that return-to-userspace path.
Furthermore, I would hope that once a CPU is executing in userspace, it's not going to populate the TLB with kernel address translations - AIUI the whole vulnerability mitigation debacle was about preventing this sort of thing.
The CPU can definitely do that; the vulnerability mitigations are all about what userspace can observe rather than what the CPU can do in the background. Additionally, there are features like SPE and TRBE that use kernel addresses while the CPU is executing userspace instructions.
The latest ARM Architecture Reference Manual (ARM DDI 0487 L.a) is fairly clear about that in section D8.16 "Translation Lookaside Buff", where it says (among other things):
When address translation is enabled, if a translation table entry meets all of the following requirements, then that translation table entry is permitted to be cached in a TLB or intermediate TLB caching structure at any time: • The translation table entry itself does not generate a Translation fault, an Address size fault, or an Access flag fault. • The translation table entry is not from a translation regime configured by an Exception level that is lower than the current Exception level.
Here "permitted to be cached in a TLB" also implies that the HW is allowed to fetch the translation tabl entry (which is what ARM call page table entries).
The PDF can be found at:
https://developer.arm.com/documentation/ddi0487/la/?lang=en
Mark.
On 11/02/25 14:03, Mark Rutland wrote:
On Tue, Feb 11, 2025 at 02:33:51PM +0100, Valentin Schneider wrote:
On 10/02/25 23:08, Jann Horn wrote:
- It's wrong to assume that TLB entries are only populated for
addresses you access - thanks to speculative execution, you have to assume that the CPU might be populating random TLB entries all over the place.
Gotta love speculation. Now it is supposed to be limited to genuinely accessible data & code, right? Say theoretically we have a full TLBi as literally the last thing before doing the return-to-userspace, speculation should be limited to executing maybe bits of the return-from-userspace code?
I think it's easier to ignore speculation entirely, and just assume that the MMU can arbitrarily fill TLB entries from any page table entries which are valid/accessible in the active page tables. Hardware prefetchers can do that regardless of the specific path of speculative execution.
Thus TLB fills are not limited to VAs which would be used on that return-to-userspace path.
Furthermore, I would hope that once a CPU is executing in userspace, it's not going to populate the TLB with kernel address translations - AIUI the whole vulnerability mitigation debacle was about preventing this sort of thing.
The CPU can definitely do that; the vulnerability mitigations are all about what userspace can observe rather than what the CPU can do in the background. Additionally, there are features like SPE and TRBE that use kernel addresses while the CPU is executing userspace instructions.
The latest ARM Architecture Reference Manual (ARM DDI 0487 L.a) is fairly clear about that in section D8.16 "Translation Lookaside Buff", where it says (among other things):
When address translation is enabled, if a translation table entry meets all of the following requirements, then that translation table entry is permitted to be cached in a TLB or intermediate TLB caching structure at any time: • The translation table entry itself does not generate a Translation fault, an Address size fault, or an Access flag fault. • The translation table entry is not from a translation regime configured by an Exception level that is lower than the current Exception level.
Here "permitted to be cached in a TLB" also implies that the HW is allowed to fetch the translation tabl entry (which is what ARM call page table entries).
That's actually fairly clear all things considered, thanks for the education and for fishing out the relevant DDI section!
The PDF can be found at:
https://developer.arm.com/documentation/ddi0487/la/?lang=en
Mark.
On 2/11/25 05:33, Valentin Schneider wrote:
- It's wrong to assume that TLB entries are only populated for
addresses you access - thanks to speculative execution, you have to assume that the CPU might be populating random TLB entries all over the place.
Gotta love speculation. Now it is supposed to be limited to genuinely accessible data & code, right? Say theoretically we have a full TLBi as literally the last thing before doing the return-to-userspace, speculation should be limited to executing maybe bits of the return-from-userspace code?
In practice, it's mostly limited like that.
Architecturally, there are no promises from the CPU. It is within its rights to cache anything from the page tables at any time. If it's in the CR3 tree, it's fair game.
Furthermore, I would hope that once a CPU is executing in userspace, it's not going to populate the TLB with kernel address translations - AIUI the whole vulnerability mitigation debacle was about preventing this sort of thing.
Nope, unfortunately. There's two big exception to this. First, "implicit supervisor-mode accesses". There are structures for which the CPU gets a virtual address and accesses it even while userspace is running. The LDT and GDT are the most obvious examples, but there are some less ubiquitous ones like the buffers for PEBS events.
Second, remember that user versus supervisor is determined *BY* the page tables. Before Linear Address Space Separation (LASS), all virtual memory accesses walk the page tables, even userspace accesses to kernel addresses. The User/Supervisor bit is *in* the page tables, of course.
A userspace access to a kernel address results in a page walk and the CPU is completely free to cache all or part of that page walk. A Meltdown-style _speculative_ userspace access to kernel memory won't generate a fault either. It won't leak data like it used to, of course, but it can still walk the page tables. That's one reason LASS is needed.
On 11/02/25 06:22, Dave Hansen wrote:
On 2/11/25 05:33, Valentin Schneider wrote:
- It's wrong to assume that TLB entries are only populated for
addresses you access - thanks to speculative execution, you have to assume that the CPU might be populating random TLB entries all over the place.
Gotta love speculation. Now it is supposed to be limited to genuinely accessible data & code, right? Say theoretically we have a full TLBi as literally the last thing before doing the return-to-userspace, speculation should be limited to executing maybe bits of the return-from-userspace code?
In practice, it's mostly limited like that.
Architecturally, there are no promises from the CPU. It is within its rights to cache anything from the page tables at any time. If it's in the CR3 tree, it's fair game.
Furthermore, I would hope that once a CPU is executing in userspace, it's not going to populate the TLB with kernel address translations - AIUI the whole vulnerability mitigation debacle was about preventing this sort of thing.
Nope, unfortunately. There's two big exception to this. First, "implicit supervisor-mode accesses". There are structures for which the CPU gets a virtual address and accesses it even while userspace is running. The LDT and GDT are the most obvious examples, but there are some less ubiquitous ones like the buffers for PEBS events.
Second, remember that user versus supervisor is determined *BY* the page tables. Before Linear Address Space Separation (LASS), all virtual memory accesses walk the page tables, even userspace accesses to kernel addresses. The User/Supervisor bit is *in* the page tables, of course.
A userspace access to a kernel address results in a page walk and the CPU is completely free to cache all or part of that page walk. A Meltdown-style _speculative_ userspace access to kernel memory won't generate a fault either. It won't leak data like it used to, of course, but it can still walk the page tables. That's one reason LASS is needed.
Bummer, now I have at least two architectures proving me wrong :-) Thank you as well for the education, I really appreciate it.
On 11/02/25 06:22, Dave Hansen wrote:
On 2/11/25 05:33, Valentin Schneider wrote:
- It's wrong to assume that TLB entries are only populated for
addresses you access - thanks to speculative execution, you have to assume that the CPU might be populating random TLB entries all over the place.
Gotta love speculation. Now it is supposed to be limited to genuinely accessible data & code, right? Say theoretically we have a full TLBi as literally the last thing before doing the return-to-userspace, speculation should be limited to executing maybe bits of the return-from-userspace code?
In practice, it's mostly limited like that.
Architecturally, there are no promises from the CPU. It is within its rights to cache anything from the page tables at any time. If it's in the CR3 tree, it's fair game.
So what if the VMEMMAP range *isn't* in the CR3 tree when a CPU is executing in userspace?
AIUI that's the case with kPTI - the remaining kernel pages should mostly be .entry.text and cpu_entry_area, at least for x86.
It sounds like it wouldn't do much for arm64 though, if with CnP a CPU executing in userspace and with the user/trampoline page table installed can still use TLB entries of another CPU executing in kernelspace with the kernel page table installed.
On 2/18/25 14:40, Valentin Schneider wrote:
In practice, it's mostly limited like that.
Architecturally, there are no promises from the CPU. It is within its rights to cache anything from the page tables at any time. If it's in the CR3 tree, it's fair game.
So what if the VMEMMAP range *isn't* in the CR3 tree when a CPU is executing in userspace?
AIUI that's the case with kPTI - the remaining kernel pages should mostly be .entry.text and cpu_entry_area, at least for x86.
Having part of VMEMMAP not in the CR3 tree should be harmless while running userspace. VMEMMAP is a purely software structure; the hardware doesn't do implicit supervisor accesses to it. It's also not needed in super early entry.
Maybe I missed part of the discussion though. Is VMEMMAP your only concern? I would have guessed that the more generic vmalloc() functionality would be harder to pin down.
On 18/02/25 16:39, Dave Hansen wrote:
On 2/18/25 14:40, Valentin Schneider wrote:
In practice, it's mostly limited like that.
Architecturally, there are no promises from the CPU. It is within its rights to cache anything from the page tables at any time. If it's in the CR3 tree, it's fair game.
So what if the VMEMMAP range *isn't* in the CR3 tree when a CPU is executing in userspace?
AIUI that's the case with kPTI - the remaining kernel pages should mostly be .entry.text and cpu_entry_area, at least for x86.
Having part of VMEMMAP not in the CR3 tree should be harmless while running userspace. VMEMMAP is a purely software structure; the hardware doesn't do implicit supervisor accesses to it. It's also not needed in super early entry.
Maybe I missed part of the discussion though. Is VMEMMAP your only concern? I would have guessed that the more generic vmalloc() functionality would be harder to pin down.
Urgh, that'll teach me to send emails that late - I did indeed mean the vmalloc() range, not at all VMEMMAP. IIUC *neither* are present in the user kPTI page table and AFAICT the page table swap is done before the actual vmap'd stack (CONFIG_VMAP_STACK=y) gets used.
On 2/19/25 07:13, Valentin Schneider wrote:
Maybe I missed part of the discussion though. Is VMEMMAP your only concern? I would have guessed that the more generic vmalloc() functionality would be harder to pin down.
Urgh, that'll teach me to send emails that late - I did indeed mean the vmalloc() range, not at all VMEMMAP. IIUC *neither* are present in the user kPTI page table and AFAICT the page table swap is done before the actual vmap'd stack (CONFIG_VMAP_STACK=y) gets used.
OK, so rewriting your question... ;)
So what if the vmalloc() range *isn't* in the CR3 tree when a CPU is executing in userspace?
The LDT and maybe the PEBS buffers are the only implicit supervisor accesses to vmalloc()'d memory that I can think of. But those are both handled specially and shouldn't ever get zapped while in use. The LDT replacement has its own IPIs separate from TLB flushing.
But I'm actually not all that worried about accesses while actually running userspace. It's that "danger zone" in the kernel between entry and when the TLB might have dangerous garbage in it.
BTW, I hope this whole thing is turned off on 32-bit. There, we can actually take and handle faults on the vmalloc() area. If you get one of those faults in your "danger zone", it'll start running page fault code which will branch out to god-knows-where and certainly isn't noinstr.
On 19/02/25 12:25, Dave Hansen wrote:
On 2/19/25 07:13, Valentin Schneider wrote:
Maybe I missed part of the discussion though. Is VMEMMAP your only concern? I would have guessed that the more generic vmalloc() functionality would be harder to pin down.
Urgh, that'll teach me to send emails that late - I did indeed mean the vmalloc() range, not at all VMEMMAP. IIUC *neither* are present in the user kPTI page table and AFAICT the page table swap is done before the actual vmap'd stack (CONFIG_VMAP_STACK=y) gets used.
OK, so rewriting your question... ;)
So what if the vmalloc() range *isn't* in the CR3 tree when a CPU is executing in userspace?
The LDT and maybe the PEBS buffers are the only implicit supervisor accesses to vmalloc()'d memory that I can think of. But those are both handled specially and shouldn't ever get zapped while in use. The LDT replacement has its own IPIs separate from TLB flushing.
But I'm actually not all that worried about accesses while actually running userspace. It's that "danger zone" in the kernel between entry and when the TLB might have dangerous garbage in it.
So say we have kPTI, thus no vmalloc() mapped in CR3 when running userspace, and do a full TLB flush right before switching to userspace - could the TLB still end up with vmalloc()-range-related entries when we're back in the kernel and going through the danger zone?
BTW, I hope this whole thing is turned off on 32-bit. There, we can actually take and handle faults on the vmalloc() area. If you get one of those faults in your "danger zone", it'll start running page fault code which will branch out to god-knows-where and certainly isn't noinstr.
Sounds... Fun. Thanks for pointing out the landmines.
On 2/20/25 09:10, Valentin Schneider wrote:
The LDT and maybe the PEBS buffers are the only implicit supervisor accesses to vmalloc()'d memory that I can think of. But those are both handled specially and shouldn't ever get zapped while in use. The LDT replacement has its own IPIs separate from TLB flushing.
But I'm actually not all that worried about accesses while actually running userspace. It's that "danger zone" in the kernel between entry and when the TLB might have dangerous garbage in it.
So say we have kPTI, thus no vmalloc() mapped in CR3 when running userspace, and do a full TLB flush right before switching to userspace - could the TLB still end up with vmalloc()-range-related entries when we're back in the kernel and going through the danger zone?
Yes, because the danger zone includes the switch back to the kernel CR3 with vmalloc() fully mapped. All bets are off about what's in the TLB the moment that CR3 write occurs.
Actually, you could probably use that.
If a mapping is in the PTI user page table, you can't defer the flushes for it. Basically the same rule for text poking in the danger zone.
If there's a deferred flush pending, make sure that all of the SWITCH_TO_KERNEL_CR3's fully flush the TLB. You'd need something similar to user_pcid_flush_mask.
But, honestly, I'm still not sure this is worth all the trouble. If folks want to avoid IPIs for TLB flushes, there are hardware features that *DO* that. Just get new hardware instead of adding this complicated pile of software that we have to maintain forever. In 10 years, we'll still have this software *and* 95% of our hardware has the hardware feature too.
On 20/02/25 09:38, Dave Hansen wrote:
On 2/20/25 09:10, Valentin Schneider wrote:
The LDT and maybe the PEBS buffers are the only implicit supervisor accesses to vmalloc()'d memory that I can think of. But those are both handled specially and shouldn't ever get zapped while in use. The LDT replacement has its own IPIs separate from TLB flushing.
But I'm actually not all that worried about accesses while actually running userspace. It's that "danger zone" in the kernel between entry and when the TLB might have dangerous garbage in it.
So say we have kPTI, thus no vmalloc() mapped in CR3 when running userspace, and do a full TLB flush right before switching to userspace - could the TLB still end up with vmalloc()-range-related entries when we're back in the kernel and going through the danger zone?
Yes, because the danger zone includes the switch back to the kernel CR3 with vmalloc() fully mapped. All bets are off about what's in the TLB the moment that CR3 write occurs.
Actually, you could probably use that.
If a mapping is in the PTI user page table, you can't defer the flushes for it. Basically the same rule for text poking in the danger zone.
If there's a deferred flush pending, make sure that all of the SWITCH_TO_KERNEL_CR3's fully flush the TLB. You'd need something similar to user_pcid_flush_mask.
Right, that's what I (roughly) had in mind...
But, honestly, I'm still not sure this is worth all the trouble. If folks want to avoid IPIs for TLB flushes, there are hardware features that *DO* that. Just get new hardware instead of adding this complicated pile of software that we have to maintain forever. In 10 years, we'll still have this software *and* 95% of our hardware has the hardware feature too.
... But yeah, it pretty much circumvents arch_context_tracking_work, or at the very least adds an early(er) flushing of the context tracking work... Urgh.
Thank you for grounding my wild ideas into reality. I'll try to think some more see if I see any other way out (other than "buy hardware that does what you want and ditch the one that doesn't").
On Fri, Jan 17, 2025 at 04:25:45PM +0100, Valentin Schneider wrote:
On 14/01/25 19:16, Jann Horn wrote:
On Tue, Jan 14, 2025 at 6:51 PM Valentin Schneider vschneid@redhat.com wrote:
vunmap()'s issued from housekeeping CPUs are a relatively common source of interference for isolated NOHZ_FULL CPUs, as they are hit by the flush_tlb_kernel_range() IPIs.
Given that CPUs executing in userspace do not access data in the vmalloc range, these IPIs could be deferred until their next kernel entry.
Deferral vs early entry danger zone
This requires a guarantee that nothing in the vmalloc range can be vunmap'd and then accessed in early entry code.
In other words, it needs a guarantee that no vmalloc allocations that have been created in the vmalloc region while the CPU was idle can then be accessed during early entry, right?
I'm not sure if that would be a problem (not an mm expert, please do correct me) - looking at vmap_pages_range(), flush_cache_vmap() isn't deferred anyway.
So after vmapping something, I wouldn't expect isolated CPUs to have invalid TLB entries for the newly vmapped page.
However, upon vunmap'ing something, the TLB flush is deferred, and thus stale TLB entries can and will remain on isolated CPUs, up until they execute the deferred flush themselves (IOW for the entire duration of the "danger zone").
Does that make sense?
Probably i am missing something and need to have a look at your patches, but how do you guarantee that no-one map same are that you defer for TLB flushing?
As noted by Jann, we already defer a TLB flushing by backing freed areas until certain threshold and just after we cross it we do a flush.
-- Uladzislau Rezki
On 17/01/25 17:11, Uladzislau Rezki wrote:
On Fri, Jan 17, 2025 at 04:25:45PM +0100, Valentin Schneider wrote:
On 14/01/25 19:16, Jann Horn wrote:
On Tue, Jan 14, 2025 at 6:51 PM Valentin Schneider vschneid@redhat.com wrote:
vunmap()'s issued from housekeeping CPUs are a relatively common source of interference for isolated NOHZ_FULL CPUs, as they are hit by the flush_tlb_kernel_range() IPIs.
Given that CPUs executing in userspace do not access data in the vmalloc range, these IPIs could be deferred until their next kernel entry.
Deferral vs early entry danger zone
This requires a guarantee that nothing in the vmalloc range can be vunmap'd and then accessed in early entry code.
In other words, it needs a guarantee that no vmalloc allocations that have been created in the vmalloc region while the CPU was idle can then be accessed during early entry, right?
I'm not sure if that would be a problem (not an mm expert, please do correct me) - looking at vmap_pages_range(), flush_cache_vmap() isn't deferred anyway.
So after vmapping something, I wouldn't expect isolated CPUs to have invalid TLB entries for the newly vmapped page.
However, upon vunmap'ing something, the TLB flush is deferred, and thus stale TLB entries can and will remain on isolated CPUs, up until they execute the deferred flush themselves (IOW for the entire duration of the "danger zone").
Does that make sense?
Probably i am missing something and need to have a look at your patches, but how do you guarantee that no-one map same are that you defer for TLB flushing?
That's the cool part: I don't :')
For deferring instruction patching IPIs, I (well Josh really) managed to get instrumentation to back me up and catch any problematic area.
I looked into getting something similar for vmalloc region access in .noinstr code, but I didn't get anywhere. I even tried using emulated watchpoints on QEMU to watch the whole vmalloc range, but that went about as well as you could expect.
That left me with staring at code. AFAICT the only vmap'd thing that is accessed during early entry is the task stack (CONFIG_VMAP_STACK), which itself cannot be freed until the task exits - thus can't be subject to invalidation when a task is entering kernelspace.
If you have any tracing/instrumentation suggestions, I'm all ears (eyes?).
As noted by Jann, we already defer a TLB flushing by backing freed areas until certain threshold and just after we cross it we do a flush.
-- Uladzislau Rezki
On Fri, Jan 17, 2025 at 06:00:30PM +0100, Valentin Schneider wrote:
On 17/01/25 17:11, Uladzislau Rezki wrote:
On Fri, Jan 17, 2025 at 04:25:45PM +0100, Valentin Schneider wrote:
On 14/01/25 19:16, Jann Horn wrote:
On Tue, Jan 14, 2025 at 6:51 PM Valentin Schneider vschneid@redhat.com wrote:
vunmap()'s issued from housekeeping CPUs are a relatively common source of interference for isolated NOHZ_FULL CPUs, as they are hit by the flush_tlb_kernel_range() IPIs.
Given that CPUs executing in userspace do not access data in the vmalloc range, these IPIs could be deferred until their next kernel entry.
Deferral vs early entry danger zone
This requires a guarantee that nothing in the vmalloc range can be vunmap'd and then accessed in early entry code.
In other words, it needs a guarantee that no vmalloc allocations that have been created in the vmalloc region while the CPU was idle can then be accessed during early entry, right?
I'm not sure if that would be a problem (not an mm expert, please do correct me) - looking at vmap_pages_range(), flush_cache_vmap() isn't deferred anyway.
So after vmapping something, I wouldn't expect isolated CPUs to have invalid TLB entries for the newly vmapped page.
However, upon vunmap'ing something, the TLB flush is deferred, and thus stale TLB entries can and will remain on isolated CPUs, up until they execute the deferred flush themselves (IOW for the entire duration of the "danger zone").
Does that make sense?
Probably i am missing something and need to have a look at your patches, but how do you guarantee that no-one map same are that you defer for TLB flushing?
That's the cool part: I don't :')
Indeed, sounds unsafe :) Then we just do not need to free areas.
For deferring instruction patching IPIs, I (well Josh really) managed to get instrumentation to back me up and catch any problematic area.
I looked into getting something similar for vmalloc region access in .noinstr code, but I didn't get anywhere. I even tried using emulated watchpoints on QEMU to watch the whole vmalloc range, but that went about as well as you could expect.
That left me with staring at code. AFAICT the only vmap'd thing that is accessed during early entry is the task stack (CONFIG_VMAP_STACK), which itself cannot be freed until the task exits - thus can't be subject to invalidation when a task is entering kernelspace.
If you have any tracing/instrumentation suggestions, I'm all ears (eyes?).
As noted before, we defer flushing for vmalloc. We have a lazy-threshold which can be exposed(if you need it) over sysfs for tuning. So, we can add it.
-- Uladzislau Rezki
On 20/01/25 12:15, Uladzislau Rezki wrote:
On Fri, Jan 17, 2025 at 06:00:30PM +0100, Valentin Schneider wrote:
On 17/01/25 17:11, Uladzislau Rezki wrote:
On Fri, Jan 17, 2025 at 04:25:45PM +0100, Valentin Schneider wrote:
On 14/01/25 19:16, Jann Horn wrote:
On Tue, Jan 14, 2025 at 6:51 PM Valentin Schneider vschneid@redhat.com wrote:
vunmap()'s issued from housekeeping CPUs are a relatively common source of interference for isolated NOHZ_FULL CPUs, as they are hit by the flush_tlb_kernel_range() IPIs.
Given that CPUs executing in userspace do not access data in the vmalloc range, these IPIs could be deferred until their next kernel entry.
Deferral vs early entry danger zone
This requires a guarantee that nothing in the vmalloc range can be vunmap'd and then accessed in early entry code.
In other words, it needs a guarantee that no vmalloc allocations that have been created in the vmalloc region while the CPU was idle can then be accessed during early entry, right?
I'm not sure if that would be a problem (not an mm expert, please do correct me) - looking at vmap_pages_range(), flush_cache_vmap() isn't deferred anyway.
So after vmapping something, I wouldn't expect isolated CPUs to have invalid TLB entries for the newly vmapped page.
However, upon vunmap'ing something, the TLB flush is deferred, and thus stale TLB entries can and will remain on isolated CPUs, up until they execute the deferred flush themselves (IOW for the entire duration of the "danger zone").
Does that make sense?
Probably i am missing something and need to have a look at your patches, but how do you guarantee that no-one map same are that you defer for TLB flushing?
That's the cool part: I don't :')
Indeed, sounds unsafe :) Then we just do not need to free areas.
For deferring instruction patching IPIs, I (well Josh really) managed to get instrumentation to back me up and catch any problematic area.
I looked into getting something similar for vmalloc region access in .noinstr code, but I didn't get anywhere. I even tried using emulated watchpoints on QEMU to watch the whole vmalloc range, but that went about as well as you could expect.
That left me with staring at code. AFAICT the only vmap'd thing that is accessed during early entry is the task stack (CONFIG_VMAP_STACK), which itself cannot be freed until the task exits - thus can't be subject to invalidation when a task is entering kernelspace.
If you have any tracing/instrumentation suggestions, I'm all ears (eyes?).
As noted before, we defer flushing for vmalloc. We have a lazy-threshold which can be exposed(if you need it) over sysfs for tuning. So, we can add it.
In a CPU isolation / NOHZ_FULL context, isolated CPUs will be running a single userspace application that will never enter the kernel, unless forced to by some interference (e.g. IPI sent from a housekeeping CPU).
Increasing the lazy threshold would unfortunately only delay the interference - housekeeping CPUs are free to run whatever, and so they will eventually cause the lazy threshold to be hit and IPI all the CPUs, including the isolated/NOHZ_FULL ones.
I was thinking maybe we could subdivide the vmap space into two regions with their own thresholds, but a task may allocate/vmap stuff while on a HK CPU and be moved to an isolated CPU afterwards, and also I still don't have any strong guarantee about what accesses an isolated CPU can do in its early entry code :(
-- Uladzislau Rezki
As noted before, we defer flushing for vmalloc. We have a lazy-threshold which can be exposed(if you need it) over sysfs for tuning. So, we can add it.
In a CPU isolation / NOHZ_FULL context, isolated CPUs will be running a single userspace application that will never enter the kernel, unless forced to by some interference (e.g. IPI sent from a housekeeping CPU).
Increasing the lazy threshold would unfortunately only delay the interference - housekeeping CPUs are free to run whatever, and so they will eventually cause the lazy threshold to be hit and IPI all the CPUs, including the isolated/NOHZ_FULL ones.
Do you have any testing results for your workload? I mean how much potentially we can allocate. Again, maybe it is just enough to back and once per-hour offload it.
Apart of that how critical IPIing CPUs affect your workloads?
I was thinking maybe we could subdivide the vmap space into two regions with their own thresholds, but a task may allocate/vmap stuff while on a HK CPU and be moved to an isolated CPU afterwards, and also I still don't have any strong guarantee about what accesses an isolated CPU can do in its early entry code :(
I agree this is not worth to play with a vmap space in terms of splitting it.
Sorry for later answer and thank you!
-- Uladzislau Rezki
On 21/01/25 18:00, Uladzislau Rezki wrote:
As noted before, we defer flushing for vmalloc. We have a lazy-threshold which can be exposed(if you need it) over sysfs for tuning. So, we can add it.
In a CPU isolation / NOHZ_FULL context, isolated CPUs will be running a single userspace application that will never enter the kernel, unless forced to by some interference (e.g. IPI sent from a housekeeping CPU).
Increasing the lazy threshold would unfortunately only delay the interference - housekeeping CPUs are free to run whatever, and so they will eventually cause the lazy threshold to be hit and IPI all the CPUs, including the isolated/NOHZ_FULL ones.
Do you have any testing results for your workload? I mean how much potentially we can allocate. Again, maybe it is just enough to back and once per-hour offload it.
Potentially as much as you want... In our Openshift environments, you can get any sort of container executing on the housekeeping CPUs and they're free to do pretty much whatever they want. Per CPU isolation they're not allowed/meant to disturb isolated CPUs, however.
Apart of that how critical IPIing CPUs affect your workloads?
If I'm being pedantic, a single IPI to an isolated CPU breaks the isolation. If we can't quiesce IPIs to isolated CPUs, then we can't guarantee that whatever is running on the isolated CPUs is actually isolated / shielded from third party interference.
On Fri, Jan 24, 2025 at 04:22:19PM +0100, Valentin Schneider wrote:
On 21/01/25 18:00, Uladzislau Rezki wrote:
As noted before, we defer flushing for vmalloc. We have a lazy-threshold which can be exposed(if you need it) over sysfs for tuning. So, we can add it.
In a CPU isolation / NOHZ_FULL context, isolated CPUs will be running a single userspace application that will never enter the kernel, unless forced to by some interference (e.g. IPI sent from a housekeeping CPU).
Increasing the lazy threshold would unfortunately only delay the interference - housekeeping CPUs are free to run whatever, and so they will eventually cause the lazy threshold to be hit and IPI all the CPUs, including the isolated/NOHZ_FULL ones.
Do you have any testing results for your workload? I mean how much potentially we can allocate. Again, maybe it is just enough to back and once per-hour offload it.
Potentially as much as you want... In our Openshift environments, you can get any sort of container executing on the housekeeping CPUs and they're free to do pretty much whatever they want. Per CPU isolation they're not allowed/meant to disturb isolated CPUs, however.
Apart of that how critical IPIing CPUs affect your workloads?
If I'm being pedantic, a single IPI to an isolated CPU breaks the isolation. If we can't quiesce IPIs to isolated CPUs, then we can't guarantee that whatever is running on the isolated CPUs is actually isolated / shielded from third party interference.
I see. I thought you are fixing some issue. I do not see a straight forward way how to remove such "distortion". Probably we can block the range which we defer for flushing. But it also can be problematic because of other constraints.
Thanks!
-- Uladzislau Rezki
With NO_HZ_IDLE, we get CONTEXT_TRACKING_IDLE, so we get these transitions:
ct_idle_enter() ct_kernel_exit() ct_state_inc_clear_work()
ct_idle_exit() ct_kernel_enter() ct_work_flush()
With just CONTEXT_TRACKING_IDLE, ct_state_inc_clear_work() is just ct_state_inc() and ct_work_flush() is a no-op. However, making them be functional as if under CONTEXT_TRACKING_WORK would allow NO_HZ_IDLE to leverage IPI deferral to keep idle CPUs idle longer.
Having this enabled for NO_HZ_IDLE is a different argument than for having it for NO_HZ_FULL (power savings vs latency/performance), but the backing mechanism is identical.
Add a default-no option to enable IPI deferral with NO_HZ_IDLE.
Signed-off-by: Valentin Schneider vschneid@redhat.com --- kernel/time/Kconfig | 16 +++++++++++++++- 1 file changed, 15 insertions(+), 1 deletion(-)
diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig index 7e8106a0d981f..c7398fe5382a0 100644 --- a/kernel/time/Kconfig +++ b/kernel/time/Kconfig @@ -183,9 +183,23 @@ config CONTEXT_TRACKING_USER_FORCE
config CONTEXT_TRACKING_WORK bool - depends on HAVE_CONTEXT_TRACKING_WORK && CONTEXT_TRACKING_USER + depends on HAVE_CONTEXT_TRACKING_WORK && (CONTEXT_TRACKING_USER || CONTEXT_TRACKING_WORK_IDLE) default y
+config CONTEXT_TRACKING_WORK_IDLE + bool + depends on HAVE_CONTEXT_TRACKING_WORK && CONTEXT_TRACKING_IDLE && !CONTEXT_TRACKING_USER + default n + help + This option enables deferral of some IPIs when they are targeted at CPUs + that are idle. This can help keep CPUs idle longer, but induces some + extra overhead to idle <-> kernel transitions and to IPI sending. + + Say Y if the power improvements are worth more to you than the added + overheads. + + Say N otherwise. + config NO_HZ bool "Old Idle dynticks config" help
linux-kselftest-mirror@lists.linaro.org