Context =======
We've observed within Red Hat that isolated, NOHZ_FULL CPUs running a pure-userspace application get regularly interrupted by IPIs sent from housekeeping CPUs. Those IPIs are caused by activity on the housekeeping CPUs leading to various on_each_cpu() calls, e.g.:
64359.052209596 NetworkManager 0 1405 smp_call_function_many_cond (cpu=0, func=do_kernel_range_flush) smp_call_function_many_cond+0x1 smp_call_function+0x39 on_each_cpu+0x2a flush_tlb_kernel_range+0x7b __purge_vmap_area_lazy+0x70 _vm_unmap_aliases.part.42+0xdf change_page_attr_set_clr+0x16a set_memory_ro+0x26 bpf_int_jit_compile+0x2f9 bpf_prog_select_runtime+0xc6 bpf_prepare_filter+0x523 sk_attach_filter+0x13 sock_setsockopt+0x92c __sys_setsockopt+0x16a __x64_sys_setsockopt+0x20 do_syscall_64+0x87 entry_SYSCALL_64_after_hwframe+0x65
The heart of this series is the thought that while we cannot remove NOHZ_FULL CPUs from the list of CPUs targeted by these IPIs, they may not have to execute the callbacks immediately. Anything that only affects kernelspace can wait until the next user->kernel transition, providing it can be executed "early enough" in the entry code.
The original implementation is from Peter [1]. Nicolas then added kernel TLB invalidation deferral to that [2], and I picked it up from there.
Deferral approach =================
Storing each and every callback, like a secondary call_single_queue turned out to be a no-go: the whole point of deferral is to keep NOHZ_FULL CPUs in userspace for as long as possible - no signal of any form would be sent when deferring an IPI. This means that any form of queuing for deferred callbacks would end up as a convoluted memory leak.
Deferred IPIs must thus be coalesced, which this series achieves by assigning IPIs a "type" and having a mapping of IPI type to callback, leveraged upon kernel entry.
What about IPIs whose callback take a parameter, you may ask?
Peter suggested during OSPM23 [3] that since on_each_cpu() targets housekeeping CPUs *and* isolated CPUs, isolated CPUs can access either global or housekeeping-CPU-local state to "reconstruct" the data that would have been sent via the IPI.
This series does not affect any IPI callback that requires an argument, but the approach would remain the same (one coalescable callback executed on kernel entry).
Kernel entry vs execution of the deferred operation ===================================================
This is what I've referred to as the "Danger Zone" during my LPC24 talk [4].
There is a non-zero length of code that is executed upon kernel entry before the deferred operation can be itself executed (i.e. before we start getting into context_tracking.c proper), i.e.:
idtentry_func_foo() <--- we're in the kernel irqentry_enter() enter_from_user_mode() __ct_user_exit() ct_kernel_enter_state() ct_work_flush() <--- deferred operation is executed here
This means one must take extra care to what can happen in the early entry code, and that <bad things> cannot happen. For instance, we really don't want to hit instructions that have been modified by a remote text_poke() while we're on our way to execute a deferred sync_core(). Patches doing the actual deferral have more detail on this.
Patches =======
o Patches 1-3 are standalone cleanups. o Patches 4-5 add an RCU testing feature.
o Patches 6-8 add a new type of jump label for static keys that will not have their IPI be deferred. o Patch 9 adds objtool verification of static keys vs their text_poke IPI deferral o Patches 10-14 add the actual IPI deferrals
o Patch 15 is a freebie to enable the deferral feature for NO_HZ_IDLE
Patches are also available at: https://gitlab.com/vschneid/linux.git -b redhat/isolirq/defer/v3
RFC status ==========
Things I'd like to get comments on and/or that are a bit WIPish; they're called out in the individual changelogs:
o "forceful" jump label naming which I don't particularly like
o objtool usage of 'offset_of(static_key.type)' and JUMP_TYPE_FORCEFUL. I've hardcoded them but it could do with being shoved in a kernel header objtool can include directly
o The noinstr variant of __flush_tlb_all() doesn't have a paravirt variant, does it need one?
Testing =======
Xeon E5-2699 system with SMToff, NOHZ_FULL, isolated CPUs. RHEL9 userspace.
Workload is using rteval (kernel compilation + hackbench) on housekeeping CPUs and a dummy stay-in-userspace loop on the isolated CPUs. The main invocation is:
$ trace-cmd record -e "csd_queue_cpu" -f "cpu & CPUS{$ISOL_CPUS}" \ -e "ipi_send_cpumask" -f "cpumask & CPUS{$ISOL_CPUS}" \ -e "ipi_send_cpu" -f "cpu & CPUS{$ISOL_CPUS}" \ rteval --onlyload --loads-cpulist=$HK_CPUS \ --hackbench-runlowmem=True --duration=$DURATION
This only records IPIs sent to isolated CPUs, so any event there is interference (with a bit of fuzz at the start/end of the workload when spawning the processes). All tests were done with a duration of 1hr.
v6.12-rc4 # This is the actual IPI count $ trace-cmd report trace-base.dat | grep callback | awk '{ print $(NF) }' | sort | uniq -c | sort -nr 1782 callback=generic_smp_call_function_single_interrupt+0x0 73 callback=0x0
# These are the different CSD's that caused IPIs $ trace-cmd report | grep csd_queue | awk '{ print $(NF-1) }' | sort | uniq -c | sort -nr 22048 func=tlb_remove_table_smp_sync 16536 func=do_sync_core 2262 func=do_flush_tlb_all 182 func=do_kernel_range_flush 144 func=rcu_exp_handler 60 func=sched_ttwu_pending
v6.12-rc4 + patches: # This is the actual IPI count $ trace-cmd report | grep callback | awk '{ print $(NF) }' | sort | uniq -c | sort -nr 1168 callback=generic_smp_call_function_single_interrupt+0x0 74 callback=0x0
# These are the different CSD's that caused IPIs $ trace-cmd report | grep csd_queue | awk '{ print $(NF-1) }' | sort | uniq -c | sort -nr 23686 func=tlb_remove_table_smp_sync 192 func=rcu_exp_handler 65 func=sched_ttwu_pending
Interestingly tlb_remove_table_smp_sync() started showing up on this machine, while it didn't during testing for v2 and it's the same machine. Yair had a series adressing this [5] which per these results would be worth revisiting.
Acknowledgements ================
Special thanks to: o Clark Williams for listening to my ramblings about this and throwing ideas my way o Josh Poimboeuf for his guidance regarding objtool and hinting at the .data..ro_after_init section. o All of the folks who attended various talks about this and provided precious feedback.
Links =====
[1]: https://lore.kernel.org/all/20210929151723.162004989@infradead.org/ [2]: https://github.com/vianpl/linux.git -b ct-work-defer-wip [3]: https://youtu.be/0vjE6fjoVVE [4]: https://lpc.events/event/18/contributions/1889/ [5]: https://lore.kernel.org/lkml/20230620144618.125703-1-ypodemsk@redhat.com/
Revisions =========
RFCv2 -> RFCv3 +++++++++++
o Rebased onto v6.12-rc7
o Added objtool documentation for the new warning (Josh) o Added low-size RCU watching counter to TREE04 torture scenario (Paul) o Added FORCEFUL jump label and static key types o Added noinstr-compliant helpers for tlb flush deferral
o Overall changelog & comments cleanup
RFCv1 -> RFCv2 ++++++++++++++
o Rebased onto v6.5-rc1
o Updated the trace filter patches (Steven)
o Fixed __ro_after_init keys used in modules (Peter) o Dropped the extra context_tracking atomic, squashed the new bits in the existing .state field (Peter, Frederic)
o Added an RCU_EXPERT config for the RCU dynticks counter size, and added an rcutorture case for a low-size counter (Paul)
o Fixed flush_tlb_kernel_range_deferrable() definition
Valentin Schneider (15): objtool: Make validate_call() recognize indirect calls to pv_ops[] objtool: Flesh out warning related to pv_ops[] calls sched/clock: Make sched_clock_running __ro_after_init rcu: Add a small-width RCU watching counter debug option rcutorture: Make TREE04 use CONFIG_RCU_DYNTICKS_TORTURE jump_label: Add forceful jump label type x86/speculation/mds: Make mds_idle_clear forceful sched/clock, x86: Make __sched_clock_stable forceful objtool: Warn about non __ro_after_init static key usage in .noinstr x86/alternatives: Record text_poke's of JUMP_TYPE_FORCEFUL labels context-tracking: Introduce work deferral infrastructure context_tracking,x86: Defer kernel text patching IPIs context_tracking,x86: Add infrastructure to defer kernel TLBI x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs context-tracking: Add a Kconfig to enable IPI deferral for NO_HZ_IDLE
arch/Kconfig | 9 +++ arch/x86/Kconfig | 1 + arch/x86/include/asm/context_tracking_work.h | 20 +++++++ arch/x86/include/asm/special_insns.h | 1 + arch/x86/include/asm/text-patching.h | 13 ++++- arch/x86/include/asm/tlbflush.h | 17 +++++- arch/x86/kernel/alternative.c | 49 ++++++++++++---- arch/x86/kernel/cpu/bugs.c | 2 +- arch/x86/kernel/cpu/common.c | 6 +- arch/x86/kernel/jump_label.c | 7 ++- arch/x86/kernel/kprobes/core.c | 4 +- arch/x86/kernel/kprobes/opt.c | 4 +- arch/x86/kernel/module.c | 2 +- arch/x86/mm/tlb.c | 49 ++++++++++++++-- include/linux/context_tracking.h | 21 +++++++ include/linux/context_tracking_state.h | 54 ++++++++++++++--- include/linux/context_tracking_work.h | 28 +++++++++ include/linux/jump_label.h | 26 ++++++--- kernel/context_tracking.c | 46 ++++++++++++++- kernel/rcu/Kconfig.debug | 14 +++++ kernel/sched/clock.c | 4 +- kernel/time/Kconfig | 19 ++++++ mm/vmalloc.c | 35 +++++++++-- tools/objtool/Documentation/objtool.txt | 13 +++++ tools/objtool/check.c | 58 ++++++++++++++++--- tools/objtool/include/objtool/check.h | 1 + tools/objtool/include/objtool/special.h | 2 + tools/objtool/special.c | 3 + .../selftests/rcutorture/configs/rcu/TREE04 | 1 + 29 files changed, 450 insertions(+), 59 deletions(-) create mode 100644 arch/x86/include/asm/context_tracking_work.h create mode 100644 include/linux/context_tracking_work.h
-- 2.43.0
call_dest_name() does not get passed the file pointer of validate_call(), which means its invocation of insn_reloc() will always return NULL. Make it take a file pointer.
While at it, make sure call_dest_name() uses arch_dest_reloc_offset(), otherwise it gets the pv_ops[] offset wrong.
Fabricating an intentional warning shows the change; previously:
vmlinux.o: warning: objtool: __flush_tlb_all_noinstr+0x4: call to {dynamic}() leaves .noinstr.text section
now:
vmlinux.o: warning: objtool: __flush_tlb_all_noinstr+0x4: call to pv_ops[1]() leaves .noinstr.text section
Signed-off-by: Valentin Schneider vschneid@redhat.com --- tools/objtool/check.c | 15 +++++++++------ 1 file changed, 9 insertions(+), 6 deletions(-)
diff --git a/tools/objtool/check.c b/tools/objtool/check.c index 6604f5d038aad..5f1d0f95fc04b 100644 --- a/tools/objtool/check.c +++ b/tools/objtool/check.c @@ -3448,7 +3448,7 @@ static inline bool func_uaccess_safe(struct symbol *func) return false; }
-static inline const char *call_dest_name(struct instruction *insn) +static inline const char *call_dest_name(struct objtool_file *file, struct instruction *insn) { static char pvname[19]; struct reloc *reloc; @@ -3457,9 +3457,9 @@ static inline const char *call_dest_name(struct instruction *insn) if (insn_call_dest(insn)) return insn_call_dest(insn)->name;
- reloc = insn_reloc(NULL, insn); + reloc = insn_reloc(file, insn); if (reloc && !strcmp(reloc->sym->name, "pv_ops")) { - idx = (reloc_addend(reloc) / sizeof(void *)); + idx = (arch_dest_reloc_offset(reloc_addend(reloc)) / sizeof(void *)); snprintf(pvname, sizeof(pvname), "pv_ops[%d]", idx); return pvname; } @@ -3538,17 +3538,20 @@ static int validate_call(struct objtool_file *file, { if (state->noinstr && state->instr <= 0 && !noinstr_call_dest(file, insn, insn_call_dest(insn))) { - WARN_INSN(insn, "call to %s() leaves .noinstr.text section", call_dest_name(insn)); + WARN_INSN(insn, "call to %s() leaves .noinstr.text section", + call_dest_name(file, insn)); return 1; }
if (state->uaccess && !func_uaccess_safe(insn_call_dest(insn))) { - WARN_INSN(insn, "call to %s() with UACCESS enabled", call_dest_name(insn)); + WARN_INSN(insn, "call to %s() with UACCESS enabled", + call_dest_name(file, insn)); return 1; }
if (state->df) { - WARN_INSN(insn, "call to %s() with DF set", call_dest_name(insn)); + WARN_INSN(insn, "call to %s() with DF set", + call_dest_name(file, insn)); return 1; }
On Tue, Nov 19, 2024 at 04:34:48PM +0100, Valentin Schneider wrote:
call_dest_name() does not get passed the file pointer of validate_call(), which means its invocation of insn_reloc() will always return NULL. Make it take a file pointer.
While at it, make sure call_dest_name() uses arch_dest_reloc_offset(), otherwise it gets the pv_ops[] offset wrong.
Fabricating an intentional warning shows the change; previously:
vmlinux.o: warning: objtool: __flush_tlb_all_noinstr+0x4: call to {dynamic}() leaves .noinstr.text section
now:
vmlinux.o: warning: objtool: __flush_tlb_all_noinstr+0x4: call to pv_ops[1]() leaves .noinstr.text section
Signed-off-by: Valentin Schneider vschneid@redhat.com
Acked-by: Josh Poimboeuf jpoimboe@kernel.org
I had to look into objtool itself to understand what this warning was about; make it more explicit.
Signed-off-by: Valentin Schneider vschneid@redhat.com --- tools/objtool/check.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/objtool/check.c b/tools/objtool/check.c index 5f1d0f95fc04b..00e25492f5065 100644 --- a/tools/objtool/check.c +++ b/tools/objtool/check.c @@ -3486,7 +3486,7 @@ static bool pv_call_dest(struct objtool_file *file, struct instruction *insn)
list_for_each_entry(target, &file->pv_ops[idx].targets, pv_target) { if (!target->sec->noinstr) { - WARN("pv_ops[%d]: %s", idx, target->name); + WARN("pv_ops[%d]: indirect call to %s() leaves .noinstr.text section", idx, target->name); file->pv_ops[idx].clean = false; } }
On Tue, Nov 19, 2024 at 04:34:49PM +0100, Valentin Schneider wrote:
I had to look into objtool itself to understand what this warning was about; make it more explicit.
Signed-off-by: Valentin Schneider vschneid@redhat.com
Acked-by: Josh Poimboeuf jpoimboe@kernel.org
sched_clock_running is only ever enabled in the __init functions sched_clock_init() and sched_clock_init_late(), and is never disabled. Mark it __ro_after_init.
Signed-off-by: Valentin Schneider vschneid@redhat.com --- kernel/sched/clock.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c index a09655b481402..200e5568b9894 100644 --- a/kernel/sched/clock.c +++ b/kernel/sched/clock.c @@ -66,7 +66,7 @@ notrace unsigned long long __weak sched_clock(void) } EXPORT_SYMBOL_GPL(sched_clock);
-static DEFINE_STATIC_KEY_FALSE(sched_clock_running); +static DEFINE_STATIC_KEY_FALSE_RO(sched_clock_running);
#ifdef CONFIG_HAVE_UNSTABLE_SCHED_CLOCK /*
A later commit will reduce the size of the RCU watching counter to free up some bits for another purpose. Paul suggested adding a config option to test the extreme case where the counter is reduced to its minimum usable width for rcutorture to poke at, so do that.
Make it only configurable under RCU_EXPERT. While at it, add a comment to explain the layout of context_tracking->state.
Link: http://lore.kernel.org/r/4c2cb573-168f-4806-b1d9-164e8276e66a@paulmck-laptop Suggested-by: Paul E. McKenney paulmck@kernel.org Signed-off-by: Valentin Schneider vschneid@redhat.com --- include/linux/context_tracking_state.h | 44 ++++++++++++++++++++++---- kernel/rcu/Kconfig.debug | 14 ++++++++ 2 files changed, 51 insertions(+), 7 deletions(-)
diff --git a/include/linux/context_tracking_state.h b/include/linux/context_tracking_state.h index 7b8433d5a8efe..0b81248aa03e2 100644 --- a/include/linux/context_tracking_state.h +++ b/include/linux/context_tracking_state.h @@ -18,12 +18,6 @@ enum ctx_state { CT_STATE_MAX = 4, };
-/* Odd value for watching, else even. */ -#define CT_RCU_WATCHING CT_STATE_MAX - -#define CT_STATE_MASK (CT_STATE_MAX - 1) -#define CT_RCU_WATCHING_MASK (~CT_STATE_MASK) - struct context_tracking { #ifdef CONFIG_CONTEXT_TRACKING_USER /* @@ -44,9 +38,45 @@ struct context_tracking { #endif };
+/* + * We cram two different things within the same atomic variable: + * + * CT_RCU_WATCHING_START CT_STATE_START + * | | + * v v + * MSB [ RCU watching counter ][ context_state ] LSB + * ^ ^ + * | | + * CT_RCU_WATCHING_END CT_STATE_END + * + * Bits are used from the LSB upwards, so unused bits (if any) will always be in + * upper bits of the variable. + */ #ifdef CONFIG_CONTEXT_TRACKING +#define CT_SIZE (sizeof(((struct context_tracking *)0)->state) * BITS_PER_BYTE) + +#define CT_STATE_WIDTH bits_per(CT_STATE_MAX - 1) +#define CT_STATE_START 0 +#define CT_STATE_END (CT_STATE_START + CT_STATE_WIDTH - 1) + +#define CT_RCU_WATCHING_MAX_WIDTH (CT_SIZE - CT_STATE_WIDTH) +#define CT_RCU_WATCHING_WIDTH (IS_ENABLED(CONFIG_RCU_DYNTICKS_TORTURE) ? 2 : CT_RCU_WATCHING_MAX_WIDTH) +#define CT_RCU_WATCHING_START (CT_STATE_END + 1) +#define CT_RCU_WATCHING_END (CT_RCU_WATCHING_START + CT_RCU_WATCHING_WIDTH - 1) +#define CT_RCU_WATCHING BIT(CT_RCU_WATCHING_START) + +#define CT_STATE_MASK GENMASK(CT_STATE_END, CT_STATE_START) +#define CT_RCU_WATCHING_MASK GENMASK(CT_RCU_WATCHING_END, CT_RCU_WATCHING_START) + +#define CT_UNUSED_WIDTH (CT_RCU_WATCHING_MAX_WIDTH - CT_RCU_WATCHING_WIDTH) + +static_assert(CT_STATE_WIDTH + + CT_RCU_WATCHING_WIDTH + + CT_UNUSED_WIDTH == + CT_SIZE); + DECLARE_PER_CPU(struct context_tracking, context_tracking); -#endif +#endif /* CONFIG_CONTEXT_TRACKING */
#ifdef CONFIG_CONTEXT_TRACKING_USER static __always_inline int __ct_state(void) diff --git a/kernel/rcu/Kconfig.debug b/kernel/rcu/Kconfig.debug index 9b0b52e1836fa..8dc505d841f8d 100644 --- a/kernel/rcu/Kconfig.debug +++ b/kernel/rcu/Kconfig.debug @@ -168,4 +168,18 @@ config RCU_STRICT_GRACE_PERIOD when looking for certain types of RCU usage bugs, for example, too-short RCU read-side critical sections.
+ +config RCU_DYNTICKS_TORTURE + bool "Minimize RCU dynticks counter size" + depends on RCU_EXPERT + default n + help + This option controls the width of the dynticks counter. + + Lower values will make overflows more frequent, which will increase + the likelihood of extending grace-periods. This option sets the width + to its minimum usable value. + + This has no value for production and is only for testing. + endmenu # "RCU Debugging"
On Tue, Nov 19, 2024 at 04:34:51PM +0100, Valentin Schneider wrote:
A later commit will reduce the size of the RCU watching counter to free up some bits for another purpose. Paul suggested adding a config option to test the extreme case where the counter is reduced to its minimum usable width for rcutorture to poke at, so do that.
Make it only configurable under RCU_EXPERT. While at it, add a comment to explain the layout of context_tracking->state.
Note that this means it will get selected by allyesconfig and the like, is that desired?
If no, depends on !COMPILE_TEST can help here.
On 20/11/24 15:50, Peter Zijlstra wrote:
On Tue, Nov 19, 2024 at 04:34:51PM +0100, Valentin Schneider wrote:
A later commit will reduce the size of the RCU watching counter to free up some bits for another purpose. Paul suggested adding a config option to test the extreme case where the counter is reduced to its minimum usable width for rcutorture to poke at, so do that.
Make it only configurable under RCU_EXPERT. While at it, add a comment to explain the layout of context_tracking->state.
Note that this means it will get selected by allyesconfig and the like, is that desired?
I would say no
If no, depends on !COMPILE_TEST can help here.
Noted, thank you!
On Tue, Nov 19, 2024 at 04:34:51PM +0100, Valentin Schneider wrote:
A later commit will reduce the size of the RCU watching counter to free up some bits for another purpose. Paul suggested adding a config option to test the extreme case where the counter is reduced to its minimum usable width for rcutorture to poke at, so do that.
Make it only configurable under RCU_EXPERT. While at it, add a comment to explain the layout of context_tracking->state.
Link: http://lore.kernel.org/r/4c2cb573-168f-4806-b1d9-164e8276e66a@paulmck-laptop Suggested-by: Paul E. McKenney paulmck@kernel.org Signed-off-by: Valentin Schneider vschneid@redhat.com
Looks good, one help-text nit below.
Reviewed-by: Paul E. McKenney paulmck@kernel.org
include/linux/context_tracking_state.h | 44 ++++++++++++++++++++++---- kernel/rcu/Kconfig.debug | 14 ++++++++ 2 files changed, 51 insertions(+), 7 deletions(-)
diff --git a/include/linux/context_tracking_state.h b/include/linux/context_tracking_state.h index 7b8433d5a8efe..0b81248aa03e2 100644 --- a/include/linux/context_tracking_state.h +++ b/include/linux/context_tracking_state.h @@ -18,12 +18,6 @@ enum ctx_state { CT_STATE_MAX = 4, }; -/* Odd value for watching, else even. */ -#define CT_RCU_WATCHING CT_STATE_MAX
-#define CT_STATE_MASK (CT_STATE_MAX - 1) -#define CT_RCU_WATCHING_MASK (~CT_STATE_MASK)
struct context_tracking { #ifdef CONFIG_CONTEXT_TRACKING_USER /* @@ -44,9 +38,45 @@ struct context_tracking { #endif }; +/*
- We cram two different things within the same atomic variable:
CT_RCU_WATCHING_START CT_STATE_START
| |
v v
MSB [ RCU watching counter ][ context_state ] LSB
^ ^
| |
- CT_RCU_WATCHING_END CT_STATE_END
- Bits are used from the LSB upwards, so unused bits (if any) will always be in
- upper bits of the variable.
- */
#ifdef CONFIG_CONTEXT_TRACKING +#define CT_SIZE (sizeof(((struct context_tracking *)0)->state) * BITS_PER_BYTE)
+#define CT_STATE_WIDTH bits_per(CT_STATE_MAX - 1) +#define CT_STATE_START 0 +#define CT_STATE_END (CT_STATE_START + CT_STATE_WIDTH - 1)
+#define CT_RCU_WATCHING_MAX_WIDTH (CT_SIZE - CT_STATE_WIDTH) +#define CT_RCU_WATCHING_WIDTH (IS_ENABLED(CONFIG_RCU_DYNTICKS_TORTURE) ? 2 : CT_RCU_WATCHING_MAX_WIDTH) +#define CT_RCU_WATCHING_START (CT_STATE_END + 1) +#define CT_RCU_WATCHING_END (CT_RCU_WATCHING_START + CT_RCU_WATCHING_WIDTH - 1) +#define CT_RCU_WATCHING BIT(CT_RCU_WATCHING_START)
+#define CT_STATE_MASK GENMASK(CT_STATE_END, CT_STATE_START) +#define CT_RCU_WATCHING_MASK GENMASK(CT_RCU_WATCHING_END, CT_RCU_WATCHING_START)
+#define CT_UNUSED_WIDTH (CT_RCU_WATCHING_MAX_WIDTH - CT_RCU_WATCHING_WIDTH)
+static_assert(CT_STATE_WIDTH +
CT_RCU_WATCHING_WIDTH +
CT_UNUSED_WIDTH ==
CT_SIZE);
DECLARE_PER_CPU(struct context_tracking, context_tracking); -#endif +#endif /* CONFIG_CONTEXT_TRACKING */ #ifdef CONFIG_CONTEXT_TRACKING_USER static __always_inline int __ct_state(void) diff --git a/kernel/rcu/Kconfig.debug b/kernel/rcu/Kconfig.debug index 9b0b52e1836fa..8dc505d841f8d 100644 --- a/kernel/rcu/Kconfig.debug +++ b/kernel/rcu/Kconfig.debug @@ -168,4 +168,18 @@ config RCU_STRICT_GRACE_PERIOD when looking for certain types of RCU usage bugs, for example, too-short RCU read-side critical sections.
+config RCU_DYNTICKS_TORTURE
- bool "Minimize RCU dynticks counter size"
- depends on RCU_EXPERT
- default n
- help
This option controls the width of the dynticks counter.
Lower values will make overflows more frequent, which will increase
the likelihood of extending grace-periods. This option sets the width
to its minimum usable value.
The second sentence ("Lower values ...") sounds at first reading like this Kconfig option directly controls the width. The third sentence sets things straight, but the reader might well be irretrievably confused by that point. How about something like this instead?
help This option sets the width of the dynticks counter to its minimum usable value. This minimum width greatly increases the probability of flushing out bugs involving counter wrap, but it also increases the probability of extending grace period durations. This Kconfig option should therefore be avoided in production due to the consequent increased probability of OOMs.
This has no value for production and is only for testing.
endmenu # "RCU Debugging"
2.43.0
On 22/11/24 04:53, Paul E. McKenney wrote:
On Tue, Nov 19, 2024 at 04:34:51PM +0100, Valentin Schneider wrote:
+config RCU_DYNTICKS_TORTURE
- bool "Minimize RCU dynticks counter size"
- depends on RCU_EXPERT
- default n
- help
This option controls the width of the dynticks counter.
Lower values will make overflows more frequent, which will increase
the likelihood of extending grace-periods. This option sets the width
to its minimum usable value.
The second sentence ("Lower values ...") sounds at first reading like this Kconfig option directly controls the width. The third sentence sets things straight, but the reader might well be irretrievably confused by that point. How about something like this instead?
help This option sets the width of the dynticks counter to its minimum usable value. This minimum width greatly increases the probability of flushing out bugs involving counter wrap, but it also increases the probability of extending grace period durations. This Kconfig option should therefore be avoided in production due to the consequent increased probability of OOMs. This has no value for production and is only for testing.
Much better, I'll take that, thank you!
endmenu # "RCU Debugging"
2.43.0
We now have an RCU_EXPERT config for testing small-sized RCU dynticks counter: CONFIG_RCU_DYNTICKS_TORTURE.
Modify scenario TREE04 to exercise to use this config in order to test a ridiculously small counter (2 bits).
Link: http://lore.kernel.org/r/4c2cb573-168f-4806-b1d9-164e8276e66a@paulmck-laptop Suggested-by: Paul E. McKenney paulmck@kernel.org Signed-off-by: Valentin Schneider vschneid@redhat.com --- tools/testing/selftests/rcutorture/configs/rcu/TREE04 | 1 + 1 file changed, 1 insertion(+)
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE04 b/tools/testing/selftests/rcutorture/configs/rcu/TREE04 index dc4985064b3ad..67caf4276bb01 100644 --- a/tools/testing/selftests/rcutorture/configs/rcu/TREE04 +++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE04 @@ -16,3 +16,4 @@ CONFIG_DEBUG_OBJECTS_RCU_HEAD=n CONFIG_RCU_EXPERT=y CONFIG_RCU_EQS_DEBUG=y CONFIG_RCU_LAZY=y +CONFIG_RCU_DYNTICKS_TORTURE=y
On Tue, Nov 19, 2024 at 04:34:52PM +0100, Valentin Schneider wrote:
We now have an RCU_EXPERT config for testing small-sized RCU dynticks counter: CONFIG_RCU_DYNTICKS_TORTURE.
Modify scenario TREE04 to exercise to use this config in order to test a ridiculously small counter (2 bits).
Link: http://lore.kernel.org/r/4c2cb573-168f-4806-b1d9-164e8276e66a@paulmck-laptop Suggested-by: Paul E. McKenney paulmck@kernel.org Signed-off-by: Valentin Schneider vschneid@redhat.com
Reviewed-by: Paul E. McKenney paulmck@kernel.org
tools/testing/selftests/rcutorture/configs/rcu/TREE04 | 1 + 1 file changed, 1 insertion(+)
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE04 b/tools/testing/selftests/rcutorture/configs/rcu/TREE04 index dc4985064b3ad..67caf4276bb01 100644 --- a/tools/testing/selftests/rcutorture/configs/rcu/TREE04 +++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE04 @@ -16,3 +16,4 @@ CONFIG_DEBUG_OBJECTS_RCU_HEAD=n CONFIG_RCU_EXPERT=y CONFIG_RCU_EQS_DEBUG=y CONFIG_RCU_LAZY=y
+CONFIG_RCU_DYNTICKS_TORTURE=y
2.43.0
Later commits will cause objtool to warn about non __ro_after_init static keys being used in .noinstr sections in order to safely defer instruction patching IPIs targeted at NOHZ_FULL CPUs.
Two such keys currently exist: mds_idle_clear and __sched_clock_stable, which can both be modified at runtime.
As discussed at LPC 2024 during the realtime micro-conference, modifying these specific keys incurs additional interference (SMT hotplug) or can even be downright incompatible with NOHZ_FULL (unstable TSC).
Suppressing the IPI associated with modifying such keys is thus a minor concern wrt NOHZ_FULL interference, so add a jump type that will be leveraged by both the kernel (to know not to defer the IPI) and objtool (to know not to generate the aforementioned warning).
Signed-off-by: Valentin Schneider vschneid@redhat.com --- include/linux/jump_label.h | 26 +++++++++++++++++++------- 1 file changed, 19 insertions(+), 7 deletions(-)
diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h index f5a2727ca4a9a..93e729545b941 100644 --- a/include/linux/jump_label.h +++ b/include/linux/jump_label.h @@ -200,7 +200,8 @@ struct module; #define JUMP_TYPE_FALSE 0UL #define JUMP_TYPE_TRUE 1UL #define JUMP_TYPE_LINKED 2UL -#define JUMP_TYPE_MASK 3UL +#define JUMP_TYPE_FORCEFUL 4UL +#define JUMP_TYPE_MASK 7UL
static __always_inline bool static_key_false(struct static_key *key) { @@ -244,12 +245,15 @@ extern enum jump_label_type jump_label_init_type(struct jump_entry *entry); * raw value, but have added a BUILD_BUG_ON() to catch any issues in * jump_label_init() see: kernel/jump_label.c. */ -#define STATIC_KEY_INIT_TRUE \ - { .enabled = { 1 }, \ - { .type = JUMP_TYPE_TRUE } } -#define STATIC_KEY_INIT_FALSE \ - { .enabled = { 0 }, \ - { .type = JUMP_TYPE_FALSE } } +#define __STATIC_KEY_INIT(_true, force) \ + { .enabled = { _true }, \ + { .type = (_true ? JUMP_TYPE_TRUE : JUMP_TYPE_FALSE) | \ + (force ? JUMP_TYPE_FORCEFUL : 0UL)} } + +#define STATIC_KEY_INIT_TRUE __STATIC_KEY_INIT(true, false) +#define STATIC_KEY_INIT_FALSE __STATIC_KEY_INIT(false, false) +#define STATIC_KEY_INIT_TRUE_FORCE __STATIC_KEY_INIT(true, true) +#define STATIC_KEY_INIT_FALSE_FORCE __STATIC_KEY_INIT(false, true)
#else /* !CONFIG_JUMP_LABEL */
@@ -369,6 +373,8 @@ struct static_key_false {
#define STATIC_KEY_TRUE_INIT (struct static_key_true) { .key = STATIC_KEY_INIT_TRUE, } #define STATIC_KEY_FALSE_INIT (struct static_key_false){ .key = STATIC_KEY_INIT_FALSE, } +#define STATIC_KEY_TRUE_FORCE_INIT (struct static_key_true) { .key = STATIC_KEY_INIT_TRUE_FORCE, } +#define STATIC_KEY_FALSE_FORCE_INIT (struct static_key_false){ .key = STATIC_KEY_INIT_FALSE_FORCE, }
#define DEFINE_STATIC_KEY_TRUE(name) \ struct static_key_true name = STATIC_KEY_TRUE_INIT @@ -376,6 +382,9 @@ struct static_key_false { #define DEFINE_STATIC_KEY_TRUE_RO(name) \ struct static_key_true name __ro_after_init = STATIC_KEY_TRUE_INIT
+#define DEFINE_STATIC_KEY_TRUE_FORCE(name) \ + struct static_key_true name = STATIC_KEY_TRUE_FORCE_INIT + #define DECLARE_STATIC_KEY_TRUE(name) \ extern struct static_key_true name
@@ -385,6 +394,9 @@ struct static_key_false { #define DEFINE_STATIC_KEY_FALSE_RO(name) \ struct static_key_false name __ro_after_init = STATIC_KEY_FALSE_INIT
+#define DEFINE_STATIC_KEY_FALSE_FORCE(name) \ + struct static_key_false name = STATIC_KEY_FALSE_FORCE_INIT + #define DECLARE_STATIC_KEY_FALSE(name) \ extern struct static_key_false name
On Tue, Nov 19, 2024 at 04:34:53PM +0100, Valentin Schneider wrote:
Later commits will cause objtool to warn about non __ro_after_init static keys being used in .noinstr sections in order to safely defer instruction patching IPIs targeted at NOHZ_FULL CPUs.
Don't we need similar checking for static calls?
Two such keys currently exist: mds_idle_clear and __sched_clock_stable, which can both be modified at runtime.
Not sure if feasible, but it sure would be a lot simpler to just make "no noinstr patching" a hard rule and then convert the above keys (or at least their noinstr-specific usage) to regular branches.
Then "no noinstr patching" could be unilaterally enforced in text_poke_bp().
diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h index f5a2727ca4a9a..93e729545b941 100644 --- a/include/linux/jump_label.h +++ b/include/linux/jump_label.h @@ -200,7 +200,8 @@ struct module; #define JUMP_TYPE_FALSE 0UL #define JUMP_TYPE_TRUE 1UL #define JUMP_TYPE_LINKED 2UL -#define JUMP_TYPE_MASK 3UL +#define JUMP_TYPE_FORCEFUL 4UL
JUMP_TYPE_NOINSTR_ALLOWED ?
On Tue, Nov 19, 2024 at 03:39:02PM -0800, Josh Poimboeuf wrote:
On Tue, Nov 19, 2024 at 04:34:53PM +0100, Valentin Schneider wrote:
Later commits will cause objtool to warn about non __ro_after_init static keys being used in .noinstr sections in order to safely defer instruction patching IPIs targeted at NOHZ_FULL CPUs.
Don't we need similar checking for static calls?
Two such keys currently exist: mds_idle_clear and __sched_clock_stable, which can both be modified at runtime.
Not sure if feasible, but it sure would be a lot simpler to just make "no noinstr patching" a hard rule and then convert the above keys (or at least their noinstr-specific usage) to regular branches.
That'll be a bit of a mess. Also, then we're adding overhead/cost for all people for the benefit of this fringe case (NOHZ_FULL). Not a desirable trade-off IMO.
So I do think the proposed solution (+- naming, I like your naming proposal better) is the better one.
But I think we can make the fall-back safer, we can simply force the IPI when we poke at noinstr code -- then NOHZ_FULL gets to keep the pieces, but at least we don't violate any correctness constraints.
On Wed, Nov 20, 2024 at 03:56:49PM +0100, Peter Zijlstra wrote:
But I think we can make the fall-back safer, we can simply force the IPI when we poke at noinstr code -- then NOHZ_FULL gets to keep the pieces, but at least we don't violate any correctness constraints.
I should have read more; that's what is being proposed.
On Wed, Nov 20, 2024 at 03:57:46PM +0100, Peter Zijlstra wrote:
On Wed, Nov 20, 2024 at 03:56:49PM +0100, Peter Zijlstra wrote:
But I think we can make the fall-back safer, we can simply force the IPI when we poke at noinstr code -- then NOHZ_FULL gets to keep the pieces, but at least we don't violate any correctness constraints.
I should have read more; that's what is being proposed.
Hm, now I'm wondering what you read, as I only see the text poke IPIs being forced when the caller sets force_ipi, rather than the text poke code itself detecting a write to .noinstr.
On Wed, Nov 20, 2024 at 08:55:15AM -0800, Josh Poimboeuf wrote:
On Wed, Nov 20, 2024 at 03:57:46PM +0100, Peter Zijlstra wrote:
On Wed, Nov 20, 2024 at 03:56:49PM +0100, Peter Zijlstra wrote:
But I think we can make the fall-back safer, we can simply force the IPI when we poke at noinstr code -- then NOHZ_FULL gets to keep the pieces, but at least we don't violate any correctness constraints.
I should have read more; that's what is being proposed.
Hm, now I'm wondering what you read, as I only see the text poke IPIs being forced when the caller sets force_ipi, rather than the text poke code itself detecting a write to .noinstr.
Right, so I had much confusion and my initial thought was that it would do something dangerous. Then upon reading more I see it forces the IPI for these special keys -- with that force_ipi thing.
Now, there's only two keys marked special, and both have a noinstr presence -- the entire reason they get marked.
So effectively we force the IPI when patching noinstr, no?
But yeah, this is not quite the same as not marking anything and simply forcing the IPI when the target address is noinstr.
And having written all that; perhaps that is the better solution, it sticks the logic in text_poke and ensure it automagically work for all its users, obviating the need for special marking.
Is that what you were thinking?
On Thu, Nov 21, 2024 at 12:00:20PM +0100, Peter Zijlstra wrote:
But yeah, this is not quite the same as not marking anything and simply forcing the IPI when the target address is noinstr.
And having written all that; perhaps that is the better solution, it sticks the logic in text_poke and ensure it automagically work for all its users, obviating the need for special marking.
Is that what you were thinking?
Yes, though I can't take credit for the idea as that's what I thought you were suggesting! That seems simpler and more bulletproof.
On 21/11/24 12:00, Peter Zijlstra wrote:
On Wed, Nov 20, 2024 at 08:55:15AM -0800, Josh Poimboeuf wrote:
On Wed, Nov 20, 2024 at 03:57:46PM +0100, Peter Zijlstra wrote:
On Wed, Nov 20, 2024 at 03:56:49PM +0100, Peter Zijlstra wrote:
But I think we can make the fall-back safer, we can simply force the IPI when we poke at noinstr code -- then NOHZ_FULL gets to keep the pieces, but at least we don't violate any correctness constraints.
I should have read more; that's what is being proposed.
Hm, now I'm wondering what you read, as I only see the text poke IPIs being forced when the caller sets force_ipi, rather than the text poke code itself detecting a write to .noinstr.
Right, so I had much confusion and my initial thought was that it would do something dangerous. Then upon reading more I see it forces the IPI for these special keys -- with that force_ipi thing.
Now, there's only two keys marked special, and both have a noinstr presence -- the entire reason they get marked.
So effectively we force the IPI when patching noinstr, no?
But yeah, this is not quite the same as not marking anything and simply forcing the IPI when the target address is noinstr.
And having written all that; perhaps that is the better solution, it sticks the logic in text_poke and ensure it automagically work for all its users, obviating the need for special marking.
Okay so forcing the IPI for .noinstr patching lets us get rid of all the force_ipi faff; however I would still want the special marking to tell objtool "yep we're okay with this one", and still get warnings when a new .noinstr key gets added so we double think about it.
On Thu, Nov 21, 2024 at 04:51:09PM +0100, Valentin Schneider wrote:
Okay so forcing the IPI for .noinstr patching lets us get rid of all the force_ipi faff; however I would still want the special marking to tell objtool "yep we're okay with this one", and still get warnings when a new .noinstr key gets added so we double think about it.
Yeah. Though, instead of DECLARE_STATIC_KEY_FALSE_NOINSTR adding a new jump label type, it could just add an objtool annotation pointing to the key. If that's the way we're going I could whip up a patch if that would help.
On 21/11/24 12:21, Josh Poimboeuf wrote:
On Thu, Nov 21, 2024 at 04:51:09PM +0100, Valentin Schneider wrote:
Okay so forcing the IPI for .noinstr patching lets us get rid of all the force_ipi faff; however I would still want the special marking to tell objtool "yep we're okay with this one", and still get warnings when a new .noinstr key gets added so we double think about it.
Yeah. Though, instead of DECLARE_STATIC_KEY_FALSE_NOINSTR adding a new jump label type, it could just add an objtool annotation pointing to the key. If that's the way we're going I could whip up a patch if that would help.
Well I'm down for the approach and I'd appreciate help for the objtool side :-)
-- Josh
On 19/11/24 15:39, Josh Poimboeuf wrote:
On Tue, Nov 19, 2024 at 04:34:53PM +0100, Valentin Schneider wrote:
Later commits will cause objtool to warn about non __ro_after_init static keys being used in .noinstr sections in order to safely defer instruction patching IPIs targeted at NOHZ_FULL CPUs.
Don't we need similar checking for static calls?
/sifts through my notes throwing paper all around
Huh, I thought I had something, but no... Per the results they don't seem to be flipped around as much as static keys, but they also end up in text_poke_bp(), so yeah, we do. Welp, I'll add that to the list.
Two such keys currently exist: mds_idle_clear and __sched_clock_stable, which can both be modified at runtime.
Not sure if feasible, but it sure would be a lot simpler to just make "no noinstr patching" a hard rule and then convert the above keys (or at least their noinstr-specific usage) to regular branches.
Then "no noinstr patching" could be unilaterally enforced in text_poke_bp().
diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h index f5a2727ca4a9a..93e729545b941 100644 --- a/include/linux/jump_label.h +++ b/include/linux/jump_label.h @@ -200,7 +200,8 @@ struct module; #define JUMP_TYPE_FALSE 0UL #define JUMP_TYPE_TRUE 1UL #define JUMP_TYPE_LINKED 2UL -#define JUMP_TYPE_MASK 3UL +#define JUMP_TYPE_FORCEFUL 4UL
JUMP_TYPE_NOINSTR_ALLOWED ?
That's better, I'll take it. Thanks!
-- Josh
On Tue, Nov 19, 2024 at 04:34:53PM +0100, Valentin Schneider wrote:
+++ b/include/linux/jump_label.h @@ -200,7 +200,8 @@ struct module; #define JUMP_TYPE_FALSE 0UL #define JUMP_TYPE_TRUE 1UL #define JUMP_TYPE_LINKED 2UL -#define JUMP_TYPE_MASK 3UL +#define JUMP_TYPE_FORCEFUL 4UL +#define JUMP_TYPE_MASK 7UL
Hm, I don't think we can (ab)use this pointer bit on 32-bit arches, as the address could be 4 byte aligned?
On Tue, Nov 19, 2024 at 04:05:32PM -0800, Josh Poimboeuf wrote:
On Tue, Nov 19, 2024 at 04:34:53PM +0100, Valentin Schneider wrote:
+++ b/include/linux/jump_label.h @@ -200,7 +200,8 @@ struct module; #define JUMP_TYPE_FALSE 0UL #define JUMP_TYPE_TRUE 1UL #define JUMP_TYPE_LINKED 2UL -#define JUMP_TYPE_MASK 3UL +#define JUMP_TYPE_FORCEFUL 4UL +#define JUMP_TYPE_MASK 7UL
Hm, I don't think we can (ab)use this pointer bit on 32-bit arches, as the address could be 4 byte aligned?
Right, you can force the alignment of the thing, workqueues do similar hacks to get more bits.
Later commits will cause objtool to warn about non __ro_after_init static keys being used in .noinstr sections in order to safely defer instruction patching IPIs targeted at NOHZ_FULL CPUs.
mds_idle_clear is used in .noinstr code, and can be modified at runtime (SMT hotplug). Suppressing the text_poke_sync() IPI has little benefits for this key, as hotplug implies eventually going through takedown_cpu() -> stop_machine_cpuslocked() which is going to cause interference on all online CPUs anyway.
Mark it as forceful to let the kernel know to always send the text_poke_sync() IPI for it, and to let objtool know not to warn about it.
Signed-off-by: Valentin Schneider vschneid@redhat.com --- arch/x86/kernel/cpu/bugs.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c index 47a01d4028f60..dcc4d5d6e3b95 100644 --- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -114,7 +114,7 @@ DEFINE_STATIC_KEY_FALSE(switch_mm_cond_ibpb); DEFINE_STATIC_KEY_FALSE(switch_mm_always_ibpb);
/* Control MDS CPU buffer clear before idling (halt, mwait) */ -DEFINE_STATIC_KEY_FALSE(mds_idle_clear); +DEFINE_STATIC_KEY_FALSE_FORCE(mds_idle_clear); EXPORT_SYMBOL_GPL(mds_idle_clear);
/*
Later commits will cause objtool to warn about non __ro_after_init static keys being used in .noinstr sections in order to safely defer instruction patching IPIs targeted at NOHZ_FULL CPUs.
__sched_clock_stable is used in .noinstr code, and can be modified at runtime (e.g. KVM module loading). Suppressing the text_poke_sync() IPI has little benefits for this key, as NOHZ_FULL is incompatible with an unstable TSC anyway.
Mark it as forceful to let the kernel know to always send the text_poke_sync() IPI for it, and to let objtool know not to warn about it.
Signed-off-by: Valentin Schneider vschneid@redhat.com --- kernel/sched/clock.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c index 200e5568b9894..dc94b3717f5ce 100644 --- a/kernel/sched/clock.c +++ b/kernel/sched/clock.c @@ -76,7 +76,7 @@ static DEFINE_STATIC_KEY_FALSE_RO(sched_clock_running); * Similarly we start with __sched_clock_stable_early, thereby assuming we * will become stable, such that there's only a single 1 -> 0 transition. */ -static DEFINE_STATIC_KEY_FALSE(__sched_clock_stable); +static DEFINE_STATIC_KEY_FALSE_FORCE(__sched_clock_stable); static int __sched_clock_stable_early = 1;
/*
On Tue, Nov 19, 2024 at 04:34:55PM +0100, Valentin Schneider wrote:
Later commits will cause objtool to warn about non __ro_after_init static keys being used in .noinstr sections in order to safely defer instruction patching IPIs targeted at NOHZ_FULL CPUs.
__sched_clock_stable is used in .noinstr code, and can be modified at runtime (e.g. KVM module loading). Suppressing the text_poke_sync() IPI has
Wait, what !? loading KVM causes the TSC to be marked unstable?
On 20/11/24 15:59, Peter Zijlstra wrote:
On Tue, Nov 19, 2024 at 04:34:55PM +0100, Valentin Schneider wrote:
Later commits will cause objtool to warn about non __ro_after_init static keys being used in .noinstr sections in order to safely defer instruction patching IPIs targeted at NOHZ_FULL CPUs.
__sched_clock_stable is used in .noinstr code, and can be modified at runtime (e.g. KVM module loading). Suppressing the text_poke_sync() IPI has
Wait, what !? loading KVM causes the TSC to be marked unstable?
Ah, maybe not, I saw the below but that's actually the x86 specific stuff and IIUC can only be builtin:
kvm_init_platform() `\ kvmclock_init() `\ kvm_sched_clock_init() `\ clear_sched_clock_stable()
There is however this:
kvm_arch_vcpu_load() `\ mark_tsc_unstable()
So plugging a VCPU might do that.
On Wed, Nov 20, 2024 at 05:34:32PM +0100, Valentin Schneider wrote:
On 20/11/24 15:59, Peter Zijlstra wrote:
On Tue, Nov 19, 2024 at 04:34:55PM +0100, Valentin Schneider wrote:
Later commits will cause objtool to warn about non __ro_after_init static keys being used in .noinstr sections in order to safely defer instruction patching IPIs targeted at NOHZ_FULL CPUs.
__sched_clock_stable is used in .noinstr code, and can be modified at runtime (e.g. KVM module loading). Suppressing the text_poke_sync() IPI has
Wait, what !? loading KVM causes the TSC to be marked unstable?
There is however this:
kvm_arch_vcpu_load() `\ mark_tsc_unstable()
So plugging a VCPU might do that.
Right, but that only happens if it observes the TSC doing dodgy, so that's deserved and shouldn't happen on hardware from this decade, and possibly the one before that.
Later commits will disallow runtime-mutable text in .noinstr sections in order to safely defer instruction patching IPIs.
All static keys used in .noinstr sections have now been checked as being either flagged as __ro_after_init, or as forceful static keys. Any occurrence of this new warning would be the result of a code change that will need looking at.
Suggested-by: Josh Poimboeuf jpoimboe@redhat.com Signed-off-by: Valentin Schneider vschneid@redhat.com --- offset_of(static_key.type) and JUMP_TYPE_FORCEFUL would need to be shoved into a somewhat standalone header file that could be included by objtool itself. --- tools/objtool/Documentation/objtool.txt | 13 ++++++++ tools/objtool/check.c | 41 +++++++++++++++++++++++++ tools/objtool/include/objtool/check.h | 1 + tools/objtool/include/objtool/special.h | 2 ++ tools/objtool/special.c | 3 ++ 5 files changed, 60 insertions(+)
diff --git a/tools/objtool/Documentation/objtool.txt b/tools/objtool/Documentation/objtool.txt index 7c3ee959b63c7..06fa285873387 100644 --- a/tools/objtool/Documentation/objtool.txt +++ b/tools/objtool/Documentation/objtool.txt @@ -447,6 +447,19 @@ the objtool maintainers. names and does not use module_init() / module_exit() macros to create them.
+13. file.o: warning: func()+0x2a: non __ro_after_init static key "key" in .noinstr section + + This means that the noinstr function func() uses a static key that can be + modified at runtime. This is not allowed as noinstr functions rely on + containing stable instructions after init. + + Check whether the static key in question can really be modified at runtime, + and if it is only enabled during init then mark it as __ro_after_init. If it + genuinely needs to be modified at runtime: + + 1) Directly rely on the underlying atomic count of they key in the noinstr + functions. + 2) Mark the static key as forceful.
If the error doesn't seem to make sense, it could be a bug in objtool. Feel free to ask the objtool maintainer for help. diff --git a/tools/objtool/check.c b/tools/objtool/check.c index 00e25492f5065..c1fb02c326839 100644 --- a/tools/objtool/check.c +++ b/tools/objtool/check.c @@ -2056,6 +2056,9 @@ static int add_special_section_alts(struct objtool_file *file) alt->next = orig_insn->alts; orig_insn->alts = alt;
+ if (special_alt->key_sym) + orig_insn->key_sym = special_alt->key_sym; + list_del(&special_alt->list); free(special_alt); } @@ -3605,6 +3608,41 @@ static int validate_return(struct symbol *func, struct instruction *insn, struct return 0; }
+static bool static_key_is_forceful(struct symbol *key) +{ + if (!strcmp(key->sec->name, ".data")) { + unsigned long data_offset = key->sec->data->d_off; + unsigned long key_offset = key->sym.st_value; + char* data = key->sec->data->d_buf; + + /* + * offset_of(static_key.type) + * v + * v JUMP_TYPE_FORCEFUL + * v v + */ + return data[(key_offset + 8) - data_offset] & 0x4; + } + + return false; +} + +static int validate_static_key(struct instruction *insn, struct insn_state *state) +{ + if (state->noinstr && state->instr <= 0) { + if (static_key_is_forceful(insn->key_sym)) + return 0; + + if ((strcmp(insn->key_sym->sec->name, ".data..ro_after_init"))) { + WARN_INSN(insn, "non __ro_after_init static key "%s" in .noinstr section", + insn->key_sym->name); + return 1; + } + } + + return 0; +} + static struct instruction *next_insn_to_validate(struct objtool_file *file, struct instruction *insn) { @@ -3766,6 +3804,9 @@ static int validate_branch(struct objtool_file *file, struct symbol *func, if (handle_insn_ops(insn, next_insn, &state)) return 1;
+ if (insn->key_sym) + validate_static_key(insn, &state); + switch (insn->type) {
case INSN_RETURN: diff --git a/tools/objtool/include/objtool/check.h b/tools/objtool/include/objtool/check.h index daa46f1f0965a..35dd21f8f41e1 100644 --- a/tools/objtool/include/objtool/check.h +++ b/tools/objtool/include/objtool/check.h @@ -77,6 +77,7 @@ struct instruction { struct symbol *sym; struct stack_op *stack_ops; struct cfi_state *cfi; + struct symbol *key_sym; };
static inline struct symbol *insn_func(struct instruction *insn) diff --git a/tools/objtool/include/objtool/special.h b/tools/objtool/include/objtool/special.h index 86d4af9c5aa9d..0e61f34fe3a28 100644 --- a/tools/objtool/include/objtool/special.h +++ b/tools/objtool/include/objtool/special.h @@ -27,6 +27,8 @@ struct special_alt { struct section *new_sec; unsigned long new_off;
+ struct symbol *key_sym; + unsigned int orig_len, new_len; /* group only */ };
diff --git a/tools/objtool/special.c b/tools/objtool/special.c index 097a69db82a0e..fefab2b471bf8 100644 --- a/tools/objtool/special.c +++ b/tools/objtool/special.c @@ -127,6 +127,9 @@ static int get_alt_entry(struct elf *elf, const struct special_entry *entry, return -1; } alt->key_addend = reloc_addend(key_reloc); + + reloc_to_sec_off(key_reloc, &sec, &offset); + alt->key_sym = find_symbol_by_offset(sec, offset & ~3); }
return 0;
On Tue, Nov 19, 2024 at 04:34:56PM +0100, Valentin Schneider wrote:
Later commits will disallow runtime-mutable text in .noinstr sections in order to safely defer instruction patching IPIs.
All static keys used in .noinstr sections have now been checked as being either flagged as __ro_after_init, or as forceful static keys. Any occurrence of this new warning would be the result of a code change that will need looking at.
Suggested-by: Josh Poimboeuf jpoimboe@redhat.com Signed-off-by: Valentin Schneider vschneid@redhat.com
offset_of(static_key.type) and JUMP_TYPE_FORCEFUL would need to be shoved into a somewhat standalone header file that could be included by objtool itself.
static_key and JUMP_TYPE_* can be moved to jump_label_types.h which can be included by jump_label.h and also synced to tools/include/linux for objtool to access. I guess objtool would have to "#define CONFIG_JUMP_LABEL" before including it to get the full definition.
@@ -3605,6 +3608,41 @@ static int validate_return(struct symbol *func, struct instruction *insn, struct return 0; } +static bool static_key_is_forceful(struct symbol *key) +{
- if (!strcmp(key->sec->name, ".data")) {
There are some configs (and more coming in the future) which compile with the kernel with -fdata-sections. So this may need to be something like if (strstarts(key->sec->name, ".data"))
Forceful static keys are used in early entry code where it is unsafe to defer the sync_core() IPIs, and flagged as such via their ->type field.
Record that information when creating a text_poke_loc. The text_poke_loc.old field is written to when first iterating a text_poke() entry, and as such can be (ab)used to store this information at the start of text_poke_bp_batch().
Signed-off-by: Valentin Schneider vschneid@redhat.com --- arch/x86/include/asm/text-patching.h | 12 ++++++++++-- arch/x86/kernel/alternative.c | 16 ++++++++++------ arch/x86/kernel/jump_label.c | 7 ++++--- 3 files changed, 24 insertions(+), 11 deletions(-)
diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h index 6259f1937fe77..e34de36cab61e 100644 --- a/arch/x86/include/asm/text-patching.h +++ b/arch/x86/include/asm/text-patching.h @@ -38,9 +38,17 @@ extern void *text_poke_copy(void *addr, const void *opcode, size_t len); extern void *text_poke_copy_locked(void *addr, const void *opcode, size_t len, bool core_ok); extern void *text_poke_set(void *addr, int c, size_t len); extern int poke_int3_handler(struct pt_regs *regs); -extern void text_poke_bp(void *addr, const void *opcode, size_t len, const void *emulate); +extern void __text_poke_bp(void *addr, const void *opcode, size_t len, const void *emulate, bool force_ipi); +static inline void text_poke_bp(void *addr, const void *opcode, size_t len, const void *emulate) +{ + __text_poke_bp(addr, opcode, len, emulate, false); +}
-extern void text_poke_queue(void *addr, const void *opcode, size_t len, const void *emulate); +extern void __text_poke_queue(void *addr, const void *opcode, size_t len, const void *emulate, bool force_ipi); +static inline void text_poke_queue(void *addr, const void *opcode, size_t len, const void *emulate) +{ + __text_poke_queue(addr, opcode, len, emulate, false); +} extern void text_poke_finish(void);
#define INT3_INSN_SIZE 1 diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c index d17518ca19b8b..954c4c0f7fc58 100644 --- a/arch/x86/kernel/alternative.c +++ b/arch/x86/kernel/alternative.c @@ -2098,7 +2098,10 @@ struct text_poke_loc { u8 opcode; const u8 text[POKE_MAX_OPCODE_SIZE]; /* see text_poke_bp_batch() */ - u8 old; + union { + u8 old; + u8 force_ipi; + }; };
struct bp_patching_desc { @@ -2385,7 +2388,7 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries }
static void text_poke_loc_init(struct text_poke_loc *tp, void *addr, - const void *opcode, size_t len, const void *emulate) + const void *opcode, size_t len, const void *emulate, bool force_ipi) { struct insn insn; int ret, i = 0; @@ -2402,6 +2405,7 @@ static void text_poke_loc_init(struct text_poke_loc *tp, void *addr, tp->rel_addr = addr - (void *)_stext; tp->len = len; tp->opcode = insn.opcode.bytes[0]; + tp->force_ipi = force_ipi;
if (is_jcc32(&insn)) { /* @@ -2493,14 +2497,14 @@ void text_poke_finish(void) text_poke_flush(NULL); }
-void __ref text_poke_queue(void *addr, const void *opcode, size_t len, const void *emulate) +void __ref __text_poke_queue(void *addr, const void *opcode, size_t len, const void *emulate, bool force_ipi) { struct text_poke_loc *tp;
text_poke_flush(addr);
tp = &tp_vec[tp_vec_nr++]; - text_poke_loc_init(tp, addr, opcode, len, emulate); + text_poke_loc_init(tp, addr, opcode, len, emulate, force_ipi); }
/** @@ -2514,10 +2518,10 @@ void __ref text_poke_queue(void *addr, const void *opcode, size_t len, const voi * dynamically allocated memory. This function should be used when it is * not possible to allocate memory. */ -void __ref text_poke_bp(void *addr, const void *opcode, size_t len, const void *emulate) +void __ref __text_poke_bp(void *addr, const void *opcode, size_t len, const void *emulate, bool force_ipi) { struct text_poke_loc tp;
- text_poke_loc_init(&tp, addr, opcode, len, emulate); + text_poke_loc_init(&tp, addr, opcode, len, emulate, force_ipi); text_poke_bp_batch(&tp, 1); } diff --git a/arch/x86/kernel/jump_label.c b/arch/x86/kernel/jump_label.c index f5b8ef02d172c..e03a4f56b30fd 100644 --- a/arch/x86/kernel/jump_label.c +++ b/arch/x86/kernel/jump_label.c @@ -101,8 +101,8 @@ __jump_label_transform(struct jump_entry *entry, text_poke_early((void *)jump_entry_code(entry), jlp.code, jlp.size); return; } - - text_poke_bp((void *)jump_entry_code(entry), jlp.code, jlp.size, NULL); + __text_poke_bp((void *)jump_entry_code(entry), jlp.code, jlp.size, NULL, + jump_entry_key(entry)->type & JUMP_TYPE_FORCEFUL); }
static void __ref jump_label_transform(struct jump_entry *entry, @@ -135,7 +135,8 @@ bool arch_jump_label_transform_queue(struct jump_entry *entry,
mutex_lock(&text_mutex); jlp = __jump_label_patch(entry, type); - text_poke_queue((void *)jump_entry_code(entry), jlp.code, jlp.size, NULL); + __text_poke_queue((void *)jump_entry_code(entry), jlp.code, jlp.size, NULL, + jump_entry_key(entry)->type & JUMP_TYPE_FORCEFUL); mutex_unlock(&text_mutex); return true; }
smp_call_function() & friends have the unfortunate habit of sending IPIs to isolated, NOHZ_FULL, in-userspace CPUs, as they blindly target all online CPUs.
Some callsites can be bent into doing the right, such as done by commit:
cc9e303c91f5 ("x86/cpu: Disable frequency requests via aperfmperf IPI for nohz_full CPUs")
Unfortunately, not all SMP callbacks can be omitted in this fashion. However, some of them only affect execution in kernelspace, which means they don't have to be executed *immediately* if the target CPU is in userspace: stashing the callback and executing it upon the next kernel entry would suffice. x86 kernel instruction patching or kernel TLB invalidation are prime examples of it.
Reduce the RCU dynticks counter width to free up some bits to be used as a deferred callback bitmask. Add some build-time checks to validate that setup.
Presence of CT_STATE_KERNEL in the ct_state prevents queuing deferred work.
Later commits introduce the bit:callback mappings.
Link: https://lore.kernel.org/all/20210929151723.162004989@infradead.org/ Signed-off-by: Nicolas Saenz Julienne nsaenzju@redhat.com Signed-off-by: Valentin Schneider vschneid@redhat.com --- arch/Kconfig | 9 ++++ arch/x86/Kconfig | 1 + arch/x86/include/asm/context_tracking_work.h | 14 ++++++ include/linux/context_tracking.h | 21 +++++++++ include/linux/context_tracking_state.h | 30 ++++++++----- include/linux/context_tracking_work.h | 26 +++++++++++ kernel/context_tracking.c | 46 +++++++++++++++++++- kernel/time/Kconfig | 5 +++ 8 files changed, 140 insertions(+), 12 deletions(-) create mode 100644 arch/x86/include/asm/context_tracking_work.h create mode 100644 include/linux/context_tracking_work.h
diff --git a/arch/Kconfig b/arch/Kconfig index bd9f095d69fa0..e7f3f797a34a4 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -912,6 +912,15 @@ config HAVE_CONTEXT_TRACKING_USER_OFFSTACK - No use of instrumentation, unless instrumentation_begin() got called.
+config HAVE_CONTEXT_TRACKING_WORK + bool + help + Architecture supports deferring work while not in kernel context. + This is especially useful on setups with isolated CPUs that might + want to avoid being interrupted to perform housekeeping tasks (for + ex. TLB invalidation or icache invalidation). The housekeeping + operations are performed upon re-entering the kernel. + config HAVE_TIF_NOHZ bool help diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 16354dfa6d965..c703376dd326b 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -213,6 +213,7 @@ config X86 select HAVE_CMPXCHG_LOCAL select HAVE_CONTEXT_TRACKING_USER if X86_64 select HAVE_CONTEXT_TRACKING_USER_OFFSTACK if HAVE_CONTEXT_TRACKING_USER + select HAVE_CONTEXT_TRACKING_WORK if X86_64 select HAVE_C_RECORDMCOUNT select HAVE_OBJTOOL_MCOUNT if HAVE_OBJTOOL select HAVE_OBJTOOL_NOP_MCOUNT if HAVE_OBJTOOL_MCOUNT diff --git a/arch/x86/include/asm/context_tracking_work.h b/arch/x86/include/asm/context_tracking_work.h new file mode 100644 index 0000000000000..5bc29e6b2ed38 --- /dev/null +++ b/arch/x86/include/asm/context_tracking_work.h @@ -0,0 +1,14 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _ASM_X86_CONTEXT_TRACKING_WORK_H +#define _ASM_X86_CONTEXT_TRACKING_WORK_H + +static __always_inline void arch_context_tracking_work(int work) +{ + switch (work) { + case CONTEXT_WORK_n: + // Do work... + break; + } +} + +#endif diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h index af9fe87a09225..16a2eb7525f1f 100644 --- a/include/linux/context_tracking.h +++ b/include/linux/context_tracking.h @@ -5,6 +5,7 @@ #include <linux/sched.h> #include <linux/vtime.h> #include <linux/context_tracking_state.h> +#include <linux/context_tracking_work.h> #include <linux/instrumentation.h>
#include <asm/ptrace.h> @@ -137,6 +138,26 @@ static __always_inline unsigned long ct_state_inc(int incby) return raw_atomic_add_return(incby, this_cpu_ptr(&context_tracking.state)); }
+#ifdef CONTEXT_TRACKING_WORK +static __always_inline unsigned long ct_state_inc_clear_work(int incby) +{ + struct context_tracking *ct = this_cpu_ptr(&context_tracking); + unsigned long new, old, state; + + state = arch_atomic_read(&ct->state); + do { + old = state; + new = old & ~CONTEXT_WORK_MASK; + new += incby; + state = arch_atomic_cmpxchg(&ct->state, old, new); + } while (old != state); + + return new; +} +#else +#define ct_state_inc_clear_work(x) ct_state_inc(x) +#endif + static __always_inline bool warn_rcu_enter(void) { bool ret = false; diff --git a/include/linux/context_tracking_state.h b/include/linux/context_tracking_state.h index 0b81248aa03e2..d1d37dbdf7195 100644 --- a/include/linux/context_tracking_state.h +++ b/include/linux/context_tracking_state.h @@ -5,6 +5,7 @@ #include <linux/percpu.h> #include <linux/static_key.h> #include <linux/context_tracking_irq.h> +#include <linux/context_tracking_work.h>
/* Offset to allow distinguishing irq vs. task-based idle entry/exit. */ #define CT_NESTING_IRQ_NONIDLE ((LONG_MAX / 2) + 1) @@ -39,16 +40,19 @@ struct context_tracking { };
/* - * We cram two different things within the same atomic variable: + * We cram up to three different things within the same atomic variable: * - * CT_RCU_WATCHING_START CT_STATE_START - * | | - * v v - * MSB [ RCU watching counter ][ context_state ] LSB - * ^ ^ - * | | - * CT_RCU_WATCHING_END CT_STATE_END + * CT_RCU_WATCHING_START CT_STATE_START + * | CT_WORK_START | + * | | | + * v v v + * MSB [ RCU watching counter ][ context work ][ context_state ] LSB + * ^ ^ ^ + * | | | + * | CT_WORK_END | + * CT_RCU_WATCHING_END CT_STATE_END * + * The [ context work ] region spans 0 bits if CONFIG_CONTEXT_WORK=n * Bits are used from the LSB upwards, so unused bits (if any) will always be in * upper bits of the variable. */ @@ -59,18 +63,24 @@ struct context_tracking { #define CT_STATE_START 0 #define CT_STATE_END (CT_STATE_START + CT_STATE_WIDTH - 1)
-#define CT_RCU_WATCHING_MAX_WIDTH (CT_SIZE - CT_STATE_WIDTH) +#define CT_WORK_WIDTH (IS_ENABLED(CONFIG_CONTEXT_TRACKING_WORK) ? CONTEXT_WORK_MAX_OFFSET : 0) +#define CT_WORK_START (CT_STATE_END + 1) +#define CT_WORK_END (CT_WORK_START + CT_WORK_WIDTH - 1) + +#define CT_RCU_WATCHING_MAX_WIDTH (CT_SIZE - CT_WORK_WIDTH - CT_STATE_WIDTH) #define CT_RCU_WATCHING_WIDTH (IS_ENABLED(CONFIG_RCU_DYNTICKS_TORTURE) ? 2 : CT_RCU_WATCHING_MAX_WIDTH) -#define CT_RCU_WATCHING_START (CT_STATE_END + 1) +#define CT_RCU_WATCHING_START (CT_WORK_END + 1) #define CT_RCU_WATCHING_END (CT_RCU_WATCHING_START + CT_RCU_WATCHING_WIDTH - 1) #define CT_RCU_WATCHING BIT(CT_RCU_WATCHING_START)
#define CT_STATE_MASK GENMASK(CT_STATE_END, CT_STATE_START) +#define CT_WORK_MASK GENMASK(CT_WORK_END, CT_WORK_START) #define CT_RCU_WATCHING_MASK GENMASK(CT_RCU_WATCHING_END, CT_RCU_WATCHING_START)
#define CT_UNUSED_WIDTH (CT_RCU_WATCHING_MAX_WIDTH - CT_RCU_WATCHING_WIDTH)
static_assert(CT_STATE_WIDTH + + CT_WORK_WIDTH + CT_RCU_WATCHING_WIDTH + CT_UNUSED_WIDTH == CT_SIZE); diff --git a/include/linux/context_tracking_work.h b/include/linux/context_tracking_work.h new file mode 100644 index 0000000000000..fb74db8876dd2 --- /dev/null +++ b/include/linux/context_tracking_work.h @@ -0,0 +1,26 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_CONTEXT_TRACKING_WORK_H +#define _LINUX_CONTEXT_TRACKING_WORK_H + +#include <linux/bitops.h> + +enum { + CONTEXT_WORK_n_OFFSET, + CONTEXT_WORK_MAX_OFFSET +}; + +enum ct_work { + CONTEXT_WORK_n = BIT(CONTEXT_WORK_n_OFFSET), + CONTEXT_WORK_MAX = BIT(CONTEXT_WORK_MAX_OFFSET) +}; + +#include <asm/context_tracking_work.h> + +#ifdef CONFIG_CONTEXT_TRACKING_WORK +extern bool ct_set_cpu_work(unsigned int cpu, unsigned int work); +#else +static inline bool +ct_set_cpu_work(unsigned int cpu, unsigned int work) { return false; } +#endif + +#endif diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c index 938c48952d265..37b094ea56fb6 100644 --- a/kernel/context_tracking.c +++ b/kernel/context_tracking.c @@ -72,6 +72,47 @@ static __always_inline void rcu_task_trace_heavyweight_exit(void) #endif /* #ifdef CONFIG_TASKS_TRACE_RCU */ }
+#ifdef CONFIG_CONTEXT_TRACKING_WORK +static noinstr void ct_work_flush(unsigned long seq) +{ + int bit; + + seq = (seq & CT_WORK_MASK) >> CT_WORK_START; + + /* + * arch_context_tracking_work() must be noinstr, non-blocking, + * and NMI safe. + */ + for_each_set_bit(bit, &seq, CONTEXT_WORK_MAX) + arch_context_tracking_work(BIT(bit)); +} + +bool ct_set_cpu_work(unsigned int cpu, unsigned int work) +{ + struct context_tracking *ct = per_cpu_ptr(&context_tracking, cpu); + unsigned int old; + bool ret = false; + + preempt_disable(); + + old = atomic_read(&ct->state); + /* + * Try setting the work until either + * - the target CPU has entered kernelspace + * - the work has been set + */ + do { + ret = atomic_try_cmpxchg(&ct->state, &old, old | (work << CT_WORK_START)); + } while (!ret && ((old & CT_STATE_MASK) != CT_STATE_KERNEL)); + + preempt_enable(); + return ret; +} +#else +static __always_inline void ct_work_flush(unsigned long work) { } +static __always_inline void ct_work_clear(struct context_tracking *ct) { } +#endif + /* * Record entry into an extended quiescent state. This is only to be * called when not already in an extended quiescent state, that is, @@ -88,7 +129,7 @@ static noinstr void ct_kernel_exit_state(int offset) * next idle sojourn. */ rcu_task_trace_heavyweight_enter(); // Before CT state update! - seq = ct_state_inc(offset); + seq = ct_state_inc_clear_work(offset); // RCU is no longer watching. Better be in extended quiescent state! WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && (seq & CT_RCU_WATCHING)); } @@ -100,7 +141,7 @@ static noinstr void ct_kernel_exit_state(int offset) */ static noinstr void ct_kernel_enter_state(int offset) { - int seq; + unsigned long seq;
/* * CPUs seeing atomic_add_return() must see prior idle sojourns, @@ -108,6 +149,7 @@ static noinstr void ct_kernel_enter_state(int offset) * critical section. */ seq = ct_state_inc(offset); + ct_work_flush(seq); // RCU is now watching. Better not be in an extended quiescent state! rcu_task_trace_heavyweight_exit(); // After CT state update! WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && !(seq & CT_RCU_WATCHING)); diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig index 8ebb6d5a106be..04efc2b605823 100644 --- a/kernel/time/Kconfig +++ b/kernel/time/Kconfig @@ -186,6 +186,11 @@ config CONTEXT_TRACKING_USER_FORCE Say N otherwise, this option brings an overhead that you don't want in production.
+config CONTEXT_TRACKING_WORK + bool + depends on HAVE_CONTEXT_TRACKING_WORK && CONTEXT_TRACKING_USER + default y + config NO_HZ bool "Old Idle dynticks config" help
Le Tue, Nov 19, 2024 at 04:34:58PM +0100, Valentin Schneider a écrit :
+bool ct_set_cpu_work(unsigned int cpu, unsigned int work) +{
- struct context_tracking *ct = per_cpu_ptr(&context_tracking, cpu);
- unsigned int old;
- bool ret = false;
- preempt_disable();
- old = atomic_read(&ct->state);
- /*
* Try setting the work until either
* - the target CPU has entered kernelspace
* - the work has been set
*/
- do {
ret = atomic_try_cmpxchg(&ct->state, &old, old | (work << CT_WORK_START));
- } while (!ret && ((old & CT_STATE_MASK) != CT_STATE_KERNEL));
- preempt_enable();
- return ret;
Does it ignore the IPI even if:
(ret && (old & CT_STATE_MASK) == CT_STATE_KERNEL))
?
And what about CT_STATE_IDLE?
Is the work ignored in those two cases?
But would it be cleaner to never set the work if the target is elsewhere than CT_STATE_USER. So you don't need to clear the work on kernel exit but rather on kernel entry.
That is:
bool ct_set_cpu_work(unsigned int cpu, unsigned int work) { struct context_tracking *ct = per_cpu_ptr(&context_tracking, cpu); unsigned int old; bool ret = false;
preempt_disable();
old = atomic_read(&ct->state);
/* Start with our best wishes */ old &= ~CT_STATE_MASK; old |= CT_STATE_USER
/* * Try setting the work until either * - the target CPU has exited userspace * - the work has been set */ do { ret = atomic_try_cmpxchg(&ct->state, &old, old | (work << CT_WORK_START)); } while (!ret && ((old & CT_STATE_MASK) == CT_STATE_USER));
preempt_enable();
return ret; }
Thanks.
Le Wed, Nov 20, 2024 at 11:54:36AM +0100, Frederic Weisbecker a écrit :
Le Tue, Nov 19, 2024 at 04:34:58PM +0100, Valentin Schneider a écrit :
+bool ct_set_cpu_work(unsigned int cpu, unsigned int work) +{
- struct context_tracking *ct = per_cpu_ptr(&context_tracking, cpu);
- unsigned int old;
- bool ret = false;
- preempt_disable();
- old = atomic_read(&ct->state);
- /*
* Try setting the work until either
* - the target CPU has entered kernelspace
* - the work has been set
*/
- do {
ret = atomic_try_cmpxchg(&ct->state, &old, old | (work << CT_WORK_START));
- } while (!ret && ((old & CT_STATE_MASK) != CT_STATE_KERNEL));
- preempt_enable();
- return ret;
Does it ignore the IPI even if:
(ret && (old & CT_STATE_MASK) == CT_STATE_KERNEL))
?
And what about CT_STATE_IDLE?
Is the work ignored in those two cases?
But would it be cleaner to never set the work if the target is elsewhere than CT_STATE_USER. So you don't need to clear the work on kernel exit but rather on kernel entry.
That is:
bool ct_set_cpu_work(unsigned int cpu, unsigned int work) { struct context_tracking *ct = per_cpu_ptr(&context_tracking, cpu); unsigned int old; bool ret = false;
preempt_disable();
old = atomic_read(&ct->state);
/* Start with our best wishes */ old &= ~CT_STATE_MASK; old |= CT_STATE_USER
/* * Try setting the work until either * - the target CPU has exited userspace * - the work has been set */ do { ret = atomic_try_cmpxchg(&ct->state, &old, old | (work << CT_WORK_START)); } while (!ret && ((old & CT_STATE_MASK) == CT_STATE_USER));
preempt_enable();
return ret; }
Ah but there is CT_STATE_GUEST and I see the last patch also applies that to CT_STATE_IDLE.
So that could be:
bool ct_set_cpu_work(unsigned int cpu, unsigned int work) { struct context_tracking *ct = per_cpu_ptr(&context_tracking, cpu); unsigned int old; bool ret = false;
preempt_disable();
old = atomic_read(&ct->state);
/* CT_STATE_IDLE can be added to last patch here */ if (!(old & (CT_STATE_USER | CT_STATE_GUEST))) { old &= ~CT_STATE_MASK; old |= CT_STATE_USER; }
/* * Try setting the work until either * - the target CPU has exited userspace / guest * - the work has been set */ do { ret = atomic_try_cmpxchg(&ct->state, &old, old | (work << CT_WORK_START)); } while (!ret && old & (CT_STATE_USER | CT_STATE_GUEST));
preempt_enable();
return ret; }
Thanks.
On 20/11/24 15:23, Frederic Weisbecker wrote:
Ah but there is CT_STATE_GUEST and I see the last patch also applies that to CT_STATE_IDLE.
So that could be:
bool ct_set_cpu_work(unsigned int cpu, unsigned int work) { struct context_tracking *ct = per_cpu_ptr(&context_tracking, cpu); unsigned int old; bool ret = false;
preempt_disable();
old = atomic_read(&ct->state);
/* CT_STATE_IDLE can be added to last patch here */ if (!(old & (CT_STATE_USER | CT_STATE_GUEST))) { old &= ~CT_STATE_MASK; old |= CT_STATE_USER; }
Hmph, so that lets us leverage the cmpxchg for a !CT_STATE_KERNEL check, but we get an extra loop if the target CPU exits kernelspace not to userspace (e.g. vcpu or idle) in the meantime - not great, not terrible.
At the cost of one extra bit for the CT_STATE area, with CT_STATE_KERNEL=1 we could do:
old = atomic_read(&ct->state); old &= ~CT_STATE_KERNEL;
Le Wed, Nov 20, 2024 at 06:10:43PM +0100, Valentin Schneider a écrit :
On 20/11/24 15:23, Frederic Weisbecker wrote:
Ah but there is CT_STATE_GUEST and I see the last patch also applies that to CT_STATE_IDLE.
So that could be:
bool ct_set_cpu_work(unsigned int cpu, unsigned int work) { struct context_tracking *ct = per_cpu_ptr(&context_tracking, cpu); unsigned int old; bool ret = false;
preempt_disable();
old = atomic_read(&ct->state);
/* CT_STATE_IDLE can be added to last patch here */ if (!(old & (CT_STATE_USER | CT_STATE_GUEST))) { old &= ~CT_STATE_MASK; old |= CT_STATE_USER; }
Hmph, so that lets us leverage the cmpxchg for a !CT_STATE_KERNEL check, but we get an extra loop if the target CPU exits kernelspace not to userspace (e.g. vcpu or idle) in the meantime - not great, not terrible.
The thing is, what you read with atomic_read() should be close to reality. If it already is != CT_STATE_KERNEL then you're good (minus racy changes). If it is CT_STATE_KERNEL then you still must do a failing cmpxchg() in any case, at least to make sure you didn't miss a context tracking change. So the best you can do is a bet.
At the cost of one extra bit for the CT_STATE area, with CT_STATE_KERNEL=1 we could do:
old = atomic_read(&ct->state); old &= ~CT_STATE_KERNEL;
And perhaps also old |= CT_STATE_IDLE (I'm seeing the last patch now), so you at least get a chance of making it right (only ~CT_STATE_KERNEL will always fail) and CPUs usually spend most of their time idle.
Thanks.
On 20/11/24 18:30, Frederic Weisbecker wrote:
Le Wed, Nov 20, 2024 at 06:10:43PM +0100, Valentin Schneider a écrit :
On 20/11/24 15:23, Frederic Weisbecker wrote:
Ah but there is CT_STATE_GUEST and I see the last patch also applies that to CT_STATE_IDLE.
So that could be:
bool ct_set_cpu_work(unsigned int cpu, unsigned int work) { struct context_tracking *ct = per_cpu_ptr(&context_tracking, cpu); unsigned int old; bool ret = false;
preempt_disable();
old = atomic_read(&ct->state);
/* CT_STATE_IDLE can be added to last patch here */ if (!(old & (CT_STATE_USER | CT_STATE_GUEST))) { old &= ~CT_STATE_MASK; old |= CT_STATE_USER; }
Hmph, so that lets us leverage the cmpxchg for a !CT_STATE_KERNEL check, but we get an extra loop if the target CPU exits kernelspace not to userspace (e.g. vcpu or idle) in the meantime - not great, not terrible.
The thing is, what you read with atomic_read() should be close to reality. If it already is != CT_STATE_KERNEL then you're good (minus racy changes). If it is CT_STATE_KERNEL then you still must do a failing cmpxchg() in any case, at least to make sure you didn't miss a context tracking change. So the best you can do is a bet.
At the cost of one extra bit for the CT_STATE area, with CT_STATE_KERNEL=1 we could do:
old = atomic_read(&ct->state); old &= ~CT_STATE_KERNEL;
And perhaps also old |= CT_STATE_IDLE (I'm seeing the last patch now), so you at least get a chance of making it right (only ~CT_STATE_KERNEL will always fail) and CPUs usually spend most of their time idle.
I'm thinking with:
CT_STATE_IDLE = 0, CT_STATE_USER = 1, CT_STATE_GUEST = 2, CT_STATE_KERNEL = 4, /* Keep that as a standalone bit */
we can stick with old &= ~CT_STATE_KERNEL; and that'll let the cmpxchg succeed for any of IDLE/USER/GUEST.
Thanks.
Le Fri, Nov 22, 2024 at 03:56:59PM +0100, Valentin Schneider a écrit :
On 20/11/24 18:30, Frederic Weisbecker wrote:
Le Wed, Nov 20, 2024 at 06:10:43PM +0100, Valentin Schneider a écrit :
On 20/11/24 15:23, Frederic Weisbecker wrote:
Ah but there is CT_STATE_GUEST and I see the last patch also applies that to CT_STATE_IDLE.
So that could be:
bool ct_set_cpu_work(unsigned int cpu, unsigned int work) { struct context_tracking *ct = per_cpu_ptr(&context_tracking, cpu); unsigned int old; bool ret = false;
preempt_disable();
old = atomic_read(&ct->state);
/* CT_STATE_IDLE can be added to last patch here */ if (!(old & (CT_STATE_USER | CT_STATE_GUEST))) { old &= ~CT_STATE_MASK; old |= CT_STATE_USER; }
Hmph, so that lets us leverage the cmpxchg for a !CT_STATE_KERNEL check, but we get an extra loop if the target CPU exits kernelspace not to userspace (e.g. vcpu or idle) in the meantime - not great, not terrible.
The thing is, what you read with atomic_read() should be close to reality. If it already is != CT_STATE_KERNEL then you're good (minus racy changes). If it is CT_STATE_KERNEL then you still must do a failing cmpxchg() in any case, at least to make sure you didn't miss a context tracking change. So the best you can do is a bet.
At the cost of one extra bit for the CT_STATE area, with CT_STATE_KERNEL=1 we could do:
old = atomic_read(&ct->state); old &= ~CT_STATE_KERNEL;
And perhaps also old |= CT_STATE_IDLE (I'm seeing the last patch now), so you at least get a chance of making it right (only ~CT_STATE_KERNEL will always fail) and CPUs usually spend most of their time idle.
I'm thinking with:
CT_STATE_IDLE = 0, CT_STATE_USER = 1, CT_STATE_GUEST = 2, CT_STATE_KERNEL = 4, /* Keep that as a standalone bit */
Right!
we can stick with old &= ~CT_STATE_KERNEL; and that'll let the cmpxchg succeed for any of IDLE/USER/GUEST.
Sure but if (old & CT_STATE_KERNEL), cmpxchg() will consistently fail. But you can make a bet that it has switched to CT_STATE_IDLE between the atomic_read() and the first atomic_cmpxchg(). This way you still have a tiny chance to succeed.
That is:
old = atomic_read(&ct->state); if (old & CT_STATE_KERNEl) old |= CT_STATE_IDLE; old &= ~CT_STATE_KERNEL;
do { atomic_try_cmpxchg(...)
Hmm?
On 24/11/24 22:46, Frederic Weisbecker wrote:
Le Fri, Nov 22, 2024 at 03:56:59PM +0100, Valentin Schneider a écrit :
On 20/11/24 18:30, Frederic Weisbecker wrote:
Le Wed, Nov 20, 2024 at 06:10:43PM +0100, Valentin Schneider a écrit :
On 20/11/24 15:23, Frederic Weisbecker wrote:
Ah but there is CT_STATE_GUEST and I see the last patch also applies that to CT_STATE_IDLE.
So that could be:
bool ct_set_cpu_work(unsigned int cpu, unsigned int work) { struct context_tracking *ct = per_cpu_ptr(&context_tracking, cpu); unsigned int old; bool ret = false;
preempt_disable();
old = atomic_read(&ct->state);
/* CT_STATE_IDLE can be added to last patch here */ if (!(old & (CT_STATE_USER | CT_STATE_GUEST))) { old &= ~CT_STATE_MASK; old |= CT_STATE_USER; }
Hmph, so that lets us leverage the cmpxchg for a !CT_STATE_KERNEL check, but we get an extra loop if the target CPU exits kernelspace not to userspace (e.g. vcpu or idle) in the meantime - not great, not terrible.
The thing is, what you read with atomic_read() should be close to reality. If it already is != CT_STATE_KERNEL then you're good (minus racy changes). If it is CT_STATE_KERNEL then you still must do a failing cmpxchg() in any case, at least to make sure you didn't miss a context tracking change. So the best you can do is a bet.
At the cost of one extra bit for the CT_STATE area, with CT_STATE_KERNEL=1 we could do:
old = atomic_read(&ct->state); old &= ~CT_STATE_KERNEL;
And perhaps also old |= CT_STATE_IDLE (I'm seeing the last patch now), so you at least get a chance of making it right (only ~CT_STATE_KERNEL will always fail) and CPUs usually spend most of their time idle.
I'm thinking with:
CT_STATE_IDLE = 0, CT_STATE_USER = 1, CT_STATE_GUEST = 2, CT_STATE_KERNEL = 4, /* Keep that as a standalone bit */
Right!
we can stick with old &= ~CT_STATE_KERNEL; and that'll let the cmpxchg succeed for any of IDLE/USER/GUEST.
Sure but if (old & CT_STATE_KERNEL), cmpxchg() will consistently fail. But you can make a bet that it has switched to CT_STATE_IDLE between the atomic_read() and the first atomic_cmpxchg(). This way you still have a tiny chance to succeed.
That is:
old = atomic_read(&ct->state); if (old & CT_STATE_KERNEl) old |= CT_STATE_IDLE; old &= ~CT_STATE_KERNEL;
do { atomic_try_cmpxchg(...)
Hmm?
But it could equally be CT_STATE_{USER, GUEST}, right? That is, if we have all of this enabled them we assume the isolated CPUs spend the least amount of time in the kernel, if they don't we get to blame the user.
Le Fri, Nov 29, 2024 at 05:40:29PM +0100, Valentin Schneider a écrit :
On 24/11/24 22:46, Frederic Weisbecker wrote:
Le Fri, Nov 22, 2024 at 03:56:59PM +0100, Valentin Schneider a écrit :
On 20/11/24 18:30, Frederic Weisbecker wrote:
Le Wed, Nov 20, 2024 at 06:10:43PM +0100, Valentin Schneider a écrit :
On 20/11/24 15:23, Frederic Weisbecker wrote:
Ah but there is CT_STATE_GUEST and I see the last patch also applies that to CT_STATE_IDLE.
So that could be:
bool ct_set_cpu_work(unsigned int cpu, unsigned int work) { struct context_tracking *ct = per_cpu_ptr(&context_tracking, cpu); unsigned int old; bool ret = false;
preempt_disable();
old = atomic_read(&ct->state);
/* CT_STATE_IDLE can be added to last patch here */ if (!(old & (CT_STATE_USER | CT_STATE_GUEST))) { old &= ~CT_STATE_MASK; old |= CT_STATE_USER; }
Hmph, so that lets us leverage the cmpxchg for a !CT_STATE_KERNEL check, but we get an extra loop if the target CPU exits kernelspace not to userspace (e.g. vcpu or idle) in the meantime - not great, not terrible.
The thing is, what you read with atomic_read() should be close to reality. If it already is != CT_STATE_KERNEL then you're good (minus racy changes). If it is CT_STATE_KERNEL then you still must do a failing cmpxchg() in any case, at least to make sure you didn't miss a context tracking change. So the best you can do is a bet.
At the cost of one extra bit for the CT_STATE area, with CT_STATE_KERNEL=1 we could do:
old = atomic_read(&ct->state); old &= ~CT_STATE_KERNEL;
And perhaps also old |= CT_STATE_IDLE (I'm seeing the last patch now), so you at least get a chance of making it right (only ~CT_STATE_KERNEL will always fail) and CPUs usually spend most of their time idle.
I'm thinking with:
CT_STATE_IDLE = 0, CT_STATE_USER = 1, CT_STATE_GUEST = 2, CT_STATE_KERNEL = 4, /* Keep that as a standalone bit */
Right!
we can stick with old &= ~CT_STATE_KERNEL; and that'll let the cmpxchg succeed for any of IDLE/USER/GUEST.
Sure but if (old & CT_STATE_KERNEL), cmpxchg() will consistently fail. But you can make a bet that it has switched to CT_STATE_IDLE between the atomic_read() and the first atomic_cmpxchg(). This way you still have a tiny chance to succeed.
That is:
old = atomic_read(&ct->state); if (old & CT_STATE_KERNEl) old |= CT_STATE_IDLE; old &= ~CT_STATE_KERNEL;
do { atomic_try_cmpxchg(...)
Hmm?
But it could equally be CT_STATE_{USER, GUEST}, right? That is, if we have all of this enabled them we assume the isolated CPUs spend the least amount of time in the kernel, if they don't we get to blame the user.
Unless CONTEXT_TRACKING_WORK_IDLE=y yes.
Anyway that's just a detail that can be refined in the future. I'm fine with just clearing CT_STATE_KERNEL and go with that.
Thanks.
text_poke_bp_batch() sends IPIs to all online CPUs to synchronize them vs the newly patched instruction. CPUs that are executing in userspace do not need this synchronization to happen immediately, and this is actually harmful interference for NOHZ_FULL CPUs.
As the synchronization IPIs are sent using a blocking call, returning from text_poke_bp_batch() implies all CPUs will observe the patched instruction(s), and this should be preserved even if the IPI is deferred. In other words, to safely defer this synchronization, any kernel instruction leading to the execution of the deferred instruction sync (ct_work_flush()) must *not* be mutable (patchable) at runtime.
This means we must pay attention to mutable instructions in the early entry code: - alternatives - static keys - all sorts of probes (kprobes/ftrace/bpf/???)
The early entry code leading to ct_work_flush() is noinstr, which gets rid of the probes.
Alternatives are safe, because it's boot-time patching (before SMP is even brought up) which is before any IPI deferral can happen.
This leaves us with static keys. Any static key used in early entry code should be only forever-enabled at boot time, IOW __ro_after_init (pretty much like alternatives). Exceptions are marked as forceful and will always generate an IPI when flipped. Objtool is now able to point at static keys that don't respect this, and all static keys used in early entry code have now been verified as behaving like so.
Leverage the new context_tracking infrastructure to defer sync_core() IPIs to a target CPU's next kernel entry.
Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Signed-off-by: Nicolas Saenz Julienne nsaenzju@redhat.com Signed-off-by: Valentin Schneider vschneid@redhat.com --- arch/x86/include/asm/context_tracking_work.h | 6 ++-- arch/x86/include/asm/text-patching.h | 1 + arch/x86/kernel/alternative.c | 33 +++++++++++++++++--- arch/x86/kernel/kprobes/core.c | 4 +-- arch/x86/kernel/kprobes/opt.c | 4 +-- arch/x86/kernel/module.c | 2 +- include/linux/context_tracking_work.h | 4 +-- 7 files changed, 41 insertions(+), 13 deletions(-)
diff --git a/arch/x86/include/asm/context_tracking_work.h b/arch/x86/include/asm/context_tracking_work.h index 5bc29e6b2ed38..2c66687ce00e2 100644 --- a/arch/x86/include/asm/context_tracking_work.h +++ b/arch/x86/include/asm/context_tracking_work.h @@ -2,11 +2,13 @@ #ifndef _ASM_X86_CONTEXT_TRACKING_WORK_H #define _ASM_X86_CONTEXT_TRACKING_WORK_H
+#include <asm/sync_core.h> + static __always_inline void arch_context_tracking_work(int work) { switch (work) { - case CONTEXT_WORK_n: - // Do work... + case CONTEXT_WORK_SYNC: + sync_core(); break; } } diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h index e34de36cab61e..37344e95f738f 100644 --- a/arch/x86/include/asm/text-patching.h +++ b/arch/x86/include/asm/text-patching.h @@ -33,6 +33,7 @@ extern void apply_relocation(u8 *buf, const u8 * const instr, size_t instrlen, u */ extern void *text_poke(void *addr, const void *opcode, size_t len); extern void text_poke_sync(void); +extern void text_poke_sync_deferrable(void); extern void *text_poke_kgdb(void *addr, const void *opcode, size_t len); extern void *text_poke_copy(void *addr, const void *opcode, size_t len); extern void *text_poke_copy_locked(void *addr, const void *opcode, size_t len, bool core_ok); diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c index 954c4c0f7fc58..4ce224d927b03 100644 --- a/arch/x86/kernel/alternative.c +++ b/arch/x86/kernel/alternative.c @@ -18,6 +18,7 @@ #include <linux/mmu_context.h> #include <linux/bsearch.h> #include <linux/sync_core.h> +#include <linux/context_tracking.h> #include <asm/text-patching.h> #include <asm/alternative.h> #include <asm/sections.h> @@ -2080,9 +2081,24 @@ static void do_sync_core(void *info) sync_core(); }
+static bool do_sync_core_defer_cond(int cpu, void *info) +{ + return !ct_set_cpu_work(cpu, CONTEXT_WORK_SYNC); +} + +static void __text_poke_sync(smp_cond_func_t cond_func) +{ + on_each_cpu_cond(cond_func, do_sync_core, NULL, 1); +} + void text_poke_sync(void) { - on_each_cpu(do_sync_core, NULL, 1); + __text_poke_sync(NULL); +} + +void text_poke_sync_deferrable(void) +{ + __text_poke_sync(do_sync_core_defer_cond); }
/* @@ -2257,6 +2273,8 @@ static int tp_vec_nr; static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries) { unsigned char int3 = INT3_INSN_OPCODE; + bool force_ipi = false; + void (*sync_fn)(void); unsigned int i; int do_sync;
@@ -2291,11 +2309,18 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries * First step: add a int3 trap to the address that will be patched. */ for (i = 0; i < nr_entries; i++) { + /* + * Record that we need to send the IPI if at least one location + * in the batch requires it. + */ + force_ipi |= tp[i].force_ipi; tp[i].old = *(u8 *)text_poke_addr(&tp[i]); text_poke(text_poke_addr(&tp[i]), &int3, INT3_INSN_SIZE); }
- text_poke_sync(); + sync_fn = force_ipi ? text_poke_sync : text_poke_sync_deferrable; + + sync_fn();
/* * Second step: update all but the first byte of the patched range. @@ -2357,7 +2382,7 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries * not necessary and we'd be safe even without it. But * better safe than sorry (plus there's not only Intel). */ - text_poke_sync(); + sync_fn(); }
/* @@ -2378,7 +2403,7 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries }
if (do_sync) - text_poke_sync(); + sync_fn();
/* * Remove and wait for refs to be zero. diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c index 72e6a45e7ec24..c2fd2578fd5fc 100644 --- a/arch/x86/kernel/kprobes/core.c +++ b/arch/x86/kernel/kprobes/core.c @@ -817,7 +817,7 @@ void arch_arm_kprobe(struct kprobe *p) u8 int3 = INT3_INSN_OPCODE;
text_poke(p->addr, &int3, 1); - text_poke_sync(); + text_poke_sync_deferrable(); perf_event_text_poke(p->addr, &p->opcode, 1, &int3, 1); }
@@ -827,7 +827,7 @@ void arch_disarm_kprobe(struct kprobe *p)
perf_event_text_poke(p->addr, &int3, 1, &p->opcode, 1); text_poke(p->addr, &p->opcode, 1); - text_poke_sync(); + text_poke_sync_deferrable(); }
void arch_remove_kprobe(struct kprobe *p) diff --git a/arch/x86/kernel/kprobes/opt.c b/arch/x86/kernel/kprobes/opt.c index 36d6809c6c9e1..b2ce4d9c3ba56 100644 --- a/arch/x86/kernel/kprobes/opt.c +++ b/arch/x86/kernel/kprobes/opt.c @@ -513,11 +513,11 @@ void arch_unoptimize_kprobe(struct optimized_kprobe *op) JMP32_INSN_SIZE - INT3_INSN_SIZE);
text_poke(addr, new, INT3_INSN_SIZE); - text_poke_sync(); + text_poke_sync_deferrable(); text_poke(addr + INT3_INSN_SIZE, new + INT3_INSN_SIZE, JMP32_INSN_SIZE - INT3_INSN_SIZE); - text_poke_sync(); + text_poke_sync_deferrable();
perf_event_text_poke(op->kp.addr, old, JMP32_INSN_SIZE, new, JMP32_INSN_SIZE); } diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c index 837450b6e882f..00e71ad30c01d 100644 --- a/arch/x86/kernel/module.c +++ b/arch/x86/kernel/module.c @@ -191,7 +191,7 @@ static int write_relocate_add(Elf64_Shdr *sechdrs, write, apply);
if (!early) { - text_poke_sync(); + text_poke_sync_deferrable(); mutex_unlock(&text_mutex); }
diff --git a/include/linux/context_tracking_work.h b/include/linux/context_tracking_work.h index fb74db8876dd2..13fc97b395030 100644 --- a/include/linux/context_tracking_work.h +++ b/include/linux/context_tracking_work.h @@ -5,12 +5,12 @@ #include <linux/bitops.h>
enum { - CONTEXT_WORK_n_OFFSET, + CONTEXT_WORK_SYNC_OFFSET, CONTEXT_WORK_MAX_OFFSET };
enum ct_work { - CONTEXT_WORK_n = BIT(CONTEXT_WORK_n_OFFSET), + CONTEXT_WORK_SYNC = BIT(CONTEXT_WORK_SYNC_OFFSET), CONTEXT_WORK_MAX = BIT(CONTEXT_WORK_MAX_OFFSET) };
On Tue, Nov 19, 2024 at 04:34:59PM +0100, Valentin Schneider wrote:
+static void __text_poke_sync(smp_cond_func_t cond_func) +{
- on_each_cpu_cond(cond_func, do_sync_core, NULL, 1);
+}
void text_poke_sync(void) {
- on_each_cpu(do_sync_core, NULL, 1);
- __text_poke_sync(NULL);
+}
+void text_poke_sync_deferrable(void) +{
- __text_poke_sync(do_sync_core_defer_cond);
}
How about we unwrap some of that like so:
/* @@ -2257,6 +2273,8 @@ static int tp_vec_nr; static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries) { unsigned char int3 = INT3_INSN_OPCODE;
- bool force_ipi = false;
- void (*sync_fn)(void);
smp_cond_func_t cond = do_sync_core_defer_cond;
unsigned int i; int do_sync; @@ -2291,11 +2309,18 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries * First step: add a int3 trap to the address that will be patched. */ for (i = 0; i < nr_entries; i++) {
/*
* Record that we need to send the IPI if at least one location
* in the batch requires it.
*/
force_ipi |= tp[i].force_ipi;
if (tp[i].force_ipi) cond = NULL;
tp[i].old = *(u8 *)text_poke_addr(&tp[i]); text_poke(text_poke_addr(&tp[i]), &int3, INT3_INSN_SIZE);
}
- text_poke_sync();
- sync_fn = force_ipi ? text_poke_sync : text_poke_sync_deferrable;
- sync_fn();
__text_poke_sync(cond);
Kernel TLB invalidation IPIs are a common source of interference on NOHZ_FULL CPUs. Given NOHZ_FULL CPUs executing in userspace are not accessing any kernel addresses, these invalidations do not need to happen immediately, and can be deferred until the next user->kernel transition.
Add a minimal, noinstr-compliant variant of __flush_tlb_all() that doesn't try to leverage INVPCID. To achieve this: o Add a noinstr variant of native_write_cr4(), keeping native_write_cr4() as "only" __no_profile. o Make invalidate_user_asid() __always_inline
XXX: what about paravirt? Should these be instead new operations made available through pv_ops.mmu.*? x86_64_start_kernel() uses __native_tlb_flush_global() regardless of paravirt, so I'm thinking the paravirt variants are more optimizations than hard requirements?
Signed-off-by: Valentin Schneider vschneid@redhat.com --- arch/x86/include/asm/context_tracking_work.h | 4 +++ arch/x86/include/asm/special_insns.h | 1 + arch/x86/include/asm/tlbflush.h | 16 ++++++++++-- arch/x86/kernel/cpu/common.c | 6 ++++- arch/x86/mm/tlb.c | 26 ++++++++++++++++++-- include/linux/context_tracking_work.h | 2 ++ 6 files changed, 50 insertions(+), 5 deletions(-)
diff --git a/arch/x86/include/asm/context_tracking_work.h b/arch/x86/include/asm/context_tracking_work.h index 2c66687ce00e2..9d4f021b5a45b 100644 --- a/arch/x86/include/asm/context_tracking_work.h +++ b/arch/x86/include/asm/context_tracking_work.h @@ -3,6 +3,7 @@ #define _ASM_X86_CONTEXT_TRACKING_WORK_H
#include <asm/sync_core.h> +#include <asm/tlbflush.h>
static __always_inline void arch_context_tracking_work(int work) { @@ -10,6 +11,9 @@ static __always_inline void arch_context_tracking_work(int work) case CONTEXT_WORK_SYNC: sync_core(); break; + case CONTEXT_WORK_TLBI: + __flush_tlb_all_noinstr(); + break; } }
diff --git a/arch/x86/include/asm/special_insns.h b/arch/x86/include/asm/special_insns.h index aec6e2d3aa1d5..b97157a69d48e 100644 --- a/arch/x86/include/asm/special_insns.h +++ b/arch/x86/include/asm/special_insns.h @@ -74,6 +74,7 @@ static inline unsigned long native_read_cr4(void) return val; }
+void native_write_cr4_noinstr(unsigned long val); void native_write_cr4(unsigned long val);
#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index 69e79fff41b80..a653b5f47f0e6 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -18,6 +18,7 @@ DECLARE_PER_CPU(u64, tlbstate_untag_mask);
void __flush_tlb_all(void); +void noinstr __flush_tlb_all_noinstr(void);
#define TLB_FLUSH_ALL -1UL #define TLB_GENERATION_INVALID 0 @@ -418,9 +419,20 @@ static inline void cpu_tlbstate_update_lam(unsigned long lam, u64 untag_mask) #endif #endif /* !MODULE */
+#define __NATIVE_TLB_FLUSH_GLOBAL(suffix, cr4) \ + native_write_cr4##suffix(cr4 ^ X86_CR4_PGE); \ + native_write_cr4##suffix(cr4) +#define NATIVE_TLB_FLUSH_GLOBAL(cr4) __NATIVE_TLB_FLUSH_GLOBAL(, cr4) +#define NATIVE_TLB_FLUSH_GLOBAL_NOINSTR(cr4) __NATIVE_TLB_FLUSH_GLOBAL(_noinstr, cr4) + static inline void __native_tlb_flush_global(unsigned long cr4) { - native_write_cr4(cr4 ^ X86_CR4_PGE); - native_write_cr4(cr4); + NATIVE_TLB_FLUSH_GLOBAL(cr4); } + +static inline void __native_tlb_flush_global_noinstr(unsigned long cr4) +{ + NATIVE_TLB_FLUSH_GLOBAL_NOINSTR(cr4); +} + #endif /* _ASM_X86_TLBFLUSH_H */ diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index a5f221ea56888..a84bb8511650b 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -424,7 +424,7 @@ void native_write_cr0(unsigned long val) } EXPORT_SYMBOL(native_write_cr0);
-void __no_profile native_write_cr4(unsigned long val) +noinstr void native_write_cr4_noinstr(unsigned long val) { unsigned long bits_changed = 0;
@@ -442,6 +442,10 @@ void __no_profile native_write_cr4(unsigned long val) bits_changed); } } +void native_write_cr4(unsigned long val) +{ + native_write_cr4_noinstr(val); +} #if IS_MODULE(CONFIG_LKDTM) EXPORT_SYMBOL_GPL(native_write_cr4); #endif diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 86593d1b787d8..973a4ab3f53b3 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -256,7 +256,7 @@ static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen, * * See SWITCH_TO_USER_CR3. */ -static inline void invalidate_user_asid(u16 asid) +static __always_inline void invalidate_user_asid(u16 asid) { /* There is no user ASID if address space separation is off */ if (!IS_ENABLED(CONFIG_MITIGATION_PAGE_TABLE_ISOLATION)) @@ -1198,7 +1198,7 @@ STATIC_NOPV void native_flush_tlb_global(void) /* * Flush the entire current user mapping */ -STATIC_NOPV void native_flush_tlb_local(void) +static noinstr void native_flush_tlb_local_noinstr(void) { /* * Preemption or interrupts must be disabled to protect the access @@ -1213,6 +1213,11 @@ STATIC_NOPV void native_flush_tlb_local(void) native_write_cr3(__native_read_cr3()); }
+STATIC_NOPV void native_flush_tlb_local(void) +{ + native_flush_tlb_local_noinstr(); +} + void flush_tlb_local(void) { __flush_tlb_local(); @@ -1240,6 +1245,23 @@ void __flush_tlb_all(void) } EXPORT_SYMBOL_GPL(__flush_tlb_all);
+void noinstr __flush_tlb_all_noinstr(void) +{ + /* + * This is for invocation in early entry code that cannot be + * instrumented. A RMW to CR4 works for most cases, but relies on + * being able to flip either of the PGE or PCIDE bits. Flipping CR4.PCID + * would require also resetting CR3.PCID, so just try with CR4.PGE, else + * do the CR3 write. + * + * XXX: this gives paravirt the finger. + */ + if (cpu_feature_enabled(X86_FEATURE_PGE)) + __native_tlb_flush_global_noinstr(this_cpu_read(cpu_tlbstate.cr4)); + else + native_flush_tlb_local_noinstr(); +} + void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch) { struct flush_tlb_info *info; diff --git a/include/linux/context_tracking_work.h b/include/linux/context_tracking_work.h index 13fc97b395030..47d5ced39a43a 100644 --- a/include/linux/context_tracking_work.h +++ b/include/linux/context_tracking_work.h @@ -6,11 +6,13 @@
enum { CONTEXT_WORK_SYNC_OFFSET, + CONTEXT_WORK_TLBI_OFFSET, CONTEXT_WORK_MAX_OFFSET };
enum ct_work { CONTEXT_WORK_SYNC = BIT(CONTEXT_WORK_SYNC_OFFSET), + CONTEXT_WORK_TLBI = BIT(CONTEXT_WORK_TLBI_OFFSET), CONTEXT_WORK_MAX = BIT(CONTEXT_WORK_MAX_OFFSET) };
On Tue, Nov 19, 2024 at 04:35:00PM +0100, Valentin Schneider wrote:
+void noinstr __flush_tlb_all_noinstr(void) +{
- /*
* This is for invocation in early entry code that cannot be
* instrumented. A RMW to CR4 works for most cases, but relies on
* being able to flip either of the PGE or PCIDE bits. Flipping CR4.PCID
* would require also resetting CR3.PCID, so just try with CR4.PGE, else
* do the CR3 write.
*
* XXX: this gives paravirt the finger.
*/
- if (cpu_feature_enabled(X86_FEATURE_PGE))
__native_tlb_flush_global_noinstr(this_cpu_read(cpu_tlbstate.cr4));
- else
native_flush_tlb_local_noinstr();
+}
Urgh, so that's a lot of ugleh, and cr4 has that pinning stuff and gah.
Why not always just do the CR3 write and call it a day? That should also work for paravirt, no? Just make the whole write_cr3 thing noinstr and voila.
On Wed, Nov 20, 2024 at 04:22:16PM +0100, Peter Zijlstra wrote:
On Tue, Nov 19, 2024 at 04:35:00PM +0100, Valentin Schneider wrote:
+void noinstr __flush_tlb_all_noinstr(void) +{
- /*
* This is for invocation in early entry code that cannot be
* instrumented. A RMW to CR4 works for most cases, but relies on
* being able to flip either of the PGE or PCIDE bits. Flipping CR4.PCID
* would require also resetting CR3.PCID, so just try with CR4.PGE, else
* do the CR3 write.
*
* XXX: this gives paravirt the finger.
*/
- if (cpu_feature_enabled(X86_FEATURE_PGE))
__native_tlb_flush_global_noinstr(this_cpu_read(cpu_tlbstate.cr4));
- else
native_flush_tlb_local_noinstr();
+}
Urgh, so that's a lot of ugleh, and cr4 has that pinning stuff and gah.
Why not always just do the CR3 write and call it a day? That should also work for paravirt, no? Just make the whole write_cr3 thing noinstr and voila.
Oh gawd, just having looked at xen_write_cr3() this might not be entirely trivial to mark noinstr :/
On 20/11/24 16:32, Peter Zijlstra wrote:
On Wed, Nov 20, 2024 at 04:22:16PM +0100, Peter Zijlstra wrote:
On Tue, Nov 19, 2024 at 04:35:00PM +0100, Valentin Schneider wrote:
+void noinstr __flush_tlb_all_noinstr(void) +{
- /*
* This is for invocation in early entry code that cannot be
* instrumented. A RMW to CR4 works for most cases, but relies on
* being able to flip either of the PGE or PCIDE bits. Flipping CR4.PCID
* would require also resetting CR3.PCID, so just try with CR4.PGE, else
* do the CR3 write.
*
* XXX: this gives paravirt the finger.
*/
- if (cpu_feature_enabled(X86_FEATURE_PGE))
__native_tlb_flush_global_noinstr(this_cpu_read(cpu_tlbstate.cr4));
- else
native_flush_tlb_local_noinstr();
+}
Urgh, so that's a lot of ugleh, and cr4 has that pinning stuff and gah.
Why not always just do the CR3 write and call it a day? That should also work for paravirt, no? Just make the whole write_cr3 thing noinstr and voila.
Oh gawd, just having looked at xen_write_cr3() this might not be entirely trivial to mark noinstr :/
... I hadn't even seen that.
AIUI the CR3 RMW is not "enough" if we have PGE enabled, because then global pages aren't flushed.
The question becomes: what is held in global pages and do we care about that when it comes to vmalloc()? I'm starting to think no, but this is x86, I don't know what surprises are waiting for me.
I see e.g. ds_clear_cea() clears PTEs that can have the _PAGE_GLOBAL flag, and it correctly uses the non-deferrable flush_tlb_kernel_range().
On Wed, Nov 20, 2024 at 06:24:56PM +0100, Valentin Schneider wrote:
Oh gawd, just having looked at xen_write_cr3() this might not be entirely trivial to mark noinstr :/
... I hadn't even seen that.
AIUI the CR3 RMW is not "enough" if we have PGE enabled, because then global pages aren't flushed.
The question becomes: what is held in global pages and do we care about that when it comes to vmalloc()? I'm starting to think no, but this is x86, I don't know what surprises are waiting for me.
I see e.g. ds_clear_cea() clears PTEs that can have the _PAGE_GLOBAL flag, and it correctly uses the non-deferrable flush_tlb_kernel_range().
I always forget what we use global pages for, dhansen might know, but let me try and have a look.
I *think* we only have GLOBAL on kernel text, and that only sometimes.
On 11/21/24 03:12, Peter Zijlstra wrote:
I see e.g. ds_clear_cea() clears PTEs that can have the _PAGE_GLOBAL flag, and it correctly uses the non-deferrable flush_tlb_kernel_range().
I always forget what we use global pages for, dhansen might know, but let me try and have a look.
I *think* we only have GLOBAL on kernel text, and that only sometimes.
I think you're remembering how _PAGE_GLOBAL gets used when KPTI is in play.
Ignoring KPTI for a sec... We use _PAGE_GLOBAL for all kernel mappings. Before PCIDs, global mappings let the kernel TLB entries live across CR3 writes. When PCIDs are in play, global mappings let two different ASIDs share TLB entries.
When KPTI is around, the kernel writes CR3 at user/kernel switches to make sure secrets are unmapped and can't be leaked by Meltdown. But unmapping those secrets doesn't do squat if they were mapped globally since they'll still be in the TLB and quite usable. There, we're more judicious and only mark performance-sensitive things that are not secret to be global, like kernel text.
On Thu, Nov 21, 2024 at 07:07:44AM -0800, Dave Hansen wrote:
On 11/21/24 03:12, Peter Zijlstra wrote:
I see e.g. ds_clear_cea() clears PTEs that can have the _PAGE_GLOBAL flag, and it correctly uses the non-deferrable flush_tlb_kernel_range().
I always forget what we use global pages for, dhansen might know, but let me try and have a look.
I *think* we only have GLOBAL on kernel text, and that only sometimes.
I think you're remembering how _PAGE_GLOBAL gets used when KPTI is in play.
Yah, I suppose I am. That was the last time I had a good look at this stuff :-)
Ignoring KPTI for a sec... We use _PAGE_GLOBAL for all kernel mappings. Before PCIDs, global mappings let the kernel TLB entries live across CR3 writes. When PCIDs are in play, global mappings let two different ASIDs share TLB entries.
Hurmph.. bah. That means we do need that horrible CR4 dance :/
On Thu, 21 Nov 2024 16:30:16 +0100 Peter Zijlstra peterz@infradead.org wrote:
On Thu, Nov 21, 2024 at 07:07:44AM -0800, Dave Hansen wrote:
On 11/21/24 03:12, Peter Zijlstra wrote:
I see e.g. ds_clear_cea() clears PTEs that can have the _PAGE_GLOBAL flag, and it correctly uses the non-deferrable flush_tlb_kernel_range().
I always forget what we use global pages for, dhansen might know, but let me try and have a look.
I *think* we only have GLOBAL on kernel text, and that only sometimes.
I think you're remembering how _PAGE_GLOBAL gets used when KPTI is in play.
Yah, I suppose I am. That was the last time I had a good look at this stuff :-)
Ignoring KPTI for a sec... We use _PAGE_GLOBAL for all kernel mappings. Before PCIDs, global mappings let the kernel TLB entries live across CR3 writes. When PCIDs are in play, global mappings let two different ASIDs share TLB entries.
Hurmph.. bah. That means we do need that horrible CR4 dance :/
In general, yes.
But I wonder what exactly was the original scenario encountered by Valentin. I mean, if TLB entry invalidations were necessary to sync changes to kernel text after flipping a static branch, then it might be less overhead to make a list of affected pages and call INVLPG on them.
AFAIK there is currently no such IPI function for doing that, but if we could add one. If the list of invalidated global pages is reasonably short, of course.
Valentin, do you happen to know?
Petr T
On 05/12/24 18:31, Petr Tesarik wrote:
On Thu, 21 Nov 2024 16:30:16 +0100 Peter Zijlstra peterz@infradead.org wrote:
On Thu, Nov 21, 2024 at 07:07:44AM -0800, Dave Hansen wrote:
On 11/21/24 03:12, Peter Zijlstra wrote:
I see e.g. ds_clear_cea() clears PTEs that can have the _PAGE_GLOBAL flag, and it correctly uses the non-deferrable flush_tlb_kernel_range().
I always forget what we use global pages for, dhansen might know, but let me try and have a look.
I *think* we only have GLOBAL on kernel text, and that only sometimes.
I think you're remembering how _PAGE_GLOBAL gets used when KPTI is in play.
Yah, I suppose I am. That was the last time I had a good look at this stuff :-)
Ignoring KPTI for a sec... We use _PAGE_GLOBAL for all kernel mappings. Before PCIDs, global mappings let the kernel TLB entries live across CR3 writes. When PCIDs are in play, global mappings let two different ASIDs share TLB entries.
Hurmph.. bah. That means we do need that horrible CR4 dance :/
In general, yes.
But I wonder what exactly was the original scenario encountered by Valentin. I mean, if TLB entry invalidations were necessary to sync changes to kernel text after flipping a static branch, then it might be less overhead to make a list of affected pages and call INVLPG on them.
AFAIK there is currently no such IPI function for doing that, but if we could add one. If the list of invalidated global pages is reasonably short, of course.
Valentin, do you happen to know?
So from my experimentation (hackbench + kernel compilation on housekeeping CPUs, dummy while(1) userspace loop on isolated CPUs), the TLB flushes only occurred from vunmap() - mainly from all the hackbench threads coming and going.
Static branch updates only seem to trigger the sync_core() IPI, at least on x86.
Petr T
On Mon, Dec 09, 2024 at 01:04:43PM +0100, Valentin Schneider wrote:
But I wonder what exactly was the original scenario encountered by Valentin. I mean, if TLB entry invalidations were necessary to sync changes to kernel text after flipping a static branch, then it might be less overhead to make a list of affected pages and call INVLPG on them.
No; TLB is not involved with text patching (on x86).
Valentin, do you happen to know?
So from my experimentation (hackbench + kernel compilation on housekeeping CPUs, dummy while(1) userspace loop on isolated CPUs), the TLB flushes only occurred from vunmap() - mainly from all the hackbench threads coming and going.
Right, we have virtually mapped stacks.
On Mon, 9 Dec 2024 13:12:49 +0100 Peter Zijlstra peterz@infradead.org wrote:
On Mon, Dec 09, 2024 at 01:04:43PM +0100, Valentin Schneider wrote:
But I wonder what exactly was the original scenario encountered by Valentin. I mean, if TLB entry invalidations were necessary to sync changes to kernel text after flipping a static branch, then it might be less overhead to make a list of affected pages and call INVLPG on them.
No; TLB is not involved with text patching (on x86).
Valentin, do you happen to know?
So from my experimentation (hackbench + kernel compilation on housekeeping CPUs, dummy while(1) userspace loop on isolated CPUs), the TLB flushes only occurred from vunmap() - mainly from all the hackbench threads coming and going.
Right, we have virtually mapped stacks.
Wait... Are you talking about the kernel stac? But that's only 4 pages (or 8 pages with KASAN), so that should be easily handled with INVLPG. No CR4 dances are needed for that.
What am I missing?
Petr T
On 09/12/24 15:42, Petr Tesarik wrote:
On Mon, 9 Dec 2024 13:12:49 +0100 Peter Zijlstra peterz@infradead.org wrote:
On Mon, Dec 09, 2024 at 01:04:43PM +0100, Valentin Schneider wrote:
But I wonder what exactly was the original scenario encountered by Valentin. I mean, if TLB entry invalidations were necessary to sync changes to kernel text after flipping a static branch, then it might be less overhead to make a list of affected pages and call INVLPG on them.
No; TLB is not involved with text patching (on x86).
Valentin, do you happen to know?
So from my experimentation (hackbench + kernel compilation on housekeeping CPUs, dummy while(1) userspace loop on isolated CPUs), the TLB flushes only occurred from vunmap() - mainly from all the hackbench threads coming and going.
Right, we have virtually mapped stacks.
Wait... Are you talking about the kernel stac? But that's only 4 pages (or 8 pages with KASAN), so that should be easily handled with INVLPG. No CR4 dances are needed for that.
What am I missing?
So the gist of the IPI deferral thing is to coalesce IPI callbacks into a single flag value that is read & acted on upon kernel entry. Freeing a task's kernel stack is not the only thing that can issue a vunmap(), so instead of tracking all the pages affected by the unmap (which is potentially an ever-growing memory leak as long as no kernel entry happens on the isolated CPUs), we just flush everything.
Quick tracing with my dummy benchmark mostly shows
vfree_atomic() -> drain_vmap_work()
but pretty much any vfree() / kvfree_rcu() from the housekeeping CPUs can get us that IPI.
On Tue, 10 Dec 2024 14:53:36 +0100 Valentin Schneider vschneid@redhat.com wrote:
On 09/12/24 15:42, Petr Tesarik wrote:
On Mon, 9 Dec 2024 13:12:49 +0100 Peter Zijlstra peterz@infradead.org wrote:
On Mon, Dec 09, 2024 at 01:04:43PM +0100, Valentin Schneider wrote:
But I wonder what exactly was the original scenario encountered by Valentin. I mean, if TLB entry invalidations were necessary to sync changes to kernel text after flipping a static branch, then it might be less overhead to make a list of affected pages and call INVLPG on them.
No; TLB is not involved with text patching (on x86).
Valentin, do you happen to know?
So from my experimentation (hackbench + kernel compilation on housekeeping CPUs, dummy while(1) userspace loop on isolated CPUs), the TLB flushes only occurred from vunmap() - mainly from all the hackbench threads coming and going.
Right, we have virtually mapped stacks.
Wait... Are you talking about the kernel stac? But that's only 4 pages (or 8 pages with KASAN), so that should be easily handled with INVLPG. No CR4 dances are needed for that.
What am I missing?
So the gist of the IPI deferral thing is to coalesce IPI callbacks into a single flag value that is read & acted on upon kernel entry. Freeing a task's kernel stack is not the only thing that can issue a vunmap(), so
Thank you for confirming it's not the kernel stack. Peter's remark left me a little confused.
instead of tracking all the pages affected by the unmap (which is potentially an ever-growing memory leak as long as no kernel entry happens on the isolated CPUs), we just flush everything.
Yes, this makes some sense. Of course, there is no way to avoid the cost; we can only defer it to a "more suitable" point in time, and current low-latency requirements make kernel entry better than IPI. It is at least more predictable (as long as device interrupts are routed to other CPUs).
I have looked into ways to reduce the number of page faults _after_ flushing the TLB. FWIW if we decide to track to-be-flushed pages, we only need an array of tlb_single_page_flush_ceiling pages. If there are more, flushing the entire TLB is believed to be cheaper. That is, I merely suggest to use the same logic which is already implemented by flush_tlb_kernel_range().
Anyway, since there is no easy trick, let's leave the discussion for a later optimization. I definitely do not want to block progress on this patch series.
Thanks for all your input!
Petr T
On Mon, 09 Dec 2024 13:04:43 +0100 Valentin Schneider vschneid@redhat.com wrote:
On 05/12/24 18:31, Petr Tesarik wrote:
On Thu, 21 Nov 2024 16:30:16 +0100 Peter Zijlstra peterz@infradead.org wrote:
On Thu, Nov 21, 2024 at 07:07:44AM -0800, Dave Hansen wrote:
On 11/21/24 03:12, Peter Zijlstra wrote:
I see e.g. ds_clear_cea() clears PTEs that can have the _PAGE_GLOBAL flag, and it correctly uses the non-deferrable flush_tlb_kernel_range().
I always forget what we use global pages for, dhansen might know, but let me try and have a look.
I *think* we only have GLOBAL on kernel text, and that only sometimes.
I think you're remembering how _PAGE_GLOBAL gets used when KPTI is in play.
Yah, I suppose I am. That was the last time I had a good look at this stuff :-)
Ignoring KPTI for a sec... We use _PAGE_GLOBAL for all kernel mappings. Before PCIDs, global mappings let the kernel TLB entries live across CR3 writes. When PCIDs are in play, global mappings let two different ASIDs share TLB entries.
Hurmph.. bah. That means we do need that horrible CR4 dance :/
In general, yes.
But I wonder what exactly was the original scenario encountered by Valentin. I mean, if TLB entry invalidations were necessary to sync changes to kernel text after flipping a static branch, then it might be less overhead to make a list of affected pages and call INVLPG on them.
AFAIK there is currently no such IPI function for doing that, but if we could add one. If the list of invalidated global pages is reasonably short, of course.
Valentin, do you happen to know?
So from my experimentation (hackbench + kernel compilation on housekeeping CPUs, dummy while(1) userspace loop on isolated CPUs), the TLB flushes only occurred from vunmap() - mainly from all the hackbench threads coming and going.
Static branch updates only seem to trigger the sync_core() IPI, at least on x86.
Thank you, this is helpful.
So, these allocations span more than tlb_single_page_flush_ceiling pages (default 33). Is THP enabled? If yes, we could possibly get below that threshold by improving flushing of huge pages (cf. footnote [1] in Documentation/arch/x86/tlb.rst).
OTOH even though a series of INVLPG may reduce subsequent TLB misses, it will not exactly improve latency, so it would go against the main goal of this whole patch series.
Hmmm... I see, the CR4 dance is the best solution after all. :-|
Petr T
On Tue, Nov 19, 2024 at 04:35:00PM +0100, Valentin Schneider wrote:
@@ -418,9 +419,20 @@ static inline void cpu_tlbstate_update_lam(unsigned long lam, u64 untag_mask) #endif #endif /* !MODULE */ +#define __NATIVE_TLB_FLUSH_GLOBAL(suffix, cr4) \
- native_write_cr4##suffix(cr4 ^ X86_CR4_PGE); \
- native_write_cr4##suffix(cr4)
+#define NATIVE_TLB_FLUSH_GLOBAL(cr4) __NATIVE_TLB_FLUSH_GLOBAL(, cr4) +#define NATIVE_TLB_FLUSH_GLOBAL_NOINSTR(cr4) __NATIVE_TLB_FLUSH_GLOBAL(_noinstr, cr4)
static inline void __native_tlb_flush_global(unsigned long cr4) {
- native_write_cr4(cr4 ^ X86_CR4_PGE);
- native_write_cr4(cr4);
- NATIVE_TLB_FLUSH_GLOBAL(cr4);
}
+static inline void __native_tlb_flush_global_noinstr(unsigned long cr4) +{
- NATIVE_TLB_FLUSH_GLOBAL_NOINSTR(cr4);
+}
How about something like this instead? I've only compile tested the tlb.c bit, but it should get __flush_tlb_global() to be noinstr I think, including the Xen bit (unless I missed something but then objtool should complain).
--- diff --git a/arch/x86/include/asm/invpcid.h b/arch/x86/include/asm/invpcid.h index 734482afbf81..ff26136fcd9c 100644 --- a/arch/x86/include/asm/invpcid.h +++ b/arch/x86/include/asm/invpcid.h @@ -2,7 +2,7 @@ #ifndef _ASM_X86_INVPCID #define _ASM_X86_INVPCID
-static inline void __invpcid(unsigned long pcid, unsigned long addr, +static __always_inline void __invpcid(unsigned long pcid, unsigned long addr, unsigned long type) { struct { u64 d[2]; } desc = { { pcid, addr } }; @@ -13,7 +13,7 @@ static inline void __invpcid(unsigned long pcid, unsigned long addr, * mappings, we don't want the compiler to reorder any subsequent * memory accesses before the TLB flush. */ - asm volatile("invpcid %[desc], %[type]" + asm_inline volatile("invpcid %[desc], %[type]" :: [desc] "m" (desc), [type] "r" (type) : "memory"); }
@@ -23,26 +23,25 @@ static inline void __invpcid(unsigned long pcid, unsigned long addr, #define INVPCID_TYPE_ALL_NON_GLOBAL 3
/* Flush all mappings for a given pcid and addr, not including globals. */ -static inline void invpcid_flush_one(unsigned long pcid, - unsigned long addr) +static __always_inline void invpcid_flush_one(unsigned long pcid, unsigned long addr) { __invpcid(pcid, addr, INVPCID_TYPE_INDIV_ADDR); }
/* Flush all mappings for a given PCID, not including globals. */ -static inline void invpcid_flush_single_context(unsigned long pcid) +static __always_inline void invpcid_flush_single_context(unsigned long pcid) { __invpcid(pcid, 0, INVPCID_TYPE_SINGLE_CTXT); }
/* Flush all mappings, including globals, for all PCIDs. */ -static inline void invpcid_flush_all(void) +static __always_inline void invpcid_flush_all(void) { __invpcid(0, 0, INVPCID_TYPE_ALL_INCL_GLOBAL); }
/* Flush all mappings for all PCIDs except globals. */ -static inline void invpcid_flush_all_nonglobals(void) +static __always_inline void invpcid_flush_all_nonglobals(void) { __invpcid(0, 0, INVPCID_TYPE_ALL_NON_GLOBAL); } diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index d4eb9e1d61b8..b3daee3d4667 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -75,7 +75,7 @@ static inline void __flush_tlb_local(void) PVOP_VCALL0(mmu.flush_tlb_user); }
-static inline void __flush_tlb_global(void) +static __always_inline void __flush_tlb_global(void) { PVOP_VCALL0(mmu.flush_tlb_kernel); } diff --git a/arch/x86/include/asm/xen/hypercall.h b/arch/x86/include/asm/xen/hypercall.h index a2dd24947eb8..b4c635b20538 100644 --- a/arch/x86/include/asm/xen/hypercall.h +++ b/arch/x86/include/asm/xen/hypercall.h @@ -357,8 +357,8 @@ MULTI_mmu_update(struct multicall_entry *mcl, struct mmu_update *req, trace_xen_mc_entry(mcl, 4); }
-static inline void -MULTI_mmuext_op(struct multicall_entry *mcl, struct mmuext_op *op, int count, +static __always_inline void +__MULTI_mmuext_op(struct multicall_entry *mcl, struct mmuext_op *op, int count, int *success_count, domid_t domid) { mcl->op = __HYPERVISOR_mmuext_op; @@ -366,6 +366,13 @@ MULTI_mmuext_op(struct multicall_entry *mcl, struct mmuext_op *op, int count, mcl->args[1] = count; mcl->args[2] = (unsigned long)success_count; mcl->args[3] = domid; +} + +static inline void +MULTI_mmuext_op(struct multicall_entry *mcl, struct mmuext_op *op, int count, + int *success_count, domid_t domid) +{ + __MULTI_mmuext_op(mcl, op, count, success_count, domid);
trace_xen_mc_entry(mcl, 4); } diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index b0d5a644fc84..0cfc00a34b7e 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -1168,9 +1168,10 @@ void flush_tlb_one_user(unsigned long addr) /* * Flush everything */ -STATIC_NOPV void native_flush_tlb_global(void) +STATIC_NOPV noinstr void native_flush_tlb_global(void) { unsigned long flags; + unsigned long cr4;
if (static_cpu_has(X86_FEATURE_INVPCID)) { /* @@ -1189,9 +1190,15 @@ STATIC_NOPV void native_flush_tlb_global(void) * be called from deep inside debugging code.) */ raw_local_irq_save(flags); - - __native_tlb_flush_global(this_cpu_read(cpu_tlbstate.cr4)); - + cr4 = this_cpu_read(cpu_tlbstate.cr4); + asm volatile("mov %0,%%cr4": : "r" (cr4 ^ X86_CR4_PGE) : "memory"); + asm volatile("mov %0,%%cr4": : "r" (cr4) : "memory"); + /* + * In lieu of not having the pinning crap, hard fail if CR4 doesn't + * match the expected value. This ensures that anybody doing dodgy gets + * the fallthrough check. + */ + BUG_ON(cr4 != this_cpu_read(cpu_tlbstate.cr4)); raw_local_irq_restore(flags); }
diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c index 55a4996d0c04..4eb265eb867a 100644 --- a/arch/x86/xen/mmu_pv.c +++ b/arch/x86/xen/mmu_pv.c @@ -1231,22 +1231,22 @@ static noinstr void xen_write_cr2(unsigned long cr2) this_cpu_read(xen_vcpu)->arch.cr2 = cr2; }
-static noinline void xen_flush_tlb(void) +static noinline noinstr void xen_flush_tlb(void) { struct mmuext_op *op; struct multicall_space mcs;
- preempt_disable(); + preempt_disable_notrace();
mcs = xen_mc_entry(sizeof(*op));
op = mcs.args; op->cmd = MMUEXT_TLB_FLUSH_LOCAL; - MULTI_mmuext_op(mcs.mc, op, 1, NULL, DOMID_SELF); + __MULTI_mmuext_op(mcs.mc, op, 1, NULL, DOMID_SELF);
- xen_mc_issue(XEN_LAZY_MMU); + __xen_mc_issue(XEN_LAZY_MMU);
- preempt_enable(); + preempt_enable_notrace(); }
static void xen_flush_tlb_one_user(unsigned long addr) diff --git a/arch/x86/xen/xen-ops.h b/arch/x86/xen/xen-ops.h index e1b782e823e6..31eddca45c27 100644 --- a/arch/x86/xen/xen-ops.h +++ b/arch/x86/xen/xen-ops.h @@ -235,15 +235,19 @@ static inline struct multicall_space xen_mc_entry(size_t args) void xen_mc_flush(void);
/* Issue a multicall if we're not in a lazy mode */ -static inline void xen_mc_issue(unsigned mode) +static __always_inline void __xen_mc_issue(unsigned mode) { - trace_xen_mc_issue(mode); - if ((xen_get_lazy_mode() & mode) == 0) xen_mc_flush();
/* restore flags saved in xen_mc_batch */ - local_irq_restore(this_cpu_read(xen_mc_irq_flags)); + raw_local_irq_restore(this_cpu_read(xen_mc_irq_flags)); +} + +static inline void xen_mc_issue(unsigned mode) +{ + trace_xen_mc_issue(mode); + __xen_mc_issue(mode); }
/* Set up a callback to be called when the current batch is flushed */
vunmap()'s issued from housekeeping CPUs are a relatively common source of interference for isolated NOHZ_FULL CPUs, as they are hit by the flush_tlb_kernel_range() IPIs.
Given that CPUs executing in userspace do not access data in the vmalloc range, these IPIs could be deferred until their next kernel entry.
Deferral vs early entry danger zone ===================================
This requires a guarantee that nothing in the vmalloc range can be vunmap'd and then accessed in early entry code.
Vmalloc uses are, as reported by vmallocinfo:
$ cat /proc/vmallocinfo | awk '{ print $3 }' | sort | uniq __pci_enable_msix_range+0x32b/0x560 acpi_os_map_iomem+0x22d/0x250 bpf_prog_alloc_no_stats+0x34/0x140 fork_idle+0x79/0x120 gen_pool_add_owner+0x3e/0xb0 ? hpet_enable+0xbf/0x470 irq_init_percpu_irqstack+0x129/0x160 kernel_clone+0xab/0x3b0 memremap+0x164/0x330 n_tty_open+0x18/0xa0 pcpu_create_chunk+0x4e/0x1b0 pcpu_create_chunk+0x75/0x1b0 pcpu_get_vm_areas+0x0/0x1100 unpurged vp_modern_map_capability+0x140/0x270 zisofs_init+0x16/0x30
I've categorized these as:
a) Device or percpu mappings
For these to be unmapped, the device (or CPU) has to be removed and an eventual IRQ freed. Even if the IRQ isn't freed, context tracking entry happens before handling the IRQ itself, per irqentry_enter() usage.
__pci_enable_msix_range() acpi_os_map_iomem() irq_init_percpu_irqstack() (not even unmapped when CPU is hot-unplugged!) memremap() n_tty_open() pcpu_create_chunk() pcpu_get_vm_areas() vp_modern_map_capability()
b) CONFIG_VMAP_STACK
fork_idle() & kernel_clone()
vmalloc'd kernel stacks are AFAICT a safe example, as a task running in userspace needs to enter kernelspace to execute do_exit() before its stack can be vfree'd.
c) Non-issues
bpf_prog_alloc_no_stats() - early entry is noinstr, no BPF! hpet_enable() - hpet_clear_mapping() is only called if __init function fails, no runtime worries zisofs_init () - used for zisofs block decompression, that's way past context tracking entry
d) I'm not sure, have a look?
gen_pool_add_owner() - AIUI this is mainly for PCI / DMA stuff, which again I wouldn't expect to be accessed before context tracking entry.
Changes ======
Blindly deferring any and all flush of the kernel mappings is a risky move, so introduce a variant of flush_tlb_kernel_range() that explicitly allows deferral. Use it for vunmap flushes.
Note that while flush_tlb_kernel_range() may end up issuing a full flush (including user mappings), this only happens when reaching a invalidation range threshold where it is cheaper to do a full flush than to individually invalidate each page in the range via INVLPG. IOW, it doesn't *require* invalidating user mappings, and thus remains safe to defer until a later kernel entry.
Signed-off-by: Valentin Schneider vschneid@redhat.com --- arch/x86/include/asm/tlbflush.h | 1 + arch/x86/mm/tlb.c | 23 +++++++++++++++++++--- mm/vmalloc.c | 35 ++++++++++++++++++++++++++++----- 3 files changed, 51 insertions(+), 8 deletions(-)
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index a653b5f47f0e6..d89345c85fa9c 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -249,6 +249,7 @@ extern void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start, unsigned long end, unsigned int stride_shift, bool freed_tables); extern void flush_tlb_kernel_range(unsigned long start, unsigned long end); +extern void flush_tlb_kernel_range_deferrable(unsigned long start, unsigned long end);
static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a) { diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 973a4ab3f53b3..bf6ff16a1a523 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -12,6 +12,7 @@ #include <linux/task_work.h> #include <linux/mmu_notifier.h> #include <linux/mmu_context.h> +#include <linux/context_tracking.h>
#include <asm/tlbflush.h> #include <asm/mmu_context.h> @@ -1041,6 +1042,11 @@ static void do_flush_tlb_all(void *info) __flush_tlb_all(); }
+static bool do_kernel_flush_defer_cond(int cpu, void *info) +{ + return !ct_set_cpu_work(cpu, CONTEXT_WORK_TLBI); +} + void flush_tlb_all(void) { count_vm_tlb_event(NR_TLB_REMOTE_FLUSH); @@ -1057,12 +1063,13 @@ static void do_kernel_range_flush(void *info) flush_tlb_one_kernel(addr); }
-void flush_tlb_kernel_range(unsigned long start, unsigned long end) +static inline void +__flush_tlb_kernel_range(smp_cond_func_t cond_func, unsigned long start, unsigned long end) { /* Balance as user space task's flush, a bit conservative */ if (end == TLB_FLUSH_ALL || (end - start) > tlb_single_page_flush_ceiling << PAGE_SHIFT) { - on_each_cpu(do_flush_tlb_all, NULL, 1); + on_each_cpu_cond(cond_func, do_flush_tlb_all, NULL, 1); } else { struct flush_tlb_info *info;
@@ -1070,13 +1077,23 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end) info = get_flush_tlb_info(NULL, start, end, 0, false, TLB_GENERATION_INVALID);
- on_each_cpu(do_kernel_range_flush, info, 1); + on_each_cpu_cond(cond_func, do_kernel_range_flush, info, 1);
put_flush_tlb_info(); preempt_enable(); } }
+void flush_tlb_kernel_range(unsigned long start, unsigned long end) +{ + __flush_tlb_kernel_range(NULL, start, end); +} + +void flush_tlb_kernel_range_deferrable(unsigned long start, unsigned long end) +{ + __flush_tlb_kernel_range(do_kernel_flush_defer_cond, start, end); +} + /* * This can be used from process context to figure out what the value of * CR3 is without needing to do a (slow) __read_cr3(). diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 634162271c004..02838c515ce2c 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -467,6 +467,31 @@ void vunmap_range_noflush(unsigned long start, unsigned long end) __vunmap_range_noflush(start, end); }
+#ifdef CONFIG_CONTEXT_TRACKING_WORK +/* + * !!! BIG FAT WARNING !!! + * + * The CPU is free to cache any part of the paging hierarchy it wants at any + * time. It's also free to set accessed and dirty bits at any time, even for + * instructions that may never execute architecturally. + * + * This means that deferring a TLB flush affecting freed page-table-pages (IOW, + * keeping them in a CPU's paging hierarchy cache) is akin to dancing in a + * minefield. + * + * This isn't a problem for deferral of TLB flushes in vmalloc, because + * page-table-pages used for vmap() mappings are never freed - see how + * __vunmap_range_noflush() walks the whole mapping but only clears the leaf PTEs. + * If this ever changes, TLB flush deferral will cause misery. + */ +void __weak flush_tlb_kernel_range_deferrable(unsigned long start, unsigned long end) +{ + flush_tlb_kernel_range(start, end); +} +#else +#define flush_tlb_kernel_range_deferrable(start, end) flush_tlb_kernel_range(start, end) +#endif + /** * vunmap_range - unmap kernel virtual addresses * @addr: start of the VM area to unmap @@ -480,7 +505,7 @@ void vunmap_range(unsigned long addr, unsigned long end) { flush_cache_vunmap(addr, end); vunmap_range_noflush(addr, end); - flush_tlb_kernel_range(addr, end); + flush_tlb_kernel_range_deferrable(addr, end); }
static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr, @@ -2265,7 +2290,7 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end,
nr_purge_nodes = cpumask_weight(&purge_nodes); if (nr_purge_nodes > 0) { - flush_tlb_kernel_range(start, end); + flush_tlb_kernel_range_deferrable(start, end);
/* One extra worker is per a lazy_max_pages() full set minus one. */ nr_purge_helpers = atomic_long_read(&vmap_lazy_nr) / lazy_max_pages(); @@ -2368,7 +2393,7 @@ static void free_unmap_vmap_area(struct vmap_area *va) flush_cache_vunmap(va->va_start, va->va_end); vunmap_range_noflush(va->va_start, va->va_end); if (debug_pagealloc_enabled_static()) - flush_tlb_kernel_range(va->va_start, va->va_end); + flush_tlb_kernel_range_deferrable(va->va_start, va->va_end);
free_vmap_area_noflush(va); } @@ -2816,7 +2841,7 @@ static void vb_free(unsigned long addr, unsigned long size) vunmap_range_noflush(addr, addr + size);
if (debug_pagealloc_enabled_static()) - flush_tlb_kernel_range(addr, addr + size); + flush_tlb_kernel_range_deferrable(addr, addr + size);
spin_lock(&vb->lock);
@@ -2881,7 +2906,7 @@ static void _vm_unmap_aliases(unsigned long start, unsigned long end, int flush) free_purged_blocks(&purge_list);
if (!__purge_vmap_area_lazy(start, end, false) && flush) - flush_tlb_kernel_range(start, end); + flush_tlb_kernel_range_deferrable(start, end); mutex_unlock(&vmap_purge_lock); }
With NO_HZ_IDLE, we get CONTEXT_TRACKING_IDLE, so we get these transitions:
ct_idle_enter() ct_kernel_exit() ct_state_inc_clear_work()
ct_idle_exit() ct_kernel_enter() ct_work_flush()
With just CONTEXT_TRACKING_IDLE, ct_state_inc_clear_work() is just ct_state_inc() and ct_work_flush() is a no-op. However, making them be functional as if under CONTEXT_TRACKING_WORK would allow NO_HZ_IDLE to leverage IPI deferral to keep idle CPUs idle longer.
Having this enabled for NO_HZ_IDLE is a different argument than for having it for NO_HZ_FULL (power savings vs latency/performance), but the backing mechanism is identical.
Add a default-no option to enable IPI deferral with NO_HZ_IDLE.
Signed-off-by: Valentin Schneider vschneid@redhat.com --- kernel/time/Kconfig | 16 +++++++++++++++- 1 file changed, 15 insertions(+), 1 deletion(-)
diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig index 04efc2b605823..a3cb903a4e022 100644 --- a/kernel/time/Kconfig +++ b/kernel/time/Kconfig @@ -188,9 +188,23 @@ config CONTEXT_TRACKING_USER_FORCE
config CONTEXT_TRACKING_WORK bool - depends on HAVE_CONTEXT_TRACKING_WORK && CONTEXT_TRACKING_USER + depends on HAVE_CONTEXT_TRACKING_WORK && (CONTEXT_TRACKING_USER || CONTEXT_TRACKING_WORK_IDLE) default y
+config CONTEXT_TRACKING_WORK_IDLE + bool + depends on HAVE_CONTEXT_TRACKING_WORK && CONTEXT_TRACKING_IDLE && !CONTEXT_TRACKING_USER + default n + help + This option enables deferral of some IPIs when they are targeted at CPUs + that are idle. This can help keep CPUs idle longer, but induces some + extra overhead to idle <-> kernel transitions and to IPI sending. + + Say Y if the power improvements are worth more to you than the added + overheads. + + Say N otherwise. + config NO_HZ bool "Old Idle dynticks config" help
On Tue, 19 Nov 2024 16:34:47 +0100 Valentin Schneider vschneid@redhat.com wrote:
Context
We've observed within Red Hat that isolated, NOHZ_FULL CPUs running a pure-userspace application get regularly interrupted by IPIs sent from housekeeping CPUs. Those IPIs are caused by activity on the housekeeping CPUs leading to various on_each_cpu() calls, e.g.:
FYI,
Sending a patch series at the start of the merge window is likely going to get ignored.
Care to resend after rc1 is released? Or at least ping about it.
-- Steve
On 19/11/24 11:45, Steven Rostedt wrote:
On Tue, 19 Nov 2024 16:34:47 +0100 Valentin Schneider vschneid@redhat.com wrote:
Context
We've observed within Red Hat that isolated, NOHZ_FULL CPUs running a pure-userspace application get regularly interrupted by IPIs sent from housekeeping CPUs. Those IPIs are caused by activity on the housekeeping CPUs leading to various on_each_cpu() calls, e.g.:
FYI,
Sending a patch series at the start of the merge window is likely going to get ignored.
Care to resend after rc1 is released? Or at least ping about it.
Heh, I was so far down "stop rewriting changelogs for the Nth time, get it in a decent shape and send it out" that I forgot about this "small little detail". Thanks for the reminder.
linux-kselftest-mirror@lists.linaro.org