This series contains RCU fixes and improvements intended for the next merge window. The nocb patches have had one round of review but still need review tags.
- Updated commit messages for clarity based on review feedback
- Testing: All rcutorture scenarios tested successfully for 2 hours on: 144-core ARM64 NVIDIA Grace (aarch64) 128-core AMD EPYC (x86_64)
Link to tag: https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/tag/?h=rcu-n...
Joel Fernandes (6): rcu/nocb: Remove unnecessary WakeOvfIsDeferred wake path rcu/nocb: Add warning if no rcuog wake up attempt happened during overload rcu/nocb: Add warning to detect if overload advancement is ever useful rcu: Reduce synchronize_rcu() latency by reporting GP kthread's CPU QS early rcutorture: Prevent concurrent kvm.sh runs on same source tree rcutorture: Add --kill-previous option to terminate previous kvm.sh runs
Yao Kai (1): rcu: Fix rcu_read_unlock() deadloop due to softirq
Zqiang (1): srcu: Use suitable gfp_flags for the init_srcu_struct_nodes()
kernel/rcu/srcutree.c | 2 +- kernel/rcu/tree.c | 16 ++++++ kernel/rcu/tree.h | 4 +- kernel/rcu/tree_nocb.h | 53 +++++++++---------- kernel/rcu/tree_plugin.h | 15 +++--- tools/testing/selftests/rcutorture/.gitignore | 1 + tools/testing/selftests/rcutorture/bin/kvm.sh | 40 ++++++++++++++ 7 files changed, 95 insertions(+), 36 deletions(-)
base-commit: d26143bb38e2546fe6f8c9860c13a88146ce5dd6
From: Yao Kai yaokai34@huawei.com
Commit 5f5fa7ea89dc ("rcu: Don't use negative nesting depth in __rcu_read_unlock()") removes the recursion-protection code from __rcu_read_unlock(). Therefore, we could invoke the deadloop in raise_softirq_irqoff() with ftrace enabled as follows:
WARNING: CPU: 0 PID: 0 at kernel/trace/trace.c:3021 __ftrace_trace_stack.constprop.0+0x172/0x180 Modules linked in: my_irq_work(O) CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Tainted: G O 6.18.0-rc7-dirty #23 PREEMPT(full) Tainted: [O]=OOT_MODULE Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 RIP: 0010:__ftrace_trace_stack.constprop.0+0x172/0x180 RSP: 0018:ffffc900000034a8 EFLAGS: 00010002 RAX: 0000000000000000 RBX: 0000000000000004 RCX: 0000000000000000 RDX: 0000000000000003 RSI: ffffffff826d7b87 RDI: ffffffff826e9329 RBP: 0000000000090009 R08: 0000000000000005 R09: ffffffff82afbc4c R10: 0000000000000008 R11: 0000000000011d7a R12: 0000000000000000 R13: ffff888003874100 R14: 0000000000000003 R15: ffff8880038c1054 FS: 0000000000000000(0000) GS:ffff8880fa8ea000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055b31fa7f540 CR3: 00000000078f4005 CR4: 0000000000770ef0 PKRU: 55555554 Call Trace: <IRQ> trace_buffer_unlock_commit_regs+0x6d/0x220 trace_event_buffer_commit+0x5c/0x260 trace_event_raw_event_softirq+0x47/0x80 raise_softirq_irqoff+0x6e/0xa0 rcu_read_unlock_special+0xb1/0x160 unwind_next_frame+0x203/0x9b0 __unwind_start+0x15d/0x1c0 arch_stack_walk+0x62/0xf0 stack_trace_save+0x48/0x70 __ftrace_trace_stack.constprop.0+0x144/0x180 trace_buffer_unlock_commit_regs+0x6d/0x220 trace_event_buffer_commit+0x5c/0x260 trace_event_raw_event_softirq+0x47/0x80 raise_softirq_irqoff+0x6e/0xa0 rcu_read_unlock_special+0xb1/0x160 unwind_next_frame+0x203/0x9b0 __unwind_start+0x15d/0x1c0 arch_stack_walk+0x62/0xf0 stack_trace_save+0x48/0x70 __ftrace_trace_stack.constprop.0+0x144/0x180 trace_buffer_unlock_commit_regs+0x6d/0x220 trace_event_buffer_commit+0x5c/0x260 trace_event_raw_event_softirq+0x47/0x80 raise_softirq_irqoff+0x6e/0xa0 rcu_read_unlock_special+0xb1/0x160 unwind_next_frame+0x203/0x9b0 __unwind_start+0x15d/0x1c0 arch_stack_walk+0x62/0xf0 stack_trace_save+0x48/0x70 __ftrace_trace_stack.constprop.0+0x144/0x180 trace_buffer_unlock_commit_regs+0x6d/0x220 trace_event_buffer_commit+0x5c/0x260 trace_event_raw_event_softirq+0x47/0x80 raise_softirq_irqoff+0x6e/0xa0 rcu_read_unlock_special+0xb1/0x160 __is_insn_slot_addr+0x54/0x70 kernel_text_address+0x48/0xc0 __kernel_text_address+0xd/0x40 unwind_get_return_address+0x1e/0x40 arch_stack_walk+0x9c/0xf0 stack_trace_save+0x48/0x70 __ftrace_trace_stack.constprop.0+0x144/0x180 trace_buffer_unlock_commit_regs+0x6d/0x220 trace_event_buffer_commit+0x5c/0x260 trace_event_raw_event_softirq+0x47/0x80 __raise_softirq_irqoff+0x61/0x80 __flush_smp_call_function_queue+0x115/0x420 __sysvec_call_function_single+0x17/0xb0 sysvec_call_function_single+0x8c/0xc0 </IRQ>
Commit b41642c87716 ("rcu: Fix rcu_read_unlock() deadloop due to IRQ work") fixed the infinite loop in rcu_read_unlock_special() for IRQ work by setting a flag before calling irq_work_queue_on(). We fix this issue by setting the same flag before calling raise_softirq_irqoff() and rename the flag to defer_qs_pending for more common.
Fixes: 5f5fa7ea89dc ("rcu: Don't use negative nesting depth in __rcu_read_unlock()") Reported-by: Tengda Wu wutengda2@huawei.com Signed-off-by: Yao Kai yaokai34@huawei.com Reviewed-by: Joel Fernandes joelagnelf@nvidia.com Signed-off-by: Joel Fernandes joelagnelf@nvidia.com --- kernel/rcu/tree.h | 2 +- kernel/rcu/tree_plugin.h | 15 +++++++++------ 2 files changed, 10 insertions(+), 7 deletions(-)
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h index b8bbe7960cda..2265b9c2906e 100644 --- a/kernel/rcu/tree.h +++ b/kernel/rcu/tree.h @@ -203,7 +203,7 @@ struct rcu_data { /* during and after the last grace */ /* period it is aware of. */ struct irq_work defer_qs_iw; /* Obtain later scheduler attention. */ - int defer_qs_iw_pending; /* Scheduler attention pending? */ + int defer_qs_pending; /* irqwork or softirq pending? */ struct work_struct strict_work; /* Schedule readers for strict GPs. */
/* 2) batch handling */ diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index dbe2d02be824..95ad967adcf3 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -487,8 +487,8 @@ rcu_preempt_deferred_qs_irqrestore(struct task_struct *t, unsigned long flags) union rcu_special special;
rdp = this_cpu_ptr(&rcu_data); - if (rdp->defer_qs_iw_pending == DEFER_QS_PENDING) - rdp->defer_qs_iw_pending = DEFER_QS_IDLE; + if (rdp->defer_qs_pending == DEFER_QS_PENDING) + rdp->defer_qs_pending = DEFER_QS_IDLE;
/* * If RCU core is waiting for this CPU to exit its critical section, @@ -645,7 +645,7 @@ static void rcu_preempt_deferred_qs_handler(struct irq_work *iwp) * 5. Deferred QS reporting does not happen. */ if (rcu_preempt_depth() > 0) - WRITE_ONCE(rdp->defer_qs_iw_pending, DEFER_QS_IDLE); + WRITE_ONCE(rdp->defer_qs_pending, DEFER_QS_IDLE); }
/* @@ -747,7 +747,10 @@ static void rcu_read_unlock_special(struct task_struct *t) // Using softirq, safe to awaken, and either the // wakeup is free or there is either an expedited // GP in flight or a potential need to deboost. - raise_softirq_irqoff(RCU_SOFTIRQ); + if (rdp->defer_qs_pending != DEFER_QS_PENDING) { + rdp->defer_qs_pending = DEFER_QS_PENDING; + raise_softirq_irqoff(RCU_SOFTIRQ); + } } else { // Enabling BH or preempt does reschedule, so... // Also if no expediting and no possible deboosting, @@ -755,11 +758,11 @@ static void rcu_read_unlock_special(struct task_struct *t) // tick enabled. set_need_resched_current(); if (IS_ENABLED(CONFIG_IRQ_WORK) && irqs_were_disabled && - needs_exp && rdp->defer_qs_iw_pending != DEFER_QS_PENDING && + needs_exp && rdp->defer_qs_pending != DEFER_QS_PENDING && cpu_online(rdp->cpu)) { // Get scheduler to re-evaluate and call hooks. // If !IRQ_WORK, FQS scan will eventually IPI. - rdp->defer_qs_iw_pending = DEFER_QS_PENDING; + rdp->defer_qs_pending = DEFER_QS_PENDING; irq_work_queue_on(&rdp->defer_qs_iw, rdp->cpu); } }
From: Zqiang qiang.zhang@linux.dev
For use the init_srcu_struct*() to initialized srcu structure, the srcu structure's->srcu_sup and sda use GFP_KERNEL flags to allocate memory. similarly, if set SRCU_SIZING_INIT, the srcu_sup's->node can still use GFP_KERNEL flags to allocate memory, not need to use GFP_ATOMIC flags all the time.
Signed-off-by: Zqiang qiang.zhang@linux.dev Reviewed-by: Joel Fernandes joelagnelf@nvidia.com Signed-off-by: Joel Fernandes joelagnelf@nvidia.com --- kernel/rcu/srcutree.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c index ea3f128de06f..c469c708fdd6 100644 --- a/kernel/rcu/srcutree.c +++ b/kernel/rcu/srcutree.c @@ -262,7 +262,7 @@ static int init_srcu_struct_fields(struct srcu_struct *ssp, bool is_static) ssp->srcu_sup->srcu_gp_seq_needed_exp = SRCU_GP_SEQ_INITIAL_VAL; ssp->srcu_sup->srcu_last_gp_end = ktime_get_mono_fast_ns(); if (READ_ONCE(ssp->srcu_sup->srcu_size_state) == SRCU_SIZE_SMALL && SRCU_SIZING_IS_INIT()) { - if (!init_srcu_struct_nodes(ssp, GFP_ATOMIC)) + if (!init_srcu_struct_nodes(ssp, is_static ? GFP_ATOMIC : GFP_KERNEL)) goto err_free_sda; WRITE_ONCE(ssp->srcu_sup->srcu_size_state, SRCU_SIZE_BIG); }
The WakeOvfIsDeferred code path in __call_rcu_nocb_wake() attempts to wake rcuog when the callback count exceeds qhimark and callbacks aren't done with their GP (newly queued or awaiting GP). However, a lot of testing proves this wake is always redundant or useless.
In the flooding case, rcuog is always waiting for a GP to finish. So waking up the rcuog thread is pointless. The timer wakeup adds overhead, rcuog simply wakes up and goes back to sleep achieving nothing.
This path also adds a full memory barrier, and additional timer expiry modifications unnecessarily.
The root cause is that WakeOvfIsDeferred fires when !rcu_segcblist_ready_cbs() (GP not complete), but waking rcuog cannot accelerate GP completion.
This commit therefore removes this path.
Tested with rcutorture scenarios: TREE01, TREE05, TREE08 (all NOCB configurations) - all pass. Also stress tested using a kernel module that floods call_rcu() to trigger the overload conditions and made the observations confirming the findings.
Signed-off-by: Joel Fernandes joelagnelf@nvidia.com --- kernel/rcu/tree.h | 1 - kernel/rcu/tree_nocb.h | 35 +++++++++-------------------------- 2 files changed, 9 insertions(+), 27 deletions(-)
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h index 2265b9c2906e..653fb4ba5852 100644 --- a/kernel/rcu/tree.h +++ b/kernel/rcu/tree.h @@ -301,7 +301,6 @@ struct rcu_data { #define RCU_NOCB_WAKE_BYPASS 1 #define RCU_NOCB_WAKE_LAZY 2 #define RCU_NOCB_WAKE 3 -#define RCU_NOCB_WAKE_FORCE 4
#define RCU_JIFFIES_TILL_FORCE_QS (1 + (HZ > 250) + (HZ > 500)) /* For jiffies_till_first_fqs and */ diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h index e6cd56603cad..daff2756cd90 100644 --- a/kernel/rcu/tree_nocb.h +++ b/kernel/rcu/tree_nocb.h @@ -518,10 +518,8 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp, }
/* - * Awaken the no-CBs grace-period kthread if needed, either due to it - * legitimately being asleep or due to overload conditions. - * - * If warranted, also wake up the kthread servicing this CPUs queues. + * Awaken the no-CBs grace-period kthread if needed due to it legitimately + * being asleep. */ static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone, unsigned long flags) @@ -533,7 +531,6 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone, long lazy_len; long len; struct task_struct *t; - struct rcu_data *rdp_gp = rdp->nocb_gp_rdp;
// If we are being polled or there is no kthread, just leave. t = READ_ONCE(rdp->nocb_gp_kthread); @@ -549,22 +546,22 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone, lazy_len = READ_ONCE(rdp->lazy_len); if (was_alldone) { rdp->qlen_last_fqs_check = len; + rcu_nocb_unlock(rdp); // Only lazy CBs in bypass list if (lazy_len && bypass_len == lazy_len) { - rcu_nocb_unlock(rdp); wake_nocb_gp_defer(rdp, RCU_NOCB_WAKE_LAZY, TPS("WakeLazy")); } else if (!irqs_disabled_flags(flags)) { /* ... if queue was empty ... */ - rcu_nocb_unlock(rdp); wake_nocb_gp(rdp, false); trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("WakeEmpty")); } else { - rcu_nocb_unlock(rdp); wake_nocb_gp_defer(rdp, RCU_NOCB_WAKE, TPS("WakeEmptyIsDeferred")); } + + return; } else if (len > rdp->qlen_last_fqs_check + qhimark) { /* ... or if many callbacks queued. */ rdp->qlen_last_fqs_check = len; @@ -575,21 +572,10 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone, rcu_advance_cbs_nowake(rdp->mynode, rdp); rdp->nocb_gp_adv_time = j; } - smp_mb(); /* Enqueue before timer_pending(). */ - if ((rdp->nocb_cb_sleep || - !rcu_segcblist_ready_cbs(&rdp->cblist)) && - !timer_pending(&rdp_gp->nocb_timer)) { - rcu_nocb_unlock(rdp); - wake_nocb_gp_defer(rdp, RCU_NOCB_WAKE_FORCE, - TPS("WakeOvfIsDeferred")); - } else { - rcu_nocb_unlock(rdp); - trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("WakeNot")); - } - } else { - rcu_nocb_unlock(rdp); - trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("WakeNot")); } + + rcu_nocb_unlock(rdp); + trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("WakeNot")); }
static void call_rcu_nocb(struct rcu_data *rdp, struct rcu_head *head, @@ -966,7 +952,6 @@ static bool do_nocb_deferred_wakeup_common(struct rcu_data *rdp_gp, unsigned long flags) __releases(rdp_gp->nocb_gp_lock) { - int ndw; int ret;
if (!rcu_nocb_need_deferred_wakeup(rdp_gp, level)) { @@ -974,8 +959,7 @@ static bool do_nocb_deferred_wakeup_common(struct rcu_data *rdp_gp, return false; }
- ndw = rdp_gp->nocb_defer_wakeup; - ret = __wake_nocb_gp(rdp_gp, rdp, ndw == RCU_NOCB_WAKE_FORCE, flags); + ret = __wake_nocb_gp(rdp_gp, rdp, false, flags); trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("DeferredWake"));
return ret; @@ -991,7 +975,6 @@ static void do_nocb_deferred_wakeup_timer(struct timer_list *t) trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("Timer"));
raw_spin_lock_irqsave(&rdp->nocb_gp_lock, flags); - smp_mb__after_spinlock(); /* Timer expire before wakeup. */ do_nocb_deferred_wakeup_common(rdp, rdp, RCU_NOCB_WAKE_BYPASS, flags); }
To be sure we have no rcog wake ups that were lost, add a warning to cover the case where the rdp is overloaded with callbacks but no wake up was attempted.
Signed-off-by: Joel Fernandes joelagnelf@nvidia.com --- kernel/rcu/tree.c | 4 ++++ kernel/rcu/tree.h | 1 + kernel/rcu/tree_nocb.h | 6 +++++- 3 files changed, 10 insertions(+), 1 deletion(-)
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 293bbd9ac3f4..78c045a5ef03 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -3767,6 +3767,10 @@ static void rcu_barrier_entrain(struct rcu_data *rdp) debug_rcu_head_unqueue(&rdp->barrier_head); rcu_barrier_trace(TPS("IRQNQ"), -1, rcu_state.barrier_sequence); } +#ifdef CONFIG_RCU_NOCB_CPU + if (wake_nocb) + rdp->nocb_gp_wake_attempt = true; +#endif rcu_nocb_unlock(rdp); if (wake_nocb) wake_nocb_gp(rdp, false); diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h index 653fb4ba5852..74bd6a2a2f84 100644 --- a/kernel/rcu/tree.h +++ b/kernel/rcu/tree.h @@ -257,6 +257,7 @@ struct rcu_data { unsigned long nocb_gp_loops; /* # passes through wait code. */ struct swait_queue_head nocb_gp_wq; /* For nocb kthreads to sleep on. */ bool nocb_cb_sleep; /* Is the nocb CB thread asleep? */ + bool nocb_gp_wake_attempt; /* Was a rcuog wakeup attempted? */ struct task_struct *nocb_cb_kthread; struct list_head nocb_head_rdp; /* * Head of rcu_data list in wakeup chain, diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h index daff2756cd90..7e9d465c8ab1 100644 --- a/kernel/rcu/tree_nocb.h +++ b/kernel/rcu/tree_nocb.h @@ -546,6 +546,7 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone, lazy_len = READ_ONCE(rdp->lazy_len); if (was_alldone) { rdp->qlen_last_fqs_check = len; + rdp->nocb_gp_wake_attempt = true; rcu_nocb_unlock(rdp); // Only lazy CBs in bypass list if (lazy_len && bypass_len == lazy_len) { @@ -563,7 +564,8 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone,
return; } else if (len > rdp->qlen_last_fqs_check + qhimark) { - /* ... or if many callbacks queued. */ + /* Callback overload condition. */ + WARN_ON_ONCE(!rdp->nocb_gp_wake_attempt); rdp->qlen_last_fqs_check = len; j = jiffies; if (j != rdp->nocb_gp_adv_time && @@ -688,6 +690,7 @@ static void nocb_gp_wait(struct rcu_data *my_rdp) bypass_ncbs > 2 * qhimark)) { flush_bypass = true; } else if (!bypass_ncbs && rcu_segcblist_empty(&rdp->cblist)) { + rdp->nocb_gp_wake_attempt = false; rcu_nocb_unlock_irqrestore(rdp, flags); continue; /* No callbacks here, try next. */ } @@ -1254,6 +1257,7 @@ lazy_rcu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc) continue; } rcu_nocb_try_flush_bypass(rdp, jiffies); + rdp->nocb_gp_wake_attempt = true; rcu_nocb_unlock_irqrestore(rdp, flags); wake_nocb_gp(rdp, false); sc->nr_to_scan -= _count;
During callback overload, the NOCB code attempts an opportunistic advancement via rcu_advance_cbs_nowake().
Analysis via tracing with 300,000 callbacks flooded shows this optimization is likely dead code: - 30 overload conditions triggered - 0 advancements actually occurred - 100% of time no advancement due to current GP not done.
I also ran TREE05 and TREE08 for 2 hours and cannot trigger it.
When callbacks overflow (exceed qhimark), they are waiting for a grace period that hasn't completed yet. The optimization requires the GP to be complete to advance callbacks, but the overload condition itself is caused by callbacks piling up faster than GPs can complete. This creates a logical contradiction where the advancement cannot happen.
In *theory* this might be possible, the GP completed just in the nick of time as we hit the overload, but this is just so rare that it can be considered impossible when we cannot even hit it with synthetic callback flooding even, it is a waste of cycles to even try to advance, let alone be useful and is a maintenance burden complexity we don't need.
I suggest deletion. However, add a WARN_ON_ONCE for a merge window or 2 and delete it after out of extreme caution.
Signed-off-by: Joel Fernandes joelagnelf@nvidia.com --- kernel/rcu/tree_nocb.h | 12 ++++++++++++ 1 file changed, 12 insertions(+)
diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h index 7e9d465c8ab1..d3e6a0e77210 100644 --- a/kernel/rcu/tree_nocb.h +++ b/kernel/rcu/tree_nocb.h @@ -571,8 +571,20 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone, if (j != rdp->nocb_gp_adv_time && rcu_segcblist_nextgp(&rdp->cblist, &cur_gp_seq) && rcu_seq_done(&rdp->mynode->gp_seq, cur_gp_seq)) { + long done_before = rcu_segcblist_get_seglen(&rdp->cblist, RCU_DONE_TAIL); + rcu_advance_cbs_nowake(rdp->mynode, rdp); rdp->nocb_gp_adv_time = j; + + /* + * The advance_cbs call above is not useful. Under an + * overload condition, nocb_gp_wait() is always waiting + * for GP completion, due to this nothing can be moved + * from WAIT to DONE, in the list. WARN if an + * advancement happened (next step is deletion of advance). + */ + WARN_ON_ONCE(rcu_segcblist_get_seglen(&rdp->cblist, + RCU_DONE_TAIL) > done_before); } }
The RCU grace period mechanism uses a two-phase FQS (Force Quiescent State) design where the first FQS saves dyntick-idle snapshots and the second FQS compares them. This results in long and unnecessary latency for synchronize_rcu() on idle systems (two FQS waits of ~3ms each with 1000HZ) whenever one FQS wait sufficed.
Some investigations showed that the GP kthread's CPU is the holdout CPU a lot of times after the first FQS as - it cannot be detected as "idle" because it's actively running the FQS scan in the GP kthread.
Therefore, at the end of rcu_gp_init(), immediately report a quiescent state for the GP kthread's CPU using rcu_qs() + rcu_report_qs_rdp(). The GP kthread cannot be in an RCU read-side critical section while running GP initialization, so this is safe and results in significant latency improvements.
The following tests were performed:
(1) synchronize_rcu() benchmarking
100 synchronize_rcu() calls with 32 CPUs, 10 runs each (default fqs jiffies settings):
Baseline (without fix): | Run | Mean | Min | Max | |-----|-----------|----------|-----------| | 1 | 10.088 ms | 9.989 ms | 18.848 ms | | 2 | 10.064 ms | 9.982 ms | 16.470 ms | | 3 | 10.051 ms | 9.988 ms | 15.113 ms | | 4 | 10.125 ms | 9.929 ms | 22.411 ms | | 5 | 8.695 ms | 5.996 ms | 15.471 ms | | 6 | 10.157 ms | 9.977 ms | 25.723 ms | | 7 | 10.102 ms | 9.990 ms | 20.224 ms | | 8 | 8.050 ms | 5.985 ms | 10.007 ms | | 9 | 10.059 ms | 9.978 ms | 15.934 ms | | 10 | 10.077 ms | 9.984 ms | 17.703 ms |
With fix: | Run | Mean | Min | Max | |-----|----------|----------|-----------| | 1 | 6.027 ms | 5.915 ms | 8.589 ms | | 2 | 6.032 ms | 5.984 ms | 9.241 ms | | 3 | 6.010 ms | 5.986 ms | 7.004 ms | | 4 | 6.076 ms | 5.993 ms | 10.001 ms | | 5 | 6.084 ms | 5.893 ms | 10.250 ms | | 6 | 6.034 ms | 5.908 ms | 9.456 ms | | 7 | 6.051 ms | 5.993 ms | 10.000 ms | | 8 | 6.057 ms | 5.941 ms | 10.001 ms | | 9 | 6.016 ms | 5.927 ms | 7.540 ms | | 10 | 6.036 ms | 5.993 ms | 9.579 ms |
Summary: - Mean latency: 9.75 ms -> 6.04 ms (38% improvement) - Max latency: 25.72 ms -> 10.25 ms (60% improvement)
(2) Bridge setup/teardown latency (Uladzislau Rezki)
x86_64 with 64 CPUs, 100 iterations of bridge add/configure/delete:
real time 1 - default: 24.221s 2 - this patch: 20.754s (14% faster) 3 - this patch + wake_from_gp: 15.895s (34% faster) 4 - wake_from_gp only: 18.947s (22% faster)
Per-synchronize_rcu() latency (in usec): 1 2 3 4 median: 37249.5 31540.5 15765 22480 min: 7881 7918 9803 7857 max: 63651 55639 31861 32040
This patch combined with rcu_normal_wake_from_gp reduces bridge setup/teardown time from 24 seconds to 16 seconds.
(3) CPU overhead verification (Uladzislau Rezki)
System CPU time across 5 runs showed no measurable increase: default: 1.698s - 1.937s this patch: 1.667s - 1.930s Conclusion: variations are within noise, no CPU overhead regression.
(4) rcutorture
Tested TREE and SRCU configurations - no regressions.
Reviewed-by: "Paul E. McKenney" paulmck@kernel.org Tested-by: Uladzislau Rezki (Sony) urezki@gmail.com Signed-off-by: Joel Fernandes joelagnelf@nvidia.com --- kernel/rcu/tree.c | 12 ++++++++++++ 1 file changed, 12 insertions(+)
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 78c045a5ef03..b7c818cabe44 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -160,6 +160,7 @@ static void rcu_report_qs_rnp(unsigned long mask, struct rcu_node *rnp, unsigned long gps, unsigned long flags); static void invoke_rcu_core(void); static void rcu_report_exp_rdp(struct rcu_data *rdp); +static void rcu_report_qs_rdp(struct rcu_data *rdp); static void check_cb_ovld_locked(struct rcu_data *rdp, struct rcu_node *rnp); static bool rcu_rdp_is_offloaded(struct rcu_data *rdp); static bool rcu_rdp_cpu_online(struct rcu_data *rdp); @@ -1983,6 +1984,17 @@ static noinline_for_stack bool rcu_gp_init(void) if (IS_ENABLED(CONFIG_RCU_STRICT_GRACE_PERIOD)) on_each_cpu(rcu_strict_gp_boundary, NULL, 0);
+ /* + * Immediately report QS for the GP kthread's CPU. The GP kthread + * cannot be in an RCU read-side critical section while running + * the FQS scan. This eliminates the need for a second FQS wait + * when all CPUs are idle. + */ + preempt_disable(); + rcu_qs(); + rcu_report_qs_rdp(this_cpu_ptr(&rcu_data)); + preempt_enable(); + return true; }
Add flock-based locking to kvm.sh to prevent multiple instances from running concurrently on the same source tree. This prevents build failures caused by one instance's "make clean" deleting generated files while another instance is building causing build failures.
The lock file is placed in the rcutorture directory and added to .gitignore.
Signed-off-by: Joel Fernandes joelagnelf@nvidia.com --- tools/testing/selftests/rcutorture/.gitignore | 1 + tools/testing/selftests/rcutorture/bin/kvm.sh | 17 +++++++++++++++++ 2 files changed, 18 insertions(+)
diff --git a/tools/testing/selftests/rcutorture/.gitignore b/tools/testing/selftests/rcutorture/.gitignore index f6cbce77460b..b8fd42547a6e 100644 --- a/tools/testing/selftests/rcutorture/.gitignore +++ b/tools/testing/selftests/rcutorture/.gitignore @@ -3,3 +3,4 @@ initrd b[0-9]* res *.swp +.kvm.sh.lock diff --git a/tools/testing/selftests/rcutorture/bin/kvm.sh b/tools/testing/selftests/rcutorture/bin/kvm.sh index fff15821c44c..d1fbd092e22a 100755 --- a/tools/testing/selftests/rcutorture/bin/kvm.sh +++ b/tools/testing/selftests/rcutorture/bin/kvm.sh @@ -275,6 +275,23 @@ do shift done
+# Prevent concurrent kvm.sh runs on the same source tree. The flock +# is automatically released when the script exits, even if killed. +TORTURE_LOCK="$RCUTORTURE/.kvm.sh.lock" +if test -z "$dryrun" +then + # Create a file descriptor and flock it, so that when kvm.sh (and its + # children) exit, the flock is released by the kernel automatically. + exec 9>"$TORTURE_LOCK" + if ! flock -n 9 + then + echo "ERROR: Another kvm.sh instance is already running on this tree." + echo " Lock file: $TORTURE_LOCK" + echo " To run kvm.sh, kill all existing kvm.sh runs first." + exit 1 + fi +fi + if test -n "$dryrun" || test -z "$TORTURE_INITRD" || tools/testing/selftests/rcutorture/bin/mkinitrd.sh then :
When kvm.sh is killed, its child processes (make, gcc, qemu, etc.) may continue running. This prevents new kvm.sh instances from starting even though the parent is gone.
Add a --kill-previous option that uses fuser(1) to terminate all processes holding the flock file before attempting to acquire it. This provides a clean way to recover from stale/zombie kvm.sh runs which sometimes may have lots of qemu and compiler processes still disturbing.
Signed-off-by: Joel Fernandes joelagnelf@nvidia.com --- tools/testing/selftests/rcutorture/bin/kvm.sh | 25 ++++++++++++++++++- 1 file changed, 24 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/rcutorture/bin/kvm.sh b/tools/testing/selftests/rcutorture/bin/kvm.sh index d1fbd092e22a..65b04b832733 100755 --- a/tools/testing/selftests/rcutorture/bin/kvm.sh +++ b/tools/testing/selftests/rcutorture/bin/kvm.sh @@ -80,6 +80,7 @@ usage () { echo " --kasan" echo " --kconfig Kconfig-options" echo " --kcsan" + echo " --kill-previous" echo " --kmake-arg kernel-make-arguments" echo " --mac nn:nn:nn:nn:nn:nn" echo " --memory megabytes|nnnG" @@ -206,6 +207,9 @@ do --kcsan) TORTURE_KCONFIG_KCSAN_ARG="$debuginfo CONFIG_KCSAN=y CONFIG_KCSAN_STRICT=y CONFIG_KCSAN_REPORT_ONCE_IN_MS=100000 CONFIG_KCSAN_VERBOSE=y CONFIG_DEBUG_LOCK_ALLOC=y CONFIG_PROVE_LOCKING=y"; export TORTURE_KCONFIG_KCSAN_ARG ;; + --kill-previous) + TORTURE_KILL_PREVIOUS=1 + ;; --kmake-arg|--kmake-args) checkarg --kmake-arg "(kernel make arguments)" $# "$2" '.*' '^error$' TORTURE_KMAKE_ARG="`echo "$TORTURE_KMAKE_ARG $2" | sed -e 's/^ *//' -e 's/ *$//'`" @@ -278,6 +282,25 @@ done # Prevent concurrent kvm.sh runs on the same source tree. The flock # is automatically released when the script exits, even if killed. TORTURE_LOCK="$RCUTORTURE/.kvm.sh.lock" + +# Terminate any processes holding the lock file, if requested. +if test -n "$TORTURE_KILL_PREVIOUS" +then + if test -e "$TORTURE_LOCK" + then + echo "Killing processes holding $TORTURE_LOCK..." + if fuser -k "$TORTURE_LOCK" >/dev/null 2>&1 + then + sleep 2 + echo "Previous kvm.sh processes killed." + else + echo "No processes were holding the lock." + fi + else + echo "No lock file exists, nothing to kill." + fi +fi + if test -z "$dryrun" then # Create a file descriptor and flock it, so that when kvm.sh (and its @@ -287,7 +310,7 @@ then then echo "ERROR: Another kvm.sh instance is already running on this tree." echo " Lock file: $TORTURE_LOCK" - echo " To run kvm.sh, kill all existing kvm.sh runs first." + echo " To run kvm.sh, kill all existing kvm.sh runs first (--kill-previous)." exit 1 fi fi
linux-kselftest-mirror@lists.linaro.org