It's a v2 of the previous set (https://lore.kernel.org/lkml/20210128171222.131380-1-frederic@kernel.org/) minus the patches already applied in rcu/dev. And this is based on latest rcu/dev.
Changelog since v1:
"rcu/nocb: Fix potential missed nocb_timer rearm" * Remove nocb_defer_wakeup reset from do_nocb_deferred_wakeup_common() (paulmck) * Only reset/del if the timer is actually armed * Add secondary potential cause for missed rearm in the changelog
"rcu/nocb: Disable bypass when CPU isn't completely offloaded" * Improve comments on state machine (paulmck) * Add comment (a full quote from Paul) explaining why early flush is enough (paulmck) * Move sanity check to the very end of deoffloading (paulmck) * Clarify some comments about nocb locking on de-offloading (paulmck)
"rcu/nocb: Remove stale comment above rcu_segcblist_offload()" * New patch, reported by (paulmck)
"rcu/nocb: Merge nocb_timer to the rdp leader" * Remove rcu_running_nocb_timer() and its use in rcu_rdp_is_offloaded() debugging since the timer doesn't refer to any rdp offloading anymore. * Only delete nocb_timer when armed, in nocb_gp_wait() * Clarify some comments about nocb locking on de-offloading (paulmck) * Remove stale code "re-enabling" nocb timer on offloading. Not necessary anymore and even buggy.
"timer: Revert "timer: Add timer_curr_running()"" * New patch
git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git rcu/dev-v2
HEAD: 925ee3076eb694db893e2c6664d90ad8fb9cb6e5
Thanks, Frederic ---
Frederic Weisbecker (13): rcu/nocb: Fix potential missed nocb_timer rearm rcu/nocb: Disable bypass when CPU isn't completely offloaded rcu/nocb: Remove stale comment above rcu_segcblist_offload() rcu/nocb: Move trace_rcu_nocb_wake() calls outside nocb_lock when possible rcu/nocb: Merge nocb_timer to the rdp leader timer: Revert "timer: Add timer_curr_running()" rcu/nocb: Directly call __wake_nocb_gp() from bypass timer rcu/nocb: Allow de-offloading rdp leader rcu/nocb: Cancel nocb_timer upon nocb_gp wakeup rcu/nocb: Delete bypass_timer upon nocb_gp wakeup rcu/nocb: Only cancel nocb timer if not polling rcu/nocb: Prepare for finegrained deferred wakeup rcu/nocb: Unify timers
include/linux/rcu_segcblist.h | 7 +- include/linux/timer.h | 2 - include/trace/events/rcu.h | 1 + kernel/rcu/rcu_segcblist.c | 3 +- kernel/rcu/tree.c | 2 +- kernel/rcu/tree.h | 9 +- kernel/rcu/tree_plugin.h | 233 +++++++++++++++++++++++------------------- kernel/time/timer.c | 14 --- 8 files changed, 141 insertions(+), 130 deletions(-)
Two situations can cause a missed nocb timer rearm:
1) rdp(CPU A) queues its nocb timer. The grace period elapses before the timer get a chance to fire. The nocb_gp kthread is awaken by rdp(CPU B). The nocb_cb kthread for rdp(CPU A) is awaken and process the callbacks, again before the nocb_timer for CPU A get a chance to fire. rdp(CPU A) queues a callback and wakes up nocb_gp kthread, cancelling the pending nocb_timer without resetting the corresponding nocb_defer_wakeup.
2) The "nocb_bypass_timer" ends up calling wake_nocb_gp() which deletes the pending "nocb_timer" (note they are not the same timers) for the given rdp without resetting the matching state stored in nocb_defer wakeup.
On both situations, a future call_rcu() on that rdp may be fooled and think the timer is armed when it's not, missing a deferred nocb_gp wakeup.
Case 1) is very unlikely due to timing constraint (the timer fires after 1 jiffy) but still possible in theory. Case 2) is more likely to happen. But in any case such scenario require the CPU to spend a long time within a kernel thread without exiting to idle or user space, which is a pretty exotic behaviour.
Fix this with resetting rdp->nocb_defer_wakeup everytime we disarm the timer.
Fixes: d1b222c6be1f (rcu/nocb: Add bypass callback queueing) Cc: Stable stable@vger.kernel.org Cc: Josh Triplett josh@joshtriplett.org Cc: Lai Jiangshan jiangshanlai@gmail.com Cc: Joel Fernandes joel@joelfernandes.org Cc: Neeraj Upadhyay neeraju@codeaurora.org Cc: Boqun Feng boqun.feng@gmail.com Signed-off-by: Frederic Weisbecker frederic@kernel.org --- kernel/rcu/tree_plugin.h | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 2ec9d7f55f99..dd0dc66c282d 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -1720,7 +1720,11 @@ static bool wake_nocb_gp(struct rcu_data *rdp, bool force, rcu_nocb_unlock_irqrestore(rdp, flags); return false; } - del_timer(&rdp->nocb_timer); + + if (READ_ONCE(rdp->nocb_defer_wakeup) > RCU_NOCB_WAKE_NOT) { + WRITE_ONCE(rdp->nocb_defer_wakeup, RCU_NOCB_WAKE_NOT); + del_timer(&rdp->nocb_timer); + } rcu_nocb_unlock_irqrestore(rdp, flags); raw_spin_lock_irqsave(&rdp_gp->nocb_gp_lock, flags); if (force || READ_ONCE(rdp_gp->nocb_gp_sleep)) { @@ -2349,7 +2353,6 @@ static bool do_nocb_deferred_wakeup_common(struct rcu_data *rdp) return false; } ndw = READ_ONCE(rdp->nocb_defer_wakeup); - WRITE_ONCE(rdp->nocb_defer_wakeup, RCU_NOCB_WAKE_NOT); ret = wake_nocb_gp(rdp, ndw == RCU_NOCB_WAKE_FORCE, flags); trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("DeferredWake"));
On Tue, Feb 23, 2021 at 01:09:59AM +0100, Frederic Weisbecker wrote:
Two situations can cause a missed nocb timer rearm:
- rdp(CPU A) queues its nocb timer. The grace period elapses before the timer get a chance to fire. The nocb_gp kthread is awaken by rdp(CPU B). The nocb_cb kthread for rdp(CPU A) is awaken and process the callbacks, again before the nocb_timer for CPU A get a chance to fire. rdp(CPU A) queues a callback and wakes up nocb_gp kthread, cancelling the pending nocb_timer without resetting the corresponding nocb_defer_wakeup.
As discussed offlist, expanding the above scenario results in this sequence of steps:
1. There are no callbacks queued for any CPU covered by CPU 0-2's ->nocb_gp_kthread.
2. CPU 0 enqueues its first callback with interrupts disabled, and thus must defer awakening its ->nocb_gp_kthread. It therefore queues its rcu_data structure's ->nocb_timer.
3. CPU 1, which shares the same ->nocb_gp_kthread, also enqueues a callback, but with interrupts enabled, allowing it to directly awaken the ->nocb_gp_kthread.
4. The newly awakened ->nocb_gp_kthread associates both CPU 0's and CPU 1's callbacks with a future grace period and arranges for that grace period to be started.
5. This ->nocb_gp_kthread goes to sleep waiting for the end of this future grace period.
6. This grace period elapses before the CPU 0's timer fires. This is normally improbably given that the timer is set for only one jiffy, but timers can be delayed. Besides, it is possible that kernel was built with CONFIG_RCU_STRICT_GRACE_PERIOD=y.
7. The grace period ends, so rcu_gp_kthread awakens the ->nocb_gp_kthread, which in turn awakens both CPU 0's and CPU 1's ->nocb_cb_kthread.
8. CPU 0's ->nocb_cb_kthread invokes its callback.
9. Note that neither kthread updated any ->nocb_timer state, so CPU 0's ->nocb_defer_wakeup is still set to either RCU_NOCB_WAKE or RCU_NOCB_WAKE_FORCE.
10. CPU 0 enqueues its second callback, again with interrupts disabled, and thus must again defer awakening its ->nocb_gp_kthread. However, ->nocb_defer_wakeup prevents CPU 0 from queueing the timer.
So far so good, but why isn't the timer still queued from back in step 2? What am I missing here? Either way, could you please update the commit logs to tell the full story? At some later time, you might be very happy that you did. ;-)
Thanx, Paul
- The "nocb_bypass_timer" ends up calling wake_nocb_gp() which deletes the pending "nocb_timer" (note they are not the same timers) for the given rdp without resetting the matching state stored in nocb_defer wakeup.
On both situations, a future call_rcu() on that rdp may be fooled and think the timer is armed when it's not, missing a deferred nocb_gp wakeup.
Case 1) is very unlikely due to timing constraint (the timer fires after 1 jiffy) but still possible in theory. Case 2) is more likely to happen. But in any case such scenario require the CPU to spend a long time within a kernel thread without exiting to idle or user space, which is a pretty exotic behaviour.
Fix this with resetting rdp->nocb_defer_wakeup everytime we disarm the timer.
Fixes: d1b222c6be1f (rcu/nocb: Add bypass callback queueing) Cc: Stable stable@vger.kernel.org Cc: Josh Triplett josh@joshtriplett.org Cc: Lai Jiangshan jiangshanlai@gmail.com Cc: Joel Fernandes joel@joelfernandes.org Cc: Neeraj Upadhyay neeraju@codeaurora.org Cc: Boqun Feng boqun.feng@gmail.com Signed-off-by: Frederic Weisbecker frederic@kernel.org
kernel/rcu/tree_plugin.h | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 2ec9d7f55f99..dd0dc66c282d 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -1720,7 +1720,11 @@ static bool wake_nocb_gp(struct rcu_data *rdp, bool force, rcu_nocb_unlock_irqrestore(rdp, flags); return false; }
- del_timer(&rdp->nocb_timer);
- if (READ_ONCE(rdp->nocb_defer_wakeup) > RCU_NOCB_WAKE_NOT) {
WRITE_ONCE(rdp->nocb_defer_wakeup, RCU_NOCB_WAKE_NOT);
del_timer(&rdp->nocb_timer);
- } rcu_nocb_unlock_irqrestore(rdp, flags); raw_spin_lock_irqsave(&rdp_gp->nocb_gp_lock, flags); if (force || READ_ONCE(rdp_gp->nocb_gp_sleep)) {
@@ -2349,7 +2353,6 @@ static bool do_nocb_deferred_wakeup_common(struct rcu_data *rdp) return false; } ndw = READ_ONCE(rdp->nocb_defer_wakeup);
- WRITE_ONCE(rdp->nocb_defer_wakeup, RCU_NOCB_WAKE_NOT); ret = wake_nocb_gp(rdp, ndw == RCU_NOCB_WAKE_FORCE, flags); trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("DeferredWake"));
2.25.1
On Wed, Feb 24, 2021 at 10:37:09AM -0800, Paul E. McKenney wrote:
On Tue, Feb 23, 2021 at 01:09:59AM +0100, Frederic Weisbecker wrote:
Two situations can cause a missed nocb timer rearm:
- rdp(CPU A) queues its nocb timer. The grace period elapses before the timer get a chance to fire. The nocb_gp kthread is awaken by rdp(CPU B). The nocb_cb kthread for rdp(CPU A) is awaken and process the callbacks, again before the nocb_timer for CPU A get a chance to fire. rdp(CPU A) queues a callback and wakes up nocb_gp kthread, cancelling the pending nocb_timer without resetting the corresponding nocb_defer_wakeup.
As discussed offlist, expanding the above scenario results in this sequence of steps:
There are no callbacks queued for any CPU covered by CPU 0-2's ->nocb_gp_kthread.
CPU 0 enqueues its first callback with interrupts disabled, and thus must defer awakening its ->nocb_gp_kthread. It therefore queues its rcu_data structure's ->nocb_timer.
CPU 1, which shares the same ->nocb_gp_kthread, also enqueues a callback, but with interrupts enabled, allowing it to directly awaken the ->nocb_gp_kthread.
The newly awakened ->nocb_gp_kthread associates both CPU 0's and CPU 1's callbacks with a future grace period and arranges for that grace period to be started.
This ->nocb_gp_kthread goes to sleep waiting for the end of this future grace period.
This grace period elapses before the CPU 0's timer fires. This is normally improbably given that the timer is set for only one jiffy, but timers can be delayed. Besides, it is possible that kernel was built with CONFIG_RCU_STRICT_GRACE_PERIOD=y.
The grace period ends, so rcu_gp_kthread awakens the ->nocb_gp_kthread, which in turn awakens both CPU 0's and CPU 1's ->nocb_cb_kthread.
CPU 0's ->nocb_cb_kthread invokes its callback.
Note that neither kthread updated any ->nocb_timer state, so CPU 0's ->nocb_defer_wakeup is still set to either RCU_NOCB_WAKE or RCU_NOCB_WAKE_FORCE.
CPU 0 enqueues its second callback, again with interrupts disabled, and thus must again defer awakening its ->nocb_gp_kthread. However, ->nocb_defer_wakeup prevents CPU 0 from queueing the timer.
I managed to recollect some pieces of my brain. So keep the above but let's change the point 10:
10. CPU 0 enqueues its second callback, this time with interrupts enabled so it can wake directly ->nocb_gp_kthread. It does so with calling __wake_nocb_gp() which also cancels the pending timer that got queued in step 2. But that doesn't reset CPU 0's ->nocb_defer_wakeup which is still set to RCU_NOCB_WAKE. So CPU 0's ->nocb_defer_wakeup and CPU 0's ->nocb_timer are now desynchronized.
11. ->nocb_gp_kthread associates the callback queued in 10 with a new grace period, arrange for it to start and sleeps on it.
12. The grace period ends, ->nocb_gp_kthread awakens and wakes up CPU 0's ->nocb_cb_kthread which invokes the callback queued in 10.
13. CPU 0 enqueues its third callback, this time with interrupts disabled so it tries to queue a deferred wakeup. However ->nocb_defer_wakeup has a stalled RCU_NOCB_WAKE value which prevents the CPU 0's ->nocb_timer, that got cancelled in 10, from being armed.
14. CPU 0 has its pending callback and it may go unnoticed until some other CPU ever wakes up ->nocb_gp_kthread or CPU 0 ever calls an explicit deferred wake up caller like idle entry.
I hope I'm not missing something this time...
Thanks.
So far so good, but why isn't the timer still queued from back in step 2? What am I missing here? Either way, could you please update the commit logs to tell the full story? At some later time, you might be very happy that you did. ;-)
Thanx, Paul
- The "nocb_bypass_timer" ends up calling wake_nocb_gp() which deletes the pending "nocb_timer" (note they are not the same timers) for the given rdp without resetting the matching state stored in nocb_defer wakeup.
On both situations, a future call_rcu() on that rdp may be fooled and think the timer is armed when it's not, missing a deferred nocb_gp wakeup.
Case 1) is very unlikely due to timing constraint (the timer fires after 1 jiffy) but still possible in theory. Case 2) is more likely to happen. But in any case such scenario require the CPU to spend a long time within a kernel thread without exiting to idle or user space, which is a pretty exotic behaviour.
Fix this with resetting rdp->nocb_defer_wakeup everytime we disarm the timer.
Fixes: d1b222c6be1f (rcu/nocb: Add bypass callback queueing) Cc: Stable stable@vger.kernel.org Cc: Josh Triplett josh@joshtriplett.org Cc: Lai Jiangshan jiangshanlai@gmail.com Cc: Joel Fernandes joel@joelfernandes.org Cc: Neeraj Upadhyay neeraju@codeaurora.org Cc: Boqun Feng boqun.feng@gmail.com Signed-off-by: Frederic Weisbecker frederic@kernel.org
kernel/rcu/tree_plugin.h | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 2ec9d7f55f99..dd0dc66c282d 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -1720,7 +1720,11 @@ static bool wake_nocb_gp(struct rcu_data *rdp, bool force, rcu_nocb_unlock_irqrestore(rdp, flags); return false; }
- del_timer(&rdp->nocb_timer);
- if (READ_ONCE(rdp->nocb_defer_wakeup) > RCU_NOCB_WAKE_NOT) {
WRITE_ONCE(rdp->nocb_defer_wakeup, RCU_NOCB_WAKE_NOT);
del_timer(&rdp->nocb_timer);
- } rcu_nocb_unlock_irqrestore(rdp, flags); raw_spin_lock_irqsave(&rdp_gp->nocb_gp_lock, flags); if (force || READ_ONCE(rdp_gp->nocb_gp_sleep)) {
@@ -2349,7 +2353,6 @@ static bool do_nocb_deferred_wakeup_common(struct rcu_data *rdp) return false; } ndw = READ_ONCE(rdp->nocb_defer_wakeup);
- WRITE_ONCE(rdp->nocb_defer_wakeup, RCU_NOCB_WAKE_NOT); ret = wake_nocb_gp(rdp, ndw == RCU_NOCB_WAKE_FORCE, flags); trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("DeferredWake"));
2.25.1
On Wed, Feb 24, 2021 at 11:06:06PM +0100, Frederic Weisbecker wrote:
On Wed, Feb 24, 2021 at 10:37:09AM -0800, Paul E. McKenney wrote:
On Tue, Feb 23, 2021 at 01:09:59AM +0100, Frederic Weisbecker wrote:
Two situations can cause a missed nocb timer rearm:
- rdp(CPU A) queues its nocb timer. The grace period elapses before the timer get a chance to fire. The nocb_gp kthread is awaken by rdp(CPU B). The nocb_cb kthread for rdp(CPU A) is awaken and process the callbacks, again before the nocb_timer for CPU A get a chance to fire. rdp(CPU A) queues a callback and wakes up nocb_gp kthread, cancelling the pending nocb_timer without resetting the corresponding nocb_defer_wakeup.
As discussed offlist, expanding the above scenario results in this sequence of steps:
There are no callbacks queued for any CPU covered by CPU 0-2's ->nocb_gp_kthread.
CPU 0 enqueues its first callback with interrupts disabled, and thus must defer awakening its ->nocb_gp_kthread. It therefore queues its rcu_data structure's ->nocb_timer.
CPU 1, which shares the same ->nocb_gp_kthread, also enqueues a callback, but with interrupts enabled, allowing it to directly awaken the ->nocb_gp_kthread.
The newly awakened ->nocb_gp_kthread associates both CPU 0's and CPU 1's callbacks with a future grace period and arranges for that grace period to be started.
This ->nocb_gp_kthread goes to sleep waiting for the end of this future grace period.
This grace period elapses before the CPU 0's timer fires. This is normally improbably given that the timer is set for only one jiffy, but timers can be delayed. Besides, it is possible that kernel was built with CONFIG_RCU_STRICT_GRACE_PERIOD=y.
The grace period ends, so rcu_gp_kthread awakens the ->nocb_gp_kthread, which in turn awakens both CPU 0's and CPU 1's ->nocb_cb_kthread.
CPU 0's ->nocb_cb_kthread invokes its callback.
Note that neither kthread updated any ->nocb_timer state, so CPU 0's ->nocb_defer_wakeup is still set to either RCU_NOCB_WAKE or RCU_NOCB_WAKE_FORCE.
CPU 0 enqueues its second callback, again with interrupts disabled, and thus must again defer awakening its ->nocb_gp_kthread. However, ->nocb_defer_wakeup prevents CPU 0 from queueing the timer.
I managed to recollect some pieces of my brain. So keep the above but let's change the point 10:
CPU 0 enqueues its second callback, this time with interrupts
enabled so it can wake directly ->nocb_gp_kthread. It does so with calling __wake_nocb_gp() which also cancels the pending timer that got queued in step 2. But that doesn't reset CPU 0's ->nocb_defer_wakeup which is still set to RCU_NOCB_WAKE. So CPU 0's ->nocb_defer_wakeup and CPU 0's ->nocb_timer are now desynchronized.
- ->nocb_gp_kthread associates the callback queued in 10 with a new
grace period, arrange for it to start and sleeps on it.
The grace period ends, ->nocb_gp_kthread awakens and wakes up
CPU 0's ->nocb_cb_kthread which invokes the callback queued in 10.
- CPU 0 enqueues its third callback, this time with interrupts
disabled so it tries to queue a deferred wakeup. However ->nocb_defer_wakeup has a stalled RCU_NOCB_WAKE value which prevents the CPU 0's ->nocb_timer, that got cancelled in 10, from being armed.
CPU 0 has its pending callback and it may go unnoticed until some other CPU ever wakes up ->nocb_gp_kthread or CPU 0 ever calls
an explicit deferred wake up caller like idle entry.
I hope I'm not missing something this time...
Thank you, that does sound plausible. I guess I can see how rcutorture might have missed this one!
Thanx, Paul
Thanks.
So far so good, but why isn't the timer still queued from back in step 2? What am I missing here? Either way, could you please update the commit logs to tell the full story? At some later time, you might be very happy that you did. ;-)
Thanx, Paul
- The "nocb_bypass_timer" ends up calling wake_nocb_gp() which deletes the pending "nocb_timer" (note they are not the same timers) for the given rdp without resetting the matching state stored in nocb_defer wakeup.
On both situations, a future call_rcu() on that rdp may be fooled and think the timer is armed when it's not, missing a deferred nocb_gp wakeup.
Case 1) is very unlikely due to timing constraint (the timer fires after 1 jiffy) but still possible in theory. Case 2) is more likely to happen. But in any case such scenario require the CPU to spend a long time within a kernel thread without exiting to idle or user space, which is a pretty exotic behaviour.
Fix this with resetting rdp->nocb_defer_wakeup everytime we disarm the timer.
Fixes: d1b222c6be1f (rcu/nocb: Add bypass callback queueing) Cc: Stable stable@vger.kernel.org Cc: Josh Triplett josh@joshtriplett.org Cc: Lai Jiangshan jiangshanlai@gmail.com Cc: Joel Fernandes joel@joelfernandes.org Cc: Neeraj Upadhyay neeraju@codeaurora.org Cc: Boqun Feng boqun.feng@gmail.com Signed-off-by: Frederic Weisbecker frederic@kernel.org
kernel/rcu/tree_plugin.h | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 2ec9d7f55f99..dd0dc66c282d 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -1720,7 +1720,11 @@ static bool wake_nocb_gp(struct rcu_data *rdp, bool force, rcu_nocb_unlock_irqrestore(rdp, flags); return false; }
- del_timer(&rdp->nocb_timer);
- if (READ_ONCE(rdp->nocb_defer_wakeup) > RCU_NOCB_WAKE_NOT) {
WRITE_ONCE(rdp->nocb_defer_wakeup, RCU_NOCB_WAKE_NOT);
del_timer(&rdp->nocb_timer);
- } rcu_nocb_unlock_irqrestore(rdp, flags); raw_spin_lock_irqsave(&rdp_gp->nocb_gp_lock, flags); if (force || READ_ONCE(rdp_gp->nocb_gp_sleep)) {
@@ -2349,7 +2353,6 @@ static bool do_nocb_deferred_wakeup_common(struct rcu_data *rdp) return false; } ndw = READ_ONCE(rdp->nocb_defer_wakeup);
- WRITE_ONCE(rdp->nocb_defer_wakeup, RCU_NOCB_WAKE_NOT); ret = wake_nocb_gp(rdp, ndw == RCU_NOCB_WAKE_FORCE, flags); trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("DeferredWake"));
2.25.1
On Wed, Feb 24, 2021 at 04:14:25PM -0800, Paul E. McKenney wrote:
On Wed, Feb 24, 2021 at 11:06:06PM +0100, Frederic Weisbecker wrote:
I managed to recollect some pieces of my brain. So keep the above but let's change the point 10:
CPU 0 enqueues its second callback, this time with interrupts
enabled so it can wake directly ->nocb_gp_kthread. It does so with calling __wake_nocb_gp() which also cancels the pending timer that got queued in step 2. But that doesn't reset CPU 0's ->nocb_defer_wakeup which is still set to RCU_NOCB_WAKE. So CPU 0's ->nocb_defer_wakeup and CPU 0's ->nocb_timer are now desynchronized.
- ->nocb_gp_kthread associates the callback queued in 10 with a new
grace period, arrange for it to start and sleeps on it.
The grace period ends, ->nocb_gp_kthread awakens and wakes up
CPU 0's ->nocb_cb_kthread which invokes the callback queued in 10.
- CPU 0 enqueues its third callback, this time with interrupts
disabled so it tries to queue a deferred wakeup. However ->nocb_defer_wakeup has a stalled RCU_NOCB_WAKE value which prevents the CPU 0's ->nocb_timer, that got cancelled in 10, from being armed.
CPU 0 has its pending callback and it may go unnoticed until some other CPU ever wakes up ->nocb_gp_kthread or CPU 0 ever calls
an explicit deferred wake up caller like idle entry.
I hope I'm not missing something this time...
Thank you, that does sound plausible. I guess I can see how rcutorture might have missed this one!
I must admit it requires a lot of stars to be aligned :-)
Thanks.
On Thu, Feb 25, 2021 at 01:48:13AM +0100, Frederic Weisbecker wrote:
On Wed, Feb 24, 2021 at 04:14:25PM -0800, Paul E. McKenney wrote:
On Wed, Feb 24, 2021 at 11:06:06PM +0100, Frederic Weisbecker wrote:
I managed to recollect some pieces of my brain. So keep the above but let's change the point 10:
CPU 0 enqueues its second callback, this time with interrupts
enabled so it can wake directly ->nocb_gp_kthread. It does so with calling __wake_nocb_gp() which also cancels the pending timer that got queued in step 2. But that doesn't reset CPU 0's ->nocb_defer_wakeup which is still set to RCU_NOCB_WAKE. So CPU 0's ->nocb_defer_wakeup and CPU 0's ->nocb_timer are now desynchronized.
- ->nocb_gp_kthread associates the callback queued in 10 with a new
grace period, arrange for it to start and sleeps on it.
The grace period ends, ->nocb_gp_kthread awakens and wakes up
CPU 0's ->nocb_cb_kthread which invokes the callback queued in 10.
- CPU 0 enqueues its third callback, this time with interrupts
disabled so it tries to queue a deferred wakeup. However ->nocb_defer_wakeup has a stalled RCU_NOCB_WAKE value which prevents the CPU 0's ->nocb_timer, that got cancelled in 10, from being armed.
CPU 0 has its pending callback and it may go unnoticed until some other CPU ever wakes up ->nocb_gp_kthread or CPU 0 ever calls
an explicit deferred wake up caller like idle entry.
I hope I'm not missing something this time...
Thank you, that does sound plausible. I guess I can see how rcutorture might have missed this one!
I must admit it requires a lot of stars to be aligned :-)
It nevertheless constitutes a bug in rcutorture. Or maybe an additional challenge for the formal verification people. ;-)
Thanx, Paul
On Wed, Feb 24, 2021 at 11:06:06PM +0100, Frederic Weisbecker wrote:
On Wed, Feb 24, 2021 at 10:37:09AM -0800, Paul E. McKenney wrote:
On Tue, Feb 23, 2021 at 01:09:59AM +0100, Frederic Weisbecker wrote:
Two situations can cause a missed nocb timer rearm:
- rdp(CPU A) queues its nocb timer. The grace period elapses before the timer get a chance to fire. The nocb_gp kthread is awaken by rdp(CPU B). The nocb_cb kthread for rdp(CPU A) is awaken and process the callbacks, again before the nocb_timer for CPU A get a chance to fire. rdp(CPU A) queues a callback and wakes up nocb_gp kthread, cancelling the pending nocb_timer without resetting the corresponding nocb_defer_wakeup.
As discussed offlist, expanding the above scenario results in this sequence of steps:
I renumbered the CPUs, since the ->nocb_gp_kthread would normally be associated with CPU 0. If the first CPU to enqueue a callback was also CPU 0, nocb_gp_wait() might clear that CPU's ->nocb_defer_wakeup, which would prevent this scenario from playing out. (But admittedly only if some other CPU handled by this same ->nocb_gp_kthread used its bypass.)
- There are no callbacks queued for any CPU covered by CPU 0-2's ->nocb_gp_kthread.
And ->nocb_gp_kthread is associated with CPU 0.
- CPU 1 enqueues its first callback with interrupts disabled, and thus must defer awakening its ->nocb_gp_kthread. It therefore queues its rcu_data structure's ->nocb_timer.
At this point, CPU 1's rdp->nocb_defer_wakeup is RCU_NOCB_WAKE.
CPU 2, which shares the same ->nocb_gp_kthread, also enqueues a callback, but with interrupts enabled, allowing it to directly awaken the ->nocb_gp_kthread.
The newly awakened ->nocb_gp_kthread associates both CPU 1's and CPU 2's callbacks with a future grace period and arranges for that grace period to be started.
This ->nocb_gp_kthread goes to sleep waiting for the end of this future grace period.
This grace period elapses before the CPU 1's timer fires. This is normally improbably given that the timer is set for only one jiffy, but timers can be delayed. Besides, it is possible that kernel was built with CONFIG_RCU_STRICT_GRACE_PERIOD=y.
The grace period ends, so rcu_gp_kthread awakens the ->nocb_gp_kthread, which in turn awakens both CPU 1's and CPU 2's ->nocb_cb_kthread.
And then ->nocb_cb_kthread sleeps waiting for more callbacks.
CPU 1's ->nocb_cb_kthread invokes its callback.
Note that neither kthread updated any ->nocb_timer state, so CPU 1's ->nocb_defer_wakeup is still set to RCU_NOCB_WAKE.
CPU 1 enqueues its second callback, again with interrupts disabled, and thus must again defer awakening its ->nocb_gp_kthread. However, ->nocb_defer_wakeup prevents CPU 1 from queueing the timer.
I managed to recollect some pieces of my brain. So keep the above but let's change the point 10:
- CPU 1 enqueues its second callback, this time with interrupts
enabled so it can wake directly ->nocb_gp_kthread. It does so with calling __wake_nocb_gp() which also cancels the
wake_nocb_gp() in current -rcu, correct?
pending timer that got queued in step 2. But that doesn't reset CPU 1's ->nocb_defer_wakeup which is still set to RCU_NOCB_WAKE. So CPU 1's ->nocb_defer_wakeup and CPU 1's ->nocb_timer are now desynchronized.
Agreed, and agreed that this is a bug. Thank you for persisting on this one!
- ->nocb_gp_kthread associates the callback queued in 10 with a new
grace period, arrange for it to start and sleeps on it.
- The grace period ends, ->nocb_gp_kthread awakens and wakes up
CPU 1's ->nocb_cb_kthread which invokes the callback queued in 10.
- CPU 1 enqueues its third callback, this time with interrupts
disabled so it tries to queue a deferred wakeup. However ->nocb_defer_wakeup has a stalled RCU_NOCB_WAKE value which prevents the CPU 1's ->nocb_timer, that got cancelled in 10, from being armed.
- CPU 1 has its pending callback and it may go unnoticed until
some other CPU ever wakes up ->nocb_gp_kthread or CPU 1 ever calls an explicit deferred wake up caller like idle entry.
I hope I'm not missing something this time...
If you are missing something, then so am I! ;-)
So far so good, but why isn't the timer still queued from back in step 2? What am I missing here? Either way, could you please update the commit logs to tell the full story? At some later time, you might be very happy that you did. ;-)
- The "nocb_bypass_timer" ends up calling wake_nocb_gp() which deletes the pending "nocb_timer" (note they are not the same timers) for the given rdp without resetting the matching state stored in nocb_defer wakeup.
Would like to similarly expand this one, or would you prefer to rest your case on Case 1) above?
Thanx, Paul
On both situations, a future call_rcu() on that rdp may be fooled and think the timer is armed when it's not, missing a deferred nocb_gp wakeup.
Case 1) is very unlikely due to timing constraint (the timer fires after 1 jiffy) but still possible in theory. Case 2) is more likely to happen. But in any case such scenario require the CPU to spend a long time within a kernel thread without exiting to idle or user space, which is a pretty exotic behaviour.
Fix this with resetting rdp->nocb_defer_wakeup everytime we disarm the timer.
Fixes: d1b222c6be1f (rcu/nocb: Add bypass callback queueing) Cc: Stable stable@vger.kernel.org Cc: Josh Triplett josh@joshtriplett.org Cc: Lai Jiangshan jiangshanlai@gmail.com Cc: Joel Fernandes joel@joelfernandes.org Cc: Neeraj Upadhyay neeraju@codeaurora.org Cc: Boqun Feng boqun.feng@gmail.com Signed-off-by: Frederic Weisbecker frederic@kernel.org
kernel/rcu/tree_plugin.h | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 2ec9d7f55f99..dd0dc66c282d 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -1720,7 +1720,11 @@ static bool wake_nocb_gp(struct rcu_data *rdp, bool force, rcu_nocb_unlock_irqrestore(rdp, flags); return false; }
- del_timer(&rdp->nocb_timer);
- if (READ_ONCE(rdp->nocb_defer_wakeup) > RCU_NOCB_WAKE_NOT) {
WRITE_ONCE(rdp->nocb_defer_wakeup, RCU_NOCB_WAKE_NOT);
del_timer(&rdp->nocb_timer);
- } rcu_nocb_unlock_irqrestore(rdp, flags); raw_spin_lock_irqsave(&rdp_gp->nocb_gp_lock, flags); if (force || READ_ONCE(rdp_gp->nocb_gp_sleep)) {
@@ -2349,7 +2353,6 @@ static bool do_nocb_deferred_wakeup_common(struct rcu_data *rdp) return false; } ndw = READ_ONCE(rdp->nocb_defer_wakeup);
- WRITE_ONCE(rdp->nocb_defer_wakeup, RCU_NOCB_WAKE_NOT); ret = wake_nocb_gp(rdp, ndw == RCU_NOCB_WAKE_FORCE, flags); trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("DeferredWake"));
2.25.1
On Mon, Mar 01, 2021 at 05:48:29PM -0800, Paul E. McKenney wrote:
On Wed, Feb 24, 2021 at 11:06:06PM +0100, Frederic Weisbecker wrote:
On Wed, Feb 24, 2021 at 10:37:09AM -0800, Paul E. McKenney wrote:
On Tue, Feb 23, 2021 at 01:09:59AM +0100, Frederic Weisbecker wrote:
Two situations can cause a missed nocb timer rearm:
- rdp(CPU A) queues its nocb timer. The grace period elapses before the timer get a chance to fire. The nocb_gp kthread is awaken by rdp(CPU B). The nocb_cb kthread for rdp(CPU A) is awaken and process the callbacks, again before the nocb_timer for CPU A get a chance to fire. rdp(CPU A) queues a callback and wakes up nocb_gp kthread, cancelling the pending nocb_timer without resetting the corresponding nocb_defer_wakeup.
As discussed offlist, expanding the above scenario results in this sequence of steps:
I renumbered the CPUs, since the ->nocb_gp_kthread would normally be associated with CPU 0. If the first CPU to enqueue a callback was also CPU 0, nocb_gp_wait() might clear that CPU's ->nocb_defer_wakeup, which would prevent this scenario from playing out. (But admittedly only if some other CPU handled by this same ->nocb_gp_kthread used its bypass.)
Ok good point.
- There are no callbacks queued for any CPU covered by CPU 0-2's ->nocb_gp_kthread.
And ->nocb_gp_kthread is associated with CPU 0.
- CPU 1 enqueues its first callback with interrupts disabled, and thus must defer awakening its ->nocb_gp_kthread. It therefore queues its rcu_data structure's ->nocb_timer.
At this point, CPU 1's rdp->nocb_defer_wakeup is RCU_NOCB_WAKE.
Right.
- The grace period ends, so rcu_gp_kthread awakens the ->nocb_gp_kthread, which in turn awakens both CPU 1's and CPU 2's ->nocb_cb_kthread.
And then ->nocb_cb_kthread sleeps waiting for more callbacks.
Yep
I managed to recollect some pieces of my brain. So keep the above but let's change the point 10:
- CPU 1 enqueues its second callback, this time with interrupts
enabled so it can wake directly ->nocb_gp_kthread. It does so with calling __wake_nocb_gp() which also cancels the
wake_nocb_gp() in current -rcu, correct?
Heh, right.
So far so good, but why isn't the timer still queued from back in step 2? What am I missing here? Either way, could you please update the commit logs to tell the full story? At some later time, you might be very happy that you did. ;-)
- The "nocb_bypass_timer" ends up calling wake_nocb_gp() which deletes the pending "nocb_timer" (note they are not the same timers) for the given rdp without resetting the matching state stored in nocb_defer wakeup.
Would like to similarly expand this one, or would you prefer to rest your case on Case 1) above?
I was about to say that we can skip that one, the changelog will already be big enough but the "Fixes:" tag refers to the second scenario, since it's the oldest vulnerable commit AFAICS.
Fixes: d1b222c6be1f (rcu/nocb: Add bypass callback queueing)
Thanks.
On Tue, Mar 02, 2021 at 01:34:44PM +0100, Frederic Weisbecker wrote:
On Mon, Mar 01, 2021 at 05:48:29PM -0800, Paul E. McKenney wrote:
On Wed, Feb 24, 2021 at 11:06:06PM +0100, Frederic Weisbecker wrote:
On Wed, Feb 24, 2021 at 10:37:09AM -0800, Paul E. McKenney wrote:
On Tue, Feb 23, 2021 at 01:09:59AM +0100, Frederic Weisbecker wrote:
Two situations can cause a missed nocb timer rearm:
- rdp(CPU A) queues its nocb timer. The grace period elapses before the timer get a chance to fire. The nocb_gp kthread is awaken by rdp(CPU B). The nocb_cb kthread for rdp(CPU A) is awaken and process the callbacks, again before the nocb_timer for CPU A get a chance to fire. rdp(CPU A) queues a callback and wakes up nocb_gp kthread, cancelling the pending nocb_timer without resetting the corresponding nocb_defer_wakeup.
As discussed offlist, expanding the above scenario results in this sequence of steps:
I renumbered the CPUs, since the ->nocb_gp_kthread would normally be associated with CPU 0. If the first CPU to enqueue a callback was also CPU 0, nocb_gp_wait() might clear that CPU's ->nocb_defer_wakeup, which would prevent this scenario from playing out. (But admittedly only if some other CPU handled by this same ->nocb_gp_kthread used its bypass.)
Ok good point.
- There are no callbacks queued for any CPU covered by CPU 0-2's ->nocb_gp_kthread.
And ->nocb_gp_kthread is associated with CPU 0.
- CPU 1 enqueues its first callback with interrupts disabled, and thus must defer awakening its ->nocb_gp_kthread. It therefore queues its rcu_data structure's ->nocb_timer.
At this point, CPU 1's rdp->nocb_defer_wakeup is RCU_NOCB_WAKE.
Right.
- The grace period ends, so rcu_gp_kthread awakens the ->nocb_gp_kthread, which in turn awakens both CPU 1's and CPU 2's ->nocb_cb_kthread.
And then ->nocb_cb_kthread sleeps waiting for more callbacks.
Yep
I managed to recollect some pieces of my brain. So keep the above but let's change the point 10:
- CPU 1 enqueues its second callback, this time with interrupts
enabled so it can wake directly ->nocb_gp_kthread. It does so with calling __wake_nocb_gp() which also cancels the
wake_nocb_gp() in current -rcu, correct?
Heh, right.
So far so good, but why isn't the timer still queued from back in step 2? What am I missing here? Either way, could you please update the commit logs to tell the full story? At some later time, you might be very happy that you did. ;-)
- The "nocb_bypass_timer" ends up calling wake_nocb_gp() which deletes the pending "nocb_timer" (note they are not the same timers) for the given rdp without resetting the matching state stored in nocb_defer wakeup.
Would like to similarly expand this one, or would you prefer to rest your case on Case 1) above?
I was about to say that we can skip that one, the changelog will already be big enough but the "Fixes:" tag refers to the second scenario, since it's the oldest vulnerable commit AFAICS.
Fixes: d1b222c6be1f (rcu/nocb: Add bypass callback queueing)
OK, how about if I queue a temporary commit (shown below) that just calls out the first scenario so that I can start testing, and you get me more detail on the second scenario? I can then update the commit.
Thanx, Paul
------------------------------------------------------------------------
commit 302fd54b9ae98f678624cbf9bf7a4ca88455a8f9 Author: Frederic Weisbecker frederic@kernel.org Date: Tue Feb 23 01:09:59 2021 +0100
rcu/nocb: Fix missed nocb_timer requeue
This sequence of events can lead to a failure to requeue a CPU's ->nocb_timer:
1. There are no callbacks queued for any CPU covered by CPU 0-2's ->nocb_gp_kthread. Note that ->nocb_gp_kthread is associated with CPU 0.
2. CPU 1 enqueues its first callback with interrupts disabled, and thus must defer awakening its ->nocb_gp_kthread. It therefore queues its rcu_data structure's ->nocb_timer. At this point, CPU 1's rdp->nocb_defer_wakeup is RCU_NOCB_WAKE.
3. CPU 2, which shares the same ->nocb_gp_kthread, also enqueues a callback, but with interrupts enabled, allowing it to directly awaken the ->nocb_gp_kthread.
4. The newly awakened ->nocb_gp_kthread associates both CPU 1's and CPU 2's callbacks with a future grace period and arranges for that grace period to be started.
5. This ->nocb_gp_kthread goes to sleep waiting for the end of this future grace period.
6. This grace period elapses before the CPU 1's timer fires. This is normally improbably given that the timer is set for only one jiffy, but timers can be delayed. Besides, it is possible that kernel was built with CONFIG_RCU_STRICT_GRACE_PERIOD=y.
7. The grace period ends, so rcu_gp_kthread awakens the ->nocb_gp_kthread, which in turn awakens both CPU 1's and CPU 2's ->nocb_cb_kthread. Then ->nocb_gb_kthread sleeps waiting for more newly queued callbacks.
8. CPU 1's ->nocb_cb_kthread invokes its callback, then sleeps waiting for more invocable callbacks.
9. Note that neither kthread updated any ->nocb_timer state, so CPU 1's ->nocb_defer_wakeup is still set to RCU_NOCB_WAKE.
10. CPU 1 enqueues its second callback, this time with interrupts enabled so it can wake directly ->nocb_gp_kthread. It does so with calling wake_nocb_gp() which also cancels the pending timer that got queued in step 2. But that doesn't reset CPU 1's ->nocb_defer_wakeup which is still set to RCU_NOCB_WAKE. So CPU 1's ->nocb_defer_wakeup and its ->nocb_timer are now desynchronized.
11. ->nocb_gp_kthread associates the callback queued in 10 with a new grace period, arranges for that grace period to start and sleeps waiting for it to complete.
12. The grace period ends, rcu_gp_kthread awakens ->nocb_gp_kthread, which in turn wakes up CPU 1's ->nocb_cb_kthread which then invokes the callback queued in 10.
13. CPU 1 enqueues its third callback, this time with interrupts disabled so it must queue a timer for a deferred wakeup. However the value of its ->nocb_defer_wakeup is RCU_NOCB_WAKE which incorrectly indicates that a timer is already queued. Instead, CPU 1's ->nocb_timer was cancelled in 10. CPU 1 therefore fails to queue the ->nocb_timer.
14. CPU 1 has its pending callback and it may go unnoticed until some other CPU ever wakes up ->nocb_gp_kthread or CPU 1 ever calls an explicit deferred wakeup, for example, during idle entry.
This commit fixes this bug by resetting rdp->nocb_defer_wakeup everytime we delete the ->nocb_timer.
Fixes: d1b222c6be1f (rcu/nocb: Add bypass callback queueing) Cc: Stable stable@vger.kernel.org Cc: Josh Triplett josh@joshtriplett.org Cc: Lai Jiangshan jiangshanlai@gmail.com Cc: Joel Fernandes joel@joelfernandes.org Cc: Neeraj Upadhyay neeraju@codeaurora.org Cc: Boqun Feng boqun.feng@gmail.com Signed-off-by: Frederic Weisbecker frederic@kernel.org Signed-off-by: Paul E. McKenney paulmck@kernel.org
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 44746d8..429491d 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -1721,7 +1721,11 @@ static bool wake_nocb_gp(struct rcu_data *rdp, bool force, rcu_nocb_unlock_irqrestore(rdp, flags); return false; } - del_timer(&rdp->nocb_timer); + + if (READ_ONCE(rdp->nocb_defer_wakeup) > RCU_NOCB_WAKE_NOT) { + WRITE_ONCE(rdp->nocb_defer_wakeup, RCU_NOCB_WAKE_NOT); + del_timer(&rdp->nocb_timer); + } rcu_nocb_unlock_irqrestore(rdp, flags); raw_spin_lock_irqsave(&rdp_gp->nocb_gp_lock, flags); if (force || READ_ONCE(rdp_gp->nocb_gp_sleep)) { @@ -2350,7 +2354,6 @@ static bool do_nocb_deferred_wakeup_common(struct rcu_data *rdp) return false; } ndw = READ_ONCE(rdp->nocb_defer_wakeup); - WRITE_ONCE(rdp->nocb_defer_wakeup, RCU_NOCB_WAKE_NOT); ret = wake_nocb_gp(rdp, ndw == RCU_NOCB_WAKE_FORCE, flags); trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("DeferredWake"));
On Tue, Mar 02, 2021 at 10:17:29AM -0800, Paul E. McKenney wrote:
On Tue, Mar 02, 2021 at 01:34:44PM +0100, Frederic Weisbecker wrote:
OK, how about if I queue a temporary commit (shown below) that just calls out the first scenario so that I can start testing, and you get me more detail on the second scenario? I can then update the commit.
Sure, meanwhile here is an attempt for a nocb_bypass_timer based scenario, it's overly hairy and perhaps I picture more power in the hands of callbacks advancing on nocb_cb_wait() than it really has:
0. CPU 0's ->nocb_cb_kthread just called rcu_do_batch() and executed all the ready callbacks. Its segcblist is now entirely empty. It's preempted while calling local_bh_enable().
1. A new callback is enqueued on CPU 0 with IRQs enabled. So the ->nocb_gp_kthread for CPU 0-2's is awaken. Then a storm of callbacks enqueue follows on CPU 0 and even reaches the bypass queue. Note that ->nocb_gp_kthread is also associated with CPU 0.
2. CPU 0 queues one last bypass callback.
3. The ->nocb_gp_kthread wakes up and associates a grace period with the whole queue of regular callbacks on CPU 0. It also tries to flush the bypass queue of CPU 0 but the bypass lock is contended due to the concurrent enqueuing on the previous step 2, so the flush fails.
4. This ->nocb_gp_kthread arms its ->nocb_bypass_timer and goes to sleep waiting for the end of this future grace period.
5. This grace period elapses before the ->nocb_bypass_timer timer fires. This is normally improbably given that the timer is set for only two jiffies, but timers can be delayed. Besides, it is possible that kernel was built with CONFIG_RCU_STRICT_GRACE_PERIOD=y.
6. The grace period ends, so rcu_gp_kthread awakens the ->nocb_gp_kthread but it doesn't get a chance to run on a CPU before a while.
7. CPU 0's ->nocb_cb_kthread get back to the CPU after its preemption. As it notices the new completed grace period, it advances the callbacks and executes them. Then it gets preempted again on local_bh_enabled().
8. A new callback enqueue on CPU 0 flushes itself the bypass queue because CPU 0's ->nocb_nobypass_count < nocb_nobypass_lim_per_jiffy.
9. CPUs from other ->nocb_gp_kthread groups (above CPU 2) initiate and elapse a few grace periods. CPU 0's ->nocb_gp_kthread still hasn't got an opportunity to run on a CPU and its ->nocb_bypass_timer still hasn't fired.
10. CPU 0's ->nocb_cb_kthread wakes up from preemption. It notices the new grace periods that have elapsed, advance all the callbacks and executes them. Then it goes to sleep waiting for invocable callbacks.
11. CPU 0 enqueues a new callback with interrupts disabled, and defers awakening its ->nocb_gp_kthread even though ->nocb_gp_sleep is actually false. It therefore queues its rcu_data structure's ->nocb_timer. At this point, CPU 0's rdp->nocb_defer_wakeup is RCU_NOCB_WAKE.
12. The ->nocb_bypass_timer finally fires! It doesn't wake up ->nocb_gp_kthread because it's actually awaken already. But it cancels CPU 0's ->nocb_timer armed at 11. Yet it doesn't re-initialize CPU 0's ->nocb_defer_wakeup which stays with the stale RCU_NOCB_WAKE value. So CPU 0's->nocb_defer_wakeup and its ->nocb_timer are now desynchronized.
13. The ->nocb_gp_kthread finally runs. It cancels the ->nocb_bypass_timer which has already fired. It sees the new callback on CPU 0 and associate it with a new grace period then sleep on it.
14. The grace period elapses, rcu_gp_kthread wakes up ->nocb_gb_kthread which wakes up CPU 0's->nocb_cb_kthread which runs the callback. Both ->nocb_gp_kthread and CPU 0's->nocb_cb_kthread now wait for new callbacks.
15. CPU 0 enqueues another callback, again with interrupts disabled so it must queue a timer for a deferred wakeup. However the value of its ->nocb_defer_wakeup is RCU_NOCB_WAKE which incorrectly indicates that a timer is already queued. Instead, CPU 0's ->nocb_timer was cancelled in 12. CPU 0 therefore fails to queue the ->nocb_timer.
16. CPU 0 has its pending callback and it may go unnoticed until some other CPU ever wakes up ->nocb_gp_kthread or CPU 0 ever calls an explicit deferred wakeup, for example, during idle entry.
Thanks.
On Wed, Mar 03, 2021 at 02:35:33AM +0100, Frederic Weisbecker wrote:
On Tue, Mar 02, 2021 at 10:17:29AM -0800, Paul E. McKenney wrote:
On Tue, Mar 02, 2021 at 01:34:44PM +0100, Frederic Weisbecker wrote:
OK, how about if I queue a temporary commit (shown below) that just calls out the first scenario so that I can start testing, and you get me more detail on the second scenario? I can then update the commit.
Sure, meanwhile here is an attempt for a nocb_bypass_timer based scenario, it's overly hairy and perhaps I picture more power in the hands of callbacks advancing on nocb_cb_wait() than it really has:
Thank you very much!
I must defer looking through this in detail until I am more awake, but I do very much like the fine-grained exposition.
Thanx, Paul
CPU 0's ->nocb_cb_kthread just called rcu_do_batch() and executed all the ready callbacks. Its segcblist is now entirely empty. It's preempted while calling local_bh_enable().
A new callback is enqueued on CPU 0 with IRQs enabled. So the ->nocb_gp_kthread for CPU 0-2's is awaken. Then a storm of callbacks enqueue follows on CPU 0 and even reaches the bypass queue. Note that ->nocb_gp_kthread is also associated with CPU 0.
CPU 0 queues one last bypass callback.
The ->nocb_gp_kthread wakes up and associates a grace period with the whole queue of regular callbacks on CPU 0. It also tries to flush the bypass queue of CPU 0 but the bypass lock is contended due to the concurrent enqueuing on the previous step 2, so the flush fails.
This ->nocb_gp_kthread arms its ->nocb_bypass_timer and goes to sleep waiting for the end of this future grace period.
This grace period elapses before the ->nocb_bypass_timer timer fires. This is normally improbably given that the timer is set for only two jiffies, but timers can be delayed. Besides, it is possible that kernel was built with CONFIG_RCU_STRICT_GRACE_PERIOD=y.
The grace period ends, so rcu_gp_kthread awakens the ->nocb_gp_kthread but it doesn't get a chance to run on a CPU before a while.
CPU 0's ->nocb_cb_kthread get back to the CPU after its preemption. As it notices the new completed grace period, it advances the callbacks and executes them. Then it gets preempted again on local_bh_enabled().
A new callback enqueue on CPU 0 flushes itself the bypass queue because CPU 0's ->nocb_nobypass_count < nocb_nobypass_lim_per_jiffy.
CPUs from other ->nocb_gp_kthread groups (above CPU 2) initiate and elapse a few grace periods. CPU 0's ->nocb_gp_kthread still hasn't got an opportunity to run on a CPU and its ->nocb_bypass_timer still hasn't fired.
CPU 0's ->nocb_cb_kthread wakes up from preemption. It notices the new grace periods that have elapsed, advance all the callbacks and executes them. Then it goes to sleep waiting for invocable callbacks.
CPU 0 enqueues a new callback with interrupts disabled, and defers awakening its ->nocb_gp_kthread even though ->nocb_gp_sleep is actually false. It therefore queues its rcu_data structure's ->nocb_timer. At this point, CPU 0's rdp->nocb_defer_wakeup is RCU_NOCB_WAKE.
The ->nocb_bypass_timer finally fires! It doesn't wake up ->nocb_gp_kthread because it's actually awaken already. But it cancels CPU 0's ->nocb_timer armed at 11. Yet it doesn't re-initialize CPU 0's ->nocb_defer_wakeup which stays with the stale RCU_NOCB_WAKE value. So CPU 0's->nocb_defer_wakeup and its ->nocb_timer are now desynchronized.
The ->nocb_gp_kthread finally runs. It cancels the ->nocb_bypass_timer which has already fired. It sees the new callback on CPU 0 and associate it with a new grace period then sleep on it.
The grace period elapses, rcu_gp_kthread wakes up ->nocb_gb_kthread which wakes up CPU 0's->nocb_cb_kthread which runs the callback. Both ->nocb_gp_kthread and CPU 0's->nocb_cb_kthread now wait for new callbacks.
CPU 0 enqueues another callback, again with interrupts disabled so it must queue a timer for a deferred wakeup. However the value of its ->nocb_defer_wakeup is RCU_NOCB_WAKE which incorrectly indicates that a timer is already queued. Instead, CPU 0's ->nocb_timer was cancelled in 12. CPU 0 therefore fails to queue the ->nocb_timer.
CPU 0 has its pending callback and it may go unnoticed until some other CPU ever wakes up ->nocb_gp_kthread or CPU 0 ever calls an explicit deferred wakeup, for example, during idle entry.
Thanks.
On Tue, Mar 02, 2021 at 06:06:43PM -0800, Paul E. McKenney wrote:
On Wed, Mar 03, 2021 at 02:35:33AM +0100, Frederic Weisbecker wrote:
On Tue, Mar 02, 2021 at 10:17:29AM -0800, Paul E. McKenney wrote:
On Tue, Mar 02, 2021 at 01:34:44PM +0100, Frederic Weisbecker wrote:
OK, how about if I queue a temporary commit (shown below) that just calls out the first scenario so that I can start testing, and you get me more detail on the second scenario? I can then update the commit.
Sure, meanwhile here is an attempt for a nocb_bypass_timer based scenario, it's overly hairy and perhaps I picture more power in the hands of callbacks advancing on nocb_cb_wait() than it really has:
Thank you very much!
I must defer looking through this in detail until I am more awake, but I do very much like the fine-grained exposition.
Thanx, Paul
CPU 0's ->nocb_cb_kthread just called rcu_do_batch() and executed all the ready callbacks. Its segcblist is now entirely empty. It's preempted while calling local_bh_enable().
A new callback is enqueued on CPU 0 with IRQs enabled. So the ->nocb_gp_kthread for CPU 0-2's is awaken. Then a storm of callbacks enqueue follows on CPU 0 and even reaches the bypass queue. Note that ->nocb_gp_kthread is also associated with CPU 0.
CPU 0 queues one last bypass callback.
The ->nocb_gp_kthread wakes up and associates a grace period with the whole queue of regular callbacks on CPU 0. It also tries to flush the bypass queue of CPU 0 but the bypass lock is contended due to the concurrent enqueuing on the previous step 2, so the flush fails.
This ->nocb_gp_kthread arms its ->nocb_bypass_timer and goes to sleep waiting for the end of this future grace period.
This grace period elapses before the ->nocb_bypass_timer timer fires. This is normally improbably given that the timer is set for only two jiffies, but timers can be delayed. Besides, it is possible that kernel was built with CONFIG_RCU_STRICT_GRACE_PERIOD=y.
The grace period ends, so rcu_gp_kthread awakens the ->nocb_gp_kthread but it doesn't get a chance to run on a CPU before a while.
CPU 0's ->nocb_cb_kthread get back to the CPU after its preemption. As it notices the new completed grace period, it advances the callbacks and executes them. Then it gets preempted again on local_bh_enabled().
A new callback enqueue on CPU 0 flushes itself the bypass queue because CPU 0's ->nocb_nobypass_count < nocb_nobypass_lim_per_jiffy.
CPUs from other ->nocb_gp_kthread groups (above CPU 2) initiate and elapse a few grace periods. CPU 0's ->nocb_gp_kthread still hasn't got an opportunity to run on a CPU and its ->nocb_bypass_timer still hasn't fired.
CPU 0's ->nocb_cb_kthread wakes up from preemption. It notices the new grace periods that have elapsed, advance all the callbacks and executes them. Then it goes to sleep waiting for invocable callbacks.
I'm just not so sure about the above point 10. Even though a few grace periods have elapsed, the callback queued in 8 is in RCU_NEXT_TAIL at this point. Perhaps one more grace period is necessary after that.
Anyway, I need to be more awake as well before checking that again.
Hi,
On 3/2/2021 11:47 PM, Paul E. McKenney wrote:
On Tue, Mar 02, 2021 at 01:34:44PM +0100, Frederic Weisbecker wrote:
On Mon, Mar 01, 2021 at 05:48:29PM -0800, Paul E. McKenney wrote:
On Wed, Feb 24, 2021 at 11:06:06PM +0100, Frederic Weisbecker wrote:
On Wed, Feb 24, 2021 at 10:37:09AM -0800, Paul E. McKenney wrote:
On Tue, Feb 23, 2021 at 01:09:59AM +0100, Frederic Weisbecker wrote:
Two situations can cause a missed nocb timer rearm:
- rdp(CPU A) queues its nocb timer. The grace period elapses before the timer get a chance to fire. The nocb_gp kthread is awaken by rdp(CPU B). The nocb_cb kthread for rdp(CPU A) is awaken and process the callbacks, again before the nocb_timer for CPU A get a chance to fire. rdp(CPU A) queues a callback and wakes up nocb_gp kthread, cancelling the pending nocb_timer without resetting the corresponding nocb_defer_wakeup.
As discussed offlist, expanding the above scenario results in this sequence of steps:
I renumbered the CPUs, since the ->nocb_gp_kthread would normally be associated with CPU 0. If the first CPU to enqueue a callback was also CPU 0, nocb_gp_wait() might clear that CPU's ->nocb_defer_wakeup, which would prevent this scenario from playing out. (But admittedly only if some other CPU handled by this same ->nocb_gp_kthread used its bypass.)
Ok good point.
- There are no callbacks queued for any CPU covered by CPU 0-2's ->nocb_gp_kthread.
And ->nocb_gp_kthread is associated with CPU 0.
- CPU 1 enqueues its first callback with interrupts disabled, and thus must defer awakening its ->nocb_gp_kthread. It therefore queues its rcu_data structure's ->nocb_timer.
At this point, CPU 1's rdp->nocb_defer_wakeup is RCU_NOCB_WAKE.
Right.
- The grace period ends, so rcu_gp_kthread awakens the ->nocb_gp_kthread, which in turn awakens both CPU 1's and CPU 2's ->nocb_cb_kthread.
And then ->nocb_cb_kthread sleeps waiting for more callbacks.
Yep
I managed to recollect some pieces of my brain. So keep the above but let's change the point 10:
- CPU 1 enqueues its second callback, this time with interrupts
enabled so it can wake directly ->nocb_gp_kthread. It does so with calling __wake_nocb_gp() which also cancels the
wake_nocb_gp() in current -rcu, correct?
Heh, right.
So far so good, but why isn't the timer still queued from back in step 2? What am I missing here? Either way, could you please update the commit logs to tell the full story? At some later time, you might be very happy that you did. ;-)
- The "nocb_bypass_timer" ends up calling wake_nocb_gp() which deletes the pending "nocb_timer" (note they are not the same timers) for the given rdp without resetting the matching state stored in nocb_defer wakeup.
Would like to similarly expand this one, or would you prefer to rest your case on Case 1) above?
I was about to say that we can skip that one, the changelog will already be big enough but the "Fixes:" tag refers to the second scenario, since it's the oldest vulnerable commit AFAICS.
Fixes: d1b222c6be1f (rcu/nocb: Add bypass callback queueing)
OK, how about if I queue a temporary commit (shown below) that just calls out the first scenario so that I can start testing, and you get me more detail on the second scenario? I can then update the commit.
Thanx, Paul
commit 302fd54b9ae98f678624cbf9bf7a4ca88455a8f9 Author: Frederic Weisbecker frederic@kernel.org Date: Tue Feb 23 01:09:59 2021 +0100
rcu/nocb: Fix missed nocb_timer requeue This sequence of events can lead to a failure to requeue a CPU's ->nocb_timer: 1. There are no callbacks queued for any CPU covered by CPU 0-2's ->nocb_gp_kthread. Note that ->nocb_gp_kthread is associated with CPU 0. 2. CPU 1 enqueues its first callback with interrupts disabled, and thus must defer awakening its ->nocb_gp_kthread. It therefore queues its rcu_data structure's ->nocb_timer. At this point, CPU 1's rdp->nocb_defer_wakeup is RCU_NOCB_WAKE. 3. CPU 2, which shares the same ->nocb_gp_kthread, also enqueues a callback, but with interrupts enabled, allowing it to directly awaken the ->nocb_gp_kthread. 4. The newly awakened ->nocb_gp_kthread associates both CPU 1's and CPU 2's callbacks with a future grace period and arranges for that grace period to be started. 5. This ->nocb_gp_kthread goes to sleep waiting for the end of this future grace period. 6. This grace period elapses before the CPU 1's timer fires. This is normally improbably given that the timer is set for only one jiffy, but timers can be delayed. Besides, it is possible that kernel was built with CONFIG_RCU_STRICT_GRACE_PERIOD=y. 7. The grace period ends, so rcu_gp_kthread awakens the ->nocb_gp_kthread, which in turn awakens both CPU 1's and CPU 2's ->nocb_cb_kthread. Then ->nocb_gb_kthread sleeps waiting for more newly queued callbacks. 8. CPU 1's ->nocb_cb_kthread invokes its callback, then sleeps waiting for more invocable callbacks. 9. Note that neither kthread updated any ->nocb_timer state, so CPU 1's ->nocb_defer_wakeup is still set to RCU_NOCB_WAKE. 10. CPU 1 enqueues its second callback, this time with interrupts enabled so it can wake directly ->nocb_gp_kthread. It does so with calling wake_nocb_gp() which also cancels the pending timer that got queued in step 2. But that doesn't reset CPU 1's ->nocb_defer_wakeup which is still set to RCU_NOCB_WAKE. So CPU 1's ->nocb_defer_wakeup and its ->nocb_timer are now desynchronized. 11. ->nocb_gp_kthread associates the callback queued in 10 with a new grace period, arranges for that grace period to start and sleeps waiting for it to complete. 12. The grace period ends, rcu_gp_kthread awakens ->nocb_gp_kthread, which in turn wakes up CPU 1's ->nocb_cb_kthread which then invokes the callback queued in 10. 13. CPU 1 enqueues its third callback, this time with interrupts disabled so it must queue a timer for a deferred wakeup. However the value of its ->nocb_defer_wakeup is RCU_NOCB_WAKE which incorrectly indicates that a timer is already queued. Instead, CPU 1's ->nocb_timer was cancelled in 10. CPU 1 therefore fails to queue the ->nocb_timer. 14. CPU 1 has its pending callback and it may go unnoticed until some other CPU ever wakes up ->nocb_gp_kthread or CPU 1 ever calls an explicit deferred wakeup, for example, during idle entry. This commit fixes this bug by resetting rdp->nocb_defer_wakeup everytime we delete the ->nocb_timer. Fixes: d1b222c6be1f (rcu/nocb: Add bypass callback queueing) Cc: Stable <stable@vger.kernel.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Neeraj Upadhyay <neeraju@codeaurora.org> Cc: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 44746d8..429491d 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -1721,7 +1721,11 @@ static bool wake_nocb_gp(struct rcu_data *rdp, bool force, rcu_nocb_unlock_irqrestore(rdp, flags); return false; }
- del_timer(&rdp->nocb_timer);
- if (READ_ONCE(rdp->nocb_defer_wakeup) > RCU_NOCB_WAKE_NOT) {
WRITE_ONCE(rdp->nocb_defer_wakeup, RCU_NOCB_WAKE_NOT);
del_timer(&rdp->nocb_timer);
- } rcu_nocb_unlock_irqrestore(rdp, flags); raw_spin_lock_irqsave(&rdp_gp->nocb_gp_lock, flags); if (force || READ_ONCE(rdp_gp->nocb_gp_sleep)) {
@@ -2350,7 +2354,6 @@ static bool do_nocb_deferred_wakeup_common(struct rcu_data *rdp) return false; } ndw = READ_ONCE(rdp->nocb_defer_wakeup);
- WRITE_ONCE(rdp->nocb_defer_wakeup, RCU_NOCB_WAKE_NOT); ret = wake_nocb_gp(rdp, ndw == RCU_NOCB_WAKE_FORCE, flags); trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("DeferredWake"));
Reviewed-by: Neeraj Upadhyay neeraju@codeaurora.org
Thanks Neeraj
Instead of flushing bypass at the very last moment in the deoffloading process, just disable bypass enqueue at soon as we start the deoffloading process and flush the pending bypass early. It's less fragile and we leave some time to the kthreads and softirqs to process quietly.
Symmetrically, only enable bypass once we safely complete the offloading process.
Reported-by: Paul E. McKenney paulmck@kernel.org Cc: Josh Triplett josh@joshtriplett.org Cc: Lai Jiangshan jiangshanlai@gmail.com Cc: Joel Fernandes joel@joelfernandes.org Cc: Neeraj Upadhyay neeraju@codeaurora.org Cc: Boqun Feng boqun.feng@gmail.com Signed-off-by: Frederic Weisbecker frederic@kernel.org --- include/linux/rcu_segcblist.h | 7 ++++--- kernel/rcu/tree_plugin.h | 38 ++++++++++++++++++++++++++--------- 2 files changed, 33 insertions(+), 12 deletions(-)
diff --git a/include/linux/rcu_segcblist.h b/include/linux/rcu_segcblist.h index 8afe886e85f1..3db96c4f45fd 100644 --- a/include/linux/rcu_segcblist.h +++ b/include/linux/rcu_segcblist.h @@ -109,7 +109,7 @@ struct rcu_cblist { * | SEGCBLIST_KTHREAD_GP | * | | * | Kthreads handle callbacks holding nocb_lock, local rcu_core() stops | - * | handling callbacks. | + * | handling callbacks. Enable bypass queueing. | * ---------------------------------------------------------------------------- */
@@ -125,7 +125,7 @@ struct rcu_cblist { * | SEGCBLIST_KTHREAD_GP | * | | * | CB/GP kthreads handle callbacks holding nocb_lock, local rcu_core() | - * | ignores callbacks. | + * | ignores callbacks. Bypass enqueue is enabled. | * ---------------------------------------------------------------------------- * | * v @@ -134,7 +134,8 @@ struct rcu_cblist { * | SEGCBLIST_KTHREAD_GP | * | | * | CB/GP kthreads and local rcu_core() handle callbacks concurrently | - * | holding nocb_lock. Wake up CB and GP kthreads if necessary. | + * | holding nocb_lock. Wake up CB and GP kthreads if necessary. Disable | + * | bypass enqueue. | * ---------------------------------------------------------------------------- * | * v diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index dd0dc66c282d..924fa3d1df0d 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -1842,11 +1842,22 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp, unsigned long j = jiffies; long ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
+ lockdep_assert_irqs_disabled(); + + // Pure softirq/rcuc based processing: no bypassing, no + // locking. if (!rcu_rdp_is_offloaded(rdp)) { + *was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist); + return false; + } + + // In the process of (de-)offloading: no bypassing, but + // locking. + if (!rcu_segcblist_completely_offloaded(&rdp->cblist)) { + rcu_nocb_lock(rdp); *was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist); return false; /* Not offloaded, no bypassing. */ } - lockdep_assert_irqs_disabled();
// Don't use ->nocb_bypass during early boot. if (rcu_scheduler_active != RCU_SCHEDULER_RUNNING) { @@ -2429,7 +2440,16 @@ static long rcu_nocb_rdp_deoffload(void *arg) pr_info("De-offloading %d\n", rdp->cpu);
rcu_nocb_lock_irqsave(rdp, flags); - + /* + * Flush once and for all now. This suffices because we are + * running on the target CPU holding ->nocb_lock (thus having + * interrupts disabled), and because rdp_offload_toggle() + * invokes rcu_segcblist_offload(), which clears SEGCBLIST_OFFLOADED. + * Thus future calls to rcu_segcblist_completely_offloaded() will + * return false, which means that future calls to rcu_nocb_try_bypass() + * will refuse to put anything into the bypass. + */ + WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies)); ret = rdp_offload_toggle(rdp, false, flags); swait_event_exclusive(rdp->nocb_state_wq, !rcu_segcblist_test_flags(cblist, SEGCBLIST_KTHREAD_CB | @@ -2441,21 +2461,21 @@ static long rcu_nocb_rdp_deoffload(void *arg) del_timer_sync(&rdp->nocb_timer);
/* - * Flush bypass. While IRQs are disabled and once we set - * SEGCBLIST_SOFTIRQ_ONLY, no callback is supposed to be - * enqueued on bypass. + * Theoretically we could set SEGCBLIST_SOFTIRQ_ONLY with CB unlocked + * and IRQs disabled but let's be paranoid. */ rcu_nocb_lock_irqsave(rdp, flags); - rcu_nocb_flush_bypass(rdp, NULL, jiffies); rcu_segcblist_set_flags(cblist, SEGCBLIST_SOFTIRQ_ONLY); /* * With SEGCBLIST_SOFTIRQ_ONLY, we can't use - * rcu_nocb_unlock_irqrestore() anymore. Theoretically we - * could set SEGCBLIST_SOFTIRQ_ONLY with cb unlocked and IRQs - * disabled now, but let's be paranoid. + * rcu_nocb_unlock_irqrestore() anymore. */ raw_spin_unlock_irqrestore(&rdp->nocb_lock, flags);
+ /* Sanity check */ + WARN_ON_ONCE(rcu_cblist_n_cbs(&rdp->nocb_bypass)); + + return ret; }
Remove stale comment claiming that the cblist must be empty before changing the offloading state. This applied when the offloaded state was defined exclusively on boot.
Reported-by: Paul E. McKenney paulmck@kernel.org Signed-off-by: Frederic Weisbecker frederic@kernel.org Cc: Josh Triplett josh@joshtriplett.org Cc: Lai Jiangshan jiangshanlai@gmail.com Cc: Joel Fernandes joel@joelfernandes.org Cc: Neeraj Upadhyay neeraju@codeaurora.org Cc: Boqun Feng boqun.feng@gmail.com --- kernel/rcu/rcu_segcblist.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/kernel/rcu/rcu_segcblist.c b/kernel/rcu/rcu_segcblist.c index 7f181c9675f7..aaa111237b60 100644 --- a/kernel/rcu/rcu_segcblist.c +++ b/kernel/rcu/rcu_segcblist.c @@ -261,8 +261,7 @@ void rcu_segcblist_disable(struct rcu_segcblist *rsclp) }
/* - * Mark the specified rcu_segcblist structure as offloaded. This - * structure must be empty. + * Mark the specified rcu_segcblist structure as offloaded. */ void rcu_segcblist_offload(struct rcu_segcblist *rsclp, bool offload) {
Those tracing calls don't need to be under the nocb lock. Move them outside.
Signed-off-by: Frederic Weisbecker frederic@kernel.org Cc: Josh Triplett josh@joshtriplett.org Cc: Lai Jiangshan jiangshanlai@gmail.com Cc: Joel Fernandes joel@joelfernandes.org Cc: Neeraj Upadhyay neeraju@codeaurora.org Cc: Boqun Feng boqun.feng@gmail.com --- kernel/rcu/tree_plugin.h | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 924fa3d1df0d..587df271d640 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -1715,9 +1715,9 @@ static bool wake_nocb_gp(struct rcu_data *rdp, bool force,
lockdep_assert_held(&rdp->nocb_lock); if (!READ_ONCE(rdp_gp->nocb_gp_kthread)) { + rcu_nocb_unlock_irqrestore(rdp, flags); trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("AlreadyAwake")); - rcu_nocb_unlock_irqrestore(rdp, flags); return false; }
@@ -1967,9 +1967,9 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone, // If we are being polled or there is no kthread, just leave. t = READ_ONCE(rdp->nocb_gp_kthread); if (rcu_nocb_poll || !t) { + rcu_nocb_unlock_irqrestore(rdp, flags); trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("WakeNotPoll")); - rcu_nocb_unlock_irqrestore(rdp, flags); return; } // Need to actually to a wakeup. @@ -2004,8 +2004,8 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone, TPS("WakeOvfIsDeferred")); rcu_nocb_unlock_irqrestore(rdp, flags); } else { + rcu_nocb_unlock_irqrestore(rdp, flags); trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("WakeNot")); - rcu_nocb_unlock_irqrestore(rdp, flags); } return; }
Currently each offline rdp has its own nocb_timer armed when the nocb_gp wakeup must be deferred. This layout has many drawbacks, compared to a solution based on a single timer per rdp group:
* It's a lot of timers to maintain.
* The per rdp nocb lock must be held to arm and cancel the timer and it's already quite contended.
* Firing one timer doesn't cancel the other rdp's timers in the same group: - They may conflict and end up with spurious wake ups - Each of the rdp that armed a timer need to lock both nocb_lock and then nocb_gp_lock upon exit to idle/user/guest mode.
* We can't cancel all of them if we detect an unflushed bypass in nocb_gp_wait(). In fact currently we only ever cancel the nocb_timer of the leader group.
* The leader group's nocb_timer is cancelled without locking nocb_lock in nocb_gp_wait(). It appears to be safe currently but is very error prone and hairy for reviewers.
* Since the timer takes the nocb_lock, it requires extra care in the NOCB (de-)offloading process, needing it to be either enabled or disabled and flushed.
Use a per rdp leader timer instead. It is based on nocb_gp_lock that is _way_ less contended and stays so after this change. As a matter of fact, the nocb_timer almost never fires and the deferred wakeup is mostly handled on idle/user/guest entry. Now the early check performed at this point in do_nocb_deferred_wakeup() is done on rdp_gp->nocb_defer_wakeup, which is racy of course. It doesn't matter though because we only need the guarantee to see the timer armed if we were the last one to arm it. Any other situation (another rdp has armed it and we either see it or not) is fine.
This solves all the issues listed above.
Signed-off-by: Frederic Weisbecker frederic@kernel.org Cc: Josh Triplett josh@joshtriplett.org Cc: Lai Jiangshan jiangshanlai@gmail.com Cc: Joel Fernandes joel@joelfernandes.org Cc: Neeraj Upadhyay neeraju@codeaurora.org Cc: Boqun Feng boqun.feng@gmail.com --- kernel/rcu/tree.h | 1 - kernel/rcu/tree_plugin.h | 142 +++++++++++++++++++++------------------ 2 files changed, 78 insertions(+), 65 deletions(-)
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h index 71821d59d95c..b280a843bd2c 100644 --- a/kernel/rcu/tree.h +++ b/kernel/rcu/tree.h @@ -257,7 +257,6 @@ struct rcu_data { };
/* Values for nocb_defer_wakeup field in struct rcu_data. */ -#define RCU_NOCB_WAKE_OFF -1 #define RCU_NOCB_WAKE_NOT 0 #define RCU_NOCB_WAKE 1 #define RCU_NOCB_WAKE_FORCE 2 diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 587df271d640..847636d3e93d 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -33,10 +33,6 @@ static inline bool rcu_current_is_nocb_kthread(struct rcu_data *rdp) return false; }
-static inline bool rcu_running_nocb_timer(struct rcu_data *rdp) -{ - return (timer_curr_running(&rdp->nocb_timer) && !in_irq()); -} #else static inline int rcu_lockdep_is_held_nocb(struct rcu_data *rdp) { @@ -48,11 +44,6 @@ static inline bool rcu_current_is_nocb_kthread(struct rcu_data *rdp) return false; }
-static inline bool rcu_running_nocb_timer(struct rcu_data *rdp) -{ - return false; -} - #endif /* #ifdef CONFIG_RCU_NOCB_CPU */
static bool rcu_rdp_is_offloaded(struct rcu_data *rdp) @@ -72,8 +63,7 @@ static bool rcu_rdp_is_offloaded(struct rcu_data *rdp) rcu_lockdep_is_held_nocb(rdp) || (rdp == this_cpu_ptr(&rcu_data) && !(IS_ENABLED(CONFIG_PREEMPT_COUNT) && preemptible())) || - rcu_current_is_nocb_kthread(rdp) || - rcu_running_nocb_timer(rdp)), + rcu_current_is_nocb_kthread(rdp)), "Unsafe read of RCU_NOCB offloaded state" );
@@ -1702,43 +1692,50 @@ bool rcu_is_nocb_cpu(int cpu) return false; }
-/* - * Kick the GP kthread for this NOCB group. Caller holds ->nocb_lock - * and this function releases it. - */ -static bool wake_nocb_gp(struct rcu_data *rdp, bool force, - unsigned long flags) - __releases(rdp->nocb_lock) +static bool __wake_nocb_gp(struct rcu_data *rdp_gp, + struct rcu_data *rdp, + bool force, unsigned long flags) + __releases(rdp_gp->nocb_gp_lock) { bool needwake = false; - struct rcu_data *rdp_gp = rdp->nocb_gp_rdp;
- lockdep_assert_held(&rdp->nocb_lock); if (!READ_ONCE(rdp_gp->nocb_gp_kthread)) { - rcu_nocb_unlock_irqrestore(rdp, flags); + raw_spin_unlock_irqrestore(&rdp_gp->nocb_gp_lock, flags); trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("AlreadyAwake")); return false; }
- if (READ_ONCE(rdp->nocb_defer_wakeup) > RCU_NOCB_WAKE_NOT) { - WRITE_ONCE(rdp->nocb_defer_wakeup, RCU_NOCB_WAKE_NOT); - del_timer(&rdp->nocb_timer); + if (rdp_gp->nocb_defer_wakeup > RCU_NOCB_WAKE_NOT) { + WRITE_ONCE(rdp_gp->nocb_defer_wakeup, RCU_NOCB_WAKE_NOT); + del_timer(&rdp_gp->nocb_timer); } - rcu_nocb_unlock_irqrestore(rdp, flags); - raw_spin_lock_irqsave(&rdp_gp->nocb_gp_lock, flags); + if (force || READ_ONCE(rdp_gp->nocb_gp_sleep)) { WRITE_ONCE(rdp_gp->nocb_gp_sleep, false); needwake = true; + } + raw_spin_unlock_irqrestore(&rdp_gp->nocb_gp_lock, flags); + if (needwake) { trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("DoWake")); - } - raw_spin_unlock_irqrestore(&rdp_gp->nocb_gp_lock, flags); - if (needwake) wake_up_process(rdp_gp->nocb_gp_kthread); + }
return needwake; }
+/* + * Kick the GP kthread for this NOCB group. + */ +static bool wake_nocb_gp(struct rcu_data *rdp, bool force) +{ + unsigned long flags; + struct rcu_data *rdp_gp = rdp->nocb_gp_rdp; + + raw_spin_lock_irqsave(&rdp_gp->nocb_gp_lock, flags); + return __wake_nocb_gp(rdp_gp, rdp, force, flags); +} + /* * Arrange to wake the GP kthread for this NOCB group at some future * time when it is safe to do so. @@ -1746,12 +1743,18 @@ static bool wake_nocb_gp(struct rcu_data *rdp, bool force, static void wake_nocb_gp_defer(struct rcu_data *rdp, int waketype, const char *reason) { - if (rdp->nocb_defer_wakeup == RCU_NOCB_WAKE_OFF) - return; - if (rdp->nocb_defer_wakeup == RCU_NOCB_WAKE_NOT) - mod_timer(&rdp->nocb_timer, jiffies + 1); - if (rdp->nocb_defer_wakeup < waketype) - WRITE_ONCE(rdp->nocb_defer_wakeup, waketype); + unsigned long flags; + struct rcu_data *rdp_gp = rdp->nocb_gp_rdp; + + raw_spin_lock_irqsave(&rdp_gp->nocb_gp_lock, flags); + + if (rdp_gp->nocb_defer_wakeup == RCU_NOCB_WAKE_NOT) + mod_timer(&rdp_gp->nocb_timer, jiffies + 1); + if (rdp_gp->nocb_defer_wakeup < waketype) + WRITE_ONCE(rdp_gp->nocb_defer_wakeup, waketype); + + raw_spin_unlock_irqrestore(&rdp_gp->nocb_gp_lock, flags); + trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, reason); }
@@ -1978,13 +1981,14 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone, rdp->qlen_last_fqs_check = len; if (!irqs_disabled_flags(flags)) { /* ... if queue was empty ... */ - wake_nocb_gp(rdp, false, flags); + rcu_nocb_unlock_irqrestore(rdp, flags); + wake_nocb_gp(rdp, false); trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("WakeEmpty")); } else { + rcu_nocb_unlock_irqrestore(rdp, flags); wake_nocb_gp_defer(rdp, RCU_NOCB_WAKE, TPS("WakeEmptyIsDeferred")); - rcu_nocb_unlock_irqrestore(rdp, flags); } } else if (len > rdp->qlen_last_fqs_check + qhimark) { /* ... or if many callbacks queued. */ @@ -1999,10 +2003,14 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone, smp_mb(); /* Enqueue before timer_pending(). */ if ((rdp->nocb_cb_sleep || !rcu_segcblist_ready_cbs(&rdp->cblist)) && - !timer_pending(&rdp->nocb_bypass_timer)) + !timer_pending(&rdp->nocb_bypass_timer)) { + rcu_nocb_unlock_irqrestore(rdp, flags); wake_nocb_gp_defer(rdp, RCU_NOCB_WAKE_FORCE, TPS("WakeOvfIsDeferred")); - rcu_nocb_unlock_irqrestore(rdp, flags); + } else { + rcu_nocb_unlock_irqrestore(rdp, flags); + trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("WakeNot")); + } } else { rcu_nocb_unlock_irqrestore(rdp, flags); trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("WakeNot")); @@ -2128,11 +2136,7 @@ static void nocb_gp_wait(struct rcu_data *my_rdp) bypass = true; } rnp = rdp->mynode; - if (bypass) { // Avoid race with first bypass CB. - WRITE_ONCE(my_rdp->nocb_defer_wakeup, - RCU_NOCB_WAKE_NOT); - del_timer(&my_rdp->nocb_timer); - } + // Advance callbacks if helpful and low contention. needwake_gp = false; if (!rcu_segcblist_restempty(&rdp->cblist, @@ -2178,11 +2182,18 @@ static void nocb_gp_wait(struct rcu_data *my_rdp) my_rdp->nocb_gp_bypass = bypass; my_rdp->nocb_gp_gp = needwait_gp; my_rdp->nocb_gp_seq = needwait_gp ? wait_gp_seq : 0; - if (bypass && !rcu_nocb_poll) { - // At least one child with non-empty ->nocb_bypass, so set - // timer in order to avoid stranding its callbacks. + if (bypass) { raw_spin_lock_irqsave(&my_rdp->nocb_gp_lock, flags); - mod_timer(&my_rdp->nocb_bypass_timer, j + 2); + // Avoid race with first bypass CB. + if (my_rdp->nocb_defer_wakeup > RCU_NOCB_WAKE_NOT) { + WRITE_ONCE(my_rdp->nocb_defer_wakeup, RCU_NOCB_WAKE_NOT); + del_timer(&my_rdp->nocb_timer); + } + if (!rcu_nocb_poll) { + // At least one child with non-empty ->nocb_bypass, so set + // timer in order to avoid stranding its callbacks. + mod_timer(&my_rdp->nocb_bypass_timer, j + 2); + } raw_spin_unlock_irqrestore(&my_rdp->nocb_gp_lock, flags); } if (rcu_nocb_poll) { @@ -2356,15 +2367,18 @@ static bool do_nocb_deferred_wakeup_common(struct rcu_data *rdp) { unsigned long flags; int ndw; + struct rcu_data *rdp_gp = rdp->nocb_gp_rdp; int ret;
- rcu_nocb_lock_irqsave(rdp, flags); - if (!rcu_nocb_need_deferred_wakeup(rdp)) { - rcu_nocb_unlock_irqrestore(rdp, flags); + raw_spin_lock_irqsave(&rdp_gp->nocb_gp_lock, flags); + + if (!rcu_nocb_need_deferred_wakeup(rdp_gp)) { + raw_spin_unlock_irqrestore(&rdp_gp->nocb_gp_lock, flags); return false; } - ndw = READ_ONCE(rdp->nocb_defer_wakeup); - ret = wake_nocb_gp(rdp, ndw == RCU_NOCB_WAKE_FORCE, flags); + + ndw = rdp_gp->nocb_defer_wakeup; + ret = __wake_nocb_gp(rdp_gp, rdp, ndw == RCU_NOCB_WAKE_FORCE, flags); trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("DeferredWake"));
return ret; @@ -2385,7 +2399,10 @@ static void do_nocb_deferred_wakeup_timer(struct timer_list *t) */ static bool do_nocb_deferred_wakeup(struct rcu_data *rdp) { - if (rcu_nocb_need_deferred_wakeup(rdp)) + if (!rdp->nocb_gp_rdp) + return false; + + if (rcu_nocb_need_deferred_wakeup(rdp->nocb_gp_rdp)) return do_nocb_deferred_wakeup_common(rdp); return false; } @@ -2454,17 +2471,15 @@ static long rcu_nocb_rdp_deoffload(void *arg) swait_event_exclusive(rdp->nocb_state_wq, !rcu_segcblist_test_flags(cblist, SEGCBLIST_KTHREAD_CB | SEGCBLIST_KTHREAD_GP)); - rcu_nocb_lock_irqsave(rdp, flags); - /* Make sure nocb timer won't stay around */ - WRITE_ONCE(rdp->nocb_defer_wakeup, RCU_NOCB_WAKE_OFF); - rcu_nocb_unlock_irqrestore(rdp, flags); - del_timer_sync(&rdp->nocb_timer); - /* - * Theoretically we could set SEGCBLIST_SOFTIRQ_ONLY with CB unlocked - * and IRQs disabled but let's be paranoid. + * Lock one last time to acquire latest callback updates from kthreads + * so we can later handle callbacks locally without locking. */ rcu_nocb_lock_irqsave(rdp, flags); + /* + * Theoretically we could set SEGCBLIST_SOFTIRQ_ONLY after the nocb + * lock is released but how about being paranoid for once? + */ rcu_segcblist_set_flags(cblist, SEGCBLIST_SOFTIRQ_ONLY); /* * With SEGCBLIST_SOFTIRQ_ONLY, we can't use @@ -2528,8 +2543,7 @@ static long rcu_nocb_rdp_offload(void *arg) * SEGCBLIST_SOFTIRQ_ONLY mode. */ raw_spin_lock_irqsave(&rdp->nocb_lock, flags); - /* Re-enable nocb timer */ - WRITE_ONCE(rdp->nocb_defer_wakeup, RCU_NOCB_WAKE_NOT); + /* * We didn't take the nocb lock while working on the * rdp->cblist in SEGCBLIST_SOFTIRQ_ONLY mode.
On Tue, Feb 23, 2021 at 01:10:03AM +0100, Frederic Weisbecker wrote:
Currently each offline rdp has its own nocb_timer armed when the nocb_gp wakeup must be deferred. This layout has many drawbacks, compared to a solution based on a single timer per rdp group:
There are a lot of timers to maintain.
The per-rdp ->nocb_lock must be held to queue and cancel the timer and this lock can already be quite contended.
One timer firing doesn't cancel the other timers in the same group:
- These other timers can thus cause spurious wakeups
- Each rdp that queued a timer must lock both ->nocb_lock and then ->nocb_gp_lock upon exit from the kernel to idle/user/guest mode.
We can't cancel all of them if we detect an unflushed bypass in nocb_gp_wait(). In fact currently we only ever cancel the nocb_timer of the leader group.
The leader group's nocb_timer is cancelled without locking ->nocb_lock in nocb_gp_wait(). This currently appears to be safe but is an accident waiting to happen.
Since the timer acquires ->nocb_lock, it requires extra care in the NOCB (de-)offloading process, requiring that it be either enabled or disabled and flushed.
This commit instead uses the rcuog kthread's CPU's ->nocb_timer instead. It is protected by nocb_gp_lock, which is _way_ less contended and remains so even after this change. As a matter of fact, the nocb_timer almost never fires and the deferred wakeup is mostly carried out upon idle/user/guest entry. Now the early check performed at this point in do_nocb_deferred_wakeup() is done on rdp_gp->nocb_defer_wakeup, which is of course racy. However, this raciness is harmless because we only need the guarantee that the timer is queued if we were the last one to queue it. Any other situation (another CPU has queued it and we either see it or not) is fine.
This solves all the issues listed above.
Signed-off-by: Frederic Weisbecker frederic@kernel.org Cc: Josh Triplett josh@joshtriplett.org Cc: Lai Jiangshan jiangshanlai@gmail.com Cc: Joel Fernandes joel@joelfernandes.org Cc: Neeraj Upadhyay neeraju@codeaurora.org Cc: Boqun Feng boqun.feng@gmail.com
I pulled in the previous three (2-4/13) with the usual commit-log wordsmithing, thank you! And I could not resist wordsmithing above.
I do very much like the general approach, but a few questions below.
The first question is of course: Did you try this with lockdep enabled? ;-)
kernel/rcu/tree.h | 1 - kernel/rcu/tree_plugin.h | 142 +++++++++++++++++++++------------------ 2 files changed, 78 insertions(+), 65 deletions(-)
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h index 71821d59d95c..b280a843bd2c 100644 --- a/kernel/rcu/tree.h +++ b/kernel/rcu/tree.h @@ -257,7 +257,6 @@ struct rcu_data { }; /* Values for nocb_defer_wakeup field in struct rcu_data. */ -#define RCU_NOCB_WAKE_OFF -1 #define RCU_NOCB_WAKE_NOT 0 #define RCU_NOCB_WAKE 1 #define RCU_NOCB_WAKE_FORCE 2 diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 587df271d640..847636d3e93d 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -33,10 +33,6 @@ static inline bool rcu_current_is_nocb_kthread(struct rcu_data *rdp) return false; } -static inline bool rcu_running_nocb_timer(struct rcu_data *rdp) -{
- return (timer_curr_running(&rdp->nocb_timer) && !in_irq());
-} #else static inline int rcu_lockdep_is_held_nocb(struct rcu_data *rdp) { @@ -48,11 +44,6 @@ static inline bool rcu_current_is_nocb_kthread(struct rcu_data *rdp) return false; } -static inline bool rcu_running_nocb_timer(struct rcu_data *rdp) -{
- return false;
-}
#endif /* #ifdef CONFIG_RCU_NOCB_CPU */ static bool rcu_rdp_is_offloaded(struct rcu_data *rdp) @@ -72,8 +63,7 @@ static bool rcu_rdp_is_offloaded(struct rcu_data *rdp) rcu_lockdep_is_held_nocb(rdp) || (rdp == this_cpu_ptr(&rcu_data) && !(IS_ENABLED(CONFIG_PREEMPT_COUNT) && preemptible())) ||
rcu_current_is_nocb_kthread(rdp) ||
rcu_running_nocb_timer(rdp)),
"Unsafe read of RCU_NOCB offloaded state" );rcu_current_is_nocb_kthread(rdp)),
@@ -1702,43 +1692,50 @@ bool rcu_is_nocb_cpu(int cpu) return false; } -/*
- Kick the GP kthread for this NOCB group. Caller holds ->nocb_lock
- and this function releases it.
- */
-static bool wake_nocb_gp(struct rcu_data *rdp, bool force,
unsigned long flags)
- __releases(rdp->nocb_lock)
+static bool __wake_nocb_gp(struct rcu_data *rdp_gp,
struct rcu_data *rdp,
bool force, unsigned long flags)
- __releases(rdp_gp->nocb_gp_lock)
{ bool needwake = false;
- struct rcu_data *rdp_gp = rdp->nocb_gp_rdp;
- lockdep_assert_held(&rdp->nocb_lock); if (!READ_ONCE(rdp_gp->nocb_gp_kthread)) {
rcu_nocb_unlock_irqrestore(rdp, flags);
trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("AlreadyAwake")); return false; }raw_spin_unlock_irqrestore(&rdp_gp->nocb_gp_lock, flags);
- if (READ_ONCE(rdp->nocb_defer_wakeup) > RCU_NOCB_WAKE_NOT) {
WRITE_ONCE(rdp->nocb_defer_wakeup, RCU_NOCB_WAKE_NOT);
del_timer(&rdp->nocb_timer);
- if (rdp_gp->nocb_defer_wakeup > RCU_NOCB_WAKE_NOT) {
So there are no longer any data races involving ->nocb_defer_wakeup?
(Yes, I could fire up KCSAN, but my KCSAN-capable system is otherwise occupied for several more hours.)
WRITE_ONCE(rdp_gp->nocb_defer_wakeup, RCU_NOCB_WAKE_NOT);
}del_timer(&rdp_gp->nocb_timer);
- rcu_nocb_unlock_irqrestore(rdp, flags);
- raw_spin_lock_irqsave(&rdp_gp->nocb_gp_lock, flags);
- if (force || READ_ONCE(rdp_gp->nocb_gp_sleep)) { WRITE_ONCE(rdp_gp->nocb_gp_sleep, false); needwake = true;
- }
- raw_spin_unlock_irqrestore(&rdp_gp->nocb_gp_lock, flags);
- if (needwake) { trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("DoWake"));
- }
- raw_spin_unlock_irqrestore(&rdp_gp->nocb_gp_lock, flags);
- if (needwake) wake_up_process(rdp_gp->nocb_gp_kthread);
- }
return needwake; } +/*
- Kick the GP kthread for this NOCB group.
- */
+static bool wake_nocb_gp(struct rcu_data *rdp, bool force) +{
- unsigned long flags;
- struct rcu_data *rdp_gp = rdp->nocb_gp_rdp;
- raw_spin_lock_irqsave(&rdp_gp->nocb_gp_lock, flags);
- return __wake_nocb_gp(rdp_gp, rdp, force, flags);
+}
/*
- Arrange to wake the GP kthread for this NOCB group at some future
- time when it is safe to do so.
@@ -1746,12 +1743,18 @@ static bool wake_nocb_gp(struct rcu_data *rdp, bool force, static void wake_nocb_gp_defer(struct rcu_data *rdp, int waketype, const char *reason) {
- if (rdp->nocb_defer_wakeup == RCU_NOCB_WAKE_OFF)
return;
- if (rdp->nocb_defer_wakeup == RCU_NOCB_WAKE_NOT)
mod_timer(&rdp->nocb_timer, jiffies + 1);
- if (rdp->nocb_defer_wakeup < waketype)
WRITE_ONCE(rdp->nocb_defer_wakeup, waketype);
- unsigned long flags;
- struct rcu_data *rdp_gp = rdp->nocb_gp_rdp;
- raw_spin_lock_irqsave(&rdp_gp->nocb_gp_lock, flags);
- if (rdp_gp->nocb_defer_wakeup == RCU_NOCB_WAKE_NOT)
mod_timer(&rdp_gp->nocb_timer, jiffies + 1);
- if (rdp_gp->nocb_defer_wakeup < waketype)
WRITE_ONCE(rdp_gp->nocb_defer_wakeup, waketype);
- raw_spin_unlock_irqrestore(&rdp_gp->nocb_gp_lock, flags);
- trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, reason);
} @@ -1978,13 +1981,14 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone, rdp->qlen_last_fqs_check = len; if (!irqs_disabled_flags(flags)) { /* ... if queue was empty ... */
wake_nocb_gp(rdp, false, flags);
rcu_nocb_unlock_irqrestore(rdp, flags);
} else {wake_nocb_gp(rdp, false); trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("WakeEmpty"));
rcu_nocb_unlock_irqrestore(rdp, flags); wake_nocb_gp_defer(rdp, RCU_NOCB_WAKE, TPS("WakeEmptyIsDeferred"));
} } else if (len > rdp->qlen_last_fqs_check + qhimark) { /* ... or if many callbacks queued. */rcu_nocb_unlock_irqrestore(rdp, flags);
@@ -1999,10 +2003,14 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone, smp_mb(); /* Enqueue before timer_pending(). */ if ((rdp->nocb_cb_sleep || !rcu_segcblist_ready_cbs(&rdp->cblist)) &&
!timer_pending(&rdp->nocb_bypass_timer))
!timer_pending(&rdp->nocb_bypass_timer)) {
rcu_nocb_unlock_irqrestore(rdp, flags); wake_nocb_gp_defer(rdp, RCU_NOCB_WAKE_FORCE, TPS("WakeOvfIsDeferred"));
rcu_nocb_unlock_irqrestore(rdp, flags);
} else {
rcu_nocb_unlock_irqrestore(rdp, flags);
trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("WakeNot"));
} else { rcu_nocb_unlock_irqrestore(rdp, flags); trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("WakeNot"));}
@@ -2128,11 +2136,7 @@ static void nocb_gp_wait(struct rcu_data *my_rdp) bypass = true; } rnp = rdp->mynode;
if (bypass) { // Avoid race with first bypass CB.
WRITE_ONCE(my_rdp->nocb_defer_wakeup,
RCU_NOCB_WAKE_NOT);
del_timer(&my_rdp->nocb_timer);
}
What ensures that the timer will be properly queued when a non-empty bypass needs it to be?
Never mind, I now see the code below...
- // Advance callbacks if helpful and low contention. needwake_gp = false; if (!rcu_segcblist_restempty(&rdp->cblist,
@@ -2178,11 +2182,18 @@ static void nocb_gp_wait(struct rcu_data *my_rdp) my_rdp->nocb_gp_bypass = bypass; my_rdp->nocb_gp_gp = needwait_gp; my_rdp->nocb_gp_seq = needwait_gp ? wait_gp_seq : 0;
- if (bypass && !rcu_nocb_poll) {
// At least one child with non-empty ->nocb_bypass, so set
// timer in order to avoid stranding its callbacks.
- if (bypass) { raw_spin_lock_irqsave(&my_rdp->nocb_gp_lock, flags);
mod_timer(&my_rdp->nocb_bypass_timer, j + 2);
// Avoid race with first bypass CB.
if (my_rdp->nocb_defer_wakeup > RCU_NOCB_WAKE_NOT) {
WRITE_ONCE(my_rdp->nocb_defer_wakeup, RCU_NOCB_WAKE_NOT);
del_timer(&my_rdp->nocb_timer);
}
Given that the timer does not get queued if rcu_nocb_poll, why not move the above "if" statement under the one following?
if (!rcu_nocb_poll) {
// At least one child with non-empty ->nocb_bypass, so set
// timer in order to avoid stranding its callbacks.
mod_timer(&my_rdp->nocb_bypass_timer, j + 2);
raw_spin_unlock_irqrestore(&my_rdp->nocb_gp_lock, flags); } if (rcu_nocb_poll) {}
@@ -2356,15 +2367,18 @@ static bool do_nocb_deferred_wakeup_common(struct rcu_data *rdp) { unsigned long flags; int ndw;
- struct rcu_data *rdp_gp = rdp->nocb_gp_rdp; int ret;
- rcu_nocb_lock_irqsave(rdp, flags);
- if (!rcu_nocb_need_deferred_wakeup(rdp)) {
rcu_nocb_unlock_irqrestore(rdp, flags);
- raw_spin_lock_irqsave(&rdp_gp->nocb_gp_lock, flags);
- if (!rcu_nocb_need_deferred_wakeup(rdp_gp)) {
return false; }raw_spin_unlock_irqrestore(&rdp_gp->nocb_gp_lock, flags);
- ndw = READ_ONCE(rdp->nocb_defer_wakeup);
- ret = wake_nocb_gp(rdp, ndw == RCU_NOCB_WAKE_FORCE, flags);
- ndw = rdp_gp->nocb_defer_wakeup;
- ret = __wake_nocb_gp(rdp_gp, rdp, ndw == RCU_NOCB_WAKE_FORCE, flags); trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("DeferredWake"));
return ret; @@ -2385,7 +2399,10 @@ static void do_nocb_deferred_wakeup_timer(struct timer_list *t) */ static bool do_nocb_deferred_wakeup(struct rcu_data *rdp) {
- if (rcu_nocb_need_deferred_wakeup(rdp))
- if (!rdp->nocb_gp_rdp)
return false;
This check was not necessary previously because each CPU used its own rdp, correct? The theory is that this early return is taken only during boot, and that the spawning of the kthreads will act as an implicit wakeup? Or am I missing something subtle here?
- if (rcu_nocb_need_deferred_wakeup(rdp->nocb_gp_rdp)) return do_nocb_deferred_wakeup_common(rdp); return false;
} @@ -2454,17 +2471,15 @@ static long rcu_nocb_rdp_deoffload(void *arg) swait_event_exclusive(rdp->nocb_state_wq, !rcu_segcblist_test_flags(cblist, SEGCBLIST_KTHREAD_CB | SEGCBLIST_KTHREAD_GP));
- rcu_nocb_lock_irqsave(rdp, flags);
- /* Make sure nocb timer won't stay around */
- WRITE_ONCE(rdp->nocb_defer_wakeup, RCU_NOCB_WAKE_OFF);
- rcu_nocb_unlock_irqrestore(rdp, flags);
- del_timer_sync(&rdp->nocb_timer);
- /*
* Theoretically we could set SEGCBLIST_SOFTIRQ_ONLY with CB unlocked
* and IRQs disabled but let's be paranoid.
* Lock one last time to acquire latest callback updates from kthreads
*/ rcu_nocb_lock_irqsave(rdp, flags);* so we can later handle callbacks locally without locking.
- /*
* Theoretically we could set SEGCBLIST_SOFTIRQ_ONLY after the nocb
* lock is released but how about being paranoid for once?
rcu_segcblist_set_flags(cblist, SEGCBLIST_SOFTIRQ_ONLY); /**/
- With SEGCBLIST_SOFTIRQ_ONLY, we can't use
@@ -2528,8 +2543,7 @@ static long rcu_nocb_rdp_offload(void *arg) * SEGCBLIST_SOFTIRQ_ONLY mode. */ raw_spin_lock_irqsave(&rdp->nocb_lock, flags);
- /* Re-enable nocb timer */
- WRITE_ONCE(rdp->nocb_defer_wakeup, RCU_NOCB_WAKE_NOT);
- /*
- We didn't take the nocb lock while working on the
- rdp->cblist in SEGCBLIST_SOFTIRQ_ONLY mode.
-- 2.25.1
On Tue, Mar 02, 2021 at 05:15:57PM -0800, Paul E. McKenney wrote:
The first question is of course: Did you try this with lockdep enabled? ;-)
Yep I always do. But I may miss some configs on my testings. I usually test at least TREE01 on x86 and arm64.
@@ -1702,43 +1692,50 @@ bool rcu_is_nocb_cpu(int cpu) return false; } -/*
- Kick the GP kthread for this NOCB group. Caller holds ->nocb_lock
- and this function releases it.
- */
-static bool wake_nocb_gp(struct rcu_data *rdp, bool force,
unsigned long flags)
- __releases(rdp->nocb_lock)
+static bool __wake_nocb_gp(struct rcu_data *rdp_gp,
struct rcu_data *rdp,
bool force, unsigned long flags)
- __releases(rdp_gp->nocb_gp_lock)
{ bool needwake = false;
- struct rcu_data *rdp_gp = rdp->nocb_gp_rdp;
- lockdep_assert_held(&rdp->nocb_lock); if (!READ_ONCE(rdp_gp->nocb_gp_kthread)) {
rcu_nocb_unlock_irqrestore(rdp, flags);
trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("AlreadyAwake")); return false; }raw_spin_unlock_irqrestore(&rdp_gp->nocb_gp_lock, flags);
- if (READ_ONCE(rdp->nocb_defer_wakeup) > RCU_NOCB_WAKE_NOT) {
WRITE_ONCE(rdp->nocb_defer_wakeup, RCU_NOCB_WAKE_NOT);
del_timer(&rdp->nocb_timer);
- if (rdp_gp->nocb_defer_wakeup > RCU_NOCB_WAKE_NOT) {
So there are no longer any data races involving ->nocb_defer_wakeup?
(Yes, I could fire up KCSAN, but my KCSAN-capable system is otherwise occupied for several more hours.)
To be more specific, there is no more unlocked write to the timer (queue/cancel) and its nocb_defer_wakeup matching state. And there is only one (on purpose) racy reader of ->nocb_defer_wakeup which is the non-timer deferred wakeup.
So the writes to the timer keep their WRITE_ONCE() and only the reader in do_nocb_deferred_wakeup() keeps its READ_ONCE(). Other readers are protected by the ->nocb_gp_lock.
- // Advance callbacks if helpful and low contention. needwake_gp = false; if (!rcu_segcblist_restempty(&rdp->cblist,
@@ -2178,11 +2182,18 @@ static void nocb_gp_wait(struct rcu_data *my_rdp) my_rdp->nocb_gp_bypass = bypass; my_rdp->nocb_gp_gp = needwait_gp; my_rdp->nocb_gp_seq = needwait_gp ? wait_gp_seq : 0;
- if (bypass && !rcu_nocb_poll) {
// At least one child with non-empty ->nocb_bypass, so set
// timer in order to avoid stranding its callbacks.
- if (bypass) { raw_spin_lock_irqsave(&my_rdp->nocb_gp_lock, flags);
mod_timer(&my_rdp->nocb_bypass_timer, j + 2);
// Avoid race with first bypass CB.
if (my_rdp->nocb_defer_wakeup > RCU_NOCB_WAKE_NOT) {
WRITE_ONCE(my_rdp->nocb_defer_wakeup, RCU_NOCB_WAKE_NOT);
del_timer(&my_rdp->nocb_timer);
}
Given that the timer does not get queued if rcu_nocb_poll, why not move the above "if" statement under the one following?
It's done later in the set.
if (!rcu_nocb_poll) {
// At least one child with non-empty ->nocb_bypass, so set
// timer in order to avoid stranding its callbacks.
mod_timer(&my_rdp->nocb_bypass_timer, j + 2);
raw_spin_unlock_irqrestore(&my_rdp->nocb_gp_lock, flags); } if (rcu_nocb_poll) {}
@@ -2385,7 +2399,10 @@ static void do_nocb_deferred_wakeup_timer(struct timer_list *t) */ static bool do_nocb_deferred_wakeup(struct rcu_data *rdp) {
- if (rcu_nocb_need_deferred_wakeup(rdp))
- if (!rdp->nocb_gp_rdp)
return false;
This check was not necessary previously because each CPU used its own rdp, correct?
Exactly!
The theory is that this early return is taken only during boot, and that the spawning of the kthreads will act as an implicit wakeup?
You guessed right! That probably deserve a comment.
Thanks!
On Wed, Mar 10, 2021 at 11:05:07PM +0100, Frederic Weisbecker wrote:
On Tue, Mar 02, 2021 at 05:15:57PM -0800, Paul E. McKenney wrote:
The first question is of course: Did you try this with lockdep enabled? ;-)
Yep I always do. But I may miss some configs on my testings. I usually test at least TREE01 on x86 and arm64.
@@ -1702,43 +1692,50 @@ bool rcu_is_nocb_cpu(int cpu) return false; } -/*
- Kick the GP kthread for this NOCB group. Caller holds ->nocb_lock
- and this function releases it.
- */
-static bool wake_nocb_gp(struct rcu_data *rdp, bool force,
unsigned long flags)
- __releases(rdp->nocb_lock)
+static bool __wake_nocb_gp(struct rcu_data *rdp_gp,
struct rcu_data *rdp,
bool force, unsigned long flags)
- __releases(rdp_gp->nocb_gp_lock)
{ bool needwake = false;
- struct rcu_data *rdp_gp = rdp->nocb_gp_rdp;
- lockdep_assert_held(&rdp->nocb_lock); if (!READ_ONCE(rdp_gp->nocb_gp_kthread)) {
rcu_nocb_unlock_irqrestore(rdp, flags);
trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("AlreadyAwake")); return false; }raw_spin_unlock_irqrestore(&rdp_gp->nocb_gp_lock, flags);
- if (READ_ONCE(rdp->nocb_defer_wakeup) > RCU_NOCB_WAKE_NOT) {
WRITE_ONCE(rdp->nocb_defer_wakeup, RCU_NOCB_WAKE_NOT);
del_timer(&rdp->nocb_timer);
- if (rdp_gp->nocb_defer_wakeup > RCU_NOCB_WAKE_NOT) {
So there are no longer any data races involving ->nocb_defer_wakeup?
(Yes, I could fire up KCSAN, but my KCSAN-capable system is otherwise occupied for several more hours.)
To be more specific, there is no more unlocked write to the timer (queue/cancel) and its nocb_defer_wakeup matching state. And there is only one (on purpose) racy reader of ->nocb_defer_wakeup which is the non-timer deferred wakeup.
So the writes to the timer keep their WRITE_ONCE() and only the reader in do_nocb_deferred_wakeup() keeps its READ_ONCE(). Other readers are protected by the ->nocb_gp_lock.
- // Advance callbacks if helpful and low contention. needwake_gp = false; if (!rcu_segcblist_restempty(&rdp->cblist,
@@ -2178,11 +2182,18 @@ static void nocb_gp_wait(struct rcu_data *my_rdp) my_rdp->nocb_gp_bypass = bypass; my_rdp->nocb_gp_gp = needwait_gp; my_rdp->nocb_gp_seq = needwait_gp ? wait_gp_seq : 0;
- if (bypass && !rcu_nocb_poll) {
// At least one child with non-empty ->nocb_bypass, so set
// timer in order to avoid stranding its callbacks.
- if (bypass) { raw_spin_lock_irqsave(&my_rdp->nocb_gp_lock, flags);
mod_timer(&my_rdp->nocb_bypass_timer, j + 2);
// Avoid race with first bypass CB.
if (my_rdp->nocb_defer_wakeup > RCU_NOCB_WAKE_NOT) {
WRITE_ONCE(my_rdp->nocb_defer_wakeup, RCU_NOCB_WAKE_NOT);
del_timer(&my_rdp->nocb_timer);
}
Given that the timer does not get queued if rcu_nocb_poll, why not move the above "if" statement under the one following?
It's done later in the set.
if (!rcu_nocb_poll) {
// At least one child with non-empty ->nocb_bypass, so set
// timer in order to avoid stranding its callbacks.
mod_timer(&my_rdp->nocb_bypass_timer, j + 2);
raw_spin_unlock_irqrestore(&my_rdp->nocb_gp_lock, flags); } if (rcu_nocb_poll) {}
@@ -2385,7 +2399,10 @@ static void do_nocb_deferred_wakeup_timer(struct timer_list *t) */ static bool do_nocb_deferred_wakeup(struct rcu_data *rdp) {
- if (rcu_nocb_need_deferred_wakeup(rdp))
- if (!rdp->nocb_gp_rdp)
return false;
This check was not necessary previously because each CPU used its own rdp, correct?
Exactly!
The theory is that this early return is taken only during boot, and that the spawning of the kthreads will act as an implicit wakeup?
You guessed right! That probably deserve a comment.
OK, I have queued these for for further review and testing. Also to look at the overall effect. Thank you very much!
Thanx, Paul
This reverts commit dcd42591ebb8a25895b551a5297ea9c24414ba54. The only user was RCU/nocb.
Signed-off-by: Frederic Weisbecker frederic@kernel.org Cc: Josh Triplett josh@joshtriplett.org Cc: Lai Jiangshan jiangshanlai@gmail.com Cc: Joel Fernandes joel@joelfernandes.org Cc: Neeraj Upadhyay neeraju@codeaurora.org Cc: Boqun Feng boqun.feng@gmail.com Cc: Thomas Gleixner tglx@linutronix.de --- include/linux/timer.h | 2 -- kernel/time/timer.c | 14 -------------- 2 files changed, 16 deletions(-)
diff --git a/include/linux/timer.h b/include/linux/timer.h index 4118a97e62fb..fda13c9d1256 100644 --- a/include/linux/timer.h +++ b/include/linux/timer.h @@ -192,8 +192,6 @@ extern int try_to_del_timer_sync(struct timer_list *timer);
#define del_singleshot_timer_sync(t) del_timer_sync(t)
-extern bool timer_curr_running(struct timer_list *timer); - extern void init_timers(void); struct hrtimer; extern enum hrtimer_restart it_real_fn(struct hrtimer *); diff --git a/kernel/time/timer.c b/kernel/time/timer.c index f475f1a027c8..8dbc008f8942 100644 --- a/kernel/time/timer.c +++ b/kernel/time/timer.c @@ -1237,20 +1237,6 @@ int try_to_del_timer_sync(struct timer_list *timer) } EXPORT_SYMBOL(try_to_del_timer_sync);
-bool timer_curr_running(struct timer_list *timer) -{ - int i; - - for (i = 0; i < NR_BASES; i++) { - struct timer_base *base = this_cpu_ptr(&timer_bases[i]); - - if (base->running_timer == timer) - return true; - } - - return false; -} - #ifdef CONFIG_PREEMPT_RT static __init void timer_base_init_expiry_lock(struct timer_base *base) {
The bypass timer calls __call_rcu_nocb_wake() instead of directly calling __wake_nocb_gp(). The only difference here is that rdp->qlen_last_fqs_check gets overriden. But resetting the deferred force quiescent state base shouldn't be relevant for that timer. In fact the bypass queue in concern can be for any rdp from the group and not necessarily the rdp leader on which the bypass timer is attached.
Therefore we can simply call directly __wake_nocb_gp(). This way we don't even need to lock the nocb_lock.
Signed-off-by: Frederic Weisbecker frederic@kernel.org Cc: Josh Triplett josh@joshtriplett.org Cc: Lai Jiangshan jiangshanlai@gmail.com Cc: Joel Fernandes joel@joelfernandes.org Cc: Neeraj Upadhyay neeraju@codeaurora.org Cc: Boqun Feng boqun.feng@gmail.com --- kernel/rcu/tree_plugin.h | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 847636d3e93d..c80b214a86bb 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -2025,9 +2025,10 @@ static void do_nocb_bypass_wakeup_timer(struct timer_list *t) struct rcu_data *rdp = from_timer(rdp, t, nocb_bypass_timer);
trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("Timer")); - rcu_nocb_lock_irqsave(rdp, flags); + + raw_spin_lock_irqsave(&rdp->nocb_gp_lock, flags); smp_mb__after_spinlock(); /* Timer expire before wakeup. */ - __call_rcu_nocb_wake(rdp, true, flags); + __wake_nocb_gp(rdp, rdp, false, flags); }
/*
The only thing that prevented an rdp leader from being de-offloaded was the nocb_bypass_timer that used to lock the nocb_lock of the rdp leader.
If an rdp gets de-offloaded, it will subtely ignore rcu_nocb_lock() calls and do its job in the timer unsafely. Worse yet: if it gets re-offloaded in the middle of the timer, rcu_nocb_unlock() would try to unlock, leaving it imbalanced.
Now that the nocb_bypass_timer doesn't use the nocb_lock anymore, we can safely allow rdp leader to be de-offloaded.
Reported-by: Paul E. McKenney paulmck@kernel.org Cc: Josh Triplett josh@joshtriplett.org Cc: Lai Jiangshan jiangshanlai@gmail.com Cc: Joel Fernandes joel@joelfernandes.org Cc: Neeraj Upadhyay neeraju@codeaurora.org Cc: Boqun Feng boqun.feng@gmail.com Signed-off-by: Frederic Weisbecker frederic@kernel.org --- kernel/rcu/tree_plugin.h | 4 ---- 1 file changed, 4 deletions(-)
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index c80b214a86bb..0fdf0223f08f 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -2500,10 +2500,6 @@ int rcu_nocb_cpu_deoffload(int cpu) struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu); int ret = 0;
- if (rdp == rdp->nocb_gp_rdp) { - pr_info("Can't deoffload an rdp GP leader (yet)\n"); - return -EINVAL; - } mutex_lock(&rcu_state.barrier_mutex); cpus_read_lock(); if (rcu_rdp_is_offloaded(rdp)) {
As we wake up in nocb_gp_wait(), there is no need to keep the nocb_timer around as we are going to go through the whole rdp list again. Any update performed before the timer was armed will now be visible after the nocb_gp_lock acquire.
Signed-off-by: Frederic Weisbecker frederic@kernel.org Cc: Josh Triplett josh@joshtriplett.org Cc: Lai Jiangshan jiangshanlai@gmail.com Cc: Joel Fernandes joel@joelfernandes.org Cc: Neeraj Upadhyay neeraju@codeaurora.org Cc: Boqun Feng boqun.feng@gmail.com --- kernel/rcu/tree_plugin.h | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 0fdf0223f08f..b62ad79bbda5 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -2221,6 +2221,10 @@ static void nocb_gp_wait(struct rcu_data *my_rdp) raw_spin_lock_irqsave(&my_rdp->nocb_gp_lock, flags); if (bypass) del_timer(&my_rdp->nocb_bypass_timer); + if (my_rdp->nocb_defer_wakeup > RCU_NOCB_WAKE_NOT) { + WRITE_ONCE(my_rdp->nocb_defer_wakeup, RCU_NOCB_WAKE_NOT); + del_timer(&my_rdp->nocb_timer); + } WRITE_ONCE(my_rdp->nocb_gp_sleep, true); raw_spin_unlock_irqrestore(&my_rdp->nocb_gp_lock, flags); }
A NOCB-gp wake up can safely delete the nocb_bypass_timer. nocb_gp_wait() is going to check again the bypass state and rearm the bypass timer if necessary.
Signed-off-by: Frederic Weisbecker frederic@kernel.org Cc: Josh Triplett josh@joshtriplett.org Cc: Lai Jiangshan jiangshanlai@gmail.com Cc: Joel Fernandes joel@joelfernandes.org Cc: Neeraj Upadhyay neeraju@codeaurora.org Cc: Boqun Feng boqun.feng@gmail.com --- kernel/rcu/tree_plugin.h | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index b62ad79bbda5..9da67b0d3997 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -1711,6 +1711,8 @@ static bool __wake_nocb_gp(struct rcu_data *rdp_gp, del_timer(&rdp_gp->nocb_timer); }
+ del_timer(&rdp_gp->nocb_bypass_timer); + if (force || READ_ONCE(rdp_gp->nocb_gp_sleep)) { WRITE_ONCE(rdp_gp->nocb_gp_sleep, false); needwake = true;
On Tue, Feb 23, 2021 at 01:10:08AM +0100, Frederic Weisbecker wrote:
A NOCB-gp wake up can safely delete the nocb_bypass_timer. nocb_gp_wait() is going to check again the bypass state and rearm the bypass timer if necessary.
Signed-off-by: Frederic Weisbecker frederic@kernel.org Cc: Josh Triplett josh@joshtriplett.org Cc: Lai Jiangshan jiangshanlai@gmail.com Cc: Joel Fernandes joel@joelfernandes.org Cc: Neeraj Upadhyay neeraju@codeaurora.org Cc: Boqun Feng boqun.feng@gmail.com
Give that you delete this code a couple of patches later in this series, why not just leave it out entirely? ;-)
Thanx, Paul
kernel/rcu/tree_plugin.h | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index b62ad79bbda5..9da67b0d3997 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -1711,6 +1711,8 @@ static bool __wake_nocb_gp(struct rcu_data *rdp_gp, del_timer(&rdp_gp->nocb_timer); }
- del_timer(&rdp_gp->nocb_bypass_timer);
- if (force || READ_ONCE(rdp_gp->nocb_gp_sleep)) { WRITE_ONCE(rdp_gp->nocb_gp_sleep, false); needwake = true;
-- 2.25.1
On Tue, Mar 02, 2021 at 05:24:56PM -0800, Paul E. McKenney wrote:
On Tue, Feb 23, 2021 at 01:10:08AM +0100, Frederic Weisbecker wrote:
A NOCB-gp wake up can safely delete the nocb_bypass_timer. nocb_gp_wait() is going to check again the bypass state and rearm the bypass timer if necessary.
Signed-off-by: Frederic Weisbecker frederic@kernel.org Cc: Josh Triplett josh@joshtriplett.org Cc: Lai Jiangshan jiangshanlai@gmail.com Cc: Joel Fernandes joel@joelfernandes.org Cc: Neeraj Upadhyay neeraju@codeaurora.org Cc: Boqun Feng boqun.feng@gmail.com
Give that you delete this code a couple of patches later in this series, why not just leave it out entirely? ;-)
It's not exactly deleted later, it's rather merged within the "del_timer(&rdp_gp->nocb_timer)".
The purpose of that patch is to make it clear that we explicitly cancel the nocb_bypass_timer here before we do it implicitly later with the merge of nocb_bypass_timer into nocb_timer.
We could drop that patch, the resulting code in the end of the patchset will be the same of course but the behaviour detail described here might slip out of the reviewers attention :-)
Thanx, Paul
kernel/rcu/tree_plugin.h | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index b62ad79bbda5..9da67b0d3997 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -1711,6 +1711,8 @@ static bool __wake_nocb_gp(struct rcu_data *rdp_gp, del_timer(&rdp_gp->nocb_timer); }
- del_timer(&rdp_gp->nocb_bypass_timer);
- if (force || READ_ONCE(rdp_gp->nocb_gp_sleep)) { WRITE_ONCE(rdp_gp->nocb_gp_sleep, false); needwake = true;
Thanks.
On Wed, Mar 10, 2021 at 11:17:02PM +0100, Frederic Weisbecker wrote:
On Tue, Mar 02, 2021 at 05:24:56PM -0800, Paul E. McKenney wrote:
On Tue, Feb 23, 2021 at 01:10:08AM +0100, Frederic Weisbecker wrote:
A NOCB-gp wake up can safely delete the nocb_bypass_timer. nocb_gp_wait() is going to check again the bypass state and rearm the bypass timer if necessary.
Signed-off-by: Frederic Weisbecker frederic@kernel.org Cc: Josh Triplett josh@joshtriplett.org Cc: Lai Jiangshan jiangshanlai@gmail.com Cc: Joel Fernandes joel@joelfernandes.org Cc: Neeraj Upadhyay neeraju@codeaurora.org Cc: Boqun Feng boqun.feng@gmail.com
Give that you delete this code a couple of patches later in this series, why not just leave it out entirely? ;-)
It's not exactly deleted later, it's rather merged within the "del_timer(&rdp_gp->nocb_timer)".
The purpose of that patch is to make it clear that we explicitly cancel the nocb_bypass_timer here before we do it implicitly later with the merge of nocb_bypass_timer into nocb_timer.
We could drop that patch, the resulting code in the end of the patchset will be the same of course but the behaviour detail described here might slip out of the reviewers attention :-)
How about merging the timers first and adding those small improvements later? i.e. move patch #12 #13 right after #7 (IIUC, #7 is the last requirement you need for merging timers), and then patch #8~#11 just follow, because IIUC, those patches are not about correctness but more about avoiding necessary timer fire-ups, right?
Just my 2 cents. The overall patchset looks good to me ;-)
Feel free to add
Reviewed-by: Boqun Feng boqun.feng@gmail.com
Regards, Boqun
Thanx, Paul
kernel/rcu/tree_plugin.h | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index b62ad79bbda5..9da67b0d3997 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -1711,6 +1711,8 @@ static bool __wake_nocb_gp(struct rcu_data *rdp_gp, del_timer(&rdp_gp->nocb_timer); }
- del_timer(&rdp_gp->nocb_bypass_timer);
- if (force || READ_ONCE(rdp_gp->nocb_gp_sleep)) { WRITE_ONCE(rdp_gp->nocb_gp_sleep, false); needwake = true;
Thanks.
On Mon, Mar 15, 2021 at 10:53:36PM +0800, Boqun Feng wrote:
On Wed, Mar 10, 2021 at 11:17:02PM +0100, Frederic Weisbecker wrote:
On Tue, Mar 02, 2021 at 05:24:56PM -0800, Paul E. McKenney wrote:
On Tue, Feb 23, 2021 at 01:10:08AM +0100, Frederic Weisbecker wrote:
A NOCB-gp wake up can safely delete the nocb_bypass_timer. nocb_gp_wait() is going to check again the bypass state and rearm the bypass timer if necessary.
Signed-off-by: Frederic Weisbecker frederic@kernel.org Cc: Josh Triplett josh@joshtriplett.org Cc: Lai Jiangshan jiangshanlai@gmail.com Cc: Joel Fernandes joel@joelfernandes.org Cc: Neeraj Upadhyay neeraju@codeaurora.org Cc: Boqun Feng boqun.feng@gmail.com
Give that you delete this code a couple of patches later in this series, why not just leave it out entirely? ;-)
It's not exactly deleted later, it's rather merged within the "del_timer(&rdp_gp->nocb_timer)".
The purpose of that patch is to make it clear that we explicitly cancel the nocb_bypass_timer here before we do it implicitly later with the merge of nocb_bypass_timer into nocb_timer.
We could drop that patch, the resulting code in the end of the patchset will be the same of course but the behaviour detail described here might slip out of the reviewers attention :-)
How about merging the timers first and adding those small improvements later? i.e. move patch #12 #13 right after #7 (IIUC, #7 is the last requirement you need for merging timers)
Hmm, nope, patches 9 and 10 are actually preparation work for timers merge. In fact they could even be skipped and timers could be merged directly but I wanted the unified behaviour to be fully explicit for reviewers through those incremental changes before merging the timers together.
, and then patch #8~#11 just follow
Patch 8 really need to stay where it is because it is an important limitation on nocb de-offloading that can be removed right after patch 7 (which itself removes the sole reason for rdp leader to remain nocb) and it doesn't depend on the timers unification that comes after.
Just my 2 cents. The overall patchset looks good to me ;-)
Feel free to add
Reviewed-by: Boqun Feng boqun.feng@gmail.com
Thanks a lot for checking that!
On Mon, Mar 15, 2021 at 11:56:33PM +0100, Frederic Weisbecker wrote:
On Mon, Mar 15, 2021 at 10:53:36PM +0800, Boqun Feng wrote:
On Wed, Mar 10, 2021 at 11:17:02PM +0100, Frederic Weisbecker wrote:
On Tue, Mar 02, 2021 at 05:24:56PM -0800, Paul E. McKenney wrote:
On Tue, Feb 23, 2021 at 01:10:08AM +0100, Frederic Weisbecker wrote:
A NOCB-gp wake up can safely delete the nocb_bypass_timer. nocb_gp_wait() is going to check again the bypass state and rearm the bypass timer if necessary.
Signed-off-by: Frederic Weisbecker frederic@kernel.org Cc: Josh Triplett josh@joshtriplett.org Cc: Lai Jiangshan jiangshanlai@gmail.com Cc: Joel Fernandes joel@joelfernandes.org Cc: Neeraj Upadhyay neeraju@codeaurora.org Cc: Boqun Feng boqun.feng@gmail.com
Give that you delete this code a couple of patches later in this series, why not just leave it out entirely? ;-)
It's not exactly deleted later, it's rather merged within the "del_timer(&rdp_gp->nocb_timer)".
The purpose of that patch is to make it clear that we explicitly cancel the nocb_bypass_timer here before we do it implicitly later with the merge of nocb_bypass_timer into nocb_timer.
We could drop that patch, the resulting code in the end of the patchset will be the same of course but the behaviour detail described here might slip out of the reviewers attention :-)
How about merging the timers first and adding those small improvements later? i.e. move patch #12 #13 right after #7 (IIUC, #7 is the last requirement you need for merging timers)
Hmm, nope, patches 9 and 10 are actually preparation work for timers merge. In fact they could even be skipped and timers could be merged directly but I wanted the unified behaviour to be fully explicit for reviewers through those incremental changes before merging the timers together.
, and then patch #8~#11 just follow
Patch 8 really need to stay where it is because it is an important limitation on nocb de-offloading that can be removed right after patch 7 (which itself removes the sole reason for rdp leader to remain nocb) and it doesn't depend on the timers unification that comes after.
Just my 2 cents. The overall patchset looks good to me ;-)
Feel free to add
Reviewed-by: Boqun Feng boqun.feng@gmail.com
Thanks a lot for checking that!
Applied to 10/13, thank you both!
Thanx, Paul
No need to disarm the nocb_timer if rcu_nocb is polling because it shouldn't be armed either.
Signed-off-by: Frederic Weisbecker frederic@kernel.org Cc: Josh Triplett josh@joshtriplett.org Cc: Lai Jiangshan jiangshanlai@gmail.com Cc: Joel Fernandes joel@joelfernandes.org Cc: Neeraj Upadhyay neeraju@codeaurora.org Cc: Boqun Feng boqun.feng@gmail.com --- kernel/rcu/tree_plugin.h | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-)
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 9da67b0d3997..d8b50ff40e4b 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -2186,18 +2186,18 @@ static void nocb_gp_wait(struct rcu_data *my_rdp) my_rdp->nocb_gp_gp = needwait_gp; my_rdp->nocb_gp_seq = needwait_gp ? wait_gp_seq : 0; if (bypass) { - raw_spin_lock_irqsave(&my_rdp->nocb_gp_lock, flags); - // Avoid race with first bypass CB. - if (my_rdp->nocb_defer_wakeup > RCU_NOCB_WAKE_NOT) { - WRITE_ONCE(my_rdp->nocb_defer_wakeup, RCU_NOCB_WAKE_NOT); - del_timer(&my_rdp->nocb_timer); - } if (!rcu_nocb_poll) { + raw_spin_lock_irqsave(&my_rdp->nocb_gp_lock, flags); + // Avoid race with first bypass CB. + if (my_rdp->nocb_defer_wakeup > RCU_NOCB_WAKE_NOT) { + WRITE_ONCE(my_rdp->nocb_defer_wakeup, RCU_NOCB_WAKE_NOT); + del_timer(&my_rdp->nocb_timer); + } // At least one child with non-empty ->nocb_bypass, so set // timer in order to avoid stranding its callbacks. mod_timer(&my_rdp->nocb_bypass_timer, j + 2); + raw_spin_unlock_irqrestore(&my_rdp->nocb_gp_lock, flags); } - raw_spin_unlock_irqrestore(&my_rdp->nocb_gp_lock, flags); } if (rcu_nocb_poll) { /* Polling, so trace if first poll in the series. */
On Tue, Feb 23, 2021 at 01:10:09AM +0100, Frederic Weisbecker wrote:
No need to disarm the nocb_timer if rcu_nocb is polling because it shouldn't be armed either.
Signed-off-by: Frederic Weisbecker frederic@kernel.org Cc: Josh Triplett josh@joshtriplett.org Cc: Lai Jiangshan jiangshanlai@gmail.com Cc: Joel Fernandes joel@joelfernandes.org Cc: Neeraj Upadhyay neeraju@codeaurora.org Cc: Boqun Feng boqun.feng@gmail.com
OK, so it does make sense to move that del_timer() under the following "if" statement, then. ;-)
kernel/rcu/tree_plugin.h | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-)
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 9da67b0d3997..d8b50ff40e4b 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -2186,18 +2186,18 @@ static void nocb_gp_wait(struct rcu_data *my_rdp) my_rdp->nocb_gp_gp = needwait_gp; my_rdp->nocb_gp_seq = needwait_gp ? wait_gp_seq : 0; if (bypass) {
raw_spin_lock_irqsave(&my_rdp->nocb_gp_lock, flags);
// Avoid race with first bypass CB.
if (my_rdp->nocb_defer_wakeup > RCU_NOCB_WAKE_NOT) {
WRITE_ONCE(my_rdp->nocb_defer_wakeup, RCU_NOCB_WAKE_NOT);
del_timer(&my_rdp->nocb_timer);
if (!rcu_nocb_poll) {}
raw_spin_lock_irqsave(&my_rdp->nocb_gp_lock, flags);
// Avoid race with first bypass CB.
if (my_rdp->nocb_defer_wakeup > RCU_NOCB_WAKE_NOT) {
WRITE_ONCE(my_rdp->nocb_defer_wakeup, RCU_NOCB_WAKE_NOT);
del_timer(&my_rdp->nocb_timer);
} // At least one child with non-empty ->nocb_bypass, so set // timer in order to avoid stranding its callbacks. mod_timer(&my_rdp->nocb_bypass_timer, j + 2);
}raw_spin_unlock_irqrestore(&my_rdp->nocb_gp_lock, flags);
} if (rcu_nocb_poll) { /* Polling, so trace if first poll in the series. */raw_spin_unlock_irqrestore(&my_rdp->nocb_gp_lock, flags);
-- 2.25.1
On Tue, Mar 02, 2021 at 05:22:29PM -0800, Paul E. McKenney wrote:
On Tue, Feb 23, 2021 at 01:10:09AM +0100, Frederic Weisbecker wrote:
No need to disarm the nocb_timer if rcu_nocb is polling because it shouldn't be armed either.
Signed-off-by: Frederic Weisbecker frederic@kernel.org Cc: Josh Triplett josh@joshtriplett.org Cc: Lai Jiangshan jiangshanlai@gmail.com Cc: Joel Fernandes joel@joelfernandes.org Cc: Neeraj Upadhyay neeraju@codeaurora.org Cc: Boqun Feng boqun.feng@gmail.com
OK, so it does make sense to move that del_timer() under the following "if" statement, then. ;-)
Right, probably I should have handled that in the beginning of the patchset instead of the end but heh, my mind is never that clear.
Thanks.
Provide a way to tune the deferred wakeup level we want to perform from a safe wakeup point. Currently those sites are:
* nocb_timer * user/idle/guest entry * CPU down * softirq/rcuc
All of these sites perform the wake up for both RCU_NOCB_WAKE and RCU_NOCB_WAKE_FORCE.
In order to merge nocb_timer and nocb_bypass_timer together, we plan to add a new RCU_NOCB_WAKE_BYPASS that really want to be deferred until a timer fires so that we don't wake up the NOCB-gp kthread too early.
To prepare for that, integrate a wake up level/limit that a callsite is deemed to perform.
Signed-off-by: Frederic Weisbecker frederic@kernel.org Cc: Josh Triplett josh@joshtriplett.org Cc: Lai Jiangshan jiangshanlai@gmail.com Cc: Joel Fernandes joel@joelfernandes.org Cc: Neeraj Upadhyay neeraju@codeaurora.org Cc: Boqun Feng boqun.feng@gmail.com --- kernel/rcu/tree.c | 2 +- kernel/rcu/tree.h | 2 +- kernel/rcu/tree_plugin.h | 15 ++++++++------- 3 files changed, 10 insertions(+), 9 deletions(-)
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 2c9cf4df942c..9951a4bef504 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -3823,7 +3823,7 @@ static int rcu_pending(int user) check_cpu_stall(rdp);
/* Does this CPU need a deferred NOCB wakeup? */ - if (rcu_nocb_need_deferred_wakeup(rdp)) + if (rcu_nocb_need_deferred_wakeup(rdp, RCU_NOCB_WAKE)) return 1;
/* Is this a nohz_full CPU in userspace or idle? (Ignore RCU if so.) */ diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h index b280a843bd2c..2510e86265c1 100644 --- a/kernel/rcu/tree.h +++ b/kernel/rcu/tree.h @@ -433,7 +433,7 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp, bool *was_alldone, unsigned long flags); static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_empty, unsigned long flags); -static int rcu_nocb_need_deferred_wakeup(struct rcu_data *rdp); +static int rcu_nocb_need_deferred_wakeup(struct rcu_data *rdp, int level); static bool do_nocb_deferred_wakeup(struct rcu_data *rdp); static void rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp); static void rcu_spawn_cpu_nocb_kthread(int cpu); diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index d8b50ff40e4b..e0420e3b30e6 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -2364,13 +2364,14 @@ static int rcu_nocb_cb_kthread(void *arg) }
/* Is a deferred wakeup of rcu_nocb_kthread() required? */ -static int rcu_nocb_need_deferred_wakeup(struct rcu_data *rdp) +static int rcu_nocb_need_deferred_wakeup(struct rcu_data *rdp, int level) { - return READ_ONCE(rdp->nocb_defer_wakeup) > RCU_NOCB_WAKE_NOT; + return READ_ONCE(rdp->nocb_defer_wakeup) >= level; }
/* Do a deferred wakeup of rcu_nocb_kthread(). */ -static bool do_nocb_deferred_wakeup_common(struct rcu_data *rdp) +static bool do_nocb_deferred_wakeup_common(struct rcu_data *rdp, + int level) { unsigned long flags; int ndw; @@ -2379,7 +2380,7 @@ static bool do_nocb_deferred_wakeup_common(struct rcu_data *rdp)
raw_spin_lock_irqsave(&rdp_gp->nocb_gp_lock, flags);
- if (!rcu_nocb_need_deferred_wakeup(rdp_gp)) { + if (!rcu_nocb_need_deferred_wakeup(rdp_gp, level)) { raw_spin_unlock_irqrestore(&rdp_gp->nocb_gp_lock, flags); return false; } @@ -2396,7 +2397,7 @@ static void do_nocb_deferred_wakeup_timer(struct timer_list *t) { struct rcu_data *rdp = from_timer(rdp, t, nocb_timer);
- do_nocb_deferred_wakeup_common(rdp); + do_nocb_deferred_wakeup_common(rdp, RCU_NOCB_WAKE); }
/* @@ -2409,8 +2410,8 @@ static bool do_nocb_deferred_wakeup(struct rcu_data *rdp) if (!rdp->nocb_gp_rdp) return false;
- if (rcu_nocb_need_deferred_wakeup(rdp->nocb_gp_rdp)) - return do_nocb_deferred_wakeup_common(rdp); + if (rcu_nocb_need_deferred_wakeup(rdp->nocb_gp_rdp, RCU_NOCB_WAKE)) + return do_nocb_deferred_wakeup_common(rdp, RCU_NOCB_WAKE); return false; }
On Tue, Feb 23, 2021 at 01:10:10AM +0100, Frederic Weisbecker wrote:
Provide a way to tune the deferred wakeup level we want to perform from a safe wakeup point. Currently those sites are:
- nocb_timer
- user/idle/guest entry
- CPU down
- softirq/rcuc
All of these sites perform the wake up for both RCU_NOCB_WAKE and RCU_NOCB_WAKE_FORCE.
In order to merge nocb_timer and nocb_bypass_timer together, we plan to add a new RCU_NOCB_WAKE_BYPASS that really want to be deferred until a timer fires so that we don't wake up the NOCB-gp kthread too early.
To prepare for that, integrate a wake up level/limit that a callsite is deemed to perform.
This appears to need the following in order to build for non-NOCB configurations. I will fold it in and am retesting.
Thanx, Paul
------------------------------------------------------------------------
commit 55f59dd75a11455cf558fd387fbf9011017dcc8a Author: Paul E. McKenney paulmck@kernel.org Date: Mon Mar 15 20:00:34 2021 -0700
squash! rcu/nocb: Prepare for fine-grained deferred wakeup
[ paulmck: Fix non-NOCB rcu_nocb_need_deferred_wakeup() definition. ] Signed-off-by: Paul E. McKenney paulmck@kernel.org
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 0cc7f68..dfb048e 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -2926,7 +2926,7 @@ static void __init rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp) { }
-static int rcu_nocb_need_deferred_wakeup(struct rcu_data *rdp) +static int rcu_nocb_need_deferred_wakeup(struct rcu_data *rdp, int level) { return false; }
On Mon, Mar 15, 2021 at 08:02:39PM -0700, Paul E. McKenney wrote:
On Tue, Feb 23, 2021 at 01:10:10AM +0100, Frederic Weisbecker wrote:
Provide a way to tune the deferred wakeup level we want to perform from a safe wakeup point. Currently those sites are:
- nocb_timer
- user/idle/guest entry
- CPU down
- softirq/rcuc
All of these sites perform the wake up for both RCU_NOCB_WAKE and RCU_NOCB_WAKE_FORCE.
In order to merge nocb_timer and nocb_bypass_timer together, we plan to add a new RCU_NOCB_WAKE_BYPASS that really want to be deferred until a timer fires so that we don't wake up the NOCB-gp kthread too early.
To prepare for that, integrate a wake up level/limit that a callsite is deemed to perform.
This appears to need the following in order to build for non-NOCB configurations. I will fold it in and am retesting.
Thanx, Paul
commit 55f59dd75a11455cf558fd387fbf9011017dcc8a Author: Paul E. McKenney paulmck@kernel.org Date: Mon Mar 15 20:00:34 2021 -0700
squash! rcu/nocb: Prepare for fine-grained deferred wakeup
[ paulmck: Fix non-NOCB rcu_nocb_need_deferred_wakeup() definition. ] Signed-off-by: Paul E. McKenney paulmck@kernel.org
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 0cc7f68..dfb048e 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -2926,7 +2926,7 @@ static void __init rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp) { } -static int rcu_nocb_need_deferred_wakeup(struct rcu_data *rdp) +static int rcu_nocb_need_deferred_wakeup(struct rcu_data *rdp, int level) { return false; }
Oops thanks, I missed that.
On Tue, Mar 16, 2021 at 12:45:26PM +0100, Frederic Weisbecker wrote:
On Mon, Mar 15, 2021 at 08:02:39PM -0700, Paul E. McKenney wrote:
On Tue, Feb 23, 2021 at 01:10:10AM +0100, Frederic Weisbecker wrote:
Provide a way to tune the deferred wakeup level we want to perform from a safe wakeup point. Currently those sites are:
- nocb_timer
- user/idle/guest entry
- CPU down
- softirq/rcuc
All of these sites perform the wake up for both RCU_NOCB_WAKE and RCU_NOCB_WAKE_FORCE.
In order to merge nocb_timer and nocb_bypass_timer together, we plan to add a new RCU_NOCB_WAKE_BYPASS that really want to be deferred until a timer fires so that we don't wake up the NOCB-gp kthread too early.
To prepare for that, integrate a wake up level/limit that a callsite is deemed to perform.
This appears to need the following in order to build for non-NOCB configurations. I will fold it in and am retesting.
Thanx, Paul
commit 55f59dd75a11455cf558fd387fbf9011017dcc8a Author: Paul E. McKenney paulmck@kernel.org Date: Mon Mar 15 20:00:34 2021 -0700
squash! rcu/nocb: Prepare for fine-grained deferred wakeup
[ paulmck: Fix non-NOCB rcu_nocb_need_deferred_wakeup() definition. ] Signed-off-by: Paul E. McKenney paulmck@kernel.org
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 0cc7f68..dfb048e 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -2926,7 +2926,7 @@ static void __init rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp) { } -static int rcu_nocb_need_deferred_wakeup(struct rcu_data *rdp) +static int rcu_nocb_need_deferred_wakeup(struct rcu_data *rdp, int level) { return false; }
Oops thanks, I missed that.
And with this, light rcutorture testing passed. I hope to pound on it harder later this week.
Thanx, Paul
Now that nocb_timer and nocb_bypass_timer have become very similar, merge them together. A new RCU_NOCB_WAKE_BYPASS wake level is introduced. As a result, timers perform all kinds of deferred wake ups but other deferred wakeup callsites only handle non-bypass wakeups in order not to wake up rcuo too early.
The timer also performs the full barrier all the time to order timer_pending() and callback enqueue although the path performing RCU_NOCB_WAKE_FORCE that makes use of it is debatable. It should also test against the rdp leader instead of the current rdp.
The permanent full barrier shouldn't bring visible overhead since the timers almost never fire.
Signed-off-by: Frederic Weisbecker frederic@kernel.org Cc: Josh Triplett josh@joshtriplett.org Cc: Lai Jiangshan jiangshanlai@gmail.com Cc: Joel Fernandes joel@joelfernandes.org Cc: Neeraj Upadhyay neeraju@codeaurora.org Cc: Boqun Feng boqun.feng@gmail.com --- include/trace/events/rcu.h | 1 + kernel/rcu/tree.h | 6 +-- kernel/rcu/tree_plugin.h | 92 ++++++++++++++++---------------------- 3 files changed, 43 insertions(+), 56 deletions(-)
diff --git a/include/trace/events/rcu.h b/include/trace/events/rcu.h index 5fc29400e1a2..c16cb7d78f51 100644 --- a/include/trace/events/rcu.h +++ b/include/trace/events/rcu.h @@ -278,6 +278,7 @@ TRACE_EVENT_RCU(rcu_exp_funnel_lock, * "WakeNot": Don't wake rcuo kthread. * "WakeNotPoll": Don't wake rcuo kthread because it is polling. * "WakeOvfIsDeferred": Wake rcuo kthread later, CB list is huge. + * "WakeBypassIsDeferred": Wake rcuo kthread later, bypass list is contended. * "WokeEmpty": rcuo CB kthread woke to find empty list. */ TRACE_EVENT_RCU(rcu_nocb_wake, diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h index 2510e86265c1..9a16487edfca 100644 --- a/kernel/rcu/tree.h +++ b/kernel/rcu/tree.h @@ -218,7 +218,6 @@ struct rcu_data {
/* The following fields are used by GP kthread, hence own cacheline. */ raw_spinlock_t nocb_gp_lock ____cacheline_internodealigned_in_smp; - struct timer_list nocb_bypass_timer; /* Force nocb_bypass flush. */ u8 nocb_gp_sleep; /* Is the nocb GP thread asleep? */ u8 nocb_gp_bypass; /* Found a bypass on last scan? */ u8 nocb_gp_gp; /* GP to wait for on last scan? */ @@ -258,8 +257,9 @@ struct rcu_data {
/* Values for nocb_defer_wakeup field in struct rcu_data. */ #define RCU_NOCB_WAKE_NOT 0 -#define RCU_NOCB_WAKE 1 -#define RCU_NOCB_WAKE_FORCE 2 +#define RCU_NOCB_WAKE_BYPASS 1 +#define RCU_NOCB_WAKE 2 +#define RCU_NOCB_WAKE_FORCE 3
#define RCU_JIFFIES_TILL_FORCE_QS (1 + (HZ > 250) + (HZ > 500)) /* For jiffies_till_first_fqs and */ diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index e0420e3b30e6..6bf35a1fe68e 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -1711,8 +1711,6 @@ static bool __wake_nocb_gp(struct rcu_data *rdp_gp, del_timer(&rdp_gp->nocb_timer); }
- del_timer(&rdp_gp->nocb_bypass_timer); - if (force || READ_ONCE(rdp_gp->nocb_gp_sleep)) { WRITE_ONCE(rdp_gp->nocb_gp_sleep, false); needwake = true; @@ -1750,10 +1748,19 @@ static void wake_nocb_gp_defer(struct rcu_data *rdp, int waketype,
raw_spin_lock_irqsave(&rdp_gp->nocb_gp_lock, flags);
- if (rdp_gp->nocb_defer_wakeup == RCU_NOCB_WAKE_NOT) - mod_timer(&rdp_gp->nocb_timer, jiffies + 1); - if (rdp_gp->nocb_defer_wakeup < waketype) + /* + * Bypass wakeup overrides previous deferments. In case + * of callback storm, no need to wake up too early. + */ + if (waketype == RCU_NOCB_WAKE_BYPASS) { + mod_timer(&rdp_gp->nocb_timer, jiffies + 2); WRITE_ONCE(rdp_gp->nocb_defer_wakeup, waketype); + } else { + if (rdp_gp->nocb_defer_wakeup < RCU_NOCB_WAKE) + mod_timer(&rdp_gp->nocb_timer, jiffies + 1); + if (rdp_gp->nocb_defer_wakeup < waketype) + WRITE_ONCE(rdp_gp->nocb_defer_wakeup, waketype); + }
raw_spin_unlock_irqrestore(&rdp_gp->nocb_gp_lock, flags);
@@ -2005,7 +2012,7 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone, smp_mb(); /* Enqueue before timer_pending(). */ if ((rdp->nocb_cb_sleep || !rcu_segcblist_ready_cbs(&rdp->cblist)) && - !timer_pending(&rdp->nocb_bypass_timer)) { + !timer_pending(&rdp->nocb_timer)) { rcu_nocb_unlock_irqrestore(rdp, flags); wake_nocb_gp_defer(rdp, RCU_NOCB_WAKE_FORCE, TPS("WakeOvfIsDeferred")); @@ -2020,19 +2027,6 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone, return; }
-/* Wake up the no-CBs GP kthread to flush ->nocb_bypass. */ -static void do_nocb_bypass_wakeup_timer(struct timer_list *t) -{ - unsigned long flags; - struct rcu_data *rdp = from_timer(rdp, t, nocb_bypass_timer); - - trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("Timer")); - - raw_spin_lock_irqsave(&rdp->nocb_gp_lock, flags); - smp_mb__after_spinlock(); /* Timer expire before wakeup. */ - __wake_nocb_gp(rdp, rdp, false, flags); -} - /* * Check if we ignore this rdp. * @@ -2185,19 +2179,12 @@ static void nocb_gp_wait(struct rcu_data *my_rdp) my_rdp->nocb_gp_bypass = bypass; my_rdp->nocb_gp_gp = needwait_gp; my_rdp->nocb_gp_seq = needwait_gp ? wait_gp_seq : 0; - if (bypass) { - if (!rcu_nocb_poll) { - raw_spin_lock_irqsave(&my_rdp->nocb_gp_lock, flags); - // Avoid race with first bypass CB. - if (my_rdp->nocb_defer_wakeup > RCU_NOCB_WAKE_NOT) { - WRITE_ONCE(my_rdp->nocb_defer_wakeup, RCU_NOCB_WAKE_NOT); - del_timer(&my_rdp->nocb_timer); - } - // At least one child with non-empty ->nocb_bypass, so set - // timer in order to avoid stranding its callbacks. - mod_timer(&my_rdp->nocb_bypass_timer, j + 2); - raw_spin_unlock_irqrestore(&my_rdp->nocb_gp_lock, flags); - } + + if (bypass && !rcu_nocb_poll) { + // At least one child with non-empty ->nocb_bypass, so set + // timer in order to avoid stranding its callbacks. + wake_nocb_gp_defer(my_rdp, RCU_NOCB_WAKE_BYPASS, + TPS("WakeBypassIsDeferred")); } if (rcu_nocb_poll) { /* Polling, so trace if first poll in the series. */ @@ -2221,8 +2208,6 @@ static void nocb_gp_wait(struct rcu_data *my_rdp) } if (!rcu_nocb_poll) { raw_spin_lock_irqsave(&my_rdp->nocb_gp_lock, flags); - if (bypass) - del_timer(&my_rdp->nocb_bypass_timer); if (my_rdp->nocb_defer_wakeup > RCU_NOCB_WAKE_NOT) { WRITE_ONCE(my_rdp->nocb_defer_wakeup, RCU_NOCB_WAKE_NOT); del_timer(&my_rdp->nocb_timer); @@ -2370,16 +2355,14 @@ static int rcu_nocb_need_deferred_wakeup(struct rcu_data *rdp, int level) }
/* Do a deferred wakeup of rcu_nocb_kthread(). */ -static bool do_nocb_deferred_wakeup_common(struct rcu_data *rdp, - int level) +static bool do_nocb_deferred_wakeup_common(struct rcu_data *rdp_gp, + struct rcu_data *rdp, int level, + unsigned long flags) + __releases(rdp_gp->nocb_gp_lock) { - unsigned long flags; int ndw; - struct rcu_data *rdp_gp = rdp->nocb_gp_rdp; int ret;
- raw_spin_lock_irqsave(&rdp_gp->nocb_gp_lock, flags); - if (!rcu_nocb_need_deferred_wakeup(rdp_gp, level)) { raw_spin_unlock_irqrestore(&rdp_gp->nocb_gp_lock, flags); return false; @@ -2395,9 +2378,15 @@ static bool do_nocb_deferred_wakeup_common(struct rcu_data *rdp, /* Do a deferred wakeup of rcu_nocb_kthread() from a timer handler. */ static void do_nocb_deferred_wakeup_timer(struct timer_list *t) { + unsigned long flags; struct rcu_data *rdp = from_timer(rdp, t, nocb_timer);
- do_nocb_deferred_wakeup_common(rdp, RCU_NOCB_WAKE); + WARN_ON_ONCE(rdp->nocb_gp_rdp != rdp); + trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("Timer")); + + raw_spin_lock_irqsave(&rdp->nocb_gp_lock, flags); + smp_mb__after_spinlock(); /* Timer expire before wakeup. */ + do_nocb_deferred_wakeup_common(rdp, rdp, RCU_NOCB_WAKE_BYPASS, flags); }
/* @@ -2407,12 +2396,14 @@ static void do_nocb_deferred_wakeup_timer(struct timer_list *t) */ static bool do_nocb_deferred_wakeup(struct rcu_data *rdp) { - if (!rdp->nocb_gp_rdp) + unsigned long flags; + struct rcu_data *rdp_gp = rdp->nocb_gp_rdp; + + if (!rdp_gp || !rcu_nocb_need_deferred_wakeup(rdp_gp, RCU_NOCB_WAKE)) return false;
- if (rcu_nocb_need_deferred_wakeup(rdp->nocb_gp_rdp, RCU_NOCB_WAKE)) - return do_nocb_deferred_wakeup_common(rdp, RCU_NOCB_WAKE); - return false; + raw_spin_lock_irqsave(&rdp_gp->nocb_gp_lock, flags); + return do_nocb_deferred_wakeup_common(rdp_gp, rdp, RCU_NOCB_WAKE, flags); }
void rcu_nocb_flush_deferred_wakeup(void) @@ -2655,7 +2646,6 @@ static void __init rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp) raw_spin_lock_init(&rdp->nocb_bypass_lock); raw_spin_lock_init(&rdp->nocb_gp_lock); timer_setup(&rdp->nocb_timer, do_nocb_deferred_wakeup_timer, 0); - timer_setup(&rdp->nocb_bypass_timer, do_nocb_bypass_wakeup_timer, 0); rcu_cblist_init(&rdp->nocb_bypass); }
@@ -2814,13 +2804,12 @@ static void show_rcu_nocb_gp_state(struct rcu_data *rdp) { struct rcu_node *rnp = rdp->mynode;
- pr_info("nocb GP %d %c%c%c%c%c%c %c[%c%c] %c%c:%ld rnp %d:%d %lu %c CPU %d%s\n", + pr_info("nocb GP %d %c%c%c%c%c %c[%c%c] %c%c:%ld rnp %d:%d %lu %c CPU %d%s\n", rdp->cpu, "kK"[!!rdp->nocb_gp_kthread], "lL"[raw_spin_is_locked(&rdp->nocb_gp_lock)], "dD"[!!rdp->nocb_defer_wakeup], "tT"[timer_pending(&rdp->nocb_timer)], - "bB"[timer_pending(&rdp->nocb_bypass_timer)], "sS"[!!rdp->nocb_gp_sleep], ".W"[swait_active(&rdp->nocb_gp_wq)], ".W"[swait_active(&rnp->nocb_gp_wq[0])], @@ -2841,7 +2830,6 @@ static void show_rcu_nocb_state(struct rcu_data *rdp) char bufr[20]; struct rcu_segcblist *rsclp = &rdp->cblist; bool waslocked; - bool wastimer; bool wassleep;
if (rdp->nocb_gp_rdp == rdp) @@ -2878,15 +2866,13 @@ static void show_rcu_nocb_state(struct rcu_data *rdp) return;
waslocked = raw_spin_is_locked(&rdp->nocb_gp_lock); - wastimer = timer_pending(&rdp->nocb_bypass_timer); wassleep = swait_active(&rdp->nocb_gp_wq); - if (!rdp->nocb_gp_sleep && !waslocked && !wastimer && !wassleep) + if (!rdp->nocb_gp_sleep && !waslocked && !wassleep) return; /* Nothing untowards. */
- pr_info(" nocb GP activity on CB-only CPU!!! %c%c%c%c %c\n", + pr_info(" nocb GP activity on CB-only CPU!!! %c%c%c %c\n", "lL"[waslocked], "dD"[!!rdp->nocb_defer_wakeup], - "tT"[wastimer], "sS"[!!rdp->nocb_gp_sleep], ".W"[wassleep]); }
linux-stable-mirror@lists.linaro.org