Motivation of backport: -----------------------
1. The cfcdef5e30469 ("rcu: Allow rcu_do_batch() to dynamically adjust batch sizes") broke the default behaviour of "offloading rcu callbacks" setup. In that scenario after each callback the caller context was used to check if it has to be rescheduled giving a CPU time for others. After that change an "offloaded" setup can switch to time-based RCU callbacks processing, what can be long for latency sensitive workloads and SCHED_FIFO processes, i.e. callbacks are invoked for a long time with keeping preemption off and without checking cond_resched().
2. Our devices which run Android and 5.10 kernel have some critical areas which are sensitive to latency. It is a low latency audio, 8k video, UI stack and so on. For example below is a trace that illustrates a delay of "irq/396-5-0072" RT task to complete IRQ processing:
<snip> rcuop/6-54 [000] d.h2 183.752989: irq_handler_entry: irq=85 name=i2c_geni rcuop/6-54 [000] d.h5 183.753007: sched_waking: comm=irq/396-5-0072 pid=12675 prio=49 target_cpu=000 rcuop/6-54 [000] dNh6 183.753014: sched_wakeup: irq/396-5-0072:12675 [49] success=1 CPU:000 rcuop/6-54 [000] dNh2 183.753015: irq_handler_exit: irq=85 ret=handled rcuop/6-54 [000] .N.. 183.753018: rcu_invoke_callback: rcu_preempt rhp=0xffffff88ffd440b0 func=__d_free.cfi_jt rcuop/6-54 [000] .N.. 183.753020: rcu_invoke_callback: rcu_preempt rhp=0xffffff892ffd8400 func=inode_free_by_rcu.cfi_jt rcuop/6-54 [000] .N.. 183.753021: rcu_invoke_callback: rcu_preempt rhp=0xffffff89327cd708 func=i_callback.cfi_jt ... rcuop/6-54 [000] .N.. 183.755941: rcu_invoke_callback: rcu_preempt rhp=0xffffff8993c5a968 func=i_callback.cfi_jt rcuop/6-54 [000] .N.. 183.755942: rcu_invoke_callback: rcu_preempt rhp=0xffffff8993c4bd20 func=__d_free.cfi_jt rcuop/6-54 [000] dN.. 183.755944: rcu_batch_end: rcu_preempt CBs-invoked=2112 idle=>c<>c<>c<>c< rcuop/6-54 [000] dN.. 183.755946: rcu_utilization: Start context switch rcuop/6-54 [000] dN.. 183.755946: rcu_utilization: End context switch rcuop/6-54 [000] d..2 183.755959: sched_switch: rcuop/6:54 [120] R ==> migration/0:16 [0] ... migratio-16 [000] d..2 183.756021: sched_switch: migration/0:16 [0] S ==> irq/396-5-0072:12675 [49] <snip>
The "irq/396-5-0072:12675" was delayed for ~3 milliseconds due to introduced side effect. Please note, on our Android devices we get ~70 000 callbacks registered to be invoked by the "rcuop/x" workers. This is during 1 seconds time interval and regular handset usage. Latencies bigger that 3 milliseconds affect our high-resolution audio streaming over the LDAC/Bluetooth stack.
Two patches depend on each other.
Frederic Weisbecker (2): rcu: [for 5.15 stable] Fix callbacks processing time limit retaining cond_resched() rcu: [for 5.15 stable] Apply callbacks processing time limit only on softirq
kernel/rcu/tree.c | 31 ++++++++++++++++++------------- 1 file changed, 18 insertions(+), 13 deletions(-)
From: Frederic Weisbecker frederic@kernel.org
commit 3e61e95e2d095e308616cba4ffb640f95a480e01 upstream.
The callbacks processing time limit makes sure we are not exceeding a given amount of time executing the queue.
However its "continue" clause bypasses the cond_resched() call on rcuc and NOCB kthreads, delaying it until we reach the limit, which can be very long...
Make sure the scheduler has a higher priority than the time limit.
Motivation of backport: -----------------------
1. The cfcdef5e30469 ("rcu: Allow rcu_do_batch() to dynamically adjust batch sizes") broke the default behaviour of "offloading rcu callbacks" setup. In that scenario after each callback the caller context was used to check if it has to be rescheduled giving a CPU time for others. After that change an "offloaded" setup can switch to time-based RCU callbacks processing, what can be long for latency sensitive workloads and SCHED_FIFO processes, i.e. callbacks are invoked for a long time with keeping preemption off and without checking cond_resched().
2. Our devices which run Android and 5.10 kernel have some critical areas which are sensitive to latency. It is a low latency audio, 8k video, UI stack and so on. For example below is a trace that illustrates a delay of "irq/396-5-0072" RT task to complete IRQ processing:
<snip> rcuop/6-54 [000] d.h2 183.752989: irq_handler_entry: irq=85 name=i2c_geni rcuop/6-54 [000] d.h5 183.753007: sched_waking: comm=irq/396-5-0072 pid=12675 prio=49 target_cpu=000 rcuop/6-54 [000] dNh6 183.753014: sched_wakeup: irq/396-5-0072:12675 [49] success=1 CPU:000 rcuop/6-54 [000] dNh2 183.753015: irq_handler_exit: irq=85 ret=handled rcuop/6-54 [000] .N.. 183.753018: rcu_invoke_callback: rcu_preempt rhp=0xffffff88ffd440b0 func=__d_free.cfi_jt rcuop/6-54 [000] .N.. 183.753020: rcu_invoke_callback: rcu_preempt rhp=0xffffff892ffd8400 func=inode_free_by_rcu.cfi_jt rcuop/6-54 [000] .N.. 183.753021: rcu_invoke_callback: rcu_preempt rhp=0xffffff89327cd708 func=i_callback.cfi_jt ... rcuop/6-54 [000] .N.. 183.755941: rcu_invoke_callback: rcu_preempt rhp=0xffffff8993c5a968 func=i_callback.cfi_jt rcuop/6-54 [000] .N.. 183.755942: rcu_invoke_callback: rcu_preempt rhp=0xffffff8993c4bd20 func=__d_free.cfi_jt rcuop/6-54 [000] dN.. 183.755944: rcu_batch_end: rcu_preempt CBs-invoked=2112 idle=>c<>c<>c<>c< rcuop/6-54 [000] dN.. 183.755946: rcu_utilization: Start context switch rcuop/6-54 [000] dN.. 183.755946: rcu_utilization: End context switch rcuop/6-54 [000] d..2 183.755959: sched_switch: rcuop/6:54 [120] R ==> migration/0:16 [0] ... migratio-16 [000] d..2 183.756021: sched_switch: migration/0:16 [0] S ==> irq/396-5-0072:12675 [49] <snip>
The "irq/396-5-0072:12675" was delayed for ~3 milliseconds due to introduced side effect. Please note, on our Android devices we get ~70 000 callbacks registered to be invoked by the "rcuop/x" workers. This is during 1 seconds time interval and regular handset usage. Latencies bigger that 3 milliseconds affect our high-resolution audio streaming over the LDAC/Bluetooth stack.
Fixes: cfcdef5e30469 ("rcu: Allow rcu_do_batch() to dynamically adjust batch sizes") Reviewed-by: Valentin Schneider valentin.schneider@arm.com Tested-by: Valentin Schneider valentin.schneider@arm.com Tested-by: Sebastian Andrzej Siewior bigeasy@linutronix.de Signed-off-by: Frederic Weisbecker frederic@kernel.org Cc: Valentin Schneider valentin.schneider@arm.com Cc: Peter Zijlstra peterz@infradead.org Cc: Sebastian Andrzej Siewior bigeasy@linutronix.de Cc: Josh Triplett josh@joshtriplett.org Cc: Joel Fernandes joel@joelfernandes.org Cc: Boqun Feng boqun.feng@gmail.com Cc: Neeraj Upadhyay neeraju@codeaurora.org Cc: Uladzislau Rezki urezki@gmail.com Cc: Thomas Gleixner tglx@linutronix.de Cc: Oleksiy Avramchenko oleksiy.avramchenko@sony.com [UR: backport to 5.15-stable + commit update] Signed-off-by: Paul E. McKenney paulmck@kernel.org Signed-off-by: Uladzislau Rezki (Sony) urezki@gmail.com --- kernel/rcu/tree.c | 27 ++++++++++++++++----------- 1 file changed, 16 insertions(+), 11 deletions(-)
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 4ca6d5b199e8..a968cc67b2bd 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -2513,10 +2513,22 @@ static void rcu_do_batch(struct rcu_data *rdp) /* * Stop only if limit reached and CPU has something to do. */ - if (count >= bl && !offloaded && - (need_resched() || - (!is_idle_task(current) && !rcu_is_callbacks_kthread()))) - break; + if (in_serving_softirq()) { + if (count >= bl && (need_resched() || + (!is_idle_task(current) && !rcu_is_callbacks_kthread()))) + break; + } else { + local_bh_enable(); + lockdep_assert_irqs_enabled(); + cond_resched_tasks_rcu_qs(); + lockdep_assert_irqs_enabled(); + local_bh_disable(); + } + + /* + * Make sure we don't spend too much time here and deprive other + * softirq vectors of CPU cycles. + */ if (unlikely(tlimit)) { /* only call local_clock() every 32 callbacks */ if (likely((count & 31) || local_clock() < tlimit)) @@ -2524,13 +2536,6 @@ static void rcu_do_batch(struct rcu_data *rdp) /* Exceeded the time limit, so leave. */ break; } - if (!in_serving_softirq()) { - local_bh_enable(); - lockdep_assert_irqs_enabled(); - cond_resched_tasks_rcu_qs(); - lockdep_assert_irqs_enabled(); - local_bh_disable(); - } }
local_irq_save(flags);
From: Frederic Weisbecker frederic@kernel.org
commit a554ba288845fd3f6f12311fd76a51694233458a upstream.
Time limit only makes sense when callbacks are serviced in softirq mode because:
_ In case we need to get back to the scheduler, cond_resched_tasks_rcu_qs() is called after each callback.
_ In case some other softirq vector needs the CPU, the call to local_bh_enable() before cond_resched_tasks_rcu_qs() takes care about them via a call to do_softirq().
Therefore, make sure the time limit only applies to softirq mode.
Fixes: cfcdef5e30469 ("rcu: Allow rcu_do_batch() to dynamically adjust batch sizes") Reviewed-by: Valentin Schneider valentin.schneider@arm.com Tested-by: Valentin Schneider valentin.schneider@arm.com Tested-by: Sebastian Andrzej Siewior bigeasy@linutronix.de Signed-off-by: Frederic Weisbecker frederic@kernel.org Cc: Valentin Schneider valentin.schneider@arm.com Cc: Peter Zijlstra peterz@infradead.org Cc: Sebastian Andrzej Siewior bigeasy@linutronix.de Cc: Josh Triplett josh@joshtriplett.org Cc: Joel Fernandes joel@joelfernandes.org Cc: Boqun Feng boqun.feng@gmail.com Cc: Neeraj Upadhyay neeraju@codeaurora.org Cc: Uladzislau Rezki urezki@gmail.com Cc: Thomas Gleixner tglx@linutronix.de Cc: Oleksiy Avramchenko oleksiy.avramchenko@sony.com [UR: backport to 5.15-stable] Signed-off-by: Paul E. McKenney paulmck@kernel.org Signed-off-by: Uladzislau Rezki (Sony) urezki@gmail.com --- kernel/rcu/tree.c | 26 +++++++++++++------------- 1 file changed, 13 insertions(+), 13 deletions(-)
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index a968cc67b2bd..a4a9d68b1fdc 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -2476,7 +2476,7 @@ static void rcu_do_batch(struct rcu_data *rdp) div = READ_ONCE(rcu_divisor); div = div < 0 ? 7 : div > sizeof(long) * 8 - 2 ? sizeof(long) * 8 - 2 : div; bl = max(rdp->blimit, pending >> div); - if (unlikely(bl > 100)) { + if (in_serving_softirq() && unlikely(bl > 100)) { long rrn = READ_ONCE(rcu_resched_ns);
rrn = rrn < NSEC_PER_MSEC ? NSEC_PER_MSEC : rrn > NSEC_PER_SEC ? NSEC_PER_SEC : rrn; @@ -2517,6 +2517,18 @@ static void rcu_do_batch(struct rcu_data *rdp) if (count >= bl && (need_resched() || (!is_idle_task(current) && !rcu_is_callbacks_kthread()))) break; + + /* + * Make sure we don't spend too much time here and deprive other + * softirq vectors of CPU cycles. + */ + if (unlikely(tlimit)) { + /* only call local_clock() every 32 callbacks */ + if (likely((count & 31) || local_clock() < tlimit)) + continue; + /* Exceeded the time limit, so leave. */ + break; + } } else { local_bh_enable(); lockdep_assert_irqs_enabled(); @@ -2524,18 +2536,6 @@ static void rcu_do_batch(struct rcu_data *rdp) lockdep_assert_irqs_enabled(); local_bh_disable(); } - - /* - * Make sure we don't spend too much time here and deprive other - * softirq vectors of CPU cycles. - */ - if (unlikely(tlimit)) { - /* only call local_clock() every 32 callbacks */ - if (likely((count & 31) || local_clock() < tlimit)) - continue; - /* Exceeded the time limit, so leave. */ - break; - } }
local_irq_save(flags);
On Tue, May 03, 2022 at 09:17:07PM +0200, Uladzislau Rezki (Sony) wrote:
Motivation of backport:
- The cfcdef5e30469 ("rcu: Allow rcu_do_batch() to dynamically adjust batch sizes")
broke the default behaviour of "offloading rcu callbacks" setup. In that scenario after each callback the caller context was used to check if it has to be rescheduled giving a CPU time for others. After that change an "offloaded" setup can switch to time-based RCU callbacks processing, what can be long for latency sensitive workloads and SCHED_FIFO processes, i.e. callbacks are invoked for a long time with keeping preemption off and without checking cond_resched().
- Our devices which run Android and 5.10 kernel have some critical areas which
are sensitive to latency. It is a low latency audio, 8k video, UI stack and so on. For example below is a trace that illustrates a delay of "irq/396-5-0072" RT task to complete IRQ processing:
<snip> rcuop/6-54 [000] d.h2 183.752989: irq_handler_entry: irq=85 name=i2c_geni rcuop/6-54 [000] d.h5 183.753007: sched_waking: comm=irq/396-5-0072 pid=12675 prio=49 target_cpu=000 rcuop/6-54 [000] dNh6 183.753014: sched_wakeup: irq/396-5-0072:12675 [49] success=1 CPU:000 rcuop/6-54 [000] dNh2 183.753015: irq_handler_exit: irq=85 ret=handled rcuop/6-54 [000] .N.. 183.753018: rcu_invoke_callback: rcu_preempt rhp=0xffffff88ffd440b0 func=__d_free.cfi_jt rcuop/6-54 [000] .N.. 183.753020: rcu_invoke_callback: rcu_preempt rhp=0xffffff892ffd8400 func=inode_free_by_rcu.cfi_jt rcuop/6-54 [000] .N.. 183.753021: rcu_invoke_callback: rcu_preempt rhp=0xffffff89327cd708 func=i_callback.cfi_jt ... rcuop/6-54 [000] .N.. 183.755941: rcu_invoke_callback: rcu_preempt rhp=0xffffff8993c5a968 func=i_callback.cfi_jt rcuop/6-54 [000] .N.. 183.755942: rcu_invoke_callback: rcu_preempt rhp=0xffffff8993c4bd20 func=__d_free.cfi_jt rcuop/6-54 [000] dN.. 183.755944: rcu_batch_end: rcu_preempt CBs-invoked=2112 idle=>c<>c<>c<>c< rcuop/6-54 [000] dN.. 183.755946: rcu_utilization: Start context switch rcuop/6-54 [000] dN.. 183.755946: rcu_utilization: End context switch rcuop/6-54 [000] d..2 183.755959: sched_switch: rcuop/6:54 [120] R ==> migration/0:16 [0] ... migratio-16 [000] d..2 183.756021: sched_switch: migration/0:16 [0] S ==> irq/396-5-0072:12675 [49] <snip>
The "irq/396-5-0072:12675" was delayed for ~3 milliseconds due to introduced side effect. Please note, on our Android devices we get ~70 000 callbacks registered to be invoked by the "rcuop/x" workers. This is during 1 seconds time interval and regular handset usage. Latencies bigger that 3 milliseconds affect our high-resolution audio streaming over the LDAC/Bluetooth stack.
Two patches depend on each other.
All now queued up, thanks.
greg k-h
Thank you!
On Tue, May 10, 2022 at 1:12 PM Greg KH greg@kroah.com wrote:
On Tue, May 03, 2022 at 09:17:07PM +0200, Uladzislau Rezki (Sony) wrote:
Motivation of backport:
- The cfcdef5e30469 ("rcu: Allow rcu_do_batch() to dynamically adjust batch sizes")
broke the default behaviour of "offloading rcu callbacks" setup. In that scenario after each callback the caller context was used to check if it has to be rescheduled giving a CPU time for others. After that change an "offloaded" setup can switch to time-based RCU callbacks processing, what can be long for latency sensitive workloads and SCHED_FIFO processes, i.e. callbacks are invoked for a long time with keeping preemption off and without checking cond_resched().
- Our devices which run Android and 5.10 kernel have some critical areas which
are sensitive to latency. It is a low latency audio, 8k video, UI stack and so on. For example below is a trace that illustrates a delay of "irq/396-5-0072" RT task to complete IRQ processing:
<snip> rcuop/6-54 [000] d.h2 183.752989: irq_handler_entry: irq=85 name=i2c_geni rcuop/6-54 [000] d.h5 183.753007: sched_waking: comm=irq/396-5-0072 pid=12675 prio=49 target_cpu=000 rcuop/6-54 [000] dNh6 183.753014: sched_wakeup: irq/396-5-0072:12675 [49] success=1 CPU:000 rcuop/6-54 [000] dNh2 183.753015: irq_handler_exit: irq=85 ret=handled rcuop/6-54 [000] .N.. 183.753018: rcu_invoke_callback: rcu_preempt rhp=0xffffff88ffd440b0 func=__d_free.cfi_jt rcuop/6-54 [000] .N.. 183.753020: rcu_invoke_callback: rcu_preempt rhp=0xffffff892ffd8400 func=inode_free_by_rcu.cfi_jt rcuop/6-54 [000] .N.. 183.753021: rcu_invoke_callback: rcu_preempt rhp=0xffffff89327cd708 func=i_callback.cfi_jt ... rcuop/6-54 [000] .N.. 183.755941: rcu_invoke_callback: rcu_preempt rhp=0xffffff8993c5a968 func=i_callback.cfi_jt rcuop/6-54 [000] .N.. 183.755942: rcu_invoke_callback: rcu_preempt rhp=0xffffff8993c4bd20 func=__d_free.cfi_jt rcuop/6-54 [000] dN.. 183.755944: rcu_batch_end: rcu_preempt CBs-invoked=2112 idle=>c<>c<>c<>c< rcuop/6-54 [000] dN.. 183.755946: rcu_utilization: Start context switch rcuop/6-54 [000] dN.. 183.755946: rcu_utilization: End context switch rcuop/6-54 [000] d..2 183.755959: sched_switch: rcuop/6:54 [120] R ==> migration/0:16 [0] ... migratio-16 [000] d..2 183.756021: sched_switch: migration/0:16 [0] S ==> irq/396-5-0072:12675 [49] <snip>
The "irq/396-5-0072:12675" was delayed for ~3 milliseconds due to introduced side effect. Please note, on our Android devices we get ~70 000 callbacks registered to be invoked by the "rcuop/x" workers. This is during 1 seconds time interval and regular handset usage. Latencies bigger that 3 milliseconds affect our high-resolution audio streaming over the LDAC/Bluetooth stack.
Two patches depend on each other.
All now queued up, thanks.
greg k-h
linux-stable-mirror@lists.linaro.org