(This was [PATCH 0/4] sched/idle: Fix missing need_resched() checks after rcu_idle_enter() v2)
I initially followed Peterz review but eventually I tried a different approach. Instead of handling the late wake up from rcu_idle_enter(), I've split the delayed rcuog wake up and moved it right before the last generic need_resched() check, it makes more sense and we don't need to fiddle with cpuidle core and drivers anymore. It's also less error prone.
I also fixed the nohz_full case and (hopefully) the guest case.
And this comes with debugging to prevent from that pattern to happen again.
Only lightly tested so far.
git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git sched/idle-v3
HEAD: d95fc510e804a5c4658a823ff12d9caba1d906c7
Thanks, Frederic ---
Frederic Weisbecker (8): rcu: Remove superfluous rdp fetch rcu: Pull deferred rcuog wake up to rcu_eqs_enter() callers rcu/nocb: Perform deferred wake up before last idle's need_resched() check rcu/nocb: Trigger self-IPI on late deferred wake up before user resume entry: Explicitly flush pending rcuog wakeup before last rescheduling points sched: Report local wake up on resched blind zone within idle loop entry: Report local wake up on resched blind zone while resuming to user timer: Report ignored local enqueue in nohz mode
include/linux/rcupdate.h | 2 ++ include/linux/sched.h | 11 ++++++++ kernel/entry/common.c | 10 ++++++++ kernel/rcu/tree.c | 27 ++++++++++++++++++-- kernel/rcu/tree.h | 2 +- kernel/rcu/tree_plugin.h | 30 +++++++++++++++------- kernel/sched/core.c | 66 +++++++++++++++++++++++++++++++++++++++++++++++- kernel/sched/idle.c | 6 +++++ kernel/sched/sched.h | 3 +++ 9 files changed, 144 insertions(+), 13 deletions(-)
Signed-off-by: Frederic Weisbecker frederic@kernel.org Cc: Paul E. McKenney paulmck@kernel.org Cc: Rafael J. Wysocki rafael.j.wysocki@intel.com Cc: Peter Zijlstra peterz@infradead.org Cc: Thomas Gleixner tglx@linutronix.de Cc: Ingo Molnarmingo@kernel.org --- kernel/rcu/tree.c | 1 - 1 file changed, 1 deletion(-)
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 40e5e3dd253e..fef90c467670 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -643,7 +643,6 @@ static noinstr void rcu_eqs_enter(bool user) instrumentation_begin(); trace_rcu_dyntick(TPS("Start"), rdp->dynticks_nesting, 0, atomic_read(&rdp->dynticks)); WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && !user && !is_idle_task(current)); - rdp = this_cpu_ptr(&rcu_data); do_nocb_deferred_wakeup(rdp); rcu_prepare_for_idle(); rcu_preempt_deferred_qs(current);
On Sat, Jan 09, 2021 at 03:05:29AM +0100, Frederic Weisbecker wrote:
Signed-off-by: Frederic Weisbecker frederic@kernel.org Cc: Paul E. McKenney paulmck@kernel.org Cc: Rafael J. Wysocki rafael.j.wysocki@intel.com Cc: Peter Zijlstra peterz@infradead.org Cc: Thomas Gleixner tglx@linutronix.de Cc: Ingo Molnarmingo@kernel.org
kernel/rcu/tree.c | 1 - 1 file changed, 1 deletion(-)
I know I will not take patches without any changelog comments, maybe other maintainers are more lax. Please write something real.
And as for sending this to stable@vger, here's my form letter:
<formletter>
This is not the correct way to submit patches for inclusion in the stable kernel tree. Please read: https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html for how to do this properly.
</formletter>
On Sat, Jan 09, 2021 at 10:03:33AM +0100, Greg KH wrote:
On Sat, Jan 09, 2021 at 03:05:29AM +0100, Frederic Weisbecker wrote:
Signed-off-by: Frederic Weisbecker frederic@kernel.org Cc: Paul E. McKenney paulmck@kernel.org Cc: Rafael J. Wysocki rafael.j.wysocki@intel.com Cc: Peter Zijlstra peterz@infradead.org Cc: Thomas Gleixner tglx@linutronix.de Cc: Ingo Molnarmingo@kernel.org
kernel/rcu/tree.c | 1 - 1 file changed, 1 deletion(-)
I know I will not take patches without any changelog comments, maybe other maintainers are more lax. Please write something real.
I must admit I've been lazy. Also I shoudn't have Cc'ed stable on this one. Only a few commits are tagged for stable in this set. I'll fix that on the next round.
Thanks!
And as for sending this to stable@vger, here's my form letter:
<formletter>
This is not the correct way to submit patches for inclusion in the stable kernel tree. Please read: https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html for how to do this properly.
</formletter>
Deferred wakeup of rcuog kthreads upon RCU idle mode entry is going to be handled differently whether initiated by idle, user or guest. Prepare with pulling that control up to rcu_eqs_enter() callers.
Signed-off-by: Frederic Weisbecker frederic@kernel.org Cc: Paul E. McKenney paulmck@kernel.org Cc: Rafael J. Wysocki rafael.j.wysocki@intel.com Cc: Peter Zijlstra peterz@infradead.org Cc: Thomas Gleixner tglx@linutronix.de Cc: Ingo Molnarmingo@kernel.org --- kernel/rcu/tree.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index fef90c467670..b9fff18d14d9 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -643,7 +643,6 @@ static noinstr void rcu_eqs_enter(bool user) instrumentation_begin(); trace_rcu_dyntick(TPS("Start"), rdp->dynticks_nesting, 0, atomic_read(&rdp->dynticks)); WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && !user && !is_idle_task(current)); - do_nocb_deferred_wakeup(rdp); rcu_prepare_for_idle(); rcu_preempt_deferred_qs(current);
@@ -671,7 +670,10 @@ static noinstr void rcu_eqs_enter(bool user) */ void rcu_idle_enter(void) { + struct rcu_data *rdp = this_cpu_ptr(&rcu_data); + lockdep_assert_irqs_disabled(); + do_nocb_deferred_wakeup(rdp); rcu_eqs_enter(false); } EXPORT_SYMBOL_GPL(rcu_idle_enter); @@ -690,7 +692,10 @@ EXPORT_SYMBOL_GPL(rcu_idle_enter); */ noinstr void rcu_user_enter(void) { + struct rcu_data *rdp = this_cpu_ptr(&rcu_data); + lockdep_assert_irqs_disabled(); + do_nocb_deferred_wakeup(rdp); rcu_eqs_enter(true); } #endif /* CONFIG_NO_HZ_FULL */
Entering RCU idle mode may cause a deferred wake up of an RCU NOCB_GP kthread (rcuog) to be serviced.
Usually a local wake up happening while running the idle task is handled in one of the need_resched() checks carefully placed within the idle loop that can break to the scheduler.
Unfortunately the call to rcu_idle_enter() is already beyond the last generic need_resched() check and we may halt the CPU with a resched request unhandled, leaving the task hanging.
Fix this with splitting the rcuog wakeup handling from rcu_idle_enter() and place it before the last generic need_resched() check in the idle loop. It is then assumed that no call to call_rcu() will be performed after that in the idle loop until the CPU is put in low power mode. Further debug code will help spotting the offenders.
Reported-by: Paul E. McKenney paulmck@kernel.org Fixes: 96d3fd0d315a (rcu: Break call_rcu() deadlock involving scheduler and perf) Cc: stable@vger.kernel.org Cc: Rafael J. Wysocki rafael.j.wysocki@intel.com Cc: Peter Zijlstra peterz@infradead.org Cc: Thomas Gleixner tglx@linutronix.de Cc: Ingo Molnarmingo@kernel.org Signed-off-by: Frederic Weisbecker frederic@kernel.org --- include/linux/rcupdate.h | 2 ++ kernel/rcu/tree.c | 3 --- kernel/rcu/tree_plugin.h | 5 +++++ kernel/sched/idle.c | 3 +++ 4 files changed, 10 insertions(+), 3 deletions(-)
diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h index de0826411311..4068234fb303 100644 --- a/include/linux/rcupdate.h +++ b/include/linux/rcupdate.h @@ -104,8 +104,10 @@ static inline void rcu_user_exit(void) { }
#ifdef CONFIG_RCU_NOCB_CPU void rcu_init_nohz(void); +void rcu_nocb_flush_deferred_wakeup(void); #else /* #ifdef CONFIG_RCU_NOCB_CPU */ static inline void rcu_init_nohz(void) { } +static inline void rcu_nocb_flush_deferred_wakeup(void) { } #endif /* #else #ifdef CONFIG_RCU_NOCB_CPU */
/** diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index b9fff18d14d9..b6e1377774e3 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -670,10 +670,7 @@ static noinstr void rcu_eqs_enter(bool user) */ void rcu_idle_enter(void) { - struct rcu_data *rdp = this_cpu_ptr(&rcu_data); - lockdep_assert_irqs_disabled(); - do_nocb_deferred_wakeup(rdp); rcu_eqs_enter(false); } EXPORT_SYMBOL_GPL(rcu_idle_enter); diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 7e291ce0a1d6..d5b38c28abd1 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -2187,6 +2187,11 @@ static void do_nocb_deferred_wakeup(struct rcu_data *rdp) do_nocb_deferred_wakeup_common(rdp); }
+void rcu_nocb_flush_deferred_wakeup(void) +{ + do_nocb_deferred_wakeup(this_cpu_ptr(&rcu_data)); +} + void __init rcu_init_nohz(void) { int cpu; diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c index 305727ea0677..b601a3aa2152 100644 --- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c @@ -55,6 +55,7 @@ __setup("hlt", cpu_idle_nopoll_setup); static noinline int __cpuidle cpu_idle_poll(void) { trace_cpu_idle(0, smp_processor_id()); + rcu_nocb_flush_deferred_wakeup(); stop_critical_timings(); rcu_idle_enter(); local_irq_enable(); @@ -173,6 +174,8 @@ static void cpuidle_idle_call(void) struct cpuidle_driver *drv = cpuidle_get_cpu_driver(dev); int next_state, entered_state;
+ rcu_nocb_flush_deferred_wakeup(); + /* * Check if the idle task must be rescheduled. If it is the * case, exit the function after re-enabling the local irq.
Entering RCU idle mode may cause a deferred wake up of an RCU NOCB_GP kthread (rcuog) to be serviced.
Unfortunately the call to rcu_user_enter() is already past the last rescheduling opportunity before we resume to userspace or to guest mode. We may escape there with the woken task ignored.
The ultimate resort to fix every callsites is to trigger a self-IPI (nohz_full depends on IRQ_WORK) that will trigger a reschedule on IRQ tail or guest exit.
Eventually every site that want a saner treatment will need to carefully place a call to rcu_nocb_flush_deferred_wakeup() before the last explicit need_resched() check upon resume.
Reported-by: Paul E. McKenney paulmck@kernel.org Fixes: 96d3fd0d315a (rcu: Break call_rcu() deadlock involving scheduler and perf) Cc: stable@vger.kernel.org Cc: Rafael J. Wysocki rafael.j.wysocki@intel.com Cc: Peter Zijlstra peterz@infradead.org Cc: Thomas Gleixner tglx@linutronix.de Cc: Ingo Molnarmingo@kernel.org Signed-off-by: Frederic Weisbecker frederic@kernel.org --- kernel/rcu/tree.c | 22 +++++++++++++++++++++- kernel/rcu/tree.h | 2 +- kernel/rcu/tree_plugin.h | 25 ++++++++++++++++--------- 3 files changed, 38 insertions(+), 11 deletions(-)
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index b6e1377774e3..2920dfc9f58c 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -676,6 +676,18 @@ void rcu_idle_enter(void) EXPORT_SYMBOL_GPL(rcu_idle_enter);
#ifdef CONFIG_NO_HZ_FULL + +/* + * An empty function that will trigger a reschedule on + * IRQ tail once IRQs get re-enabled on userspace resume. + */ +static void late_wakeup_func(struct irq_work *work) +{ +} + +static DEFINE_PER_CPU(struct irq_work, late_wakeup_work) = + IRQ_WORK_INIT(late_wakeup_func); + /** * rcu_user_enter - inform RCU that we are resuming userspace. * @@ -692,9 +704,17 @@ noinstr void rcu_user_enter(void) struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
lockdep_assert_irqs_disabled(); - do_nocb_deferred_wakeup(rdp); + /* + * We may be past the last rescheduling opportunity in the entry code. + * Trigger a self IPI that will fire and reschedule once we resume to + * user/guest mode. + */ + if (do_nocb_deferred_wakeup(rdp) && need_resched()) + irq_work_queue(this_cpu_ptr(&late_wakeup_work)); + rcu_eqs_enter(true); } + #endif /* CONFIG_NO_HZ_FULL */
/** diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h index 7708ed161f4a..9226f4021a36 100644 --- a/kernel/rcu/tree.h +++ b/kernel/rcu/tree.h @@ -433,7 +433,7 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp, static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_empty, unsigned long flags); static int rcu_nocb_need_deferred_wakeup(struct rcu_data *rdp); -static void do_nocb_deferred_wakeup(struct rcu_data *rdp); +static bool do_nocb_deferred_wakeup(struct rcu_data *rdp); static void rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp); static void rcu_spawn_cpu_nocb_kthread(int cpu); static void __init rcu_spawn_nocb_kthreads(void); diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index d5b38c28abd1..384856e4d13e 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -1631,8 +1631,8 @@ bool rcu_is_nocb_cpu(int cpu) * Kick the GP kthread for this NOCB group. Caller holds ->nocb_lock * and this function releases it. */ -static void wake_nocb_gp(struct rcu_data *rdp, bool force, - unsigned long flags) +static bool wake_nocb_gp(struct rcu_data *rdp, bool force, + unsigned long flags) __releases(rdp->nocb_lock) { bool needwake = false; @@ -1643,7 +1643,7 @@ static void wake_nocb_gp(struct rcu_data *rdp, bool force, trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("AlreadyAwake")); rcu_nocb_unlock_irqrestore(rdp, flags); - return; + return false; } del_timer(&rdp->nocb_timer); rcu_nocb_unlock_irqrestore(rdp, flags); @@ -1656,6 +1656,8 @@ static void wake_nocb_gp(struct rcu_data *rdp, bool force, raw_spin_unlock_irqrestore(&rdp_gp->nocb_gp_lock, flags); if (needwake) wake_up_process(rdp_gp->nocb_gp_kthread); + + return needwake; }
/* @@ -2152,20 +2154,23 @@ static int rcu_nocb_need_deferred_wakeup(struct rcu_data *rdp) }
/* Do a deferred wakeup of rcu_nocb_kthread(). */ -static void do_nocb_deferred_wakeup_common(struct rcu_data *rdp) +static bool do_nocb_deferred_wakeup_common(struct rcu_data *rdp) { unsigned long flags; int ndw; + int ret;
rcu_nocb_lock_irqsave(rdp, flags); if (!rcu_nocb_need_deferred_wakeup(rdp)) { rcu_nocb_unlock_irqrestore(rdp, flags); - return; + return false; } ndw = READ_ONCE(rdp->nocb_defer_wakeup); WRITE_ONCE(rdp->nocb_defer_wakeup, RCU_NOCB_WAKE_NOT); - wake_nocb_gp(rdp, ndw == RCU_NOCB_WAKE_FORCE, flags); + ret = wake_nocb_gp(rdp, ndw == RCU_NOCB_WAKE_FORCE, flags); trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("DeferredWake")); + + return ret; }
/* Do a deferred wakeup of rcu_nocb_kthread() from a timer handler. */ @@ -2181,10 +2186,11 @@ static void do_nocb_deferred_wakeup_timer(struct timer_list *t) * This means we do an inexact common-case check. Note that if * we miss, ->nocb_timer will eventually clean things up. */ -static void do_nocb_deferred_wakeup(struct rcu_data *rdp) +static bool do_nocb_deferred_wakeup(struct rcu_data *rdp) { if (rcu_nocb_need_deferred_wakeup(rdp)) - do_nocb_deferred_wakeup_common(rdp); + return do_nocb_deferred_wakeup_common(rdp); + return false; }
void rcu_nocb_flush_deferred_wakeup(void) @@ -2523,8 +2529,9 @@ static int rcu_nocb_need_deferred_wakeup(struct rcu_data *rdp) return false; }
-static void do_nocb_deferred_wakeup(struct rcu_data *rdp) +static bool do_nocb_deferred_wakeup(struct rcu_data *rdp) { + return false; }
static void rcu_spawn_cpu_nocb_kthread(int cpu)
On Sat, Jan 09, 2021 at 03:05:32AM +0100, Frederic Weisbecker wrote:
Entering RCU idle mode may cause a deferred wake up of an RCU NOCB_GP kthread (rcuog) to be serviced.
Unfortunately the call to rcu_user_enter() is already past the last rescheduling opportunity before we resume to userspace or to guest mode. We may escape there with the woken task ignored.
The ultimate resort to fix every callsites is to trigger a self-IPI (nohz_full depends on IRQ_WORK) that will trigger a reschedule on IRQ tail or guest exit.
Eventually every site that want a saner treatment will need to carefully place a call to rcu_nocb_flush_deferred_wakeup() before the last explicit need_resched() check upon resume.
Reported-by: Paul E. McKenney paulmck@kernel.org Fixes: 96d3fd0d315a (rcu: Break call_rcu() deadlock involving scheduler and perf) Cc: stable@vger.kernel.org Cc: Rafael J. Wysocki rafael.j.wysocki@intel.com Cc: Peter Zijlstra peterz@infradead.org Cc: Thomas Gleixner tglx@linutronix.de Cc: Ingo Molnarmingo@kernel.org Signed-off-by: Frederic Weisbecker frederic@kernel.org
kernel/rcu/tree.c | 22 +++++++++++++++++++++- kernel/rcu/tree.h | 2 +- kernel/rcu/tree_plugin.h | 25 ++++++++++++++++--------- 3 files changed, 38 insertions(+), 11 deletions(-)
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index b6e1377774e3..2920dfc9f58c 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -676,6 +676,18 @@ void rcu_idle_enter(void) EXPORT_SYMBOL_GPL(rcu_idle_enter); #ifdef CONFIG_NO_HZ_FULL
+/*
- An empty function that will trigger a reschedule on
- IRQ tail once IRQs get re-enabled on userspace resume.
- */
+static void late_wakeup_func(struct irq_work *work) +{ +}
+static DEFINE_PER_CPU(struct irq_work, late_wakeup_work) =
- IRQ_WORK_INIT(late_wakeup_func);
/**
- rcu_user_enter - inform RCU that we are resuming userspace.
@@ -692,9 +704,17 @@ noinstr void rcu_user_enter(void) struct rcu_data *rdp = this_cpu_ptr(&rcu_data); lockdep_assert_irqs_disabled();
- do_nocb_deferred_wakeup(rdp);
- /*
* We may be past the last rescheduling opportunity in the entry code.
* Trigger a self IPI that will fire and reschedule once we resume to
* user/guest mode.
*/
- if (do_nocb_deferred_wakeup(rdp) && need_resched())
irq_work_queue(this_cpu_ptr(&late_wakeup_work));
- rcu_eqs_enter(true);
}
Do we have the guarantee that every architecture that supports NOHZ_FULL has arch_irq_work_raise() on?
Also, can't you do the same thing you did earlier and do that wakeup thing before we complete exit_to_user_mode_prepare() ?
On Mon, Jan 11, 2021 at 01:04:24PM +0100, Peter Zijlstra wrote:
+static DEFINE_PER_CPU(struct irq_work, late_wakeup_work) =
- IRQ_WORK_INIT(late_wakeup_func);
/**
- rcu_user_enter - inform RCU that we are resuming userspace.
@@ -692,9 +704,17 @@ noinstr void rcu_user_enter(void) struct rcu_data *rdp = this_cpu_ptr(&rcu_data); lockdep_assert_irqs_disabled();
- do_nocb_deferred_wakeup(rdp);
- /*
* We may be past the last rescheduling opportunity in the entry code.
* Trigger a self IPI that will fire and reschedule once we resume to
* user/guest mode.
*/
- if (do_nocb_deferred_wakeup(rdp) && need_resched())
irq_work_queue(this_cpu_ptr(&late_wakeup_work));
- rcu_eqs_enter(true);
}
Do we have the guarantee that every architecture that supports NOHZ_FULL has arch_irq_work_raise() on?
Yes it's a requirement for NOHZ_FULL to work. But you make me realize this is tacit and isn't constrained anywhere in the code. I'm going to add HAVE_IRQ_WORK_RAISE and replace the weak definition with a config based.
Also, can't you do the same thing you did earlier and do that wakeup thing before we complete exit_to_user_mode_prepare() ?
I do it for CONFIG_GENERIC_ENTRY but the other architectures have their own exit to user loop that I would need to audit and make sure that interrupts aren't ever re-enabled before resuming to user and there is no possible rescheduling point. I could manage to handle arm and arm64 but the others scare me:
$ git grep HAVE_CONTEXT_TRACKING arch/csky/Kconfig: select HAVE_CONTEXT_TRACKING arch/mips/Kconfig: select HAVE_CONTEXT_TRACKING arch/powerpc/Kconfig: select HAVE_CONTEXT_TRACKING if PPC64 arch/riscv/Kconfig: select HAVE_CONTEXT_TRACKING arch/sparc/Kconfig: select HAVE_CONTEXT_TRACKING
:-s
Following the idle loop model, cleanly check for pending rcuog wakeup before the last rescheduling point on resuming to user mode. This way we can avoid to do it from rcu_user_enter() with the last resort self-IPI hack that enforces rescheduling.
Signed-off-by: Frederic Weisbecker frederic@kernel.org Cc: Peter Zijlstra peterz@infradead.org Cc: Thomas Gleixner tglx@linutronix.de Cc: Ingo Molnarmingo@kernel.org Cc: Paul E. McKenney paulmck@kernel.org Cc: Rafael J. Wysocki rafael.j.wysocki@intel.com --- kernel/entry/common.c | 6 ++++++ kernel/rcu/tree.c | 12 +++++++----- 2 files changed, 13 insertions(+), 5 deletions(-)
diff --git a/kernel/entry/common.c b/kernel/entry/common.c index 378341642f94..8f3292b5f9b7 100644 --- a/kernel/entry/common.c +++ b/kernel/entry/common.c @@ -178,6 +178,9 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs, /* Architecture specific TIF work */ arch_exit_to_user_mode_work(regs, ti_work);
+ /* Check if any of the above work has queued a deferred wakeup */ + rcu_nocb_flush_deferred_wakeup(); + /* * Disable interrupts and reevaluate the work flags as they * might have changed while interrupts and preemption was @@ -197,6 +200,9 @@ static void exit_to_user_mode_prepare(struct pt_regs *regs)
lockdep_assert_irqs_disabled();
+ /* Flush pending rcuog wakeup before the last need_resched() check */ + rcu_nocb_flush_deferred_wakeup(); + if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK)) ti_work = exit_to_user_mode_loop(regs, ti_work);
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 2920dfc9f58c..3c4c0d5cea65 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -705,12 +705,14 @@ noinstr void rcu_user_enter(void)
lockdep_assert_irqs_disabled(); /* - * We may be past the last rescheduling opportunity in the entry code. - * Trigger a self IPI that will fire and reschedule once we resume to - * user/guest mode. + * Other than generic entry implementation, we may be past the last + * rescheduling opportunity in the entry code. Trigger a self IPI + * that will fire and reschedule once we resume in user/guest mode. */ - if (do_nocb_deferred_wakeup(rdp) && need_resched()) - irq_work_queue(this_cpu_ptr(&late_wakeup_work)); + if (!IS_ENABLED(CONFIG_GENERIC_ENTRY) || (current->flags & PF_VCPU)) { + if (do_nocb_deferred_wakeup(rdp) && need_resched()) + irq_work_queue(this_cpu_ptr(&late_wakeup_work)); + }
rcu_eqs_enter(true); }
On Sat, Jan 09, 2021 at 03:05:33AM +0100, Frederic Weisbecker wrote:
Following the idle loop model, cleanly check for pending rcuog wakeup before the last rescheduling point on resuming to user mode. This way we can avoid to do it from rcu_user_enter() with the last resort self-IPI hack that enforces rescheduling.
Signed-off-by: Frederic Weisbecker frederic@kernel.org Cc: Peter Zijlstra peterz@infradead.org Cc: Thomas Gleixner tglx@linutronix.de Cc: Ingo Molnarmingo@kernel.org Cc: Paul E. McKenney paulmck@kernel.org Cc: Rafael J. Wysocki rafael.j.wysocki@intel.com
kernel/entry/common.c | 6 ++++++ kernel/rcu/tree.c | 12 +++++++----- 2 files changed, 13 insertions(+), 5 deletions(-)
diff --git a/kernel/entry/common.c b/kernel/entry/common.c index 378341642f94..8f3292b5f9b7 100644 --- a/kernel/entry/common.c +++ b/kernel/entry/common.c @@ -178,6 +178,9 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs, /* Architecture specific TIF work */ arch_exit_to_user_mode_work(regs, ti_work);
/* Check if any of the above work has queued a deferred wakeup */
rcu_nocb_flush_deferred_wakeup();
So this needs to be moved to the IRQs disabled section, just a few lines later, otherwise preemption may schedule another task that in turn do call_rcu() and create new deferred wake up (thank Paul for the warning). Not to mention moving to another CPU with its own deferred wakeups to flush...
I'll fix that for the next version.
Thanks.
On Mon, Jan 11, 2021 at 01:40:14AM +0100, Frederic Weisbecker wrote:
On Sat, Jan 09, 2021 at 03:05:33AM +0100, Frederic Weisbecker wrote:
Following the idle loop model, cleanly check for pending rcuog wakeup before the last rescheduling point on resuming to user mode. This way we can avoid to do it from rcu_user_enter() with the last resort self-IPI hack that enforces rescheduling.
Signed-off-by: Frederic Weisbecker frederic@kernel.org Cc: Peter Zijlstra peterz@infradead.org Cc: Thomas Gleixner tglx@linutronix.de Cc: Ingo Molnarmingo@kernel.org Cc: Paul E. McKenney paulmck@kernel.org Cc: Rafael J. Wysocki rafael.j.wysocki@intel.com
kernel/entry/common.c | 6 ++++++ kernel/rcu/tree.c | 12 +++++++----- 2 files changed, 13 insertions(+), 5 deletions(-)
diff --git a/kernel/entry/common.c b/kernel/entry/common.c index 378341642f94..8f3292b5f9b7 100644 --- a/kernel/entry/common.c +++ b/kernel/entry/common.c @@ -178,6 +178,9 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs, /* Architecture specific TIF work */ arch_exit_to_user_mode_work(regs, ti_work);
/* Check if any of the above work has queued a deferred wakeup */
rcu_nocb_flush_deferred_wakeup();
So this needs to be moved to the IRQs disabled section, just a few lines later, otherwise preemption may schedule another task that in turn do call_rcu() and create new deferred wake up (thank Paul for the warning). Not to mention moving to another CPU with its own deferred wakeups to flush...
I'll fix that for the next version.
Ah, so it was not just my laptop dying, then! ;-)
Thanx, Paul
On Sun, Jan 10, 2021 at 09:13:18PM -0800, Paul E. McKenney wrote:
On Mon, Jan 11, 2021 at 01:40:14AM +0100, Frederic Weisbecker wrote:
On Sat, Jan 09, 2021 at 03:05:33AM +0100, Frederic Weisbecker wrote:
Following the idle loop model, cleanly check for pending rcuog wakeup before the last rescheduling point on resuming to user mode. This way we can avoid to do it from rcu_user_enter() with the last resort self-IPI hack that enforces rescheduling.
Signed-off-by: Frederic Weisbecker frederic@kernel.org Cc: Peter Zijlstra peterz@infradead.org Cc: Thomas Gleixner tglx@linutronix.de Cc: Ingo Molnarmingo@kernel.org Cc: Paul E. McKenney paulmck@kernel.org Cc: Rafael J. Wysocki rafael.j.wysocki@intel.com
kernel/entry/common.c | 6 ++++++ kernel/rcu/tree.c | 12 +++++++----- 2 files changed, 13 insertions(+), 5 deletions(-)
diff --git a/kernel/entry/common.c b/kernel/entry/common.c index 378341642f94..8f3292b5f9b7 100644 --- a/kernel/entry/common.c +++ b/kernel/entry/common.c @@ -178,6 +178,9 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs, /* Architecture specific TIF work */ arch_exit_to_user_mode_work(regs, ti_work);
/* Check if any of the above work has queued a deferred wakeup */
rcu_nocb_flush_deferred_wakeup();
So this needs to be moved to the IRQs disabled section, just a few lines later, otherwise preemption may schedule another task that in turn do call_rcu() and create new deferred wake up (thank Paul for the warning). Not to mention moving to another CPU with its own deferred wakeups to flush...
I'll fix that for the next version.
Ah, so it was not just my laptop dying, then! ;-)
Note that it fixes the "smp_processor_id() in preemptible" warnings you reported but it shouldn't fix the other issues.
On Sat, Jan 09, 2021 at 03:05:33AM +0100, Frederic Weisbecker wrote:
Following the idle loop model, cleanly check for pending rcuog wakeup before the last rescheduling point on resuming to user mode. This way we can avoid to do it from rcu_user_enter() with the last resort self-IPI hack that enforces rescheduling.
Signed-off-by: Frederic Weisbecker frederic@kernel.org Cc: Peter Zijlstra peterz@infradead.org Cc: Thomas Gleixner tglx@linutronix.de Cc: Ingo Molnarmingo@kernel.org Cc: Paul E. McKenney paulmck@kernel.org Cc: Rafael J. Wysocki rafael.j.wysocki@intel.com
kernel/entry/common.c | 6 ++++++ kernel/rcu/tree.c | 12 +++++++----- 2 files changed, 13 insertions(+), 5 deletions(-)
diff --git a/kernel/entry/common.c b/kernel/entry/common.c index 378341642f94..8f3292b5f9b7 100644 --- a/kernel/entry/common.c +++ b/kernel/entry/common.c @@ -178,6 +178,9 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs, /* Architecture specific TIF work */ arch_exit_to_user_mode_work(regs, ti_work);
/* Check if any of the above work has queued a deferred wakeup */
rcu_nocb_flush_deferred_wakeup();
- /*
- Disable interrupts and reevaluate the work flags as they
- might have changed while interrupts and preemption was
@@ -197,6 +200,9 @@ static void exit_to_user_mode_prepare(struct pt_regs *regs) lockdep_assert_irqs_disabled();
- /* Flush pending rcuog wakeup before the last need_resched() check */
- rcu_nocb_flush_deferred_wakeup();
- if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK)) ti_work = exit_to_user_mode_loop(regs, ti_work);
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 2920dfc9f58c..3c4c0d5cea65 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -705,12 +705,14 @@ noinstr void rcu_user_enter(void) lockdep_assert_irqs_disabled(); /*
* We may be past the last rescheduling opportunity in the entry code.
* Trigger a self IPI that will fire and reschedule once we resume to
* user/guest mode.
* Other than generic entry implementation, we may be past the last
* rescheduling opportunity in the entry code. Trigger a self IPI
*/* that will fire and reschedule once we resume in user/guest mode.
- if (do_nocb_deferred_wakeup(rdp) && need_resched())
irq_work_queue(this_cpu_ptr(&late_wakeup_work));
- if (!IS_ENABLED(CONFIG_GENERIC_ENTRY) || (current->flags & PF_VCPU)) {
We have xfer_to_guest_mode_work() for that PF_VCPU case.
if (do_nocb_deferred_wakeup(rdp) && need_resched())
irq_work_queue(this_cpu_ptr(&late_wakeup_work));
- }
On Mon, Jan 11, 2021 at 01:08:08PM +0100, Peter Zijlstra wrote:
On Sat, Jan 09, 2021 at 03:05:33AM +0100, Frederic Weisbecker wrote:
Following the idle loop model, cleanly check for pending rcuog wakeup before the last rescheduling point on resuming to user mode. This way we can avoid to do it from rcu_user_enter() with the last resort self-IPI hack that enforces rescheduling.
Signed-off-by: Frederic Weisbecker frederic@kernel.org Cc: Peter Zijlstra peterz@infradead.org Cc: Thomas Gleixner tglx@linutronix.de Cc: Ingo Molnarmingo@kernel.org Cc: Paul E. McKenney paulmck@kernel.org Cc: Rafael J. Wysocki rafael.j.wysocki@intel.com
kernel/entry/common.c | 6 ++++++ kernel/rcu/tree.c | 12 +++++++----- 2 files changed, 13 insertions(+), 5 deletions(-)
diff --git a/kernel/entry/common.c b/kernel/entry/common.c index 378341642f94..8f3292b5f9b7 100644 --- a/kernel/entry/common.c +++ b/kernel/entry/common.c @@ -178,6 +178,9 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs, /* Architecture specific TIF work */ arch_exit_to_user_mode_work(regs, ti_work);
/* Check if any of the above work has queued a deferred wakeup */
rcu_nocb_flush_deferred_wakeup();
- /*
- Disable interrupts and reevaluate the work flags as they
- might have changed while interrupts and preemption was
@@ -197,6 +200,9 @@ static void exit_to_user_mode_prepare(struct pt_regs *regs) lockdep_assert_irqs_disabled();
- /* Flush pending rcuog wakeup before the last need_resched() check */
- rcu_nocb_flush_deferred_wakeup();
- if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK)) ti_work = exit_to_user_mode_loop(regs, ti_work);
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 2920dfc9f58c..3c4c0d5cea65 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -705,12 +705,14 @@ noinstr void rcu_user_enter(void) lockdep_assert_irqs_disabled(); /*
* We may be past the last rescheduling opportunity in the entry code.
* Trigger a self IPI that will fire and reschedule once we resume to
* user/guest mode.
* Other than generic entry implementation, we may be past the last
* rescheduling opportunity in the entry code. Trigger a self IPI
*/* that will fire and reschedule once we resume in user/guest mode.
- if (do_nocb_deferred_wakeup(rdp) && need_resched())
irq_work_queue(this_cpu_ptr(&late_wakeup_work));
- if (!IS_ENABLED(CONFIG_GENERIC_ENTRY) || (current->flags & PF_VCPU)) {
We have xfer_to_guest_mode_work() for that PF_VCPU case.
Ah very nice! I'll integrate that on the next iteration.
Thanks.
if (do_nocb_deferred_wakeup(rdp) && need_resched())
irq_work_queue(this_cpu_ptr(&late_wakeup_work));
- }
The idle loop has several need_resched() checks that make sure we don't miss a rescheduling request. This means that any wake up performed on the local runqueue after the last generic need_resched() check is going to have its rescheduling silently ignored. This has happened in the past with rcu kthreads awaken from rcu_idle_enter() for example.
Perform sanity checks to report these situations.
Signed-off-by: Frederic Weisbecker frederic@kernel.org Cc: Peter Zijlstra peterz@infradead.org Cc: Thomas Gleixner tglx@linutronix.de Cc: Ingo Molnarmingo@kernel.org Cc: Paul E. McKenney paulmck@kernel.org Cc: Rafael J. Wysocki rafael.j.wysocki@intel.com --- include/linux/sched.h | 11 +++++++++++ kernel/sched/core.c | 42 ++++++++++++++++++++++++++++++++++++++++++ kernel/sched/idle.c | 3 +++ kernel/sched/sched.h | 3 +++ 4 files changed, 59 insertions(+)
diff --git a/include/linux/sched.h b/include/linux/sched.h index 6e3a5eeec509..83fedda54943 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1917,6 +1917,17 @@ static __always_inline bool need_resched(void) return unlikely(tif_need_resched()); }
+#ifdef CONFIG_SCHED_DEBUG +extern void sched_resched_local_allow(void); +extern void sched_resched_local_forbid(void); +extern void sched_resched_local_assert_allowed(void); +#else +static inline void sched_resched_local_allow(void) { } +static inline void sched_resched_local_forbid(void) { } +static inline void sched_resched_local_assert_allowed(void) { } +#endif + + /* * Wrappers for p->thread_info->cpu access. No-op on UP. */ diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 15d2562118d1..6056f0374674 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -591,6 +591,44 @@ void wake_up_q(struct wake_q_head *head) } }
+#ifdef CONFIG_SCHED_DEBUG +void noinstr sched_resched_local_allow(void) +{ + this_rq()->resched_local_allow = 1; +} + +void noinstr sched_resched_local_forbid(void) +{ + this_rq()->resched_local_allow = 0; +} + +void noinstr sched_resched_local_assert_allowed(void) +{ + if (this_rq()->resched_local_allow) + return; + + /* + * Idle interrupts break the CPU from its pause and + * rescheduling happens on idle loop exit. + */ + if (in_hardirq()) + return; + + /* + * What applies to hardirq also applies to softirq as + * we assume they execute on hardirq tail. Ksoftirqd + * shouldn't have resched_local_allow == 0. + * We also assume that no local_bh_enable() call may + * execute softirqs inline on fragile idle/entry + * path... + */ + if (in_serving_softirq()) + return; + + WARN_ONCE(1, "Late current task rescheduling may be lost\n"); +} +#endif + /* * resched_curr - mark rq's current task 'to be rescheduled now'. * @@ -613,6 +651,7 @@ void resched_curr(struct rq *rq) if (cpu == smp_processor_id()) { set_tsk_need_resched(curr); set_preempt_need_resched(); + sched_resched_local_assert_allowed(); return; }
@@ -7796,6 +7835,9 @@ void __init sched_init(void) #endif /* CONFIG_SMP */ hrtick_rq_init(rq); atomic_set(&rq->nr_iowait, 0); +#ifdef CONFIG_SCHED_DEBUG + rq->resched_local_allow = 1; +#endif }
set_load_weight(&init_task, false); diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c index b601a3aa2152..cdffd32812bd 100644 --- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c @@ -185,6 +185,8 @@ static void cpuidle_idle_call(void) return; }
+ sched_resched_local_forbid(); + /* * The RCU framework needs to be told that we are entering an idle * section, so no more rcu read side critical sections and one more @@ -247,6 +249,7 @@ static void cpuidle_idle_call(void) }
exit_idle: + sched_resched_local_allow(); __current_set_polling();
/* diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 12ada79d40f3..a9416c383451 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1060,6 +1060,9 @@ struct rq { #endif unsigned int push_busy; struct cpu_stop_work push_work; +#ifdef CONFIG_SCHED_DEBUG + unsigned int resched_local_allow; +#endif };
#ifdef CONFIG_FAIR_GROUP_SCHED
On Sat, Jan 09, 2021 at 03:05:34AM +0100, Frederic Weisbecker wrote:
The idle loop has several need_resched() checks that make sure we don't miss a rescheduling request. This means that any wake up performed on the local runqueue after the last generic need_resched() check is going to have its rescheduling silently ignored. This has happened in the past with rcu kthreads awaken from rcu_idle_enter() for example.
Perform sanity checks to report these situations.
I really don't like this..
- it's too specific to the actual reschedule condition, any wakeup this late is dodgy, not only those that happen to cause a local reschedule.
- we can already test this with unwind and checking against __cpuidle
- moving all of __cpuidle into noinstr would also cover this. And we're going to have to do that anyway.
+void noinstr sched_resched_local_assert_allowed(void) +{
- if (this_rq()->resched_local_allow)
return;
- /*
* Idle interrupts break the CPU from its pause and
* rescheduling happens on idle loop exit.
*/
- if (in_hardirq())
return;
- /*
* What applies to hardirq also applies to softirq as
* we assume they execute on hardirq tail. Ksoftirqd
* shouldn't have resched_local_allow == 0.
* We also assume that no local_bh_enable() call may
* execute softirqs inline on fragile idle/entry
* path...
*/
- if (in_serving_softirq())
return;
- WARN_ONCE(1, "Late current task rescheduling may be lost\n");
That seems like it wants to be:
WARN_ONCE(in_task(), "...");
+}
On Mon, Jan 11, 2021 at 01:25:59PM +0100, Peter Zijlstra wrote:
On Sat, Jan 09, 2021 at 03:05:34AM +0100, Frederic Weisbecker wrote:
The idle loop has several need_resched() checks that make sure we don't miss a rescheduling request. This means that any wake up performed on the local runqueue after the last generic need_resched() check is going to have its rescheduling silently ignored. This has happened in the past with rcu kthreads awaken from rcu_idle_enter() for example.
Perform sanity checks to report these situations.
I really don't like this..
- it's too specific to the actual reschedule condition, any wakeup this late is dodgy, not only those that happen to cause a local reschedule.
Right.
we can already test this with unwind and checking against __cpuidle
moving all of __cpuidle into noinstr would also cover this. And we're going to have to do that anyway.
Ok then, I'll wait for that instead.
+void noinstr sched_resched_local_assert_allowed(void) +{
- if (this_rq()->resched_local_allow)
return;
- /*
* Idle interrupts break the CPU from its pause and
* rescheduling happens on idle loop exit.
*/
- if (in_hardirq())
return;
- /*
* What applies to hardirq also applies to softirq as
* we assume they execute on hardirq tail. Ksoftirqd
* shouldn't have resched_local_allow == 0.
* We also assume that no local_bh_enable() call may
* execute softirqs inline on fragile idle/entry
* path...
*/
- if (in_serving_softirq())
return;
- WARN_ONCE(1, "Late current task rescheduling may be lost\n");
That seems like it wants to be:
WARN_ONCE(in_task(), "...");
Right! But I guess I'll drop that patch now.
Thanks.
Greeting,
FYI, we noticed the following commit (built with gcc-9):
commit: 9720a64438d901dad40d4791daf017507fe67f51 ("sched: Report local wake up on resched blind zone within idle loop") url: https://github.com/0day-ci/linux/commits/Frederic-Weisbecker/rcu-sched-Fix-i...
in testcase: boot
on test machine: qemu-system-i386 -enable-kvm -cpu SandyBridge -smp 2 -m 8G
caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):
+--------------------------------------------------------------------+------------+------------+ | | 13b5aef705 | 9720a64438 | +--------------------------------------------------------------------+------------+------------+ | boot_successes | 16 | 0 | | boot_failures | 0 | 18 | | WARNING:at_kernel/sched/core.c:#sched_resched_local_assert_allowed | 0 | 18 | | EIP:sched_resched_local_assert_allowed | 0 | 18 | | EIP:default_idle | 0 | 18 | +--------------------------------------------------------------------+------------+------------+
If you fix the issue, kindly add following tag Reported-by: kernel test robot oliver.sang@intel.com
[ 0.278654] WARNING: CPU: 1 PID: 0 at kernel/sched/core.c:628 sched_resched_local_assert_allowed (kbuild/src/consumer/kernel/sched/core.c:628 (discriminator 13)) [ 0.278654] Modules linked in: [ 0.278654] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.11.0-rc2-00006-g9720a64438d9 #2 [ 0.278654] EIP: sched_resched_local_assert_allowed (kbuild/src/consumer/kernel/sched/core.c:628 (discriminator 13)) [ 0.278654] Code: 00 00 00 b8 98 76 e3 97 ff 05 a4 1d ee 97 c6 05 76 31 e2 97 01 e8 b2 38 92 ff ff 05 90 1d ee 97 68 39 b2 97 97 e8 f7 16 ff ff <0f> 0b 6a 01 31 c9 ba 01 00 00 00 b8 80 76 e3 97 e8 8d 38 92 ff 83 All code ======== 0: 00 00 add %al,(%rax) 2: 00 b8 98 76 e3 97 add %bh,-0x681c8968(%rax) 8: ff 05 a4 1d ee 97 incl -0x6811e25c(%rip) # 0xffffffff97ee1db2 e: c6 05 76 31 e2 97 01 movb $0x1,-0x681dce8a(%rip) # 0xffffffff97e2318b 15: e8 b2 38 92 ff callq 0xffffffffff9238cc 1a: ff 05 90 1d ee 97 incl -0x6811e270(%rip) # 0xffffffff97ee1db0 20: 68 39 b2 97 97 pushq $0xffffffff9797b239 25: e8 f7 16 ff ff callq 0xffffffffffff1721 2a:* 0f 0b ud2 <-- trapping instruction 2c: 6a 01 pushq $0x1 2e: 31 c9 xor %ecx,%ecx 30: ba 01 00 00 00 mov $0x1,%edx 35: b8 80 76 e3 97 mov $0x97e37680,%eax 3a: e8 8d 38 92 ff callq 0xffffffffff9238cc 3f: 83 .byte 0x83
Code starting with the faulting instruction =========================================== 0: 0f 0b ud2 2: 6a 01 pushq $0x1 4: 31 c9 xor %ecx,%ecx 6: ba 01 00 00 00 mov $0x1,%edx b: b8 80 76 e3 97 mov $0x97e37680,%eax 10: e8 8d 38 92 ff callq 0xffffffffff9238a2 15: 83 .byte 0x83 [ 0.278654] EAX: 0000002a EBX: 00000001 ECX: 00000000 EDX: 00000000 [ 0.278654] ESI: d95f4f00 EDI: 80540000 EBP: 8054bdc0 ESP: 8054bdb4 [ 0.278654] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00210086 [ 0.278654] CR0: 80050033 CR2: 00000000 CR3: 18370000 CR4: 000406b0 [ 0.278654] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 [ 0.278654] DR6: fffe0ff0 DR7: 00000400 [ 0.278654] Call Trace: [ 0.278654] resched_curr (kbuild/src/consumer/kernel/sched/core.c:655 (discriminator 24)) [ 0.278654] check_preempt_curr (kbuild/src/consumer/kernel/sched/core.c:1750 (discriminator 4)) [ 0.278654] ttwu_do_wakeup (kbuild/src/consumer/kernel/sched/core.c:2976) [ 0.278654] ttwu_do_activate (kbuild/src/consumer/kernel/sched/core.c:3027) [ 0.278654] try_to_wake_up (kbuild/src/consumer/kernel/sched/core.c:3216 kbuild/src/consumer/kernel/sched/core.c:3493) [ 0.278654] ? sysvec_call_function (kbuild/src/consumer/arch/x86/kernel/smp.c:243) [ 0.278654] wake_up_process (kbuild/src/consumer/kernel/sched/core.c:3564) [ 0.278654] wakeup_softirqd (kbuild/src/consumer/kernel/softirq.c:77 (discriminator 3)) [ 0.278654] raise_softirq_irqoff (kbuild/src/consumer/kernel/softirq.c:467 (discriminator 1)) [ 0.278654] raise_softirq (kbuild/src/consumer/kernel/softirq.c:476 (discriminator 7)) [ 0.278654] invoke_rcu_core (kbuild/src/consumer/kernel/rcu/tree.c:2793 (discriminator 4)) [ 0.278654] rcu_cleanup_after_idle (kbuild/src/consumer/kernel/rcu/tree_plugin.h:1434 (discriminator 1)) [ 0.278654] rcu_nmi_enter (kbuild/src/consumer/kernel/rcu/tree.c:1033 (discriminator 1)) [ 0.278654] rcu_irq_enter (kbuild/src/consumer/kernel/rcu/tree.c:1087 (discriminator 49)) [ 0.278654] irqentry_enter (kbuild/src/consumer/kernel/entry/common.c:369 (discriminator 1)) [ 0.278654] sysvec_call_function_single (kbuild/src/consumer/arch/x86/kernel/smp.c:243) [ 0.278654] handle_exception (kbuild/src/consumer/arch/x86/entry/entry_32.S:1179) [ 0.278654] EIP: default_idle (kbuild/src/consumer/arch/x86/kernel/process.c:689) [ 0.278654] Code: eb 97 fb b8 15 00 00 00 64 8b 15 5c b8 21 98 e8 1b 5a 7e ff 8d 65 f8 5b 5e 5d c3 66 66 66 66 90 55 89 e5 e8 9a 5a 7e ff fb f4 <5d> c3 66 66 66 66 90 89 c2 55 a1 e0 ae 37 98 0f b6 52 09 64 8b 0d All code ======== 0: eb 97 jmp 0xffffffffffffff99 2: fb sti 3: b8 15 00 00 00 mov $0x15,%eax 8: 64 8b 15 5c b8 21 98 mov %fs:-0x67de47a4(%rip),%edx # 0xffffffff9821b86b f: e8 1b 5a 7e ff callq 0xffffffffff7e5a2f 14: 8d 65 f8 lea -0x8(%rbp),%esp 17: 5b pop %rbx 18: 5e pop %rsi 19: 5d pop %rbp 1a: c3 retq 1b: 66 66 66 66 90 data16 data16 data16 xchg %ax,%ax 20: 55 push %rbp 21: 89 e5 mov %esp,%ebp 23: e8 9a 5a 7e ff callq 0xffffffffff7e5ac2 28: fb sti 29: f4 hlt 2a:* 5d pop %rbp <-- trapping instruction 2b: c3 retq 2c: 66 66 66 66 90 data16 data16 data16 xchg %ax,%ax 31: 89 c2 mov %eax,%edx 33: 55 push %rbp 34: a1 e0 ae 37 98 0f b6 movabs 0x952b60f9837aee0,%eax 3b: 52 09 3d: 64 fs 3e: 8b .byte 0x8b 3f: 0d .byte 0xd
Code starting with the faulting instruction =========================================== 0: 5d pop %rbp 1: c3 retq 2: 66 66 66 66 90 data16 data16 data16 xchg %ax,%ax 7: 89 c2 mov %eax,%edx 9: 55 push %rbp a: a1 e0 ae 37 98 0f b6 movabs 0x952b60f9837aee0,%eax 11: 52 09 13: 64 fs 14: 8b .byte 0x8b 15: 0d .byte 0xd [ 0.278654] EAX: 00000000 EBX: 00000000 ECX: 00000001 EDX: 00000000 [ 0.278654] ESI: 80540000 EDI: 00000000 EBP: 8054bf54 ESP: 8054bf54 [ 0.278654] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00200206 [ 0.278654] ? rcu_dump_cpu_stacks (kbuild/src/consumer/kernel/rcu/tree_stall.h:333 (discriminator 1)) [ 0.278654] ? sysvec_call_function (kbuild/src/consumer/arch/x86/kernel/smp.c:243) [ 0.278654] ? default_idle (kbuild/src/consumer/arch/x86/kernel/process.c:689) [ 0.278654] arch_cpu_idle (kbuild/src/consumer/arch/x86/kernel/process.c:681) [ 0.278654] default_idle_call (kbuild/src/consumer/arch/x86/include/asm/irqflags.h:49 (discriminator 2) kbuild/src/consumer/arch/x86/include/asm/irqflags.h:89 (discriminator 2) kbuild/src/consumer/kernel/sched/idle.c:121 (discriminator 2)) [ 0.278654] cpuidle_idle_call (kbuild/src/consumer/kernel/sched/idle.c:200 (discriminator 1)) [ 0.278654] do_idle (kbuild/src/consumer/kernel/sched/idle.c:307) [ 0.278654] cpu_startup_entry (kbuild/src/consumer/kernel/sched/idle.c:401 (discriminator 1)) [ 0.278654] start_secondary (kbuild/src/consumer/arch/x86/kernel/smpboot.c:272) [ 0.278654] startup_32_smp (kbuild/src/consumer/arch/x86/kernel/head_32.S:328) [ 0.278654] irq event stamp: 1280 [ 0.278654] hardirqs last enabled at (1279): default_idle_call (kbuild/src/consumer/kernel/sched/idle.c:96 (discriminator 2)) [ 0.278654] hardirqs last disabled at (1280): sysvec_call_function_single (kbuild/src/consumer/arch/x86/kernel/smp.c:243) [ 0.278654] softirqs last enabled at (1246): __do_softirq (kbuild/src/consumer/kernel/softirq.c:371) [ 0.278654] softirqs last disabled at (1201): do_softirq_own_stack (kbuild/src/consumer/arch/x86/kernel/irq_32.c:59 kbuild/src/consumer/arch/x86/kernel/irq_32.c:148) [ 0.278654] ---[ end trace f16ac7c94443e620 ]--- [ 0.318955] ACPI: Added _OSI(Module Device) [ 0.319271] ACPI: Added _OSI(Processor Device) [ 0.319553] ACPI: Added _OSI(3.0 _SCP Extensions) [ 0.319845] ACPI: Added _OSI(Processor Aggregator Device) [ 0.320179] ACPI: Added _OSI(Linux-Dell-Video) [ 0.320456] ACPI: Added _OSI(Linux-Lenovo-NV-HDMI-Audio) [ 0.320797] ACPI: Added _OSI(Linux-HPI-Hybrid-Graphics) [ 0.325521] ACPI: 1 ACPI AML tables successfully acquired and loaded [ 0.329232] ACPI: Interpreter enabled [ 0.329527] ACPI: (supports S0 S3 S5) [ 0.329768] ACPI: Using IOAPIC for interrupt routing [ 0.330126] PCI: Using host bridge windows from ACPI; if necessary, use "pci=nocrs" and report a bug [ 0.331290] ACPI: Enabled 2 GPEs in block 00 to 0F [ 0.349832] ACPI: PCI Root Bridge [PCI0] (domain 0000 [bus 00-ff]) [ 0.350277] acpi PNP0A03:00: _OSC: OS supports [ASPM ClockPM Segments HPX-Type3] [ 0.350816] acpi PNP0A03:00: fail to add MMCONFIG information, can't access extended PCI configuration space under this bridge. [ 0.351562] PCI host bridge to bus 0000:00 [ 0.351831] pci_bus 0000:00: root bus resource [io 0x0000-0x0cf7 window] [ 0.352249] pci_bus 0000:00: root bus resource [io 0x0d00-0xffff window] [ 0.352666] pci_bus 0000:00: root bus resource [mem 0x000a0000-0x000bffff window] [ 0.353125] pci_bus 0000:00: root bus resource [mem 0xc0000000-0xfebfffff window] [ 0.353591] pci_bus 0000:00: root bus resource [mem 0x240000000-0x2bfffffff window] [ 0.354059] pci_bus 0000:00: root bus resource [bus 00-ff] [ 0.354465] pci 0000:00:00.0: [8086:1237] type 00 class 0x060000 [ 0.355655] pci 0000:00:01.0: [8086:7000] type 00 class 0x060100 [ 0.356963] pci 0000:00:01.1: [8086:7010] type 00 class 0x010180 [ 0.359846] pci 0000:00:01.1: reg 0x20: [io 0xc040-0xc04f] [ 0.361169] pci 0000:00:01.1: legacy IDE quirk: reg 0x10: [io 0x01f0-0x01f7] [ 0.361316] pci 0000:00:01.1: legacy IDE quirk: reg 0x14: [io 0x03f6] [ 0.361716] pci 0000:00:01.1: legacy IDE quirk: reg 0x18: [io 0x0170-0x0177] [ 0.362151] pci 0000:00:01.1: legacy IDE quirk: reg 0x1c: [io 0x0376] [ 0.363135] pci 0000:00:01.3: [8086:7113] type 00 class 0x068000 [ 0.363727] pci 0000:00:01.3: quirk: [io 0x0600-0x063f] claimed by PIIX4 ACPI
To reproduce:
# build kernel cd linux cp config-5.11.0-rc2-00006-g9720a64438d9 .config make HOSTCC=gcc-9 CC=gcc-9 ARCH=i386 olddefconfig prepare modules_prepare bzImage
git clone https://github.com/intel/lkp-tests.git cd lkp-tests bin/lkp qemu -k <bzImage> job-script # job-script is attached in this email
Thanks, Oliver Sang
The last rescheduling opportunity while resuming to user is in exit_to_user_mode_loop(). This means that any wake up performed on the local runqueue after this point is going to have its rescheduling silently ignored.
Perform sanity checks to report these situations.
Signed-off-by: Frederic Weisbecker frederic@kernel.org Cc: Peter Zijlstra peterz@infradead.org Cc: Thomas Gleixner tglx@linutronix.de Cc: Ingo Molnarmingo@kernel.org Cc: Paul E. McKenney paulmck@kernel.org Cc: Rafael J. Wysocki rafael.j.wysocki@intel.com --- kernel/entry/common.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/kernel/entry/common.c b/kernel/entry/common.c index 8f3292b5f9b7..1dfb97762336 100644 --- a/kernel/entry/common.c +++ b/kernel/entry/common.c @@ -5,6 +5,7 @@ #include <linux/highmem.h> #include <linux/livepatch.h> #include <linux/audit.h> +#include <linux/sched.h>
#include "common.h"
@@ -23,6 +24,8 @@ static __always_inline void __enter_from_user_mode(struct pt_regs *regs) instrumentation_begin(); trace_hardirqs_off_finish(); instrumentation_end(); + + sched_resched_local_allow(); }
void noinstr enter_from_user_mode(struct pt_regs *regs) @@ -206,6 +209,7 @@ static void exit_to_user_mode_prepare(struct pt_regs *regs) if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK)) ti_work = exit_to_user_mode_loop(regs, ti_work);
+ sched_resched_local_forbid(); arch_exit_to_user_mode_prepare(regs, ti_work);
/* Ensure that the address limit is intact and no locks are held */
Greeting,
FYI, we noticed a -2.2% regression of unixbench.score due to commit:
commit: 8e01c5f10451c019e384d68ee8edb9129e3f0f7f ("entry: Report local wake up on resched blind zone while resuming to user") url: https://github.com/0day-ci/linux/commits/Frederic-Weisbecker/rcu-sched-Fix-i...
in testcase: unixbench on test machine: 96 threads Intel(R) Xeon(R) CPU @ 2.30GHz with 128G memory with following parameters:
runtime: 300s nr_task: 1 test: syscall cpufreq_governor: performance ucode: 0x4003003
test-description: UnixBench is the original BYTE UNIX benchmark suite aims to test performance of Unix-like system. test-url: https://github.com/kdlucas/byte-unixbench
In addition to that, the commit also has significant impact on the following tests:
+------------------+---------------------------------------------------------------------------+ | testcase: change | will-it-scale: will-it-scale.per_thread_ops -2.0% regression | | test machine | 192 threads Intel(R) Xeon(R) Platinum 9242 CPU @ 2.30GHz with 192G memory | | test parameters | cpufreq_governor=performance | | | mode=thread | | | nr_task=50% | | | test=futex3 | | | ucode=0x5003003 | +------------------+---------------------------------------------------------------------------+ | testcase: change | will-it-scale: will-it-scale.per_thread_ops -1.5% regression | | test machine | 192 threads Intel(R) Xeon(R) Platinum 9242 CPU @ 2.30GHz with 192G memory | | test parameters | cpufreq_governor=performance | | | mode=thread | | | nr_task=16 | | | test=futex4 | | | ucode=0x5003003 | +------------------+---------------------------------------------------------------------------+
If you fix the issue, kindly add following tag Reported-by: kernel test robot oliver.sang@intel.com
Details are as below: -------------------------------------------------------------------------------------------------->
To reproduce:
git clone https://github.com/intel/lkp-tests.git cd lkp-tests bin/lkp install job.yaml # job file is attached in this email bin/lkp run job.yaml
========================================================================================= compiler/cpufreq_governor/kconfig/nr_task/rootfs/runtime/tbox_group/test/testcase/ucode: gcc-9/performance/x86_64-rhel-8.3/1/debian-10.4-x86_64-20200603.cgz/300s/lkp-csl-2sp4/syscall/unixbench/0x4003003
commit: 9720a64438 ("sched: Report local wake up on resched blind zone within idle loop") 8e01c5f104 ("entry: Report local wake up on resched blind zone while resuming to user")
9720a64438d901da 8e01c5f10451c019e384d68ee8e ---------------- --------------------------- fail:runs %reproduction fail:runs | | | 0:4 -2% 0:4 perf-profile.children.cycles-pp.error_entry 0:4 -1% 0:4 perf-profile.self.cycles-pp.error_entry %stddev %change %stddev \ | \ 1566 -2.2% 1532 unixbench.score 198.20 -1.2% 195.82 unixbench.time.system_time 100.35 +2.4% 102.77 unixbench.time.user_time 9.165e+08 -2.2% 8.965e+08 unixbench.workload 105519 ±116% -72.3% 29231 ± 10% cpuidle.C1.usage 0.02 ± 31% -56.9% 0.01 ± 33% perf-sched.sch_delay.max.ms.schedule_timeout.wait_for_completion.__flush_work.lru_add_drain_all 10909 ± 4% -12.2% 9580 ± 6% numa-vmstat.node0.nr_slab_reclaimable 7745 ± 5% +17.3% 9087 ± 8% numa-vmstat.node1.nr_slab_reclaimable 2558 ± 5% +16.4% 2977 slabinfo.fsnotify_mark_connector.active_objs 2558 ± 5% +16.4% 2977 slabinfo.fsnotify_mark_connector.num_objs 570484 ± 4% +6.7% 608647 ± 6% sched_debug.cpu.max_idle_balance_cost.max 10507 ± 42% +62.3% 17056 ± 11% sched_debug.cpu.max_idle_balance_cost.stddev 8.73 ± 7% -16.0% 7.33 ± 5% sched_debug.cpu.nr_uninterruptible.stddev 43640 ± 4% -12.2% 38321 ± 6% numa-meminfo.node0.KReclaimable 43640 ± 4% -12.2% 38321 ± 6% numa-meminfo.node0.SReclaimable 135268 ± 2% -8.5% 123810 ± 4% numa-meminfo.node0.Slab 30984 ± 5% +17.3% 36352 ± 8% numa-meminfo.node1.KReclaimable 30984 ± 5% +17.3% 36352 ± 8% numa-meminfo.node1.SReclaimable 101801 ± 3% +11.6% 113655 ± 4% numa-meminfo.node1.Slab 7.036e+08 ± 2% +4.3% 7.34e+08 perf-stat.i.branch-instructions 1.074e+09 +2.5% 1.101e+09 perf-stat.i.dTLB-loads 6.915e+08 +4.1% 7.199e+08 perf-stat.i.dTLB-stores 26.16 +3.0% 26.93 perf-stat.i.metric.M/sec 1479 ± 2% +4.1% 1540 perf-stat.overall.path-length 7.018e+08 ± 2% +4.3% 7.322e+08 perf-stat.ps.branch-instructions 1.071e+09 +2.6% 1.098e+09 perf-stat.ps.dTLB-loads 6.895e+08 +4.1% 7.179e+08 perf-stat.ps.dTLB-stores 3.75 ± 5% -0.8 2.99 ± 15% perf-profile.calltrace.cycles-pp.tick_sched_timer.__hrtimer_run_queues.hrtimer_interrupt.__sysvec_apic_timer_interrupt.asm_call_sysvec_on_stack 2.99 ± 6% -0.6 2.39 ± 17% perf-profile.calltrace.cycles-pp.update_process_times.tick_sched_handle.tick_sched_timer.__hrtimer_run_queues.hrtimer_interrupt 1.46 ± 6% -0.3 1.18 ± 14% perf-profile.calltrace.cycles-pp.scheduler_tick.update_process_times.tick_sched_handle.tick_sched_timer.__hrtimer_run_queues 0.96 ± 10% +0.2 1.16 ± 12% perf-profile.calltrace.cycles-pp.exit_to_user_mode_prepare.syscall_exit_to_user_mode.entry_SYSCALL_64_after_hwframe 3.86 ± 4% -0.8 3.06 ± 15% perf-profile.children.cycles-pp.tick_sched_timer 3.09 ± 6% -0.6 2.48 ± 16% perf-profile.children.cycles-pp.update_process_times 1.51 ± 6% -0.3 1.25 ± 13% perf-profile.children.cycles-pp.scheduler_tick 0.05 ± 58% +0.0 0.09 ± 12% perf-profile.children.cycles-pp.rcu_dynticks_eqs_enter 0.28 ± 11% +0.1 0.34 ± 7% perf-profile.children.cycles-pp.__intel_pmu_enable_all 0.93 ± 7% +0.1 1.07 ± 12% perf-profile.children.cycles-pp.syscall_enter_from_user_mode 0.03 ±100% +0.2 0.18 ± 17% perf-profile.children.cycles-pp.sched_resched_local_allow 1.47 ± 8% +0.3 1.75 ± 10% perf-profile.children.cycles-pp.exit_to_user_mode_prepare 0.00 +0.3 0.33 ± 10% perf-profile.children.cycles-pp.sched_resched_local_forbid 0.47 ± 9% -0.1 0.36 ± 19% perf-profile.self.cycles-pp.update_process_times 0.05 ± 58% +0.0 0.09 ± 12% perf-profile.self.cycles-pp.rcu_dynticks_eqs_enter 0.10 ± 5% +0.0 0.14 ± 17% perf-profile.self.cycles-pp.__x64_sys_close 0.28 ± 11% +0.1 0.34 ± 7% perf-profile.self.cycles-pp.__intel_pmu_enable_all 0.01 ±173% +0.2 0.18 ± 15% perf-profile.self.cycles-pp.sched_resched_local_allow 0.00 +0.2 0.17 ± 21% perf-profile.self.cycles-pp.sched_resched_local_forbid 3.78 ± 48% +1.4 5.16 ± 39% perf-profile.self.cycles-pp.cpuidle_enter_state 75783 ± 2% +7.7% 81634 ± 3% interrupts.CAL:Function_call_interrupts 148.75 ± 14% -33.4% 99.00 ± 34% interrupts.CPU16.NMI:Non-maskable_interrupts 148.75 ± 14% -33.4% 99.00 ± 34% interrupts.CPU16.PMI:Performance_monitoring_interrupts 805.75 ±144% -87.4% 101.50 ± 33% interrupts.CPU19.NMI:Non-maskable_interrupts 805.75 ±144% -87.4% 101.50 ± 33% interrupts.CPU19.PMI:Performance_monitoring_interrupts 1312 ±153% -92.4% 100.25 ± 34% interrupts.CPU23.NMI:Non-maskable_interrupts 1312 ±153% -92.4% 100.25 ± 34% interrupts.CPU23.PMI:Performance_monitoring_interrupts 618.00 ± 5% +10.3% 681.50 ± 2% interrupts.CPU39.CAL:Function_call_interrupts 579.50 ± 12% +18.2% 685.00 ± 2% interrupts.CPU48.CAL:Function_call_interrupts 254.50 ± 65% -60.8% 99.75 ± 34% interrupts.CPU48.NMI:Non-maskable_interrupts 254.50 ± 65% -60.8% 99.75 ± 34% interrupts.CPU48.PMI:Performance_monitoring_interrupts 136.25 ± 13% -32.5% 92.00 ± 18% interrupts.CPU49.NMI:Non-maskable_interrupts 136.25 ± 13% -32.5% 92.00 ± 18% interrupts.CPU49.PMI:Performance_monitoring_interrupts 134.50 ± 15% -29.9% 94.25 ± 22% interrupts.CPU50.NMI:Non-maskable_interrupts 134.50 ± 15% -29.9% 94.25 ± 22% interrupts.CPU50.PMI:Performance_monitoring_interrupts 668.75 ± 5% +176.1% 1846 ± 64% interrupts.CPU56.CAL:Function_call_interrupts 143.50 ± 14% -23.7% 109.50 ± 15% interrupts.CPU60.NMI:Non-maskable_interrupts 143.50 ± 14% -23.7% 109.50 ± 15% interrupts.CPU60.PMI:Performance_monitoring_interrupts 140.75 ± 17% -32.9% 94.50 ± 26% interrupts.CPU62.NMI:Non-maskable_interrupts 140.75 ± 17% -32.9% 94.50 ± 26% interrupts.CPU62.PMI:Performance_monitoring_interrupts 143.00 ± 10% -43.7% 80.50 ± 36% interrupts.CPU64.NMI:Non-maskable_interrupts 143.00 ± 10% -43.7% 80.50 ± 36% interrupts.CPU64.PMI:Performance_monitoring_interrupts 650.75 +20.1% 781.50 ± 20% interrupts.CPU69.CAL:Function_call_interrupts 510.00 ±123% -80.8% 98.00 ± 34% interrupts.CPU71.NMI:Non-maskable_interrupts 510.00 ±123% -80.8% 98.00 ± 34% interrupts.CPU71.PMI:Performance_monitoring_interrupts 648.00 ± 2% +35.6% 878.75 ± 36% interrupts.CPU73.CAL:Function_call_interrupts 648.75 ± 2% +169.4% 1748 ± 92% interrupts.CPU88.CAL:Function_call_interrupts
unixbench.score
1590 +--------------------------------------------------------------------+ |. +. .+. +.+. .+ .+.++ +.+ .+. +. .+ .+. .++. .+ .+. .+ | 1580 |-+ + + + + : : + +.+ + + + + + + +.+ | | : : : | | :: : | 1570 |-+ + +.+ .| | + | 1560 |-+ | | | 1550 |-+ | | | | | 1540 |-+ | | OO O O OO O O O O O | 1530 +--------------------------------------------------------------------+
unixbench.workload
9.3e+08 +----------------------------------------------------------------+ |.++.+.+ + +.+.++.+.++ +.+.++. .++.++.+.+ + ++.++.+. : +. | 9.25e+08 |-+ + : : + + + + | | : : : | 9.2e+08 |-+ : : | | + +.+ .| 9.15e+08 |-+ + | | | 9.1e+08 |-+ | | | 9.05e+08 |-+ | | | 9e+08 |-+ | | OO O OO OO OO OO O O O O O | 8.95e+08 +----------------------------------------------------------------+
[*] bisect-good sample [O] bisect-bad sample
*************************************************************************************************** lkp-csl-2ap2: 192 threads Intel(R) Xeon(R) Platinum 9242 CPU @ 2.30GHz with 192G memory ========================================================================================= compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase/ucode: gcc-9/performance/x86_64-rhel-8.3/thread/50%/debian-10.4-x86_64-20200603.cgz/lkp-csl-2ap2/futex3/will-it-scale/0x5003003
commit: 9720a64438 ("sched: Report local wake up on resched blind zone within idle loop") 8e01c5f104 ("entry: Report local wake up on resched blind zone while resuming to user")
9720a64438d901da 8e01c5f10451c019e384d68ee8e ---------------- --------------------------- %stddev %change %stddev \ | \ 9.783e+08 -2.0% 9.59e+08 will-it-scale.96.threads 10190429 -2.0% 9989144 will-it-scale.per_thread_ops 9.783e+08 -2.0% 9.59e+08 will-it-scale.workload 0.06 +0.0 0.07 ± 2% mpstat.cpu.all.soft% 28015 +1.1% 28324 proc-vmstat.nr_slab_reclaimable 4971 ± 6% -11.4% 4405 ± 7% sched_debug.cpu.nr_switches.stddev 1275 ± 70% +306.7% 5187 ± 86% numa-vmstat.node0.nr_shmem 65283 ± 3% -17.2% 54026 ± 18% numa-vmstat.node3.nr_shmem 2721 ± 3% +12.1% 3049 ± 4% slabinfo.PING.active_objs 2721 ± 3% +12.1% 3049 ± 4% slabinfo.PING.num_objs 1520 ± 6% +17.8% 1790 ± 7% slabinfo.khugepaged_mm_slot.active_objs 1520 ± 6% +17.8% 1790 ± 7% slabinfo.khugepaged_mm_slot.num_objs 5105 ± 70% +307.4% 20798 ± 86% numa-meminfo.node0.Shmem 372490 ± 36% -57.6% 157918 ± 53% numa-meminfo.node1.AnonPages.max 251355 ± 3% -17.6% 207138 ± 18% numa-meminfo.node3.Active 251355 ± 3% -17.6% 207138 ± 18% numa-meminfo.node3.Active(anon) 261667 ± 3% -17.3% 216523 ± 18% numa-meminfo.node3.Shmem 946.63 ±173% +493.3% 5616 ± 26% perf-sched.wait_and_delay.avg.ms.preempt_schedule_common._cond_resched.generic_perform_write.__generic_file_write_iter.generic_file_write_iter 240.00 ± 48% -36.7% 152.00 ± 60% perf-sched.wait_and_delay.count.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_sysvec_apic_timer_interrupt.[unknown] 148.50 ± 17% -24.1% 112.75 ± 13% perf-sched.wait_and_delay.count.schedule_hrtimeout_range_clock.poll_schedule_timeout.constprop.0.do_sys_poll 1873 ±173% +300.6% 7504 perf-sched.wait_and_delay.max.ms.preempt_schedule_common._cond_resched.generic_perform_write.__generic_file_write_iter.generic_file_write_iter 0.02 ± 39% -76.9% 0.00 ±173% perf-sched.wait_time.avg.ms.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_sysvec_call_function_single.[unknown] 973.25 ±166% +477.1% 5616 ± 26% perf-sched.wait_time.avg.ms.preempt_schedule_common._cond_resched.generic_perform_write.__generic_file_write_iter.generic_file_write_iter 0.03 ± 41% -74.1% 0.01 ±173% perf-sched.wait_time.max.ms.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_sysvec_call_function_single.[unknown] 2031 ±155% +269.4% 7504 perf-sched.wait_time.max.ms.preempt_schedule_common._cond_resched.generic_perform_write.__generic_file_write_iter.generic_file_write_iter 0.01 ± 60% +133.3% 0.02 ± 19% perf-sched.wait_time.max.ms.schedule_timeout.wait_for_completion.stop_one_cpu.affine_move_task 6.958e+10 +3.6% 7.205e+10 perf-stat.i.branch-instructions 0.72 -0.0 0.68 perf-stat.i.branch-miss-rate% 4.961e+08 -1.9% 4.867e+08 perf-stat.i.branch-misses 14.70 ± 3% +1.1 15.81 perf-stat.i.cache-miss-rate% 1497135 ± 4% +11.5% 1668752 ± 4% perf-stat.i.cache-misses 228415 ± 4% -11.6% 201875 ± 6% perf-stat.i.cycles-between-cache-misses 1.114e+11 +1.5% 1.131e+11 perf-stat.i.dTLB-loads 8.403e+10 +2.6% 8.619e+10 perf-stat.i.dTLB-stores 3747984 +2.7% 3849820 perf-stat.i.iTLB-loads 1.53 ± 4% +5.8% 1.62 ± 3% perf-stat.i.major-faults 1.39 +5.1% 1.46 ± 3% perf-stat.i.metric.K/sec 1379 +2.4% 1412 perf-stat.i.metric.M/sec 301494 +9.0% 328692 ± 5% perf-stat.i.node-load-misses 0.71 -0.0 0.68 perf-stat.overall.branch-miss-rate% 14.61 ± 3% +0.9 15.55 perf-stat.overall.cache-miss-rate% 195763 ± 4% -10.5% 175161 ± 4% perf-stat.overall.cycles-between-cache-misses 0.00 -0.0 0.00 perf-stat.overall.dTLB-store-miss-rate% 134378 +2.2% 137315 perf-stat.overall.path-length 6.93e+10 +3.5% 7.175e+10 perf-stat.ps.branch-instructions 4.942e+08 -1.9% 4.848e+08 perf-stat.ps.branch-misses 1510988 ± 4% +11.4% 1683127 ± 4% perf-stat.ps.cache-misses 203.58 -1.5% 200.43 perf-stat.ps.cpu-migrations 1.11e+11 +1.5% 1.126e+11 perf-stat.ps.dTLB-loads 8.368e+10 +2.6% 8.583e+10 perf-stat.ps.dTLB-stores 3733148 +2.7% 3832271 perf-stat.ps.iTLB-loads 305850 +9.2% 333869 ± 5% perf-stat.ps.node-load-misses 1.52 ± 10% +0.3 1.79 ± 11% perf-profile.calltrace.cycles-pp.exit_to_user_mode_prepare.syscall_exit_to_user_mode.entry_SYSCALL_64_after_hwframe.syscall 1.68 ± 9% +0.3 2.01 ± 11% perf-profile.calltrace.cycles-pp.syscall_enter_from_user_mode.do_syscall_64.entry_SYSCALL_64_after_hwframe.syscall 3.23 ± 10% +0.4 3.58 ± 11% perf-profile.calltrace.cycles-pp.syscall_exit_to_user_mode.entry_SYSCALL_64_after_hwframe.syscall 0.10 ± 23% -0.1 0.04 ± 58% perf-profile.children.cycles-pp.ktime_get 0.09 ± 14% -0.0 0.04 ± 59% perf-profile.children.cycles-pp.clockevents_program_event 0.09 ± 10% +0.0 0.13 ± 9% perf-profile.children.cycles-pp.perf_prepare_sample 0.11 ± 8% +0.0 0.15 ± 8% perf-profile.children.cycles-pp.perf_tp_event 0.10 ± 10% +0.0 0.15 ± 10% perf-profile.children.cycles-pp.perf_swevent_overflow 0.11 ± 8% +0.0 0.15 ± 10% perf-profile.children.cycles-pp.perf_trace_sched_stat_runtime 0.10 ± 12% +0.1 0.15 ± 10% perf-profile.children.cycles-pp.__perf_event_overflow 0.10 ± 12% +0.1 0.15 ± 10% perf-profile.children.cycles-pp.perf_event_output_forward 0.00 +0.1 0.06 ± 14% perf-profile.children.cycles-pp.account_system_index_time 0.20 ± 10% +0.1 0.26 ± 9% perf-profile.children.cycles-pp.task_tick_fair 0.11 ± 11% +0.1 0.18 ± 10% perf-profile.children.cycles-pp.update_curr 0.22 ± 9% +0.1 0.29 ± 9% perf-profile.children.cycles-pp.scheduler_tick 0.35 ± 7% +0.1 0.47 ± 9% perf-profile.children.cycles-pp.__hrtimer_run_queues 0.30 ± 10% +0.1 0.43 ± 9% perf-profile.children.cycles-pp.tick_sched_timer 0.28 ± 9% +0.1 0.42 ± 9% perf-profile.children.cycles-pp.update_process_times 0.28 ± 9% +0.1 0.43 ± 8% perf-profile.children.cycles-pp.tick_sched_handle 0.00 +0.2 0.22 ± 11% perf-profile.children.cycles-pp.sched_resched_local_allow 2.37 ± 10% +0.2 2.61 ± 12% perf-profile.children.cycles-pp.testcase 1.94 ± 10% +0.3 2.23 ± 11% perf-profile.children.cycles-pp.exit_to_user_mode_prepare 1.69 ± 9% +0.3 2.02 ± 11% perf-profile.children.cycles-pp.syscall_enter_from_user_mode 3.66 ± 10% +0.4 4.02 ± 11% perf-profile.children.cycles-pp.syscall_exit_to_user_mode 0.00 +0.4 0.45 ± 11% perf-profile.children.cycles-pp.sched_resched_local_forbid 0.09 ± 20% -0.1 0.03 ±100% perf-profile.self.cycles-pp.ktime_get 0.00 +0.1 0.05 ± 9% perf-profile.self.cycles-pp.account_system_index_time 1.91 ± 10% +0.2 2.13 ± 12% perf-profile.self.cycles-pp.testcase 0.00 +0.2 0.22 ± 11% perf-profile.self.cycles-pp.sched_resched_local_forbid 0.00 +0.2 0.22 ± 11% perf-profile.self.cycles-pp.sched_resched_local_allow 39568 -12.1% 34775 softirqs.CPU0.SCHED 26074 ± 6% -30.8% 18054 ± 18% softirqs.CPU1.RCU 13937 ± 27% +96.1% 27328 ± 20% softirqs.CPU1.SCHED 487.75 ± 60% +1455.7% 7587 ±129% softirqs.CPU10.NET_RX 7471 ± 99% -99.6% 32.50 ± 38% softirqs.CPU103.TIMER 22133 ± 15% -27.0% 16160 ± 29% softirqs.CPU107.RCU 23683 ± 12% -34.9% 15423 ± 25% softirqs.CPU110.RCU 21771 ± 13% -27.0% 15887 ± 28% softirqs.CPU119.RCU 27268 ± 7% -33.6% 18105 ± 24% softirqs.CPU12.RCU 9800 ± 82% +147.2% 24228 ± 16% softirqs.CPU12.SCHED 35848 ± 10% -52.3% 17101 ± 52% softirqs.CPU123.SCHED 21873 ± 9% -28.4% 15658 ± 19% softirqs.CPU125.RCU 23701 ± 7% -24.4% 17906 ± 20% softirqs.CPU129.RCU 23812 ± 15% -27.5% 17268 ± 7% softirqs.CPU130.RCU 35487 ± 8% -38.9% 21674 ± 33% softirqs.CPU131.SCHED 24202 ± 14% -26.3% 17841 ± 24% softirqs.CPU139.RCU 26857 ± 9% -33.1% 17956 ± 24% softirqs.CPU145.RCU 24985 -25.4% 18643 ± 24% softirqs.CPU146.RCU 19845 ± 11% +32.6% 26307 ± 18% softirqs.CPU146.SCHED 24163 ± 10% -30.7% 16746 ± 16% softirqs.CPU147.RCU 25991 ± 11% -28.0% 18706 ± 20% softirqs.CPU150.RCU 31382 ± 16% -46.1% 16909 ± 33% softirqs.CPU156.SCHED 26315 ± 5% -29.0% 18686 ± 28% softirqs.CPU16.RCU 24924 ± 9% -26.4% 18336 ± 26% softirqs.CPU163.RCU 25795 ± 12% -30.4% 17948 ± 17% softirqs.CPU165.RCU 23494 ± 9% -31.4% 16118 ± 17% softirqs.CPU169.RCU 15434 ± 38% +67.3% 25820 ± 21% softirqs.CPU169.SCHED 23443 ± 7% -25.3% 17521 ± 15% softirqs.CPU17.RCU 22698 ± 9% -20.2% 18116 ± 15% softirqs.CPU172.RCU 21677 ± 9% -29.8% 15224 ± 15% softirqs.CPU173.RCU 20602 ± 27% +53.8% 31690 ± 16% softirqs.CPU173.SCHED 19982 ± 10% -23.1% 15368 ± 16% softirqs.CPU188.RCU 31405 ± 9% -52.0% 15062 ± 43% softirqs.CPU189.SCHED 27459 ± 5% -29.9% 19244 ± 23% softirqs.CPU19.RCU 23837 ± 8% -24.8% 17931 ± 21% softirqs.CPU191.RCU 27482 ± 3% -26.7% 20133 ± 25% softirqs.CPU2.RCU 27374 ± 5% -28.3% 19620 ± 29% softirqs.CPU20.RCU 8946 ± 55% +120.5% 19723 ± 40% softirqs.CPU20.SCHED 23561 ± 9% -27.4% 17102 ± 18% softirqs.CPU21.RCU 24920 ± 8% -28.3% 17869 ± 14% softirqs.CPU22.RCU 27899 ± 5% -36.3% 17760 ± 29% softirqs.CPU27.RCU 9230 ± 33% +202.9% 27954 ± 31% softirqs.CPU27.SCHED 25209 ± 7% -24.7% 18973 ± 22% softirqs.CPU3.RCU 27974 ± 9% -31.3% 19231 ± 13% softirqs.CPU32.RCU 28747 ± 5% -36.5% 18268 ± 14% softirqs.CPU35.RCU 9574 ± 33% +145.3% 23490 ± 32% softirqs.CPU35.SCHED 24738 ± 15% -27.4% 17967 ± 15% softirqs.CPU36.RCU 27437 ± 13% -34.7% 17904 ± 22% softirqs.CPU37.RCU 27259 ± 9% -33.7% 18083 ± 23% softirqs.CPU38.RCU 14438 ± 52% +86.8% 26971 ± 13% softirqs.CPU38.SCHED 26156 ± 9% -32.6% 17617 ± 29% softirqs.CPU4.RCU 27287 ± 6% -31.5% 18695 ± 27% softirqs.CPU40.RCU 26370 ± 10% -30.6% 18302 ± 17% softirqs.CPU41.RCU 26793 ± 8% -30.3% 18668 ± 19% softirqs.CPU46.RCU 15557 ± 45% +64.8% 25642 ± 21% softirqs.CPU46.SCHED 25335 ± 12% -27.3% 18416 ± 24% softirqs.CPU47.RCU 25154 ± 2% -25.0% 18872 ± 20% softirqs.CPU5.RCU 23480 ± 4% -23.3% 18018 ± 23% softirqs.CPU55.RCU 26294 ± 3% -33.0% 17630 ± 20% softirqs.CPU56.RCU 13958 ± 32% +109.1% 29187 ± 15% softirqs.CPU56.SCHED 27194 ± 7% -32.8% 18287 ± 22% softirqs.CPU57.RCU 26424 ± 7% -33.4% 17603 ± 23% softirqs.CPU60.RCU 13405 ± 41% +110.0% 28152 ± 20% softirqs.CPU60.SCHED 24662 ± 17% -30.3% 17187 ± 32% softirqs.CPU66.RCU 27174 ± 28% -33.1% 18168 ± 48% softirqs.CPU67.SCHED 23980 ± 7% -28.8% 17083 ± 25% softirqs.CPU7.RCU 16015 ± 12% +69.5% 27140 ± 30% softirqs.CPU7.SCHED 29430 ± 19% -35.8% 18884 ± 34% softirqs.CPU73.SCHED 25123 ± 6% -25.5% 18715 ± 18% softirqs.CPU74.RCU 24340 ± 22% -44.1% 13615 ± 38% softirqs.CPU77.SCHED 23940 ± 8% -24.2% 18145 ± 22% softirqs.CPU83.RCU 22452 ± 7% -18.7% 18253 ± 15% softirqs.CPU90.RCU 24046 ± 3% -32.2% 16309 ± 24% softirqs.CPU93.RCU 13685 ± 19% +119.3% 30012 ± 21% softirqs.CPU93.SCHED 9316 ± 5% +40.3% 13075 ± 3% softirqs.CPU96.SCHED 32207 ± 9% -45.1% 17687 ± 30% softirqs.CPU97.SCHED 37350 ± 4% -24.4% 28241 ± 21% softirqs.CPU98.SCHED 30743 ± 11% -22.5% 23841 ± 14% softirqs.CPU99.SCHED 932.00 ± 64% +1402.0% 13998 ±133% interrupts.31:PCI-MSI.524289-edge.eth0-TxRx-0 120.75 ± 8% +75.6% 212.00 ± 5% interrupts.CPU0.RES:Rescheduling_interrupts 223.00 ± 11% -64.6% 79.00 ± 38% interrupts.CPU1.RES:Rescheduling_interrupts 981.50 ± 19% -42.4% 565.25 ± 46% interrupts.CPU1.TLB:TLB_shootdowns 932.00 ± 64% +1402.0% 13998 ±133% interrupts.CPU10.31:PCI-MSI.524289-edge.eth0-TxRx-0 4140 ± 28% +84.1% 7623 ± 14% interrupts.CPU100.NMI:Non-maskable_interrupts 4140 ± 28% +84.1% 7623 ± 14% interrupts.CPU100.PMI:Performance_monitoring_interrupts 3585 ± 8% -9.7% 3238 ± 5% interrupts.CPU104.CAL:Function_call_interrupts 6437 ± 14% +34.5% 8655 interrupts.CPU104.NMI:Non-maskable_interrupts 6437 ± 14% +34.5% 8655 interrupts.CPU104.PMI:Performance_monitoring_interrupts 49.50 ±129% +169.2% 133.25 ± 41% interrupts.CPU108.RES:Rescheduling_interrupts 276.75 ±108% +237.5% 934.00 ± 38% interrupts.CPU108.TLB:TLB_shootdowns 3058 ± 11% +15.2% 3523 ± 5% interrupts.CPU11.CAL:Function_call_interrupts 8162 ± 12% -38.5% 5023 ± 47% interrupts.CPU110.NMI:Non-maskable_interrupts 8162 ± 12% -38.5% 5023 ± 47% interrupts.CPU110.PMI:Performance_monitoring_interrupts 3115 ± 6% -18.7% 2534 ± 4% interrupts.CPU114.CAL:Function_call_interrupts 32.25 ±113% +271.3% 119.75 ± 58% interrupts.CPU116.RES:Rescheduling_interrupts 3704 ± 2% -18.4% 3021 ± 14% interrupts.CPU12.CAL:Function_call_interrupts 1544 ± 6% -51.3% 752.00 ± 48% interrupts.CPU12.TLB:TLB_shootdowns 2530 ± 10% +57.7% 3991 ± 15% interrupts.CPU123.CAL:Function_call_interrupts 34.75 ± 80% +415.1% 179.00 ± 35% interrupts.CPU123.RES:Rescheduling_interrupts 264.75 ± 60% +330.3% 1139 ± 42% interrupts.CPU123.TLB:TLB_shootdowns 8062 ± 9% -29.0% 5722 ± 35% interrupts.CPU125.NMI:Non-maskable_interrupts 8062 ± 9% -29.0% 5722 ± 35% interrupts.CPU125.PMI:Performance_monitoring_interrupts 1059 ± 24% -51.3% 515.75 ± 51% interrupts.CPU125.TLB:TLB_shootdowns 2648 ± 12% +37.0% 3627 ± 9% interrupts.CPU131.CAL:Function_call_interrupts 35.75 ± 76% +253.1% 126.25 ± 32% interrupts.CPU131.RES:Rescheduling_interrupts 426.00 ± 44% +148.5% 1058 ± 27% interrupts.CPU131.TLB:TLB_shootdowns 737.50 ± 44% +60.4% 1182 ± 13% interrupts.CPU133.TLB:TLB_shootdowns 76.50 ± 77% +104.2% 156.25 ± 30% interrupts.CPU134.RES:Rescheduling_interrupts 568.25 ± 55% +62.7% 924.75 ± 30% interrupts.CPU134.TLB:TLB_shootdowns 2879 ± 4% +14.7% 3303 ± 4% interrupts.CPU136.CAL:Function_call_interrupts 484.00 ± 66% +114.2% 1036 ± 20% interrupts.CPU136.TLB:TLB_shootdowns 82.25 ± 69% +88.8% 155.25 ± 30% interrupts.CPU142.RES:Rescheduling_interrupts 4178 ± 17% -24.9% 3136 ± 13% interrupts.CPU145.CAL:Function_call_interrupts 204.00 ± 35% -66.8% 67.75 ± 25% interrupts.CPU145.RES:Rescheduling_interrupts 1429 ± 17% -49.8% 717.50 ± 32% interrupts.CPU145.TLB:TLB_shootdowns 165.50 ± 9% -45.9% 89.50 ± 23% interrupts.CPU146.RES:Rescheduling_interrupts 8063 ± 14% -53.9% 3717 ± 15% interrupts.CPU15.NMI:Non-maskable_interrupts 8063 ± 14% -53.9% 3717 ± 15% interrupts.CPU15.PMI:Performance_monitoring_interrupts 2702 ± 4% +23.8% 3345 ± 12% interrupts.CPU152.CAL:Function_call_interrupts 74.00 ± 54% +135.8% 174.50 ± 27% interrupts.CPU152.RES:Rescheduling_interrupts 431.00 ± 36% +151.4% 1083 ± 31% interrupts.CPU152.TLB:TLB_shootdowns 580.25 ± 59% +91.8% 1112 ± 26% interrupts.CPU156.TLB:TLB_shootdowns 8427 ± 4% -53.3% 3932 ± 22% interrupts.CPU16.NMI:Non-maskable_interrupts 8427 ± 4% -53.3% 3932 ± 22% interrupts.CPU16.PMI:Performance_monitoring_interrupts 234.75 ± 15% -46.6% 125.25 ± 53% interrupts.CPU16.RES:Rescheduling_interrupts 7739 ± 9% -48.9% 3953 ± 30% interrupts.CPU164.NMI:Non-maskable_interrupts 7739 ± 9% -48.9% 3953 ± 30% interrupts.CPU164.PMI:Performance_monitoring_interrupts 3669 ± 7% -16.7% 3055 ± 8% interrupts.CPU165.CAL:Function_call_interrupts 7853 ± 16% -49.9% 3933 ± 47% interrupts.CPU165.NMI:Non-maskable_interrupts 7853 ± 16% -49.9% 3933 ± 47% interrupts.CPU165.PMI:Performance_monitoring_interrupts 1430 ± 17% -44.7% 790.50 ± 35% interrupts.CPU165.TLB:TLB_shootdowns 5312 ± 18% +35.9% 7220 ± 18% interrupts.CPU168.NMI:Non-maskable_interrupts 5312 ± 18% +35.9% 7220 ± 18% interrupts.CPU168.PMI:Performance_monitoring_interrupts 3547 ± 3% -16.2% 2972 ± 11% interrupts.CPU169.CAL:Function_call_interrupts 202.25 ± 24% -56.2% 88.50 ± 64% interrupts.CPU169.RES:Rescheduling_interrupts 8001 ± 13% -45.4% 4370 ± 46% interrupts.CPU17.NMI:Non-maskable_interrupts 8001 ± 13% -45.4% 4370 ± 46% interrupts.CPU17.PMI:Performance_monitoring_interrupts 8053 ± 8% -42.5% 4627 ± 58% interrupts.CPU172.NMI:Non-maskable_interrupts 8053 ± 8% -42.5% 4627 ± 58% interrupts.CPU172.PMI:Performance_monitoring_interrupts 159.75 ± 33% -65.7% 54.75 ± 72% interrupts.CPU173.RES:Rescheduling_interrupts 8384 -53.1% 3930 ± 47% interrupts.CPU176.NMI:Non-maskable_interrupts 8384 -53.1% 3930 ± 47% interrupts.CPU176.PMI:Performance_monitoring_interrupts 636.00 ± 41% +65.0% 1049 ± 35% interrupts.CPU179.TLB:TLB_shootdowns 3017 ± 12% +15.3% 3479 ± 11% interrupts.CPU189.CAL:Function_call_interrupts 275.00 ± 6% -38.6% 168.75 ± 22% interrupts.CPU2.RES:Rescheduling_interrupts 1540 ± 12% -23.6% 1176 ± 13% interrupts.CPU2.TLB:TLB_shootdowns 260.00 ± 19% -51.2% 127.00 ± 39% interrupts.CPU20.RES:Rescheduling_interrupts 253.75 ± 11% -71.3% 72.75 ± 69% interrupts.CPU27.RES:Rescheduling_interrupts 1480 ± 11% -60.9% 578.25 ± 81% interrupts.CPU27.TLB:TLB_shootdowns 219.00 ± 13% -37.9% 136.00 ± 10% interrupts.CPU3.RES:Rescheduling_interrupts 714.50 ± 49% +83.4% 1310 ± 22% interrupts.CPU30.TLB:TLB_shootdowns 3577 ± 6% -14.7% 3053 ± 11% interrupts.CPU35.CAL:Function_call_interrupts 248.50 ± 10% -50.8% 122.25 ± 61% interrupts.CPU35.RES:Rescheduling_interrupts 1340 ± 12% -49.2% 681.50 ± 41% interrupts.CPU35.TLB:TLB_shootdowns 239.25 ± 12% -62.7% 89.25 ± 95% interrupts.CPU4.RES:Rescheduling_interrupts 225.50 ± 20% -24.3% 170.75 ± 27% interrupts.CPU42.RES:Rescheduling_interrupts 200.50 ± 31% -52.2% 95.75 ± 44% interrupts.CPU46.RES:Rescheduling_interrupts 377.75 ± 65% +179.2% 1054 ± 21% interrupts.CPU49.TLB:TLB_shootdowns 153.00 ± 17% -42.3% 88.25 ± 25% interrupts.CPU55.RES:Rescheduling_interrupts 212.75 ± 14% -67.5% 69.25 ± 37% interrupts.CPU56.RES:Rescheduling_interrupts 1383 ± 13% -49.1% 703.75 ± 50% interrupts.CPU56.TLB:TLB_shootdowns 242.50 ± 17% -57.6% 102.75 ±103% interrupts.CPU57.RES:Rescheduling_interrupts 3764 ± 9% -20.9% 2976 ± 8% interrupts.CPU60.CAL:Function_call_interrupts 218.75 ± 24% -61.7% 83.75 ± 52% interrupts.CPU60.RES:Rescheduling_interrupts 1316 ± 23% -48.7% 675.25 ± 45% interrupts.CPU60.TLB:TLB_shootdowns 204.00 ± 8% -60.8% 80.00 ± 66% interrupts.CPU7.RES:Rescheduling_interrupts 249.25 ± 12% -26.4% 183.50 ± 23% interrupts.CPU74.RES:Rescheduling_interrupts 124.25 ± 31% +46.1% 181.50 ± 21% interrupts.CPU77.RES:Rescheduling_interrupts 3508 ± 8% -10.4% 3144 ± 12% interrupts.CPU78.CAL:Function_call_interrupts 6194 ± 35% -42.3% 3574 ± 34% interrupts.CPU8.NMI:Non-maskable_interrupts 6194 ± 35% -42.3% 3574 ± 34% interrupts.CPU8.PMI:Performance_monitoring_interrupts 5092 ± 25% +67.3% 8522 interrupts.CPU80.NMI:Non-maskable_interrupts 5092 ± 25% +67.3% 8522 interrupts.CPU80.PMI:Performance_monitoring_interrupts 169.25 ± 29% -54.2% 77.50 ± 46% interrupts.CPU90.RES:Rescheduling_interrupts 216.00 ± 7% -73.8% 56.50 ± 61% interrupts.CPU93.RES:Rescheduling_interrupts 254.50 ± 3% -26.8% 186.25 ± 15% interrupts.CPU96.RES:Rescheduling_interrupts 1372 ± 12% -16.3% 1149 ± 18% interrupts.CPU96.TLB:TLB_shootdowns 92.50 ± 23% +98.1% 183.25 ± 18% interrupts.CPU97.RES:Rescheduling_interrupts 158.75 ± 98% +221.1% 509.75 ± 34% interrupts.CPU98.TLB:TLB_shootdowns 28796 ± 3% -17.4% 23785 ± 15% interrupts.RES:Rescheduling_interrupts
*************************************************************************************************** lkp-csl-2ap2: 192 threads Intel(R) Xeon(R) Platinum 9242 CPU @ 2.30GHz with 192G memory ========================================================================================= compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase/ucode: gcc-9/performance/x86_64-rhel-8.3/thread/16/debian-10.4-x86_64-20200603.cgz/lkp-csl-2ap2/futex4/will-it-scale/0x5003003
commit: 9720a64438 ("sched: Report local wake up on resched blind zone within idle loop") 8e01c5f104 ("entry: Report local wake up on resched blind zone while resuming to user")
9720a64438d901da 8e01c5f10451c019e384d68ee8e ---------------- --------------------------- %stddev %change %stddev \ | \ 1.068e+08 -1.5% 1.052e+08 will-it-scale.16.threads 6674552 -1.5% 6571881 will-it-scale.per_thread_ops 1.068e+08 -1.5% 1.052e+08 will-it-scale.workload 540984 ± 27% -45.1% 296787 ± 49% numa-numastat.node2.local_node 1158 ± 6% -10.7% 1034 ± 7% slabinfo.file_lock_cache.active_objs 1158 ± 6% -10.7% 1034 ± 7% slabinfo.file_lock_cache.num_objs 6900 ±127% -95.6% 301.00 ±102% softirqs.CPU11.NET_RX 21404 ± 9% +30.2% 27867 ± 6% softirqs.CPU111.SCHED 23371 ± 8% -26.7% 17133 ± 8% softirqs.CPU15.SCHED 243.75 ± 63% +112.1% 517.00 ± 16% numa-vmstat.node0.nr_page_table_pages 16717 ± 4% +13.7% 19002 ± 3% numa-vmstat.node0.nr_slab_unreclaimable 425644 ± 14% +16.4% 495424 ± 8% numa-vmstat.node0.numa_local 1374 ± 55% -71.5% 391.25 ±114% numa-vmstat.node1.nr_shmem 4803 ± 17% +60.0% 7686 ± 39% numa-vmstat.node1.nr_slab_reclaimable 775917 ± 8% +10.3% 855691 ± 3% numa-meminfo.node0.MemUsed 977.75 ± 63% +112.9% 2081 ± 15% numa-meminfo.node0.PageTables 66871 ± 4% +13.7% 76009 ± 3% numa-meminfo.node0.SUnreclaim 19215 ± 17% +60.0% 30749 ± 39% numa-meminfo.node1.KReclaimable 19215 ± 17% +60.0% 30749 ± 39% numa-meminfo.node1.SReclaimable 5497 ± 55% -71.5% 1566 ±114% numa-meminfo.node1.Shmem 0.01 ± 48% +114.3% 0.01 ± 38% perf-sched.sch_delay.avg.ms.schedule_timeout.wait_for_completion.__flush_work.lru_add_drain_all 0.01 ± 15% +373.7% 0.04 ±108% perf-sched.sch_delay.max.ms.do_wait.kernel_wait4.__do_sys_wait4.do_syscall_64 0.01 ± 8% +205.4% 0.04 ±102% perf-sched.sch_delay.max.ms.futex_wait_queue_me.futex_wait.do_futex.__x64_sys_futex 0.01 ± 48% +114.3% 0.01 ± 38% perf-sched.sch_delay.max.ms.schedule_timeout.wait_for_completion.__flush_work.lru_add_drain_all 7256 ± 3% +12.9% 8193 ± 3% perf-sched.total_wait_and_delay.max.ms 7256 ± 3% +12.9% 8193 ± 3% perf-sched.total_wait_time.max.ms 595.05 ± 11% -12.9% 518.40 ± 5% perf-sched.wait_and_delay.avg.ms.schedule_hrtimeout_range_clock.ep_poll.do_epoll_wait.__x64_sys_epoll_wait 5903 ± 21% +38.8% 8193 ± 3% perf-sched.wait_and_delay.max.ms.worker_thread.kthread.ret_from_fork 595.04 ± 11% -12.9% 518.39 ± 5% perf-sched.wait_time.avg.ms.schedule_hrtimeout_range_clock.ep_poll.do_epoll_wait.__x64_sys_epoll_wait 5903 ± 21% +38.8% 8193 ± 3% perf-sched.wait_time.max.ms.worker_thread.kthread.ret_from_fork 0.98 ± 6% +0.1 1.12 ± 8% perf-profile.calltrace.cycles-pp.syscall_enter_from_user_mode.do_syscall_64.entry_SYSCALL_64_after_hwframe.syscall 0.91 ± 9% +0.1 1.05 ± 7% perf-profile.calltrace.cycles-pp.exit_to_user_mode_prepare.syscall_exit_to_user_mode.entry_SYSCALL_64_after_hwframe.syscall 0.30 ± 7% +0.0 0.34 ± 6% perf-profile.children.cycles-pp.scheduler_tick 0.11 ± 12% +0.1 0.16 ± 31% perf-profile.children.cycles-pp.tick_irq_enter 0.58 ± 10% +0.1 0.68 ± 11% perf-profile.children.cycles-pp.tick_sched_timer 0.00 +0.1 0.15 ± 14% perf-profile.children.cycles-pp.sched_resched_local_allow 0.98 ± 6% +0.1 1.13 ± 8% perf-profile.children.cycles-pp.syscall_enter_from_user_mode 1.16 ± 9% +0.2 1.33 ± 8% perf-profile.children.cycles-pp.exit_to_user_mode_prepare 1.38 ± 8% +0.2 1.56 ± 9% perf-profile.children.cycles-pp.hrtimer_interrupt 0.00 +0.3 0.28 ± 11% perf-profile.children.cycles-pp.sched_resched_local_forbid 0.00 +0.1 0.14 ± 9% perf-profile.self.cycles-pp.sched_resched_local_forbid 0.00 +0.1 0.14 ± 12% perf-profile.self.cycles-pp.sched_resched_local_allow 9.413e+09 +2.9% 9.684e+09 perf-stat.i.branch-instructions 1.521e+10 +1.5% 1.544e+10 perf-stat.i.dTLB-loads 1.174e+10 +2.1% 1.198e+10 perf-stat.i.dTLB-stores 54241332 -1.8% 53267348 perf-stat.i.iTLB-load-misses 1082 +2.1% 1104 perf-stat.i.instructions-per-iTLB-miss 189.49 +2.1% 193.48 perf-stat.i.metric.M/sec 1078 +2.1% 1100 perf-stat.overall.instructions-per-iTLB-miss 165026 +1.8% 167961 perf-stat.overall.path-length 9.381e+09 +2.9% 9.651e+09 perf-stat.ps.branch-instructions 1.516e+10 +1.5% 1.538e+10 perf-stat.ps.dTLB-loads 1.17e+10 +2.1% 1.194e+10 perf-stat.ps.dTLB-stores 54057433 -1.8% 53087204 perf-stat.ps.iTLB-load-misses 13120 ±126% -95.7% 563.75 ±106% interrupts.32:PCI-MSI.524290-edge.eth0-TxRx-1 6947 ± 12% -37.7% 4326 ± 34% interrupts.CPU0.NMI:Non-maskable_interrupts 6947 ± 12% -37.7% 4326 ± 34% interrupts.CPU0.PMI:Performance_monitoring_interrupts 13120 ±126% -95.7% 563.75 ±106% interrupts.CPU11.32:PCI-MSI.524290-edge.eth0-TxRx-1 288.50 ± 18% -40.6% 171.25 ± 21% interrupts.CPU111.TLB:TLB_shootdowns 101.25 ± 28% +49.1% 151.00 ± 19% interrupts.CPU122.NMI:Non-maskable_interrupts 101.25 ± 28% +49.1% 151.00 ± 19% interrupts.CPU122.PMI:Performance_monitoring_interrupts 118.50 ± 5% +15.4% 136.75 ± 8% interrupts.CPU123.NMI:Non-maskable_interrupts 118.50 ± 5% +15.4% 136.75 ± 8% interrupts.CPU123.PMI:Performance_monitoring_interrupts 99.25 ± 24% +38.5% 137.50 ± 11% interrupts.CPU125.NMI:Non-maskable_interrupts 99.25 ± 24% +38.5% 137.50 ± 11% interrupts.CPU125.PMI:Performance_monitoring_interrupts 98.25 ± 23% +45.0% 142.50 ± 24% interrupts.CPU126.NMI:Non-maskable_interrupts 98.25 ± 23% +45.0% 142.50 ± 24% interrupts.CPU126.PMI:Performance_monitoring_interrupts 114.25 ± 5% +24.1% 141.75 ± 12% interrupts.CPU135.NMI:Non-maskable_interrupts 114.25 ± 5% +24.1% 141.75 ± 12% interrupts.CPU135.PMI:Performance_monitoring_interrupts 99.00 ± 23% +33.6% 132.25 ± 12% interrupts.CPU137.NMI:Non-maskable_interrupts 99.00 ± 23% +33.6% 132.25 ± 12% interrupts.CPU137.PMI:Performance_monitoring_interrupts 98.75 ± 24% +31.9% 130.25 ± 13% interrupts.CPU138.NMI:Non-maskable_interrupts 98.75 ± 24% +31.9% 130.25 ± 13% interrupts.CPU138.PMI:Performance_monitoring_interrupts 98.50 ± 24% +31.0% 129.00 ± 12% interrupts.CPU139.NMI:Non-maskable_interrupts 98.50 ± 24% +31.0% 129.00 ± 12% interrupts.CPU139.PMI:Performance_monitoring_interrupts 98.25 ± 24% +32.8% 130.50 ± 12% interrupts.CPU140.NMI:Non-maskable_interrupts 98.25 ± 24% +32.8% 130.50 ± 12% interrupts.CPU140.PMI:Performance_monitoring_interrupts 84.00 ± 30% +55.1% 130.25 ± 11% interrupts.CPU141.NMI:Non-maskable_interrupts 84.00 ± 30% +55.1% 130.25 ± 11% interrupts.CPU141.PMI:Performance_monitoring_interrupts 86.50 ± 26% +51.7% 131.25 ± 12% interrupts.CPU142.NMI:Non-maskable_interrupts 86.50 ± 26% +51.7% 131.25 ± 12% interrupts.CPU142.PMI:Performance_monitoring_interrupts 84.00 ± 30% +97.3% 165.75 ± 25% interrupts.CPU143.NMI:Non-maskable_interrupts 84.00 ± 30% +97.3% 165.75 ± 25% interrupts.CPU143.PMI:Performance_monitoring_interrupts 253.50 ± 20% +43.8% 364.50 ± 14% interrupts.CPU15.TLB:TLB_shootdowns 101.50 ± 24% +32.3% 134.25 ± 12% interrupts.CPU150.NMI:Non-maskable_interrupts 101.50 ± 24% +32.3% 134.25 ± 12% interrupts.CPU150.PMI:Performance_monitoring_interrupts 121.75 ± 10% +115.4% 262.25 ± 84% interrupts.CPU153.NMI:Non-maskable_interrupts 121.75 ± 10% +115.4% 262.25 ± 84% interrupts.CPU153.PMI:Performance_monitoring_interrupts 77.75 ± 40% +71.1% 133.00 ± 12% interrupts.CPU167.NMI:Non-maskable_interrupts 77.75 ± 40% +71.1% 133.00 ± 12% interrupts.CPU167.PMI:Performance_monitoring_interrupts 77.75 ± 30% +137.3% 184.50 ± 49% interrupts.CPU169.NMI:Non-maskable_interrupts 77.75 ± 30% +137.3% 184.50 ± 49% interrupts.CPU169.PMI:Performance_monitoring_interrupts 7583 ± 14% -46.7% 4043 ± 31% interrupts.CPU2.NMI:Non-maskable_interrupts 7583 ± 14% -46.7% 4043 ± 31% interrupts.CPU2.PMI:Performance_monitoring_interrupts 85.25 ± 33% +96.8% 167.75 ± 31% interrupts.CPU26.NMI:Non-maskable_interrupts 85.25 ± 33% +96.8% 167.75 ± 31% interrupts.CPU26.PMI:Performance_monitoring_interrupts 100.50 ± 27% +46.5% 147.25 ± 17% interrupts.CPU29.NMI:Non-maskable_interrupts 100.50 ± 27% +46.5% 147.25 ± 17% interrupts.CPU29.PMI:Performance_monitoring_interrupts 115.00 ± 7% +16.7% 134.25 ± 12% interrupts.CPU37.NMI:Non-maskable_interrupts 115.00 ± 7% +16.7% 134.25 ± 12% interrupts.CPU37.PMI:Performance_monitoring_interrupts 113.75 ± 4% +16.5% 132.50 ± 11% interrupts.CPU38.NMI:Non-maskable_interrupts 113.75 ± 4% +16.5% 132.50 ± 11% interrupts.CPU38.PMI:Performance_monitoring_interrupts 113.50 ± 4% +22.0% 138.50 ± 10% interrupts.CPU39.NMI:Non-maskable_interrupts 113.50 ± 4% +22.0% 138.50 ± 10% interrupts.CPU39.PMI:Performance_monitoring_interrupts 113.25 ± 5% +16.1% 131.50 ± 11% interrupts.CPU41.NMI:Non-maskable_interrupts 113.25 ± 5% +16.1% 131.50 ± 11% interrupts.CPU41.PMI:Performance_monitoring_interrupts 101.50 ± 20% +28.1% 130.00 ± 12% interrupts.CPU46.NMI:Non-maskable_interrupts 101.50 ± 20% +28.1% 130.00 ± 12% interrupts.CPU46.PMI:Performance_monitoring_interrupts 99.25 ± 24% +50.6% 149.50 ± 13% interrupts.CPU47.NMI:Non-maskable_interrupts 99.25 ± 24% +50.6% 149.50 ± 13% interrupts.CPU47.PMI:Performance_monitoring_interrupts 87.25 ± 30% +168.2% 234.00 ± 71% interrupts.CPU57.NMI:Non-maskable_interrupts 87.25 ± 30% +168.2% 234.00 ± 71% interrupts.CPU57.PMI:Performance_monitoring_interrupts 91.50 ± 26% +57.1% 143.75 ± 24% interrupts.CPU58.NMI:Non-maskable_interrupts 91.50 ± 26% +57.1% 143.75 ± 24% interrupts.CPU58.PMI:Performance_monitoring_interrupts 36.25 ±103% -77.9% 8.00 ± 63% interrupts.CPU6.RES:Rescheduling_interrupts 7788 ± 14% -37.3% 4883 ± 25% interrupts.CPU7.NMI:Non-maskable_interrupts 7788 ± 14% -37.3% 4883 ± 25% interrupts.CPU7.PMI:Performance_monitoring_interrupts 251.75 ± 30% -33.1% 168.50 ± 20% interrupts.CPU97.TLB:TLB_shootdowns
Disclaimer: Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance.
Thanks, Oliver Sang
Enqueuing a local timer after the tick has been stopped will result in the timer being ignored until the next random interrupt.
Perform sanity checks to report these situations.
Signed-off-by: Frederic Weisbecker frederic@kernel.org Cc: Peter Zijlstra peterz@infradead.org Cc: Thomas Gleixner tglx@linutronix.de Cc: Ingo Molnarmingo@kernel.org Cc: Paul E. McKenney paulmck@kernel.org Cc: Rafael J. Wysocki rafael.j.wysocki@intel.com --- kernel/sched/core.c | 24 +++++++++++++++++++++++- 1 file changed, 23 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 6056f0374674..6c8b04272a9a 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -715,6 +715,26 @@ int get_nohz_timer_target(void) return cpu; }
+static void wake_idle_assert_possible(void) +{ +#ifdef CONFIG_SCHED_DEBUG + /* Timers are re-evaluated after idle IRQs */ + if (in_hardirq()) + return; + /* + * Same as hardirqs, assuming they are executing + * on IRQ tail. Ksoftirqd shouldn't reach here + * as the timer base wouldn't be idle. And inline + * softirq processing after a call to local_bh_enable() + * within idle loop sound too fun to be considered here. + */ + if (in_serving_softirq()) + return; + + WARN_ON_ONCE("Late timer enqueue may be ignored\n"); +#endif +} + /* * When add_timer_on() enqueues a timer into the timer wheel of an * idle CPU then this timer might expire before the next timer event @@ -729,8 +749,10 @@ static void wake_up_idle_cpu(int cpu) { struct rq *rq = cpu_rq(cpu);
- if (cpu == smp_processor_id()) + if (cpu == smp_processor_id()) { + wake_idle_assert_possible(); return; + }
if (set_nr_and_not_polling(rq->idle)) smp_send_reschedule(cpu);
linux-stable-mirror@lists.linaro.org