Hi, Alan,
On Wed, Aug 16, 2023 at 11:57 PM Alan Huang mmpgouride@gmail.com wrote:
>>>> >>>> Currently rcu_cpu_stall_reset() set rcu_state.jiffies_stall to one check >>>> period later, i.e. jiffies + rcu_jiffies_till_stall_check(). But jiffies >>>> is only updated in the timer interrupt, so when kgdb_cpu_enter() begins >>>> to run there may already be nearly one rcu check period after jiffies. >>>> Since all interrupts are disabled during kgdb_cpu_enter(), jiffies will >>>> not be updated. When kgdb_cpu_enter() returns, rcu_state.jiffies_stall >>>> maybe already gets timeout. >>>> >>>> We can set rcu_state.jiffies_stall to two rcu check periods later, e.g. >>>> jiffies + (rcu_jiffies_till_stall_check() * 2) in rcu_cpu_stall_reset() >>>> to avoid this problem. But this isn't a complete solution because kgdb >>>> may take a very long time in irq disabled context. >>>> >>>> Instead, update jiffies at the beginning of rcu_cpu_stall_reset() can >>>> solve all kinds of problems. >>> >>> Would it make sense for there to be a kgdb_cpu_exit()? In that case, >>> the stalls could simply be suppressed at the beginning of the debug >>> session and re-enabled upon exit, as is currently done for sysrq output >>> via rcu_sysrq_start() and rcu_sysrq_end(). >> Thank you for your advice, but that doesn't help. Because >> rcu_sysrq_start() and rcu_sysrq_end() try to suppress the warnings >> during sysrq, but kgdb already has no warnings during kgdb_cpu_enter() >> since it is executed in irq disabled context. Instead, this patch >> wants to suppress the warnings *after* kgdb_cpu_enter() due to a very >> old jiffies value. >> > > Hello, Huacai > > Is it possible to set the rcu_cpu_stall_suppress is true in > dbg_touch_watchdogs() > and reset the rcu_cpu_stall_suppress at the beginning and end of the > RCU grace period? This is possible but not the best: 1, kgdb is not the only caller of rcu_cpu_stall_reset(); 2, it is difficult to find the "end" to reset rcu_cpu_stall_suppress.
You can replace rcu_state.jiffies_stall update by setting rcu_cpu_stall_suppress in rcu_cpu_stall_reset(), and reset rcu_cpu_stall_suppress in rcu_gp_init() and rcu_gp_cleanup().
What's the advantage compared with updating jiffies? Updating jiffies seems more straight forward.
In do_update_jiffies_64(), need to acquire jiffies_lock raw spinlock, like you said, kgdb is not the only caller of rcu_cpu_stall_reset(), the rcu_cpu_stall_reset() maybe invoke in NMI (arch/x86/platform/uv/uv_nmi.c)
Reset rcu_cpu_stall_suppress in rcu_gp_init()/rcu_gp_cleanup() is still not so good to me, because it does a useless operation in most cases. Moreover, the rcu core is refactored again and again, something may be changed in future.
If do_update_jiffies_64() cannot be used in NMI context, can we
What about updating jiffies in dbg_touch_watchdogs or adding a wrapper which updates both jiffies and jiffies_stall?
This can solve the kgdb problem, but I found that most callers of rcu_cpu_stall_reset() are in irq disabled context so they may meet similar problems. Modifying rcu_cpu_stall_reset() can solve all of them.
But due to the NMI issue, from my point of view, setting jiffies_stall to jiffies + 300*HZ is the best solution now. :)
Huacai
consider my old method [1]? https://lore.kernel.org/rcu/CAAhV-H7j9Y=VvRLm8thLw-EX1PGqBA9YfT4G1AN7ucYS=iP...
Of course we should set rcu_state.jiffies_stall large enough, so we can do like this:
void rcu_cpu_stall_reset(void) { WRITE_ONCE(rcu_state.jiffies_stall,
- jiffies + rcu_jiffies_till_stall_check());
- jiffies + 300 * HZ);
}
300s is the largest timeout value, and I think 300s is enough here in practice.
Huacai
Thanks Zqiang
Huacai
Thanks Zqiang
> or set rcupdate.rcu_cpu_stall_suppress_at_boot=1 in bootargs can > suppress RCU stall > in booting. This is also possible, but it suppresses all kinds of stall warnings, which is not what we want.
Huacai > > > Thanks > Zqiang > > >> >> Huacai >> >>> >>> Thanx, Paul >>> >>>> Cc: stable@vger.kernel.org >>>> Fixes: a80be428fbc1f1f3bc9ed924 ("rcu: Do not disable GP stall detection in rcu_cpu_stall_reset()") >>>> Reported-off-by: Binbin Zhou zhoubinbin@loongson.cn >>>> Signed-off-by: Huacai Chen chenhuacai@loongson.cn >>>> --- >>>> kernel/rcu/tree_stall.h | 1 + >>>> 1 file changed, 1 insertion(+) >>>> >>>> diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h >>>> index b10b8349bb2a..1c7b540985bf 100644 >>>> --- a/kernel/rcu/tree_stall.h >>>> +++ b/kernel/rcu/tree_stall.h >>>> @@ -153,6 +153,7 @@ static void panic_on_rcu_stall(void) >>>> */ >>>> void rcu_cpu_stall_reset(void) >>>> { >>>> + do_update_jiffies_64(ktime_get()); >>>> WRITE_ONCE(rcu_state.jiffies_stall, >>>> jiffies + rcu_jiffies_till_stall_check()); >>>> } >>>> -- >>>> 2.39.3