Hi, Alan,
On Thu, Aug 17, 2023 at 12:52 AM Alan Huang mmpgouride@gmail.com wrote:
>>>>>> >>>>>> Currently rcu_cpu_stall_reset() set rcu_state.jiffies_stall to one check >>>>>> period later, i.e. jiffies + rcu_jiffies_till_stall_check(). But jiffies >>>>>> is only updated in the timer interrupt, so when kgdb_cpu_enter() begins >>>>>> to run there may already be nearly one rcu check period after jiffies. >>>>>> Since all interrupts are disabled during kgdb_cpu_enter(), jiffies will >>>>>> not be updated. When kgdb_cpu_enter() returns, rcu_state.jiffies_stall >>>>>> maybe already gets timeout. >>>>>> >>>>>> We can set rcu_state.jiffies_stall to two rcu check periods later, e.g. >>>>>> jiffies + (rcu_jiffies_till_stall_check() * 2) in rcu_cpu_stall_reset() >>>>>> to avoid this problem. But this isn't a complete solution because kgdb >>>>>> may take a very long time in irq disabled context. >>>>>> >>>>>> Instead, update jiffies at the beginning of rcu_cpu_stall_reset() can >>>>>> solve all kinds of problems. >>>>> >>>>> Would it make sense for there to be a kgdb_cpu_exit()? In that case, >>>>> the stalls could simply be suppressed at the beginning of the debug >>>>> session and re-enabled upon exit, as is currently done for sysrq output >>>>> via rcu_sysrq_start() and rcu_sysrq_end(). >>>> Thank you for your advice, but that doesn't help. Because >>>> rcu_sysrq_start() and rcu_sysrq_end() try to suppress the warnings >>>> during sysrq, but kgdb already has no warnings during kgdb_cpu_enter() >>>> since it is executed in irq disabled context. Instead, this patch >>>> wants to suppress the warnings *after* kgdb_cpu_enter() due to a very >>>> old jiffies value. >>>> >>> >>> Hello, Huacai >>> >>> Is it possible to set the rcu_cpu_stall_suppress is true in >>> dbg_touch_watchdogs() >>> and reset the rcu_cpu_stall_suppress at the beginning and end of the >>> RCU grace period? >> This is possible but not the best: 1, kgdb is not the only caller of >> rcu_cpu_stall_reset(); 2, it is difficult to find the "end" to reset >> rcu_cpu_stall_suppress. >> > > You can replace rcu_state.jiffies_stall update by setting rcu_cpu_stall_suppress > in rcu_cpu_stall_reset(), and reset rcu_cpu_stall_suppress in rcu_gp_init() and > rcu_gp_cleanup(). What's the advantage compared with updating jiffies? Updating jiffies seems more straight forward.
In do_update_jiffies_64(), need to acquire jiffies_lock raw spinlock, like you said, kgdb is not the only caller of rcu_cpu_stall_reset(), the rcu_cpu_stall_reset() maybe invoke in NMI (arch/x86/platform/uv/uv_nmi.c)
Reset rcu_cpu_stall_suppress in rcu_gp_init()/rcu_gp_cleanup() is still not so good to me, because it does a useless operation in most cases. Moreover, the rcu core is refactored again and again, something may be changed in future.
If do_update_jiffies_64() cannot be used in NMI context, can we
What about updating jiffies in dbg_touch_watchdogs or adding a wrapper which updates both jiffies and jiffies_stall?
This can solve the kgdb problem, but I found that most callers of rcu_cpu_stall_reset() are in irq disabled context so they may meet
The duration of other contexts where interrupts are disabled may not be as long as in the case of kgdb?
similar problems. Modifying rcu_cpu_stall_reset() can solve all of them.
But due to the NMI issue, from my point of view, setting jiffies_stall to jiffies + 300*HZ is the best solution now. :)
If I understand correctly, the NMI issue is the deadlock issue? If so, plus the short duration of other irq disabled contexts, it’s ok just update jiffies in dbg_touch_watchdogs.
Please correct me if anything wrong. :)
The timeout value can be configured as short as 3 seconds, in this case other callers may also have problems.
Huacai
Huacai
consider my old method [1]? https://lore.kernel.org/rcu/CAAhV-H7j9Y=VvRLm8thLw-EX1PGqBA9YfT4G1AN7ucYS=iP...
Of course we should set rcu_state.jiffies_stall large enough, so we can do like this:
void rcu_cpu_stall_reset(void) { WRITE_ONCE(rcu_state.jiffies_stall,
- jiffies + rcu_jiffies_till_stall_check());
- jiffies + 300 * HZ);
}
300s is the largest timeout value, and I think 300s is enough here in practice.
Huacai
Thanks Zqiang
Huacai
> > Thanks > Zqiang > >> >>> or set rcupdate.rcu_cpu_stall_suppress_at_boot=1 in bootargs can >>> suppress RCU stall >>> in booting. >> This is also possible, but it suppresses all kinds of stall warnings, >> which is not what we want. >> >> Huacai >>> >>> >>> Thanks >>> Zqiang >>> >>> >>>> >>>> Huacai >>>> >>>>> >>>>> Thanx, Paul >>>>> >>>>>> Cc: stable@vger.kernel.org >>>>>> Fixes: a80be428fbc1f1f3bc9ed924 ("rcu: Do not disable GP stall detection in rcu_cpu_stall_reset()") >>>>>> Reported-off-by: Binbin Zhou zhoubinbin@loongson.cn >>>>>> Signed-off-by: Huacai Chen chenhuacai@loongson.cn >>>>>> --- >>>>>> kernel/rcu/tree_stall.h | 1 + >>>>>> 1 file changed, 1 insertion(+) >>>>>> >>>>>> diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h >>>>>> index b10b8349bb2a..1c7b540985bf 100644 >>>>>> --- a/kernel/rcu/tree_stall.h >>>>>> +++ b/kernel/rcu/tree_stall.h >>>>>> @@ -153,6 +153,7 @@ static void panic_on_rcu_stall(void) >>>>>> */ >>>>>> void rcu_cpu_stall_reset(void) >>>>>> { >>>>>> + do_update_jiffies_64(ktime_get()); >>>>>> WRITE_ONCE(rcu_state.jiffies_stall, >>>>>> jiffies + rcu_jiffies_till_stall_check()); >>>>>> } >>>>>> -- >>>>>> 2.39.3