On Sun, Aug 27, 2023 at 1:51 AM Huacai Chen chenhuacai@kernel.org wrote: [..]
The only way I know of to avoid these sorts of false positives is for the user to manually suppress all timeouts (perhaps using a kernel-boot parameter for your early-boot case), do the gdb work, and then unsuppress all stalls. Even that won't work for networking, because the other system's clock will be running throughout.
In other words, from what I know now, there is no perfect solution. Therefore, there are sharp limits to the complexity of any solution that I will be willing to accept.
I think the simplest solution is (I hope Joel will not angry):
Not angry at all, just want to help. ;-). The problem is the 300*HZ solution will also effect the VM workloads which also do a similar reset. Allow me few days to see if I can take a shot at fixing it slightly differently. I am trying Paul's idea of setting jiffies at a later time. I think it is doable. I think the advantage of doing this is it will make stall detection more robust in this face of these gaps in jiffie update. And that solution does not even need us to rely on ktime (and all the issues that come with that).
I wrote a patch similar to Paul's idea and sent it out for review, the advantage being it purely is based on jiffies. Could you try it out and let me know?
If you can cc my gmail chenhuacai@gmail.com, that could be better.
Sure, will do.
I have read your patch, maybe the counter (nr_fqs_jiffies_stall) should be atomic_t and we should use atomic operation to decrement its value. Because rcu_gp_fqs() can be run concurrently, and we may miss the (nr_fqs == 1) condition.
I don't think so. There is only 1 place where RMW operation happens and rcu_gp_fqs() is called only from the GP kthread. So a concurrent RMW (and hence a lost update) is not possible.
Could you test the patch for the issue you are seeing and provide your Tested-by tag? Thanks,
- Joel