* Paul E. McKenney paulmck@kernel.org [230906 14:03]:
On Wed, Sep 06, 2023 at 01:29:54PM -0400, Liam R. Howlett wrote:
- Paul E. McKenney paulmck@kernel.org [230906 13:24]:
On Wed, Sep 06, 2023 at 11:23:25AM -0400, Liam R. Howlett wrote:
(Adding Paul & Shanker to Cc list.. please see below for why)
Apologies on the late response, I was away and have been struggling to get a working PPC32 test environment.
- Geert Uytterhoeven geert@linux-m68k.org [230829 12:42]:
Hi Liam,
On Fri, 18 Aug 2023, Liam R. Howlett wrote:
The current implementation of append may cause duplicate data and/or incorrect ranges to be returned to a reader during an update. Although this has not been reported or seen, disable the append write operation while the tree is in rcu mode out of an abundance of caution.
...
...
RCU-related configs:
$ grep RCU .config # RCU Subsystem CONFIG_TINY_RCU=y # CONFIG_RCU_EXPERT is not set CONFIG_TINY_SRCU=y # end of RCU Subsystem # RCU Debugging # CONFIG_RCU_SCALE_TEST is not set # CONFIG_RCU_TORTURE_TEST is not set # CONFIG_RCU_REF_SCALE_TEST is not set # CONFIG_RCU_TRACE is not set # CONFIG_RCU_EQS_DEBUG is not set # end of RCU Debugging
I used the configuration from debian 8 and ran 'make oldconfig' to build my kernel. I have attached the configuration.
...
It appears to be something to do with struct maple_tree sparse_irqs. If you drop the rcu flag from that maple tree, then my configuration boots without the warning.
I *think* this is because we will reuse a lot more nodes. And I *think* the rcu flag is not needed, since there is a comment about reading the tree being protected by the mutex sparse_irq_lock within the kernel/irq/irqdesc.c file. Shanker, can you comment on that?
I wonder if there is a limit to the number of RCU free events before something is triggered to flush them out which could trigger IRQ enabling? Paul, could this be the case?
Are you asking if call_rcu() will re-enable interrupts in the following use case?
local_irq_disable(); call_rcu(&p->rh, my_cb_func); local_irq_enable();
I am not.
...
Or am I missing your point?
This is very early in the boot sequence when interrupts have not been enabled. What we are seeing is a WARN_ON() that is triggered by interrupts being enabled before they should be enabled.
I was wondering if, for example, I called call_rcu() a lot *before* interrupts were enabled, that something could trigger that would either enable interrupts or indicate the task needs rescheduling?
You aren't doing call_rcu() enough to hit OOM, are you? The actual RCU callback invocations won't happen until some time after the scheduler starts up.
I am not, it's just a detection of IRQs being enabled early.
Specifically the rescheduling part is suspect. I tracked down the call to a mutex_lock() which calls cond_resched(), so could rcu be 'encouraging' the rcu window by a reschedule request?
During boot before interrupts are enabled, RCU has not yet spawned any of its kthreads. Therefore, all of its attempts to do wakeups would notice a NULL task_struct pointer and refrain from actually doing the wakeup. If it did do the wakeup, you would see a NULL-pointer exception. See for example, invoke_rcu_core_kthread(), though that won't happen unless you booted with rcutree.use_softirq=0.
Besides, since when did doing a wakeup enable interrupts? That would make it hard to do wakeups from hardware interrupt handlers, not?
Taking the mutex lock in kernel/irq/manage.c __setup_irq() is calling a cond_resched().
From what Michael said [1] in this thread, since something has already set TIF_NEED_RESCHED, it will eventually enable interrupts on us.
I've traced this to running call_rcu() in kernel/rcu/tiny.c and is_idle_task(current) is true, which means rcu runs: /* force scheduling for rcu_qs() */ resched_cpu(0);
the task is set idle in sched_init() -> init_idle() and never changed, afaict.
Removing the RCU option from the maple tree in kernel/irq/irqdesc.c fixes the issue by avoiding the maple tree running call_rcu(). I am not sure on the locking of the tree so I feel this change may cause other issues...also it's before lockdep_init(), so any issue I introduce may not be detected.
When CONFIG_DEBUG_ATOMIC_SLEEP is configured, it seems that rcu does the same thing, but the IRQs are not enabled on return. So, resched_cpu(0) is called, but the IRQs warning of enabled isn't triggered. I failed to find a reason why.
I am not entirely sure what makes ppc32 different than other platforms in that the initial task is configured to an idle task and the first call to call_rcu (tiny!) would cause the observed behaviour.
Non-tiny rcu calls (as I am sure you know, but others may not) kernel/rcu/tree.c which in turn calls __call_rcu_common(). That function is far more complex than the tiny version. Maybe it's part of why we see different behaviour based on platforms? I don't see an idle check in that version of call_rcu().
Or maybe PPC32 has something set incorrectly to cause this failure in early boot and I've just found something that needs to be set differently?
But why not put some WARN_ON_ONCE(!irqs_disabled()) calls in the areas of greatest suspicion, starting from the stack trace generated by that mutex_lock()? A stray interrupt-enable could be pretty much anywhere.
But where are those call_rcu() invocations? Before rcu_init()?
During init_IRQ(), which is after rcu_init() but before rcu_init_nohz(), srcu_init(), and softirq_init() in init/main.c start_kernel().
Presumably before init is spawned and the early_init() calls.
And what is the RCU-related Kconfig and boot-parameter setup?
The .config was attached to the email I sent, and it matches what was quoted above in the "RCU-related configs" section.
[1] https://lore.kernel.org/linux-mm/87v8cv22jh.fsf@mail.lhotse/