On Mon, Sep 11, 2023 at 07:54:52PM -0400, Liam R. Howlett wrote:
- Paul E. McKenney paulmck@kernel.org [230906 14:03]:
On Wed, Sep 06, 2023 at 01:29:54PM -0400, Liam R. Howlett wrote:
- Paul E. McKenney paulmck@kernel.org [230906 13:24]:
On Wed, Sep 06, 2023 at 11:23:25AM -0400, Liam R. Howlett wrote:
(Adding Paul & Shanker to Cc list.. please see below for why)
Apologies on the late response, I was away and have been struggling to get a working PPC32 test environment.
- Geert Uytterhoeven geert@linux-m68k.org [230829 12:42]:
Hi Liam,
On Fri, 18 Aug 2023, Liam R. Howlett wrote: > The current implementation of append may cause duplicate data and/or > incorrect ranges to be returned to a reader during an update. Although > this has not been reported or seen, disable the append write operation > while the tree is in rcu mode out of an abundance of caution.
...
>
...
RCU-related configs:
$ grep RCU .config # RCU Subsystem CONFIG_TINY_RCU=y
I must have been asleep last time I looked at this. I was looking at Tree RCU. Please accept my apologies for my lapse. :-/
However, Tiny RCU's call_rcu() also avoids enabling IRQs, so I would have said the same thing, albeit after looking at a lot less RCU code.
TL;DR:
1. Try making the __setup_irq() function's call to mutex_lock() instead be as follows:
if (!mutex_trylock(&desc->request_mutex)) mutex_lock(&desc->request_mutex);
This might fail if __setup_irq() has other dependencies on a fully operational scheduler.
2. Move that ppc32 call to __setup_irq() much later, most definitely after interrupts have been enabled and the scheduler is fully operational. Invoking mutex_lock() before that time is not a good idea. ;-)
For more detail, please read on!
# CONFIG_RCU_EXPERT is not set CONFIG_TINY_SRCU=y # end of RCU Subsystem # RCU Debugging # CONFIG_RCU_SCALE_TEST is not set # CONFIG_RCU_TORTURE_TEST is not set # CONFIG_RCU_REF_SCALE_TEST is not set # CONFIG_RCU_TRACE is not set # CONFIG_RCU_EQS_DEBUG is not set # end of RCU Debugging
I used the configuration from debian 8 and ran 'make oldconfig' to build my kernel. I have attached the configuration.
...
It appears to be something to do with struct maple_tree sparse_irqs. If you drop the rcu flag from that maple tree, then my configuration boots without the warning.
I *think* this is because we will reuse a lot more nodes. And I *think* the rcu flag is not needed, since there is a comment about reading the tree being protected by the mutex sparse_irq_lock within the kernel/irq/irqdesc.c file. Shanker, can you comment on that?
I wonder if there is a limit to the number of RCU free events before something is triggered to flush them out which could trigger IRQ enabling? Paul, could this be the case?
Are you asking if call_rcu() will re-enable interrupts in the following use case?
local_irq_disable(); call_rcu(&p->rh, my_cb_func); local_irq_enable();
I am not.
...
Or am I missing your point?
This is very early in the boot sequence when interrupts have not been enabled. What we are seeing is a WARN_ON() that is triggered by interrupts being enabled before they should be enabled.
I was wondering if, for example, I called call_rcu() a lot *before* interrupts were enabled, that something could trigger that would either enable interrupts or indicate the task needs rescheduling?
You aren't doing call_rcu() enough to hit OOM, are you? The actual RCU callback invocations won't happen until some time after the scheduler starts up.
I am not, it's just a detection of IRQs being enabled early.
Specifically the rescheduling part is suspect. I tracked down the call to a mutex_lock() which calls cond_resched(), so could rcu be 'encouraging' the rcu window by a reschedule request?
During boot before interrupts are enabled, RCU has not yet spawned any of its kthreads. Therefore, all of its attempts to do wakeups would notice a NULL task_struct pointer and refrain from actually doing the wakeup. If it did do the wakeup, you would see a NULL-pointer exception. See for example, invoke_rcu_core_kthread(), though that won't happen unless you booted with rcutree.use_softirq=0.
Besides, since when did doing a wakeup enable interrupts? That would make it hard to do wakeups from hardware interrupt handlers, not?
Taking the mutex lock in kernel/irq/manage.c __setup_irq() is calling a cond_resched().
From what Michael said [1] in this thread, since something has already
set TIF_NEED_RESCHED, it will eventually enable interrupts on us.
I've traced this to running call_rcu() in kernel/rcu/tiny.c and is_idle_task(current) is true, which means rcu runs: /* force scheduling for rcu_qs() */ resched_cpu(0);
the task is set idle in sched_init() -> init_idle() and never changed, afaict.
Yes, because RCU eventually needs a context switch in order to make a grace period happen. And Maple Tree isn't the only thing invoking call_rcu() early, so this has been in place for a very long time.
Removing the RCU option from the maple tree in kernel/irq/irqdesc.c fixes the issue by avoiding the maple tree running call_rcu(). I am not sure on the locking of the tree so I feel this change may cause other issues...also it's before lockdep_init(), so any issue I introduce may not be detected.
When CONFIG_DEBUG_ATOMIC_SLEEP is configured, it seems that rcu does the same thing, but the IRQs are not enabled on return. So, resched_cpu(0) is called, but the IRQs warning of enabled isn't triggered. I failed to find a reason why.
Here you mean IRQs being enabled upon return from __setup_irq(), correct?
But yes, __setup_irq() does call mutex_lock(). Which will call preempt_schedule_common() via might_sleep() and __cond_resched(), even though that is clearly a very bad thing this early. And let's face it, the whole purpose of mutex_lock() is to block when needed. And a big purpose of that might_sleep() is to yell at people invoking this with interrupts disabled.
And most of the wrappers around __setup_irq() look to be intended for much later, after interrupts have been enabled. One exception is setup_percpu_irq(), which says that it is for "the early boot process", whenever that might be. But this is only invoked from mips in v6.5.
The __request_percpu_irq() function is wrappered by request_percpu_irq(), and its header comment suggests that it is to be called after there are multiple CPUs. I am not seeing a call that is obviously from ppc32, but there are a number of drivers using this with which I am unfamiliar.
The request_percpu_nmi() has to be followed up on each CPU using prepare_percpu_nmi() and enable_percpu_nmi(), so it is not clear that it is useful to invoke this before interrupts are enabled. But this is used by ARM, not ppc32 from what I can see.
So even though I am not seeing how ppc32 invokes __setup_irq() early, your testing clearly indicates that it is doing so.
I am not entirely sure what makes ppc32 different than other platforms in that the initial task is configured to an idle task and the first call to call_rcu (tiny!) would cause the observed behaviour.
Maybe something like this in __setup_irq(), right before the mutex_lock()?
WARN_ON_ONCE(irqs_disabled());
This will dump the stack trace showing how __setup_irq() is being invoked in early boot on ppc32.
Again, given that __setup_irq() invokes mutex_lock(), invoking this function in its current form before interrupts have been enabled is a bad idea.
Non-tiny rcu calls (as I am sure you know, but others may not) kernel/rcu/tree.c which in turn calls __call_rcu_common(). That function is far more complex than the tiny version. Maybe it's part of why we see different behaviour based on platforms? I don't see an idle check in that version of call_rcu().
Tree RCU can rely on the grace-period kthread starting up later in boot. This kthread will handle any need for a grace period from early boot invocation of call_rcu(). Tiny RCU does not have such a kthread, hence the resched_cpu(). Which is fine, as long as nothing (incorrectly!) invokes a blocking operation before the scheduler has been initialized.
Or maybe PPC32 has something set incorrectly to cause this failure in early boot and I've just found something that needs to be set differently?
This might be a failure of imagination on my part, but I certainly don't see why ppc32 should expect to get away with invoking __setup_irq() anywhere near that early! ;-)
But if ppc32 cannot be adjusted so as to invoke __setup_irq() later, the following change to __setup_irq() might work:
if (!mutex_trylock(&desc->request_mutex)) mutex_lock(&desc->request_mutex);
The reason for "might" rather than "will" is that __setup_irq() might well have other dependencies on the scheduler being up and running.
It would be cleaner for that ppc32 call to __setup_irq() to be moved later, after the scheduler is up and running.
But why not put some WARN_ON_ONCE(!irqs_disabled()) calls in the areas of greatest suspicion, starting from the stack trace generated by that mutex_lock()? A stray interrupt-enable could be pretty much anywhere.
But where are those call_rcu() invocations? Before rcu_init()?
During init_IRQ(), which is after rcu_init() but before rcu_init_nohz(), srcu_init(), and softirq_init() in init/main.c start_kernel().
That is a perfectly legal time and place for a call_rcu() invocation. It is also legal before rcu_init(), but it has to be late enough that things like per-CPU variables are accessible. Which is still very early.
Presumably before init is spawned and the early_init() calls.
And what is the RCU-related Kconfig and boot-parameter setup?
The .config was attached to the email I sent, and it matches what was quoted above in the "RCU-related configs" section.
[1] https://lore.kernel.org/linux-mm/87v8cv22jh.fsf@mail.lhotse/
Again, apologies for having misread that .config snippet!!!
Thanx, Paul