Le 12/09/2023 à 10:14, Paul E. McKenney a écrit :
On Mon, Sep 11, 2023 at 07:54:52PM -0400, Liam R. Howlett wrote:
- Paul E. McKenney paulmck@kernel.org [230906 14:03]:
On Wed, Sep 06, 2023 at 01:29:54PM -0400, Liam R. Howlett wrote:
- Paul E. McKenney paulmck@kernel.org [230906 13:24]:
On Wed, Sep 06, 2023 at 11:23:25AM -0400, Liam R. Howlett wrote:
(Adding Paul & Shanker to Cc list.. please see below for why)
Apologies on the late response, I was away and have been struggling to get a working PPC32 test environment.
- Geert Uytterhoeven geert@linux-m68k.org [230829 12:42]:
> Hi Liam, > > On Fri, 18 Aug 2023, Liam R. Howlett wrote: >> The current implementation of append may cause duplicate data and/or >> incorrect ranges to be returned to a reader during an update. Although >> this has not been reported or seen, disable the append write operation >> while the tree is in rcu mode out of an abundance of caution.
... >>
...
> RCU-related configs: > > $ grep RCU .config > # RCU Subsystem > CONFIG_TINY_RCU=y
I must have been asleep last time I looked at this. I was looking at Tree RCU. Please accept my apologies for my lapse. :-/
However, Tiny RCU's call_rcu() also avoids enabling IRQs, so I would have said the same thing, albeit after looking at a lot less RCU code.
TL;DR:
Try making the __setup_irq() function's call to mutex_lock() instead be as follows:
if (!mutex_trylock(&desc->request_mutex)) mutex_lock(&desc->request_mutex);
This might fail if __setup_irq() has other dependencies on a fully operational scheduler.
Move that ppc32 call to __setup_irq() much later, most definitely after interrupts have been enabled and the scheduler is fully operational. Invoking mutex_lock() before that time is not a good idea. ;-)
For more detail, please read on!
> # CONFIG_RCU_EXPERT is not set > CONFIG_TINY_SRCU=y > # end of RCU Subsystem > # RCU Debugging > # CONFIG_RCU_SCALE_TEST is not set > # CONFIG_RCU_TORTURE_TEST is not set > # CONFIG_RCU_REF_SCALE_TEST is not set > # CONFIG_RCU_TRACE is not set > # CONFIG_RCU_EQS_DEBUG is not set > # end of RCU Debugging
I used the configuration from debian 8 and ran 'make oldconfig' to build my kernel. I have attached the configuration.
...
It appears to be something to do with struct maple_tree sparse_irqs. If you drop the rcu flag from that maple tree, then my configuration boots without the warning.
I *think* this is because we will reuse a lot more nodes. And I *think* the rcu flag is not needed, since there is a comment about reading the tree being protected by the mutex sparse_irq_lock within the kernel/irq/irqdesc.c file. Shanker, can you comment on that?
I wonder if there is a limit to the number of RCU free events before something is triggered to flush them out which could trigger IRQ enabling? Paul, could this be the case?
Are you asking if call_rcu() will re-enable interrupts in the following use case?
local_irq_disable(); call_rcu(&p->rh, my_cb_func); local_irq_enable();
I am not.
...
Or am I missing your point?
This is very early in the boot sequence when interrupts have not been enabled. What we are seeing is a WARN_ON() that is triggered by interrupts being enabled before they should be enabled.
I was wondering if, for example, I called call_rcu() a lot *before* interrupts were enabled, that something could trigger that would either enable interrupts or indicate the task needs rescheduling?
You aren't doing call_rcu() enough to hit OOM, are you? The actual RCU callback invocations won't happen until some time after the scheduler starts up.
I am not, it's just a detection of IRQs being enabled early.
Specifically the rescheduling part is suspect. I tracked down the call to a mutex_lock() which calls cond_resched(), so could rcu be 'encouraging' the rcu window by a reschedule request?
During boot before interrupts are enabled, RCU has not yet spawned any of its kthreads. Therefore, all of its attempts to do wakeups would notice a NULL task_struct pointer and refrain from actually doing the wakeup. If it did do the wakeup, you would see a NULL-pointer exception. See for example, invoke_rcu_core_kthread(), though that won't happen unless you booted with rcutree.use_softirq=0.
Besides, since when did doing a wakeup enable interrupts? That would make it hard to do wakeups from hardware interrupt handlers, not?
Taking the mutex lock in kernel/irq/manage.c __setup_irq() is calling a cond_resched().
From what Michael said [1] in this thread, since something has already
set TIF_NEED_RESCHED, it will eventually enable interrupts on us.
I've traced this to running call_rcu() in kernel/rcu/tiny.c and is_idle_task(current) is true, which means rcu runs: /* force scheduling for rcu_qs() */ resched_cpu(0);
the task is set idle in sched_init() -> init_idle() and never changed, afaict.
Yes, because RCU eventually needs a context switch in order to make a grace period happen. And Maple Tree isn't the only thing invoking call_rcu() early, so this has been in place for a very long time.
Removing the RCU option from the maple tree in kernel/irq/irqdesc.c fixes the issue by avoiding the maple tree running call_rcu(). I am not sure on the locking of the tree so I feel this change may cause other issues...also it's before lockdep_init(), so any issue I introduce may not be detected.
When CONFIG_DEBUG_ATOMIC_SLEEP is configured, it seems that rcu does the same thing, but the IRQs are not enabled on return. So, resched_cpu(0) is called, but the IRQs warning of enabled isn't triggered. I failed to find a reason why.
Here you mean IRQs being enabled upon return from __setup_irq(), correct?
But yes, __setup_irq() does call mutex_lock(). Which will call preempt_schedule_common() via might_sleep() and __cond_resched(), even though that is clearly a very bad thing this early. And let's face it, the whole purpose of mutex_lock() is to block when needed. And a big purpose of that might_sleep() is to yell at people invoking this with interrupts disabled.
And most of the wrappers around __setup_irq() look to be intended for much later, after interrupts have been enabled. One exception is setup_percpu_irq(), which says that it is for "the early boot process", whenever that might be. But this is only invoked from mips in v6.5.
The __request_percpu_irq() function is wrappered by request_percpu_irq(), and its header comment suggests that it is to be called after there are multiple CPUs. I am not seeing a call that is obviously from ppc32, but there are a number of drivers using this with which I am unfamiliar.
The request_percpu_nmi() has to be followed up on each CPU using prepare_percpu_nmi() and enable_percpu_nmi(), so it is not clear that it is useful to invoke this before interrupts are enabled. But this is used by ARM, not ppc32 from what I can see.
So even though I am not seeing how ppc32 invokes __setup_irq() early, your testing clearly indicates that it is doing so.
I am not entirely sure what makes ppc32 different than other platforms in that the initial task is configured to an idle task and the first call to call_rcu (tiny!) would cause the observed behaviour.
Maybe something like this in __setup_irq(), right before the mutex_lock()?
WARN_ON_ONCE(irqs_disabled());
This will dump the stack trace showing how __setup_irq() is being invoked in early boot on ppc32.
Again, given that __setup_irq() invokes mutex_lock(), invoking this function in its current form before interrupts have been enabled is a bad idea.
No trigger of that WARN_ON() I added in __setup_irq() as instructed above, still getting (pmac32_defconfig on MAC99 QEMU)
------------[ cut here ]------------ Interrupts were enabled early WARNING: CPU: 0 PID: 0 at init/main.c:992 start_kernel+0x4d8/0x5c0 Modules linked in: CPU: 0 PID: 0 Comm: swapper Not tainted 6.6.0-rc1-dirty #481 Hardware name: PowerMac3,1 7400 0xc0209 PowerMac NIP: c0a6052c LR: c0a6052c CTR: 00000000 REGS: c0c4dee0 TRAP: 0700 Not tainted (6.6.0-rc1-dirty) MSR: 00029032 <EE,ME,IR,DR,RI> CR: 24000282 XER: 20000000
GPR00: c0a6052c c0c4dfa0 c0b92580 0000001d c0b9d128 00000001 c0b9d148 3ffffdff GPR08: c0ba80f0 00000000 00000000 3ffffe00 44000282 00000000 00000000 0199abfc GPR16: 0199b0a4 7fde7fa4 7fc5ac0c 000000bb 41000000 01c690c8 c0b92014 c09b4bdc GPR24: c0c55220 c0ac0000 00000000 efff9109 efff9100 0000000a c0c6d000 c0b920a0 NIP [c0a6052c] start_kernel+0x4d8/0x5c0 LR [c0a6052c] start_kernel+0x4d8/0x5c0 Call Trace: [c0c4dfa0] [c0a6052c] start_kernel+0x4d8/0x5c0 (unreliable) [c0c4dff0] [00003540] 0x3540 Code: 480037b1 48023c05 4bab88ed 90620260 480139e9 4b657ced 7d2000a6 71298000 41a20014 3c60c09a 3863b78c 4b5e9ccd <0fe00000> 39200000 99380008 7d2000a6 ---[ end trace 0000000000000000 ]---