Tianyu,
On Mon, 25 Mar 2019, lantianyu1986@gmail.com wrote:
thanks for tracking this down.
A few nit picks vs. the subject first:
Subject: [Fix PATCH] cpu/hotplug: Fix bug report when add "nosmt" parameter with CONFIG_HOTPLUG_CPU=N
[PATCH] is sufficient. The extra 'Fix' has no real value. 'Fix bug report' is weird. Bug reports cannot be fixed, only bugs can be fixed. Also please avoid overly long subjects. Subjects should be concise and describe the problem/change as precise as it goes.
When add "nosmt" parameter, kernel still boots up all logical cpus once and set CR4.MCE on each CPU. This is to avoid shutting down machine when a broadacasted MCE is observed CR4.MCE=0b. (Detail please see comment in the cpu_smt_allowed()). Smt cpus will bring up and bring down during kernel boot with "nosmt" parameter.
When CONFIG_HOTPLUG_CPU=Y, CPU_DYING callbacks will be called inside stop-machine and irq is disabled. This happens in the take_cpu_down() callback. When CONFIG_HOTPLUG_CPU=N,CPU_DYING callbacks will be called with irq enabled.
Yes, that's bad.
smpcfd_dying_cpu() is one of CPU_DYING callbacks and it assumes to be called when irq is disabled. smpcfd_dying_cpu() calls flush_smp_call_ function_queue() which requires to be called with irq disabled.
When CONFIG_HOTPLUG_CPU=N and add "nosmt" parameter, smpcfd_dying_cpu() is called with irq enalbed and this triggers BUG_ON(!irqs_disabled()) in the irq_work_run_list(). This patch is to fix the issue.
That's not a fix, it's papering over the root cause. As you figured out already, the DYING callbacks should be called with interrupts disabled.
So rather than "fixing" that particular callback we need to look at the reason why these callbacks are invoked with interrupts enabled and fix that.
Fixes: 0cc3cd21657b ("cpu/hotplug: Boot HT siblings at least once")
That has nothing to do with 'nosmt'. It's a general bug in the rollback code when HOTPLUG_CPU=n. 'nosmt' is using the rollback mechanism and is just a reliable way to trigger the problem. This happens in the same way when the bringup of a CPU fails for any other reason. That bug was there way before 0cc3cd21657b.
I'll have a look, but I fear that needs some surgery to fix.
Thanks,
tglx