From: Lan Tianyu Tianyu.Lan@microsoft.com
When add "nosmt" parameter, kernel still boots up all logical cpus once and set CR4.MCE on each CPU. This is to avoid shutting down machine when a broadacasted MCE is observed CR4.MCE=0b. (Detail please see comment in the cpu_smt_allowed()). Smt cpus will bring up and bring down during kernel boot with "nosmt" parameter.
When CONFIG_HOTPLUG_CPU=Y, CPU_DYING callbacks will be called inside stop-machine and irq is disabled. This happens in the take_cpu_down() callback. When CONFIG_HOTPLUG_CPU=N,CPU_DYING callbacks will be called with irq enabled.
smpcfd_dying_cpu() is one of CPU_DYING callbacks and it assumes to be called when irq is disabled. smpcfd_dying_cpu() calls flush_smp_call_ function_queue() which requires to be called with irq disabled.
When CONFIG_HOTPLUG_CPU=N and add "nosmt" parameter, smpcfd_dying_cpu() is called with irq enalbed and this triggers BUG_ON(!irqs_disabled()) in the irq_work_run_list(). This patch is to fix the issue.
Fixes: 0cc3cd21657b ("cpu/hotplug: Boot HT siblings at least once") Signed-off-by: Lan Tianyu Tianyu.Lan@microsoft.com --- kernel/smp.c | 5 +++++ 1 file changed, 5 insertions(+)
diff --git a/kernel/smp.c b/kernel/smp.c index f4cf1b0..33f1970 100644 --- a/kernel/smp.c +++ b/kernel/smp.c @@ -73,6 +73,8 @@ int smpcfd_dead_cpu(unsigned int cpu)
int smpcfd_dying_cpu(unsigned int cpu) { + unsigned long flags; + /* * The IPIs for the smp-call-function callbacks queued by other * CPUs might arrive late, either due to hardware latencies or @@ -82,7 +84,10 @@ int smpcfd_dying_cpu(unsigned int cpu) * ensure that the outgoing CPU doesn't go offline with work * still pending. */ + local_irq_save(flags); flush_smp_call_function_queue(false); + local_irq_restore(flags); + return 0; }
Tianyu,
On Mon, 25 Mar 2019, lantianyu1986@gmail.com wrote:
thanks for tracking this down.
A few nit picks vs. the subject first:
Subject: [Fix PATCH] cpu/hotplug: Fix bug report when add "nosmt" parameter with CONFIG_HOTPLUG_CPU=N
[PATCH] is sufficient. The extra 'Fix' has no real value. 'Fix bug report' is weird. Bug reports cannot be fixed, only bugs can be fixed. Also please avoid overly long subjects. Subjects should be concise and describe the problem/change as precise as it goes.
When add "nosmt" parameter, kernel still boots up all logical cpus once and set CR4.MCE on each CPU. This is to avoid shutting down machine when a broadacasted MCE is observed CR4.MCE=0b. (Detail please see comment in the cpu_smt_allowed()). Smt cpus will bring up and bring down during kernel boot with "nosmt" parameter.
When CONFIG_HOTPLUG_CPU=Y, CPU_DYING callbacks will be called inside stop-machine and irq is disabled. This happens in the take_cpu_down() callback. When CONFIG_HOTPLUG_CPU=N,CPU_DYING callbacks will be called with irq enabled.
Yes, that's bad.
smpcfd_dying_cpu() is one of CPU_DYING callbacks and it assumes to be called when irq is disabled. smpcfd_dying_cpu() calls flush_smp_call_ function_queue() which requires to be called with irq disabled.
When CONFIG_HOTPLUG_CPU=N and add "nosmt" parameter, smpcfd_dying_cpu() is called with irq enalbed and this triggers BUG_ON(!irqs_disabled()) in the irq_work_run_list(). This patch is to fix the issue.
That's not a fix, it's papering over the root cause. As you figured out already, the DYING callbacks should be called with interrupts disabled.
So rather than "fixing" that particular callback we need to look at the reason why these callbacks are invoked with interrupts enabled and fix that.
Fixes: 0cc3cd21657b ("cpu/hotplug: Boot HT siblings at least once")
That has nothing to do with 'nosmt'. It's a general bug in the rollback code when HOTPLUG_CPU=n. 'nosmt' is using the rollback mechanism and is just a reliable way to trigger the problem. This happens in the same way when the bringup of a CPU fails for any other reason. That bug was there way before 0cc3cd21657b.
I'll have a look, but I fear that needs some surgery to fix.
Thanks,
tglx
On Mon, 25 Mar 2019, Thomas Gleixner wrote:
That has nothing to do with 'nosmt'. It's a general bug in the rollback code when HOTPLUG_CPU=n. 'nosmt' is using the rollback mechanism and is just a reliable way to trigger the problem. This happens in the same way when the bringup of a CPU fails for any other reason. That bug was there way before 0cc3cd21657b.
I'll have a look, but I fear that needs some surgery to fix.
This is SMP=y HOTPLUG_CPU=n case is broken and was broken forever.
Of course that tglx moron did not notice this particular wreckage when cleaning up the hotplug code. So we just kept the historical brainfart around.
The problem is that a failure in the bringup path triggers the rollback down to the point where the CPU is dead and resources are cleaned up.
But the interesting part of the rollback, i.e. takedown_cpu(), is not available when CONFIG_HOTPLUG_CPU=n.
As a consequence the former CPU_DYING callbacks are invoked on the control CPU with interrupts disabled. That rightfully explodes and even if interrupts were disabled then it's still violating the fundamental requirements of CPU unplug which include stop machine and various synchronizations in facilities like RCU which are all unavailable with HOTPLUG_CPU=n.
The pre state machine code has a different failure mode, but the end result is a straight forward crash as well. Just less obvious to decode.
So the SMP=y HOTPLUG_CPU=n onlining is just working by chance as long as none of the bringup callbacks fails.
'nosmt' exposes this flaw nicely because the wonderful MCE design of x86 forces us to bringup the sibling thread and then tear it down right away. That's equivalent to a callback failure and triggers the same rollback code which is broken...
But this is not a x86 only problem. Any architecture which supports the SMP=y HOTPLUG_CPU=n combination suffers from the same issue. It's just less likely to be triggered because in 99.99999% of the cases all bringup callbacks succeed.
Now making HOTPLUG_CPU mandatory for SMP is not working on all architectures as the following architectures have either no hotplug support at all or not all subarchitectures support it:
alpha, arc, hexagon, openrisc, riscv, sparc (32bit), mips (partial).
So I looked into a 'minimal rollback' for the HOTPLUG=n case which prevents at least the full crash. That would mean:
- the CPU is brought down to the point where the stop_machine takedown would happen.
- the CPU stays there forever and is idle
- The CPU is cleared in the CPU active mask, but not in the CPU online mask which is a legit state.
- Interrupts are not forced away from the CPU
- All facilities which only look at online mask would still see it, but that is the case on normal hotplug/unplug operations as well. It's just a longer time frame.
There might be some subtle stuff broken, but that's something we should figure out anyway as it's then also broken during unplug to the point where it is actually taken down. Letting it stay there just increases the time window. Same applies for the onlining path.
And indeed with that minimal rollback workaround in place this makes the vmstat code trigger an preemtible warning because it schedules work on a CPU which is online but inactive. The workqueue for that CPU is not bound yet and that triggers a preemptible warning of some sort over and over.
And that's not a problem of the workaround. Just onlining the CPU to the same point makes that problem visible as well.
A simple test which delayed the full onlining of a CPU and a test module scheduling work on this already marked online CPU triggers the same issue when the worker callback uses e.g. this_cpu_read(). So there is some nasty crap lurking all over the place.
We surely want to fix the fallout of this, but I think the sane short term solution for x86 is to enforce HOTPLUG_CPU when SMP is enabled which is also the sane and simple fix for backporting. That 'nosmt' stuff is popular nowadays for some reasons.
Minimal rollback patch to prevent crashing below.
Thoughts?
tglx
8<---------------- --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -564,6 +564,20 @@ static void undo_cpu_up(unsigned int cpu cpuhp_invoke_callback(cpu, st->state, false, NULL, NULL); }
+static inline bool can_rollback_cpu(struct cpuhp_cpu_state *st) +{ + if (IS_ENABLED(CONFIG_HOTPLUG_CPU)) + return true; + /* + * When CPU hotplug is disabled, then taking the CPU down is not + * possible because takedown_cpu() and the architecture and + * subsystem specific mechanisms are not available. So the CPU + * which would be completely unplugged again needs to stay around + * in the current state, i.e. <= CPUHP_AP_ONLINE_IDLE. + */ + return st->state <= CPUHP_BRINGUP_CPU; +} + static int cpuhp_up_callbacks(unsigned int cpu, struct cpuhp_cpu_state *st, enum cpuhp_state target) { @@ -574,8 +588,10 @@ static int cpuhp_up_callbacks(unsigned i st->state++; ret = cpuhp_invoke_callback(cpu, st->state, true, NULL, NULL); if (ret) { - st->target = prev_state; - undo_cpu_up(cpu, st); + if (can_rollback_cpu(st)) { + st->target = prev_state; + undo_cpu_up(cpu, st); + } break; } }
Hi Thomas: Thanks for your review.
On Tue, Mar 26, 2019 at 6:39 AM Thomas Gleixner tglx@linutronix.de wrote:
On Mon, 25 Mar 2019, Thomas Gleixner wrote:
That has nothing to do with 'nosmt'. It's a general bug in the rollback code when HOTPLUG_CPU=n. 'nosmt' is using the rollback mechanism and is just a reliable way to trigger the problem. This happens in the same way when the bringup of a CPU fails for any other reason. That bug was there way before 0cc3cd21657b.
I'll have a look, but I fear that needs some surgery to fix.
This is SMP=y HOTPLUG_CPU=n case is broken and was broken forever.
Of course that tglx moron did not notice this particular wreckage when cleaning up the hotplug code. So we just kept the historical brainfart around.
The problem is that a failure in the bringup path triggers the rollback down to the point where the CPU is dead and resources are cleaned up.
But the interesting part of the rollback, i.e. takedown_cpu(), is not available when CONFIG_HOTPLUG_CPU=n.
As a consequence the former CPU_DYING callbacks are invoked on the control CPU with interrupts disabled. That rightfully explodes and even if interrupts were disabled then it's still violating the fundamental requirements of CPU unplug which include stop machine and various synchronizations in facilities like RCU which are all unavailable with HOTPLUG_CPU=n.
The pre state machine code has a different failure mode, but the end result is a straight forward crash as well. Just less obvious to decode.
So the SMP=y HOTPLUG_CPU=n onlining is just working by chance as long as none of the bringup callbacks fails.
'nosmt' exposes this flaw nicely because the wonderful MCE design of x86 forces us to bringup the sibling thread and then tear it down right away. That's equivalent to a callback failure and triggers the same rollback code which is broken...
But this is not a x86 only problem. Any architecture which supports the SMP=y HOTPLUG_CPU=n combination suffers from the same issue. It's just less likely to be triggered because in 99.99999% of the cases all bringup callbacks succeed.
Now making HOTPLUG_CPU mandatory for SMP is not working on all architectures as the following architectures have either no hotplug support at all or not all subarchitectures support it:
alpha, arc, hexagon, openrisc, riscv, sparc (32bit), mips (partial).
So I looked into a 'minimal rollback' for the HOTPLUG=n case which prevents at least the full crash. That would mean:
the CPU is brought down to the point where the stop_machine takedown would happen.
the CPU stays there forever and is idle
The CPU is cleared in the CPU active mask, but not in the CPU online mask which is a legit state.
Interrupts are not forced away from the CPU
All facilities which only look at online mask would still see it, but that is the case on normal hotplug/unplug operations as well. It's just a longer time frame.
There might be some subtle stuff broken, but that's something we should figure out anyway as it's then also broken during unplug to the point where it is actually taken down. Letting it stay there just increases the time window. Same applies for the onlining path.
And indeed with that minimal rollback workaround in place this makes the vmstat code trigger an preemtible warning because it schedules work on a CPU which is online but inactive. The workqueue for that CPU is not bound yet and that triggers a preemptible warning of some sort over and over.
And that's not a problem of the workaround. Just onlining the CPU to the same point makes that problem visible as well.
A simple test which delayed the full onlining of a CPU and a test module scheduling work on this already marked online CPU triggers the same issue when the worker callback uses e.g. this_cpu_read(). So there is some nasty crap lurking all over the place.
We surely want to fix the fallout of this, but I think the sane short term solution for x86 is to enforce HOTPLUG_CPU when SMP is enabled which is also the sane and simple fix for backporting. That 'nosmt' stuff is popular nowadays for some reasons.
Minimal rollback patch to prevent crashing below.
Thoughts?
Yes, you are right. The "nosmt" parameter just exposes the issue and root cause is that DYING callbacks should be called with interrupts disabled.I didn't notice synchronizations (e.g RCU) were also not available when HOTPLUG_CPU=n. Other synchronization issues maybe not exposed.
Agree. Force HOTPLUG_CPU=Y if SMP=Y on x86 would be a quick fix and it's easy to be backported.
tglx
8<---------------- --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -564,6 +564,20 @@ static void undo_cpu_up(unsigned int cpu cpuhp_invoke_callback(cpu, st->state, false, NULL, NULL); }
+static inline bool can_rollback_cpu(struct cpuhp_cpu_state *st) +{
if (IS_ENABLED(CONFIG_HOTPLUG_CPU))
return true;
/*
* When CPU hotplug is disabled, then taking the CPU down is not
* possible because takedown_cpu() and the architecture and
* subsystem specific mechanisms are not available. So the CPU
* which would be completely unplugged again needs to stay around
* in the current state, i.e. <= CPUHP_AP_ONLINE_IDLE.
*/
return st->state <= CPUHP_BRINGUP_CPU;
+}
static int cpuhp_up_callbacks(unsigned int cpu, struct cpuhp_cpu_state *st, enum cpuhp_state target) { @@ -574,8 +588,10 @@ static int cpuhp_up_callbacks(unsigned i st->state++; ret = cpuhp_invoke_callback(cpu, st->state, true, NULL, NULL); if (ret) {
st->target = prev_state;
undo_cpu_up(cpu, st);
if (can_rollback_cpu(st)) {
st->target = prev_state;
undo_cpu_up(cpu, st);
} break; } }
I have tested your patch. It resolves the crash with "nosmt" parameter.
On Mon, Mar 25, 2019 at 09:51:23PM +0800, lantianyu1986@gmail.com wrote:
From: Lan Tianyu Tianyu.Lan@microsoft.com
When add "nosmt" parameter, kernel still boots up all logical cpus once and set CR4.MCE on each CPU. This is to avoid shutting down machine when a broadacasted MCE is observed CR4.MCE=0b. (Detail please see comment in the cpu_smt_allowed()). Smt cpus will bring up and bring down during kernel boot with "nosmt" parameter.
When CONFIG_HOTPLUG_CPU=Y, CPU_DYING callbacks will be called inside stop-machine and irq is disabled. This happens in the take_cpu_down() callback. When CONFIG_HOTPLUG_CPU=N,CPU_DYING callbacks will be called with irq enabled.
smpcfd_dying_cpu() is one of CPU_DYING callbacks and it assumes to be called when irq is disabled. smpcfd_dying_cpu() calls flush_smp_call_ function_queue() which requires to be called with irq disabled.
When CONFIG_HOTPLUG_CPU=N and add "nosmt" parameter, smpcfd_dying_cpu() is called with irq enalbed and this triggers BUG_ON(!irqs_disabled()) in the irq_work_run_list(). This patch is to fix the issue.
Fixes: 0cc3cd21657b ("cpu/hotplug: Boot HT siblings at least once") Signed-off-by: Lan Tianyu Tianyu.Lan@microsoft.com
kernel/smp.c | 5 +++++ 1 file changed, 5 insertions(+)
<formletter>
This is not the correct way to submit patches for inclusion in the stable kernel tree. Please read: https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html for how to do this properly.
</formletter>
linux-stable-mirror@lists.linaro.org