On Wed, Mar 20, 2024 at 7:33 PM Valentin Schneider vschneid@redhat.com wrote:
On 20/03/24 19:17, Valentin Schneider wrote:
On 20/03/24 15:03, Daniel Vacek wrote:
Hi Valentin,
On Mon, Mar 18, 2024 at 6:34 PM Valentin Schneider vschneid@redhat.com wrote:
On 18/03/24 12:17, Daniel Vacek wrote:
Bill Peters reported CPU hangs while offlining/onlining CPUs on s390.
Analyzing the vmcore data shows `stop_one_cpu_nowait()` in `affine_move_task()` can fail when racing with off-/on-lining resulting in a deadlock waiting for the pending migration stop work completion which is never done.
Fix this by correctly handling such a condition.
IIUC the problem is that the dest_cpu and its stopper thread can be taken down by take_cpu_down(), and affine_move_task() currently isn't aware of that. I thought we had tested this vs hotplug, but oh well...
I'm sorry, I should have provided more context in the first place. The machine is an LPAR with 2 CPUs and CPU 0 was onlining (hotplugging?) CPU 1. The traces show this scenario:
CPU 0 | CPU 1 | cpuplugd task 1429 |
holds the `cpu_hotplug_lock` | for writing in _cpu_up+0x16a | blocked on `cpuhp_state:1.done_up` | completion in __cpuhp_kick_ap+0x76 | | | cpuhp/1 task 17 |supposed to complete bringup of the CPU | (`cpuhp_state:1.done_up`) in cpuhp_thread_fun+0x108 |blocked on `wq_pool_attach_mutex` | in workqueue_online_cpu+0x9e | xfs-conv/dm-0 task 745 | holds the `wq_pool_attach_mutex` | in worker_attach_to_pool+0x66 \ blocked on `task->migration_pending->done`| completion in affine_move_task+0x10a/
crash> b 1429 PID: 1429 TASK: 99398000 CPU: 0 COMMAND: "cpuplugd" #0 [997df970] __schedule+0x34c at 3089c424 #1 [997df9e0] schedule+0x7e at 3089cafe #2 [997dfa20] schedule_timeout+0x26e at 308a1d8e [inlined] do_wait_for_common [inlined] __wait_for_common #3 [997dfad8] wait_for_common+0x14a at 3089d902 [ret call] wait_for_completion+0x1a at 3089d96a [inlined] wait_for_ap_thread <<< blocked on `cpuhp_state:1.done_up` completion [ret call] __cpuhp_kick_ap+0x76 at 300c610e #4 [997dfb58] cpuhp_kick_ap+0xc4 at 300c61dc [inlined] bringup_wait_for_ap [ret call] bringup_cpu+0xea at 300c6402 #5 [997dfba8] cpuhp_invoke_callback+0xcc at 300c4f14 #6 [997dfc40] _cpu_up+0x16a at 300c798a <<< holds the `cpu_hotplug_lock` for writing #7 [997dfc98] do_cpu_up+0xc6 at 300c7b66 #8 [997dfcd8] cpu_subsys_online+0x58 at 305a0a00 #9 [997dfd28] device_online+0x9e at 30598e7e #10 [997dfd68] online_store+0x88 at 30598f28 #11 [997dfda8] kernfs_fop_write+0xdc at 3040276c #12 [997dfdf8] vfs_write+0xa8 at 30354760 #13 [997dfe58] ksys_write+0x62 at 30354a32 crash> cpuhp_cpu_state.state cpuhp_state:a | paste - - [0]: 1aef424e0 state = CPUHP_ONLINE, # (195) [1]: 1aef654e0 state = CPUHP_AP_WORKQUEUE_ONLINE, # (159) crash> cpuhp_cpu_state.bringup,thread,done_up.done cpuhp_state:1 -d | paste - - - - [1]: 1aef654e0 bringup = true, thread = 0x81134400, done_up.done = 0, <<< crash> b 17 PID: 17 TASK: 81134400 CPU: 1 COMMAND: "cpuhp/1" #0 [81143b68] __schedule+0x34c at 3089c424 #1 [81143bd8] schedule+0x7e at 3089cafe #2 [81143c18] schedule_preempt_disabled+0x2a at 3089cfba #3 [81143c30] __mutex_lock+0x320 at 3089df60 #4 [81143cb0] workqueue_online_cpu+0x9e at 300e847e <<< blocked on `wq_pool_attach_mutex` #5 [81143d20] cpuhp_invoke_callback+0xcc at 300c4f14 #6 [81143db8] cpuhp_thread_fun+0x108 at 300c6848 <<< supposed to complete the bring-up of the CPU (`cpuhp_state:1.done_up`) crash> b 745 PID: 745 TASK: 82359100 CPU: 0 COMMAND: "xfs-conv/dm-0" #0 [8b4bfa20] __schedule+0x34c at 3089c424 #1 [8b4bfa90] schedule+0x7e at 3089cafe #2 [8b4bfad0] schedule_timeout+0x26e at 308a1d8e [inlined] do_wait_for_common [inlined] __wait_for_common #3 [8b4bfb88] wait_for_common+0x14a at 3089d902 [ret call] wait_for_completion+0x1a at 3089d96a #4 [8b4bfc08] affine_move_task+0x10a at 300fb51a <<< blocked on `task->migration_pending->done` completion #5 [8b4bfd08] __set_cpus_allowed_ptr+0x12e at 300fb926 [ret call] set_cpus_allowed_ptr+0xa at 300fba32 #6 [8b4bfd78] worker_attach_to_pool+0x66 at 300e1dae <<< holds the `wq_pool_attach_mutex` #7 [8b4bfdc8] rescuer_thread+0x12c at 300e5bac crash> rx 8b4bfea0 8b4bfea0: [863373c0:kmalloc-192] crash> worker.task,rescue_wq 863373c0 task = 0x82359100, rescue_wq = 0x8aa44400, crash> list -s pool_workqueue.pool pool_workqueue.mayday_node -hO workqueue_struct.maydays 0x8aa44400 | paste - - 1fffff7f751900 pool = 0x1aef56a00, crash> worker_pool.attrs 0x1aef56a00 attrs = 0x80088180, crash> workqueue_attrs.cpumask[0].bits 0x80088180 cpumask[0].bits = {0x1, 0x0, ... crash> cpumask.bits __cpu_active_mask bits = {0x1, 0x0, ... crash> cpumask.bits __cpu_online_mask bits = {0x3, 0x0, ... crash> task_struct.migration_pending,flags 0x82359100 migration_pending = 0x8b4bfce8, flags = 0x4208060, ^ PF_KTHREAD crash> pd distribute_cpu_mask_prev:0 per_cpu(distribute_cpu_mask_prev, 0) = 0 crash> set_affinity_pending.refs.refs.counter,arg,stop_pending,done.done 0x8b4bfce8 -d refs.refs.counter = 1 arg = { task = 0x82359100, dest_cpu = 0, pending = 0x8b4bfce8 } stop_pending = 1, done.done = 0,
In other words the `set_cpus_allowed_ptr()` is called from a worker thread which tries to migrate. The worker pool is only allowed on CPU 0 and that was supposed to be the destination as per the stack structure. In this case I thought it's OK to leave the task on the old CPU
AFAICT if a call to set_cpus_allowed_ptr() ends up in affine_move_task() and down to the stopper call, that means the task isn't allowed on its current CPU and needs to be moved.
and the Bill's testing scenario was successful with the proposed patch. IIUC, it's exercising the hotplug due to load-balancing.
This was on RHEL 8.8.z kernel. I see upstream changed a bit so I'm not sure it's still reproducible. Also, I'm not sure why this only happens on s390 and not on x86. I imagine the CPU hotplug slightly differs? Anyways this seems to be timing sensitive and the timing will differ greatly for sure.
Thanks for the extra context!
Double checking what I wrote before, I forgot RCU considers preempt-off sections as read-side critical sections. __set_cpus_allowed_ptr() already has preemption disabled all the way from reading the cpu_active_mask to the stop_one_cpu_nowait() call via task_rq_lock() + preempt_disable().
IOW we have:
__set_cpus_allowed_ptr() task_rq_lock() <-- PREEMPT OFF __set_cpus_allowed_ptr_locked() cpu_valid_mask = cpu_active_mask; dest_cpu = cpumask_any_and_distribute(cpu_valid_mask, ctx->new_mask); affine_move_task() preempt_disable(); task_rq_unlock(); stop_one_cpu_nowait(); <-- preemption still OFF
And, considering:
sched_cpu_deactivate() set_cpu_active(cpu, false); synchronize_rcu();
Then, if __set_cpus_allowed_ptr() observes a CPU as being in the cpu_active_mask and uses that one as a destination CPU, said CPU cannot reach CPUHP_TEARDOWN_CPU:take_cpu_down() and park the stopper thread because its hotplug machinery will wait on the synchronize_rcu() in CPUHP_AP_ACTIVE:sched_cpu_deactivate().
So "in theory", this shouldn't happen upstream.
Eh nevermind, in your stacktrace the relevant task is a rescuer thread which is a kthread, so the cpu_valid_mask in use there is cpu_online_mask, not cpu_valid_mask... Back to reading code :-)
I just wanted to correct you on this point but you figured it out for yourself after all.
On the other hand the `new_mask` (which comes from the worker pool allowed CPUs)
~~~ 2102 static void worker_attach_to_pool(struct worker *worker, 2103 struct worker_pool *pool) 2104 { 2105 mutex_lock(&wq_pool_attach_mutex); -- 2117 if (worker->rescue_wq) 2118 set_cpus_allowed_ptr(worker->task, pool_allowed_cpus(pool));
2827 static int rescuer_thread(void *__rescuer) 2828 { 2829 struct worker *rescuer = __rescuer; 2830 struct workqueue_struct *wq = rescuer->rescue_wq; -- 2856 while (!list_empty(&wq->maydays)) { 2857 struct pool_workqueue *pwq = list_first_entry(&wq->maydays, 2858 struct pool_workqueue, mayday_node); 2859 struct worker_pool *pool = pwq->pool; -- 2863 list_del_init(&pwq->mayday_node); 2864 2865 raw_spin_unlock_irq(&wq_mayday_lock); 2866 2867 worker_attach_to_pool(rescuer, pool); 2868 2869 raw_spin_lock_irq(&pool->lock); ~~~
matches the `cpu_valid_mask` and is ANDed to the `cpu_online_mask` so the final result is the same...
Actually double checking now, I may have pulled the wrong pool as the right `pwq` is removed from the `maydays` list on line 2863 (the code is similar in RHEL 8.8) and I have used whatever `pwq` was remaining on that list, but possibly unrelated.
~~~ crash> rx 8b4bfe30 8b4bfe30: 00000001aef56a00
crash> worker_pool.attrs 1aef56a00 attrs = 0x80088180,
crash> workqueue_attrs.cpumask[0].bits 0x80088180 cpumask[0].bits = {0x1, 0x0, ...
crash> set_affinity_pending.refs.refs.counter,arg,stop_pending,done.done 0x8b4bfce8 refs.refs.counter = 0x1 arg = { task = 0x82359100, dest_cpu = 0x0, pending = 0x8b4bfce8 } stop_pending = 0x1, done.done = 0x0, ~~~
Nah, so not unrelated. It's the right one. The rescuer was deadlocked for almost 3 days at the time the vmcore was collected, so it seems the `pwq` returned to the `maydays` list in the meantime. And I guess the pool cpumask could have changed as well. Nevertheless, the local `affine_move_task::my_pending.arg.dest_cpu` on stack still shows CPU 0 was selected that time.
So far I don't see why upstream would not be prone to the same issue. But I may be missing something.
--nX