Re: [PATCH] sched/core: fix affine_move_task failure case

20 Mar 2024


      On Wed, Mar 20, 2024 at 7:33 PM Valentin Schneider vschneid@redhat.com wrote:
...
On 20/03/24 19:17, Valentin Schneider wrote:
...
On 20/03/24 15:03, Daniel Vacek wrote:
...
Hi Valentin,
On Mon, Mar 18, 2024 at 6:34 PM Valentin Schneider vschneid@redhat.com wrote:
...
On 18/03/24 12:17, Daniel Vacek wrote:
...
Bill Peters reported CPU hangs while offlining/onlining CPUs on s390.
Analyzing the vmcore data shows `stop_one_cpu_nowait()` in `affine_move_task()`
can fail when racing with off-/on-lining resulting in a deadlock waiting for
the pending migration stop work completion which is never done.
Fix this by correctly handling such a condition.
IIUC the problem is that the dest_cpu and its stopper thread can be taken
down by take_cpu_down(), and affine_move_task() currently isn't aware of
that. I thought we had tested this vs hotplug, but oh well...
I'm sorry, I should have provided more context in the first place. The machine
is an LPAR with 2 CPUs and CPU 0 was onlining (hotplugging?) CPU 1. The traces
show this scenario:
        CPU 0                       |           CPU 1
                                    |
    cpuplugd task 1429              |

holds the `cpu_hotplug_lock`           |
    for writing in _cpu_up+0x16a        |
blocked on `cpuhp_state:1.done_up`      |
     completion in __cpuhp_kick_ap+0x76 |
                                        |
                                        |       cpuhp/1 task 17
                                        |supposed to complete bringup of the CPU
                                        |     (`cpuhp_state:1.done_up`) in cpuhp_thread_fun+0x108
                                        |blocked on `wq_pool_attach_mutex`
                                        |    in workqueue_online_cpu+0x9e
                                        |
        xfs-conv/dm-0 task 745          |
 holds the `wq_pool_attach_mutex`       |
    in worker_attach_to_pool+0x66        \
blocked on `task->migration_pending->done`|
     completion in affine_move_task+0x10a/
crash> b 1429
PID: 1429     TASK: 99398000          CPU: 0    COMMAND: "cpuplugd"
 #0 [997df970] __schedule+0x34c at 3089c424
 #1 [997df9e0] schedule+0x7e at 3089cafe
 #2 [997dfa20] schedule_timeout+0x26e at 308a1d8e
     [inlined] do_wait_for_common
     [inlined] __wait_for_common
 #3 [997dfad8] wait_for_common+0x14a at 3089d902
    [ret call] wait_for_completion+0x1a at 3089d96a

     [inlined] wait_for_ap_thread                      <<< blocked on `cpuhp_state:1.done_up` completion
    [ret call] __cpuhp_kick_ap+0x76 at 300c610e
 #4 [997dfb58] cpuhp_kick_ap+0xc4 at 300c61dc
     [inlined] bringup_wait_for_ap
    [ret call] bringup_cpu+0xea at 300c6402
 #5 [997dfba8] cpuhp_invoke_callback+0xcc at 300c4f14
 #6 [997dfc40] _cpu_up+0x16a at 300c798a               <<< holds the `cpu_hotplug_lock` for writing
 #7 [997dfc98] do_cpu_up+0xc6 at 300c7b66
 #8 [997dfcd8] cpu_subsys_online+0x58 at 305a0a00
 #9 [997dfd28] device_online+0x9e at 30598e7e
#10 [997dfd68] online_store+0x88 at 30598f28
#11 [997dfda8] kernfs_fop_write+0xdc at 3040276c
#12 [997dfdf8] vfs_write+0xa8 at 30354760
#13 [997dfe58] ksys_write+0x62 at 30354a32

crash> cpuhp_cpu_state.state cpuhp_state:a | paste - -
[0]: 1aef424e0      state = CPUHP_ONLINE,               # (195)
[1]: 1aef654e0      state = CPUHP_AP_WORKQUEUE_ONLINE,  # (159)

crash> cpuhp_cpu_state.bringup,thread,done_up.done cpuhp_state:1 -d | paste - - - -
[1]: 1aef654e0      bringup = true,      thread = 0x81134400,      done_up.done = 0,  <<<

crash> b 17
PID: 17       TASK: 81134400          CPU: 1    COMMAND: "cpuhp/1"
 #0 [81143b68] __schedule+0x34c at 3089c424
 #1 [81143bd8] schedule+0x7e at 3089cafe
 #2 [81143c18] schedule_preempt_disabled+0x2a at 3089cfba
 #3 [81143c30] __mutex_lock+0x320 at 3089df60

 #4 [81143cb0] workqueue_online_cpu+0x9e at 300e847e   <<< blocked on `wq_pool_attach_mutex`
 #5 [81143d20] cpuhp_invoke_callback+0xcc at 300c4f14
 #6 [81143db8] cpuhp_thread_fun+0x108 at 300c6848      <<< supposed to complete the bring-up of the CPU (`cpuhp_state:1.done_up`)

crash> b 745
PID: 745      TASK: 82359100          CPU: 0    COMMAND: "xfs-conv/dm-0"
 #0 [8b4bfa20] __schedule+0x34c at 3089c424
 #1 [8b4bfa90] schedule+0x7e at 3089cafe
 #2 [8b4bfad0] schedule_timeout+0x26e at 308a1d8e
     [inlined] do_wait_for_common
     [inlined] __wait_for_common
 #3 [8b4bfb88] wait_for_common+0x14a at 3089d902
    [ret call] wait_for_completion+0x1a at 3089d96a

 #4 [8b4bfc08] affine_move_task+0x10a at 300fb51a        <<< blocked on `task->migration_pending->done` completion
 #5 [8b4bfd08] __set_cpus_allowed_ptr+0x12e at 300fb926
    [ret call] set_cpus_allowed_ptr+0xa at 300fba32
 #6 [8b4bfd78] worker_attach_to_pool+0x66 at 300e1dae    <<< holds the `wq_pool_attach_mutex`
 #7 [8b4bfdc8] rescuer_thread+0x12c at 300e5bac

crash> rx 8b4bfea0
        8b4bfea0:  [863373c0:kmalloc-192]

crash> worker.task,rescue_wq 863373c0
  task = 0x82359100,
  rescue_wq = 0x8aa44400,

crash> list -s pool_workqueue.pool pool_workqueue.mayday_node -hO workqueue_struct.maydays 0x8aa44400 | paste - -
1fffff7f751900    pool = 0x1aef56a00,

crash> worker_pool.attrs 0x1aef56a00
  attrs = 0x80088180,

crash> workqueue_attrs.cpumask[0].bits 0x80088180
  cpumask[0].bits = {0x1, 0x0, ...

crash> cpumask.bits __cpu_active_mask
  bits = {0x1, 0x0, ...

crash> cpumask.bits __cpu_online_mask
  bits = {0x3, 0x0, ...

crash> task_struct.migration_pending,flags 0x82359100
    migration_pending = 0x8b4bfce8,
  flags = 0x4208060,
             ^ PF_KTHREAD

crash> pd distribute_cpu_mask_prev:0
per_cpu(distribute_cpu_mask_prev, 0) = 0

crash> set_affinity_pending.refs.refs.counter,arg,stop_pending,done.done 0x8b4bfce8 -d
  refs.refs.counter = 1
  arg = {
    task = 0x82359100,
    dest_cpu = 0,
    pending = 0x8b4bfce8
  }
  stop_pending = 1,
  done.done = 0,

In other words the `set_cpus_allowed_ptr()` is called from a worker thread which
tries to migrate. The worker pool is only allowed on CPU 0 and that was supposed
to be the destination as per the stack structure. In this case I thought it's OK
to leave the task on the old CPU
AFAICT if a call to set_cpus_allowed_ptr() ends up in affine_move_task()
and down to the stopper call, that means the task isn't allowed on its
current CPU and needs to be moved.
...
and the Bill's testing scenario was successful
with the proposed patch. IIUC, it's exercising the hotplug due to load-balancing.
This was on RHEL 8.8.z kernel. I see upstream changed a bit so I'm not sure it's
still reproducible. Also, I'm not sure why this only happens on s390 and not on
x86. I imagine the CPU hotplug slightly differs? Anyways this seems to be timing
sensitive and the timing will differ greatly for sure.
Thanks for the extra context!
Double checking what I wrote before, I forgot RCU considers preempt-off
sections as read-side critical sections. __set_cpus_allowed_ptr() already
has preemption disabled all the way from reading the cpu_active_mask to the
stop_one_cpu_nowait() call via task_rq_lock() + preempt_disable().
IOW we have:
__set_cpus_allowed_ptr()
  task_rq_lock() <-- PREEMPT OFF
  __set_cpus_allowed_ptr_locked()
    cpu_valid_mask = cpu_active_mask;
    dest_cpu = cpumask_any_and_distribute(cpu_valid_mask, ctx->new_mask);
    affine_move_task()
      preempt_disable();
      task_rq_unlock();
      stop_one_cpu_nowait(); <-- preemption still OFF
And, considering:
sched_cpu_deactivate()
  set_cpu_active(cpu, false);
  synchronize_rcu();
Then, if __set_cpus_allowed_ptr() observes a CPU as being in the
cpu_active_mask and uses that one as a destination CPU, said CPU cannot
reach CPUHP_TEARDOWN_CPU:take_cpu_down() and park the stopper thread
because its hotplug machinery will wait on the synchronize_rcu() in
CPUHP_AP_ACTIVE:sched_cpu_deactivate().
So "in theory", this shouldn't happen upstream.
Eh nevermind, in your stacktrace the relevant task is a rescuer thread
which is a kthread, so the cpu_valid_mask in use there is
cpu_online_mask, not cpu_valid_mask... Back to reading code :-)
I just wanted to correct you on this point but you figured it out for
yourself after all.
On the other hand the `new_mask` (which comes from the worker pool allowed CPUs)
~~~
2102 static void worker_attach_to_pool(struct worker *worker,
2103                                    struct worker_pool *pool)
2104 {
2105         mutex_lock(&wq_pool_attach_mutex);
--
2117         if (worker->rescue_wq)
2118                 set_cpus_allowed_ptr(worker->task,
pool_allowed_cpus(pool));
2827 static int rescuer_thread(void *__rescuer)
2828 {
2829         struct worker *rescuer = __rescuer;
2830         struct workqueue_struct *wq = rescuer->rescue_wq;
--
2856         while (!list_empty(&wq->maydays)) {
2857                 struct pool_workqueue *pwq = list_first_entry(&wq->maydays,
2858                                         struct pool_workqueue,
mayday_node);
2859                 struct worker_pool *pool = pwq->pool;
--
2863                 list_del_init(&pwq->mayday_node);
2864
2865                 raw_spin_unlock_irq(&wq_mayday_lock);
2866
2867                 worker_attach_to_pool(rescuer, pool);
2868
2869                 raw_spin_lock_irq(&pool->lock);
~~~
matches the `cpu_valid_mask` and is ANDed to the `cpu_online_mask` so
the final result is the same...
Actually double checking now, I may have pulled the wrong pool as the
right `pwq` is removed from the `maydays` list on line 2863 (the code
is similar in RHEL 8.8) and I have used whatever `pwq` was remaining
on that list, but possibly unrelated.
~~~
crash> rx 8b4bfe30
        8b4bfe30:  00000001aef56a00
crash> worker_pool.attrs 1aef56a00
  attrs = 0x80088180,
crash> workqueue_attrs.cpumask[0].bits 0x80088180
  cpumask[0].bits = {0x1, 0x0, ...
crash> set_affinity_pending.refs.refs.counter,arg,stop_pending,done.done
0x8b4bfce8
  refs.refs.counter = 0x1
  arg = {
    task = 0x82359100,
    dest_cpu = 0x0,
    pending = 0x8b4bfce8
  }
  stop_pending = 0x1,
  done.done = 0x0,
~~~
Nah, so not unrelated. It's the right one. The rescuer was deadlocked
for almost 3 days at the time the vmcore was collected, so it seems
the `pwq` returned to the `maydays` list in the meantime. And I guess
the pool cpumask could have changed as well. Nevertheless, the local
`affine_move_task::my_pending.arg.dest_cpu` on stack still shows CPU 0
was selected that time.
So far I don't see why upstream would not be prone to the same issue.
But I may be missing something.
--nX

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH] sched/core: fix affine_move_task failure case