From: Henrik Austad haustad@cisco.com
Short story:
The following patches are needed on a 4.4 kernel to avoid Oops in the scheduler when a sched_rr and sched_deadline task contends on the same futex (with PI).
Longer story:
On one of our arm64 systems, we occasionally crash with an Oops in the scheduler with the following backtrace.
[<ffffffc0000ee398>] enqueue_task_dl+0x1f0/0x420 [<ffffffc0000d0f14>] activate_task+0x7c/0x90 [<ffffffc0000edbdc>] push_dl_task+0x164/0x1c8 [<ffffffc0000edc60>] push_dl_tasks+0x20/0x30 [<ffffffc0000cc00c>] __balance_callback+0x44/0x68 [<ffffffc000d2c018>] __schedule+0x6f0/0x728 [<ffffffc000d2c278>] schedule+0x78/0x98 [<ffffffc000d2e76c>] __rt_mutex_slowlock+0x9c/0x108 [<ffffffc000d2e9d0>] rt_mutex_slowlock+0xd8/0x198 [<ffffffc0000f7f28>] rt_mutex_timed_futex_lock+0x30/0x40 [<ffffffc00012c1a8>] futex_lock_pi+0x200/0x3b0 [<ffffffc00012cf84>] do_futex+0x1c4/0x550 [<ffffffc00012d92c>] compat_SyS_futex+0x10c/0x138 [<ffffffc00008504c>] __sys_trace_return+0x0/0x4
This seems to be the same bug Xuneli Pang triggered and fixed in e96a7705e7d3 "sched/rtmutex/deadline: Fix a PI crash for deadline tasks". As noted by Peter Zijlstra in the previous attempt, this fix requires a few other patches, most notably the FUTEX_UNLOCK_PI series [1]
Testing this on a dual-core VM I have not been able to reproduce the same crash, but pi_stress (part of the rt-test suite) reveals that vanilla 4.4.162 behaves rather badly with a mix of deadline and sched_(rr|fifo) tasks:
time pi_stress --rr --mlockall --sched id=high,policy=deadline,runtime=100000,deadline=200000,period=200000 Starting PI Stress Test Number of thread groups: 1 Duration of test run: infinite Number of inversions per group: unlimited Admin thread SCHED_RR priority 4 1 groups of 3 threads will be created High thread SCHED_DEADLINE runtime 100000 deadline 200000 period 200000 Med thread SCHED_RR priority 2 Low thread SCHED_RR priority 1 Current Inversions: 141627 WATCHDOG triggered: group 0 is deadlocked! reporter stopping due to watchdog event Stopping test Terminated
real 0m26.291s user 0m0.148s sys 0m18.819s
With this series applied, the test ran for ~4.5 hours and again for 129 minutes (when I remembered to time it) before crashing:
time pi_stress --rr --mlockall --sched id=high,policy=deadline,runtime=100000,deadline=200000,period=200000 Starting PI Stress Test Number of thread groups: 1 Duration of test run: infinite Number of inversions per group: unlimited Admin thread SCHED_RR priority 4 1 groups of 3 threads will be created High thread SCHED_DEADLINE runtime 100000 deadline 200000 period 200000 Med thread SCHED_RR priority 2 Low thread SCHED_RR priority 1 Current Inversions: 51985223 WATCHDOG triggered: group 0 is deadlocked! reporter stopping due to watchdog event Stopping test Terminated
real 129m38.807s user 0m59.084s sys 109m53.666s
So clearly not perfect, but a *lot* better.
The same series on our vendor-4.4 kernel moves pi_stress up from ~30 seconds before deadlock up to the same level as the VM (the test is still going as of this writing).
I suspect other users of 4.4 would benefit from having these patches backported, so tag them for stable. I assume 4.9 and 4.14 could benefit as well, but I have not had time to look into those.
1) https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1359667.html
Peter Zijlstra (13): futex: Cleanup variable names for futex_top_waiter() futex: Use smp_store_release() in mark_wake_futex() futex: Remove rt_mutex_deadlock_account_*() futex,rt_mutex: Provide futex specific rt_mutex API futex: Change locking rules futex: Cleanup refcounting futex: Rework inconsistent rt_mutex/futex_q state futex: Pull rt_mutex_futex_unlock() out from under hb->lock futex,rt_mutex: Introduce rt_mutex_init_waiter() futex,rt_mutex: Restructure rt_mutex_finish_proxy_lock() futex: Rework futex_lock_pi() to use rt_mutex_*_proxy_lock() futex: Futex_unlock_pi() determinism futex: Drop hb->lock before enqueueing on the rtmutex
Thomas Gleixner (2): rtmutex: Make wait_lock irq safe futex: Rename free_pi_state() to put_pi_state()
Xunlei Pang (2): rtmutex: Deboost before waking up the top waiter sched/rtmutex/deadline: Fix a PI crash for deadline tasks
include/linux/init_task.h | 1 + include/linux/sched.h | 2 + include/linux/sched/rt.h | 1 + kernel/fork.c | 1 + kernel/futex.c | 532 ++++++++++++++++++++++++++-------------- kernel/locking/rtmutex-debug.c | 9 - kernel/locking/rtmutex-debug.h | 3 - kernel/locking/rtmutex.c | 406 ++++++++++++++++++------------ kernel/locking/rtmutex.h | 2 - kernel/locking/rtmutex_common.h | 24 +- kernel/sched/core.c | 2 + 11 files changed, 620 insertions(+), 363 deletions(-)
From: Peter Zijlstra peterz@infradead.org
commit 499f5aca2cdd5e958b27e2655e7e7f82524f46b1 uptream.
futex_top_waiter() returns the top-waiter on the pi_mutex. Assinging this to a variable 'match' totally obscures the code.
Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Cc: juri.lelli@arm.com Cc: bigeasy@linutronix.de Cc: xlpang@redhat.com Cc: rostedt@goodmis.org Cc: mathieu.desnoyers@efficios.com Cc: jdesfossez@efficios.com Cc: dvhart@infradead.org Cc: bristot@redhat.com Link: http://lkml.kernel.org/r/20170322104151.554710645@infradead.org Signed-off-by: Thomas Gleixner tglx@linutronix.de Tested-by: Henrik Austad haustad@cisco.com --- kernel/futex.c | 30 +++++++++++++++--------------- 1 file changed, 15 insertions(+), 15 deletions(-)
diff --git a/kernel/futex.c b/kernel/futex.c index a26d217..bb87324 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -1116,14 +1116,14 @@ static int attach_to_pi_owner(u32 uval, union futex_key *key, static int lookup_pi_state(u32 uval, struct futex_hash_bucket *hb, union futex_key *key, struct futex_pi_state **ps) { - struct futex_q *match = futex_top_waiter(hb, key); + struct futex_q *top_waiter = futex_top_waiter(hb, key);
/* * If there is a waiter on that futex, validate it and * attach to the pi_state when the validation succeeds. */ - if (match) - return attach_to_pi_state(uval, match->pi_state, ps); + if (top_waiter) + return attach_to_pi_state(uval, top_waiter->pi_state, ps);
/* * We are the first waiter - try to look up the owner based on @@ -1170,7 +1170,7 @@ static int futex_lock_pi_atomic(u32 __user *uaddr, struct futex_hash_bucket *hb, struct task_struct *task, int set_waiters) { u32 uval, newval, vpid = task_pid_vnr(task); - struct futex_q *match; + struct futex_q *top_waiter; int ret;
/* @@ -1196,9 +1196,9 @@ static int futex_lock_pi_atomic(u32 __user *uaddr, struct futex_hash_bucket *hb, * Lookup existing state first. If it exists, try to attach to * its pi_state. */ - match = futex_top_waiter(hb, key); - if (match) - return attach_to_pi_state(uval, match->pi_state, ps); + top_waiter = futex_top_waiter(hb, key); + if (top_waiter) + return attach_to_pi_state(uval, top_waiter->pi_state, ps);
/* * No waiter and user TID is 0. We are here because the @@ -1288,11 +1288,11 @@ static void mark_wake_futex(struct wake_q_head *wake_q, struct futex_q *q) q->lock_ptr = NULL; }
-static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *this, +static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *top_waiter, struct futex_hash_bucket *hb) { struct task_struct *new_owner; - struct futex_pi_state *pi_state = this->pi_state; + struct futex_pi_state *pi_state = top_waiter->pi_state; u32 uninitialized_var(curval), newval; WAKE_Q(wake_q); bool deboost; @@ -1313,11 +1313,11 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *this,
/* * It is possible that the next waiter (the one that brought - * this owner to the kernel) timed out and is no longer + * top_waiter owner to the kernel) timed out and is no longer * waiting on the lock. */ if (!new_owner) - new_owner = this->task; + new_owner = top_waiter->task;
/* * We pass it to the next owner. The WAITERS bit is always @@ -2639,7 +2639,7 @@ static int futex_unlock_pi(u32 __user *uaddr, unsigned int flags) u32 uninitialized_var(curval), uval, vpid = task_pid_vnr(current); union futex_key key = FUTEX_KEY_INIT; struct futex_hash_bucket *hb; - struct futex_q *match; + struct futex_q *top_waiter; int ret;
retry: @@ -2663,9 +2663,9 @@ retry: * all and we at least want to know if user space fiddled * with the futex value instead of blindly unlocking. */ - match = futex_top_waiter(hb, &key); - if (match) { - ret = wake_futex_pi(uaddr, uval, match, hb); + top_waiter = futex_top_waiter(hb, &key); + if (top_waiter) { + ret = wake_futex_pi(uaddr, uval, top_waiter, hb); /* * In case of success wake_futex_pi dropped the hash * bucket lock.
From: Peter Zijlstra peterz@infradead.org
commit 1b367ece0d7e696cab1c8501bab282cc6a538b3f upstream.
Since the futex_q can dissapear the instruction after assigning NULL, this really should be a RELEASE barrier. That stops loads from hitting dead memory too.
Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Cc: juri.lelli@arm.com Cc: bigeasy@linutronix.de Cc: xlpang@redhat.com Cc: rostedt@goodmis.org Cc: mathieu.desnoyers@efficios.com Cc: jdesfossez@efficios.com Cc: dvhart@infradead.org Cc: bristot@redhat.com Link: http://lkml.kernel.org/r/20170322104151.604296452@infradead.org Signed-off-by: Thomas Gleixner tglx@linutronix.de Tested-by: Henrik Austad haustad@cisco.com --- kernel/futex.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/kernel/futex.c b/kernel/futex.c index bb87324..9e92f12 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -1284,8 +1284,7 @@ static void mark_wake_futex(struct wake_q_head *wake_q, struct futex_q *q) * memory barrier is required here to prevent the following * store to lock_ptr from getting ahead of the plist_del. */ - smp_wmb(); - q->lock_ptr = NULL; + smp_store_release(&q->lock_ptr, NULL); }
static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *top_waiter,
From: Peter Zijlstra peterz@infradead.org
commit fffa954fb528963c2fb7b0c0084eb77e2be7ab52 upstream
These are unused and clutter up the code.
Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Cc: juri.lelli@arm.com Cc: bigeasy@linutronix.de Cc: xlpang@redhat.com Cc: rostedt@goodmis.org Cc: mathieu.desnoyers@efficios.com Cc: jdesfossez@efficios.com Cc: dvhart@infradead.org Cc: bristot@redhat.com Link: http://lkml.kernel.org/r/20170322104151.652692478@infradead.org Signed-off-by: Thomas Gleixner tglx@linutronix.de
Conflicts: kernel/locking/rtmutex.c (WAKE_Q) Tested-by: Henrik Austad haustad@cisco.com --- kernel/locking/rtmutex-debug.c | 9 -------- kernel/locking/rtmutex-debug.h | 3 --- kernel/locking/rtmutex.c | 47 ++++++++++++++++-------------------------- kernel/locking/rtmutex.h | 2 -- 4 files changed, 18 insertions(+), 43 deletions(-)
diff --git a/kernel/locking/rtmutex-debug.c b/kernel/locking/rtmutex-debug.c index 62b6cee..0613c4b 100644 --- a/kernel/locking/rtmutex-debug.c +++ b/kernel/locking/rtmutex-debug.c @@ -173,12 +173,3 @@ void debug_rt_mutex_init(struct rt_mutex *lock, const char *name) lock->name = name; }
-void -rt_mutex_deadlock_account_lock(struct rt_mutex *lock, struct task_struct *task) -{ -} - -void rt_mutex_deadlock_account_unlock(struct task_struct *task) -{ -} - diff --git a/kernel/locking/rtmutex-debug.h b/kernel/locking/rtmutex-debug.h index d0519c3..b585af9 100644 --- a/kernel/locking/rtmutex-debug.h +++ b/kernel/locking/rtmutex-debug.h @@ -9,9 +9,6 @@ * This file contains macros used solely by rtmutex.c. Debug version. */
-extern void -rt_mutex_deadlock_account_lock(struct rt_mutex *lock, struct task_struct *task); -extern void rt_mutex_deadlock_account_unlock(struct task_struct *task); extern void debug_rt_mutex_init_waiter(struct rt_mutex_waiter *waiter); extern void debug_rt_mutex_free_waiter(struct rt_mutex_waiter *waiter); extern void debug_rt_mutex_init(struct rt_mutex *lock, const char *name); diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c index b066724d..6cf9dab7 100644 --- a/kernel/locking/rtmutex.c +++ b/kernel/locking/rtmutex.c @@ -937,8 +937,6 @@ takeit: */ rt_mutex_set_owner(lock, task);
- rt_mutex_deadlock_account_lock(lock, task); - return 1; }
@@ -1331,8 +1329,6 @@ static bool __sched rt_mutex_slowunlock(struct rt_mutex *lock,
debug_rt_mutex_unlock(lock);
- rt_mutex_deadlock_account_unlock(current); - /* * We must be careful here if the fast path is enabled. If we * have no waiters queued we cannot set owner to NULL here @@ -1398,11 +1394,10 @@ rt_mutex_fastlock(struct rt_mutex *lock, int state, struct hrtimer_sleeper *timeout, enum rtmutex_chainwalk chwalk)) { - if (likely(rt_mutex_cmpxchg_acquire(lock, NULL, current))) { - rt_mutex_deadlock_account_lock(lock, current); + if (likely(rt_mutex_cmpxchg_acquire(lock, NULL, current))) return 0; - } else - return slowfn(lock, state, NULL, RT_MUTEX_MIN_CHAINWALK); + + return slowfn(lock, state, NULL, RT_MUTEX_MIN_CHAINWALK); }
static inline int @@ -1414,21 +1409,19 @@ rt_mutex_timed_fastlock(struct rt_mutex *lock, int state, enum rtmutex_chainwalk chwalk)) { if (chwalk == RT_MUTEX_MIN_CHAINWALK && - likely(rt_mutex_cmpxchg_acquire(lock, NULL, current))) { - rt_mutex_deadlock_account_lock(lock, current); + likely(rt_mutex_cmpxchg_acquire(lock, NULL, current))) return 0; - } else - return slowfn(lock, state, timeout, chwalk); + + return slowfn(lock, state, timeout, chwalk); }
static inline int rt_mutex_fasttrylock(struct rt_mutex *lock, int (*slowfn)(struct rt_mutex *lock)) { - if (likely(rt_mutex_cmpxchg_acquire(lock, NULL, current))) { - rt_mutex_deadlock_account_lock(lock, current); + if (likely(rt_mutex_cmpxchg_acquire(lock, NULL, current))) return 1; - } + return slowfn(lock); }
@@ -1438,19 +1431,18 @@ rt_mutex_fastunlock(struct rt_mutex *lock, struct wake_q_head *wqh)) { WAKE_Q(wake_q); + bool deboost;
- if (likely(rt_mutex_cmpxchg_release(lock, current, NULL))) { - rt_mutex_deadlock_account_unlock(current); + if (likely(rt_mutex_cmpxchg_release(lock, current, NULL))) + return;
- } else { - bool deboost = slowfn(lock, &wake_q); + deboost = slowfn(lock, &wake_q);
- wake_up_q(&wake_q); + wake_up_q(&wake_q);
- /* Undo pi boosting if necessary: */ - if (deboost) - rt_mutex_adjust_prio(current); - } + /* Undo pi boosting if necessary: */ + if (deboost) + rt_mutex_adjust_prio(current); }
/** @@ -1561,10 +1553,9 @@ EXPORT_SYMBOL_GPL(rt_mutex_unlock); bool __sched rt_mutex_futex_unlock(struct rt_mutex *lock, struct wake_q_head *wqh) { - if (likely(rt_mutex_cmpxchg_release(lock, current, NULL))) { - rt_mutex_deadlock_account_unlock(current); + if (likely(rt_mutex_cmpxchg_release(lock, current, NULL))) return false; - } + return rt_mutex_slowunlock(lock, wqh); }
@@ -1622,7 +1613,6 @@ void rt_mutex_init_proxy_locked(struct rt_mutex *lock, __rt_mutex_init(lock, NULL); debug_rt_mutex_proxy_lock(lock, proxy_owner); rt_mutex_set_owner(lock, proxy_owner); - rt_mutex_deadlock_account_lock(lock, proxy_owner); }
/** @@ -1638,7 +1628,6 @@ void rt_mutex_proxy_unlock(struct rt_mutex *lock, { debug_rt_mutex_proxy_unlock(lock); rt_mutex_set_owner(lock, NULL); - rt_mutex_deadlock_account_unlock(proxy_owner); }
/** diff --git a/kernel/locking/rtmutex.h b/kernel/locking/rtmutex.h index c406058..6607802 100644 --- a/kernel/locking/rtmutex.h +++ b/kernel/locking/rtmutex.h @@ -11,8 +11,6 @@ */
#define rt_mutex_deadlock_check(l) (0) -#define rt_mutex_deadlock_account_lock(m, t) do { } while (0) -#define rt_mutex_deadlock_account_unlock(l) do { } while (0) #define debug_rt_mutex_init_waiter(w) do { } while (0) #define debug_rt_mutex_free_waiter(w) do { } while (0) #define debug_rt_mutex_lock(l) do { } while (0)
From: Thomas Gleixner tglx@linutronix.de
commit b4abf91047cf054f203dcfac97e1038388826937 upstream.
Sasha reported a lockdep splat about a potential deadlock between RCU boosting rtmutex and the posix timer it_lock.
CPU0 CPU1
rtmutex_lock(&rcu->rt_mutex) spin_lock(&rcu->rt_mutex.wait_lock) local_irq_disable() spin_lock(&timer->it_lock) spin_lock(&rcu->mutex.wait_lock) --> Interrupt spin_lock(&timer->it_lock)
This is caused by the following code sequence on CPU1
rcu_read_lock() x = lookup(); if (x) spin_lock_irqsave(&x->it_lock); rcu_read_unlock(); return x;
We could fix that in the posix timer code by keeping rcu read locked across the spinlocked and irq disabled section, but the above sequence is common and there is no reason not to support it.
Taking rt_mutex.wait_lock irq safe prevents the deadlock.
Reported-by: Sasha Levin sasha.levin@oracle.com Signed-off-by: Thomas Gleixner tglx@linutronix.de Cc: Peter Zijlstra peterz@infradead.org Cc: Paul McKenney paulmck@linux.vnet.ibm.com Tested-by:Henrik Austad haustad@cisco.com --- kernel/futex.c | 18 +++---- kernel/locking/rtmutex.c | 135 +++++++++++++++++++++++++---------------------- 2 files changed, 81 insertions(+), 72 deletions(-)
diff --git a/kernel/futex.c b/kernel/futex.c index 9e92f12..0f44952 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -1307,7 +1307,7 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *top_waiter if (pi_state->owner != current) return -EINVAL;
- raw_spin_lock(&pi_state->pi_mutex.wait_lock); + raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock); new_owner = rt_mutex_next_owner(&pi_state->pi_mutex);
/* @@ -1343,22 +1343,22 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *top_waiter ret = -EINVAL; } if (ret) { - raw_spin_unlock(&pi_state->pi_mutex.wait_lock); + raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock); return ret; }
- raw_spin_lock_irq(&pi_state->owner->pi_lock); + raw_spin_lock(&pi_state->owner->pi_lock); WARN_ON(list_empty(&pi_state->list)); list_del_init(&pi_state->list); - raw_spin_unlock_irq(&pi_state->owner->pi_lock); + raw_spin_unlock(&pi_state->owner->pi_lock);
- raw_spin_lock_irq(&new_owner->pi_lock); + raw_spin_lock(&new_owner->pi_lock); WARN_ON(!list_empty(&pi_state->list)); list_add(&pi_state->list, &new_owner->pi_state_list); pi_state->owner = new_owner; - raw_spin_unlock_irq(&new_owner->pi_lock); + raw_spin_unlock(&new_owner->pi_lock);
- raw_spin_unlock(&pi_state->pi_mutex.wait_lock); + raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
deboost = rt_mutex_futex_unlock(&pi_state->pi_mutex, &wake_q);
@@ -2269,11 +2269,11 @@ static int fixup_owner(u32 __user *uaddr, struct futex_q *q, int locked) * we returned due to timeout or signal without taking the * rt_mutex. Too late. */ - raw_spin_lock(&q->pi_state->pi_mutex.wait_lock); + raw_spin_lock_irq(&q->pi_state->pi_mutex.wait_lock); owner = rt_mutex_owner(&q->pi_state->pi_mutex); if (!owner) owner = rt_mutex_next_owner(&q->pi_state->pi_mutex); - raw_spin_unlock(&q->pi_state->pi_mutex.wait_lock); + raw_spin_unlock_irq(&q->pi_state->pi_mutex.wait_lock); ret = fixup_pi_state_owner(uaddr, q, owner); goto out; } diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c index 6cf9dab7..b8d08c7 100644 --- a/kernel/locking/rtmutex.c +++ b/kernel/locking/rtmutex.c @@ -163,13 +163,14 @@ static inline void mark_rt_mutex_waiters(struct rt_mutex *lock) * 2) Drop lock->wait_lock * 3) Try to unlock the lock with cmpxchg */ -static inline bool unlock_rt_mutex_safe(struct rt_mutex *lock) +static inline bool unlock_rt_mutex_safe(struct rt_mutex *lock, + unsigned long flags) __releases(lock->wait_lock) { struct task_struct *owner = rt_mutex_owner(lock);
clear_rt_mutex_waiters(lock); - raw_spin_unlock(&lock->wait_lock); + raw_spin_unlock_irqrestore(&lock->wait_lock, flags); /* * If a new waiter comes in between the unlock and the cmpxchg * we have two situations: @@ -211,11 +212,12 @@ static inline void mark_rt_mutex_waiters(struct rt_mutex *lock) /* * Simple slow path only version: lock->owner is protected by lock->wait_lock. */ -static inline bool unlock_rt_mutex_safe(struct rt_mutex *lock) +static inline bool unlock_rt_mutex_safe(struct rt_mutex *lock, + unsigned long flags) __releases(lock->wait_lock) { lock->owner = NULL; - raw_spin_unlock(&lock->wait_lock); + raw_spin_unlock_irqrestore(&lock->wait_lock, flags); return true; } #endif @@ -497,7 +499,6 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task, int ret = 0, depth = 0; struct rt_mutex *lock; bool detect_deadlock; - unsigned long flags; bool requeue = true;
detect_deadlock = rt_mutex_cond_detect_deadlock(orig_waiter, chwalk); @@ -540,7 +541,7 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task, /* * [1] Task cannot go away as we did a get_task() before ! */ - raw_spin_lock_irqsave(&task->pi_lock, flags); + raw_spin_lock_irq(&task->pi_lock);
/* * [2] Get the waiter on which @task is blocked on. @@ -624,7 +625,7 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task, * operations. */ if (!raw_spin_trylock(&lock->wait_lock)) { - raw_spin_unlock_irqrestore(&task->pi_lock, flags); + raw_spin_unlock_irq(&task->pi_lock); cpu_relax(); goto retry; } @@ -655,7 +656,7 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task, /* * No requeue[7] here. Just release @task [8] */ - raw_spin_unlock_irqrestore(&task->pi_lock, flags); + raw_spin_unlock(&task->pi_lock); put_task_struct(task);
/* @@ -663,14 +664,14 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task, * If there is no owner of the lock, end of chain. */ if (!rt_mutex_owner(lock)) { - raw_spin_unlock(&lock->wait_lock); + raw_spin_unlock_irq(&lock->wait_lock); return 0; }
/* [10] Grab the next task, i.e. owner of @lock */ task = rt_mutex_owner(lock); get_task_struct(task); - raw_spin_lock_irqsave(&task->pi_lock, flags); + raw_spin_lock(&task->pi_lock);
/* * No requeue [11] here. We just do deadlock detection. @@ -685,8 +686,8 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task, top_waiter = rt_mutex_top_waiter(lock);
/* [13] Drop locks */ - raw_spin_unlock_irqrestore(&task->pi_lock, flags); - raw_spin_unlock(&lock->wait_lock); + raw_spin_unlock(&task->pi_lock); + raw_spin_unlock_irq(&lock->wait_lock);
/* If owner is not blocked, end of chain. */ if (!next_lock) @@ -707,7 +708,7 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task, rt_mutex_enqueue(lock, waiter);
/* [8] Release the task */ - raw_spin_unlock_irqrestore(&task->pi_lock, flags); + raw_spin_unlock(&task->pi_lock); put_task_struct(task);
/* @@ -725,14 +726,14 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task, */ if (prerequeue_top_waiter != rt_mutex_top_waiter(lock)) wake_up_process(rt_mutex_top_waiter(lock)->task); - raw_spin_unlock(&lock->wait_lock); + raw_spin_unlock_irq(&lock->wait_lock); return 0; }
/* [10] Grab the next task, i.e. the owner of @lock */ task = rt_mutex_owner(lock); get_task_struct(task); - raw_spin_lock_irqsave(&task->pi_lock, flags); + raw_spin_lock(&task->pi_lock);
/* [11] requeue the pi waiters if necessary */ if (waiter == rt_mutex_top_waiter(lock)) { @@ -786,8 +787,8 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task, top_waiter = rt_mutex_top_waiter(lock);
/* [13] Drop the locks */ - raw_spin_unlock_irqrestore(&task->pi_lock, flags); - raw_spin_unlock(&lock->wait_lock); + raw_spin_unlock(&task->pi_lock); + raw_spin_unlock_irq(&lock->wait_lock);
/* * Make the actual exit decisions [12], based on the stored @@ -810,7 +811,7 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task, goto again;
out_unlock_pi: - raw_spin_unlock_irqrestore(&task->pi_lock, flags); + raw_spin_unlock_irq(&task->pi_lock); out_put_task: put_task_struct(task);
@@ -820,7 +821,7 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task, /* * Try to take an rt-mutex * - * Must be called with lock->wait_lock held. + * Must be called with lock->wait_lock held and interrupts disabled * * @lock: The lock to be acquired. * @task: The task which wants to acquire the lock @@ -830,8 +831,6 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task, static int try_to_take_rt_mutex(struct rt_mutex *lock, struct task_struct *task, struct rt_mutex_waiter *waiter) { - unsigned long flags; - /* * Before testing whether we can acquire @lock, we set the * RT_MUTEX_HAS_WAITERS bit in @lock->owner. This forces all @@ -916,7 +915,7 @@ static int try_to_take_rt_mutex(struct rt_mutex *lock, struct task_struct *task, * case, but conditionals are more expensive than a redundant * store. */ - raw_spin_lock_irqsave(&task->pi_lock, flags); + raw_spin_lock(&task->pi_lock); task->pi_blocked_on = NULL; /* * Finish the lock acquisition. @task is the new owner. If @@ -925,7 +924,7 @@ static int try_to_take_rt_mutex(struct rt_mutex *lock, struct task_struct *task, */ if (rt_mutex_has_waiters(lock)) rt_mutex_enqueue_pi(task, rt_mutex_top_waiter(lock)); - raw_spin_unlock_irqrestore(&task->pi_lock, flags); + raw_spin_unlock(&task->pi_lock);
takeit: /* We got the lock. */ @@ -945,7 +944,7 @@ takeit: * * Prepare waiter and propagate pi chain * - * This must be called with lock->wait_lock held. + * This must be called with lock->wait_lock held and interrupts disabled */ static int task_blocks_on_rt_mutex(struct rt_mutex *lock, struct rt_mutex_waiter *waiter, @@ -956,7 +955,6 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock, struct rt_mutex_waiter *top_waiter = waiter; struct rt_mutex *next_lock; int chain_walk = 0, res; - unsigned long flags;
/* * Early deadlock detection. We really don't want the task to @@ -970,7 +968,7 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock, if (owner == task) return -EDEADLK;
- raw_spin_lock_irqsave(&task->pi_lock, flags); + raw_spin_lock(&task->pi_lock); __rt_mutex_adjust_prio(task); waiter->task = task; waiter->lock = lock; @@ -983,12 +981,12 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
task->pi_blocked_on = waiter;
- raw_spin_unlock_irqrestore(&task->pi_lock, flags); + raw_spin_unlock(&task->pi_lock);
if (!owner) return 0;
- raw_spin_lock_irqsave(&owner->pi_lock, flags); + raw_spin_lock(&owner->pi_lock); if (waiter == rt_mutex_top_waiter(lock)) { rt_mutex_dequeue_pi(owner, top_waiter); rt_mutex_enqueue_pi(owner, waiter); @@ -1003,7 +1001,7 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock, /* Store the lock on which owner is blocked or NULL */ next_lock = task_blocked_on_lock(owner);
- raw_spin_unlock_irqrestore(&owner->pi_lock, flags); + raw_spin_unlock(&owner->pi_lock); /* * Even if full deadlock detection is on, if the owner is not * blocked itself, we can avoid finding this out in the chain @@ -1019,12 +1017,12 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock, */ get_task_struct(owner);
- raw_spin_unlock(&lock->wait_lock); + raw_spin_unlock_irq(&lock->wait_lock);
res = rt_mutex_adjust_prio_chain(owner, chwalk, lock, next_lock, waiter, task);
- raw_spin_lock(&lock->wait_lock); + raw_spin_lock_irq(&lock->wait_lock);
return res; } @@ -1033,15 +1031,14 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock, * Remove the top waiter from the current tasks pi waiter tree and * queue it up. * - * Called with lock->wait_lock held. + * Called with lock->wait_lock held and interrupts disabled. */ static void mark_wakeup_next_waiter(struct wake_q_head *wake_q, struct rt_mutex *lock) { struct rt_mutex_waiter *waiter; - unsigned long flags;
- raw_spin_lock_irqsave(¤t->pi_lock, flags); + raw_spin_lock(¤t->pi_lock);
waiter = rt_mutex_top_waiter(lock);
@@ -1063,7 +1060,7 @@ static void mark_wakeup_next_waiter(struct wake_q_head *wake_q, */ lock->owner = (void *) RT_MUTEX_HAS_WAITERS;
- raw_spin_unlock_irqrestore(¤t->pi_lock, flags); + raw_spin_unlock(¤t->pi_lock);
wake_q_add(wake_q, waiter->task); } @@ -1071,7 +1068,7 @@ static void mark_wakeup_next_waiter(struct wake_q_head *wake_q, /* * Remove a waiter from a lock and give up * - * Must be called with lock->wait_lock held and + * Must be called with lock->wait_lock held and interrupts disabled. I must * have just failed to try_to_take_rt_mutex(). */ static void remove_waiter(struct rt_mutex *lock, @@ -1080,12 +1077,11 @@ static void remove_waiter(struct rt_mutex *lock, bool is_top_waiter = (waiter == rt_mutex_top_waiter(lock)); struct task_struct *owner = rt_mutex_owner(lock); struct rt_mutex *next_lock; - unsigned long flags;
- raw_spin_lock_irqsave(¤t->pi_lock, flags); + raw_spin_lock(¤t->pi_lock); rt_mutex_dequeue(lock, waiter); current->pi_blocked_on = NULL; - raw_spin_unlock_irqrestore(¤t->pi_lock, flags); + raw_spin_unlock(¤t->pi_lock);
/* * Only update priority if the waiter was the highest priority @@ -1094,7 +1090,7 @@ static void remove_waiter(struct rt_mutex *lock, if (!owner || !is_top_waiter) return;
- raw_spin_lock_irqsave(&owner->pi_lock, flags); + raw_spin_lock(&owner->pi_lock);
rt_mutex_dequeue_pi(owner, waiter);
@@ -1106,7 +1102,7 @@ static void remove_waiter(struct rt_mutex *lock, /* Store the lock on which owner is blocked or NULL */ next_lock = task_blocked_on_lock(owner);
- raw_spin_unlock_irqrestore(&owner->pi_lock, flags); + raw_spin_unlock(&owner->pi_lock);
/* * Don't walk the chain, if the owner task is not blocked @@ -1118,12 +1114,12 @@ static void remove_waiter(struct rt_mutex *lock, /* gets dropped in rt_mutex_adjust_prio_chain()! */ get_task_struct(owner);
- raw_spin_unlock(&lock->wait_lock); + raw_spin_unlock_irq(&lock->wait_lock);
rt_mutex_adjust_prio_chain(owner, RT_MUTEX_MIN_CHAINWALK, lock, next_lock, NULL, current);
- raw_spin_lock(&lock->wait_lock); + raw_spin_lock_irq(&lock->wait_lock); }
/* @@ -1159,11 +1155,11 @@ void rt_mutex_adjust_pi(struct task_struct *task) * __rt_mutex_slowlock() - Perform the wait-wake-try-to-take loop * @lock: the rt_mutex to take * @state: the state the task should block in (TASK_INTERRUPTIBLE - * or TASK_UNINTERRUPTIBLE) + * or TASK_UNINTERRUPTIBLE) * @timeout: the pre-initialized and started timer, or NULL for none * @waiter: the pre-initialized rt_mutex_waiter * - * lock->wait_lock must be held by the caller. + * Must be called with lock->wait_lock held and interrupts disabled */ static int __sched __rt_mutex_slowlock(struct rt_mutex *lock, int state, @@ -1191,13 +1187,13 @@ __rt_mutex_slowlock(struct rt_mutex *lock, int state, break; }
- raw_spin_unlock(&lock->wait_lock); + raw_spin_unlock_irq(&lock->wait_lock);
debug_rt_mutex_print_deadlock(waiter);
schedule();
- raw_spin_lock(&lock->wait_lock); + raw_spin_lock_irq(&lock->wait_lock); set_current_state(state); }
@@ -1234,17 +1230,26 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state, enum rtmutex_chainwalk chwalk) { struct rt_mutex_waiter waiter; + unsigned long flags; int ret = 0;
debug_rt_mutex_init_waiter(&waiter); RB_CLEAR_NODE(&waiter.pi_tree_entry); RB_CLEAR_NODE(&waiter.tree_entry);
- raw_spin_lock(&lock->wait_lock); + /* + * Technically we could use raw_spin_[un]lock_irq() here, but this can + * be called in early boot if the cmpxchg() fast path is disabled + * (debug, no architecture support). In this case we will acquire the + * rtmutex with lock->wait_lock held. But we cannot unconditionally + * enable interrupts in that early boot case. So we need to use the + * irqsave/restore variants. + */ + raw_spin_lock_irqsave(&lock->wait_lock, flags);
/* Try to acquire the lock again: */ if (try_to_take_rt_mutex(lock, current, NULL)) { - raw_spin_unlock(&lock->wait_lock); + raw_spin_unlock_irqrestore(&lock->wait_lock, flags); return 0; }
@@ -1273,7 +1278,7 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state, */ fixup_rt_mutex_waiters(lock);
- raw_spin_unlock(&lock->wait_lock); + raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
/* Remove pending timer: */ if (unlikely(timeout)) @@ -1289,6 +1294,7 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state, */ static inline int rt_mutex_slowtrylock(struct rt_mutex *lock) { + unsigned long flags; int ret;
/* @@ -1300,10 +1306,10 @@ static inline int rt_mutex_slowtrylock(struct rt_mutex *lock) return 0;
/* - * The mutex has currently no owner. Lock the wait lock and - * try to acquire the lock. + * The mutex has currently no owner. Lock the wait lock and try to + * acquire the lock. We use irqsave here to support early boot calls. */ - raw_spin_lock(&lock->wait_lock); + raw_spin_lock_irqsave(&lock->wait_lock, flags);
ret = try_to_take_rt_mutex(lock, current, NULL);
@@ -1313,7 +1319,7 @@ static inline int rt_mutex_slowtrylock(struct rt_mutex *lock) */ fixup_rt_mutex_waiters(lock);
- raw_spin_unlock(&lock->wait_lock); + raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
return ret; } @@ -1325,7 +1331,10 @@ static inline int rt_mutex_slowtrylock(struct rt_mutex *lock) static bool __sched rt_mutex_slowunlock(struct rt_mutex *lock, struct wake_q_head *wake_q) { - raw_spin_lock(&lock->wait_lock); + unsigned long flags; + + /* irqsave required to support early boot calls */ + raw_spin_lock_irqsave(&lock->wait_lock, flags);
debug_rt_mutex_unlock(lock);
@@ -1362,10 +1371,10 @@ static bool __sched rt_mutex_slowunlock(struct rt_mutex *lock, */ while (!rt_mutex_has_waiters(lock)) { /* Drops lock->wait_lock ! */ - if (unlock_rt_mutex_safe(lock) == true) + if (unlock_rt_mutex_safe(lock, flags) == true) return false; /* Relock the rtmutex and try again */ - raw_spin_lock(&lock->wait_lock); + raw_spin_lock_irqsave(&lock->wait_lock, flags); }
/* @@ -1376,7 +1385,7 @@ static bool __sched rt_mutex_slowunlock(struct rt_mutex *lock, */ mark_wakeup_next_waiter(wake_q, lock);
- raw_spin_unlock(&lock->wait_lock); + raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
/* check PI boosting */ return true; @@ -1649,10 +1658,10 @@ int rt_mutex_start_proxy_lock(struct rt_mutex *lock, { int ret;
- raw_spin_lock(&lock->wait_lock); + raw_spin_lock_irq(&lock->wait_lock);
if (try_to_take_rt_mutex(lock, task, NULL)) { - raw_spin_unlock(&lock->wait_lock); + raw_spin_unlock_irq(&lock->wait_lock); return 1; }
@@ -1673,7 +1682,7 @@ int rt_mutex_start_proxy_lock(struct rt_mutex *lock, if (unlikely(ret)) remove_waiter(lock, waiter);
- raw_spin_unlock(&lock->wait_lock); + raw_spin_unlock_irq(&lock->wait_lock);
debug_rt_mutex_print_deadlock(waiter);
@@ -1721,7 +1730,7 @@ int rt_mutex_finish_proxy_lock(struct rt_mutex *lock, { int ret;
- raw_spin_lock(&lock->wait_lock); + raw_spin_lock_irq(&lock->wait_lock);
set_current_state(TASK_INTERRUPTIBLE);
@@ -1737,7 +1746,7 @@ int rt_mutex_finish_proxy_lock(struct rt_mutex *lock, */ fixup_rt_mutex_waiters(lock);
- raw_spin_unlock(&lock->wait_lock); + raw_spin_unlock_irq(&lock->wait_lock);
return ret; }
From: Peter Zijlstra peterz@infradead.org
commit 5293c2efda37775346885c7e924d4ef7018ea60b upstream.
Part of what makes futex_unlock_pi() intricate is that rt_mutex_futex_unlock() -> rt_mutex_slowunlock() can drop rt_mutex::wait_lock.
This means it cannot rely on the atomicy of wait_lock, which would be preferred in order to not rely on hb->lock so much.
The reason rt_mutex_slowunlock() needs to drop wait_lock is because it can race with the rt_mutex fastpath, however futexes have their own fast path.
Since futexes already have a bunch of separate rt_mutex accessors, complete that set and implement a rt_mutex variant without fastpath for them.
Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Cc: juri.lelli@arm.com Cc: bigeasy@linutronix.de Cc: xlpang@redhat.com Cc: rostedt@goodmis.org Cc: mathieu.desnoyers@efficios.com Cc: jdesfossez@efficios.com Cc: dvhart@infradead.org Cc: bristot@redhat.com Link: http://lkml.kernel.org/r/20170322104151.702962446@infradead.org Signed-off-by: Thomas Gleixner tglx@linutronix.de Tested-by:Henrik Austad haustad@cisco.com --- kernel/futex.c | 30 +++++++++++----------- kernel/locking/rtmutex.c | 55 ++++++++++++++++++++++++++++++----------- kernel/locking/rtmutex_common.h | 9 +++++-- 3 files changed, 62 insertions(+), 32 deletions(-)
diff --git a/kernel/futex.c b/kernel/futex.c index 0f44952..e1200b9 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -910,7 +910,7 @@ void exit_pi_state_list(struct task_struct *curr) pi_state->owner = NULL; raw_spin_unlock_irq(&curr->pi_lock);
- rt_mutex_unlock(&pi_state->pi_mutex); + rt_mutex_futex_unlock(&pi_state->pi_mutex);
spin_unlock(&hb->lock);
@@ -1358,20 +1358,18 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *top_waiter pi_state->owner = new_owner; raw_spin_unlock(&new_owner->pi_lock);
- raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock); - - deboost = rt_mutex_futex_unlock(&pi_state->pi_mutex, &wake_q); - /* - * First unlock HB so the waiter does not spin on it once he got woken - * up. Second wake up the waiter before the priority is adjusted. If we - * deboost first (and lose our higher priority), then the task might get - * scheduled away before the wake up can take place. + * We've updated the uservalue, this unlock cannot fail. */ + deboost = __rt_mutex_futex_unlock(&pi_state->pi_mutex, &wake_q); + + raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock); spin_unlock(&hb->lock); - wake_up_q(&wake_q); - if (deboost) + + if (deboost) { + wake_up_q(&wake_q); rt_mutex_adjust_prio(current); + }
return 0; } @@ -2259,7 +2257,7 @@ static int fixup_owner(u32 __user *uaddr, struct futex_q *q, int locked) * task acquired the rt_mutex after we removed ourself from the * rt_mutex waiters list. */ - if (rt_mutex_trylock(&q->pi_state->pi_mutex)) { + if (rt_mutex_futex_trylock(&q->pi_state->pi_mutex)) { locked = 1; goto out; } @@ -2574,7 +2572,7 @@ retry_private: if (!trylock) { ret = rt_mutex_timed_futex_lock(&q.pi_state->pi_mutex, to); } else { - ret = rt_mutex_trylock(&q.pi_state->pi_mutex); + ret = rt_mutex_futex_trylock(&q.pi_state->pi_mutex); /* Fixup the trylock return value: */ ret = ret ? 0 : -EWOULDBLOCK; } @@ -2597,7 +2595,7 @@ retry_private: * it and return the fault to userspace. */ if (ret && (rt_mutex_owner(&q.pi_state->pi_mutex) == current)) - rt_mutex_unlock(&q.pi_state->pi_mutex); + rt_mutex_futex_unlock(&q.pi_state->pi_mutex);
/* Unqueue and drop the lock */ unqueue_me_pi(&q); @@ -2904,7 +2902,7 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags, spin_lock(q.lock_ptr); ret = fixup_pi_state_owner(uaddr2, &q, current); if (ret && rt_mutex_owner(&q.pi_state->pi_mutex) == current) - rt_mutex_unlock(&q.pi_state->pi_mutex); + rt_mutex_futex_unlock(&q.pi_state->pi_mutex); /* * Drop the reference to the pi state which * the requeue_pi() code acquired for us. @@ -2944,7 +2942,7 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags, * userspace. */ if (ret && rt_mutex_owner(pi_mutex) == current) - rt_mutex_unlock(pi_mutex); + rt_mutex_futex_unlock(pi_mutex);
/* Unqueue and drop the lock. */ unqueue_me_pi(&q); diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c index b8d08c7..28c1d40 100644 --- a/kernel/locking/rtmutex.c +++ b/kernel/locking/rtmutex.c @@ -1486,15 +1486,23 @@ EXPORT_SYMBOL_GPL(rt_mutex_lock_interruptible);
/* * Futex variant with full deadlock detection. + * Futex variants must not use the fast-path, see __rt_mutex_futex_unlock(). */ -int rt_mutex_timed_futex_lock(struct rt_mutex *lock, +int __sched rt_mutex_timed_futex_lock(struct rt_mutex *lock, struct hrtimer_sleeper *timeout) { might_sleep();
- return rt_mutex_timed_fastlock(lock, TASK_INTERRUPTIBLE, timeout, - RT_MUTEX_FULL_CHAINWALK, - rt_mutex_slowlock); + return rt_mutex_slowlock(lock, TASK_INTERRUPTIBLE, + timeout, RT_MUTEX_FULL_CHAINWALK); +} + +/* + * Futex variant, must not use fastpath. + */ +int __sched rt_mutex_futex_trylock(struct rt_mutex *lock) +{ + return rt_mutex_slowtrylock(lock); }
/** @@ -1553,19 +1561,38 @@ void __sched rt_mutex_unlock(struct rt_mutex *lock) EXPORT_SYMBOL_GPL(rt_mutex_unlock);
/** - * rt_mutex_futex_unlock - Futex variant of rt_mutex_unlock - * @lock: the rt_mutex to be unlocked - * - * Returns: true/false indicating whether priority adjustment is - * required or not. + * Futex variant, that since futex variants do not use the fast-path, can be + * simple and will not need to retry. */ -bool __sched rt_mutex_futex_unlock(struct rt_mutex *lock, - struct wake_q_head *wqh) +bool __sched __rt_mutex_futex_unlock(struct rt_mutex *lock, + struct wake_q_head *wake_q) { - if (likely(rt_mutex_cmpxchg_release(lock, current, NULL))) - return false; + lockdep_assert_held(&lock->wait_lock); + + debug_rt_mutex_unlock(lock); + + if (!rt_mutex_has_waiters(lock)) { + lock->owner = NULL; + return false; /* done */ + } + + mark_wakeup_next_waiter(wake_q, lock); + return true; /* deboost and wakeups */ +}
- return rt_mutex_slowunlock(lock, wqh); +void __sched rt_mutex_futex_unlock(struct rt_mutex *lock) +{ + WAKE_Q(wake_q); + bool deboost; + + raw_spin_lock_irq(&lock->wait_lock); + deboost = __rt_mutex_futex_unlock(lock, &wake_q); + raw_spin_unlock_irq(&lock->wait_lock); + + if (deboost) { + wake_up_q(&wake_q); + rt_mutex_adjust_prio(current); + } }
/** diff --git a/kernel/locking/rtmutex_common.h b/kernel/locking/rtmutex_common.h index e317e1c..2441c2d 100644 --- a/kernel/locking/rtmutex_common.h +++ b/kernel/locking/rtmutex_common.h @@ -109,9 +109,14 @@ extern int rt_mutex_start_proxy_lock(struct rt_mutex *lock, extern int rt_mutex_finish_proxy_lock(struct rt_mutex *lock, struct hrtimer_sleeper *to, struct rt_mutex_waiter *waiter); + extern int rt_mutex_timed_futex_lock(struct rt_mutex *l, struct hrtimer_sleeper *to); -extern bool rt_mutex_futex_unlock(struct rt_mutex *lock, - struct wake_q_head *wqh); +extern int rt_mutex_futex_trylock(struct rt_mutex *l); + +extern void rt_mutex_futex_unlock(struct rt_mutex *lock); +extern bool __rt_mutex_futex_unlock(struct rt_mutex *lock, + struct wake_q_head *wqh); + extern void rt_mutex_adjust_prio(struct task_struct *task);
#ifdef CONFIG_DEBUG_RT_MUTEXES
From: Peter Zijlstra peterz@infradead.org
commit 734009e96d1983ad739e5b656e03430b3660c913 upstream.
Currently futex-pi relies on hb->lock to serialize everything. But hb->lock creates another set of problems, especially priority inversions on RT where hb->lock becomes a rt_mutex itself.
The rt_mutex::wait_lock is the most obvious protection for keeping the futex user space value and the kernel internal pi_state in sync.
Rework and document the locking so rt_mutex::wait_lock is held accross all operations which modify the user space value and the pi state.
This allows to invoke rt_mutex_unlock() (including deboost) without holding hb->lock as a next step.
Nothing yet relies on the new locking rules.
Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Cc: juri.lelli@arm.com Cc: bigeasy@linutronix.de Cc: xlpang@redhat.com Cc: rostedt@goodmis.org Cc: mathieu.desnoyers@efficios.com Cc: jdesfossez@efficios.com Cc: dvhart@infradead.org Cc: bristot@redhat.com Link: http://lkml.kernel.org/r/20170322104151.751993333@infradead.org Signed-off-by: Thomas Gleixner tglx@linutronix.de Tested-by: Henrik Austad haustad@cisco.com --- kernel/futex.c | 165 +++++++++++++++++++++++++++++++++++++++++++++------------ 1 file changed, 132 insertions(+), 33 deletions(-)
diff --git a/kernel/futex.c b/kernel/futex.c index e1200b9..52e3678 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -967,6 +967,39 @@ void exit_pi_state_list(struct task_struct *curr) * * [10] There is no transient state which leaves owner and user space * TID out of sync. + * + * + * Serialization and lifetime rules: + * + * hb->lock: + * + * hb -> futex_q, relation + * futex_q -> pi_state, relation + * + * (cannot be raw because hb can contain arbitrary amount + * of futex_q's) + * + * pi_mutex->wait_lock: + * + * {uval, pi_state} + * + * (and pi_mutex 'obviously') + * + * p->pi_lock: + * + * p->pi_state_list -> pi_state->list, relation + * + * pi_state->refcount: + * + * pi_state lifetime + * + * + * Lock order: + * + * hb->lock + * pi_mutex->wait_lock + * p->pi_lock + * */
/* @@ -974,10 +1007,12 @@ void exit_pi_state_list(struct task_struct *curr) * the pi_state against the user space value. If correct, attach to * it. */ -static int attach_to_pi_state(u32 uval, struct futex_pi_state *pi_state, +static int attach_to_pi_state(u32 __user *uaddr, u32 uval, + struct futex_pi_state *pi_state, struct futex_pi_state **ps) { pid_t pid = uval & FUTEX_TID_MASK; + int ret, uval2;
/* * Userspace might have messed up non-PI and PI futexes [3] @@ -985,9 +1020,34 @@ static int attach_to_pi_state(u32 uval, struct futex_pi_state *pi_state, if (unlikely(!pi_state)) return -EINVAL;
+ /* + * We get here with hb->lock held, and having found a + * futex_top_waiter(). This means that futex_lock_pi() of said futex_q + * has dropped the hb->lock in between queue_me() and unqueue_me_pi(), + * which in turn means that futex_lock_pi() still has a reference on + * our pi_state. + */ WARN_ON(!atomic_read(&pi_state->refcount));
/* + * Now that we have a pi_state, we can acquire wait_lock + * and do the state validation. + */ + raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock); + + /* + * Since {uval, pi_state} is serialized by wait_lock, and our current + * uval was read without holding it, it can have changed. Verify it + * still is what we expect it to be, otherwise retry the entire + * operation. + */ + if (get_futex_value_locked(&uval2, uaddr)) + goto out_efault; + + if (uval != uval2) + goto out_eagain; + + /* * Handle the owner died case: */ if (uval & FUTEX_OWNER_DIED) { @@ -1002,11 +1062,11 @@ static int attach_to_pi_state(u32 uval, struct futex_pi_state *pi_state, * is not 0. Inconsistent state. [5] */ if (pid) - return -EINVAL; + goto out_einval; /* * Take a ref on the state and return success. [4] */ - goto out_state; + goto out_attach; }
/* @@ -1018,14 +1078,14 @@ static int attach_to_pi_state(u32 uval, struct futex_pi_state *pi_state, * Take a ref on the state and return success. [6] */ if (!pid) - goto out_state; + goto out_attach; } else { /* * If the owner died bit is not set, then the pi_state * must have an owner. [7] */ if (!pi_state->owner) - return -EINVAL; + goto out_einval; }
/* @@ -1034,11 +1094,29 @@ static int attach_to_pi_state(u32 uval, struct futex_pi_state *pi_state, * user space TID. [9/10] */ if (pid != task_pid_vnr(pi_state->owner)) - return -EINVAL; -out_state: + goto out_einval; + +out_attach: atomic_inc(&pi_state->refcount); + raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock); *ps = pi_state; return 0; + +out_einval: + ret = -EINVAL; + goto out_error; + +out_eagain: + ret = -EAGAIN; + goto out_error; + +out_efault: + ret = -EFAULT; + goto out_error; + +out_error: + raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock); + return ret; }
/* @@ -1089,6 +1167,9 @@ static int attach_to_pi_owner(u32 uval, union futex_key *key,
/* * No existing pi state. First waiter. [2] + * + * This creates pi_state, we have hb->lock held, this means nothing can + * observe this state, wait_lock is irrelevant. */ pi_state = alloc_pi_state();
@@ -1113,7 +1194,8 @@ static int attach_to_pi_owner(u32 uval, union futex_key *key, return 0; }
-static int lookup_pi_state(u32 uval, struct futex_hash_bucket *hb, +static int lookup_pi_state(u32 __user *uaddr, u32 uval, + struct futex_hash_bucket *hb, union futex_key *key, struct futex_pi_state **ps) { struct futex_q *top_waiter = futex_top_waiter(hb, key); @@ -1123,7 +1205,7 @@ static int lookup_pi_state(u32 uval, struct futex_hash_bucket *hb, * attach to the pi_state when the validation succeeds. */ if (top_waiter) - return attach_to_pi_state(uval, top_waiter->pi_state, ps); + return attach_to_pi_state(uaddr, uval, top_waiter->pi_state, ps);
/* * We are the first waiter - try to look up the owner based on @@ -1142,7 +1224,7 @@ static int lock_pi_update_atomic(u32 __user *uaddr, u32 uval, u32 newval) if (unlikely(cmpxchg_futex_value_locked(&curval, uaddr, uval, newval))) return -EFAULT;
- /*If user space value changed, let the caller retry */ + /* If user space value changed, let the caller retry */ return curval != uval ? -EAGAIN : 0; }
@@ -1198,7 +1280,7 @@ static int futex_lock_pi_atomic(u32 __user *uaddr, struct futex_hash_bucket *hb, */ top_waiter = futex_top_waiter(hb, key); if (top_waiter) - return attach_to_pi_state(uval, top_waiter->pi_state, ps); + return attach_to_pi_state(uaddr, uval, top_waiter->pi_state, ps);
/* * No waiter and user TID is 0. We are here because the @@ -1330,6 +1412,7 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *top_waiter
if (cmpxchg_futex_value_locked(&curval, uaddr, uval, newval)) { ret = -EFAULT; + } else if (curval != uval) { /* * If a unconditional UNLOCK_PI operation (user space did not @@ -1342,6 +1425,7 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *top_waiter else ret = -EINVAL; } + if (ret) { raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock); return ret; @@ -1856,7 +1940,7 @@ retry_private: * rereading and handing potential crap to * lookup_pi_state. */ - ret = lookup_pi_state(ret, hb2, &key2, &pi_state); + ret = lookup_pi_state(uaddr2, ret, hb2, &key2, &pi_state); }
switch (ret) { @@ -2128,10 +2212,13 @@ static int fixup_pi_state_owner(u32 __user *uaddr, struct futex_q *q, { u32 newtid = task_pid_vnr(newowner) | FUTEX_WAITERS; struct futex_pi_state *pi_state = q->pi_state; - struct task_struct *oldowner = pi_state->owner; u32 uval, uninitialized_var(curval), newval; + struct task_struct *oldowner; int ret;
+ raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock); + + oldowner = pi_state->owner; /* Owner died? */ if (!pi_state->owner) newtid |= FUTEX_OWNER_DIED; @@ -2147,11 +2234,10 @@ static int fixup_pi_state_owner(u32 __user *uaddr, struct futex_q *q, * because we can fault here. Imagine swapped out pages or a fork * that marked all the anonymous memory readonly for cow. * - * Modifying pi_state _before_ the user space value would - * leave the pi_state in an inconsistent state when we fault - * here, because we need to drop the hash bucket lock to - * handle the fault. This might be observed in the PID check - * in lookup_pi_state. + * Modifying pi_state _before_ the user space value would leave the + * pi_state in an inconsistent state when we fault here, because we + * need to drop the locks to handle the fault. This might be observed + * in the PID check in lookup_pi_state. */ retry: if (get_futex_value_locked(&uval, uaddr)) @@ -2172,47 +2258,60 @@ retry: * itself. */ if (pi_state->owner != NULL) { - raw_spin_lock_irq(&pi_state->owner->pi_lock); + raw_spin_lock(&pi_state->owner->pi_lock); WARN_ON(list_empty(&pi_state->list)); list_del_init(&pi_state->list); - raw_spin_unlock_irq(&pi_state->owner->pi_lock); + raw_spin_unlock(&pi_state->owner->pi_lock); }
pi_state->owner = newowner;
- raw_spin_lock_irq(&newowner->pi_lock); + raw_spin_lock(&newowner->pi_lock); WARN_ON(!list_empty(&pi_state->list)); list_add(&pi_state->list, &newowner->pi_state_list); - raw_spin_unlock_irq(&newowner->pi_lock); + raw_spin_unlock(&newowner->pi_lock); + raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock); + return 0;
/* - * To handle the page fault we need to drop the hash bucket - * lock here. That gives the other task (either the highest priority - * waiter itself or the task which stole the rtmutex) the - * chance to try the fixup of the pi_state. So once we are - * back from handling the fault we need to check the pi_state - * after reacquiring the hash bucket lock and before trying to - * do another fixup. When the fixup has been done already we - * simply return. + * To handle the page fault we need to drop the locks here. That gives + * the other task (either the highest priority waiter itself or the + * task which stole the rtmutex) the chance to try the fixup of the + * pi_state. So once we are back from handling the fault we need to + * check the pi_state after reacquiring the locks and before trying to + * do another fixup. When the fixup has been done already we simply + * return. + * + * Note: we hold both hb->lock and pi_mutex->wait_lock. We can safely + * drop hb->lock since the caller owns the hb -> futex_q relation. + * Dropping the pi_mutex->wait_lock requires the state revalidate. */ handle_fault: + raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock); spin_unlock(q->lock_ptr);
ret = fault_in_user_writeable(uaddr);
spin_lock(q->lock_ptr); + raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
/* * Check if someone else fixed it for us: */ - if (pi_state->owner != oldowner) - return 0; + if (pi_state->owner != oldowner) { + ret = 0; + goto out_unlock; + }
if (ret) - return ret; + goto out_unlock;
goto retry; + +out_unlock: + raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock); + return ret; }
static long futex_wait_restart(struct restart_block *restart);
From: Peter Zijlstra peterz@infradead.org
commit bf92cf3a5100f5a0d5f9834787b130159397cb22 upstream.
Add a put_pit_state() as counterpart for get_pi_state() so the refcounting becomes consistent.
Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Cc: juri.lelli@arm.com Cc: bigeasy@linutronix.de Cc: xlpang@redhat.com Cc: rostedt@goodmis.org Cc: mathieu.desnoyers@efficios.com Cc: jdesfossez@efficios.com Cc: dvhart@infradead.org Cc: bristot@redhat.com Link: http://lkml.kernel.org/r/20170322104151.801778516@infradead.org Signed-off-by: Thomas Gleixner tglx@linutronix.de
Conflicts: kernel/futex.c Tested-by:Henrik Austad haustad@cisco.com --- kernel/futex.c | 19 ++++++++++++++----- 1 file changed, 14 insertions(+), 5 deletions(-)
diff --git a/kernel/futex.c b/kernel/futex.c index 52e3678..9d7d462 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -799,7 +799,7 @@ static int refill_pi_state_cache(void) return 0; }
-static struct futex_pi_state * alloc_pi_state(void) +static struct futex_pi_state *alloc_pi_state(void) { struct futex_pi_state *pi_state = current->pi_state_cache;
@@ -809,6 +809,11 @@ static struct futex_pi_state * alloc_pi_state(void) return pi_state; }
+static void get_pi_state(struct futex_pi_state *pi_state) +{ + WARN_ON_ONCE(!atomic_inc_not_zero(&pi_state->refcount)); +} + /* * Must be called with the hb lock held. */ @@ -850,7 +855,7 @@ static void free_pi_state(struct futex_pi_state *pi_state) * Look up the task based on what TID userspace gave us. * We dont trust it. */ -static struct task_struct * futex_find_get_task(pid_t pid) +static struct task_struct *futex_find_get_task(pid_t pid) { struct task_struct *p;
@@ -1097,7 +1102,7 @@ static int attach_to_pi_state(u32 __user *uaddr, u32 uval, goto out_einval;
out_attach: - atomic_inc(&pi_state->refcount); + get_pi_state(pi_state); raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock); *ps = pi_state; return 0; @@ -2019,8 +2024,12 @@ retry_private: * of requeue_pi if we couldn't acquire the lock atomically. */ if (requeue_pi) { - /* Prepare the waiter to take the rt_mutex. */ - atomic_inc(&pi_state->refcount); + /* + * Prepare the waiter to take the rt_mutex. Take a + * refcount on the pi_state and store the pointer in + * the futex_q object of the waiter. + */ + get_pi_state(pi_state); this->pi_state = pi_state; ret = rt_mutex_start_proxy_lock(&pi_state->pi_mutex, this->rt_waiter,
From: Peter Zijlstra peterz@infradead.org
commit 73d786bd043ebc855f349c81ea805f6b11cbf2aa upstream.
There is a weird state in the futex_unlock_pi() path when it interleaves with a concurrent futex_lock_pi() at the point where it drops hb->lock.
In this case, it can happen that the rt_mutex wait_list and the futex_q disagree on pending waiters, in particular rt_mutex will find no pending waiters where futex_q thinks there are. In this case the rt_mutex unlock code cannot assign an owner.
The futex side fixup code has to cleanup the inconsistencies with quite a bunch of interesting corner cases.
Simplify all this by changing wake_futex_pi() to return -EAGAIN when this situation occurs. This then gives the futex_lock_pi() code the opportunity to continue and the retried futex_unlock_pi() will now observe a coherent state.
The only problem is that this breaks RT timeliness guarantees. That is, consider the following scenario:
T1 and T2 are both pinned to CPU0. prio(T2) > prio(T1)
CPU0
T1 lock_pi() queue_me() <- Waiter is visible
preemption
T2 unlock_pi() loops with -EAGAIN forever
Which is undesirable for PI primitives. Future patches will rectify this.
Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Cc: juri.lelli@arm.com Cc: bigeasy@linutronix.de Cc: xlpang@redhat.com Cc: rostedt@goodmis.org Cc: mathieu.desnoyers@efficios.com Cc: jdesfossez@efficios.com Cc: dvhart@infradead.org Cc: bristot@redhat.com Link: http://lkml.kernel.org/r/20170322104151.850383690@infradead.org Signed-off-by: Thomas Gleixner tglx@linutronix.de Tested-by: Henrik Austad haustad@cisco.com --- kernel/futex.c | 50 ++++++++++++++------------------------------------ 1 file changed, 14 insertions(+), 36 deletions(-)
diff --git a/kernel/futex.c b/kernel/futex.c index 9d7d462..91acb65 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -1398,12 +1398,19 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *top_waiter new_owner = rt_mutex_next_owner(&pi_state->pi_mutex);
/* - * It is possible that the next waiter (the one that brought - * top_waiter owner to the kernel) timed out and is no longer - * waiting on the lock. + * When we interleave with futex_lock_pi() where it does + * rt_mutex_timed_futex_lock(), we might observe @this futex_q waiter, + * but the rt_mutex's wait_list can be empty (either still, or again, + * depending on which side we land). + * + * When this happens, give up our locks and try again, giving the + * futex_lock_pi() instance time to complete, either by waiting on the + * rtmutex or removing itself from the futex queue. */ - if (!new_owner) - new_owner = top_waiter->task; + if (!new_owner) { + raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock); + return -EAGAIN; + }
/* * We pass it to the next owner. The WAITERS bit is always @@ -2342,7 +2349,6 @@ static long futex_wait_restart(struct restart_block *restart); */ static int fixup_owner(u32 __user *uaddr, struct futex_q *q, int locked) { - struct task_struct *owner; int ret = 0;
if (locked) { @@ -2356,43 +2362,15 @@ static int fixup_owner(u32 __user *uaddr, struct futex_q *q, int locked) }
/* - * Catch the rare case, where the lock was released when we were on the - * way back before we locked the hash bucket. - */ - if (q->pi_state->owner == current) { - /* - * Try to get the rt_mutex now. This might fail as some other - * task acquired the rt_mutex after we removed ourself from the - * rt_mutex waiters list. - */ - if (rt_mutex_futex_trylock(&q->pi_state->pi_mutex)) { - locked = 1; - goto out; - } - - /* - * pi_state is incorrect, some other task did a lock steal and - * we returned due to timeout or signal without taking the - * rt_mutex. Too late. - */ - raw_spin_lock_irq(&q->pi_state->pi_mutex.wait_lock); - owner = rt_mutex_owner(&q->pi_state->pi_mutex); - if (!owner) - owner = rt_mutex_next_owner(&q->pi_state->pi_mutex); - raw_spin_unlock_irq(&q->pi_state->pi_mutex.wait_lock); - ret = fixup_pi_state_owner(uaddr, q, owner); - goto out; - } - - /* * Paranoia check. If we did not take the lock, then we should not be * the owner of the rt_mutex. */ - if (rt_mutex_owner(&q->pi_state->pi_mutex) == current) + if (rt_mutex_owner(&q->pi_state->pi_mutex) == current) { printk(KERN_ERR "fixup_owner: ret = %d pi-mutex: %p " "pi-state %p\n", ret, q->pi_state->pi_mutex.owner, q->pi_state->owner); + }
out: return ret ? ret : locked;
From: Thomas Gleixner tglx@linutronix.de
commit 29e9ee5d48c35d6cf8afe09bdf03f77125c9ac11 upstream.
free_pi_state() is confusing as it is in fact only freeing/caching the pi state when the last reference is gone. Rename it to put_pi_state() which reflects better what it is doing.
Signed-off-by: Thomas Gleixner tglx@linutronix.de Cc: Peter Zijlstra peterz@infradead.org Cc: Darren Hart darren@dvhart.com Cc: Davidlohr Bueso dave@stgolabs.net Cc: Bhuvanesh_Surachari@mentor.com Cc: Andy Lowe Andy_Lowe@mentor.com Link: http://lkml.kernel.org/r/20151219200607.259636467@linutronix.de Signed-off-by: Thomas Gleixner tglx@linutronix.de Tested-by: Henrik Austad haustad@cisco.com --- kernel/futex.c | 17 ++++++++++------- 1 file changed, 10 insertions(+), 7 deletions(-)
diff --git a/kernel/futex.c b/kernel/futex.c index 91acb65..09f698a 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -815,9 +815,12 @@ static void get_pi_state(struct futex_pi_state *pi_state) }
/* + * Drops a reference to the pi_state object and frees or caches it + * when the last reference is gone. + * * Must be called with the hb lock held. */ -static void free_pi_state(struct futex_pi_state *pi_state) +static void put_pi_state(struct futex_pi_state *pi_state) { if (!pi_state) return; @@ -1959,7 +1962,7 @@ retry_private: case 0: break; case -EFAULT: - free_pi_state(pi_state); + put_pi_state(pi_state); pi_state = NULL; double_unlock_hb(hb1, hb2); hb_waiters_dec(hb2); @@ -1976,7 +1979,7 @@ retry_private: * exit to complete. * - The user space value changed. */ - free_pi_state(pi_state); + put_pi_state(pi_state); pi_state = NULL; double_unlock_hb(hb1, hb2); hb_waiters_dec(hb2); @@ -2049,7 +2052,7 @@ retry_private: } else if (ret) { /* -EDEADLK */ this->pi_state = NULL; - free_pi_state(pi_state); + put_pi_state(pi_state); goto out_unlock; } } @@ -2058,7 +2061,7 @@ retry_private: }
out_unlock: - free_pi_state(pi_state); + put_pi_state(pi_state); double_unlock_hb(hb1, hb2); wake_up_q(&wake_q); hb_waiters_dec(hb2); @@ -2211,7 +2214,7 @@ static void unqueue_me_pi(struct futex_q *q) __unqueue_futex(q);
BUG_ON(!q->pi_state); - free_pi_state(q->pi_state); + put_pi_state(q->pi_state); q->pi_state = NULL;
spin_unlock(q->lock_ptr); @@ -2993,7 +2996,7 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags, * Drop the reference to the pi state which * the requeue_pi() code acquired for us. */ - free_pi_state(q.pi_state); + put_pi_state(q.pi_state); spin_unlock(q.lock_ptr); } } else {
From: Peter Zijlstra peterz@infradead.org
commit 16ffa12d742534d4ff73e8b3a4e81c1de39196f0 upstream.
There's a number of 'interesting' problems, all caused by holding hb->lock while doing the rt_mutex_unlock() equivalient.
Notably:
- a PI inversion on hb->lock; and,
- a SCHED_DEADLINE crash because of pointer instability.
The previous changes:
- changed the locking rules to cover {uval,pi_state} with wait_lock.
- allow to do rt_mutex_futex_unlock() without dropping wait_lock; which in turn allows to rely on wait_lock atomicity completely.
- simplified the waiter conundrum.
It's now sufficient to hold rtmutex::wait_lock and a reference on the pi_state to protect the state consistency, so hb->lock can be dropped before calling rt_mutex_futex_unlock().
Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Cc: juri.lelli@arm.com Cc: bigeasy@linutronix.de Cc: xlpang@redhat.com Cc: rostedt@goodmis.org Cc: mathieu.desnoyers@efficios.com Cc: jdesfossez@efficios.com Cc: dvhart@infradead.org Cc: bristot@redhat.com Link: http://lkml.kernel.org/r/20170322104151.900002056@infradead.org Signed-off-by: Thomas Gleixner tglx@linutronix.de
Conflicts: kernel/futex.c Tested-by:Henrik Austad haustad@cisco.com --- kernel/futex.c | 154 +++++++++++++++++++++++++++++++++++++-------------------- 1 file changed, 100 insertions(+), 54 deletions(-)
diff --git a/kernel/futex.c b/kernel/futex.c index 09f698a..7054ca3 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -918,10 +918,12 @@ void exit_pi_state_list(struct task_struct *curr) pi_state->owner = NULL; raw_spin_unlock_irq(&curr->pi_lock);
- rt_mutex_futex_unlock(&pi_state->pi_mutex); - + get_pi_state(pi_state); spin_unlock(&hb->lock);
+ rt_mutex_futex_unlock(&pi_state->pi_mutex); + put_pi_state(pi_state); + raw_spin_lock_irq(&curr->pi_lock); } raw_spin_unlock_irq(&curr->pi_lock); @@ -1034,6 +1036,11 @@ static int attach_to_pi_state(u32 __user *uaddr, u32 uval, * has dropped the hb->lock in between queue_me() and unqueue_me_pi(), * which in turn means that futex_lock_pi() still has a reference on * our pi_state. + * + * The waiter holding a reference on @pi_state also protects against + * the unlocked put_pi_state() in futex_unlock_pi(), futex_lock_pi() + * and futex_wait_requeue_pi() as it cannot go to 0 and consequently + * free pi_state before we can take a reference ourselves. */ WARN_ON(!atomic_read(&pi_state->refcount));
@@ -1377,48 +1384,40 @@ static void mark_wake_futex(struct wake_q_head *wake_q, struct futex_q *q) smp_store_release(&q->lock_ptr, NULL); }
-static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *top_waiter, - struct futex_hash_bucket *hb) +/* + * Caller must hold a reference on @pi_state. + */ +static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_pi_state *pi_state) { - struct task_struct *new_owner; - struct futex_pi_state *pi_state = top_waiter->pi_state; u32 uninitialized_var(curval), newval; + struct task_struct *new_owner; + bool deboost = false; WAKE_Q(wake_q); - bool deboost; int ret = 0;
- if (!pi_state) - return -EINVAL; - - /* - * If current does not own the pi_state then the futex is - * inconsistent and user space fiddled with the futex value. - */ - if (pi_state->owner != current) - return -EINVAL; - raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock); new_owner = rt_mutex_next_owner(&pi_state->pi_mutex); - - /* - * When we interleave with futex_lock_pi() where it does - * rt_mutex_timed_futex_lock(), we might observe @this futex_q waiter, - * but the rt_mutex's wait_list can be empty (either still, or again, - * depending on which side we land). - * - * When this happens, give up our locks and try again, giving the - * futex_lock_pi() instance time to complete, either by waiting on the - * rtmutex or removing itself from the futex queue. - */ if (!new_owner) { - raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock); - return -EAGAIN; + /* + * Since we held neither hb->lock nor wait_lock when coming + * into this function, we could have raced with futex_lock_pi() + * such that we might observe @this futex_q waiter, but the + * rt_mutex's wait_list can be empty (either still, or again, + * depending on which side we land). + * + * When this happens, give up our locks and try again, giving + * the futex_lock_pi() instance time to complete, either by + * waiting on the rtmutex or removing itself from the futex + * queue. + */ + ret = -EAGAIN; + goto out_unlock; }
/* - * We pass it to the next owner. The WAITERS bit is always - * kept enabled while there is PI state around. We cleanup the - * owner died bit, because we are the owner. + * We pass it to the next owner. The WAITERS bit is always kept + * enabled while there is PI state around. We cleanup the owner + * died bit, because we are the owner. */ newval = FUTEX_WAITERS | task_pid_vnr(new_owner);
@@ -1441,10 +1440,8 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *top_waiter ret = -EINVAL; }
- if (ret) { - raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock); - return ret; - } + if (ret) + goto out_unlock;
raw_spin_lock(&pi_state->owner->pi_lock); WARN_ON(list_empty(&pi_state->list)); @@ -1462,15 +1459,15 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *top_waiter */ deboost = __rt_mutex_futex_unlock(&pi_state->pi_mutex, &wake_q);
+out_unlock: raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock); - spin_unlock(&hb->lock);
if (deboost) { wake_up_q(&wake_q); rt_mutex_adjust_prio(current); }
- return 0; + return ret; }
/* @@ -2245,7 +2242,8 @@ static int fixup_pi_state_owner(u32 __user *uaddr, struct futex_q *q, /* * We are here either because we stole the rtmutex from the * previous highest priority waiter or we are the highest priority - * waiter but failed to get the rtmutex the first time. + * waiter but have failed to get the rtmutex the first time. + * * We have to replace the newowner TID in the user space variable. * This must be atomic as we have to preserve the owner died bit here. * @@ -2262,7 +2260,7 @@ retry: if (get_futex_value_locked(&uval, uaddr)) goto handle_fault;
- while (1) { + for (;;) { newval = (uval & FUTEX_OWNER_DIED) | newtid;
if (cmpxchg_futex_value_locked(&curval, uaddr, uval, newval)) @@ -2358,6 +2356,10 @@ static int fixup_owner(u32 __user *uaddr, struct futex_q *q, int locked) /* * Got the lock. We might not be the anticipated owner if we * did a lock-steal - fix up the PI-state in that case: + * + * We can safely read pi_state->owner without holding wait_lock + * because we now own the rt_mutex, only the owner will attempt + * to change it. */ if (q->pi_state->owner != current) ret = fixup_pi_state_owner(uaddr, q, current); @@ -2597,6 +2599,7 @@ static int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int trylock) { struct hrtimer_sleeper timeout, *to = NULL; + struct futex_pi_state *pi_state = NULL; struct futex_hash_bucket *hb; struct futex_q q = futex_q_init; int res, ret; @@ -2683,12 +2686,19 @@ retry_private: * If fixup_owner() faulted and was unable to handle the fault, unlock * it and return the fault to userspace. */ - if (ret && (rt_mutex_owner(&q.pi_state->pi_mutex) == current)) - rt_mutex_futex_unlock(&q.pi_state->pi_mutex); + if (ret && (rt_mutex_owner(&q.pi_state->pi_mutex) == current)) { + pi_state = q.pi_state; + get_pi_state(pi_state); + }
/* Unqueue and drop the lock */ unqueue_me_pi(&q);
+ if (pi_state) { + rt_mutex_futex_unlock(&pi_state->pi_mutex); + put_pi_state(pi_state); + } + goto out_put_key;
out_unlock_put_key: @@ -2751,10 +2761,36 @@ retry: */ top_waiter = futex_top_waiter(hb, &key); if (top_waiter) { - ret = wake_futex_pi(uaddr, uval, top_waiter, hb); + struct futex_pi_state *pi_state = top_waiter->pi_state; + + ret = -EINVAL; + if (!pi_state) + goto out_unlock; + + /* + * If current does not own the pi_state then the futex is + * inconsistent and user space fiddled with the futex value. + */ + if (pi_state->owner != current) + goto out_unlock; + /* - * In case of success wake_futex_pi dropped the hash - * bucket lock. + * Grab a reference on the pi_state and drop hb->lock. + * + * The reference ensures pi_state lives, dropping the hb->lock + * is tricky.. wake_futex_pi() will take rt_mutex::wait_lock to + * close the races against futex_lock_pi(), but in case of + * _any_ fail we'll abort and retry the whole deal. + */ + get_pi_state(pi_state); + spin_unlock(&hb->lock); + + ret = wake_futex_pi(uaddr, uval, pi_state); + + put_pi_state(pi_state); + + /* + * Success, we're done! No tricky corner cases. */ if (!ret) goto out_putkey; @@ -2769,7 +2805,6 @@ retry: * setting the FUTEX_WAITERS bit. Try again. */ if (ret == -EAGAIN) { - spin_unlock(&hb->lock); put_futex_key(&key); goto retry; } @@ -2777,7 +2812,7 @@ retry: * wake_futex_pi has detected invalid state. Tell user * space. */ - goto out_unlock; + goto out_putkey; }
/* @@ -2787,8 +2822,10 @@ retry: * preserve the WAITERS bit not the OWNER_DIED one. We are the * owner. */ - if (cmpxchg_futex_value_locked(&curval, uaddr, uval, 0)) + if (cmpxchg_futex_value_locked(&curval, uaddr, uval, 0)) { + spin_unlock(&hb->lock); goto pi_faulted; + }
/* * If uval has changed, let user space handle it. @@ -2802,7 +2839,6 @@ out_putkey: return ret;
pi_faulted: - spin_unlock(&hb->lock); put_futex_key(&key);
ret = fault_in_user_writeable(uaddr); @@ -2906,6 +2942,7 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags, u32 __user *uaddr2) { struct hrtimer_sleeper timeout, *to = NULL; + struct futex_pi_state *pi_state = NULL; struct rt_mutex_waiter rt_waiter; struct futex_hash_bucket *hb; union futex_key key2 = FUTEX_KEY_INIT; @@ -2990,8 +3027,10 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags, if (q.pi_state && (q.pi_state->owner != current)) { spin_lock(q.lock_ptr); ret = fixup_pi_state_owner(uaddr2, &q, current); - if (ret && rt_mutex_owner(&q.pi_state->pi_mutex) == current) - rt_mutex_futex_unlock(&q.pi_state->pi_mutex); + if (ret && rt_mutex_owner(&q.pi_state->pi_mutex) == current) { + pi_state = q.pi_state; + get_pi_state(pi_state); + } /* * Drop the reference to the pi state which * the requeue_pi() code acquired for us. @@ -3030,13 +3069,20 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags, * the fault, unlock the rt_mutex and return the fault to * userspace. */ - if (ret && rt_mutex_owner(pi_mutex) == current) - rt_mutex_futex_unlock(pi_mutex); + if (ret && rt_mutex_owner(&q.pi_state->pi_mutex) == current) { + pi_state = q.pi_state; + get_pi_state(pi_state); + }
/* Unqueue and drop the lock. */ unqueue_me_pi(&q); }
+ if (pi_state) { + rt_mutex_futex_unlock(&pi_state->pi_mutex); + put_pi_state(pi_state); + } + if (ret == -EINTR) { /* * We've already been requeued, but cannot restart by calling
From: Peter Zijlstra peterz@infradead.org
commit 50809358dd7199aa7ce232f6877dd09ec30ef374 upstream.
Since there's already two copies of this code, introduce a helper now before adding a third one.
Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Cc: juri.lelli@arm.com Cc: bigeasy@linutronix.de Cc: xlpang@redhat.com Cc: rostedt@goodmis.org Cc: mathieu.desnoyers@efficios.com Cc: jdesfossez@efficios.com Cc: dvhart@infradead.org Cc: bristot@redhat.com Link: http://lkml.kernel.org/r/20170322104151.950039479@infradead.org Signed-off-by: Thomas Gleixner tglx@linutronix.de Tested-by: Henrik Austad haustad@cisco.com --- kernel/futex.c | 5 +---- kernel/locking/rtmutex.c | 12 +++++++++--- kernel/locking/rtmutex_common.h | 1 + 3 files changed, 11 insertions(+), 7 deletions(-)
diff --git a/kernel/futex.c b/kernel/futex.c index 7054ca3..4d70fd7 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -2969,10 +2969,7 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags, * The waiter is allocated on our stack, manipulated by the requeue * code while we sleep on uaddr. */ - debug_rt_mutex_init_waiter(&rt_waiter); - RB_CLEAR_NODE(&rt_waiter.pi_tree_entry); - RB_CLEAR_NODE(&rt_waiter.tree_entry); - rt_waiter.task = NULL; + rt_mutex_init_waiter(&rt_waiter);
ret = get_futex_key(uaddr2, flags & FLAGS_SHARED, &key2, VERIFY_WRITE); if (unlikely(ret != 0)) diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c index 28c1d40..8778ac3 100644 --- a/kernel/locking/rtmutex.c +++ b/kernel/locking/rtmutex.c @@ -1151,6 +1151,14 @@ void rt_mutex_adjust_pi(struct task_struct *task) next_lock, NULL, task); }
+void rt_mutex_init_waiter(struct rt_mutex_waiter *waiter) +{ + debug_rt_mutex_init_waiter(waiter); + RB_CLEAR_NODE(&waiter->pi_tree_entry); + RB_CLEAR_NODE(&waiter->tree_entry); + waiter->task = NULL; +} + /** * __rt_mutex_slowlock() - Perform the wait-wake-try-to-take loop * @lock: the rt_mutex to take @@ -1233,9 +1241,7 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state, unsigned long flags; int ret = 0;
- debug_rt_mutex_init_waiter(&waiter); - RB_CLEAR_NODE(&waiter.pi_tree_entry); - RB_CLEAR_NODE(&waiter.tree_entry); + rt_mutex_init_waiter(&waiter);
/* * Technically we could use raw_spin_[un]lock_irq() here, but this can diff --git a/kernel/locking/rtmutex_common.h b/kernel/locking/rtmutex_common.h index 2441c2d..d16de236 100644 --- a/kernel/locking/rtmutex_common.h +++ b/kernel/locking/rtmutex_common.h @@ -103,6 +103,7 @@ extern void rt_mutex_init_proxy_locked(struct rt_mutex *lock, struct task_struct *proxy_owner); extern void rt_mutex_proxy_unlock(struct rt_mutex *lock, struct task_struct *proxy_owner); +extern void rt_mutex_init_waiter(struct rt_mutex_waiter *waiter); extern int rt_mutex_start_proxy_lock(struct rt_mutex *lock, struct rt_mutex_waiter *waiter, struct task_struct *task);
From: Peter Zijlstra peterz@infradead.org
commit 38d589f2fd08f1296aea3ce62bebd185125c6d81 upstream.
With the ultimate goal of keeping rt_mutex wait_list and futex_q waiters consistent it's necessary to split 'rt_mutex_futex_lock()' into finer parts, such that only the actual blocking can be done without hb->lock held.
Split split_mutex_finish_proxy_lock() into two parts, one that does the blocking and one that does remove_waiter() when the lock acquire failed.
When the rtmutex was acquired successfully the waiter can be removed in the acquisiton path safely, since there is no concurrency on the lock owner.
This means that, except for futex_lock_pi(), all wait_list modifications are done with both hb->lock and wait_lock held.
[bigeasy@linutronix.de: fix for futex_requeue_pi_signal_restart]
Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Cc: juri.lelli@arm.com Cc: bigeasy@linutronix.de Cc: xlpang@redhat.com Cc: rostedt@goodmis.org Cc: mathieu.desnoyers@efficios.com Cc: jdesfossez@efficios.com Cc: dvhart@infradead.org Cc: bristot@redhat.com Link: http://lkml.kernel.org/r/20170322104152.001659630@infradead.org Signed-off-by: Thomas Gleixner tglx@linutronix.de Tested-by: Henrik Austad haustad@cisco.com --- kernel/futex.c | 7 ++++-- kernel/locking/rtmutex.c | 52 +++++++++++++++++++++++++++++++++++------ kernel/locking/rtmutex_common.h | 8 ++++--- 3 files changed, 55 insertions(+), 12 deletions(-)
diff --git a/kernel/futex.c b/kernel/futex.c index 4d70fd7..dce3250 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -3045,10 +3045,13 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags, */ WARN_ON(!q.pi_state); pi_mutex = &q.pi_state->pi_mutex; - ret = rt_mutex_finish_proxy_lock(pi_mutex, to, &rt_waiter); - debug_rt_mutex_free_waiter(&rt_waiter); + ret = rt_mutex_wait_proxy_lock(pi_mutex, to, &rt_waiter);
spin_lock(q.lock_ptr); + if (ret && !rt_mutex_cleanup_proxy_lock(pi_mutex, &rt_waiter)) + ret = 0; + + debug_rt_mutex_free_waiter(&rt_waiter); /* * Fixup the pi_state owner and possibly acquire the lock if we * haven't already. diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c index 8778ac3..78ecea6 100644 --- a/kernel/locking/rtmutex.c +++ b/kernel/locking/rtmutex.c @@ -1743,21 +1743,23 @@ struct task_struct *rt_mutex_next_owner(struct rt_mutex *lock) }
/** - * rt_mutex_finish_proxy_lock() - Complete lock acquisition + * rt_mutex_wait_proxy_lock() - Wait for lock acquisition * @lock: the rt_mutex we were woken on * @to: the timeout, null if none. hrtimer should already have * been started. * @waiter: the pre-initialized rt_mutex_waiter * - * Complete the lock acquisition started our behalf by another thread. + * Wait for the the lock acquisition started on our behalf by + * rt_mutex_start_proxy_lock(). Upon failure, the caller must call + * rt_mutex_cleanup_proxy_lock(). * * Returns: * 0 - success * <0 - error, one of -EINTR, -ETIMEDOUT * - * Special API call for PI-futex requeue support + * Special API call for PI-futex support */ -int rt_mutex_finish_proxy_lock(struct rt_mutex *lock, +int rt_mutex_wait_proxy_lock(struct rt_mutex *lock, struct hrtimer_sleeper *to, struct rt_mutex_waiter *waiter) { @@ -1770,9 +1772,6 @@ int rt_mutex_finish_proxy_lock(struct rt_mutex *lock, /* sleep on the mutex */ ret = __rt_mutex_slowlock(lock, TASK_INTERRUPTIBLE, to, waiter);
- if (unlikely(ret)) - remove_waiter(lock, waiter); - /* * try_to_take_rt_mutex() sets the waiter bit unconditionally. We might * have to fix that up. @@ -1783,3 +1782,42 @@ int rt_mutex_finish_proxy_lock(struct rt_mutex *lock,
return ret; } + +/** + * rt_mutex_cleanup_proxy_lock() - Cleanup failed lock acquisition + * @lock: the rt_mutex we were woken on + * @waiter: the pre-initialized rt_mutex_waiter + * + * Attempt to clean up after a failed rt_mutex_wait_proxy_lock(). + * + * Unless we acquired the lock; we're still enqueued on the wait-list and can + * in fact still be granted ownership until we're removed. Therefore we can + * find we are in fact the owner and must disregard the + * rt_mutex_wait_proxy_lock() failure. + * + * Returns: + * true - did the cleanup, we done. + * false - we acquired the lock after rt_mutex_wait_proxy_lock() returned, + * caller should disregards its return value. + * + * Special API call for PI-futex support + */ +bool rt_mutex_cleanup_proxy_lock(struct rt_mutex *lock, + struct rt_mutex_waiter *waiter) +{ + bool cleanup = false; + + raw_spin_lock_irq(&lock->wait_lock); + /* + * Unless we're the owner; we're still enqueued on the wait_list. + * So check if we became owner, if not, take us off the wait_list. + */ + if (rt_mutex_owner(lock) != current) { + remove_waiter(lock, waiter); + fixup_rt_mutex_waiters(lock); + cleanup = true; + } + raw_spin_unlock_irq(&lock->wait_lock); + + return cleanup; +} diff --git a/kernel/locking/rtmutex_common.h b/kernel/locking/rtmutex_common.h index d16de236..fedd5ab 100644 --- a/kernel/locking/rtmutex_common.h +++ b/kernel/locking/rtmutex_common.h @@ -107,9 +107,11 @@ extern void rt_mutex_init_waiter(struct rt_mutex_waiter *waiter); extern int rt_mutex_start_proxy_lock(struct rt_mutex *lock, struct rt_mutex_waiter *waiter, struct task_struct *task); -extern int rt_mutex_finish_proxy_lock(struct rt_mutex *lock, - struct hrtimer_sleeper *to, - struct rt_mutex_waiter *waiter); +extern int rt_mutex_wait_proxy_lock(struct rt_mutex *lock, + struct hrtimer_sleeper *to, + struct rt_mutex_waiter *waiter); +extern bool rt_mutex_cleanup_proxy_lock(struct rt_mutex *lock, + struct rt_mutex_waiter *waiter);
extern int rt_mutex_timed_futex_lock(struct rt_mutex *l, struct hrtimer_sleeper *to); extern int rt_mutex_futex_trylock(struct rt_mutex *l);
From: Peter Zijlstra peterz@infradead.org
commit cfafcd117da0216520568c195cb2f6cd1980c4bb upstream.
By changing futex_lock_pi() to use rt_mutex_*_proxy_lock() all wait_list modifications are done under both hb->lock and wait_lock.
This closes the obvious interleave pattern between futex_lock_pi() and futex_unlock_pi(), but not entirely so. See below:
Before:
futex_lock_pi() futex_unlock_pi() unlock hb->lock
lock hb->lock unlock hb->lock
lock rt_mutex->wait_lock unlock rt_mutex_wait_lock -EAGAIN
lock rt_mutex->wait_lock list_add unlock rt_mutex->wait_lock
schedule()
lock rt_mutex->wait_lock list_del unlock rt_mutex->wait_lock
<idem> -EAGAIN
lock hb->lock
After:
futex_lock_pi() futex_unlock_pi()
lock hb->lock lock rt_mutex->wait_lock list_add unlock rt_mutex->wait_lock unlock hb->lock
schedule() lock hb->lock unlock hb->lock lock hb->lock lock rt_mutex->wait_lock list_del unlock rt_mutex->wait_lock
lock rt_mutex->wait_lock unlock rt_mutex_wait_lock -EAGAIN
unlock hb->lock
It does however solve the earlier starvation/live-lock scenario which got introduced with the -EAGAIN since unlike the before scenario; where the -EAGAIN happens while futex_unlock_pi() doesn't hold any locks; in the after scenario it happens while futex_unlock_pi() actually holds a lock, and then it is serialized on that lock.
Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Cc: juri.lelli@arm.com Cc: bigeasy@linutronix.de Cc: xlpang@redhat.com Cc: rostedt@goodmis.org Cc: mathieu.desnoyers@efficios.com Cc: jdesfossez@efficios.com Cc: dvhart@infradead.org Cc: bristot@redhat.com Link: http://lkml.kernel.org/r/20170322104152.062785528@infradead.org Signed-off-by: Thomas Gleixner tglx@linutronix.de Tested-by: Henrik Austad haustad@cisco.com --- kernel/futex.c | 77 +++++++++++++++++++++++++++++------------ kernel/locking/rtmutex.c | 26 ++++---------- kernel/locking/rtmutex_common.h | 1 - 3 files changed, 62 insertions(+), 42 deletions(-)
diff --git a/kernel/futex.c b/kernel/futex.c index dce3250..1cc40dd 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -2112,20 +2112,7 @@ queue_unlock(struct futex_hash_bucket *hb) hb_waiters_dec(hb); }
-/** - * queue_me() - Enqueue the futex_q on the futex_hash_bucket - * @q: The futex_q to enqueue - * @hb: The destination hash bucket - * - * The hb->lock must be held by the caller, and is released here. A call to - * queue_me() is typically paired with exactly one call to unqueue_me(). The - * exceptions involve the PI related operations, which may use unqueue_me_pi() - * or nothing if the unqueue is done as part of the wake process and the unqueue - * state is implicit in the state of woken task (see futex_wait_requeue_pi() for - * an example). - */ -static inline void queue_me(struct futex_q *q, struct futex_hash_bucket *hb) - __releases(&hb->lock) +static inline void __queue_me(struct futex_q *q, struct futex_hash_bucket *hb) { int prio;
@@ -2142,6 +2129,24 @@ static inline void queue_me(struct futex_q *q, struct futex_hash_bucket *hb) plist_node_init(&q->list, prio); plist_add(&q->list, &hb->chain); q->task = current; +} + +/** + * queue_me() - Enqueue the futex_q on the futex_hash_bucket + * @q: The futex_q to enqueue + * @hb: The destination hash bucket + * + * The hb->lock must be held by the caller, and is released here. A call to + * queue_me() is typically paired with exactly one call to unqueue_me(). The + * exceptions involve the PI related operations, which may use unqueue_me_pi() + * or nothing if the unqueue is done as part of the wake process and the unqueue + * state is implicit in the state of woken task (see futex_wait_requeue_pi() for + * an example). + */ +static inline void queue_me(struct futex_q *q, struct futex_hash_bucket *hb) + __releases(&hb->lock) +{ + __queue_me(q, hb); spin_unlock(&hb->lock); }
@@ -2600,6 +2605,7 @@ static int futex_lock_pi(u32 __user *uaddr, unsigned int flags, { struct hrtimer_sleeper timeout, *to = NULL; struct futex_pi_state *pi_state = NULL; + struct rt_mutex_waiter rt_waiter; struct futex_hash_bucket *hb; struct futex_q q = futex_q_init; int res, ret; @@ -2652,25 +2658,52 @@ retry_private: } }
+ WARN_ON(!q.pi_state); + /* * Only actually queue now that the atomic ops are done: */ - queue_me(&q, hb); + __queue_me(&q, hb);
- WARN_ON(!q.pi_state); - /* - * Block on the PI mutex: - */ - if (!trylock) { - ret = rt_mutex_timed_futex_lock(&q.pi_state->pi_mutex, to); - } else { + if (trylock) { ret = rt_mutex_futex_trylock(&q.pi_state->pi_mutex); /* Fixup the trylock return value: */ ret = ret ? 0 : -EWOULDBLOCK; + goto no_block; + } + + /* + * We must add ourselves to the rt_mutex waitlist while holding hb->lock + * such that the hb and rt_mutex wait lists match. + */ + rt_mutex_init_waiter(&rt_waiter); + ret = rt_mutex_start_proxy_lock(&q.pi_state->pi_mutex, &rt_waiter, current); + if (ret) { + if (ret == 1) + ret = 0; + + goto no_block; }
+ spin_unlock(q.lock_ptr); + + if (unlikely(to)) + hrtimer_start_expires(&to->timer, HRTIMER_MODE_ABS); + + ret = rt_mutex_wait_proxy_lock(&q.pi_state->pi_mutex, to, &rt_waiter); + spin_lock(q.lock_ptr); /* + * If we failed to acquire the lock (signal/timeout), we must + * first acquire the hb->lock before removing the lock from the + * rt_mutex waitqueue, such that we can keep the hb and rt_mutex + * wait lists consistent. + */ + if (ret && !rt_mutex_cleanup_proxy_lock(&q.pi_state->pi_mutex, &rt_waiter)) + ret = 0; + +no_block: + /* * Fixup the pi_state owner and possibly acquire the lock if we * haven't already. */ diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c index 78ecea6..3025f61 100644 --- a/kernel/locking/rtmutex.c +++ b/kernel/locking/rtmutex.c @@ -1491,19 +1491,6 @@ int __sched rt_mutex_lock_interruptible(struct rt_mutex *lock) EXPORT_SYMBOL_GPL(rt_mutex_lock_interruptible);
/* - * Futex variant with full deadlock detection. - * Futex variants must not use the fast-path, see __rt_mutex_futex_unlock(). - */ -int __sched rt_mutex_timed_futex_lock(struct rt_mutex *lock, - struct hrtimer_sleeper *timeout) -{ - might_sleep(); - - return rt_mutex_slowlock(lock, TASK_INTERRUPTIBLE, - timeout, RT_MUTEX_FULL_CHAINWALK); -} - -/* * Futex variant, must not use fastpath. */ int __sched rt_mutex_futex_trylock(struct rt_mutex *lock) @@ -1772,12 +1759,6 @@ int rt_mutex_wait_proxy_lock(struct rt_mutex *lock, /* sleep on the mutex */ ret = __rt_mutex_slowlock(lock, TASK_INTERRUPTIBLE, to, waiter);
- /* - * try_to_take_rt_mutex() sets the waiter bit unconditionally. We might - * have to fix that up. - */ - fixup_rt_mutex_waiters(lock); - raw_spin_unlock_irq(&lock->wait_lock);
return ret; @@ -1817,6 +1798,13 @@ bool rt_mutex_cleanup_proxy_lock(struct rt_mutex *lock, fixup_rt_mutex_waiters(lock); cleanup = true; } + + /* + * try_to_take_rt_mutex() sets the waiter bit unconditionally. We might + * have to fix that up. + */ + fixup_rt_mutex_waiters(lock); + raw_spin_unlock_irq(&lock->wait_lock);
return cleanup; diff --git a/kernel/locking/rtmutex_common.h b/kernel/locking/rtmutex_common.h index fedd5ab..4fe3f32 100644 --- a/kernel/locking/rtmutex_common.h +++ b/kernel/locking/rtmutex_common.h @@ -113,7 +113,6 @@ extern int rt_mutex_wait_proxy_lock(struct rt_mutex *lock, extern bool rt_mutex_cleanup_proxy_lock(struct rt_mutex *lock, struct rt_mutex_waiter *waiter);
-extern int rt_mutex_timed_futex_lock(struct rt_mutex *l, struct hrtimer_sleeper *to); extern int rt_mutex_futex_trylock(struct rt_mutex *l);
extern void rt_mutex_futex_unlock(struct rt_mutex *lock);
From: Peter Zijlstra peterz@infradead.org
commit bebe5b514345f09be2c15e414d076b02ecb9cce8 upstream.
The problem with returning -EAGAIN when the waiter state mismatches is that it becomes very hard to proof a bounded execution time on the operation. And seeing that this is a RT operation, this is somewhat important.
While in practise; given the previous patch; it will be very unlikely to ever really take more than one or two rounds, proving so becomes rather hard.
However, now that modifying wait_list is done while holding both hb->lock and wait_lock, the scenario can be avoided entirely by acquiring wait_lock while still holding hb-lock. Doing a hand-over, without leaving a hole.
Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Cc: juri.lelli@arm.com Cc: bigeasy@linutronix.de Cc: xlpang@redhat.com Cc: rostedt@goodmis.org Cc: mathieu.desnoyers@efficios.com Cc: jdesfossez@efficios.com Cc: dvhart@infradead.org Cc: bristot@redhat.com Link: http://lkml.kernel.org/r/20170322104152.112378812@infradead.org Signed-off-by: Thomas Gleixner tglx@linutronix.de Tested-by: Henrik Austad haustad@cisco.com --- kernel/futex.c | 24 +++++++++++------------- 1 file changed, 11 insertions(+), 13 deletions(-)
diff --git a/kernel/futex.c b/kernel/futex.c index 1cc40dd..14d270e 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -1395,15 +1395,10 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_pi_state *pi_ WAKE_Q(wake_q); int ret = 0;
- raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock); new_owner = rt_mutex_next_owner(&pi_state->pi_mutex); - if (!new_owner) { + if (WARN_ON_ONCE(!new_owner)) { /* - * Since we held neither hb->lock nor wait_lock when coming - * into this function, we could have raced with futex_lock_pi() - * such that we might observe @this futex_q waiter, but the - * rt_mutex's wait_list can be empty (either still, or again, - * depending on which side we land). + * As per the comment in futex_unlock_pi() this should not happen. * * When this happens, give up our locks and try again, giving * the futex_lock_pi() instance time to complete, either by @@ -2807,15 +2802,18 @@ retry: if (pi_state->owner != current) goto out_unlock;
+ get_pi_state(pi_state); /* - * Grab a reference on the pi_state and drop hb->lock. + * Since modifying the wait_list is done while holding both + * hb->lock and wait_lock, holding either is sufficient to + * observe it. * - * The reference ensures pi_state lives, dropping the hb->lock - * is tricky.. wake_futex_pi() will take rt_mutex::wait_lock to - * close the races against futex_lock_pi(), but in case of - * _any_ fail we'll abort and retry the whole deal. + * By taking wait_lock while still holding hb->lock, we ensure + * there is no point where we hold neither; and therefore + * wake_futex_pi() must observe a state consistent with what we + * observed. */ - get_pi_state(pi_state); + raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock); spin_unlock(&hb->lock);
ret = wake_futex_pi(uaddr, uval, pi_state);
From: Peter Zijlstra peterz@infradead.org
commit 56222b212e8edb1cf51f5dd73ff645809b082b40 upstream.
When PREEMPT_RT_FULL does the spinlock -> rt_mutex substitution the PI chain code will (falsely) report a deadlock and BUG.
The problem is that it hold hb->lock (now an rt_mutex) while doing task_blocks_on_rt_mutex on the futex's pi_state::rtmutex. This, when interleaved just right with futex_unlock_pi() leads it to believe to see an AB-BA deadlock.
Task1 (holds rt_mutex, Task2 (does FUTEX_LOCK_PI) does FUTEX_UNLOCK_PI)
lock hb->lock lock rt_mutex (as per start_proxy) lock hb->lock
Which is a trivial AB-BA.
It is not an actual deadlock, because it won't be holding hb->lock by the time it actually blocks on the rt_mutex, but the chainwalk code doesn't know that and it would be a nightmare to handle this gracefully.
To avoid this problem, do the same as in futex_unlock_pi() and drop hb->lock after acquiring wait_lock. This still fully serializes against futex_unlock_pi(), since adding to the wait_list does the very same lock dance, and removing it holds both locks.
Aside of solving the RT problem this makes the lock and unlock mechanism symetric and reduces the hb->lock held time.
Reported-and-tested-by: Sebastian Andrzej Siewior bigeasy@linutronix.de Suggested-by: Thomas Gleixner tglx@linutronix.de Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Cc: juri.lelli@arm.com Cc: xlpang@redhat.com Cc: rostedt@goodmis.org Cc: mathieu.desnoyers@efficios.com Cc: jdesfossez@efficios.com Cc: dvhart@infradead.org Cc: bristot@redhat.com Link: http://lkml.kernel.org/r/20170322104152.161341537@infradead.org Signed-off-by: Thomas Gleixner tglx@linutronix.de Tested-by: Henrik Austad haustad@cisco.com --- kernel/futex.c | 30 +++++++++++++++++-------- kernel/locking/rtmutex.c | 49 +++++++++++++++++++++++------------------ kernel/locking/rtmutex_common.h | 3 +++ 3 files changed, 52 insertions(+), 30 deletions(-)
diff --git a/kernel/futex.c b/kernel/futex.c index 14d270e..afb02a7 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -2667,20 +2667,33 @@ retry_private: goto no_block; }
+ rt_mutex_init_waiter(&rt_waiter); + /* - * We must add ourselves to the rt_mutex waitlist while holding hb->lock - * such that the hb and rt_mutex wait lists match. + * On PREEMPT_RT_FULL, when hb->lock becomes an rt_mutex, we must not + * hold it while doing rt_mutex_start_proxy(), because then it will + * include hb->lock in the blocking chain, even through we'll not in + * fact hold it while blocking. This will lead it to report -EDEADLK + * and BUG when futex_unlock_pi() interleaves with this. + * + * Therefore acquire wait_lock while holding hb->lock, but drop the + * latter before calling rt_mutex_start_proxy_lock(). This still fully + * serializes against futex_unlock_pi() as that does the exact same + * lock handoff sequence. */ - rt_mutex_init_waiter(&rt_waiter); - ret = rt_mutex_start_proxy_lock(&q.pi_state->pi_mutex, &rt_waiter, current); + raw_spin_lock_irq(&q.pi_state->pi_mutex.wait_lock); + spin_unlock(q.lock_ptr); + ret = __rt_mutex_start_proxy_lock(&q.pi_state->pi_mutex, &rt_waiter, current); + raw_spin_unlock_irq(&q.pi_state->pi_mutex.wait_lock); + if (ret) { if (ret == 1) ret = 0;
+ spin_lock(q.lock_ptr); goto no_block; }
- spin_unlock(q.lock_ptr);
if (unlikely(to)) hrtimer_start_expires(&to->timer, HRTIMER_MODE_ABS); @@ -2693,6 +2706,9 @@ retry_private: * first acquire the hb->lock before removing the lock from the * rt_mutex waitqueue, such that we can keep the hb and rt_mutex * wait lists consistent. + * + * In particular; it is important that futex_unlock_pi() can not + * observe this inconsistency. */ if (ret && !rt_mutex_cleanup_proxy_lock(&q.pi_state->pi_mutex, &rt_waiter)) ret = 0; @@ -2804,10 +2820,6 @@ retry:
get_pi_state(pi_state); /* - * Since modifying the wait_list is done while holding both - * hb->lock and wait_lock, holding either is sufficient to - * observe it. - * * By taking wait_lock while still holding hb->lock, we ensure * there is no point where we hold neither; and therefore * wake_futex_pi() must observe a state consistent with what we diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c index 3025f61..b061a79 100644 --- a/kernel/locking/rtmutex.c +++ b/kernel/locking/rtmutex.c @@ -1659,31 +1659,14 @@ void rt_mutex_proxy_unlock(struct rt_mutex *lock, rt_mutex_set_owner(lock, NULL); }
-/** - * rt_mutex_start_proxy_lock() - Start lock acquisition for another task - * @lock: the rt_mutex to take - * @waiter: the pre-initialized rt_mutex_waiter - * @task: the task to prepare - * - * Returns: - * 0 - task blocked on lock - * 1 - acquired the lock for task, caller should wake it up - * <0 - error - * - * Special API call for FUTEX_REQUEUE_PI support. - */ -int rt_mutex_start_proxy_lock(struct rt_mutex *lock, +int __rt_mutex_start_proxy_lock(struct rt_mutex *lock, struct rt_mutex_waiter *waiter, struct task_struct *task) { int ret;
- raw_spin_lock_irq(&lock->wait_lock); - - if (try_to_take_rt_mutex(lock, task, NULL)) { - raw_spin_unlock_irq(&lock->wait_lock); + if (try_to_take_rt_mutex(lock, task, NULL)) return 1; - }
/* We enforce deadlock detection for futexes */ ret = task_blocks_on_rt_mutex(lock, waiter, task, @@ -1702,14 +1685,38 @@ int rt_mutex_start_proxy_lock(struct rt_mutex *lock, if (unlikely(ret)) remove_waiter(lock, waiter);
- raw_spin_unlock_irq(&lock->wait_lock); - debug_rt_mutex_print_deadlock(waiter);
return ret; }
/** + * rt_mutex_start_proxy_lock() - Start lock acquisition for another task + * @lock: the rt_mutex to take + * @waiter: the pre-initialized rt_mutex_waiter + * @task: the task to prepare + * + * Returns: + * 0 - task blocked on lock + * 1 - acquired the lock for task, caller should wake it up + * <0 - error + * + * Special API call for FUTEX_REQUEUE_PI support. + */ +int rt_mutex_start_proxy_lock(struct rt_mutex *lock, + struct rt_mutex_waiter *waiter, + struct task_struct *task) +{ + int ret; + + raw_spin_lock_irq(&lock->wait_lock); + ret = __rt_mutex_start_proxy_lock(lock, waiter, task); + raw_spin_unlock_irq(&lock->wait_lock); + + return ret; +} + +/** * rt_mutex_next_owner - return the next owner of the lock * * @lock: the rt lock query diff --git a/kernel/locking/rtmutex_common.h b/kernel/locking/rtmutex_common.h index 4fe3f32..25ccf71 100644 --- a/kernel/locking/rtmutex_common.h +++ b/kernel/locking/rtmutex_common.h @@ -104,6 +104,9 @@ extern void rt_mutex_init_proxy_locked(struct rt_mutex *lock, extern void rt_mutex_proxy_unlock(struct rt_mutex *lock, struct task_struct *proxy_owner); extern void rt_mutex_init_waiter(struct rt_mutex_waiter *waiter); +extern int __rt_mutex_start_proxy_lock(struct rt_mutex *lock, + struct rt_mutex_waiter *waiter, + struct task_struct *task); extern int rt_mutex_start_proxy_lock(struct rt_mutex *lock, struct rt_mutex_waiter *waiter, struct task_struct *task);
From: Xunlei Pang xlpang@redhat.com
commit 2a1c6029940675abb2217b590512dbf691867ec4 upstream.
We should deboost before waking the high-priority task, such that we don't run two tasks with the same "state" (priority, deadline, sched_class, etc).
In order to make sure the boosting task doesn't start running between unlock and deboost (due to 'spurious' wakeup), we move the deboost under the wait_lock, that way its serialized against the wait loop in __rt_mutex_slowlock().
Doing the deboost early can however lead to priority-inversion if current would get preempted after the deboost but before waking our high-prio task, hence we disable preemption before doing deboost, and enabling it after the wake up is over.
This gets us the right semantic order, but most importantly however; this change ensures pointer stability for the next patch, where we have rt_mutex_setprio() cache a pointer to the top-most waiter task. If we, as before this change, do the wakeup first and then deboost, this pointer might point into thin air.
[peterz: Changelog + patch munging] Suggested-by: Peter Zijlstra peterz@infradead.org Signed-off-by: Xunlei Pang xlpang@redhat.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Acked-by: Steven Rostedt rostedt@goodmis.org Cc: juri.lelli@arm.com Cc: bigeasy@linutronix.de Cc: mathieu.desnoyers@efficios.com Cc: jdesfossez@efficios.com Cc: bristot@redhat.com Link: http://lkml.kernel.org/r/20170323150216.110065320@infradead.org Signed-off-by: Thomas Gleixner tglx@linutronix.de Tested-by: Henrik Austad haustad@cisco.com --- kernel/futex.c | 5 +--- kernel/locking/rtmutex.c | 59 ++++++++++++++++++++++------------------- kernel/locking/rtmutex_common.h | 2 +- 3 files changed, 34 insertions(+), 32 deletions(-)
diff --git a/kernel/futex.c b/kernel/futex.c index afb02a7..63fa840 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -1457,10 +1457,7 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_pi_state *pi_ out_unlock: raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
- if (deboost) { - wake_up_q(&wake_q); - rt_mutex_adjust_prio(current); - } + rt_mutex_postunlock(&wake_q, deboost);
return ret; } diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c index b061a79..c01d7f4 100644 --- a/kernel/locking/rtmutex.c +++ b/kernel/locking/rtmutex.c @@ -371,24 +371,6 @@ static void __rt_mutex_adjust_prio(struct task_struct *task) }
/* - * Adjust task priority (undo boosting). Called from the exit path of - * rt_mutex_slowunlock() and rt_mutex_slowlock(). - * - * (Note: We do this outside of the protection of lock->wait_lock to - * allow the lock to be taken while or before we readjust the priority - * of task. We do not use the spin_xx_mutex() variants here as we are - * outside of the debug path.) - */ -void rt_mutex_adjust_prio(struct task_struct *task) -{ - unsigned long flags; - - raw_spin_lock_irqsave(&task->pi_lock, flags); - __rt_mutex_adjust_prio(task); - raw_spin_unlock_irqrestore(&task->pi_lock, flags); -} - -/* * Deadlock detection is conditional: * * If CONFIG_DEBUG_RT_MUTEXES=n, deadlock detection is only conducted @@ -1049,6 +1031,7 @@ static void mark_wakeup_next_waiter(struct wake_q_head *wake_q, * lock->wait_lock. */ rt_mutex_dequeue_pi(current, waiter); + __rt_mutex_adjust_prio(current);
/* * As we are waking up the top waiter, and the waiter stays @@ -1391,6 +1374,16 @@ static bool __sched rt_mutex_slowunlock(struct rt_mutex *lock, */ mark_wakeup_next_waiter(wake_q, lock);
+ /* + * We should deboost before waking the top waiter task such that + * we don't run two tasks with the 'same' priority. This however + * can lead to prio-inversion if we would get preempted after + * the deboost but before waking our high-prio task, hence the + * preempt_disable before unlock. Pairs with preempt_enable() in + * rt_mutex_postunlock(); + */ + preempt_disable(); + raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
/* check PI boosting */ @@ -1440,6 +1433,18 @@ rt_mutex_fasttrylock(struct rt_mutex *lock, return slowfn(lock); }
+/* + * Undo pi boosting (if necessary) and wake top waiter. + */ +void rt_mutex_postunlock(struct wake_q_head *wake_q, bool deboost) +{ + wake_up_q(wake_q); + + /* Pairs with preempt_disable() in rt_mutex_slowunlock() */ + if (deboost) + preempt_enable(); +} + static inline void rt_mutex_fastunlock(struct rt_mutex *lock, bool (*slowfn)(struct rt_mutex *lock, @@ -1453,11 +1458,7 @@ rt_mutex_fastunlock(struct rt_mutex *lock,
deboost = slowfn(lock, &wake_q);
- wake_up_q(&wake_q); - - /* Undo pi boosting if necessary: */ - if (deboost) - rt_mutex_adjust_prio(current); + rt_mutex_postunlock(&wake_q, deboost); }
/** @@ -1570,6 +1571,13 @@ bool __sched __rt_mutex_futex_unlock(struct rt_mutex *lock, }
mark_wakeup_next_waiter(wake_q, lock); + /* + * We've already deboosted, retain preempt_disabled when dropping + * the wait_lock to avoid inversion until the wakeup. Matched + * by rt_mutex_postunlock(); + */ + preempt_disable(); + return true; /* deboost and wakeups */ }
@@ -1582,10 +1590,7 @@ void __sched rt_mutex_futex_unlock(struct rt_mutex *lock) deboost = __rt_mutex_futex_unlock(lock, &wake_q); raw_spin_unlock_irq(&lock->wait_lock);
- if (deboost) { - wake_up_q(&wake_q); - rt_mutex_adjust_prio(current); - } + rt_mutex_postunlock(&wake_q, deboost); }
/** diff --git a/kernel/locking/rtmutex_common.h b/kernel/locking/rtmutex_common.h index 25ccf71..6f07357 100644 --- a/kernel/locking/rtmutex_common.h +++ b/kernel/locking/rtmutex_common.h @@ -122,7 +122,7 @@ extern void rt_mutex_futex_unlock(struct rt_mutex *lock); extern bool __rt_mutex_futex_unlock(struct rt_mutex *lock, struct wake_q_head *wqh);
-extern void rt_mutex_adjust_prio(struct task_struct *task); +extern void rt_mutex_postunlock(struct wake_q_head *wake_q, bool deboost);
#ifdef CONFIG_DEBUG_RT_MUTEXES # include "rtmutex-debug.h"
From: Xunlei Pang xlpang@redhat.com
commit e96a7705e7d3fef96aec9b590c63b2f6f7d2ba22 upstream.
A crash happened while I was playing with deadline PI rtmutex.
BUG: unable to handle kernel NULL pointer dereference at 0000000000000018 IP: [<ffffffff810eeb8f>] rt_mutex_get_top_task+0x1f/0x30 PGD 232a75067 PUD 230947067 PMD 0 Oops: 0000 [#1] SMP CPU: 1 PID: 10994 Comm: a.out Not tainted
Call Trace: [<ffffffff810b658c>] enqueue_task+0x2c/0x80 [<ffffffff810ba763>] activate_task+0x23/0x30 [<ffffffff810d0ab5>] pull_dl_task+0x1d5/0x260 [<ffffffff810d0be6>] pre_schedule_dl+0x16/0x20 [<ffffffff8164e783>] __schedule+0xd3/0x900 [<ffffffff8164efd9>] schedule+0x29/0x70 [<ffffffff8165035b>] __rt_mutex_slowlock+0x4b/0xc0 [<ffffffff81650501>] rt_mutex_slowlock+0xd1/0x190 [<ffffffff810eeb33>] rt_mutex_timed_lock+0x53/0x60 [<ffffffff810ecbfc>] futex_lock_pi.isra.18+0x28c/0x390 [<ffffffff810ed8b0>] do_futex+0x190/0x5b0 [<ffffffff810edd50>] SyS_futex+0x80/0x180
This is because rt_mutex_enqueue_pi() and rt_mutex_dequeue_pi() are only protected by pi_lock when operating pi waiters, while rt_mutex_get_top_task(), will access them with rq lock held but not holding pi_lock.
In order to tackle it, we introduce new "pi_top_task" pointer cached in task_struct, and add new rt_mutex_update_top_task() to update its value, it can be called by rt_mutex_setprio() which held both owner's pi_lock and rq lock. Thus "pi_top_task" can be safely accessed by enqueue_task_dl() under rq lock.
Originally-From: Peter Zijlstra peterz@infradead.org Signed-off-by: Xunlei Pang xlpang@redhat.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Acked-by: Steven Rostedt rostedt@goodmis.org Reviewed-by: Thomas Gleixner tglx@linutronix.de Cc: juri.lelli@arm.com Cc: bigeasy@linutronix.de Cc: mathieu.desnoyers@efficios.com Cc: jdesfossez@efficios.com Cc: bristot@redhat.com Link: http://lkml.kernel.org/r/20170323150216.157682758@infradead.org Signed-off-by: Thomas Gleixner tglx@linutronix.de
Conflicts: include/linux/sched.h Tested-by:Henrik Austad haustad@cisco.com --- include/linux/init_task.h | 1 + include/linux/sched.h | 2 ++ include/linux/sched/rt.h | 1 + kernel/fork.c | 1 + kernel/locking/rtmutex.c | 29 +++++++++++++++++++++-------- kernel/sched/core.c | 2 ++ 6 files changed, 28 insertions(+), 8 deletions(-)
diff --git a/include/linux/init_task.h b/include/linux/init_task.h index 1c1ff7e..a561ce0c 100644 --- a/include/linux/init_task.h +++ b/include/linux/init_task.h @@ -162,6 +162,7 @@ extern struct task_group root_task_group; #ifdef CONFIG_RT_MUTEXES # define INIT_RT_MUTEXES(tsk) \ .pi_waiters = RB_ROOT, \ + .pi_top_task = NULL, \ .pi_waiters_leftmost = NULL, #else # define INIT_RT_MUTEXES(tsk) diff --git a/include/linux/sched.h b/include/linux/sched.h index b30540d..89cd0d0 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1617,6 +1617,8 @@ struct task_struct { /* PI waiters blocked on a rt_mutex held by this task */ struct rb_root pi_waiters; struct rb_node *pi_waiters_leftmost; + /* Updated under owner's pi_lock and rq lock */ + struct task_struct *pi_top_task; /* Deadlock detection and priority inheritance handling */ struct rt_mutex_waiter *pi_blocked_on; #endif diff --git a/include/linux/sched/rt.h b/include/linux/sched/rt.h index a30b172..60d0c47 100644 --- a/include/linux/sched/rt.h +++ b/include/linux/sched/rt.h @@ -19,6 +19,7 @@ static inline int rt_task(struct task_struct *p) extern int rt_mutex_getprio(struct task_struct *p); extern void rt_mutex_setprio(struct task_struct *p, int prio); extern int rt_mutex_get_effective_prio(struct task_struct *task, int newprio); +extern void rt_mutex_update_top_task(struct task_struct *p); extern struct task_struct *rt_mutex_get_top_task(struct task_struct *task); extern void rt_mutex_adjust_pi(struct task_struct *p); static inline bool tsk_is_pi_blocked(struct task_struct *tsk) diff --git a/kernel/fork.c b/kernel/fork.c index dd2f79a..9376270 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1242,6 +1242,7 @@ static void rt_mutex_init_task(struct task_struct *p) #ifdef CONFIG_RT_MUTEXES p->pi_waiters = RB_ROOT; p->pi_waiters_leftmost = NULL; + p->pi_top_task = NULL; p->pi_blocked_on = NULL; #endif } diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c index c01d7f4..dd3b1e9 100644 --- a/kernel/locking/rtmutex.c +++ b/kernel/locking/rtmutex.c @@ -321,6 +321,19 @@ rt_mutex_dequeue_pi(struct task_struct *task, struct rt_mutex_waiter *waiter) }
/* + * Must hold both p->pi_lock and task_rq(p)->lock. + */ +void rt_mutex_update_top_task(struct task_struct *p) +{ + if (!task_has_pi_waiters(p)) { + p->pi_top_task = NULL; + return; + } + + p->pi_top_task = task_top_pi_waiter(p)->task; +} + +/* * Calculate task priority from the waiter tree priority * * Return task->normal_prio when the waiter tree is empty or when @@ -335,12 +348,12 @@ int rt_mutex_getprio(struct task_struct *task) task->normal_prio); }
+/* + * Must hold either p->pi_lock or task_rq(p)->lock. + */ struct task_struct *rt_mutex_get_top_task(struct task_struct *task) { - if (likely(!task_has_pi_waiters(task))) - return NULL; - - return task_top_pi_waiter(task)->task; + return task->pi_top_task; }
/* @@ -349,12 +362,12 @@ struct task_struct *rt_mutex_get_top_task(struct task_struct *task) */ int rt_mutex_get_effective_prio(struct task_struct *task, int newprio) { - if (!task_has_pi_waiters(task)) + struct task_struct *top_task = rt_mutex_get_top_task(task); + + if (!top_task) return newprio;
- if (task_top_pi_waiter(task)->task->prio <= newprio) - return task_top_pi_waiter(task)->task->prio; - return newprio; + return min(top_task->prio, newprio); }
/* diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 65ed350..0c24d1f 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3417,6 +3417,8 @@ void rt_mutex_setprio(struct task_struct *p, int prio) goto out_unlock; }
+ rt_mutex_update_top_task(p); + trace_sched_pi_setprio(p, prio); oldprio = p->prio; prev_class = p->sched_class;
On Fri, Nov 09, 2018 at 11:07:28AM +0100, Henrik Austad wrote:
From: Henrik Austad haustad@cisco.com
Short story:
Sorry for the spam, it looks like I was not very specific in /which/ version I targeted this to, as well as not providing a full Cc-list for the cover-letter.
The series is targeted at stable v4.4.162.
Expanding Cc-list to those missing from the first attempt.
-Henrik
The following patches are needed on a 4.4 kernel to avoid Oops in the scheduler when a sched_rr and sched_deadline task contends on the same futex (with PI).
Longer story:
On one of our arm64 systems, we occasionally crash with an Oops in the scheduler with the following backtrace.
[<ffffffc0000ee398>] enqueue_task_dl+0x1f0/0x420 [<ffffffc0000d0f14>] activate_task+0x7c/0x90 [<ffffffc0000edbdc>] push_dl_task+0x164/0x1c8 [<ffffffc0000edc60>] push_dl_tasks+0x20/0x30 [<ffffffc0000cc00c>] __balance_callback+0x44/0x68 [<ffffffc000d2c018>] __schedule+0x6f0/0x728 [<ffffffc000d2c278>] schedule+0x78/0x98 [<ffffffc000d2e76c>] __rt_mutex_slowlock+0x9c/0x108 [<ffffffc000d2e9d0>] rt_mutex_slowlock+0xd8/0x198 [<ffffffc0000f7f28>] rt_mutex_timed_futex_lock+0x30/0x40 [<ffffffc00012c1a8>] futex_lock_pi+0x200/0x3b0 [<ffffffc00012cf84>] do_futex+0x1c4/0x550 [<ffffffc00012d92c>] compat_SyS_futex+0x10c/0x138 [<ffffffc00008504c>] __sys_trace_return+0x0/0x4
This seems to be the same bug Xuneli Pang triggered and fixed in e96a7705e7d3 "sched/rtmutex/deadline: Fix a PI crash for deadline tasks". As noted by Peter Zijlstra in the previous attempt, this fix requires a few other patches, most notably the FUTEX_UNLOCK_PI series [1]
Testing this on a dual-core VM I have not been able to reproduce the same crash, but pi_stress (part of the rt-test suite) reveals that vanilla 4.4.162 behaves rather badly with a mix of deadline and sched_(rr|fifo) tasks:
time pi_stress --rr --mlockall --sched id=high,policy=deadline,runtime=100000,deadline=200000,period=200000 Starting PI Stress Test Number of thread groups: 1 Duration of test run: infinite Number of inversions per group: unlimited Admin thread SCHED_RR priority 4 1 groups of 3 threads will be created High thread SCHED_DEADLINE runtime 100000 deadline 200000 period 200000 Med thread SCHED_RR priority 2 Low thread SCHED_RR priority 1 Current Inversions: 141627 WATCHDOG triggered: group 0 is deadlocked! reporter stopping due to watchdog event Stopping test Terminated
real 0m26.291s user 0m0.148s sys 0m18.819s
With this series applied, the test ran for ~4.5 hours and again for 129 minutes (when I remembered to time it) before crashing:
time pi_stress --rr --mlockall --sched id=high,policy=deadline,runtime=100000,deadline=200000,period=200000 Starting PI Stress Test Number of thread groups: 1 Duration of test run: infinite Number of inversions per group: unlimited Admin thread SCHED_RR priority 4 1 groups of 3 threads will be created High thread SCHED_DEADLINE runtime 100000 deadline 200000 period 200000 Med thread SCHED_RR priority 2 Low thread SCHED_RR priority 1 Current Inversions: 51985223 WATCHDOG triggered: group 0 is deadlocked! reporter stopping due to watchdog event Stopping test Terminated
real 129m38.807s user 0m59.084s sys 109m53.666s
So clearly not perfect, but a *lot* better.
The same series on our vendor-4.4 kernel moves pi_stress up from ~30 seconds before deadlock up to the same level as the VM (the test is still going as of this writing).
I suspect other users of 4.4 would benefit from having these patches backported, so tag them for stable. I assume 4.9 and 4.14 could benefit as well, but I have not had time to look into those.
Peter Zijlstra (13): futex: Cleanup variable names for futex_top_waiter() futex: Use smp_store_release() in mark_wake_futex() futex: Remove rt_mutex_deadlock_account_*() futex,rt_mutex: Provide futex specific rt_mutex API futex: Change locking rules futex: Cleanup refcounting futex: Rework inconsistent rt_mutex/futex_q state futex: Pull rt_mutex_futex_unlock() out from under hb->lock futex,rt_mutex: Introduce rt_mutex_init_waiter() futex,rt_mutex: Restructure rt_mutex_finish_proxy_lock() futex: Rework futex_lock_pi() to use rt_mutex_*_proxy_lock() futex: Futex_unlock_pi() determinism futex: Drop hb->lock before enqueueing on the rtmutex
Thomas Gleixner (2): rtmutex: Make wait_lock irq safe futex: Rename free_pi_state() to put_pi_state()
Xunlei Pang (2): rtmutex: Deboost before waking up the top waiter sched/rtmutex/deadline: Fix a PI crash for deadline tasks
include/linux/init_task.h | 1 + include/linux/sched.h | 2 + include/linux/sched/rt.h | 1 + kernel/fork.c | 1 + kernel/futex.c | 532 ++++++++++++++++++++++++++-------------- kernel/locking/rtmutex-debug.c | 9 - kernel/locking/rtmutex-debug.h | 3 - kernel/locking/rtmutex.c | 406 ++++++++++++++++++------------ kernel/locking/rtmutex.h | 2 - kernel/locking/rtmutex_common.h | 24 +- kernel/sched/core.c | 2 + 11 files changed, 620 insertions(+), 363 deletions(-)
-- 2.7.4
On Fri, Nov 09, 2018 at 11:35:31AM +0100, Henrik Austad wrote:
On Fri, Nov 09, 2018 at 11:07:28AM +0100, Henrik Austad wrote:
From: Henrik Austad haustad@cisco.com
Short story:
Sorry for the spam, it looks like I was not very specific in /which/ version I targeted this to, as well as not providing a full Cc-list for the cover-letter.
Gentle prod. I realize this was sent out just before plumbers and that people had pretty packed agendas, so a small nudge to gain a spot closer to the top of the inbox :)
This series has now been running on an arm64 system for 9 days without any issues and pi_stress showed a dramatic improvement from ~30 seconds and up to several ours (it finally deadlocked at 3.9e9 inversions).
I'd greatly appreciate if someone could give the list of patches a quick glance to verify that I got all the required patches and then if it could be added to 4.4.y.
Thanks!
-Henrik
The series is targeted at stable v4.4.162.
Expanding Cc-list to those missing from the first attempt.
-Henrik
The following patches are needed on a 4.4 kernel to avoid Oops in the scheduler when a sched_rr and sched_deadline task contends on the same futex (with PI).
Longer story:
On one of our arm64 systems, we occasionally crash with an Oops in the scheduler with the following backtrace.
[<ffffffc0000ee398>] enqueue_task_dl+0x1f0/0x420 [<ffffffc0000d0f14>] activate_task+0x7c/0x90 [<ffffffc0000edbdc>] push_dl_task+0x164/0x1c8 [<ffffffc0000edc60>] push_dl_tasks+0x20/0x30 [<ffffffc0000cc00c>] __balance_callback+0x44/0x68 [<ffffffc000d2c018>] __schedule+0x6f0/0x728 [<ffffffc000d2c278>] schedule+0x78/0x98 [<ffffffc000d2e76c>] __rt_mutex_slowlock+0x9c/0x108 [<ffffffc000d2e9d0>] rt_mutex_slowlock+0xd8/0x198 [<ffffffc0000f7f28>] rt_mutex_timed_futex_lock+0x30/0x40 [<ffffffc00012c1a8>] futex_lock_pi+0x200/0x3b0 [<ffffffc00012cf84>] do_futex+0x1c4/0x550 [<ffffffc00012d92c>] compat_SyS_futex+0x10c/0x138 [<ffffffc00008504c>] __sys_trace_return+0x0/0x4
This seems to be the same bug Xuneli Pang triggered and fixed in e96a7705e7d3 "sched/rtmutex/deadline: Fix a PI crash for deadline tasks". As noted by Peter Zijlstra in the previous attempt, this fix requires a few other patches, most notably the FUTEX_UNLOCK_PI series [1]
Testing this on a dual-core VM I have not been able to reproduce the same crash, but pi_stress (part of the rt-test suite) reveals that vanilla 4.4.162 behaves rather badly with a mix of deadline and sched_(rr|fifo) tasks:
time pi_stress --rr --mlockall --sched id=high,policy=deadline,runtime=100000,deadline=200000,period=200000 Starting PI Stress Test Number of thread groups: 1 Duration of test run: infinite Number of inversions per group: unlimited Admin thread SCHED_RR priority 4 1 groups of 3 threads will be created High thread SCHED_DEADLINE runtime 100000 deadline 200000 period 200000 Med thread SCHED_RR priority 2 Low thread SCHED_RR priority 1 Current Inversions: 141627 WATCHDOG triggered: group 0 is deadlocked! reporter stopping due to watchdog event Stopping test Terminated
real 0m26.291s user 0m0.148s sys 0m18.819s
With this series applied, the test ran for ~4.5 hours and again for 129 minutes (when I remembered to time it) before crashing:
time pi_stress --rr --mlockall --sched id=high,policy=deadline,runtime=100000,deadline=200000,period=200000 Starting PI Stress Test Number of thread groups: 1 Duration of test run: infinite Number of inversions per group: unlimited Admin thread SCHED_RR priority 4 1 groups of 3 threads will be created High thread SCHED_DEADLINE runtime 100000 deadline 200000 period 200000 Med thread SCHED_RR priority 2 Low thread SCHED_RR priority 1 Current Inversions: 51985223 WATCHDOG triggered: group 0 is deadlocked! reporter stopping due to watchdog event Stopping test Terminated
real 129m38.807s user 0m59.084s sys 109m53.666s
So clearly not perfect, but a *lot* better.
The same series on our vendor-4.4 kernel moves pi_stress up from ~30 seconds before deadlock up to the same level as the VM (the test is still going as of this writing).
I suspect other users of 4.4 would benefit from having these patches backported, so tag them for stable. I assume 4.9 and 4.14 could benefit as well, but I have not had time to look into those.
Peter Zijlstra (13): futex: Cleanup variable names for futex_top_waiter() futex: Use smp_store_release() in mark_wake_futex() futex: Remove rt_mutex_deadlock_account_*() futex,rt_mutex: Provide futex specific rt_mutex API futex: Change locking rules futex: Cleanup refcounting futex: Rework inconsistent rt_mutex/futex_q state futex: Pull rt_mutex_futex_unlock() out from under hb->lock futex,rt_mutex: Introduce rt_mutex_init_waiter() futex,rt_mutex: Restructure rt_mutex_finish_proxy_lock() futex: Rework futex_lock_pi() to use rt_mutex_*_proxy_lock() futex: Futex_unlock_pi() determinism futex: Drop hb->lock before enqueueing on the rtmutex
Thomas Gleixner (2): rtmutex: Make wait_lock irq safe futex: Rename free_pi_state() to put_pi_state()
Xunlei Pang (2): rtmutex: Deboost before waking up the top waiter sched/rtmutex/deadline: Fix a PI crash for deadline tasks
include/linux/init_task.h | 1 + include/linux/sched.h | 2 + include/linux/sched/rt.h | 1 + kernel/fork.c | 1 + kernel/futex.c | 532 ++++++++++++++++++++++++++-------------- kernel/locking/rtmutex-debug.c | 9 - kernel/locking/rtmutex-debug.h | 3 - kernel/locking/rtmutex.c | 406 ++++++++++++++++++------------ kernel/locking/rtmutex.h | 2 - kernel/locking/rtmutex_common.h | 24 +- kernel/sched/core.c | 2 + 11 files changed, 620 insertions(+), 363 deletions(-)
-- 2.7.4
-- Henrik Austad
On Mon, Nov 19, 2018 at 12:27:21PM +0100, Henrik Austad wrote:
On Fri, Nov 09, 2018 at 11:35:31AM +0100, Henrik Austad wrote:
On Fri, Nov 09, 2018 at 11:07:28AM +0100, Henrik Austad wrote:
From: Henrik Austad haustad@cisco.com
Short story:
Sorry for the spam, it looks like I was not very specific in /which/ version I targeted this to, as well as not providing a full Cc-list for the cover-letter.
Gentle prod. I realize this was sent out just before plumbers and that people had pretty packed agendas, so a small nudge to gain a spot closer to the top of the inbox :)
This series has now been running on an arm64 system for 9 days without any issues and pi_stress showed a dramatic improvement from ~30 seconds and up to several ours (it finally deadlocked at 3.9e9 inversions).
I'd greatly appreciate if someone could give the list of patches a quick glance to verify that I got all the required patches and then if it could be added to 4.4.y.
This is a really intrusive series of patches, and without some testing and verification by others, I am really reluctant to take these patches.
Why not just move to the 4.9.y tree, or better yet, 4.19.y to resolve this issue for your systems?
thanks,
greg k-h
On Fri, Dec 14, 2018 at 08:18:26AM +0100, Greg Kroah-Hartman wrote:
On Mon, Nov 19, 2018 at 12:27:21PM +0100, Henrik Austad wrote:
On Fri, Nov 09, 2018 at 11:35:31AM +0100, Henrik Austad wrote:
On Fri, Nov 09, 2018 at 11:07:28AM +0100, Henrik Austad wrote:
From: Henrik Austad haustad@cisco.com
Short story:
Sorry for the spam, it looks like I was not very specific in /which/ version I targeted this to, as well as not providing a full Cc-list for the cover-letter.
Gentle prod. I realize this was sent out just before plumbers and that people had pretty packed agendas, so a small nudge to gain a spot closer to the top of the inbox :)
This series has now been running on an arm64 system for 9 days without any issues and pi_stress showed a dramatic improvement from ~30 seconds and up to several ours (it finally deadlocked at 3.9e9 inversions).
I'd greatly appreciate if someone could give the list of patches a quick glance to verify that I got all the required patches and then if it could be added to 4.4.y.
Hi Greg,
This is a really intrusive series of patches, and without some testing and verification by others, I am really reluctant to take these patches.
Yes I know, they are intrusive, and they touch core parts of the kernel in interesting ways.
I completely agree with the need for testing, and I do not _expect_ these pathces to be merged. It was a "this was useful for us, it is probably useful for others" kind of series.
Perhaps it is not that many others out there using pi_futex shared between a sched_rr thread and a sched_deadline thread, which is how you back yourself into this corner.
Why not just move to the 4.9.y tree, or better yet, 4.19.y to resolve this issue for your systems?
That would indeed be the best solution, but vendor will not update kernel past 4.4 for this particular SoC, so we have no way of moving this to a later kernel :(
Anyway, I'm happy to carry these in our local tree for our own use. If something pops up in our internal testing requiring update to the series, I'll send an update for others to see should they experience the same issue. :)
Thanks for the reply!
linux-stable-mirror@lists.linaro.org