Quoting Linus Torvalds (2020-04-07 18:28:34)
On Tue, Apr 7, 2020 at 9:04 AM Chris Wilson chris@chris-wilson.co.uk wrote:
It'll take some time to reconstruct the original report, but the case in question was in removing the last element of the list of the last list, switch to a global lock over all such lists to park the HW, which in doing so added one more element to the original list. [If we happen to be retiring along the kernel timeline in the first place.]
Please point to the real code and the list.
Honestly, what you describe sounds complex enough that I think your locking is simply just buggy.
IOW, this patch seems to really just paper over a locking bug, and the KASAN report tied to it.
Because the fundamental issue is that READ_ONCE() can not fix a bug here. Reading the next pointer once fundamentally cannot matter if it can change concurrently: the code is buggy, and the READ_ONCE() just means that it gets one or the other value randomly, and that the list walking is fundamentally racy.
One the other hand, if the next pointer _cannot_ change concurrently, then READ_ONCE() cannot make a difference.
So as fat as I can tell, we have two possibilities, and in both cases changing the code to use READ_ONCE() is not the right thing to do. In one case it hides a bug, and in another case it's just pointless.
list->next changed from pointing to list_head, to point to the new element instead. However, we don't have to check the next element yet and want to terminate the list iteration.
I'd really like to see the actual code that has that list walking. You say:
For reference, drivers/gpu/drm/i915/gt/intel_engine_pm.c::__engine_park()
.. but that function doesn't have any locking or list-walking. Is it the "call_idle_barriers()" loop? What is it?
list_for_each_safe @ intel_gt_requests.c::retire_requests list_del @ i915_requests.c::i915_request_retire list_add @ i915_requests.c::__i915_request_create
I'd really like to see the KASAN report and the discussion about this change.
And if there was no discussion, then the patch just seems like "I changed code randomly and the KASAN report went away, so it's all good".
Global activity is serialised by engine->wakeref.mutex; every active timeline is required to hold an engine wakeref, but retiring is local to timelines and serialised by their own timeline->mutex.
lock(&timeline->lock) list_for_each_safe(&timeline->requests) -> i915_request_retire [list_del(&timeline->requests)] -> intel_timeline_exit -> lock(&engine->wakeref.mutex) engine_park [list_add_tail(&engine->kernel_context->timeline->requests)]
in that particular list_for_each_safe() thing, there's no possibility that the 'next' field would be reloaded, since the list_del() in the above will be somethign the compiler is aware of.
So yes, the beginning list_for_each_safe() might load it twice (or a hundred times), but by the time that list_del() in i915_request_retire() has been called, if the compiler then reloads it afterwards, that would be a major compiler bug, since it's after that value could have been written in the local thread.
My understanding of how KCSAN works is that it installs a watchpoint after a read and complains if the value changes within X us.
So this doesn't explain it to me.
What it *sounds* like is that the "engine" lock that you do *not* hold initially, is not protecting some accessor to that list, so you have a race on the list at the time of that list_del().
We take an engine wakeref prior to creating a request, and release it upon retiring the request [through the respective context/timelines, across all the engines that might be used for that context]. When the engine wakeref hits zero, that mutex is taken to prevent any other new request being created, and under that mutex while we know that the engine->kernel_context->timeline is empty, we submit a request to switch the GPU to point into our scratch context.
That submission can run concurrently to the list iteration, but only _after_ the final list_del.
And that race may be what KASAN is reporting, and what that patch is _hiding_ from KASAN - but not fixing.
Yes, it was sent as a means to silence KCSAN with a query as to whether or not it could happen in practice with an aggressive compiler.
See what I am saying and why I find this patch questionable?
There may be something really subtle going on, but it really smells like "two threads are modifying the same list at the same time".
In strict succession.
And there's no way that the READ_ONCE() will fix that bug, it will only make KASAN shut up about it.
There's some more shutting up required for KCSAN to bring the noise down to usable levels which I hope has been done so I don't have to argue for it, such as
diff --git a/include/linux/timer.h b/include/linux/timer.h index 1e6650ed066d..c7c8dd89f279 100644 --- a/include/linux/timer.h +++ b/include/linux/timer.h @@ -164,7 +164,7 @@ static inline void destroy_timer_on_stack(struct timer_list *timer) { } */ static inline int timer_pending(const struct timer_list * timer) { - return timer->entry.pprev != NULL; + return READ_ONCE(timer->entry.pprev) != NULL; }
extern void add_timer_on(struct timer_list *timer, int cpu); diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c index 5352ce50a97e..7461b3f33629 100644 --- a/kernel/locking/mutex.c +++ b/kernel/locking/mutex.c @@ -565,8 +565,9 @@ bool mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner, /* * Use vcpu_is_preempted to detect lock holder preemption issue. */ - if (!owner->on_cpu || need_resched() || - vcpu_is_preempted(task_cpu(owner))) { + if (!READ_ONCE(owner->on_cpu) || + need_resched() || + vcpu_is_preempted(task_cpu(owner))) { ret = false; break; } @@ -602,7 +603,7 @@ static inline int mutex_can_spin_on_owner(struct mutex *lock) * on cpu or its cpu is preempted */ if (owner) - retval = owner->on_cpu && !vcpu_is_preempted(task_cpu(owner)); + retval = READ_ONCE(owner->on_cpu) && !vcpu_is_preempted(task_cpu(owner)); rcu_read_unlock();
/* diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c index 1f7734949ac8..4a81fba4cf70 100644 --- a/kernel/locking/osq_lock.c +++ b/kernel/locking/osq_lock.c @@ -75,7 +75,7 @@ osq_wait_next(struct optimistic_spin_queue *lock, * wait for either @lock to point to us, through its Step-B, or * wait for a new @node->next from its Step-C. */ - if (node->next) { + if (READ_ONCE(node->next)) { next = xchg(&node->next, NULL); if (next) break; @@ -154,7 +154,7 @@ bool osq_lock(struct optimistic_spin_queue *lock) */
for (;;) { - if (prev->next == node && + if (READ_ONCE(prev->next) == node && cmpxchg(&prev->next, node, NULL) == node) break;
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c index 0d9b6be9ecc8..eef4835cecf2 100644 --- a/kernel/locking/rwsem.c +++ b/kernel/locking/rwsem.c @@ -650,7 +650,7 @@ static inline bool owner_on_cpu(struct task_struct *owner) * As lock holder preemption issue, we both skip spinning if * task is not on cpu or its cpu is preempted */ - return owner->on_cpu && !vcpu_is_preempted(task_cpu(owner)); + return READ_ONCE(owner->on_cpu) && !vcpu_is_preempted(task_cpu(owner)); }
static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem,