On Thu, 2025-10-30 at 13:06 +0100, Pierre-Eric Pelloux-Prayer wrote:
Le 30/10/2025 à 12:17, Philipp Stanner a écrit :
On Wed, 2025-10-29 at 10:11 +0100, Pierre-Eric Pelloux-Prayer wrote:
https://gitlab.freedesktop.org/mesa/mesa/-/issues/13908%C2%A0pointed out
This link should be moved to the tag section at the bottom at a Closes: tag. Optionally a Reported-by:, too.
The bug report is about a different issue. The potential deadlock being fixed by this patch was discovered while investigating it. I'll add a Reported-by tag though.
a possible deadlock:
[ 1231.611031] Possible interrupt unsafe locking scenario:
[ 1231.611033] CPU0 CPU1 [ 1231.611034] ---- ---- [ 1231.611035] lock(&xa->xa_lock#17); [ 1231.611038] local_irq_disable(); [ 1231.611039] lock(&fence->lock); [ 1231.611041] lock(&xa->xa_lock#17); [ 1231.611044] <Interrupt> [ 1231.611045] lock(&fence->lock); [ 1231.611047] *** DEADLOCK ***
The commit message is lacking an explanation as to _how_ and _when_ the deadlock comes to be. That's a prerequisite for understanding why the below is the proper fix and solution.
I copy-pasted a small chunk of the full deadlock analysis report included in the ticket because it's 300+ lines long. Copying the full log isn't useful IMO, but I can add more context.
The log wouldn't be useful, but a human-generated explanation as you detail it below.
The problem is that a thread (CPU0 above) can lock the job's dependencies xa_array without disabling the interrupts.
Which drm_sched function would that be?
If a fence signals while CPU0 holds this lock and drm_sched_entity_kill_jobs_cb is called, it will try to grab the xa_array lock which is not possible because CPU0 holds it already.
You mean an *interrupt* signals the fence? Shouldn't interrupt issues be solved with spin_lock_irqdisable() – but we can't have that because it's the xarray doing that internally?
You don't have to explain that in this mail-thread, a v2 detailing that would be suficient.
The issue seems to be that you cannot perform certain tasks from within that work item?
[…]
+static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f,
struct dma_fence_cb *cb);static void drm_sched_entity_kill_jobs_work(struct work_struct *wrk) { struct drm_sched_job *job = container_of(wrk, typeof(*job), work);
- drm_sched_fence_scheduled(job->s_fence, NULL);
- drm_sched_fence_finished(job->s_fence, -ESRCH);
- WARN_ON(job->s_fence->parent);
- job->sched->ops->free_job(job);
Can free_job() really not be called from within work item context?
It's still called from drm_sched_entity_kill_jobs_work but the diff is slightly confusing.
OK, probably my bad. But just asking, do you use git format-patch --histogram ?
histogram often produces better diffs than default.
P.