[Linaro-mm-sig] Re: [PATCH v3] drm/sched: Fix deadlock in drm_sched_entity_kill_jobs_cb

4 Nov 2025

On Tue, 2025-11-04 at 10:53 +0100, Pierre-Eric Pelloux-Prayer wrote:
...
The Mesa issue referenced below pointed out a possible deadlock:
[ 1231.611031]  Possible interrupt unsafe locking scenario:
[ 1231.611033]        CPU0                    CPU1
[ 1231.611034]        ----                    ----
[ 1231.611035]   lock(&xa->xa_lock#17);
[ 1231.611038]                                local_irq_disable();
[ 1231.611039]                                lock(&fence->lock);
[ 1231.611041]                                lock(&xa->xa_lock#17);
[ 1231.611044]   <Interrupt>
[ 1231.611045]     lock(&fence->lock);
[ 1231.611047]
                *** DEADLOCK ***
In this example, CPU0 would be any function accessing job->dependencies
through the xa_* functions that doesn't disable interrupts (eg:
drm_sched_job_add_dependency, drm_sched_entity_kill_jobs_cb).
CPU1 is executing drm_sched_entity_kill_jobs_cb as a fence signalling
callback so in an interrupt context. It will deadlock when trying to
grab the xa_lock which is already held by CPU0.
Replacing all xa_* usage by their xa_*_irq counterparts would fix
this issue, but Christian pointed out another issue: dma_fence_signal
takes fence.lock and so does dma_fence_add_callback.
dma_fence_signal() // locks f1.lock
  -> drm_sched_entity_kill_jobs_cb()
  -> foreach dependencies
     -> dma_fence_add_callback() // locks f2.lock
This will deadlock if f1 and f2 share the same spinlock.
To fix both issues, the code iterating on dependencies and re-arming them
is moved out to drm_sched_entity_kill_jobs_work.
v2: reworded commit message (Philipp)
v3: added Fixes tag (Philipp)
Thx for the update.
In the future please put the changelog below between a pair of '---'
---
v2: …
v3: …
---
Some things I have unfortunately overlooked below.
...
Fixes: 2fdb8a8f07c2 ("drm/scheduler: rework entity flush, kill and fini")
We should +Cc stable. It's a deadlock after all.
...
Link: https://gitlab.freedesktop.org/mesa/mesa/-/issues/13908
Reported-by: Mikhail Gavrilov mikhail.v.gavrilov@gmail.com
Suggested-by: Christian König christian.koenig@amd.com
Reviewed-by: Christian König christian.koenig@amd.com
Signed-off-by: Pierre-Eric Pelloux-Prayer pierre-eric.pelloux-prayer@amd.com

drivers/gpu/drm/scheduler/sched_entity.c | 34 +++++++++++++-----------
 1 file changed, 19 insertions(+), 15 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
index c8e949f4a568..fe174a4857be 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -173,26 +173,15 @@ int drm_sched_entity_error(struct drm_sched_entity *entity)
 }
 EXPORT_SYMBOL(drm_sched_entity_error);
 
+static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f,

			  struct dma_fence_cb *cb);



It's far better to move the function up instead. Can you do that?
...



[…]
...
+/* Signal the scheduler finished fence when the entity in question is killed. */
+static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f,

			  struct dma_fence_cb *cb)



+{

struct drm_sched_job *job = container_of(cb, struct drm_sched_job,
				 finish_cb);



dma_fence_put(f);

It would be great if we knew what fence is being dropped here and why.
I know you're just moving the pre-existing code, but if you should
know, informing about that via comment would be great.
Optional.
Rest of the code looks good. No further objections.
P.

    

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

[Linaro-mm-sig] Re: [PATCH v3] drm/sched: Fix deadlock in drm_sched_entity_kill_jobs_cb