[Linaro-mm-sig] Re: [PATCH v1] drm/sched: fix deadlock in drm_sched_entity_kill_jobs_cb

30 Oct 2025

Le 30/10/2025 à 12:17, Philipp Stanner a écrit :
...
On Wed, 2025-10-29 at 10:11 +0100, Pierre-Eric Pelloux-Prayer wrote:
...
https://gitlab.freedesktop.org/mesa/mesa/-/issues/13908%C2%A0pointed out
This link should be moved to the tag section at the bottom at a Closes:
tag. Optionally a Reported-by:, too.
The bug report is about a different issue. The potential deadlock being fixed by 
this patch was discovered while investigating it.
I'll add a Reported-by tag though.
...
...
a possible deadlock:
[ 1231.611031]  Possible interrupt unsafe locking scenario:
[ 1231.611033]        CPU0                    CPU1
[ 1231.611034]        ----                    ----
[ 1231.611035]   lock(&xa->xa_lock#17);
[ 1231.611038]                                local_irq_disable();
[ 1231.611039]                                lock(&fence->lock);
[ 1231.611041]                                lock(&xa->xa_lock#17);
[ 1231.611044]   <Interrupt>
[ 1231.611045]     lock(&fence->lock);
[ 1231.611047]
                 *** DEADLOCK ***
The commit message is lacking an explanation as to _how_ and _when_ the
deadlock comes to be. That's a prerequisite for understanding why the
below is the proper fix and solution.
I copy-pasted a small chunk of the full deadlock analysis report included in the 
ticket because it's 300+ lines long. Copying the full log isn't useful IMO, but 
I can add more context.
The problem is that a thread (CPU0 above) can lock the job's dependencies 
xa_array without disabling the interrupts.
If a fence signals while CPU0 holds this lock and drm_sched_entity_kill_jobs_cb 
is called, it will try to grab the xa_array lock which is not possible because 
CPU0 holds it already.
...
The issue seems to be that you cannot perform certain tasks from within
that work item?
...
My initial fix was to replace xa_erase by xa_erase_irq, but Christian
pointed out that calling dma_fence_add_callback from a callback can
also deadlock if the signalling fence and the one passed to
dma_fence_add_callback share the same lock.
To fix both issues, the code iterating on dependencies and re-arming them
is moved out to drm_sched_entity_kill_jobs_work.
Suggested-by: Christian König christian.koenig@amd.com
Signed-off-by: Pierre-Eric Pelloux-Prayer pierre-eric.pelloux-prayer@amd.com

drivers/gpu/drm/scheduler/sched_entity.c | 34 +++++++++++++-----------
  1 file changed, 19 insertions(+), 15 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
index c8e949f4a568..fe174a4857be 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -173,26 +173,15 @@ int drm_sched_entity_error(struct drm_sched_entity *entity)
  }
  EXPORT_SYMBOL(drm_sched_entity_error);
  
+static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f,

			  struct dma_fence_cb *cb);




static void drm_sched_entity_kill_jobs_work(struct work_struct *wrk)
  {
  	struct drm_sched_job *job = container_of(wrk, typeof(*job), work);


drm_sched_fence_scheduled(job->s_fence, NULL);
drm_sched_fence_finished(job->s_fence, -ESRCH);
WARN_ON(job->s_fence->parent);
job->sched->ops->free_job(job);

Can free_job() really not be called from within work item context?
It's still called from drm_sched_entity_kill_jobs_work but the diff is slightly 
confusing.
Pierre-Eric
...
P.

    

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

[Linaro-mm-sig] Re: [PATCH v1] drm/sched: fix deadlock in drm_sched_entity_kill_jobs_cb