Le 30/10/2025 à 12:17, Philipp Stanner a écrit :
On Wed, 2025-10-29 at 10:11 +0100, Pierre-Eric Pelloux-Prayer wrote:
https://gitlab.freedesktop.org/mesa/mesa/-/issues/13908%C2%A0pointed out
This link should be moved to the tag section at the bottom at a Closes: tag. Optionally a Reported-by:, too.
The bug report is about a different issue. The potential deadlock being fixed by this patch was discovered while investigating it. I'll add a Reported-by tag though.
a possible deadlock:
[ 1231.611031] Possible interrupt unsafe locking scenario:
[ 1231.611033] CPU0 CPU1 [ 1231.611034] ---- ---- [ 1231.611035] lock(&xa->xa_lock#17); [ 1231.611038] local_irq_disable(); [ 1231.611039] lock(&fence->lock); [ 1231.611041] lock(&xa->xa_lock#17); [ 1231.611044] <Interrupt> [ 1231.611045] lock(&fence->lock); [ 1231.611047] *** DEADLOCK ***
The commit message is lacking an explanation as to _how_ and _when_ the deadlock comes to be. That's a prerequisite for understanding why the below is the proper fix and solution.
I copy-pasted a small chunk of the full deadlock analysis report included in the ticket because it's 300+ lines long. Copying the full log isn't useful IMO, but I can add more context.
The problem is that a thread (CPU0 above) can lock the job's dependencies xa_array without disabling the interrupts. If a fence signals while CPU0 holds this lock and drm_sched_entity_kill_jobs_cb is called, it will try to grab the xa_array lock which is not possible because CPU0 holds it already.
The issue seems to be that you cannot perform certain tasks from within that work item?
My initial fix was to replace xa_erase by xa_erase_irq, but Christian pointed out that calling dma_fence_add_callback from a callback can also deadlock if the signalling fence and the one passed to dma_fence_add_callback share the same lock.
To fix both issues, the code iterating on dependencies and re-arming them is moved out to drm_sched_entity_kill_jobs_work.
Suggested-by: Christian König christian.koenig@amd.com Signed-off-by: Pierre-Eric Pelloux-Prayer pierre-eric.pelloux-prayer@amd.com
drivers/gpu/drm/scheduler/sched_entity.c | 34 +++++++++++++----------- 1 file changed, 19 insertions(+), 15 deletions(-)
diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c index c8e949f4a568..fe174a4857be 100644 --- a/drivers/gpu/drm/scheduler/sched_entity.c +++ b/drivers/gpu/drm/scheduler/sched_entity.c @@ -173,26 +173,15 @@ int drm_sched_entity_error(struct drm_sched_entity *entity) } EXPORT_SYMBOL(drm_sched_entity_error); +static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f,
struct dma_fence_cb *cb);static void drm_sched_entity_kill_jobs_work(struct work_struct *wrk) { struct drm_sched_job *job = container_of(wrk, typeof(*job), work);
- drm_sched_fence_scheduled(job->s_fence, NULL);
- drm_sched_fence_finished(job->s_fence, -ESRCH);
- WARN_ON(job->s_fence->parent);
- job->sched->ops->free_job(job);
Can free_job() really not be called from within work item context?
It's still called from drm_sched_entity_kill_jobs_work but the diff is slightly confusing.
Pierre-Eric
P.