https://gitlab.freedesktop.org/mesa/mesa/-/issues/13908 pointed out a possible deadlock:
[ 1231.611031] Possible interrupt unsafe locking scenario:
[ 1231.611033] CPU0 CPU1 [ 1231.611034] ---- ---- [ 1231.611035] lock(&xa->xa_lock#17); [ 1231.611038] local_irq_disable(); [ 1231.611039] lock(&fence->lock); [ 1231.611041] lock(&xa->xa_lock#17); [ 1231.611044] <Interrupt> [ 1231.611045] lock(&fence->lock); [ 1231.611047] *** DEADLOCK ***
My initial fix was to replace xa_erase by xa_erase_irq, but Christian pointed out that calling dma_fence_add_callback from a callback can also deadlock if the signalling fence and the one passed to dma_fence_add_callback share the same lock.
To fix both issues, the code iterating on dependencies and re-arming them is moved out to drm_sched_entity_kill_jobs_work.
Suggested-by: Christian König christian.koenig@amd.com Signed-off-by: Pierre-Eric Pelloux-Prayer pierre-eric.pelloux-prayer@amd.com --- drivers/gpu/drm/scheduler/sched_entity.c | 34 +++++++++++++----------- 1 file changed, 19 insertions(+), 15 deletions(-)
diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c index c8e949f4a568..fe174a4857be 100644 --- a/drivers/gpu/drm/scheduler/sched_entity.c +++ b/drivers/gpu/drm/scheduler/sched_entity.c @@ -173,26 +173,15 @@ int drm_sched_entity_error(struct drm_sched_entity *entity) } EXPORT_SYMBOL(drm_sched_entity_error);
+static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f, + struct dma_fence_cb *cb); + static void drm_sched_entity_kill_jobs_work(struct work_struct *wrk) { struct drm_sched_job *job = container_of(wrk, typeof(*job), work); - - drm_sched_fence_scheduled(job->s_fence, NULL); - drm_sched_fence_finished(job->s_fence, -ESRCH); - WARN_ON(job->s_fence->parent); - job->sched->ops->free_job(job); -} - -/* Signal the scheduler finished fence when the entity in question is killed. */ -static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f, - struct dma_fence_cb *cb) -{ - struct drm_sched_job *job = container_of(cb, struct drm_sched_job, - finish_cb); + struct dma_fence *f; unsigned long index;
- dma_fence_put(f); - /* Wait for all dependencies to avoid data corruptions */ xa_for_each(&job->dependencies, index, f) { struct drm_sched_fence *s_fence = to_drm_sched_fence(f); @@ -220,6 +209,21 @@ static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f, dma_fence_put(f); }
+ drm_sched_fence_scheduled(job->s_fence, NULL); + drm_sched_fence_finished(job->s_fence, -ESRCH); + WARN_ON(job->s_fence->parent); + job->sched->ops->free_job(job); +} + +/* Signal the scheduler finished fence when the entity in question is killed. */ +static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f, + struct dma_fence_cb *cb) +{ + struct drm_sched_job *job = container_of(cb, struct drm_sched_job, + finish_cb); + + dma_fence_put(f); + INIT_WORK(&job->work, drm_sched_entity_kill_jobs_work); schedule_work(&job->work); }
On 10/29/25 10:11, Pierre-Eric Pelloux-Prayer wrote:
https://gitlab.freedesktop.org/mesa/mesa/-/issues/13908 pointed out a possible deadlock:
[ 1231.611031] Possible interrupt unsafe locking scenario:
[ 1231.611033] CPU0 CPU1 [ 1231.611034] ---- ---- [ 1231.611035] lock(&xa->xa_lock#17); [ 1231.611038] local_irq_disable(); [ 1231.611039] lock(&fence->lock); [ 1231.611041] lock(&xa->xa_lock#17); [ 1231.611044] <Interrupt> [ 1231.611045] lock(&fence->lock); [ 1231.611047] *** DEADLOCK ***
My initial fix was to replace xa_erase by xa_erase_irq, but Christian pointed out that calling dma_fence_add_callback from a callback can also deadlock if the signalling fence and the one passed to dma_fence_add_callback share the same lock.
To fix both issues, the code iterating on dependencies and re-arming them is moved out to drm_sched_entity_kill_jobs_work.
Suggested-by: Christian König christian.koenig@amd.com Signed-off-by: Pierre-Eric Pelloux-Prayer pierre-eric.pelloux-prayer@amd.com
Reviewed-by: Christian König christian.koenig@amd.com
drivers/gpu/drm/scheduler/sched_entity.c | 34 +++++++++++++----------- 1 file changed, 19 insertions(+), 15 deletions(-)
diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c index c8e949f4a568..fe174a4857be 100644 --- a/drivers/gpu/drm/scheduler/sched_entity.c +++ b/drivers/gpu/drm/scheduler/sched_entity.c @@ -173,26 +173,15 @@ int drm_sched_entity_error(struct drm_sched_entity *entity) } EXPORT_SYMBOL(drm_sched_entity_error); +static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f,
struct dma_fence_cb *cb);static void drm_sched_entity_kill_jobs_work(struct work_struct *wrk) { struct drm_sched_job *job = container_of(wrk, typeof(*job), work);
- drm_sched_fence_scheduled(job->s_fence, NULL);
- drm_sched_fence_finished(job->s_fence, -ESRCH);
- WARN_ON(job->s_fence->parent);
- job->sched->ops->free_job(job);
-}
-/* Signal the scheduler finished fence when the entity in question is killed. */ -static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f,
struct dma_fence_cb *cb)-{
- struct drm_sched_job *job = container_of(cb, struct drm_sched_job,
finish_cb);
- struct dma_fence *f; unsigned long index;
- dma_fence_put(f);
- /* Wait for all dependencies to avoid data corruptions */ xa_for_each(&job->dependencies, index, f) { struct drm_sched_fence *s_fence = to_drm_sched_fence(f);
@@ -220,6 +209,21 @@ static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f, dma_fence_put(f); }
- drm_sched_fence_scheduled(job->s_fence, NULL);
- drm_sched_fence_finished(job->s_fence, -ESRCH);
- WARN_ON(job->s_fence->parent);
- job->sched->ops->free_job(job);
+}
+/* Signal the scheduler finished fence when the entity in question is killed. */ +static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f,
struct dma_fence_cb *cb)+{
- struct drm_sched_job *job = container_of(cb, struct drm_sched_job,
finish_cb);- dma_fence_put(f);
- INIT_WORK(&job->work, drm_sched_entity_kill_jobs_work); schedule_work(&job->work);
}
On Wed, 2025-10-29 at 10:11 +0100, Pierre-Eric Pelloux-Prayer wrote:
https://gitlab.freedesktop.org/mesa/mesa/-/issues/13908%C2%A0pointed out
This link should be moved to the tag section at the bottom at a Closes: tag. Optionally a Reported-by:, too.
a possible deadlock:
[ 1231.611031] Possible interrupt unsafe locking scenario:
[ 1231.611033] CPU0 CPU1 [ 1231.611034] ---- ---- [ 1231.611035] lock(&xa->xa_lock#17); [ 1231.611038] local_irq_disable(); [ 1231.611039] lock(&fence->lock); [ 1231.611041] lock(&xa->xa_lock#17); [ 1231.611044] <Interrupt> [ 1231.611045] lock(&fence->lock); [ 1231.611047] *** DEADLOCK ***
The commit message is lacking an explanation as to _how_ and _when_ the deadlock comes to be. That's a prerequisite for understanding why the below is the proper fix and solution.
The issue seems to be that you cannot perform certain tasks from within that work item?
My initial fix was to replace xa_erase by xa_erase_irq, but Christian pointed out that calling dma_fence_add_callback from a callback can also deadlock if the signalling fence and the one passed to dma_fence_add_callback share the same lock.
To fix both issues, the code iterating on dependencies and re-arming them is moved out to drm_sched_entity_kill_jobs_work.
Suggested-by: Christian König christian.koenig@amd.com Signed-off-by: Pierre-Eric Pelloux-Prayer pierre-eric.pelloux-prayer@amd.com
drivers/gpu/drm/scheduler/sched_entity.c | 34 +++++++++++++----------- 1 file changed, 19 insertions(+), 15 deletions(-)
diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c index c8e949f4a568..fe174a4857be 100644 --- a/drivers/gpu/drm/scheduler/sched_entity.c +++ b/drivers/gpu/drm/scheduler/sched_entity.c @@ -173,26 +173,15 @@ int drm_sched_entity_error(struct drm_sched_entity *entity) } EXPORT_SYMBOL(drm_sched_entity_error); +static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f,
struct dma_fence_cb *cb);static void drm_sched_entity_kill_jobs_work(struct work_struct *wrk) { struct drm_sched_job *job = container_of(wrk, typeof(*job), work);
- drm_sched_fence_scheduled(job->s_fence, NULL);
- drm_sched_fence_finished(job->s_fence, -ESRCH);
- WARN_ON(job->s_fence->parent);
- job->sched->ops->free_job(job);
Can free_job() really not be called from within work item context?
P.
Le 30/10/2025 à 12:17, Philipp Stanner a écrit :
On Wed, 2025-10-29 at 10:11 +0100, Pierre-Eric Pelloux-Prayer wrote:
https://gitlab.freedesktop.org/mesa/mesa/-/issues/13908%C2%A0pointed out
This link should be moved to the tag section at the bottom at a Closes: tag. Optionally a Reported-by:, too.
The bug report is about a different issue. The potential deadlock being fixed by this patch was discovered while investigating it. I'll add a Reported-by tag though.
a possible deadlock:
[ 1231.611031] Possible interrupt unsafe locking scenario:
[ 1231.611033] CPU0 CPU1 [ 1231.611034] ---- ---- [ 1231.611035] lock(&xa->xa_lock#17); [ 1231.611038] local_irq_disable(); [ 1231.611039] lock(&fence->lock); [ 1231.611041] lock(&xa->xa_lock#17); [ 1231.611044] <Interrupt> [ 1231.611045] lock(&fence->lock); [ 1231.611047] *** DEADLOCK ***
The commit message is lacking an explanation as to _how_ and _when_ the deadlock comes to be. That's a prerequisite for understanding why the below is the proper fix and solution.
I copy-pasted a small chunk of the full deadlock analysis report included in the ticket because it's 300+ lines long. Copying the full log isn't useful IMO, but I can add more context.
The problem is that a thread (CPU0 above) can lock the job's dependencies xa_array without disabling the interrupts. If a fence signals while CPU0 holds this lock and drm_sched_entity_kill_jobs_cb is called, it will try to grab the xa_array lock which is not possible because CPU0 holds it already.
The issue seems to be that you cannot perform certain tasks from within that work item?
My initial fix was to replace xa_erase by xa_erase_irq, but Christian pointed out that calling dma_fence_add_callback from a callback can also deadlock if the signalling fence and the one passed to dma_fence_add_callback share the same lock.
To fix both issues, the code iterating on dependencies and re-arming them is moved out to drm_sched_entity_kill_jobs_work.
Suggested-by: Christian König christian.koenig@amd.com Signed-off-by: Pierre-Eric Pelloux-Prayer pierre-eric.pelloux-prayer@amd.com
drivers/gpu/drm/scheduler/sched_entity.c | 34 +++++++++++++----------- 1 file changed, 19 insertions(+), 15 deletions(-)
diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c index c8e949f4a568..fe174a4857be 100644 --- a/drivers/gpu/drm/scheduler/sched_entity.c +++ b/drivers/gpu/drm/scheduler/sched_entity.c @@ -173,26 +173,15 @@ int drm_sched_entity_error(struct drm_sched_entity *entity) } EXPORT_SYMBOL(drm_sched_entity_error); +static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f,
struct dma_fence_cb *cb);static void drm_sched_entity_kill_jobs_work(struct work_struct *wrk) { struct drm_sched_job *job = container_of(wrk, typeof(*job), work);
- drm_sched_fence_scheduled(job->s_fence, NULL);
- drm_sched_fence_finished(job->s_fence, -ESRCH);
- WARN_ON(job->s_fence->parent);
- job->sched->ops->free_job(job);
Can free_job() really not be called from within work item context?
It's still called from drm_sched_entity_kill_jobs_work but the diff is slightly confusing.
Pierre-Eric
P.
On Thu, 2025-10-30 at 13:06 +0100, Pierre-Eric Pelloux-Prayer wrote:
Le 30/10/2025 à 12:17, Philipp Stanner a écrit :
On Wed, 2025-10-29 at 10:11 +0100, Pierre-Eric Pelloux-Prayer wrote:
https://gitlab.freedesktop.org/mesa/mesa/-/issues/13908%C2%A0pointed out
This link should be moved to the tag section at the bottom at a Closes: tag. Optionally a Reported-by:, too.
The bug report is about a different issue. The potential deadlock being fixed by this patch was discovered while investigating it. I'll add a Reported-by tag though.
a possible deadlock:
[ 1231.611031] Possible interrupt unsafe locking scenario:
[ 1231.611033] CPU0 CPU1 [ 1231.611034] ---- ---- [ 1231.611035] lock(&xa->xa_lock#17); [ 1231.611038] local_irq_disable(); [ 1231.611039] lock(&fence->lock); [ 1231.611041] lock(&xa->xa_lock#17); [ 1231.611044] <Interrupt> [ 1231.611045] lock(&fence->lock); [ 1231.611047] *** DEADLOCK ***
The commit message is lacking an explanation as to _how_ and _when_ the deadlock comes to be. That's a prerequisite for understanding why the below is the proper fix and solution.
I copy-pasted a small chunk of the full deadlock analysis report included in the ticket because it's 300+ lines long. Copying the full log isn't useful IMO, but I can add more context.
The log wouldn't be useful, but a human-generated explanation as you detail it below.
The problem is that a thread (CPU0 above) can lock the job's dependencies xa_array without disabling the interrupts.
Which drm_sched function would that be?
If a fence signals while CPU0 holds this lock and drm_sched_entity_kill_jobs_cb is called, it will try to grab the xa_array lock which is not possible because CPU0 holds it already.
You mean an *interrupt* signals the fence? Shouldn't interrupt issues be solved with spin_lock_irqdisable() – but we can't have that because it's the xarray doing that internally?
You don't have to explain that in this mail-thread, a v2 detailing that would be suficient.
The issue seems to be that you cannot perform certain tasks from within that work item?
[…]
+static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f,
struct dma_fence_cb *cb);static void drm_sched_entity_kill_jobs_work(struct work_struct *wrk) { struct drm_sched_job *job = container_of(wrk, typeof(*job), work);
- drm_sched_fence_scheduled(job->s_fence, NULL);
- drm_sched_fence_finished(job->s_fence, -ESRCH);
- WARN_ON(job->s_fence->parent);
- job->sched->ops->free_job(job);
Can free_job() really not be called from within work item context?
It's still called from drm_sched_entity_kill_jobs_work but the diff is slightly confusing.
OK, probably my bad. But just asking, do you use git format-patch --histogram ?
histogram often produces better diffs than default.
P.
linaro-mm-sig@lists.linaro.org