[PATCH AUTOSEL 6.17] drm/amdgpu: Fix fence signaling race condition in userqueue

25 Oct 2025

From: "Jesse.Zhang" Jesse.Zhang@amd.com
[ Upstream commit b8ae2640f9acd4f411c9227d2493755d03fe440a ]
This commit fixes a potential race condition in the userqueue fence
signaling mechanism by replacing dma_fence_is_signaled_locked() with
dma_fence_is_signaled().
The issue occurred because:
1. dma_fence_is_signaled_locked() should only be used when holding
   the fence's individual lock, not just the fence list lock
2. Using the locked variant without the proper fence lock could lead
   to double-signaling scenarios:
   - Hardware completion signals the fence
   - Software path also tries to signal the same fence
By using dma_fence_is_signaled() instead, we properly handle the
locking hierarchy and avoid the race condition while still maintaining
the necessary synchronization through the fence_list_lock.
v2: drop the comment (Christian)
Reviewed-by: Christian König christian.koenig@amd.com
Signed-off-by: Jesse Zhang Jesse.Zhang@amd.com
Signed-off-by: Alex Deucher alexander.deucher@amd.com
Signed-off-by: Sasha Levin sashal@kernel.org
---
LLM Generated explanations, may be completely bogus:
YES
- `amdgpu_userq_fence_create()` only holds the queue-wide
  `fence_list_lock` when it checks completion, so calling
  `dma_fence_is_signaled_locked()` there violated the documented
  precondition that the per-fence spinlock be held (`include/linux/dma-
  fence.h:414-425`). That allowed the helper to run
  `dma_fence_signal_locked()` without proper serialization, so hardware
  completion and the software fast path could both signal the same
  fence, corrupting the callback list and triggering the “double signal”
  race the author observed.
- Switching to `dma_fence_is_signaled()` at
  `drivers/gpu/drm/amd/amdgpu/amdgpu_userq_fence.c:286-290` makes the
  same completion check but lets the helper take the per-fence lock
  itself before signaling, matching the lock ordering already used by
  the runtime completion path (`amdgpu_userq_fence_driver_process()`
  calls `dma_fence_signal()` under the same `fence_list_lock`; see
  `drivers/gpu/drm/amd/amdgpu/amdgpu_userq_fence.c:162-175`). That
  closes the race without functional side effects—the fence still ends
  up signaled or enqueued exactly as before.
- The regression comes from 2e65ea1ab2f6f (“drm/amdgpu: screen freeze
  and userq driver crash”), so every stable kernel that picked up
  userqueue support since that change is exposed. This fix is a single-
  line change, introduces no new APIs, and aligns with existing locking
  patterns, so the backport risk is very low.
- Residual risk: other userqueue helpers still call `_locked` variants
  while holding only the driver lock, so additional audits may be
  warranted, but this patch addresses the high-risk race in the job
  creation fast path and should land in stable promptly.
Suggested next step: cherry-pick into all stable trees that contain
2e65ea1ab2f6f.
drivers/gpu/drm/amd/amdgpu/amdgpu_userq_fence.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_userq_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_userq_fence.c
index c2a983ff23c95..b372baae39797 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_userq_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_userq_fence.c
@@ -276,7 +276,7 @@ static int amdgpu_userq_fence_create(struct amdgpu_usermode_queue *userq,
/* Check if hardware has already processed the job */
    spin_lock_irqsave(&fence_drv->fence_list_lock, flags);
-	if (!dma_fence_is_signaled_locked(fence))
+	if (!dma_fence_is_signaled(fence))
    	list_add_tail(&userq_fence->link, &fence_drv->fences);
    else
    	dma_fence_put(fence);
-- 
2.51.0



    

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

[PATCH AUTOSEL 6.17] drm/amdgpu: Fix fence signaling race condition in userqueue