From: "Jesse.Zhang" Jesse.Zhang@amd.com
[ Upstream commit b8ae2640f9acd4f411c9227d2493755d03fe440a ]
This commit fixes a potential race condition in the userqueue fence signaling mechanism by replacing dma_fence_is_signaled_locked() with dma_fence_is_signaled().
The issue occurred because: 1. dma_fence_is_signaled_locked() should only be used when holding the fence's individual lock, not just the fence list lock 2. Using the locked variant without the proper fence lock could lead to double-signaling scenarios: - Hardware completion signals the fence - Software path also tries to signal the same fence
By using dma_fence_is_signaled() instead, we properly handle the locking hierarchy and avoid the race condition while still maintaining the necessary synchronization through the fence_list_lock.
v2: drop the comment (Christian)
Reviewed-by: Christian König christian.koenig@amd.com Signed-off-by: Jesse Zhang Jesse.Zhang@amd.com Signed-off-by: Alex Deucher alexander.deucher@amd.com Signed-off-by: Sasha Levin sashal@kernel.org ---
LLM Generated explanations, may be completely bogus:
YES - `amdgpu_userq_fence_create()` only holds the queue-wide `fence_list_lock` when it checks completion, so calling `dma_fence_is_signaled_locked()` there violated the documented precondition that the per-fence spinlock be held (`include/linux/dma- fence.h:414-425`). That allowed the helper to run `dma_fence_signal_locked()` without proper serialization, so hardware completion and the software fast path could both signal the same fence, corrupting the callback list and triggering the “double signal” race the author observed. - Switching to `dma_fence_is_signaled()` at `drivers/gpu/drm/amd/amdgpu/amdgpu_userq_fence.c:286-290` makes the same completion check but lets the helper take the per-fence lock itself before signaling, matching the lock ordering already used by the runtime completion path (`amdgpu_userq_fence_driver_process()` calls `dma_fence_signal()` under the same `fence_list_lock`; see `drivers/gpu/drm/amd/amdgpu/amdgpu_userq_fence.c:162-175`). That closes the race without functional side effects—the fence still ends up signaled or enqueued exactly as before. - The regression comes from 2e65ea1ab2f6f (“drm/amdgpu: screen freeze and userq driver crash”), so every stable kernel that picked up userqueue support since that change is exposed. This fix is a single- line change, introduces no new APIs, and aligns with existing locking patterns, so the backport risk is very low. - Residual risk: other userqueue helpers still call `_locked` variants while holding only the driver lock, so additional audits may be warranted, but this patch addresses the high-risk race in the job creation fast path and should land in stable promptly.
Suggested next step: cherry-pick into all stable trees that contain 2e65ea1ab2f6f.
drivers/gpu/drm/amd/amdgpu/amdgpu_userq_fence.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_userq_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_userq_fence.c index c2a983ff23c95..b372baae39797 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_userq_fence.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_userq_fence.c @@ -276,7 +276,7 @@ static int amdgpu_userq_fence_create(struct amdgpu_usermode_queue *userq,
/* Check if hardware has already processed the job */ spin_lock_irqsave(&fence_drv->fence_list_lock, flags); - if (!dma_fence_is_signaled_locked(fence)) + if (!dma_fence_is_signaled(fence)) list_add_tail(&userq_fence->link, &fence_drv->fences); else dma_fence_put(fence);