[Linaro-mm-sig] [PATCH AUTOSEL 6.19-6.18] drm/amdgpu: Refactor amdgpu_gem_va_ioctl for Handling Last Fence Update and Timeline Management v4

14 Feb 2026

From: Srinivasan Shanmugam srinivasan.shanmugam@amd.com
[ Upstream commit bd8150a1b3370a9f7761c5814202a3fe5a79f44f ]
This commit simplifies the amdgpu_gem_va_ioctl function, key updates
include:
 - Moved the logic for managing the last update fence directly into
   amdgpu_gem_va_update_vm.
 - Introduced checks for the timeline point to enable conditional
   replacement or addition of fences.
v2: Addressed review comments from Christian.
v3: Updated comments (Christian).
v4: The previous version selected the fence too early and did not manage its
    reference correctly, which could lead to stale or freed fences being used.
    This resulted in refcount underflows and could crash when updating GPU
    timelines.
    The fence is now chosen only after the VA mapping work is completed, and its
    reference is taken safely. After exporting it to the VM timeline syncobj, the
    driver always drops its local fence reference, ensuring balanced refcounting
    and avoiding use-after-free on dma_fence.
Crash signature:
    [  205.828135] refcount_t: underflow; use-after-free.
    [  205.832963] WARNING: CPU: 30 PID: 7274 at lib/refcount.c:28 refcount_warn_saturate+0xbe/0x110
    ...
    [  206.074014] Call Trace:
    [  206.076488]  <TASK>
    [  206.078608]  amdgpu_gem_va_ioctl+0x6ea/0x740 [amdgpu]
    [  206.084040]  ? __pfx_amdgpu_gem_va_ioctl+0x10/0x10 [amdgpu]
    [  206.089994]  drm_ioctl_kernel+0x86/0xe0 [drm]
    [  206.094415]  drm_ioctl+0x26e/0x520 [drm]
    [  206.098424]  ? __pfx_amdgpu_gem_va_ioctl+0x10/0x10 [amdgpu]
    [  206.104402]  amdgpu_drm_ioctl+0x4b/0x80 [amdgpu]
    [  206.109387]  __x64_sys_ioctl+0x96/0xe0
    [  206.113156]  do_syscall_64+0x66/0x2d0
    ...
    [  206.553351] BUG: unable to handle page fault for address: ffffffffc0dfde90
    ...
    [  206.553378] RIP: 0010:dma_fence_signal_timestamp_locked+0x39/0xe0
    ...
    [  206.553405] Call Trace:
    [  206.553409]  <IRQ>
    [  206.553415]  ? __pfx_drm_sched_fence_free_rcu+0x10/0x10 [gpu_sched]
    [  206.553424]  dma_fence_signal+0x30/0x60
    [  206.553427]  drm_sched_job_done.isra.0+0x123/0x150 [gpu_sched]
    [  206.553434]  dma_fence_signal_timestamp_locked+0x6e/0xe0
    [  206.553437]  dma_fence_signal+0x30/0x60
    [  206.553441]  amdgpu_fence_process+0xd8/0x150 [amdgpu]
    [  206.553854]  sdma_v4_0_process_trap_irq+0x97/0xb0 [amdgpu]
    [  206.554353]  edac_mce_amd(E) ee1004(E)
    [  206.554270]  amdgpu_irq_dispatch+0x150/0x230 [amdgpu]
    [  206.554702]  amdgpu_ih_process+0x6a/0x180 [amdgpu]
    [  206.555101]  amdgpu_irq_handler+0x23/0x60 [amdgpu]
    [  206.555500]  __handle_irq_event_percpu+0x4a/0x1c0
    [  206.555506]  handle_irq_event+0x38/0x80
    [  206.555509]  handle_edge_irq+0x92/0x1e0
    [  206.555513]  __common_interrupt+0x3e/0xb0
    [  206.555519]  common_interrupt+0x80/0xa0
    [  206.555525]  </IRQ>
    [  206.555527]  <TASK>
    ...
    [  206.555650] RIP: 0010:dma_fence_signal_timestamp_locked+0x39/0xe0
    ...
    [  206.555667] Kernel panic - not syncing: Fatal exception in interrupt
Link: https://patchwork.freedesktop.org/patch/654669/
Cc: Alex Deucher alexander.deucher@amd.com
Cc: Christian König christian.koenig@amd.com
Suggested-by: Christian König christian.koenig@amd.com
Signed-off-by: Srinivasan Shanmugam srinivasan.shanmugam@amd.com
Reviewed-by: Christian König christian.koenig@amd.com
Signed-off-by: Alex Deucher alexander.deucher@amd.com
Signed-off-by: Sasha Levin sashal@kernel.org
---
LLM Generated explanations, may be completely bogus:
I now have a thorough understanding of this commit. Let me compile my
complete analysis.
---
## Complete Analysis
### 1. COMMIT MESSAGE ANALYSIS
The commit title says "Refactor" but the v4 description reveals this is
a **critical bug fix**:
...
"The previous version selected the fence too early and did not manage
its reference correctly, which could lead to stale or freed fences being
used. This resulted in refcount underflows and could crash when updating
GPU timelines."
The commit includes a **concrete crash signature** showing:
- `refcount_t: underflow; use-after-free` in `amdgpu_gem_va_ioctl`
- `BUG: unable to handle page fault` in
  `dma_fence_signal_timestamp_locked` from IRQ context
- `Kernel panic - not syncing: Fatal exception in interrupt`
This is a **kernel panic** triggered via a userspace ioctl path
(`amdgpu_gem_va_ioctl`).
### 2. CODE CHANGE ANALYSIS - THE BUG
The bug is in `amdgpu_gem_update_bo_mapping()`, which was introduced by
commit 70773bef4e091 ("drm/amdgpu: update userqueue BOs and PDs") in
v6.16-rc1.
**Bug mechanism** in the old code at lines 115-154 of the current file:
```132:154:drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
        /* Find the last update fence */
        switch (operation) {
        case AMDGPU_VA_OP_MAP:
        case AMDGPU_VA_OP_REPLACE:
                if (bo && (bo->tbo.base.resv ==
vm->root.bo->tbo.base.resv))
                        last_update = vm->last_update;
                else
                        last_update = bo_va->last_pt_update;
                break;
        case AMDGPU_VA_OP_UNMAP:
        case AMDGPU_VA_OP_CLEAR:
                last_update = fence;
                break;
        default:
                return;
        }
/* Add fence to timeline */
        if (!point)
                drm_syncobj_replace_fence(syncobj, last_update);
        else
                drm_syncobj_add_point(syncobj, chain, last_update,
point);
```
**Problem 1 - Missing dma_fence references**: `last_update =
vm->last_update` and `last_update = bo_va->last_pt_update` read fence
pointers **without taking a reference** via `dma_fence_get()`. These
fences can be replaced and their references dropped by concurrent VM
operations or fence signaling. The bare pointer is then passed to
`drm_syncobj_replace_fence()` or `drm_syncobj_add_point()`, which
attempt `dma_fence_get()` on a potentially freed fence - **use-after-
free**.
I verified that `vm->last_update` is replaced via `swap()` in
`amdgpu_vm_sdma_commit()` (line 106-146 of `amdgpu_vm_sdma.c`), which
puts the old fence. If the fence was already signaled and no other
holders exist, it's freed.
**Problem 2 - Fence leak**: In the calling code (lines 971-983 of
6.16.y), when `timeline_syncobj` is set and the operation is
MAP/REPLACE, the `fence` (clear_freed fence) returned from
`amdgpu_gem_va_update_vm()` is **never put** - it's passed to
`amdgpu_gem_update_bo_mapping()` which ignores it for MAP/REPLACE
operations.
**The fix** properly addresses both issues:
- Moves fence selection into `amdgpu_gem_va_update_vm()`, which runs
  immediately after the VM operations complete
- Takes explicit references with `dma_fence_get()` on the selected fence
- Returns the properly referenced fence to the caller
- The caller **always** calls `dma_fence_put(fence)` regardless of
  whether a timeline syncobj is used
### 3. CLASSIFICATION
This is a **use-after-free / refcount underflow bug fix** disguised as a
"refactor." The crash is a kernel panic from interrupt context - one of
the most severe possible outcomes. It's reachable from userspace via the
`amdgpu_gem_va_ioctl` ioctl.
### 4. AFFECTED VERSIONS
- Feature introduced: commit 70773bef4e091, first in **v6.16-rc1** (not
  in v6.15 or earlier)
- Partial memleak fix: commit ad6c120f68880, also in v6.16-rc1
- **Stable trees affected**: 6.16.y, 6.17.y, 6.18.y (all confirmed to
  have the buggy code)
- **NOT affected**: 6.12.y LTS, 6.6.y LTS, 6.1.y LTS, 5.15.y LTS (no
  timeline syncobj code)
### 5. SCOPE AND RISK
- **Lines changed**: ~+70/-41 (net ~+30 lines, mostly restructuring)
- **Files changed**: 1 (`amdgpu_gem.c`)
- **Risk**: Moderate - it's a structural change, but well-contained and
  well-reviewed
- **Review quality**: Reviewed-by Christian König (the top DRM
  fence/amdgpu expert), Suggested-by Christian König
- **Dependency**: Uses `amdgpu_vm_is_bo_always_valid()` which exists
  since v6.11 - available in all affected trees
- **Backport concern**: The diff may need minor adaptation for
  6.16.y/6.17.y due to `va_flags` changes (6716a823d18d4 in v6.18-rc1)
### 6. USER IMPACT
- **Severity**: Kernel panic (fatal exception in interrupt handler)
- **Trigger**: Userspace GPU applications using the VM timeline syncobj
  feature (usermode queues)
- **Who is affected**: AMD GPU users on 6.16+ kernels using MESA/AMDVLK
  with timeline synchronization
### 7. VERDICT
Despite being titled "Refactor," this commit fixes:
1. A **use-after-free** on `dma_fence` objects (accessing freed memory)
2. A **refcount underflow** leading to premature fence destruction
3. A **kernel panic** in interrupt context
   (`dma_fence_signal_timestamp_locked`)
4. A **fence memory leak** (clear_freed fence never put when timeline is
   used)
The fix is well-reviewed by the subsystem expert, self-contained to one
file, and addresses a real crash that users hit. The affected code
exists in 6.16.y, 6.17.y, and 6.18.y stable trees.
**YES**
drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 135 ++++++++++++++----------
 1 file changed, 82 insertions(+), 53 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
index 3e38c5db29871..c5570a9698ed2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
@@ -112,47 +112,6 @@ amdgpu_gem_update_timeline_node(struct drm_file *filp,
    return 0;
 }
-static void
-amdgpu_gem_update_bo_mapping(struct drm_file *filp,
-			     struct amdgpu_bo_va *bo_va,
-			     uint32_t operation,
-			     uint64_t point,
-			     struct dma_fence *fence,
-			     struct drm_syncobj *syncobj,
-			     struct dma_fence_chain *chain)
-{
-	struct amdgpu_bo *bo = bo_va ? bo_va->base.bo : NULL;
-	struct amdgpu_fpriv *fpriv = filp->driver_priv;
-	struct amdgpu_vm *vm = &fpriv->vm;
-	struct dma_fence *last_update;
-
-	if (!syncobj)
-		return;
-
-	/* Find the last update fence */
-	switch (operation) {
-	case AMDGPU_VA_OP_MAP:
-	case AMDGPU_VA_OP_REPLACE:
-		if (bo && (bo->tbo.base.resv == vm->root.bo->tbo.base.resv))
-			last_update = vm->last_update;
-		else
-			last_update = bo_va->last_pt_update;
-		break;
-	case AMDGPU_VA_OP_UNMAP:
-	case AMDGPU_VA_OP_CLEAR:
-		last_update = fence;
-		break;
-	default:
-		return;
-	}
-
-	/* Add fence to timeline */
-	if (!point)
-		drm_syncobj_replace_fence(syncobj, last_update);
-	else
-		drm_syncobj_add_point(syncobj, chain, last_update, point);
-}
-
 static vm_fault_t amdgpu_gem_fault(struct vm_fault *vmf)
 {
    struct ttm_buffer_object *bo = vmf->vma->vm_private_data;
@@ -764,16 +723,19 @@ amdgpu_gem_va_update_vm(struct amdgpu_device *adev,
    		struct amdgpu_bo_va *bo_va,
    		uint32_t operation)
 {
-	struct dma_fence *fence = dma_fence_get_stub();
+	struct dma_fence *clear_fence = dma_fence_get_stub();
+	struct dma_fence *last_update = NULL;
    int r;
if (!amdgpu_vm_ready(vm))
-		return fence;
+		return clear_fence;
-	r = amdgpu_vm_clear_freed(adev, vm, &fence);
+	/* First clear freed BOs and get a fence for that work, if any. */
+	r = amdgpu_vm_clear_freed(adev, vm, &clear_fence);
    if (r)
    	goto error;
+	/* For MAP/REPLACE we also need to update the BO mappings. */
    if (operation == AMDGPU_VA_OP_MAP ||
        operation == AMDGPU_VA_OP_REPLACE) {
    	r = amdgpu_vm_bo_update(adev, bo_va, false);
@@ -781,13 +743,59 @@ amdgpu_gem_va_update_vm(struct amdgpu_device *adev,
    		goto error;
    }
+	/* Always update PDEs after we touched the mappings. */
    r = amdgpu_vm_update_pdes(adev, vm, false);
+	if (r)
+		goto error;
+
+	/*
+	 * Decide which fence represents the "last update" for this VM/BO:
+	 *
+	 * - For MAP/REPLACE we want the PT update fence, which is tracked as
+	 *   either vm->last_update (for always-valid BOs) or bo_va->last_pt_update
+	 *   (for per-BO updates).
+	 *
+	 * - For UNMAP/CLEAR we rely on the fence returned by
+	 *   amdgpu_vm_clear_freed(), which already covers the page table work
+	 *   for the removed mappings.
+	 */
+	switch (operation) {
+	case AMDGPU_VA_OP_MAP:
+	case AMDGPU_VA_OP_REPLACE:
+		if (bo_va && bo_va->base.bo) {
+			if (amdgpu_vm_is_bo_always_valid(vm, bo_va->base.bo)) {
+				if (vm->last_update)
+					last_update = dma_fence_get(vm->last_update);
+			} else {
+				if (bo_va->last_pt_update)
+					last_update = dma_fence_get(bo_va->last_pt_update);
+			}
+		}
+		break;
+	case AMDGPU_VA_OP_UNMAP:
+	case AMDGPU_VA_OP_CLEAR:
+		if (clear_fence)
+			last_update = dma_fence_get(clear_fence);
+		break;
+	default:
+		break;
+	}
error:
    if (r && r != -ERESTARTSYS)
    	DRM_ERROR("Couldn't update BO_VA (%d)\n", r);
-	return fence;
+	/*
+	 * If we managed to pick a more specific last-update fence, prefer it
+	 * over the generic clear_fence and drop the extra reference to the
+	 * latter.
+	 */
+	if (last_update) {
+		dma_fence_put(clear_fence);
+		return last_update;
+	}
+
+	return clear_fence;
 }
int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data,
@@ -813,6 +821,7 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data,
    uint64_t vm_size;
    int r = 0;
+	/* Validate virtual address range against reserved regions. */
    if (args->va_address < AMDGPU_VA_RESERVED_BOTTOM) {
    	dev_dbg(dev->dev,
    		"va_address 0x%llx is in reserved area 0x%llx\n",
@@ -846,6 +855,7 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data,
    	return -EINVAL;
    }
+	/* Validate operation type. */
    switch (args->operation) {
    case AMDGPU_VA_OP_MAP:
    case AMDGPU_VA_OP_UNMAP:
@@ -869,6 +879,7 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data,
    	abo = NULL;
    }
+	/* Add input syncobj fences (if any) for synchronization. */
    r = amdgpu_gem_add_input_fence(filp,
    			       args->input_fence_syncobj_handles,
    			       args->num_syncobj_handles);
@@ -891,6 +902,7 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data,
    		goto error;
    }
+	/* Resolve the BO-VA mapping for this VM/BO combination. */
    if (abo) {
    	bo_va = amdgpu_vm_bo_find(&fpriv->vm, abo);
    	if (!bo_va) {
@@ -903,6 +915,11 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data,
    	bo_va = NULL;
    }
+	/*
+	 * Prepare the timeline syncobj node if the user requested a VM
+	 * timeline update. This only allocates/looks up the syncobj and
+	 * chain node; the actual fence is attached later.
+	 */
    r = amdgpu_gem_update_timeline_node(filp,
    				    args->vm_timeline_syncobj_out,
    				    args->vm_timeline_point,
@@ -934,18 +951,30 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data,
    default:
    	break;
    }
+
+	/*
+	 * Once the VA operation is done, update the VM and obtain the fence
+	 * that represents the last relevant update for this mapping. This
+	 * fence can then be exported to the user-visible VM timeline.
+	 */
    if (!r && !(args->flags & AMDGPU_VM_DELAY_UPDATE) && !adev->debug_vm) {
    	fence = amdgpu_gem_va_update_vm(adev, &fpriv->vm, bo_va,
    					args->operation);
-		if (timeline_syncobj)
-			amdgpu_gem_update_bo_mapping(filp, bo_va,
-					     args->operation,
-					     args->vm_timeline_point,
-					     fence, timeline_syncobj,
-					     timeline_chain);
-		else
-			dma_fence_put(fence);
+		if (timeline_syncobj && fence) {
+			if (!args->vm_timeline_point) {
+				/* Replace the existing fence when no point is given. */
+				drm_syncobj_replace_fence(timeline_syncobj,
+							  fence);
+			} else {
+				/* Attach the last-update fence at a specific point. */
+				drm_syncobj_add_point(timeline_syncobj,
+						      timeline_chain,
+						      fence,
+						      args->vm_timeline_point);
+			}
+		}
+		dma_fence_put(fence);
}
-- 
2.51.0


    

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

[Linaro-mm-sig] [PATCH AUTOSEL 6.19-6.18] drm/amdgpu: Refactor amdgpu_gem_va_ioctl for Handling Last Fence Update and Timeline Management v4