From: Heng Zhou Heng.Zhou@amd.com
[ Upstream commit 859958a7faefe5b7742b7b8cdbc170713d4bf158 ]
If a amdgpu_bo_va is fpriv->prt_va, the bo of this one is always NULL. So, such kind of amdgpu_bo_va should be updated separately before amdgpu_vm_handle_moved.
Signed-off-by: Heng Zhou Heng.Zhou@amd.com Reviewed-by: Kasiviswanathan, Harish Harish.Kasiviswanathan@amd.com Signed-off-by: Alex Deucher alexander.deucher@amd.com Signed-off-by: Sasha Levin sashal@kernel.org ---
LLM Generated explanations, may be completely bogus:
YES
- Bug impact: The commit fixes a real, user-visible NULL pointer dereference during KFD process BO restore. In the KFD restore path, PRT (partial resident texture) mappings live in `fpriv->prt_va`, which is a VM mapping with no backing BO. This mapping can appear in the VM’s “moved/invalidated” lists, and `amdgpu_vm_handle_moved()` will then dereference `bo_va->base.bo`, causing a NULL deref. Specifically, `amdgpu_vm_handle_moved()` dereferences `bo_va->base.bo` in the invalidated loop to fetch the reservation object: drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:1608 and drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:1618. That’s unsafe for PRT VA since its BO is always NULL.
- Why the bug exists: `fpriv->prt_va` is created with a NULL BO (as intended) at drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c:1428, which sets up a special VA mapping without a BO: it calls `amdgpu_vm_bo_add(adev, &fpriv->vm, NULL)`. Consequently, any generic handling that assumes `bo_va->base.bo` is non-NULL can crash if the PRT VA ends up in the VM’s invalidation or movement queues.
- What the change does: The patch updates the PRT mapping before calling the generic VM “handle moved” pass, ensuring the PRT VA is not present in those lists when the code that assumes a non-NULL BO runs. - Before: In the restore path, after validating PDs/PTs, the code directly calls `amdgpu_vm_handle_moved(adev, peer_vm, &exec.ticket)` for all VMs (drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c:2992). - After: It first derives `fpriv` from the VM, then explicitly updates `fpriv->prt_va` with `amdgpu_vm_bo_update(adev, fpriv->prt_va, false)` before calling `amdgpu_vm_handle_moved()` (as per the diff). This mirrors how command submission already handles PRT VA before calling handle_moved (see drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c:1149–1191). - Rationale: `amdgpu_vm_bo_update()` safely supports `bo_va->base.bo == NULL` (PRT case) and moves the mapping’s state to “done” without dereferencing a BO, see drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:1245–1390. It uses the VM’s fence (`vm->last_update`) instead of a BO fence when `bo == NULL`, and it moves the mapping out of the invalidated/moved state via `amdgpu_vm_bo_done(&bo_va->base)`.
- Safety and minimality: - The fix is small, localized to `amdgpu_amdkfd_gpuvm_restore_process_bos()` (drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c:2986–3004 region), with no architectural changes. - It follows an established pattern already present in the CS path: `amdgpu_cs` updates `fpriv->prt_va` before `amdgpu_vm_handle_moved()` (drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c:1149–1191). - It operates while PD/PTs are already reserved via the DRM exec/ticket path, matching the expectations of both `amdgpu_vm_bo_update()` and `amdgpu_vm_handle_moved()`. - The additional log message changes (`pr_debug` → `dev_dbg` with PID) are non-functional and low risk.
- Stable tree criteria: - Fixes a real crash (NULL deref), so it’s an important bug fix. - Change is small and contained to the AMDGPU KFD restore path. - No new features or API changes; no architectural rework. - Regression risk is low, because it matches the existing, proven sequence used in command submission code. - Touches a critical subsystem (AMDGPU), but the pattern is already used elsewhere, reducing risk.
- Applicability across stable branches: - This is relevant for stable kernels that have the KFD restore path structured like the one in your 6.17 workspace (with “Update mappings not managed by KFD” and the call to `amdgpu_vm_handle_moved()` within `amdgpu_amdkfd_gpuvm_restore_process_bos` at drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c:2986–3004). - Older long-term branches (e.g., 5.4 in your `$CVEKERNELTREE`) have a different implementation of the restore path and do not invoke `amdgpu_vm_handle_moved()` there (see drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c:1952 in that tree). For those, this exact backport is not applicable or needed.
Conclusion: This is a correct, minimal, and safe bug fix preventing a NULL pointer crash in the KFD eviction-restore path and mirrors existing correct handling in CS paths. It should be backported to stable trees that have this KFD restore flow.
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 15 ++++++++++++++- 1 file changed, 14 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c index 902eac2c685f3..30d4a47535882 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c @@ -2993,9 +2993,22 @@ int amdgpu_amdkfd_gpuvm_restore_process_bos(void *info, struct dma_fence __rcu * struct amdgpu_device *adev = amdgpu_ttm_adev( peer_vm->root.bo->tbo.bdev);
+ struct amdgpu_fpriv *fpriv = + container_of(peer_vm, struct amdgpu_fpriv, vm); + + ret = amdgpu_vm_bo_update(adev, fpriv->prt_va, false); + if (ret) { + dev_dbg(adev->dev, + "Memory eviction: handle PRT moved failed, pid %8d. Try again.\n", + pid_nr(process_info->pid)); + goto validate_map_fail; + } + ret = amdgpu_vm_handle_moved(adev, peer_vm, &exec.ticket); if (ret) { - pr_debug("Memory eviction: handle moved failed. Try again\n"); + dev_dbg(adev->dev, + "Memory eviction: handle moved failed, pid %8d. Try again.\n", + pid_nr(process_info->pid)); goto validate_map_fail; } }