[PATCH AUTOSEL 6.17] drm/amdgpu: Add fallback to pipe reset if KCQ ring reset fails

25 Oct 2025

From: "Jesse.Zhang" Jesse.Zhang@amd.com
[ Upstream commit 7469567d882374dcac3fdb8b300e0f28cf875a75 ]
Add a fallback mechanism to attempt pipe reset when KCQ reset
fails to recover the ring. After performing the KCQ reset and
queue remapping, test the ring functionality. If the ring test
fails, initiate a pipe reset as an additional recovery step.
v2: fix the typo (Lijo)
v3: try pipeline reset when kiq mapping fails (Lijo)
Reviewed-by: Alex Deucher alexander.deucher@amd.com
Signed-off-by: Lijo Lazar lijo.lazar@amd.com
Signed-off-by: Jesse Zhang Jesse.Zhang@amd.com
Signed-off-by: Alex Deucher alexander.deucher@amd.com
Signed-off-by: Sasha Levin sashal@kernel.org
---
LLM Generated explanations, may be completely bogus:
YES
- The patch makes `gfx_v9_4_3_reset_kcq()` retry with a pipe-level reset
  when queue-level recovery fails: it tracks the current mode
  (`reset_mode` at drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c:3563), flips
  it when `gfx_v9_4_3_reset_hw_pipe()` runs
  (drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c:3600), and now re-enters the
  reset logic if the KIQ queue remap or the final ring validation still
  fail while only a per-queue reset was attempted
  (drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c:3623 and :3631). This plugs
  the hole where the earlier pipe-reset support never triggered on those
  later failure points.
- Without this fallback, a KCQ reset that cannot revive the ring bubbles
  up as an error, sending the scheduler down the full GPU reset path in
  `amdgpu_job.c` (drivers/gpu/drm/amd/amdgpu/amdgpu_job.c:132-170); that
  is a user-visible functional failure. The new logic keeps recovery
  local to the ring, exactly as the original pipe-reset series intended.
- The change is confined to GC 9.4.3’s compute reset path, only
  exercises when recovery is already failing, and relies solely on the
  pipe-reset infrastructure that has shipped since v6.12 (e.g., commit
  ad17b124). Risk of regression is therefore minimal for stable trees
  carrying this IP block. Branches that lack the earlier pipe-reset
  support simply wouldn’t take this patch.
drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
index 51babf5c78c86..f06bc94cf6e14 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
@@ -3562,6 +3562,7 @@ static int gfx_v9_4_3_reset_kcq(struct amdgpu_ring *ring,
    struct amdgpu_device *adev = ring->adev;
    struct amdgpu_kiq *kiq = &adev->gfx.kiq[ring->xcc_id];
    struct amdgpu_ring *kiq_ring = &kiq->ring;
+	int reset_mode = AMDGPU_RESET_TYPE_PER_QUEUE;
    unsigned long flags;
    int r;
@@ -3599,6 +3600,7 @@ static int gfx_v9_4_3_reset_kcq(struct amdgpu_ring *ring,
    	if (!(adev->gfx.compute_supported_reset & AMDGPU_RESET_TYPE_PER_PIPE))
    		return -EOPNOTSUPP;
    	r = gfx_v9_4_3_reset_hw_pipe(ring);
+		reset_mode = AMDGPU_RESET_TYPE_PER_PIPE;
    	dev_info(adev->dev, "ring: %s pipe reset :%s\n", ring->name,
    			r ? "failed" : "successfully");
    	if (r)
@@ -3621,10 +3623,20 @@ static int gfx_v9_4_3_reset_kcq(struct amdgpu_ring *ring,
    r = amdgpu_ring_test_ring(kiq_ring);
    spin_unlock_irqrestore(&kiq->ring_lock, flags);
    if (r) {
+		if (reset_mode == AMDGPU_RESET_TYPE_PER_QUEUE)
+			goto pipe_reset;
+
    	dev_err(adev->dev, "fail to remap queue\n");
    	return r;
    }
+	if (reset_mode == AMDGPU_RESET_TYPE_PER_QUEUE) {
+		r = amdgpu_ring_test_ring(ring);
+		if (r)
+			goto pipe_reset;
+	}
+
+
    return amdgpu_ring_reset_helper_end(ring, timedout_fence);
 }
-- 
2.51.0



    

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

[PATCH AUTOSEL 6.17] drm/amdgpu: Add fallback to pipe reset if KCQ ring reset fails