From: "Jesse.Zhang" Jesse.Zhang@amd.com
[ Upstream commit 7469567d882374dcac3fdb8b300e0f28cf875a75 ]
Add a fallback mechanism to attempt pipe reset when KCQ reset fails to recover the ring. After performing the KCQ reset and queue remapping, test the ring functionality. If the ring test fails, initiate a pipe reset as an additional recovery step.
v2: fix the typo (Lijo) v3: try pipeline reset when kiq mapping fails (Lijo)
Reviewed-by: Alex Deucher alexander.deucher@amd.com Signed-off-by: Lijo Lazar lijo.lazar@amd.com Signed-off-by: Jesse Zhang Jesse.Zhang@amd.com Signed-off-by: Alex Deucher alexander.deucher@amd.com Signed-off-by: Sasha Levin sashal@kernel.org ---
LLM Generated explanations, may be completely bogus:
YES - The patch makes `gfx_v9_4_3_reset_kcq()` retry with a pipe-level reset when queue-level recovery fails: it tracks the current mode (`reset_mode` at drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c:3563), flips it when `gfx_v9_4_3_reset_hw_pipe()` runs (drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c:3600), and now re-enters the reset logic if the KIQ queue remap or the final ring validation still fail while only a per-queue reset was attempted (drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c:3623 and :3631). This plugs the hole where the earlier pipe-reset support never triggered on those later failure points. - Without this fallback, a KCQ reset that cannot revive the ring bubbles up as an error, sending the scheduler down the full GPU reset path in `amdgpu_job.c` (drivers/gpu/drm/amd/amdgpu/amdgpu_job.c:132-170); that is a user-visible functional failure. The new logic keeps recovery local to the ring, exactly as the original pipe-reset series intended. - The change is confined to GC 9.4.3’s compute reset path, only exercises when recovery is already failing, and relies solely on the pipe-reset infrastructure that has shipped since v6.12 (e.g., commit ad17b124). Risk of regression is therefore minimal for stable trees carrying this IP block. Branches that lack the earlier pipe-reset support simply wouldn’t take this patch.
drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c | 12 ++++++++++++ 1 file changed, 12 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c index 51babf5c78c86..f06bc94cf6e14 100644 --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c @@ -3562,6 +3562,7 @@ static int gfx_v9_4_3_reset_kcq(struct amdgpu_ring *ring, struct amdgpu_device *adev = ring->adev; struct amdgpu_kiq *kiq = &adev->gfx.kiq[ring->xcc_id]; struct amdgpu_ring *kiq_ring = &kiq->ring; + int reset_mode = AMDGPU_RESET_TYPE_PER_QUEUE; unsigned long flags; int r;
@@ -3599,6 +3600,7 @@ static int gfx_v9_4_3_reset_kcq(struct amdgpu_ring *ring, if (!(adev->gfx.compute_supported_reset & AMDGPU_RESET_TYPE_PER_PIPE)) return -EOPNOTSUPP; r = gfx_v9_4_3_reset_hw_pipe(ring); + reset_mode = AMDGPU_RESET_TYPE_PER_PIPE; dev_info(adev->dev, "ring: %s pipe reset :%s\n", ring->name, r ? "failed" : "successfully"); if (r) @@ -3621,10 +3623,20 @@ static int gfx_v9_4_3_reset_kcq(struct amdgpu_ring *ring, r = amdgpu_ring_test_ring(kiq_ring); spin_unlock_irqrestore(&kiq->ring_lock, flags); if (r) { + if (reset_mode == AMDGPU_RESET_TYPE_PER_QUEUE) + goto pipe_reset; + dev_err(adev->dev, "fail to remap queue\n"); return r; }
+ if (reset_mode == AMDGPU_RESET_TYPE_PER_QUEUE) { + r = amdgpu_ring_test_ring(ring); + if (r) + goto pipe_reset; + } + + return amdgpu_ring_reset_helper_end(ring, timedout_fence); }