[PATCH AUTOSEL 6.17] drm/amdgpu: Effective health check before reset

25 Oct 2025

From: Ce Sun cesun102@amd.com
[ Upstream commit da467352296f8e50c7ab7057ead44a1df1c81496 ]
Move amdgpu_device_health_check into amdgpu_device_gpu_recover to
ensure that if the device is present can be checked before reset
The reason is:
1.During the dpc event, the device where the dpc event occurs is not
present on the bus
2.When both dpc event and ATHUB event occur simultaneously,the dpc thread
holds the reset domain lock when detecting error,and the gpu recover thread
acquires the hive lock.The device is simultaneously in the states of
amdgpu_ras_in_recovery and occurs_dpc,so gpu recover thread will not go to
amdgpu_device_health_check.It waits for the reset domain lock held by the
dpc thread, but dpc thread has not released the reset domain lock.In the dpc
callback slot_reset,to obtain the hive lock, the hive lock is held by the
gpu recover thread at this time.So a deadlock occurred
Signed-off-by: Ce Sun cesun102@amd.com
Reviewed-by: Tao Zhou tao.zhou1@amd.com
Signed-off-by: Alex Deucher alexander.deucher@amd.com
Signed-off-by: Sasha Levin sashal@kernel.org
---
LLM Generated explanations, may be completely bogus:
YES
- What it fixes
  - Prevents a real-world deadlock between the GPU recovery path and
    PCIe DPC error handling when an ATHUB error and a DPC event occur
    concurrently in an XGMI hive. Today, the GPU recovery thread takes
    the hive lock first and then waits for the reset-domain lock, while
    the DPC path already holds the reset-domain lock and later needs the
    hive lock, leading to a lock inversion and a hang.
  - The deadlock arises because the health check is currently skipped
    during DPC (gated by `occurs_dpc`), so the GPU recovery path
    proceeds far enough to try to take the reset lock rather than
    bailing out early when the device is no longer present on the bus.
- Current behavior in this tree
  - Health check implementation:
    `drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:6123` (uses
    `amdgpu_device_bus_status_check`, which reads PCI config via
    `amdgpu_device_bus_status_check` in
    `drivers/gpu/drm/amd/amdgpu/amdgpu.h:1774`).
  - Health check is called from `amdgpu_device_recovery_prepare()` and
    is skipped during DPC: see call at
    `drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:6162` guarded by `if
    (!adev->pcie_reset_ctx.occurs_dpc)`.
  - GPU recovery lock order: hive lock is taken before attempting reset
    lock; reset lock acquired here:
    `drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:6464`.
  - DPC path takes the reset lock early (in error_detected) and later
    needs the hive lock: reset lock taken in error_detected at
    `drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:6911`, and hive lock is
    taken again in `amdgpu_pci_slot_reset()`.
- What this commit changes
  - Moves the health check out of `amdgpu_device_recovery_prepare()`
    (and drops its return value) so `amdgpu_device_recovery_prepare()`
    only builds the reset device list. This eliminates the `occurs_dpc`
    gate there.
  - Calls `amdgpu_device_health_check()` in
    `amdgpu_device_gpu_recover()` immediately after building the device
    list and before attempting to take the reset-domain lock. This
    ensures that when DPC has occurred (device lost from the bus), the
    check returns `-ENODEV` and the recovery path bails out early,
    releasing the hive lock and avoiding the lock inversion with the DPC
    thread.
  - Net effect: on DPC, GPU recovery no longer tries to contend for the
    reset-domain lock; it exits cleanly because the health check fails,
    allowing the DPC thread to proceed.
- Why this prevents the deadlock
  - Before: GPU recovery holds hive lock, skips health check due to
    `occurs_dpc`, then blocks on reset-domain lock; DPC holds reset-
    domain lock and later blocks on hive lock → deadlock.
  - After: GPU recovery holds hive lock, runs health check
    unconditionally for non-VF, sees device lost (DPC), returns early
    and releases hive lock; DPC can then obtain hive lock and complete.
- Risk and scope
  - Scope: Single file change; no architectural refactor. The change
    restores the earlier, safer placement/semantics (health check before
    lock acquisition) and removes the special-case gating for DPC.
  - Callers: `amdgpu_device_recovery_prepare()`’s signature change is
    localized within the same file; its only callers are updated
    accordingly. No interfaces exposed outside the driver are changed.
  - Behavior under SR-IOV VF remains unchanged (health check is still
    skipped for VFs).
  - Note: The diff also shows removal of the “skip slot reset during RAS
    recovery” early-return in `amdgpu_pci_slot_reset`
    (drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:6969). That widens when
    slot reset proceeds during RAS. While unrelated to the deadlock fix,
    it keeps error handling consistent and avoids suppressing the DPC
    recovery path; the main deadlock is eliminated by the earlier
    health-check bail-out. This is the only part that slightly increases
    behavioral surface area, but it remains confined to AMDGPU’s AER
    recovery.
- Stable backport fit
  - Fixes an important deadlock affecting users under real error
    conditions (DPC + ATHUB).
  - Small, contained, revert of a fragile conditional (the `occurs_dpc`
    gate) and call placement tweak.
  - No new features, minimal regression risk, limited to the AMDGPU
    reset/AER code paths.
Given the bug severity (deadlock/hang) and contained nature of the fix,
this is a good candidate for stable backport.
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26 +++++++---------------
 1 file changed, 8 insertions(+), 18 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index c8459337fcb89..dfa68cb411966 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -6132,12 +6132,11 @@ static int amdgpu_device_health_check(struct list_head *device_list_handle)
    return ret;
 }
-static int amdgpu_device_recovery_prepare(struct amdgpu_device *adev,
+static void amdgpu_device_recovery_prepare(struct amdgpu_device *adev,
    				  struct list_head *device_list,
    				  struct amdgpu_hive_info *hive)
 {
    struct amdgpu_device *tmp_adev = NULL;
-	int r;
/*
     * Build list of devices to reset.
@@ -6157,14 +6156,6 @@ static int amdgpu_device_recovery_prepare(struct amdgpu_device *adev,
    } else {
    	list_add_tail(&adev->reset_list, device_list);
    }
-
-	if (!amdgpu_sriov_vf(adev) && (!adev->pcie_reset_ctx.occurs_dpc)) {
-		r = amdgpu_device_health_check(device_list);
-		if (r)
-			return r;
-	}
-
-	return 0;
 }
static void amdgpu_device_recovery_get_reset_lock(struct amdgpu_device *adev,
@@ -6457,8 +6448,13 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
    reset_context->hive = hive;
    INIT_LIST_HEAD(&device_list);
-	if (amdgpu_device_recovery_prepare(adev, &device_list, hive))
-		goto end_reset;
+	amdgpu_device_recovery_prepare(adev, &device_list, hive);
+
+	if (!amdgpu_sriov_vf(adev)) {
+		r = amdgpu_device_health_check(&device_list);
+		if (r)
+			goto end_reset;
+	}
/* We need to lock reset domain only once both for XGMI and single device */
    amdgpu_device_recovery_get_reset_lock(adev, &device_list);
@@ -6965,12 +6961,6 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
    int r = 0, i;
    u32 memsize;
-	/* PCI error slot reset should be skipped During RAS recovery */
-	if ((amdgpu_ip_version(adev, GC_HWIP, 0) == IP_VERSION(9, 4, 3) ||
-	    amdgpu_ip_version(adev, GC_HWIP, 0) == IP_VERSION(9, 4, 4)) &&
-	    amdgpu_ras_in_recovery(adev))
-		return PCI_ERS_RESULT_RECOVERED;
-
    dev_info(adev->dev, "PCI error: slot reset callback!!\n");
memset(&reset_context, 0, sizeof(reset_context));
-- 
2.51.0



    

2025

2024

2023

2022

2021

2020

2019

2018

2017

[PATCH AUTOSEL 6.17] drm/amdgpu: Effective health check before reset