From: Ce Sun cesun102@amd.com
[ Upstream commit da467352296f8e50c7ab7057ead44a1df1c81496 ]
Move amdgpu_device_health_check into amdgpu_device_gpu_recover to ensure that if the device is present can be checked before reset
The reason is: 1.During the dpc event, the device where the dpc event occurs is not present on the bus 2.When both dpc event and ATHUB event occur simultaneously,the dpc thread holds the reset domain lock when detecting error,and the gpu recover thread acquires the hive lock.The device is simultaneously in the states of amdgpu_ras_in_recovery and occurs_dpc,so gpu recover thread will not go to amdgpu_device_health_check.It waits for the reset domain lock held by the dpc thread, but dpc thread has not released the reset domain lock.In the dpc callback slot_reset,to obtain the hive lock, the hive lock is held by the gpu recover thread at this time.So a deadlock occurred
Signed-off-by: Ce Sun cesun102@amd.com Reviewed-by: Tao Zhou tao.zhou1@amd.com Signed-off-by: Alex Deucher alexander.deucher@amd.com Signed-off-by: Sasha Levin sashal@kernel.org ---
LLM Generated explanations, may be completely bogus:
YES
- What it fixes - Prevents a real-world deadlock between the GPU recovery path and PCIe DPC error handling when an ATHUB error and a DPC event occur concurrently in an XGMI hive. Today, the GPU recovery thread takes the hive lock first and then waits for the reset-domain lock, while the DPC path already holds the reset-domain lock and later needs the hive lock, leading to a lock inversion and a hang. - The deadlock arises because the health check is currently skipped during DPC (gated by `occurs_dpc`), so the GPU recovery path proceeds far enough to try to take the reset lock rather than bailing out early when the device is no longer present on the bus.
- Current behavior in this tree - Health check implementation: `drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:6123` (uses `amdgpu_device_bus_status_check`, which reads PCI config via `amdgpu_device_bus_status_check` in `drivers/gpu/drm/amd/amdgpu/amdgpu.h:1774`). - Health check is called from `amdgpu_device_recovery_prepare()` and is skipped during DPC: see call at `drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:6162` guarded by `if (!adev->pcie_reset_ctx.occurs_dpc)`. - GPU recovery lock order: hive lock is taken before attempting reset lock; reset lock acquired here: `drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:6464`. - DPC path takes the reset lock early (in error_detected) and later needs the hive lock: reset lock taken in error_detected at `drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:6911`, and hive lock is taken again in `amdgpu_pci_slot_reset()`.
- What this commit changes - Moves the health check out of `amdgpu_device_recovery_prepare()` (and drops its return value) so `amdgpu_device_recovery_prepare()` only builds the reset device list. This eliminates the `occurs_dpc` gate there. - Calls `amdgpu_device_health_check()` in `amdgpu_device_gpu_recover()` immediately after building the device list and before attempting to take the reset-domain lock. This ensures that when DPC has occurred (device lost from the bus), the check returns `-ENODEV` and the recovery path bails out early, releasing the hive lock and avoiding the lock inversion with the DPC thread. - Net effect: on DPC, GPU recovery no longer tries to contend for the reset-domain lock; it exits cleanly because the health check fails, allowing the DPC thread to proceed.
- Why this prevents the deadlock - Before: GPU recovery holds hive lock, skips health check due to `occurs_dpc`, then blocks on reset-domain lock; DPC holds reset- domain lock and later blocks on hive lock → deadlock. - After: GPU recovery holds hive lock, runs health check unconditionally for non-VF, sees device lost (DPC), returns early and releases hive lock; DPC can then obtain hive lock and complete.
- Risk and scope - Scope: Single file change; no architectural refactor. The change restores the earlier, safer placement/semantics (health check before lock acquisition) and removes the special-case gating for DPC. - Callers: `amdgpu_device_recovery_prepare()`’s signature change is localized within the same file; its only callers are updated accordingly. No interfaces exposed outside the driver are changed. - Behavior under SR-IOV VF remains unchanged (health check is still skipped for VFs). - Note: The diff also shows removal of the “skip slot reset during RAS recovery” early-return in `amdgpu_pci_slot_reset` (drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:6969). That widens when slot reset proceeds during RAS. While unrelated to the deadlock fix, it keeps error handling consistent and avoids suppressing the DPC recovery path; the main deadlock is eliminated by the earlier health-check bail-out. This is the only part that slightly increases behavioral surface area, but it remains confined to AMDGPU’s AER recovery.
- Stable backport fit - Fixes an important deadlock affecting users under real error conditions (DPC + ATHUB). - Small, contained, revert of a fragile conditional (the `occurs_dpc` gate) and call placement tweak. - No new features, minimal regression risk, limited to the AMDGPU reset/AER code paths.
Given the bug severity (deadlock/hang) and contained nature of the fix, this is a good candidate for stable backport.
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26 +++++++--------------- 1 file changed, 8 insertions(+), 18 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index c8459337fcb89..dfa68cb411966 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -6132,12 +6132,11 @@ static int amdgpu_device_health_check(struct list_head *device_list_handle) return ret; }
-static int amdgpu_device_recovery_prepare(struct amdgpu_device *adev, +static void amdgpu_device_recovery_prepare(struct amdgpu_device *adev, struct list_head *device_list, struct amdgpu_hive_info *hive) { struct amdgpu_device *tmp_adev = NULL; - int r;
/* * Build list of devices to reset. @@ -6157,14 +6156,6 @@ static int amdgpu_device_recovery_prepare(struct amdgpu_device *adev, } else { list_add_tail(&adev->reset_list, device_list); } - - if (!amdgpu_sriov_vf(adev) && (!adev->pcie_reset_ctx.occurs_dpc)) { - r = amdgpu_device_health_check(device_list); - if (r) - return r; - } - - return 0; }
static void amdgpu_device_recovery_get_reset_lock(struct amdgpu_device *adev, @@ -6457,8 +6448,13 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev, reset_context->hive = hive; INIT_LIST_HEAD(&device_list);
- if (amdgpu_device_recovery_prepare(adev, &device_list, hive)) - goto end_reset; + amdgpu_device_recovery_prepare(adev, &device_list, hive); + + if (!amdgpu_sriov_vf(adev)) { + r = amdgpu_device_health_check(&device_list); + if (r) + goto end_reset; + }
/* We need to lock reset domain only once both for XGMI and single device */ amdgpu_device_recovery_get_reset_lock(adev, &device_list); @@ -6965,12 +6961,6 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev) int r = 0, i; u32 memsize;
- /* PCI error slot reset should be skipped During RAS recovery */ - if ((amdgpu_ip_version(adev, GC_HWIP, 0) == IP_VERSION(9, 4, 3) || - amdgpu_ip_version(adev, GC_HWIP, 0) == IP_VERSION(9, 4, 4)) && - amdgpu_ras_in_recovery(adev)) - return PCI_ERS_RESULT_RECOVERED; - dev_info(adev->dev, "PCI error: slot reset callback!!\n");
memset(&reset_context, 0, sizeof(reset_context));