[PATCH AUTOSEL 6.17] drm/amdgpu: Release hive reference properly

25 Oct 2025

From: Lijo Lazar lijo.lazar@amd.com
[ Upstream commit c1456fadce0c99175f97e66c2b982dd051e01aa2 ]
xgmi hive reference is taken on function entry, but not released
correctly for all paths. Use __free() to release reference properly.
Signed-off-by: Lijo Lazar lijo.lazar@amd.com
Reviewed-by: Ce Sun cesun102@amd.com
Signed-off-by: Alex Deucher alexander.deucher@amd.com
Signed-off-by: Sasha Levin sashal@kernel.org
---
LLM Generated explanations, may be completely bogus:
## Backport Analysis Complete
**Answer: YES**
This commit should be backported to stable kernel trees.
---
## Comprehensive Analysis
### Bug Description
This commit fixes a **kobject reference leak** in the AMDGPU driver's
PCI error handling code. The function `amdgpu_pci_error_detected()`
obtains an XGMI hive reference at function entry via
`amdgpu_get_xgmi_hive(adev)` but only releases it in one out of multiple
return paths, causing the reference to leak in the following cases:
**Leak Paths (lines 6940-6990 in amdgpu_device.c):**
1. **`pci_channel_io_normal` case** (line 6952): Returns
   `PCI_ERS_RESULT_CAN_RECOVER` without calling
   `amdgpu_put_xgmi_hive(hive)` - **LEAK**
2. **`pci_channel_io_frozen` with unsupported link reset** (line 6963):
   Returns `PCI_ERS_RESULT_DISCONNECT` when
   `amdgpu_dpm_is_link_reset_supported()` returns false - **LEAK**
3. **`pci_channel_io_perm_failure` case** (line 6986): Returns
   `PCI_ERS_RESULT_DISCONNECT` without releasing - **LEAK**
4. **Default fallthrough** (line 6990): Returns
   `PCI_ERS_RESULT_NEED_RESET` without releasing - **LEAK**
**Only correct path:** The `pci_channel_io_frozen` case (line 6980)
properly calls `amdgpu_put_xgmi_hive(hive)` after unlocking the mutex.
### The Fix
The commit elegantly solves this by:
1. **Defining a cleanup macro** in `amdgpu_xgmi.h:131`:
  ```c
  DEFINE_FREE(xgmi_put_hive, struct amdgpu_hive_info *, if (_T)
  amdgpu_put_xgmi_hive(_T))
  ```
2. **Using the `__free()` attribute** in `amdgpu_device.c:6940-6941`:
  ```c
  struct amdgpu_hive_info *hive __free(xgmi_put_hive) =
  amdgpu_get_xgmi_hive(adev);
  ```
This ensures automatic cleanup when the `hive` variable goes out of
scope, regardless of which return path is taken.
3. **Removing the manual cleanup** (lines 6980-6982): The explicit
   `amdgpu_put_xgmi_hive(hive)` call is removed since it's now handled
   automatically.
### Historical Context
**Timeline of the bug:**
- **March 2025 (v6.16)**: Commit `8ba904f54148d` ("drm/amdgpu: Multi-GPU
  DPC recovery support") by Ce Sun introduced the XGMI hive reference to
  this function, creating 3 leak paths
- **July 2025 (v6.18-rc1)**: Commit `91c4fd416463a6` ("drm/amdgpu: Set
  dpc status appropriately") by Lijo Lazar added the 4th leak path
- **September 2025 (v6.18-rc1)**: This commit `c1456fadce0c9` by Lijo
  Lazar fixes all leak paths
**Pattern of XGMI hive leaks:** My research found multiple previous
commits fixing similar XGMI hive reference leaks:
- `2efc30f0161b0` (2022): "drm/amdgpu: Fix hive reference count leak"
- `9dfa4860efb8c` (2022): "drm/amdgpu: fix hive reference leak when
  adding xgmi device"
- `1ff186ff32997` (2022): "drm/amdgpu: fix hive reference leak when
  reflecting psp topology info"
This demonstrates that XGMI hive reference management has been error-
prone, making the automatic cleanup approach especially valuable.
### Impact Assessment
**Severity:** Medium to High for affected systems
**Who is affected:**
- Systems with AMD XGMI multi-GPU configurations (MI200/MI300 series
  data center GPUs)
- Only triggers when PCI errors are detected on these systems
**Consequences:**
- **Kobject reference leak**: The XGMI hive kobject's reference count is
  incremented but not decremented
- **Memory leak**: The hive structure cannot be freed even when it
  should be released
- **Accumulation over time**: Repeated PCI errors will continue leaking
  references
- **System instability**: Eventually could lead to memory exhaustion or
  kobject reference count overflow
**Frequency:**
- PCI error detection is typically rare in healthy systems
- More common in systems with hardware issues, RAS (Reliability,
  Availability, Serviceability) testing, or during DPC (Downstream Port
  Containment) events
- The `pci_channel_io_normal` case (early return) is likely the most
  common leak path
### Technical Quality of the Fix
**Strengths:**
- **Modern kernel pattern**: Uses the `__free()` cleanup attribute
  introduced in v6.10
- **Automatic cleanup**: Compiler-enforced, eliminates human error
- **Small and contained**: Only 11 lines changed across 2 files
- **No behavioral changes**: Pure bug fix, no functional modifications
- **Well-tested pattern**: The cleanup infrastructure is widely used and
  battle-tested
**Weaknesses:**
- **Missing Fixes: tag**: Should have included `Fixes: 8ba904f54148d
  ("drm/amdgpu: Multi-GPU DPC recovery support")`
- **No stable tag**: Commit message lacks `Cc: stable@vger.kernel.org`
### Backport Compatibility
**Infrastructure requirements:**
- Requires `include/linux/cleanup.h` with `DEFINE_FREE()` and `__free()`
  support
- First introduced in v6.10 (commit `54da6a0924311`)
**Affected kernel versions:**
- **Bug introduced**: v6.16 (first 3 leak paths)
- **Bug worsened**: v6.18-rc1 (4th leak path added)
- **Bug fixed**: v6.18-rc1 (this commit)
**Backport targets:**
- ✅ v6.16.x stable - Has cleanup.h infrastructure, has the bug
- ✅ v6.17.x stable - Has cleanup.h infrastructure, has the bug
- ❌ v6.15.x and earlier - Bug not present (commit 8ba904f54148d not in
  these versions)
- ❌ v6.9.x and earlier - cleanup.h infrastructure not available
**Backport process:**
- Direct cherry-pick should work without modification
- All dependencies (cleanup.h, DEFINE_FREE) are present in v6.16+
### Why This Should Be Backported
**Meets stable kernel criteria:**
✅ **Fixes a real bug**: Kobject reference leak causing memory leak
✅ **Affects real users**: AMD XGMI multi-GPU systems with PCI errors
✅ **Important bugfix**: PCI error handling is critical for system
reliability
✅ **Small and simple**: Only 11 lines changed, minimal complexity
✅ **Low regression risk**: Pure cleanup fix, no logic changes
✅ **No new features**: Just bug fix, no feature additions
✅ **No architectural changes**: Confined to one function
✅ **Already in mainline**: Present in v6.18-rc1
✅ **Clean dependencies**: Only needs cleanup.h (available since v6.10)
**Additional justification:**
1. **Critical subsystem**: PCI error recovery is essential for system
   stability and RAS features in data center environments
2. **Resource leak accumulation**: Unlike some bugs that have one-time
   impact, this leak accumulates with each PCI error event
3. **Modern cleanup pattern**: This fix demonstrates best practices for
   resource management in the kernel
4. **Prevents future issues**: The automatic cleanup prevents this class
   of bug from being reintroduced
### Recommendation
**BACKPORT TO:**
- linux-6.17.y stable tree
- linux-6.16.y stable tree
**NOT NEEDED FOR:**
- linux-6.15.y and earlier (bug not present)
**SUGGESTED STABLE COMMIT MESSAGE ADDITIONS:**
```
Fixes: 8ba904f54148d ("drm/amdgpu: Multi-GPU DPC recovery support")
Cc: stable@vger.kernel.org # v6.16+
```
### Verification
To verify my analysis, I:
1. ✅ Used semcode MCP tools to analyze function implementations
2. ✅ Examined kobject reference counting mechanism
   (`amdgpu_get_xgmi_hive` calls `kobject_get`, `amdgpu_put_xgmi_hive`
   calls `kobject_put`)
3. ✅ Traced git history to find when the bug was introduced
4. ✅ Verified cleanup.h infrastructure availability across kernel
   versions
5. ✅ Analyzed all code paths in the affected function
6. ✅ Researched related XGMI hive leak fixes
7. ✅ Confirmed the fix is already in mainline (v6.18-rc1)
---
**FINAL VERDICT: YES - This commit should be backported to v6.16.x and
v6.17.x stable kernel trees.**
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 +++----
 drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h   | 4 ++++
 2 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 274bb4d857d36..56a737df87cc7 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -6880,7 +6880,8 @@ pci_ers_result_t amdgpu_pci_error_detected(struct pci_dev *pdev, pci_channel_sta
 {
    struct drm_device *dev = pci_get_drvdata(pdev);
    struct amdgpu_device *adev = drm_to_adev(dev);
-	struct amdgpu_hive_info *hive = amdgpu_get_xgmi_hive(adev);
+	struct amdgpu_hive_info *hive __free(xgmi_put_hive) =
+		amdgpu_get_xgmi_hive(adev);
    struct amdgpu_reset_context reset_context;
    struct list_head device_list;
@@ -6911,10 +6912,8 @@ pci_ers_result_t amdgpu_pci_error_detected(struct pci_dev *pdev, pci_channel_sta
    	amdgpu_device_recovery_get_reset_lock(adev, &device_list);
    	amdgpu_device_halt_activities(adev, NULL, &reset_context, &device_list,
    				      hive, false);
-		if (hive) {
+		if (hive)
    		mutex_unlock(&hive->hive_lock);
-			amdgpu_put_xgmi_hive(hive);
-		}
    	return PCI_ERS_RESULT_NEED_RESET;
    case pci_channel_io_perm_failure:
    	/* Permanent error, prepare for device removal */
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h
index bba0b26fee8f1..5f36aff17e79e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h
@@ -126,4 +126,8 @@ uint32_t amdgpu_xgmi_get_max_bandwidth(struct amdgpu_device *adev);
void amgpu_xgmi_set_max_speed_width(struct amdgpu_device *adev,
    			    uint16_t max_speed, uint8_t max_width);
+
+/* Cleanup macro for use with __free(xgmi_put_hive) */
+DEFINE_FREE(xgmi_put_hive, struct amdgpu_hive_info *, if (_T) amdgpu_put_xgmi_hive(_T))
+
 #endif
-- 
2.51.0



    

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

[PATCH AUTOSEL 6.17] drm/amdgpu: Release hive reference properly