From: Timur Kristóf timur.kristof@gmail.com
[ Upstream commit 7bdd91abf0cb3ea78160e2e78fb58b12f6a38d55 ]
Enabling ASPM causes randoms hangs on Tahiti and Oland on Zen4. It's unclear if this is a platform-specific or GPU-specific issue. Disable ASPM on SI for the time being.
Reviewed-by: Alex Deucher alexander.deucher@amd.com Signed-off-by: Timur Kristóf timur.kristof@gmail.com Signed-off-by: Alex Deucher alexander.deucher@amd.com Signed-off-by: Sasha Levin sashal@kernel.org ---
LLM Generated explanations, may be completely bogus:
Based on my comprehensive semantic analysis of this commit, here is my assessment:
## **RECOMMENDATION: YES**
This commit should be backported to stable kernel trees.
---
## **DETAILED ANALYSIS**
### **1. Semantic Code Analysis Performed**
**Tools Used:** - `mcp__semcode__find_function`: Located the modified `amdgpu_device_aspm_support_quirk()` function - `mcp__semcode__find_callers`: Traced the impact scope through the call graph - `mcp__semcode__find_callchain`: Analyzed complete call chains to understand initialization paths - Git history analysis: Examined related ASPM commits and patterns
**Key Findings:**
**Call Graph Analysis:** ``` amdgpu_device_aspm_support_quirk() [MODIFIED] ↓ called by amdgpu_device_should_use_aspm() [1 caller] ↓ called by (9 callers across multiple GPU generations) ├─ si_program_aspm() [SI generation - directly affected] ├─ vi_program_aspm() [VI generation] ├─ cik_program_aspm() [CIK generation] ├─ nv_program_aspm() [Navi generation] ├─ soc15_program_aspm() [SoC15 generation] └─ ... and 4 more hardware initialization functions ```
The change adds an early return when `adev->family == AMDGPU_FAMILY_SI`, which specifically targets Southern Islands GPUs (Tahiti, Oland, Verde, Pitcairn, Hainan from ~2012).
### **2. Code Changes Analysis**
**Change Size:** Minimal - only 7 lines added (6 code + 1 blank) - Lines added: `+6` - Lines removed: `0` - Files modified: `1` (drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:1883-1889)
**Change Type:** Conservative quirk addition - Uses existing quirk infrastructure (function already handles Intel Alder Lake/Raptor Lake quirks) - No refactoring or architectural changes - Simply adds hardware-specific condition at function entry
### **3. Bug Impact Assessment**
**Severity:** **CRITICAL** - Random system hangs - Symptom: Random hangs on Tahiti and Oland GPUs - Platform: Zen4 (AMD Ryzen 7000 series) - Affected Hardware: SI family GPUs (AMDGPU_FAMILY_SI)
**User Exposure:** From call chain analysis, the code path is triggered during: - Hardware initialization (`si_common_hw_init` at drivers/gpu/drm/amd/amdgpu/si.c:2640) - Executed automatically when SI GPU is present in system - No special user action required to trigger the bug
**Impact Scope:** - **Narrow hardware scope**: Only SI family GPUs (12+ year old hardware, but still in use) - **Platform-specific trigger**: Issues observed on Zen4 platforms - **Well-contained fix**: Isolated to ASPM quirk handling code
### **4. Backport Suitability - Positive Indicators**
✅ **Fixes Critical Stability Bug**: Random system hangs are severe issues affecting system usability
✅ **Minimal Change Size**: Only 7 lines added, zero lines removed - extremely low complexity
✅ **No Dependencies**: Uses existing code infrastructure (`AMDGPU_FAMILY_SI` constant, quirk pattern)
✅ **Conservative Fix**: Disables problematic feature rather than attempting complex behavior changes
✅ **Low Regression Risk**: - Only affects SI generation GPUs - Disabling ASPM is safe (may slightly increase power consumption but prevents hangs) - No code path changes for other GPU families
✅ **Follows Established Pattern**: From git history analysis, found similar ASPM quirk commits: - `c770ef19673fb` - "disable ASPM in some situations" - `d9b3a066dfcd3` - "Exclude dGPUs in eGPU enclosures from DPM quirks" - `2757a848cb0f1` - "Explicitly disable ASPM when dynamic switching disabled"
✅ **Clear Hardware Scope**: Specifically targets well-defined hardware (SI family)
✅ **Stable Tree Compliant**: - Pure bug fix, not a feature addition - No architectural changes - Fixes user-visible problem
### **5. Backport Suitability - Considerations**
⚠️ **No Explicit Cc: stable Tag**: Commit lacks `Cc: stable@vger.kernel.org` tag - This may be intentional (let it bake in mainline first) - Or could be an oversight given the severity
⚠️ **Root Cause Uncertainty**: Comment states "unclear if platform- specific or GPU-specific issue" - However, the conservative fix (disable ASPM) is appropriate regardless - This uncertainty doesn't affect fix correctness
⚠️ **Affects Legacy Hardware**: SI GPUs from 2012 - But users still run this hardware on modern Zen4 platforms - Hangs make systems unusable, so fix is valuable
### **6. Technical Correctness**
**Function Behavior:** - `amdgpu_device_aspm_support_quirk()` returns `true` when ASPM should be **disabled** (quirk detected) - Adding SI family check causes function to return `true` early - This propagates through `amdgpu_device_should_use_aspm()` which returns `false` when quirk detected - Result: `si_program_aspm()` exits early without programming ASPM registers
**Semantic Verification:** Checked the function at commit `7a899fb4de29c` (before this change) - it only handled Intel Alder Lake/Raptor Lake quirks. The new SI quirk follows identical pattern.
### **7. Historical Context**
From git log analysis: - First appeared in v6.18-rc2 (October 2025) - Part of ongoing ASPM quirk management in AMD GPU driver - Follows 2+ year history of ASPM-related fixes in this subsystem - Reviewed by Alex Deucher (AMD maintainer)
### **8. Final Assessment**
**BACKPORT: YES**
**Justification:** 1. **Critical stability fix** resolving random system hangs 2. **Minimal risk** due to small, contained change 3. **Clear benefit** for users with SI GPUs on Zen4 platforms 4. **Proper technical approach** using established quirk infrastructure 5. **No side effects** on other hardware generations 6. **Conservative solution** that safely disables problematic feature
**Recommendation:** This commit should be backported to all active stable kernel trees that support SI generation AMD GPUs (likely 5.x and 6.x stable series). The fix is low-risk and provides significant stability improvements for affected users.
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 +++++++ 1 file changed, 7 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index c8459337fcb89..e01264d644701 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -1879,6 +1879,13 @@ static bool amdgpu_device_pcie_dynamic_switching_supported(struct amdgpu_device
static bool amdgpu_device_aspm_support_quirk(struct amdgpu_device *adev) { + /* Enabling ASPM causes randoms hangs on Tahiti and Oland on Zen4. + * It's unclear if this is a platform-specific or GPU-specific issue. + * Disable ASPM on SI for the time being. + */ + if (adev->family == AMDGPU_FAMILY_SI) + return true; + #if IS_ENABLED(CONFIG_X86) struct cpuinfo_x86 *c = &cpu_data(0);