[PATCH AUTOSEL 6.17] drm/amdgpu: Notify pmfw bad page threshold exceeded

25 Oct 2025

From: Xiang Liu xiang.liu@amd.com
[ Upstream commit c8d6e90abe50377110f92702fbebc6efdd22391d ]
Notify pmfw when bad page threshold is exceeded, no matter the module
parameter 'bad_page_threshold' is set or not.
Signed-off-by: Xiang Liu xiang.liu@amd.com
Reviewed-by: Hawking Zhang Hawking.Zhang@amd.com
Signed-off-by: Alex Deucher alexander.deucher@amd.com
Signed-off-by: Sasha Levin sashal@kernel.org
---
LLM Generated explanations, may be completely bogus:
YES
**Rationale**
- What changed
  - The call to notify the PMFW/SMU about an RMA reason
    (`amdgpu_dpm_send_rma_reason(adev)`) is moved outside the inner
    check that previously only executed for user-defined thresholds. Now
    it runs whenever the bad-page threshold is exceeded (and the feature
    isn’t disabled), regardless of whether the module parameter is left
    at default (-1) or formula-based (-2).
  - Reference: `drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c:772`
    (inner check for user-defined thresholds),
    `drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c:783` (unconditional
    PMFW notify within the threshold-exceeded block).
- Why it matters (bug fix, not a feature)
  - With the default (-1) or formula-based (-2) settings of
    `bad_page_threshold`, the driver already computes a threshold and
    warns when it’s exceeded, but previously did not always notify PMFW.
    This commit ensures PMFW is notified whenever the bad-page count
    crosses the computed threshold, aligning behavior across
    configurations and avoiding missed PMFW-side actions/telemetry.
  - Threshold semantics are documented and unchanged: -1 (default), 0
    (disable), -2 (formula), N>0 (user-defined). Reference:
    `drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c:979` (module param
    description), `drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c:980`
    (parameter definition); threshold computation paths:
    `drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:3283`,
    `drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:3289`,
    `drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:3292`.
- Scope and containment
  - The change is confined to a single function in AMDGPU RAS EEPROM
    handling and only adjusts when a single notification is sent. No
    architectural changes, no interface changes.
- Safety and regression risk
  - The PMFW notification path is robust: `amdgpu_dpm_send_rma_reason`
    guards for unsupported SW SMU and returns `-EOPNOTSUPP`; the caller
    ignores such failures by design (see comment just above the call).
    References: `drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c:782`
    (comment “ignore the -ENOTSUPP”),
    `drivers/gpu/drm/amd/pm/amdgpu_dpm.c:760` (unsupported check),
    `drivers/gpu/drm/amd/pm/amdgpu_dpm.c:763` (mutex),
    `drivers/gpu/drm/amd/pm/amdgpu_dpm.c:764` (SMU call),
    `drivers/gpu/drm/amd/pm/amdgpu_dpm.c:767` (return).
  - The driver continues to mark RMA in the EEPROM header (`ras->is_rma
    = true` and `header = RAS_TABLE_HDR_BAD`) only for user-defined
    thresholds, unchanged. Reference:
    `drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c:772` to
    `drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c:780`.
  - The feature remains disabled when `bad_page_threshold == 0`; the
    outer guard still requires `amdgpu_bad_page_threshold != 0`.
    Reference: `drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c:763`.
- User impact
  - Fixes a real behavioral gap: in common default/auto modes, PMFW was
    not being notified of threshold exceed events. This can affect
    reliability handling/telemetry on systems that rely on PMFW
    awareness. The fix is minimal, localized, and low risk.
Given the small, targeted nature of the fix, its correctness, and low
regression risk, this is a good candidate for stable backport.
drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
index 9bda9ad13f882..88ded6296be34 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
@@ -774,9 +774,10 @@ amdgpu_ras_eeprom_update_header(struct amdgpu_ras_eeprom_control *control)
    			control->tbl_rai.health_percent = 0;
    		}
    		ras->is_rma = true;
-			/* ignore the -ENOTSUPP return value */
-			amdgpu_dpm_send_rma_reason(adev);
    	}
+
+		/* ignore the -ENOTSUPP return value */
+		amdgpu_dpm_send_rma_reason(adev);
    }
if (control->tbl_hdr.version >= RAS_TABLE_VER_V2_1)
-- 
2.51.0



    

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

[PATCH AUTOSEL 6.17] drm/amdgpu: Notify pmfw bad page threshold exceeded