From: Xiang Liu xiang.liu@amd.com
[ Upstream commit c8d6e90abe50377110f92702fbebc6efdd22391d ]
Notify pmfw when bad page threshold is exceeded, no matter the module parameter 'bad_page_threshold' is set or not.
Signed-off-by: Xiang Liu xiang.liu@amd.com Reviewed-by: Hawking Zhang Hawking.Zhang@amd.com Signed-off-by: Alex Deucher alexander.deucher@amd.com Signed-off-by: Sasha Levin sashal@kernel.org ---
LLM Generated explanations, may be completely bogus:
YES
**Rationale** - What changed - The call to notify the PMFW/SMU about an RMA reason (`amdgpu_dpm_send_rma_reason(adev)`) is moved outside the inner check that previously only executed for user-defined thresholds. Now it runs whenever the bad-page threshold is exceeded (and the feature isn’t disabled), regardless of whether the module parameter is left at default (-1) or formula-based (-2). - Reference: `drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c:772` (inner check for user-defined thresholds), `drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c:783` (unconditional PMFW notify within the threshold-exceeded block).
- Why it matters (bug fix, not a feature) - With the default (-1) or formula-based (-2) settings of `bad_page_threshold`, the driver already computes a threshold and warns when it’s exceeded, but previously did not always notify PMFW. This commit ensures PMFW is notified whenever the bad-page count crosses the computed threshold, aligning behavior across configurations and avoiding missed PMFW-side actions/telemetry. - Threshold semantics are documented and unchanged: -1 (default), 0 (disable), -2 (formula), N>0 (user-defined). Reference: `drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c:979` (module param description), `drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c:980` (parameter definition); threshold computation paths: `drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:3283`, `drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:3289`, `drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:3292`.
- Scope and containment - The change is confined to a single function in AMDGPU RAS EEPROM handling and only adjusts when a single notification is sent. No architectural changes, no interface changes.
- Safety and regression risk - The PMFW notification path is robust: `amdgpu_dpm_send_rma_reason` guards for unsupported SW SMU and returns `-EOPNOTSUPP`; the caller ignores such failures by design (see comment just above the call). References: `drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c:782` (comment “ignore the -ENOTSUPP”), `drivers/gpu/drm/amd/pm/amdgpu_dpm.c:760` (unsupported check), `drivers/gpu/drm/amd/pm/amdgpu_dpm.c:763` (mutex), `drivers/gpu/drm/amd/pm/amdgpu_dpm.c:764` (SMU call), `drivers/gpu/drm/amd/pm/amdgpu_dpm.c:767` (return). - The driver continues to mark RMA in the EEPROM header (`ras->is_rma = true` and `header = RAS_TABLE_HDR_BAD`) only for user-defined thresholds, unchanged. Reference: `drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c:772` to `drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c:780`. - The feature remains disabled when `bad_page_threshold == 0`; the outer guard still requires `amdgpu_bad_page_threshold != 0`. Reference: `drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c:763`.
- User impact - Fixes a real behavioral gap: in common default/auto modes, PMFW was not being notified of threshold exceed events. This can affect reliability handling/telemetry on systems that rely on PMFW awareness. The fix is minimal, localized, and low risk.
Given the small, targeted nature of the fix, its correctness, and low regression risk, this is a good candidate for stable backport.
drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c index 9bda9ad13f882..88ded6296be34 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c @@ -774,9 +774,10 @@ amdgpu_ras_eeprom_update_header(struct amdgpu_ras_eeprom_control *control) control->tbl_rai.health_percent = 0; } ras->is_rma = true; - /* ignore the -ENOTSUPP return value */ - amdgpu_dpm_send_rma_reason(adev); } + + /* ignore the -ENOTSUPP return value */ + amdgpu_dpm_send_rma_reason(adev); }
if (control->tbl_hdr.version >= RAS_TABLE_VER_V2_1)