From: Xiang Liu xiang.liu@amd.com
[ Upstream commit 8f0245ee95c5ba65a2fe03f60386868353c6a3a0 ]
Update the IPID register value for bad page threshold CPER according to the latest definition.
Signed-off-by: Xiang Liu xiang.liu@amd.com Reviewed-by: Hawking Zhang Hawking.Zhang@amd.com Signed-off-by: Alex Deucher alexander.deucher@amd.com Signed-off-by: Sasha Levin sashal@kernel.org ---
LLM Generated explanations, may be completely bogus:
YES
- What changed: In the bad page threshold CPER builder, the IPID fields are no longer hardcoded; they are computed from the GPU’s socket ID per the “latest definition.” - Previous behavior: `IPID_LO = 0x0` and `IPID_HI = 0x96` (drivers/gpu/drm/amd/amdgpu/amdgpu_cper.c:237-238). - New behavior: Introduces `socket_id` and sets: - `IPID_LO = (socket_id / 4) & 0x01` - `IPID_HI = 0x096 | (((socket_id % 4) & 0x3) << 12)` These replace the constants, encoding the socket information in IPID per the updated spec.
- Scope and containment: - The change is confined to one function: `amdgpu_cper_entry_fill_bad_page_threshold_section()` in drivers/gpu/drm/amd/amdgpu/amdgpu_cper.c. - It only affects construction of CPER records for “bad page threshold” events; normal runtime, CE/DE/UE CPERs still use real ACA bank IPID values (drivers/gpu/drm/amd/amdgpu/amdgpu_cper.c:391-404).
- Rationale and user impact: - This corrects CPER content by encoding the GPU socket in IPID, improving RAS diagnostics. Previously, CPERs for this event carried a fixed, misleading IPID, which can misidentify the device/location and hamper triage and RMA workflows. - The commit message aligns with this: “Update … according to the latest definition,” i.e., a spec-compliance fix rather than a feature.
- Dependencies and compatibility: - It uses `adev->smuio.funcs->get_socket_id` if available, otherwise falls back to 0, preserving prior behavior on ASICs without socket ID support. This same pattern is already used elsewhere in this file for `record_id` and FRU text (drivers/gpu/drm/amd/amdgpu/amdgpu_cper.c:73-81, 123-131), so there is no new dependency risk. - No API/ABI changes; no headers or structures changed; no architectural changes.
- Risk assessment: - Minimal risk: pure data-field fix inside a CPER payload builder; no control flow or subsystem behavior changes. - Side effects are limited to CPER contents produced when bad page threshold is exceeded (trigger path in drivers/gpu/drm/amd/pm/amdgpu_dpm.c:764-778).
- Stable backport criteria: - Fixes a real (though non-crashing) bug affecting users of RAS/CPER reporting in multi-GPU or multi-socket environments. - Small, localized change with clear intent and low regression risk. - No new features or architectural changes; adheres to stable rules.
- Practical note for backporting: - Backport to stable trees that already contain CPER generation for bad page threshold and the `smuio.get_socket_id` plumbing. Where `get_socket_id` is absent, the fallback keeps behavior identical to pre-fix.
drivers/gpu/drm/amd/amdgpu/amdgpu_cper.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cper.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cper.c index 25252231a68a9..6c266f18c5981 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cper.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cper.c @@ -206,6 +206,7 @@ int amdgpu_cper_entry_fill_bad_page_threshold_section(struct amdgpu_device *adev { struct cper_sec_desc *section_desc; struct cper_sec_nonstd_err *section; + uint32_t socket_id;
section_desc = (struct cper_sec_desc *)((uint8_t *)hdr + SEC_DESC_OFFSET(idx)); section = (struct cper_sec_nonstd_err *)((uint8_t *)hdr + @@ -224,6 +225,9 @@ int amdgpu_cper_entry_fill_bad_page_threshold_section(struct amdgpu_device *adev section->ctx.reg_arr_size = sizeof(section->ctx.reg_dump);
/* Hardcoded Reg dump for bad page threshold CPER */ + socket_id = (adev->smuio.funcs && adev->smuio.funcs->get_socket_id) ? + adev->smuio.funcs->get_socket_id(adev) : + 0; section->ctx.reg_dump[CPER_ACA_REG_CTL_LO] = 0x1; section->ctx.reg_dump[CPER_ACA_REG_CTL_HI] = 0x0; section->ctx.reg_dump[CPER_ACA_REG_STATUS_LO] = 0x137; @@ -234,8 +238,8 @@ int amdgpu_cper_entry_fill_bad_page_threshold_section(struct amdgpu_device *adev section->ctx.reg_dump[CPER_ACA_REG_MISC0_HI] = 0x0; section->ctx.reg_dump[CPER_ACA_REG_CONFIG_LO] = 0x2; section->ctx.reg_dump[CPER_ACA_REG_CONFIG_HI] = 0x1ff; - section->ctx.reg_dump[CPER_ACA_REG_IPID_LO] = 0x0; - section->ctx.reg_dump[CPER_ACA_REG_IPID_HI] = 0x96; + section->ctx.reg_dump[CPER_ACA_REG_IPID_LO] = (socket_id / 4) & 0x01; + section->ctx.reg_dump[CPER_ACA_REG_IPID_HI] = 0x096 | (((socket_id % 4) & 0x3) << 12); section->ctx.reg_dump[CPER_ACA_REG_SYND_LO] = 0x0; section->ctx.reg_dump[CPER_ACA_REG_SYND_HI] = 0x0;