From: Xiang Liu xiang.liu@amd.com
[ Upstream commit f320ed01cf5f2259e2035a56900952cb3cc77e7a ]
Correct valid_bits and ms_chk_bits of section info field for bad page threshold exceed CPER to match OOB's behavior.
Signed-off-by: Xiang Liu xiang.liu@amd.com Reviewed-by: Hawking Zhang Hawking.Zhang@amd.com Signed-off-by: Alex Deucher alexander.deucher@amd.com Signed-off-by: Sasha Levin sashal@kernel.org ---
LLM Generated explanations, may be completely bogus:
YES
- Fixes incorrect CPER section info for “bad page threshold exceeded” events by explicitly marking the MS_CHECK subfield valid and setting the key status bits: - Sets `valid_bits.ms_chk = 1` so MS_CHECK content is defined for consumers: drivers/gpu/drm/amd/amdgpu/amdgpu_cper.c:222 - Marks error type presence and value: `ms_chk_bits.err_type_valid = 1` and `ms_chk_bits.err_type = 1`: drivers/gpu/drm/amd/amdgpu/amdgpu_cper.c:223-224 - Marks processor context corrupted: `ms_chk_bits.pcc = 1`: drivers/gpu/drm/amd/amdgpu/amdgpu_cper.c:225 - Without these, decoders can treat MS_CHECK as invalid/unknown, leading to misclassification or ignoring of the event.
- Corrects CPER header validity flags by removing an invalid assertion that a reserved field is present: - Drops `hdr->valid_bits.partition_id = 1` (the field is reserved in this format), preventing consumers from assuming a valid partition ID when none is provided: drivers/gpu/drm/amd/amdgpu/amdgpu_cper.c:70-71 (absence of the old assignment) - The header explicitly documents `partition_id` as reserved: drivers/gpu/drm/amd/include/amd_cper.h:118
- Scope and risk: - Small, contained change in one driver file: drivers/gpu/drm/amd/amdgpu/amdgpu_cper.c. - No API/ABI or architectural changes; only corrects record formatting bits. - Runtime behavior of the GPU or the kernel isn’t affected; this only alters metadata in generated CPER records written to the AMDGPU CPER ring. - Extremely low regression risk; improves compatibility with OOB tooling by matching expected CPER semantics.
- User impact: - Fixes a real correctness bug in error reporting: previously, MS_CHECK data was not flagged valid and key semantics (error type, PCC) were not asserted for the bad-page-threshold CPER, causing potential misinterpretation by diagnostics/management tools. - Aligns driver-generated records with out-of-band behavior as stated in the commit message.
- Stable criteria: - Important bugfix in a confined subsystem (AMDGPU RAS/CPER formatting). - Minimal change set, no feature additions, no cross-subsystem fallout. - Suitable for all stable trees that include AMDGPU CPER generation (e.g., where `amdgpu_cper_generate_bp_threshold_record()` is present: drivers/gpu/drm/amd/amdgpu/amdgpu_cper.c:323-341).
Given the above, this is a low-risk correctness fix that improves error record fidelity and should be backported.
drivers/gpu/drm/amd/amdgpu/amdgpu_cper.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cper.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cper.c index 6c266f18c5981..12710496adae5 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cper.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cper.c @@ -68,7 +68,6 @@ void amdgpu_cper_entry_fill_hdr(struct amdgpu_device *adev, hdr->error_severity = sev;
hdr->valid_bits.platform_id = 1; - hdr->valid_bits.partition_id = 1; hdr->valid_bits.timestamp = 1;
amdgpu_cper_get_timestamp(&hdr->timestamp); @@ -220,7 +219,10 @@ int amdgpu_cper_entry_fill_bad_page_threshold_section(struct amdgpu_device *adev section->hdr.valid_bits.err_context_cnt = 1;
section->info.error_type = RUNTIME; + section->info.valid_bits.ms_chk = 1; section->info.ms_chk_bits.err_type_valid = 1; + section->info.ms_chk_bits.err_type = 1; + section->info.ms_chk_bits.pcc = 1; section->ctx.reg_ctx_type = CPER_CTX_TYPE_CRASH; section->ctx.reg_arr_size = sizeof(section->ctx.reg_dump);