From: Shuai Xue xueshuai@linux.alibaba.com
[ Upstream commit 79a5ae3c4c5eb7e38e0ebe4d6bf602d296080060 ]
If a synchronous error is detected as a result of user-space process triggering a 2-bit uncorrected error, the CPU will take a synchronous error exception such as Synchronous External Abort (SEA) on Arm64. The kernel will queue a memory_failure() work which poisons the related page, unmaps the page, and then sends a SIGBUS to the process, so that a system wide panic can be avoided.
However, no memory_failure() work will be queued when abnormal synchronous errors occur. These errors can include situations like invalid PA, unexpected severity, no memory failure config support, invalid GUID section, etc. In such a case, the user-space process will trigger SEA again. This loop can potentially exceed the platform firmware threshold or even trigger a kernel hard lockup, leading to a system reboot.
Fix it by performing a force kill if no memory_failure() work is queued for synchronous errors.
Signed-off-by: Shuai Xue xueshuai@linux.alibaba.com Reviewed-by: Jarkko Sakkinen jarkko@kernel.org Reviewed-by: Jonathan Cameron Jonathan.Cameron@huawei.com Reviewed-by: Yazen Ghannam yazen.ghannam@amd.com Reviewed-by: Jane Chu jane.chu@oracle.com Reviewed-by: Hanjun Guo guohanjun@huawei.com Link: https://patch.msgid.link/20250714114212.31660-2-xueshuai@linux.alibaba.com [ rjw: Changelog edits ] Signed-off-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Signed-off-by: Sasha Levin sashal@kernel.org ---
LLM Generated explanations, may be completely bogus:
**Backport Status: YES**
This commit should be backported to stable kernel trees for the following reasons:
1. **Critical Bug Fix**: The commit fixes a serious issue where synchronous memory errors that cannot be recovered (abnormal synchronous errors) would cause an infinite loop. The commit message clearly describes how "the user-space process will trigger SEA again. This loop can potentially exceed the platform firmware threshold or even trigger a kernel hard lockup, leading to a system reboot."
2. **Security and Stability Impact**: The bug can lead to: - Infinite exception loops - Platform firmware threshold exceeded - Kernel hard lockups - Unexpected system reboots
These are severe stability issues that affect system reliability.
3. **Small and Contained Fix**: The code change is minimal and well- contained: ```c + /* + * If no memory failure work is queued for abnormal synchronous + * errors, do a force kill. + */ + if (sync && !queued) { + dev_err(ghes->dev, + HW_ERR GHES_PFX "%s:%d: synchronous unrecoverable error (SIGBUS)\n", + current->comm, task_pid_nr(current)); + force_sig(SIGBUS); + } ``` The fix adds only 10 lines of code that check if we're in a synchronous context (`sync`) and no memory failure work was queued (`!queued`), then sends SIGBUS to the current process.
4. **Clear Problem and Solution**: The commit addresses a specific gap in error handling. When `ghes_handle_memory_failure()` returns false (meaning no memory_failure() work was queued) for synchronous errors, the process that triggered the error continues execution and will hit the same error again, creating an infinite loop.
5. **Follows Stable Rules**: This fix meets the stable kernel criteria: - Fixes a real bug that affects users - Small change (< 100 lines) - Obviously correct and tested (has multiple Reviewed-by tags) - Does not add new features - Addresses a serious issue (system stability/reboot)
6. **Related to Previous Work**: This appears to be part of a series addressing synchronous error handling issues in GHES. The commit c1f1fda14137 mentioned in the git log shows ongoing work to properly handle synchronous exceptions, and this commit addresses a critical gap where abnormal synchronous errors weren't being handled at all.
The fix ensures that when a synchronous memory error cannot be properly handled through the normal memory_failure() path, the kernel will at least terminate the offending process with SIGBUS rather than allowing it to continue and create an infinite exception loop that can crash the system.
drivers/acpi/apei/ghes.c | 11 +++++++++++ 1 file changed, 11 insertions(+)
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c index f0584ccad451..281a0a2f6730 100644 --- a/drivers/acpi/apei/ghes.c +++ b/drivers/acpi/apei/ghes.c @@ -902,6 +902,17 @@ static bool ghes_do_proc(struct ghes *ghes, } }
+ /* + * If no memory failure work is queued for abnormal synchronous + * errors, do a force kill. + */ + if (sync && !queued) { + dev_err(ghes->dev, + HW_ERR GHES_PFX "%s:%d: synchronous unrecoverable error (SIGBUS)\n", + current->comm, task_pid_nr(current)); + force_sig(SIGBUS); + } + return queued; }