Thanks for your comment, Andi.
On Thu, Jun 20, 2024 at 3:53 PM Andi Kleen ak@linux.intel.com wrote:
Jiaqi Yan jiaqiyan@google.com writes:
Correctable memory errors are very common on servers with large amount of memory, and are corrected by ECC, but with two pain points to users:
- Correction usually happens on the fly and adds latency overhead
- Not-fully-proved theory states excessive correctable memory errors can develop into uncorrectable memory error.
This patchkit is amusing (or maybe sad) because it basically tries to reconstruct the original soft offline design using a user space daemon instead of doing policy badly in the kernel.
Some clarifications. I don't intend to reconstruct. I think this patchset can also be treated as "patch some missing places so that kernel doesn't soft offline behind the back of userspace daemon". I agree with you (IIUC) that the policy for corrected memory errors should exist in userspace. But the situation is that some behaviors in the kernel don't respect that (they either have a reason to not respect, or just forget to respect). enable_soft_offline is basically the big button in userspace to block these kernel violators.
You can still have it by enabling CONFIG_X86_MCELOG_LEGACY and use http://www.mcelog.org or an equivalent daemon of your chosing that listens to /dev/mcelog.
If I didn't miss anything important in https://github.com/andikleen/mcelog and arch/x86/kernel/cpu/mce/dev-mcelog.c, I don't think /dev/mcelog works on ARM platforms where CPER is used to convey hw errors from platform to OS.
In addition, again taking an ARM platform as an example, I don't think any userspace daemon has the way to stop the GHES driver from soft offlining memory pages: https://github.com/torvalds/linux/blob/master/drivers/acpi/apei/ghes.c#L521. But of course it is not a problem if userspace always wants soft offline to happen.
-Andi