Jiaqi Yan jiaqiyan@google.com writes:
Correctable memory errors are very common on servers with large amount of memory, and are corrected by ECC, but with two pain points to users:
- Correction usually happens on the fly and adds latency overhead
- Not-fully-proved theory states excessive correctable memory errors can develop into uncorrectable memory error.
This patchkit is amusing (or maybe sad) because it basically tries to reconstruct the original soft offline design using a user space daemon instead of doing policy badly in the kernel.
You can still have it by enabling CONFIG_X86_MCELOG_LEGACY and use http://www.mcelog.org or an equivalent daemon of your chosing that listens to /dev/mcelog.
-Andi