On Tue, Sep 16, 2025 at 03:20:49PM +0000, Luck, Tony wrote:
Reported-by: Shawn Fan shawn.fan@intel.com
Interesting. What did Shawn report? (Closes:!).
Tony or Shawn, could you please point me to the original report? Thanks!
Original report is internal to Intel, so no useful link for the community (but I still wanted to give credit).
Recap of original problem is that some BIOS keep track of error threshold per-rank and use this GHES mechanism to report threshold exceeded on the rank.
Systems that stay up a long time can accumulate enough soft errors to trigger this threshold. But the action of taking a page offline isn't going to help. For a 4K page this is merely annoying. For 1G page it can mess things up badly.
My original patch for this just skipped the GHES->offline process for huge pages. But I wasn't aware of the sysctl control. That provides a better solution.
Tony, does that mean you're OK with using the existing sysctl interface? If so, I'll just send a separate patch to update the sysfs-memory-page-offline documentation and drop the rest.
Thanks, Kyle Meyer