Reported-by: Shawn Fan shawn.fan@intel.com
Interesting. What did Shawn report? (Closes:!).
Tony or Shawn, could you please point me to the original report? Thanks!
Original report is internal to Intel, so no useful link for the community (but I still wanted to give credit).
Recap of original problem is that some BIOS keep track of error threshold per-rank and use this GHES mechanism to report threshold exceeded on the rank.
Systems that stay up a long time can accumulate enough soft errors to trigger this threshold. But the action of taking a page offline isn't going to help. For a 4K page this is merely annoying. For 1G page it can mess things up badly.
My original patch for this just skipped the GHES->offline process for huge pages. But I wasn't aware of the sysctl control. That provides a better solution.
-Tony