Hi,
Memory corruption may occur if the location of the HFI memory buffer is not restored when resuming from hibernation or suspend-to-memory.
During a normal boot, the kernel allocates a memory buffer and gives it to the hardware for reporting updates in the HFI table. The same allocation process is done by a restore kernel when resuming from suspend or hibernation.
The location of the memory that the restore kernel allocates may differ from that allocated by the image kernel. To prevent memory corruption (the hardware keeps using the memory buffer from the restore kernel), it is necessary to disable HFI before transferring control to the image kernel. Once running, the image kernel must restore the location of the HFI memory and enable HFI.
The patchset addresses the described bug on systems with one or more HFI instances (i.e., packages) using CPU hotplug callbacks and a suspend notifier.
I tested this patchset on Meteor Lake and Sapphire Rapids. The systems completed 3500 (in two separate tests of 1500 and 2000 repeats) and 1000 hibernate-resume cycles, respectively. I tested it using Rafael's testing branch as on 20th December 2023.
Thanks and BR, Ricardo
Ricardo Neri (4): thermal: intel: hfi: Refactor enabling code into helper functions thermal: intel: hfi: Enable an HFI instance from its first online CPU thermal: intel: hfi: Disable an HFI instance when all its CPUs go offline thermal: intel: hfi: Add a suspend notifier
drivers/thermal/intel/intel_hfi.c | 142 ++++++++++++++++++++++++------ 1 file changed, 116 insertions(+), 26 deletions(-)