Dear Reviewers,
we suggest to backport a commit to Linux kernel-5.10 stable tree to fix thermal bug. Thanks a lot
source patch: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/dr... thermal: core: Fix thermal zone suspend-resume synchronization There are 3 synchronization issues with thermal zone suspend-resume during system-wide transitions:
1. The resume code runs in a PM notifier which is invoked after user space has been thawed, so it can run concurrently with user space which can trigger a thermal zone device removal. If that happens, the thermal zone resume code may use a stale pointer to the next list element and crash, because it does not hold thermal_list_lock while walking thermal_tz_list.
2. The thermal zone resume code calls thermal_zone_device_init() outside the zone lock, so user space or an update triggered by the platform firmware may see an inconsistent state of a thermal zone leading to unexpected behavior.
3. Clearing the in_suspend global variable in thermal_pm_notify() allows __thermal_zone_device_update() to continue for all thermal zones and it may as well run before the thermal_tz_list walk (or at any point during the list walk for that matter) and attempt to operate on a thermal zone that has not been resumed yet. It may also race destructively with thermal_zone_device_init().
To address these issues, add thermal_list_lock locking to thermal_pm_notify(), especially arount the thermal_tz_list, make it call thermal_zone_device_init() back-to-back with __thermal_zone_device_update() under the zone lock and replace in_suspend with per-zone bool "suspend" indicators set and unset under the given zone's lock.
Link: https://lore.kernel.org/linux-pm/20231218162348.69101-1-bo.ye@mediatek.com/ Reported-by: Bo Ye bo.ye@mediatek.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.co thermal: core: Fix thermal zone suspend-resume synchronization There are 3 synchronization issues with thermal zone suspend-resume during system-wide transitions:
1. The resume code runs in a PM notifier which is invoked after user space has been thawed, so it can run concurrently with user space which can trigger a thermal zone device removal. If that happens, the thermal zone resume code may use a stale pointer to the next list element and crash, because it does not hold thermal_list_lock while walking thermal_tz_list.
2. The thermal zone resume code calls thermal_zone_device_init() outside the zone lock, so user space or an update triggered by the platform firmware may see an inconsistent state of a thermal zone leading to unexpected behavior.
3. Clearing the in_suspend global variable in thermal_pm_notify() allows __thermal_zone_device_update() to continue for all thermal zones and it may as well run before the thermal_tz_list walk (or at any point during the list walk for that matter) and attempt to operate on a thermal zone that has not been resumed yet. It may also race destructively with thermal_zone_device_init().
To address these issues, add thermal_list_lock locking to thermal_pm_notify(), especially arount the thermal_tz_list, make it call thermal_zone_device_init() back-to-back with __thermal_zone_device_update() under the zone lock and replace in_suspend with per-zone bool "suspend" indicators set and unset under the given zone's lock.
Link: https://lore.kernel.org/linux-pm/20231218162348.69101-1-bo.ye@mediatek.com/ Reported-by: Bo Ye bo.ye@mediatek.com Signed-off-by: Rafael J. Wysocki rafael.j.wysocki@intel.com BRs Bo Ye