From: Jake Hillion jake@hillion.co.uk
[ Upstream commit f8afb12a2d7503de6558c23cacd7acbf6e9fe678 ]
__hsmp_send_message sleeps between result read attempts and has a timeout of 100ms. Under extreme load it's possible for these sleeps to take a long time, exceeding the 100ms. In this case the current code does not check the register and fails with ETIMEDOUT.
Refactor the loop to ensure there is at least one read of the register after a sleep of any duration. This removes instances of ETIMEDOUT with a single caller, even with a misbehaving scheduler. Tested on AMD Bergamo machines.
Suggested-by: Blaise Sanouillet linux@blaise.sanouillet.com Reviewed-by: Suma Hegde suma.hegde@amd.com Tested-by: Suma Hegde suma.hegde@amd.com Signed-off-by: Jake Hillion jake@hillion.co.uk Link: https://lore.kernel.org/r/20250605-amd-hsmp-v2-1-a811bc3dd74a@hillion.co.uk Reviewed-by: Ilpo Järvinen ilpo.jarvinen@linux.intel.com Signed-off-by: Ilpo Järvinen ilpo.jarvinen@linux.intel.com Signed-off-by: Sasha Levin sashal@kernel.org ---
**YES**
This commit should be backported to stable kernel trees. Here's my extensive analysis:
## Bug Analysis
The commit fixes a critical race condition in the AMD HSMP driver's timeout handling. The issue occurs in the `__hsmp_send_message()` function where:
1. **Original code flaw**: The loop checks `time_before(jiffies, timeout)` at the start of each iteration, then sleeps, but doesn't check the register one final time after the last sleep completes.
2. **Race condition scenario**: Under heavy system load (as mentioned in the commit message - "Under extreme load"), the sleep operations (`usleep_range()`) can take significantly longer than intended. If the final sleep extends beyond the 100ms timeout window, the code exits with `-ETIMEDOUT` without checking if the SMU actually responded during that extended sleep.
3. **The fix**: Changes the loop from `while (time_before(jiffies, timeout))` to `while (true)` and moves the timeout check to after the register read, ensuring at least one register check occurs after any sleep duration.
## Why This Qualifies for Stable Backport
1. **Fixes a real bug affecting users**: The commit message explicitly states this was "Tested on AMD Bergamo machines" and fixes actual instances of ETIMEDOUT errors. AMD Bergamo is a server processor used in production data centers.
2. **Small, contained fix**: The change is minimal - just restructuring the loop logic without changing functionality. This meets the stable kernel criteria of being a small, obvious fix.
3. **No new features**: This purely fixes existing behavior without adding capabilities.
4. **Critical subsystem**: The HSMP driver is essential for AMD EPYC server management, controlling: - Power management - Thermal monitoring - Performance states - System telemetry
5. **Production impact**: Spurious `-ETIMEDOUT` errors would cause failures in: - Data center management tools - Power capping operations - Performance monitoring - Thermal management
6. **Clear problem and solution**: The race condition is well- understood, and the fix ensures the code behaves as intended - checking the register after sleeping rather than potentially timing out without a final check.
## Risk Assessment
The risk is minimal because: - The logic change is straightforward and correct - It's been tested on production AMD Bergamo systems - It only affects the timeout path behavior - The worst case is the same as before (timeout after 100ms) - The best case fixes false timeouts under load
This is exactly the type of bug fix that stable kernels exist to deliver - fixing real issues users encounter in production without introducing new risks.
drivers/platform/x86/amd/hsmp/hsmp.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/drivers/platform/x86/amd/hsmp/hsmp.c b/drivers/platform/x86/amd/hsmp/hsmp.c index a3ac09a90de45..ab877112f4c80 100644 --- a/drivers/platform/x86/amd/hsmp/hsmp.c +++ b/drivers/platform/x86/amd/hsmp/hsmp.c @@ -99,7 +99,7 @@ static int __hsmp_send_message(struct hsmp_socket *sock, struct hsmp_message *ms short_sleep = jiffies + msecs_to_jiffies(HSMP_SHORT_SLEEP); timeout = jiffies + msecs_to_jiffies(HSMP_MSG_TIMEOUT);
- while (time_before(jiffies, timeout)) { + while (true) { ret = sock->amd_hsmp_rdwr(sock, mbinfo->msg_resp_off, &mbox_status, HSMP_RD); if (ret) { dev_err(sock->dev, "Error %d reading mailbox status\n", ret); @@ -108,6 +108,10 @@ static int __hsmp_send_message(struct hsmp_socket *sock, struct hsmp_message *ms
if (mbox_status != HSMP_STATUS_NOT_READY) break; + + if (!time_before(jiffies, timeout)) + break; + if (time_before(jiffies, short_sleep)) usleep_range(50, 100); else