From: Timur Kristóf timur.kristof@gmail.com
[ Upstream commit 813d13524a3bdcc5f0253e06542440ca74c2653a ]
The SMC can take an excessive amount of time to process some messages under some conditions.
Background: Sending a message to the SMC works by writing the message into the mmSMC_MESSAGE_0 register and its optional parameter into the mmSMC_SCRATCH0, and then polling mmSMC_RESP_0. Previously the timeout was AMDGPU_MAX_USEC_TIMEOUT, ie. 100 ms.
Increase the timeout to 200 ms for all messages and to 1 sec for a few messages which I've observed to be especially slow: PPSMC_MSG_NoForcedLevel PPSMC_MSG_SetEnabledLevels PPSMC_MSG_SetForcedLevels PPSMC_MSG_DisableULV PPSMC_MSG_SwitchToSwState
This fixes the following problems on Tahiti when switching from a lower clock power state to a higher clock state, such as when DC turns on a display which was previously turned off.
* si_restrict_performance_levels_before_switch would fail (if the user previously forced high clocks using sysfs) * si_set_sw_state would fail (always)
It turns out that both of those failures were SMC timeouts and that the SMC actually didn't fail or hang, just needs more time to process those.
Add a warning when there is an SMC timeout to make it easier to identify this type of problem in the future.
Reviewed-by: Alex Deucher alexander.deucher@amd.com Signed-off-by: Timur Kristóf timur.kristof@gmail.com Signed-off-by: Alex Deucher alexander.deucher@amd.com Signed-off-by: Sasha Levin sashal@kernel.org ---
LLM Generated explanations, may be completely bogus:
YES
**Change Details** - Increases SMC message polling timeout in `amdgpu_si_send_msg_to_smc` from the device default to longer, message-specific intervals: - Adds local `usec_timeout` and selects 1s for slow messages and 200ms for others in `drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c:175` and cases at `drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c:179`. - Applies the new timeout in the poll loop `for (i = 0; i < usec_timeout; i++)` at `drivers/gpu/drm/amd/pm/legacy- dpm/si_smc.c:196`. - Emits a warning on timeout to aid debugging at `drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c:203`. - The messages given extended timeout are specifically the ones observed to be slow: `PPSMC_MSG_NoForcedLevel`, `PPSMC_MSG_SetEnabledLevels`, `PPSMC_MSG_SetForcedLevels`, `PPSMC_MSG_DisableULV`, `PPSMC_MSG_SwitchToSwState` (see switch at `drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c:179`; message IDs defined under `drivers/gpu/drm/amd/pm/legacy-dpm/ppsmc.h:79`, `drivers/gpu/drm/amd/pm/legacy-dpm/ppsmc.h:81`, `drivers/gpu/drm/amd/pm/legacy-dpm/ppsmc.h:99`, `drivers/gpu/drm/amd/pm/legacy-dpm/ppsmc.h:106`, `drivers/gpu/drm/amd/pm/legacy-dpm/ppsmc.h:107`). - Prior behavior used the device default timeout `adev->usec_timeout` (100 ms) for all messages; that default is defined as `AMDGPU_MAX_USEC_TIMEOUT` at `drivers/gpu/drm/amd/amdgpu/amdgpu.h:280` and initialized in `drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:4414`.
**Backport Assessment** - Fixes a user-visible bug: On SI (e.g., Tahiti), switching from lower to higher clocks timed out spuriously, causing failures in: - `si_restrict_performance_levels_before_switch` which sends `PPSMC_MSG_NoForcedLevel` and `PPSMC_MSG_SetEnabledLevels` (`drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c:3899` and following). - `si_set_sw_state`, which sends `PPSMC_MSG_SwitchToSwState` (`drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c:3949`). - Scope is small and contained: one function in the SI legacy DPM SMC path only (`drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c`); no API/ABI changes; no architectural changes. - Risk is minimal and bounded: - Only increases timeouts when sending SMC messages; does not alter state-machine logic. - Longest busy-wait increases from 100 ms to 1 s, but only for a narrow set of transitions; these are not hot paths and the long latency is needed for hardware that legitimately responds slowly. - Still finite (no indefinite waits) and adds `drm_warn` for diagnostics (`drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c:203`). - Constrained impact: Applies only to amdgpu’s SI legacy DPM; other ASICs and paths unaffected. Other SMC waits (e.g., `amdgpu_si_wait_for_smc_inactive`) still use the driver default timeout (`drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c:221`). - Aligns with stable rules: important reliability fix without new features or architectural churn; low regression risk; confined to a subsystem.
Given the clear user impact, narrow scope, and low risk, this is a strong candidate for stable backport in trees that include SI legacy DPM.
drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c | 26 ++++++++++++++++++++-- 1 file changed, 24 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c b/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c index 4e65ab9e931c9..281a5e377aee4 100644 --- a/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c +++ b/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c @@ -172,20 +172,42 @@ PPSMC_Result amdgpu_si_send_msg_to_smc(struct amdgpu_device *adev, { u32 tmp; int i; + int usec_timeout; + + /* SMC seems to process some messages exceptionally slowly. */ + switch (msg) { + case PPSMC_MSG_NoForcedLevel: + case PPSMC_MSG_SetEnabledLevels: + case PPSMC_MSG_SetForcedLevels: + case PPSMC_MSG_DisableULV: + case PPSMC_MSG_SwitchToSwState: + usec_timeout = 1000000; /* 1 sec */ + break; + default: + usec_timeout = 200000; /* 200 ms */ + break; + }
if (!amdgpu_si_is_smc_running(adev)) return PPSMC_Result_Failed;
WREG32(mmSMC_MESSAGE_0, msg);
- for (i = 0; i < adev->usec_timeout; i++) { + for (i = 0; i < usec_timeout; i++) { tmp = RREG32(mmSMC_RESP_0); if (tmp != 0) break; udelay(1); }
- return (PPSMC_Result)RREG32(mmSMC_RESP_0); + tmp = RREG32(mmSMC_RESP_0); + if (tmp == 0) { + drm_warn(adev_to_drm(adev), + "%s timeout on message: %x (SMC_SCRATCH0: %x)\n", + __func__, msg, RREG32(mmSMC_SCRATCH0)); + } + + return (PPSMC_Result)tmp; }
PPSMC_Result amdgpu_si_wait_for_smc_inactive(struct amdgpu_device *adev)