[PATCH 6.1.y 0/4] Solve abrupt shutdowns from momentarily fluctuations

List overview All Threads
Download

newer

older

[PATCH v4 1/1] test_fimware:...

[PATCH 6.4.y 0/2] Solve abrupt...

Mario Limonciello

11 Aug 2023 11 Aug '23

4:40 p.m.

Users have been reporting that momentary fluctuations can trigger a shutdown.

Link: https://gitlab.freedesktop.org/drm/amd/-/issues/1267 Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2779

This behavior has been fixed in kernel 6.5, and this series brings the solution to the LTS kernel.

Evan Quan (4): drm/amd/pm: fulfill swsmu peak profiling mode shader/memory clock settings drm/amd/pm: expose swctf threshold setting for legacy powerplay drm/amd/pm: fulfill powerplay peak profiling mode shader/memory clock settings drm/amd/pm: avoid unintentional shutdown due to temperature momentary fluctuation

drivers/gpu/drm/amd/amdgpu/amdgpu.h | 3 + .../gpu/drm/amd/include/kgd_pp_interface.h | 2 + drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h | 2 + .../gpu/drm/amd/pm/powerplay/amd_powerplay.c | 58 +++++++++++++- .../amd/pm/powerplay/hwmgr/hardwaremanager.c | 4 +- .../drm/amd/pm/powerplay/hwmgr/smu10_hwmgr.c | 16 +++- .../drm/amd/pm/powerplay/hwmgr/smu7_hwmgr.c | 78 +++++++++++++++---- .../drm/amd/pm/powerplay/hwmgr/smu8_hwmgr.c | 16 +++- .../drm/amd/pm/powerplay/hwmgr/smu_helper.c | 27 +++---- .../drm/amd/pm/powerplay/hwmgr/vega10_hwmgr.c | 41 ++++++++-- .../drm/amd/pm/powerplay/hwmgr/vega12_hwmgr.c | 26 +++++++ .../drm/amd/pm/powerplay/hwmgr/vega20_hwmgr.c | 24 +++--- drivers/gpu/drm/amd/pm/powerplay/inc/hwmgr.h | 4 + .../drm/amd/pm/powerplay/inc/power_state.h | 1 + drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c | 42 ++++++++++ drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h | 2 + .../gpu/drm/amd/pm/swsmu/smu11/smu_v11_0.c | 9 +-- .../gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c | 9 +-- 18 files changed, 293 insertions(+), 71 deletions(-)

-- 2.34.1

Show replies by date

Mario Limonciello

11 Aug 11 Aug

4:40 p.m.

New subject: [PATCH 6.1.y 1/4] drm/amd/pm: fulfill swsmu peak profiling mode shader/memory clock settings

From: Evan Quan evan.quan@amd.com

Enable peak profiling mode shader/memory clocks reporting for swsmu framework.

diff --git a/drivers/gpu/drm/amd/include/kgd_pp_interface.h b/drivers/gpu/drm/amd/include/kgd_pp_interface.h index d18162e9ed1d..f3d64c78feaa 100644 --- a/drivers/gpu/drm/amd/include/kgd_pp_interface.h +++ b/drivers/gpu/drm/amd/include/kgd_pp_interface.h @@ -139,6 +139,8 @@ enum amd_pp_sensors { AMDGPU_PP_SENSOR_MIN_FAN_RPM, AMDGPU_PP_SENSOR_MAX_FAN_RPM, AMDGPU_PP_SENSOR_VCN_POWER_STATE, + AMDGPU_PP_SENSOR_PEAK_PSTATE_SCLK, + AMDGPU_PP_SENSOR_PEAK_PSTATE_MCLK, };

enum amd_pp_task { diff --git a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c index 91dfc229e34d..6d90ab55cea3 100644 --- a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c +++ b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c @@ -2520,6 +2520,14 @@ static int smu_read_sensor(void *handle, *((uint32_t *)data) = pstate_table->uclk_pstate.standard * 100; *size = 4; break; + case AMDGPU_PP_SENSOR_PEAK_PSTATE_SCLK: + *((uint32_t *)data) = pstate_table->gfxclk_pstate.peak * 100; + *size = 4; + break; + case AMDGPU_PP_SENSOR_PEAK_PSTATE_MCLK: + *((uint32_t *)data) = pstate_table->uclk_pstate.peak * 100; + *size = 4; + break; case AMDGPU_PP_SENSOR_ENABLED_SMC_FEATURES_MASK: ret = smu_feature_get_enabled_mask(smu, (uint64_t *)data); *size = 8;

-- 2.34.1

Mario Limonciello

4:40 p.m.

New subject: [PATCH 6.1.y 2/4] drm/amd/pm: expose swctf threshold setting for legacy powerplay

From: Evan Quan evan.quan@amd.com

Preparation for coming optimization which eliminates the influence of GPU temperature momentary fluctuation.

Signed-off-by: Evan Quan evan.quan@amd.com Reviewed-by: Lijo Lazar lijo.lazar@amd.com Signed-off-by: Alex Deucher alexander.deucher@amd.com (cherry picked from commit 064329c595da56eff6d7a7e7760660c726433139) Signed-off-by: Mario Limonciello mario.limonciello@amd.com --- drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h | 2 ++ .../gpu/drm/amd/pm/powerplay/hwmgr/hardwaremanager.c | 4 +++- drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu7_hwmgr.c | 2 ++ drivers/gpu/drm/amd/pm/powerplay/hwmgr/vega10_hwmgr.c | 10 ++++++++++ drivers/gpu/drm/amd/pm/powerplay/hwmgr/vega12_hwmgr.c | 4 ++++ drivers/gpu/drm/amd/pm/powerplay/hwmgr/vega20_hwmgr.c | 4 ++++ drivers/gpu/drm/amd/pm/powerplay/inc/power_state.h | 1 + 7 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h b/drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h index cb5b9df78b4d..338fce249f5a 100644 --- a/drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h +++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h @@ -89,6 +89,8 @@ struct amdgpu_dpm_thermal { int max_mem_crit_temp; /* memory max emergency(shutdown) temp */ int max_mem_emergency_temp; + /* SWCTF threshold */ + int sw_ctf_threshold; /* was last interrupt low to high or high to low */ bool high_to_low; /* interrupt source */ diff --git a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/hardwaremanager.c b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/hardwaremanager.c index 981dc8c7112d..90452b66e107 100644 --- a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/hardwaremanager.c +++ b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/hardwaremanager.c @@ -241,7 +241,8 @@ int phm_start_thermal_controller(struct pp_hwmgr *hwmgr) TEMP_RANGE_MAX, TEMP_RANGE_MIN, TEMP_RANGE_MAX, - TEMP_RANGE_MAX}; + TEMP_RANGE_MAX, + 0}; struct amdgpu_device *adev = hwmgr->adev;

if (!hwmgr->not_vf) @@ -265,6 +266,7 @@ int phm_start_thermal_controller(struct pp_hwmgr *hwmgr) adev->pm.dpm.thermal.min_mem_temp = range.mem_min; adev->pm.dpm.thermal.max_mem_crit_temp = range.mem_crit_max; adev->pm.dpm.thermal.max_mem_emergency_temp = range.mem_emergency_max; + adev->pm.dpm.thermal.sw_ctf_threshold = range.sw_ctf_threshold;

return ret; } diff --git a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu7_hwmgr.c b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu7_hwmgr.c index 7ef7e81525a3..b9e6e49ba4f0 100644 --- a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu7_hwmgr.c +++ b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu7_hwmgr.c @@ -5381,6 +5381,8 @@ static int smu7_get_thermal_temperature_range(struct pp_hwmgr *hwmgr, thermal_data->max = data->thermal_temp_setting.temperature_shutdown * PP_TEMPERATURE_UNITS_PER_CENTIGRADES;

+ thermal_data->sw_ctf_threshold = thermal_data->max; + return 0; }

diff --git a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/vega10_hwmgr.c b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/vega10_hwmgr.c index c8c9fb827bda..c78f8b2b056d 100644 --- a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/vega10_hwmgr.c +++ b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/vega10_hwmgr.c @@ -5221,6 +5221,9 @@ static int vega10_get_thermal_temperature_range(struct pp_hwmgr *hwmgr, { struct vega10_hwmgr *data = hwmgr->backend; PPTable_t *pp_table = &(data->smc_state_table.pp_table); + struct phm_ppt_v2_information *pp_table_info = + (struct phm_ppt_v2_information *)(hwmgr->pptable); + struct phm_tdp_table *tdp_table = pp_table_info->tdp_table;

memcpy(thermal_data, &SMU7ThermalWithDelayPolicy[0], sizeof(struct PP_TemperatureRange));

@@ -5237,6 +5240,13 @@ static int vega10_get_thermal_temperature_range(struct pp_hwmgr *hwmgr, thermal_data->mem_emergency_max = (pp_table->ThbmLimit + CTF_OFFSET_HBM)* PP_TEMPERATURE_UNITS_PER_CENTIGRADES;

+ if (tdp_table->usSoftwareShutdownTemp > pp_table->ThotspotLimit && + tdp_table->usSoftwareShutdownTemp < VEGA10_THERMAL_MAXIMUM_ALERT_TEMP) + thermal_data->sw_ctf_threshold = tdp_table->usSoftwareShutdownTemp; + else + thermal_data->sw_ctf_threshold = VEGA10_THERMAL_MAXIMUM_ALERT_TEMP; + thermal_data->sw_ctf_threshold *= PP_TEMPERATURE_UNITS_PER_CENTIGRADES; + return 0; }

diff --git a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/vega12_hwmgr.c b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/vega12_hwmgr.c index a2f4d6773d45..0fe821dff0a4 100644 --- a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/vega12_hwmgr.c +++ b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/vega12_hwmgr.c @@ -2742,6 +2742,8 @@ static int vega12_notify_cac_buffer_info(struct pp_hwmgr *hwmgr, static int vega12_get_thermal_temperature_range(struct pp_hwmgr *hwmgr, struct PP_TemperatureRange *thermal_data) { + struct phm_ppt_v3_information *pptable_information = + (struct phm_ppt_v3_information *)hwmgr->pptable; struct vega12_hwmgr *data = (struct vega12_hwmgr *)(hwmgr->backend); PPTable_t *pp_table = &(data->smc_state_table.pp_table); @@ -2760,6 +2762,8 @@ static int vega12_get_thermal_temperature_range(struct pp_hwmgr *hwmgr, PP_TEMPERATURE_UNITS_PER_CENTIGRADES; thermal_data->mem_emergency_max = (pp_table->ThbmLimit + CTF_OFFSET_HBM)* PP_TEMPERATURE_UNITS_PER_CENTIGRADES; + thermal_data->sw_ctf_threshold = pptable_information->us_software_shutdown_temp * + PP_TEMPERATURE_UNITS_PER_CENTIGRADES;

return 0; } diff --git a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/vega20_hwmgr.c b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/vega20_hwmgr.c index b30684c84e20..8e4743cb7443 100644 --- a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/vega20_hwmgr.c +++ b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/vega20_hwmgr.c @@ -4213,6 +4213,8 @@ static int vega20_notify_cac_buffer_info(struct pp_hwmgr *hwmgr, static int vega20_get_thermal_temperature_range(struct pp_hwmgr *hwmgr, struct PP_TemperatureRange *thermal_data) { + struct phm_ppt_v3_information *pptable_information = + (struct phm_ppt_v3_information *)hwmgr->pptable; struct vega20_hwmgr *data = (struct vega20_hwmgr *)(hwmgr->backend); PPTable_t *pp_table = &(data->smc_state_table.pp_table); @@ -4231,6 +4233,8 @@ static int vega20_get_thermal_temperature_range(struct pp_hwmgr *hwmgr, PP_TEMPERATURE_UNITS_PER_CENTIGRADES; thermal_data->mem_emergency_max = (pp_table->ThbmLimit + CTF_OFFSET_HBM)* PP_TEMPERATURE_UNITS_PER_CENTIGRADES; + thermal_data->sw_ctf_threshold = pptable_information->us_software_shutdown_temp * + PP_TEMPERATURE_UNITS_PER_CENTIGRADES;

return 0; } diff --git a/drivers/gpu/drm/amd/pm/powerplay/inc/power_state.h b/drivers/gpu/drm/amd/pm/powerplay/inc/power_state.h index a5f2227a3971..0ffc2347829d 100644 --- a/drivers/gpu/drm/amd/pm/powerplay/inc/power_state.h +++ b/drivers/gpu/drm/amd/pm/powerplay/inc/power_state.h @@ -131,6 +131,7 @@ struct PP_TemperatureRange { int mem_min; int mem_crit_max; int mem_emergency_max; + int sw_ctf_threshold; };

struct PP_StateValidationBlock {

-- 2.34.1

Mario Limonciello

4:40 p.m.

New subject: [PATCH 6.1.y 3/4] drm/amd/pm: fulfill powerplay peak profiling mode shader/memory clock settings

From: Evan Quan evan.quan@amd.com

Enable peak profiling mode shader/memory clock reporting for powerplay framework.

Signed-off-by: Evan Quan evan.quan@amd.com Reviewed-by: Alex Deucher alexander.deucher@amd.com Signed-off-by: Alex Deucher alexander.deucher@amd.com (cherry picked from commit b1a9557a7d00c758ed9e701fbb3445a13a49506f) Signed-off-by: Mario Limonciello mario.limonciello@amd.com --- .../gpu/drm/amd/pm/powerplay/amd_powerplay.c | 10 ++- .../drm/amd/pm/powerplay/hwmgr/smu10_hwmgr.c | 16 +++- .../drm/amd/pm/powerplay/hwmgr/smu7_hwmgr.c | 76 +++++++++++++++---- .../drm/amd/pm/powerplay/hwmgr/smu8_hwmgr.c | 16 +++- .../drm/amd/pm/powerplay/hwmgr/vega10_hwmgr.c | 31 ++++++-- .../drm/amd/pm/powerplay/hwmgr/vega12_hwmgr.c | 22 ++++++ .../drm/amd/pm/powerplay/hwmgr/vega20_hwmgr.c | 20 ++--- drivers/gpu/drm/amd/pm/powerplay/inc/hwmgr.h | 2 + 8 files changed, 155 insertions(+), 38 deletions(-)

diff --git a/drivers/gpu/drm/amd/pm/powerplay/amd_powerplay.c b/drivers/gpu/drm/amd/pm/powerplay/amd_powerplay.c index 1159ae114dd0..3f4a476d7802 100644 --- a/drivers/gpu/drm/amd/pm/powerplay/amd_powerplay.c +++ b/drivers/gpu/drm/amd/pm/powerplay/amd_powerplay.c @@ -769,10 +769,16 @@ static int pp_dpm_read_sensor(void *handle, int idx,

switch (idx) { case AMDGPU_PP_SENSOR_STABLE_PSTATE_SCLK: - *((uint32_t *)value) = hwmgr->pstate_sclk; + *((uint32_t *)value) = hwmgr->pstate_sclk * 100; return 0; case AMDGPU_PP_SENSOR_STABLE_PSTATE_MCLK: - *((uint32_t *)value) = hwmgr->pstate_mclk; + *((uint32_t *)value) = hwmgr->pstate_mclk * 100; + return 0; + case AMDGPU_PP_SENSOR_PEAK_PSTATE_SCLK: + *((uint32_t *)value) = hwmgr->pstate_sclk_peak * 100; + return 0; + case AMDGPU_PP_SENSOR_PEAK_PSTATE_MCLK: + *((uint32_t *)value) = hwmgr->pstate_mclk_peak * 100; return 0; case AMDGPU_PP_SENSOR_MIN_FAN_RPM: *((uint32_t *)value) = hwmgr->thermal_controller.fanInfo.ulMinRPM; diff --git a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu10_hwmgr.c b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu10_hwmgr.c index ede71de2343d..86d6e88c7386 100644 --- a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu10_hwmgr.c +++ b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu10_hwmgr.c @@ -375,6 +375,17 @@ static int smu10_enable_gfx_off(struct pp_hwmgr *hwmgr) return 0; }

+static void smu10_populate_umdpstate_clocks(struct pp_hwmgr *hwmgr) +{ + hwmgr->pstate_sclk = SMU10_UMD_PSTATE_GFXCLK; + hwmgr->pstate_mclk = SMU10_UMD_PSTATE_FCLK; + + smum_send_msg_to_smc(hwmgr, + PPSMC_MSG_GetMaxGfxclkFrequency, + &hwmgr->pstate_sclk_peak); + hwmgr->pstate_mclk_peak = SMU10_UMD_PSTATE_PEAK_FCLK; +} + static int smu10_enable_dpm_tasks(struct pp_hwmgr *hwmgr) { struct amdgpu_device *adev = hwmgr->adev; @@ -398,6 +409,8 @@ static int smu10_enable_dpm_tasks(struct pp_hwmgr *hwmgr) return ret; }

+ smu10_populate_umdpstate_clocks(hwmgr); + return 0; }

@@ -574,9 +587,6 @@ static int smu10_hwmgr_backend_init(struct pp_hwmgr *hwmgr)

hwmgr->platform_descriptor.minimumClocksReductionPercentage = 50;

- hwmgr->pstate_sclk = SMU10_UMD_PSTATE_GFXCLK * 100; - hwmgr->pstate_mclk = SMU10_UMD_PSTATE_FCLK * 100; - /* enable the pp_od_clk_voltage sysfs file */ hwmgr->od_enabled = 1; /* disabled fine grain tuning function by default */ diff --git a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu7_hwmgr.c b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu7_hwmgr.c index b9e6e49ba4f0..44ec238cfeff 100644 --- a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu7_hwmgr.c +++ b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu7_hwmgr.c @@ -1501,6 +1501,65 @@ static int smu7_populate_edc_leakage_registers(struct pp_hwmgr *hwmgr) return ret; }

+static void smu7_populate_umdpstate_clocks(struct pp_hwmgr *hwmgr) +{ + struct smu7_hwmgr *data = (struct smu7_hwmgr *)(hwmgr->backend); + struct smu7_dpm_table *golden_dpm_table = &data->golden_dpm_table; + struct phm_clock_voltage_dependency_table *vddc_dependency_on_sclk = + hwmgr->dyn_state.vddc_dependency_on_sclk; + struct phm_ppt_v1_information *table_info = + (struct phm_ppt_v1_information *)(hwmgr->pptable); + struct phm_ppt_v1_clock_voltage_dependency_table *vdd_dep_on_sclk = + table_info->vdd_dep_on_sclk; + int32_t tmp_sclk, count, percentage; + + if (golden_dpm_table->mclk_table.count == 1) { + percentage = 70; + hwmgr->pstate_mclk = golden_dpm_table->mclk_table.dpm_levels[0].value; + } else { + percentage = 100 * golden_dpm_table->sclk_table.dpm_levels[golden_dpm_table->sclk_table.count - 1].value / + golden_dpm_table->mclk_table.dpm_levels[golden_dpm_table->mclk_table.count - 1].value; + hwmgr->pstate_mclk = golden_dpm_table->mclk_table.dpm_levels[golden_dpm_table->mclk_table.count - 2].value; + } + + tmp_sclk = hwmgr->pstate_mclk * percentage / 100; + + if (hwmgr->pp_table_version == PP_TABLE_V0) { + for (count = vddc_dependency_on_sclk->count - 1; count >= 0; count--) { + if (tmp_sclk >= vddc_dependency_on_sclk->entries[count].clk) { + hwmgr->pstate_sclk = vddc_dependency_on_sclk->entries[count].clk; + break; + } + } + if (count < 0) + hwmgr->pstate_sclk = vddc_dependency_on_sclk->entries[0].clk; + + hwmgr->pstate_sclk_peak = + vddc_dependency_on_sclk->entries[vddc_dependency_on_sclk->count - 1].clk; + } else if (hwmgr->pp_table_version == PP_TABLE_V1) { + for (count = vdd_dep_on_sclk->count - 1; count >= 0; count--) { + if (tmp_sclk >= vdd_dep_on_sclk->entries[count].clk) { + hwmgr->pstate_sclk = vdd_dep_on_sclk->entries[count].clk; + break; + } + } + if (count < 0) + hwmgr->pstate_sclk = vdd_dep_on_sclk->entries[0].clk; + + hwmgr->pstate_sclk_peak = + vdd_dep_on_sclk->entries[vdd_dep_on_sclk->count - 1].clk; + } + + hwmgr->pstate_mclk_peak = + golden_dpm_table->mclk_table.dpm_levels[golden_dpm_table->mclk_table.count - 1].value; + + /* make sure the output is in Mhz */ + hwmgr->pstate_sclk /= 100; + hwmgr->pstate_mclk /= 100; + hwmgr->pstate_sclk_peak /= 100; + hwmgr->pstate_mclk_peak /= 100; +} + static int smu7_enable_dpm_tasks(struct pp_hwmgr *hwmgr) { int tmp_result = 0; @@ -1625,6 +1684,8 @@ static int smu7_enable_dpm_tasks(struct pp_hwmgr *hwmgr) PP_ASSERT_WITH_CODE((0 == tmp_result), "pcie performance request failed!", result = tmp_result);

+ smu7_populate_umdpstate_clocks(hwmgr); + return 0; }

@@ -3143,15 +3204,12 @@ static int smu7_get_profiling_clk(struct pp_hwmgr *hwmgr, enum amd_dpm_forced_le for (count = hwmgr->dyn_state.vddc_dependency_on_sclk->count-1; count >= 0; count--) { if (tmp_sclk >= hwmgr->dyn_state.vddc_dependency_on_sclk->entries[count].clk) { - tmp_sclk = hwmgr->dyn_state.vddc_dependency_on_sclk->entries[count].clk; *sclk_mask = count; break; } } - if (count < 0 || level == AMD_DPM_FORCED_LEVEL_PROFILE_MIN_SCLK) { + if (count < 0 || level == AMD_DPM_FORCED_LEVEL_PROFILE_MIN_SCLK) *sclk_mask = 0; - tmp_sclk = hwmgr->dyn_state.vddc_dependency_on_sclk->entries[0].clk; - }

if (level == AMD_DPM_FORCED_LEVEL_PROFILE_PEAK) *sclk_mask = hwmgr->dyn_state.vddc_dependency_on_sclk->count-1; @@ -3161,15 +3219,12 @@ static int smu7_get_profiling_clk(struct pp_hwmgr *hwmgr, enum amd_dpm_forced_le

for (count = table_info->vdd_dep_on_sclk->count-1; count >= 0; count--) { if (tmp_sclk >= table_info->vdd_dep_on_sclk->entries[count].clk) { - tmp_sclk = table_info->vdd_dep_on_sclk->entries[count].clk; *sclk_mask = count; break; } } - if (count < 0 || level == AMD_DPM_FORCED_LEVEL_PROFILE_MIN_SCLK) { + if (count < 0 || level == AMD_DPM_FORCED_LEVEL_PROFILE_MIN_SCLK) *sclk_mask = 0; - tmp_sclk = table_info->vdd_dep_on_sclk->entries[0].clk; - }

if (level == AMD_DPM_FORCED_LEVEL_PROFILE_PEAK) *sclk_mask = table_info->vdd_dep_on_sclk->count - 1; @@ -3181,8 +3236,6 @@ static int smu7_get_profiling_clk(struct pp_hwmgr *hwmgr, enum amd_dpm_forced_le *mclk_mask = golden_dpm_table->mclk_table.count - 1;

*pcie_mask = data->dpm_table.pcie_speed_table.count - 1; - hwmgr->pstate_sclk = tmp_sclk; - hwmgr->pstate_mclk = tmp_mclk;

return 0; } @@ -3195,9 +3248,6 @@ static int smu7_force_dpm_level(struct pp_hwmgr *hwmgr, uint32_t mclk_mask = 0; uint32_t pcie_mask = 0;

- if (hwmgr->pstate_sclk == 0) - smu7_get_profiling_clk(hwmgr, level, &sclk_mask, &mclk_mask, &pcie_mask); - switch (level) { case AMD_DPM_FORCED_LEVEL_HIGH: ret = smu7_force_dpm_highest(hwmgr); diff --git a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu8_hwmgr.c b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu8_hwmgr.c index b50fd4a4a3d1..b015a601b385 100644 --- a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu8_hwmgr.c +++ b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu8_hwmgr.c @@ -1016,6 +1016,18 @@ static void smu8_reset_acp_boot_level(struct pp_hwmgr *hwmgr) data->acp_boot_level = 0xff; }

+static void smu8_populate_umdpstate_clocks(struct pp_hwmgr *hwmgr) +{ + struct phm_clock_voltage_dependency_table *table = + hwmgr->dyn_state.vddc_dependency_on_sclk; + + hwmgr->pstate_sclk = table->entries[0].clk / 100; + hwmgr->pstate_mclk = 0; + + hwmgr->pstate_sclk_peak = table->entries[table->count - 1].clk / 100; + hwmgr->pstate_mclk_peak = 0; +} + static int smu8_enable_dpm_tasks(struct pp_hwmgr *hwmgr) { smu8_program_voting_clients(hwmgr); @@ -1024,6 +1036,8 @@ static int smu8_enable_dpm_tasks(struct pp_hwmgr *hwmgr) smu8_program_bootup_state(hwmgr); smu8_reset_acp_boot_level(hwmgr);

+ smu8_populate_umdpstate_clocks(hwmgr); + return 0; }

@@ -1167,8 +1181,6 @@ static int smu8_phm_unforce_dpm_levels(struct pp_hwmgr *hwmgr)

data->sclk_dpm.soft_min_clk = table->entries[0].clk; data->sclk_dpm.hard_min_clk = table->entries[0].clk; - hwmgr->pstate_sclk = table->entries[0].clk; - hwmgr->pstate_mclk = 0;

level = smu8_get_max_sclk_level(hwmgr) - 1;

diff --git a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/vega10_hwmgr.c b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/vega10_hwmgr.c index c78f8b2b056d..d8cd23438b76 100644 --- a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/vega10_hwmgr.c +++ b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/vega10_hwmgr.c @@ -3008,6 +3008,30 @@ static int vega10_enable_disable_PCC_limit_feature(struct pp_hwmgr *hwmgr, bool return 0; }

+static void vega10_populate_umdpstate_clocks(struct pp_hwmgr *hwmgr) +{ + struct phm_ppt_v2_information *table_info = + (struct phm_ppt_v2_information *)(hwmgr->pptable); + + if (table_info->vdd_dep_on_sclk->count > VEGA10_UMD_PSTATE_GFXCLK_LEVEL && + table_info->vdd_dep_on_mclk->count > VEGA10_UMD_PSTATE_MCLK_LEVEL) { + hwmgr->pstate_sclk = table_info->vdd_dep_on_sclk->entries[VEGA10_UMD_PSTATE_GFXCLK_LEVEL].clk; + hwmgr->pstate_mclk = table_info->vdd_dep_on_mclk->entries[VEGA10_UMD_PSTATE_MCLK_LEVEL].clk; + } else { + hwmgr->pstate_sclk = table_info->vdd_dep_on_sclk->entries[0].clk; + hwmgr->pstate_mclk = table_info->vdd_dep_on_mclk->entries[0].clk; + } + + hwmgr->pstate_sclk_peak = table_info->vdd_dep_on_sclk->entries[table_info->vdd_dep_on_sclk->count - 1].clk; + hwmgr->pstate_mclk_peak = table_info->vdd_dep_on_mclk->entries[table_info->vdd_dep_on_mclk->count - 1].clk; + + /* make sure the output is in Mhz */ + hwmgr->pstate_sclk /= 100; + hwmgr->pstate_mclk /= 100; + hwmgr->pstate_sclk_peak /= 100; + hwmgr->pstate_mclk_peak /= 100; +} + static int vega10_enable_dpm_tasks(struct pp_hwmgr *hwmgr) { struct vega10_hwmgr *data = hwmgr->backend; @@ -3082,6 +3106,8 @@ static int vega10_enable_dpm_tasks(struct pp_hwmgr *hwmgr) result = tmp_result); }

+ vega10_populate_umdpstate_clocks(hwmgr); + return result; }

@@ -4169,8 +4195,6 @@ static int vega10_get_profiling_clk_mask(struct pp_hwmgr *hwmgr, enum amd_dpm_fo *sclk_mask = VEGA10_UMD_PSTATE_GFXCLK_LEVEL; *soc_mask = VEGA10_UMD_PSTATE_SOCCLK_LEVEL; *mclk_mask = VEGA10_UMD_PSTATE_MCLK_LEVEL; - hwmgr->pstate_sclk = table_info->vdd_dep_on_sclk->entries[VEGA10_UMD_PSTATE_GFXCLK_LEVEL].clk; - hwmgr->pstate_mclk = table_info->vdd_dep_on_mclk->entries[VEGA10_UMD_PSTATE_MCLK_LEVEL].clk; }

if (level == AMD_DPM_FORCED_LEVEL_PROFILE_MIN_SCLK) { @@ -4281,9 +4305,6 @@ static int vega10_dpm_force_dpm_level(struct pp_hwmgr *hwmgr, uint32_t mclk_mask = 0; uint32_t soc_mask = 0;

- if (hwmgr->pstate_sclk == 0) - vega10_get_profiling_clk_mask(hwmgr, level, &sclk_mask, &mclk_mask, &soc_mask); - switch (level) { case AMD_DPM_FORCED_LEVEL_HIGH: ret = vega10_force_dpm_highest(hwmgr); diff --git a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/vega12_hwmgr.c b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/vega12_hwmgr.c index 0fe821dff0a4..1069eaaae2f8 100644 --- a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/vega12_hwmgr.c +++ b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/vega12_hwmgr.c @@ -1026,6 +1026,25 @@ static int vega12_get_all_clock_ranges(struct pp_hwmgr *hwmgr) return 0; }

+static void vega12_populate_umdpstate_clocks(struct pp_hwmgr *hwmgr) +{ + struct vega12_hwmgr *data = (struct vega12_hwmgr *)(hwmgr->backend); + struct vega12_single_dpm_table *gfx_dpm_table = &(data->dpm_table.gfx_table); + struct vega12_single_dpm_table *mem_dpm_table = &(data->dpm_table.mem_table); + + if (gfx_dpm_table->count > VEGA12_UMD_PSTATE_GFXCLK_LEVEL && + mem_dpm_table->count > VEGA12_UMD_PSTATE_MCLK_LEVEL) { + hwmgr->pstate_sclk = gfx_dpm_table->dpm_levels[VEGA12_UMD_PSTATE_GFXCLK_LEVEL].value; + hwmgr->pstate_mclk = mem_dpm_table->dpm_levels[VEGA12_UMD_PSTATE_MCLK_LEVEL].value; + } else { + hwmgr->pstate_sclk = gfx_dpm_table->dpm_levels[0].value; + hwmgr->pstate_mclk = mem_dpm_table->dpm_levels[0].value; + } + + hwmgr->pstate_sclk_peak = gfx_dpm_table->dpm_levels[gfx_dpm_table->count].value; + hwmgr->pstate_mclk_peak = mem_dpm_table->dpm_levels[mem_dpm_table->count].value; +} + static int vega12_enable_dpm_tasks(struct pp_hwmgr *hwmgr) { int tmp_result, result = 0; @@ -1077,6 +1096,9 @@ static int vega12_enable_dpm_tasks(struct pp_hwmgr *hwmgr) PP_ASSERT_WITH_CODE(!result, "Failed to setup default DPM tables!", return result); + + vega12_populate_umdpstate_clocks(hwmgr); + return result; }

diff --git a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/vega20_hwmgr.c b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/vega20_hwmgr.c index 8e4743cb7443..ff77a3683efd 100644 --- a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/vega20_hwmgr.c +++ b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/vega20_hwmgr.c @@ -1555,26 +1555,23 @@ static int vega20_set_mclk_od( return 0; }

-static int vega20_populate_umdpstate_clocks( - struct pp_hwmgr *hwmgr) +static void vega20_populate_umdpstate_clocks(struct pp_hwmgr *hwmgr) { struct vega20_hwmgr *data = (struct vega20_hwmgr *)(hwmgr->backend); struct vega20_single_dpm_table *gfx_table = &(data->dpm_table.gfx_table); struct vega20_single_dpm_table *mem_table = &(data->dpm_table.mem_table);

- hwmgr->pstate_sclk = gfx_table->dpm_levels[0].value; - hwmgr->pstate_mclk = mem_table->dpm_levels[0].value; - if (gfx_table->count > VEGA20_UMD_PSTATE_GFXCLK_LEVEL && mem_table->count > VEGA20_UMD_PSTATE_MCLK_LEVEL) { hwmgr->pstate_sclk = gfx_table->dpm_levels[VEGA20_UMD_PSTATE_GFXCLK_LEVEL].value; hwmgr->pstate_mclk = mem_table->dpm_levels[VEGA20_UMD_PSTATE_MCLK_LEVEL].value; + } else { + hwmgr->pstate_sclk = gfx_table->dpm_levels[0].value; + hwmgr->pstate_mclk = mem_table->dpm_levels[0].value; }

- hwmgr->pstate_sclk = hwmgr->pstate_sclk * 100; - hwmgr->pstate_mclk = hwmgr->pstate_mclk * 100; - - return 0; + hwmgr->pstate_sclk_peak = gfx_table->dpm_levels[gfx_table->count - 1].value; + hwmgr->pstate_mclk_peak = mem_table->dpm_levels[mem_table->count - 1].value; }

static int vega20_get_max_sustainable_clock(struct pp_hwmgr *hwmgr, @@ -1753,10 +1750,7 @@ static int vega20_enable_dpm_tasks(struct pp_hwmgr *hwmgr) "[EnableDPMTasks] Failed to initialize odn settings!", return result);

- result = vega20_populate_umdpstate_clocks(hwmgr); - PP_ASSERT_WITH_CODE(!result, - "[EnableDPMTasks] Failed to populate umdpstate clocks!", - return result); + vega20_populate_umdpstate_clocks(hwmgr);

result = smum_send_msg_to_smc_with_parameter(hwmgr, PPSMC_MSG_GetPptLimit, POWER_SOURCE_AC << 16, &hwmgr->default_power_limit); diff --git a/drivers/gpu/drm/amd/pm/powerplay/inc/hwmgr.h b/drivers/gpu/drm/amd/pm/powerplay/inc/hwmgr.h index 27f8d0e0e6a8..5ce433e2c16a 100644 --- a/drivers/gpu/drm/amd/pm/powerplay/inc/hwmgr.h +++ b/drivers/gpu/drm/amd/pm/powerplay/inc/hwmgr.h @@ -809,6 +809,8 @@ struct pp_hwmgr { uint32_t workload_prority[Workload_Policy_Max]; uint32_t workload_setting[Workload_Policy_Max]; bool gfxoff_state_changed_by_workload; + uint32_t pstate_sclk_peak; + uint32_t pstate_mclk_peak; };

int hwmgr_early_init(struct pp_hwmgr *hwmgr);

-- 2.34.1

Mario Limonciello

4:40 p.m.

New subject: [PATCH 6.1.y 4/4] drm/amd/pm: avoid unintentional shutdown due to temperature momentary fluctuation

From: Evan Quan evan.quan@amd.com

An intentional delay is added on soft ctf triggered. Then there will be a double check for the GPU temperature before taking further action. This can avoid unintended shutdown due to temperature momentary fluctuation.

Signed-off-by: Evan Quan evan.quan@amd.com Reviewed-by: Lijo Lazar lijo.lazar@amd.com Signed-off-by: Alex Deucher alexander.deucher@amd.com (cherry picked from commit b75efe88b20c2be28b67e2821a794cc183e32374) Hand-modified because: * XCP support added to amdgpu.h in kernel 6.5 and is not necessary for this fix. * SMU microcode initialization moved in 32806038aa76 ("drm/amd: Load SMU microcode during early_init") Link: https://gitlab.freedesktop.org/drm/amd/-/issues/1267 Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2779 Signed-off-by: Mario Limonciello mario.limonciello@amd.com --- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 3 ++ .../gpu/drm/amd/pm/powerplay/amd_powerplay.c | 48 +++++++++++++++++++ .../drm/amd/pm/powerplay/hwmgr/smu_helper.c | 27 ++++------- drivers/gpu/drm/amd/pm/powerplay/inc/hwmgr.h | 2 + drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c | 34 +++++++++++++ drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h | 2 + .../gpu/drm/amd/pm/swsmu/smu11/smu_v11_0.c | 9 +--- .../gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c | 9 +--- 8 files changed, 102 insertions(+), 32 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h index c0e782a95e72..43eb3a3dadff 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h @@ -283,6 +283,9 @@ extern int amdgpu_vcnfw_log; #define AMDGPU_SMARTSHIFT_MAX_BIAS (100) #define AMDGPU_SMARTSHIFT_MIN_BIAS (-100)

+/* Extra time delay(in ms) to eliminate the influence of temperature momentary fluctuation */ +#define AMDGPU_SWCTF_EXTRA_DELAY 50 + struct amdgpu_device; struct amdgpu_irq_src; struct amdgpu_fpriv; diff --git a/drivers/gpu/drm/amd/pm/powerplay/amd_powerplay.c b/drivers/gpu/drm/amd/pm/powerplay/amd_powerplay.c index 3f4a476d7802..179e1c593a53 100644 --- a/drivers/gpu/drm/amd/pm/powerplay/amd_powerplay.c +++ b/drivers/gpu/drm/amd/pm/powerplay/amd_powerplay.c @@ -26,6 +26,7 @@ #include <linux/gfp.h> #include <linux/slab.h> #include <linux/firmware.h> +#include <linux/reboot.h> #include "amd_shared.h" #include "amd_powerplay.h" #include "power_state.h" @@ -91,6 +92,45 @@ static int pp_early_init(void *handle) return 0; }

+static void pp_swctf_delayed_work_handler(struct work_struct *work) +{ + struct pp_hwmgr *hwmgr = + container_of(work, struct pp_hwmgr, swctf_delayed_work.work); + struct amdgpu_device *adev = hwmgr->adev; + struct amdgpu_dpm_thermal *range = + &adev->pm.dpm.thermal; + uint32_t gpu_temperature, size; + int ret; + + /* + * If the hotspot/edge temperature is confirmed as below SW CTF setting point + * after the delay enforced, nothing will be done. + * Otherwise, a graceful shutdown will be performed to prevent further damage. + */ + if (range->sw_ctf_threshold && + hwmgr->hwmgr_func->read_sensor) { + ret = hwmgr->hwmgr_func->read_sensor(hwmgr, + AMDGPU_PP_SENSOR_HOTSPOT_TEMP, + &gpu_temperature, + &size); + /* + * For some legacy ASICs, hotspot temperature retrieving might be not + * supported. Check the edge temperature instead then. + */ + if (ret == -EOPNOTSUPP) + ret = hwmgr->hwmgr_func->read_sensor(hwmgr, + AMDGPU_PP_SENSOR_EDGE_TEMP, + &gpu_temperature, + &size); + if (!ret && gpu_temperature / 1000 < range->sw_ctf_threshold) + return; + } + + dev_emerg(adev->dev, "ERROR: GPU over temperature range(SW CTF) detected!\n"); + dev_emerg(adev->dev, "ERROR: System is going to shutdown due to GPU SW CTF!\n"); + orderly_poweroff(true); +} + static int pp_sw_init(void *handle) { struct amdgpu_device *adev = handle; @@ -101,6 +141,10 @@ static int pp_sw_init(void *handle)

pr_debug("powerplay sw init %s\n", ret ? "failed" : "successfully");

+ if (!ret) + INIT_DELAYED_WORK(&hwmgr->swctf_delayed_work, + pp_swctf_delayed_work_handler); + return ret; }

@@ -136,6 +180,8 @@ static int pp_hw_fini(void *handle) struct amdgpu_device *adev = handle; struct pp_hwmgr *hwmgr = adev->powerplay.pp_handle;

+ cancel_delayed_work_sync(&hwmgr->swctf_delayed_work); + hwmgr_hw_fini(hwmgr);

return 0; @@ -222,6 +268,8 @@ static int pp_suspend(void *handle) struct amdgpu_device *adev = handle; struct pp_hwmgr *hwmgr = adev->powerplay.pp_handle;

+ cancel_delayed_work_sync(&hwmgr->swctf_delayed_work); + return hwmgr_suspend(hwmgr); }

diff --git a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu_helper.c b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu_helper.c index bfe80ac0ad8c..d0b1ab6c4523 100644 --- a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu_helper.c +++ b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu_helper.c @@ -603,21 +603,17 @@ int phm_irq_process(struct amdgpu_device *adev, struct amdgpu_irq_src *source, struct amdgpu_iv_entry *entry) { + struct pp_hwmgr *hwmgr = adev->powerplay.pp_handle; uint32_t client_id = entry->client_id; uint32_t src_id = entry->src_id;

if (client_id == AMDGPU_IRQ_CLIENTID_LEGACY) { if (src_id == VISLANDS30_IV_SRCID_CG_TSS_THERMAL_LOW_TO_HIGH) { - dev_emerg(adev->dev, "ERROR: GPU over temperature range(SW CTF) detected!\n"); - /* - * SW CTF just occurred. - * Try to do a graceful shutdown to prevent further damage. - */ - dev_emerg(adev->dev, "ERROR: System is going to shutdown due to GPU SW CTF!\n"); - orderly_poweroff(true); - } else if (src_id == VISLANDS30_IV_SRCID_CG_TSS_THERMAL_HIGH_TO_LOW) + schedule_delayed_work(&hwmgr->swctf_delayed_work, + msecs_to_jiffies(AMDGPU_SWCTF_EXTRA_DELAY)); + } else if (src_id == VISLANDS30_IV_SRCID_CG_TSS_THERMAL_HIGH_TO_LOW) { dev_emerg(adev->dev, "ERROR: GPU under temperature range detected!\n"); - else if (src_id == VISLANDS30_IV_SRCID_GPIO_19) { + } else if (src_id == VISLANDS30_IV_SRCID_GPIO_19) { dev_emerg(adev->dev, "ERROR: GPU HW Critical Temperature Fault(aka CTF) detected!\n"); /* * HW CTF just occurred. Shutdown to prevent further damage. @@ -626,15 +622,10 @@ int phm_irq_process(struct amdgpu_device *adev, orderly_poweroff(true); } } else if (client_id == SOC15_IH_CLIENTID_THM) { - if (src_id == 0) { - dev_emerg(adev->dev, "ERROR: GPU over temperature range(SW CTF) detected!\n"); - /* - * SW CTF just occurred. - * Try to do a graceful shutdown to prevent further damage. - */ - dev_emerg(adev->dev, "ERROR: System is going to shutdown due to GPU SW CTF!\n"); - orderly_poweroff(true); - } else + if (src_id == 0) + schedule_delayed_work(&hwmgr->swctf_delayed_work, + msecs_to_jiffies(AMDGPU_SWCTF_EXTRA_DELAY)); + else dev_emerg(adev->dev, "ERROR: GPU under temperature range detected!\n"); } else if (client_id == SOC15_IH_CLIENTID_ROM_SMUIO) { dev_emerg(adev->dev, "ERROR: GPU HW Critical Temperature Fault(aka CTF) detected!\n"); diff --git a/drivers/gpu/drm/amd/pm/powerplay/inc/hwmgr.h b/drivers/gpu/drm/amd/pm/powerplay/inc/hwmgr.h index 5ce433e2c16a..ec10643edea3 100644 --- a/drivers/gpu/drm/amd/pm/powerplay/inc/hwmgr.h +++ b/drivers/gpu/drm/amd/pm/powerplay/inc/hwmgr.h @@ -811,6 +811,8 @@ struct pp_hwmgr { bool gfxoff_state_changed_by_workload; uint32_t pstate_sclk_peak; uint32_t pstate_mclk_peak; + + struct delayed_work swctf_delayed_work; };

int hwmgr_early_init(struct pp_hwmgr *hwmgr); diff --git a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c index 6d90ab55cea3..d191ff52d4f0 100644 --- a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c +++ b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c @@ -24,6 +24,7 @@

#include <linux/firmware.h> #include <linux/pci.h> +#include <linux/reboot.h>

#include "amdgpu.h" #include "amdgpu_smu.h" @@ -1061,6 +1062,34 @@ static void smu_interrupt_work_fn(struct work_struct *work) smu->ppt_funcs->interrupt_work(smu); }

+static void smu_swctf_delayed_work_handler(struct work_struct *work) +{ + struct smu_context *smu = + container_of(work, struct smu_context, swctf_delayed_work.work); + struct smu_temperature_range *range = + &smu->thermal_range; + struct amdgpu_device *adev = smu->adev; + uint32_t hotspot_tmp, size; + + /* + * If the hotspot temperature is confirmed as below SW CTF setting point + * after the delay enforced, nothing will be done. + * Otherwise, a graceful shutdown will be performed to prevent further damage. + */ + if (range->software_shutdown_temp && + smu->ppt_funcs->read_sensor && + !smu->ppt_funcs->read_sensor(smu, + AMDGPU_PP_SENSOR_HOTSPOT_TEMP, + &hotspot_tmp, + &size) && + hotspot_tmp / 1000 < range->software_shutdown_temp) + return; + + dev_emerg(adev->dev, "ERROR: GPU over temperature range(SW CTF) detected!\n"); + dev_emerg(adev->dev, "ERROR: System is going to shutdown due to GPU SW CTF!\n"); + orderly_poweroff(true); +} + static int smu_sw_init(void *handle) { struct amdgpu_device *adev = (struct amdgpu_device *)handle; @@ -1109,6 +1138,9 @@ static int smu_sw_init(void *handle) return ret; }

+ INIT_DELAYED_WORK(&smu->swctf_delayed_work, + smu_swctf_delayed_work_handler); + ret = smu_smc_table_sw_init(smu); if (ret) { dev_err(adev->dev, "Failed to sw init smc table!\n"); @@ -1581,6 +1613,8 @@ static int smu_smc_hw_cleanup(struct smu_context *smu) return ret; }

+ cancel_delayed_work_sync(&smu->swctf_delayed_work); + ret = smu_disable_dpms(smu); if (ret) { dev_err(adev->dev, "Fail to disable dpm features!\n"); diff --git a/drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h b/drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h index 3bc4128a22ac..1ab77a6cdb65 100644 --- a/drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h +++ b/drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h @@ -573,6 +573,8 @@ struct smu_context u32 debug_param_reg; u32 debug_msg_reg; u32 debug_resp_reg; + + struct delayed_work swctf_delayed_work; };

struct i2c_adapter; diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu11/smu_v11_0.c b/drivers/gpu/drm/amd/pm/swsmu/smu11/smu_v11_0.c index ad5f6a15a1d7..d490b571c8ff 100644 --- a/drivers/gpu/drm/amd/pm/swsmu/smu11/smu_v11_0.c +++ b/drivers/gpu/drm/amd/pm/swsmu/smu11/smu_v11_0.c @@ -1438,13 +1438,8 @@ static int smu_v11_0_irq_process(struct amdgpu_device *adev, if (client_id == SOC15_IH_CLIENTID_THM) { switch (src_id) { case THM_11_0__SRCID__THM_DIG_THERM_L2H: - dev_emerg(adev->dev, "ERROR: GPU over temperature range(SW CTF) detected!\n"); - /* - * SW CTF just occurred. - * Try to do a graceful shutdown to prevent further damage. - */ - dev_emerg(adev->dev, "ERROR: System is going to shutdown due to GPU SW CTF!\n"); - orderly_poweroff(true); + schedule_delayed_work(&smu->swctf_delayed_work, + msecs_to_jiffies(AMDGPU_SWCTF_EXTRA_DELAY)); break; case THM_11_0__SRCID__THM_DIG_THERM_H2L: dev_emerg(adev->dev, "ERROR: GPU under temperature range detected\n"); diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c index 47fafb1fa608..3104d4937909 100644 --- a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c +++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c @@ -1386,13 +1386,8 @@ static int smu_v13_0_irq_process(struct amdgpu_device *adev, if (client_id == SOC15_IH_CLIENTID_THM) { switch (src_id) { case THM_11_0__SRCID__THM_DIG_THERM_L2H: - dev_emerg(adev->dev, "ERROR: GPU over temperature range(SW CTF) detected!\n"); - /* - * SW CTF just occurred. - * Try to do a graceful shutdown to prevent further damage. - */ - dev_emerg(adev->dev, "ERROR: System is going to shutdown due to GPU SW CTF!\n"); - orderly_poweroff(true); + schedule_delayed_work(&smu->swctf_delayed_work, + msecs_to_jiffies(AMDGPU_SWCTF_EXTRA_DELAY)); break; case THM_11_0__SRCID__THM_DIG_THERM_H2L: dev_emerg(adev->dev, "ERROR: GPU under temperature range detected\n");

-- 2.34.1

Greg KH

12 Aug 12 Aug

8:15 a.m.

On Fri, Aug 11, 2023 at 11:40:27AM -0500, Mario Limonciello wrote:

...

Users have been reporting that momentary fluctuations can trigger a shutdown.

Link: https://gitlab.freedesktop.org/drm/amd/-/issues/1267 Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2779

This behavior has been fixed in kernel 6.5, and this series brings the solution to the LTS kernel.

All now queued up, thanks.

greg k-h

865

days inactive

866

days old

linux-stable-mirror@lists.linaro.org

5 comments

participants

tags (0)

participants (2)

Greg KH
Mario Limonciello