From: Ce Sun cesun102@amd.com
[ Upstream commit d8442bcad0764c5613e9f8b2356f3e0a48327e20 ]
By polling, poll ACA bank count to ensure that valid ACA bank reg info can be obtained
v2: add corresponding delay before send msg to SMU to query mca bank info (Stanley)
v3: the loop cannot exit. (Thomas)
v4: remove amdgpu_aca_clear_bank_count. (Kevin)
v5: continuously inject ce. If a creation interruption occurs at this time, bank reg info will be lost. (Thomas) v5: each cycle is delayed by 100ms. (Tao)
Signed-off-by: Ce Sun cesun102@amd.com Reviewed-by: Hawking Zhang Hawking.Zhang@amd.com Signed-off-by: Alex Deucher alexander.deucher@amd.com Signed-off-by: Sasha Levin sashal@kernel.org ---
LLM Generated explanations, may be completely bogus:
YES
Explanation - Fixes real bug: Prevents loss of ACA/MCA bank register info when poison consumption interrupts poison creation, which can drop error records and delay/skip bad-page retirement. The change adds explicit coordination between “creation” and “consumption” paths. - Scoped and minimal: Changes are confined to amdgpu RAS/UMC v12 logic; no uAPI changes or architectural rewrites. - Bounded behavior: Polling now has a clear, short timeout (about 1 second) to avoid hangs while ensuring valid ACA/MCA bank data is captured.
Key technical changes - Add explicit creation/consumption counters to gate polling completion: - New fields to track state: - `struct ras_ecc_log_info`: `de_queried_count`, `consumption_q_count` (drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h:495, drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h:496). - `struct amdgpu_ras`: `atomic_t poison_consumption_count` (drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h:568). - Creation path now waits until both sides are observed (or timeout): - `amdgpu_ras_poison_creation_handler()` resets both counters each cycle, polls via `amdgpu_ras_query_error_status_with_event()`, and exits early when both `de_queried_count` and `consumption_q_count` are non-zero; otherwise sleeps 100ms, up to 10 cycles (drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:3405, drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:3423, drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:3424, drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:3426). - If any DEs were actually found (`de_queried_count`), schedule page- retirement work promptly (drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:3432). - Defines the polling budget as 10 cycles, each 100ms, by changing `MAX_UMC_POISON_POLLING_TIME_ASYNC` to 10 (drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:125) and using `msleep(100)`. - Consumption path signals promptly: - When queuing poison consumption messages, increment `poison_consumption_count` to indicate pending consumption (drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c:255). - For UMC v12 bank scans, if the IP decode indicates it is a SMU consumption query (i.e., not UMC HWID/MCATYPE), increment `consumption_q_count` so the creation loop knows consumption was observed (drivers/gpu/drm/amd/amdgpu/umc_v12_0.c:541). - Thread sequencing and reset hygiene: - The page retirement thread now processes the creation loop while there are creation requests and stops early if a consumption event is pending (`!atomic_read(&con->poison_consumption_count)`) (drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:3531). - On GPU reset conditions, clear both `poison_creation_count` and `poison_consumption_count`, and flush/clear the FIFO, ensuring clean state and avoiding lost bank info across resets (drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:3548, drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:3573). - Initialize `poison_consumption_count` on recovery init (drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:3680).
Why it meets stable criteria - Important bugfix: Prevents loss of RAS bank register info, ensuring accurate ECC error logging and timely bad-page retirement on affected AMD GPUs. - Low risk of regression: Changes are local to RAS/UMC v12 error handling, use bounded waits, add simple counters, and don’t alter external interfaces. - No architectural churn: Purely corrective synchronization and sequencing; no redesign or feature addition. - Performance impact is negligible: Only affects rare error paths, with short bounded waits.
Notes for backport - Target stable series that include ACA/UMC v12 poison handling; the patch relies on existing ACA/MCA decoding paths and `amdgpu_ras_query_error_status_with_event()`. - No userspace ABI impact; struct layout changes are internal to the driver.
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 50 +++++++++++-------------- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 5 ++- drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c | 1 + drivers/gpu/drm/amd/amdgpu/umc_v12_0.c | 5 ++- 4 files changed, 29 insertions(+), 32 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c index 54909bcf181f3..893cae9813fbb 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c @@ -122,7 +122,7 @@ const char *get_ras_block_str(struct ras_common_if *ras_block) /* typical ECC bad page rate is 1 bad page per 100MB VRAM */ #define RAS_BAD_PAGE_COVER (100 * 1024 * 1024ULL)
-#define MAX_UMC_POISON_POLLING_TIME_ASYNC 300 //ms +#define MAX_UMC_POISON_POLLING_TIME_ASYNC 10
#define AMDGPU_RAS_RETIRE_PAGE_INTERVAL 100 //ms
@@ -3239,7 +3239,7 @@ static void amdgpu_ras_ecc_log_init(struct ras_ecc_log_info *ecc_log)
INIT_RADIX_TREE(&ecc_log->de_page_tree, GFP_KERNEL); ecc_log->de_queried_count = 0; - ecc_log->prev_de_queried_count = 0; + ecc_log->consumption_q_count = 0; }
static void amdgpu_ras_ecc_log_fini(struct ras_ecc_log_info *ecc_log) @@ -3259,7 +3259,7 @@ static void amdgpu_ras_ecc_log_fini(struct ras_ecc_log_info *ecc_log)
mutex_destroy(&ecc_log->lock); ecc_log->de_queried_count = 0; - ecc_log->prev_de_queried_count = 0; + ecc_log->consumption_q_count = 0; }
static bool amdgpu_ras_schedule_retirement_dwork(struct amdgpu_ras *con, @@ -3309,47 +3309,34 @@ static int amdgpu_ras_poison_creation_handler(struct amdgpu_device *adev, int ret = 0; struct ras_ecc_log_info *ecc_log; struct ras_query_if info; - uint32_t timeout = 0; + u32 timeout = MAX_UMC_POISON_POLLING_TIME_ASYNC; struct amdgpu_ras *ras = amdgpu_ras_get_context(adev); - uint64_t de_queried_count; - uint32_t new_detect_count, total_detect_count; - uint32_t need_query_count = poison_creation_count; + u64 de_queried_count; + u64 consumption_q_count; enum ras_event_type type = RAS_EVENT_TYPE_POISON_CREATION;
memset(&info, 0, sizeof(info)); info.head.block = AMDGPU_RAS_BLOCK__UMC;
ecc_log = &ras->umc_ecc_log; - total_detect_count = 0; + ecc_log->de_queried_count = 0; + ecc_log->consumption_q_count = 0; + do { ret = amdgpu_ras_query_error_status_with_event(adev, &info, type); if (ret) return ret;
de_queried_count = ecc_log->de_queried_count; - if (de_queried_count > ecc_log->prev_de_queried_count) { - new_detect_count = de_queried_count - ecc_log->prev_de_queried_count; - ecc_log->prev_de_queried_count = de_queried_count; - timeout = 0; - } else { - new_detect_count = 0; - } + consumption_q_count = ecc_log->consumption_q_count;
- if (new_detect_count) { - total_detect_count += new_detect_count; - } else { - if (!timeout && need_query_count) - timeout = MAX_UMC_POISON_POLLING_TIME_ASYNC; + if (de_queried_count && consumption_q_count) + break;
- if (timeout) { - if (!--timeout) - break; - msleep(1); - } - } - } while (total_detect_count < need_query_count); + msleep(100); + } while (--timeout);
- if (total_detect_count) + if (de_queried_count) schedule_delayed_work(&ras->page_retirement_dwork, 0);
if (amdgpu_ras_is_rma(adev) && atomic_cmpxchg(&ras->rma_in_recovery, 0, 1) == 0) @@ -3446,7 +3433,8 @@ static int amdgpu_ras_page_retirement_thread(void *param) atomic_sub(poison_creation_count, &con->poison_creation_count); atomic_sub(poison_creation_count, &con->page_retirement_req_cnt); } - } while (atomic_read(&con->poison_creation_count)); + } while (atomic_read(&con->poison_creation_count) && + !atomic_read(&con->poison_consumption_count));
if (ret != -EIO) { msg_count = kfifo_len(&con->poison_fifo); @@ -3463,6 +3451,7 @@ static int amdgpu_ras_page_retirement_thread(void *param) /* gpu mode-1 reset is ongoing or just completed ras mode-1 reset */ /* Clear poison creation request */ atomic_set(&con->poison_creation_count, 0); + atomic_set(&con->poison_consumption_count, 0);
/* Clear poison fifo */ amdgpu_ras_clear_poison_fifo(adev); @@ -3487,6 +3476,8 @@ static int amdgpu_ras_page_retirement_thread(void *param) atomic_sub(msg_count, &con->page_retirement_req_cnt); }
+ atomic_set(&con->poison_consumption_count, 0); + /* Wake up work to save bad pages to eeprom */ schedule_delayed_work(&con->page_retirement_dwork, 0); } @@ -3590,6 +3581,7 @@ int amdgpu_ras_recovery_init(struct amdgpu_device *adev, bool init_bp_info) init_waitqueue_head(&con->page_retirement_wq); atomic_set(&con->page_retirement_req_cnt, 0); atomic_set(&con->poison_creation_count, 0); + atomic_set(&con->poison_consumption_count, 0); con->page_retirement_thread = kthread_run(amdgpu_ras_page_retirement_thread, adev, "umc_page_retirement"); if (IS_ERR(con->page_retirement_thread)) { diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h index 699953c02649f..96cb62a44a35b 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h @@ -492,8 +492,8 @@ struct ras_ecc_err { struct ras_ecc_log_info { struct mutex lock; struct radix_tree_root de_page_tree; - uint64_t de_queried_count; - uint64_t prev_de_queried_count; + uint64_t de_queried_count; + uint64_t consumption_q_count; };
struct amdgpu_ras { @@ -558,6 +558,7 @@ struct amdgpu_ras { struct mutex page_retirement_lock; atomic_t page_retirement_req_cnt; atomic_t poison_creation_count; + atomic_t poison_consumption_count; struct mutex page_rsv_lock; DECLARE_KFIFO(poison_fifo, struct ras_poison_msg, 128); struct ras_ecc_log_info umc_ecc_log; diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c index c92b8794aa73d..2e039fb778ea8 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c @@ -252,6 +252,7 @@ int amdgpu_umc_pasid_poison_handler(struct amdgpu_device *adev, block, pasid, pasid_fn, data, reset); if (!ret) { atomic_inc(&con->page_retirement_req_cnt); + atomic_inc(&con->poison_consumption_count); wake_up(&con->page_retirement_wq); } } diff --git a/drivers/gpu/drm/amd/amdgpu/umc_v12_0.c b/drivers/gpu/drm/amd/amdgpu/umc_v12_0.c index e590cbdd8de96..8dc32787d6250 100644 --- a/drivers/gpu/drm/amd/amdgpu/umc_v12_0.c +++ b/drivers/gpu/drm/amd/amdgpu/umc_v12_0.c @@ -536,8 +536,11 @@ static int umc_v12_0_update_ecc_status(struct amdgpu_device *adev, hwid = REG_GET_FIELD(ipid, MCMP1_IPIDT0, HardwareID); mcatype = REG_GET_FIELD(ipid, MCMP1_IPIDT0, McaType);
- if ((hwid != MCA_UMC_HWID_V12_0) || (mcatype != MCA_UMC_MCATYPE_V12_0)) + /* The IP block decode of consumption is SMU */ + if (hwid != MCA_UMC_HWID_V12_0 || mcatype != MCA_UMC_MCATYPE_V12_0) { + con->umc_ecc_log.consumption_q_count++; return 0; + }
if (!status) return 0;