[PATCH AUTOSEL 6.17] drm/amdgpu: Correct the loss of aca bank reg info

25 Oct 2025

From: Ce Sun cesun102@amd.com
[ Upstream commit d8442bcad0764c5613e9f8b2356f3e0a48327e20 ]
By polling, poll ACA bank count to ensure that valid
ACA bank reg info can be obtained
v2: add corresponding delay before send msg to SMU to query mca bank info
(Stanley)
v3: the loop cannot exit. (Thomas)
v4: remove amdgpu_aca_clear_bank_count. (Kevin)
v5: continuously inject ce. If a creation interruption
occurs at this time, bank reg info will be lost. (Thomas)
v5: each cycle is delayed by 100ms. (Tao)
Signed-off-by: Ce Sun cesun102@amd.com
Reviewed-by: Hawking Zhang Hawking.Zhang@amd.com
Signed-off-by: Alex Deucher alexander.deucher@amd.com
Signed-off-by: Sasha Levin sashal@kernel.org
---
LLM Generated explanations, may be completely bogus:
YES
Explanation
- Fixes real bug: Prevents loss of ACA/MCA bank register info when
  poison consumption interrupts poison creation, which can drop error
  records and delay/skip bad-page retirement. The change adds explicit
  coordination between “creation” and “consumption” paths.
- Scoped and minimal: Changes are confined to amdgpu RAS/UMC v12 logic;
  no uAPI changes or architectural rewrites.
- Bounded behavior: Polling now has a clear, short timeout (about 1
  second) to avoid hangs while ensuring valid ACA/MCA bank data is
  captured.
Key technical changes
- Add explicit creation/consumption counters to gate polling completion:
  - New fields to track state:
    - `struct ras_ecc_log_info`: `de_queried_count`,
      `consumption_q_count`
      (drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h:495,
      drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h:496).
    - `struct amdgpu_ras`: `atomic_t poison_consumption_count`
      (drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h:568).
- Creation path now waits until both sides are observed (or timeout):
  - `amdgpu_ras_poison_creation_handler()` resets both counters each
    cycle, polls via `amdgpu_ras_query_error_status_with_event()`, and
    exits early when both `de_queried_count` and `consumption_q_count`
    are non-zero; otherwise sleeps 100ms, up to 10 cycles
    (drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:3405,
    drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:3423,
    drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:3424,
    drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:3426).
  - If any DEs were actually found (`de_queried_count`), schedule page-
    retirement work promptly
    (drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:3432).
  - Defines the polling budget as 10 cycles, each 100ms, by changing
    `MAX_UMC_POISON_POLLING_TIME_ASYNC` to 10
    (drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:125) and using
    `msleep(100)`.
- Consumption path signals promptly:
  - When queuing poison consumption messages, increment
    `poison_consumption_count` to indicate pending consumption
    (drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c:255).
  - For UMC v12 bank scans, if the IP decode indicates it is a SMU
    consumption query (i.e., not UMC HWID/MCATYPE), increment
    `consumption_q_count` so the creation loop knows consumption was
    observed (drivers/gpu/drm/amd/amdgpu/umc_v12_0.c:541).
- Thread sequencing and reset hygiene:
  - The page retirement thread now processes the creation loop while
    there are creation requests and stops early if a consumption event
    is pending (`!atomic_read(&con->poison_consumption_count)`)
    (drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:3531).
  - On GPU reset conditions, clear both `poison_creation_count` and
    `poison_consumption_count`, and flush/clear the FIFO, ensuring clean
    state and avoiding lost bank info across resets
    (drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:3548,
    drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:3573).
  - Initialize `poison_consumption_count` on recovery init
    (drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:3680).
Why it meets stable criteria
- Important bugfix: Prevents loss of RAS bank register info, ensuring
  accurate ECC error logging and timely bad-page retirement on affected
  AMD GPUs.
- Low risk of regression: Changes are local to RAS/UMC v12 error
  handling, use bounded waits, add simple counters, and don’t alter
  external interfaces.
- No architectural churn: Purely corrective synchronization and
  sequencing; no redesign or feature addition.
- Performance impact is negligible: Only affects rare error paths, with
  short bounded waits.
Notes for backport
- Target stable series that include ACA/UMC v12 poison handling; the
  patch relies on existing ACA/MCA decoding paths and
  `amdgpu_ras_query_error_status_with_event()`.
- No userspace ABI impact; struct layout changes are internal to the
  driver.
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 50 +++++++++++--------------
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |  5 ++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c |  1 +
 drivers/gpu/drm/amd/amdgpu/umc_v12_0.c  |  5 ++-
 4 files changed, 29 insertions(+), 32 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 54909bcf181f3..893cae9813fbb 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -122,7 +122,7 @@ const char *get_ras_block_str(struct ras_common_if *ras_block)
 /* typical ECC bad page rate is 1 bad page per 100MB VRAM */
 #define RAS_BAD_PAGE_COVER              (100 * 1024 * 1024ULL)
-#define MAX_UMC_POISON_POLLING_TIME_ASYNC  300  //ms
+#define MAX_UMC_POISON_POLLING_TIME_ASYNC  10
#define AMDGPU_RAS_RETIRE_PAGE_INTERVAL 100  //ms
@@ -3239,7 +3239,7 @@ static void amdgpu_ras_ecc_log_init(struct ras_ecc_log_info *ecc_log)
INIT_RADIX_TREE(&ecc_log->de_page_tree, GFP_KERNEL);
    ecc_log->de_queried_count = 0;
-	ecc_log->prev_de_queried_count = 0;
+	ecc_log->consumption_q_count = 0;
 }
static void amdgpu_ras_ecc_log_fini(struct ras_ecc_log_info *ecc_log)
@@ -3259,7 +3259,7 @@ static void amdgpu_ras_ecc_log_fini(struct ras_ecc_log_info *ecc_log)
mutex_destroy(&ecc_log->lock);
    ecc_log->de_queried_count = 0;
-	ecc_log->prev_de_queried_count = 0;
+	ecc_log->consumption_q_count = 0;
 }
static bool amdgpu_ras_schedule_retirement_dwork(struct amdgpu_ras *con,
@@ -3309,47 +3309,34 @@ static int amdgpu_ras_poison_creation_handler(struct amdgpu_device *adev,
    int ret = 0;
    struct ras_ecc_log_info *ecc_log;
    struct ras_query_if info;
-	uint32_t timeout = 0;
+	u32 timeout = MAX_UMC_POISON_POLLING_TIME_ASYNC;
    struct amdgpu_ras *ras = amdgpu_ras_get_context(adev);
-	uint64_t de_queried_count;
-	uint32_t new_detect_count, total_detect_count;
-	uint32_t need_query_count = poison_creation_count;
+	u64 de_queried_count;
+	u64 consumption_q_count;
    enum ras_event_type type = RAS_EVENT_TYPE_POISON_CREATION;
memset(&info, 0, sizeof(info));
    info.head.block = AMDGPU_RAS_BLOCK__UMC;
ecc_log = &ras->umc_ecc_log;
-	total_detect_count = 0;
+	ecc_log->de_queried_count = 0;
+	ecc_log->consumption_q_count = 0;
+
    do {
    	ret = amdgpu_ras_query_error_status_with_event(adev, &info, type);
    	if (ret)
    		return ret;
de_queried_count = ecc_log->de_queried_count;
-		if (de_queried_count > ecc_log->prev_de_queried_count) {
-			new_detect_count = de_queried_count - ecc_log->prev_de_queried_count;
-			ecc_log->prev_de_queried_count = de_queried_count;
-			timeout = 0;
-		} else {
-			new_detect_count = 0;
-		}
+		consumption_q_count = ecc_log->consumption_q_count;
-		if (new_detect_count) {
-			total_detect_count += new_detect_count;
-		} else {
-			if (!timeout && need_query_count)
-				timeout = MAX_UMC_POISON_POLLING_TIME_ASYNC;
+		if (de_queried_count && consumption_q_count)
+			break;
-			if (timeout) {
-				if (!--timeout)
-					break;
-				msleep(1);
-			}
-		}
-	} while (total_detect_count < need_query_count);
+		msleep(100);
+	} while (--timeout);
-	if (total_detect_count)
+	if (de_queried_count)
    	schedule_delayed_work(&ras->page_retirement_dwork, 0);
if (amdgpu_ras_is_rma(adev) && atomic_cmpxchg(&ras->rma_in_recovery, 0, 1) == 0)
@@ -3446,7 +3433,8 @@ static int amdgpu_ras_page_retirement_thread(void *param)
    			atomic_sub(poison_creation_count, &con->poison_creation_count);
    			atomic_sub(poison_creation_count, &con->page_retirement_req_cnt);
    		}
-		} while (atomic_read(&con->poison_creation_count));
+		} while (atomic_read(&con->poison_creation_count) &&
+			!atomic_read(&con->poison_consumption_count));
if (ret != -EIO) {
    		msg_count = kfifo_len(&con->poison_fifo);
@@ -3463,6 +3451,7 @@ static int amdgpu_ras_page_retirement_thread(void *param)
    		/* gpu mode-1 reset is ongoing or just completed ras mode-1 reset */
    		/* Clear poison creation request */
    		atomic_set(&con->poison_creation_count, 0);
+			atomic_set(&con->poison_consumption_count, 0);
/* Clear poison fifo */
    		amdgpu_ras_clear_poison_fifo(adev);
@@ -3487,6 +3476,8 @@ static int amdgpu_ras_page_retirement_thread(void *param)
    			atomic_sub(msg_count, &con->page_retirement_req_cnt);
    		}
+			atomic_set(&con->poison_consumption_count, 0);
+
    		/* Wake up work to save bad pages to eeprom */
    		schedule_delayed_work(&con->page_retirement_dwork, 0);
    	}
@@ -3590,6 +3581,7 @@ int amdgpu_ras_recovery_init(struct amdgpu_device *adev, bool init_bp_info)
    init_waitqueue_head(&con->page_retirement_wq);
    atomic_set(&con->page_retirement_req_cnt, 0);
    atomic_set(&con->poison_creation_count, 0);
+	atomic_set(&con->poison_consumption_count, 0);
    con->page_retirement_thread =
    	kthread_run(amdgpu_ras_page_retirement_thread, adev, "umc_page_retirement");
    if (IS_ERR(con->page_retirement_thread)) {
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
index 699953c02649f..96cb62a44a35b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
@@ -492,8 +492,8 @@ struct ras_ecc_err {
 struct ras_ecc_log_info {
    struct mutex lock;
    struct radix_tree_root de_page_tree;
-	uint64_t	de_queried_count;
-	uint64_t	prev_de_queried_count;
+	uint64_t de_queried_count;
+	uint64_t consumption_q_count;
 };
struct amdgpu_ras {
@@ -558,6 +558,7 @@ struct amdgpu_ras {
    struct mutex page_retirement_lock;
    atomic_t page_retirement_req_cnt;
    atomic_t poison_creation_count;
+	atomic_t poison_consumption_count;
    struct mutex page_rsv_lock;
    DECLARE_KFIFO(poison_fifo, struct ras_poison_msg, 128);
    struct ras_ecc_log_info  umc_ecc_log;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
index c92b8794aa73d..2e039fb778ea8 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
@@ -252,6 +252,7 @@ int amdgpu_umc_pasid_poison_handler(struct amdgpu_device *adev,
    			block, pasid, pasid_fn, data, reset);
    		if (!ret) {
    			atomic_inc(&con->page_retirement_req_cnt);
+				atomic_inc(&con->poison_consumption_count);
    			wake_up(&con->page_retirement_wq);
    		}
    	}
diff --git a/drivers/gpu/drm/amd/amdgpu/umc_v12_0.c b/drivers/gpu/drm/amd/amdgpu/umc_v12_0.c
index e590cbdd8de96..8dc32787d6250 100644
--- a/drivers/gpu/drm/amd/amdgpu/umc_v12_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/umc_v12_0.c
@@ -536,8 +536,11 @@ static int umc_v12_0_update_ecc_status(struct amdgpu_device *adev,
    hwid = REG_GET_FIELD(ipid, MCMP1_IPIDT0, HardwareID);
    mcatype = REG_GET_FIELD(ipid, MCMP1_IPIDT0, McaType);
-	if ((hwid != MCA_UMC_HWID_V12_0) || (mcatype != MCA_UMC_MCATYPE_V12_0))
+	/* The IP block decode of consumption is SMU */
+	if (hwid != MCA_UMC_HWID_V12_0 || mcatype != MCA_UMC_MCATYPE_V12_0) {
+		con->umc_ecc_log.consumption_q_count++;
    	return 0;
+	}
if (!status)
    	return 0;
-- 
2.51.0



    

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

[PATCH AUTOSEL 6.17] drm/amdgpu: Correct the loss of aca bank reg info