From: Tao Zhou tao.zhou1@amd.com
[ Upstream commit 2b17c240e8cd9ac61d3c82277fbed27edad7f002 ]
Exclude invalid bad pages.
Signed-off-by: Tao Zhou tao.zhou1@amd.com Reviewed-by: Hawking Zhang Hawking.Zhang@amd.com Signed-off-by: Alex Deucher alexander.deucher@amd.com Signed-off-by: Sasha Levin sashal@kernel.org ---
LLM Generated explanations, may be completely bogus:
YES
- What it fixes - Centralizes and strengthens address validation for RAS bad pages so invalid addresses are rejected early and consistently, and invalid EEPROM entries are excluded. Previously, validity checks were duplicated at some call sites and missing in others. - Prevents invalid “bad pages” from being considered/processed (e.g., loaded from EEPROM or used in injection), which could lead to incorrect reservations or error injection behavior.
- Key code changes - Return type change and range check - Changes `amdgpu_ras_check_bad_page_unlock()` and `amdgpu_ras_check_bad_page()` from `bool` to `int` and adds an explicit address range check. Now returns: - `-EINVAL` for invalid addresses - `1` if the page is already bad - `0` otherwise - In your tree, the current declarations are `bool` (drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:137, drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:139) and definitions are `bool` with no range check (drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:2777, drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:2796). - The added range check uses `adev->gmc.mc_vram_size` and `RAS_UMC_INJECT_ADDR_LIMIT` (already defined in your tree at drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:110), so it is aligned with existing constraints elsewhere in the file.
- Reserve page path - `amdgpu_reserve_page_direct()` currently performs its own invalid address check and separately checks whether the page was already marked bad (drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:164–189). The patch replaces these ad hoc checks by calling the updated `amdgpu_ras_check_bad_page()` once and branching on its return value. Behaviorally equivalent for valid/invalid, but more robust and less error-prone.
- Debugfs inject path - In `amdgpu_ras_debugfs_ctrl_write()` “case 2” (inject), your tree rejects invalid addresses for all blocks (drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:508–518) and then prohibits UMC injection into already-bad pages (drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:522–528). - The patch refines this: - Only UMC injections are gated by address range validity (via the updated `amdgpu_ras_check_bad_page()`), which is correct since the address field is meaningful for UMC but not necessarily used for other blocks. - If `amdgpu_ras_check_bad_page()` returns `-EINVAL`, warns about invalid input; if it returns `1`, warns about already-bad page. This prevents false “invalid address” rejections for non-UMC blocks and preserves correct UMC validations.
- Excluding invalid entries when loading/saving bad pages - `amdgpu_ras_add_bad_pages()` uses `amdgpu_ras_check_bad_page_unlock()` to skip duplicates (drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:2660–2704). With the new semantics, a return of `-EINVAL` is nonzero and thus treated as “skip,” effectively filtering out invalid entries sourced from EEPROM. This matches the commit message “Exclude invalid bad pages.”
- Why this is a good stable backport - Small, self-contained change in a single file (`amdgpu_ras.c`), with no architectural changes. - Fixes real-world correctness/safety issues: - Invalid addresses are consistently rejected, avoiding possible incorrect reservations or injections. - Invalid EEPROM-stored entries are ignored instead of processed. - Non-UMC injection no longer gets spuriously blocked by a VRAM address check that is irrelevant for those blocks. - All modified functions are `static` and localized, minimizing regression risk. - The semantics of debugfs (not a stable ABI) are improved in a way that reduces surprising failures, and core RAS boot-time behavior improves by excluding bad invalid entries.
- Dependencies and backport notes - The upstream context in your diff shows a newer signature for `amdgpu_ras_add_bad_pages()` (an extra boolean) than in your tree (drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:2665). This commit itself does not depend on that signature change; it can be adapted trivially by: - Changing the two `bool` declarations/definitions to `int` and adding the range check in `amdgpu_ras_check_bad_page_unlock()`. - Updating the callers to handle `-EINVAL` vs `1` vs `0` as shown in the diff, while keeping your existing `amdgpu_ras_add_bad_pages()` signature. - No impact beyond AMDGPU RAS paths; no user ABI changes.
- Risk assessment - Low: Types change from `bool` to `int` on `static` functions; existing call sites in this file are updated; other uses that test in a boolean context will still treat `-EINVAL` as true (nonzero), which is intentionally leveraged to skip invalid entries when loading bad pages. - Improves robustness and reduces incorrect behavior; unlikely to cause regressions in supported stable trees.
Given the above, this is a clear bugfix with minimal risk and targeted scope, and it improves correctness and safety. It meets stable criteria and should be backported.
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 58 ++++++++++++------------- 1 file changed, 28 insertions(+), 30 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c index 540817e296da6..c88123302a071 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c @@ -136,9 +136,9 @@ enum amdgpu_ras_retire_page_reservation {
atomic_t amdgpu_ras_in_intr = ATOMIC_INIT(0);
-static bool amdgpu_ras_check_bad_page_unlock(struct amdgpu_ras *con, +static int amdgpu_ras_check_bad_page_unlock(struct amdgpu_ras *con, uint64_t addr); -static bool amdgpu_ras_check_bad_page(struct amdgpu_device *adev, +static int amdgpu_ras_check_bad_page(struct amdgpu_device *adev, uint64_t addr); #ifdef CONFIG_X86_MCE_AMD static void amdgpu_register_bad_pages_mca_notifier(struct amdgpu_device *adev); @@ -169,18 +169,16 @@ static int amdgpu_reserve_page_direct(struct amdgpu_device *adev, uint64_t addre struct eeprom_table_record err_rec; int ret;
- if ((address >= adev->gmc.mc_vram_size) || - (address >= RAS_UMC_INJECT_ADDR_LIMIT)) { + ret = amdgpu_ras_check_bad_page(adev, address); + if (ret == -EINVAL) { dev_warn(adev->dev, - "RAS WARN: input address 0x%llx is invalid.\n", - address); + "RAS WARN: input address 0x%llx is invalid.\n", + address); return -EINVAL; - } - - if (amdgpu_ras_check_bad_page(adev, address)) { + } else if (ret == 1) { dev_warn(adev->dev, - "RAS WARN: 0x%llx has already been marked as bad page!\n", - address); + "RAS WARN: 0x%llx has already been marked as bad page!\n", + address); return 0; }
@@ -513,22 +511,16 @@ static ssize_t amdgpu_ras_debugfs_ctrl_write(struct file *f, ret = amdgpu_ras_feature_enable(adev, &data.head, 1); break; case 2: - if ((data.inject.address >= adev->gmc.mc_vram_size && - adev->gmc.mc_vram_size) || - (data.inject.address >= RAS_UMC_INJECT_ADDR_LIMIT)) { - dev_warn(adev->dev, "RAS WARN: input address " - "0x%llx is invalid.", + /* umc ce/ue error injection for a bad page is not allowed */ + if (data.head.block == AMDGPU_RAS_BLOCK__UMC) + ret = amdgpu_ras_check_bad_page(adev, data.inject.address); + if (ret == -EINVAL) { + dev_warn(adev->dev, "RAS WARN: input address 0x%llx is invalid.", data.inject.address); - ret = -EINVAL; break; - } - - /* umc ce/ue error injection for a bad page is not allowed */ - if ((data.head.block == AMDGPU_RAS_BLOCK__UMC) && - amdgpu_ras_check_bad_page(adev, data.inject.address)) { - dev_warn(adev->dev, "RAS WARN: inject: 0x%llx has " - "already been marked as bad!\n", - data.inject.address); + } else if (ret == 1) { + dev_warn(adev->dev, "RAS WARN: inject: 0x%llx has already been marked as bad!\n", + data.inject.address); break; }
@@ -3134,18 +3126,24 @@ static int amdgpu_ras_load_bad_pages(struct amdgpu_device *adev) return ret; }
-static bool amdgpu_ras_check_bad_page_unlock(struct amdgpu_ras *con, +static int amdgpu_ras_check_bad_page_unlock(struct amdgpu_ras *con, uint64_t addr) { struct ras_err_handler_data *data = con->eh_data; + struct amdgpu_device *adev = con->adev; int i;
+ if ((addr >= adev->gmc.mc_vram_size && + adev->gmc.mc_vram_size) || + (addr >= RAS_UMC_INJECT_ADDR_LIMIT)) + return -EINVAL; + addr >>= AMDGPU_GPU_PAGE_SHIFT; for (i = 0; i < data->count; i++) if (addr == data->bps[i].retired_page) - return true; + return 1;
- return false; + return 0; }
/* @@ -3153,11 +3151,11 @@ static bool amdgpu_ras_check_bad_page_unlock(struct amdgpu_ras *con, * * Note: this check is only for umc block */ -static bool amdgpu_ras_check_bad_page(struct amdgpu_device *adev, +static int amdgpu_ras_check_bad_page(struct amdgpu_device *adev, uint64_t addr) { struct amdgpu_ras *con = amdgpu_ras_get_context(adev); - bool ret = false; + int ret = 0;
if (!con || !con->eh_data) return ret;