From: John Harrison John.C.Harrison@Intel.com
[ Upstream commit 45fbb51050e72723c2bdcedc1ce32305256c70ed ]
The GuC load process will abort if certain status codes (which are indicative of a fatal error) are reported. Otherwise, it keeps waiting until the 'success' code is returned. New error codes have been added in recent GuC releases, so add support for aborting on those as well.
v2: Shuffle HWCONFIG_START to the front of the switch to keep the ordering as per the enum define for clarity (review feedback by Jonathan). Also add a description for the basic 'invalid init data' code which was missing.
Signed-off-by: John Harrison John.C.Harrison@Intel.com Reviewed-by: Stuart Summers stuart.summers@intel.com Link: https://lore.kernel.org/r/20250726024337.4056272-1-John.C.Harrison@Intel.com Signed-off-by: Sasha Levin sashal@kernel.org ---
LLM Generated explanations, may be completely bogus:
YES
Why this is a good stable backport - Fixes real-world hangs/timeouts: New GuC firmware versions can report additional fatal load status codes. Without this patch, the Xe driver may continue waiting until the generic timeout, causing long delays and poorer diagnostics. Recognizing these as terminal failures is a correctness and robustness fix, not a feature. - Small and contained: Changes are limited to two Xe files, only touching enums and switch cases that read GuC status. No architectural changes, no API/UAPI changes, no behavior change unless the new error codes are actually returned. - Forward-compatibility with newer GuC: Distros often update GuC via linux-firmware independently of the kernel. This patch keeps older kernels robust when paired with newer GuC blobs. - Low regression risk: Older GuC won’t emit the new codes, so behavior is unchanged there. New codes are explicitly fatal, so aborting earlier is the correct action. Additional logging improves triage.
What changes and why they matter - Add new GuC load error codes in the ABI header - drivers/gpu/drm/xe/abi/guc_errors_abi.h:49 defines `enum xe_guc_load_status`. This patch adds: - `XE_GUC_LOAD_STATUS_BOOTROM_VERSION_MISMATCH = 0x08` (fatal) - `XE_GUC_LOAD_STATUS_KLV_WORKAROUND_INIT_ERROR = 0x75` (fatal) - `XE_GUC_LOAD_STATUS_INVALID_FTR_FLAG = 0x76` (fatal) - In current tree, the relevant region is at drivers/gpu/drm/xe/abi/guc_errors_abi.h:49–72. Adding these entries fills previously unused values (0x08, 0x75, 0x76) and keeps them in the “invalid init data” range where appropriate, preserving ordering and ABI clarity.
- Treat the new codes as terminal failures in the load state machine - drivers/gpu/drm/xe/xe_guc.c:517 `guc_load_done()` is the terminal- state detector for the load loop. - Existing fatal cases are in the switch at drivers/gpu/drm/xe/xe_guc.c:526–535. - The patch adds the new codes to this fatal set, so `guc_load_done()` returns -1 immediately instead of waiting for a timeout. This prevents long waits and aligns behavior with the intended semantics of these GuC codes.
- Improve diagnostics for new failure modes during load - drivers/gpu/drm/xe/xe_guc.c:593 `guc_wait_ucode()` logs the reason for failure. - New message cases are added to the `ukernel` switch (today at drivers/gpu/drm/xe/xe_guc.c:672–685): - A logging case for `HWCONFIG_START` was reordered to the front for clarity (still “still extracting hwconfig table.”) - New diagnostics for: - `INIT_DATA_INVALID`: “illegal init/ADS data” - `KLV_WORKAROUND_INIT_ERROR`: “illegal workaround KLV data” - `INVALID_FTR_FLAG`: “illegal feature flag specified” - These improve visibility into what went wrong without altering control flow beyond early abort on fatal codes.
Cross-check with i915 (parity and precedent) - i915 already handles one of these newer codes: - `INTEL_GUC_LOAD_STATUS_KLV_WORKAROUND_INIT_ERROR` is defined and handled in i915 (drivers/gpu/drm/i915/gt/uc/abi/guc_errors_abi.h:24 and :39; drivers/gpu/drm/i915/gt/uc/intel_guc_fw.c:118, 245), confirming this class of additions is standard and low risk. - Bringing Xe up to parity on load error handling is consistent with upstream direction and improves stability for GuC firmware evolution.
Stable criteria assessment - Bug fix that affects users: Yes — avoids long waits and wedges with clearer diagnostics when GuC reports new fatal statuses. - Minimal and contained: Yes — a handful of enum entries and switch cases in two Xe files. - No architectural changes: Correct — only error-code recognition and messaging. - Critical subsystem: It’s a GPU driver; impact is localized to GuC bring-up, not core kernel. - Explicit stable tags: Not present, but the change is a standard, low- risk, forward-compat fix consistent with stable rules. - Dependencies: None apparent; the new constants are self-contained. Note: in some branches the header’s response enum is named `xe_guc_response_status` (drivers/gpu/drm/xe/abi/guc_errors_abi.h:9), not `xe_guc_response` as in the posted diff context. This patch does not alter that enum and the backport simply adds entries to `xe_guc_load_status`, so this naming difference does not block the backport.
Potential risks and why they’re acceptable - Earlier abort on these statuses vs. timing out: That is intended; these codes are designated fatal by GuC. For older GuC which never emit them, behavior is unchanged. - No ABI or userspace exposure: The enums are internal to the driver/firmware interface.
Conclusion - This is a targeted robustness fix for GuC load error handling, consistent with established patterns in i915, with minimal risk and clear user benefit. It should be backported to stable.
drivers/gpu/drm/xe/abi/guc_errors_abi.h | 3 +++ drivers/gpu/drm/xe/xe_guc.c | 19 +++++++++++++++++-- 2 files changed, 20 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/xe/abi/guc_errors_abi.h b/drivers/gpu/drm/xe/abi/guc_errors_abi.h index ecf748fd87df3..ad76b4baf42e9 100644 --- a/drivers/gpu/drm/xe/abi/guc_errors_abi.h +++ b/drivers/gpu/drm/xe/abi/guc_errors_abi.h @@ -63,6 +63,7 @@ enum xe_guc_load_status { XE_GUC_LOAD_STATUS_HWCONFIG_START = 0x05, XE_GUC_LOAD_STATUS_HWCONFIG_DONE = 0x06, XE_GUC_LOAD_STATUS_HWCONFIG_ERROR = 0x07, + XE_GUC_LOAD_STATUS_BOOTROM_VERSION_MISMATCH = 0x08, XE_GUC_LOAD_STATUS_GDT_DONE = 0x10, XE_GUC_LOAD_STATUS_IDT_DONE = 0x20, XE_GUC_LOAD_STATUS_LAPIC_DONE = 0x30, @@ -75,6 +76,8 @@ enum xe_guc_load_status { XE_GUC_LOAD_STATUS_INVALID_INIT_DATA_RANGE_START, XE_GUC_LOAD_STATUS_MPU_DATA_INVALID = 0x73, XE_GUC_LOAD_STATUS_INIT_MMIO_SAVE_RESTORE_INVALID = 0x74, + XE_GUC_LOAD_STATUS_KLV_WORKAROUND_INIT_ERROR = 0x75, + XE_GUC_LOAD_STATUS_INVALID_FTR_FLAG = 0x76, XE_GUC_LOAD_STATUS_INVALID_INIT_DATA_RANGE_END,
XE_GUC_LOAD_STATUS_READY = 0xF0, diff --git a/drivers/gpu/drm/xe/xe_guc.c b/drivers/gpu/drm/xe/xe_guc.c index 270fc37924936..9e0ed8fabcd54 100644 --- a/drivers/gpu/drm/xe/xe_guc.c +++ b/drivers/gpu/drm/xe/xe_guc.c @@ -990,11 +990,14 @@ static int guc_load_done(u32 status) case XE_GUC_LOAD_STATUS_GUC_PREPROD_BUILD_MISMATCH: case XE_GUC_LOAD_STATUS_ERROR_DEVID_INVALID_GUCTYPE: case XE_GUC_LOAD_STATUS_HWCONFIG_ERROR: + case XE_GUC_LOAD_STATUS_BOOTROM_VERSION_MISMATCH: case XE_GUC_LOAD_STATUS_DPC_ERROR: case XE_GUC_LOAD_STATUS_EXCEPTION: case XE_GUC_LOAD_STATUS_INIT_DATA_INVALID: case XE_GUC_LOAD_STATUS_MPU_DATA_INVALID: case XE_GUC_LOAD_STATUS_INIT_MMIO_SAVE_RESTORE_INVALID: + case XE_GUC_LOAD_STATUS_KLV_WORKAROUND_INIT_ERROR: + case XE_GUC_LOAD_STATUS_INVALID_FTR_FLAG: return -1; }
@@ -1134,17 +1137,29 @@ static void guc_wait_ucode(struct xe_guc *guc) }
switch (ukernel) { + case XE_GUC_LOAD_STATUS_HWCONFIG_START: + xe_gt_err(gt, "still extracting hwconfig table.\n"); + break; + case XE_GUC_LOAD_STATUS_EXCEPTION: xe_gt_err(gt, "firmware exception. EIP: %#x\n", xe_mmio_read32(mmio, SOFT_SCRATCH(13))); break;
+ case XE_GUC_LOAD_STATUS_INIT_DATA_INVALID: + xe_gt_err(gt, "illegal init/ADS data\n"); + break; + case XE_GUC_LOAD_STATUS_INIT_MMIO_SAVE_RESTORE_INVALID: xe_gt_err(gt, "illegal register in save/restore workaround list\n"); break;
- case XE_GUC_LOAD_STATUS_HWCONFIG_START: - xe_gt_err(gt, "still extracting hwconfig table.\n"); + case XE_GUC_LOAD_STATUS_KLV_WORKAROUND_INIT_ERROR: + xe_gt_err(gt, "illegal workaround KLV data\n"); + break; + + case XE_GUC_LOAD_STATUS_INVALID_FTR_FLAG: + xe_gt_err(gt, "illegal feature flag specified\n"); break; }