From: "Stanley.Yang" Stanley.Yang@amd.com
[ Upstream commit b1b29aa88f5367d0367c8eeef643635bc6009a9a ]
Why: [13014.890792] Call Trace: [13014.890793] <TASK> [13014.890795] ? show_trace_log_lvl+0x1d6/0x2ea [13014.890799] ? show_trace_log_lvl+0x1d6/0x2ea [13014.890800] ? vcn_v5_0_1_hw_fini+0xe9/0x110 [amdgpu] [13014.890872] ? show_regs.part.0+0x23/0x29 [13014.890873] ? show_regs.cold+0x8/0xd [13014.890874] ? amdgpu_irq_put+0xc6/0xe0 [amdgpu] [13014.890934] ? __warn+0x8c/0x100 [13014.890936] ? amdgpu_irq_put+0xc6/0xe0 [amdgpu] [13014.890995] ? report_bug+0xa4/0xd0 [13014.890999] ? handle_bug+0x39/0x90 [13014.891001] ? exc_invalid_op+0x19/0x70 [13014.891003] ? asm_exc_invalid_op+0x1b/0x20 [13014.891005] ? amdgpu_irq_put+0xc6/0xe0 [amdgpu] [13014.891065] ? amdgpu_irq_put+0x63/0xe0 [amdgpu] [13014.891124] vcn_v5_0_1_hw_fini+0xe9/0x110 [amdgpu] [13014.891189] amdgpu_ip_block_hw_fini+0x3b/0x78 [amdgpu] [13014.891309] amdgpu_device_fini_hw+0x3c1/0x479 [amdgpu] How: Add omitted vcn poison irq get call.
Signed-off-by: Stanley.Yang Stanley.Yang@amd.com Reviewed-by: Hawking Zhang Hawking.Zhang@amd.com Signed-off-by: Alex Deucher alexander.deucher@amd.com Signed-off-by: Sasha Levin sashal@kernel.org ---
LLM Generated explanations, may be completely bogus:
YES
- Root cause and symptom: - vcn_v5_0_1 enables the VCN poison IRQ source in sw_init via `amdgpu_irq_add_id()` (drivers/gpu/drm/amd/amdgpu/vcn_v5_0_1.c:100), and disables it in hw_fini via `amdgpu_irq_put()` (drivers/gpu/drm/amd/amdgpu/vcn_v5_0_1.c:288). However, it never enables the IRQ at init time (i.e., no `amdgpu_irq_get()`), so the `amdgpu_irq_put()` in hw_fini hits the WARN in `amdgpu_irq_put()` when the IRQ wasn’t enabled, matching the call trace in the commit message (invalid op from WARN_ON in IRQ put). - The WARN is explicitly emitted by `amdgpu_irq_put()` when the IRQ isn’t enabled: drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c:619.
- What the patch does: - VCN: Adds the missing `amdgpu_irq_get()` for the poison IRQ in `vcn_v5_0_1_ras_late_init()` so the later `amdgpu_irq_put()` in `vcn_v5_0_1_hw_fini()` is balanced. - Before: `vcn_v5_0_1_ras_late_init()` only called `amdgpu_ras_bind_aca()` and returned (drivers/gpu/drm/amd/amdgpu/vcn_v5_0_1.c:1593). - After (per patch): if RAS is supported and `ras_poison_irq.funcs` is set, call `amdgpu_irq_get(adev, &adev->vcn.inst->ras_poison_irq, 0)`. This mirrors the established pattern in the generic helper `amdgpu_vcn_ras_late_init()` (drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c:1214), which performs the `amdgpu_irq_get()` per instance. vcn_v5_0_1 overrides the generic ras_late_init and had omitted this step; the patch restores this missing piece. - JPEG: Reorders operations in `jpeg_v5_0_1_ras_late_init()` to bind ACA before enabling the poison IRQ. While that JPEG v5.0.1 file may not exist on all branches, the change is a benign ordering fix that keeps RAS/ACA setup consistent before enabling the IRQ.
- Why this is a correct and minimal fix: - The call trace shows a WARN in `amdgpu_irq_put()` due to an unbalanced put; adding a matching `amdgpu_irq_get()` in ras_late_init is the smallest correct change to restore balance. - The guard `amdgpu_ras_is_supported(adev, ras_block->block) && adev->vcn.inst->ras_poison_irq.funcs` ensures the get only occurs when RAS is supported and the IRQ source is correctly set up, minimizing risk. - Other VCN versions rely on the generic `amdgpu_vcn_ras_late_init()` which already does an `amdgpu_irq_get()`; this change simply brings vcn_v5_0_1 in line with the established pattern and with its own `hw_fini` which unconditionally calls `amdgpu_irq_put()` when RAS is supported (drivers/gpu/drm/amd/amdgpu/vcn_v5_0_1.c:288).
- Backport suitability: - Fixes a real user-visible bug (WARN/trace on shutdown/suspend/reset paths), confirmed by the provided stack trace. - Small, self-contained, and localized to the AMDGPU VCN/JPEG RAS init path. - No API/ABI or architectural changes; no feature additions. - Aligns behavior with other IP blocks and the generic RAS late init code path. - Low regression risk: only enables an IRQ that is already registered and later disabled; gated by RAS support and presence of IRQ funcs.
- Specific code references to support the analysis: - Missing get in vcn v5.0.1: drivers/gpu/drm/amd/amdgpu/vcn_v5_0_1.c:1593 - Unbalanced put causing WARN: drivers/gpu/drm/amd/amdgpu/vcn_v5_0_1.c:288 - WARN in `amdgpu_irq_put()` when IRQ not enabled: drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c:619 - Correct generic pattern (does get in late init): drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c:1214 - VCN poison IRQ registered in sw_init (needs get to enable): drivers/gpu/drm/amd/amdgpu/vcn_v5_0_1.c:119
Given the above, this commit is an important, minimal-risk bugfix and should be backported to stable trees that contain VCN/JPEG 5.0.1.
drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_1.c | 10 +++++----- drivers/gpu/drm/amd/amdgpu/vcn_v5_0_1.c | 7 +++++++ 2 files changed, 12 insertions(+), 5 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_1.c b/drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_1.c index 54523dc1f7026..03ec4b741d194 100644 --- a/drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_1.c +++ b/drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_1.c @@ -1058,6 +1058,11 @@ static int jpeg_v5_0_1_ras_late_init(struct amdgpu_device *adev, struct ras_comm if (r) return r;
+ r = amdgpu_ras_bind_aca(adev, AMDGPU_RAS_BLOCK__JPEG, + &jpeg_v5_0_1_aca_info, NULL); + if (r) + goto late_fini; + if (amdgpu_ras_is_supported(adev, ras_block->block) && adev->jpeg.inst->ras_poison_irq.funcs) { r = amdgpu_irq_get(adev, &adev->jpeg.inst->ras_poison_irq, 0); @@ -1065,11 +1070,6 @@ static int jpeg_v5_0_1_ras_late_init(struct amdgpu_device *adev, struct ras_comm goto late_fini; }
- r = amdgpu_ras_bind_aca(adev, AMDGPU_RAS_BLOCK__JPEG, - &jpeg_v5_0_1_aca_info, NULL); - if (r) - goto late_fini; - return 0;
late_fini: diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v5_0_1.c b/drivers/gpu/drm/amd/amdgpu/vcn_v5_0_1.c index d8bbb93767318..cb560d64da08c 100644 --- a/drivers/gpu/drm/amd/amdgpu/vcn_v5_0_1.c +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v5_0_1.c @@ -1608,6 +1608,13 @@ static int vcn_v5_0_1_ras_late_init(struct amdgpu_device *adev, struct ras_commo if (r) goto late_fini;
+ if (amdgpu_ras_is_supported(adev, ras_block->block) && + adev->vcn.inst->ras_poison_irq.funcs) { + r = amdgpu_irq_get(adev, &adev->vcn.inst->ras_poison_irq, 0); + if (r) + goto late_fini; + } + return 0;
late_fini: