From: Antonino Maniscalco antomani103@gmail.com
[ Upstream commit 10fb1b2fcaee5545a5e54db1ed4d7b15c2db50c8 ]
If two fault IRQs arrive in short succession recovery work will be queued up twice.
When recovery runs a second time it may end up killing an unrelated context.
Prevent this by masking off interrupts when triggering recovery.
Signed-off-by: Antonino Maniscalco antomani103@gmail.com Reviewed-by: Akhil P Oommen akhilpo@oss.qualcomm.com Patchwork: https://patchwork.freedesktop.org/patch/670023/ Signed-off-by: Rob Clark robin.clark@oss.qualcomm.com Signed-off-by: Sasha Levin sashal@kernel.org ---
LLM Generated explanations, may be completely bogus:
YES
- What it fixes: Two fault IRQs arriving back-to-back can queue `recover_work` twice; the second recovery may kill an unrelated context. The change masks interrupts before queuing recovery so subsequent fault IRQs don’t re-queue recovery. - Core change: In `drivers/gpu/drm/msm/adreno/a6xx_gpu.c:1821-1824`, `a6xx_fault_detect_irq()` adds: - `gpu_write(gpu, REG_A6XX_RBBM_INT_0_MASK, 0);` to mask all RBBM interrupts before `kthread_queue_work(gpu->worker, &gpu->recover_work);` - This follows the hangcheck timer being disabled, ensuring no further spurious recovery triggers while the first recovery proceeds. - Call path impact: `a6xx_irq()` invokes `a6xx_fault_detect_irq()` on hang detect (`drivers/gpu/drm/msm/adreno/a6xx_gpu.c:1891-1900`). With the new mask, subsequent IRQs won’t retrigger recovery for the same incident. - Interrupts are safely restored: During recovery, the GPU is reinitialized and interrupts are re-enabled in `a6xx_hw_init()` via `REG_A6XX_RBBM_INT_0_MASK` (sets `A6XX_INT_MASK`/`A7XX_INT_MASK`) at `drivers/gpu/drm/msm/adreno/a6xx_gpu.c:1410-1413`. `a6xx_recover()` calls `msm_gpu_hw_init(gpu)` to perform this re-init (`drivers/gpu/drm/msm/adreno/a6xx_gpu.c:1614`). Additionally, the top- level wrapper brackets `hw_init` with `disable_irq/enable_irq` (`drivers/gpu/drm/msm/msm_gpu.c:168-174`), so the flow cleanly unmasks after reset. - Consistency with existing patterns: A similar mask-on-fault pattern already exists for a7xx SW fuse violations (`drivers/gpu/drm/msm/adreno/a6xx_gpu.c:1831-1834`), indicating this is the established approach to prevent repeated fault handling. - Stable suitability: - User-visible bugfix: Prevents erroneous second recovery that can kill unrelated contexts. - Small and contained: One register write in an error path; no ABI or feature changes. - Low regression risk: Interrupts are restored during the normal recovery/reinit path; only affects a6xx hang/fault handling. - No architectural churn, limited to DRM/MSM Adreno a6xx driver.
Conclusion: This is a minimal, targeted fix for a real correctness issue with low risk and clear recovery restore points, making it a good candidate for backporting to all supported stable kernels that include the a6xx driver.
drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c index 45dd5fd1c2bfc..f8992a68df7fb 100644 --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c @@ -1727,6 +1727,9 @@ static void a6xx_fault_detect_irq(struct msm_gpu *gpu) /* Turn off the hangcheck timer to keep it from bothering us */ timer_delete(&gpu->hangcheck_timer);
+ /* Turn off interrupts to avoid triggering recovery again */ + gpu_write(gpu, REG_A6XX_RBBM_INT_0_MASK, 0); + kthread_queue_work(gpu->worker, &gpu->recover_work); }