In rocket_job_run(), after taking an extra fence reference for job->done_fence via dma_fence_get(), the error paths have three bugs:
- The dma_fence reference held by job->done_fence is never released, causing a reference leak. - pm_runtime_get_sync() increments the usage counter even on failure, but the error path does not decrement it, leaking the runtime PM reference and preventing the NPU from suspending. - A valid but unsignaled fence is returned to the DRM scheduler, which triggers WARN("Fence ... released with pending signals!") when the scheduler drops its reference.
Fix by replacing pm_runtime_get_sync() with pm_runtime_resume_and_get() which auto-balances the usage counter on failure, releasing both fence references on error, and returning ERR_PTR(ret) instead of the unsignaled fence.
Cc: stable@vger.kernel.org Fixes: 0810d5ad88a1 ("accel/rocket: Add job submission IOCTL") Signed-off-by: ZhaoJinming zhaojinming@uniontech.com --- drivers/accel/rocket/rocket_job.c | 19 ++++++++++++++----- 1 file changed, 14 insertions(+), 5 deletions(-)
diff --git a/drivers/accel/rocket/rocket_job.c b/drivers/accel/rocket/rocket_job.c index ac51bff39833..e8a073e22ac2 100644 --- a/drivers/accel/rocket/rocket_job.c +++ b/drivers/accel/rocket/rocket_job.c @@ -310,13 +310,22 @@ static struct dma_fence *rocket_job_run(struct drm_sched_job *sched_job) dma_fence_put(job->done_fence); job->done_fence = dma_fence_get(fence);
- ret = pm_runtime_get_sync(core->dev); - if (ret < 0) - return fence; + ret = pm_runtime_resume_and_get(core->dev); + if (ret < 0) { + dma_fence_put(job->done_fence); + job->done_fence = NULL; + dma_fence_put(fence); + return ERR_PTR(ret); + }
ret = iommu_attach_group(job->domain->domain, core->iommu_group); - if (ret < 0) - return fence; + if (ret < 0) { + pm_runtime_put(core->dev); + dma_fence_put(job->done_fence); + job->done_fence = NULL; + dma_fence_put(fence); + return ERR_PTR(ret); + }
scoped_guard(mutex, &core->job_lock) { core->in_flight_job = job;
Two bugs in the IRQ handling path:
1) iommu_group reference leak in rocket_job_handle_irq(): iommu_group_get() increments the reference count but the returned pointer is passed directly to iommu_detach_group() which does not consume it. Since this runs on every completed job, the reference count accumulates and prevents the group from being freed. Use core->iommu_group instead, consistent with rocket_reset().
2) Unsafe hardware register access in shared IRQ handler: rocket_job_irq_handler() is registered with IRQF_SHARED but accesses hardware registers without checking runtime PM status. If another device on the same IRQ line triggers an interrupt while the NPU is suspended, register reads return 0xffffffff, spuriously triggering WARN_ON macros and falsely returning IRQ_WAKE_THREAD.
Add pm_runtime_get_if_active() in the hardirq handler to atomically verify the device is active before accessing registers. Each handler (hardirq and threaded) independently acquires and releases its own runtime PM reference to avoid coalescing-related leaks when the IRQ core coalesces multiple wakeups into a single thread execution.
Cc: stable@vger.kernel.org Fixes: 0810d5ad88a1 ("accel/rocket: Add job submission IOCTL") Signed-off-by: ZhaoJinming zhaojinming@uniontech.com --- drivers/accel/rocket/rocket_job.c | 23 +++++++++++++++++++---- 1 file changed, 19 insertions(+), 4 deletions(-)
diff --git a/drivers/accel/rocket/rocket_job.c b/drivers/accel/rocket/rocket_job.c index e8a073e22ac2..334bf01c3382 100644 --- a/drivers/accel/rocket/rocket_job.c +++ b/drivers/accel/rocket/rocket_job.c @@ -349,7 +349,7 @@ static void rocket_job_handle_irq(struct rocket_core *core) return; }
- iommu_detach_group(NULL, iommu_group_get(core->dev)); + iommu_detach_group(NULL, core->iommu_group); dma_fence_signal(core->in_flight_job->done_fence); pm_runtime_put_autosuspend(core->dev); core->in_flight_job = NULL; @@ -420,7 +420,10 @@ static irqreturn_t rocket_job_irq_handler_thread(int irq, void *data) { struct rocket_core *core = data;
- rocket_job_handle_irq(core); + if (pm_runtime_get_if_active(core->dev) == 1) { + rocket_job_handle_irq(core); + pm_runtime_put(core->dev); + }
return IRQ_HANDLED; } @@ -428,16 +431,28 @@ static irqreturn_t rocket_job_irq_handler_thread(int irq, void *data) static irqreturn_t rocket_job_irq_handler(int irq, void *data) { struct rocket_core *core = data; - u32 raw_status = rocket_pc_readl(core, INTERRUPT_RAW_STATUS); + u32 raw_status; + int ret; + + ret = pm_runtime_get_if_active(core->dev); + if (ret == 0) + return IRQ_NONE; + + raw_status = rocket_pc_readl(core, INTERRUPT_RAW_STATUS);
WARN_ON(raw_status & PC_INTERRUPT_RAW_STATUS_DMA_READ_ERROR); WARN_ON(raw_status & PC_INTERRUPT_RAW_STATUS_DMA_WRITE_ERROR);
if (!(raw_status & PC_INTERRUPT_RAW_STATUS_DPU_0 || - raw_status & PC_INTERRUPT_RAW_STATUS_DPU_1)) + raw_status & PC_INTERRUPT_RAW_STATUS_DPU_1)) { + if (ret > 0) + pm_runtime_put(core->dev); return IRQ_NONE; + }
rocket_pc_writel(core, INTERRUPT_MASK, 0x0); + if (ret > 0) + pm_runtime_put(core->dev);
return IRQ_WAKE_THREAD; }
Hi,
Le mercredi 10 juin 2026 à 15:10 +0800, ZhaoJinming a écrit :
In rocket_job_run(), after taking an extra fence reference for job->done_fence via dma_fence_get(), the error paths have three bugs:
- The dma_fence reference held by job->done_fence is never released,
causing a reference leak.
- pm_runtime_get_sync() increments the usage counter even on failure,
but the error path does not decrement it, leaking the runtime PM reference and preventing the NPU from suspending.
- A valid but unsignaled fence is returned to the DRM scheduler,
which triggers WARN("Fence ... released with pending signals!") when the scheduler drops its reference.
Fix by replacing pm_runtime_get_sync() with pm_runtime_resume_and_get() which auto-balances the usage counter on failure, releasing both fence references on error, and returning ERR_PTR(ret) instead of the unsignaled fence.
Cc: stable@vger.kernel.org Fixes: 0810d5ad88a1 ("accel/rocket: Add job submission IOCTL") Signed-off-by: ZhaoJinming zhaojinming@uniontech.com
This is a lot of versions within the same day. You should slow down a little so a human can provide a review, and also document the differences in this section, after the ---, or using a cover letter.
Nicolas
drivers/accel/rocket/rocket_job.c | 19 ++++++++++++++----- 1 file changed, 14 insertions(+), 5 deletions(-)
diff --git a/drivers/accel/rocket/rocket_job.c b/drivers/accel/rocket/rocket_job.c index ac51bff39833..e8a073e22ac2 100644 --- a/drivers/accel/rocket/rocket_job.c +++ b/drivers/accel/rocket/rocket_job.c @@ -310,13 +310,22 @@ static struct dma_fence *rocket_job_run(struct drm_sched_job *sched_job) dma_fence_put(job->done_fence); job->done_fence = dma_fence_get(fence);
- ret = pm_runtime_get_sync(core->dev);
- if (ret < 0)
return fence;
- ret = pm_runtime_resume_and_get(core->dev);
- if (ret < 0) {
dma_fence_put(job->done_fence);job->done_fence = NULL;dma_fence_put(fence);return ERR_PTR(ret);- }
ret = iommu_attach_group(job->domain->domain, core->iommu_group);
- if (ret < 0)
return fence;
- if (ret < 0) {
pm_runtime_put(core->dev);dma_fence_put(job->done_fence);job->done_fence = NULL;dma_fence_put(fence);return ERR_PTR(ret);- }
scoped_guard(mutex, &core->job_lock) { core->in_flight_job = job;
Hi,
Apologies for the confusion caused. I was trying to incorporate feedback from the AI to make my changes safer, but the result seems unsatisfactory.
I'd like to split the two independent fixes and tackle them separately. Let's first review [PATCH v6 1/2] accel/rocket: Fix error path handling in rocket_job_run().
I currently don't fully grasp the AI's problem description regarding [PATCH v6 2/2] accel/rocket: Fix iommu_group leak and unsafe IRQ register access.
Please let me know if I should start a new thread for [PATCH v6 1/2].
Thanks,
--------------
Zhao Jinming
Hi,
Le mercredi 10 juin 2026 à 15:10 +0800, ZhaoJinming a écrit?:
In rocket_job_run(), after taking an extra fence reference for
job->done_fence via dma_fence_get(), the error paths have three bugs:
- The dma_fence reference held by job->done_fence is never released,
? causing a reference leak.
- pm_runtime_get_sync() increments the usage counter even on failure,
? but the error path does not decrement it, leaking the runtime PM
? reference and preventing the NPU from suspending.
- A valid but unsignaled fence is returned to the DRM scheduler,
? which triggers WARN("Fence ... released with pending signals!")
? when the scheduler drops its reference.
Fix by replacing pm_runtime_get_sync() with pm_runtime_resume_and_get()
which auto-balances the usage counter on failure, releasing both fence
references on error, and returning ERR_PTR(ret) instead of the
unsignaled fence.
Cc: stable@vger.kernel.org
Fixes: 0810d5ad88a1 ("accel/rocket: Add job submission IOCTL")
Signed-off-by: ZhaoJinming zhaojinming@uniontech.com
This is a lot of versions within the same day. You should slow down a little so
a human can provide a review, and also document the differences in this section,
after the ---, or using a cover letter.
Nicolas
?drivers/accel/rocket/rocket_job.c | 19 ++++++++++++++-----
?1 file changed, 14 insertions(+), 5 deletions(-)
diff --git a/drivers/accel/rocket/rocket_job.c
b/drivers/accel/rocket/rocket_job.c
index ac51bff39833..e8a073e22ac2 100644
--- a/drivers/accel/rocket/rocket_job.c
+++ b/drivers/accel/rocket/rocket_job.c
@@ -310,13 +310,22 @@ static struct dma_fence *rocket_job_run(struct
drm_sched_job *sched_job)
? dma_fence_put(job->done_fence);
? job->done_fence = dma_fence_get(fence);
?
- ret = pm_runtime_get_sync(core->dev);
- if (ret < 0)
return fence;
- ret = pm_runtime_resume_and_get(core->dev);
- if (ret < 0) {
dma_fence_put(job->done_fence);
job->done_fence = NULL;
dma_fence_put(fence);
return ERR_PTR(ret);
- }
?
? ret = iommu_attach_group(job->domain->domain, core->iommu_group);
- if (ret < 0)
return fence;
- if (ret < 0) {
pm_runtime_put(core->dev);
dma_fence_put(job->done_fence);
job->done_fence = NULL;
dma_fence_put(fence);
return ERR_PTR(ret);
- }
?
? scoped_guard(mutex, &core->job_lock) {
? core->in_flight_job = job;
linaro-mm-sig@lists.linaro.org