When a CL/CSD job times out, we check if the GPU has made any progress since the last timeout. If so, instead of resetting the hardware, we skip the reset and let the timer get rearmed. This gives long-running jobs a chance to complete.
However, when `timedout_job()` is called, the job in question is removed from the pending list, which means it won't be automatically freed through `free_job()`. Consequently, when we skip the reset and keep the job running, the job won't be freed when it finally completes.
This situation leads to a memory leak, as exposed in [1].
Similarly to commit 704d3d60fec4 ("drm/etnaviv: don't block scheduler when GPU is still active"), this patch ensures the job is put back on the pending list when extending the timeout.
Cc: stable@vger.kernel.org # 6.0 Link: https://gitlab.freedesktop.org/mesa/mesa/-/issues/12227 [1] Reported-by: Daivik Bhatia dtgs1208@gmail.com Signed-off-by: Maíra Canal mcanal@igalia.com --- drivers/gpu/drm/v3d/v3d_sched.c | 18 +++++++++++------- 1 file changed, 11 insertions(+), 7 deletions(-)
diff --git a/drivers/gpu/drm/v3d/v3d_sched.c b/drivers/gpu/drm/v3d/v3d_sched.c index b3be08b0ca91..a98dcf9d75b1 100644 --- a/drivers/gpu/drm/v3d/v3d_sched.c +++ b/drivers/gpu/drm/v3d/v3d_sched.c @@ -734,11 +734,6 @@ v3d_gpu_reset_for_timeout(struct v3d_dev *v3d, struct drm_sched_job *sched_job) return DRM_GPU_SCHED_STAT_NOMINAL; }
-/* If the current address or return address have changed, then the GPU - * has probably made progress and we should delay the reset. This - * could fail if the GPU got in an infinite loop in the CL, but that - * is pretty unlikely outside of an i-g-t testcase. - */ static enum drm_gpu_sched_stat v3d_cl_job_timedout(struct drm_sched_job *sched_job, enum v3d_queue q, u32 *timedout_ctca, u32 *timedout_ctra) @@ -748,9 +743,16 @@ v3d_cl_job_timedout(struct drm_sched_job *sched_job, enum v3d_queue q, u32 ctca = V3D_CORE_READ(0, V3D_CLE_CTNCA(q)); u32 ctra = V3D_CORE_READ(0, V3D_CLE_CTNRA(q));
+ /* If the current address or return address have changed, then the GPU + * has probably made progress and we should delay the reset. This + * could fail if the GPU got in an infinite loop in the CL, but that + * is pretty unlikely outside of an i-g-t testcase. + */ if (*timedout_ctca != ctca || *timedout_ctra != ctra) { *timedout_ctca = ctca; *timedout_ctra = ctra; + + list_add(&sched_job->list, &sched_job->sched->pending_list); return DRM_GPU_SCHED_STAT_NOMINAL; }
@@ -790,11 +792,13 @@ v3d_csd_job_timedout(struct drm_sched_job *sched_job) struct v3d_dev *v3d = job->base.v3d; u32 batches = V3D_CORE_READ(0, V3D_CSD_CURRENT_CFG4(v3d->ver));
- /* If we've made progress, skip reset and let the timer get - * rearmed. + /* If we've made progress, skip reset, add the job to the pending + * list, and let the timer get rearmed. */ if (job->timedout_batches != batches) { job->timedout_batches = batches; + + list_add(&sched_job->list, &sched_job->sched->pending_list); return DRM_GPU_SCHED_STAT_NOMINAL; }
Hi Maira,
Looks good to me, but don't we need to do the same in v3d_csd_job_timedout?
Iago
El dom, 27-04-2025 a las 17:28 -0300, Maíra Canal escribió:
When a CL/CSD job times out, we check if the GPU has made any progress since the last timeout. If so, instead of resetting the hardware, we skip the reset and let the timer get rearmed. This gives long-running jobs a chance to complete.
However, when `timedout_job()` is called, the job in question is removed from the pending list, which means it won't be automatically freed through `free_job()`. Consequently, when we skip the reset and keep the job running, the job won't be freed when it finally completes.
This situation leads to a memory leak, as exposed in [1].
Similarly to commit 704d3d60fec4 ("drm/etnaviv: don't block scheduler when GPU is still active"), this patch ensures the job is put back on the pending list when extending the timeout.
Cc: stable@vger.kernel.org # 6.0 Link: https://gitlab.freedesktop.org/mesa/mesa/-/issues/12227%C2%A0%5B1] Reported-by: Daivik Bhatia dtgs1208@gmail.com Signed-off-by: Maíra Canal mcanal@igalia.com
drivers/gpu/drm/v3d/v3d_sched.c | 18 +++++++++++------- 1 file changed, 11 insertions(+), 7 deletions(-)
diff --git a/drivers/gpu/drm/v3d/v3d_sched.c b/drivers/gpu/drm/v3d/v3d_sched.c index b3be08b0ca91..a98dcf9d75b1 100644 --- a/drivers/gpu/drm/v3d/v3d_sched.c +++ b/drivers/gpu/drm/v3d/v3d_sched.c @@ -734,11 +734,6 @@ v3d_gpu_reset_for_timeout(struct v3d_dev *v3d, struct drm_sched_job *sched_job) return DRM_GPU_SCHED_STAT_NOMINAL; } -/* If the current address or return address have changed, then the GPU
- has probably made progress and we should delay the reset. This
- could fail if the GPU got in an infinite loop in the CL, but that
- is pretty unlikely outside of an i-g-t testcase.
- */
static enum drm_gpu_sched_stat v3d_cl_job_timedout(struct drm_sched_job *sched_job, enum v3d_queue q, u32 *timedout_ctca, u32 *timedout_ctra) @@ -748,9 +743,16 @@ v3d_cl_job_timedout(struct drm_sched_job *sched_job, enum v3d_queue q, u32 ctca = V3D_CORE_READ(0, V3D_CLE_CTNCA(q)); u32 ctra = V3D_CORE_READ(0, V3D_CLE_CTNRA(q));
- /* If the current address or return address have changed,
then the GPU
* has probably made progress and we should delay the reset.
This
* could fail if the GPU got in an infinite loop in the CL,
but that
* is pretty unlikely outside of an i-g-t testcase.
*/
if (*timedout_ctca != ctca || *timedout_ctra != ctra) { *timedout_ctca = ctca; *timedout_ctra = ctra;
list_add(&sched_job->list, &sched_job->sched-
pending_list);
return DRM_GPU_SCHED_STAT_NOMINAL; } @@ -790,11 +792,13 @@ v3d_csd_job_timedout(struct drm_sched_job *sched_job) struct v3d_dev *v3d = job->base.v3d; u32 batches = V3D_CORE_READ(0, V3D_CSD_CURRENT_CFG4(v3d-
ver));
- /* If we've made progress, skip reset and let the timer get
* rearmed.
- /* If we've made progress, skip reset, add the job to the
pending
* list, and let the timer get rearmed.
*/ if (job->timedout_batches != batches) { job->timedout_batches = batches;
list_add(&sched_job->list, &sched_job->sched-
pending_list);
return DRM_GPU_SCHED_STAT_NOMINAL; }
Never mind, somehow I missed you're already doing that in this patch.
Reviewed-by: Iago Toral Quiroga itoral@igalia.com
El lun, 28-04-2025 a las 07:45 +0200, Iago Toral escribió:
Hi Maira,
Looks good to me, but don't we need to do the same in v3d_csd_job_timedout?
Iago
El dom, 27-04-2025 a las 17:28 -0300, Maíra Canal escribió:
When a CL/CSD job times out, we check if the GPU has made any progress since the last timeout. If so, instead of resetting the hardware, we skip the reset and let the timer get rearmed. This gives long-running jobs a chance to complete.
However, when `timedout_job()` is called, the job in question is removed from the pending list, which means it won't be automatically freed through `free_job()`. Consequently, when we skip the reset and keep the job running, the job won't be freed when it finally completes.
This situation leads to a memory leak, as exposed in [1].
Similarly to commit 704d3d60fec4 ("drm/etnaviv: don't block scheduler when GPU is still active"), this patch ensures the job is put back on the pending list when extending the timeout.
Cc: stable@vger.kernel.org # 6.0 Link: https://gitlab.freedesktop.org/mesa/mesa/-/issues/12227%C2%A0%5B1] Reported-by: Daivik Bhatia dtgs1208@gmail.com Signed-off-by: Maíra Canal mcanal@igalia.com
drivers/gpu/drm/v3d/v3d_sched.c | 18 +++++++++++------- 1 file changed, 11 insertions(+), 7 deletions(-)
diff --git a/drivers/gpu/drm/v3d/v3d_sched.c b/drivers/gpu/drm/v3d/v3d_sched.c index b3be08b0ca91..a98dcf9d75b1 100644 --- a/drivers/gpu/drm/v3d/v3d_sched.c +++ b/drivers/gpu/drm/v3d/v3d_sched.c @@ -734,11 +734,6 @@ v3d_gpu_reset_for_timeout(struct v3d_dev *v3d, struct drm_sched_job *sched_job) return DRM_GPU_SCHED_STAT_NOMINAL; } -/* If the current address or return address have changed, then the GPU
- has probably made progress and we should delay the reset. This
- could fail if the GPU got in an infinite loop in the CL, but
that
- is pretty unlikely outside of an i-g-t testcase.
- */
static enum drm_gpu_sched_stat v3d_cl_job_timedout(struct drm_sched_job *sched_job, enum v3d_queue q, u32 *timedout_ctca, u32 *timedout_ctra) @@ -748,9 +743,16 @@ v3d_cl_job_timedout(struct drm_sched_job *sched_job, enum v3d_queue q, u32 ctca = V3D_CORE_READ(0, V3D_CLE_CTNCA(q)); u32 ctra = V3D_CORE_READ(0, V3D_CLE_CTNRA(q));
- /* If the current address or return address have changed,
then the GPU
* has probably made progress and we should delay the
reset. This
* could fail if the GPU got in an infinite loop in the
CL, but that
* is pretty unlikely outside of an i-g-t testcase.
*/
if (*timedout_ctca != ctca || *timedout_ctra != ctra) { *timedout_ctca = ctca; *timedout_ctra = ctra;
list_add(&sched_job->list, &sched_job->sched-
pending_list);
return DRM_GPU_SCHED_STAT_NOMINAL; } @@ -790,11 +792,13 @@ v3d_csd_job_timedout(struct drm_sched_job *sched_job) struct v3d_dev *v3d = job->base.v3d; u32 batches = V3D_CORE_READ(0, V3D_CSD_CURRENT_CFG4(v3d-
ver));
- /* If we've made progress, skip reset and let the timer
get
* rearmed.
- /* If we've made progress, skip reset, add the job to the
pending
* list, and let the timer get rearmed.
*/ if (job->timedout_batches != batches) { job->timedout_batches = batches;
list_add(&sched_job->list, &sched_job->sched-
pending_list);
return DRM_GPU_SCHED_STAT_NOMINAL; }
linux-stable-mirror@lists.linaro.org