Hi,
I happened to come across an issue just now again where soft recovery fails to get reported to userspace properly, causing apps to submit hanging work in a loop (which ended up hanging the entire machine) - it seems like this patch never made it into amd-staging-drm-next. Given that it has a Reviewed-by and everything, was this just an oversight or are there some blockers to pushing it that I missed?
If not, I'd be grateful if the patch could get merged.
Thanks, Friedrich
On 08.03.24 09:33, Christian König wrote:
Am 07.03.24 um 20:04 schrieb Joshua Ashton:
As we discussed before[1], soft recovery should be forwarded to userspace, or we can get into a really bad state where apps will keep submitting hanging command buffers cascading us to a hard reset.
Marek you are in favor of this like forever. So I would like to request you to put your Reviewed-by on it and I will just push it into our internal kernel branch.
Regards, Christian.
1: https://lore.kernel.org/all/bf23d5ed-9a6b-43e7-84ee-8cbfd0d60f18@froggi.es/ Signed-off-by: Joshua Ashton joshua@froggi.es
Cc: Friedrich Vock friedrich.vock@gmx.de Cc: Bas Nieuwenhuizen bas@basnieuwenhuizen.nl Cc: Christian König christian.koenig@amd.com Cc: André Almeida andrealmeid@igalia.com Cc: stable@vger.kernel.org
drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c index 4b3000c21ef2..aebf59855e9f 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c @@ -262,9 +262,8 @@ amdgpu_job_prepare_job(struct drm_sched_job *sched_job, struct dma_fence *fence = NULL; int r; - /* Ignore soft recovered fences here */ r = drm_sched_entity_error(s_entity); - if (r && r != -ENODATA) + if (r) goto error; if (!fence && job->gang_submit)