From: Len Brown len.brown@intel.com
This reverts commit f7d6779df642720e22bffd449e683bb8690bd3bf.
This bisected regression has impacted suspend-resume stability since 5.15-rc1. It regressed -stable via 5.14.10.
https://bugzilla.kernel.org/show_bug.cgi?id=215315
Fixes: f7d6779df64 ("drm/amdgpu: stop scheduler when calling hw_fini (v2)") Cc: Guchun Chen guchun.chen@amd.com Cc: Andrey Grodzovsky andrey.grodzovsky@amd.com Cc: Christian Koenig christian.koenig@amd.com Cc: Alex Deucher alexander.deucher@amd.com Cc: stable@vger.kernel.org # 5.14+ Signed-off-by: Len Brown len.brown@intel.com --- drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 8 -------- 1 file changed, 8 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c index 9afd11ca2709..45977a72b5dd 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c @@ -547,9 +547,6 @@ void amdgpu_fence_driver_hw_fini(struct amdgpu_device *adev) if (!ring || !ring->fence_drv.initialized) continue;
- if (!ring->no_scheduler) - drm_sched_stop(&ring->sched, NULL); - /* You can't wait for HW to signal if it's gone */ if (!drm_dev_is_unplugged(adev_to_drm(adev))) r = amdgpu_fence_wait_empty(ring); @@ -609,11 +606,6 @@ void amdgpu_fence_driver_hw_init(struct amdgpu_device *adev) if (!ring || !ring->fence_drv.initialized) continue;
- if (!ring->no_scheduler) { - drm_sched_resubmit_jobs(&ring->sched); - drm_sched_start(&ring->sched, true); - } - /* enable the interrupt */ if (ring->fence_drv.irq_src) amdgpu_irq_get(adev, ring->fence_drv.irq_src,
[Public]
-----Original Message----- From: Len Brown lenb417@gmail.com On Behalf Of Len Brown Sent: Sunday, January 9, 2022 1:12 PM To: torvalds@linux-foundation.org Cc: linux-pm@vger.kernel.org; linux-kernel@vger.kernel.org; Len Brown len.brown@intel.com; Chen, Guchun Guchun.Chen@amd.com; Grodzovsky, Andrey Andrey.Grodzovsky@amd.com; Koenig, Christian Christian.Koenig@amd.com; Deucher, Alexander Alexander.Deucher@amd.com; stable@vger.kernel.org Subject: [PATCH REGRESSION] Revert "drm/amdgpu: stop scheduler when calling hw_fini (v2)"
From: Len Brown len.brown@intel.com
This reverts commit f7d6779df642720e22bffd449e683bb8690bd3bf.
This bisected regression has impacted suspend-resume stability since 5.15- rc1. It regressed -stable via 5.14.10.
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugz illa.kernel.org%2Fshow_bug.cgi%3Fid%3D215315&data=04%7C01%7Cal exander.deucher%40amd.com%7Ccf790be4827f4df9f2d808d9d39b81af%7C3 dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637773487569442716%7C Unknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJB TiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=AX0TXkyoMhy%2BZqE VgRSWMkKd5nPa4WOv%2B1FZHLSErSw%3D&reserved=0
Fixes: f7d6779df64 ("drm/amdgpu: stop scheduler when calling hw_fini (v2)") Cc: Guchun Chen guchun.chen@amd.com Cc: Andrey Grodzovsky andrey.grodzovsky@amd.com Cc: Christian Koenig christian.koenig@amd.com Cc: Alex Deucher alexander.deucher@amd.com Cc: stable@vger.kernel.org # 5.14+ Signed-off-by: Len Brown len.brown@intel.com
@Chen, Guchun, @Grodzovsky, Andrey, @Koenig, Christian
Any ideas? What's the consequence of reverting this patch? Didn't this patch fix another suspend/resume issue?
Alex
drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 8 -------- 1 file changed, 8 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c index 9afd11ca2709..45977a72b5dd 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c @@ -547,9 +547,6 @@ void amdgpu_fence_driver_hw_fini(struct amdgpu_device *adev) if (!ring || !ring->fence_drv.initialized) continue;
if (!ring->no_scheduler)
drm_sched_stop(&ring->sched, NULL);
- /* You can't wait for HW to signal if it's gone */ if (!drm_dev_is_unplugged(adev_to_drm(adev))) r = amdgpu_fence_wait_empty(ring);
@@ -609,11 +606,6 @@ void amdgpu_fence_driver_hw_init(struct amdgpu_device *adev) if (!ring || !ring->fence_drv.initialized) continue;
if (!ring->no_scheduler) {
drm_sched_resubmit_jobs(&ring->sched);
drm_sched_start(&ring->sched, true);
}
- /* enable the interrupt */ if (ring->fence_drv.irq_src) amdgpu_irq_get(adev, ring->fence_drv.irq_src,
-- 2.25.1
Am 10.01.22 um 17:08 schrieb Deucher, Alexander:
[Public]
-----Original Message----- From: Len Brown lenb417@gmail.com On Behalf Of Len Brown Sent: Sunday, January 9, 2022 1:12 PM To: torvalds@linux-foundation.org Cc: linux-pm@vger.kernel.org; linux-kernel@vger.kernel.org; Len Brown len.brown@intel.com; Chen, Guchun Guchun.Chen@amd.com; Grodzovsky, Andrey Andrey.Grodzovsky@amd.com; Koenig, Christian Christian.Koenig@amd.com; Deucher, Alexander Alexander.Deucher@amd.com; stable@vger.kernel.org Subject: [PATCH REGRESSION] Revert "drm/amdgpu: stop scheduler when calling hw_fini (v2)"
From: Len Brown len.brown@intel.com
This reverts commit f7d6779df642720e22bffd449e683bb8690bd3bf.
This bisected regression has impacted suspend-resume stability since 5.15- rc1. It regressed -stable via 5.14.10.
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugz illa.kernel.org%2Fshow_bug.cgi%3Fid%3D215315&data=04%7C01%7Cal exander.deucher%40amd.com%7Ccf790be4827f4df9f2d808d9d39b81af%7C3 dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637773487569442716%7C Unknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJB TiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=AX0TXkyoMhy%2BZqE VgRSWMkKd5nPa4WOv%2B1FZHLSErSw%3D&reserved=0
Fixes: f7d6779df64 ("drm/amdgpu: stop scheduler when calling hw_fini (v2)") Cc: Guchun Chen guchun.chen@amd.com Cc: Andrey Grodzovsky andrey.grodzovsky@amd.com Cc: Christian Koenig christian.koenig@amd.com Cc: Alex Deucher alexander.deucher@amd.com Cc: stable@vger.kernel.org # 5.14+ Signed-off-by: Len Brown len.brown@intel.com
@Chen, Guchun, @Grodzovsky, Andrey, @Koenig, Christian
Any ideas? What's the consequence of reverting this patch? Didn't this patch fix another suspend/resume issue?
I think Guchun was just trying to adapt that we removed the scheduler stop from the fence driver hw fini path.
Not sure if that actually fixed something or was just a precaution.
Regards, Christian.
Alex
drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 8 -------- 1 file changed, 8 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c index 9afd11ca2709..45977a72b5dd 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c @@ -547,9 +547,6 @@ void amdgpu_fence_driver_hw_fini(struct amdgpu_device *adev) if (!ring || !ring->fence_drv.initialized) continue;
if (!ring->no_scheduler)
drm_sched_stop(&ring->sched, NULL);
- /* You can't wait for HW to signal if it's gone */ if (!drm_dev_is_unplugged(adev_to_drm(adev))) r = amdgpu_fence_wait_empty(ring);
@@ -609,11 +606,6 @@ void amdgpu_fence_driver_hw_init(struct amdgpu_device *adev) if (!ring || !ring->fence_drv.initialized) continue;
if (!ring->no_scheduler) {
drm_sched_resubmit_jobs(&ring->sched);
drm_sched_start(&ring->sched, true);
}
- /* enable the interrupt */ if (ring->fence_drv.irq_src) amdgpu_irq_get(adev, ring->fence_drv.irq_src,
-- 2.25.1
[Public]
-----Original Message----- From: Koenig, Christian Christian.Koenig@amd.com Sent: Monday, January 10, 2022 11:16 AM To: Deucher, Alexander Alexander.Deucher@amd.com; Len Brown lenb@kernel.org; torvalds@linux-foundation.org; Chen, Guchun Guchun.Chen@amd.com; Grodzovsky, Andrey Andrey.Grodzovsky@amd.com Cc: linux-pm@vger.kernel.org; linux-kernel@vger.kernel.org; Len Brown len.brown@intel.com; stable@vger.kernel.org Subject: Re: [PATCH REGRESSION] Revert "drm/amdgpu: stop scheduler when calling hw_fini (v2)"
Am 10.01.22 um 17:08 schrieb Deucher, Alexander:
[Public]
-----Original Message----- From: Len Brown lenb417@gmail.com On Behalf Of Len Brown Sent: Sunday, January 9, 2022 1:12 PM To: torvalds@linux-foundation.org Cc: linux-pm@vger.kernel.org; linux-kernel@vger.kernel.org; Len Brown len.brown@intel.com; Chen, Guchun Guchun.Chen@amd.com; Grodzovsky, Andrey Andrey.Grodzovsky@amd.com; Koenig, Christian Christian.Koenig@amd.com; Deucher, Alexander Alexander.Deucher@amd.com; stable@vger.kernel.org Subject: [PATCH REGRESSION] Revert "drm/amdgpu: stop scheduler
when
calling hw_fini (v2)"
From: Len Brown len.brown@intel.com
This reverts commit f7d6779df642720e22bffd449e683bb8690bd3bf.
This bisected regression has impacted suspend-resume stability since 5.15- rc1. It regressed -stable via 5.14.10.
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbug
z
illa.kernel.org%2Fshow_bug.cgi%3Fid%3D215315&data=04%7C01%7Cal
exander.deucher%40amd.com%7Ccf790be4827f4df9f2d808d9d39b81af%7C3
dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637773487569442716%7C
Unknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJB
TiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=AX0TXkyoMhy%2BZqE
VgRSWMkKd5nPa4WOv%2B1FZHLSErSw%3D&reserved=0
Fixes: f7d6779df64 ("drm/amdgpu: stop scheduler when calling hw_fini (v2)") Cc: Guchun Chen guchun.chen@amd.com Cc: Andrey Grodzovsky andrey.grodzovsky@amd.com Cc: Christian Koenig christian.koenig@amd.com Cc: Alex Deucher alexander.deucher@amd.com Cc: stable@vger.kernel.org # 5.14+ Signed-off-by: Len Brown len.brown@intel.com
@Chen, Guchun, @Grodzovsky, Andrey, @Koenig, Christian
Any ideas? What's the consequence of reverting this patch? Didn't this
patch fix another suspend/resume issue?
I think Guchun was just trying to adapt that we removed the scheduler stop from the fence driver hw fini path.
Not sure if that actually fixed something or was just a precaution.
Thanks. I'll wait for feedback from Guchun and Andrey and if they are ok with it, I'll apply the revert.
Alex
Regards, Christian.
Alex
drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 8 -------- 1 file changed, 8 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c index 9afd11ca2709..45977a72b5dd 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c @@ -547,9 +547,6 @@ void amdgpu_fence_driver_hw_fini(struct amdgpu_device *adev) if (!ring || !ring->fence_drv.initialized) continue;
if (!ring->no_scheduler)
drm_sched_stop(&ring->sched, NULL);
- /* You can't wait for HW to signal if it's gone */ if (!drm_dev_is_unplugged(adev_to_drm(adev))) r = amdgpu_fence_wait_empty(ring); @@ -609,11
+606,6 @@ void
amdgpu_fence_driver_hw_init(struct amdgpu_device *adev) if (!ring || !ring->fence_drv.initialized) continue;
if (!ring->no_scheduler) {
drm_sched_resubmit_jobs(&ring->sched);
drm_sched_start(&ring->sched, true);
}
- /* enable the interrupt */ if (ring->fence_drv.irq_src) amdgpu_irq_get(adev, ring->fence_drv.irq_src,
-- 2.25.1
[Public]
Hi Alex/Christian,
This patch is to put drm_sched_stop to stop scheduler before amdgpu_fence_wait_empty, otherwise, there is possibly a race problem that drm scheduler will keep submitting commands to hardware in suspend, so amdgpu_fence_wait_empty has no chance to get empty. This is based on the discussion with Andrey before.
In Brown's case, without this patch, his test can run well by a 10-hour duration. However, with this patch applied, issue occurs in under an hour. I guess this patch exposes another underlying problem, as if it's totally faulty, the test with the patch applied will break in the first round suspend/resume test instead of failed after several rounds suspend/resume test. https://bugzilla.kernel.org/show_bug.cgi?id=215315
Anyway, we can revert it for now, and I will continue the investigation to the root cause.
Regards, Guchun
-----Original Message----- From: Deucher, Alexander Alexander.Deucher@amd.com Sent: Tuesday, January 11, 2022 12:26 AM To: Koenig, Christian Christian.Koenig@amd.com; Len Brown lenb@kernel.org; torvalds@linux-foundation.org; Chen, Guchun Guchun.Chen@amd.com; Grodzovsky, Andrey Andrey.Grodzovsky@amd.com Cc: linux-pm@vger.kernel.org; linux-kernel@vger.kernel.org; Len Brown len.brown@intel.com; stable@vger.kernel.org Subject: RE: [PATCH REGRESSION] Revert "drm/amdgpu: stop scheduler when calling hw_fini (v2)"
[Public]
-----Original Message----- From: Koenig, Christian Christian.Koenig@amd.com Sent: Monday, January 10, 2022 11:16 AM To: Deucher, Alexander Alexander.Deucher@amd.com; Len Brown lenb@kernel.org; torvalds@linux-foundation.org; Chen, Guchun Guchun.Chen@amd.com; Grodzovsky, Andrey Andrey.Grodzovsky@amd.com Cc: linux-pm@vger.kernel.org; linux-kernel@vger.kernel.org; Len Brown len.brown@intel.com; stable@vger.kernel.org Subject: Re: [PATCH REGRESSION] Revert "drm/amdgpu: stop scheduler when calling hw_fini (v2)"
Am 10.01.22 um 17:08 schrieb Deucher, Alexander:
[Public]
-----Original Message----- From: Len Brown lenb417@gmail.com On Behalf Of Len Brown Sent: Sunday, January 9, 2022 1:12 PM To: torvalds@linux-foundation.org Cc: linux-pm@vger.kernel.org; linux-kernel@vger.kernel.org; Len Brown len.brown@intel.com; Chen, Guchun Guchun.Chen@amd.com; Grodzovsky, Andrey Andrey.Grodzovsky@amd.com; Koenig, Christian Christian.Koenig@amd.com; Deucher, Alexander Alexander.Deucher@amd.com; stable@vger.kernel.org Subject: [PATCH REGRESSION] Revert "drm/amdgpu: stop scheduler
when
calling hw_fini (v2)"
From: Len Brown len.brown@intel.com
This reverts commit f7d6779df642720e22bffd449e683bb8690bd3bf.
This bisected regression has impacted suspend-resume stability since 5.15- rc1. It regressed -stable via 5.14.10.
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbug
z
illa.kernel.org%2Fshow_bug.cgi%3Fid%3D215315&data=04%7C01%7Cal
exander.deucher%40amd.com%7Ccf790be4827f4df9f2d808d9d39b81af%7C3
dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637773487569442716%7C
Unknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJB
TiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=AX0TXkyoMhy%2BZqE
VgRSWMkKd5nPa4WOv%2B1FZHLSErSw%3D&reserved=0
Fixes: f7d6779df64 ("drm/amdgpu: stop scheduler when calling hw_fini (v2)") Cc: Guchun Chen guchun.chen@amd.com Cc: Andrey Grodzovsky andrey.grodzovsky@amd.com Cc: Christian Koenig christian.koenig@amd.com Cc: Alex Deucher alexander.deucher@amd.com Cc: stable@vger.kernel.org # 5.14+ Signed-off-by: Len Brown len.brown@intel.com
@Chen, Guchun, @Grodzovsky, Andrey, @Koenig, Christian
Any ideas? What's the consequence of reverting this patch? Didn't this
patch fix another suspend/resume issue?
I think Guchun was just trying to adapt that we removed the scheduler stop from the fence driver hw fini path.
Not sure if that actually fixed something or was just a precaution.
Thanks. I'll wait for feedback from Guchun and Andrey and if they are ok with it, I'll apply the revert.
Alex
Regards, Christian.
Alex
drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 8 -------- 1 file changed, 8 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c index 9afd11ca2709..45977a72b5dd 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c @@ -547,9 +547,6 @@ void amdgpu_fence_driver_hw_fini(struct amdgpu_device *adev) if (!ring || !ring->fence_drv.initialized) continue;
if (!ring->no_scheduler)
drm_sched_stop(&ring->sched, NULL);
- /* You can't wait for HW to signal if it's gone */ if (!drm_dev_is_unplugged(adev_to_drm(adev))) r = amdgpu_fence_wait_empty(ring); @@ -609,11
+606,6 @@ void
amdgpu_fence_driver_hw_init(struct amdgpu_device *adev) if (!ring || !ring->fence_drv.initialized) continue;
if (!ring->no_scheduler) {
drm_sched_resubmit_jobs(&ring->sched);
drm_sched_start(&ring->sched, true);
}
- /* enable the interrupt */ if (ring->fence_drv.irq_src) amdgpu_irq_get(adev, ring->fence_drv.irq_src,
-- 2.25.1
linux-stable-mirror@lists.linaro.org