In below thousands of screen rotation loop tests with virtual display enabled, a CPU hard lockup issue may happen, leading system to unresponsive and crash.
do { xrandr --output Virtual --rotate inverted xrandr --output Virtual --rotate right xrandr --output Virtual --rotate left xrandr --output Virtual --rotate normal } while (1);
NMI watchdog: Watchdog detected hard LOCKUP on cpu 4
? hrtimer_run_softirq+0x140/0x140 ? store_vblank+0xe0/0xe0 [drm] hrtimer_cancel+0x15/0x30 amdgpu_vkms_disable_vblank+0x15/0x30 [amdgpu] drm_vblank_disable_and_save+0x185/0x1f0 [drm] drm_crtc_vblank_off+0x159/0x4c0 [drm] ? record_print_text.cold+0x11/0x11 ? wait_for_completion_timeout+0x232/0x280 ? drm_crtc_wait_one_vblank+0x40/0x40 [drm] ? bit_wait_io_timeout+0xe0/0xe0 ? wait_for_completion_interruptible+0x1d7/0x320 ? mutex_unlock+0x81/0xd0 amdgpu_vkms_crtc_atomic_disable
It's caused by a stuck in lock dependency in such scenario on different CPUs.
CPU1 CPU2 drm_crtc_vblank_off hrtimer_interrupt grab event_lock (irq disabled) __hrtimer_run_queues grab vbl_lock/vblank_time_block amdgpu_vkms_vblank_simulate amdgpu_vkms_disable_vblank drm_handle_vblank hrtimer_cancel grab dev->event_lock
So CPU1 stucks in hrtimer_cancel as timer callback is running endless on current clock base, as that timer queue on CPU2 has no chance to finish it because of failing to hold the lock. So NMI watchdog will throw the errors after its threshold, and all later CPUs are impacted/blocked.
So use hrtimer_try_to_cancel to fix this, as disable_vblank callback does not need to wait the handler to finish. And also it's not necessary to check the return value of hrtimer_try_to_cancel, because even if it's -1 which means current timer callback is running, it will be reprogrammed in hrtimer_start with calling enable_vblank to make it works.
Cc: stable@vger.kernel.org Suggested-by: Christian König christian.koenig@amd.com Signed-off-by: Guchun Chen guchun.chen@amd.com --- drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c index 53ff91fc6cf6..70fb0df039e3 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c @@ -81,7 +81,7 @@ static void amdgpu_vkms_disable_vblank(struct drm_crtc *crtc) { struct amdgpu_crtc *amdgpu_crtc = to_amdgpu_crtc(crtc);
- hrtimer_cancel(&amdgpu_crtc->vblank_timer); + hrtimer_try_to_cancel(&amdgpu_crtc->vblank_timer); }
static bool amdgpu_vkms_get_vblank_timestamp(struct drm_crtc *crtc,
Am 06.07.23 um 10:35 schrieb Guchun Chen:
In below thousands of screen rotation loop tests with virtual display enabled, a CPU hard lockup issue may happen, leading system to unresponsive and crash.
do { xrandr --output Virtual --rotate inverted xrandr --output Virtual --rotate right xrandr --output Virtual --rotate left xrandr --output Virtual --rotate normal } while (1);
NMI watchdog: Watchdog detected hard LOCKUP on cpu 4
? hrtimer_run_softirq+0x140/0x140 ? store_vblank+0xe0/0xe0 [drm] hrtimer_cancel+0x15/0x30 amdgpu_vkms_disable_vblank+0x15/0x30 [amdgpu] drm_vblank_disable_and_save+0x185/0x1f0 [drm] drm_crtc_vblank_off+0x159/0x4c0 [drm] ? record_print_text.cold+0x11/0x11 ? wait_for_completion_timeout+0x232/0x280 ? drm_crtc_wait_one_vblank+0x40/0x40 [drm] ? bit_wait_io_timeout+0xe0/0xe0 ? wait_for_completion_interruptible+0x1d7/0x320 ? mutex_unlock+0x81/0xd0 amdgpu_vkms_crtc_atomic_disable
It's caused by a stuck in lock dependency in such scenario on different CPUs.
CPU1 CPU2 drm_crtc_vblank_off hrtimer_interrupt grab event_lock (irq disabled) __hrtimer_run_queues grab vbl_lock/vblank_time_block amdgpu_vkms_vblank_simulate amdgpu_vkms_disable_vblank drm_handle_vblank hrtimer_cancel grab dev->event_lock
So CPU1 stucks in hrtimer_cancel as timer callback is running endless on current clock base, as that timer queue on CPU2 has no chance to finish it because of failing to hold the lock. So NMI watchdog will throw the errors after its threshold, and all later CPUs are impacted/blocked.
So use hrtimer_try_to_cancel to fix this, as disable_vblank callback does not need to wait the handler to finish. And also it's not necessary to check the return value of hrtimer_try_to_cancel, because even if it's -1 which means current timer callback is running, it will be reprogrammed in hrtimer_start with calling enable_vblank to make it works.
Cc: stable@vger.kernel.org Suggested-by: Christian König christian.koenig@amd.com Signed-off-by: Guchun Chen guchun.chen@amd.com
drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c index 53ff91fc6cf6..70fb0df039e3 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c @@ -81,7 +81,7 @@ static void amdgpu_vkms_disable_vblank(struct drm_crtc *crtc) { struct amdgpu_crtc *amdgpu_crtc = to_amdgpu_crtc(crtc);
- hrtimer_cancel(&amdgpu_crtc->vblank_timer);
- hrtimer_try_to_cancel(&amdgpu_crtc->vblank_timer);
That's a first step, but not sufficient.
You also need to change the "return HRTIMER_RESTART;" in amdgpu_vkms_vblank_simulate() to only re-arm the interrupt when it is enabled.
Finally I strongly suggest to implement a amdgpu_vkms_destroy() function to make sure the HRTIMER is properly cleaned up.
Regards, Christian.
} static bool amdgpu_vkms_get_vblank_timestamp(struct drm_crtc *crtc,
[Public]
-----Original Message----- From: Koenig, Christian Christian.Koenig@amd.com Sent: Thursday, July 6, 2023 7:25 PM To: Chen, Guchun Guchun.Chen@amd.com; amd- gfx@lists.freedesktop.org; Deucher, Alexander Alexander.Deucher@amd.com; Zhang, Hawking Hawking.Zhang@amd.com; Milinkovic, Dusica Dusica.Milinkovic@amd.com; Prica, Nikola Nikola.Prica@amd.com; Cui, Flora Flora.Cui@amd.com Cc: stable@vger.kernel.org Subject: Re: [PATCH] drm/amdgpu/vkms: relax timer deactivation by hrtimer_try_to_cancel
Am 06.07.23 um 10:35 schrieb Guchun Chen:
In below thousands of screen rotation loop tests with virtual display enabled, a CPU hard lockup issue may happen, leading system to unresponsive and crash.
do { xrandr --output Virtual --rotate inverted xrandr --output Virtual --rotate right xrandr --output Virtual --rotate left xrandr --output Virtual --rotate normal } while (1);
NMI watchdog: Watchdog detected hard LOCKUP on cpu 4
? hrtimer_run_softirq+0x140/0x140 ? store_vblank+0xe0/0xe0 [drm] hrtimer_cancel+0x15/0x30 amdgpu_vkms_disable_vblank+0x15/0x30 [amdgpu] drm_vblank_disable_and_save+0x185/0x1f0 [drm] drm_crtc_vblank_off+0x159/0x4c0 [drm] ? record_print_text.cold+0x11/0x11 ? wait_for_completion_timeout+0x232/0x280 ? drm_crtc_wait_one_vblank+0x40/0x40 [drm] ? bit_wait_io_timeout+0xe0/0xe0 ? wait_for_completion_interruptible+0x1d7/0x320 ? mutex_unlock+0x81/0xd0 amdgpu_vkms_crtc_atomic_disable
It's caused by a stuck in lock dependency in such scenario on different CPUs.
CPU1 CPU2 drm_crtc_vblank_off hrtimer_interrupt grab event_lock (irq disabled) __hrtimer_run_queues grab vbl_lock/vblank_time_block
amdgpu_vkms_vblank_simulate
amdgpu_vkms_disable_vblank drm_handle_vblank hrtimer_cancel grab dev->event_lock
So CPU1 stucks in hrtimer_cancel as timer callback is running endless on current clock base, as that timer queue on CPU2 has no chance to finish it because of failing to hold the lock. So NMI watchdog will throw the errors after its threshold, and all later CPUs are
impacted/blocked.
So use hrtimer_try_to_cancel to fix this, as disable_vblank callback does not need to wait the handler to finish. And also it's not necessary to check the return value of hrtimer_try_to_cancel, because even if it's -1 which means current timer callback is running, it will be reprogrammed in hrtimer_start with calling enable_vblank to make it works.
Cc: stable@vger.kernel.org Suggested-by: Christian König christian.koenig@amd.com Signed-off-by: Guchun Chen guchun.chen@amd.com
drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c index 53ff91fc6cf6..70fb0df039e3 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c @@ -81,7 +81,7 @@ static void amdgpu_vkms_disable_vblank(struct
drm_crtc *crtc)
{ struct amdgpu_crtc *amdgpu_crtc = to_amdgpu_crtc(crtc);
- hrtimer_cancel(&amdgpu_crtc->vblank_timer);
- hrtimer_try_to_cancel(&amdgpu_crtc->vblank_timer);
That's a first step, but not sufficient.
You also need to change the "return HRTIMER_RESTART;" in amdgpu_vkms_vblank_simulate() to only re-arm the interrupt when it is enabled.
Finally I strongly suggest to implement a amdgpu_vkms_destroy() function to make sure the HRTIMER is properly cleaned up.
Good suggestion, will fix it in V2.
Regards, Guchun
Regards, Christian.
}
static bool amdgpu_vkms_get_vblank_timestamp(struct drm_crtc *crtc,
[Public]
-----Original Message----- From: Chen, Guchun Sent: Friday, July 7, 2023 9:06 AM To: Koenig, Christian Christian.Koenig@amd.com; amd- gfx@lists.freedesktop.org; Deucher, Alexander Alexander.Deucher@amd.com; Zhang, Hawking Hawking.Zhang@amd.com; Milinkovic, Dusica Dusica.Milinkovic@amd.com; Prica, Nikola Nikola.Prica@amd.com; Cui, Flora Flora.Cui@amd.com Cc: stable@vger.kernel.org Subject: RE: [PATCH] drm/amdgpu/vkms: relax timer deactivation by hrtimer_try_to_cancel
-----Original Message----- From: Koenig, Christian Christian.Koenig@amd.com Sent: Thursday, July 6, 2023 7:25 PM To: Chen, Guchun Guchun.Chen@amd.com; amd- gfx@lists.freedesktop.org; Deucher, Alexander Alexander.Deucher@amd.com; Zhang, Hawking
Milinkovic, Dusica Dusica.Milinkovic@amd.com; Prica, Nikola Nikola.Prica@amd.com; Cui, Flora Flora.Cui@amd.com Cc: stable@vger.kernel.org Subject: Re: [PATCH] drm/amdgpu/vkms: relax timer deactivation by hrtimer_try_to_cancel
Am 06.07.23 um 10:35 schrieb Guchun Chen:
In below thousands of screen rotation loop tests with virtual display enabled, a CPU hard lockup issue may happen, leading system to unresponsive and crash.
do { xrandr --output Virtual --rotate inverted xrandr --output Virtual --rotate right xrandr --output Virtual --rotate left xrandr --output Virtual --rotate normal } while (1);
NMI watchdog: Watchdog detected hard LOCKUP on cpu 4
? hrtimer_run_softirq+0x140/0x140 ? store_vblank+0xe0/0xe0 [drm] hrtimer_cancel+0x15/0x30 amdgpu_vkms_disable_vblank+0x15/0x30 [amdgpu] drm_vblank_disable_and_save+0x185/0x1f0 [drm] drm_crtc_vblank_off+0x159/0x4c0 [drm] ? record_print_text.cold+0x11/0x11 ? wait_for_completion_timeout+0x232/0x280 ? drm_crtc_wait_one_vblank+0x40/0x40 [drm] ? bit_wait_io_timeout+0xe0/0xe0 ? wait_for_completion_interruptible+0x1d7/0x320 ? mutex_unlock+0x81/0xd0 amdgpu_vkms_crtc_atomic_disable
It's caused by a stuck in lock dependency in such scenario on different CPUs.
CPU1 CPU2 drm_crtc_vblank_off hrtimer_interrupt grab event_lock (irq disabled) __hrtimer_run_queues grab vbl_lock/vblank_time_block
amdgpu_vkms_vblank_simulate
amdgpu_vkms_disable_vblank drm_handle_vblank hrtimer_cancel grab dev->event_lock
So CPU1 stucks in hrtimer_cancel as timer callback is running endless on current clock base, as that timer queue on CPU2 has no chance to finish it because of failing to hold the lock. So NMI watchdog will throw the errors after its threshold, and all later CPUs are
impacted/blocked.
So use hrtimer_try_to_cancel to fix this, as disable_vblank callback does not need to wait the handler to finish. And also it's not necessary to check the return value of hrtimer_try_to_cancel, because even if it's -1 which means current timer callback is running, it will be reprogrammed in hrtimer_start with calling enable_vblank to make it
works.
Cc: stable@vger.kernel.org Suggested-by: Christian König christian.koenig@amd.com Signed-off-by: Guchun Chen guchun.chen@amd.com
drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c index 53ff91fc6cf6..70fb0df039e3 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c @@ -81,7 +81,7 @@ static void amdgpu_vkms_disable_vblank(struct
drm_crtc *crtc)
{ struct amdgpu_crtc *amdgpu_crtc = to_amdgpu_crtc(crtc);
- hrtimer_cancel(&amdgpu_crtc->vblank_timer);
- hrtimer_try_to_cancel(&amdgpu_crtc->vblank_timer);
That's a first step, but not sufficient.
You also need to change the "return HRTIMER_RESTART;" in amdgpu_vkms_vblank_simulate() to only re-arm the interrupt when it is enabled.
Finally I strongly suggest to implement a amdgpu_vkms_destroy() function to make sure the HRTIMER is properly cleaned up.
Good suggestion, will fix it in V2.
Hi Christian,
I just sent out patch v2 to address the return value problem in amdgpu_vkms_vblank_simulate.
Regarding HRTIMER cleanup, it's handled in sw_fini in amdgpu_vkms code, I think so far it's good. Anyway, we can continue the discussion in the new patch set thread.
Regards, Guchun
Regards, Guchun
Regards, Christian.
}
static bool amdgpu_vkms_get_vblank_timestamp(struct drm_crtc *crtc,
linux-stable-mirror@lists.linaro.org