RE: [PATCH] drm/amdgpu/vkms: relax timer deactivation by hrtimer_try_to_cancel

7 Jul 2023

[Public]
...
-----Original Message-----
From: Koenig, Christian Christian.Koenig@amd.com
Sent: Thursday, July 6, 2023 7:25 PM
To: Chen, Guchun Guchun.Chen@amd.com; amd-
gfx@lists.freedesktop.org; Deucher, Alexander
Alexander.Deucher@amd.com; Zhang, Hawking
Hawking.Zhang@amd.com; Milinkovic, Dusica
Dusica.Milinkovic@amd.com; Prica, Nikola Nikola.Prica@amd.com; Cui,
Flora Flora.Cui@amd.com
Cc: stable@vger.kernel.org
Subject: Re: [PATCH] drm/amdgpu/vkms: relax timer deactivation by
hrtimer_try_to_cancel
Am 06.07.23 um 10:35 schrieb Guchun Chen:
...
In below thousands of screen rotation loop tests with virtual display
enabled, a CPU hard lockup issue may happen, leading system to
unresponsive and crash.
do {
    xrandr --output Virtual --rotate inverted
    xrandr --output Virtual --rotate right
    xrandr --output Virtual --rotate left
    xrandr --output Virtual --rotate normal } while (1);
NMI watchdog: Watchdog detected hard LOCKUP on cpu 4
? hrtimer_run_softirq+0x140/0x140
? store_vblank+0xe0/0xe0 [drm]
hrtimer_cancel+0x15/0x30
amdgpu_vkms_disable_vblank+0x15/0x30 [amdgpu]
drm_vblank_disable_and_save+0x185/0x1f0 [drm]
drm_crtc_vblank_off+0x159/0x4c0 [drm]
? record_print_text.cold+0x11/0x11
? wait_for_completion_timeout+0x232/0x280
? drm_crtc_wait_one_vblank+0x40/0x40 [drm] ?
bit_wait_io_timeout+0xe0/0xe0 ?
wait_for_completion_interruptible+0x1d7/0x320
? mutex_unlock+0x81/0xd0
amdgpu_vkms_crtc_atomic_disable
It's caused by a stuck in lock dependency in such scenario on
different CPUs.
CPU1                                             CPU2
drm_crtc_vblank_off                              hrtimer_interrupt
     grab event_lock (irq disabled)                   __hrtimer_run_queues
         grab vbl_lock/vblank_time_block
amdgpu_vkms_vblank_simulate
...
         amdgpu_vkms_disable_vblank                       drm_handle_vblank
             hrtimer_cancel                                   grab dev->event_lock


So CPU1 stucks in hrtimer_cancel as timer callback is running endless
on current clock base, as that timer queue on CPU2 has no chance to
finish it because of failing to hold the lock. So NMI watchdog will
throw the errors after its threshold, and all later CPUs are
impacted/blocked.
...
So use hrtimer_try_to_cancel to fix this, as disable_vblank callback
does not need to wait the handler to finish. And also it's not
necessary to check the return value of hrtimer_try_to_cancel, because
even if it's
-1 which means current timer callback is running, it will be
reprogrammed in hrtimer_start with calling enable_vblank to make it works.
Cc: stable@vger.kernel.org
Suggested-by: Christian König christian.koenig@amd.com
Signed-off-by: Guchun Chen guchun.chen@amd.com

drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c
index 53ff91fc6cf6..70fb0df039e3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c
@@ -81,7 +81,7 @@ static void amdgpu_vkms_disable_vblank(struct
drm_crtc *crtc)
...
{
    struct amdgpu_crtc *amdgpu_crtc = to_amdgpu_crtc(crtc);

hrtimer_cancel(&amdgpu_crtc->vblank_timer);


hrtimer_try_to_cancel(&amdgpu_crtc->vblank_timer);

That's a first step, but not sufficient.
You also need to change the "return HRTIMER_RESTART;" in
amdgpu_vkms_vblank_simulate() to only re-arm the interrupt when it is
enabled.
Finally I strongly suggest to implement a amdgpu_vkms_destroy() function to
make sure the HRTIMER is properly cleaned up.
Good suggestion, will fix it in V2.
Regards,
Guchun
...
Regards,
Christian.
...
}
static bool amdgpu_vkms_get_vblank_timestamp(struct drm_crtc *crtc,

    

2025

2024

2023

2022

2021

2020

2019

2018

2017

RE: [PATCH] drm/amdgpu/vkms: relax timer deactivation by hrtimer_try_to_cancel