Re: [Intel-gfx] [PATCH v3 0/2] Fix error propagation amongst request

7 Mar 2023

      Hi GG,
On Tue, Mar 07, 2023 at 09:33:12AM +0200, Gwan-gyeong Mun wrote:
...
Hi Andi,
After applying these two patches, deadlock is being detected in the call
stack below. Please review whether the patch to update the
intel_context_migrate_copy() part affected the deadlock.
https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_114451v1/bat-dg2-8/igt@i9...
Thanks for looking into this. Yes, there is a basic locking issue
here coming from migrate. migrate() takes the timeline lock and
then calls the request_create() which tries to lock again. We
inevitably fall into deadlock.
The locking of the timeline is quite exotic, it's started in
request_create() and released in request_add().
It's still in trybot, but this is supposed to be the next
version:
https://patchwork.freedesktop.org/series/114645/
This creates new version of request_create_locked() and
request_add_locked() where there the timeline is not locked in
the process.
There are still some selftests that need to be fixed, though.
Andi
...
<4> [33.070967] ============================================
<4> [33.070968] WARNING: possible recursive locking detected
<4> [33.070969] 6.2.0-Patchwork_114451v1-g8589fd9227ca+ #1 Not tainted
<4> [33.070970] --------------------------------------------
<4> [33.070971] i915_module_loa/948 is trying to acquire lock:
<4> [33.070972] ffff8881127f0478 (migrate){+.+.}-{3:3}, at:
i915_request_create+0x1c6/0x230 [i915]
<4> [33.071215]
but task is already holding lock:
<4> [33.071235] ffff8881127f0478 (migrate){+.+.}-{3:3}, at:
intel_context_migrate_copy+0x1b3/0xa80 [i915]
<4> [33.071484]
other info that might help us debug this:
<4> [33.071504]  Possible unsafe locking scenario:
<4> [33.071522]        CPU0
<4> [33.071532]        ----
<4> [33.071541]   lock(migrate);
<4> [33.071554]   lock(migrate);
<4> [33.071567]
 *** DEADLOCK ***
<4> [33.071585]  May be due to missing lock nesting notation
<4> [33.071606] 3 locks held by i915_module_loa/948:
<4> [33.071622]  #0: ffffc90001eb7b70
(reservation_ww_class_acquire){+.+.}-{0:0}, at:
i915_gem_do_execbuffer+0xae2/0x21c0 [i915]
<4> [33.071893]  #1: ffff8881127b9c28
(reservation_ww_class_mutex){+.+.}-{3:3}, at:
__intel_context_do_pin_ww+0x7a/0xa30 [i915]
<4> [33.072133]  #2: ffff8881127f0478 (migrate){+.+.}-{3:3}, at:
intel_context_migrate_copy+0x1b3/0xa80 [i915]
<4> [33.072384]
stack backtrace:
<4> [33.072399] CPU: 7 PID: 948 Comm: i915_module_loa Not tainted
6.2.0-Patchwork_114451v1-g8589fd9227ca+ #1
<4> [33.072428] Hardware name: Intel Corporation CoffeeLake Client
Platform/CoffeeLake S UDIMM RVP, BIOS CNLSFWR1.R00.X220.B00.2103302221
03/30/2021
<4> [33.072465] Call Trace:
<4> [33.072475]  <TASK>
<4> [33.072486]  dump_stack_lvl+0x5b/0x85
<4> [33.072503]  __lock_acquire.cold+0x158/0x33b
<4> [33.072524]  lock_acquire+0xd6/0x310
<4> [33.072541]  ? i915_request_create+0x1c6/0x230 [i915]
<4> [33.072812]  __mutex_lock+0x95/0xf40
<4> [33.072829]  ? i915_request_create+0x1c6/0x230 [i915]
<4> [33.073093]  ? rcu_read_lock_sched_held+0x55/0x80
<4> [33.073112]  ? __mutex_lock+0x133/0xf40
<4> [33.073128]  ? i915_request_create+0x1c6/0x230 [i915]
<4> [33.073388]  ? intel_context_migrate_copy+0x1b3/0xa80 [i915]
<4> [33.073619]  ? i915_request_create+0x1c6/0x230 [i915]
<4> [33.073876]  i915_request_create+0x1c6/0x230 [i915]
<4> [33.074135]  intel_context_migrate_copy+0x1d0/0xa80 [i915]
<4> [33.074360]  __i915_ttm_move+0x7a8/0x940 [i915]
<4> [33.074538]  ? _raw_spin_unlock_irqrestore+0x41/0x70
<4> [33.074552]  ? dma_resv_iter_next+0x91/0xb0
<4> [33.074564]  ? dma_resv_iter_first+0x42/0xb0
<4> [33.074576]  ? i915_deps_add_resv+0x4c/0xc0 [i915]
<4> [33.074744]  i915_ttm_move+0x2ac/0x430 [i915]
<4> [33.074910]  ttm_bo_handle_move_mem+0xb5/0x140 [ttm]
<4> [33.074930]  ttm_bo_validate+0xe9/0x1a0 [ttm]
<4> [33.074947]  __i915_ttm_get_pages+0x4e/0x190 [i915]
<4> [33.075112]  i915_ttm_get_pages+0xf3/0x160 [i915]
<4> [33.075280]  ____i915_gem_object_get_pages+0x36/0xb0 [i915]
<4> [33.075446]  __i915_gem_object_get_pages+0x95/0xa0 [i915]
<4> [33.075608]  i915_vma_get_pages+0xfa/0x160 [i915]
<4> [33.075779]  i915_vma_pin_ww+0xdc/0xb50 [i915]
<4> [33.075953]  eb_validate_vmas+0x1c6/0xac0 [i915]
<4> [33.076114]  i915_gem_do_execbuffer+0xb2a/0x21c0 [i915]
<4> [33.076276]  ? __stack_depot_save+0x3f/0x4e0
<4> [33.076292]  ? 0xffffffff81000000
<4> [33.076301]  ? _raw_spin_unlock_irq+0x41/0x50
<4> [33.076312]  ? lockdep_hardirqs_on+0xc3/0x140
<4> [33.076325]  ? set_track_update+0x25/0x50
<4> [33.076338]  ? __lock_acquire+0x5f2/0x2130
<4> [33.076356]  i915_gem_execbuffer2_ioctl+0x123/0x2e0 [i915]
<4> [33.076519]  ? __pfx_i915_gem_execbuffer2_ioctl+0x10/0x10 [i915]
<4> [33.076679]  drm_ioctl_kernel+0xb4/0x150
<4> [33.076692]  drm_ioctl+0x21d/0x420
<4> [33.076703]  ? __pfx_i915_gem_execbuffer2_ioctl+0x10/0x10 [i915]
<4> [33.076864]  ? __vm_munmap+0xd3/0x170
<4> [33.076877]  __x64_sys_ioctl+0x76/0xb0
<4> [33.076889]  do_syscall_64+0x3c/0x90
<4> [33.076900]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
<4> [33.076913] RIP: 0033:0x7f304aa903ab
<4> [33.076923] Code: 0f 1e fa 48 8b 05 e5 7a 0d 00 64 c7 00 26 00 00 00 48
c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48>
3d 01 f0 ff ff 73 01 c3 48 8b 0d b5 7a 0d 00 f7 d8 64 89 01 48
<4> [33.076957] RSP: 002b:00007fffb1424cf8 EFLAGS: 00000246 ORIG_RAX:
0000000000000010
<4> [33.076975] RAX: ffffffffffffffda RBX: 00007fffb1424da0 RCX:
00007f304aa903ab
<4> [33.076990] RDX: 00007fffb1424da0 RSI: 0000000040406469 RDI:
0000000000000005
<4> [33.077004] RBP: 0000000040406469 R08: 0000000000000005 R09:
0000000100003000
<4> [33.077019] R10: 0000000000000001 R11: 0000000000000246 R12:
0000000000010000
<4> [33.077034] R13: 0000000000000005 R14: 00000000ffffffff R15:
00000000000056a0
<4> [33.077052]  </TASK>
Br,
G.G.
On 2/28/23 4:11 AM, Andi Shyti wrote:
...
Hi,
This series of two patches fixes the issue introduced in
cf586021642d80 ("drm/i915/gt: Pipelined page migration") where,
as reported by Matt, in a chain of requests an error is reported
only if happens in the last request.
However Chris noticed that without ensuring exclusivity in the
locking we might end up in some deadlock. That's why patch 1
throttles for the ringspace in order to make sure that no one is
holding it.
Version 1 of this patch has been reviewed by matt and this
version is adding Chris exclusive locking.
Thanks Chris for this work.
Andi
Changelog
v1 -> v2

Add patch 1 for ensuring exclusive locking of the timeline
Reword git commit of patch 2.

Andi Shyti (1):
   drm/i915/gt: Make sure that errors are propagated through request
     chains
Chris Wilson (1):
   drm/i915: Throttle for ringspace prior to taking the timeline mutex
drivers/gpu/drm/i915/gt/intel_context.c | 41 +++++++++++++++++++++++++
  drivers/gpu/drm/i915/gt/intel_context.h |  2 ++
  drivers/gpu/drm/i915/gt/intel_migrate.c | 39 +++++++++++++++++------
  drivers/gpu/drm/i915/i915_request.c     |  3 ++
  4 files changed, 75 insertions(+), 10 deletions(-)

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [Intel-gfx] [PATCH v3 0/2] Fix error propagation amongst request

Changelog