[Linaro-mm-sig] Re: dma_buf_detach lockdep splat

27 Jun 2024


      On Thu, 2024-06-27 at 10:04 +0200, Daniel Vetter wrote:
...
On Wed, Jun 26, 2024 at 05:58:02PM +0200, Thomas Hellström wrote:
...
Hi!
I'm seeing the below lockdep splat 1) with the xe driver in an
imported
dma-buf object destruction path.
It's not because we hold the dma_resv lock at that point, but
rather
because we hold *another* dma_resv lock at that point, and the
dma_resv
detach happens when the object is idle, in this case it was idle at
the
final put(), and dma_buf_detach() is called in the putting process.
Holding another dma-buf lock might happen as part of
drm_exec_unlock_all, or simply if the wider vm dma_resv was held at
object put time, so it's not an uncommon pattern, even if the
drm_exec
instance can be fixed by putting all bos after unlocking them all.
Two solutions coming to mind here:

Provide a dma_buf_detach_locked()

This smells way too much like the endless headaches we had with
drm_gem_object_put_locked and friends against
drm_device.struct_mutex. Or
I'm not understanding what you're doing, because I'm pretty sure you
have
to take the dma_resv lock on final put() of imported objects. Because
that
final put() is of the import wrapper, the exporter (and other
importers)
can still get at that object and so dma_resv_lock is very much
needed.
Yeah, the TTM final put looks like
if (!dma_resv_trylock() || !idle)
    queue_work(final_distruction);
dma_resv_unlock();
dma_buf_detach(); <--- lockdep splat
Here's where a dma_buf_detach_locked() would've made sense before the
dma_resv_unlock().
But if you think this will cause grief, I'm completely fine with
fixing this in TTM by always taking the deferring path.
/Thomas
...
Or it's a completely different final put(), but I have no idea how
you get
that on an imported dma_buf.
...

Have TTM always take the delayed delete path for imported dma-

buf
objects.
I'd prefer 1) since I think the correct place to call this is in
the
TTM callback delete_mem_notify() where the bo is already locked,
and I
figure non-TTM gem backends may come to suffer from the same
problem.
Opinions, suggestions?
Imo 2) or trying to push the object puts outside of the
dma_resv_lock. The
latter is imo natural, since usually you grab references, then lock.
And
this even holds for at least the slow path of lru eviction, because
you
need to drop all locks and then do a ww_mutex_lock_slow, and that
requires
that you can hold references to unlocked objects.
But 2) alone is imo fine, dma_buf have become really big objects that
go
across drivers, extremely similar to struct file, and that is doing
the
delayed final put unconditionally since years too, using task_work.
It's
simply a solid design.
Cheers, Sima
...
[1]
[   99.136161] ============================================
[   99.136162] WARNING: possible recursive locking detected
[   99.136163] 6.10.0-rc2+ #6 Tainted: G     U            
[   99.136165] --------------------------------------------
[   99.136166] glxgears:sh0/4675 is trying to acquire lock:
[   99.136167] ffff9967dcdd91a8 (reservation_ww_class_mutex){+.+.}-
{3:3}, at: dma_buf_detach+0x3b/0xf0
[   99.136184] 
               but task is already holding lock:
[   99.136186] ffff9967d8c145a8 (reservation_ww_class_mutex){+.+.}-
{3:3}, at: drm_exec_lock_obj+0x49/0x2b0 [drm_exec]
[   99.136191] 
               other info that might help us debug this:
[   99.136192]  Possible unsafe locking scenario:
[   99.136194]        CPU0
[   99.136194]        ----
[   99.136195]   lock(reservation_ww_class_mutex);
[   99.136197]   lock(reservation_ww_class_mutex);
[   99.136199] 
                *** DEADLOCK ***
[   99.136199]  May be due to missing lock nesting notation
[   99.136200] 5 locks held by glxgears:sh0/4675:
[   99.136202]  #0: ffff9967d8c104c8 (&xef->vm.lock){+.+.}-{3:3},
at:
xe_file_close+0xde/0x1c0 [xe]
[   99.136272]  #1: ffff9967d5bb7480 (&vm->lock){++++}-{3:3}, at:
xe_vm_close_and_put+0x161/0x9b0 [xe]
[   99.136350]  #2: ffff9967ef88a970 (&val->lock){.+.+}-{3:3}, at:
xe_validation_ctx_init+0x6d/0x70 [xe]
[   99.136440]  #3: ffffbd6a085577b8
(reservation_ww_class_acquire){+.+.}-{0:0}, at:
xe_vma_destroy_unlocked+0x7f/0xe0 [xe]
[   99.136546]  #4: ffff9967d8c145a8
(reservation_ww_class_mutex){+.+.}-{3:3}, at:
drm_exec_lock_obj+0x49/0x2b0 [drm_exec]
[   99.136552] 
               stack backtrace:
[   99.136553] CPU: 10 PID: 4675 Comm: glxgears:sh0 Tainted: G    
U  
6.10.0-rc2+ #6
[   99.136555] Hardware name: ASUS System Product Name/PRIME B560M-
A
AC, BIOS 2001 02/01/2023
[   99.136557] Call Trace:
[   99.136558]  <TASK>
[   99.136560]  dump_stack_lvl+0x77/0xb0
[   99.136564]  __lock_acquire+0x1232/0x2160
[   99.136569]  lock_acquire+0xcb/0x2d0
[   99.136570]  ? dma_buf_detach+0x3b/0xf0
[   99.136574]  ? __lock_acquire+0x417/0x2160
[   99.136577]  __ww_mutex_lock.constprop.0+0xd0/0x13b0
[   99.136580]  ? dma_buf_detach+0x3b/0xf0
[   99.136584]  ? dma_buf_detach+0x3b/0xf0
[   99.136588]  ? ww_mutex_lock+0x2b/0x90
[   99.136590]  ww_mutex_lock+0x2b/0x90
[   99.136592]  dma_buf_detach+0x3b/0xf0
[   99.136595]  drm_prime_gem_destroy+0x2f/0x40 [drm]
[   99.136638]  xe_ttm_bo_destroy+0x32/0x220 [xe]
[   99.136734]  ? __mutex_unlock_slowpath+0x3a/0x290
[   99.136738]  drm_exec_unlock_all+0xa1/0xd0 [drm_exec]
[   99.136741]  drm_exec_fini+0x12/0xb0 [drm_exec]
[   99.136743]  xe_validation_ctx_fini+0x15/0x40 [xe]
[   99.136848]  xe_vma_destroy_unlocked+0xb1/0xe0 [xe]
[   99.136954]  xe_vm_close_and_put+0x41a/0x9b0 [xe]
[   99.137056]  ? xa_find+0xe3/0x1e0
[   99.137060]  xe_file_close+0x10a/0x1c0 [xe]
[   99.137157]  drm_file_free+0x22a/0x280 [drm]
[   99.137193]  drm_release_noglobal+0x22/0x70 [drm]
[   99.137227]  __fput+0xf1/0x2d0
[   99.137231]  task_work_run+0x59/0x90
[   99.137235]  do_exit+0x330/0xb40
[   99.137238]  do_group_exit+0x36/0xa0
[   99.137241]  get_signal+0xbd2/0xbe0
[   99.137245]  arch_do_signal_or_restart+0x3e/0x240
[   99.137249]  syscall_exit_to_user_mode+0x1e7/0x290
[   99.137252]  do_syscall_64+0xa1/0x180
[   99.137255]  ? _raw_spin_unlock+0x23/0x40
[   99.137257]  ? look_up_lock_class+0x6f/0x120
[   99.137261]  ? __lock_acquire+0x417/0x2160
[   99.137264]  ? lock_acquire+0xcb/0x2d0
[   99.137266]  ? __set_task_comm+0x28/0x1e0
[   99.137268]  ? find_held_lock+0x2b/0x80
[   99.137271]  ? __set_task_comm+0xe1/0x1e0
[   99.137273]  ? lock_release+0xca/0x290
[   99.137277]  ? __do_sys_prctl+0x245/0xab0
[   99.137279]  ? lockdep_hardirqs_on_prepare+0xde/0x190
[   99.137281]  ? syscall_exit_to_user_mode+0xb0/0x290
[   99.137284]  ? do_syscall_64+0xa1/0x180
[   99.137286]  ? cpuset_cpus_allowed+0x36/0x140
[   99.137289]  ? find_held_lock+0x2b/0x80
[   99.137291]  ? find_held_lock+0x2b/0x80
[   99.137294]  ? __sched_setaffinity+0x78/0x240
[   99.137297]  ? kfree+0xe2/0x310
[   99.137301]  ? kfree+0x202/0x310
[   99.137303]  ? __sched_setaffinity+0x78/0x240
[   99.137305]  ? __x64_sys_sched_setaffinity+0x69/0xb0
[   99.137307]  ? kfree+0xe2/0x310
[   99.137310]  ? lockdep_hardirqs_on_prepare+0xde/0x190
[   99.137312]  ? syscall_exit_to_user_mode+0xb0/0x290
[   99.137315]  ? do_syscall_64+0xa1/0x180
[   99.137317]  ? trace_hardirqs_off+0x4b/0xc0
[   99.137321]  ? clear_bhb_loop+0x45/0xa0
[   99.137325]  ? clear_bhb_loop+0x45/0xa0
[   99.137327]  ? clear_bhb_loop+0x45/0xa0
[   99.137330]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   99.137333] RIP: 0033:0x7fda70ee6169
[   99.137351] Code: Unable to access opcode bytes at
0x7fda70ee613f.
[   99.137352] RSP: 002b:00007fda5fdffc80 EFLAGS: 00000246
ORIG_RAX:
00000000000000ca
[   99.137354] RAX: fffffffffffffe00 RBX: 0000000000000000 RCX:
00007fda70ee6169
[   99.137356] RDX: 0000000000000000 RSI: 0000000000000189 RDI:
0000564a96f45b30
[   99.137358] RBP: 00007fda5fdffcb0 R08: 0000000000000000 R09:
00000000ffffffff
[   99.137359] R10: 0000000000000000 R11: 0000000000000246 R12:
0000000000000000
[   99.137360] R13: 0000000000000000 R14: 0000000000000000 R15:
0000564a96f45b30
[   99.137365]  </TASK>

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

[Linaro-mm-sig] Re: dma_buf_detach lockdep splat