I haven't gone to see where it started, but as of late a good number of pretty nasty deadlock issues have appeared with the kernel. Easy reproduction recipe on a laptop with i915/amdgpu prime with lockdep enabled:
DRI_PRIME=1 glxinfo
Additionally, some more race conditions exist that I've managed to trigger with piglit and lockdep enabled after applying these patches:
============================= WARNING: suspicious RCU usage 4.14.3Lyude-Test+ #2 Not tainted ----------------------------- ./include/linux/reservation.h:216 suspicious rcu_dereference_protected() usage!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 1 1 lock held by ext_image_dma_b/27451: #0: (reservation_ww_class_mutex){+.+.}, at: [<ffffffffa034f2ff>] ttm_bo_unref+0x9f/0x3c0 [ttm]
stack backtrace: CPU: 0 PID: 27451 Comm: ext_image_dma_b Not tainted 4.14.3Lyude-Test+ #2 Hardware name: HP HP ZBook 15 G4/8275, BIOS P70 Ver. 01.02 06/09/2017 Call Trace: dump_stack+0x8e/0xce lockdep_rcu_suspicious+0xc5/0x100 reservation_object_copy_fences+0x292/0x2b0 ? ttm_bo_unref+0x9f/0x3c0 [ttm] ttm_bo_unref+0xbd/0x3c0 [ttm] amdgpu_bo_unref+0x2a/0x50 [amdgpu] amdgpu_gem_object_free+0x4b/0x50 [amdgpu] drm_gem_object_free+0x1f/0x40 [drm] drm_gem_object_put_unlocked+0x40/0xb0 [drm] drm_gem_object_handle_put_unlocked+0x6c/0xb0 [drm] drm_gem_object_release_handle+0x51/0x90 [drm] drm_gem_handle_delete+0x5e/0x90 [drm] ? drm_gem_handle_create+0x40/0x40 [drm] drm_gem_close_ioctl+0x20/0x30 [drm] drm_ioctl_kernel+0x5d/0xb0 [drm] drm_ioctl+0x2f7/0x3b0 [drm] ? drm_gem_handle_create+0x40/0x40 [drm] ? trace_hardirqs_on_caller+0xf4/0x190 ? trace_hardirqs_on+0xd/0x10 amdgpu_drm_ioctl+0x4f/0x90 [amdgpu] do_vfs_ioctl+0x93/0x670 ? __fget+0x108/0x1f0 SyS_ioctl+0x79/0x90 entry_SYSCALL_64_fastpath+0x23/0xc2
I've also added the relevant fixes for the issue mentioned above.
Christian König (3): drm/ttm: fix ttm_bo_cleanup_refs_or_queue once more dma-buf: make reservation_object_copy_fences rcu save drm/amdgpu: reserve root PD while releasing it
Michel Dänzer (1): drm/ttm: Always and only destroy bo->ttm_resv in ttm_bo_release_list
drivers/dma-buf/reservation.c | 56 +++++++++++++++++++++++++--------- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 13 ++++++-- drivers/gpu/drm/ttm/ttm_bo.c | 43 +++++++++++++------------- 3 files changed, 74 insertions(+), 38 deletions(-)
-- 2.14.3
From: Christian König christian.koenig@amd.com
With shared reservation objects __ttm_bo_reserve() can easily fail even on destroyed BOs. This prevents correct handling when we need to individualize the reservation object.
Fix this by individualizing the object before even trying to reserve it.
commit 378e2d5b504fe0231c557751e58b80fcf717cc20 upstream
Signed-off-by: Christian König christian.koenig@amd.com Acked-by: Chunming Zhou david1.zhou@amd.com Signed-off-by: Alex Deucher alexander.deucher@amd.com Signed-off-by: Lyude Paul lyude@redhat.com --- drivers/gpu/drm/ttm/ttm_bo.c | 32 +++++++++++++++++--------------- 1 file changed, 17 insertions(+), 15 deletions(-)
diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c index 180ce6296416..bee77d31895b 100644 --- a/drivers/gpu/drm/ttm/ttm_bo.c +++ b/drivers/gpu/drm/ttm/ttm_bo.c @@ -440,28 +440,29 @@ static void ttm_bo_cleanup_refs_or_queue(struct ttm_buffer_object *bo) struct ttm_bo_global *glob = bo->glob; int ret;
+ ret = ttm_bo_individualize_resv(bo); + if (ret) { + /* Last resort, if we fail to allocate memory for the + * fences block for the BO to become idle + */ + reservation_object_wait_timeout_rcu(bo->resv, true, false, + 30 * HZ); + spin_lock(&glob->lru_lock); + goto error; + } + spin_lock(&glob->lru_lock); ret = __ttm_bo_reserve(bo, false, true, NULL); - if (!ret) { - if (!ttm_bo_wait(bo, false, true)) { + if (reservation_object_test_signaled_rcu(&bo->ttm_resv, true)) { ttm_bo_del_from_lru(bo); spin_unlock(&glob->lru_lock); + if (bo->resv != &bo->ttm_resv) + reservation_object_unlock(&bo->ttm_resv); ttm_bo_cleanup_memtype_use(bo); - return; }
- ret = ttm_bo_individualize_resv(bo); - if (ret) { - /* Last resort, if we fail to allocate memory for the - * fences block for the BO to become idle and free it. - */ - spin_unlock(&glob->lru_lock); - ttm_bo_wait(bo, true, true); - ttm_bo_cleanup_memtype_use(bo); - return; - } ttm_bo_flush_all_fences(bo);
/* @@ -474,11 +475,12 @@ static void ttm_bo_cleanup_refs_or_queue(struct ttm_buffer_object *bo) ttm_bo_add_to_lru(bo); }
- if (bo->resv != &bo->ttm_resv) - reservation_object_unlock(&bo->ttm_resv); __ttm_bo_unreserve(bo); } + if (bo->resv != &bo->ttm_resv) + reservation_object_unlock(&bo->ttm_resv);
+error: kref_get(&bo->list_kref); list_add_tail(&bo->ddestroy, &bdev->ddestroy); spin_unlock(&glob->lru_lock);
From: Christian König christian.koenig@amd.com
Stop requiring that the src reservation object is locked for this operation.
commit 39e16ba16c147e662bf9fbcee9a99d70d420382f upstream
Acked-by: Chunming Zhou david1.zhou@amd.com Signed-off-by: Christian König christian.koenig@amd.com Signed-off-by: Alex Deucher alexander.deucher@amd.com Link: https://patchwork.freedesktop.org/patch/msgid/1504551766-5093-1-git-send-ema... Signed-off-by: Lyude Paul lyude@redhat.com --- drivers/dma-buf/reservation.c | 56 ++++++++++++++++++++++++++++++++----------- 1 file changed, 42 insertions(+), 14 deletions(-)
diff --git a/drivers/dma-buf/reservation.c b/drivers/dma-buf/reservation.c index dec3a815455d..b44d9d7db347 100644 --- a/drivers/dma-buf/reservation.c +++ b/drivers/dma-buf/reservation.c @@ -266,8 +266,7 @@ EXPORT_SYMBOL(reservation_object_add_excl_fence); * @dst: the destination reservation object * @src: the source reservation object * -* Copy all fences from src to dst. Both src->lock as well as dst-lock must be -* held. +* Copy all fences from src to dst. dst-lock must be held. */ int reservation_object_copy_fences(struct reservation_object *dst, struct reservation_object *src) @@ -277,33 +276,62 @@ int reservation_object_copy_fences(struct reservation_object *dst, size_t size; unsigned i;
- src_list = reservation_object_get_list(src); + rcu_read_lock(); + src_list = rcu_dereference(src->fence);
+retry: if (src_list) { - size = offsetof(typeof(*src_list), - shared[src_list->shared_count]); + unsigned shared_count = src_list->shared_count; + + size = offsetof(typeof(*src_list), shared[shared_count]); + rcu_read_unlock(); + dst_list = kmalloc(size, GFP_KERNEL); if (!dst_list) return -ENOMEM;
- dst_list->shared_count = src_list->shared_count; - dst_list->shared_max = src_list->shared_count; - for (i = 0; i < src_list->shared_count; ++i) - dst_list->shared[i] = - dma_fence_get(src_list->shared[i]); + rcu_read_lock(); + src_list = rcu_dereference(src->fence); + if (!src_list || src_list->shared_count > shared_count) { + kfree(dst_list); + goto retry; + } + + dst_list->shared_count = 0; + dst_list->shared_max = shared_count; + for (i = 0; i < src_list->shared_count; ++i) { + struct dma_fence *fence; + + fence = rcu_dereference(src_list->shared[i]); + if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, + &fence->flags)) + continue; + + if (!dma_fence_get_rcu(fence)) { + kfree(dst_list); + src_list = rcu_dereference(src->fence); + goto retry; + } + + if (dma_fence_is_signaled(fence)) { + dma_fence_put(fence); + continue; + } + + dst_list->shared[dst_list->shared_count++] = fence; + } } else { dst_list = NULL; }
+ new = dma_fence_get_rcu_safe(&src->fence_excl); + rcu_read_unlock(); + kfree(dst->staged); dst->staged = NULL;
src_list = reservation_object_get_list(dst); - old = reservation_object_get_excl(dst); - new = reservation_object_get_excl(src); - - dma_fence_get(new);
preempt_disable(); write_seqcount_begin(&dst->seq);
From: Christian König christian.koenig@amd.com
Otherwise somebody could try to evict it at the same time and try to use half torn down structures.
commit 2642cf110d08a403f585a051e4cbf45a90b3adea upstream
Signed-off-by: Christian König christian.koenig@amd.com Reviewed-and-Tested-by: Michel Dänzer michel.daenzer@amd.com Reviewed-by: Alex Deucher alexander.deucher@amd.com Signed-off-by: Alex Deucher alexander.deucher@amd.com Signed-off-by: Lyude Paul lyude@redhat.com --- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c index bd20ff018512..591c6dc2c1f7 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c @@ -2586,7 +2586,8 @@ void amdgpu_vm_fini(struct amdgpu_device *adev, struct amdgpu_vm *vm) { struct amdgpu_bo_va_mapping *mapping, *tmp; bool prt_fini_needed = !!adev->gart.gart_funcs->set_prt; - int i; + struct amdgpu_bo *root; + int i, r;
amd_sched_entity_fini(vm->entity.sched, &vm->entity);
@@ -2609,7 +2610,15 @@ void amdgpu_vm_fini(struct amdgpu_device *adev, struct amdgpu_vm *vm) amdgpu_vm_free_mapping(adev, vm, mapping, NULL); }
- amdgpu_vm_free_levels(&vm->root); + root = amdgpu_bo_ref(vm->root.bo); + r = amdgpu_bo_reserve(root, true); + if (r) { + dev_err(adev->dev, "Leaking page tables because BO reservation failed\n"); + } else { + amdgpu_vm_free_levels(&vm->root); + amdgpu_bo_unreserve(root); + } + amdgpu_bo_unref(&root); dma_fence_put(vm->last_dir_update); for (i = 0; i < AMDGPU_MAX_VMHUBS; i++) amdgpu_vm_free_reserved_vmid(adev, vm, i);
From: Michel Dänzer michel.daenzer@amd.com
Fixes a use-after-free due to a race condition in ttm_bo_cleanup_refs_and_unlock, which allows one task to reserve a BO and destroy its ttm_resv while another task is waiting for it to signal in reservation_object_wait_timeout_rcu.
commit e1fc12c5d9ad06a2a74e97a91f1b0c5f4c723b50 upstream
v2: * Always initialize bo->ttm_resv in ttm_bo_init_reserved (Christian König)
Fixes: 0d2bd2ae045d "drm/ttm: fix memory leak while individualizing BOs" Reviewed-by: Chunming Zhou david1.zhou@amd.com # v1 Reviewed-by: Christian König christian.koenig@amd.com Signed-off-by: Michel Dänzer michel.daenzer@amd.com Signed-off-by: Alex Deucher alexander.deucher@amd.com Signed-off-by: Lyude Paul lyude@redhat.com --- drivers/gpu/drm/ttm/ttm_bo.c | 11 ++++------- 1 file changed, 4 insertions(+), 7 deletions(-)
diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c index bee77d31895b..c088703777e2 100644 --- a/drivers/gpu/drm/ttm/ttm_bo.c +++ b/drivers/gpu/drm/ttm/ttm_bo.c @@ -150,8 +150,7 @@ static void ttm_bo_release_list(struct kref *list_kref) ttm_tt_destroy(bo->ttm); atomic_dec(&bo->glob->bo_count); dma_fence_put(bo->moving); - if (bo->resv == &bo->ttm_resv) - reservation_object_fini(&bo->ttm_resv); + reservation_object_fini(&bo->ttm_resv); mutex_destroy(&bo->wu_mutex); if (bo->destroy) bo->destroy(bo); @@ -402,14 +401,11 @@ static int ttm_bo_individualize_resv(struct ttm_buffer_object *bo) if (bo->resv == &bo->ttm_resv) return 0;
- reservation_object_init(&bo->ttm_resv); BUG_ON(!reservation_object_trylock(&bo->ttm_resv));
r = reservation_object_copy_fences(&bo->ttm_resv, bo->resv); - if (r) { + if (r) reservation_object_unlock(&bo->ttm_resv); - reservation_object_fini(&bo->ttm_resv); - }
return r; } @@ -459,6 +455,7 @@ static void ttm_bo_cleanup_refs_or_queue(struct ttm_buffer_object *bo) spin_unlock(&glob->lru_lock); if (bo->resv != &bo->ttm_resv) reservation_object_unlock(&bo->ttm_resv); + ttm_bo_cleanup_memtype_use(bo); return; } @@ -1205,8 +1202,8 @@ int ttm_bo_init_reserved(struct ttm_bo_device *bdev, lockdep_assert_held(&bo->resv->lock.base); } else { bo->resv = &bo->ttm_resv; - reservation_object_init(&bo->ttm_resv); } + reservation_object_init(&bo->ttm_resv); atomic_inc(&bo->glob->bo_count); drm_vma_node_reset(&bo->vma_node); bo->priority = 0;
Am 01.12.2017 um 01:23 schrieb Lyude Paul:
I haven't gone to see where it started, but as of late a good number of pretty nasty deadlock issues have appeared with the kernel. Easy reproduction recipe on a laptop with i915/amdgpu prime with lockdep enabled:
DRI_PRIME=1 glxinfo
Acked-by: Christian König christian.koenig@amd.com
Thanks for taking care of this, Christian.
Additionally, some more race conditions exist that I've managed to trigger with piglit and lockdep enabled after applying these patches:
============================= WARNING: suspicious RCU usage 4.14.3Lyude-Test+ #2 Not tainted ----------------------------- ./include/linux/reservation.h:216 suspicious rcu_dereference_protected() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 1 lock held by ext_image_dma_b/27451: #0: (reservation_ww_class_mutex){+.+.}, at: [<ffffffffa034f2ff>] ttm_bo_unref+0x9f/0x3c0 [ttm] stack backtrace: CPU: 0 PID: 27451 Comm: ext_image_dma_b Not tainted 4.14.3Lyude-Test+ #2 Hardware name: HP HP ZBook 15 G4/8275, BIOS P70 Ver. 01.02 06/09/2017 Call Trace: dump_stack+0x8e/0xce lockdep_rcu_suspicious+0xc5/0x100 reservation_object_copy_fences+0x292/0x2b0 ? ttm_bo_unref+0x9f/0x3c0 [ttm] ttm_bo_unref+0xbd/0x3c0 [ttm] amdgpu_bo_unref+0x2a/0x50 [amdgpu] amdgpu_gem_object_free+0x4b/0x50 [amdgpu] drm_gem_object_free+0x1f/0x40 [drm] drm_gem_object_put_unlocked+0x40/0xb0 [drm] drm_gem_object_handle_put_unlocked+0x6c/0xb0 [drm] drm_gem_object_release_handle+0x51/0x90 [drm] drm_gem_handle_delete+0x5e/0x90 [drm] ? drm_gem_handle_create+0x40/0x40 [drm] drm_gem_close_ioctl+0x20/0x30 [drm] drm_ioctl_kernel+0x5d/0xb0 [drm] drm_ioctl+0x2f7/0x3b0 [drm] ? drm_gem_handle_create+0x40/0x40 [drm] ? trace_hardirqs_on_caller+0xf4/0x190 ? trace_hardirqs_on+0xd/0x10 amdgpu_drm_ioctl+0x4f/0x90 [amdgpu] do_vfs_ioctl+0x93/0x670 ? __fget+0x108/0x1f0 SyS_ioctl+0x79/0x90 entry_SYSCALL_64_fastpath+0x23/0xc2
I've also added the relevant fixes for the issue mentioned above.
Christian König (3): drm/ttm: fix ttm_bo_cleanup_refs_or_queue once more dma-buf: make reservation_object_copy_fences rcu save drm/amdgpu: reserve root PD while releasing it
Michel Dänzer (1): drm/ttm: Always and only destroy bo->ttm_resv in ttm_bo_release_list
drivers/dma-buf/reservation.c | 56 +++++++++++++++++++++++++--------- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 13 ++++++-- drivers/gpu/drm/ttm/ttm_bo.c | 43 +++++++++++++------------- 3 files changed, 74 insertions(+), 38 deletions(-)
-- 2.14.3
amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
On Thu, Nov 30, 2017 at 07:23:02PM -0500, Lyude Paul wrote:
I haven't gone to see where it started, but as of late a good number of pretty nasty deadlock issues have appeared with the kernel. Easy reproduction recipe on a laptop with i915/amdgpu prime with lockdep enabled:
DRI_PRIME=1 glxinfo
Additionally, some more race conditions exist that I've managed to trigger with piglit and lockdep enabled after applying these patches:
============================= WARNING: suspicious RCU usage 4.14.3Lyude-Test+ #2 Not tainted ----------------------------- ./include/linux/reservation.h:216 suspicious rcu_dereference_protected() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 1 lock held by ext_image_dma_b/27451: #0: (reservation_ww_class_mutex){+.+.}, at: [<ffffffffa034f2ff>] ttm_bo_unref+0x9f/0x3c0 [ttm] stack backtrace: CPU: 0 PID: 27451 Comm: ext_image_dma_b Not tainted 4.14.3Lyude-Test+ #2 Hardware name: HP HP ZBook 15 G4/8275, BIOS P70 Ver. 01.02 06/09/2017 Call Trace: dump_stack+0x8e/0xce lockdep_rcu_suspicious+0xc5/0x100 reservation_object_copy_fences+0x292/0x2b0 ? ttm_bo_unref+0x9f/0x3c0 [ttm] ttm_bo_unref+0xbd/0x3c0 [ttm] amdgpu_bo_unref+0x2a/0x50 [amdgpu] amdgpu_gem_object_free+0x4b/0x50 [amdgpu] drm_gem_object_free+0x1f/0x40 [drm] drm_gem_object_put_unlocked+0x40/0xb0 [drm] drm_gem_object_handle_put_unlocked+0x6c/0xb0 [drm] drm_gem_object_release_handle+0x51/0x90 [drm] drm_gem_handle_delete+0x5e/0x90 [drm] ? drm_gem_handle_create+0x40/0x40 [drm] drm_gem_close_ioctl+0x20/0x30 [drm] drm_ioctl_kernel+0x5d/0xb0 [drm] drm_ioctl+0x2f7/0x3b0 [drm] ? drm_gem_handle_create+0x40/0x40 [drm] ? trace_hardirqs_on_caller+0xf4/0x190 ? trace_hardirqs_on+0xd/0x10 amdgpu_drm_ioctl+0x4f/0x90 [amdgpu] do_vfs_ioctl+0x93/0x670 ? __fget+0x108/0x1f0 SyS_ioctl+0x79/0x90 entry_SYSCALL_64_fastpath+0x23/0xc2
I've also added the relevant fixes for the issue mentioned above.
Christian König (3): drm/ttm: fix ttm_bo_cleanup_refs_or_queue once more dma-buf: make reservation_object_copy_fences rcu save drm/amdgpu: reserve root PD while releasing it
Michel Dänzer (1): drm/ttm: Always and only destroy bo->ttm_resv in ttm_bo_release_list
All now queued up, thanks.
greg k-h
linux-stable-mirror@lists.linaro.org