Re: [Linaro-mm-sig] [Mesa-dev] [PATCH 01/11] drm/amdgpu: Comply with implicit fencing rules

26 May 2021


      On Wed, May 26, 2021 at 3:32 PM Christian König
ckoenig.leichtzumerken@gmail.com wrote:
...
Am 25.05.21 um 17:23 schrieb Daniel Vetter:
...
On Tue, May 25, 2021 at 5:05 PM Christian König
ckoenig.leichtzumerken@gmail.com wrote:
...
Hi Daniel,
Am 25.05.21 um 15:05 schrieb Daniel Vetter:
...
Hi Christian,
On Sat, May 22, 2021 at 10:30:19AM +0200, Christian König wrote:
...
Am 21.05.21 um 20:31 schrieb Daniel Vetter:
This works by adding the fence of the last eviction DMA operation to BOs
when their backing store is newly allocated. That's what the
ttm_bo_add_move_fence() function you stumbled over is good for: https://elixir.bootlin.com/linux/v5.13-rc2/source/drivers/gpu/drm/ttm/ttm_bo...
Now the problem is it is possible that the application is terminated before
it can complete it's command submission. But since resource management only
waits for the shared fences when there are some there is a chance that we
free up memory while it is still in use.
Hm where is this code? Would help in my audit that I wanted to do this
week? If you look at most other places like
drm_gem_fence_array_add_implicit() I mentioned earlier, then we don't
treat the shared fences special and always also include the exclusive one.
See amdgpu_gem_object_close():
...
          fence = dma_resv_get_excl(bo->tbo.base.resv);
          if (fence) {
                  amdgpu_bo_fence(bo, fence, true);
                  fence = NULL;
          }
...
We explicitly added that because resource management of some other
driver was going totally bananas without that.
But I'm not sure which one that was. Maybe dig a bit in the git and
mailing history of that.
Hm I looked and it's
commit 82c416b13cb7d22b96ec0888b296a48dff8a09eb
Author: Christian König christian.koenig@amd.com
Date:   Thu Mar 12 12:03:34 2020 +0100
drm/amdgpu: fix and cleanup amdgpu_gem_object_close v4


That sounded more like amdgpu itself needing this, not another driver?
No, that patch was just a follow up moving the functionality around.
That patch added the "add exclusive fence to shared slots before
amdgpu_vm_clear_freed() call", which I thought was at least part of
your fix.
...
...
But looking at amdgpu_vm_bo_update_mapping() it seems to pick the
right fencing mode for gpu pte clearing, so I'm really not sure what
the bug was that you worked around here?The implementation boils down
to amdgpu_sync_resv() which syncs for the exclusive fence, always. And
there's nothing else that I could find in public history at least, no
references to bug reports or anything. I think you need to dig
internally, because as-is I'm not seeing the problem here.
Or am I missing something here?
See the code here for example:
https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/nouveau/nouve...
Nouveau assumes that when a shared fence is present it doesn't need to
wait for the exclusive one because the shared are always supposed to
finish after the exclusive one.
But for page table unmap fences that isn't true and we ran into a really
nasty and hard to reproduce bug because of this.
I think it would be much more defensive if we could say that we always
wait for the exclusive fence and fix the use case in nouveau and double
check if somebody else does stuff like that as well.
Yeah most other drivers do the defensive thing here. I think it would
be good to standardize on that. I'll add that to my list and do more
auditing.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

Re: [Linaro-mm-sig] [Mesa-dev] [PATCH 01/11] drm/amdgpu: Comply with implicit fencing rules