Linaro-mm-sig

linaro-mm-sig@lists.linaro.org

48 participants
6403 discussions

Re: [Linaro-mm-sig] [PATCH 5/5] drm/amdgpu: implement amdgpu_gem_prime_move_notify v2

by Daniel Vetter

On Thu, Feb 20, 2020 at 11:51:07PM +0100, Thomas Hellström (VMware) wrote: > On 2/20/20 9:08 PM, Daniel Vetter wrote: > > On Thu, Feb 20, 2020 at 08:46:27PM +0100, Thomas Hellström (VMware) wrote: > > > On 2/20/20 7:04 PM, Daniel Vetter wrote: > > > > On Thu, Feb 20, 2020 at 10:39:06AM +0100, Thomas Hellström (VMware) wrote: > > > > > On 2/19/20 7:42 AM, Thomas Hellström (VMware) wrote: > > > > > > On 2/18/20 10:01 PM, Daniel Vetter wrote: > > > > > > > On Tue, Feb 18, 2020 at 9:17 PM Thomas Hellström (VMware) > > > > > > > <thomas_os(a)shipmail.org> wrote: > > > > > > > > On 2/17/20 6:55 PM, Daniel Vetter wrote: > > > > > > > > > On Mon, Feb 17, 2020 at 04:45:09PM +0100, Christian König wrote: > > > > > > > > > > Implement the importer side of unpinned DMA-buf handling. > > > > > > > > > > > > > > > > > > > > v2: update page tables immediately > > > > > > > > > > > > > > > > > > > > Signed-off-by: Christian König <christian.koenig(a)amd.com> > > > > > > > > > > --- > > > > > > > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 66 > > > > > > > > > > ++++++++++++++++++++- > > > > > > > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 6 ++ > > > > > > > > > > 2 files changed, 71 insertions(+), 1 deletion(-) > > > > > > > > > > > > > > > > > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c > > > > > > > > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c > > > > > > > > > > index 770baba621b3..48de7624d49c 100644 > > > > > > > > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c > > > > > > > > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c > > > > > > > > > > @@ -453,7 +453,71 @@ amdgpu_dma_buf_create_obj(struct > > > > > > > > > > drm_device *dev, struct dma_buf *dma_buf) > > > > > > > > > > return ERR_PTR(ret); > > > > > > > > > > } > > > > > > > > > > > > > > > > > > > > +/** > > > > > > > > > > + * amdgpu_dma_buf_move_notify - &attach.move_notify implementation > > > > > > > > > > + * > > > > > > > > > > + * @attach: the DMA-buf attachment > > > > > > > > > > + * > > > > > > > > > > + * Invalidate the DMA-buf attachment, making sure that > > > > > > > > > > the we re-create the > > > > > > > > > > + * mapping before the next use. > > > > > > > > > > + */ > > > > > > > > > > +static void > > > > > > > > > > +amdgpu_dma_buf_move_notify(struct dma_buf_attachment *attach) > > > > > > > > > > +{ > > > > > > > > > > + struct drm_gem_object *obj = attach->importer_priv; > > > > > > > > > > + struct ww_acquire_ctx *ticket = dma_resv_locking_ctx(obj->resv); > > > > > > > > > > + struct amdgpu_bo *bo = gem_to_amdgpu_bo(obj); > > > > > > > > > > + struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev); > > > > > > > > > > + struct ttm_operation_ctx ctx = { false, false }; > > > > > > > > > > + struct ttm_placement placement = {}; > > > > > > > > > > + struct amdgpu_vm_bo_base *bo_base; > > > > > > > > > > + int r; > > > > > > > > > > + > > > > > > > > > > + if (bo->tbo.mem.mem_type == TTM_PL_SYSTEM) > > > > > > > > > > + return; > > > > > > > > > > + > > > > > > > > > > + r = ttm_bo_validate(&bo->tbo, &placement, &ctx); > > > > > > > > > > + if (r) { > > > > > > > > > > + DRM_ERROR("Failed to invalidate DMA-buf > > > > > > > > > > import (%d))\n", r); > > > > > > > > > > + return; > > > > > > > > > > + } > > > > > > > > > > + > > > > > > > > > > + for (bo_base = bo->vm_bo; bo_base; bo_base = bo_base->next) { > > > > > > > > > > + struct amdgpu_vm *vm = bo_base->vm; > > > > > > > > > > + struct dma_resv *resv = vm->root.base.bo->tbo.base.resv; > > > > > > > > > > + > > > > > > > > > > + if (ticket) { > > > > > > > > > Yeah so this is kinda why I've been a total pain about the > > > > > > > > > exact semantics > > > > > > > > > of the move_notify hook. I think we should flat-out require > > > > > > > > > that importers > > > > > > > > > _always_ have a ticket attach when they call this, and that > > > > > > > > > they can cope > > > > > > > > > with additional locks being taken (i.e. full EDEADLCK) handling. > > > > > > > > > > > > > > > > > > Simplest way to force that contract is to add a dummy 2nd > > > > > > > > > ww_mutex lock to > > > > > > > > > the dma_resv object, which we then can take #ifdef > > > > > > > > > CONFIG_WW_MUTEX_SLOWPATH_DEBUG. Plus mabye a WARN_ON(!ticket). > > > > > > > > > > > > > > > > > > Now the real disaster is how we handle deadlocks. Two issues: > > > > > > > > > > > > > > > > > > - Ideally we'd keep any lock we've taken locked until the > > > > > > > > > end, it helps > > > > > > > > > needless backoffs. I've played around a bit with that > > > > > > > > > but not even poc > > > > > > > > > level, just an idea: > > > > > > > > > > > > > > > > > > https://cgit.freedesktop.org/~danvet/drm/commit/?id=b1799c5a0f02df9e1bb08d2… > > > > > > > > > > > > > > > > > > > > > > > > > > > Idea is essentially to track a list of objects we had to > > > > > > > > > lock as part of > > > > > > > > > the ttm_bo_validate of the main object. > > > > > > > > > > > > > > > > > > - Second one is if we get a EDEADLCK on one of these > > > > > > > > > sublocks (like the > > > > > > > > > one here). We need to pass that up the entire callchain, > > > > > > > > > including a > > > > > > > > > temporary reference (we have to drop locks to do the > > > > > > > > > ww_mutex_lock_slow > > > > > > > > > call), and need a custom callback to drop that temporary reference > > > > > > > > > (since that's all driver specific, might even be > > > > > > > > > internal ww_mutex and > > > > > > > > > not anything remotely looking like a normal dma_buf). > > > > > > > > > This probably > > > > > > > > > needs the exec util helpers from ttm, but at the > > > > > > > > > dma_resv level, so that > > > > > > > > > we can do something like this: > > > > > > > > > > > > > > > > > > struct dma_resv_ticket { > > > > > > > > > struct ww_acquire_ctx base; > > > > > > > > > > > > > > > > > > /* can be set by anyone (including other drivers) > > > > > > > > > that got hold of > > > > > > > > > * this ticket and had to acquire some new lock. This > > > > > > > > > lock might > > > > > > > > > * protect anything, including driver-internal stuff, and isn't > > > > > > > > > * required to be a dma_buf or even just a dma_resv. */ > > > > > > > > > struct ww_mutex *contended_lock; > > > > > > > > > > > > > > > > > > /* callback which the driver (which might be a dma-buf exporter > > > > > > > > > * and not matching the driver that started this > > > > > > > > > locking ticket) > > > > > > > > > * sets together with @contended_lock, for the main > > > > > > > > > driver to drop > > > > > > > > > * when it calls dma_resv_unlock on the contended_lock. */ > > > > > > > > > void (drop_ref*)(struct ww_mutex *contended_lock); > > > > > > > > > }; > > > > > > > > > > > > > > > > > > This is all supremely nasty (also ttm_bo_validate would need to be > > > > > > > > > improved to handle these sublocks and random new objects > > > > > > > > > that could force > > > > > > > > > a ww_mutex_lock_slow). > > > > > > > > > > > > > > > > > Just a short comment on this: > > > > > > > > > > > > > > > > Neither the currently used wait-die or the wound-wait algorithm > > > > > > > > *strictly* requires a slow lock on the contended lock. For > > > > > > > > wait-die it's > > > > > > > > just very convenient since it makes us sleep instead of spinning with > > > > > > > > -EDEADLK on the contended lock. For wound-wait IIRC one could just > > > > > > > > immediately restart the whole locking transaction after an > > > > > > > > -EDEADLK, and > > > > > > > > the transaction would automatically end up waiting on the contended > > > > > > > > lock, provided the mutex lock stealing is not allowed. There is however > > > > > > > > a possibility that the transaction will be wounded again on another > > > > > > > > lock, taken before the contended lock, but I think there are ways to > > > > > > > > improve the wound-wait algorithm to reduce that probability. > > > > > > > > > > > > > > > > So in short, choosing the wound-wait algorithm instead of wait-die and > > > > > > > > perhaps modifying the ww mutex code somewhat would probably help > > > > > > > > passing > > > > > > > > an -EDEADLK up the call chain without requiring passing the contended > > > > > > > > lock, as long as each locker releases its own locks when receiving an > > > > > > > > -EDEADLK. > > > > > > > Hm this is kinda tempting, since rolling out the full backoff tricker > > > > > > > across driver boundaries is going to be real painful. > > > > > > > > > > > > > > What I'm kinda worried about is the debug/validation checks we're > > > > > > > losing with this. The required backoff has this nice property that > > > > > > > ww_mutex debug code can check that we've fully unwound everything when > > > > > > > we should, that we've blocked on the right lock, and that we're > > > > > > > restarting everything without keeling over. Without that I think we > > > > > > > could end up with situations where a driver in the middle feels like > > > > > > > handling the EDEADLCK, which might go well most of the times (the > > > > > > > deadlock will probably be mostly within a given driver, not across). > > > > > > > Right up to the point where someone creates a deadlock across drivers, > > > > > > > and the lack of full rollback will be felt. > > > > > > > > > > > > > > So not sure whether we can still keep all these debug/validation > > > > > > > checks, or whether this is a step too far towards clever tricks. > > > > > > I think we could definitely find a way to keep debugging to make sure > > > > > > everything is unwound before attempting to restart the locking > > > > > > transaction. But the debug check that we're restarting on the contended > > > > > > lock only really makes sense for wait-die, (and we could easily keep it > > > > > > for wait-die). The lock returning -EDEADLK for wound-wait may actually > > > > > > not be the contending lock but an arbitrary lock that the wounded > > > > > > transaction attempts to take after it is wounded. > > > > > > > > > > > > So in the end IMO this is a tradeoff between added (possibly severe) > > > > > > locking complexity into dma-buf and not being able to switch back to > > > > > > wait-die efficiently if we need / want to do that. > > > > > > > > > > > > /Thomas > > > > > And as a consequence an interface *could* be: > > > > > > > > > > *) We introduce functions > > > > > > > > > > void ww_acquire_relax(struct ww_acquire_ctx *ctx); > > > > > int ww_acquire_relax_interruptible(struct ww_acquire_ctx *ctx); > > > > > > > > > > that can be used instead of ww_mutex_lock_slow() in the absence of a > > > > > contending lock to avoid spinning on -EDEADLK. While trying to take the > > > > > contending lock is probably the best choice there are various second best > > > > > approaches that can be explored, for example waiting on the contending > > > > > acquire to finish or in the wound-wait case, perhaps do nothing. These > > > > > functions will also help us keep the debugging. > > > > Hm ... I guess this could work. Trouble is, it only gets rid of the > > > > slowpath locking book-keeping headaches, we still have quite a few others. > > > > > > > > > *) A function returning -EDEADLK to a caller *must* have already released > > > > > its own locks. > > > > So this ties to another question, as in should these callbacks have to > > > > drops the locks thei acquire (much simpler code) or not (less thrashing, > > > > if we drop locks we might end up in a situation where threads thrash > > > > around instead of realizing quicker that they're actually deadlocking and > > > > one of them should stop and back off). > > > Hmm.. Could you describe such a thrashing case with an example? > > Ignoring cross device fun and all that, just a simplified example of why > > holding onto locks you've acquired for eviction is useful, at least in a > > slow path. > > > > - one thread trying to do an execbuf with a huge bo > > > > vs. > > > > - an entire pile of thread that try to do execbuf with just a few small bo > > > > First thread is in the eviction loop, selects a bo, wins against all the > > other thread since it's been doing this forever already, gets the bo moved > > out, unlocks. > > > > Since it's competing against lots of other threads with small bo, it'll > > have to do that a lot of times. Often enough to create a contiguous hole. > > If you have a smarter allocator that tries to create that hole more > > actively, just assume that the single huge bo is a substantial part of > > total vram. > > > > The other threads will be quicker in cramming new stuff in, even if they > > occasionally lose the ww dance against the single thread. So the big > > thread livelocks. > > > > If otoh the big thread would keep onto all the locks, eventually it have > > the entire vram locked, and every other thread is guaranteed to lose > > against it in the ww dance and queue up behind. And it could finally but > > its huge bo into vram and execute. > > Hmm, yes this indeed explains why it's beneficial in some cases to keep a > number of locks held across certain operations, but I still fail to see why > we would like *all* locks held across the entire transaction? In the above > case I'd envision us ending up with something like: > > int validate(ctx, bo) > { > > for_each_suitable_bo_to_evict(ebo) { > r = lock(ctx, ebo); > if (r == EDEADLK) > goto out_unlock; > > r = move_notify(ctx, ebo);// locks and unlocks GPU VM bo. Yeah I think for move_notify the "keep the locks" thing is probably not what we want. That's more for when you have to evict stuff and similar things like that (which hopefully no driver needs to do in their ->move_notify). But for placing buffers we kinda want to keep things, and that's also a cross-driver thing (eventually at least I think). > if (r == EDEADLK) > goto out_unlock; > evict(); > } > > place_bo(bo); > //Repeat until success. > > > out_unlock: > for_each_locked_bo(ebo) > unlock(ctx, ebo); So that this unlock loop would need to be moved up to higher levels perhaps. This here would solve the example of a single big bo, but if you have multiple then you still end up with a lot of thrashing until the younger thread realizes that it needs to back off. > } > > > void command_submission() > { > acquire_init(ctx); > > restart: > for_each_bo_in_cs(bo) { > r = lock(ctx, bo); > if (r == -EDEADLK) > goto out_unreserve; > } > > for_each_bo_in_cs(bo) { > r = validate(ctx, bo); > if (r == -EDEADLK) > goto out_unreserve; > }; > > cs(); > > for_each_bo_in_cs(bo) > unlock(ctx, bo); > > acquire_fini(ctx); > return 0; > > out_unreserve: > for_each_locked_bo() > unlock(ctx, bo); > > acquire_relax(); > goto restart; > } > > > > Vary example for multi-gpu and more realism, but that's roughly it. > > > > Aside, a lot of the stuff Christian has been doing in ttm is to improve > > the chances that the competing threads will hit one of the locked objects > > of the big thread, and at least back off a bit. That's at least my > > understanding of what's been happening. > > -Daniel > > OK unserstood. For vmwgfx the crude simplistic idea to avoid that situation > has been to have an rwsem around command submission: When the thread with > the big bo has run a full LRU worth of eviction without succeeding it would > get annoyed and take the rwsem in write mode, blocking competing threads. > But that would of course never work in a dma-buf setting, and IIRC the > implementation is not complete either.... Yeah the Great Plan (tm) is to fully rely on ww_mutex slowly degenerating into essentially a global lock. But only when there's actual contention and thrashing. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

6 years, 2 months

Re: [Linaro-mm-sig] [PATCH] dma-buf: Fix missing excl fence waiting

by Christian König

Am 23.02.20 um 13:21 schrieb Pan, Xinhui: > >> 2020年2月23日 20:04，Koenig, Christian <Christian.Koenig(a)amd.com> 写道： >> >> Am 23.02.20 um 12:56 schrieb Pan, Xinhui: >>> If shared fence list is not empty, even we want to test all fences, excl fence is ignored. >>> That is abviously wrong, so fix it. >> Yeah that is a known issue and I completely agree with you, but other disagree. >> >> See the shared fences are meant to depend on the exclusive fence. So all shared fences must finish only after the exclusive one has finished as well. > fair enough. > >> The problem now is that for error handling this isn't necessary true. In other words when a shared fence completes with an error it is perfectly possible that he does this before the exclusive fence is finished. >> >> I'm trying to convince Daniel that this is a problem for years :) >> > I have met problems, eviction has race with bo relase. system memory is overwried by sDMA. the kernel is 4.19, stable one, LOL. Ok sounds like we add some shared fence which doesn't depend on the exclusive one to finish. That is of course highly problematic for the current handling. It might be that this happens with the TTM move fence, but I don't of hand know either how to prevent that. Question at Daniel and others: Can we finally drop this assumption that all shared fences must finish after the exclusive one? Thanks for pointing this out Xinhui, Christian. > > amdgpu add excl fence to bo to move system memory which is done by the drm scheduler. > after sDMA finish the moving job, the memory might have already been released as dma_resv_test_signaled_rcu did not check excl fence. > > Our local customer report this issue. I took 4 days into it. sigh > > thanks > xinhui > >> Regards, >> Christian. >> >>> Signed-off-by: xinhui pan <xinhui.pan(a)amd.com> >>> --- >>> drivers/dma-buf/dma-resv.c | 9 +++++---- >>> 1 file changed, 5 insertions(+), 4 deletions(-) >>> >>> diff --git a/drivers/dma-buf/dma-resv.c b/drivers/dma-buf/dma-resv.c >>> index 4264e64788c4..44dc64c547c6 100644 >>> --- a/drivers/dma-buf/dma-resv.c >>> +++ b/drivers/dma-buf/dma-resv.c >>> @@ -632,14 +632,14 @@ static inline int dma_resv_test_signaled_single(struct dma_fence *passed_fence) >>> */ >>> bool dma_resv_test_signaled_rcu(struct dma_resv *obj, bool test_all) >>> { >>> - unsigned seq, shared_count; >>> + unsigned int seq, shared_count, left; >>> int ret; >>> rcu_read_lock(); >>> retry: >>> ret = true; >>> shared_count = 0; >>> - seq = read_seqcount_begin(&obj->seq); >>> + left = seq = read_seqcount_begin(&obj->seq); >>> if (test_all) { >>> unsigned i; >>> @@ -647,7 +647,7 @@ bool dma_resv_test_signaled_rcu(struct dma_resv *obj, bool test_all) >>> struct dma_resv_list *fobj = rcu_dereference(obj->fence); >>> if (fobj) >>> - shared_count = fobj->shared_count; >>> + left = shared_count = fobj->shared_count; >>> for (i = 0; i < shared_count; ++i) { >>> struct dma_fence *fence = rcu_dereference(fobj->shared[i]); >>> @@ -657,13 +657,14 @@ bool dma_resv_test_signaled_rcu(struct dma_resv *obj, bool test_all) >>> goto retry; >>> else if (!ret) >>> break; >>> + left--; >>> } >>> if (read_seqcount_retry(&obj->seq, seq)) >>> goto retry; >>> } >>> - if (!shared_count) { >>> + if (!left) { >>> struct dma_fence *fence_excl = rcu_dereference(obj->fence_excl); >>> if (fence_excl) {

6 years, 2 months

Re: [Linaro-mm-sig] [PATCH 5/5] drm/amdgpu: implement amdgpu_gem_prime_move_notify v2

by Daniel Vetter

On Thu, Feb 20, 2020 at 08:46:27PM +0100, Thomas Hellström (VMware) wrote: > On 2/20/20 7:04 PM, Daniel Vetter wrote: > > On Thu, Feb 20, 2020 at 10:39:06AM +0100, Thomas Hellström (VMware) wrote: > > > On 2/19/20 7:42 AM, Thomas Hellström (VMware) wrote: > > > > On 2/18/20 10:01 PM, Daniel Vetter wrote: > > > > > On Tue, Feb 18, 2020 at 9:17 PM Thomas Hellström (VMware) > > > > > <thomas_os(a)shipmail.org> wrote: > > > > > > On 2/17/20 6:55 PM, Daniel Vetter wrote: > > > > > > > On Mon, Feb 17, 2020 at 04:45:09PM +0100, Christian König wrote: > > > > > > > > Implement the importer side of unpinned DMA-buf handling. > > > > > > > > > > > > > > > > v2: update page tables immediately > > > > > > > > > > > > > > > > Signed-off-by: Christian König <christian.koenig(a)amd.com> > > > > > > > > --- > > > > > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 66 > > > > > > > > ++++++++++++++++++++- > > > > > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 6 ++ > > > > > > > > 2 files changed, 71 insertions(+), 1 deletion(-) > > > > > > > > > > > > > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c > > > > > > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c > > > > > > > > index 770baba621b3..48de7624d49c 100644 > > > > > > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c > > > > > > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c > > > > > > > > @@ -453,7 +453,71 @@ amdgpu_dma_buf_create_obj(struct > > > > > > > > drm_device *dev, struct dma_buf *dma_buf) > > > > > > > > return ERR_PTR(ret); > > > > > > > > } > > > > > > > > > > > > > > > > +/** > > > > > > > > + * amdgpu_dma_buf_move_notify - &attach.move_notify implementation > > > > > > > > + * > > > > > > > > + * @attach: the DMA-buf attachment > > > > > > > > + * > > > > > > > > + * Invalidate the DMA-buf attachment, making sure that > > > > > > > > the we re-create the > > > > > > > > + * mapping before the next use. > > > > > > > > + */ > > > > > > > > +static void > > > > > > > > +amdgpu_dma_buf_move_notify(struct dma_buf_attachment *attach) > > > > > > > > +{ > > > > > > > > + struct drm_gem_object *obj = attach->importer_priv; > > > > > > > > + struct ww_acquire_ctx *ticket = dma_resv_locking_ctx(obj->resv); > > > > > > > > + struct amdgpu_bo *bo = gem_to_amdgpu_bo(obj); > > > > > > > > + struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev); > > > > > > > > + struct ttm_operation_ctx ctx = { false, false }; > > > > > > > > + struct ttm_placement placement = {}; > > > > > > > > + struct amdgpu_vm_bo_base *bo_base; > > > > > > > > + int r; > > > > > > > > + > > > > > > > > + if (bo->tbo.mem.mem_type == TTM_PL_SYSTEM) > > > > > > > > + return; > > > > > > > > + > > > > > > > > + r = ttm_bo_validate(&bo->tbo, &placement, &ctx); > > > > > > > > + if (r) { > > > > > > > > + DRM_ERROR("Failed to invalidate DMA-buf > > > > > > > > import (%d))\n", r); > > > > > > > > + return; > > > > > > > > + } > > > > > > > > + > > > > > > > > + for (bo_base = bo->vm_bo; bo_base; bo_base = bo_base->next) { > > > > > > > > + struct amdgpu_vm *vm = bo_base->vm; > > > > > > > > + struct dma_resv *resv = vm->root.base.bo->tbo.base.resv; > > > > > > > > + > > > > > > > > + if (ticket) { > > > > > > > Yeah so this is kinda why I've been a total pain about the > > > > > > > exact semantics > > > > > > > of the move_notify hook. I think we should flat-out require > > > > > > > that importers > > > > > > > _always_ have a ticket attach when they call this, and that > > > > > > > they can cope > > > > > > > with additional locks being taken (i.e. full EDEADLCK) handling. > > > > > > > > > > > > > > Simplest way to force that contract is to add a dummy 2nd > > > > > > > ww_mutex lock to > > > > > > > the dma_resv object, which we then can take #ifdef > > > > > > > CONFIG_WW_MUTEX_SLOWPATH_DEBUG. Plus mabye a WARN_ON(!ticket). > > > > > > > > > > > > > > Now the real disaster is how we handle deadlocks. Two issues: > > > > > > > > > > > > > > - Ideally we'd keep any lock we've taken locked until the > > > > > > > end, it helps > > > > > > > needless backoffs. I've played around a bit with that > > > > > > > but not even poc > > > > > > > level, just an idea: > > > > > > > > > > > > > > https://cgit.freedesktop.org/~danvet/drm/commit/?id=b1799c5a0f02df9e1bb08d2… > > > > > > > > > > > > > > > > > > > > > Idea is essentially to track a list of objects we had to > > > > > > > lock as part of > > > > > > > the ttm_bo_validate of the main object. > > > > > > > > > > > > > > - Second one is if we get a EDEADLCK on one of these > > > > > > > sublocks (like the > > > > > > > one here). We need to pass that up the entire callchain, > > > > > > > including a > > > > > > > temporary reference (we have to drop locks to do the > > > > > > > ww_mutex_lock_slow > > > > > > > call), and need a custom callback to drop that temporary reference > > > > > > > (since that's all driver specific, might even be > > > > > > > internal ww_mutex and > > > > > > > not anything remotely looking like a normal dma_buf). > > > > > > > This probably > > > > > > > needs the exec util helpers from ttm, but at the > > > > > > > dma_resv level, so that > > > > > > > we can do something like this: > > > > > > > > > > > > > > struct dma_resv_ticket { > > > > > > > struct ww_acquire_ctx base; > > > > > > > > > > > > > > /* can be set by anyone (including other drivers) > > > > > > > that got hold of > > > > > > > * this ticket and had to acquire some new lock. This > > > > > > > lock might > > > > > > > * protect anything, including driver-internal stuff, and isn't > > > > > > > * required to be a dma_buf or even just a dma_resv. */ > > > > > > > struct ww_mutex *contended_lock; > > > > > > > > > > > > > > /* callback which the driver (which might be a dma-buf exporter > > > > > > > * and not matching the driver that started this > > > > > > > locking ticket) > > > > > > > * sets together with @contended_lock, for the main > > > > > > > driver to drop > > > > > > > * when it calls dma_resv_unlock on the contended_lock. */ > > > > > > > void (drop_ref*)(struct ww_mutex *contended_lock); > > > > > > > }; > > > > > > > > > > > > > > This is all supremely nasty (also ttm_bo_validate would need to be > > > > > > > improved to handle these sublocks and random new objects > > > > > > > that could force > > > > > > > a ww_mutex_lock_slow). > > > > > > > > > > > > > Just a short comment on this: > > > > > > > > > > > > Neither the currently used wait-die or the wound-wait algorithm > > > > > > *strictly* requires a slow lock on the contended lock. For > > > > > > wait-die it's > > > > > > just very convenient since it makes us sleep instead of spinning with > > > > > > -EDEADLK on the contended lock. For wound-wait IIRC one could just > > > > > > immediately restart the whole locking transaction after an > > > > > > -EDEADLK, and > > > > > > the transaction would automatically end up waiting on the contended > > > > > > lock, provided the mutex lock stealing is not allowed. There is however > > > > > > a possibility that the transaction will be wounded again on another > > > > > > lock, taken before the contended lock, but I think there are ways to > > > > > > improve the wound-wait algorithm to reduce that probability. > > > > > > > > > > > > So in short, choosing the wound-wait algorithm instead of wait-die and > > > > > > perhaps modifying the ww mutex code somewhat would probably help > > > > > > passing > > > > > > an -EDEADLK up the call chain without requiring passing the contended > > > > > > lock, as long as each locker releases its own locks when receiving an > > > > > > -EDEADLK. > > > > > Hm this is kinda tempting, since rolling out the full backoff tricker > > > > > across driver boundaries is going to be real painful. > > > > > > > > > > What I'm kinda worried about is the debug/validation checks we're > > > > > losing with this. The required backoff has this nice property that > > > > > ww_mutex debug code can check that we've fully unwound everything when > > > > > we should, that we've blocked on the right lock, and that we're > > > > > restarting everything without keeling over. Without that I think we > > > > > could end up with situations where a driver in the middle feels like > > > > > handling the EDEADLCK, which might go well most of the times (the > > > > > deadlock will probably be mostly within a given driver, not across). > > > > > Right up to the point where someone creates a deadlock across drivers, > > > > > and the lack of full rollback will be felt. > > > > > > > > > > So not sure whether we can still keep all these debug/validation > > > > > checks, or whether this is a step too far towards clever tricks. > > > > I think we could definitely find a way to keep debugging to make sure > > > > everything is unwound before attempting to restart the locking > > > > transaction. But the debug check that we're restarting on the contended > > > > lock only really makes sense for wait-die, (and we could easily keep it > > > > for wait-die). The lock returning -EDEADLK for wound-wait may actually > > > > not be the contending lock but an arbitrary lock that the wounded > > > > transaction attempts to take after it is wounded. > > > > > > > > So in the end IMO this is a tradeoff between added (possibly severe) > > > > locking complexity into dma-buf and not being able to switch back to > > > > wait-die efficiently if we need / want to do that. > > > > > > > > /Thomas > > > And as a consequence an interface *could* be: > > > > > > *) We introduce functions > > > > > > void ww_acquire_relax(struct ww_acquire_ctx *ctx); > > > int ww_acquire_relax_interruptible(struct ww_acquire_ctx *ctx); > > > > > > that can be used instead of ww_mutex_lock_slow() in the absence of a > > > contending lock to avoid spinning on -EDEADLK. While trying to take the > > > contending lock is probably the best choice there are various second best > > > approaches that can be explored, for example waiting on the contending > > > acquire to finish or in the wound-wait case, perhaps do nothing. These > > > functions will also help us keep the debugging. > > Hm ... I guess this could work. Trouble is, it only gets rid of the > > slowpath locking book-keeping headaches, we still have quite a few others. > > > > > *) A function returning -EDEADLK to a caller *must* have already released > > > its own locks. > > So this ties to another question, as in should these callbacks have to > > drops the locks thei acquire (much simpler code) or not (less thrashing, > > if we drop locks we might end up in a situation where threads thrash > > around instead of realizing quicker that they're actually deadlocking and > > one of them should stop and back off). > > Hmm.. Could you describe such a thrashing case with an example? Ignoring cross device fun and all that, just a simplified example of why holding onto locks you've acquired for eviction is useful, at least in a slow path. - one thread trying to do an execbuf with a huge bo vs. - an entire pile of thread that try to do execbuf with just a few small bo First thread is in the eviction loop, selects a bo, wins against all the other thread since it's been doing this forever already, gets the bo moved out, unlocks. Since it's competing against lots of other threads with small bo, it'll have to do that a lot of times. Often enough to create a contiguous hole. If you have a smarter allocator that tries to create that hole more actively, just assume that the single huge bo is a substantial part of total vram. The other threads will be quicker in cramming new stuff in, even if they occasionally lose the ww dance against the single thread. So the big thread livelocks. If otoh the big thread would keep onto all the locks, eventually it have the entire vram locked, and every other thread is guaranteed to lose against it in the ww dance and queue up behind. And it could finally but its huge bo into vram and execute. Vary example for multi-gpu and more realism, but that's roughly it. Aside, a lot of the stuff Christian has been doing in ttm is to improve the chances that the competing threads will hit one of the locked objects of the big thread, and at least back off a bit. That's at least my understanding of what's been happening. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

6 years, 2 months

Re: [Linaro-mm-sig] [PATCH 5/5] drm/amdgpu: implement amdgpu_gem_prime_move_notify v2

by Daniel Vetter

On Thu, Feb 20, 2020 at 10:39:06AM +0100, Thomas Hellström (VMware) wrote: > On 2/19/20 7:42 AM, Thomas Hellström (VMware) wrote: > > On 2/18/20 10:01 PM, Daniel Vetter wrote: > > > On Tue, Feb 18, 2020 at 9:17 PM Thomas Hellström (VMware) > > > <thomas_os(a)shipmail.org> wrote: > > > > On 2/17/20 6:55 PM, Daniel Vetter wrote: > > > > > On Mon, Feb 17, 2020 at 04:45:09PM +0100, Christian König wrote: > > > > > > Implement the importer side of unpinned DMA-buf handling. > > > > > > > > > > > > v2: update page tables immediately > > > > > > > > > > > > Signed-off-by: Christian König <christian.koenig(a)amd.com> > > > > > > --- > > > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 66 > > > > > > ++++++++++++++++++++- > > > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 6 ++ > > > > > > 2 files changed, 71 insertions(+), 1 deletion(-) > > > > > > > > > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c > > > > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c > > > > > > index 770baba621b3..48de7624d49c 100644 > > > > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c > > > > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c > > > > > > @@ -453,7 +453,71 @@ amdgpu_dma_buf_create_obj(struct > > > > > > drm_device *dev, struct dma_buf *dma_buf) > > > > > > return ERR_PTR(ret); > > > > > > } > > > > > > > > > > > > +/** > > > > > > + * amdgpu_dma_buf_move_notify - &attach.move_notify implementation > > > > > > + * > > > > > > + * @attach: the DMA-buf attachment > > > > > > + * > > > > > > + * Invalidate the DMA-buf attachment, making sure that > > > > > > the we re-create the > > > > > > + * mapping before the next use. > > > > > > + */ > > > > > > +static void > > > > > > +amdgpu_dma_buf_move_notify(struct dma_buf_attachment *attach) > > > > > > +{ > > > > > > + struct drm_gem_object *obj = attach->importer_priv; > > > > > > + struct ww_acquire_ctx *ticket = dma_resv_locking_ctx(obj->resv); > > > > > > + struct amdgpu_bo *bo = gem_to_amdgpu_bo(obj); > > > > > > + struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev); > > > > > > + struct ttm_operation_ctx ctx = { false, false }; > > > > > > + struct ttm_placement placement = {}; > > > > > > + struct amdgpu_vm_bo_base *bo_base; > > > > > > + int r; > > > > > > + > > > > > > + if (bo->tbo.mem.mem_type == TTM_PL_SYSTEM) > > > > > > + return; > > > > > > + > > > > > > + r = ttm_bo_validate(&bo->tbo, &placement, &ctx); > > > > > > + if (r) { > > > > > > + DRM_ERROR("Failed to invalidate DMA-buf > > > > > > import (%d))\n", r); > > > > > > + return; > > > > > > + } > > > > > > + > > > > > > + for (bo_base = bo->vm_bo; bo_base; bo_base = bo_base->next) { > > > > > > + struct amdgpu_vm *vm = bo_base->vm; > > > > > > + struct dma_resv *resv = vm->root.base.bo->tbo.base.resv; > > > > > > + > > > > > > + if (ticket) { > > > > > Yeah so this is kinda why I've been a total pain about the > > > > > exact semantics > > > > > of the move_notify hook. I think we should flat-out require > > > > > that importers > > > > > _always_ have a ticket attach when they call this, and that > > > > > they can cope > > > > > with additional locks being taken (i.e. full EDEADLCK) handling. > > > > > > > > > > Simplest way to force that contract is to add a dummy 2nd > > > > > ww_mutex lock to > > > > > the dma_resv object, which we then can take #ifdef > > > > > CONFIG_WW_MUTEX_SLOWPATH_DEBUG. Plus mabye a WARN_ON(!ticket). > > > > > > > > > > Now the real disaster is how we handle deadlocks. Two issues: > > > > > > > > > > - Ideally we'd keep any lock we've taken locked until the > > > > > end, it helps > > > > > needless backoffs. I've played around a bit with that > > > > > but not even poc > > > > > level, just an idea: > > > > > > > > > > https://cgit.freedesktop.org/~danvet/drm/commit/?id=b1799c5a0f02df9e1bb08d2… > > > > > > > > > > > > > > > Idea is essentially to track a list of objects we had to > > > > > lock as part of > > > > > the ttm_bo_validate of the main object. > > > > > > > > > > - Second one is if we get a EDEADLCK on one of these > > > > > sublocks (like the > > > > > one here). We need to pass that up the entire callchain, > > > > > including a > > > > > temporary reference (we have to drop locks to do the > > > > > ww_mutex_lock_slow > > > > > call), and need a custom callback to drop that temporary reference > > > > > (since that's all driver specific, might even be > > > > > internal ww_mutex and > > > > > not anything remotely looking like a normal dma_buf). > > > > > This probably > > > > > needs the exec util helpers from ttm, but at the > > > > > dma_resv level, so that > > > > > we can do something like this: > > > > > > > > > > struct dma_resv_ticket { > > > > > struct ww_acquire_ctx base; > > > > > > > > > > /* can be set by anyone (including other drivers) > > > > > that got hold of > > > > > * this ticket and had to acquire some new lock. This > > > > > lock might > > > > > * protect anything, including driver-internal stuff, and isn't > > > > > * required to be a dma_buf or even just a dma_resv. */ > > > > > struct ww_mutex *contended_lock; > > > > > > > > > > /* callback which the driver (which might be a dma-buf exporter > > > > > * and not matching the driver that started this > > > > > locking ticket) > > > > > * sets together with @contended_lock, for the main > > > > > driver to drop > > > > > * when it calls dma_resv_unlock on the contended_lock. */ > > > > > void (drop_ref*)(struct ww_mutex *contended_lock); > > > > > }; > > > > > > > > > > This is all supremely nasty (also ttm_bo_validate would need to be > > > > > improved to handle these sublocks and random new objects > > > > > that could force > > > > > a ww_mutex_lock_slow). > > > > > > > > > Just a short comment on this: > > > > > > > > Neither the currently used wait-die or the wound-wait algorithm > > > > *strictly* requires a slow lock on the contended lock. For > > > > wait-die it's > > > > just very convenient since it makes us sleep instead of spinning with > > > > -EDEADLK on the contended lock. For wound-wait IIRC one could just > > > > immediately restart the whole locking transaction after an > > > > -EDEADLK, and > > > > the transaction would automatically end up waiting on the contended > > > > lock, provided the mutex lock stealing is not allowed. There is however > > > > a possibility that the transaction will be wounded again on another > > > > lock, taken before the contended lock, but I think there are ways to > > > > improve the wound-wait algorithm to reduce that probability. > > > > > > > > So in short, choosing the wound-wait algorithm instead of wait-die and > > > > perhaps modifying the ww mutex code somewhat would probably help > > > > passing > > > > an -EDEADLK up the call chain without requiring passing the contended > > > > lock, as long as each locker releases its own locks when receiving an > > > > -EDEADLK. > > > Hm this is kinda tempting, since rolling out the full backoff tricker > > > across driver boundaries is going to be real painful. > > > > > > What I'm kinda worried about is the debug/validation checks we're > > > losing with this. The required backoff has this nice property that > > > ww_mutex debug code can check that we've fully unwound everything when > > > we should, that we've blocked on the right lock, and that we're > > > restarting everything without keeling over. Without that I think we > > > could end up with situations where a driver in the middle feels like > > > handling the EDEADLCK, which might go well most of the times (the > > > deadlock will probably be mostly within a given driver, not across). > > > Right up to the point where someone creates a deadlock across drivers, > > > and the lack of full rollback will be felt. > > > > > > So not sure whether we can still keep all these debug/validation > > > checks, or whether this is a step too far towards clever tricks. > > > > I think we could definitely find a way to keep debugging to make sure > > everything is unwound before attempting to restart the locking > > transaction. But the debug check that we're restarting on the contended > > lock only really makes sense for wait-die, (and we could easily keep it > > for wait-die). The lock returning -EDEADLK for wound-wait may actually > > not be the contending lock but an arbitrary lock that the wounded > > transaction attempts to take after it is wounded. > > > > So in the end IMO this is a tradeoff between added (possibly severe) > > locking complexity into dma-buf and not being able to switch back to > > wait-die efficiently if we need / want to do that. > > > > /Thomas > > And as a consequence an interface *could* be: > > *) We introduce functions > > void ww_acquire_relax(struct ww_acquire_ctx *ctx); > int ww_acquire_relax_interruptible(struct ww_acquire_ctx *ctx); > > that can be used instead of ww_mutex_lock_slow() in the absence of a > contending lock to avoid spinning on -EDEADLK. While trying to take the > contending lock is probably the best choice there are various second best > approaches that can be explored, for example waiting on the contending > acquire to finish or in the wound-wait case, perhaps do nothing. These > functions will also help us keep the debugging. Hm ... I guess this could work. Trouble is, it only gets rid of the slowpath locking book-keeping headaches, we still have quite a few others. > *) A function returning -EDEADLK to a caller *must* have already released > its own locks. So this ties to another question, as in should these callbacks have to drops the locks thei acquire (much simpler code) or not (less thrashing, if we drop locks we might end up in a situation where threads thrash around instead of realizing quicker that they're actually deadlocking and one of them should stop and back off). But keeping locks locked means massive amounts of book-keeping in dma_resv layer, so goes all downhill from there. > *) move_notify() explicitly takes a struct ww_acquire_ctx * to make sure > there is no ambiguity. (I think it would be valuable if we could do the same > for ttm_bo_validate()). Yeah I think more explicit locking ctx would be really good no matter what. Implicitly fishing the acquire_ctx out of the lock for the object you're called on is kinda nasty. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

6 years, 2 months

Re: [Linaro-mm-sig] [PATCH 5/5] drm/amdgpu: implement amdgpu_gem_prime_move_notify v2

by Daniel Vetter

On Tue, Feb 18, 2020 at 9:17 PM Thomas Hellström (VMware) <thomas_os(a)shipmail.org> wrote: > > On 2/17/20 6:55 PM, Daniel Vetter wrote: > > On Mon, Feb 17, 2020 at 04:45:09PM +0100, Christian König wrote: > >> Implement the importer side of unpinned DMA-buf handling. > >> > >> v2: update page tables immediately > >> > >> Signed-off-by: Christian König <christian.koenig(a)amd.com> > >> --- > >> drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 66 ++++++++++++++++++++- > >> drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 6 ++ > >> 2 files changed, 71 insertions(+), 1 deletion(-) > >> > >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c > >> index 770baba621b3..48de7624d49c 100644 > >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c > >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c > >> @@ -453,7 +453,71 @@ amdgpu_dma_buf_create_obj(struct drm_device *dev, struct dma_buf *dma_buf) > >> return ERR_PTR(ret); > >> } > >> > >> +/** > >> + * amdgpu_dma_buf_move_notify - &attach.move_notify implementation > >> + * > >> + * @attach: the DMA-buf attachment > >> + * > >> + * Invalidate the DMA-buf attachment, making sure that the we re-create the > >> + * mapping before the next use. > >> + */ > >> +static void > >> +amdgpu_dma_buf_move_notify(struct dma_buf_attachment *attach) > >> +{ > >> + struct drm_gem_object *obj = attach->importer_priv; > >> + struct ww_acquire_ctx *ticket = dma_resv_locking_ctx(obj->resv); > >> + struct amdgpu_bo *bo = gem_to_amdgpu_bo(obj); > >> + struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev); > >> + struct ttm_operation_ctx ctx = { false, false }; > >> + struct ttm_placement placement = {}; > >> + struct amdgpu_vm_bo_base *bo_base; > >> + int r; > >> + > >> + if (bo->tbo.mem.mem_type == TTM_PL_SYSTEM) > >> + return; > >> + > >> + r = ttm_bo_validate(&bo->tbo, &placement, &ctx); > >> + if (r) { > >> + DRM_ERROR("Failed to invalidate DMA-buf import (%d))\n", r); > >> + return; > >> + } > >> + > >> + for (bo_base = bo->vm_bo; bo_base; bo_base = bo_base->next) { > >> + struct amdgpu_vm *vm = bo_base->vm; > >> + struct dma_resv *resv = vm->root.base.bo->tbo.base.resv; > >> + > >> + if (ticket) { > > Yeah so this is kinda why I've been a total pain about the exact semantics > > of the move_notify hook. I think we should flat-out require that importers > > _always_ have a ticket attach when they call this, and that they can cope > > with additional locks being taken (i.e. full EDEADLCK) handling. > > > > Simplest way to force that contract is to add a dummy 2nd ww_mutex lock to > > the dma_resv object, which we then can take #ifdef > > CONFIG_WW_MUTEX_SLOWPATH_DEBUG. Plus mabye a WARN_ON(!ticket). > > > > Now the real disaster is how we handle deadlocks. Two issues: > > > > - Ideally we'd keep any lock we've taken locked until the end, it helps > > needless backoffs. I've played around a bit with that but not even poc > > level, just an idea: > > > > https://cgit.freedesktop.org/~danvet/drm/commit/?id=b1799c5a0f02df9e1bb08d2… > > > > Idea is essentially to track a list of objects we had to lock as part of > > the ttm_bo_validate of the main object. > > > > - Second one is if we get a EDEADLCK on one of these sublocks (like the > > one here). We need to pass that up the entire callchain, including a > > temporary reference (we have to drop locks to do the ww_mutex_lock_slow > > call), and need a custom callback to drop that temporary reference > > (since that's all driver specific, might even be internal ww_mutex and > > not anything remotely looking like a normal dma_buf). This probably > > needs the exec util helpers from ttm, but at the dma_resv level, so that > > we can do something like this: > > > > struct dma_resv_ticket { > > struct ww_acquire_ctx base; > > > > /* can be set by anyone (including other drivers) that got hold of > > * this ticket and had to acquire some new lock. This lock might > > * protect anything, including driver-internal stuff, and isn't > > * required to be a dma_buf or even just a dma_resv. */ > > struct ww_mutex *contended_lock; > > > > /* callback which the driver (which might be a dma-buf exporter > > * and not matching the driver that started this locking ticket) > > * sets together with @contended_lock, for the main driver to drop > > * when it calls dma_resv_unlock on the contended_lock. */ > > void (drop_ref*)(struct ww_mutex *contended_lock); > > }; > > > > This is all supremely nasty (also ttm_bo_validate would need to be > > improved to handle these sublocks and random new objects that could force > > a ww_mutex_lock_slow). > > > Just a short comment on this: > > Neither the currently used wait-die or the wound-wait algorithm > *strictly* requires a slow lock on the contended lock. For wait-die it's > just very convenient since it makes us sleep instead of spinning with > -EDEADLK on the contended lock. For wound-wait IIRC one could just > immediately restart the whole locking transaction after an -EDEADLK, and > the transaction would automatically end up waiting on the contended > lock, provided the mutex lock stealing is not allowed. There is however > a possibility that the transaction will be wounded again on another > lock, taken before the contended lock, but I think there are ways to > improve the wound-wait algorithm to reduce that probability. > > So in short, choosing the wound-wait algorithm instead of wait-die and > perhaps modifying the ww mutex code somewhat would probably help passing > an -EDEADLK up the call chain without requiring passing the contended > lock, as long as each locker releases its own locks when receiving an > -EDEADLK. Hm this is kinda tempting, since rolling out the full backoff tricker across driver boundaries is going to be real painful. What I'm kinda worried about is the debug/validation checks we're losing with this. The required backoff has this nice property that ww_mutex debug code can check that we've fully unwound everything when we should, that we've blocked on the right lock, and that we're restarting everything without keeling over. Without that I think we could end up with situations where a driver in the middle feels like handling the EDEADLCK, which might go well most of the times (the deadlock will probably be mostly within a given driver, not across). Right up to the point where someone creates a deadlock across drivers, and the lack of full rollback will be felt. So not sure whether we can still keep all these debug/validation checks, or whether this is a step too far towards clever tricks. But definitely a neat idea ... -Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch

6 years, 2 months

RFC: Unpinned DMA-buf handling

by Christian König

The basic idea stayed the same since the last version of those patches. The exporter can provide explicit pin/unpin functions and the importer a move_notify callback. This allows us to avoid pinning buffers while importers have a mapping for them. In difference to the last version the locking changes were separated from this patchset and committed to drm-misc-next. This allows drivers to implement the new locking semantics without the extra unpinned handling, but of course the changed locking semantics is still a prerequisite to the unpinned handling. The last time this set was send out the discussion ended by questioning if the move_notify callback was really the right approach of notifying the importers that a buffer is about to change its placement. A possible alternative would be to add a special crafted fence object instead. Let's discuss on the different approaches once more, Christian.

6 years, 2 months

RFC: Unpinned DMA-buf handling

by Christian König

Hi everyone, hopefully the last iteration of those patches. For now I've addressed the issue of unmapping imported BOs from the amdgpu page tables immediately by locking the page tables in place. For HMM handling we are getting the ability to invalidate BOs without locking the VM anyway, so this last TODO will probably go away rather soon. Place comment, Christian.

6 years, 3 months

Re: [Linaro-mm-sig] [PATCH] dma-buf: Fix a typo in Kconfig

by Daniel Vetter

On Sun, Feb 16, 2020 at 12:47:08PM +0100, Christophe JAILLET wrote: > A 'h' ismissing in' syncronization' > > Signed-off-by: Christophe JAILLET <christophe.jaillet(a)wanadoo.fr> Applied, thanks for your patch. -Daniel > --- > drivers/dma-buf/Kconfig | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/dma-buf/Kconfig b/drivers/dma-buf/Kconfig > index 0613bb7770f5..e7d820ce0724 100644 > --- a/drivers/dma-buf/Kconfig > +++ b/drivers/dma-buf/Kconfig > @@ -6,7 +6,7 @@ config SYNC_FILE > default n > select DMA_SHARED_BUFFER > ---help--- > - The Sync File Framework adds explicit syncronization via > + The Sync File Framework adds explicit synchronization via > userspace. It enables send/receive 'struct dma_fence' objects to/from > userspace via Sync File fds for synchronization between drivers via > userspace components. It has been ported from Android. > -- > 2.20.1 > > _______________________________________________ > dri-devel mailing list > dri-devel(a)lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/dri-devel -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

6 years, 3 months

Re: [Linaro-mm-sig] [PATCH -next] drm/panfrost: Remove set but not used variable 'bo'

by Rob Herring

On Mon, Feb 3, 2020 at 9:33 AM YueHaibing <yuehaibing(a)huawei.com> wrote: > > Fixes gcc '-Wunused-but-set-variable' warning: > > drivers/gpu/drm/panfrost/panfrost_job.c: In function 'panfrost_job_cleanup': > drivers/gpu/drm/panfrost/panfrost_job.c:278:31: warning: > variable 'bo' set but not used [-Wunused-but-set-variable] > > commit bdefca2d8dc0 ("drm/panfrost: Add the panfrost_gem_mapping concept") > involved this unused variable. > > Reported-by: Hulk Robot <hulkci(a)huawei.com> > Signed-off-by: YueHaibing <yuehaibing(a)huawei.com> > --- > drivers/gpu/drm/panfrost/panfrost_job.c | 6 +----- > 1 file changed, 1 insertion(+), 5 deletions(-) Applied to drm-misc-fixes. Rob

6 years, 3 months

[PATCH v2 3/4] drm/virtio: move mapping teardown to virtio_gpu_cleanup_object()

by Gerd Hoffmann

Stop sending DETACH_BACKING commands, that will happening anyway when releasing resources via UNREF. Handle guest-side cleanup in virtio_gpu_cleanup_object(), called when the host finished processing the UNREF command. Signed-off-by: Gerd Hoffmann <kraxel(a)redhat.com> --- drivers/gpu/drm/virtio/virtgpu_drv.h | 2 -- drivers/gpu/drm/virtio/virtgpu_object.c | 14 ++++++-- drivers/gpu/drm/virtio/virtgpu_vq.c | 46 ------------------------- 3 files changed, 12 insertions(+), 50 deletions(-) diff --git a/drivers/gpu/drm/virtio/virtgpu_drv.h b/drivers/gpu/drm/virtio/virtgpu_drv.h index 1bc13f6b161b..d37ddd7644f6 100644 --- a/drivers/gpu/drm/virtio/virtgpu_drv.h +++ b/drivers/gpu/drm/virtio/virtgpu_drv.h @@ -281,8 +281,6 @@ void virtio_gpu_cmd_set_scanout(struct virtio_gpu_device *vgdev, int virtio_gpu_object_attach(struct virtio_gpu_device *vgdev, struct virtio_gpu_object *obj, struct virtio_gpu_fence *fence); -void virtio_gpu_object_detach(struct virtio_gpu_device *vgdev, - struct virtio_gpu_object *obj); int virtio_gpu_attach_status_page(struct virtio_gpu_device *vgdev); int virtio_gpu_detach_status_page(struct virtio_gpu_device *vgdev); void virtio_gpu_cursor_ping(struct virtio_gpu_device *vgdev, diff --git a/drivers/gpu/drm/virtio/virtgpu_object.c b/drivers/gpu/drm/virtio/virtgpu_object.c index 28a161af7503..bce2b3d843fe 100644 --- a/drivers/gpu/drm/virtio/virtgpu_object.c +++ b/drivers/gpu/drm/virtio/virtgpu_object.c @@ -23,6 +23,7 @@ * WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. */ +#include <linux/dma-mapping.h> #include <linux/moduleparam.h> #include "virtgpu_drv.h" @@ -65,6 +66,17 @@ void virtio_gpu_cleanup_object(struct virtio_gpu_object *bo) { struct virtio_gpu_device *vgdev = bo->base.base.dev->dev_private; + if (bo->pages) { + if (bo->mapped) { + dma_unmap_sg(vgdev->vdev->dev.parent, + bo->pages->sgl, bo->mapped, + DMA_TO_DEVICE); + bo->mapped = 0; + } + sg_free_table(bo->pages); + bo->pages = NULL; + drm_gem_shmem_unpin(&bo->base.base); + } virtio_gpu_resource_id_put(vgdev, bo->hw_res_handle); drm_gem_shmem_free_object(&bo->base.base); } @@ -74,8 +86,6 @@ static void virtio_gpu_free_object(struct drm_gem_object *obj) struct virtio_gpu_object *bo = gem_to_virtio_gpu_obj(obj); struct virtio_gpu_device *vgdev = bo->base.base.dev->dev_private; - if (bo->pages) - virtio_gpu_object_detach(vgdev, bo); if (bo->created) { virtio_gpu_cmd_unref_resource(vgdev, bo); /* completion handler calls virtio_gpu_cleanup_object() */ diff --git a/drivers/gpu/drm/virtio/virtgpu_vq.c b/drivers/gpu/drm/virtio/virtgpu_vq.c index 4e22c3914f94..87c439156151 100644 --- a/drivers/gpu/drm/virtio/virtgpu_vq.c +++ b/drivers/gpu/drm/virtio/virtgpu_vq.c @@ -545,22 +545,6 @@ void virtio_gpu_cmd_unref_resource(struct virtio_gpu_device *vgdev, virtio_gpu_queue_ctrl_buffer(vgdev, vbuf); } -static void virtio_gpu_cmd_resource_inval_backing(struct virtio_gpu_device *vgdev, - uint32_t resource_id, - struct virtio_gpu_fence *fence) -{ - struct virtio_gpu_resource_detach_backing *cmd_p; - struct virtio_gpu_vbuffer *vbuf; - - cmd_p = virtio_gpu_alloc_cmd(vgdev, &vbuf, sizeof(*cmd_p)); - memset(cmd_p, 0, sizeof(*cmd_p)); - - cmd_p->hdr.type = cpu_to_le32(VIRTIO_GPU_CMD_RESOURCE_DETACH_BACKING); - cmd_p->resource_id = cpu_to_le32(resource_id); - - virtio_gpu_queue_fenced_ctrl_buffer(vgdev, vbuf, fence); -} - void virtio_gpu_cmd_set_scanout(struct virtio_gpu_device *vgdev, uint32_t scanout_id, uint32_t resource_id, uint32_t width, uint32_t height, @@ -1155,36 +1139,6 @@ int virtio_gpu_object_attach(struct virtio_gpu_device *vgdev, return 0; } -void virtio_gpu_object_detach(struct virtio_gpu_device *vgdev, - struct virtio_gpu_object *obj) -{ - bool use_dma_api = !virtio_has_iommu_quirk(vgdev->vdev); - - if (WARN_ON_ONCE(!obj->pages)) - return; - - if (use_dma_api && obj->mapped) { - struct virtio_gpu_fence *fence = virtio_gpu_fence_alloc(vgdev); - /* detach backing and wait for the host process it ... */ - virtio_gpu_cmd_resource_inval_backing(vgdev, obj->hw_res_handle, fence); - dma_fence_wait(&fence->f, true); - dma_fence_put(&fence->f); - - /* ... then tear down iommu mappings */ - dma_unmap_sg(vgdev->vdev->dev.parent, - obj->pages->sgl, obj->mapped, - DMA_TO_DEVICE); - obj->mapped = 0; - } else { - virtio_gpu_cmd_resource_inval_backing(vgdev, obj->hw_res_handle, NULL); - } - - sg_free_table(obj->pages); - obj->pages = NULL; - - drm_gem_shmem_unpin(&obj->base.base); -} - void virtio_gpu_cursor_ping(struct virtio_gpu_device *vgdev, struct virtio_gpu_output *output) { -- 2.18.1

6 years, 3 months

Jump to page:

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

Linaro-mm-sig