RFC
I don't have a good name for this yet and I did not spend
any time on documentataion (for that reason)
We create fences (out fences) as part of operations execution, which
are short-lived objects, we want to release all memory after operation
execution is completed or when operation gets cancelled/deleted via
ioctl().
This creates a bit of a problem. DMA fences are refcounted objects and
exporter never knows when importer imports a fence or puts its refcount,
so exporter never knows when fence will be destoyed, which should not
be a problem for refcounted objects, but here comes the twist...
operation A - creates and exports out fence X
... user-space imports fence X
operation A - finishes execution, signals fence X
kfree operation A, put dma_fence
DMA fences are designed to borrow spinlock that DMA fences use to
protect struct dma_fence members:
struct dma_fence {
spinlock_t *lock;
const struct dma_fence_ops *ops;
.....
};
void dma_fence_init(struct dma_fence *fence,
const struct dma_fence_ops *ops,
spinlock_t *lock,
u64 context,
u64 seqno);
So the `lock` should have at least same lifespan as the DMA fence
that borrows it, which is impossible to guarantee in our case. When
we kfree operation A struct we also kfree ->lock that operation
lends to DMA fence, which outlives operation A (depending on what
fence importers do and when they drop imported fence refcount).
This patch adds a new memnber to struct dma_fence: __lock_inplace.
Which is a lock that DMA fence will use to protect its own data when
it cannot reliably borrow a lock from the outside object.
I also had a patch that puts inplace and borrowed locks to an unnamed
uninon and adds one more dma_fence_flag_bits to distinguish between
fences with borrowed and inplace locks
struct dma_fence {
uninon {
spinlock_t *lock;
spinlock_t __lock_inplace;
};
...
};
And then instead of locking/unlocking ->lock directly we would use
dma_fence_lock_irqsave()/dma_fence_unlock_irqrestore() macros which
would check fence flags and either use borrowed lock or inplace lock.
But after seeing how owten drivers directly access fence ->lock I
decided to scratch that approach and just add extra spinlock member.
Not-Yet-Signed-off-by: Sergey Senozhatsky <senozhatsky(a)chromium.org>
---
drivers/dma-buf/dma-fence.c | 10 ++++++++++
include/linux/dma-fence.h | 6 ++++++
2 files changed, 16 insertions(+)
diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
index 066400ed8841..7ae40b8adb73 100644
--- a/drivers/dma-buf/dma-fence.c
+++ b/drivers/dma-buf/dma-fence.c
@@ -958,3 +958,13 @@ dma_fence_init(struct dma_fence *fence, const struct dma_fence_ops *ops,
trace_dma_fence_init(fence);
}
EXPORT_SYMBOL(dma_fence_init);
+
+void dma_fence_inplace_lock_init(struct dma_fence *fence,
+ const struct dma_fence_ops *ops,
+ u64 context, u64 seqno)
+{
+ spin_lock_init(&fence->__lock_inplace);
+
+ dma_fence_init(fence, ops, &fence->__lock_inplace, context, seqno);
+}
+EXPORT_SYMBOL(dma_fence_inplace_lock_init);
diff --git a/include/linux/dma-fence.h b/include/linux/dma-fence.h
index 1ea691753bd3..6b15a0d2eccf 100644
--- a/include/linux/dma-fence.h
+++ b/include/linux/dma-fence.h
@@ -64,6 +64,8 @@ struct dma_fence_cb;
*/
struct dma_fence {
spinlock_t *lock;
+ spinlock_t __lock_inplace;
+
const struct dma_fence_ops *ops;
/*
* We clear the callback list on kref_put so that by the time we
@@ -262,6 +264,10 @@ struct dma_fence_ops {
void dma_fence_init(struct dma_fence *fence, const struct dma_fence_ops *ops,
spinlock_t *lock, u64 context, u64 seqno);
+void dma_fence_inplace_lock_init(struct dma_fence *fence,
+ const struct dma_fence_ops *ops,
+ u64 context, u64 seqno);
+
void dma_fence_release(struct kref *kref);
void dma_fence_free(struct dma_fence *fence);
void dma_fence_describe(struct dma_fence *fence, struct seq_file *seq);
--
2.36.1.124.g0e6072fb45-goog
Hi,
On Wed, 25 May 2022 at 15:07, Simon Ser <contact(a)emersion.fr> wrote:
> On Wednesday, May 25th, 2022 at 15:51, Daniel Vetter <daniel(a)ffwll.ch> wrote:
> > > You can add that to the list of reasons why compositors need to stop
> > > using buffers with unsignaled fences. There's plenty of other reasons
> > > there already (the big one being that otherwise slow clients can slow
> > > down the compositor, even if the compositor uses a high priority context
> > > and the HW supports preemption).
> >
> >
> > Yeah that's tbh another reason why I think we shouldn't do umf as a
> > transparent thing - compositors need to get better anyway, so we might as
> > well take this as a chance to do this right.
>
> As a compositor dev, I agree -- we should definitely be smarter about
> this. Note, it would help a lot to have a good way to integrate the
> waits into a poll(2) event loop.
The same holds for Weston. We're currently working through a bunch of
internal infrastructure to be able to handle this. Mutter (aka GNOME)
is also really well-placed to be able to do this.
Having pollable waits would be really useful, but I don't think it's
essential. In my strawman I'm just waking up at the usual
just-before-repaint point and checking; if it doesn't make it for this
frame then we'll wait for the next frame. If someone submits buffers
which take 4 repaint periods to clear then we'll have 3 'unnecessary'
wakeups, but given the GPU is already slammed then it's not an
efficiency problem I don't think. (I don't know if all the other
compositor people share this view.)
Cheers,
Daniel
On Wed, May 25, 2022 at 03:28:41PM +0200, Michel Dänzer wrote:
> On 2022-05-25 15:05, Daniel Vetter wrote:
> > On Tue, May 17, 2022 at 12:28:17PM +0200, Christian König wrote:
> >> Am 09.05.22 um 16:10 schrieb Daniel Vetter:
> >>> On Mon, May 09, 2022 at 08:56:41AM +0200, Christian König wrote:
> >>>> Am 04.05.22 um 12:08 schrieb Daniel Vetter:
> >>>>>
> >>>>> If the goal is specifically atomic kms, then there's an entire can of
> >>>>> worms there that I really don't want to think about, but it exists: We
> >>>>> have dma_fence as out-fences from atomic commit, and that's already
> >>>>> massively broken since most drivers allocate some memory or at least take
> >>>>> locks which can allocate memory in their commit path. Like i2c. Putting a
> >>>>> userspace memory fence as in-fence in there makes that problem
> >>>>> substantially worse, since at least in theory you're just not allowed to
> >>>>> might_faul in atomic_commit_tail.
> >>>> Yes, that's unfortunately one of the goals as well and yes I completely
> >>>> agree on the can of worms. But I think I've solved that.
> >>>>
> >>>> What I do in the patch set is to enforce that the out fence is an user fence
> >>>> when the driver supports user in fences as well.
> >>>>
> >>>> Since user fences doesn't have the memory management dependency drivers can
> >>>> actually allocate memory or call I2C functions which takes locks which have
> >>>> memory allocation dependencies.
> >>>>
> >>>> Or do I miss some other reason why you can't fault or allocate memory in
> >>>> atomic_commit_tail? At least lockdep seems to be happy about that now.
> >>> The problem is a bit that this breaks the uapi already. At least if the
> >>> goal is to have this all be perfectly transparent for userspace - as you
> >>> as you have multi-gpu setups going on at least.
> >>
> >> Question here is why do you think there is an UAPI break? We currently wait
> >> in a work item already, so where exactly is the problem?
> >
> > It's a bit washy, but dma_fence and hence implicit sync is supposed to
> > finish in finite time. umf just doesn't.
> >
> > Ofc in reality you can still flood your compositor and they're not very
> > robust, but with umf it's trivial to just hang your compositor forever and
> > nothing happens.
>
> You can add that to the list of reasons why compositors need to stop
> using buffers with unsignaled fences. There's plenty of other reasons
> there already (the big one being that otherwise slow clients can slow
> down the compositor, even if the compositor uses a high priority context
> and the HW supports preemption).
Yeah that's tbh another reason why I think we shouldn't do umf as a
transparent thing - compositors need to get better anyway, so we might as
well take this as a chance to do this right.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
Hello everyone,
it's a well known problem that the DMA-buf subsystem mixed synchronization and memory management requirements into the same dma_fence and dma_resv objects. Because of this dma_fence objects need to guarantee that they complete within a finite amount of time or otherwise the system can easily deadlock.
One of the few good things about this problem is that it is really good understood by now.
Daniel and others came up with some documentation: https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html?highlight=dma_…
And Jason did an excellent presentation about that problem on last years LPC: https://lpc.events/event/11/contributions/1115/
Based on that we had been able to reject new implementations of infinite/user DMA fences and mitigate the effect of the few existing ones.
The still remaining down side is that we don't have a way of using user fences as dependency in both the explicit (sync_file, drm_syncobj) as well as the implicit (dma_resv) synchronization objects, resulting in numerous problems and limitations for things like HMM, user queues etc....
This patch set here now tries to tackle this problem by untangling the synchronization from the memory management. What it does *not* try to do is to fix the existing kernel fences, because I think we now can all agree on that this isn't really possible.
To archive this goal what I do in this patch set is to add some parallel infrastructure to cleanly separate normal kernel dma_fence objects from indefinite/user fences:
1. It introduce a DMA_FENCE_FLAG_USER define (after renaming some existing driver defines). To note that a certain dma_fence is an user fence and *must* be ignore by memory management and never used as dependency for normal none user dma_fence objects.
2. The dma_fence_array and dma_fence_chain containers are modified so that they are marked as user fences whenever any of their contained fences are an user fence.
3. The dma_resv object gets a new DMA_RESV_USAGE_USER flag which must be used with indefinite/user fences and separates those into it's own synchronization domain.
4. The existing dma_buf_poll_add_cb() function is modified so that indefinite/user fences are included in the polling.
5. The sync_file synchronization object is modified so that we essentially have two fence streams instead of just one.
6. The drm_syncobj is modified in a similar way. User fences are just ignored unless the driver explicitly states support to wait for them.
7. The DRM subsystem gains a new DRIVER_USER_FENCE flag which drivers can use to indicate the need for user fences. If user fences are used the atomic mode setting starts to support user fences as IN/OUT fences.
8. Lockdep is used at various critical locations to ensure that nobody ever tries to mix user fences with non user fences.
The general approach is to just ignore user fences unless a driver stated explicitely support for them.
On top of all of this I've hacked amdgpu so that we add the resulting CS fence only as kernel dependency to the dma_resv object and an additional wrapped up with a dma_fence_array and a stub user fence.
The result is that the newly added atomic modeset functions now correctly wait for the user fence to complete before doing the flip. And dependent CS don't pipeline any more, but rather block on the CPU before submitting work.
After tons of debugging and testing everything now seems to not go up in flames immediately and even lockdep is happy with the annotations.
I'm perfectly aware that this is probably by far the most controversial patch set I've ever created and I really wish we wouldn't need it. But we certainly have the requirement for this and I don't see much other chance to get that working in an UAPI compatible way.
Thoughts/comments?
Regards,
Christian.