On 12/11/25 16:13, Tvrtko Ursulin wrote:
>
> On 11/12/2025 13:16, Christian König wrote:
>> Using the inline lock is now the recommended way for dma_fence implementations.
>>
>> So use this approach for the scheduler fences as well just in case if
>> anybody uses this as blueprint for its own implementation.
>>
>> Also saves about 4 bytes for the external spinlock.
>>
>> Signed-off-by: Christian König <christian.koenig(a)amd.com>
>> ---
>> drivers/gpu/drm/scheduler/sched_fence.c | 7 +++----
>> include/drm/gpu_scheduler.h | 4 ----
>> 2 files changed, 3 insertions(+), 8 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_fence.c b/drivers/gpu/drm/scheduler/sched_fence.c
>> index 08ccbde8b2f5..47471b9e43f9 100644
>> --- a/drivers/gpu/drm/scheduler/sched_fence.c
>> +++ b/drivers/gpu/drm/scheduler/sched_fence.c
>> @@ -161,7 +161,7 @@ static void drm_sched_fence_set_deadline_finished(struct dma_fence *f,
>> /* If we already have an earlier deadline, keep it: */
>> if (test_bit(DRM_SCHED_FENCE_FLAG_HAS_DEADLINE_BIT, &f->flags) &&
>> ktime_before(fence->deadline, deadline)) {
>> - spin_unlock_irqrestore(&fence->lock, flags);
>> + dma_fence_unlock_irqrestore(f, flags);
>
> Rebase error I guess. Pull into the locking helpers patch.
No that is actually completely intentional here.
Previously we had a separate lock which protected both the DMA-fences as well as the deadline state.
Now we turn that upside down by dropping the separate lock and protecting the deadline state with the dma_fence lock instead.
Regards,
Christian.
>
> Regards,
>
> Tvrtko
>
>> return;
>> }
>> @@ -217,7 +217,6 @@ struct drm_sched_fence *drm_sched_fence_alloc(struct drm_sched_entity *entity,
>> fence->owner = owner;
>> fence->drm_client_id = drm_client_id;
>> - spin_lock_init(&fence->lock);
>> return fence;
>> }
>> @@ -230,9 +229,9 @@ void drm_sched_fence_init(struct drm_sched_fence *fence,
>> fence->sched = entity->rq->sched;
>> seq = atomic_inc_return(&entity->fence_seq);
>> dma_fence_init(&fence->scheduled, &drm_sched_fence_ops_scheduled,
>> - &fence->lock, entity->fence_context, seq);
>> + NULL, entity->fence_context, seq);
>> dma_fence_init(&fence->finished, &drm_sched_fence_ops_finished,
>> - &fence->lock, entity->fence_context + 1, seq);
>> + NULL, entity->fence_context + 1, seq);
>> }
>> module_init(drm_sched_fence_slab_init);
>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
>> index fb88301b3c45..b77f24a783e3 100644
>> --- a/include/drm/gpu_scheduler.h
>> +++ b/include/drm/gpu_scheduler.h
>> @@ -297,10 +297,6 @@ struct drm_sched_fence {
>> * belongs to.
>> */
>> struct drm_gpu_scheduler *sched;
>> - /**
>> - * @lock: the lock used by the scheduled and the finished fences.
>> - */
>> - spinlock_t lock;
>> /**
>> * @owner: job owner for debugging
>> */
>
On 12/12/25 14:20, Karol Wachowski wrote:
> Add missing drm_gem_object_put() call when drm_gem_object_lookup()
> successfully returns an object. This fixes a GEM object reference
> leak that can prevent driver modules from unloading when using
> prime buffers.
>
> Fixes: 53096728b891 ("drm: Add DRM prime interface to reassign GEM handle")
> Signed-off-by: Karol Wachowski <karol.wachowski(a)linux.intel.com>
> ---
> Changes between v1 and v2:
> - move setting ret value under if branch as suggested in review
> - add Cc: stable 6.18+
Oh don't CC the stable list on the review mail directly, just add "CC: stable(a)vger.kernel.org # 6.18+" to the tags. Greg is going to complain about that :(
With that done Reviewed-by: Christian König <christian.koenig(a)amd.com> and please push to drm-misc-fixes.
If you don't have commit rights for drm-misc-fixes please ping me and I'm going to push that.
Thanks,
Christian.
> ---
> drivers/gpu/drm/drm_gem.c | 8 ++++++--
> 1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c
> index ca1956608261..bcc08a6aebf8 100644
> --- a/drivers/gpu/drm/drm_gem.c
> +++ b/drivers/gpu/drm/drm_gem.c
> @@ -1010,8 +1010,10 @@ int drm_gem_change_handle_ioctl(struct drm_device *dev, void *data,
> if (!obj)
> return -ENOENT;
>
> - if (args->handle == args->new_handle)
> - return 0;
> + if (args->handle == args->new_handle) {
> + ret = 0;
> + goto out;
> + }
>
> mutex_lock(&file_priv->prime.lock);
>
> @@ -1043,6 +1045,8 @@ int drm_gem_change_handle_ioctl(struct drm_device *dev, void *data,
>
> out_unlock:
> mutex_unlock(&file_priv->prime.lock);
> +out:
> + drm_gem_object_put(obj);
>
> return ret;
> }
On 12/11/25 15:35, Tvrtko Ursulin wrote:
>
> Hi,
>
> On 11/12/2025 13:16, Christian König wrote:
>> Implement per-fence spinlocks, allowing implementations to not give an
>> external spinlock to protect the fence internal statei. Instead a spinlock
>> embedded into the fence structure itself is used in this case.
>>
>> Shared spinlocks have the problem that implementations need to guarantee
>> that the lock live at least as long all fences referencing them.
>>
>> Using a per-fence spinlock allows completely decoupling spinlock producer
>> and consumer life times, simplifying the handling in most use cases.
>>
>> v2: improve naming, coverage and function documentation
>> v3: fix one additional locking in the selftests
>>
>> Signed-off-by: Christian König <christian.koenig(a)amd.com>
>> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin(a)igalia.com>
>
> I don't think I gave r-b on this one. Not just yet at least. Maybe you have missed the comments I had in the previous two rounds? I will repeat them below.
I was already wondering why you gave comments and an rb but though that the comments might just be optional.
Going to remove that and see on the comments below.
>> @@ -365,7 +364,7 @@ void dma_fence_signal_timestamp_locked(struct dma_fence *fence,
>> struct dma_fence_cb *cur, *tmp;
>> struct list_head cb_list;
>> - lockdep_assert_held(fence->lock);
>> + lockdep_assert_held(dma_fence_spinlock(fence));
>> if (unlikely(test_and_set_bit(DMA_FENCE_FLAG_SIGNALED_BIT,
>> &fence->flags)))
>> @@ -412,9 +411,9 @@ void dma_fence_signal_timestamp(struct dma_fence *fence, ktime_t timestamp)
>> if (WARN_ON(!fence))
>> return;
>> - spin_lock_irqsave(fence->lock, flags);
>> + dma_fence_lock_irqsave(fence, flags);
>
> For the locking wrappers I think it would be better to introduce them in a purely mechanical patch preceding this one. That is, just add the wrappers and nothing else.
That doesn't fully work for all cases, but I will separate it out a bit more.
>> static inline uint64_t amdgpu_vm_tlb_seq(struct amdgpu_vm *vm)
>> {
>> + struct dma_fence *fence;
>> unsigned long flags;
>> - spinlock_t *lock;
>> /*
>> * Workaround to stop racing between the fence signaling and handling
>> - * the cb. The lock is static after initially setting it up, just make
>> - * sure that the dma_fence structure isn't freed up.
>> + * the cb.
>> */
>> rcu_read_lock();
>> - lock = vm->last_tlb_flush->lock;
>> + fence = dma_fence_get_rcu(vm->last_tlb_flush);
>
> Why does this belong here? If taking a reference fixes some race it needs to be a separate patch. If it doesn't then this patch shouldn't be adding it.
The code previously assumed that the lock is global and can't go away while the function is called. When we start to use an inline lock that assumption is not true any more.
But you're right that can be a separate patch.
>> @@ -362,6 +368,38 @@ dma_fence_get_rcu_safe(struct dma_fence __rcu **fencep)
>> } while (1);
>> }
>> +/**
>> + * dma_fence_spinlock - return pointer to the spinlock protecting the fence
>> + * @fence: the fence to get the lock from
>> + *
>> + * Return either the pointer to the embedded or the external spin lock.
>> + */
>> +static inline spinlock_t *dma_fence_spinlock(struct dma_fence *fence)
>> +{
>> + return test_bit(DMA_FENCE_FLAG_INLINE_LOCK_BIT, &fence->flags) ?
>> + &fence->inline_lock : fence->extern_lock;
>
> Is sprinkling of conditionals better than growing the struct? Probably yes, since branch misses are cheaper than cache misses. Unless the code grows significantly on some hot path and we get instruction cache misses instead. Who knows. But let say in the commit message we considered it and decided on this solution due xyz.
Sure.
>
> On a quick grep there is one arch where this grows the struct past a cache line anyway, but as it is PA-RISC I guess no one cares. Lets mention that in the commit message as well.
Interesting, I was aware of the problems on Sparc regarding spinlocks but that PA-RISC also has something more complicated then an int is news to me.
Anyway I agree it doesn't really matter.
Regards,
Christian.
>
> Regards,
>
> Tvrtko
>> +}
>> +
>> +/**
>> + * dma_fence_lock_irqsave - irqsave lock the fence
>> + * @fence: the fence to lock
>> + * @flags: where to store the CPU flags.
>> + *
>> + * Lock the fence, preventing it from changing to the signaled state.
>> + */
>> +#define dma_fence_lock_irqsave(fence, flags) \
>> + spin_lock_irqsave(dma_fence_spinlock(fence), flags)
>> +
>> +/**
>> + * dma_fence_unlock_irqrestore - unlock the fence and irqrestore
>> + * @fence: the fence to unlock
>> + * @flags the CPU flags to restore
>> + *
>> + * Unlock the fence, allowing it to change it's state to signaled again.
>> + */
>> +#define dma_fence_unlock_irqrestore(fence, flags) \
>> + spin_unlock_irqrestore(dma_fence_spinlock(fence), flags)
>> +
>> #ifdef CONFIG_LOCKDEP
>> bool dma_fence_begin_signalling(void);
>> void dma_fence_end_signalling(bool cookie);
>
On 12/11/25 13:33, Philipp Stanner wrote:
> On Thu, 2025-12-11 at 13:16 +0100, Christian König wrote:
>> Hi everyone,
>>
>> dma_fences have ever lived under the tyranny dictated by the module
>> lifetime of their issuer, leading to crashes should anybody still holding
>> a reference to a dma_fence when the module of the issuer was unloaded.
>>
>> The basic problem is that when buffer are shared between drivers
>> dma_fence objects can leak into external drivers and stay there even
>> after they are signaled. The dma_resv object for example only lazy releases
>> dma_fences.
>>
>> So what happens is that when the module who originally created the dma_fence
>> unloads the dma_fence_ops function table becomes unavailable as well and so
>> any attempt to release the fence crashes the system.
>>
>> Previously various approaches have been discussed, including changing the
>> locking semantics of the dma_fence callbacks (by me) as well as using the
>> drm scheduler as intermediate layer (by Sima) to disconnect dma_fences
>> from their actual users, but none of them are actually solving all problems.
>>
>> Tvrtko did some really nice prerequisite work by protecting the returned
>> strings of the dma_fence_ops by RCU. This way dma_fence creators where
>> able to just wait for an RCU grace period after fence signaling before
>> they could be save to free those data structures.
>>
>> Now this patch set here goes a step further and protects the whole
>> dma_fence_ops structure by RCU, so that after the fence signals the
>> pointer to the dma_fence_ops is set to NULL when there is no wait nor
>> release callback given. All functionality which use the dma_fence_ops
>> reference are put inside an RCU critical section, except for the
>> deprecated issuer specific wait and of course the optional release
>> callback.
>>
>> Additional to the RCU changes the lock protecting the dma_fence state
>> previously had to be allocated external. This set here now changes the
>> functionality to make that external lock optional and allows dma_fences
>> to use an inline lock and be self contained.
>>
>> v4:
>>
>> Rebases the whole set on upstream changes, especially the cleanup
>> from Philip in patch "drm/amdgpu: independence for the amdkfd_fence!".
>>
>> Adding two patches which brings the DMA-fence self tests up to date.
>> The first selftest changes removes the mock_wait and so actually starts
>> testing the default behavior instead of some hacky implementation in the
>> test. This one should probably go upstream independent of this set.
>> The second drops the mock_fence as well and tests the new RCU and inline
>> spinlock functionality.
>>
>> Especially the first patch still needs a Reviewed-by, apart from that I
>> think I've addressed all review comments.
>>
>> The plan is to push the core DMA-buf changes to drm-misc-next and then the
>> driver specific changes through the driver channels as approprite.
>
> This does not apply to drm-misc-next (unless I'm screwing up badly).
>
> Where can I apply it? I'd like to test the drm_sched changes before
> this gets merged.
drm-tip from a few days ago, otherwise the xe changes won't work.
Regards,
Christian.
>
> P.
>
>>
>> Please review and comment,
>> Christian.
>>
>>
>
On Fri, Dec 12, 2025 at 4:31 AM Eric Chanudet <echanude(a)redhat.com> wrote:
>
> The system dma-buf heap lets userspace allocate buffers from the page
> allocator. However, these allocations are not accounted for in memcg,
> allowing processes to escape limits that may be configured.
>
> Pass the __GFP_ACCOUNT for our allocations to account them into memcg.
We had a discussion just last night in the MM track at LPC about how
shared memory accounted in memcg is pretty broken. Without a way to
identify (and possibly transfer) ownership of a shared buffer, this
makes the accounting of shared memory, and zombie memcg problems
worse. :\
>
> Userspace components using the system heap can be constrained with, e.g:
> systemd-run --user --scope -p MemoryMax=10M ...
>
> Signed-off-by: Eric Chanudet <echanude(a)redhat.com>
> ---
> drivers/dma-buf/heaps/system_heap.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/dma-buf/heaps/system_heap.c b/drivers/dma-buf/heaps/system_heap.c
> index 4c782fe33fd4..c91fcdff4b77 100644
> --- a/drivers/dma-buf/heaps/system_heap.c
> +++ b/drivers/dma-buf/heaps/system_heap.c
> @@ -38,10 +38,10 @@ struct dma_heap_attachment {
> bool mapped;
> };
>
> -#define LOW_ORDER_GFP (GFP_HIGHUSER | __GFP_ZERO)
> +#define LOW_ORDER_GFP (GFP_HIGHUSER | __GFP_ZERO | __GFP_ACCOUNT)
> #define HIGH_ORDER_GFP (((GFP_HIGHUSER | __GFP_ZERO | __GFP_NOWARN \
> | __GFP_NORETRY) & ~__GFP_RECLAIM) \
> - | __GFP_COMP)
> + | __GFP_COMP | __GFP_ACCOUNT)
> static gfp_t order_flags[] = {HIGH_ORDER_GFP, HIGH_ORDER_GFP, LOW_ORDER_GFP};
> /*
> * The selection of the orders used for allocation (1MB, 64K, 4K) is designed
> --
> 2.52.0
>
Hi everyone,
dma_fences have ever lived under the tyranny dictated by the module
lifetime of their issuer, leading to crashes should anybody still holding
a reference to a dma_fence when the module of the issuer was unloaded.
The basic problem is that when buffer are shared between drivers
dma_fence objects can leak into external drivers and stay there even
after they are signaled. The dma_resv object for example only lazy releases
dma_fences.
So what happens is that when the module who originally created the dma_fence
unloads the dma_fence_ops function table becomes unavailable as well and so
any attempt to release the fence crashes the system.
Previously various approaches have been discussed, including changing the
locking semantics of the dma_fence callbacks (by me) as well as using the
drm scheduler as intermediate layer (by Sima) to disconnect dma_fences
from their actual users, but none of them are actually solving all problems.
Tvrtko did some really nice prerequisite work by protecting the returned
strings of the dma_fence_ops by RCU. This way dma_fence creators where
able to just wait for an RCU grace period after fence signaling before
they could be save to free those data structures.
Now this patch set here goes a step further and protects the whole
dma_fence_ops structure by RCU, so that after the fence signals the
pointer to the dma_fence_ops is set to NULL when there is no wait nor
release callback given. All functionality which use the dma_fence_ops
reference are put inside an RCU critical section, except for the
deprecated issuer specific wait and of course the optional release
callback.
Additional to the RCU changes the lock protecting the dma_fence state
previously had to be allocated external. This set here now changes the
functionality to make that external lock optional and allows dma_fences
to use an inline lock and be self contained.
v4:
Rebases the whole set on upstream changes, especially the cleanup
from Philip in patch "drm/amdgpu: independence for the amdkfd_fence!".
Adding two patches which brings the DMA-fence self tests up to date.
The first selftest changes removes the mock_wait and so actually starts
testing the default behavior instead of some hacky implementation in the
test. This one should probably go upstream independent of this set.
The second drops the mock_fence as well and tests the new RCU and inline
spinlock functionality.
Especially the first patch still needs a Reviewed-by, apart from that I
think I've addressed all review comments.
The plan is to push the core DMA-buf changes to drm-misc-next and then the
driver specific changes through the driver channels as approprite.
Please review and comment,
Christian.
If there is a large number (hundreds) of dmabufs allocated, the text
output generated from dmabuf_iter_seq_show can exceed common user buffer
sizes (e.g. PAGE_SIZE) necessitating multiple start/stop cycles to
iterate through all dmabufs. However the dmabuf iterator currently
returns NULL in dmabuf_iter_seq_start for all non-zero pos values, which
results in the truncation of the output before all dmabufs are handled.
After dma_buf_iter_begin / dma_buf_iter_next, the refcount of the buffer
is elevated so that the BPF iterator program can run without holding any
locks. When a stop occurs, instead of immediately dropping the reference
on the buffer, stash a pointer to the buffer in seq->priv until
either start is called or the iterator is released. This also enables
the resumption of iteration without first walking through the list of
dmabufs based on the pos value.
Fixes: 76ea95534995 ("bpf: Add dmabuf iterator")
Signed-off-by: T.J. Mercier <tjmercier(a)google.com>
---
kernel/bpf/dmabuf_iter.c | 56 +++++++++++++++++++++++++++++++++++-----
1 file changed, 49 insertions(+), 7 deletions(-)
diff --git a/kernel/bpf/dmabuf_iter.c b/kernel/bpf/dmabuf_iter.c
index 4dd7ef7c145c..cd500248abd9 100644
--- a/kernel/bpf/dmabuf_iter.c
+++ b/kernel/bpf/dmabuf_iter.c
@@ -6,10 +6,33 @@
#include <linux/kernel.h>
#include <linux/seq_file.h>
+struct dmabuf_iter_priv {
+ /*
+ * If this pointer is non-NULL, the buffer's refcount is elevated to
+ * prevent destruction between stop/start. If reading is not resumed and
+ * start is never called again, then dmabuf_iter_seq_fini drops the
+ * reference when the iterator is released.
+ */
+ struct dma_buf *dmabuf;
+};
+
static void *dmabuf_iter_seq_start(struct seq_file *seq, loff_t *pos)
{
- if (*pos)
- return NULL;
+ struct dmabuf_iter_priv *p = seq->private;
+
+ if (*pos) {
+ struct dma_buf *dmabuf = p->dmabuf;
+
+ if (!dmabuf)
+ return NULL;
+
+ /*
+ * Always resume from where we stopped, regardless of the value
+ * of pos.
+ */
+ p->dmabuf = NULL;
+ return dmabuf;
+ }
return dma_buf_iter_begin();
}
@@ -54,8 +77,11 @@ static void dmabuf_iter_seq_stop(struct seq_file *seq, void *v)
{
struct dma_buf *dmabuf = v;
- if (dmabuf)
- dma_buf_put(dmabuf);
+ if (dmabuf) {
+ struct dmabuf_iter_priv *p = seq->private;
+
+ p->dmabuf = dmabuf;
+ }
}
static const struct seq_operations dmabuf_iter_seq_ops = {
@@ -71,11 +97,27 @@ static void bpf_iter_dmabuf_show_fdinfo(const struct bpf_iter_aux_info *aux,
seq_puts(seq, "dmabuf iter\n");
}
+static int dmabuf_iter_seq_init(void *priv, struct bpf_iter_aux_info *aux)
+{
+ struct dmabuf_iter_priv *p = (struct dmabuf_iter_priv *)priv;
+
+ p->dmabuf = NULL;
+ return 0;
+}
+
+static void dmabuf_iter_seq_fini(void *priv)
+{
+ struct dmabuf_iter_priv *p = (struct dmabuf_iter_priv *)priv;
+
+ if (p->dmabuf)
+ dma_buf_put(p->dmabuf);
+}
+
static const struct bpf_iter_seq_info dmabuf_iter_seq_info = {
.seq_ops = &dmabuf_iter_seq_ops,
- .init_seq_private = NULL,
- .fini_seq_private = NULL,
- .seq_priv_size = 0,
+ .init_seq_private = dmabuf_iter_seq_init,
+ .fini_seq_private = dmabuf_iter_seq_fini,
+ .seq_priv_size = sizeof(struct dmabuf_iter_priv),
};
static struct bpf_iter_reg bpf_dmabuf_reg_info = {
base-commit: 30f09200cc4aefbd8385b01e41bde2e4565a6f0e
--
2.52.0.177.g9f829587af-goog
On Fri, Dec 05, 2025 at 04:18:38PM +0900, Byungchul Park wrote:
> Add documents describing the concept and APIs of dept.
>
> Signed-off-by: Byungchul Park <byungchul(a)sk.com>
> ---
> Documentation/dev-tools/dept.rst | 778 +++++++++++++++++++++++++++
> Documentation/dev-tools/dept_api.rst | 125 +++++
You forget to add toctree entries:
---- >8 ----
diff --git a/Documentation/dev-tools/index.rst b/Documentation/dev-tools/index.rst
index 4b8425e348abd1..02c858f5ed1fa2 100644
--- a/Documentation/dev-tools/index.rst
+++ b/Documentation/dev-tools/index.rst
@@ -22,6 +22,8 @@ Documentation/process/debugging/index.rst
clang-format
coccinelle
sparse
+ dept
+ dept_api
kcov
gcov
kasan
> +Lockdep detects a deadlock by checking lock acquisition order. For
> +example, a graph to track acquisition order built by lockdep might look
> +like:
> +
> +.. literal::
> +
> + A -> B -
> + \
> + -> E
> + /
> + C -> D -
> +
> + where 'A -> B' means that acquisition A is prior to acquisition B
> + with A still held.
Use code-block directive for literal code blocks:
---- >8 ----
diff --git a/Documentation/dev-tools/dept.rst b/Documentation/dev-tools/dept.rst
index 333166464543d7..8394c4ea81bc2a 100644
--- a/Documentation/dev-tools/dept.rst
+++ b/Documentation/dev-tools/dept.rst
@@ -10,7 +10,7 @@ Lockdep detects a deadlock by checking lock acquisition order. For
example, a graph to track acquisition order built by lockdep might look
like:
-.. literal::
+.. code-block::
A -> B -
\
@@ -25,7 +25,7 @@ Lockdep keeps adding each new acquisition order into the graph at
runtime. For example, 'E -> C' will be added when the two locks have
been acquired in the order, E and then C. The graph will look like:
-.. literal::
+.. code-block::
A -> B -
\
@@ -41,7 +41,7 @@ been acquired in the order, E and then C. The graph will look like:
This graph contains a subgraph that demonstrates a loop like:
-.. literal::
+.. code-block::
-> E -
/ \
@@ -76,7 +76,7 @@ e.g. irq context, normal process context, wq worker context, or so on.
Can lockdep detect the following deadlock?
-.. literal::
+.. code-block::
context X context Y context Z
@@ -91,7 +91,7 @@ Can lockdep detect the following deadlock?
No. What about the following?
-.. literal::
+.. code-block::
context X context Y
@@ -116,7 +116,7 @@ What leads a deadlock
A deadlock occurs when one or multi contexts are waiting for events that
will never happen. For example:
-.. literal::
+.. code-block::
context X context Y context Z
@@ -148,7 +148,7 @@ In terms of dependency:
Dependency graph reflecting this example will look like:
-.. literal::
+.. code-block::
-> C -> A -> B -
/ \
@@ -171,7 +171,7 @@ Introduce DEPT
DEPT(DEPendency Tracker) tracks wait and event instead of lock
acquisition order so as to recognize the following situation:
-.. literal::
+.. code-block::
context X context Y context Z
@@ -186,7 +186,7 @@ acquisition order so as to recognize the following situation:
and builds up a dependency graph at runtime that is similar to lockdep.
The graph might look like:
-.. literal::
+.. code-block::
-> C -> A -> B -
/ \
@@ -199,7 +199,7 @@ DEPT keeps adding each new dependency into the graph at runtime. For
example, 'B -> D' will be added when event D occurrence is a
prerequisite to reaching event B like:
-.. literal::
+.. code-block::
context W
@@ -211,7 +211,7 @@ prerequisite to reaching event B like:
After the addition, the graph will look like:
-.. literal::
+.. code-block::
-> D
/
@@ -236,7 +236,7 @@ How DEPT works
Let's take a look how DEPT works with the 1st example in the section
'Limitation of lockdep'.
-.. literal::
+.. code-block::
context X context Y context Z
@@ -256,7 +256,7 @@ event.
Adding comments to describe DEPT's view in detail:
-.. literal::
+.. code-block::
context X context Y context Z
@@ -293,7 +293,7 @@ Adding comments to describe DEPT's view in detail:
Let's build up dependency graph with this example. Firstly, context X:
-.. literal::
+.. code-block::
context X
@@ -304,7 +304,7 @@ Let's build up dependency graph with this example. Firstly, context X:
There are no events to create dependency. Next, context Y:
-.. literal::
+.. code-block::
context Y
@@ -332,7 +332,7 @@ event A cannot be triggered if wait B cannot be awakened by event B.
Therefore, we can say event A depends on event B, say, 'A -> B'. The
graph will look like after adding the dependency:
-.. literal::
+.. code-block::
A -> B
@@ -340,7 +340,7 @@ graph will look like after adding the dependency:
Lastly, context Z:
-.. literal::
+.. code-block::
context Z
@@ -362,7 +362,7 @@ triggered if wait A cannot be awakened by event A. Therefore, we can
say event B depends on event A, say, 'B -> A'. The graph will look like
after adding the dependency:
-.. literal::
+.. code-block::
-> A -> B -
/ \
@@ -386,7 +386,7 @@ Interpret DEPT report
The following is the same example in the section 'How DEPT works'.
-.. literal::
+.. code-block::
context X context Y context Z
@@ -425,7 +425,7 @@ We can simplify this by labeling each waiting point with [W], each
point where its event's context starts with [S] and each event with [E].
This example will look like after the labeling:
-.. literal::
+.. code-block::
context X context Y context Z
@@ -443,7 +443,7 @@ DEPT uses the symbols [W], [S] and [E] in its report as described above.
The following is an example reported by DEPT for a real problem in
practice.
-.. literal::
+.. code-block::
Link: https://lore.kernel.org/lkml/6383cde5-cf4b-facf-6e07-1378a485657d@I-love.SA…
Link: https://lore.kernel.org/lkml/1674268856-31807-1-git-send-email-byungchul.pa…
@@ -646,7 +646,7 @@ practice.
Let's take a look at the summary that is the most important part.
-.. literal::
+.. code-block::
---------------------------------------------------
summary
@@ -669,7 +669,7 @@ Let's take a look at the summary that is the most important part.
The summary shows the following scenario:
-.. literal::
+.. code-block::
context A context B context ?(unknown)
@@ -684,7 +684,7 @@ The summary shows the following scenario:
Adding comments to describe DEPT's view in detail:
-.. literal::
+.. code-block::
context A context B context ?(unknown)
@@ -711,7 +711,7 @@ Adding comments to describe DEPT's view in detail:
Let's build up dependency graph with this report. Firstly, context A:
-.. literal::
+.. code-block::
context A
@@ -735,7 +735,7 @@ unlock(&ni->ni_lock:0) depends on folio_unlock(&f1), say,
The graph will look like after adding the dependency:
-.. literal::
+.. code-block::
unlock(&ni->ni_lock:0) -> folio_unlock(&f1)
@@ -743,7 +743,7 @@ The graph will look like after adding the dependency:
Secondly, context B:
-.. literal::
+.. code-block::
context B
@@ -762,7 +762,7 @@ folio_unlock(&f1) depends on unlock(&ni->ni_lock:0), say,
The graph will look like after adding the dependency:
-.. literal::
+.. code-block::
-> unlock(&ni->ni_lock:0) -> folio_unlock(&f1) -
/ \
> +Limitation of lockdep
> +---------------------
> +
> +Lockdep deals with a deadlock by typical lock e.g. spinlock and mutex,
> +that are supposed to be released within the acquisition context.
> +However, when it comes to a deadlock by folio lock that is not supposed
> +to be released within the acquisition context or other general
> +synchronization mechanisms, lockdep doesn't work.
> +
> +NOTE: In this document, 'context' refers to any type of unique context
> +e.g. irq context, normal process context, wq worker context, or so on.
> +
> +Can lockdep detect the following deadlock?
> +
> +.. literal::
> +
> + context X context Y context Z
> +
> + mutex_lock A
> + folio_lock B
> + folio_lock B <- DEADLOCK
> + mutex_lock A <- DEADLOCK
> + folio_unlock B
> + folio_unlock B
> + mutex_unlock A
> + mutex_unlock A
> +
> +No. What about the following?
> +
> +.. literal::
> +
> + context X context Y
> +
> + mutex_lock A
> + mutex_lock A <- DEADLOCK
> + wait_for_complete B <- DEADLOCK
> + complete B
> + mutex_unlock A
> + mutex_unlock A
> +
> +No.
One unanswered question from my v17 review [1]: You explain in "How DEPT works"
section how DEPT detects deadlock in the first example (the former with three
contexts). Can you do the same on the second example (the latter with two
contexts)?
Thanks.
[1]: https://lore.kernel.org/linux-doc/aN84jKyrE1BumpLj@archie.me/
--
An old man doll... just what I always wanted! - Clara
Am Mittwoch, dem 26.11.2025 um 16:44 +0100 schrieb Philipp Stanner:
> On Wed, 2025-11-26 at 16:03 +0100, Christian König wrote:
>
> > >
[...]
> > > My hope would be that in the mid-term future we'd get firmware
> > > rings
> > > that can be preempted through a firmware call for all major
> > > hardware.
> > > Then a huge share of our problems would disappear.
> >
> > At least on AMD HW pre-emption is actually horrible unreliable as
> > well.
>
> Do you mean new GPUs with firmware scheduling, or what is "HW pre-
> emption"?
>
> With firmware interfaces, my hope would be that you could simply tell
>
> stop_running_ring(nr_of_ring)
> // time slice for someone else
> start_running_ring(nr_of_ring)
>
> Thereby getting real scheduling and all that. And eliminating many
> other problems we know well from drm/sched.
It doesn't really matter if you have firmware scheduling or not for
preemption to be a hard problem on GPUs. CPUs have limited software
visible state that needs to be saved/restored on a context switch and
even there people start complaining now that they need to context
switch the AVX512 register set.
GPUs have megabytes of software visible state. Which needs to be
saved/restored on the context switch if you want fine grained
preemption with low preemption latency. There might be points in the
command execution where you can ignore most of that state, but reaching
those points can have basically unbounded latency. So either you can
reliably save/restore lots of state or you are limited to very coarse
grained preemption with all the usual issues of timeouts and DoS
vectors.
I'm not totally up to speed with the current state across all relevant
GPUs, but until recently NVidia was the only vendor to have real
reliable fine-grained preemption.
Regards,
Lucas
On Sun, Nov 23, 2025 at 10:51:25PM +0000, Pavel Begunkov wrote:
> +struct dma_token *blkdev_dma_map(struct file *file,
> + struct dma_token_params *params)
Given that this is a direct file operation instance it should be
in block/fops.c. If we do want a generic helper below it, it
should take a struct block_device instead. But we can probably
defer that until a user for that shows up.
> +static void nvme_sync_dma(struct nvme_dev *nvme_dev, struct request *req,
> + enum dma_data_direction dir)
> +{
> + struct blk_mq_dma_map *map = req->dma_map;
> + int length = blk_rq_payload_bytes(req);
> + bool for_cpu = dir == DMA_FROM_DEVICE;
> + struct device *dev = nvme_dev->dev;
> + dma_addr_t *dma_list = map->private;
> + struct bio *bio = req->bio;
> + int offset, map_idx;
> +
> + offset = bio->bi_iter.bi_bvec_done;
> + map_idx = offset / NVME_CTRL_PAGE_SIZE;
> + length += offset & (NVME_CTRL_PAGE_SIZE - 1);
> +
> + while (length > 0) {
> + u64 dma_addr = dma_list[map_idx++];
> +
> + if (for_cpu)
> + __dma_sync_single_for_cpu(dev, dma_addr,
> + NVME_CTRL_PAGE_SIZE, dir);
> + else
> + __dma_sync_single_for_device(dev, dma_addr,
> + NVME_CTRL_PAGE_SIZE, dir);
> + length -= NVME_CTRL_PAGE_SIZE;
> + }
This looks really inefficient. Usually the ranges in the dmabuf should
be much larger than a controller page.
> +static void nvme_unmap_premapped_data(struct nvme_dev *dev,
> + struct request *req)
> +{
> + struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
> +
> + if (rq_data_dir(req) == READ)
> + nvme_sync_dma(dev, req, DMA_FROM_DEVICE);
> + if (!(iod->flags & IOD_SINGLE_SEGMENT))
> + nvme_free_descriptors(req);
> +}
This doesn't really unmap anything :)
Also the dma ownership rules say that you always need to call the
sync_to_device helpers before I/O and the sync_to_cpu helpers after I/O,
no matters if it is a read or write. The implementations then makes
them a no-op where possible.
> +
> + offset = bio->bi_iter.bi_bvec_done;
> + map_idx = offset / NVME_CTRL_PAGE_SIZE;
> + offset &= (NVME_CTRL_PAGE_SIZE - 1);
> +
> + prp1_dma = dma_list[map_idx++] + offset;
> +
> + length -= (NVME_CTRL_PAGE_SIZE - offset);
> + if (length <= 0) {
> + prp2_dma = 0;
Urgg, why is this building PRPs instead of SGLs? Yes, SGLs are an
optional feature, but for devices where you want to micro-optimize
like this I think we should simply require them. This should cut
down on both the memory use and the amount of special mapping code.
On Sun, Nov 23, 2025 at 10:51:25PM +0000, Pavel Begunkov wrote:
> Add blk-mq infrastructure to handle dmabuf tokens. There are two main
Please spell out infrastructure in the subject as well.
> +struct dma_token *blkdev_dma_map(struct file *file,
> + struct dma_token_params *params)
> +{
> + struct request_queue *q = bdev_get_queue(file_bdev(file));
> +
> + if (!(file->f_flags & O_DIRECT))
> + return ERR_PTR(-EINVAL);
Shouldn't the O_DIRECT check be in the caller?
> +++ b/block/blk-mq-dma-token.c
Missing SPDX and Copyright statement.
> @@ -0,0 +1,236 @@
> +#include <linux/blk-mq-dma-token.h>
> +#include <linux/dma-resv.h>
> +
> +struct blk_mq_dma_fence {
> + struct dma_fence base;
> + spinlock_t lock;
> +};
And a high-level comment explaining the fencing logic would be nice
as well.
> + struct blk_mq_dma_map *map = container_of(ref, struct blk_mq_dma_map, refs);
Overly long line.
> +static struct blk_mq_dma_map *blk_mq_alloc_dma_mapping(struct blk_mq_dma_token *token)
Another one. Also kinda inconsistent between _map in the data structure
and _mapping in the function name.
> +static inline
> +struct blk_mq_dma_map *blk_mq_get_token_map(struct blk_mq_dma_token *token)
Really odd return value / scope formatting.
> +{
> + struct blk_mq_dma_map *map;
> +
> + guard(rcu)();
> +
> + map = rcu_dereference(token->map);
> + if (unlikely(!map || !percpu_ref_tryget_live_rcu(&map->refs)))
> + return NULL;
> + return map;
Please use good old rcu_read_unlock to make this readable.
> + guard(mutex)(&token->mapping_lock);
Same.
> +
> + map = blk_mq_get_token_map(token);
> + if (map)
> + return map;
> +
> + map = blk_mq_alloc_dma_mapping(token);
> + if (IS_ERR(map))
> + return NULL;
> +
> + dma_resv_lock(dmabuf->resv, NULL);
> + ret = dma_resv_wait_timeout(dmabuf->resv, DMA_RESV_USAGE_BOOKKEEP,
> + true, MAX_SCHEDULE_TIMEOUT);
> + ret = ret ? ret : -ETIME;
if (!ret)
ret = -ETIME;
> +blk_status_t blk_rq_assign_dma_map(struct request *rq,
> + struct blk_mq_dma_token *token)
> +{
> + struct blk_mq_dma_map *map;
> +
> + map = blk_mq_get_token_map(token);
> + if (map)
> + goto complete;
> +
> + if (rq->cmd_flags & REQ_NOWAIT)
> + return BLK_STS_AGAIN;
> +
> + map = blk_mq_create_dma_map(token);
> + if (IS_ERR(map))
> + return BLK_STS_RESOURCE;
Having a few comments, that say this is creating the map lazily
would probably helper the reader. Also why not keep the !map
case in the branch, as the map case should be the fast path and
thus usually be straight line in the function?
> +void blk_mq_dma_map_move_notify(struct blk_mq_dma_token *token)
> +{
> + blk_mq_dma_map_remove(token);
> +}
Is there a good reason for having this blk_mq_dma_map_move_notify
wrapper?
> + if (bio_flagged(bio, BIO_DMA_TOKEN)) {
> + struct blk_mq_dma_token *token;
> + blk_status_t ret;
> +
> + token = dma_token_to_blk_mq(bio->dma_token);
> + ret = blk_rq_assign_dma_map(rq, token);
> + if (ret) {
> + if (ret == BLK_STS_AGAIN) {
> + bio_wouldblock_error(bio);
> + } else {
> + bio->bi_status = BLK_STS_RESOURCE;
> + bio_endio(bio);
> + }
> + goto queue_exit;
> + }
> + }
Any reason to not just keep the dma_token_to_blk_mq? Also why is this
overriding non-BLK_STS_AGAIN errors with BLK_STS_RESOURCE?
(I really wish we could make all BLK_STS_AGAIN errors be quiet without
the explicit setting of BIO_QUIET, which is a bit annoying, but that's
not for this patch).
> +static inline
> +struct blk_mq_dma_token *dma_token_to_blk_mq(struct dma_token *token)
More odd formatting.
> diff --git a/block/bio.c b/block/bio.c
> index 7b13bdf72de0..8793f1ee559d 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -843,6 +843,11 @@ static int __bio_clone(struct bio *bio, struct bio *bio_src, gfp_t gfp)
> bio_clone_blkg_association(bio, bio_src);
> }
>
> + if (bio_flagged(bio_src, BIO_DMA_TOKEN)) {
> + bio->dma_token = bio_src->dma_token;
> + bio_set_flag(bio, BIO_DMA_TOKEN);
> + }
Historically __bio_clone itself does not clone the payload, just the
bio. But we got rid of the callers that want to clone a bio but not
the payload long time ago.
I'd suggest a prep patch that moves assigning bi_io_vec from
bio_alloc_clone and bio_init_clone into __bio_clone, and given that they
are the same field that'll take carw of the dma token as well.
Alternatively do it in an if/else that the compiler will hopefully
optimize away.
> @@ -1349,6 +1366,10 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter,
> bio_iov_bvec_set(bio, iter);
> iov_iter_advance(iter, bio->bi_iter.bi_size);
> return 0;
> + } else if (iov_iter_is_dma_token(iter)) {
No else after an return please.
> +++ b/block/blk-merge.c
> @@ -328,6 +328,29 @@ int bio_split_io_at(struct bio *bio, const struct queue_limits *lim,
> unsigned nsegs = 0, bytes = 0, gaps = 0;
> struct bvec_iter iter;
>
> + if (bio_flagged(bio, BIO_DMA_TOKEN)) {
Please split the dmabuf logic into a self-contained
helper here.
> + int offset = offset_in_page(bio->bi_iter.bi_bvec_done);
> +
> + nsegs = ALIGN(bio->bi_iter.bi_size + offset, PAGE_SIZE);
> + nsegs >>= PAGE_SHIFT;
Why are we hardcoding PAGE_SIZE based "segments" here?
> +
> + if (offset & lim->dma_alignment || bytes & len_align_mask)
> + return -EINVAL;
> +
> + if (bio->bi_iter.bi_size > max_bytes) {
> + bytes = max_bytes;
> + nsegs = (bytes + offset) >> PAGE_SHIFT;
> + goto split;
> + } else if (nsegs > lim->max_segments) {
No else after a goto either.
On Sun, Nov 23, 2025 at 10:51:23PM +0000, Pavel Begunkov wrote:
> We'll need bio_flagged() earlier in bio.h in the next patch, move it
> together with all related helpers, and mark the bio_flagged()'s bio
> argument as const.
>
> Signed-off-by: Pavel Begunkov <asml.silence(a)gmail.com>
Looks good:
Reviewed-by: Christoph Hellwig <hch(a)lst.de>
Maybe ask Jens to queue it up ASAP to get it out of the way?
On Sun, Nov 23, 2025 at 10:51:22PM +0000, Pavel Begunkov wrote:
> diff --git a/include/linux/uio.h b/include/linux/uio.h
> index 5b127043a151..1b22594ca35b 100644
> --- a/include/linux/uio.h
> +++ b/include/linux/uio.h
> @@ -29,6 +29,7 @@ enum iter_type {
> ITER_FOLIOQ,
> ITER_XARRAY,
> ITER_DISCARD,
> + ITER_DMA_TOKEN,
Please use DMABUF/dmabuf naming everywhere, this is about dmabufs and
not dma in general.
Otherwise this looks good.