On Fri, Dec 05, 2025 at 04:18:38PM +0900, Byungchul Park wrote:
> Add documents describing the concept and APIs of dept.
>
> Signed-off-by: Byungchul Park <byungchul(a)sk.com>
> ---
> Documentation/dev-tools/dept.rst | 778 +++++++++++++++++++++++++++
> Documentation/dev-tools/dept_api.rst | 125 +++++
You forget to add toctree entries:
---- >8 ----
diff --git a/Documentation/dev-tools/index.rst b/Documentation/dev-tools/index.rst
index 4b8425e348abd1..02c858f5ed1fa2 100644
--- a/Documentation/dev-tools/index.rst
+++ b/Documentation/dev-tools/index.rst
@@ -22,6 +22,8 @@ Documentation/process/debugging/index.rst
clang-format
coccinelle
sparse
+ dept
+ dept_api
kcov
gcov
kasan
> +Lockdep detects a deadlock by checking lock acquisition order. For
> +example, a graph to track acquisition order built by lockdep might look
> +like:
> +
> +.. literal::
> +
> + A -> B -
> + \
> + -> E
> + /
> + C -> D -
> +
> + where 'A -> B' means that acquisition A is prior to acquisition B
> + with A still held.
Use code-block directive for literal code blocks:
---- >8 ----
diff --git a/Documentation/dev-tools/dept.rst b/Documentation/dev-tools/dept.rst
index 333166464543d7..8394c4ea81bc2a 100644
--- a/Documentation/dev-tools/dept.rst
+++ b/Documentation/dev-tools/dept.rst
@@ -10,7 +10,7 @@ Lockdep detects a deadlock by checking lock acquisition order. For
example, a graph to track acquisition order built by lockdep might look
like:
-.. literal::
+.. code-block::
A -> B -
\
@@ -25,7 +25,7 @@ Lockdep keeps adding each new acquisition order into the graph at
runtime. For example, 'E -> C' will be added when the two locks have
been acquired in the order, E and then C. The graph will look like:
-.. literal::
+.. code-block::
A -> B -
\
@@ -41,7 +41,7 @@ been acquired in the order, E and then C. The graph will look like:
This graph contains a subgraph that demonstrates a loop like:
-.. literal::
+.. code-block::
-> E -
/ \
@@ -76,7 +76,7 @@ e.g. irq context, normal process context, wq worker context, or so on.
Can lockdep detect the following deadlock?
-.. literal::
+.. code-block::
context X context Y context Z
@@ -91,7 +91,7 @@ Can lockdep detect the following deadlock?
No. What about the following?
-.. literal::
+.. code-block::
context X context Y
@@ -116,7 +116,7 @@ What leads a deadlock
A deadlock occurs when one or multi contexts are waiting for events that
will never happen. For example:
-.. literal::
+.. code-block::
context X context Y context Z
@@ -148,7 +148,7 @@ In terms of dependency:
Dependency graph reflecting this example will look like:
-.. literal::
+.. code-block::
-> C -> A -> B -
/ \
@@ -171,7 +171,7 @@ Introduce DEPT
DEPT(DEPendency Tracker) tracks wait and event instead of lock
acquisition order so as to recognize the following situation:
-.. literal::
+.. code-block::
context X context Y context Z
@@ -186,7 +186,7 @@ acquisition order so as to recognize the following situation:
and builds up a dependency graph at runtime that is similar to lockdep.
The graph might look like:
-.. literal::
+.. code-block::
-> C -> A -> B -
/ \
@@ -199,7 +199,7 @@ DEPT keeps adding each new dependency into the graph at runtime. For
example, 'B -> D' will be added when event D occurrence is a
prerequisite to reaching event B like:
-.. literal::
+.. code-block::
context W
@@ -211,7 +211,7 @@ prerequisite to reaching event B like:
After the addition, the graph will look like:
-.. literal::
+.. code-block::
-> D
/
@@ -236,7 +236,7 @@ How DEPT works
Let's take a look how DEPT works with the 1st example in the section
'Limitation of lockdep'.
-.. literal::
+.. code-block::
context X context Y context Z
@@ -256,7 +256,7 @@ event.
Adding comments to describe DEPT's view in detail:
-.. literal::
+.. code-block::
context X context Y context Z
@@ -293,7 +293,7 @@ Adding comments to describe DEPT's view in detail:
Let's build up dependency graph with this example. Firstly, context X:
-.. literal::
+.. code-block::
context X
@@ -304,7 +304,7 @@ Let's build up dependency graph with this example. Firstly, context X:
There are no events to create dependency. Next, context Y:
-.. literal::
+.. code-block::
context Y
@@ -332,7 +332,7 @@ event A cannot be triggered if wait B cannot be awakened by event B.
Therefore, we can say event A depends on event B, say, 'A -> B'. The
graph will look like after adding the dependency:
-.. literal::
+.. code-block::
A -> B
@@ -340,7 +340,7 @@ graph will look like after adding the dependency:
Lastly, context Z:
-.. literal::
+.. code-block::
context Z
@@ -362,7 +362,7 @@ triggered if wait A cannot be awakened by event A. Therefore, we can
say event B depends on event A, say, 'B -> A'. The graph will look like
after adding the dependency:
-.. literal::
+.. code-block::
-> A -> B -
/ \
@@ -386,7 +386,7 @@ Interpret DEPT report
The following is the same example in the section 'How DEPT works'.
-.. literal::
+.. code-block::
context X context Y context Z
@@ -425,7 +425,7 @@ We can simplify this by labeling each waiting point with [W], each
point where its event's context starts with [S] and each event with [E].
This example will look like after the labeling:
-.. literal::
+.. code-block::
context X context Y context Z
@@ -443,7 +443,7 @@ DEPT uses the symbols [W], [S] and [E] in its report as described above.
The following is an example reported by DEPT for a real problem in
practice.
-.. literal::
+.. code-block::
Link: https://lore.kernel.org/lkml/6383cde5-cf4b-facf-6e07-1378a485657d@I-love.SA…
Link: https://lore.kernel.org/lkml/1674268856-31807-1-git-send-email-byungchul.pa…
@@ -646,7 +646,7 @@ practice.
Let's take a look at the summary that is the most important part.
-.. literal::
+.. code-block::
---------------------------------------------------
summary
@@ -669,7 +669,7 @@ Let's take a look at the summary that is the most important part.
The summary shows the following scenario:
-.. literal::
+.. code-block::
context A context B context ?(unknown)
@@ -684,7 +684,7 @@ The summary shows the following scenario:
Adding comments to describe DEPT's view in detail:
-.. literal::
+.. code-block::
context A context B context ?(unknown)
@@ -711,7 +711,7 @@ Adding comments to describe DEPT's view in detail:
Let's build up dependency graph with this report. Firstly, context A:
-.. literal::
+.. code-block::
context A
@@ -735,7 +735,7 @@ unlock(&ni->ni_lock:0) depends on folio_unlock(&f1), say,
The graph will look like after adding the dependency:
-.. literal::
+.. code-block::
unlock(&ni->ni_lock:0) -> folio_unlock(&f1)
@@ -743,7 +743,7 @@ The graph will look like after adding the dependency:
Secondly, context B:
-.. literal::
+.. code-block::
context B
@@ -762,7 +762,7 @@ folio_unlock(&f1) depends on unlock(&ni->ni_lock:0), say,
The graph will look like after adding the dependency:
-.. literal::
+.. code-block::
-> unlock(&ni->ni_lock:0) -> folio_unlock(&f1) -
/ \
> +Limitation of lockdep
> +---------------------
> +
> +Lockdep deals with a deadlock by typical lock e.g. spinlock and mutex,
> +that are supposed to be released within the acquisition context.
> +However, when it comes to a deadlock by folio lock that is not supposed
> +to be released within the acquisition context or other general
> +synchronization mechanisms, lockdep doesn't work.
> +
> +NOTE: In this document, 'context' refers to any type of unique context
> +e.g. irq context, normal process context, wq worker context, or so on.
> +
> +Can lockdep detect the following deadlock?
> +
> +.. literal::
> +
> + context X context Y context Z
> +
> + mutex_lock A
> + folio_lock B
> + folio_lock B <- DEADLOCK
> + mutex_lock A <- DEADLOCK
> + folio_unlock B
> + folio_unlock B
> + mutex_unlock A
> + mutex_unlock A
> +
> +No. What about the following?
> +
> +.. literal::
> +
> + context X context Y
> +
> + mutex_lock A
> + mutex_lock A <- DEADLOCK
> + wait_for_complete B <- DEADLOCK
> + complete B
> + mutex_unlock A
> + mutex_unlock A
> +
> +No.
One unanswered question from my v17 review [1]: You explain in "How DEPT works"
section how DEPT detects deadlock in the first example (the former with three
contexts). Can you do the same on the second example (the latter with two
contexts)?
Thanks.
[1]: https://lore.kernel.org/linux-doc/aN84jKyrE1BumpLj@archie.me/
--
An old man doll... just what I always wanted! - Clara
Am Mittwoch, dem 26.11.2025 um 16:44 +0100 schrieb Philipp Stanner:
> On Wed, 2025-11-26 at 16:03 +0100, Christian König wrote:
>
> > >
[...]
> > > My hope would be that in the mid-term future we'd get firmware
> > > rings
> > > that can be preempted through a firmware call for all major
> > > hardware.
> > > Then a huge share of our problems would disappear.
> >
> > At least on AMD HW pre-emption is actually horrible unreliable as
> > well.
>
> Do you mean new GPUs with firmware scheduling, or what is "HW pre-
> emption"?
>
> With firmware interfaces, my hope would be that you could simply tell
>
> stop_running_ring(nr_of_ring)
> // time slice for someone else
> start_running_ring(nr_of_ring)
>
> Thereby getting real scheduling and all that. And eliminating many
> other problems we know well from drm/sched.
It doesn't really matter if you have firmware scheduling or not for
preemption to be a hard problem on GPUs. CPUs have limited software
visible state that needs to be saved/restored on a context switch and
even there people start complaining now that they need to context
switch the AVX512 register set.
GPUs have megabytes of software visible state. Which needs to be
saved/restored on the context switch if you want fine grained
preemption with low preemption latency. There might be points in the
command execution where you can ignore most of that state, but reaching
those points can have basically unbounded latency. So either you can
reliably save/restore lots of state or you are limited to very coarse
grained preemption with all the usual issues of timeouts and DoS
vectors.
I'm not totally up to speed with the current state across all relevant
GPUs, but until recently NVidia was the only vendor to have real
reliable fine-grained preemption.
Regards,
Lucas
On Sun, Nov 23, 2025 at 10:51:25PM +0000, Pavel Begunkov wrote:
> +struct dma_token *blkdev_dma_map(struct file *file,
> + struct dma_token_params *params)
Given that this is a direct file operation instance it should be
in block/fops.c. If we do want a generic helper below it, it
should take a struct block_device instead. But we can probably
defer that until a user for that shows up.
> +static void nvme_sync_dma(struct nvme_dev *nvme_dev, struct request *req,
> + enum dma_data_direction dir)
> +{
> + struct blk_mq_dma_map *map = req->dma_map;
> + int length = blk_rq_payload_bytes(req);
> + bool for_cpu = dir == DMA_FROM_DEVICE;
> + struct device *dev = nvme_dev->dev;
> + dma_addr_t *dma_list = map->private;
> + struct bio *bio = req->bio;
> + int offset, map_idx;
> +
> + offset = bio->bi_iter.bi_bvec_done;
> + map_idx = offset / NVME_CTRL_PAGE_SIZE;
> + length += offset & (NVME_CTRL_PAGE_SIZE - 1);
> +
> + while (length > 0) {
> + u64 dma_addr = dma_list[map_idx++];
> +
> + if (for_cpu)
> + __dma_sync_single_for_cpu(dev, dma_addr,
> + NVME_CTRL_PAGE_SIZE, dir);
> + else
> + __dma_sync_single_for_device(dev, dma_addr,
> + NVME_CTRL_PAGE_SIZE, dir);
> + length -= NVME_CTRL_PAGE_SIZE;
> + }
This looks really inefficient. Usually the ranges in the dmabuf should
be much larger than a controller page.
> +static void nvme_unmap_premapped_data(struct nvme_dev *dev,
> + struct request *req)
> +{
> + struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
> +
> + if (rq_data_dir(req) == READ)
> + nvme_sync_dma(dev, req, DMA_FROM_DEVICE);
> + if (!(iod->flags & IOD_SINGLE_SEGMENT))
> + nvme_free_descriptors(req);
> +}
This doesn't really unmap anything :)
Also the dma ownership rules say that you always need to call the
sync_to_device helpers before I/O and the sync_to_cpu helpers after I/O,
no matters if it is a read or write. The implementations then makes
them a no-op where possible.
> +
> + offset = bio->bi_iter.bi_bvec_done;
> + map_idx = offset / NVME_CTRL_PAGE_SIZE;
> + offset &= (NVME_CTRL_PAGE_SIZE - 1);
> +
> + prp1_dma = dma_list[map_idx++] + offset;
> +
> + length -= (NVME_CTRL_PAGE_SIZE - offset);
> + if (length <= 0) {
> + prp2_dma = 0;
Urgg, why is this building PRPs instead of SGLs? Yes, SGLs are an
optional feature, but for devices where you want to micro-optimize
like this I think we should simply require them. This should cut
down on both the memory use and the amount of special mapping code.
On Sun, Nov 23, 2025 at 10:51:25PM +0000, Pavel Begunkov wrote:
> Add blk-mq infrastructure to handle dmabuf tokens. There are two main
Please spell out infrastructure in the subject as well.
> +struct dma_token *blkdev_dma_map(struct file *file,
> + struct dma_token_params *params)
> +{
> + struct request_queue *q = bdev_get_queue(file_bdev(file));
> +
> + if (!(file->f_flags & O_DIRECT))
> + return ERR_PTR(-EINVAL);
Shouldn't the O_DIRECT check be in the caller?
> +++ b/block/blk-mq-dma-token.c
Missing SPDX and Copyright statement.
> @@ -0,0 +1,236 @@
> +#include <linux/blk-mq-dma-token.h>
> +#include <linux/dma-resv.h>
> +
> +struct blk_mq_dma_fence {
> + struct dma_fence base;
> + spinlock_t lock;
> +};
And a high-level comment explaining the fencing logic would be nice
as well.
> + struct blk_mq_dma_map *map = container_of(ref, struct blk_mq_dma_map, refs);
Overly long line.
> +static struct blk_mq_dma_map *blk_mq_alloc_dma_mapping(struct blk_mq_dma_token *token)
Another one. Also kinda inconsistent between _map in the data structure
and _mapping in the function name.
> +static inline
> +struct blk_mq_dma_map *blk_mq_get_token_map(struct blk_mq_dma_token *token)
Really odd return value / scope formatting.
> +{
> + struct blk_mq_dma_map *map;
> +
> + guard(rcu)();
> +
> + map = rcu_dereference(token->map);
> + if (unlikely(!map || !percpu_ref_tryget_live_rcu(&map->refs)))
> + return NULL;
> + return map;
Please use good old rcu_read_unlock to make this readable.
> + guard(mutex)(&token->mapping_lock);
Same.
> +
> + map = blk_mq_get_token_map(token);
> + if (map)
> + return map;
> +
> + map = blk_mq_alloc_dma_mapping(token);
> + if (IS_ERR(map))
> + return NULL;
> +
> + dma_resv_lock(dmabuf->resv, NULL);
> + ret = dma_resv_wait_timeout(dmabuf->resv, DMA_RESV_USAGE_BOOKKEEP,
> + true, MAX_SCHEDULE_TIMEOUT);
> + ret = ret ? ret : -ETIME;
if (!ret)
ret = -ETIME;
> +blk_status_t blk_rq_assign_dma_map(struct request *rq,
> + struct blk_mq_dma_token *token)
> +{
> + struct blk_mq_dma_map *map;
> +
> + map = blk_mq_get_token_map(token);
> + if (map)
> + goto complete;
> +
> + if (rq->cmd_flags & REQ_NOWAIT)
> + return BLK_STS_AGAIN;
> +
> + map = blk_mq_create_dma_map(token);
> + if (IS_ERR(map))
> + return BLK_STS_RESOURCE;
Having a few comments, that say this is creating the map lazily
would probably helper the reader. Also why not keep the !map
case in the branch, as the map case should be the fast path and
thus usually be straight line in the function?
> +void blk_mq_dma_map_move_notify(struct blk_mq_dma_token *token)
> +{
> + blk_mq_dma_map_remove(token);
> +}
Is there a good reason for having this blk_mq_dma_map_move_notify
wrapper?
> + if (bio_flagged(bio, BIO_DMA_TOKEN)) {
> + struct blk_mq_dma_token *token;
> + blk_status_t ret;
> +
> + token = dma_token_to_blk_mq(bio->dma_token);
> + ret = blk_rq_assign_dma_map(rq, token);
> + if (ret) {
> + if (ret == BLK_STS_AGAIN) {
> + bio_wouldblock_error(bio);
> + } else {
> + bio->bi_status = BLK_STS_RESOURCE;
> + bio_endio(bio);
> + }
> + goto queue_exit;
> + }
> + }
Any reason to not just keep the dma_token_to_blk_mq? Also why is this
overriding non-BLK_STS_AGAIN errors with BLK_STS_RESOURCE?
(I really wish we could make all BLK_STS_AGAIN errors be quiet without
the explicit setting of BIO_QUIET, which is a bit annoying, but that's
not for this patch).
> +static inline
> +struct blk_mq_dma_token *dma_token_to_blk_mq(struct dma_token *token)
More odd formatting.