On Sun, Nov 23, 2025 at 10:51:25PM +0000, Pavel Begunkov wrote:
> +struct dma_token *blkdev_dma_map(struct file *file,
> + struct dma_token_params *params)
Given that this is a direct file operation instance it should be
in block/fops.c. If we do want a generic helper below it, it
should take a struct block_device instead. But we can probably
defer that until a user for that shows up.
> +static void nvme_sync_dma(struct nvme_dev *nvme_dev, struct request *req,
> + enum dma_data_direction dir)
> +{
> + struct blk_mq_dma_map *map = req->dma_map;
> + int length = blk_rq_payload_bytes(req);
> + bool for_cpu = dir == DMA_FROM_DEVICE;
> + struct device *dev = nvme_dev->dev;
> + dma_addr_t *dma_list = map->private;
> + struct bio *bio = req->bio;
> + int offset, map_idx;
> +
> + offset = bio->bi_iter.bi_bvec_done;
> + map_idx = offset / NVME_CTRL_PAGE_SIZE;
> + length += offset & (NVME_CTRL_PAGE_SIZE - 1);
> +
> + while (length > 0) {
> + u64 dma_addr = dma_list[map_idx++];
> +
> + if (for_cpu)
> + __dma_sync_single_for_cpu(dev, dma_addr,
> + NVME_CTRL_PAGE_SIZE, dir);
> + else
> + __dma_sync_single_for_device(dev, dma_addr,
> + NVME_CTRL_PAGE_SIZE, dir);
> + length -= NVME_CTRL_PAGE_SIZE;
> + }
This looks really inefficient. Usually the ranges in the dmabuf should
be much larger than a controller page.
> +static void nvme_unmap_premapped_data(struct nvme_dev *dev,
> + struct request *req)
> +{
> + struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
> +
> + if (rq_data_dir(req) == READ)
> + nvme_sync_dma(dev, req, DMA_FROM_DEVICE);
> + if (!(iod->flags & IOD_SINGLE_SEGMENT))
> + nvme_free_descriptors(req);
> +}
This doesn't really unmap anything :)
Also the dma ownership rules say that you always need to call the
sync_to_device helpers before I/O and the sync_to_cpu helpers after I/O,
no matters if it is a read or write. The implementations then makes
them a no-op where possible.
> +
> + offset = bio->bi_iter.bi_bvec_done;
> + map_idx = offset / NVME_CTRL_PAGE_SIZE;
> + offset &= (NVME_CTRL_PAGE_SIZE - 1);
> +
> + prp1_dma = dma_list[map_idx++] + offset;
> +
> + length -= (NVME_CTRL_PAGE_SIZE - offset);
> + if (length <= 0) {
> + prp2_dma = 0;
Urgg, why is this building PRPs instead of SGLs? Yes, SGLs are an
optional feature, but for devices where you want to micro-optimize
like this I think we should simply require them. This should cut
down on both the memory use and the amount of special mapping code.
On Sun, Nov 23, 2025 at 10:51:25PM +0000, Pavel Begunkov wrote:
> Add blk-mq infrastructure to handle dmabuf tokens. There are two main
Please spell out infrastructure in the subject as well.
> +struct dma_token *blkdev_dma_map(struct file *file,
> + struct dma_token_params *params)
> +{
> + struct request_queue *q = bdev_get_queue(file_bdev(file));
> +
> + if (!(file->f_flags & O_DIRECT))
> + return ERR_PTR(-EINVAL);
Shouldn't the O_DIRECT check be in the caller?
> +++ b/block/blk-mq-dma-token.c
Missing SPDX and Copyright statement.
> @@ -0,0 +1,236 @@
> +#include <linux/blk-mq-dma-token.h>
> +#include <linux/dma-resv.h>
> +
> +struct blk_mq_dma_fence {
> + struct dma_fence base;
> + spinlock_t lock;
> +};
And a high-level comment explaining the fencing logic would be nice
as well.
> + struct blk_mq_dma_map *map = container_of(ref, struct blk_mq_dma_map, refs);
Overly long line.
> +static struct blk_mq_dma_map *blk_mq_alloc_dma_mapping(struct blk_mq_dma_token *token)
Another one. Also kinda inconsistent between _map in the data structure
and _mapping in the function name.
> +static inline
> +struct blk_mq_dma_map *blk_mq_get_token_map(struct blk_mq_dma_token *token)
Really odd return value / scope formatting.
> +{
> + struct blk_mq_dma_map *map;
> +
> + guard(rcu)();
> +
> + map = rcu_dereference(token->map);
> + if (unlikely(!map || !percpu_ref_tryget_live_rcu(&map->refs)))
> + return NULL;
> + return map;
Please use good old rcu_read_unlock to make this readable.
> + guard(mutex)(&token->mapping_lock);
Same.
> +
> + map = blk_mq_get_token_map(token);
> + if (map)
> + return map;
> +
> + map = blk_mq_alloc_dma_mapping(token);
> + if (IS_ERR(map))
> + return NULL;
> +
> + dma_resv_lock(dmabuf->resv, NULL);
> + ret = dma_resv_wait_timeout(dmabuf->resv, DMA_RESV_USAGE_BOOKKEEP,
> + true, MAX_SCHEDULE_TIMEOUT);
> + ret = ret ? ret : -ETIME;
if (!ret)
ret = -ETIME;
> +blk_status_t blk_rq_assign_dma_map(struct request *rq,
> + struct blk_mq_dma_token *token)
> +{
> + struct blk_mq_dma_map *map;
> +
> + map = blk_mq_get_token_map(token);
> + if (map)
> + goto complete;
> +
> + if (rq->cmd_flags & REQ_NOWAIT)
> + return BLK_STS_AGAIN;
> +
> + map = blk_mq_create_dma_map(token);
> + if (IS_ERR(map))
> + return BLK_STS_RESOURCE;
Having a few comments, that say this is creating the map lazily
would probably helper the reader. Also why not keep the !map
case in the branch, as the map case should be the fast path and
thus usually be straight line in the function?
> +void blk_mq_dma_map_move_notify(struct blk_mq_dma_token *token)
> +{
> + blk_mq_dma_map_remove(token);
> +}
Is there a good reason for having this blk_mq_dma_map_move_notify
wrapper?
> + if (bio_flagged(bio, BIO_DMA_TOKEN)) {
> + struct blk_mq_dma_token *token;
> + blk_status_t ret;
> +
> + token = dma_token_to_blk_mq(bio->dma_token);
> + ret = blk_rq_assign_dma_map(rq, token);
> + if (ret) {
> + if (ret == BLK_STS_AGAIN) {
> + bio_wouldblock_error(bio);
> + } else {
> + bio->bi_status = BLK_STS_RESOURCE;
> + bio_endio(bio);
> + }
> + goto queue_exit;
> + }
> + }
Any reason to not just keep the dma_token_to_blk_mq? Also why is this
overriding non-BLK_STS_AGAIN errors with BLK_STS_RESOURCE?
(I really wish we could make all BLK_STS_AGAIN errors be quiet without
the explicit setting of BIO_QUIET, which is a bit annoying, but that's
not for this patch).
> +static inline
> +struct blk_mq_dma_token *dma_token_to_blk_mq(struct dma_token *token)
More odd formatting.
> diff --git a/block/bio.c b/block/bio.c
> index 7b13bdf72de0..8793f1ee559d 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -843,6 +843,11 @@ static int __bio_clone(struct bio *bio, struct bio *bio_src, gfp_t gfp)
> bio_clone_blkg_association(bio, bio_src);
> }
>
> + if (bio_flagged(bio_src, BIO_DMA_TOKEN)) {
> + bio->dma_token = bio_src->dma_token;
> + bio_set_flag(bio, BIO_DMA_TOKEN);
> + }
Historically __bio_clone itself does not clone the payload, just the
bio. But we got rid of the callers that want to clone a bio but not
the payload long time ago.
I'd suggest a prep patch that moves assigning bi_io_vec from
bio_alloc_clone and bio_init_clone into __bio_clone, and given that they
are the same field that'll take carw of the dma token as well.
Alternatively do it in an if/else that the compiler will hopefully
optimize away.
> @@ -1349,6 +1366,10 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter,
> bio_iov_bvec_set(bio, iter);
> iov_iter_advance(iter, bio->bi_iter.bi_size);
> return 0;
> + } else if (iov_iter_is_dma_token(iter)) {
No else after an return please.
> +++ b/block/blk-merge.c
> @@ -328,6 +328,29 @@ int bio_split_io_at(struct bio *bio, const struct queue_limits *lim,
> unsigned nsegs = 0, bytes = 0, gaps = 0;
> struct bvec_iter iter;
>
> + if (bio_flagged(bio, BIO_DMA_TOKEN)) {
Please split the dmabuf logic into a self-contained
helper here.
> + int offset = offset_in_page(bio->bi_iter.bi_bvec_done);
> +
> + nsegs = ALIGN(bio->bi_iter.bi_size + offset, PAGE_SIZE);
> + nsegs >>= PAGE_SHIFT;
Why are we hardcoding PAGE_SIZE based "segments" here?
> +
> + if (offset & lim->dma_alignment || bytes & len_align_mask)
> + return -EINVAL;
> +
> + if (bio->bi_iter.bi_size > max_bytes) {
> + bytes = max_bytes;
> + nsegs = (bytes + offset) >> PAGE_SHIFT;
> + goto split;
> + } else if (nsegs > lim->max_segments) {
No else after a goto either.
On Sun, Nov 23, 2025 at 10:51:23PM +0000, Pavel Begunkov wrote:
> We'll need bio_flagged() earlier in bio.h in the next patch, move it
> together with all related helpers, and mark the bio_flagged()'s bio
> argument as const.
>
> Signed-off-by: Pavel Begunkov <asml.silence(a)gmail.com>
Looks good:
Reviewed-by: Christoph Hellwig <hch(a)lst.de>
Maybe ask Jens to queue it up ASAP to get it out of the way?
On Sun, Nov 23, 2025 at 10:51:22PM +0000, Pavel Begunkov wrote:
> diff --git a/include/linux/uio.h b/include/linux/uio.h
> index 5b127043a151..1b22594ca35b 100644
> --- a/include/linux/uio.h
> +++ b/include/linux/uio.h
> @@ -29,6 +29,7 @@ enum iter_type {
> ITER_FOLIOQ,
> ITER_XARRAY,
> ITER_DISCARD,
> + ITER_DMA_TOKEN,
Please use DMABUF/dmabuf naming everywhere, this is about dmabufs and
not dma in general.
Otherwise this looks good.
On Sun, Nov 23, 2025 at 10:51:21PM +0000, Pavel Begunkov wrote:
> +static inline struct dma_token *
> +dma_token_create(struct file *file, struct dma_token_params *params)
> +{
> + struct dma_token *res;
> +
> + if (!file->f_op->dma_map)
> + return ERR_PTR(-EOPNOTSUPP);
> + res = file->f_op->dma_map(file, params);
Calling the file operation ->dmap_map feels really misleading.
create_token as in the function name is already much better, but
it really is not just dma, but dmabuf related, and that should really
be encoded in the name.
Also why not pass the dmabuf and direction directly instead of wrapping
it in the odd params struct making the whole thing hard to follow?
For retrieving a pointer to the struct dma_resv for a given GEM object. We
also introduce it in a new trait, BaseObjectPrivate, which we automatically
implement for all gem objects and don't expose to users outside of the
crate.
Signed-off-by: Lyude Paul <lyude(a)redhat.com>
---
rust/kernel/drm/gem/mod.rs | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/rust/kernel/drm/gem/mod.rs b/rust/kernel/drm/gem/mod.rs
index 5c215e83c1b09..ec3c1b1775196 100644
--- a/rust/kernel/drm/gem/mod.rs
+++ b/rust/kernel/drm/gem/mod.rs
@@ -199,6 +199,18 @@ fn create_mmap_offset(&self) -> Result<u64> {
impl<T: IntoGEMObject> BaseObject for T {}
+/// Crate-private base operations shared by all GEM object classes.
+#[expect(unused)]
+pub(crate) trait BaseObjectPrivate: IntoGEMObject {
+ /// Return a pointer to this object's dma_resv.
+ fn raw_dma_resv(&self) -> *mut bindings::dma_resv {
+ // SAFETY: `as_gem_obj()` always returns a valid pointer to the base DRM gem object
+ unsafe { (*self.as_raw()).resv }
+ }
+}
+
+impl<T: IntoGEMObject> BaseObjectPrivate for T {}
+
/// A base GEM object.
///
/// # Invariants
--
2.52.0
On 11/28/25 11:10, Philipp Stanner wrote:
> On Fri, 2025-11-28 at 11:06 +0100, Christian König wrote:
>> On 11/27/25 12:10, Philipp Stanner wrote:
>>> On Thu, 2025-11-13 at 15:51 +0100, Christian König wrote:
>>>> This should allow amdkfd_fences to outlive the amdgpu module.
>>>>
>>>> v2: implement Felix suggestion to lock the fence while signaling it.
>>>>
>>>> Signed-off-by: Christian König <christian.koenig(a)amd.com>
>>>> ---
>>>>
>>>>
>
> […]
>
>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>>> index a085faac9fe1..8fac70b839ed 100644
>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>>> @@ -1173,7 +1173,7 @@ static void kfd_process_wq_release(struct work_struct *work)
>>>> synchronize_rcu();
>>>> ef = rcu_access_pointer(p->ef);
>>>> if (ef)
>>>> - dma_fence_signal(ef);
>>>> + amdkfd_fence_signal(ef);
>>>>
>>>> kfd_process_remove_sysfs(p);
>>>> kfd_debugfs_remove_process(p);
>>>> @@ -1990,7 +1990,6 @@ kfd_process_gpuid_from_node(struct kfd_process *p, struct kfd_node *node,
>>>> static int signal_eviction_fence(struct kfd_process *p)
>>>> {
>>>> struct dma_fence *ef;
>>>> - int ret;
>>>>
>>>> rcu_read_lock();
>>>> ef = dma_fence_get_rcu_safe(&p->ef);
>>>> @@ -1998,10 +1997,10 @@ static int signal_eviction_fence(struct kfd_process *p)
>>>> if (!ef)
>>>> return -EINVAL;
>>>>
>>>> - ret = dma_fence_signal(ef);
>>>> + amdkfd_fence_signal(ef);
>>>> dma_fence_put(ef);
>>>>
>>>> - return ret;
>>>> + return 0;
>>>
>>> Oh wait, that's the code I'm also touching in my return code series!
>>>
>>> https://lore.kernel.org/dri-devel/cef83fed-5994-4c77-962c-9c7aac9f7306@amd.…
>>>
>>>
>>> Does this series then solve the problem Felix pointed out in
>>> evict_process_worker()?
>>
>> No it doesn't, I wasn't aware that the higher level code actually needs the status. After all Felix is the maintainer of this part.
>>
>> This patch here needs to be rebased on top of yours and changed accordingly to still return the fence status correctly.
>>
>> But thanks for pointing that out.
>
>
> Alright, so my (repaired, v2) status-code-removal series shall enter drm-misc-next first, and then your series here. ACK?
Works for me, I just need both to re-base the amdgpu patches on top.
Christian.
>
>
> P.
On 11/27/25 11:57, Philipp Stanner wrote:
> On Thu, 2025-11-13 at 15:51 +0100, Christian König wrote:
>> Calling dma_fence_is_signaled() here is illegal!
>
> OK, but why is that patch in this series?
Because the next patch depends on it, otherwise the series won't compile.
My plan is to push the amdgpu patches through amd-staging-drm-next as soon as Alex rebased that branch on drm-next during the next cycle.
Regards,
Christian.
>
> P.
>
>>
>> Signed-off-by: Christian König <christian.koenig(a)amd.com>
>> ---
>> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c | 6 ------
>> 1 file changed, 6 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c
>> index 1ef758ac5076..09c919f72b6c 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c
>> @@ -120,12 +120,6 @@ static bool amdkfd_fence_enable_signaling(struct dma_fence *f)
>> {
>> struct amdgpu_amdkfd_fence *fence = to_amdgpu_amdkfd_fence(f);
>>
>> - if (!fence)
>> - return false;
>> -
>> - if (dma_fence_is_signaled(f))
>> - return true;
>> -
>> if (!fence->svm_bo) {
>> if (!kgd2kfd_schedule_evict_and_restore_process(fence->mm, f))
>> return true;
>
On 11/27/25 10:48, Philipp Stanner wrote:
> On Wed, 2025-11-26 at 16:24 -0500, Kuehling, Felix wrote:
>>
>> On 2025-11-26 08:19, Philipp Stanner wrote:
>>> The return code of dma_fence_signal() is not really useful as there is
>>> nothing reasonable to do if a fence was already signaled. That return
>>> code shall be removed from the kernel.
>>>
>>> Ignore dma_fence_signal()'s return code.
>>
>> I think this is not correct. Looking at the comment in
>> evict_process_worker, we use the return value to decide a race
>> conditions where multiple threads are trying to signal the eviction
>> fence. Only one of them should schedule the restore work. And the other
>> ones need to increment the reference count to keep evictions balanced.
>
> Thank you for pointing that out. Seems then amdkfd is the only user who
> actually relies on the feature. See below
>
>>
>> Regards,
>> Felix
>>
>>
>>>
>>> Suggested-by: Christian König <christian.koenig(a)amd.com>
>>> Signed-off-by: Philipp Stanner <phasta(a)kernel.org>
>>> ---
>>> drivers/gpu/drm/amd/amdkfd/kfd_process.c | 5 ++---
>>> 1 file changed, 2 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>> index ddfe30c13e9d..950fafa4b3c3 100644
>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>> @@ -1986,7 +1986,6 @@ kfd_process_gpuid_from_node(struct kfd_process *p, struct kfd_node *node,
>>> static int signal_eviction_fence(struct kfd_process *p)
>>> {
>>> struct dma_fence *ef;
>>> - int ret;
>>>
>>> rcu_read_lock();
>>> ef = dma_fence_get_rcu_safe(&p->ef);
>>> @@ -1994,10 +1993,10 @@ static int signal_eviction_fence(struct kfd_process *p)
>>> if (!ef)
>>> return -EINVAL;
>>>
>>> - ret = dma_fence_signal(ef);
>>> + dma_fence_signal(ef);
>
> The issue now is that dma_fence_signal()'s return code is actually non-
> racy, because check + bit-set are protected by lock.
>
> Christian's new spinlock series would add a lock function for fences:
> https://lore.kernel.org/dri-devel/20251113145332.16805-5-christian.koenig@a…
>
>
> So I suppose this should work:
>
> dma_fence_lock_irqsave(ef, flags);
> if (dma_fence_test_signaled_flag(ef)) {
> dma_fence_unlock_irqrestore(ef, flags);
> return true;
> }
> dma_fence_signal_locked(ef);
> dma_fence_unlock_irqrestore(ef, flags);
>
> return false;
>
>
> + some cosmetic adjustments for the boolean of course.
>
>
> Would that fly and be reasonable? @Felix, Christian.
I was just about to reply with the same idea when your mail arrived.
So yes looks totally reasonable to me.
Regards,
Christian.
>
>
> P.
On 11/27/25 09:23, Viresh Kumar wrote:
> On 27-11-25, 09:07, Christian König wrote:
>> On 11/27/25 08:40, Viresh Kumar wrote:
>>> Move several dma-buf function declarations under
>>> CONFIG_DMA_SHARED_BUFFER and provide static inline no-op implementations
>>> for the disabled case to allow the callers to build when the feature is
>>> not compiled in.
>>
>> Good point, but which driver actually needs that?
>
> This broke some WIP stuff [1] which isn't posted upstream yet. That's why I
> didn't mention anything in the commit log, though I could have added a comment
> about that in the non-commit-log part.
Well then better send that out with the full patch set.
>> In other words there should be a concrete example of what breaks in the commit message.
>
> There is time for those changes to be posted and not sure if they will be
> accepted or not. But either way, this change made sense in general and so I
> thought there is nothing wrong to get this upstream right away.
Yeah when it is unused intermediately then that is usually a no-go even if I agree that it makes sense.
>>> +static inline struct dma_buf *dma_buf_get(int fd)
>>> +{
>>> + return NULL;
>>
>> And here ERR_PTR(-EINVAL).
>
> I am not really sure if this should be EINVAL in this case. EOPNOTSUPP still
> makes sense as the API isn't supported.
When the API isn't compiled in the fd can't be valid (because you can't create a dma_buf object in the first place).
So returning -EINVAL still makes a lot of sense.
Regards,
Christian.
>
>>> +static inline struct dma_buf *dma_buf_iter_begin(void)
>>> +{
>>> + return NULL;
>>> +}
>>> +
>>> +static inline struct dma_buf *dma_buf_iter_next(struct dma_buf *dmbuf)
>>> +{
>>> + return NULL;
>>> +}
>>
>> Those two are only for BPF and not driver use.
>
> Will drop them.
>
On 11/27/25 10:16, Philipp Stanner wrote:
> On Thu, 2025-11-27 at 09:11 +0100, Christian König wrote:
>> On 11/26/25 17:55, Matthew Brost wrote:
>>> On Wed, Nov 26, 2025 at 08:41:27AM -0800, Matthew Brost wrote:
>>>> On Wed, Nov 26, 2025 at 02:19:10PM +0100, Philipp Stanner wrote:
>>>>> The dma_fence framework checks at many places whether the signaled flag
>>>>> of a fence is already set. The code can be simplified and made more
>>>>> readable by providing a helper function for that.
>>>>>
>>>>> Add dma_fence_test_signaled_flag(), which only checks whether a fence is
>>>>> signaled. Use it internally.
>>>>>
>>>>> Suggested-by: Tvrtko Ursulin <tvrtko.ursulin(a)igalia.com>
>>>>> Signed-off-by: Philipp Stanner <phasta(a)kernel.org>
>>>>
>>>> This is a nice cleanp:
>>>> Reviewed-by: Matthew Brost <matthew.brost(a)intel.com>
>>>>
>>>>> ---
>>>>> drivers/dma-buf/dma-fence.c | 19 +++++++++----------
>>>>> include/linux/dma-fence.h | 24 ++++++++++++++++++++++--
>>>>> 2 files changed, 31 insertions(+), 12 deletions(-)
>>>>>
>>>>> diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
>>>>> index 39e6f93dc310..25117a906846 100644
>>>>> --- a/drivers/dma-buf/dma-fence.c
>>>>> +++ b/drivers/dma-buf/dma-fence.c
>>>>> @@ -372,8 +372,7 @@ int dma_fence_signal_timestamp_locked(struct dma_fence *fence,
>>>>>
>>>>> lockdep_assert_held(fence->lock);
>>>>>
>>>>> - if (unlikely(test_and_set_bit(DMA_FENCE_FLAG_SIGNALED_BIT,
>>>>> - &fence->flags)))
>>>
>>> I need to read a little better, I think this change isn't quite right.
>>> The original code is test and set, the updated code is test only (i.e.,
>>> you are missing the set step). So maybe just leave this line as is.
>>
>> Oh, good point! I've totally missed that as well.
>
> Oh dear; I also just saw it when opening the mail client ._.
>
>>
>> But that means that this patch set hasn't even been smoke tested.
>
> I've built it and did some basic testing with my Nouveau system. Any
> suggestions? Do you have a CI that one can trigger?
DMA-buf has CONFIG_DMABUF_SELFTESTS which should be able to catch things like that.
But even running Nouveau should have found this since basically no fence at would signal any more.
Regards,
Christian.
>
> Thx
> P.
On 11/26/25 17:55, Matthew Brost wrote:
> On Wed, Nov 26, 2025 at 08:41:27AM -0800, Matthew Brost wrote:
>> On Wed, Nov 26, 2025 at 02:19:10PM +0100, Philipp Stanner wrote:
>>> The dma_fence framework checks at many places whether the signaled flag
>>> of a fence is already set. The code can be simplified and made more
>>> readable by providing a helper function for that.
>>>
>>> Add dma_fence_test_signaled_flag(), which only checks whether a fence is
>>> signaled. Use it internally.
>>>
>>> Suggested-by: Tvrtko Ursulin <tvrtko.ursulin(a)igalia.com>
>>> Signed-off-by: Philipp Stanner <phasta(a)kernel.org>
>>
>> This is a nice cleanp:
>> Reviewed-by: Matthew Brost <matthew.brost(a)intel.com>
>>
>>> ---
>>> drivers/dma-buf/dma-fence.c | 19 +++++++++----------
>>> include/linux/dma-fence.h | 24 ++++++++++++++++++++++--
>>> 2 files changed, 31 insertions(+), 12 deletions(-)
>>>
>>> diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
>>> index 39e6f93dc310..25117a906846 100644
>>> --- a/drivers/dma-buf/dma-fence.c
>>> +++ b/drivers/dma-buf/dma-fence.c
>>> @@ -372,8 +372,7 @@ int dma_fence_signal_timestamp_locked(struct dma_fence *fence,
>>>
>>> lockdep_assert_held(fence->lock);
>>>
>>> - if (unlikely(test_and_set_bit(DMA_FENCE_FLAG_SIGNALED_BIT,
>>> - &fence->flags)))
>
> I need to read a little better, I think this change isn't quite right.
> The original code is test and set, the updated code is test only (i.e.,
> you are missing the set step). So maybe just leave this line as is.
Oh, good point! I've totally missed that as well.
But that means that this patch set hasn't even been smoke tested.
Regards,
Christian.
>
> Matt
>
>>> + if (unlikely(dma_fence_test_signaled_flag(fence)))
>>> return -EINVAL;
>>>
>>> /* Stash the cb_list before replacing it with the timestamp */
>>> @@ -545,7 +544,7 @@ void dma_fence_release(struct kref *kref)
>>> trace_dma_fence_destroy(fence);
>>>
>>> if (!list_empty(&fence->cb_list) &&
>>> - !test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags)) {
>>> + !dma_fence_test_signaled_flag(fence)) {
>>> const char __rcu *timeline;
>>> const char __rcu *driver;
>>> unsigned long flags;
>>> @@ -602,7 +601,7 @@ static bool __dma_fence_enable_signaling(struct dma_fence *fence)
>>> was_set = test_and_set_bit(DMA_FENCE_FLAG_ENABLE_SIGNAL_BIT,
>>> &fence->flags);
>>>
>>> - if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags))
>>> + if (dma_fence_test_signaled_flag(fence))
>>> return false;
>>>
>>> if (!was_set && fence->ops->enable_signaling) {
>>> @@ -666,7 +665,7 @@ int dma_fence_add_callback(struct dma_fence *fence, struct dma_fence_cb *cb,
>>> if (WARN_ON(!fence || !func))
>>> return -EINVAL;
>>>
>>> - if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags)) {
>>> + if (dma_fence_test_signaled_flag(fence)) {
>>> INIT_LIST_HEAD(&cb->node);
>>> return -ENOENT;
>>> }
>>> @@ -783,7 +782,7 @@ dma_fence_default_wait(struct dma_fence *fence, bool intr, signed long timeout)
>>>
>>> spin_lock_irqsave(fence->lock, flags);
>>>
>>> - if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags))
>>> + if (dma_fence_test_signaled_flag(fence))
>>> goto out;
>>>
>>> if (intr && signal_pending(current)) {
>>> @@ -800,7 +799,7 @@ dma_fence_default_wait(struct dma_fence *fence, bool intr, signed long timeout)
>>> cb.task = current;
>>> list_add(&cb.base.node, &fence->cb_list);
>>>
>>> - while (!test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags) && ret > 0) {
>>> + while (!dma_fence_test_signaled_flag(fence) && ret > 0) {
>>> if (intr)
>>> __set_current_state(TASK_INTERRUPTIBLE);
>>> else
>>> @@ -832,7 +831,7 @@ dma_fence_test_signaled_any(struct dma_fence **fences, uint32_t count,
>>>
>>> for (i = 0; i < count; ++i) {
>>> struct dma_fence *fence = fences[i];
>>> - if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags)) {
>>> + if (dma_fence_test_signaled_flag(fence)) {
>>> if (idx)
>>> *idx = i;
>>> return true;
>>> @@ -1108,7 +1107,7 @@ const char __rcu *dma_fence_driver_name(struct dma_fence *fence)
>>> RCU_LOCKDEP_WARN(!rcu_read_lock_held(),
>>> "RCU protection is required for safe access to returned string");
>>>
>>> - if (!test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags))
>>> + if (!dma_fence_test_signaled_flag(fence))
>>> return fence->ops->get_driver_name(fence);
>>> else
>>> return "detached-driver";
>>> @@ -1140,7 +1139,7 @@ const char __rcu *dma_fence_timeline_name(struct dma_fence *fence)
>>> RCU_LOCKDEP_WARN(!rcu_read_lock_held(),
>>> "RCU protection is required for safe access to returned string");
>>>
>>> - if (!test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags))
>>> + if (!dma_fence_test_signaled_flag(fence))
>>> return fence->ops->get_timeline_name(fence);
>>> else
>>> return "signaled-timeline";
>>> diff --git a/include/linux/dma-fence.h b/include/linux/dma-fence.h
>>> index 64639e104110..19972f5d176f 100644
>>> --- a/include/linux/dma-fence.h
>>> +++ b/include/linux/dma-fence.h
>>> @@ -401,6 +401,26 @@ void dma_fence_enable_sw_signaling(struct dma_fence *fence);
>>> const char __rcu *dma_fence_driver_name(struct dma_fence *fence);
>>> const char __rcu *dma_fence_timeline_name(struct dma_fence *fence);
>>>
>>> +/*
>>> + * dma_fence_test_signaled_flag - Only check whether a fence is signaled yet.
>>> + * @fence: the fence to check
>>> + *
>>> + * This function just checks whether @fence is signaled, without interacting
>>> + * with the fence in any way. The user must, therefore, ensure through other
>>> + * means that fences get signaled eventually.
>>> + *
>>> + * This function uses test_bit(), which is thread-safe. Naturally, this function
>>> + * should be used opportunistically; a fence could get signaled at any moment
>>> + * after the check is done.
>>> + *
>>> + * Return: true if signaled, false otherwise.
>>> + */
>>> +static inline bool
>>> +dma_fence_test_signaled_flag(struct dma_fence *fence)
>>> +{
>>> + return test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags);
>>> +}
>>> +
>>> /**
>>> * dma_fence_is_signaled_locked - Return an indication if the fence
>>> * is signaled yet.
>>> @@ -418,7 +438,7 @@ const char __rcu *dma_fence_timeline_name(struct dma_fence *fence);
>>> static inline bool
>>> dma_fence_is_signaled_locked(struct dma_fence *fence)
>>> {
>>> - if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags))
>>> + if (dma_fence_test_signaled_flag(fence))
>>> return true;
>>>
>>> if (fence->ops->signaled && fence->ops->signaled(fence)) {
>>> @@ -448,7 +468,7 @@ dma_fence_is_signaled_locked(struct dma_fence *fence)
>>> static inline bool
>>> dma_fence_is_signaled(struct dma_fence *fence)
>>> {
>>> - if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags))
>>> + if (dma_fence_test_signaled_flag(fence))
>>> return true;
>>>
>>> if (fence->ops->signaled && fence->ops->signaled(fence)) {
>>> --
>>> 2.49.0
>>>
On 2025-11-26 08:19, Philipp Stanner wrote:
> The return code of dma_fence_signal() is not really useful as there is
> nothing reasonable to do if a fence was already signaled. That return
> code shall be removed from the kernel.
>
> Ignore dma_fence_signal()'s return code.
I think this is not correct. Looking at the comment in
evict_process_worker, we use the return value to decide a race
conditions where multiple threads are trying to signal the eviction
fence. Only one of them should schedule the restore work. And the other
ones need to increment the reference count to keep evictions balanced.
Regards,
Felix
>
> Suggested-by: Christian König <christian.koenig(a)amd.com>
> Signed-off-by: Philipp Stanner <phasta(a)kernel.org>
> ---
> drivers/gpu/drm/amd/amdkfd/kfd_process.c | 5 ++---
> 1 file changed, 2 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> index ddfe30c13e9d..950fafa4b3c3 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> @@ -1986,7 +1986,6 @@ kfd_process_gpuid_from_node(struct kfd_process *p, struct kfd_node *node,
> static int signal_eviction_fence(struct kfd_process *p)
> {
> struct dma_fence *ef;
> - int ret;
>
> rcu_read_lock();
> ef = dma_fence_get_rcu_safe(&p->ef);
> @@ -1994,10 +1993,10 @@ static int signal_eviction_fence(struct kfd_process *p)
> if (!ef)
> return -EINVAL;
>
> - ret = dma_fence_signal(ef);
> + dma_fence_signal(ef);
> dma_fence_put(ef);
>
> - return ret;
> + return 0;
> }
>
> static void evict_process_worker(struct work_struct *work)
This series is the start of adding full DMABUF support to
iommufd. Currently it is limited to only work with VFIO's DMABUF exporter.
It sits on top of Leon's series to add a DMABUF exporter to VFIO:
https://lore.kernel.org/all/20251120-dmabuf-vfio-v9-0-d7f71607f371@nvidia.c…
The existing IOMMU_IOAS_MAP_FILE is enhanced to detect DMABUF fd's, but
otherwise works the same as it does today for a memfd. The user can select
a slice of the FD to map into the ioas and if the underliyng alignment
requirements are met it will be placed in the iommu_domain.
Though limited, it is enough to allow a VMM like QEMU to connect MMIO BAR
memory from VFIO to an iommu_domain controlled by iommufd. This is used
for PCI Peer to Peer support in VMs, and is the last feature that the VFIO
type 1 container has that iommufd couldn't do.
The VFIO type1 version extracts raw PFNs from VMAs, which has no lifetime
control and is a use-after-free security problem.
Instead iommufd relies on revokable DMABUFs. Whenever VFIO thinks there
should be no access to the MMIO it can shoot down the mapping in iommufd
which will unmap it from the iommu_domain. There is no automatic remap,
this is a safety protocol so the kernel doesn't get stuck. Userspace is
expected to know it is doing something that will revoke the dmabuf and
map/unmap it around the activity. Eg when QEMU goes to issue FLR it should
do the map/unmap to iommufd.
Since DMABUF is missing some key general features for this use case it
relies on a "private interconnect" between VFIO and iommufd via the
vfio_pci_dma_buf_iommufd_map() call.
The call confirms the DMABUF has revoke semantics and delivers a phys_addr
for the memory suitable for use with iommu_map().
Medium term there is a desire to expand the supported DMABUFs to include
GPU drivers to support DPDK/SPDK type use cases so future series will work
to add a general concept of revoke and a general negotiation of
interconnect to remove vfio_pci_dma_buf_iommufd_map().
I also plan another series to modify iommufd's vfio_compat to
transparently pull a dmabuf out of a VFIO VMA to emulate more of the uAPI
of type1.
The latest series for interconnect negotation to exchange a phys_addr is:
https://lore.kernel.org/r/20251027044712.1676175-1-vivek.kasireddy@intel.com
And the discussion for design of revoke is here:
https://lore.kernel.org/dri-devel/20250114173103.GE5556@nvidia.com/
This is on github: https://github.com/jgunthorpe/linux/commits/iommufd_dmabuf
v2:
- Rebase on Leon's v9
- Fix mislocking in an iopt_fill_domain() error path
- Revise the comments around how the sub page offset works
- Remove a useless WARN_ON in iopt_pages_rw_access()
- Fixed missed memory free in the selftest
v1: https://patch.msgid.link/r/0-v1-64bed2430cdb+31b-iommufd_dmabuf_jgg@nvidia.…
Jason Gunthorpe (9):
vfio/pci: Add vfio_pci_dma_buf_iommufd_map()
iommufd: Add DMABUF to iopt_pages
iommufd: Do not map/unmap revoked DMABUFs
iommufd: Allow a DMABUF to be revoked
iommufd: Allow MMIO pages in a batch
iommufd: Have pfn_reader process DMABUF iopt_pages
iommufd: Have iopt_map_file_pages convert the fd to a file
iommufd: Accept a DMABUF through IOMMU_IOAS_MAP_FILE
iommufd/selftest: Add some tests for the dmabuf flow
drivers/iommu/iommufd/io_pagetable.c | 78 +++-
drivers/iommu/iommufd/io_pagetable.h | 54 ++-
drivers/iommu/iommufd/ioas.c | 8 +-
drivers/iommu/iommufd/iommufd_private.h | 14 +-
drivers/iommu/iommufd/iommufd_test.h | 10 +
drivers/iommu/iommufd/main.c | 10 +
drivers/iommu/iommufd/pages.c | 414 ++++++++++++++++--
drivers/iommu/iommufd/selftest.c | 143 ++++++
drivers/vfio/pci/vfio_pci_dmabuf.c | 34 ++
include/linux/vfio_pci_core.h | 4 +
tools/testing/selftests/iommu/iommufd.c | 43 ++
tools/testing/selftests/iommu/iommufd_utils.h | 44 ++
12 files changed, 786 insertions(+), 70 deletions(-)
base-commit: f836737ed56db9e2d5b047c56a31e05af0f3f116
--
2.43.0
On Tue, Nov 25, 2025 at 05:11:18PM -0800, Alex Mastro wrote:
> fill_sg_entry() splits large DMA buffers into multiple scatter-gather
> entries, each holding up to UINT_MAX bytes. When calculating the DMA
> address for entries beyond the second one, the expression (i * UINT_MAX)
> causes integer overflow due to 32-bit arithmetic.
>
> This manifests when the input arg length >= 8 GiB results in looping for
> i >= 2.
>
> Fix by casting i to dma_addr_t before multiplication.
>
> Fixes: 3aa31a8bb11e ("dma-buf: provide phys_vec to scatter-gather mapping routine")
> Signed-off-by: Alex Mastro <amastro(a)fb.com>
> ---
> More color about how I discovered this in [1] for the commit at [2]:
>
> [1] https://lore.kernel.org/all/aSZHO6otK0Heh+Qj@devgpu015.cco6.facebook.com
> [2] https://lore.kernel.org/all/20251120-dmabuf-vfio-v9-6-d7f71607f371@nvidia.c…
> ---
> drivers/dma-buf/dma-buf-mapping.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
Reviewed-by: Jason Gunthorpe <jgg(a)nvidia.com>
AlexW, can you pick this up?
Thanks,
Jason
On Wed, Nov 26, 2025 at 08:08:24AM -0800, Alex Mastro wrote:
> On Wed, Nov 26, 2025 at 01:12:40PM +0000, Pranjal Shrivastava wrote:
> > On Tue, Nov 25, 2025 at 04:18:03PM -0800, Alex Mastro wrote:
> > > On Thu, Nov 20, 2025 at 11:28:25AM +0200, Leon Romanovsky wrote:
> > > > +static struct scatterlist *fill_sg_entry(struct scatterlist *sgl, size_t length,
> > > > + dma_addr_t addr)
> > > > +{
> > > > + unsigned int len, nents;
> > > > + int i;
> > > > +
> > > > + nents = DIV_ROUND_UP(length, UINT_MAX);
> > > > + for (i = 0; i < nents; i++) {
> > > > + len = min_t(size_t, length, UINT_MAX);
> > > > + length -= len;
> > > > + /*
> > > > + * DMABUF abuses scatterlist to create a scatterlist
> > > > + * that does not have any CPU list, only the DMA list.
> > > > + * Always set the page related values to NULL to ensure
> > > > + * importers can't use it. The phys_addr based DMA API
> > > > + * does not require the CPU list for mapping or unmapping.
> > > > + */
> > > > + sg_set_page(sgl, NULL, 0, 0);
> > > > + sg_dma_address(sgl) = addr + i * UINT_MAX;
> > >
> > > (i * UINT_MAX) happens in 32-bit before being promoted to dma_addr_t for
> > > addition with addr. Overflows for i >=2 when length >= 8 GiB. Needs a cast:
> > >
> > > sg_dma_address(sgl) = addr + (dma_addr_t)i * UINT_MAX;
Yeah, and i should not be signed.
> > > Discovered this while debugging why dma-buf import was failing for
> > > an 8 GiB dma-buf using my earlier toy program [1]. It was surfaced by
> > > ib_umem_find_best_pgsz() returning 0 due to malformed scatterlist, which bubbles
> > > up as an EINVAL.
> > >
> >
> > Thanks a lot for testing & reporting this!
> >
> > However, I believe the casting approach is a little fragile (and
> > potentially prone to issues depending on how dma_addr_t is sized on
> > different platforms). Thus, approaching this with accumulation seems
> > better as it avoids the multiplication logic entirely, maybe something
> > like the following (untested) diff ?
>
> If the function input range is well-formed, then all values in
> [addr..addr+length) must be expressible by dma_addr_t, so I don't think overflow
> after casting is possible as long as nents is valid.
It is probably not perfect, but validate_dmabuf_input() limits length
to a valid size_t
The signature is:
bool dma_iova_try_alloc(struct device *dev, struct dma_iova_state *state,
phys_addr_t phys, size_t size)
And that function should fail if size is too large. I think it mostly
does, but it looks like there are a few little misses:
iova_align(iovad, size + iova_off),
return ALIGN(size, iovad->granule);
etc are all unchecked math that could overflow.
> That said, `nents = DIV_ROUND_UP(length, UINT_MAX)` is simply broken on any
> system where size_t is 32b. I don't know if that's a practical consideration for
> these code paths though.
Yeah, that's a good point.
Casting to u64 will trigger 64 bit device errors on 32 bit too.
// DIV_ROUND_UP that is safe at the type limits
nents = size / UINT_MAX;
if (size % UINT_MAX)
nents++;
Compiler should turn the % into bit math.
Jason