On Mon, 18 Aug 2025 16:54:23 +0530 Meghana Malladi wrote:
> @@ -1332,6 +1350,13 @@ static int prueth_xsk_wakeup(struct net_device *ndev, u32 qid, u32 flags)
> }
> }
>
> + if (flags & XDP_WAKEUP_RX) {
> + if (!napi_if_scheduled_mark_missed(&emac->napi_rx)) {
> + if (likely(napi_schedule_prep(&emac->napi_rx)))
> + __napi_schedule(&emac->napi_rx);
> + }
> + }
> +
> return 0;
I suspect this series is generated against old source or there's
another conflicting series in flight, because git ends up applying
this chunk to prueth_xsk_pool_disable() :S
Before you proceed with AF_XDP could you make this driver build under
COMPILE_TEST on x86? This is very easy to miss, luckily we got an off
list report but its pure luck. And obviously much more effort for the
maintainers to investigate than if it was caught by the CI.
--
pw-bot: cr
On 19.08.25 11:04, Janusz Krzysztofik wrote:
> On Monday, 18 August 2025 16:42:56 CEST Christian König wrote:
>> On 18.08.25 16:30, Janusz Krzysztofik wrote:
>>> Hi Christian,
>>>
>>> On Thursday, 14 August 2025 14:24:29 CEST Christian König wrote:
>>>>
>>>> On 14.08.25 10:16, Janusz Krzysztofik wrote:
>>>>> When first user starts waiting on a not yet signaled fence of a chain
>>>>> link, a dma_fence_chain callback is added to a user fence of that link.
>>>>> When the user fence of that chain link is then signaled, the chain is
>>>>> traversed in search for a first not signaled link and the callback is
>>>>> rearmed on a user fence of that link.
>>>>>
>>>>> Since chain fences may be exposed to user space, e.g. over drm_syncobj
>>>>> IOCTLs, users may start waiting on any link of the chain, then many links
>>>>> of a chain may have signaling enabled and their callbacks added to their
>>>>> user fences. Once an arbitrary user fence is signaled, all
>>>>> dma_fence_chain callbacks added to it so far must be rearmed to another
>>>>> user fence of the chain. In extreme scenarios, when all N links of a
>>>>> chain are awaited and then signaled in reverse order, the dma_fence_chain
>>>>> callback may be called up to N * (N + 1) / 2 times (an arithmetic series).
>>>>>
>>>>> To avoid that potential excessive accumulation of dma_fence_chain
>>>>> callbacks, rearm a trimmed-down, signal only callback version to the base
>>>>> fence of a previous link, if not yet signaled, otherwise just signal the
>>>>> base fence of the current link instead of traversing the chain in search
>>>>> for a first not signaled link and moving all callbacks collected so far to
>>>>> a user fence of that link.
>>>>
>>>> Well clear NAK to that! You can easily overflow the kernel stack with that!
>>>
>>> I'll be happy to propose a better solution, but for that I need to understand
>>> better your message. Could you please point out an exact piece of the
>>> proposed code and/or describe a scenario where you can see the risk of stack
>>> overflow?
>>
>> The sentence "rearm .. to the base fence of a previous link" sounds like you are trying to install a callback on the signaling to the previous chain element.
>>
>> That is exactly what I pointed out previously where you need to be super careful because when this chain signals the callbacks will execute recursively which means that you can trivially overflow the kernel stack if you have more than a handful of chain elements.
>>
>> In other words A waits for B, B waits for C, C waits for D etc.... when D finally signals it will call C which in turn calls B which in turn calls A.
>
> OK, maybe my commit description was not precise enough, however, I didn't
> describe implementation details (how) intentionally.
> When D signals then it doesn't call C directly, only it submits an irq work
> that calls C. Then C doesn't just call B, only it submits another irq work
> that calls B, and so on.
> Doesn't that code pattern effectively break the recursion loop into separate
> work items, each with its own separate stack?
No, it's architecture dependent if the irq_work executes on a separate stack or not.
You would need a work_struct to really separate the two and I would reject that because it adds additional latency to the signaling path.
>>
>> Even if the chain is a recursive data structure you absolutely can't use recursion for the handling of it.
>>
>> Maybe I misunderstood your textual description but reading a sentence like this rings all alarm bells here. Otherwise I can't see what the patch is supposed to be optimizing.
>
> OK, maybe I should start my commit description of this patch with a copy of
> the first sentence from cover letter and also from patch 1/4 description that
> informs about the problem as reported by CI. Maybe I should also provide a
> comparison of measured signaling times from trybot executions [1][2][3].
> Here are example numbers from CI machine fi-bsw-n3050:
Yeah and I've pointed out before that this is irrelevant.
The problem is *not* the dma_fence_chain implementation, that one is doing exactly what is expected to do.
The problem is that the test case is nonsense. I've pointed that out numerous times now!
Regards,
Christian.
>
> With signaling time reports only added to selftests (patch 1 of 4):
> <6> [777.914451] dma-buf: Running dma_fence_chain/wait_forward
> <6> [778.123516] wait_forward: 4096 signals in 21373487 ns
> <6> [778.335709] dma-buf: Running dma_fence_chain/wait_backward
> <6> [795.791546] wait_backward: 4096 signals in 17249051192 ns
> <6> [795.859699] dma-buf: Running dma_fence_chain/wait_random
> <6> [796.161375] wait_random: 4096 signals in 97386256 ns
>
> With dma_fence_enable_signaling() replaced in selftests with dma_fence_wait()
> (patches 1-3 of 4):
> <6> [782.505692] dma-buf: Running dma_fence_chain/wait_forward
> <6> [784.609213] wait_forward: 4096 signals in 36513103 ns
> <3> [784.837226] Reported -4 for kthread_stop_put(0)!
> <6> [785.147643] dma-buf: Running dma_fence_chain/wait_backward
> <6> [806.367763] wait_backward: 4096 signals in 18428009499 ns
> <6> [807.175325] dma-buf: Running dma_fence_chain/wait_random
> <6> [809.453942] wait_random: 4096 signals in 119761950 ns
>
> With the fix (patches 1-4 of 4):
> <6> [731.519020] dma-buf: Running dma_fence_chain/wait_forward
> <6> [733.623375] wait_forward: 4096 signals in 31890220 ns
> <6> [734.258972] dma-buf: Running dma_fence_chain/wait_backward
> <6> [736.267325] wait_backward: 4096 signals in 39007955 ns
> <6> [736.700221] dma-buf: Running dma_fence_chain/wait_random
> <6> [739.346706] wait_random: 4096 signals in 48384865 ns
>
> Signaling time in wait_backward selftest has been reduced from 17s to 39ms.
>
> [1] https://intel-gfx-ci.01.org/tree/drm-tip/Trybot_152785v1/index.html?
> [2] https://intel-gfx-ci.01.org/tree/drm-tip/Trybot_152828v2/index.html?
> [3] https://intel-gfx-ci.01.org/tree/drm-tip/Trybot_152830v2/index.html?
>
>>
>>>>
>>>> Additional to this messing with the fence ops outside of the dma_fence code is an absolute no-go.
>>>
>>> Could you please explain what piece of code you are referring to when you say
>>> "messing with the fence ops outside the dma_fence code"? If not this patch
>>> then which particular one of this series did you mean? I'm assuming you
>>> didn't mean drm_syncobj code that I mentioned in my commit descriptions.
>>
>> See below.
>>
>>>
>>> Thanks,
>>> Janusz
>>>
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>>
>>>>> Closes: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/12904
>>>>> Suggested-by: Chris Wilson <chris.p.wilson(a)linux.intel.com>
>>>>> Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik(a)linux.intel.com>
>>>>> ---
>>>>> drivers/dma-buf/dma-fence-chain.c | 101 +++++++++++++++++++++++++-----
>>>>> 1 file changed, 84 insertions(+), 17 deletions(-)
>>>>>
>>>>> diff --git a/drivers/dma-buf/dma-fence-chain.c b/drivers/dma-buf/dma-fence-chain.c
>>>>> index a8a90acf4f34d..90eff264ee05c 100644
>>>>> --- a/drivers/dma-buf/dma-fence-chain.c
>>>>> +++ b/drivers/dma-buf/dma-fence-chain.c
>>>>> @@ -119,46 +119,113 @@ static const char *dma_fence_chain_get_timeline_name(struct dma_fence *fence)
>>>>> return "unbound";
>>>>> }
>>>>>
>>>>> -static void dma_fence_chain_irq_work(struct irq_work *work)
>>>>> +static void signal_irq_work(struct irq_work *work)
>>>>> {
>>>>> struct dma_fence_chain *chain;
>>>>>
>>>>> chain = container_of(work, typeof(*chain), work);
>>>>>
>>>>> - /* Try to rearm the callback */
>>>>> - if (!dma_fence_chain_enable_signaling(&chain->base))
>>>>> - /* Ok, we are done. No more unsignaled fences left */
>>>>> - dma_fence_signal(&chain->base);
>>>>> + dma_fence_signal(&chain->base);
>>>>> dma_fence_put(&chain->base);
>>>>> }
>>>>>
>>>>> -static void dma_fence_chain_cb(struct dma_fence *f, struct dma_fence_cb *cb)
>>>>> +static void signal_cb(struct dma_fence *f, struct dma_fence_cb *cb)
>>>>> +{
>>>>> + struct dma_fence_chain *chain;
>>>>> +
>>>>> + chain = container_of(cb, typeof(*chain), cb);
>>>>> + init_irq_work(&chain->work, signal_irq_work);
>>>>> + irq_work_queue(&chain->work);
>>>>> +}
>>>>> +
>>>>> +static void rearm_irq_work(struct irq_work *work)
>>>>> +{
>>>>> + struct dma_fence_chain *chain;
>>>>> + struct dma_fence *prev;
>>>>> +
>>>>> + chain = container_of(work, typeof(*chain), work);
>>>>> +
>>>>> + rcu_read_lock();
>>>>> + prev = rcu_dereference(chain->prev);
>>>>> + if (prev && dma_fence_add_callback(prev, &chain->cb, signal_cb))
>>>>> + prev = NULL;
>>>>> + rcu_read_unlock();
>>>>> + if (prev)
>>>>> + return;
>>>>> +
>>>>> + /* Ok, we are done. No more unsignaled fences left */
>>>>> + signal_irq_work(work);
>>>>> +}
>>>>> +
>>>>> +static inline bool fence_is_signaled__nested(struct dma_fence *fence)
>>>>> +{
>>>>> + if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags))
>>>>> + return true;
>>>>> +
>>
>>>>> + if (fence->ops->signaled && fence->ops->signaled(fence)) {
>>
>> Calling this outside of dma-fence.[ch] is a clear no-go.
>
> But this patch applies only to drivers/dma-buf/dma-fence-chain.c, not
> outside of it.
>
> Thanks,
> Janusz
>
>>
>> Regards,
>> Christian.
>>
>>>>> + unsigned long flags;
>>>>> +
>>>>> + spin_lock_irqsave_nested(fence->lock, flags, SINGLE_DEPTH_NESTING);
>>>>> + dma_fence_signal_locked(fence);
>>>>> + spin_unlock_irqrestore(fence->lock, flags);
>>>>> +
>>>>> + return true;
>>>>> + }
>>>>> +
>>>>> + return false;
>>>>> +}
>>>>> +
>>>>> +static bool prev_is_signaled(struct dma_fence_chain *chain)
>>>>> +{
>>>>> + struct dma_fence *prev;
>>>>> + bool result;
>>>>> +
>>>>> + rcu_read_lock();
>>>>> + prev = rcu_dereference(chain->prev);
>>>>> + result = !prev || fence_is_signaled__nested(prev);
>>>>> + rcu_read_unlock();
>>>>> +
>>>>> + return result;
>>>>> +}
>>>>> +
>>>>> +static void rearm_or_signal_cb(struct dma_fence *f, struct dma_fence_cb *cb)
>>>>> {
>>>>> struct dma_fence_chain *chain;
>>>>>
>>>>> chain = container_of(cb, typeof(*chain), cb);
>>>>> - init_irq_work(&chain->work, dma_fence_chain_irq_work);
>>>>> + if (prev_is_signaled(chain)) {
>>>>> + /* Ok, we are done. No more unsignaled fences left */
>>>>> + init_irq_work(&chain->work, signal_irq_work);
>>>>> + } else {
>>>>> + /* Try to rearm the callback */
>>>>> + init_irq_work(&chain->work, rearm_irq_work);
>>>>> + }
>>>>> +
>>>>> irq_work_queue(&chain->work);
>>>>> - dma_fence_put(f);
>>>>> }
>>>>>
>>>>> static bool dma_fence_chain_enable_signaling(struct dma_fence *fence)
>>>>> {
>>>>> struct dma_fence_chain *head = to_dma_fence_chain(fence);
>>>>> + int err = -ENOENT;
>>>>>
>>>>> - dma_fence_get(&head->base);
>>>>> - dma_fence_chain_for_each(fence, &head->base) {
>>>>> - struct dma_fence *f = dma_fence_chain_contained(fence);
>>>>> + if (WARN_ON(!head))
>>>>> + return false;
>>>>>
>>>>> - dma_fence_get(f);
>>>>> - if (!dma_fence_add_callback(f, &head->cb, dma_fence_chain_cb)) {
>>>>> + dma_fence_get(fence);
>>>>> + if (head->fence)
>>>>> + err = dma_fence_add_callback(head->fence, &head->cb, rearm_or_signal_cb);
>>>>> + if (err) {
>>>>> + if (prev_is_signaled(head)) {
>>>>> dma_fence_put(fence);
>>>>> - return true;
>>>>> + } else {
>>>>> + init_irq_work(&head->work, rearm_irq_work);
>>>>> + irq_work_queue(&head->work);
>>>>> + err = 0;
>>>>> }
>>>>> - dma_fence_put(f);
>>>>> }
>>>>> - dma_fence_put(&head->base);
>>>>> - return false;
>>>>> +
>>>>> + return !err;
>>>>> }
>>>>>
>>>>> static bool dma_fence_chain_signaled(struct dma_fence *fence)
>>>>
>>>>
>>>
>>>
>>>
>>>
>>
>>
>
>
>
>
Current dma-buf vmap semantics require that the mapped buffer remains
in place until the corresponding vunmap has completed.
For GEM-SHMEM, this used to be guaranteed by a pin operation while creating
an S/G table in import. GEM-SHMEN can now import dma-buf objects without
creating the S/G table, so the pin is missing. Leads to page-fault errors,
such as the one shown below.
[ 102.101726] BUG: unable to handle page fault for address: ffffc90127000000
[...]
[ 102.157102] RIP: 0010:udl_compress_hline16+0x219/0x940 [udl]
[...]
[ 102.243250] Call Trace:
[ 102.245695] <TASK>
[ 102.2477V95] ? validate_chain+0x24e/0x5e0
[ 102.251805] ? __lock_acquire+0x568/0xae0
[ 102.255807] udl_render_hline+0x165/0x341 [udl]
[ 102.260338] ? __pfx_udl_render_hline+0x10/0x10 [udl]
[ 102.265379] ? local_clock_noinstr+0xb/0x100
[ 102.269642] ? __lock_release.isra.0+0x16c/0x2e0
[ 102.274246] ? mark_held_locks+0x40/0x70
[ 102.278177] udl_primary_plane_helper_atomic_update+0x43e/0x680 [udl]
[ 102.284606] ? __pfx_udl_primary_plane_helper_atomic_update+0x10/0x10 [udl]
[ 102.291551] ? lockdep_hardirqs_on_prepare.part.0+0x92/0x170
[ 102.297208] ? lockdep_hardirqs_on+0x88/0x130
[ 102.301554] ? _raw_spin_unlock_irq+0x24/0x50
[ 102.305901] ? wait_for_completion_timeout+0x2bb/0x3a0
[ 102.311028] ? drm_atomic_helper_calc_timestamping_constants+0x141/0x200
[ 102.317714] ? drm_atomic_helper_commit_planes+0x3b6/0x1030
[ 102.323279] drm_atomic_helper_commit_planes+0x3b6/0x1030
[ 102.328664] drm_atomic_helper_commit_tail+0x41/0xb0
[ 102.333622] commit_tail+0x204/0x330
[...]
[ 102.529946] ---[ end trace 0000000000000000 ]---
[ 102.651980] RIP: 0010:udl_compress_hline16+0x219/0x940 [udl]
In this stack strace, udl (based on GEM-SHMEM) imported and vmap'ed a
dma-buf from amdgpu. Amdgpu relocated the buffer, thereby invalidating the
mapping.
Provide a custom dma-buf vmap method in amdgpu that pins the object before
mapping it's buffer's pages into kernel address space. Do the opposite in
vunmap.
Note that dma-buf vmap differs from GEM vmap in how it handles relocation.
While dma-buf vmap keeps the buffer in place, GEM vmap requires the caller
to keep the buffer in place. Hence, this fix is in amdgpu's dma-buf code
instead of its GEM code.
A discussion of various approaches to solving the problem is available
at [1].
Signed-off-by: Thomas Zimmermann <tzimmermann(a)suse.de>
Fixes: 660cd44659a0 ("drm/shmem-helper: Import dmabuf without mapping its sg_table")
Reported-by: Thomas Zimmermann <tzimmermann(a)suse.de>
Closes: https://lore.kernel.org/dri-devel/ba1bdfb8-dbf7-4372-bdcb-df7e0511c702@suse…
Cc: Shixiong Ou <oushixiong(a)kylinos.cn>
Cc: Thomas Zimmermann <tzimmermann(a)suse.de>
Cc: Maarten Lankhorst <maarten.lankhorst(a)linux.intel.com>
Cc: Maxime Ripard <mripard(a)kernel.org>
Cc: David Airlie <airlied(a)gmail.com>
Cc: Simona Vetter <simona(a)ffwll.ch>
Cc: Sumit Semwal <sumit.semwal(a)linaro.org>
Cc: "Christian König" <christian.koenig(a)amd.com>
Cc: dri-devel(a)lists.freedesktop.org
Cc: linux-media(a)vger.kernel.org
Cc: linaro-mm-sig(a)lists.linaro.org
Link: https://lore.kernel.org/dri-devel/9792c6c3-a2b8-4b2b-b5ba-fba19b153e21@suse… # [1]
---
drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 36 +++++++++++++++++++--
1 file changed, 34 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
index 5743ebb2f1b7..5b33776eeece 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
@@ -285,6 +285,38 @@ static int amdgpu_dma_buf_begin_cpu_access(struct dma_buf *dma_buf,
return ret;
}
+static int amdgpu_dma_buf_vmap(struct dma_buf *dma_buf, struct iosys_map *map)
+{
+ struct drm_gem_object *obj = dma_buf->priv;
+ struct amdgpu_bo *bo = gem_to_amdgpu_bo(obj);
+ int ret;
+
+ /*
+ * Pin to keep buffer in place while it's vmap'ed. The actual
+ * location is not important as long as it's mapable.
+ *
+ * This code is required for exporting to GEM-SHMEM without S/G table.
+ * Once GEM-SHMEM supports dynamic imports, it should be dropped.
+ */
+ ret = amdgpu_bo_pin(bo, AMDGPU_GEM_DOMAIN_MASK);
+ if (ret)
+ return ret;
+ ret = drm_gem_dmabuf_vmap(dma_buf, map);
+ if (ret)
+ amdgpu_bo_unpin(bo);
+
+ return ret;
+}
+
+static void amdgpu_dma_buf_vunmap(struct dma_buf *dma_buf, struct iosys_map *map)
+{
+ struct drm_gem_object *obj = dma_buf->priv;
+ struct amdgpu_bo *bo = gem_to_amdgpu_bo(obj);
+
+ drm_gem_dmabuf_vunmap(dma_buf, map);
+ amdgpu_bo_unpin(bo);
+}
+
const struct dma_buf_ops amdgpu_dmabuf_ops = {
.attach = amdgpu_dma_buf_attach,
.pin = amdgpu_dma_buf_pin,
@@ -294,8 +326,8 @@ const struct dma_buf_ops amdgpu_dmabuf_ops = {
.release = drm_gem_dmabuf_release,
.begin_cpu_access = amdgpu_dma_buf_begin_cpu_access,
.mmap = drm_gem_dmabuf_mmap,
- .vmap = drm_gem_dmabuf_vmap,
- .vunmap = drm_gem_dmabuf_vunmap,
+ .vmap = amdgpu_dma_buf_vmap,
+ .vunmap = amdgpu_dma_buf_vunmap,
};
/**
--
2.50.1
On 18.08.25 16:30, Janusz Krzysztofik wrote:
> Hi Christian,
>
> On Thursday, 14 August 2025 14:24:29 CEST Christian König wrote:
>>
>> On 14.08.25 10:16, Janusz Krzysztofik wrote:
>>> When first user starts waiting on a not yet signaled fence of a chain
>>> link, a dma_fence_chain callback is added to a user fence of that link.
>>> When the user fence of that chain link is then signaled, the chain is
>>> traversed in search for a first not signaled link and the callback is
>>> rearmed on a user fence of that link.
>>>
>>> Since chain fences may be exposed to user space, e.g. over drm_syncobj
>>> IOCTLs, users may start waiting on any link of the chain, then many links
>>> of a chain may have signaling enabled and their callbacks added to their
>>> user fences. Once an arbitrary user fence is signaled, all
>>> dma_fence_chain callbacks added to it so far must be rearmed to another
>>> user fence of the chain. In extreme scenarios, when all N links of a
>>> chain are awaited and then signaled in reverse order, the dma_fence_chain
>>> callback may be called up to N * (N + 1) / 2 times (an arithmetic series).
>>>
>>> To avoid that potential excessive accumulation of dma_fence_chain
>>> callbacks, rearm a trimmed-down, signal only callback version to the base
>>> fence of a previous link, if not yet signaled, otherwise just signal the
>>> base fence of the current link instead of traversing the chain in search
>>> for a first not signaled link and moving all callbacks collected so far to
>>> a user fence of that link.
>>
>> Well clear NAK to that! You can easily overflow the kernel stack with that!
>
> I'll be happy to propose a better solution, but for that I need to understand
> better your message. Could you please point out an exact piece of the
> proposed code and/or describe a scenario where you can see the risk of stack
> overflow?
The sentence "rearm .. to the base fence of a previous link" sounds like you are trying to install a callback on the signaling to the previous chain element.
That is exactly what I pointed out previously where you need to be super careful because when this chain signals the callbacks will execute recursively which means that you can trivially overflow the kernel stack if you have more than a handful of chain elements.
In other words A waits for B, B waits for C, C waits for D etc.... when D finally signals it will call C which in turn calls B which in turn calls A.
Even if the chain is a recursive data structure you absolutely can't use recursion for the handling of it.
Maybe I misunderstood your textual description but reading a sentence like this rings all alarm bells here. Otherwise I can't see what the patch is supposed to be optimizing.
>>
>> Additional to this messing with the fence ops outside of the dma_fence code is an absolute no-go.
>
> Could you please explain what piece of code you are referring to when you say
> "messing with the fence ops outside the dma_fence code"? If not this patch
> then which particular one of this series did you mean? I'm assuming you
> didn't mean drm_syncobj code that I mentioned in my commit descriptions.
See below.
>
> Thanks,
> Janusz
>
>>
>> Regards,
>> Christian.
>>
>>>
>>> Closes: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/12904
>>> Suggested-by: Chris Wilson <chris.p.wilson(a)linux.intel.com>
>>> Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik(a)linux.intel.com>
>>> ---
>>> drivers/dma-buf/dma-fence-chain.c | 101 +++++++++++++++++++++++++-----
>>> 1 file changed, 84 insertions(+), 17 deletions(-)
>>>
>>> diff --git a/drivers/dma-buf/dma-fence-chain.c b/drivers/dma-buf/dma-fence-chain.c
>>> index a8a90acf4f34d..90eff264ee05c 100644
>>> --- a/drivers/dma-buf/dma-fence-chain.c
>>> +++ b/drivers/dma-buf/dma-fence-chain.c
>>> @@ -119,46 +119,113 @@ static const char *dma_fence_chain_get_timeline_name(struct dma_fence *fence)
>>> return "unbound";
>>> }
>>>
>>> -static void dma_fence_chain_irq_work(struct irq_work *work)
>>> +static void signal_irq_work(struct irq_work *work)
>>> {
>>> struct dma_fence_chain *chain;
>>>
>>> chain = container_of(work, typeof(*chain), work);
>>>
>>> - /* Try to rearm the callback */
>>> - if (!dma_fence_chain_enable_signaling(&chain->base))
>>> - /* Ok, we are done. No more unsignaled fences left */
>>> - dma_fence_signal(&chain->base);
>>> + dma_fence_signal(&chain->base);
>>> dma_fence_put(&chain->base);
>>> }
>>>
>>> -static void dma_fence_chain_cb(struct dma_fence *f, struct dma_fence_cb *cb)
>>> +static void signal_cb(struct dma_fence *f, struct dma_fence_cb *cb)
>>> +{
>>> + struct dma_fence_chain *chain;
>>> +
>>> + chain = container_of(cb, typeof(*chain), cb);
>>> + init_irq_work(&chain->work, signal_irq_work);
>>> + irq_work_queue(&chain->work);
>>> +}
>>> +
>>> +static void rearm_irq_work(struct irq_work *work)
>>> +{
>>> + struct dma_fence_chain *chain;
>>> + struct dma_fence *prev;
>>> +
>>> + chain = container_of(work, typeof(*chain), work);
>>> +
>>> + rcu_read_lock();
>>> + prev = rcu_dereference(chain->prev);
>>> + if (prev && dma_fence_add_callback(prev, &chain->cb, signal_cb))
>>> + prev = NULL;
>>> + rcu_read_unlock();
>>> + if (prev)
>>> + return;
>>> +
>>> + /* Ok, we are done. No more unsignaled fences left */
>>> + signal_irq_work(work);
>>> +}
>>> +
>>> +static inline bool fence_is_signaled__nested(struct dma_fence *fence)
>>> +{
>>> + if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags))
>>> + return true;
>>> +
>>> + if (fence->ops->signaled && fence->ops->signaled(fence)) {
Calling this outside of dma-fence.[ch] is a clear no-go.
Regards,
Christian.
>>> + unsigned long flags;
>>> +
>>> + spin_lock_irqsave_nested(fence->lock, flags, SINGLE_DEPTH_NESTING);
>>> + dma_fence_signal_locked(fence);
>>> + spin_unlock_irqrestore(fence->lock, flags);
>>> +
>>> + return true;
>>> + }
>>> +
>>> + return false;
>>> +}
>>> +
>>> +static bool prev_is_signaled(struct dma_fence_chain *chain)
>>> +{
>>> + struct dma_fence *prev;
>>> + bool result;
>>> +
>>> + rcu_read_lock();
>>> + prev = rcu_dereference(chain->prev);
>>> + result = !prev || fence_is_signaled__nested(prev);
>>> + rcu_read_unlock();
>>> +
>>> + return result;
>>> +}
>>> +
>>> +static void rearm_or_signal_cb(struct dma_fence *f, struct dma_fence_cb *cb)
>>> {
>>> struct dma_fence_chain *chain;
>>>
>>> chain = container_of(cb, typeof(*chain), cb);
>>> - init_irq_work(&chain->work, dma_fence_chain_irq_work);
>>> + if (prev_is_signaled(chain)) {
>>> + /* Ok, we are done. No more unsignaled fences left */
>>> + init_irq_work(&chain->work, signal_irq_work);
>>> + } else {
>>> + /* Try to rearm the callback */
>>> + init_irq_work(&chain->work, rearm_irq_work);
>>> + }
>>> +
>>> irq_work_queue(&chain->work);
>>> - dma_fence_put(f);
>>> }
>>>
>>> static bool dma_fence_chain_enable_signaling(struct dma_fence *fence)
>>> {
>>> struct dma_fence_chain *head = to_dma_fence_chain(fence);
>>> + int err = -ENOENT;
>>>
>>> - dma_fence_get(&head->base);
>>> - dma_fence_chain_for_each(fence, &head->base) {
>>> - struct dma_fence *f = dma_fence_chain_contained(fence);
>>> + if (WARN_ON(!head))
>>> + return false;
>>>
>>> - dma_fence_get(f);
>>> - if (!dma_fence_add_callback(f, &head->cb, dma_fence_chain_cb)) {
>>> + dma_fence_get(fence);
>>> + if (head->fence)
>>> + err = dma_fence_add_callback(head->fence, &head->cb, rearm_or_signal_cb);
>>> + if (err) {
>>> + if (prev_is_signaled(head)) {
>>> dma_fence_put(fence);
>>> - return true;
>>> + } else {
>>> + init_irq_work(&head->work, rearm_irq_work);
>>> + irq_work_queue(&head->work);
>>> + err = 0;
>>> }
>>> - dma_fence_put(f);
>>> }
>>> - dma_fence_put(&head->base);
>>> - return false;
>>> +
>>> + return !err;
>>> }
>>>
>>> static bool dma_fence_chain_signaled(struct dma_fence *fence)
>>
>>
>
>
>
>
Hello all,
This series makes it so the udmabuf will sync the backing buffer
with the set of attached devices as required for DMA-BUFs when
doing {begin,end}_cpu_access.
Thanks
Andrew
Changes for v2:
- fix attachment table use-after-free
- rebased on v6.17-rc1
Andrew Davis (3):
udmabuf: Keep track current device mappings
udmabuf: Sync buffer mappings for attached devices
udmabuf: Use module_misc_device() to register this device
drivers/dma-buf/udmabuf.c | 133 +++++++++++++++++++-------------------
1 file changed, 67 insertions(+), 66 deletions(-)
--
2.39.2
The Arm Ethos-U65/85 NPUs are designed for edge AI inference
applications[0].
The driver works with Mesa Teflon. A merge request for Ethos support is
here[1]. The UAPI should also be compatible with the downstream (open
source) driver stack[2] and Vela compiler though that has not been
implemented.
Testing so far has been on i.MX93 boards with Ethos-U65. Support for U85
is still todo. Only minor changes on driver side will be needed for U85
support.
A git tree is here[3].
Rob
[0] https://www.arm.com/products/silicon-ip-cpu?families=ethos%20npus
[1] https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36699/
[2] https://gitlab.arm.com/artificial-intelligence/ethos-u/
[3] git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux.git ethos-v2
Signed-off-by: Rob Herring (Arm) <robh(a)kernel.org>
---
Changes in v2:
- Rebase on v6.17-rc1 adapting to scheduler changes
- scheduler: Drop the reset workqueue. According to the scheduler docs,
we don't need it since we have a single h/w queue.
- scheduler: Rework the timeout handling to continue running if we are
making progress. Fixes timeouts on larger jobs.
- Reset the NPU on resume so it's in a known state
- Add error handling on clk_get() calls
- Fix drm_mm splat on module unload. We were missing a put on the
cmdstream BO in the scheduler clean-up.
- Fix 0-day report needing explicit bitfield.h include
- Link to v1: https://lore.kernel.org/r/20250722-ethos-v1-0-cc1c5a0cbbfb@kernel.org
---
Rob Herring (Arm) (2):
dt-bindings: npu: Add Arm Ethos-U65/U85
accel: Add Arm Ethos-U NPU driver
.../devicetree/bindings/npu/arm,ethos.yaml | 79 +++
MAINTAINERS | 9 +
drivers/accel/Kconfig | 1 +
drivers/accel/Makefile | 1 +
drivers/accel/ethos/Kconfig | 10 +
drivers/accel/ethos/Makefile | 4 +
drivers/accel/ethos/ethos_device.h | 181 ++++++
drivers/accel/ethos/ethos_drv.c | 418 ++++++++++++
drivers/accel/ethos/ethos_drv.h | 15 +
drivers/accel/ethos/ethos_gem.c | 707 +++++++++++++++++++++
drivers/accel/ethos/ethos_gem.h | 46 ++
drivers/accel/ethos/ethos_job.c | 514 +++++++++++++++
drivers/accel/ethos/ethos_job.h | 41 ++
include/uapi/drm/ethos_accel.h | 262 ++++++++
14 files changed, 2288 insertions(+)
---
base-commit: 8f5ae30d69d7543eee0d70083daf4de8fe15d585
change-id: 20250715-ethos-3fdd39ef6f19
Best regards,
--
Rob Herring (Arm) <robh(a)kernel.org>
On 14.08.25 10:16, Janusz Krzysztofik wrote:
> When first user starts waiting on a not yet signaled fence of a chain
> link, a dma_fence_chain callback is added to a user fence of that link.
> When the user fence of that chain link is then signaled, the chain is
> traversed in search for a first not signaled link and the callback is
> rearmed on a user fence of that link.
>
> Since chain fences may be exposed to user space, e.g. over drm_syncobj
> IOCTLs, users may start waiting on any link of the chain, then many links
> of a chain may have signaling enabled and their callbacks added to their
> user fences. Once an arbitrary user fence is signaled, all
> dma_fence_chain callbacks added to it so far must be rearmed to another
> user fence of the chain. In extreme scenarios, when all N links of a
> chain are awaited and then signaled in reverse order, the dma_fence_chain
> callback may be called up to N * (N + 1) / 2 times (an arithmetic series).
>
> To avoid that potential excessive accumulation of dma_fence_chain
> callbacks, rearm a trimmed-down, signal only callback version to the base
> fence of a previous link, if not yet signaled, otherwise just signal the
> base fence of the current link instead of traversing the chain in search
> for a first not signaled link and moving all callbacks collected so far to
> a user fence of that link.
Well clear NAK to that! You can easily overflow the kernel stack with that!
Additional to this messing with the fence ops outside of the dma_fence code is an absolute no-go.
Regards,
Christian.
>
> Closes: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/12904
> Suggested-by: Chris Wilson <chris.p.wilson(a)linux.intel.com>
> Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik(a)linux.intel.com>
> ---
> drivers/dma-buf/dma-fence-chain.c | 101 +++++++++++++++++++++++++-----
> 1 file changed, 84 insertions(+), 17 deletions(-)
>
> diff --git a/drivers/dma-buf/dma-fence-chain.c b/drivers/dma-buf/dma-fence-chain.c
> index a8a90acf4f34d..90eff264ee05c 100644
> --- a/drivers/dma-buf/dma-fence-chain.c
> +++ b/drivers/dma-buf/dma-fence-chain.c
> @@ -119,46 +119,113 @@ static const char *dma_fence_chain_get_timeline_name(struct dma_fence *fence)
> return "unbound";
> }
>
> -static void dma_fence_chain_irq_work(struct irq_work *work)
> +static void signal_irq_work(struct irq_work *work)
> {
> struct dma_fence_chain *chain;
>
> chain = container_of(work, typeof(*chain), work);
>
> - /* Try to rearm the callback */
> - if (!dma_fence_chain_enable_signaling(&chain->base))
> - /* Ok, we are done. No more unsignaled fences left */
> - dma_fence_signal(&chain->base);
> + dma_fence_signal(&chain->base);
> dma_fence_put(&chain->base);
> }
>
> -static void dma_fence_chain_cb(struct dma_fence *f, struct dma_fence_cb *cb)
> +static void signal_cb(struct dma_fence *f, struct dma_fence_cb *cb)
> +{
> + struct dma_fence_chain *chain;
> +
> + chain = container_of(cb, typeof(*chain), cb);
> + init_irq_work(&chain->work, signal_irq_work);
> + irq_work_queue(&chain->work);
> +}
> +
> +static void rearm_irq_work(struct irq_work *work)
> +{
> + struct dma_fence_chain *chain;
> + struct dma_fence *prev;
> +
> + chain = container_of(work, typeof(*chain), work);
> +
> + rcu_read_lock();
> + prev = rcu_dereference(chain->prev);
> + if (prev && dma_fence_add_callback(prev, &chain->cb, signal_cb))
> + prev = NULL;
> + rcu_read_unlock();
> + if (prev)
> + return;
> +
> + /* Ok, we are done. No more unsignaled fences left */
> + signal_irq_work(work);
> +}
> +
> +static inline bool fence_is_signaled__nested(struct dma_fence *fence)
> +{
> + if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags))
> + return true;
> +
> + if (fence->ops->signaled && fence->ops->signaled(fence)) {
> + unsigned long flags;
> +
> + spin_lock_irqsave_nested(fence->lock, flags, SINGLE_DEPTH_NESTING);
> + dma_fence_signal_locked(fence);
> + spin_unlock_irqrestore(fence->lock, flags);
> +
> + return true;
> + }
> +
> + return false;
> +}
> +
> +static bool prev_is_signaled(struct dma_fence_chain *chain)
> +{
> + struct dma_fence *prev;
> + bool result;
> +
> + rcu_read_lock();
> + prev = rcu_dereference(chain->prev);
> + result = !prev || fence_is_signaled__nested(prev);
> + rcu_read_unlock();
> +
> + return result;
> +}
> +
> +static void rearm_or_signal_cb(struct dma_fence *f, struct dma_fence_cb *cb)
> {
> struct dma_fence_chain *chain;
>
> chain = container_of(cb, typeof(*chain), cb);
> - init_irq_work(&chain->work, dma_fence_chain_irq_work);
> + if (prev_is_signaled(chain)) {
> + /* Ok, we are done. No more unsignaled fences left */
> + init_irq_work(&chain->work, signal_irq_work);
> + } else {
> + /* Try to rearm the callback */
> + init_irq_work(&chain->work, rearm_irq_work);
> + }
> +
> irq_work_queue(&chain->work);
> - dma_fence_put(f);
> }
>
> static bool dma_fence_chain_enable_signaling(struct dma_fence *fence)
> {
> struct dma_fence_chain *head = to_dma_fence_chain(fence);
> + int err = -ENOENT;
>
> - dma_fence_get(&head->base);
> - dma_fence_chain_for_each(fence, &head->base) {
> - struct dma_fence *f = dma_fence_chain_contained(fence);
> + if (WARN_ON(!head))
> + return false;
>
> - dma_fence_get(f);
> - if (!dma_fence_add_callback(f, &head->cb, dma_fence_chain_cb)) {
> + dma_fence_get(fence);
> + if (head->fence)
> + err = dma_fence_add_callback(head->fence, &head->cb, rearm_or_signal_cb);
> + if (err) {
> + if (prev_is_signaled(head)) {
> dma_fence_put(fence);
> - return true;
> + } else {
> + init_irq_work(&head->work, rearm_irq_work);
> + irq_work_queue(&head->work);
> + err = 0;
> }
> - dma_fence_put(f);
> }
> - dma_fence_put(&head->base);
> - return false;
> +
> + return !err;
> }
>
> static bool dma_fence_chain_signaled(struct dma_fence *fence)
Hi Amirreza,
kernel test robot noticed the following build warnings:
[auto build test WARNING on 2674d1eadaa2fd3a918dfcdb6d0bb49efe8a8bb9]
url: https://github.com/intel-lab-lkp/linux/commits/Amirreza-Zarrabi/tee-allow-a…
base: 2674d1eadaa2fd3a918dfcdb6d0bb49efe8a8bb9
patch link: https://lore.kernel.org/r/20250812-qcom-tee-using-tee-ss-without-mem-obj-v7…
patch subject: [PATCH v7 08/11] tee: add Qualcomm TEE driver
config: hexagon-randconfig-r072-20250814 (https://download.01.org/0day-ci/archive/20250814/202508140527.ighXikjo-lkp@…)
compiler: clang version 22.0.0git (https://github.com/llvm/llvm-project 3769ce013be2879bf0b329c14a16f5cb766f26ce)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250814/202508140527.ighXikjo-lkp@…)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp(a)intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202508140527.ighXikjo-lkp@intel.com/
All warnings (new ones prefixed by >>):
>> drivers/tee/qcomtee/user_obj.c:384:12: warning: format specifies type 'unsigned long' but the argument has type 'u64' (aka 'unsigned long long') [-Wformat]
383 | &qcomtee_user_object_ops, "uo-%lu",
| ~~~
| %llu
384 | param->u.objref.id);
| ^~~~~~~~~~~~~~~~~~
1 warning generated.
vim +384 drivers/tee/qcomtee/user_obj.c
355
356 /**
357 * qcomtee_user_param_to_object() - OBJREF parameter to &struct qcomtee_object.
358 * @object: object returned.
359 * @param: TEE parameter.
360 * @ctx: context in which the conversion should happen.
361 *
362 * @param is an OBJREF with %QCOMTEE_OBJREF_FLAG_USER flags.
363 *
364 * Return: On success, returns 0; on failure, returns < 0.
365 */
366 int qcomtee_user_param_to_object(struct qcomtee_object **object,
367 struct tee_param *param,
368 struct tee_context *ctx)
369 {
370 struct qcomtee_user_object *user_object __free(kfree) = NULL;
371 int err;
372
373 user_object = kzalloc(sizeof(*user_object), GFP_KERNEL);
374 if (!user_object)
375 return -ENOMEM;
376
377 user_object->ctx = ctx;
378 user_object->object_id = param->u.objref.id;
379 /* By default, always notify userspace upon release. */
380 user_object->notify = true;
381 err = qcomtee_object_user_init(&user_object->object,
382 QCOMTEE_OBJECT_TYPE_CB,
383 &qcomtee_user_object_ops, "uo-%lu",
> 384 param->u.objref.id);
385 if (err)
386 return err;
387 /* Matching teedev_ctx_put() is in qcomtee_user_object_release(). */
388 teedev_ctx_get(ctx);
389
390 *object = &no_free_ptr(user_object)->object;
391
392 return 0;
393 }
394
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
Hi Amir,
On Wed, Aug 13, 2025 at 2:37 AM Amirreza Zarrabi
<amirreza.zarrabi(a)oss.qualcomm.com> wrote:
>
> This patch series introduces a Trusted Execution Environment (TEE)
> driver for Qualcomm TEE (QTEE). QTEE enables Trusted Applications (TAs)
> and services to run securely. It uses an object-based interface, where
> each service is an object with sets of operations. Clients can invoke
> these operations on objects, which can generate results, including other
> objects. For example, an object can load a TA and return another object
> that represents the loaded TA, allowing access to its services.
>
There are some build errors/warnings for arm and x86_64, see
https://tuxapi.tuxsuite.com/v1/groups/linaro/projects/jens/plans/31DmCOn1pF…
Thanks,
Jens
On 25-07-25, 16:20, Jyothi Kumar Seerapu wrote:
>
>
> On 7/23/2025 12:45 PM, Vinod Koul wrote:
> > On 22-07-25, 15:46, Dmitry Baryshkov wrote:
> > > On Tue, Jul 22, 2025 at 05:50:08PM +0530, Jyothi Kumar Seerapu wrote:
> > > > On 7/19/2025 3:27 PM, Dmitry Baryshkov wrote:
> > > > > On Mon, Jul 07, 2025 at 09:58:30PM +0530, Jyothi Kumar Seerapu wrote:
> > > > > > On 7/4/2025 1:11 AM, Dmitry Baryshkov wrote:
> > > > > > > On Thu, 3 Jul 2025 at 15:51, Jyothi Kumar Seerapu
> >
> > [Folks, would be nice to trim replies]
> >
> > > > > > Could you please confirm if can go with the similar approach of unmap the
> > > > > > processed TREs based on a fixed threshold or constant value, instead of
> > > > > > unmapping them all at once?
> > > > >
> > > > > I'd still say, that's a bad idea. Please stay within the boundaries of
> > > > > the DMA API.
> > > > >
> > > > I agree with the approach you suggested—it's the GPI's responsibility to
> > > > manage the available TREs.
> > > >
> > > > However, I'm curious whether can we set a dynamic watermark value perhaps
> > > > half the available TREs) to trigger unmapping of processed TREs ? This would
> > > > allow the software to prepare the next set of TREs while the hardware
> > > > continues processing the remaining ones, enabling better parallelism and
> > > > throughput.
> > >
> > > Let's land the simple implementation first, which can then be improved.
> > > However I don't see any way to return 'above the watermark' from the DMA
> > > controller. You might need to enhance the API.
> >
> > Traditionally, we set the dma transfers for watermark level and we get a
> > interrupt. So you might want to set the callback for watermark level
> > and then do mapping/unmapping etc in the callback. This is typical model
> > for dmaengines, we should follow that well
> >
> > BR
>
> Thanks Dmitry and Vinod, I will work on V7 patch for submitting the I2C
> messages until they fit and and unmap all processed messages together for
> now.
>
> Regarding the watermark mechanism, looks GENI SE DMA supports watermark
> interrupts but it appears that GPI DMA doesn't have such provision of
> watermark.
What is the mechanism to get interrupts from the GPI? If you submit 10
txn, can you ask it to interrupt when half of them are done?
--
~Vinod
On Mon, 21 Jul 2025 11:17:27 +0200, Tomeu Vizoso wrote:
> This series adds a new driver for the NPU that Rockchip includes in its
> newer SoCs, developed by them on the NVDLA base.
>
> In its current form, it supports the specific NPU in the RK3588 SoC.
>
> The userspace driver is part of Mesa and an initial draft can be found at:
>
> [...]
Applied, thanks!
[07/10] arm64: dts: rockchip: add pd_npu label for RK3588 power domains
commit: 6d64bceb97a1c93b3cc2131f7e023ef2f9cf33f2
[08/10] arm64: dts: rockchip: Add nodes for NPU and its MMU to rk3588-base
commit: a31dfc060a747f08705ace36d8de006bc13318fa
[09/10] arm64: dts: rockchip: Enable the NPU on quartzpro64
commit: 640366d644b1e282771a09c72be37162b6eda438
[10/10] arm64: dts: rockchip: enable NPU on ROCK 5B
commit: 3af6a83fc85033e44ce5cd0e1de54dc20b7e15af
Best regards,
--
Heiko Stuebner <heiko(a)sntech.de>
Hi Ling,
kernel test robot noticed the following build warnings:
[auto build test WARNING on char-misc/char-misc-testing]
[also build test WARNING on char-misc/char-misc-next char-misc/char-misc-linus linus/master v6.16 next-20250806]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Ling-Xu/misc-fastrpc-Save-ac…
base: char-misc/char-misc-testing
patch link: https://lore.kernel.org/r/20250806115114.688814-5-quic_lxu5%40quicinc.com
patch subject: [PATCH v2 4/4] misc: fastrpc: Skip reference for DMA handles
config: hexagon-randconfig-002-20250807 (https://download.01.org/0day-ci/archive/20250807/202508070731.S30957lV-lkp@…)
compiler: clang version 22.0.0git (https://github.com/llvm/llvm-project 7b8dea265e72c3037b6b1e54d5ab51b7e14f328b)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250807/202508070731.S30957lV-lkp@…)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp(a)intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202508070731.S30957lV-lkp@intel.com/
All warnings (new ones prefixed by >>):
>> drivers/misc/fastrpc.c:368:30: warning: unused variable 'sess' [-Wunused-variable]
368 | struct fastrpc_session_ctx *sess = fl->sctx;
| ^~~~
1 warning generated.
vim +/sess +368 drivers/misc/fastrpc.c
c68cfb718c8f97 Srinivas Kandagatla 2019-02-08 363
8f6c1d8c4f0cc3 Vamsi Krishna Gattupalli 2022-02-14 364
8f6c1d8c4f0cc3 Vamsi Krishna Gattupalli 2022-02-14 365 static int fastrpc_map_lookup(struct fastrpc_user *fl, int fd,
1922c68c56c660 Ling Xu 2025-08-06 366 struct fastrpc_map **ppmap)
c68cfb718c8f97 Srinivas Kandagatla 2019-02-08 367 {
9446fa1683a7e3 Abel Vesa 2022-11-24 @368 struct fastrpc_session_ctx *sess = fl->sctx;
c68cfb718c8f97 Srinivas Kandagatla 2019-02-08 369 struct fastrpc_map *map = NULL;
d259063578ed76 Ling Xu 2025-08-06 370 struct dma_buf *buf;
9446fa1683a7e3 Abel Vesa 2022-11-24 371 int ret = -ENOENT;
c68cfb718c8f97 Srinivas Kandagatla 2019-02-08 372
d259063578ed76 Ling Xu 2025-08-06 373 buf = dma_buf_get(fd);
d259063578ed76 Ling Xu 2025-08-06 374 if (IS_ERR(buf))
d259063578ed76 Ling Xu 2025-08-06 375 return PTR_ERR(buf);
d259063578ed76 Ling Xu 2025-08-06 376
9446fa1683a7e3 Abel Vesa 2022-11-24 377 spin_lock(&fl->lock);
c68cfb718c8f97 Srinivas Kandagatla 2019-02-08 378 list_for_each_entry(map, &fl->maps, node) {
d259063578ed76 Ling Xu 2025-08-06 379 if (map->fd != fd || map->buf != buf)
9446fa1683a7e3 Abel Vesa 2022-11-24 380 continue;
9446fa1683a7e3 Abel Vesa 2022-11-24 381
9446fa1683a7e3 Abel Vesa 2022-11-24 382 *ppmap = map;
9446fa1683a7e3 Abel Vesa 2022-11-24 383 ret = 0;
9446fa1683a7e3 Abel Vesa 2022-11-24 384 break;
c68cfb718c8f97 Srinivas Kandagatla 2019-02-08 385 }
9446fa1683a7e3 Abel Vesa 2022-11-24 386 spin_unlock(&fl->lock);
8f6c1d8c4f0cc3 Vamsi Krishna Gattupalli 2022-02-14 387
8f6c1d8c4f0cc3 Vamsi Krishna Gattupalli 2022-02-14 388 return ret;
8f6c1d8c4f0cc3 Vamsi Krishna Gattupalli 2022-02-14 389 }
8f6c1d8c4f0cc3 Vamsi Krishna Gattupalli 2022-02-14 390
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
On Mon, Aug 04, 2025 at 10:10:32AM -0400, Benjamin LaHaise wrote:
> FYI: this entire patch series was rejected as spam by large numbers of
> linux-mm subscribers using @gmail.com email addresses.
Thanks for the heads-up. Are you aware of any issues from my side?
I'm sending patches with git-send-email through mail.kernel.org SMTP.
Thanks
>
> -ben (owner-linux-mm)
From: Leon Romanovsky <leonro(a)nvidia.com>
---------------------------------------------------------------------------
Based on blk and DMA patches which will be sent during coming merge window.
---------------------------------------------------------------------------
This series extends the VFIO PCI subsystem to support exporting MMIO regions
from PCI device BARs as dma-buf objects, enabling safe sharing of non-struct
page memory with controlled lifetime management. This allows RDMA and other
subsystems to import dma-buf FDs and build them into memory regions for PCI
P2P operations.
The series supports a use case for SPDK where a NVMe device will be owned
by SPDK through VFIO but interacting with a RDMA device. The RDMA device
may directly access the NVMe CMB or directly manipulate the NVMe device's
doorbell using PCI P2P.
However, as a general mechanism, it can support many other scenarios with
VFIO. This dmabuf approach can be usable by iommufd as well for generic
and safe P2P mappings.
In addition to the SPDK use-case mentioned above, the capability added
in this patch series can also be useful when a buffer (located in device
memory such as VRAM) needs to be shared between any two dGPU devices or
instances (assuming one of them is bound to VFIO PCI) as long as they
are P2P DMA compatible.
The implementation provides a revocable attachment mechanism using dma-buf
move operations. MMIO regions are normally pinned as BARs don't change
physical addresses, but access is revoked when the VFIO device is closed
or a PCI reset is issued. This ensures kernel self-defense against
potentially hostile userspace.
The series includes significant refactoring of the PCI P2PDMA subsystem
to separate core P2P functionality from memory allocation features,
making it more modular and suitable for VFIO use cases that don't need
struct page support.
-----------------------------------------------------------------------
This is based on
https://lore.kernel.org/all/20250307052248.405803-1-vivek.kasireddy@intel.c…
but heavily rewritten to be based on DMA physical API.
-----------------------------------------------------------------------
The WIP branch can be found here:
https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma.git/log/?h=…
Thanks
Leon Romanovsky (8):
PCI/P2PDMA: Remove redundant bus_offset from map state
PCI/P2PDMA: Introduce p2pdma_provider structure for cleaner
abstraction
PCI/P2PDMA: Simplify bus address mapping API
PCI/P2PDMA: Refactor to separate core P2P functionality from memory
allocation
PCI/P2PDMA: Export pci_p2pdma_map_type() function
types: move phys_vec definition to common header
vfio/pci: Enable peer-to-peer DMA transactions by default
vfio/pci: Add dma-buf export support for MMIO regions
Vivek Kasireddy (2):
vfio: Export vfio device get and put registration helpers
vfio/pci: Share the core device pointer while invoking feature
functions
block/blk-mq-dma.c | 7 +-
drivers/iommu/dma-iommu.c | 4 +-
drivers/pci/p2pdma.c | 144 +++++++++----
drivers/vfio/pci/Kconfig | 20 ++
drivers/vfio/pci/Makefile | 2 +
drivers/vfio/pci/vfio_pci_config.c | 22 +-
drivers/vfio/pci/vfio_pci_core.c | 59 ++++--
drivers/vfio/pci/vfio_pci_dmabuf.c | 321 +++++++++++++++++++++++++++++
drivers/vfio/pci/vfio_pci_priv.h | 23 +++
drivers/vfio/vfio_main.c | 2 +
include/linux/dma-buf.h | 1 +
include/linux/pci-p2pdma.h | 114 +++++-----
include/linux/types.h | 5 +
include/linux/vfio.h | 2 +
include/linux/vfio_pci_core.h | 4 +
include/uapi/linux/vfio.h | 19 ++
kernel/dma/direct.c | 4 +-
mm/hmm.c | 2 +-
18 files changed, 631 insertions(+), 124 deletions(-)
create mode 100644 drivers/vfio/pci/vfio_pci_dmabuf.c
--
2.50.1
On Tue, Jul 29, 2025 at 02:54:13PM -0600, Logan Gunthorpe wrote:
>
>
> On 2025-07-28 17:11, Jason Gunthorpe wrote:
> >> If the dma mapping for P2P memory doesn't need to create an iommu
> >> mapping then that's fine. But it should be the dma-iommu layer to decide
> >> that.
> >
> > So above, we can't use dma-iommu.c, it might not be compiled into the
> > kernel but the dma_map_phys() path is still valid.
>
> This is an easily solved problem. I did a very rough sketch below to say
> it's really not that hard. (Note it has some rough edges that could be
> cleaned up and I based it off Leon's git repo which appears to not be
> the same as what was posted, but the core concept is sound).
I started to prepare v2, this is why posted version is slightly
different from dmabuf-vfio branch.
In addition to what Jason wrote. there is an extra complexity with using
state. The wrappers which operate on dma_iova_state assume that all memory,
which is going to be mapped, is the same type: or p2p or not.
This is not the cased for HMM/RDMA users, there you create state in
advance and get mixed type of pages.
Thanks
On Tue, Jul 29, 2025 at 02:54:13PM -0600, Logan Gunthorpe wrote:
>
>
> On 2025-07-28 17:11, Jason Gunthorpe wrote:
> >> If the dma mapping for P2P memory doesn't need to create an iommu
> >> mapping then that's fine. But it should be the dma-iommu layer to decide
> >> that.
> >
> > So above, we can't use dma-iommu.c, it might not be compiled into the
> > kernel but the dma_map_phys() path is still valid.
>
> This is an easily solved problem. I did a very rough sketch below to say
> it's really not that hard. (Note it has some rough edges that could be
> cleaned up and I based it off Leon's git repo which appears to not be
> the same as what was posted, but the core concept is sound).
I did hope for something like this in the early days, but it proved
not so easy to get agreements on details :(
My feeling was we should get some actual examples of using this thing
and then it is far easier to discuss ideas, like yours here, to
improve it. Many of the discussions kind of got confused without
enough actual usering code for everyone to refer to.
For instance the nvme use case is a big driver for the API design, and
it is quite different from these simpler flows, this idea needs to see
how it would work there.
Maybe this idea could also have provider = NULL meaning it is CPU
cachable memory?
> +static inline void dma_iova_try_alloc_p2p(struct p2pdma_provider *provider,
> + struct device *dev, struct dma_iova_state *state, phys_addr_t phys,
> + size_t size)
> +{
> +}
Can't be empty - PCI_P2PDMA_MAP_THRU_HOST_BRIDGE vs
PCI_P2PDMA_MAP_BUS_ADDR still matters so it still must set
dma_iova_state::bus_addr to get dma_map_phys_prealloc() to do the
right thing.
Still, it would make sense to put something like that in dma/mapping.c
and rely on the static inline stub for dma_iova_try_alloc()..
> for (i = 0; i < priv->nr_ranges; i++) {
> - if (!state) {
> - addr = pci_p2pdma_bus_addr_map(provider,
> - phys_vec[i].paddr);
> - } else if (dma_use_iova(state)) {
> - ret = dma_iova_link(attachment->dev, state,
> - phys_vec[i].paddr, 0,
> - phys_vec[i].len, dir, attrs);
> - if (ret)
> - goto err_unmap_dma;
> -
> - mapped_len += phys_vec[i].len;
> - } else {
> - addr = dma_map_phys(attachment->dev, phys_vec[i].paddr,
> - phys_vec[i].len, dir, attrs);
> - ret = dma_mapping_error(attachment->dev, addr);
> - if (ret)
> - goto err_unmap_dma;
> - }
> + addr = dma_map_phys_prealloc(attachment->dev, phys_vec[i].paddr,
> + phys_vec[i].len, dir, attrs, state,
> + provider);
There was a draft of something like this at some point. The
DMA_MAPPING_USE_IOVA is a new twist though
> #define DMA_BIT_MASK(n) (((n) == 64) ? ~0ULL : ((1ULL<<(n))-1))
> struct dma_iova_state {
> dma_addr_t addr;
> u64 __size;
> + bool bus_addr;
> };
Gowing this structure has been strongly pushed back on. This probably
can be solved in some other way, a bitfield on size perhaps..
> +dma_addr_t dma_map_phys_prealloc(struct device *dev, phys_addr_t phys,
> size_t size,
> + enum dma_data_direction dir, unsigned long attrs,
> + struct dma_iova_state *state, struct p2pdma_provider *provider)
> +{
> + int ret;
> +
> + if (state->bus_addr)
> + return pci_p2pdma_bus_addr_map(provider, phys);
> +
> + if (dma_use_iova(state)) {
> + ret = dma_iova_link(dev, state, phys, 0, size, dir, attrs);
> + if (ret)
> + return DMA_MAPPING_ERROR;
> +
> + return DMA_MAPPING_USE_IOVA;
> + }
> +
> + return dma_map_phys(dev, phys, size, dir, attrs);
> +}
> +EXPORT_SYMBOL_GPL(dma_map_phys_prealloc);
I would be tempted to inline this
Overall, yeah I would certainly welcome improvements like this if
everyone can agree, but I'd really like to see nvme merged before we
start working on ideas. That way the proposal can be properly
evaluated by all the stake holders.
Jason
On Mon, Jul 28, 2025 at 11:07:34AM -0600, Logan Gunthorpe wrote:
>
>
> On 2025-07-28 10:41, Leon Romanovsky wrote:
> > On Mon, Jul 28, 2025 at 10:12:31AM -0600, Logan Gunthorpe wrote:
> >>
> >>
> >> On 2025-07-27 13:05, Jason Gunthorpe wrote:
> >>> On Fri, Jul 25, 2025 at 10:30:46AM -0600, Logan Gunthorpe wrote:
> >>>>
> >>>>
> >>>> On 2025-07-24 02:13, Leon Romanovsky wrote:
> >>>>> On Thu, Jul 24, 2025 at 10:03:13AM +0200, Christoph Hellwig wrote:
> >>>>>> On Wed, Jul 23, 2025 at 04:00:06PM +0300, Leon Romanovsky wrote:
> >>>>>>> From: Leon Romanovsky <leonro(a)nvidia.com>
> >>>>>>>
> >>>>>>> Export the pci_p2pdma_map_type() function to allow external modules
> >>>>>>> and subsystems to determine the appropriate mapping type for P2PDMA
> >>>>>>> transfers between a provider and target device.
> >>>>>>
> >>>>>> External modules have no business doing this.
> >>>>>
> >>>>> VFIO PCI code is built as module. There is no way to access PCI p2p code
> >>>>> without exporting functions in it.
> >>>>
> >>>> The solution that would make more sense to me would be for either
> >>>> dma_iova_try_alloc() or another helper in dma-iommu.c to handle the
> >>>> P2PDMA case.
> >>>
> >>> This has nothing to do with dma-iommu.c, the decisions here still need
> >>> to be made even if dma-iommu.c is not compiled in.
> >>
> >> Doesn't it though? Every single call in patch 10 to the newly exported
> >> PCI functions calls into the the dma-iommu functions.
Patch 10 has lots of flows, only one will end up in dma-iommu.c
vfio_pci_dma_buf_map() calls pci_p2pdma_bus_addr_map(),
dma_iova_link(), dma_map_phys().
Only iova_link would call to dma-iommu.c - if dma_map_phys() is called
we know that dma-iommu.c won't be called by it.
> >> If there were non-iommu paths then I would expect the code would
> >> use the regular DMA api directly which would then call in to
> >> dma-iommu.
> >
> > If p2p type is PCI_P2PDMA_MAP_BUS_ADDR, there will no dma-iommu and DMA
> > at all.
>
> I understand that and it is completely beside my point.
>
> If the dma mapping for P2P memory doesn't need to create an iommu
> mapping then that's fine. But it should be the dma-iommu layer to decide
> that.
So above, we can't use dma-iommu.c, it might not be compiled into the
kernel but the dma_map_phys() path is still valid.
> It's not a decision that should be made by every driver doing this
> kind of thing.
Sort of, I think we are trying to get to some place where there are
subsystem, or at least data structure specific helpers that do this
(ie nvme has BIO helpers), but the helpers should be running this
logic directly for performance. Leon hasn't done it but I think we
should see helpers for DMABUF too encapsulating the logic shown in
patch 10. I think we need to prove it out these basic points first
before trying to go and convert a bunch of GPU drivers.
The vfio in patch 10 is not the full example since it only has a
single scatter/gather" effectively, but the generalized version loops
over pci_p2pdma_bus_addr_map(), dma_iova_link(), dma_map_phys() for
each page.
Part of the new API design is to only do one kind of mapping operation
at once, and part of the design is we know that the P2P type is fixed.
It makes no performance sense to check the type inside the
pci_p2pdma_bus_addr_map()/ dma_iova_link()/dma_map_phys() within the
per-page loop.
I do think some level of abstraction has been lost here in pursuit of
performance. If someone does have a better way to structure this
without a performance hit then fantastic, but thats going back and
revising the new DMA API. This just builds on top of that, and yes, it
is not so abstract.
Jason
On Mon, Jul 28, 2025 at 10:12:31AM -0600, Logan Gunthorpe wrote:
>
>
> On 2025-07-27 13:05, Jason Gunthorpe wrote:
> > On Fri, Jul 25, 2025 at 10:30:46AM -0600, Logan Gunthorpe wrote:
> >>
> >>
> >> On 2025-07-24 02:13, Leon Romanovsky wrote:
> >>> On Thu, Jul 24, 2025 at 10:03:13AM +0200, Christoph Hellwig wrote:
> >>>> On Wed, Jul 23, 2025 at 04:00:06PM +0300, Leon Romanovsky wrote:
> >>>>> From: Leon Romanovsky <leonro(a)nvidia.com>
> >>>>>
> >>>>> Export the pci_p2pdma_map_type() function to allow external modules
> >>>>> and subsystems to determine the appropriate mapping type for P2PDMA
> >>>>> transfers between a provider and target device.
> >>>>
> >>>> External modules have no business doing this.
> >>>
> >>> VFIO PCI code is built as module. There is no way to access PCI p2p code
> >>> without exporting functions in it.
> >>
> >> The solution that would make more sense to me would be for either
> >> dma_iova_try_alloc() or another helper in dma-iommu.c to handle the
> >> P2PDMA case.
> >
> > This has nothing to do with dma-iommu.c, the decisions here still need
> > to be made even if dma-iommu.c is not compiled in.
>
> Doesn't it though? Every single call in patch 10 to the newly exported
> PCI functions calls into the the dma-iommu functions. If there were
> non-iommu paths then I would expect the code would use the regular DMA
> api directly which would then call in to dma-iommu.
If p2p type is PCI_P2PDMA_MAP_BUS_ADDR, there will no dma-iommu and DMA
at all.
+static int vfio_pci_dma_buf_attach(struct dma_buf *dmabuf,
+ struct dma_buf_attachment *attachment)
+{
+ struct vfio_pci_dma_buf *priv = dmabuf->priv;
+
+ if (!attachment->peer2peer)
+ return -EOPNOTSUPP;
+
+ if (priv->revoked)
+ return -ENODEV;
+
+ switch (pci_p2pdma_map_type(priv->vdev->provider, attachment->dev)) {
+ case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
+ break;
+ case PCI_P2PDMA_MAP_BUS_ADDR:
+ /*
+ * There is no need in IOVA at all for this flow.
+ * We rely on attachment->priv == NULL as a marker
+ * for this mode.
+ */
+ return 0;
+ default:
+ return -EINVAL;
+ }
+
+ attachment->priv = kzalloc(sizeof(struct dma_iova_state), GFP_KERNEL);
+ if (!attachment->priv)
+ return -ENOMEM;
+
+ dma_iova_try_alloc(attachment->dev, attachment->priv, 0, priv->phys_vec.len);
+ return 0;
+}
We've discussed a number of times of how some heap names are bad, but
not really what makes a good heap name.
Let's document what we expect the heap names to look like.
Reviewed-by: Bagas Sanjaya <bagasdotme(a)gmail.com>
Signed-off-by: Maxime Ripard <mripard(a)kernel.org>
---
Changes in v3:
- Grammar, spelling fixes
- Remove the cacheable / uncacheable name suggestion
- Link to v2: https://lore.kernel.org/r/20250616-dma-buf-heap-names-doc-v2-1-8ae43174cdbf…
Changes in v2:
- Added justifications for each requirement / suggestions
- Added a mention and example of buffer attributes
- Link to v1: https://lore.kernel.org/r/20250520-dma-buf-heap-names-doc-v1-1-ab31f74809ee…
---
Documentation/userspace-api/dma-buf-heaps.rst | 35 +++++++++++++++++++++++++++
1 file changed, 35 insertions(+)
diff --git a/Documentation/userspace-api/dma-buf-heaps.rst b/Documentation/userspace-api/dma-buf-heaps.rst
index 535f49047ce6450796bf4380c989e109355efc05..3ee4e7961fe390ba356a2125d53b060546c3e4a6 100644
--- a/Documentation/userspace-api/dma-buf-heaps.rst
+++ b/Documentation/userspace-api/dma-buf-heaps.rst
@@ -21,5 +21,40 @@ following heaps:
usually created either through the kernel commandline through the
`cma` parameter, a memory region Device-Tree node with the
`linux,cma-default` property set, or through the `CMA_SIZE_MBYTES` or
`CMA_SIZE_PERCENTAGE` Kconfig options. Depending on the platform, it
might be called ``reserved``, ``linux,cma``, or ``default-pool``.
+
+Naming Convention
+=================
+
+``dma-buf`` heaps name should meet a number of constraints:
+
+- The name must be stable, and must not change from one version to the other.
+ Userspace identifies heaps by their name, so if the names ever change, we
+ would be likely to introduce regressions.
+
+- The name must describe the memory region the heap will allocate from, and
+ must uniquely identify it in a given platform. Since userspace applications
+ use the heap name as the discriminant, it must be able to tell which heap it
+ wants to use reliably if there's multiple heaps.
+
+- The name must not mention implementation details, such as the allocator. The
+ heap driver will change over time, and implementation details when it was
+ introduced might not be relevant in the future.
+
+- The name should describe properties of the buffers that would be allocated.
+ Doing so will make heap identification easier for userspace. Such properties
+ are:
+
+ - ``contiguous`` for physically contiguous buffers;
+
+ - ``protected`` for encrypted buffers not accessible the OS;
+
+- The name may describe intended usage. Doing so will make heap identification
+ easier for userspace applications and users.
+
+For example, assuming a platform with a reserved memory region located at the
+RAM address 0x42000000, intended to allocate video framebuffers, physically
+contiguous, and backed by the CMA kernel allocator, good names would be
+``memory@42000000-cacheable-contiguous`` or ``video@42000000``, but
+``cma-video`` wouldn't.
---
base-commit: 19272b37aa4f83ca52bdf9c16d5d81bdd1354494
change-id: 20250520-dma-buf-heap-names-doc-31261aa0cfe6
Best regards,
--
Maxime Ripard <mripard(a)kernel.org>
On Fri, Jul 25, 2025 at 10:30:46AM -0600, Logan Gunthorpe wrote:
>
>
> On 2025-07-24 02:13, Leon Romanovsky wrote:
> > On Thu, Jul 24, 2025 at 10:03:13AM +0200, Christoph Hellwig wrote:
> >> On Wed, Jul 23, 2025 at 04:00:06PM +0300, Leon Romanovsky wrote:
> >>> From: Leon Romanovsky <leonro(a)nvidia.com>
> >>>
> >>> Export the pci_p2pdma_map_type() function to allow external modules
> >>> and subsystems to determine the appropriate mapping type for P2PDMA
> >>> transfers between a provider and target device.
> >>
> >> External modules have no business doing this.
> >
> > VFIO PCI code is built as module. There is no way to access PCI p2p code
> > without exporting functions in it.
>
> The solution that would make more sense to me would be for either
> dma_iova_try_alloc() or another helper in dma-iommu.c to handle the
> P2PDMA case.
This has nothing to do with dma-iommu.c, the decisions here still need
to be made even if dma-iommu.c is not compiled in.
It could be exported from the main dma code, but I think it would just
be a 1 line wrapper around the existing function? I'd rather rename
the functions and leave them in the p2pdma.c files...
Jason
On Fri, Jul 25, 2025 at 05:34:40AM +0000, Kasireddy, Vivek wrote:
> Hi Leon,
>
> > Subject: Re: [PATCH 10/10] vfio/pci: Add dma-buf export support for MMIO
> > regions
> >
> > > >
> > > > From: Leon Romanovsky <leonro(a)nvidia.com>
> > > >
> > > > Add support for exporting PCI device MMIO regions through dma-buf,
> > > > enabling safe sharing of non-struct page memory with controlled
> > > > lifetime management. This allows RDMA and other subsystems to
> > import
> > > > dma-buf FDs and build them into memory regions for PCI P2P
> > operations.
> > > >
> > > > The implementation provides a revocable attachment mechanism using
> > > > dma-buf move operations. MMIO regions are normally pinned as BARs
> > > > don't change physical addresses, but access is revoked when the VFIO
> > > > device is closed or a PCI reset is issued. This ensures kernel
> > > > self-defense against potentially hostile userspace.
> > > >
> > > > Signed-off-by: Jason Gunthorpe <jgg(a)nvidia.com>
> > > > Signed-off-by: Vivek Kasireddy <vivek.kasireddy(a)intel.com>
> > > > Signed-off-by: Leon Romanovsky <leonro(a)nvidia.com>
> > > > ---
> > > > drivers/vfio/pci/Kconfig | 20 ++
> > > > drivers/vfio/pci/Makefile | 2 +
> > > > drivers/vfio/pci/vfio_pci_config.c | 22 +-
> > > > drivers/vfio/pci/vfio_pci_core.c | 25 ++-
> > > > drivers/vfio/pci/vfio_pci_dmabuf.c | 321
> > +++++++++++++++++++++++++++++
> > > > drivers/vfio/pci/vfio_pci_priv.h | 23 +++
> > > > include/linux/dma-buf.h | 1 +
> > > > include/linux/vfio_pci_core.h | 3 +
> > > > include/uapi/linux/vfio.h | 19 ++
> > > > 9 files changed, 431 insertions(+), 5 deletions(-)
> > > > create mode 100644 drivers/vfio/pci/vfio_pci_dmabuf.c
> >
> > <...>
> >
> > > > +static int validate_dmabuf_input(struct vfio_pci_core_device *vdev,
> > > > + struct vfio_device_feature_dma_buf
> > *dma_buf)
> > > > +{
> > > > + struct pci_dev *pdev = vdev->pdev;
> > > > + u32 bar = dma_buf->region_index;
> > > > + u64 offset = dma_buf->offset;
> > > > + u64 len = dma_buf->length;
> > > > + resource_size_t bar_size;
> > > > + u64 sum;
> > > > +
> > > > + /*
> > > > + * For PCI the region_index is the BAR number like everything else.
> > > > + */
> > > > + if (bar >= VFIO_PCI_ROM_REGION_INDEX)
> > > > + return -ENODEV;
> >
> > <...>
> >
> > > > +/**
> > > > + * Upon VFIO_DEVICE_FEATURE_GET create a dma_buf fd for the
> > > > + * regions selected.
> > > > + *
> > > > + * open_flags are the typical flags passed to open(2), eg O_RDWR,
> > > > O_CLOEXEC,
> > > > + * etc. offset/length specify a slice of the region to create the dmabuf
> > from.
> > > > + * nr_ranges is the total number of (P2P DMA) ranges that comprise the
> > > > dmabuf.
> > > Any particular reason why you dropped the option (nr_ranges) of creating
> > a
> > > single dmabuf from multiple ranges of an MMIO region?
> >
> > I did it for two reasons. First, I wanted to simplify the code in order
> > to speed-up discussion over the patchset itself. Second, I failed to
> > find justification for need of multiple ranges, as the number of BARs
> > are limited by VFIO_PCI_ROM_REGION_INDEX (6) and same functionality
> > can be achieved by multiple calls to DMABUF import.
> I don't think the same functionality can be achieved by multiple calls to
> dmabuf import. AFAIU, a dmabuf (as of today) is backed by a SGL that can
> have multiple entries because it represents a scattered buffer (multiple
> non-contiguous entries in System RAM or an MMIO region).
I don't know all the reasons why SG was chosen, but one of the main
reasons is that DMA SG API was the only one possible way to handle p2p
transfers (peer2peer flag).
> But in this patch you are constraining it such that only one entry associated with a
> buffer would be included, which effectively means that we cannot create
> a dmabuf to represent scattered buffers (located in a single MMIO region
> such as VRAM or other device memory) anymore.
Yes
>
> >
> > >
> > > Restricting the dmabuf to a single range (or having to create multiple
> > dmabufs
> > > to represent multiple regions/ranges associated with a single scattered
> > buffer)
> > > would be very limiting and may not work in all cases. For instance, in my
> > use-case,
> > > I am trying to share a large (4k mode) framebuffer (FB) located in GPU's
> > VRAM
> > > between two (p2p compatible) GPU devices. And, this would probably not
> > work
> > > given that allocating a large contiguous FB (nr_ranges = 1) in VRAM may
> > not be
> > > feasible when there is memory pressure.
> >
> > Can you please help me and point to the place in the code where this can
> > fail?
> > I'm probably missing something basic as there are no large allocations
> > in the current patchset.
> Sorry, I was not very clear. What I meant is that it is not prudent to assume that
> there will only be one range associated with an MMIO region which we need to
> consider while creating a dmabuf. And, I was pointing out my use-case as an
> example where vfio-pci needs to create a dmabuf for a large buffer (FB) that
> would likely be scattered (and not contiguous) in an MMIO region (such as VRAM).
>
> Let me further explain with my use-case. Here is a link to my Qemu-based test:
> https://gitlab.freedesktop.org/Vivek/qemu/-/commit/b2bdb16d9cfaf55384c95b1f…
Ohh, thanks. I'll add nr_ranges in next version. I see that you are
using same region_index for all ranges and this is how I would like to
keep it: "multiple nr_ranges, same region_index".
Thanks
On Fri, Jul 25, 2025 at 01:12:35PM -0600, Logan Gunthorpe wrote:
>
>
> On 2025-07-25 12:54, Leon Romanovsky wrote:
> >> The solution that would make more sense to me would be for either
> >> dma_iova_try_alloc() or another helper in dma-iommu.c to handle the
> >> P2PDMA case. dma-iommu.c already uses those same interfaces and thus
> >> there would be no need to export the low level helpers from the p2pdma code.
> >
> > I had same idea in early versions of DMA phys API discussion and it was
> > pointed (absolutely right) that this is layering violation.
>
> Respectfully, I have to disagree with this. Having the layer (ie.
> dma-iommu) that normally checks how to handle a P2PDMA request now check
> how to handle these DMA requests is the exact opposite of a layering
> violation.
I'm aware of your implementation and have feeling that it was very
influenced by NVMe requirements, so the end result is very tailored
for it. Other users have very different paths if p2p is taken. Just
see last VFIO patch in this series, it skips all DMA logic.
> Expecting every driver that wants to do P2PDMA to have to
> figure out for themselves how to map the memory before calling into the
> DMA API doesn't seem like a good design choice to me.
We had this discussion earlier too on previous versions. The summary is
that p2p capable devices are very special anyway. They need to work with
p2p natively. BTW, the implementation is not supposed to be in the
drivers, but in their respective subsystems.
>
> > So unfortunately, I think that dma*.c|h is not right place for p2p
> > type check.
>
> dma*.c is already where those checks are done. I'm not sure patches to
> remove the code from that layer and put it into the NVMe driver would
> make a lot of sense (and then, of course, we'd have to put it into every
> other driver that wants to participate in p2p transactions).
I don't have plans to remove existing checks right now, but NVMe was already
converted to new DMA phys API.
Thanks
>
> Logan
>
>
On Fri, Jul 25, 2025 at 10:30:46AM -0600, Logan Gunthorpe wrote:
>
>
> On 2025-07-24 02:13, Leon Romanovsky wrote:
> > On Thu, Jul 24, 2025 at 10:03:13AM +0200, Christoph Hellwig wrote:
> >> On Wed, Jul 23, 2025 at 04:00:06PM +0300, Leon Romanovsky wrote:
> >>> From: Leon Romanovsky <leonro(a)nvidia.com>
> >>>
> >>> Export the pci_p2pdma_map_type() function to allow external modules
> >>> and subsystems to determine the appropriate mapping type for P2PDMA
> >>> transfers between a provider and target device.
> >>
> >> External modules have no business doing this.
> >
> > VFIO PCI code is built as module. There is no way to access PCI p2p code
> > without exporting functions in it.
>
> The solution that would make more sense to me would be for either
> dma_iova_try_alloc() or another helper in dma-iommu.c to handle the
> P2PDMA case. dma-iommu.c already uses those same interfaces and thus
> there would be no need to export the low level helpers from the p2pdma code.
I had same idea in early versions of DMA phys API discussion and it was
pointed (absolutely right) that this is layering violation.
At that time, that remark wasn't such clear to me because HMM code
performs check for p2p on every page and has call to dma_iova_try_alloc()
before that check. But this VFIO DMABUF code shows it much more clearer.
The p2p check is performed before any DMA calls and in case of PCI_P2PDMA_MAP_BUS_ADDR
p2p type between DMABUF exporter device and DMABUF importer device, we
don't call dma_iova_try_alloc() or any DMA API at all.
So unfortunately, I think that dma*.c|h is not right place for p2p
type check.
Thanks
>
> Logan
>
On Thu, Jul 24, 2025 at 05:13:49AM +0000, Kasireddy, Vivek wrote:
> Hi Leon,
>
> > Subject: [PATCH 10/10] vfio/pci: Add dma-buf export support for MMIO
> > regions
> >
> > From: Leon Romanovsky <leonro(a)nvidia.com>
> >
> > Add support for exporting PCI device MMIO regions through dma-buf,
> > enabling safe sharing of non-struct page memory with controlled
> > lifetime management. This allows RDMA and other subsystems to import
> > dma-buf FDs and build them into memory regions for PCI P2P operations.
> >
> > The implementation provides a revocable attachment mechanism using
> > dma-buf move operations. MMIO regions are normally pinned as BARs
> > don't change physical addresses, but access is revoked when the VFIO
> > device is closed or a PCI reset is issued. This ensures kernel
> > self-defense against potentially hostile userspace.
> >
> > Signed-off-by: Jason Gunthorpe <jgg(a)nvidia.com>
> > Signed-off-by: Vivek Kasireddy <vivek.kasireddy(a)intel.com>
> > Signed-off-by: Leon Romanovsky <leonro(a)nvidia.com>
> > ---
> > drivers/vfio/pci/Kconfig | 20 ++
> > drivers/vfio/pci/Makefile | 2 +
> > drivers/vfio/pci/vfio_pci_config.c | 22 +-
> > drivers/vfio/pci/vfio_pci_core.c | 25 ++-
> > drivers/vfio/pci/vfio_pci_dmabuf.c | 321 +++++++++++++++++++++++++++++
> > drivers/vfio/pci/vfio_pci_priv.h | 23 +++
> > include/linux/dma-buf.h | 1 +
> > include/linux/vfio_pci_core.h | 3 +
> > include/uapi/linux/vfio.h | 19 ++
> > 9 files changed, 431 insertions(+), 5 deletions(-)
> > create mode 100644 drivers/vfio/pci/vfio_pci_dmabuf.c
<...>
> > +static int validate_dmabuf_input(struct vfio_pci_core_device *vdev,
> > + struct vfio_device_feature_dma_buf *dma_buf)
> > +{
> > + struct pci_dev *pdev = vdev->pdev;
> > + u32 bar = dma_buf->region_index;
> > + u64 offset = dma_buf->offset;
> > + u64 len = dma_buf->length;
> > + resource_size_t bar_size;
> > + u64 sum;
> > +
> > + /*
> > + * For PCI the region_index is the BAR number like everything else.
> > + */
> > + if (bar >= VFIO_PCI_ROM_REGION_INDEX)
> > + return -ENODEV;
<...>
> > +/**
> > + * Upon VFIO_DEVICE_FEATURE_GET create a dma_buf fd for the
> > + * regions selected.
> > + *
> > + * open_flags are the typical flags passed to open(2), eg O_RDWR,
> > O_CLOEXEC,
> > + * etc. offset/length specify a slice of the region to create the dmabuf from.
> > + * nr_ranges is the total number of (P2P DMA) ranges that comprise the
> > dmabuf.
> Any particular reason why you dropped the option (nr_ranges) of creating a
> single dmabuf from multiple ranges of an MMIO region?
I did it for two reasons. First, I wanted to simplify the code in order
to speed-up discussion over the patchset itself. Second, I failed to
find justification for need of multiple ranges, as the number of BARs
are limited by VFIO_PCI_ROM_REGION_INDEX (6) and same functionality
can be achieved by multiple calls to DMABUF import.
>
> Restricting the dmabuf to a single range (or having to create multiple dmabufs
> to represent multiple regions/ranges associated with a single scattered buffer)
> would be very limiting and may not work in all cases. For instance, in my use-case,
> I am trying to share a large (4k mode) framebuffer (FB) located in GPU's VRAM
> between two (p2p compatible) GPU devices. And, this would probably not work
> given that allocating a large contiguous FB (nr_ranges = 1) in VRAM may not be
> feasible when there is memory pressure.
Can you please help me and point to the place in the code where this can fail?
I'm probably missing something basic as there are no large allocations
in the current patchset.
>
> Furthermore, since you are adding a new UAPI with this patch/feature, as you know,
> we cannot go back and tweak it (to add support for nr_ranges > 1) should there
> be a need in the future, but you can always use nr_ranges = 1 anytime. Therefore,
> I think it makes sense to be flexible in terms of the number of ranges to include
> while creating a dmabuf instead of restricting ourselves to one range.
I'm not a big fan of over-engineering. Let's first understand if this
case is needed.
Thanks
>
> Thanks,
> Vivek
>
> > + *
> > + * Return: The fd number on success, -1 and errno is set on failure.
> > + */
> > +#define VFIO_DEVICE_FEATURE_DMA_BUF 11
> > +
> > +struct vfio_device_feature_dma_buf {
> > + __u32 region_index;
> > + __u32 open_flags;
> > + __u64 offset;
> > + __u64 length;
> > +};
> > +
> > /* -------- API for Type1 VFIO IOMMU -------- */
> >
> > /**
> > --
> > 2.50.1
>