We've discussed a number of times of how some heap names are bad, but
not really what makes a good heap name.
Let's document what we expect the heap names to look like.
Signed-off-by: Maxime Ripard <mripard(a)kernel.org>
---
Documentation/userspace-api/dma-buf-heaps.rst | 19 +++++++++++++++++++
1 file changed, 19 insertions(+)
diff --git a/Documentation/userspace-api/dma-buf-heaps.rst b/Documentation/userspace-api/dma-buf-heaps.rst
index 535f49047ce6450796bf4380c989e109355efc05..b24618e360a9a9ba0bd85135d8c1760776f1a37f 100644
--- a/Documentation/userspace-api/dma-buf-heaps.rst
+++ b/Documentation/userspace-api/dma-buf-heaps.rst
@@ -21,5 +21,24 @@ following heaps:
usually created either through the kernel commandline through the
`cma` parameter, a memory region Device-Tree node with the
`linux,cma-default` property set, or through the `CMA_SIZE_MBYTES` or
`CMA_SIZE_PERCENTAGE` Kconfig options. Depending on the platform, it
might be called ``reserved``, ``linux,cma``, or ``default-pool``.
+
+Naming Convention
+=================
+
+A good heap name is a name that:
+
+- Is stable, and won't change from one version to the other;
+
+- Describes the memory region the heap will allocate from, and will
+ uniquely identify it in a given platform;
+
+- Doesn't use implementation details, such as the allocator;
+
+- Can describe intended usage.
+
+For example, assuming a platform with a reserved memory region located
+at the RAM address 0x42000000, intended to allocate video framebuffers,
+and backed by the CMA kernel allocator. Good names would be
+`memory@42000000` or `video@42000000`, but `cma-video` wouldn't.
---
base-commit: 92a09c47464d040866cf2b4cd052bc60555185fb
change-id: 20250520-dma-buf-heap-names-doc-31261aa0cfe6
Best regards,
--
Maxime Ripard <mripard(a)kernel.org>
On Fri, Jun 13, 2025 at 09:33:55AM +0000, wangtao wrote:
>
> >
> > On Mon, Jun 09, 2025 at 09:32:20AM +0000, wangtao wrote:
> > > Are you suggesting adding an ITER_DMABUF type to iov_iter,
> >
> > Yes.
>
> May I clarify: Do all disk operations require data to pass through
> memory (reading into memory or writing from memory)? In the block layer,
> the bio structure uses bio_iov_iter_get_pages to convert iter_type
> objects into memory-backed bio_vec representations.
> However, some dmabufs are not memory-based, making page-to-bio_vec
> conversion impossible. This suggests adding a callback function in
> dma_buf_ops to handle dmabuf- to-bio_vec conversion.
bios do support PCI P2P tranfers. This could be fairly easily extended
to other peer to peer transfers if we manage to come up with a coherent
model for them. No need for a callback.
On Fri, Jun 13, 2025 at 09:43:08AM +0000, wangtao wrote:
> Here's my analysis on Linux 6.6 with F2FS/iomap.
Linux 6.6 is almost two years old and completely irrelevant. Please
provide numbers on 6.16 or current Linus' tree.
On 6/10/25 18:42, Tvrtko Ursulin wrote:
> Dma-fence objects currently suffer from a potential use after free problem
> where fences exported to userspace and other drivers can outlive the
> exporting driver, or the associated data structures.
>
> The discussion on how to address this concluded that adding reference
> counting to all the involved objects is not desirable, since it would need
> to be very wide reaching and could cause unloadable drivers if another
> entity would be holding onto a signaled fence reference potentially
> indefinitely.
>
> This patch enables the safe access by introducing and documenting a
> contract between fence exporters and users. It documents a set of
> contraints and adds helpers which a) drivers with potential to suffer from
> the use after free must use and b) users of the dma-fence API must use as
> well.
>
> Premise of the design has multiple sides:
>
> 1. Drivers (fence exporters) MUST ensure a RCU grace period between
> signalling a fence and freeing the driver private data associated with it.
>
> The grace period does not have to follow the signalling immediately but
> HAS to happen before data is freed.
>
> 2. Users of the dma-fence API marked with such requirement MUST contain
> the complete access to the data within a single code block guarded by
> rcu_read_lock() and rcu_read_unlock().
>
> The combination of the two ensures that whoever sees the
> DMA_FENCE_FLAG_SIGNALED_BIT not set is guaranteed to have access to a
> valid fence->lock and valid data potentially accessed by the fence->ops
> virtual functions, until the call to rcu_read_unlock().
>
> 3. Module unload (fence->ops) disappearing is for now explicitly not
> handled. That would required a more complex protection, possibly needing
> SRCU instead of RCU to handle callers such as dma_fence_release() and
> dma_fence_wait_timeout(), where race between
> dma_fence_enable_sw_signaling, signalling, and dereference of
> fence->ops->wait() would need a sleeping SRCU context.
>
> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin(a)igalia.com>
Reviewed-by: Christian König <christian.koenig(a)amd.com>
> ---
> drivers/dma-buf/dma-fence.c | 111 ++++++++++++++++++++++++++++---
> include/linux/dma-fence.h | 31 ++++++---
> include/trace/events/dma_fence.h | 38 +++++++++--
> 3 files changed, 157 insertions(+), 23 deletions(-)
>
> diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
> index 74f9e4b665e3..3f78c56b58dc 100644
> --- a/drivers/dma-buf/dma-fence.c
> +++ b/drivers/dma-buf/dma-fence.c
> @@ -511,12 +511,20 @@ dma_fence_wait_timeout(struct dma_fence *fence, bool intr, signed long timeout)
>
> dma_fence_enable_sw_signaling(fence);
>
> - trace_dma_fence_wait_start(fence);
> + if (trace_dma_fence_wait_start_enabled()) {
> + rcu_read_lock();
> + trace_dma_fence_wait_start(fence);
> + rcu_read_unlock();
> + }
> if (fence->ops->wait)
> ret = fence->ops->wait(fence, intr, timeout);
> else
> ret = dma_fence_default_wait(fence, intr, timeout);
> - trace_dma_fence_wait_end(fence);
> + if (trace_dma_fence_wait_end_enabled()) {
> + rcu_read_lock();
> + trace_dma_fence_wait_end(fence);
> + rcu_read_unlock();
> + }
> return ret;
> }
> EXPORT_SYMBOL(dma_fence_wait_timeout);
> @@ -533,16 +541,23 @@ void dma_fence_release(struct kref *kref)
> struct dma_fence *fence =
> container_of(kref, struct dma_fence, refcount);
>
> + rcu_read_lock();
> trace_dma_fence_destroy(fence);
>
> - if (WARN(!list_empty(&fence->cb_list) &&
> - !test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags),
> - "Fence %s:%s:%llx:%llx released with pending signals!\n",
> - dma_fence_driver_name(fence),
> - dma_fence_timeline_name(fence),
> - fence->context, fence->seqno)) {
> + if (!list_empty(&fence->cb_list) &&
> + !test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags)) {
> + const char __rcu *timeline;
> + const char __rcu *driver;
> unsigned long flags;
>
> + driver = dma_fence_driver_name(fence);
> + timeline = dma_fence_timeline_name(fence);
> +
> + WARN(1,
> + "Fence %s:%s:%llx:%llx released with pending signals!\n",
> + rcu_dereference(driver), rcu_dereference(timeline),
> + fence->context, fence->seqno);
> +
> /*
> * Failed to signal before release, likely a refcounting issue.
> *
> @@ -556,6 +571,8 @@ void dma_fence_release(struct kref *kref)
> spin_unlock_irqrestore(fence->lock, flags);
> }
>
> + rcu_read_unlock();
> +
> if (fence->ops->release)
> fence->ops->release(fence);
> else
> @@ -982,11 +999,21 @@ EXPORT_SYMBOL(dma_fence_set_deadline);
> */
> void dma_fence_describe(struct dma_fence *fence, struct seq_file *seq)
> {
> + const char __rcu *timeline;
> + const char __rcu *driver;
> +
> + rcu_read_lock();
> +
> + timeline = dma_fence_timeline_name(fence);
> + driver = dma_fence_driver_name(fence);
> +
> seq_printf(seq, "%s %s seq %llu %ssignalled\n",
> - dma_fence_driver_name(fence),
> - dma_fence_timeline_name(fence),
> + rcu_dereference(driver),
> + rcu_dereference(timeline),
> fence->seqno,
> dma_fence_is_signaled(fence) ? "" : "un");
> +
> + rcu_read_unlock();
> }
> EXPORT_SYMBOL(dma_fence_describe);
>
> @@ -1055,3 +1082,67 @@ dma_fence_init64(struct dma_fence *fence, const struct dma_fence_ops *ops,
> BIT(DMA_FENCE_FLAG_SEQNO64_BIT));
> }
> EXPORT_SYMBOL(dma_fence_init64);
> +
> +/**
> + * dma_fence_driver_name - Access the driver name
> + * @fence: the fence to query
> + *
> + * Returns a driver name backing the dma-fence implementation.
> + *
> + * IMPORTANT CONSIDERATION:
> + * Dma-fence contract stipulates that access to driver provided data (data not
> + * directly embedded into the object itself), such as the &dma_fence.lock and
> + * memory potentially accessed by the &dma_fence.ops functions, is forbidden
> + * after the fence has been signalled. Drivers are allowed to free that data,
> + * and some do.
> + *
> + * To allow safe access drivers are mandated to guarantee a RCU grace period
> + * between signalling the fence and freeing said data.
> + *
> + * As such access to the driver name is only valid inside a RCU locked section.
> + * The pointer MUST be both queried and USED ONLY WITHIN a SINGLE block guarded
> + * by the &rcu_read_lock and &rcu_read_unlock pair.
> + */
> +const char __rcu *dma_fence_driver_name(struct dma_fence *fence)
> +{
> + RCU_LOCKDEP_WARN(!rcu_read_lock_held(),
> + "RCU protection is required for safe access to returned string");
> +
> + if (!test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags))
> + return fence->ops->get_driver_name(fence);
> + else
> + return "detached-driver";
> +}
> +EXPORT_SYMBOL(dma_fence_driver_name);
> +
> +/**
> + * dma_fence_timeline_name - Access the timeline name
> + * @fence: the fence to query
> + *
> + * Returns a timeline name provided by the dma-fence implementation.
> + *
> + * IMPORTANT CONSIDERATION:
> + * Dma-fence contract stipulates that access to driver provided data (data not
> + * directly embedded into the object itself), such as the &dma_fence.lock and
> + * memory potentially accessed by the &dma_fence.ops functions, is forbidden
> + * after the fence has been signalled. Drivers are allowed to free that data,
> + * and some do.
> + *
> + * To allow safe access drivers are mandated to guarantee a RCU grace period
> + * between signalling the fence and freeing said data.
> + *
> + * As such access to the driver name is only valid inside a RCU locked section.
> + * The pointer MUST be both queried and USED ONLY WITHIN a SINGLE block guarded
> + * by the &rcu_read_lock and &rcu_read_unlock pair.
> + */
> +const char __rcu *dma_fence_timeline_name(struct dma_fence *fence)
> +{
> + RCU_LOCKDEP_WARN(!rcu_read_lock_held(),
> + "RCU protection is required for safe access to returned string");
> +
> + if (!test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags))
> + return fence->ops->get_driver_name(fence);
> + else
> + return "signaled-timeline";
> +}
> +EXPORT_SYMBOL(dma_fence_timeline_name);
> diff --git a/include/linux/dma-fence.h b/include/linux/dma-fence.h
> index 10a849cb2d3f..64639e104110 100644
> --- a/include/linux/dma-fence.h
> +++ b/include/linux/dma-fence.h
> @@ -378,15 +378,28 @@ bool dma_fence_remove_callback(struct dma_fence *fence,
> struct dma_fence_cb *cb);
> void dma_fence_enable_sw_signaling(struct dma_fence *fence);
>
> -static inline const char *dma_fence_driver_name(struct dma_fence *fence)
> -{
> - return fence->ops->get_driver_name(fence);
> -}
> -
> -static inline const char *dma_fence_timeline_name(struct dma_fence *fence)
> -{
> - return fence->ops->get_timeline_name(fence);
> -}
> +/**
> + * DOC: Safe external access to driver provided object members
> + *
> + * All data not stored directly in the dma-fence object, such as the
> + * &dma_fence.lock and memory potentially accessed by functions in the
> + * &dma_fence.ops table, MUST NOT be accessed after the fence has been signalled
> + * because after that point drivers are allowed to free it.
> + *
> + * All code accessing that data via the dma-fence API (or directly, which is
> + * discouraged), MUST make sure to contain the complete access within a
> + * &rcu_read_lock and &rcu_read_unlock pair.
> + *
> + * Some dma-fence API handles this automatically, while other, as for example
> + * &dma_fence_driver_name and &dma_fence_timeline_name, leave that
> + * responsibility to the caller.
> + *
> + * To enable this scheme to work drivers MUST ensure a RCU grace period elapses
> + * between signalling the fence and freeing the said data.
> + *
> + */
> +const char __rcu *dma_fence_driver_name(struct dma_fence *fence);
> +const char __rcu *dma_fence_timeline_name(struct dma_fence *fence);
>
> /**
> * dma_fence_is_signaled_locked - Return an indication if the fence
> diff --git a/include/trace/events/dma_fence.h b/include/trace/events/dma_fence.h
> index 84c83074ee81..4814a65b68dc 100644
> --- a/include/trace/events/dma_fence.h
> +++ b/include/trace/events/dma_fence.h
> @@ -34,14 +34,44 @@ DECLARE_EVENT_CLASS(dma_fence,
> __entry->seqno)
> );
>
> -DEFINE_EVENT(dma_fence, dma_fence_emit,
> +/*
> + * Safe only for call sites which are guaranteed to not race with fence
> + * signaling,holding the fence->lock and having checked for not signaled, or the
> + * signaling path itself.
> + */
> +DECLARE_EVENT_CLASS(dma_fence_unsignaled,
> +
> + TP_PROTO(struct dma_fence *fence),
> +
> + TP_ARGS(fence),
> +
> + TP_STRUCT__entry(
> + __string(driver, fence->ops->get_driver_name(fence))
> + __string(timeline, fence->ops->get_timeline_name(fence))
> + __field(unsigned int, context)
> + __field(unsigned int, seqno)
> + ),
> +
> + TP_fast_assign(
> + __assign_str(driver);
> + __assign_str(timeline);
> + __entry->context = fence->context;
> + __entry->seqno = fence->seqno;
> + ),
> +
> + TP_printk("driver=%s timeline=%s context=%u seqno=%u",
> + __get_str(driver), __get_str(timeline), __entry->context,
> + __entry->seqno)
> +);
> +
> +DEFINE_EVENT(dma_fence_unsignaled, dma_fence_emit,
>
> TP_PROTO(struct dma_fence *fence),
>
> TP_ARGS(fence)
> );
>
> -DEFINE_EVENT(dma_fence, dma_fence_init,
> +DEFINE_EVENT(dma_fence_unsignaled, dma_fence_init,
>
> TP_PROTO(struct dma_fence *fence),
>
> @@ -55,14 +85,14 @@ DEFINE_EVENT(dma_fence, dma_fence_destroy,
> TP_ARGS(fence)
> );
>
> -DEFINE_EVENT(dma_fence, dma_fence_enable_signal,
> +DEFINE_EVENT(dma_fence_unsignaled, dma_fence_enable_signal,
>
> TP_PROTO(struct dma_fence *fence),
>
> TP_ARGS(fence)
> );
>
> -DEFINE_EVENT(dma_fence, dma_fence_signaled,
> +DEFINE_EVENT(dma_fence_unsignaled, dma_fence_signaled,
>
> TP_PROTO(struct dma_fence *fence),
>
Dear All,
This patchset fixes the incorrect use of dma_sync_sg_*() calls in
media and related drivers. They are replaced with much safer
dma_sync_sgtable_*() variants, which take care of passing the proper
number of elements for the sync operation.
Best regards
Marek Szyprowski, PhD
Samsung R&D Institute Poland
Change log:
v3: added cc: stable to tags
v2: fixes typos and added cc: stable
Patch summary:
Marek Szyprowski (3):
media: videobuf2: use sgtable-based scatterlist wrappers
udmabuf: use sgtable-based scatterlist wrappers
media: omap3isp: use sgtable-based scatterlist wrappers
drivers/dma-buf/udmabuf.c | 5 ++---
drivers/media/common/videobuf2/videobuf2-dma-sg.c | 4 ++--
drivers/media/platform/ti/omap3isp/ispccdc.c | 8 ++++----
drivers/media/platform/ti/omap3isp/ispstat.c | 6 ++----
4 files changed, 10 insertions(+), 13 deletions(-)
--
2.34.1
On 6/9/25 11:32, wangtao wrote:
>> -----Original Message-----
>> From: Christoph Hellwig <hch(a)infradead.org>
>> Sent: Monday, June 9, 2025 12:35 PM
>> To: Christian König <christian.koenig(a)amd.com>
>> Cc: wangtao <tao.wangtao(a)honor.com>; Christoph Hellwig
>> <hch(a)infradead.org>; sumit.semwal(a)linaro.org; kraxel(a)redhat.com;
>> vivek.kasireddy(a)intel.com; viro(a)zeniv.linux.org.uk; brauner(a)kernel.org;
>> hughd(a)google.com; akpm(a)linux-foundation.org; amir73il(a)gmail.com;
>> benjamin.gaignard(a)collabora.com; Brian.Starkey(a)arm.com;
>> jstultz(a)google.com; tjmercier(a)google.com; jack(a)suse.cz;
>> baolin.wang(a)linux.alibaba.com; linux-media(a)vger.kernel.org; dri-
>> devel(a)lists.freedesktop.org; linaro-mm-sig(a)lists.linaro.org; linux-
>> kernel(a)vger.kernel.org; linux-fsdevel(a)vger.kernel.org; linux-
>> mm(a)kvack.org; wangbintian(BintianWang) <bintian.wang(a)honor.com>;
>> yipengxiang <yipengxiang(a)honor.com>; liulu 00013167
>> <liulu.liu(a)honor.com>; hanfeng 00012985 <feng.han(a)honor.com>
>> Subject: Re: [PATCH v4 0/4] Implement dmabuf direct I/O via
>> copy_file_range
>>
>> On Fri, Jun 06, 2025 at 01:20:48PM +0200, Christian König wrote:
>>>> dmabuf acts as a driver and shouldn't be handled by VFS, so I made
>>>> dmabuf implement copy_file_range callbacks to support direct I/O
>>>> zero-copy. I'm open to both approaches. What's the preference of VFS
>>>> experts?
>>>
>>> That would probably be illegal. Using the sg_table in the DMA-buf
>>> implementation turned out to be a mistake.
>>
>> Two thing here that should not be directly conflated. Using the sg_table was
>> a huge mistake, and we should try to move dmabuf to switch that to a pure
> I'm a bit confused: don't dmabuf importers need to traverse sg_table to
> access folios or dma_addr/len? Do you mean restricting sg_table access
> (e.g., only via iov_iter) or proposing alternative approaches?
No, accessing pages folios inside the sg_table of a DMA-buf is strictly forbidden.
We have removed most use cases of that over the years and push back on generating new ones.
>
>> dma_addr_t/len array now that the new DMA API supporting that has been
>> merged. Is there any chance the dma-buf maintainers could start to kick this
>> off? I'm of course happy to assist.
Work on that is already underway for some time.
Most GPU drivers already do sg_table -> DMA array conversion, I need to push on the remaining to clean up.
But there are also tons of other users of dma_buf_map_attachment() which needs to be converted.
>> But that notwithstanding, dma-buf is THE buffer sharing mechanism in the
>> kernel, and we should promote it instead of reinventing it badly.
>> And there is a use case for having a fully DMA mapped buffer in the block
>> layer and I/O path, especially on systems with an IOMMU.
>> So having an iov_iter backed by a dma-buf would be extremely helpful.
>> That's mostly lib/iov_iter.c code, not VFS, though.
> Are you suggesting adding an ITER_DMABUF type to iov_iter, or
> implementing dmabuf-to-iov_bvec conversion within iov_iter?
That would be rather nice to have, yeah.
>
>>
>>> The question Christoph raised was rather why is your CPU so slow that
>>> walking the page tables has a significant overhead compared to the
>>> actual I/O?
>>
>> Yes, that's really puzzling and should be addressed first.
> With high CPU performance (e.g., 3GHz), GUP (get_user_pages) overhead
> is relatively low (observed in 3GHz tests).
Even on a low end CPU walking the page tables and grabbing references shouldn't be that much of an overhead.
There must be some reason why you see so much CPU overhead. E.g. compound pages are broken up or similar which should not happen in the first place.
Regards,
Christian.
> | 32x32MB Read 1024MB |Creat-ms|Close-ms| I/O-ms|I/O-MB/s| I/O%
> |---------------------------|--------|--------|--------|--------|-----
> | 1) memfd direct R/W| 1 | 118 | 312 | 3448 | 100%
> | 2) u+memfd direct R/W| 196 | 123 | 295 | 3651 | 105%
> | 3) u+memfd direct sendfile| 175 | 102 | 976 | 1100 | 31%
> | 4) u+memfd direct splice| 173 | 103 | 443 | 2428 | 70%
> | 5) udmabuf buffer R/W| 183 | 100 | 453 | 2375 | 68%
> | 6) dmabuf buffer R/W| 34 | 4 | 427 | 2519 | 73%
> | 7) udmabuf direct c_f_r| 200 | 102 | 278 | 3874 | 112%
> | 8) dmabuf direct c_f_r| 36 | 5 | 269 | 4002 | 116%
>
> With lower CPU performance (e.g., 1GHz), GUP overhead becomes more
> significant (as seen in 1GHz tests).
> | 32x32MB Read 1024MB |Creat-ms|Close-ms| I/O-ms|I/O-MB/s| I/O%
> |---------------------------|--------|--------|--------|--------|-----
> | 1) memfd direct R/W| 2 | 393 | 969 | 1109 | 100%
> | 2) u+memfd direct R/W| 592 | 424 | 570 | 1884 | 169%
> | 3) u+memfd direct sendfile| 587 | 356 | 2229 | 481 | 43%
> | 4) u+memfd direct splice| 568 | 352 | 795 | 1350 | 121%
> | 5) udmabuf buffer R/W| 597 | 343 | 1238 | 867 | 78%
> | 6) dmabuf buffer R/W| 69 | 13 | 1128 | 952 | 85%
> | 7) udmabuf direct c_f_r| 595 | 345 | 372 | 2889 | 260%
> | 8) dmabuf direct c_f_r| 80 | 13 | 274 | 3929 | 354%
>
> Regards,
> Wangtao.