On 5/22/25 10:02, wangtao wrote:
>> -----Original Message-----
>> From: Christian König <christian.koenig(a)amd.com>
>> Sent: Wednesday, May 21, 2025 7:57 PM
>> To: wangtao <tao.wangtao(a)honor.com>; T.J. Mercier
>> <tjmercier(a)google.com>
>> Cc: sumit.semwal(a)linaro.org; benjamin.gaignard(a)collabora.com;
>> Brian.Starkey(a)arm.com; jstultz(a)google.com; linux-media(a)vger.kernel.org;
>> dri-devel(a)lists.freedesktop.org; linaro-mm-sig(a)lists.linaro.org; linux-
>> kernel(a)vger.kernel.org; wangbintian(BintianWang)
>> <bintian.wang(a)honor.com>; yipengxiang <yipengxiang(a)honor.com>; liulu
>> 00013167 <liulu.liu(a)honor.com>; hanfeng 00012985 <feng.han(a)honor.com>;
>> amir73il(a)gmail.com
>> Subject: Re: [PATCH 2/2] dmabuf/heaps: implement
>> DMA_BUF_IOCTL_RW_FILE for system_heap
>>
>> On 5/21/25 12:25, wangtao wrote:
>>> [wangtao] I previously explained that
>>> read/sendfile/splice/copy_file_range
>>> syscalls can't achieve dmabuf direct IO zero-copy.
>>
>> And why can't you work on improving those syscalls instead of creating a new
>> IOCTL?
>>
> [wangtao] As I mentioned in previous emails, these syscalls cannot
> achieve dmabuf zero-copy due to technical constraints.
Yeah, and why can't you work on removing those technical constrains?
What is blocking you from improving the sendfile system call or proposing a patch to remove the copy_file_range restrictions?
Regards,
Christian.
Could you
> specify the technical points, code, or principles that need
> optimization?
>
> Let me explain again why these syscalls can't work:
> 1. read() syscall
> - dmabuf fops lacks read callback implementation. Even if implemented,
> file_fd info cannot be transferred
> - read(file_fd, dmabuf_ptr, len) with remap_pfn_range-based mmap
> cannot access dmabuf_buf pages, forcing buffer-mode reads
>
> 2. sendfile() syscall
> - Requires CPU copy from page cache to memory file(tmpfs/shmem):
> [DISK] --DMA--> [page cache] --CPU copy--> [MEMORY file]
> - CPU overhead (both buffer/direct modes involve copies):
> 55.08% do_sendfile
> |- 55.08% do_splice_direct
> |-|- 55.08% splice_direct_to_actor
> |-|-|- 22.51% copy_splice_read
> |-|-|-|- 16.57% f2fs_file_read_iter
> |-|-|-|-|- 15.12% __iomap_dio_rw
> |-|-|- 32.33% direct_splice_actor
> |-|-|-|- 32.11% iter_file_splice_write
> |-|-|-|-|- 28.42% vfs_iter_write
> |-|-|-|-|-|- 28.42% do_iter_write
> |-|-|-|-|-|-|- 28.39% shmem_file_write_iter
> |-|-|-|-|-|-|-|- 24.62% generic_perform_write
> |-|-|-|-|-|-|-|-|- 18.75% __pi_memmove
>
> 3. splice() requires one end to be a pipe, incompatible with regular files or dmabuf.
>
> 4. copy_file_range()
> - Blocked by cross-FS restrictions (Amir's commit 868f9f2f8e00)
> - Even without this restriction, Even without restrictions, implementing
> the copy_file_range callback in dmabuf fops would only allow dmabuf read
> from regular files. This is because copy_file_range relies on
> file_out->f_op->copy_file_range, which cannot support dmabuf write
> operations to regular files.
>
> Test results confirm these limitations:
> T.J. Mercier's 1G from ext4 on 6.12.20 | read/sendfile (ms) w/ 3 > drop_caches
> ------------------------|-------------------
> udmabuf buffer read | 1210
> udmabuf direct read | 671
> udmabuf buffer sendfile | 1096
> udmabuf direct sendfile | 2340
>
> My 3GHz CPU tests (cache cleared):
> Method | alloc | read | vs. (%)
> -----------------------------------------------
> udmabuf buffer read | 135 | 546 | 180%
> udmabuf direct read | 159 | 300 | 99%
> udmabuf buffer sendfile | 134 | 303 | 100%
> udmabuf direct sendfile | 141 | 912 | 301%
> dmabuf buffer read | 22 | 362 | 119%
> my patch direct read | 29 | 265 | 87%
>
> My 1GHz CPU tests (cache cleared):
> Method | alloc | read | vs. (%)
> -----------------------------------------------
> udmabuf buffer read | 552 | 2067 | 198%
> udmabuf direct read | 540 | 627 | 60%
> udmabuf buffer sendfile | 497 | 1045 | 100%
> udmabuf direct sendfile | 527 | 2330 | 223%
> dmabuf buffer read | 40 | 1111 | 106%
> patch direct read | 44 | 310 | 30%
>
> Test observations align with expectations:
> 1. dmabuf buffer read requires slow CPU copies
> 2. udmabuf direct read achieves zero-copy but has page retrieval
> latency from vaddr
> 3. udmabuf buffer sendfile suffers CPU copy overhead
> 4. udmabuf direct sendfile combines CPU copies with frequent DMA
> operations due to small pipe buffers
> 5. dmabuf buffer read also requires CPU copies
> 6. My direct read patch enables zero-copy with better performance
> on low-power CPUs
> 7. udmabuf creation time remains problematic (as you’ve noted).
>
>>> My focus is enabling dmabuf direct I/O for [regular file] <--DMA-->
>>> [dmabuf] zero-copy.
>>
>> Yeah and that focus is wrong. You need to work on a general solution to the
>> issue and not specific to your problem.
>>
>>> Any API achieving this would work. Are there other uAPIs you think
>>> could help? Could you recommend experts who might offer suggestions?
>>
>> Well once more: Either work on sendfile or copy_file_range or eventually
>> splice to make it what you want to do.
>>
>> When that is done we can discuss with the VFS people if that approach is
>> feasible.
>>
>> But just bypassing the VFS review by implementing a DMA-buf specific IOCTL
>> is a NO-GO. That is clearly not something you can do in any way.
> [wangtao] The issue is that only dmabuf lacks Direct I/O zero-copy support. Tmpfs/shmem
> already work with Direct I/O zero-copy. As explained, existing syscalls or
> generic methods can't enable dmabuf direct I/O zero-copy, which is why I
> propose adding an IOCTL command.
>
> I respect your perspective. Could you clarify specific technical aspects,
> code requirements, or implementation principles for modifying sendfile()
> or copy_file_range()? This would help advance our discussion.
>
> Thank you for engaging in this dialogue.
>
>>
>> Regards,
>> Christian.
This is the next version of the shmem backed GEM objects series
originally from Asahi, previously posted by Daniel Almeida. Along with
bindings for shmem backed GEM objects, it also adds a few features that
various users like Tyr and Asahi are interested in:
* The ability to pass custom arguments to new GEM objects (needed by
Tyr)
* OpaqueObject (to enable the use of custom private GEM objects, which I
believe asahi wanted)
And replaces some of the hand-rolled API bindings (sg_table mainly) with
some of the WIP patch series for adding kernel-wide bindings. It also
addresses the comments from the code review of the last version of this
patch series.
Currently doesn't apply on an upstream branch, but should very soon as
all of the dependencies in this series are on a mailing list already.
The current branch this can be applied on top of is here:
https://gitlab.freedesktop.org/lyudess/linux/-/commits/rust%2Fgem-shmem-base
Which is based on top of nova/nova-next with the following patch series
applied:
* My (hopefully final) gem bindings cleanup:
https://lkml.org/lkml/2025/5/20/1541
* Benno's derive Zeroable series:
https://lkml.org/lkml/2025/5/20/1446
* Abdiel's sg_table series:
https://lwn.net/Articles/1020986/
Also, there is one FIXES patch on top of Abdiel's work to fix some
iterator bugs. These fixes have already been mentioned on the
mailing list and should not be needed for their V2 version
Asahi Lina (3):
rust: helpers: Add bindings/wrappers for dma_resv_lock
rust: drm: gem: shmem: Add DRM shmem helper abstraction
rust: drm: gem: shmem: Add share_dma_resv to ObjectConfig
Lyude Paul (9):
rust: drm: gem: Add raw_dma_resv() function
drm/gem/shmem: Extract drm_gem_shmem_init() from
drm_gem_shmem_create()
drm/gem/shmem: Extract drm_gem_shmem_release() from
drm_gem_shmem_free()
rust: gem: Introduce BaseDriverObject::Args
rust: drm: gem: Add OpaqueObject
rust: drm: gem: Introduce OwnedSGTable
rust: Add dma_buf stub bindings
rust: drm: gem: Add export() callback
rust: drm: gem: Add BaseObject::prime_export()
drivers/gpu/drm/drm_gem_shmem_helper.c | 98 +++++--
drivers/gpu/drm/nova/gem.rs | 6 +-
include/drm/drm_gem_shmem_helper.h | 2 +
rust/bindings/bindings_helper.h | 4 +
rust/helpers/dma-resv.c | 13 +
rust/helpers/drm.c | 48 +++-
rust/helpers/helpers.c | 1 +
rust/kernel/dma_buf.rs | 39 +++
rust/kernel/drm/gem/mod.rs | 187 ++++++++++++-
rust/kernel/drm/gem/shmem.rs | 370 +++++++++++++++++++++++++
rust/kernel/lib.rs | 1 +
11 files changed, 727 insertions(+), 42 deletions(-)
create mode 100644 rust/helpers/dma-resv.c
create mode 100644 rust/kernel/dma_buf.rs
create mode 100644 rust/kernel/drm/gem/shmem.rs
--
2.49.0
Hi Jared,
On Thu, Apr 24, 2025 at 09:11:24AM -0700, Jared Kangas wrote:
> > > struct cma_heap {
> > > struct dma_heap *heap;
> > > @@ -394,15 +395,26 @@ static int __init __add_cma_heap(struct cma *cma, const char *name)
> > > static int __init add_default_cma_heap(void)
> > > {
> > > struct cma *default_cma = dev_get_cma_area(NULL);
> > > + const char *legacy_cma_name;
> > > int ret;
> > >
> > > if (!default_cma)
> > > return 0;
> > >
> > > - ret = __add_cma_heap(default_cma, cma_get_name(default_cma));
> > > + ret = __add_cma_heap(default_cma, DEFAULT_CMA_NAME);
> > > if (ret)
> > > return ret;
> > >
> > > + legacy_cma_name = cma_get_name(default_cma);
> > > +
> > > + if (IS_ENABLED(CONFIG_DMABUF_HEAPS_CMA_LEGACY) &&
> > > + strcmp(legacy_cma_name, DEFAULT_CMA_NAME)) {
> > > + ret = __add_cma_heap(default_cma, legacy_cma_name);
> > > + if (ret)
> > > + pr_warn("cma_heap: failed to add legacy heap: %pe\n",
> > > + ERR_PTR(-ret));
> > > + }
> > > +
> >
> > It would also simplify this part, since you would always create the legacy heap.
>
> By "always", do you mean removing the strcmp? I added this to guard
> against cases where the devicetree node's name clashed with the default
> name, given that the DT name isn't necessarily restricted to one of the
> current names in use ("linux,cma" or "default-pool"). It seems like the
> strcmp would be relevant regardless of the naming choice, but if this is
> overly cautious, I can remove it in v3.
That's not overly cautious, that's something I overlooked :)
You're totally right that we should check for that. We should probably
add a more specific error message in that case though
Maxime
On 5/21/25 12:25, wangtao wrote:
> [wangtao] I previously explained that read/sendfile/splice/copy_file_range
> syscalls can't achieve dmabuf direct IO zero-copy.
And why can't you work on improving those syscalls instead of creating a new IOCTL?
> My focus is enabling dmabuf direct I/O for [regular file] <--DMA--> [dmabuf]
> zero-copy.
Yeah and that focus is wrong. You need to work on a general solution to the issue and not specific to your problem.
> Any API achieving this would work. Are there other uAPIs you think
> could help? Could you recommend experts who might offer suggestions?
Well once more: Either work on sendfile or copy_file_range or eventually splice to make it what you want to do.
When that is done we can discuss with the VFS people if that approach is feasible.
But just bypassing the VFS review by implementing a DMA-buf specific IOCTL is a NO-GO. That is clearly not something you can do in any way.
Regards,
Christian.
On 5/21/25 06:17, wangtao wrote:
>>> Reducing CPU overhead/power consumption is critical for mobile devices.
>>> We need simpler and more efficient dmabuf direct I/O support.
>>>
>>> As Christian evaluated sendfile performance based on your data, could
>>> you confirm whether the cache was cleared? If not, please share the
>>> post-cache-clearing test data. Thank you for your support.
>>
>> Yes sorry, I was out yesterday riding motorcycles. I did not clear the cache for
>> the buffered reads, I didn't realize you had. The IO plus the copy certainly
>> explains the difference.
>>
>> Your point about the unlikelihood of any of that data being in the cache also
>> makes sense.
> [wangtao] Thank you for testing and clarifying.
>
>>
>> I'm not sure it changes anything about the ioctl approach though.
>> Another way to do this would be to move the (optional) support for direct IO
>> into the exporter via dma_buf_fops and dma_buf_ops. Then normal read()
>> syscalls would just work for buffers that support them.
>> I know that's more complicated, but at least it doesn't require inventing new
>> uapi to do it.
>>
> [wangtao] Thank you for the discussion. I fully support any method that enables
> dmabuf direct I/O.
>
> I understand using sendfile/splice with regular files for dmabuf
> adds an extra CPU copy, preventing zero-copy. For example:
> sendfile path: [DISK] → DMA → [page cache] → CPU copy → [memory file].
Yeah, but why can't you work on improving that?
> The read() syscall can't pass regular file fd parameters, so I added
> an ioctl command.
> While copy_file_range() supports two fds (fd_in/fd_out), it blocks cross-fs use.
> Even without this restriction, file_out->f_op->copy_file_range
> only enables dmabuf direct reads from regular files, not writes.
>
> Since dmabuf's direct I/O limitation comes from its unique
> attachment/map/fence model and lacks suitable syscalls, adding
> an ioctl seems necessary.
I absolutely don't see that. Both splice and sendfile can take two regular file descriptors.
That the underlying fops currently can't do that is not a valid argument for adding new uAPI. It just means that you need to work on improving those fops.
As long as nobody proves to me that the existing uAPI isn't sufficient for this use case I will systematically reject any approach to adding new one.
Regards,
Christian.
> When system exporters return a duplicated sg_table via map_dma_buf
> (used exclusively like a pages array), they should retain control
> over it.
>
> I welcome all solutions to achieve dmabuf direct I/O! Your feedback
> is greatly appreciated.
>
>> 1G from ext4 on 6.12.20 | read/sendfile (ms) w/ 3 > drop_caches
>> ------------------------|-------------------
>> udmabuf buffer read | 1210
>> udmabuf direct read | 671
>> udmabuf buffer sendfile | 1096
>> udmabuf direct sendfile | 2340
>>
>>
>>
>>>
>>>>>
>>>>>>> dmabuf buffer read | 51 | 1068 | 1118
>>>>>>> dmabuf direct read | 52 | 297 | 349
>>>>>>>
>>>>>>> udmabuf sendfile test steps:
>>>>>>> 1. Open data file(1024MB), get back_fd 2. Create memfd(32MB) #
>>>>>>> Loop steps 2-6 3. Allocate udmabuf with memfd 4. Call
>>>>>>> sendfile(memfd,
>>>>>>> back_fd) 5. Close memfd after sendfile 6. Close udmabuf 7.
>>>>>>> Close back_fd
>>>>>>>
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Christian.
>>>>>>>
>>>>>>
>>>
On Tue, May 20, 2025 at 12:26:54PM GMT, Tomeu Vizoso wrote:
> Add the bindings for the Neural Processing Unit IP from Rockchip.
>
> v2:
> - Adapt to new node structure (one node per core, each with its own
> IOMMU)
> - Several misc. fixes from Sebastian Reichel
>
> v3:
> - Split register block in its constituent subblocks, and only require
> the ones that the kernel would ever use (Nicolas Frattaroli)
> - Group supplies (Rob Herring)
> - Explain the way in which the top core is special (Rob Herring)
>
> v4:
> - Change required node name to npu@ (Rob Herring and Krzysztof Kozlowski)
> - Remove unneeded items: (Krzysztof Kozlowski)
> - Fix use of minItems/maxItems (Krzysztof Kozlowski)
> - Add reg-names to list of required properties (Krzysztof Kozlowski)
> - Fix example (Krzysztof Kozlowski)
>
> v5:
> - Rename file to rockchip,rk3588-rknn-core.yaml (Krzysztof Kozlowski)
> - Streamline compatible property (Krzysztof Kozlowski)
>
This is a big patchset, so please slow down and do not send it every day
but allow people to actually review the version you post.
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski(a)linaro.org>
Best regards,
Krzysztof
Hi Tomeu,
Am Dienstag, dem 20.05.2025 um 12:27 +0200 schrieb Tomeu Vizoso:
> The NPU cores have their own access to the memory bus, and this isn't
> cache coherent with the CPUs.
>
> Add IOCTLs so userspace can mark when the caches need to be flushed, and
> also when a writer job needs to be waited for before the buffer can be
> accessed from the CPU.
>
> Initially based on the same IOCTLs from the Etnaviv driver.
>
> v2:
> - Don't break UABI by reordering the IOCTL IDs (Jeff Hugo)
>
> v3:
> - Check that padding fields in IOCTLs are zero (Jeff Hugo)
>
> Signed-off-by: Tomeu Vizoso <tomeu(a)tomeuvizoso.net>
> ---
> drivers/accel/rocket/rocket_drv.c | 2 +
> drivers/accel/rocket/rocket_gem.c | 80 +++++++++++++++++++++++++++++++++++++++
> drivers/accel/rocket/rocket_gem.h | 5 +++
> include/uapi/drm/rocket_accel.h | 37 ++++++++++++++++++
> 4 files changed, 124 insertions(+)
>
> diff --git a/drivers/accel/rocket/rocket_drv.c b/drivers/accel/rocket/rocket_drv.c
> index fef9b93372d3f65c41c1ac35a9bfa0c01ee721a5..c06e66939e6c39909fe08bef3c4f301b07bf8fbf 100644
> --- a/drivers/accel/rocket/rocket_drv.c
> +++ b/drivers/accel/rocket/rocket_drv.c
> @@ -59,6 +59,8 @@ static const struct drm_ioctl_desc rocket_drm_driver_ioctls[] = {
>
> ROCKET_IOCTL(CREATE_BO, create_bo),
> ROCKET_IOCTL(SUBMIT, submit),
> + ROCKET_IOCTL(PREP_BO, prep_bo),
> + ROCKET_IOCTL(FINI_BO, fini_bo),
> };
>
> DEFINE_DRM_ACCEL_FOPS(rocket_accel_driver_fops);
> diff --git a/drivers/accel/rocket/rocket_gem.c b/drivers/accel/rocket/rocket_gem.c
> index 8a8a7185daac4740081293aae6945c9b2bbeb2dd..cdc5238a93fa5978129dc1ac8ec8de955160dc18 100644
> --- a/drivers/accel/rocket/rocket_gem.c
> +++ b/drivers/accel/rocket/rocket_gem.c
> @@ -129,3 +129,83 @@ int rocket_ioctl_create_bo(struct drm_device *dev, void *data, struct drm_file *
>
> return ret;
> }
> +
> +static inline enum dma_data_direction rocket_op_to_dma_dir(u32 op)
> +{
> + if (op & ROCKET_PREP_READ)
> + return DMA_FROM_DEVICE;
> + else if (op & ROCKET_PREP_WRITE)
> + return DMA_TO_DEVICE;
> + else
> + return DMA_BIDIRECTIONAL;
> +}
This has copied over the bug fixed in etnaviv commit 58979ad6330a
("drm/etnaviv: fix DMA direction handling for cached RW buffers")
Regards,
Lucas