On 5/22/25 10:02, wangtao wrote:
-----Original Message----- From: Christian König christian.koenig@amd.com Sent: Wednesday, May 21, 2025 7:57 PM To: wangtao tao.wangtao@honor.com; T.J. Mercier tjmercier@google.com Cc: sumit.semwal@linaro.org; benjamin.gaignard@collabora.com; Brian.Starkey@arm.com; jstultz@google.com; linux-media@vger.kernel.org; dri-devel@lists.freedesktop.org; linaro-mm-sig@lists.linaro.org; linux- kernel@vger.kernel.org; wangbintian(BintianWang) bintian.wang@honor.com; yipengxiang yipengxiang@honor.com; liulu 00013167 liulu.liu@honor.com; hanfeng 00012985 feng.han@honor.com; amir73il@gmail.com Subject: Re: [PATCH 2/2] dmabuf/heaps: implement DMA_BUF_IOCTL_RW_FILE for system_heap
On 5/21/25 12:25, wangtao wrote:
[wangtao] I previously explained that read/sendfile/splice/copy_file_range syscalls can't achieve dmabuf direct IO zero-copy.
And why can't you work on improving those syscalls instead of creating a new IOCTL?
[wangtao] As I mentioned in previous emails, these syscalls cannot achieve dmabuf zero-copy due to technical constraints.
Yeah, and why can't you work on removing those technical constrains?
What is blocking you from improving the sendfile system call or proposing a patch to remove the copy_file_range restrictions?
Regards, Christian.
Could you
specify the technical points, code, or principles that need optimization?
Let me explain again why these syscalls can't work:
read() syscall
- dmabuf fops lacks read callback implementation. Even if implemented, file_fd info cannot be transferred
- read(file_fd, dmabuf_ptr, len) with remap_pfn_range-based mmap cannot access dmabuf_buf pages, forcing buffer-mode reads
sendfile() syscall
- Requires CPU copy from page cache to memory file(tmpfs/shmem): [DISK] --DMA--> [page cache] --CPU copy--> [MEMORY file]
- CPU overhead (both buffer/direct modes involve copies): 55.08% do_sendfile
|- 55.08% do_splice_direct |-|- 55.08% splice_direct_to_actor |-|-|- 22.51% copy_splice_read |-|-|-|- 16.57% f2fs_file_read_iter |-|-|-|-|- 15.12% __iomap_dio_rw |-|-|- 32.33% direct_splice_actor |-|-|-|- 32.11% iter_file_splice_write |-|-|-|-|- 28.42% vfs_iter_write |-|-|-|-|-|- 28.42% do_iter_write |-|-|-|-|-|-|- 28.39% shmem_file_write_iter |-|-|-|-|-|-|-|- 24.62% generic_perform_write |-|-|-|-|-|-|-|-|- 18.75% __pi_memmove
splice() requires one end to be a pipe, incompatible with regular files or dmabuf.
copy_file_range()
- Blocked by cross-FS restrictions (Amir's commit 868f9f2f8e00)
- Even without this restriction, Even without restrictions, implementing the copy_file_range callback in dmabuf fops would only allow dmabuf read
from regular files. This is because copy_file_range relies on file_out->f_op->copy_file_range, which cannot support dmabuf write operations to regular files.
Test results confirm these limitations: T.J. Mercier's 1G from ext4 on 6.12.20 | read/sendfile (ms) w/ 3 > drop_caches ------------------------|------------------- udmabuf buffer read | 1210 udmabuf direct read | 671 udmabuf buffer sendfile | 1096 udmabuf direct sendfile | 2340
My 3GHz CPU tests (cache cleared): Method | alloc | read | vs. (%)
udmabuf buffer read | 135 | 546 | 180% udmabuf direct read | 159 | 300 | 99% udmabuf buffer sendfile | 134 | 303 | 100% udmabuf direct sendfile | 141 | 912 | 301% dmabuf buffer read | 22 | 362 | 119% my patch direct read | 29 | 265 | 87%
My 1GHz CPU tests (cache cleared): Method | alloc | read | vs. (%)
udmabuf buffer read | 552 | 2067 | 198% udmabuf direct read | 540 | 627 | 60% udmabuf buffer sendfile | 497 | 1045 | 100% udmabuf direct sendfile | 527 | 2330 | 223% dmabuf buffer read | 40 | 1111 | 106% patch direct read | 44 | 310 | 30%
Test observations align with expectations:
- dmabuf buffer read requires slow CPU copies
- udmabuf direct read achieves zero-copy but has page retrieval latency from vaddr
- udmabuf buffer sendfile suffers CPU copy overhead
- udmabuf direct sendfile combines CPU copies with frequent DMA operations due to small pipe buffers
- dmabuf buffer read also requires CPU copies
- My direct read patch enables zero-copy with better performance on low-power CPUs
- udmabuf creation time remains problematic (as you’ve noted).
My focus is enabling dmabuf direct I/O for [regular file] <--DMA--> [dmabuf] zero-copy.
Yeah and that focus is wrong. You need to work on a general solution to the issue and not specific to your problem.
Any API achieving this would work. Are there other uAPIs you think could help? Could you recommend experts who might offer suggestions?
Well once more: Either work on sendfile or copy_file_range or eventually splice to make it what you want to do.
When that is done we can discuss with the VFS people if that approach is feasible.
But just bypassing the VFS review by implementing a DMA-buf specific IOCTL is a NO-GO. That is clearly not something you can do in any way.
[wangtao] The issue is that only dmabuf lacks Direct I/O zero-copy support. Tmpfs/shmem already work with Direct I/O zero-copy. As explained, existing syscalls or generic methods can't enable dmabuf direct I/O zero-copy, which is why I propose adding an IOCTL command.
I respect your perspective. Could you clarify specific technical aspects, code requirements, or implementation principles for modifying sendfile() or copy_file_range()? This would help advance our discussion.
Thank you for engaging in this dialogue.
Regards, Christian.
linaro-mm-sig@lists.linaro.org