This is a really weird interface. No one has yet to explain why dmabuf is so special that we can't support direct I/O to it when we can support it to otherwise exotic mappings like PCI P2P ones.
On 6/3/25 15:00, Christoph Hellwig wrote:
This is a really weird interface. No one has yet to explain why dmabuf is so special that we can't support direct I/O to it when we can support it to otherwise exotic mappings like PCI P2P ones.
With udmabuf you can do direct I/O, it's just inefficient to walk the page tables for it when you already have an array of all the folios.
Regards, Christian.
On Tue, Jun 03, 2025 at 03:14:20PM +0200, Christian König wrote:
On 6/3/25 15:00, Christoph Hellwig wrote:
This is a really weird interface. No one has yet to explain why dmabuf is so special that we can't support direct I/O to it when we can support it to otherwise exotic mappings like PCI P2P ones.
With udmabuf you can do direct I/O, it's just inefficient to walk the page tables for it when you already have an array of all the folios.
Does it matter compared to the I/O in this case?
Either way there has been talk (in case of networking implementations) that use a dmabuf as a first class container for lower level I/O. I'd much rather do that than adding odd side interfaces. I.e. have a version of splice that doesn't bother with the pipe, but instead just uses in-kernel direct I/O on one side and dmabuf-provided folios on the other.
On 6/3/25 15:19, Christoph Hellwig wrote:
On Tue, Jun 03, 2025 at 03:14:20PM +0200, Christian König wrote:
On 6/3/25 15:00, Christoph Hellwig wrote:
This is a really weird interface. No one has yet to explain why dmabuf is so special that we can't support direct I/O to it when we can support it to otherwise exotic mappings like PCI P2P ones.
With udmabuf you can do direct I/O, it's just inefficient to walk the page tables for it when you already have an array of all the folios.
Does it matter compared to the I/O in this case?
It unfortunately does, see the numbers on patch 3 and 4.
I'm not very keen about it either, but I don't see much other way to do this.
Either way there has been talk (in case of networking implementations) that use a dmabuf as a first class container for lower level I/O. I'd much rather do that than adding odd side interfaces. I.e. have a version of splice that doesn't bother with the pipe, but instead just uses in-kernel direct I/O on one side and dmabuf-provided folios on the other.
That would work for me as well. But if splice or copy_file_range is used is not that important to me.
My question is rather if it's ok to call f_op->write_iter() and f_op->read_iter() with pages allocated by alloc_pages(), e.g. where drivers potentially ignore the page count and just re-use pages as they like?
Regards, Christian.
On Tue, Jun 03, 2025 at 04:18:22PM +0200, Christian König wrote:
Does it matter compared to the I/O in this case?
It unfortunately does, see the numbers on patch 3 and 4.
That's kinda weird. Why does the page table lookup tage so much time compared to normal I/O?
My question is rather if it's ok to call f_op->write_iter() and f_op->read_iter() with pages allocated by alloc_pages(), e.g. where drivers potentially ignore the page count and just re-use pages as they like?
read_iter and write_iter with ITER_BVEC just use the pages as source and destination of the I/O. They must not touch the refcounts or do anything fancy with them. Various places in the kernel rely on that.
On 6/3/25 16:28, Christoph Hellwig wrote:
On Tue, Jun 03, 2025 at 04:18:22PM +0200, Christian König wrote:
Does it matter compared to the I/O in this case?
It unfortunately does, see the numbers on patch 3 and 4.
That's kinda weird. Why does the page table lookup tage so much time compared to normal I/O?
I have absolutely no idea. It's rather surprising for me as well.
The user seems to have a rather slow CPU paired with fast I/O, but it still looks rather fishy to me.
Additional to that allocating memory through memfd_create() is *much* slower on that box than through dma-buf-heaps (which basically just uses GFP and an array).
We have seen something similar with customers systems which we couldn't explain so far.
My question is rather if it's ok to call f_op->write_iter() and f_op->read_iter() with pages allocated by alloc_pages(), e.g. where drivers potentially ignore the page count and just re-use pages as they like?
read_iter and write_iter with ITER_BVEC just use the pages as source and destination of the I/O. They must not touch the refcounts or do anything fancy with them. Various places in the kernel rely on that.
Perfect, thanks for that info.
Regards, Christian.
On Tue, Jun 03, 2025 at 05:55:18PM +0200, Christian König wrote:
On 6/3/25 16:28, Christoph Hellwig wrote:
On Tue, Jun 03, 2025 at 04:18:22PM +0200, Christian König wrote:
Does it matter compared to the I/O in this case?
It unfortunately does, see the numbers on patch 3 and 4.
That's kinda weird. Why does the page table lookup tage so much time compared to normal I/O?
I have absolutely no idea. It's rather surprising for me as well.
The user seems to have a rather slow CPU paired with fast I/O, but it still looks rather fishy to me.
Additional to that allocating memory through memfd_create() is *much* slower on that box than through dma-buf-heaps (which basically just uses GFP and an array).
Can someone try to reproduce these results on a normal system before we're building infrastructure based on these numbers?
linaro-mm-sig@lists.linaro.org