On Wed, Jul 31, 2024 at 11:34:49AM +0800, Huan Yang wrote:
> The current udmabuf_folio contains a list_head and the corresponding
> folio pointer, with a size of 24 bytes. udmabuf_folio uses kmalloc to
> allocate memory.
>
> However, kmalloc is a public pool, starting from 64 bytes. This means
> that each udmabuf_folio allocation will waste 40 bytes.
>
> Considering that each udmabuf creates a folio corresponding to a
> udmabuf_folio, the wasted memory can be significant in the case of
> memory fragmentation.
>
> Furthermore, if udmabuf is frequently used, the allocation and
> deallocation of udmabuf_folio will also be frequent.
>
> Therefore, this patch adds a kmem_cache dedicated to the allocation and
> deallocation of udmabuf_folio.This is expected to improve the
> performance of allocation and deallocation within the expected range,
> while also avoiding memory waste.
>
> Signed-off-by: Huan Yang <link(a)vivo.com>
> ---
> drivers/dma-buf/udmabuf.c | 18 +++++++++++++++---
> 1 file changed, 15 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/dma-buf/udmabuf.c b/drivers/dma-buf/udmabuf.c
> index 047c3cd2ceff..db4de8c745ce 100644
> --- a/drivers/dma-buf/udmabuf.c
> +++ b/drivers/dma-buf/udmabuf.c
> @@ -24,6 +24,8 @@ static int size_limit_mb = 64;
> module_param(size_limit_mb, int, 0644);
> MODULE_PARM_DESC(size_limit_mb, "Max size of a dmabuf, in megabytes. Default is 64.");
>
> +static struct kmem_cache *udmabuf_folio_cachep;
> +
> struct udmabuf {
> pgoff_t pagecount;
> struct folio **folios;
> @@ -169,7 +171,7 @@ static void unpin_all_folios(struct list_head *unpin_list)
> unpin_folio(ubuf_folio->folio);
>
> list_del(&ubuf_folio->list);
> - kfree(ubuf_folio);
> + kmem_cache_free(udmabuf_folio_cachep, ubuf_folio);
> }
> }
>
> @@ -178,7 +180,7 @@ static int add_to_unpin_list(struct list_head *unpin_list,
> {
> struct udmabuf_folio *ubuf_folio;
>
> - ubuf_folio = kzalloc(sizeof(*ubuf_folio), GFP_KERNEL);
> + ubuf_folio = kmem_cache_alloc(udmabuf_folio_cachep, GFP_KERNEL);
> if (!ubuf_folio)
> return -ENOMEM;
>
> @@ -492,10 +494,20 @@ static int __init udmabuf_dev_init(void)
> if (ret < 0) {
> pr_err("Could not setup DMA mask for udmabuf device\n");
> misc_deregister(&udmabuf_misc);
misc_deregister() is now called twice in this error path, I think you've
forgotten to delete this line too?
Otherwise lgtm.
-Sima
> - return ret;
> + goto err;
> + }
> +
> + udmabuf_folio_cachep = KMEM_CACHE(udmabuf_folio, 0);
> + if (unlikely(!udmabuf_folio_cachep)) {
> + ret = -ENOMEM;
> + goto err;
> }
>
> return 0;
> +
> +err:
> + misc_deregister(&udmabuf_misc);
> + return ret;
> }
>
> static void __exit udmabuf_dev_exit(void)
>
> base-commit: cd19ac2f903276b820f5d0d89de0c896c27036ed
> --
> 2.45.2
>
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
On Tue, Jul 30, 2024 at 08:04:04PM +0800, Huan Yang wrote:
>
> 在 2024/7/30 17:05, Huan Yang 写道:
> >
> > 在 2024/7/30 16:56, Daniel Vetter 写道:
> > > [????????? daniel.vetter(a)ffwll.ch ?????????
> > > https://aka.ms/LearnAboutSenderIdentification?????????????]
> > >
> > > On Tue, Jul 30, 2024 at 03:57:44PM +0800, Huan Yang wrote:
> > > > UDMA-BUF step:
> > > > 1. memfd_create
> > > > 2. open file(buffer/direct)
> > > > 3. udmabuf create
> > > > 4. mmap memfd
> > > > 5. read file into memfd vaddr
> > > Yeah this is really slow and the worst way to do it. You absolutely want
> > > to start _all_ the io before you start creating the dma-buf, ideally
> > > with
> > > everything running in parallel. But just starting the direct I/O with
> > > async and then creating the umdabuf should be a lot faster and avoid
> > That's greate, Let me rephrase that, and please correct me if I'm wrong.
> >
> > UDMA-BUF step:
> > 1. memfd_create
> > 2. mmap memfd
> > 3. open file(buffer/direct)
> > 4. start thread to async read
> > 3. udmabuf create
> >
> > With this, can improve
>
> I just test with it. Step is:
>
> UDMA-BUF step:
> 1. memfd_create
> 2. mmap memfd
> 3. open file(buffer/direct)
> 4. start thread to async read
> 5. udmabuf create
>
> 6 . join wait
>
> 3G file read all step cost 1,527,103,431ns, it's greate.
Ok that's almost the throughput of your patch set, which I think is close
enough. The remaining difference is probably just the mmap overhead, not
sure whether/how we can do direct i/o to an fd directly ... in principle
it's possible for any file that uses the standard pagecache.
-Sima
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
Le 31/07/2024 à 09:37, Huan Yang a écrit :
> The current udmabuf_folio contains a list_head and the corresponding
> folio pointer, with a size of 24 bytes. udmabuf_folio uses kmalloc to
> allocate memory.
>
> However, kmalloc is a public pool, starting from 8,16,32 bytes.
> Additionally, if the size is not aligned with the kmalloc size, it will
> be rounded up to the corresponding size.
> This means that each udmabuf_folio allocation will get 32 bytes, and
> waste 8 bytes.
>
> Considering that each udmabuf creates a folio corresponding to a
> udmabuf_folio, the wasted memory can be significant in the case of
> memory fragmentation.
>
> Furthermore, if udmabuf is frequently used, the allocation and
> deallocation of udmabuf_folio will also be frequent.
>
> Therefore, this patch adds a kmem_cache dedicated to the allocation and
> deallocation of udmabuf_folio.This is expected to improve the
> performance of allocation and deallocation within the expected range,
> while also avoiding memory waste.
>
> Signed-off-by: Huan Yang <link(a)vivo.com>
> ---
> v3 -> v2: fix error description.
> v2 -> v1: fix double unregister, remove unlikely.
> drivers/dma-buf/udmabuf.c | 19 +++++++++++++++----
> 1 file changed, 15 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/dma-buf/udmabuf.c b/drivers/dma-buf/udmabuf.c
> index 047c3cd2ceff..c112c58ef09a 100644
> --- a/drivers/dma-buf/udmabuf.c
> +++ b/drivers/dma-buf/udmabuf.c
> @@ -24,6 +24,8 @@ static int size_limit_mb = 64;
> module_param(size_limit_mb, int, 0644);
> MODULE_PARM_DESC(size_limit_mb, "Max size of a dmabuf, in megabytes. Default is 64.");
>
> +static struct kmem_cache *udmabuf_folio_cachep;
> +
> struct udmabuf {
> pgoff_t pagecount;
> struct folio **folios;
> @@ -169,7 +171,7 @@ static void unpin_all_folios(struct list_head *unpin_list)
> unpin_folio(ubuf_folio->folio);
>
> list_del(&ubuf_folio->list);
> - kfree(ubuf_folio);
> + kmem_cache_free(udmabuf_folio_cachep, ubuf_folio);
> }
> }
>
> @@ -178,7 +180,7 @@ static int add_to_unpin_list(struct list_head *unpin_list,
> {
> struct udmabuf_folio *ubuf_folio;
>
> - ubuf_folio = kzalloc(sizeof(*ubuf_folio), GFP_KERNEL);
> + ubuf_folio = kmem_cache_alloc(udmabuf_folio_cachep, GFP_KERNEL);
> if (!ubuf_folio)
> return -ENOMEM;
>
> @@ -491,11 +493,20 @@ static int __init udmabuf_dev_init(void)
> DMA_BIT_MASK(64));
> if (ret < 0) {
> pr_err("Could not setup DMA mask for udmabuf device\n");
> - misc_deregister(&udmabuf_misc);
> - return ret;
> + goto err;
> + }
> +
> + udmabuf_folio_cachep = KMEM_CACHE(udmabuf_folio, 0);
> + if (!udmabuf_folio_cachep) {
> + ret = -ENOMEM;
> + goto err;
> }
>
> return 0;
> +
> +err:
> + misc_deregister(&udmabuf_misc);
> + return ret;
> }
>
> static void __exit udmabuf_dev_exit(void)
Hi,
should a kmem_cache_destroy() be also added in udmabuf_dev_exit()?
CJ
>
> base-commit: cd19ac2f903276b820f5d0d89de0c896c27036ed
Hi Huan,
kernel test robot noticed the following build warnings:
[auto build test WARNING on 931a3b3bccc96e7708c82b30b2b5fa82dfd04890]
url: https://github.com/intel-lab-lkp/linux/commits/Huan-Yang/dma-buf-heaps-Intr…
base: 931a3b3bccc96e7708c82b30b2b5fa82dfd04890
patch link: https://lore.kernel.org/r/20240730075755.10941-4-link%40vivo.com
patch subject: [PATCH v2 3/5] dma-buf: heaps: support alloc async read file
config: xtensa-allyesconfig (https://download.01.org/0day-ci/archive/20240731/202407312202.LhLTLEhX-lkp@…)
compiler: xtensa-linux-gcc (GCC) 14.1.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240731/202407312202.LhLTLEhX-lkp@…)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp(a)intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202407312202.LhLTLEhX-lkp@intel.com/
All warnings (new ones prefixed by >>):
drivers/dma-buf/dma-heap.c:45: warning: Function parameter or struct member 'priv' not described in 'dma_heap'
drivers/dma-buf/dma-heap.c:45: warning: Function parameter or struct member 'heap_devt' not described in 'dma_heap'
drivers/dma-buf/dma-heap.c:45: warning: Function parameter or struct member 'list' not described in 'dma_heap'
drivers/dma-buf/dma-heap.c:45: warning: Function parameter or struct member 'heap_cdev' not described in 'dma_heap'
>> drivers/dma-buf/dma-heap.c:158: warning: Function parameter or struct member 'lock' not described in 'dma_heap_file_control'
drivers/dma-buf/dma-heap.c:482: warning: expecting prototype for Trigger sync file read, read into dma(). Prototype was for dma_heap_read_file_sync() instead
vim +158 drivers/dma-buf/dma-heap.c
133
134 /**
135 * struct dma_heap_file_control - global control of dma_heap file read.
136 * @works: @dma_heap_file_work's list head.
137 *
138 * @threadwq: wait queue for @work_thread, if commit work, @work_thread
139 * wakeup and read this work's file contains.
140 *
141 * @workwq: used for main thread wait for file read end, if allocation
142 * end before file read. @dma_heap_file_task ref effect this.
143 *
144 * @work_thread: file read kthread. the dma_heap_file_task work's consumer.
145 *
146 * @heap_fwork_cachep: @dma_heap_file_work's cachep, it's alloc/free frequently.
147 *
148 * @nr_work: global number of how many work committed.
149 */
150 struct dma_heap_file_control {
151 struct list_head works;
152 spinlock_t lock; // only lock for @works.
153 wait_queue_head_t threadwq;
154 wait_queue_head_t workwq;
155 struct task_struct *work_thread;
156 struct kmem_cache *heap_fwork_cachep;
157 atomic_t nr_work;
> 158 };
159
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
Hi Huan,
kernel test robot noticed the following build warnings:
[auto build test WARNING on 931a3b3bccc96e7708c82b30b2b5fa82dfd04890]
url: https://github.com/intel-lab-lkp/linux/commits/Huan-Yang/dma-buf-heaps-Intr…
base: 931a3b3bccc96e7708c82b30b2b5fa82dfd04890
patch link: https://lore.kernel.org/r/20240730075755.10941-2-link%40vivo.com
patch subject: [PATCH v2 1/5] dma-buf: heaps: Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap flag
config: xtensa-allyesconfig (https://download.01.org/0day-ci/archive/20240731/202407311822.ZneNMq5I-lkp@…)
compiler: xtensa-linux-gcc (GCC) 14.1.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240731/202407311822.ZneNMq5I-lkp@…)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp(a)intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202407311822.ZneNMq5I-lkp@intel.com/
All warnings (new ones prefixed by >>):
drivers/dma-buf/dma-heap.c:44: warning: Function parameter or struct member 'priv' not described in 'dma_heap'
drivers/dma-buf/dma-heap.c:44: warning: Function parameter or struct member 'heap_devt' not described in 'dma_heap'
drivers/dma-buf/dma-heap.c:44: warning: Function parameter or struct member 'list' not described in 'dma_heap'
drivers/dma-buf/dma-heap.c:44: warning: Function parameter or struct member 'heap_cdev' not described in 'dma_heap'
>> drivers/dma-buf/dma-heap.c:104: warning: expecting prototype for Trigger sync file read, read into dma(). Prototype was for dma_heap_read_file_sync() instead
vim +104 drivers/dma-buf/dma-heap.c
86
87 /**
88 * Trigger sync file read, read into dma-buf.
89 *
90 * @dmabuf: which we done alloced and export.
91 * @heap_file: file info wrapper to read from.
92 *
93 * Whether to use buffer I/O or direct I/O depends on the mode when the
94 * file is opened.
95 * Remember, if use direct I/O, file must be page aligned.
96 * Since the buffer used for file reading is provided by dma-buf, when
97 * using direct I/O, the file content will be directly filled into
98 * dma-buf without the need for additional CPU copying.
99 *
100 * 0 on success, negative if anything wrong.
101 */
102 static int dma_heap_read_file_sync(struct dma_buf *dmabuf,
103 struct dma_heap_file *heap_file)
> 104 {
105 struct iosys_map map;
106 ssize_t bytes;
107 int ret;
108
109 ret = dma_buf_vmap(dmabuf, &map);
110 if (ret)
111 return ret;
112
113 /**
114 * The kernel_read_file function can handle file reading effectively,
115 * and if the return value does not match the file size,
116 * then it indicates an error.
117 */
118 bytes = kernel_read_file(heap_file->file, 0, &map.vaddr, dmabuf->size,
119 &heap_file->fsize, READING_POLICY);
120 if (bytes != heap_file->fsize)
121 ret = -EIO;
122
123 dma_buf_vunmap(dmabuf, &map);
124
125 return ret;
126 }
127
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
On Tue, Jul 30, 2024 at 1:14 AM Huan Yang <link(a)vivo.com> wrote:
>
>
> 在 2024/7/30 16:03, Christian König 写道:
> > Am 30.07.24 um 09:57 schrieb Huan Yang:
> >> Background
> >> ====
> >> Some user may need load file into dma-buf, current way is:
> >> 1. allocate a dma-buf, get dma-buf fd
> >> 2. mmap dma-buf fd into user vaddr
> >> 3. read(file_fd, vaddr, fsz)
> >> Due to dma-buf user map can't support direct I/O[1], the file read
> >> must be buffer I/O.
> >>
> >> This means that during the process of reading the file into dma-buf,
> >> page cache needs to be generated, and the corresponding content needs to
> >> be first copied to the page cache before being copied to the dma-buf.
> >>
> >> This way worked well when reading relatively small files before, as
> >> the page cache can cache the file content, thus improving performance.
> >>
> >> However, there are new challenges currently, especially as AI models are
> >> becoming larger and need to be shared between DMA devices and the CPU
> >> via dma-buf.
> >>
> >> For example, our 7B model file size is around 3.4GB. Using the
> >> previous would mean generating a total of 3.4GB of page cache
> >> (even if it will be reclaimed), and also requiring the copying of 3.4GB
> >> of content between page cache and dma-buf.
> >>
> >> Due to the limited resources of system memory, files in the gigabyte
> >> range
> >> cannot persist in memory indefinitely, so this portion of page cache may
> >> not provide much assistance for subsequent reads. Additionally, the
> >> existence of page cache will consume additional system resources due to
> >> the extra copying required by the CPU.
> >>
> >> Therefore, I think it is necessary for dma-buf to support direct I/O.
> >>
> >> However, direct I/O file reads cannot be performed using the buffer
> >> mmaped by the user space for the dma-buf.[1]
> >>
> >> Here are some discussions on implementing direct I/O using dma-buf:
> >>
> >> mmap[1]
> >> ---
> >> dma-buf never support user map vaddr use of direct I/O.
> >>
> >> udmabuf[2]
> >> ---
> >> Currently, udmabuf can use the memfd method to read files into
> >> dma-buf in direct I/O mode.
> >>
> >> However, if the size is large, the current udmabuf needs to adjust the
> >> corresponding size_limit(default 64MB).
> >> But using udmabuf for files at the 3GB level is not a very good
> >> approach.
> >> It needs to make some adjustments internally to handle this.[3] Or else,
> >> fail create.
> >>
> >> But, it is indeed a viable way to enable dma-buf to support direct I/O.
> >> However, it is necessary to initiate the file read after the memory
> >> allocation
> >> is completed, and handle race conditions carefully.
> >>
> >> sendfile/splice[4]
> >> ---
> >> Another way to enable dma-buf to support direct I/O is by implementing
> >> splice_write/write_iter in the dma-buf file operations (fops) to adapt
> >> to the sendfile method.
> >> However, the current sendfile/splice calls are based on pipe. When using
> >> direct I/O to read a file, the content needs to be copied to the buffer
> >> allocated by the pipe (default 64KB), and then the dma-buf fops'
> >> splice_write needs to be called to write the content into the dma-buf.
> >> This approach requires serially reading the content of file pipe size
> >> into the pipe buffer and then waiting for the dma-buf to be written
> >> before reading the next one.(The I/O performance is relatively weak
> >> under direct I/O.)
> >> Moreover, due to the existence of the pipe buffer, even when using
> >> direct I/O and not needing to generate additional page cache,
> >> there still needs to be a CPU copy.
> >>
> >> copy_file_range[5]
> >> ---
> >> Consider of copy_file_range, It only supports copying files within the
> >> same file system. Similarly, it is not very practical.
> >>
> >>
> >> So, currently, there is no particularly suitable solution on VFS to
> >> allow dma-buf to support direct I/O for large file reads.
> >>
> >> This patchset provides an idea to complete file reads when requesting a
> >> dma-buf.
> >>
> >> Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap flag
> >> ===
> >> This patch provides a method to immediately read the file content after
> >> the dma-buf is allocated, and only returns the dma-buf file descriptor
> >> after the file is fully read.
> >>
> >> Since the dma-buf file descriptor is not returned, no other thread can
> >> access it except for the current thread, so we don't need to worry about
> >> race conditions.
> >
> > That is a completely false assumption.
> Can you provide a detailed explanation as to why this assumption is
> incorrect? thanks.
> >
> >>
> >> Map the dma-buf to the vmalloc area and initiate file reads in kernel
> >> space, supporting both buffer I/O and direct I/O.
> >>
> >> This patch adds the DMA_HEAP_ALLOC_AND_READ heap_flag for user.
> >> When a user needs to allocate a dma-buf and read a file, they should
> >> pass this heap flag. As the size of the file being read is fixed,
> >> there is no
> >> need to pass the 'len' parameter. Instead, The file_fd needs to be
> >> passed to
> >> indicate to the kernel the file that needs to be read.
> >>
> >> The file open flag determines the mode of file reading.
> >> But, please note that if direct I/O(O_DIRECT) is needed to read the
> >> file,
> >> the file size must be page aligned. (with patch 2-5, no need)
> >>
> >> Therefore, for the user, len and file_fd are mutually exclusive,
> >> and they are combined using a union.
> >>
> >> Once the user obtains the dma-buf fd, the dma-buf directly contains the
> >> file content.
> >
> > And I'm repeating myself, but this is a complete NAK from my side to
> > this approach.
> >
> > We pointed out multiple ways of how to implement this cleanly and not
> > by hacking functionality into the kernel which absolutely doesn't
> > belong there.
> In this patchset, I have provided performance comparisons of each of
> these methods. Can you please provide more opinions?
> >
> > Regards,
> > Christian.
> >
> >>
> >> Patch 1 implement it.
> >>
> >> Patch 2-5 provides an approach for performance improvement.
> >>
> >> The DMA_HEAP_ALLOC_AND_READ_FILE heap flag patch enables us to
> >> synchronously read files using direct I/O.
> >>
> >> This approach helps to save CPU copying and avoid a certain degree of
> >> memory thrashing (page cache generation and reclamation)
> >>
> >> When dealing with large file sizes, the benefits of this approach become
> >> particularly significant.
> >>
> >> However, there are currently some methods that can improve performance,
> >> not just save system resources:
> >>
> >> Due to the large file size, for example, a AI 7B model of around
> >> 3.4GB, the
> >> time taken to allocate DMA-BUF memory will be relatively long. Waiting
> >> for the allocation to complete before reading the file will add to the
> >> overall time consumption. Therefore, the total time for DMA-BUF
> >> allocation and file read can be calculated using the formula
> >> T(total) = T(alloc) + T(I/O)
> >>
> >> However, if we change our approach, we don't necessarily need to wait
> >> for the DMA-BUF allocation to complete before initiating I/O. In fact,
> >> during the allocation process, we already hold a portion of the page,
> >> which means that waiting for subsequent page allocations to complete
> >> before carrying out file reads is actually unfair to the pages that have
> >> already been allocated.
> >>
> >> The allocation of pages is sequential, and the reading of the file is
> >> also sequential, with the content and size corresponding to the file.
> >> This means that the memory location for each page, which holds the
> >> content of a specific position in the file, can be determined at the
> >> time of allocation.
> >>
> >> However, to fully leverage I/O performance, it is best to wait and
> >> gather a certain number of pages before initiating batch processing.
> >>
> >> The default gather size is 128MB. So, ever gathered can see as a file
> >> read
> >> work, it maps the gather page to the vmalloc area to obtain a continuous
> >> virtual address, which is used as a buffer to store the contents of the
> >> corresponding file. So, if using direct I/O to read a file, the file
> >> content will be written directly to the corresponding dma-buf buffer
> >> memory
> >> without any additional copying.(compare to pipe buffer.)
> >>
> >> Consider other ways to read into dma-buf. If we assume reading after
> >> mmap
> >> dma-buf, we need to map the pages of the dma-buf to the user virtual
> >> address space. Also, udmabuf memfd need do this operations too.
> >> Even if we support sendfile, the file copy also need buffer, you must
> >> setup it.
> >> So, mapping pages to the vmalloc area does not incur any additional
> >> performance overhead compared to other methods.[6]
> >>
> >> Certainly, the administrator can also modify the gather size through
> >> patch5.
> >>
> >> The formula for the time taken for system_heap buffer allocation and
> >> file reading through async_read is as follows:
> >>
> >> T(total) = T(first gather page) + Max(T(remain alloc), T(I/O))
> >>
> >> Compared to the synchronous read:
> >> T(total) = T(alloc) + T(I/O)
> >>
> >> If the allocation time or I/O time is long, the time difference will be
> >> covered by the maximum value between the allocation and I/O. The other
> >> party will be concealed.
> >>
> >> Therefore, the larger the size of the file that needs to be read, the
> >> greater the corresponding benefits will be.
> >>
> >> How to use
> >> ===
> >> Consider the current pathway for loading model files into DMA-BUF:
> >> 1. open dma-heap, get heap fd
> >> 2. open file, get file_fd(can't use O_DIRECT)
> >> 3. use file len to allocate dma-buf, get dma-buf fd
> >> 4. mmap dma-buf fd, get vaddr
> >> 5. read(file_fd, vaddr, file_size) into dma-buf pages
> >> 6. share, attach, whatever you want
> >>
> >> Use DMA_HEAP_ALLOC_AND_READ_FILE JUST a little change:
> >> 1. open dma-heap, get heap fd
> >> 2. open file, get file_fd(buffer/direct)
> >> 3. allocate dma-buf with DMA_HEAP_ALLOC_AND_READ_FILE heap flag,
> >> set file_fd
> >> instead of len. get dma-buf fd(contains file content)
> >> 4. share, attach, whatever you want
> >>
> >> So, test it is easy.
> >>
> >> How to test
> >> ===
> >> The performance comparison will be conducted for the following
> >> scenarios:
> >> 1. normal
> >> 2. udmabuf with [3] patch
> >> 3. sendfile
> >> 4. only patch 1
> >> 5. patch1 - patch4.
> >>
> >> normal:
> >> 1. open dma-heap, get heap fd
> >> 2. open file, get file_fd(can't use O_DIRECT)
> >> 3. use file len to allocate dma-buf, get dma-buf fd
> >> 4. mmap dma-buf fd, get vaddr
> >> 5. read(file_fd, vaddr, file_size) into dma-buf pages
> >> 6. share, attach, whatever you want
> >>
> >> UDMA-BUF step:
> >> 1. memfd_create
> >> 2. open file(buffer/direct)
> >> 3. udmabuf create
> >> 4. mmap memfd
> >> 5. read file into memfd vaddr
> >>
> >> Sendfile step(need suit splice_write/write_iter, just use to compare):
> >> 1. open dma-heap, get heap fd
> >> 2. open file, get file_fd(buffer/direct)
> >> 3. use file len to allocate dma-buf, get dma-buf fd
> >> 4. sendfile file_fd to dma-buf fd
> >> 6. share, attach, whatever you want
> >>
> >> patch1/patch1-4:
> >> 1. open dma-heap, get heap fd
> >> 2. open file, get file_fd(buffer/direct)
> >> 3. allocate dma-buf with DMA_HEAP_ALLOC_AND_READ_FILE heap flag,
> >> set file_fd
> >> instead of len. get dma-buf fd(contains file content)
> >> 4. share, attach, whatever you want
> >>
> >> You can create a file to test it. Compare the performance gap between
> >> the two.
> >> It is best to compare the differences in file size from KB to MB to GB.
> >>
> >> The following test data will compare the performance differences
> >> between 512KB,
> >> 8MB, 1GB, and 3GB under various scenarios.
> >>
> >> Performance Test
> >> ===
> >> 12G RAM phone
> >> UFS4.0(the maximum speed is 4GB/s. ),
> >> f2fs
> >> kernel 6.1 with patch[7] (or else, can't support kvec direct I/O
> >> read.)
> >> no memory pressure.
> >> drop_cache is used for each test.
> >>
> >> The average of 5 test results:
> >> | scheme-size | 512KB(ns) | 8MB(ns) | 1GB(ns) |
> >> 3GB(ns) |
> >> | ------------------- | ---------- | ---------- | ------------- |
> >> ------------- |
> >> | normal | 2,790,861 | 14,535,784 | 1,520,790,492 |
> >> 3,332,438,754 |
> >> | udmabuf buffer I/O | 1,704,046 | 11,313,476 | 821,348,000 |
> >> 2,108,419,923 |
> >> | sendfile buffer I/O | 3,261,261 | 12,112,292 | 1,565,939,938 |
> >> 3,062,052,984 |
> >> | patch1-4 buffer I/O | 2,064,538 | 10,771,474 | 986,338,800 |
> >> 2,187,570,861 |
> >> | sendfile direct I/O | 12,844,231 | 37,883,938 | 5,110,299,184 |
> >> 9,777,661,077 |
> >> | patch1 direct I/O | 813,215 | 6,962,092 | 2,364,211,877 |
> >> 5,648,897,554 |
> >> | udmabuf direct I/O | 1,289,554 | 8,968,138 | 921,480,784 |
> >> 2,158,305,738 |
> >> | patch1-4 direct I/O | 1,957,661 | 6,581,999 | 520,003,538 |
> >> 1,400,006,107 |
>
> With this test, sendfile can't give a good help base on pipe buffer.
>
> udmabuf is good, but I think our oem driver can't suit it. (And, AOSP do
> not open this feature)
Hi Huan,
We should be able to turn on udmabuf for the Android kernels. We don't
have CONFIG_UDMABUF because nobody has wanted it so far. It's
encouraging to see your latest results!
-T.J.
>
> Anyway, I am sending this patchset in the hope of further discussion.
>
> Thanks.
>
> >>
> >> So, based on the test results:
> >>
> >> When the file is large, the patchset has the highest performance.
> >> Compared to normal, patchset is a 50% improvement;
> >> Compared to normal, patch1 only showed a degradation of 41%.
> >> patch1 typical performance breakdown is as follows:
> >> 1. alloc cost 188,802,693 ns
> >> 2. vmap cost 42,491,385 ns
> >> 3. file read cost 4,180,876,702 ns
> >> Therefore, directly performing a single direct I/O read on a large file
> >> may not be the most optimal way for performance.
> >>
> >> The performance of direct I/O implemented by the sendfile method is
> >> the worst.
> >>
> >> When file size is small, The difference in performance is not
> >> significant. This is consistent with expectations.
> >>
> >>
> >>
> >> Suggested use cases
> >> ===
> >> 1. When there is a need to read large files and system resources
> >> are scarce,
> >> especially when the size of memory is limited.(GB level) In this
> >> scenario, using direct I/O for file reading can even bring
> >> performance
> >> improvements.(may need patch2-3)
> >> 2. For embedded devices with limited RAM, using direct I/O can
> >> save system
> >> resources and avoid unnecessary data copying. Therefore, even
> >> if the
> >> performance is lower when read small file, it can still be used
> >> effectively.
> >> 3. If there is sufficient memory, pinning the page cache of the
> >> model files
> >> in memory and placing file in the EROFS file system for
> >> read-only access
> >> maybe better.(EROFS do not support direct I/O)
> >>
> >>
> >> Changlog
> >> ===
> >> v1 [8]
> >> v1->v2:
> >> Uses the heap flag method for alloc and read instead of adding a new
> >> DMA-buf ioctl command. [9]
> >> Split the patchset to facilitate review and test.
> >> patch 1 implement alloc and read, offer heap flag into it.
> >> patch 2-4 offer async read
> >> patch 5 can change gather limit.
> >>
> >> Reference
> >> ===
> >> [1]
> >> https://lore.kernel.org/all/0393cf47-3fa2-4e32-8b3d-d5d5bdece298@amd.com/
> >> [2]
> >> https://lore.kernel.org/all/ZpTnzkdolpEwFbtu@phenom.ffwll.local/
> >> [3]
> >> https://lore.kernel.org/all/20240725021349.580574-1-link@vivo.com/
> >> [4]
> >> https://lore.kernel.org/all/Zpf5R7fRZZmEwVuR@infradead.org/
> >> [5]
> >> https://lore.kernel.org/all/ZpiHKY2pGiBuEq4z@infradead.org/
> >> [6]
> >> https://lore.kernel.org/all/9b70db2e-e562-4771-be6b-1fa8df19e356@amd.com/
> >> [7]
> >> https://patchew.org/linux/20230209102954.528942-1-dhowells@redhat.com/20230…
> >> [8]
> >> https://lore.kernel.org/all/20240711074221.459589-1-link@vivo.com/
> >> [9]
> >> https://lore.kernel.org/all/5ccbe705-883c-4651-9e66-6b452c414c74@amd.com/
> >>
> >> Huan Yang (5):
> >> dma-buf: heaps: Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap flag
> >> dma-buf: heaps: Introduce async alloc read ops
> >> dma-buf: heaps: support alloc async read file
> >> dma-buf: heaps: system_heap alloc support async read
> >> dma-buf: heaps: configurable async read gather limit
> >>
> >> drivers/dma-buf/dma-heap.c | 552 +++++++++++++++++++++++++++-
> >> drivers/dma-buf/heaps/system_heap.c | 70 +++-
> >> include/linux/dma-heap.h | 53 ++-
> >> include/uapi/linux/dma-heap.h | 11 +-
> >> 4 files changed, 673 insertions(+), 13 deletions(-)
> >>
> >>
> >> base-commit: 931a3b3bccc96e7708c82b30b2b5fa82dfd04890
> >