Am 30.07.24 um 10:14 schrieb Huan Yang:
> 在 2024/7/30 16:03, Christian König 写道:
>> Am 30.07.24 um 09:57 schrieb Huan Yang:
>>> Background
>>> ====
>>> Some user may need load file into dma-buf, current way is:
>>> 1. allocate a dma-buf, get dma-buf fd
>>> 2. mmap dma-buf fd into user vaddr
>>> 3. read(file_fd, vaddr, fsz)
>>> Due to dma-buf user map can't support direct I/O[1], the file read
>>> must be buffer I/O.
>>>
>>> This means that during the process of reading the file into dma-buf,
>>> page cache needs to be generated, and the corresponding content
>>> needs to
>>> be first copied to the page cache before being copied to the dma-buf.
>>>
>>> This way worked well when reading relatively small files before, as
>>> the page cache can cache the file content, thus improving performance.
>>>
>>> However, there are new challenges currently, especially as AI models
>>> are
>>> becoming larger and need to be shared between DMA devices and the CPU
>>> via dma-buf.
>>>
>>> For example, our 7B model file size is around 3.4GB. Using the
>>> previous would mean generating a total of 3.4GB of page cache
>>> (even if it will be reclaimed), and also requiring the copying of 3.4GB
>>> of content between page cache and dma-buf.
>>>
>>> Due to the limited resources of system memory, files in the gigabyte
>>> range
>>> cannot persist in memory indefinitely, so this portion of page cache
>>> may
>>> not provide much assistance for subsequent reads. Additionally, the
>>> existence of page cache will consume additional system resources due to
>>> the extra copying required by the CPU.
>>>
>>> Therefore, I think it is necessary for dma-buf to support direct I/O.
>>>
>>> However, direct I/O file reads cannot be performed using the buffer
>>> mmaped by the user space for the dma-buf.[1]
>>>
>>> Here are some discussions on implementing direct I/O using dma-buf:
>>>
>>> mmap[1]
>>> ---
>>> dma-buf never support user map vaddr use of direct I/O.
>>>
>>> udmabuf[2]
>>> ---
>>> Currently, udmabuf can use the memfd method to read files into
>>> dma-buf in direct I/O mode.
>>>
>>> However, if the size is large, the current udmabuf needs to adjust the
>>> corresponding size_limit(default 64MB).
>>> But using udmabuf for files at the 3GB level is not a very good
>>> approach.
>>> It needs to make some adjustments internally to handle this.[3] Or
>>> else,
>>> fail create.
>>>
>>> But, it is indeed a viable way to enable dma-buf to support direct I/O.
>>> However, it is necessary to initiate the file read after the memory
>>> allocation
>>> is completed, and handle race conditions carefully.
>>>
>>> sendfile/splice[4]
>>> ---
>>> Another way to enable dma-buf to support direct I/O is by implementing
>>> splice_write/write_iter in the dma-buf file operations (fops) to adapt
>>> to the sendfile method.
>>> However, the current sendfile/splice calls are based on pipe. When
>>> using
>>> direct I/O to read a file, the content needs to be copied to the buffer
>>> allocated by the pipe (default 64KB), and then the dma-buf fops'
>>> splice_write needs to be called to write the content into the dma-buf.
>>> This approach requires serially reading the content of file pipe size
>>> into the pipe buffer and then waiting for the dma-buf to be written
>>> before reading the next one.(The I/O performance is relatively weak
>>> under direct I/O.)
>>> Moreover, due to the existence of the pipe buffer, even when using
>>> direct I/O and not needing to generate additional page cache,
>>> there still needs to be a CPU copy.
>>>
>>> copy_file_range[5]
>>> ---
>>> Consider of copy_file_range, It only supports copying files within the
>>> same file system. Similarly, it is not very practical.
>>>
>>>
>>> So, currently, there is no particularly suitable solution on VFS to
>>> allow dma-buf to support direct I/O for large file reads.
>>>
>>> This patchset provides an idea to complete file reads when requesting a
>>> dma-buf.
>>>
>>> Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap flag
>>> ===
>>> This patch provides a method to immediately read the file content after
>>> the dma-buf is allocated, and only returns the dma-buf file descriptor
>>> after the file is fully read.
>>>
>>> Since the dma-buf file descriptor is not returned, no other thread can
>>> access it except for the current thread, so we don't need to worry
>>> about
>>> race conditions.
>>
>> That is a completely false assumption.
> Can you provide a detailed explanation as to why this assumption is
> incorrect? thanks.
File descriptors can be guessed and is available to userspace as soon as
dma_buf_fd() is called.
What could potentially work is to call system_heap_allocate() without
calling dma_buf_fd(), but I'm not sure if you can then make I/O to the
underlying pages.
>>
>>>
>>> Map the dma-buf to the vmalloc area and initiate file reads in kernel
>>> space, supporting both buffer I/O and direct I/O.
>>>
>>> This patch adds the DMA_HEAP_ALLOC_AND_READ heap_flag for user.
>>> When a user needs to allocate a dma-buf and read a file, they should
>>> pass this heap flag. As the size of the file being read is fixed,
>>> there is no
>>> need to pass the 'len' parameter. Instead, The file_fd needs to be
>>> passed to
>>> indicate to the kernel the file that needs to be read.
>>>
>>> The file open flag determines the mode of file reading.
>>> But, please note that if direct I/O(O_DIRECT) is needed to read the
>>> file,
>>> the file size must be page aligned. (with patch 2-5, no need)
>>>
>>> Therefore, for the user, len and file_fd are mutually exclusive,
>>> and they are combined using a union.
>>>
>>> Once the user obtains the dma-buf fd, the dma-buf directly contains the
>>> file content.
>>
>> And I'm repeating myself, but this is a complete NAK from my side to
>> this approach.
>>
>> We pointed out multiple ways of how to implement this cleanly and not
>> by hacking functionality into the kernel which absolutely doesn't
>> belong there.
> In this patchset, I have provided performance comparisons of each of
> these methods. Can you please provide more opinions?
Either drop the whole approach or change udmabuf to do what you want to do.
Apart from that I don't see a doable way which can be accepted into the
kernel.
Regards,
Christian.
>>
>> Regards,
>> Christian.
>>
>>>
>>> Patch 1 implement it.
>>>
>>> Patch 2-5 provides an approach for performance improvement.
>>>
>>> The DMA_HEAP_ALLOC_AND_READ_FILE heap flag patch enables us to
>>> synchronously read files using direct I/O.
>>>
>>> This approach helps to save CPU copying and avoid a certain degree of
>>> memory thrashing (page cache generation and reclamation)
>>>
>>> When dealing with large file sizes, the benefits of this approach
>>> become
>>> particularly significant.
>>>
>>> However, there are currently some methods that can improve performance,
>>> not just save system resources:
>>>
>>> Due to the large file size, for example, a AI 7B model of around
>>> 3.4GB, the
>>> time taken to allocate DMA-BUF memory will be relatively long. Waiting
>>> for the allocation to complete before reading the file will add to the
>>> overall time consumption. Therefore, the total time for DMA-BUF
>>> allocation and file read can be calculated using the formula
>>> T(total) = T(alloc) + T(I/O)
>>>
>>> However, if we change our approach, we don't necessarily need to wait
>>> for the DMA-BUF allocation to complete before initiating I/O. In fact,
>>> during the allocation process, we already hold a portion of the page,
>>> which means that waiting for subsequent page allocations to complete
>>> before carrying out file reads is actually unfair to the pages that
>>> have
>>> already been allocated.
>>>
>>> The allocation of pages is sequential, and the reading of the file is
>>> also sequential, with the content and size corresponding to the file.
>>> This means that the memory location for each page, which holds the
>>> content of a specific position in the file, can be determined at the
>>> time of allocation.
>>>
>>> However, to fully leverage I/O performance, it is best to wait and
>>> gather a certain number of pages before initiating batch processing.
>>>
>>> The default gather size is 128MB. So, ever gathered can see as a
>>> file read
>>> work, it maps the gather page to the vmalloc area to obtain a
>>> continuous
>>> virtual address, which is used as a buffer to store the contents of the
>>> corresponding file. So, if using direct I/O to read a file, the file
>>> content will be written directly to the corresponding dma-buf buffer
>>> memory
>>> without any additional copying.(compare to pipe buffer.)
>>>
>>> Consider other ways to read into dma-buf. If we assume reading after
>>> mmap
>>> dma-buf, we need to map the pages of the dma-buf to the user virtual
>>> address space. Also, udmabuf memfd need do this operations too.
>>> Even if we support sendfile, the file copy also need buffer, you must
>>> setup it.
>>> So, mapping pages to the vmalloc area does not incur any additional
>>> performance overhead compared to other methods.[6]
>>>
>>> Certainly, the administrator can also modify the gather size through
>>> patch5.
>>>
>>> The formula for the time taken for system_heap buffer allocation and
>>> file reading through async_read is as follows:
>>>
>>> T(total) = T(first gather page) + Max(T(remain alloc), T(I/O))
>>>
>>> Compared to the synchronous read:
>>> T(total) = T(alloc) + T(I/O)
>>>
>>> If the allocation time or I/O time is long, the time difference will be
>>> covered by the maximum value between the allocation and I/O. The other
>>> party will be concealed.
>>>
>>> Therefore, the larger the size of the file that needs to be read, the
>>> greater the corresponding benefits will be.
>>>
>>> How to use
>>> ===
>>> Consider the current pathway for loading model files into DMA-BUF:
>>> 1. open dma-heap, get heap fd
>>> 2. open file, get file_fd(can't use O_DIRECT)
>>> 3. use file len to allocate dma-buf, get dma-buf fd
>>> 4. mmap dma-buf fd, get vaddr
>>> 5. read(file_fd, vaddr, file_size) into dma-buf pages
>>> 6. share, attach, whatever you want
>>>
>>> Use DMA_HEAP_ALLOC_AND_READ_FILE JUST a little change:
>>> 1. open dma-heap, get heap fd
>>> 2. open file, get file_fd(buffer/direct)
>>> 3. allocate dma-buf with DMA_HEAP_ALLOC_AND_READ_FILE heap flag,
>>> set file_fd
>>> instead of len. get dma-buf fd(contains file content)
>>> 4. share, attach, whatever you want
>>>
>>> So, test it is easy.
>>>
>>> How to test
>>> ===
>>> The performance comparison will be conducted for the following
>>> scenarios:
>>> 1. normal
>>> 2. udmabuf with [3] patch
>>> 3. sendfile
>>> 4. only patch 1
>>> 5. patch1 - patch4.
>>>
>>> normal:
>>> 1. open dma-heap, get heap fd
>>> 2. open file, get file_fd(can't use O_DIRECT)
>>> 3. use file len to allocate dma-buf, get dma-buf fd
>>> 4. mmap dma-buf fd, get vaddr
>>> 5. read(file_fd, vaddr, file_size) into dma-buf pages
>>> 6. share, attach, whatever you want
>>>
>>> UDMA-BUF step:
>>> 1. memfd_create
>>> 2. open file(buffer/direct)
>>> 3. udmabuf create
>>> 4. mmap memfd
>>> 5. read file into memfd vaddr
>>>
>>> Sendfile step(need suit splice_write/write_iter, just use to compare):
>>> 1. open dma-heap, get heap fd
>>> 2. open file, get file_fd(buffer/direct)
>>> 3. use file len to allocate dma-buf, get dma-buf fd
>>> 4. sendfile file_fd to dma-buf fd
>>> 6. share, attach, whatever you want
>>>
>>> patch1/patch1-4:
>>> 1. open dma-heap, get heap fd
>>> 2. open file, get file_fd(buffer/direct)
>>> 3. allocate dma-buf with DMA_HEAP_ALLOC_AND_READ_FILE heap flag,
>>> set file_fd
>>> instead of len. get dma-buf fd(contains file content)
>>> 4. share, attach, whatever you want
>>>
>>> You can create a file to test it. Compare the performance gap
>>> between the two.
>>> It is best to compare the differences in file size from KB to MB to GB.
>>>
>>> The following test data will compare the performance differences
>>> between 512KB,
>>> 8MB, 1GB, and 3GB under various scenarios.
>>>
>>> Performance Test
>>> ===
>>> 12G RAM phone
>>> UFS4.0(the maximum speed is 4GB/s. ),
>>> f2fs
>>> kernel 6.1 with patch[7] (or else, can't support kvec direct I/O
>>> read.)
>>> no memory pressure.
>>> drop_cache is used for each test.
>>>
>>> The average of 5 test results:
>>> | scheme-size | 512KB(ns) | 8MB(ns) | 1GB(ns) |
>>> 3GB(ns) |
>>> | ------------------- | ---------- | ---------- | ------------- |
>>> ------------- |
>>> | normal | 2,790,861 | 14,535,784 | 1,520,790,492 |
>>> 3,332,438,754 |
>>> | udmabuf buffer I/O | 1,704,046 | 11,313,476 | 821,348,000 |
>>> 2,108,419,923 |
>>> | sendfile buffer I/O | 3,261,261 | 12,112,292 | 1,565,939,938 |
>>> 3,062,052,984 |
>>> | patch1-4 buffer I/O | 2,064,538 | 10,771,474 | 986,338,800 |
>>> 2,187,570,861 |
>>> | sendfile direct I/O | 12,844,231 | 37,883,938 | 5,110,299,184 |
>>> 9,777,661,077 |
>>> | patch1 direct I/O | 813,215 | 6,962,092 | 2,364,211,877 |
>>> 5,648,897,554 |
>>> | udmabuf direct I/O | 1,289,554 | 8,968,138 | 921,480,784 |
>>> 2,158,305,738 |
>>> | patch1-4 direct I/O | 1,957,661 | 6,581,999 | 520,003,538 |
>>> 1,400,006,107 |
>
> With this test, sendfile can't give a good help base on pipe buffer.
>
> udmabuf is good, but I think our oem driver can't suit it. (And, AOSP
> do not open this feature)
>
>
> Anyway, I am sending this patchset in the hope of further discussion.
>
> Thanks.
>
>>>
>>> So, based on the test results:
>>>
>>> When the file is large, the patchset has the highest performance.
>>> Compared to normal, patchset is a 50% improvement;
>>> Compared to normal, patch1 only showed a degradation of 41%.
>>> patch1 typical performance breakdown is as follows:
>>> 1. alloc cost 188,802,693 ns
>>> 2. vmap cost 42,491,385 ns
>>> 3. file read cost 4,180,876,702 ns
>>> Therefore, directly performing a single direct I/O read on a large file
>>> may not be the most optimal way for performance.
>>>
>>> The performance of direct I/O implemented by the sendfile method is
>>> the worst.
>>>
>>> When file size is small, The difference in performance is not
>>> significant. This is consistent with expectations.
>>>
>>>
>>>
>>> Suggested use cases
>>> ===
>>> 1. When there is a need to read large files and system resources
>>> are scarce,
>>> especially when the size of memory is limited.(GB level) In this
>>> scenario, using direct I/O for file reading can even bring
>>> performance
>>> improvements.(may need patch2-3)
>>> 2. For embedded devices with limited RAM, using direct I/O can
>>> save system
>>> resources and avoid unnecessary data copying. Therefore, even
>>> if the
>>> performance is lower when read small file, it can still be used
>>> effectively.
>>> 3. If there is sufficient memory, pinning the page cache of the
>>> model files
>>> in memory and placing file in the EROFS file system for
>>> read-only access
>>> maybe better.(EROFS do not support direct I/O)
>>>
>>>
>>> Changlog
>>> ===
>>> v1 [8]
>>> v1->v2:
>>> Uses the heap flag method for alloc and read instead of adding a
>>> new
>>> DMA-buf ioctl command. [9]
>>> Split the patchset to facilitate review and test.
>>> patch 1 implement alloc and read, offer heap flag into it.
>>> patch 2-4 offer async read
>>> patch 5 can change gather limit.
>>>
>>> Reference
>>> ===
>>> [1]
>>> https://lore.kernel.org/all/0393cf47-3fa2-4e32-8b3d-d5d5bdece298@amd.com/
>>> [2] https://lore.kernel.org/all/ZpTnzkdolpEwFbtu@phenom.ffwll.local/
>>> [3] https://lore.kernel.org/all/20240725021349.580574-1-link@vivo.com/
>>> [4] https://lore.kernel.org/all/Zpf5R7fRZZmEwVuR@infradead.org/
>>> [5] https://lore.kernel.org/all/ZpiHKY2pGiBuEq4z@infradead.org/
>>> [6]
>>> https://lore.kernel.org/all/9b70db2e-e562-4771-be6b-1fa8df19e356@amd.com/
>>> [7]
>>> https://patchew.org/linux/20230209102954.528942-1-dhowells@redhat.com/20230…
>>> [8] https://lore.kernel.org/all/20240711074221.459589-1-link@vivo.com/
>>> [9]
>>> https://lore.kernel.org/all/5ccbe705-883c-4651-9e66-6b452c414c74@amd.com/
>>>
>>> Huan Yang (5):
>>> dma-buf: heaps: Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap flag
>>> dma-buf: heaps: Introduce async alloc read ops
>>> dma-buf: heaps: support alloc async read file
>>> dma-buf: heaps: system_heap alloc support async read
>>> dma-buf: heaps: configurable async read gather limit
>>>
>>> drivers/dma-buf/dma-heap.c | 552
>>> +++++++++++++++++++++++++++-
>>> drivers/dma-buf/heaps/system_heap.c | 70 +++-
>>> include/linux/dma-heap.h | 53 ++-
>>> include/uapi/linux/dma-heap.h | 11 +-
>>> 4 files changed, 673 insertions(+), 13 deletions(-)
>>>
>>>
>>> base-commit: 931a3b3bccc96e7708c82b30b2b5fa82dfd04890
>>
Am 30.07.24 um 09:57 schrieb Huan Yang:
> Background
> ====
> Some user may need load file into dma-buf, current way is:
> 1. allocate a dma-buf, get dma-buf fd
> 2. mmap dma-buf fd into user vaddr
> 3. read(file_fd, vaddr, fsz)
> Due to dma-buf user map can't support direct I/O[1], the file read
> must be buffer I/O.
>
> This means that during the process of reading the file into dma-buf,
> page cache needs to be generated, and the corresponding content needs to
> be first copied to the page cache before being copied to the dma-buf.
>
> This way worked well when reading relatively small files before, as
> the page cache can cache the file content, thus improving performance.
>
> However, there are new challenges currently, especially as AI models are
> becoming larger and need to be shared between DMA devices and the CPU
> via dma-buf.
>
> For example, our 7B model file size is around 3.4GB. Using the
> previous would mean generating a total of 3.4GB of page cache
> (even if it will be reclaimed), and also requiring the copying of 3.4GB
> of content between page cache and dma-buf.
>
> Due to the limited resources of system memory, files in the gigabyte range
> cannot persist in memory indefinitely, so this portion of page cache may
> not provide much assistance for subsequent reads. Additionally, the
> existence of page cache will consume additional system resources due to
> the extra copying required by the CPU.
>
> Therefore, I think it is necessary for dma-buf to support direct I/O.
>
> However, direct I/O file reads cannot be performed using the buffer
> mmaped by the user space for the dma-buf.[1]
>
> Here are some discussions on implementing direct I/O using dma-buf:
>
> mmap[1]
> ---
> dma-buf never support user map vaddr use of direct I/O.
>
> udmabuf[2]
> ---
> Currently, udmabuf can use the memfd method to read files into
> dma-buf in direct I/O mode.
>
> However, if the size is large, the current udmabuf needs to adjust the
> corresponding size_limit(default 64MB).
> But using udmabuf for files at the 3GB level is not a very good approach.
> It needs to make some adjustments internally to handle this.[3] Or else,
> fail create.
>
> But, it is indeed a viable way to enable dma-buf to support direct I/O.
> However, it is necessary to initiate the file read after the memory allocation
> is completed, and handle race conditions carefully.
>
> sendfile/splice[4]
> ---
> Another way to enable dma-buf to support direct I/O is by implementing
> splice_write/write_iter in the dma-buf file operations (fops) to adapt
> to the sendfile method.
> However, the current sendfile/splice calls are based on pipe. When using
> direct I/O to read a file, the content needs to be copied to the buffer
> allocated by the pipe (default 64KB), and then the dma-buf fops'
> splice_write needs to be called to write the content into the dma-buf.
> This approach requires serially reading the content of file pipe size
> into the pipe buffer and then waiting for the dma-buf to be written
> before reading the next one.(The I/O performance is relatively weak
> under direct I/O.)
> Moreover, due to the existence of the pipe buffer, even when using
> direct I/O and not needing to generate additional page cache,
> there still needs to be a CPU copy.
>
> copy_file_range[5]
> ---
> Consider of copy_file_range, It only supports copying files within the
> same file system. Similarly, it is not very practical.
>
>
> So, currently, there is no particularly suitable solution on VFS to
> allow dma-buf to support direct I/O for large file reads.
>
> This patchset provides an idea to complete file reads when requesting a
> dma-buf.
>
> Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap flag
> ===
> This patch provides a method to immediately read the file content after
> the dma-buf is allocated, and only returns the dma-buf file descriptor
> after the file is fully read.
>
> Since the dma-buf file descriptor is not returned, no other thread can
> access it except for the current thread, so we don't need to worry about
> race conditions.
That is a completely false assumption.
>
> Map the dma-buf to the vmalloc area and initiate file reads in kernel
> space, supporting both buffer I/O and direct I/O.
>
> This patch adds the DMA_HEAP_ALLOC_AND_READ heap_flag for user.
> When a user needs to allocate a dma-buf and read a file, they should
> pass this heap flag. As the size of the file being read is fixed, there is no
> need to pass the 'len' parameter. Instead, The file_fd needs to be passed to
> indicate to the kernel the file that needs to be read.
>
> The file open flag determines the mode of file reading.
> But, please note that if direct I/O(O_DIRECT) is needed to read the file,
> the file size must be page aligned. (with patch 2-5, no need)
>
> Therefore, for the user, len and file_fd are mutually exclusive,
> and they are combined using a union.
>
> Once the user obtains the dma-buf fd, the dma-buf directly contains the
> file content.
And I'm repeating myself, but this is a complete NAK from my side to
this approach.
We pointed out multiple ways of how to implement this cleanly and not by
hacking functionality into the kernel which absolutely doesn't belong there.
Regards,
Christian.
>
> Patch 1 implement it.
>
> Patch 2-5 provides an approach for performance improvement.
>
> The DMA_HEAP_ALLOC_AND_READ_FILE heap flag patch enables us to
> synchronously read files using direct I/O.
>
> This approach helps to save CPU copying and avoid a certain degree of
> memory thrashing (page cache generation and reclamation)
>
> When dealing with large file sizes, the benefits of this approach become
> particularly significant.
>
> However, there are currently some methods that can improve performance,
> not just save system resources:
>
> Due to the large file size, for example, a AI 7B model of around 3.4GB, the
> time taken to allocate DMA-BUF memory will be relatively long. Waiting
> for the allocation to complete before reading the file will add to the
> overall time consumption. Therefore, the total time for DMA-BUF
> allocation and file read can be calculated using the formula
> T(total) = T(alloc) + T(I/O)
>
> However, if we change our approach, we don't necessarily need to wait
> for the DMA-BUF allocation to complete before initiating I/O. In fact,
> during the allocation process, we already hold a portion of the page,
> which means that waiting for subsequent page allocations to complete
> before carrying out file reads is actually unfair to the pages that have
> already been allocated.
>
> The allocation of pages is sequential, and the reading of the file is
> also sequential, with the content and size corresponding to the file.
> This means that the memory location for each page, which holds the
> content of a specific position in the file, can be determined at the
> time of allocation.
>
> However, to fully leverage I/O performance, it is best to wait and
> gather a certain number of pages before initiating batch processing.
>
> The default gather size is 128MB. So, ever gathered can see as a file read
> work, it maps the gather page to the vmalloc area to obtain a continuous
> virtual address, which is used as a buffer to store the contents of the
> corresponding file. So, if using direct I/O to read a file, the file
> content will be written directly to the corresponding dma-buf buffer memory
> without any additional copying.(compare to pipe buffer.)
>
> Consider other ways to read into dma-buf. If we assume reading after mmap
> dma-buf, we need to map the pages of the dma-buf to the user virtual
> address space. Also, udmabuf memfd need do this operations too.
> Even if we support sendfile, the file copy also need buffer, you must
> setup it.
> So, mapping pages to the vmalloc area does not incur any additional
> performance overhead compared to other methods.[6]
>
> Certainly, the administrator can also modify the gather size through patch5.
>
> The formula for the time taken for system_heap buffer allocation and
> file reading through async_read is as follows:
>
> T(total) = T(first gather page) + Max(T(remain alloc), T(I/O))
>
> Compared to the synchronous read:
> T(total) = T(alloc) + T(I/O)
>
> If the allocation time or I/O time is long, the time difference will be
> covered by the maximum value between the allocation and I/O. The other
> party will be concealed.
>
> Therefore, the larger the size of the file that needs to be read, the
> greater the corresponding benefits will be.
>
> How to use
> ===
> Consider the current pathway for loading model files into DMA-BUF:
> 1. open dma-heap, get heap fd
> 2. open file, get file_fd(can't use O_DIRECT)
> 3. use file len to allocate dma-buf, get dma-buf fd
> 4. mmap dma-buf fd, get vaddr
> 5. read(file_fd, vaddr, file_size) into dma-buf pages
> 6. share, attach, whatever you want
>
> Use DMA_HEAP_ALLOC_AND_READ_FILE JUST a little change:
> 1. open dma-heap, get heap fd
> 2. open file, get file_fd(buffer/direct)
> 3. allocate dma-buf with DMA_HEAP_ALLOC_AND_READ_FILE heap flag, set file_fd
> instead of len. get dma-buf fd(contains file content)
> 4. share, attach, whatever you want
>
> So, test it is easy.
>
> How to test
> ===
> The performance comparison will be conducted for the following scenarios:
> 1. normal
> 2. udmabuf with [3] patch
> 3. sendfile
> 4. only patch 1
> 5. patch1 - patch4.
>
> normal:
> 1. open dma-heap, get heap fd
> 2. open file, get file_fd(can't use O_DIRECT)
> 3. use file len to allocate dma-buf, get dma-buf fd
> 4. mmap dma-buf fd, get vaddr
> 5. read(file_fd, vaddr, file_size) into dma-buf pages
> 6. share, attach, whatever you want
>
> UDMA-BUF step:
> 1. memfd_create
> 2. open file(buffer/direct)
> 3. udmabuf create
> 4. mmap memfd
> 5. read file into memfd vaddr
>
> Sendfile step(need suit splice_write/write_iter, just use to compare):
> 1. open dma-heap, get heap fd
> 2. open file, get file_fd(buffer/direct)
> 3. use file len to allocate dma-buf, get dma-buf fd
> 4. sendfile file_fd to dma-buf fd
> 6. share, attach, whatever you want
>
> patch1/patch1-4:
> 1. open dma-heap, get heap fd
> 2. open file, get file_fd(buffer/direct)
> 3. allocate dma-buf with DMA_HEAP_ALLOC_AND_READ_FILE heap flag, set file_fd
> instead of len. get dma-buf fd(contains file content)
> 4. share, attach, whatever you want
>
> You can create a file to test it. Compare the performance gap between the two.
> It is best to compare the differences in file size from KB to MB to GB.
>
> The following test data will compare the performance differences between 512KB,
> 8MB, 1GB, and 3GB under various scenarios.
>
> Performance Test
> ===
> 12G RAM phone
> UFS4.0(the maximum speed is 4GB/s. ),
> f2fs
> kernel 6.1 with patch[7] (or else, can't support kvec direct I/O read.)
> no memory pressure.
> drop_cache is used for each test.
>
> The average of 5 test results:
> | scheme-size | 512KB(ns) | 8MB(ns) | 1GB(ns) | 3GB(ns) |
> | ------------------- | ---------- | ---------- | ------------- | ------------- |
> | normal | 2,790,861 | 14,535,784 | 1,520,790,492 | 3,332,438,754 |
> | udmabuf buffer I/O | 1,704,046 | 11,313,476 | 821,348,000 | 2,108,419,923 |
> | sendfile buffer I/O | 3,261,261 | 12,112,292 | 1,565,939,938 | 3,062,052,984 |
> | patch1-4 buffer I/O | 2,064,538 | 10,771,474 | 986,338,800 | 2,187,570,861 |
> | sendfile direct I/O | 12,844,231 | 37,883,938 | 5,110,299,184 | 9,777,661,077 |
> | patch1 direct I/O | 813,215 | 6,962,092 | 2,364,211,877 | 5,648,897,554 |
> | udmabuf direct I/O | 1,289,554 | 8,968,138 | 921,480,784 | 2,158,305,738 |
> | patch1-4 direct I/O | 1,957,661 | 6,581,999 | 520,003,538 | 1,400,006,107 |
>
> So, based on the test results:
>
> When the file is large, the patchset has the highest performance.
> Compared to normal, patchset is a 50% improvement;
> Compared to normal, patch1 only showed a degradation of 41%.
> patch1 typical performance breakdown is as follows:
> 1. alloc cost 188,802,693 ns
> 2. vmap cost 42,491,385 ns
> 3. file read cost 4,180,876,702 ns
> Therefore, directly performing a single direct I/O read on a large file
> may not be the most optimal way for performance.
>
> The performance of direct I/O implemented by the sendfile method is the worst.
>
> When file size is small, The difference in performance is not
> significant. This is consistent with expectations.
>
>
>
> Suggested use cases
> ===
> 1. When there is a need to read large files and system resources are scarce,
> especially when the size of memory is limited.(GB level) In this
> scenario, using direct I/O for file reading can even bring performance
> improvements.(may need patch2-3)
> 2. For embedded devices with limited RAM, using direct I/O can save system
> resources and avoid unnecessary data copying. Therefore, even if the
> performance is lower when read small file, it can still be used
> effectively.
> 3. If there is sufficient memory, pinning the page cache of the model files
> in memory and placing file in the EROFS file system for read-only access
> maybe better.(EROFS do not support direct I/O)
>
>
> Changlog
> ===
> v1 [8]
> v1->v2:
> Uses the heap flag method for alloc and read instead of adding a new
> DMA-buf ioctl command. [9]
> Split the patchset to facilitate review and test.
> patch 1 implement alloc and read, offer heap flag into it.
> patch 2-4 offer async read
> patch 5 can change gather limit.
>
> Reference
> ===
> [1] https://lore.kernel.org/all/0393cf47-3fa2-4e32-8b3d-d5d5bdece298@amd.com/
> [2] https://lore.kernel.org/all/ZpTnzkdolpEwFbtu@phenom.ffwll.local/
> [3] https://lore.kernel.org/all/20240725021349.580574-1-link@vivo.com/
> [4] https://lore.kernel.org/all/Zpf5R7fRZZmEwVuR@infradead.org/
> [5] https://lore.kernel.org/all/ZpiHKY2pGiBuEq4z@infradead.org/
> [6] https://lore.kernel.org/all/9b70db2e-e562-4771-be6b-1fa8df19e356@amd.com/
> [7] https://patchew.org/linux/20230209102954.528942-1-dhowells@redhat.com/20230…
> [8] https://lore.kernel.org/all/20240711074221.459589-1-link@vivo.com/
> [9] https://lore.kernel.org/all/5ccbe705-883c-4651-9e66-6b452c414c74@amd.com/
>
> Huan Yang (5):
> dma-buf: heaps: Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap flag
> dma-buf: heaps: Introduce async alloc read ops
> dma-buf: heaps: support alloc async read file
> dma-buf: heaps: system_heap alloc support async read
> dma-buf: heaps: configurable async read gather limit
>
> drivers/dma-buf/dma-heap.c | 552 +++++++++++++++++++++++++++-
> drivers/dma-buf/heaps/system_heap.c | 70 +++-
> include/linux/dma-heap.h | 53 ++-
> include/uapi/linux/dma-heap.h | 11 +-
> 4 files changed, 673 insertions(+), 13 deletions(-)
>
>
> base-commit: 931a3b3bccc96e7708c82b30b2b5fa82dfd04890
On Mon, Jul 29, 2024 at 10:46:04AM +0800, Zenghui Yu wrote:
> Even if a vgem device is configured in, we will skip the import_vgem_fd()
> test almost every time.
>
> TAP version 13
> 1..11
> # Testing heap: system
> # =======================================
> # Testing allocation and importing:
> ok 1 # SKIP Could not open vgem -1
>
> The problem is that we use the DRM_IOCTL_VERSION ioctl to query the driver
> version information but leave the name field a non-null-terminated string.
> Terminate it properly to actually test against the vgem device.
>
> While at it, let's check the length of the driver name is exactly 4 bytes
> and return early otherwise (in case there is a name like "vgemfoo" that
> gets converted to "vgem\0" unexpectedly).
>
> Signed-off-by: Zenghui Yu <yuzenghui(a)huawei.com>
> ---
> * From v1 [1]:
> - Check version.name_len is exactly 4 bytes and return early otherwise
>
> [1] https://lore.kernel.org/r/20240708134654.1725-1-yuzenghui@huawei.com
Thanks for your patch, I'll push it to drm-misc-next-fixes.
> P.S., Maybe worth including the kselftests file into "DMA-BUF HEAPS
> FRAMEWORK" MAINTAINERS entry?
Good idea, want to do the patch for that too?
Cheers, Sima
>
> tools/testing/selftests/dmabuf-heaps/dmabuf-heap.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/tools/testing/selftests/dmabuf-heaps/dmabuf-heap.c b/tools/testing/selftests/dmabuf-heaps/dmabuf-heap.c
> index 5f541522364f..5d0a809dc2df 100644
> --- a/tools/testing/selftests/dmabuf-heaps/dmabuf-heap.c
> +++ b/tools/testing/selftests/dmabuf-heaps/dmabuf-heap.c
> @@ -29,9 +29,11 @@ static int check_vgem(int fd)
> version.name = name;
>
> ret = ioctl(fd, DRM_IOCTL_VERSION, &version);
> - if (ret)
> + if (ret || version.name_len != 4)
> return 0;
>
> + name[4] = '\0';
> +
> return !strcmp(name, "vgem");
> }
>
> --
> 2.33.0
>
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
On Mon, 22 Jul 2024 08:53:32 +0200, Alexandre Mergnat wrote:
> Add the audio codec sub-device. This sub-device is used to set the
> optional voltage values according to the hardware.
> The properties are:
> - Setup of microphone bias voltage.
> - Setup of the speaker pin pull-down.
>
> Also, add the audio power supply property which is dedicated for
> the audio codec sub-device.
>
> [...]
Applied, thanks!
[03/16] dt-bindings: mfd: mediatek: Add codec property for MT6357 PMIC
commit: 3821149eb101fe2d45a4697659e60930828400d8
--
Lee Jones [李琼斯]
On Fri, 14 Jun 2024 09:27:46 +0200, Alexandre Mergnat wrote:
> Add the audio codec sub-device. This sub-device is used to set the
> optional voltage values according to the hardware.
> The properties are:
> - Setup of microphone bias voltage.
> - Setup of the speaker pin pull-down.
>
> Also, add the audio power supply property which is dedicated for
> the audio codec sub-device.
>
> [...]
Applied, thanks!
[03/16] dt-bindings: mfd: mediatek: Add codec property for MT6357 PMIC
commit: 3821149eb101fe2d45a4697659e60930828400d8
--
Lee Jones [李琼斯]
On Thu, 25 Jul 2024 at 07:15, Amirreza Zarrabi
<quic_azarrabi(a)quicinc.com> wrote:
>
>
>
> On 7/25/2024 2:09 PM, Dmitry Baryshkov wrote:
> > On Thu, Jul 25, 2024 at 01:19:07PM GMT, Amirreza Zarrabi wrote:
> >>
> >>
> >> On 7/4/2024 5:34 PM, Dmitry Baryshkov wrote:
> >>> On Thu, 4 Jul 2024 at 00:40, Amirreza Zarrabi <quic_azarrabi(a)quicinc.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 7/3/2024 10:13 PM, Dmitry Baryshkov wrote:
> >>>>> On Tue, Jul 02, 2024 at 10:57:36PM GMT, Amirreza Zarrabi wrote:
> >>>>>> Qualcomm TEE hosts Trusted Applications and Services that run in the
> >>>>>> secure world. Access to these resources is provided using object
> >>>>>> capabilities. A TEE client with access to the capability can invoke
> >>>>>> the object and request a service. Similarly, TEE can request a service
> >>>>>> from nonsecure world with object capabilities that are exported to secure
> >>>>>> world.
> >>>>>>
> >>>>>> We provide qcom_tee_object which represents an object in both secure
> >>>>>> and nonsecure world. TEE clients can invoke an instance of qcom_tee_object
> >>>>>> to access TEE. TEE can issue a callback request to nonsecure world
> >>>>>> by invoking an instance of qcom_tee_object in nonsecure world.
> >>>>>
> >>>>> Please see Documentation/process/submitting-patches.rst on how to write
> >>>>> commit messages.
> >>>>
> >>>> Ack.
> >>>>
> >>>>>
> >>>>>>
> >>>>>> Any driver in nonsecure world that is interested to export a struct (or a
> >>>>>> service object) to TEE, requires to embed an instance of qcom_tee_object in
> >>>>>> the relevant struct and implements the dispatcher function which is called
> >>>>>> when TEE invoked the service object.
> >>>>>>
> >>>>>> We also provids simplified API which implements the Qualcomm TEE transport
> >>>>>> protocol. The implementation is independent from any services that may
> >>>>>> reside in nonsecure world.
> >>>>>
> >>>>> "also" usually means that it should go to a separate commit.
> >>>>
> >>>> I will split this patch to multiple smaller ones.
> >>>>
> >>>
> >>> [...]
> >>>
> >>>>>
> >>>>>> + } in, out;
> >>>>>> +};
> >>>>>> +
> >>>>>> +int qcom_tee_object_do_invoke(struct qcom_tee_object_invoke_ctx *oic,
> >>>>>> + struct qcom_tee_object *object, unsigned long op, struct qcom_tee_arg u[], int *result);
> >>>>>
> >>>>> What's the difference between a result that gets returned by the
> >>>>> function and the result that gets retuned via the pointer?
> >>>>
> >>>> The function result, is local to kernel, for instance memory allocation failure,
> >>>> or failure to issue the smc call. The result in pointer, is the remote result,
> >>>> for instance return value from TA, or the TEE itself.
> >>>>
> >>>> I'll use better name, e.g. 'remote_result'?
> >>>
> >>> See how this is handled by other parties. For example, PSCI. If you
> >>> have a standard set of return codes, translate them to -ESOMETHING in
> >>> your framework and let everybody else see only the standard errors.
> >>>
> >>>
> >>
> >> I can not hide this return value, they are TA dependent. The client to a TA
> >> needs to see it, just knowing that something has failed is not enough in
> >> case they need to do something based on that. I can not even translate them
> >> as they are TA related so the range is unknown.
> >
> > I'd say it a sad design. At least error values should be standard.
> >
>
> Sure. But it is normal. If we finally move to TEE subsystem, this is the value that
> would be copied to struct tee_ioctl_invoke_arg.ret to pass to the caller of
> TEE_IOC_INVOKE.
Ack
--
With best wishes
Dmitry
On Thu, Jul 25, 2024 at 01:19:07PM GMT, Amirreza Zarrabi wrote:
>
>
> On 7/4/2024 5:34 PM, Dmitry Baryshkov wrote:
> > On Thu, 4 Jul 2024 at 00:40, Amirreza Zarrabi <quic_azarrabi(a)quicinc.com> wrote:
> >>
> >>
> >>
> >> On 7/3/2024 10:13 PM, Dmitry Baryshkov wrote:
> >>> On Tue, Jul 02, 2024 at 10:57:36PM GMT, Amirreza Zarrabi wrote:
> >>>> Qualcomm TEE hosts Trusted Applications and Services that run in the
> >>>> secure world. Access to these resources is provided using object
> >>>> capabilities. A TEE client with access to the capability can invoke
> >>>> the object and request a service. Similarly, TEE can request a service
> >>>> from nonsecure world with object capabilities that are exported to secure
> >>>> world.
> >>>>
> >>>> We provide qcom_tee_object which represents an object in both secure
> >>>> and nonsecure world. TEE clients can invoke an instance of qcom_tee_object
> >>>> to access TEE. TEE can issue a callback request to nonsecure world
> >>>> by invoking an instance of qcom_tee_object in nonsecure world.
> >>>
> >>> Please see Documentation/process/submitting-patches.rst on how to write
> >>> commit messages.
> >>
> >> Ack.
> >>
> >>>
> >>>>
> >>>> Any driver in nonsecure world that is interested to export a struct (or a
> >>>> service object) to TEE, requires to embed an instance of qcom_tee_object in
> >>>> the relevant struct and implements the dispatcher function which is called
> >>>> when TEE invoked the service object.
> >>>>
> >>>> We also provids simplified API which implements the Qualcomm TEE transport
> >>>> protocol. The implementation is independent from any services that may
> >>>> reside in nonsecure world.
> >>>
> >>> "also" usually means that it should go to a separate commit.
> >>
> >> I will split this patch to multiple smaller ones.
> >>
> >
> > [...]
> >
> >>>
> >>>> + } in, out;
> >>>> +};
> >>>> +
> >>>> +int qcom_tee_object_do_invoke(struct qcom_tee_object_invoke_ctx *oic,
> >>>> + struct qcom_tee_object *object, unsigned long op, struct qcom_tee_arg u[], int *result);
> >>>
> >>> What's the difference between a result that gets returned by the
> >>> function and the result that gets retuned via the pointer?
> >>
> >> The function result, is local to kernel, for instance memory allocation failure,
> >> or failure to issue the smc call. The result in pointer, is the remote result,
> >> for instance return value from TA, or the TEE itself.
> >>
> >> I'll use better name, e.g. 'remote_result'?
> >
> > See how this is handled by other parties. For example, PSCI. If you
> > have a standard set of return codes, translate them to -ESOMETHING in
> > your framework and let everybody else see only the standard errors.
> >
> >
>
> I can not hide this return value, they are TA dependent. The client to a TA
> needs to see it, just knowing that something has failed is not enough in
> case they need to do something based on that. I can not even translate them
> as they are TA related so the range is unknown.
I'd say it a sad design. At least error values should be standard.
--
With best wishes
Dmitry
On 22/07/2024 08:53, Alexandre Mergnat wrote:
> Add the audio codec sub-device. This sub-device is used to set the
> optional voltage values according to the hardware.
> The properties are:
> - Setup of microphone bias voltage.
> - Setup of the speaker pin pull-down.
>
> Also, add the audio power supply property which is dedicated for
> the audio codec sub-device.
>
> Signed-off-by: Alexandre Mergnat <amergnat(a)baylibre.com>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski(a)linaro.org>
Best regards,
Krzysztof
On 03/07/2024 07:57, Amirreza Zarrabi wrote:
> Qualcomm TEE hosts Trusted Applications and Services that run in the
> secure world. Access to these resources is provided using object
> capabilities. A TEE client with access to the capability can invoke
> the object and request a service. Similarly, TEE can request a service
> from nonsecure world with object capabilities that are exported to secure
> world.
>
> We provide qcom_tee_object which represents an object in both secure
> and nonsecure world. TEE clients can invoke an instance of qcom_tee_object
> to access TEE. TEE can issue a callback request to nonsecure world
> by invoking an instance of qcom_tee_object in nonsecure world.
>
> Any driver in nonsecure world that is interested to export a struct (or a
> service object) to TEE, requires to embed an instance of qcom_tee_object in
> the relevant struct and implements the dispatcher function which is called
> when TEE invoked the service object.
>
> We also provids simplified API which implements the Qualcomm TEE transport
> protocol. The implementation is independent from any services that may
> reside in nonsecure world.
>
> Signed-off-by: Amirreza Zarrabi <quic_azarrabi(a)quicinc.com>
> ---
> drivers/firmware/qcom/Kconfig | 14 +
> drivers/firmware/qcom/Makefile | 2 +
> drivers/firmware/qcom/qcom_object_invoke/Makefile | 4 +
> drivers/firmware/qcom/qcom_object_invoke/async.c | 142 +++
> drivers/firmware/qcom/qcom_object_invoke/core.c | 1139 ++++++++++++++++++++
> drivers/firmware/qcom/qcom_object_invoke/core.h | 186 ++++
> .../qcom/qcom_object_invoke/qcom_scm_invoke.c | 22 +
> .../firmware/qcom/qcom_object_invoke/release_wq.c | 90 ++
> include/linux/firmware/qcom/qcom_object_invoke.h | 233 ++++
> 9 files changed, 1832 insertions(+)
>
> diff --git a/drivers/firmware/qcom/Kconfig b/drivers/firmware/qcom/Kconfig
> index 7f6eb4174734..103ab82bae9f 100644
> --- a/drivers/firmware/qcom/Kconfig
> +++ b/drivers/firmware/qcom/Kconfig
> @@ -84,4 +84,18 @@ config QCOM_QSEECOM_UEFISECAPP
> Select Y here to provide access to EFI variables on the aforementioned
> platforms.
>
> +config QCOM_OBJECT_INVOKE_CORE
Let's avoid another rant from Linus and add here either proper defaults
or dependencies.
> + bool "Secure TEE Communication Support"
> + help
> + Various Qualcomm SoCs have a Trusted Execution Environment (TEE) running
> + in the Trust Zone. This module provides an interface to that via the
> + capability based object invocation, using SMC calls.
> +
> + OBJECT_INVOKE_CORE allows capability based secure communication between
> + TEE and VMs. Using OBJECT_INVOKE_CORE, kernel can issue calls to TEE or
> + TAs to request a service or exposes services to TEE and TAs. It implements
> + the necessary marshaling of messages with TEE.
> +
> + Select Y here to provide access to TEE.
> +
> endmenu
> diff --git a/drivers/firmware/qcom/Makefile b/drivers/firmware/qc
...
> + } else {
> + /* TEE obtained the ownership of QCOM_TEE_OBJECT_TYPE_CB_OBJECT
> + * input objects in 'u'. On further failure, TEE is responsible
> + * to release them.
> + */
> +
> + oic->flags |= OIC_FLAG_QCOM_TEE;
> + }
> +
> + /* Is it a callback request?! */
> + if (response_type != QCOM_TEE_RESULT_INBOUND_REQ_NEEDED) {
> + if (!*result) {
> + ret = update_args(u, oic);
> + if (ret) {
> + arg_for_each_output_object(i, u)
> + put_qcom_tee_object(u[i].o);
> + }
> + }
> +
> + break;
> +
> + } else {
> + oic->flags |= OIC_FLAG_BUSY;
> +
> + /* Before dispatching the request, handle any pending async requests. */
> + __fetch__async_reqs(oic);
> +
> + qcom_tee_object_invoke(oic, cb_msg);
> + }
> + }
> +
> + __fetch__async_reqs(oic);
> +
> +out:
> + qcom_tee_object_invoke_ctx_uninit(oic);
> +
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(qcom_tee_object_do_invoke);
> +
> +/* Primordial Object. */
> +/* It is invoked by TEE for kernel services. */
> +
> +static struct qcom_tee_object *primordial_object = NULL_QCOM_TEE_OBJECT;
> +static DEFINE_MUTEX(primordial_object_lock);
Oh my... except that it looks like undocumented ABI, please avoid
file-scope variables.
Best regards,
Krzysztof
Adding TEE mailing list and maintainers to the CC list.
Amirreza, please include them in future even if you are not going to use
the framework.
On Wed, Jul 10, 2024 at 09:16:48AM GMT, Amirreza Zarrabi wrote:
>
>
> On 7/3/2024 9:36 PM, Dmitry Baryshkov wrote:
> > On Tue, Jul 02, 2024 at 10:57:35PM GMT, Amirreza Zarrabi wrote:
> >> Qualcomm TEE hosts Trusted Applications (TAs) and services that run in
> >> the secure world. Access to these resources is provided using MinkIPC.
> >> MinkIPC is a capability-based synchronous message passing facility. It
> >> allows code executing in one domain to invoke objects running in other
> >> domains. When a process holds a reference to an object that lives in
> >> another domain, that object reference is a capability. Capabilities
> >> allow us to separate implementation of policies from implementation of
> >> the transport.
> >>
> >> As part of the upstreaming of the object invoke driver (called SMC-Invoke
> >> driver), we need to provide a reasonable kernel API and UAPI. The clear
> >> option is to use TEE subsystem and write a back-end driver, however the
> >> TEE subsystem doesn't fit with the design of Qualcomm TEE.
> >>
>
> To answer your "general comment", maybe a bit of background :).
>
> Traditionally, policy enforcement is based on access-control models,
> either (1) access-control list or (2) capability [0]. A capability is an
> opaque ("non-forge-able") object reference that grants the holder the
> right to perform certain operations on the object (e.g. Read, Write,
> Execute, or Grant). Capabilities are preferred mechanism for representing
> a policy, due to their fine-grained representation of access right, inline
> with
> (P1) the principle of least privilege [1], and
> (P2) the ability to avoid the confused deputy problem [2].
>
> [0] Jack B. Dennis and Earl C. Van Horn. 1966. Programming Semantics for
> Multiprogrammed Computations. Commun. ACM 9 (1966), 143–155.
>
> [1] Jerome H. Saltzer and Michael D. Schroeder. 1975. The Protection of
> Information in Computer Systems. Proc. IEEE 63 (1975), 1278–1308.
>
> [2] Norm Hardy. 1988. The Confused Deputy (or Why Capabilities Might Have
> Been Invented). ACM Operating Systems Review 22, 4 (1988), 36–38.
>
> For MinkIPC, an object represents a TEE or TA service. The reference to
> the object is the "handle" that is returned from TEE (let's call it
> TEE-Handle). The supported operations are "service invocation" (similar
> to Execute), and "sharing access to a service" (similar to Grant).
> Anyone with access to the TEE-Handle can invoke the service or pass the
> TEE-Handle to someone else to access the same service.
>
> The responsibility of the MinkIPC framework is to hide the TEE-Handle,
> so that the client can not forge it, and allow the owner of the handle
> to transfer it to other clients as it wishes. Using a file descriptor
> table we can achieve that. We wrap the TEE-Handle as a FD and let the
> client invoke FD (e.g. using IOCTL), or transfer the FD (e.g. using
> UNIX socket).
>
> As a side note, for the sake of completeness, capabilities are fundamentally
> a "discretionary mechanism", as the holder of the object reference has the
> ability to share it with others. A secure system requires "mandatory
> enforcement" (i.e. ability to revoke authority and ability to control
> the authority propagation). This is out of scope for the MinkIPC.
> MinkIPC is only interested in P1 and P2 (mention above).
>
>
> >> Does TEE subsystem fit requirements of a capability based system?
> >> -----------------------------------------------------------------
> >> In TEE subsystem, to invoke a function:
> >> - client should open a device file "/dev/teeX",
> >> - create a session with a TA, and
> >> - invoke the functions in that session.
> >>
> >> 1. The privilege to invoke a function is determined by a session. If a
> >> client has a session, it cannot share it with other clients. Even if
> >> it does, it is not fine-grained enough, i.e. either all accessible
> >> functions/resources in a session or none. Assume a scenario when a client
> >> wants to grant a permission to invoke just a function that it has the rights,
> >> to another client.
> >>
> >> The "all or nothing" for sharing sessions is not in line with our
> >> capability system: "if you own a capability, you should be able to grant
> >> or share it".
> >
> > Can you please be more specific here? What kind of sharing is expected
> > on the user side of it?
>
> In MinkIPC, after authenticating a client credential, a TA (or TEE) may
> return multiple TEE-Handles, each representing a service that the client
> has privilege to access. The client should be able to "individually"
> reference each TEE-Handle, e.g. to invoke and share it (as per capability-
> based system requirements).
>
> If we use TEE subsystem, which has a session based design, all TEE-Handles
> are meaningful with respect to the session in which they are allocated,
> hence the use of "__u32 session" in "struct tee_ioctl_invoke_arg".
>
> Here, we have a contradiction with MinkIPC. We may ignore the session
> and say "even though a TEE-Handle is allocated in a session but it is also
> valid outside a session", i.e. the session-id in TEE uapi becomes redundant
> (a case of divergence from definition).
>
> >
> >> 2. In TEE subsystem, resources are managed in a context. Every time a
> >> client opens "/dev/teeX", a new context is created to keep track of
> >> the allocated resources, including opened sessions and remote objects. Any
> >> effort for sharing resources between two independent clients requires
> >> involvement of context manager, i.e. the back-end driver. This requires
> >> implementing some form of policy in the back-end driver.
> >
> > What kind of resource sharing?
>
> TEE subsystem "rightfully" allocates a context each time a client opens
> a device file. This context pass around to the backend driver to identify
> independent clients that opened the device file.
>
> The context is used by backend driver to keep track of the resources. Type
> of resources are TEE driver dependent. As an example of resource in TEE
> subsystem, you can look into 'shm' register and unregister (specially,
> see comment in function 'shm_alloc_helper').
>
> For MinkIPC, all clients are treated the same and the TEE-Handles are
> representative of the resources, accessible "globally" if a client has the
> capability for them. In kernel, clients access an object if they have
> access to "qcom_tee_object", in userspace, clients access an object if
> they have the FD wrapper for the TEE-Handle.
>
> If we use context, instead of the file descriptor table, any form of object
> transfer requires involvement of the backend driver. If we use the file
> descriptor table, contexts are becoming useless for MinkIPC (i.e.
> 'ctx->data' will "always" be null).
>
> >
> >> 3. The TEE subsystem supports two type of memory sharing:
> >> - per-device memory pools, and
> >> - user defined memory references.
> >> User defined memory references are private to the application and cannot
> >> be shared. Memory allocated from per-device "shared" pools are accessible
> >> using a file descriptor. It can be mapped by any process if it has
> >> access to it. This means, we cannot provide the resource isolation
> >> between two clients. Assume a scenario when a client wants to allocate a
> >> memory (which is shared with TEE) from an "isolated" pool and share it
> >> with another client, without the right to access the contents of memory.
> >
> > This doesn't explain, why would it want to share such memory with
> > another client.
>
> Ok, I believe there is a misunderstanding here. I did not try to justify
> specific usecase. We want to separate the memory allocation from the
> framework. This way, how the memory is obtained, e.g. it is allocated
> (1) from an isolated pool, (2) a shared pool, (3) a secure heap,
> (4) a system dma-heap, (5) process address space, or (6) other memory
> with "different constraints", becomes independent.
>
> We introduced "memory object" type. User implements a kernel service
> using "qcom_tee_object" to represent the memory object. We have an
> implementation of memory objects based on dma-buf.
>
> >
> >> 4. The kernel API provided by TEE subsystem does not support a kernel
> >> supplicant. Adding support requires an execution context (e.g. a
> >> kernel thread) due to the TEE subsystem design. tee_driver_ops supports
> >> only "send" and "receive" callbacks and to deliver a request, someone
> >> should wait on "receive".
> >
> > There is nothing wrong here, but maybe I'm misunderstanding something.
>
> I agree. But, I am trying to re-emphasize how useful TEE subsystem is
> for MinkIPC. For kernel services, we solely rely on the backend driver.
> For instance, to expose RPMB service we will use "qcom_tee_object".
> So there is nothing provided by the framework to simplify the service
> development.
>
> >
> >> We need a callback to "dispatch" or "handle" a request in the context of
> >> the client thread. It should redirect a request to a kernel service or
> >> a user supplicant. In TEE subsystem such requirement should be implemented
> >> in TEE back-end driver, independent from the TEE subsystem.
> >>
> >> 5. The UAPI provided by TEE subsystem is similar to the GPTEE Client
> >> interface. This interface is not suitable for a capability system.
> >> For instance, there is no session in a capability system which means
> >> either its should not be used, or we should overload its definition.
> >
> > General comment: maybe adding more detailed explanation of how the
> > capabilities are aquired and how they can be used might make sense.
> >
> > BTW. It might be my imperfect English, but each time I see the word
> > 'capability' I'm thinking that some is capable of doing something. I
> > find it hard to use 'capability' for the reference to another object.
> >
>
> Explained at the top :).
>
> >>
> >> Can we use TEE subsystem?
> >> -------------------------
> >> There are workarounds for some of the issues above. The question is if we
> >> should define our own UAPI or try to use a hack-y way of fitting into
> >> the TEE subsystem. I am using word hack-y, as most of the workaround
> >> involves:
> >>
> >> - "diverging from the definition". For instance, ignoring the session
> >> open and close ioctl calls or use file descriptors for all remote
> >> resources (as, fd is the closet to capability) which undermines the
> >> isolation provided by the contexts,
> >>
> >> - "overloading the variables". For instance, passing object ID as file
> >> descriptors in a place of session ID, or
> >>
> >> - "bypass TEE subsystem". For instance, extensively rely on meta
> >> parameters or push everything (e.g. kernel services) to the back-end
> >> driver, which means leaving almost all TEE subsystem unused.
> >>
> >> We cannot take the full benefits of TEE subsystem and may need to
> >> implement most of the requirements in the back-end driver. Also, as
> >> discussed above, the UAPI is not suitable for capability-based use cases.
> >> We proposed a new set of ioctl calls for SMC-Invoke driver.
> >>
> >> In this series we posted three patches. We implemented a transport
> >> driver that provides qcom_tee_object. Any object on secure side is
> >> represented with an instance of qcom_tee_object and any struct exposed
> >> to TEE should embed an instance of qcom_tee_object. Any, support for new
> >> services, e.g. memory object, RPMB, userspace clients or supplicants are
> >> implemented independently from the driver.
> >>
> >> We have a simple memory object and a user driver that uses
> >> qcom_tee_object.
> >
> > Could you please point out any user for the uAPI? I'd like to understand
> > how does it from from the userspace point of view.
>
> Sure :), I'll write up a test patch and send it in next series.
>
> Summary.
>
> TEE framework provides some nice facilities, including:
> - uapi and ioctl interface,
> - marshaling parameters and context management,
> - memory mapping and sharing, and
> - TEE bus and TA drivers.
>
> For, MinkIPC, we will not use any of them. The only usable piece, is uapi
> interface which is not suitable for MinkIPC, as discussed above.
>
> >
> >>
> >> Signed-off-by: Amirreza Zarrabi <quic_azarrabi(a)quicinc.com>
> >> ---
> >> Amirreza Zarrabi (3):
> >> firmware: qcom: implement object invoke support
> >> firmware: qcom: implement memory object support for TEE
> >> firmware: qcom: implement ioctl for TEE object invocation
> >>
> >> drivers/firmware/qcom/Kconfig | 36 +
> >> drivers/firmware/qcom/Makefile | 2 +
> >> drivers/firmware/qcom/qcom_object_invoke/Makefile | 12 +
> >> drivers/firmware/qcom/qcom_object_invoke/async.c | 142 +++
> >> drivers/firmware/qcom/qcom_object_invoke/core.c | 1139 ++++++++++++++++++
> >> drivers/firmware/qcom/qcom_object_invoke/core.h | 186 +++
> >> .../qcom/qcom_object_invoke/qcom_scm_invoke.c | 22 +
> >> .../firmware/qcom/qcom_object_invoke/release_wq.c | 90 ++
> >> .../qcom/qcom_object_invoke/xts/mem_object.c | 406 +++++++
> >> .../qcom_object_invoke/xts/object_invoke_uapi.c | 1231 ++++++++++++++++++++
> >> include/linux/firmware/qcom/qcom_object_invoke.h | 233 ++++
> >> include/uapi/misc/qcom_tee.h | 117 ++
> >> 12 files changed, 3616 insertions(+)
> >> ---
> >> base-commit: 74564adfd3521d9e322cfc345fdc132df80f3c79
> >> change-id: 20240702-qcom-tee-object-and-ioctls-6f52fde03485
> >>
> >> Best regards,
> >> --
> >> Amirreza Zarrabi <quic_azarrabi(a)quicinc.com>
> >>
> >
--
With best wishes
Dmitry