commit 04c7adb5871a ("dma-buf: system_heap: use larger contiguous mappings instead of per-page mmap") facilitates the use of PTE_CONT. The system_heap allocates pages of order 4 and 8 that meet the alignment requirements for PTE_CONT. enabling PTE_CONT for larger contiguous mappings.
After applying this patch, TLB misses are reduced by approximately 5% when opening the camera on Android systems.
Signed-off-by: gao xu gaoxu2@honor.com --- drivers/dma-buf/heaps/system_heap.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/drivers/dma-buf/heaps/system_heap.c b/drivers/dma-buf/heaps/system_heap.c index 4c782fe33..103b06f89 100644 --- a/drivers/dma-buf/heaps/system_heap.c +++ b/drivers/dma-buf/heaps/system_heap.c @@ -202,12 +202,16 @@ static int system_heap_mmap(struct dma_buf *dmabuf, struct vm_area_struct *vma) unsigned long n = (sg->length >> PAGE_SHIFT) - pgoff; struct page *page = sg_page(sg) + pgoff; unsigned long size = n << PAGE_SHIFT; + pgprot_t prot = vma->vm_page_prot;
if (addr + size > vma->vm_end) size = vma->vm_end - addr;
+ if (((addr | size) & ~CONT_PTE_MASK) == 0) + prot = __pgprot(pgprot_val(prot) | PTE_CONT); + ret = remap_pfn_range(vma, addr, page_to_pfn(page), - size, vma->vm_page_prot); + size, prot); if (ret) return ret;
On Mon, Dec 8, 2025 at 5:41 PM gao xu gaoxu2@honor.com wrote:
commit 04c7adb5871a ("dma-buf: system_heap: use larger contiguous mappings instead of per-page mmap") facilitates the use of PTE_CONT. The system_heap allocates pages of order 4 and 8 that meet the alignment requirements for PTE_CONT. enabling PTE_CONT for larger contiguous mappings.
Unfortunately, we don't have pte_cont for architectures other than AArch64. On the other hand, AArch64 isn't automatically mapping cont_pte for mmap. It might be better if this were done automatically by the ARM code.
Ryan(Cced) is the expert on automatically setting cont_pte for contiguous mapping, so let's ask for some advice from Ryan.
After applying this patch, TLB misses are reduced by approximately 5% when opening the camera on Android systems.
Signed-off-by: gao xu gaoxu2@honor.com
Thanks Barry
On 08/12/2025 09:52, Barry Song wrote:
On Mon, Dec 8, 2025 at 5:41 PM gao xu gaoxu2@honor.com wrote:
commit 04c7adb5871a ("dma-buf: system_heap: use larger contiguous mappings instead of per-page mmap") facilitates the use of PTE_CONT. The system_heap allocates pages of order 4 and 8 that meet the alignment requirements for PTE_CONT. enabling PTE_CONT for larger contiguous mappings.
Unfortunately, we don't have pte_cont for architectures other than AArch64. On the other hand, AArch64 isn't automatically mapping cont_pte for mmap. It might be better if this were done automatically by the ARM code.
Yes indeed; CONT_PTE_MASK and PTE_CONT are arm64-specific macros that cannot be used outside of the arm64 arch code.
Ryan(Cced) is the expert on automatically setting cont_pte for contiguous mapping, so let's ask for some advice from Ryan.
arm64 arch code will automatically and transparently apply PTE_CONT whenever it detects suitable conditions. Those suitable conditions include:
- physically contiguous block of 64K, aligned to 64K - virtually contiguous block of 64K, aligned to 64K - 64K block has the same access permissions - 64K block all belongs to the same folio - not a special mapping
The last 2 requirements are the tricky ones here: We require that every page in the block belongs to the same folio because a contigous mapping only maintains a single access and dirty bit for the whole 64K block, so we are losing fidelity vs per-page mappings. But the kernel tracks access/dirty per folio, so the extra fidelity we get for per-page mappings is ingored by the kernel anyway if the contiguous mapping only maps pages from a single folio. We reject special mappings because they are not backed by a folio at all.
For your case, remap_pfn_range() will create special mappings so we will never set the PTE_CONT bit.
Likely we are being a bit too conservative here and we may be able to relax this requirement if we know that nothing will ever consume the access/dirty information for special mappings? I'm not if that is the case in general though - it would need some investigation.
With that issue resolved, there is still a second issue; there are 2 ways the arm64 arch code detects suitable contiguous mappings. The primary way is via a call to set_ptes(). This part of the "PTE batching" API and explicitly tells the implementaiton that all the conditions are met (including the memory being backed by a folio). This is the most efficient approach. See contpte_set_ptes().
There is a second (hacky) approach which attempts to recognise when the last PTE of a contiguous block is set and automatically "fold" the mapping. See contpte_try_fold(). This approach has a cost because (for systems without BBML2_NOABORT) we have to issue a TLBI when we fold the range.
For remap_pfn_range(), we would be relying on the second approach since it is not currently batched (and could not use set_ptes() as currently spec'ed due to there being no folio). If we are going to add support for contiguous pfn-mapped PTEs, it would be preferable to add equivalent batching APIs (or relax set_ptes()).
I think this would be a useful improvement, but it's not as straightforward as adding PTE_CONT in system_heap_mmap().
Thanks, Ryan
After applying this patch, TLB misses are reduced by approximately 5% when opening the camera on Android systems.
Signed-off-by: gao xu gaoxu2@honor.com
Thanks Barry
On Mon, Dec 8, 2025 at 6:38 PM Ryan Roberts ryan.roberts@arm.com wrote:
On 08/12/2025 09:52, Barry Song wrote:
On Mon, Dec 8, 2025 at 5:41 PM gao xu gaoxu2@honor.com wrote:
commit 04c7adb5871a ("dma-buf: system_heap: use larger contiguous mappings instead of per-page mmap") facilitates the use of PTE_CONT. The system_heap allocates pages of order 4 and 8 that meet the alignment requirements for PTE_CONT. enabling PTE_CONT for larger contiguous mappings.
Unfortunately, we don't have pte_cont for architectures other than AArch64. On the other hand, AArch64 isn't automatically mapping cont_pte for mmap. It might be better if this were done automatically by the ARM code.
Yes indeed; CONT_PTE_MASK and PTE_CONT are arm64-specific macros that cannot be used outside of the arm64 arch code.
Ryan(Cced) is the expert on automatically setting cont_pte for contiguous mapping, so let's ask for some advice from Ryan.
arm64 arch code will automatically and transparently apply PTE_CONT whenever it detects suitable conditions. Those suitable conditions include:
- physically contiguous block of 64K, aligned to 64K
- virtually contiguous block of 64K, aligned to 64K
- 64K block has the same access permissions
- 64K block all belongs to the same folio
- not a special mapping
The last 2 requirements are the tricky ones here: We require that every page in the block belongs to the same folio because a contigous mapping only maintains a single access and dirty bit for the whole 64K block, so we are losing fidelity vs per-page mappings. But the kernel tracks access/dirty per folio, so the extra fidelity we get for per-page mappings is ingored by the kernel anyway if the contiguous mapping only maps pages from a single folio. We reject special mappings because they are not backed by a folio at all.
For your case, remap_pfn_range() will create special mappings so we will never set the PTE_CONT bit.
Likely we are being a bit too conservative here and we may be able to relax this requirement if we know that nothing will ever consume the access/dirty information for special mappings? I'm not if that is the case in general though
- it would need some investigation.
With that issue resolved, there is still a second issue; there are 2 ways the arm64 arch code detects suitable contiguous mappings. The primary way is via a call to set_ptes(). This part of the "PTE batching" API and explicitly tells the implementaiton that all the conditions are met (including the memory being backed by a folio). This is the most efficient approach. See contpte_set_ptes().
There is a second (hacky) approach which attempts to recognise when the last PTE of a contiguous block is set and automatically "fold" the mapping. See contpte_try_fold(). This approach has a cost because (for systems without BBML2_NOABORT) we have to issue a TLBI when we fold the range.
For remap_pfn_range(), we would be relying on the second approach since it is not currently batched (and could not use set_ptes() as currently spec'ed due to there being no folio). If we are going to add support for contiguous pfn-mapped PTEs, it would be preferable to add equivalent batching APIs (or relax set_ptes()).
Thanks a lot, Ryan. It seems quite tricky to support automatic cont_pte.
I think this would be a useful improvement, but it's not as straightforward as adding PTE_CONT in system_heap_mmap().
Since it's just a driver, I'm not sure if it's acceptable to use CONFIG_ARM64. However, I can find many instances of it in drivers. drivers % git grep CONFIG_ARM64 | wc -l 127
On the other hand, a corner case is when the dma-buf is partially unmapped. I assume cont_pte can still be automatically unfolded, even for special mappings?
Thanks Barry
On 09/12/2025 11:37, Barry Song wrote:
On Mon, Dec 8, 2025 at 6:38 PM Ryan Roberts ryan.roberts@arm.com wrote:
On 08/12/2025 09:52, Barry Song wrote:
On Mon, Dec 8, 2025 at 5:41 PM gao xu gaoxu2@honor.com wrote:
commit 04c7adb5871a ("dma-buf: system_heap: use larger contiguous mappings instead of per-page mmap") facilitates the use of PTE_CONT. The system_heap allocates pages of order 4 and 8 that meet the alignment requirements for PTE_CONT. enabling PTE_CONT for larger contiguous mappings.
Unfortunately, we don't have pte_cont for architectures other than AArch64. On the other hand, AArch64 isn't automatically mapping cont_pte for mmap. It might be better if this were done automatically by the ARM code.
Yes indeed; CONT_PTE_MASK and PTE_CONT are arm64-specific macros that cannot be used outside of the arm64 arch code.
Ryan(Cced) is the expert on automatically setting cont_pte for contiguous mapping, so let's ask for some advice from Ryan.
arm64 arch code will automatically and transparently apply PTE_CONT whenever it detects suitable conditions. Those suitable conditions include:
- physically contiguous block of 64K, aligned to 64K
- virtually contiguous block of 64K, aligned to 64K
- 64K block has the same access permissions
- 64K block all belongs to the same folio
- not a special mapping
The last 2 requirements are the tricky ones here: We require that every page in the block belongs to the same folio because a contigous mapping only maintains a single access and dirty bit for the whole 64K block, so we are losing fidelity vs per-page mappings. But the kernel tracks access/dirty per folio, so the extra fidelity we get for per-page mappings is ingored by the kernel anyway if the contiguous mapping only maps pages from a single folio. We reject special mappings because they are not backed by a folio at all.
For your case, remap_pfn_range() will create special mappings so we will never set the PTE_CONT bit.
Likely we are being a bit too conservative here and we may be able to relax this requirement if we know that nothing will ever consume the access/dirty information for special mappings? I'm not if that is the case in general though
- it would need some investigation.
With that issue resolved, there is still a second issue; there are 2 ways the arm64 arch code detects suitable contiguous mappings. The primary way is via a call to set_ptes(). This part of the "PTE batching" API and explicitly tells the implementaiton that all the conditions are met (including the memory being backed by a folio). This is the most efficient approach. See contpte_set_ptes().
There is a second (hacky) approach which attempts to recognise when the last PTE of a contiguous block is set and automatically "fold" the mapping. See contpte_try_fold(). This approach has a cost because (for systems without BBML2_NOABORT) we have to issue a TLBI when we fold the range.
For remap_pfn_range(), we would be relying on the second approach since it is not currently batched (and could not use set_ptes() as currently spec'ed due to there being no folio). If we are going to add support for contiguous pfn-mapped PTEs, it would be preferable to add equivalent batching APIs (or relax set_ptes()).
Thanks a lot, Ryan. It seems quite tricky to support automatic cont_pte.
I think this would be a useful improvement, but it's not as straightforward as adding PTE_CONT in system_heap_mmap().
Since it's just a driver, I'm not sure if it's acceptable to use CONFIG_ARM64. However, I can find many instances of it in drivers. drivers % git grep CONFIG_ARM64 | wc -l 127
On the other hand, a corner case is when the dma-buf is partially unmapped. I assume cont_pte can still be automatically unfolded, even for special mappings?
I think unfolding will probably happen to work, but you're definitely in the neighbourhood of "horrible hack that may not work as intended in some corner cases".
I think it would be much better to support batching for pfn-mapped ptes. That would generalize to many more users. (and I might be interested in taking a look at some point next year if nobody else gets to it).
We deliberately didn't want to expose the idea of a single, specific contiguous size to the generic code so that the arch could make more fine-grained decisions. :)
Thanks, Ryan
Thanks Barry
linaro-mm-sig@lists.linaro.org