[Linaro-mm-sig] Re: [PATCH] mm/vmalloc: map contiguous pages in batches for vmap() whenever possible

22 Dec 2025

On Fri, Dec 19, 2025 at 05:24:36AM +0800, Barry Song wrote:
...
On Thu, Dec 18, 2025 at 9:55 PM Uladzislau Rezki urezki@gmail.com wrote:
...
On Thu, Dec 18, 2025 at 02:01:56PM +0100, David Hildenbrand (Red Hat) wrote:
...
On 12/15/25 06:30, Barry Song wrote:
...
From: Barry Song v-songbaohua@oppo.com
In many cases, the pages passed to vmap() may include high-order
pages allocated with __GFP_COMP flags. For example, the systemheap
often allocates pages in descending order: order 8, then 4, then 0.
Currently, vmap() iterates over every page individually—even pages
inside a high-order block are handled one by one.
This patch detects high-order pages and maps them as a single
contiguous block whenever possible.
An alternative would be to implement a new API, vmap_sg(), but that
change seems to be large in scope.
When vmapping a 128MB dma-buf using the systemheap, this patch
makes system_heap_do_vmap() roughly 17× faster.
W/ patch:
[   10.404769] system_heap_do_vmap took 2494000 ns
[   12.525921] system_heap_do_vmap took 2467008 ns
[   14.517348] system_heap_do_vmap took 2471008 ns
[   16.593406] system_heap_do_vmap took 2444000 ns
[   19.501341] system_heap_do_vmap took 2489008 ns
W/o patch:
[    7.413756] system_heap_do_vmap took 42626000 ns
[    9.425610] system_heap_do_vmap took 42500992 ns
[   11.810898] system_heap_do_vmap took 42215008 ns
[   14.336790] system_heap_do_vmap took 42134992 ns
[   16.373890] system_heap_do_vmap took 42750000 ns
That's quite a speedup.
...
Cc: David Hildenbrand david@kernel.org
Cc: Uladzislau Rezki urezki@gmail.com
Cc: Sumit Semwal sumit.semwal@linaro.org
Cc: John Stultz jstultz@google.com
Cc: Maxime Ripard mripard@kernel.org
Tested-by: Tangquan Zheng zhengtangquan@oppo.com
Signed-off-by: Barry Song v-songbaohua@oppo.com

* diff with rfc:
  Many code refinements based on David's suggestions, thanks!
  Refine comment and changelog according to Uladzislau, thanks!
  rfc link:
  https://lore.kernel.org/linux-mm/20251122090343.81243-1-21cnbao@gmail.com/
mm/vmalloc.c | 45 +++++++++++++++++++++++++++++++++++++++------
  1 file changed, 39 insertions(+), 6 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 41dd01e8430c..8d577767a9e5 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -642,6 +642,29 @@ static int vmap_small_pages_range_noflush(unsigned long addr, unsigned long end,
    return err;
  }
+static inline int get_vmap_batch_order(struct page **pages,

unsigned int stride, unsigned int max_steps, unsigned int idx)

+{

int nr_pages = 1;

unsigned int, maybe
Right
...
...
Why are you initializing nr_pages when you overwrite it below?
Right, initializing nr_pages can be dropped.
...
...
...


/*
* Currently, batching is only supported in vmap_pages_range
* when page_shift == PAGE_SHIFT.

I don't know the code so realizing how we go from page_shift to stride too
me a second. Maybe only talk about stride here?
OTOH, is "stride" really the right terminology?
we calculate it as
stride = 1U << (page_shift - PAGE_SHIFT);
page_shift - PAGE_SHIFT should give us an "order". So is this a
"granularity" in nr_pages?
This is the case where vmalloc() may realize that it has
high-order pages and therefore calls
vmap_pages_range_noflush() with a page_shift larger than
PAGE_SHIFT. For vmap(), we take a pages array, so
page_shift is always PAGE_SHIFT.
...
...
Again, I don't know this code, so sorry for the question.
To me "stride" also sounds unclear.
Thanks, David and Uladzislau. On second thought, this stride may be
redundant, and it should be possible to drop it entirely. This results
in the code below:
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 41dd01e8430c..3962bdcb43e5 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -642,6 +642,20 @@ static int vmap_small_pages_range_noflush(unsigned long addr, unsigned long end,
   return err;
 }
 
+static inline int get_vmap_batch_order(struct page **pages,

unsigned int max_steps, unsigned int idx)



+{

unsigned int nr_pages	 = compound_nr(pages[idx]);

if (nr_pages == 1 || max_steps < nr_pages)
return 0;



if (num_pages_contiguous(&pages[idx], nr_pages) == nr_pages)
return compound_order(pages[idx]);


return 0;

+}



...
/*

vmap_pages_range_noflush is similar to vmap_pages_range, but does not
flush caches.

@@ -658,20 +672,35 @@ int __vmap_pages_range_noflush(unsigned long addr, unsigned long end,
 
   WARN_ON(page_shift < PAGE_SHIFT);

/*
* For vmap(), users may allocate pages from high orders down to


* order 0, while always using PAGE_SHIFT as the page_shift.


* We first check whether the initial page is a compound page. If so,


* there may be an opportunity to batch multiple pages together.


*/

if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMALLOC) ||


	page_shift == PAGE_SHIFT)




	(page_shift == PAGE_SHIFT && !PageCompound(pages[0])))

return vmap_small_pages_range_noflush(addr, end, prot, pages);

Hm.. If first few pages are order-0 and the rest are compound
then we do nothing.
...

for (i = 0; i < nr; i += 1U << (page_shift - PAGE_SHIFT)) {


for (i = 0; i < nr; ) {
unsigned int shift = page_shift;

int err;


err = vmap_range_noflush(addr, addr + (1UL << page_shift),




/*


 * For vmap() cases, page_shift is always PAGE_SHIFT, even


 * if the pages are physically contiguous, they may still


 * be mapped in a batch.


 */


if (page_shift == PAGE_SHIFT)


	shift += get_vmap_batch_order(pages, nr - i, i);


err = vmap_range_noflush(addr, addr + (1UL << shift),
		page_to_phys(pages[i]), prot,




			page_shift);




			shift);

if (err)
	return err;


addr += 1UL << page_shift;




addr += 1UL  << shift;


i += 1U << shift;

}

return 0;
Does this look clearer?
The concern is we mix it with a huge page mapping path. If we want to batch
v-mapping for page_shift == PAGE_SHIFT case, where "pages" array may contain 
compound pages(folio)(corner case to me), i think we should split it.
--
Uladzislau Rezki

    

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

[Linaro-mm-sig] Re: [PATCH] mm/vmalloc: map contiguous pages in batches for vmap() whenever possible