[PATCH 6.6.y 0/2] riscv: mm: Backport of mmap hint address fixes

List overview All Threads
Download

newer

older

[PATCH 6.12 00/10] 6.12.51-rc1...

[PATCH 6.1.y] spi: microchip-core:...

Vivian Wang

8 Oct 2025 8 Oct '25

7:50 a.m.

Backport of the two riscv mmap patches from master. In effect, these two patches removes arch_get_mmap_{base,end} for riscv.

Guo Ren: Please take a look. Patch 1 has a slightly non-trivial conflict with your commit 97b7ac69be2e ("riscv: mm: Fixup compat arch_get_mmap_end"), which changed STACK_TOP_MAX from TASK_SIZE_64 to TASK_SIZE when CONFIG_64BIT=y. This shouldn't be a problem, but, well, just to be safe.

--- Charlie Jenkins (2): riscv: mm: Use hint address in mmap if available riscv: mm: Do not restrict mmap address based on hint

arch/riscv/include/asm/processor.h | 33 +++++---------------------------- 1 file changed, 5 insertions(+), 28 deletions(-) --- base-commit: 60a9e718726fa7019ae00916e4b1c52498da5b60 change-id: 20250917-riscv-mmap-addr-space-6-6-15e7db6b5db6

Best regards,

-- Vivian "dramforever" Wang

Show replies by date

Vivian Wang

8 Oct 8 Oct

7:50 a.m.

New subject: [PATCH 6.6.y 1/2] riscv: mm: Use hint address in mmap if available

From: Charlie Jenkins charlie@rivosinc.com

[ Upstream commit b5b4287accd702f562a49a60b10dbfaf7d40270f ]

On riscv it is guaranteed that the address returned by mmap is less than the hint address. Allow mmap to return an address all the way up to addr, if provided, rather than just up to the lower address space.

This provides a performance benefit as well, allowing mmap to exit after checking that the address is in range rather than searching for a valid address.

It is possible to provide an address that uses at most the same number of bits, however it is significantly more computationally expensive to provide that number rather than setting the max to be the hint address. There is the instruction clz/clzw in Zbb that returns the highest set bit which could be used to performantly implement this, but it would still be slower than the current implementation. At worst case, half of the address would not be able to be allocated when a hint address is provided.

Signed-off-by: Charlie Jenkins charlie@rivosinc.com Link: https://lore.kernel.org/r/20240130-use_mmap_hint_address-v3-1-8a655cfa8bcb@r... Signed-off-by: Palmer Dabbelt palmer@rivosinc.com [ Adjust TASK_SIZE64 -> TASK_SIZE in moved lines ] Signed-off-by: Vivian Wang wangruikang@iscas.ac.cn Tested-by: Han Gao rabenda.cn@gmail.com --- arch/riscv/include/asm/processor.h | 27 +++++++++++---------------- 1 file changed, 11 insertions(+), 16 deletions(-)

diff --git a/arch/riscv/include/asm/processor.h b/arch/riscv/include/asm/processor.h index 4f6af8c6cfa060380594c6d0e727af6b02d08d70..938aef30dfb42ee477b7c59b5d2afc3871d8004d 100644 --- a/arch/riscv/include/asm/processor.h +++ b/arch/riscv/include/asm/processor.h @@ -13,22 +13,16 @@

#include <asm/ptrace.h>

-#ifdef CONFIG_64BIT -#define DEFAULT_MAP_WINDOW (UL(1) << (MMAP_VA_BITS - 1)) -#define STACK_TOP_MAX TASK_SIZE - #define arch_get_mmap_end(addr, len, flags) \ ({ \ unsigned long mmap_end; \ typeof(addr) _addr = (addr); \ - if ((_addr) == 0 || (IS_ENABLED(CONFIG_COMPAT) && is_compat_task())) \ + if ((_addr) == 0 || \ + (IS_ENABLED(CONFIG_COMPAT) && is_compat_task()) || \ + ((_addr + len) > BIT(VA_BITS - 1))) \ mmap_end = STACK_TOP_MAX; \ - else if ((_addr) >= VA_USER_SV57) \ - mmap_end = STACK_TOP_MAX; \ - else if ((((_addr) >= VA_USER_SV48)) && (VA_BITS >= VA_BITS_SV48)) \ - mmap_end = VA_USER_SV48; \ else \ - mmap_end = VA_USER_SV39; \ + mmap_end = (_addr + len); \ mmap_end; \ })

@@ -38,17 +32,18 @@ typeof(addr) _addr = (addr); \ typeof(base) _base = (base); \ unsigned long rnd_gap = DEFAULT_MAP_WINDOW - (_base); \ - if ((_addr) == 0 || (IS_ENABLED(CONFIG_COMPAT) && is_compat_task())) \ + if ((_addr) == 0 || \ + (IS_ENABLED(CONFIG_COMPAT) && is_compat_task()) || \ + ((_addr + len) > BIT(VA_BITS - 1))) \ mmap_base = (_base); \ - else if (((_addr) >= VA_USER_SV57) && (VA_BITS >= VA_BITS_SV57)) \ - mmap_base = VA_USER_SV57 - rnd_gap; \ - else if ((((_addr) >= VA_USER_SV48)) && (VA_BITS >= VA_BITS_SV48)) \ - mmap_base = VA_USER_SV48 - rnd_gap; \ else \ - mmap_base = VA_USER_SV39 - rnd_gap; \ + mmap_base = (_addr + len) - rnd_gap; \ mmap_base; \ })

+#ifdef CONFIG_64BIT +#define DEFAULT_MAP_WINDOW (UL(1) << (MMAP_VA_BITS - 1)) +#define STACK_TOP_MAX TASK_SIZE #else #define DEFAULT_MAP_WINDOW TASK_SIZE #define STACK_TOP_MAX TASK_SIZE

-- 2.50.1

Vivian Wang

7:50 a.m.

New subject: [PATCH 6.6.y 2/2] riscv: mm: Do not restrict mmap address based on hint

From: Charlie Jenkins charlie@rivosinc.com

[ Upstream commit 2116988d5372aec51f8c4fb85bf8e305ecda47a0 ]

The hint address should not forcefully restrict the addresses returned by mmap as this causes mmap to report ENOMEM when there is memory still available.

Signed-off-by: Charlie Jenkins charlie@rivosinc.com Fixes: b5b4287accd7 ("riscv: mm: Use hint address in mmap if available") Fixes: add2cc6b6515 ("RISC-V: mm: Restrict address space for sv39,sv48,sv57") Closes: https://lore.kernel.org/linux-kernel/ZbxTNjQPFKBatMq+@ghost/T/#mccb1890466bf... Link: https://lore.kernel.org/r/20240826-riscv_mmap-v1-3-cd8962afe47f@rivosinc.com Signed-off-by: Palmer Dabbelt palmer@rivosinc.com [ Adjust removed lines ] Signed-off-by: Vivian Wang wangruikang@iscas.ac.cn Tested-by: Han Gao rabenda.cn@gmail.com --- arch/riscv/include/asm/processor.h | 22 ++-------------------- 1 file changed, 2 insertions(+), 20 deletions(-)

diff --git a/arch/riscv/include/asm/processor.h b/arch/riscv/include/asm/processor.h index 938aef30dfb42ee477b7c59b5d2afc3871d8004d..4747277983ad1a2a9666c03a4e69758f56d22dbc 100644 --- a/arch/riscv/include/asm/processor.h +++ b/arch/riscv/include/asm/processor.h @@ -15,30 +15,12 @@

#define arch_get_mmap_end(addr, len, flags) \ ({ \ - unsigned long mmap_end; \ - typeof(addr) _addr = (addr); \ - if ((_addr) == 0 || \ - (IS_ENABLED(CONFIG_COMPAT) && is_compat_task()) || \ - ((_addr + len) > BIT(VA_BITS - 1))) \ - mmap_end = STACK_TOP_MAX; \ - else \ - mmap_end = (_addr + len); \ - mmap_end; \ + STACK_TOP_MAX; \ })

#define arch_get_mmap_base(addr, base) \ ({ \ - unsigned long mmap_base; \ - typeof(addr) _addr = (addr); \ - typeof(base) _base = (base); \ - unsigned long rnd_gap = DEFAULT_MAP_WINDOW - (_base); \ - if ((_addr) == 0 || \ - (IS_ENABLED(CONFIG_COMPAT) && is_compat_task()) || \ - ((_addr + len) > BIT(VA_BITS - 1))) \ - mmap_base = (_base); \ - else \ - mmap_base = (_addr + len) - rnd_gap; \ - mmap_base; \ + base; \ })

#ifdef CONFIG_64BIT

-- 2.50.1

Greg KH

10:20 a.m.

On Wed, Oct 08, 2025 at 03:50:15PM +0800, Vivian Wang wrote:

...

Backport of the two riscv mmap patches from master. In effect, these two patches removes arch_get_mmap_{base,end} for riscv.

Why is this needed? What bug does this fix?

thanks,

greg k-h

Vivian Wang

9 Oct 9 Oct

4:19 a.m.

On 10/8/25 18:20, Greg KH wrote:

...

On Wed, Oct 08, 2025 at 03:50:15PM +0800, Vivian Wang wrote:

...
Backport of the two riscv mmap patches from master. In effect, these two patches removes arch_get_mmap_{base,end} for riscv.

Why is this needed? What bug does this fix?

The behavior of mmap hint address in current 6.6.y is broken when > 39 bits of virtual address is available (i.e. Sv48 or Sv57, having 48 and 57 bits of VA available, respectively). The man-pages mmap(2) page states, for the hint address [1]:

If addr is NULL, then the kernel chooses the (page-aligned) address at which to create the mapping; this is the most portable method of creating a new mapping. If addr is not NULL, then the kernel takes it as a hint about where to place the mapping; on Linux, the kernel will pick a nearby page boundary (but always above or equal to the value specified by /proc/sys/vm/mmap_min_addr) and attempt to create the mapping there. If another mapping already exists there, the kernel picks a new address that may or may not depend on the hint. The address of the new mapping is returned as the result of the call.

Therefore, if a userspace program specifies a large hint address of e.g. 1<<50, and both the kernel and the hardware supports it, it should be used even if MAP_FIXED is not specified. This is also the behavior implemented in x86_64, arm64, and, on a recent enough (> 6.10) kernel, riscv64.

However, current 6.6.y for riscv64 implements a bizarre behavior, where the hint address is treated as an upper bound instead. Therefore, passing 1<<50 would actually return a VA in 48-bit space.

To reproduce, call mmap with arguments like:

mmap(hint, 4096, PROT_READ, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);

Comparison:

hint = 0x4000000000000 i.e. 1 << 50

6.6.106 6.6.106 + patch sv48 0x7fff90223000 0x7fff93b4e000 sv57 0x7fffb7d49000 0x4000000000000

When the hint is not used, the exact address is of course random, which is expected. However, since the address 1<<50 is supported under Sv57, it should be usable by mmap, but with current 6.6.y behavior it is not used, and some other address from 48-bit space used instead.

There's not yet real riscv64 hardware with Sv57, but an analogous problem arises on Sv48 with an address like 1<<40.

One real userspace program that runs into this is the Go programming language runtime with TSAN enabled. Excerpt from a test log [2], which was run on an Eswin EIC7700x, which supports Sv48:

fatal error: too many address space collisions for -race mode runtime stack: runtime.throw({0x257eaa?, 0x4000000?}) /home/swarming/.swarming/w/ir/x/w/goroot/src/runtime/panic.go:1246 +0x38 fp=0x7ffff84af758 sp=0x7ffff84af730 pc=0xc9310 runtime.(*mheap).sysAlloc(0x3e3c20, 0x81cc8?, 0x3f3e28, 0x3f3e50) /home/swarming/.swarming/w/ir/x/w/goroot/src/runtime/malloc.go:799 +0x56c fp=0x7ffff84af7f8 sp=0x7ffff84af758 pc=0x67944 runtime.(*mheap).grow(0x3e3c20, 0x7fffb69fee00?) /home/swarming/.swarming/w/ir/x/w/goroot/src/runtime/mheap.go:1568 +0x9c fp=0x7ffff84af870 sp=0x7ffff84af7f8 pc=0x824c4 runtime.(*mheap).allocSpan(0x3e3c20, 0x1, 0x0, 0x10) [...] FAIL runtime/race 0.285s

With TSAN enabled, the Go runtime allocates a lot of virtual address space. As the message suggests, if the return value of mmap is not equal to a non-zero hint, the runtime assumes that mmap is failing to allocate the address because some other mapping is already there (in other words, it assumes the man-pages documented behavior), and unmaps it and tries a different address, until it tries too many times and gives up. This means Go with TSAN fails to initialize on Sv48 and current 6.6.y.

(cc Meng Zhuo, in case of any questions about the Go runtime here.)

Patch 1 here addresses the above issue, but introduced regressions (see replies in "Link"). Patch 2 addresses those regressions.

Thanks, Vivian "dramforever" Wang

[1]: https://man7.org/linux/man-pages/man2/mmap.2.html [2]: https://logs.chromium.org/logs/golang/buildbucket/cr-buildbucket/87083013106...

Greg KH

5 a.m.

On Thu, Oct 09, 2025 at 12:19:46PM +0800, Vivian Wang wrote:

...

On 10/8/25 18:20, Greg KH wrote:

...
On Wed, Oct 08, 2025 at 03:50:15PM +0800, Vivian Wang wrote:

...
Backport of the two riscv mmap patches from master. In effect, these two patches removes arch_get_mmap_{base,end} for riscv.

Why is this needed? What bug does this fix?

The behavior of mmap hint address in current 6.6.y is broken when > 39 bits of virtual address is available (i.e. Sv48 or Sv57, having 48 and 57 bits of VA available, respectively). The man-pages mmap(2) page states, for the hint address [1]:

If addr is NULL, then the kernel chooses the (page-aligned) address at which to create the mapping; this is the most portable method of creating a new mapping. If addr is not NULL, then the kernel takes it as a hint about where to place the mapping; on Linux, the kernel will pick a nearby page boundary (but always above or equal to the value specified by /proc/sys/vm/mmap_min_addr) and attempt to create the mapping there. If another mapping already exists there, the kernel picks a new address that may or may not depend on the hint. The address of the new mapping is returned as the result of the call.

Therefore, if a userspace program specifies a large hint address of e.g. 1<<50, and both the kernel and the hardware supports it, it should be used even if MAP_FIXED is not specified. This is also the behavior implemented in x86_64, arm64, and, on a recent enough (> 6.10) kernel, riscv64.

However, current 6.6.y for riscv64 implements a bizarre behavior, where the hint address is treated as an upper bound instead. Therefore, passing 1<<50 would actually return a VA in 48-bit space.

To reproduce, call mmap with arguments like:
   mmap(hint, 4096, PROT_READ, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
Comparison:

hint = 0x4000000000000 i.e. 1 << 50

6.6.106 6.6.106 + patch sv48 0x7fff90223000 0x7fff93b4e000 sv57 0x7fffb7d49000 0x4000000000000

When the hint is not used, the exact address is of course random, which is expected. However, since the address 1<<50 is supported under Sv57, it should be usable by mmap, but with current 6.6.y behavior it is not used, and some other address from 48-bit space used instead.

There's not yet real riscv64 hardware with Sv57, but an analogous problem arises on Sv48 with an address like 1<<40.

As this issue has been fixed for many years now, why is it just showing up now? Shouldn't you be using 6.12.y for new hardware?

...

One real userspace program that runs into this is the Go programming language runtime with TSAN enabled. Excerpt from a test log [2], which was run on an Eswin EIC7700x, which supports Sv48:

fatal error: too many address space collisions for -race mode runtime stack: runtime.throw({0x257eaa?, 0x4000000?}) /home/swarming/.swarming/w/ir/x/w/goroot/src/runtime/panic.go:1246 +0x38 fp=0x7ffff84af758 sp=0x7ffff84af730 pc=0xc9310 runtime.(*mheap).sysAlloc(0x3e3c20, 0x81cc8?, 0x3f3e28, 0x3f3e50) /home/swarming/.swarming/w/ir/x/w/goroot/src/runtime/malloc.go:799 +0x56c fp=0x7ffff84af7f8 sp=0x7ffff84af758 pc=0x67944 runtime.(*mheap).grow(0x3e3c20, 0x7fffb69fee00?) /home/swarming/.swarming/w/ir/x/w/goroot/src/runtime/mheap.go:1568 +0x9c fp=0x7ffff84af870 sp=0x7ffff84af7f8 pc=0x824c4 runtime.(*mheap).allocSpan(0x3e3c20, 0x1, 0x0, 0x10) [...] FAIL runtime/race 0.285s

With TSAN enabled, the Go runtime allocates a lot of virtual address space. As the message suggests, if the return value of mmap is not equal to a non-zero hint, the runtime assumes that mmap is failing to allocate the address because some other mapping is already there (in other words, it assumes the man-pages documented behavior), and unmaps it and tries a different address, until it tries too many times and gives up. This means Go with TSAN fails to initialize on Sv48 and current 6.6.y.

(cc Meng Zhuo, in case of any questions about the Go runtime here.)

Patch 1 here addresses the above issue, but introduced regressions (see replies in "Link"). Patch 2 addresses those regressions.

Ok, that makes a bit more sense, but again, why is this just showing up now? What changed to cause this to be noticed at and needed to be fixed at this moment in time and not before?

thanks,

greg k-h

Vivian Wang

5:50 a.m.

On 10/9/25 13:00, Greg KH wrote:

...

On Thu, Oct 09, 2025 at 12:19:46PM +0800, Vivian Wang wrote:

...
[...]

Ok, that makes a bit more sense, but again, why is this just showing up now? What changed to cause this to be noticed at and needed to be fixed at this moment in time and not before?

As of why this came quite late in the lifetime of the 6.6.y branch, I believe it's a combination of two factors.

Firstly, actual Sv48-capable RISC-V hardware came fairly late. Milk-V Megrez (with Eswin EIC7700X), on which the Go TSAN thing ran, was shipped only early this year. The DC ROMA II laptop (EIC7702X) and Framework mainboard with the same SoC has not even shipped yet, or maybe only shipped to developers - I'm not so certain. Most other RISC-V machines only have Sv39.

Secondly, there is interest among some Chinese software vendors to ship Linux distros based on a 6.6.y LTS kernel. The "RISC-V Common Kernel" (RVCK) project [1], with support from openEuler and various HW vendors, maintains backports on top of a 6.6.y kernel. "RockOS" [2] is a distro maintained by PLCT Lab, ISCAS, for EIC770{0,2}X-based boards, and it has a 6.6.y kernel branch. Both have cherry-picked the mmap patches for now.

We operate with the understanding that the official stable kernel will not be accepting new major features and drivers, but fixes do belong in stable, and at least from the perspective of PLCT Lab we generally try to send patches instead of hoarding them. Hence, the earlier backport request and this backport series.

I hope this explanation is acceptable.

Thanks, Vivian "dramforever" Wang

PS: This 6.6 kernel thing isn't just a RISC-V thing, by the way. KylinOS V11 has shipped in August with a 6.6 kernel. Deepin and UOS will be shipping with 6.6, with UOS "25" shipping maybe late this year or early 2026.

[1]: https://github.com/RVCK-Project/rvck [2]: https://docs.rockos.dev/

Greg KH

1:31 p.m.

On Thu, Oct 09, 2025 at 01:50:11PM +0800, Vivian Wang wrote:

...

On 10/9/25 13:00, Greg KH wrote:

...
On Thu, Oct 09, 2025 at 12:19:46PM +0800, Vivian Wang wrote:

...
[...]

Ok, that makes a bit more sense, but again, why is this just showing up now? What changed to cause this to be noticed at and needed to be fixed at this moment in time and not before?

As of why this came quite late in the lifetime of the 6.6.y branch, I believe it's a combination of two factors.

Firstly, actual Sv48-capable RISC-V hardware came fairly late. Milk-V Megrez (with Eswin EIC7700X), on which the Go TSAN thing ran, was shipped only early this year. The DC ROMA II laptop (EIC7702X) and Framework mainboard with the same SoC has not even shipped yet, or maybe only shipped to developers - I'm not so certain. Most other RISC-V machines only have Sv39.

Secondly, there is interest among some Chinese software vendors to ship Linux distros based on a 6.6.y LTS kernel. The "RISC-V Common Kernel" (RVCK) project [1], with support from openEuler and various HW vendors, maintains backports on top of a 6.6.y kernel. "RockOS" [2] is a distro maintained by PLCT Lab, ISCAS, for EIC770{0,2}X-based boards, and it has a 6.6.y kernel branch. Both have cherry-picked the mmap patches for now.

We operate with the understanding that the official stable kernel will not be accepting new major features and drivers, but fixes do belong in stable, and at least from the perspective of PLCT Lab we generally try to send patches instead of hoarding them. Hence, the earlier backport request and this backport series.

I hope this explanation is acceptable.

Thanks for the detailed explaination. I've queued these up now.

But wow, shipping new products on a 2 year old kernel feels very risky to me, but hey, what do I know? :)

...

PS: This 6.6 kernel thing isn't just a RISC-V thing, by the way. KylinOS V11 has shipped in August with a 6.6 kernel. Deepin and UOS will be shipping with 6.6, with UOS "25" shipping maybe late this year or early 2026.

That too is crazy. They should know better.

Just to give a bit of context for this, for the latest 6.6.y release, 6.6.110, there are currently over 300 documented unfixed CVE items in that branch. Feels rough to be doing a new release based on that...

good luck!

greg k-h

days inactive

days old

linux-stable-mirror@lists.linaro.org

7 comments

participants

tags (0)

participants (2)

Greg KH
Vivian Wang