On Wed, Jan 08, 2025 at 05:46:37PM +0100, David Hildenbrand wrote:
On 08.01.25 17:13, Thomas Weißschuh wrote:
On Wed, Jan 08, 2025 at 02:36:57PM +0100, David Hildenbrand wrote:
On 08.01.25 09:05, Thomas Weißschuh wrote:
On Wed, Jan 08, 2025 at 11:46:19AM +0530, Dev Jain wrote:
On 07/01/25 8:44 pm, Thomas Weißschuh wrote:
If not enough physical memory is available the kernel may fail mmap(); see __vm_enough_memory() and vm_commit_limit(). In that case the logic in validate_complete_va_space() does not make sense and will even incorrectly fail. Instead skip the test if no mmap() succeeded.
Fixes: 010409649885 ("selftests/mm: confirm VA exhaustion without reliance on correctness of mmap()") Cc: stable@vger.kernel.org
CC stable on tests is ... odd.
I thought it was fairly common, but it isn't. Will drop it.
As it's not really a "kernel BUG", it's rather uncommon.
I also used it on patch 2, which is now reproducibly broken on x86 mainline since my commit mentioned in that patch. But I'll drop it there, too.
Note that with MAP_NORESRVE, most setups we care about will allow mapping as much as you want, but on access OOM will fire.
Thanks for the hint.
So one could require that /proc/sys/vm/overcommit_memory is setup properly and use MAP_NORESRVE.
Isn't the check for lchunks == 0 essentially exactly this?
I assume paired with MAP_NORESERVE?
Yes.
Maybe, but it could be better to have something that says "if overcommit_memory is not setup properly I will SKIP this test", but otherwise I expect this to work and will FAIL if it doesn't".
Ok, I'll validate the sysctl value.
Or would you expect to run into lchunks == 0 even if overcommit_memory is setup properly and MAP_NORESERVE is used? (very very low memory that we cannot even create all the VMAs?)
No.
Reading from anonymous memory will populate the shared zeropage. To mitigate OOM from "too many page tables", one could simply unmap the pieces as they are verified (or MAP_FIXED over them, to free page tables).
The code has to figure out if a verified region was created by mmap(), otherwise an munmap() could crash the process. As the entries from /proc/self/maps may have been merged and (I assume)
Yes, and partial unmap (in chunk granularity?) would split them again.
the ordering of mappings is not guaranteed, some bespoke logic to establish the link will be needed.
My thinking was that you simply process one /proc/self/maps entry in some chunks. After processing a chunk, you munmap() it.
So you would process + munmap in chunks.
That is clear. The issue would be to figure which chunks are valid to unmap. If something critical like the executable file is unmapped, the process crashes. But see below.
Is it fine to rely on CONFIG_ANON_VMA_NAME? That would make it much easier to implement.
Can you elaborate how you would do it?
First set the VMA name after mmap():
for (i = 0; i < NR_CHUNKS_LOW; i++) { ptr[i] = mmap(NULL, MAP_CHUNK_SIZE, PROT_READ | PROT_WRITE, MAP_NORESERVE | MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if (ptr[i] == MAP_FAILED) { if (validate_lower_address_hint()) ksft_exit_fail_msg("mmap unexpectedly succeeded with hint\n"); break; }
validate_addr(ptr[i], 0); if (prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, ptr[i], MAP_CHUNK_SIZE, "virtual_address_range")) ksft_exit_fail_msg("prctl(PR_SET_VMA_ANON_NAME) failed: %s\n", strerror(errno)); }
During validation:
hop = 0; while (start_addr + hop < end_addr) { if (write(fd, (void *)(start_addr + hop), 1) != 1) return 1; lseek(fd, 0, SEEK_SET);
if (!strncmp(line + path_offset, "[anon:virtual_address_range]", 28)) munmap((char *)(start_addr + hop), MAP_CHUNK_SIZE);
hop += MAP_CHUNK_SIZE;
}
It is done for each chunk, as all chunks may have been merged into a single VMA and a per-VMA unmap would not happen before OOM.
Using MAP_NORESERVE and eager munmap()s, the testcase works nicely even in very low physical memory conditions.
Cool.