Hi all,
This is v2 to the VMA count patch I previously posted at:
https://lore.kernel.org/r/20250903232437.1454293-1-kaleshsingh@google.com/
I've split it into multiple patches to address the feedback.
The main changes in v2 are:
- Use a capacity-based check for VMA count limit, per Lorenzo. - Rename map_count to vma_count, per David. - Add assertions for exceeding the limit, per Pedro. - Add tests for max_vma_count, per Liam. - Emit a trace event for failure due to insufficient capacity for observability
Tested on x86_64 and arm64:
- Build test: - allyesconfig for rename
- Selftests: cd tools/testing/selftests/mm && \ make && \ ./run_vmtests.sh -t max_vma_count
(With trace_max_vma_count_exceeded enabled)
- vma tests: cd tools/testing/vma && \ make && \ ./vma
Thanks, Kalesh
Kalesh Singh (7): mm: fix off-by-one error in VMA count limit checks mm/selftests: add max_vma_count tests mm: introduce vma_count_remaining() mm: rename mm_struct::map_count to vma_count mm: harden vma_count against direct modification mm: add assertion for VMA count limit mm/tracing: introduce max_vma_count_exceeded trace event
fs/binfmt_elf.c | 2 +- fs/coredump.c | 2 +- include/linux/mm.h | 35 +- include/linux/mm_types.h | 5 +- include/trace/events/vma.h | 32 + kernel/fork.c | 2 +- mm/debug.c | 2 +- mm/internal.h | 1 + mm/mmap.c | 28 +- mm/mremap.c | 13 +- mm/nommu.c | 8 +- mm/util.c | 1 - mm/vma.c | 88 ++- tools/testing/selftests/mm/Makefile | 1 + .../selftests/mm/max_vma_count_tests.c | 709 ++++++++++++++++++ tools/testing/selftests/mm/run_vmtests.sh | 5 + tools/testing/vma/vma.c | 32 +- tools/testing/vma/vma_internal.h | 44 +- 18 files changed, 949 insertions(+), 61 deletions(-) create mode 100644 include/trace/events/vma.h create mode 100644 tools/testing/selftests/mm/max_vma_count_tests.c
base-commit: f83ec76bf285bea5727f478a68b894f5543ca76e
The VMA count limit check in do_mmap() and do_brk_flags() uses a strict inequality (>), which allows a process's VMA count to exceed the configured sysctl_max_map_count limit by one.
A process with mm->map_count == sysctl_max_map_count will incorrectly pass this check and then exceed the limit upon allocation of a new VMA when its map_count is incremented.
Other VMA allocation paths, such as split_vma(), already use the correct, inclusive (>=) comparison.
Fix this bug by changing the comparison to be inclusive in do_mmap() and do_brk_flags(), bringing them in line with the correct behavior of other allocation paths.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Cc: stable@vger.kernel.org Cc: Andrew Morton akpm@linux-foundation.org Cc: David Hildenbrand david@redhat.com Cc: "Liam R. Howlett" Liam.Howlett@oracle.com Cc: Lorenzo Stoakes lorenzo.stoakes@oracle.com Cc: Mike Rapoport rppt@kernel.org Cc: Minchan Kim minchan@kernel.org Cc: Pedro Falcato pfalcato@suse.de Signed-off-by: Kalesh Singh kaleshsingh@google.com ---
Chnages in v2: - Fix mmap check, per Pedro
mm/mmap.c | 2 +- mm/vma.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/mm/mmap.c b/mm/mmap.c index 7306253cc3b5..e5370e7fcd8f 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -374,7 +374,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr, return -EOVERFLOW;
/* Too many mappings? */ - if (mm->map_count > sysctl_max_map_count) + if (mm->map_count >= sysctl_max_map_count) return -ENOMEM;
/* diff --git a/mm/vma.c b/mm/vma.c index 3b12c7579831..033a388bc4b1 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -2772,7 +2772,7 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma, if (!may_expand_vm(mm, vm_flags, len >> PAGE_SHIFT)) return -ENOMEM;
- if (mm->map_count > sysctl_max_map_count) + if (mm->map_count >= sysctl_max_map_count) return -ENOMEM;
if (security_vm_enough_memory_mm(mm, len >> PAGE_SHIFT))
On Mon, 15 Sep 2025 09:36:32 -0700 Kalesh Singh kaleshsingh@google.com wrote:
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
lol.
x1:/usr/src/25> grep "Fixes.*1da177e4c3f4" ../gitlog|wc -l 661
we really blew it that time!
Add a new selftest to verify that the max VMA count limit is correctly enforced.
This test suite checks that various VMA operations (mmap, mprotect, munmap, mremap) succeed or fail as expected when the number of VMAs is close to the sysctl_max_map_count limit.
The test works by first creating a large number of VMAs to bring the process close to the limit, and then performing various operations that may or may not create new VMAs. The test then verifies that the operations that would exceed the limit fail, and that the operations that do not exceed the limit succeed.
NOTE: munmap is special as it's allowed to temporarily exceed the limit by one for splits as this will decrease back to the limit once the unmap succeeds.
Cc: Andrew Morton akpm@linux-foundation.org Cc: David Hildenbrand david@redhat.com Cc: "Liam R. Howlett" Liam.Howlett@oracle.com Cc: Lorenzo Stoakes lorenzo.stoakes@oracle.com Cc: Mike Rapoport rppt@kernel.org Cc: Minchan Kim minchan@kernel.org Cc: Pedro Falcato pfalcato@suse.de Signed-off-by: Kalesh Singh kaleshsingh@google.com ---
Changes in v2: - Add tests, per Liam (note that the do_brk_flags() path is not easily tested from userspace, so it's not included here). Exceeding the limit there should be uncommon.
tools/testing/selftests/mm/Makefile | 1 + .../selftests/mm/max_vma_count_tests.c | 709 ++++++++++++++++++ tools/testing/selftests/mm/run_vmtests.sh | 5 + 3 files changed, 715 insertions(+) create mode 100644 tools/testing/selftests/mm/max_vma_count_tests.c
diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile index d13b3cef2a2b..00a4b04eab06 100644 --- a/tools/testing/selftests/mm/Makefile +++ b/tools/testing/selftests/mm/Makefile @@ -91,6 +91,7 @@ TEST_GEN_FILES += transhuge-stress TEST_GEN_FILES += uffd-stress TEST_GEN_FILES += uffd-unit-tests TEST_GEN_FILES += uffd-wp-mremap +TEST_GEN_FILES += max_vma_count_tests TEST_GEN_FILES += split_huge_page_test TEST_GEN_FILES += ksm_tests TEST_GEN_FILES += ksm_functional_tests diff --git a/tools/testing/selftests/mm/max_vma_count_tests.c b/tools/testing/selftests/mm/max_vma_count_tests.c new file mode 100644 index 000000000000..c8401c03425c --- /dev/null +++ b/tools/testing/selftests/mm/max_vma_count_tests.c @@ -0,0 +1,709 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright 2025 Google LLC + */ +#define _GNU_SOURCE + +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <sys/mman.h> +#include <unistd.h> +#include <errno.h> +#include <stdbool.h> +#include <linux/prctl.h> /* Definition of PR_* constants */ +#include <sys/prctl.h> + +#include "../kselftest.h" + +static int get_max_vma_count(void); +static bool set_max_vma_count(int val); +static int get_current_vma_count(void); +static bool is_current_vma_count(const char *msg, int expected); +static bool is_test_area_mapped(const char *msg); +static void print_surrounding_maps(const char *msg); + +/* Globals initialized in test_suite_setup() */ +static int MAX_VMA_COUNT; +static int ORIGINAL_MAX_VMA_COUNT; +static int PAGE_SIZE; +static int GUARD_SIZE; +static int TEST_AREA_SIZE; +static int EXTRA_MAP_SIZE; + +static int MAX_VMA_COUNT; + +static int NR_EXTRA_MAPS; + +static char *TEST_AREA; +static char *EXTRA_MAPS; + +#define DEFAULT_MAX_MAP_COUNT 65530 +#define TEST_AREA_NR_PAGES 3 +/* 1 before test area + 1 after test area + 1 after extra mappings */ +#define NR_GUARDS 3 +#define TEST_AREA_PROT (PROT_NONE) +#define EXTRA_MAP_PROT (PROT_NONE) + +/** + * test_suite_setup - Set up the VMA layout for VMA count testing. + * + * Sets up the following VMA layout: + * + * +----- base_addr + * | + * V + * +--------------+----------------------+--------------+----------------+--------------+----------------+--------------+-----+----------------+--------------+ + * | Guard Page | | Guard Page | Extra Map 1 | Unmapped Gap | Extra Map 2 | Unmapped Gap | ... | Extra Map N | Unmapped Gap | + * | (unmapped) | TEST_AREA | (unmapped) | (mapped page) | (1 page) | (mapped page) | (1 page) | ... | (mapped page) | (1 page) | + * | (1 page) | (unmapped, 3 pages) | (1 page) | (1 page) | | (1 page) | | | (1 page) | | + * +--------------+----------------------+--------------+----------------+--------------+----------------+--------------+-----+----------------+--------------+ + * ^ ^ ^ ^ ^ + * | | | | | + * +--GUARD_SIZE--+ | +-- EXTRA_MAPS points here Sufficient EXTRA_MAPS to ---+ + * (PAGE_SIZE) | | reach MAX_VMA_COUNT + * | | + * +--- TEST_AREA_SIZE ---+ + * | (3 * PAGE_SIZE) | + * ^ + * | + * +-- TEST_AREA starts here + * + * Populates TEST_AREA and other globals required for the tests. + * If successful, the current VMA count will be MAX_VMA_COUNT - 1. + * + * Return: true on success, false on failure. + */ +static bool test_suite_setup(void) +{ + int initial_vma_count; + size_t reservation_size; + void *base_addr = NULL; + char *ptr = NULL; + + ksft_print_msg("Setting up vma_max_count test suite...\n"); + + /* Initialize globals */ + PAGE_SIZE = sysconf(_SC_PAGESIZE); + TEST_AREA_SIZE = TEST_AREA_NR_PAGES * PAGE_SIZE; + GUARD_SIZE = PAGE_SIZE; + EXTRA_MAP_SIZE = PAGE_SIZE; + MAX_VMA_COUNT = get_max_vma_count(); + + MAX_VMA_COUNT = get_max_vma_count(); + if (MAX_VMA_COUNT < 0) { + ksft_print_msg("Failed to read /proc/sys/vm/max_map_count\n"); + return false; + } + + /* + * If the current limit is higher than the kernel default, + * we attempt to lower it to the default to ensure the test + * can run with a reliably known boundary. + */ + ORIGINAL_MAX_VMA_COUNT = 0; + + if (MAX_VMA_COUNT > DEFAULT_MAX_MAP_COUNT) { + ORIGINAL_MAX_VMA_COUNT = MAX_VMA_COUNT; + + ksft_print_msg("Max VMA count is %d, lowering to default %d for test...\n", + MAX_VMA_COUNT, DEFAULT_MAX_MAP_COUNT); + + if (!set_max_vma_count(DEFAULT_MAX_MAP_COUNT)) { + ksft_print_msg("WARNING: Failed to lower max_map_count to %d (requires root)n", + DEFAULT_MAX_MAP_COUNT); + ksft_print_msg("Skipping test. Please run as root: limit needs adjustment\n"); + + MAX_VMA_COUNT = ORIGINAL_MAX_VMA_COUNT; + + return false; + } + + /* Update MAX_VMA_COUNT for the test run */ + MAX_VMA_COUNT = DEFAULT_MAX_MAP_COUNT; + } + + initial_vma_count = get_current_vma_count(); + if (initial_vma_count < 0) { + ksft_print_msg("Failed to read /proc/self/maps\n"); + return false; + } + + /* + * Calculate how many extra mappings we need to create to reach + * MAX_VMA_COUNT - 1 (excluding test area). + */ + NR_EXTRA_MAPS = MAX_VMA_COUNT - 1 - initial_vma_count; + + if (NR_EXTRA_MAPS < 1) { + ksft_print_msg("Not enough available maps to run test\n"); + ksft_print_msg("max_vma_count=%d, current_vma_count=%d\n", + MAX_VMA_COUNT, initial_vma_count); + return false; + } + + /* + * Reserve space for: + * - Extra mappings with a 1-page gap after each (NR_EXTRA_MAPS * 2) + * - The test area itself (TEST_AREA_NR_PAGES) + * - The guard pages (NR_GUARDS) + */ + reservation_size = ((NR_EXTRA_MAPS * 2) + + TEST_AREA_NR_PAGES + NR_GUARDS) * PAGE_SIZE; + + base_addr = mmap(NULL, reservation_size, PROT_NONE, + MAP_PRIVATE|MAP_ANONYMOUS, -1, 0); + if (base_addr == MAP_FAILED) { + ksft_print_msg("Failed tommap initial reservation\n"); + return false; + } + + if (munmap(base_addr, reservation_size) == -1) { + ksft_print_msg("Failed to munmap initial reservation\n"); + return false; + } + + /* Get the addr of the test area */ + TEST_AREA = (char *)base_addr + GUARD_SIZE; + + /* + * Get the addr of the region for extra mappings: + * test area + 1 guard. + */ + EXTRA_MAPS = TEST_AREA + TEST_AREA_SIZE + GUARD_SIZE; + + /* Create single-page mappings separated by unmapped pages */ + ptr = EXTRA_MAPS; + for (int i = 0; i < NR_EXTRA_MAPS; ++i) { + if (mmap(ptr, PAGE_SIZE, EXTRA_MAP_PROT, + MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED_NOREPLACE, + -1, 0) == MAP_FAILED) { + perror("mmap in fill loop"); + ksft_print_msg("Failed on mapping #%d of %d\n", i + 1, + NR_EXTRA_MAPS); + return false; + } + + /* Advance pointer by 2 to leave a gap */ + ptr += (2 * EXTRA_MAP_SIZE); + } + + if (!is_current_vma_count("test_suite_setup", MAX_VMA_COUNT - 1)) + return false; + + ksft_print_msg("vma_max_count test suite setup done.\n"); + + return true; +} + +static void test_suite_teardown(void) +{ + if (ORIGINAL_MAX_VMA_COUNT && MAX_VMA_COUNT != ORIGINAL_MAX_VMA_COUNT) { + if (!set_max_vma_count(ORIGINAL_MAX_VMA_COUNT)) + ksft_print_msg("Failed to restore max_map_count to %d\n", + ORIGINAL_MAX_VMA_COUNT); + } +} + +/* --- Test Helper Functions --- */ +static bool mmap_anon(void) +{ + void *addr = mmap(NULL, PAGE_SIZE, PROT_READ, + MAP_PRIVATE|MAP_ANONYMOUS, -1, 0); + + /* + * Handle cleanup here as the runner doesn't track where this, + *mapping is located. + */ + if (addr != MAP_FAILED) + munmap(addr, PAGE_SIZE); + + return addr != MAP_FAILED; +} + +static inline bool __mprotect(char *addr, int size) +{ + int new_prot = ~TEST_AREA_PROT & (PROT_READ | PROT_WRITE | PROT_EXEC); + + return mprotect(addr, size, new_prot) == 0; +} + +static bool mprotect_nosplit(void) +{ + return __mprotect(TEST_AREA, TEST_AREA_SIZE); +} + +static bool mprotect_2way_split(void) +{ + return __mprotect(TEST_AREA, TEST_AREA_SIZE - PAGE_SIZE); +} + +static bool mprotect_3way_split(void) +{ + return __mprotect(TEST_AREA + PAGE_SIZE, PAGE_SIZE); +} + +static inline bool __munmap(char *addr, int size) +{ + return munmap(addr, size) == 0; +} + +static bool munmap_nosplit(void) +{ + return __munmap(TEST_AREA, TEST_AREA_SIZE); +} + +static bool munmap_2way_split(void) +{ + return __munmap(TEST_AREA, TEST_AREA_SIZE - PAGE_SIZE); +} + +static bool munmap_3way_split(void) +{ + return __munmap(TEST_AREA + PAGE_SIZE, PAGE_SIZE); +} + +/* mremap accounts for the worst case to fail early */ +static const int MREMAP_REQUIRED_VMA_SLOTS = 6; + +static bool mremap_dontunmap(void) +{ + void *new_addr; + + /* + * Using MREMAP_DONTUNMAP will create a new mapping without + * removing the old one, consuming one VMA slot. + */ + new_addr = mremap(TEST_AREA, TEST_AREA_SIZE, TEST_AREA_SIZE, + MREMAP_MAYMOVE | MREMAP_DONTUNMAP, NULL); + + if (new_addr != MAP_FAILED) + munmap(new_addr, TEST_AREA_SIZE); + + return new_addr != MAP_FAILED; +} + +struct test { + const char *name; + bool (*test)(void); + /* How many VMA slots below the limit this test needs to start? */ + int vma_slots_needed; + bool expect_success; +}; + +/* --- Test Cases --- */ +struct test tests[] = { + { + .name = "mmap_at_1_below_vma_count_limit", + .test = mmap_anon, + .vma_slots_needed = 1, + .expect_success = true, + }, + { + .name = "mmap_at_vma_count_limit", + .test = mmap_anon, + .vma_slots_needed = 0, + .expect_success = false, + }, + { + .name = "mprotect_nosplit_at_1_below_vma_count_limit", + .test = mprotect_nosplit, + .vma_slots_needed = 1, + .expect_success = true, + }, + { + .name = "mprotect_nosplit_at_vma_count_limit", + .test = mprotect_nosplit, + .vma_slots_needed = 0, + .expect_success = true, + }, + { + .name = "mprotect_2way_split_at_1_below_vma_count_limit", + .test = mprotect_2way_split, + .vma_slots_needed = 1, + .expect_success = true, + }, + { + .name = "mprotect_2way_split_at_vma_count_limit", + .test = mprotect_2way_split, + .vma_slots_needed = 0, + .expect_success = false, + }, + { + .name = "mprotect_3way_split_at_2_below_vma_count_limit", + .test = mprotect_3way_split, + .vma_slots_needed = 2, + .expect_success = true, + }, + { + .name = "mprotect_3way_split_at_1_below_vma_count_limit", + .test = mprotect_3way_split, + .vma_slots_needed = 1, + .expect_success = false, + }, + { + .name = "mprotect_3way_split_at_vma_count_limit", + .test = mprotect_3way_split, + .vma_slots_needed = 0, + .expect_success = false, + }, + { + .name = "munmap_nosplit_at_1_below_vma_count_limit", + .test = munmap_nosplit, + .vma_slots_needed = 1, + .expect_success = true, + }, + { + .name = "munmap_nosplit_at_vma_count_limit", + .test = munmap_nosplit, + .vma_slots_needed = 0, + .expect_success = true, + }, + { + .name = "munmap_2way_split_at_1_below_vma_count_limit", + .test = munmap_2way_split, + .vma_slots_needed = 1, + .expect_success = true, + }, + { + .name = "munmap_2way_split_at_vma_count_limit", + .test = munmap_2way_split, + .vma_slots_needed = 0, + .expect_success = true, + }, + { + .name = "munmap_3way_split_at_2_below_vma_count_limit", + .test = munmap_3way_split, + .vma_slots_needed = 2, + .expect_success = true, + }, + { + .name = "munmap_3way_split_at_1_below_vma_count_limit", + .test = munmap_3way_split, + .vma_slots_needed = 1, + .expect_success = true, + }, + { + .name = "munmap_3way_split_at_vma_count_limit", + .test = munmap_3way_split, + .vma_slots_needed = 0, + .expect_success = false, + }, + { + .name = "mremap_dontunmap_at_required_vma_count_capcity", + .test = mremap_dontunmap, + .vma_slots_needed = MREMAP_REQUIRED_VMA_SLOTS, + .expect_success = true, + }, + { + .name = "mremap_dontunmap_at_1_below_required_vma_count_capacity", + .test = mremap_dontunmap, + .vma_slots_needed = MREMAP_REQUIRED_VMA_SLOTS - 1, + .expect_success = false, + }, +}; + +/* --- Test Runner --- */ +int main(int argc, char **argv) +{ + int num_tests = ARRAY_SIZE(tests); + int failed_tests = 0; + + ksft_set_plan(num_tests); + + if (!test_suite_setup() != 0) { + if (MAX_VMA_COUNT > DEFAULT_MAX_MAP_COUNT) + ksft_exit_skip("max_map_count too high and cannot be lowered\n" + "Please rerun as root.\n"); + else + ksft_exit_fail_msg("Test suite setup failed. Aborting.\n"); + + } + + for (int i = 0; i < num_tests; i++) { + int maps_to_unmap = tests[i].vma_slots_needed; + const char *name = tests[i].name; + bool test_passed; + + errno = 0; + + /* 1. Setup: TEST_AREA mapping */ + if (mmap(TEST_AREA, TEST_AREA_SIZE, TEST_AREA_PROT, + MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED, -1, 0) + == MAP_FAILED) { + ksft_test_result_fail( + "%s: Test setup failed to map TEST_AREA\n", + name); + maps_to_unmap = 0; + goto fail; + } + + /* Label TEST_AREA to ease debugging */ + if (prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, TEST_AREA, + TEST_AREA_SIZE, "TEST_AREA")) { + ksft_print_msg("WARNING: [%s] prctl(PR_SET_VMA) failed\n", + name); + ksft_print_msg( + "Continuing without named TEST_AREA mapping\n"); + } + + /* 2. Setup: Adjust VMA count based on test requirements */ + if (maps_to_unmap > NR_EXTRA_MAPS) { + ksft_test_result_fail( + "%s: Test setup failed: Invalid VMA slots required %d\n", + name, tests[i].vma_slots_needed); + maps_to_unmap = 0; + goto fail; + } + + /* Unmap extra mappings, accounting for the 1-page gap */ + for (int j = 0; j < maps_to_unmap; j++) + munmap(EXTRA_MAPS + (j * 2 * EXTRA_MAP_SIZE), + EXTRA_MAP_SIZE); + + /* + * 3. Verify the preconditions. + * + * Sometimes there isn't an easy way to determine the cause + * of the test failure. + * e.g. an mprotect ENOMEM may be due to trying to protect + * unmapped area or due to hitting MAX_VMA_COUNT limit. + * + * We verify the preconditions of the test to ensure any + * expected failures are from the expected cause and not + * coincidental. + */ + if (!is_current_vma_count(name, + MAX_VMA_COUNT - tests[i].vma_slots_needed)) + goto fail; + + if (!is_test_area_mapped(name)) + goto fail; + + /* 4. Run the test */ + test_passed = (tests[i].test() == tests[i].expect_success); + if (test_passed) { + ksft_test_result_pass("%s\n", name); + } else { +fail: + failed_tests++; + ksft_test_result_fail( + "%s: current_vma_count=%d,max_vma_count=%d: errno: %d (%s)\n", + name, get_current_vma_count(), MAX_VMA_COUNT, + errno, strerror(errno)); + print_surrounding_maps(name); + } + + /* 5. Teardown: Unmap TEST_AREA. */ + munmap(TEST_AREA, TEST_AREA_SIZE); + + /* 6. Teardown: Restore extra mappings to test suite baseline */ + for (int j = 0; j < maps_to_unmap; j++) { + /* Remap extra mappings, accounting for the gap */ + mmap(EXTRA_MAPS + (j * 2 * EXTRA_MAP_SIZE), + EXTRA_MAP_SIZE, EXTRA_MAP_PROT, + MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED_NOREPLACE, + -1, 0); + } + } + + test_suite_teardown(); + + if (failed_tests > 0) + ksft_exit_fail(); + else + ksft_exit_pass(); +} + +/* --- Utilities --- */ + +static int get_max_vma_count(void) +{ + int max_count; + FILE *f; + + f = fopen("/proc/sys/vm/max_map_count", "r"); + if (!f) + return -1; + + if (fscanf(f, "%d", &max_count) != 1) + max_count = -1; + + + fclose(f); + + return max_count; +} + +static bool set_max_vma_count(int val) +{ + FILE *f; + bool success = false; + + f = fopen("/proc/sys/vm/max_map_count", "w"); + if (!f) + return false; + + if (fprintf(f, "%d", val) > 0) + success = true; + + fclose(f); + return success; +} + +static int get_current_vma_count(void) +{ + char line[1024]; + int count = 0; + FILE *f; + + f = fopen("/proc/self/maps", "r"); + if (!f) + return -1; + + while (fgets(line, sizeof(line), f)) { + if (!strstr(line, "[vsyscall]")) + count++; + } + + fclose(f); + + return count; +} + +static bool is_current_vma_count(const char *msg, int expected) +{ + int current = get_current_vma_count(); + + if (current == expected) + return true; + + ksft_print_msg("%s: vma count is %d, expected %d\n", msg, current, + expected); + return false; +} + +static bool is_test_area_mapped(const char *msg) +{ + unsigned long search_start = (unsigned long)TEST_AREA; + unsigned long search_end = search_start + TEST_AREA_SIZE; + bool found = false; + char line[1024]; + FILE *f; + + f = fopen("/proc/self/maps", "r"); + if (!f) { + ksft_print_msg("failed to open /proc/self/maps\n"); + return false; + } + + while (fgets(line, sizeof(line), f)) { + unsigned long start, end; + + if (sscanf(line, "%lx-%lx", &start, &end) != 2) + continue; + + /* Check for an exact match of the range */ + if (start == search_start && end == search_end) { + found = true; + break; + } else if (start > search_end) { + /* + *Since maps are sorted, if we've passed the end, we + * can stop searching. + */ + break; + } + } + + fclose(f); + + if (found) + return true; + + /* Not found */ + ksft_print_msg( + "%s: TEST_AREA is not mapped as a single contiguous block.\n", + msg); + print_surrounding_maps(msg); + + return false; +} + +static void print_surrounding_maps(const char *msg) +{ + unsigned long search_start = (unsigned long)TEST_AREA; + unsigned long search_end = search_start + TEST_AREA_SIZE; + unsigned long start; + unsigned long end; + char line[1024] = {}; + int line_idx = 0; + int first_match_idx = -1; + int last_match_idx = -1; + FILE *f; + + f = fopen("/proc/self/maps", "r"); + if (!f) + return; + + if (msg) + ksft_print_msg("%s\n", msg); + + ksft_print_msg("--- Surrounding VMA entries for TEST_AREA (%p) ---\n", + TEST_AREA); + + /* First pass: Read all lines and find the range of matching entries */ + fseek(f, 0, SEEK_SET); /* Rewind file */ + while (fgets(line, sizeof(line), f)) { + if (sscanf(line, "%lx-%lx", &start, &end) != 2) { + line_idx++; + continue; + } + + /* Check for any overlap */ + if (start < search_end && end > search_start) { + if (first_match_idx == -1) + first_match_idx = line_idx; + last_match_idx = line_idx; + } else if (start > search_end) { + /* + * Since maps are sorted, if we've passed the end, we + * can stop searching. + */ + break; + } + + line_idx++; + } + + if (first_match_idx == -1) { + ksft_print_msg("TEST_AREA (%p) is not currently mapped.\n", + TEST_AREA); + } else { + /* Second pass: Print the relevant lines */ + fseek(f, 0, SEEK_SET); /* Rewind file */ + line_idx = 0; + while (fgets(line, sizeof(line), f)) { + /* Print 2 lines before the first match */ + if (line_idx >= first_match_idx - 2 && + line_idx < first_match_idx) + ksft_print_msg(" %s", line); + + /* Print all matching TEST_AREA entries */ + if (line_idx >= first_match_idx && + line_idx <= last_match_idx) + ksft_print_msg(">> %s", line); + + /* Print 2 lines after the last match */ + if (line_idx > last_match_idx && + line_idx <= last_match_idx + 2) + ksft_print_msg(" %s", line); + + line_idx++; + } + } + + ksft_print_msg("--------------------------------------------------\n"); + + fclose(f); +} diff --git a/tools/testing/selftests/mm/run_vmtests.sh b/tools/testing/selftests/mm/run_vmtests.sh index 471e539d82b8..3794b50ec280 100755 --- a/tools/testing/selftests/mm/run_vmtests.sh +++ b/tools/testing/selftests/mm/run_vmtests.sh @@ -49,6 +49,8 @@ separated by spaces: test madvise(2) MADV_GUARD_INSTALL and MADV_GUARD_REMOVE options - madv_populate test memadvise(2) MADV_POPULATE_{READ,WRITE} options +- max_vma_count + tests for max vma_count - memfd_secret test memfd_secret(2) - process_mrelease @@ -417,6 +419,9 @@ fi # VADDR64 # vmalloc stability smoke test CATEGORY="vmalloc" run_test bash ./test_vmalloc.sh smoke
+# test operations against max vma count limit +CATEGORY="max_vma_count" run_test ./max_vma_count_tests + CATEGORY="mremap" run_test ./mremap_dontunmap
CATEGORY="hmm" run_test bash ./test_hmm.sh smoke
The checks against sysctl_max_map_count are open-coded in multiple places. While simple checks are manageable, the logic in places like mremap.c involves arithmetic with magic numbers that can be difficult to reason about. e.g. ... >= sysctl_max_map_count - 3
To improve readability and centralize the logic, introduce a new helper, vma_count_remaining(). This function returns the VMA count headroom available for a givine process.
The most common case of checking for a single new VMA can be done with the convenience helper has_vma_count_remaining():
if (!vma_count_remaining(mm))
And the complex checks in mremap.c become clearer by expressing the required capacity directly:
if (vma_count_remaining(mm) < 4)
While a capacity-based function could be misused (e.g., with an incorrect '<' vs '<=' comparison), the improved readability at the call sites makes such errors less likely than with the previous open-coded arithmetic.
As part of this change, sysctl_max_map_count is made static to mm/mmap.c to improve encapsulation.
Cc: Andrew Morton akpm@linux-foundation.org Cc: David Hildenbrand david@redhat.com Cc: "Liam R. Howlett" Liam.Howlett@oracle.com Cc: Lorenzo Stoakes lorenzo.stoakes@oracle.com Cc: Mike Rapoport rppt@kernel.org Cc: Minchan Kim minchan@kernel.org Cc: Pedro Falcato pfalcato@suse.de Signed-off-by: Kalesh Singh kaleshsingh@google.com ---
Changes in v2: - Fix documentation comment for vma_count_remaining(), per Mike - Remove extern in header, per Mike and Pedro - Move declaration to mm/internal.h, per Mike - Replace exceeds_max_map_count() with capacity-based vma_count_remaining(), per Lorenzo. - Fix tools/testing/vma, per Lorenzo.
include/linux/mm.h | 2 -- mm/internal.h | 2 ++ mm/mmap.c | 21 ++++++++++++++++++++- mm/mremap.c | 7 ++++--- mm/nommu.c | 2 +- mm/util.c | 1 - mm/vma.c | 10 +++++----- tools/testing/vma/vma_internal.h | 9 +++++++++ 8 files changed, 41 insertions(+), 13 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h index 1ae97a0b8ec7..138bab2988f8 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -192,8 +192,6 @@ static inline void __mm_zero_struct_page(struct page *page) #define MAPCOUNT_ELF_CORE_MARGIN (5) #define DEFAULT_MAX_MAP_COUNT (USHRT_MAX - MAPCOUNT_ELF_CORE_MARGIN)
-extern int sysctl_max_map_count; - extern unsigned long sysctl_user_reserve_kbytes; extern unsigned long sysctl_admin_reserve_kbytes;
diff --git a/mm/internal.h b/mm/internal.h index 45b725c3dc03..39f1c9535ae5 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -1661,4 +1661,6 @@ static inline bool reclaim_pt_is_enabled(unsigned long start, unsigned long end, void dup_mm_exe_file(struct mm_struct *mm, struct mm_struct *oldmm); int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm);
+int vma_count_remaining(const struct mm_struct *mm); + #endif /* __MM_INTERNAL_H */ diff --git a/mm/mmap.c b/mm/mmap.c index e5370e7fcd8f..af88ce1fbb5f 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -374,7 +374,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr, return -EOVERFLOW;
/* Too many mappings? */ - if (mm->map_count >= sysctl_max_map_count) + if (!vma_count_remaining(mm)) return -ENOMEM;
/* @@ -1504,6 +1504,25 @@ struct vm_area_struct *_install_special_mapping( int sysctl_legacy_va_layout; #endif
+static int sysctl_max_map_count __read_mostly = DEFAULT_MAX_MAP_COUNT; + +/** + * vma_count_remaining - Determine available VMA slots + * @mm: The memory descriptor for the process. + * + * Check how many more VMAs can be created for the given @mm + * before hitting the sysctl_max_map_count limit. + * + * Return: The number of new VMAs the process can accommodate. + */ +int vma_count_remaining(const struct mm_struct *mm) +{ + const int map_count = mm->map_count; + const int max_count = sysctl_max_map_count; + + return (max_count > map_count) ? (max_count - map_count) : 0; +} + static const struct ctl_table mmap_table[] = { { .procname = "max_map_count", diff --git a/mm/mremap.c b/mm/mremap.c index 35de0a7b910e..14d35d87e89b 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -1040,7 +1040,7 @@ static unsigned long prep_move_vma(struct vma_remap_struct *vrm) * We'd prefer to avoid failure later on in do_munmap: * which may split one vma into three before unmapping. */ - if (current->mm->map_count >= sysctl_max_map_count - 3) + if (vma_count_remaining(current->mm) < 4) return -ENOMEM;
if (vma->vm_ops && vma->vm_ops->may_split) { @@ -1814,9 +1814,10 @@ static unsigned long check_mremap_params(struct vma_remap_struct *vrm) * split in 3 before unmapping it. * That means 2 more maps (1 for each) to the ones we already hold. * Check whether current map count plus 2 still leads us to 4 maps below - * the threshold, otherwise return -ENOMEM here to be more safe. + * the threshold. In other words, is the current map count + 6 at or + * below the threshold? Otherwise return -ENOMEM here to be more safe. */ - if ((current->mm->map_count + 2) >= sysctl_max_map_count - 3) + if (vma_count_remaining(current->mm) < 6) return -ENOMEM;
return 0; diff --git a/mm/nommu.c b/mm/nommu.c index 8b819fafd57b..dd75f2334812 100644 --- a/mm/nommu.c +++ b/mm/nommu.c @@ -1316,7 +1316,7 @@ static int split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma, return -ENOMEM;
mm = vma->vm_mm; - if (mm->map_count >= sysctl_max_map_count) + if (!vma_count_remaining(mm)) return -ENOMEM;
region = kmem_cache_alloc(vm_region_jar, GFP_KERNEL); diff --git a/mm/util.c b/mm/util.c index f814e6a59ab1..b6e83922cafe 100644 --- a/mm/util.c +++ b/mm/util.c @@ -751,7 +751,6 @@ EXPORT_SYMBOL(folio_mc_copy); int sysctl_overcommit_memory __read_mostly = OVERCOMMIT_GUESS; static int sysctl_overcommit_ratio __read_mostly = 50; static unsigned long sysctl_overcommit_kbytes __read_mostly; -int sysctl_max_map_count __read_mostly = DEFAULT_MAX_MAP_COUNT; unsigned long sysctl_user_reserve_kbytes __read_mostly = 1UL << 17; /* 128MB */ unsigned long sysctl_admin_reserve_kbytes __read_mostly = 1UL << 13; /* 8MB */
diff --git a/mm/vma.c b/mm/vma.c index 033a388bc4b1..df0e8409f63d 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -491,8 +491,8 @@ void unmap_region(struct ma_state *mas, struct vm_area_struct *vma, }
/* - * __split_vma() bypasses sysctl_max_map_count checking. We use this where it - * has already been checked or doesn't make sense to fail. + * __split_vma() bypasses vma_count_remaining() checks. We use this where + * it has already been checked or doesn't make sense to fail. * VMA Iterator will point to the original VMA. */ static __must_check int @@ -592,7 +592,7 @@ __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma, static int split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma, unsigned long addr, int new_below) { - if (vma->vm_mm->map_count >= sysctl_max_map_count) + if (!vma_count_remaining(vma->vm_mm)) return -ENOMEM;
return __split_vma(vmi, vma, addr, new_below); @@ -1345,7 +1345,7 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms, * its limit temporarily, to help free resources as expected. */ if (vms->end < vms->vma->vm_end && - vms->vma->vm_mm->map_count >= sysctl_max_map_count) { + !vma_count_remaining(vms->vma->vm_mm)) { error = -ENOMEM; goto map_count_exceeded; } @@ -2772,7 +2772,7 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma, if (!may_expand_vm(mm, vm_flags, len >> PAGE_SHIFT)) return -ENOMEM;
- if (mm->map_count >= sysctl_max_map_count) + if (!vma_count_remaining(mm)) return -ENOMEM;
if (security_vm_enough_memory_mm(mm, len >> PAGE_SHIFT)) diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h index 3639aa8dd2b0..52cd7ddc73f4 100644 --- a/tools/testing/vma/vma_internal.h +++ b/tools/testing/vma/vma_internal.h @@ -1517,4 +1517,13 @@ static inline vm_flags_t ksm_vma_flags(const struct mm_struct *, const struct fi return vm_flags; }
+/* Helper to get VMA count capacity */ +static int vma_count_remaining(const struct mm_struct *mm) +{ + const int map_count = mm->map_count; + const int max_count = sysctl_max_map_count; + + return (max_count > map_count) ? (max_count - map_count) : 0; +} + #endif /* __MM_VMA_INTERNAL_H */
A mechanical rename of the mm_struct->map_count field to vma_count; no functional change is intended.
The name "map_count" is ambiguous within the memory management subsystem, as it can be confused with the folio/page->_mapcount field, which tracks PTE references.
The new name, vma_count, is more precise as this field has always counted the number of vm_area_structs associated with an mm_struct.
Cc: Andrew Morton akpm@linux-foundation.org Cc: David Hildenbrand david@redhat.com Cc: "Liam R. Howlett" Liam.Howlett@oracle.com Cc: Lorenzo Stoakes lorenzo.stoakes@oracle.com Cc: Mike Rapoport rppt@kernel.org Cc: Minchan Kim minchan@kernel.org Cc: Pedro Falcato pfalcato@suse.de Signed-off-by: Kalesh Singh kaleshsingh@google.com ---
Changes in v2: - map_count is easily confused with _mapcount rename to vma_count, per David
fs/binfmt_elf.c | 2 +- fs/coredump.c | 2 +- include/linux/mm_types.h | 2 +- kernel/fork.c | 2 +- mm/debug.c | 2 +- mm/mmap.c | 6 +++--- mm/nommu.c | 6 +++--- mm/vma.c | 24 ++++++++++++------------ tools/testing/vma/vma.c | 32 ++++++++++++++++---------------- tools/testing/vma/vma_internal.h | 6 +++--- 10 files changed, 42 insertions(+), 42 deletions(-)
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c index 264fba0d44bd..52449dec12cb 100644 --- a/fs/binfmt_elf.c +++ b/fs/binfmt_elf.c @@ -1643,7 +1643,7 @@ static int fill_files_note(struct memelfnote *note, struct coredump_params *cprm data[0] = count; data[1] = PAGE_SIZE; /* - * Count usually is less than mm->map_count, + * Count usually is less than mm->vma_count, * we need to move filenames down. */ n = cprm->vma_count - count; diff --git a/fs/coredump.c b/fs/coredump.c index 60bc9685e149..8881459c53d9 100644 --- a/fs/coredump.c +++ b/fs/coredump.c @@ -1731,7 +1731,7 @@ static bool dump_vma_snapshot(struct coredump_params *cprm)
cprm->vma_data_size = 0; gate_vma = get_gate_vma(mm); - cprm->vma_count = mm->map_count + (gate_vma ? 1 : 0); + cprm->vma_count = mm->vma_count + (gate_vma ? 1 : 0);
cprm->vma_meta = kvmalloc_array(cprm->vma_count, sizeof(*cprm->vma_meta), GFP_KERNEL); if (!cprm->vma_meta) { diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 08bc2442db93..4343be2f9e85 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -1020,7 +1020,7 @@ struct mm_struct { #ifdef CONFIG_MMU atomic_long_t pgtables_bytes; /* size of all page tables */ #endif - int map_count; /* number of VMAs */ + int vma_count; /* number of VMAs */
spinlock_t page_table_lock; /* Protects page tables and some * counters diff --git a/kernel/fork.c b/kernel/fork.c index c4ada32598bd..8fcbbf947579 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1037,7 +1037,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p, mmap_init_lock(mm); INIT_LIST_HEAD(&mm->mmlist); mm_pgtables_bytes_init(mm); - mm->map_count = 0; + mm->vma_count = 0; mm->locked_vm = 0; atomic64_set(&mm->pinned_vm, 0); memset(&mm->rss_stat, 0, sizeof(mm->rss_stat)); diff --git a/mm/debug.c b/mm/debug.c index b4388f4dcd4d..40fc9425a84a 100644 --- a/mm/debug.c +++ b/mm/debug.c @@ -204,7 +204,7 @@ void dump_mm(const struct mm_struct *mm) mm->pgd, atomic_read(&mm->mm_users), atomic_read(&mm->mm_count), mm_pgtables_bytes(mm), - mm->map_count, + mm->vma_count, mm->hiwater_rss, mm->hiwater_vm, mm->total_vm, mm->locked_vm, (u64)atomic64_read(&mm->pinned_vm), mm->data_vm, mm->exec_vm, mm->stack_vm, diff --git a/mm/mmap.c b/mm/mmap.c index af88ce1fbb5f..c6769394a174 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1308,7 +1308,7 @@ void exit_mmap(struct mm_struct *mm) vma = vma_next(&vmi); } while (vma && likely(!xa_is_zero(vma)));
- BUG_ON(count != mm->map_count); + BUG_ON(count != mm->vma_count);
trace_exit_mmap(mm); destroy: @@ -1517,7 +1517,7 @@ static int sysctl_max_map_count __read_mostly = DEFAULT_MAX_MAP_COUNT; */ int vma_count_remaining(const struct mm_struct *mm) { - const int map_count = mm->map_count; + const int map_count = mm->vma_count; const int max_count = sysctl_max_map_count;
return (max_count > map_count) ? (max_count - map_count) : 0; @@ -1828,7 +1828,7 @@ __latent_entropy int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm) */ vma_iter_bulk_store(&vmi, tmp);
- mm->map_count++; + mm->vma_count++;
if (tmp->vm_ops && tmp->vm_ops->open) tmp->vm_ops->open(tmp); diff --git a/mm/nommu.c b/mm/nommu.c index dd75f2334812..9ab2e5ca736d 100644 --- a/mm/nommu.c +++ b/mm/nommu.c @@ -576,7 +576,7 @@ static void setup_vma_to_mm(struct vm_area_struct *vma, struct mm_struct *mm)
static void cleanup_vma_from_mm(struct vm_area_struct *vma) { - vma->vm_mm->map_count--; + vma->vm_mm->vma_count--; /* remove the VMA from the mapping */ if (vma->vm_file) { struct address_space *mapping; @@ -1198,7 +1198,7 @@ unsigned long do_mmap(struct file *file, goto error_just_free;
setup_vma_to_mm(vma, current->mm); - current->mm->map_count++; + current->mm->vma_count++; /* add the VMA to the tree */ vma_iter_store_new(&vmi, vma);
@@ -1366,7 +1366,7 @@ static int split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma, setup_vma_to_mm(vma, mm); setup_vma_to_mm(new, mm); vma_iter_store_new(vmi, new); - mm->map_count++; + mm->vma_count++; return 0;
err_vmi_preallocate: diff --git a/mm/vma.c b/mm/vma.c index df0e8409f63d..64f4e7c867c3 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -352,7 +352,7 @@ static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi, * (it may either follow vma or precede it). */ vma_iter_store_new(vmi, vp->insert); - mm->map_count++; + mm->vma_count++; }
if (vp->anon_vma) { @@ -383,7 +383,7 @@ static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi, } if (vp->remove->anon_vma) anon_vma_merge(vp->vma, vp->remove); - mm->map_count--; + mm->vma_count--; mpol_put(vma_policy(vp->remove)); if (!vp->remove2) WARN_ON_ONCE(vp->vma->vm_end < vp->remove->vm_end); @@ -683,13 +683,13 @@ void validate_mm(struct mm_struct *mm) } #endif /* Check for a infinite loop */ - if (++i > mm->map_count + 10) { + if (++i > mm->vma_count + 10) { i = -1; break; } } - if (i != mm->map_count) { - pr_emerg("map_count %d vma iterator %d\n", mm->map_count, i); + if (i != mm->vma_count) { + pr_emerg("vma_count %d vma iterator %d\n", mm->vma_count, i); bug = 1; } VM_BUG_ON_MM(bug, mm); @@ -1266,7 +1266,7 @@ static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms, struct mm_struct *mm;
mm = current->mm; - mm->map_count -= vms->vma_count; + mm->vma_count -= vms->vma_count; mm->locked_vm -= vms->locked_vm; if (vms->unlock) mmap_write_downgrade(mm); @@ -1340,14 +1340,14 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms, if (vms->start > vms->vma->vm_start) {
/* - * Make sure that map_count on return from munmap() will + * Make sure that vma_count on return from munmap() will * not exceed its limit; but let map_count go just above * its limit temporarily, to help free resources as expected. */ if (vms->end < vms->vma->vm_end && !vma_count_remaining(vms->vma->vm_mm)) { error = -ENOMEM; - goto map_count_exceeded; + goto vma_count_exceeded; }
/* Don't bother splitting the VMA if we can't unmap it anyway */ @@ -1461,7 +1461,7 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms, modify_vma_failed: reattach_vmas(mas_detach); start_split_failed: -map_count_exceeded: +vma_count_exceeded: return error; }
@@ -1795,7 +1795,7 @@ int vma_link(struct mm_struct *mm, struct vm_area_struct *vma) vma_start_write(vma); vma_iter_store_new(&vmi, vma); vma_link_file(vma); - mm->map_count++; + mm->vma_count++; validate_mm(mm); return 0; } @@ -2495,7 +2495,7 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap) /* Lock the VMA since it is modified after insertion into VMA tree */ vma_start_write(vma); vma_iter_store_new(vmi, vma); - map->mm->map_count++; + map->mm->vma_count++; vma_link_file(vma);
/* @@ -2810,7 +2810,7 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma, if (vma_iter_store_gfp(vmi, vma, GFP_KERNEL)) goto mas_store_fail;
- mm->map_count++; + mm->vma_count++; validate_mm(mm); out: perf_event_mmap(vma); diff --git a/tools/testing/vma/vma.c b/tools/testing/vma/vma.c index 656e1c75b711..69fa7d14a6c2 100644 --- a/tools/testing/vma/vma.c +++ b/tools/testing/vma/vma.c @@ -261,7 +261,7 @@ static int cleanup_mm(struct mm_struct *mm, struct vma_iterator *vmi) }
mtree_destroy(&mm->mm_mt); - mm->map_count = 0; + mm->vma_count = 0; return count; }
@@ -500,7 +500,7 @@ static bool test_merge_new(void) INIT_LIST_HEAD(&vma_d->anon_vma_chain); list_add(&dummy_anon_vma_chain_d.same_vma, &vma_d->anon_vma_chain); ASSERT_FALSE(merged); - ASSERT_EQ(mm.map_count, 4); + ASSERT_EQ(mm.vma_count, 4);
/* * Merge BOTH sides. @@ -519,7 +519,7 @@ static bool test_merge_new(void) ASSERT_EQ(vma->vm_pgoff, 0); ASSERT_EQ(vma->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma)); - ASSERT_EQ(mm.map_count, 3); + ASSERT_EQ(mm.vma_count, 3);
/* * Merge to PREVIOUS VMA. @@ -536,7 +536,7 @@ static bool test_merge_new(void) ASSERT_EQ(vma->vm_pgoff, 0); ASSERT_EQ(vma->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma)); - ASSERT_EQ(mm.map_count, 3); + ASSERT_EQ(mm.vma_count, 3);
/* * Merge to NEXT VMA. @@ -555,7 +555,7 @@ static bool test_merge_new(void) ASSERT_EQ(vma->vm_pgoff, 6); ASSERT_EQ(vma->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma)); - ASSERT_EQ(mm.map_count, 3); + ASSERT_EQ(mm.vma_count, 3);
/* * Merge BOTH sides. @@ -573,7 +573,7 @@ static bool test_merge_new(void) ASSERT_EQ(vma->vm_pgoff, 0); ASSERT_EQ(vma->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma)); - ASSERT_EQ(mm.map_count, 2); + ASSERT_EQ(mm.vma_count, 2);
/* * Merge to NEXT VMA. @@ -591,7 +591,7 @@ static bool test_merge_new(void) ASSERT_EQ(vma->vm_pgoff, 0xa); ASSERT_EQ(vma->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma)); - ASSERT_EQ(mm.map_count, 2); + ASSERT_EQ(mm.vma_count, 2);
/* * Merge BOTH sides. @@ -608,7 +608,7 @@ static bool test_merge_new(void) ASSERT_EQ(vma->vm_pgoff, 0); ASSERT_EQ(vma->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma)); - ASSERT_EQ(mm.map_count, 1); + ASSERT_EQ(mm.vma_count, 1);
/* * Final state. @@ -967,7 +967,7 @@ static bool test_vma_merge_new_with_close(void) ASSERT_EQ(vma->vm_pgoff, 0); ASSERT_EQ(vma->vm_ops, &vm_ops); ASSERT_TRUE(vma_write_started(vma)); - ASSERT_EQ(mm.map_count, 2); + ASSERT_EQ(mm.vma_count, 2);
cleanup_mm(&mm, &vmi); return true; @@ -1017,7 +1017,7 @@ static bool test_merge_existing(void) ASSERT_EQ(vma->vm_pgoff, 2); ASSERT_TRUE(vma_write_started(vma)); ASSERT_TRUE(vma_write_started(vma_next)); - ASSERT_EQ(mm.map_count, 2); + ASSERT_EQ(mm.vma_count, 2);
/* Clear down and reset. */ ASSERT_EQ(cleanup_mm(&mm, &vmi), 2); @@ -1045,7 +1045,7 @@ static bool test_merge_existing(void) ASSERT_EQ(vma_next->vm_pgoff, 2); ASSERT_EQ(vma_next->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma_next)); - ASSERT_EQ(mm.map_count, 1); + ASSERT_EQ(mm.vma_count, 1);
/* Clear down and reset. We should have deleted vma. */ ASSERT_EQ(cleanup_mm(&mm, &vmi), 1); @@ -1079,7 +1079,7 @@ static bool test_merge_existing(void) ASSERT_EQ(vma->vm_pgoff, 6); ASSERT_TRUE(vma_write_started(vma_prev)); ASSERT_TRUE(vma_write_started(vma)); - ASSERT_EQ(mm.map_count, 2); + ASSERT_EQ(mm.vma_count, 2);
/* Clear down and reset. */ ASSERT_EQ(cleanup_mm(&mm, &vmi), 2); @@ -1108,7 +1108,7 @@ static bool test_merge_existing(void) ASSERT_EQ(vma_prev->vm_pgoff, 0); ASSERT_EQ(vma_prev->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma_prev)); - ASSERT_EQ(mm.map_count, 1); + ASSERT_EQ(mm.vma_count, 1);
/* Clear down and reset. We should have deleted vma. */ ASSERT_EQ(cleanup_mm(&mm, &vmi), 1); @@ -1138,7 +1138,7 @@ static bool test_merge_existing(void) ASSERT_EQ(vma_prev->vm_pgoff, 0); ASSERT_EQ(vma_prev->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma_prev)); - ASSERT_EQ(mm.map_count, 1); + ASSERT_EQ(mm.vma_count, 1);
/* Clear down and reset. We should have deleted prev and next. */ ASSERT_EQ(cleanup_mm(&mm, &vmi), 1); @@ -1540,7 +1540,7 @@ static bool test_merge_extend(void) ASSERT_EQ(vma->vm_end, 0x4000); ASSERT_EQ(vma->vm_pgoff, 0); ASSERT_TRUE(vma_write_started(vma)); - ASSERT_EQ(mm.map_count, 1); + ASSERT_EQ(mm.vma_count, 1);
cleanup_mm(&mm, &vmi); return true; @@ -1652,7 +1652,7 @@ static bool test_mmap_region_basic(void) 0x24d, NULL); ASSERT_EQ(addr, 0x24d000);
- ASSERT_EQ(mm.map_count, 2); + ASSERT_EQ(mm.vma_count, 2);
for_each_vma(vmi, vma) { if (vma->vm_start == 0x300000) { diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h index 52cd7ddc73f4..15525b86145d 100644 --- a/tools/testing/vma/vma_internal.h +++ b/tools/testing/vma/vma_internal.h @@ -251,7 +251,7 @@ struct mutex {};
struct mm_struct { struct maple_tree mm_mt; - int map_count; /* number of VMAs */ + int vma_count; /* number of VMAs */ unsigned long total_vm; /* Total pages mapped */ unsigned long locked_vm; /* Pages that have PG_mlocked set */ unsigned long data_vm; /* VM_WRITE & ~VM_SHARED & ~VM_STACK */ @@ -1520,10 +1520,10 @@ static inline vm_flags_t ksm_vma_flags(const struct mm_struct *, const struct fi /* Helper to get VMA count capacity */ static int vma_count_remaining(const struct mm_struct *mm) { - const int map_count = mm->map_count; + const int vma_count = mm->vma_count; const int max_count = sysctl_max_map_count;
- return (max_count > map_count) ? (max_count - map_count) : 0; + return (max_count > vma_count) ? (max_count - vma_count) : 0; }
#endif /* __MM_VMA_INTERNAL_H */
To make VMA counting more robust, prevent direct modification of the mm->vma_count field. This is achieved by making the public-facing member const via a union and requiring all modifications to go through a new set of helper functions the operate on a private __vma_count.
While there are no other invariants tied to vma_count currently, this structural change improves maintainability; as it creates a single, centralized point for any future logic, such as adding debug checks or updating related statistics (in subsequent patches).
Cc: Andrew Morton akpm@linux-foundation.org Cc: David Hildenbrand david@redhat.com Cc: "Liam R. Howlett" Liam.Howlett@oracle.com Cc: Lorenzo Stoakes lorenzo.stoakes@oracle.com Cc: Mike Rapoport rppt@kernel.org Cc: Minchan Kim minchan@kernel.org Cc: Pedro Falcato pfalcato@suse.de Signed-off-by: Kalesh Singh kaleshsingh@google.com --- include/linux/mm.h | 25 +++++++++++++++++++++++++ include/linux/mm_types.h | 5 ++++- kernel/fork.c | 2 +- mm/mmap.c | 2 +- mm/vma.c | 12 ++++++------ tools/testing/vma/vma.c | 2 +- tools/testing/vma/vma_internal.h | 30 +++++++++++++++++++++++++++++- 7 files changed, 67 insertions(+), 11 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h index 138bab2988f8..8bad1454984c 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -4219,4 +4219,29 @@ static inline bool snapshot_page_is_faithful(const struct page_snapshot *ps)
void snapshot_page(struct page_snapshot *ps, const struct page *page);
+static inline void vma_count_init(struct mm_struct *mm) +{ + ACCESS_PRIVATE(mm, __vma_count) = 0; +} + +static inline void vma_count_add(struct mm_struct *mm, int nr_vmas) +{ + ACCESS_PRIVATE(mm, __vma_count) += nr_vmas; +} + +static inline void vma_count_sub(struct mm_struct *mm, int nr_vmas) +{ + vma_count_add(mm, -nr_vmas); +} + +static inline void vma_count_inc(struct mm_struct *mm) +{ + vma_count_add(mm, 1); +} + +static inline void vma_count_dec(struct mm_struct *mm) +{ + vma_count_sub(mm, 1); +} + #endif /* _LINUX_MM_H */ diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 4343be2f9e85..2ea8fc722aa2 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -1020,7 +1020,10 @@ struct mm_struct { #ifdef CONFIG_MMU atomic_long_t pgtables_bytes; /* size of all page tables */ #endif - int vma_count; /* number of VMAs */ + union { + const int vma_count; /* number of VMAs */ + int __private __vma_count; + };
spinlock_t page_table_lock; /* Protects page tables and some * counters diff --git a/kernel/fork.c b/kernel/fork.c index 8fcbbf947579..ea9eff416e51 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1037,7 +1037,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p, mmap_init_lock(mm); INIT_LIST_HEAD(&mm->mmlist); mm_pgtables_bytes_init(mm); - mm->vma_count = 0; + vma_count_init(mm); mm->locked_vm = 0; atomic64_set(&mm->pinned_vm, 0); memset(&mm->rss_stat, 0, sizeof(mm->rss_stat)); diff --git a/mm/mmap.c b/mm/mmap.c index c6769394a174..30ddd550197e 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1828,7 +1828,7 @@ __latent_entropy int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm) */ vma_iter_bulk_store(&vmi, tmp);
- mm->vma_count++; + vma_count_inc(mm);
if (tmp->vm_ops && tmp->vm_ops->open) tmp->vm_ops->open(tmp); diff --git a/mm/vma.c b/mm/vma.c index 64f4e7c867c3..0cd3cb472220 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -352,7 +352,7 @@ static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi, * (it may either follow vma or precede it). */ vma_iter_store_new(vmi, vp->insert); - mm->vma_count++; + vma_count_inc(mm); }
if (vp->anon_vma) { @@ -383,7 +383,7 @@ static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi, } if (vp->remove->anon_vma) anon_vma_merge(vp->vma, vp->remove); - mm->vma_count--; + vma_count_dec(mm); mpol_put(vma_policy(vp->remove)); if (!vp->remove2) WARN_ON_ONCE(vp->vma->vm_end < vp->remove->vm_end); @@ -1266,7 +1266,7 @@ static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms, struct mm_struct *mm;
mm = current->mm; - mm->vma_count -= vms->vma_count; + vma_count_sub(mm, vms->vma_count); mm->locked_vm -= vms->locked_vm; if (vms->unlock) mmap_write_downgrade(mm); @@ -1795,7 +1795,7 @@ int vma_link(struct mm_struct *mm, struct vm_area_struct *vma) vma_start_write(vma); vma_iter_store_new(&vmi, vma); vma_link_file(vma); - mm->vma_count++; + vma_count_inc(mm); validate_mm(mm); return 0; } @@ -2495,7 +2495,7 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap) /* Lock the VMA since it is modified after insertion into VMA tree */ vma_start_write(vma); vma_iter_store_new(vmi, vma); - map->mm->vma_count++; + vma_count_inc(map->mm); vma_link_file(vma);
/* @@ -2810,7 +2810,7 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma, if (vma_iter_store_gfp(vmi, vma, GFP_KERNEL)) goto mas_store_fail;
- mm->vma_count++; + vma_count_inc(mm); validate_mm(mm); out: perf_event_mmap(vma); diff --git a/tools/testing/vma/vma.c b/tools/testing/vma/vma.c index 69fa7d14a6c2..ee5a1e2365e0 100644 --- a/tools/testing/vma/vma.c +++ b/tools/testing/vma/vma.c @@ -261,7 +261,7 @@ static int cleanup_mm(struct mm_struct *mm, struct vma_iterator *vmi) }
mtree_destroy(&mm->mm_mt); - mm->vma_count = 0; + vma_count_init(mm); return count; }
diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h index 15525b86145d..6e724ba1adf4 100644 --- a/tools/testing/vma/vma_internal.h +++ b/tools/testing/vma/vma_internal.h @@ -251,7 +251,10 @@ struct mutex {};
struct mm_struct { struct maple_tree mm_mt; - int vma_count; /* number of VMAs */ + union { + const int vma_count; /* number of VMAs */ + int __vma_count; + }; unsigned long total_vm; /* Total pages mapped */ unsigned long locked_vm; /* Pages that have PG_mlocked set */ unsigned long data_vm; /* VM_WRITE & ~VM_SHARED & ~VM_STACK */ @@ -1526,4 +1529,29 @@ static int vma_count_remaining(const struct mm_struct *mm) return (max_count > vma_count) ? (max_count - vma_count) : 0; }
+static inline void vma_count_init(struct mm_struct *mm) +{ + mm->__vma_count = 0; +} + +static inline void vma_count_add(struct mm_struct *mm, int nr_vmas) +{ + mm->__vma_count += nr_vmas; +} + +static inline void vma_count_sub(struct mm_struct *mm, int nr_vmas) +{ + vma_count_add(mm, -nr_vmas); +} + +static inline void vma_count_inc(struct mm_struct *mm) +{ + vma_count_add(mm, 1); +} + +static inline void vma_count_dec(struct mm_struct *mm) +{ + vma_count_sub(mm, 1); +} + #endif /* __MM_VMA_INTERNAL_H */
Building on the vma_count helpers, add a VM_WARN_ON_ONCE() to detect cases where the VMA count exceeds the sysctl_max_map_count limit.
This check will help catch future bugs or regressions where the VMAs are allocated exceeding the limit.
The warning is placed in the main vma_count_*() helpers, while the internal *_nocheck variants bypass it. _nocheck helpers are used to ensure that the assertion does not trigger a false positive in the legitimate case of a temporary VMA increase past the limit by a VMA split in munmap().
Cc: Andrew Morton akpm@linux-foundation.org Cc: David Hildenbrand david@redhat.com Cc: "Liam R. Howlett" Liam.Howlett@oracle.com Cc: Lorenzo Stoakes lorenzo.stoakes@oracle.com Cc: Mike Rapoport rppt@kernel.org Cc: Minchan Kim minchan@kernel.org Cc: Pedro Falcato pfalcato@suse.de Signed-off-by: Kalesh Singh kaleshsingh@google.com ---
Changes in v2: - Add assertions if exceeding max_vma_count limit, per Pedro
include/linux/mm.h | 12 ++++++-- mm/internal.h | 1 - mm/vma.c | 49 +++++++++++++++++++++++++------- tools/testing/vma/vma_internal.h | 7 ++++- 4 files changed, 55 insertions(+), 14 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h index 8bad1454984c..3a3749d7015c 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -4219,19 +4219,27 @@ static inline bool snapshot_page_is_faithful(const struct page_snapshot *ps)
void snapshot_page(struct page_snapshot *ps, const struct page *page);
+int vma_count_remaining(const struct mm_struct *mm); + static inline void vma_count_init(struct mm_struct *mm) { ACCESS_PRIVATE(mm, __vma_count) = 0; }
-static inline void vma_count_add(struct mm_struct *mm, int nr_vmas) +static inline void __vma_count_add_nocheck(struct mm_struct *mm, int nr_vmas) { ACCESS_PRIVATE(mm, __vma_count) += nr_vmas; }
+static inline void vma_count_add(struct mm_struct *mm, int nr_vmas) +{ + VM_WARN_ON_ONCE(!vma_count_remaining(mm)); + __vma_count_add_nocheck(mm, nr_vmas); +} + static inline void vma_count_sub(struct mm_struct *mm, int nr_vmas) { - vma_count_add(mm, -nr_vmas); + __vma_count_add_nocheck(mm, -nr_vmas); }
static inline void vma_count_inc(struct mm_struct *mm) diff --git a/mm/internal.h b/mm/internal.h index 39f1c9535ae5..e0567a3b64fa 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -1661,6 +1661,5 @@ static inline bool reclaim_pt_is_enabled(unsigned long start, unsigned long end, void dup_mm_exe_file(struct mm_struct *mm, struct mm_struct *oldmm); int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm);
-int vma_count_remaining(const struct mm_struct *mm);
#endif /* __MM_INTERNAL_H */ diff --git a/mm/vma.c b/mm/vma.c index 0cd3cb472220..0e4fcaebe209 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -323,15 +323,17 @@ static void vma_prepare(struct vma_prepare *vp) }
/* - * vma_complete- Helper function for handling the unlocking after altering VMAs, - * or for inserting a VMA. + * This is the internal, unsafe version of vma_complete(). Unlike its + * wrapper, this function bypasses runtime checks for VMA count limits by + * using the _nocheck vma_count* helpers. * - * @vp: The vma_prepare struct - * @vmi: The vma iterator - * @mm: The mm_struct + * Its use is restricted to __split_vma() where the VMA count can be + * temporarily higher than the sysctl_max_map_count limit. + * + * All other callers must use vma_complete(). */ -static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi, - struct mm_struct *mm) +static void __vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi, + struct mm_struct *mm) { if (vp->file) { if (vp->adj_next) @@ -352,7 +354,11 @@ static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi, * (it may either follow vma or precede it). */ vma_iter_store_new(vmi, vp->insert); - vma_count_inc(mm); + /* + * Explicitly allow vma_count to exceed the threshold to prevent, + * blocking munmap() freeing resources. + */ + __vma_count_add_nocheck(mm, 1); }
if (vp->anon_vma) { @@ -403,6 +409,26 @@ static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi, uprobe_mmap(vp->insert); }
+/* + * vma_complete- Helper function for handling the unlocking after altering VMAs, + * or for inserting a VMA. + * + * @vp: The vma_prepare struct + * @vmi: The vma iterator + * @mm: The mm_struct + */ +static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi, + struct mm_struct *mm) +{ + /* + * __vma_complete() explicitly foregoes checking the new + * vma_count against the sysctl_max_map_count limit, so + * do it here. + */ + VM_WARN_ON_ONCE(!vma_count_remaining(mm)); + __vma_complete(vp, vmi, mm); +} + /* * init_vma_prep() - Initializer wrapper for vma_prepare struct * @vp: The vma_prepare struct @@ -564,8 +590,11 @@ __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma, vma->vm_end = addr; }
- /* vma_complete stores the new vma */ - vma_complete(&vp, vmi, vma->vm_mm); + /* + * __vma_complete stores the new vma without checking against the + * sysctl_max_map_count (vma_count) limit. + */ + __vma_complete(&vp, vmi, vma->vm_mm); validate_mm(vma->vm_mm);
/* Success. */ diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h index 6e724ba1adf4..d084b1eb2a5c 100644 --- a/tools/testing/vma/vma_internal.h +++ b/tools/testing/vma/vma_internal.h @@ -1534,11 +1534,16 @@ static inline void vma_count_init(struct mm_struct *mm) mm->__vma_count = 0; }
-static inline void vma_count_add(struct mm_struct *mm, int nr_vmas) +static inline void __vma_count_add_nocheck(struct mm_struct *mm, int nr_vmas) { mm->__vma_count += nr_vmas; }
+static inline void vma_count_add(struct mm_struct *mm, int nr_vmas) +{ + __vma_count_add_nocheck(mm, nr_vmas); +} + static inline void vma_count_sub(struct mm_struct *mm, int nr_vmas) { vma_count_add(mm, -nr_vmas);
Needed observability on in field devices can be collected with minimal overhead and can be toggled on and off. Event driven telemetry can be done with tracepoint BPF programs.
The process comm is provided for aggregation across devices and tgid is to enable per-process aggregation per device.
This allows for observing the distribution of such problems in the field, to deduce if there are legitimate bugs or if a bump to the limit is warranted.
Cc: Andrew Morton akpm@linux-foundation.org Cc: David Hildenbrand david@redhat.com Cc: "Liam R. Howlett" Liam.Howlett@oracle.com Cc: Lorenzo Stoakes lorenzo.stoakes@oracle.com Cc: Mike Rapoport rppt@kernel.org Cc: Minchan Kim minchan@kernel.org Cc: Pedro Falcato pfalcato@suse.de Signed-off-by: Kalesh Singh kaleshsingh@google.com ---
Chnages in v2: - Add needed observability for operations failing due to the vma count limit, per Minchan (Since there isn't a common point for debug logging due checks being external to the capacity based vma_count_remaining() helper. I used a trace event for low overhead and to facilitate event driven telemetry for in field devices)
include/trace/events/vma.h | 32 ++++++++++++++++++++++++++++++++ mm/mmap.c | 5 ++++- mm/mremap.c | 10 ++++++++-- mm/vma.c | 11 +++++++++-- 4 files changed, 53 insertions(+), 5 deletions(-) create mode 100644 include/trace/events/vma.h
diff --git a/include/trace/events/vma.h b/include/trace/events/vma.h new file mode 100644 index 000000000000..2fed63b0d0a6 --- /dev/null +++ b/include/trace/events/vma.h @@ -0,0 +1,32 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#undef TRACE_SYSTEM +#define TRACE_SYSTEM vma + +#if !defined(_TRACE_VMA_H) || defined(TRACE_HEADER_MULTI_READ) +#define _TRACE_VMA_H + +#include <linux/tracepoint.h> + +TRACE_EVENT(max_vma_count_exceeded, + + TP_PROTO(struct task_struct *task), + + TP_ARGS(task), + + TP_STRUCT__entry( + __string(comm, task->comm) + __field(pid_t, tgid) + ), + + TP_fast_assign( + __assign_str(comm); + __entry->tgid = task->tgid; + ), + + TP_printk("comm=%s tgid=%d", __get_str(comm), __entry->tgid) +); + +#endif /* _TRACE_VMA_H */ + +/* This part must be outside protection */ +#include <trace/define_trace.h> diff --git a/mm/mmap.c b/mm/mmap.c index 30ddd550197e..0bb311bf48f3 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -56,6 +56,7 @@
#define CREATE_TRACE_POINTS #include <trace/events/mmap.h> +#include <trace/events/vma.h>
#include "internal.h"
@@ -374,8 +375,10 @@ unsigned long do_mmap(struct file *file, unsigned long addr, return -EOVERFLOW;
/* Too many mappings? */ - if (!vma_count_remaining(mm)) + if (!vma_count_remaining(mm)) { + trace_max_vma_count_exceeded(current); return -ENOMEM; + }
/* * addr is returned from get_unmapped_area, diff --git a/mm/mremap.c b/mm/mremap.c index 14d35d87e89b..f42ac05f0069 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -30,6 +30,8 @@ #include <asm/tlb.h> #include <asm/pgalloc.h>
+#include <trace/events/vma.h> + #include "internal.h"
/* Classify the kind of remap operation being performed. */ @@ -1040,8 +1042,10 @@ static unsigned long prep_move_vma(struct vma_remap_struct *vrm) * We'd prefer to avoid failure later on in do_munmap: * which may split one vma into three before unmapping. */ - if (vma_count_remaining(current->mm) < 4) + if (vma_count_remaining(current->mm) < 4) { + trace_max_vma_count_exceeded(current); return -ENOMEM; + }
if (vma->vm_ops && vma->vm_ops->may_split) { if (vma->vm_start != old_addr) @@ -1817,8 +1821,10 @@ static unsigned long check_mremap_params(struct vma_remap_struct *vrm) * the threshold. In other words, is the current map count + 6 at or * below the threshold? Otherwise return -ENOMEM here to be more safe. */ - if (vma_count_remaining(current->mm) < 6) + if (vma_count_remaining(current->mm) < 6) { + trace_max_vma_count_exceeded(current); return -ENOMEM; + }
return 0; } diff --git a/mm/vma.c b/mm/vma.c index 0e4fcaebe209..692c33c3e84d 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -7,6 +7,8 @@ #include "vma_internal.h" #include "vma.h"
+#include <trace/events/vma.h> + struct mmap_state { struct mm_struct *mm; struct vma_iterator *vmi; @@ -621,8 +623,10 @@ __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma, static int split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma, unsigned long addr, int new_below) { - if (!vma_count_remaining(vma->vm_mm)) + if (!vma_count_remaining(vma->vm_mm)) { + trace_max_vma_count_exceeded(current); return -ENOMEM; + }
return __split_vma(vmi, vma, addr, new_below); } @@ -1375,6 +1379,7 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms, */ if (vms->end < vms->vma->vm_end && !vma_count_remaining(vms->vma->vm_mm)) { + trace_max_vma_count_exceeded(current); error = -ENOMEM; goto vma_count_exceeded; } @@ -2801,8 +2806,10 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma, if (!may_expand_vm(mm, vm_flags, len >> PAGE_SHIFT)) return -ENOMEM;
- if (!vma_count_remaining(mm)) + if (!vma_count_remaining(mm)) { + trace_max_vma_count_exceeded(current); return -ENOMEM; + }
if (security_vm_enough_memory_mm(mm, len >> PAGE_SHIFT)) return -ENOMEM;
On Mon, 15 Sep 2025 09:36:38 -0700 Kalesh Singh kaleshsingh@google.com wrote:
Needed observability on in field devices can be collected with minimal overhead and can be toggled on and off. Event driven telemetry can be done with tracepoint BPF programs.
The process comm is provided for aggregation across devices and tgid is to enable per-process aggregation per device.
What do you mean about comm being used to aggregation across devices? What's special about this trace event that will make it used across devices?
Note, if BPF is being used, can't the BPF program just add the current comm? Why waste space in the ring buffer for it?
+TRACE_EVENT(max_vma_count_exceeded,
- TP_PROTO(struct task_struct *task),
Why pass in the task if it's always going to be current?
- TP_ARGS(task),
- TP_STRUCT__entry(
__string(comm, task->comm)
This could be:
__string(comm, current)
But I still want to know what makes this trace event special over other trace events to store this, and can't it be retrieved another way, especially if BPF is being used to hook to it?
-- Steve
__field(pid_t, tgid)
- ),
- TP_fast_assign(
__assign_str(comm);
__entry->tgid = task->tgid;
- ),
- TP_printk("comm=%s tgid=%d", __get_str(comm), __entry->tgid)
+);
On Mon, 15 Sep 2025 09:36:31 -0700 Kalesh Singh kaleshsingh@google.com wrote:
Hi all,
This is v2 to the VMA count patch I previously posted at:
https://lore.kernel.org/r/20250903232437.1454293-1-kaleshsingh@google.com/
I've split it into multiple patches to address the feedback.
The main changes in v2 are:
- Use a capacity-based check for VMA count limit, per Lorenzo.
- Rename map_count to vma_count, per David.
- Add assertions for exceeding the limit, per Pedro.
- Add tests for max_vma_count, per Liam.
- Emit a trace event for failure due to insufficient capacity for observability
Tested on x86_64 and arm64:
Build test:
- allyesconfig for rename
Selftests: cd tools/testing/selftests/mm && \ make && \ ./run_vmtests.sh -t max_vma_count
(With trace_max_vma_count_exceeded enabled)
vma tests: cd tools/testing/vma && \ make && \ ./vma
fwiw, there's nothing in the above which is usable in a [0/N] overview.
While useful, the "what changed since the previous version" info isn't a suitable thing to carry in the permanent kernel record - it's short-term treansient stuff, not helpful to someone who is looking at the patchset in 2029.
Similarly, the "how it was tested" material is also useful, but it becomes irrelevant as soon as the code hits linux-next and mainline.
Anyhow, this -rc cycle has been quite the firehose in MM and I'm feeling a need to slow things down for additional stabilization and so people hopefully get additional bandwidth to digest the material we've added this far. So I think I'll just cherrypick [1/7] for now. A great flood of positive review activity would probably make me revisit that ;)
On Mon, Sep 15, 2025 at 3:34 PM Andrew Morton akpm@linux-foundation.org wrote:
On Mon, 15 Sep 2025 09:36:31 -0700 Kalesh Singh kaleshsingh@google.com wrote:
Hi all,
This is v2 to the VMA count patch I previously posted at:
https://lore.kernel.org/r/20250903232437.1454293-1-kaleshsingh@google.com/
I've split it into multiple patches to address the feedback.
The main changes in v2 are:
- Use a capacity-based check for VMA count limit, per Lorenzo.
- Rename map_count to vma_count, per David.
- Add assertions for exceeding the limit, per Pedro.
- Add tests for max_vma_count, per Liam.
- Emit a trace event for failure due to insufficient capacity for observability
Tested on x86_64 and arm64:
Build test:
- allyesconfig for rename
Selftests: cd tools/testing/selftests/mm && \ make && \ ./run_vmtests.sh -t max_vma_count
(With trace_max_vma_count_exceeded enabled)
vma tests: cd tools/testing/vma && \ make && \ ./vma
fwiw, there's nothing in the above which is usable in a [0/N] overview.
While useful, the "what changed since the previous version" info isn't a suitable thing to carry in the permanent kernel record - it's short-term treansient stuff, not helpful to someone who is looking at the patchset in 2029.
Similarly, the "how it was tested" material is also useful, but it becomes irrelevant as soon as the code hits linux-next and mainline.
Hi Andrew,
Thanks for the feedback. Do you mean the cover letter was not needed in this case or that it lacked enough context?
Anyhow, this -rc cycle has been quite the firehose in MM and I'm feeling a need to slow things down for additional stabilization and so people hopefully get additional bandwidth to digest the material we've added this far. So I think I'll just cherrypick [1/7] for now. A great flood of positive review activity would probably make me revisit that ;)
I understand, yes 1/7 is all we need for now, since it prevents an unrecoverable situation where we get over the limit and cannot recover as munmap() will then always fail.
Thanks, Kalesh
linux-kselftest-mirror@lists.linaro.org