Currently, guard regions are not visible to users except through /proc/$pid/pagemap, with no explicit visibility at the VMA level.
This makes the feature less useful, as it isn't entirely apparent which VMAs may have these entries present, especially when performing actions which walk through memory regions such as those performed by CRIU.
This series addresses this issue by introducing the VM_MAYBE_GUARD flag which fulfils this role, updating the smaps logic to display an entry for these.
The semantics of this flag are that a guard region MAY be present if set (we cannot be sure, as we can't efficiently track whether an MADV_GUARD_REMOVE finally removes all the guard regions in a VMA) - but if not set the VMA definitely does NOT have any guard regions present.
It's problematic to establish this flag without further action, because that means that VMAs with guard regions in them become non-mergeable with adjacent VMAs for no especially good reason.
To work around this, this series also introduces the concept of 'sticky' VMA flags - that is flags which:
a. if set in one VMA and not in another still permit those VMAs to be merged (if otherwise compatible).
b. When they are merged, the resultant VMA must have the flag set.
The VMA logic is updated to propagate these flags correctly.
Additionally, VM_MAYBE_GUARD being an explicit VMA flag allows us to solve an issue with file-backed guard regions - previously these established an anon_vma object for file-backed mappings solely to have vma_needs_copy() correctly propagate guard region mappings to child processes.
We introduce a new flag alias VM_COPY_ON_FORK (which currently only specifies VM_MAYBE_GUARD) and update vma_needs_copy() to check explicitly for this flag and to copy page tables if it is present, which resolves this issue.
Finally we introduce extensive VMA userland tests to assert that the sticky VMA logic behaves correctly as well as guard region self tests to assert that smaps visibility is correctly implemented.
Lorenzo Stoakes (3): mm: introduce VM_MAYBE_GUARD and make visible for guard regions mm: implement sticky, copy on fork VMA flags selftests/mm/guard-regions: add smaps visibility test
Documentation/filesystems/proc.rst | 1 + fs/proc/task_mmu.c | 1 + include/linux/mm.h | 33 ++++++ include/trace/events/mmflags.h | 1 + mm/madvise.c | 22 ++-- mm/memory.c | 3 + mm/vma.c | 22 ++-- tools/testing/selftests/mm/guard-regions.c | 120 +++++++++++++++++++++ tools/testing/selftests/mm/vm_util.c | 5 + tools/testing/selftests/mm/vm_util.h | 1 + tools/testing/vma/vma.c | 89 +++++++++++++-- tools/testing/vma/vma_internal.h | 33 ++++++ 12 files changed, 303 insertions(+), 28 deletions(-)
-- 2.51.0
Currently, if a user needs to determine if guard regions are present in a range, they have to scan all VMAs (or have knowledge of which ones might have guard regions).
Since commit 8e2f2aeb8b48 ("fs/proc/task_mmu: add guard region bit to pagemap") and the related commit a516403787e0 ("fs/proc: extend the PAGEMAP_SCAN ioctl to report guard regions"), users can use either /proc/$pid/pagemap or the PAGEMAP_SCAN functionality to perform this operation at a virtual address level.
This is not ideal, and it gives no visibility at a /proc/$pid/smaps level that guard regions exist in ranges.
This patch remedies the situation by establishing a new VMA flag, VM_MAYBE_GUARD, to indicate that a VMA may contain guard regions (it is uncertain because we cannot reasonably determine whether a MADV_GUARD_REMOVE call has removed all of the guard regions in a VMA, and additionally VMAs may change across merge/split).
We utilise 0x800 for this flag which makes it available to 32-bit architectures also, a flag that was previously used by VM_DENYWRITE, which was removed in commit 8d0920bde5eb ("mm: remove VM_DENYWRITE") and hasn't bee reused yet.
The MADV_GUARD_INSTALL madvise() operation now must take an mmap write lock (and also VMA write lock) whereas previously it did not, but this seems a reasonable overhead.
We also update the smaps logic and documentation to identify these VMAs.
Another major use of this functionality is that we can use it to identify that we ought to copy page tables on fork.
For anonymous mappings this is inherent, however since commit f807123d578d ("mm: allow guard regions in file-backed and read-only mappings") which allowed file-backed guard regions, we have unfortunately had to enforce this behaviour by settings vma->anon_vma to force page table copying.
The existence of this flag removes the need for this, so we simply update vma_needs_copy() to check for this flag instead.
Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com --- Documentation/filesystems/proc.rst | 1 + fs/proc/task_mmu.c | 1 + include/linux/mm.h | 1 + include/trace/events/mmflags.h | 1 + mm/madvise.c | 22 ++++++++++++++-------- mm/memory.c | 4 ++++ tools/testing/vma/vma_internal.h | 1 + 7 files changed, 23 insertions(+), 8 deletions(-)
diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst index 0b86a8022fa1..b8a423ca590a 100644 --- a/Documentation/filesystems/proc.rst +++ b/Documentation/filesystems/proc.rst @@ -591,6 +591,7 @@ encoded manner. The codes are the following: sl sealed lf lock on fault pages dp always lazily freeable mapping + gu maybe contains guard regions (if not set, definitely doesn't) == =======================================
Note that there is no guarantee that every flag and associated mnemonic will diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index fc35a0543f01..db16ed91c269 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1146,6 +1146,7 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma) [ilog2(VM_MAYSHARE)] = "ms", [ilog2(VM_GROWSDOWN)] = "gd", [ilog2(VM_PFNMAP)] = "pf", + [ilog2(VM_MAYBE_GUARD)] = "gu", [ilog2(VM_LOCKED)] = "lo", [ilog2(VM_IO)] = "io", [ilog2(VM_SEQ_READ)] = "sr", diff --git a/include/linux/mm.h b/include/linux/mm.h index aada935c4950..f963afa1b9de 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -296,6 +296,7 @@ extern unsigned int kobjsize(const void *objp); #define VM_UFFD_MISSING 0 #endif /* CONFIG_MMU */ #define VM_PFNMAP 0x00000400 /* Page-ranges managed without "struct page", just pure PFN */ +#define VM_MAYBE_GUARD 0x00000800 /* The VMA maybe contains guard regions. */ #define VM_UFFD_WP 0x00001000 /* wrprotect pages tracking */
#define VM_LOCKED 0x00002000 diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h index aa441f593e9a..a6e5a44c9b42 100644 --- a/include/trace/events/mmflags.h +++ b/include/trace/events/mmflags.h @@ -213,6 +213,7 @@ IF_HAVE_PG_ARCH_3(arch_3) {VM_UFFD_MISSING, "uffd_missing" }, \ IF_HAVE_UFFD_MINOR(VM_UFFD_MINOR, "uffd_minor" ) \ {VM_PFNMAP, "pfnmap" }, \ + {VM_MAYBE_GUARD, "maybe_guard" }, \ {VM_UFFD_WP, "uffd_wp" }, \ {VM_LOCKED, "locked" }, \ {VM_IO, "io" }, \ diff --git a/mm/madvise.c b/mm/madvise.c index fb1c86e630b6..216ae6ed344e 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -1141,15 +1141,22 @@ static long madvise_guard_install(struct madvise_behavior *madv_behavior) return -EINVAL;
/* - * If we install guard markers, then the range is no longer - * empty from a page table perspective and therefore it's - * appropriate to have an anon_vma. + * It would be confusing for anonymous mappings to have page table + * entries but no anon_vma established, so ensure that it is. + */ + if (vma_is_anonymous(vma)) + anon_vma_prepare(vma); + + /* + * Indicate that the VMA may contain guard regions, making it visible to + * the user that a VMA may contain these, narrowing down the range which + * must be scanned in order to detect them. * - * This ensures that on fork, we copy page tables correctly. + * This additionally causes page tables to be copied on fork regardless + * of whether the VMA is anonymous or not, correctly preserving the + * guard region page table entries. */ - err = anon_vma_prepare(vma); - if (err) - return err; + vm_flags_set(vma, VM_MAYBE_GUARD);
/* * Optimistically try to install the guard marker pages first. If any @@ -1709,7 +1716,6 @@ static enum madvise_lock_mode get_lock_mode(struct madvise_behavior *madv_behavi case MADV_POPULATE_READ: case MADV_POPULATE_WRITE: case MADV_COLLAPSE: - case MADV_GUARD_INSTALL: case MADV_GUARD_REMOVE: return MADVISE_MMAP_READ_LOCK; case MADV_DONTNEED: diff --git a/mm/memory.c b/mm/memory.c index 4c3a7e09a159..a2c79ee43d68 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1478,6 +1478,10 @@ vma_needs_copy(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma) if (src_vma->anon_vma) return true;
+ /* Guard regions have momdified page tables that require copying. */ + if (src_vma->vm_flags & VM_MAYBE_GUARD) + return true; + /* * Don't copy ptes where a page fault will fill them correctly. Fork * becomes much lighter when there are big shared or private readonly diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h index d873667704e8..e40c93edc5a7 100644 --- a/tools/testing/vma/vma_internal.h +++ b/tools/testing/vma/vma_internal.h @@ -56,6 +56,7 @@ extern unsigned long dac_mmap_min_addr; #define VM_MAYEXEC 0x00000040 #define VM_GROWSDOWN 0x00000100 #define VM_PFNMAP 0x00000400 +#define VM_MAYBE_GUARD 0x00000800 #define VM_LOCKED 0x00002000 #define VM_IO 0x00004000 #define VM_SEQ_READ 0x00008000 /* App will access data sequentially */
On 10/29/25 9:50 AM, Lorenzo Stoakes wrote:
diff --git a/mm/memory.c b/mm/memory.c index 4c3a7e09a159..a2c79ee43d68 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1478,6 +1478,10 @@ vma_needs_copy(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma) if (src_vma->anon_vma) return true;
- /* Guard regions have momdified page tables that require copying. */
modified
- if (src_vma->vm_flags & VM_MAYBE_GUARD)
return true;- /*
- Don't copy ptes where a page fault will fill them correctly. Fork
- becomes much lighter when there are big shared or private readonly
On Wed, Oct 29, 2025 at 12:50:18PM -0700, Randy Dunlap wrote:
On 10/29/25 9:50 AM, Lorenzo Stoakes wrote:
diff --git a/mm/memory.c b/mm/memory.c index 4c3a7e09a159..a2c79ee43d68 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1478,6 +1478,10 @@ vma_needs_copy(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma) if (src_vma->anon_vma) return true;
- /* Guard regions have momdified page tables that require copying. */
modified
I appear to have entered some kind of typo nirvana in this series it seems :)
Thanks, will fix!
- if (src_vma->vm_flags & VM_MAYBE_GUARD)
return true;- /*
- Don't copy ptes where a page fault will fill them correctly. Fork
- becomes much lighter when there are big shared or private readonly
-- ~Randy
Cheers, Lorenzo
On Wed, Oct 29, 2025 at 9:51 AM Lorenzo Stoakes lorenzo.stoakes@oracle.com wrote:
Currently, if a user needs to determine if guard regions are present in a range, they have to scan all VMAs (or have knowledge of which ones might have guard regions).
Since commit 8e2f2aeb8b48 ("fs/proc/task_mmu: add guard region bit to pagemap") and the related commit a516403787e0 ("fs/proc: extend the PAGEMAP_SCAN ioctl to report guard regions"), users can use either /proc/$pid/pagemap or the PAGEMAP_SCAN functionality to perform this operation at a virtual address level.
This is not ideal, and it gives no visibility at a /proc/$pid/smaps level that guard regions exist in ranges.
This patch remedies the situation by establishing a new VMA flag, VM_MAYBE_GUARD, to indicate that a VMA may contain guard regions (it is uncertain because we cannot reasonably determine whether a MADV_GUARD_REMOVE call has removed all of the guard regions in a VMA, and additionally VMAs may change across merge/split).
nit: I know I suck at naming but I think VM_MAY_HAVE_GUARDS would better represent the meaning.
We utilise 0x800 for this flag which makes it available to 32-bit architectures also, a flag that was previously used by VM_DENYWRITE, which was removed in commit 8d0920bde5eb ("mm: remove VM_DENYWRITE") and hasn't bee reused yet.
s/bee/been but I'm not even sure the above paragraph has to be included in the changelog. It's a technical detail IMHO.
The MADV_GUARD_INSTALL madvise() operation now must take an mmap write lock (and also VMA write lock) whereas previously it did not, but this seems a reasonable overhead.
I guess this is because it is modifying vm_flags now?
We also update the smaps logic and documentation to identify these VMAs.
Another major use of this functionality is that we can use it to identify that we ought to copy page tables on fork.
For anonymous mappings this is inherent, however since commit f807123d578d ("mm: allow guard regions in file-backed and read-only mappings") which allowed file-backed guard regions, we have unfortunately had to enforce this behaviour by settings vma->anon_vma to force page table copying.
The existence of this flag removes the need for this, so we simply update vma_needs_copy() to check for this flag instead.
Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com
Overall, makes sense to me and I think we could use it.
Reviewed-by: Suren Baghdasaryan surenb@google.com
It would be nice to have a way for userspace to reset this flag if it confirms that the VMA does not really have any guards (using say PAGEMAP_SCAN) but I think such an API can be abused.
Documentation/filesystems/proc.rst | 1 + fs/proc/task_mmu.c | 1 + include/linux/mm.h | 1 + include/trace/events/mmflags.h | 1 + mm/madvise.c | 22 ++++++++++++++-------- mm/memory.c | 4 ++++ tools/testing/vma/vma_internal.h | 1 + 7 files changed, 23 insertions(+), 8 deletions(-)
diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst index 0b86a8022fa1..b8a423ca590a 100644 --- a/Documentation/filesystems/proc.rst +++ b/Documentation/filesystems/proc.rst @@ -591,6 +591,7 @@ encoded manner. The codes are the following: sl sealed lf lock on fault pages dp always lazily freeable mapping
- gu maybe contains guard regions (if not set, definitely doesn't) == =======================================
Note that there is no guarantee that every flag and associated mnemonic will diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index fc35a0543f01..db16ed91c269 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1146,6 +1146,7 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma) [ilog2(VM_MAYSHARE)] = "ms", [ilog2(VM_GROWSDOWN)] = "gd", [ilog2(VM_PFNMAP)] = "pf",
[ilog2(VM_MAYBE_GUARD)] = "gu", [ilog2(VM_LOCKED)] = "lo", [ilog2(VM_IO)] = "io", [ilog2(VM_SEQ_READ)] = "sr",diff --git a/include/linux/mm.h b/include/linux/mm.h index aada935c4950..f963afa1b9de 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -296,6 +296,7 @@ extern unsigned int kobjsize(const void *objp); #define VM_UFFD_MISSING 0 #endif /* CONFIG_MMU */ #define VM_PFNMAP 0x00000400 /* Page-ranges managed without "struct page", just pure PFN */ +#define VM_MAYBE_GUARD 0x00000800 /* The VMA maybe contains guard regions. */ #define VM_UFFD_WP 0x00001000 /* wrprotect pages tracking */
#define VM_LOCKED 0x00002000 diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h index aa441f593e9a..a6e5a44c9b42 100644 --- a/include/trace/events/mmflags.h +++ b/include/trace/events/mmflags.h @@ -213,6 +213,7 @@ IF_HAVE_PG_ARCH_3(arch_3) {VM_UFFD_MISSING, "uffd_missing" }, \ IF_HAVE_UFFD_MINOR(VM_UFFD_MINOR, "uffd_minor" ) \ {VM_PFNMAP, "pfnmap" }, \
{VM_MAYBE_GUARD, "maybe_guard" }, \ {VM_UFFD_WP, "uffd_wp" }, \ {VM_LOCKED, "locked" }, \ {VM_IO, "io" }, \diff --git a/mm/madvise.c b/mm/madvise.c index fb1c86e630b6..216ae6ed344e 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -1141,15 +1141,22 @@ static long madvise_guard_install(struct madvise_behavior *madv_behavior) return -EINVAL;
/*
* If we install guard markers, then the range is no longer* empty from a page table perspective and therefore it's* appropriate to have an anon_vma.
* It would be confusing for anonymous mappings to have page table* entries but no anon_vma established, so ensure that it is.*/if (vma_is_anonymous(vma))anon_vma_prepare(vma);/** Indicate that the VMA may contain guard regions, making it visible to* the user that a VMA may contain these, narrowing down the range which* must be scanned in order to detect them. *
* This ensures that on fork, we copy page tables correctly.
* This additionally causes page tables to be copied on fork regardless* of whether the VMA is anonymous or not, correctly preserving the* guard region page table entries. */
err = anon_vma_prepare(vma);if (err)return err;
vm_flags_set(vma, VM_MAYBE_GUARD); /* * Optimistically try to install the guard marker pages first. If any@@ -1709,7 +1716,6 @@ static enum madvise_lock_mode get_lock_mode(struct madvise_behavior *madv_behavi case MADV_POPULATE_READ: case MADV_POPULATE_WRITE: case MADV_COLLAPSE:
case MADV_GUARD_INSTALL: case MADV_GUARD_REMOVE: return MADVISE_MMAP_READ_LOCK; case MADV_DONTNEED:diff --git a/mm/memory.c b/mm/memory.c index 4c3a7e09a159..a2c79ee43d68 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1478,6 +1478,10 @@ vma_needs_copy(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma) if (src_vma->anon_vma) return true;
/* Guard regions have momdified page tables that require copying. */if (src_vma->vm_flags & VM_MAYBE_GUARD)return true;/* * Don't copy ptes where a page fault will fill them correctly. Fork * becomes much lighter when there are big shared or private readonlydiff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h index d873667704e8..e40c93edc5a7 100644 --- a/tools/testing/vma/vma_internal.h +++ b/tools/testing/vma/vma_internal.h @@ -56,6 +56,7 @@ extern unsigned long dac_mmap_min_addr; #define VM_MAYEXEC 0x00000040 #define VM_GROWSDOWN 0x00000100 #define VM_PFNMAP 0x00000400 +#define VM_MAYBE_GUARD 0x00000800 #define VM_LOCKED 0x00002000 #define VM_IO 0x00004000
#define VM_SEQ_READ 0x00008000 /* App will access data sequentially */
2.51.0
On Wed, Oct 29, 2025 at 06:05:54PM -0700, Suren Baghdasaryan wrote:
On Wed, Oct 29, 2025 at 9:51 AM Lorenzo Stoakes lorenzo.stoakes@oracle.com wrote:
Currently, if a user needs to determine if guard regions are present in a range, they have to scan all VMAs (or have knowledge of which ones might have guard regions).
Since commit 8e2f2aeb8b48 ("fs/proc/task_mmu: add guard region bit to pagemap") and the related commit a516403787e0 ("fs/proc: extend the PAGEMAP_SCAN ioctl to report guard regions"), users can use either /proc/$pid/pagemap or the PAGEMAP_SCAN functionality to perform this operation at a virtual address level.
This is not ideal, and it gives no visibility at a /proc/$pid/smaps level that guard regions exist in ranges.
This patch remedies the situation by establishing a new VMA flag, VM_MAYBE_GUARD, to indicate that a VMA may contain guard regions (it is uncertain because we cannot reasonably determine whether a MADV_GUARD_REMOVE call has removed all of the guard regions in a VMA, and additionally VMAs may change across merge/split).
nit: I know I suck at naming but I think VM_MAY_HAVE_GUARDS would better represent the meaning.
We all suck at naming :) it's the hardest bit! :P
Hm I don't love that, bit overwrought, I do think 'maybe guard' is a better shorthand for this flag name.
I am open to other suggestions but I think the original wins on succinctness here!
We utilise 0x800 for this flag which makes it available to 32-bit architectures also, a flag that was previously used by VM_DENYWRITE, which was removed in commit 8d0920bde5eb ("mm: remove VM_DENYWRITE") and hasn't bee reused yet.
s/bee/been
Yeah this series appears to be a bonanza of typos... not sure why :) will fix.
but I'm not even sure the above paragraph has to be included in the changelog. It's a technical detail IMHO.
Well, I think it's actually important to highlight that we have a VMA flag free and why. I know it's bordering on extraneous, but I don't think there's any harm in mentioning it.
Otherwise people might wonder 'oh is this flag used elsewhere somehow' etc.
The MADV_GUARD_INSTALL madvise() operation now must take an mmap write lock (and also VMA write lock) whereas previously it did not, but this seems a reasonable overhead.
I guess this is because it is modifying vm_flags now?
Yes
We also update the smaps logic and documentation to identify these VMAs.
Another major use of this functionality is that we can use it to identify that we ought to copy page tables on fork.
For anonymous mappings this is inherent, however since commit f807123d578d ("mm: allow guard regions in file-backed and read-only mappings") which allowed file-backed guard regions, we have unfortunately had to enforce this behaviour by settings vma->anon_vma to force page table copying.
The existence of this flag removes the need for this, so we simply update vma_needs_copy() to check for this flag instead.
Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com
Overall, makes sense to me and I think we could use it.
Reviewed-by: Suren Baghdasaryan surenb@google.com
Thanks!
It would be nice to have a way for userspace to reset this flag if it confirms that the VMA does not really have any guards (using say PAGEMAP_SCAN) but I think such an API can be abused.
Yeah, I'd rather not for that reason.
Documentation/filesystems/proc.rst | 1 + fs/proc/task_mmu.c | 1 + include/linux/mm.h | 1 + include/trace/events/mmflags.h | 1 + mm/madvise.c | 22 ++++++++++++++-------- mm/memory.c | 4 ++++ tools/testing/vma/vma_internal.h | 1 + 7 files changed, 23 insertions(+), 8 deletions(-)
diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst index 0b86a8022fa1..b8a423ca590a 100644 --- a/Documentation/filesystems/proc.rst +++ b/Documentation/filesystems/proc.rst @@ -591,6 +591,7 @@ encoded manner. The codes are the following: sl sealed lf lock on fault pages dp always lazily freeable mapping
- gu maybe contains guard regions (if not set, definitely doesn't) == =======================================
Note that there is no guarantee that every flag and associated mnemonic will diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index fc35a0543f01..db16ed91c269 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1146,6 +1146,7 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma) [ilog2(VM_MAYSHARE)] = "ms", [ilog2(VM_GROWSDOWN)] = "gd", [ilog2(VM_PFNMAP)] = "pf",
[ilog2(VM_MAYBE_GUARD)] = "gu", [ilog2(VM_LOCKED)] = "lo", [ilog2(VM_IO)] = "io", [ilog2(VM_SEQ_READ)] = "sr",diff --git a/include/linux/mm.h b/include/linux/mm.h index aada935c4950..f963afa1b9de 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -296,6 +296,7 @@ extern unsigned int kobjsize(const void *objp); #define VM_UFFD_MISSING 0 #endif /* CONFIG_MMU */ #define VM_PFNMAP 0x00000400 /* Page-ranges managed without "struct page", just pure PFN */ +#define VM_MAYBE_GUARD 0x00000800 /* The VMA maybe contains guard regions. */ #define VM_UFFD_WP 0x00001000 /* wrprotect pages tracking */
#define VM_LOCKED 0x00002000 diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h index aa441f593e9a..a6e5a44c9b42 100644 --- a/include/trace/events/mmflags.h +++ b/include/trace/events/mmflags.h @@ -213,6 +213,7 @@ IF_HAVE_PG_ARCH_3(arch_3) {VM_UFFD_MISSING, "uffd_missing" }, \ IF_HAVE_UFFD_MINOR(VM_UFFD_MINOR, "uffd_minor" ) \ {VM_PFNMAP, "pfnmap" }, \
{VM_MAYBE_GUARD, "maybe_guard" }, \ {VM_UFFD_WP, "uffd_wp" }, \ {VM_LOCKED, "locked" }, \ {VM_IO, "io" }, \diff --git a/mm/madvise.c b/mm/madvise.c index fb1c86e630b6..216ae6ed344e 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -1141,15 +1141,22 @@ static long madvise_guard_install(struct madvise_behavior *madv_behavior) return -EINVAL;
/*
* If we install guard markers, then the range is no longer* empty from a page table perspective and therefore it's* appropriate to have an anon_vma.
* It would be confusing for anonymous mappings to have page table* entries but no anon_vma established, so ensure that it is.*/if (vma_is_anonymous(vma))anon_vma_prepare(vma);/** Indicate that the VMA may contain guard regions, making it visible to* the user that a VMA may contain these, narrowing down the range which* must be scanned in order to detect them. *
* This ensures that on fork, we copy page tables correctly.
* This additionally causes page tables to be copied on fork regardless* of whether the VMA is anonymous or not, correctly preserving the* guard region page table entries. */
err = anon_vma_prepare(vma);if (err)return err;
vm_flags_set(vma, VM_MAYBE_GUARD); /* * Optimistically try to install the guard marker pages first. If any@@ -1709,7 +1716,6 @@ static enum madvise_lock_mode get_lock_mode(struct madvise_behavior *madv_behavi case MADV_POPULATE_READ: case MADV_POPULATE_WRITE: case MADV_COLLAPSE:
case MADV_GUARD_INSTALL: case MADV_GUARD_REMOVE: return MADVISE_MMAP_READ_LOCK; case MADV_DONTNEED:diff --git a/mm/memory.c b/mm/memory.c index 4c3a7e09a159..a2c79ee43d68 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1478,6 +1478,10 @@ vma_needs_copy(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma) if (src_vma->anon_vma) return true;
/* Guard regions have momdified page tables that require copying. */if (src_vma->vm_flags & VM_MAYBE_GUARD)return true;/* * Don't copy ptes where a page fault will fill them correctly. Fork * becomes much lighter when there are big shared or private readonlydiff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h index d873667704e8..e40c93edc5a7 100644 --- a/tools/testing/vma/vma_internal.h +++ b/tools/testing/vma/vma_internal.h @@ -56,6 +56,7 @@ extern unsigned long dac_mmap_min_addr; #define VM_MAYEXEC 0x00000040 #define VM_GROWSDOWN 0x00000100 #define VM_PFNMAP 0x00000400 +#define VM_MAYBE_GUARD 0x00000800 #define VM_LOCKED 0x00002000 #define VM_IO 0x00004000
#define VM_SEQ_READ 0x00008000 /* App will access data sequentially */
2.51.0
It's useful to be able to force a VMA to be copied on fork outside of the parameters specified by vma_needs_copy(), which otherwise only copies page tables if:
* The destination VMA has VM_UFFD_WP set * The mapping is a PFN or mixed map * The mapping is anonymous and forked in (i.e. vma->anon_vma is non-NULL)
Setting this flag implies that the page tables mapping the VMA are such that simply re-faulting the VMA will not re-establish them in identical form.
We introduce VM_COPY_ON_FORK to clearly identify which flags require this behaviour, which currently is only VM_MAYBE_GUARD.
Any VMA flags which require this behaviour are inherently 'sticky', that is, should we merge two VMAs together, this implies that the newly merged VMA maps a range that requires page table copying on fork.
In order to implement this we must both introduce the concept of a 'sticky' VMA flag and adjust the VMA merge logic accordingly, and also have VMA merge still successfully succeed should one VMA have the flag set and another not.
Note that we update the VMA expand logic to handle new VMA merging, as this function is the one ultimately called by all instances of merging of new VMAs.
This patch implements this, establishing VM_STICKY to contain all such flags and VM_IGNORE_MERGE for those flags which should be ignored when comparing adjacent VMA's flags for the purposes of merging.
As part of this change we place VM_SOFTDIRTY in VM_IGNORE_MERGE as it already had this behaviour, alongside VM_STICKY as sticky flags by implication must not disallow merge.
We update the VMA userland tests to account for the changes and, furthermore, in order to assert that the functionality is workingly correctly, update the new VMA and existing VMA merging logic to consider every permutation of the flag being set/not set in all VMAs being considered for merge.
As a result of this change, VMAs with guard ranges will now not have their merge behaviour impacted by doing so and can be freely merged with other VMAs without VM_MAYBE_GUARD set.
Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com --- include/linux/mm.h | 32 ++++++++++++ mm/memory.c | 3 +- mm/vma.c | 22 ++++---- tools/testing/vma/vma.c | 89 ++++++++++++++++++++++++++++---- tools/testing/vma/vma_internal.h | 32 ++++++++++++ 5 files changed, 156 insertions(+), 22 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h index f963afa1b9de..a8811ba57150 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -522,6 +522,38 @@ extern unsigned int kobjsize(const void *objp); #endif #define VM_FLAGS_CLEAR (ARCH_VM_PKEY_FLAGS | VM_ARCH_CLEAR)
+/* Flags which should result in page tables being copied on fork. */ +#define VM_COPY_ON_FORK VM_MAYBE_GUARD + +/* + * Flags which should be 'sticky' on merge - that is, flags which, when one VMA + * possesses it but the other does not, the merged VMA should nonetheless have + * applied to it: + * + * VM_COPY_ON_FORK - These flags indicates that a VMA maps a range that contains + * metadata which should be unconditionally propagated upon + * fork. When merging two VMAs, we encapsulate this range in + * the merged VMA, so the flag should be 'sticky' as a result. + */ +#define VM_STICKY VM_COPY_ON_FORK + +/* + * VMA flags we ignore for the purposes of merge, i.e. one VMA possessing one + * of these flags and the other not does not preclude a merge. + * + * VM_SOFTDIRTY - Should not prevent from VMA merging, if we match the flags but + * dirty bit -- the caller should mark merged VMA as dirty. If + * dirty bit won't be excluded from comparison, we increase + * pressure on the memory system forcing the kernel to generate + * new VMAs when old one could be extended instead. + * + * VM_STICKY - If one VMA has flags which most be 'sticky', that is ones + * which should propagate to all VMAs, but the other does not, + * the merge should still proceed with the merge logic applying + * sticky flags to the final VMA. + */ +#define VM_IGNORE_MERGE (VM_SOFTDIRTY | VM_STICKY) + /* * mapping from the currently active vm_flags protection bits (the * low four bits) to a page protection mask.. diff --git a/mm/memory.c b/mm/memory.c index a2c79ee43d68..9528133e5147 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1478,8 +1478,7 @@ vma_needs_copy(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma) if (src_vma->anon_vma) return true;
- /* Guard regions have momdified page tables that require copying. */ - if (src_vma->vm_flags & VM_MAYBE_GUARD) + if (src_vma->vm_flags & VM_COPY_ON_FORK) return true;
/* diff --git a/mm/vma.c b/mm/vma.c index 919d1fc63a52..50a6909c4be3 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -89,15 +89,7 @@ static inline bool is_mergeable_vma(struct vma_merge_struct *vmg, bool merge_nex
if (!mpol_equal(vmg->policy, vma_policy(vma))) return false; - /* - * VM_SOFTDIRTY should not prevent from VMA merging, if we - * match the flags but dirty bit -- the caller should mark - * merged VMA as dirty. If dirty bit won't be excluded from - * comparison, we increase pressure on the memory system forcing - * the kernel to generate new VMAs when old one could be - * extended instead. - */ - if ((vma->vm_flags ^ vmg->vm_flags) & ~VM_SOFTDIRTY) + if ((vma->vm_flags ^ vmg->vm_flags) & ~VM_IGNORE_MERGE) return false; if (vma->vm_file != vmg->file) return false; @@ -809,6 +801,7 @@ static bool can_merge_remove_vma(struct vm_area_struct *vma) static __must_check struct vm_area_struct *vma_merge_existing_range( struct vma_merge_struct *vmg) { + vm_flags_t sticky_flags = vmg->vm_flags & VM_STICKY; struct vm_area_struct *middle = vmg->middle; struct vm_area_struct *prev = vmg->prev; struct vm_area_struct *next; @@ -901,11 +894,13 @@ static __must_check struct vm_area_struct *vma_merge_existing_range( if (merge_right) { vma_start_write(next); vmg->target = next; + sticky_flags |= (next->vm_flags & VM_STICKY); }
if (merge_left) { vma_start_write(prev); vmg->target = prev; + sticky_flags |= (prev->vm_flags & VM_STICKY); }
if (merge_both) { @@ -975,6 +970,7 @@ static __must_check struct vm_area_struct *vma_merge_existing_range( if (err || commit_merge(vmg)) goto abort;
+ vm_flags_set(vmg->target, sticky_flags); khugepaged_enter_vma(vmg->target, vmg->vm_flags); vmg->state = VMA_MERGE_SUCCESS; return vmg->target; @@ -1125,6 +1121,10 @@ int vma_expand(struct vma_merge_struct *vmg) bool remove_next = false; struct vm_area_struct *target = vmg->target; struct vm_area_struct *next = vmg->next; + vm_flags_t sticky_flags; + + sticky_flags = vmg->vm_flags & VM_STICKY; + sticky_flags |= target->vm_flags & VM_STICKY;
VM_WARN_ON_VMG(!target, vmg);
@@ -1134,6 +1134,7 @@ int vma_expand(struct vma_merge_struct *vmg) if (next && (target != next) && (vmg->end == next->vm_end)) { int ret;
+ sticky_flags |= next->vm_flags & VM_STICKY; remove_next = true; /* This should already have been checked by this point. */ VM_WARN_ON_VMG(!can_merge_remove_vma(next), vmg); @@ -1160,6 +1161,7 @@ int vma_expand(struct vma_merge_struct *vmg) if (commit_merge(vmg)) goto nomem;
+ vm_flags_set(target, sticky_flags); return 0;
nomem: @@ -1903,7 +1905,7 @@ static int anon_vma_compatible(struct vm_area_struct *a, struct vm_area_struct * return a->vm_end == b->vm_start && mpol_equal(vma_policy(a), vma_policy(b)) && a->vm_file == b->vm_file && - !((a->vm_flags ^ b->vm_flags) & ~(VM_ACCESS_FLAGS | VM_SOFTDIRTY)) && + !((a->vm_flags ^ b->vm_flags) & ~(VM_ACCESS_FLAGS | VM_IGNORE_MERGE)) && b->vm_pgoff == a->vm_pgoff + ((b->vm_start - a->vm_start) >> PAGE_SHIFT); }
diff --git a/tools/testing/vma/vma.c b/tools/testing/vma/vma.c index 656e1c75b711..ee9d3547c421 100644 --- a/tools/testing/vma/vma.c +++ b/tools/testing/vma/vma.c @@ -48,6 +48,8 @@ static struct anon_vma dummy_anon_vma; #define ASSERT_EQ(_val1, _val2) ASSERT_TRUE((_val1) == (_val2)) #define ASSERT_NE(_val1, _val2) ASSERT_TRUE((_val1) != (_val2))
+#define IS_SET(_val, _flags) ((_val & _flags) == _flags) + static struct task_struct __current;
struct task_struct *get_current(void) @@ -441,7 +443,7 @@ static bool test_simple_shrink(void) return true; }
-static bool test_merge_new(void) +static bool __test_merge_new(bool is_sticky, bool a_is_sticky, bool b_is_sticky, bool c_is_sticky) { vm_flags_t vm_flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE; struct mm_struct mm = {}; @@ -469,23 +471,32 @@ static bool test_merge_new(void) struct vm_area_struct *vma, *vma_a, *vma_b, *vma_c, *vma_d; bool merged;
+ if (is_sticky) + vm_flags |= VM_STICKY; + /* * 0123456789abc * AA B CC */ vma_a = alloc_and_link_vma(&mm, 0, 0x2000, 0, vm_flags); ASSERT_NE(vma_a, NULL); + if (a_is_sticky) + vm_flags_set(vma_a, VM_STICKY); /* We give each VMA a single avc so we can test anon_vma duplication. */ INIT_LIST_HEAD(&vma_a->anon_vma_chain); list_add(&dummy_anon_vma_chain_a.same_vma, &vma_a->anon_vma_chain);
vma_b = alloc_and_link_vma(&mm, 0x3000, 0x4000, 3, vm_flags); ASSERT_NE(vma_b, NULL); + if (b_is_sticky) + vm_flags_set(vma_b, VM_STICKY); INIT_LIST_HEAD(&vma_b->anon_vma_chain); list_add(&dummy_anon_vma_chain_b.same_vma, &vma_b->anon_vma_chain);
vma_c = alloc_and_link_vma(&mm, 0xb000, 0xc000, 0xb, vm_flags); ASSERT_NE(vma_c, NULL); + if (c_is_sticky) + vm_flags_set(vma_c, VM_STICKY); INIT_LIST_HEAD(&vma_c->anon_vma_chain); list_add(&dummy_anon_vma_chain_c.same_vma, &vma_c->anon_vma_chain);
@@ -520,6 +531,8 @@ static bool test_merge_new(void) ASSERT_EQ(vma->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma)); ASSERT_EQ(mm.map_count, 3); + if (is_sticky || a_is_sticky || b_is_sticky) + ASSERT_TRUE(IS_SET(vma->vm_flags, VM_STICKY));
/* * Merge to PREVIOUS VMA. @@ -537,6 +550,8 @@ static bool test_merge_new(void) ASSERT_EQ(vma->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma)); ASSERT_EQ(mm.map_count, 3); + if (is_sticky || a_is_sticky) + ASSERT_TRUE(IS_SET(vma->vm_flags, VM_STICKY));
/* * Merge to NEXT VMA. @@ -556,6 +571,8 @@ static bool test_merge_new(void) ASSERT_EQ(vma->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma)); ASSERT_EQ(mm.map_count, 3); + if (is_sticky) /* D uses is_sticky. */ + ASSERT_TRUE(IS_SET(vma->vm_flags, VM_STICKY));
/* * Merge BOTH sides. @@ -574,6 +591,8 @@ static bool test_merge_new(void) ASSERT_EQ(vma->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma)); ASSERT_EQ(mm.map_count, 2); + if (is_sticky || a_is_sticky) + ASSERT_TRUE(IS_SET(vma->vm_flags, VM_STICKY));
/* * Merge to NEXT VMA. @@ -592,6 +611,8 @@ static bool test_merge_new(void) ASSERT_EQ(vma->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma)); ASSERT_EQ(mm.map_count, 2); + if (is_sticky || c_is_sticky) + ASSERT_TRUE(IS_SET(vma->vm_flags, VM_STICKY));
/* * Merge BOTH sides. @@ -609,6 +630,8 @@ static bool test_merge_new(void) ASSERT_EQ(vma->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma)); ASSERT_EQ(mm.map_count, 1); + if (is_sticky || a_is_sticky || c_is_sticky) + ASSERT_TRUE(IS_SET(vma->vm_flags, VM_STICKY));
/* * Final state. @@ -637,6 +660,20 @@ static bool test_merge_new(void) return true; }
+static bool test_merge_new(void) +{ + int i, j, k, l; + + /* Generate every possible permutation of sticky flags. */ + for (i = 0; i < 2; i++) + for (j = 0; j < 2; j++) + for (k = 0; k < 2; k++) + for (l = 0; l < 2; l++) + ASSERT_TRUE(__test_merge_new(i, j, k, l)); + + return true; +} + static bool test_vma_merge_special_flags(void) { vm_flags_t vm_flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE; @@ -973,9 +1010,11 @@ static bool test_vma_merge_new_with_close(void) return true; }
-static bool test_merge_existing(void) +static bool __test_merge_existing(bool prev_is_sticky, bool middle_is_sticky, bool next_is_sticky) { vm_flags_t vm_flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE; + vm_flags_t prev_flags = vm_flags; + vm_flags_t next_flags = vm_flags; struct mm_struct mm = {}; VMA_ITERATOR(vmi, &mm, 0); struct vm_area_struct *vma, *vma_prev, *vma_next; @@ -988,6 +1027,13 @@ static bool test_merge_existing(void) }; struct anon_vma_chain avc = {};
+ if (prev_is_sticky) + prev_flags |= VM_STICKY; + if (middle_is_sticky) + vm_flags |= VM_STICKY; + if (next_is_sticky) + next_flags |= VM_STICKY; + /* * Merge right case - partial span. * @@ -1000,7 +1046,7 @@ static bool test_merge_existing(void) */ vma = alloc_and_link_vma(&mm, 0x2000, 0x6000, 2, vm_flags); vma->vm_ops = &vm_ops; /* This should have no impact. */ - vma_next = alloc_and_link_vma(&mm, 0x6000, 0x9000, 6, vm_flags); + vma_next = alloc_and_link_vma(&mm, 0x6000, 0x9000, 6, next_flags); vma_next->vm_ops = &vm_ops; /* This should have no impact. */ vmg_set_range_anon_vma(&vmg, 0x3000, 0x6000, 3, vm_flags, &dummy_anon_vma); vmg.middle = vma; @@ -1018,6 +1064,8 @@ static bool test_merge_existing(void) ASSERT_TRUE(vma_write_started(vma)); ASSERT_TRUE(vma_write_started(vma_next)); ASSERT_EQ(mm.map_count, 2); + if (middle_is_sticky || next_is_sticky) + ASSERT_TRUE(IS_SET(vma_next->vm_flags, VM_STICKY));
/* Clear down and reset. */ ASSERT_EQ(cleanup_mm(&mm, &vmi), 2); @@ -1033,7 +1081,7 @@ static bool test_merge_existing(void) * NNNNNNN */ vma = alloc_and_link_vma(&mm, 0x2000, 0x6000, 2, vm_flags); - vma_next = alloc_and_link_vma(&mm, 0x6000, 0x9000, 6, vm_flags); + vma_next = alloc_and_link_vma(&mm, 0x6000, 0x9000, 6, next_flags); vma_next->vm_ops = &vm_ops; /* This should have no impact. */ vmg_set_range_anon_vma(&vmg, 0x2000, 0x6000, 2, vm_flags, &dummy_anon_vma); vmg.middle = vma; @@ -1046,6 +1094,8 @@ static bool test_merge_existing(void) ASSERT_EQ(vma_next->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma_next)); ASSERT_EQ(mm.map_count, 1); + if (middle_is_sticky || next_is_sticky) + ASSERT_TRUE(IS_SET(vma_next->vm_flags, VM_STICKY));
/* Clear down and reset. We should have deleted vma. */ ASSERT_EQ(cleanup_mm(&mm, &vmi), 1); @@ -1060,7 +1110,7 @@ static bool test_merge_existing(void) * 0123456789 * PPPPPPV */ - vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, vm_flags); + vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, prev_flags); vma_prev->vm_ops = &vm_ops; /* This should have no impact. */ vma = alloc_and_link_vma(&mm, 0x3000, 0x7000, 3, vm_flags); vma->vm_ops = &vm_ops; /* This should have no impact. */ @@ -1080,6 +1130,8 @@ static bool test_merge_existing(void) ASSERT_TRUE(vma_write_started(vma_prev)); ASSERT_TRUE(vma_write_started(vma)); ASSERT_EQ(mm.map_count, 2); + if (prev_is_sticky || middle_is_sticky) + ASSERT_TRUE(IS_SET(vma_prev->vm_flags, VM_STICKY));
/* Clear down and reset. */ ASSERT_EQ(cleanup_mm(&mm, &vmi), 2); @@ -1094,7 +1146,7 @@ static bool test_merge_existing(void) * 0123456789 * PPPPPPP */ - vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, vm_flags); + vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, prev_flags); vma_prev->vm_ops = &vm_ops; /* This should have no impact. */ vma = alloc_and_link_vma(&mm, 0x3000, 0x7000, 3, vm_flags); vmg_set_range_anon_vma(&vmg, 0x3000, 0x7000, 3, vm_flags, &dummy_anon_vma); @@ -1109,6 +1161,8 @@ static bool test_merge_existing(void) ASSERT_EQ(vma_prev->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma_prev)); ASSERT_EQ(mm.map_count, 1); + if (prev_is_sticky || middle_is_sticky) + ASSERT_TRUE(IS_SET(vma_prev->vm_flags, VM_STICKY));
/* Clear down and reset. We should have deleted vma. */ ASSERT_EQ(cleanup_mm(&mm, &vmi), 1); @@ -1123,10 +1177,10 @@ static bool test_merge_existing(void) * 0123456789 * PPPPPPPPPP */ - vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, vm_flags); + vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, prev_flags); vma_prev->vm_ops = &vm_ops; /* This should have no impact. */ vma = alloc_and_link_vma(&mm, 0x3000, 0x7000, 3, vm_flags); - vma_next = alloc_and_link_vma(&mm, 0x7000, 0x9000, 7, vm_flags); + vma_next = alloc_and_link_vma(&mm, 0x7000, 0x9000, 7, next_flags); vmg_set_range_anon_vma(&vmg, 0x3000, 0x7000, 3, vm_flags, &dummy_anon_vma); vmg.prev = vma_prev; vmg.middle = vma; @@ -1139,6 +1193,8 @@ static bool test_merge_existing(void) ASSERT_EQ(vma_prev->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma_prev)); ASSERT_EQ(mm.map_count, 1); + if (prev_is_sticky || middle_is_sticky || next_is_sticky) + ASSERT_TRUE(IS_SET(vma_prev->vm_flags, VM_STICKY));
/* Clear down and reset. We should have deleted prev and next. */ ASSERT_EQ(cleanup_mm(&mm, &vmi), 1); @@ -1158,9 +1214,9 @@ static bool test_merge_existing(void) * PPPVVVVVNNN */
- vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, vm_flags); + vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, prev_flags); vma = alloc_and_link_vma(&mm, 0x3000, 0x8000, 3, vm_flags); - vma_next = alloc_and_link_vma(&mm, 0x8000, 0xa000, 8, vm_flags); + vma_next = alloc_and_link_vma(&mm, 0x8000, 0xa000, 8, next_flags);
vmg_set_range(&vmg, 0x4000, 0x5000, 4, vm_flags); vmg.prev = vma; @@ -1203,6 +1259,19 @@ static bool test_merge_existing(void) return true; }
+static bool test_merge_existing(void) +{ + int i, j, k; + + /* Generate every possible permutation of sticky flags. */ + for (i = 0; i < 2; i++) + for (j = 0; j < 2; j++) + for (k = 0; k < 2; k++) + ASSERT_TRUE(__test_merge_existing(i, j, k)); + + return true; +} + static bool test_anon_vma_non_mergeable(void) { vm_flags_t vm_flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE; diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h index e40c93edc5a7..3d9cb3a9411a 100644 --- a/tools/testing/vma/vma_internal.h +++ b/tools/testing/vma/vma_internal.h @@ -117,6 +117,38 @@ extern unsigned long dac_mmap_min_addr; #define VM_SEALED VM_NONE #endif
+/* Flags which should result in page tables being copied on fork. */ +#define VM_COPY_ON_FORK VM_MAYBE_GUARD + +/* + * Flags which should be 'sticky' on merge - that is, flags which, when one VMA + * possesses it but the other does not, the merged VMA should nonetheless have + * applied to it: + * + * VM_COPY_ON_FORK - These flags indicates that a VMA maps a range that contains + * metadata which should be unconditionally propagated upon + * fork. When merging two VMAs, we encapsulate this range in + * the merged VMA, so the flag should be 'sticky' as a result. + */ +#define VM_STICKY VM_COPY_ON_FORK + +/* + * VMA flags we ignore for the purposes of merge, i.e. one VMA possessing one + * of these flags and the other not does not preclude a merge. + * + * VM_SOFTDIRTY - Should not prevent from VMA merging, if we match the flags but + * dirty bit -- the caller should mark merged VMA as dirty. If + * dirty bit won't be excluded from comparison, we increase + * pressure on the memory system forcing the kernel to generate + * new VMAs when old one could be extended instead. + * + * VM_STICKY - If one VMA has flags which must be 'sticky', that is ones + * which should propagate to all VMAs, but the other does not, + * the merge should still proceed with the merge logic applying + * sticky flags to the final VMA. + */ +#define VM_IGNORE_MERGE (VM_SOFTDIRTY | VM_STICKY) + #define FIRST_USER_ADDRESS 0UL #define USER_PGTABLES_CEILING 0UL
On Wed, Oct 29, 2025 at 9:51 AM Lorenzo Stoakes lorenzo.stoakes@oracle.com wrote:
It's useful to be able to force a VMA to be copied on fork outside of the parameters specified by vma_needs_copy(), which otherwise only copies page tables if:
- The destination VMA has VM_UFFD_WP set
- The mapping is a PFN or mixed map
- The mapping is anonymous and forked in (i.e. vma->anon_vma is non-NULL)
Setting this flag implies that the page tables mapping the VMA are such that simply re-faulting the VMA will not re-establish them in identical form.
We introduce VM_COPY_ON_FORK to clearly identify which flags require this behaviour, which currently is only VM_MAYBE_GUARD.
Any VMA flags which require this behaviour are inherently 'sticky', that is, should we merge two VMAs together, this implies that the newly merged VMA maps a range that requires page table copying on fork.
In order to implement this we must both introduce the concept of a 'sticky' VMA flag and adjust the VMA merge logic accordingly, and also have VMA merge still successfully succeed should one VMA have the flag set and another not.
"successfully succeed" sounds weird. Just "succeed"?
Note that we update the VMA expand logic to handle new VMA merging, as this function is the one ultimately called by all instances of merging of new VMAs.
This patch implements this, establishing VM_STICKY to contain all such flags and VM_IGNORE_MERGE for those flags which should be ignored when comparing adjacent VMA's flags for the purposes of merging.
As part of this change we place VM_SOFTDIRTY in VM_IGNORE_MERGE as it already had this behaviour, alongside VM_STICKY as sticky flags by implication must not disallow merge.
We update the VMA userland tests to account for the changes and, furthermore, in order to assert that the functionality is workingly
s/workingly/working
correctly, update the new VMA and existing VMA merging logic to consider every permutation of the flag being set/not set in all VMAs being considered for merge.
As a result of this change, VMAs with guard ranges will now not have their merge behaviour impacted by doing so and can be freely merged with other VMAs without VM_MAYBE_GUARD set.
Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com
include/linux/mm.h | 32 ++++++++++++ mm/memory.c | 3 +- mm/vma.c | 22 ++++---- tools/testing/vma/vma.c | 89 ++++++++++++++++++++++++++++---- tools/testing/vma/vma_internal.h | 32 ++++++++++++ 5 files changed, 156 insertions(+), 22 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h index f963afa1b9de..a8811ba57150 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -522,6 +522,38 @@ extern unsigned int kobjsize(const void *objp); #endif #define VM_FLAGS_CLEAR (ARCH_VM_PKEY_FLAGS | VM_ARCH_CLEAR)
+/* Flags which should result in page tables being copied on fork. */ +#define VM_COPY_ON_FORK VM_MAYBE_GUARD
+/*
- Flags which should be 'sticky' on merge - that is, flags which, when one VMA
- possesses it but the other does not, the merged VMA should nonetheless have
- applied to it:
- VM_COPY_ON_FORK - These flags indicates that a VMA maps a range that contains
metadata which should be unconditionally propagated upon
fork. When merging two VMAs, we encapsulate this range in
the merged VMA, so the flag should be 'sticky' as a result.
It's probably worth noting that after a split, we do not remove "sticky" flags even if the VMA acquired them as a result of a previous merge.
- */
+#define VM_STICKY VM_COPY_ON_FORK
+/*
- VMA flags we ignore for the purposes of merge, i.e. one VMA possessing one
- of these flags and the other not does not preclude a merge.
- VM_SOFTDIRTY - Should not prevent from VMA merging, if we match the flags but
dirty bit -- the caller should mark merged VMA as dirty. If
dirty bit won't be excluded from comparison, we increase
pressure on the memory system forcing the kernel to generate
new VMAs when old one could be extended instead.
- VM_STICKY - If one VMA has flags which most be 'sticky', that is ones
s/most/must ?
which should propagate to all VMAs, but the other does not,
the merge should still proceed with the merge logic applying
sticky flags to the final VMA.- */
+#define VM_IGNORE_MERGE (VM_SOFTDIRTY | VM_STICKY)
/*
- mapping from the currently active vm_flags protection bits (the
- low four bits) to a page protection mask..
diff --git a/mm/memory.c b/mm/memory.c index a2c79ee43d68..9528133e5147 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1478,8 +1478,7 @@ vma_needs_copy(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma) if (src_vma->anon_vma) return true;
/* Guard regions have momdified page tables that require copying. */if (src_vma->vm_flags & VM_MAYBE_GUARD)
if (src_vma->vm_flags & VM_COPY_ON_FORK) return true; /*diff --git a/mm/vma.c b/mm/vma.c index 919d1fc63a52..50a6909c4be3 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -89,15 +89,7 @@ static inline bool is_mergeable_vma(struct vma_merge_struct *vmg, bool merge_nex
if (!mpol_equal(vmg->policy, vma_policy(vma))) return false;
/** VM_SOFTDIRTY should not prevent from VMA merging, if we* match the flags but dirty bit -- the caller should mark* merged VMA as dirty. If dirty bit won't be excluded from* comparison, we increase pressure on the memory system forcing* the kernel to generate new VMAs when old one could be* extended instead.*/if ((vma->vm_flags ^ vmg->vm_flags) & ~VM_SOFTDIRTY)
if ((vma->vm_flags ^ vmg->vm_flags) & ~VM_IGNORE_MERGE) return false; if (vma->vm_file != vmg->file) return false;@@ -809,6 +801,7 @@ static bool can_merge_remove_vma(struct vm_area_struct *vma) static __must_check struct vm_area_struct *vma_merge_existing_range( struct vma_merge_struct *vmg) {
vm_flags_t sticky_flags = vmg->vm_flags & VM_STICKY; struct vm_area_struct *middle = vmg->middle; struct vm_area_struct *prev = vmg->prev; struct vm_area_struct *next;@@ -901,11 +894,13 @@ static __must_check struct vm_area_struct *vma_merge_existing_range( if (merge_right) { vma_start_write(next); vmg->target = next;
sticky_flags |= (next->vm_flags & VM_STICKY); } if (merge_left) { vma_start_write(prev); vmg->target = prev;sticky_flags |= (prev->vm_flags & VM_STICKY); } if (merge_both) {@@ -975,6 +970,7 @@ static __must_check struct vm_area_struct *vma_merge_existing_range( if (err || commit_merge(vmg)) goto abort;
vm_flags_set(vmg->target, sticky_flags); khugepaged_enter_vma(vmg->target, vmg->vm_flags); vmg->state = VMA_MERGE_SUCCESS; return vmg->target;@@ -1125,6 +1121,10 @@ int vma_expand(struct vma_merge_struct *vmg) bool remove_next = false; struct vm_area_struct *target = vmg->target; struct vm_area_struct *next = vmg->next;
vm_flags_t sticky_flags;sticky_flags = vmg->vm_flags & VM_STICKY;sticky_flags |= target->vm_flags & VM_STICKY; VM_WARN_ON_VMG(!target, vmg);@@ -1134,6 +1134,7 @@ int vma_expand(struct vma_merge_struct *vmg) if (next && (target != next) && (vmg->end == next->vm_end)) { int ret;
sticky_flags |= next->vm_flags & VM_STICKY; remove_next = true; /* This should already have been checked by this point. */ VM_WARN_ON_VMG(!can_merge_remove_vma(next), vmg);@@ -1160,6 +1161,7 @@ int vma_expand(struct vma_merge_struct *vmg) if (commit_merge(vmg)) goto nomem;
vm_flags_set(target, sticky_flags); return 0;nomem: @@ -1903,7 +1905,7 @@ static int anon_vma_compatible(struct vm_area_struct *a, struct vm_area_struct * return a->vm_end == b->vm_start && mpol_equal(vma_policy(a), vma_policy(b)) && a->vm_file == b->vm_file &&
!((a->vm_flags ^ b->vm_flags) & ~(VM_ACCESS_FLAGS | VM_SOFTDIRTY)) &&
!((a->vm_flags ^ b->vm_flags) & ~(VM_ACCESS_FLAGS | VM_IGNORE_MERGE)) && b->vm_pgoff == a->vm_pgoff + ((b->vm_start - a->vm_start) >> PAGE_SHIFT);}
diff --git a/tools/testing/vma/vma.c b/tools/testing/vma/vma.c index 656e1c75b711..ee9d3547c421 100644 --- a/tools/testing/vma/vma.c +++ b/tools/testing/vma/vma.c
I prefer tests in a separate patch, but that might just be me. Feel free to ignore.
@@ -48,6 +48,8 @@ static struct anon_vma dummy_anon_vma; #define ASSERT_EQ(_val1, _val2) ASSERT_TRUE((_val1) == (_val2)) #define ASSERT_NE(_val1, _val2) ASSERT_TRUE((_val1) != (_val2))
+#define IS_SET(_val, _flags) ((_val & _flags) == _flags)
static struct task_struct __current;
struct task_struct *get_current(void) @@ -441,7 +443,7 @@ static bool test_simple_shrink(void) return true; }
-static bool test_merge_new(void) +static bool __test_merge_new(bool is_sticky, bool a_is_sticky, bool b_is_sticky, bool c_is_sticky) { vm_flags_t vm_flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE; struct mm_struct mm = {}; @@ -469,23 +471,32 @@ static bool test_merge_new(void) struct vm_area_struct *vma, *vma_a, *vma_b, *vma_c, *vma_d; bool merged;
if (is_sticky)vm_flags |= VM_STICKY;/* * 0123456789abc * AA B CC */ vma_a = alloc_and_link_vma(&mm, 0, 0x2000, 0, vm_flags); ASSERT_NE(vma_a, NULL);if (a_is_sticky)vm_flags_set(vma_a, VM_STICKY); /* We give each VMA a single avc so we can test anon_vma duplication. */ INIT_LIST_HEAD(&vma_a->anon_vma_chain); list_add(&dummy_anon_vma_chain_a.same_vma, &vma_a->anon_vma_chain); vma_b = alloc_and_link_vma(&mm, 0x3000, 0x4000, 3, vm_flags); ASSERT_NE(vma_b, NULL);if (b_is_sticky)vm_flags_set(vma_b, VM_STICKY); INIT_LIST_HEAD(&vma_b->anon_vma_chain); list_add(&dummy_anon_vma_chain_b.same_vma, &vma_b->anon_vma_chain); vma_c = alloc_and_link_vma(&mm, 0xb000, 0xc000, 0xb, vm_flags); ASSERT_NE(vma_c, NULL);if (c_is_sticky)vm_flags_set(vma_c, VM_STICKY); INIT_LIST_HEAD(&vma_c->anon_vma_chain); list_add(&dummy_anon_vma_chain_c.same_vma, &vma_c->anon_vma_chain);@@ -520,6 +531,8 @@ static bool test_merge_new(void) ASSERT_EQ(vma->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma)); ASSERT_EQ(mm.map_count, 3);
if (is_sticky || a_is_sticky || b_is_sticky)ASSERT_TRUE(IS_SET(vma->vm_flags, VM_STICKY)); /* * Merge to PREVIOUS VMA.@@ -537,6 +550,8 @@ static bool test_merge_new(void) ASSERT_EQ(vma->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma)); ASSERT_EQ(mm.map_count, 3);
if (is_sticky || a_is_sticky)ASSERT_TRUE(IS_SET(vma->vm_flags, VM_STICKY)); /* * Merge to NEXT VMA.@@ -556,6 +571,8 @@ static bool test_merge_new(void) ASSERT_EQ(vma->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma)); ASSERT_EQ(mm.map_count, 3);
if (is_sticky) /* D uses is_sticky. */ASSERT_TRUE(IS_SET(vma->vm_flags, VM_STICKY)); /* * Merge BOTH sides.@@ -574,6 +591,8 @@ static bool test_merge_new(void) ASSERT_EQ(vma->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma)); ASSERT_EQ(mm.map_count, 2);
if (is_sticky || a_is_sticky)ASSERT_TRUE(IS_SET(vma->vm_flags, VM_STICKY)); /* * Merge to NEXT VMA.@@ -592,6 +611,8 @@ static bool test_merge_new(void) ASSERT_EQ(vma->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma)); ASSERT_EQ(mm.map_count, 2);
if (is_sticky || c_is_sticky)ASSERT_TRUE(IS_SET(vma->vm_flags, VM_STICKY)); /* * Merge BOTH sides.@@ -609,6 +630,8 @@ static bool test_merge_new(void) ASSERT_EQ(vma->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma)); ASSERT_EQ(mm.map_count, 1);
if (is_sticky || a_is_sticky || c_is_sticky)ASSERT_TRUE(IS_SET(vma->vm_flags, VM_STICKY)); /* * Final state.@@ -637,6 +660,20 @@ static bool test_merge_new(void) return true; }
+static bool test_merge_new(void) +{
int i, j, k, l;/* Generate every possible permutation of sticky flags. */for (i = 0; i < 2; i++)for (j = 0; j < 2; j++)for (k = 0; k < 2; k++)for (l = 0; l < 2; l++)ASSERT_TRUE(__test_merge_new(i, j, k, l));return true;+}
static bool test_vma_merge_special_flags(void) { vm_flags_t vm_flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE; @@ -973,9 +1010,11 @@ static bool test_vma_merge_new_with_close(void) return true; }
-static bool test_merge_existing(void) +static bool __test_merge_existing(bool prev_is_sticky, bool middle_is_sticky, bool next_is_sticky) { vm_flags_t vm_flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE;
vm_flags_t prev_flags = vm_flags;vm_flags_t next_flags = vm_flags; struct mm_struct mm = {}; VMA_ITERATOR(vmi, &mm, 0); struct vm_area_struct *vma, *vma_prev, *vma_next;@@ -988,6 +1027,13 @@ static bool test_merge_existing(void) }; struct anon_vma_chain avc = {};
if (prev_is_sticky)prev_flags |= VM_STICKY;if (middle_is_sticky)vm_flags |= VM_STICKY;if (next_is_sticky)next_flags |= VM_STICKY;/* * Merge right case - partial span. *@@ -1000,7 +1046,7 @@ static bool test_merge_existing(void) */ vma = alloc_and_link_vma(&mm, 0x2000, 0x6000, 2, vm_flags); vma->vm_ops = &vm_ops; /* This should have no impact. */
vma_next = alloc_and_link_vma(&mm, 0x6000, 0x9000, 6, vm_flags);
vma_next = alloc_and_link_vma(&mm, 0x6000, 0x9000, 6, next_flags); vma_next->vm_ops = &vm_ops; /* This should have no impact. */ vmg_set_range_anon_vma(&vmg, 0x3000, 0x6000, 3, vm_flags, &dummy_anon_vma); vmg.middle = vma;@@ -1018,6 +1064,8 @@ static bool test_merge_existing(void) ASSERT_TRUE(vma_write_started(vma)); ASSERT_TRUE(vma_write_started(vma_next)); ASSERT_EQ(mm.map_count, 2);
if (middle_is_sticky || next_is_sticky)ASSERT_TRUE(IS_SET(vma_next->vm_flags, VM_STICKY)); /* Clear down and reset. */ ASSERT_EQ(cleanup_mm(&mm, &vmi), 2);@@ -1033,7 +1081,7 @@ static bool test_merge_existing(void) * NNNNNNN */ vma = alloc_and_link_vma(&mm, 0x2000, 0x6000, 2, vm_flags);
vma_next = alloc_and_link_vma(&mm, 0x6000, 0x9000, 6, vm_flags);
vma_next = alloc_and_link_vma(&mm, 0x6000, 0x9000, 6, next_flags); vma_next->vm_ops = &vm_ops; /* This should have no impact. */ vmg_set_range_anon_vma(&vmg, 0x2000, 0x6000, 2, vm_flags, &dummy_anon_vma); vmg.middle = vma;@@ -1046,6 +1094,8 @@ static bool test_merge_existing(void) ASSERT_EQ(vma_next->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma_next)); ASSERT_EQ(mm.map_count, 1);
if (middle_is_sticky || next_is_sticky)ASSERT_TRUE(IS_SET(vma_next->vm_flags, VM_STICKY)); /* Clear down and reset. We should have deleted vma. */ ASSERT_EQ(cleanup_mm(&mm, &vmi), 1);@@ -1060,7 +1110,7 @@ static bool test_merge_existing(void) * 0123456789 * PPPPPPV */
vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, vm_flags);
vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, prev_flags); vma_prev->vm_ops = &vm_ops; /* This should have no impact. */ vma = alloc_and_link_vma(&mm, 0x3000, 0x7000, 3, vm_flags); vma->vm_ops = &vm_ops; /* This should have no impact. */@@ -1080,6 +1130,8 @@ static bool test_merge_existing(void) ASSERT_TRUE(vma_write_started(vma_prev)); ASSERT_TRUE(vma_write_started(vma)); ASSERT_EQ(mm.map_count, 2);
if (prev_is_sticky || middle_is_sticky)ASSERT_TRUE(IS_SET(vma_prev->vm_flags, VM_STICKY)); /* Clear down and reset. */ ASSERT_EQ(cleanup_mm(&mm, &vmi), 2);@@ -1094,7 +1146,7 @@ static bool test_merge_existing(void) * 0123456789 * PPPPPPP */
vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, vm_flags);
vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, prev_flags); vma_prev->vm_ops = &vm_ops; /* This should have no impact. */ vma = alloc_and_link_vma(&mm, 0x3000, 0x7000, 3, vm_flags); vmg_set_range_anon_vma(&vmg, 0x3000, 0x7000, 3, vm_flags, &dummy_anon_vma);@@ -1109,6 +1161,8 @@ static bool test_merge_existing(void) ASSERT_EQ(vma_prev->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma_prev)); ASSERT_EQ(mm.map_count, 1);
if (prev_is_sticky || middle_is_sticky)ASSERT_TRUE(IS_SET(vma_prev->vm_flags, VM_STICKY)); /* Clear down and reset. We should have deleted vma. */ ASSERT_EQ(cleanup_mm(&mm, &vmi), 1);@@ -1123,10 +1177,10 @@ static bool test_merge_existing(void) * 0123456789 * PPPPPPPPPP */
vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, vm_flags);
vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, prev_flags); vma_prev->vm_ops = &vm_ops; /* This should have no impact. */ vma = alloc_and_link_vma(&mm, 0x3000, 0x7000, 3, vm_flags);
vma_next = alloc_and_link_vma(&mm, 0x7000, 0x9000, 7, vm_flags);
vma_next = alloc_and_link_vma(&mm, 0x7000, 0x9000, 7, next_flags); vmg_set_range_anon_vma(&vmg, 0x3000, 0x7000, 3, vm_flags, &dummy_anon_vma); vmg.prev = vma_prev; vmg.middle = vma;@@ -1139,6 +1193,8 @@ static bool test_merge_existing(void) ASSERT_EQ(vma_prev->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma_prev)); ASSERT_EQ(mm.map_count, 1);
if (prev_is_sticky || middle_is_sticky || next_is_sticky)ASSERT_TRUE(IS_SET(vma_prev->vm_flags, VM_STICKY)); /* Clear down and reset. We should have deleted prev and next. */ ASSERT_EQ(cleanup_mm(&mm, &vmi), 1);@@ -1158,9 +1214,9 @@ static bool test_merge_existing(void) * PPPVVVVVNNN */
vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, vm_flags);
vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, prev_flags); vma = alloc_and_link_vma(&mm, 0x3000, 0x8000, 3, vm_flags);
vma_next = alloc_and_link_vma(&mm, 0x8000, 0xa000, 8, vm_flags);
vma_next = alloc_and_link_vma(&mm, 0x8000, 0xa000, 8, next_flags); vmg_set_range(&vmg, 0x4000, 0x5000, 4, vm_flags); vmg.prev = vma;@@ -1203,6 +1259,19 @@ static bool test_merge_existing(void) return true; }
+static bool test_merge_existing(void) +{
int i, j, k;/* Generate every possible permutation of sticky flags. */for (i = 0; i < 2; i++)for (j = 0; j < 2; j++)for (k = 0; k < 2; k++)ASSERT_TRUE(__test_merge_existing(i, j, k));return true;+}
static bool test_anon_vma_non_mergeable(void) { vm_flags_t vm_flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE; diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h index e40c93edc5a7..3d9cb3a9411a 100644 --- a/tools/testing/vma/vma_internal.h +++ b/tools/testing/vma/vma_internal.h @@ -117,6 +117,38 @@ extern unsigned long dac_mmap_min_addr; #define VM_SEALED VM_NONE #endif
+/* Flags which should result in page tables being copied on fork. */ +#define VM_COPY_ON_FORK VM_MAYBE_GUARD
+/*
- Flags which should be 'sticky' on merge - that is, flags which, when one VMA
- possesses it but the other does not, the merged VMA should nonetheless have
- applied to it:
- VM_COPY_ON_FORK - These flags indicates that a VMA maps a range that contains
metadata which should be unconditionally propagated upon
fork. When merging two VMAs, we encapsulate this range in
the merged VMA, so the flag should be 'sticky' as a result.- */
+#define VM_STICKY VM_COPY_ON_FORK
+/*
- VMA flags we ignore for the purposes of merge, i.e. one VMA possessing one
- of these flags and the other not does not preclude a merge.
- VM_SOFTDIRTY - Should not prevent from VMA merging, if we match the flags but
dirty bit -- the caller should mark merged VMA as dirty. If
dirty bit won't be excluded from comparison, we increase
pressure on the memory system forcing the kernel to generate
new VMAs when old one could be extended instead.
- VM_STICKY - If one VMA has flags which must be 'sticky', that is ones
which should propagate to all VMAs, but the other does not,
the merge should still proceed with the merge logic applying
sticky flags to the final VMA.- */
+#define VM_IGNORE_MERGE (VM_SOFTDIRTY | VM_STICKY)
#define FIRST_USER_ADDRESS 0UL #define USER_PGTABLES_CEILING 0UL
-- 2.51.0
On Wed, Oct 29, 2025 at 09:35:25PM -0700, Suren Baghdasaryan wrote:
On Wed, Oct 29, 2025 at 9:51 AM Lorenzo Stoakes lorenzo.stoakes@oracle.com wrote:
It's useful to be able to force a VMA to be copied on fork outside of the parameters specified by vma_needs_copy(), which otherwise only copies page tables if:
- The destination VMA has VM_UFFD_WP set
- The mapping is a PFN or mixed map
- The mapping is anonymous and forked in (i.e. vma->anon_vma is non-NULL)
Setting this flag implies that the page tables mapping the VMA are such that simply re-faulting the VMA will not re-establish them in identical form.
We introduce VM_COPY_ON_FORK to clearly identify which flags require this behaviour, which currently is only VM_MAYBE_GUARD.
Any VMA flags which require this behaviour are inherently 'sticky', that is, should we merge two VMAs together, this implies that the newly merged VMA maps a range that requires page table copying on fork.
In order to implement this we must both introduce the concept of a 'sticky' VMA flag and adjust the VMA merge logic accordingly, and also have VMA merge still successfully succeed should one VMA have the flag set and another not.
"successfully succeed" sounds weird. Just "succeed"?
Yeah... typo bonanza this series :) will fix.
Note that we update the VMA expand logic to handle new VMA merging, as this function is the one ultimately called by all instances of merging of new VMAs.
This patch implements this, establishing VM_STICKY to contain all such flags and VM_IGNORE_MERGE for those flags which should be ignored when comparing adjacent VMA's flags for the purposes of merging.
As part of this change we place VM_SOFTDIRTY in VM_IGNORE_MERGE as it already had this behaviour, alongside VM_STICKY as sticky flags by implication must not disallow merge.
We update the VMA userland tests to account for the changes and, furthermore, in order to assert that the functionality is workingly
s/workingly/working
Haha good lord. Will fix also!
correctly, update the new VMA and existing VMA merging logic to consider every permutation of the flag being set/not set in all VMAs being considered for merge.
As a result of this change, VMAs with guard ranges will now not have their merge behaviour impacted by doing so and can be freely merged with other VMAs without VM_MAYBE_GUARD set.
Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com
include/linux/mm.h | 32 ++++++++++++ mm/memory.c | 3 +- mm/vma.c | 22 ++++---- tools/testing/vma/vma.c | 89 ++++++++++++++++++++++++++++---- tools/testing/vma/vma_internal.h | 32 ++++++++++++ 5 files changed, 156 insertions(+), 22 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h index f963afa1b9de..a8811ba57150 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -522,6 +522,38 @@ extern unsigned int kobjsize(const void *objp); #endif #define VM_FLAGS_CLEAR (ARCH_VM_PKEY_FLAGS | VM_ARCH_CLEAR)
+/* Flags which should result in page tables being copied on fork. */ +#define VM_COPY_ON_FORK VM_MAYBE_GUARD
+/*
- Flags which should be 'sticky' on merge - that is, flags which, when one VMA
- possesses it but the other does not, the merged VMA should nonetheless have
- applied to it:
- VM_COPY_ON_FORK - These flags indicates that a VMA maps a range that contains
metadata which should be unconditionally propagated upon
fork. When merging two VMAs, we encapsulate this range in
the merged VMA, so the flag should be 'sticky' as a result.It's probably worth noting that after a split, we do not remove "sticky" flags even if the VMA acquired them as a result of a previous merge.
Hm I thought this was implied. Will update to be clear however!
- */
+#define VM_STICKY VM_COPY_ON_FORK
+/*
- VMA flags we ignore for the purposes of merge, i.e. one VMA possessing one
- of these flags and the other not does not preclude a merge.
- VM_SOFTDIRTY - Should not prevent from VMA merging, if we match the flags but
dirty bit -- the caller should mark merged VMA as dirty. If
dirty bit won't be excluded from comparison, we increase
pressure on the memory system forcing the kernel to generate
new VMAs when old one could be extended instead.
- VM_STICKY - If one VMA has flags which most be 'sticky', that is ones
s/most/must ?
I most learn to not typo so much :)
Yes you're right, will fix! :P
which should propagate to all VMAs, but the other does not,
the merge should still proceed with the merge logic applying
sticky flags to the final VMA.- */
+#define VM_IGNORE_MERGE (VM_SOFTDIRTY | VM_STICKY)
/*
- mapping from the currently active vm_flags protection bits (the
- low four bits) to a page protection mask..
diff --git a/mm/memory.c b/mm/memory.c index a2c79ee43d68..9528133e5147 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1478,8 +1478,7 @@ vma_needs_copy(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma) if (src_vma->anon_vma) return true;
/* Guard regions have momdified page tables that require copying. */if (src_vma->vm_flags & VM_MAYBE_GUARD)
if (src_vma->vm_flags & VM_COPY_ON_FORK) return true; /*diff --git a/mm/vma.c b/mm/vma.c index 919d1fc63a52..50a6909c4be3 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -89,15 +89,7 @@ static inline bool is_mergeable_vma(struct vma_merge_struct *vmg, bool merge_nex
if (!mpol_equal(vmg->policy, vma_policy(vma))) return false;
/** VM_SOFTDIRTY should not prevent from VMA merging, if we* match the flags but dirty bit -- the caller should mark* merged VMA as dirty. If dirty bit won't be excluded from* comparison, we increase pressure on the memory system forcing* the kernel to generate new VMAs when old one could be* extended instead.*/if ((vma->vm_flags ^ vmg->vm_flags) & ~VM_SOFTDIRTY)
if ((vma->vm_flags ^ vmg->vm_flags) & ~VM_IGNORE_MERGE) return false; if (vma->vm_file != vmg->file) return false;@@ -809,6 +801,7 @@ static bool can_merge_remove_vma(struct vm_area_struct *vma) static __must_check struct vm_area_struct *vma_merge_existing_range( struct vma_merge_struct *vmg) {
vm_flags_t sticky_flags = vmg->vm_flags & VM_STICKY; struct vm_area_struct *middle = vmg->middle; struct vm_area_struct *prev = vmg->prev; struct vm_area_struct *next;@@ -901,11 +894,13 @@ static __must_check struct vm_area_struct *vma_merge_existing_range( if (merge_right) { vma_start_write(next); vmg->target = next;
sticky_flags |= (next->vm_flags & VM_STICKY); } if (merge_left) { vma_start_write(prev); vmg->target = prev;sticky_flags |= (prev->vm_flags & VM_STICKY); } if (merge_both) {@@ -975,6 +970,7 @@ static __must_check struct vm_area_struct *vma_merge_existing_range( if (err || commit_merge(vmg)) goto abort;
vm_flags_set(vmg->target, sticky_flags); khugepaged_enter_vma(vmg->target, vmg->vm_flags); vmg->state = VMA_MERGE_SUCCESS; return vmg->target;@@ -1125,6 +1121,10 @@ int vma_expand(struct vma_merge_struct *vmg) bool remove_next = false; struct vm_area_struct *target = vmg->target; struct vm_area_struct *next = vmg->next;
vm_flags_t sticky_flags;sticky_flags = vmg->vm_flags & VM_STICKY;sticky_flags |= target->vm_flags & VM_STICKY; VM_WARN_ON_VMG(!target, vmg);@@ -1134,6 +1134,7 @@ int vma_expand(struct vma_merge_struct *vmg) if (next && (target != next) && (vmg->end == next->vm_end)) { int ret;
sticky_flags |= next->vm_flags & VM_STICKY; remove_next = true; /* This should already have been checked by this point. */ VM_WARN_ON_VMG(!can_merge_remove_vma(next), vmg);@@ -1160,6 +1161,7 @@ int vma_expand(struct vma_merge_struct *vmg) if (commit_merge(vmg)) goto nomem;
vm_flags_set(target, sticky_flags); return 0;nomem: @@ -1903,7 +1905,7 @@ static int anon_vma_compatible(struct vm_area_struct *a, struct vm_area_struct * return a->vm_end == b->vm_start && mpol_equal(vma_policy(a), vma_policy(b)) && a->vm_file == b->vm_file &&
!((a->vm_flags ^ b->vm_flags) & ~(VM_ACCESS_FLAGS | VM_SOFTDIRTY)) &&
!((a->vm_flags ^ b->vm_flags) & ~(VM_ACCESS_FLAGS | VM_IGNORE_MERGE)) && b->vm_pgoff == a->vm_pgoff + ((b->vm_start - a->vm_start) >> PAGE_SHIFT);}
diff --git a/tools/testing/vma/vma.c b/tools/testing/vma/vma.c index 656e1c75b711..ee9d3547c421 100644 --- a/tools/testing/vma/vma.c +++ b/tools/testing/vma/vma.c
I prefer tests in a separate patch, but that might just be me. Feel free to ignore.
Yeah can split it out! I do tend to do that actually, not sure why I deviated from that here.
@@ -48,6 +48,8 @@ static struct anon_vma dummy_anon_vma; #define ASSERT_EQ(_val1, _val2) ASSERT_TRUE((_val1) == (_val2)) #define ASSERT_NE(_val1, _val2) ASSERT_TRUE((_val1) != (_val2))
+#define IS_SET(_val, _flags) ((_val & _flags) == _flags)
static struct task_struct __current;
struct task_struct *get_current(void) @@ -441,7 +443,7 @@ static bool test_simple_shrink(void) return true; }
-static bool test_merge_new(void) +static bool __test_merge_new(bool is_sticky, bool a_is_sticky, bool b_is_sticky, bool c_is_sticky) { vm_flags_t vm_flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE; struct mm_struct mm = {}; @@ -469,23 +471,32 @@ static bool test_merge_new(void) struct vm_area_struct *vma, *vma_a, *vma_b, *vma_c, *vma_d; bool merged;
if (is_sticky)vm_flags |= VM_STICKY;/* * 0123456789abc * AA B CC */ vma_a = alloc_and_link_vma(&mm, 0, 0x2000, 0, vm_flags); ASSERT_NE(vma_a, NULL);if (a_is_sticky)vm_flags_set(vma_a, VM_STICKY); /* We give each VMA a single avc so we can test anon_vma duplication. */ INIT_LIST_HEAD(&vma_a->anon_vma_chain); list_add(&dummy_anon_vma_chain_a.same_vma, &vma_a->anon_vma_chain); vma_b = alloc_and_link_vma(&mm, 0x3000, 0x4000, 3, vm_flags); ASSERT_NE(vma_b, NULL);if (b_is_sticky)vm_flags_set(vma_b, VM_STICKY); INIT_LIST_HEAD(&vma_b->anon_vma_chain); list_add(&dummy_anon_vma_chain_b.same_vma, &vma_b->anon_vma_chain); vma_c = alloc_and_link_vma(&mm, 0xb000, 0xc000, 0xb, vm_flags); ASSERT_NE(vma_c, NULL);if (c_is_sticky)vm_flags_set(vma_c, VM_STICKY); INIT_LIST_HEAD(&vma_c->anon_vma_chain); list_add(&dummy_anon_vma_chain_c.same_vma, &vma_c->anon_vma_chain);@@ -520,6 +531,8 @@ static bool test_merge_new(void) ASSERT_EQ(vma->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma)); ASSERT_EQ(mm.map_count, 3);
if (is_sticky || a_is_sticky || b_is_sticky)ASSERT_TRUE(IS_SET(vma->vm_flags, VM_STICKY)); /* * Merge to PREVIOUS VMA.@@ -537,6 +550,8 @@ static bool test_merge_new(void) ASSERT_EQ(vma->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma)); ASSERT_EQ(mm.map_count, 3);
if (is_sticky || a_is_sticky)ASSERT_TRUE(IS_SET(vma->vm_flags, VM_STICKY)); /* * Merge to NEXT VMA.@@ -556,6 +571,8 @@ static bool test_merge_new(void) ASSERT_EQ(vma->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma)); ASSERT_EQ(mm.map_count, 3);
if (is_sticky) /* D uses is_sticky. */ASSERT_TRUE(IS_SET(vma->vm_flags, VM_STICKY)); /* * Merge BOTH sides.@@ -574,6 +591,8 @@ static bool test_merge_new(void) ASSERT_EQ(vma->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma)); ASSERT_EQ(mm.map_count, 2);
if (is_sticky || a_is_sticky)ASSERT_TRUE(IS_SET(vma->vm_flags, VM_STICKY)); /* * Merge to NEXT VMA.@@ -592,6 +611,8 @@ static bool test_merge_new(void) ASSERT_EQ(vma->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma)); ASSERT_EQ(mm.map_count, 2);
if (is_sticky || c_is_sticky)ASSERT_TRUE(IS_SET(vma->vm_flags, VM_STICKY)); /* * Merge BOTH sides.@@ -609,6 +630,8 @@ static bool test_merge_new(void) ASSERT_EQ(vma->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma)); ASSERT_EQ(mm.map_count, 1);
if (is_sticky || a_is_sticky || c_is_sticky)ASSERT_TRUE(IS_SET(vma->vm_flags, VM_STICKY)); /* * Final state.@@ -637,6 +660,20 @@ static bool test_merge_new(void) return true; }
+static bool test_merge_new(void) +{
int i, j, k, l;/* Generate every possible permutation of sticky flags. */for (i = 0; i < 2; i++)for (j = 0; j < 2; j++)for (k = 0; k < 2; k++)for (l = 0; l < 2; l++)ASSERT_TRUE(__test_merge_new(i, j, k, l));return true;+}
static bool test_vma_merge_special_flags(void) { vm_flags_t vm_flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE; @@ -973,9 +1010,11 @@ static bool test_vma_merge_new_with_close(void) return true; }
-static bool test_merge_existing(void) +static bool __test_merge_existing(bool prev_is_sticky, bool middle_is_sticky, bool next_is_sticky) { vm_flags_t vm_flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE;
vm_flags_t prev_flags = vm_flags;vm_flags_t next_flags = vm_flags; struct mm_struct mm = {}; VMA_ITERATOR(vmi, &mm, 0); struct vm_area_struct *vma, *vma_prev, *vma_next;@@ -988,6 +1027,13 @@ static bool test_merge_existing(void) }; struct anon_vma_chain avc = {};
if (prev_is_sticky)prev_flags |= VM_STICKY;if (middle_is_sticky)vm_flags |= VM_STICKY;if (next_is_sticky)next_flags |= VM_STICKY;/* * Merge right case - partial span. *@@ -1000,7 +1046,7 @@ static bool test_merge_existing(void) */ vma = alloc_and_link_vma(&mm, 0x2000, 0x6000, 2, vm_flags); vma->vm_ops = &vm_ops; /* This should have no impact. */
vma_next = alloc_and_link_vma(&mm, 0x6000, 0x9000, 6, vm_flags);
vma_next = alloc_and_link_vma(&mm, 0x6000, 0x9000, 6, next_flags); vma_next->vm_ops = &vm_ops; /* This should have no impact. */ vmg_set_range_anon_vma(&vmg, 0x3000, 0x6000, 3, vm_flags, &dummy_anon_vma); vmg.middle = vma;@@ -1018,6 +1064,8 @@ static bool test_merge_existing(void) ASSERT_TRUE(vma_write_started(vma)); ASSERT_TRUE(vma_write_started(vma_next)); ASSERT_EQ(mm.map_count, 2);
if (middle_is_sticky || next_is_sticky)ASSERT_TRUE(IS_SET(vma_next->vm_flags, VM_STICKY)); /* Clear down and reset. */ ASSERT_EQ(cleanup_mm(&mm, &vmi), 2);@@ -1033,7 +1081,7 @@ static bool test_merge_existing(void) * NNNNNNN */ vma = alloc_and_link_vma(&mm, 0x2000, 0x6000, 2, vm_flags);
vma_next = alloc_and_link_vma(&mm, 0x6000, 0x9000, 6, vm_flags);
vma_next = alloc_and_link_vma(&mm, 0x6000, 0x9000, 6, next_flags); vma_next->vm_ops = &vm_ops; /* This should have no impact. */ vmg_set_range_anon_vma(&vmg, 0x2000, 0x6000, 2, vm_flags, &dummy_anon_vma); vmg.middle = vma;@@ -1046,6 +1094,8 @@ static bool test_merge_existing(void) ASSERT_EQ(vma_next->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma_next)); ASSERT_EQ(mm.map_count, 1);
if (middle_is_sticky || next_is_sticky)ASSERT_TRUE(IS_SET(vma_next->vm_flags, VM_STICKY)); /* Clear down and reset. We should have deleted vma. */ ASSERT_EQ(cleanup_mm(&mm, &vmi), 1);@@ -1060,7 +1110,7 @@ static bool test_merge_existing(void) * 0123456789 * PPPPPPV */
vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, vm_flags);
vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, prev_flags); vma_prev->vm_ops = &vm_ops; /* This should have no impact. */ vma = alloc_and_link_vma(&mm, 0x3000, 0x7000, 3, vm_flags); vma->vm_ops = &vm_ops; /* This should have no impact. */@@ -1080,6 +1130,8 @@ static bool test_merge_existing(void) ASSERT_TRUE(vma_write_started(vma_prev)); ASSERT_TRUE(vma_write_started(vma)); ASSERT_EQ(mm.map_count, 2);
if (prev_is_sticky || middle_is_sticky)ASSERT_TRUE(IS_SET(vma_prev->vm_flags, VM_STICKY)); /* Clear down and reset. */ ASSERT_EQ(cleanup_mm(&mm, &vmi), 2);@@ -1094,7 +1146,7 @@ static bool test_merge_existing(void) * 0123456789 * PPPPPPP */
vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, vm_flags);
vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, prev_flags); vma_prev->vm_ops = &vm_ops; /* This should have no impact. */ vma = alloc_and_link_vma(&mm, 0x3000, 0x7000, 3, vm_flags); vmg_set_range_anon_vma(&vmg, 0x3000, 0x7000, 3, vm_flags, &dummy_anon_vma);@@ -1109,6 +1161,8 @@ static bool test_merge_existing(void) ASSERT_EQ(vma_prev->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma_prev)); ASSERT_EQ(mm.map_count, 1);
if (prev_is_sticky || middle_is_sticky)ASSERT_TRUE(IS_SET(vma_prev->vm_flags, VM_STICKY)); /* Clear down and reset. We should have deleted vma. */ ASSERT_EQ(cleanup_mm(&mm, &vmi), 1);@@ -1123,10 +1177,10 @@ static bool test_merge_existing(void) * 0123456789 * PPPPPPPPPP */
vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, vm_flags);
vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, prev_flags); vma_prev->vm_ops = &vm_ops; /* This should have no impact. */ vma = alloc_and_link_vma(&mm, 0x3000, 0x7000, 3, vm_flags);
vma_next = alloc_and_link_vma(&mm, 0x7000, 0x9000, 7, vm_flags);
vma_next = alloc_and_link_vma(&mm, 0x7000, 0x9000, 7, next_flags); vmg_set_range_anon_vma(&vmg, 0x3000, 0x7000, 3, vm_flags, &dummy_anon_vma); vmg.prev = vma_prev; vmg.middle = vma;@@ -1139,6 +1193,8 @@ static bool test_merge_existing(void) ASSERT_EQ(vma_prev->anon_vma, &dummy_anon_vma); ASSERT_TRUE(vma_write_started(vma_prev)); ASSERT_EQ(mm.map_count, 1);
if (prev_is_sticky || middle_is_sticky || next_is_sticky)ASSERT_TRUE(IS_SET(vma_prev->vm_flags, VM_STICKY)); /* Clear down and reset. We should have deleted prev and next. */ ASSERT_EQ(cleanup_mm(&mm, &vmi), 1);@@ -1158,9 +1214,9 @@ static bool test_merge_existing(void) * PPPVVVVVNNN */
vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, vm_flags);
vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, prev_flags); vma = alloc_and_link_vma(&mm, 0x3000, 0x8000, 3, vm_flags);
vma_next = alloc_and_link_vma(&mm, 0x8000, 0xa000, 8, vm_flags);
vma_next = alloc_and_link_vma(&mm, 0x8000, 0xa000, 8, next_flags); vmg_set_range(&vmg, 0x4000, 0x5000, 4, vm_flags); vmg.prev = vma;@@ -1203,6 +1259,19 @@ static bool test_merge_existing(void) return true; }
+static bool test_merge_existing(void) +{
int i, j, k;/* Generate every possible permutation of sticky flags. */for (i = 0; i < 2; i++)for (j = 0; j < 2; j++)for (k = 0; k < 2; k++)ASSERT_TRUE(__test_merge_existing(i, j, k));return true;+}
static bool test_anon_vma_non_mergeable(void) { vm_flags_t vm_flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE; diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h index e40c93edc5a7..3d9cb3a9411a 100644 --- a/tools/testing/vma/vma_internal.h +++ b/tools/testing/vma/vma_internal.h @@ -117,6 +117,38 @@ extern unsigned long dac_mmap_min_addr; #define VM_SEALED VM_NONE #endif
+/* Flags which should result in page tables being copied on fork. */ +#define VM_COPY_ON_FORK VM_MAYBE_GUARD
+/*
- Flags which should be 'sticky' on merge - that is, flags which, when one VMA
- possesses it but the other does not, the merged VMA should nonetheless have
- applied to it:
- VM_COPY_ON_FORK - These flags indicates that a VMA maps a range that contains
metadata which should be unconditionally propagated upon
fork. When merging two VMAs, we encapsulate this range in
the merged VMA, so the flag should be 'sticky' as a result.- */
+#define VM_STICKY VM_COPY_ON_FORK
+/*
- VMA flags we ignore for the purposes of merge, i.e. one VMA possessing one
- of these flags and the other not does not preclude a merge.
- VM_SOFTDIRTY - Should not prevent from VMA merging, if we match the flags but
dirty bit -- the caller should mark merged VMA as dirty. If
dirty bit won't be excluded from comparison, we increase
pressure on the memory system forcing the kernel to generate
new VMAs when old one could be extended instead.
- VM_STICKY - If one VMA has flags which must be 'sticky', that is ones
which should propagate to all VMAs, but the other does not,
the merge should still proceed with the merge logic applying
sticky flags to the final VMA.- */
+#define VM_IGNORE_MERGE (VM_SOFTDIRTY | VM_STICKY)
#define FIRST_USER_ADDRESS 0UL #define USER_PGTABLES_CEILING 0UL
-- 2.51.0
Thanks for review!
Assert that we observe guard regions appearing in /proc/$pid/smaps as expected, and when split/merge is performed too (with expected sticky behaviour).
Also add handling for file systems which don't sanely handle mmap() VMA merging so we don't incorrectly encounter a test failure in this situation.
Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com --- tools/testing/selftests/mm/guard-regions.c | 120 +++++++++++++++++++++ tools/testing/selftests/mm/vm_util.c | 5 + tools/testing/selftests/mm/vm_util.h | 1 + 3 files changed, 126 insertions(+)
diff --git a/tools/testing/selftests/mm/guard-regions.c b/tools/testing/selftests/mm/guard-regions.c index 8dd81c0a4a5a..a9be11e03a6a 100644 --- a/tools/testing/selftests/mm/guard-regions.c +++ b/tools/testing/selftests/mm/guard-regions.c @@ -94,6 +94,7 @@ static void *mmap_(FIXTURE_DATA(guard_regions) * self, case ANON_BACKED: flags |= MAP_PRIVATE | MAP_ANON; fd = -1; + offset = 0; break; case SHMEM_BACKED: case LOCAL_FILE_BACKED: @@ -260,6 +261,54 @@ static bool is_buf_eq(char *buf, size_t size, char chr) return true; }
+/* + * Some file systems have issues with merging due to changing merge-sensitive + * parameters in the .mmap callback, and prior to .mmap_prepare being + * implemented everywhere this will now result in an unexpected failure to + * merge (e.g. - overlayfs). + * + * Perform a simple test to see if the local file system suffers from this, if + * it does then we can skip test logic that assumes local file system merging is + * sane. + */ +static bool local_fs_has_sane_mmap(FIXTURE_DATA(guard_regions) * self, + const FIXTURE_VARIANT(guard_regions) * variant) +{ + const unsigned long page_size = self->page_size; + char *ptr, *ptr2; + struct procmap_fd procmap; + + if (variant->backing != LOCAL_FILE_BACKED) + return true; + + /* Map 10 pages. */ + ptr = mmap_(self, variant, NULL, 10 * page_size, PROT_READ | PROT_WRITE, 0, 0); + if (ptr == MAP_FAILED) + return false; + /* Unmap the middle. */ + munmap(&ptr[5 * page_size], page_size); + + /* Map again. */ + ptr2 = mmap_(self, variant, &ptr[5 * page_size], page_size, PROT_READ | PROT_WRITE, + MAP_FIXED, 5 * page_size); + + if (ptr2 == MAP_FAILED) + return false; + + /* Now make sure they all merged. */ + if (open_self_procmap(&procmap) != 0) + return false; + if (!find_vma_procmap(&procmap, ptr)) + return false; + if (procmap.query.vma_start != (unsigned long)ptr) + return false; + if (procmap.query.vma_end != (unsigned long)ptr + 10 * page_size) + return false; + close_procmap(&procmap); + + return true; +} + FIXTURE_SETUP(guard_regions) { self->page_size = (unsigned long)sysconf(_SC_PAGESIZE); @@ -2138,4 +2187,75 @@ TEST_F(guard_regions, pagemap_scan) ASSERT_EQ(munmap(ptr, 10 * page_size), 0); }
+TEST_F(guard_regions, smaps) +{ + const unsigned long page_size = self->page_size; + struct procmap_fd procmap; + char *ptr, *ptr2; + int i; + + /* Map a region. */ + ptr = mmap_(self, variant, NULL, 10 * page_size, PROT_READ | PROT_WRITE, 0, 0); + ASSERT_NE(ptr, MAP_FAILED); + + /* We shouldn't yet see a guard flag. */ + ASSERT_FALSE(check_vmflag_guard(ptr)); + + /* Install a single guard region. */ + ASSERT_EQ(madvise(ptr, page_size, MADV_GUARD_INSTALL), 0); + + /* Now we should see a guard flag. */ + ASSERT_TRUE(check_vmflag_guard(ptr)); + + /* + * Removing the guard region should not change things because we simply + * cannot accurately track whether a given VMA has had all of its guard + * regions removed. + */ + ASSERT_EQ(madvise(ptr, page_size, MADV_GUARD_REMOVE), 0); + ASSERT_TRUE(check_vmflag_guard(ptr)); + + /* Install guard regions throughout. */ + for (i = 0; i < 10; i++) { + ASSERT_EQ(madvise(&ptr[i * page_size], page_size, MADV_GUARD_INSTALL), 0); + /* We should always see the guard region flag. */ + ASSERT_TRUE(check_vmflag_guard(ptr)); + } + + /* Split into two VMAs. */ + ASSERT_EQ(munmap(&ptr[4 * page_size], page_size), 0); + + /* Both VMAs should have the guard flag set. */ + ASSERT_TRUE(check_vmflag_guard(ptr)); + ASSERT_TRUE(check_vmflag_guard(&ptr[5 * page_size])); + + /* + * If the local file system is unable to merge VMAs due to having + * unusual characteristics, there is no point in asserting merge + * behaviour. + */ + if (!local_fs_has_sane_mmap(self, variant)) { + TH_LOG("local filesystem does not support sane merging skipping merge test"); + return; + } + + /* Map a fresh VMA between the two split VMAs. */ + ptr2 = mmap_(self, variant, &ptr[4 * page_size], page_size, + PROT_READ | PROT_WRITE, MAP_FIXED, 4 * page_size); + ASSERT_NE(ptr2, MAP_FAILED); + + /* + * Check the procmap to ensure that this VMA merged with the adjacent + * two. The guard region flag is 'sticky' so should not preclude + * merging. + */ + ASSERT_EQ(open_self_procmap(&procmap), 0); + ASSERT_TRUE(find_vma_procmap(&procmap, ptr)); + ASSERT_EQ(procmap.query.vma_start, (unsigned long)ptr); + ASSERT_EQ(procmap.query.vma_end, (unsigned long)ptr + 10 * page_size); + ASSERT_EQ(close_procmap(&procmap), 0); + /* And, of course, this VMA should have the guard flag set. */ + ASSERT_TRUE(check_vmflag_guard(ptr)); +} + TEST_HARNESS_MAIN diff --git a/tools/testing/selftests/mm/vm_util.c b/tools/testing/selftests/mm/vm_util.c index e33cda301dad..605cb58ea5c3 100644 --- a/tools/testing/selftests/mm/vm_util.c +++ b/tools/testing/selftests/mm/vm_util.c @@ -449,6 +449,11 @@ bool check_vmflag_pfnmap(void *addr) return check_vmflag(addr, "pf"); }
+bool check_vmflag_guard(void *addr) +{ + return check_vmflag(addr, "gu"); +} + bool softdirty_supported(void) { char *addr; diff --git a/tools/testing/selftests/mm/vm_util.h b/tools/testing/selftests/mm/vm_util.h index 26c30fdc0241..a8abdf414d46 100644 --- a/tools/testing/selftests/mm/vm_util.h +++ b/tools/testing/selftests/mm/vm_util.h @@ -98,6 +98,7 @@ int uffd_register_with_ioctls(int uffd, void *addr, uint64_t len, unsigned long get_free_hugepages(void); bool check_vmflag_io(void *addr); bool check_vmflag_pfnmap(void *addr); +bool check_vmflag_guard(void *addr); int open_procmap(pid_t pid, struct procmap_fd *procmap_out); int query_procmap(struct procmap_fd *procmap); bool find_vma_procmap(struct procmap_fd *procmap, void *address);
On Wed, Oct 29, 2025 at 9:51 AM Lorenzo Stoakes lorenzo.stoakes@oracle.com wrote:
Assert that we observe guard regions appearing in /proc/$pid/smaps as expected, and when split/merge is performed too (with expected sticky behaviour).
Also add handling for file systems which don't sanely handle mmap() VMA merging so we don't incorrectly encounter a test failure in this situation.
Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com
Reviewed-by: Suren Baghdasaryan surenb@google.com
tools/testing/selftests/mm/guard-regions.c | 120 +++++++++++++++++++++ tools/testing/selftests/mm/vm_util.c | 5 + tools/testing/selftests/mm/vm_util.h | 1 + 3 files changed, 126 insertions(+)
diff --git a/tools/testing/selftests/mm/guard-regions.c b/tools/testing/selftests/mm/guard-regions.c index 8dd81c0a4a5a..a9be11e03a6a 100644 --- a/tools/testing/selftests/mm/guard-regions.c +++ b/tools/testing/selftests/mm/guard-regions.c @@ -94,6 +94,7 @@ static void *mmap_(FIXTURE_DATA(guard_regions) * self, case ANON_BACKED: flags |= MAP_PRIVATE | MAP_ANON; fd = -1;
offset = 0; break; case SHMEM_BACKED: case LOCAL_FILE_BACKED:@@ -260,6 +261,54 @@ static bool is_buf_eq(char *buf, size_t size, char chr) return true; }
+/*
- Some file systems have issues with merging due to changing merge-sensitive
- parameters in the .mmap callback, and prior to .mmap_prepare being
- implemented everywhere this will now result in an unexpected failure to
- merge (e.g. - overlayfs).
- Perform a simple test to see if the local file system suffers from this, if
- it does then we can skip test logic that assumes local file system merging is
- sane.
- */
+static bool local_fs_has_sane_mmap(FIXTURE_DATA(guard_regions) * self,
const FIXTURE_VARIANT(guard_regions) * variant)+{
const unsigned long page_size = self->page_size;char *ptr, *ptr2;struct procmap_fd procmap;if (variant->backing != LOCAL_FILE_BACKED)return true;/* Map 10 pages. */ptr = mmap_(self, variant, NULL, 10 * page_size, PROT_READ | PROT_WRITE, 0, 0);if (ptr == MAP_FAILED)return false;/* Unmap the middle. */munmap(&ptr[5 * page_size], page_size);/* Map again. */ptr2 = mmap_(self, variant, &ptr[5 * page_size], page_size, PROT_READ | PROT_WRITE,MAP_FIXED, 5 * page_size);if (ptr2 == MAP_FAILED)return false;/* Now make sure they all merged. */if (open_self_procmap(&procmap) != 0)return false;if (!find_vma_procmap(&procmap, ptr))return false;if (procmap.query.vma_start != (unsigned long)ptr)return false;if (procmap.query.vma_end != (unsigned long)ptr + 10 * page_size)return false;close_procmap(&procmap);return true;+}
FIXTURE_SETUP(guard_regions) { self->page_size = (unsigned long)sysconf(_SC_PAGESIZE); @@ -2138,4 +2187,75 @@ TEST_F(guard_regions, pagemap_scan) ASSERT_EQ(munmap(ptr, 10 * page_size), 0); }
+TEST_F(guard_regions, smaps) +{
const unsigned long page_size = self->page_size;struct procmap_fd procmap;char *ptr, *ptr2;int i;/* Map a region. */ptr = mmap_(self, variant, NULL, 10 * page_size, PROT_READ | PROT_WRITE, 0, 0);ASSERT_NE(ptr, MAP_FAILED);/* We shouldn't yet see a guard flag. */ASSERT_FALSE(check_vmflag_guard(ptr));/* Install a single guard region. */ASSERT_EQ(madvise(ptr, page_size, MADV_GUARD_INSTALL), 0);/* Now we should see a guard flag. */ASSERT_TRUE(check_vmflag_guard(ptr));/** Removing the guard region should not change things because we simply* cannot accurately track whether a given VMA has had all of its guard* regions removed.*/ASSERT_EQ(madvise(ptr, page_size, MADV_GUARD_REMOVE), 0);ASSERT_TRUE(check_vmflag_guard(ptr));/* Install guard regions throughout. */for (i = 0; i < 10; i++) {ASSERT_EQ(madvise(&ptr[i * page_size], page_size, MADV_GUARD_INSTALL), 0);/* We should always see the guard region flag. */ASSERT_TRUE(check_vmflag_guard(ptr));}/* Split into two VMAs. */ASSERT_EQ(munmap(&ptr[4 * page_size], page_size), 0);/* Both VMAs should have the guard flag set. */ASSERT_TRUE(check_vmflag_guard(ptr));ASSERT_TRUE(check_vmflag_guard(&ptr[5 * page_size]));/** If the local file system is unable to merge VMAs due to having* unusual characteristics, there is no point in asserting merge* behaviour.*/if (!local_fs_has_sane_mmap(self, variant)) {TH_LOG("local filesystem does not support sane merging skipping merge test");return;}/* Map a fresh VMA between the two split VMAs. */ptr2 = mmap_(self, variant, &ptr[4 * page_size], page_size,PROT_READ | PROT_WRITE, MAP_FIXED, 4 * page_size);ASSERT_NE(ptr2, MAP_FAILED);/** Check the procmap to ensure that this VMA merged with the adjacent* two. The guard region flag is 'sticky' so should not preclude* merging.*/ASSERT_EQ(open_self_procmap(&procmap), 0);ASSERT_TRUE(find_vma_procmap(&procmap, ptr));ASSERT_EQ(procmap.query.vma_start, (unsigned long)ptr);ASSERT_EQ(procmap.query.vma_end, (unsigned long)ptr + 10 * page_size);ASSERT_EQ(close_procmap(&procmap), 0);/* And, of course, this VMA should have the guard flag set. */ASSERT_TRUE(check_vmflag_guard(ptr));+}
TEST_HARNESS_MAIN diff --git a/tools/testing/selftests/mm/vm_util.c b/tools/testing/selftests/mm/vm_util.c index e33cda301dad..605cb58ea5c3 100644 --- a/tools/testing/selftests/mm/vm_util.c +++ b/tools/testing/selftests/mm/vm_util.c @@ -449,6 +449,11 @@ bool check_vmflag_pfnmap(void *addr) return check_vmflag(addr, "pf"); }
+bool check_vmflag_guard(void *addr) +{
return check_vmflag(addr, "gu");+}
bool softdirty_supported(void) { char *addr; diff --git a/tools/testing/selftests/mm/vm_util.h b/tools/testing/selftests/mm/vm_util.h index 26c30fdc0241..a8abdf414d46 100644 --- a/tools/testing/selftests/mm/vm_util.h +++ b/tools/testing/selftests/mm/vm_util.h @@ -98,6 +98,7 @@ int uffd_register_with_ioctls(int uffd, void *addr, uint64_t len, unsigned long get_free_hugepages(void); bool check_vmflag_io(void *addr); bool check_vmflag_pfnmap(void *addr); +bool check_vmflag_guard(void *addr); int open_procmap(pid_t pid, struct procmap_fd *procmap_out); int query_procmap(struct procmap_fd *procmap); bool find_vma_procmap(struct procmap_fd *procmap, void *address); -- 2.51.0
On Wed, Oct 29, 2025 at 09:40:19PM -0700, Suren Baghdasaryan wrote:
On Wed, Oct 29, 2025 at 9:51 AM Lorenzo Stoakes lorenzo.stoakes@oracle.com wrote:
Assert that we observe guard regions appearing in /proc/$pid/smaps as expected, and when split/merge is performed too (with expected sticky behaviour).
Also add handling for file systems which don't sanely handle mmap() VMA merging so we don't incorrectly encounter a test failure in this situation.
Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com
Reviewed-by: Suren Baghdasaryan surenb@google.com
Thanks!
linux-kselftest-mirror@lists.linaro.org