On Wed, Oct 29, 2025 at 06:05:54PM -0700, Suren Baghdasaryan wrote:
On Wed, Oct 29, 2025 at 9:51 AM Lorenzo Stoakes lorenzo.stoakes@oracle.com wrote:
Currently, if a user needs to determine if guard regions are present in a range, they have to scan all VMAs (or have knowledge of which ones might have guard regions).
Since commit 8e2f2aeb8b48 ("fs/proc/task_mmu: add guard region bit to pagemap") and the related commit a516403787e0 ("fs/proc: extend the PAGEMAP_SCAN ioctl to report guard regions"), users can use either /proc/$pid/pagemap or the PAGEMAP_SCAN functionality to perform this operation at a virtual address level.
This is not ideal, and it gives no visibility at a /proc/$pid/smaps level that guard regions exist in ranges.
This patch remedies the situation by establishing a new VMA flag, VM_MAYBE_GUARD, to indicate that a VMA may contain guard regions (it is uncertain because we cannot reasonably determine whether a MADV_GUARD_REMOVE call has removed all of the guard regions in a VMA, and additionally VMAs may change across merge/split).
nit: I know I suck at naming but I think VM_MAY_HAVE_GUARDS would better represent the meaning.
We all suck at naming :) it's the hardest bit! :P
Hm I don't love that, bit overwrought, I do think 'maybe guard' is a better shorthand for this flag name.
I am open to other suggestions but I think the original wins on succinctness here!
We utilise 0x800 for this flag which makes it available to 32-bit architectures also, a flag that was previously used by VM_DENYWRITE, which was removed in commit 8d0920bde5eb ("mm: remove VM_DENYWRITE") and hasn't bee reused yet.
s/bee/been
Yeah this series appears to be a bonanza of typos... not sure why :) will fix.
but I'm not even sure the above paragraph has to be included in the changelog. It's a technical detail IMHO.
Well, I think it's actually important to highlight that we have a VMA flag free and why. I know it's bordering on extraneous, but I don't think there's any harm in mentioning it.
Otherwise people might wonder 'oh is this flag used elsewhere somehow' etc.
The MADV_GUARD_INSTALL madvise() operation now must take an mmap write lock (and also VMA write lock) whereas previously it did not, but this seems a reasonable overhead.
I guess this is because it is modifying vm_flags now?
Yes
We also update the smaps logic and documentation to identify these VMAs.
Another major use of this functionality is that we can use it to identify that we ought to copy page tables on fork.
For anonymous mappings this is inherent, however since commit f807123d578d ("mm: allow guard regions in file-backed and read-only mappings") which allowed file-backed guard regions, we have unfortunately had to enforce this behaviour by settings vma->anon_vma to force page table copying.
The existence of this flag removes the need for this, so we simply update vma_needs_copy() to check for this flag instead.
Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com
Overall, makes sense to me and I think we could use it.
Reviewed-by: Suren Baghdasaryan surenb@google.com
Thanks!
It would be nice to have a way for userspace to reset this flag if it confirms that the VMA does not really have any guards (using say PAGEMAP_SCAN) but I think such an API can be abused.
Yeah, I'd rather not for that reason.
Documentation/filesystems/proc.rst | 1 + fs/proc/task_mmu.c | 1 + include/linux/mm.h | 1 + include/trace/events/mmflags.h | 1 + mm/madvise.c | 22 ++++++++++++++-------- mm/memory.c | 4 ++++ tools/testing/vma/vma_internal.h | 1 + 7 files changed, 23 insertions(+), 8 deletions(-)
diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst index 0b86a8022fa1..b8a423ca590a 100644 --- a/Documentation/filesystems/proc.rst +++ b/Documentation/filesystems/proc.rst @@ -591,6 +591,7 @@ encoded manner. The codes are the following: sl sealed lf lock on fault pages dp always lazily freeable mapping
- gu maybe contains guard regions (if not set, definitely doesn't) == =======================================
Note that there is no guarantee that every flag and associated mnemonic will diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index fc35a0543f01..db16ed91c269 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1146,6 +1146,7 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma) [ilog2(VM_MAYSHARE)] = "ms", [ilog2(VM_GROWSDOWN)] = "gd", [ilog2(VM_PFNMAP)] = "pf",
[ilog2(VM_MAYBE_GUARD)] = "gu", [ilog2(VM_LOCKED)] = "lo", [ilog2(VM_IO)] = "io", [ilog2(VM_SEQ_READ)] = "sr",diff --git a/include/linux/mm.h b/include/linux/mm.h index aada935c4950..f963afa1b9de 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -296,6 +296,7 @@ extern unsigned int kobjsize(const void *objp); #define VM_UFFD_MISSING 0 #endif /* CONFIG_MMU */ #define VM_PFNMAP 0x00000400 /* Page-ranges managed without "struct page", just pure PFN */ +#define VM_MAYBE_GUARD 0x00000800 /* The VMA maybe contains guard regions. */ #define VM_UFFD_WP 0x00001000 /* wrprotect pages tracking */
#define VM_LOCKED 0x00002000 diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h index aa441f593e9a..a6e5a44c9b42 100644 --- a/include/trace/events/mmflags.h +++ b/include/trace/events/mmflags.h @@ -213,6 +213,7 @@ IF_HAVE_PG_ARCH_3(arch_3) {VM_UFFD_MISSING, "uffd_missing" }, \ IF_HAVE_UFFD_MINOR(VM_UFFD_MINOR, "uffd_minor" ) \ {VM_PFNMAP, "pfnmap" }, \
{VM_MAYBE_GUARD, "maybe_guard" }, \ {VM_UFFD_WP, "uffd_wp" }, \ {VM_LOCKED, "locked" }, \ {VM_IO, "io" }, \diff --git a/mm/madvise.c b/mm/madvise.c index fb1c86e630b6..216ae6ed344e 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -1141,15 +1141,22 @@ static long madvise_guard_install(struct madvise_behavior *madv_behavior) return -EINVAL;
/*
* If we install guard markers, then the range is no longer* empty from a page table perspective and therefore it's* appropriate to have an anon_vma.
* It would be confusing for anonymous mappings to have page table* entries but no anon_vma established, so ensure that it is.*/if (vma_is_anonymous(vma))anon_vma_prepare(vma);/** Indicate that the VMA may contain guard regions, making it visible to* the user that a VMA may contain these, narrowing down the range which* must be scanned in order to detect them. *
* This ensures that on fork, we copy page tables correctly.
* This additionally causes page tables to be copied on fork regardless* of whether the VMA is anonymous or not, correctly preserving the* guard region page table entries. */
err = anon_vma_prepare(vma);if (err)return err;
vm_flags_set(vma, VM_MAYBE_GUARD); /* * Optimistically try to install the guard marker pages first. If any@@ -1709,7 +1716,6 @@ static enum madvise_lock_mode get_lock_mode(struct madvise_behavior *madv_behavi case MADV_POPULATE_READ: case MADV_POPULATE_WRITE: case MADV_COLLAPSE:
case MADV_GUARD_INSTALL: case MADV_GUARD_REMOVE: return MADVISE_MMAP_READ_LOCK; case MADV_DONTNEED:diff --git a/mm/memory.c b/mm/memory.c index 4c3a7e09a159..a2c79ee43d68 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1478,6 +1478,10 @@ vma_needs_copy(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma) if (src_vma->anon_vma) return true;
/* Guard regions have momdified page tables that require copying. */if (src_vma->vm_flags & VM_MAYBE_GUARD)return true;/* * Don't copy ptes where a page fault will fill them correctly. Fork * becomes much lighter when there are big shared or private readonlydiff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h index d873667704e8..e40c93edc5a7 100644 --- a/tools/testing/vma/vma_internal.h +++ b/tools/testing/vma/vma_internal.h @@ -56,6 +56,7 @@ extern unsigned long dac_mmap_min_addr; #define VM_MAYEXEC 0x00000040 #define VM_GROWSDOWN 0x00000100 #define VM_PFNMAP 0x00000400 +#define VM_MAYBE_GUARD 0x00000800 #define VM_LOCKED 0x00002000 #define VM_IO 0x00004000
#define VM_SEQ_READ 0x00008000 /* App will access data sequentially */
2.51.0