On Wed, Nov 19, 2025 at 10:16:14AM +0100, David Hildenbrand (Red Hat) wrote:
+/* Can we retract page tables for this file-backed VMA? */ +static bool file_backed_vma_is_retractable(struct vm_area_struct *vma)
It's not really the VMA that is retractable :)
Given that the function we are called this from is called "retract_page_tables" (and not file_backed_...) I guess I would just have called this
"page_tables_are_retractable"
"page_tables_support_retract"
Or sth. along those lines.
Well it's specific to the VMA and it starts getting messy, this is the problem with naming, you can_end_up_with_way_too_specific_names :)
Also you'd need to say file-baked for clarity really, and that's getting far too long...
I think this is fine as-is given it's a static function, a user thinking 'what does retractable mean?' can see right away in the comment immdiately above the function name.
+{
- /*
* Check vma->anon_vma to exclude MAP_PRIVATE mappings that* got written to. These VMAs are likely not worth removing* page tables from, as PMD-mapping is likely to be split later.*/- if (READ_ONCE(vma->anon_vma))
return false;- /*
* When a vma is registered with uffd-wp, we cannot recycle* the page table because there may be pte markers installed.* Other vmas can still have the same file mapped hugely, but* skip this one: it will always be mapped in small page size* for uffd-wp registered ranges.*/- if (userfaultfd_wp(vma))
return false;- /*
* If the VMA contains guard regions then we can't collapse it.** This is set atomically on guard marker installation under mmap/VMA* read lock, and here we may not hold any VMA or mmap lock at all.** This is therefore serialised on the PTE page table lock, which is* obtained on guard region installation after the flag is set, so this* check being performed under this lock excludes races.*/- if (vma_flag_test_atomic(vma, VM_MAYBE_GUARD_BIT))
return false;- return true;
+}
- static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff) { struct vm_area_struct *vma;
@@ -1724,14 +1761,6 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff) spinlock_t *ptl; bool success = false;
/** Check vma->anon_vma to exclude MAP_PRIVATE mappings that* got written to. These VMAs are likely not worth removing* page tables from, as PMD-mapping is likely to be split later.*/if (READ_ONCE(vma->anon_vma))continue;- addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT); if (addr & ~HPAGE_PMD_MASK || vma->vm_end < addr + HPAGE_PMD_SIZE)
@@ -1743,14 +1772,8 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff) if (hpage_collapse_test_exit(mm)) continue;
/** When a vma is registered with uffd-wp, we cannot recycle* the page table because there may be pte markers installed.* Other vmas can still have the same file mapped hugely, but* skip this one: it will always be mapped in small page size* for uffd-wp registered ranges.*/if (userfaultfd_wp(vma))
/* PTEs were notified when unmapped; but now for the PMD? */if (!file_backed_vma_is_retractable(vma)) continue;@@ -1777,15 +1800,15 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff) spin_lock_nested(ptl, SINGLE_DEPTH_NESTING); /*
* Huge page lock is still held, so normally the page table* must remain empty; and we have already skipped anon_vma* and userfaultfd_wp() vmas. But since the mmap_lock is not* held, it is still possible for a racing userfaultfd_ioctl()* to have inserted ptes or markers. Now that we hold ptlock,* repeating the anon_vma check protects from one category,* and repeating the userfaultfd_wp() check from another.
* Huge page lock is still held, so normally the page table must* remain empty; and we have already skipped anon_vma and* userfaultfd_wp() vmas. But since the mmap_lock is not held,* it is still possible for a racing userfaultfd_ioctl() or* madvise() to have inserted ptes or markers. Now that we hold* ptlock, repeating the retractable checks protects us from */* races against the prior checks.
if (likely(!vma->anon_vma && !userfaultfd_wp(vma))) {
if (likely(file_backed_vma_is_retractable(vma))) { pgt_pmd = pmdp_collapse_flush(vma, addr, pmd); pmdp_get_lockless_sync(); success = true;diff --git a/mm/madvise.c b/mm/madvise.c index 0b3280752bfb..5dbe40be7c65 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -1141,15 +1141,21 @@ static long madvise_guard_install(struct madvise_behavior *madv_behavior) return -EINVAL; /*
* If we install guard markers, then the range is no longer* empty from a page table perspective and therefore it's* appropriate to have an anon_vma.** This ensures that on fork, we copy page tables correctly.
* Set atomically under read lock. All pertinent readers will need to* acquire an mmap/VMA write lock to read it. All remaining readers may* or may not see the flag set, but we don't care.*/- vma_flag_set_atomic(vma, VM_MAYBE_GUARD_BIT);
In general LGTM.
Thanks
- /*
* If anonymous and we are establishing page tables the VMA ought to* have an anon_vma associated with it.Do you know why? I know that as soon as we have anon folios in there we need it, but is it still required for guard regions? Patch #5 should handle the for case I guess.
Which other code depends on that?
There seems to be a general convention of people seeing 'vma->anon_vma' as meaning it has page tables, and vice-versa, for anon VMAs.
Obviously we change fork behaviour for this now with the flag, and _perhaps_ it's not necessary, but I'd rather keep this consistent for now (this is what we were doing before) and come back to it, rather than audit the code base for assumptions.
I'd probably like to do a patch adding vma_has_page_tables() or something that EXPLICITLY spells this out for cases that need it.
And it's not really overhead, as there'd not be much point in guard regions if you didn't fault in the memory (running off the end of empty range doesn't really make snese).
The key change here is that file-backed guard regions no longer do the horrible thing of having your file-backed VMA act as if it were MAP_PRIVATE with its own pointless anon_vma just to get correct fork behaviour... :)
-- Cheers
David
Thanks, Lorenzo