MADV_COLLAPSE acts on one hugepage-aligned/sized region at a time, until it has collapsed all eligible memory contained within the bounds supplied by the user.
At the top of each hugepage iteration we (re)lock mmap_lock and (re)validate the VMA for eligibility and update variables that might have changed while mmap_lock was dropped. One thing that might occur, is that the VMA could be resized, and as such, we refetch vma->vm_end to make sure we don't collapse past the end of the VMA's new end.
However, it's possible that when refetching vma>vm_end that we expand the region acted on by MADV_COLLAPSE if vma->vm_end is greater than size+len supplied by the user.
The consequence here is that we may attempt to collapse more memory than requested, possibly yielding either "too much success" or "false failure" user-visible results. An example of the former is if we MADV_COLLAPSE the first 4MiB of a 2TiB mmap()'d file, the incorrect refetch would cause the operation to block for much longer than anticipated as we attempt to collapse the entire TiB region. An example of the latter is that applying MADV_COLLPSE to a 4MiB file mapped to the start of a 6MiB VMA will successfully collapse the first 4MiB, then incorrectly attempt to collapse the last hugepage-aligned/sized region -- fail (since readahead/page cache lookup will fail) -- and report a failure to the user.
Don't expand the acted-on region when refetching vma->vm_end.
Fixes: 4d24de9425f7 ("mm: MADV_COLLAPSE: refetch vm_end after reacquiring mmap_lock") Reported-by: Hugh Dickins hughd@google.com Signed-off-by: Zach O'Keefe zokeefe@google.com Cc: Yang Shi shy828301@gmail.com --- v1->v2 : Updated changelog to make clear what user-visible issues this patch addresses, as well makes the case for backporting (Andrew Morton).
While there aren't any stability risks, without this patch there exist trivial examples where MADV_COLLAPSE won't work; as such, this should be backported to stable 6.1.X to make MADV_COLLAPSE dependable in such cases.
v1: https://lore.kernel.org/linux-mm/CAAa6QmRx_b2UCJWE2XZ3=3c3-_N3R4cDGX6Wm4OT7q... --- mm/khugepaged.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 5cb401aa2b9d..b4d2ec0a94ed 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -2649,7 +2649,7 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev, goto out_nolock; }
- hend = vma->vm_end & HPAGE_PMD_MASK; + hend = min(hend, vma->vm_end & HPAGE_PMD_MASK); } mmap_assert_locked(mm); memset(cc->node_load, 0, sizeof(cc->node_load));
SHMEM_HUGE_DENY is for emergency use by the admin, to disable allocation of shmem huge pages if, for example, a dangerous bug is found in their usage: see "deny" in Documentation/mm/transhuge.rst. An app using madvise(,,MADV_COLLAPSE) should not be allowed to override it: restore its precedence over shmem_huge_force.
Restore SHMEM_HUGE_DENY precedence over MADV_COLLAPSE.
Fixes: 7c6c6cc4d3a2 ("mm/shmem: add flag to enforce shmem THP in hugepage_vma_check()") Suggested-by: Hugh Dickins hughd@google.com Signed-off-by: Zach O'Keefe zokeefe@google.com Cc: Yang Shi shy828301@gmail.com --- v1->v2: Update changelog, and add note explaining rationale for backporting (Andrew Morton).
Request to backport this to 6.1.X stable. We'd like SHMEM_HUGE_DENY to take precedence over MADV_COLLAPSE. If we make this change later, it will be a userspace API change. As such, 6.1 cannot be allowed to continue as-is, and we should fix up the code there. --- mm/shmem.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-)
diff --git a/mm/shmem.c b/mm/shmem.c index c301487be5fb..0005ab2c29af 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -478,12 +478,10 @@ bool shmem_is_huge(struct vm_area_struct *vma, struct inode *inode, if (vma && ((vma->vm_flags & VM_NOHUGEPAGE) || test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))) return false; - if (shmem_huge_force) - return true; - if (shmem_huge == SHMEM_HUGE_FORCE) - return true; if (shmem_huge == SHMEM_HUGE_DENY) return false; + if (shmem_huge_force || shmem_huge == SHMEM_HUGE_FORCE) + return true;
switch (SHMEM_SB(inode->i_sb)->huge) { case SHMEM_HUGE_ALWAYS:
Hi,
Thanks for your patch.
FYI: kernel test robot notices the stable kernel rule is not satisfied.
Rule: 'Cc: stable@vger.kernel.org' or 'commit <sha1> upstream.' Subject: [PATCH v2 1/2] mm/MADV_COLLAPSE: don't expand collapse when vm_end is past requested end Link: https://lore.kernel.org/stable/20221224081203.3193960-1-zokeefe%40google.com
The check is based on https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
linux-stable-mirror@lists.linaro.org