During recent vma locking patch reviews Linus and Jann Horn noted a number of issues with vma locking and suggested improvements:
1. walk_page_range() does not have ability to write-lock a vma during the walk when it's done under mmap_write_lock. For example s390_reset_cmma().
2. Vma locking is hidden inside vm_flags modifiers and is hard to follow. Suggestion is to change vm_flags_reset{_once} to assert that vma is write-locked and require an explicit locking.
3. Same issue with vma_prepare() hiding vma locking.
4. In userfaultfd vm_flags are modified after vma->vm_userfaultfd_ctx and page faults can operate on a context while it's changed.
5. do_brk_flags() and __install_special_mapping() not locking a newly created vma before adding it into the mm. While not strictly a problem, this is fragile if vma is modified after insertion, as in the mmap_region() case which was recently fixed. Suggestion is to always lock a new vma before inserting it and making it visible to page faults.
6. vma_assert_write_locked() for CONFIG_PER_VMA_LOCK=n would benefit from being mmap_assert_write_locked() instead of no-op and then any place which operates on a vma and calls mmap_assert_write_locked() can be converted into vma_assert_write_locked().
I CC'ed stable only on the first patch because others are cleanups and the bug in userfaultfd does not affect stable (lock_vma_under_rcu prevents uffds from being handled under vma lock protection). However I would be happy if the whole series is merged into stable 6.4 since it makes vma locking more maintainable.
The patches apply cleanly over Linus' ToT and will conflict when applied over mm-unstable due to missing [1]. The conflict can be easily resolved by ignoring conflicting deletions but probably simpler to take [1] into mm-unstable and avoid later conflict.
[1] commit 6c21e066f925 ("mm/mempolicy: Take VMA lock before replacing policy")
Suren Baghdasaryan (6): mm: enable page walking API to lock vmas during the walk mm: for !CONFIG_PER_VMA_LOCK equate write lock assertion for vma and mmap mm: replace mmap with vma write lock assertions when operating on a vma mm: lock vma explicitly before doing vm_flags_reset and vm_flags_reset_once mm: always lock new vma before inserting into vma tree mm: move vma locking out of vma_prepare
arch/powerpc/kvm/book3s_hv_uvmem.c | 1 + arch/powerpc/mm/book3s64/subpage_prot.c | 2 +- arch/riscv/mm/pageattr.c | 4 ++-- arch/s390/mm/gmap.c | 10 ++++----- drivers/infiniband/hw/hfi1/file_ops.c | 1 + fs/proc/task_mmu.c | 10 ++++----- fs/userfaultfd.c | 6 +++++ include/linux/mm.h | 13 +++++++---- include/linux/pagewalk.h | 6 ++--- mm/damon/vaddr.c | 4 ++-- mm/hmm.c | 2 +- mm/hugetlb.c | 2 +- mm/khugepaged.c | 5 +++-- mm/ksm.c | 16 +++++++------- mm/madvise.c | 13 +++++------ mm/memcontrol.c | 4 ++-- mm/memory-failure.c | 2 +- mm/memory.c | 2 +- mm/mempolicy.c | 12 ++++------ mm/migrate_device.c | 2 +- mm/mincore.c | 2 +- mm/mlock.c | 5 +++-- mm/mmap.c | 29 ++++++++++++++++--------- mm/mprotect.c | 3 ++- mm/pagewalk.c | 13 ++++++++--- mm/vmscan.c | 3 ++- 26 files changed, 100 insertions(+), 72 deletions(-)
walk_page_range() and friends often operate under write-locked mmap_lock. With introduction of vma locks, the vmas have to be locked as well during such walks to prevent concurrent page faults in these areas. Add an additional parameter to walk_page_range() functions to indicate the walks which should lock the vmas before operating on them.
Cc: stable@vger.kernel.org # 6.4.x Suggested-by: Linus Torvalds torvalds@linuxfoundation.org Suggested-by: Jann Horn jannh@google.com Signed-off-by: Suren Baghdasaryan surenb@google.com --- arch/powerpc/mm/book3s64/subpage_prot.c | 2 +- arch/riscv/mm/pageattr.c | 4 ++-- arch/s390/mm/gmap.c | 10 +++++----- fs/proc/task_mmu.c | 10 +++++----- include/linux/pagewalk.h | 6 +++--- mm/damon/vaddr.c | 4 ++-- mm/hmm.c | 2 +- mm/ksm.c | 16 ++++++++-------- mm/madvise.c | 8 ++++---- mm/memcontrol.c | 4 ++-- mm/memory-failure.c | 2 +- mm/mempolicy.c | 12 ++++-------- mm/migrate_device.c | 2 +- mm/mincore.c | 2 +- mm/mlock.c | 2 +- mm/mprotect.c | 2 +- mm/pagewalk.c | 13 ++++++++++--- mm/vmscan.c | 3 ++- 18 files changed, 54 insertions(+), 50 deletions(-)
diff --git a/arch/powerpc/mm/book3s64/subpage_prot.c b/arch/powerpc/mm/book3s64/subpage_prot.c index 0dc85556dec5..177e5c646d9c 100644 --- a/arch/powerpc/mm/book3s64/subpage_prot.c +++ b/arch/powerpc/mm/book3s64/subpage_prot.c @@ -159,7 +159,7 @@ static void subpage_mark_vma_nohuge(struct mm_struct *mm, unsigned long addr, */ for_each_vma_range(vmi, vma, addr + len) { vm_flags_set(vma, VM_NOHUGEPAGE); - walk_page_vma(vma, &subpage_walk_ops, NULL); + walk_page_vma(vma, &subpage_walk_ops, true, NULL); } } #else diff --git a/arch/riscv/mm/pageattr.c b/arch/riscv/mm/pageattr.c index ea3d61de065b..95207994cbf0 100644 --- a/arch/riscv/mm/pageattr.c +++ b/arch/riscv/mm/pageattr.c @@ -167,7 +167,7 @@ int set_direct_map_invalid_noflush(struct page *page) };
mmap_read_lock(&init_mm); - ret = walk_page_range(&init_mm, start, end, &pageattr_ops, &masks); + ret = walk_page_range(&init_mm, start, end, &pageattr_ops, false, &masks); mmap_read_unlock(&init_mm);
return ret; @@ -184,7 +184,7 @@ int set_direct_map_default_noflush(struct page *page) };
mmap_read_lock(&init_mm); - ret = walk_page_range(&init_mm, start, end, &pageattr_ops, &masks); + ret = walk_page_range(&init_mm, start, end, &pageattr_ops, false, &masks); mmap_read_unlock(&init_mm);
return ret; diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c index 9c8af31be970..16a58c860c74 100644 --- a/arch/s390/mm/gmap.c +++ b/arch/s390/mm/gmap.c @@ -2523,7 +2523,7 @@ static inline void thp_split_mm(struct mm_struct *mm)
for_each_vma(vmi, vma) { vm_flags_mod(vma, VM_NOHUGEPAGE, VM_HUGEPAGE); - walk_page_vma(vma, &thp_split_walk_ops, NULL); + walk_page_vma(vma, &thp_split_walk_ops, true, NULL); } mm->def_flags |= VM_NOHUGEPAGE; } @@ -2584,7 +2584,7 @@ int s390_enable_sie(void) mm->context.has_pgste = 1; /* split thp mappings and disable thp for future mappings */ thp_split_mm(mm); - walk_page_range(mm, 0, TASK_SIZE, &zap_zero_walk_ops, NULL); + walk_page_range(mm, 0, TASK_SIZE, &zap_zero_walk_ops, true, NULL); mmap_write_unlock(mm); return 0; } @@ -2672,7 +2672,7 @@ int s390_enable_skey(void) mm->context.uses_skeys = 0; goto out_up; } - walk_page_range(mm, 0, TASK_SIZE, &enable_skey_walk_ops, NULL); + walk_page_range(mm, 0, TASK_SIZE, &enable_skey_walk_ops, true, NULL);
out_up: mmap_write_unlock(mm); @@ -2697,7 +2697,7 @@ static const struct mm_walk_ops reset_cmma_walk_ops = { void s390_reset_cmma(struct mm_struct *mm) { mmap_write_lock(mm); - walk_page_range(mm, 0, TASK_SIZE, &reset_cmma_walk_ops, NULL); + walk_page_range(mm, 0, TASK_SIZE, &reset_cmma_walk_ops, true, NULL); mmap_write_unlock(mm); } EXPORT_SYMBOL_GPL(s390_reset_cmma); @@ -2771,7 +2771,7 @@ int __s390_uv_destroy_range(struct mm_struct *mm, unsigned long start, while (r > 0) { state.count = 0; mmap_read_lock(mm); - r = walk_page_range(mm, state.next, end, &gather_pages_ops, &state); + r = walk_page_range(mm, state.next, end, &gather_pages_ops, false, &state); mmap_read_unlock(mm); cond_resched(); s390_uv_destroy_pfns(state.count, state.pfns); diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 507cd4e59d07..f0d0f2959f91 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -804,9 +804,9 @@ static void smap_gather_stats(struct vm_area_struct *vma,
/* mmap_lock is held in m_start */ if (!start) - walk_page_vma(vma, ops, mss); + walk_page_vma(vma, ops, false, mss); else - walk_page_range(vma->vm_mm, start, vma->vm_end, ops, mss); + walk_page_range(vma->vm_mm, start, vma->vm_end, ops, false, mss); }
#define SEQ_PUT_DEC(str, val) \ @@ -1307,7 +1307,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf, 0, mm, 0, -1UL); mmu_notifier_invalidate_range_start(&range); } - walk_page_range(mm, 0, -1, &clear_refs_walk_ops, &cp); + walk_page_range(mm, 0, -1, &clear_refs_walk_ops, true, &cp); if (type == CLEAR_REFS_SOFT_DIRTY) { mmu_notifier_invalidate_range_end(&range); flush_tlb_mm(mm); @@ -1720,7 +1720,7 @@ static ssize_t pagemap_read(struct file *file, char __user *buf, ret = mmap_read_lock_killable(mm); if (ret) goto out_free; - ret = walk_page_range(mm, start_vaddr, end, &pagemap_ops, &pm); + ret = walk_page_range(mm, start_vaddr, end, &pagemap_ops, false, &pm); mmap_read_unlock(mm); start_vaddr = end;
@@ -1981,7 +1981,7 @@ static int show_numa_map(struct seq_file *m, void *v) seq_puts(m, " huge");
/* mmap_lock is held by m_start */ - walk_page_vma(vma, &show_numa_ops, md); + walk_page_vma(vma, &show_numa_ops, false, md);
if (!md->pages) goto out; diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h index 27a6df448ee5..69656ec44049 100644 --- a/include/linux/pagewalk.h +++ b/include/linux/pagewalk.h @@ -105,16 +105,16 @@ struct mm_walk {
int walk_page_range(struct mm_struct *mm, unsigned long start, unsigned long end, const struct mm_walk_ops *ops, - void *private); + bool lock_vma, void *private); int walk_page_range_novma(struct mm_struct *mm, unsigned long start, unsigned long end, const struct mm_walk_ops *ops, pgd_t *pgd, void *private); int walk_page_range_vma(struct vm_area_struct *vma, unsigned long start, unsigned long end, const struct mm_walk_ops *ops, - void *private); + bool lock_vma, void *private); int walk_page_vma(struct vm_area_struct *vma, const struct mm_walk_ops *ops, - void *private); + bool lock_vma, void *private); int walk_page_mapping(struct address_space *mapping, pgoff_t first_index, pgoff_t nr, const struct mm_walk_ops *ops, void *private); diff --git a/mm/damon/vaddr.c b/mm/damon/vaddr.c index 2fcc9731528a..54f50b1aefe4 100644 --- a/mm/damon/vaddr.c +++ b/mm/damon/vaddr.c @@ -391,7 +391,7 @@ static const struct mm_walk_ops damon_mkold_ops = { static void damon_va_mkold(struct mm_struct *mm, unsigned long addr) { mmap_read_lock(mm); - walk_page_range(mm, addr, addr + 1, &damon_mkold_ops, NULL); + walk_page_range(mm, addr, addr + 1, &damon_mkold_ops, false, NULL); mmap_read_unlock(mm); }
@@ -536,7 +536,7 @@ static bool damon_va_young(struct mm_struct *mm, unsigned long addr, };
mmap_read_lock(mm); - walk_page_range(mm, addr, addr + 1, &damon_young_ops, &arg); + walk_page_range(mm, addr, addr + 1, &damon_young_ops, false, &arg); mmap_read_unlock(mm); return arg.young; } diff --git a/mm/hmm.c b/mm/hmm.c index 855e25e59d8f..f94f5e268e40 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -600,7 +600,7 @@ int hmm_range_fault(struct hmm_range *range) range->notifier_seq)) return -EBUSY; ret = walk_page_range(mm, hmm_vma_walk.last, range->end, - &hmm_walk_ops, &hmm_vma_walk); + &hmm_walk_ops, false, &hmm_vma_walk); /* * When -EBUSY is returned the loop restarts with * hmm_vma_walk.last set to an address that has not been stored diff --git a/mm/ksm.c b/mm/ksm.c index ba266359da55..494a1f3fcb97 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -470,7 +470,7 @@ static const struct mm_walk_ops break_ksm_ops = { * of the process that owns 'vma'. We also do not want to enforce * protection keys here anyway. */ -static int break_ksm(struct vm_area_struct *vma, unsigned long addr) +static int break_ksm(struct vm_area_struct *vma, unsigned long addr, bool lock_vma) { vm_fault_t ret = 0;
@@ -479,7 +479,7 @@ static int break_ksm(struct vm_area_struct *vma, unsigned long addr)
cond_resched(); ksm_page = walk_page_range_vma(vma, addr, addr + 1, - &break_ksm_ops, NULL); + &break_ksm_ops, lock_vma, NULL); if (WARN_ON_ONCE(ksm_page < 0)) return ksm_page; if (!ksm_page) @@ -565,7 +565,7 @@ static void break_cow(struct ksm_rmap_item *rmap_item) mmap_read_lock(mm); vma = find_mergeable_vma(mm, addr); if (vma) - break_ksm(vma, addr); + break_ksm(vma, addr, false); mmap_read_unlock(mm); }
@@ -871,7 +871,7 @@ static void remove_trailing_rmap_items(struct ksm_rmap_item **rmap_list) * in cmp_and_merge_page on one of the rmap_items we would be removing. */ static int unmerge_ksm_pages(struct vm_area_struct *vma, - unsigned long start, unsigned long end) + unsigned long start, unsigned long end, bool lock_vma) { unsigned long addr; int err = 0; @@ -882,7 +882,7 @@ static int unmerge_ksm_pages(struct vm_area_struct *vma, if (signal_pending(current)) err = -ERESTARTSYS; else - err = break_ksm(vma, addr); + err = break_ksm(vma, addr, lock_vma); } return err; } @@ -1029,7 +1029,7 @@ static int unmerge_and_remove_all_rmap_items(void) if (!(vma->vm_flags & VM_MERGEABLE) || !vma->anon_vma) continue; err = unmerge_ksm_pages(vma, - vma->vm_start, vma->vm_end); + vma->vm_start, vma->vm_end, false); if (err) goto error; } @@ -2530,7 +2530,7 @@ static int __ksm_del_vma(struct vm_area_struct *vma) return 0;
if (vma->anon_vma) { - err = unmerge_ksm_pages(vma, vma->vm_start, vma->vm_end); + err = unmerge_ksm_pages(vma, vma->vm_start, vma->vm_end, true); if (err) return err; } @@ -2668,7 +2668,7 @@ int ksm_madvise(struct vm_area_struct *vma, unsigned long start, return 0; /* just ignore the advice */
if (vma->anon_vma) { - err = unmerge_ksm_pages(vma, start, end); + err = unmerge_ksm_pages(vma, start, end, true); if (err) return err; } diff --git a/mm/madvise.c b/mm/madvise.c index 886f06066622..0e484111a1d2 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -287,7 +287,7 @@ static long madvise_willneed(struct vm_area_struct *vma, *prev = vma; #ifdef CONFIG_SWAP if (!file) { - walk_page_range(vma->vm_mm, start, end, &swapin_walk_ops, vma); + walk_page_range(vma->vm_mm, start, end, &swapin_walk_ops, false, vma); lru_add_drain(); /* Push any new pages onto the LRU now */ return 0; } @@ -546,7 +546,7 @@ static void madvise_cold_page_range(struct mmu_gather *tlb, };
tlb_start_vma(tlb, vma); - walk_page_range(vma->vm_mm, addr, end, &cold_walk_ops, &walk_private); + walk_page_range(vma->vm_mm, addr, end, &cold_walk_ops, false, &walk_private); tlb_end_vma(tlb, vma); }
@@ -584,7 +584,7 @@ static void madvise_pageout_page_range(struct mmu_gather *tlb, };
tlb_start_vma(tlb, vma); - walk_page_range(vma->vm_mm, addr, end, &cold_walk_ops, &walk_private); + walk_page_range(vma->vm_mm, addr, end, &cold_walk_ops, false, &walk_private); tlb_end_vma(tlb, vma); }
@@ -786,7 +786,7 @@ static int madvise_free_single_vma(struct vm_area_struct *vma, mmu_notifier_invalidate_range_start(&range); tlb_start_vma(&tlb, vma); walk_page_range(vma->vm_mm, range.start, range.end, - &madvise_free_walk_ops, &tlb); + &madvise_free_walk_ops, false, &tlb); tlb_end_vma(&tlb, vma); mmu_notifier_invalidate_range_end(&range); tlb_finish_mmu(&tlb); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index e8ca4bdcb03c..76aaadbd4bf9 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -6031,7 +6031,7 @@ static unsigned long mem_cgroup_count_precharge(struct mm_struct *mm) unsigned long precharge;
mmap_read_lock(mm); - walk_page_range(mm, 0, ULONG_MAX, &precharge_walk_ops, NULL); + walk_page_range(mm, 0, ULONG_MAX, &precharge_walk_ops, false, NULL); mmap_read_unlock(mm);
precharge = mc.precharge; @@ -6332,7 +6332,7 @@ static void mem_cgroup_move_charge(void) * When we have consumed all precharges and failed in doing * additional charge, the page walk just aborts. */ - walk_page_range(mc.mm, 0, ULONG_MAX, &charge_walk_ops, NULL); + walk_page_range(mc.mm, 0, ULONG_MAX, &charge_walk_ops, false, NULL); mmap_read_unlock(mc.mm); atomic_dec(&mc.from->moving_account); } diff --git a/mm/memory-failure.c b/mm/memory-failure.c index ece5d481b5ff..763297df9240 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -860,7 +860,7 @@ static int kill_accessing_process(struct task_struct *p, unsigned long pfn,
mmap_read_lock(p->mm); ret = walk_page_range(p->mm, 0, TASK_SIZE, &hwp_walk_ops, - (void *)&priv); + false, (void *)&priv); if (ret == 1 && priv.tk.addr) kill_proc(&priv.tk, pfn, flags); else diff --git a/mm/mempolicy.c b/mm/mempolicy.c index c53f8beeb507..70ba53c70700 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -738,7 +738,7 @@ static const struct mm_walk_ops queue_pages_walk_ops = { static int queue_pages_range(struct mm_struct *mm, unsigned long start, unsigned long end, nodemask_t *nodes, unsigned long flags, - struct list_head *pagelist) + struct list_head *pagelist, bool lock_vma) { int err; struct queue_pages qp = { @@ -750,7 +750,7 @@ queue_pages_range(struct mm_struct *mm, unsigned long start, unsigned long end, .first = NULL, };
- err = walk_page_range(mm, start, end, &queue_pages_walk_ops, &qp); + err = walk_page_range(mm, start, end, &queue_pages_walk_ops, lock_vma, &qp);
if (!qp.first) /* whole range in hole */ @@ -1078,7 +1078,7 @@ static int migrate_to_node(struct mm_struct *mm, int source, int dest, vma = find_vma(mm, 0); VM_BUG_ON(!(flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))); queue_pages_range(mm, vma->vm_start, mm->task_size, &nmask, - flags | MPOL_MF_DISCONTIG_OK, &pagelist); + flags | MPOL_MF_DISCONTIG_OK, &pagelist, false);
if (!list_empty(&pagelist)) { err = migrate_pages(&pagelist, alloc_migration_target, NULL, @@ -1321,12 +1321,8 @@ static long do_mbind(unsigned long start, unsigned long len, * Lock the VMAs before scanning for pages to migrate, to ensure we don't * miss a concurrently inserted page. */ - vma_iter_init(&vmi, mm, start); - for_each_vma_range(vmi, vma, end) - vma_start_write(vma); - ret = queue_pages_range(mm, start, end, nmask, - flags | MPOL_MF_INVERT, &pagelist); + flags | MPOL_MF_INVERT, &pagelist, true);
if (ret < 0) { err = ret; diff --git a/mm/migrate_device.c b/mm/migrate_device.c index 8365158460ed..1bc9937bf1fb 100644 --- a/mm/migrate_device.c +++ b/mm/migrate_device.c @@ -304,7 +304,7 @@ static void migrate_vma_collect(struct migrate_vma *migrate) mmu_notifier_invalidate_range_start(&range);
walk_page_range(migrate->vma->vm_mm, migrate->start, migrate->end, - &migrate_vma_walk_ops, migrate); + &migrate_vma_walk_ops, false, migrate);
mmu_notifier_invalidate_range_end(&range); migrate->end = migrate->start + (migrate->npages << PAGE_SHIFT); diff --git a/mm/mincore.c b/mm/mincore.c index b7f7a516b26c..a06288c6c126 100644 --- a/mm/mincore.c +++ b/mm/mincore.c @@ -198,7 +198,7 @@ static long do_mincore(unsigned long addr, unsigned long pages, unsigned char *v memset(vec, 1, pages); return pages; } - err = walk_page_range(vma->vm_mm, addr, end, &mincore_walk_ops, vec); + err = walk_page_range(vma->vm_mm, addr, end, &mincore_walk_ops, false, vec); if (err < 0) return err; return (end - addr) >> PAGE_SHIFT; diff --git a/mm/mlock.c b/mm/mlock.c index 0a0c996c5c21..3634de0b28e3 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -389,7 +389,7 @@ static void mlock_vma_pages_range(struct vm_area_struct *vma, vm_flags_reset_once(vma, newflags);
lru_add_drain(); - walk_page_range(vma->vm_mm, start, end, &mlock_walk_ops, NULL); + walk_page_range(vma->vm_mm, start, end, &mlock_walk_ops, true, NULL); lru_add_drain();
if (newflags & VM_IO) { diff --git a/mm/mprotect.c b/mm/mprotect.c index 6f658d483704..f781f709c39d 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -599,7 +599,7 @@ mprotect_fixup(struct vma_iterator *vmi, struct mmu_gather *tlb, pgprot_t new_pgprot = vm_get_page_prot(newflags);
error = walk_page_range(current->mm, start, end, - &prot_none_walk_ops, &new_pgprot); + &prot_none_walk_ops, true, &new_pgprot); if (error) return error; } diff --git a/mm/pagewalk.c b/mm/pagewalk.c index 2022333805d3..7503885fae75 100644 --- a/mm/pagewalk.c +++ b/mm/pagewalk.c @@ -406,6 +406,7 @@ static int __walk_page_range(unsigned long start, unsigned long end, * @start: start address of the virtual address range * @end: end address of the virtual address range * @ops: operation to call during the walk + * @lock_vma write-lock the vma before operating on it * @private: private data for callbacks' usage * * Recursively walk the page table tree of the process represented by @mm @@ -442,7 +443,7 @@ static int __walk_page_range(unsigned long start, unsigned long end, */ int walk_page_range(struct mm_struct *mm, unsigned long start, unsigned long end, const struct mm_walk_ops *ops, - void *private) + bool lock_vma, void *private) { int err = 0; unsigned long next; @@ -474,6 +475,8 @@ int walk_page_range(struct mm_struct *mm, unsigned long start, if (ops->pte_hole) err = ops->pte_hole(start, next, -1, &walk); } else { /* inside vma */ + if (lock_vma) + vma_start_write(vma); walk.vma = vma; next = min(end, vma->vm_end); vma = find_vma(mm, vma->vm_end); @@ -535,7 +538,7 @@ int walk_page_range_novma(struct mm_struct *mm, unsigned long start,
int walk_page_range_vma(struct vm_area_struct *vma, unsigned long start, unsigned long end, const struct mm_walk_ops *ops, - void *private) + bool lock_vma, void *private) { struct mm_walk walk = { .ops = ops, @@ -550,11 +553,13 @@ int walk_page_range_vma(struct vm_area_struct *vma, unsigned long start, return -EINVAL;
mmap_assert_locked(walk.mm); + if (lock_vma) + vma_start_write(vma); return __walk_page_range(start, end, &walk); }
int walk_page_vma(struct vm_area_struct *vma, const struct mm_walk_ops *ops, - void *private) + bool lock_vma, void *private) { struct mm_walk walk = { .ops = ops, @@ -567,6 +572,8 @@ int walk_page_vma(struct vm_area_struct *vma, const struct mm_walk_ops *ops, return -EINVAL;
mmap_assert_locked(walk.mm); + if (lock_vma) + vma_start_write(vma); return __walk_page_range(vma->vm_start, vma->vm_end, &walk); }
diff --git a/mm/vmscan.c b/mm/vmscan.c index 1080209a568b..d85f86871fd9 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -4306,7 +4306,8 @@ static void walk_mm(struct lruvec *lruvec, struct mm_struct *mm, struct lru_gen_
/* the caller might be holding the lock for write */ if (mmap_read_trylock(mm)) { - err = walk_page_range(mm, walk->next_addr, ULONG_MAX, &mm_walk_ops, walk); + err = walk_page_range(mm, walk->next_addr, ULONG_MAX, + &mm_walk_ops, false, walk);
mmap_read_unlock(mm); }
On Mon, 31 Jul 2023 at 10:12, Suren Baghdasaryan surenb@google.com wrote:
walk_page_vma(vma, &subpage_walk_ops, NULL);
walk_page_vma(vma, &subpage_walk_ops, true, NULL);
Rather than add a new argument to the walk_page_*() functions, I really think you should just add the locking rule to the 'const struct mm_walk_ops' structure.
The locking rule goes along with the rules for what you are actually doing, after all. Plus it would actually make it all much more legible when it's not just some random "true/false" argument, but a an actual
.write_lock = 1
in the ops definition.
Yes, yes, that might mean that some ops might need duplicating in case you really have a walk that sometimes takes the lock, and sometimes doesn't, but that is odd to begin with.
The only such case I found from a quick look was the very strange queue_pages_range() case. Is it really true that do_mbind() needs the write-lock, but do_migrate_pages() does not?
And if they really are that different maybe they should have different walk_ops?
Maybe there were other cases that I didn't notice.
error = walk_page_range(current->mm, start, end,
&prot_none_walk_ops, &new_pgprot);
&prot_none_walk_ops, true, &new_pgprot);
This looks odd. You're adding vma locking to a place that didn't do it before.
Yes, the mmap semaphore is held for writing, but this particular walk doesn't need it as far as I can tell.
In fact, this feels like that walker should maybe *verify* that it's held for writing, but not try to write it again?
Maybe the "lock_vma" flag should be a tri-state:
- lock for reading (no-op per vma), verify that the mmap sem is held for reading
- lock for reading (no-op per vma), but with mmap sem held for writing (this kind of "check before doing changes" walker)
- lock for writing (with mmap sem obviously needs to be held for writing)
mmap_assert_locked(walk.mm);
if (lock_vma)
vma_start_write(vma);
So I think this should also be tightened up, and something like
switch (ops->locking) { case WRLOCK: vma_start_write(vma); fallthrough; case WRLOCK_VERIFY: mmap_assert_write_locked(mm); break; case RDLOCK: mmap_assert_locked(walk.mm); }
because we shouldn't have a 'vma_start_write()' without holding the mmap sem for *writing*, and the above would also allow that mprotect_fixup() "walk to see if we can merge, verify that it was already locked" thing.
Hmm?
NOTE! The above names are just completely made up. I dcon't think it should actually be some "WRLOCK" enum. There are probably much better names. Take the above as a "maybe something kind of in this direction" rather than "do it exactly like this".
Linus
On Mon, Jul 31, 2023 at 11:02 AM Linus Torvalds torvalds@linux-foundation.org wrote:
On Mon, 31 Jul 2023 at 10:12, Suren Baghdasaryan surenb@google.com wrote:
walk_page_vma(vma, &subpage_walk_ops, NULL);
walk_page_vma(vma, &subpage_walk_ops, true, NULL);
Rather than add a new argument to the walk_page_*() functions, I really think you should just add the locking rule to the 'const struct mm_walk_ops' structure.
The locking rule goes along with the rules for what you are actually doing, after all. Plus it would actually make it all much more legible when it's not just some random "true/false" argument, but a an actual
.write_lock = 1
in the ops definition.
Yeah, I was thinking about that but thought a flag like this in a pure "ops" struct would be frowned upon. If this is acceptable then it makes it much cleaner.
Yes, yes, that might mean that some ops might need duplicating in case you really have a walk that sometimes takes the lock, and sometimes doesn't, but that is odd to begin with.
The only such case I found from a quick look was the very strange queue_pages_range() case. Is it really true that do_mbind() needs the write-lock, but do_migrate_pages() does not?
And if they really are that different maybe they should have different walk_ops?
Makes sense to me.
Maybe there were other cases that I didn't notice.
error = walk_page_range(current->mm, start, end,
&prot_none_walk_ops, &new_pgprot);
&prot_none_walk_ops, true, &new_pgprot);
This looks odd. You're adding vma locking to a place that didn't do it before.
Yes, the mmap semaphore is held for writing, but this particular walk doesn't need it as far as I can tell.
Yes you are correct. Locking a vma in this case seems unnecessary.
In fact, this feels like that walker should maybe *verify* that it's held for writing, but not try to write it again?
In this particular case, does this walk even require the vma to be write locked? Looks like it's simply checking the ptes. And if so, walk_page_range() already has mmap_assert_locked(walk.mm) at the beginning to ensure the tree is stable. Do we need anything else here?
Maybe the "lock_vma" flag should be a tri-state:
- lock for reading (no-op per vma), verify that the mmap sem is held
for reading
- lock for reading (no-op per vma), but with mmap sem held for
writing (this kind of "check before doing changes" walker)
- lock for writing (with mmap sem obviously needs to be held for writing)
mmap_assert_locked(walk.mm);
if (lock_vma)
vma_start_write(vma);
So I think this should also be tightened up, and something like
switch (ops->locking) { case WRLOCK: vma_start_write(vma); fallthrough; case WRLOCK_VERIFY: mmap_assert_write_locked(mm); break; case RDLOCK: mmap_assert_locked(walk.mm); }
I got the idea but a couple of modifications, if I may. walk_page_range() already does mmap_assert_locked() at the beginning, so we can change it to:
if (ops->locking == RDLOCK) mmap_assert_locked(walk.mm); else mmap_assert_write_locked(mm);
and during the walk:
switch (ops->locking) { case WRLOCK: vma_start_write(vma); break; #ifdef CONFIG_PER_VMA_LOCK case WRLOCK_VERIFY: vma_assert_write_locked(vma); break; #endif }
The above CONFIG_PER_VMA_LOCK is ugly but with !CONFIG_PER_VMA_LOCK vma_assert_write_locked() becomes mmap_assert_write_locked() and we already checked that, so it's unnecessary.
because we shouldn't have a 'vma_start_write()' without holding the mmap sem for *writing*, and the above would also allow that mprotect_fixup() "walk to see if we can merge, verify that it was already locked" thing.
Hmm?
NOTE! The above names are just completely made up. I dcon't think it should actually be some "WRLOCK" enum. There are probably much better names. Take the above as a "maybe something kind of in this direction" rather than "do it exactly like this".
I'm not great with names... Maybe just add a PGWALK_ prefix like this:
PGWALK_RDLOCK PGWALK_WRLOCK PGWALK_WRLOCK_VERIFY
?
Linus
On Mon, 31 Jul 2023 at 12:31, Suren Baghdasaryan surenb@google.com wrote:
I got the idea but a couple of modifications, if I may.
Ack, sounds sane to me.
Linus
On Mon, Jul 31, 2023 at 12:33 PM Linus Torvalds torvalds@linux-foundation.org wrote:
On Mon, 31 Jul 2023 at 12:31, Suren Baghdasaryan surenb@google.com wrote:
I got the idea but a couple of modifications, if I may.
Ack, sounds sane to me.
Ok, I'll wait for more feedback today and will post an update tomorrow. Thanks!
Linus
On Mon, Jul 31, 2023 at 1:24 PM Suren Baghdasaryan surenb@google.com wrote:
On Mon, Jul 31, 2023 at 12:33 PM Linus Torvalds torvalds@linux-foundation.org wrote:
On Mon, 31 Jul 2023 at 12:31, Suren Baghdasaryan surenb@google.com wrote:
I got the idea but a couple of modifications, if I may.
Ack, sounds sane to me.
Ok, I'll wait for more feedback today and will post an update tomorrow. Thanks!
I have the new patchset ready but I see 3 places where we walk the pages after mmap_write_lock() while *I think* we can tolerate concurrent page faults (don't need to lock the vmas):
s390_enable_sie() break_ksm() clear_refs_write()
In all these walks we lock PTL when modifying the page table entries, that's why I think we can skip locking the vma but maybe I'm missing something? Could someone please check these 3 cases and confirm or deny my claim? Thanks, Suren.
Linus
On Tue, Aug 01, 2023 at 01:28:56PM -0700, Suren Baghdasaryan wrote:
I have the new patchset ready but I see 3 places where we walk the pages after mmap_write_lock() while *I think* we can tolerate concurrent page faults (don't need to lock the vmas):
s390_enable_sie() break_ksm() clear_refs_write()
This one doesn't look right to be listed - tlb flushing is postponed after pgtable lock released, so I assume the same issue can happen like fork(): where we can have race coditions to corrupt data if, e.g., thread A writting with a writable (unflushed) tlb, alongside with thread B CoWing.
It'll indeed be nice to know whether break_ksm() can avoid that lock_vma parameter across quite a few function jumps. I don't yet see an immediate issue with this one.. No idea on s390_enable_sie(), but to make it simple and safe I'd simply leave it with the write vma lock to match the mmap write lock.
Thanks,
On Tue, Aug 1, 2023 at 2:34 PM Peter Xu peterx@redhat.com wrote:
On Tue, Aug 01, 2023 at 01:28:56PM -0700, Suren Baghdasaryan wrote:
I have the new patchset ready but I see 3 places where we walk the pages after mmap_write_lock() while *I think* we can tolerate concurrent page faults (don't need to lock the vmas):
s390_enable_sie() break_ksm() clear_refs_write()
This one doesn't look right to be listed - tlb flushing is postponed after pgtable lock released, so I assume the same issue can happen like fork(): where we can have race coditions to corrupt data if, e.g., thread A writting with a writable (unflushed) tlb, alongside with thread B CoWing.
Ah, good point.
It'll indeed be nice to know whether break_ksm() can avoid that lock_vma parameter across quite a few function jumps. I don't yet see an immediate issue with this one.. No idea on s390_enable_sie(), but to make it simple and safe I'd simply leave it with the write vma lock to match the mmap write lock.
Thanks for taking a look, Peter!
Ok, let me keep all three of them with vma locking in place to be safe and will post v2 for further discussion. Thanks, Suren.
Thanks,
-- Peter Xu
On Tue, Aug 1, 2023 at 2:46 PM Suren Baghdasaryan surenb@google.com wrote:
On Tue, Aug 1, 2023 at 2:34 PM Peter Xu peterx@redhat.com wrote:
On Tue, Aug 01, 2023 at 01:28:56PM -0700, Suren Baghdasaryan wrote:
I have the new patchset ready but I see 3 places where we walk the pages after mmap_write_lock() while *I think* we can tolerate concurrent page faults (don't need to lock the vmas):
s390_enable_sie() break_ksm() clear_refs_write()
This one doesn't look right to be listed - tlb flushing is postponed after pgtable lock released, so I assume the same issue can happen like fork(): where we can have race coditions to corrupt data if, e.g., thread A writting with a writable (unflushed) tlb, alongside with thread B CoWing.
Ah, good point.
It'll indeed be nice to know whether break_ksm() can avoid that lock_vma parameter across quite a few function jumps. I don't yet see an immediate issue with this one.. No idea on s390_enable_sie(), but to make it simple and safe I'd simply leave it with the write vma lock to match the mmap write lock.
Thanks for taking a look, Peter!
Ok, let me keep all three of them with vma locking in place to be safe and will post v2 for further discussion.
v2 posted at https://lore.kernel.org/all/20230801220733.1987762-1-surenb@google.com/
Thanks, Suren.
Thanks,
-- Peter Xu
When CONFIG_PER_VMA_LOCK=n, vma_assert_write_locked() should be equivalent to mmap_assert_write_locked().
Suggested-by: Jann Horn jannh@google.com Signed-off-by: Suren Baghdasaryan surenb@google.com --- include/linux/mm.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h index 406ab9ea818f..262b5f44101d 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -750,7 +750,8 @@ static inline void vma_end_read(struct vm_area_struct *vma) {} static inline void vma_start_write(struct vm_area_struct *vma) {} static inline bool vma_try_start_write(struct vm_area_struct *vma) { return true; } -static inline void vma_assert_write_locked(struct vm_area_struct *vma) {} +static inline void vma_assert_write_locked(struct vm_area_struct *vma) + { mmap_assert_write_locked(vma->vm_mm); } static inline void vma_mark_detached(struct vm_area_struct *vma, bool detached) {}
Hi,
Thanks for your patch.
FYI: kernel test robot notices the stable kernel rule is not satisfied.
Rule: 'Cc: stable@vger.kernel.org' or 'commit <sha1> upstream.' Subject: [PATCH 2/6] mm: for !CONFIG_PER_VMA_LOCK equate write lock assertion for vma and mmap Link: https://lore.kernel.org/stable/20230731171233.1098105-3-surenb%40google.com
The check is based on https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
Vma write lock assertion always includes mmap write lock assertion and additional vma lock checks when per-VMA locks are enabled. Replace weaker mmap_assert_write_locked() assertions with stronger vma_assert_write_locked() ones when we are operating on a vma which is expected to be locked.
Suggested-by: Jann Horn jannh@google.com Signed-off-by: Suren Baghdasaryan surenb@google.com --- mm/hugetlb.c | 2 +- mm/khugepaged.c | 5 +++-- mm/memory.c | 2 +- 3 files changed, 5 insertions(+), 4 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 64a3239b6407..1d871a1167d8 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -5028,7 +5028,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, src_vma->vm_start, src_vma->vm_end); mmu_notifier_invalidate_range_start(&range); - mmap_assert_write_locked(src); + vma_assert_write_locked(src_vma); raw_write_seqcount_begin(&src->write_protect_seq); } else { /* diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 78c8d5d8b628..1e43a56fba31 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1495,7 +1495,7 @@ static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr, };
VM_BUG_ON(!PageTransHuge(hpage)); - mmap_assert_write_locked(vma->vm_mm); + vma_assert_write_locked(vma);
if (do_set_pmd(&vmf, hpage)) return SCAN_FAIL; @@ -1525,7 +1525,7 @@ static void collapse_and_free_pmd(struct mm_struct *mm, struct vm_area_struct *v pmd_t pmd; struct mmu_notifier_range range;
- mmap_assert_write_locked(mm); + vma_assert_write_locked(vma); if (vma->vm_file) lockdep_assert_held_write(&vma->vm_file->f_mapping->i_mmap_rwsem); /* @@ -1570,6 +1570,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, int count = 0, result = SCAN_FAIL; int i;
+ /* Ensure vma can't change, it will be locked below after checks */ mmap_assert_write_locked(mm);
/* Fast check before locking page if already PMD-mapped */ diff --git a/mm/memory.c b/mm/memory.c index 603b2f419948..652d99b9858a 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1312,7 +1312,7 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma) * Use the raw variant of the seqcount_t write API to avoid * lockdep complaining about preemptibility. */ - mmap_assert_write_locked(src_mm); + vma_assert_write_locked(src_vma); raw_write_seqcount_begin(&src_mm->write_protect_seq); }
Implicit vma locking inside vm_flags_reset() and vm_flags_reset_once() is not obvious and makes it hard to understand where vma locking is happening. Also in some cases (like in dup_userfaultfd()) vma should be locked earlier than vma_flags modification. To make locking more visible, change these functions to assert that the vma write lock is taken and explicitly lock the vma beforehand. Fix userfaultfd functions which should lock the vma earlier.
Suggested-by: Linus Torvalds torvalds@linuxfoundation.org Signed-off-by: Suren Baghdasaryan surenb@google.com --- arch/powerpc/kvm/book3s_hv_uvmem.c | 1 + drivers/infiniband/hw/hfi1/file_ops.c | 1 + fs/userfaultfd.c | 6 ++++++ include/linux/mm.h | 10 +++++++--- mm/madvise.c | 5 ++--- mm/mlock.c | 3 ++- mm/mprotect.c | 1 + 7 files changed, 20 insertions(+), 7 deletions(-)
diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c index 709ebd578394..e2d6f9327f77 100644 --- a/arch/powerpc/kvm/book3s_hv_uvmem.c +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c @@ -410,6 +410,7 @@ static int kvmppc_memslot_page_merge(struct kvm *kvm, ret = H_STATE; break; } + vma_start_write(vma); /* Copy vm_flags to avoid partial modifications in ksm_madvise */ vm_flags = vma->vm_flags; ret = ksm_madvise(vma, vma->vm_start, vma->vm_end, diff --git a/drivers/infiniband/hw/hfi1/file_ops.c b/drivers/infiniband/hw/hfi1/file_ops.c index a5ab22cedd41..5920bfc1e1c5 100644 --- a/drivers/infiniband/hw/hfi1/file_ops.c +++ b/drivers/infiniband/hw/hfi1/file_ops.c @@ -344,6 +344,7 @@ static int hfi1_file_mmap(struct file *fp, struct vm_area_struct *vma) goto done; }
+ vma_start_write(vma); /* * vm_pgoff is used as a buffer selector cookie. Always mmap from * the beginning. diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 7cecd49e078b..6cde95533dcd 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -667,6 +667,7 @@ static void userfaultfd_event_wait_completion(struct userfaultfd_ctx *ctx, mmap_write_lock(mm); for_each_vma(vmi, vma) { if (vma->vm_userfaultfd_ctx.ctx == release_new_ctx) { + vma_start_write(vma); vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX; userfaultfd_set_vm_flags(vma, vma->vm_flags & ~__VM_UFFD_FLAGS); @@ -702,6 +703,7 @@ int dup_userfaultfd(struct vm_area_struct *vma, struct list_head *fcs)
octx = vma->vm_userfaultfd_ctx.ctx; if (!octx || !(octx->features & UFFD_FEATURE_EVENT_FORK)) { + vma_start_write(vma); vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX; userfaultfd_set_vm_flags(vma, vma->vm_flags & ~__VM_UFFD_FLAGS); return 0; @@ -783,6 +785,7 @@ void mremap_userfaultfd_prep(struct vm_area_struct *vma, atomic_inc(&ctx->mmap_changing); } else { /* Drop uffd context if remap feature not enabled */ + vma_start_write(vma); vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX; userfaultfd_set_vm_flags(vma, vma->vm_flags & ~__VM_UFFD_FLAGS); } @@ -940,6 +943,7 @@ static int userfaultfd_release(struct inode *inode, struct file *file) prev = vma; }
+ vma_start_write(vma); userfaultfd_set_vm_flags(vma, new_flags); vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX; } @@ -1502,6 +1506,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx, * the next vma was merged into the current one and * the current one has not been updated yet. */ + vma_start_write(vma); userfaultfd_set_vm_flags(vma, new_flags); vma->vm_userfaultfd_ctx.ctx = ctx;
@@ -1685,6 +1690,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx, * the next vma was merged into the current one and * the current one has not been updated yet. */ + vma_start_write(vma); userfaultfd_set_vm_flags(vma, new_flags); vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
diff --git a/include/linux/mm.h b/include/linux/mm.h index 262b5f44101d..2c720c9bb1ae 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -780,18 +780,22 @@ static inline void vm_flags_init(struct vm_area_struct *vma, ACCESS_PRIVATE(vma, __vm_flags) = flags; }
-/* Use when VMA is part of the VMA tree and modifications need coordination */ +/* + * Use when VMA is part of the VMA tree and modifications need coordination + * Note: vm_flags_reset and vm_flags_reset_once do not lock the vma and + * it should be locked explicitly beforehand. + */ static inline void vm_flags_reset(struct vm_area_struct *vma, vm_flags_t flags) { - vma_start_write(vma); + vma_assert_write_locked(vma); vm_flags_init(vma, flags); }
static inline void vm_flags_reset_once(struct vm_area_struct *vma, vm_flags_t flags) { - vma_start_write(vma); + vma_assert_write_locked(vma); WRITE_ONCE(ACCESS_PRIVATE(vma, __vm_flags), flags); }
diff --git a/mm/madvise.c b/mm/madvise.c index 0e484111a1d2..54628f4ca217 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -173,9 +173,8 @@ static int madvise_update_vma(struct vm_area_struct *vma, }
success: - /* - * vm_flags is protected by the mmap_lock held in write mode. - */ + /* vm_flags is protected by the mmap_lock held in write mode. */ + vma_start_write(vma); vm_flags_reset(vma, new_flags); if (!vma->vm_file || vma_is_anon_shmem(vma)) { error = replace_anon_vma_name(vma, anon_name); diff --git a/mm/mlock.c b/mm/mlock.c index 3634de0b28e3..f0f5125188ba 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -386,6 +386,7 @@ static void mlock_vma_pages_range(struct vm_area_struct *vma, */ if (newflags & VM_LOCKED) newflags |= VM_IO; + vma_start_write(vma); vm_flags_reset_once(vma, newflags);
lru_add_drain(); @@ -460,9 +461,9 @@ static int mlock_fixup(struct vma_iterator *vmi, struct vm_area_struct *vma, * It's okay if try_to_unmap_one unmaps a page just after we * set VM_LOCKED, populate_vma_page_range will bring it back. */ - if ((newflags & VM_LOCKED) && (oldflags & VM_LOCKED)) { /* No work to do, and mlocking twice would be wrong */ + vma_start_write(vma); vm_flags_reset(vma, newflags); } else { mlock_vma_pages_range(vma, start, end, newflags); diff --git a/mm/mprotect.c b/mm/mprotect.c index f781f709c39d..0eab019914db 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -656,6 +656,7 @@ mprotect_fixup(struct vma_iterator *vmi, struct mmu_gather *tlb, * vm_flags and vm_page_prot are protected by the mmap_lock * held in write mode. */ + vma_start_write(vma); vm_flags_reset(vma, newflags); if (vma_wants_manual_pte_write_upgrade(vma)) mm_cp_flags |= MM_CP_TRY_CHANGE_WRITABLE;
While it's not strictly necessary to lock a newly created vma before adding it into the vma tree (as long as no further changes are performed to it), it seems like a good policy to lock it and prevent accidental changes after it becomes visible to the page faults. Lock the vma before adding it into the vma tree.
Suggested-by: Jann Horn jannh@google.com Signed-off-by: Suren Baghdasaryan surenb@google.com --- mm/mmap.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/mm/mmap.c b/mm/mmap.c index 3937479d0e07..850a39dee075 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -412,6 +412,8 @@ static int vma_link(struct mm_struct *mm, struct vm_area_struct *vma) if (vma_iter_prealloc(&vmi)) return -ENOMEM;
+ vma_start_write(vma); + if (vma->vm_file) { mapping = vma->vm_file->f_mapping; i_mmap_lock_write(mapping); @@ -477,7 +479,8 @@ static inline void vma_prepare(struct vma_prepare *vp) vma_start_write(vp->vma); if (vp->adj_next) vma_start_write(vp->adj_next); - /* vp->insert is always a newly created VMA, no need for locking */ + if (vp->insert) + vma_start_write(vp->insert); if (vp->remove) vma_start_write(vp->remove); if (vp->remove2) @@ -3098,6 +3101,7 @@ static int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma, vma->vm_pgoff = addr >> PAGE_SHIFT; vm_flags_init(vma, flags); vma->vm_page_prot = vm_get_page_prot(flags); + vma_start_write(vma); if (vma_iter_store_gfp(vmi, vma, GFP_KERNEL)) goto mas_store_fail;
@@ -3345,7 +3349,6 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap, get_file(new_vma->vm_file); if (new_vma->vm_ops && new_vma->vm_ops->open) new_vma->vm_ops->open(new_vma); - vma_start_write(new_vma); if (vma_link(mm, new_vma)) goto out_vma_link; *need_rmap_locks = false;
vma_prepare() is currently the central place where vmas are being locked before vma_complete() applies changes to them. While this is convenient, it also obscures vma locking and makes it hard to follow the locking rules. Move vma locking out of vma_prepare() and take vma locks explicitly at the locations where vmas are being modified.
Suggested-by: Linus Torvalds torvalds@linuxfoundation.org Signed-off-by: Suren Baghdasaryan surenb@google.com --- mm/mmap.c | 26 ++++++++++++++++---------- 1 file changed, 16 insertions(+), 10 deletions(-)
diff --git a/mm/mmap.c b/mm/mmap.c index 850a39dee075..e59d83cb1d7a 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -476,16 +476,6 @@ static inline void init_vma_prep(struct vma_prepare *vp, */ static inline void vma_prepare(struct vma_prepare *vp) { - vma_start_write(vp->vma); - if (vp->adj_next) - vma_start_write(vp->adj_next); - if (vp->insert) - vma_start_write(vp->insert); - if (vp->remove) - vma_start_write(vp->remove); - if (vp->remove2) - vma_start_write(vp->remove2); - if (vp->file) { uprobe_munmap(vp->vma, vp->vma->vm_start, vp->vma->vm_end);
@@ -650,6 +640,7 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma, bool remove_next = false; struct vma_prepare vp;
+ vma_start_write(vma); if (next && (vma != next) && (end == next->vm_end)) { int ret;
@@ -657,6 +648,7 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma, ret = dup_anon_vma(vma, next); if (ret) return ret; + vma_start_write(next); }
init_multi_vma_prep(&vp, vma, NULL, remove_next ? next : NULL, NULL); @@ -708,6 +700,8 @@ int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma, if (vma_iter_prealloc(vmi)) return -ENOMEM;
+ vma_start_write(vma); + init_vma_prep(&vp, vma); vma_prepare(&vp); vma_adjust_trans_huge(vma, start, end, 0); @@ -946,10 +940,12 @@ struct vm_area_struct *vma_merge(struct vma_iterator *vmi, struct mm_struct *mm, /* Can we merge both the predecessor and the successor? */ if (merge_prev && merge_next && is_mergeable_anon_vma(prev->anon_vma, next->anon_vma, NULL)) { + vma_start_write(next); remove = next; /* case 1 */ vma_end = next->vm_end; err = dup_anon_vma(prev, next); if (curr) { /* case 6 */ + vma_start_write(curr); remove = curr; remove2 = next; if (!next->anon_vma) @@ -958,6 +954,7 @@ struct vm_area_struct *vma_merge(struct vma_iterator *vmi, struct mm_struct *mm, } else if (merge_prev) { /* case 2 */ if (curr) { err = dup_anon_vma(prev, curr); + vma_start_write(curr); if (end == curr->vm_end) { /* case 7 */ remove = curr; } else { /* case 5 */ @@ -969,6 +966,7 @@ struct vm_area_struct *vma_merge(struct vma_iterator *vmi, struct mm_struct *mm, res = next; if (prev && addr < prev->vm_end) { /* case 4 */ vma_end = addr; + vma_start_write(next); adjust = next; adj_start = -(prev->vm_end - addr); err = dup_anon_vma(next, prev); @@ -983,6 +981,7 @@ struct vm_area_struct *vma_merge(struct vma_iterator *vmi, struct mm_struct *mm, vma_pgoff = next->vm_pgoff - pglen; if (curr) { /* case 8 */ vma_pgoff = curr->vm_pgoff; + vma_start_write(curr); remove = curr; err = dup_anon_vma(next, curr); } @@ -996,6 +995,8 @@ struct vm_area_struct *vma_merge(struct vma_iterator *vmi, struct mm_struct *mm, if (vma_iter_prealloc(vmi)) return NULL;
+ vma_start_write(vma); + init_multi_vma_prep(&vp, vma, adjust, remove, remove2); VM_WARN_ON(vp.anon_vma && adjust && adjust->anon_vma && vp.anon_vma != adjust->anon_vma); @@ -2373,6 +2374,9 @@ int __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma, if (new->vm_ops && new->vm_ops->open) new->vm_ops->open(new);
+ vma_start_write(vma); + vma_start_write(new); + init_vma_prep(&vp, vma); vp.insert = new; vma_prepare(&vp); @@ -3078,6 +3082,8 @@ static int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma, if (vma_iter_prealloc(vmi)) goto unacct_fail;
+ vma_start_write(vma); + init_vma_prep(&vp, vma); vma_prepare(&vp); vma_adjust_trans_huge(vma, vma->vm_start, addr + len, 0);
Adding Lorenzo since this also touches vma_merge() again..
* Suren Baghdasaryan surenb@google.com [230731 13:12]:
vma_prepare() is currently the central place where vmas are being locked before vma_complete() applies changes to them. While this is convenient, it also obscures vma locking and makes it hard to follow the locking rules. Move vma locking out of vma_prepare() and take vma locks explicitly at the locations where vmas are being modified.
I get the idea of locking closer to the edits, but vma_merge() is very hard to follow. It was worse when it was two functions and much larger, but adding this into vma_merge() is difficult to validate.
We still set vma = <another vma> in places, so that adds to the difficulty of ensuring the end result is all VMAs that will be modified/removed have been locked...and the 'locking rules' are also obscured.
It's also annoying that this doesn't fully allow you to follow the locking anyways. dup_anon_vma() still locks internally, with good reason, but it's still not clear that the VMA is locked when looking at this.
That being said, I did go through each case and it looks like it locks the correct VMA(s) to me.
Reviewed-by: Liam R. Howlett Liam.Howlett@oracle.com
Suggested-by: Linus Torvalds torvalds@linuxfoundation.org Signed-off-by: Suren Baghdasaryan surenb@google.com
mm/mmap.c | 26 ++++++++++++++++---------- 1 file changed, 16 insertions(+), 10 deletions(-)
diff --git a/mm/mmap.c b/mm/mmap.c index 850a39dee075..e59d83cb1d7a 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -476,16 +476,6 @@ static inline void init_vma_prep(struct vma_prepare *vp, */ static inline void vma_prepare(struct vma_prepare *vp) {
- vma_start_write(vp->vma);
- if (vp->adj_next)
vma_start_write(vp->adj_next);
- if (vp->insert)
vma_start_write(vp->insert);
- if (vp->remove)
vma_start_write(vp->remove);
- if (vp->remove2)
vma_start_write(vp->remove2);
- if (vp->file) { uprobe_munmap(vp->vma, vp->vma->vm_start, vp->vma->vm_end);
@@ -650,6 +640,7 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma, bool remove_next = false; struct vma_prepare vp;
- vma_start_write(vma); if (next && (vma != next) && (end == next->vm_end)) { int ret;
@@ -657,6 +648,7 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma, ret = dup_anon_vma(vma, next); if (ret) return ret;
}vma_start_write(next);
init_multi_vma_prep(&vp, vma, NULL, remove_next ? next : NULL, NULL); @@ -708,6 +700,8 @@ int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma, if (vma_iter_prealloc(vmi)) return -ENOMEM;
- vma_start_write(vma);
- init_vma_prep(&vp, vma); vma_prepare(&vp); vma_adjust_trans_huge(vma, start, end, 0);
@@ -946,10 +940,12 @@ struct vm_area_struct *vma_merge(struct vma_iterator *vmi, struct mm_struct *mm, /* Can we merge both the predecessor and the successor? */ if (merge_prev && merge_next && is_mergeable_anon_vma(prev->anon_vma, next->anon_vma, NULL)) {
remove = next; /* case 1 */ vma_end = next->vm_end; err = dup_anon_vma(prev, next); if (curr) { /* case 6 */vma_start_write(next);
vma_start_write(curr); remove = curr; remove2 = next; if (!next->anon_vma)
@@ -958,6 +954,7 @@ struct vm_area_struct *vma_merge(struct vma_iterator *vmi, struct mm_struct *mm, } else if (merge_prev) { /* case 2 */ if (curr) { err = dup_anon_vma(prev, curr);
vma_start_write(curr); if (end == curr->vm_end) { /* case 7 */ remove = curr; } else { /* case 5 */
@@ -969,6 +966,7 @@ struct vm_area_struct *vma_merge(struct vma_iterator *vmi, struct mm_struct *mm, res = next; if (prev && addr < prev->vm_end) { /* case 4 */ vma_end = addr;
vma_start_write(next); adjust = next; adj_start = -(prev->vm_end - addr); err = dup_anon_vma(next, prev);
@@ -983,6 +981,7 @@ struct vm_area_struct *vma_merge(struct vma_iterator *vmi, struct mm_struct *mm, vma_pgoff = next->vm_pgoff - pglen; if (curr) { /* case 8 */ vma_pgoff = curr->vm_pgoff;
vma_start_write(curr); remove = curr; err = dup_anon_vma(next, curr); }
@@ -996,6 +995,8 @@ struct vm_area_struct *vma_merge(struct vma_iterator *vmi, struct mm_struct *mm, if (vma_iter_prealloc(vmi)) return NULL;
- vma_start_write(vma);
- init_multi_vma_prep(&vp, vma, adjust, remove, remove2); VM_WARN_ON(vp.anon_vma && adjust && adjust->anon_vma && vp.anon_vma != adjust->anon_vma);
@@ -2373,6 +2374,9 @@ int __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma, if (new->vm_ops && new->vm_ops->open) new->vm_ops->open(new);
- vma_start_write(vma);
- vma_start_write(new);
- init_vma_prep(&vp, vma); vp.insert = new; vma_prepare(&vp);
@@ -3078,6 +3082,8 @@ static int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma, if (vma_iter_prealloc(vmi)) goto unacct_fail;
vma_start_write(vma);
- init_vma_prep(&vp, vma); vma_prepare(&vp); vma_adjust_trans_huge(vma, vma->vm_start, addr + len, 0);
-- 2.41.0.487.g6d72f3e995-goog
On Mon, Jul 31, 2023 at 1:30 PM Liam R. Howlett Liam.Howlett@oracle.com wrote:
Adding Lorenzo since this also touches vma_merge() again..
- Suren Baghdasaryan surenb@google.com [230731 13:12]:
vma_prepare() is currently the central place where vmas are being locked before vma_complete() applies changes to them. While this is convenient, it also obscures vma locking and makes it hard to follow the locking rules. Move vma locking out of vma_prepare() and take vma locks explicitly at the locations where vmas are being modified.
I get the idea of locking closer to the edits, but vma_merge() is very hard to follow. It was worse when it was two functions and much larger, but adding this into vma_merge() is difficult to validate.
We still set vma = <another vma> in places, so that adds to the difficulty of ensuring the end result is all VMAs that will be modified/removed have been locked...and the 'locking rules' are also obscured.
It's also annoying that this doesn't fully allow you to follow the locking anyways. dup_anon_vma() still locks internally, with good reason, but it's still not clear that the VMA is locked when looking at this.
That being said, I did go through each case and it looks like it locks the correct VMA(s) to me.
Thanks! Yeah, it took me some time to ensure the locking there is correct. If you see a better alternative to make the locking more obvious I'm open to suggestions. I accept that locking in vma_merge() is not easy to follow.
Reviewed-by: Liam R. Howlett Liam.Howlett@oracle.com
Suggested-by: Linus Torvalds torvalds@linuxfoundation.org Signed-off-by: Suren Baghdasaryan surenb@google.com
mm/mmap.c | 26 ++++++++++++++++---------- 1 file changed, 16 insertions(+), 10 deletions(-)
diff --git a/mm/mmap.c b/mm/mmap.c index 850a39dee075..e59d83cb1d7a 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -476,16 +476,6 @@ static inline void init_vma_prep(struct vma_prepare *vp, */ static inline void vma_prepare(struct vma_prepare *vp) {
vma_start_write(vp->vma);
if (vp->adj_next)
vma_start_write(vp->adj_next);
if (vp->insert)
vma_start_write(vp->insert);
if (vp->remove)
vma_start_write(vp->remove);
if (vp->remove2)
vma_start_write(vp->remove2);
if (vp->file) { uprobe_munmap(vp->vma, vp->vma->vm_start, vp->vma->vm_end);
@@ -650,6 +640,7 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma, bool remove_next = false; struct vma_prepare vp;
vma_start_write(vma); if (next && (vma != next) && (end == next->vm_end)) { int ret;
@@ -657,6 +648,7 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma, ret = dup_anon_vma(vma, next); if (ret) return ret;
vma_start_write(next); } init_multi_vma_prep(&vp, vma, NULL, remove_next ? next : NULL, NULL);
@@ -708,6 +700,8 @@ int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma, if (vma_iter_prealloc(vmi)) return -ENOMEM;
vma_start_write(vma);
init_vma_prep(&vp, vma); vma_prepare(&vp); vma_adjust_trans_huge(vma, start, end, 0);
@@ -946,10 +940,12 @@ struct vm_area_struct *vma_merge(struct vma_iterator *vmi, struct mm_struct *mm, /* Can we merge both the predecessor and the successor? */ if (merge_prev && merge_next && is_mergeable_anon_vma(prev->anon_vma, next->anon_vma, NULL)) {
vma_start_write(next); remove = next; /* case 1 */ vma_end = next->vm_end; err = dup_anon_vma(prev, next); if (curr) { /* case 6 */
vma_start_write(curr); remove = curr; remove2 = next; if (!next->anon_vma)
@@ -958,6 +954,7 @@ struct vm_area_struct *vma_merge(struct vma_iterator *vmi, struct mm_struct *mm, } else if (merge_prev) { /* case 2 */ if (curr) { err = dup_anon_vma(prev, curr);
vma_start_write(curr); if (end == curr->vm_end) { /* case 7 */ remove = curr; } else { /* case 5 */
@@ -969,6 +966,7 @@ struct vm_area_struct *vma_merge(struct vma_iterator *vmi, struct mm_struct *mm, res = next; if (prev && addr < prev->vm_end) { /* case 4 */ vma_end = addr;
vma_start_write(next); adjust = next; adj_start = -(prev->vm_end - addr); err = dup_anon_vma(next, prev);
@@ -983,6 +981,7 @@ struct vm_area_struct *vma_merge(struct vma_iterator *vmi, struct mm_struct *mm, vma_pgoff = next->vm_pgoff - pglen; if (curr) { /* case 8 */ vma_pgoff = curr->vm_pgoff;
vma_start_write(curr); remove = curr; err = dup_anon_vma(next, curr); }
@@ -996,6 +995,8 @@ struct vm_area_struct *vma_merge(struct vma_iterator *vmi, struct mm_struct *mm, if (vma_iter_prealloc(vmi)) return NULL;
vma_start_write(vma);
init_multi_vma_prep(&vp, vma, adjust, remove, remove2); VM_WARN_ON(vp.anon_vma && adjust && adjust->anon_vma && vp.anon_vma != adjust->anon_vma);
@@ -2373,6 +2374,9 @@ int __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma, if (new->vm_ops && new->vm_ops->open) new->vm_ops->open(new);
vma_start_write(vma);
vma_start_write(new);
init_vma_prep(&vp, vma); vp.insert = new; vma_prepare(&vp);
@@ -3078,6 +3082,8 @@ static int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma, if (vma_iter_prealloc(vmi)) goto unacct_fail;
vma_start_write(vma);
init_vma_prep(&vp, vma); vma_prepare(&vp); vma_adjust_trans_huge(vma, vma->vm_start, addr + len, 0);
-- 2.41.0.487.g6d72f3e995-goog
linux-stable-mirror@lists.linaro.org