On Sat, Jul 8, 2023 at 12:12 PM Suren Baghdasaryan surenb@google.com wrote:
When forking a child process, parent write-protects an anonymous page and COW-shares it with the child being forked using copy_present_pte(). Parent's TLB is flushed right before we drop the parent's mmap_lock in dup_mmap(). If we get a write-fault before that TLB flush in the parent, and we end up replacing that anonymous page in the parent process in do_wp_page() (because, COW-shared with the child), this might lead to some stale writable TLB entries targeting the wrong (old) page. Similar issue happened in the past with userfaultfd (see flush_tlb_page() call inside do_wp_page()). Lock VMAs of the parent process when forking a child, which prevents concurrent page faults during fork operation and avoids this issue. This fix can potentially regress some fork-heavy workloads. Kernel build time did not show noticeable regression on a 56-core machine while a stress test mapping 10000 VMAs and forking 5000 times in a tight loop shows ~5% regression. If such fork time regression is unacceptable, disabling CONFIG_PER_VMA_LOCK should restore its performance. Further optimizations are possible if this regression proves to be problematic.
Sending this earlier version of the patch per request from Linus and with his explanation here: https://lore.kernel.org/all/CAHk-=wi-99-DyMOGywTbjRnRRC+XfpPm=r=pei4A=MEL0QD...
Suggested-by: David Hildenbrand david@redhat.com Reported-by: Jiri Slaby jirislaby@kernel.org Closes: https://lore.kernel.org/all/dbdef34c-3a07-5951-e1ae-e9c6e3cdf51b@kernel.org/ Reported-by: Holger Hoffstätte holger@applied-asynchrony.com Closes: https://lore.kernel.org/all/b198d649-f4bf-b971-31d0-e8433ec2a34c@applied-asy... Reported-by: Jacob Young jacobly.alt@gmail.com Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217624 Fixes: 0bff0aaea03e ("x86/mm: try VMA lock-based page fault handling first") Cc: stable@vger.kernel.org Signed-off-by: Suren Baghdasaryan surenb@google.com
kernel/fork.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/kernel/fork.c b/kernel/fork.c index b85814e614a5..d2e12b6d2b18 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -686,6 +686,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, for_each_vma(old_vmi, mpnt) { struct file *file;
vma_start_write(mpnt); if (mpnt->vm_flags & VM_DONTCOPY) { vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt)); continue;
-- 2.41.0.390.g38632f3daf-goog