Dear All,
The original mail with this patch is not available in lore, so I decided to reply this one.
On 03.10.2024 00:44, Andrew Morton wrote:
The patch titled Subject: mm/mremap: prevent racing change of old pmd type has been added to the -mm mm-hotfixes-unstable branch. Its filename is mm-mremap-prevent-racing-change-of-old-pmd-type.patch
This patch will shortly appear at https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches...
This patch will later appear in the mm-hotfixes-unstable branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm and is updated there every 2-3 working days
From: Jann Horn jannh@google.com Subject: mm/mremap: prevent racing change of old pmd type Date: Wed, 02 Oct 2024 23:07:06 +0200
Prevent move_normal_pmd() in mremap() from racing with retract_page_tables() in MADVISE_COLLAPSE such that
pmd_populate(mm, new_pmd, pmd_pgtable(pmd))
operates on an empty source pmd, causing creation of a new pmd which maps physical address 0 as a page table.
This bug is only reachable if either CONFIG_READ_ONLY_THP_FOR_FS is set or THP shmem is usable. (Unprivileged namespaces can be used to set up a tmpfs that can contain THP shmem pages with "huge=advise".)
If userspace triggers this bug *in multiple processes*, this could likely be used to create stale TLB entries pointing to freed pages or cause kernel UAF by breaking an invariant the rmap code relies on.
Fix it by moving the rmap locking up so that it covers the span from reading the PMD entry to moving the page table.
Link: https://lkml.kernel.org/r/20241002-move_normal_pmd-vs-collapse-fix-v1-1-7829... Fixes: 1d65b771bc08 ("mm/khugepaged: retract_page_tables() without mmap or vma lock") Signed-off-by: Jann Horn jannh@google.com Cc: David Hildenbrand david@redhat.com Cc: Hugh Dickins hughd@google.com Cc: Matthew Wilcox willy@infradead.org Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org
This patch landed in today's linux-next as commit 46c1b3279220 ("mm/mremap: prevent racing change of old pmd type"). In my tests I found that it introduces a lockdep warning about possible circular locking dependency on ARM64 machines. Reverting $subject together with commits a2fbe16f45a8 ("mm: mremap: move_ptes() use pte_offset_map_rw_nolock()") and 46c1b3279220 ("mm/mremap: prevent racing change of old pmd type") on top of next-20241004 fixes this problem.
Here is the observed lockdep warning:
Freeing unused kernel memory: 13824K Run /sbin/init as init process
====================================================== WARNING: possible circular locking dependency detected 6.12.0-rc1+ #15391 Not tainted ------------------------------------------------------ init/1 is trying to acquire lock: ffff000006943588 (&anon_vma->rwsem){+.+.}-{3:3}, at: vma_prepare+0x70/0x158
but task is already holding lock: ffff0000048c9970 (&mapping->i_mmap_rwsem){+.+.}-{3:3}, at: vma_prepare+0x28/0x158
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #2 (&mapping->i_mmap_rwsem){+.+.}-{3:3}: down_write+0x50/0xe8 dma_resv_lockdep+0x140/0x300 do_one_initcall+0x68/0x300 kernel_init_freeable+0x28c/0x50c kernel_init+0x20/0x1d8 ret_from_fork+0x10/0x20
-> #1 (fs_reclaim){+.+.}-{0:0}: fs_reclaim_acquire+0xd0/0xe4 __alloc_pages_noprof+0xe4/0x10d0 alloc_pages_mpol_noprof+0x88/0x23c alloc_pages_noprof+0x48/0xc0 __pud_alloc+0x44/0x254 alloc_new_pud.constprop.0+0x154/0x160 move_page_tables+0x1b0/0xc38 relocate_vma_down+0xe4/0x1f8 setup_arg_pages+0x190/0x370 load_elf_binary+0x370/0x15c4 bprm_execve+0x290/0x7a0 kernel_execve+0xf8/0x16c run_init_process+0xa8/0xbc kernel_init+0xec/0x1d8 ret_from_fork+0x10/0x20
-> #0 (&anon_vma->rwsem){+.+.}-{3:3}: __lock_acquire+0x1374/0x2224 lock_acquire+0x200/0x340 down_write+0x50/0xe8 vma_prepare+0x70/0x158 __split_vma+0x26c/0x388 vma_modify+0x45c/0x7f4 vma_modify_flags+0x90/0xc4 mprotect_fixup+0x8c/0x2c0 do_mprotect_pkey+0x2a8/0x464 __arm64_sys_mprotect+0x20/0x30 invoke_syscall+0x48/0x110 el0_svc_common.constprop.0+0x40/0xe8 do_el0_svc_compat+0x20/0x3c el0_svc_compat+0x44/0xe0 el0t_32_sync_handler+0x98/0x148 el0t_32_sync+0x194/0x198
other info that might help us debug this:
Chain exists of: &anon_vma->rwsem --> fs_reclaim --> &mapping->i_mmap_rwsem
Possible unsafe locking scenario:
CPU0 CPU1 ---- ---- lock(&mapping->i_mmap_rwsem); lock(fs_reclaim); lock(&mapping->i_mmap_rwsem); lock(&anon_vma->rwsem);
*** DEADLOCK ***
2 locks held by init/1: #0: ffff000006998188 (&mm->mmap_lock){++++}-{3:3}, at: do_mprotect_pkey+0xb4/0x464 #1: ffff0000048c9970 (&mapping->i_mmap_rwsem){+.+.}-{3:3}, at: vma_prepare+0x28/0x158
stack backtrace: CPU: 1 UID: 0 PID: 1 Comm: init Not tainted 6.12.0-rc1+ #15391 Hardware name: linux,dummy-virt (DT) Call trace: dump_backtrace+0x94/0xec show_stack+0x18/0x24 dump_stack_lvl+0x90/0xd0 dump_stack+0x18/0x24 print_circular_bug+0x298/0x37c check_noncircular+0x15c/0x170 __lock_acquire+0x1374/0x2224 lock_acquire+0x200/0x340 down_write+0x50/0xe8 vma_prepare+0x70/0x158 __split_vma+0x26c/0x388 vma_modify+0x45c/0x7f4 vma_modify_flags+0x90/0xc4 mprotect_fixup+0x8c/0x2c0 do_mprotect_pkey+0x2a8/0x464 __arm64_sys_mprotect+0x20/0x30 invoke_syscall+0x48/0x110 el0_svc_common.constprop.0+0x40/0xe8 do_el0_svc_compat+0x20/0x3c el0_svc_compat+0x44/0xe0 el0t_32_sync_handler+0x98/0x148 el0t_32_sync+0x194/0x198 INIT: version 2.88 booting
...
Best regards