Hi All,
Chages since v1: - left fixes only, improvements will be posted separately; - Fixes: and -stable tags added to patch descriptions;
This series is an attempt to fix the violation of lazy MMU mode context requirement as described for arch_enter_lazy_mmu_mode():
This mode can only be entered and left under the protection of the page table locks for all page tables which may be modified.
On s390 if I make arch_enter_lazy_mmu_mode() -> preempt_enable() and arch_leave_lazy_mmu_mode() -> preempt_disable() I am getting this:
[ 553.332108] preempt_count: 1, expected: 0 [ 553.332117] no locks held by multipathd/2116. [ 553.332128] CPU: 24 PID: 2116 Comm: multipathd Kdump: loaded Tainted: [ 553.332139] Hardware name: IBM 3931 A01 701 (LPAR) [ 553.332146] Call Trace: [ 553.332152] [<00000000158de23a>] dump_stack_lvl+0xfa/0x150 [ 553.332167] [<0000000013e10d12>] __might_resched+0x57a/0x5e8 [ 553.332178] [<00000000144eb6c2>] __alloc_pages+0x2ba/0x7c0 [ 553.332189] [<00000000144d5cdc>] __get_free_pages+0x2c/0x88 [ 553.332198] [<00000000145663f6>] kasan_populate_vmalloc_pte+0x4e/0x110 [ 553.332207] [<000000001447625c>] apply_to_pte_range+0x164/0x3c8 [ 553.332218] [<000000001448125a>] apply_to_pmd_range+0xda/0x318 [ 553.332226] [<000000001448181c>] __apply_to_page_range+0x384/0x768 [ 553.332233] [<0000000014481c28>] apply_to_page_range+0x28/0x38 [ 553.332241] [<00000000145665da>] kasan_populate_vmalloc+0x82/0x98 [ 553.332249] [<00000000144c88d0>] alloc_vmap_area+0x590/0x1c90 [ 553.332257] [<00000000144ca108>] __get_vm_area_node.constprop.0+0x138/0x260 [ 553.332265] [<00000000144d17fc>] __vmalloc_node_range+0x134/0x360 [ 553.332274] [<0000000013d5dbf2>] alloc_thread_stack_node+0x112/0x378 [ 553.332284] [<0000000013d62726>] dup_task_struct+0x66/0x430 [ 553.332293] [<0000000013d63962>] copy_process+0x432/0x4b80 [ 553.332302] [<0000000013d68300>] kernel_clone+0xf0/0x7d0 [ 553.332311] [<0000000013d68bd6>] __do_sys_clone+0xae/0xc8 [ 553.332400] [<0000000013d68dee>] __s390x_sys_clone+0xd6/0x118 [ 553.332410] [<0000000013c9d34c>] do_syscall+0x22c/0x328 [ 553.332419] [<00000000158e7366>] __do_syscall+0xce/0xf0 [ 553.332428] [<0000000015913260>] system_call+0x70/0x98
This exposes a KASAN issue fixed with patch 1 and apply_to_pte_range() issue fixed with patch 3, while patch 2 is a prerequisite.
Commit b9ef323ea168 ("powerpc/64s: Disable preemption in hash lazy mmu mode") looks like powerpc-only fix, yet not entirely conforming to the above provided requirement (page tables itself are still not protected). If I am not mistaken, xen and sparc are alike.
Thanks!
Alexander Gordeev (3): kasan: Avoid sleepable page allocation from atomic context mm: Cleanup apply_to_pte_range() routine mm: Protect kernel pgtables in apply_to_pte_range()
mm/kasan/shadow.c | 9 +++------ mm/memory.c | 33 +++++++++++++++++++++------------ 2 files changed, 24 insertions(+), 18 deletions(-)
apply_to_page_range() enters lazy MMU mode and then invokes kasan_populate_vmalloc_pte() callback on each page table walk iteration. The lazy MMU mode may only be entered only under protection of the page table lock. However, the callback can go into sleep when trying to allocate a single page.
Change __get_free_page() allocation mode from GFP_KERNEL to GFP_ATOMIC to avoid scheduling out while in atomic context.
Cc: stable@vger.kernel.org Fixes: 3c5c3cfb9ef4 ("kasan: support backing vmalloc space with real shadow memory") Signed-off-by: Alexander Gordeev agordeev@linux.ibm.com --- mm/kasan/shadow.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/kasan/shadow.c b/mm/kasan/shadow.c index 88d1c9dcb507..edfa77959474 100644 --- a/mm/kasan/shadow.c +++ b/mm/kasan/shadow.c @@ -301,7 +301,7 @@ static int kasan_populate_vmalloc_pte(pte_t *ptep, unsigned long addr, if (likely(!pte_none(ptep_get(ptep)))) return 0;
- page = __get_free_page(GFP_KERNEL); + page = __get_free_page(GFP_ATOMIC); if (!page) return -ENOMEM;
On 4/8/25 6:07 PM, Alexander Gordeev wrote:
apply_to_page_range() enters lazy MMU mode and then invokes kasan_populate_vmalloc_pte() callback on each page table walk iteration. The lazy MMU mode may only be entered only under protection of the page table lock. However, the callback can go into sleep when trying to allocate a single page.
Change __get_free_page() allocation mode from GFP_KERNEL to GFP_ATOMIC to avoid scheduling out while in atomic context.
Cc: stable@vger.kernel.org Fixes: 3c5c3cfb9ef4 ("kasan: support backing vmalloc space with real shadow memory") Signed-off-by: Alexander Gordeev agordeev@linux.ibm.com
mm/kasan/shadow.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/kasan/shadow.c b/mm/kasan/shadow.c index 88d1c9dcb507..edfa77959474 100644 --- a/mm/kasan/shadow.c +++ b/mm/kasan/shadow.c @@ -301,7 +301,7 @@ static int kasan_populate_vmalloc_pte(pte_t *ptep, unsigned long addr, if (likely(!pte_none(ptep_get(ptep)))) return 0;
- page = __get_free_page(GFP_KERNEL);
- page = __get_free_page(GFP_ATOMIC); if (!page) return -ENOMEM;
I think a better way to fix this would be moving out allocation from atomic context. Allocate page prior to apply_to_page_range() call and pass it down to kasan_populate_vmalloc_pte().
Whenever kasan_populate_vmalloc_pte() will require additional page we could bail out with -EAGAIN, and allocate another one.
On Wed, Apr 09, 2025 at 04:10:58PM +0200, Andrey Ryabinin wrote:
Hi Andrey,
@@ -301,7 +301,7 @@ static int kasan_populate_vmalloc_pte(pte_t *ptep, unsigned long addr, if (likely(!pte_none(ptep_get(ptep)))) return 0;
- page = __get_free_page(GFP_KERNEL);
- page = __get_free_page(GFP_ATOMIC); if (!page) return -ENOMEM;
I think a better way to fix this would be moving out allocation from atomic context. Allocate page prior to apply_to_page_range() call and pass it down to kasan_populate_vmalloc_pte().
I think the page address could be passed as the parameter to kasan_populate_vmalloc_pte().
Whenever kasan_populate_vmalloc_pte() will require additional page we could bail out with -EAGAIN, and allocate another one.
When would it be needed? kasan_populate_vmalloc_pte() handles just one page.
Thanks!
On 4/9/25 4:25 PM, Alexander Gordeev wrote:
On Wed, Apr 09, 2025 at 04:10:58PM +0200, Andrey Ryabinin wrote:
Hi Andrey,
@@ -301,7 +301,7 @@ static int kasan_populate_vmalloc_pte(pte_t *ptep, unsigned long addr, if (likely(!pte_none(ptep_get(ptep)))) return 0;
- page = __get_free_page(GFP_KERNEL);
- page = __get_free_page(GFP_ATOMIC); if (!page) return -ENOMEM;
I think a better way to fix this would be moving out allocation from atomic context. Allocate page prior to apply_to_page_range() call and pass it down to kasan_populate_vmalloc_pte().
I think the page address could be passed as the parameter to kasan_populate_vmalloc_pte().
We'll need to pass it as 'struct page **page' or maybe as pointer to some struct, e.g.: struct page_data { struct page *page; };
So, the kasan_populate_vmalloc_pte() would do something like this:
kasan_populate_vmalloc_pte() { if (!pte_none) return 0; if (!page_data->page) return -EAGAIN;
//use page to set pte
//NULLify pointer so that next kasan_populate_vmalloc_pte() will bail // out to allocate new page page_data->page = NULL; }
And it might be good idea to add 'last_addr' to page_data, so that we know where we stopped so that the next apply_to_page_range() call could continue, instead of starting from the beginning.
Whenever kasan_populate_vmalloc_pte() will require additional page we could bail out with -EAGAIN, and allocate another one.
When would it be needed? kasan_populate_vmalloc_pte() handles just one page.
apply_to_page_range() goes over range of addresses and calls kasan_populate_vmalloc_pte() multiple times (each time with different 'addr' but the same '*unused' arg). Things will go wrong if you'll use same page multiple times for different addresses.
Thanks!
On Wed, Apr 09, 2025 at 04:56:29PM +0200, Andrey Ryabinin wrote:
Hi Andrey,
...
- page = __get_free_page(GFP_KERNEL);
- page = __get_free_page(GFP_ATOMIC); if (!page)
I think a better way to fix this would be moving out allocation from atomic context. Allocate page prior to apply_to_page_range() call and pass it down to kasan_populate_vmalloc_pte().
I think the page address could be passed as the parameter to kasan_populate_vmalloc_pte().
We'll need to pass it as 'struct page **page' or maybe as pointer to some struct, e.g.: struct page_data { struct page *page; };
...
Thanks for the hint! I will try to implement that, but will likely start in two weeks, after I am back from vacation.
Not sure wether this version needs to be dropped.
Thanks!
Reverse 'create' vs 'mm == &init_mm' conditions and move page table mask modification out of the atomic context. This is a prerequisite for fixing missing kernel page tables lock.
Cc: stable@vger.kernel.org Fixes: 38e0edb15bd0 ("mm/apply_to_range: call pte function with lazy updates") Signed-off-by: Alexander Gordeev agordeev@linux.ibm.com --- mm/memory.c | 28 +++++++++++++++++----------- 1 file changed, 17 insertions(+), 11 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c index 2d8c265fc7d6..f0201c8ec1ce 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2915,24 +2915,28 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd, pte_fn_t fn, void *data, bool create, pgtbl_mod_mask *mask) { + int err = create ? -ENOMEM : -EINVAL; pte_t *pte, *mapped_pte; - int err = 0; spinlock_t *ptl;
- if (create) { - mapped_pte = pte = (mm == &init_mm) ? - pte_alloc_kernel_track(pmd, addr, mask) : - pte_alloc_map_lock(mm, pmd, addr, &ptl); + if (mm == &init_mm) { + if (create) + pte = pte_alloc_kernel_track(pmd, addr, mask); + else + pte = pte_offset_kernel(pmd, addr); if (!pte) - return -ENOMEM; + return err; } else { - mapped_pte = pte = (mm == &init_mm) ? - pte_offset_kernel(pmd, addr) : - pte_offset_map_lock(mm, pmd, addr, &ptl); + if (create) + pte = pte_alloc_map_lock(mm, pmd, addr, &ptl); + else + pte = pte_offset_map_lock(mm, pmd, addr, &ptl); if (!pte) - return -EINVAL; + return err; + mapped_pte = pte; }
+ err = 0; arch_enter_lazy_mmu_mode();
if (fn) { @@ -2944,12 +2948,14 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd, } } while (addr += PAGE_SIZE, addr != end); } - *mask |= PGTBL_PTE_MODIFIED;
arch_leave_lazy_mmu_mode();
if (mm != &init_mm) pte_unmap_unlock(mapped_pte, ptl); + + *mask |= PGTBL_PTE_MODIFIED; + return err; }
The lazy MMU mode can only be entered and left under the protection of the page table locks for all page tables which may be modified. Yet, when it comes to kernel mappings apply_to_pte_range() does not take any locks. That does not conform arch_enter|leave_lazy_mmu_mode() semantics and could potentially lead to re-schedulling a process while in lazy MMU mode or racing on a kernel page table updates.
Cc: stable@vger.kernel.org Fixes: 38e0edb15bd0 ("mm/apply_to_range: call pte function with lazy updates") Signed-off-by: Alexander Gordeev agordeev@linux.ibm.com --- mm/kasan/shadow.c | 7 ++----- mm/memory.c | 5 ++++- 2 files changed, 6 insertions(+), 6 deletions(-)
diff --git a/mm/kasan/shadow.c b/mm/kasan/shadow.c index edfa77959474..6531a7aa8562 100644 --- a/mm/kasan/shadow.c +++ b/mm/kasan/shadow.c @@ -308,14 +308,14 @@ static int kasan_populate_vmalloc_pte(pte_t *ptep, unsigned long addr, __memset((void *)page, KASAN_VMALLOC_INVALID, PAGE_SIZE); pte = pfn_pte(PFN_DOWN(__pa(page)), PAGE_KERNEL);
- spin_lock(&init_mm.page_table_lock); if (likely(pte_none(ptep_get(ptep)))) { set_pte_at(&init_mm, addr, ptep, pte); page = 0; } - spin_unlock(&init_mm.page_table_lock); + if (page) free_page(page); + return 0; }
@@ -401,13 +401,10 @@ static int kasan_depopulate_vmalloc_pte(pte_t *ptep, unsigned long addr,
page = (unsigned long)__va(pte_pfn(ptep_get(ptep)) << PAGE_SHIFT);
- spin_lock(&init_mm.page_table_lock); - if (likely(!pte_none(ptep_get(ptep)))) { pte_clear(&init_mm, addr, ptep); free_page(page); } - spin_unlock(&init_mm.page_table_lock);
return 0; } diff --git a/mm/memory.c b/mm/memory.c index f0201c8ec1ce..1f3727104e99 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2926,6 +2926,7 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd, pte = pte_offset_kernel(pmd, addr); if (!pte) return err; + spin_lock(&init_mm.page_table_lock); } else { if (create) pte = pte_alloc_map_lock(mm, pmd, addr, &ptl); @@ -2951,7 +2952,9 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
arch_leave_lazy_mmu_mode();
- if (mm != &init_mm) + if (mm == &init_mm) + spin_unlock(&init_mm.page_table_lock); + else pte_unmap_unlock(mapped_pte, ptl);
*mask |= PGTBL_PTE_MODIFIED;
On Tue, Apr 08, 2025 at 06:07:32PM +0200, Alexander Gordeev wrote:
Hi Andrew,
The lazy MMU mode can only be entered and left under the protection of the page table locks for all page tables which may be modified.
Heiko Carstens noticed that the above claim is not valid, since v6.15-rc1 commit 691ee97e1a9d ("mm: fix lazy mmu docs and usage"), which restates it to:
"In the general case, no lock is guaranteed to be held between entry and exit of the lazy mode. So the implementation must assume preemption may be enabled"
That effectively invalidates this patch, so it needs to be dropped.
Patch 2 still could be fine, except -stable and Fixes tags and it does not need to aim 6.15-rcX. Do you want me to repost it?
Thanks!
On Thu, 10 Apr 2025 16:50:33 +0200 Alexander Gordeev agordeev@linux.ibm.com wrote:
On Tue, Apr 08, 2025 at 06:07:32PM +0200, Alexander Gordeev wrote:
Hi Andrew,
The lazy MMU mode can only be entered and left under the protection of the page table locks for all page tables which may be modified.
Heiko Carstens noticed that the above claim is not valid, since v6.15-rc1 commit 691ee97e1a9d ("mm: fix lazy mmu docs and usage"), which restates it to:
"In the general case, no lock is guaranteed to be held between entry and exit of the lazy mode. So the implementation must assume preemption may be enabled"
That effectively invalidates this patch, so it needs to be dropped.
Patch 2 still could be fine, except -stable and Fixes tags and it does not need to aim 6.15-rcX. Do you want me to repost it?
I dropped the whole series - let's start again.
linux-stable-mirror@lists.linaro.org