We recently made GUP's common page table walking code to also walk hugetlb VMAs without most hugetlb special-casing, preparing for the future of having less hugetlb-specific page table walking code in the codebase. Turns out that we missed one page table locking detail: page table locking for hugetlb folios that are not mapped using a single PMD/PUD.
Assume we have hugetlb folio that spans multiple PTEs (e.g., 64 KiB hugetlb folios on arm64 with 4 KiB base page size). GUP, as it walks the page tables, will perform a pte_offset_map_lock() to grab the PTE table lock.
However, hugetlb that concurrently modifies these page tables would actually grab the mm->page_table_lock: with USE_SPLIT_PTE_PTLOCKS, the locks would differ. Something similar can happen right now with hugetlb folios that span multiple PMDs when USE_SPLIT_PMD_PTLOCKS.
This issue can be reproduced [1], for example triggering:
[ 3105.936100] ------------[ cut here ]------------ [ 3105.939323] WARNING: CPU: 31 PID: 2732 at mm/gup.c:142 try_grab_folio+0x11c/0x188 [ 3105.944634] Modules linked in: [...] [ 3105.974841] CPU: 31 PID: 2732 Comm: reproducer Not tainted 6.10.0-64.eln141.aarch64 #1 [ 3105.980406] Hardware name: QEMU KVM Virtual Machine, BIOS edk2-20240524-4.fc40 05/24/2024 [ 3105.986185] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 3105.991108] pc : try_grab_folio+0x11c/0x188 [ 3105.994013] lr : follow_page_pte+0xd8/0x430 [ 3105.996986] sp : ffff80008eafb8f0 [ 3105.999346] x29: ffff80008eafb900 x28: ffffffe8d481f380 x27: 00f80001207cff43 [ 3106.004414] x26: 0000000000000001 x25: 0000000000000000 x24: ffff80008eafba48 [ 3106.009520] x23: 0000ffff9372f000 x22: ffff7a54459e2000 x21: ffff7a546c1aa978 [ 3106.014529] x20: ffffffe8d481f3c0 x19: 0000000000610041 x18: 0000000000000001 [ 3106.019506] x17: 0000000000000001 x16: ffffffffffffffff x15: 0000000000000000 [ 3106.024494] x14: ffffb85477fdfe08 x13: 0000ffff9372ffff x12: 0000000000000000 [ 3106.029469] x11: 1fffef4a88a96be1 x10: ffff7a54454b5f0c x9 : ffffb854771b12f0 [ 3106.034324] x8 : 0008000000000000 x7 : ffff7a546c1aa980 x6 : 0008000000000080 [ 3106.038902] x5 : 00000000001207cf x4 : 0000ffff9372f000 x3 : ffffffe8d481f000 [ 3106.043420] x2 : 0000000000610041 x1 : 0000000000000001 x0 : 0000000000000000 [ 3106.047957] Call trace: [ 3106.049522] try_grab_folio+0x11c/0x188 [ 3106.051996] follow_pmd_mask.constprop.0.isra.0+0x150/0x2e0 [ 3106.055527] follow_page_mask+0x1a0/0x2b8 [ 3106.058118] __get_user_pages+0xf0/0x348 [ 3106.060647] faultin_page_range+0xb0/0x360 [ 3106.063651] do_madvise+0x340/0x598
Let's make huge_pte_lockptr() effectively uses the same PT locks as any core-mm page table walker would. Add ptep_lockptr() to obtain the PTE page table lock using a pte pointer -- unfortunately we cannot convert pte_lockptr() because virt_to_page() doesn't work with kmap'ed page tables we can have with CONFIG_HIGHPTE.
There is one ugly case: powerpc 8xx, whereby we have an 8 MiB hugetlb folio being mapped using two PTE page tables. While hugetlb wants to take the PMD table lock, core-mm would grab the PTE table lock of one of both PTE page tables. In such corner cases, we have to make sure that both locks match, which is (fortunately!) currently guaranteed for 8xx as it does not support SMP and consequently doesn't use split PT locks.
[1] https://lore.kernel.org/all/1bbfcc7f-f222-45a5-ac44-c5a1381c596d@redhat.com/
Fixes: 9cb28da54643 ("mm/gup: handle hugetlb in the generic follow_page_mask code") Cc: stable@vger.kernel.org Cc: Peter Xu peterx@redhat.com Cc: Oscar Salvador osalvador@suse.de Cc: Muchun Song muchun.song@linux.dev Cc: Baolin Wang baolin.wang@linux.alibaba.com Signed-off-by: David Hildenbrand david@redhat.com ---
Still busy runtime-testing of this version -- have to set up my ARM environment again. Dropped the RB's/ACKs because there was significant change in the pte_lockptr() handling.
v1 -> 2: * Extend patch description * Drop "mm: let pte_lockptr() consume a pte_t pointer" * Introduce ptep_lockptr() in this patch
I wish there was a nicer way to avoid messing with CONFIG_HIGHPTE ...
--- include/linux/hugetlb.h | 26 +++++++++++++++++++++++--- include/linux/mm.h | 10 ++++++++++ 2 files changed, 33 insertions(+), 3 deletions(-)
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index c9bf68c239a01..dd6d4ee5ee59c 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -944,10 +944,30 @@ static inline bool htlb_allow_alloc_fallback(int reason) static inline spinlock_t *huge_pte_lockptr(struct hstate *h, struct mm_struct *mm, pte_t *pte) { - if (huge_page_size(h) == PMD_SIZE) + VM_WARN_ON(huge_page_size(h) == PAGE_SIZE); + VM_WARN_ON(huge_page_size(h) >= P4D_SIZE); + + /* + * hugetlb must use the exact same PT locks as core-mm page table + * walkers would. When modifying a PTE table, hugetlb must take the + * PTE PT lock, when modifying a PMD table, hugetlb must take the PMD + * PT lock etc. + * + * The expectation is that any hugetlb folio smaller than a PMD is + * always mapped into a single PTE table and that any hugetlb folio + * smaller than a PUD (but at least as big as a PMD) is always mapped + * into a single PMD table. + * + * If that does not hold for an architecture, then that architecture + * must disable split PT locks such that all *_lockptr() functions + * will give us the same result: the per-MM PT lock. + */ + if (huge_page_size(h) < PMD_SIZE && !IS_ENABLED(CONFIG_HIGHPTE)) + /* pte_alloc_huge() only applies with !CONFIG_HIGHPTE */ + return ptep_lockptr(mm, pte); + else if (huge_page_size(h) < PUD_SIZE) return pmd_lockptr(mm, (pmd_t *) pte); - VM_BUG_ON(huge_page_size(h) == PAGE_SIZE); - return &mm->page_table_lock; + return pud_lockptr(mm, (pud_t *) pte); }
#ifndef hugepages_supported diff --git a/include/linux/mm.h b/include/linux/mm.h index b100df8cb5857..1b1f40ff00b7d 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2926,6 +2926,12 @@ static inline spinlock_t *pte_lockptr(struct mm_struct *mm, pmd_t *pmd) return ptlock_ptr(page_ptdesc(pmd_page(*pmd))); }
+static inline spinlock_t *ptep_lockptr(struct mm_struct *mm, pte_t *pte) +{ + BUILD_BUG_ON(IS_ENABLED(CONFIG_HIGHPTE)); + return ptlock_ptr(virt_to_ptdesc(pte)); +} + static inline bool ptlock_init(struct ptdesc *ptdesc) { /* @@ -2950,6 +2956,10 @@ static inline spinlock_t *pte_lockptr(struct mm_struct *mm, pmd_t *pmd) { return &mm->page_table_lock; } +static inline spinlock_t *ptep_lockptr(struct mm_struct *mm, pte_t *pte) +{ + return &mm->page_table_lock; +} static inline void ptlock_cache_init(void) {} static inline bool ptlock_init(struct ptdesc *ptdesc) { return true; } static inline void ptlock_free(struct ptdesc *ptdesc) {}
On Tue, Jul 30, 2024 at 1:03 PM David Hildenbrand david@redhat.com wrote:
diff --git a/include/linux/mm.h b/include/linux/mm.h index b100df8cb5857..1b1f40ff00b7d 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2926,6 +2926,12 @@ static inline spinlock_t *pte_lockptr(struct mm_struct *mm, pmd_t *pmd) return ptlock_ptr(page_ptdesc(pmd_page(*pmd))); }
+static inline spinlock_t *ptep_lockptr(struct mm_struct *mm, pte_t *pte) +{
BUILD_BUG_ON(IS_ENABLED(CONFIG_HIGHPTE));
return ptlock_ptr(virt_to_ptdesc(pte));
Hi David,
Small question: ptep_lockptr() does not handle the case where the size of the PTE table is larger than PAGE_SIZE, but pmd_lockptr() does. IIUC, for pte_lockptr() and ptep_lockptr() to return the same result in this case, ptep_lockptr() should be doing the masking that pmd_lockptr() is doing. Are you sure that you don't need to be doing it? (Or maybe I am misunderstanding something.)
Thanks for the fix!
+}
static inline bool ptlock_init(struct ptdesc *ptdesc) { /* @@ -2950,6 +2956,10 @@ static inline spinlock_t *pte_lockptr(struct mm_struct *mm, pmd_t *pmd) { return &mm->page_table_lock; } +static inline spinlock_t *ptep_lockptr(struct mm_struct *mm, pte_t *pte) +{
return &mm->page_table_lock;
+} static inline void ptlock_cache_init(void) {} static inline bool ptlock_init(struct ptdesc *ptdesc) { return true; } static inline void ptlock_free(struct ptdesc *ptdesc) {} -- 2.45.2
On 30.07.24 22:43, James Houghton wrote:
On Tue, Jul 30, 2024 at 1:03 PM David Hildenbrand david@redhat.com wrote:
diff --git a/include/linux/mm.h b/include/linux/mm.h index b100df8cb5857..1b1f40ff00b7d 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2926,6 +2926,12 @@ static inline spinlock_t *pte_lockptr(struct mm_struct *mm, pmd_t *pmd) return ptlock_ptr(page_ptdesc(pmd_page(*pmd))); }
+static inline spinlock_t *ptep_lockptr(struct mm_struct *mm, pte_t *pte) +{
BUILD_BUG_ON(IS_ENABLED(CONFIG_HIGHPTE));
return ptlock_ptr(virt_to_ptdesc(pte));
Hi David,
Hi!
Small question: ptep_lockptr() does not handle the case where the size of the PTE table is larger than PAGE_SIZE, but pmd_lockptr() does.
I thought I convinced myself that leaf page tables are always single pages and had a comment in v1.
But now I have to double-check again, and staring at pagetable_pte_ctor() callers I am left confused.
It certainly sounds more future proof to just align the pointer down to the start of the PTE table like pmd_lockptr() would.
IIUC, for pte_lockptr() and ptep_lockptr() to return the same result in this case, ptep_lockptr() should be doing the masking that pmd_lockptr() is doing. Are you sure that you don't need to be doing it? (Or maybe I am misunderstanding something.)
It's a valid concern even if it would not be required. But I'm afraid I won't dig into the details and simply do the alignment in a v3.
I'm hoping I'll be done with that hugetlb crap soon; it's starting to annoy me and I really should be working on other stuff ...
On 30.07.24 23:00, David Hildenbrand wrote:
On 30.07.24 22:43, James Houghton wrote:
On Tue, Jul 30, 2024 at 1:03 PM David Hildenbrand david@redhat.com wrote:
diff --git a/include/linux/mm.h b/include/linux/mm.h index b100df8cb5857..1b1f40ff00b7d 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2926,6 +2926,12 @@ static inline spinlock_t *pte_lockptr(struct mm_struct *mm, pmd_t *pmd) return ptlock_ptr(page_ptdesc(pmd_page(*pmd))); }
+static inline spinlock_t *ptep_lockptr(struct mm_struct *mm, pte_t *pte) +{
BUILD_BUG_ON(IS_ENABLED(CONFIG_HIGHPTE));
return ptlock_ptr(virt_to_ptdesc(pte));
Hi David,
Hi!
Small question: ptep_lockptr() does not handle the case where the size of the PTE table is larger than PAGE_SIZE, but pmd_lockptr() does.
I thought I convinced myself that leaf page tables are always single pages and had a comment in v1.
But now I have to double-check again, and staring at pagetable_pte_ctor() callers I am left confused.
It certainly sounds more future proof to just align the pointer down to the start of the PTE table like pmd_lockptr() would.
IIUC, for pte_lockptr() and ptep_lockptr() to return the same result in this case, ptep_lockptr() should be doing the masking that pmd_lockptr() is doing. Are you sure that you don't need to be doing it? (Or maybe I am misunderstanding something.)
It's a valid concern even if it would not be required. But I'm afraid I won't dig into the details and simply do the alignment in a v3.
To be precise, the following on top:
diff --git a/include/linux/mm.h b/include/linux/mm.h index 1b1f40ff00b7d..f6c7fe8f5746f 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2926,10 +2926,22 @@ static inline spinlock_t *pte_lockptr(struct mm_struct *mm, pmd_t *pmd) return ptlock_ptr(page_ptdesc(pmd_page(*pmd))); }
-static inline spinlock_t *ptep_lockptr(struct mm_struct *mm, pte_t *pte) +static inline struct page *ptep_pgtable_page(pte_t *pte) { + unsigned long mask = ~(PTRS_PER_PTE * sizeof(pte_t) - 1); + BUILD_BUG_ON(IS_ENABLED(CONFIG_HIGHPTE)); - return ptlock_ptr(virt_to_ptdesc(pte)); + return virt_to_page((void *)((unsigned long)pte & mask)); +} + +static inline struct ptdesc *ptep_ptdesc(pte_t *pte) +{ + return page_ptdesc(ptep_pgtable_page(pte)); +} + +static inline spinlock_t *ptep_lockptr(struct mm_struct *mm, pte_t *pte) +{ + return ptlock_ptr(ptep_ptdesc(pte)); }
virt_to_ptdesc() really is of limited use in core-mm code as it seems ...
On Tue, Jul 30, 2024 at 2:07 PM David Hildenbrand david@redhat.com wrote:
On 30.07.24 23:00, David Hildenbrand wrote:
On 30.07.24 22:43, James Houghton wrote:
On Tue, Jul 30, 2024 at 1:03 PM David Hildenbrand david@redhat.com wrote:
diff --git a/include/linux/mm.h b/include/linux/mm.h index b100df8cb5857..1b1f40ff00b7d 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2926,6 +2926,12 @@ static inline spinlock_t *pte_lockptr(struct mm_struct *mm, pmd_t *pmd) return ptlock_ptr(page_ptdesc(pmd_page(*pmd))); }
+static inline spinlock_t *ptep_lockptr(struct mm_struct *mm, pte_t *pte) +{
BUILD_BUG_ON(IS_ENABLED(CONFIG_HIGHPTE));
return ptlock_ptr(virt_to_ptdesc(pte));
Hi David,
Hi!
Small question: ptep_lockptr() does not handle the case where the size of the PTE table is larger than PAGE_SIZE, but pmd_lockptr() does.
I thought I convinced myself that leaf page tables are always single pages and had a comment in v1.
But now I have to double-check again, and staring at pagetable_pte_ctor() callers I am left confused.
It certainly sounds more future proof to just align the pointer down to the start of the PTE table like pmd_lockptr() would.
IIUC, for pte_lockptr() and ptep_lockptr() to return the same result in this case, ptep_lockptr() should be doing the masking that pmd_lockptr() is doing. Are you sure that you don't need to be doing it? (Or maybe I am misunderstanding something.)
It's a valid concern even if it would not be required. But I'm afraid I won't dig into the details and simply do the alignment in a v3.
To be precise, the following on top:
diff --git a/include/linux/mm.h b/include/linux/mm.h index 1b1f40ff00b7d..f6c7fe8f5746f 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2926,10 +2926,22 @@ static inline spinlock_t *pte_lockptr(struct mm_struct *mm, pmd_t *pmd) return ptlock_ptr(page_ptdesc(pmd_page(*pmd))); }
-static inline spinlock_t *ptep_lockptr(struct mm_struct *mm, pte_t *pte) +static inline struct page *ptep_pgtable_page(pte_t *pte) {
unsigned long mask = ~(PTRS_PER_PTE * sizeof(pte_t) - 1);
BUILD_BUG_ON(IS_ENABLED(CONFIG_HIGHPTE));
return ptlock_ptr(virt_to_ptdesc(pte));
return virt_to_page((void *)((unsigned long)pte & mask));
+}
+static inline struct ptdesc *ptep_ptdesc(pte_t *pte) +{
return page_ptdesc(ptep_pgtable_page(pte));
+}
+static inline spinlock_t *ptep_lockptr(struct mm_struct *mm, pte_t *pte) +{
}return ptlock_ptr(ptep_ptdesc(pte));
Thanks! That looks right to me. Feel free to add
Reviewed-by: James Houghton jthoughton@google.com
virt_to_ptdesc() really is of limited use in core-mm code as it seems ...
-- Cheers,
David / dhildenb
On 30.07.24 23:17, James Houghton wrote:
On Tue, Jul 30, 2024 at 2:07 PM David Hildenbrand david@redhat.com wrote:
On 30.07.24 23:00, David Hildenbrand wrote:
On 30.07.24 22:43, James Houghton wrote:
On Tue, Jul 30, 2024 at 1:03 PM David Hildenbrand david@redhat.com wrote:
diff --git a/include/linux/mm.h b/include/linux/mm.h index b100df8cb5857..1b1f40ff00b7d 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2926,6 +2926,12 @@ static inline spinlock_t *pte_lockptr(struct mm_struct *mm, pmd_t *pmd) return ptlock_ptr(page_ptdesc(pmd_page(*pmd))); }
+static inline spinlock_t *ptep_lockptr(struct mm_struct *mm, pte_t *pte) +{
BUILD_BUG_ON(IS_ENABLED(CONFIG_HIGHPTE));
return ptlock_ptr(virt_to_ptdesc(pte));
Hi David,
Hi!
Small question: ptep_lockptr() does not handle the case where the size of the PTE table is larger than PAGE_SIZE, but pmd_lockptr() does.
I thought I convinced myself that leaf page tables are always single pages and had a comment in v1.
But now I have to double-check again, and staring at pagetable_pte_ctor() callers I am left confused.
It certainly sounds more future proof to just align the pointer down to the start of the PTE table like pmd_lockptr() would.
IIUC, for pte_lockptr() and ptep_lockptr() to return the same result in this case, ptep_lockptr() should be doing the masking that pmd_lockptr() is doing. Are you sure that you don't need to be doing it? (Or maybe I am misunderstanding something.)
It's a valid concern even if it would not be required. But I'm afraid I won't dig into the details and simply do the alignment in a v3.
To be precise, the following on top:
diff --git a/include/linux/mm.h b/include/linux/mm.h index 1b1f40ff00b7d..f6c7fe8f5746f 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2926,10 +2926,22 @@ static inline spinlock_t *pte_lockptr(struct mm_struct *mm, pmd_t *pmd) return ptlock_ptr(page_ptdesc(pmd_page(*pmd))); }
-static inline spinlock_t *ptep_lockptr(struct mm_struct *mm, pte_t *pte) +static inline struct page *ptep_pgtable_page(pte_t *pte) {
unsigned long mask = ~(PTRS_PER_PTE * sizeof(pte_t) - 1);
BUILD_BUG_ON(IS_ENABLED(CONFIG_HIGHPTE));
return ptlock_ptr(virt_to_ptdesc(pte));
return virt_to_page((void *)((unsigned long)pte & mask));
+}
+static inline struct ptdesc *ptep_ptdesc(pte_t *pte) +{
return page_ptdesc(ptep_pgtable_page(pte));
+}
+static inline spinlock_t *ptep_lockptr(struct mm_struct *mm, pte_t *pte) +{
}return ptlock_ptr(ptep_ptdesc(pte));
Thanks! That looks right to me. Feel free to add
Reviewed-by: James Houghton jthoughton@google.com
Thanks for the review, will send a v3 tomorrow after having wasted more valuable life time setting up the ARM environment again ... :)
On Tue, Jul 30, 2024 at 01:43:35PM -0700, James Houghton wrote:
On Tue, Jul 30, 2024 at 1:03 PM David Hildenbrand david@redhat.com wrote:
diff --git a/include/linux/mm.h b/include/linux/mm.h index b100df8cb5857..1b1f40ff00b7d 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2926,6 +2926,12 @@ static inline spinlock_t *pte_lockptr(struct mm_struct *mm, pmd_t *pmd) return ptlock_ptr(page_ptdesc(pmd_page(*pmd))); }
+static inline spinlock_t *ptep_lockptr(struct mm_struct *mm, pte_t *pte) +{
BUILD_BUG_ON(IS_ENABLED(CONFIG_HIGHPTE));
return ptlock_ptr(virt_to_ptdesc(pte));
Hi David,
Small question: ptep_lockptr() does not handle the case where the size of the PTE table is larger than PAGE_SIZE, but pmd_lockptr() does. IIUC, for pte_lockptr() and ptep_lockptr() to return the same result in this case, ptep_lockptr() should be doing the masking that pmd_lockptr() is doing. Are you sure that you don't need to be doing it? (Or maybe I am misunderstanding something.)
I was just curious and looked at pte_alloc_one(), not too much archs implemented it besides the default (which calls pte_alloc_one_noprof(), and should be order=0 there). I didn't see any arch that actually allocated with non-zero orders.
The motorola/m68k one is slightly involved, but still.. nothing I spot yet.
Thanks,
linux-stable-mirror@lists.linaro.org