The patch below does not apply to the 5.15-stable tree. If someone wants it applied there, or to any other stable or longterm tree, then please email the backport, including the original git commit id to stable@vger.kernel.org.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y git checkout FETCH_HEAD git cherry-pick -x 3a5a8d343e1cf96eb9971b17cbd4b832ab19b8e7 # <resolve conflicts, build, test, etc.> git commit -s git send-email --to 'stable@vger.kernel.org' --in-reply-to '2024061316-brook-imaging-a202@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^..
Possible dependencies:
3a5a8d343e1c ("mm: fix race between __split_huge_pmd_locked() and GUP-fast") 4f83145721f3 ("mm: avoid unnecessary flush on change_huge_pmd()") c9fe66560bf2 ("mm/mprotect: do not flush when not required architecturally") 4a18419f71cd ("mm/mprotect: use mmu_gather")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 3a5a8d343e1cf96eb9971b17cbd4b832ab19b8e7 Mon Sep 17 00:00:00 2001 From: Ryan Roberts ryan.roberts@arm.com Date: Wed, 1 May 2024 15:33:10 +0100 Subject: [PATCH] mm: fix race between __split_huge_pmd_locked() and GUP-fast
__split_huge_pmd_locked() can be called for a present THP, devmap or (non-present) migration entry. It calls pmdp_invalidate() unconditionally on the pmdp and only determines if it is present or not based on the returned old pmd. This is a problem for the migration entry case because pmd_mkinvalid(), called by pmdp_invalidate() must only be called for a present pmd.
On arm64 at least, pmd_mkinvalid() will mark the pmd such that any future call to pmd_present() will return true. And therefore any lockless pgtable walker could see the migration entry pmd in this state and start interpretting the fields as if it were present, leading to BadThings (TM). GUP-fast appears to be one such lockless pgtable walker.
x86 does not suffer the above problem, but instead pmd_mkinvalid() will corrupt the offset field of the swap entry within the swap pte. See link below for discussion of that problem.
Fix all of this by only calling pmdp_invalidate() for a present pmd. And for good measure let's add a warning to all implementations of pmdp_invalidate[_ad](). I've manually reviewed all other pmdp_invalidate[_ad]() call sites and believe all others to be conformant.
This is a theoretical bug found during code review. I don't have any test case to trigger it in practice.
Link: https://lkml.kernel.org/r/20240501143310.1381675-1-ryan.roberts@arm.com Link: https://lore.kernel.org/all/0dd7827a-6334-439a-8fd0-43c98e6af22b@arm.com/ Fixes: 84c3fc4e9c56 ("mm: thp: check pmd migration entry in common path") Signed-off-by: Ryan Roberts ryan.roberts@arm.com Reviewed-by: Zi Yan ziy@nvidia.com Reviewed-by: Anshuman Khandual anshuman.khandual@arm.com Acked-by: David Hildenbrand david@redhat.com Cc: Andreas Larsson andreas@gaisler.com Cc: Andy Lutomirski luto@kernel.org Cc: Aneesh Kumar K.V aneesh.kumar@kernel.org Cc: Borislav Petkov (AMD) bp@alien8.de Cc: Catalin Marinas catalin.marinas@arm.com Cc: Christian Borntraeger borntraeger@linux.ibm.com Cc: Christophe Leroy christophe.leroy@csgroup.eu Cc: Dave Hansen dave.hansen@linux.intel.com Cc: "David S. Miller" davem@davemloft.net Cc: Ingo Molnar mingo@redhat.com Cc: Jonathan Corbet corbet@lwn.net Cc: Mark Rutland mark.rutland@arm.com Cc: Naveen N. Rao naveen.n.rao@linux.ibm.com Cc: Nicholas Piggin npiggin@gmail.com Cc: Peter Zijlstra peterz@infradead.org Cc: Sven Schnelle svens@linux.ibm.com Cc: Thomas Gleixner tglx@linutronix.de Cc: Will Deacon will@kernel.org Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org
diff --git a/Documentation/mm/arch_pgtable_helpers.rst b/Documentation/mm/arch_pgtable_helpers.rst index 2466d3363af7..ad50ca6f495e 100644 --- a/Documentation/mm/arch_pgtable_helpers.rst +++ b/Documentation/mm/arch_pgtable_helpers.rst @@ -140,7 +140,8 @@ PMD Page Table Helpers +---------------------------+--------------------------------------------------+ | pmd_swp_clear_soft_dirty | Clears a soft dirty swapped PMD | +---------------------------+--------------------------------------------------+ -| pmd_mkinvalid | Invalidates a mapped PMD [1] | +| pmd_mkinvalid | Invalidates a present PMD; do not call for | +| | non-present PMD [1] | +---------------------------+--------------------------------------------------+ | pmd_set_huge | Creates a PMD huge mapping | +---------------------------+--------------------------------------------------+ @@ -196,7 +197,8 @@ PUD Page Table Helpers +---------------------------+--------------------------------------------------+ | pud_mkdevmap | Creates a ZONE_DEVICE mapped PUD | +---------------------------+--------------------------------------------------+ -| pud_mkinvalid | Invalidates a mapped PUD [1] | +| pud_mkinvalid | Invalidates a present PUD; do not call for | +| | non-present PUD [1] | +---------------------------+--------------------------------------------------+ | pud_set_huge | Creates a PUD huge mapping | +---------------------------+--------------------------------------------------+ diff --git a/arch/powerpc/mm/book3s64/pgtable.c b/arch/powerpc/mm/book3s64/pgtable.c index 83823db3488b..2975ea0841ba 100644 --- a/arch/powerpc/mm/book3s64/pgtable.c +++ b/arch/powerpc/mm/book3s64/pgtable.c @@ -170,6 +170,7 @@ pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address, { unsigned long old_pmd;
+ VM_WARN_ON_ONCE(!pmd_present(*pmdp)); old_pmd = pmd_hugepage_update(vma->vm_mm, address, pmdp, _PAGE_PRESENT, _PAGE_INVALID); flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE); return __pmd(old_pmd); diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h index 2cb2a2e7b34b..558902edbfec 100644 --- a/arch/s390/include/asm/pgtable.h +++ b/arch/s390/include/asm/pgtable.h @@ -1769,8 +1769,10 @@ static inline pmd_t pmdp_huge_clear_flush(struct vm_area_struct *vma, static inline pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long addr, pmd_t *pmdp) { - pmd_t pmd = __pmd(pmd_val(*pmdp) | _SEGMENT_ENTRY_INVALID); + pmd_t pmd;
+ VM_WARN_ON_ONCE(!pmd_present(*pmdp)); + pmd = __pmd(pmd_val(*pmdp) | _SEGMENT_ENTRY_INVALID); return pmdp_xchg_direct(vma->vm_mm, addr, pmdp, pmd); }
diff --git a/arch/sparc/mm/tlb.c b/arch/sparc/mm/tlb.c index 19642f7ffb52..8648a50afe88 100644 --- a/arch/sparc/mm/tlb.c +++ b/arch/sparc/mm/tlb.c @@ -249,6 +249,7 @@ pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address, { pmd_t old, entry;
+ VM_WARN_ON_ONCE(!pmd_present(*pmdp)); entry = __pmd(pmd_val(*pmdp) & ~_PAGE_VALID); old = pmdp_establish(vma, address, pmdp, entry); flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE); diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c index 94767c82fc0d..93e54ba91fbf 100644 --- a/arch/x86/mm/pgtable.c +++ b/arch/x86/mm/pgtable.c @@ -631,6 +631,8 @@ int pmdp_clear_flush_young(struct vm_area_struct *vma, pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma, unsigned long address, pmd_t *pmdp) { + VM_WARN_ON_ONCE(!pmd_present(*pmdp)); + /* * No flush is necessary. Once an invalid PTE is established, the PTE's * access and dirty bits cannot be updated. diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 08e4f3343bcd..ccdcff73284a 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2430,32 +2430,11 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, return __split_huge_zero_page_pmd(vma, haddr, pmd); }
- /* - * Up to this point the pmd is present and huge and userland has the - * whole access to the hugepage during the split (which happens in - * place). If we overwrite the pmd with the not-huge version pointing - * to the pte here (which of course we could if all CPUs were bug - * free), userland could trigger a small page size TLB miss on the - * small sized TLB while the hugepage TLB entry is still established in - * the huge TLB. Some CPU doesn't like that. - * See http://support.amd.com/TechDocs/41322_10h_Rev_Gd.pdf, Erratum - * 383 on page 105. Intel should be safe but is also warns that it's - * only safe if the permission and cache attributes of the two entries - * loaded in the two TLB is identical (which should be the case here). - * But it is generally safer to never allow small and huge TLB entries - * for the same virtual address to be loaded simultaneously. So instead - * of doing "pmd_populate(); flush_pmd_tlb_range();" we first mark the - * current pmd notpresent (atomically because here the pmd_trans_huge - * must remain set at all times on the pmd until the split is complete - * for this pmd), then we flush the SMP TLB and finally we write the - * non-huge version of the pmd entry with pmd_populate. - */ - old_pmd = pmdp_invalidate(vma, haddr, pmd); - - pmd_migration = is_pmd_migration_entry(old_pmd); + pmd_migration = is_pmd_migration_entry(*pmd); if (unlikely(pmd_migration)) { swp_entry_t entry;
+ old_pmd = *pmd; entry = pmd_to_swp_entry(old_pmd); page = pfn_swap_entry_to_page(entry); write = is_writable_migration_entry(entry); @@ -2466,6 +2445,30 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, soft_dirty = pmd_swp_soft_dirty(old_pmd); uffd_wp = pmd_swp_uffd_wp(old_pmd); } else { + /* + * Up to this point the pmd is present and huge and userland has + * the whole access to the hugepage during the split (which + * happens in place). If we overwrite the pmd with the not-huge + * version pointing to the pte here (which of course we could if + * all CPUs were bug free), userland could trigger a small page + * size TLB miss on the small sized TLB while the hugepage TLB + * entry is still established in the huge TLB. Some CPU doesn't + * like that. See + * http://support.amd.com/TechDocs/41322_10h_Rev_Gd.pdf, Erratum + * 383 on page 105. Intel should be safe but is also warns that + * it's only safe if the permission and cache attributes of the + * two entries loaded in the two TLB is identical (which should + * be the case here). But it is generally safer to never allow + * small and huge TLB entries for the same virtual address to be + * loaded simultaneously. So instead of doing "pmd_populate(); + * flush_pmd_tlb_range();" we first mark the current pmd + * notpresent (atomically because here the pmd_trans_huge must + * remain set at all times on the pmd until the split is + * complete for this pmd), then we flush the SMP TLB and finally + * we write the non-huge version of the pmd entry with + * pmd_populate. + */ + old_pmd = pmdp_invalidate(vma, haddr, pmd); page = pmd_page(old_pmd); folio = page_folio(page); if (pmd_dirty(old_pmd)) { diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c index 4fcd959dcc4d..a78a4adf711a 100644 --- a/mm/pgtable-generic.c +++ b/mm/pgtable-generic.c @@ -198,6 +198,7 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp) pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address, pmd_t *pmdp) { + VM_WARN_ON_ONCE(!pmd_present(*pmdp)); pmd_t old = pmdp_establish(vma, address, pmdp, pmd_mkinvalid(*pmdp)); flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE); return old; @@ -208,6 +209,7 @@ pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address, pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma, unsigned long address, pmd_t *pmdp) { + VM_WARN_ON_ONCE(!pmd_present(*pmdp)); return pmdp_invalidate(vma, address, pmdp); } #endif
__split_huge_pmd_locked() can be called for a present THP, devmap or (non-present) migration entry. It calls pmdp_invalidate() unconditionally on the pmdp and only determines if it is present or not based on the returned old pmd. This is a problem for the migration entry case because pmd_mkinvalid(), called by pmdp_invalidate() must only be called for a present pmd.
On arm64 at least, pmd_mkinvalid() will mark the pmd such that any future call to pmd_present() will return true. And therefore any lockless pgtable walker could see the migration entry pmd in this state and start interpretting the fields as if it were present, leading to BadThings (TM). GUP-fast appears to be one such lockless pgtable walker.
x86 does not suffer the above problem, but instead pmd_mkinvalid() will corrupt the offset field of the swap entry within the swap pte. See link below for discussion of that problem.
Fix all of this by only calling pmdp_invalidate() for a present pmd. And for good measure let's add a warning to all implementations of pmdp_invalidate[_ad](). I've manually reviewed all other pmdp_invalidate[_ad]() call sites and believe all others to be conformant.
This is a theoretical bug found during code review. I don't have any test case to trigger it in practice.
Link: https://lkml.kernel.org/r/20240501143310.1381675-1-ryan.roberts@arm.com Link: https://lore.kernel.org/all/0dd7827a-6334-439a-8fd0-43c98e6af22b@arm.com/ Fixes: 84c3fc4e9c56 ("mm: thp: check pmd migration entry in common path") Signed-off-by: Ryan Roberts ryan.roberts@arm.com Reviewed-by: Zi Yan ziy@nvidia.com Reviewed-by: Anshuman Khandual anshuman.khandual@arm.com Acked-by: David Hildenbrand david@redhat.com Cc: Andreas Larsson andreas@gaisler.com Cc: Andy Lutomirski luto@kernel.org Cc: Aneesh Kumar K.V aneesh.kumar@kernel.org Cc: Borislav Petkov (AMD) bp@alien8.de Cc: Catalin Marinas catalin.marinas@arm.com Cc: Christian Borntraeger borntraeger@linux.ibm.com Cc: Christophe Leroy christophe.leroy@csgroup.eu Cc: Dave Hansen dave.hansen@linux.intel.com Cc: "David S. Miller" davem@davemloft.net Cc: Ingo Molnar mingo@redhat.com Cc: Jonathan Corbet corbet@lwn.net Cc: Mark Rutland mark.rutland@arm.com Cc: Naveen N. Rao naveen.n.rao@linux.ibm.com Cc: Nicholas Piggin npiggin@gmail.com Cc: Peter Zijlstra peterz@infradead.org Cc: Sven Schnelle svens@linux.ibm.com Cc: Thomas Gleixner tglx@linutronix.de Cc: Will Deacon will@kernel.org Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org (cherry picked from commit 3a5a8d343e1cf96eb9971b17cbd4b832ab19b8e7) Signed-off-by: Ryan Roberts ryan.roberts@arm.com --- Documentation/vm/arch_pgtable_helpers.rst | 6 ++- arch/powerpc/mm/book3s64/pgtable.c | 1 + arch/s390/include/asm/pgtable.h | 4 +- arch/sparc/mm/tlb.c | 1 + mm/huge_memory.c | 49 ++++++++++++----------- mm/pgtable-generic.c | 1 + 6 files changed, 36 insertions(+), 26 deletions(-)
diff --git a/Documentation/vm/arch_pgtable_helpers.rst b/Documentation/vm/arch_pgtable_helpers.rst index 552567d863b8..b8ae5d040b99 100644 --- a/Documentation/vm/arch_pgtable_helpers.rst +++ b/Documentation/vm/arch_pgtable_helpers.rst @@ -134,7 +134,8 @@ PMD Page Table Helpers +---------------------------+--------------------------------------------------+ | pmd_swp_clear_soft_dirty | Clears a soft dirty swapped PMD | +---------------------------+--------------------------------------------------+ -| pmd_mkinvalid | Invalidates a mapped PMD [1] | +| pmd_mkinvalid | Invalidates a present PMD; do not call for | +| | non-present PMD [1] | +---------------------------+--------------------------------------------------+ | pmd_set_huge | Creates a PMD huge mapping | +---------------------------+--------------------------------------------------+ @@ -190,7 +191,8 @@ PUD Page Table Helpers +---------------------------+--------------------------------------------------+ | pud_mkdevmap | Creates a ZONE_DEVICE mapped PUD | +---------------------------+--------------------------------------------------+ -| pud_mkinvalid | Invalidates a mapped PUD [1] | +| pud_mkinvalid | Invalidates a present PUD; do not call for | +| | non-present PUD [1] | +---------------------------+--------------------------------------------------+ | pud_set_huge | Creates a PUD huge mapping | +---------------------------+--------------------------------------------------+ diff --git a/arch/powerpc/mm/book3s64/pgtable.c b/arch/powerpc/mm/book3s64/pgtable.c index da15f28c7b13..3a22e7d970f3 100644 --- a/arch/powerpc/mm/book3s64/pgtable.c +++ b/arch/powerpc/mm/book3s64/pgtable.c @@ -115,6 +115,7 @@ pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address, { unsigned long old_pmd;
+ VM_WARN_ON_ONCE(!pmd_present(*pmdp)); old_pmd = pmd_hugepage_update(vma->vm_mm, address, pmdp, _PAGE_PRESENT, _PAGE_INVALID); flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE); return __pmd(old_pmd); diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h index b61426c9ef17..b65ce0c90dd0 100644 --- a/arch/s390/include/asm/pgtable.h +++ b/arch/s390/include/asm/pgtable.h @@ -1625,8 +1625,10 @@ static inline pmd_t pmdp_huge_clear_flush(struct vm_area_struct *vma, static inline pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long addr, pmd_t *pmdp) { - pmd_t pmd = __pmd(pmd_val(*pmdp) | _SEGMENT_ENTRY_INVALID); + pmd_t pmd;
+ VM_WARN_ON_ONCE(!pmd_present(*pmdp)); + pmd = __pmd(pmd_val(*pmdp) | _SEGMENT_ENTRY_INVALID); return pmdp_xchg_direct(vma->vm_mm, addr, pmdp, pmd); }
diff --git a/arch/sparc/mm/tlb.c b/arch/sparc/mm/tlb.c index 9a725547578e..946f33c1b032 100644 --- a/arch/sparc/mm/tlb.c +++ b/arch/sparc/mm/tlb.c @@ -245,6 +245,7 @@ pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address, { pmd_t old, entry;
+ VM_WARN_ON_ONCE(!pmd_present(*pmdp)); entry = __pmd(pmd_val(*pmdp) & ~_PAGE_VALID); old = pmdp_establish(vma, address, pmdp, entry); flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE); diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 98ff57c8eda6..4e9dbaded0a4 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2017,32 +2017,11 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, return __split_huge_zero_page_pmd(vma, haddr, pmd); }
- /* - * Up to this point the pmd is present and huge and userland has the - * whole access to the hugepage during the split (which happens in - * place). If we overwrite the pmd with the not-huge version pointing - * to the pte here (which of course we could if all CPUs were bug - * free), userland could trigger a small page size TLB miss on the - * small sized TLB while the hugepage TLB entry is still established in - * the huge TLB. Some CPU doesn't like that. - * See http://support.amd.com/TechDocs/41322_10h_Rev_Gd.pdf, Erratum - * 383 on page 105. Intel should be safe but is also warns that it's - * only safe if the permission and cache attributes of the two entries - * loaded in the two TLB is identical (which should be the case here). - * But it is generally safer to never allow small and huge TLB entries - * for the same virtual address to be loaded simultaneously. So instead - * of doing "pmd_populate(); flush_pmd_tlb_range();" we first mark the - * current pmd notpresent (atomically because here the pmd_trans_huge - * must remain set at all times on the pmd until the split is complete - * for this pmd), then we flush the SMP TLB and finally we write the - * non-huge version of the pmd entry with pmd_populate. - */ - old_pmd = pmdp_invalidate(vma, haddr, pmd); - - pmd_migration = is_pmd_migration_entry(old_pmd); + pmd_migration = is_pmd_migration_entry(*pmd); if (unlikely(pmd_migration)) { swp_entry_t entry;
+ old_pmd = *pmd; entry = pmd_to_swp_entry(old_pmd); page = pfn_swap_entry_to_page(entry); write = is_writable_migration_entry(entry); @@ -2050,6 +2029,30 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, soft_dirty = pmd_swp_soft_dirty(old_pmd); uffd_wp = pmd_swp_uffd_wp(old_pmd); } else { + /* + * Up to this point the pmd is present and huge and userland has + * the whole access to the hugepage during the split (which + * happens in place). If we overwrite the pmd with the not-huge + * version pointing to the pte here (which of course we could if + * all CPUs were bug free), userland could trigger a small page + * size TLB miss on the small sized TLB while the hugepage TLB + * entry is still established in the huge TLB. Some CPU doesn't + * like that. See + * http://support.amd.com/TechDocs/41322_10h_Rev_Gd.pdf, Erratum + * 383 on page 105. Intel should be safe but is also warns that + * it's only safe if the permission and cache attributes of the + * two entries loaded in the two TLB is identical (which should + * be the case here). But it is generally safer to never allow + * small and huge TLB entries for the same virtual address to be + * loaded simultaneously. So instead of doing "pmd_populate(); + * flush_pmd_tlb_range();" we first mark the current pmd + * notpresent (atomically because here the pmd_trans_huge must + * remain set at all times on the pmd until the split is + * complete for this pmd), then we flush the SMP TLB and finally + * we write the non-huge version of the pmd entry with + * pmd_populate. + */ + old_pmd = pmdp_invalidate(vma, haddr, pmd); page = pmd_page(old_pmd); if (pmd_dirty(old_pmd)) SetPageDirty(page); diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c index 4e640baf9794..3bfc31a7cb38 100644 --- a/mm/pgtable-generic.c +++ b/mm/pgtable-generic.c @@ -194,6 +194,7 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp) pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address, pmd_t *pmdp) { + VM_WARN_ON_ONCE(!pmd_present(*pmdp)); pmd_t old = pmdp_establish(vma, address, pmdp, pmd_mkinvalid(*pmdp)); flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE); return old;
Sorry, please ignore this patch (see below)...
On 21/06/2024 16:22, Ryan Roberts wrote:
__split_huge_pmd_locked() can be called for a present THP, devmap or (non-present) migration entry. It calls pmdp_invalidate() unconditionally on the pmdp and only determines if it is present or not based on the returned old pmd. This is a problem for the migration entry case because pmd_mkinvalid(), called by pmdp_invalidate() must only be called for a present pmd.
On arm64 at least, pmd_mkinvalid() will mark the pmd such that any future call to pmd_present() will return true. And therefore any lockless pgtable walker could see the migration entry pmd in this state and start interpretting the fields as if it were present, leading to BadThings (TM). GUP-fast appears to be one such lockless pgtable walker.
x86 does not suffer the above problem, but instead pmd_mkinvalid() will corrupt the offset field of the swap entry within the swap pte. See link below for discussion of that problem.
Fix all of this by only calling pmdp_invalidate() for a present pmd. And for good measure let's add a warning to all implementations of pmdp_invalidate[_ad](). I've manually reviewed all other pmdp_invalidate[_ad]() call sites and believe all others to be conformant.
This is a theoretical bug found during code review. I don't have any test case to trigger it in practice.
Link: https://lkml.kernel.org/r/20240501143310.1381675-1-ryan.roberts@arm.com Link: https://lore.kernel.org/all/0dd7827a-6334-439a-8fd0-43c98e6af22b@arm.com/ Fixes: 84c3fc4e9c56 ("mm: thp: check pmd migration entry in common path") Signed-off-by: Ryan Roberts ryan.roberts@arm.com Reviewed-by: Zi Yan ziy@nvidia.com Reviewed-by: Anshuman Khandual anshuman.khandual@arm.com Acked-by: David Hildenbrand david@redhat.com Cc: Andreas Larsson andreas@gaisler.com Cc: Andy Lutomirski luto@kernel.org Cc: Aneesh Kumar K.V aneesh.kumar@kernel.org Cc: Borislav Petkov (AMD) bp@alien8.de Cc: Catalin Marinas catalin.marinas@arm.com Cc: Christian Borntraeger borntraeger@linux.ibm.com Cc: Christophe Leroy christophe.leroy@csgroup.eu Cc: Dave Hansen dave.hansen@linux.intel.com Cc: "David S. Miller" davem@davemloft.net Cc: Ingo Molnar mingo@redhat.com Cc: Jonathan Corbet corbet@lwn.net Cc: Mark Rutland mark.rutland@arm.com Cc: Naveen N. Rao naveen.n.rao@linux.ibm.com Cc: Nicholas Piggin npiggin@gmail.com Cc: Peter Zijlstra peterz@infradead.org Cc: Sven Schnelle svens@linux.ibm.com Cc: Thomas Gleixner tglx@linutronix.de Cc: Will Deacon will@kernel.org Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org (cherry picked from commit 3a5a8d343e1cf96eb9971b17cbd4b832ab19b8e7) Signed-off-by: Ryan Roberts ryan.roberts@arm.com
Documentation/vm/arch_pgtable_helpers.rst | 6 ++- arch/powerpc/mm/book3s64/pgtable.c | 1 + arch/s390/include/asm/pgtable.h | 4 +- arch/sparc/mm/tlb.c | 1 + mm/huge_memory.c | 49 ++++++++++++----------- mm/pgtable-generic.c | 1 + 6 files changed, 36 insertions(+), 26 deletions(-)
diff --git a/Documentation/vm/arch_pgtable_helpers.rst b/Documentation/vm/arch_pgtable_helpers.rst index 552567d863b8..b8ae5d040b99 100644 --- a/Documentation/vm/arch_pgtable_helpers.rst +++ b/Documentation/vm/arch_pgtable_helpers.rst @@ -134,7 +134,8 @@ PMD Page Table Helpers +---------------------------+--------------------------------------------------+ | pmd_swp_clear_soft_dirty | Clears a soft dirty swapped PMD | +---------------------------+--------------------------------------------------+ -| pmd_mkinvalid | Invalidates a mapped PMD [1] | +| pmd_mkinvalid | Invalidates a present PMD; do not call for | +| | non-present PMD [1] | +---------------------------+--------------------------------------------------+ | pmd_set_huge | Creates a PMD huge mapping | +---------------------------+--------------------------------------------------+ @@ -190,7 +191,8 @@ PUD Page Table Helpers +---------------------------+--------------------------------------------------+ | pud_mkdevmap | Creates a ZONE_DEVICE mapped PUD | +---------------------------+--------------------------------------------------+ -| pud_mkinvalid | Invalidates a mapped PUD [1] | +| pud_mkinvalid | Invalidates a present PUD; do not call for | +| | non-present PUD [1] | +---------------------------+--------------------------------------------------+ | pud_set_huge | Creates a PUD huge mapping | +---------------------------+--------------------------------------------------+ diff --git a/arch/powerpc/mm/book3s64/pgtable.c b/arch/powerpc/mm/book3s64/pgtable.c index da15f28c7b13..3a22e7d970f3 100644 --- a/arch/powerpc/mm/book3s64/pgtable.c +++ b/arch/powerpc/mm/book3s64/pgtable.c @@ -115,6 +115,7 @@ pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address, { unsigned long old_pmd;
- VM_WARN_ON_ONCE(!pmd_present(*pmdp)); old_pmd = pmd_hugepage_update(vma->vm_mm, address, pmdp, _PAGE_PRESENT, _PAGE_INVALID); flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE); return __pmd(old_pmd);
diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h index b61426c9ef17..b65ce0c90dd0 100644 --- a/arch/s390/include/asm/pgtable.h +++ b/arch/s390/include/asm/pgtable.h @@ -1625,8 +1625,10 @@ static inline pmd_t pmdp_huge_clear_flush(struct vm_area_struct *vma, static inline pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long addr, pmd_t *pmdp) {
- pmd_t pmd = __pmd(pmd_val(*pmdp) | _SEGMENT_ENTRY_INVALID);
- pmd_t pmd;
- VM_WARN_ON_ONCE(!pmd_present(*pmdp));
- pmd = __pmd(pmd_val(*pmdp) | _SEGMENT_ENTRY_INVALID); return pmdp_xchg_direct(vma->vm_mm, addr, pmdp, pmd);
} diff --git a/arch/sparc/mm/tlb.c b/arch/sparc/mm/tlb.c index 9a725547578e..946f33c1b032 100644 --- a/arch/sparc/mm/tlb.c +++ b/arch/sparc/mm/tlb.c @@ -245,6 +245,7 @@ pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address, { pmd_t old, entry;
- VM_WARN_ON_ONCE(!pmd_present(*pmdp)); entry = __pmd(pmd_val(*pmdp) & ~_PAGE_VALID); old = pmdp_establish(vma, address, pmdp, entry); flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 98ff57c8eda6..4e9dbaded0a4 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2017,32 +2017,11 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, return __split_huge_zero_page_pmd(vma, haddr, pmd); }
- /*
* Up to this point the pmd is present and huge and userland has the
* whole access to the hugepage during the split (which happens in
* place). If we overwrite the pmd with the not-huge version pointing
* to the pte here (which of course we could if all CPUs were bug
* free), userland could trigger a small page size TLB miss on the
* small sized TLB while the hugepage TLB entry is still established in
* the huge TLB. Some CPU doesn't like that.
* See http://support.amd.com/TechDocs/41322_10h_Rev_Gd.pdf, Erratum
* 383 on page 105. Intel should be safe but is also warns that it's
* only safe if the permission and cache attributes of the two entries
* loaded in the two TLB is identical (which should be the case here).
* But it is generally safer to never allow small and huge TLB entries
* for the same virtual address to be loaded simultaneously. So instead
* of doing "pmd_populate(); flush_pmd_tlb_range();" we first mark the
* current pmd notpresent (atomically because here the pmd_trans_huge
* must remain set at all times on the pmd until the split is complete
* for this pmd), then we flush the SMP TLB and finally we write the
* non-huge version of the pmd entry with pmd_populate.
*/
- old_pmd = pmdp_invalidate(vma, haddr, pmd);
- pmd_migration = is_pmd_migration_entry(old_pmd);
- pmd_migration = is_pmd_migration_entry(*pmd); if (unlikely(pmd_migration)) { swp_entry_t entry;
entry = pmd_to_swp_entry(old_pmd); page = pfn_swap_entry_to_page(entry); write = is_writable_migration_entry(entry);old_pmd = *pmd;
@@ -2050,6 +2029,30 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, soft_dirty = pmd_swp_soft_dirty(old_pmd); uffd_wp = pmd_swp_uffd_wp(old_pmd); } else {
/*
* Up to this point the pmd is present and huge and userland has
* the whole access to the hugepage during the split (which
* happens in place). If we overwrite the pmd with the not-huge
* version pointing to the pte here (which of course we could if
* all CPUs were bug free), userland could trigger a small page
* size TLB miss on the small sized TLB while the hugepage TLB
* entry is still established in the huge TLB. Some CPU doesn't
* like that. See
* http://support.amd.com/TechDocs/41322_10h_Rev_Gd.pdf, Erratum
* 383 on page 105. Intel should be safe but is also warns that
* it's only safe if the permission and cache attributes of the
* two entries loaded in the two TLB is identical (which should
* be the case here). But it is generally safer to never allow
* small and huge TLB entries for the same virtual address to be
* loaded simultaneously. So instead of doing "pmd_populate();
* flush_pmd_tlb_range();" we first mark the current pmd
* notpresent (atomically because here the pmd_trans_huge must
* remain set at all times on the pmd until the split is
* complete for this pmd), then we flush the SMP TLB and finally
* we write the non-huge version of the pmd entry with
* pmd_populate.
*/
page = pmd_page(old_pmd); if (pmd_dirty(old_pmd)) SetPageDirty(page);old_pmd = pmdp_invalidate(vma, haddr, pmd);
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c index 4e640baf9794..3bfc31a7cb38 100644 --- a/mm/pgtable-generic.c +++ b/mm/pgtable-generic.c @@ -194,6 +194,7 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp) pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address, pmd_t *pmdp) {
- VM_WARN_ON_ONCE(!pmd_present(*pmdp)); pmd_t old = pmdp_establish(vma, address, pmdp, pmd_mkinvalid(*pmdp));
old needs to be defined before the warning; I'm getting a compile warning on 5.15. I fixed that up but failed to add it to this commit. But the real question is why I'm not seeing the same warning on mainline? Let me investigate and resend appropriate patches.
flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE); return old;
On 21/06/2024 16:31, Ryan Roberts wrote:
Sorry, please ignore this patch (see below)...
[...]
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c index 4e640baf9794..3bfc31a7cb38 100644 --- a/mm/pgtable-generic.c +++ b/mm/pgtable-generic.c @@ -194,6 +194,7 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp) pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address, pmd_t *pmdp) {
- VM_WARN_ON_ONCE(!pmd_present(*pmdp)); pmd_t old = pmdp_establish(vma, address, pmdp, pmd_mkinvalid(*pmdp));
old needs to be defined before the warning; I'm getting a compile warning on 5.15. I fixed that up but failed to add it to this commit. But the real question is why I'm not seeing the same warning on mainline? Let me investigate and resend appropriate patches.
OK, it turns out that commit b5ec6fd286df ("kbuild: Drop -Wdeclaration-after-statement") (v6.5 timeframe) stopped emitting compiler warnings for statements that appear before declarations, so when I did the original fix for mainline, there was no warning for this.
Current status with backports for this patch is; you have applied to kernels back to 5.15, and from 5.15 backwards there are conflicts. Clearly when applied to any kernel prior to v6.5 this will result in warning when DEBUG_VM is enabled.
What's the best way to proceed? Should we just fix up the backports, or am I going to have to deliver a separate (technically uneccessary) patch to mainline that can then be backported?
Thanks, Ryan
flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE); return old;
On 21 Jun 2024, at 12:34, Ryan Roberts wrote:
On 21/06/2024 16:31, Ryan Roberts wrote:
Sorry, please ignore this patch (see below)...
[...]
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c index 4e640baf9794..3bfc31a7cb38 100644 --- a/mm/pgtable-generic.c +++ b/mm/pgtable-generic.c @@ -194,6 +194,7 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp) pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address, pmd_t *pmdp) {
- VM_WARN_ON_ONCE(!pmd_present(*pmdp)); pmd_t old = pmdp_establish(vma, address, pmdp, pmd_mkinvalid(*pmdp));
old needs to be defined before the warning; I'm getting a compile warning on 5.15. I fixed that up but failed to add it to this commit. But the real question is why I'm not seeing the same warning on mainline? Let me investigate and resend appropriate patches.
OK, it turns out that commit b5ec6fd286df ("kbuild: Drop -Wdeclaration-after-statement") (v6.5 timeframe) stopped emitting compiler warnings for statements that appear before declarations, so when I did the original fix for mainline, there was no warning for this.
Current status with backports for this patch is; you have applied to kernels back to 5.15, and from 5.15 backwards there are conflicts. Clearly when applied to any kernel prior to v6.5 this will result in warning when DEBUG_VM is enabled.
What's the best way to proceed? Should we just fix up the backports, or am I going to have to deliver a separate (technically uneccessary) patch to mainline that can then be backported?
Probably just fix the backports. IMHO you would want to follow the expectation (no statement before declaration before v6.5) of the kernel to which you are backporting.
And thank you for taking time to fix all these.
-- Best Regards, Yan, Zi
__split_huge_pmd_locked() can be called for a present THP, devmap or (non-present) migration entry. It calls pmdp_invalidate() unconditionally on the pmdp and only determines if it is present or not based on the returned old pmd. This is a problem for the migration entry case because pmd_mkinvalid(), called by pmdp_invalidate() must only be called for a present pmd.
On arm64 at least, pmd_mkinvalid() will mark the pmd such that any future call to pmd_present() will return true. And therefore any lockless pgtable walker could see the migration entry pmd in this state and start interpretting the fields as if it were present, leading to BadThings (TM). GUP-fast appears to be one such lockless pgtable walker.
x86 does not suffer the above problem, but instead pmd_mkinvalid() will corrupt the offset field of the swap entry within the swap pte. See link below for discussion of that problem.
Fix all of this by only calling pmdp_invalidate() for a present pmd. And for good measure let's add a warning to all implementations of pmdp_invalidate[_ad](). I've manually reviewed all other pmdp_invalidate[_ad]() call sites and believe all others to be conformant.
This is a theoretical bug found during code review. I don't have any test case to trigger it in practice.
Link: https://lkml.kernel.org/r/20240501143310.1381675-1-ryan.roberts@arm.com Link: https://lore.kernel.org/all/0dd7827a-6334-439a-8fd0-43c98e6af22b@arm.com/ Fixes: 84c3fc4e9c56 ("mm: thp: check pmd migration entry in common path") Signed-off-by: Ryan Roberts ryan.roberts@arm.com Reviewed-by: Zi Yan ziy@nvidia.com Reviewed-by: Anshuman Khandual anshuman.khandual@arm.com Acked-by: David Hildenbrand david@redhat.com Cc: Andreas Larsson andreas@gaisler.com Cc: Andy Lutomirski luto@kernel.org Cc: Aneesh Kumar K.V aneesh.kumar@kernel.org Cc: Borislav Petkov (AMD) bp@alien8.de Cc: Catalin Marinas catalin.marinas@arm.com Cc: Christian Borntraeger borntraeger@linux.ibm.com Cc: Christophe Leroy christophe.leroy@csgroup.eu Cc: Dave Hansen dave.hansen@linux.intel.com Cc: "David S. Miller" davem@davemloft.net Cc: Ingo Molnar mingo@redhat.com Cc: Jonathan Corbet corbet@lwn.net Cc: Mark Rutland mark.rutland@arm.com Cc: Naveen N. Rao naveen.n.rao@linux.ibm.com Cc: Nicholas Piggin npiggin@gmail.com Cc: Peter Zijlstra peterz@infradead.org Cc: Sven Schnelle svens@linux.ibm.com Cc: Thomas Gleixner tglx@linutronix.de Cc: Will Deacon will@kernel.org Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org (cherry picked from commit 3a5a8d343e1cf96eb9971b17cbd4b832ab19b8e7) Signed-off-by: Ryan Roberts ryan.roberts@arm.com --- Documentation/vm/arch_pgtable_helpers.rst | 6 ++- arch/powerpc/mm/book3s64/pgtable.c | 1 + arch/s390/include/asm/pgtable.h | 4 +- arch/sparc/mm/tlb.c | 1 + mm/huge_memory.c | 49 ++++++++++++----------- mm/pgtable-generic.c | 5 ++- 6 files changed, 39 insertions(+), 27 deletions(-)
diff --git a/Documentation/vm/arch_pgtable_helpers.rst b/Documentation/vm/arch_pgtable_helpers.rst index 552567d863b8..b8ae5d040b99 100644 --- a/Documentation/vm/arch_pgtable_helpers.rst +++ b/Documentation/vm/arch_pgtable_helpers.rst @@ -134,7 +134,8 @@ PMD Page Table Helpers +---------------------------+--------------------------------------------------+ | pmd_swp_clear_soft_dirty | Clears a soft dirty swapped PMD | +---------------------------+--------------------------------------------------+ -| pmd_mkinvalid | Invalidates a mapped PMD [1] | +| pmd_mkinvalid | Invalidates a present PMD; do not call for | +| | non-present PMD [1] | +---------------------------+--------------------------------------------------+ | pmd_set_huge | Creates a PMD huge mapping | +---------------------------+--------------------------------------------------+ @@ -190,7 +191,8 @@ PUD Page Table Helpers +---------------------------+--------------------------------------------------+ | pud_mkdevmap | Creates a ZONE_DEVICE mapped PUD | +---------------------------+--------------------------------------------------+ -| pud_mkinvalid | Invalidates a mapped PUD [1] | +| pud_mkinvalid | Invalidates a present PUD; do not call for | +| | non-present PUD [1] | +---------------------------+--------------------------------------------------+ | pud_set_huge | Creates a PUD huge mapping | +---------------------------+--------------------------------------------------+ diff --git a/arch/powerpc/mm/book3s64/pgtable.c b/arch/powerpc/mm/book3s64/pgtable.c index da15f28c7b13..3a22e7d970f3 100644 --- a/arch/powerpc/mm/book3s64/pgtable.c +++ b/arch/powerpc/mm/book3s64/pgtable.c @@ -115,6 +115,7 @@ pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address, { unsigned long old_pmd;
+ VM_WARN_ON_ONCE(!pmd_present(*pmdp)); old_pmd = pmd_hugepage_update(vma->vm_mm, address, pmdp, _PAGE_PRESENT, _PAGE_INVALID); flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE); return __pmd(old_pmd); diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h index b61426c9ef17..b65ce0c90dd0 100644 --- a/arch/s390/include/asm/pgtable.h +++ b/arch/s390/include/asm/pgtable.h @@ -1625,8 +1625,10 @@ static inline pmd_t pmdp_huge_clear_flush(struct vm_area_struct *vma, static inline pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long addr, pmd_t *pmdp) { - pmd_t pmd = __pmd(pmd_val(*pmdp) | _SEGMENT_ENTRY_INVALID); + pmd_t pmd;
+ VM_WARN_ON_ONCE(!pmd_present(*pmdp)); + pmd = __pmd(pmd_val(*pmdp) | _SEGMENT_ENTRY_INVALID); return pmdp_xchg_direct(vma->vm_mm, addr, pmdp, pmd); }
diff --git a/arch/sparc/mm/tlb.c b/arch/sparc/mm/tlb.c index 9a725547578e..946f33c1b032 100644 --- a/arch/sparc/mm/tlb.c +++ b/arch/sparc/mm/tlb.c @@ -245,6 +245,7 @@ pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address, { pmd_t old, entry;
+ VM_WARN_ON_ONCE(!pmd_present(*pmdp)); entry = __pmd(pmd_val(*pmdp) & ~_PAGE_VALID); old = pmdp_establish(vma, address, pmdp, entry); flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE); diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 98ff57c8eda6..4e9dbaded0a4 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2017,32 +2017,11 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, return __split_huge_zero_page_pmd(vma, haddr, pmd); }
- /* - * Up to this point the pmd is present and huge and userland has the - * whole access to the hugepage during the split (which happens in - * place). If we overwrite the pmd with the not-huge version pointing - * to the pte here (which of course we could if all CPUs were bug - * free), userland could trigger a small page size TLB miss on the - * small sized TLB while the hugepage TLB entry is still established in - * the huge TLB. Some CPU doesn't like that. - * See http://support.amd.com/TechDocs/41322_10h_Rev_Gd.pdf, Erratum - * 383 on page 105. Intel should be safe but is also warns that it's - * only safe if the permission and cache attributes of the two entries - * loaded in the two TLB is identical (which should be the case here). - * But it is generally safer to never allow small and huge TLB entries - * for the same virtual address to be loaded simultaneously. So instead - * of doing "pmd_populate(); flush_pmd_tlb_range();" we first mark the - * current pmd notpresent (atomically because here the pmd_trans_huge - * must remain set at all times on the pmd until the split is complete - * for this pmd), then we flush the SMP TLB and finally we write the - * non-huge version of the pmd entry with pmd_populate. - */ - old_pmd = pmdp_invalidate(vma, haddr, pmd); - - pmd_migration = is_pmd_migration_entry(old_pmd); + pmd_migration = is_pmd_migration_entry(*pmd); if (unlikely(pmd_migration)) { swp_entry_t entry;
+ old_pmd = *pmd; entry = pmd_to_swp_entry(old_pmd); page = pfn_swap_entry_to_page(entry); write = is_writable_migration_entry(entry); @@ -2050,6 +2029,30 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, soft_dirty = pmd_swp_soft_dirty(old_pmd); uffd_wp = pmd_swp_uffd_wp(old_pmd); } else { + /* + * Up to this point the pmd is present and huge and userland has + * the whole access to the hugepage during the split (which + * happens in place). If we overwrite the pmd with the not-huge + * version pointing to the pte here (which of course we could if + * all CPUs were bug free), userland could trigger a small page + * size TLB miss on the small sized TLB while the hugepage TLB + * entry is still established in the huge TLB. Some CPU doesn't + * like that. See + * http://support.amd.com/TechDocs/41322_10h_Rev_Gd.pdf, Erratum + * 383 on page 105. Intel should be safe but is also warns that + * it's only safe if the permission and cache attributes of the + * two entries loaded in the two TLB is identical (which should + * be the case here). But it is generally safer to never allow + * small and huge TLB entries for the same virtual address to be + * loaded simultaneously. So instead of doing "pmd_populate(); + * flush_pmd_tlb_range();" we first mark the current pmd + * notpresent (atomically because here the pmd_trans_huge must + * remain set at all times on the pmd until the split is + * complete for this pmd), then we flush the SMP TLB and finally + * we write the non-huge version of the pmd entry with + * pmd_populate. + */ + old_pmd = pmdp_invalidate(vma, haddr, pmd); page = pmd_page(old_pmd); if (pmd_dirty(old_pmd)) SetPageDirty(page); diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c index 4e640baf9794..efa5ab28f471 100644 --- a/mm/pgtable-generic.c +++ b/mm/pgtable-generic.c @@ -194,7 +194,10 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp) pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address, pmd_t *pmdp) { - pmd_t old = pmdp_establish(vma, address, pmdp, pmd_mkinvalid(*pmdp)); + pmd_t old; + + VM_WARN_ON_ONCE(!pmd_present(*pmdp)); + old = pmdp_establish(vma, address, pmdp, pmd_mkinvalid(*pmdp)); flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE); return old; }
linux-stable-mirror@lists.linaro.org