From: "Kirill A. Shutemov" kirill.shutemov@linux.intel.com
commit 0a85e51d37645e9ce57e5e1a30859e07810ed07c upstream.
Patch series "thp: fix few MADV_DONTNEED races"
For MADV_DONTNEED to work properly with huge pages, it's critical to not clear pmd intermittently unless you hold down_write(mmap_sem).
Otherwise MADV_DONTNEED can miss the THP which can lead to userspace breakage.
See example of such race in commit message of patch 2/4.
All these races are found by code inspection. I haven't seen them triggered. I don't think it's worth to apply them to stable@.
This patch (of 4):
Restructure code in preparation for a fix.
Link: http://lkml.kernel.org/r/20170302151034.27829-2-kirill.shutemov@linux.intel.... Signed-off-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Acked-by: Vlastimil Babka vbabka@suse.cz Cc: Andrea Arcangeli aarcange@redhat.com Cc: Hillf Danton hillf.zj@alibaba-inc.com Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org [jwang: adjust context for 4.9] Signed-off-by: Jack Wang jinpu.wang@profitbricks.com --- mm/huge_memory.c | 52 ++++++++++++++++++++++++++-------------------------- 1 file changed, 26 insertions(+), 26 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 3cae1dc..660e732 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1509,37 +1509,37 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, { struct mm_struct *mm = vma->vm_mm; spinlock_t *ptl; - int ret = 0; + pmd_t entry; + bool preserve_write; + int ret;
ptl = __pmd_trans_huge_lock(pmd, vma); - if (ptl) { - pmd_t entry; - bool preserve_write = prot_numa && pmd_write(*pmd); - ret = 1; + if (!ptl) + return 0;
- /* - * Avoid trapping faults against the zero page. The read-only - * data is likely to be read-cached on the local CPU and - * local/remote hits to the zero page are not interesting. - */ - if (prot_numa && is_huge_zero_pmd(*pmd)) { - spin_unlock(ptl); - return ret; - } + preserve_write = prot_numa && pmd_write(*pmd); + ret = 1;
- if (!prot_numa || !pmd_protnone(*pmd)) { - entry = pmdp_huge_get_and_clear_notify(mm, addr, pmd); - entry = pmd_modify(entry, newprot); - if (preserve_write) - entry = pmd_mkwrite(entry); - ret = HPAGE_PMD_NR; - set_pmd_at(mm, addr, pmd, entry); - BUG_ON(vma_is_anonymous(vma) && !preserve_write && - pmd_write(entry)); - } - spin_unlock(ptl); - } + /* + * Avoid trapping faults against the zero page. The read-only + * data is likely to be read-cached on the local CPU and + * local/remote hits to the zero page are not interesting. + */ + if (prot_numa && is_huge_zero_pmd(*pmd)) + goto unlock;
+ if (prot_numa && pmd_protnone(*pmd)) + goto unlock; + + entry = pmdp_huge_get_and_clear_notify(mm, addr, pmd); + entry = pmd_modify(entry, newprot); + if (preserve_write) + entry = pmd_mkwrite(entry); + ret = HPAGE_PMD_NR; + set_pmd_at(mm, addr, pmd, entry); + BUG_ON(vma_is_anonymous(vma) && !preserve_write && pmd_write(entry)); +unlock: + spin_unlock(ptl); return ret; }
From: "Kirill A. Shutemov" kirill.shutemov@linux.intel.com
commit ced108037c2aa542b3ed8b7afd1576064ad1362a upstream
In case prot_numa, we are under down_read(mmap_sem). It's critical to not clear pmd intermittently to avoid race with MADV_DONTNEED which is also under down_read(mmap_sem):
CPU0: CPU1: change_huge_pmd(prot_numa=1) pmdp_huge_get_and_clear_notify() madvise_dontneed() zap_pmd_range() pmd_trans_huge(*pmd) == 0 (without ptl) // skip the pmd set_pmd_at(); // pmd is re-established
The race makes MADV_DONTNEED miss the huge pmd and don't clear it which may break userspace.
Found by code analysis, never saw triggered.
Link: http://lkml.kernel.org/r/20170302151034.27829-3-kirill.shutemov@linux.intel.... Signed-off-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: Andrea Arcangeli aarcange@redhat.com Cc: Hillf Danton hillf.zj@alibaba-inc.com Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org [jwang: adjust context for 4.9 ] Signed-off-by: Jack Wang jinpu.wang@profitbricks.com --- mm/huge_memory.c | 34 +++++++++++++++++++++++++++++++++- 1 file changed, 33 insertions(+), 1 deletion(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 660e732..c234c07 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1531,7 +1531,39 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, if (prot_numa && pmd_protnone(*pmd)) goto unlock;
- entry = pmdp_huge_get_and_clear_notify(mm, addr, pmd); + /* + * In case prot_numa, we are under down_read(mmap_sem). It's critical + * to not clear pmd intermittently to avoid race with MADV_DONTNEED + * which is also under down_read(mmap_sem): + * + * CPU0: CPU1: + * change_huge_pmd(prot_numa=1) + * pmdp_huge_get_and_clear_notify() + * madvise_dontneed() + * zap_pmd_range() + * pmd_trans_huge(*pmd) == 0 (without ptl) + * // skip the pmd + * set_pmd_at(); + * // pmd is re-established + * + * The race makes MADV_DONTNEED miss the huge pmd and don't clear it + * which may break userspace. + * + * pmdp_invalidate() is required to make sure we don't miss + * dirty/young flags set by hardware. + */ + entry = *pmd; + pmdp_invalidate(vma, addr, pmd); + + /* + * Recover dirty/young flags. It relies on pmdp_invalidate to not + * corrupt them. + */ + if (pmd_dirty(*pmd)) + entry = pmd_mkdirty(entry); + if (pmd_young(*pmd)) + entry = pmd_mkyoung(entry); + entry = pmd_modify(entry, newprot); if (preserve_write) entry = pmd_mkwrite(entry);
From: "Kirill A. Shutemov" kirill.shutemov@linux.intel.com
commit c0c379e2931b05facef538e53bf3b21f283d9a0b upstream
Dave noticed that after fixing MADV_DONTNEED vs numa balancing race the last pmdp_huge_get_and_clear_notify() user is gone.
Let's drop the helper.
Link: http://lkml.kernel.org/r/20170306112047.24809-1-kirill.shutemov@linux.intel.... Signed-off-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: Dave Hansen dave.hansen@intel.com Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org [jwang: adjust context for 4.9] Signed-off-by: Jack Wang jinpu.wang@profitbricks.com --- include/linux/mmu_notifier.h | 13 ------------- 1 file changed, 13 deletions(-)
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h index 25c0dc3..854dfa6 100644 --- a/include/linux/mmu_notifier.h +++ b/include/linux/mmu_notifier.h @@ -381,18 +381,6 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct *mm) ___pmd; \ })
-#define pmdp_huge_get_and_clear_notify(__mm, __haddr, __pmd) \ -({ \ - unsigned long ___haddr = __haddr & HPAGE_PMD_MASK; \ - pmd_t ___pmd; \ - \ - ___pmd = pmdp_huge_get_and_clear(__mm, __haddr, __pmd); \ - mmu_notifier_invalidate_range(__mm, ___haddr, \ - ___haddr + HPAGE_PMD_SIZE); \ - \ - ___pmd; \ -}) - /* * set_pte_at_notify() sets the pte _after_ running the notifier. * This is safe to start by updating the secondary MMUs, because the primary MMU @@ -480,7 +468,6 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct *mm) #define pmdp_clear_young_notify pmdp_test_and_clear_young #define ptep_clear_flush_notify ptep_clear_flush #define pmdp_huge_clear_flush_notify pmdp_huge_clear_flush -#define pmdp_huge_get_and_clear_notify pmdp_huge_get_and_clear #define set_pte_at_notify set_pte_at
#endif /* CONFIG_MMU_NOTIFIER */
linux-stable-mirror@lists.linaro.org