The page migration code employs try_to_unmap() to try and unmap the source page. This is accomplished by using rmap_walk to find all vmas where the page is mapped. This search stops when page mapcount is zero. For shared PMD huge pages, the page map count is always 1 no matter the number of mappings. Shared mappings are tracked via the reference count of the PMD page. Therefore, try_to_unmap stops prematurely and does not completely unmap all mappings of the source page.
This problem can result is data corruption as writes to the original source page can happen after contents of the page are copied to the target page. Hence, data is lost.
This problem was originally seen as DB corruption of shared global areas after a huge page was soft offlined due to ECC memory errors. DB developers noticed they could reproduce the issue by (hotplug) offlining memory used to back huge pages. A simple testcase can reproduce the problem by creating a shared PMD mapping (note that this must be at least PUD_SIZE in size and PUD_SIZE aligned (1GB on x86)), and using migrate_pages() to migrate process pages between nodes while continually writing to the huge pages being migrated.
To fix, have the try_to_unmap_one routine check for huge PMD sharing by calling huge_pmd_unshare for hugetlbfs huge pages. If it is a shared mapping it will be 'unshared' which removes the page table entry and drops the reference on the PMD page. After this, flush caches and TLB.
mmu notifiers are called before locking page tables, but we can not be sure of PMD sharing until page tables are locked. Therefore, check for the possibility of PMD sharing before locking so that notifiers can prepare for the worst possible case.
Fixes: 39dde65c9940 ("shared page table for hugetlb page") Cc: stable@vger.kernel.org Signed-off-by: Mike Kravetz mike.kravetz@oracle.com --- include/linux/hugetlb.h | 14 ++++++++++++++ mm/hugetlb.c | 40 +++++++++++++++++++++++++++++++++++++++ mm/rmap.c | 42 ++++++++++++++++++++++++++++++++++++++--- 3 files changed, 93 insertions(+), 3 deletions(-)
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 36fa6a2a82e3..1c6cde68487f 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -140,6 +140,8 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr, unsigned long sz); int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep); +bool huge_pmd_sharing_possible(struct vm_area_struct *vma, + unsigned long *start, unsigned long *end); struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address, int write); struct page *follow_huge_pd(struct vm_area_struct *vma, @@ -170,6 +172,18 @@ static inline unsigned long hugetlb_total_pages(void) return 0; }
+static inline int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, + pte_t *ptep) +{ + return 0; +} + +bool huge_pmd_sharing_possible(struct vm_area_struct *vma, + unsigned long *start, unsigned long *end) +{ + return false; +} + #define follow_hugetlb_page(m,v,p,vs,a,b,i,w,n) ({ BUG(); 0; }) #define follow_huge_addr(mm, addr, write) ERR_PTR(-EINVAL) #define copy_hugetlb_page_range(src, dst, vma) ({ BUG(); 0; }) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 3103099f64fd..fd155dc52117 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -4555,6 +4555,9 @@ static bool vma_shareable(struct vm_area_struct *vma, unsigned long addr)
/* * check on proper vm_flags and page table alignment + * + * Note that this is the same check used in huge_pmd_sharing_possible. + * If you change one, consider changing both. */ if (vma->vm_flags & VM_MAYSHARE && vma->vm_start <= base && end <= vma->vm_end) @@ -4562,6 +4565,43 @@ static bool vma_shareable(struct vm_area_struct *vma, unsigned long addr) return false; }
+/* + * Determine if start,end range within vma could be mapped by shared pmd. + * If yes, adjust start and end to cover range associated with possible + * shared pmd mappings. + */ +bool huge_pmd_sharing_possible(struct vm_area_struct *vma, + unsigned long *start, unsigned long *end) +{ + unsigned long check_addr = *start; + bool ret = false; + + if (!(vma->vm_flags & VM_MAYSHARE)) + return ret; + + for (check_addr = *start; check_addr < *end; check_addr += PUD_SIZE) { + unsigned long a_start = check_addr & PUD_MASK; + unsigned long a_end = a_start + PUD_SIZE; + + /* + * If sharing is possible, adjust start/end if necessary. + * + * Note that this is the same check used in vma_shareable. If + * you change one, consider changing both. + */ + if (vma->vm_start <= a_start && a_end <= vma->vm_end) { + if (a_start < *start) + *start = a_start; + if (a_end > *end) + *end = a_end; + + ret = true; + } + } + + return ret; +} + /* * Search for a shareable pmd page for hugetlb. In any case calls pmd_alloc() * and returns the corresponding pte. While this is not necessary for the diff --git a/mm/rmap.c b/mm/rmap.c index eb477809a5c0..8cf853a4b093 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1362,11 +1362,21 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, }
/* - * We have to assume the worse case ie pmd for invalidation. Note that - * the page can not be free in this function as call of try_to_unmap() - * must hold a reference on the page. + * For THP, we have to assume the worse case ie pmd for invalidation. + * For hugetlb, it could be much worse if we need to do pud + * invalidation in the case of pmd sharing. + * + * Note that the page can not be free in this function as call of + * try_to_unmap() must hold a reference on the page. */ end = min(vma->vm_end, start + (PAGE_SIZE << compound_order(page))); + if (PageHuge(page)) { + /* + * If sharing is possible, start and end will be adjusted + * accordingly. + */ + (void)huge_pmd_sharing_possible(vma, &start, &end); + } mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);
while (page_vma_mapped_walk(&pvmw)) { @@ -1409,6 +1419,32 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte); address = pvmw.address;
+ if (PageHuge(page)) { + if (huge_pmd_unshare(mm, &address, pvmw.pte)) { + /* + * huge_pmd_unshare unmapped an entire PMD + * page. There is no way of knowing exactly + * which PMDs may be cached for this mm, so + * we must flush them all. start/end were + * already adjusted above to cover this range. + */ + flush_cache_range(vma, start, end); + flush_tlb_range(vma, start, end); + mmu_notifier_invalidate_range(mm, start, end); + + /* + * The ref count of the PMD page was dropped + * which is part of the way map counting + * is done for shared PMDs. Return 'true' + * here. When there is no other sharing, + * huge_pmd_unshare returns false and we will + * unmap the actual page and drop map count + * to zero. + */ + page_vma_mapped_walk_done(&pvmw); + break; + } + }
if (IS_ENABLED(CONFIG_MIGRATION) && (flags & TTU_MIGRATION) &&
Hi Mike,
I love your patch! Yet something to improve:
[auto build test ERROR on linus/master] [also build test ERROR on v4.18 next-20180821] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
url: https://github.com/0day-ci/linux/commits/Mike-Kravetz/huge_pmd_unshare-migra... config: i386-tinyconfig (attached as .config) compiler: gcc-7 (Debian 7.3.0-16) 7.3.0 reproduce: # save the attached .config to linux build tree make ARCH=i386
All errors (new ones prefixed by >>):
arch/x86/mm/fault.o: In function `huge_pmd_sharing_possible':
fault.c:(.text+0xa06): multiple definition of `huge_pmd_sharing_possible'
arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here arch/x86/mm/pgtable.o: In function `huge_pmd_sharing_possible': pgtable.c:(.text+0x4): multiple definition of `huge_pmd_sharing_possible' arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here kernel/fork.o: In function `huge_pmd_sharing_possible': fork.c:(.text+0x309): multiple definition of `huge_pmd_sharing_possible' arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here kernel/sysctl.o: In function `huge_pmd_sharing_possible': sysctl.c:(.text+0x0): multiple definition of `huge_pmd_sharing_possible' arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here kernel/sched/core.o: In function `huge_pmd_sharing_possible': core.c:(.text+0x299): multiple definition of `huge_pmd_sharing_possible' arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here kernel/sched/loadavg.o: In function `huge_pmd_sharing_possible': loadavg.c:(.text+0x0): multiple definition of `huge_pmd_sharing_possible' arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here kernel/sched/clock.o: In function `huge_pmd_sharing_possible': clock.c:(.text+0x0): multiple definition of `huge_pmd_sharing_possible' arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here kernel/sched/cputime.o: In function `huge_pmd_sharing_possible': cputime.c:(.text+0x0): multiple definition of `huge_pmd_sharing_possible' arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here kernel/sched/idle.o: In function `huge_pmd_sharing_possible': idle.c:(.text+0x36): multiple definition of `huge_pmd_sharing_possible' arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here kernel/sched/fair.o: In function `huge_pmd_sharing_possible': fair.c:(.text+0x864): multiple definition of `huge_pmd_sharing_possible' arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here kernel/sched/rt.o: In function `huge_pmd_sharing_possible': rt.c:(.text+0x72b): multiple definition of `huge_pmd_sharing_possible' arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here kernel/sched/deadline.o: In function `huge_pmd_sharing_possible': deadline.c:(.text+0xac7): multiple definition of `huge_pmd_sharing_possible' arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here kernel/sched/wait.o: In function `huge_pmd_sharing_possible': wait.c:(.text+0x16e): multiple definition of `huge_pmd_sharing_possible' arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here kernel/sched/wait_bit.o: In function `huge_pmd_sharing_possible': wait_bit.c:(.text+0x7b): multiple definition of `huge_pmd_sharing_possible' arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here kernel/sched/swait.o: In function `huge_pmd_sharing_possible': swait.c:(.text+0x4): multiple definition of `huge_pmd_sharing_possible' arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here kernel/sched/completion.o: In function `huge_pmd_sharing_possible': completion.c:(.text+0x4): multiple definition of `huge_pmd_sharing_possible' arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here mm/filemap.o: In function `huge_pmd_sharing_possible': filemap.c:(.text+0x3ca): multiple definition of `huge_pmd_sharing_possible' arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here mm/page_alloc.o: In function `huge_pmd_sharing_possible': page_alloc.c:(.text+0xa95): multiple definition of `huge_pmd_sharing_possible' arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here mm/swap.o: In function `huge_pmd_sharing_possible': swap.c:(.text+0x551): multiple definition of `huge_pmd_sharing_possible' arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here mm/vmscan.o: In function `huge_pmd_sharing_possible': vmscan.c:(.text+0x5bb): multiple definition of `huge_pmd_sharing_possible' arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here mm/shmem.o: In function `huge_pmd_sharing_possible': shmem.c:(.text+0x6d): multiple definition of `huge_pmd_sharing_possible' arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here mm/util.o: In function `huge_pmd_sharing_possible': util.c:(.text+0xc): multiple definition of `huge_pmd_sharing_possible' arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here mm/compaction.o: In function `huge_pmd_sharing_possible': compaction.c:(.text+0x0): multiple definition of `huge_pmd_sharing_possible' arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here mm/debug.o: In function `huge_pmd_sharing_possible': debug.c:(.text+0x0): multiple definition of `huge_pmd_sharing_possible' arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here mm/gup.o: In function `huge_pmd_sharing_possible': gup.c:(.text+0x17c): multiple definition of `huge_pmd_sharing_possible' arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here mm/memory.o: In function `huge_pmd_sharing_possible': memory.c:(.text+0x5f9): multiple definition of `huge_pmd_sharing_possible' arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here mm/mincore.o: In function `huge_pmd_sharing_possible': mincore.c:(.text+0x150): multiple definition of `huge_pmd_sharing_possible' arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here mm/mlock.o: In function `huge_pmd_sharing_possible': mlock.c:(.text+0x245): multiple definition of `huge_pmd_sharing_possible' arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here mm/mmap.o: In function `huge_pmd_sharing_possible': mmap.c:(.text+0x565): multiple definition of `huge_pmd_sharing_possible' arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here mm/mprotect.o: In function `huge_pmd_sharing_possible': mprotect.c:(.text+0x39): multiple definition of `huge_pmd_sharing_possible' arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here mm/mremap.o: In function `huge_pmd_sharing_possible': mremap.c:(.text+0xf2): multiple definition of `huge_pmd_sharing_possible' arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here mm/page_vma_mapped.o: In function `huge_pmd_sharing_possible': page_vma_mapped.c:(.text+0x0): multiple definition of `huge_pmd_sharing_possible' arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here mm/pagewalk.o: In function `huge_pmd_sharing_possible': pagewalk.c:(.text+0x13d): multiple definition of `huge_pmd_sharing_possible' arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here mm/rmap.o: In function `huge_pmd_sharing_possible': rmap.c:(.text+0x3bb): multiple definition of `huge_pmd_sharing_possible' arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here
--- 0-DAY kernel test infrastructure Open Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation
On 08/21/2018 03:03 PM, kbuild test robot wrote:
Hi Mike,
I love your patch! Yet something to improve:
[auto build test ERROR on linus/master] [also build test ERROR on v4.18 next-20180821] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
url: https://github.com/0day-ci/linux/commits/Mike-Kravetz/huge_pmd_unshare-migra... config: i386-tinyconfig (attached as .config) compiler: gcc-7 (Debian 7.3.0-16) 7.3.0 reproduce: # save the attached .config to linux build tree make ARCH=i386
Oops, simple build fix addressed in updated patch below.
From 57b7d63d88f65b1c385a5f15c7cd5fa07513e955 Mon Sep 17 00:00:00 2001
From: Mike Kravetz mike.kravetz@oracle.com Date: Tue, 21 Aug 2018 10:50:17 -0700 Subject: [PATCH v4 1/2] mm: migration: fix migration of huge PMD shared pages
The page migration code employs try_to_unmap() to try and unmap the source page. This is accomplished by using rmap_walk to find all vmas where the page is mapped. This search stops when page mapcount is zero. For shared PMD huge pages, the page map count is always 1 no matter the number of mappings. Shared mappings are tracked via the reference count of the PMD page. Therefore, try_to_unmap stops prematurely and does not completely unmap all mappings of the source page.
This problem can result is data corruption as writes to the original source page can happen after contents of the page are copied to the target page. Hence, data is lost.
This problem was originally seen as DB corruption of shared global areas after a huge page was soft offlined due to ECC memory errors. DB developers noticed they could reproduce the issue by (hotplug) offlining memory used to back huge pages. A simple testcase can reproduce the problem by creating a shared PMD mapping (note that this must be at least PUD_SIZE in size and PUD_SIZE aligned (1GB on x86)), and using migrate_pages() to migrate process pages between nodes while continually writing to the huge pages being migrated.
To fix, have the try_to_unmap_one routine check for huge PMD sharing by calling huge_pmd_unshare for hugetlbfs huge pages. If it is a shared mapping it will be 'unshared' which removes the page table entry and drops the reference on the PMD page. After this, flush caches and TLB.
mmu notifiers are called before locking page tables, but we can not be sure of PMD sharing until page tables are locked. Therefore, check for the possibility of PMD sharing before locking so that notifiers can prepare for the worst possible case.
Fixes: 39dde65c9940 ("shared page table for hugetlb page") Cc: stable@vger.kernel.org Signed-off-by: Mike Kravetz mike.kravetz@oracle.com --- include/linux/hugetlb.h | 14 ++++++++++++++ mm/hugetlb.c | 40 +++++++++++++++++++++++++++++++++++++++ mm/rmap.c | 42 ++++++++++++++++++++++++++++++++++++++--- 3 files changed, 93 insertions(+), 3 deletions(-)
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 36fa6a2a82e3..840b43a085cc 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -140,6 +140,8 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr, unsigned long sz); int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep); +bool huge_pmd_sharing_possible(struct vm_area_struct *vma, + unsigned long *start, unsigned long *end); struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address, int write); struct page *follow_huge_pd(struct vm_area_struct *vma, @@ -170,6 +172,18 @@ static inline unsigned long hugetlb_total_pages(void) return 0; }
+static inline int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, + pte_t *ptep) +{ + return 0; +} + +static inline bool huge_pmd_sharing_possible(struct vm_area_struct *vma, + unsigned long *start, unsigned long *end) +{ + return false; +} + #define follow_hugetlb_page(m,v,p,vs,a,b,i,w,n) ({ BUG(); 0; }) #define follow_huge_addr(mm, addr, write) ERR_PTR(-EINVAL) #define copy_hugetlb_page_range(src, dst, vma) ({ BUG(); 0; }) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 3103099f64fd..fd155dc52117 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -4555,6 +4555,9 @@ static bool vma_shareable(struct vm_area_struct *vma, unsigned long addr)
/* * check on proper vm_flags and page table alignment + * + * Note that this is the same check used in huge_pmd_sharing_possible. + * If you change one, consider changing both. */ if (vma->vm_flags & VM_MAYSHARE && vma->vm_start <= base && end <= vma->vm_end) @@ -4562,6 +4565,43 @@ static bool vma_shareable(struct vm_area_struct *vma, unsigned long addr) return false; }
+/* + * Determine if start,end range within vma could be mapped by shared pmd. + * If yes, adjust start and end to cover range associated with possible + * shared pmd mappings. + */ +bool huge_pmd_sharing_possible(struct vm_area_struct *vma, + unsigned long *start, unsigned long *end) +{ + unsigned long check_addr = *start; + bool ret = false; + + if (!(vma->vm_flags & VM_MAYSHARE)) + return ret; + + for (check_addr = *start; check_addr < *end; check_addr += PUD_SIZE) { + unsigned long a_start = check_addr & PUD_MASK; + unsigned long a_end = a_start + PUD_SIZE; + + /* + * If sharing is possible, adjust start/end if necessary. + * + * Note that this is the same check used in vma_shareable. If + * you change one, consider changing both. + */ + if (vma->vm_start <= a_start && a_end <= vma->vm_end) { + if (a_start < *start) + *start = a_start; + if (a_end > *end) + *end = a_end; + + ret = true; + } + } + + return ret; +} + /* * Search for a shareable pmd page for hugetlb. In any case calls pmd_alloc() * and returns the corresponding pte. While this is not necessary for the diff --git a/mm/rmap.c b/mm/rmap.c index eb477809a5c0..8cf853a4b093 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1362,11 +1362,21 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, }
/* - * We have to assume the worse case ie pmd for invalidation. Note that - * the page can not be free in this function as call of try_to_unmap() - * must hold a reference on the page. + * For THP, we have to assume the worse case ie pmd for invalidation. + * For hugetlb, it could be much worse if we need to do pud + * invalidation in the case of pmd sharing. + * + * Note that the page can not be free in this function as call of + * try_to_unmap() must hold a reference on the page. */ end = min(vma->vm_end, start + (PAGE_SIZE << compound_order(page))); + if (PageHuge(page)) { + /* + * If sharing is possible, start and end will be adjusted + * accordingly. + */ + (void)huge_pmd_sharing_possible(vma, &start, &end); + } mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);
while (page_vma_mapped_walk(&pvmw)) { @@ -1409,6 +1419,32 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte); address = pvmw.address;
+ if (PageHuge(page)) { + if (huge_pmd_unshare(mm, &address, pvmw.pte)) { + /* + * huge_pmd_unshare unmapped an entire PMD + * page. There is no way of knowing exactly + * which PMDs may be cached for this mm, so + * we must flush them all. start/end were + * already adjusted above to cover this range. + */ + flush_cache_range(vma, start, end); + flush_tlb_range(vma, start, end); + mmu_notifier_invalidate_range(mm, start, end); + + /* + * The ref count of the PMD page was dropped + * which is part of the way map counting + * is done for shared PMDs. Return 'true' + * here. When there is no other sharing, + * huge_pmd_unshare returns false and we will + * unmap the actual page and drop map count + * to zero. + */ + page_vma_mapped_walk_done(&pvmw); + break; + } + }
if (IS_ENABLED(CONFIG_MIGRATION) && (flags & TTU_MIGRATION) &&
Hi Mike,
I love your patch! Yet something to improve:
[auto build test ERROR on linus/master] [also build test ERROR on v4.18 next-20180821] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
url: https://github.com/0day-ci/linux/commits/Mike-Kravetz/huge_pmd_unshare-migra... config: sparc64-allyesconfig (attached as .config) compiler: sparc64-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0 reproduce: wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # save the attached .config to linux build tree GCC_VERSION=7.2.0 make.cross ARCH=sparc64
All errors (new ones prefixed by >>):
mm/rmap.o: In function `try_to_unmap_one':
rmap.c:(.text+0x2708): undefined reference to `huge_pmd_sharing_possible'
`.exit.data' referenced in section `.exit.text' of drivers/tty/n_hdlc.o: defined in discarded section `.exit.data' of drivers/tty/n_hdlc.o `.exit.data' referenced in section `.exit.text' of drivers/tty/n_hdlc.o: defined in discarded section `.exit.data' of drivers/tty/n_hdlc.o `.exit.data' referenced in section `.exit.text' of drivers/tty/n_hdlc.o: defined in discarded section `.exit.data' of drivers/tty/n_hdlc.o `.exit.data' referenced in section `.exit.text' of drivers/tty/n_hdlc.o: defined in discarded section `.exit.data' of drivers/tty/n_hdlc.o
--- 0-DAY kernel test infrastructure Open Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation
On 08/21/2018 05:51 PM, kbuild test robot wrote:
Hi Mike,
I love your patch! Yet something to improve:
[auto build test ERROR on linus/master] [also build test ERROR on v4.18 next-20180821] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
url: https://github.com/0day-ci/linux/commits/Mike-Kravetz/huge_pmd_unshare-migra... config: sparc64-allyesconfig (attached as .config) compiler: sparc64-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0 reproduce: wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # save the attached .config to linux build tree GCC_VERSION=7.2.0 make.cross ARCH=sparc64
Ok, this should take care of all the build errors. Needed to address !CONFIG_HUGETLB_PAGE and !CONFIG_ARCH_WANT_HUGE_PMD_SHARE.
From 2868cea3ff33fccd03da9188dde2ae5b619bf300 Mon Sep 17 00:00:00 2001
From: Mike Kravetz mike.kravetz@oracle.com Date: Tue, 21 Aug 2018 10:50:17 -0700 Subject: [PATCH v5 1/2] mm: migration: fix migration of huge PMD shared pages
The page migration code employs try_to_unmap() to try and unmap the source page. This is accomplished by using rmap_walk to find all vmas where the page is mapped. This search stops when page mapcount is zero. For shared PMD huge pages, the page map count is always 1 no matter the number of mappings. Shared mappings are tracked via the reference count of the PMD page. Therefore, try_to_unmap stops prematurely and does not completely unmap all mappings of the source page.
This problem can result is data corruption as writes to the original source page can happen after contents of the page are copied to the target page. Hence, data is lost.
This problem was originally seen as DB corruption of shared global areas after a huge page was soft offlined due to ECC memory errors. DB developers noticed they could reproduce the issue by (hotplug) offlining memory used to back huge pages. A simple testcase can reproduce the problem by creating a shared PMD mapping (note that this must be at least PUD_SIZE in size and PUD_SIZE aligned (1GB on x86)), and using migrate_pages() to migrate process pages between nodes while continually writing to the huge pages being migrated.
To fix, have the try_to_unmap_one routine check for huge PMD sharing by calling huge_pmd_unshare for hugetlbfs huge pages. If it is a shared mapping it will be 'unshared' which removes the page table entry and drops the reference on the PMD page. After this, flush caches and TLB.
mmu notifiers are called before locking page tables, but we can not be sure of PMD sharing until page tables are locked. Therefore, check for the possibility of PMD sharing before locking so that notifiers can prepare for the worst possible case.
Fixes: 39dde65c9940 ("shared page table for hugetlb page") Cc: stable@vger.kernel.org Signed-off-by: Mike Kravetz mike.kravetz@oracle.com --- include/linux/hugetlb.h | 14 +++++++++++++ mm/hugetlb.c | 46 +++++++++++++++++++++++++++++++++++++++++ mm/rmap.c | 42 ++++++++++++++++++++++++++++++++++--- 3 files changed, 99 insertions(+), 3 deletions(-)
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 36fa6a2a82e3..840b43a085cc 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -140,6 +140,8 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr, unsigned long sz); int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep); +bool huge_pmd_sharing_possible(struct vm_area_struct *vma, + unsigned long *start, unsigned long *end); struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address, int write); struct page *follow_huge_pd(struct vm_area_struct *vma, @@ -170,6 +172,18 @@ static inline unsigned long hugetlb_total_pages(void) return 0; }
+static inline int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, + pte_t *ptep) +{ + return 0; +} + +static inline bool huge_pmd_sharing_possible(struct vm_area_struct *vma, + unsigned long *start, unsigned long *end) +{ + return false; +} + #define follow_hugetlb_page(m,v,p,vs,a,b,i,w,n) ({ BUG(); 0; }) #define follow_huge_addr(mm, addr, write) ERR_PTR(-EINVAL) #define copy_hugetlb_page_range(src, dst, vma) ({ BUG(); 0; }) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 3103099f64fd..f085019a4724 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -4555,6 +4555,9 @@ static bool vma_shareable(struct vm_area_struct *vma, unsigned long addr)
/* * check on proper vm_flags and page table alignment + * + * Note that this is the same check used in huge_pmd_sharing_possible. + * If you change one, consider changing both. */ if (vma->vm_flags & VM_MAYSHARE && vma->vm_start <= base && end <= vma->vm_end) @@ -4562,6 +4565,43 @@ static bool vma_shareable(struct vm_area_struct *vma, unsigned long addr) return false; }
+/* + * Determine if start,end range within vma could be mapped by shared pmd. + * If yes, adjust start and end to cover range associated with possible + * shared pmd mappings. + */ +bool huge_pmd_sharing_possible(struct vm_area_struct *vma, + unsigned long *start, unsigned long *end) +{ + unsigned long check_addr = *start; + bool ret = false; + + if (!(vma->vm_flags & VM_MAYSHARE)) + return ret; + + for (check_addr = *start; check_addr < *end; check_addr += PUD_SIZE) { + unsigned long a_start = check_addr & PUD_MASK; + unsigned long a_end = a_start + PUD_SIZE; + + /* + * If sharing is possible, adjust start/end if necessary. + * + * Note that this is the same check used in vma_shareable. If + * you change one, consider changing both. + */ + if (vma->vm_start <= a_start && a_end <= vma->vm_end) { + if (a_start < *start) + *start = a_start; + if (a_end > *end) + *end = a_end; + + ret = true; + } + } + + return ret; +} + /* * Search for a shareable pmd page for hugetlb. In any case calls pmd_alloc() * and returns the corresponding pte. While this is not necessary for the @@ -4659,6 +4699,12 @@ int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep) { return 0; } + +bool huge_pmd_sharing_possible(struct vm_area_struct *vma, + unsigned long *start, unsigned long *end) +{ + return false; +} #define want_pmd_share() (0) #endif /* CONFIG_ARCH_WANT_HUGE_PMD_SHARE */
diff --git a/mm/rmap.c b/mm/rmap.c index eb477809a5c0..8cf853a4b093 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1362,11 +1362,21 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, }
/* - * We have to assume the worse case ie pmd for invalidation. Note that - * the page can not be free in this function as call of try_to_unmap() - * must hold a reference on the page. + * For THP, we have to assume the worse case ie pmd for invalidation. + * For hugetlb, it could be much worse if we need to do pud + * invalidation in the case of pmd sharing. + * + * Note that the page can not be free in this function as call of + * try_to_unmap() must hold a reference on the page. */ end = min(vma->vm_end, start + (PAGE_SIZE << compound_order(page))); + if (PageHuge(page)) { + /* + * If sharing is possible, start and end will be adjusted + * accordingly. + */ + (void)huge_pmd_sharing_possible(vma, &start, &end); + } mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);
while (page_vma_mapped_walk(&pvmw)) { @@ -1409,6 +1419,32 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte); address = pvmw.address;
+ if (PageHuge(page)) { + if (huge_pmd_unshare(mm, &address, pvmw.pte)) { + /* + * huge_pmd_unshare unmapped an entire PMD + * page. There is no way of knowing exactly + * which PMDs may be cached for this mm, so + * we must flush them all. start/end were + * already adjusted above to cover this range. + */ + flush_cache_range(vma, start, end); + flush_tlb_range(vma, start, end); + mmu_notifier_invalidate_range(mm, start, end); + + /* + * The ref count of the PMD page was dropped + * which is part of the way map counting + * is done for shared PMDs. Return 'true' + * here. When there is no other sharing, + * huge_pmd_unshare returns false and we will + * unmap the actual page and drop map count + * to zero. + */ + page_vma_mapped_walk_done(&pvmw); + break; + } + }
if (IS_ENABLED(CONFIG_MIGRATION) && (flags & TTU_MIGRATION) &&
On Tue 21-08-18 18:10:42, Mike Kravetz wrote: [...]
diff --git a/mm/rmap.c b/mm/rmap.c index eb477809a5c0..8cf853a4b093 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1362,11 +1362,21 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, } /*
* We have to assume the worse case ie pmd for invalidation. Note that
* the page can not be free in this function as call of try_to_unmap()
* must hold a reference on the page.
* For THP, we have to assume the worse case ie pmd for invalidation.
* For hugetlb, it could be much worse if we need to do pud
* invalidation in the case of pmd sharing.
*
* Note that the page can not be free in this function as call of
*/ end = min(vma->vm_end, start + (PAGE_SIZE << compound_order(page)));* try_to_unmap() must hold a reference on the page.
- if (PageHuge(page)) {
/*
* If sharing is possible, start and end will be adjusted
* accordingly.
*/
(void)huge_pmd_sharing_possible(vma, &start, &end);
- } mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);
I do not get this part. Why don't we simply unconditionally invalidate the whole huge page range?
while (page_vma_mapped_walk(&pvmw)) { @@ -1409,6 +1419,32 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte); address = pvmw.address;
if (PageHuge(page)) {
if (huge_pmd_unshare(mm, &address, pvmw.pte)) {
huge_pmd_unshare is documented to require a pte lock. Where do we take it?
On 08/22/2018 05:28 AM, Michal Hocko wrote:
On Tue 21-08-18 18:10:42, Mike Kravetz wrote: [...]
diff --git a/mm/rmap.c b/mm/rmap.c index eb477809a5c0..8cf853a4b093 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1362,11 +1362,21 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, } /*
* We have to assume the worse case ie pmd for invalidation. Note that
* the page can not be free in this function as call of try_to_unmap()
* must hold a reference on the page.
* For THP, we have to assume the worse case ie pmd for invalidation.
* For hugetlb, it could be much worse if we need to do pud
* invalidation in the case of pmd sharing.
*
* Note that the page can not be free in this function as call of
*/ end = min(vma->vm_end, start + (PAGE_SIZE << compound_order(page)));* try_to_unmap() must hold a reference on the page.
- if (PageHuge(page)) {
/*
* If sharing is possible, start and end will be adjusted
* accordingly.
*/
(void)huge_pmd_sharing_possible(vma, &start, &end);
- } mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);
I do not get this part. Why don't we simply unconditionally invalidate the whole huge page range?
In this routine, we are only unmapping a single page. The existing code is limiting the invalidate range to that page size: 4K or 2M. With shared PMDs, we have the possibility of unmapping a PUD_SIZE area: 1G. I don't think we want to unconditionally invalidate 1G. Is that what you are asking?
I do not know how often PMD sharing is exercised. It certainly is used by DBs for large shared areas. I suspect it is less frequent than hugtlb pages in general, and certainly less frequent than THP or base pages.
while (page_vma_mapped_walk(&pvmw)) { @@ -1409,6 +1419,32 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte); address = pvmw.address;
if (PageHuge(page)) {
if (huge_pmd_unshare(mm, &address, pvmw.pte)) {
huge_pmd_unshare is documented to require a pte lock. Where do we take it?
It is somewhat hidden, but we are in the loop:
while (page_vma_mapped_walk(&pvmw)) {
The routine page_vma_mapped_walk will acquire the lock, and it correctly checks for huge pages and uses huge_pte_lockptr().
page_vma_mapped_walk_done() will release the lock.
On Wed 22-08-18 09:48:16, Mike Kravetz wrote:
On 08/22/2018 05:28 AM, Michal Hocko wrote:
On Tue 21-08-18 18:10:42, Mike Kravetz wrote: [...]
diff --git a/mm/rmap.c b/mm/rmap.c index eb477809a5c0..8cf853a4b093 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1362,11 +1362,21 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, } /*
* We have to assume the worse case ie pmd for invalidation. Note that
* the page can not be free in this function as call of try_to_unmap()
* must hold a reference on the page.
* For THP, we have to assume the worse case ie pmd for invalidation.
* For hugetlb, it could be much worse if we need to do pud
* invalidation in the case of pmd sharing.
*
* Note that the page can not be free in this function as call of
*/ end = min(vma->vm_end, start + (PAGE_SIZE << compound_order(page)));* try_to_unmap() must hold a reference on the page.
- if (PageHuge(page)) {
/*
* If sharing is possible, start and end will be adjusted
* accordingly.
*/
(void)huge_pmd_sharing_possible(vma, &start, &end);
- } mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);
I do not get this part. Why don't we simply unconditionally invalidate the whole huge page range?
In this routine, we are only unmapping a single page. The existing code is limiting the invalidate range to that page size: 4K or 2M. With shared PMDs, we have the possibility of unmapping a PUD_SIZE area: 1G. I don't think we want to unconditionally invalidate 1G. Is that what you are asking?
But we know that huge_pmd_unshare unmapped a shared pte so we know when to flush 2MB or 1GB. I really do not like how huge_pmd_sharing_possible a) duplicates some checks and b) it updates start/stop out of line.
I do not know how often PMD sharing is exercised. It certainly is used by DBs for large shared areas. I suspect it is less frequent than hugtlb pages in general, and certainly less frequent than THP or base pages.
while (page_vma_mapped_walk(&pvmw)) { @@ -1409,6 +1419,32 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte); address = pvmw.address;
if (PageHuge(page)) {
if (huge_pmd_unshare(mm, &address, pvmw.pte)) {
huge_pmd_unshare is documented to require a pte lock. Where do we take it?
It is somewhat hidden, but we are in the loop:
while (page_vma_mapped_walk(&pvmw)) {
The routine page_vma_mapped_walk will acquire the lock, and it correctly checks for huge pages and uses huge_pte_lockptr().
page_vma_mapped_walk_done() will release the lock.
OK, I can see it now. Thanks for the clarification. page_vma_mapped_walk is quite hard to follow.
On Thu, Aug 23, 2018 at 09:30:35AM +0200, Michal Hocko wrote:
On Wed 22-08-18 09:48:16, Mike Kravetz wrote:
On 08/22/2018 05:28 AM, Michal Hocko wrote:
On Tue 21-08-18 18:10:42, Mike Kravetz wrote: [...]
diff --git a/mm/rmap.c b/mm/rmap.c index eb477809a5c0..8cf853a4b093 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1362,11 +1362,21 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, } /*
* We have to assume the worse case ie pmd for invalidation. Note that
* the page can not be free in this function as call of try_to_unmap()
* must hold a reference on the page.
* For THP, we have to assume the worse case ie pmd for invalidation.
* For hugetlb, it could be much worse if we need to do pud
* invalidation in the case of pmd sharing.
*
* Note that the page can not be free in this function as call of
*/ end = min(vma->vm_end, start + (PAGE_SIZE << compound_order(page)));* try_to_unmap() must hold a reference on the page.
- if (PageHuge(page)) {
/*
* If sharing is possible, start and end will be adjusted
* accordingly.
*/
(void)huge_pmd_sharing_possible(vma, &start, &end);
- } mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);
I do not get this part. Why don't we simply unconditionally invalidate the whole huge page range?
In this routine, we are only unmapping a single page. The existing code is limiting the invalidate range to that page size: 4K or 2M. With shared PMDs, we have the possibility of unmapping a PUD_SIZE area: 1G. I don't think we want to unconditionally invalidate 1G. Is that what you are asking?
But we know that huge_pmd_unshare unmapped a shared pte so we know when to flush 2MB or 1GB. I really do not like how huge_pmd_sharing_possible a) duplicates some checks and b) it updates start/stop out of line.
My reading on this is that mmu_notifier_invalidate_range_start() has to be called from sleepable context on the full range that *can* be invalidated before following mmu_notifier_invalidate_range_end().
In this case huge_pmd_unshare() may unmap aligned PUD_SIZE around the PMD page that effectively enlarge range that has to be covered by mmu_notifier_invalidate_range_start(). We cannot yet know if there's any shared page tables in the range, so we need to go with worst case scenario.
I don't see conceptually better solution than what is proposed.
On Thu 23-08-18 11:21:12, Kirill A. Shutemov wrote:
On Thu, Aug 23, 2018 at 09:30:35AM +0200, Michal Hocko wrote:
On Wed 22-08-18 09:48:16, Mike Kravetz wrote:
On 08/22/2018 05:28 AM, Michal Hocko wrote:
On Tue 21-08-18 18:10:42, Mike Kravetz wrote: [...]
diff --git a/mm/rmap.c b/mm/rmap.c index eb477809a5c0..8cf853a4b093 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1362,11 +1362,21 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, } /*
* We have to assume the worse case ie pmd for invalidation. Note that
* the page can not be free in this function as call of try_to_unmap()
* must hold a reference on the page.
* For THP, we have to assume the worse case ie pmd for invalidation.
* For hugetlb, it could be much worse if we need to do pud
* invalidation in the case of pmd sharing.
*
* Note that the page can not be free in this function as call of
*/ end = min(vma->vm_end, start + (PAGE_SIZE << compound_order(page)));* try_to_unmap() must hold a reference on the page.
- if (PageHuge(page)) {
/*
* If sharing is possible, start and end will be adjusted
* accordingly.
*/
(void)huge_pmd_sharing_possible(vma, &start, &end);
- } mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);
I do not get this part. Why don't we simply unconditionally invalidate the whole huge page range?
In this routine, we are only unmapping a single page. The existing code is limiting the invalidate range to that page size: 4K or 2M. With shared PMDs, we have the possibility of unmapping a PUD_SIZE area: 1G. I don't think we want to unconditionally invalidate 1G. Is that what you are asking?
But we know that huge_pmd_unshare unmapped a shared pte so we know when to flush 2MB or 1GB. I really do not like how huge_pmd_sharing_possible a) duplicates some checks and b) it updates start/stop out of line.
My reading on this is that mmu_notifier_invalidate_range_start() has to be called from sleepable context on the full range that *can* be invalidated before following mmu_notifier_invalidate_range_end().
In this case huge_pmd_unshare() may unmap aligned PUD_SIZE around the PMD page that effectively enlarge range that has to be covered by mmu_notifier_invalidate_range_start(). We cannot yet know if there's any shared page tables in the range, so we need to go with worst case scenario.
I don't see conceptually better solution than what is proposed.
I was thinking we would just pull PageHuge outside of the page_vma_mapped_walk. I thought it would look much more straightforward but I've tried to put something together and it grown into an ugly code as well. So going the Mike's way might be a better option after all.
On 08/23/2018 01:21 AM, Kirill A. Shutemov wrote:
On Thu, Aug 23, 2018 at 09:30:35AM +0200, Michal Hocko wrote:
On Wed 22-08-18 09:48:16, Mike Kravetz wrote:
On 08/22/2018 05:28 AM, Michal Hocko wrote:
On Tue 21-08-18 18:10:42, Mike Kravetz wrote: [...]
diff --git a/mm/rmap.c b/mm/rmap.c index eb477809a5c0..8cf853a4b093 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1362,11 +1362,21 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, } /*
* We have to assume the worse case ie pmd for invalidation. Note that
* the page can not be free in this function as call of try_to_unmap()
* must hold a reference on the page.
* For THP, we have to assume the worse case ie pmd for invalidation.
* For hugetlb, it could be much worse if we need to do pud
* invalidation in the case of pmd sharing.
*
* Note that the page can not be free in this function as call of
*/ end = min(vma->vm_end, start + (PAGE_SIZE << compound_order(page)));* try_to_unmap() must hold a reference on the page.
- if (PageHuge(page)) {
/*
* If sharing is possible, start and end will be adjusted
* accordingly.
*/
(void)huge_pmd_sharing_possible(vma, &start, &end);
- } mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);
I do not get this part. Why don't we simply unconditionally invalidate the whole huge page range?
In this routine, we are only unmapping a single page. The existing code is limiting the invalidate range to that page size: 4K or 2M. With shared PMDs, we have the possibility of unmapping a PUD_SIZE area: 1G. I don't think we want to unconditionally invalidate 1G. Is that what you are asking?
But we know that huge_pmd_unshare unmapped a shared pte so we know when to flush 2MB or 1GB. I really do not like how huge_pmd_sharing_possible a) duplicates some checks and b) it updates start/stop out of line.
My reading on this is that mmu_notifier_invalidate_range_start() has to be called from sleepable context on the full range that *can* be invalidated before following mmu_notifier_invalidate_range_end().
In this case huge_pmd_unshare() may unmap aligned PUD_SIZE around the PMD page that effectively enlarge range that has to be covered by mmu_notifier_invalidate_range_start(). We cannot yet know if there's any shared page tables in the range, so we need to go with worst case scenario.
Yes, that is a good summary. We can not know for sure if there is PMD sharing until we hold the page table lock. So, we don't know if we should invalidate/flush 2M or 1G. Yet, we need to call mmu_notifier_invalidate_range_start before taking the lock. And, the notifiers need to know the range of the worst possible case. The best approach I came up with is to adjust the range if sharing is 'possible'.
I don't see conceptually better solution than what is proposed.
I have updated the patches based on Kirill's previous comments and will send out a new version later today.
On Tue, Aug 21, 2018 at 06:10:42PM -0700, Mike Kravetz wrote:
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 3103099f64fd..f085019a4724 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -4555,6 +4555,9 @@ static bool vma_shareable(struct vm_area_struct *vma, unsigned long addr) /* * check on proper vm_flags and page table alignment
*
* Note that this is the same check used in huge_pmd_sharing_possible.
* If you change one, consider changing both.
Should we have helper to isolate the check in one place?
*/
if (vma->vm_flags & VM_MAYSHARE && vma->vm_start <= base && end <= vma->vm_end) @@ -4562,6 +4565,43 @@ static bool vma_shareable(struct vm_area_struct *vma, unsigned long addr) return false; } +/*
- Determine if start,end range within vma could be mapped by shared pmd.
- If yes, adjust start and end to cover range associated with possible
- shared pmd mappings.
- */
+bool huge_pmd_sharing_possible(struct vm_area_struct *vma,
unsigned long *start, unsigned long *end)
+{
- unsigned long check_addr = *start;
- bool ret = false;
- if (!(vma->vm_flags & VM_MAYSHARE))
return ret;
Do we ever use return value? I don't see it.
And in this case function name is not really work...
- for (check_addr = *start; check_addr < *end; check_addr += PUD_SIZE) {
unsigned long a_start = check_addr & PUD_MASK;
unsigned long a_end = a_start + PUD_SIZE;
/*
* If sharing is possible, adjust start/end if necessary.
*
* Note that this is the same check used in vma_shareable. If
* you change one, consider changing both.
*/
if (vma->vm_start <= a_start && a_end <= vma->vm_end) {
if (a_start < *start)
*start = a_start;
if (a_end > *end)
*end = a_end;
ret = true;
}
- }
- return ret;
+}
On 08/22/2018 02:05 PM, Kirill A. Shutemov wrote:
On Tue, Aug 21, 2018 at 06:10:42PM -0700, Mike Kravetz wrote:
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 3103099f64fd..f085019a4724 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -4555,6 +4555,9 @@ static bool vma_shareable(struct vm_area_struct *vma, unsigned long addr) /* * check on proper vm_flags and page table alignment
*
* Note that this is the same check used in huge_pmd_sharing_possible.
* If you change one, consider changing both.
Should we have helper to isolate the check in one place?
Yes, I will create one. Most likely just a #define.
*/
if (vma->vm_flags & VM_MAYSHARE && vma->vm_start <= base && end <= vma->vm_end) @@ -4562,6 +4565,43 @@ static bool vma_shareable(struct vm_area_struct *vma, unsigned long addr) return false; } +/*
- Determine if start,end range within vma could be mapped by shared pmd.
- If yes, adjust start and end to cover range associated with possible
- shared pmd mappings.
- */
+bool huge_pmd_sharing_possible(struct vm_area_struct *vma,
unsigned long *start, unsigned long *end)
+{
- unsigned long check_addr = *start;
- bool ret = false;
- if (!(vma->vm_flags & VM_MAYSHARE))
return ret;
Do we ever use return value? I don't see it.
And in this case function name is not really work...
You are correct. None of the code uses the return value. I initially thought some caller would use it. But every caller wants/needs to adjust the range if sharing is possible. This is a really long name but how about:
void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma, unsigned long *start, unsigned long *end)
I'm open to other names and will update patch with suggestions.
On Tue 21-08-18 18:10:42, Mike Kravetz wrote: [...]
OK, after burning myself when trying to be clever here it seems like your proposed solution is indeed simpler.
+bool huge_pmd_sharing_possible(struct vm_area_struct *vma,
unsigned long *start, unsigned long *end)
+{
- unsigned long check_addr = *start;
- bool ret = false;
- if (!(vma->vm_flags & VM_MAYSHARE))
return ret;
- for (check_addr = *start; check_addr < *end; check_addr += PUD_SIZE) {
unsigned long a_start = check_addr & PUD_MASK;
unsigned long a_end = a_start + PUD_SIZE;
I guess this should be rather in HPAGE_SIZE * PTRS_PER_PTE units as huge_pmd_unshare does.
On 08/23/2018 05:48 AM, Michal Hocko wrote:
On Tue 21-08-18 18:10:42, Mike Kravetz wrote: [...]
OK, after burning myself when trying to be clever here it seems like your proposed solution is indeed simpler.
+bool huge_pmd_sharing_possible(struct vm_area_struct *vma,
unsigned long *start, unsigned long *end)
+{
- unsigned long check_addr = *start;
- bool ret = false;
- if (!(vma->vm_flags & VM_MAYSHARE))
return ret;
- for (check_addr = *start; check_addr < *end; check_addr += PUD_SIZE) {
unsigned long a_start = check_addr & PUD_MASK;
unsigned long a_end = a_start + PUD_SIZE;
I guess this should be rather in HPAGE_SIZE * PTRS_PER_PTE units as huge_pmd_unshare does.
Sure, I can do that.
However, I consider the statement making that calculation in huge_pmd_unshare to be VERY ugly and confusing code.
*addr = ALIGN(*addr, HPAGE_SIZE * PTRS_PER_PTE) - HPAGE_SIZE;
Note that it is adjusting the value of passed argument 'unsigned long *addr'. This argument/value is part of a loop condition in all current callers of huge_pmd_unshare. For instance:
for (; address < end; address += huge_page_size(h)) {
So, that calculation in huge_pmd_unshare gets the calling loop back to the starting address of the unmapped range. It even takes the loop increment 'huge_page_size(h)' into account. That is why that ' - HPAGE_SIZE' is at the end of the calculation.
ugly and confusing! And on my list of things to clean up.
On 08/23/2018 10:01 AM, Mike Kravetz wrote:
On 08/23/2018 05:48 AM, Michal Hocko wrote:
On Tue 21-08-18 18:10:42, Mike Kravetz wrote: [...]
OK, after burning myself when trying to be clever here it seems like your proposed solution is indeed simpler.
+bool huge_pmd_sharing_possible(struct vm_area_struct *vma,
unsigned long *start, unsigned long *end)
+{
- unsigned long check_addr = *start;
- bool ret = false;
- if (!(vma->vm_flags & VM_MAYSHARE))
return ret;
- for (check_addr = *start; check_addr < *end; check_addr += PUD_SIZE) {
unsigned long a_start = check_addr & PUD_MASK;
unsigned long a_end = a_start + PUD_SIZE;
I guess this should be rather in HPAGE_SIZE * PTRS_PER_PTE units as huge_pmd_unshare does.
Sure, I can do that.
On second thought, this is more similar to vma_shareable() which uses PUD_MASK and PUD_SIZE. In fact Kirill asked me to put in a common helper for this and vma_shareable. So, I would prefer to leave it as PUD* unless you feel strongly.
IMO, it would make more sense to change this in huge_pmd_unshare as PMD sharing is pretty explicitly tied to PUD_SIZE. But, that is a future cleanup.
On Thu 23-08-18 10:56:30, Mike Kravetz wrote:
On 08/23/2018 10:01 AM, Mike Kravetz wrote:
On 08/23/2018 05:48 AM, Michal Hocko wrote:
On Tue 21-08-18 18:10:42, Mike Kravetz wrote: [...]
OK, after burning myself when trying to be clever here it seems like your proposed solution is indeed simpler.
+bool huge_pmd_sharing_possible(struct vm_area_struct *vma,
unsigned long *start, unsigned long *end)
+{
- unsigned long check_addr = *start;
- bool ret = false;
- if (!(vma->vm_flags & VM_MAYSHARE))
return ret;
- for (check_addr = *start; check_addr < *end; check_addr += PUD_SIZE) {
unsigned long a_start = check_addr & PUD_MASK;
unsigned long a_end = a_start + PUD_SIZE;
I guess this should be rather in HPAGE_SIZE * PTRS_PER_PTE units as huge_pmd_unshare does.
Sure, I can do that.
On second thought, this is more similar to vma_shareable() which uses PUD_MASK and PUD_SIZE. In fact Kirill asked me to put in a common helper for this and vma_shareable. So, I would prefer to leave it as PUD* unless you feel strongly.
I don't
IMO, it would make more sense to change this in huge_pmd_unshare as PMD sharing is pretty explicitly tied to PUD_SIZE. But, that is a future cleanup.
Fair enough. I didn't realize we are PUD explicit elsewhere. So if you plan to update huge_pmd_unshare to be in sync then no objections from me at all. I merely wanted to be in sync with this because it is a central point to look at wrt pmd sharing.
linux-stable-mirror@lists.linaro.org