While looking at BUGs associated with invalid huge page map counts, it was discovered and observed that a huge pte pointer could become 'invalid' and point to another task's page table. Consider the following:
A task takes a page fault on a shared hugetlbfs file and calls huge_pte_alloc to get a ptep. Suppose the returned ptep points to a shared pmd.
Now, another task truncates the hugetlbfs file. As part of truncation, it unmaps everyone who has the file mapped. If the range being truncated is covered by a shared pmd, huge_pmd_unshare will be called. For all but the last user of the shared pmd, huge_pmd_unshare will clear the pud pointing to the pmd. If the task in the middle of the page fault is not the last user, the ptep returned by huge_pte_alloc now points to another task's page table or worse. This leads to bad things such as incorrect page map/reference counts or invalid memory references.
To fix, expand the use of i_mmap_rwsem as follows: - i_mmap_rwsem is held in read mode whenever huge_pmd_share is called. huge_pmd_share is only called via huge_pte_alloc, so callers of huge_pte_alloc take i_mmap_rwsem before calling. In addition, callers of huge_pte_alloc continue to hold the semaphore until finished with the ptep. - i_mmap_rwsem is held in write mode whenever huge_pmd_unshare is called.
Cc: stable@vger.kernel.org Fixes: 39dde65c9940 ("shared page table for hugetlb page") Signed-off-by: Mike Kravetz mike.kravetz@oracle.com --- mm/hugetlb.c | 67 ++++++++++++++++++++++++++++++++++----------- mm/memory-failure.c | 14 +++++++++- mm/migrate.c | 13 ++++++++- mm/rmap.c | 4 +++ mm/userfaultfd.c | 11 ++++++-- 5 files changed, 89 insertions(+), 20 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 309fb8c969af..2a3162030167 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -3239,6 +3239,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, int cow; struct hstate *h = hstate_vma(vma); unsigned long sz = huge_page_size(h); + struct address_space *mapping = vma->vm_file->f_mapping; unsigned long mmun_start; /* For mmu_notifiers */ unsigned long mmun_end; /* For mmu_notifiers */ int ret = 0; @@ -3247,14 +3248,25 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
mmun_start = vma->vm_start; mmun_end = vma->vm_end; - if (cow) + if (cow) { mmu_notifier_invalidate_range_start(src, mmun_start, mmun_end); + } else { + /* + * For shared mappings i_mmap_rwsem must be held to call + * huge_pte_alloc, otherwise the returned ptep could go + * away if part of a shared pmd and another thread calls + * huge_pmd_unshare. + */ + i_mmap_lock_read(mapping); + }
for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) { spinlock_t *src_ptl, *dst_ptl; + src_pte = huge_pte_offset(src, addr, sz); if (!src_pte) continue; + dst_pte = huge_pte_alloc(dst, addr, sz); if (!dst_pte) { ret = -ENOMEM; @@ -3325,6 +3337,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
if (cow) mmu_notifier_invalidate_range_end(src, mmun_start, mmun_end); + else + i_mmap_unlock_read(mapping);
return ret; } @@ -3772,14 +3786,18 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm, };
/* - * hugetlb_fault_mutex must be dropped before - * handling userfault. Reacquire after handling - * fault to make calling code simpler. + * hugetlb_fault_mutex and i_mmap_rwsem must be + * dropped before handling userfault. Reacquire + * after handling fault to make calling code simpler. */ hash = hugetlb_fault_mutex_hash(h, mm, vma, mapping, idx, haddr); mutex_unlock(&hugetlb_fault_mutex_table[hash]); + i_mmap_unlock_read(mapping); + ret = handle_userfault(&vmf, VM_UFFD_MISSING); + + i_mmap_lock_read(mapping); mutex_lock(&hugetlb_fault_mutex_table[hash]); goto out; } @@ -3927,6 +3945,11 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
ptep = huge_pte_offset(mm, haddr, huge_page_size(h)); if (ptep) { + /* + * Since we hold no locks, ptep could be stale. That is + * OK as we are only making decisions based on content and + * not actually modifying content here. + */ entry = huge_ptep_get(ptep); if (unlikely(is_hugetlb_entry_migration(entry))) { migration_entry_wait_huge(vma, mm, ptep); @@ -3934,20 +3957,31 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, } else if (unlikely(is_hugetlb_entry_hwpoisoned(entry))) return VM_FAULT_HWPOISON_LARGE | VM_FAULT_SET_HINDEX(hstate_index(h)); - } else { - ptep = huge_pte_alloc(mm, haddr, huge_page_size(h)); - if (!ptep) - return VM_FAULT_OOM; }
+ /* + * Acquire i_mmap_rwsem before calling huge_pte_alloc and hold + * until finished with ptep. This prevents huge_pmd_unshare from + * being called elsewhere and making the ptep no longer valid. + * + * ptep could have already be assigned via huge_pte_offset. That + * is OK, as huge_pte_alloc will return the same value unless + * something changed. + */ mapping = vma->vm_file->f_mapping; - idx = vma_hugecache_offset(h, vma, haddr); + i_mmap_lock_read(mapping); + ptep = huge_pte_alloc(mm, haddr, huge_page_size(h)); + if (!ptep) { + i_mmap_unlock_read(mapping); + return VM_FAULT_OOM; + }
/* * Serialize hugepage allocation and instantiation, so that we don't * get spurious allocation failures if two CPUs race to instantiate * the same page in the page cache. */ + idx = vma_hugecache_offset(h, vma, haddr); hash = hugetlb_fault_mutex_hash(h, mm, vma, mapping, idx, haddr); mutex_lock(&hugetlb_fault_mutex_table[hash]);
@@ -4035,6 +4069,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, } out_mutex: mutex_unlock(&hugetlb_fault_mutex_table[hash]); + i_mmap_unlock_read(mapping); /* * Generally it's safe to hold refcount during waiting page lock. But * here we just wait to defer the next page fault to avoid busy loop and @@ -4639,10 +4674,12 @@ void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma, * Search for a shareable pmd page for hugetlb. In any case calls pmd_alloc() * and returns the corresponding pte. While this is not necessary for the * !shared pmd case because we can allocate the pmd later as well, it makes the - * code much cleaner. pmd allocation is essential for the shared case because - * pud has to be populated inside the same i_mmap_rwsem section - otherwise - * racing tasks could either miss the sharing (see huge_pte_offset) or select a - * bad pmd for sharing. + * code much cleaner. + * + * This routine must be called with i_mmap_rwsem held in at least read mode. + * For hugetlbfs, this prevents removal of any page table entries associated + * with the address space. This is important as we are setting up sharing + * based on existing page table entries (mappings). */ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud) { @@ -4659,7 +4696,6 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud) if (!vma_shareable(vma, addr)) return (pte_t *)pmd_alloc(mm, pud, addr);
- i_mmap_lock_write(mapping); vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) { if (svma == vma) continue; @@ -4689,7 +4725,6 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud) spin_unlock(ptl); out: pte = (pte_t *)pmd_alloc(mm, pud, addr); - i_mmap_unlock_write(mapping); return pte; }
@@ -4700,7 +4735,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud) * indicated by page_count > 1, unmap is achieved by clearing pud and * decrementing the ref count. If count == 1, the pte page is not shared. * - * called with page table lock held. + * Called with page table lock held and i_mmap_rwsem held in write mode. * * returns: 1 successfully unmapped a shared pte page * 0 the underlying pte page is not shared, or it is the last user diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 0cd3de3550f0..b992d1295578 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -1028,7 +1028,19 @@ static bool hwpoison_user_mappings(struct page *p, unsigned long pfn, if (kill) collect_procs(hpage, &tokill, flags & MF_ACTION_REQUIRED);
- unmap_success = try_to_unmap(hpage, ttu); + if (!PageHuge(hpage)) { + unmap_success = try_to_unmap(hpage, ttu); + } else { + /* + * For hugetlb pages, try_to_unmap could potentially call + * huge_pmd_unshare. Because of this, take semaphore in + * write mode here and set TTU_RMAP_LOCKED to indicate we + * have taken the lock at this higer level. + */ + i_mmap_lock_write(mapping); + unmap_success = try_to_unmap(hpage, ttu|TTU_RMAP_LOCKED); + i_mmap_unlock_write(mapping); + } if (!unmap_success) pr_err("Memory failure: %#lx: failed to unmap page (mapcount=%d)\n", pfn, page_mapcount(hpage)); diff --git a/mm/migrate.c b/mm/migrate.c index 84381b55b2bd..725edaef238a 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1307,8 +1307,19 @@ static int unmap_and_move_huge_page(new_page_t get_new_page, goto put_anon;
if (page_mapped(hpage)) { + struct address_space *mapping = page_mapping(hpage); + + /* + * try_to_unmap could potentially call huge_pmd_unshare. + * Because of this, take semaphore in write mode here and + * set TTU_RMAP_LOCKED to let lower levels know we have + * taken the lock. + */ + i_mmap_lock_write(mapping); try_to_unmap(hpage, - TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS); + TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS| + TTU_RMAP_LOCKED); + i_mmap_unlock_write(mapping); page_was_mapped = 1; }
diff --git a/mm/rmap.c b/mm/rmap.c index 85b7f9423352..c566bd552535 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -25,6 +25,7 @@ * page->flags PG_locked (lock_page) * hugetlbfs_i_mmap_rwsem_key (in huge_pmd_share) * mapping->i_mmap_rwsem + * hugetlb_fault_mutex (hugetlbfs specific page fault mutex) * anon_vma->rwsem * mm->page_table_lock or pte_lock * zone_lru_lock (in mark_page_accessed, isolate_lru_page) @@ -1374,6 +1375,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, /* * If sharing is possible, start and end will be adjusted * accordingly. + * + * If called for a huge page, caller must hold i_mmap_rwsem + * in write mode as it is possible to call huge_pmd_unshare. */ adjust_range_if_pmd_sharing_possible(vma, &start, &end); } diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 458acda96f20..48368589f519 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -267,10 +267,14 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm, VM_BUG_ON(dst_addr & ~huge_page_mask(h));
/* - * Serialize via hugetlb_fault_mutex + * Serialize via i_mmap_rwsem and hugetlb_fault_mutex. + * i_mmap_rwsem ensures the dst_pte remains valid even + * in the case of shared pmds. fault mutex prevents + * races with other faulting threads. */ - idx = linear_page_index(dst_vma, dst_addr); mapping = dst_vma->vm_file->f_mapping; + i_mmap_lock_read(mapping); + idx = linear_page_index(dst_vma, dst_addr); hash = hugetlb_fault_mutex_hash(h, dst_mm, dst_vma, mapping, idx, dst_addr); mutex_lock(&hugetlb_fault_mutex_table[hash]); @@ -279,6 +283,7 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm, dst_pte = huge_pte_alloc(dst_mm, dst_addr, huge_page_size(h)); if (!dst_pte) { mutex_unlock(&hugetlb_fault_mutex_table[hash]); + i_mmap_unlock_read(mapping); goto out_unlock; }
@@ -286,6 +291,7 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm, dst_pteval = huge_ptep_get(dst_pte); if (!huge_pte_none(dst_pteval)) { mutex_unlock(&hugetlb_fault_mutex_table[hash]); + i_mmap_unlock_read(mapping); goto out_unlock; }
@@ -293,6 +299,7 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm, dst_addr, src_addr, &page);
mutex_unlock(&hugetlb_fault_mutex_table[hash]); + i_mmap_unlock_read(mapping); vm_alloc_shared = vm_shared;
cond_resched();
Greeting,
FYI, we noticed a -4.3% regression of vm-scalability.throughput due to commit:
commit: 9c83282117778856d647ffc461c4aede2abb6742 ("[PATCH v3 1/2] hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization") url: https://github.com/0day-ci/linux/commits/Mike-Kravetz/hugetlbfs-use-i_mmap_r...
in testcase: vm-scalability on test machine: 104 threads Intel(R) Xeon(R) Platinum 8170 CPU @ 2.10GHz with 64G memory with following parameters:
runtime: 300s size: 8T test: anon-cow-seq-hugetlb cpufreq_governor: performance ucode: 0x200004d
test-description: The motivation behind this suite is to exercise functions and regions of the mm/ of the Linux kernel which are of interest to us. test-url: https://git.kernel.org/cgit/linux/kernel/git/wfg/vm-scalability.git/
Details are as below: -------------------------------------------------------------------------------------------------->
To reproduce:
git clone https://github.com/intel/lkp-tests.git cd lkp-tests bin/lkp install job.yaml # job file is attached in this email bin/lkp run job.yaml
========================================================================================= compiler/cpufreq_governor/kconfig/rootfs/runtime/size/tbox_group/test/testcase/ucode: gcc-7/performance/x86_64-rhel-7.2/debian-x86_64-2018-04-03.cgz/300s/8T/lkp-skl-2sp4/anon-cow-seq-hugetlb/vm-scalability/0x200004d
commit: 0cd60eb1a7 ("dma-mapping: fix flags in dma_alloc_wc") 9c83282117 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization")
0cd60eb1a7b5421e 9c83282117778856d647ffc461 ---------------- -------------------------- %stddev %change %stddev \ | \ 184494 -10.7% 164684 vm-scalability.median 20393229 -4.3% 19523319 vm-scalability.throughput 37986 ± 2% -4.3% 36341 ± 2% vm-scalability.time.involuntary_context_switches 3670375 -1.0% 3635385 vm-scalability.time.minor_page_faults 5808 -9.9% 5236 vm-scalability.time.percent_of_cpu_this_job_got 10665 -6.4% 9980 vm-scalability.time.system_time 6873 -15.2% 5829 vm-scalability.time.user_time 1561119 +42.4% 2222959 vm-scalability.time.voluntary_context_switches 304034 ± 10% -15.5% 256985 ± 7% meminfo.DirectMap4k 2455420 +17.5% 2884045 softirqs.SCHED 15179 ± 57% -77.2% 3468 ±167% numa-numastat.node0.other_node 5069 ±171% +231.5% 16803 ± 34% numa-numastat.node1.other_node 58.25 -14.6% 49.75 vmstat.procs.r 13194 +33.3% 17592 vmstat.system.cs 30.81 +4.7 35.50 mpstat.cpu.idle% 0.00 ± 39% +0.0 0.00 ± 19% mpstat.cpu.soft% 22.13 -3.4 18.73 mpstat.cpu.usr% 1608 -9.5% 1454 turbostat.Avg_MHz 57.68 -5.5 52.16 turbostat.Busy% 42.17 +12.7% 47.54 turbostat.CPU%c1 1896 ± 10% -13.5% 1639 ± 12% slabinfo.UNIX.active_objs 1896 ± 10% -13.5% 1639 ± 12% slabinfo.UNIX.num_objs 512.00 ± 8% +18.8% 608.00 ± 5% slabinfo.ebitmap_node.active_objs 512.00 ± 8% +18.8% 608.00 ± 5% slabinfo.ebitmap_node.num_objs 832.00 ± 13% +23.1% 1024 ± 10% slabinfo.scsi_sense_cache.active_objs 832.00 ± 13% +23.1% 1024 ± 10% slabinfo.scsi_sense_cache.num_objs 1309088 -1.8% 1285325 proc-vmstat.nr_dirty_background_threshold 2621507 -1.8% 2573971 proc-vmstat.nr_dirty_threshold 13199577 -1.8% 12961837 proc-vmstat.nr_free_pages 1742 +1.8% 1774 proc-vmstat.nr_page_table_pages 22375 -2.8% 21752 proc-vmstat.nr_shmem 1259 ± 37% +61.5% 2033 ± 19% proc-vmstat.numa_huge_pte_updates 681268 ± 35% +59.1% 1084220 ± 19% proc-vmstat.numa_pte_updates 13983 -8.3% 12823 ± 4% proc-vmstat.pgactivate 0.05 +0.0 0.05 perf-stat.branch-miss-rate% 2.109e+09 +4.3% 2.2e+09 perf-stat.branch-misses 78.76 -1.9 76.88 perf-stat.cache-miss-rate% 1.113e+11 -2.9% 1.081e+11 perf-stat.cache-misses 3996996 +33.6% 5341757 perf-stat.context-switches 3.37 -9.0% 3.07 perf-stat.cpi 4.944e+13 -9.6% 4.471e+13 perf-stat.cpu-cycles 211278 +5.0% 221866 perf-stat.cpu-migrations 0.00 ± 7% +0.0 0.00 ± 5% perf-stat.dTLB-load-miss-rate% 49679544 ± 7% +17.5% 58377845 ± 4% perf-stat.dTLB-load-misses 0.00 ± 4% +0.0 0.00 ± 2% perf-stat.dTLB-store-miss-rate% 15180335 ± 4% +14.0% 17307062 ± 2% perf-stat.dTLB-store-misses 10.83 ± 3% -1.8 9.08 ± 3% perf-stat.iTLB-load-miss-rate% 44270724 ± 3% -8.4% 40569884 ± 2% perf-stat.iTLB-load-misses 3.644e+08 +11.5% 4.065e+08 perf-stat.iTLB-loads 331624 ± 3% +8.4% 359414 ± 2% perf-stat.instructions-per-iTLB-miss 0.30 +9.9% 0.33 perf-stat.ipc 51.92 +1.8 53.74 perf-stat.node-load-miss-rate% 1.48e+10 -6.0% 1.391e+10 perf-stat.node-loads 1.497e+10 -6.9% 1.394e+10 perf-stat.node-stores 10272 ± 14% -19.0% 8323 ± 13% sched_debug.cfs_rq:/.load.avg 7232660 ± 9% -20.1% 5782120 ± 10% sched_debug.cfs_rq:/.min_vruntime.max 0.52 ± 5% -18.9% 0.43 ± 5% sched_debug.cfs_rq:/.nr_running.avg 1.67 ± 10% -33.1% 1.12 ± 15% sched_debug.cfs_rq:/.nr_spread_over.avg 7.52 ± 10% -29.6% 5.29 ± 2% sched_debug.cfs_rq:/.runnable_load_avg.avg 10163 ± 13% -18.7% 8262 ± 13% sched_debug.cfs_rq:/.runnable_weight.avg 2147344 ± 11% -29.4% 1515179 ± 10% sched_debug.cfs_rq:/.spread0.avg 3673348 ± 11% -22.3% 2854166 ± 5% sched_debug.cfs_rq:/.spread0.max 396.82 ± 13% -26.6% 291.11 ± 4% sched_debug.cfs_rq:/.util_est_enqueued.avg 6.81 ± 4% -25.8% 5.05 sched_debug.cpu.cpu_load[0].avg 6.96 ± 6% -25.3% 5.20 ± 2% sched_debug.cpu.cpu_load[1].avg 7.01 ± 4% -23.0% 5.40 ± 2% sched_debug.cpu.cpu_load[2].avg 7.09 ± 3% -19.2% 5.73 ± 2% sched_debug.cpu.cpu_load[3].avg 54.42 ± 33% -55.2% 24.39 ± 9% sched_debug.cpu.cpu_load[3].max 8.94 ± 21% -33.4% 5.96 ± 5% sched_debug.cpu.cpu_load[3].stddev 7.34 ± 3% -15.0% 6.24 ± 2% sched_debug.cpu.cpu_load[4].avg 72.43 ± 16% -29.4% 51.15 ± 18% sched_debug.cpu.cpu_load[4].max 10.51 ± 8% -20.8% 8.32 ± 7% sched_debug.cpu.cpu_load[4].stddev 18364 ± 10% +26.5% 23240 ± 11% sched_debug.cpu.nr_switches.avg 12769 ± 11% +43.0% 18261 ± 13% sched_debug.cpu.nr_switches.min 17580 ± 10% +28.1% 22513 ± 11% sched_debug.cpu.sched_count.avg 12302 ± 10% +41.6% 17424 ± 11% sched_debug.cpu.sched_count.min 8539 ± 10% +29.3% 11037 ± 11% sched_debug.cpu.sched_goidle.avg 5806 ± 11% +43.1% 8309 ± 11% sched_debug.cpu.sched_goidle.min 8747 ± 10% +28.1% 11205 ± 11% sched_debug.cpu.ttwu_count.avg 17367 ± 11% +29.1% 22427 ± 6% sched_debug.cpu.ttwu_count.max 1788 ± 11% +90.2% 3402 ± 12% sched_debug.cpu.ttwu_count.stddev 0.77 ± 3% +0.2 0.95 ± 5% perf-profile.calltrace.cycles-pp.alloc_huge_page.hugetlb_cow.hugetlb_fault.handle_mm_fault.__do_page_fault 0.66 ± 4% +0.2 0.88 ± 5% perf-profile.calltrace.cycles-pp.alloc_surplus_huge_page.alloc_huge_page.hugetlb_cow.hugetlb_fault.handle_mm_fault 0.56 ± 6% +0.3 0.83 ± 5% perf-profile.calltrace.cycles-pp.alloc_fresh_huge_page.alloc_surplus_huge_page.alloc_huge_page.hugetlb_cow.hugetlb_fault 0.27 ±100% +0.5 0.73 ± 4% perf-profile.calltrace.cycles-pp.get_page_from_freelist.__alloc_pages_nodemask.alloc_fresh_huge_page.alloc_surplus_huge_page.alloc_huge_page 0.27 ±100% +0.5 0.74 ± 4% perf-profile.calltrace.cycles-pp.__alloc_pages_nodemask.alloc_fresh_huge_page.alloc_surplus_huge_page.alloc_huge_page.hugetlb_cow 0.56 ± 4% -0.2 0.32 ± 3% perf-profile.children.cycles-pp._raw_spin_lock 0.42 ± 4% -0.2 0.22 perf-profile.children.cycles-pp.release_pages 0.41 ± 3% -0.2 0.21 ± 2% perf-profile.children.cycles-pp.free_huge_page 0.42 ± 4% -0.2 0.23 ± 2% perf-profile.children.cycles-pp.arch_tlb_finish_mmu 0.42 ± 4% -0.2 0.23 ± 2% perf-profile.children.cycles-pp.tlb_flush_mmu_free 0.42 ± 4% -0.2 0.23 perf-profile.children.cycles-pp.tlb_finish_mmu 0.46 ± 4% -0.2 0.28 ± 2% perf-profile.children.cycles-pp.mmput 0.46 ± 4% -0.2 0.28 perf-profile.children.cycles-pp.__x64_sys_exit_group 0.46 ± 4% -0.2 0.28 perf-profile.children.cycles-pp.do_group_exit 0.46 ± 4% -0.2 0.28 perf-profile.children.cycles-pp.do_exit 0.45 ± 3% -0.2 0.28 ± 2% perf-profile.children.cycles-pp.exit_mmap 0.94 ± 3% -0.1 0.85 ± 4% perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe 0.94 ± 3% -0.1 0.85 ± 4% perf-profile.children.cycles-pp.do_syscall_64 0.17 ± 4% -0.0 0.14 ± 3% perf-profile.children.cycles-pp.update_and_free_page 0.12 ± 5% +0.0 0.14 ± 5% perf-profile.children.cycles-pp.__account_scheduler_latency 0.08 ± 8% +0.0 0.10 ± 8% perf-profile.children.cycles-pp.sched_ttwu_pending 0.17 ± 6% +0.0 0.20 ± 2% perf-profile.children.cycles-pp.enqueue_entity 0.18 ± 6% +0.0 0.21 ± 3% perf-profile.children.cycles-pp.enqueue_task_fair 0.17 ± 4% +0.0 0.20 ± 8% perf-profile.children.cycles-pp.schedule 0.18 ± 6% +0.0 0.21 ± 2% perf-profile.children.cycles-pp.ttwu_do_activate 0.05 ± 9% +0.0 0.09 perf-profile.children.cycles-pp.prep_new_huge_page 0.16 ± 5% +0.0 0.20 ± 4% perf-profile.children.cycles-pp.io_serial_in 0.24 ± 5% +0.0 0.28 ± 6% perf-profile.children.cycles-pp.__schedule 0.03 ±100% +0.0 0.07 ± 10% perf-profile.children.cycles-pp.delay_tsc 0.18 ± 4% +0.1 0.24 ± 2% perf-profile.children.cycles-pp.serial8250_console_putchar 0.19 ± 6% +0.1 0.26 ± 3% perf-profile.children.cycles-pp.wait_for_xmitr 0.18 ± 5% +0.1 0.25 ± 2% perf-profile.children.cycles-pp.uart_console_write 0.20 ± 6% +0.1 0.27 ± 2% perf-profile.children.cycles-pp.serial8250_console_write 0.20 ± 18% +0.1 0.28 ± 5% perf-profile.children.cycles-pp._fini 0.20 ± 16% +0.1 0.28 ± 5% perf-profile.children.cycles-pp.devkmsg_write 0.20 ± 16% +0.1 0.28 ± 5% perf-profile.children.cycles-pp.printk_emit 0.26 ± 8% +0.1 0.34 ± 5% perf-profile.children.cycles-pp.__vfs_write 0.23 ± 12% +0.1 0.31 ± 5% perf-profile.children.cycles-pp.vprintk_emit 1.65 ± 4% +0.1 1.73 perf-profile.children.cycles-pp.__mutex_lock 0.22 ± 9% +0.1 0.30 ± 3% perf-profile.children.cycles-pp.console_unlock 0.22 ± 13% +0.1 0.30 ± 5% perf-profile.children.cycles-pp.write 0.26 ± 8% +0.1 0.35 ± 4% perf-profile.children.cycles-pp.ksys_write 0.26 ± 8% +0.1 0.35 ± 4% perf-profile.children.cycles-pp.vfs_write 0.59 ± 4% +0.1 0.68 ± 3% perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath 0.93 ± 3% +0.2 1.12 ± 4% perf-profile.children.cycles-pp.alloc_huge_page 0.79 ± 2% +0.2 1.03 ± 4% perf-profile.children.cycles-pp.alloc_surplus_huge_page 0.60 ± 2% +0.3 0.88 ± 5% perf-profile.children.cycles-pp.__alloc_pages_nodemask 0.59 ± 2% +0.3 0.87 ± 5% perf-profile.children.cycles-pp.get_page_from_freelist 0.66 ± 2% +0.3 0.97 ± 4% perf-profile.children.cycles-pp.alloc_fresh_huge_page 0.15 ± 4% +0.3 0.48 ± 6% perf-profile.children.cycles-pp._raw_spin_lock_irqsave 25.44 ± 6% -2.5 22.95 ± 10% perf-profile.self.cycles-pp.do_rw_once 0.46 ± 2% -0.0 0.41 ± 2% perf-profile.self.cycles-pp.get_page_from_freelist 0.17 ± 2% -0.0 0.14 ± 5% perf-profile.self.cycles-pp.update_and_free_page 0.15 ± 7% +0.0 0.20 ± 4% perf-profile.self.cycles-pp.io_serial_in 0.01 ±173% +0.1 0.06 ± 6% perf-profile.self.cycles-pp.delay_tsc 1.59 ± 3% +0.1 1.67 perf-profile.self.cycles-pp.mutex_spin_on_owner 0.58 ± 3% +0.1 0.68 ± 4% perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath
vm-scalability.time.user_time
7000 +-+------------------------------------------------------------------+ | : | 6000 O-+ O :O O O O O O O O O O | |: : | 5000 +-+ : | |: : | 4000 +-+ : | | : : | 3000 +-+ : | | : : | 2000 +-+: : | | : : | 1000 +-+: : | | : | 0 +-+------------------------------------------------------------------+
vm-scalability.time.system_time
12000 +-+-----------------------------------------------------------------+ | ..+...+... | 10000 O-+ O O...O...O...O...O...O...O. O O O...+...+...+...+...+...| | : | |: : | 8000 +-+ : | |: : | 6000 +-+ : | | : : | 4000 +-+ : | | : : | | : : | 2000 +-+: : | | : | 0 +-+-----------------------------------------------------------------+
vm-scalability.time.percent_of_cpu_this_job_got
6000 +-+------------------------------------------------------------------+ | : +. +.. +...+...+...+...+...+. | 5000 O-+ O :O O O O O O O O O O | |: : | |: : | 4000 +-+ : | | : : | 3000 +-+ : | | : : | 2000 +-+: : | | : : | | : : | 1000 +-+: : | | : | 0 +-+------------------------------------------------------------------+
vm-scalability.time.voluntary_context_switches
2.5e+06 +-+---------------------------------------------------------------+ | O O | O O O O O O O O O O | 2e+06 +-+ | | | | +...+...+..+...+...+...+...+...+...+...+..+...+...+...+...| 1.5e+06 +-+ : | |: : | 1e+06 +-+ : | | : : | | : : | 500000 +-+: : | | : : | | : | 0 +-+---------------------------------------------------------------+
vm-scalability.throughput
2.5e+07 +-+---------------------------------------------------------------+ | | | ..+...+... ..+... | 2e+07 O-+ O O...O...O..O. O O O...O...O...O...+..+...+. +...| | : | |: : | 1.5e+07 +-+ : | |: : | 1e+07 +-+ : | | : : | | : : | 5e+06 +-+: : | | : : | | : | 0 +-+---------------------------------------------------------------+
vm-scalability.median
200000 +-+----------------------------------------------------------------+ 180000 +-+ +...+...+...+...+...+...+..+...+...+...+...+...+...+...+...| O O O O O O O | 160000 +-+ O : O O O O | 140000 +-+ : | |: : | 120000 +-+ : | 100000 +-+ : | 80000 +-+ : | | : : | 60000 +-+: : | 40000 +-+: : | | : : | 20000 +-+ : | 0 +-+----------------------------------------------------------------+
[*] bisect-good sample [O] bisect-bad sample
Disclaimer: Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance.
Thanks, Rong Chen
On 12/28/18 6:26 AM, kernel test robot wrote:
Greeting,
FYI, we noticed a -4.3% regression of vm-scalability.throughput due to commit:
commit: 9c83282117778856d647ffc461c4aede2abb6742 ("[PATCH v3 1/2] hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization") url: https://github.com/0day-ci/linux/commits/Mike-Kravetz/hugetlbfs-use-i_mmap_r...
in testcase: vm-scalability on test machine: 104 threads Intel(R) Xeon(R) Platinum 8170 CPU @ 2.10GHz with 64G memory with following parameters:
runtime: 300s size: 8T test: anon-cow-seq-hugetlb cpufreq_governor: performance ucode: 0x200004d
I'll take a closer look.
The patch does introduce longer i_mmap_rwsem hold times for the sake of correctness. Need to more fully understand the test and results to determine if this is expected.
linux-stable-mirror@lists.linaro.org