Greeting,
FYI, we noticed a -4.3% regression of vm-scalability.throughput due to commit:
commit: 9c83282117778856d647ffc461c4aede2abb6742 ("[PATCH v3 1/2] hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization") url: https://github.com/0day-ci/linux/commits/Mike-Kravetz/hugetlbfs-use-i_mmap_r...
in testcase: vm-scalability on test machine: 104 threads Intel(R) Xeon(R) Platinum 8170 CPU @ 2.10GHz with 64G memory with following parameters:
runtime: 300s size: 8T test: anon-cow-seq-hugetlb cpufreq_governor: performance ucode: 0x200004d
test-description: The motivation behind this suite is to exercise functions and regions of the mm/ of the Linux kernel which are of interest to us. test-url: https://git.kernel.org/cgit/linux/kernel/git/wfg/vm-scalability.git/
Details are as below: -------------------------------------------------------------------------------------------------->
To reproduce:
git clone https://github.com/intel/lkp-tests.git cd lkp-tests bin/lkp install job.yaml # job file is attached in this email bin/lkp run job.yaml
========================================================================================= compiler/cpufreq_governor/kconfig/rootfs/runtime/size/tbox_group/test/testcase/ucode: gcc-7/performance/x86_64-rhel-7.2/debian-x86_64-2018-04-03.cgz/300s/8T/lkp-skl-2sp4/anon-cow-seq-hugetlb/vm-scalability/0x200004d
commit: 0cd60eb1a7 ("dma-mapping: fix flags in dma_alloc_wc") 9c83282117 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization")
0cd60eb1a7b5421e 9c83282117778856d647ffc461 ---------------- -------------------------- %stddev %change %stddev \ | \ 184494 -10.7% 164684 vm-scalability.median 20393229 -4.3% 19523319 vm-scalability.throughput 37986 ± 2% -4.3% 36341 ± 2% vm-scalability.time.involuntary_context_switches 3670375 -1.0% 3635385 vm-scalability.time.minor_page_faults 5808 -9.9% 5236 vm-scalability.time.percent_of_cpu_this_job_got 10665 -6.4% 9980 vm-scalability.time.system_time 6873 -15.2% 5829 vm-scalability.time.user_time 1561119 +42.4% 2222959 vm-scalability.time.voluntary_context_switches 304034 ± 10% -15.5% 256985 ± 7% meminfo.DirectMap4k 2455420 +17.5% 2884045 softirqs.SCHED 15179 ± 57% -77.2% 3468 ±167% numa-numastat.node0.other_node 5069 ±171% +231.5% 16803 ± 34% numa-numastat.node1.other_node 58.25 -14.6% 49.75 vmstat.procs.r 13194 +33.3% 17592 vmstat.system.cs 30.81 +4.7 35.50 mpstat.cpu.idle% 0.00 ± 39% +0.0 0.00 ± 19% mpstat.cpu.soft% 22.13 -3.4 18.73 mpstat.cpu.usr% 1608 -9.5% 1454 turbostat.Avg_MHz 57.68 -5.5 52.16 turbostat.Busy% 42.17 +12.7% 47.54 turbostat.CPU%c1 1896 ± 10% -13.5% 1639 ± 12% slabinfo.UNIX.active_objs 1896 ± 10% -13.5% 1639 ± 12% slabinfo.UNIX.num_objs 512.00 ± 8% +18.8% 608.00 ± 5% slabinfo.ebitmap_node.active_objs 512.00 ± 8% +18.8% 608.00 ± 5% slabinfo.ebitmap_node.num_objs 832.00 ± 13% +23.1% 1024 ± 10% slabinfo.scsi_sense_cache.active_objs 832.00 ± 13% +23.1% 1024 ± 10% slabinfo.scsi_sense_cache.num_objs 1309088 -1.8% 1285325 proc-vmstat.nr_dirty_background_threshold 2621507 -1.8% 2573971 proc-vmstat.nr_dirty_threshold 13199577 -1.8% 12961837 proc-vmstat.nr_free_pages 1742 +1.8% 1774 proc-vmstat.nr_page_table_pages 22375 -2.8% 21752 proc-vmstat.nr_shmem 1259 ± 37% +61.5% 2033 ± 19% proc-vmstat.numa_huge_pte_updates 681268 ± 35% +59.1% 1084220 ± 19% proc-vmstat.numa_pte_updates 13983 -8.3% 12823 ± 4% proc-vmstat.pgactivate 0.05 +0.0 0.05 perf-stat.branch-miss-rate% 2.109e+09 +4.3% 2.2e+09 perf-stat.branch-misses 78.76 -1.9 76.88 perf-stat.cache-miss-rate% 1.113e+11 -2.9% 1.081e+11 perf-stat.cache-misses 3996996 +33.6% 5341757 perf-stat.context-switches 3.37 -9.0% 3.07 perf-stat.cpi 4.944e+13 -9.6% 4.471e+13 perf-stat.cpu-cycles 211278 +5.0% 221866 perf-stat.cpu-migrations 0.00 ± 7% +0.0 0.00 ± 5% perf-stat.dTLB-load-miss-rate% 49679544 ± 7% +17.5% 58377845 ± 4% perf-stat.dTLB-load-misses 0.00 ± 4% +0.0 0.00 ± 2% perf-stat.dTLB-store-miss-rate% 15180335 ± 4% +14.0% 17307062 ± 2% perf-stat.dTLB-store-misses 10.83 ± 3% -1.8 9.08 ± 3% perf-stat.iTLB-load-miss-rate% 44270724 ± 3% -8.4% 40569884 ± 2% perf-stat.iTLB-load-misses 3.644e+08 +11.5% 4.065e+08 perf-stat.iTLB-loads 331624 ± 3% +8.4% 359414 ± 2% perf-stat.instructions-per-iTLB-miss 0.30 +9.9% 0.33 perf-stat.ipc 51.92 +1.8 53.74 perf-stat.node-load-miss-rate% 1.48e+10 -6.0% 1.391e+10 perf-stat.node-loads 1.497e+10 -6.9% 1.394e+10 perf-stat.node-stores 10272 ± 14% -19.0% 8323 ± 13% sched_debug.cfs_rq:/.load.avg 7232660 ± 9% -20.1% 5782120 ± 10% sched_debug.cfs_rq:/.min_vruntime.max 0.52 ± 5% -18.9% 0.43 ± 5% sched_debug.cfs_rq:/.nr_running.avg 1.67 ± 10% -33.1% 1.12 ± 15% sched_debug.cfs_rq:/.nr_spread_over.avg 7.52 ± 10% -29.6% 5.29 ± 2% sched_debug.cfs_rq:/.runnable_load_avg.avg 10163 ± 13% -18.7% 8262 ± 13% sched_debug.cfs_rq:/.runnable_weight.avg 2147344 ± 11% -29.4% 1515179 ± 10% sched_debug.cfs_rq:/.spread0.avg 3673348 ± 11% -22.3% 2854166 ± 5% sched_debug.cfs_rq:/.spread0.max 396.82 ± 13% -26.6% 291.11 ± 4% sched_debug.cfs_rq:/.util_est_enqueued.avg 6.81 ± 4% -25.8% 5.05 sched_debug.cpu.cpu_load[0].avg 6.96 ± 6% -25.3% 5.20 ± 2% sched_debug.cpu.cpu_load[1].avg 7.01 ± 4% -23.0% 5.40 ± 2% sched_debug.cpu.cpu_load[2].avg 7.09 ± 3% -19.2% 5.73 ± 2% sched_debug.cpu.cpu_load[3].avg 54.42 ± 33% -55.2% 24.39 ± 9% sched_debug.cpu.cpu_load[3].max 8.94 ± 21% -33.4% 5.96 ± 5% sched_debug.cpu.cpu_load[3].stddev 7.34 ± 3% -15.0% 6.24 ± 2% sched_debug.cpu.cpu_load[4].avg 72.43 ± 16% -29.4% 51.15 ± 18% sched_debug.cpu.cpu_load[4].max 10.51 ± 8% -20.8% 8.32 ± 7% sched_debug.cpu.cpu_load[4].stddev 18364 ± 10% +26.5% 23240 ± 11% sched_debug.cpu.nr_switches.avg 12769 ± 11% +43.0% 18261 ± 13% sched_debug.cpu.nr_switches.min 17580 ± 10% +28.1% 22513 ± 11% sched_debug.cpu.sched_count.avg 12302 ± 10% +41.6% 17424 ± 11% sched_debug.cpu.sched_count.min 8539 ± 10% +29.3% 11037 ± 11% sched_debug.cpu.sched_goidle.avg 5806 ± 11% +43.1% 8309 ± 11% sched_debug.cpu.sched_goidle.min 8747 ± 10% +28.1% 11205 ± 11% sched_debug.cpu.ttwu_count.avg 17367 ± 11% +29.1% 22427 ± 6% sched_debug.cpu.ttwu_count.max 1788 ± 11% +90.2% 3402 ± 12% sched_debug.cpu.ttwu_count.stddev 0.77 ± 3% +0.2 0.95 ± 5% perf-profile.calltrace.cycles-pp.alloc_huge_page.hugetlb_cow.hugetlb_fault.handle_mm_fault.__do_page_fault 0.66 ± 4% +0.2 0.88 ± 5% perf-profile.calltrace.cycles-pp.alloc_surplus_huge_page.alloc_huge_page.hugetlb_cow.hugetlb_fault.handle_mm_fault 0.56 ± 6% +0.3 0.83 ± 5% perf-profile.calltrace.cycles-pp.alloc_fresh_huge_page.alloc_surplus_huge_page.alloc_huge_page.hugetlb_cow.hugetlb_fault 0.27 ±100% +0.5 0.73 ± 4% perf-profile.calltrace.cycles-pp.get_page_from_freelist.__alloc_pages_nodemask.alloc_fresh_huge_page.alloc_surplus_huge_page.alloc_huge_page 0.27 ±100% +0.5 0.74 ± 4% perf-profile.calltrace.cycles-pp.__alloc_pages_nodemask.alloc_fresh_huge_page.alloc_surplus_huge_page.alloc_huge_page.hugetlb_cow 0.56 ± 4% -0.2 0.32 ± 3% perf-profile.children.cycles-pp._raw_spin_lock 0.42 ± 4% -0.2 0.22 perf-profile.children.cycles-pp.release_pages 0.41 ± 3% -0.2 0.21 ± 2% perf-profile.children.cycles-pp.free_huge_page 0.42 ± 4% -0.2 0.23 ± 2% perf-profile.children.cycles-pp.arch_tlb_finish_mmu 0.42 ± 4% -0.2 0.23 ± 2% perf-profile.children.cycles-pp.tlb_flush_mmu_free 0.42 ± 4% -0.2 0.23 perf-profile.children.cycles-pp.tlb_finish_mmu 0.46 ± 4% -0.2 0.28 ± 2% perf-profile.children.cycles-pp.mmput 0.46 ± 4% -0.2 0.28 perf-profile.children.cycles-pp.__x64_sys_exit_group 0.46 ± 4% -0.2 0.28 perf-profile.children.cycles-pp.do_group_exit 0.46 ± 4% -0.2 0.28 perf-profile.children.cycles-pp.do_exit 0.45 ± 3% -0.2 0.28 ± 2% perf-profile.children.cycles-pp.exit_mmap 0.94 ± 3% -0.1 0.85 ± 4% perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe 0.94 ± 3% -0.1 0.85 ± 4% perf-profile.children.cycles-pp.do_syscall_64 0.17 ± 4% -0.0 0.14 ± 3% perf-profile.children.cycles-pp.update_and_free_page 0.12 ± 5% +0.0 0.14 ± 5% perf-profile.children.cycles-pp.__account_scheduler_latency 0.08 ± 8% +0.0 0.10 ± 8% perf-profile.children.cycles-pp.sched_ttwu_pending 0.17 ± 6% +0.0 0.20 ± 2% perf-profile.children.cycles-pp.enqueue_entity 0.18 ± 6% +0.0 0.21 ± 3% perf-profile.children.cycles-pp.enqueue_task_fair 0.17 ± 4% +0.0 0.20 ± 8% perf-profile.children.cycles-pp.schedule 0.18 ± 6% +0.0 0.21 ± 2% perf-profile.children.cycles-pp.ttwu_do_activate 0.05 ± 9% +0.0 0.09 perf-profile.children.cycles-pp.prep_new_huge_page 0.16 ± 5% +0.0 0.20 ± 4% perf-profile.children.cycles-pp.io_serial_in 0.24 ± 5% +0.0 0.28 ± 6% perf-profile.children.cycles-pp.__schedule 0.03 ±100% +0.0 0.07 ± 10% perf-profile.children.cycles-pp.delay_tsc 0.18 ± 4% +0.1 0.24 ± 2% perf-profile.children.cycles-pp.serial8250_console_putchar 0.19 ± 6% +0.1 0.26 ± 3% perf-profile.children.cycles-pp.wait_for_xmitr 0.18 ± 5% +0.1 0.25 ± 2% perf-profile.children.cycles-pp.uart_console_write 0.20 ± 6% +0.1 0.27 ± 2% perf-profile.children.cycles-pp.serial8250_console_write 0.20 ± 18% +0.1 0.28 ± 5% perf-profile.children.cycles-pp._fini 0.20 ± 16% +0.1 0.28 ± 5% perf-profile.children.cycles-pp.devkmsg_write 0.20 ± 16% +0.1 0.28 ± 5% perf-profile.children.cycles-pp.printk_emit 0.26 ± 8% +0.1 0.34 ± 5% perf-profile.children.cycles-pp.__vfs_write 0.23 ± 12% +0.1 0.31 ± 5% perf-profile.children.cycles-pp.vprintk_emit 1.65 ± 4% +0.1 1.73 perf-profile.children.cycles-pp.__mutex_lock 0.22 ± 9% +0.1 0.30 ± 3% perf-profile.children.cycles-pp.console_unlock 0.22 ± 13% +0.1 0.30 ± 5% perf-profile.children.cycles-pp.write 0.26 ± 8% +0.1 0.35 ± 4% perf-profile.children.cycles-pp.ksys_write 0.26 ± 8% +0.1 0.35 ± 4% perf-profile.children.cycles-pp.vfs_write 0.59 ± 4% +0.1 0.68 ± 3% perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath 0.93 ± 3% +0.2 1.12 ± 4% perf-profile.children.cycles-pp.alloc_huge_page 0.79 ± 2% +0.2 1.03 ± 4% perf-profile.children.cycles-pp.alloc_surplus_huge_page 0.60 ± 2% +0.3 0.88 ± 5% perf-profile.children.cycles-pp.__alloc_pages_nodemask 0.59 ± 2% +0.3 0.87 ± 5% perf-profile.children.cycles-pp.get_page_from_freelist 0.66 ± 2% +0.3 0.97 ± 4% perf-profile.children.cycles-pp.alloc_fresh_huge_page 0.15 ± 4% +0.3 0.48 ± 6% perf-profile.children.cycles-pp._raw_spin_lock_irqsave 25.44 ± 6% -2.5 22.95 ± 10% perf-profile.self.cycles-pp.do_rw_once 0.46 ± 2% -0.0 0.41 ± 2% perf-profile.self.cycles-pp.get_page_from_freelist 0.17 ± 2% -0.0 0.14 ± 5% perf-profile.self.cycles-pp.update_and_free_page 0.15 ± 7% +0.0 0.20 ± 4% perf-profile.self.cycles-pp.io_serial_in 0.01 ±173% +0.1 0.06 ± 6% perf-profile.self.cycles-pp.delay_tsc 1.59 ± 3% +0.1 1.67 perf-profile.self.cycles-pp.mutex_spin_on_owner 0.58 ± 3% +0.1 0.68 ± 4% perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath
vm-scalability.time.user_time
7000 +-+------------------------------------------------------------------+ | : | 6000 O-+ O :O O O O O O O O O O | |: : | 5000 +-+ : | |: : | 4000 +-+ : | | : : | 3000 +-+ : | | : : | 2000 +-+: : | | : : | 1000 +-+: : | | : | 0 +-+------------------------------------------------------------------+
vm-scalability.time.system_time
12000 +-+-----------------------------------------------------------------+ | ..+...+... | 10000 O-+ O O...O...O...O...O...O...O. O O O...+...+...+...+...+...| | : | |: : | 8000 +-+ : | |: : | 6000 +-+ : | | : : | 4000 +-+ : | | : : | | : : | 2000 +-+: : | | : | 0 +-+-----------------------------------------------------------------+
vm-scalability.time.percent_of_cpu_this_job_got
6000 +-+------------------------------------------------------------------+ | : +. +.. +...+...+...+...+...+. | 5000 O-+ O :O O O O O O O O O O | |: : | |: : | 4000 +-+ : | | : : | 3000 +-+ : | | : : | 2000 +-+: : | | : : | | : : | 1000 +-+: : | | : | 0 +-+------------------------------------------------------------------+
vm-scalability.time.voluntary_context_switches
2.5e+06 +-+---------------------------------------------------------------+ | O O | O O O O O O O O O O | 2e+06 +-+ | | | | +...+...+..+...+...+...+...+...+...+...+..+...+...+...+...| 1.5e+06 +-+ : | |: : | 1e+06 +-+ : | | : : | | : : | 500000 +-+: : | | : : | | : | 0 +-+---------------------------------------------------------------+
vm-scalability.throughput
2.5e+07 +-+---------------------------------------------------------------+ | | | ..+...+... ..+... | 2e+07 O-+ O O...O...O..O. O O O...O...O...O...+..+...+. +...| | : | |: : | 1.5e+07 +-+ : | |: : | 1e+07 +-+ : | | : : | | : : | 5e+06 +-+: : | | : : | | : | 0 +-+---------------------------------------------------------------+
vm-scalability.median
200000 +-+----------------------------------------------------------------+ 180000 +-+ +...+...+...+...+...+...+..+...+...+...+...+...+...+...+...| O O O O O O O | 160000 +-+ O : O O O O | 140000 +-+ : | |: : | 120000 +-+ : | 100000 +-+ : | 80000 +-+ : | | : : | 60000 +-+: : | 40000 +-+: : | | : : | 20000 +-+ : | 0 +-+----------------------------------------------------------------+
[*] bisect-good sample [O] bisect-bad sample
Disclaimer: Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance.
Thanks, Rong Chen