On Tue, Oct 23, 2018 at 08:57:45AM +0100, Mel Gorman wrote:
Note that I accept it's trivial to fragment memory in a harmful way. I've prototyped a test case yesterday that uses fio in the following way to fragment memory
o fio of many small files (64K) o create initial pages using writes that disable fallocate and create inodes on first open. This is massively inefficient from an IO perspective but it mixes slab and page cache allocations so all NUMA nodes get fragmented. o Size the page cache so that it's 150% the size of memory so it forces reclaim activity and new fio activity to further mix slab and page cache allocations o After initial write, run parallel readers to keep slab active and run this for the same length of time the initial writes took so fio has called stat() on the existing files and begun the read phase. This forces the slab and page cache pages to remain "live" and difficult to reclaim/compact. o Finally, start a workload that allocates THP after the warmup phase but while fio is still runnning to measure allocation success rate and latencies
The tests completed shortly after I wrote this mail so I can put some figures to the intuitions expressed in this mail. I'm truncating the reports for clarity but can upload the full data if necessary.
The target system is a 2-socket using E5-2670 v3 (Haswell). Base kernel is 4.19. The baseline is an unpatched kernel. relaxthisnode-v1r1 is patch 1 of Michal's series and does not include the second cleanup. noretry-v1r1 is David's alternative
global-dhp__workload_usemem-stress-numa-compact (no filesystem as this is the trivial case of allocating anonymous memory on a freshly booted system. Figures are elapsed time)
4.19.0 4.19.0 4.19.0 vanilla relaxthisnode-v1r1 noretry-v1r1 Amean System-1 14.16 ( 0.00%) 12.35 * 12.75%* 15.96 * -12.70%* Amean System-3 15.14 ( 0.00%) 9.83 * 35.08%* 11.00 * 27.34%* Amean System-4 9.88 ( 0.00%) 9.85 ( 0.25%) 9.80 ( 0.75%) Amean Elapsd-1 29.23 ( 0.00%) 26.16 * 10.50%* 33.81 * -15.70%* Amean Elapsd-3 25.67 ( 0.00%) 7.28 * 71.63%* 8.49 * 66.93%* Amean Elapsd-4 5.49 ( 0.00%) 5.53 ( -0.76%) 5.46 ( 0.49%)
The figures in () are the percentage gain/loss. If it's around *'s then the automation has guessed at the results are outside the noise.
System CPU usage is reduced by both as reported but Micha's gives a 10.5% gain and David's is a 15.7% loss. Boith appear to be outside the noise. While not included here, the vanilla kernel swaps heavily with a 56% reclaim efficiency (pages scanned vs pages reclaimed) and neither of the proposed patches swaps and it's all from direct reclaim activity. Michal's patch does not enter reclaim, David's enters reclaim but it's very light.
global-dhp__workload_thpfioscale-xfs (Uses fio to fragment memory and keep slab and page cache active while there is an attempt to allocate THP in parallel. No special madvise flags or tuning is applied. A dedicated test partition is used for fio and XFS was the target filesystem that is recreated on every test) thpfioscale Fault Latencies 4.19.0 4.19.0 4.19.0 vanilla relaxthisnode-v1r1 noretry-v1r1 Amean fault-base-5 1471.95 ( 0.00%) 1515.64 ( -2.97%) 1491.05 ( -1.30%) Amean fault-huge-5 0.00 ( 0.00%) 534.51 * -99.00%* 0.00 ( 0.00%)
thpfioscale Percentage Faults Huge 4.19.0 4.19.0 4.19.0 vanilla relaxthisnode-v1r1 noretry-v1r1 Percentage huge-5 0.00 ( 0.00%) 1.18 ( 100.00%) 0.00 ( 0.00%)
Both patches incur a slight hit to fault latency (measured in microseconds) but it's well within the noise. While not included here, the variance is massive (min 1052 microseconds, max 282348 microseconds in the vanilla kernel. Both patches reduce the worst-case scenarios. All kernels show terrible allocation success rates. Michal's had a 1.18% success rate but that's probably luck.
global-dhp__workload_thpfioscale-madvhugepage-xfs (Same as the last test but the THP allocation program uses MADV_HUGEPAGE)
thpfioscale Fault Latencies 4.19.0 4.19.0 4.19.0 vanilla relaxthisnode-v1r1 noretry-v1r1 Amean fault-base-5 6772.84 ( 0.00%) 10256.30 * -51.43%* 1574.45 * 76.75%* Amean fault-huge-5 2644.19 ( 0.00%) 5314.17 *-100.98%* 3517.89 ( -33.04%)
thpfioscale Percentage Faults Huge 4.19.0 4.19.0 4.19.0 vanilla relaxthisnode-v1r1 noretry-v1r1 Percentage huge-5 45.48 ( 0.00%) 95.09 ( 109.08%) 2.81 ( -93.81%
The first point of interest is that even with the vanilla kernel, the allocation fault latency is much higher than average reflecting that additional work is being done.
Next point of interest -- David's patch has much lower latency on average when allocating *base* pages showing and the vmstats (not included) show that compaction activity is reduced but not eliminated.
To balance this, Michal's patch has an 95% allocation success rate for THP versus 45% on the default kernel at the cost of higher fault latency. This is almost certainly a reflection that THPs are being allocated on remote nodes. This can be considered good or bad depending on whether THP is more important than locality. Note with David's patch that the allocation success rate drops to 2.81% showing that it's much less efficient at THP.
This demonstrates a very clear trade-off between allocation latency and allocation success rate for THP. Which one is better is workload dependent.
global-dhp__workload_thpfioscale-defrag-xfs (Same as global-dhp__workload_thpfioscale-xfs except that defrag is set to always) thpfioscale Fault Latencies 4.19.0 4.19.0 4.19.0 vanilla relaxthisnode-v1r1 noretry-v1r1 Amean fault-base-5 2678.60 ( 0.00%) 4442.14 * -65.84%* 1640.15 * 38.77%* Amean fault-huge-5 1324.61 ( 0.00%) 1460.08 ( -10.23%) 2358.23 ( -78.03%)
thpfioscale Percentage Faults Huge 4.19.0 4.19.0 4.19.0 vanilla relaxthisnode-v1r1 noretry-v1r1 Percentage huge-5 0.90 ( 0.00%) 0.40 ( -55.56%) 0.22 ( -75.93%)
The allocation latency is again higher in this case as greater effort is made to allocate the huge page. Michal's takes a hit as it's still trying to allocate the THP while David's gives up early. In all cases the allocation success rate is terrible.
So it should be reasonably clear that no approach is a universal win. Michal's wins at the trivial case which is what the original problem was and why it was pushed at all. David's in general has lower latency in general because it gives up quickly but the allocation success rate when MADV_HUGEPAGE specifically asks for huge pages is terrible. This may make it a non-starter for the virtualisation case that wants huge pages on the basis that if an application asks for huge pages, it presumably is willing to pay the cost to get them.