On Tue, Oct 16, 2018 at 03:37:15PM -0700, Andrew Morton wrote:
On Tue, 16 Oct 2018 08:46:06 +0100 Mel Gorman mgorman@suse.de wrote:
I consider this to be an unfortunate outcome. On the one hand, we have a problem that three people can trivially reproduce with known test cases and a patch shown to resolve the problem. Two of those three people work on distributions that are exposed to a large number of users. On the other, we have a problem that requires the system to be in a specific state and an unknown workload that suffers badly from the remote access penalties with a patch that has review concerns and has not been proven to resolve the trivial cases. In the case of distributions, the first patch addresses concerns with a common workload where on the other hand we have an internal workload of a single company that is affected -- which indirectly affects many users admittedly but only one entity directly.
At the absolute minimum, a test case for the "system fragmentation incurs access penalties for a workload" scenario that could both replicate the fragmentation and demonstrate the problem should have been available before the patch was rejected. With the test case, there would be a chance that others could analyse the problem and prototype some fixes. The test case was requested in the thread and never produced so even if someone were to prototype fixes, it would be dependant on a third party to test and produce data which is a time-consuming loop. Instead, we are more or less in limbo.
OK, thanks.
But we're OK holding off for a few weeks, yes? If we do that we'll still make it into 4.19.1. Am reluctant to merge this while discussion, testing and possibly more development are ongoing.
Without a test case that reproduces the Google case, we are a bit stuck. Previous experience indicates that just fragmenting memory is not enough to give a reliable case as unless the unmovable/reclaimable pages are "sticky", the normal reclaim can handle it. Similarly, the access pattern of the target workload is important as it would need to be something larger than L3 cache to constantly hit the access penalty. We do not know what the exact characteristics of the Google workload are but we know that a fix for three cases is not equivalent for the Google case.
The discussion has circled around wish-list items such as better fragmentation control, node-aware compaction, improved compact deferred logic and lower latencies with little in the way of actual specifics of implementation or patches. Improving fragmentation control would benefit from a workload that actually fragments so the extfrag events can be monitored as well as maybe a dump of pageblocks with mixed pages.
On node-aware compaction, that was not implemented initially simply because HighMem was common and that needs to be treated as a corner case -- we cannot safely migrate pages from zone normal to highmem. That one is relatively trivial to measure as it's a functional issue.
However, backing off compaction properly to maximise allocation success rates while minimising allocation latency and access latency needs a live workload that is representative. Trivial cases like the java workloads, nas or usemem won't do as they either exhibit special locality or are streaming readers/writers. Memcache might work but the driver in that case is critical to ensure the access penalties are incurred. Again, a modern example is missing.
As for why this took so long to discover, it is highly likely that it's due to VM's being sized such as they typically fit in a NUMA node so it would have avoided the worst case scenarios. Furthermore, a machine dedicated to VM's has fewer concerns with respect to slab allocations and unmovable allocations fragmenting memory long-term. Finally, the worst case scenarios are encountered when there is a mix of different workloads of variable duration which may be common in a Google-like setup with different jobs being dispatched across a large network but less so in other setups where a service tends to be persistent. We already know that some of the worst performance problems take years to discover.