On Dec 17, 2021, at 8:02 PM, Linus Torvalds torvalds@linux-foundation.org wrote:
On Fri, Dec 17, 2021 at 3:53 PM Nadav Amit namit@vmware.com wrote:
I understand the discussion mainly revolves correctness, which is obviously the most important property, but I would like to mention that having transient get_page() calls causing unnecessary COWs can cause hard-to-analyze and hard-to-avoid performance degradation.
Note that the COW itself is pretty cheap. Yes, there's the page allocation and copy, but it's mostly a local thing.
I don’t know about the page-lock overhead, but I understand your argument.
Having said that, I do know a bit about TLB flushes, which you did not mention as overheads of COW. Such flushes can be quite expensive on multithreaded workloads (specifically on VMs, but lets put those aside).
Take for instance memcached and assume you overcommit memory with a very fast swap (e.g., pmem, zram, perhaps even slower). Now, it turns out memcached often accesses a page first for read and shortly after for write. I encountered, in a similar scenario, that the page reference that lru_cache_add() takes during the first faultin event (for read), causes a COW on a write page-fault that happens shortly after [1]. So on memcached I assume this would also trigger frequent unnecessary COWs.
Besides page allocation and copy, COW would then require a TLB flush, which, when performed locally, might not be too bad (~200 cycles). But if memcached has many threads, as it usually does, then you need a TLB shootdown and this one can be expensive (microseconds). If you start getting a TLB shootdown storm, you may avoid some IPIs since you see that other CPUs already queued IPIs for the target CPU. But then the kernel would flush the entire TLB on the the target CPU, as it realizes that multiple TLB flushes were queued, and as it assumes that a full TLB flush would be cheaper.
[ I can try to run a benchmark during the weekend to measure the impact, as I did not really measure the impact on memcached before/after 5.8. ]
So I am in no position to prioritize one overhead over the other, but I do not think that COW can be characterized as mostly-local and cheap in the case of multithreaded workloads.
[1] https://lore.kernel.org/linux-mm/0480D692-D9B2-429A-9A88-9BBA1331AC3A@gmail....