On Wed, Jan 30, 2019 at 09:33:34PM -0500, Rik van Riel wrote:
On Thu, 2019-01-31 at 12:44 +1100, Dave Chinner wrote:
Indeed, the fs/inode.c change definitely needs reverting, because that is just *plain wrong* and breaks long-standing memory reclaim behaviour.
The long-standing behavior may be wrong here, because of just how incredibly slow ext4 and XFS are when it comes to reclaiming inodes with lots of dirty pages.
That has nothing to do with the reasons that the change that was made. The reasons the change were to band-aid a regression for a different change that has clearly introduced multiple regressions. This change of >15 year old behaviour can't be justified by arguing that "but what about <this other completely differnt issue>!"
If you have problems with writeback and inode reclaim, do what everyone else does: write a bug report showing where the problematic writeback is coming from, explain why it's a problem and propose a fix. And post it to linux-fsdevel@vger.kernel.org for review.
Rik, you know how to work with upstream development, I should not have to be telling you this.
We have observed some real system stalls when the reclaim code hits an ext4 or XFS inode with dozens of megabytes of dirty data that needs to be synced out before the inode can be reclaimed.
"reclaim code" is a massive foot print. Please be specific.
Have you observed any regressions due to not reclaiming inodes with cached pages attached?
That's not how we justify a change. The burden of proof of correctness is on the person proposing the change, not on the reviewer to prove the change is wrong. We have clear indication that is changes behaviour of common workloads (copying files at the same time as compiling a kernel is a common thing!), so we should revert it until we have a set of changes that actually are proved to work correctly for common workloads.
I seriously disagree with shovelling a different, largely untested and contentious change to the shrinker algorithm to try and patch over the symptoms of the original change. It leaves the underlying problem unfixed (dying memcgs need a reaper to shrink the remaining slab objects that pin that specific memcg) and instead plays "whack-a-mole" on what we alreayd know is a fundamentally broken assumption (i.e. that shrinking small slabs more agressively is side-effect free).
My patch shrinks small slabs with the same pressure as larger slabs. It also ensures that slabs from dead memcgs will get eventually reclaimed.
What am I missing?
That "freeable" is not a measure of slab cache size. It's a measure of how many freeable objects are in the cache. i.e. a large cache with nothing freeable returns the same value as a small cache with nothing freeable.
i.e. the basic premise that "freeable is small == this is a small cache" is flawed.
And then there's the bugs. Don't forget - the same shrinker can run concurrent on every node via kswapd and be called by every instance of direct reclaim on every CPU at the same time. (i.e. the unbound concurrency problem that fucks up inode caches and results in having to throttle (block) in the shrinker implementation so the working sets don't get trashed by massively excessive memory reclaim pressure).
-Dave.