The patch titled Subject: Revert "mm: don't reclaim inodes with many attached pages" has been added to the -mm tree. Its filename is revert-mm-dont-reclaim-inodes-with-many-attached-pages.patch
This patch should soon appear at http://ozlabs.org/~akpm/mmots/broken-out/revert-mm-dont-reclaim-inodes-with-... and later at http://ozlabs.org/~akpm/mmotm/broken-out/revert-mm-dont-reclaim-inodes-with-...
Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated there every 3-4 working days
------------------------------------------------------ From: Dave Chinner dchinner@redhat.com Subject: Revert "mm: don't reclaim inodes with many attached pages"
This reverts commit a76cf1a474d7dbcd9336b5f5afb0162baa142cf0.
This change causes serious changes to page cache and inode cache behaviour and balance, resulting in major performance regressions when combining worklaods such as large file copies and kernel compiles.
https://bugzilla.kernel.org/show_bug.cgi?id=202441
This change is a hack to work around the problems introduced by changing how agressive shrinkers are on small caches in commit 172b06c32b94 ("mm: slowly shrink slabs with a relatively small number of objects"). It creates more problems than it solves, wasn't adequately reviewed or tested, so it needs to be reverted.
Link: http://lkml.kernel.org/r/20190130041707.27750-2-david@fromorbit.com Signed-off-by: Dave Chinner dchinner@redhat.com Cc: Roman Gushchin guro@fb.com Cc: Spock dairinin@gmail.com Cc: Rik van Riel riel@surriel.com Cc: Michal Hocko mhocko@kernel.org Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org ---
--- a/fs/inode.c~revert-mm-dont-reclaim-inodes-with-many-attached-pages +++ a/fs/inode.c @@ -730,11 +730,8 @@ static enum lru_status inode_lru_isolate return LRU_REMOVED; }
- /* - * Recently referenced inodes and inodes with many attached pages - * get one more pass. - */ - if (inode->i_state & I_REFERENCED || inode->i_data.nrpages > 1) { + /* recently referenced inodes get one more pass */ + if (inode->i_state & I_REFERENCED) { inode->i_state &= ~I_REFERENCED; spin_unlock(&inode->i_lock); return LRU_ROTATE; _
Patches currently in -mm which might be from dchinner@redhat.com are
revert-mm-dont-reclaim-inodes-with-many-attached-pages.patch revert-mm-slowly-shrink-slabs-with-a-relatively-small-number-of-objects.patch
Hi, Andrew!
I believe, that Rik's patch ( https://lkml.org/lkml/2019/1/28/1865 ) can make a difference here, and might fix the regression. I'd give it a chance, before reverting these two patches. Reverting will re-introduce the memcg-leak, which is quite bad.
Thanks!
On Tue, Jan 29, 2019 at 09:10:09PM -0800, Andrew Morton wrote:
The patch titled Subject: Revert "mm: don't reclaim inodes with many attached pages" has been added to the -mm tree. Its filename is revert-mm-dont-reclaim-inodes-with-many-attached-pages.patch
This patch should soon appear at https://urldefense.proofpoint.com/v2/url?u=http-3A__ozlabs.org_-7Eakpm_mmots... and later at https://urldefense.proofpoint.com/v2/url?u=http-3A__ozlabs.org_-7Eakpm_mmotm...
Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated there every 3-4 working days
From: Dave Chinner dchinner@redhat.com Subject: Revert "mm: don't reclaim inodes with many attached pages"
This reverts commit a76cf1a474d7dbcd9336b5f5afb0162baa142cf0.
This change causes serious changes to page cache and inode cache behaviour and balance, resulting in major performance regressions when combining worklaods such as large file copies and kernel compiles.
https://bugzilla.kernel.org/show_bug.cgi?id=202441
This change is a hack to work around the problems introduced by changing how agressive shrinkers are on small caches in commit 172b06c32b94 ("mm: slowly shrink slabs with a relatively small number of objects"). It creates more problems than it solves, wasn't adequately reviewed or tested, so it needs to be reverted.
Link: http://lkml.kernel.org/r/20190130041707.27750-2-david@fromorbit.com Signed-off-by: Dave Chinner dchinner@redhat.com Cc: Roman Gushchin guro@fb.com Cc: Spock dairinin@gmail.com Cc: Rik van Riel riel@surriel.com Cc: Michal Hocko mhocko@kernel.org Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org
--- a/fs/inode.c~revert-mm-dont-reclaim-inodes-with-many-attached-pages +++ a/fs/inode.c @@ -730,11 +730,8 @@ static enum lru_status inode_lru_isolate return LRU_REMOVED; }
- /*
* Recently referenced inodes and inodes with many attached pages
* get one more pass.
*/
- if (inode->i_state & I_REFERENCED || inode->i_data.nrpages > 1) {
- /* recently referenced inodes get one more pass */
- if (inode->i_state & I_REFERENCED) { inode->i_state &= ~I_REFERENCED; spin_unlock(&inode->i_lock); return LRU_ROTATE;
_
Patches currently in -mm which might be from dchinner@redhat.com are
revert-mm-dont-reclaim-inodes-with-many-attached-pages.patch revert-mm-slowly-shrink-slabs-with-a-relatively-small-number-of-objects.patch
On Wed, 30 Jan 2019 05:54:27 +0000 Roman Gushchin guro@fb.com wrote:
I believe, that Rik's patch ( https://lkml.org/lkml/2019/1/28/1865 ) can make a difference here, and might fix the regression. I'd give it a chance, before reverting these two patches. Reverting will re-introduce the memcg-leak, which is quite bad.
Yup. I'll restore Rik's "mm, slab, vmscan: accumulate gradual pressure on small slabs" for next -mm.
On Wed, Jan 30, 2019 at 05:54:27AM +0000, Roman Gushchin wrote:
Hi, Andrew!
I believe, that Rik's patch ( https://lkml.org/lkml/2019/1/28/1865 ) can make a difference here, and might fix the regression. I'd give it a chance, before reverting these two patches. Reverting will re-introduce the memcg-leak, which is quite bad.
Rik's change is just another hack that will still have effects on reclaim behaviour.
Indeed, the fs/inode.c change definitely needs reverting, because that is just *plain wrong* and breaks long-standing memory reclaim behaviour.
I seriously disagree with shovelling a different, largely untested and contentious change to the shrinker algorithm to try and patch over the symptoms of the original change. It leaves the underlying problem unfixed (dying memcgs need a reaper to shrink the remaining slab objects that pin that specific memcg) and instead plays "whack-a-mole" on what we alreayd know is a fundamentally broken assumption (i.e. that shrinking small slabs more agressively is side-effect free).
-Dave.
On Thu, 2019-01-31 at 12:44 +1100, Dave Chinner wrote:
Indeed, the fs/inode.c change definitely needs reverting, because that is just *plain wrong* and breaks long-standing memory reclaim behaviour.
The long-standing behavior may be wrong here, because of just how incredibly slow ext4 and XFS are when it comes to reclaiming inodes with lots of dirty pages.
We have observed some real system stalls when the reclaim code hits an ext4 or XFS inode with dozens of megabytes of dirty data that needs to be synced out before the inode can be reclaimed.
Have you observed any regressions due to not reclaiming inodes with cached pages attached?
If so, what kind of behavioral differences are you seeing due to that regression?
It would be nice if we could figure out a way to avoid both bad behaviors...
I seriously disagree with shovelling a different, largely untested and contentious change to the shrinker algorithm to try and patch over the symptoms of the original change. It leaves the underlying problem unfixed (dying memcgs need a reaper to shrink the remaining slab objects that pin that specific memcg) and instead plays "whack-a-mole" on what we alreayd know is a fundamentally broken assumption (i.e. that shrinking small slabs more agressively is side-effect free).
My patch shrinks small slabs with the same pressure as larger slabs. It also ensures that slabs from dead memcgs will get eventually reclaimed.
What am I missing?
On Wed, Jan 30, 2019 at 09:33:34PM -0500, Rik van Riel wrote:
On Thu, 2019-01-31 at 12:44 +1100, Dave Chinner wrote:
Indeed, the fs/inode.c change definitely needs reverting, because that is just *plain wrong* and breaks long-standing memory reclaim behaviour.
The long-standing behavior may be wrong here, because of just how incredibly slow ext4 and XFS are when it comes to reclaiming inodes with lots of dirty pages.
That has nothing to do with the reasons that the change that was made. The reasons the change were to band-aid a regression for a different change that has clearly introduced multiple regressions. This change of >15 year old behaviour can't be justified by arguing that "but what about <this other completely differnt issue>!"
If you have problems with writeback and inode reclaim, do what everyone else does: write a bug report showing where the problematic writeback is coming from, explain why it's a problem and propose a fix. And post it to linux-fsdevel@vger.kernel.org for review.
Rik, you know how to work with upstream development, I should not have to be telling you this.
We have observed some real system stalls when the reclaim code hits an ext4 or XFS inode with dozens of megabytes of dirty data that needs to be synced out before the inode can be reclaimed.
"reclaim code" is a massive foot print. Please be specific.
Have you observed any regressions due to not reclaiming inodes with cached pages attached?
That's not how we justify a change. The burden of proof of correctness is on the person proposing the change, not on the reviewer to prove the change is wrong. We have clear indication that is changes behaviour of common workloads (copying files at the same time as compiling a kernel is a common thing!), so we should revert it until we have a set of changes that actually are proved to work correctly for common workloads.
I seriously disagree with shovelling a different, largely untested and contentious change to the shrinker algorithm to try and patch over the symptoms of the original change. It leaves the underlying problem unfixed (dying memcgs need a reaper to shrink the remaining slab objects that pin that specific memcg) and instead plays "whack-a-mole" on what we alreayd know is a fundamentally broken assumption (i.e. that shrinking small slabs more agressively is side-effect free).
My patch shrinks small slabs with the same pressure as larger slabs. It also ensures that slabs from dead memcgs will get eventually reclaimed.
What am I missing?
That "freeable" is not a measure of slab cache size. It's a measure of how many freeable objects are in the cache. i.e. a large cache with nothing freeable returns the same value as a small cache with nothing freeable.
i.e. the basic premise that "freeable is small == this is a small cache" is flawed.
And then there's the bugs. Don't forget - the same shrinker can run concurrent on every node via kswapd and be called by every instance of direct reclaim on every CPU at the same time. (i.e. the unbound concurrency problem that fucks up inode caches and results in having to throttle (block) in the shrinker implementation so the working sets don't get trashed by massively excessive memory reclaim pressure).
-Dave.
linux-stable-mirror@lists.linaro.org