From: Rongwei Wang <rongwei.wang(a)linux.alibaba.com>
Subject: mm, thp: fix incorrect unmap behavior for private pages
When truncating pagecache on file THP, the private pages of a process
should not be unmapped mapping. This incorrect behavior on a dynamic
shared libraries which will cause related processes to happen core dump.
A simple test for a DSO (Prerequisite is the DSO mapped in file THP):
int main(int argc, char *argv[])
{
int fd;
fd = open(argv[1], O_WRONLY);
if (fd < 0) {
perror("open");
}
close(fd);
return 0;
}
The test only to open a target DSO, and do nothing. But this operation
will lead one or more process to happen core dump. This patch mainly to
fix this bug.
Link: https://lkml.kernel.org/r/20211025092134.18562-3-rongwei.wang@linux.alibaba…
Fixes: eb6ecbed0aa2 ("mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs")
Signed-off-by: Rongwei Wang <rongwei.wang(a)linux.alibaba.com>
Tested-by: Xu Yu <xuyu(a)linux.alibaba.com>
Cc: Matthew Wilcox (Oracle) <willy(a)infradead.org>
Cc: Song Liu <song(a)kernel.org>
Cc: William Kucharski <william.kucharski(a)oracle.com>
Cc: Hugh Dickins <hughd(a)google.com>
Cc: Yang Shi <shy828301(a)gmail.com>
Cc: Mike Kravetz <mike.kravetz(a)oracle.com>
Cc: Collin Fijalkovich <cfijalkovich(a)google.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
fs/open.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)
--- a/fs/open.c~mm-thp-fix-incorrect-unmap-behavior-for-private-pages
+++ a/fs/open.c
@@ -857,8 +857,17 @@ static int do_dentry_open(struct file *f
*/
smp_mb();
if (filemap_nr_thps(inode->i_mapping)) {
+ struct address_space *mapping = inode->i_mapping;
+
filemap_invalidate_lock(inode->i_mapping);
- truncate_pagecache(inode, 0);
+ /*
+ * unmap_mapping_range just need to be called once
+ * here, because the private pages is not need to be
+ * unmapped mapping (e.g. data segment of dynamic
+ * shared libraries here).
+ */
+ unmap_mapping_range(mapping, 0, 0, 0);
+ truncate_inode_pages(mapping, 0);
filemap_invalidate_unlock(inode->i_mapping);
}
}
_
From: Vasily Averin <vvs(a)virtuozzo.com>
Subject: memcg: prohibit unconditional exceeding the limit of dying tasks
Memory cgroup charging allows killed or exiting tasks to exceed the hard
limit. It is assumed that the amount of the memory charged by those tasks
is bound and most of the memory will get released while the task is
exiting. This is resembling a heuristic for the global OOM situation when
tasks get access to memory reserves. There is no global memory shortage
at the memcg level so the memcg heuristic is more relieved.
The above assumption is overly optimistic though. E.g. vmalloc can scale
to really large requests and the heuristic would allow that. We used to
have an early break in the vmalloc allocator for killed tasks but this has
been reverted by commit b8c8a338f75e ("Revert "vmalloc: back off when the
current task is killed""). There are likely other similar code paths
which do not check for fatal signals in an allocation&charge loop. Also
there are some kernel objects charged to a memcg which are not bound to a
process life time.
It has been observed that it is not really hard to trigger these bypasses
and cause global OOM situation.
One potential way to address these runaways would be to limit the amount
of excess (similar to the global OOM with limited oom reserves). This is
certainly possible but it is not really clear how much of an excess is
desirable and still protects from global OOMs as that would have to
consider the overall memcg configuration.
This patch is addressing the problem by removing the heuristic altogether.
Bypass is only allowed for requests which either cannot fail or where the
failure is not desirable while excess should be still limited (e.g.
atomic requests). Implementation wise a killed or dying task fails to
charge if it has passed the OOM killer stage. That should give all forms
of reclaim chance to restore the limit before the failure (ENOMEM) and
tell the caller to back off.
In addition, this patch renames should_force_charge() helper to
task_is_dying() because now its use is not associated witch forced
charging.
This patch depends on pagefault_out_of_memory() to not trigger
out_of_memory(), because then a memcg failure can unwind to VM_FAULT_OOM
and cause a global OOM killer.
Link: https://lkml.kernel.org/r/8f5cebbb-06da-4902-91f0-6566fc4b4203@virtuozzo.com
Signed-off-by: Vasily Averin <vvs(a)virtuozzo.com>
Suggested-by: Michal Hocko <mhocko(a)suse.com>
Acked-by: Michal Hocko <mhocko(a)suse.com>
Cc: Johannes Weiner <hannes(a)cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev(a)gmail.com>
Cc: Roman Gushchin <guro(a)fb.com>
Cc: Uladzislau Rezki <urezki(a)gmail.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: Shakeel Butt <shakeelb(a)google.com>
Cc: Mel Gorman <mgorman(a)techsingularity.net>
Cc: Tetsuo Handa <penguin-kernel(a)i-love.sakura.ne.jp>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/memcontrol.c | 27 ++++++++-------------------
1 file changed, 8 insertions(+), 19 deletions(-)
--- a/mm/memcontrol.c~memcg-prohibit-unconditional-exceeding-the-limit-of-dying-tasks
+++ a/mm/memcontrol.c
@@ -234,7 +234,7 @@ enum res_type {
iter != NULL; \
iter = mem_cgroup_iter(NULL, iter, NULL))
-static inline bool should_force_charge(void)
+static inline bool task_is_dying(void)
{
return tsk_is_oom_victim(current) || fatal_signal_pending(current) ||
(current->flags & PF_EXITING);
@@ -1624,7 +1624,7 @@ static bool mem_cgroup_out_of_memory(str
* A few threads which were not waiting at mutex_lock_killable() can
* fail to bail out. Therefore, check again after holding oom_lock.
*/
- ret = should_force_charge() || out_of_memory(&oc);
+ ret = task_is_dying() || out_of_memory(&oc);
unlock:
mutex_unlock(&oom_lock);
@@ -2579,6 +2579,7 @@ static int try_charge_memcg(struct mem_c
struct page_counter *counter;
enum oom_status oom_status;
unsigned long nr_reclaimed;
+ bool passed_oom = false;
bool may_swap = true;
bool drained = false;
unsigned long pflags;
@@ -2614,15 +2615,6 @@ retry:
goto force;
/*
- * Unlike in global OOM situations, memcg is not in a physical
- * memory shortage. Allow dying and OOM-killed tasks to
- * bypass the last charges so that they can exit quickly and
- * free their memory.
- */
- if (unlikely(should_force_charge()))
- goto force;
-
- /*
* Prevent unbounded recursion when reclaim operations need to
* allocate memory. This might exceed the limits temporarily,
* but we prefer facilitating memory reclaim and getting back
@@ -2679,8 +2671,9 @@ retry:
if (gfp_mask & __GFP_RETRY_MAYFAIL)
goto nomem;
- if (fatal_signal_pending(current))
- goto force;
+ /* Avoid endless loop for tasks bypassed by the oom killer */
+ if (passed_oom && task_is_dying())
+ goto nomem;
/*
* keep retrying as long as the memcg oom killer is able to make
@@ -2689,14 +2682,10 @@ retry:
*/
oom_status = mem_cgroup_oom(mem_over_limit, gfp_mask,
get_order(nr_pages * PAGE_SIZE));
- switch (oom_status) {
- case OOM_SUCCESS:
+ if (oom_status == OOM_SUCCESS) {
+ passed_oom = true;
nr_retries = MAX_RECLAIM_RETRIES;
goto retry;
- case OOM_FAILED:
- goto force;
- default:
- goto nomem;
}
nomem:
if (!(gfp_mask & __GFP_NOFAIL))
_
From: Michal Hocko <mhocko(a)suse.com>
Subject: mm, oom: do not trigger out_of_memory from the #PF
Any allocation failure during the #PF path will return with VM_FAULT_OOM
which in turn results in pagefault_out_of_memory. This can happen for 2
different reasons. a) Memcg is out of memory and we rely on
mem_cgroup_oom_synchronize to perform the memcg OOM handling or b) normal
allocation fails.
The latter is quite problematic because allocation paths already trigger
out_of_memory and the page allocator tries really hard to not fail
allocations. Anyway, if the OOM killer has been already invoked there is
no reason to invoke it again from the #PF path. Especially when the OOM
condition might be gone by that time and we have no way to find out other
than allocate.
Moreover if the allocation failed and the OOM killer hasn't been invoked
then we are unlikely to do the right thing from the #PF context because we
have already lost the allocation context and restictions and therefore
might oom kill a task from a different NUMA domain.
This all suggests that there is no legitimate reason to trigger
out_of_memory from pagefault_out_of_memory so drop it. Just to be sure
that no #PF path returns with VM_FAULT_OOM without allocation print a
warning that this is happening before we restart the #PF.
[VvS: #PF allocation can hit into limit of cgroup v1 kmem controller.
This is a local problem related to memcg, however, it causes unnecessary
global OOM kills that are repeated over and over again and escalate into a
real disaster. This has been broken since kmem accounting has been
introduced for cgroup v1 (3.8). There was no kmem specific reclaim for
the separate limit so the only way to handle kmem hard limit was to return
with ENOMEM. In upstream the problem will be fixed by removing the
outdated kmem limit, however stable and LTS kernels cannot do it and are
still affected. This patch fixes the problem and should be backported
into stable/LTS.]
Link: https://lkml.kernel.org/r/f5fd8dd8-0ad4-c524-5f65-920b01972a42@virtuozzo.com
Signed-off-by: Michal Hocko <mhocko(a)suse.com>
Signed-off-by: Vasily Averin <vvs(a)virtuozzo.com>
Acked-by: Michal Hocko <mhocko(a)suse.com>
Cc: Johannes Weiner <hannes(a)cmpxchg.org>
Cc: Mel Gorman <mgorman(a)techsingularity.net>
Cc: Roman Gushchin <guro(a)fb.com>
Cc: Shakeel Butt <shakeelb(a)google.com>
Cc: Tetsuo Handa <penguin-kernel(a)i-love.sakura.ne.jp>
Cc: Uladzislau Rezki <urezki(a)gmail.com>
Cc: Vladimir Davydov <vdavydov.dev(a)gmail.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/oom_kill.c | 22 ++++++++--------------
1 file changed, 8 insertions(+), 14 deletions(-)
--- a/mm/oom_kill.c~mm-oom-do-not-trigger-out_of_memory-from-the-pf
+++ a/mm/oom_kill.c
@@ -1120,19 +1120,15 @@ bool out_of_memory(struct oom_control *o
}
/*
- * The pagefault handler calls here because it is out of memory, so kill a
- * memory-hogging task. If oom_lock is held by somebody else, a parallel oom
- * killing is already in progress so do nothing.
+ * The pagefault handler calls here because some allocation has failed. We have
+ * to take care of the memcg OOM here because this is the only safe context without
+ * any locks held but let the oom killer triggered from the allocation context care
+ * about the global OOM.
*/
void pagefault_out_of_memory(void)
{
- struct oom_control oc = {
- .zonelist = NULL,
- .nodemask = NULL,
- .memcg = NULL,
- .gfp_mask = 0,
- .order = 0,
- };
+ static DEFINE_RATELIMIT_STATE(pfoom_rs, DEFAULT_RATELIMIT_INTERVAL,
+ DEFAULT_RATELIMIT_BURST);
if (mem_cgroup_oom_synchronize(true))
return;
@@ -1140,10 +1136,8 @@ void pagefault_out_of_memory(void)
if (fatal_signal_pending(current))
return;
- if (!mutex_trylock(&oom_lock))
- return;
- out_of_memory(&oc);
- mutex_unlock(&oom_lock);
+ if (__ratelimit(&pfoom_rs))
+ pr_warn("Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF\n");
}
SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
_
From: Vasily Averin <vvs(a)virtuozzo.com>
Subject: mm, oom: pagefault_out_of_memory: don't force global OOM for dying tasks
Patch series "memcg: prohibit unconditional exceeding the limit of dying tasks", v3.
Memory cgroup charging allows killed or exiting tasks to exceed the hard
limit. It can be misused and allowed to trigger global OOM from inside a
memcg-limited container. On the other hand if memcg fails allocation,
called from inside #PF handler it triggers global OOM from inside
pagefault_out_of_memory().
To prevent these problems this patchset:
a) removes execution of out_of_memory() from pagefault_out_of_memory(),
becasue nobody can explain why it is necessary.
b) allow memcg to fail allocation of dying/killed tasks.
This patch (of 3):
Any allocation failure during the #PF path will return with VM_FAULT_OOM
which in turn results in pagefault_out_of_memory which in turn executes
out_out_memory() and can kill a random task.
An allocation might fail when the current task is the oom victim and there
are no memory reserves left. The OOM killer is already handled at the
page allocator level for the global OOM and at the charging level for the
memcg one. Both have much more information about the scope of
allocation/charge request. This means that either the OOM killer has been
invoked properly and didn't lead to the allocation success or it has been
skipped because it couldn't have been invoked. In both cases triggering
it from here is pointless and even harmful.
It makes much more sense to let the killed task die rather than to wake up
an eternally hungry oom-killer and send him to choose a fatter victim for
breakfast.
Link: https://lkml.kernel.org/r/0828a149-786e-7c06-b70a-52d086818ea3@virtuozzo.com
Signed-off-by: Vasily Averin <vvs(a)virtuozzo.com>
Suggested-by: Michal Hocko <mhocko(a)suse.com>
Acked-by: Michal Hocko <mhocko(a)suse.com>
Cc: Johannes Weiner <hannes(a)cmpxchg.org>
Cc: Mel Gorman <mgorman(a)techsingularity.net>
Cc: Roman Gushchin <guro(a)fb.com>
Cc: Shakeel Butt <shakeelb(a)google.com>
Cc: Tetsuo Handa <penguin-kernel(a)i-love.sakura.ne.jp>
Cc: Uladzislau Rezki <urezki(a)gmail.com>
Cc: Vladimir Davydov <vdavydov.dev(a)gmail.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/oom_kill.c | 3 +++
1 file changed, 3 insertions(+)
--- a/mm/oom_kill.c~mm-oom-pagefault_out_of_memory-dont-force-global-oom-for-dying-tasks
+++ a/mm/oom_kill.c
@@ -1137,6 +1137,9 @@ void pagefault_out_of_memory(void)
if (mem_cgroup_oom_synchronize(true))
return;
+ if (fatal_signal_pending(current))
+ return;
+
if (!mutex_trylock(&oom_lock))
return;
out_of_memory(&oc);
_
From: "Matthew Wilcox (Oracle)" <willy(a)infradead.org>
Subject: mm/filemap.c: remove bogus VM_BUG_ON
It is not safe to check page->index without holding the page lock. It can
be changed if the page is moved between the swap cache and the page cache
for a shmem file, for example. There is a VM_BUG_ON below which checks
page->index is correct after taking the page lock.
Link: https://lkml.kernel.org/r/20210818144932.940640-1-willy@infradead.org
Fixes: 5c211ba29deb ("mm: add and use find_lock_entries")
Signed-off-by: Matthew Wilcox (Oracle) <willy(a)infradead.org>
Reported-by: <syzbot+c87be4f669d920c76330(a)syzkaller.appspotmail.com>
Cc: Hugh Dickins <hughd(a)google.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/filemap.c | 1 -
1 file changed, 1 deletion(-)
--- a/mm/filemap.c~mm-remove-bogus-vm_bug_on
+++ a/mm/filemap.c
@@ -2093,7 +2093,6 @@ unsigned find_lock_entries(struct addres
if (!xa_is_value(page)) {
if (page->index < start)
goto put;
- VM_BUG_ON_PAGE(page->index != xas.xa_index, page);
if (page->index + thp_nr_pages(page) - 1 > end)
goto put;
if (!trylock_page(page))
_
From: Jan Kara <jack(a)suse.cz>
Subject: ocfs2: fix data corruption on truncate
Patch series "ocfs2: Truncate data corruption fix".
As further testing has shown, commit 5314454ea3f ("ocfs2: fix data
corruption after conversion from inline format") didn't fix all the data
corruption issues the customer started observing after 6dbf7bb55598 ("fs:
Don't invalidate page buffers in block_write_full_page()") This time I
have tracked them down to two bugs in ocfs2 truncation code.
One bug (truncating page cache before clearing tail cluster and setting
i_size) could cause data corruption even before 6dbf7bb55598, but before
that commit it needed a race with page fault, after 6dbf7bb55598 it
started to be pretty deterministic.
Another bug (zeroing pages beyond old i_size) used to be harmless
inefficiency before commit 6dbf7bb55598. But after commit 6dbf7bb55598 in
combination with the first bug it resulted in deterministic data
corruption.
Although fixing only the first problem is needed to stop data corruption,
I've fixed both issues to make the code more robust.
This patch (of 2):
ocfs2_truncate_file() did unmap invalidate page cache pages before zeroing
partial tail cluster and setting i_size. Thus some pages could be left
(and likely have left if the cluster zeroing happened) in the page cache
beyond i_size after truncate finished letting user possibly see stale data
once the file was extended again. Also the tail cluster zeroing was not
guaranteed to finish before truncate finished causing possible stale data
exposure. The problem started to be particularly easy to hit after commit
6dbf7bb55598 "fs: Don't invalidate page buffers in
block_write_full_page()" stopped invalidation of pages beyond i_size from
page writeback path.
Fix these problems by unmapping and invalidating pages in the page cache
after the i_size is reduced and tail cluster is zeroed out.
Link: https://lkml.kernel.org/r/20211025150008.29002-1-jack@suse.cz
Link: https://lkml.kernel.org/r/20211025151332.11301-1-jack@suse.cz
Fixes: ccd979bdbce9 ("[PATCH] OCFS2: The Second Oracle Cluster Filesystem")
Signed-off-by: Jan Kara <jack(a)suse.cz>
Reviewed-by: Joseph Qi <joseph.qi(a)linux.alibaba.com>
Cc: Mark Fasheh <mark(a)fasheh.com>
Cc: Joel Becker <jlbec(a)evilplan.org>
Cc: Junxiao Bi <junxiao.bi(a)oracle.com>
Cc: Changwei Ge <gechangwei(a)live.cn>
Cc: Gang He <ghe(a)suse.com>
Cc: Jun Piao <piaojun(a)huawei.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
fs/ocfs2/file.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
--- a/fs/ocfs2/file.c~ocfs2-fix-data-corruption-on-truncate
+++ a/fs/ocfs2/file.c
@@ -476,10 +476,11 @@ int ocfs2_truncate_file(struct inode *in
* greater than page size, so we have to truncate them
* anyway.
*/
- unmap_mapping_range(inode->i_mapping, new_i_size + PAGE_SIZE - 1, 0, 1);
- truncate_inode_pages(inode->i_mapping, new_i_size);
if (OCFS2_I(inode)->ip_dyn_features & OCFS2_INLINE_DATA_FL) {
+ unmap_mapping_range(inode->i_mapping,
+ new_i_size + PAGE_SIZE - 1, 0, 1);
+ truncate_inode_pages(inode->i_mapping, new_i_size);
status = ocfs2_truncate_inline(inode, di_bh, new_i_size,
i_size_read(inode), 1);
if (status)
@@ -498,6 +499,9 @@ int ocfs2_truncate_file(struct inode *in
goto bail_unlock_sem;
}
+ unmap_mapping_range(inode->i_mapping, new_i_size + PAGE_SIZE - 1, 0, 1);
+ truncate_inode_pages(inode->i_mapping, new_i_size);
+
status = ocfs2_commit_truncate(osb, inode, di_bh);
if (status < 0) {
mlog_errno(status);
_