From: Andrey Ryabinin <aryabinin(a)virtuozzo.com>
Subject: mm/vmscan: wake up flushers for legacy cgroups too
Commit 726d061fbd36 ("mm: vmscan: kick flushers when we encounter dirty
pages on the LRU") added flusher invocation to shrink_inactive_list() when
many dirty pages on the LRU are encountered.
However, shrink_inactive_list() doesn't wake up flushers for legacy cgroup
reclaim, so the next commit bbef938429f5 ("mm: vmscan: remove old flusher
wakeup from direct reclaim path") removed the only source of flusher's
wake up in legacy mem cgroup reclaim path.
This leads to premature OOM if there is too many dirty pages in cgroup:
# mkdir /sys/fs/cgroup/memory/test
# echo $$ > /sys/fs/cgroup/memory/test/tasks
# echo 50M > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
# dd if=/dev/zero of=tmp_file bs=1M count=100
Killed
dd invoked oom-killer: gfp_mask=0x14000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=0
Call Trace:
dump_stack+0x46/0x65
dump_header+0x6b/0x2ac
oom_kill_process+0x21c/0x4a0
out_of_memory+0x2a5/0x4b0
mem_cgroup_out_of_memory+0x3b/0x60
mem_cgroup_oom_synchronize+0x2ed/0x330
pagefault_out_of_memory+0x24/0x54
__do_page_fault+0x521/0x540
page_fault+0x45/0x50
Task in /test killed as a result of limit of /test
memory: usage 51200kB, limit 51200kB, failcnt 73
memory+swap: usage 51200kB, limit 9007199254740988kB, failcnt 0
kmem: usage 296kB, limit 9007199254740988kB, failcnt 0
Memory cgroup stats for /test: cache:49632KB rss:1056KB rss_huge:0KB shmem:0KB
mapped_file:0KB dirty:49500KB writeback:0KB swap:0KB inactive_anon:0KB
active_anon:1168KB inactive_file:24760KB active_file:24960KB unevictable:0KB
Memory cgroup out of memory: Kill process 3861 (bash) score 88 or sacrifice child
Killed process 3876 (dd) total-vm:8484kB, anon-rss:1052kB, file-rss:1720kB, shmem-rss:0kB
oom_reaper: reaped process 3876 (dd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Wake up flushers in legacy cgroup reclaim too.
Link: http://lkml.kernel.org/r/20180315164553.17856-1-aryabinin@virtuozzo.com
Fixes: bbef938429f5 ("mm: vmscan: remove old flusher wakeup from direct reclaim path")
Signed-off-by: Andrey Ryabinin <aryabinin(a)virtuozzo.com>
Tested-by: Shakeel Butt <shakeelb(a)google.com>
Acked-by: Michal Hocko <mhocko(a)suse.cz>
Cc: Mel Gorman <mgorman(a)techsingularity.net>
Cc: Tejun Heo <tj(a)kernel.org>
Cc: Johannes Weiner <hannes(a)cmpxchg.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <>
---
mm/vmscan.c | 31 ++++++++++++++++---------------
1 file changed, 16 insertions(+), 15 deletions(-)
diff -puN mm/vmscan.c~mm-vmscan-wake-up-flushers-for-legacy-cgroups-too mm/vmscan.c
--- a/mm/vmscan.c~mm-vmscan-wake-up-flushers-for-legacy-cgroups-too
+++ a/mm/vmscan.c
@@ -1780,6 +1780,20 @@ shrink_inactive_list(unsigned long nr_to
set_bit(PGDAT_WRITEBACK, &pgdat->flags);
/*
+ * If dirty pages are scanned that are not queued for IO, it
+ * implies that flushers are not doing their job. This can
+ * happen when memory pressure pushes dirty pages to the end of
+ * the LRU before the dirty limits are breached and the dirty
+ * data has expired. It can also happen when the proportion of
+ * dirty pages grows not through writes but through memory
+ * pressure reclaiming all the clean cache. And in some cases,
+ * the flushers simply cannot keep up with the allocation
+ * rate. Nudge the flusher threads in case they are asleep.
+ */
+ if (stat.nr_unqueued_dirty == nr_taken)
+ wakeup_flusher_threads(WB_REASON_VMSCAN);
+
+ /*
* Legacy memcg will stall in page writeback so avoid forcibly
* stalling here.
*/
@@ -1791,22 +1805,9 @@ shrink_inactive_list(unsigned long nr_to
if (stat.nr_dirty && stat.nr_dirty == stat.nr_congested)
set_bit(PGDAT_CONGESTED, &pgdat->flags);
- /*
- * If dirty pages are scanned that are not queued for IO, it
- * implies that flushers are not doing their job. This can
- * happen when memory pressure pushes dirty pages to the end of
- * the LRU before the dirty limits are breached and the dirty
- * data has expired. It can also happen when the proportion of
- * dirty pages grows not through writes but through memory
- * pressure reclaiming all the clean cache. And in some cases,
- * the flushers simply cannot keep up with the allocation
- * rate. Nudge the flusher threads in case they are asleep, but
- * also allow kswapd to start writing pages during reclaim.
- */
- if (stat.nr_unqueued_dirty == nr_taken) {
- wakeup_flusher_threads(WB_REASON_VMSCAN);
+ /* Allow kswapd to start writing pages during reclaim. */
+ if (stat.nr_unqueued_dirty == nr_taken)
set_bit(PGDAT_DIRTY, &pgdat->flags);
- }
/*
* If kswapd scans pages marked marked for immediate
_
From: "Kirill A. Shutemov" <kirill.shutemov(a)linux.intel.com>
Subject: mm/shmem: do not wait for lock_page() in shmem_unused_huge_shrink()
shmem_unused_huge_shrink() gets called from reclaim path. Waiting for
page lock may lead to deadlock there.
There was a bug report that may be attributed to this:
http://lkml.kernel.org/r/alpine.LRH.2.11.1801242349220.30642@mail.ewheeler.…
Replace lock_page() with trylock_page() and skip the page if we failed to
lock it. We will get to the page on the next scan.
We can test for the PageTransHuge() outside the page lock as we only need
protection against splitting the page under us. Holding pin oni the page
is enough for this.
Link: http://lkml.kernel.org/r/20180316210830.43738-1-kirill.shutemov@linux.intel…
Fixes: 779750d20b93 ("shmem: split huge pages beyond i_size under memory pressure")
Signed-off-by: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Reported-by: Eric Wheeler <linux-mm(a)lists.ewheeler.net>
Acked-by: Michal Hocko <mhocko(a)suse.com>
Reviewed-by: Andrew Morton <akpm(a)linux-foundation.org>
Cc: Tetsuo Handa <penguin-kernel(a)I-love.SAKURA.ne.jp>
Cc: Hugh Dickins <hughd(a)google.com>
Cc: <stable(a)vger.kernel.org> [4.8+]
Signed-off-by: Andrew Morton <>
---
mm/shmem.c | 31 ++++++++++++++++++++-----------
1 file changed, 20 insertions(+), 11 deletions(-)
diff -puN mm/shmem.c~mm-shmem-do-not-wait-for-lock_page-in-shmem_unused_huge_shrink mm/shmem.c
--- a/mm/shmem.c~mm-shmem-do-not-wait-for-lock_page-in-shmem_unused_huge_shrink
+++ a/mm/shmem.c
@@ -493,36 +493,45 @@ next:
info = list_entry(pos, struct shmem_inode_info, shrinklist);
inode = &info->vfs_inode;
- if (nr_to_split && split >= nr_to_split) {
- iput(inode);
- continue;
- }
+ if (nr_to_split && split >= nr_to_split)
+ goto leave;
- page = find_lock_page(inode->i_mapping,
+ page = find_get_page(inode->i_mapping,
(inode->i_size & HPAGE_PMD_MASK) >> PAGE_SHIFT);
if (!page)
goto drop;
+ /* No huge page at the end of the file: nothing to split */
if (!PageTransHuge(page)) {
- unlock_page(page);
put_page(page);
goto drop;
}
+ /*
+ * Leave the inode on the list if we failed to lock
+ * the page at this time.
+ *
+ * Waiting for the lock may lead to deadlock in the
+ * reclaim path.
+ */
+ if (!trylock_page(page)) {
+ put_page(page);
+ goto leave;
+ }
+
ret = split_huge_page(page);
unlock_page(page);
put_page(page);
- if (ret) {
- /* split failed: leave it on the list */
- iput(inode);
- continue;
- }
+ /* If split failed leave the inode on the list */
+ if (ret)
+ goto leave;
split++;
drop:
list_del_init(&info->shrinklist);
removed++;
+leave:
iput(inode);
}
_
From: "Kirill A. Shutemov" <kirill.shutemov(a)linux.intel.com>
Subject: mm/thp: do not wait for lock_page() in deferred_split_scan()
deferred_split_scan() gets called from reclaim path. Waiting for page
lock may lead to deadlock there.
Replace lock_page() with trylock_page() and skip the page if we failed to
lock it. We will get to the page on the next scan.
Link: http://lkml.kernel.org/r/20180315150747.31945-1-kirill.shutemov@linux.intel…
Fixes: 9a982250f773 ("thp: introduce deferred_split_huge_page()")
Signed-off-by: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Acked-by: Michal Hocko <mhocko(a)suse.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <>
---
mm/huge_memory.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff -puN mm/huge_memory.c~mm-thp-do-not-wait-for-lock_page-in-deferred_split_scan mm/huge_memory.c
--- a/mm/huge_memory.c~mm-thp-do-not-wait-for-lock_page-in-deferred_split_scan
+++ a/mm/huge_memory.c
@@ -2783,11 +2783,13 @@ static unsigned long deferred_split_scan
list_for_each_safe(pos, next, &list) {
page = list_entry((void *)pos, struct page, mapping);
- lock_page(page);
+ if (!trylock_page(page))
+ goto next;
/* split_huge_page() removes page from list on success */
if (!split_huge_page(page))
split++;
unlock_page(page);
+next:
put_page(page);
}
_
From: "Kirill A. Shutemov" <kirill.shutemov(a)linux.intel.com>
Subject: mm/khugepaged.c: convert VM_BUG_ON() to collapse fail
khugepaged is not yet able to convert PTE-mapped huge pages back to PMD
mapped. We do not collapse such pages. See check khugepaged_scan_pmd().
But if between khugepaged_scan_pmd() and __collapse_huge_page_isolate()
somebody managed to instantiate THP in the range and then split the PMD
back to PTEs we would have a problem -- VM_BUG_ON_PAGE(PageCompound(page))
will get triggered.
It's possible since we drop mmap_sem during collapse to re-take for
write.
Replace the VM_BUG_ON() with graceful collapse fail.
Link: http://lkml.kernel.org/r/20180315152353.27989-1-kirill.shutemov@linux.intel…
Fixes: b1caa957ae6d ("khugepaged: ignore pmd tables with THP mapped with ptes")
Signed-off-by: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Cc: Laura Abbott <labbott(a)redhat.com>
Cc: Jerome Marchand <jmarchan(a)redhat.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <>
---
mm/khugepaged.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff -puN mm/khugepaged.c~mm-khugepaged-convert-vm_bug_on-to-collapse-fail mm/khugepaged.c
--- a/mm/khugepaged.c~mm-khugepaged-convert-vm_bug_on-to-collapse-fail
+++ a/mm/khugepaged.c
@@ -530,7 +530,12 @@ static int __collapse_huge_page_isolate(
goto out;
}
- VM_BUG_ON_PAGE(PageCompound(page), page);
+ /* TODO: teach khugepaged to collapse THP mapped with pte */
+ if (PageCompound(page)) {
+ result = SCAN_PAGE_COMPOUND;
+ goto out;
+ }
+
VM_BUG_ON_PAGE(!PageAnon(page), page);
/*
_
From: Arnd Bergmann <arnd(a)arndb.de>
Subject: h8300: remove extraneous __BIG_ENDIAN definition
A bugfix I did earlier caused a build regression on h8300, which defines
the __BIG_ENDIAN macro in a slightly different way than the generic code:
arch/h8300/include/asm/byteorder.h:5:0: warning: "__BIG_ENDIAN" redefined
We don't need to define it here, as the same macro is already provided by
the linux/byteorder/big_endian.h, and that version does not conflict.
While this is a v4.16 regression, my earlier patch also got backported to
the 4.14 and 4.15 stable kernels, so we need the fixup there as well.
Link: http://lkml.kernel.org/r/20180313120752.2645129-1-arnd@arndb.de
Fixes: 101110f6271c ("Kbuild: always define endianess in kconfig.h")
Signed-off-by: Arnd Bergmann <arnd(a)arndb.de>
Cc: Yoshinori Sato <ysato(a)users.sourceforge.jp>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <>
---
arch/h8300/include/asm/byteorder.h | 1 -
1 file changed, 1 deletion(-)
diff -puN arch/h8300/include/asm/byteorder.h~h8300-remove-extranous-__big_endian-definition arch/h8300/include/asm/byteorder.h
--- a/arch/h8300/include/asm/byteorder.h~h8300-remove-extranous-__big_endian-definition
+++ a/arch/h8300/include/asm/byteorder.h
@@ -2,7 +2,6 @@
#ifndef __H8300_BYTEORDER_H__
#define __H8300_BYTEORDER_H__
-#define __BIG_ENDIAN __ORDER_BIG_ENDIAN__
#include <linux/byteorder/big_endian.h>
#endif
_
From: Mike Kravetz <mike.kravetz(a)oracle.com>
Subject: hugetlbfs: check for pgoff value overflow
A vma with vm_pgoff large enough to overflow a loff_t type when converted
to a byte offset can be passed via the remap_file_pages system call. The
hugetlbfs mmap routine uses the byte offset to calculate reservations and
file size.
A sequence such as:
mmap(0x20a00000, 0x600000, 0, 0x66033, -1, 0);
remap_file_pages(0x20a00000, 0x600000, 0, 0x20000000000000, 0);
will result in the following when task exits/file closed,
kernel BUG at mm/hugetlb.c:749!
Call Trace:
hugetlbfs_evict_inode+0x2f/0x40
evict+0xcb/0x190
__dentry_kill+0xcb/0x150
__fput+0x164/0x1e0
task_work_run+0x84/0xa0
exit_to_usermode_loop+0x7d/0x80
do_syscall_64+0x18b/0x190
entry_SYSCALL_64_after_hwframe+0x3d/0xa2
The overflowed pgoff value causes hugetlbfs to try to set up a mapping
with a negative range (end < start) that leaves invalid state which causes
the BUG.
The previous overflow fix to this code was incomplete and did not take the
remap_file_pages system call into account.
[mike.kravetz(a)oracle.com: v3]
Link: http://lkml.kernel.org/r/20180309002726.7248-1-mike.kravetz@oracle.com
[akpm(a)linux-foundation.org: include mmdebug.h]
[akpm(a)linux-foundation.org: fix -ve left shift count on sh]
Link: http://lkml.kernel.org/r/20180308210502.15952-1-mike.kravetz@oracle.com
Fixes: 045c7a3f53d9 ("hugetlbfs: fix offset overflow in hugetlbfs mmap")
Signed-off-by: Mike Kravetz <mike.kravetz(a)oracle.com>
Reported-by: Nic Losby <blurbdust(a)gmail.com>
Acked-by: Michal Hocko <mhocko(a)suse.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov(a)linux.intel.com>
Cc: Yisheng Xie <xieyisheng1(a)huawei.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <>
---
fs/hugetlbfs/inode.c | 17 ++++++++++++++---
mm/hugetlb.c | 7 +++++++
2 files changed, 21 insertions(+), 3 deletions(-)
diff -puN fs/hugetlbfs/inode.c~hugetlbfs-check-for-pgoff-value-overflow fs/hugetlbfs/inode.c
--- a/fs/hugetlbfs/inode.c~hugetlbfs-check-for-pgoff-value-overflow
+++ a/fs/hugetlbfs/inode.c
@@ -108,6 +108,16 @@ static void huge_pagevec_release(struct
pagevec_reinit(pvec);
}
+/*
+ * Mask used when checking the page offset value passed in via system
+ * calls. This value will be converted to a loff_t which is signed.
+ * Therefore, we want to check the upper PAGE_SHIFT + 1 bits of the
+ * value. The extra bit (- 1 in the shift value) is to take the sign
+ * bit into account.
+ */
+#define PGOFF_LOFFT_MAX \
+ (((1UL << (PAGE_SHIFT + 1)) - 1) << (BITS_PER_LONG - (PAGE_SHIFT + 1)))
+
static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
{
struct inode *inode = file_inode(file);
@@ -127,12 +137,13 @@ static int hugetlbfs_file_mmap(struct fi
vma->vm_ops = &hugetlb_vm_ops;
/*
- * Offset passed to mmap (before page shift) could have been
- * negative when represented as a (l)off_t.
+ * page based offset in vm_pgoff could be sufficiently large to
+ * overflow a (l)off_t when converted to byte offset.
*/
- if (((loff_t)vma->vm_pgoff << PAGE_SHIFT) < 0)
+ if (vma->vm_pgoff & PGOFF_LOFFT_MAX)
return -EINVAL;
+ /* must be huge page aligned */
if (vma->vm_pgoff & (~huge_page_mask(h) >> PAGE_SHIFT))
return -EINVAL;
diff -puN mm/hugetlb.c~hugetlbfs-check-for-pgoff-value-overflow mm/hugetlb.c
--- a/mm/hugetlb.c~hugetlbfs-check-for-pgoff-value-overflow
+++ a/mm/hugetlb.c
@@ -18,6 +18,7 @@
#include <linux/bootmem.h>
#include <linux/sysfs.h>
#include <linux/slab.h>
+#include <linux/mmdebug.h>
#include <linux/sched/signal.h>
#include <linux/rmap.h>
#include <linux/string_helpers.h>
@@ -4374,6 +4375,12 @@ int hugetlb_reserve_pages(struct inode *
struct resv_map *resv_map;
long gbl_reserve;
+ /* This should never happen */
+ if (from > to) {
+ VM_WARN(1, "%s called with a negative range\n", __func__);
+ return -EINVAL;
+ }
+
/*
* Only apply hugepage reservation if asked. At fault time, an
* attempt will be made for VM_NORESERVE to allocate a page
_