On Tue, May 1, 2018 at 4:52 PM <gregkh(a)linuxfoundation.org> wrote:
> This is a note to let you know that I've just added the patch titled
> mm, mlock, vmscan: no more skipping pagevecs
> to the 4.14-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
> The filename of the patch is:
> mm-mlock-vmscan-no-more-skipping-pagevecs.patch
> and it can be found in the queue-4.14 subdirectory.
> If you, or anyone else, feels it should not be added to the stable tree,
> please let <stable(a)vger.kernel.org> know about it.
I would advise not to include this patch in the previous stable trees. This
patch does not fix any critical bug and for some very specific
microbenchmarks on very large machines, it has shown performance
regression. So, I don't think it is worth the risk.
> From foo@baz Tue May 1 16:18:19 PDT 2018
> From: Shakeel Butt <shakeelb(a)google.com>
> Date: Wed, 21 Feb 2018 14:45:28 -0800
> Subject: mm, mlock, vmscan: no more skipping pagevecs
> From: Shakeel Butt <shakeelb(a)google.com>
> [ Upstream commit 9c4e6b1a7027f102990c0395296015a812525f4d ]
> When a thread mlocks an address space backed either by file pages which
> are currently not present in memory or swapped out anon pages (not in
> swapcache), a new page is allocated and added to the local pagevec
> (lru_add_pvec), I/O is triggered and the thread then sleeps on the page.
> On I/O completion, the thread can wake on a different CPU, the mlock
> syscall will then sets the PageMlocked() bit of the page but will not be
> able to put that page in unevictable LRU as the page is on the pagevec
> of a different CPU. Even on drain, that page will go to evictable LRU
> because the PageMlocked() bit is not checked on pagevec drain.
> The page will eventually go to right LRU on reclaim but the LRU stats
> will remain skewed for a long time.
> This patch puts all the pages, even unevictable, to the pagevecs and on
> the drain, the pages will be added on their LRUs correctly by checking
> their evictability. This resolves the mlocked pages on pagevec of other
> CPUs issue because when those pagevecs will be drained, the mlocked file
> pages will go to unevictable LRU. Also this makes the race with munlock
> easier to resolve because the pagevec drains happen in LRU lock.
> However there is still one place which makes a page evictable and does
> PageLRU check on that page without LRU lock and needs special attention.
> TestClearPageMlocked() and isolate_lru_page() in clear_page_mlock().
> #0: __pagevec_lru_add_fn #1: clear_page_mlock
> SetPageLRU() if (!TestClearPageMlocked())
> return
> smp_mb() // <--required
> // inside does PageLRU
> if (!PageMlocked()) if (isolate_lru_page())
> move to evictable LRU putback_lru_page()
> else
> move to unevictable LRU
> In '#1', TestClearPageMlocked() provides full memory barrier semantics
> and thus the PageLRU check (inside isolate_lru_page) can not be
> reordered before it.
> In '#0', without explicit memory barrier, the PageMlocked() check can be
> reordered before SetPageLRU(). If that happens, '#0' can put a page in
> unevictable LRU and '#1' might have just cleared the Mlocked bit of that
> page but fails to isolate as PageLRU fails as '#0' still hasn't set
> PageLRU bit of that page. That page will be stranded on the unevictable
> LRU.
> There is one (good) side effect though. Without this patch, the pages
> allocated for System V shared memory segment are added to evictable LRUs
> even after shmctl(SHM_LOCK) on that segment. This patch will correctly
> put such pages to unevictable LRU.
> Link: http://lkml.kernel.org/r/20171121211241.18877-1-shakeelb@google.com
> Signed-off-by: Shakeel Butt <shakeelb(a)google.com>
> Acked-by: Vlastimil Babka <vbabka(a)suse.cz>
> Cc: Jérôme Glisse <jglisse(a)redhat.com>
> Cc: Huang Ying <ying.huang(a)intel.com>
> Cc: Tim Chen <tim.c.chen(a)linux.intel.com>
> Cc: Michal Hocko <mhocko(a)kernel.org>
> Cc: Greg Thelen <gthelen(a)google.com>
> Cc: Johannes Weiner <hannes(a)cmpxchg.org>
> Cc: Balbir Singh <bsingharora(a)gmail.com>
> Cc: Minchan Kim <minchan(a)kernel.org>
> Cc: Shaohua Li <shli(a)fb.com>
> Cc: Jan Kara <jack(a)suse.cz>
> Cc: Nicholas Piggin <npiggin(a)gmail.com>
> Cc: Dan Williams <dan.j.williams(a)intel.com>
> Cc: Mel Gorman <mgorman(a)suse.de>
> Cc: Hugh Dickins <hughd(a)google.com>
> Cc: Vlastimil Babka <vbabka(a)suse.cz>
> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
> Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
> Signed-off-by: Sasha Levin <alexander.levin(a)microsoft.com>
> Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
> ---
> include/linux/swap.h | 2 -
> mm/mlock.c | 6 +++
> mm/swap.c | 82
+++++++++++++++++++++++++++++----------------------
> mm/vmscan.c | 59 ------------------------------------
> 4 files changed, 54 insertions(+), 95 deletions(-)
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -326,8 +326,6 @@ extern void deactivate_file_page(struct
> extern void mark_page_lazyfree(struct page *page);
> extern void swap_setup(void);
> -extern void add_page_to_unevictable_list(struct page *page);
> -
> extern void lru_cache_add_active_or_unevictable(struct page *page,
> struct vm_area_struct
*vma);
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -64,6 +64,12 @@ void clear_page_mlock(struct page *page)
> mod_zone_page_state(page_zone(page), NR_MLOCK,
> -hpage_nr_pages(page));
> count_vm_event(UNEVICTABLE_PGCLEARED);
> + /*
> + * The previous TestClearPageMlocked() corresponds to the smp_mb()
> + * in __pagevec_lru_add_fn().
> + *
> + * See __pagevec_lru_add_fn for more explanation.
> + */
> if (!isolate_lru_page(page)) {
> putback_lru_page(page);
> } else {
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -446,30 +446,6 @@ void lru_cache_add(struct page *page)
> }
> /**
> - * add_page_to_unevictable_list - add a page to the unevictable list
> - * @page: the page to be added to the unevictable list
> - *
> - * Add page directly to its zone's unevictable list. To avoid races with
> - * tasks that might be making the page evictable, through eg. munlock,
> - * munmap or exit, while it's not on the lru, we want to add the page
> - * while it's locked or otherwise "invisible" to other tasks. This is
> - * difficult to do when using the pagevec cache, so bypass that.
> - */
> -void add_page_to_unevictable_list(struct page *page)
> -{
> - struct pglist_data *pgdat = page_pgdat(page);
> - struct lruvec *lruvec;
> -
> - spin_lock_irq(&pgdat->lru_lock);
> - lruvec = mem_cgroup_page_lruvec(page, pgdat);
> - ClearPageActive(page);
> - SetPageUnevictable(page);
> - SetPageLRU(page);
> - add_page_to_lru_list(page, lruvec, LRU_UNEVICTABLE);
> - spin_unlock_irq(&pgdat->lru_lock);
> -}
> -
> -/**
> * lru_cache_add_active_or_unevictable
> * @page: the page to be added to LRU
> * @vma: vma in which page is mapped for determining reclaimability
> @@ -484,13 +460,9 @@ void lru_cache_add_active_or_unevictable
> {
> VM_BUG_ON_PAGE(PageLRU(page), page);
> - if (likely((vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) !=
VM_LOCKED)) {
> + if (likely((vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) !=
VM_LOCKED))
> SetPageActive(page);
> - lru_cache_add(page);
> - return;
> - }
> -
> - if (!TestSetPageMlocked(page)) {
> + else if (!TestSetPageMlocked(page)) {
> /*
> * We use the irq-unsafe __mod_zone_page_stat because this
> * counter is not modified from interrupt context, and
the pte
> @@ -500,7 +472,7 @@ void lru_cache_add_active_or_unevictable
> hpage_nr_pages(page));
> count_vm_event(UNEVICTABLE_PGMLOCKED);
> }
> - add_page_to_unevictable_list(page);
> + lru_cache_add(page);
> }
> /*
> @@ -883,15 +855,55 @@ void lru_add_page_tail(struct page *page
> static void __pagevec_lru_add_fn(struct page *page, struct lruvec
*lruvec,
> void *arg)
> {
> - int file = page_is_file_cache(page);
> - int active = PageActive(page);
> - enum lru_list lru = page_lru(page);
> + enum lru_list lru;
> + int was_unevictable = TestClearPageUnevictable(page);
> VM_BUG_ON_PAGE(PageLRU(page), page);
> SetPageLRU(page);
> + /*
> + * Page becomes evictable in two ways:
> + * 1) Within LRU lock [munlock_vma_pages() and
__munlock_pagevec()].
> + * 2) Before acquiring LRU lock to put the page to correct LRU
and then
> + * a) do PageLRU check with lock [check_move_unevictable_pages]
> + * b) do PageLRU check before lock [clear_page_mlock]
> + *
> + * (1) & (2a) are ok as LRU lock will serialize them. For (2b),
we need
> + * following strict ordering:
> + *
> + * #0: __pagevec_lru_add_fn #1: clear_page_mlock
> + *
> + * SetPageLRU() TestClearPageMlocked()
> + * smp_mb() // explicit ordering // above provides strict
> + * // ordering
> + * PageMlocked() PageLRU()
> + *
> + *
> + * if '#1' does not observe setting of PG_lru by '#0' and fails
> + * isolation, the explicit barrier will make sure that
page_evictable
> + * check will put the page in correct LRU. Without smp_mb(),
SetPageLRU
> + * can be reordered after PageMlocked check and can make '#1' to
fail
> + * the isolation of the page whose Mlocked bit is cleared (#0 is
also
> + * looking at the same page) and the evictable page will be
stranded
> + * in an unevictable LRU.
> + */
> + smp_mb();
> +
> + if (page_evictable(page)) {
> + lru = page_lru(page);
> + update_page_reclaim_stat(lruvec, page_is_file_cache(page),
> + PageActive(page));
> + if (was_unevictable)
> + count_vm_event(UNEVICTABLE_PGRESCUED);
> + } else {
> + lru = LRU_UNEVICTABLE;
> + ClearPageActive(page);
> + SetPageUnevictable(page);
> + if (!was_unevictable)
> + count_vm_event(UNEVICTABLE_PGCULLED);
> + }
> +
> add_page_to_lru_list(page, lruvec, lru);
> - update_page_reclaim_stat(lruvec, file, active);
> trace_mm_lru_insertion(page, lru);
> }
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -790,64 +790,7 @@ int remove_mapping(struct address_space
> */
> void putback_lru_page(struct page *page)
> {
> - bool is_unevictable;
> - int was_unevictable = PageUnevictable(page);
> -
> - VM_BUG_ON_PAGE(PageLRU(page), page);
> -
> -redo:
> - ClearPageUnevictable(page);
> -
> - if (page_evictable(page)) {
> - /*
> - * For evictable pages, we can use the cache.
> - * In event of a race, worst case is we end up with an
> - * unevictable page on [in]active list.
> - * We know how to handle that.
> - */
> - is_unevictable = false;
> - lru_cache_add(page);
> - } else {
> - /*
> - * Put unevictable pages directly on zone's unevictable
> - * list.
> - */
> - is_unevictable = true;
> - add_page_to_unevictable_list(page);
> - /*
> - * When racing with an mlock or AS_UNEVICTABLE clearing
> - * (page is unlocked) make sure that if the other thread
> - * does not observe our setting of PG_lru and fails
> - * isolation/check_move_unevictable_pages,
> - * we see PG_mlocked/AS_UNEVICTABLE cleared below and move
> - * the page back to the evictable list.
> - *
> - * The other side is TestClearPageMlocked() or
shmem_lock().
> - */
> - smp_mb();
> - }
> -
> - /*
> - * page's status can change while we move it among lru. If an
evictable
> - * page is on unevictable list, it never be freed. To avoid that,
> - * check after we added it to the list, again.
> - */
> - if (is_unevictable && page_evictable(page)) {
> - if (!isolate_lru_page(page)) {
> - put_page(page);
> - goto redo;
> - }
> - /* This means someone else dropped this page from LRU
> - * So, it will be freed or putback to LRU again. There is
> - * nothing to do here.
> - */
> - }
> -
> - if (was_unevictable && !is_unevictable)
> - count_vm_event(UNEVICTABLE_PGRESCUED);
> - else if (!was_unevictable && is_unevictable)
> - count_vm_event(UNEVICTABLE_PGCULLED);
> -
> + lru_cache_add(page);
> put_page(page); /* drop ref from isolate */
> }
> Patches currently in stable-queue which might be from shakeelb(a)google.com
are
> queue-4.14/mm-slab-memcg_link-the-slab-s-kmem_cache.patch
> queue-4.14/mm-mlock-vmscan-no-more-skipping-pagevecs.patch
Latest binutils release (2.29.1) will no longer allow proper computation
of GDT entries on 32-bits, with warning:
arch/x86/xen/xen-pvh.S: Assembler messages:
arch/x86/xen/xen-pvh.S:150: Warning: shift count out of range (32 is not between 0 and 31)
arch/x86/xen/xen-pvh.S:150: Warning: shift count out of range (40 is not between 0 and 31)
arch/x86/xen/xen-pvh.S:150: Warning: shift count out of range (32 is not between 0 and 31)
arch/x86/xen/xen-pvh.S:152: Warning: shift count out of range (32 is not between 0 and 31)
arch/x86/xen/xen-pvh.S:152: Warning: shift count out of range (40 is not between 0 and 31)
arch/x86/xen/xen-pvh.S:152: Warning: shift count out of range (32 is not between 0 and 31)
Use explicit value of the entry instead of using GDT_ENTRY() macro.
Signed-off-by: Boris Ostrovsky <boris.ostrovsky(a)oracle.com>
Cc: stable(a)vger.kernel.org
---
arch/x86/xen/xen-pvh.S | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/arch/x86/xen/xen-pvh.S b/arch/x86/xen/xen-pvh.S
index e1a5fbe..934f7d4 100644
--- a/arch/x86/xen/xen-pvh.S
+++ b/arch/x86/xen/xen-pvh.S
@@ -145,11 +145,11 @@ gdt_start:
.quad 0x0000000000000000 /* NULL descriptor */
.quad 0x0000000000000000 /* reserved */
#ifdef CONFIG_X86_64
- .quad GDT_ENTRY(0xa09a, 0, 0xfffff) /* __KERNEL_CS */
+ .quad 0x00af9a000000ffff /* __BOOT_CS */
#else
- .quad GDT_ENTRY(0xc09a, 0, 0xfffff) /* __KERNEL_CS */
+ .quad 0x00cf9a000000ffff /* __BOOT_CS */
#endif
- .quad GDT_ENTRY(0xc092, 0, 0xfffff) /* __KERNEL_DS */
+ .quad 0x00cf92000000ffff /* __BOOT_DS */
gdt_end:
.balign 4
--
2.9.3
Patches in original Xen Security Advisory 155 cared only about backend drivers
while leaving frontend patches to be "developed and released (publicly) after
the embargo date". This is said series.
Marek Marczykowski-Górecki (6):
xen: Add RING_COPY_RESPONSE()
xen-netfront: copy response out of shared buffer before accessing it
xen-netfront: do not use data already exposed to backend
xen-netfront: add range check for Tx response id
xen-blkfront: make local copy of response before using it
xen-blkfront: prepare request locally, only then put it on the shared ring
drivers/block/xen-blkfront.c | 110 ++++++++++++++++++---------------
drivers/net/xen-netfront.c | 61 +++++++++---------
include/xen/interface/io/ring.h | 14 ++++-
3 files changed, 106 insertions(+), 79 deletions(-)
base-commit: 6d08b06e67cd117f6992c46611dfb4ce267cd71e
--
git-series 0.9.1
The patch titled
Subject: mm, oom: fix concurrent munlock and oom reaper unmap, v3
has been added to the -mm tree. Its filename is
mm-oom-fix-concurrent-munlock-and-oom-reaper-unmap.patch
This patch should soon appear at
http://ozlabs.org/~akpm/mmots/broken-out/mm-oom-fix-concurrent-munlock-and-…
and later at
http://ozlabs.org/~akpm/mmotm/broken-out/mm-oom-fix-concurrent-munlock-and-…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: David Rientjes <rientjes(a)google.com>
Subject: mm, oom: fix concurrent munlock and oom reaper unmap, v3
Since exit_mmap() is done without the protection of mm->mmap_sem, it is
possible for the oom reaper to concurrently operate on an mm until
MMF_OOM_SKIP is set.
This allows munlock_vma_pages_all() to concurrently run while the oom
reaper is operating on a vma. Since munlock_vma_pages_range() depends on
clearing VM_LOCKED from vm_flags before actually doing the munlock to
determine if any other vmas are locking the same memory, the check for
VM_LOCKED in the oom reaper is racy.
This is especially noticeable on architectures such as powerpc where
clearing a huge pmd requires serialize_against_pte_lookup(). If the pmd
is zapped by the oom reaper during follow_page_mask() after the check for
pmd_none() is bypassed, this ends up deferencing a NULL ptl or a kernel
oops.
Fix this by manually freeing all possible memory from the mm before doing
the munlock and then setting MMF_OOM_SKIP. The oom reaper can not run on
the mm anymore so the munlock is safe to do in exit_mmap(). It also
matches the logic that the oom reaper currently uses for determining when
to set MMF_OOM_SKIP itself, so there's no new risk of excessive oom
killing.
This issue fixes CVE-2018-1000200.
Link: http://lkml.kernel.org/r/alpine.DEB.2.21.1804241526320.238665@chino.kir.cor…
Fixes: 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently")
Signed-off-by: David Rientjes <rientjes(a)google.com>
Suggested-by: Tetsuo Handa <penguin-kernel(a)I-love.SAKURA.ne.jp>
Cc: Andrea Arcangeli <aarcange(a)redhat.com>
Cc: <stable(a)vger.kernel.org> [4.14+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/oom.h | 2 +
mm/mmap.c | 44 +++++++++++++---------
mm/oom_kill.c | 81 ++++++++++++++++++++++--------------------
3 files changed, 71 insertions(+), 56 deletions(-)
diff -puN include/linux/oom.h~mm-oom-fix-concurrent-munlock-and-oom-reaper-unmap include/linux/oom.h
--- a/include/linux/oom.h~mm-oom-fix-concurrent-munlock-and-oom-reaper-unmap
+++ a/include/linux/oom.h
@@ -95,6 +95,8 @@ static inline int check_stable_address_s
return 0;
}
+void __oom_reap_task_mm(struct mm_struct *mm);
+
extern unsigned long oom_badness(struct task_struct *p,
struct mem_cgroup *memcg, const nodemask_t *nodemask,
unsigned long totalpages);
diff -puN mm/mmap.c~mm-oom-fix-concurrent-munlock-and-oom-reaper-unmap mm/mmap.c
--- a/mm/mmap.c~mm-oom-fix-concurrent-munlock-and-oom-reaper-unmap
+++ a/mm/mmap.c
@@ -3024,6 +3024,32 @@ void exit_mmap(struct mm_struct *mm)
/* mm's last user has gone, and its about to be pulled down */
mmu_notifier_release(mm);
+ if (unlikely(mm_is_oom_victim(mm))) {
+ /*
+ * Manually reap the mm to free as much memory as possible.
+ * Then, as the oom reaper does, set MMF_OOM_SKIP to disregard
+ * this mm from further consideration. Taking mm->mmap_sem for
+ * write after setting MMF_OOM_SKIP will guarantee that the oom
+ * reaper will not run on this mm again after mmap_sem is
+ * dropped.
+ *
+ * Nothing can be holding mm->mmap_sem here and the above call
+ * to mmu_notifier_release(mm) ensures mmu notifier callbacks in
+ * __oom_reap_task_mm() will not block.
+ *
+ * This needs to be done before calling munlock_vma_pages_all(),
+ * which clears VM_LOCKED, otherwise the oom reaper cannot
+ * reliably test it.
+ */
+ mutex_lock(&oom_lock);
+ __oom_reap_task_mm(mm);
+ mutex_unlock(&oom_lock);
+
+ set_bit(MMF_OOM_SKIP, &mm->flags);
+ down_write(&mm->mmap_sem);
+ up_write(&mm->mmap_sem);
+ }
+
if (mm->locked_vm) {
vma = mm->mmap;
while (vma) {
@@ -3045,24 +3071,6 @@ void exit_mmap(struct mm_struct *mm)
/* update_hiwater_rss(mm) here? but nobody should be looking */
/* Use -1 here to ensure all VMAs in the mm are unmapped */
unmap_vmas(&tlb, vma, 0, -1);
-
- if (unlikely(mm_is_oom_victim(mm))) {
- /*
- * Wait for oom_reap_task() to stop working on this
- * mm. Because MMF_OOM_SKIP is already set before
- * calling down_read(), oom_reap_task() will not run
- * on this "mm" post up_write().
- *
- * mm_is_oom_victim() cannot be set from under us
- * either because victim->mm is already set to NULL
- * under task_lock before calling mmput and oom_mm is
- * set not NULL by the OOM killer only if victim->mm
- * is found not NULL while holding the task_lock.
- */
- set_bit(MMF_OOM_SKIP, &mm->flags);
- down_write(&mm->mmap_sem);
- up_write(&mm->mmap_sem);
- }
free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
tlb_finish_mmu(&tlb, 0, -1);
diff -puN mm/oom_kill.c~mm-oom-fix-concurrent-munlock-and-oom-reaper-unmap mm/oom_kill.c
--- a/mm/oom_kill.c~mm-oom-fix-concurrent-munlock-and-oom-reaper-unmap
+++ a/mm/oom_kill.c
@@ -469,7 +469,6 @@ bool process_shares_mm(struct task_struc
return false;
}
-
#ifdef CONFIG_MMU
/*
* OOM Reaper kernel thread which tries to reap the memory used by the OOM
@@ -480,16 +479,54 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_reape
static struct task_struct *oom_reaper_list;
static DEFINE_SPINLOCK(oom_reaper_lock);
-static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
+void __oom_reap_task_mm(struct mm_struct *mm)
{
- struct mmu_gather tlb;
struct vm_area_struct *vma;
+
+ /*
+ * Tell all users of get_user/copy_from_user etc... that the content
+ * is no longer stable. No barriers really needed because unmapping
+ * should imply barriers already and the reader would hit a page fault
+ * if it stumbled over a reaped memory.
+ */
+ set_bit(MMF_UNSTABLE, &mm->flags);
+
+ for (vma = mm->mmap ; vma; vma = vma->vm_next) {
+ if (!can_madv_dontneed_vma(vma))
+ continue;
+
+ /*
+ * Only anonymous pages have a good chance to be dropped
+ * without additional steps which we cannot afford as we
+ * are OOM already.
+ *
+ * We do not even care about fs backed pages because all
+ * which are reclaimable have already been reclaimed and
+ * we do not want to block exit_mmap by keeping mm ref
+ * count elevated without a good reason.
+ */
+ if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED)) {
+ const unsigned long start = vma->vm_start;
+ const unsigned long end = vma->vm_end;
+ struct mmu_gather tlb;
+
+ tlb_gather_mmu(&tlb, mm, start, end);
+ mmu_notifier_invalidate_range_start(mm, start, end);
+ unmap_page_range(&tlb, vma, start, end, NULL);
+ mmu_notifier_invalidate_range_end(mm, start, end);
+ tlb_finish_mmu(&tlb, start, end);
+ }
+ }
+}
+
+static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
+{
bool ret = true;
/*
* We have to make sure to not race with the victim exit path
* and cause premature new oom victim selection:
- * __oom_reap_task_mm exit_mm
+ * oom_reap_task_mm exit_mm
* mmget_not_zero
* mmput
* atomic_dec_and_test
@@ -534,39 +571,8 @@ static bool __oom_reap_task_mm(struct ta
trace_start_task_reaping(tsk->pid);
- /*
- * Tell all users of get_user/copy_from_user etc... that the content
- * is no longer stable. No barriers really needed because unmapping
- * should imply barriers already and the reader would hit a page fault
- * if it stumbled over a reaped memory.
- */
- set_bit(MMF_UNSTABLE, &mm->flags);
-
- for (vma = mm->mmap ; vma; vma = vma->vm_next) {
- if (!can_madv_dontneed_vma(vma))
- continue;
+ __oom_reap_task_mm(mm);
- /*
- * Only anonymous pages have a good chance to be dropped
- * without additional steps which we cannot afford as we
- * are OOM already.
- *
- * We do not even care about fs backed pages because all
- * which are reclaimable have already been reclaimed and
- * we do not want to block exit_mmap by keeping mm ref
- * count elevated without a good reason.
- */
- if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED)) {
- const unsigned long start = vma->vm_start;
- const unsigned long end = vma->vm_end;
-
- tlb_gather_mmu(&tlb, mm, start, end);
- mmu_notifier_invalidate_range_start(mm, start, end);
- unmap_page_range(&tlb, vma, start, end, NULL);
- mmu_notifier_invalidate_range_end(mm, start, end);
- tlb_finish_mmu(&tlb, start, end);
- }
- }
pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
task_pid_nr(tsk), tsk->comm,
K(get_mm_counter(mm, MM_ANONPAGES)),
@@ -587,14 +593,13 @@ static void oom_reap_task(struct task_st
struct mm_struct *mm = tsk->signal->oom_mm;
/* Retry the down_read_trylock(mmap_sem) a few times */
- while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task_mm(tsk, mm))
+ while (attempts++ < MAX_OOM_REAP_RETRIES && !oom_reap_task_mm(tsk, mm))
schedule_timeout_idle(HZ/10);
if (attempts <= MAX_OOM_REAP_RETRIES ||
test_bit(MMF_OOM_SKIP, &mm->flags))
goto done;
-
pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
task_pid_nr(tsk), tsk->comm);
debug_show_all_locks();
_
Patches currently in -mm which might be from rientjes(a)google.com are
mm-oom-fix-concurrent-munlock-and-oom-reaper-unmap.patch
The patch titled
Subject: mm, oom: fix concurrent munlock and oom reaper unmap
has been removed from the -mm tree. Its filename was
mm-oom-fix-concurrent-munlock-and-oom-reaper-unmap.patch
This patch was dropped because an updated version will be merged
------------------------------------------------------
From: David Rientjes <rientjes(a)google.com>
Subject: mm, oom: fix concurrent munlock and oom reaper unmap
Since exit_mmap() is done without the protection of mm->mmap_sem, it is
possible for the oom reaper to concurrently operate on an mm until
MMF_OOM_SKIP is set.
This allows munlock_vma_pages_all() to concurrently run while the oom
reaper is operating on a vma. Since munlock_vma_pages_range() depends on
clearing VM_LOCKED from vm_flags before actually doing the munlock to
determine if any other vmas are locking the same memory, the check for
VM_LOCKED in the oom reaper is racy.
This is especially noticeable on architectures such as powerpc where
clearing a huge pmd requires serialize_against_pte_lookup(). If the pmd
is zapped by the oom reaper during follow_page_mask() after the check for
pmd_none() is bypassed, this ends up deferencing a NULL ptl.
Fix this by reusing MMF_UNSTABLE to specify that an mm should not be
reaped. This prevents the concurrent munlock_vma_pages_range() and
unmap_page_range(). The oom reaper will simply not operate on an mm that
has the bit set and leave the unmapping to exit_mmap().
[rientjes(a)google.com: v2]
Link: http://lkml.kernel.org/r/alpine.DEB.2.21.1804171951440.105401@chino.kir.cor…
Link: http://lkml.kernel.org/r/alpine.DEB.2.21.1804171545460.53786@chino.kir.corp…
Fixes: 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently")
Signed-off-by: David Rientjes <rientjes(a)google.com>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Andrea Arcangeli <aarcange(a)redhat.com>
Cc: Tetsuo Handa <penguin-kernel(a)I-love.SAKURA.ne.jp>
Cc: Roman Gushchin <guro(a)fb.com>
Cc: <stable(a)vger.kernel.org> [4.14+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/mmap.c | 38 ++++++++++++++++++++------------------
mm/oom_kill.c | 28 +++++++++++++---------------
2 files changed, 33 insertions(+), 33 deletions(-)
diff -puN mm/mmap.c~mm-oom-fix-concurrent-munlock-and-oom-reaper-unmap mm/mmap.c
--- a/mm/mmap.c~mm-oom-fix-concurrent-munlock-and-oom-reaper-unmap
+++ a/mm/mmap.c
@@ -3024,6 +3024,25 @@ void exit_mmap(struct mm_struct *mm)
/* mm's last user has gone, and its about to be pulled down */
mmu_notifier_release(mm);
+ if (unlikely(mm_is_oom_victim(mm))) {
+ /*
+ * Wait for oom_reap_task() to stop working on this mm. Because
+ * MMF_UNSTABLE is already set before calling down_read(),
+ * oom_reap_task() will not run on this mm after up_write().
+ * oom_reap_task() also depends on a stable VM_LOCKED flag to
+ * indicate it should not unmap during munlock_vma_pages_all().
+ *
+ * mm_is_oom_victim() cannot be set from under us because
+ * victim->mm is already set to NULL under task_lock before
+ * calling mmput() and victim->signal->oom_mm is set by the oom
+ * killer only if victim->mm is non-NULL while holding
+ * task_lock().
+ */
+ set_bit(MMF_UNSTABLE, &mm->flags);
+ down_write(&mm->mmap_sem);
+ up_write(&mm->mmap_sem);
+ }
+
if (mm->locked_vm) {
vma = mm->mmap;
while (vma) {
@@ -3045,26 +3064,9 @@ void exit_mmap(struct mm_struct *mm)
/* update_hiwater_rss(mm) here? but nobody should be looking */
/* Use -1 here to ensure all VMAs in the mm are unmapped */
unmap_vmas(&tlb, vma, 0, -1);
-
- if (unlikely(mm_is_oom_victim(mm))) {
- /*
- * Wait for oom_reap_task() to stop working on this
- * mm. Because MMF_OOM_SKIP is already set before
- * calling down_read(), oom_reap_task() will not run
- * on this "mm" post up_write().
- *
- * mm_is_oom_victim() cannot be set from under us
- * either because victim->mm is already set to NULL
- * under task_lock before calling mmput and oom_mm is
- * set not NULL by the OOM killer only if victim->mm
- * is found not NULL while holding the task_lock.
- */
- set_bit(MMF_OOM_SKIP, &mm->flags);
- down_write(&mm->mmap_sem);
- up_write(&mm->mmap_sem);
- }
free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
tlb_finish_mmu(&tlb, 0, -1);
+ set_bit(MMF_OOM_SKIP, &mm->flags);
/*
* Walk the list again, actually closing and freeing it,
diff -puN mm/oom_kill.c~mm-oom-fix-concurrent-munlock-and-oom-reaper-unmap mm/oom_kill.c
--- a/mm/oom_kill.c~mm-oom-fix-concurrent-munlock-and-oom-reaper-unmap
+++ a/mm/oom_kill.c
@@ -521,12 +521,17 @@ static bool __oom_reap_task_mm(struct ta
}
/*
- * MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't
- * work on the mm anymore. The check for MMF_OOM_SKIP must run
+ * Tell all users of get_user/copy_from_user etc... that the content
+ * is no longer stable. No barriers really needed because unmapping
+ * should imply barriers already and the reader would hit a page fault
+ * if it stumbled over reaped memory.
+ *
+ * MMF_UNSTABLE is also set by exit_mmap when the OOM reaper shouldn't
+ * work on the mm anymore. The check for MMF_OOM_UNSTABLE must run
* under mmap_sem for reading because it serializes against the
* down_write();up_write() cycle in exit_mmap().
*/
- if (test_bit(MMF_OOM_SKIP, &mm->flags)) {
+ if (test_and_set_bit(MMF_UNSTABLE, &mm->flags)) {
up_read(&mm->mmap_sem);
trace_skip_task_reaping(tsk->pid);
goto unlock_oom;
@@ -534,14 +539,6 @@ static bool __oom_reap_task_mm(struct ta
trace_start_task_reaping(tsk->pid);
- /*
- * Tell all users of get_user/copy_from_user etc... that the content
- * is no longer stable. No barriers really needed because unmapping
- * should imply barriers already and the reader would hit a page fault
- * if it stumbled over a reaped memory.
- */
- set_bit(MMF_UNSTABLE, &mm->flags);
-
for (vma = mm->mmap ; vma; vma = vma->vm_next) {
if (!can_madv_dontneed_vma(vma))
continue;
@@ -567,6 +564,7 @@ static bool __oom_reap_task_mm(struct ta
tlb_finish_mmu(&tlb, start, end);
}
}
+ set_bit(MMF_OOM_SKIP, &mm->flags);
pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
task_pid_nr(tsk), tsk->comm,
K(get_mm_counter(mm, MM_ANONPAGES)),
@@ -594,7 +592,6 @@ static void oom_reap_task(struct task_st
test_bit(MMF_OOM_SKIP, &mm->flags))
goto done;
-
pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
task_pid_nr(tsk), tsk->comm);
debug_show_all_locks();
@@ -603,10 +600,11 @@ done:
tsk->oom_reaper_list = NULL;
/*
- * Hide this mm from OOM killer because it has been either reaped or
- * somebody can't call up_write(mmap_sem).
+ * If the oom reaper could not get started on this mm and it has not yet
+ * reached exit_mmap(), set MMF_OOM_SKIP to disregard.
*/
- set_bit(MMF_OOM_SKIP, &mm->flags);
+ if (!test_bit(MMF_UNSTABLE, &mm->flags))
+ set_bit(MMF_OOM_SKIP, &mm->flags);
/* Drop a reference taken by wake_oom_reaper */
put_task_struct(tsk);
_
Patches currently in -mm which might be from rientjes(a)google.com are
On 5/1/2018 6:32 PM, gregkh(a)linuxfoundation.org wrote:
>
> This is a note to let you know that I've just added the patch titled
>
> net: qlge: Eliminate duplicate barriers on weakly-ordered archs
>
> to the 4.16-stable tree which can be found at:
> http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
>
> The filename of the patch is:
> net-qlge-eliminate-duplicate-barriers-on-weakly-ordered-archs.patch
> and it can be found in the queue-4.16 subdirectory.
>
> If you, or anyone else, feels it should not be added to the stable tree,
> please let <stable(a)vger.kernel.org> know about it.
>
I don't think these should be pulled at this moment. There is still more barrier
work to be done and the patch below goes in the wrong direction after discussion
with Linus.
I'm waiting for all architectures to reach to functional parity at the end of 4.17
to pick up from where I left.
There are still few architectures still violating writel() guarantee.
>
>>From foo@baz Tue May 1 14:59:17 PDT 2018
> From: Sinan Kaya <okaya(a)codeaurora.org>
> Date: Sun, 25 Mar 2018 10:39:19 -0400
> Subject: net: qlge: Eliminate duplicate barriers on weakly-ordered archs
>
> From: Sinan Kaya <okaya(a)codeaurora.org>
>
> [ Upstream commit e42d8cee343a545ac2d9557a3b28708bbca2bd31 ]
>
> Code includes wmb() followed by writel(). writel() already has a barrier on
> some architectures like arm64.
>
> This ends up CPU observing two barriers back to back before executing the
> register write.
>
> Create a new wrapper function with relaxed write operator. Use the new
> wrapper when a write is following a wmb().
>
> Signed-off-by: Sinan Kaya <okaya(a)codeaurora.org>
> Signed-off-by: David S. Miller <davem(a)davemloft.net>
> Signed-off-by: Sasha Levin <alexander.levin(a)microsoft.com>
> Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
> ---
> drivers/net/ethernet/qlogic/qlge/qlge.h | 16 ++++++++++++++++
> drivers/net/ethernet/qlogic/qlge/qlge_main.c | 3 ++-
> 2 files changed, 18 insertions(+), 1 deletion(-)
>
> --- a/drivers/net/ethernet/qlogic/qlge/qlge.h
> +++ b/drivers/net/ethernet/qlogic/qlge/qlge.h
> @@ -2185,6 +2185,22 @@ static inline void ql_write_db_reg(u32 v
> }
>
> /*
> + * Doorbell Registers:
> + * Doorbell registers are virtual registers in the PCI memory space.
> + * The space is allocated by the chip during PCI initialization. The
> + * device driver finds the doorbell address in BAR 3 in PCI config space.
> + * The registers are used to control outbound and inbound queues. For
> + * example, the producer index for an outbound queue. Each queue uses
> + * 1 4k chunk of memory. The lower half of the space is for outbound
> + * queues. The upper half is for inbound queues.
> + * Caller has to guarantee ordering.
> + */
> +static inline void ql_write_db_reg_relaxed(u32 val, void __iomem *addr)
> +{
> + writel_relaxed(val, addr);
> +}
> +
> +/*
> * Shadow Registers:
> * Outbound queues have a consumer index that is maintained by the chip.
> * Inbound queues have a producer index that is maintained by the chip.
> --- a/drivers/net/ethernet/qlogic/qlge/qlge_main.c
> +++ b/drivers/net/ethernet/qlogic/qlge/qlge_main.c
> @@ -2700,7 +2700,8 @@ static netdev_tx_t qlge_send(struct sk_b
> tx_ring->prod_idx = 0;
> wmb();
>
> - ql_write_db_reg(tx_ring->prod_idx, tx_ring->prod_idx_db_reg);
> + ql_write_db_reg_relaxed(tx_ring->prod_idx, tx_ring->prod_idx_db_reg);
> + mmiowb();
> netif_printk(qdev, tx_queued, KERN_DEBUG, qdev->ndev,
> "tx queued, slot %d, len %d\n",
> tx_ring->prod_idx, skb->len);
>
>
> Patches currently in stable-queue which might be from okaya(a)codeaurora.org are
>
> queue-4.16/net-qlge-eliminate-duplicate-barriers-on-weakly-ordered-archs.patch
>
--
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.