From: Wengang Wang <wen.gang.wang(a)oracle.com>
Subject: ocfs2: initialize ip_next_orphan
Though problem if found on a lower 4.1.12 kernel, I think upstream has
same issue.
In one node in the cluster, there is the following callback trace:
# cat /proc/21473/stack
[<ffffffffc09a2f06>] __ocfs2_cluster_lock.isra.36+0x336/0x9e0 [ocfs2]
[<ffffffffc09a4481>] ocfs2_inode_lock_full_nested+0x121/0x520 [ocfs2]
[<ffffffffc09b2ce2>] ocfs2_evict_inode+0x152/0x820 [ocfs2]
[<ffffffff8122b36e>] evict+0xae/0x1a0
[<ffffffff8122bd26>] iput+0x1c6/0x230
[<ffffffffc09b60ed>] ocfs2_orphan_filldir+0x5d/0x100 [ocfs2]
[<ffffffffc0992ae0>] ocfs2_dir_foreach_blk+0x490/0x4f0 [ocfs2]
[<ffffffffc099a1e9>] ocfs2_dir_foreach+0x29/0x30 [ocfs2]
[<ffffffffc09b7716>] ocfs2_recover_orphans+0x1b6/0x9a0 [ocfs2]
[<ffffffffc09b9b4e>] ocfs2_complete_recovery+0x1de/0x5c0 [ocfs2]
[<ffffffff810a1399>] process_one_work+0x169/0x4a0
[<ffffffff810a1bcb>] worker_thread+0x5b/0x560
[<ffffffff810a7a2b>] kthread+0xcb/0xf0
[<ffffffff816f5d21>] ret_from_fork+0x61/0x90
[<ffffffffffffffff>] 0xffffffffffffffff
The above stack is not reasonable, the final iput shouldn't happen in
ocfs2_orphan_filldir() function. Looking at the code,
2067 /* Skip inodes which are already added to recover list, since dio may
2068 * happen concurrently with unlink/rename */
2069 if (OCFS2_I(iter)->ip_next_orphan) {
2070 iput(iter);
2071 return 0;
2072 }
2073
The logic thinks the inode is already in recover list on seeing
ip_next_orphan is non-NULL, so it skip this inode after dropping a
reference which incremented in ocfs2_iget().
While, if the inode is already in recover list, it should have another
reference and the iput() at line 2070 should not be the final iput
(dropping the last reference). So I don't think the inode is really in
the recover list (no vmcore to confirm).
Note that ocfs2_queue_orphans(), though not shown up in the call back
trace, is holding cluster lock on the orphan directory when looking up for
unlinked inodes. The on disk inode eviction could involve a lot of IOs
which may need long time to finish. That means this node could hold the
cluster lock for very long time, that can lead to the lock requests (from
other nodes) to the orhpan directory hang for long time.
Looking at more on ip_next_orphan, I found it's not initialized when
allocating a new ocfs2_inode_info structure.
This causes te reflink operations from some nodes hang for very long
time waiting for the cluster lock on the orphan directory.
Fix: initialize ip_next_orphan as NULL.
Link: https://lkml.kernel.org/r/20201109171746.27884-1-wen.gang.wang@oracle.com
Signed-off-by: Wengang Wang <wen.gang.wang(a)oracle.com>
Reviewed-by: Joseph Qi <joseph.qi(a)linux.alibaba.com>
Cc: Mark Fasheh <mark(a)fasheh.com>
Cc: Joel Becker <jlbec(a)evilplan.org>
Cc: Junxiao Bi <junxiao.bi(a)oracle.com>
Cc: Changwei Ge <gechangwei(a)live.cn>
Cc: Gang He <ghe(a)suse.com>
Cc: Jun Piao <piaojun(a)huawei.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
fs/ocfs2/super.c | 1 +
1 file changed, 1 insertion(+)
--- a/fs/ocfs2/super.c~ocfs2-initialize-ip_next_orphan
+++ a/fs/ocfs2/super.c
@@ -1713,6 +1713,7 @@ static void ocfs2_inode_init_once(void *
oi->ip_blkno = 0ULL;
oi->ip_clusters = 0;
+ oi->ip_next_orphan = NULL;
ocfs2_resv_init_once(&oi->ip_la_data_resv);
_
From: Mike Kravetz <mike.kravetz(a)oracle.com>
Subject: hugetlbfs: fix anon huge page migration race
Qian Cai reported the following BUG in [1]
[ 6147.019063][T45242] LTP: starting move_pages12
[ 6147.475680][T64921] BUG: unable to handle page fault for address: ffffffffffffffe0
...
[ 6147.525866][T64921] RIP: 0010:anon_vma_interval_tree_iter_first+0xa2/0x170
avc_start_pgoff at mm/interval_tree.c:63
[ 6147.620914][T64921] Call Trace:
[ 6147.624078][T64921] rmap_walk_anon+0x141/0xa30
rmap_walk_anon at mm/rmap.c:1864
[ 6147.628639][T64921] try_to_unmap+0x209/0x2d0
try_to_unmap at mm/rmap.c:1763
[ 6147.633026][T64921] ? rmap_walk_locked+0x140/0x140
[ 6147.637936][T64921] ? page_remove_rmap+0x1190/0x1190
[ 6147.643020][T64921] ? page_not_mapped+0x10/0x10
[ 6147.647668][T64921] ? page_get_anon_vma+0x290/0x290
[ 6147.652664][T64921] ? page_mapcount_is_zero+0x10/0x10
[ 6147.657838][T64921] ? hugetlb_page_mapping_lock_write+0x97/0x180
[ 6147.663972][T64921] migrate_pages+0x1005/0x1fb0
[ 6147.668617][T64921] ? remove_migration_pte+0xac0/0xac0
[ 6147.673875][T64921] move_pages_and_store_status.isra.47+0xd7/0x1a0
[ 6147.680181][T64921] ? migrate_pages+0x1fb0/0x1fb0
[ 6147.685002][T64921] __x64_sys_move_pages+0xa5c/0x1100
[ 6147.690176][T64921] ? trace_hardirqs_on+0x20/0x1b5
[ 6147.695084][T64921] ? move_pages_and_store_status.isra.47+0x1a0/0x1a0
[ 6147.701653][T64921] ? rcu_read_lock_sched_held+0xaa/0xd0
[ 6147.707088][T64921] ? switch_fpu_return+0x196/0x400
[ 6147.712083][T64921] ? lockdep_hardirqs_on_prepare+0x38c/0x550
[ 6147.717954][T64921] ? do_syscall_64+0x24/0x310
[ 6147.722513][T64921] do_syscall_64+0x5f/0x310
[ 6147.726897][T64921] ? trace_hardirqs_off+0x12/0x1a0
[ 6147.731894][T64921] ? asm_exc_page_fault+0x8/0x30
[ 6147.736714][T64921] entry_SYSCALL_64_after_hwframe+0x44/0xa9
Hugh Dickens diagnosed this as a migration bug caused by code introduced
to use i_mmap_rwsem for pmd sharing synchronization. Specifically, the
routine unmap_and_move_huge_page() is always passing the TTU_RMAP_LOCKED
flag to try_to_unmap() while holding i_mmap_rwsem. This is wrong for
anon pages as the anon_vma_lock should be held in this case. Further
analysis suggested that i_mmap_rwsem was not required to he held at all
when calling try_to_unmap for anon pages as an anon page could never be
part of a shared pmd mapping.
Discussion also revealed that the hack in hugetlb_page_mapping_lock_write
to drop page lock and acquire i_mmap_rwsem is wrong. There is no way to
keep mapping valid while dropping page lock.
This patch does the following:
- Do not take i_mmap_rwsem and set TTU_RMAP_LOCKED for anon pages when
calling try_to_unmap.
- Remove the hacky code in hugetlb_page_mapping_lock_write. The routine
will now simply do a 'trylock' while still holding the page lock. If
the trylock fails, it will return NULL. This could impact the callers:
- migration calling code will receive -EAGAIN and retry up to the hard
coded limit (10).
- memory error code will treat the page as BUSY. This will force
killing (SIGKILL) instead of SIGBUS any mapping tasks.
Do note that this change in behavior only happens when there is a race.
None of the standard kernel testing suites actually hit this race, but
it is possible.
[1] https://lore.kernel.org/lkml/20200708012044.GC992@lca.pw/
[2] https://lore.kernel.org/linux-mm/alpine.LSU.2.11.2010071833100.2214@eggly.a…
Link: https://lkml.kernel.org/r/20201105195058.78401-1-mike.kravetz@oracle.com
Fixes: c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization")
Signed-off-by: Mike Kravetz <mike.kravetz(a)oracle.com>
Reported-by: Qian Cai <cai(a)lca.pw>
Suggested-by: Hugh Dickins <hughd(a)google.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi(a)nec.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/hugetlb.c | 90 ++----------------------------------------
mm/memory-failure.c | 36 +++++++---------
mm/migrate.c | 46 +++++++++++----------
mm/rmap.c | 5 --
4 files changed, 48 insertions(+), 129 deletions(-)
--- a/mm/hugetlb.c~hugetlbfs-fix-anon-huge-page-migration-race
+++ a/mm/hugetlb.c
@@ -1568,103 +1568,23 @@ int PageHeadHuge(struct page *page_head)
}
/*
- * Find address_space associated with hugetlbfs page.
- * Upon entry page is locked and page 'was' mapped although mapped state
- * could change. If necessary, use anon_vma to find vma and associated
- * address space. The returned mapping may be stale, but it can not be
- * invalid as page lock (which is held) is required to destroy mapping.
- */
-static struct address_space *_get_hugetlb_page_mapping(struct page *hpage)
-{
- struct anon_vma *anon_vma;
- pgoff_t pgoff_start, pgoff_end;
- struct anon_vma_chain *avc;
- struct address_space *mapping = page_mapping(hpage);
-
- /* Simple file based mapping */
- if (mapping)
- return mapping;
-
- /*
- * Even anonymous hugetlbfs mappings are associated with an
- * underlying hugetlbfs file (see hugetlb_file_setup in mmap
- * code). Find a vma associated with the anonymous vma, and
- * use the file pointer to get address_space.
- */
- anon_vma = page_lock_anon_vma_read(hpage);
- if (!anon_vma)
- return mapping; /* NULL */
-
- /* Use first found vma */
- pgoff_start = page_to_pgoff(hpage);
- pgoff_end = pgoff_start + pages_per_huge_page(page_hstate(hpage)) - 1;
- anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root,
- pgoff_start, pgoff_end) {
- struct vm_area_struct *vma = avc->vma;
-
- mapping = vma->vm_file->f_mapping;
- break;
- }
-
- anon_vma_unlock_read(anon_vma);
- return mapping;
-}
-
-/*
* Find and lock address space (mapping) in write mode.
*
- * Upon entry, the page is locked which allows us to find the mapping
- * even in the case of an anon page. However, locking order dictates
- * the i_mmap_rwsem be acquired BEFORE the page lock. This is hugetlbfs
- * specific. So, we first try to lock the sema while still holding the
- * page lock. If this works, great! If not, then we need to drop the
- * page lock and then acquire i_mmap_rwsem and reacquire page lock. Of
- * course, need to revalidate state along the way.
+ * Upon entry, the page is locked which means that page_mapping() is
+ * stable. Due to locking order, we can only trylock_write. If we can
+ * not get the lock, simply return NULL to caller.
*/
struct address_space *hugetlb_page_mapping_lock_write(struct page *hpage)
{
- struct address_space *mapping, *mapping2;
+ struct address_space *mapping = page_mapping(hpage);
- mapping = _get_hugetlb_page_mapping(hpage);
-retry:
if (!mapping)
return mapping;
- /*
- * If no contention, take lock and return
- */
if (i_mmap_trylock_write(mapping))
return mapping;
- /*
- * Must drop page lock and wait on mapping sema.
- * Note: Once page lock is dropped, mapping could become invalid.
- * As a hack, increase map count until we lock page again.
- */
- atomic_inc(&hpage->_mapcount);
- unlock_page(hpage);
- i_mmap_lock_write(mapping);
- lock_page(hpage);
- atomic_add_negative(-1, &hpage->_mapcount);
-
- /* verify page is still mapped */
- if (!page_mapped(hpage)) {
- i_mmap_unlock_write(mapping);
- return NULL;
- }
-
- /*
- * Get address space again and verify it is the same one
- * we locked. If not, drop lock and retry.
- */
- mapping2 = _get_hugetlb_page_mapping(hpage);
- if (mapping2 != mapping) {
- i_mmap_unlock_write(mapping);
- mapping = mapping2;
- goto retry;
- }
-
- return mapping;
+ return NULL;
}
pgoff_t __basepage_index(struct page *page)
--- a/mm/memory-failure.c~hugetlbfs-fix-anon-huge-page-migration-race
+++ a/mm/memory-failure.c
@@ -1057,27 +1057,25 @@ static bool hwpoison_user_mappings(struc
if (!PageHuge(hpage)) {
unmap_success = try_to_unmap(hpage, ttu);
} else {
- /*
- * For hugetlb pages, try_to_unmap could potentially call
- * huge_pmd_unshare. Because of this, take semaphore in
- * write mode here and set TTU_RMAP_LOCKED to indicate we
- * have taken the lock at this higer level.
- *
- * Note that the call to hugetlb_page_mapping_lock_write
- * is necessary even if mapping is already set. It handles
- * ugliness of potentially having to drop page lock to obtain
- * i_mmap_rwsem.
- */
- mapping = hugetlb_page_mapping_lock_write(hpage);
-
- if (mapping) {
- unmap_success = try_to_unmap(hpage,
+ if (!PageAnon(hpage)) {
+ /*
+ * For hugetlb pages in shared mappings, try_to_unmap
+ * could potentially call huge_pmd_unshare. Because of
+ * this, take semaphore in write mode here and set
+ * TTU_RMAP_LOCKED to indicate we have taken the lock
+ * at this higer level.
+ */
+ mapping = hugetlb_page_mapping_lock_write(hpage);
+ if (mapping) {
+ unmap_success = try_to_unmap(hpage,
ttu|TTU_RMAP_LOCKED);
- i_mmap_unlock_write(mapping);
+ i_mmap_unlock_write(mapping);
+ } else {
+ pr_info("Memory failure: %#lx: could not lock mapping for mapped huge page\n", pfn);
+ unmap_success = false;
+ }
} else {
- pr_info("Memory failure: %#lx: could not find mapping for mapped huge page\n",
- pfn);
- unmap_success = false;
+ unmap_success = try_to_unmap(hpage, ttu);
}
}
if (!unmap_success)
--- a/mm/migrate.c~hugetlbfs-fix-anon-huge-page-migration-race
+++ a/mm/migrate.c
@@ -1328,34 +1328,38 @@ static int unmap_and_move_huge_page(new_
goto put_anon;
if (page_mapped(hpage)) {
- /*
- * try_to_unmap could potentially call huge_pmd_unshare.
- * Because of this, take semaphore in write mode here and
- * set TTU_RMAP_LOCKED to let lower levels know we have
- * taken the lock.
- */
- mapping = hugetlb_page_mapping_lock_write(hpage);
- if (unlikely(!mapping))
- goto unlock_put_anon;
-
- try_to_unmap(hpage,
- TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS|
- TTU_RMAP_LOCKED);
+ bool mapping_locked = false;
+ enum ttu_flags ttu = TTU_MIGRATION|TTU_IGNORE_MLOCK|
+ TTU_IGNORE_ACCESS;
+
+ if (!PageAnon(hpage)) {
+ /*
+ * In shared mappings, try_to_unmap could potentially
+ * call huge_pmd_unshare. Because of this, take
+ * semaphore in write mode here and set TTU_RMAP_LOCKED
+ * to let lower levels know we have taken the lock.
+ */
+ mapping = hugetlb_page_mapping_lock_write(hpage);
+ if (unlikely(!mapping))
+ goto unlock_put_anon;
+
+ mapping_locked = true;
+ ttu |= TTU_RMAP_LOCKED;
+ }
+
+ try_to_unmap(hpage, ttu);
page_was_mapped = 1;
- /*
- * Leave mapping locked until after subsequent call to
- * remove_migration_ptes()
- */
+
+ if (mapping_locked)
+ i_mmap_unlock_write(mapping);
}
if (!page_mapped(hpage))
rc = move_to_new_page(new_hpage, hpage, mode);
- if (page_was_mapped) {
+ if (page_was_mapped)
remove_migration_ptes(hpage,
- rc == MIGRATEPAGE_SUCCESS ? new_hpage : hpage, true);
- i_mmap_unlock_write(mapping);
- }
+ rc == MIGRATEPAGE_SUCCESS ? new_hpage : hpage, false);
unlock_put_anon:
unlock_page(new_hpage);
--- a/mm/rmap.c~hugetlbfs-fix-anon-huge-page-migration-race
+++ a/mm/rmap.c
@@ -1413,9 +1413,6 @@ static bool try_to_unmap_one(struct page
/*
* If sharing is possible, start and end will be adjusted
* accordingly.
- *
- * If called for a huge page, caller must hold i_mmap_rwsem
- * in write mode as it is possible to call huge_pmd_unshare.
*/
adjust_range_if_pmd_sharing_possible(vma, &range.start,
&range.end);
@@ -1462,7 +1459,7 @@ static bool try_to_unmap_one(struct page
subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
address = pvmw.address;
- if (PageHuge(page)) {
+ if (PageHuge(page) && !PageAnon(page)) {
/*
* To call huge_pmd_unshare, i_mmap_rwsem must be
* held in write mode. Caller needs to explicitly
_
From: Matteo Croce <mcroce(a)microsoft.com>
Subject: Revert "kernel/reboot.c: convert simple_strtoul to kstrtoint"
Patch series "fix parsing of reboot= cmdline", v3.
The parsing of the reboot= cmdline has two major errors:
- a missing bound check can crash the system on reboot
- parsing of the cpu number only works if specified last
Fix both.
This patch (of 2):
This reverts commit 616feab753972b97.
kstrtoint() and simple_strtoul() have a subtle difference which makes them
non interchangeable: if a non digit character is found amid the parsing,
the former will return an error, while the latter will just stop parsing,
e.g. simple_strtoul("123xyx") = 123.
The kernel cmdline reboot= argument allows to specify the CPU used for
rebooting, with the syntax `s####` among the other flags, e.g.
"reboot=warm,s31,force", so if this flag is not the last given, it's
silently ignored as well as the subsequent ones.
Link: https://lkml.kernel.org/r/20201103214025.116799-2-mcroce@linux.microsoft.com
Fixes: 616feab75397 ("kernel/reboot.c: convert simple_strtoul to kstrtoint")
Signed-off-by: Matteo Croce <mcroce(a)microsoft.com>
Cc: Guenter Roeck <linux(a)roeck-us.net>
Cc: Petr Mladek <pmladek(a)suse.com>
Cc: Arnd Bergmann <arnd(a)arndb.de>
Cc: Mike Rapoport <rppt(a)kernel.org>
Cc: Kees Cook <keescook(a)chromium.org>
Cc: Pavel Tatashin <pasha.tatashin(a)soleen.com>
Cc: Robin Holt <robinmholt(a)gmail.com>
Cc: Fabian Frederick <fabf(a)skynet.be>
Cc: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
kernel/reboot.c | 21 +++++++--------------
1 file changed, 7 insertions(+), 14 deletions(-)
--- a/kernel/reboot.c~revert-kernel-rebootc-convert-simple_strtoul-to-kstrtoint
+++ a/kernel/reboot.c
@@ -551,22 +551,15 @@ static int __init reboot_setup(char *str
break;
case 's':
- {
- int rc;
-
- if (isdigit(*(str+1))) {
- rc = kstrtoint(str+1, 0, &reboot_cpu);
- if (rc)
- return rc;
- } else if (str[1] == 'm' && str[2] == 'p' &&
- isdigit(*(str+3))) {
- rc = kstrtoint(str+3, 0, &reboot_cpu);
- if (rc)
- return rc;
- } else
+ if (isdigit(*(str+1)))
+ reboot_cpu = simple_strtoul(str+1, NULL, 0);
+ else if (str[1] == 'm' && str[2] == 'p' &&
+ isdigit(*(str+3)))
+ reboot_cpu = simple_strtoul(str+3, NULL, 0);
+ else
*mode = REBOOT_SOFT;
break;
- }
+
case 'g':
*mode = REBOOT_GPIO;
break;
_
From: Arvind Sankar <nivedita(a)alum.mit.edu>
Subject: compiler.h: fix barrier_data() on clang
Commit 815f0ddb346c ("include/linux/compiler*.h: make compiler-*.h
mutually exclusive") neglected to copy barrier_data() from compiler-gcc.h
into compiler-clang.h. The definition in compiler-gcc.h was really to
work around clang's more aggressive optimization, so this broke
barrier_data() on clang, and consequently memzero_explicit() as well.
For example, this results in at least the memzero_explicit() call in
lib/crypto/sha256.c:sha256_transform() being optimized away by clang.
Fix this by moving the definition of barrier_data() into compiler.h.
Also move the gcc/clang definition of barrier() into compiler.h,
__memory_barrier() is icc-specific (and barrier() is already defined using
it in compiler-intel.h) and doesn't belong in compiler.h.
[rdunlap(a)infradead.org: fix ALPHA builds when SMP is not enabled]
Link: https://lkml.kernel.org/r/20201101231835.4589-1-rdunlap@infradead.org
Link: https://lkml.kernel.org/r/20201014212631.207844-1-nivedita@alum.mit.edu
Fixes: 815f0ddb346c ("include/linux/compiler*.h: make compiler-*.h mutually exclusive")
Signed-off-by: Arvind Sankar <nivedita(a)alum.mit.edu>
Signed-off-by: Randy Dunlap <rdunlap(a)infradead.org>
Reviewed-by: Nick Desaulniers <ndesaulniers(a)google.com>
Tested-by: Nick Desaulniers <ndesaulniers(a)google.com>
Reviewed-by: Kees Cook <keescook(a)chromium.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/asm-generic/barrier.h | 1 +
include/linux/compiler-clang.h | 6 ------
include/linux/compiler-gcc.h | 19 -------------------
include/linux/compiler.h | 18 ++++++++++++++++--
4 files changed, 17 insertions(+), 27 deletions(-)
--- a/include/asm-generic/barrier.h~compilerh-fix-barrier_data-on-clang
+++ a/include/asm-generic/barrier.h
@@ -13,6 +13,7 @@
#ifndef __ASSEMBLY__
+#include <linux/compiler.h>
#include <asm/rwonce.h>
#ifndef nop
--- a/include/linux/compiler-clang.h~compilerh-fix-barrier_data-on-clang
+++ a/include/linux/compiler-clang.h
@@ -60,12 +60,6 @@
#define COMPILER_HAS_GENERIC_BUILTIN_OVERFLOW 1
#endif
-/* The following are for compatibility with GCC, from compiler-gcc.h,
- * and may be redefined here because they should not be shared with other
- * compilers, like ICC.
- */
-#define barrier() __asm__ __volatile__("" : : : "memory")
-
#if __has_feature(shadow_call_stack)
# define __noscs __attribute__((__no_sanitize__("shadow-call-stack")))
#endif
--- a/include/linux/compiler-gcc.h~compilerh-fix-barrier_data-on-clang
+++ a/include/linux/compiler-gcc.h
@@ -15,25 +15,6 @@
# error Sorry, your version of GCC is too old - please use 4.9 or newer.
#endif
-/* Optimization barrier */
-
-/* The "volatile" is due to gcc bugs */
-#define barrier() __asm__ __volatile__("": : :"memory")
-/*
- * This version is i.e. to prevent dead stores elimination on @ptr
- * where gcc and llvm may behave differently when otherwise using
- * normal barrier(): while gcc behavior gets along with a normal
- * barrier(), llvm needs an explicit input variable to be assumed
- * clobbered. The issue is as follows: while the inline asm might
- * access any memory it wants, the compiler could have fit all of
- * @ptr into memory registers instead, and since @ptr never escaped
- * from that, it proved that the inline asm wasn't touching any of
- * it. This version works well with both compilers, i.e. we're telling
- * the compiler that the inline asm absolutely may see the contents
- * of @ptr. See also: https://llvm.org/bugs/show_bug.cgi?id=15495
- */
-#define barrier_data(ptr) __asm__ __volatile__("": :"r"(ptr) :"memory")
-
/*
* This macro obfuscates arithmetic on a variable address so that gcc
* shouldn't recognize the original var, and make assumptions about it.
--- a/include/linux/compiler.h~compilerh-fix-barrier_data-on-clang
+++ a/include/linux/compiler.h
@@ -80,11 +80,25 @@ void ftrace_likely_update(struct ftrace_
/* Optimization barrier */
#ifndef barrier
-# define barrier() __memory_barrier()
+/* The "volatile" is due to gcc bugs */
+# define barrier() __asm__ __volatile__("": : :"memory")
#endif
#ifndef barrier_data
-# define barrier_data(ptr) barrier()
+/*
+ * This version is i.e. to prevent dead stores elimination on @ptr
+ * where gcc and llvm may behave differently when otherwise using
+ * normal barrier(): while gcc behavior gets along with a normal
+ * barrier(), llvm needs an explicit input variable to be assumed
+ * clobbered. The issue is as follows: while the inline asm might
+ * access any memory it wants, the compiler could have fit all of
+ * @ptr into memory registers instead, and since @ptr never escaped
+ * from that, it proved that the inline asm wasn't touching any of
+ * it. This version works well with both compilers, i.e. we're telling
+ * the compiler that the inline asm absolutely may see the contents
+ * of @ptr. See also: https://llvm.org/bugs/show_bug.cgi?id=15495
+ */
+# define barrier_data(ptr) __asm__ __volatile__("": :"r"(ptr) :"memory")
#endif
/* workaround for GCC PR82365 if needed */
_
From: Jason Gunthorpe <jgg(a)nvidia.com>
Subject: mm/gup: use unpin_user_pages() in __gup_longterm_locked()
When FOLL_PIN is passed to __get_user_pages() the page list must be put
back using unpin_user_pages() otherwise the page pin reference persists in
a corrupted state.
There are two places in the unwind of __gup_longterm_locked() that put the
pages back without checking. Normally on error this function would return
the partial page list making this the caller's responsibility, but in
these two cases the caller is not allowed to see these pages at all.
Link: https://lkml.kernel.org/r/0-v2-3ae7d9d162e2+2a7-gup_cma_fix_jgg@nvidia.com
Fixes: 3faa52c03f44 ("mm/gup: track FOLL_PIN pages")
Signed-off-by: Jason Gunthorpe <jgg(a)nvidia.com>
Reported-by: Ira Weiny <ira.weiny(a)intel.com>
Reviewed-by: Ira Weiny <ira.weiny(a)intel.com>
Reviewed-by: John Hubbard <jhubbard(a)nvidia.com>
Cc: Aneesh Kumar K.V <aneesh.kumar(a)linux.ibm.com>
Cc: Dan Williams <dan.j.williams(a)intel.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/gup.c | 14 ++++++++++----
1 file changed, 10 insertions(+), 4 deletions(-)
--- a/mm/gup.c~mm-gup-use-unpin_user_pages-in-__gup_longterm_locked
+++ a/mm/gup.c
@@ -1647,8 +1647,11 @@ check_again:
/*
* drop the above get_user_pages reference.
*/
- for (i = 0; i < nr_pages; i++)
- put_page(pages[i]);
+ if (gup_flags & FOLL_PIN)
+ unpin_user_pages(pages, nr_pages);
+ else
+ for (i = 0; i < nr_pages; i++)
+ put_page(pages[i]);
if (migrate_pages(&cma_page_list, alloc_migration_target, NULL,
(unsigned long)&mtc, MIGRATE_SYNC, MR_CONTIG_RANGE)) {
@@ -1728,8 +1731,11 @@ static long __gup_longterm_locked(struct
goto out;
if (check_dax_vmas(vmas_tmp, rc)) {
- for (i = 0; i < rc; i++)
- put_page(pages[i]);
+ if (gup_flags & FOLL_PIN)
+ unpin_user_pages(pages, rc);
+ else
+ for (i = 0; i < rc; i++)
+ put_page(pages[i]);
rc = -EOPNOTSUPP;
goto out;
}
_
From: Laurent Dufour <ldufour(a)linux.ibm.com>
Subject: mm/slub: fix panic in slab_alloc_node()
While doing memory hot-unplug operation on a PowerPC VM running 1024 CPUs
with 11TB of ram, I hit the following panic:
BUG: Kernel NULL pointer dereference on read at 0x00000007
Faulting instruction address: 0xc000000000456048
Oops: Kernel access of bad area, sig: 11 [#2]
LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS= 2048 NUMA pSeries
Modules linked in: rpadlpar_io rpaphp
CPU: 160 PID: 1 Comm: systemd Tainted: G D 5.9.0 #1
NIP: c000000000456048 LR: c000000000455fd4 CTR: c00000000047b350
REGS: c00006028d1b77a0 TRAP: 0300 Tainted: G D (5.9.0)
MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 24004228 XER: 00000000
CFAR: c00000000000f1b0 DAR: 0000000000000007 DSISR: 40000000 IRQMASK: 0
GPR00: c000000000455fd4 c00006028d1b7a30 c000000001bec800 0000000000000000
GPR04: 0000000000000dc0 0000000000000000 00000000000374ef c00007c53df99320
GPR08: 000007c53c980000 0000000000000000 000007c53c980000 0000000000000000
GPR12: 0000000000004400 c00000001e8e4400 0000000000000000 0000000000000f6a
GPR16: 0000000000000000 c000000001c25930 c000000001d62528 00000000000000c1
GPR20: c000000001d62538 c00006be469e9000 0000000fffffffe0 c0000000003c0ff8
GPR24: 0000000000000018 0000000000000000 0000000000000dc0 0000000000000000
GPR28: c00007c513755700 c000000001c236a4 c00007bc4001f800 0000000000000001
NIP [c000000000456048] __kmalloc_node+0x108/0x790
LR [c000000000455fd4] __kmalloc_node+0x94/0x790
Call Trace:
[c00006028d1b7a30] [c00007c51af92000] 0xc00007c51af92000 (unreliable)
[c00006028d1b7aa0] [c0000000003c0ff8] kvmalloc_node+0x58/0x110
[c00006028d1b7ae0] [c00000000047b45c] mem_cgroup_css_online+0x10c/0x270
[c00006028d1b7b30] [c000000000241fd8] online_css+0x48/0xd0
[c00006028d1b7b60] [c00000000024af14] cgroup_apply_control_enable+0x2c4/0x470
[c00006028d1b7c40] [c00000000024e838] cgroup_mkdir+0x408/0x5f0
[c00006028d1b7cb0] [c0000000005a4ef0] kernfs_iop_mkdir+0x90/0x100
[c00006028d1b7cf0] [c0000000004b8168] vfs_mkdir+0x138/0x250
[c00006028d1b7d40] [c0000000004baf04] do_mkdirat+0x154/0x1c0
[c00006028d1b7dc0] [c000000000032b38] system_call_exception+0xf8/0x200
[c00006028d1b7e20] [c00000000000c740] system_call_common+0xf0/0x27c
Instruction dump:
e93e0000 e90d0030 39290008 7cc9402a e94d0030 e93e0000 7ce95214 7f89502a
2fbc0000 419e0018 41920230 e9270010 <89290007> 7f994800 419e0220 7ee6bb78
This pointing to the following code:
mm/slub.c:2851
if (unlikely(!object || !node_match(page, node))) {
c000000000456038: 00 00 bc 2f cmpdi cr7,r28,0
c00000000045603c: 18 00 9e 41 beq cr7,c000000000456054 <__kmalloc_node+0x114>
node_match():
mm/slub.c:2491
if (node != NUMA_NO_NODE && page_to_nid(page) != node)
c000000000456040: 30 02 92 41 beq cr4,c000000000456270 <__kmalloc_node+0x330>
page_to_nid():
include/linux/mm.h:1294
c000000000456044: 10 00 27 e9 ld r9,16(r7)
c000000000456048: 07 00 29 89 lbz r9,7(r9) <<<< r9 = NULL
node_match():
mm/slub.c:2491
c00000000045604c: 00 48 99 7f cmpw cr7,r25,r9
c000000000456050: 20 02 9e 41 beq cr7,c000000000456270 <__kmalloc_node+0x330>
The panic occurred in slab_alloc_node() when checking for the page's node:
object = c->freelist;
page = c->page;
if (unlikely(!object || !node_match(page, node))) {
object = __slab_alloc(s, gfpflags, node, addr, c);
stat(s, ALLOC_SLOWPATH);
The issue is that object is not NULL while page is NULL which is odd but
may happen if the cache flush happened after loading object but before
loading page. Thus checking for the page pointer is required too.
The cache flush is done through an inter processor interrupt when a piece
of memory is off-lined. That interrupt is triggered when a memory
hot-unplug operation is initiated and offline_pages() is calling the
slub's MEM_GOING_OFFLINE callback slab_mem_going_offline_callback() which
is calling flush_cpu_slab(). If that interrupt is caught between the
reading of c->freelist and the reading of c->page, this could lead to such
a situation. That situation is expected and the later call to
this_cpu_cmpxchg_double() will detect the change to c->freelist and redo
the whole operation.
In commit 6159d0f5c03e ("mm/slub.c: page is always non-NULL in
node_match()") check on the page pointer has been removed assuming that
page is always valid when it is called. It happens that this is not true
in that particular case, so check for page before calling node_match()
here.
Link: https://lkml.kernel.org/r/20201027190406.33283-1-ldufour@linux.ibm.com
Fixes: 6159d0f5c03e ("mm/slub.c: page is always non-NULL in node_match()")
Signed-off-by: Laurent Dufour <ldufour(a)linux.ibm.com>
Acked-by: Vlastimil Babka <vbabka(a)suse.cz>
Acked-by: Christoph Lameter <cl(a)linux.com>
Cc: Wei Yang <richard.weiyang(a)gmail.com>
Cc: Pekka Enberg <penberg(a)kernel.org>
Cc: David Rientjes <rientjes(a)google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim(a)lge.com>
Cc: Nathan Lynch <nathanl(a)linux.ibm.com>
Cc: Scott Cheloha <cheloha(a)linux.ibm.com>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/slub.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/mm/slub.c~mm-slub-fix-panic-in-slab_alloc_node
+++ a/mm/slub.c
@@ -2852,7 +2852,7 @@ redo:
object = c->freelist;
page = c->page;
- if (unlikely(!object || !node_match(page, node))) {
+ if (unlikely(!object || !page || !node_match(page, node))) {
object = __slab_alloc(s, gfpflags, node, addr, c);
} else {
void *next_object = get_freepointer_safe(s, object);
_
From: Zi Yan <ziy(a)nvidia.com>
Subject: mm/compaction: stop isolation if too many pages are isolated and we have pages to migrate
In isolate_migratepages_block, if we have too many isolated pages and
nr_migratepages is not zero, we should try to migrate what we have without
wasting time on isolating.
In theory it's possible that multiple parallel compactions will cause
too_many_isolated() to become true even if each has isolated less than
COMPACT_CLUSTER_MAX, and loop forever in the while loop. Bailing
immediately prevents that.
[vbabka(a)suse.cz: changelog addition]
Link: https://lkml.kernel.org/r/20201030183809.3616803-2-zi.yan@sent.com
Fixes: 1da2f328fa64 (“mm,thp,compaction,cma: allow THP migration for CMA allocations”)
Signed-off-by: Zi Yan <ziy(a)nvidia.com>
Suggested-by: Vlastimil Babka <vbabka(a)suse.cz>
Cc: <stable(a)vger.kernel.org>
Cc: Mel Gorman <mgorman(a)techsingularity.net>
Cc: Michal Hocko <mhocko(a)kernel.org>
Cc: Rik van Riel <riel(a)surriel.com>
Cc: Yang Shi <shy828301(a)gmail.com>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/compaction.c | 4 ++++
1 file changed, 4 insertions(+)
--- a/mm/compaction.c~mm-compaction-stop-isolation-if-too-many-pages-are-isolated-and-we-have-pages-to-migrate
+++ a/mm/compaction.c
@@ -817,6 +817,10 @@ isolate_migratepages_block(struct compac
* delay for some time until fewer pages are isolated
*/
while (unlikely(too_many_isolated(pgdat))) {
+ /* stop isolation if there are still pages not migrated */
+ if (cc->nr_migratepages)
+ return 0;
+
/* async migration should just abort */
if (cc->mode == MIGRATE_ASYNC)
return 0;
_
From: Zi Yan <ziy(a)nvidia.com>
Subject: mm/compaction: count pages and stop correctly during page isolation
In isolate_migratepages_block, when cc->alloc_contig is true, we are able
to isolate compound pages, nr_migratepages and nr_isolated did not count
compound pages correctly, causing us to isolate more pages than we
thought. Count compound pages as the number of base pages they contain.
Otherwise, we might be trapped in too_many_isolated while loop, since the
actual isolated pages can go up to COMPACT_CLUSTER_MAX*512=16384, where
COMPACT_CLUSTER_MAX is 32, since we stop isolation after
cc->nr_migratepages reaches to COMPACT_CLUSTER_MAX.
In addition, after we fix the issue above, cc->nr_migratepages could never
be equal to COMPACT_CLUSTER_MAX if compound pages are isolated, thus page
isolation could not stop as we intended. Change the isolation stop
condition to >=.
The issue can be triggered as follows: In a system with 16GB memory and an
8GB CMA region reserved by hugetlb_cma, if we first allocate 10GB THPs and
mlock them (so some THPs are allocated in the CMA region and mlocked),
reserving 6 1GB hugetlb pages via
/sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages will get stuck
(looping in too_many_isolated function) until we kill either task. With
the patch applied, oom will kill the application with 10GB THPs and let
hugetlb page reservation finish.
[ziy(a)nvidia.com: v3]
Link: https://lkml.kernel.org/r/20201030183809.3616803-1-zi.yan@sent.com
Link: https://lkml.kernel.org/r/20201029200435.3386066-1-zi.yan@sent.com
Fixes: 1da2f328fa64 ("cmm,thp,compaction,cma: allow THP migration for CMA allocations")
Signed-off-by: Zi Yan <ziy(a)nvidia.com>
Reviewed-by: Yang Shi <shy828301(a)gmail.com>
Acked-by: Vlastimil Babka <vbabka(a)suse.cz>
Cc: Rik van Riel <riel(a)surriel.com>
Cc: Michal Hocko <mhocko(a)kernel.org>
Cc: Mel Gorman <mgorman(a)techsingularity.net>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/compaction.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
--- a/mm/compaction.c~mm-compaction-count-pages-and-stop-correctly-during-page-isolation
+++ a/mm/compaction.c
@@ -1012,8 +1012,8 @@ isolate_migratepages_block(struct compac
isolate_success:
list_add(&page->lru, &cc->migratepages);
- cc->nr_migratepages++;
- nr_isolated++;
+ cc->nr_migratepages += compound_nr(page);
+ nr_isolated += compound_nr(page);
/*
* Avoid isolating too much unless this block is being
@@ -1021,7 +1021,7 @@ isolate_success:
* or a lock is contended. For contention, isolate quickly to
* potentially remove one source of contention.
*/
- if (cc->nr_migratepages == COMPACT_CLUSTER_MAX &&
+ if (cc->nr_migratepages >= COMPACT_CLUSTER_MAX &&
!cc->rescan && !cc->contended) {
++low_pfn;
break;
@@ -1132,7 +1132,7 @@ isolate_migratepages_range(struct compac
if (!pfn)
break;
- if (cc->nr_migratepages == COMPACT_CLUSTER_MAX)
+ if (cc->nr_migratepages >= COMPACT_CLUSTER_MAX)
break;
}
_