From: Lars Persson <lars.persson(a)axis.com>
Subject: mm/migrate.c: add missing flush_dcache_page for non-mapped page migrate
Our MIPS 1004Kc SoCs were seeing random userspace crashes with SIGILL and
SIGSEGV that could not be traced back to a userspace code bug. They had
all the magic signs of an I/D cache coherency issue.
Now recently we noticed that the /proc/sys/vm/compact_memory interface was
quite efficient at provoking this class of userspace crashes.
Studying the code in mm/migrate.c there is a distinction made between
migrating a page that is mapped at the instant of migration and one that
is not mapped. Our problem turned out to be the non-mapped pages.
For the non-mapped page the code performs a copy of the page content and
all relevant meta-data of the page without doing the required D-cache
maintenance. This leaves dirty data in the D-cache of the CPU and on the
1004K cores this data is not visible to the I-cache. A subsequent
page-fault that triggers a mapping of the page will happily serve the
process with potentially stale code.
What about ARM then, this bug should have seen greater exposure? Well ARM
became immune to this flaw back in 2010, see commit c01778001a4f ("ARM:
6379/1: Assume new page cache pages have dirty D-cache").
My proposed fix moves the D-cache maintenance inside move_to_new_page to
make it common for both cases.
Link: http://lkml.kernel.org/r/20190315083502.11849-1-larper@axis.com
Fixes: 97ee0524614 ("flush cache before installing new page at migraton")
Signed-off-by: Lars Persson <larper(a)axis.com>
Reviewed-by: Paul Burton <paul.burton(a)mips.com>
Acked-by: Mel Gorman <mgorman(a)techsingularity.net>
Cc: Ralf Baechle <ralf(a)linux-mips.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/migrate.c | 11 ++++++++---
1 file changed, 8 insertions(+), 3 deletions(-)
--- a/mm/migrate.c~mm-migrate-add-missing-flush_dcache_page-for-non-mapped-page-migrate
+++ a/mm/migrate.c
@@ -248,10 +248,8 @@ static bool remove_migration_pte(struct
pte = swp_entry_to_pte(entry);
} else if (is_device_public_page(new)) {
pte = pte_mkdevmap(pte);
- flush_dcache_page(new);
}
- } else
- flush_dcache_page(new);
+ }
#ifdef CONFIG_HUGETLB_PAGE
if (PageHuge(new)) {
@@ -995,6 +993,13 @@ static int move_to_new_page(struct page
*/
if (!PageMappingFlags(page))
page->mapping = NULL;
+
+ if (unlikely(is_zone_device_page(newpage))) {
+ if (is_device_public_page(newpage))
+ flush_dcache_page(newpage);
+ } else
+ flush_dcache_page(newpage);
+
}
out:
return rc;
_
From: Qian Cai <cai(a)lca.pw>
Subject: mm/page_isolation.c: fix a wrong flag in set_migratetype_isolate()
Due to has_unmovable_pages() taking an incorrect irqsave flag instead of
the isolation flag in set_migratetype_isolate(), there are issues with
HWPOSION and error reporting where dump_page() is not called when there is
an unmovable page.
Link: http://lkml.kernel.org/r/20190320204941.53731-1-cai@lca.pw
Fixes: d381c54760dc ("mm: only report isolation failures when offlining memory")
Acked-by: Michal Hocko <mhocko(a)suse.com>
Reviewed-by: Oscar Salvador <osalvador(a)suse.de>
Signed-off-by: Qian Cai <cai(a)lca.pw>
Cc: <stable(a)vger.kernel.org> [5.0.x]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/page_isolation.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
--- a/mm/page_isolation.c~mm-fix-a-wrong-flag-in-set_migratetype_isolate
+++ a/mm/page_isolation.c
@@ -59,7 +59,8 @@ static int set_migratetype_isolate(struc
* FIXME: Now, memory hotplug doesn't call shrink_slab() by itself.
* We just check MOVABLE pages.
*/
- if (!has_unmovable_pages(zone, page, arg.pages_found, migratetype, flags))
+ if (!has_unmovable_pages(zone, page, arg.pages_found, migratetype,
+ isol_flags))
ret = 0;
/*
_
From: Oscar Salvador <osalvador(a)suse.de>
Subject: mm/debug.c: fix __dump_page when mapping->host is not set
While debugging something, I added a dump_page() into do_swap_page(), and
I got the splat from below. The issue happens when dereferencing
mapping->host in __dump_page():
...
else if (mapping) {
pr_warn("%ps ", mapping->a_ops);
if (mapping->host->i_dentry.first) {
struct dentry *dentry;
dentry = container_of(mapping->host->i_dentry.first, struct dentry, d_u.d_alias);
pr_warn("name:\"%pd\" ", dentry);
}
}
...
Swap address space does not contain an inode information, and so
mapping->host equals NULL.
Although the dump_page() call was added artificially into do_swap_page(),
I am not sure if we can hit this from any other path, so it looks worth
fixing it. We can easily do that by checking mapping->host first.
Link: http://lkml.kernel.org/r/20190318072931.29094-1-osalvador@suse.de
Fixes: 1c6fb1d89e73c ("mm: print more information about mapping in __dump_page")
Signed-off-by: Oscar Salvador <osalvador(a)suse.de>
Acked-by: Michal Hocko <mhocko(a)suse.com>
Acked-by: Hugh Dickins <hughd(a)google.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/debug.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/mm/debug.c~mm-fix-__dump_page-when-mapping-host-is-not-set
+++ a/mm/debug.c
@@ -79,7 +79,7 @@ void __dump_page(struct page *page, cons
pr_warn("ksm ");
else if (mapping) {
pr_warn("%ps ", mapping->a_ops);
- if (mapping->host->i_dentry.first) {
+ if (mapping->host && mapping->host->i_dentry.first) {
struct dentry *dentry;
dentry = container_of(mapping->host->i_dentry.first, struct dentry, d_u.d_alias);
pr_warn("name:\"%pd\" ", dentry);
_
From: Yang Shi <yang.shi(a)linux.alibaba.com>
Subject: mm: mempolicy: make mbind() return -EIO when MPOL_MF_STRICT is specified
When MPOL_MF_STRICT was specified and an existing page was already on a
node that does not follow the policy, mbind() should return -EIO. But
6f4576e3687b ("mempolicy: apply page table walker on queue_pages_range()")
broke the rule.
And c8633798497c ("mm: mempolicy: mbind and migrate_pages support thp
migration") didn't return the correct value for THP mbind() too.
If MPOL_MF_STRICT is set, ignore vma_migratable() to make sure it reaches
queue_pages_to_pte_range() or queue_pages_pmd() to check if an existing
page was already on a node that does not follow the policy. And,
non-migratable vma may be used, return -EIO too if MPOL_MF_MOVE or
MPOL_MF_MOVE_ALL was specified.
Tested with https://github.com/metan-ucw/ltp/blob/master/testcases/kernel/syscalls/mbin…
[akpm(a)linux-foundation.org: tweak code comment]
Link: http://lkml.kernel.org/r/1553020556-38583-1-git-send-email-yang.shi@linux.a…
Fixes: 6f4576e3687b ("mempolicy: apply page table walker on queue_pages_range()")
Signed-off-by: Yang Shi <yang.shi(a)linux.alibaba.com>
Signed-off-by: Oscar Salvador <osalvador(a)suse.de>
Reported-by: Cyril Hrubis <chrubis(a)suse.cz>
Suggested-by: Kirill A. Shutemov <kirill(a)shutemov.name>
Acked-by: Rafael Aquini <aquini(a)redhat.com>
Reviewed-by: Oscar Salvador <osalvador(a)suse.de>
Acked-by: David Rientjes <rientjes(a)google.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/mempolicy.c | 40 +++++++++++++++++++++++++++++++++-------
1 file changed, 33 insertions(+), 7 deletions(-)
--- a/mm/mempolicy.c~mm-mempolicy-make-mbind-return-eio-when-mpol_mf_strict-is-specified
+++ a/mm/mempolicy.c
@@ -428,6 +428,13 @@ static inline bool queue_pages_required(
return node_isset(nid, *qp->nmask) == !(flags & MPOL_MF_INVERT);
}
+/*
+ * queue_pages_pmd() has three possible return values:
+ * 1 - pages are placed on the right node or queued successfully.
+ * 0 - THP was split.
+ * -EIO - is migration entry or MPOL_MF_STRICT was specified and an existing
+ * page was already on a node that does not follow the policy.
+ */
static int queue_pages_pmd(pmd_t *pmd, spinlock_t *ptl, unsigned long addr,
unsigned long end, struct mm_walk *walk)
{
@@ -437,7 +444,7 @@ static int queue_pages_pmd(pmd_t *pmd, s
unsigned long flags;
if (unlikely(is_pmd_migration_entry(*pmd))) {
- ret = 1;
+ ret = -EIO;
goto unlock;
}
page = pmd_page(*pmd);
@@ -454,8 +461,15 @@ static int queue_pages_pmd(pmd_t *pmd, s
ret = 1;
flags = qp->flags;
/* go to thp migration */
- if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
+ if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) {
+ if (!vma_migratable(walk->vma)) {
+ ret = -EIO;
+ goto unlock;
+ }
+
migrate_page_add(page, qp->pagelist, flags);
+ } else
+ ret = -EIO;
unlock:
spin_unlock(ptl);
out:
@@ -480,8 +494,10 @@ static int queue_pages_pte_range(pmd_t *
ptl = pmd_trans_huge_lock(pmd, vma);
if (ptl) {
ret = queue_pages_pmd(pmd, ptl, addr, end, walk);
- if (ret)
+ if (ret > 0)
return 0;
+ else if (ret < 0)
+ return ret;
}
if (pmd_trans_unstable(pmd))
@@ -502,11 +518,16 @@ static int queue_pages_pte_range(pmd_t *
continue;
if (!queue_pages_required(page, qp))
continue;
- migrate_page_add(page, qp->pagelist, flags);
+ if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) {
+ if (!vma_migratable(vma))
+ break;
+ migrate_page_add(page, qp->pagelist, flags);
+ } else
+ break;
}
pte_unmap_unlock(pte - 1, ptl);
cond_resched();
- return 0;
+ return addr != end ? -EIO : 0;
}
static int queue_pages_hugetlb(pte_t *pte, unsigned long hmask,
@@ -576,7 +597,12 @@ static int queue_pages_test_walk(unsigne
unsigned long endvma = vma->vm_end;
unsigned long flags = qp->flags;
- if (!vma_migratable(vma))
+ /*
+ * Need check MPOL_MF_STRICT to return -EIO if possible
+ * regardless of vma_migratable
+ */
+ if (!vma_migratable(vma) &&
+ !(flags & MPOL_MF_STRICT))
return 1;
if (endvma > end)
@@ -603,7 +629,7 @@ static int queue_pages_test_walk(unsigne
}
/* queue pages from current vma */
- if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
+ if (flags & MPOL_MF_VALID)
return 0;
return 1;
}
_
From: Nicolas Boichat <drinkcat(a)chromium.org>
Subject: iommu/io-pgtable-arm-v7s: request DMA32 memory, and improve debugging
IOMMUs using ARMv7 short-descriptor format require page tables (level 1
and 2) to be allocated within the first 4GB of RAM, even on 64-bit
systems.
For level 1/2 pages, ensure GFP_DMA32 is used if CONFIG_ZONE_DMA32 is
defined (e.g. on arm64 platforms).
For level 2 pages, allocate a slab cache in SLAB_CACHE_DMA32. Note that
we do not explicitly pass GFP_DMA[32] to kmem_cache_zalloc, as this is not
strictly necessary, and would cause a warning in mm/sl*b.c, as we did not
update GFP_SLAB_BUG_MASK.
Also, print an error when the physical address does not fit in
32-bit, to make debugging easier in the future.
Link: http://lkml.kernel.org/r/20181210011504.122604-3-drinkcat@chromium.org
Fixes: ad67f5a6545f ("arm64: replace ZONE_DMA with ZONE_DMA32")
Signed-off-by: Nicolas Boichat <drinkcat(a)chromium.org>
Acked-by: Will Deacon <will.deacon(a)arm.com>
Cc: Christoph Hellwig <hch(a)infradead.org>
Cc: Christoph Lameter <cl(a)linux.com>
Cc: David Rientjes <rientjes(a)google.com>
Cc: Hsin-Yi Wang <hsinyi(a)chromium.org>
Cc: Huaisheng Ye <yehs1(a)lenovo.com>
Cc: Joerg Roedel <joro(a)8bytes.org>
Cc: Joonsoo Kim <iamjoonsoo.kim(a)lge.com>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: Matthias Brugger <matthias.bgg(a)gmail.com>
Cc: Mel Gorman <mgorman(a)techsingularity.net>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Mike Rapoport <rppt(a)linux.vnet.ibm.com>
Cc: Pekka Enberg <penberg(a)kernel.org>
Cc: Robin Murphy <robin.murphy(a)arm.com>
Cc: Sasha Levin <Alexander.Levin(a)microsoft.com>
Cc: Tomasz Figa <tfiga(a)google.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: Yingjoe Chen <yingjoe.chen(a)mediatek.com>
Cc: Yong Wu <yong.wu(a)mediatek.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
drivers/iommu/io-pgtable-arm-v7s.c | 19 +++++++++++++++----
1 file changed, 15 insertions(+), 4 deletions(-)
--- a/drivers/iommu/io-pgtable-arm-v7s.c~iommu-io-pgtable-arm-v7s-request-dma32-memory-and-improve-debugging
+++ a/drivers/iommu/io-pgtable-arm-v7s.c
@@ -160,6 +160,14 @@
#define ARM_V7S_TCR_PD1 BIT(5)
+#ifdef CONFIG_ZONE_DMA32
+#define ARM_V7S_TABLE_GFP_DMA GFP_DMA32
+#define ARM_V7S_TABLE_SLAB_FLAGS SLAB_CACHE_DMA32
+#else
+#define ARM_V7S_TABLE_GFP_DMA GFP_DMA
+#define ARM_V7S_TABLE_SLAB_FLAGS SLAB_CACHE_DMA
+#endif
+
typedef u32 arm_v7s_iopte;
static bool selftest_running;
@@ -197,13 +205,16 @@ static void *__arm_v7s_alloc_table(int l
void *table = NULL;
if (lvl == 1)
- table = (void *)__get_dma_pages(__GFP_ZERO, get_order(size));
+ table = (void *)__get_free_pages(
+ __GFP_ZERO | ARM_V7S_TABLE_GFP_DMA, get_order(size));
else if (lvl == 2)
- table = kmem_cache_zalloc(data->l2_tables, gfp | GFP_DMA);
+ table = kmem_cache_zalloc(data->l2_tables, gfp);
phys = virt_to_phys(table);
- if (phys != (arm_v7s_iopte)phys)
+ if (phys != (arm_v7s_iopte)phys) {
/* Doesn't fit in PTE */
+ dev_err(dev, "Page table does not fit in PTE: %pa", &phys);
goto out_free;
+ }
if (table && !(cfg->quirks & IO_PGTABLE_QUIRK_NO_DMA)) {
dma = dma_map_single(dev, table, size, DMA_TO_DEVICE);
if (dma_mapping_error(dev, dma))
@@ -733,7 +744,7 @@ static struct io_pgtable *arm_v7s_alloc_
data->l2_tables = kmem_cache_create("io-pgtable_armv7s_l2",
ARM_V7S_TABLE_SIZE(2),
ARM_V7S_TABLE_SIZE(2),
- SLAB_CACHE_DMA, NULL);
+ ARM_V7S_TABLE_SLAB_FLAGS, NULL);
if (!data->l2_tables)
goto out_free_data;
_
From: Darrick J. Wong <darrick.wong(a)oracle.com>
Subject: ocfs2: fix inode bh swapping mixup in ocfs2_reflink_inodes_lock
ocfs2_reflink_inodes_lock() can swap the inode1/inode2 variables so that
we always grab cluster locks in order of increasing inode number.
Unfortunately, we forget to swap the inode record buffer head pointers
when we've done this, which leads to incorrect bookkeepping when we're
trying to make the two inodes have the same refcount tree.
This has the effect of causing filesystem shutdowns if you're trying to
reflink data from inode 100 into inode 97, where inode 100 already has a
refcount tree attached and inode 97 doesn't. The reflink code decides to
copy the refcount tree pointer from 100 to 97, but uses inode 97's inode
record to open the tree root (which it doesn't have) and blows up. This
issue causes filesystem shutdowns and metadata corruption!
Link: http://lkml.kernel.org/r/20190312214910.GK20533@magnolia
Fixes: 29ac8e856cb369 ("ocfs2: implement the VFS clone_range, copy_range, and dedupe_range features")
Signed-off-by: Darrick J. Wong <darrick.wong(a)oracle.com>
Reviewed-by: Joseph Qi <jiangqi903(a)gmail.com>
Cc: Mark Fasheh <mfasheh(a)versity.com>
Cc: Joel Becker <jlbec(a)evilplan.org>
Cc: Junxiao Bi <junxiao.bi(a)oracle.com>
Cc: Joseph Qi <joseph.qi(a)huawei.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
fs/ocfs2/refcounttree.c | 42 +++++++++++++++++++++-----------------
1 file changed, 24 insertions(+), 18 deletions(-)
--- a/fs/ocfs2/refcounttree.c~ocfs2-fix-inode-bh-swapping-mixup-in-ocfs2_reflink_inodes_lock
+++ a/fs/ocfs2/refcounttree.c
@@ -4719,22 +4719,23 @@ out:
/* Lock an inode and grab a bh pointing to the inode. */
int ocfs2_reflink_inodes_lock(struct inode *s_inode,
- struct buffer_head **bh1,
+ struct buffer_head **bh_s,
struct inode *t_inode,
- struct buffer_head **bh2)
+ struct buffer_head **bh_t)
{
- struct inode *inode1;
- struct inode *inode2;
+ struct inode *inode1 = s_inode;
+ struct inode *inode2 = t_inode;
struct ocfs2_inode_info *oi1;
struct ocfs2_inode_info *oi2;
+ struct buffer_head *bh1 = NULL;
+ struct buffer_head *bh2 = NULL;
bool same_inode = (s_inode == t_inode);
+ bool need_swap = (inode1->i_ino > inode2->i_ino);
int status;
/* First grab the VFS and rw locks. */
lock_two_nondirectories(s_inode, t_inode);
- inode1 = s_inode;
- inode2 = t_inode;
- if (inode1->i_ino > inode2->i_ino)
+ if (need_swap)
swap(inode1, inode2);
status = ocfs2_rw_lock(inode1, 1);
@@ -4757,17 +4758,13 @@ int ocfs2_reflink_inodes_lock(struct ino
trace_ocfs2_double_lock((unsigned long long)oi1->ip_blkno,
(unsigned long long)oi2->ip_blkno);
- if (*bh1)
- *bh1 = NULL;
- if (*bh2)
- *bh2 = NULL;
-
/* We always want to lock the one with the lower lockid first. */
if (oi1->ip_blkno > oi2->ip_blkno)
mlog_errno(-ENOLCK);
/* lock id1 */
- status = ocfs2_inode_lock_nested(inode1, bh1, 1, OI_LS_REFLINK_TARGET);
+ status = ocfs2_inode_lock_nested(inode1, &bh1, 1,
+ OI_LS_REFLINK_TARGET);
if (status < 0) {
if (status != -ENOENT)
mlog_errno(status);
@@ -4776,15 +4773,25 @@ int ocfs2_reflink_inodes_lock(struct ino
/* lock id2 */
if (!same_inode) {
- status = ocfs2_inode_lock_nested(inode2, bh2, 1,
+ status = ocfs2_inode_lock_nested(inode2, &bh2, 1,
OI_LS_REFLINK_TARGET);
if (status < 0) {
if (status != -ENOENT)
mlog_errno(status);
goto out_cl1;
}
- } else
- *bh2 = *bh1;
+ } else {
+ bh2 = bh1;
+ }
+
+ /*
+ * If we swapped inode order above, we have to swap the buffer heads
+ * before passing them back to the caller.
+ */
+ if (need_swap)
+ swap(bh1, bh2);
+ *bh_s = bh1;
+ *bh_t = bh2;
trace_ocfs2_double_lock_end(
(unsigned long long)oi1->ip_blkno,
@@ -4794,8 +4801,7 @@ int ocfs2_reflink_inodes_lock(struct ino
out_cl1:
ocfs2_inode_unlock(inode1, 1);
- brelse(*bh1);
- *bh1 = NULL;
+ brelse(bh1);
out_rw2:
ocfs2_rw_unlock(inode2, 1);
out_i2:
_