From: Pavel Tatashin <pasha.tatashin(a)soleen.com>
Subject: mm/memory_hotplug: drain per-cpu pages again during memory offline
There is a race during page offline that can lead to infinite loop:
a page never ends up on a buddy list and __offline_pages() keeps
retrying infinitely or until a termination signal is received.
Thread#1 - a new process:
load_elf_binary
begin_new_exec
exec_mmap
mmput
exit_mmap
tlb_finish_mmu
tlb_flush_mmu
release_pages
free_unref_page_list
free_unref_page_prepare
set_pcppage_migratetype(page, migratetype);
// Set page->index migration type below MIGRATE_PCPTYPES
Thread#2 - hot-removes memory
__offline_pages
start_isolate_page_range
set_migratetype_isolate
set_pageblock_migratetype(page, MIGRATE_ISOLATE);
Set migration type to MIGRATE_ISOLATE-> set
drain_all_pages(zone);
// drain per-cpu page lists to buddy allocator.
Thread#1 - continue
free_unref_page_commit
migratetype = get_pcppage_migratetype(page);
// get old migration type
list_add(&page->lru, &pcp->lists[migratetype]);
// add new page to already drained pcp list
Thread#2
Never drains pcp again, and therefore gets stuck in the loop.
The fix is to try to drain per-cpu lists again after
check_pages_isolated_cb() fails.
Link: https://lkml.kernel.org/r/20200903140032.380431-1-pasha.tatashin@soleen.com
Link: https://lkml.kernel.org/r/20200904151448.100489-2-pasha.tatashin@soleen.com
Link: http://lkml.kernel.org/r/20200904070235.GA15277@dhcp22.suse.cz
Fixes: c52e75935f8d ("mm: remove extra drain pages on pcp list")
Signed-off-by: Pavel Tatashin <pasha.tatashin(a)soleen.com>
Acked-by: David Rientjes <rientjes(a)google.com>
Acked-by: Vlastimil Babka <vbabka(a)suse.cz>
Acked-by: Michal Hocko <mhocko(a)suse.com>
Acked-by: David Hildenbrand <david(a)redhat.com>
Cc: Oscar Salvador <osalvador(a)suse.de>
Cc: Wei Yang <richard.weiyang(a)gmail.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/memory_hotplug.c | 14 ++++++++++++++
mm/page_isolation.c | 8 ++++++++
2 files changed, 22 insertions(+)
--- a/mm/memory_hotplug.c~mm-memory_hotplug-drain-per-cpu-pages-again-during-memory-offline
+++ a/mm/memory_hotplug.c
@@ -1575,6 +1575,20 @@ static int __ref __offline_pages(unsigne
/* check again */
ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn,
NULL, check_pages_isolated_cb);
+ /*
+ * per-cpu pages are drained in start_isolate_page_range, but if
+ * there are still pages that are not free, make sure that we
+ * drain again, because when we isolated range we might
+ * have raced with another thread that was adding pages to pcp
+ * list.
+ *
+ * Forward progress should be still guaranteed because
+ * pages on the pcp list can only belong to MOVABLE_ZONE
+ * because has_unmovable_pages explicitly checks for
+ * PageBuddy on freed pages on other zones.
+ */
+ if (ret)
+ drain_all_pages(zone);
} while (ret);
/* Ok, all of our target is isolated.
--- a/mm/page_isolation.c~mm-memory_hotplug-drain-per-cpu-pages-again-during-memory-offline
+++ a/mm/page_isolation.c
@@ -170,6 +170,14 @@ __first_valid_page(unsigned long pfn, un
* pageblocks we may have modified and return -EBUSY to caller. This
* prevents two threads from simultaneously working on overlapping ranges.
*
+ * Please note that there is no strong synchronization with the page allocator
+ * either. Pages might be freed while their page blocks are marked ISOLATED.
+ * In some cases pages might still end up on pcp lists and that would allow
+ * for their allocation even when they are in fact isolated already. Depending
+ * on how strong of a guarantee the caller needs drain_all_pages might be needed
+ * (e.g. __offline_pages will need to call it after check for isolated range for
+ * a next retry).
+ *
* Return: the number of isolated pageblocks on success and -EBUSY if any part
* of range cannot be isolated.
*/
_
From: Ralph Campbell <rcampbell(a)nvidia.com>
Subject: mm/thp: fix __split_huge_pmd_locked() for migration PMD
A migrating transparent huge page has to already be unmapped. Otherwise,
the page could be modified while it is being copied to a new page and data
could be lost. The function __split_huge_pmd() checks for a PMD migration
entry before calling __split_huge_pmd_locked() leading one to think that
__split_huge_pmd_locked() can handle splitting a migrating PMD.
However, the code always increments the page->_mapcount and adjusts the
memory control group accounting assuming the page is mapped.
Also, if the PMD entry is a migration PMD entry, the call to
is_huge_zero_pmd(*pmd) is incorrect because it calls pmd_pfn(pmd) instead
of migration_entry_to_pfn(pmd_to_swp_entry(pmd)). Fix these problems by
checking for a PMD migration entry.
Link: https://lkml.kernel.org/r/20200903183140.19055-1-rcampbell@nvidia.com
Fixes: 84c3fc4e9c56 ("mm: thp: check pmd migration entry in common path")
Signed-off-by: Ralph Campbell <rcampbell(a)nvidia.com>
Reviewed-by: Yang Shi <shy828301(a)gmail.com>
Reviewed-by: Zi Yan <ziy(a)nvidia.com>
Cc: Jerome Glisse <jglisse(a)redhat.com>
Cc: John Hubbard <jhubbard(a)nvidia.com>
Cc: Alistair Popple <apopple(a)nvidia.com>
Cc: Christoph Hellwig <hch(a)lst.de>
Cc: Jason Gunthorpe <jgg(a)nvidia.com>
Cc: Bharata B Rao <bharata(a)linux.ibm.com>
Cc: Ben Skeggs <bskeggs(a)redhat.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: <stable(a)vger.kernel.org> [4.14+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/huge_memory.c | 42 +++++++++++++++++++++++-------------------
1 file changed, 23 insertions(+), 19 deletions(-)
--- a/mm/huge_memory.c~mm-thp-fix-__split_huge_pmd_locked-for-migration-pmd
+++ a/mm/huge_memory.c
@@ -2022,7 +2022,7 @@ static void __split_huge_pmd_locked(stru
put_page(page);
add_mm_counter(mm, mm_counter_file(page), -HPAGE_PMD_NR);
return;
- } else if (is_huge_zero_pmd(*pmd)) {
+ } else if (pmd_trans_huge(*pmd) && is_huge_zero_pmd(*pmd)) {
/*
* FIXME: Do we want to invalidate secondary mmu by calling
* mmu_notifier_invalidate_range() see comments below inside
@@ -2116,30 +2116,34 @@ static void __split_huge_pmd_locked(stru
pte = pte_offset_map(&_pmd, addr);
BUG_ON(!pte_none(*pte));
set_pte_at(mm, addr, pte, entry);
- atomic_inc(&page[i]._mapcount);
- pte_unmap(pte);
- }
-
- /*
- * Set PG_double_map before dropping compound_mapcount to avoid
- * false-negative page_mapped().
- */
- if (compound_mapcount(page) > 1 && !TestSetPageDoubleMap(page)) {
- for (i = 0; i < HPAGE_PMD_NR; i++)
+ if (!pmd_migration)
atomic_inc(&page[i]._mapcount);
+ pte_unmap(pte);
}
- lock_page_memcg(page);
- if (atomic_add_negative(-1, compound_mapcount_ptr(page))) {
- /* Last compound_mapcount is gone. */
- __dec_lruvec_page_state(page, NR_ANON_THPS);
- if (TestClearPageDoubleMap(page)) {
- /* No need in mapcount reference anymore */
+ if (!pmd_migration) {
+ /*
+ * Set PG_double_map before dropping compound_mapcount to avoid
+ * false-negative page_mapped().
+ */
+ if (compound_mapcount(page) > 1 &&
+ !TestSetPageDoubleMap(page)) {
for (i = 0; i < HPAGE_PMD_NR; i++)
- atomic_dec(&page[i]._mapcount);
+ atomic_inc(&page[i]._mapcount);
+ }
+
+ lock_page_memcg(page);
+ if (atomic_add_negative(-1, compound_mapcount_ptr(page))) {
+ /* Last compound_mapcount is gone. */
+ __dec_lruvec_page_state(page, NR_ANON_THPS);
+ if (TestClearPageDoubleMap(page)) {
+ /* No need in mapcount reference anymore */
+ for (i = 0; i < HPAGE_PMD_NR; i++)
+ atomic_dec(&page[i]._mapcount);
+ }
}
+ unlock_page_memcg(page);
}
- unlock_page_memcg(page);
smp_wmb(); /* make pte visible before pmd */
pmd_populate(mm, pmd, pgtable);
_
From: Muchun Song <songmuchun(a)bytedance.com>
Subject: kprobes: fix kill kprobe which has been marked as gone
If a kprobe is marked as gone, we should not kill it again. Otherwise, we
can disarm the kprobe more than once. In that case, the statistics of
kprobe_ftrace_enabled can unbalance which can lead to that kprobe do not
work.
Link: https://lkml.kernel.org/r/20200822030055.32383-1-songmuchun@bytedance.com
Fixes: e8386a0cb22f ("kprobes: support probing module __exit function")
Signed-off-by: Muchun Song <songmuchun(a)bytedance.com>
Co-developed-by: Chengming Zhou <zhouchengming(a)bytedance.com>
Signed-off-by: Chengming Zhou <zhouchengming(a)bytedance.com>
Acked-by: Masami Hiramatsu <mhiramat(a)kernel.org>
Cc: "Naveen N . Rao" <naveen.n.rao(a)linux.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy(a)intel.com>
Cc: David S. Miller <davem(a)davemloft.net>
Cc: Song Liu <songliubraving(a)fb.com>
Cc: Steven Rostedt <rostedt(a)goodmis.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
kernel/kprobes.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
--- a/kernel/kprobes.c~kprobes-fix-kill-kprobe-which-has-been-marked-as-gone
+++ a/kernel/kprobes.c
@@ -2140,6 +2140,9 @@ static void kill_kprobe(struct kprobe *p
lockdep_assert_held(&kprobe_mutex);
+ if (WARN_ON_ONCE(kprobe_gone(p)))
+ return;
+
p->flags |= KPROBE_FLAG_GONE;
if (kprobe_aggrprobe(p)) {
/*
@@ -2419,7 +2422,10 @@ static int kprobes_module_callback(struc
mutex_lock(&kprobe_mutex);
for (i = 0; i < KPROBE_TABLE_SIZE; i++) {
head = &kprobe_table[i];
- hlist_for_each_entry(p, head, hlist)
+ hlist_for_each_entry(p, head, hlist) {
+ if (kprobe_gone(p))
+ continue;
+
if (within_module_init((unsigned long)p->addr, mod) ||
(checkcore &&
within_module_core((unsigned long)p->addr, mod))) {
@@ -2436,6 +2442,7 @@ static int kprobes_module_callback(struc
*/
kill_kprobe(p);
}
+ }
}
if (val == MODULE_STATE_GOING)
remove_module_kprobe_blacklist(mod);
_
From: Hugh Dickins <hughd(a)google.com>
Subject: ksm: reinstate memcg charge on copied pages
Patch series "mm: fixes to past from future testing".
Here's a set of independent fixes against 5.9-rc2: prompted by
testing Alex Shi's "warning on !memcg" and lru_lock series, but
I think fit for 5.9 - though maybe only the first for stable.
This patch (of 5):
In 5.8 some instances of memcg charging in do_swap_page() and unuse_pte()
were removed, on the understanding that swap cache is now already charged
at those points; but a case was missed, when ksm_might_need_to_copy() has
decided it must allocate a substitute page: such pages were never charged.
Fix it inside ksm_might_need_to_copy().
This was discovered by Alex Shi's prospective commit "mm/memcg: warning on
!memcg after readahead page charged".
But there is a another surprise: this also fixes some rarer uncharged
PageAnon cases, when KSM is configured in, but has never been activated.
ksm_might_need_to_copy()'s anon_vma->root and linear_page_index() check
sometimes catches a case which would need to have been copied if KSM were
turned on. Or that's my optimistic interpretation (of my own old code),
but it leaves some doubt as to whether everything is working as intended
there - might it hint at rare anon ptes which rmap cannot find? A
question not easily answered: put in the fix for missed memcg charges.
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008301343270.5954@eggly.anvils
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008301358020.5954@eggly.anvils
Fixes: 4c6355b25e8b ("mm: memcontrol: charge swapin pages on instantiation")
Signed-off-by: Hugh Dickins <hughd(a)google.com>
Reviewed-by: Shakeel Butt <shakeelb(a)google.com>
Acked-by: Johannes Weiner <hannes(a)cmpxchg.org>
Cc: Alex Shi <alex.shi(a)linux.alibaba.com>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Mike Kravetz <mike.kravetz(a)oracle.com>
Cc; Matthew Wilcox <willy(a)infradead.org>
Cc: Qian Cai <cai(a)lca.pw>
Cc: <stable(a)vger.kernel.org> [5.8]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/ksm.c | 4 ++++
1 file changed, 4 insertions(+)
--- a/mm/ksm.c~ksm-reinstate-memcg-charge-on-copied-pages
+++ a/mm/ksm.c
@@ -2586,6 +2586,10 @@ struct page *ksm_might_need_to_copy(stru
return page; /* let do_swap_page report the error */
new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
+ if (new_page && mem_cgroup_charge(new_page, vma->vm_mm, GFP_KERNEL)) {
+ put_page(new_page);
+ new_page = NULL;
+ }
if (new_page) {
copy_user_highpage(new_page, page, address, vma);
_
A recent fix to the dm_dax_supported() flow uncovered a latent bug. When
dm_get_live_table() fails it is still required to drop the
srcu_read_lock(). Without this change the lvm2 test-suite triggers this
warning:
# lvm2-testsuite --only pvmove-abort-all.sh
WARNING: lock held when returning to user space!
5.9.0-rc5+ #251 Tainted: G OE
------------------------------------------------
lvm/1318 is leaving the kernel with locks still held!
1 lock held by lvm/1318:
#0: ffff9372abb5a340 (&md->io_barrier){....}-{0:0}, at: dm_get_live_table+0x5/0xb0 [dm_mod]
...and later on this hang signature:
INFO: task lvm:1344 blocked for more than 122 seconds.
Tainted: G OE 5.9.0-rc5+ #251
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:lvm state:D stack: 0 pid: 1344 ppid: 1 flags:0x00004000
Call Trace:
__schedule+0x45f/0xa80
? finish_task_switch+0x249/0x2c0
? wait_for_completion+0x86/0x110
schedule+0x5f/0xd0
schedule_timeout+0x212/0x2a0
? __schedule+0x467/0xa80
? wait_for_completion+0x86/0x110
wait_for_completion+0xb0/0x110
__synchronize_srcu+0xd1/0x160
? __bpf_trace_rcu_utilization+0x10/0x10
__dm_suspend+0x6d/0x210 [dm_mod]
dm_suspend+0xf6/0x140 [dm_mod]
Fixes: 7bf7eac8d648 ("dax: Arrange for dax_supported check to span multiple devices")
Cc: <stable(a)vger.kernel.org>
Cc: Jan Kara <jack(a)suse.cz>
Cc: Alasdair Kergon <agk(a)redhat.com>
Cc: Mike Snitzer <snitzer(a)redhat.com>
Reported-by: Adrian Huang <ahuang12(a)lenovo.com>
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
---
drivers/md/dm.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index fb0255d25e4b..4a40df8af7d3 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1136,15 +1136,16 @@ static bool dm_dax_supported(struct dax_device *dax_dev, struct block_device *bd
{
struct mapped_device *md = dax_get_private(dax_dev);
struct dm_table *map;
+ bool ret = false;
int srcu_idx;
- bool ret;
map = dm_get_live_table(md, &srcu_idx);
if (!map)
- return false;
+ goto out;
ret = dm_table_supports_dax(map, device_supports_dax, &blocksize);
+out:
dm_put_live_table(md, srcu_idx);
return ret;
Nick reports that the following error is produced in some cases when
using GCC+ld.bfd to build the ARM defconfig with Thumb2 enabled:
arch/arm/vfp/vfphw.o: in function `vfp_support_entry':
(.text+0xa): relocation truncated to fit: R_ARM_THM_JUMP19 against
symbol `vfp_kmode_exception' defined in .text.unlikely section in
arch/arm/vfp/vfpmodule.o
$ arm-linux-gnueabihf-ld --version
GNU ld (GNU Binutils for Debian) 2.34
Generally, the linker should be able to fix up out of range branches by
emitting veneers, but apparently, it fails to do so in this particular
case, i.e., a conditional 'tail call' to vfp_kmode_exception(), which
is not defined in the same object file.
So let's force the use of a non-conditional encoding of the B instruction,
which has more space for an immediate offset.
Cc: <stable(a)vger.kernel.org>
Reported-by: Nick Desaulniers <ndesaulniers(a)google.com>
Tested-by: Nick Desaulniers <ndesaulniers(a)google.com>
Signed-off-by: Ard Biesheuvel <ardb(a)kernel.org>
---
arch/arm/vfp/vfphw.S | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/arm/vfp/vfphw.S b/arch/arm/vfp/vfphw.S
index 4fcff9f59947..f1468702fbc9 100644
--- a/arch/arm/vfp/vfphw.S
+++ b/arch/arm/vfp/vfphw.S
@@ -82,6 +82,7 @@ ENTRY(vfp_support_entry)
ldr r3, [sp, #S_PSR] @ Neither lazy restore nor FP exceptions
and r3, r3, #MODE_MASK @ are supported in kernel mode
teq r3, #USR_MODE
+THUMB( it ne )
bne vfp_kmode_exception @ Returns through lr
VFPFMRX r1, FPEXC @ Is the VFP enabled?
--
2.17.1
From: Jonathan Lebon <jlebon(a)redhat.com>
[ Upstream commit 3e3e24b42043eceb97ed834102c2d094dfd7aaa6 ]
Currently, the SELinux LSM prevents one from setting the
`security.selinux` xattr on an inode without a policy first being
loaded. However, this restriction is problematic: it makes it impossible
to have newly created files with the correct label before actually
loading the policy.
This is relevant in distributions like Fedora, where the policy is
loaded by systemd shortly after pivoting out of the initrd. In such
instances, all files created prior to pivoting will be unlabeled. One
then has to relabel them after pivoting, an operation which inherently
races with other processes trying to access those same files.
Going further, there are use cases for creating the entire root
filesystem on first boot from the initrd (e.g. Container Linux supports
this today[1], and we'd like to support it in Fedora CoreOS as well[2]).
One can imagine doing this in two ways: at the block device level (e.g.
laying down a disk image), or at the filesystem level. In the former,
labeling can simply be part of the image. But even in the latter
scenario, one still really wants to be able to set the right labels when
populating the new filesystem.
This patch enables this by changing behaviour in the following two ways:
1. allow `setxattr` if we're not initialized
2. don't try to set the in-core inode SID if we're not initialized;
instead leave it as `LABEL_INVALID` so that revalidation may be
attempted at a later time
Note the first hunk of this patch is mostly the same as a previously
discussed one[3], though it was part of a larger series which wasn't
accepted.
[1] https://coreos.com/os/docs/latest/root-filesystem-placement.html
[2] https://github.com/coreos/fedora-coreos-tracker/issues/94
[3] https://www.spinics.net/lists/linux-initramfs/msg04593.html
Co-developed-by: Victor Kamensky <kamensky(a)cisco.com>
Signed-off-by: Victor Kamensky <kamensky(a)cisco.com>
Signed-off-by: Jonathan Lebon <jlebon(a)redhat.com>
Signed-off-by: Paul Moore <paul(a)paul-moore.com>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
security/selinux/hooks.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 452254fd89f87..250b725f5754c 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -3304,6 +3304,9 @@ static int selinux_inode_setxattr(struct dentry *dentry, const char *name,
return dentry_has_perm(current_cred(), dentry, FILE__SETATTR);
}
+ if (!selinux_state.initialized)
+ return (inode_owner_or_capable(inode) ? 0 : -EPERM);
+
sbsec = inode->i_sb->s_security;
if (!(sbsec->flags & SBLABEL_MNT))
return -EOPNOTSUPP;
@@ -3387,6 +3390,15 @@ static void selinux_inode_post_setxattr(struct dentry *dentry, const char *name,
return;
}
+ if (!selinux_state.initialized) {
+ /* If we haven't even been initialized, then we can't validate
+ * against a policy, so leave the label as invalid. It may
+ * resolve to a valid label on the next revalidation try if
+ * we've since initialized.
+ */
+ return;
+ }
+
rc = security_context_to_sid_force(&selinux_state, value, size,
&newsid);
if (rc) {
--
2.25.1
From: Marc Zyngier <maz(a)kernel.org>
Commit 606f8e7b27bf ("arm64: capabilities: Use linear array for
detection and verification") changed the way we deal with per-CPU errata
by only calling the .matches() callback until one CPU is found to be
affected. At this point, .matches() stop being called, and .cpu_enable()
will be called on all CPUs.
This breaks the ARCH_WORKAROUND_2 handling, as only a single CPU will be
mitigated.
In order to address this, forcefully call the .matches() callback from a
.cpu_enable() callback, which brings us back to the original behaviour.
Fixes: 606f8e7b27bf ("arm64: capabilities: Use linear array for detection and verification")
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Marc Zyngier <maz(a)kernel.org>
Signed-off-by: Will Deacon <will(a)kernel.org>
---
arch/arm64/kernel/cpu_errata.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/arch/arm64/kernel/cpu_errata.c b/arch/arm64/kernel/cpu_errata.c
index 3fe64bf5a58d..abfef5f3b5fd 100644
--- a/arch/arm64/kernel/cpu_errata.c
+++ b/arch/arm64/kernel/cpu_errata.c
@@ -457,6 +457,12 @@ static bool has_ssbd_mitigation(const struct arm64_cpu_capabilities *entry,
return required;
}
+static void cpu_enable_ssbd_mitigation(const struct arm64_cpu_capabilities *cap)
+{
+ if (ssbd_state != ARM64_SSBD_FORCE_DISABLE)
+ cap->matches(cap, SCOPE_LOCAL_CPU);
+}
+
/* known invulnerable cores */
static const struct midr_range arm64_ssb_cpus[] = {
MIDR_ALL_VERSIONS(MIDR_CORTEX_A35),
@@ -914,6 +920,7 @@ const struct arm64_cpu_capabilities arm64_errata[] = {
.capability = ARM64_SSBD,
.type = ARM64_CPUCAP_LOCAL_CPU_ERRATUM,
.matches = has_ssbd_mitigation,
+ .cpu_enable = cpu_enable_ssbd_mitigation,
.midr_range_list = arm64_ssb_cpus,
},
#ifdef CONFIG_ARM64_ERRATUM_1418040
--
2.28.0.681.g6f77f65b4e-goog