musb_queue_resume_work() would call the provided callback if the runtime
PM status was 'active'. Otherwise, it would enqueue the request if the
hardware was still suspended (musb->is_runtime_suspended is true).
This causes a race with the runtime PM handlers, as it is possible to be
in the case where the runtime PM status is not yet 'active', but the
hardware has been awaken (PM resume function has been called).
When hitting the race, the resume work was not enqueued, which probably
triggered other bugs further down the stack. For instance, a telnet
connection on Ingenic SoCs would result in a 50/50 chance of a
segmentation fault somewhere in the musb code.
Rework the code so that either we call the callback directly if
(musb->is_runtime_suspended == 0), or enqueue the query otherwise.
Fixes: ea2f35c01d5e ("usb: musb: Fix sleeping function called from invalid context for hdrc glue")
Cc: stable(a)vger.kernel.org # v4.9+
Signed-off-by: Paul Cercueil <paul(a)crapouillou.net>
Reviewed-by: Tony Lindgren <tony(a)atomide.com>
Tested-by: Tony Lindgren <tony(a)atomide.com>
---
drivers/usb/musb/musb_core.c | 31 +++++++++++++++++--------------
1 file changed, 17 insertions(+), 14 deletions(-)
diff --git a/drivers/usb/musb/musb_core.c b/drivers/usb/musb/musb_core.c
index 849e0b770130..1cd87729ba60 100644
--- a/drivers/usb/musb/musb_core.c
+++ b/drivers/usb/musb/musb_core.c
@@ -2240,32 +2240,35 @@ int musb_queue_resume_work(struct musb *musb,
{
struct musb_pending_work *w;
unsigned long flags;
+ bool is_suspended;
int error;
if (WARN_ON(!callback))
return -EINVAL;
- if (pm_runtime_active(musb->controller))
- return callback(musb, data);
+ spin_lock_irqsave(&musb->list_lock, flags);
+ is_suspended = musb->is_runtime_suspended;
+
+ if (is_suspended) {
+ w = devm_kzalloc(musb->controller, sizeof(*w), GFP_ATOMIC);
+ if (!w) {
+ error = -ENOMEM;
+ goto out_unlock;
+ }
- w = devm_kzalloc(musb->controller, sizeof(*w), GFP_ATOMIC);
- if (!w)
- return -ENOMEM;
+ w->callback = callback;
+ w->data = data;
- w->callback = callback;
- w->data = data;
- spin_lock_irqsave(&musb->list_lock, flags);
- if (musb->is_runtime_suspended) {
list_add_tail(&w->node, &musb->pending_list);
error = 0;
- } else {
- dev_err(musb->controller, "could not add resume work %p\n",
- callback);
- devm_kfree(musb->controller, w);
- error = -EINPROGRESS;
}
+
+out_unlock:
spin_unlock_irqrestore(&musb->list_lock, flags);
+ if (!is_suspended)
+ error = callback(musb, data);
+
return error;
}
EXPORT_SYMBOL_GPL(musb_queue_resume_work);
--
2.29.2
From: Xiaoming Ni <nixiaoming(a)huawei.com>
Subject: proc_sysctl: fix oops caused by incorrect command parameters
The process_sysctl_arg() does not check whether val is empty before
invoking strlen(val). If the command line parameter () is incorrectly
configured and val is empty, oops is triggered.
For example:
"hung_task_panic=1" is incorrectly written as "hung_task_panic", oops is
triggered. The call stack is as follows:
Kernel command line: .... hung_task_panic
......
Call trace:
__pi_strlen+0x10/0x98
parse_args+0x278/0x344
do_sysctl_args+0x8c/0xfc
kernel_init+0x5c/0xf4
ret_from_fork+0x10/0x30
To fix it, check whether "val" is empty when "phram" is a sysctl field.
Error codes are returned in the failure branch, and error logs are
generated by parse_args().
Link: https://lkml.kernel.org/r/20210118133029.28580-1-nixiaoming@huawei.com
Fixes: 3db978d480e2843 ("kernel/sysctl: support setting sysctl parameters from kernel command line")
Signed-off-by: Xiaoming Ni <nixiaoming(a)huawei.com>
Acked-by: Vlastimil Babka <vbabka(a)suse.cz>
Cc: Luis Chamberlain <mcgrof(a)kernel.org>
Cc: Kees Cook <keescook(a)chromium.org>
Cc: Iurii Zaikin <yzaikin(a)google.com>
Cc: Alexey Dobriyan <adobriyan(a)gmail.com>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Masami Hiramatsu <mhiramat(a)kernel.org>
Cc: Heiner Kallweit <hkallweit1(a)gmail.com>
Cc: Randy Dunlap <rdunlap(a)infradead.org>
Cc: <stable(a)vger.kernel.org> [5.8+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
fs/proc/proc_sysctl.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
--- a/fs/proc/proc_sysctl.c~proc_sysctl-fix-oops-caused-by-incorrect-command-parameters
+++ a/fs/proc/proc_sysctl.c
@@ -1770,6 +1770,12 @@ static int process_sysctl_arg(char *para
return 0;
}
+ if (!val)
+ return -EINVAL;
+ len = strlen(val);
+ if (len == 0)
+ return -EINVAL;
+
/*
* To set sysctl options, we use a temporary mount of proc, look up the
* respective sys/ file and write to it. To avoid mounting it when no
@@ -1811,7 +1817,6 @@ static int process_sysctl_arg(char *para
file, param, val);
goto out;
}
- len = strlen(val);
wret = kernel_write(file, val, len, &pos);
if (wret < 0) {
err = wret;
_
From: Dan Williams <dan.j.williams(a)intel.com>
Subject: mm: fix page reference leak in soft_offline_page()
The conversion to move pfn_to_online_page() internal to
soft_offline_page() missed that the get_user_pages() reference taken by
the madvise() path needs to be dropped when pfn_to_online_page() fails.
Note the direct sysfs-path to soft_offline_page() does not perform a
get_user_pages() lookup.
When soft_offline_page() is handed a pfn_valid() && !pfn_to_online_page()
pfn the kernel hangs at dax-device shutdown due to a leaked reference.
Link: https://lkml.kernel.org/r/161058501210.1840162.8108917599181157327.stgit@dw…
Fixes: feec24a6139d ("mm, soft-offline: convert parameter to pfn")
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
Reviewed-by: David Hildenbrand <david(a)redhat.com>
Reviewed-by: Oscar Salvador <osalvador(a)suse.de>
Reviewed-by: Naoya Horiguchi <naoya.horiguchi(a)nec.com>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Qian Cai <cai(a)lca.pw>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/memory-failure.c | 20 ++++++++++++++++----
1 file changed, 16 insertions(+), 4 deletions(-)
--- a/mm/memory-failure.c~mm-fix-page-reference-leak-in-soft_offline_page
+++ a/mm/memory-failure.c
@@ -1885,6 +1885,12 @@ static int soft_offline_free_page(struct
return rc;
}
+static void put_ref_page(struct page *page)
+{
+ if (page)
+ put_page(page);
+}
+
/**
* soft_offline_page - Soft offline a page.
* @pfn: pfn to soft-offline
@@ -1910,20 +1916,26 @@ static int soft_offline_free_page(struct
int soft_offline_page(unsigned long pfn, int flags)
{
int ret;
- struct page *page;
bool try_again = true;
+ struct page *page, *ref_page = NULL;
+
+ WARN_ON_ONCE(!pfn_valid(pfn) && (flags & MF_COUNT_INCREASED));
if (!pfn_valid(pfn))
return -ENXIO;
+ if (flags & MF_COUNT_INCREASED)
+ ref_page = pfn_to_page(pfn);
+
/* Only online pages can be soft-offlined (esp., not ZONE_DEVICE). */
page = pfn_to_online_page(pfn);
- if (!page)
+ if (!page) {
+ put_ref_page(ref_page);
return -EIO;
+ }
if (PageHWPoison(page)) {
pr_info("%s: %#lx page already poisoned\n", __func__, pfn);
- if (flags & MF_COUNT_INCREASED)
- put_page(page);
+ put_ref_page(ref_page);
return 0;
}
_
From: Shakeel Butt <shakeelb(a)google.com>
Subject: mm: fix numa stats for thp migration
Currently the kernel is not correctly updating the numa stats for
NR_FILE_PAGES and NR_SHMEM on THP migration. Fix that. For NR_FILE_DIRTY
and NR_ZONE_WRITE_PENDING, although at the moment there is no need to
handle THP migration as kernel still does not have write support for file
THP but to be more future proof, this patch adds the THP support for those
stats as well.
Link: https://lkml.kernel.org/r/20210108155813.2914586-2-shakeelb@google.com
Fixes: e71769ae52609 ("mm: enable thp migration for shmem thp")
Signed-off-by: Shakeel Butt <shakeelb(a)google.com>
Acked-by: Yang Shi <shy828301(a)gmail.com>
Reviewed-by: Roman Gushchin <guro(a)fb.com>
Cc: Johannes Weiner <hannes(a)cmpxchg.org>
Cc: Michal Hocko <mhocko(a)kernel.org>
Cc: Muchun Song <songmuchun(a)bytedance.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/migrate.c | 23 ++++++++++++-----------
1 file changed, 12 insertions(+), 11 deletions(-)
--- a/mm/migrate.c~mm-fix-numa-stats-for-thp-migration
+++ a/mm/migrate.c
@@ -402,6 +402,7 @@ int migrate_page_move_mapping(struct add
struct zone *oldzone, *newzone;
int dirty;
int expected_count = expected_page_refs(mapping, page) + extra_count;
+ int nr = thp_nr_pages(page);
if (!mapping) {
/* Anonymous page without mapping */
@@ -437,7 +438,7 @@ int migrate_page_move_mapping(struct add
*/
newpage->index = page->index;
newpage->mapping = page->mapping;
- page_ref_add(newpage, thp_nr_pages(page)); /* add cache reference */
+ page_ref_add(newpage, nr); /* add cache reference */
if (PageSwapBacked(page)) {
__SetPageSwapBacked(newpage);
if (PageSwapCache(page)) {
@@ -459,7 +460,7 @@ int migrate_page_move_mapping(struct add
if (PageTransHuge(page)) {
int i;
- for (i = 1; i < HPAGE_PMD_NR; i++) {
+ for (i = 1; i < nr; i++) {
xas_next(&xas);
xas_store(&xas, newpage);
}
@@ -470,7 +471,7 @@ int migrate_page_move_mapping(struct add
* to one less reference.
* We know this isn't the last reference.
*/
- page_ref_unfreeze(page, expected_count - thp_nr_pages(page));
+ page_ref_unfreeze(page, expected_count - nr);
xas_unlock(&xas);
/* Leave irq disabled to prevent preemption while updating stats */
@@ -493,17 +494,17 @@ int migrate_page_move_mapping(struct add
old_lruvec = mem_cgroup_lruvec(memcg, oldzone->zone_pgdat);
new_lruvec = mem_cgroup_lruvec(memcg, newzone->zone_pgdat);
- __dec_lruvec_state(old_lruvec, NR_FILE_PAGES);
- __inc_lruvec_state(new_lruvec, NR_FILE_PAGES);
+ __mod_lruvec_state(old_lruvec, NR_FILE_PAGES, -nr);
+ __mod_lruvec_state(new_lruvec, NR_FILE_PAGES, nr);
if (PageSwapBacked(page) && !PageSwapCache(page)) {
- __dec_lruvec_state(old_lruvec, NR_SHMEM);
- __inc_lruvec_state(new_lruvec, NR_SHMEM);
+ __mod_lruvec_state(old_lruvec, NR_SHMEM, -nr);
+ __mod_lruvec_state(new_lruvec, NR_SHMEM, nr);
}
if (dirty && mapping_can_writeback(mapping)) {
- __dec_lruvec_state(old_lruvec, NR_FILE_DIRTY);
- __dec_zone_state(oldzone, NR_ZONE_WRITE_PENDING);
- __inc_lruvec_state(new_lruvec, NR_FILE_DIRTY);
- __inc_zone_state(newzone, NR_ZONE_WRITE_PENDING);
+ __mod_lruvec_state(old_lruvec, NR_FILE_DIRTY, -nr);
+ __mod_zone_page_state(oldzone, NR_ZONE_WRITE_PENDING, -nr);
+ __mod_lruvec_state(new_lruvec, NR_FILE_DIRTY, nr);
+ __mod_zone_page_state(newzone, NR_ZONE_WRITE_PENDING, nr);
}
}
local_irq_enable();
_
From: Shakeel Butt <shakeelb(a)google.com>
Subject: mm: memcg: fix memcg file_dirty numa stat
The kernel updates the per-node NR_FILE_DIRTY stats on page migration but
not the memcg numa stats. That was not an issue until recently the commit
5f9a4f4a7096 ("mm: memcontrol: add the missing numa_stat interface for
cgroup v2") exposed numa stats for the memcg. So fix the file_dirty
per-memcg numa stat.
Link: https://lkml.kernel.org/r/20210108155813.2914586-1-shakeelb@google.com
Fixes: 5f9a4f4a7096 ("mm: memcontrol: add the missing numa_stat interface for cgroup v2")
Signed-off-by: Shakeel Butt <shakeelb(a)google.com>
Reviewed-by: Muchun Song <songmuchun(a)bytedance.com>
Acked-by: Yang Shi <shy828301(a)gmail.com>
Reviewed-by: Roman Gushchin <guro(a)fb.com>
Cc: Johannes Weiner <hannes(a)cmpxchg.org>
Cc: Michal Hocko <mhocko(a)kernel.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/migrate.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
--- a/mm/migrate.c~mm-memcg-fix-memcg-file_dirty-numa-stat
+++ a/mm/migrate.c
@@ -500,9 +500,9 @@ int migrate_page_move_mapping(struct add
__inc_lruvec_state(new_lruvec, NR_SHMEM);
}
if (dirty && mapping_can_writeback(mapping)) {
- __dec_node_state(oldzone->zone_pgdat, NR_FILE_DIRTY);
+ __dec_lruvec_state(old_lruvec, NR_FILE_DIRTY);
__dec_zone_state(oldzone, NR_ZONE_WRITE_PENDING);
- __inc_node_state(newzone->zone_pgdat, NR_FILE_DIRTY);
+ __inc_lruvec_state(new_lruvec, NR_FILE_DIRTY);
__inc_zone_state(newzone, NR_ZONE_WRITE_PENDING);
}
}
_
From: Mike Rapoport <rppt(a)linux.ibm.com>
Subject: mm: fix initialization of struct page for holes in memory layout
There could be struct pages that are not backed by actual physical memory.
This can happen when the actual memory bank is not a multiple of
SECTION_SIZE or when an architecture does not register memory holes
reserved by the firmware as memblock.memory.
Such pages are currently initialized using init_unavailable_mem() function
that iterates through PFNs in holes in memblock.memory and if there is a
struct page corresponding to a PFN, the fields if this page are set to
default values and the page is marked as Reserved.
init_unavailable_mem() does not take into account zone and node the page
belongs to and sets both zone and node links in struct page to zero.
On a system that has firmware reserved holes in a zone above ZONE_DMA, for
instance in a configuration below:
# grep -A1 E820 /proc/iomem
7a17b000-7a216fff : Unknown E820 type
7a217000-7bffffff : System RAM
unset zone link in struct page will trigger
VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn), page);
because there are pages in both ZONE_DMA32 and ZONE_DMA (unset zone link
in struct page) in the same pageblock.
Update init_unavailable_mem() to use zone constraints defined by an
architecture to properly setup the zone link and use node ID of the
adjacent range in memblock.memory to set the node link.
Link: https://lkml.kernel.org/r/20210111194017.22696-3-rppt@kernel.org
Fixes: 73a6e474cb37 ("mm: memmap_init: iterate over memblock regions rather that check each PFN")
Signed-off-by: Mike Rapoport <rppt(a)linux.ibm.com>
Reported-by: Andrea Arcangeli <aarcange(a)redhat.com>
Cc: Andrea Arcangeli <aarcange(a)redhat.com>
Cc: Baoquan He <bhe(a)redhat.com>
Cc: Borislav Petkov <bp(a)alien8.de>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: "H. Peter Anvin" <hpa(a)zytor.com>
Cc: Ingo Molnar <mingo(a)redhat.com>
Cc: Mel Gorman <mgorman(a)suse.de>
Cc: Michal Hocko <mhocko(a)kernel.org>
Cc: Qian Cai <cai(a)lca.pw>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/page_alloc.c | 84 +++++++++++++++++++++++++++-------------------
1 file changed, 50 insertions(+), 34 deletions(-)
--- a/mm/page_alloc.c~mm-fix-initialization-of-struct-page-for-holes-in-memory-layout
+++ a/mm/page_alloc.c
@@ -7078,23 +7078,26 @@ void __init free_area_init_memoryless_no
* Initialize all valid struct pages in the range [spfn, epfn) and mark them
* PageReserved(). Return the number of struct pages that were initialized.
*/
-static u64 __init init_unavailable_range(unsigned long spfn, unsigned long epfn)
+static u64 __init init_unavailable_range(unsigned long spfn, unsigned long epfn,
+ int zone, int nid)
{
- unsigned long pfn;
+ unsigned long pfn, zone_spfn, zone_epfn;
u64 pgcnt = 0;
+ zone_spfn = arch_zone_lowest_possible_pfn[zone];
+ zone_epfn = arch_zone_highest_possible_pfn[zone];
+
+ spfn = clamp(spfn, zone_spfn, zone_epfn);
+ epfn = clamp(epfn, zone_spfn, zone_epfn);
+
for (pfn = spfn; pfn < epfn; pfn++) {
if (!pfn_valid(ALIGN_DOWN(pfn, pageblock_nr_pages))) {
pfn = ALIGN_DOWN(pfn, pageblock_nr_pages)
+ pageblock_nr_pages - 1;
continue;
}
- /*
- * Use a fake node/zone (0) for now. Some of these pages
- * (in memblock.reserved but not in memblock.memory) will
- * get re-initialized via reserve_bootmem_region() later.
- */
- __init_single_page(pfn_to_page(pfn), pfn, 0, 0);
+
+ __init_single_page(pfn_to_page(pfn), pfn, zone, nid);
__SetPageReserved(pfn_to_page(pfn));
pgcnt++;
}
@@ -7103,51 +7106,64 @@ static u64 __init init_unavailable_range
}
/*
- * Only struct pages that are backed by physical memory are zeroed and
- * initialized by going through __init_single_page(). But, there are some
- * struct pages which are reserved in memblock allocator and their fields
- * may be accessed (for example page_to_pfn() on some configuration accesses
- * flags). We must explicitly initialize those struct pages.
+ * Only struct pages that correspond to ranges defined by memblock.memory
+ * are zeroed and initialized by going through __init_single_page() during
+ * memmap_init().
*
- * This function also addresses a similar issue where struct pages are left
- * uninitialized because the physical address range is not covered by
- * memblock.memory or memblock.reserved. That could happen when memblock
- * layout is manually configured via memmap=, or when the highest physical
- * address (max_pfn) does not end on a section boundary.
+ * But, there could be struct pages that correspond to holes in
+ * memblock.memory. This can happen because of the following reasons:
+ * - phyiscal memory bank size is not necessarily the exact multiple of the
+ * arbitrary section size
+ * - early reserved memory may not be listed in memblock.memory
+ * - memory layouts defined with memmap= kernel parameter may not align
+ * nicely with memmap sections
+ *
+ * Explicitly initialize those struct pages so that:
+ * - PG_Reserved is set
+ * - zone link is set accorging to the architecture constrains
+ * - node is set to node id of the next populated region except for the
+ * trailing hole where last node id is used
*/
-static void __init init_unavailable_mem(void)
+static void __init init_zone_unavailable_mem(int zone)
{
- phys_addr_t start, end;
- u64 i, pgcnt;
- phys_addr_t next = 0;
+ unsigned long start, end;
+ int i, nid;
+ u64 pgcnt;
+ unsigned long next = 0;
/*
- * Loop through unavailable ranges not covered by memblock.memory.
+ * Loop through holes in memblock.memory and initialize struct
+ * pages corresponding to these holes
*/
pgcnt = 0;
- for_each_mem_range(i, &start, &end) {
+ for_each_mem_pfn_range(i, MAX_NUMNODES, &start, &end, &nid) {
if (next < start)
- pgcnt += init_unavailable_range(PFN_DOWN(next),
- PFN_UP(start));
+ pgcnt += init_unavailable_range(next, start, zone, nid);
next = end;
}
/*
- * Early sections always have a fully populated memmap for the whole
- * section - see pfn_valid(). If the last section has holes at the
- * end and that section is marked "online", the memmap will be
- * considered initialized. Make sure that memmap has a well defined
- * state.
+ * Last section may surpass the actual end of memory (e.g. we can
+ * have 1Gb section and 512Mb of RAM pouplated).
+ * Make sure that memmap has a well defined state in this case.
*/
- pgcnt += init_unavailable_range(PFN_DOWN(next),
- round_up(max_pfn, PAGES_PER_SECTION));
+ end = round_up(max_pfn, PAGES_PER_SECTION);
+ pgcnt += init_unavailable_range(next, end, zone, nid);
/*
* Struct pages that do not have backing memory. This could be because
* firmware is using some of this memory, or for some other reasons.
*/
if (pgcnt)
- pr_info("Zeroed struct page in unavailable ranges: %lld pages", pgcnt);
+ pr_info("Zone %s: zeroed struct page in unavailable ranges: %lld pages", zone_names[zone], pgcnt);
+}
+
+static void __init init_unavailable_mem(void)
+{
+ int zone;
+
+ for (zone = 0; zone < ZONE_MOVABLE; zone++)
+ init_zone_unavailable_mem(zone);
}
#else
static inline void __init init_unavailable_mem(void)
_
From: Mike Rapoport <rppt(a)linux.ibm.com>
Subject: x86/setup: don't remove E820_TYPE_RAM for pfn 0
Patch series "mm: fix initialization of struct page for holes in memory layout", v3.
Commit 73a6e474cb37 ("mm: memmap_init: iterate over
memblock regions rather that check each PFN") exposed several issues with
the memory map initialization and these patches fix those issues.
Initially there were crashes during compaction that Qian Cai reported back
in April [1]. It seemed back then that the problem was fixed, but a few
weeks ago Andrea Arcangeli hit the same bug [2] and there was an additional
discussion at [3].
[1] https://lore.kernel.org/lkml/8C537EB7-85EE-4DCF-943E-3CC0ED0DF56D@lca.pw
[2] https://lore.kernel.org/lkml/20201121194506.13464-1-aarcange@redhat.com
[3] https://lore.kernel.org/mm-commits/20201206005401.qKuAVgOXr%akpm@linux-foun…
This patch (of 2):
The first 4Kb of memory is a BIOS owned area and to avoid its allocation
for the kernel it was not listed in e820 tables as memory. As the result,
pfn 0 was never recognised by the generic memory management and it is not
a part of neither node 0 nor ZONE_DMA.
If set_pfnblock_flags_mask() would be ever called for the pageblock
corresponding to the first 2Mbytes of memory, having pfn 0 outside of
ZONE_DMA would trigger
VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn), page);
Along with reserving the first 4Kb in e820 tables, several first pages are
reserved with memblock in several places during setup_arch(). These
reservations are enough to ensure the kernel does not touch the BIOS area
and it is not necessary to remove E820_TYPE_RAM for pfn 0.
Remove the update of e820 table that changes the type of pfn 0 and move
the comment describing why it was done to trim_low_memory_range() that
reserves the beginning of the memory.
Link: https://lkml.kernel.org/r/20210111194017.22696-2-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt(a)linux.ibm.com>
Cc: Baoquan He <bhe(a)redhat.com>
Cc: Borislav Petkov <bp(a)alien8.de>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: "H. Peter Anvin" <hpa(a)zytor.com>
Cc: Ingo Molnar <mingo(a)redhat.com>
Cc: Mel Gorman <mgorman(a)suse.de>
Cc: Michal Hocko <mhocko(a)kernel.org>
Cc: Qian Cai <cai(a)lca.pw>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
arch/x86/kernel/setup.c | 20 +++++++++-----------
1 file changed, 9 insertions(+), 11 deletions(-)
--- a/arch/x86/kernel/setup.c~x86-setup-dont-remove-e820_type_ram-for-pfn-0
+++ a/arch/x86/kernel/setup.c
@@ -661,17 +661,6 @@ static void __init trim_platform_memory_
static void __init trim_bios_range(void)
{
/*
- * A special case is the first 4Kb of memory;
- * This is a BIOS owned area, not kernel ram, but generally
- * not listed as such in the E820 table.
- *
- * This typically reserves additional memory (64KiB by default)
- * since some BIOSes are known to corrupt low memory. See the
- * Kconfig help text for X86_RESERVE_LOW.
- */
- e820__range_update(0, PAGE_SIZE, E820_TYPE_RAM, E820_TYPE_RESERVED);
-
- /*
* special case: Some BIOSes report the PC BIOS
* area (640Kb -> 1Mb) as RAM even though it is not.
* take them out.
@@ -728,6 +717,15 @@ early_param("reservelow", parse_reservel
static void __init trim_low_memory_range(void)
{
+ /*
+ * A special case is the first 4Kb of memory;
+ * This is a BIOS owned area, not kernel ram, but generally
+ * not listed as such in the E820 table.
+ *
+ * This typically reserves additional memory (64KiB by default)
+ * since some BIOSes are known to corrupt low memory. See the
+ * Kconfig help text for X86_RESERVE_LOW.
+ */
memblock_reserve(0, ALIGN(reserve_low, PAGE_SIZE));
}
_
The patch titled
Subject: mm: thp: fix MADV_REMOVE deadlock on shmem THP
has been added to the -mm tree. Its filename is
mm-thp-fix-madv_remove-deadlock-on-shmem-thp.patch
This patch should soon appear at
https://ozlabs.org/~akpm/mmots/broken-out/mm-thp-fix-madv_remove-deadlock-o…
and later at
https://ozlabs.org/~akpm/mmotm/broken-out/mm-thp-fix-madv_remove-deadlock-o…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Hugh Dickins <hughd(a)google.com>
Subject: mm: thp: fix MADV_REMOVE deadlock on shmem THP
Sergey reported deadlock between kswapd correctly doing its usual
lock_page(page) followed by down_read(page->mapping->i_mmap_rwsem), and
madvise(MADV_REMOVE) on an madvise(MADV_HUGEPAGE) area doing
down_write(page->mapping->i_mmap_rwsem) followed by lock_page(page).
This happened when shmem_fallocate(punch hole)'s unmap_mapping_range()
reaches zap_pmd_range()'s call to __split_huge_pmd(). The same deadlock
could occur when partially truncating a mapped huge tmpfs file, or using
fallocate(FALLOC_FL_PUNCH_HOLE) on it.
__split_huge_pmd()'s page lock was added in 5.8, to make sure that any
concurrent use of reuse_swap_page() (holding page lock) could not catch
the anon THP's mapcounts and swapcounts while they were being split.
Fortunately, reuse_swap_page() is never applied to a shmem or file THP
(not even by khugepaged, which checks PageSwapCache before calling), and
anonymous THPs are never created in shmem or file areas: so that
__split_huge_pmd()'s page lock can only be necessary for anonymous THPs,
on which there is no risk of deadlock with i_mmap_rwsem.
Link: https://lkml.kernel.org/r/alpine.LSU.2.11.2101161409470.2022@eggly.anvils
Fixes: c444eb564fb1 ("mm: thp: make the THP mapcount atomic against __split_huge_pmd_locked()")
Signed-off-by: Hugh Dickins <hughd(a)google.com>
Reported-by: Sergey Senozhatsky <sergey.senozhatsky.work(a)gmail.com>
Reviewed-by: Andrea Arcangeli <aarcange(a)redhat.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/huge_memory.c | 37 +++++++++++++++++++++++--------------
1 file changed, 23 insertions(+), 14 deletions(-)
--- a/mm/huge_memory.c~mm-thp-fix-madv_remove-deadlock-on-shmem-thp
+++ a/mm/huge_memory.c
@@ -2202,7 +2202,7 @@ void __split_huge_pmd(struct vm_area_str
{
spinlock_t *ptl;
struct mmu_notifier_range range;
- bool was_locked = false;
+ bool do_unlock_page = false;
pmd_t _pmd;
mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
@@ -2218,7 +2218,6 @@ void __split_huge_pmd(struct vm_area_str
VM_BUG_ON(freeze && !page);
if (page) {
VM_WARN_ON_ONCE(!PageLocked(page));
- was_locked = true;
if (page != pmd_page(*pmd))
goto out;
}
@@ -2227,19 +2226,29 @@ repeat:
if (pmd_trans_huge(*pmd)) {
if (!page) {
page = pmd_page(*pmd);
- if (unlikely(!trylock_page(page))) {
- get_page(page);
- _pmd = *pmd;
- spin_unlock(ptl);
- lock_page(page);
- spin_lock(ptl);
- if (unlikely(!pmd_same(*pmd, _pmd))) {
- unlock_page(page);
+ /*
+ * An anonymous page must be locked, to ensure that a
+ * concurrent reuse_swap_page() sees stable mapcount;
+ * but reuse_swap_page() is not used on shmem or file,
+ * and page lock must not be taken when zap_pmd_range()
+ * calls __split_huge_pmd() while i_mmap_lock is held.
+ */
+ if (PageAnon(page)) {
+ if (unlikely(!trylock_page(page))) {
+ get_page(page);
+ _pmd = *pmd;
+ spin_unlock(ptl);
+ lock_page(page);
+ spin_lock(ptl);
+ if (unlikely(!pmd_same(*pmd, _pmd))) {
+ unlock_page(page);
+ put_page(page);
+ page = NULL;
+ goto repeat;
+ }
put_page(page);
- page = NULL;
- goto repeat;
}
- put_page(page);
+ do_unlock_page = true;
}
}
if (PageMlocked(page))
@@ -2249,7 +2258,7 @@ repeat:
__split_huge_pmd_locked(vma, pmd, range.start, freeze);
out:
spin_unlock(ptl);
- if (!was_locked && page)
+ if (do_unlock_page)
unlock_page(page);
/*
* No need to double call mmu_notifier->invalidate_range() callback.
_
Patches currently in -mm which might be from hughd(a)google.com are
mm-thp-fix-madv_remove-deadlock-on-shmem-thp.patch
The patch titled
Subject: Revert "mm: memcontrol: avoid workload stalls when lowering memory.high"
has been added to the -mm tree. Its filename is
revert-mm-memcontrol-avoid-workload-stalls-when-lowering-memoryhigh.patch
This patch should soon appear at
https://ozlabs.org/~akpm/mmots/broken-out/revert-mm-memcontrol-avoid-worklo…
and later at
https://ozlabs.org/~akpm/mmotm/broken-out/revert-mm-memcontrol-avoid-worklo…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Johannes Weiner <hannes(a)cmpxchg.org>
Subject: Revert "mm: memcontrol: avoid workload stalls when lowering memory.high"
This reverts commit 536d3bf261a2fc3b05b3e91e7eef7383443015cf, as it can
cause writers to memory.high to get stuck in the kernel forever,
performing page reclaim and consuming excessive amounts of CPU cycles.
Before the patch, a write to memory.high would first put the new limit in
place for the workload, and then reclaim the requested delta. After the
patch, the kernel tries to reclaim the delta before putting the new limit
into place, in order to not overwhelm the workload with a sudden, large
excess over the limit. However, if reclaim is actively racing with new
allocations from the uncurbed workload, it can keep the write() working
inside the kernel indefinitely.
This is causing problems in Facebook production. A privileged
system-level daemon that adjusts memory.high for various workloads running
on a host can get unexpectedly stuck in the kernel and essentially turn
into a sort of involuntary kswapd for one of the workloads. We've
observed that daemon busy-spin in a write() for minutes at a time,
neglecting its other duties on the system, and expending privileged system
resources on behalf of a workload.
To remedy this, we have first considered changing the reclaim logic to
break out after a couple of loops - whether the workload has converged to
the new limit or not - and bound the write() call this way. However, the
root cause that inspired the sequence change in the first place has been
fixed through other means, and so a revert back to the proven
limit-setting sequence, also used by memory.max, is preferable.
The sequence was changed to avoid extreme latencies in the workload when
the limit was lowered: the sudden, large excess created by the limit
lowering would erroneously trigger the penalty sleeping code that is meant
to throttle excessive growth from below. Allocating threads could end up
sleeping long after the write() had already reclaimed the delta for which
they were being punished.
However, erroneous throttling also caused problems in other scenarios at
around the same time. This resulted in commit b3ff92916af3 ("mm, memcg:
reclaim more aggressively before high allocator throttling"), included in
the same release as the offending commit. When allocating threads now
encounter large excess caused by a racing write() to memory.high, instead
of entering punitive sleeps, they will simply be tasked with helping
reclaim down the excess, and will be held no longer than it takes to
accomplish that. This is in line with regular limit enforcement - i.e.
if the workload allocates up against or over an otherwise unchanged limit
from below.
With the patch breaking userspace, and the root cause addressed by other
means already, revert it again.
Link: https://lkml.kernel.org/r/20210122184341.292461-1-hannes@cmpxchg.org
Fixes: 536d3bf261a2 ("mm: memcontrol: avoid workload stalls when lowering memory.high")
Signed-off-by: Johannes Weiner <hannes(a)cmpxchg.org>
Reported-by: Tejun Heo <tj(a)kernel.org>
Cc: Roman Gushchin <guro(a)fb.com>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Shakeel Butt <shakeelb(a)google.com>
Cc: Michal Koutný <mkoutny(a)suse.com>
Cc: <stable(a)vger.kernel.org> [5.8+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/memcontrol.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
--- a/mm/memcontrol.c~revert-mm-memcontrol-avoid-workload-stalls-when-lowering-memoryhigh
+++ a/mm/memcontrol.c
@@ -6271,6 +6271,8 @@ static ssize_t memory_high_write(struct
if (err)
return err;
+ page_counter_set_high(&memcg->memory, high);
+
for (;;) {
unsigned long nr_pages = page_counter_read(&memcg->memory);
@@ -6293,10 +6295,7 @@ static ssize_t memory_high_write(struct
break;
}
- page_counter_set_high(&memcg->memory, high);
-
memcg_wb_domain_size_changed(memcg);
-
return nbytes;
}
_
Patches currently in -mm which might be from hannes(a)cmpxchg.org are
mm-memcontrol-prevent-starvation-when-writing-memoryhigh.patch
revert-mm-memcontrol-avoid-workload-stalls-when-lowering-memoryhigh.patch