From: Daniel Jordan daniel.m.jordan@oracle.com
From: Daniel Jordan daniel.m.jordan@oracle.com
commit 117003c32771df617acf66e140fbdbdeb0ac71f5 upstream.
Patch series "initialize deferred pages with interrupts enabled", v4.
Keep interrupts enabled during deferred page initialization in order to make code more modular and allow jiffies to update.
Original approach, and discussion can be found here: http://lkml.kernel.org/r/20200311123848.118638-1-shile.zhang@linux.alibaba.c...
This patch (of 3):
deferred_init_memmap() disables interrupts the entire time, so it calls touch_nmi_watchdog() periodically to avoid soft lockup splats. Soon it will run with interrupts enabled, at which point cond_resched() should be used instead.
deferred_grow_zone() makes the same watchdog calls through code shared with deferred init but will continue to run with interrupts disabled, so it can't call cond_resched().
Pull the watchdog calls up to these two places to allow the first to be changed later, independently of the second. The frequency reduces from twice per pageblock (init and free) to once per max order block.
Fixes: 3a2d7fa8a3d5 ("mm: disable interrupts while initializing deferred pages") Signed-off-by: Daniel Jordan daniel.m.jordan@oracle.com Signed-off-by: Pavel Tatashin pasha.tatashin@soleen.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Reviewed-by: David Hildenbrand david@redhat.com Acked-by: Michal Hocko mhocko@suse.com Acked-by: Vlastimil Babka vbabka@suse.cz Cc: Dan Williams dan.j.williams@intel.com Cc: Shile Zhang shile.zhang@linux.alibaba.com Cc: Kirill Tkhai ktkhai@virtuozzo.com Cc: James Morris jmorris@namei.org Cc: Sasha Levin sashal@kernel.org Cc: Yiqian Wei yiwei@redhat.com Cc: stable@vger.kernel.org [4.17+] Link: http://lkml.kernel.org/r/20200403140952.17177-2-pasha.tatashin@soleen.com Signed-off-by: Linus Torvalds torvalds@linux-foundation.org --- mm/page_alloc.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 13cc653122b7..f7130e4445d3 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1692,7 +1692,6 @@ static void __init deferred_free_pages(unsigned long pfn, } else if (!(pfn & nr_pgmask)) { deferred_free_range(pfn - nr_free, nr_free); nr_free = 1; - touch_nmi_watchdog(); } else { nr_free++; } @@ -1722,7 +1721,6 @@ static unsigned long __init deferred_init_pages(struct zone *zone, continue; } else if (!page || !(pfn & nr_pgmask)) { page = pfn_to_page(pfn); - touch_nmi_watchdog(); } else { page++; } @@ -1862,8 +1860,10 @@ static int __init deferred_init_memmap(void *data) * that we can avoid introducing any issues with the buddy * allocator. */ - while (spfn < epfn) + while (spfn < epfn) { nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn); + touch_nmi_watchdog(); + } zone_empty: pgdat_resize_unlock(pgdat, &flags);
@@ -1947,6 +1947,7 @@ deferred_grow_zone(struct zone *zone, unsigned int order) first_deferred_pfn = spfn;
nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn); + touch_nmi_watchdog();
/* We should only stop along section boundaries */ if ((first_deferred_pfn ^ spfn) < PAGES_PER_SECTION)
From: Pavel Tatashin pasha.tatashin@soleen.com
commit da97f2d56bbd880b4138916a7ef96f9881a551b2 upstream.
Initializing struct pages is a long task and keeping interrupts disabled for the duration of this operation introduces a number of problems.
1. jiffies are not updated for long period of time, and thus incorrect time is reported. See proposed solution and discussion here: lkml/20200311123848.118638-1-shile.zhang@linux.alibaba.com 2. It prevents farther improving deferred page initialization by allowing intra-node multi-threading.
We are keeping interrupts disabled to solve a rather theoretical problem that was never observed in real world (See 3a2d7fa8a3d5).
Let's keep interrupts enabled. In case we ever encounter a scenario where an interrupt thread wants to allocate large amount of memory this early in boot we can deal with that by growing zone (see deferred_grow_zone()) by the needed amount before starting deferred_init_memmap() threads.
Before: [ 1.232459] node 0 initialised, 12058412 pages in 1ms
After: [ 1.632580] node 0 initialised, 12051227 pages in 436ms
Fixes: 3a2d7fa8a3d5 ("mm: disable interrupts while initializing deferred pages") Reported-by: Shile Zhang shile.zhang@linux.alibaba.com Signed-off-by: Pavel Tatashin pasha.tatashin@soleen.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Reviewed-by: Daniel Jordan daniel.m.jordan@oracle.com Reviewed-by: David Hildenbrand david@redhat.com Acked-by: Michal Hocko mhocko@suse.com Acked-by: Vlastimil Babka vbabka@suse.cz Cc: Dan Williams dan.j.williams@intel.com Cc: James Morris jmorris@namei.org Cc: Kirill Tkhai ktkhai@virtuozzo.com Cc: Sasha Levin sashal@kernel.org Cc: Yiqian Wei yiwei@redhat.com Cc: stable@vger.kernel.org [4.17+] Link: http://lkml.kernel.org/r/20200403140952.17177-3-pasha.tatashin@soleen.com Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit 3d060856adfc59afb9d029c233141334cfaba418) --- include/linux/mmzone.h | 2 ++ mm/page_alloc.c | 20 +++++++------------- 2 files changed, 9 insertions(+), 13 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 1b9de7d220fb..6196fd16b7f4 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -678,6 +678,8 @@ typedef struct pglist_data { /* * Must be held any time you expect node_start_pfn, * node_present_pages, node_spanned_pages or nr_zones to stay constant. + * Also synchronizes pgdat->first_deferred_pfn during deferred page + * init. * * pgdat_resize_lock() and pgdat_resize_unlock() are provided to * manipulate node_size_lock without checking for CONFIG_MEMORY_HOTPLUG diff --git a/mm/page_alloc.c b/mm/page_alloc.c index f7130e4445d3..b03da51dee5d 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1843,6 +1843,13 @@ static int __init deferred_init_memmap(void *data) BUG_ON(pgdat->first_deferred_pfn > pgdat_end_pfn(pgdat)); pgdat->first_deferred_pfn = ULONG_MAX;
+ /* + * Once we unlock here, the zone cannot be grown anymore, thus if an + * interrupt thread must allocate this early in boot, zone must be + * pre-grown prior to start of deferred page initialization. + */ + pgdat_resize_unlock(pgdat, &flags); + /* Only the highest zone is deferred so find it */ for (zid = 0; zid < MAX_NR_ZONES; zid++) { zone = pgdat->node_zones + zid; @@ -1865,8 +1872,6 @@ static int __init deferred_init_memmap(void *data) touch_nmi_watchdog(); } zone_empty: - pgdat_resize_unlock(pgdat, &flags); - /* Sanity check that the next zone really is unpopulated */ WARN_ON(++zid < MAX_NR_ZONES && populated_zone(++zone));
@@ -1908,17 +1913,6 @@ deferred_grow_zone(struct zone *zone, unsigned int order)
pgdat_resize_lock(pgdat, &flags);
- /* - * If deferred pages have been initialized while we were waiting for - * the lock, return true, as the zone was grown. The caller will retry - * this zone. We won't return to this function since the caller also - * has this static branch. - */ - if (!static_branch_unlikely(&deferred_pages)) { - pgdat_resize_unlock(pgdat, &flags); - return true; - } - /* * If someone grew this zone while we were waiting for spinlock, return * true, as there might be enough pages already.
From: Pavel Tatashin pasha.tatashin@soleen.com
commit 3d060856adfc59afb9d029c233141334cfaba418 upstream.
Now that deferred pages are initialized with interrupts enabled we can replace touch_nmi_watchdog() with cond_resched(), as it was before 3a2d7fa8a3d5.
For now, we cannot do the same in deferred_grow_zone() as it is still initializes pages with interrupts disabled.
This change fixes RCU problem described in https://lkml.kernel.org/r/20200401104156.11564-2-david@redhat.com
[ 60.474005] rcu: INFO: rcu_sched detected stalls on CPUs/tasks: [ 60.475000] rcu: 1-...0: (0 ticks this GP) idle=02a/1/0x4000000000000000 softirq=1/1 fqs=15000 [ 60.475000] rcu: (detected by 0, t=60002 jiffies, g=-1199, q=1) [ 60.475000] Sending NMI from CPU 0 to CPUs 1: [ 1.760091] NMI backtrace for cpu 1 [ 1.760091] CPU: 1 PID: 20 Comm: pgdatinit0 Not tainted 4.18.0-147.9.1.el8_1.x86_64 #1 [ 1.760091] Hardware name: Red Hat KVM, BIOS 1.13.0-1.module+el8.2.0+5520+4e5817f3 04/01/2014 [ 1.760091] RIP: 0010:__init_single_page.isra.65+0x10/0x4f [ 1.760091] Code: 48 83 cf 63 48 89 f8 0f 1f 40 00 48 89 c6 48 89 d7 e8 6b 18 80 ff 66 90 5b c3 31 c0 b9 10 00 00 00 49 89 f8 48 c1 e6 33 f3 ab <b8> 07 00 00 00 48 c1 e2 36 41 c7 40 34 01 00 00 00 48 c1 e0 33 41 [ 1.760091] RSP: 0000:ffffba783123be40 EFLAGS: 00000006 [ 1.760091] RAX: 0000000000000000 RBX: fffffad34405e300 RCX: 0000000000000000 [ 1.760091] RDX: 0000000000000000 RSI: 0010000000000000 RDI: fffffad34405e340 [ 1.760091] RBP: 0000000033f3177e R08: fffffad34405e300 R09: 0000000000000002 [ 1.760091] R10: 000000000000002b R11: ffff98afb691a500 R12: 0000000000000002 [ 1.760091] R13: 0000000000000000 R14: 000000003f03ea00 R15: 000000003e10178c [ 1.760091] FS: 0000000000000000(0000) GS:ffff9c9ebeb00000(0000) knlGS:0000000000000000 [ 1.760091] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1.760091] CR2: 00000000ffffffff CR3: 000000a1cf20a001 CR4: 00000000003606e0 [ 1.760091] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1.760091] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 1.760091] Call Trace: [ 1.760091] deferred_init_pages+0x8f/0xbf [ 1.760091] deferred_init_memmap+0x184/0x29d [ 1.760091] ? deferred_free_pages.isra.97+0xba/0xba [ 1.760091] kthread+0x112/0x130 [ 1.760091] ? kthread_flush_work_fn+0x10/0x10 [ 1.760091] ret_from_fork+0x35/0x40 [ 89.123011] node 0 initialised, 1055935372 pages in 88650ms
Fixes: 3a2d7fa8a3d5 ("mm: disable interrupts while initializing deferred pages") Reported-by: Yiqian Wei yiwei@redhat.com Signed-off-by: Pavel Tatashin pasha.tatashin@soleen.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Tested-by: David Hildenbrand david@redhat.com Reviewed-by: Daniel Jordan daniel.m.jordan@oracle.com Reviewed-by: David Hildenbrand david@redhat.com Reviewed-by: Pankaj Gupta pankaj.gupta.linux@gmail.com Acked-by: Michal Hocko mhocko@suse.com Cc: Dan Williams dan.j.williams@intel.com Cc: James Morris jmorris@namei.org Cc: Kirill Tkhai ktkhai@virtuozzo.com Cc: Sasha Levin sashal@kernel.org Cc: Shile Zhang shile.zhang@linux.alibaba.com Cc: Vlastimil Babka vbabka@suse.cz Cc: stable@vger.kernel.org [4.17+] Link: http://lkml.kernel.org/r/20200403140952.17177-4-pasha.tatashin@soleen.com Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit da97f2d56bbd880b4138916a7ef96f9881a551b2) --- mm/page_alloc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index b03da51dee5d..d0c0d9364aa6 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1869,7 +1869,7 @@ static int __init deferred_init_memmap(void *data) */ while (spfn < epfn) { nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn); - touch_nmi_watchdog(); + cond_resched(); } zone_empty: /* Sanity check that the next zone really is unpopulated */
linux-stable-mirror@lists.linaro.org