From: Daniel Vacek neelx@redhat.com Subject: mm/page_alloc: fix memmap_init_zone pageblock alignment
b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns where possible") introduced a bug where move_freepages() triggers a VM_BUG_ON() on uninitialized page structure due to pageblock alignment. To fix this, simply align the skipped pfns in memmap_init_zone() the same way as in move_freepages_block().
Seen in one of the RHEL reports:
crash> log | grep -e BUG -e RIP -e Call.Trace -e move_freepages_block -e rmqueue -e freelist -A1 kernel BUG at mm/page_alloc.c:1389! invalid opcode: 0000 [#1] SMP -- RIP: 0010:[<ffffffff8118833e>] [<ffffffff8118833e>] move_freepages+0x15e/0x160 RSP: 0018:ffff88054d727688 EFLAGS: 00010087 -- Call Trace: [<ffffffff811883b3>] move_freepages_block+0x73/0x80 [<ffffffff81189e63>] __rmqueue+0x263/0x460 [<ffffffff8118c781>] get_page_from_freelist+0x7e1/0x9e0 [<ffffffff8118caf6>] __alloc_pages_nodemask+0x176/0x420 -- RIP [<ffffffff8118833e>] move_freepages+0x15e/0x160 RSP <ffff88054d727688>
crash> page_init_bug -v | grep RAM <struct resource 0xffff88067fffd2f8> 1000 - 9bfff System RAM (620.00 KiB) <struct resource 0xffff88067fffd3a0> 100000 - 430bffff System RAM ( 1.05 GiB = 1071.75 MiB = 1097472.00 KiB) <struct resource 0xffff88067fffd410> 4b0c8000 - 4bf9cfff System RAM ( 14.83 MiB = 15188.00 KiB) <struct resource 0xffff88067fffd480> 4bfac000 - 646b1fff System RAM (391.02 MiB = 400408.00 KiB) <struct resource 0xffff88067fffd560> 7b788000 - 7b7fffff System RAM (480.00 KiB) <struct resource 0xffff88067fffd640> 100000000 - 67fffffff System RAM ( 22.00 GiB)
crash> page_init_bug | head -6 <struct resource 0xffff88067fffd560> 7b788000 - 7b7fffff System RAM (480.00 KiB) <struct page 0xffffea0001ede200> 1fffff00000000 0 <struct pglist_data 0xffff88047ffd9000> 1 <struct zone 0xffff88047ffd9800> DMA32 4096 1048575 <struct page 0xffffea0001ede200> 505736 505344 <struct page 0xffffea0001ed8000> 505855 <struct page 0xffffea0001edffc0> <struct page 0xffffea0001ed8000> 0 0 <struct pglist_data 0xffff88047ffd9000> 0 <struct zone 0xffff88047ffd9000> DMA 1 4095 <struct page 0xffffea0001edffc0> 1fffff00000400 0 <struct pglist_data 0xffff88047ffd9000> 1 <struct zone 0xffff88047ffd9800> DMA32 4096 1048575 BUG, zones differ!
Note that this range follows two not populated sections 68000000-77ffffff in this zone. 7b788000-7b7fffff is the first one after a gap. This makes memmap_init_zone() skip all the pfns up to the beginning of this range. But this range is not pageblock (2M) aligned. In fact no range has to be.
crash> kmem -p 77fff000 78000000 7b5ff000 7b600000 7b787000 7b788000 PAGE PHYSICAL MAPPING INDEX CNT FLAGS ffffea0001e00000 78000000 0 0 0 0 ffffea0001ed7fc0 7b5ff000 0 0 0 0 ffffea0001ed8000 7b600000 0 0 0 0 <<<< ffffea0001ede1c0 7b787000 0 0 0 0 ffffea0001ede200 7b788000 0 0 1 1fffff00000000
Top part of page flags should contain nodeid and zonenr, which is not the case for page ffffea0001ed8000 here (<<<<).
crash> log | grep -o fffea0001ed[^\ ]* | sort -u fffea0001ed8000 fffea0001eded20 fffea0001edffc0
crash> bt -r | grep -o fffea0001ed[^\ ]* | sort -u fffea0001ed8000 fffea0001eded00 fffea0001eded20 fffea0001edffc0
Initialization of the whole beginning of the section is skipped up to the start of the range due to the commit b92df1de5d28. Now any code calling move_freepages_block() (like reusing the page from a freelist as in this example) with a page from the beginning of the range will get the page rounded down to start_page ffffea0001ed8000 and passed to move_freepages() which crashes on assertion getting wrong zonenr.
VM_BUG_ON(page_zone(start_page) != page_zone(end_page));
Note, page_zone() derives the zone from page flags here.
From similar machine before commit b92df1de5d28:
crash> kmem -p 77fff000 78000000 7b5ff000 7b600000 7b7fe000 7b7ff000 PAGE PHYSICAL MAPPING INDEX CNT FLAGS fffff73941e00000 78000000 0 0 1 1fffff00000000 fffff73941ed7fc0 7b5ff000 0 0 1 1fffff00000000 fffff73941ed8000 7b600000 0 0 1 1fffff00000000 fffff73941edff80 7b7fe000 0 0 1 1fffff00000000 fffff73941edffc0 7b7ff000 ffff8e67e04d3ae0 ad84 1 1fffff00020068 uptodate,lru,active,mappedtodisk
All the pages since the beginning of the section are initialized. move_freepages()' not gonna blow up.
The same machine with this fix applied:
crash> kmem -p 77fff000 78000000 7b5ff000 7b600000 7b7fe000 7b7ff000 PAGE PHYSICAL MAPPING INDEX CNT FLAGS ffffea0001e00000 78000000 0 0 0 0 ffffea0001e00000 7b5ff000 0 0 0 0 ffffea0001ed8000 7b600000 0 0 1 1fffff00000000 ffffea0001edff80 7b7fe000 0 0 1 1fffff00000000 ffffea0001edffc0 7b7ff000 ffff88017fb13720 8 2 1fffff00020068 uptodate,lru,active,mappedtodisk
At least the bare minimum of pages is initialized preventing the crash as well.
Customers started to report this as soon as 7.4 (where b92df1de5d28 was merged in RHEL) was released. I remember reports from September/October-ish times. It's not easily reproduced and happens on a handful of machines only. I guess that's why. But that does not make it less serious, I think.
Though there actually is a report here: https://bugzilla.kernel.org/show_bug.cgi?id=196443
And there are reports for Fedora from July: https://bugzilla.redhat.com/show_bug.cgi?id=1473242 and CentOS: https://bugs.centos.org/view.php?id=13964 and we internally track several dozens reports for RHEL bug https://bugzilla.redhat.com/show_bug.cgi?id=1525121
Link: http://lkml.kernel.org/r/0485727b2e82da7efbce5f6ba42524b429d0391a.1520011945... Fixes: b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns where possible") Signed-off-by: Daniel Vacek neelx@redhat.com Cc: Mel Gorman mgorman@techsingularity.net Cc: Michal Hocko mhocko@suse.com Cc: Paul Burton paul.burton@imgtec.com Cc: Pavel Tatashin pasha.tatashin@oracle.com Cc: Vlastimil Babka vbabka@suse.cz Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org ---
mm/page_alloc.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-)
diff -puN mm/page_alloc.c~mm-page_alloc-fix-memmap_init_zone-pageblock-alignment mm/page_alloc.c --- a/mm/page_alloc.c~mm-page_alloc-fix-memmap_init_zone-pageblock-alignment +++ a/mm/page_alloc.c @@ -5359,9 +5359,14 @@ void __meminit memmap_init_zone(unsigned /* * Skip to the pfn preceding the next valid one (or * end_pfn), such that we hit a valid pfn (or end_pfn) - * on our next iteration of the loop. + * on our next iteration of the loop. Note that it needs + * to be pageblock aligned even when the region itself + * is not. move_freepages_block() can shift ahead of + * the valid region but still depends on correct page + * metadata. */ - pfn = memblock_next_valid_pfn(pfn, end_pfn) - 1; + pfn = (memblock_next_valid_pfn(pfn, end_pfn) & + ~(pageblock_nr_pages-1)) - 1; #endif continue; } _
On Sat, Mar 10, 2018 at 12:51 AM, akpm@linux-foundation.org wrote:
From: Daniel Vacek neelx@redhat.com Subject: mm/page_alloc: fix memmap_init_zone pageblock alignment
b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns where possible") introduced a bug where move_freepages() triggers a VM_BUG_ON() on uninitialized page structure due to pageblock alignment. To fix this, simply align the skipped pfns in memmap_init_zone() the same way as in move_freepages_block().
diff -puN mm/page_alloc.c~mm-page_alloc-fix-memmap_init_zone-pageblock-alignment mm/page_alloc.c --- a/mm/page_alloc.c~mm-page_alloc-fix-memmap_init_zone-pageblock-alignment +++ a/mm/page_alloc.c @@ -5359,9 +5359,14 @@ void __meminit memmap_init_zone(unsigned /* * Skip to the pfn preceding the next valid one (or * end_pfn), such that we hit a valid pfn (or end_pfn)
* on our next iteration of the loop.
* on our next iteration of the loop. Note that it needs
* to be pageblock aligned even when the region itself
* is not. move_freepages_block() can shift ahead of
* the valid region but still depends on correct page
* metadata. */
pfn = memblock_next_valid_pfn(pfn, end_pfn) - 1;
pfn = (memblock_next_valid_pfn(pfn, end_pfn) &
~(pageblock_nr_pages-1)) - 1;
Again, end_pfn should be gone.
--nX
#endif continue; } _
Hmm. I've already applied Andrew's set.
Can the people involved please check the current upstream tree, and make sure it's either ok, or send in the proper fixes..
Linus
On Fri, Mar 9, 2018 at 5:02 PM, Daniel Vacek neelx@redhat.com wrote:
Again, end_pfn should be gone.
On Sat, Mar 10, 2018 at 2:07 AM, Linus Torvalds torvalds@linux-foundation.org wrote:
Hmm. I've already applied Andrew's set.
Can the people involved please check the current upstream tree, and make sure it's either ok, or send in the proper fixes..
As simple as this:
~~~~ neelx@metal:~/nX/src/linux$ git diff diff --git a/mm/memblock.c b/mm/memblock.c index b6ba6b7adadc..2a5facd236bb 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -1101,8 +1101,7 @@ void __init_memblock __next_mem_pfn_range(int *idx, int nid, *out_nid = r->nid; }
-unsigned long __init_memblock memblock_next_valid_pfn(unsigned long pfn, - unsigned long max_pfn) +unsigned long __init_memblock memblock_next_valid_pfn(unsigned long pfn) { struct memblock_type *type = &memblock.memory; unsigned int right = type->cnt; diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 3d974cb2a1a1..efb290268479 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -5365,7 +5365,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, * the valid region but still depends on correct page * metadata. */ - pfn = (memblock_next_valid_pfn(pfn, end_pfn) & + pfn = (memblock_next_valid_pfn(pfn) & ~(pageblock_nr_pages-1)) - 1; #endif continue; ~~~~ This is what I've originally sent. The argument is unused now.
--nX
Linus
On Fri, Mar 9, 2018 at 5:02 PM, Daniel Vacek neelx@redhat.com wrote:
Again, end_pfn should be gone.
On Sat, 10 Mar 2018 02:28:00 +0100 Daniel Vacek neelx@redhat.com wrote:
As simple as this:
Is OK, I'll send a fix along next week.
max_pfn is not used anymore but survived due to mismerge. Remove it for good.
Link: http://lkml.kernel.org/r/0485727b2e82da7efbce5f6ba42524b429d0391a.1520011945... Fixes: b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns where possible") Signed-off-by: Daniel Vacek neelx@redhat.com Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Andrew Morton akpm@linux-foundation.org Cc: Mel Gorman mgorman@techsingularity.net Cc: Michal Hocko mhocko@suse.com Cc: Paul Burton paul.burton@imgtec.com Cc: Pavel Tatashin pasha.tatashin@oracle.com Cc: Vlastimil Babka vbabka@suse.cz Cc: stable@vger.kernel.org --- mm/memblock.c | 3 +-- mm/page_alloc.c | 2 +- 2 files changed, 2 insertions(+), 3 deletions(-)
diff --git a/mm/memblock.c b/mm/memblock.c index b6ba6b7adadc..2a5facd236bb 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -1101,8 +1101,7 @@ void __init_memblock __next_mem_pfn_range(int *idx, int nid, *out_nid = r->nid; }
-unsigned long __init_memblock memblock_next_valid_pfn(unsigned long pfn, - unsigned long max_pfn) +unsigned long __init_memblock memblock_next_valid_pfn(unsigned long pfn) { struct memblock_type *type = &memblock.memory; unsigned int right = type->cnt; diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 3d974cb2a1a1..efb290268479 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -5365,7 +5365,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, * the valid region but still depends on correct page * metadata. */ - pfn = (memblock_next_valid_pfn(pfn, end_pfn) & + pfn = (memblock_next_valid_pfn(pfn) & ~(pageblock_nr_pages-1)) - 1; #endif continue;
On Sat, 10 Mar 2018 03:17:56 +0100 Daniel Vacek neelx@redhat.com wrote:
max_pfn is not used anymore but survived due to mismerge. Remove it for good.
...
mm/memblock.c | 3 +-- mm/page_alloc.c | 2 +-
No, the prototype needs changing as well.
This is very minor. I'll send a (tested!) fix next week.
On Sat, Mar 10, 2018 at 3:23 AM, Andrew Morton akpm@linux-foundation.org wrote:
On Sat, 10 Mar 2018 03:17:56 +0100 Daniel Vacek neelx@redhat.com wrote:
max_pfn is not used anymore but survived due to mismerge. Remove it for good.
...
mm/memblock.c | 3 +-- mm/page_alloc.c | 2 +-
No, the prototype needs changing as well.
This is very minor. I'll send a (tested!) fix next week.
My bad, sorry. Totally minor. I'll leave it to you. Thanks
--nX
I am sorry, I am still not back up to speed yet but the below patch should also update the comment claiming: * Skip to the pfn preceding the next valid one (or * end_pfn), such that we hit a valid pfn (or end_pfn) I will double check whether that is still true but I got lost in the follow up fixups during the original patch development.
On Sat 10-03-18 03:17:56, Daniel Vacek wrote:
max_pfn is not used anymore but survived due to mismerge. Remove it for good.
Link: http://lkml.kernel.org/r/0485727b2e82da7efbce5f6ba42524b429d0391a.1520011945... Fixes: b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns where possible") Signed-off-by: Daniel Vacek neelx@redhat.com Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Andrew Morton akpm@linux-foundation.org Cc: Mel Gorman mgorman@techsingularity.net Cc: Michal Hocko mhocko@suse.com Cc: Paul Burton paul.burton@imgtec.com Cc: Pavel Tatashin pasha.tatashin@oracle.com Cc: Vlastimil Babka vbabka@suse.cz Cc: stable@vger.kernel.org
mm/memblock.c | 3 +-- mm/page_alloc.c | 2 +- 2 files changed, 2 insertions(+), 3 deletions(-)
diff --git a/mm/memblock.c b/mm/memblock.c index b6ba6b7adadc..2a5facd236bb 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -1101,8 +1101,7 @@ void __init_memblock __next_mem_pfn_range(int *idx, int nid, *out_nid = r->nid; } -unsigned long __init_memblock memblock_next_valid_pfn(unsigned long pfn,
unsigned long max_pfn)
+unsigned long __init_memblock memblock_next_valid_pfn(unsigned long pfn) { struct memblock_type *type = &memblock.memory; unsigned int right = type->cnt; diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 3d974cb2a1a1..efb290268479 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -5365,7 +5365,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, * the valid region but still depends on correct page * metadata. */
pfn = (memblock_next_valid_pfn(pfn, end_pfn) &
pfn = (memblock_next_valid_pfn(pfn) & ~(pageblock_nr_pages-1)) - 1;
#endif continue; -- 2.16.2
On Fri, Mar 9, 2018 at 5:28 PM, Daniel Vacek neelx@redhat.com wrote:
As simple as this:
Ahh, ok, so just a minor missed cleanup.
I worried there was something bigger going on.
I'll get it next round from Andrew,
Linus
On Fri, 9 Mar 2018 18:27:41 -0800 Linus Torvalds torvalds@linux-foundation.org wrote:
On Fri, Mar 9, 2018 at 5:28 PM, Daniel Vacek neelx@redhat.com wrote:
As simple as this:
Ahh, ok, so just a minor missed cleanup.
I worried there was something bigger going on.
I'll get it next round from Andrew,
We've had a bug report against this patch and Daniel has a fix and I'm awaiting the formal version of that fix. And we cc'ed -stable, sigh.
So I'll wait for that dust to settle and will slip this fixup in at a later stage.
linux-stable-mirror@lists.linaro.org