From: Mike Rapoport rppt@linux.ibm.com
There could be struct pages that are not backed by actual physical memory. This can happen when the actual memory bank is not a multiple of SECTION_SIZE or when an architecture does not register memory holes reserved by the firmware as memblock.memory.
Such pages are currently initialized using init_unavailable_mem() function that iterates through PFNs in holes in memblock.memory and if there is a struct page corresponding to a PFN, the fields of this page are set to default values and it is marked as Reserved.
init_unavailable_mem() does not take into account zone and node the page belongs to and sets both zone and node links in struct page to zero.
On a system that has firmware reserved holes in a zone above ZONE_DMA, for instance in a configuration below:
# grep -A1 E820 /proc/iomem 7a17b000-7a216fff : Unknown E820 type 7a217000-7bffffff : System RAM
unset zone link in struct page will trigger
VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn), page);
because there are pages in both ZONE_DMA32 and ZONE_DMA (unset zone link in struct page) in the same pageblock.
Moreover, it is possible that the lowest node and zone start is not aligned to the section boundarie, for example on x86:
[ 0.078898] Zone ranges: [ 0.078899] DMA [mem 0x0000000000001000-0x0000000000ffffff] ... [ 0.078910] Early memory node ranges [ 0.078912] node 0: [mem 0x0000000000001000-0x000000000009cfff] [ 0.078913] node 0: [mem 0x0000000000100000-0x000000003fffffff]
and thus with SPARSEMEM memory model the beginning of the memory map will have struct pages that are not spanned by any node and zone.
Update detection of node boundaries in get_pfn_range_for_nid() so that the node range will be expanded to cover memory map section. Since zone spans are derived from the node span, there always will be a zone that covers the part of the memory map with unavailable pages.
Interleave initialization of the unavailable pages with the normal initialization of memory map, so that zone and node information will be properly set on struct pages that are not backed by the actual memory.
Fixes: 73a6e474cb37 ("mm: memmap_init: iterate over memblock regions rather that check each PFN") Reported-by: Andrea Arcangeli aarcange@redhat.com Signed-off-by: Mike Rapoport rppt@linux.ibm.com Cc: Baoquan He bhe@redhat.com Cc: David Hildenbrand david@redhat.com Cc: Mel Gorman mgorman@suse.de Cc: Michal Hocko mhocko@kernel.org Cc: Qian Cai cai@lca.pw Cc: Vlastimil Babka vbabka@suse.cz --- mm/page_alloc.c | 160 +++++++++++++++++++++++------------------------- 1 file changed, 75 insertions(+), 85 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 6446778cbc6b..1c3f7521028f 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -6257,22 +6257,84 @@ static void __meminit zone_init_free_lists(struct zone *zone) } }
+#if !defined(CONFIG_FLAT_NODE_MEM_MAP) +/* + * Only struct pages that correspond to ranges defined by memblock.memory + * are zeroed and initialized by going through __init_single_page() during + * memmap_init_zone(). + * + * But, there could be struct pages that correspond to holes in + * memblock.memory. This can happen because of the following reasons: + * - phyiscal memory bank size is not necessarily the exact multiple of the + * arbitrary section size + * - early reserved memory may not be listed in memblock.memory + * - memory layouts defined with memmap= kernel parameter may not align + * nicely with memmap sections + * + * Explicitly initialize those struct pages so that: + * - PG_Reserved is set + * - zone and node links point to zone and node that span the page + */ +static u64 __meminit init_unavailable_range(unsigned long spfn, + unsigned long epfn, + int zone, int node) +{ + unsigned long pfn; + u64 pgcnt = 0; + + for (pfn = spfn; pfn < epfn; pfn++) { + if (!pfn_valid(ALIGN_DOWN(pfn, pageblock_nr_pages))) { + pfn = ALIGN_DOWN(pfn, pageblock_nr_pages) + + pageblock_nr_pages - 1; + continue; + } + __init_single_page(pfn_to_page(pfn), pfn, zone, node); + __SetPageReserved(pfn_to_page(pfn)); + pgcnt++; + } + + return pgcnt; +} +#else +static inline u64 init_unavailable_range(unsigned long spfn, unsigned long epfn, + int zone, int node) +{ + return 0; +} +#endif + void __meminit __weak memmap_init_zone(struct zone *zone) { unsigned long zone_start_pfn = zone->zone_start_pfn; unsigned long zone_end_pfn = zone_start_pfn + zone->spanned_pages; int i, nid = zone_to_nid(zone), zone_id = zone_idx(zone); unsigned long start_pfn, end_pfn; + unsigned long hole_pfn = 0; + u64 pgcnt = 0;
for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) { start_pfn = clamp(start_pfn, zone_start_pfn, zone_end_pfn); end_pfn = clamp(end_pfn, zone_start_pfn, zone_end_pfn); + hole_pfn = clamp(hole_pfn, zone_start_pfn, zone_end_pfn);
if (end_pfn > start_pfn) memmap_init_range(end_pfn - start_pfn, nid, zone_id, start_pfn, zone_end_pfn, MEMINIT_EARLY, NULL, MIGRATE_MOVABLE); + + if (hole_pfn < start_pfn) + pgcnt += init_unavailable_range(hole_pfn, start_pfn, + zone_id, nid); + hole_pfn = end_pfn; } + + if (hole_pfn < zone_end_pfn) + pgcnt += init_unavailable_range(hole_pfn, zone_end_pfn, + zone_id, nid); + + if (pgcnt) + pr_info(" %s zone: %lld pages in unavailable ranges\n", + zone->name, pgcnt); }
static int zone_batchsize(struct zone *zone) @@ -6519,8 +6581,19 @@ void __init get_pfn_range_for_nid(unsigned int nid, *end_pfn = max(*end_pfn, this_end_pfn); }
- if (*start_pfn == -1UL) + if (*start_pfn == -1UL) { *start_pfn = 0; + return; + } + +#ifdef CONFIG_SPARSEMEM + /* + * Sections in the memory map may not match actual populated + * memory, extend the node span to cover the entire section. + */ + *start_pfn = round_down(*start_pfn, PAGES_PER_SECTION); + *end_pfn = round_up(*end_pfn, PAGES_PER_SECTION); +#endif }
/* @@ -7069,88 +7142,6 @@ void __init free_area_init_memoryless_node(int nid) free_area_init_node(nid); }
-#if !defined(CONFIG_FLAT_NODE_MEM_MAP) -/* - * Initialize all valid struct pages in the range [spfn, epfn) and mark them - * PageReserved(). Return the number of struct pages that were initialized. - */ -static u64 __init init_unavailable_range(unsigned long spfn, unsigned long epfn) -{ - unsigned long pfn; - u64 pgcnt = 0; - - for (pfn = spfn; pfn < epfn; pfn++) { - if (!pfn_valid(ALIGN_DOWN(pfn, pageblock_nr_pages))) { - pfn = ALIGN_DOWN(pfn, pageblock_nr_pages) - + pageblock_nr_pages - 1; - continue; - } - /* - * Use a fake node/zone (0) for now. Some of these pages - * (in memblock.reserved but not in memblock.memory) will - * get re-initialized via reserve_bootmem_region() later. - */ - __init_single_page(pfn_to_page(pfn), pfn, 0, 0); - __SetPageReserved(pfn_to_page(pfn)); - pgcnt++; - } - - return pgcnt; -} - -/* - * Only struct pages that are backed by physical memory are zeroed and - * initialized by going through __init_single_page(). But, there are some - * struct pages which are reserved in memblock allocator and their fields - * may be accessed (for example page_to_pfn() on some configuration accesses - * flags). We must explicitly initialize those struct pages. - * - * This function also addresses a similar issue where struct pages are left - * uninitialized because the physical address range is not covered by - * memblock.memory or memblock.reserved. That could happen when memblock - * layout is manually configured via memmap=, or when the highest physical - * address (max_pfn) does not end on a section boundary. - */ -static void __init init_unavailable_mem(void) -{ - phys_addr_t start, end; - u64 i, pgcnt; - phys_addr_t next = 0; - - /* - * Loop through unavailable ranges not covered by memblock.memory. - */ - pgcnt = 0; - for_each_mem_range(i, &start, &end) { - if (next < start) - pgcnt += init_unavailable_range(PFN_DOWN(next), - PFN_UP(start)); - next = end; - } - - /* - * Early sections always have a fully populated memmap for the whole - * section - see pfn_valid(). If the last section has holes at the - * end and that section is marked "online", the memmap will be - * considered initialized. Make sure that memmap has a well defined - * state. - */ - pgcnt += init_unavailable_range(PFN_DOWN(next), - round_up(max_pfn, PAGES_PER_SECTION)); - - /* - * Struct pages that do not have backing memory. This could be because - * firmware is using some of this memory, or for some other reasons. - */ - if (pgcnt) - pr_info("Zeroed struct page in unavailable ranges: %lld pages", pgcnt); -} -#else -static inline void __init init_unavailable_mem(void) -{ -} -#endif /* !CONFIG_FLAT_NODE_MEM_MAP */ - #if MAX_NUMNODES > 1 /* * Figure out the number of possible node ids. @@ -7510,7 +7501,7 @@ void __init free_area_init(unsigned long *max_zone_pfn) memset(arch_zone_highest_possible_pfn, 0, sizeof(arch_zone_highest_possible_pfn));
- start_pfn = find_min_pfn_with_active_regions(); + start_pfn = 0; descending = arch_has_descending_max_zone_pfns();
for (i = 0; i < MAX_NR_ZONES; i++) { @@ -7574,7 +7565,6 @@ void __init free_area_init(unsigned long *max_zone_pfn) /* Initialise every node */ mminit_verify_pageflags_layout(); setup_nr_node_ids(); - init_unavailable_mem(); for_each_online_node(nid) { pg_data_t *pgdat = NODE_DATA(nid); free_area_init_node(nid);
On Mon, 8 Feb 2021 13:08:20 +0200 Mike Rapoport rppt@kernel.org wrote:
There could be struct pages that are not backed by actual physical memory. This can happen when the actual memory bank is not a multiple of SECTION_SIZE or when an architecture does not register memory holes reserved by the firmware as memblock.memory.
Such pages are currently initialized using init_unavailable_mem() function that iterates through PFNs in holes in memblock.memory and if there is a struct page corresponding to a PFN, the fields of this page are set to default values and it is marked as Reserved.
init_unavailable_mem() does not take into account zone and node the page belongs to and sets both zone and node links in struct page to zero.
On a system that has firmware reserved holes in a zone above ZONE_DMA, for instance in a configuration below:
# grep -A1 E820 /proc/iomem 7a17b000-7a216fff : Unknown E820 type 7a217000-7bffffff : System RAM
unset zone link in struct page will trigger
VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn), page);
because there are pages in both ZONE_DMA32 and ZONE_DMA (unset zone link in struct page) in the same pageblock.
...
Fixes: 73a6e474cb37 ("mm: memmap_init: iterate over memblock regions rather that check each PFN")
What are your thoughts on the priority of this (rather large!) fix? Are such systems sufficiently common to warrant a 5.11 merge? -stable?
On Mon, Feb 08, 2021 at 01:11:00PM -0800, Andrew Morton wrote:
On Mon, 8 Feb 2021 13:08:20 +0200 Mike Rapoport rppt@kernel.org wrote:
There could be struct pages that are not backed by actual physical memory. This can happen when the actual memory bank is not a multiple of SECTION_SIZE or when an architecture does not register memory holes reserved by the firmware as memblock.memory.
Such pages are currently initialized using init_unavailable_mem() function that iterates through PFNs in holes in memblock.memory and if there is a struct page corresponding to a PFN, the fields of this page are set to default values and it is marked as Reserved.
init_unavailable_mem() does not take into account zone and node the page belongs to and sets both zone and node links in struct page to zero.
On a system that has firmware reserved holes in a zone above ZONE_DMA, for instance in a configuration below:
# grep -A1 E820 /proc/iomem 7a17b000-7a216fff : Unknown E820 type 7a217000-7bffffff : System RAM
unset zone link in struct page will trigger
VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn), page);
because there are pages in both ZONE_DMA32 and ZONE_DMA (unset zone link in struct page) in the same pageblock.
...
Fixes: 73a6e474cb37 ("mm: memmap_init: iterate over memblock regions rather that check each PFN")
What are your thoughts on the priority of this (rather large!) fix? Are such systems sufficiently common to warrant a 5.11 merge? -stable?
I don't know how common are such systems, but the bug is exposed only for builds with DEBUG_VM=y, so after problems with previous versions discovered by various CI systems I'd say to hold it off till 5.11 is out.
If this time the fix works it'll make it to -stable anyway :)
On 08.02.21 12:08, Mike Rapoport wrote:
From: Mike Rapoport rppt@linux.ibm.com
There could be struct pages that are not backed by actual physical memory. This can happen when the actual memory bank is not a multiple of SECTION_SIZE or when an architecture does not register memory holes reserved by the firmware as memblock.memory.
Such pages are currently initialized using init_unavailable_mem() function that iterates through PFNs in holes in memblock.memory and if there is a struct page corresponding to a PFN, the fields of this page are set to default values and it is marked as Reserved.
init_unavailable_mem() does not take into account zone and node the page belongs to and sets both zone and node links in struct page to zero.
On a system that has firmware reserved holes in a zone above ZONE_DMA, for instance in a configuration below:
# grep -A1 E820 /proc/iomem 7a17b000-7a216fff : Unknown E820 type 7a217000-7bffffff : System RAM
unset zone link in struct page will trigger
VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn), page);
because there are pages in both ZONE_DMA32 and ZONE_DMA (unset zone link in struct page) in the same pageblock.
Moreover, it is possible that the lowest node and zone start is not aligned to the section boundarie, for example on x86:
[ 0.078898] Zone ranges: [ 0.078899] DMA [mem 0x0000000000001000-0x0000000000ffffff] ... [ 0.078910] Early memory node ranges [ 0.078912] node 0: [mem 0x0000000000001000-0x000000000009cfff] [ 0.078913] node 0: [mem 0x0000000000100000-0x000000003fffffff]
and thus with SPARSEMEM memory model the beginning of the memory map will have struct pages that are not spanned by any node and zone.
Update detection of node boundaries in get_pfn_range_for_nid() so that the node range will be expanded to cover memory map section. Since zone spans are derived from the node span, there always will be a zone that covers the part of the memory map with unavailable pages.
Interleave initialization of the unavailable pages with the normal initialization of memory map, so that zone and node information will be properly set on struct pages that are not backed by the actual memory.
Fixes: 73a6e474cb37 ("mm: memmap_init: iterate over memblock regions rather that check each PFN") Reported-by: Andrea Arcangeli aarcange@redhat.com Signed-off-by: Mike Rapoport rppt@linux.ibm.com Cc: Baoquan He bhe@redhat.com Cc: David Hildenbrand david@redhat.com Cc: Mel Gorman mgorman@suse.de Cc: Michal Hocko mhocko@kernel.org Cc: Qian Cai cai@lca.pw Cc: Vlastimil Babka vbabka@suse.cz
mm/page_alloc.c | 160 +++++++++++++++++++++++------------------------- 1 file changed, 75 insertions(+), 85 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 6446778cbc6b..1c3f7521028f 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -6257,22 +6257,84 @@ static void __meminit zone_init_free_lists(struct zone *zone) } } +#if !defined(CONFIG_FLAT_NODE_MEM_MAP) +/*
- Only struct pages that correspond to ranges defined by memblock.memory
- are zeroed and initialized by going through __init_single_page() during
- memmap_init_zone().
- But, there could be struct pages that correspond to holes in
- memblock.memory. This can happen because of the following reasons:
- phyiscal memory bank size is not necessarily the exact multiple of the
- arbitrary section size
- early reserved memory may not be listed in memblock.memory
- memory layouts defined with memmap= kernel parameter may not align
- nicely with memmap sections
- Explicitly initialize those struct pages so that:
- PG_Reserved is set
- zone and node links point to zone and node that span the page
- */
+static u64 __meminit init_unavailable_range(unsigned long spfn,
unsigned long epfn,
int zone, int node)
+{
- unsigned long pfn;
- u64 pgcnt = 0;
- for (pfn = spfn; pfn < epfn; pfn++) {
if (!pfn_valid(ALIGN_DOWN(pfn, pageblock_nr_pages))) {
pfn = ALIGN_DOWN(pfn, pageblock_nr_pages)
+ pageblock_nr_pages - 1;
continue;
}
__init_single_page(pfn_to_page(pfn), pfn, zone, node);
__SetPageReserved(pfn_to_page(pfn));
pgcnt++;
- }
- return pgcnt;
+} +#else +static inline u64 init_unavailable_range(unsigned long spfn, unsigned long epfn,
int zone, int node)
+{
- return 0;
+} +#endif
- void __meminit __weak memmap_init_zone(struct zone *zone) { unsigned long zone_start_pfn = zone->zone_start_pfn; unsigned long zone_end_pfn = zone_start_pfn + zone->spanned_pages; int i, nid = zone_to_nid(zone), zone_id = zone_idx(zone); unsigned long start_pfn, end_pfn;
- unsigned long hole_pfn = 0;
- u64 pgcnt = 0;
for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) { start_pfn = clamp(start_pfn, zone_start_pfn, zone_end_pfn); end_pfn = clamp(end_pfn, zone_start_pfn, zone_end_pfn);
hole_pfn = clamp(hole_pfn, zone_start_pfn, zone_end_pfn);
if (end_pfn > start_pfn) memmap_init_range(end_pfn - start_pfn, nid, zone_id, start_pfn, zone_end_pfn, MEMINIT_EARLY, NULL, MIGRATE_MOVABLE);
if (hole_pfn < start_pfn)
pgcnt += init_unavailable_range(hole_pfn, start_pfn,
zone_id, nid);
}hole_pfn = end_pfn;
- if (hole_pfn < zone_end_pfn)
pgcnt += init_unavailable_range(hole_pfn, zone_end_pfn,
zone_id, nid);
- if (pgcnt)
pr_info(" %s zone: %lld pages in unavailable ranges\n",
}zone->name, pgcnt);
static int zone_batchsize(struct zone *zone) @@ -6519,8 +6581,19 @@ void __init get_pfn_range_for_nid(unsigned int nid, *end_pfn = max(*end_pfn, this_end_pfn); }
- if (*start_pfn == -1UL)
- if (*start_pfn == -1UL) { *start_pfn = 0;
return;
- }
+#ifdef CONFIG_SPARSEMEM
- /*
* Sections in the memory map may not match actual populated
* memory, extend the node span to cover the entire section.
*/
- *start_pfn = round_down(*start_pfn, PAGES_PER_SECTION);
- *end_pfn = round_up(*end_pfn, PAGES_PER_SECTION);
Does that mean that we might create overlapping zones when one node starts in the middle of a section and the other one ends in the middle of a section?
Could it be a problem? (e.g., would we have to look at neighboring nodes when making the decision to extend, and how far to extend?)
On 12.02.21 10:55, David Hildenbrand wrote:
On 08.02.21 12:08, Mike Rapoport wrote:
From: Mike Rapoport rppt@linux.ibm.com
There could be struct pages that are not backed by actual physical memory. This can happen when the actual memory bank is not a multiple of SECTION_SIZE or when an architecture does not register memory holes reserved by the firmware as memblock.memory.
Such pages are currently initialized using init_unavailable_mem() function that iterates through PFNs in holes in memblock.memory and if there is a struct page corresponding to a PFN, the fields of this page are set to default values and it is marked as Reserved.
init_unavailable_mem() does not take into account zone and node the page belongs to and sets both zone and node links in struct page to zero.
On a system that has firmware reserved holes in a zone above ZONE_DMA, for instance in a configuration below:
# grep -A1 E820 /proc/iomem 7a17b000-7a216fff : Unknown E820 type 7a217000-7bffffff : System RAM
unset zone link in struct page will trigger
VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn), page);
because there are pages in both ZONE_DMA32 and ZONE_DMA (unset zone link in struct page) in the same pageblock.
Moreover, it is possible that the lowest node and zone start is not aligned to the section boundarie, for example on x86:
[ 0.078898] Zone ranges: [ 0.078899] DMA [mem 0x0000000000001000-0x0000000000ffffff] ... [ 0.078910] Early memory node ranges [ 0.078912] node 0: [mem 0x0000000000001000-0x000000000009cfff] [ 0.078913] node 0: [mem 0x0000000000100000-0x000000003fffffff]
and thus with SPARSEMEM memory model the beginning of the memory map will have struct pages that are not spanned by any node and zone.
Update detection of node boundaries in get_pfn_range_for_nid() so that the node range will be expanded to cover memory map section. Since zone spans are derived from the node span, there always will be a zone that covers the part of the memory map with unavailable pages.
Interleave initialization of the unavailable pages with the normal initialization of memory map, so that zone and node information will be properly set on struct pages that are not backed by the actual memory.
Fixes: 73a6e474cb37 ("mm: memmap_init: iterate over memblock regions rather that check each PFN") Reported-by: Andrea Arcangeli aarcange@redhat.com Signed-off-by: Mike Rapoport rppt@linux.ibm.com Cc: Baoquan He bhe@redhat.com Cc: David Hildenbrand david@redhat.com Cc: Mel Gorman mgorman@suse.de Cc: Michal Hocko mhocko@kernel.org Cc: Qian Cai cai@lca.pw Cc: Vlastimil Babka vbabka@suse.cz
mm/page_alloc.c | 160 +++++++++++++++++++++++------------------------- 1 file changed, 75 insertions(+), 85 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 6446778cbc6b..1c3f7521028f 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -6257,22 +6257,84 @@ static void __meminit zone_init_free_lists(struct zone *zone) } } +#if !defined(CONFIG_FLAT_NODE_MEM_MAP) +/*
- Only struct pages that correspond to ranges defined by memblock.memory
- are zeroed and initialized by going through __init_single_page() during
- memmap_init_zone().
- But, there could be struct pages that correspond to holes in
- memblock.memory. This can happen because of the following reasons:
- phyiscal memory bank size is not necessarily the exact multiple of the
- arbitrary section size
- early reserved memory may not be listed in memblock.memory
- memory layouts defined with memmap= kernel parameter may not align
- nicely with memmap sections
- Explicitly initialize those struct pages so that:
- PG_Reserved is set
- zone and node links point to zone and node that span the page
- */
+static u64 __meminit init_unavailable_range(unsigned long spfn,
unsigned long epfn,
int zone, int node)
+{
- unsigned long pfn;
- u64 pgcnt = 0;
- for (pfn = spfn; pfn < epfn; pfn++) {
if (!pfn_valid(ALIGN_DOWN(pfn, pageblock_nr_pages))) {
pfn = ALIGN_DOWN(pfn, pageblock_nr_pages)
+ pageblock_nr_pages - 1;
continue;
}
__init_single_page(pfn_to_page(pfn), pfn, zone, node);
__SetPageReserved(pfn_to_page(pfn));
pgcnt++;
- }
- return pgcnt;
+} +#else +static inline u64 init_unavailable_range(unsigned long spfn, unsigned long epfn,
int zone, int node)
+{
- return 0;
+} +#endif
- void __meminit __weak memmap_init_zone(struct zone *zone) { unsigned long zone_start_pfn = zone->zone_start_pfn; unsigned long zone_end_pfn = zone_start_pfn + zone->spanned_pages; int i, nid = zone_to_nid(zone), zone_id = zone_idx(zone); unsigned long start_pfn, end_pfn;
- unsigned long hole_pfn = 0;
- u64 pgcnt = 0; for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) { start_pfn = clamp(start_pfn, zone_start_pfn, zone_end_pfn); end_pfn = clamp(end_pfn, zone_start_pfn, zone_end_pfn);
hole_pfn = clamp(hole_pfn, zone_start_pfn, zone_end_pfn); if (end_pfn > start_pfn) memmap_init_range(end_pfn - start_pfn, nid, zone_id, start_pfn, zone_end_pfn, MEMINIT_EARLY, NULL, MIGRATE_MOVABLE);
if (hole_pfn < start_pfn)
pgcnt += init_unavailable_range(hole_pfn, start_pfn,
zone_id, nid);
}hole_pfn = end_pfn;
- if (hole_pfn < zone_end_pfn)
pgcnt += init_unavailable_range(hole_pfn, zone_end_pfn,
zone_id, nid);
- if (pgcnt)
pr_info(" %s zone: %lld pages in unavailable ranges\n",
} static int zone_batchsize(struct zone *zone)zone->name, pgcnt);
@@ -6519,8 +6581,19 @@ void __init get_pfn_range_for_nid(unsigned int nid, *end_pfn = max(*end_pfn, this_end_pfn); }
- if (*start_pfn == -1UL)
- if (*start_pfn == -1UL) { *start_pfn = 0;
return;
- }
+#ifdef CONFIG_SPARSEMEM
- /*
* Sections in the memory map may not match actual populated
* memory, extend the node span to cover the entire section.
*/
- *start_pfn = round_down(*start_pfn, PAGES_PER_SECTION);
- *end_pfn = round_up(*end_pfn, PAGES_PER_SECTION);
Does that mean that we might create overlapping zones when one node
s/overlapping zones/overlapping nodes/
starts in the middle of a section and the other one ends in the middle of a section?
Could it be a problem? (e.g., would we have to look at neighboring nodes when making the decision to extend, and how far to extend?)
On Fri 12-02-21 10:56:19, David Hildenbrand wrote:
On 12.02.21 10:55, David Hildenbrand wrote:
On 08.02.21 12:08, Mike Rapoport wrote:
[...]
@@ -6519,8 +6581,19 @@ void __init get_pfn_range_for_nid(unsigned int nid, *end_pfn = max(*end_pfn, this_end_pfn); }
- if (*start_pfn == -1UL)
- if (*start_pfn == -1UL) { *start_pfn = 0;
return;
- }
+#ifdef CONFIG_SPARSEMEM
- /*
* Sections in the memory map may not match actual populated
* memory, extend the node span to cover the entire section.
*/
- *start_pfn = round_down(*start_pfn, PAGES_PER_SECTION);
- *end_pfn = round_up(*end_pfn, PAGES_PER_SECTION);
Does that mean that we might create overlapping zones when one node
s/overlapping zones/overlapping nodes/
I didn't get to review the patch yet. Just wanted to note that we can interleave nodes/zone. Or what kind of concern do you have in mind?
On 12.02.21 11:11, Michal Hocko wrote:
On Fri 12-02-21 10:56:19, David Hildenbrand wrote:
On 12.02.21 10:55, David Hildenbrand wrote:
On 08.02.21 12:08, Mike Rapoport wrote:
[...]
@@ -6519,8 +6581,19 @@ void __init get_pfn_range_for_nid(unsigned int nid, *end_pfn = max(*end_pfn, this_end_pfn); }
- if (*start_pfn == -1UL)
- if (*start_pfn == -1UL) { *start_pfn = 0;
return;
- }
+#ifdef CONFIG_SPARSEMEM
- /*
* Sections in the memory map may not match actual populated
* memory, extend the node span to cover the entire section.
*/
- *start_pfn = round_down(*start_pfn, PAGES_PER_SECTION);
- *end_pfn = round_up(*end_pfn, PAGES_PER_SECTION);
Does that mean that we might create overlapping zones when one node
s/overlapping zones/overlapping nodes/
I didn't get to review the patch yet. Just wanted to note that we can interleave nodes/zone. Or what kind of concern do you have in mind?
I know that we can have it after boot, when hotplugging memory. How about during boot?
For example, which node will a PFN then actually be assigned to?
I was just wondering if this might result in issues - if that can already happen, then it's just fine I guess.
On Fri 12-02-21 11:16:28, David Hildenbrand wrote:
On 12.02.21 11:11, Michal Hocko wrote:
On Fri 12-02-21 10:56:19, David Hildenbrand wrote:
On 12.02.21 10:55, David Hildenbrand wrote:
On 08.02.21 12:08, Mike Rapoport wrote:
[...]
@@ -6519,8 +6581,19 @@ void __init get_pfn_range_for_nid(unsigned int nid, *end_pfn = max(*end_pfn, this_end_pfn); }
- if (*start_pfn == -1UL)
- if (*start_pfn == -1UL) { *start_pfn = 0;
return;
- }
+#ifdef CONFIG_SPARSEMEM
- /*
* Sections in the memory map may not match actual populated
* memory, extend the node span to cover the entire section.
*/
- *start_pfn = round_down(*start_pfn, PAGES_PER_SECTION);
- *end_pfn = round_up(*end_pfn, PAGES_PER_SECTION);
Does that mean that we might create overlapping zones when one node
s/overlapping zones/overlapping nodes/
I didn't get to review the patch yet. Just wanted to note that we can interleave nodes/zone. Or what kind of concern do you have in mind?
I know that we can have it after boot, when hotplugging memory. How about during boot?
I have seen systems where NUMA nodes are interleaved.
For example, which node will a PFN then actually be assigned to?
I would have to double check all the details but I do remember there was a non trivial work to handle those systems.
On Fri, Feb 12, 2021 at 10:56:19AM +0100, David Hildenbrand wrote:
On 12.02.21 10:55, David Hildenbrand wrote:
On 08.02.21 12:08, Mike Rapoport wrote:
+#ifdef CONFIG_SPARSEMEM
- /*
* Sections in the memory map may not match actual populated
* memory, extend the node span to cover the entire section.
*/
- *start_pfn = round_down(*start_pfn, PAGES_PER_SECTION);
- *end_pfn = round_up(*end_pfn, PAGES_PER_SECTION);
Does that mean that we might create overlapping zones when one node
s/overlapping zones/overlapping nodes/
starts in the middle of a section and the other one ends in the middle of a section?
Could it be a problem? (e.g., would we have to look at neighboring nodes when making the decision to extend, and how far to extend?)
Having a node end/start in a middle of a section would be a problem, but in this case I don't see a way to detect how a node should be extended :(
We can return to a v4 [1] without x86 modifications. With that we'll have struct pages corresponding to a hole in a middle of a zone with correct zone link and a good guess for the node.
As for the pfn 0 on x86, it'll remain outside any node and zone, but since it was the case since, like forever, I think we can live with it.
[1] https://lore.kernel.org/lkml/20210130221035.4169-1-rppt@kernel.org
On 14.02.21 18:29, Mike Rapoport wrote:
On Fri, Feb 12, 2021 at 10:56:19AM +0100, David Hildenbrand wrote:
On 12.02.21 10:55, David Hildenbrand wrote:
On 08.02.21 12:08, Mike Rapoport wrote:
+#ifdef CONFIG_SPARSEMEM
- /*
* Sections in the memory map may not match actual populated
* memory, extend the node span to cover the entire section.
*/
- *start_pfn = round_down(*start_pfn, PAGES_PER_SECTION);
- *end_pfn = round_up(*end_pfn, PAGES_PER_SECTION);
Does that mean that we might create overlapping zones when one node
s/overlapping zones/overlapping nodes/
starts in the middle of a section and the other one ends in the middle of a section?
Could it be a problem? (e.g., would we have to look at neighboring nodes when making the decision to extend, and how far to extend?)
Having a node end/start in a middle of a section would be a problem, but in this case I don't see a way to detect how a node should be extended :(
Running QEMU with something like:
... -m 8G \ -smp sockets=2,cores=2 \ -object memory-backend-ram,id=bmem0,size=4160M \ -object memory-backend-ram,id=bmem1,size=4032M \ -numa node,nodeid=0,cpus=0-1,memdev=bmem0 -numa node,nodeid=1,cpus=2-3,memdev=bmem1 \ ...
Creates such a setup.
With an older kernel:
[ 0.000000] BIOS-provided physical RAM map: [ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable [ 0.000000] BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved [ 0.000000] BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved [ 0.000000] BIOS-e820: [mem 0x0000000000100000-0x00000000bffdffff] usable [ 0.000000] BIOS-e820: [mem 0x00000000bffe0000-0x00000000bfffffff] reserved [ 0.000000] BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] reserved [ 0.000000] BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved [ 0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000023fffffff] usable [...] [ 0.002506] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff] [ 0.002508] ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0xbfffffff] [ 0.002509] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x143ffffff] [ 0.002510] ACPI: SRAT: Node 1 PXM 1 [mem 0x144000000-0x23fffffff] [ 0.002511] NUMA: Node 0 [mem 0x00000000-0x0009ffff] + [mem 0x00100000-0xbfffffff] -> [mem 0x00000000-0xbfffffff] [ 0.002513] NUMA: Node 0 [mem 0x00000000-0xbfffffff] + [mem 0x100000000-0x143ffffff] -> [mem 0x00000000-0x143ffffff] [ 0.002519] NODE_DATA(0) allocated [mem 0x143fd5000-0x143ffffff] [ 0.002669] NODE_DATA(1) allocated [mem 0x23ffd2000-0x23fffcfff] [ 0.017947] memblock: reserved range [0x0000000000000000-0x0000000000001000] is not in memory [ 0.017953] memblock: reserved range [0x000000000009f000-0x0000000000100000] is not in memory [ 0.017956] Zone ranges: [ 0.017957] DMA [mem 0x0000000000000000-0x0000000000ffffff] [ 0.017958] DMA32 [mem 0x0000000001000000-0x00000000ffffffff] [ 0.017960] Normal [mem 0x0000000100000000-0x000000023fffffff] [ 0.017961] Device empty [ 0.017962] Movable zone start for each node [ 0.017964] Early memory node ranges [ 0.017965] node 0: [mem 0x0000000000000000-0x00000000bffdffff] [ 0.017966] node 0: [mem 0x0000000100000000-0x0000000143ffffff] [ 0.017967] node 1: [mem 0x0000000144000000-0x000000023fffffff] [ 0.017969] Initmem setup node 0 [mem 0x0000000000000000-0x0000000143ffffff] [ 0.017971] On node 0 totalpages: 1064928 [ 0.017972] DMA zone: 64 pages used for memmap [ 0.017973] DMA zone: 21 pages reserved [ 0.017974] DMA zone: 4096 pages, LIFO batch:0 [ 0.017994] DMA32 zone: 12224 pages used for memmap [ 0.017995] DMA32 zone: 782304 pages, LIFO batch:63 [ 0.022281] DMA32: Zeroed struct page in unavailable ranges: 32 [ 0.022286] Normal zone: 4352 pages used for memmap [ 0.022287] Normal zone: 278528 pages, LIFO batch:63 [ 0.023769] Initmem setup node 1 [mem 0x0000000144000000-0x000000023fffffff] [ 0.023774] On node 1 totalpages: 1032192 [ 0.023775] Normal zone: 16128 pages used for memmap [ 0.023775] Normal zone: 1032192 pages, LIFO batch:63
With current next/master:
[ 0.000000] BIOS-provided physical RAM map: [ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable [ 0.000000] BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved [ 0.000000] BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved [ 0.000000] BIOS-e820: [mem 0x0000000000100000-0x00000000bffdffff] usable [ 0.000000] BIOS-e820: [mem 0x00000000bffe0000-0x00000000bfffffff] reserved [ 0.000000] BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] reserved [ 0.000000] BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved [ 0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000023fffffff] usable [...] [ 0.002419] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff] [ 0.002421] ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0xbfffffff] [ 0.002422] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x143ffffff] [ 0.002423] ACPI: SRAT: Node 1 PXM 1 [mem 0x144000000-0x23fffffff] [ 0.002424] NUMA: Node 0 [mem 0x00000000-0x0009ffff] + [mem 0x00100000-0xbfffffff] -> [mem 0x00000000-0xbfffffff] [ 0.002426] NUMA: Node 0 [mem 0x00000000-0xbfffffff] + [mem 0x100000000-0x143ffffff] -> [mem 0x00000000-0x143ffffff] [ 0.002432] NODE_DATA(0) allocated [mem 0x143fd5000-0x143ffffff] [ 0.002583] NODE_DATA(1) allocated [mem 0x23ffd2000-0x23fffcfff] [ 0.017722] Zone ranges: [ 0.017726] DMA [mem 0x0000000000000000-0x0000000000ffffff] [ 0.017728] DMA32 [mem 0x0000000001000000-0x00000000ffffffff] [ 0.017729] Normal [mem 0x0000000100000000-0x000000023fffffff] [ 0.017731] Device empty [ 0.017732] Movable zone start for each node [ 0.017734] Early memory node ranges [ 0.017735] node 0: [mem 0x0000000000001000-0x000000000009efff] [ 0.017736] node 0: [mem 0x0000000000100000-0x00000000bffdffff] [ 0.017737] node 0: [mem 0x0000000100000000-0x0000000143ffffff] [ 0.017738] node 1: [mem 0x0000000144000000-0x000000023fffffff] [ 0.017741] Initmem setup node 0 [mem 0x0000000000000000-0x0000000147ffffff] [ 0.017742] On node 0 totalpages: 1064830 [ 0.017743] DMA zone: 64 pages used for memmap [ 0.017744] DMA zone: 21 pages reserved [ 0.017745] DMA zone: 3998 pages, LIFO batch:0 [ 0.017765] DMA zone: 98 pages in unavailable ranges [ 0.017766] DMA32 zone: 12224 pages used for memmap [ 0.017766] DMA32 zone: 782304 pages, LIFO batch:63 [ 0.022042] DMA32 zone: 32 pages in unavailable ranges [ 0.022046] Normal zone: 4608 pages used for memmap [ 0.022047] Normal zone: 278528 pages, LIFO batch:63 [ 0.023601] Normal zone: 16384 pages in unavailable ranges [ 0.023606] Initmem setup node 1 [mem 0x0000000140000000-0x000000023fffffff] [ 0.023608] On node 1 totalpages: 1032192 [ 0.023609] Normal zone: 16384 pages used for memmap [ 0.023609] Normal zone: 1032192 pages, LIFO batch:63 [ 0.029267] Normal zone: 16384 pages in unavailable ranges
In this setup, one node ends in the middle of a section (+64MB), the other one starts in the middle of the same section (+64MB).
After your patch, the nodes overlap (in one section)
I can spot that each node still has the same number of present pages and that each node now has exactly 64MB unavailable pages (the extra ones spanned).
So at least here, it looks like the machinery is still doing the right thing?
On Mon, Feb 15, 2021 at 09:45:30AM +0100, David Hildenbrand wrote:
On 14.02.21 18:29, Mike Rapoport wrote:
On Fri, Feb 12, 2021 at 10:56:19AM +0100, David Hildenbrand wrote:
On 12.02.21 10:55, David Hildenbrand wrote:
On 08.02.21 12:08, Mike Rapoport wrote:
+#ifdef CONFIG_SPARSEMEM
- /*
* Sections in the memory map may not match actual populated
* memory, extend the node span to cover the entire section.
*/
- *start_pfn = round_down(*start_pfn, PAGES_PER_SECTION);
- *end_pfn = round_up(*end_pfn, PAGES_PER_SECTION);
Does that mean that we might create overlapping zones when one node
s/overlapping zones/overlapping nodes/
starts in the middle of a section and the other one ends in the middle of a section?
Could it be a problem? (e.g., would we have to look at neighboring nodes when making the decision to extend, and how far to extend?)
Having a node end/start in a middle of a section would be a problem, but in this case I don't see a way to detect how a node should be extended :(
Running QEMU with something like:
... -m 8G \ -smp sockets=2,cores=2 \ -object memory-backend-ram,id=bmem0,size=4160M \ -object memory-backend-ram,id=bmem1,size=4032M \
This is an interesting setup :)
TBH, I've tried to think what physical configuration would be problematic for the implicit node extension, and I had concerns about arm64 with it's huge section size, but it entirely slipped my mind that a VM can have really weird memory configuration.
-numa node,nodeid=0,cpus=0-1,memdev=bmem0 -numa node,nodeid=1,cpus=2-3,memdev=bmem1 \
...
Creates such a setup.
With an older kernel:
[ 0.000000] BIOS-provided physical RAM map: [ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable [ 0.000000] BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved [ 0.000000] BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved [ 0.000000] BIOS-e820: [mem 0x0000000000100000-0x00000000bffdffff] usable [ 0.000000] BIOS-e820: [mem 0x00000000bffe0000-0x00000000bfffffff] reserved [ 0.000000] BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] reserved [ 0.000000] BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved [ 0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000023fffffff] usable [...] [ 0.002506] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff] [ 0.002508] ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0xbfffffff] [ 0.002509] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x143ffffff] [ 0.002510] ACPI: SRAT: Node 1 PXM 1 [mem 0x144000000-0x23fffffff] [ 0.002511] NUMA: Node 0 [mem 0x00000000-0x0009ffff] + [mem 0x00100000-0xbfffffff] -> [mem 0x00000000-0xbfffffff] [ 0.002513] NUMA: Node 0 [mem 0x00000000-0xbfffffff] + [mem 0x100000000-0x143ffffff] -> [mem 0x00000000-0x143ffffff] [ 0.002519] NODE_DATA(0) allocated [mem 0x143fd5000-0x143ffffff] [ 0.002669] NODE_DATA(1) allocated [mem 0x23ffd2000-0x23fffcfff] [ 0.017947] memblock: reserved range [0x0000000000000000-0x0000000000001000] is not in memory [ 0.017953] memblock: reserved range [0x000000000009f000-0x0000000000100000] is not in memory [ 0.017956] Zone ranges: [ 0.017957] DMA [mem 0x0000000000000000-0x0000000000ffffff] [ 0.017958] DMA32 [mem 0x0000000001000000-0x00000000ffffffff] [ 0.017960] Normal [mem 0x0000000100000000-0x000000023fffffff] [ 0.017961] Device empty [ 0.017962] Movable zone start for each node [ 0.017964] Early memory node ranges [ 0.017965] node 0: [mem 0x0000000000000000-0x00000000bffdffff] [ 0.017966] node 0: [mem 0x0000000100000000-0x0000000143ffffff] [ 0.017967] node 1: [mem 0x0000000144000000-0x000000023fffffff] [ 0.017969] Initmem setup node 0 [mem 0x0000000000000000-0x0000000143ffffff] [ 0.017971] On node 0 totalpages: 1064928 [ 0.017972] DMA zone: 64 pages used for memmap [ 0.017973] DMA zone: 21 pages reserved [ 0.017974] DMA zone: 4096 pages, LIFO batch:0 [ 0.017994] DMA32 zone: 12224 pages used for memmap [ 0.017995] DMA32 zone: 782304 pages, LIFO batch:63 [ 0.022281] DMA32: Zeroed struct page in unavailable ranges: 32 [ 0.022286] Normal zone: 4352 pages used for memmap [ 0.022287] Normal zone: 278528 pages, LIFO batch:63 [ 0.023769] Initmem setup node 1 [mem 0x0000000144000000-0x000000023fffffff] [ 0.023774] On node 1 totalpages: 1032192 [ 0.023775] Normal zone: 16128 pages used for memmap [ 0.023775] Normal zone: 1032192 pages, LIFO batch:63
With current next/master:
[ 0.000000] BIOS-provided physical RAM map: [ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable [ 0.000000] BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved [ 0.000000] BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved [ 0.000000] BIOS-e820: [mem 0x0000000000100000-0x00000000bffdffff] usable [ 0.000000] BIOS-e820: [mem 0x00000000bffe0000-0x00000000bfffffff] reserved [ 0.000000] BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] reserved [ 0.000000] BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved [ 0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000023fffffff] usable [...] [ 0.002419] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff] [ 0.002421] ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0xbfffffff] [ 0.002422] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x143ffffff] [ 0.002423] ACPI: SRAT: Node 1 PXM 1 [mem 0x144000000-0x23fffffff] [ 0.002424] NUMA: Node 0 [mem 0x00000000-0x0009ffff] + [mem 0x00100000-0xbfffffff] -> [mem 0x00000000-0xbfffffff] [ 0.002426] NUMA: Node 0 [mem 0x00000000-0xbfffffff] + [mem 0x100000000-0x143ffffff] -> [mem 0x00000000-0x143ffffff] [ 0.002432] NODE_DATA(0) allocated [mem 0x143fd5000-0x143ffffff] [ 0.002583] NODE_DATA(1) allocated [mem 0x23ffd2000-0x23fffcfff] [ 0.017722] Zone ranges: [ 0.017726] DMA [mem 0x0000000000000000-0x0000000000ffffff] [ 0.017728] DMA32 [mem 0x0000000001000000-0x00000000ffffffff] [ 0.017729] Normal [mem 0x0000000100000000-0x000000023fffffff] [ 0.017731] Device empty [ 0.017732] Movable zone start for each node [ 0.017734] Early memory node ranges [ 0.017735] node 0: [mem 0x0000000000001000-0x000000000009efff] [ 0.017736] node 0: [mem 0x0000000000100000-0x00000000bffdffff] [ 0.017737] node 0: [mem 0x0000000100000000-0x0000000143ffffff] [ 0.017738] node 1: [mem 0x0000000144000000-0x000000023fffffff] [ 0.017741] Initmem setup node 0 [mem 0x0000000000000000-0x0000000147ffffff] [ 0.017742] On node 0 totalpages: 1064830 [ 0.017743] DMA zone: 64 pages used for memmap [ 0.017744] DMA zone: 21 pages reserved [ 0.017745] DMA zone: 3998 pages, LIFO batch:0 [ 0.017765] DMA zone: 98 pages in unavailable ranges [ 0.017766] DMA32 zone: 12224 pages used for memmap [ 0.017766] DMA32 zone: 782304 pages, LIFO batch:63 [ 0.022042] DMA32 zone: 32 pages in unavailable ranges [ 0.022046] Normal zone: 4608 pages used for memmap [ 0.022047] Normal zone: 278528 pages, LIFO batch:63 [ 0.023601] Normal zone: 16384 pages in unavailable ranges [ 0.023606] Initmem setup node 1 [mem 0x0000000140000000-0x000000023fffffff] [ 0.023608] On node 1 totalpages: 1032192 [ 0.023609] Normal zone: 16384 pages used for memmap [ 0.023609] Normal zone: 1032192 pages, LIFO batch:63 [ 0.029267] Normal zone: 16384 pages in unavailable ranges
In this setup, one node ends in the middle of a section (+64MB), the other one starts in the middle of the same section (+64MB).
After your patch, the nodes overlap (in one section)
I can spot that each node still has the same number of present pages and that each node now has exactly 64MB unavailable pages (the extra ones spanned).
So at least here, it looks like the machinery is still doing the right thing?
So in this setup we'll have pages in the overlapping section initialized twice and they will end linked to node1 which is not exactly correct, but we care less about the nodes than about the zones. Well, at least we don't have VM_BUG_ON(!node_spans_pfn()) :)
On Mon 08-02-21 13:08:20, Mike Rapoport wrote:
From: Mike Rapoport rppt@linux.ibm.com
There could be struct pages that are not backed by actual physical memory. This can happen when the actual memory bank is not a multiple of SECTION_SIZE or when an architecture does not register memory holes reserved by the firmware as memblock.memory.
Such pages are currently initialized using init_unavailable_mem() function that iterates through PFNs in holes in memblock.memory and if there is a struct page corresponding to a PFN, the fields of this page are set to default values and it is marked as Reserved.
init_unavailable_mem() does not take into account zone and node the page belongs to and sets both zone and node links in struct page to zero.
IIUC the zone should be associated based on the pfn and architecture constraines on zones. Node is then guessed based on the last existing range, right?
On a system that has firmware reserved holes in a zone above ZONE_DMA, for instance in a configuration below:
# grep -A1 E820 /proc/iomem 7a17b000-7a216fff : Unknown E820 type 7a217000-7bffffff : System RAM
I like the description here though. Thanks very useful.
unset zone link in struct page will trigger
VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn), page);
I guess you mean set_pfnblock_flags_mask, right? Is this bug on really needed? Maybe we just need to skip over reserved pages?
because there are pages in both ZONE_DMA32 and ZONE_DMA (unset zone link in struct page) in the same pageblock.
Moreover, it is possible that the lowest node and zone start is not aligned to the section boundarie, for example on x86:
[ 0.078898] Zone ranges: [ 0.078899] DMA [mem 0x0000000000001000-0x0000000000ffffff] ... [ 0.078910] Early memory node ranges [ 0.078912] node 0: [mem 0x0000000000001000-0x000000000009cfff] [ 0.078913] node 0: [mem 0x0000000000100000-0x000000003fffffff]
and thus with SPARSEMEM memory model the beginning of the memory map will have struct pages that are not spanned by any node and zone.
Update detection of node boundaries in get_pfn_range_for_nid() so that the node range will be expanded to cover memory map section. Since zone spans are derived from the node span, there always will be a zone that covers the part of the memory map with unavailable pages.
Interleave initialization of the unavailable pages with the normal initialization of memory map, so that zone and node information will be properly set on struct pages that are not backed by the actual memory.
I have to digest this but my first impression is that this is more heavy weight than it needs to. Pfn walkers should normally obey node range at least. The first pfn is usually excluded but I haven't seen real problems with that. The VM_BUG_ON blowing up is really bad but as said above we can simply make it less offensive in presence of reserved pages as those shouldn't reach that path AFAICS normally. Or we can just remove it. It is not really clear to me whether it has any value beyond debugging TBH.
On 12.02.21 11:33, Michal Hocko wrote:
On Mon 08-02-21 13:08:20, Mike Rapoport wrote:
From: Mike Rapoport rppt@linux.ibm.com
There could be struct pages that are not backed by actual physical memory. This can happen when the actual memory bank is not a multiple of SECTION_SIZE or when an architecture does not register memory holes reserved by the firmware as memblock.memory.
Such pages are currently initialized using init_unavailable_mem() function that iterates through PFNs in holes in memblock.memory and if there is a struct page corresponding to a PFN, the fields of this page are set to default values and it is marked as Reserved.
init_unavailable_mem() does not take into account zone and node the page belongs to and sets both zone and node links in struct page to zero.
IIUC the zone should be associated based on the pfn and architecture constraines on zones. Node is then guessed based on the last existing range, right?
On a system that has firmware reserved holes in a zone above ZONE_DMA, for instance in a configuration below:
# grep -A1 E820 /proc/iomem 7a17b000-7a216fff : Unknown E820 type 7a217000-7bffffff : System RAM
I like the description here though. Thanks very useful.
unset zone link in struct page will trigger
VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn), page);
I guess you mean set_pfnblock_flags_mask, right? Is this bug on really needed? Maybe we just need to skip over reserved pages?
because there are pages in both ZONE_DMA32 and ZONE_DMA (unset zone link in struct page) in the same pageblock.
Moreover, it is possible that the lowest node and zone start is not aligned to the section boundarie, for example on x86:
[ 0.078898] Zone ranges: [ 0.078899] DMA [mem 0x0000000000001000-0x0000000000ffffff] ... [ 0.078910] Early memory node ranges [ 0.078912] node 0: [mem 0x0000000000001000-0x000000000009cfff] [ 0.078913] node 0: [mem 0x0000000000100000-0x000000003fffffff]
and thus with SPARSEMEM memory model the beginning of the memory map will have struct pages that are not spanned by any node and zone.
Update detection of node boundaries in get_pfn_range_for_nid() so that the node range will be expanded to cover memory map section. Since zone spans are derived from the node span, there always will be a zone that covers the part of the memory map with unavailable pages.
Interleave initialization of the unavailable pages with the normal initialization of memory map, so that zone and node information will be properly set on struct pages that are not backed by the actual memory.
I have to digest this but my first impression is that this is more heavy weight than it needs to. Pfn walkers should normally obey node range at least. The first pfn is usually excluded but I haven't seen real
We've seen examples where this is not sufficient. Simple example:
Have your physical memory end within a memory section. Easy via QEMU, just do a "-m 4000M". The remaining part of the last section has fake/wrong node/zone info.
Hotplug memory. The node/zone gets resized such that PFN walkers might stumble over it.
The basic idea is to make sure that any initialized/"online" pfn belongs to exactly one node/zone and that the node/zone spans that PFN.
problems with that. The VM_BUG_ON blowing up is really bad but as said above we can simply make it less offensive in presence of reserved pages as those shouldn't reach that path AFAICS normally.
Andrea tried tried working around if via PG_reserved pages and it resulted in quite some ugly code. Andrea also noted that we cannot rely on any random page walker to do the right think when it comes to messed up node/zone info.
On Fri 12-02-21 11:42:15, David Hildenbrand wrote:
On 12.02.21 11:33, Michal Hocko wrote:
[...]
I have to digest this but my first impression is that this is more heavy weight than it needs to. Pfn walkers should normally obey node range at least. The first pfn is usually excluded but I haven't seen real
We've seen examples where this is not sufficient. Simple example:
Have your physical memory end within a memory section. Easy via QEMU, just do a "-m 4000M". The remaining part of the last section has fake/wrong node/zone info.
Does this really matter though. If those pages are reserved then nobody will touch them regardless of their node/zone ids.
Hotplug memory. The node/zone gets resized such that PFN walkers might stumble over it.
The basic idea is to make sure that any initialized/"online" pfn belongs to exactly one node/zone and that the node/zone spans that PFN.
Yeah, this sounds like a good idea but what is the poper node for hole between two ranges associated with a different nodes/zones? This will always be a random number. We should have a clear way to tell "do not touch those pages" and PageReserved sounds like a good way to tell that.
problems with that. The VM_BUG_ON blowing up is really bad but as said above we can simply make it less offensive in presence of reserved pages as those shouldn't reach that path AFAICS normally.
Andrea tried tried working around if via PG_reserved pages and it resulted in quite some ugly code. Andrea also noted that we cannot rely on any random page walker to do the right think when it comes to messed up node/zone info.
I am sorry, I haven't followed previous discussions. Has the removal of the VM_BUG_ON been considered as an immediate workaround?
On Fri, Feb 12, 2021 at 02:18:20PM +0100, Michal Hocko wrote:
On Fri 12-02-21 11:42:15, David Hildenbrand wrote:
On 12.02.21 11:33, Michal Hocko wrote:
[...]
I have to digest this but my first impression is that this is more heavy weight than it needs to. Pfn walkers should normally obey node range at least. The first pfn is usually excluded but I haven't seen real
We've seen examples where this is not sufficient. Simple example:
Have your physical memory end within a memory section. Easy via QEMU, just do a "-m 4000M". The remaining part of the last section has fake/wrong node/zone info.
Does this really matter though. If those pages are reserved then nobody will touch them regardless of their node/zone ids.
Hotplug memory. The node/zone gets resized such that PFN walkers might stumble over it.
The basic idea is to make sure that any initialized/"online" pfn belongs to exactly one node/zone and that the node/zone spans that PFN.
Yeah, this sounds like a good idea but what is the poper node for hole between two ranges associated with a different nodes/zones? This will always be a random number. We should have a clear way to tell "do not touch those pages" and PageReserved sounds like a good way to tell that.
Nobody should touch reserved pages, but I don't think we can ensure that.
We can correctly set the zone links for the reserved pages for holes in the middle of a zone based on the architecture constraints and with only the holes in the beginning/end of the memory will be not spanned by any node/zone which in practice does not seem to be a problem as the VM_BUG_ON in set_pfnblock_flags_mask() never triggered on pfn 0.
I believe that any improvement in memory map consistency is a step forward.
problems with that. The VM_BUG_ON blowing up is really bad but as said above we can simply make it less offensive in presence of reserved pages as those shouldn't reach that path AFAICS normally.
Andrea tried tried working around if via PG_reserved pages and it resulted in quite some ugly code. Andrea also noted that we cannot rely on any random page walker to do the right think when it comes to messed up node/zone info.
I am sorry, I haven't followed previous discussions. Has the removal of the VM_BUG_ON been considered as an immediate workaround?
It was never discussed, but I'm not sure it's a good idea.
Judging by the commit message that introduced the VM_BUG_ON (commit 86051ca5eaf5 ("mm: fix usemap initialization")) there was yet another inconsistency in the memory map that required a special care.
On Sun 14-02-21 20:00:16, Mike Rapoport wrote:
On Fri, Feb 12, 2021 at 02:18:20PM +0100, Michal Hocko wrote:
On Fri 12-02-21 11:42:15, David Hildenbrand wrote:
On 12.02.21 11:33, Michal Hocko wrote:
[...]
I have to digest this but my first impression is that this is more heavy weight than it needs to. Pfn walkers should normally obey node range at least. The first pfn is usually excluded but I haven't seen real
We've seen examples where this is not sufficient. Simple example:
Have your physical memory end within a memory section. Easy via QEMU, just do a "-m 4000M". The remaining part of the last section has fake/wrong node/zone info.
Does this really matter though. If those pages are reserved then nobody will touch them regardless of their node/zone ids.
Hotplug memory. The node/zone gets resized such that PFN walkers might stumble over it.
The basic idea is to make sure that any initialized/"online" pfn belongs to exactly one node/zone and that the node/zone spans that PFN.
Yeah, this sounds like a good idea but what is the poper node for hole between two ranges associated with a different nodes/zones? This will always be a random number. We should have a clear way to tell "do not touch those pages" and PageReserved sounds like a good way to tell that.
Nobody should touch reserved pages, but I don't think we can ensure that.
Touching a reserved page which doesn't belong to you is a bug. Sure we cannot enforce that rule by runtime checks. But incorrect/misleading zone/node association is the least of the problem when somebody already does that.
We can correctly set the zone links for the reserved pages for holes in the middle of a zone based on the architecture constraints and with only the holes in the beginning/end of the memory will be not spanned by any node/zone which in practice does not seem to be a problem as the VM_BUG_ON in set_pfnblock_flags_mask() never triggered on pfn 0.
I really fail to see what you mean by correct zone/node for a memory range which is not associated with any real node.
I believe that any improvement in memory map consistency is a step forward.
I do agree but we are talking about a subtle bug (VM_BUG_ON) which would be better of with a simplistic fix first. You can work on consistency improvements on top of that.
problems with that. The VM_BUG_ON blowing up is really bad but as said above we can simply make it less offensive in presence of reserved pages as those shouldn't reach that path AFAICS normally.
Andrea tried tried working around if via PG_reserved pages and it resulted in quite some ugly code. Andrea also noted that we cannot rely on any random page walker to do the right think when it comes to messed up node/zone info.
I am sorry, I haven't followed previous discussions. Has the removal of the VM_BUG_ON been considered as an immediate workaround?
It was never discussed, but I'm not sure it's a good idea.
Judging by the commit message that introduced the VM_BUG_ON (commit 86051ca5eaf5 ("mm: fix usemap initialization")) there was yet another inconsistency in the memory map that required a special care.
Can we actually explore that path before adding yet additional complexity and potentially a very involved fix for a subtle problem?
Mel who is author of this code might help us out. I have to say I do not see the point for the VM_BUG_ON other than a better debuggability. If there is a real incosistency problem as a result then we should be handling that situation for non debugging kernels as well.
On 15.02.21 10:00, Michal Hocko wrote:
On Sun 14-02-21 20:00:16, Mike Rapoport wrote:
On Fri, Feb 12, 2021 at 02:18:20PM +0100, Michal Hocko wrote:
On Fri 12-02-21 11:42:15, David Hildenbrand wrote:
On 12.02.21 11:33, Michal Hocko wrote:
[...]
I have to digest this but my first impression is that this is more heavy weight than it needs to. Pfn walkers should normally obey node range at least. The first pfn is usually excluded but I haven't seen real
We've seen examples where this is not sufficient. Simple example:
Have your physical memory end within a memory section. Easy via QEMU, just do a "-m 4000M". The remaining part of the last section has fake/wrong node/zone info.
Does this really matter though. If those pages are reserved then nobody will touch them regardless of their node/zone ids.
Hotplug memory. The node/zone gets resized such that PFN walkers might stumble over it.
The basic idea is to make sure that any initialized/"online" pfn belongs to exactly one node/zone and that the node/zone spans that PFN.
Yeah, this sounds like a good idea but what is the poper node for hole between two ranges associated with a different nodes/zones? This will always be a random number. We should have a clear way to tell "do not touch those pages" and PageReserved sounds like a good way to tell that.
Nobody should touch reserved pages, but I don't think we can ensure that.
Touching a reserved page which doesn't belong to you is a bug. Sure we cannot enforce that rule by runtime checks. But incorrect/misleading zone/node association is the least of the problem when somebody already does that.
We can correctly set the zone links for the reserved pages for holes in the middle of a zone based on the architecture constraints and with only the holes in the beginning/end of the memory will be not spanned by any node/zone which in practice does not seem to be a problem as the VM_BUG_ON in set_pfnblock_flags_mask() never triggered on pfn 0.
I really fail to see what you mean by correct zone/node for a memory range which is not associated with any real node.
I believe that any improvement in memory map consistency is a step forward.
I do agree but we are talking about a subtle bug (VM_BUG_ON) which would be better of with a simplistic fix first. You can work on consistency improvements on top of that.
problems with that. The VM_BUG_ON blowing up is really bad but as said above we can simply make it less offensive in presence of reserved pages as those shouldn't reach that path AFAICS normally.
Andrea tried tried working around if via PG_reserved pages and it resulted in quite some ugly code. Andrea also noted that we cannot rely on any random page walker to do the right think when it comes to messed up node/zone info.
I am sorry, I haven't followed previous discussions. Has the removal of the VM_BUG_ON been considered as an immediate workaround?
It was never discussed, but I'm not sure it's a good idea.
Judging by the commit message that introduced the VM_BUG_ON (commit 86051ca5eaf5 ("mm: fix usemap initialization")) there was yet another inconsistency in the memory map that required a special care.
Can we actually explore that path before adding yet additional complexity and potentially a very involved fix for a subtle problem?
Mel who is author of this code might help us out. I have to say I do not see the point for the VM_BUG_ON other than a better debuggability. If there is a real incosistency problem as a result then we should be handling that situation for non debugging kernels as well.
I have no time to summarize, you can find the complete discussion (also involving Mel) at
https://lkml.kernel.org/r/20201121194506.13464-1-aarcange@redhat.com
On Mon, Feb 15, 2021 at 10:00:31AM +0100, Michal Hocko wrote:
On Sun 14-02-21 20:00:16, Mike Rapoport wrote:
On Fri, Feb 12, 2021 at 02:18:20PM +0100, Michal Hocko wrote:
We can correctly set the zone links for the reserved pages for holes in the middle of a zone based on the architecture constraints and with only the holes in the beginning/end of the memory will be not spanned by any node/zone which in practice does not seem to be a problem as the VM_BUG_ON in set_pfnblock_flags_mask() never triggered on pfn 0.
I really fail to see what you mean by correct zone/node for a memory range which is not associated with any real node.
We know architectural zone constraints, so we can have always have 1:1 match from pfn to zone. Node indeed will be a guess.
I am sorry, I haven't followed previous discussions. Has the removal of the VM_BUG_ON been considered as an immediate workaround?
It was never discussed, but I'm not sure it's a good idea.
Judging by the commit message that introduced the VM_BUG_ON (commit 86051ca5eaf5 ("mm: fix usemap initialization")) there was yet another inconsistency in the memory map that required a special care.
Can we actually explore that path before adding yet additional complexity and potentially a very involved fix for a subtle problem?
This patch was intended as a fix for inconsistency of the memory map that is the root cause for triggering this VM_BUG_ON and other corner case problems.
The previous version [1] is less involved as it does not extend node/zone spans.
[1] https://lore.kernel.org/lkml/20210130221035.4169-3-rppt@kernel.org
On Mon 15-02-21 23:24:40, Mike Rapoport wrote:
On Mon, Feb 15, 2021 at 10:00:31AM +0100, Michal Hocko wrote:
On Sun 14-02-21 20:00:16, Mike Rapoport wrote:
On Fri, Feb 12, 2021 at 02:18:20PM +0100, Michal Hocko wrote:
We can correctly set the zone links for the reserved pages for holes in the middle of a zone based on the architecture constraints and with only the holes in the beginning/end of the memory will be not spanned by any node/zone which in practice does not seem to be a problem as the VM_BUG_ON in set_pfnblock_flags_mask() never triggered on pfn 0.
I really fail to see what you mean by correct zone/node for a memory range which is not associated with any real node.
We know architectural zone constraints, so we can have always have 1:1 match from pfn to zone. Node indeed will be a guess.
That is true only for some zones. Also we do require those to be correct when the memory is managed by the page allocator. I believe we can live with incorrect zones when they are in holes.
I am sorry, I haven't followed previous discussions. Has the removal of the VM_BUG_ON been considered as an immediate workaround?
It was never discussed, but I'm not sure it's a good idea.
Judging by the commit message that introduced the VM_BUG_ON (commit 86051ca5eaf5 ("mm: fix usemap initialization")) there was yet another inconsistency in the memory map that required a special care.
Can we actually explore that path before adding yet additional complexity and potentially a very involved fix for a subtle problem?
This patch was intended as a fix for inconsistency of the memory map that is the root cause for triggering this VM_BUG_ON and other corner case problems.
The previous version [1] is less involved as it does not extend node/zone spans.
I do understand that. And I am not objecting to the patch. I have to confess I haven't digested it yet. Any changes to early memory intialization have turned out to be subtle and corner cases only pop up later. This is almost impossible to review just by reading the code. That's why I am asking whether we want to address the specific VM_BUG_ON first with something much less tricky and actually reviewable. And that's why I am asking whether dropping the bug_on itself is safe to do and use as a hot fix which should be easier to backport.
Longterm I am definitely supporting any change which will lead to a fully initialized state. Whatever that means. One option would be to simply never allow partial page blocks or even memory sections. This would waste some memory but from what I have seen so far this would be quite small amount on very rare setups. So it might turn out as a much more easier and maintainable way forward.
[1] https://lore.kernel.org/lkml/20210130221035.4169-3-rppt@kernel.org
Sincerely yours, Mike.
On Tue, Feb 16, 2021 at 09:33:20AM +0100, Michal Hocko wrote:
On Mon 15-02-21 23:24:40, Mike Rapoport wrote:
On Mon, Feb 15, 2021 at 10:00:31AM +0100, Michal Hocko wrote:
On Sun 14-02-21 20:00:16, Mike Rapoport wrote:
On Fri, Feb 12, 2021 at 02:18:20PM +0100, Michal Hocko wrote:
We can correctly set the zone links for the reserved pages for holes in the middle of a zone based on the architecture constraints and with only the holes in the beginning/end of the memory will be not spanned by any node/zone which in practice does not seem to be a problem as the VM_BUG_ON in set_pfnblock_flags_mask() never triggered on pfn 0.
I really fail to see what you mean by correct zone/node for a memory range which is not associated with any real node.
We know architectural zone constraints, so we can have always have 1:1 match from pfn to zone. Node indeed will be a guess.
That is true only for some zones.
Hmm, and when is it not true?
Also we do require those to be correct when the memory is managed by the page allocator. I believe we can live with incorrect zones when they are in holes.
Note that the holes Andrea reported in the first place are not ranges that are not populated, but rather ranges that arch didn't report as usable, i.e. there is physical memory and it perfectly fits into an existing node and zone.
I am sorry, I haven't followed previous discussions. Has the removal of the VM_BUG_ON been considered as an immediate workaround?
It was never discussed, but I'm not sure it's a good idea.
Judging by the commit message that introduced the VM_BUG_ON (commit 86051ca5eaf5 ("mm: fix usemap initialization")) there was yet another inconsistency in the memory map that required a special care.
Can we actually explore that path before adding yet additional complexity and potentially a very involved fix for a subtle problem?
This patch was intended as a fix for inconsistency of the memory map that is the root cause for triggering this VM_BUG_ON and other corner case problems.
The previous version [1] is less involved as it does not extend node/zone spans.
I do understand that. And I am not objecting to the patch. I have to confess I haven't digested it yet. Any changes to early memory intialization have turned out to be subtle and corner cases only pop up later. This is almost impossible to review just by reading the code. That's why I am asking whether we want to address the specific VM_BUG_ON first with something much less tricky and actually reviewable. And that's why I am asking whether dropping the bug_on itself is safe to do and use as a hot fix which should be easier to backport.
I can't say I'm familiar enough with migration and compaction code to say if it's ok to remove that bug_on. It does point to inconsistency in the memmap, but probably it's not important.
On Tue 16-02-21 13:01:54, Mike Rapoport wrote:
On Tue, Feb 16, 2021 at 09:33:20AM +0100, Michal Hocko wrote:
On Mon 15-02-21 23:24:40, Mike Rapoport wrote:
On Mon, Feb 15, 2021 at 10:00:31AM +0100, Michal Hocko wrote:
On Sun 14-02-21 20:00:16, Mike Rapoport wrote:
On Fri, Feb 12, 2021 at 02:18:20PM +0100, Michal Hocko wrote:
We can correctly set the zone links for the reserved pages for holes in the middle of a zone based on the architecture constraints and with only the holes in the beginning/end of the memory will be not spanned by any node/zone which in practice does not seem to be a problem as the VM_BUG_ON in set_pfnblock_flags_mask() never triggered on pfn 0.
I really fail to see what you mean by correct zone/node for a memory range which is not associated with any real node.
We know architectural zone constraints, so we can have always have 1:1 match from pfn to zone. Node indeed will be a guess.
That is true only for some zones.
Hmm, and when is it not true?
ZONE_MOVABLE, ZONE_NORMAL and ZONE_DEVICE.
Also we do require those to be correct when the memory is managed by the page allocator. I believe we can live with incorrect zones when they are in holes.
Note that the holes Andrea reported in the first place are not ranges that are not populated, but rather ranges that arch didn't report as usable, i.e. there is physical memory and it perfectly fits into an existing node and zone.
Except those are not usable so they have no clear meaning, right?
I am sorry, I haven't followed previous discussions. Has the removal of the VM_BUG_ON been considered as an immediate workaround?
It was never discussed, but I'm not sure it's a good idea.
Judging by the commit message that introduced the VM_BUG_ON (commit 86051ca5eaf5 ("mm: fix usemap initialization")) there was yet another inconsistency in the memory map that required a special care.
Can we actually explore that path before adding yet additional complexity and potentially a very involved fix for a subtle problem?
This patch was intended as a fix for inconsistency of the memory map that is the root cause for triggering this VM_BUG_ON and other corner case problems.
The previous version [1] is less involved as it does not extend node/zone spans.
I do understand that. And I am not objecting to the patch. I have to confess I haven't digested it yet. Any changes to early memory intialization have turned out to be subtle and corner cases only pop up later. This is almost impossible to review just by reading the code. That's why I am asking whether we want to address the specific VM_BUG_ON first with something much less tricky and actually reviewable. And that's why I am asking whether dropping the bug_on itself is safe to do and use as a hot fix which should be easier to backport.
I can't say I'm familiar enough with migration and compaction code to say if it's ok to remove that bug_on. It does point to inconsistency in the memmap, but probably it's not important.
Yeah, I am not really clear on that myself TBH. On one hand this cannot be really critical because it is a conditional assert on the debuging mode. Most production users do not enable it so they wouldn't learn about that inconsistency and maybe never notice any difference. This has led me to think about this to be something mostly focused on debugging. If that is an incorrect assumption then the VM_BUG_ON should be changed to something else - either a graceful failure or a real BUG_ON if that is really worth it.
On 2/16/21 12:01 PM, Mike Rapoport wrote:
I do understand that. And I am not objecting to the patch. I have to confess I haven't digested it yet. Any changes to early memory intialization have turned out to be subtle and corner cases only pop up later. This is almost impossible to review just by reading the code. That's why I am asking whether we want to address the specific VM_BUG_ON first with something much less tricky and actually reviewable. And that's why I am asking whether dropping the bug_on itself is safe to do and use as a hot fix which should be easier to backport.
I can't say I'm familiar enough with migration and compaction code to say if it's ok to remove that bug_on. It does point to inconsistency in the memmap, but probably it's not important.
On closer look, removing the VM_BUG_ON_PAGE() in set_pfnblock_flags_mask() is not safe. If we violate the zone_spans_pfn condition, it means we will write outside of the pageblock bitmap for the zone, and corrupt something. Actually similar thing can happen in __get_pfnblock_flags_mask() where there's no VM_BUG_ON, but there we can't corrupt memory. But we could theoretically fault to do accessing some unmapped range?
So the checks would have to become unconditional !DEBUG_VM and return instead of causing a BUG. Or we could go back one level and add some checks to fast_isolate_around() to detect a page from zone that doesn't match cc->zone. The question is if there is another code that will break if a page_zone() suddenly changes e.g. in the middle of the pageblock - __pageblock_pfn_to_page() assumes that if first and last page is from the same zone, so are all pages in between, and the rest relies on that. But maybe if Andrea's fast_isolate_around() issue is fixed, that's enough for stable backport.
On 2/16/21 1:34 PM, Vlastimil Babka wrote:
On 2/16/21 12:01 PM, Mike Rapoport wrote:
I do understand that. And I am not objecting to the patch. I have to confess I haven't digested it yet. Any changes to early memory intialization have turned out to be subtle and corner cases only pop up later. This is almost impossible to review just by reading the code. That's why I am asking whether we want to address the specific VM_BUG_ON first with something much less tricky and actually reviewable. And that's why I am asking whether dropping the bug_on itself is safe to do and use as a hot fix which should be easier to backport.
I can't say I'm familiar enough with migration and compaction code to say if it's ok to remove that bug_on. It does point to inconsistency in the memmap, but probably it's not important.
On closer look, removing the VM_BUG_ON_PAGE() in set_pfnblock_flags_mask() is not safe. If we violate the zone_spans_pfn condition, it means we will write outside of the pageblock bitmap for the zone, and corrupt something. Actually
Clarification. This is true only for !CONFIG_SPARSEMEM, which is unlikely in practice to produce the configurations that trigger this issue. So we can remove the VM_BUG_ON_PAGE()
similar thing can happen in __get_pfnblock_flags_mask() where there's no VM_BUG_ON, but there we can't corrupt memory. But we could theoretically fault to do accessing some unmapped range?
So the checks would have to become unconditional !DEBUG_VM and return instead of causing a BUG. Or we could go back one level and add some checks to fast_isolate_around() to detect a page from zone that doesn't match cc->zone. The question is if there is another code that will break if a page_zone() suddenly changes e.g. in the middle of the pageblock - __pageblock_pfn_to_page() assumes that if first and last page is from the same zone, so are all pages in between, and the rest relies on that. But maybe if Andrea's fast_isolate_around() issue is fixed, that's enough for stable backport.
On Tue 16-02-21 13:34:56, Vlastimil Babka wrote:
On 2/16/21 12:01 PM, Mike Rapoport wrote:
I do understand that. And I am not objecting to the patch. I have to confess I haven't digested it yet. Any changes to early memory intialization have turned out to be subtle and corner cases only pop up later. This is almost impossible to review just by reading the code. That's why I am asking whether we want to address the specific VM_BUG_ON first with something much less tricky and actually reviewable. And that's why I am asking whether dropping the bug_on itself is safe to do and use as a hot fix which should be easier to backport.
I can't say I'm familiar enough with migration and compaction code to say if it's ok to remove that bug_on. It does point to inconsistency in the memmap, but probably it's not important.
On closer look, removing the VM_BUG_ON_PAGE() in set_pfnblock_flags_mask() is not safe. If we violate the zone_spans_pfn condition, it means we will write outside of the pageblock bitmap for the zone, and corrupt something.
Isn't it enough that at least some pfn from the pageblock belongs to the zone in order to have the bitmap allocated for the whole page block (even if it partially belongs to a different zone)?
Actually similar thing can happen in __get_pfnblock_flags_mask() where there's no VM_BUG_ON, but there we can't corrupt memory. But we could theoretically fault to do accessing some unmapped range?
So the checks would have to become unconditional !DEBUG_VM and return instead of causing a BUG. Or we could go back one level and add some checks to fast_isolate_around() to detect a page from zone that doesn't match cc->zone.
Thanks for looking deeper into that. This sounds like a much more targeted fix to me.
The question is if there is another code that will break if a page_zone() suddenly changes e.g. in the middle of the pageblock - __pageblock_pfn_to_page() assumes that if first and last page is from the same zone, so are all pages in between, and the rest relies on that. But maybe if Andrea's fast_isolate_around() issue is fixed, that's enough for stable backport.
There might be some other cases but I think it would be better to have a single fix for this particular issue and have it fixed properly and only then build something more robust on top.
On 2/16/21 2:11 PM, Michal Hocko wrote:
On Tue 16-02-21 13:34:56, Vlastimil Babka wrote:
On 2/16/21 12:01 PM, Mike Rapoport wrote:
I do understand that. And I am not objecting to the patch. I have to confess I haven't digested it yet. Any changes to early memory intialization have turned out to be subtle and corner cases only pop up later. This is almost impossible to review just by reading the code. That's why I am asking whether we want to address the specific VM_BUG_ON first with something much less tricky and actually reviewable. And that's why I am asking whether dropping the bug_on itself is safe to do and use as a hot fix which should be easier to backport.
I can't say I'm familiar enough with migration and compaction code to say if it's ok to remove that bug_on. It does point to inconsistency in the memmap, but probably it's not important.
On closer look, removing the VM_BUG_ON_PAGE() in set_pfnblock_flags_mask() is not safe. If we violate the zone_spans_pfn condition, it means we will write outside of the pageblock bitmap for the zone, and corrupt something.
Isn't it enough that at least some pfn from the pageblock belongs to the zone in order to have the bitmap allocated for the whole page block (even if it partially belongs to a different zone)?
Actually similar thing can happen in __get_pfnblock_flags_mask() where there's no VM_BUG_ON, but there we can't corrupt memory. But we could theoretically fault to do accessing some unmapped range?
So the checks would have to become unconditional !DEBUG_VM and return instead of causing a BUG. Or we could go back one level and add some checks to fast_isolate_around() to detect a page from zone that doesn't match cc->zone.
Thanks for looking deeper into that. This sounds like a much more targeted fix to me.
So, Andrea could you please check if this fixes the original fast_isolate_around() issue for you? With the VM_BUG_ON not removed, DEBUG_VM enabled, no changes to struct page initialization... It relies on pageblock_pfn_to_page as the rest of the compaction code.
Thanks!
----8<----
From f5c8d7bc77d2ec0b4cfec44820ce6f602fdb3a86 Mon Sep 17 00:00:00 2001
From: Vlastimil Babka vbabka@suse.cz Date: Tue, 16 Feb 2021 17:32:34 +0100 Subject: [PATCH] mm, compaction: make fast_isolate_around() robust against pfns from a wrong zone
TBD
Signed-off-by: Vlastimil Babka vbabka@suse.cz --- mm/compaction.c | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-)
diff --git a/mm/compaction.c b/mm/compaction.c index 190ccdaa6c19..b75645e4678d 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -1288,7 +1288,7 @@ static void fast_isolate_around(struct compact_control *cc, unsigned long pfn, unsigned long nr_isolated) { unsigned long start_pfn, end_pfn; - struct page *page = pfn_to_page(pfn); + struct page *page;
/* Do not search around if there are enough pages already */ if (cc->nr_freepages >= cc->nr_migratepages) @@ -1300,7 +1300,11 @@ fast_isolate_around(struct compact_control *cc, unsigned long pfn, unsigned long
/* Pageblock boundaries */ start_pfn = pageblock_start_pfn(pfn); - end_pfn = min(pageblock_end_pfn(pfn), zone_end_pfn(cc->zone)) - 1; + end_pfn = min(pageblock_end_pfn(pfn), zone_end_pfn(cc->zone)); + + page = pageblock_pfn_to_page(start_pfn, end_pfn, cc->zone); + if (!page) + return;
/* Scan before */ if (start_pfn != pfn) { @@ -1486,7 +1490,7 @@ fast_isolate_freepages(struct compact_control *cc) }
cc->total_free_scanned += nr_scanned; - if (!page) + if (!page || page_zone(page) != cc->zone) return cc->free_pfn;
low_pfn = page_to_pfn(page);
Hi Vlastimil,
On Tue, Feb 16, 2021 at 05:39:12PM +0100, Vlastimil Babka wrote:
So, Andrea could you please check if this fixes the original fast_isolate_around() issue for you? With the VM_BUG_ON not removed, DEBUG_VM enabled, no changes to struct page initialization... It relies on pageblock_pfn_to_page as the rest of the compaction code.
Pardon my ignorance of compaction internals, but does this mean that with your patch we'll never call set_pfnblock_flags_mask() for a pfn in a hole?
Thanks!
----8<---- From f5c8d7bc77d2ec0b4cfec44820ce6f602fdb3a86 Mon Sep 17 00:00:00 2001 From: Vlastimil Babka vbabka@suse.cz Date: Tue, 16 Feb 2021 17:32:34 +0100 Subject: [PATCH] mm, compaction: make fast_isolate_around() robust against pfns from a wrong zone
TBD
Signed-off-by: Vlastimil Babka vbabka@suse.cz
mm/compaction.c | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-)
diff --git a/mm/compaction.c b/mm/compaction.c index 190ccdaa6c19..b75645e4678d 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -1288,7 +1288,7 @@ static void fast_isolate_around(struct compact_control *cc, unsigned long pfn, unsigned long nr_isolated) { unsigned long start_pfn, end_pfn;
- struct page *page = pfn_to_page(pfn);
- struct page *page;
/* Do not search around if there are enough pages already */ if (cc->nr_freepages >= cc->nr_migratepages) @@ -1300,7 +1300,11 @@ fast_isolate_around(struct compact_control *cc, unsigned long pfn, unsigned long /* Pageblock boundaries */ start_pfn = pageblock_start_pfn(pfn);
- end_pfn = min(pageblock_end_pfn(pfn), zone_end_pfn(cc->zone)) - 1;
- end_pfn = min(pageblock_end_pfn(pfn), zone_end_pfn(cc->zone));
- page = pageblock_pfn_to_page(start_pfn, end_pfn, cc->zone);
- if (!page)
return;
/* Scan before */ if (start_pfn != pfn) { @@ -1486,7 +1490,7 @@ fast_isolate_freepages(struct compact_control *cc) } cc->total_free_scanned += nr_scanned;
- if (!page)
- if (!page || page_zone(page) != cc->zone) return cc->free_pfn;
low_pfn = page_to_pfn(page); -- 2.30.0
On 2/16/21 6:49 PM, Mike Rapoport wrote:
Hi Vlastimil,
On Tue, Feb 16, 2021 at 05:39:12PM +0100, Vlastimil Babka wrote:
So, Andrea could you please check if this fixes the original fast_isolate_around() issue for you? With the VM_BUG_ON not removed, DEBUG_VM enabled, no changes to struct page initialization... It relies on pageblock_pfn_to_page as the rest of the compaction code.
Pardon my ignorance of compaction internals, but does this mean that with your patch we'll never call set_pfnblock_flags_mask() for a pfn in a hole?
No it doesn't mean that kind of guarantee. But we will not call it anymore (if my patch is correct) from a path which we currently know it's doing that and triggering the VM_BUG_ON. So that's a targetted fix that matches stable backport criteria. It doesn't contradict your patch as a way to improve mainline, I still agree it's best long-term if we initialize the struct pages without such surprises. But I also agree with Michal that there's a risk of replacing one corner case with another and thus we shouldn't do that as a stable fix.
linux-stable-mirror@lists.linaro.org