 
            On Mon, Dec 14, 2020 at 12:18:07PM +0100, David Hildenbrand wrote:
On 14.12.20 12:12, Mike Rapoport wrote:
On Mon, Dec 14, 2020 at 11:11:35AM +0100, David Hildenbrand wrote:
On 09.12.20 22:43, Mike Rapoport wrote:
From: Mike Rapoport rppt@linux.ibm.com
memblock does not require that the reserved memory ranges will be a subset of memblock.memory.
As the result there maybe reserved pages that are not in the range of any zone or node because zone and node boundaries are detected based on memblock.memory and pages that only present in memblock.reserved are not taken into account during zone/node size detection.
Make sure that all ranges in memblock.reserved are added to memblock.memory before calculating node and zone boundaries.
Fixes: 73a6e474cb37 ("mm: memmap_init: iterate over memblock regions rather that check each PFN") Reported-by: Andrea Arcangeli aarcange@redhat.com Signed-off-by: Mike Rapoport rppt@linux.ibm.com
include/linux/memblock.h | 1 + mm/memblock.c | 24 ++++++++++++++++++++++++ mm/page_alloc.c | 7 +++++++ 3 files changed, 32 insertions(+)
diff --git a/include/linux/memblock.h b/include/linux/memblock.h index ef131255cedc..e64dae2dd1ce 100644 --- a/include/linux/memblock.h +++ b/include/linux/memblock.h @@ -120,6 +120,7 @@ int memblock_clear_nomap(phys_addr_t base, phys_addr_t size); unsigned long memblock_free_all(void); void reset_node_managed_pages(pg_data_t *pgdat); void reset_all_zones_managed_pages(void); +void memblock_enforce_memory_reserved_overlap(void); /* Low level functions */ void __next_mem_range(u64 *idx, int nid, enum memblock_flags flags, diff --git a/mm/memblock.c b/mm/memblock.c index b68ee86788af..9277aca642b2 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -1857,6 +1857,30 @@ void __init_memblock memblock_trim_memory(phys_addr_t align) } } +/**
- memblock_enforce_memory_reserved_overlap - make sure every range in
- @memblock.reserved is covered by @memblock.memory
- The data in @memblock.memory is used to detect zone and node boundaries
- during initialization of the memory map and the page allocator. Make
- sure that every memory range present in @memblock.reserved is also added
- to @memblock.memory even if the architecture specific memory
- initialization failed to do so
- */
+void __init memblock_enforce_memory_reserved_overlap(void) +{
- phys_addr_t start, end;
- int nid;
- u64 i;
- __for_each_mem_range(i, &memblock.reserved, &memblock.memory,
NUMA_NO_NODE, MEMBLOCK_NONE, &start, &end, &nid) {
pr_warn("memblock: reserved range [%pa-%pa] is not in memory\n",
&start, &end);
memblock_add_node(start, (end - start), nid);- }
+}
void __init_memblock memblock_set_current_limit(phys_addr_t limit) { memblock.current_limit = limit; diff --git a/mm/page_alloc.c b/mm/page_alloc.c index eaa227a479e4..dbc57dbbacd8 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -7436,6 +7436,13 @@ void __init free_area_init(unsigned long *max_zone_pfn) memset(arch_zone_highest_possible_pfn, 0, sizeof(arch_zone_highest_possible_pfn));
- /*
* Some architectures (e.g. x86) have reserved pages outside of
* memblock.memory. Make sure these pages are taken into account
* when detecting zone and node boundaries
*/- memblock_enforce_memory_reserved_overlap();
- start_pfn = find_min_pfn_with_active_regions(); descending = arch_has_descending_max_zone_pfns();
CCing Dan.
This implies that any memory that is E820_TYPE_SOFT_RESERVED that was reserved via memblock_reserve() will be added via memblock_add_node() as well, resulting in all such memory getting a memmap allocated right when booting up, right?
IIRC, there are use cases where that is absolutely not desired.
Hmm, if this is the case we need entirely different solution to ensure that we don't have partial pageblocks in a zone and we have all the memory map initialized to a known state.
Am I missing something? (@Dan?)
BTW, @Dan, why did you need to memblock_reserve(E820_TYPE_SOFT_RESERVED) without memblock_add()ing it?
I suspect to cover cases where it might partially span memory sections (or even sub-sections). Maybe we should focus on initializing that part only - meaning, not adding all memory to .memory but only !section aligned pieces.
We had that information left in the memblock data structure with the previous implementation in -mm (before adding all memblock.reserved to memblock.memory). To avoid destroying that information we'll need a new flag for each range that is not originally in memblock.memory:
=== What you suggest would require adding extra information to flag which ranges must not have a direct mapping, but that information is already in memblock today, for each range in memblock_reserved but not in memblock.memory or did I misunderstand how that no-direct-map detail works? ===
I guess I was too optimistic that this was already implemented, thanks for noticing.
For the record, I didn't have time to test the new implementation yet. Since I'm running the "hack" on all machines things have been stable on v5.9. I'm actually curious if the hack would also fail boot on the CI system or not, that would help localize the issue into the implicit memblock_add at least. The memblock debug output won't give us a direct reproducer, but we can try to generate one by reproducing the same e820 map in seabios.
Andrea