Re: + mm-initialize-struct-pages-in-reserved-regions-outside-of-the-zone-ranges.patch added to -mm tree

9 Dec 2020


      On Tue, Dec 08, 2020 at 06:13:47PM -0500, Andrea Arcangeli wrote:
...
On Tue, Dec 08, 2020 at 11:46:14PM +0200, Mike Rapoport wrote:
...
Sorry, I was not clear. The penalty here is not the traversal of
memblock.reserved array. The penalty is for redundant initialization of
struct page for ranges memblock.reserved describes.
But that's the fix... How can you avoid removing the PagePoison for
all pfn_valid in memblock.reserved, when the crash of
https://lore.kernel.org/lkml/20201201181502.2340-1-rppt@kernel.org is
in __SetPageReserved of reserve_bootmem_region?
With your fix the pages in the memblock.reserved that intersects
with memblock.memory will be initialized twice: first time in
for_each_mem_pfn_range() and the second time in for_each_res_pfn_range()
...
...
For instance, the if user specified hugetlb_cma=nnG loop over
memblock.reserved will cause redundant initialization pass over the
pages in this nnG.
And if that didn't happen, then I'm left wondering if
free_low_memory_core_early would crash again with specified
hugetlb_cma=nnG, not just in pfn 0 anymore even with the simple fix
applied.
Where does PagePoison get removed from the CMA range, or is somehow
__SetPageReserved not called with the simple fix applied on the CMA
reserved range?
CMA range would be in both memblock.memory and memblock.reserved. So the
pages for it will be initialized in for_each_mem_pfn_range().
...
...
Regardless of particular implementation, the generic code cannot detect
physical memory layout, this is up to architecture to detect what memory
banks are populated and provide this information to the generic code.
So we'll always have what you call "dependencies on the caller" here.
The good thing is that we can fix "the caller" when we find it's wrong.
The caller that massages the e820 bios table, is still
software. Software can go from arch code to common code. There's no
asm or arch dependency in there.
The caller in my view needs to know what e820 is and send down the raw
info... it's a 1:1 API translation, the caller massaging of the data
can all go in the callee and benefit all callers.
Well, the translation between e820 and memblock could been 1:1, but it
is not. I think e820 in more broadly x86 memory initialization could
benifit from more straightforward translation to memblock. For instance,
I don't see why the pfn 0 should be marked as reserved in both e820 and
memblock and why some of the e820 reserved areas never make it to
memblock.reserved.
...
The benefit is once the callee does all validation, then all archs
that do a mistake will be told and boot will not fail and it'll be
more likely to boot stable even in presence of inconsistent hardware
tables.
That's true. Having the generic code more robust will benifit everybody.
...
...
...
I guess before worrying of pfn 1 not being added to memblock.reserved,
I'd worry about pfn 0 itself not being added to memblock.reserved.
My point was that with a bit different e820 table we may again hit a
corner case that we didn't anticipate.
pfn 0 is in memblock.reserved by a coincidence and not by design and using
memblock.reserved to detect zone/node boundaries is not reliable.
Moreover, if you look for usage of for_each_mem_pfn_range() in
page_alloc.c, it's looks very much that it is presumed to iterate over
*all* physical memory, including mysterious pfn 0.
All the complex patch does it that it guarantees all the reserved
pages can get a "right" zoneid.
Then we certainly should consider rounding it down by the pageblock in
case pfn 0 isn't reserved.
In fact if you add all memblock.reserved ranges to memblock.memory,
you'll obtain exactly the same result in the zone_start_pfn as with
the complex patch in the zone_start_pfn calculation and the exact same
issue of PagePoison being left on pfn 0 will happen if pfn 0 isn't
added to memblock.memory with { memblock_add; memblock_reserved }.
...
...
Still with the complex patch applied, if something goes wrong
DEBUG_VM=y will make it reproducible, with the simple fix we return in
non-reproducible land.
I still think that the complex patch is going in the wrong direction.
We cannot rely on memblock.reserved to calculate zones and nodes span.
This:

  /*


   * reserved regions must be included so that their page


   * structure can be part of a zone and obtain a valid zoneid


   * before __SetPageReserved().


   */


  return min(PHYS_PFN(memblock_start_of_DRAM()),


             PHYS_PFN(memblock.reserved.regions[0].base));


is definitely a hack that worked because "the caller" has this:
/*
    * Make sure page 0 is always reserved because on systems with
    * L1TF its contents can be leaked to user processes.
    */
   memblock_reserve(0, PAGE_SIZE);
So, what would have happen if L1TF was not yet disclosed? ;-)
It's actually fine thing to delete the above memblock_reserve(0,
PAGE_SIZE) for all testing so we can move into the remaining issues.
There are few more places in arch/x86/kernel/setup.c that
memblock_reserve() one or several pages from address 0 :-)
...
The end result of the "hack" is precisely what you obtain if you add
all memblock.reserved to memblock.memory, except it doesn't require to
add memblock.reserved to memblock.memory.
I still do not agree that there can be minimum between
memblock_start_of_DRAM() and anything else. I think it's semantically
wrong.
And I really dislike addition of memblock.reserved traversals, mostly
because of the areas that actually overlap.
However, I do agree that adding non-overlaping part of memblock.reserved
to memblock.memory will make the generic code more robust. I just don't
want to do this implicitly by calling memblock_add() from
memblock_reserve(). I think the cleaner way is to join them just before
free_area_init() starts zone/node size detection.
...
Thanks,
Andrea
-- 
Sincerely yours,
Mike.

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: + mm-initialize-struct-pages-in-reserved-regions-outside-of-the-zone-ranges.patch added to -mm tree