On Tue, Dec 08, 2020 at 06:13:47PM -0500, Andrea Arcangeli wrote:
On Tue, Dec 08, 2020 at 11:46:14PM +0200, Mike Rapoport wrote:
Sorry, I was not clear. The penalty here is not the traversal of memblock.reserved array. The penalty is for redundant initialization of struct page for ranges memblock.reserved describes.
But that's the fix... How can you avoid removing the PagePoison for all pfn_valid in memblock.reserved, when the crash of https://lore.kernel.org/lkml/20201201181502.2340-1-rppt@kernel.org is in __SetPageReserved of reserve_bootmem_region?
With your fix the pages in the memblock.reserved that intersects with memblock.memory will be initialized twice: first time in for_each_mem_pfn_range() and the second time in for_each_res_pfn_range()
For instance, the if user specified hugetlb_cma=nnG loop over memblock.reserved will cause redundant initialization pass over the pages in this nnG.
And if that didn't happen, then I'm left wondering if free_low_memory_core_early would crash again with specified hugetlb_cma=nnG, not just in pfn 0 anymore even with the simple fix applied.
Where does PagePoison get removed from the CMA range, or is somehow __SetPageReserved not called with the simple fix applied on the CMA reserved range?
CMA range would be in both memblock.memory and memblock.reserved. So the pages for it will be initialized in for_each_mem_pfn_range().
Regardless of particular implementation, the generic code cannot detect physical memory layout, this is up to architecture to detect what memory banks are populated and provide this information to the generic code.
So we'll always have what you call "dependencies on the caller" here. The good thing is that we can fix "the caller" when we find it's wrong.
The caller that massages the e820 bios table, is still software. Software can go from arch code to common code. There's no asm or arch dependency in there.
The caller in my view needs to know what e820 is and send down the raw info... it's a 1:1 API translation, the caller massaging of the data can all go in the callee and benefit all callers.
Well, the translation between e820 and memblock could been 1:1, but it is not. I think e820 in more broadly x86 memory initialization could benifit from more straightforward translation to memblock. For instance, I don't see why the pfn 0 should be marked as reserved in both e820 and memblock and why some of the e820 reserved areas never make it to memblock.reserved.
The benefit is once the callee does all validation, then all archs that do a mistake will be told and boot will not fail and it'll be more likely to boot stable even in presence of inconsistent hardware tables.
That's true. Having the generic code more robust will benifit everybody.
I guess before worrying of pfn 1 not being added to memblock.reserved, I'd worry about pfn 0 itself not being added to memblock.reserved.
My point was that with a bit different e820 table we may again hit a corner case that we didn't anticipate.
pfn 0 is in memblock.reserved by a coincidence and not by design and using memblock.reserved to detect zone/node boundaries is not reliable.
Moreover, if you look for usage of for_each_mem_pfn_range() in page_alloc.c, it's looks very much that it is presumed to iterate over *all* physical memory, including mysterious pfn 0.
All the complex patch does it that it guarantees all the reserved pages can get a "right" zoneid.
Then we certainly should consider rounding it down by the pageblock in case pfn 0 isn't reserved.
In fact if you add all memblock.reserved ranges to memblock.memory, you'll obtain exactly the same result in the zone_start_pfn as with the complex patch in the zone_start_pfn calculation and the exact same issue of PagePoison being left on pfn 0 will happen if pfn 0 isn't added to memblock.memory with { memblock_add; memblock_reserved }.
Still with the complex patch applied, if something goes wrong DEBUG_VM=y will make it reproducible, with the simple fix we return in non-reproducible land.
I still think that the complex patch is going in the wrong direction. We cannot rely on memblock.reserved to calculate zones and nodes span.
This:
/*
* reserved regions must be included so that their page
* structure can be part of a zone and obtain a valid zoneid
* before __SetPageReserved().
*/
return min(PHYS_PFN(memblock_start_of_DRAM()),
PHYS_PFN(memblock.reserved.regions[0].base));
is definitely a hack that worked because "the caller" has this:
/* * Make sure page 0 is always reserved because on systems with * L1TF its contents can be leaked to user processes. */ memblock_reserve(0, PAGE_SIZE);
So, what would have happen if L1TF was not yet disclosed? ;-)
It's actually fine thing to delete the above memblock_reserve(0, PAGE_SIZE) for all testing so we can move into the remaining issues.
There are few more places in arch/x86/kernel/setup.c that memblock_reserve() one or several pages from address 0 :-)
The end result of the "hack" is precisely what you obtain if you add all memblock.reserved to memblock.memory, except it doesn't require to add memblock.reserved to memblock.memory.
I still do not agree that there can be minimum between memblock_start_of_DRAM() and anything else. I think it's semantically wrong.
And I really dislike addition of memblock.reserved traversals, mostly because of the areas that actually overlap.
However, I do agree that adding non-overlaping part of memblock.reserved to memblock.memory will make the generic code more robust. I just don't want to do this implicitly by calling memblock_add() from memblock_reserve(). I think the cleaner way is to join them just before free_area_init() starts zone/node size detection.
Thanks, Andrea