On 2/16/21 1:34 PM, Vlastimil Babka wrote:
On 2/16/21 12:01 PM, Mike Rapoport wrote:
I do understand that. And I am not objecting to the patch. I have to confess I haven't digested it yet. Any changes to early memory intialization have turned out to be subtle and corner cases only pop up later. This is almost impossible to review just by reading the code. That's why I am asking whether we want to address the specific VM_BUG_ON first with something much less tricky and actually reviewable. And that's why I am asking whether dropping the bug_on itself is safe to do and use as a hot fix which should be easier to backport.
I can't say I'm familiar enough with migration and compaction code to say if it's ok to remove that bug_on. It does point to inconsistency in the memmap, but probably it's not important.
On closer look, removing the VM_BUG_ON_PAGE() in set_pfnblock_flags_mask() is not safe. If we violate the zone_spans_pfn condition, it means we will write outside of the pageblock bitmap for the zone, and corrupt something. Actually
Clarification. This is true only for !CONFIG_SPARSEMEM, which is unlikely in practice to produce the configurations that trigger this issue. So we can remove the VM_BUG_ON_PAGE()
similar thing can happen in __get_pfnblock_flags_mask() where there's no VM_BUG_ON, but there we can't corrupt memory. But we could theoretically fault to do accessing some unmapped range?
So the checks would have to become unconditional !DEBUG_VM and return instead of causing a BUG. Or we could go back one level and add some checks to fast_isolate_around() to detect a page from zone that doesn't match cc->zone. The question is if there is another code that will break if a page_zone() suddenly changes e.g. in the middle of the pageblock - __pageblock_pfn_to_page() assumes that if first and last page is from the same zone, so are all pages in between, and the rest relies on that. But maybe if Andrea's fast_isolate_around() issue is fixed, that's enough for stable backport.