On Sun 14-02-21 20:00:16, Mike Rapoport wrote:
On Fri, Feb 12, 2021 at 02:18:20PM +0100, Michal Hocko wrote:
On Fri 12-02-21 11:42:15, David Hildenbrand wrote:
On 12.02.21 11:33, Michal Hocko wrote:
[...]
I have to digest this but my first impression is that this is more heavy weight than it needs to. Pfn walkers should normally obey node range at least. The first pfn is usually excluded but I haven't seen real
We've seen examples where this is not sufficient. Simple example:
Have your physical memory end within a memory section. Easy via QEMU, just do a "-m 4000M". The remaining part of the last section has fake/wrong node/zone info.
Does this really matter though. If those pages are reserved then nobody will touch them regardless of their node/zone ids.
Hotplug memory. The node/zone gets resized such that PFN walkers might stumble over it.
The basic idea is to make sure that any initialized/"online" pfn belongs to exactly one node/zone and that the node/zone spans that PFN.
Yeah, this sounds like a good idea but what is the poper node for hole between two ranges associated with a different nodes/zones? This will always be a random number. We should have a clear way to tell "do not touch those pages" and PageReserved sounds like a good way to tell that.
Nobody should touch reserved pages, but I don't think we can ensure that.
Touching a reserved page which doesn't belong to you is a bug. Sure we cannot enforce that rule by runtime checks. But incorrect/misleading zone/node association is the least of the problem when somebody already does that.
We can correctly set the zone links for the reserved pages for holes in the middle of a zone based on the architecture constraints and with only the holes in the beginning/end of the memory will be not spanned by any node/zone which in practice does not seem to be a problem as the VM_BUG_ON in set_pfnblock_flags_mask() never triggered on pfn 0.
I really fail to see what you mean by correct zone/node for a memory range which is not associated with any real node.
I believe that any improvement in memory map consistency is a step forward.
I do agree but we are talking about a subtle bug (VM_BUG_ON) which would be better of with a simplistic fix first. You can work on consistency improvements on top of that.
problems with that. The VM_BUG_ON blowing up is really bad but as said above we can simply make it less offensive in presence of reserved pages as those shouldn't reach that path AFAICS normally.
Andrea tried tried working around if via PG_reserved pages and it resulted in quite some ugly code. Andrea also noted that we cannot rely on any random page walker to do the right think when it comes to messed up node/zone info.
I am sorry, I haven't followed previous discussions. Has the removal of the VM_BUG_ON been considered as an immediate workaround?
It was never discussed, but I'm not sure it's a good idea.
Judging by the commit message that introduced the VM_BUG_ON (commit 86051ca5eaf5 ("mm: fix usemap initialization")) there was yet another inconsistency in the memory map that required a special care.
Can we actually explore that path before adding yet additional complexity and potentially a very involved fix for a subtle problem?
Mel who is author of this code might help us out. I have to say I do not see the point for the VM_BUG_ON other than a better debuggability. If there is a real incosistency problem as a result then we should be handling that situation for non debugging kernels as well.