On 10.09.20 14:36, Laurent Dufour wrote:
Le 10/09/2020 à 14:00, David Hildenbrand a écrit :
On 10.09.20 13:35, Laurent Dufour wrote:
Le 10/09/2020 à 13:12, Michal Hocko a écrit :
On Thu 10-09-20 09:51:39, Laurent Dufour wrote:
Le 10/09/2020 à 09:23, Michal Hocko a écrit :
On Wed 09-09-20 18:07:15, Laurent Dufour wrote: > Le 09/09/2020 à 12:59, Michal Hocko a écrit : >> On Wed 09-09-20 11:21:58, Laurent Dufour wrote: [...] >>> For the point a, using the enum allows to know in >>> register_mem_sect_under_node() if the link operation is due to a hotplug >>> operation or done at boot time. >> >> Yes, but let me repeat. We have a mess here and different paths check >> for the very same condition by different ways. We need to unify those. > > What are you suggesting to unify these checks (using a MP_* enum as > suggested by David, something else)?
We do have system_state check spread at different places. I would use this one and wrap it behind a helper. Or have I missed any reason why that wouldn't work for this case?
That would not work in that case because memory can be hot-added at the SYSTEM_SCHEDULING system state and the regular memory is also registered at that system state too. So system state is not enough to discriminate between the both.
If that is really the case all other places need a fix as well. Btw. could you be more specific about memory hotplug during early boot? How that happens? I am only aware of https://lkml.kernel.org/r/20200818110046.6664-1-osalvador@suse.de and that doesn't happen as early as SYSTEM_SCHEDULING.
That points has been raised by David, quoting him here:
IIRC, ACPI can hotadd memory while SCHEDULING, this patch would break that.
Ccing Oscar, I think he mentioned recently that this is the case with ACPI.
Oscar told that he need to investigate further on that.
On my side I can't get these ACPI "early" hot-plug operations to happen so I can't check that.
If this is clear that ACPI memory hotplug doesn't happen at SYSTEM_SCHEDULING, the patch I proposed at first is enough to fix the issue.
Booting a qemu guest with 4 coldplugged DIMMs gives me:
:/root# dmesg | grep link_mem [ 0.302247] link_mem_sections() during 1 [ 0.445086] link_mem_sections() during 1 [ 0.445766] link_mem_sections() during 1 [ 0.446749] link_mem_sections() during 1 [ 0.447746] link_mem_sections() during 1
So AFAICs everything happens during SYSTEM_SCHEDULING - boot memory and ACPI (cold)plug.
To make forward progress with this, relying on the system_state is obviously not sufficient.
- We have to fix this instance and the instance directly in
get_nid_for_pfn() by passing in the context (I once had a patch to clean that up, to not have two state checks, but it got lost somewhere).
- The "system_state < SYSTEM_RUNNING" check in
register_memory_resource() is correct. Actual memory hotplug after boot is not impacted. (I remember we discussed this exact behavior back then)
- build_all_zonelists() should work as expected, called from
start_kernel() before sched_init().
I'm bit confused now. Since hotplug operation is happening at SYSTEM_SCHEDULING like the regular memory registration, would it be enough to add a parameter to register_mem_sect_under_node() (reworking the memmap_context enum)? That way the check is not based on the system state but on the calling path.
That would have been my suggestion to definitely fix it - maybe Michal/Oscar have a better suggestion know that we know what's going on.