On 28/07/25 4:01 pm, Dev Jain wrote:
Memory hotunplug is done under the hotplug lock and ptdump walk is done under the init_mm.mmap_lock. Therefore, ptdump and hotunplug can run simultaneously without any synchronization. During hotunplug, free_empty_tables() is ultimately called to free up the pagetables. The following race can happen, where x denotes the level of the pagetable:
CPU1 CPU2 free_empty_pxd_table ptdump_walk_pgd() Get p(x+1)d table from pxd entry pxd_clear free_hotplug_pgtable_page(p(x+1)dp) Still using the p(x+1)d table
which leads to a user-after-free.
To solve this, we need to synchronize ptdump_walk_pgd() with free_hotplug_pgtable_page() in such a way that ptdump never takes a reference on a freed pagetable.
Since this race is very unlikely to happen in practice, we do not want to penalize other code paths taking the init_mm mmap_lock. Therefore, we use static keys. ptdump will enable the static key - upon observing that, the free_empty_pxd_table() functions will get patched in with an mmap_read_lock/unlock sequence. A code comment explains in detail, how a combination of acquire semantics of static_branch_enable() and the barriers in __flush_tlb_kernel_pgtable() ensures that ptdump will never get a hold on the address of a freed pagetable - either ptdump will block the table freeing path due to write locking the mmap_lock, or, the nullity of the pxd entry will be observed by ptdump, therefore having no access to the isolated p(x+1)d pagetable.
This bug was found by code inspection, as a result of working on [1].
Cc: stable@vger.kernel.org Fixes: bbd6ec605c0f ("arm64/mm: Enable memory hot remove") Signed-off-by: Dev Jain dev.jain@arm.com
Immediately after posting, I guess the first objection which is going to come is, why not just nest free_empty_tables() with mmap_read_lock/unlock. Memory offlining obviously should not be a hot path so taking the read lock should be fine I guess.