Memory hotunplug is done under the hotplug lock and ptdump walk is done under the init_mm.mmap_lock. Therefore, ptdump and hotunplug can run simultaneously without any synchronization. During hotunplug, free_empty_tables() is ultimately called to free up the pagetables. The following race can happen, where x denotes the level of the pagetable:
CPU1 CPU2 free_empty_pxd_table ptdump_walk_pgd() Get p(x+1)d table from pxd entry pxd_clear free_hotplug_pgtable_page(p(x+1)dp) Still using the p(x+1)d table
which leads to a user-after-free.
To solve this, we need to synchronize ptdump_walk_pgd() with free_hotplug_pgtable_page() in such a way that ptdump never takes a reference on a freed pagetable.
Since this race is very unlikely to happen in practice, we do not want to penalize other code paths taking the init_mm mmap_lock. Therefore, we use static keys. ptdump will enable the static key - upon observing that, the free_empty_pxd_table() functions will get patched in with an mmap_read_lock/unlock sequence. A code comment explains in detail, how a combination of acquire semantics of static_branch_enable() and the barriers in __flush_tlb_kernel_pgtable() ensures that ptdump will never get a hold on the address of a freed pagetable - either ptdump will block the table freeing path due to write locking the mmap_lock, or, the nullity of the pxd entry will be observed by ptdump, therefore having no access to the isolated p(x+1)d pagetable.
This bug was found by code inspection, as a result of working on [1]. 1. https://lore.kernel.org/all/20250723161827.15802-1-dev.jain@arm.com/
Cc: stable@vger.kernel.org Fixes: bbd6ec605c0f ("arm64/mm: Enable memory hot remove") Signed-off-by: Dev Jain dev.jain@arm.com --- Rebased on Linux 6.16.
arch/arm64/include/asm/ptdump.h | 2 ++ arch/arm64/mm/mmu.c | 61 +++++++++++++++++++++++++++++++++ arch/arm64/mm/ptdump.c | 11 ++++-- 3 files changed, 72 insertions(+), 2 deletions(-)
diff --git a/arch/arm64/include/asm/ptdump.h b/arch/arm64/include/asm/ptdump.h index fded5358641f..4760168cbd6e 100644 --- a/arch/arm64/include/asm/ptdump.h +++ b/arch/arm64/include/asm/ptdump.h @@ -7,6 +7,8 @@
#include <linux/ptdump.h>
+DECLARE_STATIC_KEY_FALSE(arm64_ptdump_key); + #ifdef CONFIG_PTDUMP
#include <linux/mm_types.h> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c index 00ab1d648db6..d2feef270880 100644 --- a/arch/arm64/mm/mmu.c +++ b/arch/arm64/mm/mmu.c @@ -46,6 +46,8 @@ #define NO_CONT_MAPPINGS BIT(1) #define NO_EXEC_MAPPINGS BIT(2) /* assumes FEAT_HPDS is not used */
+DEFINE_STATIC_KEY_FALSE(arm64_ptdump_key); + enum pgtable_type { TABLE_PTE, TABLE_PMD, @@ -1002,6 +1004,61 @@ static void unmap_hotplug_range(unsigned long addr, unsigned long end, } while (addr = next, addr < end); }
+/* + * Our objective is to prevent ptdump from reading a pagetable which has + * been freed. Assume that ptdump_walk_pgd() (call this thread T1) + * executes completely on CPU1 and free_hotplug_pgtable_page() (call this + * thread T2) executes completely on CPU2. Let the region sandwiched by the + * mmap_write_lock/unlock in T1 be called CS (the critical section). + * + * Claim: The CS of T1 will never operate on a freed pagetable. + * + * Proof: + * + * Case 1: The static branch is visible to T2. + * + * Case 1 (a): T1 acquires the lock before T2 can. + * T2 will block until T1 drops the lock, so free_hotplug_pgtable_page() will + * only be executed after T1 exits CS. + * + * Case 1 (b): T2 acquires the lock before T1 can. + * The acquire semantics of mmap_read_lock() ensure that an empty pagetable + * entry (via pxd_clear()) is visible to T1 before T1 can enter CS, therefore + * it is impossible for the CS to get hold of the isolated level + 1 pagetable. + * + * Case 2: The static branch is not visible to T2. + * + * Since static_branch_enable() and mmap_write_lock() (via smp_mb()) have + * acquire semantics, it implies that the static branch will be visible to + * all CPUs before T1 can enter CS. The static branch not being visible to + * T2 therefore implies that T1 has not yet entered CS .... (i) + * + * The sequence of barriers via __flush_tlb_kernel_pgtable() in T2 + * implies that if the invisibility of the static branch has been + * observed by T2 (i.e static_branch_unlikely() is observed as false), + * then all CPUs will have observed an empty pagetable entry via + * pxd_clear() ... (ii) + * + * Combining (i) and (ii), we conclude that T1 observes an empty pagetable + * entry before entering CS => it is impossible for the CS to get hold of + * the isolated level + 1 pagetable. Q.E.D + * + * We have proven that the claim is true on the assumption that + * there is no context switch for T1 and T2. Note that the reasoning + * of the proof uses barriers operating on the inner shareable domain, + * which means that they will affect all CPUs, and also a context switch + * will insert extra barriers into the code paths => the claim will + * stand true even if we drop the assumption. + */ +static void synchronize_with_ptdump(void) +{ + if (!static_branch_unlikely(&arm64_ptdump_key)) + return; + + mmap_read_lock(&init_mm); + mmap_read_unlock(&init_mm); +} + static void free_empty_pte_table(pmd_t *pmdp, unsigned long addr, unsigned long end, unsigned long floor, unsigned long ceiling) @@ -1036,6 +1093,7 @@ static void free_empty_pte_table(pmd_t *pmdp, unsigned long addr,
pmd_clear(pmdp); __flush_tlb_kernel_pgtable(start); + synchronize_with_ptdump(); free_hotplug_pgtable_page(virt_to_page(ptep)); }
@@ -1076,6 +1134,7 @@ static void free_empty_pmd_table(pud_t *pudp, unsigned long addr,
pud_clear(pudp); __flush_tlb_kernel_pgtable(start); + synchronize_with_ptdump(); free_hotplug_pgtable_page(virt_to_page(pmdp)); }
@@ -1116,6 +1175,7 @@ static void free_empty_pud_table(p4d_t *p4dp, unsigned long addr,
p4d_clear(p4dp); __flush_tlb_kernel_pgtable(start); + synchronize_with_ptdump(); free_hotplug_pgtable_page(virt_to_page(pudp)); }
@@ -1156,6 +1216,7 @@ static void free_empty_p4d_table(pgd_t *pgdp, unsigned long addr,
pgd_clear(pgdp); __flush_tlb_kernel_pgtable(start); + synchronize_with_ptdump(); free_hotplug_pgtable_page(virt_to_page(p4dp)); }
diff --git a/arch/arm64/mm/ptdump.c b/arch/arm64/mm/ptdump.c index 421a5de806c6..d543c9f8ffa8 100644 --- a/arch/arm64/mm/ptdump.c +++ b/arch/arm64/mm/ptdump.c @@ -283,6 +283,13 @@ void note_page_flush(struct ptdump_state *pt_st) note_page(pt_st, 0, -1, pte_val(pte_zero)); }
+static void arm64_ptdump_walk_pgd(struct ptdump_state *st, struct mm_struct *mm) +{ + static_branch_enable(&arm64_ptdump_key); + ptdump_walk_pgd(st, mm, NULL); + static_branch_disable(&arm64_ptdump_key); +} + void ptdump_walk(struct seq_file *s, struct ptdump_info *info) { unsigned long end = ~0UL; @@ -311,7 +318,7 @@ void ptdump_walk(struct seq_file *s, struct ptdump_info *info) } };
- ptdump_walk_pgd(&st.ptdump, info->mm, NULL); + arm64_ptdump_walk_pgd(&st.ptdump, info->mm); }
static void __init ptdump_initialize(void) @@ -353,7 +360,7 @@ bool ptdump_check_wx(void) } };
- ptdump_walk_pgd(&st.ptdump, &init_mm, NULL); + arm64_ptdump_walk_pgd(&st.ptdump, &init_mm);
if (st.wx_pages || st.uxn_pages) { pr_warn("Checked W+X mappings: FAILED, %lu W+X pages found, %lu non-UXN pages found\n",
On 28/07/25 4:01 pm, Dev Jain wrote:
Memory hotunplug is done under the hotplug lock and ptdump walk is done under the init_mm.mmap_lock. Therefore, ptdump and hotunplug can run simultaneously without any synchronization. During hotunplug, free_empty_tables() is ultimately called to free up the pagetables. The following race can happen, where x denotes the level of the pagetable:
CPU1 CPU2 free_empty_pxd_table ptdump_walk_pgd() Get p(x+1)d table from pxd entry pxd_clear free_hotplug_pgtable_page(p(x+1)dp) Still using the p(x+1)d table
which leads to a user-after-free.
To solve this, we need to synchronize ptdump_walk_pgd() with free_hotplug_pgtable_page() in such a way that ptdump never takes a reference on a freed pagetable.
Since this race is very unlikely to happen in practice, we do not want to penalize other code paths taking the init_mm mmap_lock. Therefore, we use static keys. ptdump will enable the static key - upon observing that, the free_empty_pxd_table() functions will get patched in with an mmap_read_lock/unlock sequence. A code comment explains in detail, how a combination of acquire semantics of static_branch_enable() and the barriers in __flush_tlb_kernel_pgtable() ensures that ptdump will never get a hold on the address of a freed pagetable - either ptdump will block the table freeing path due to write locking the mmap_lock, or, the nullity of the pxd entry will be observed by ptdump, therefore having no access to the isolated p(x+1)d pagetable.
This bug was found by code inspection, as a result of working on [1].
Cc: stable@vger.kernel.org Fixes: bbd6ec605c0f ("arm64/mm: Enable memory hot remove") Signed-off-by: Dev Jain dev.jain@arm.com
Immediately after posting, I guess the first objection which is going to come is, why not just nest free_empty_tables() with mmap_read_lock/unlock. Memory offlining obviously should not be a hot path so taking the read lock should be fine I guess.
On 28/07/2025 11:31, Dev Jain wrote:
Memory hotunplug is done under the hotplug lock and ptdump walk is done under the init_mm.mmap_lock. Therefore, ptdump and hotunplug can run simultaneously without any synchronization. During hotunplug, free_empty_tables() is ultimately called to free up the pagetables. The following race can happen, where x denotes the level of the pagetable:
CPU1 CPU2 free_empty_pxd_table ptdump_walk_pgd() Get p(x+1)d table from pxd entry pxd_clear free_hotplug_pgtable_page(p(x+1)dp) Still using the p(x+1)d table
which leads to a user-after-free.
I'm not sure I understand this. ptdump_show() protects against this with get_online_mems()/put_online_mems(), doesn't it? There are 2 paths that call ptdump_walk_pgd(). This protects one of them. The other is ptdump_check_wx(); I thought you (or Anshuman?) had a patch in flight to fix that with [get|put]_online_mems() too?
Sorry if my memory is failing me here...
On 28/07/25 4:43 pm, Ryan Roberts wrote:
On 28/07/2025 11:31, Dev Jain wrote:
Memory hotunplug is done under the hotplug lock and ptdump walk is done under the init_mm.mmap_lock. Therefore, ptdump and hotunplug can run simultaneously without any synchronization. During hotunplug, free_empty_tables() is ultimately called to free up the pagetables. The following race can happen, where x denotes the level of the pagetable:
CPU1 CPU2 free_empty_pxd_table ptdump_walk_pgd() Get p(x+1)d table from pxd entry pxd_clear free_hotplug_pgtable_page(p(x+1)dp) Still using the p(x+1)d table
which leads to a user-after-free.
I'm not sure I understand this. ptdump_show() protects against this with get_online_mems()/put_online_mems(), doesn't it? There are 2 paths that call ptdump_walk_pgd(). This protects one of them. The other is ptdump_check_wx(); I thought you (or Anshuman?) had a patch in flight to fix that with [get|put]_online_mems() too?
Sorry if my memory is failing me here...
Nope, I think I just had a use-after-free in my memory so I came up with this patch :) Because of the recent work with ptdump, I was so concentrated on ptdump_walk_pgd() that I didn't even bother looking up the call chain. And I even forgot we had these [get|put]_online_mems() patches recently.
Sorry for the noise, it must have been incredibly confusing to see this patch :(
On 28/07/2025 12:31, Dev Jain wrote:
On 28/07/25 4:43 pm, Ryan Roberts wrote:
On 28/07/2025 11:31, Dev Jain wrote:
Memory hotunplug is done under the hotplug lock and ptdump walk is done under the init_mm.mmap_lock. Therefore, ptdump and hotunplug can run simultaneously without any synchronization. During hotunplug, free_empty_tables() is ultimately called to free up the pagetables. The following race can happen, where x denotes the level of the pagetable:
CPU1 CPU2 free_empty_pxd_table ptdump_walk_pgd() Get p(x+1)d table from pxd entry pxd_clear free_hotplug_pgtable_page(p(x+1)dp) Still using the p(x+1)d table
which leads to a user-after-free.
I'm not sure I understand this. ptdump_show() protects against this with get_online_mems()/put_online_mems(), doesn't it? There are 2 paths that call ptdump_walk_pgd(). This protects one of them. The other is ptdump_check_wx(); I thought you (or Anshuman?) had a patch in flight to fix that with [get|put]_online_mems() too?
Sorry if my memory is failing me here...
Nope, I think I just had a use-after-free in my memory so I came up with this patch :) Because of the recent work with ptdump, I was so concentrated on ptdump_walk_pgd() that I didn't even bother looking up the call chain. And I even forgot we had these [get|put]_online_mems() patches recently.
I just checked; Anshuman's fix is in mm-stable, so I guess it'll be in v6.17-rc1.
That's the patch: https://lore.kernel.org/linux-arm-kernel/20250620052427.2092093-1-anshuman.k...
Sorry for the noise, it must have been incredibly confusing to see this patch :(
linux-stable-mirror@lists.linaro.org