Hello.
We have observed a huge latency increase using `fork()` after ingesting the CVE-2025-38085 fix which leads to the commit `1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race`. On large machines with 1.5TB of memory with 196 cores, we identified mmapping of 1.2TB of shared memory and forking itself dozens or hundreds of times we see a increase of execution times of a factor of 4. The reproducer is at the end of the email.
Comparing the a kernel without this patch with a kernel with this patch applied when spawning 1000 children we see those execution times:
Patched kernel: $ time make stress ... real 0m11.275s user 0m0.177s sys 0m23.905s
Original kernel :
$ time make stress ...real 0m2.475s user 0m1.398s sys 0m2.501s
The patch in question: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
My observation/assumption is:
each child touches 100 random pages and despawns on each despawn `huge_pmd_unshare()` is called each call to `huge_pmd_unshare()` syncrhonizes all threads using `tlb_remove_table_sync_one()` leading to the regression
I'm happy to provide more information.
Thank you Stanislav Uschakow
=== Reproducer ===
Setup:
#!/bin/bash echo "Setting up hugepages for reproduction..."
# hugepages (1.2TB / 2MB = 614400 pages) REQUIRED_PAGES=614400
# Check current hugepage allocation CURRENT_PAGES=$(cat /proc/sys/vm/nr_hugepages) echo "Current hugepages: $CURRENT_PAGES"
if [ "$CURRENT_PAGES" -lt "$REQUIRED_PAGES" ]; then echo "Allocating $REQUIRED_PAGES hugepages..." echo $REQUIRED_PAGES | sudo tee /proc/sys/vm/nr_hugepages
ALLOCATED=$(cat /proc/sys/vm/nr_hugepages) echo "Allocated hugepages: $ALLOCATED" if [ "$ALLOCATED" -lt "$REQUIRED_PAGES" ]; then echo "Warning: Could not allocate all required hugepages" echo "Available: $ALLOCATED, Required: $REQUIRED_PAGES" fi fi
echo never | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
echo -e "\nHugepage information:" cat /proc/meminfo | grep -i huge
echo -e "\nSetup complete. You can now run the reproduction test."
Makefile:
CXX = gcc CXXFLAGS = -O2 -Wall TARGET = hugepage_repro SOURCE = hugepage_repro.c
$(TARGET): $(SOURCE) $(CXX) $(CXXFLAGS) -o $(TARGET) $(SOURCE)
clean: rm -f $(TARGET)
setup: chmod +x setup_hugepages.sh ./setup_hugepages.sh
test: $(TARGET) ./$(TARGET) 20 3
stress: $(TARGET) ./$(TARGET) 1000 1
.PHONY: clean setup test stress
hugepage_repro.c:
#include <sys/mman.h> #include <sys/wait.h> #include <unistd.h> #include <stdlib.h> #include <string.h> #include <time.h> #include <stdio.h>
#define HUGEPAGE_SIZE (2 * 1024 * 1024) // 2MB #define TOTAL_SIZE (1200ULL * 1024 * 1024 * 1024) // 1.2TB #define NUM_HUGEPAGES (TOTAL_SIZE / HUGEPAGE_SIZE)
void* create_hugepage_mapping() { void* addr = mmap(NULL, TOTAL_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0); if (addr == MAP_FAILED) { perror("mmap hugepages failed"); exit(1); } return addr; }
void touch_random_pages(void* addr, int num_touches) { char* base = (char*)addr; for (int i = 0; i < num_touches; ++i) { size_t offset = (rand() % NUM_HUGEPAGES) * HUGEPAGE_SIZE; volatile char val = base[offset]; (void)val; } }
void child_process(void* shared_mem, int child_id) { struct timespec start, end; clock_gettime(CLOCK_MONOTONIC, &start); touch_random_pages(shared_mem, 100); clock_gettime(CLOCK_MONOTONIC, &end); long duration = (end.tv_sec - start.tv_sec) * 1000000 + (end.tv_nsec - start.tv_nsec) / 1000; printf("Child %d completed in %ld μs\n", child_id, duration); }
int main(int argc, char* argv[]) { int num_processes = argc > 1 ? atoi(argv[1]) : 50; int iterations = argc > 2 ? atoi(argv[2]) : 5; printf("Creating %lluGB hugepage mapping...\n", TOTAL_SIZE / (1024*1024*1024)); void* shared_mem = create_hugepage_mapping(); for (int iter = 0; iter < iterations; ++iter) { printf("\nIteration %d: Forking %d processes\n", iter + 1, num_processes); pid_t children[num_processes]; struct timespec iter_start, iter_end; clock_gettime(CLOCK_MONOTONIC, &iter_start); for (int i = 0; i < num_processes; ++i) { pid_t pid = fork(); if (pid == 0) { child_process(shared_mem, i); exit(0); } else if (pid > 0) { children[i] = pid; } } for (int i = 0; i < num_processes; ++i) { waitpid(children[i], NULL, 0); } clock_gettime(CLOCK_MONOTONIC, &iter_end); long iter_duration = (iter_end.tv_sec - iter_start.tv_sec) * 1000 + (iter_end.tv_nsec - iter_start.tv_nsec) / 1000000; printf("Iteration completed in %ld ms\n", iter_duration); } munmap(shared_mem, TOTAL_SIZE); return 0; }
Amazon Web Services Development Center Germany GmbH Tamara-Danz-Str. 13 10243 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B Sitz: Berlin Ust-ID: DE 365 538 597
Hi!
On Fri, Aug 29, 2025 at 4:30 PM Uschakow, Stanislav suschako@amazon.de wrote:
We have observed a huge latency increase using `fork()` after ingesting the CVE-2025-38085 fix which leads to the commit `1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race`. On large machines with 1.5TB of memory with 196 cores, we identified mmapping of 1.2TB of shared memory and forking itself dozens or hundreds of times we see a increase of execution times of a factor of 4. The reproducer is at the end of the email.
Yeah, every 1G virtual address range you unshare on unmap will do an extra synchronous IPI broadcast to all CPU cores, so it's not very surprising that doing this would be a bit slow on a machine with 196 cores.
My observation/assumption is:
each child touches 100 random pages and despawns on each despawn `huge_pmd_unshare()` is called each call to `huge_pmd_unshare()` syncrhonizes all threads using `tlb_remove_table_sync_one()` leading to the regression
Yeah, makes sense that that'd be slow.
There are probably several ways this could be optimized - like maybe changing tlb_remove_table_sync_one() to rely on the MM's cpumask (though that would require thinking about whether this interacts with remote MM access somehow), or batching the refcount drops for hugetlb shared page tables through something like struct mmu_gather, or doing something special for the unmap path, or changing the semantics of hugetlb page tables such that they can never turn into normal page tables again. However, I'm not planning to work on optimizing this.
On 01.09.25 12:58, Jann Horn wrote:
Hi!
On Fri, Aug 29, 2025 at 4:30 PM Uschakow, Stanislav suschako@amazon.de wrote:
We have observed a huge latency increase using `fork()` after ingesting the CVE-2025-38085 fix which leads to the commit `1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race`. On large machines with 1.5TB of memory with 196 cores, we identified mmapping of 1.2TB of shared memory and forking itself dozens or hundreds of times we see a increase of execution times of a factor of 4. The reproducer is at the end of the email.
Yeah, every 1G virtual address range you unshare on unmap will do an extra synchronous IPI broadcast to all CPU cores, so it's not very surprising that doing this would be a bit slow on a machine with 196 cores.
What is the use case for this extreme usage of fork() in that context? Is it just something people noticed and it's suboptimal, or is this a real problem for some use cases?
Hi David,
From: David Hildenbrand david@redhat.com Sent: Monday, September 1, 2025 1:26 PM To: Jann Horn; Uschakow, Stanislav Cc: linux-mm@kvack.org; trix@redhat.com; ndesaulniers@google.com; nathan@kernel.org; akpm@linux-foundation.org; muchun.song@linux.dev; mike.kravetz@oracle.com; lorenzo.stoakes@oracle.com; liam.howlett@oracle.com; osalvador@suse.de; vbabka@suse.cz; stable@vger.kernel.org Subject: RE: [EXTERNAL] Bug: Performance regression in 1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
On 01.09.25 12:58, Jann Horn wrote:
Hi!
On Fri, Aug 29, 2025 at 4:30 PM Uschakow, Stanislav suschako@amazon.de wrote:
We have observed a huge latency increase using `fork()` after ingesting the CVE-2025-38085 fix which leads to the commit `1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race`. On large machines with 1.5TB of memory with 196 cores, we identified mmapping of 1.2TB of shared memory and forking itself dozens or hundreds of times we see a increase of execution times of a factor of 4. The reproducer is at the end of the email.
Yeah, every 1G virtual address range you unshare on unmap will do an extra synchronous IPI broadcast to all CPU cores, so it's not very surprising that doing this would be a bit slow on a machine with 196 cores.
What is the use case for this extreme usage of fork() in that context? Is it just something people noticed and it's suboptimal, or is this a real problem for some use cases?
Yes, we have customer reporting huge performance regressions on their workloads. I don't know the software architecture or actual use case for their application though. A execution time increase of at least a factor of 4 is noticeable even with few forks() on those machines.
-- Cheers
David / dhildenb
Thanks
Stanislav
Amazon Web Services Development Center Germany GmbH Tamara-Danz-Str. 13 10243 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B Sitz: Berlin Ust-ID: DE 365 538 597
On Sep 1, 2025, at 4:26 AM, David Hildenbrand david@redhat.com wrote:
On 01.09.25 12:58, Jann Horn wrote:
Hi! On Fri, Aug 29, 2025 at 4:30 PM Uschakow, Stanislav suschako@amazon.de wrote:
We have observed a huge latency increase using `fork()` after ingesting the CVE-2025-38085 fix which leads to the commit `1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race`. On large machines with 1.5TB of memory with 196 cores, we identified mmapping of 1.2TB of shared memory and forking itself dozens or hundreds of times we see a increase of execution times of a factor of 4. The reproducer is at the end of the email.
Yeah, every 1G virtual address range you unshare on unmap will do an extra synchronous IPI broadcast to all CPU cores, so it's not very surprising that doing this would be a bit slow on a machine with 196 cores.
What is the use case for this extreme usage of fork() in that context? Is it just something people noticed and it's suboptimal, or is this a real problem for some use cases?
Our DB team is reporting performance issues due to this change. While running TPCC, Database timeouts & shuts down(crashes). This is seen when there are a large number of processes(thousands) involved. It is not so prominent when there are lesser number of processes.
Backing out this change addresses the problem.
-Prakash
-- Cheers
David / dhildenb
On 09.10.25 00:54, Prakash Sangappa wrote:
On Sep 1, 2025, at 4:26 AM, David Hildenbrand david@redhat.com wrote:
On 01.09.25 12:58, Jann Horn wrote:
Hi! On Fri, Aug 29, 2025 at 4:30 PM Uschakow, Stanislav suschako@amazon.de wrote:
We have observed a huge latency increase using `fork()` after ingesting the CVE-2025-38085 fix which leads to the commit `1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race`. On large machines with 1.5TB of memory with 196 cores, we identified mmapping of 1.2TB of shared memory and forking itself dozens or hundreds of times we see a increase of execution times of a factor of 4. The reproducer is at the end of the email.
Yeah, every 1G virtual address range you unshare on unmap will do an extra synchronous IPI broadcast to all CPU cores, so it's not very surprising that doing this would be a bit slow on a machine with 196 cores.
What is the use case for this extreme usage of fork() in that context? Is it just something people noticed and it's suboptimal, or is this a real problem for some use cases?
Our DB team is reporting performance issues due to this change. While running TPCC, Database timeouts & shuts down(crashes). This is seen when there are a large number of processes(thousands) involved. It is not so prominent when there are lesser number of processes.
Backing out this change addresses the problem.
I suspect the timeouts are due to fork() taking longer, and there is no kernel crash etc, right?
On Oct 9, 2025, at 12:23 AM, David Hildenbrand david@redhat.com wrote:
On 09.10.25 00:54, Prakash Sangappa wrote:
On Sep 1, 2025, at 4:26 AM, David Hildenbrand david@redhat.com wrote:
On 01.09.25 12:58, Jann Horn wrote:
Hi! On Fri, Aug 29, 2025 at 4:30 PM Uschakow, Stanislav suschako@amazon.de wrote:
We have observed a huge latency increase using `fork()` after ingesting the CVE-2025-38085 fix which leads to the commit `1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race`. On large machines with 1.5TB of memory with 196 cores, we identified mmapping of 1.2TB of shared memory and forking itself dozens or hundreds of times we see a increase of execution times of a factor of 4. The reproducer is at the end of the email.
Yeah, every 1G virtual address range you unshare on unmap will do an extra synchronous IPI broadcast to all CPU cores, so it's not very surprising that doing this would be a bit slow on a machine with 196 cores.
What is the use case for this extreme usage of fork() in that context? Is it just something people noticed and it's suboptimal, or is this a real problem for some use cases?
Our DB team is reporting performance issues due to this change. While running TPCC, Database timeouts & shuts down(crashes). This is seen when there are a large number of processes(thousands) involved. It is not so prominent when there are lesser number of processes. Backing out this change addresses the problem.
I suspect the timeouts are due to fork() taking longer, and there is no kernel crash etc, right?
That is correct, there is no kernel crash. -Prakash
-- Cheers
David / dhildenb
On 01.09.25 12:58, Jann Horn wrote:
Hi!
On Fri, Aug 29, 2025 at 4:30 PM Uschakow, Stanislav suschako@amazon.de wrote:
We have observed a huge latency increase using `fork()` after ingesting the CVE-2025-38085 fix which leads to the commit `1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race`. On large machines with 1.5TB of memory with 196 cores, we identified mmapping of 1.2TB of shared memory and forking itself dozens or hundreds of times we see a increase of execution times of a factor of 4. The reproducer is at the end of the email.
Yeah, every 1G virtual address range you unshare on unmap will do an extra synchronous IPI broadcast to all CPU cores, so it's not very surprising that doing this would be a bit slow on a machine with 196 cores.
My observation/assumption is:
each child touches 100 random pages and despawns on each despawn `huge_pmd_unshare()` is called each call to `huge_pmd_unshare()` syncrhonizes all threads using `tlb_remove_table_sync_one()` leading to the regression
Yeah, makes sense that that'd be slow.
There are probably several ways this could be optimized - like maybe changing tlb_remove_table_sync_one() to rely on the MM's cpumask (though that would require thinking about whether this interacts with remote MM access somehow), or batching the refcount drops for hugetlb shared page tables through something like struct mmu_gather, or doing something special for the unmap path, or changing the semantics of hugetlb page tables such that they can never turn into normal page tables again. However, I'm not planning to work on optimizing this.
I'm currently looking at the fix and what sticks out is "Fix it with an explicit broadcast IPI through tlb_remove_table_sync_one()".
(I don't understand how the page table can be used for "normal, non-hugetlb". I could only see how it is used for the remaining user for hugetlb stuff, but that's different question)
How does the fix work when an architecture does not issue IPIs for TLB shootdown? To handle gup-fast on these architectures, we use RCU.
So I'm wondering whether we use RCU somehow.
But note that in gup_fast_pte_range(), we are validating whether the PMD changed:
if (unlikely(pmd_val(pmd) != pmd_val(*pmdp)) || unlikely(pte_val(pte) != pte_val(ptep_get(ptep)))) { gup_put_folio(folio, 1, flags); goto pte_unmap; }
So in case the page table got reused in the meantime, we should just back off and be fine, right?
On 09.10.25 09:40, David Hildenbrand wrote:
On 01.09.25 12:58, Jann Horn wrote:
Hi!
On Fri, Aug 29, 2025 at 4:30 PM Uschakow, Stanislav suschako@amazon.de wrote:
We have observed a huge latency increase using `fork()` after ingesting the CVE-2025-38085 fix which leads to the commit `1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race`. On large machines with 1.5TB of memory with 196 cores, we identified mmapping of 1.2TB of shared memory and forking itself dozens or hundreds of times we see a increase of execution times of a factor of 4. The reproducer is at the end of the email.
Yeah, every 1G virtual address range you unshare on unmap will do an extra synchronous IPI broadcast to all CPU cores, so it's not very surprising that doing this would be a bit slow on a machine with 196 cores.
My observation/assumption is:
each child touches 100 random pages and despawns on each despawn `huge_pmd_unshare()` is called each call to `huge_pmd_unshare()` syncrhonizes all threads using `tlb_remove_table_sync_one()` leading to the regression
Yeah, makes sense that that'd be slow.
There are probably several ways this could be optimized - like maybe changing tlb_remove_table_sync_one() to rely on the MM's cpumask (though that would require thinking about whether this interacts with remote MM access somehow), or batching the refcount drops for hugetlb shared page tables through something like struct mmu_gather, or doing something special for the unmap path, or changing the semantics of hugetlb page tables such that they can never turn into normal page tables again. However, I'm not planning to work on optimizing this.
I'm currently looking at the fix and what sticks out is "Fix it with an explicit broadcast IPI through tlb_remove_table_sync_one()".
(I don't understand how the page table can be used for "normal, non-hugetlb". I could only see how it is used for the remaining user for hugetlb stuff, but that's different question)
How does the fix work when an architecture does not issue IPIs for TLB shootdown? To handle gup-fast on these architectures, we use RCU.
So I'm wondering whether we use RCU somehow.
But note that in gup_fast_pte_range(), we are validating whether the PMD changed:
if (unlikely(pmd_val(pmd) != pmd_val(*pmdp)) || unlikely(pte_val(pte) != pte_val(ptep_get(ptep)))) { gup_put_folio(folio, 1, flags); goto pte_unmap; }
So in case the page table got reused in the meantime, we should just back off and be fine, right?
Wrong page table level. We'd have to check when processing a PMD leave whether the PUD changed as well.
On Thu, Oct 09, 2025 at 09:40:34AM +0200, David Hildenbrand wrote:
On 01.09.25 12:58, Jann Horn wrote:
Hi!
On Fri, Aug 29, 2025 at 4:30 PM Uschakow, Stanislav suschako@amazon.de wrote:
We have observed a huge latency increase using `fork()` after ingesting the CVE-2025-38085 fix which leads to the commit `1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race`. On large machines with 1.5TB of memory with 196 cores, we identified mmapping of 1.2TB of shared memory and forking itself dozens or hundreds of times we see a increase of execution times of a factor of 4. The reproducer is at the end of the email.
Yeah, every 1G virtual address range you unshare on unmap will do an extra synchronous IPI broadcast to all CPU cores, so it's not very surprising that doing this would be a bit slow on a machine with 196 cores.
My observation/assumption is:
each child touches 100 random pages and despawns on each despawn `huge_pmd_unshare()` is called each call to `huge_pmd_unshare()` syncrhonizes all threads using `tlb_remove_table_sync_one()` leading to the regression
Yeah, makes sense that that'd be slow.
There are probably several ways this could be optimized - like maybe changing tlb_remove_table_sync_one() to rely on the MM's cpumask (though that would require thinking about whether this interacts with remote MM access somehow), or batching the refcount drops for hugetlb shared page tables through something like struct mmu_gather, or doing something special for the unmap path, or changing the semantics of hugetlb page tables such that they can never turn into normal page tables again. However, I'm not planning to work on optimizing this.
I'm currently looking at the fix and what sticks out is "Fix it with an explicit broadcast IPI through tlb_remove_table_sync_one()".
(I don't understand how the page table can be used for "normal, non-hugetlb". I could only see how it is used for the remaining user for hugetlb stuff, but that's different question)
Right, this surely is related only to hugetlb PTS, otherwise the refcount shouldn't be a factor no?
How does the fix work when an architecture does not issue IPIs for TLB shootdown? To handle gup-fast on these architectures, we use RCU.
So I'm wondering whether we use RCU somehow.
Presumably you mean whether we _can_ use RCU somehow?
But note that in gup_fast_pte_range(), we are validating whether the PMD changed:
if (unlikely(pmd_val(pmd) != pmd_val(*pmdp)) || unlikely(pte_val(pte) != pte_val(ptep_get(ptep)))) { gup_put_folio(folio, 1, flags); goto pte_unmap; }
Right and as per the comment there:
/* ... * For THP collapse, it's a bit more complicated because GUP-fast may be * walking a pgtable page that is being freed (pte is still valid but pmd * can be cleared already). To avoid race in such condition, we need to * also check pmd here to make sure pmd doesn't change (corresponds to * pmdp_collapse_flush() in the THP collapse code path). ... */
So if this can correctly handle a cleared PMD entry in the teardown case, surely it can handle it in this case also?
So in case the page table got reused in the meantime, we should just back off and be fine, right?
Yeah seems to be the case to me.
-- Cheers
David / dhildenb
So it seems like you have a proposal here - could you send a patch so we can assess it please? :)
I'm guessing we need only consider the 'remaining user' case for hugetlb PTS right? And perhaps stabilise via RCU somehow?
Cheers, Lorenzo
(I don't understand how the page table can be used for "normal, non-hugetlb". I could only see how it is used for the remaining user for hugetlb stuff, but that's different question)
Right, this surely is related only to hugetlb PTS, otherwise the refcount shouldn't be a factor no?
The example from Jann is scary. But I think it checks out.
How does the fix work when an architecture does not issue IPIs for TLB shootdown? To handle gup-fast on these architectures, we use RCU.
So I'm wondering whether we use RCU somehow.
Presumably you mean whether we _can_ use RCU somehow?
No, whether there is an implied RCU sync before the page table gets reused, see my reply from Jann.
But note that in gup_fast_pte_range(), we are validating whether the PMD changed:
if (unlikely(pmd_val(pmd) != pmd_val(*pmdp)) || unlikely(pte_val(pte) != pte_val(ptep_get(ptep)))) { gup_put_folio(folio, 1, flags); goto pte_unmap; }
Right and as per the comment there:
/* ...
- For THP collapse, it's a bit more complicated because GUP-fast may be
- walking a pgtable page that is being freed (pte is still valid but pmd
- can be cleared already). To avoid race in such condition, we need to
- also check pmd here to make sure pmd doesn't change (corresponds to
- pmdp_collapse_flush() in the THP collapse code path).
... */
So if this can correctly handle a cleared PMD entry in the teardown case, surely it can handle it in this case also?
Right.
But see my other mail, on architectures that don't free page tables with RCU we still need the IPI, so that is nasty.
So in case the page table got reused in the meantime, we should just back off and be fine, right?
Yeah seems to be the case to me.
-- Cheers
David / dhildenb
So it seems like you have a proposal here - could you send a patch so we can assess it please? :)
It's a bit tricky, I think I have to discuss with Jann some more first. But right now my understanding is that Janns fix might not have taken care of arch without the IPI sync -- I might be wrong.
On Thu, Oct 9, 2025 at 9:40 AM David Hildenbrand david@redhat.com wrote:
On 01.09.25 12:58, Jann Horn wrote:
Hi!
On Fri, Aug 29, 2025 at 4:30 PM Uschakow, Stanislav suschako@amazon.de wrote:
We have observed a huge latency increase using `fork()` after ingesting the CVE-2025-38085 fix which leads to the commit `1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race`. On large machines with 1.5TB of memory with 196 cores, we identified mmapping of 1.2TB of shared memory and forking itself dozens or hundreds of times we see a increase of execution times of a factor of 4. The reproducer is at the end of the email.
Yeah, every 1G virtual address range you unshare on unmap will do an extra synchronous IPI broadcast to all CPU cores, so it's not very surprising that doing this would be a bit slow on a machine with 196 cores.
My observation/assumption is:
each child touches 100 random pages and despawns on each despawn `huge_pmd_unshare()` is called each call to `huge_pmd_unshare()` syncrhonizes all threads using `tlb_remove_table_sync_one()` leading to the regression
Yeah, makes sense that that'd be slow.
There are probably several ways this could be optimized - like maybe changing tlb_remove_table_sync_one() to rely on the MM's cpumask (though that would require thinking about whether this interacts with remote MM access somehow), or batching the refcount drops for hugetlb shared page tables through something like struct mmu_gather, or doing something special for the unmap path, or changing the semantics of hugetlb page tables such that they can never turn into normal page tables again. However, I'm not planning to work on optimizing this.
I'm currently looking at the fix and what sticks out is "Fix it with an explicit broadcast IPI through tlb_remove_table_sync_one()".
(I don't understand how the page table can be used for "normal, non-hugetlb". I could only see how it is used for the remaining user for hugetlb stuff, but that's different question)
If I remember correctly: When a hugetlb shared page table drops to refcount 1, it turns into a normal page table. If you then afterwards split the hugetlb VMA, unmap one half of it, and place a new unrelated VMA in its place, the same page table will be reused for PTEs of this new unrelated VMA.
So the scenario would be:
1. Initially, we have a hugetlb shared page table covering 1G of address space which maps hugetlb 2M pages, which is used by two hugetlb VMAs in different processes (processes P1 and P2). 2. A thread in P2 begins a gup_fast() walk in the hugetlb region, and walks down through the PUD entry that points to the shared page table, then when it reaches the loop in gup_fast_pmd_range() gets interrupted for a while by an NMI or preempted by the hypervisor or something. 3. P2 removes its VMA, and the hugetlb shared page table effectively becomes a normal page table in P1. 4. Then P1 splits the hugetlb VMA in the middle (at a 2M boundary), leaving two VMAs VMA1 and VMA2. 5. P1 unmaps VMA1, and creates a new VMA (VMA3) in its place, for example an anonymous private VMA. 6. P1 populates VMA3 with page table entries. 7. The gup_fast() walk in P2 continues, and gup_fast_pmd_range() now uses the new PMD/PTE entries created for VMA3.
How does the fix work when an architecture does not issue IPIs for TLB shootdown? To handle gup-fast on these architectures, we use RCU.
gup-fast disables interrupts, which synchronizes against both RCU and IPI.
So I'm wondering whether we use RCU somehow.
But note that in gup_fast_pte_range(), we are validating whether the PMD changed:
if (unlikely(pmd_val(pmd) != pmd_val(*pmdp)) || unlikely(pte_val(pte) != pte_val(ptep_get(ptep)))) { gup_put_folio(folio, 1, flags); goto pte_unmap; }
So in case the page table got reused in the meantime, we should just back off and be fine, right?
The shared page table is mapped with a PUD entry, and we don't check whether the PUD entry changed here.
I'm currently looking at the fix and what sticks out is "Fix it with an explicit broadcast IPI through tlb_remove_table_sync_one()".
(I don't understand how the page table can be used for "normal, non-hugetlb". I could only see how it is used for the remaining user for hugetlb stuff, but that's different question)
If I remember correctly: When a hugetlb shared page table drops to refcount 1, it turns into a normal page table. If you then afterwards split the hugetlb VMA, unmap one half of it, and place a new unrelated VMA in its place, the same page table will be reused for PTEs of this new unrelated VMA.
That makes sense.
So the scenario would be:
- Initially, we have a hugetlb shared page table covering 1G of
address space which maps hugetlb 2M pages, which is used by two hugetlb VMAs in different processes (processes P1 and P2). 2. A thread in P2 begins a gup_fast() walk in the hugetlb region, and walks down through the PUD entry that points to the shared page table, then when it reaches the loop in gup_fast_pmd_range() gets interrupted for a while by an NMI or preempted by the hypervisor or something. 3. P2 removes its VMA, and the hugetlb shared page table effectively becomes a normal page table in P1. 4. Then P1 splits the hugetlb VMA in the middle (at a 2M boundary), leaving two VMAs VMA1 and VMA2. 5. P1 unmaps VMA1, and creates a new VMA (VMA3) in its place, for example an anonymous private VMA. 6. P1 populates VMA3 with page table entries. 7. The gup_fast() walk in P2 continues, and gup_fast_pmd_range() now uses the new PMD/PTE entries created for VMA3.
Yeah, sounds possible. And nasty.
How does the fix work when an architecture does not issue IPIs for TLB shootdown? To handle gup-fast on these architectures, we use RCU.
gup-fast disables interrupts, which synchronizes against both RCU and IPI.
Right, but RCU is only used for prevent walking a page table that has been freed+reused in the meantime (prevent us from de-referencing garbage entries).
It does not prevent walking the now-unshared page table that has been modified by the other process.
For that, we need the back-off described below. IIRC we implemented that in the PMD case for khugepaged.
Or is there somewhere a guaranteed RCU sync before the shared page table gets reused?
So I'm wondering whether we use RCU somehow.
But note that in gup_fast_pte_range(), we are validating whether the PMD changed:
if (unlikely(pmd_val(pmd) != pmd_val(*pmdp)) || unlikely(pte_val(pte) != pte_val(ptep_get(ptep)))) { gup_put_folio(folio, 1, flags); goto pte_unmap; }
So in case the page table got reused in the meantime, we should just back off and be fine, right?
The shared page table is mapped with a PUD entry, and we don't check whether the PUD entry changed here.
Yes, see my follow-up mail, that's what we'd have to add.
On an arch without IPI, page tables will be freed with RCU and it just works. We walk the wrong page table, realize that the PUD changed and back off.
On an arch with IPI it's tricky: if we don't issue the IPI you added, we might still back off once we check the PUD entry didn't changee, but I'm afraid nothing would stop us from walking the previous page table that was freed in the meantime, containing garbage.
Easy fix would be never reusing a page table once shared once?
On Thu, Oct 16, 2025 at 9:10 PM David Hildenbrand david@redhat.com wrote:
I'm currently looking at the fix and what sticks out is "Fix it with an explicit broadcast IPI through tlb_remove_table_sync_one()".
(I don't understand how the page table can be used for "normal, non-hugetlb". I could only see how it is used for the remaining user for hugetlb stuff, but that's different question)
If I remember correctly: When a hugetlb shared page table drops to refcount 1, it turns into a normal page table. If you then afterwards split the hugetlb VMA, unmap one half of it, and place a new unrelated VMA in its place, the same page table will be reused for PTEs of this new unrelated VMA.
That makes sense.
So the scenario would be:
- Initially, we have a hugetlb shared page table covering 1G of
address space which maps hugetlb 2M pages, which is used by two hugetlb VMAs in different processes (processes P1 and P2). 2. A thread in P2 begins a gup_fast() walk in the hugetlb region, and walks down through the PUD entry that points to the shared page table, then when it reaches the loop in gup_fast_pmd_range() gets interrupted for a while by an NMI or preempted by the hypervisor or something. 3. P2 removes its VMA, and the hugetlb shared page table effectively becomes a normal page table in P1. 4. Then P1 splits the hugetlb VMA in the middle (at a 2M boundary), leaving two VMAs VMA1 and VMA2. 5. P1 unmaps VMA1, and creates a new VMA (VMA3) in its place, for example an anonymous private VMA. 6. P1 populates VMA3 with page table entries. 7. The gup_fast() walk in P2 continues, and gup_fast_pmd_range() now uses the new PMD/PTE entries created for VMA3.
Yeah, sounds possible. And nasty.
How does the fix work when an architecture does not issue IPIs for TLB shootdown? To handle gup-fast on these architectures, we use RCU.
gup-fast disables interrupts, which synchronizes against both RCU and IPI.
Right, but RCU is only used for prevent walking a page table that has been freed+reused in the meantime (prevent us from de-referencing garbage entries).
It does not prevent walking the now-unshared page table that has been modified by the other process.
Hm, I'm a bit lost... which page table walk implementation are you worried about that accesses page tables purely with RCU? I believe all page table walks should be happening either with interrupts off (in gup_fast()) or under the protection of higher-level locks; in particular, hugetlb page walks take an extra hugetlb specific lock (for hugetlb VMAs that are eligible for page table sharing, that is the rw_sema in hugetlb_vma_lock).
Regarding gup_fast():
In the case where CONFIG_MMU_GATHER_RCU_TABLE_FREE is defined, the fix commit 1013af4f585f uses a synchronous IPI with tlb_remove_table_sync_one() to wait for any concurrent GUP-fast software page table walks, and some time after the call to huge_pmd_unshare() we will do a TLB flush that synchronizes against hardware page table walks.
In the case where CONFIG_MMU_GATHER_RCU_TABLE_FREE is not defined, I believe the expectation is that the TLB flush implicitly does an IPI which synchronizes against both software and hardware page table walks.
On 16.10.25 21:26, Jann Horn wrote:
On Thu, Oct 16, 2025 at 9:10 PM David Hildenbrand david@redhat.com wrote:
I'm currently looking at the fix and what sticks out is "Fix it with an explicit broadcast IPI through tlb_remove_table_sync_one()".
(I don't understand how the page table can be used for "normal, non-hugetlb". I could only see how it is used for the remaining user for hugetlb stuff, but that's different question)
If I remember correctly: When a hugetlb shared page table drops to refcount 1, it turns into a normal page table. If you then afterwards split the hugetlb VMA, unmap one half of it, and place a new unrelated VMA in its place, the same page table will be reused for PTEs of this new unrelated VMA.
That makes sense.
So the scenario would be:
- Initially, we have a hugetlb shared page table covering 1G of
address space which maps hugetlb 2M pages, which is used by two hugetlb VMAs in different processes (processes P1 and P2). 2. A thread in P2 begins a gup_fast() walk in the hugetlb region, and walks down through the PUD entry that points to the shared page table, then when it reaches the loop in gup_fast_pmd_range() gets interrupted for a while by an NMI or preempted by the hypervisor or something. 3. P2 removes its VMA, and the hugetlb shared page table effectively becomes a normal page table in P1. 4. Then P1 splits the hugetlb VMA in the middle (at a 2M boundary), leaving two VMAs VMA1 and VMA2. 5. P1 unmaps VMA1, and creates a new VMA (VMA3) in its place, for example an anonymous private VMA. 6. P1 populates VMA3 with page table entries. 7. The gup_fast() walk in P2 continues, and gup_fast_pmd_range() now uses the new PMD/PTE entries created for VMA3.
Yeah, sounds possible. And nasty.
How does the fix work when an architecture does not issue IPIs for TLB shootdown? To handle gup-fast on these architectures, we use RCU.
gup-fast disables interrupts, which synchronizes against both RCU and IPI.
Right, but RCU is only used for prevent walking a page table that has been freed+reused in the meantime (prevent us from de-referencing garbage entries).
It does not prevent walking the now-unshared page table that has been modified by the other process.
Hm, I'm a bit lost... which page table walk implementation are you worried about that accesses page tables purely with RCU? I believe all page table walks should be happening either with interrupts off (in gup_fast()) or under the protection of higher-level locks; in particular, hugetlb page walks take an extra hugetlb specific lock (for hugetlb VMAs that are eligible for page table sharing, that is the rw_sema in hugetlb_vma_lock).
I'm only concerned about gup-fast, but your comment below explains why your fix works as it triggers an IPI in any case, not just during the TLB flush.
Sorry for missing that detail.
Regarding gup_fast():
In the case where CONFIG_MMU_GATHER_RCU_TABLE_FREE is defined, the fix commit 1013af4f585f uses a synchronous IPI with tlb_remove_table_sync_one() to wait for any concurrent GUP-fast software page table walks, and some time after the call to huge_pmd_unshare() we will do a TLB flush that synchronizes against hardware page table walks.
Right, so we definetly issue an IPI.
In the case where CONFIG_MMU_GATHER_RCU_TABLE_FREE is not defined, I believe the expectation is that the TLB flush implicitly does an IPI which synchronizes against both software and hardware page table walks.
Yes, that's what I had in mind, not an explicit sync.
So the big question is whether we could avoid this IPI on every unsharing.
Assume we would ever reuse a page table that was shared, we'd have to do this IPI only before freeing the page table I guess, or free the page table through RCU.
On Thu, Oct 16, 2025 at 9:45 PM David Hildenbrand david@redhat.com wrote:
On 16.10.25 21:26, Jann Horn wrote:
On Thu, Oct 16, 2025 at 9:10 PM David Hildenbrand david@redhat.com wrote:
I'm currently looking at the fix and what sticks out is "Fix it with an explicit broadcast IPI through tlb_remove_table_sync_one()".
(I don't understand how the page table can be used for "normal, non-hugetlb". I could only see how it is used for the remaining user for hugetlb stuff, but that's different question)
If I remember correctly: When a hugetlb shared page table drops to refcount 1, it turns into a normal page table. If you then afterwards split the hugetlb VMA, unmap one half of it, and place a new unrelated VMA in its place, the same page table will be reused for PTEs of this new unrelated VMA.
That makes sense.
So the scenario would be:
- Initially, we have a hugetlb shared page table covering 1G of
address space which maps hugetlb 2M pages, which is used by two hugetlb VMAs in different processes (processes P1 and P2). 2. A thread in P2 begins a gup_fast() walk in the hugetlb region, and walks down through the PUD entry that points to the shared page table, then when it reaches the loop in gup_fast_pmd_range() gets interrupted for a while by an NMI or preempted by the hypervisor or something. 3. P2 removes its VMA, and the hugetlb shared page table effectively becomes a normal page table in P1. 4. Then P1 splits the hugetlb VMA in the middle (at a 2M boundary), leaving two VMAs VMA1 and VMA2. 5. P1 unmaps VMA1, and creates a new VMA (VMA3) in its place, for example an anonymous private VMA. 6. P1 populates VMA3 with page table entries. 7. The gup_fast() walk in P2 continues, and gup_fast_pmd_range() now uses the new PMD/PTE entries created for VMA3.
Yeah, sounds possible. And nasty.
How does the fix work when an architecture does not issue IPIs for TLB shootdown? To handle gup-fast on these architectures, we use RCU.
gup-fast disables interrupts, which synchronizes against both RCU and IPI.
Right, but RCU is only used for prevent walking a page table that has been freed+reused in the meantime (prevent us from de-referencing garbage entries).
It does not prevent walking the now-unshared page table that has been modified by the other process.
Hm, I'm a bit lost... which page table walk implementation are you worried about that accesses page tables purely with RCU? I believe all page table walks should be happening either with interrupts off (in gup_fast()) or under the protection of higher-level locks; in particular, hugetlb page walks take an extra hugetlb specific lock (for hugetlb VMAs that are eligible for page table sharing, that is the rw_sema in hugetlb_vma_lock).
I'm only concerned about gup-fast, but your comment below explains why your fix works as it triggers an IPI in any case, not just during the TLB flush.
Sorry for missing that detail.
Regarding gup_fast():
In the case where CONFIG_MMU_GATHER_RCU_TABLE_FREE is defined, the fix commit 1013af4f585f uses a synchronous IPI with tlb_remove_table_sync_one() to wait for any concurrent GUP-fast software page table walks, and some time after the call to huge_pmd_unshare() we will do a TLB flush that synchronizes against hardware page table walks.
Right, so we definetly issue an IPI.
In the case where CONFIG_MMU_GATHER_RCU_TABLE_FREE is not defined, I believe the expectation is that the TLB flush implicitly does an IPI which synchronizes against both software and hardware page table walks.
Yes, that's what I had in mind, not an explicit sync.
So the big question is whether we could avoid this IPI on every unsharing.
Assume we would ever reuse a page table that was shared, we'd have to do this IPI only before freeing the page table I guess, or free the page table through RCU.
Yeah, that would make things a lot neater. Prevent hugetlb shared page tables from ever being reused for normal mappings, perhaps by changing huge_pmd_unshare() so that if the page table has a share count of 1, we zap it instead of doing nothing. (Though that has to be restricted to shared hugetlb mappings, which are the ones eligible for page table sharing.)
I thiiiink doing it at huge_pmd_unshare() would probably be enough to prevent formerly-shared page tables from being reused for new stuff, but I haven't looked in detail.
Jann,
Please bear with my questions below, want to get a good mental model of this. :)
Thanks!
On Thu, Oct 16, 2025 at 08:44:57PM +0200, Jann Horn wrote:
On Thu, Oct 9, 2025 at 9:40 AM David Hildenbrand david@redhat.com wrote:
On 01.09.25 12:58, Jann Horn wrote:
Hi!
On Fri, Aug 29, 2025 at 4:30 PM Uschakow, Stanislav suschako@amazon.de wrote:
We have observed a huge latency increase using `fork()` after ingesting the CVE-2025-38085 fix which leads to the commit `1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race`. On large machines with 1.5TB of memory with 196 cores, we identified mmapping of 1.2TB of shared memory and forking itself dozens or hundreds of times we see a increase of execution times of a factor of 4. The reproducer is at the end of the email.
Yeah, every 1G virtual address range you unshare on unmap will do an extra synchronous IPI broadcast to all CPU cores, so it's not very surprising that doing this would be a bit slow on a machine with 196 cores.
My observation/assumption is:
each child touches 100 random pages and despawns on each despawn `huge_pmd_unshare()` is called each call to `huge_pmd_unshare()` syncrhonizes all threads using `tlb_remove_table_sync_one()` leading to the regression
Yeah, makes sense that that'd be slow.
There are probably several ways this could be optimized - like maybe changing tlb_remove_table_sync_one() to rely on the MM's cpumask (though that would require thinking about whether this interacts with remote MM access somehow), or batching the refcount drops for hugetlb shared page tables through something like struct mmu_gather, or doing something special for the unmap path, or changing the semantics of hugetlb page tables such that they can never turn into normal page tables again. However, I'm not planning to work on optimizing this.
I'm currently looking at the fix and what sticks out is "Fix it with an explicit broadcast IPI through tlb_remove_table_sync_one()".
(I don't understand how the page table can be used for "normal, non-hugetlb". I could only see how it is used for the remaining user for hugetlb stuff, but that's different question)
If I remember correctly: When a hugetlb shared page table drops to refcount 1, it turns into a normal page table. If you then afterwards split the hugetlb VMA, unmap one half of it, and place a new unrelated VMA in its place, the same page table will be reused for PTEs of this new unrelated VMA.
So the scenario would be:
- Initially, we have a hugetlb shared page table covering 1G of
address space which maps hugetlb 2M pages, which is used by two hugetlb VMAs in different processes (processes P1 and P2). 2. A thread in P2 begins a gup_fast() walk in the hugetlb region, and walks down through the PUD entry that points to the shared page table, then when it reaches the loop in gup_fast_pmd_range() gets interrupted for a while by an NMI or preempted by the hypervisor or something. 3. P2 removes its VMA, and the hugetlb shared page table effectively becomes a normal page table in P1.
This is a bit confusing, are we talking about 2 threads in P2 on different CPUs?
P2/T1 on CPU A is doing the gup_fast() walk, P2/T2 on CPU B is simultaneously 'removing' this VMA?
Because surely the interrupts being disabled on CPU A means that ordinary preemption won't happen right?
By remove what do you mean? Unmap? But won't this result in a TLB flush synced by IPI that is stalled by P2'S CPU having interrupts diabled?
Or is it removed in the sense of hugetlb? As in something that invokes huge_pmd_unshare()?
But I guess this doesn't matter as the page table teardown will succeed, just the final tlb_finish_mmu() will stall.
And I guess GUP fast is trying to protect against the clear down by checking pmd != *pmdp.
- Then P1 splits the hugetlb VMA in the middle (at a 2M boundary),
leaving two VMAs VMA1 and VMA2. 5. P1 unmaps VMA1, and creates a new VMA (VMA3) in its place, for example an anonymous private VMA.
Hmm, can it though?
P1 mmap write lock will be held, and VMA lock will be held too for VMA1,
In vms_complete_munmap_vmas(), vms_clear_ptes() will stall on tlb_finish_mmu() for IPI-synced architectures, and in that case the unmap won't finish and the mmap write lock won't be released so nobody an map a new VMA yet can they?
- P1 populates VMA3 with page table entries.
ofc this requires the mmap/vma write lock above to be released first.
- The gup_fast() walk in P2 continues, and gup_fast_pmd_range() now
uses the new PMD/PTE entries created for VMA3.
How does the fix work when an architecture does not issue IPIs for TLB shootdown? To handle gup-fast on these architectures, we use RCU.
gup-fast disables interrupts, which synchronizes against both RCU and IPI.
So I'm wondering whether we use RCU somehow.
But note that in gup_fast_pte_range(), we are validating whether the PMD changed:
if (unlikely(pmd_val(pmd) != pmd_val(*pmdp)) || unlikely(pte_val(pte) != pte_val(ptep_get(ptep)))) { gup_put_folio(folio, 1, flags); goto pte_unmap; }
So in case the page table got reused in the meantime, we should just back off and be fine, right?
The shared page table is mapped with a PUD entry, and we don't check whether the PUD entry changed here.
Could we simply put a PUD check in there sensibly?
Cheers, Lorenzo
On Mon, Oct 20, 2025 at 5:01 PM Lorenzo Stoakes lorenzo.stoakes@oracle.com wrote:
On Thu, Oct 16, 2025 at 08:44:57PM +0200, Jann Horn wrote:
On Thu, Oct 9, 2025 at 9:40 AM David Hildenbrand david@redhat.com wrote:
On 01.09.25 12:58, Jann Horn wrote:
Hi!
On Fri, Aug 29, 2025 at 4:30 PM Uschakow, Stanislav suschako@amazon.de wrote:
We have observed a huge latency increase using `fork()` after ingesting the CVE-2025-38085 fix which leads to the commit `1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race`. On large machines with 1.5TB of memory with 196 cores, we identified mmapping of 1.2TB of shared memory and forking itself dozens or hundreds of times we see a increase of execution times of a factor of 4. The reproducer is at the end of the email.
Yeah, every 1G virtual address range you unshare on unmap will do an extra synchronous IPI broadcast to all CPU cores, so it's not very surprising that doing this would be a bit slow on a machine with 196 cores.
My observation/assumption is:
each child touches 100 random pages and despawns on each despawn `huge_pmd_unshare()` is called each call to `huge_pmd_unshare()` syncrhonizes all threads using `tlb_remove_table_sync_one()` leading to the regression
Yeah, makes sense that that'd be slow.
There are probably several ways this could be optimized - like maybe changing tlb_remove_table_sync_one() to rely on the MM's cpumask (though that would require thinking about whether this interacts with remote MM access somehow), or batching the refcount drops for hugetlb shared page tables through something like struct mmu_gather, or doing something special for the unmap path, or changing the semantics of hugetlb page tables such that they can never turn into normal page tables again. However, I'm not planning to work on optimizing this.
I'm currently looking at the fix and what sticks out is "Fix it with an explicit broadcast IPI through tlb_remove_table_sync_one()".
(I don't understand how the page table can be used for "normal, non-hugetlb". I could only see how it is used for the remaining user for hugetlb stuff, but that's different question)
If I remember correctly: When a hugetlb shared page table drops to refcount 1, it turns into a normal page table. If you then afterwards split the hugetlb VMA, unmap one half of it, and place a new unrelated VMA in its place, the same page table will be reused for PTEs of this new unrelated VMA.
So the scenario would be:
- Initially, we have a hugetlb shared page table covering 1G of
address space which maps hugetlb 2M pages, which is used by two hugetlb VMAs in different processes (processes P1 and P2). 2. A thread in P2 begins a gup_fast() walk in the hugetlb region, and walks down through the PUD entry that points to the shared page table, then when it reaches the loop in gup_fast_pmd_range() gets interrupted for a while by an NMI or preempted by the hypervisor or something. 3. P2 removes its VMA, and the hugetlb shared page table effectively becomes a normal page table in P1.
This is a bit confusing, are we talking about 2 threads in P2 on different CPUs?
P2/T1 on CPU A is doing the gup_fast() walk, P2/T2 on CPU B is simultaneously 'removing' this VMA?
Ah, yes.
Because surely the interrupts being disabled on CPU A means that ordinary preemption won't happen right?
Yeah.
By remove what do you mean? Unmap? But won't this result in a TLB flush synced by IPI that is stalled by P2'S CPU having interrupts diabled?
The case I had in mind is munmap(). This is only an issue on platforms where TLB flushes can be done without IPI. That includes:
- KVM guests on x86 (where TLB flush IPIs can be elided if the target vCPU has been preempted by the host, in which case the host promises to do a TLB flush on guest re-entry) - modern AMD CPUs with INVLPGB - arm64
That is the whole point of tlb_remove_table_sync_one() - it forces an IPI on architectures where TLB flush doesn't guarantee an IPI.
(The config option "CONFIG_MMU_GATHER_RCU_TABLE_FREE", which is only needed on architectures that don't guarantee that an IPI is involved in TLB flushing, is set on the major architectures nowadays - unconditionally on x86 and arm64, and in SMP builds of 32-bit arm.)
Or is it removed in the sense of hugetlb? As in something that invokes huge_pmd_unshare()?
I think that could also trigger it, though I wasn't thinking of that case.
But I guess this doesn't matter as the page table teardown will succeed, just the final tlb_finish_mmu() will stall.
And I guess GUP fast is trying to protect against the clear down by checking pmd != *pmdp.
The pmd recheck is done because of THP, IIRC because THP can deposit and reuse page tables without following the normal page table life cycle.
- Then P1 splits the hugetlb VMA in the middle (at a 2M boundary),
leaving two VMAs VMA1 and VMA2. 5. P1 unmaps VMA1, and creates a new VMA (VMA3) in its place, for example an anonymous private VMA.
Hmm, can it though?
P1 mmap write lock will be held, and VMA lock will be held too for VMA1,
In vms_complete_munmap_vmas(), vms_clear_ptes() will stall on tlb_finish_mmu() for IPI-synced architectures, and in that case the unmap won't finish and the mmap write lock won't be released so nobody an map a new VMA yet can they?
Yeah, I think it can't happen on configurations that always use IPI for TLB synchronization. My patch also doesn't change anything on those architectures - tlb_remove_table_sync_one() is a no-op on architectures without CONFIG_MMU_GATHER_RCU_TABLE_FREE.
- P1 populates VMA3 with page table entries.
ofc this requires the mmap/vma write lock above to be released first.
- The gup_fast() walk in P2 continues, and gup_fast_pmd_range() now
uses the new PMD/PTE entries created for VMA3.
How does the fix work when an architecture does not issue IPIs for TLB shootdown? To handle gup-fast on these architectures, we use RCU.
gup-fast disables interrupts, which synchronizes against both RCU and IPI.
So I'm wondering whether we use RCU somehow.
But note that in gup_fast_pte_range(), we are validating whether the PMD changed:
if (unlikely(pmd_val(pmd) != pmd_val(*pmdp)) || unlikely(pte_val(pte) != pte_val(ptep_get(ptep)))) { gup_put_folio(folio, 1, flags); goto pte_unmap; }
So in case the page table got reused in the meantime, we should just back off and be fine, right?
The shared page table is mapped with a PUD entry, and we don't check whether the PUD entry changed here.
Could we simply put a PUD check in there sensibly?
Uuuh... maybe? But I'm not sure if there is a good way to express the safety rules after that change any more nicely than we can do with the current safety rules, it feels like we're just tacking on an increasing number of special cases. As I understand it, the current rules are something like:
Freeing a page table needs RCU delay or IPI to synchronize against gup_fast(). Randomly moving page tables to different locations (which khugepaged does) is specially allowed only for PTE tables, thanks to the PMD entry recheck. mremap() is kind of an weird case because it can also move PMD tables without locking, but that's fine because nothing in the region covered by the source virtual address range can be part of a VMA other than the VMA being moved, so userspace has no legitimate reason to access it.
On Mon, Oct 20, 2025 at 05:33:22PM +0200, Jann Horn wrote:
On Mon, Oct 20, 2025 at 5:01 PM Lorenzo Stoakes lorenzo.stoakes@oracle.com wrote:
On Thu, Oct 16, 2025 at 08:44:57PM +0200, Jann Horn wrote:
On Thu, Oct 9, 2025 at 9:40 AM David Hildenbrand david@redhat.com wrote:
On 01.09.25 12:58, Jann Horn wrote:
Hi!
On Fri, Aug 29, 2025 at 4:30 PM Uschakow, Stanislav suschako@amazon.de wrote:
We have observed a huge latency increase using `fork()` after ingesting the CVE-2025-38085 fix which leads to the commit `1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race`. On large machines with 1.5TB of memory with 196 cores, we identified mmapping of 1.2TB of shared memory and forking itself dozens or hundreds of times we see a increase of execution times of a factor of 4. The reproducer is at the end of the email.
Yeah, every 1G virtual address range you unshare on unmap will do an extra synchronous IPI broadcast to all CPU cores, so it's not very surprising that doing this would be a bit slow on a machine with 196 cores.
My observation/assumption is:
each child touches 100 random pages and despawns on each despawn `huge_pmd_unshare()` is called each call to `huge_pmd_unshare()` syncrhonizes all threads using `tlb_remove_table_sync_one()` leading to the regression
Yeah, makes sense that that'd be slow.
There are probably several ways this could be optimized - like maybe changing tlb_remove_table_sync_one() to rely on the MM's cpumask (though that would require thinking about whether this interacts with remote MM access somehow), or batching the refcount drops for hugetlb shared page tables through something like struct mmu_gather, or doing something special for the unmap path, or changing the semantics of hugetlb page tables such that they can never turn into normal page tables again. However, I'm not planning to work on optimizing this.
I'm currently looking at the fix and what sticks out is "Fix it with an explicit broadcast IPI through tlb_remove_table_sync_one()".
(I don't understand how the page table can be used for "normal, non-hugetlb". I could only see how it is used for the remaining user for hugetlb stuff, but that's different question)
If I remember correctly: When a hugetlb shared page table drops to refcount 1, it turns into a normal page table. If you then afterwards split the hugetlb VMA, unmap one half of it, and place a new unrelated VMA in its place, the same page table will be reused for PTEs of this new unrelated VMA.
So the scenario would be:
- Initially, we have a hugetlb shared page table covering 1G of
address space which maps hugetlb 2M pages, which is used by two hugetlb VMAs in different processes (processes P1 and P2). 2. A thread in P2 begins a gup_fast() walk in the hugetlb region, and walks down through the PUD entry that points to the shared page table, then when it reaches the loop in gup_fast_pmd_range() gets interrupted for a while by an NMI or preempted by the hypervisor or something. 3. P2 removes its VMA, and the hugetlb shared page table effectively becomes a normal page table in P1.
This is a bit confusing, are we talking about 2 threads in P2 on different CPUs?
P2/T1 on CPU A is doing the gup_fast() walk, P2/T2 on CPU B is simultaneously 'removing' this VMA?
Ah, yes.
Thanks
Because surely the interrupts being disabled on CPU A means that ordinary preemption won't happen right?
Yeah.
By remove what do you mean? Unmap? But won't this result in a TLB flush synced by IPI that is stalled by P2'S CPU having interrupts diabled?
The case I had in mind is munmap(). This is only an issue on platforms where TLB flushes can be done without IPI. That includes:
- KVM guests on x86 (where TLB flush IPIs can be elided if the target
vCPU has been preempted by the host, in which case the host promises to do a TLB flush on guest re-entry)
- modern AMD CPUs with INVLPGB
- arm64
That is the whole point of tlb_remove_table_sync_one() - it forces an IPI on architectures where TLB flush doesn't guarantee an IPI.
Right.
(The config option "CONFIG_MMU_GATHER_RCU_TABLE_FREE", which is only needed on architectures that don't guarantee that an IPI is involved in TLB flushing, is set on the major architectures nowadays - unconditionally on x86 and arm64, and in SMP builds of 32-bit arm.)
Yes.
Or is it removed in the sense of hugetlb? As in something that invokes huge_pmd_unshare()?
I think that could also trigger it, though I wasn't thinking of that case.
But I guess this doesn't matter as the page table teardown will succeed, just the final tlb_finish_mmu() will stall.
And I guess GUP fast is trying to protect against the clear down by checking pmd != *pmdp.
The pmd recheck is done because of THP, IIRC because THP can deposit and reuse page tables without following the normal page table life cycle.
Right.
- Then P1 splits the hugetlb VMA in the middle (at a 2M boundary),
leaving two VMAs VMA1 and VMA2. 5. P1 unmaps VMA1, and creates a new VMA (VMA3) in its place, for example an anonymous private VMA.
Hmm, can it though?
P1 mmap write lock will be held, and VMA lock will be held too for VMA1,
In vms_complete_munmap_vmas(), vms_clear_ptes() will stall on tlb_finish_mmu() for IPI-synced architectures, and in that case the unmap won't finish and the mmap write lock won't be released so nobody an map a new VMA yet can they?
Yeah, I think it can't happen on configurations that always use IPI for TLB synchronization. My patch also doesn't change anything on those architectures - tlb_remove_table_sync_one() is a no-op on architectures without CONFIG_MMU_GATHER_RCU_TABLE_FREE.
Hmm but in that case wouldn't:
tlb_finish_mmu() -> tlb_flush_mmu() -> tlb_flush_mmu_free() -> tlb_table_flush() -> tlb_remove_table() -> __tlb_remove_table_one() -> tlb_remove_table_sync_one()
prevent the unmapping on non-IPI architectures, thereby mitigating the issue?
Also doesn't CONFIG_MMU_GATHER_RCU_TABLE_FREE imply that RCU is being used for page table teardown whose grace period would be disallowed until gup_fast() finishes and therefore that also mitigate?
Why is a tlb_remove_table_sync_one() needed in huge_pmd_unshare()?
It seems you're predicating the issue on an unmap happening without waiting for GUP fast, but it seems that it always will?
Am I missing something here?
- P1 populates VMA3 with page table entries.
ofc this requires the mmap/vma write lock above to be released first.
- The gup_fast() walk in P2 continues, and gup_fast_pmd_range() now
uses the new PMD/PTE entries created for VMA3.
How does the fix work when an architecture does not issue IPIs for TLB shootdown? To handle gup-fast on these architectures, we use RCU.
gup-fast disables interrupts, which synchronizes against both RCU and IPI.
So I'm wondering whether we use RCU somehow.
But note that in gup_fast_pte_range(), we are validating whether the PMD changed:
if (unlikely(pmd_val(pmd) != pmd_val(*pmdp)) || unlikely(pte_val(pte) != pte_val(ptep_get(ptep)))) { gup_put_folio(folio, 1, flags); goto pte_unmap; }
So in case the page table got reused in the meantime, we should just back off and be fine, right?
The shared page table is mapped with a PUD entry, and we don't check whether the PUD entry changed here.
Could we simply put a PUD check in there sensibly?
Uuuh... maybe? But I'm not sure if there is a good way to express the safety rules after that change any more nicely than we can do with the current safety rules, it feels like we're just tacking on an increasing number of special cases. As I understand it, the current rules are something like:
Yeah David covered off in other sub-thread, not really viable I guess :)
Freeing a page table needs RCU delay or IPI to synchronize against gup_fast(). Randomly moving page tables to different locations (which khugepaged does) is specially allowed only for PTE tables, thanks to the PMD entry recheck. mremap() is kind of an weird case because it can also move PMD tables without locking, but that's fine because nothing in the region covered by the source virtual address range can be part of a VMA other than the VMA being moved, so userspace has no legitimate reason to access it.
I will need to document these somewhere :)
Cheers, Lorenzo
On Fri, Oct 24, 2025 at 2:25 PM Lorenzo Stoakes lorenzo.stoakes@oracle.com wrote:
On Mon, Oct 20, 2025 at 05:33:22PM +0200, Jann Horn wrote:
On Mon, Oct 20, 2025 at 5:01 PM Lorenzo Stoakes lorenzo.stoakes@oracle.com wrote:
On Thu, Oct 16, 2025 at 08:44:57PM +0200, Jann Horn wrote:
- Then P1 splits the hugetlb VMA in the middle (at a 2M boundary),
leaving two VMAs VMA1 and VMA2. 5. P1 unmaps VMA1, and creates a new VMA (VMA3) in its place, for example an anonymous private VMA.
Hmm, can it though?
P1 mmap write lock will be held, and VMA lock will be held too for VMA1,
In vms_complete_munmap_vmas(), vms_clear_ptes() will stall on tlb_finish_mmu() for IPI-synced architectures, and in that case the unmap won't finish and the mmap write lock won't be released so nobody an map a new VMA yet can they?
Yeah, I think it can't happen on configurations that always use IPI for TLB synchronization. My patch also doesn't change anything on those architectures - tlb_remove_table_sync_one() is a no-op on architectures without CONFIG_MMU_GATHER_RCU_TABLE_FREE.
Hmm but in that case wouldn't:
tlb_finish_mmu() -> tlb_flush_mmu() -> tlb_flush_mmu_free() -> tlb_table_flush()
And then from there we call tlb_remove_table_free(), which does a call_rcu() to tlb_remove_table_rcu(), which will asynchronously run later and do __tlb_remove_table_free(), which does __tlb_remove_table()?
-> tlb_remove_table()
I don't see any way we end up in tlb_remove_table() from here. tlb_remove_table() is a much higher-level function, we end up there from something like pte_free_tlb(). I think you mixed up tlb_remove_table_free and tlb_remove_table.
-> __tlb_remove_table_one()
Heh, I think you made the same mistake as Linus made years ago when he was looking at tlb_remove_table(). In that function, the call to tlb_remove_table_one() leading to __tlb_remove_table_one() **is a slowpath only taken when memory allocation fails** - it's a fallback from the normal path that queues up batch items in (*batch)->tables[] (and occasionally calls tlb_table_flush() when it runs out of space in there).
-> tlb_remove_table_sync_one()
prevent the unmapping on non-IPI architectures, thereby mitigating the issue?
Also doesn't CONFIG_MMU_GATHER_RCU_TABLE_FREE imply that RCU is being used for page table teardown whose grace period would be disallowed until gup_fast() finishes and therefore that also mitigate?
I'm not sure I understand your point. CONFIG_MMU_GATHER_RCU_TABLE_FREE implies that "Semi RCU" is used to protect page table *freeing*, but page table freeing is irrelevant to this bug, and there is no RCU delay involved in dropping a reference on a shared hugetlb page table. "Semi RCU" is not used to protect against page table *reuse* at a different address by THP. Also, as explained in the big comment block in m/mmu_gather.c, "Semi RCU" doesn't mean RCU is definitely used - when memory allocations fail, the __tlb_remove_table_one() fallback path, when used on !PT_RECLAIM, will fall back to an IPI broadcast followed by directly freeing the page table. RCU is just used as the more polite way to do something equivalent to an IPI broadcast (RCU will wait for other cores to go through regions where they _could_ receive an IPI as part of RCU-sched).
But also: At which point would you expect any page table to actually be freed, triggering any of this logic? When unmapping VMA1 in step 5, I think there might not be any page tables that exist and are fully covered by VMA1 (or its adjacent free space, if there is any) so that they are eligible to be freed.
Why is a tlb_remove_table_sync_one() needed in huge_pmd_unshare()?
Because nothing else on that path is guaranteed to send any IPIs before the page table becomes reusable in another process.
On Fri, Oct 24, 2025 at 08:22:15PM +0200, Jann Horn wrote:
On Fri, Oct 24, 2025 at 2:25 PM Lorenzo Stoakes lorenzo.stoakes@oracle.com wrote:
On Mon, Oct 20, 2025 at 05:33:22PM +0200, Jann Horn wrote:
On Mon, Oct 20, 2025 at 5:01 PM Lorenzo Stoakes lorenzo.stoakes@oracle.com wrote:
On Thu, Oct 16, 2025 at 08:44:57PM +0200, Jann Horn wrote:
- Then P1 splits the hugetlb VMA in the middle (at a 2M boundary),
leaving two VMAs VMA1 and VMA2. 5. P1 unmaps VMA1, and creates a new VMA (VMA3) in its place, for example an anonymous private VMA.
Hmm, can it though?
P1 mmap write lock will be held, and VMA lock will be held too for VMA1,
In vms_complete_munmap_vmas(), vms_clear_ptes() will stall on tlb_finish_mmu() for IPI-synced architectures, and in that case the unmap won't finish and the mmap write lock won't be released so nobody an map a new VMA yet can they?
Yeah, I think it can't happen on configurations that always use IPI for TLB synchronization. My patch also doesn't change anything on those architectures - tlb_remove_table_sync_one() is a no-op on architectures without CONFIG_MMU_GATHER_RCU_TABLE_FREE.
Hmm but in that case wouldn't:
tlb_finish_mmu() -> tlb_flush_mmu() -> tlb_flush_mmu_free() -> tlb_table_flush()
And then from there we call tlb_remove_table_free(), which does a call_rcu() to tlb_remove_table_rcu(), which will asynchronously run later and do __tlb_remove_table_free(), which does __tlb_remove_table()?
Yeah my bad!
-> tlb_remove_table()
I don't see any way we end up in tlb_remove_table() from here. tlb_remove_table() is a much higher-level function, we end up there from something like pte_free_tlb(). I think you mixed up tlb_remove_table_free and tlb_remove_table.
Yeah sorry my mistake you're right!
-> __tlb_remove_table_one()
Heh, I think you made the same mistake as Linus made years ago when he was looking at tlb_remove_table(). In that function, the call to tlb_remove_table_one() leading to __tlb_remove_table_one() **is a slowpath only taken when memory allocation fails** - it's a fallback from the normal path that queues up batch items in (*batch)->tables[] (and occasionally calls tlb_table_flush() when it runs out of space in there).
At least in good company ;)
-> tlb_remove_table_sync_one()
prevent the unmapping on non-IPI architectures, thereby mitigating the issue?
Also doesn't CONFIG_MMU_GATHER_RCU_TABLE_FREE imply that RCU is being used for page table teardown whose grace period would be disallowed until gup_fast() finishes and therefore that also mitigate?
I'm not sure I understand your point. CONFIG_MMU_GATHER_RCU_TABLE_FREE implies that "Semi RCU" is used to protect page table *freeing*, but page table freeing is irrelevant to this bug, and there is no RCU delay involved in dropping a reference on a shared hugetlb page table.
It's this step:
5. P1 unmaps VMA1, and creates a new VMA (VMA3) in its place, for example an anonymous private VMA.
But see below, I have had the 'aha' moment... this is really horrible.
Sigh hugetlb...
"Semi RCU" is not used to protect against page table *reuse* at a different address by THP. Also, as explained in the big comment block in m/mmu_gather.c, "Semi RCU" doesn't mean RCU is definitely used - when memory allocations fail, the __tlb_remove_table_one() fallback path, when used on !PT_RECLAIM, will fall back to an IPI broadcast followed by directly freeing the page table. RCU is just used as the more polite way to do something equivalent to an IPI broadcast (RCU will wait for other cores to go through regions where they _could_ receive an IPI as part of RCU-sched).
I guess for IPI we're ok as _any_ of the TLB flushing will cause a shootdown + thus delay on GUP-fast.
Are there any scenarios where the shootdown wouldn't happen even for the IPI case?
But also: At which point would you expect any page table to actually be freed, triggering any of this logic? When unmapping VMA1 in step 5, I think there might not be any page tables that exist and are fully covered by VMA1 (or its adjacent free space, if there is any) so that they are eligible to be freed.
Hmmm yeah, ok now I see - the PMD would remain in place throughout, we don't actually need to free anything, that's the crux of this isn't it... yikes.
"Initially, we have a hugetlb shared page table covering 1G of address space which maps hugetlb 2M pages, which is used by two hugetlb VMAs in different processes (processes P1 and P2)."
"Then P1 splits the hugetlb VMA in the middle (at a 2M boundary), leaving two VMAs VMA1 and VMA2."
So the 1 GB would have to be aligned and (xxx = PUD entry, y = VMA1 entries, z = VMA2 entries)
PUD |-----| \ \ / / \ \ PMD / / |-----| | xxx |--->| y1 | / / | y2 | \ \ | ... | / / |y255 | \ \ |y256 | |-----| | z1 | | z2 | | ... | |z255 | |z256 | |-----|
So the hugetlb page sharing stuff defeats all assumptions and checks... sigh.
Why is a tlb_remove_table_sync_one() needed in huge_pmd_unshare()?
Because nothing else on that path is guaranteed to send any IPIs before the page table becomes reusable in another process.
I feel that David's suggestion of just disallowing the use of shared page tables like this (I mean really does it actually come up that much?) is the right one then.
I wonder whether we shouldn't just free the PMD after it becomes unshared? It's kind of crazy to think we'll allow a reuse like this, it's asking for trouble.
Moving on to another point:
One point here I'd like to raise - this seems like a 'just so' scenario. I'm not saying we shouldn't fix it, but we're paying a _very heavy_ penalty here for a scenario that really does require some unusual things to happen in GUP fast and an _extremely_ tight and specific window in which to do it.
Plus isn't it going to be difficult to mediate exactly when an unshare will happen?
Since you can't pre-empt and IRQs are disabled, to even get the scenario to happen is surely very very difficult, you really have to have some form of (para?)virtualisation preemption or a NMI which would have to be very long lasting (the operations you mention in P2 are hardly small ones) which seems very very unlikely for an attacker to be able to achieve.
So my question is - would it be reasonable to consider this at the very least a vanishingly small, 'paranoid' fixup? I think it's telling you couldn't come up with a repro, and you are usually very good at that :)
Another question, perhaps silly one, is - what is the attack scenario here? I'm not so familiar with hugetlb page table sharing, but is it in any way feasible that you'd access another process's mappings? If not, the attack scenario is that you end up accidentally accessing some other part of the process's memory (which doesn't seem so bad right?).
Thanks, sorry for all the questions but really want to make sure I understand what's going on here (and can later extract some of this into documentation also potentially! :)
Cheers, Lorenzo
On Fri, Oct 24, 2025 at 9:03 PM Lorenzo Stoakes lorenzo.stoakes@oracle.com wrote:
On Fri, Oct 24, 2025 at 08:22:15PM +0200, Jann Horn wrote:
On Fri, Oct 24, 2025 at 2:25 PM Lorenzo Stoakes lorenzo.stoakes@oracle.com wrote:
On Mon, Oct 20, 2025 at 05:33:22PM +0200, Jann Horn wrote:
On Mon, Oct 20, 2025 at 5:01 PM Lorenzo Stoakes lorenzo.stoakes@oracle.com wrote:
On Thu, Oct 16, 2025 at 08:44:57PM +0200, Jann Horn wrote:
- Then P1 splits the hugetlb VMA in the middle (at a 2M boundary),
leaving two VMAs VMA1 and VMA2. 5. P1 unmaps VMA1, and creates a new VMA (VMA3) in its place, for example an anonymous private VMA.
Hmm, can it though?
P1 mmap write lock will be held, and VMA lock will be held too for VMA1,
In vms_complete_munmap_vmas(), vms_clear_ptes() will stall on tlb_finish_mmu() for IPI-synced architectures, and in that case the unmap won't finish and the mmap write lock won't be released so nobody an map a new VMA yet can they?
Yeah, I think it can't happen on configurations that always use IPI for TLB synchronization. My patch also doesn't change anything on those architectures - tlb_remove_table_sync_one() is a no-op on architectures without CONFIG_MMU_GATHER_RCU_TABLE_FREE.
Hmm but in that case wouldn't:
tlb_finish_mmu() -> tlb_flush_mmu() -> tlb_flush_mmu_free() -> tlb_table_flush()
And then from there we call tlb_remove_table_free(), which does a call_rcu() to tlb_remove_table_rcu(), which will asynchronously run later and do __tlb_remove_table_free(), which does __tlb_remove_table()?
Yeah my bad!
-> tlb_remove_table()
I don't see any way we end up in tlb_remove_table() from here. tlb_remove_table() is a much higher-level function, we end up there from something like pte_free_tlb(). I think you mixed up tlb_remove_table_free and tlb_remove_table.
Yeah sorry my mistake you're right!
-> __tlb_remove_table_one()
Heh, I think you made the same mistake as Linus made years ago when he was looking at tlb_remove_table(). In that function, the call to tlb_remove_table_one() leading to __tlb_remove_table_one() **is a slowpath only taken when memory allocation fails** - it's a fallback from the normal path that queues up batch items in (*batch)->tables[] (and occasionally calls tlb_table_flush() when it runs out of space in there).
At least in good company ;)
-> tlb_remove_table_sync_one()
prevent the unmapping on non-IPI architectures, thereby mitigating the issue?
Also doesn't CONFIG_MMU_GATHER_RCU_TABLE_FREE imply that RCU is being used for page table teardown whose grace period would be disallowed until gup_fast() finishes and therefore that also mitigate?
I'm not sure I understand your point. CONFIG_MMU_GATHER_RCU_TABLE_FREE implies that "Semi RCU" is used to protect page table *freeing*, but page table freeing is irrelevant to this bug, and there is no RCU delay involved in dropping a reference on a shared hugetlb page table.
It's this step:
- P1 unmaps VMA1, and creates a new VMA (VMA3) in its place, for
example an anonymous private VMA.
But see below, I have had the 'aha' moment... this is really horrible.
Sigh hugetlb...
"Semi RCU" is not used to protect against page table *reuse* at a different address by THP. Also, as explained in the big comment block in m/mmu_gather.c, "Semi RCU" doesn't mean RCU is definitely used - when memory allocations fail, the __tlb_remove_table_one() fallback path, when used on !PT_RECLAIM, will fall back to an IPI broadcast followed by directly freeing the page table. RCU is just used as the more polite way to do something equivalent to an IPI broadcast (RCU will wait for other cores to go through regions where they _could_ receive an IPI as part of RCU-sched).
I guess for IPI we're ok as _any_ of the TLB flushing will cause a shootdown + thus delay on GUP-fast.
Are there any scenarios where the shootdown wouldn't happen even for the IPI case?
But also: At which point would you expect any page table to actually be freed, triggering any of this logic? When unmapping VMA1 in step 5, I think there might not be any page tables that exist and are fully covered by VMA1 (or its adjacent free space, if there is any) so that they are eligible to be freed.
Hmmm yeah, ok now I see - the PMD would remain in place throughout, we don't actually need to free anything, that's the crux of this isn't it... yikes.
"Initially, we have a hugetlb shared page table covering 1G of address space which maps hugetlb 2M pages, which is used by two hugetlb VMAs in different processes (processes P1 and P2)."
"Then P1 splits the hugetlb VMA in the middle (at a 2M boundary), leaving two VMAs VMA1 and VMA2."
So the 1 GB would have to be aligned and (xxx = PUD entry, y = VMA1 entries, z = VMA2 entries)
PUD |-----| \ \ / / \ \ PMD / / |-----| | xxx |--->| y1 | / / | y2 | \ \ | ... | / / |y255 | \ \ |y256 | |-----| | z1 | | z2 | | ... | |z255 | |z256 | |-----|So the hugetlb page sharing stuff defeats all assumptions and checks... sigh.
Why is a tlb_remove_table_sync_one() needed in huge_pmd_unshare()?
Because nothing else on that path is guaranteed to send any IPIs before the page table becomes reusable in another process.
I feel that David's suggestion of just disallowing the use of shared page tables like this (I mean really does it actually come up that much?) is the right one then.
Yeah, I also like that suggestion.
I wonder whether we shouldn't just free the PMD after it becomes unshared? It's kind of crazy to think we'll allow a reuse like this, it's asking for trouble.
Moving on to another point:
One point here I'd like to raise - this seems like a 'just so' scenario. I'm not saying we shouldn't fix it, but we're paying a _very heavy_ penalty here for a scenario that really does require some unusual things to happen in GUP fast and an _extremely_ tight and specific window in which to do it.
Yes.
Plus isn't it going to be difficult to mediate exactly when an unshare will happen?
Since you can't pre-empt and IRQs are disabled, to even get the scenario to happen is surely very very difficult, you really have to have some form of (para?)virtualisation preemption or a NMI which would have to be very long lasting (the operations you mention in P2 are hardly small ones) which seems very very unlikely for an attacker to be able to achieve.
Yeah, I think it would have to be something like a hypervisor rescheduling to another vCPU, or potentially it could happen if someone is doing kernel performance profiling with perf_event_open() (which might do stuff like copying large amounts of userspace stack memory from NMI context depending on runtime configuration).
So my question is - would it be reasonable to consider this at the very least a vanishingly small, 'paranoid' fixup? I think it's telling you couldn't come up with a repro, and you are usually very good at that :)
I mean, how hard this is to hit probably partly depends on what choices hypervisors make about vCPU scheduling. And it would probably also be easier to hit for an attacker with CAP_PERFMON, though that's true of many bugs.
But yeah, it's not the kind of bug I would choose to target if I wanted to write an exploit and had a larger selection of bugs to choose from.
Another question, perhaps silly one, is - what is the attack scenario here? I'm not so familiar with hugetlb page table sharing, but is it in any way feasible that you'd access another process's mappings? If not, the attack scenario is that you end up accidentally accessing some other part of the process's memory (which doesn't seem so bad right?).
I think the impact would be P2 being able to read/write unrelated data in P1. Though with the way things are currently implemented, I think that requires P1 to do this weird unmap of half of a hugetlb mapping.
We're also playing with fire because if P2 is walking page tables of P1 while P1 is concurrently freeing page tables, normal TLB flush IPIs issued by P1 wouldn't be sent to P2. I think that's not exploitable in the current implementation because CONFIG_MMU_GATHER_RCU_TABLE_FREE unconditionally either frees page tables through RCU or does IPI broadcasts sent to the whole system, but it is scary because sensible-looking optimizations could turn this into a user-to-kernel privilege escalation bug. For example, if we decided that in cases where we already did an IPI-based TLB flush, or in cases where we are single-threaded, we don't need to free page tables with Semi-RCU delay to synchronize against gup_fast().
Thanks, sorry for all the questions but really want to make sure I understand what's going on here (and can later extract some of this into documentation also potentially! :)
On Fri, Oct 24, 2025 at 09:43:43PM +0200, Jann Horn wrote:
So my question is - would it be reasonable to consider this at the very least a vanishingly small, 'paranoid' fixup? I think it's telling you couldn't come up with a repro, and you are usually very good at that :)
I mean, how hard this is to hit probably partly depends on what choices hypervisors make about vCPU scheduling. And it would probably also be easier to hit for an attacker with CAP_PERFMON, though that's true of many bugs.
But yeah, it's not the kind of bug I would choose to target if I wanted to write an exploit and had a larger selection of bugs to choose from.
Another question, perhaps silly one, is - what is the attack scenario here? I'm not so familiar with hugetlb page table sharing, but is it in any way feasible that you'd access another process's mappings? If not, the attack scenario is that you end up accidentally accessing some other part of the process's memory (which doesn't seem so bad right?).
I think the impact would be P2 being able to read/write unrelated data in P1. Though with the way things are currently implemented, I think that requires P1 to do this weird unmap of half of a hugetlb mapping.
We're also playing with fire because if P2 is walking page tables of P1 while P1 is concurrently freeing page tables, normal TLB flush IPIs issued by P1 wouldn't be sent to P2. I think that's not exploitable in the current implementation because CONFIG_MMU_GATHER_RCU_TABLE_FREE unconditionally either frees page tables through RCU or does IPI broadcasts sent to the whole system, but it is scary because sensible-looking optimizations could turn this into a user-to-kernel privilege escalation bug. For example, if we decided that in cases where we already did an IPI-based TLB flush, or in cases where we are single-threaded, we don't need to free page tables with Semi-RCU delay to synchronize against gup_fast().
Would it therefore be reasonable to say that this is more of a preventative measure against future kernel changes (which otherwise seem reasonable) which might lead to exploitable bugs rather than being a practiclaly exploitable bug in itself?
On Fri, Oct 24, 2025 at 9:59 PM Lorenzo Stoakes lorenzo.stoakes@oracle.com wrote:
On Fri, Oct 24, 2025 at 09:43:43PM +0200, Jann Horn wrote:
So my question is - would it be reasonable to consider this at the very least a vanishingly small, 'paranoid' fixup? I think it's telling you couldn't come up with a repro, and you are usually very good at that :)
I mean, how hard this is to hit probably partly depends on what choices hypervisors make about vCPU scheduling. And it would probably also be easier to hit for an attacker with CAP_PERFMON, though that's true of many bugs.
But yeah, it's not the kind of bug I would choose to target if I wanted to write an exploit and had a larger selection of bugs to choose from.
Another question, perhaps silly one, is - what is the attack scenario here? I'm not so familiar with hugetlb page table sharing, but is it in any way feasible that you'd access another process's mappings? If not, the attack scenario is that you end up accidentally accessing some other part of the process's memory (which doesn't seem so bad right?).
I think the impact would be P2 being able to read/write unrelated data in P1. Though with the way things are currently implemented, I think that requires P1 to do this weird unmap of half of a hugetlb mapping.
We're also playing with fire because if P2 is walking page tables of P1 while P1 is concurrently freeing page tables, normal TLB flush IPIs issued by P1 wouldn't be sent to P2. I think that's not exploitable in the current implementation because CONFIG_MMU_GATHER_RCU_TABLE_FREE unconditionally either frees page tables through RCU or does IPI broadcasts sent to the whole system, but it is scary because sensible-looking optimizations could turn this into a user-to-kernel privilege escalation bug. For example, if we decided that in cases where we already did an IPI-based TLB flush, or in cases where we are single-threaded, we don't need to free page tables with Semi-RCU delay to synchronize against gup_fast().
Would it therefore be reasonable to say that this is more of a preventative measure against future kernel changes (which otherwise seem reasonable) which might lead to exploitable bugs rather than being a practiclaly exploitable bug in itself?
I would say it is a security fix for theoretical userspace that either intentionally partially unmaps hugetlb mappings (which would probably be weird), or maps and partially unmaps attacker-supplied file descriptors (without necessarily expecting them to be hugetlb). (I know of userspace that mmap()s file descriptors coming from untrusted code, though I don't know examples that would then partially unmap them.) Admittedly there is some perfectionism involved here on my part. In particular, it irks me to make qualitative distinctions between bugs based on how hard to hit the timing requirements for them are.
But yes, a large part of my motivation for writing this patch was to prevent reasonable future changes to the rest of MM from making this a worse bug.
Why is a tlb_remove_table_sync_one() needed in huge_pmd_unshare()?
Because nothing else on that path is guaranteed to send any IPIs before the page table becomes reusable in another process.
I feel that David's suggestion of just disallowing the use of shared page tables like this (I mean really does it actually come up that much?) is the right one then.
Yeah, I also like that suggestion.
I started hacking on this (only found a bit of time this week), and in essence, we'll be using the mmu_gather when unsharing to collect the pages and handle the TLB flushing etc.
(TLB flushing in that hugetlb area is a mess)
It almost looks like a cleanup.
Having that said, it will take a bit longer to finish it and, of course, I first have to test it then to see if it even works.
But it looks doable. :)
On Wed, Oct 29, 2025 at 05:19:54PM +0100, David Hildenbrand wrote:
Why is a tlb_remove_table_sync_one() needed in huge_pmd_unshare()?
Because nothing else on that path is guaranteed to send any IPIs before the page table becomes reusable in another process.
I feel that David's suggestion of just disallowing the use of shared page tables like this (I mean really does it actually come up that much?) is the right one then.
Yeah, I also like that suggestion.
I started hacking on this (only found a bit of time this week), and in essence, we'll be using the mmu_gather when unsharing to collect the pages and handle the TLB flushing etc.
(TLB flushing in that hugetlb area is a mess)
It almost looks like a cleanup.
Having that said, it will take a bit longer to finish it and, of course, I first have to test it then to see if it even works.
But it looks doable. :)
Ohhhh nice :)
I look forward to it!
-- Cheers
David / dhildenb
Cheers, Lorenzo
So in case the page table got reused in the meantime, we should just back off and be fine, right?
The shared page table is mapped with a PUD entry, and we don't check whether the PUD entry changed here.
Could we simply put a PUD check in there sensibly?
A PUD check would only work if we are guaranteed that the page table will not get freed in the meantime, otherwise we might be walking garbage, trying to interpret garbage as PMDs etc.
That would require RCU freeing of page tables, which we are not guaranteed to have IIRC.
The easiest approach is probably to simply never reuse shared page tables.
If there is consensus on that I can try to see if I can make it fly easily.
On Mon, Oct 20, 2025 at 07:18:18PM +0200, David Hildenbrand wrote:
So in case the page table got reused in the meantime, we should just back off and be fine, right?
The shared page table is mapped with a PUD entry, and we don't check whether the PUD entry changed here.
Could we simply put a PUD check in there sensibly?
A PUD check would only work if we are guaranteed that the page table will not get freed in the meantime, otherwise we might be walking garbage, trying to interpret garbage as PMDs etc.
That would require RCU freeing of page tables, which we are not guaranteed to have IIRC.
Ack. Yeah Suren is working on this :) but don't know when that'll land.
The easiest approach is probably to simply never reuse shared page tables.
If there is consensus on that I can try to see if I can make it fly easily.
That'd be good, I'd like to see that if viable for you to put something forward :)
-- Cheers
David / dhildenb
Cheers, Lorenzo
linux-stable-mirror@lists.linaro.org