On 8/15/25 02:16, Baolu Lu wrote:
On 8/8/2025 10:57 AM, Tian, Kevin wrote:
pud_free_pmd_page() ... for (i = 0; i < PTRS_PER_PMD; i++) { if (!pmd_none(pmd_sv[i])) { pte = (pte_t *)pmd_page_vaddr(pmd_sv[i]); pte_free_kernel(&init_mm, pte); } }
free_page((unsigned long)pmd_sv);
Otherwise the risk still exists if the pmd page is repurposed before the pte work is scheduled.
You're right that freeing high-level page table pages also requires an IOTLB flush before the pages are freed. But I question the practical risk of the race given the extremely small time window.
I hear that Linux is gaining popularity these days. There might even be dozens of users! Given that large scale of dozens (or even hundreds??) of users, I would suggest exercising some care. The race might be small but it only needs to happen once to cause chaos.
Seriously, though... A race is a race. Preemption or interrupts or SMIs or VMExits or a million other things can cause a "small time window" to become a big time window.
Even perceived small races need to be fixed.
If this is a real concern, a potential mitigation would be to clear the U/S bits in all page table entries for kernel address space? But I am not confident in making that change at this time as I am unsure of the side effects it might cause.
That doesn't do any good. I even went as far as double-checking months ago with the IOMMU hardware folks to confirm the actual implementation. I'm really surprised this is being brought up again.
another observation - pte_free_kernel is not used in remove_pagetable () and __change_page_attr(). Is it straightforward to put it in those paths or do we need duplicate some deferring logic there?
The remove_pagetable() function is called in the path where memory is hot-removed from the system, right?
No. Not right.
This is in the vmalloc() code: the side of things that _creates_ mappings for new allocations, not tears them down.