On 8/15/25 22:31, Dave Hansen wrote:
On 8/15/25 02:16, Baolu Lu wrote:
On 8/8/2025 10:57 AM, Tian, Kevin wrote:
pud_free_pmd_page() ... for (i = 0; i < PTRS_PER_PMD; i++) { if (!pmd_none(pmd_sv[i])) { pte = (pte_t *)pmd_page_vaddr(pmd_sv[i]); pte_free_kernel(&init_mm, pte); } }
free_page((unsigned long)pmd_sv);
Otherwise the risk still exists if the pmd page is repurposed before the pte work is scheduled.
You're right that freeing high-level page table pages also requires an IOTLB flush before the pages are freed. But I question the practical risk of the race given the extremely small time window.
I hear that Linux is gaining popularity these days. There might even be dozens of users! Given that large scale of dozens (or even hundreds??) of users, I would suggest exercising some care. The race might be small but it only needs to happen once to cause chaos.
Seriously, though... A race is a race. Preemption or interrupts or SMIs or VMExits or a million other things can cause a "small time window" to become a big time window.
Even perceived small races need to be fixed.
Yes, agreed.
If this is a real concern, a potential mitigation would be to clear the U/S bits in all page table entries for kernel address space? But I am not confident in making that change at this time as I am unsure of the side effects it might cause.
That doesn't do any good. I even went as far as double-checking months ago with the IOMMU hardware folks to confirm the actual implementation. I'm really surprised this is being brought up again.
It should not be brought up again. Sorry about it.
I was thinking about deferring pte page table page free plus clearing U/S bit. That's also not desirable anyway, so let's ignore it.
another observation - pte_free_kernel is not used in remove_pagetable () and __change_page_attr(). Is it straightforward to put it in those paths or do we need duplicate some deferring logic there?
The remove_pagetable() function is called in the path where memory is hot-removed from the system, right?
No. Not right.
This is in the vmalloc() code: the side of things that _creates_ mappings for new allocations, not tears them down.
Yeah, let me look into it.
Thanks, baolu