On Mon, 09 Dec 2024 13:04:43 +0100 Valentin Schneider vschneid@redhat.com wrote:
On 05/12/24 18:31, Petr Tesarik wrote:
On Thu, 21 Nov 2024 16:30:16 +0100 Peter Zijlstra peterz@infradead.org wrote:
On Thu, Nov 21, 2024 at 07:07:44AM -0800, Dave Hansen wrote:
On 11/21/24 03:12, Peter Zijlstra wrote:
I see e.g. ds_clear_cea() clears PTEs that can have the _PAGE_GLOBAL flag, and it correctly uses the non-deferrable flush_tlb_kernel_range().
I always forget what we use global pages for, dhansen might know, but let me try and have a look.
I *think* we only have GLOBAL on kernel text, and that only sometimes.
I think you're remembering how _PAGE_GLOBAL gets used when KPTI is in play.
Yah, I suppose I am. That was the last time I had a good look at this stuff :-)
Ignoring KPTI for a sec... We use _PAGE_GLOBAL for all kernel mappings. Before PCIDs, global mappings let the kernel TLB entries live across CR3 writes. When PCIDs are in play, global mappings let two different ASIDs share TLB entries.
Hurmph.. bah. That means we do need that horrible CR4 dance :/
In general, yes.
But I wonder what exactly was the original scenario encountered by Valentin. I mean, if TLB entry invalidations were necessary to sync changes to kernel text after flipping a static branch, then it might be less overhead to make a list of affected pages and call INVLPG on them.
AFAIK there is currently no such IPI function for doing that, but if we could add one. If the list of invalidated global pages is reasonably short, of course.
Valentin, do you happen to know?
So from my experimentation (hackbench + kernel compilation on housekeeping CPUs, dummy while(1) userspace loop on isolated CPUs), the TLB flushes only occurred from vunmap() - mainly from all the hackbench threads coming and going.
Static branch updates only seem to trigger the sync_core() IPI, at least on x86.
Thank you, this is helpful.
So, these allocations span more than tlb_single_page_flush_ceiling pages (default 33). Is THP enabled? If yes, we could possibly get below that threshold by improving flushing of huge pages (cf. footnote [1] in Documentation/arch/x86/tlb.rst).
OTOH even though a series of INVLPG may reduce subsequent TLB misses, it will not exactly improve latency, so it would go against the main goal of this whole patch series.
Hmmm... I see, the CR4 dance is the best solution after all. :-|
Petr T