Re: [RFC PATCH v3 13/15] context_tracking,x86: Add infrastructure to defer kernel TLBI

10 Dec 2024


      On Tue, 10 Dec 2024 14:53:36 +0100
Valentin Schneider vschneid@redhat.com wrote:
...
On 09/12/24 15:42, Petr Tesarik wrote:
...
On Mon, 9 Dec 2024 13:12:49 +0100
Peter Zijlstra peterz@infradead.org wrote:
...
On Mon, Dec 09, 2024 at 01:04:43PM +0100, Valentin Schneider wrote:
...
...
But I wonder what exactly was the original scenario encountered by
Valentin. I mean, if TLB entry invalidations were necessary to sync
changes to kernel text after flipping a static branch, then it might be
less overhead to make a list of affected pages and call INVLPG on them.
No; TLB is not involved with text patching (on x86).
...
...
Valentin, do you happen to know?
So from my experimentation (hackbench + kernel compilation on housekeeping
CPUs, dummy while(1) userspace loop on isolated CPUs), the TLB flushes only
occurred from vunmap() - mainly from all the hackbench threads coming and
going.
Right, we have virtually mapped stacks.
Wait... Are you talking about the kernel stac? But that's only 4 pages
(or 8 pages with KASAN), so that should be easily handled with INVLPG.
No CR4 dances are needed for that.
What am I missing?
So the gist of the IPI deferral thing is to coalesce IPI callbacks into a
single flag value that is read & acted on upon kernel entry. Freeing a
task's kernel stack is not the only thing that can issue a vunmap(), so
Thank you for confirming it's not the kernel stack. Peter's remark left
me a little confused.
...
instead of tracking all the pages affected by the unmap (which is
potentially an ever-growing memory leak as long as no kernel entry happens
on the isolated CPUs), we just flush everything.
Yes, this makes some sense. Of course, there is no way to avoid the
cost; we can only defer it to a "more suitable" point in time, and
current low-latency requirements make kernel entry better than IPI. It
is at least more predictable (as long as device interrupts are routed
to other CPUs).
I have looked into ways to reduce the number of page faults _after_
flushing the TLB. FWIW if we decide to track to-be-flushed pages, we
only need an array of tlb_single_page_flush_ceiling pages. If there are
more, flushing the entire TLB is believed to be cheaper. That is, I
merely suggest to use the same logic which is already implemented by
flush_tlb_kernel_range().
Anyway, since there is no easy trick, let's leave the discussion for a
later optimization. I definitely do not want to block progress on this
patch series.
Thanks for all your input!
Petr T

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [RFC PATCH v3 13/15] context_tracking,x86: Add infrastructure to defer kernel TLBI