On Tue, 10 Dec 2024 14:53:36 +0100 Valentin Schneider vschneid@redhat.com wrote:
On 09/12/24 15:42, Petr Tesarik wrote:
On Mon, 9 Dec 2024 13:12:49 +0100 Peter Zijlstra peterz@infradead.org wrote:
On Mon, Dec 09, 2024 at 01:04:43PM +0100, Valentin Schneider wrote:
But I wonder what exactly was the original scenario encountered by Valentin. I mean, if TLB entry invalidations were necessary to sync changes to kernel text after flipping a static branch, then it might be less overhead to make a list of affected pages and call INVLPG on them.
No; TLB is not involved with text patching (on x86).
Valentin, do you happen to know?
So from my experimentation (hackbench + kernel compilation on housekeeping CPUs, dummy while(1) userspace loop on isolated CPUs), the TLB flushes only occurred from vunmap() - mainly from all the hackbench threads coming and going.
Right, we have virtually mapped stacks.
Wait... Are you talking about the kernel stac? But that's only 4 pages (or 8 pages with KASAN), so that should be easily handled with INVLPG. No CR4 dances are needed for that.
What am I missing?
So the gist of the IPI deferral thing is to coalesce IPI callbacks into a single flag value that is read & acted on upon kernel entry. Freeing a task's kernel stack is not the only thing that can issue a vunmap(), so
Thank you for confirming it's not the kernel stack. Peter's remark left me a little confused.
instead of tracking all the pages affected by the unmap (which is potentially an ever-growing memory leak as long as no kernel entry happens on the isolated CPUs), we just flush everything.
Yes, this makes some sense. Of course, there is no way to avoid the cost; we can only defer it to a "more suitable" point in time, and current low-latency requirements make kernel entry better than IPI. It is at least more predictable (as long as device interrupts are routed to other CPUs).
I have looked into ways to reduce the number of page faults _after_ flushing the TLB. FWIW if we decide to track to-be-flushed pages, we only need an array of tlb_single_page_flush_ceiling pages. If there are more, flushing the entire TLB is believed to be cheaper. That is, I merely suggest to use the same logic which is already implemented by flush_tlb_kernel_range().
Anyway, since there is no easy trick, let's leave the discussion for a later optimization. I definitely do not want to block progress on this patch series.
Thanks for all your input!
Petr T