Re: [PATCH v4 29/30] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs

20 Jan 2025


      On 20/01/25 12:15, Uladzislau Rezki wrote:
...
On Fri, Jan 17, 2025 at 06:00:30PM +0100, Valentin Schneider wrote:
...
On 17/01/25 17:11, Uladzislau Rezki wrote:
...
On Fri, Jan 17, 2025 at 04:25:45PM +0100, Valentin Schneider wrote:
...
On 14/01/25 19:16, Jann Horn wrote:
...
On Tue, Jan 14, 2025 at 6:51 PM Valentin Schneider vschneid@redhat.com wrote:
...
vunmap()'s issued from housekeeping CPUs are a relatively common source of
interference for isolated NOHZ_FULL CPUs, as they are hit by the
flush_tlb_kernel_range() IPIs.
Given that CPUs executing in userspace do not access data in the vmalloc
range, these IPIs could be deferred until their next kernel entry.
Deferral vs early entry danger zone
This requires a guarantee that nothing in the vmalloc range can be vunmap'd
and then accessed in early entry code.
In other words, it needs a guarantee that no vmalloc allocations that
have been created in the vmalloc region while the CPU was idle can
then be accessed during early entry, right?
I'm not sure if that would be a problem (not an mm expert, please do
correct me) - looking at vmap_pages_range(), flush_cache_vmap() isn't
deferred anyway.
So after vmapping something, I wouldn't expect isolated CPUs to have
invalid TLB entries for the newly vmapped page.
However, upon vunmap'ing something, the TLB flush is deferred, and thus
stale TLB entries can and will remain on isolated CPUs, up until they
execute the deferred flush themselves (IOW for the entire duration of the
"danger zone").
Does that make sense?
Probably i am missing something and need to have a look at your patches,
but how do you guarantee that no-one map same are that you defer for TLB
flushing?
That's the cool part: I don't :')
Indeed, sounds unsafe :) Then we just do not need to free areas.
...
For deferring instruction patching IPIs, I (well Josh really) managed to
get instrumentation to back me up and catch any problematic area.
I looked into getting something similar for vmalloc region access in
.noinstr code, but I didn't get anywhere. I even tried using emulated
watchpoints on QEMU to watch the whole vmalloc range, but that went about
as well as you could expect.
That left me with staring at code. AFAICT the only vmap'd thing that is
accessed during early entry is the task stack (CONFIG_VMAP_STACK), which
itself cannot be freed until the task exits - thus can't be subject to
invalidation when a task is entering kernelspace.
If you have any tracing/instrumentation suggestions, I'm all ears (eyes?).
As noted before, we defer flushing for vmalloc. We have a lazy-threshold
which can be exposed(if you need it) over sysfs for tuning. So, we can add it.
In a CPU isolation / NOHZ_FULL context, isolated CPUs will be running a
single userspace application that will never enter the kernel, unless
forced to by some interference (e.g. IPI sent from a housekeeping CPU).
Increasing the lazy threshold would unfortunately only delay the
interference - housekeeping CPUs are free to run whatever, and so they will
eventually cause the lazy threshold to be hit and IPI all the CPUs,
including the isolated/NOHZ_FULL ones.
I was thinking maybe we could subdivide the vmap space into two regions
with their own thresholds, but a task may allocate/vmap stuff while on a HK
CPU and be moved to an isolated CPU afterwards, and also I still don't have
any strong guarantee about what accesses an isolated CPU can do in its
early entry code :(
...
--
Uladzislau Rezki

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH v4 29/30] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs

Deferral vs early entry danger zone