Context
=======
We've observed within Red Hat that isolated, NOHZ_FULL CPUs running a
pure-userspace application get regularly interrupted by IPIs sent from
housekeeping CPUs. Those IPIs are caused by activity on the housekeeping CPUs
leading to various on_each_cpu() calls, e.g.:
64359.052209596 NetworkManager 0 1405 smp_call_function_many_cond (cpu=0, func=do_kernel_range_flush)
smp_call_function_many_cond+0x1
smp_call_function+0x39
on_each_cpu+0x2a
flush_tlb_kernel_range+0x7b
__purge_vmap_area_lazy+0x70
_vm_unmap_aliases.part.42+0xdf
change_page_attr_set_clr+0x16a
set_memory_ro+0x26
bpf_int_jit_compile+0x2f9
bpf_prog_select_runtime+0xc6
bpf_prepare_filter+0x523
sk_attach_filter+0x13
sock_setsockopt+0x92c
__sys_setsockopt+0x16a
__x64_sys_setsockopt+0x20
do_syscall_64+0x87
entry_SYSCALL_64_after_hwframe+0x65
The heart of this series is the thought that while we cannot remove NOHZ_FULL
CPUs from the list of CPUs targeted by these IPIs, they may not have to execute
the callbacks immediately. Anything that only affects kernelspace can wait
until the next user->kernel transition, providing it can be executed "early
enough" in the entry code.
The original implementation is from Peter [1]. Nicolas then added kernel TLB
invalidation deferral to that [2], and I picked it up from there.
Deferral approach
=================
Storing each and every callback, like a secondary call_single_queue turned out
to be a no-go: the whole point of deferral is to keep NOHZ_FULL CPUs in
userspace for as long as possible - no signal of any form would be sent when
deferring an IPI. This means that any form of queuing for deferred callbacks
would end up as a convoluted memory leak.
Deferred IPIs must thus be coalesced, which this series achieves by assigning
IPIs a "type" and having a mapping of IPI type to callback, leveraged upon
kernel entry.
What about IPIs whose callback take a parameter, you may ask?
Peter suggested during OSPM23 [3] that since on_each_cpu() targets
housekeeping CPUs *and* isolated CPUs, isolated CPUs can access either global or
housekeeping-CPU-local state to "reconstruct" the data that would have been sent
via the IPI.
This series does not affect any IPI callback that requires an argument, but the
approach would remain the same (one coalescable callback executed on kernel
entry).
Kernel entry vs execution of the deferred operation
===================================================
This is what I've referred to as the "Danger Zone" during my LPC24 talk [4].
There is a non-zero length of code that is executed upon kernel entry before the
deferred operation can be itself executed (before we start getting into
context_tracking.c proper), i.e.:
idtentry_func_foo() <--- we're in the kernel
irqentry_enter()
enter_from_user_mode()
__ct_user_exit()
ct_kernel_enter_state()
ct_work_flush() <--- deferred operation is executed here
This means one must take extra care to what can happen in the early entry code,
and that <bad things> cannot happen. For instance, we really don't want to hit
instructions that have been modified by a remote text_poke() while we're on our
way to execute a deferred sync_core(). Patches doing the actual deferral have
more detail on this.
Where are we at with this whole thing?
======================================
Dave has been incredibly helpful wrt figuring out what would and wouldn't
(mostly that) be safe to do for deferring kernel range TLB flush IPIs, see [5].
Long story short, there are ugly things I can still do to (safely) defer the TLB
flush IPIs, but it's going to be a long session of pulling my own hair out, and
I got plenty so I won't be done for a while.
In the meantime, I think everything leading up to deferring text poke IPIs is
sane-ish and could get in. I'm not the biggest fan of adding an API with a
single user, but hey, I've been working on this for "a little while" now and
I'll still need to get the other IPIs sorted out.
TL;DR: Text patching IPI deferral LGTM so here it is for now, I'm still working
on the TLB flush thing.
Patches
=======
o Patches 1-2 are standalone objtool cleanups.
o Patches 3-4 add an RCU testing feature.
o Patches 5-6 add infrastructure for annotating static keys and static calls
that may be used in noinstr code (courtesy of Josh).
o Patches 7-20 use said annotations on relevant keys / calls.
o Patch 21 enforces proper usage of said annotations (courtesy of Josh).
o Patches 22-23 deal with detecting NOINSTR text in modules
o Patches 24-25 add the actual IPI deferral faff
Patches are also available at:
https://gitlab.com/vschneid/linux.git -b redhat/isolirq/defer/v5
Testing
=======
Xeon E5-2699 system with SMToff, NOHZ_FULL, isolated CPUs.
RHEL10 userspace.
Workload is using rteval (kernel compilation + hackbench) on housekeeping CPUs
and a dummy stay-in-userspace loop on the isolated CPUs. The main invocation is:
$ trace-cmd record -e "csd_queue_cpu" -f "cpu & CPUS{$ISOL_CPUS}" \
-e "ipi_send_cpumask" -f "cpumask & CPUS{$ISOL_CPUS}" \
-e "ipi_send_cpu" -f "cpu & CPUS{$ISOL_CPUS}" \
rteval --onlyload --loads-cpulist=$HK_CPUS \
--hackbench-runlowmem=True --duration=$DURATION
This only records IPIs sent to isolated CPUs, so any event there is interference
(with a bit of fuzz at the start/end of the workload when spawning the
processes). All tests were done with a duration of 3 hours.
v6.14
# This is the actual IPI count
$ trace-cmd report | grep callback | awk '{ print $(NF) }' | sort | uniq -c | sort -nr
93 callback=generic_smp_call_function_single_interrupt+0x0
22 callback=nohz_full_kick_func+0x0
# These are the different CSD's that caused IPIs
$ trace-cmd report | grep csd_queue | awk '{ print $(NF-1) }' | sort | uniq -c | sort -nr
1456 func=do_flush_tlb_all
78 func=do_sync_core
33 func=nohz_full_kick_func
26 func=do_kernel_range_flush
v6.14 + patches
# This is the actual IPI count
$ trace-cmd report | grep callback | awk '{ print $(NF) }' | sort | uniq -c | sort -nr
86 callback=generic_smp_call_function_single_interrupt+0x0
41 callback=nohz_full_kick_func+0x0
# These are the different CSD's that caused IPIs
$ trace-cmd report | grep csd_queue | awk '{ print $(NF-1) }' | sort | uniq -c | sort -nr
1378 func=do_flush_tlb_all
33 func=nohz_full_kick_func
So the TLB flush is still there driving most of the IPIs, but at least the
instruction patching IPIs are gone. With kernel TLB flushes deferred, there are
no IPIs sent to isolated CPUs in that 3hr window, but as stated above that still
needs some more work.
Also note that tlb_remove_table_smp_sync() showed up during testing of v3, and
has gone as mysteriously as it showed up. Yair had a series adressing this [6]
which per these results would be worth revisiting.
Acknowledgements
================
Special thanks to:
o Clark Williams for listening to my ramblings about this and throwing ideas my way
o Josh Poimboeuf for all his help with everything objtool-related
o All of the folks who attended various (too many?) talks about this and
provided precious feedback.
o The mm folks for pointing out what I can and can't do with TLB flushes
Links
=====
[1]: https://lore.kernel.org/all/20210929151723.162004989@infradead.org/
[2]: https://github.com/vianpl/linux.git -b ct-work-defer-wip
[3]: https://youtu.be/0vjE6fjoVVE
[4]: https://lpc.events/event/18/contributions/1889/
[5]: http://lore.kernel.org/r/eef09bdc-7546-462b-9ac0-661a44d2ceae@intel.com
[6]: https://lore.kernel.org/lkml/20230620144618.125703-1-ypodemsk@redhat.com/
Revisions
=========
v4 -> v5
++++++++
o Rebased onto v6.15-rc3
o Collected Reviewed-by
o Annotated a few more static keys
o Added proper checking of noinstr sections that are in loadable code such as
KVM early entry (Sean Christopherson)
o Switched to checking for CT_RCU_WATCHING instead of CT_STATE_KERNEL or
CT_STATE_IDLE, which means deferral is now behaving sanely for IRQ/NMI
entry from idle (thanks to Frederic!)
o Ditched the vmap TLB flush deferral (for now)
RFCv3 -> v4
+++++++++++
o Rebased onto v6.13-rc6
o New objtool patches from Josh
o More .noinstr static key/call patches
o Static calls now handled as well (again thanks to Josh)
o Fixed clearing the work bits on kernel exit
o Messed with IRQ hitting an idle CPU vs context tracking
o Various comment and naming cleanups
o Made RCU_DYNTICKS_TORTURE depend on !COMPILE_TEST (PeterZ)
o Fixed the CT_STATE_KERNEL check when setting a deferred work (Frederic)
o Cleaned up the __flush_tlb_all() mess thanks to PeterZ
RFCv2 -> RFCv3
++++++++++++++
o Rebased onto v6.12-rc6
o Added objtool documentation for the new warning (Josh)
o Added low-size RCU watching counter to TREE04 torture scenario (Paul)
o Added FORCEFUL jump label and static key types
o Added noinstr-compliant helpers for tlb flush deferral
RFCv1 -> RFCv2
++++++++++++++
o Rebased onto v6.5-rc1
o Updated the trace filter patches (Steven)
o Fixed __ro_after_init keys used in modules (Peter)
o Dropped the extra context_tracking atomic, squashed the new bits in the
existing .state field (Peter, Frederic)
o Added an RCU_EXPERT config for the RCU dynticks counter size, and added an
rcutorture case for a low-size counter (Paul)
o Fixed flush_tlb_kernel_range_deferrable() definition
Josh Poimboeuf (3):
jump_label: Add annotations for validating noinstr usage
static_call: Add read-only-after-init static calls
objtool: Add noinstr validation for static branches/calls
Valentin Schneider (22):
objtool: Make validate_call() recognize indirect calls to pv_ops[]
objtool: Flesh out warning related to pv_ops[] calls
rcu: Add a small-width RCU watching counter debug option
rcutorture: Make TREE04 use CONFIG_RCU_DYNTICKS_TORTURE
x86/paravirt: Mark pv_sched_clock static call as __ro_after_init
x86/idle: Mark x86_idle static call as __ro_after_init
x86/paravirt: Mark pv_steal_clock static call as __ro_after_init
riscv/paravirt: Mark pv_steal_clock static call as __ro_after_init
loongarch/paravirt: Mark pv_steal_clock static call as __ro_after_init
arm64/paravirt: Mark pv_steal_clock static call as __ro_after_init
arm/paravirt: Mark pv_steal_clock static call as __ro_after_init
perf/x86/amd: Mark perf_lopwr_cb static call as __ro_after_init
sched/clock: Mark sched_clock_running key as __ro_after_init
KVM: VMX: Mark __kvm_is_using_evmcs static key as __ro_after_init
x86/speculation/mds: Mark mds_idle_clear key as allowed in .noinstr
sched/clock, x86: Mark __sched_clock_stable key as allowed in .noinstr
KVM: VMX: Mark vmx_l1d_should flush and vmx_l1d_flush_cond keys as
allowed in .noinstr
stackleack: Mark stack_erasing_bypass key as allowed in .noinstr
module: Remove outdated comment about text_size
module: Add MOD_NOINSTR_TEXT mem_type
context-tracking: Introduce work deferral infrastructure
context_tracking,x86: Defer kernel text patching IPIs
arch/Kconfig | 9 ++
arch/arm/kernel/paravirt.c | 2 +-
arch/arm64/kernel/paravirt.c | 2 +-
arch/loongarch/kernel/paravirt.c | 2 +-
arch/riscv/kernel/paravirt.c | 2 +-
arch/x86/Kconfig | 1 +
arch/x86/events/amd/brs.c | 2 +-
arch/x86/include/asm/context_tracking_work.h | 18 +++
arch/x86/include/asm/text-patching.h | 1 +
arch/x86/kernel/alternative.c | 39 ++++++-
arch/x86/kernel/cpu/bugs.c | 2 +-
arch/x86/kernel/kprobes/core.c | 4 +-
arch/x86/kernel/kprobes/opt.c | 4 +-
arch/x86/kernel/module.c | 2 +-
arch/x86/kernel/paravirt.c | 4 +-
arch/x86/kernel/process.c | 2 +-
arch/x86/kvm/vmx/vmx.c | 11 +-
arch/x86/kvm/vmx/vmx_onhyperv.c | 2 +-
include/asm-generic/sections.h | 15 +++
include/linux/context_tracking.h | 21 ++++
include/linux/context_tracking_state.h | 54 +++++++--
include/linux/context_tracking_work.h | 26 +++++
include/linux/jump_label.h | 30 ++++-
include/linux/module.h | 6 +-
include/linux/objtool.h | 7 ++
include/linux/static_call.h | 19 ++++
kernel/context_tracking.c | 69 +++++++++++-
kernel/kprobes.c | 8 +-
kernel/module/main.c | 85 ++++++++++----
kernel/rcu/Kconfig.debug | 15 +++
kernel/sched/clock.c | 7 +-
kernel/stackleak.c | 6 +-
kernel/time/Kconfig | 5 +
tools/objtool/Documentation/objtool.txt | 34 ++++++
tools/objtool/check.c | 106 +++++++++++++++---
tools/objtool/include/objtool/check.h | 1 +
tools/objtool/include/objtool/elf.h | 1 +
tools/objtool/include/objtool/special.h | 1 +
tools/objtool/special.c | 15 ++-
.../selftests/rcutorture/configs/rcu/TREE04 | 1 +
40 files changed, 557 insertions(+), 84 deletions(-)
create mode 100644 arch/x86/include/asm/context_tracking_work.h
create mode 100644 include/linux/context_tracking_work.h
--
2.49.0
Nolibc is useful for selftests as the test programs can be very small,
and compiled with just a kernel crosscompiler, without userspace support.
Currently nolibc is only usable with kselftest.h, not the more
convenient to use kselftest_harness.h
This series provides this compatibility by adding new features to nolibc
and removing the usage of problematic features from the harness.
The first half of the series are changes to the harness, the second one
are for nolibc. Both parts are very independent and should go through
different trees.
The last patch is not meant to be applied and serves as test that
everything works together correctly.
Signed-off-by: Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>
---
Changes in v3:
- Send patches to correct kselftest harness maintainers
- Move harness selftest to dedicated directory
- Add harness selftest to MAINTAINERS
- Integrate harness selftest cleanup with the selftest framework
- Consistently use "kselftest harness" in commit messages
- Properly propagate kselftest harness failure
- Link to v2: https://lore.kernel.org/r/20250407-nolibc-kselftest-harness-v2-0-f8812f76e9…
Changes in v2:
- Rebase unto v6.15-rc1
- Rename internal nolibc symbols
- Handle edge case of waitpid(INT_MIN) == ESRCH
- Fix arm configurations for final testing patch
- Clean up global getopt.h variable declarations
- Add Acks from Willy
- Link to v1: https://lore.kernel.org/r/20250304-nolibc-kselftest-harness-v1-0-adca7cd231…
---
Thomas Weißschuh (32):
selftests: harness: Add kselftest harness selftest
selftests: harness: Use C89 comment style
selftests: harness: Ignore unused variant argument warning
selftests: harness: Mark functions without prototypes static
selftests: harness: Remove inline qualifier for wrappers
selftests: harness: Remove dependency on libatomic
selftests: harness: Implement test timeouts through pidfd
selftests: harness: Don't set setup_completed for fixtureless tests
selftests: harness: Always provide "self" and "variant"
selftests: harness: Move teardown conditional into test metadata
selftests: harness: Add teardown callback to test metadata
selftests: harness: Stop using setjmp()/longjmp()
selftests: harness: Guard includes on nolibc
tools/nolibc: handle intmax_t/uintmax_t in printf
tools/nolibc: use intmax definitions from compiler
tools/nolibc: use pselect6_time64 if available
tools/nolibc: use ppoll_time64 if available
tools/nolibc: add tolower() and toupper()
tools/nolibc: add _exit()
tools/nolibc: add setpgrp()
tools/nolibc: implement waitpid() in terms of waitid()
Revert "selftests/nolibc: use waitid() over waitpid()"
tools/nolibc: add dprintf() and vdprintf()
tools/nolibc: add getopt()
tools/nolibc: allow different write callbacks in printf
tools/nolibc: allow limiting of printf destination size
tools/nolibc: add snprintf() and friends
selftests/nolibc: use snprintf() for printf tests
selftests/nolibc: rename vfprintf test suite
selftests/nolibc: add test for snprintf() truncation
tools/nolibc: implement width padding in printf()
HACK: selftests/nolibc: demonstrate usage of the kselftest harness
MAINTAINERS | 1 +
tools/include/nolibc/Makefile | 1 +
tools/include/nolibc/getopt.h | 101 ++
tools/include/nolibc/nolibc.h | 1 +
tools/include/nolibc/stdint.h | 4 +-
tools/include/nolibc/stdio.h | 127 +-
tools/include/nolibc/string.h | 17 +
tools/include/nolibc/sys.h | 105 +-
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/kselftest_harness.h | 181 +-
.../testing/selftests/kselftest_harness/.gitignore | 2 +
tools/testing/selftests/kselftest_harness/Makefile | 7 +
.../selftests/kselftest_harness/harness-selftest.c | 129 ++
.../kselftest_harness/harness-selftest.expected | 62 +
.../kselftest_harness/harness-selftest.sh | 13 +
tools/testing/selftests/nolibc/Makefile | 13 +-
tools/testing/selftests/nolibc/harness-selftest.c | 1 +
tools/testing/selftests/nolibc/nolibc-test.c | 1729 +-------------------
tools/testing/selftests/nolibc/run-tests.sh | 2 +-
19 files changed, 637 insertions(+), 1860 deletions(-)
---
base-commit: 0af2f6be1b4281385b618cb86ad946eded089ac8
change-id: 20250130-nolibc-kselftest-harness-8b2c8cac43bf
Best regards,
--
Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>
Cong reported a warning when running ./test_sockmp:
https://lore.kernel.org/bpf/aAmIi0vlycHtbXeb@pop-os.localdomain/T/#t
------------[ cut here ]------------
WARNING: CPU: 1 PID: 40 at net/ipv4/af_inet.c inet_sock_destruct+0x173/0x1d5
Tainted: [W]=WARN
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1 04/01/2014
Workqueue: events sk_psock_destroy
RIP: 0010:inet_sock_destruct+0x173/0x1d5
RSP: 0018:ffff8880085cfc18 EFLAGS: 00010202
RAX: 1ffff11003dbfc00 RBX: ffff88801edfe3e8 RCX: ffffffff822f5af4
RDX: 0000000000000007 RSI: dffffc0000000000 RDI: ffff88801edfe16c
RBP: ffff88801edfe184 R08: ffffed1003dbfc31 R09: 0000000000000000
R10: ffffffff822f5ab7 R11: ffff88801edfe187 R12: ffff88801edfdec0
R13: ffff888020376ac0 R14: ffff888020376ac0 R15: ffff888020376a60
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000556365155830 CR3: 000000001d6aa000 CR4: 0000000000350ef0
Call Trace:
<TASK>
__sk_destruct+0x46/0x222
sk_psock_destroy+0x22f/0x242
process_one_work+0x504/0x8a8
? process_one_work+0x39d/0x8a8
? __pfx_process_one_work+0x10/0x10
? worker_thread+0x44/0x2ae
? __list_add_valid_or_report+0x83/0xea
? srso_return_thunk+0x5/0x5f
? __list_add+0x45/0x52
process_scheduled_works+0x73/0x82
worker_thread+0x1ce/0x2ae
When we specify apply_bytes, we divide the msg into multiple segments,
each with a length of 'send', and every time we send this part of the data
using tcp_bpf_sendmsg_redir(), we use sk_msg_return_zero() to uncharge the
memory of the specified 'send' size.
However, if the first segment of data fails to send, for example, the
peer's buffer is full, we need to release all of the msg. When releasing
the msg, we haven't uncharged the memory of the subsequent segments.
This modification does not make significant logical changes, but only
fills in the missing uncharge places.
This issue has existed all along, until it was exposed after we added the
apply test in test_sockmap:
commit 3448ad23b34e ("selftests/bpf: Add apply_bytes test to test_txmsg_redir_wait_sndmem in test_sockmap")
Jiayuan Chen (2):
ktls, sockmap: Fix missing uncharge operation
selftests/bpf: Add test to cover sockmap with ktls
net/tls/tls_sw.c | 7 ++
.../selftests/bpf/prog_tests/sockmap_ktls.c | 76 +++++++++++++++++++
.../selftests/bpf/progs/test_sockmap_ktls.c | 10 +++
3 files changed, 93 insertions(+)
--
2.47.1
ksft runner sends 2 SIGTERMs in a row if a test runs out of time.
Handle this in a similar way we handle SIGINT - cleanup and stop
running further tests.
Because we get 2 signals we need a bit of logic to ignore
the subsequent one, they come immediately one after the other
(due to commit 9616cb34b08e ("kselftest/runner.sh: Propagate SIGTERM
to runner child")).
This change makes sure we run cleanup (scheduled defer()s)
and also print a stack trace on SIGTERM, which doesn't happen
by default. Tests occasionally hang in NIPA and it's impossible
to tell what they are waiting from or doing.
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
---
v2:
- remove declaration at the global scope
v1: https://lore.kernel.org/20250425151757.1652517-1-kuba@kernel.org
CC: petrm(a)nvidia.com
CC: willemb(a)google.com
CC: sdf(a)fomichev.me
CC: linux-kselftest(a)vger.kernel.org
---
tools/testing/selftests/net/lib/py/ksft.py | 25 +++++++++++++++++++++-
1 file changed, 24 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/net/lib/py/ksft.py b/tools/testing/selftests/net/lib/py/ksft.py
index 3cfad0fd4570..1b815768bf8a 100644
--- a/tools/testing/selftests/net/lib/py/ksft.py
+++ b/tools/testing/selftests/net/lib/py/ksft.py
@@ -3,6 +3,7 @@
import builtins
import functools
import inspect
+import signal
import sys
import time
import traceback
@@ -26,6 +27,10 @@ KSFT_DISRUPTIVE = True
pass
+class KsftTerminate(KeyboardInterrupt):
+ pass
+
+
def ksft_pr(*objs, **kwargs):
print("#", *objs, **kwargs)
@@ -193,6 +198,17 @@ KSFT_DISRUPTIVE = True
return env
+def _ksft_intr(signum, frame):
+ # ksft runner.sh sends 2 SIGTERMs in a row on a timeout
+ # if we don't ignore the second one it will stop us from handling cleanup
+ global term_cnt
+ term_cnt += 1
+ if term_cnt == 1:
+ raise KsftTerminate()
+ else:
+ ksft_pr(f"Ignoring SIGTERM (cnt: {term_cnt}), already exiting...")
+
+
def ksft_run(cases=None, globs=None, case_pfx=None, args=()):
cases = cases or []
@@ -205,6 +221,10 @@ KSFT_DISRUPTIVE = True
cases.append(value)
break
+ global term_cnt
+ term_cnt = 0
+ prev_sigterm = signal.signal(signal.SIGTERM, _ksft_intr)
+
totals = {"pass": 0, "fail": 0, "skip": 0, "xfail": 0}
print("TAP version 13")
@@ -229,11 +249,12 @@ KSFT_DISRUPTIVE = True
cnt_key = 'xfail'
except BaseException as e:
stop |= isinstance(e, KeyboardInterrupt)
+ stop |= isinstance(e, KsftTerminate)
tb = traceback.format_exc()
for line in tb.strip().split('\n'):
ksft_pr("Exception|", line)
if stop:
- ksft_pr("Stopping tests due to KeyboardInterrupt.")
+ ksft_pr(f"Stopping tests due to {type(e).__name__}.")
KSFT_RESULT = False
cnt_key = 'fail'
@@ -248,6 +269,8 @@ KSFT_DISRUPTIVE = True
if stop:
break
+ signal.signal(signal.SIGTERM, prev_sigterm)
+
print(
f"# Totals: pass:{totals['pass']} fail:{totals['fail']} xfail:{totals['xfail']} xpass:0 skip:{totals['skip']} error:0"
)
--
2.49.0
Ensure the following prerequisites before executing the test:
1. 'socat' is installed on the remote host.
2. Python version supports socket.SO_INCOMING_CPU (available since v3.11).
Skip the test if either prerequisite is not met.
Reviewed-by: Nimrod Oren <noren(a)nvidia.com>
Signed-off-by: Gal Pressman <gal(a)nvidia.com>
---
Changelog -
v1->v2: https://lore.kernel.org/netdev/20250317123149.364565-1-gal@nvidia.com/
* Use require_cmd() helper (Jakub).
---
tools/testing/selftests/drivers/net/hw/rss_input_xfrm.py | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/tools/testing/selftests/drivers/net/hw/rss_input_xfrm.py b/tools/testing/selftests/drivers/net/hw/rss_input_xfrm.py
index 53bb08cc29ec..f439c434ba36 100755
--- a/tools/testing/selftests/drivers/net/hw/rss_input_xfrm.py
+++ b/tools/testing/selftests/drivers/net/hw/rss_input_xfrm.py
@@ -32,6 +32,11 @@ def test_rss_input_xfrm(cfg, ipver):
if multiprocessing.cpu_count() < 2:
raise KsftSkipEx("Need at least two CPUs to test symmetric RSS hash")
+ cfg.require_cmd("socat", remote=True)
+
+ if not hasattr(socket, "SO_INCOMING_CPU"):
+ raise KsftSkipEx("socket.SO_INCOMING_CPU was added in Python 3.11")
+
input_xfrm = cfg.ethnl.rss_get(
{'header': {'dev-name': cfg.ifname}}).get('input_xfrm')
--
2.40.1
This series is a follow-up to [1], which adds mTHP support to khugepaged.
mTHP khugepaged support is a "loose" dependency for the sysfs/sysctl
configs to make sense. Without it global="defer" and mTHP="inherit" case
is "undefined" behavior.
We've seen cases were customers switching from RHEL7 to RHEL8 see a
significant increase in the memory footprint for the same workloads.
Through our investigations we found that a large contributing factor to
the increase in RSS was an increase in THP usage.
For workloads like MySQL, or when using allocators like jemalloc, it is
often recommended to set /transparent_hugepages/enabled=never. This is
in part due to performance degradations and increased memory waste.
This series introduces enabled=defer, this setting acts as a middle
ground between always and madvise. If the mapping is MADV_HUGEPAGE, the
page fault handler will act normally, making a hugepage if possible. If
the allocation is not MADV_HUGEPAGE, then the page fault handler will
default to the base size allocation. The caveat is that khugepaged can
still operate on pages that are not MADV_HUGEPAGE.
This allows for three things... one, applications specifically designed to
use hugepages will get them, and two, applications that don't use
hugepages can still benefit from them without aggressively inserting
THPs at every possible chance. This curbs the memory waste, and defers
the use of hugepages to khugepaged. Khugepaged can then scan the memory
for eligible collapsing. Lastly there is the added benefit for those who
want THPs but experience higher latency PFs. Now you can get base page
performance at the PF handler and Hugepage performance for those mappings
after they collapse.
Admins may want to lower max_ptes_none, if not, khugepaged may
aggressively collapse single allocations into hugepages.
TESTING:
- Built for x86_64, aarch64, ppc64le, and s390x
- selftests mm
- In [1] I provided a script [2] that has multiple access patterns
- lots of general use.
- redis testing. This test was my original case for the defer mode. What I
was able to prove was that THP=always leads to increased max_latency
cases; hence why it is recommended to disable THPs for redis servers.
However with 'defer' we dont have the max_latency spikes and can still
get the system to utilize THPs. I further tested this with the mTHP
defer setting and found that redis (and probably other jmalloc users)
can utilize THPs via defer (+mTHP defer) without a large latency
penalty and some potential gains. I uploaded some mmtest results
here[3] which compares:
stock+thp=never
stock+(m)thp=always
khugepaged-mthp + defer (max_ptes_none=64)
The results show that (m)THPs can cause some throughput regression in
some cases, but also has gains in other cases. The mTHP+defer results
have more gains and less losses over the (m)THP=always case.
V5 Changes:
- rebased dependent series
- added reviewed-by tag on 2/4
V4 Changes:
- Minor Documentation fixes
- rebased the dependent series [1] onto mm-unstable
commit 0e68b850b1d3 ("vmalloc: use atomic_long_add_return_relaxed()")
V3 Changes:
- Combined the documentation commits into one, and moved a section to the
khugepaged mthp patchset
V2 Changes:
- base changes on mTHP khugepaged support
- Fix selftests parsing issue
- add mTHP defer option
- add mTHP defer Documentation
[1] - https://lore.kernel.org/lkml/20250428181218.85925-1-npache@redhat.com/
[2] - https://gitlab.com/npache/khugepaged_mthp_test
[3] - https://people.redhat.com/npache/mthp_khugepaged_defer/testoutput2/output.h…
Nico Pache (4):
mm: defer THP insertion to khugepaged
mm: document (m)THP defer usage
khugepaged: add defer option to mTHP options
selftests: mm: add defer to thp setting parser
Documentation/admin-guide/mm/transhuge.rst | 31 +++++++---
include/linux/huge_mm.h | 18 +++++-
mm/huge_memory.c | 69 +++++++++++++++++++---
mm/khugepaged.c | 8 +--
tools/testing/selftests/mm/thp_settings.c | 1 +
tools/testing/selftests/mm/thp_settings.h | 1 +
6 files changed, 106 insertions(+), 22 deletions(-)
--
2.48.1
In the current implementation if the program is dev-bound to a specific
device, it will not be possible to perform XDP_REDIRECT into a DEVMAP or
CPUMAP even if the program is running in the driver NAPI context.
Fix the issue introducing __bpf_prog_map_compatible utility routine in
order to avoid bpf_prog_is_dev_bound() during the XDP program load.
Continue forbidding to attach a dev-bound program to XDP maps.
---
Changes in v3:
- move seltest changes in a dedicated patch
- Link to v2: https://lore.kernel.org/r/20250423-xdp-prog-bound-fix-v2-1-51742a5dfbce@ker…
Changes in v2:
- Introduce __bpf_prog_map_compatible() utility routine in order to skip
bpf_prog_is_dev_bound check in bpf_check_tail_call()
- Extend xdp_metadata selftest
- Link to v1: https://lore.kernel.org/r/20250422-xdp-prog-bound-fix-v1-1-0b581fa186fe@ker…
---
Lorenzo Bianconi (2):
bpf: Allow XDP dev-bound programs to perform XDP_REDIRECT into maps
selftests/bpf: xdp_metadata: check XDP_REDIRCT support for dev-bound progs
kernel/bpf/core.c | 27 +++++++++++++---------
.../selftests/bpf/prog_tests/xdp_metadata.c | 22 +++++++++++++++++-
tools/testing/selftests/bpf/progs/xdp_metadata.c | 13 +++++++++++
3 files changed, 50 insertions(+), 12 deletions(-)
---
base-commit: 91dbac4076537b464639953c055c460d2bdfc7ea
change-id: 20250422-xdp-prog-bound-fix-9f30f3e134aa
Best regards,
--
Lorenzo Bianconi <lorenzo(a)kernel.org>
This series fixes misaligned access handling when in non interruptible
context by reenabling interrupts when possible. A previous commit
changed raw_copy_from_user() with copy_from_user() which enables page
faulting and thus can sleep. While correct, a warning is now triggered
due to being called in an invalid context (sleeping in
non-interruptible). This series fixes that problem by factorizing
misaligned load/store entry in a single function than reenables
interrupt if the interrupted context had interrupts enabled.
In order for misaligned handling problems to be caught sooner, add a
kselftest for all the currently supported instructions .
Note: these commits were actually part of another larger series for
misaligned request delegation but was split since it isn't directly
required.
Clément Léger (5):
riscv: misaligned: factorize trap handling
riscv: misaligned: enable IRQs while handling misaligned accesses
riscv: misaligned: use get_user() instead of __get_user()
Documentation/sysctl: add riscv to unaligned-trap supported archs
selftests: riscv: add misaligned access testing
Documentation/admin-guide/sysctl/kernel.rst | 4 +-
arch/riscv/kernel/traps.c | 57 ++--
arch/riscv/kernel/traps_misaligned.c | 2 +-
.../selftests/riscv/misaligned/.gitignore | 1 +
.../selftests/riscv/misaligned/Makefile | 12 +
.../selftests/riscv/misaligned/common.S | 33 +++
.../testing/selftests/riscv/misaligned/fpu.S | 180 +++++++++++++
tools/testing/selftests/riscv/misaligned/gp.S | 103 +++++++
.../selftests/riscv/misaligned/misaligned.c | 254 ++++++++++++++++++
9 files changed, 614 insertions(+), 32 deletions(-)
create mode 100644 tools/testing/selftests/riscv/misaligned/.gitignore
create mode 100644 tools/testing/selftests/riscv/misaligned/Makefile
create mode 100644 tools/testing/selftests/riscv/misaligned/common.S
create mode 100644 tools/testing/selftests/riscv/misaligned/fpu.S
create mode 100644 tools/testing/selftests/riscv/misaligned/gp.S
create mode 100644 tools/testing/selftests/riscv/misaligned/misaligned.c
--
2.49.0
This series improves the following tests.
1. Get-reg-list : Adds vector support
2. SBI PMU test : Distinguish between different types of illegal exception
The first patch is just helper patch that adds stval support during
exception handling.
Signed-off-by: Atish Patra <atishp(a)rivosinc.com>
---
Changes in v3:
- Dropped the redundant macros and rv32 specific csr details.
- Changed to vcpu_get_reg from __vcpu_get_reg based on suggestion from Drew.
- Added RB tags from Drew.
- Link to v2: https://lore.kernel.org/r/20250429-kvm_selftest_improve-v2-0-51713f91e04a@r…
Changes in v2:
- Rebased on top of Linux 6.15-rc4
- Changed from ex_regs to pt_regs based on Drew's suggestion.
- Dropped Anup's review on PATCH1 as it is significantly changed from last review.
- Moved the instruction decoding macros to a common header file.
- Improved the vector reg list test as per the feedback.
- Link to v1: https://lore.kernel.org/r/20250324-kvm_selftest_improve-v1-0-583620219d4f@r…
---
Atish Patra (3):
KVM: riscv: selftests: Align the trap information wiht pt_regs
KVM: riscv: selftests: Decode stval to identify exact exception type
KVM: riscv: selftests: Add vector extension tests
.../selftests/kvm/include/riscv/processor.h | 23 +++-
tools/testing/selftests/kvm/lib/riscv/handlers.S | 139 +++++++++++----------
tools/testing/selftests/kvm/lib/riscv/processor.c | 2 +-
tools/testing/selftests/kvm/riscv/arch_timer.c | 2 +-
tools/testing/selftests/kvm/riscv/ebreak_test.c | 2 +-
tools/testing/selftests/kvm/riscv/get-reg-list.c | 132 +++++++++++++++++++
tools/testing/selftests/kvm/riscv/sbi_pmu_test.c | 24 +++-
7 files changed, 247 insertions(+), 77 deletions(-)
---
base-commit: f15d97df5afae16f40ecef942031235d1c6ba14f
change-id: 20250324-kvm_selftest_improve-9bedb9f0a6d3
--
Regards,
Atish patra