Hello,
This patchset builds upon the code at
https://lore.kernel.org/lkml/20230718234512.1690985-1-seanjc@google.com/T/.
This code is available at
https://github.com/googleprodkernel/linux-cc/tree/kvm-gmem-link-migrate-rfc….
In guest_mem v11, a split file/inode model was proposed, where memslot
bindings belong to the file and pages belong to the inode. This model
lends itself well to having different VMs use separate files pointing
to the same inode.
This RFC proposes an ioctl, KVM_LINK_GUEST_MEMFD, that takes a VM and
a gmem fd, and returns another gmem fd referencing a different file
and associated with VM. This RFC also includes an update to
KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM to migrate memory context
(slot->arch.lpage_info and kvm->mem_attr_array) from source to
destination vm, intra-host.
Intended usage of the two ioctls:
1. Source VM’s fd is passed to destination VM via unix sockets
2. Destination VM uses new ioctl KVM_LINK_GUEST_MEMFD to link source
VM’s fd to a new fd.
3. Destination VM will pass new fds to KVM_SET_USER_MEMORY_REGION,
which will bind the new file, pointing to the same inode that the
source VM’s file points to, to memslots
4. Use KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM to move kvm->mem_attr_array
and slot->arch.lpage_info to the destination VM.
5. Run the destination VM as per normal
Some other approaches considered were:
+ Using the linkat() syscall, but that requires a mount/directory for
a source fd to be linked to
+ Using the dup() syscall, but that only duplicates the fd, and both
fds point to the same file
---
Ackerley Tng (11):
KVM: guest_mem: Refactor out kvm_gmem_alloc_file()
KVM: guest_mem: Add ioctl KVM_LINK_GUEST_MEMFD
KVM: selftests: Add tests for KVM_LINK_GUEST_MEMFD ioctl
KVM: selftests: Test transferring private memory to another VM
KVM: x86: Refactor sev's flag migration_in_progress to kvm struct
KVM: x86: Refactor common code out of sev.c
KVM: x86: Refactor common migration preparation code out of
sev_vm_move_enc_context_from
KVM: x86: Let moving encryption context be configurable
KVM: x86: Handle moving of memory context for intra-host migration
KVM: selftests: Generalize migration functions from
sev_migrate_tests.c
KVM: selftests: Add tests for migration of private mem
arch/x86/include/asm/kvm_host.h | 4 +-
arch/x86/kvm/svm/sev.c | 85 ++-----
arch/x86/kvm/svm/svm.h | 3 +-
arch/x86/kvm/x86.c | 221 +++++++++++++++++-
arch/x86/kvm/x86.h | 6 +
include/linux/kvm_host.h | 18 ++
include/uapi/linux/kvm.h | 8 +
tools/testing/selftests/kvm/Makefile | 1 +
.../testing/selftests/kvm/guest_memfd_test.c | 42 ++++
.../selftests/kvm/include/kvm_util_base.h | 31 +++
.../kvm/x86_64/private_mem_migrate_tests.c | 93 ++++++++
.../selftests/kvm/x86_64/sev_migrate_tests.c | 48 ++--
virt/kvm/guest_mem.c | 151 ++++++++++--
virt/kvm/kvm_main.c | 10 +
virt/kvm/kvm_mm.h | 7 +
15 files changed, 596 insertions(+), 132 deletions(-)
create mode 100644 tools/testing/selftests/kvm/x86_64/private_mem_migrate_tests.c
--
2.41.0.640.ga95def55d0-goog
'realpath' is not always available, fallback to 'readlink -f' if is not
available. They seem to work equally well in this context.
Signed-off-by: Yosry Ahmed <yosry.ahmed(a)linux.dev>
---
tools/testing/selftests/run_kselftest.sh | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/run_kselftest.sh b/tools/testing/selftests/run_kselftest.sh
index 50e03eefe7ac7..0443beacf3621 100755
--- a/tools/testing/selftests/run_kselftest.sh
+++ b/tools/testing/selftests/run_kselftest.sh
@@ -3,7 +3,14 @@
#
# Run installed kselftest tests.
#
-BASE_DIR=$(realpath $(dirname $0))
+
+# Fallback to readlink if realpath is not available
+if which realpath > /dev/null; then
+ BASE_DIR=$(realpath $(dirname $0))
+else
+ BASE_DIR=$(readlink -f $(dirname $0))
+fi
+
cd $BASE_DIR
TESTS="$BASE_DIR"/kselftest-list.txt
if [ ! -r "$TESTS" ] ; then
--
2.49.0.rc1.451.g8f38331e32-goog
v13: https://lore.kernel.org/netdev/20250425204743.617260-1-almasrymina@google.c…
===
Changelog:
- Fix unneeded error label pointed out by Christoph, and addressed
nitpick.
v12: https://lore.kernel.org/netdev/20250423031117.907681-1-almasrymina@google.c…
====
No changes in v12, just restored the selftests patch I accidentally dropped in
v11
v11: https://lore.kernel.org/netdev/20250423031117.907681-1-almasrymina@google.c…
====
Addressed a couple of nits and collected Acked-by from Harshitha
(thanks!)
v10: https://lore.kernel.org/netdev/20250417231540.2780723-1-almasrymina@google.…
====
Addressed comments following conversations with Pavel, Stan, and
Harshitha. Thank you guys for the reviews again. Overall minor changes:
Changelog:
- Check for !niov->pp in io_zcrx_recv_frag, just in case we end up with
a TX niov in that path (Pavel).
- Fix locking case in !netif_device_present (Jakub/Stan).
v9: https://lore.kernel.org/netdev/20250415224756.152002-1-almasrymina@google.c…
===
Changelog:
- Use priv->bindings list instead of sock_bindings_list. This was missed
during the rebase as the bindings have been updated to use
priv->bindings recently (thanks Stan!)
v8: https://lore.kernel.org/netdev/20250308214045.1160445-1-almasrymina@google.…
===
Only address minor comments on V7
Changelog:
- Use netdev locking instead of rtnl_locking to match rx path.
- Now that iouring zcrx is in net-next, use NET_IOV_IOURING instead of
NET_IOV_UNSPECIFIED.
- Post send binding to net_devmem_dmabuf_bindings after it's been fully
initialized (Stan).
v7: https://lore.kernel.org/netdev/20250227041209.2031104-1-almasrymina@google.…
===
Changelog:
- Check the dmabuf net_iov binding belongs to the device the TX is going
out on. (Jakub)
- Provide detailed inspection of callsites of
__skb_frag_ref/skb_page_unref in patch 2's changelog (Jakub)
v6: https://lore.kernel.org/netdev/20250222191517.743530-1-almasrymina@google.c…
===
v6 has no major changes. Addressed a few issues from Paolo and David,
and collected Acks from Stan. Thank you everyone for the review!
Changes:
- retain behavior to process MSG_FASTOPEN even if the provided cmsg is
invalid (Paolo).
- Rework the freeing of tx_vec slightly (it now has its own err label).
(Paolo).
- Squash the commit that makes dmabuf unbinding scheduled work into the
same one which implements the TX path so we don't run into future
errors on bisecting (Paolo).
- Fix/add comments to explain how dmabuf binding refcounting works
(David).
v5: https://lore.kernel.org/netdev/20250220020914.895431-1-almasrymina@google.c…
===
v5 has no major changes; it clears up the relatively minor issues
pointed out to in v4, and rebases the series on top of net-next to
resolve the conflict with a patch that raced to the tree. It also
collects the review tags from v4.
Changes:
- Rebase to net-next
- Fix issues in selftest (Stan).
- Address comments in the devmem and netmem driver docs (Stan and Bagas)
- Fix zerocopy_fill_skb_from_devmem return error code (Stan).
v4: https://lore.kernel.org/netdev/20250203223916.1064540-1-almasrymina@google.…
===
v4 mainly addresses the critical driver support issue surfaced in v3 by
Paolo and Stan. Drivers aiming to support netmem_tx should make sure not
to pass the netmem dma-addrs to the dma-mapping APIs, as these dma-addrs
may come from dma-bufs.
Additionally other feedback from v3 is addressed.
Major changes:
- Add helpers to handle netmem dma-addrs. Add GVE support for
netmem_tx.
- Fix binding->tx_vec not being freed on error paths during the
tx binding.
- Add a minimal devmem_tx test to devmem.py.
- Clean up everything obsolete from the cover letter (Paolo).
v3: https://patchwork.kernel.org/project/netdevbpf/list/?series=929401&state=*
===
Address minor comments from RFCv2 and fix a few build warnings and
ynl-regen issues. No major changes.
RFC v2: https://patchwork.kernel.org/project/netdevbpf/list/?series=920056&state=*
=======
RFC v2 addresses much of the feedback from RFC v1. I plan on sending
something close to this as net-next reopens, sending it slightly early
to get feedback if any.
Major changes:
--------------
- much improved UAPI as suggested by Stan. We now interpret the iov_base
of the passed in iov from userspace as the offset into the dmabuf to
send from. This removes the need to set iov.iov_base = NULL which may
be confusing to users, and enables us to send multiple iovs in the
same sendmsg() call. ncdevmem and the docs show a sample use of that.
- Removed the duplicate dmabuf iov_iter in binding->iov_iter. I think
this is good improvment as it was confusing to keep track of
2 iterators for the same sendmsg, and mistracking both iterators
caused a couple of bugs reported in the last iteration that are now
resolved with this streamlining.
- Improved test coverage in ncdevmem. Now multiple sendmsg() are tested,
and sending multiple iovs in the same sendmsg() is tested.
- Fixed issue where dmabuf unmapping was happening in invalid context
(Stan).
====================================================================
The TX path had been dropped from the Device Memory TCP patch series
post RFCv1 [1], to make that series slightly easier to review. This
series rebases the implementation of the TX path on top of the
net_iov/netmem framework agreed upon and merged. The motivation for
the feature is thoroughly described in the docs & cover letter of the
original proposal, so I don't repeat the lengthy descriptions here, but
they are available in [1].
Full outline on usage of the TX path is detailed in the documentation
included with this series.
Test example is available via the kselftest included in the series as well.
The series is relatively small, as the TX path for this feature largely
piggybacks on the existing MSG_ZEROCOPY implementation.
Patch Overview:
---------------
1. Documentation & tests to give high level overview of the feature
being added.
1. Add netmem refcounting needed for the TX path.
2. Devmem TX netlink API.
3. Devmem TX net stack implementation.
4. Make dma-buf unbinding scheduled work to handle TX cases where it gets
freed from contexts where we can't sleep.
5. Add devmem TX documentation.
6. Add scaffolding enabling driver support for netmem_tx. Add helpers, driver
feature flag, and docs to enable drivers to declare netmem_tx support.
7. Guard netmem_tx against being enabled against drivers that don't
support it.
8. Add devmem_tx selftests. Add TX path to ncdevmem and add a test to
devmem.py.
Testing:
--------
Testing is very similar to devmem TCP RX path. The ncdevmem test used
for the RX path is now augemented with client functionality to test TX
path.
* Test Setup:
Kernel: net-next with this RFC and memory provider API cherry-picked
locally.
Hardware: Google Cloud A3 VMs.
NIC: GVE with header split & RSS & flow steering support.
Performance results are not included with this version, unfortunately.
I'm having issues running the dma-buf exporter driver against the
upstream kernel on my test setup. The issues are specific to that
dma-buf exporter and do not affect this patch series. I plan to follow
up this series with perf fixes if the tests point to issues once they're
up and running.
Special thanks to Stan who took a stab at rebasing the TX implementation
on top of the netmem/net_iov framework merged. Parts of his proposal [2]
that are reused as-is are forked off into their own patches to give full
credit.
[1] https://lore.kernel.org/netdev/20240909054318.1809580-1-almasrymina@google.…
[2] https://lore.kernel.org/netdev/20240913150913.1280238-2-sdf@fomichev.me/T/#…
Cc: sdf(a)fomichev.me
Cc: asml.silence(a)gmail.com
Cc: dw(a)davidwei.uk
Cc: Jamal Hadi Salim <jhs(a)mojatatu.com>
Cc: Victor Nogueira <victor(a)mojatatu.com>
Cc: Pedro Tammela <pctammela(a)mojatatu.com>
Cc: Samiullah Khawaja <skhawaja(a)google.com>
Cc: Kuniyuki Iwashima <kuniyu(a)amazon.com>
Mina Almasry (8):
netmem: add niov->type attribute to distinguish different net_iov
types
net: add get_netmem/put_netmem support
net: devmem: Implement TX path
net: add devmem TCP TX documentation
net: enable driver support for netmem TX
gve: add netmem TX support to GVE DQO-RDA mode
net: check for driver support in netmem TX
selftests: ncdevmem: Implement devmem TCP TX
Stanislav Fomichev (1):
net: devmem: TCP tx netlink api
Documentation/netlink/specs/netdev.yaml | 12 +
Documentation/networking/devmem.rst | 150 ++++++++-
.../networking/net_cachelines/net_device.rst | 1 +
Documentation/networking/netdev-features.rst | 5 +
Documentation/networking/netmem.rst | 23 +-
drivers/net/ethernet/google/gve/gve_main.c | 3 +
drivers/net/ethernet/google/gve/gve_tx_dqo.c | 8 +-
include/linux/netdevice.h | 2 +
include/linux/skbuff.h | 17 +-
include/linux/skbuff_ref.h | 4 +-
include/net/netmem.h | 34 +-
include/net/sock.h | 1 +
include/uapi/linux/netdev.h | 1 +
io_uring/zcrx.c | 3 +-
net/core/datagram.c | 48 ++-
net/core/dev.c | 34 +-
net/core/devmem.c | 131 ++++++--
net/core/devmem.h | 83 ++++-
net/core/netdev-genl-gen.c | 13 +
net/core/netdev-genl-gen.h | 1 +
net/core/netdev-genl.c | 80 ++++-
net/core/skbuff.c | 48 ++-
net/core/sock.c | 6 +
net/ipv4/ip_output.c | 3 +-
net/ipv4/tcp.c | 50 ++-
net/ipv6/ip6_output.c | 3 +-
net/vmw_vsock/virtio_transport_common.c | 5 +-
tools/include/uapi/linux/netdev.h | 1 +
.../selftests/drivers/net/hw/devmem.py | 26 +-
.../selftests/drivers/net/hw/ncdevmem.c | 300 +++++++++++++++++-
30 files changed, 1008 insertions(+), 88 deletions(-)
base-commit: 0d15a26b247d25cd012134bf8825128fedb15cc9
--
2.49.0.901.g37484f566f-goog
Fix some more minor issues in ublk selftests.
The first patch is from
https://lore.kernel.org/linux-block/20250423-ublk_selftests-v1-0-7d060e260e…
with a modification requested by Jens. The others are new.
Signed-off-by: Uday Shankar <ushankar(a)purestorage.com>
---
Changes in v2:
- Use a test-specific WERROR flag instead of reusing CONFIG_WERROR from
the kernel build for deciding whether or not to use -Werror for the
kublk build. The default behavior is to use -Werror (Ming Lei)
- Link to v1: https://lore.kernel.org/r/20250428-ublk_selftests-v1-0-5795f7b00cda@puresto…
---
Uday Shankar (3):
selftests: ublk: kublk: build with -Werror iff WERROR!=0
selftests: ublk: make test_generic_06 silent on success
selftests: ublk: kublk: fix include path
tools/testing/selftests/ublk/Makefile | 6 +++++-
tools/testing/selftests/ublk/kublk.h | 1 -
tools/testing/selftests/ublk/test_generic_06.sh | 2 +-
3 files changed, 6 insertions(+), 3 deletions(-)
---
base-commit: 53ec1abce79c986dc59e59d0c60d00088bcdf32a
change-id: 20250428-ublk_selftests-983240d3a325
Best regards,
--
Uday Shankar <ushankar(a)purestorage.com>
Fix some more minor issues in ublk selftests.
The first patch is from
https://lore.kernel.org/linux-block/20250423-ublk_selftests-v1-0-7d060e260e…
with a modification requested by Jens. The others are new.
Signed-off-by: Uday Shankar <ushankar(a)purestorage.com>
---
Uday Shankar (3):
selftests: ublk: kublk: build with -Werror iff CONFIG_WERROR=y
selftests: ublk: make test_generic_06 silent on success
selftests: ublk: kublk: fix include path
tools/testing/selftests/ublk/Makefile | 4 +++-
tools/testing/selftests/ublk/kublk.h | 1 -
tools/testing/selftests/ublk/test_generic_06.sh | 2 +-
3 files changed, 4 insertions(+), 3 deletions(-)
---
base-commit: 53ec1abce79c986dc59e59d0c60d00088bcdf32a
change-id: 20250428-ublk_selftests-983240d3a325
Best regards,
--
Uday Shankar <ushankar(a)purestorage.com>
After a long delay I'm posting next iteration of lockless /proc/pid/maps
reading patchset. Differences from v2 [1]:
- Add a set of tests concurrently modifying address space and checking for
correct reading results;
- Use new mmap_lock_speculate_xxx APIs for concurrent change detection and
retries;
- Add lockless PROCMAP_QUERY execution support;
The new tests are designed to check for any unexpected data tearing while
performing some common address space modifications (vma split, resize and
remap). Even before these changes, reading /proc/pid/maps might have
inconsistent data because the file is read page-by-page with mmap_lock
being dropped between the pages. Such tearing is expected and userspace
is supposed to deal with that possibility. An example of user-visible
inconsistency can be that the same vma is printed twice: once before
it was modified and then after the modifications. For example if vma was
extended, it might be found and reported twice. Whan is not expected is
to see a gap where there should have been a vma both before and after
modification. This patchset increases the chances of such tearing,
therefore it's event more important now to test for unexpected
inconsistencies.
Thanks to Paul McKenney who developed a benchmark to test performance
of concurrent reads and updates, we also have data on performance
benefits:
The test has a pair of processes scanning /proc/PID/maps, and another
process unmapping and remapping 4K pages from a 128MB range of anonymous
memory. At the end of each 10-second run, the latency of each mmap()
or munmap() operation is measured, and for each run the maximum and mean
latency is printed. (Yes, the map/unmap process is started first, its
PID is passed to the scanners, and then the map/unmap process waits until
both scanners are running before starting its timed test. The scanners
keep scanning until the specified /proc/PID/maps file disappears.)
In summary, with stock mm, 78% of the runs had maximum latencies in
excess of 0.5 milliseconds, and with more then half of the runs' latencies
exceeding a full millisecond. In contrast, 98% of the runs with Suren's
patch series applied had maximum latencies of less than 0.5 milliseconds.
From a median-performance viewpoint, Suren's series also looks good,
with stock mm weighing in at 13 microseconds and Suren's series at 10
microseconds, better than a 20% improvement.
[1] https://lore.kernel.org/all/20240123231014.3801041-1-surenb@google.com/
Suren Baghdasaryan (8):
selftests/proc: add /proc/pid/maps tearing from vma split test
selftests/proc: extend /proc/pid/maps tearing test to include vma
resizing
selftests/proc: extend /proc/pid/maps tearing test to include vma
remapping
selftests/proc: test PROCMAP_QUERY ioctl while vma is concurrently
modified
selftests/proc: add verbose more for tests to facilitate debugging
mm: make vm_area_struct anon_name field RCU-safe
mm/maps: read proc/pid/maps under RCU
mm/maps: execute PROCMAP_QUERY ioctl under RCU
fs/proc/internal.h | 6 +
fs/proc/task_mmu.c | 233 +++++-
include/linux/mm_inline.h | 28 +-
include/linux/mm_types.h | 3 +-
mm/madvise.c | 30 +-
tools/testing/selftests/proc/proc-pid-vm.c | 793 ++++++++++++++++++++-
6 files changed, 1061 insertions(+), 32 deletions(-)
base-commit: 79f35c4125a9a3fd98efeed4cce1cd7ce5311a44
--
2.49.0.805.g082f7c87e0-goog
Context
=======
We've observed within Red Hat that isolated, NOHZ_FULL CPUs running a
pure-userspace application get regularly interrupted by IPIs sent from
housekeeping CPUs. Those IPIs are caused by activity on the housekeeping CPUs
leading to various on_each_cpu() calls, e.g.:
64359.052209596 NetworkManager 0 1405 smp_call_function_many_cond (cpu=0, func=do_kernel_range_flush)
smp_call_function_many_cond+0x1
smp_call_function+0x39
on_each_cpu+0x2a
flush_tlb_kernel_range+0x7b
__purge_vmap_area_lazy+0x70
_vm_unmap_aliases.part.42+0xdf
change_page_attr_set_clr+0x16a
set_memory_ro+0x26
bpf_int_jit_compile+0x2f9
bpf_prog_select_runtime+0xc6
bpf_prepare_filter+0x523
sk_attach_filter+0x13
sock_setsockopt+0x92c
__sys_setsockopt+0x16a
__x64_sys_setsockopt+0x20
do_syscall_64+0x87
entry_SYSCALL_64_after_hwframe+0x65
The heart of this series is the thought that while we cannot remove NOHZ_FULL
CPUs from the list of CPUs targeted by these IPIs, they may not have to execute
the callbacks immediately. Anything that only affects kernelspace can wait
until the next user->kernel transition, providing it can be executed "early
enough" in the entry code.
The original implementation is from Peter [1]. Nicolas then added kernel TLB
invalidation deferral to that [2], and I picked it up from there.
Deferral approach
=================
Storing each and every callback, like a secondary call_single_queue turned out
to be a no-go: the whole point of deferral is to keep NOHZ_FULL CPUs in
userspace for as long as possible - no signal of any form would be sent when
deferring an IPI. This means that any form of queuing for deferred callbacks
would end up as a convoluted memory leak.
Deferred IPIs must thus be coalesced, which this series achieves by assigning
IPIs a "type" and having a mapping of IPI type to callback, leveraged upon
kernel entry.
What about IPIs whose callback take a parameter, you may ask?
Peter suggested during OSPM23 [3] that since on_each_cpu() targets
housekeeping CPUs *and* isolated CPUs, isolated CPUs can access either global or
housekeeping-CPU-local state to "reconstruct" the data that would have been sent
via the IPI.
This series does not affect any IPI callback that requires an argument, but the
approach would remain the same (one coalescable callback executed on kernel
entry).
Kernel entry vs execution of the deferred operation
===================================================
This is what I've referred to as the "Danger Zone" during my LPC24 talk [4].
There is a non-zero length of code that is executed upon kernel entry before the
deferred operation can be itself executed (before we start getting into
context_tracking.c proper), i.e.:
idtentry_func_foo() <--- we're in the kernel
irqentry_enter()
enter_from_user_mode()
__ct_user_exit()
ct_kernel_enter_state()
ct_work_flush() <--- deferred operation is executed here
This means one must take extra care to what can happen in the early entry code,
and that <bad things> cannot happen. For instance, we really don't want to hit
instructions that have been modified by a remote text_poke() while we're on our
way to execute a deferred sync_core(). Patches doing the actual deferral have
more detail on this.
Where are we at with this whole thing?
======================================
Dave has been incredibly helpful wrt figuring out what would and wouldn't
(mostly that) be safe to do for deferring kernel range TLB flush IPIs, see [5].
Long story short, there are ugly things I can still do to (safely) defer the TLB
flush IPIs, but it's going to be a long session of pulling my own hair out, and
I got plenty so I won't be done for a while.
In the meantime, I think everything leading up to deferring text poke IPIs is
sane-ish and could get in. I'm not the biggest fan of adding an API with a
single user, but hey, I've been working on this for "a little while" now and
I'll still need to get the other IPIs sorted out.
TL;DR: Text patching IPI deferral LGTM so here it is for now, I'm still working
on the TLB flush thing.
Patches
=======
o Patches 1-2 are standalone objtool cleanups.
o Patches 3-4 add an RCU testing feature.
o Patches 5-6 add infrastructure for annotating static keys and static calls
that may be used in noinstr code (courtesy of Josh).
o Patches 7-20 use said annotations on relevant keys / calls.
o Patch 21 enforces proper usage of said annotations (courtesy of Josh).
o Patches 22-23 deal with detecting NOINSTR text in modules
o Patches 24-25 add the actual IPI deferral faff
Patches are also available at:
https://gitlab.com/vschneid/linux.git -b redhat/isolirq/defer/v5
Testing
=======
Xeon E5-2699 system with SMToff, NOHZ_FULL, isolated CPUs.
RHEL10 userspace.
Workload is using rteval (kernel compilation + hackbench) on housekeeping CPUs
and a dummy stay-in-userspace loop on the isolated CPUs. The main invocation is:
$ trace-cmd record -e "csd_queue_cpu" -f "cpu & CPUS{$ISOL_CPUS}" \
-e "ipi_send_cpumask" -f "cpumask & CPUS{$ISOL_CPUS}" \
-e "ipi_send_cpu" -f "cpu & CPUS{$ISOL_CPUS}" \
rteval --onlyload --loads-cpulist=$HK_CPUS \
--hackbench-runlowmem=True --duration=$DURATION
This only records IPIs sent to isolated CPUs, so any event there is interference
(with a bit of fuzz at the start/end of the workload when spawning the
processes). All tests were done with a duration of 3 hours.
v6.14
# This is the actual IPI count
$ trace-cmd report | grep callback | awk '{ print $(NF) }' | sort | uniq -c | sort -nr
93 callback=generic_smp_call_function_single_interrupt+0x0
22 callback=nohz_full_kick_func+0x0
# These are the different CSD's that caused IPIs
$ trace-cmd report | grep csd_queue | awk '{ print $(NF-1) }' | sort | uniq -c | sort -nr
1456 func=do_flush_tlb_all
78 func=do_sync_core
33 func=nohz_full_kick_func
26 func=do_kernel_range_flush
v6.14 + patches
# This is the actual IPI count
$ trace-cmd report | grep callback | awk '{ print $(NF) }' | sort | uniq -c | sort -nr
86 callback=generic_smp_call_function_single_interrupt+0x0
41 callback=nohz_full_kick_func+0x0
# These are the different CSD's that caused IPIs
$ trace-cmd report | grep csd_queue | awk '{ print $(NF-1) }' | sort | uniq -c | sort -nr
1378 func=do_flush_tlb_all
33 func=nohz_full_kick_func
So the TLB flush is still there driving most of the IPIs, but at least the
instruction patching IPIs are gone. With kernel TLB flushes deferred, there are
no IPIs sent to isolated CPUs in that 3hr window, but as stated above that still
needs some more work.
Also note that tlb_remove_table_smp_sync() showed up during testing of v3, and
has gone as mysteriously as it showed up. Yair had a series adressing this [6]
which per these results would be worth revisiting.
Acknowledgements
================
Special thanks to:
o Clark Williams for listening to my ramblings about this and throwing ideas my way
o Josh Poimboeuf for all his help with everything objtool-related
o All of the folks who attended various (too many?) talks about this and
provided precious feedback.
o The mm folks for pointing out what I can and can't do with TLB flushes
Links
=====
[1]: https://lore.kernel.org/all/20210929151723.162004989@infradead.org/
[2]: https://github.com/vianpl/linux.git -b ct-work-defer-wip
[3]: https://youtu.be/0vjE6fjoVVE
[4]: https://lpc.events/event/18/contributions/1889/
[5]: http://lore.kernel.org/r/eef09bdc-7546-462b-9ac0-661a44d2ceae@intel.com
[6]: https://lore.kernel.org/lkml/20230620144618.125703-1-ypodemsk@redhat.com/
Revisions
=========
v4 -> v5
++++++++
o Rebased onto v6.15-rc3
o Collected Reviewed-by
o Annotated a few more static keys
o Added proper checking of noinstr sections that are in loadable code such as
KVM early entry (Sean Christopherson)
o Switched to checking for CT_RCU_WATCHING instead of CT_STATE_KERNEL or
CT_STATE_IDLE, which means deferral is now behaving sanely for IRQ/NMI
entry from idle (thanks to Frederic!)
o Ditched the vmap TLB flush deferral (for now)
RFCv3 -> v4
+++++++++++
o Rebased onto v6.13-rc6
o New objtool patches from Josh
o More .noinstr static key/call patches
o Static calls now handled as well (again thanks to Josh)
o Fixed clearing the work bits on kernel exit
o Messed with IRQ hitting an idle CPU vs context tracking
o Various comment and naming cleanups
o Made RCU_DYNTICKS_TORTURE depend on !COMPILE_TEST (PeterZ)
o Fixed the CT_STATE_KERNEL check when setting a deferred work (Frederic)
o Cleaned up the __flush_tlb_all() mess thanks to PeterZ
RFCv2 -> RFCv3
++++++++++++++
o Rebased onto v6.12-rc6
o Added objtool documentation for the new warning (Josh)
o Added low-size RCU watching counter to TREE04 torture scenario (Paul)
o Added FORCEFUL jump label and static key types
o Added noinstr-compliant helpers for tlb flush deferral
RFCv1 -> RFCv2
++++++++++++++
o Rebased onto v6.5-rc1
o Updated the trace filter patches (Steven)
o Fixed __ro_after_init keys used in modules (Peter)
o Dropped the extra context_tracking atomic, squashed the new bits in the
existing .state field (Peter, Frederic)
o Added an RCU_EXPERT config for the RCU dynticks counter size, and added an
rcutorture case for a low-size counter (Paul)
o Fixed flush_tlb_kernel_range_deferrable() definition
Josh Poimboeuf (3):
jump_label: Add annotations for validating noinstr usage
static_call: Add read-only-after-init static calls
objtool: Add noinstr validation for static branches/calls
Valentin Schneider (22):
objtool: Make validate_call() recognize indirect calls to pv_ops[]
objtool: Flesh out warning related to pv_ops[] calls
rcu: Add a small-width RCU watching counter debug option
rcutorture: Make TREE04 use CONFIG_RCU_DYNTICKS_TORTURE
x86/paravirt: Mark pv_sched_clock static call as __ro_after_init
x86/idle: Mark x86_idle static call as __ro_after_init
x86/paravirt: Mark pv_steal_clock static call as __ro_after_init
riscv/paravirt: Mark pv_steal_clock static call as __ro_after_init
loongarch/paravirt: Mark pv_steal_clock static call as __ro_after_init
arm64/paravirt: Mark pv_steal_clock static call as __ro_after_init
arm/paravirt: Mark pv_steal_clock static call as __ro_after_init
perf/x86/amd: Mark perf_lopwr_cb static call as __ro_after_init
sched/clock: Mark sched_clock_running key as __ro_after_init
KVM: VMX: Mark __kvm_is_using_evmcs static key as __ro_after_init
x86/speculation/mds: Mark mds_idle_clear key as allowed in .noinstr
sched/clock, x86: Mark __sched_clock_stable key as allowed in .noinstr
KVM: VMX: Mark vmx_l1d_should flush and vmx_l1d_flush_cond keys as
allowed in .noinstr
stackleack: Mark stack_erasing_bypass key as allowed in .noinstr
module: Remove outdated comment about text_size
module: Add MOD_NOINSTR_TEXT mem_type
context-tracking: Introduce work deferral infrastructure
context_tracking,x86: Defer kernel text patching IPIs
arch/Kconfig | 9 ++
arch/arm/kernel/paravirt.c | 2 +-
arch/arm64/kernel/paravirt.c | 2 +-
arch/loongarch/kernel/paravirt.c | 2 +-
arch/riscv/kernel/paravirt.c | 2 +-
arch/x86/Kconfig | 1 +
arch/x86/events/amd/brs.c | 2 +-
arch/x86/include/asm/context_tracking_work.h | 18 +++
arch/x86/include/asm/text-patching.h | 1 +
arch/x86/kernel/alternative.c | 39 ++++++-
arch/x86/kernel/cpu/bugs.c | 2 +-
arch/x86/kernel/kprobes/core.c | 4 +-
arch/x86/kernel/kprobes/opt.c | 4 +-
arch/x86/kernel/module.c | 2 +-
arch/x86/kernel/paravirt.c | 4 +-
arch/x86/kernel/process.c | 2 +-
arch/x86/kvm/vmx/vmx.c | 11 +-
arch/x86/kvm/vmx/vmx_onhyperv.c | 2 +-
include/asm-generic/sections.h | 15 +++
include/linux/context_tracking.h | 21 ++++
include/linux/context_tracking_state.h | 54 +++++++--
include/linux/context_tracking_work.h | 26 +++++
include/linux/jump_label.h | 30 ++++-
include/linux/module.h | 6 +-
include/linux/objtool.h | 7 ++
include/linux/static_call.h | 19 ++++
kernel/context_tracking.c | 69 +++++++++++-
kernel/kprobes.c | 8 +-
kernel/module/main.c | 85 ++++++++++----
kernel/rcu/Kconfig.debug | 15 +++
kernel/sched/clock.c | 7 +-
kernel/stackleak.c | 6 +-
kernel/time/Kconfig | 5 +
tools/objtool/Documentation/objtool.txt | 34 ++++++
tools/objtool/check.c | 106 +++++++++++++++---
tools/objtool/include/objtool/check.h | 1 +
tools/objtool/include/objtool/elf.h | 1 +
tools/objtool/include/objtool/special.h | 1 +
tools/objtool/special.c | 15 ++-
.../selftests/rcutorture/configs/rcu/TREE04 | 1 +
40 files changed, 557 insertions(+), 84 deletions(-)
create mode 100644 arch/x86/include/asm/context_tracking_work.h
create mode 100644 include/linux/context_tracking_work.h
--
2.49.0
Nolibc is useful for selftests as the test programs can be very small,
and compiled with just a kernel crosscompiler, without userspace support.
Currently nolibc is only usable with kselftest.h, not the more
convenient to use kselftest_harness.h
This series provides this compatibility by adding new features to nolibc
and removing the usage of problematic features from the harness.
The first half of the series are changes to the harness, the second one
are for nolibc. Both parts are very independent and should go through
different trees.
The last patch is not meant to be applied and serves as test that
everything works together correctly.
Signed-off-by: Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>
---
Changes in v3:
- Send patches to correct kselftest harness maintainers
- Move harness selftest to dedicated directory
- Add harness selftest to MAINTAINERS
- Integrate harness selftest cleanup with the selftest framework
- Consistently use "kselftest harness" in commit messages
- Properly propagate kselftest harness failure
- Link to v2: https://lore.kernel.org/r/20250407-nolibc-kselftest-harness-v2-0-f8812f76e9…
Changes in v2:
- Rebase unto v6.15-rc1
- Rename internal nolibc symbols
- Handle edge case of waitpid(INT_MIN) == ESRCH
- Fix arm configurations for final testing patch
- Clean up global getopt.h variable declarations
- Add Acks from Willy
- Link to v1: https://lore.kernel.org/r/20250304-nolibc-kselftest-harness-v1-0-adca7cd231…
---
Thomas Weißschuh (32):
selftests: harness: Add kselftest harness selftest
selftests: harness: Use C89 comment style
selftests: harness: Ignore unused variant argument warning
selftests: harness: Mark functions without prototypes static
selftests: harness: Remove inline qualifier for wrappers
selftests: harness: Remove dependency on libatomic
selftests: harness: Implement test timeouts through pidfd
selftests: harness: Don't set setup_completed for fixtureless tests
selftests: harness: Always provide "self" and "variant"
selftests: harness: Move teardown conditional into test metadata
selftests: harness: Add teardown callback to test metadata
selftests: harness: Stop using setjmp()/longjmp()
selftests: harness: Guard includes on nolibc
tools/nolibc: handle intmax_t/uintmax_t in printf
tools/nolibc: use intmax definitions from compiler
tools/nolibc: use pselect6_time64 if available
tools/nolibc: use ppoll_time64 if available
tools/nolibc: add tolower() and toupper()
tools/nolibc: add _exit()
tools/nolibc: add setpgrp()
tools/nolibc: implement waitpid() in terms of waitid()
Revert "selftests/nolibc: use waitid() over waitpid()"
tools/nolibc: add dprintf() and vdprintf()
tools/nolibc: add getopt()
tools/nolibc: allow different write callbacks in printf
tools/nolibc: allow limiting of printf destination size
tools/nolibc: add snprintf() and friends
selftests/nolibc: use snprintf() for printf tests
selftests/nolibc: rename vfprintf test suite
selftests/nolibc: add test for snprintf() truncation
tools/nolibc: implement width padding in printf()
HACK: selftests/nolibc: demonstrate usage of the kselftest harness
MAINTAINERS | 1 +
tools/include/nolibc/Makefile | 1 +
tools/include/nolibc/getopt.h | 101 ++
tools/include/nolibc/nolibc.h | 1 +
tools/include/nolibc/stdint.h | 4 +-
tools/include/nolibc/stdio.h | 127 +-
tools/include/nolibc/string.h | 17 +
tools/include/nolibc/sys.h | 105 +-
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/kselftest_harness.h | 181 +-
.../testing/selftests/kselftest_harness/.gitignore | 2 +
tools/testing/selftests/kselftest_harness/Makefile | 7 +
.../selftests/kselftest_harness/harness-selftest.c | 129 ++
.../kselftest_harness/harness-selftest.expected | 62 +
.../kselftest_harness/harness-selftest.sh | 13 +
tools/testing/selftests/nolibc/Makefile | 13 +-
tools/testing/selftests/nolibc/harness-selftest.c | 1 +
tools/testing/selftests/nolibc/nolibc-test.c | 1729 +-------------------
tools/testing/selftests/nolibc/run-tests.sh | 2 +-
19 files changed, 637 insertions(+), 1860 deletions(-)
---
base-commit: 0af2f6be1b4281385b618cb86ad946eded089ac8
change-id: 20250130-nolibc-kselftest-harness-8b2c8cac43bf
Best regards,
--
Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>