Kselftest.h declares many variadic functions that can print some
formatted message while also executing selftest logic. These
declarations don't have any compiler mechanism to verify if passed
arguments are valid in comparison with format specifiers used in
printf() calls.
Attribute addition can make debugging easier, the code more consistent
and prevent mismatched or missing variables.
Add a __printf() macro that validates types of variables passed to the
format string. The macro is similarly used in other tools in the kernel.
Add __printf() attributes to function definitions inside kselftest.h that
use printing.
Adding the __printf() macro exposes some mismatches in format strings
across different selftests.
Fix the mismatched format specifiers in multiple tests.
Series is based on kselftests next branch.
Changelog v4:
- Fix patch 1/8 subject typo.
- Add Reinette's reviewed-by tags.
- Rebased onto updated kselftests next branch.
Changelog v3:
- Changed git signature from Wieczor-Retman Maciej to Maciej
Wieczor-Retman.
- Added one review tag.
- Rebased onto updated kselftests next branch.
Changelog v2:
- Add review and fixes tags to patches.
- Add two patches with mismatch fixes.
- Fix missed attribute in selftests/kvm. (Andrew)
- Fix previously missed issues in selftests/mm (Ilpo)
[v3] https://lore.kernel.org/all/cover.1695373131.git.maciej.wieczor-retman@inte…
[v2] https://lore.kernel.org/all/cover.1693829810.git.maciej.wieczor-retman@inte…
[v1] https://lore.kernel.org/all/cover.1693216959.git.maciej.wieczor-retman@inte…
Maciej Wieczor-Retman (8):
selftests: Add printf attribute to kselftest prints
selftests/cachestat: Fix print_cachestat format
selftests/openat2: Fix wrong format specifier
selftests/pidfd: Fix ksft print formats
selftests/sigaltstack: Fix wrong format specifier
selftests/kvm: Replace attribute with macro
selftests/mm: Substitute attribute with a macro
selftests/resctrl: Fix wrong format specifier
.../selftests/cachestat/test_cachestat.c | 2 +-
tools/testing/selftests/kselftest.h | 18 ++++++++++--------
.../testing/selftests/kvm/include/test_util.h | 8 ++++----
tools/testing/selftests/mm/mremap_test.c | 2 +-
tools/testing/selftests/mm/pkey-helpers.h | 2 +-
tools/testing/selftests/openat2/openat2_test.c | 2 +-
.../selftests/pidfd/pidfd_fdinfo_test.c | 2 +-
tools/testing/selftests/pidfd/pidfd_test.c | 12 ++++++------
tools/testing/selftests/resctrl/cache.c | 2 +-
tools/testing/selftests/sigaltstack/sas.c | 2 +-
10 files changed, 27 insertions(+), 25 deletions(-)
base-commit: f1020c687153609f246f3314db5b74821025c185
--
2.42.0
PASID (Process Address Space ID) is a PCIe extension to tag the DMA
transactions out of a physical device, and most modern IOMMU hardware
have supported PASID granular address translation. So a PASID-capable
devices can be attached to multiple hwpts (a.k.a. domains), each attachment
is tagged with a PASID.
This series first adds a missing iommu API to replace domain for a pasid,
then adds iommufd APIs for device drivers to attach/replace/detach pasid
to/from hwpt per userspace's request, and adds selftest to validate the
iommufd APIs.
pasid attach/replace is mandatory on Intel VT-d given the PASID table
locates in the physical address space hence must be managed by the kernel,
both for supporting vSVA and coming SIOV. But it's optional on ARM/AMD
which allow configuring the PASID/CD table either in host physical address space
or nested on top of an GPA address space. This series only add VT-d support
as the minimal requirement.
Complete code can be found in below link:
https://github.com/yiliu1765/iommufd/tree/iommufd_pasid
Regards,
Yi Liu
Kevin Tian (1):
iommufd: Support attach/replace hwpt per pasid
Lu Baolu (2):
iommu: Introduce a replace API for device pasid
iommu/vt-d: Add set_dev_pasid callback for nested domain
Yi Liu (5):
iommufd: replace attach_fn with a structure
iommufd/selftest: Add set_dev_pasid and remove_dev_pasid in mock iommu
iommufd/selftest: Add a helper to get test device
iommufd/selftest: Add test ops to test pasid attach/detach
iommufd/selftest: Add coverage for iommufd pasid attach/detach
drivers/iommu/intel/nested.c | 47 +++++
drivers/iommu/iommu-priv.h | 2 +
drivers/iommu/iommu.c | 73 ++++++--
drivers/iommu/iommufd/Makefile | 1 +
drivers/iommu/iommufd/device.c | 42 +++--
drivers/iommu/iommufd/iommufd_private.h | 16 ++
drivers/iommu/iommufd/iommufd_test.h | 24 +++
drivers/iommu/iommufd/pasid.c | 152 ++++++++++++++++
drivers/iommu/iommufd/selftest.c | 158 ++++++++++++++--
include/linux/iommufd.h | 6 +
tools/testing/selftests/iommu/iommufd.c | 172 ++++++++++++++++++
.../selftests/iommu/iommufd_fail_nth.c | 28 ++-
tools/testing/selftests/iommu/iommufd_utils.h | 78 ++++++++
13 files changed, 756 insertions(+), 43 deletions(-)
create mode 100644 drivers/iommu/iommufd/pasid.c
--
2.34.1
On Sun, Oct 8, 2023 at 12:22 AM Akihiko Odaki <akihiko.odaki(a)daynix.com> wrote:
>
> virtio-net have two usage of hashes: one is RSS and another is hash
> reporting. Conventionally the hash calculation was done by the VMM.
> However, computing the hash after the queue was chosen defeats the
> purpose of RSS.
>
> Another approach is to use eBPF steering program. This approach has
> another downside: it cannot report the calculated hash due to the
> restrictive nature of eBPF.
>
> Introduce the code to compute hashes to the kernel in order to overcome
> thse challenges. An alternative solution is to extend the eBPF steering
> program so that it will be able to report to the userspace, but it makes
> little sense to allow to implement different hashing algorithms with
> eBPF since the hash value reported by virtio-net is strictly defined by
> the specification.
>
> The hash value already stored in sk_buff is not used and computed
> independently since it may have been computed in a way not conformant
> with the specification.
>
> Signed-off-by: Akihiko Odaki <akihiko.odaki(a)daynix.com>
> @@ -2116,31 +2172,49 @@ static ssize_t tun_put_user(struct tun_struct *tun,
> }
>
> if (vnet_hdr_sz) {
> - struct virtio_net_hdr gso;
> + union {
> + struct virtio_net_hdr hdr;
> + struct virtio_net_hdr_v1_hash v1_hash_hdr;
> + } hdr;
> + int ret;
>
> if (iov_iter_count(iter) < vnet_hdr_sz)
> return -EINVAL;
>
> - if (virtio_net_hdr_from_skb(skb, &gso,
> - tun_is_little_endian(tun), true,
> - vlan_hlen)) {
> + if ((READ_ONCE(tun->vnet_hash.flags) & TUN_VNET_HASH_REPORT) &&
> + vnet_hdr_sz >= sizeof(hdr.v1_hash_hdr) &&
> + skb->tun_vnet_hash) {
Isn't vnet_hdr_sz guaranteed to be >= hdr.v1_hash_hdr, by virtue of
the set hash ioctl failing otherwise?
Such checks should be limited to control path where possible
> + vnet_hdr_content_sz = sizeof(hdr.v1_hash_hdr);
> + ret = virtio_net_hdr_v1_hash_from_skb(skb,
> + &hdr.v1_hash_hdr,
> + true,
> + vlan_hlen,
> + &vnet_hash);
> + } else {
> + vnet_hdr_content_sz = sizeof(hdr.hdr);
> + ret = virtio_net_hdr_from_skb(skb, &hdr.hdr,
> + tun_is_little_endian(tun),
> + true, vlan_hlen);
> + }
> +
Fix four issues with resctrl selftests.
The signal handling fix became necessary after the mount/umount fixes
and the uninitialized member bug was discovered during the review.
The other two came up when I ran resctrl selftests across the server
fleet in our lab to validate the upcoming CAT test rewrite (the rewrite
is not part of this series).
These are developed and should apply cleanly at least on top the
benchmark cleanup series (might apply cleanly also w/o the benchmark
series, I didn't test).
v4:
- Use func(void) for functions taking no arguments
- Correct Fixes tag formatting
v3:
- Add fix to uninitialized sa_flags
- Handle ksft_exit_fail_msg() in per test functions
- Make signal handler register fails to also exit
- Improve changelogs
v2:
- Include patch to move _GNU_SOURCE to Makefile to allow normal #include
placement
- Rework the signal register/unregister into patch to use helpers
- Fixed incorrect function parameter description
- Use return !!res to avoid confusing implicit boolean conversion
- Improve MBA/MBM success bound patch's changelog
- Tweak Cc: stable dependencies (make it a chain).
Ilpo Järvinen (7):
selftests/resctrl: Fix uninitialized .sa_flags
selftests/resctrl: Extend signal handler coverage to unmount on
receiving signal
selftests/resctrl: Remove duplicate feature check from CMT test
selftests/resctrl: Move _GNU_SOURCE define into Makefile
selftests/resctrl: Refactor feature check to use resource and feature
name
selftests/resctrl: Fix feature checks
selftests/resctrl: Reduce failures due to outliers in MBA/MBM tests
tools/testing/selftests/resctrl/Makefile | 2 +-
tools/testing/selftests/resctrl/cat_test.c | 8 --
tools/testing/selftests/resctrl/cmt_test.c | 3 -
tools/testing/selftests/resctrl/mba_test.c | 2 +-
tools/testing/selftests/resctrl/mbm_test.c | 2 +-
tools/testing/selftests/resctrl/resctrl.h | 7 +-
.../testing/selftests/resctrl/resctrl_tests.c | 82 ++++++++++++-------
tools/testing/selftests/resctrl/resctrl_val.c | 26 +++---
tools/testing/selftests/resctrl/resctrlfs.c | 69 ++++++----------
9 files changed, 97 insertions(+), 104 deletions(-)
--
2.30.2
Context
=======
We've observed within Red Hat that isolated, NOHZ_FULL CPUs running a
pure-userspace application get regularly interrupted by IPIs sent from
housekeeping CPUs. Those IPIs are caused by activity on the housekeeping CPUs
leading to various on_each_cpu() calls, e.g.:
64359.052209596 NetworkManager 0 1405 smp_call_function_many_cond (cpu=0, func=do_kernel_range_flush)
smp_call_function_many_cond+0x1
smp_call_function+0x39
on_each_cpu+0x2a
flush_tlb_kernel_range+0x7b
__purge_vmap_area_lazy+0x70
_vm_unmap_aliases.part.42+0xdf
change_page_attr_set_clr+0x16a
set_memory_ro+0x26
bpf_int_jit_compile+0x2f9
bpf_prog_select_runtime+0xc6
bpf_prepare_filter+0x523
sk_attach_filter+0x13
sock_setsockopt+0x92c
__sys_setsockopt+0x16a
__x64_sys_setsockopt+0x20
do_syscall_64+0x87
entry_SYSCALL_64_after_hwframe+0x65
The heart of this series is the thought that while we cannot remove NOHZ_FULL
CPUs from the list of CPUs targeted by these IPIs, they may not have to execute
the callbacks immediately. Anything that only affects kernelspace can wait
until the next user->kernel transition, providing it can be executed "early
enough" in the entry code.
The original implementation is from Peter [1]. Nicolas then added kernel TLB
invalidation deferral to that [2], and I picked it up from there.
Deferral approach
=================
Storing each and every callback, like a secondary call_single_queue turned out
to be a no-go: the whole point of deferral is to keep NOHZ_FULL CPUs in
userspace for as long as possible - no signal of any form would be sent when
deferring an IPI. This means that any form of queuing for deferred callbacks
would end up as a convoluted memory leak.
Deferred IPIs must thus be coalesced, which this series achieves by assigning
IPIs a "type" and having a mapping of IPI type to callback, leveraged upon
kernel entry.
What about IPIs whose callback take a parameter, you may ask?
Peter suggested during OSPM23 [3] that since on_each_cpu() targets
housekeeping CPUs *and* isolated CPUs, isolated CPUs can access either global or
housekeeping-CPU-local state to "reconstruct" the data that would have been sent
via the IPI.
This series does not affect any IPI callback that requires an argument, but the
approach would remain the same (one coalescable callback executed on kernel
entry).
Kernel entry vs execution of the deferred operation
===================================================
There is a non-zero length of code that is executed upon kernel entry before the
deferred operation can be itself executed (i.e. before we start getting into
context_tracking.c proper).
This means one must take extra care to what can happen in the early entry code,
and that <bad things> cannot happen. For instance, we really don't want to hit
instructions that have been modified by a remote text_poke() while we're on our
way to execute a deferred sync_core().
Patches
=======
o Patches 1-9 have been submitted separately and are included for the sake of
testing
o Patches 10-14 focus on having objtool detect problematic static key usage in
early entry
o Patch 15 adds the infrastructure for IPI deferral.
o Patches 16-17 add some RCU testing infrastructure
o Patch 18 adds text_poke() IPI deferral.
o Patches 19-20 add vunmap() flush_tlb_kernel_range() IPI deferral
These ones I'm a lot less confident about, mostly due to lacking
instrumentation/verification.
The actual deferred callback is also incomplete as it's not properly noinstr:
vmlinux.o: warning: objtool: __flush_tlb_all_noinstr+0x19: call to native_write_cr4() leaves .noinstr.text section
and it doesn't support PARAVIRT - it's going to need a pv_ops.mmu entry, but I
have *no idea* what a sane implementation would be for Xen so I haven't
touched that yet.
Patches are also available at:
https://gitlab.com/vschneid/linux.git -b redhat/isolirq/defer/v2
Testing
=======
Note: this is a different machine than used for v1, because that machine decided
to act difficult.
Xeon E5-2699 system with SMToff, NOHZ_FULL, isolated CPUs.
RHEL9 userspace.
Workload is using rteval (kernel compilation + hackbench) on housekeeping CPUs
and a dummy stay-in-userspace loop on the isolated CPUs. The main invocation is:
$ trace-cmd record -e "csd_queue_cpu" -f "cpu & CPUS{$ISOL_CPUS}" \
-e "ipi_send_cpumask" -f "cpumask & CPUS{$ISOL_CPUS}" \
-e "ipi_send_cpu" -f "cpu & CPUS{$ISOL_CPUS}" \
rteval --onlyload --loads-cpulist=$HK_CPUS \
--hackbench-runlowmem=True --duration=$DURATION
This only records IPIs sent to isolated CPUs, so any event there is interference
(with a bit of fuzz at the start/end of the workload when spawning the
processes). All tests were done with a duration of 30 minutes.
v6.5-rc1 (+ cpumask filtering patches):
# This is the actual IPI count
$ trace-cmd report | grep callback | awk '{ print $(NF) }' | sort | uniq -c | sort -nr
338 callback=generic_smp_call_function_single_interrupt+0x0
# These are the different CSD's that caused IPIs
$ trace-cmd report | grep csd_queue | awk '{ print $(NF-1) }' | sort | uniq -c | sort -nr
9207 func=do_flush_tlb_all
1116 func=do_sync_core
62 func=do_kernel_range_flush
3 func=nohz_full_kick_func
v6.5-rc1 + patches:
# This is the actual IPI count
$ trace-cmd report | grep callback | awk '{ print $(NF) }' | sort | uniq -c | sort -nr
2 callback=generic_smp_call_function_single_interrupt+0x0
# These are the different CSD's that caused IPIs
$ trace-cmd report | grep csd_queue | awk '{ print $(NF-1) }' | sort | uniq -c | sort -nr
2 func=nohz_full_kick_func
The incriminating IPIs are all gone, but note that on the machine I used to test
v1 there were still some do_flush_tlb_all() IPIs caused by
pcpu_balance_workfn(), since only vmalloc is affected by the deferral
mechanism.
Acknowledgements
================
Special thanks to:
o Clark Williams for listening to my ramblings about this and throwing ideas my way
o Josh Poimboeuf for his guidance regarding objtool and hinting at the
.data..ro_after_init section.
Links
=====
[1]: https://lore.kernel.org/all/20210929151723.162004989@infradead.org/
[2]: https://github.com/vianpl/linux.git -b ct-work-defer-wip
[3]: https://youtu.be/0vjE6fjoVVE
Revisions
=========
RFCv1 -> RFCv2
++++++++++++++
o Rebased onto v6.5-rc1
o Updated the trace filter patches (Steven)
o Fixed __ro_after_init keys used in modules (Peter)
o Dropped the extra context_tracking atomic, squashed the new bits in the
existing .state field (Peter, Frederic)
o Added an RCU_EXPERT config for the RCU dynticks counter size, and added an
rcutorture case for a low-size counter (Paul)
The new TREE11 case with a 2-bit dynticks counter seems to pass when ran
against this series.
o Fixed flush_tlb_kernel_range_deferrable() definition
Peter Zijlstra (1):
jump_label,module: Don't alloc static_key_mod for __ro_after_init keys
Valentin Schneider (19):
tracing/filters: Dynamically allocate filter_pred.regex
tracing/filters: Enable filtering a cpumask field by another cpumask
tracing/filters: Enable filtering a scalar field by a cpumask
tracing/filters: Enable filtering the CPU common field by a cpumask
tracing/filters: Optimise cpumask vs cpumask filtering when user mask
is a single CPU
tracing/filters: Optimise scalar vs cpumask filtering when the user
mask is a single CPU
tracing/filters: Optimise CPU vs cpumask filtering when the user mask
is a single CPU
tracing/filters: Further optimise scalar vs cpumask comparison
tracing/filters: Document cpumask filtering
objtool: Flesh out warning related to pv_ops[] calls
objtool: Warn about non __ro_after_init static key usage in .noinstr
context_tracking: Make context_tracking_key __ro_after_init
x86/kvm: Make kvm_async_pf_enabled __ro_after_init
context-tracking: Introduce work deferral infrastructure
rcu: Make RCU dynticks counter size configurable
rcutorture: Add a test config to torture test low RCU_DYNTICKS width
context_tracking,x86: Defer kernel text patching IPIs
context_tracking,x86: Add infrastructure to defer kernel TLBI
x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL
CPUs
Documentation/trace/events.rst | 14 +
arch/Kconfig | 9 +
arch/x86/Kconfig | 1 +
arch/x86/include/asm/context_tracking_work.h | 20 ++
arch/x86/include/asm/text-patching.h | 1 +
arch/x86/include/asm/tlbflush.h | 2 +
arch/x86/kernel/alternative.c | 24 +-
arch/x86/kernel/kprobes/core.c | 4 +-
arch/x86/kernel/kprobes/opt.c | 4 +-
arch/x86/kernel/kvm.c | 2 +-
arch/x86/kernel/module.c | 2 +-
arch/x86/mm/tlb.c | 40 ++-
include/asm-generic/sections.h | 5 +
include/linux/context_tracking.h | 26 ++
include/linux/context_tracking_state.h | 65 +++-
include/linux/context_tracking_work.h | 28 ++
include/linux/jump_label.h | 1 +
include/linux/trace_events.h | 1 +
init/main.c | 1 +
kernel/context_tracking.c | 53 ++-
kernel/jump_label.c | 49 +++
kernel/rcu/Kconfig | 33 ++
kernel/time/Kconfig | 5 +
kernel/trace/trace_events_filter.c | 302 ++++++++++++++++--
mm/vmalloc.c | 19 +-
tools/objtool/check.c | 22 +-
tools/objtool/include/objtool/check.h | 1 +
tools/objtool/include/objtool/special.h | 2 +
tools/objtool/special.c | 3 +
.../selftests/rcutorture/configs/rcu/TREE11 | 19 ++
.../rcutorture/configs/rcu/TREE11.boot | 1 +
31 files changed, 695 insertions(+), 64 deletions(-)
create mode 100644 arch/x86/include/asm/context_tracking_work.h
create mode 100644 include/linux/context_tracking_work.h
create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/TREE11
create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/TREE11.boot
--
2.31.1
Currently the bpf selftests are skipped by default, so is someone would
like to run the tests one would need to run:
$ make TARGETS=bpf SKIP_TARGETS="" kselftest
To overwrite the SKIP_TARGETS that defines bpf by default. Also,
following the BPF instructions[1], to run the bpf selftests one would
need to enter in the tools/testing/selftests/bpf/ directory, and then
run make, which is not the standard way to run selftests per it's
documentation.
For the reasons above stop mentioning bpf in the kselftests as examples
of how to run a test suite.
[1]: Documentation/bpf/bpf_devel_QA.rst
Signed-off-by: Marcos Paulo de Souza <mpdesouza(a)suse.com>
---
Documentation/dev-tools/kselftest.rst | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/Documentation/dev-tools/kselftest.rst b/Documentation/dev-tools/kselftest.rst
index deede972f254..ab376b316c36 100644
--- a/Documentation/dev-tools/kselftest.rst
+++ b/Documentation/dev-tools/kselftest.rst
@@ -112,7 +112,7 @@ You can specify multiple tests to skip::
You can also specify a restricted list of tests to run together with a
dedicated skiplist::
- $ make TARGETS="bpf breakpoints size timers" SKIP_TARGETS=bpf kselftest
+ $ make TARGETS="breakpoints size timers" SKIP_TARGETS=size kselftest
See the top-level tools/testing/selftests/Makefile for the list of all
possible targets.
@@ -165,7 +165,7 @@ To see the list of available tests, the `-l` option can be used::
The `-c` option can be used to run all the tests from a test collection, or
the `-t` option for specific single tests. Either can be used multiple times::
- $ ./run_kselftest.sh -c bpf -c seccomp -t timers:posix_timers -t timer:nanosleep
+ $ ./run_kselftest.sh -c size -c seccomp -t timers:posix_timers -t timer:nanosleep
For other features see the script usage output, seen with the `-h` option.
@@ -210,7 +210,7 @@ option is supported, such as::
tests by using variables specified in `Running a subset of selftests`_
section::
- $ make -C tools/testing/selftests gen_tar TARGETS="bpf" FORMAT=.xz
+ $ make -C tools/testing/selftests gen_tar TARGETS="size" FORMAT=.xz
.. _tar's auto-compress: https://www.gnu.org/software/tar/manual/html_node/gzip.html#auto_002dcompre…
--
2.42.0