February 2025 - Linux-kselftest-mirror

[PATCH bpf-next v1 0/3] bpf: Fix use-after-free of sockmap

by Jiayuan Chen

1. Issue Syzkaller reported this issue [1]. 2. Reproduce We can reproduce this issue by using the test_sockmap_with_close_on_write() test I provided in selftest, also you need to apply the following patch to ensure 100% reproducibility (sleep after checking sock): ''' static void sk_psock_verdict_data_ready(struct sock *sk) { ....... if (unlikely(!sock)) return; + if (!strcmp("test_progs", current->comm)) { + printk("sleep 2s to wait socket freed\n"); + mdelay(2000); + printk("sleep end\n"); + } ops = READ_ONCE(sock->ops); if (!ops || !ops->read_skb) return; } ''' Then running './test_progs -v sockmap_basic', and if the kernel has KASAN enabled [2], you will see the following warning: ''' BUG: KASAN: slab-use-after-free in sk_psock_verdict_data_ready+0x29b/0x2d0 Read of size 8 at addr ffff88813a777020 by task test_progs/47055 Tainted: [O]=OOT_MODULE Call Trace: <TASK> dump_stack_lvl+0x53/0x70 print_address_description.constprop.0+0x30/0x420 ? sk_psock_verdict_data_ready+0x29b/0x2d0 print_report+0xb7/0x270 ? sk_psock_verdict_data_ready+0x29b/0x2d0 ? kasan_addr_to_slab+0xd/0xa0 ? sk_psock_verdict_data_ready+0x29b/0x2d0 kasan_report+0xca/0x100 ? sk_psock_verdict_data_ready+0x29b/0x2d0 sk_psock_verdict_data_ready+0x29b/0x2d0 unix_stream_sendmsg+0x4a6/0xa40 ? __pfx_unix_stream_sendmsg+0x10/0x10 ? fdget+0x2c1/0x3a0 __sys_sendto+0x39c/0x410 ''' 3. Reason ''' CPU0 CPU1 unix_stream_sendmsg(sk): other = unix_peer(sk) other->sk_data_ready(other): socket *sock = sk->sk_socket if (unlikely(!sock)) return; close(other): ... other->close() free(socket) READ_ONCE(sock->ops) ^ use 'sock' after free ''' For TCP, UDP, or other protocols, we have already performed rcu_read_lock() when the network stack receives packets in ip_input.c: ''' ip_local_deliver_finish(): rcu_read_lock() ip_protocol_deliver_rcu() xxx_rcv rcu_read_unlock() ''' However, for Unix sockets, sk_data_ready is called directly from the process context without rcu_read_lock() protection. 4. Solution Based on the fact that the 'struct socket' is released using call_rcu(), We add rcu_read_{un}lock() at the entrance and exit of our sk_data_ready. It will not increase performance overhead, at least for TCP and UDP, they are already in a relatively large critical section. Of course, we can also add a custom callback for Unix sockets and call rcu_read_lock() before calling _verdict_data_ready like this: ''' if (sk_is_unix(sk)) sk->sk_data_ready = sk_psock_verdict_data_ready_rcu; else sk->sk_data_ready = sk_psock_verdict_data_ready; sk_psock_verdict_data_ready_rcu(): rcu_read_lock() sk_psock_verdict_data_ready() rcu_read_unlock() ''' However, this will cause too many branches, and it's not suitable to distinguish network protocols in skmsg.c. [1] https://syzkaller.appspot.com/bug?extid=dd90a702f518e0eac072 [2] https://syzkaller.appspot.com/text?tag=KernelConfig&x=1362a5aee630ff34 Jiayuan Chen (3): bpf, sockmap: avoid using sk_socket after free selftests/bpf: Add socketpair to create_pair to support unix socket selftests/bpf: Add edge case tests for sockmap net/core/skmsg.c | 18 ++++-- .../selftests/bpf/prog_tests/socket_helpers.h | 13 ++++- .../selftests/bpf/prog_tests/sockmap_basic.c | 57 +++++++++++++++++++ 3 files changed, 82 insertions(+), 6 deletions(-) -- 2.47.1

8 months, 4 weeks

4
13
0 0

[PATCH v2 0/3] selftests/net: deflake GRO tests and fix return value and output

by Kevin Krakauer

The GRO selftests can flake and have some confusing behavior. These changes make the output and return value of GRO behave as expected, then deflake the tests. v2: - Split into multiple commits. - Reduced napi_defer_hard_irqs to 1. - Reduced gro_flush_timeout to 100us. - Fixed comment that wasn't updated. v1: https://lore.kernel.org/netdev/20250218164555.1955400-1-krakauer@google.com/ Kevin Krakauer (3): selftests/net: have `gro.sh -t` return a correct exit code selftests/net: only print passing message in GRO tests when tests pass selftests/net: deflake GRO tests tools/testing/selftests/net/gro.c | 8 +++++--- tools/testing/selftests/net/gro.sh | 7 ++++--- tools/testing/selftests/net/setup_veth.sh | 3 ++- 3 files changed, 11 insertions(+), 7 deletions(-) -- 2.48.1.658.g4767266eb4-goog

8 months, 4 weeks

3
10
0 0

[PATCH][next] KVM: selftests: Fix spelling mistake "avaialable" -> "available"

by Colin Ian King

There is a spelling mistake in a ksft_test_result_skip message. Fix it. Signed-off-by: Colin Ian King <colin.i.king(a)gmail.com> --- tools/testing/selftests/kvm/s390/cpumodel_subfuncs_test.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/testing/selftests/kvm/s390/cpumodel_subfuncs_test.c b/tools/testing/selftests/kvm/s390/cpumodel_subfuncs_test.c index 27255880dabd..aded795d42be 100644 --- a/tools/testing/selftests/kvm/s390/cpumodel_subfuncs_test.c +++ b/tools/testing/selftests/kvm/s390/cpumodel_subfuncs_test.c @@ -291,7 +291,7 @@ int main(int argc, char *argv[]) ksft_test_result_pass("%s\n", testlist[idx].subfunc_name); free(array); } else { - ksft_test_result_skip("%s feature is not avaialable\n", + ksft_test_result_skip("%s feature is not available\n", testlist[idx].subfunc_name); } } -- 2.47.2

8 months, 4 weeks

1
0
0 0

[PATCH v2 0/7] iommu: Add MSI mapping support with nested SMMU (Part-1 core)

by Nicolin Chen

[ Background ] On ARM GIC systems and others, the target address of the MSI is translated by the IOMMU. For GIC, the MSI address page is called "ITS" page. When the IOMMU is disabled, the MSI address is programmed to the physical location of the GIC ITS page (e.g. 0x20200000). When the IOMMU is enabled, the ITS page is behind the IOMMU, so the MSI address is programmed to an allocated IO virtual address (a.k.a IOVA), e.g. 0xFFFF0000, which must be mapped to the physical ITS page: IOVA (0xFFFF0000) ===> PA (0x20200000). When a 2-stage translation is enabled, IOVA will be still used to program the MSI address, though the mappings will be in two stages: IOVA (0xFFFF0000) ===> IPA (e.g. 0x80900000) ===> PA (0x20200000) (IPA stands for Intermediate Physical Address). If the device that generates MSI is attached to an IOMMU_DOMAIN_DMA, the IOVA is dynamically allocated from the top of the IOVA space. If attached to an IOMMU_DOMAIN_UNMANAGED (e.g. a VFIO passthrough device), the IOVA is fixed to an MSI window reported by the IOMMU driver via IOMMU_RESV_SW_MSI, which is hardwired to MSI_IOVA_BASE (IOVA==0x8000000) for ARM IOMMUs. So far, this IOMMU_RESV_SW_MSI works well as kernel is entirely in charge of the IOMMU translation (1-stage translation), since the IOVA for the ITS page is fixed and known by kernel. However, with virtual machine enabling a nested IOMMU translation (2-stage), a guest kernel directly controls the stage-1 translation with an IOMMU_DOMAIN_DMA, mapping a vITS page (at an IPA 0x80900000) onto its own IOVA space (e.g. 0xEEEE0000). Then, the host kernel can't know that guest-level IOVA to program the MSI address. There have been two approaches to solve this problem: 1. Create an identity mapping in the stage-1. VMM could insert a few RMRs (Reserved Memory Regions) in guest's IORT. Then the guest kernel would fetch these RMR entries from the IORT and create an IOMMU_RESV_DIRECT region per iommu group for a direct mapping. Eventually, the mappings would look like: IOVA (0x8000000) === IPA (0x8000000) ===> 0x20200000 This requires an IOMMUFD ioctl for kernel and VMM to agree on the IPA. 2. Forward the guest-level MSI IOVA captured by VMM to the host-level GIC driver, to program the correct MSI IOVA. Forward the VMM-defined vITS page location (IPA) to the kernel for the stage-2 mapping. Eventually: IOVA (0xFFFF0000) ===> IPA (0x80900000) ===> PA (0x20200000) This requires a VFIO ioctl (for IOVA) and an IOMMUFD ioctl (for IPA). Worth mentioning that when Eric Auger was working on the same topic with the VFIO iommu uAPI, he had a solution for approach (2) first, and then switched to approach (1), suggested by Jean-Philippe for the reduction of complexity. Approach (1) basically feels like the existing VFIO passthrough that has a 1-stage mapping for the unmanaged domain, yet only by shifting the MSI mapping from stage 1 (no-viommu case) to stage 2 (has-viommu case). So, it could reuse the existing IOMMU_RESV_SW_MSI piece, by sharing the same idea of "VMM leaving everything to the kernel". Approach (2) is an ideal solution, yet it requires additional effort for kernel to be aware of the stage-1 gIOVAs and the stage-2 IPAs for vITS page(s), which demands VMM to closely cooperate. * It also brings some complicated use cases to the table where the host or/and guest system(s) has/have multiple ITS pages. [ Execution ] Though these two approaches feel very different on the surface, they can share some underlying common infrastructure. Currently, only one pair of sw_msi functions (prepare/compose) are provided by dma-iommu for irqchip drivers to directly use. There could be different versions of functions from different domain owners: for existing VFIO passthrough cases and in- kernel DMA domain cases, reuse the existing dma-iommu's version of sw_msi functions; for nested translation use cases, there can be another version of sw_msi functions to handle mapping and msi_msg(s) differently. As a part-1 series, this refactors the core infrastructure: - Get rid of the duplication in the "compose" function - Introduce a function pointer for the previously "prepare" function - Allow different domain owners to set their own "sw_msi" implementations - Implement an iommufd_sw_msi function to additionally support non-nested use cases and also prepare for a nested translation use case using the approach (1) [ Future Plan ] Part-2 will add support of approach (1), i.e. RMR solution: - Add a pair of IOMMUFD options for a SW_MSI window for kernel and VMM to agree on (for approach 1) Part-3 and beyond will continue the effort of supporting approach (2) i.e. a complete vITS-to-pITS mapping: - Map the phsical ITS page (potentially via IOMMUFD_CMD_IOAS_MAP_MSI) - Convey the IOVAs per-irq (potentially via VFIO_IRQ_SET_ACTION_PREPARE) --- This is a joint effort that includes Jason's rework in irq/iommu/iommufd base level and my additional patches on top of that for new uAPIs. This series is on github: https://github.com/nicolinc/iommufd/commits/iommufd_msi_p1-v2 For testing with nested SMMU (approach 1): https://github.com/nicolinc/iommufd/commits/wip/iommufd_msi_p2-v2 Pairing QEMU branch for testing (approach 1): https://github.com/nicolinc/qemu/commits/wip/for_iommufd_msi_p2-v2-rmr Changelog v2 * Split the iommufd ioctl for approach (1) out of this part-1 * Rebase on Jason's for-next tree (6.14-rc2) for two iommufd patches * Update commit logs in two irqchip patches to make narrative clearer * Keep iommu_dma_compose_msi_msg() in PATCH-1 as a small cleaner step * Improve with some coding style changes: kdoc and 100-char wrapping v1 https://lore.kernel.org/kvm/cover.1739005085.git.nicolinc@nvidia.com/ * Rebase on v6.14-rc1 and iommufd_attach_handle-v1 series https://lore.kernel.org/all/cover.1738645017.git.nicolinc@nvidia.com/ * Correct typos * Replace set_bit with __set_bit * Use a common helper to get iommufd_handle * Add kdoc for iommu_msi_iova/iommu_msi_page_shift * Rename msi_msg_set_msi_addr() to msi_msg_set_addr() * Update selftest for a better coverage for the new options * Change IOMMU_OPTION_SW_MSI_START/SIZE to be per-idev and properly check against device's reserved region list RFCv2 https://lore.kernel.org/kvm/cover.1736550979.git.nicolinc@nvidia.com/ * Rebase on v6.13-rc6 * Drop all the irq/pci patches and rework the compose function instead * Add a new sw_msi op to iommu_domain for a per type implementation and let iommufd core has its own implementation to support both approaches * Add RMR-solution (approach 1) support since it is straightforward and have been used in some out-of-tree projects widely RFCv1 https://lore.kernel.org/kvm/cover.1731130093.git.nicolinc@nvidia.com/ Thanks! Nicolin Jason Gunthorpe (5): genirq/msi: Store the IOMMU IOVA directly in msi_desc instead of iommu_cookie genirq/msi: Refactor iommu_dma_compose_msi_msg() iommu: Make iommu_dma_prepare_msi() into a generic operation irqchip: Have CONFIG_IRQ_MSI_IOMMU be selected by irqchips that need it iommufd: Implement sw_msi support natively Nicolin Chen (2): iommu: Turn fault_data to iommufd private pointer iommu: Turn iova_cookie to dma-iommu private pointer drivers/iommu/Kconfig | 1 - drivers/irqchip/Kconfig | 4 + kernel/irq/Kconfig | 1 + drivers/iommu/iommufd/iommufd_private.h | 23 +++- include/linux/iommu.h | 58 +++++---- include/linux/msi.h | 55 +++++--- drivers/iommu/dma-iommu.c | 63 +++------- drivers/iommu/iommu.c | 29 +++++ drivers/iommu/iommufd/device.c | 160 ++++++++++++++++++++---- drivers/iommu/iommufd/fault.c | 2 +- drivers/iommu/iommufd/hw_pagetable.c | 5 +- drivers/iommu/iommufd/main.c | 9 ++ drivers/irqchip/irq-gic-v2m.c | 5 +- drivers/irqchip/irq-gic-v3-its.c | 13 +- drivers/irqchip/irq-gic-v3-mbi.c | 12 +- drivers/irqchip/irq-ls-scfg-msi.c | 5 +- 16 files changed, 309 insertions(+), 136 deletions(-) base-commit: dc10ba25d43f433ad5d9e8e6be4f4d2bb3cd9ddb prerequisite-patch-id: 0000000000000000000000000000000000000000 -- 2.43.0

8 months, 4 weeks

5
33
0 0

[PATCH] bpf/selftests: test_select_reuseport_kern: remove unused header

by Alexis Lothoré (eBPF Foundation)

test_select_reuseport_kern.c is currently including <stdlib.h>, but it does not use any definition from there. Remove stdlib.h inclusion from test_select_reuseport_kern.c Signed-off-by: Alexis Lothoré (eBPF Foundation) <alexis.lothore(a)bootlin.com> --- I stumbled upon this specific header include while trying to build selftests on the current bpf-next_base branch, which ended with this error: [...] CLNG-BPF [test_progs-cpuv4] test_select_reuseport_kern.bpf.o In file included from progs/test_select_reuseport_kern.c:4: /usr/include/bits/floatn.h:83:52: error: unsupported machine mode '__TC__' 83 | typedef _Complex float __cfloat128 __attribute__ ((__mode__ (__TC__))); | ^ /usr/include/bits/floatn.h:97:9: error: __float128 is not supported on this target 97 | typedef __float128 _Float128; The exact error (unknown TC mode) is likely rather due to some issues in my local build, in which I am actually cross-compiling selftests (for ARM64 from a x86_64 host, but not through vmtests.sh), and I still have to sort out some other issues. But I guess it is not really right anyway to include stdlib.h from an ebpf program, especially if it is not used, so I am still proposing this small change. --- tools/testing/selftests/bpf/progs/test_select_reuseport_kern.c | 1 - 1 file changed, 1 deletion(-) diff --git a/tools/testing/selftests/bpf/progs/test_select_reuseport_kern.c b/tools/testing/selftests/bpf/progs/test_select_reuseport_kern.c index 5eb25c6ad75b1a9c61f22e978d817d3dc88b3a2f..a5be3267dbb01372c84bb468e3a48eae69ac5329 100644 --- a/tools/testing/selftests/bpf/progs/test_select_reuseport_kern.c +++ b/tools/testing/selftests/bpf/progs/test_select_reuseport_kern.c @@ -1,7 +1,6 @@ // SPDX-License-Identifier: GPL-2.0 /* Copyright (c) 2018 Facebook */ -#include <stdlib.h> #include <linux/in.h> #include <linux/ip.h> #include <linux/ipv6.h> --- base-commit: 072c40912477ebac2ef98cd0b1532ba9bebda20a change-id: 20250227-remove_wrong_header-02d288d64204 Best regards, -- Alexis Lothoré, Bootlin Embedded Linux and Kernel engineering https://bootlin.com

8 months, 4 weeks

2
1
0 0

[PATCH v3 0/2 RESEND] update kselftest framework to check for required configs

by Siddharth Menon

Currently, kselftests does not have a generalised mechanism to skip compilation and run tests when required kernel configuration options are disabled. This patch series adresses this issue by checking whether all required configs from selftest/<test>/config are enabled in the current kernel Siddharth Menon (2): selftests: Introduce script to validate required dependencies selftests/lib.mk: Introduce check to validate required dependencies .../testing/selftests/check_kselftest_deps.pl | 170 ++++++++++++++++++ tools/testing/selftests/lib.mk | 15 +- 2 files changed, 183 insertions(+), 2 deletions(-) create mode 100755 tools/testing/selftests/check_kselftest_deps.pl -- 2.48.1

8 months, 4 weeks

2
3
0 0

[RFC PATCH 0/9] bpf: Mitigate Spectre v1 using speculation barriers

by Luis Gerhorst

This improves the expressiveness of unprivileged BPF by inserting speculation barriers instead of rejcting the programs. The approach was presented at LPC'24: https://lpc.events/event/18/contributions/1954/ ("Mitigating Spectre-PHT using Speculation Barriers in Linux eBPF") and RAID'24: https://arxiv.org/pdf/2405.00078 ("VeriFence: Lightweight and Precise Spectre Defenses for Untrusted Linux Kernel Extensions") Goal of this RFC is to get feedback on the approach and the structuring into commits. TODOs to be fixed for final version: * actually emit arm64 barrier * fix unexpected_load_success from test_progs for "bpf: Fall back to nospec for sanitization-failures" * use bpf-next as base commit Luis Gerhorst (9): bpf/arm64: Unset bypass_spec_v4() instead of ignoring BPF_NOSPEC bpf: Refactor do_check() if/else into do_check_insn() bpf: Return EFAULT on misconfigurations bpf: Return EFAULT on internal errors bpf: Fall back to nospec if v1 verification fails bpf: Allow nospec-protected var-offset stack access bpf: Refactor push_stack to return error code bpf: Fall back to nospec for sanitization-failures bpf: Cut speculative path verification short arch/arm64/net/bpf_jit_comp.c | 10 +- include/linux/bpf.h | 14 +- include/linux/bpf_verifier.h | 3 +- kernel/bpf/core.c | 17 +- kernel/bpf/verifier.c | 832 ++++++++++-------- .../selftests/bpf/progs/verifier_and.c | 3 +- .../selftests/bpf/progs/verifier_bounds.c | 30 +- .../selftests/bpf/progs/verifier_movsx.c | 6 +- .../selftests/bpf/progs/verifier_unpriv.c | 3 +- .../bpf/progs/verifier_value_ptr_arith.c | 11 +- 10 files changed, 520 insertions(+), 409 deletions(-) base-commit: d082ecbc71e9e0bf49883ee4afd435a77a5101b6 -- 2.48.1

8 months, 4 weeks

1
10
0 0

[PATCH] KVM: selftests: access_tracking_perf_test: add option to skip the sanity check

by Maxim Levitsky

Add an option to skip sanity check of number of still idle pages, and force it on, in case hypervisor or NUMA balancing is detected. Signed-off-by: Maxim Levitsky <mlevitsk(a)redhat.com> --- .../selftests/kvm/access_tracking_perf_test.c | 23 +++++++++++++++++-- .../testing/selftests/kvm/include/test_util.h | 1 + tools/testing/selftests/kvm/lib/test_util.c | 22 ++++++++++++++++++ 3 files changed, 44 insertions(+), 2 deletions(-) diff --git a/tools/testing/selftests/kvm/access_tracking_perf_test.c b/tools/testing/selftests/kvm/access_tracking_perf_test.c index 3c7defd34f56..eafaecf086c4 100644 --- a/tools/testing/selftests/kvm/access_tracking_perf_test.c +++ b/tools/testing/selftests/kvm/access_tracking_perf_test.c @@ -65,6 +65,8 @@ static int vcpu_last_completed_iteration[KVM_MAX_VCPUS]; /* Whether to overlap the regions of memory vCPUs access. */ static bool overlap_memory_access; +static bool skip_sanity_check; + struct test_params { /* The backing source for the region of memory. */ enum vm_mem_backing_src_type backing_src; @@ -185,7 +187,7 @@ static void mark_vcpu_memory_idle(struct kvm_vm *vm, */ if (still_idle >= pages / 10) { #ifdef __x86_64__ - TEST_ASSERT(this_cpu_has(X86_FEATURE_HYPERVISOR), + TEST_ASSERT(skip_sanity_check, "vCPU%d: Too many pages still idle (%lu out of %lu)", vcpu_idx, still_idle, pages); #endif @@ -342,6 +344,8 @@ static void help(char *name) printf(" -v: specify the number of vCPUs to run.\n"); printf(" -o: Overlap guest memory accesses instead of partitioning\n" " them into a separate region of memory for each vCPU.\n"); + printf(" -u: Skip check that after dirtying the guest memory, most (90%%) of\n" + "it is reported as dirty again"); backing_src_help("-s"); puts(""); exit(0); @@ -359,7 +363,7 @@ int main(int argc, char *argv[]) guest_modes_append_default(); - while ((opt = getopt(argc, argv, "hm:b:v:os:")) != -1) { + while ((opt = getopt(argc, argv, "hm:b:v:os:u")) != -1) { switch (opt) { case 'm': guest_modes_cmdline(optarg); @@ -376,6 +380,9 @@ int main(int argc, char *argv[]) case 's': params.backing_src = parse_backing_src_type(optarg); break; + case 'u': + skip_sanity_check = true; + break; case 'h': default: help(argv[0]); @@ -386,6 +393,18 @@ int main(int argc, char *argv[]) page_idle_fd = open("/sys/kernel/mm/page_idle/bitmap", O_RDWR); __TEST_REQUIRE(page_idle_fd >= 0, "CONFIG_IDLE_PAGE_TRACKING is not enabled"); + + + if (skip_sanity_check == false) { + if (this_cpu_has(X86_FEATURE_HYPERVISOR)) { + printf("Skipping idle page count sanity check, because the test is run nested\n"); + skip_sanity_check = true; + } else if (is_numa_balancing_enabled()) { + printf("Skipping idle page count sanity check, because NUMA balance is enabled\n"); + skip_sanity_check = true; + } + } + close(page_idle_fd); for_each_guest_mode(run_test, &params); diff --git a/tools/testing/selftests/kvm/include/test_util.h b/tools/testing/selftests/kvm/include/test_util.h index 3e473058849f..1bc9b0a92427 100644 --- a/tools/testing/selftests/kvm/include/test_util.h +++ b/tools/testing/selftests/kvm/include/test_util.h @@ -153,6 +153,7 @@ bool is_backing_src_hugetlb(uint32_t i); void backing_src_help(const char *flag); enum vm_mem_backing_src_type parse_backing_src_type(const char *type_name); long get_run_delay(void); +bool is_numa_balancing_enabled(void); /* * Whether or not the given source type is shared memory (as opposed to diff --git a/tools/testing/selftests/kvm/lib/test_util.c b/tools/testing/selftests/kvm/lib/test_util.c index 8ed0b74ae837..1271863613fa 100644 --- a/tools/testing/selftests/kvm/lib/test_util.c +++ b/tools/testing/selftests/kvm/lib/test_util.c @@ -163,6 +163,28 @@ size_t get_trans_hugepagesz(void) return size; } + +bool is_numa_balancing_enabled(void) +{ + int ret; + int val; + struct stat statbuf; + FILE *f; + + ret = stat("/proc/sys/kernel/numa_balancing", &statbuf); + TEST_ASSERT(ret == 0 || (ret == -1 && errno == ENOENT), + "Error in stating /proc/sys/kernel/numa_balancing"); + + if (ret != 0) + return false; + + f = fopen("/proc/sys/kernel/numa_balancing", "r"); + ret = fscanf(f, "%d", &val); + + TEST_ASSERT(val == 0 || val == 1, "Unexpected value in /proc/sys/kernel/numa_balancing"); + return val == 1; +} + size_t get_def_hugetlb_pagesz(void) { char buf[64]; -- 2.26.3

8 months, 4 weeks

2
1
0 0

[PATCH net v2] selftests: drv-net: Check if combined-count exists

by Joe Damato

Some drivers, like tg3, do not set combined-count: $ ethtool -l enp4s0f1 Channel parameters for enp4s0f1: Pre-set maximums: RX: 4 TX: 4 Other: n/a Combined: n/a Current hardware settings: RX: 4 TX: 1 Other: n/a Combined: n/a In the case where combined-count is not set, the ethtool netlink code in the kernel elides the value and the code in the test: netnl.channels_get(...) With a tg3 device, the returned dictionary looks like: {'header': {'dev-index': 3, 'dev-name': 'enp4s0f1'}, 'rx-max': 4, 'rx-count': 4, 'tx-max': 4, 'tx-count': 1} Note that the key 'combined-count' is missing. As a result of this missing key the test raises an exception: # Exception| if channels['combined-count'] == 0: # Exception| ~~~~~~~~^^^^^^^^^^^^^^^^^^ # Exception| KeyError: 'combined-count' Change the test to check if 'combined-count' is a key in the dictionary first and if not assume that this means the driver has separate RX and TX queues. With this change, the test now passes successfully on tg3 and mlx5 (which does have a 'combined-count'). Fixes: 1cf270424218 ("net: selftest: add test for netdev netlink queue-get API") Signed-off-by: Joe Damato <jdamato(a)fastly.com> --- v2: - Simplify logic and reduce indentation as suggested by David Wei. Retested on both tg3 and mlx5 and test passes as expected. v1: https://lore.kernel.org/lkml/20250225181455.224309-1-jdamato@fastly.com/ tools/testing/selftests/drivers/net/queues.py | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/tools/testing/selftests/drivers/net/queues.py b/tools/testing/selftests/drivers/net/queues.py index 38303da957ee..8a518905a9f9 100755 --- a/tools/testing/selftests/drivers/net/queues.py +++ b/tools/testing/selftests/drivers/net/queues.py @@ -45,10 +45,9 @@ def addremove_queues(cfg, nl) -> None: netnl = EthtoolFamily() channels = netnl.channels_get({'header': {'dev-index': cfg.ifindex}}) - if channels['combined-count'] == 0: - rx_type = 'rx' - else: - rx_type = 'combined' + rx_type = 'rx' + if channels.get('combined-count', 0) > 0: + rx_type = 'combined' expected = curr_queues - 1 cmd(f"ethtool -L {cfg.dev['ifname']} {rx_type} {expected}", timeout=10) base-commit: 8d52da23b6c68a0f6bad83959ebb61a2cf623c4e -- 2.43.0

8 months, 4 weeks

3
2
0 0

[PATCH v11 3/3] selftests/rseq: Add test for mm_cid compaction

by Gabriele Monaco

A task in the kernel (task_mm_cid_work) runs somewhat periodically to compact the mm_cid for each process. Add a test to validate that it runs correctly and timely. The test spawns 1 thread pinned to each CPU, then each thread, including the main one, runs in short bursts for some time. During this period, the mm_cids should be spanning all numbers between 0 and nproc. At the end of this phase, a thread with high enough mm_cid (>= nproc/2) is selected to be the new leader, all other threads terminate. After some time, the only running thread should see 0 as mm_cid, if that doesn't happen, the compaction mechanism didn't work and the test fails. The test never fails if only 1 core is available, in which case, we cannot test anything as the only available mm_cid is 0. Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers(a)efficios.com> Signed-off-by: Gabriele Monaco <gmonaco(a)redhat.com> --- tools/testing/selftests/rseq/.gitignore | 1 + tools/testing/selftests/rseq/Makefile | 2 +- .../selftests/rseq/mm_cid_compaction_test.c | 200 ++++++++++++++++++ 3 files changed, 202 insertions(+), 1 deletion(-) create mode 100644 tools/testing/selftests/rseq/mm_cid_compaction_test.c diff --git a/tools/testing/selftests/rseq/.gitignore b/tools/testing/selftests/rseq/.gitignore index 16496de5f6ce4..2c89f97e4f737 100644 --- a/tools/testing/selftests/rseq/.gitignore +++ b/tools/testing/selftests/rseq/.gitignore @@ -3,6 +3,7 @@ basic_percpu_ops_test basic_percpu_ops_mm_cid_test basic_test basic_rseq_op_test +mm_cid_compaction_test param_test param_test_benchmark param_test_compare_twice diff --git a/tools/testing/selftests/rseq/Makefile b/tools/testing/selftests/rseq/Makefile index 5a3432fceb586..ce1b38f46a355 100644 --- a/tools/testing/selftests/rseq/Makefile +++ b/tools/testing/selftests/rseq/Makefile @@ -16,7 +16,7 @@ OVERRIDE_TARGETS = 1 TEST_GEN_PROGS = basic_test basic_percpu_ops_test basic_percpu_ops_mm_cid_test param_test \ param_test_benchmark param_test_compare_twice param_test_mm_cid \ - param_test_mm_cid_benchmark param_test_mm_cid_compare_twice + param_test_mm_cid_benchmark param_test_mm_cid_compare_twice mm_cid_compaction_test TEST_GEN_PROGS_EXTENDED = librseq.so diff --git a/tools/testing/selftests/rseq/mm_cid_compaction_test.c b/tools/testing/selftests/rseq/mm_cid_compaction_test.c new file mode 100644 index 0000000000000..7ddde3b657dd6 --- /dev/null +++ b/tools/testing/selftests/rseq/mm_cid_compaction_test.c @@ -0,0 +1,200 @@ +// SPDX-License-Identifier: LGPL-2.1 +#define _GNU_SOURCE +#include <assert.h> +#include <pthread.h> +#include <sched.h> +#include <stdint.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <stddef.h> + +#include "../kselftest.h" +#include "rseq.h" + +#define VERBOSE 0 +#define printf_verbose(fmt, ...) \ + do { \ + if (VERBOSE) \ + printf(fmt, ##__VA_ARGS__); \ + } while (0) + +/* 0.5 s */ +#define RUNNER_PERIOD 500000 +/* Number of runs before we terminate or get the token */ +#define THREAD_RUNS 5 + +/* + * Number of times we check that the mm_cid were compacted. + * Checks are repeated every RUNNER_PERIOD. + */ +#define MM_CID_COMPACT_TIMEOUT 10 + +struct thread_args { + int cpu; + int num_cpus; + pthread_mutex_t *token; + pthread_barrier_t *barrier; + pthread_t *tinfo; + struct thread_args *args_head; +}; + +static void __noreturn *thread_runner(void *arg) +{ + struct thread_args *args = arg; + int i, ret, curr_mm_cid; + cpu_set_t cpumask; + + CPU_ZERO(&cpumask); + CPU_SET(args->cpu, &cpumask); + ret = pthread_setaffinity_np(pthread_self(), sizeof(cpumask), &cpumask); + if (ret) { + errno = ret; + perror("Error: failed to set affinity"); + abort(); + } + pthread_barrier_wait(args->barrier); + + for (i = 0; i < THREAD_RUNS; i++) + usleep(RUNNER_PERIOD); + curr_mm_cid = rseq_current_mm_cid(); + /* + * We select one thread with high enough mm_cid to be the new leader. + * All other threads (including the main thread) will terminate. + * After some time, the mm_cid of the only remaining thread should + * converge to 0, if not, the test fails. + */ + if (curr_mm_cid >= args->num_cpus / 2 && + !pthread_mutex_trylock(args->token)) { + printf_verbose( + "cpu%d has mm_cid=%d and will be the new leader.\n", + sched_getcpu(), curr_mm_cid); + for (i = 0; i < args->num_cpus; i++) { + if (args->tinfo[i] == pthread_self()) + continue; + ret = pthread_join(args->tinfo[i], NULL); + if (ret) { + errno = ret; + perror("Error: failed to join thread"); + abort(); + } + } + pthread_barrier_destroy(args->barrier); + free(args->tinfo); + free(args->token); + free(args->barrier); + free(args->args_head); + + for (i = 0; i < MM_CID_COMPACT_TIMEOUT; i++) { + curr_mm_cid = rseq_current_mm_cid(); + printf_verbose("run %d: mm_cid=%d on cpu%d.\n", i, + curr_mm_cid, sched_getcpu()); + if (curr_mm_cid == 0) + exit(EXIT_SUCCESS); + usleep(RUNNER_PERIOD); + } + exit(EXIT_FAILURE); + } + printf_verbose("cpu%d has mm_cid=%d and is going to terminate.\n", + sched_getcpu(), curr_mm_cid); + pthread_exit(NULL); +} + +int test_mm_cid_compaction(void) +{ + cpu_set_t affinity; + int i, j, ret = 0, num_threads; + pthread_t *tinfo; + pthread_mutex_t *token; + pthread_barrier_t *barrier; + struct thread_args *args; + + sched_getaffinity(0, sizeof(affinity), &affinity); + num_threads = CPU_COUNT(&affinity); + tinfo = calloc(num_threads, sizeof(*tinfo)); + if (!tinfo) { + perror("Error: failed to allocate tinfo"); + return -1; + } + args = calloc(num_threads, sizeof(*args)); + if (!args) { + perror("Error: failed to allocate args"); + ret = -1; + goto out_free_tinfo; + } + token = malloc(sizeof(*token)); + if (!token) { + perror("Error: failed to allocate token"); + ret = -1; + goto out_free_args; + } + barrier = malloc(sizeof(*barrier)); + if (!barrier) { + perror("Error: failed to allocate barrier"); + ret = -1; + goto out_free_token; + } + if (num_threads == 1) { + fprintf(stderr, "Cannot test on a single cpu. " + "Skipping mm_cid_compaction test.\n"); + /* only skipping the test, this is not a failure */ + goto out_free_barrier; + } + pthread_mutex_init(token, NULL); + ret = pthread_barrier_init(barrier, NULL, num_threads); + if (ret) { + errno = ret; + perror("Error: failed to initialise barrier"); + goto out_free_barrier; + } + for (i = 0, j = 0; i < CPU_SETSIZE && j < num_threads; i++) { + if (!CPU_ISSET(i, &affinity)) + continue; + args[j].num_cpus = num_threads; + args[j].tinfo = tinfo; + args[j].token = token; + args[j].barrier = barrier; + args[j].cpu = i; + args[j].args_head = args; + if (!j) { + /* The first thread is the main one */ + tinfo[0] = pthread_self(); + ++j; + continue; + } + ret = pthread_create(&tinfo[j], NULL, thread_runner, &args[j]); + if (ret) { + errno = ret; + perror("Error: failed to create thread"); + abort(); + } + ++j; + } + printf_verbose("Started %d threads.\n", num_threads); + + /* Also main thread will terminate if it is not selected as leader */ + thread_runner(&args[0]); + + /* only reached in case of errors */ +out_free_barrier: + free(barrier); +out_free_token: + free(token); +out_free_args: + free(args); +out_free_tinfo: + free(tinfo); + + return ret; +} + +int main(int argc, char **argv) +{ + if (!rseq_mm_cid_available()) { + fprintf(stderr, "Error: rseq_mm_cid unavailable\n"); + return -1; + } + if (test_mm_cid_compaction()) + return -1; + return 0; +} -- 2.48.1

8 months, 4 weeks

1
0
0 0

2025

2024

2023

2022

2021

2020

2019

2018

2017

Linux-kselftest-mirror February 2025