June 2025 - Linux-kselftest-mirror

[PATCH v2 0/3] replace `allow(...)` lints with `expect(...)`

by Onur Özkan

Changes in v2: - Removed lints are not replaced with `expect` in the first diff. - Removals are done in separate diffs for each. The `#[allow(clippy::non_send_fields_in_send_ty)]` removal was tested on 1.81 and clippy was still happy with it. I couldn't test it on 1.78 because when I go below 1.81 `menuconfig` no longer shows the Rust option. And any manual changes I make to `.config` are immediately reverted on `make` invocations. Onur Özkan (3): replace `#[allow(...)]` with `#[expect(...)]` rust: remove `#[allow(clippy::unnecessary_cast)]` rust: remove `#[allow(clippy::non_send_fields_in_send_ty)]` drivers/gpu/nova-core/regs.rs | 2 +- rust/compiler_builtins.rs | 2 +- rust/kernel/alloc/allocator_test.rs | 2 +- rust/kernel/cpufreq.rs | 1 - rust/kernel/devres.rs | 2 +- rust/kernel/driver.rs | 2 +- rust/kernel/drm/ioctl.rs | 8 ++++---- rust/kernel/error.rs | 3 +-- rust/kernel/init.rs | 6 +++--- rust/kernel/kunit.rs | 2 +- rust/kernel/opp.rs | 4 ++-- rust/kernel/types.rs | 2 +- rust/macros/helpers.rs | 2 +- 13 files changed, 18 insertions(+), 20 deletions(-) -- 2.50.0

2 months, 1 week

3
6
0 0

[PATCH 0/2] Possible TTY privilege escalation in TIOCSTI ioctl

by Abhinav Saxena via B4 Relay

This patch series was initially sent to security(a)k.o; resending it in public. I might follow-up with a tests series which addresses similar issues with TIOCLINUX. =============== The TIOCSTI ioctl uses capable(CAP_SYS_ADMIN) for access control, which checks the current process's credentials. However, it doesn't validate against the file opener's credentials stored in file->f_cred. This creates a potential security issue where an unprivileged process can open a TTY fd and pass it to a privileged process via SCM_RIGHTS. The privileged process may then inadvertently grant access based on its elevated privileges rather than the original opener's credentials. Background ========== As noted in previous discussion, while CONFIG_LEGACY_TIOCSTI can restrict TIOCSTI usage, it is enabled by default in most distributions. Even when CONFIG_LEGACY_TIOCSTI=n, processes with CAP_SYS_ADMIN can still use TIOCSTI according to the Kconfig documentation. Additionally, CONFIG_LEGACY_TIOCSTI controls the default value for the dev.tty.legacy_tiocsti sysctl, which remains runtime-configurable. This means the described attack vector could work on systems even with CONFIG_LEGACY_TIOCSTI=n, particularly on Ubuntu 24.04 where it's "restricted" but still functional. Solution Approach ================= This series addresses the issue through SELinux LSM integration rather than modifying core TTY credential checking to avoid potential compatibility issues with existing userspace. The enhancement adds proper current task and file credential capability validation in SELinux's selinux_file_ioctl() hook specifically for TIOCSTI operations. Testing ======= All patches have been validated using: - scripts/checkpatch.pl --strict (0 errors, 0 warnings) - Functional testing on kernel v6.16-rc2 - File descriptor passing security test scenarios - SELinux policy enforcement testing The fd_passing_security test demonstrates the security concern. To verify, disable legacy TIOCSTI and run the test: $ echo "0" | sudo tee /proc/sys/dev/tty/legacy_tiocsti $ sudo ./tools/testing/selftests/tty/tty_tiocsti_test -t fd_passing_security Patch Overview ============== PATCH 1/2: selftests/tty: add TIOCSTI test suite Comprehensive test suite demonstrating the issue and fix validation PATCH 2/2: selinux: add capability checks for TIOCSTI ioctl Core security enhancement via SELinux LSM hook References ========== - tty_ioctl(4) - documents TIOCSTI ioctl and capability requirements - commit 83efeeeb3d04 ("tty: Allow TIOCSTI to be disabled") - Documentation/security/credentials.rst - https://github.com/KSPP/linux/issues/156 - https://lore.kernel.org/linux-hardening/Y0m9l52AKmw6Yxi1@hostpad/ - drivers/tty/Kconfig Configuration References: [1] - https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/dri… [2] - https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/dri… [3] - https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/dri… To: Shuah Khan <shuah(a)kernel.org> To: Nathan Chancellor <nathan(a)kernel.org> To: Nick Desaulniers <nick.desaulniers+lkml(a)gmail.com> To: Bill Wendling <morbo(a)google.com> To: Justin Stitt <justinstitt(a)google.com> To: Paul Moore <paul(a)paul-moore.com> To: Stephen Smalley <stephen.smalley.work(a)gmail.com> To: Ondrej Mosnacek <omosnace(a)redhat.com> Cc: linux-kernel(a)vger.kernel.org Cc: linux-kselftest(a)vger.kernel.org Cc: llvm(a)lists.linux.dev Cc: selinux(a)vger.kernel.org Signed-off-by: Abhinav Saxena <xandfury(a)gmail.com> --- Abhinav Saxena (2): selftests/tty: add TIOCSTI test suite selinux: add capability checks for TIOCSTI ioctl security/selinux/hooks.c | 6 + tools/testing/selftests/tty/Makefile | 6 +- tools/testing/selftests/tty/config | 1 + tools/testing/selftests/tty/tty_tiocsti_test.c | 421 +++++++++++++++++++++++++ 4 files changed, 433 insertions(+), 1 deletion(-) --- base-commit: 5adb635077d1b4bd65b183022775a59a378a9c00 change-id: 20250618-toicsti-bug-7822b8e94a32 Best regards, -- Abhinav Saxena <xandfury(a)gmail.com>

2 months, 1 week

7
10
0 0

[PATCH RFT v17 0/8] fork: Support shadow stacks in clone3()

by Mark Brown

The kernel has recently added support for shadow stacks, currently x86 only using their CET feature but both arm64 and RISC-V have equivalent features (GCS and Zicfiss respectively), I am actively working on GCS[1]. With shadow stacks the hardware maintains an additional stack containing only the return addresses for branch instructions which is not generally writeable by userspace and ensures that any returns are to the recorded addresses. This provides some protection against ROP attacks and making it easier to collect call stacks. These shadow stacks are allocated in the address space of the userspace process. Our API for shadow stacks does not currently offer userspace any flexiblity for managing the allocation of shadow stacks for newly created threads, instead the kernel allocates a new shadow stack with the same size as the normal stack whenever a thread is created with the feature enabled. The stacks allocated in this way are freed by the kernel when the thread exits or shadow stacks are disabled for the thread. This lack of flexibility and control isn't ideal, in the vast majority of cases the shadow stack will be over allocated and the implicit allocation and deallocation is not consistent with other interfaces. As far as I can tell the interface is done in this manner mainly because the shadow stack patches were in development since before clone3() was implemented. Since clone3() is readily extensible let's add support for specifying a shadow stack when creating a new thread or process, keeping the current implicit allocation behaviour if one is not specified either with clone3() or through the use of clone(). The user must provide a shadow stack pointer, this must point to memory mapped for use as a shadow stackby map_shadow_stack() with an architecture specified shadow stack token at the top of the stack. Yuri Khrustalev has raised questions from the libc side regarding discoverability of extended clone3() structure sizes[2], this seems like a general issue with clone3(). There was a suggestion to add a hwcap on arm64 which isn't ideal but is doable there, though architecture specific mechanisms would also be needed for x86 (and RISC-V if it's support gets merged before this does). Please note that the x86 portions of this code are build tested only, I don't appear to have a system that can run CET available to me. [1] https://lore.kernel.org/linux-arm-kernel/20241001-arm64-gcs-v13-0-222b78d87… [2] https://lore.kernel.org/r/aCs65ccRQtJBnZ_5@arm.com Signed-off-by: Mark Brown <broonie(a)kernel.org> --- Changes in v17: - Rebase onto v6.16-rc1. - Link to v16: https://lore.kernel.org/r/20250416-clone3-shadow-stack-v16-0-2ffc9ca3917b@k… Changes in v16: - Rebase onto v6.15-rc2. - Roll in fixes from x86 testing from Rick Edgecombe. - Rework so that the argument is shadow_stack_token. - Link to v15: https://lore.kernel.org/r/20250408-clone3-shadow-stack-v15-0-3fa245c6e3be@k… Changes in v15: - Rebase onto v6.15-rc1. - Link to v14: https://lore.kernel.org/r/20250206-clone3-shadow-stack-v14-0-805b53af73b9@k… Changes in v14: - Rebase onto v6.14-rc1. - Link to v13: https://lore.kernel.org/r/20241203-clone3-shadow-stack-v13-0-93b89a81a5ed@k… Changes in v13: - Rebase onto v6.13-rc1. - Link to v12: https://lore.kernel.org/r/20241031-clone3-shadow-stack-v12-0-7183eb8bee17@k… Changes in v12: - Add the regular prctl() to the userspace API document since arm64 support is queued in -next. - Link to v11: https://lore.kernel.org/r/20241005-clone3-shadow-stack-v11-0-2a6a2bd6d651@k… Changes in v11: - Rebase onto arm64 for-next/gcs, which is based on v6.12-rc1, and integrate arm64 support. - Rework the interface to specify a shadow stack pointer rather than a base and size like we do for the regular stack. - Link to v10: https://lore.kernel.org/r/20240821-clone3-shadow-stack-v10-0-06e8797b9445@k… Changes in v10: - Integrate fixes & improvements for the x86 implementation from Rick Edgecombe. - Require that the shadow stack be VM_WRITE. - Require that the shadow stack base and size be sizeof(void *) aligned. - Clean up trailing newline. - Link to v9: https://lore.kernel.org/r/20240819-clone3-shadow-stack-v9-0-962d74f99464@ke… Changes in v9: - Pull token validation earlier and report problems with an error return to parent rather than signal delivery to the child. - Verify that the top of the supplied shadow stack is VM_SHADOW_STACK. - Rework token validation to only do the page mapping once. - Drop no longer needed support for testing for signals in selftest. - Fix typo in comments. - Link to v8: https://lore.kernel.org/r/20240808-clone3-shadow-stack-v8-0-0acf37caf14c@ke… Changes in v8: - Fix token verification with user specified shadow stack. - Don't track user managed shadow stacks for child processes. - Link to v7: https://lore.kernel.org/r/20240731-clone3-shadow-stack-v7-0-a9532eebfb1d@ke… Changes in v7: - Rebase onto v6.11-rc1. - Typo fixes. - Link to v6: https://lore.kernel.org/r/20240623-clone3-shadow-stack-v6-0-9ee7783b1fb9@ke… Changes in v6: - Rebase onto v6.10-rc3. - Ensure we don't try to free the parent shadow stack in error paths of x86 arch code. - Spelling fixes in userspace API document. - Additional cleanups and improvements to the clone3() tests to support the shadow stack tests. - Link to v5: https://lore.kernel.org/r/20240203-clone3-shadow-stack-v5-0-322c69598e4b@ke… Changes in v5: - Rebase onto v6.8-rc2. - Rework ABI to have the user allocate the shadow stack memory with map_shadow_stack() and a token. - Force inlining of the x86 shadow stack enablement. - Move shadow stack enablement out into a shared header for reuse by other tests. - Link to v4: https://lore.kernel.org/r/20231128-clone3-shadow-stack-v4-0-8b28ffe4f676@ke… Changes in v4: - Formatting changes. - Use a define for minimum shadow stack size and move some basic validation to fork.c. - Link to v3: https://lore.kernel.org/r/20231120-clone3-shadow-stack-v3-0-a7b8ed3e2acc@ke… Changes in v3: - Rebase onto v6.7-rc2. - Remove stale shadow_stack in internal kargs. - If a shadow stack is specified unconditionally use it regardless of CLONE_ parameters. - Force enable shadow stacks in the selftest. - Update changelogs for RISC-V feature rename. - Link to v2: https://lore.kernel.org/r/20231114-clone3-shadow-stack-v2-0-b613f8681155@ke… Changes in v2: - Rebase onto v6.7-rc1. - Remove ability to provide preallocated shadow stack, just specify the desired size. - Link to v1: https://lore.kernel.org/r/20231023-clone3-shadow-stack-v1-0-d867d0b5d4d0@ke… --- Mark Brown (8): arm64/gcs: Return a success value from gcs_alloc_thread_stack() Documentation: userspace-api: Add shadow stack API documentation selftests: Provide helper header for shadow stack testing fork: Add shadow stack support to clone3() selftests/clone3: Remove redundant flushes of output streams selftests/clone3: Factor more of main loop into test_clone3() selftests/clone3: Allow tests to flag if -E2BIG is a valid error code selftests/clone3: Test shadow stack support Documentation/userspace-api/index.rst | 1 + Documentation/userspace-api/shadow_stack.rst | 44 +++++ arch/arm64/include/asm/gcs.h | 8 +- arch/arm64/kernel/process.c | 8 +- arch/arm64/mm/gcs.c | 61 +++++- arch/x86/include/asm/shstk.h | 11 +- arch/x86/kernel/process.c | 2 +- arch/x86/kernel/shstk.c | 57 +++++- include/asm-generic/cacheflush.h | 11 ++ include/linux/sched/task.h | 17 ++ include/uapi/linux/sched.h | 9 +- kernel/fork.c | 96 +++++++-- tools/testing/selftests/clone3/clone3.c | 226 ++++++++++++++++++---- tools/testing/selftests/clone3/clone3_selftests.h | 65 ++++++- tools/testing/selftests/ksft_shstk.h | 98 ++++++++++ 15 files changed, 633 insertions(+), 81 deletions(-) --- base-commit: 19272b37aa4f83ca52bdf9c16d5d81bdd1354494 change-id: 20231019-clone3-shadow-stack-15d40d2bf536 Best regards, -- Mark Brown <broonie(a)kernel.org>

2 months, 1 week

3
12
0 0

Re: [PATCH v2] selftests: cachestat: Refactor test to remove duplication

by Nhat Pham

On Tue, Jun 24, 2025 at 9:58 PM Suresh Chandrappa <suresh.k.chandrappa(a)gmail.com> wrote: > > Hi @nphamcs > Can you please check the modified change Is this supposed to be on top of the earlier patch you sent out? In that case, you should send both together as a patch series. > > Thanks > Suresh K C > > > On Wed, 11 Jun 2025, 23:39 Suresh K C, <suresh.k.chandrappa(a)gmail.com> wrote: >> >> From: Suresh K C <suresh.k.chandrappa(a)gmail.com> >> >> Refactored the mmap and shmem test logic into a common function >> to reduce code duplication and improve maintainability >> >> Changes in v2: >> Refactored mmap and shmem tests into a common function >> Renamed test function to run_cachestat_test() >> Removed test for /proc/cpuinfo as a general /proc test case already exists >> >> Signed-off-by: Suresh K C <suresh.k.chandrappa(a)gmail.com> >> --- >> .../selftests/cachestat/test_cachestat.c | 97 ++++++------------- >> 1 file changed, 30 insertions(+), 67 deletions(-) >> >> diff --git a/tools/testing/selftests/cachestat/test_cachestat.c b/tools/testing/selftests/cachestat/test_cachestat.c >> index 81e7f6dd2279..7c2f64175943 100644 >> --- a/tools/testing/selftests/cachestat/test_cachestat.c >> +++ b/tools/testing/selftests/cachestat/test_cachestat.c >> @@ -22,7 +22,7 @@ >> >> static const char * const dev_files[] = { >> "/dev/zero", "/dev/null", "/dev/urandom", >> - "/proc/version","/proc/cpuinfo","/proc" >> + "/proc/version","/proc" So you removed one file that you added in an earlier patch, right? Then why bother adding it in the first place...? Can you either: 1. Send the two patch together as a series. In the first patch, do not add /proc/cpuinfo. or 2. Squash them into a single patch. I'll let you decide if this is worth it. >> }; >> >> void print_cachestat(struct cachestat *cs) >> @@ -33,6 +33,11 @@ void print_cachestat(struct cachestat *cs) >> cs->nr_evicted, cs->nr_recently_evicted); >> } >> >> +enum file_type { >> + FILE_MMAP, >> + FILE_SHMEM >> +}; >> + >> bool write_exactly(int fd, size_t filesize) >> { >> int random_fd = open("/dev/urandom", O_RDONLY); >> @@ -202,66 +207,8 @@ static int test_cachestat(const char *filename, bool write_random, bool create, >> return ret; >> } >> >> -bool test_cachestat_mmap(void){ >> - >> - size_t PS = sysconf(_SC_PAGESIZE); >> - size_t filesize = PS * 512 * 2;; >> - int syscall_ret; >> - size_t compute_len = PS * 512; >> - struct cachestat_range cs_range = { PS, compute_len }; >> - char *filename = "tmpshmcstat"; >> - unsigned long num_pages = compute_len / PS; >> - struct cachestat cs; >> - bool ret = true; >> - int fd = open(filename, O_RDWR | O_CREAT | O_TRUNC, 0666); >> - if (fd < 0) { >> - ksft_print_msg("Unable to create mmap file.\n"); >> - ret = false; >> - goto out; >> - } >> - if (ftruncate(fd, filesize)) { >> - ksft_print_msg("Unable to truncate mmap file.\n"); >> - ret = false; >> - goto close_fd; >> - } >> - if (!write_exactly(fd, filesize)) { >> - ksft_print_msg("Unable to write to mmap file.\n"); >> - ret = false; >> - goto close_fd; >> - } >> - char *map = mmap(NULL, filesize, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); >> - if (map == MAP_FAILED) { >> - ksft_print_msg("mmap failed.\n"); >> - ret = false; >> - goto close_fd; >> - } >> - >> - for (int i = 0; i < filesize; i++) { >> - map[i] = 'A'; >> - } >> - map[filesize - 1] = 'X'; >> - >> - syscall_ret = syscall(__NR_cachestat, fd, &cs_range, &cs, 0); >> - >> - if (syscall_ret) { >> - ksft_print_msg("Cachestat returned non-zero.\n"); >> - ret = false; >> - } else { >> - print_cachestat(&cs); >> - if (cs.nr_cache + cs.nr_evicted != num_pages) { >> - ksft_print_msg("Total number of cached and evicted pages is off.\n"); >> - ret = false; >> - } >> - } >> - >> -close_fd: >> - close(fd); >> - unlink(filename); >> -out: >> - return ret; >> -} >> >> -bool test_cachestat_shmem(void) >> +bool run_cachestat_test(enum file_type type) Can you just call this function test_cachestat(enum file_type type) ? >> { >> size_t PS = sysconf(_SC_PAGESIZE); >> size_t filesize = PS * 512 * 2; /* 2 2MB huge pages */ >> @@ -271,27 +218,43 @@ bool test_cachestat_shmem(void) >> char *filename = "tmpshmcstat"; >> struct cachestat cs; >> bool ret = true; >> + int fd; >> unsigned long num_pages = compute_len / PS; >> - int fd = shm_open(filename, O_CREAT | O_RDWR, 0600); >> + if (type == FILE_SHMEM) >> + fd = shm_open(filename, O_CREAT | O_RDWR, 0600); >> + else >> + fd = open(filename, O_RDWR | O_CREAT | O_TRUNC, 0666); >> >> if (fd < 0) { >> - ksft_print_msg("Unable to create shmem file.\n"); >> + ksft_print_msg("Unable to create file.\n"); >> ret = false; >> goto out; >> } >> >> if (ftruncate(fd, filesize)) { >> - ksft_print_msg("Unable to truncate shmem file.\n"); >> + ksft_print_msg("Unable to truncate file.\n"); >> ret = false; >> goto close_fd; >> } >> >> if (!write_exactly(fd, filesize)) { >> - ksft_print_msg("Unable to write to shmem file.\n"); >> + ksft_print_msg("Unable to write to file.\n"); >> ret = false; >> goto close_fd; >> } >> >> + if (type == FILE_MMAP){ >> + char *map = mmap(NULL, filesize, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); >> + if (map == MAP_FAILED) { >> + ksft_print_msg("mmap failed.\n"); >> + ret = false; >> + goto close_fd; >> + } >> + for (int i = 0; i < filesize; i++) { >> + map[i] = 'A'; >> + } >> + map[filesize - 1] = 'X'; >> + } >> syscall_ret = syscall(__NR_cachestat, fd, &cs_range, &cs, 0); >> >> if (syscall_ret) { >> @@ -333,7 +296,7 @@ int main(void) >> ret = 1; >> } >> >> - for (int i = 0; i < 6; i++) { >> + for (int i = 0; i < 5; i++) { >> const char *dev_filename = dev_files[i]; >> >> if (test_cachestat(dev_filename, false, false, false, >> @@ -367,14 +330,14 @@ int main(void) >> break; >> } >> >> - if (test_cachestat_shmem()) >> + if (run_cachestat_test(FILE_SHMEM)) >> ksft_test_result_pass("cachestat works with a shmem file\n"); >> else { >> ksft_test_result_fail("cachestat fails with a shmem file\n"); >> ret = 1; >> } >> >> - if (test_cachestat_mmap()) >> + if (run_cachestat_test(FILE_MMAP)) >> ksft_test_result_pass("cachestat works with a mmap file\n"); >> else { >> ksft_test_result_fail("cachestat fails with a mmap file\n"); >> -- >> 2.43.0 >>

2 months, 2 weeks

1
0
0 0

[PATCH net-next v2 0/4] selftest: net: Add selftest for netpoll

by Breno Leitao

I am submitting a new selftest for the netpoll subsystem specifically targeting the case where the RX is polling in the TX path, which is a case that we don't have any test in the tree today. This is done when netpoll_poll_dev() called, and this test creates a scenario when that is probably. The test does the following: 1) Configuring a single RX/TX queue to increase contention on the interface. 2) Generating background traffic to saturate the network, mimicking real-world congestion. 3) Sending netconsole messages to trigger netpoll polling and monitor its behavior. 4) Using dynamic netconsole targets via configfs, with the ability to delete and recreate targets during the test. 5) Running bpftrace in parallel to verify that netpoll_poll_dev() is called when expected. If it is called, then the test passes, otherwise the test is marked as skipped. In order to achieve it, I stole Jakub's bpftrace helper from [1], and did some small changes that I found useful to use the helper. So, this patchset basically contains: 1) The code stolen from Jakub 2) Improvements on bpftrace() helper 3) The selftest itself Link: https://lore.kernel.org/all/20250421222827.283737-22-kuba@kernel.org/ [1] --- Changes in v1 (from RFC): - Toggle the netconsole interfaces up and down after 5 iterations. - Moved the traffic check under DEBUG (Willem de Bruijn). - Bumped the iterations to 20 given it runs faster now. - Link to the RFC: https://lore.kernel.org/r/20250612-netpoll_test-v1-1-4774fd95933f@debian.org --- Changes in v2: - Stole Jakub's helper to run bpftrace - Removed the DEBUG option and moved logs to logging - Change the code to have a higher chance of calling netpoll_poll_dev(). In my current configuration, it is hitting multiple times during the test. - Save and restore TX/RX queue size (Jakub) - Link to v1: https://lore.kernel.org/r/20250620-netpoll_test-v1-1-5068832f72fc@debian.org --- Breno Leitao (3): selftests: drv-net: Improve bpftrace utility error handling selftests: drv-net: Strip '@' prefix from bpftrace map keys selftests: net: add netpoll basic functionality test Jakub Kicinski (1): selftests: drv-net: add helper/wrapper for bpftrace tools/testing/selftests/drivers/net/Makefile | 1 + .../testing/selftests/drivers/net/netpoll_basic.py | 344 +++++++++++++++++++++ tools/testing/selftests/net/lib/py/utils.py | 38 +++ 3 files changed, 383 insertions(+) --- base-commit: eb4c27edb4d8dbfbdcc7bc03e0394a0fab8af7d5 change-id: 20250612-netpoll_test-a1324d2057c8 Best regards, -- Breno Leitao <leitao(a)debian.org>

2 months, 2 weeks

4
18
0 0

[PATCH 0/5] sysctl: Remove last two ctl_tables from the kern_table array

by Joel Granados

This is the last series to relocate sysctl tables from kernel/sysctl.c into their respective subsystems. After the move of two ctl_tables (uevent_helper & overflow{uid,gid}), five remain. They either handle variables defined within sysctl.c or serve as a common place for variables that are defined in different architectures. These five will not be moved. Note that this series includes two auxiliary changes: Removal of an unused variable and Nix-based rework of sysctl.sh test script By decentralizing sysctl registrations, subsystem maintainers regain control over their sysctl interfaces, improving maintainability and reducing the likelihood of merge conflicts. All this is made possible by the work done to reduce the ctl_table memory footprint in commit d7a76ec87195 ("sysctl: Remove check for sentinel element in ctl_table arrays"). A few comments on the process: 1. If you prefer to merge this through a non-sysctl tree, please let me know so I can avoid conflicts in linux-next. 2. Apologies if you were copied by mistake—let me know if you'd like to be removed. 3. This series builds on [1], so please rebase accordingly for clean application. 4. Testing done by running sysctl selftests on x86_64 and 0-day. Comments/Suggestions greatly appreciated [1] https://lore.kernel.org/20250509-jag-mv_ctltables_iter2-v1-0-d0ad83f5f4c3@k… Signed-off-by: Joel Granados <joel.granados(a)kernel.org> --- Joel Granados (5): sysctl: Nixify sysctl.sh sysctl: Removed unused variable uevent: mv uevent_helper into kobject_uevent.c kernel/sys.c: Move overflow{uid,gid} sysctl into kernel/sys.c sysctl: rename kern_table -> sysctl_subsys_table include/linux/sysctl.h | 1 - kernel/sys.c | 29 +++++++++++++++++++ kernel/sysctl.c | 49 +++++++------------------------- lib/kobject_uevent.c | 20 +++++++++++++ tools/testing/selftests/sysctl/sysctl.sh | 2 +- 5 files changed, 61 insertions(+), 40 deletions(-) --- base-commit: 501dd0fbc76bcae57902ea000d9c6ccd9d5f226e change-id: 20250627-jag-sysctl-823adf5732be Best regards, -- Joel Granados <joel.granados(a)kernel.org>

2 months, 2 weeks

1
5
0 0

[PATCH net-next v11 0/8] Support rate management on traffic classes in devlink and mlx5

by Mark Bloch

V11: - Refactored the devlink code to accept relative TC bandwidth share values instead of percentages. - Updated documentation to clarify that values are interpreted as relative shares. - Refactored the logic in mlx5 to support proportional scaling for tc-bw values. - Switched to `nlmsg_for_each_attr_type()` for cleaner attribute parsing. - Added a hardware selftest to validate TC bandwidth behavior. - Refactored esw_qos_is_node_empty for readability. V10: - Added netdevsim selftest for tc-bw ops. - Dropped header: field as it’s unnecessary for local constants in devlink.yaml. V9: - Defined DEVLINK_RATE_TCS_MAX as 8 in uapi/linux/devlink.h. - Replaced IEEE_8021QAZ_MAX_TCS with DEVLINK_RATE_TCS_MAX throughout the code. - Updated devlink-rate-tc-index-max spec to reference the correct UAPI header. V8: - Limit line width to 80 characters in mlx5 changes instead of 100. - Increase the scheduling node levels to support TC arbitration. - Ensure parent nodes are set correctly in all code paths that extend the hierarchy depth for TC arbitration. - Extended the cover letter with the ongoing discussion on devlink-rate and net-shapers. - Extended the cover letter with the Netdev talk link on this series. V7: - Fixed disabling tc-bw on leaf nodes that did not have tc-bw configured. - Fixed an issue where tc-bw was disabled on a node with assigned vports, ensuring that vport->qos.sched_node->parent is correctly updated with the cloned node. - Declared a constant for the maximum allowed Traffic Class index in devlink rate. - Added a range check to validate rate-tc-index. - Added documentation for the tc-bw argument. - Add a validation check to ensure that the total bandwidth assigned to all traffic classes sums to 100. V6: - Addressed comments on devlink patch #3. - Removed first 4 IFC patches, to be pulled from mlx5-next. V5: - Fix warning in devlink_nl_rate_tc_bw_set(). - Fix target branch of patch #4. V4: - Renamed the nested attribute for traffic class bandwidth to DEVLINK_ATTR_RATE_TC_BWS. - Changed the order of the attributes in `devlink.h`. - Refactored the initialization tc-bw array in devlink_nl_rate_tc_bw_set(). - Added extack messages to provide clear feedback on issues with tc-bw arguments. - Updated `rate-tc-bws` to support a multi-attr set, where each attribute includes an index and the corresponding bandwidth for that traffic class. - Handled the issue where the user could provide DEVLINK_ATTR_RATE_TC_BWS with duplicate indices. - Provided ynl exmaples in patch [1/5] commit message. - Take IFC patches to beginning of the series, targeted for mlx5-next. V3: - Dropped rate-tc-index, using tc-bw array index instead. - Renamed rate-bw to rate-tc-bw. - Documneted what the rate-tc-bw represents and added a range check for validation. - Intorduced devlink_nl_rate_tc_bw_set() to parse and set the TC bandwidth values. - Updated the user API in the commit message of patch 1/6 to ensure bandwidths sum equals 100. - Fixed missing filling of rate-parent in devlink_nl_rate_fill(). V2: - Included <linux/dcbnl.h> in devlink.h to resolve missing IEEE_8021QAZ_MAX_TCS definition. - Refactored the rate-tc-bw attribute structure to use a separate rate-tc-index. - Updated patch 2/6 title. This patch series extends the devlink-rate API to support traffic class (TC) bandwidth management, enabling more granular control over traffic shaping and rate limiting across multiple TCs. The API now allows users to specify bandwidth proportions for different traffic classes in a single command. This is particularly useful for managing Enhanced Transmission Selection (ETS) for groups of Virtual Functions (VFs), allowing precise bandwidth allocation across traffic classes. Additionally the series refines the QoS handling in net/mlx5 to support TC arbitration and bandwidth management on vports and rate nodes. Discussions on traffic class shaping in net-shapers began in V5 [1], where we discussed with maintainers whether net-shapers should support traffic classes and how this could be implemented. Later, after further conversations with Paolo Abeni and Simon Horman, Cosmin provided an update [2], confirming that net-shapers' tree-based hierarchy aligns well with traffic classes when treated as distinct subsets of netdev queues. Since mlx5 enforces a 1:1 mapping between TX queues and traffic classes, this approach seems feasible, though some open questions remain regarding queue reconfiguration and certain mlx5 scheduling behaviors. Building on that discussion, Cosmin has now shared a concrete implementation plan on the netdev mailing list [3]. The plan, developed in collaboration with Paolo and Simon, outlines how net-shapers can be extended to support the same use cases currently covered by devlink-rate, with the eventual goal of aligning both and simplifying the shaping infrastructure in the kernel. This work was presented at Netdev 0x19 in Zagreb [4]. There we presented how TC scheduling is enforced in mlx5 hardware, which led to discussions on the mailing list. A summary of how things work: Classification means labeling a packet with a traffic class based on the packet's DSCP or VLAN PCP field, then treating packets with different traffic classes differently during transmit processing. In a virtualized setup, VFs are untrusted and do not control classification or shaping. Classification is done by the hardware using a prio-to-TC mapping set by the hypervisor. VFs only select which send queue to use and are expected to respect the classification logic by sending each traffic class on its dedicated queue. As stated in the net-shapers plan [3], each transmit queue should carry only a single traffic class. Mixing classes in a single queue can lead to HOL blocking. In the mlx5 implementation, if the queue used does not match the classified traffic class, the hardware moves the queue to the correct TC scheduler. This movement is not a reclassification; it’s a necessary enforcement step to ensure traffic class isolation is maintained. Extend devlink-rate API to support rate management on TCs: - devlink: Extend the devlink rate API to support traffic class bandwidth management Introduce a no-op implementation: - net/mlx5: Add no-op implementation for setting tc-bw on rate objects Add support for enabling and disabling TC QoS on vports and nodes: - net/mlx5: Add support for setting tc-bw on nodes - net/mlx5: Add traffic class scheduling support for vport QoS Support for setting tc-bw on rate objects: - net/mlx5: Manage TC arbiter nodes and implement full support for tc-bw [1] https://lore.kernel.org/netdev/20241204220931.254964-1-tariqt@nvidia.com/ [2] https://lore.kernel.org/netdev/67df1a562614b553dcab043f347a0d7c5393ff83.cam… [3] https://lore.kernel.org/netdev/d9831d0c940a7b77419abe7c7330e822bbfd1cfb.cam… [4] https://netdevconf.info/0x19/sessions/talk/optimizing-bandwidth-allocation-… Carolina Jubran (8): netlink: introduce type-checking attribute iteration for nlmsg devlink: Extend devlink rate API with traffic classes bandwidth management selftest: netdevsim: Add devlink rate tc-bw test net/mlx5: Add no-op implementation for setting tc-bw on rate objects net/mlx5: Add support for setting tc-bw on nodes net/mlx5: Add traffic class scheduling support for vport QoS net/mlx5: Manage TC arbiter nodes and implement full support for tc-bw selftests: drv-net: Add test for devlink-rate traffic class bandwidth distribution Documentation/netlink/specs/devlink.yaml | 32 +- .../networking/devlink/devlink-port.rst | 8 + .../net/ethernet/mellanox/mlx5/core/devlink.c | 2 + .../net/ethernet/mellanox/mlx5/core/esw/qos.c | 1037 ++++++++++++++++- .../net/ethernet/mellanox/mlx5/core/esw/qos.h | 8 + .../net/ethernet/mellanox/mlx5/core/eswitch.h | 14 +- drivers/net/netdevsim/dev.c | 43 + drivers/net/netdevsim/netdevsim.h | 1 + drivers/net/vxlan/vxlan_vnifilter.c | 13 +- fs/nfsd/nfsctl.c | 36 +- include/net/devlink.h | 8 + include/net/netlink.h | 14 + include/uapi/linux/devlink.h | 9 + net/devlink/netlink_gen.c | 15 +- net/devlink/netlink_gen.h | 1 + net/devlink/rate.c | 129 ++ .../drivers/net/hw/devlink_rate_tc_bw.py | 466 ++++++++ .../drivers/net/netdevsim/devlink.sh | 51 + .../testing/selftests/net/lib/py/__init__.py | 2 +- tools/testing/selftests/net/lib/py/ynl.py | 5 + 20 files changed, 1823 insertions(+), 71 deletions(-) create mode 100755 tools/testing/selftests/drivers/net/hw/devlink_rate_tc_bw.py base-commit: 8dacfd92dbefee829ca555a860e86108fdd1d55b -- 2.34.1

2 months, 2 weeks

2
9
0 0

[PATCH net-next] selftests: forwarding: lib: Split setup_wait()

by Petr Machata

setup_wait() takes an optional argument and then is called from the top level of the test script. That confuses shellcheck, which thinks that maybe the intention is to pass $1 of the script to the function, which is never the case. To avoid having to annotate every single new test with a SC disable, split the function in two: one that takes a mandatory argument, and one that takes no argument at all. Convert the two existing users of that optional argument, both in Spectrum resource selftest, to use the new form. Clean up vxlan_bridge_1q_mc_ul.sh to not pass a now-unused argument. Signed-off-by: Petr Machata <petrm(a)nvidia.com> --- Notes: CC: Shuah Khan <shuah(a)kernel.org> CC: Matthieu Baerts <matttbe(a)kernel.org> CC: linux-kselftest(a)vger.kernel.org .../drivers/net/mlxsw/spectrum-2/resource_scale.sh | 2 +- .../drivers/net/mlxsw/spectrum/resource_scale.sh | 2 +- tools/testing/selftests/net/forwarding/lib.sh | 9 +++++++-- .../selftests/net/forwarding/vxlan_bridge_1q_mc_ul.sh | 2 +- 4 files changed, 10 insertions(+), 5 deletions(-) diff --git a/tools/testing/selftests/drivers/net/mlxsw/spectrum-2/resource_scale.sh b/tools/testing/selftests/drivers/net/mlxsw/spectrum-2/resource_scale.sh index 899b6892603f..d7505b933aef 100755 --- a/tools/testing/selftests/drivers/net/mlxsw/spectrum-2/resource_scale.sh +++ b/tools/testing/selftests/drivers/net/mlxsw/spectrum-2/resource_scale.sh @@ -51,7 +51,7 @@ for current_test in ${TESTS:-$ALL_TESTS}; do fi ${current_test}_setup_prepare - setup_wait $num_netifs + setup_wait_n $num_netifs # Update target in case occupancy of a certain resource changed # following the test setup. target=$(${current_test}_get_target "$should_fail") diff --git a/tools/testing/selftests/drivers/net/mlxsw/spectrum/resource_scale.sh b/tools/testing/selftests/drivers/net/mlxsw/spectrum/resource_scale.sh index 482ebb744eba..7b98cdd0580d 100755 --- a/tools/testing/selftests/drivers/net/mlxsw/spectrum/resource_scale.sh +++ b/tools/testing/selftests/drivers/net/mlxsw/spectrum/resource_scale.sh @@ -55,7 +55,7 @@ for current_test in ${TESTS:-$ALL_TESTS}; do continue fi ${current_test}_setup_prepare - setup_wait $num_netifs + setup_wait_n $num_netifs # Update target in case occupancy of a certain resource # changed following the test setup. target=$(${current_test}_get_target "$should_fail") diff --git a/tools/testing/selftests/net/forwarding/lib.sh b/tools/testing/selftests/net/forwarding/lib.sh index 83ee6a07e072..9308b2f77fed 100644 --- a/tools/testing/selftests/net/forwarding/lib.sh +++ b/tools/testing/selftests/net/forwarding/lib.sh @@ -526,9 +526,9 @@ setup_wait_dev_with_timeout() return 1 } -setup_wait() +setup_wait_n() { - local num_netifs=${1:-$NUM_NETIFS} + local num_netifs=$1; shift local i for ((i = 1; i <= num_netifs; ++i)); do @@ -539,6 +539,11 @@ setup_wait() sleep $WAIT_TIME } +setup_wait() +{ + setup_wait_n "$NUM_NETIFS" +} + wait_for_dev() { local dev=$1; shift diff --git a/tools/testing/selftests/net/forwarding/vxlan_bridge_1q_mc_ul.sh b/tools/testing/selftests/net/forwarding/vxlan_bridge_1q_mc_ul.sh index 7ec58b6b1128..462db0b603e7 100755 --- a/tools/testing/selftests/net/forwarding/vxlan_bridge_1q_mc_ul.sh +++ b/tools/testing/selftests/net/forwarding/vxlan_bridge_1q_mc_ul.sh @@ -765,7 +765,7 @@ ipv6_mcroute_fdb_sep_rx() trap cleanup EXIT setup_prepare -setup_wait "$NUM_NETIFS" +setup_wait tests_run exit "$EXIT_STATUS" -- 2.49.0

2 months, 2 weeks

2
1
0 0

[PATCH] selftests: futex: define SYS_futex on 32-bit architectures with 64-bit time_t

by Ben Zong-You Xie

glibc does not define SYS_futex for 32-bit architectures using 64-bit time_t e.g. riscv32, therefore this test fails to compile since it does not find SYS_futex in C library headers. Define SYS_futex as SYS_futex_time64 in this situation to ensure successful compilation and compatibility. Signed-off-by: Ben Zong-You Xie <ben717(a)andestech.com> Signed-off-by: Cynthia Huang <cynthia(a)andestech.com> --- tools/testing/selftests/futex/include/futextest.h | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/tools/testing/selftests/futex/include/futextest.h b/tools/testing/selftests/futex/include/futextest.h index ddbcfc9b7bac..7a5fd1d5355e 100644 --- a/tools/testing/selftests/futex/include/futextest.h +++ b/tools/testing/selftests/futex/include/futextest.h @@ -47,6 +47,17 @@ typedef volatile u_int32_t futex_t; FUTEX_PRIVATE_FLAG) #endif +/* + * SYS_futex is expected from system C library, in glibc some 32-bit + * architectures (e.g. RV32) are using 64-bit time_t, therefore it doesn't have + * SYS_futex defined but just SYS_futex_time64. Define SYS_futex as + * SYS_futex_time64 in this situation to ensure the compilation and the + * compatibility. + */ +#if !defined(SYS_futex) && defined(SYS_futex_time64) +#define SYS_futex SYS_futex_time64 +#endif + /** * futex() - SYS_futex syscall wrapper * @uaddr: address of first futex -- 2.34.1

2 months, 2 weeks

2
1
0 0

[PATCH] selftests/futex: Convert 32bit timespec struct to 64bit version for 32bit compatibility mode

by Terry Tritton

Futex_waitv can not accept old_timespec32 struct, so userspace should convert it from 32bit to 64bit before syscall in 32bit compatible mode. This fix is based off [1] Link: https://lore.kernel.org/all/20231203235117.29677-1-wegao@suse.com/ [1] Signed-off-by: Terry Tritton <terry.tritton(a)linaro.org> Signed-off-by: Wei Gao <wegao(a)suse.com> --- The original patch is for an identically named file and function in ltp and we need the same fix in kselftest. The patch is near identical with only a slight change to `syscall` instead of `tst_syscall`. Is the way I have tagged this appropriate? .../testing/selftests/futex/include/futex2test.h | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/tools/testing/selftests/futex/include/futex2test.h b/tools/testing/selftests/futex/include/futex2test.h index ea79662405bc..6780e51eb2d6 100644 --- a/tools/testing/selftests/futex/include/futex2test.h +++ b/tools/testing/selftests/futex/include/futex2test.h @@ -55,6 +55,13 @@ struct futex32_numa { futex_t numa; }; +#if !defined(__LP64__) +struct timespec64 { + int64_t tv_sec; + int64_t tv_nsec; +}; +#endif + /** * futex_waitv - Wait at multiple futexes, wake on any * @waiters: Array of waiters @@ -65,7 +72,15 @@ struct futex32_numa { static inline int futex_waitv(volatile struct futex_waitv *waiters, unsigned long nr_waiters, unsigned long flags, struct timespec *timo, clockid_t clockid) { +#if !defined(__LP64__) + struct timespec64 timo64 = {0}; + + timo64.tv_sec = timo->tv_sec; + timo64.tv_nsec = timo->tv_nsec; + return syscall(__NR_futex_waitv, waiters, nr_waiters, flags, &timo64, clockid); +#else return syscall(__NR_futex_waitv, waiters, nr_waiters, flags, timo, clockid); +#endif } /* -- 2.39.5

2 months, 2 weeks

2
1
0 0

2025

2024

2023

2022

2021

2020

2019

2018

2017

Linux-kselftest-mirror June 2025