Linux-kselftest-mirror

linux-kselftest-mirror@lists.linaro.org

136 participants
14280 discussions

[PATCH v8 00/15] Consolidate iommu page table implementations (AMD)

by Jason Gunthorpe

[Joerg, can you put this and vtd in linux-next please. The vtd series is still good at v3 thanks] Currently each of the iommu page table formats duplicates all of the logic to maintain the page table and perform map/unmap/etc operations. There are several different versions of the algorithms between all the different formats. The io-pgtable system provides an interface to help isolate the page table code from the iommu driver, but doesn't provide tools to implement the common algorithms. This makes it very hard to improve the state of the pagetable code under the iommu domains as any proposed improvement needs to alter a large number of different driver code paths. Combined with a lack of software based testing this makes improvement in this area very hard. iommufd wants several new page table operations: - More efficient map/unmap operations, using iommufd's batching logic - unmap that returns the physical addresses into a batch as it progresses - cut that allows splitting areas so large pages can have holes poked in them dynamically (ie guestmemfd hitless shared/private transitions) - More agressive freeing of table memory to avoid waste - Fragmenting large pages so that dirty tracking can be more granular - Reassembling large pages so that VMs can run at full IO performance in migration/dirty tracking error flows - KHO integration for kernel live upgrade Together these are algorithmically complex enough to be a very significant task to go and implement in all the page table formats we support. Just the "server" focused drivers use almost all the formats (ARMv8 S1&S2 / x86 PAE / AMDv1 / VT-d SS / RISCV) Instead of doing the duplicated work, this series takes the first step to consolidate the algorithms into one places. In spirit it is similar to the work Christoph did a few years back to pull the redundant get_user_pages() implementations out of the arch code into core MM. This unlocked a great deal of improvement in that space in the following years. I would like to see the same benefit in iommu as well. My first RFC showed a bigger picture with all most all formats and more algorithms. This series reorganizes that to be narrowly focused on just enough to convert the AMD driver to use the new mechanism. kunit tests are provided that allow good testing of the algorithms and all formats on x86, nothing is arch specific. AMD is one of the simpler options as the HW is quite uniform with few different options/bugs while still requiring the complicated contiguous pages support. The HW also has a very simple range based invalidation approach that is easy to implement. The AMD v1 and AMD v2 page table formats are implemented bit for bit identical to the current code, tested using a compare kunit test that checks against the io-pgtable version (on github, see below). Updating the AMD driver to replace the io-pgtable layer with the new stuff is fairly straightforward now. The layering is fixed up in the new version so that all the invalidation goes through function pointers. Several small fixing patches have come out of this as I've been fixing the problems that the test suite uncovers in the current code, and implementing the fixed version in iommupt. On performance, there is a quite wide variety of implementation designs across all the drivers. Looking at some key performance across the main formats: iommu_map(): pgsz ,avg new,old ns, min new,old ns , min % (+ve is better) 2^12, 53,66 , 51,63 , 19.19 (AMDV1) 256*2^12, 386,1909 , 367,1795 , 79.79 256*2^21, 362,1633 , 355,1556 , 77.77 2^12, 56,62 , 52,59 , 11.11 (AMDv2) 256*2^12, 405,1355 , 357,1292 , 72.72 256*2^21, 393,1160 , 358,1114 , 67.67 2^12, 55,65 , 53,62 , 14.14 (VT-d second stage) 256*2^12, 391,518 , 332,512 , 35.35 256*2^21, 383,635 , 336,624 , 46.46 2^12, 57,65 , 55,63 , 12.12 (ARM 64 bit) 256*2^12, 380,389 , 361,369 , 2.02 256*2^21, 358,419 , 345,400 , 13.13 iommu_unmap(): pgsz ,avg new,old ns, min new,old ns , min % (+ve is better) 2^12, 69,88 , 65,85 , 23.23 (AMDv1) 256*2^12, 353,6498 , 331,6029 , 94.94 256*2^21, 373,6014 , 360,5706 , 93.93 2^12, 71,72 , 66,69 , 4.04 (AMDv2) 256*2^12, 228,891 , 206,871 , 76.76 256*2^21, 254,721 , 245,711 , 65.65 2^12, 69,87 , 65,82 , 20.20 (VT-d second stage) 256*2^12, 210,321 , 200,315 , 36.36 256*2^21, 255,349 , 238,342 , 30.30 2^12, 72,77 , 68,74 , 8.08 (ARM 64 bit) 256*2^12, 521,357 , 447,346 , -29.29 256*2^21, 489,358 , 433,345 , -25.25 * Above numbers include additional patches to remove the iommu_pgsize() overheads. gcc 13.3.0, i7-12700 This version provides fairly consistent performance across formats. ARM unmap performance is quite different because this version supports contiguous pages and uses a very different algorithm for unmapping. Though why it is so worse compared to AMDv1 I haven't figured out yet. The per-format commits include a more detailed chart. There is a second branch: https://github.com/jgunthorpe/linux/commits/iommu_pt_all Containing supporting work and future steps: - ARM short descriptor (32 bit), ARM long descriptor (64 bit) formats - RISCV format and RISCV conversion https://github.com/jgunthorpe/linux/commits/iommu_pt_riscv - Support for a DMA incoherent HW page table walker - VT-d second stage format and VT-d conversion https://github.com/jgunthorpe/linux/commits/iommu_pt_vtd - DART v1 & v2 format - Draft of a iommufd 'cut' operation to break down huge pages - A compare test that checks the iommupt formats against the iopgtable interface, including updating AMD to have a working iopgtable and patches to make VT-d have an iopgtable for testing. - A performance test to micro-benchmark map and unmap against iogptable My strategy is to go one by one for the drivers: - AMD driver conversion - RISCV page table and driver - Intel VT-d driver and VTDSS page table - Flushing improvements for RISCV - ARM SMMUv3 And concurrently work on the algorithm side: - debugfs content dump, like VT-d has - Cut support - Increase/Decrease page size support - map/unmap batching - KHO As we make more algorithm improvements the value to convert the drivers increases. This is on github: https://github.com/jgunthorpe/linux/commits/iommu_pt v8: - Remove unused to_amdv1pt/common_to_amdv1pt/to_x86_64_pt/common_to_x86_64_pt - Fix 32 bit udiv compile failure in the kunit v7: https://patch.msgid.link/r/0-v7-ab019a8791e2+175b8-iommu_pt_jgg@nvidia.com - Rebase to v6.18-rc2 - Improve comments and documentation - Add a few missed __sme_sets() for AMD CC - Rename pt_iommu_flush_ops -> pt_iommu_driver_ops VT-D -> VT-d pt_clear_entry -> pt_clear_entries pt_entry_write_is_dirty -> pt_entry_is_write_dirty pt_entry_set_write_clean -> pt_entry_make_write_clean - Tidy some of the map flow into a new function do_map() - Fix ffz64() v6: https://patch.msgid.link/r/0-v6-0fb54a1d9850+36b-iommu_pt_jgg@nvidia.com - Improve comments and documentation - Rename pt_entry_oa_full -> pt_entry_oa_exact pt_has_system_page -> pt_has_system_page_size pt_max_output_address_lg2 -> pt_max_oa_lg2 log2_f*() -> vaf* / oaf* / f*_t pt_item_fully_covered -> pt_entry_fully_covered - Fix missed constant propogation causing division - Consolidate debugging checks to pt_check_install_leaf_args() - Change collect->ignore_mapped to check_mapped - Shuffle some hunks around to more appropriate patches - Two new mini kunit tests v5: https://patch.msgid.link/r/0-v5-116c4948af3d+68091-iommu_pt_jgg@nvidia.com - Text grammar updates and kdoc fixes v4: https://patch.msgid.link/r/0-v4-0d6a6726a372+18959-iommu_pt_jgg@nvidia.com - Rebase on v6.16-rc3 - Integrate the HATS/HATDis changes - Remove 'default n' from kconfig - Remove unused 'PT_FIXED_TOP_LEVEL' - Improve comments and documentation - Fix some compile warnings from kbuild robots v3: https://patch.msgid.link/r/0-v3-a93aab628dbc+521-iommu_pt_jgg@nvidia.com - Rebase on v6.16-rc2 - s/PT_ENTRY_WORD_SIZE/PT_ITEM_WORD_SIZE/s to follow the language better - Comment and documentation updates - Add PT_TOP_PHYS_MASK to help manage alignment restrictions on the top pointer - Add missed force_aperture = true - Make pt_iommu_deinit() take care of the not-yet-inited error case internally as AMD/RISCV/VTD all shared this logic - Change gather_range() into gather_range_pages() so it also deals with the page list. This makes the following cache flushing series simpler - Fix missed update of unmap->unmapped in some error cases - Change clear_contig() to order the gather more logically - Remove goto from the error handling in __map_range_leaf() - s/log2_/oalog2_/ in places where the argument is an oaddr_t - Pass the pts to pt_table_install64/32() - Do not use SIGN_EXTEND for the AMDv2 page table because of Vasant's information on how PASID 0 works. v2: https://patch.msgid.link/r/0-v2-5c26bde5c22d+58b-iommu_pt_jgg@nvidia.com - AMD driver only, many code changes RFC: https://lore.kernel.org/all/0-v1-01fa10580981+1d-iommu_pt_jgg@nvidia.com/ Cc: Michael Roth <michael.roth(a)amd.com> Cc: Alexey Kardashevskiy <aik(a)amd.com> Cc: Pasha Tatashin <pasha.tatashin(a)soleen.com> Cc: James Gowans <jgowans(a)amazon.com> Signed-off-by: Jason Gunthorpe <jgg(a)nvidia.com> Alejandro Jimenez (1): iommu/amd: Use the generic iommu page table Jason Gunthorpe (14): genpt: Generic Page Table base API genpt: Add Documentation/ files iommupt: Add the basic structure of the iommu implementation iommupt: Add the AMD IOMMU v1 page table format iommupt: Add iova_to_phys op iommupt: Add unmap_pages op iommupt: Add map_pages op iommupt: Add read_and_clear_dirty op iommupt: Add a kunit test for Generic Page Table iommupt: Add a mock pagetable format for iommufd selftest to use iommufd: Change the selftest to use iommupt instead of xarray iommupt: Add the x86 64 bit page table format iommu/amd: Remove AMD io_pgtable support iommupt: Add a kunit test for the IOMMU implementation .clang-format | 1 + Documentation/driver-api/generic_pt.rst | 142 ++ Documentation/driver-api/index.rst | 1 + drivers/iommu/Kconfig | 2 + drivers/iommu/Makefile | 1 + drivers/iommu/amd/Kconfig | 5 +- drivers/iommu/amd/Makefile | 2 +- drivers/iommu/amd/amd_iommu.h | 1 - drivers/iommu/amd/amd_iommu_types.h | 110 +- drivers/iommu/amd/io_pgtable.c | 577 -------- drivers/iommu/amd/io_pgtable_v2.c | 370 ------ drivers/iommu/amd/iommu.c | 538 ++++---- drivers/iommu/generic_pt/.kunitconfig | 13 + drivers/iommu/generic_pt/Kconfig | 68 + drivers/iommu/generic_pt/fmt/Makefile | 26 + drivers/iommu/generic_pt/fmt/amdv1.h | 411 ++++++ drivers/iommu/generic_pt/fmt/defs_amdv1.h | 21 + drivers/iommu/generic_pt/fmt/defs_x86_64.h | 21 + drivers/iommu/generic_pt/fmt/iommu_amdv1.c | 15 + drivers/iommu/generic_pt/fmt/iommu_mock.c | 10 + drivers/iommu/generic_pt/fmt/iommu_template.h | 48 + drivers/iommu/generic_pt/fmt/iommu_x86_64.c | 11 + drivers/iommu/generic_pt/fmt/x86_64.h | 255 ++++ drivers/iommu/generic_pt/iommu_pt.h | 1162 +++++++++++++++++ drivers/iommu/generic_pt/kunit_generic_pt.h | 713 ++++++++++ drivers/iommu/generic_pt/kunit_iommu.h | 183 +++ drivers/iommu/generic_pt/kunit_iommu_pt.h | 487 +++++++ drivers/iommu/generic_pt/pt_common.h | 358 +++++ drivers/iommu/generic_pt/pt_defs.h | 329 +++++ drivers/iommu/generic_pt/pt_fmt_defaults.h | 233 ++++ drivers/iommu/generic_pt/pt_iter.h | 636 +++++++++ drivers/iommu/generic_pt/pt_log2.h | 122 ++ drivers/iommu/io-pgtable.c | 4 - drivers/iommu/iommufd/Kconfig | 1 + drivers/iommu/iommufd/iommufd_test.h | 11 +- drivers/iommu/iommufd/selftest.c | 438 +++---- include/linux/generic_pt/common.h | 167 +++ include/linux/generic_pt/iommu.h | 271 ++++ include/linux/io-pgtable.h | 2 - include/linux/irqchip/riscv-imsic.h | 3 +- tools/testing/selftests/iommu/iommufd.c | 60 +- tools/testing/selftests/iommu/iommufd_utils.h | 12 + 42 files changed, 6229 insertions(+), 1612 deletions(-) create mode 100644 Documentation/driver-api/generic_pt.rst delete mode 100644 drivers/iommu/amd/io_pgtable.c delete mode 100644 drivers/iommu/amd/io_pgtable_v2.c create mode 100644 drivers/iommu/generic_pt/.kunitconfig create mode 100644 drivers/iommu/generic_pt/Kconfig create mode 100644 drivers/iommu/generic_pt/fmt/Makefile create mode 100644 drivers/iommu/generic_pt/fmt/amdv1.h create mode 100644 drivers/iommu/generic_pt/fmt/defs_amdv1.h create mode 100644 drivers/iommu/generic_pt/fmt/defs_x86_64.h create mode 100644 drivers/iommu/generic_pt/fmt/iommu_amdv1.c create mode 100644 drivers/iommu/generic_pt/fmt/iommu_mock.c create mode 100644 drivers/iommu/generic_pt/fmt/iommu_template.h create mode 100644 drivers/iommu/generic_pt/fmt/iommu_x86_64.c create mode 100644 drivers/iommu/generic_pt/fmt/x86_64.h create mode 100644 drivers/iommu/generic_pt/iommu_pt.h create mode 100644 drivers/iommu/generic_pt/kunit_generic_pt.h create mode 100644 drivers/iommu/generic_pt/kunit_iommu.h create mode 100644 drivers/iommu/generic_pt/kunit_iommu_pt.h create mode 100644 drivers/iommu/generic_pt/pt_common.h create mode 100644 drivers/iommu/generic_pt/pt_defs.h create mode 100644 drivers/iommu/generic_pt/pt_fmt_defaults.h create mode 100644 drivers/iommu/generic_pt/pt_iter.h create mode 100644 drivers/iommu/generic_pt/pt_log2.h create mode 100644 include/linux/generic_pt/common.h create mode 100644 include/linux/generic_pt/iommu.h base-commit: 8440410283bb5533b676574211f31f030a18011b -- 2.43.0

3 weeks, 4 days

[PATCH] selftests/ftrace: Add test dependency

by Thibault Ferrante

test_duplicates miss a running dependency and leads to test failures on kernel with specific configuration. Signed-off-by: Thibault Ferrante <thibault.ferrante(a)canonical.com> --- .../testing/selftests/ftrace/test.d/dynevent/test_duplicates.tc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/testing/selftests/ftrace/test.d/dynevent/test_duplicates.tc b/tools/testing/selftests/ftrace/test.d/dynevent/test_duplicates.tc index d3a79da215c8..0b5e4543e70b 100644 --- a/tools/testing/selftests/ftrace/test.d/dynevent/test_duplicates.tc +++ b/tools/testing/selftests/ftrace/test.d/dynevent/test_duplicates.tc @@ -1,7 +1,7 @@ #!/bin/sh # SPDX-License-Identifier: GPL-2.0 # description: Generic dynamic event - check if duplicate events are caught -# requires: dynamic_events "e[:[<group>/][<event>]] <attached-group>.<attached-event> [<args>]":README +# requires: dynamic_events events/syscalls/sys_enter_openat "e[:[<group>/][<event>]] <attached-group>.<attached-event> [<args>]":README echo 0 > events/enable -- 2.39.2

3 weeks, 5 days

[PATCH v23 0/8] fork: Support shadow stacks in clone3()

by Mark Brown

At this point I think everyone in the on the kernel side is happy with this but there were some questions from the glibc side about the value of controlling the shadow stack placement and size, especially with the current inability to reuse the shadow stack for an exited thread. With support for reuse it would be possible to have a cache of shadow stacks as is currently supported for the normal stack. Since the discussion petered out I'm resending this in order to give people something work with while prototyping. It should be possible to prototype any potential kernel features to help build out shadow stack support in userspace by enabling shadow stack writes, as suggested by Rick Edgecombe this may end up being required anyway for supporting more exotic scenarios. On all current architectures with the feature writes to shadow stack require specific instructions so there are still security benefits even with writes enabled. I did send a change implementing a feature writing a token on thread exit to allow reuse: https://lore.kernel.org/r/20250921-arm64-gcs-exit-token-v1-0-45cf64e648d5@k… but wasn't planning to refresh it without some indication from the userspace side that that'd be useful. Non-process cover letter: The kernel has added support for shadow stacks, currently x86 only using their CET feature but both arm64 and RISC-V have equivalent features (GCS and Zicfiss respectively), I am actively working on GCS[1]. With shadow stacks the hardware maintains an additional stack containing only the return addresses for branch instructions which is not generally writeable by userspace and ensures that any returns are to the recorded addresses. This provides some protection against ROP attacks and making it easier to collect call stacks. These shadow stacks are allocated in the address space of the userspace process. Our API for shadow stacks does not currently offer userspace any flexiblity for managing the allocation of shadow stacks for newly created threads, instead the kernel allocates a new shadow stack with the same size as the normal stack whenever a thread is created with the feature enabled. The stacks allocated in this way are freed by the kernel when the thread exits or shadow stacks are disabled for the thread. This lack of flexibility and control isn't ideal, in the vast majority of cases the shadow stack will be over allocated and the implicit allocation and deallocation is not consistent with other interfaces. As far as I can tell the interface is done in this manner mainly because the shadow stack patches were in development since before clone3() was implemented. Since clone3() is readily extensible let's add support for specifying a shadow stack when creating a new thread or process, keeping the current implicit allocation behaviour if one is not specified either with clone3() or through the use of clone(). The user must provide a shadow stack pointer, this must point to memory mapped for use as a shadow stackby map_shadow_stack() with an architecture specified shadow stack token at the top of the stack. Yuri Khrustalev has raised questions from the libc side regarding discoverability of extended clone3() structure sizes[2], this seems like a general issue with clone3(). There was a suggestion to add a hwcap on arm64 which isn't ideal but is doable there, though architecture specific mechanisms would also be needed for x86 (and RISC-V if it's support gets merged before this does). The idea has, however, had strong pushback from the architecture maintainers and it is possible to detect support for this in clone3() by attempting a call with a misaligned shadow stack pointer specified so no hwcap has been added. [1] https://lore.kernel.org/linux-arm-kernel/20241001-arm64-gcs-v13-0-222b78d87… [2] https://lore.kernel.org/r/aCs65ccRQtJBnZ_5@arm.com Signed-off-by: Mark Brown <broonie(a)kernel.org> --- Changes in v23: - Rebase onto v6.19-rc1. - Link to v22: https://lore.kernel.org/r/20251015-clone3-shadow-stack-v22-0-a8c8da011427@k… Changes in v22: - Rebase onto v6.18-rc1. - Cover letter updates. - Link to v21: https://lore.kernel.org/r/20250916-clone3-shadow-stack-v21-0-910493527013@k… Changes in v21: - Rebase onto https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git kernel-6.18.clone3 - Rename shadow_stack_token to shstk_token, since it's a simple rename I've kept the acks and reviews but I dropped the tested-bys just to be safe. - Link to v20: https://lore.kernel.org/r/20250902-clone3-shadow-stack-v20-0-4d9fff1c53e7@k… Changes in v20: - Comment fixes and clarifications in x86 arch_shstk_validate_clone() from Rick Edgecombe. - Spelling fix in documentation. - Link to v19: https://lore.kernel.org/r/20250819-clone3-shadow-stack-v19-0-bc957075479b@k… Changes in v19: - Rebase onto v6.17-rc1. - Link to v18: https://lore.kernel.org/r/20250702-clone3-shadow-stack-v18-0-7965d2b694db@k… Changes in v18: - Rebase onto v6.16-rc3. - Thanks to pointers from Yuri Khrustalev this version has been tested on x86 so I have removed the RFT tag. - Clarify clone3_shadow_stack_valid() comment about the Kconfig check. - Remove redundant GCSB DSYNCs in arm64 code. - Fix token validation on x86. - Link to v17: https://lore.kernel.org/r/20250609-clone3-shadow-stack-v17-0-8840ed97ff6f@k… Changes in v17: - Rebase onto v6.16-rc1. - Link to v16: https://lore.kernel.org/r/20250416-clone3-shadow-stack-v16-0-2ffc9ca3917b@k… Changes in v16: - Rebase onto v6.15-rc2. - Roll in fixes from x86 testing from Rick Edgecombe. - Rework so that the argument is shadow_stack_token. - Link to v15: https://lore.kernel.org/r/20250408-clone3-shadow-stack-v15-0-3fa245c6e3be@k… Changes in v15: - Rebase onto v6.15-rc1. - Link to v14: https://lore.kernel.org/r/20250206-clone3-shadow-stack-v14-0-805b53af73b9@k… Changes in v14: - Rebase onto v6.14-rc1. - Link to v13: https://lore.kernel.org/r/20241203-clone3-shadow-stack-v13-0-93b89a81a5ed@k… Changes in v13: - Rebase onto v6.13-rc1. - Link to v12: https://lore.kernel.org/r/20241031-clone3-shadow-stack-v12-0-7183eb8bee17@k… Changes in v12: - Add the regular prctl() to the userspace API document since arm64 support is queued in -next. - Link to v11: https://lore.kernel.org/r/20241005-clone3-shadow-stack-v11-0-2a6a2bd6d651@k… Changes in v11: - Rebase onto arm64 for-next/gcs, which is based on v6.12-rc1, and integrate arm64 support. - Rework the interface to specify a shadow stack pointer rather than a base and size like we do for the regular stack. - Link to v10: https://lore.kernel.org/r/20240821-clone3-shadow-stack-v10-0-06e8797b9445@k… Changes in v10: - Integrate fixes & improvements for the x86 implementation from Rick Edgecombe. - Require that the shadow stack be VM_WRITE. - Require that the shadow stack base and size be sizeof(void *) aligned. - Clean up trailing newline. - Link to v9: https://lore.kernel.org/r/20240819-clone3-shadow-stack-v9-0-962d74f99464@ke… Changes in v9: - Pull token validation earlier and report problems with an error return to parent rather than signal delivery to the child. - Verify that the top of the supplied shadow stack is VM_SHADOW_STACK. - Rework token validation to only do the page mapping once. - Drop no longer needed support for testing for signals in selftest. - Fix typo in comments. - Link to v8: https://lore.kernel.org/r/20240808-clone3-shadow-stack-v8-0-0acf37caf14c@ke… Changes in v8: - Fix token verification with user specified shadow stack. - Don't track user managed shadow stacks for child processes. - Link to v7: https://lore.kernel.org/r/20240731-clone3-shadow-stack-v7-0-a9532eebfb1d@ke… Changes in v7: - Rebase onto v6.11-rc1. - Typo fixes. - Link to v6: https://lore.kernel.org/r/20240623-clone3-shadow-stack-v6-0-9ee7783b1fb9@ke… Changes in v6: - Rebase onto v6.10-rc3. - Ensure we don't try to free the parent shadow stack in error paths of x86 arch code. - Spelling fixes in userspace API document. - Additional cleanups and improvements to the clone3() tests to support the shadow stack tests. - Link to v5: https://lore.kernel.org/r/20240203-clone3-shadow-stack-v5-0-322c69598e4b@ke… Changes in v5: - Rebase onto v6.8-rc2. - Rework ABI to have the user allocate the shadow stack memory with map_shadow_stack() and a token. - Force inlining of the x86 shadow stack enablement. - Move shadow stack enablement out into a shared header for reuse by other tests. - Link to v4: https://lore.kernel.org/r/20231128-clone3-shadow-stack-v4-0-8b28ffe4f676@ke… Changes in v4: - Formatting changes. - Use a define for minimum shadow stack size and move some basic validation to fork.c. - Link to v3: https://lore.kernel.org/r/20231120-clone3-shadow-stack-v3-0-a7b8ed3e2acc@ke… Changes in v3: - Rebase onto v6.7-rc2. - Remove stale shadow_stack in internal kargs. - If a shadow stack is specified unconditionally use it regardless of CLONE_ parameters. - Force enable shadow stacks in the selftest. - Update changelogs for RISC-V feature rename. - Link to v2: https://lore.kernel.org/r/20231114-clone3-shadow-stack-v2-0-b613f8681155@ke… Changes in v2: - Rebase onto v6.7-rc1. - Remove ability to provide preallocated shadow stack, just specify the desired size. - Link to v1: https://lore.kernel.org/r/20231023-clone3-shadow-stack-v1-0-d867d0b5d4d0@ke… --- Mark Brown (8): arm64/gcs: Return a success value from gcs_alloc_thread_stack() Documentation: userspace-api: Add shadow stack API documentation selftests: Provide helper header for shadow stack testing fork: Add shadow stack support to clone3() selftests/clone3: Remove redundant flushes of output streams selftests/clone3: Factor more of main loop into test_clone3() selftests/clone3: Allow tests to flag if -E2BIG is a valid error code selftests/clone3: Test shadow stack support Documentation/userspace-api/index.rst | 1 + Documentation/userspace-api/shadow_stack.rst | 44 +++++ arch/arm64/include/asm/gcs.h | 8 +- arch/arm64/kernel/process.c | 8 +- arch/arm64/mm/gcs.c | 55 +++++- arch/x86/include/asm/shstk.h | 11 +- arch/x86/kernel/process.c | 2 +- arch/x86/kernel/shstk.c | 53 ++++- include/asm-generic/cacheflush.h | 11 ++ include/linux/sched/task.h | 17 ++ include/uapi/linux/sched.h | 9 +- kernel/fork.c | 93 +++++++-- tools/testing/selftests/clone3/clone3.c | 226 ++++++++++++++++++---- tools/testing/selftests/clone3/clone3_selftests.h | 65 ++++++- tools/testing/selftests/ksft_shstk.h | 98 ++++++++++ 15 files changed, 620 insertions(+), 81 deletions(-) --- base-commit: 8f0b4cce4481fb22653697cced8d0d04027cb1e8 change-id: 20231019-clone3-shadow-stack-15d40d2bf536 Best regards, -- Mark Brown <broonie(a)kernel.org>

3 weeks, 5 days

[PATCH v2] selftests/filesystems: Assume that TIOCGPTPEER is defined

by Mark Brown

The devpts_pts selftest has an ifdef in case an architecture does not define TIOCGPTPEER, but the handling for this is broken since we need errno to be set to EINVAL in order to skip the test as we should. Given that this ioctl() has been defined since v4.15 we may as well just assume it's there rather than write handling code which will probably never get used. Signed-off-by: Mark Brown <broonie(a)kernel.org> --- Changes in v2: - Rebase onto v6.19-rc1. - Link to v1: https://patch.msgid.link/20251126-selftests-filesystems-devpts-tiocgptpeer-… --- tools/testing/selftests/filesystems/devpts_pts.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/tools/testing/selftests/filesystems/devpts_pts.c b/tools/testing/selftests/filesystems/devpts_pts.c index 54fea349204e..950e8b7f675b 100644 --- a/tools/testing/selftests/filesystems/devpts_pts.c +++ b/tools/testing/selftests/filesystems/devpts_pts.c @@ -100,7 +100,7 @@ static int resolve_procfd_symlink(int fd, char *buf, size_t buflen) static int do_tiocgptpeer(char *ptmx, char *expected_procfd_contents) { int ret; - int master = -1, slave = -1, fret = -1; + int master = -1, slave, fret = -1; master = open(ptmx, O_RDWR | O_NOCTTY | O_CLOEXEC); if (master < 0) { @@ -119,9 +119,7 @@ static int do_tiocgptpeer(char *ptmx, char *expected_procfd_contents) goto do_cleanup; } -#ifdef TIOCGPTPEER slave = ioctl(master, TIOCGPTPEER, O_RDWR | O_NOCTTY | O_CLOEXEC); -#endif if (slave < 0) { if (errno == EINVAL) { fprintf(stderr, "TIOCGPTPEER is not supported. " --- base-commit: 8f0b4cce4481fb22653697cced8d0d04027cb1e8 change-id: 20251126-selftests-filesystems-devpts-tiocgptpeer-fbd30e579859 Best regards, -- Mark Brown <broonie(a)kernel.org>

3 weeks, 5 days

[PATCH v2 0/9] KVM: x86: Improve the handling of debug exceptions during instruction emulation

by Hou Wenlong

During my testing, I found that guest debugging with 'DR6.BD' does not work in instruction emulation, as the current code only considers the guest's DR7. Upon reviewing the code, I also observed that the checks for the userspace guest debugging feature and the guest's own debugging feature are repeated in different places during instruction emulation, but the overall logic is the same. If guest debug is enabled, it needs to exit to userspace; otherwise, a #DB exception needs to be injected into the guest. Therefore, as suggested by Jiangshan Lai, some cleanup has been done for #DB handling in instruction emulation in this patchset. A new function named 'kvm_inject_emulated_db()' is introduced to consolidate all the checking logic. Moreover, I hope we can make the #DB interception path use the same function as well. Additionally, when I looked into the single-step #DB handling in instruction emulation, I noticed that the interrupt shadow is toggled, but it is not considered in the single-step #DB injection. This oversight causes VM entry to fail on VMX (due to pending debug exceptions state checking). As pointed out by Sean, fault-like code #DBs can be coincident with trap-like single-step #DBs at the instruction boundary on the hardware. However it is difficult to emulate this in the emulator, as kvm_vcpu_check_code_breakpoint() is called at the start of the next instruction while the single-step #DB for the previous instruction has already been injected. v1->v2: - cleanup in inject_emulated_exception(). - rename 'set_pending_dbg' callback as 'refresh_pending_dbg_exceptions'. - fold refresh_pending_dbg_exceptions() call into kvm_vcpu_do_singlestep(). - Split the change to move up kvm_set_rflags() into a single patch. - Move the #DB and IRQ handler registration after guest debug testcases. Hou Wenlong (9): KVM: x86: Capture "struct x86_exception" in inject_emulated_exception() KVM: x86: Set guest DR6 by kvm_queue_exception_p() in instruction emulation KVM: x86: Check guest debug in DR access instruction emulation KVM: x86: Only check effective code breakpoint in emulation KVM: x86: Consolidate KVM_GUESTDBG_SINGLESTEP check into the kvm_inject_emulated_db() KVM: x86: Move kvm_set_rflags() up before kvm_vcpu_do_singlestep() KVM: VMX: Refresh 'PENDING_DBG_EXCEPTIONS.BS' bit during instruction emulation KVM: selftests: Verify guest debug DR7.GD checking during instruction emulation KVM: selftests: Verify 'BS' bit checking in pending debug exception during VM entry arch/x86/include/asm/kvm-x86-ops.h | 1 + arch/x86/include/asm/kvm_host.h | 1 + arch/x86/kvm/emulate.c | 14 +-- arch/x86/kvm/kvm_emulate.h | 7 +- arch/x86/kvm/vmx/main.c | 9 ++ arch/x86/kvm/vmx/vmx.c | 15 ++- arch/x86/kvm/vmx/x86_ops.h | 1 + arch/x86/kvm/x86.c | 116 ++++++++++-------- arch/x86/kvm/x86.h | 7 ++ .../selftests/kvm/include/x86/processor.h | 3 +- tools/testing/selftests/kvm/x86/debug_regs.c | 72 ++++++++++- 11 files changed, 178 insertions(+), 68 deletions(-) base-commit: 5d3e2d9ba9ed68576c70c127e4f7446d896f2af2 -- 2.31.1

3 weeks, 5 days

[PATCH 0/7] KVM: x86: Improve the handling of debug exceptions during instruction emulation

by Hou Wenlong

During my testing, I found that guest debugging with 'DR6.BD' does not work in instruction emulation, as the current code only considers the guest's DR7. Upon reviewing the code, I also observed that the checks for the userspace guest debugging feature and the guest's own debugging feature are repeated in different places during instruction emulation, but the overall logic is the same. If guest debugging is enabled, it needs to exit to userspace; otherwise, a #DB exception needs to be injected into the guest. Therefore, as suggested by Jiangshan Lai, some cleanup has been done for #DB handling in instruction emulation in this patchset. A new function named 'kvm_inject_emulated_db()' is introduced to consolidate all the checking logic. Moreover, I hope we can make the #DB interception path use the same function as well. Additionally, when I looked into the single-step #DB handling in instruction emulation, I noticed that the interrupt shadow is toggled, but it is not considered in the single-step #DB injection. This oversight causes VM entry to fail on VMX (due to pending debug exceptions checking) or breaks the 'MOV SS' suppressed #DB. For the latter, I have kept the behavior for now in my patchset, as I need some suggestions. Hou Wenlong (7): KVM: x86: Set guest DR6 by kvm_queue_exception_p() in instruction emulation KVM: x86: Check guest debug in DR access instruction emulation KVM: x86: Only check effective code breakpoint in emulation KVM: x86: Consolidate KVM_GUESTDBG_SINGLESTEP check into the kvm_inject_emulated_db() KVM: VMX: Set 'BS' bit in pending debug exceptions during instruction emulation KVM: selftests: Verify guest debug DR7.GD checking during instruction emulation KVM: selftests: Verify 'BS' bit checking in pending debug exception during VM entry arch/x86/include/asm/kvm-x86-ops.h | 1 + arch/x86/include/asm/kvm_host.h | 1 + arch/x86/kvm/emulate.c | 14 +-- arch/x86/kvm/kvm_emulate.h | 7 +- arch/x86/kvm/vmx/main.c | 9 ++ arch/x86/kvm/vmx/vmx.c | 14 ++- arch/x86/kvm/vmx/x86_ops.h | 1 + arch/x86/kvm/x86.c | 109 +++++++++++------- arch/x86/kvm/x86.h | 7 ++ .../selftests/kvm/include/x86/processor.h | 3 +- tools/testing/selftests/kvm/x86/debug_regs.c | 64 +++++++++- 11 files changed, 167 insertions(+), 63 deletions(-) base-commit: ecbcc2461839e848970468b44db32282e5059925 -- 2.31.1

3 weeks, 5 days

[PATCH v3 0/3] selftests/filelock: Make output more kselftestish

by Mark Brown

This series makes the output from the ofdlocks test a bit easier for tooling to work with, and also ignores the generated file while we're here. Signed-off-by: Mark Brown <broonie(a)kernel.org> --- Changes in v3: - Rebase onto v6.19-rc1. - Link to v2: https://lore.kernel.org/r/20251015-selftest-filelock-ktap-v2-0-f5fd21b75c3a… Changes in v2: - Rebase onto v6.18-rc1. - Link to v1: https://lore.kernel.org/r/20250818-selftest-filelock-ktap-v1-0-d41af77f1396… --- Mark Brown (3): kselftest/filelock: Use ksft_perror() kselftest/filelock: Report each test in oftlocks separately kselftest/filelock: Add a .gitignore file tools/testing/selftests/filelock/.gitignore | 1 + tools/testing/selftests/filelock/ofdlocks.c | 94 +++++++++++++---------------- 2 files changed, 42 insertions(+), 53 deletions(-) --- base-commit: 8f0b4cce4481fb22653697cced8d0d04027cb1e8 change-id: 20250604-selftest-filelock-ktap-f2ae998a0de0 Best regards, -- Mark Brown <broonie(a)kernel.org>

3 weeks, 5 days

Issue in parsing of tests that use KUNIT_CASE_PARAM

by Ojaswin Mujoo

Hello, While writing some Kunit tests for ext4 filesystem, I'm encountering an issue in the way we display the diagnostic logs upon failures, when using KUNIT_CASE_PARAM() to write the tests. This can be observed by patching fs/ext4/mballoc-test.c to fail and print one of the params: --- a/fs/ext4/mballoc-test.c +++ b/fs/ext4/mballoc-test.c @@ -350,6 +350,8 @@ static int mbt_kunit_init(struct kunit *test) struct super_block *sb; int ret; + KUNIT_FAIL(test, "Failed: blocksize_bits=%d", layout->blocksize_bits); + sb = mbt_ext4_alloc_super_block(); if (sb == NULL) return -ENOMEM; With the above change, we can observe the following output (snipped): [18:50:25] ============== ext4_mballoc_test (7 subtests) ============== [18:50:25] ================= test_new_blocks_simple ================== [18:50:25] [FAILED] block_bits=10 cluster_bits=3 blocks_per_group=8192 group_count=4 desc_size=64 [18:50:25] # test_new_blocks_simple: EXPECTATION FAILED at fs/ext4/mballoc-test.c:364 [18:50:25] Failed: blocksize_bits=12 [18:50:25] [FAILED] block_bits=12 cluster_bits=3 blocks_per_group=8192 group_count=4 desc_size=64 [18:50:25] # test_new_blocks_simple: EXPECTATION FAILED at fs/ext4/mballoc-test.c:364 [18:50:25] Failed: blocksize_bits=16 [18:50:25] [FAILED] block_bits=16 cluster_bits=3 blocks_per_group=8192 group_count=4 desc_size=64 [18:50:25] # test_new_blocks_simple: EXPECTATION FAILED at fs/ext4/mballoc-test.c:364 [18:50:25] Failed: blocksize_bits=10 [18:50:25] # test_new_blocks_simple: pass:0 fail:3 skip:0 total:3 [18:50:25] ============= [FAILED] test_new_blocks_simple ============== <snip> Note that the diagnostic logs don't show up correctly. Ideally they should be before test result but here the first [FAILED] test has no logs printed above whereas the last "Failed: blocksize_bits=10" print comes after the last subtest, when it actually corresponds to the first subtest. The KTAP file itself seems to have diagnostic logs in the right place: KTAP version 1 1..2 KTAP version 1 # Subtest: ext4_mballoc_test # module: ext4 1..7 KTAP version 1 # Subtest: test_new_blocks_simple # test_new_blocks_simple: EXPECTATION FAILED at fs/ext4/mballoc-test.c:364 Failed: blocksize_bits=10 not ok 1 block_bits=10 cluster_bits=3 blocks_per_group=8192 group_count=4 desc_size=64 # test_new_blocks_simple: EXPECTATION FAILED at fs/ext4/mballoc-test.c:364 Failed: blocksize_bits=12 not ok 2 block_bits=12 cluster_bits=3 blocks_per_group=8192 group_count=4 desc_size=64 # test_new_blocks_simple: EXPECTATION FAILED at fs/ext4/mballoc-test.c:364 Failed: blocksize_bits=16 not ok 3 block_bits=16 cluster_bits=3 blocks_per_group=8192 group_count=4 desc_size=64 # test_new_blocks_simple: pass:0 fail:3 skip:0 total:3 not ok 1 test_new_blocks_simple <snip> By tracing kunit_parser.py script, I could see the issue here is in the parsing of the "Subtest: test_new_blocks_simple". We end up associating everything below the subtest till "not ok 1 block_bits=10..." as diagnostic logs of the subtest, while these lons actually belong to the first of the 3 subtests under this test. I tired to figure out a way to fix the parsing but fixing one thing broke something else. Im starting to think that the issue is that there are 3 subtests under test_new_block_simple (array of 3 params passed to KUNIT_CASE_PARAM), but instead of creating 3 structured subtests, the KTAP output dumps the results of all 3 directly under subtest:test_new_blocks_simple. Which makes it tricky to determine where the diagnostic log/attributes of test_new_blocks_simple ends and that of its children begins. I'm not very familiar with KUnit framework so I though I'd reach out here for some pointers. I can dedicate some time fixing this but I'd like to know if this is something we need to somehow fix in parsing or during generation of the KTAP file itself? Any pointers would be appreciated. Thanks, Ojaswin

3 weeks, 5 days

[PATCH][v2] watchdog: softlockup: panic when lockup duration exceeds N thresholds

by lirongqing

From: Li RongQing <lirongqing(a)baidu.com> The softlockup_panic sysctl is currently a binary option: panic immediately or never panic on soft lockups. Panicking on any soft lockup, regardless of duration, can be overly aggressive for brief stalls that may be caused by legitimate operations. Conversely, never panicking may allow severe system hangs to persist undetected. Extend softlockup_panic to accept an integer threshold, allowing the kernel to panic only when the normalized lockup duration exceeds N watchdog threshold periods. This provides finer-grained control to distinguish between transient delays and persistent system failures. The accepted values are: - 0: Don't panic (unchanged) - 1: Panic when duration >= 1 * threshold (20s default, original behavior) - N > 1: Panic when duration >= N * threshold (e.g., 2 = 40s, 3 = 60s.) The original behavior is preserved for values 0 and 1, maintaining full backward compatibility while allowing systems to tolerate brief lockups while still catching severe, persistent hangs. Signed-off-by: Li RongQing <lirongqing(a)baidu.com> Cc: Eduard Zingerman <eddyz87(a)gmail.com> Cc: Hao Luo <haoluo(a)google.com> Cc: Jiri Olsa <jolsa(a)kernel.org> Cc: John Fastabend <john.fastabend(a)gmail.com> Cc: KP Singh <kpsingh(a)kernel.org> Cc: Lance Yang <lance.yang(a)linux.dev> Cc: Martin KaFai Lau <martin.lau(a)linux.dev> Cc: Nicholas Piggin <npiggin(a)gmail.com> Cc: Song Liu <song(a)kernel.org> Cc: Stanislav Fomichev <sdf(a)fomichev.me> Cc: Yonghong Song <yonghong.song(a)linux.dev> Cc: Andrew Morton <akpm(a)linux-foundation.org> --- Diff with v1: add a temp variable thresh_count chang config to 0 in kernel/configs/debug.config Documentation/admin-guide/kernel-parameters.txt | 10 +++++----- arch/arm/configs/aspeed_g5_defconfig | 2 +- arch/arm/configs/pxa3xx_defconfig | 2 +- arch/openrisc/configs/or1klitex_defconfig | 2 +- arch/powerpc/configs/skiroot_defconfig | 2 +- drivers/gpu/drm/ci/arm.config | 2 +- drivers/gpu/drm/ci/arm64.config | 2 +- drivers/gpu/drm/ci/x86_64.config | 2 +- kernel/configs/debug.config | 2 +- kernel/watchdog.c | 10 ++++++---- lib/Kconfig.debug | 13 +++++++------ tools/testing/selftests/bpf/config | 2 +- tools/testing/selftests/wireguard/qemu/kernel.config | 2 +- 13 files changed, 28 insertions(+), 25 deletions(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index a8d0afd..27c5f96 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -6934,12 +6934,12 @@ Kernel parameters softlockup_panic= [KNL] Should the soft-lockup detector generate panics. - Format: 0 | 1 + Format: <int> - A value of 1 instructs the soft-lockup detector - to panic the machine when a soft-lockup occurs. It is - also controlled by the kernel.softlockup_panic sysctl - and CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC, which is the + A value of non-zero instructs the soft-lockup detector + to panic the machine when a soft-lockup duration exceeds + N thresholds. It is also controlled by the kernel.softlockup_panic + sysctl and CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC, which is the respective build-time switch to that functionality. softlockup_all_cpu_backtrace= diff --git a/arch/arm/configs/aspeed_g5_defconfig b/arch/arm/configs/aspeed_g5_defconfig index 2e6ea13..ec558e5 100644 --- a/arch/arm/configs/aspeed_g5_defconfig +++ b/arch/arm/configs/aspeed_g5_defconfig @@ -306,7 +306,7 @@ CONFIG_SCHED_STACK_END_CHECK=y CONFIG_PANIC_ON_OOPS=y CONFIG_PANIC_TIMEOUT=-1 CONFIG_SOFTLOCKUP_DETECTOR=y -CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC=y +CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC=1 CONFIG_BOOTPARAM_HUNG_TASK_PANIC=1 CONFIG_WQ_WATCHDOG=y # CONFIG_SCHED_DEBUG is not set diff --git a/arch/arm/configs/pxa3xx_defconfig b/arch/arm/configs/pxa3xx_defconfig index 07d422f..fb272e3 100644 --- a/arch/arm/configs/pxa3xx_defconfig +++ b/arch/arm/configs/pxa3xx_defconfig @@ -100,7 +100,7 @@ CONFIG_PRINTK_TIME=y CONFIG_DEBUG_KERNEL=y CONFIG_MAGIC_SYSRQ=y CONFIG_DEBUG_SHIRQ=y -CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC=y +CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC=1 # CONFIG_SCHED_DEBUG is not set CONFIG_DEBUG_SPINLOCK=y CONFIG_DEBUG_SPINLOCK_SLEEP=y diff --git a/arch/openrisc/configs/or1klitex_defconfig b/arch/openrisc/configs/or1klitex_defconfig index fb1eb9a..984b0e3 100644 --- a/arch/openrisc/configs/or1klitex_defconfig +++ b/arch/openrisc/configs/or1klitex_defconfig @@ -52,5 +52,5 @@ CONFIG_LSM="lockdown,yama,loadpin,safesetid,integrity,bpf" CONFIG_PRINTK_TIME=y CONFIG_PANIC_ON_OOPS=y CONFIG_SOFTLOCKUP_DETECTOR=y -CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC=y +CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC=1 CONFIG_BUG_ON_DATA_CORRUPTION=y diff --git a/arch/powerpc/configs/skiroot_defconfig b/arch/powerpc/configs/skiroot_defconfig index 2b71a6d..a4114fc 100644 --- a/arch/powerpc/configs/skiroot_defconfig +++ b/arch/powerpc/configs/skiroot_defconfig @@ -289,7 +289,7 @@ CONFIG_SCHED_STACK_END_CHECK=y CONFIG_DEBUG_STACKOVERFLOW=y CONFIG_PANIC_ON_OOPS=y CONFIG_SOFTLOCKUP_DETECTOR=y -CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC=y +CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC=1 CONFIG_HARDLOCKUP_DETECTOR=y CONFIG_BOOTPARAM_HARDLOCKUP_PANIC=y CONFIG_WQ_WATCHDOG=y diff --git a/drivers/gpu/drm/ci/arm.config b/drivers/gpu/drm/ci/arm.config index 411e814..d7c5167 100644 --- a/drivers/gpu/drm/ci/arm.config +++ b/drivers/gpu/drm/ci/arm.config @@ -52,7 +52,7 @@ CONFIG_TMPFS=y CONFIG_PROVE_LOCKING=n CONFIG_DEBUG_LOCKDEP=n CONFIG_SOFTLOCKUP_DETECTOR=n -CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC=n +CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC=0 CONFIG_FW_LOADER_COMPRESS=y diff --git a/drivers/gpu/drm/ci/arm64.config b/drivers/gpu/drm/ci/arm64.config index fddfbd4..ea0e307 100644 --- a/drivers/gpu/drm/ci/arm64.config +++ b/drivers/gpu/drm/ci/arm64.config @@ -161,7 +161,7 @@ CONFIG_TMPFS=y CONFIG_PROVE_LOCKING=n CONFIG_DEBUG_LOCKDEP=n CONFIG_SOFTLOCKUP_DETECTOR=y -CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC=y +CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC=1 CONFIG_DETECT_HUNG_TASK=y diff --git a/drivers/gpu/drm/ci/x86_64.config b/drivers/gpu/drm/ci/x86_64.config index 8eaba388..7ac98a7 100644 --- a/drivers/gpu/drm/ci/x86_64.config +++ b/drivers/gpu/drm/ci/x86_64.config @@ -47,7 +47,7 @@ CONFIG_TMPFS=y CONFIG_PROVE_LOCKING=n CONFIG_DEBUG_LOCKDEP=n CONFIG_SOFTLOCKUP_DETECTOR=y -CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC=y +CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC=1 CONFIG_DETECT_HUNG_TASK=y diff --git a/kernel/configs/debug.config b/kernel/configs/debug.config index 9f6ab7d..774702591 100644 --- a/kernel/configs/debug.config +++ b/kernel/configs/debug.config @@ -84,7 +84,7 @@ CONFIG_SLUB_DEBUG_ON=y # Debug Oops, Lockups and Hangs # CONFIG_BOOTPARAM_HUNG_TASK_PANIC=0 -# CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC is not set +CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC=0 CONFIG_DEBUG_ATOMIC_SLEEP=y CONFIG_DETECT_HUNG_TASK=y CONFIG_PANIC_ON_OOPS=y diff --git a/kernel/watchdog.c b/kernel/watchdog.c index 0685e3a..8168e0d 100644 --- a/kernel/watchdog.c +++ b/kernel/watchdog.c @@ -363,7 +363,7 @@ static struct cpumask watchdog_allowed_mask __read_mostly; /* Global variables, exported for sysctl */ unsigned int __read_mostly softlockup_panic = - IS_ENABLED(CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC); + CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC; static bool softlockup_initialized __read_mostly; static u64 __read_mostly sample_period; @@ -774,8 +774,8 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer) { unsigned long touch_ts, period_ts, now; struct pt_regs *regs = get_irq_regs(); - int duration; int softlockup_all_cpu_backtrace; + int duration, thresh_count; unsigned long flags; if (!watchdog_enabled) @@ -879,7 +879,9 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer) add_taint(TAINT_SOFTLOCKUP, LOCKDEP_STILL_OK); sys_info(softlockup_si_mask & ~SYS_INFO_ALL_BT); - if (softlockup_panic) + thresh_count = duration / get_softlockup_thresh(); + + if (softlockup_panic && thresh_count >= softlockup_panic) panic("softlockup: hung tasks"); } @@ -1228,7 +1230,7 @@ static const struct ctl_table watchdog_sysctls[] = { .mode = 0644, .proc_handler = proc_dointvec_minmax, .extra1 = SYSCTL_ZERO, - .extra2 = SYSCTL_ONE, + .extra2 = SYSCTL_INT_MAX, }, { .procname = "softlockup_sys_info", diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index ba36939..17a7a77 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -1110,13 +1110,14 @@ config SOFTLOCKUP_DETECTOR_INTR_STORM the CPU stats and the interrupt counts during the "soft lockups". config BOOTPARAM_SOFTLOCKUP_PANIC - bool "Panic (Reboot) On Soft Lockups" + int "Panic (Reboot) On Soft Lockups" depends on SOFTLOCKUP_DETECTOR + default 0 help - Say Y here to enable the kernel to panic on "soft lockups", - which are bugs that cause the kernel to loop in kernel - mode for more than 20 seconds (configurable using the watchdog_thresh - sysctl), without giving other tasks a chance to run. + Set to a non-zero value N to enable the kernel to panic on "soft + lockups", which are bugs that cause the kernel to loop in kernel + mode for more than (N * 20 seconds) (configurable using the + watchdog_thresh sysctl), without giving other tasks a chance to run. The panic can be used in combination with panic_timeout, to cause the system to reboot automatically after a @@ -1124,7 +1125,7 @@ config BOOTPARAM_SOFTLOCKUP_PANIC high-availability systems that have uptime guarantees and where a lockup must be resolved ASAP. - Say N if unsure. + Say 0 if unsure. config HAVE_HARDLOCKUP_DETECTOR_BUDDY bool diff --git a/tools/testing/selftests/bpf/config b/tools/testing/selftests/bpf/config index 558839e..2485538 100644 --- a/tools/testing/selftests/bpf/config +++ b/tools/testing/selftests/bpf/config @@ -1,6 +1,6 @@ CONFIG_BLK_DEV_LOOP=y CONFIG_BOOTPARAM_HARDLOCKUP_PANIC=y -CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC=y +CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC=1 CONFIG_BPF=y CONFIG_BPF_EVENTS=y CONFIG_BPF_JIT=y diff --git a/tools/testing/selftests/wireguard/qemu/kernel.config b/tools/testing/selftests/wireguard/qemu/kernel.config index 0504c11..bb89d2d 100644 --- a/tools/testing/selftests/wireguard/qemu/kernel.config +++ b/tools/testing/selftests/wireguard/qemu/kernel.config @@ -80,7 +80,7 @@ CONFIG_HARDLOCKUP_DETECTOR=y CONFIG_WQ_WATCHDOG=y CONFIG_DETECT_HUNG_TASK=y CONFIG_BOOTPARAM_HARDLOCKUP_PANIC=y -CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC=y +CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC=1 CONFIG_BOOTPARAM_HUNG_TASK_PANIC=1 CONFIG_PANIC_TIMEOUT=-1 CONFIG_STACKTRACE=y -- 2.9.4

3 weeks, 5 days

[PATCHSET v11 sched_ext/for-6.20] Add a deadline server for sched_ext tasks

by Andrea Righi

sched_ext tasks can be starved by long-running RT tasks, especially since RT throttling was replaced by deadline servers to boost only SCHED_NORMAL tasks. Several users in the community have reported issues with RT stalling sched_ext tasks. This is fairly common on distributions or environments where applications like video compositors, audio services, etc. run as RT tasks by default. Example trace (showing a per-CPU kthread stalled due to the sway Wayland compositor running as an RT task): runnable task stall (kworker/0:0[106377] failed to run for 5.043s) ... CPU 0 : nr_run=3 flags=0xd cpu_rel=0 ops_qseq=20646200 pnt_seq=45388738 curr=sway[994] class=rt_sched_class R kworker/0:0[106377] -5043ms scx_state/flags=3/0x1 dsq_flags=0x0 ops_state/qseq=0/0 sticky/holding_cpu=-1/-1 dsq_id=0x8000000000000002 dsq_vtime=0 slice=20000000 cpus=01 This is often perceived as a bug in the BPF schedulers, but in reality they can't do much: RT tasks run outside their control and can potentially consume 100% of the CPU bandwidth. Fix this by adding a sched_ext deadline server, so that sched_ext tasks are also boosted and do not suffer starvation. Two kselftests are also provided to verify the starvation fixes and bandwidth allocation is correct. == Design == - The EXT server is initialized at boot time and remains configured throughout the system's lifetime - It starts automatically when the first sched_ext task is enqueued (rq->scx.nr_running == 1) - The server's pick function (ext_server_pick_task) always selects sched_ext tasks when active - Runtime accounting happens in update_curr_scx() during task execution and update_curr_idle() when idle - Bandwidth accounting includes both fair and ext servers in root domain calculations - A debugfs interface (/sys/kernel/debug/sched/ext_server/) allows runtime tuning of server parameters == Highlights in this version == As discussed at the sched_ext microconference at LPC Tokyo, the plan is to start with a simpler approach, avoiding automatically creating or tearing down the EXT server bandwidth reservation when a BPF scheduler is loaded or unloaded. Instead, the reservation is kept permanently active. This significantly simplifies the logic while still addressing the starvation issue. Any fine-tuning of the bandwidth reservation is delegated to the system administrator, who can adjust it via the debugfs interface. In the future, a more suitable interface can be introduced and automatic removal of the reservation when the BPF scheduler is unloaded can be revisited. This patchset is also available in the following git branch: git://git.kernel.org/pub/scm/linux/kernel/git/arighi/linux.git scx-dl-server Changes in v11: - do not create/remove the bandwidth reservation for the ext server when a BPF scheduler is loaded/unloaded, but keep the reservation bandwdith always active - change rt_stall kselftest to validate both FAIR and EXT DL servers - Link to v10: https://lore.kernel.org/all/20250903095008.162049-1-arighi@nvidia.com/ Changes in v10: - reordered patches to better isolate sched_ext changes vs sched/deadline changes (Andrea Righi) - define ext_server only with CONFIG_SCHED_CLASS_EXT=y (Andrea Righi) - add WARN_ON_ONCE(!cpus) check in dl_server_apply_params() (Andrea Righi) - wait for inactive_task_timer to fire before removing the bandwidth reservation (Juri Lelli) - remove explicit dl_server_stop() in dequeue_task_scx() to reduce timer reprogramming overhead (Juri Lelli) - do not restart pick_task() when invoked by the dl_server (Tejun Heo) - rename rq_dl_server to dl_server (Peter Zijlstra) - fixed a missing dl_server start in dl_server_on() (Christian Loehle) - add a comment to the rt_stall selftest to better explain the 4% threshold (Emil Tsalapatis) - Link to v9: https://lore.kernel.org/all/20251017093214.70029-1-arighi@nvidia.com/ Changes in v9: - Drop the ->balance() logic as its functionality is now integrated into ->pick_task(), allowing dl_server to call pick_task_scx() directly - Link to v8: https://lore.kernel.org/all/20250903095008.162049-1-arighi@nvidia.com/ Changes in v8: - Add tj's patch to de-couple balance and pick_task and avoid changing sched/core callbacks to propagate @rf - Simplify dl_se->dl_server check (suggested by PeterZ) - Small coding style fixes in the kselftests - Link to v7: https://lore.kernel.org/all/20250809184800.129831-1-joelagnelf@nvidia.com/ Changes in v7: - Rebased to Linus master - Link to v6: https://lore.kernel.org/all/20250702232944.3221001-1-joelagnelf@nvidia.com/ Changes in v6: - Added Acks to few patches - Fixes to few nits suggested by Tejun - Link to v5: https://lore.kernel.org/all/20250620203234.3349930-1-joelagnelf@nvidia.com/ Changes in v5: - Added a kselftest (total_bw) to sched_ext to verify bandwidth values from debugfs - Address comment from Andrea about redundant rq clock invalidation - Link to v4: https://lore.kernel.org/all/20250617200523.1261231-1-joelagnelf@nvidia.com/ Changes in v4: - Fixed issues with hotplugged CPUs having their DL server bandwidth altered due to loading SCX - Fixed other issues - Rebased on Linus master - All sched_ext kselftests reliably pass now, also verified that the total_bw in debugfs (CONFIG_SCHED_DEBUG) is conserved with these patches - Link to v3: https://lore.kernel.org/all/20250613051734.4023260-1-joelagnelf@nvidia.com/ Changes in v3: - Removed code duplication in debugfs. Made ext interface separate - Fixed issue where rq_lock_irqsave was not used in the relinquish patch - Fixed running bw accounting issue in dl_server_remove_params - Link to v2: https://lore.kernel.org/all/20250602180110.816225-1-joelagnelf@nvidia.com/ Changes in v2: - Fixed a hang related to using rq_lock instead of rq_lock_irqsave - Added support to remove BW of DL servers when they are switched to/from EXT - Link to v1: https://lore.kernel.org/all/20250315022158.2354454-1-joelagnelf@nvidia.com/ Andrea Righi (2): sched_ext: Add a DL server for sched_ext tasks selftests/sched_ext: Add test for sched_ext dl_server Joel Fernandes (5): sched/deadline: Clear the defer params sched/debug: Fix updating of ppos on server write ops sched/debug: Stop and start server based on if it was active sched/debug: Add support to change sched_ext server params selftests/sched_ext: Add test for DL server total_bw consistency kernel/sched/core.c | 6 + kernel/sched/deadline.c | 87 +++++-- kernel/sched/debug.c | 171 +++++++++++--- kernel/sched/ext.c | 42 ++++ kernel/sched/idle.c | 3 + kernel/sched/sched.h | 2 + kernel/sched/topology.c | 5 + tools/testing/selftests/sched_ext/Makefile | 2 + tools/testing/selftests/sched_ext/rt_stall.bpf.c | 23 ++ tools/testing/selftests/sched_ext/rt_stall.c | 240 +++++++++++++++++++ tools/testing/selftests/sched_ext/total_bw.c | 281 +++++++++++++++++++++++ 11 files changed, 811 insertions(+), 51 deletions(-) create mode 100644 tools/testing/selftests/sched_ext/rt_stall.bpf.c create mode 100644 tools/testing/selftests/sched_ext/rt_stall.c create mode 100644 tools/testing/selftests/sched_ext/total_bw.c

3 weeks, 5 days

Jump to page:

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

Linux-kselftest-mirror