November 2025 - Linux-kselftest-mirror

[PATCH v3 0/3] KVM ARM64 pre_fault_memory

by Jack Thomson

From: Jack Thomson <jackabt(a)amazon.com> This patch series adds ARM64 support for the KVM_PRE_FAULT_MEMORY feature, which was previously only available on x86 [1]. This allows us to reduce the number of stage-2 faults during execution. This is of benefit in post-copy migration scenarios, particularly in memory intensive applications, where we are experiencing high latencies due to the stage-2 faults. Patch Overview: - The first patch adds support for the KVM_PRE_FAULT_MEMORY ioctl on arm64. - The second patch updates the pre_fault_memory_test to support arm64. - The last patch extends the pre_fault_memory_test to cover different vm memory backings. === Changes Since v2 [2] === - Update fault info synthesize value. Thanks Suzuki - Remove change to selftests for unaligned mmap allocations. Thanks Sean [1]: https://lore.kernel.org/kvm/20240710174031.312055-1-pbonzini@redhat.com [2]: https://lore.kernel.org/linux-arm-kernel/20251013151502.6679-1-jackabt.amaz… Jack Thomson (3): KVM: arm64: Add pre_fault_memory implementation KVM: selftests: Enable pre_fault_memory_test for arm64 KVM: selftests: Add option for different backing in pre-fault tests Documentation/virt/kvm/api.rst | 3 +- arch/arm64/kvm/Kconfig | 1 + arch/arm64/kvm/arm.c | 1 + arch/arm64/kvm/mmu.c | 73 +++++++++++- tools/testing/selftests/kvm/Makefile.kvm | 1 + .../selftests/kvm/pre_fault_memory_test.c | 110 ++++++++++++++---- 6 files changed, 159 insertions(+), 30 deletions(-) base-commit: 8a4821412cf2c1429fffa07c012dd150f2edf78c -- 2.43.0

6 days, 15 hours

5
8
0 0

[PATCH v7 00/12] Direct Map Removal Support for guest_memfd

by Patrick Roy

From: Patrick Roy <roypat(a)amazon.co.uk> [ based on kvm/next ] Unmapping virtual machine guest memory from the host kernel's direct map is a successful mitigation against Spectre-style transient execution issues: If the kernel page tables do not contain entries pointing to guest memory, then any attempted speculative read through the direct map will necessarily be blocked by the MMU before any observable microarchitectural side-effects happen. This means that Spectre-gadgets and similar cannot be used to target virtual machine memory. Roughly 60% of speculative execution issues fall into this category [1, Table 1]. This patch series extends guest_memfd with the ability to remove its memory from the host kernel's direct map, to be able to attain the above protection for KVM guests running inside guest_memfd. Additionally, a Firecracker branch with support for these VMs can be found on GitHub [2]. For more details, please refer to the v5 cover letter [v5]. No substantial changes in design have taken place since. === Changes Since v6 === - Drop patch for passing struct address_space to ->free_folio(), due to possible races with freeing of the address_space. (Hugh) - Stop using PG_uptodate / gmem preparedness tracking to keep track of direct map state. Instead, use the lowest bit of folio->private. (Mike, David) - Do direct map removal when establishing mapping of gmem folio instead of at allocation time, due to impossibility of handling direct map removal errors in kvm_gmem_populate(). (Patrick) - Do TLB flushes after direct map removal, and provide a module parameter to opt out from them, and a new patch to export flush_tlb_kernel_range() to KVM. (Will) [1]: https://download.vusec.net/papers/quarantine_raid23.pdf [2]: https://github.com/firecracker-microvm/firecracker/tree/feature/secret-hidi… [RFCv1]: https://lore.kernel.org/kvm/20240709132041.3625501-1-roypat@amazon.co.uk/ [RFCv2]: https://lore.kernel.org/kvm/20240910163038.1298452-1-roypat@amazon.co.uk/ [RFCv3]: https://lore.kernel.org/kvm/20241030134912.515725-1-roypat@amazon.co.uk/ [v4]: https://lore.kernel.org/kvm/20250221160728.1584559-1-roypat@amazon.co.uk/ [v5]: https://lore.kernel.org/kvm/20250828093902.2719-1-roypat@amazon.co.uk/ [v6]: https://lore.kernel.org/kvm/20250912091708.17502-1-roypat@amazon.co.uk/ Patrick Roy (12): arch: export set_direct_map_valid_noflush to KVM module x86/tlb: export flush_tlb_kernel_range to KVM module mm: introduce AS_NO_DIRECT_MAP KVM: guest_memfd: Add stub for kvm_arch_gmem_invalidate KVM: guest_memfd: Add flag to remove from direct map KVM: guest_memfd: add module param for disabling TLB flushing KVM: selftests: load elf via bounce buffer KVM: selftests: set KVM_MEM_GUEST_MEMFD in vm_mem_add() if guest_memfd != -1 KVM: selftests: Add guest_memfd based vm_mem_backing_src_types KVM: selftests: cover GUEST_MEMFD_FLAG_NO_DIRECT_MAP in existing selftests KVM: selftests: stuff vm_mem_backing_src_type into vm_shape KVM: selftests: Test guest execution from direct map removed gmem Documentation/virt/kvm/api.rst | 5 ++ arch/arm64/include/asm/kvm_host.h | 12 ++++ arch/arm64/mm/pageattr.c | 1 + arch/loongarch/mm/pageattr.c | 1 + arch/riscv/mm/pageattr.c | 1 + arch/s390/mm/pageattr.c | 1 + arch/x86/include/asm/tlbflush.h | 3 +- arch/x86/mm/pat/set_memory.c | 1 + arch/x86/mm/tlb.c | 1 + include/linux/kvm_host.h | 9 +++ include/linux/pagemap.h | 16 +++++ include/linux/secretmem.h | 18 ----- include/uapi/linux/kvm.h | 2 + lib/buildid.c | 4 +- mm/gup.c | 19 ++---- mm/mlock.c | 2 +- mm/secretmem.c | 8 +-- .../testing/selftests/kvm/guest_memfd_test.c | 2 + .../testing/selftests/kvm/include/kvm_util.h | 37 ++++++++--- .../testing/selftests/kvm/include/test_util.h | 8 +++ tools/testing/selftests/kvm/lib/elf.c | 8 +-- tools/testing/selftests/kvm/lib/io.c | 23 +++++++ tools/testing/selftests/kvm/lib/kvm_util.c | 61 +++++++++-------- tools/testing/selftests/kvm/lib/test_util.c | 8 +++ tools/testing/selftests/kvm/lib/x86/sev.c | 1 + .../selftests/kvm/pre_fault_memory_test.c | 1 + .../selftests/kvm/set_memory_region_test.c | 50 ++++++++++++-- .../kvm/x86/private_mem_conversions_test.c | 7 +- virt/kvm/guest_memfd.c | 66 +++++++++++++++++-- virt/kvm/kvm_main.c | 8 +++ 30 files changed, 290 insertions(+), 94 deletions(-) base-commit: a6ad54137af92535cfe32e19e5f3bc1bb7dbd383 -- 2.51.0

6 days, 15 hours

11
54
0 0

[PATCH v3 0/5] mm, kvm: add guest_memfd support for uffd minor faults

by Mike Rapoport

From: "Mike Rapoport (Microsoft)" <rppt(a)kernel.org> Hi, These patches allow guest_memfd to notify userspace about minor page faults using userfaultfd and let userspace to resolve these page faults using UFFDIO_CONTINUE. To allow UFFDIO_CONTINUE outside of the core mm I added a get_folio_noalloc() callback to vm_ops that allows an address space backing a VMA to return a folio that exists in it's page cache (patch 2) In order for guest_memfd to notify userspace about page faults, there is a new VM_FAULT_UFFD_MINOR that a ->fault() handler can return to inform the page fault handler that it needs to call handle_userfault() to complete the fault (patch 3). Patch 4 plumbs these new goodies into guest_memfd. This series is the minimal change I've been able to come up with to allow integration of guest_memfd with uffd and while refactoring uffd and making mfill_atomic() flow more linear would have been a nice improvement, it's way out of the scope of enabling uffd with guest_memfd. v3 changes: * rename ->get_folio() to ->get_folio_noalloc() * fix build errors reported by kbuild * pull handling of UFFD_MINOR out of hotpath in __do_fault() * update guest_memfs changes so its ->fault() and ->get_folio_noalloc() follow the same semantics as shmem and hugetlb. * s/MISSING/MINOR/g in changelogs * added review tags v2: https://lore.kernel.org/all/20251125183840.2368510-1-rppt@kernel.org * rename ->get_shared_folio() to ->get_folio() * hardwire VM_FAULF_UFFD_MINOR to 0 when CONFIG_USERFAULTFD=n v1: https://patch.msgid.link/20251123102707.559422-1-rppt@kernel.org * Introduce VM_FAULF_UFFD_MINOR to avoid exporting handle_userfault() * Simplify vma_can_mfill_atomic() * Rename get_pagecache_folio() to get_shared_folio() and use inode instead of vma as its argument rfc: https://patch.msgid.link/20251117114631.2029447-1-rppt@kernel.org Mike Rapoport (Microsoft) (4): userfaultfd: move vma_can_userfault out of line userfaultfd, shmem: use a VMA callback to handle UFFDIO_CONTINUE mm: introduce VM_FAULT_UFFD_MINOR fault reason guest_memfd: add support for userfaultfd minor mode Nikita Kalyazin (1): KVM: selftests: test userfaultfd minor for guest_memfd include/linux/mm.h | 9 ++ include/linux/mm_types.h | 10 +- include/linux/userfaultfd_k.h | 36 +------ mm/memory.c | 5 +- mm/shmem.c | 20 +++- mm/userfaultfd.c | 80 ++++++++++++--- .../testing/selftests/kvm/guest_memfd_test.c | 97 +++++++++++++++++++ virt/kvm/guest_memfd.c | 33 ++++++- 8 files changed, 236 insertions(+), 54 deletions(-) base-commit: ac3fd01e4c1efce8f2c054cdeb2ddd2fc0fb150d -- 2.51.0

1 week

4
19
0 0

[PATCH v23 00/28] riscv control-flow integrity for usermode

by Deepak Gupta via B4 Relay

v23: fixed some of the "CHECK:" reported on checkpatch --strict. Accepted Joel's suggestion for kselftest's Makefile. CONFIG_RISCV_USER_CFI is enabled when zicfiss, zicfilp and fcf-protection are all present in toolchain v22: fixing build error due to -march=zicfiss being picked in gcc-13 and above but not actually doing any codegen or recognizing instruction for zicfiss. Change in v22 makes dependence on `-fcf-protection=full` compiler flag to ensure that toolchain has support and then only CONFIG_RISCV_USER_CFI will be visible in menuconfig. v21: fixed build errors. Basics and overview =================== Software with larger attack surfaces (e.g. network facing apps like databases, browsers or apps relying on browser runtimes) suffer from memory corruption issues which can be utilized by attackers to bend control flow of the program to eventually gain control (by making their payload executable). Attackers are able to perform such attacks by leveraging call-sites which rely on indirect calls or return sites which rely on obtaining return address from stack memory. To mitigate such attacks, risc-v extension zicfilp enforces that all indirect calls must land on a landing pad instruction `lpad` else cpu will raise software check exception (a new cpu exception cause code on riscv). Similarly for return flow, risc-v extension zicfiss extends architecture with - `sspush` instruction to push return address on a shadow stack - `sspopchk` instruction to pop return address from shadow stack and compare with input operand (i.e. return address on stack) - `sspopchk` to raise software check exception if comparision above was a mismatch - Protection mechanism using which shadow stack is not writeable via regular store instructions More information an details can be found at extensions github repo [1]. Equivalent to landing pad (zicfilp) on x86 is `ENDBRANCH` instruction in Intel CET [3] and branch target identification (BTI) [4] on arm. Similarly x86's Intel CET has shadow stack [5] and arm64 has guarded control stack (GCS) [6] which are very similar to risc-v's zicfiss shadow stack. x86 and arm64 support for user mode shadow stack is already in mainline. Kernel awareness for user control flow integrity ================================================ This series picks up Samuel Holland's envcfg changes [2] as well. So if those are being applied independently, they should be removed from this series. Enabling: In order to maintain compatibility and not break anything in user mode, kernel doesn't enable control flow integrity cpu extensions on binary by default. Instead exposes a prctl interface to enable, disable and lock the shadow stack or landing pad feature for a task. This allows userspace (loader) to enumerate if all objects in its address space are compiled with shadow stack and landing pad support and accordingly enable the feature. Additionally if a subsequent `dlopen` happens on a library, user mode can take a decision again to disable the feature (if incoming library is not compiled with support) OR terminate the task (if user mode policy is strict to have all objects in address space to be compiled with control flow integirty cpu feature). prctl to enable shadow stack results in allocating shadow stack from virtual memory and activating for user address space. x86 and arm64 are also following same direction due to similar reason(s). clone/fork: On clone and fork, cfi state for task is inherited by child. Shadow stack is part of virtual memory and is a writeable memory from kernel perspective (writeable via a restricted set of instructions aka shadow stack instructions) Thus kernel changes ensure that this memory is converted into read-only when fork/clone happens and COWed when fault is taken due to sspush, sspopchk or ssamoswap. In case `CLONE_VM` is specified and shadow stack is to be enabled, kernel will automatically allocate a shadow stack for that clone call. map_shadow_stack: x86 introduced `map_shadow_stack` system call to allow user space to explicitly map shadow stack memory in its address space. It is useful to allocate shadow for different contexts managed by a single thread (green threads or contexts) risc-v implements this system call as well. signal management: If shadow stack is enabled for a task, kernel performs an asynchronous control flow diversion to deliver the signal and eventually expects userspace to issue sigreturn so that original execution can be resumed. Even though resume context is prepared by kernel, it is in user space memory and is subject to memory corruption and corruption bugs can be utilized by attacker in this race window to perform arbitrary sigreturn and eventually bypass cfi mechanism. Another issue is how to ensure that cfi related state on sigcontext area is not trampled by legacy apps or apps compiled with old kernel headers. In order to mitigate control-flow hijacting, kernel prepares a token and place it on shadow stack before signal delivery and places address of token in sigcontext structure. During sigreturn, kernel obtains address of token from sigcontext struture, reads token from shadow stack and validates it and only then allow sigreturn to succeed. Compatiblity issue is solved by adopting dynamic sigcontext management introduced for vector extension. This series re-factor the code little bit to allow future sigcontext management easy (as proposed by Andy Chiu from SiFive) config and compilation: Introduce a new risc-v config option `CONFIG_RISCV_USER_CFI`. Selecting this config option picks the kernel support for user control flow integrity. This optin is presented only if toolchain has shadow stack and landing pad support. And is on purpose guarded by toolchain support. Reason being that eventually vDSO also needs to be compiled in with shadow stack and landing pad support. vDSO compile patches are not included as of now because landing pad labeling scheme is yet to settle for usermode runtime. To get more information on kernel interactions with respect to zicfilp and zicfiss, patch series adds documentation for `zicfilp` and `zicfiss` in following: Documentation/arch/riscv/zicfiss.rst Documentation/arch/riscv/zicfilp.rst How to test this series ======================= Toolchain --------- $ git clone git@github.com:sifive/riscv-gnu-toolchain.git -b cfi-dev $ riscv-gnu-toolchain/configure --prefix=<path-to-where-to-build> --with-arch=rv64gc_zicfilp_zicfiss --enable-linux --disable-gdb --with-extra-multilib-test="rv64gc_zicfilp_zicfiss-lp64d:-static" $ make -j$(nproc) Qemu ---- Get the lastest qemu $ cd qemu $ mkdir build $ cd build $ ../configure --target-list=riscv64-softmmu $ make -j$(nproc) Opensbi ------- $ git clone git@github.com:deepak0414/opensbi.git -b v6_cfi_spec_split_opensbi $ make CROSS_COMPILE=<your riscv toolchain> -j$(nproc) PLATFORM=generic Linux ----- Running defconfig is fine. CFI is enabled by default if the toolchain supports it. $ make ARCH=riscv CROSS_COMPILE=<path-to-cfi-riscv-gnu-toolchain>/build/bin/riscv64-unknown-linux-gnu- -j$(nproc) defconfig $ make ARCH=riscv CROSS_COMPILE=<path-to-cfi-riscv-gnu-toolchain>/build/bin/riscv64-unknown-linux-gnu- -j$(nproc) Running ------- Modify your qemu command to have: -bios <path-to-cfi-opensbi>/build/platform/generic/firmware/fw_dynamic.bin -cpu rv64,zicfilp=true,zicfiss=true,zimop=true,zcmop=true References ========== [1] - https://github.com/riscv/riscv-cfi [2] - https://lore.kernel.org/all/20240814081126.956287-1-samuel.holland@sifive.c… [3] - https://lwn.net/Articles/889475/ [4] - https://developer.arm.com/documentation/109576/0100/Branch-Target-Identific… [5] - https://www.intel.com/content/dam/develop/external/us/en/documents/catc17-i… [6] - https://lwn.net/Articles/940403/ To: Thomas Gleixner <tglx(a)linutronix.de> To: Ingo Molnar <mingo(a)redhat.com> To: Borislav Petkov <bp(a)alien8.de> To: Dave Hansen <dave.hansen(a)linux.intel.com> To: x86(a)kernel.org To: H. Peter Anvin <hpa(a)zytor.com> To: Andrew Morton <akpm(a)linux-foundation.org> To: Liam R. Howlett <Liam.Howlett(a)oracle.com> To: Vlastimil Babka <vbabka(a)suse.cz> To: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com> To: Paul Walmsley <paul.walmsley(a)sifive.com> To: Palmer Dabbelt <palmer(a)dabbelt.com> To: Albert Ou <aou(a)eecs.berkeley.edu> To: Conor Dooley <conor(a)kernel.org> To: Rob Herring <robh(a)kernel.org> To: Krzysztof Kozlowski <krzk+dt(a)kernel.org> To: Arnd Bergmann <arnd(a)arndb.de> To: Christian Brauner <brauner(a)kernel.org> To: Peter Zijlstra <peterz(a)infradead.org> To: Oleg Nesterov <oleg(a)redhat.com> To: Eric Biederman <ebiederm(a)xmission.com> To: Kees Cook <kees(a)kernel.org> To: Jonathan Corbet <corbet(a)lwn.net> To: Shuah Khan <shuah(a)kernel.org> To: Jann Horn <jannh(a)google.com> To: Conor Dooley <conor+dt(a)kernel.org> To: Miguel Ojeda <ojeda(a)kernel.org> To: Alex Gaynor <alex.gaynor(a)gmail.com> To: Boqun Feng <boqun.feng(a)gmail.com> To: Gary Guo <gary(a)garyguo.net> To: Björn Roy Baron <bjorn3_gh(a)protonmail.com> To: Benno Lossin <benno.lossin(a)proton.me> To: Andreas Hindborg <a.hindborg(a)kernel.org> To: Alice Ryhl <aliceryhl(a)google.com> To: Trevor Gross <tmgross(a)umich.edu> Cc: linux-kernel(a)vger.kernel.org Cc: linux-fsdevel(a)vger.kernel.org Cc: linux-mm(a)kvack.org Cc: linux-riscv(a)lists.infradead.org Cc: devicetree(a)vger.kernel.org Cc: linux-arch(a)vger.kernel.org Cc: linux-doc(a)vger.kernel.org Cc: linux-kselftest(a)vger.kernel.org Cc: alistair.francis(a)wdc.com Cc: richard.henderson(a)linaro.org Cc: jim.shu(a)sifive.com Cc: andybnac(a)gmail.com Cc: kito.cheng(a)sifive.com Cc: charlie(a)rivosinc.com Cc: atishp(a)rivosinc.com Cc: evan(a)rivosinc.com Cc: cleger(a)rivosinc.com Cc: alexghiti(a)rivosinc.com Cc: samitolvanen(a)google.com Cc: broonie(a)kernel.org Cc: rick.p.edgecombe(a)intel.com Cc: rust-for-linux(a)vger.kernel.org changelog --------- v23: - fixed some of the "CHECK:" reported on checkpatch --strict. - Accepted Joel's suggestion for kselftest's Makefile. - CONFIG_RISCV_USER_CFI is enabled when zicfiss, zicfilp and fcf-protection are all present in toolchain v22: - CONFIG_RISCV_USER_CFI was by default "n". With dual vdso support it is default "y" (if toolchain supports it). Fixing build error due to "-march=zicfiss" being picked in gcc-13 partially. gcc-13 only recognizes the flag but not actually doing any codegen or recognizing instruction for zicfiss. Change in v22 makes dependence on `-fcf-protection=full` compiler flag to ensure that toolchain has support and then only CONFIG_RISCV_USER_CFI will be visible in menuconfig. - picked up tags and some cosmetic changes in commit message for dual vdso patch. v21: - Fixing build errors due to changes in arch/riscv/include/asm/vdso.h Using #ifdef instead of IS_ENABLED in arch/riscv/include/asm/vdso.h vdso-cfi-offsets.h should be included only when CONFIG_RISCV_USER_CFI is selected. v20: - rebased on v6.18-rc1. - Added two vDSO support. If `CONFIG_RISCV_USER_CFI` is selected two vDSOs are compiled (one for hardware prior to RVA23 and one for RVA23 onwards). Kernel exposes RVA23 vDSO if hardware/cpu implements zimop else exposes existing vDSO to userspace. - default selection for `CONFIG_RISCV_USER_CFI` is "Yes". - replaced "__ASSEMBLY__" with "__ASSEMBLER__" v19: - riscv_nousercfi was `int`. changed it to unsigned long. Thanks to Alex Ghiti for reporting it. It was a bug. - ELP is cleared on trap entry only when CONFIG_64BIT. - restore ssp back on return to usermode was being done before `riscv_v_context_nesting_end` on trap exit path. If kernel shadow stack were enabled this would result in kernel operating on user shadow stack and panic (as I found in my testing of kcfi patch series). So fixed that. v18: - rebased on 6.16-rc1 - uprobe handling clears ELP in sstatus image in pt_regs - vdso was missing shadow stack elf note for object files. added that. Additional asm file for vdso needed the elf marker flag. toolchain should complain if `-fcf-protection=full` and marker is missing for object generated from asm file. Asked toolchain folks to fix this. Although no reason to gate the merge on that. - Split up compile options for march and fcf-protection in vdso Makefile - CONFIG_RISCV_USER_CFI option is moved under "Kernel features" menu Added `arch/riscv/configs/hardening.config` fragment which selects CONFIG_RISCV_USER_CFI v17: - fixed warnings due to empty macros in usercfi.h (reported by alexg) - fixed prefixes in commit titles reported by alexg - took below uprobe with fcfi v2 patch from Zong Li and squashed it with "riscv/traps: Introduce software check exception and uprobe handling" https://lore.kernel.org/all/20250604093403.10916-1-zong.li@sifive.com/ v16: - If FWFT is not implemented or returns error for shadow stack activation, then no_usercfi is set to disable shadow stack. Although this should be picked up by extension validation and activation. Fixed this bug for zicfilp and zicfiss both. Thanks to Charlie Jenkins for reporting this. - If toolchain doesn't support cfi, cfi kselftest shouldn't build. Suggested by Charlie Jenkins. - Default for CONFIG_RISCV_USER_CFI is set to no. Charlie/Atish suggested to keep it off till we have more hardware availibility with RVA23 profile and zimop/zcmop implemented. Else this will start breaking people's workflow - Includes the fix if "!RV64 and !SBI" then definitions for FWFT in asm-offsets.c error. v15: - Toolchain has been updated to include `-fcf-protection` flag. This exists for x86 as well. Updated kernel patches to compile vDSO and selftest to compile with `fcf-protection=full` flag. - selecting CONFIG_RISCV_USERCFI selects CONFIG_RISCV_SBI. - Patch to enable shadow stack for kernel wasn't hidden behind CONFIG_RISCV_USERCFI and CONFIG_RISCV_SBI. fixed that. v14: - rebased on top of palmer/sbi-v3. Thus dropped clement's FWFT patches Updated RISCV_ISA_EXT_XXXX in hwcap and hwprobe constants. - Took Radim's suggestions on bitfields. - Placed cfi_state at the end of thread_info block so that current situation is not disturbed with respect to member fields of thread_info in single cacheline. v13: - cpu_supports_shadow_stack/cpu_supports_indirect_br_lp_instr uses riscv_has_extension_unlikely() - uses nops(count) to create nop slide - RISCV_ACQUIRE_BARRIER is not needed in `amo_user_shstk`. Removed it - changed ternaries to simply use implicit casting to convert to bool. - kernel command line allows to disable zicfilp and zicfiss independently. updated kernel-parameters.txt. - ptrace user abi for cfi uses bitmasks instead of bitfields. Added ptrace kselftest. - cosmetic and grammatical changes to documentation. v12: - It seems like I had accidently squashed arch agnostic indirect branch tracking prctl and riscv implementation of those prctls. Split them again. - set_shstk_status/set_indir_lp_status perform CSR writes only when CPU support is available. As suggested by Zong Li. - Some minor clean up in kselftests as suggested by Zong Li. v11: - patch "arch/riscv: compile vdso with landing pad" was unconditionally selecting `_zicfilp` for vDSO compile. fixed that. Changed `lpad 1` to to `lpad 0`. v10: - dropped "mm: helper `is_shadow_stack_vma` to check shadow stack vma". This patch is not that interesting to this patch series for risc-v. There are instances in arch directories where VM_SHADOW_STACK flag is anyways used. Dropping this patch to expedite merging in riscv tree. - Took suggestions from `Clement` on "riscv: zicfiss / zicfilp enumeration" to validate presence of cfi based on config. - Added a patch for vDSO to have `lpad 0`. I had omitted this earlier to make sure we add single vdso object with cfi enabled. But a vdso object with scheme of zero labeled landing pad is least common denominator and should work with all objects of zero labeled as well as function-signature labeled objects. v9: - rebased on master (39a803b754d5 fix braino in "9p: fix ->rename_sem exclusion") - dropped "mm: Introduce ARCH_HAS_USER_SHADOW_STACK" (master has it from arm64/gcs) - dropped "prctl: arch-agnostic prctl for shadow stack" (master has it from arm64/gcs) v8: - rebased on palmer/for-next - dropped samuel holland's `envcfg` context switch patches. they are in parlmer/for-next v7: - Removed "riscv/Kconfig: enable HAVE_EXIT_THREAD for riscv" Instead using `deactivate_mm` flow to clean up. see here for more context https://lore.kernel.org/all/20230908203655.543765-1-rick.p.edgecombe@intel.… - Changed the header include in `kselftest`. Hopefully this fixes compile issue faced by Zong Li at SiFive. - Cleaned up an orphaned change to `mm/mmap.c` in below patch "riscv/mm : ensure PROT_WRITE leads to VM_READ | VM_WRITE" - Lock interfaces for shadow stack and indirect branch tracking expect arg == 0 Any future evolution of this interface should accordingly define how arg should be setup. - `mm/map.c` has an instance of using `VM_SHADOW_STACK`. Fixed it to use helper `is_shadow_stack_vma`. - Link to v6: https://lore.kernel.org/r/20241008-v5_user_cfi_series-v6-0-60d9fe073f37@riv… v6: - Picked up Samuel Holland's changes as is with `envcfg` placed in `thread` instead of `thread_info` - fixed unaligned newline escapes in kselftest - cleaned up messages in kselftest and included test output in commit message - fixed a bug in clone path reported by Zong Li - fixed a build issue if CONFIG_RISCV_ISA_V is not selected (this was introduced due to re-factoring signal context management code) v5: - rebased on v6.12-rc1 - Fixed schema related issues in device tree file - Fixed some of the documentation related issues in zicfilp/ss.rst (style issues and added index) - added `SHADOW_STACK_SET_MARKER` so that implementation can define base of shadow stack. - Fixed warnings on definitions added in usercfi.h when CONFIG_RISCV_USER_CFI is not selected. - Adopted context header based signal handling as proposed by Andy Chiu - Added support for enabling kernel mode access to shadow stack using FWFT (https://github.com/riscv-non-isa/riscv-sbi-doc/blob/master/src/ext-firmware…) - Link to v5: https://lore.kernel.org/r/20241001-v5_user_cfi_series-v1-0-3ba65b6e550f@riv… (Note: I had an issue in my workflow due to which version number wasn't picked up correctly while sending out patches) v4: - rebased on 6.11-rc6 - envcfg: Converged with Samuel Holland's patches for envcfg management on per- thread basis. - vma_is_shadow_stack is renamed to is_vma_shadow_stack - picked up Mark Brown's `ARCH_HAS_USER_SHADOW_STACK` patch - signal context: using extended context management to maintain compatibility. - fixed `-Wmissing-prototypes` compiler warnings for prctl functions - Documentation fixes and amending typos. - Link to v4: https://lore.kernel.org/all/20240912231650.3740732-1-debug@rivosinc.com/ v3: - envcfg logic to pick up base envcfg had a bug where `ENVCFG_CBZE` could have been picked on per task basis, even though CPU didn't implement it. Fixed in this series. - dt-bindings As suggested, split into separate commit. fixed the messaging that spec is in public review - arch_is_shadow_stack change arch_is_shadow_stack changed to vma_is_shadow_stack - hwprobe zicfiss / zicfilp if present will get enumerated in hwprobe - selftests As suggested, added object and binary filenames to .gitignore Selftest binary anyways need to be compiled with cfi enabled compiler which will make sure that landing pad and shadow stack are enabled. Thus removed separate enable/disable tests. Cleaned up tests a bit. - Link to v3: https://lore.kernel.org/lkml/20240403234054.2020347-1-debug@rivosinc.com/ v2: - Using config `CONFIG_RISCV_USER_CFI`, kernel support for riscv control flow integrity for user mode programs can be compiled in the kernel. - Enabling of control flow integrity for user programs is left to user runtime - This patch series introduces arch agnostic `prctls` to enable shadow stack and indirect branch tracking. And implements them on riscv. --- Changes in v23: - Link to v22: https://lore.kernel.org/r/20251023-v5_user_cfi_series-v22-0-1935270f7636@ri… Changes in v22: - Link to v21: https://lore.kernel.org/r/20251015-v5_user_cfi_series-v21-0-6a07856e90e7@ri… Changes in v21: - Link to v20: https://lore.kernel.org/r/20251013-v5_user_cfi_series-v20-0-b9de4be9912e@ri… Changes in v20: - Link to v19: https://lore.kernel.org/r/20250731-v5_user_cfi_series-v19-0-09b468d7beab@ri… Changes in v19: - Link to v18: https://lore.kernel.org/r/20250711-v5_user_cfi_series-v18-0-a8ee62f9f38e@ri… Changes in v18: - Link to v17: https://lore.kernel.org/r/20250604-v5_user_cfi_series-v17-0-4565c2cf869f@ri… Changes in v17: - Link to v16: https://lore.kernel.org/r/20250522-v5_user_cfi_series-v16-0-64f61a35eee7@ri… Changes in v16: - Link to v15: https://lore.kernel.org/r/20250502-v5_user_cfi_series-v15-0-914966471885@ri… Changes in v15: - changelog posted just below cover letter - Link to v14: https://lore.kernel.org/r/20250429-v5_user_cfi_series-v14-0-5239410d012a@ri… Changes in v14: - changelog posted just below cover letter - Link to v13: https://lore.kernel.org/r/20250424-v5_user_cfi_series-v13-0-971437de586a@ri… Changes in v13: - changelog posted just below cover letter - Link to v12: https://lore.kernel.org/r/20250314-v5_user_cfi_series-v12-0-e51202b53138@ri… Changes in v12: - changelog posted just below cover letter - Link to v11: https://lore.kernel.org/r/20250310-v5_user_cfi_series-v11-0-86b36cbfb910@ri… Changes in v11: - changelog posted just below cover letter - Link to v10: https://lore.kernel.org/r/20250210-v5_user_cfi_series-v10-0-163dcfa31c60@ri… --- Andy Chiu (1): riscv: signal: abstract header saving for setup_sigcontext Deepak Gupta (26): mm: VM_SHADOW_STACK definition for riscv dt-bindings: riscv: zicfilp and zicfiss in dt-bindings (extensions.yaml) riscv: zicfiss / zicfilp enumeration riscv: zicfiss / zicfilp extension csr and bit definitions riscv: usercfi state for task and save/restore of CSR_SSP on trap entry/exit riscv/mm : ensure PROT_WRITE leads to VM_READ | VM_WRITE riscv/mm: manufacture shadow stack pte riscv/mm: teach pte_mkwrite to manufacture shadow stack PTEs riscv/mm: write protect and shadow stack riscv/mm: Implement map_shadow_stack() syscall riscv/shstk: If needed allocate a new shadow stack on clone riscv: Implements arch agnostic shadow stack prctls prctl: arch-agnostic prctl for indirect branch tracking riscv: Implements arch agnostic indirect branch tracking prctls riscv/traps: Introduce software check exception and uprobe handling riscv/signal: save and restore of shadow stack for signal riscv/kernel: update __show_regs to print shadow stack register riscv/ptrace: riscv cfi status and state via ptrace and in core files riscv/hwprobe: zicfilp / zicfiss enumeration in hwprobe riscv: kernel command line option to opt out of user cfi riscv: enable kernel access to shadow stack memory via FWFT sbi call arch/riscv: dual vdso creation logic and select vdso based on hw riscv: create a config for shadow stack and landing pad instr support riscv: Documentation for landing pad / indirect branch tracking riscv: Documentation for shadow stack on riscv kselftest/riscv: kselftest for user mode cfi Jim Shu (1): arch/riscv: compile vdso with landing pad and shadow stack note Documentation/admin-guide/kernel-parameters.txt | 8 + Documentation/arch/riscv/index.rst | 2 + Documentation/arch/riscv/zicfilp.rst | 115 +++++ Documentation/arch/riscv/zicfiss.rst | 179 +++++++ .../devicetree/bindings/riscv/extensions.yaml | 14 + arch/riscv/Kconfig | 22 + arch/riscv/Makefile | 8 +- arch/riscv/configs/hardening.config | 4 + arch/riscv/include/asm/asm-prototypes.h | 1 + arch/riscv/include/asm/assembler.h | 44 ++ arch/riscv/include/asm/cpufeature.h | 12 + arch/riscv/include/asm/csr.h | 16 + arch/riscv/include/asm/entry-common.h | 2 + arch/riscv/include/asm/hwcap.h | 2 + arch/riscv/include/asm/mman.h | 26 + arch/riscv/include/asm/mmu_context.h | 7 + arch/riscv/include/asm/pgtable.h | 30 +- arch/riscv/include/asm/processor.h | 1 + arch/riscv/include/asm/thread_info.h | 3 + arch/riscv/include/asm/usercfi.h | 95 ++++ arch/riscv/include/asm/vdso.h | 13 +- arch/riscv/include/asm/vector.h | 3 + arch/riscv/include/uapi/asm/hwprobe.h | 2 + arch/riscv/include/uapi/asm/ptrace.h | 34 ++ arch/riscv/include/uapi/asm/sigcontext.h | 1 + arch/riscv/kernel/Makefile | 2 + arch/riscv/kernel/asm-offsets.c | 10 + arch/riscv/kernel/cpufeature.c | 27 + arch/riscv/kernel/entry.S | 38 ++ arch/riscv/kernel/head.S | 27 + arch/riscv/kernel/process.c | 27 +- arch/riscv/kernel/ptrace.c | 95 ++++ arch/riscv/kernel/signal.c | 148 +++++- arch/riscv/kernel/sys_hwprobe.c | 2 + arch/riscv/kernel/sys_riscv.c | 10 + arch/riscv/kernel/traps.c | 54 ++ arch/riscv/kernel/usercfi.c | 545 +++++++++++++++++++++ arch/riscv/kernel/vdso.c | 7 + arch/riscv/kernel/vdso/Makefile | 40 +- arch/riscv/kernel/vdso/flush_icache.S | 4 + arch/riscv/kernel/vdso/gen_vdso_offsets.sh | 4 +- arch/riscv/kernel/vdso/getcpu.S | 4 + arch/riscv/kernel/vdso/note.S | 3 + arch/riscv/kernel/vdso/rt_sigreturn.S | 4 + arch/riscv/kernel/vdso/sys_hwprobe.S | 4 + arch/riscv/kernel/vdso/vgetrandom-chacha.S | 5 +- arch/riscv/kernel/vdso_cfi/Makefile | 25 + arch/riscv/kernel/vdso_cfi/vdso-cfi.S | 11 + arch/riscv/mm/init.c | 2 +- arch/riscv/mm/pgtable.c | 16 + include/linux/cpu.h | 4 + include/linux/mm.h | 7 + include/uapi/linux/elf.h | 2 + include/uapi/linux/prctl.h | 27 + kernel/sys.c | 30 ++ tools/testing/selftests/riscv/Makefile | 2 +- tools/testing/selftests/riscv/cfi/.gitignore | 2 + tools/testing/selftests/riscv/cfi/Makefile | 23 + tools/testing/selftests/riscv/cfi/cfi_rv_test.h | 82 ++++ tools/testing/selftests/riscv/cfi/cfitests.c | 173 +++++++ tools/testing/selftests/riscv/cfi/shadowstack.c | 385 +++++++++++++++ tools/testing/selftests/riscv/cfi/shadowstack.h | 27 + 62 files changed, 2481 insertions(+), 41 deletions(-) --- base-commit: 3a8660878839faadb4f1a6dd72c3179c1df56787 change-id: 20240930-v5_user_cfi_series-3dc332f8f5b2 -- - debug

1 week

7
53
0 0

[PATCH v3] selftests: cgroup: make test_memcg_sock robust against delayed sock stats

by Guopeng Zhang

test_memcg_sock() currently requires that memory.stat's "sock " counter is exactly zero immediately after the TCP server exits. On a busy system this assumption is too strict: - Socket memory may be freed with a small delay (e.g. RCU callbacks). - memcg statistics are updated asynchronously via the rstat flushing worker, so the "sock " value in memory.stat can stay non-zero for a short period of time even after all socket memory has been uncharged. As a result, test_memcg_sock() can intermittently fail even though socket memory accounting is working correctly. Make the test more robust by polling memory.stat for the "sock " counter and allowing it some time to drop to zero instead of checking it only once. The timeout is set to 3 seconds to cover the periodic rstat flush interval (FLUSH_TIME = 2*HZ by default) plus some scheduling slack. If the counter does not become zero within the timeout, the test still fails as before. On my test system, running test_memcontrol 50 times produced: - Before this patch: 6/50 runs passed. - After this patch: 50/50 runs passed. Suggested-by: Lance Yang <lance.yang(a)linux.dev> Reviewed-by: Lance Yang <lance.yang(a)linux.dev> Signed-off-by: Guopeng Zhang <zhangguopeng(a)kylinos.cn> --- v3: - Move MEMCG_SOCKSTAT_WAIT_* defines after the #include block as suggested. v2: - Mention the periodic rstat flush interval (FLUSH_TIME = 2*HZ) in the comment and clarify the rationale for the 3s timeout. - Replace the hard-coded retry count and wait interval with macros to avoid magic numbers and make the 3s timeout calculation explicit. --- .../selftests/cgroup/test_memcontrol.c | 30 ++++++++++++++++++- 1 file changed, 29 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/cgroup/test_memcontrol.c b/tools/testing/selftests/cgroup/test_memcontrol.c index 4e1647568c5b..8ff7286fc80b 100644 --- a/tools/testing/selftests/cgroup/test_memcontrol.c +++ b/tools/testing/selftests/cgroup/test_memcontrol.c @@ -21,6 +21,9 @@ #include "kselftest.h" #include "cgroup_util.h" +#define MEMCG_SOCKSTAT_WAIT_RETRIES 30 /* 3s total */ +#define MEMCG_SOCKSTAT_WAIT_INTERVAL_US (100 * 1000) /* 100 ms */ + static bool has_localevents; static bool has_recursiveprot; @@ -1384,6 +1387,8 @@ static int test_memcg_sock(const char *root) int bind_retries = 5, ret = KSFT_FAIL, pid, err; unsigned short port; char *memcg; + long sock_post = -1; + int i; memcg = cg_name(root, "memcg_test"); if (!memcg) @@ -1432,7 +1437,30 @@ static int test_memcg_sock(const char *root) if (cg_read_long(memcg, "memory.current") < 0) goto cleanup; - if (cg_read_key_long(memcg, "memory.stat", "sock ")) + /* + * memory.stat is updated asynchronously via the memcg rstat + * flushing worker, which runs periodically (every 2 seconds, + * see FLUSH_TIME). On a busy system, the "sock " counter may + * stay non-zero for a short period of time after the TCP + * connection is closed and all socket memory has been + * uncharged. + * + * Poll memory.stat for up to 3 seconds (~FLUSH_TIME plus some + * scheduling slack) and require that the "sock " counter + * eventually drops to zero. + */ + for (i = 0; i < MEMCG_SOCKSTAT_WAIT_RETRIES; i++) { + sock_post = cg_read_key_long(memcg, "memory.stat", "sock "); + if (sock_post < 0) + goto cleanup; + + if (!sock_post) + break; + + usleep(MEMCG_SOCKSTAT_WAIT_INTERVAL_US); + } + + if (sock_post) goto cleanup; ret = KSFT_PASS; -- 2.25.1

1 week

2
4
0 0

[PATCH v7 0/3] statmount: accept fd as a parameter

by Bhavik Sachdev

We would like to add support for checkpoint/restoring file descriptors open on these "unmounted" mounts to CRIU (Checkpoint/Restore in Userspace) [1]. Currently, we have no way to get mount info for these "unmounted" mounts since they do appear in /proc/<pid>/mountinfo and statmount does not work on them, since they do not belong to any mount namespace. This patch helps us by providing a way to get mountinfo for these "unmounted" mounts by using a fd on the mount. Changes from v6 [2] to v7: * Add kselftests for STATMOUNT_BY_FD flag. * Instead of renaming mnt_id_req.mnt_ns_fd to mnt_id_req.fd introduce a union so struct mnt_id_req looks like this: struct mnt_id_req { __u32 size; union { __u32 mnt_ns_fd; __u32 mnt_fd; }; __u64 mnt_id; __u64 param; __u64 mnt_ns_id; }; * In case of STATMOUNT_BY_FD grab mnt_ns inside of do_statmount(), since we get mnt_ns from mnt, which should happen under namespace lock. * Remove the modifications made to grab_requested_mnt_ns, those were never needed. Changes from v5 [3] to v6: * Instead of returning "[unmounted]" as the mount point for "unmounted" mounts, we unset the STATMOUNT_MNT_POINT flag in statmount.mask. * Instead of returning 0 as the mnt_ns_id for "unmounted" mounts, we unset the STATMOUNT_MNT_NS_ID flag in statmount.mask. * Added comment in `do_statmount` clarifying that the caller sets s->mnt in case of STATMOUNT_BY_FD. * In `do_statmount` move the mnt_ns_id and mnt_ns_empty() check just before lookup_mnt_in_ns(). * We took another look at the capability checks for getting information for "unmounted" mounts using an fd and decided to remove them for the following reasons: - All fs related information is available via fstatfs() without any capability check. - Mount information is also available via /proc/pid/mountinfo (without any capability check). - Given that we have access to a fd on the mount which tells us that we had access to the mount at some point (or someone that had access gave us the fd). So, we should be able to access mount info. Changes from v4 [4] to v5: Check only for s->root.mnt to be NULL instead of checking for both s->root.mnt and s->root.dentry (I did not find a case where only one of them would be NULL). * Only allow system root (CAP_SYS_ADMIN in init_user_ns) to call statmount() on fd's on "unmounted" mounts. We (mostly Pavel) spent some time thinking about how our previous approach (of checking the opener's file credentials) caused problems. Please take a look at the linked pictures they describe everything more clearly. Case 1: A fd is on a normal mount (Link to Picture: [5]) Consider, a situation where we have two processes P1 and P2 and a file F1. F1 is opened on mount ns M1 by P1. P1 is nested inside user namespace U1 and U2. P2 is also in U1. P2 is also in a pid namespace and mount namespace separate from M1. P1 sends F1 to P2 (using a unix socket). But, P2 is unable to call statmount() on F1 because since it is a separate pid and mount namespace. This is good and expected. Case 2: A fd is on a "unmounted" mount (Link to Picture: [6]) Consider a similar situation as Case 1. But now F1 is on a mounted that has been "unmounted". Now, since we used openers credentials to check for permissions P2 ends up having the ability call statmount() and get mount info for this "unmounted" mount. Hence, It is better to restrict the ability to call statmount() on fds on "unmounted" mounts to system root only (There could also be other cases than the one described above). Changes from v3 [7] to v4: * Change the string returned when there is no mountpoint to be "[unmounted]" instead of "[detached]". * Remove the new DEFINE_FREE put_file and use the one already present in include/linux/file.h (fput) [8]. * Inside listmount consistently pass 0 in flags to copy_mnt_id_req and prepare_klistmount()->grab_requested_mnt_ns() and remove flags from the prepare_klistmount prototype. * If STATMOUNT_BY_FD is set, check for mnt_ns_id == 0 && mnt_id == 0. Changes from v2 [9] to v3: * Rename STATMOUNT_FD flag to STATMOUNT_BY_FD. * Fixed UAF bug caused by the reference to fd_mount being bound by scope of CLASS(fd_raw, f)(kreq.fd) by using fget_raw instead. * Reused @spare parameter in mnt_id_req instead of adding new fields to the struct. Changes from v1 [10] to v2: v1 of this patchset, took a different approach and introduced a new umount_mnt_ns, to which "unmounted" mounts would be moved to (instead of their namespace being NULL) thus allowing them to be still available via statmount. Introducing umount_mnt_ns complicated namespace locking and modified performance sensitive code [11] and it was agreed upon that fd-based statmount would be better. This code is also available on github [12]. [1]: https://github.com/checkpoint-restore/criu/pull/2754 [2]: https://lore.kernel.org/all/20251118084836.2114503-1-b.sachdev1904@gmail.co… [3]: https://lore.kernel.org/criu/20251109053921.1320977-2-b.sachdev1904@gmail.c… [4]: https://lore.kernel.org/all/20251029052037.506273-2-b.sachdev1904@gmail.com/ [5]: https://github.com/bsach64/linux/blob/statmount-fd-v5/fd_on_normal_mount.png [6]: https://github.com/bsach64/linux/blob/statmount-fd-v5/file_on_unmounted_mou… [7]: https://lore.kernel.org/all/20251024181443.786363-1-b.sachdev1904@gmail.com/ [8]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/inc… [9]: https://lore.kernel.org/linux-fsdevel/20251011124753.1820802-1-b.sachdev190… [10]: https://lore.kernel.org/linux-fsdevel/20251002125422.203598-1-b.sachdev1904… [11]: https://lore.kernel.org/linux-fsdevel/7e4d9eb5-6dde-4c59-8ee3-358233f082d0@… [12]: https://github.com/bsach64/linux/tree/statmount-fd-v7 Bhavik Sachdev (3): statmount: permission check should return EPERM statmount: accept fd as a parameter selftests: statmount: tests for STATMOUNT_BY_FD fs/namespace.c | 102 ++++--- include/uapi/linux/mount.h | 10 +- .../filesystems/statmount/statmount.h | 15 +- .../filesystems/statmount/statmount_test.c | 261 +++++++++++++++++- .../filesystems/statmount/statmount_test_ns.c | 101 ++++++- 5 files changed, 430 insertions(+), 59 deletions(-) -- 2.52.0

1 week

4
6
0 0

[PATCH 00/21] vfio/pci: Base support to preserve a VFIO device file across Live Update

by David Matlack

This series adds the base support to preserve a VFIO device file across a Live Update. "Base support" means that this allows userspace to safetly preserve a VFIO device file with LIVEUPDATE_SESSION_PRESERVE_FD and retrieve a preserved VFIO device file with LIVEUPDATE_SESSION_RETRIEVE_FD, but the device itself is not preserved in a fully running state across Live Update. This series unblocks 2 parallel but related streams of work: - iommufd preservation across Live Update. This work spans iommufd, the IOMMU subsystem, and IOMMU drivers [1] - Preservation of VFIO device state across Live Update (config space, BAR addresses, power state, SR-IOV state, etc.). This work spans both VFIO and the core PCI subsystem. While we need all of the above to fully preserve a VFIO device across a Live Update without disrupting the workload on the device, this series aims to be functional and safe enough to merge as the first incremental step toward that goal. Areas for Discussion -------------------- BDF Stability across Live Update The PCI support for tracking preserved devices across a Live Update to prevent auto-probing relies on PCI segment numbers and BDFs remaining stable. For now I have disallowed VFs, as the BDFs assigned to VFs can vary depending on how the kernel chooses to allocate bus numbers. For non-VFs I am wondering if there is any more needed to ensure BDF stability across Live Update. While we would like to support many different systems and configurations in due time (including preserving VFs), I'd like to keep this first serses constrained to simple use-cases. FLB Locking I don't see a way to properly synchronize pci_flb_finish() with pci_liveupdate_incoming_is_preserved() since the incoming FLB mutex is dropped by liveupdate_flb_get_incoming() when it returns the pointer to the object, and taking pci_flb_incoming_lock in pci_flb_finish() could result in a deadlock due to reversing the lock ordering. FLB Retrieving The first patch of this series includes a fix to prevent an FLB from being retrieved again it is finished. I am wondering if this is the right approach or if subsystems are expected to stop calling liveupdate_flb_get_incoming() after an FLB is finished. Testing ------- The patches at the end of this series provide comprehensive selftests for the new code added by this series. The selftests have been validated in both a VM environment using a virtio-net PCIe device, and in a baremetal environment on an Intel EMR server with an Intel DSA device. Here is an example of how to run the new selftests: vfio_pci_liveupdate_uapi_test: $ tools/testing/selftests/vfio/scripts/setup.sh 0000:00:04.0 $ tools/testing/selftests/vfio/vfio_pci_liveupdate_uapi_test 0000:00:04.0 $ tools/testing/selftests/vfio/scripts/cleanup.sh vfio_pci_liveupdate_kexec_test: $ tools/testing/selftests/vfio/scripts/setup.sh 0000:00:04.0 $ tools/testing/selftests/vfio/vfio_pci_liveupdate_kexec_test --stage 1 0000:00:04.0 $ kexec [...] # NOTE: distro-dependent $ tools/testing/selftests/vfio/scripts/setup.sh 0000:00:04.0 $ tools/testing/selftests/vfio/vfio_pci_liveupdate_kexec_test --stage 2 0000:00:04.0 $ tools/testing/selftests/vfio/scripts/cleanup.sh Dependencies ------------ This series was constructed on top of several in-flight series and on top of mm-nonmm-unstable [2]. +-- This series | +-- [PATCH v2 00/18] vfio: selftests: Support for multi-device tests | https://lore.kernel.org/kvm/20251112192232.442761-1-dmatlack@google.com/ | +-- [PATCH v3 0/4] vfio: selftests: update DMA mapping tests to use queried IOVA ranges | https://lore.kernel.org/kvm/20251111-iova-ranges-v3-0-7960244642c5@fb.com/ | +-- [PATCH v8 0/2] Live Update: File-Lifecycle-Bound (FLB) State | https://lore.kernel.org/linux-mm/20251125225006.3722394-1-pasha.tatashin@so… | +-- [PATCH v8 00/18] Live Update Orchestrator | https://lore.kernel.org/linux-mm/20251125165850.3389713-1-pasha.tatashin@so… | To simplify checking out the code, this series can be found on GitHub: https://github.com/dmatlack/linux/tree/liveupdate/vfio/cdev/v1 Changelog --------- v1: - Rebase series on top of LUOv8 and VFIO selftests improvements - Drop commits to preserve config space fields across Live Update. These changes require changes to the PCI layer. For exmaple, preserving rbars could lead to an inconsistent device state until device BARs addresses are preserved across Live Update. - Drop commits to preserve Bus Master Enable on the device. There's no reason to preserve this until iommufd preservation is fully working. Furthermore, preserving Bus Master Enable could lead to memory corruption when the device if the device is bound to the default identity-map domain after Live Update. - Drop commits to preserve saved PCI state. This work is not needed until we are ready to preserve the device's config space, and requires more thought to make the PCI state data layout ABI-friendly. - Add support to skip auto-probing devices that are preserved by VFIO to avoid them getting bound to a different driver by the next kernel. - Restrict device preservation further (no VFs, no intel-graphics). - Various refactoring and small edits to improve readability and eliminate code duplication. rfc: https://lore.kernel.org/kvm/20251018000713.677779-1-vipinsh@google.com/ Cc: Saeed Mahameed <saeedm(a)nvidia.com> Cc: Adithya Jayachandran <ajayachandra(a)nvidia.com> Cc: Jason Gunthorpe <jgg(a)nvidia.com> Cc: Parav Pandit <parav(a)nvidia.com> Cc: Leon Romanovsky <leonro(a)nvidia.com> Cc: William Tu <witu(a)nvidia.com> Cc: Jacob Pan <jacob.pan(a)linux.microsoft.com> Cc: Lukas Wunner <lukas(a)wunner.de> Cc: Pasha Tatashin <pasha.tatashin(a)soleen.com> Cc: Mike Rapoport <rppt(a)kernel.org> Cc: Pratyush Yadav <pratyush(a)kernel.org> Cc: Samiullah Khawaja <skhawaja(a)google.com> Cc: Chris Li <chrisl(a)kernel.org> Cc: Josh Hilke <jrhilke(a)google.com> Cc: David Rientjes <rientjes(a)google.com> [1] https://lore.kernel.org/linux-iommu/20250928190624.3735830-1-skhawaja@googl… [2] https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/log/?h=mm-nonmm… David Matlack (12): liveupdate: luo_flb: Prevent retrieve() after finish() PCI: Add API to track PCI devices preserved across Live Update PCI: Require driver_override for incoming Live Update preserved devices vfio/pci: Notify PCI subsystem about devices preserved across Live Update vfio: Enforce preserved devices are retrieved via LIVEUPDATE_SESSION_RETRIEVE_FD vfio/pci: Store Live Update state in struct vfio_pci_core_device vfio: selftests: Add Makefile support for TEST_GEN_PROGS_EXTENDED vfio: selftests: Add vfio_pci_liveupdate_uapi_test vfio: selftests: Expose iommu_modes to tests vfio: selftests: Expose low-level helper routines for setting up struct vfio_pci_device vfio: selftests: Verify that opening VFIO device fails during Live Update vfio: selftests: Add continuous DMA to vfio_pci_liveupdate_kexec_test Vipin Sharma (9): vfio/pci: Register a file handler with Live Update Orchestrator vfio/pci: Preserve vfio-pci device files across Live Update vfio/pci: Retrieve preserved device files after Live Update vfio/pci: Skip reset of preserved device after Live Update selftests/liveupdate: Move luo_test_utils.* into a reusable library selftests/liveupdate: Add helpers to preserve/retrieve FDs vfio: selftests: Build liveupdate library in VFIO selftests vfio: selftests: Initialize vfio_pci_device using a VFIO cdev FD vfio: selftests: Add vfio_pci_liveupdate_kexec_test MAINTAINERS | 1 + drivers/pci/Makefile | 1 + drivers/pci/liveupdate.c | 248 ++++++++++++++++ drivers/pci/pci-driver.c | 12 +- drivers/vfio/device_cdev.c | 25 +- drivers/vfio/group.c | 9 + drivers/vfio/pci/Makefile | 1 + drivers/vfio/pci/vfio_pci.c | 11 +- drivers/vfio/pci/vfio_pci_core.c | 23 +- drivers/vfio/pci/vfio_pci_liveupdate.c | 278 ++++++++++++++++++ drivers/vfio/pci/vfio_pci_priv.h | 16 + drivers/vfio/vfio.h | 13 - drivers/vfio/vfio_main.c | 22 +- include/linux/kho/abi/pci.h | 53 ++++ include/linux/kho/abi/vfio_pci.h | 45 +++ include/linux/liveupdate.h | 3 + include/linux/pci.h | 38 +++ include/linux/vfio.h | 51 ++++ include/linux/vfio_pci_core.h | 7 + kernel/liveupdate/luo_flb.c | 4 + tools/testing/selftests/liveupdate/.gitignore | 1 + tools/testing/selftests/liveupdate/Makefile | 14 +- .../include/libliveupdate.h} | 11 +- .../selftests/liveupdate/lib/libliveupdate.mk | 20 ++ .../{luo_test_utils.c => lib/liveupdate.c} | 43 ++- .../selftests/liveupdate/luo_kexec_simple.c | 2 +- .../selftests/liveupdate/luo_multi_session.c | 2 +- tools/testing/selftests/vfio/Makefile | 23 +- .../vfio/lib/include/libvfio/iommu.h | 2 + .../lib/include/libvfio/vfio_pci_device.h | 8 + tools/testing/selftests/vfio/lib/iommu.c | 4 +- .../selftests/vfio/lib/vfio_pci_device.c | 60 +++- .../vfio/vfio_pci_liveupdate_kexec_test.c | 255 ++++++++++++++++ .../vfio/vfio_pci_liveupdate_uapi_test.c | 93 ++++++ 34 files changed, 1313 insertions(+), 86 deletions(-) create mode 100644 drivers/pci/liveupdate.c create mode 100644 drivers/vfio/pci/vfio_pci_liveupdate.c create mode 100644 include/linux/kho/abi/pci.h create mode 100644 include/linux/kho/abi/vfio_pci.h rename tools/testing/selftests/liveupdate/{luo_test_utils.h => lib/include/libliveupdate.h} (80%) create mode 100644 tools/testing/selftests/liveupdate/lib/libliveupdate.mk rename tools/testing/selftests/liveupdate/{luo_test_utils.c => lib/liveupdate.c} (89%) create mode 100644 tools/testing/selftests/vfio/vfio_pci_liveupdate_kexec_test.c create mode 100644 tools/testing/selftests/vfio/vfio_pci_liveupdate_uapi_test.c -- 2.52.0.487.g5c8c507ade-goog

1 week

9
57
0 0

[PATCH] selftests: gpio: correctly check the type of /dev/gpiochipX

by Guixin Liu

/dev/gpiochipX is a character device, not a directory, so we should use "test -c" instead of "test -d" to check for its existence; otherwise, the current check will not work correctly when /dev/gpiochipX is left over (e.g., as a stale device node). Signed-off-by: Guixin Liu <kanie(a)linux.alibaba.com> --- tools/testing/selftests/gpio/gpio-aggregator.sh | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/tools/testing/selftests/gpio/gpio-aggregator.sh b/tools/testing/selftests/gpio/gpio-aggregator.sh index 9b6f80ad9f8a..13d8e1607571 100755 --- a/tools/testing/selftests/gpio/gpio-aggregator.sh +++ b/tools/testing/selftests/gpio/gpio-aggregator.sh @@ -351,7 +351,7 @@ test "$(agg_get_chip_num_lines _sysfs.0)" = "1" || fail "number of lines is not test "$(agg_get_line_name _sysfs.0 0)" = "" || fail "line name is unset" echo "$(agg_configfs_dev_name _sysfs.0)" > "$SYSFS_AGG_DIR/delete_device" test -d $CONFIGFS_AGG_DIR/_sysfs.0 && fail "_sysfs.0 unexpectedly remains" -test -d /dev/${CHIPNAME} && fail "/dev/${CHIPNAME} unexpectedly remains" +test -c /dev/${CHIPNAME} && fail "/dev/${CHIPNAME} unexpectedly remains" echo "1.2.2. Complex creation/deletion" echo "chip0bank0_0 chip1_bank1 10-11" > "$SYSFS_AGG_DIR/new_device" @@ -365,7 +365,7 @@ test "$(agg_get_line_name _sysfs.0 1)" = "" || fail "line name is unset" test "$(agg_get_line_name _sysfs.0 2)" = "" || fail "line name is unset" echo "$(agg_configfs_dev_name _sysfs.0)" > "$SYSFS_AGG_DIR/delete_device" test -d $CONFIGFS_AGG_DIR/_sysfs.0 && fail "_sysfs.0 unexpectedly remains" -test -d /dev/${CHIPNAME} && fail "/dev/${CHIPNAME} unexpectedly remains" +test -c /dev/${CHIPNAME} && fail "/dev/${CHIPNAME} unexpectedly remains" echo "1.2.3. Asynchronous creation with deferred probe" sim_disable_chip chip0 @@ -382,7 +382,7 @@ test "$(agg_get_chip_num_lines _sysfs.0)" = "1" || fail "number of lines is not test "$(agg_get_line_name _sysfs.0 0)" = "" || fail "line name unexpectedly set" echo "$(agg_configfs_dev_name _sysfs.0)" > "$SYSFS_AGG_DIR/delete_device" test -d $CONFIGFS_AGG_DIR/_sysfs.0 && fail "_sysfs.0 unexpectedly remains" -test -d /dev/${CHIPNAME} && fail "/dev/${CHIPNAME} unexpectedly remains" +test -c /dev/${CHIPNAME} && fail "/dev/${CHIPNAME} unexpectedly remains" echo "1.2.4. Can't instantiate a chip with invalid configuration" echo "xyz 0" > "$SYSFS_AGG_DIR/new_device" -- 2.43.0

1 week, 1 day

1
1
0 0

[PATCH v2 00/13] tools/nolibc: always use 64-bit time-related types

by Thomas Weißschuh

nolibc currently uses 32-bit types for various APIs. These are problematic as their reduced value range can lead to truncated values. Intended for 6.19. Signed-off-by: Thomas Weißschuh <linux(a)weissschuh.net> --- Changes in v2: - Drop already applied ino_t and off_t patches. - Also handle 'struct timeval'. - Make the progression of the series a bit clearer. - Add compatibility assertions. - Link to v1: https://lore.kernel.org/r/20251029-nolibc-uapi-types-v1-0-e79de3b215d8@weis… --- Thomas Weißschuh (13): tools/nolibc/poll: use kernel types for system call invocations tools/nolibc/poll: drop __NR_poll fallback tools/nolibc/select: drop non-pselect based implementations tools/nolibc/time: drop invocation of gettimeofday system call tools/nolibc: prefer explicit 64-bit time-related system calls tools/nolibc/gettimeofday: avoid libgcc 64-bit divisions tools/nolibc/select: avoid libgcc 64-bit multiplications tools/nolibc: use custom structs timespec and timeval tools/nolibc: always use 64-bit time types selftests/nolibc: test compatibility of nolibc and kernel time types tools/nolibc: remove time conversions tools/nolibc: add __nolibc_static_assert() selftests/nolibc: add static assertions around time types handling tools/include/nolibc/arch-s390.h | 3 + tools/include/nolibc/compiler.h | 2 + tools/include/nolibc/poll.h | 14 ++-- tools/include/nolibc/std.h | 2 +- tools/include/nolibc/sys/select.h | 25 ++----- tools/include/nolibc/sys/time.h | 6 +- tools/include/nolibc/sys/timerfd.h | 32 +++------ tools/include/nolibc/time.h | 102 +++++++++------------------ tools/include/nolibc/types.h | 17 ++++- tools/testing/selftests/nolibc/nolibc-test.c | 27 +++++++ 10 files changed, 107 insertions(+), 123 deletions(-) --- base-commit: 586e8d5137dfcddfccca44c3b992b92d2be79347 change-id: 20251001-nolibc-uapi-types-1c072d10fcc7 Best regards, -- Thomas Weißschuh <linux(a)weissschuh.net>

1 week, 1 day

3
23
0 0

[PATCH] selftests/seccomp: fix pointer type mismatch in UPROBE test

by Nirbhay Sharma

Fix compilation error in UPROBE_setup caused by pointer type mismatch in ternary expression. The probed_uretprobe and probed_uprobe function pointers have different type attributes (__attribute__((nocf_check))), which causes the conditional operator to fail with: seccomp_bpf.c:5175:74: error: pointer type mismatch in conditional expression [-Wincompatible-pointer-types] Cast both function pointers to 'const void *' to match the expected parameter type of get_uprobe_offset(), resolving the type mismatch while preserving the function selection logic. Signed-off-by: Nirbhay Sharma <nirbhay.lkd(a)gmail.com> --- tools/testing/selftests/seccomp/seccomp_bpf.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c index 874f17763536..e13ffe18ef95 100644 --- a/tools/testing/selftests/seccomp/seccomp_bpf.c +++ b/tools/testing/selftests/seccomp/seccomp_bpf.c @@ -5172,7 +5172,8 @@ FIXTURE_SETUP(UPROBE) ASSERT_GE(bit, 0); } - offset = get_uprobe_offset(variant->uretprobe ? probed_uretprobe : probed_uprobe); + offset = get_uprobe_offset(variant->uretprobe ? + (const void *)probed_uretprobe : (const void *)probed_uprobe); ASSERT_GE(offset, 0); if (variant->uretprobe) -- 2.48.1

1 week, 1 day

4
8
0 0

2025

2024

2023

2022

2021

2020

2019

2018

2017

Linux-kselftest-mirror November 2025