October 2024 - Linux-kselftest-mirror

[PATCH 1/2] selftest/mm: Fix typo in virtual_address_range

by Chunyan Zhang

The function name should be *hint* address, so correct it. Signed-off-by: Chunyan Zhang <zhangchunyan(a)iscas.ac.cn> --- tools/testing/selftests/mm/virtual_address_range.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/tools/testing/selftests/mm/virtual_address_range.c b/tools/testing/selftests/mm/virtual_address_range.c index 4e4c1e311247..2a2b69e91950 100644 --- a/tools/testing/selftests/mm/virtual_address_range.c +++ b/tools/testing/selftests/mm/virtual_address_range.c @@ -64,7 +64,7 @@ #define NR_CHUNKS_HIGH NR_CHUNKS_384TB #endif -static char *hind_addr(void) +static char *hint_addr(void) { int bits = HIGH_ADDR_SHIFT + rand() % (63 - HIGH_ADDR_SHIFT); @@ -185,7 +185,7 @@ int main(int argc, char *argv[]) } for (i = 0; i < NR_CHUNKS_HIGH; i++) { - hint = hind_addr(); + hint = hint_addr(); hptr[i] = mmap(hint, MAP_CHUNK_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); -- 2.34.1

1 year

4
5
0 0

[PATCH 00/15] KVM: x86: Introduce new ioctl KVM_TRANSLATE2

by Nikolas Wipper

This series introduces a new ioctl KVM_TRANSLATE2, which expands on KVM_TRANSLATE. It is required to implement Hyper-V's HvTranslateVirtualAddress hyper-call as part of the ongoing effort to emulate HyperV's Virtual Secure Mode (VSM) within KVM and QEMU. The hyper- call requires several new KVM APIs, one of which is KVM_TRANSLATE2, which implements the core functionality of the hyper-call. The rest of the required functionality will be implemented in subsequent series. Other than translating guest virtual addresses, the ioctl allows the caller to control whether the access and dirty bits are set during the page walk. It also allows specifying an access mode instead of returning viable access modes, which enables setting the bits up to the level that caused a failure. Additionally, the ioctl provides more information about why the page walk failed, and which page table is responsible. This functionality is not available within KVM_TRANSLATE, and can't be added without breaking backwards compatiblity, thus a new ioctl is required. The ioctl was designed to facilitate as many other use cases as possible apart from VSM. The error codes were intentionally chosen to be broad enough to avoid exposing architecture specific details. Even though HvTranslateVirtualAddress only really needs one flag to set the accessed and dirty bits whenever possible, that was split into several flags so that future users can chose more gradually when these bits should be set. Furthermore, as much information as possible is provided to the caller. The patch series includes selftests for the ioctl, as well as fuzzy testing on random garbage guest page table entries. All previously passing KVM selftests and KVM unit tests still pass. Series overview: - 1: Document the new ioctl - 2-11: Update the page walker in preparation - 12-14: Implement the ioctl - 15: Implement testing This series, alongside the series by Nicolas Saenz Julienne [1] introducing the core building blocks for VSM and the accompanying QEMU implementation [2], is capable of booting Windows Server 2019. Both series are also available on GitHub [3]. [1] https://lore.kernel.org/linux-hyperv/20240609154945.55332-1-nsaenz@amazon.c… [2] https://github.com/vianpl/qemu/tree/vsm/next [3] https://github.com/vianpl/linux/tree/vsm/next Best, Nikolas Nikolas Wipper (15): KVM: Add API documentation for KVM_TRANSLATE2 KVM: x86/mmu: Abort page walk if permission checks fail KVM: x86/mmu: Introduce exception flag for unmapped GPAs KVM: x86/mmu: Store GPA in exception if applicable KVM: x86/mmu: Introduce flags parameter to page walker KVM: x86/mmu: Implement PWALK_SET_ACCESSED in page walker KVM: x86/mmu: Implement PWALK_SET_DIRTY in page walker KVM: x86/mmu: Implement PWALK_FORCE_SET_ACCESSED in page walker KVM: x86/mmu: Introduce status parameter to page walker KVM: x86/mmu: Implement PWALK_STATUS_READ_ONLY_PTE_GPA in page walker KVM: x86: Introduce generic gva to gpa translation function KVM: Introduce KVM_TRANSLATE2 KVM: Add KVM_TRANSLATE2 stub KVM: x86: Implement KVM_TRANSLATE2 KVM: selftests: Add test for KVM_TRANSLATE2 Documentation/virt/kvm/api.rst | 131 ++++++++ arch/x86/include/asm/kvm_host.h | 18 +- arch/x86/kvm/hyperv.c | 3 +- arch/x86/kvm/kvm_emulate.h | 8 + arch/x86/kvm/mmu.h | 10 +- arch/x86/kvm/mmu/mmu.c | 7 +- arch/x86/kvm/mmu/paging_tmpl.h | 80 +++-- arch/x86/kvm/x86.c | 123 ++++++- include/linux/kvm_host.h | 6 + include/uapi/linux/kvm.h | 33 ++ tools/testing/selftests/kvm/Makefile | 1 + .../selftests/kvm/x86_64/kvm_translate2.c | 310 ++++++++++++++++++ virt/kvm/kvm_main.c | 41 +++ 13 files changed, 724 insertions(+), 47 deletions(-) create mode 100644 tools/testing/selftests/kvm/x86_64/kvm_translate2.c -- 2.40.1 Amazon Web Services Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B Sitz: Berlin Ust-ID: DE 365 538 597

1 year

2
18
0 0

[PATCH v2] selftests/vDSO: support DT_GNU_HASH

by Fangrui Song

glibc added support for DT_GNU_HASH in 2006 and DT_HASH has been obsoleted for more than one decade in many Linux distributions. Many vDSOs support DT_GNU_HASH. This patch adds selftests support. Signed-off-by: Fangrui Song <maskray(a)google.com> Tested-by: Xi Ruoyao <xry111(a)xry111.site> -- Changes from v1: * fix style of a multi-line comment. ignore false positive suggestions from checkpath.pl: `ELF(Word) *` --- tools/testing/selftests/vDSO/parse_vdso.c | 105 ++++++++++++++++------ 1 file changed, 79 insertions(+), 26 deletions(-) diff --git a/tools/testing/selftests/vDSO/parse_vdso.c b/tools/testing/selftests/vDSO/parse_vdso.c index 4ae417372e9e..dbc946dee4b1 100644 --- a/tools/testing/selftests/vDSO/parse_vdso.c +++ b/tools/testing/selftests/vDSO/parse_vdso.c @@ -47,6 +47,7 @@ static struct vdso_info /* Symbol table */ ELF(Sym) *symtab; const char *symstrings; + ELF(Word) *gnu_hash; ELF(Word) *bucket, *chain; ELF(Word) nbucket, nchain; @@ -75,6 +76,16 @@ static unsigned long elf_hash(const char *name) return h; } +static uint32_t gnu_hash(const char *name) +{ + const unsigned char *s = (void *)name; + uint32_t h = 5381; + + for (; *s; s++) + h += h * 32 + *s; + return h; +} + void vdso_init_from_sysinfo_ehdr(uintptr_t base) { size_t i; @@ -117,6 +128,7 @@ void vdso_init_from_sysinfo_ehdr(uintptr_t base) */ ELF(Word) *hash = 0; vdso_info.symstrings = 0; + vdso_info.gnu_hash = 0; vdso_info.symtab = 0; vdso_info.versym = 0; vdso_info.verdef = 0; @@ -137,6 +149,11 @@ void vdso_init_from_sysinfo_ehdr(uintptr_t base) ((uintptr_t)dyn[i].d_un.d_ptr + vdso_info.load_offset); break; + case DT_GNU_HASH: + vdso_info.gnu_hash = + (ELF(Word) *)((uintptr_t)dyn[i].d_un.d_ptr + + vdso_info.load_offset); + break; case DT_VERSYM: vdso_info.versym = (ELF(Versym) *) ((uintptr_t)dyn[i].d_un.d_ptr @@ -149,17 +166,26 @@ void vdso_init_from_sysinfo_ehdr(uintptr_t base) break; } } - if (!vdso_info.symstrings || !vdso_info.symtab || !hash) + if (!vdso_info.symstrings || !vdso_info.symtab || + (!hash && !vdso_info.gnu_hash)) return; /* Failed */ if (!vdso_info.verdef) vdso_info.versym = 0; /* Parse the hash table header. */ - vdso_info.nbucket = hash[0]; - vdso_info.nchain = hash[1]; - vdso_info.bucket = &hash[2]; - vdso_info.chain = &hash[vdso_info.nbucket + 2]; + if (vdso_info.gnu_hash) { + vdso_info.nbucket = vdso_info.gnu_hash[0]; + /* The bucket array is located after the header (4 uint32) and the bloom + * filter (size_t array of gnu_hash[2] elements). */ + vdso_info.bucket = vdso_info.gnu_hash + 4 + + sizeof(size_t) / 4 * vdso_info.gnu_hash[2]; + } else { + vdso_info.nbucket = hash[0]; + vdso_info.nchain = hash[1]; + vdso_info.bucket = &hash[2]; + vdso_info.chain = &hash[vdso_info.nbucket + 2]; + } /* That's all we need. */ vdso_info.valid = true; @@ -203,6 +229,26 @@ static bool vdso_match_version(ELF(Versym) ver, && !strcmp(name, vdso_info.symstrings + aux->vda_name); } +static bool check_sym(ELF(Sym) *sym, ELF(Word) i, const char *name, + const char *version, unsigned long ver_hash) +{ + /* Check for a defined global or weak function w/ right name. */ + if (ELF64_ST_TYPE(sym->st_info) != STT_FUNC) + return false; + if (ELF64_ST_BIND(sym->st_info) != STB_GLOBAL && + ELF64_ST_BIND(sym->st_info) != STB_WEAK) + return false; + if (strcmp(name, vdso_info.symstrings + sym->st_name)) + return false; + + /* Check symbol version. */ + if (vdso_info.versym && + !vdso_match_version(vdso_info.versym[i], version, ver_hash)) + return false; + + return true; +} + void *vdso_sym(const char *version, const char *name) { unsigned long ver_hash; @@ -210,29 +256,36 @@ void *vdso_sym(const char *version, const char *name) return 0; ver_hash = elf_hash(version); - ELF(Word) chain = vdso_info.bucket[elf_hash(name) % vdso_info.nbucket]; + ELF(Word) i; - for (; chain != STN_UNDEF; chain = vdso_info.chain[chain]) { - ELF(Sym) *sym = &vdso_info.symtab[chain]; + if (vdso_info.gnu_hash) { + uint32_t h1 = gnu_hash(name), h2, *hashval; - /* Check for a defined global or weak function w/ right name. */ - if (ELF64_ST_TYPE(sym->st_info) != STT_FUNC) - continue; - if (ELF64_ST_BIND(sym->st_info) != STB_GLOBAL && - ELF64_ST_BIND(sym->st_info) != STB_WEAK) - continue; - if (sym->st_shndx == SHN_UNDEF) - continue; - if (strcmp(name, vdso_info.symstrings + sym->st_name)) - continue; - - /* Check symbol version. */ - if (vdso_info.versym - && !vdso_match_version(vdso_info.versym[chain], - version, ver_hash)) - continue; - - return (void *)(vdso_info.load_offset + sym->st_value); + i = vdso_info.bucket[h1 % vdso_info.nbucket]; + if (i == 0) + return 0; + h1 |= 1; + hashval = vdso_info.bucket + vdso_info.nbucket + + (i - vdso_info.gnu_hash[1]); + for (;; i++) { + ELF(Sym) *sym = &vdso_info.symtab[i]; + h2 = *hashval++; + if (h1 == (h2 | 1) && + check_sym(sym, i, name, version, ver_hash)) + return (void *)(vdso_info.load_offset + + sym->st_value); + if (h2 & 1) + break; + } + } else { + i = vdso_info.bucket[elf_hash(name) % vdso_info.nbucket]; + for (; i; i = vdso_info.chain[i]) { + ELF(Sym) *sym = &vdso_info.symtab[i]; + if (sym->st_shndx != SHN_UNDEF && + check_sym(sym, i, name, version, ver_hash)) + return (void *)(vdso_info.load_offset + + sym->st_value); + } } return 0; -- 2.46.0.662.g92d0881bb0-goog

1 year

4
6
0 0

[PATCH v7 00/32] riscv control-flow integrity for usermode

by Deepak Gupta

Basics and overview =================== Software with larger attack surfaces (e.g. network facing apps like databases, browsers or apps relying on browser runtimes) suffer from memory corruption issues which can be utilized by attackers to bend control flow of the program to eventually gain control (by making their payload executable). Attackers are able to perform such attacks by leveraging call-sites which rely on indirect calls or return sites which rely on obtaining return address from stack memory. To mitigate such attacks, risc-v extension zicfilp enforces that all indirect calls must land on a landing pad instruction `lpad` else cpu will raise software check exception (a new cpu exception cause code on riscv). Similarly for return flow, risc-v extension zicfiss extends architecture with - `sspush` instruction to push return address on a shadow stack - `sspopchk` instruction to pop return address from shadow stack and compare with input operand (i.e. return address on stack) - `sspopchk` to raise software check exception if comparision above was a mismatch - Protection mechanism using which shadow stack is not writeable via regular store instructions More information an details can be found at extensions github repo [1]. Equivalent to landing pad (zicfilp) on x86 is `ENDBRANCH` instruction in Intel CET [3] and branch target identification (BTI) [4] on arm. Similarly x86's Intel CET has shadow stack [5] and arm64 has guarded control stack (GCS) [6] which are very similar to risc-v's zicfiss shadow stack. x86 already supports shadow stack for user mode and arm64 support for GCS in usermode [7] is in -next. Kernel awareness for user control flow integrity ================================================ This series picks up Samuel Holland's envcfg changes [2] as well. So if those are being applied independently, they should be removed from this series. Enabling: In order to maintain compatibility and not break anything in user mode, kernel doesn't enable control flow integrity cpu extensions on binary by default. Instead exposes a prctl interface to enable, disable and lock the shadow stack or landing pad feature for a task. This allows userspace (loader) to enumerate if all objects in its address space are compiled with shadow stack and landing pad support and accordingly enable the feature. Additionally if a subsequent `dlopen` happens on a library, user mode can take a decision again to disable the feature (if incoming library is not compiled with support) OR terminate the task (if user mode policy is strict to have all objects in address space to be compiled with control flow integirty cpu feature). prctl to enable shadow stack results in allocating shadow stack from virtual memory and activating for user address space. x86 and arm64 are also following same direction due to similar reason(s). clone/fork: On clone and fork, cfi state for task is inherited by child. Shadow stack is part of virtual memory and is a writeable memory from kernel perspective (writeable via a restricted set of instructions aka shadow stack instructions) Thus kernel changes ensure that this memory is converted into read-only when fork/clone happens and COWed when fault is taken due to sspush, sspopchk or ssamoswap. In case `CLONE_VM` is specified and shadow stack is to be enabled, kernel will automatically allocate a shadow stack for that clone call. map_shadow_stack: x86 introduced `map_shadow_stack` system call to allow user space to explicitly map shadow stack memory in its address space. It is useful to allocate shadow for different contexts managed by a single thread (green threads or contexts) risc-v implements this system call as well. signal management: If shadow stack is enabled for a task, kernel performs an asynchronous control flow diversion to deliver the signal and eventually expects userspace to issue sigreturn so that original execution can be resumed. Even though resume context is prepared by kernel, it is in user space memory and is subject to memory corruption and corruption bugs can be utilized by attacker in this race window to perform arbitrary sigreturn and eventually bypass cfi mechanism. Another issue is how to ensure that cfi related state on sigcontext area is not trampled by legacy apps or apps compiled with old kernel headers. In order to mitigate control-flow hijacting, kernel prepares a token and place it on shadow stack before signal delivery and places address of token in sigcontext structure. During sigreturn, kernel obtains address of token from sigcontext struture, reads token from shadow stack and validates it and only then allow sigreturn to succeed. Compatiblity issue is solved by adopting dynamic sigcontext management introduced for vector extension. This series re-factor the code little bit to allow future sigcontext management easy (as proposed by Andy Chiu from SiFive) config and compilation: Introduce a new risc-v config option `CONFIG_RISCV_USER_CFI`. Selecting this config option picks the kernel support for user control flow integrity. This optin is presented only if toolchain has shadow stack and landing pad support. And is on purpose guarded by toolchain support. Reason being that eventually vDSO also needs to be compiled in with shadow stack and landing pad support. vDSO compile patches are not included as of now because landing pad labeling scheme is yet to settle for usermode runtime. To get more information on kernel interactions with respect to zicfilp and zicfiss, patch series adds documentation for `zicfilp` and `zicfiss` in following: Documentation/arch/riscv/zicfiss.rst Documentation/arch/riscv/zicfilp.rst How to test this series ======================= Toolchain --------- $ git clone git@github.com:sifive/riscv-gnu-toolchain.git -b cfi-dev $ riscv-gnu-toolchain/configure --prefix=<path-to-where-to-build> --with-arch=rv64gc_zicfilp_zicfiss --enable-linux --disable-gdb --with-extra-multilib-test="rv64gc_zicfilp_zicfiss-lp64d:-static" $ make -j$(nproc) Qemu ---- $ git clone git@github.com:deepak0414/qemu.git -b zicfilp_zicfiss_ratified_master_july11 $ cd qemu $ mkdir build $ cd build $ ../configure --target-list=riscv64-softmmu $ make -j$(nproc) Opensbi ------- $ git clone git@github.com:deepak0414/opensbi.git -b v6_cfi_spec_split_opensbi $ make CROSS_COMPILE=<your riscv toolchain> -j$(nproc) PLATFORM=generic Linux ----- Running defconfig is fine. CFI is enabled by default if the toolchain supports it. $ make ARCH=riscv CROSS_COMPILE=<path-to-cfi-riscv-gnu-toolchain>/build/bin/riscv64-unknown-linux-gnu- -j$(nproc) defconfig $ make ARCH=riscv CROSS_COMPILE=<path-to-cfi-riscv-gnu-toolchain>/build/bin/riscv64-unknown-linux-gnu- -j$(nproc) Branch where user cfi enabling patches are maintained https://github.com/deepak0414/linux-riscv-cfi/tree/vdso_user_cfi_v6.12-rc1 In case you're building your own rootfs using toolchain, please make sure you pick following patch to ensure that vDSO compiled with lpad and shadow stack. "arch/riscv: compile vdso with landing pad" Running ------- Modify your qemu command to have: -bios <path-to-cfi-opensbi>/build/platform/generic/firmware/fw_dynamic.bin -cpu rv64,zicfilp=true,zicfiss=true,zimop=true,zcmop=true vDSO related Opens (in the flux) ================================= I am listing these opens for laying out plan and what to expect in future patch sets. And of course for the sake of discussion. Shadow stack and landing pad enabling in vDSO ---------------------------------------------- vDSO must have shadow stack and landing pad support compiled in for task to have shadow stack and landing pad support. This patch series doesn't enable that (yet). Enabling shadow stack support in vDSO should be straight forward (intend to do that in next versions of patch set). Enabling landing pad support in vDSO requires some collaboration with toolchain folks to follow a single label scheme for all object binaries. This is necessary to ensure that all indirect call-sites are setting correct label and target landing pads are decorated with same label scheme. How many vDSOs --------------- Shadow stack instructions are carved out of zimop (may be operations) and if CPU doesn't implement zimop, they're illegal instructions. Kernel could be running on a CPU which may or may not implement zimop. And thus kernel will have to carry 2 different vDSOs and expose the appropriate one depending on whether CPU implements zimop or not. References ========== [1] - https://github.com/riscv/riscv-cfi [2] - https://lore.kernel.org/all/20240814081126.956287-1-samuel.holland@sifive.c… [3] - https://lwn.net/Articles/889475/ [4] - https://developer.arm.com/documentation/109576/0100/Branch-Target-Identific… [5] - https://www.intel.com/content/dam/develop/external/us/en/documents/catc17-i… [6] - https://lwn.net/Articles/940403/ [7] - https://lore.kernel.org/all/20241001-arm64-gcs-v13-0-222b78d87eee@kernel.or… --- changelog --------- v7: - Removed "riscv/Kconfig: enable HAVE_EXIT_THREAD for riscv" Instead using `deactivate_mm` flow to clean up. see here for more context https://lore.kernel.org/all/20230908203655.543765-1-rick.p.edgecombe@intel.… - Changed the header include in `kselftest`. Hopefully this fixes compile issue faced by Zong Li at SiFive. - Cleaned up an orphaned change to `mm/mmap.c` in below patch "riscv/mm : ensure PROT_WRITE leads to VM_READ | VM_WRITE" - Lock interfaces for shadow stack and indirect branch tracking expect arg == 0 Any future evolution of this interface should accordingly define how arg should be setup. - `mm/map.c` has an instance of using `VM_SHADOW_STACK`. Fixed it to use helper `is_shadow_stack_vma`. - Link to v6: https://lore.kernel.org/r/20241008-v5_user_cfi_series-v6-0-60d9fe073f37@riv… v6: - Picked up Samuel Holland's changes as is with `envcfg` placed in `thread` instead of `thread_info` - fixed unaligned newline escapes in kselftest - cleaned up messages in kselftest and included test output in commit message - fixed a bug in clone path reported by Zong Li - fixed a build issue if CONFIG_RISCV_ISA_V is not selected (this was introduced due to re-factoring signal context management code) v5: - rebased on v6.12-rc1 - Fixed schema related issues in device tree file - Fixed some of the documentation related issues in zicfilp/ss.rst (style issues and added index) - added `SHADOW_STACK_SET_MARKER` so that implementation can define base of shadow stack. - Fixed warnings on definitions added in usercfi.h when CONFIG_RISCV_USER_CFI is not selected. - Adopted context header based signal handling as proposed by Andy Chiu - Added support for enabling kernel mode access to shadow stack using FWFT (https://github.com/riscv-non-isa/riscv-sbi-doc/blob/master/src/ext-firmware…) - Link to v5: https://lore.kernel.org/r/20241001-v5_user_cfi_series-v1-0-3ba65b6e550f@riv… (Note: I had an issue in my workflow due to which version number wasn't picked up correctly while sending out patches) v4: - rebased on 6.11-rc6 - envcfg: Converged with Samuel Holland's patches for envcfg management on per- thread basis. - vma_is_shadow_stack is renamed to is_vma_shadow_stack - picked up Mark Brown's `ARCH_HAS_USER_SHADOW_STACK` patch - signal context: using extended context management to maintain compatibility. - fixed `-Wmissing-prototypes` compiler warnings for prctl functions - Documentation fixes and amending typos. - Link to v4: https://lore.kernel.org/all/20240912231650.3740732-1-debug@rivosinc.com/ v3: - envcfg logic to pick up base envcfg had a bug where `ENVCFG_CBZE` could have been picked on per task basis, even though CPU didn't implement it. Fixed in this series. - dt-bindings As suggested, split into separate commit. fixed the messaging that spec is in public review - arch_is_shadow_stack change arch_is_shadow_stack changed to vma_is_shadow_stack - hwprobe zicfiss / zicfilp if present will get enumerated in hwprobe - selftests As suggested, added object and binary filenames to .gitignore Selftest binary anyways need to be compiled with cfi enabled compiler which will make sure that landing pad and shadow stack are enabled. Thus removed separate enable/disable tests. Cleaned up tests a bit. - Link to v3: https://lore.kernel.org/lkml/20240403234054.2020347-1-debug@rivosinc.com/ v2: - Using config `CONFIG_RISCV_USER_CFI`, kernel support for riscv control flow integrity for user mode programs can be compiled in the kernel. - Enabling of control flow integrity for user programs is left to user runtime - This patch series introduces arch agnostic `prctls` to enable shadow stack and indirect branch tracking. And implements them on riscv. --- Andy Chiu (1): riscv: signal: abstract header saving for setup_sigcontext Clément Léger (1): riscv: Add Firmware Feature SBI extensions definitions Deepak Gupta (25): mm: helper `is_shadow_stack_vma` to check shadow stack vma dt-bindings: riscv: zicfilp and zicfiss in dt-bindings (extensions.yaml) riscv: zicfiss / zicfilp enumeration riscv: zicfiss / zicfilp extension csr and bit definitions riscv: usercfi state for task and save/restore of CSR_SSP on trap entry/exit riscv/mm : ensure PROT_WRITE leads to VM_READ | VM_WRITE riscv mm: manufacture shadow stack pte riscv mmu: teach pte_mkwrite to manufacture shadow stack PTEs riscv mmu: write protect and shadow stack riscv/mm: Implement map_shadow_stack() syscall riscv/shstk: If needed allocate a new shadow stack on clone prctl: arch-agnostic prctl for indirect branch tracking riscv: Implements arch agnostic shadow stack prctls riscv: Implements arch agnostic indirect branch tracking prctls riscv/traps: Introduce software check exception riscv/signal: save and restore of shadow stack for signal riscv/kernel: update __show_regs to print shadow stack register riscv/ptrace: riscv cfi status and state via ptrace and in core files riscv/hwprobe: zicfilp / zicfiss enumeration in hwprobe riscv: enable kernel access to shadow stack memory via FWFT sbi call riscv: kernel command line option to opt out of user cfi riscv: create a config for shadow stack and landing pad instr support riscv: Documentation for landing pad / indirect branch tracking riscv: Documentation for shadow stack on riscv kselftest/riscv: kselftest for user mode cfi Mark Brown (2): mm: Introduce ARCH_HAS_USER_SHADOW_STACK prctl: arch-agnostic prctl for shadow stack Samuel Holland (3): riscv: Enable cbo.zero only when all harts support Zicboz riscv: Add support for per-thread envcfg CSR values riscv: Call riscv_user_isa_enable() only on the boot hart Documentation/arch/riscv/index.rst | 2 + Documentation/arch/riscv/zicfilp.rst | 115 +++++ Documentation/arch/riscv/zicfiss.rst | 176 +++++++ .../devicetree/bindings/riscv/extensions.yaml | 14 + arch/riscv/Kconfig | 20 + arch/riscv/include/asm/asm-prototypes.h | 1 + arch/riscv/include/asm/cpufeature.h | 15 +- arch/riscv/include/asm/csr.h | 16 + arch/riscv/include/asm/entry-common.h | 2 + arch/riscv/include/asm/hwcap.h | 2 + arch/riscv/include/asm/mman.h | 24 + arch/riscv/include/asm/mmu_context.h | 7 + arch/riscv/include/asm/pgtable.h | 30 +- arch/riscv/include/asm/processor.h | 3 + arch/riscv/include/asm/sbi.h | 27 ++ arch/riscv/include/asm/switch_to.h | 8 + arch/riscv/include/asm/thread_info.h | 3 + arch/riscv/include/asm/usercfi.h | 89 ++++ arch/riscv/include/asm/vector.h | 3 + arch/riscv/include/uapi/asm/hwprobe.h | 2 + arch/riscv/include/uapi/asm/ptrace.h | 22 + arch/riscv/include/uapi/asm/sigcontext.h | 1 + arch/riscv/kernel/Makefile | 2 + arch/riscv/kernel/asm-offsets.c | 8 + arch/riscv/kernel/cpufeature.c | 13 +- arch/riscv/kernel/entry.S | 31 +- arch/riscv/kernel/head.S | 12 + arch/riscv/kernel/process.c | 26 +- arch/riscv/kernel/ptrace.c | 83 ++++ arch/riscv/kernel/signal.c | 140 +++++- arch/riscv/kernel/smpboot.c | 2 - arch/riscv/kernel/suspend.c | 4 +- arch/riscv/kernel/sys_hwprobe.c | 2 + arch/riscv/kernel/sys_riscv.c | 10 + arch/riscv/kernel/traps.c | 42 ++ arch/riscv/kernel/usercfi.c | 526 +++++++++++++++++++++ arch/riscv/mm/init.c | 2 +- arch/riscv/mm/pgtable.c | 17 + arch/x86/Kconfig | 1 + fs/proc/task_mmu.c | 2 +- include/linux/cpu.h | 4 + include/linux/mm.h | 5 +- include/uapi/asm-generic/mman.h | 4 + include/uapi/linux/elf.h | 1 + include/uapi/linux/prctl.h | 48 ++ kernel/sys.c | 60 +++ mm/Kconfig | 6 + mm/gup.c | 2 +- mm/mmap.c | 2 +- mm/vma.h | 10 +- tools/testing/selftests/riscv/Makefile | 2 +- tools/testing/selftests/riscv/cfi/.gitignore | 3 + tools/testing/selftests/riscv/cfi/Makefile | 10 + tools/testing/selftests/riscv/cfi/cfi_rv_test.h | 84 ++++ tools/testing/selftests/riscv/cfi/riscv_cfi_test.c | 78 +++ tools/testing/selftests/riscv/cfi/shadowstack.c | 373 +++++++++++++++ tools/testing/selftests/riscv/cfi/shadowstack.h | 37 ++ 57 files changed, 2191 insertions(+), 43 deletions(-) --- base-commit: 7d9923ee3960bdbfaa7f3a4e0ac2364e770c46ff change-id: 20240930-v5_user_cfi_series-3dc332f8f5b2 -- - debug

1 year

2
33
0 0

[PATCH v4] memfd: `MFD_NOEXEC_SEAL` should not imply `MFD_ALLOW_SEALING`

by Barnabás Pőcze

`MFD_NOEXEC_SEAL` should remove the executable bits and set `F_SEAL_EXEC` to prevent further modifications to the executable bits as per the comment in the uapi header file: not executable and sealed to prevent changing to executable However, commit 105ff5339f498a ("mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC") that introduced this feature made it so that `MFD_NOEXEC_SEAL` unsets `F_SEAL_SEAL`, essentially acting as a superset of `MFD_ALLOW_SEALING`. Nothing implies that it should be so, and indeed up until the second version of the of the patchset[0] that introduced `MFD_EXEC` and `MFD_NOEXEC_SEAL`, `F_SEAL_SEAL` was not removed, however, it was changed in the third revision of the patchset[1] without a clear explanation. This behaviour is surprising for application developers, there is no documentation that would reveal that `MFD_NOEXEC_SEAL` has the additional effect of `MFD_ALLOW_SEALING`. Additionally, combined with `vm.memfd_noexec=2` it has the effect of making all memfds initially sealable. So do not remove `F_SEAL_SEAL` when `MFD_NOEXEC_SEAL` is requested, thereby returning to the pre-Linux 6.3 behaviour of only allowing sealing when `MFD_ALLOW_SEALING` is specified. Now, this is technically a uapi break. However, the damage is expected to be minimal. To trigger user visible change, a program has to do the following steps: - create memfd: - with `MFD_NOEXEC_SEAL`, - without `MFD_ALLOW_SEALING`; - try to add seals / check the seals. But that seems unlikely to happen intentionally since this change essentially reverts the kernel's behaviour to that of Linux <6.3, so if a program worked correctly on those older kernels, it will likely work correctly after this change. I have used Debian Code Search and GitHub to try to find potential breakages, and I could only find a single one. dbus-broker's memfd_create() wrapper is aware of this implicit `MFD_ALLOW_SEALING` behaviour, and tries to work around it[2]. This workaround will break. Luckily, this only affects the test suite, it does not affect the normal operations of dbus-broker. There is a PR with a fix[3]. I also carried out a smoke test by building a kernel with this change and booting an Arch Linux system into GNOME and Plasma sessions. There was also a previous attempt to address this peculiarity by introducing a new flag[4]. [0]: https://lore.kernel.org/lkml/20220805222126.142525-3-jeffxu@google.com/ [1]: https://lore.kernel.org/lkml/20221202013404.163143-3-jeffxu@google.com/ [2]: https://github.com/bus1/dbus-broker/blob/9eb0b7e5826fc76cad7b025bc46f267d4a… [3]: https://github.com/bus1/dbus-broker/pull/366 [4]: https://lore.kernel.org/lkml/20230714114753.170814-1-david@readahead.eu/ Cc: stable(a)vger.kernel.org Signed-off-by: Barnabás Pőcze <pobrn(a)protonmail.com> --- * v3: https://lore.kernel.org/linux-mm/20240611231409.3899809-1-jeffxu@chromium.o… * v2: https://lore.kernel.org/linux-mm/20240524033933.135049-1-jeffxu@google.com/ * v1: https://lore.kernel.org/linux-mm/20240513191544.94754-1-pobrn@protonmail.co… This fourth version returns to removing the inconsistency as opposed to documenting its existence, with the same code change as v1 but with a somewhat extended commit message. This is sent because I believe it is worth at least a try; it can be easily reverted if bigger application breakages are discovered than initially imagined. --- mm/memfd.c | 9 ++++----- tools/testing/selftests/memfd/memfd_test.c | 2 +- 2 files changed, 5 insertions(+), 6 deletions(-) diff --git a/mm/memfd.c b/mm/memfd.c index 7d8d3ab3fa37..8b7f6afee21d 100644 --- a/mm/memfd.c +++ b/mm/memfd.c @@ -356,12 +356,11 @@ SYSCALL_DEFINE2(memfd_create, inode->i_mode &= ~0111; file_seals = memfd_file_seals_ptr(file); - if (file_seals) { - *file_seals &= ~F_SEAL_SEAL; + if (file_seals) *file_seals |= F_SEAL_EXEC; - } - } else if (flags & MFD_ALLOW_SEALING) { - /* MFD_EXEC and MFD_ALLOW_SEALING are set */ + } + + if (flags & MFD_ALLOW_SEALING) { file_seals = memfd_file_seals_ptr(file); if (file_seals) *file_seals &= ~F_SEAL_SEAL; diff --git a/tools/testing/selftests/memfd/memfd_test.c b/tools/testing/selftests/memfd/memfd_test.c index 95af2d78fd31..7b78329f65b6 100644 --- a/tools/testing/selftests/memfd/memfd_test.c +++ b/tools/testing/selftests/memfd/memfd_test.c @@ -1151,7 +1151,7 @@ static void test_noexec_seal(void) mfd_def_size, MFD_CLOEXEC | MFD_NOEXEC_SEAL); mfd_assert_mode(fd, 0666); - mfd_assert_has_seals(fd, F_SEAL_EXEC); + mfd_assert_has_seals(fd, F_SEAL_SEAL | F_SEAL_EXEC); mfd_fail_chmod(fd, 0777); close(fd); } -- 2.45.2

1 year

4
5
0 0

[PATCH v2 net-next] wireguard: allowedips: Add WGALLOWEDIP_F_REMOVE_ME flag

by Jordan Rife

With the current API the only way to remove an allowed IP is to completely rebuild the allowed IPs set for a peer using WGPEER_F_REPLACE_ALLOWEDIPS. In other words, if my current configuration is such that a peer has allowed IP IPs 192.168.0.2 and 192.168.0.3 and I want to remove 192.168.0.2 the actual transition looks like this. [192.168.0.2, 192.168.0.3] <-- Initial state [] <-- Step 1: Allowed IPs removed for peer [192.168.0.3] <-- Step 2: Allowed IPs added back for peer This is true even if the allowed IP list is small and the update does not need to be batched into multiple WG_CMD_SET_DEVICE requests, as the removal and subsequent addition of IPs is non-atomic within a single request. Consequently, wg_allowedips_lookup_dst and wg_allowedips_lookup_src may return NULL while reconfiguring a peer even for packets bound for IPs a user did not intend to remove leading to unintended interruptions in connectivity. This presents in userspace as failed calls to sendto and sendmsg for UDP sockets. In my case, I ran netperf while repeatedly reconfiguring the allowed IPs for a peer with wg. /usr/local/bin/netperf -H 10.102.73.72 -l 10m -t UDP_STREAM -- -R 1 -m 1024 send_data: data send error: No route to host (errno 113) netperf: send_omni: send_data failed: No route to host While this may not be of particular concern for environments where peers and allowed IPs are mostly static, Cilium manages peers and allowed IPs in a dynamic environment where peers (i.e. Kubernetes nodes) and allowed IPs (i.e. Pods running on those nodes) can frequently change. Cilium must continually keep its WireGuard device's configuration in sync with its cluster state leading to unnecessary churn and packet drops. This patch introduces a new flag called WGALLOWEDIP_F_REMOVE_ME which in the same way that WGPEER_F_REMOVE_ME allows a user to remove a single peer from a WireGuard device's configuration allows a user to remove an IP from a peer's set of allowed IPs. This has two benefits. First, it allows systems such as Cilium to avoid introducing connectivity blips while reconfiguring a WireGuard device. Second, it allows us to more efficiently keep the device's configuration in sync with the cluster state, as we no longer need to do frequent rebuilds of the allowed IPs list for each peer. Instead, the device's configuration can be incrementally updated. This patch also bumps WG_GENL_VERSION which can be used by clients to detect whether or not their system supports the WGALLOWEDIP_F_REMOVE_ME flag. ======= Changes ======= v1->v2 ------ * Fixed some Sparse warnings Signed-off-by: Jordan Rife <jrife(a)google.com> --- drivers/net/wireguard/allowedips.c | 103 ++++++++++---- drivers/net/wireguard/allowedips.h | 4 + drivers/net/wireguard/netlink.c | 45 +++++-- drivers/net/wireguard/selftest/allowedips.c | 30 +++++ include/uapi/linux/wireguard.h | 11 +- tools/testing/selftests/wireguard/Makefile | 18 +++ tools/testing/selftests/wireguard/netns.sh | 38 ++++++ tools/testing/selftests/wireguard/remove-ip.c | 126 ++++++++++++++++++ 8 files changed, 333 insertions(+), 42 deletions(-) create mode 100644 tools/testing/selftests/wireguard/Makefile create mode 100644 tools/testing/selftests/wireguard/remove-ip.c diff --git a/drivers/net/wireguard/allowedips.c b/drivers/net/wireguard/allowedips.c index 4b8528206cc8a..ff52259dd8d81 100644 --- a/drivers/net/wireguard/allowedips.c +++ b/drivers/net/wireguard/allowedips.c @@ -249,6 +249,56 @@ static int add(struct allowedips_node __rcu **trie, u8 bits, const u8 *key, return 0; } +static void _remove(struct allowedips_node *node, struct mutex *lock) +{ + struct allowedips_node *child, **parent_bit, *parent; + bool free_parent; + + list_del_init(&node->peer_list); + RCU_INIT_POINTER(node->peer, NULL); + if (node->bit[0] && node->bit[1]) + return; + child = rcu_dereference_protected(node->bit[!rcu_access_pointer(node->bit[0])], + lockdep_is_held(lock)); + if (child) + child->parent_bit_packed = node->parent_bit_packed; + parent_bit = (struct allowedips_node **)(node->parent_bit_packed & ~3UL); + *parent_bit = child; + parent = (void *)parent_bit - + offsetof(struct allowedips_node, bit[node->parent_bit_packed & 1]); + free_parent = !rcu_access_pointer(node->bit[0]) && + !rcu_access_pointer(node->bit[1]) && + (node->parent_bit_packed & 3) <= 1 && + !rcu_access_pointer(parent->peer); + if (free_parent) + child = rcu_dereference_protected(parent->bit[!(node->parent_bit_packed & 1)], + lockdep_is_held(lock)); + call_rcu(&node->rcu, node_free_rcu); + if (!free_parent) + return; + if (child) + child->parent_bit_packed = parent->parent_bit_packed; + *(struct allowedips_node **)(parent->parent_bit_packed & ~3UL) = child; + call_rcu(&parent->rcu, node_free_rcu); +} + +static int remove(struct allowedips_node __rcu **trie, u8 bits, const u8 *key, + u8 cidr, struct wg_peer *peer, struct mutex *lock) +{ + struct allowedips_node *node; + + if (unlikely(cidr > bits || !peer)) + return -EINVAL; + if (!rcu_access_pointer(*trie) || + !node_placement(*trie, key, cidr, bits, &node, lock) || + peer != rcu_access_pointer(node->peer)) + return 0; + + _remove(node, lock); + + return 0; +} + void wg_allowedips_init(struct allowedips *table) { table->root4 = table->root6 = NULL; @@ -300,43 +350,38 @@ int wg_allowedips_insert_v6(struct allowedips *table, const struct in6_addr *ip, return add(&table->root6, 128, key, cidr, peer, lock); } +int wg_allowedips_remove_v4(struct allowedips *table, const struct in_addr *ip, + u8 cidr, struct wg_peer *peer, struct mutex *lock) +{ + /* Aligned so it can be passed to fls */ + u8 key[4] __aligned(__alignof(u32)); + + ++table->seq; + swap_endian(key, (const u8 *)ip, 32); + return remove(&table->root4, 32, key, cidr, peer, lock); +} + +int wg_allowedips_remove_v6(struct allowedips *table, const struct in6_addr *ip, + u8 cidr, struct wg_peer *peer, struct mutex *lock) +{ + /* Aligned so it can be passed to fls64 */ + u8 key[16] __aligned(__alignof(u64)); + + ++table->seq; + swap_endian(key, (const u8 *)ip, 128); + return remove(&table->root6, 128, key, cidr, peer, lock); +} + void wg_allowedips_remove_by_peer(struct allowedips *table, struct wg_peer *peer, struct mutex *lock) { - struct allowedips_node *node, *child, **parent_bit, *parent, *tmp; - bool free_parent; + struct allowedips_node *node, *tmp; if (list_empty(&peer->allowedips_list)) return; ++table->seq; list_for_each_entry_safe(node, tmp, &peer->allowedips_list, peer_list) { - list_del_init(&node->peer_list); - RCU_INIT_POINTER(node->peer, NULL); - if (node->bit[0] && node->bit[1]) - continue; - child = rcu_dereference_protected(node->bit[!rcu_access_pointer(node->bit[0])], - lockdep_is_held(lock)); - if (child) - child->parent_bit_packed = node->parent_bit_packed; - parent_bit = (struct allowedips_node **)(node->parent_bit_packed & ~3UL); - *parent_bit = child; - parent = (void *)parent_bit - - offsetof(struct allowedips_node, bit[node->parent_bit_packed & 1]); - free_parent = !rcu_access_pointer(node->bit[0]) && - !rcu_access_pointer(node->bit[1]) && - (node->parent_bit_packed & 3) <= 1 && - !rcu_access_pointer(parent->peer); - if (free_parent) - child = rcu_dereference_protected( - parent->bit[!(node->parent_bit_packed & 1)], - lockdep_is_held(lock)); - call_rcu(&node->rcu, node_free_rcu); - if (!free_parent) - continue; - if (child) - child->parent_bit_packed = parent->parent_bit_packed; - *(struct allowedips_node **)(parent->parent_bit_packed & ~3UL) = child; - call_rcu(&parent->rcu, node_free_rcu); + _remove(node, lock); } } diff --git a/drivers/net/wireguard/allowedips.h b/drivers/net/wireguard/allowedips.h index 2346c797eb4d8..931958cb6e100 100644 --- a/drivers/net/wireguard/allowedips.h +++ b/drivers/net/wireguard/allowedips.h @@ -38,6 +38,10 @@ int wg_allowedips_insert_v4(struct allowedips *table, const struct in_addr *ip, u8 cidr, struct wg_peer *peer, struct mutex *lock); int wg_allowedips_insert_v6(struct allowedips *table, const struct in6_addr *ip, u8 cidr, struct wg_peer *peer, struct mutex *lock); +int wg_allowedips_remove_v4(struct allowedips *table, const struct in_addr *ip, + u8 cidr, struct wg_peer *peer, struct mutex *lock); +int wg_allowedips_remove_v6(struct allowedips *table, const struct in6_addr *ip, + u8 cidr, struct wg_peer *peer, struct mutex *lock); void wg_allowedips_remove_by_peer(struct allowedips *table, struct wg_peer *peer, struct mutex *lock); /* The ip input pointer should be __aligned(__alignof(u64))) */ diff --git a/drivers/net/wireguard/netlink.c b/drivers/net/wireguard/netlink.c index f7055180ba4aa..5f2a8553ab43d 100644 --- a/drivers/net/wireguard/netlink.c +++ b/drivers/net/wireguard/netlink.c @@ -46,7 +46,8 @@ static const struct nla_policy peer_policy[WGPEER_A_MAX + 1] = { static const struct nla_policy allowedip_policy[WGALLOWEDIP_A_MAX + 1] = { [WGALLOWEDIP_A_FAMILY] = { .type = NLA_U16 }, [WGALLOWEDIP_A_IPADDR] = NLA_POLICY_MIN_LEN(sizeof(struct in_addr)), - [WGALLOWEDIP_A_CIDR_MASK] = { .type = NLA_U8 } + [WGALLOWEDIP_A_CIDR_MASK] = { .type = NLA_U8 }, + [WGALLOWEDIP_A_FLAGS] = { .type = NLA_U32 } }; static struct wg_device *lookup_interface(struct nlattr **attrs, @@ -329,6 +330,7 @@ static int set_port(struct wg_device *wg, u16 port) static int set_allowedip(struct wg_peer *peer, struct nlattr **attrs) { int ret = -EINVAL; + u32 flags = 0; u16 family; u8 cidr; @@ -337,19 +339,38 @@ static int set_allowedip(struct wg_peer *peer, struct nlattr **attrs) return ret; family = nla_get_u16(attrs[WGALLOWEDIP_A_FAMILY]); cidr = nla_get_u8(attrs[WGALLOWEDIP_A_CIDR_MASK]); + if (attrs[WGALLOWEDIP_A_FLAGS]) + flags = nla_get_u32(attrs[WGALLOWEDIP_A_FLAGS]); if (family == AF_INET && cidr <= 32 && - nla_len(attrs[WGALLOWEDIP_A_IPADDR]) == sizeof(struct in_addr)) - ret = wg_allowedips_insert_v4( - &peer->device->peer_allowedips, - nla_data(attrs[WGALLOWEDIP_A_IPADDR]), cidr, peer, - &peer->device->device_update_lock); - else if (family == AF_INET6 && cidr <= 128 && - nla_len(attrs[WGALLOWEDIP_A_IPADDR]) == sizeof(struct in6_addr)) - ret = wg_allowedips_insert_v6( - &peer->device->peer_allowedips, - nla_data(attrs[WGALLOWEDIP_A_IPADDR]), cidr, peer, - &peer->device->device_update_lock); + nla_len(attrs[WGALLOWEDIP_A_IPADDR]) == sizeof(struct in_addr)) { + if (flags & WGALLOWEDIP_F_REMOVE_ME) + ret = wg_allowedips_remove_v4(&peer->device->peer_allowedips, + nla_data(attrs[WGALLOWEDIP_A_IPADDR]), + cidr, + peer, + &peer->device->device_update_lock); + else + ret = wg_allowedips_insert_v4(&peer->device->peer_allowedips, + nla_data(attrs[WGALLOWEDIP_A_IPADDR]), + cidr, + peer, + &peer->device->device_update_lock); + } else if (family == AF_INET6 && cidr <= 128 && + nla_len(attrs[WGALLOWEDIP_A_IPADDR]) == sizeof(struct in6_addr)) { + if (flags & WGALLOWEDIP_F_REMOVE_ME) + ret = wg_allowedips_remove_v6(&peer->device->peer_allowedips, + nla_data(attrs[WGALLOWEDIP_A_IPADDR]), + cidr, + peer, + &peer->device->device_update_lock); + else + ret = wg_allowedips_insert_v6(&peer->device->peer_allowedips, + nla_data(attrs[WGALLOWEDIP_A_IPADDR]), + cidr, + peer, + &peer->device->device_update_lock); + } return ret; } diff --git a/drivers/net/wireguard/selftest/allowedips.c b/drivers/net/wireguard/selftest/allowedips.c index 3d1f64ff2e122..9f6458a889e96 100644 --- a/drivers/net/wireguard/selftest/allowedips.c +++ b/drivers/net/wireguard/selftest/allowedips.c @@ -461,6 +461,10 @@ static __init struct wg_peer *init_peer(void) wg_allowedips_insert_v##version(&t, ip##version(ipa, ipb, ipc, ipd), \ cidr, mem, &mutex) +#define remove(version, mem, ipa, ipb, ipc, ipd, cidr) \ + wg_allowedips_remove_v##version(&t, ip##version(ipa, ipb, ipc, ipd), \ + cidr, mem, &mutex) + #define maybe_fail() do { \ ++i; \ if (!_s) { \ @@ -586,6 +590,32 @@ bool __init wg_allowedips_selftest(void) test_negative(4, a, 192, 0, 0, 0); test_negative(4, a, 255, 0, 0, 0); + insert(4, a, 1, 0, 0, 0, 32); + insert(4, a, 192, 0, 0, 0, 24); + insert(6, a, 0x24446801, 0x40e40800, 0xdeaebeef, 0xdefbeef, 128); + insert(6, a, 0x24446800, 0xf0e40800, 0xeeaebeef, 0, 98); + test(4, a, 1, 0, 0, 0); + test(4, a, 192, 0, 0, 1); + test(6, a, 0x24446801, 0x40e40800, 0xdeaebeef, 0xdefbeef); + test(6, a, 0x24446800, 0xf0e40800, 0xeeaebeef, 0x10101010); + /* Must be an exact match to remove */ + remove(4, a, 192, 0, 0, 0, 32); + test(4, a, 192, 0, 0, 1); + remove(4, a, 192, 0, 0, 0, 24); + test_negative(4, a, 192, 0, 0, 1); + remove(4, a, 1, 0, 0, 0, 32); + test_negative(4, a, 1, 0, 0, 0); + /* Must be an exact match to remove */ + remove(6, a, 0x24446801, 0x40e40800, 0xdeaebeef, 0xdefbeef, 96); + test(6, a, 0x24446801, 0x40e40800, 0xdeaebeef, 0xdefbeef); + remove(6, a, 0x24446801, 0x40e40800, 0xdeaebeef, 0xdefbeef, 128); + test_negative(6, a, 0x24446801, 0x40e40800, 0xdeaebeef, 0xdefbeef); + /* Must match the peer to remove */ + remove(6, b, 0x24446800, 0xf0e40800, 0xeeaebeef, 0, 98); + test(6, a, 0x24446800, 0xf0e40800, 0xeeaebeef, 0x10101010); + remove(6, a, 0x24446800, 0xf0e40800, 0xeeaebeef, 0, 98); + test_negative(6, a, 0x24446800, 0xf0e40800, 0xeeaebeef, 0x10101010); + wg_allowedips_free(&t, &mutex); wg_allowedips_init(&t); insert(4, a, 192, 168, 0, 0, 16); diff --git a/include/uapi/linux/wireguard.h b/include/uapi/linux/wireguard.h index ae88be14c9478..e219194cb9f5a 100644 --- a/include/uapi/linux/wireguard.h +++ b/include/uapi/linux/wireguard.h @@ -101,6 +101,10 @@ * WGALLOWEDIP_A_FAMILY: NLA_U16 * WGALLOWEDIP_A_IPADDR: struct in_addr or struct in6_addr * WGALLOWEDIP_A_CIDR_MASK: NLA_U8 + * WGALLOWEDIP_A_FLAGS: NLA_U32, WGALLOWEDIP_F_REMOVE_ME if + * the specified IP should be removed, + * otherwise this IP will be added if + * it is not already present. * 0: NLA_NESTED * ... * 0: NLA_NESTED @@ -132,7 +136,7 @@ #define _WG_UAPI_WIREGUARD_H #define WG_GENL_NAME "wireguard" -#define WG_GENL_VERSION 1 +#define WG_GENL_VERSION 2 #define WG_KEY_LEN 32 @@ -184,11 +188,16 @@ enum wgpeer_attribute { }; #define WGPEER_A_MAX (__WGPEER_A_LAST - 1) +enum wgallowedip_flag { + WGALLOWEDIP_F_REMOVE_ME = 1U << 0, + __WGALLOWEDIP_F_ALL = WGALLOWEDIP_F_REMOVE_ME +}; enum wgallowedip_attribute { WGALLOWEDIP_A_UNSPEC, WGALLOWEDIP_A_FAMILY, WGALLOWEDIP_A_IPADDR, WGALLOWEDIP_A_CIDR_MASK, + WGALLOWEDIP_A_FLAGS, __WGALLOWEDIP_A_LAST }; #define WGALLOWEDIP_A_MAX (__WGALLOWEDIP_A_LAST - 1) diff --git a/tools/testing/selftests/wireguard/Makefile b/tools/testing/selftests/wireguard/Makefile new file mode 100644 index 0000000000000..4f4db54f89cb3 --- /dev/null +++ b/tools/testing/selftests/wireguard/Makefile @@ -0,0 +1,18 @@ +# SPDX-License-Identifier: GPL-2.0 +# +# Note: To build this you must install libnl-3 and libnl-genl-3 development +# packages. +remove-ip: + gcc -I/usr/include/libnl3 \ + -I../../../../usr/include \ + remove-ip.c \ + -o remove-ip \ + -lnl-genl-3 \ + -lnl-3 + +.PHONY: all +all: remove-ip + +.PHONY: clean +clean: + rm remove-ip diff --git a/tools/testing/selftests/wireguard/netns.sh b/tools/testing/selftests/wireguard/netns.sh index 405ff262ca93d..70058d6ebbe85 100755 --- a/tools/testing/selftests/wireguard/netns.sh +++ b/tools/testing/selftests/wireguard/netns.sh @@ -28,6 +28,7 @@ exec 3>&1 export LANG=C export WG_HIDE_KEYS=never NPROC=( /sys/devices/system/cpu/cpu+([0-9]) ); NPROC=${#NPROC[@]} +SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd ) netns0="wg-test-$$-0" netns1="wg-test-$$-1" netns2="wg-test-$$-2" @@ -610,6 +611,43 @@ n0 wg set wg0 peer "$pub2" allowed-ips "$allowedips" } < <(n0 wg show wg0 allowed-ips) ip0 link del wg0 +# Test IP removal +allowedips=( ) +for i in {1..197}; do + allowedips+=( 192.168.0.$i ) + allowedips+=( abcd::$i ) +done +saved_ifs="$IFS" +IFS=, +allowedips="${allowedips[*]}" +IFS="$saved_ifs" +ip0 link add wg0 type wireguard +n0 wg set wg0 peer "$pub1" allowed-ips "$allowedips" +pub1_hex=$(echo "$pub1" | base64 -d | xxd -p -c 50) +n0 $SCRIPT_DIR/remove-ip wg0 "$pub1_hex" 4 192.168.0.1 +n0 $SCRIPT_DIR/remove-ip wg0 "$pub1_hex" 4 192.168.0.20 +n0 $SCRIPT_DIR/remove-ip wg0 "$pub1_hex" 4 192.168.0.100 +n0 $SCRIPT_DIR/remove-ip wg0 "$pub1_hex" 6 abcd::1 +n0 $SCRIPT_DIR/remove-ip wg0 "$pub1_hex" 6 abcd::20 +n0 $SCRIPT_DIR/remove-ip wg0 "$pub1_hex" 6 abcd::100 +n0 wg show wg0 allowed-ips +{ + read -r pub allowedips + [[ $pub == "$pub1" ]] + i=0 + for ip in $allowedips; do + [[ "$ip" != "192.168.0.1" ]] + [[ "$ip" != "192.168.0.20" ]] + [[ "$ip" != "192.168.0.100" ]] + [[ "$ip" != "abcd::1" ]] + [[ "$ip" != "abcd::20" ]] + [[ "$ip" != "abcd::100" ]] + ((++i)) + done + ((i == 388)) +} < <(n0 wg show wg0 allowed-ips) +ip0 link del wg0 + ! n0 wg show doesnotexist || false ip0 link add wg0 type wireguard diff --git a/tools/testing/selftests/wireguard/remove-ip.c b/tools/testing/selftests/wireguard/remove-ip.c new file mode 100644 index 0000000000000..242f922d99b56 --- /dev/null +++ b/tools/testing/selftests/wireguard/remove-ip.c @@ -0,0 +1,126 @@ +// SPDX-License-Identifier: GPL-2.0 +#include <linux/wireguard.h> +#include <sys/socket.h> +#include <netinet/in.h> +#include <arpa/inet.h> +#include <netlink/socket.h> +#include <netlink/netlink.h> +#include <netlink/genl/ctrl.h> +#include <netlink/genl/genl.h> +#include <netlink/genl/family.h> + +#define CURVE25519_KEY_SIZE 32 + +const char *usage = "Usage: remove-ip INTERFACE_NAME PEER_PUBLIC_KEY_HEX IP_VERSION IP"; + +char h2b(char c) +{ + if ('0' <= c && c <= '9') + return c - '0'; + else if ('a' <= c && c <= 'f') + return 10 + (c - 'a'); + + return -1; +} + +int parse_key(const char *raw, unsigned char key[CURVE25519_KEY_SIZE]) +{ + int ret = 0; + int i; + + for (i = 0; i < CURVE25519_KEY_SIZE; i++) { + char h, l; + + h = h2b(raw[0]); + if (h < 0) + return -1; + + l = h2b(raw[1]); + if (l < 0) + return -1; + + key[i] = (h << 4) | l; + raw += 2; + } + + return 0; +} + +int main(int argc, char **argv) +{ + unsigned char addr[sizeof(struct in6_addr)]; + unsigned char pub_key[CURVE25519_KEY_SIZE]; + struct nl_sock *sock; + struct nl_msg *msg; + int addr_len; + int family; + int cidr; + int af; + + if (argc < 5) { + printf("Not enough arguments.\n\n%s\n", usage); + return -1; + } + + if (parse_key(argv[2], pub_key)) { + printf("Could not parse public key\n"); + return -1; + } + + switch (argv[3][0]) { + case '4': + af = AF_INET; + addr_len = sizeof(struct in_addr); + cidr = 32; + break; + case '6': + af = AF_INET6; + addr_len = sizeof(struct in6_addr); + cidr = 128; + break; + default: + printf("Invalid IP version\n"); + return -1; + } + + if (inet_pton(af, argv[4], &addr) <= 0) { + printf("Could not parse IP address\n"); + return -1; + } + + sock = nl_socket_alloc(); + genl_connect(sock); + family = genl_ctrl_resolve(sock, WG_GENL_NAME); + msg = nlmsg_alloc(); + genlmsg_put(msg, NL_AUTO_PID, NL_AUTO_SEQ, family, 0, NLM_F_ECHO, + WG_CMD_SET_DEVICE, WG_GENL_VERSION); + nla_put_string(msg, WGDEVICE_A_IFNAME, argv[1]); + + struct nlattr *peers = nla_nest_start(msg, WGDEVICE_A_PEERS); + struct nlattr *peer0 = nla_nest_start(msg, 0); + + nla_put(msg, WGPEER_A_PUBLIC_KEY, CURVE25519_KEY_SIZE, pub_key); + + struct nlattr *allowed_ips = nla_nest_start(msg, WGPEER_A_ALLOWEDIPS); + struct nlattr *allowed_ip0 = nla_nest_start(msg, 0); + + nla_put_u16(msg, WGALLOWEDIP_A_FAMILY, af); + nla_put(msg, WGALLOWEDIP_A_IPADDR, addr_len, &addr); + nla_put_u8(msg, WGALLOWEDIP_A_CIDR_MASK, cidr); + nla_put_u32(msg, WGALLOWEDIP_A_FLAGS, WGALLOWEDIP_F_REMOVE_ME); + nla_nest_end(msg, allowed_ip0); + nla_nest_end(msg, allowed_ips); + nla_nest_end(msg, peer0); + nla_nest_end(msg, peers); + + int err = nl_send_sync(sock, msg); + + if (err < 0) { + char message[256]; + + nl_perror(err, message); + printf("An error occurred: %d - %s\n", err, message); + } + + return err; +} -- 2.46.0.598.g6f2099f65c-goog

1 year

3
7
0 0

[PATCH net-next v11 00/23] Introducing OpenVPN Data Channel Offload

by Antonio Quartulli

Notable changes from v10: * extended commit message of 23/23 with brief description of the output * Link to v10: https://lore.kernel.org/r/20241025-b4-ovpn-v10-0-b87530777be7@openvpn.net Please note that some patches were already reviewed by Andre Lunn, Donald Hunter and Shuah Khan. They have retained the Reviewed-by tag since no major code modification has happened since the review. The latest code can also be found at: https://github.com/OpenVPN/linux-kernel-ovpn Thanks a lot! Best Regards, Antonio Quartulli OpenVPN Inc. --- Antonio Quartulli (23): netlink: add NLA_POLICY_MAX_LEN macro net: introduce OpenVPN Data Channel Offload (ovpn) ovpn: add basic netlink support ovpn: add basic interface creation/destruction/management routines ovpn: keep carrier always on ovpn: introduce the ovpn_peer object ovpn: introduce the ovpn_socket object ovpn: implement basic TX path (UDP) ovpn: implement basic RX path (UDP) ovpn: implement packet processing ovpn: store tunnel and transport statistics ovpn: implement TCP transport ovpn: implement multi-peer support ovpn: implement peer lookup logic ovpn: implement keepalive mechanism ovpn: add support for updating local UDP endpoint ovpn: add support for peer floating ovpn: implement peer add/get/dump/delete via netlink ovpn: implement key add/get/del/swap via netlink ovpn: kill key and notify userspace in case of IV exhaustion ovpn: notify userspace when a peer is deleted ovpn: add basic ethtool support testing/selftests: add test tool and scripts for ovpn module Documentation/netlink/specs/ovpn.yaml | 362 +++ MAINTAINERS | 11 + drivers/net/Kconfig | 14 + drivers/net/Makefile | 1 + drivers/net/ovpn/Makefile | 22 + drivers/net/ovpn/bind.c | 54 + drivers/net/ovpn/bind.h | 117 + drivers/net/ovpn/crypto.c | 214 ++ drivers/net/ovpn/crypto.h | 145 ++ drivers/net/ovpn/crypto_aead.c | 386 ++++ drivers/net/ovpn/crypto_aead.h | 33 + drivers/net/ovpn/io.c | 462 ++++ drivers/net/ovpn/io.h | 25 + drivers/net/ovpn/main.c | 337 +++ drivers/net/ovpn/main.h | 24 + drivers/net/ovpn/netlink-gen.c | 212 ++ drivers/net/ovpn/netlink-gen.h | 41 + drivers/net/ovpn/netlink.c | 1135 ++++++++++ drivers/net/ovpn/netlink.h | 18 + drivers/net/ovpn/ovpnstruct.h | 61 + drivers/net/ovpn/packet.h | 40 + drivers/net/ovpn/peer.c | 1201 ++++++++++ drivers/net/ovpn/peer.h | 165 ++ drivers/net/ovpn/pktid.c | 130 ++ drivers/net/ovpn/pktid.h | 87 + drivers/net/ovpn/proto.h | 104 + drivers/net/ovpn/skb.h | 56 + drivers/net/ovpn/socket.c | 178 ++ drivers/net/ovpn/socket.h | 55 + drivers/net/ovpn/stats.c | 21 + drivers/net/ovpn/stats.h | 47 + drivers/net/ovpn/tcp.c | 506 +++++ drivers/net/ovpn/tcp.h | 44 + drivers/net/ovpn/udp.c | 406 ++++ drivers/net/ovpn/udp.h | 26 + include/net/netlink.h | 1 + include/uapi/linux/if_link.h | 15 + include/uapi/linux/ovpn.h | 109 + include/uapi/linux/udp.h | 1 + tools/net/ynl/ynl-gen-c.py | 4 +- tools/testing/selftests/Makefile | 1 + tools/testing/selftests/net/ovpn/.gitignore | 2 + tools/testing/selftests/net/ovpn/Makefile | 17 + tools/testing/selftests/net/ovpn/config | 10 + tools/testing/selftests/net/ovpn/data64.key | 5 + tools/testing/selftests/net/ovpn/ovpn-cli.c | 2370 ++++++++++++++++++++ tools/testing/selftests/net/ovpn/tcp_peers.txt | 5 + .../testing/selftests/net/ovpn/test-chachapoly.sh | 9 + tools/testing/selftests/net/ovpn/test-float.sh | 9 + tools/testing/selftests/net/ovpn/test-tcp.sh | 9 + tools/testing/selftests/net/ovpn/test.sh | 183 ++ tools/testing/selftests/net/ovpn/udp_peers.txt | 5 + 52 files changed, 9494 insertions(+), 1 deletion(-) --- base-commit: ab101c553bc1f76a839163d1dc0d1e715ad6bb4e change-id: 20241002-b4-ovpn-eeee35c694a2 Best regards, -- Antonio Quartulli <antonio(a)openvpn.net>

1 year

5
154
0 0

[PATCH] selftest: drivers: Add support to check duplicate hwirq

by Joseph Jang

Validate there are no duplicate hwirq from the irq debug file system /sys/kernel/debug/irq/irqs/* per chip name. One example log show 2 duplicated hwirq in the irq debug file system. $ sudo cat /sys/kernel/debug/irq/irqs/163 handler: handle_fasteoi_irq device: 0019:00:00.0 <SNIP> node: 1 affinity: 72-143 effectiv: 76 domain: irqchip@0x0000100022040000-3 hwirq: 0xc8000000 chip: ITS-MSI flags: 0x20 $ sudo cat /sys/kernel/debug/irq/irqs/174 handler: handle_fasteoi_irq device: 0039:00:00.0 <SNIP> node: 3 affinity: 216-287 effectiv: 221 domain: irqchip@0x0000300022040000-3 hwirq: 0xc8000000 chip: ITS-MSI flags: 0x20 The irq-check.sh can help to collect hwirq and chip name from /sys/kernel/debug/irq/irqs/* and print error log when find duplicate hwirq per chip name. Kernel patch ("PCI/MSI: Fix MSI hwirq truncation") [1] fix above issue. [1]: https://lore.kernel.org/all/20240115135649.708536-1-vidyas@nvidia.com/ Signed-off-by: Joseph Jang <jjang(a)nvidia.com> Reviewed-by: Matthew R. Ochs <mochs(a)nvidia.com> --- tools/testing/selftests/drivers/irq/Makefile | 5 +++ tools/testing/selftests/drivers/irq/config | 2 + .../selftests/drivers/irq/irq-check.sh | 39 +++++++++++++++++++ 3 files changed, 46 insertions(+) create mode 100644 tools/testing/selftests/drivers/irq/Makefile create mode 100644 tools/testing/selftests/drivers/irq/config create mode 100755 tools/testing/selftests/drivers/irq/irq-check.sh diff --git a/tools/testing/selftests/drivers/irq/Makefile b/tools/testing/selftests/drivers/irq/Makefile new file mode 100644 index 000000000000..d6998017c861 --- /dev/null +++ b/tools/testing/selftests/drivers/irq/Makefile @@ -0,0 +1,5 @@ +# SPDX-License-Identifier: GPL-2.0 + +TEST_PROGS := irq-check.sh + +include ../../lib.mk diff --git a/tools/testing/selftests/drivers/irq/config b/tools/testing/selftests/drivers/irq/config new file mode 100644 index 000000000000..a53d3b713728 --- /dev/null +++ b/tools/testing/selftests/drivers/irq/config @@ -0,0 +1,2 @@ +CONFIG_GENERIC_IRQ_DEBUGFS=y +CONFIG_GENERIC_IRQ_INJECTION=y diff --git a/tools/testing/selftests/drivers/irq/irq-check.sh b/tools/testing/selftests/drivers/irq/irq-check.sh new file mode 100755 index 000000000000..e784777043a1 --- /dev/null +++ b/tools/testing/selftests/drivers/irq/irq-check.sh @@ -0,0 +1,39 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-2.0 + +# This script need root permission +uid=$(id -u) +if [ $uid -ne 0 ]; then + echo "SKIP: Must be run as root" + exit 4 +fi + +# Ensure debugfs is mounted +mount -t debugfs nodev /sys/kernel/debug 2>/dev/null +if [ ! -d "/sys/kernel/debug/irq/irqs" ]; then + echo "SKIP: irq debugfs not found" + exit 4 +fi + +# Traverse the irq debug file system directory to collect chip_name and hwirq +hwirq_list=$(for irq_file in /sys/kernel/debug/irq/irqs/*; do + # Read chip name and hwirq from the irq_file + chip_name=$(cat "$irq_file" | grep -m 1 'chip:' | awk '{print $2}') + hwirq=$(cat "$irq_file" | grep -m 1 'hwirq:' | awk '{print $2}' ) + + if [ -z "$chip_name" ] || [ -z "$hwirq" ]; then + continue + fi + + echo "$chip_name $hwirq" +done) + +dup_hwirq_list=$(echo "$hwirq_list" | sort | uniq -cd) + +if [ -n "$dup_hwirq_list" ]; then + echo "ERROR: Found duplicate hwirq" + echo "$dup_hwirq_list" + exit 1 +fi + +exit 0 -- 2.34.1

1 year

4
6
0 0

[PATCH RFT v12 0/8] fork: Support shadow stacks in clone3()

by Mark Brown

The kernel has recently added support for shadow stacks, currently x86 only using their CET feature but both arm64 and RISC-V have equivalent features (GCS and Zicfiss respectively), I am actively working on GCS[1]. With shadow stacks the hardware maintains an additional stack containing only the return addresses for branch instructions which is not generally writeable by userspace and ensures that any returns are to the recorded addresses. This provides some protection against ROP attacks and making it easier to collect call stacks. These shadow stacks are allocated in the address space of the userspace process. Our API for shadow stacks does not currently offer userspace any flexiblity for managing the allocation of shadow stacks for newly created threads, instead the kernel allocates a new shadow stack with the same size as the normal stack whenever a thread is created with the feature enabled. The stacks allocated in this way are freed by the kernel when the thread exits or shadow stacks are disabled for the thread. This lack of flexibility and control isn't ideal, in the vast majority of cases the shadow stack will be over allocated and the implicit allocation and deallocation is not consistent with other interfaces. As far as I can tell the interface is done in this manner mainly because the shadow stack patches were in development since before clone3() was implemented. Since clone3() is readily extensible let's add support for specifying a shadow stack when creating a new thread or process, keeping the current implicit allocation behaviour if one is not specified either with clone3() or through the use of clone(). The user must provide a shadow stack pointer, this must point to memory mapped for use as a shadow stackby map_shadow_stack() with an architecture specified shadow stack token at the top of the stack. Please note that the x86 portions of this code are build tested only, I don't appear to have a system that can run CET available to me. [1] https://lore.kernel.org/linux-arm-kernel/20241001-arm64-gcs-v13-0-222b78d87… Signed-off-by: Mark Brown <broonie(a)kernel.org> --- Changes in v12: - Add the regular prctl() to the userspace API document since arm64 support is queued in -next. - Link to v11: https://lore.kernel.org/r/20241005-clone3-shadow-stack-v11-0-2a6a2bd6d651@k… Changes in v11: - Rebase onto arm64 for-next/gcs, which is based on v6.12-rc1, and integrate arm64 support. - Rework the interface to specify a shadow stack pointer rather than a base and size like we do for the regular stack. - Link to v10: https://lore.kernel.org/r/20240821-clone3-shadow-stack-v10-0-06e8797b9445@k… Changes in v10: - Integrate fixes & improvements for the x86 implementation from Rick Edgecombe. - Require that the shadow stack be VM_WRITE. - Require that the shadow stack base and size be sizeof(void *) aligned. - Clean up trailing newline. - Link to v9: https://lore.kernel.org/r/20240819-clone3-shadow-stack-v9-0-962d74f99464@ke… Changes in v9: - Pull token validation earlier and report problems with an error return to parent rather than signal delivery to the child. - Verify that the top of the supplied shadow stack is VM_SHADOW_STACK. - Rework token validation to only do the page mapping once. - Drop no longer needed support for testing for signals in selftest. - Fix typo in comments. - Link to v8: https://lore.kernel.org/r/20240808-clone3-shadow-stack-v8-0-0acf37caf14c@ke… Changes in v8: - Fix token verification with user specified shadow stack. - Don't track user managed shadow stacks for child processes. - Link to v7: https://lore.kernel.org/r/20240731-clone3-shadow-stack-v7-0-a9532eebfb1d@ke… Changes in v7: - Rebase onto v6.11-rc1. - Typo fixes. - Link to v6: https://lore.kernel.org/r/20240623-clone3-shadow-stack-v6-0-9ee7783b1fb9@ke… Changes in v6: - Rebase onto v6.10-rc3. - Ensure we don't try to free the parent shadow stack in error paths of x86 arch code. - Spelling fixes in userspace API document. - Additional cleanups and improvements to the clone3() tests to support the shadow stack tests. - Link to v5: https://lore.kernel.org/r/20240203-clone3-shadow-stack-v5-0-322c69598e4b@ke… Changes in v5: - Rebase onto v6.8-rc2. - Rework ABI to have the user allocate the shadow stack memory with map_shadow_stack() and a token. - Force inlining of the x86 shadow stack enablement. - Move shadow stack enablement out into a shared header for reuse by other tests. - Link to v4: https://lore.kernel.org/r/20231128-clone3-shadow-stack-v4-0-8b28ffe4f676@ke… Changes in v4: - Formatting changes. - Use a define for minimum shadow stack size and move some basic validation to fork.c. - Link to v3: https://lore.kernel.org/r/20231120-clone3-shadow-stack-v3-0-a7b8ed3e2acc@ke… Changes in v3: - Rebase onto v6.7-rc2. - Remove stale shadow_stack in internal kargs. - If a shadow stack is specified unconditionally use it regardless of CLONE_ parameters. - Force enable shadow stacks in the selftest. - Update changelogs for RISC-V feature rename. - Link to v2: https://lore.kernel.org/r/20231114-clone3-shadow-stack-v2-0-b613f8681155@ke… Changes in v2: - Rebase onto v6.7-rc1. - Remove ability to provide preallocated shadow stack, just specify the desired size. - Link to v1: https://lore.kernel.org/r/20231023-clone3-shadow-stack-v1-0-d867d0b5d4d0@ke… --- Mark Brown (8): arm64/gcs: Return a success value from gcs_alloc_thread_stack() Documentation: userspace-api: Add shadow stack API documentation selftests: Provide helper header for shadow stack testing fork: Add shadow stack support to clone3() selftests/clone3: Remove redundant flushes of output streams selftests/clone3: Factor more of main loop into test_clone3() selftests/clone3: Allow tests to flag if -E2BIG is a valid error code selftests/clone3: Test shadow stack support Documentation/userspace-api/index.rst | 1 + Documentation/userspace-api/shadow_stack.rst | 42 ++++ arch/arm64/include/asm/gcs.h | 8 +- arch/arm64/kernel/process.c | 8 +- arch/arm64/mm/gcs.c | 62 +++++- arch/x86/include/asm/shstk.h | 11 +- arch/x86/kernel/process.c | 2 +- arch/x86/kernel/shstk.c | 57 +++++- include/asm-generic/cacheflush.h | 11 ++ include/linux/sched/task.h | 17 ++ include/uapi/linux/sched.h | 10 +- kernel/fork.c | 96 +++++++-- tools/testing/selftests/clone3/clone3.c | 226 ++++++++++++++++++---- tools/testing/selftests/clone3/clone3_selftests.h | 65 ++++++- tools/testing/selftests/ksft_shstk.h | 98 ++++++++++ 15 files changed, 633 insertions(+), 81 deletions(-) --- base-commit: d17cd7b7cc92d37ee8b2df8f975fc859a261f4dc change-id: 20231019-clone3-shadow-stack-15d40d2bf536 Best regards, -- Mark Brown <broonie(a)kernel.org>

1 year

4
14
0 0

[PATCH 1/2] exec: fix up /proc/pid/comm in the execveat(AT_EMPTY_PATH) case

by Tycho Andersen

From: Tycho Andersen <tandersen(a)netflix.com> Zbigniew mentioned at Linux Plumber's that systemd is interested in switching to execveat() for service execution, but can't, because the contents of /proc/pid/comm are the file descriptor which was used, instead of the path to the binary. This makes the output of tools like top and ps useless, especially in a world where most fds are opened CLOEXEC so the number is truly meaningless. Change exec path to fix up /proc/pid/comm in the case where we have allocated one of these synthetic paths in bprm_init(). This way the actual exec machinery is unchanged, but cosmetically the comm looks reasonable to admins investigating things. Signed-off-by: Tycho Andersen <tandersen(a)netflix.com> Suggested-by: Zbigniew Jędrzejewski-Szmek <zbyszek(a)in.waw.pl> CC: Aleksa Sarai <cyphar(a)cyphar.com> Link: https://github.com/uapi-group/kernel-features#set-comm-field-before-exec --- v2: * drop the flag, everyone :) * change the rendered value to f_path.dentry->d_name.name instead of argv[0], Eric v3: * fix up subject line, Eric v4: * switch to no flag, always rewrite approach, with some cleanup suggested by Kees --- fs/exec.c | 36 +++++++++++++++++++++++++++++++++++- include/linux/binfmts.h | 1 + 2 files changed, 36 insertions(+), 1 deletion(-) diff --git a/fs/exec.c b/fs/exec.c index 6c53920795c2..3b559f598c74 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1347,7 +1347,16 @@ int begin_new_exec(struct linux_binprm * bprm) set_dumpable(current->mm, SUID_DUMP_USER); perf_event_exec(); - __set_task_comm(me, kbasename(bprm->filename), true); + + /* + * If argv0 was set, alloc_bprm() made up a path that will + * probably not be useful to admins running ps or similar. + * Let's fix it up to be something reasonable. + */ + if (bprm->argv0) + __set_task_comm(me, kbasename(bprm->argv0), true); + else + __set_task_comm(me, kbasename(bprm->filename), true); /* An exec changes our domain. We are no longer part of the thread group */ @@ -1497,9 +1506,28 @@ static void free_bprm(struct linux_binprm *bprm) if (bprm->interp != bprm->filename) kfree(bprm->interp); kfree(bprm->fdpath); + kfree(bprm->argv0); kfree(bprm); } +static int bprm_add_fixup_comm(struct linux_binprm *bprm, + struct user_arg_ptr argv) +{ + const char __user *p = get_user_arg_ptr(argv, 0); + + /* + * If p == NULL, let's just fall back to fdpath. + */ + if (!p) + return 0; + + bprm->argv0 = strndup_user(p, MAX_ARG_STRLEN); + if (bprm->argv0) + return 0; + + return -EFAULT; +} + static struct linux_binprm *alloc_bprm(int fd, struct filename *filename, int flags) { struct linux_binprm *bprm; @@ -1906,6 +1934,12 @@ static int do_execveat_common(int fd, struct filename *filename, goto out_ret; } + if (unlikely(bprm->fdpath)) { + retval = bprm_add_fixup_comm(bprm, argv); + if (retval != 0) + goto out_free; + } + retval = count(argv, MAX_ARG_STRINGS); if (retval == 0) pr_warn_once("process '%s' launched '%s' with NULL argv: empty string added\n", diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h index e6c00e860951..bab5121a746b 100644 --- a/include/linux/binfmts.h +++ b/include/linux/binfmts.h @@ -55,6 +55,7 @@ struct linux_binprm { of the time same as filename, but could be different for binfmt_{misc,script} */ const char *fdpath; /* generated filename for execveat */ + const char *argv0; /* argv0 from execveat */ unsigned interp_flags; int execfd; /* File descriptor of the executable */ unsigned long loader, exec; base-commit: c1e939a21eb111a6d6067b38e8e04b8809b64c4e -- 2.34.1

1 year

5
8
0 0

2025

2024

2023

2022

2021

2020

2019

2018

2017

Linux-kselftest-mirror October 2024