Linux-kselftest-mirror

linux-kselftest-mirror@lists.linaro.org

116 participants
14258 discussions

[PATCH v4 21/22] x86/fpu/xstate: Support dynamic user state in the signal handling path

by Chang S. Bae

Entering a signal handler, the kernel saves xstate in signal frame. The dynamic user state is better to be saved only when used. fpu->state_mask can help to exclude unused states. Returning from a signal handler, XRSTOR re-initializes the excluded state components. Add a test case to verify in the signal handler that the signal frame excludes AMX data when the signaled thread has initialized AMX state. Signed-off-by: Chang S. Bae <chang.seok.bae(a)intel.com> Reviewed-by: Len Brown <len.brown(a)intel.com> Cc: x86(a)kernel.org Cc: linux-kernel(a)vger.kernel.org Cc: linux-kselftest(a)vger.kernel.org --- Changes from v3: * Removed 'no functional changes' in the changelog. (Borislav Petkov) Changes from v1: * Made it revertable (moved close to the end of the series). * Included the test case. --- arch/x86/include/asm/fpu/internal.h | 2 +- tools/testing/selftests/x86/amx.c | 66 +++++++++++++++++++++++++++++ 2 files changed, 67 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h index c467312d38d8..090eb5bb277b 100644 --- a/arch/x86/include/asm/fpu/internal.h +++ b/arch/x86/include/asm/fpu/internal.h @@ -354,7 +354,7 @@ static inline void copy_kernel_to_xregs(struct xregs_state *xstate, u64 mask) */ static inline int copy_xregs_to_user(struct xregs_state __user *buf) { - u64 mask = xfeatures_mask_user(); + u64 mask = current->thread.fpu.state_mask; u32 lmask = mask; u32 hmask = mask >> 32; int err; diff --git a/tools/testing/selftests/x86/amx.c b/tools/testing/selftests/x86/amx.c index f4ecdfd27ae9..a7386b886532 100644 --- a/tools/testing/selftests/x86/amx.c +++ b/tools/testing/selftests/x86/amx.c @@ -650,6 +650,71 @@ static void test_ptrace(void) test_tile_state_write(ptracee_loads_tiles); } +/* Signal handling test */ + +static int sigtrapped; +struct tile_data sig_tiles, sighdl_tiles; + +static void handle_sigtrap(int sig, siginfo_t *info, void *ctx_void) +{ + ucontext_t *uctxt = (ucontext_t *)ctx_void; + struct xsave_data xdata; + struct tile_config cfg; + struct tile_data tiles; + u64 header; + + header = __get_xsave_xstate_bv((void *)uctxt->uc_mcontext.fpregs); + + if (header & (1 << XFEATURE_XTILE_DATA)) + printf("[FAIL]\ttile data was written in sigframe\n"); + else + printf("[OK]\ttile data was skipped in sigframe\n"); + + set_tilecfg(&cfg); + load_tilecfg(&cfg); + init_xdata(&xdata); + + make_tiles(&tiles); + copy_tiles_to_xdata(&xdata, &tiles); + restore_xdata(&xdata); + + save_xdata(&xdata); + if (compare_xdata_tiles(&xdata, &tiles)) + err(1, "tile load file"); + + printf("\tsignal handler: load tile data\n"); + + sigtrapped = sig; +} + +static void test_signal_handling(void) +{ + struct xsave_data xdata = { 0 }; + struct tile_data tiles = { 0 }; + + sethandler(SIGTRAP, handle_sigtrap, 0); + sigtrapped = 0; + + printf("[RUN]\tCheck tile state management in handling signal\n"); + + printf("\tbefore signal: initial tile data state\n"); + + raise(SIGTRAP); + + if (sigtrapped == 0) + err(1, "sigtrap"); + + save_xdata(&xdata); + if (compare_xdata_tiles(&xdata, &tiles)) { + printf("[FAIL]\ttile data was not loaded at sigreturn\n"); + nerrs++; + } else { + printf("[OK]\ttile data was re-initialized at sigreturn\n"); + } + + clearhandler(SIGTRAP); +} + int main(void) { /* Check hardware availability at first */ @@ -672,6 +737,7 @@ int main(void) test_fork(); test_context_switch(); test_ptrace(); + test_signal_handling(); return nerrs ? 1 : 0; } -- 2.17.1

4 years, 10 months

[PATCH v4 20/22] selftest/x86/amx: Include test cases for the AMX state management

by Chang S. Bae

This selftest exercises the kernel's behavior not to inherit AMX state and the ability to switch the context by verifying that they retain unique data between multiple threads. Also, ptrace() is used to insert AMX state into existing threads -- both before and after the existing thread has initialized its AMX state. Collect the test cases of validating those operations together, as they share some common setup for the AMX state. These test cases do not depend on AMX compiler support, as they employ userspace-XSAVE directly to access AMX state. Signed-off-by: Chang S. Bae <chang.seok.bae(a)intel.com> Reviewed-by: Len Brown <len.brown(a)intel.com> Cc: linux-kernel(a)vger.kernel.org Cc: linux-kselftest(a)vger.kernel.org --- Changes from v2: * Updated the test messages and the changelog as tile data is not inherited to a child anymore. * Removed bytecode for the instructions already supported by binutils. * Changed to check the XSAVE availability in a reliable way. Changes from v1: * Removed signal testing code --- tools/testing/selftests/x86/Makefile | 2 +- tools/testing/selftests/x86/amx.c | 677 +++++++++++++++++++++++++++ 2 files changed, 678 insertions(+), 1 deletion(-) create mode 100644 tools/testing/selftests/x86/amx.c diff --git a/tools/testing/selftests/x86/Makefile b/tools/testing/selftests/x86/Makefile index 333980375bc7..2f7feb03867b 100644 --- a/tools/testing/selftests/x86/Makefile +++ b/tools/testing/selftests/x86/Makefile @@ -17,7 +17,7 @@ TARGETS_C_BOTHBITS := single_step_syscall sysret_ss_attrs syscall_nt test_mremap TARGETS_C_32BIT_ONLY := entry_from_vm86 test_syscall_vdso unwind_vdso \ test_FCMOV test_FCOMI test_FISTTP \ vdso_restorer -TARGETS_C_64BIT_ONLY := fsgsbase sysret_rip syscall_numbering +TARGETS_C_64BIT_ONLY := fsgsbase sysret_rip syscall_numbering amx # Some selftests require 32bit support enabled also on 64bit systems TARGETS_C_32BIT_NEEDED := ldt_gdt ptrace_syscall diff --git a/tools/testing/selftests/x86/amx.c b/tools/testing/selftests/x86/amx.c new file mode 100644 index 000000000000..f4ecdfd27ae9 --- /dev/null +++ b/tools/testing/selftests/x86/amx.c @@ -0,0 +1,677 @@ +// SPDX-License-Identifier: GPL-2.0 + +#define _GNU_SOURCE +#include <err.h> +#include <elf.h> +#include <pthread.h> +#include <sched.h> +#include <setjmp.h> +#include <signal.h> +#include <stdio.h> +#include <string.h> +#include <stdbool.h> +#include <stdint.h> +#include <stdlib.h> +#include <time.h> +#include <malloc.h> +#include <unistd.h> +#include <ucontext.h> + +#include <linux/futex.h> + +#include <sys/ipc.h> +#include <sys/mman.h> +#include <sys/ptrace.h> +#include <sys/shm.h> +#include <sys/signal.h> +#include <sys/syscall.h> +#include <sys/time.h> +#include <sys/types.h> +#include <sys/wait.h> +#include <sys/uio.h> +#include <sys/ucontext.h> + +#include <x86intrin.h> + +#ifndef __x86_64__ +# error This test is 64-bit only +#endif + +typedef uint8_t u8; +typedef uint16_t u16; +typedef uint32_t u32; +typedef uint64_t u64; + +#define PAGE_SIZE (1 << 12) + +#define NUM_TILES 8 +#define TILE_SIZE 1024 +#define XSAVE_SIZE ((NUM_TILES * TILE_SIZE) + PAGE_SIZE) + +struct xsave_data { + u8 area[XSAVE_SIZE]; +} __attribute__((aligned(64))); + +/* Tile configuration associated: */ +#define MAX_TILES 16 +#define RESERVED_BYTES 14 + +struct tile_config { + u8 palette_id; + u8 start_row; + u8 reserved[RESERVED_BYTES]; + u16 colsb[MAX_TILES]; + u8 rows[MAX_TILES]; +}; + +struct tile_data { + u8 data[NUM_TILES * TILE_SIZE]; +}; + +static inline u64 __xgetbv(u32 index) +{ + u32 eax, edx; + + asm volatile("xgetbv;" + : "=a" (eax), "=d" (edx) + : "c" (index)); + return eax + ((u64)edx << 32); +} + +static inline void __cpuid(u32 *eax, u32 *ebx, u32 *ecx, u32 *edx) +{ + asm volatile("cpuid;" + : "=a" (*eax), "=b" (*ebx), "=c" (*ecx), "=d" (*edx) + : "0" (*eax), "2" (*ecx)); +} + +/* Load tile configuration */ +static inline void __ldtilecfg(void *cfg) +{ + asm volatile(".byte 0xc4,0xe2,0x78,0x49,0x00" + : : "a"(cfg)); +} + +/* Load tile data to %tmm0 register only */ +static inline void __tileloadd(void *tile) +{ + asm volatile(".byte 0xc4,0xe2,0x7b,0x4b,0x04,0x10" + : : "a"(tile), "d"(0)); +} + +/* Save extended states */ +static inline void __xsave(void *buffer, u32 lo, u32 hi) +{ + asm volatile("xsave (%%rdi)" + : : "D" (buffer), "a" (lo), "d" (hi) + : "memory"); +} + +/* Restore extended states */ +static inline void __xrstor(void *buffer, u32 lo, u32 hi) +{ + asm volatile("xrstor (%%rdi)" + : : "D" (buffer), "a" (lo), "d" (hi)); +} + +/* Release tile states to init values */ +static inline void __tilerelease(void) +{ + asm volatile(".byte 0xc4, 0xe2, 0x78, 0x49, 0xc0" ::); +} + +static void sethandler(int sig, void (*handler)(int, siginfo_t *, void *), + int flags) +{ + struct sigaction sa; + + memset(&sa, 0, sizeof(sa)); + sa.sa_sigaction = handler; + sa.sa_flags = SA_SIGINFO | flags; + sigemptyset(&sa.sa_mask); + if (sigaction(sig, &sa, 0)) + err(1, "sigaction"); +} + +static void clearhandler(int sig) +{ + struct sigaction sa; + + memset(&sa, 0, sizeof(sa)); + sa.sa_handler = SIG_DFL; + sigemptyset(&sa.sa_mask); + if (sigaction(sig, &sa, 0)) + err(1, "sigaction"); +} + +/* Hardware info check: */ + +static jmp_buf jmpbuf; +static bool xsave_disabled; + +static void handle_sigill(int sig, siginfo_t *si, void *ctx_void) +{ + xsave_disabled = true; + siglongjmp(jmpbuf, 1); +} + +#define XFEATURE_XTILE_CFG 17 +#define XFEATURE_XTILE_DATA 18 +#define XFEATURE_MASK_XTILE ((1 << XFEATURE_XTILE_DATA) | \ + (1 << XFEATURE_XTILE_CFG)) + +static inline bool check_xsave_supports_xtile(void) +{ + bool supported = false; + + sethandler(SIGILL, handle_sigill, 0); + + if (!sigsetjmp(jmpbuf, 1)) + supported = __xgetbv(0) & XFEATURE_MASK_XTILE; + + clearhandler(SIGILL); + return supported; +} + +struct xtile_hwinfo { + struct { + u16 bytes_per_tile; + u16 bytes_per_row; + u16 max_names; + u16 max_rows; + } spec; + + struct { + u32 offset; + u32 size; + } xsave; +}; + +static struct xtile_hwinfo xtile; + +static bool __enum_xtile_config(void) +{ + u32 eax, ebx, ecx, edx; + u16 bytes_per_tile; + bool valid = false; + +#define TILE_CPUID 0x1d +#define TILE_PALETTE_CPUID_SUBLEAVE 0x1 + + eax = TILE_CPUID; + ecx = TILE_PALETTE_CPUID_SUBLEAVE; + + __cpuid(&eax, &ebx, &ecx, &edx); + if (!eax || !ebx || !ecx) + return valid; + + xtile.spec.max_names = ebx >> 16; + if (xtile.spec.max_names < NUM_TILES) + return valid; + + bytes_per_tile = eax >> 16; + if (bytes_per_tile < TILE_SIZE) + return valid; + + xtile.spec.bytes_per_row = ebx; + xtile.spec.max_rows = ecx; + valid = true; + + return valid; +} + +static bool __enum_xsave_tile(void) +{ + u32 eax, ebx, ecx, edx; + bool valid = false; + +#define XSTATE_CPUID 0xd +#define XSTATE_USER_STATE_SUBLEAVE 0x0 + + eax = XSTATE_CPUID; + ecx = XFEATURE_XTILE_DATA; + + __cpuid(&eax, &ebx, &ecx, &edx); + if (!eax || !ebx) + return valid; + + xtile.xsave.offset = ebx; + xtile.xsave.size = eax; + valid = true; + + return valid; +} + +static bool __check_xsave_size(void) +{ + u32 eax, ebx, ecx, edx; + bool valid = false; + + eax = XSTATE_CPUID; + ecx = XSTATE_USER_STATE_SUBLEAVE; + + __cpuid(&eax, &ebx, &ecx, &edx); + if (ebx && ebx <= XSAVE_SIZE) + valid = true; + + return valid; +} + +/* + * Check the hardware-provided tile state info and cross-check it with the + * hard-coded values: XSAVE_SIZE, NUM_TILES, and TILE_SIZE. + */ +static int check_xtile_hwinfo(void) +{ + bool success = false; + + if (!__check_xsave_size()) + return success; + + if (!__enum_xsave_tile()) + return success; + + if (!__enum_xtile_config()) + return success; + + if (sizeof(struct tile_data) >= xtile.xsave.size) + success = true; + + return success; +} + +/* The helpers for managing XSAVE buffer and tile states: */ + +/* Use the uncompacted format without 'init optimization' */ +static void save_xdata(void *data) +{ + __xsave(data, -1, -1); +} + +static void restore_xdata(void *data) +{ + __xrstor(data, -1, -1); +} + +static inline u64 __get_xsave_xstate_bv(void *data) +{ +#define XSAVE_HDR_OFFSET 512 + return *(u64 *)(data + XSAVE_HDR_OFFSET); +} + +static void set_tilecfg(struct tile_config *cfg) +{ + int i; + + memset(cfg, 0, sizeof(*cfg)); + /* The first implementation has one significant palette with id 1 */ + cfg->palette_id = 1; + for (i = 0; i < xtile.spec.max_names; i++) { + cfg->colsb[i] = xtile.spec.bytes_per_row; + cfg->rows[i] = xtile.spec.max_rows; + } +} + +static void load_tilecfg(struct tile_config *cfg) +{ + __ldtilecfg(cfg); +} + +static void make_tiles(void *tiles) +{ + u32 iterations = xtile.xsave.size / sizeof(u32); + static u32 value = 1; + u32 *ptr = tiles; + int i; + + for (i = 0, ptr = tiles; i < iterations; i++, ptr++) + *ptr = value; + value++; +} + +/* + * Initialize the XSAVE buffer: + * + * Make sure tile configuration loaded already. Load limited tile data (%tmm0 only) + * and save all the states. XSAVE buffer is ready to complete tile data. + */ +static void init_xdata(void *data) +{ + struct tile_data tiles; + + make_tiles(&tiles); + __tileloadd(&tiles); + __xsave(data, -1, -1); +} + +static inline void *__get_xsave_tile_data_addr(void *data) +{ + return data + xtile.xsave.offset; +} + +static void copy_tiles_to_xdata(void *xdata, void *tiles) +{ + void *dst = __get_xsave_tile_data_addr(xdata); + + memcpy(dst, tiles, xtile.xsave.size); +} + +static int compare_xdata_tiles(void *xdata, void *tiles) +{ + void *tile_data = __get_xsave_tile_data_addr(xdata); + + if (memcmp(tile_data, tiles, xtile.xsave.size)) + return 1; + + return 0; +} + +static int nerrs, errs; + +/* Testing tile data inheritance */ + +static void test_tile_data_inheritance(void) +{ + struct xsave_data xdata; + struct tile_data tiles; + struct tile_config cfg; + pid_t child; + int status; + + set_tilecfg(&cfg); + load_tilecfg(&cfg); + init_xdata(&xdata); + + make_tiles(&tiles); + copy_tiles_to_xdata(&xdata, &tiles); + restore_xdata(&xdata); + + errs = 0; + + child = fork(); + if (child < 0) + err(1, "fork"); + + if (child == 0) { + memset(&xdata, 0, sizeof(xdata)); + save_xdata(&xdata); + if (compare_xdata_tiles(&xdata, &tiles)) { + printf("[OK]\tchild didn't inherit tile data at fork()\n"); + } else { + printf("[FAIL]\tchild inherited tile data at fork()\n"); + nerrs++; + } + _exit(0); + } + wait(&status); +} + +static void test_fork(void) +{ + pid_t child; + int status; + + child = fork(); + if (child < 0) + err(1, "fork"); + + if (child == 0) { + test_tile_data_inheritance(); + _exit(0); + } + + wait(&status); +} + +/* Context switching test */ + +#define ITERATIONS 10 +#define NUM_THREADS 5 + +struct futex_info { + int current; + int next; + int *futex; +}; + +static inline void command_wait(struct futex_info *info, int value) +{ + do { + sched_yield(); + } while (syscall(SYS_futex, info->futex, FUTEX_WAIT, value, 0, 0, 0)); +} + +static inline void command_wake(struct futex_info *info, int value) +{ + do { + *info->futex = value; + while (!syscall(SYS_futex, info->futex, FUTEX_WAKE, 1, 0, 0, 0)) + sched_yield(); + } while (0); +} + +static inline int get_iterative_value(int id) +{ + return ((id << 1) & ~0x1); +} + +static inline int get_endpoint_value(int id) +{ + return ((id << 1) | 0x1); +} + +static void *check_tiles(void *info) +{ + struct futex_info *finfo = (struct futex_info *)info; + struct xsave_data xdata; + struct tile_data tiles; + struct tile_config cfg; + int i; + + set_tilecfg(&cfg); + load_tilecfg(&cfg); + init_xdata(&xdata); + + make_tiles(&tiles); + copy_tiles_to_xdata(&xdata, &tiles); + restore_xdata(&xdata); + + for (i = 0; i < ITERATIONS; i++) { + command_wait(finfo, get_iterative_value(finfo->current)); + + memset(&xdata, 0, sizeof(xdata)); + save_xdata(&xdata); + errs += compare_xdata_tiles(&xdata, &tiles); + + make_tiles(&tiles); + copy_tiles_to_xdata(&xdata, &tiles); + restore_xdata(&xdata); + + command_wake(finfo, get_iterative_value(finfo->next)); + } + + command_wait(finfo, get_endpoint_value(finfo->current)); + __tilerelease(); + return NULL; +} + +static int create_children(int num, struct futex_info *finfo) +{ + const int shm_id = shmget(IPC_PRIVATE, sizeof(int), IPC_CREAT | 0666); + int *futex = shmat(shm_id, NULL, 0); + pthread_t thread; + int i; + + for (i = 0; i < num; i++) { + finfo[i].futex = futex; + finfo[i].current = i + 1; + finfo[i].next = (i + 2) % (num + 1); + + if (pthread_create(&thread, NULL, check_tiles, &finfo[i])) { + err(1, "pthread_create"); + return 1; + } + } + return 0; +} + +static void test_context_switch(void) +{ + struct futex_info *finfo; + cpu_set_t cpuset; + int i; + + printf("[RUN]\t%u context switches of tile states in %d threads\n", + ITERATIONS * NUM_THREADS, NUM_THREADS); + + errs = 0; + + CPU_ZERO(&cpuset); + CPU_SET(0, &cpuset); + if (sched_setaffinity(0, sizeof(cpuset), &cpuset) != 0) + err(1, "sched_setaffinity to CPU 0"); + + finfo = malloc(sizeof(*finfo) * NUM_THREADS); + + if (create_children(NUM_THREADS, finfo)) + return; + + for (i = 0; i < ITERATIONS; i++) { + command_wake(finfo, get_iterative_value(1)); + command_wait(finfo, get_iterative_value(0)); + } + + for (i = 1; i <= NUM_THREADS; i++) + command_wake(finfo, get_endpoint_value(i)); + + if (errs) { + printf("[FAIL]\t%u incorrect tile states\n", errs); + nerrs += errs; + return; + } + + printf("[OK]\tall tile states are correct\n"); +} + +/* Ptrace test */ + +static inline long get_tile_state(pid_t child, struct iovec *iov) +{ + return ptrace(PTRACE_GETREGSET, child, (u32)NT_X86_XSTATE, iov); +} + +static inline long set_tile_state(pid_t child, struct iovec *iov) +{ + return ptrace(PTRACE_SETREGSET, child, (u32)NT_X86_XSTATE, iov); +} + +static int write_tile_state(bool load_tile, pid_t child) +{ + struct xsave_data xdata; + struct tile_data tiles; + struct iovec iov; + + iov.iov_base = &xdata; + iov.iov_len = sizeof(xdata); + + if (get_tile_state(child, &iov)) + err(1, "PTRACE_GETREGSET"); + + make_tiles(&tiles); + copy_tiles_to_xdata(&xdata, &tiles); + if (set_tile_state(child, &iov)) + err(1, "PTRACE_SETREGSET"); + + memset(&xdata, 0, sizeof(xdata)); + if (get_tile_state(child, &iov)) + err(1, "PTRACE_GETREGSET"); + + if (!load_tile) + memset(&tiles, 0, sizeof(tiles)); + + return compare_xdata_tiles(&xdata, &tiles); +} + +static void test_tile_state_write(bool load_tile) +{ + pid_t child; + int status; + + child = fork(); + if (child < 0) + err(1, "fork"); + + if (child == 0) { + printf("[RUN]\tPtrace-induced tile state write, "); + printf("%s tile data loaded\n", load_tile ? "with" : "without"); + + if (ptrace(PTRACE_TRACEME, 0, NULL, NULL)) + err(1, "PTRACE_TRACEME"); + + if (load_tile) { + struct tile_config cfg; + struct tile_data tiles; + + set_tilecfg(&cfg); + load_tilecfg(&cfg); + make_tiles(&tiles); + /* Load only %tmm0 but inducing the #NM */ + __tileloadd(&tiles); + } + + raise(SIGTRAP); + _exit(0); + } + + do { + wait(&status); + } while (WSTOPSIG(status) != SIGTRAP); + + errs = write_tile_state(load_tile, child); + if (errs) { + nerrs++; + printf("[FAIL]\t%s write\n", load_tile ? "incorrect" : "unexpected"); + } else { + printf("[OK]\t%s write\n", load_tile ? "correct" : "no"); + } + + ptrace(PTRACE_DETACH, child, NULL, NULL); + wait(&status); +} + +static void test_ptrace(void) +{ + bool ptracee_loads_tiles; + + ptracee_loads_tiles = true; + test_tile_state_write(ptracee_loads_tiles); + + ptracee_loads_tiles = false; + test_tile_state_write(ptracee_loads_tiles); +} + +int main(void) +{ + /* Check hardware availability at first */ + + if (!check_xsave_supports_xtile()) { + if (xsave_disabled) + printf("XSAVE disabled.\n"); + else + printf("Tile data not available.\n"); + return 0; + } + + if (!check_xtile_hwinfo()) { + printf("Available tile state size is insufficient to test.\n"); + return 0; + } + + nerrs = 0; + + test_fork(); + test_context_switch(); + test_ptrace(); + + return nerrs ? 1 : 0; +} -- 2.17.1

4 years, 10 months

[PATCH v28 00/12] Landlock LSM

by Mickaël Salaün

Hi, This patch series fixes a corner-case with non-overlapping access rights coming from different layers. This is now handled in a generic way and verified with new tests. A stricter check is enforced for landlock_add_rule(2) to forbid useless rules. Finally, the previous landlock_enforce_ruleset_self(2) is renamed to landlock_restrict_self(2), which is more consistent. The SLOC count is 1314 for security/landlock/ and 2484 for tools/testing/selftest/landlock/ . Test coverage for security/landlock/ is 94.7% of lines. The code not covered only deals with internal kernel errors (e.g. memory allocation) and race conditions. This series is being fuzzed by syzkaller, and patches are on their way: https://github.com/google/syzkaller/pull/2380 The compiled documentation is available here: https://landlock.io/linux-doc/landlock-v28/userspace-api/landlock.html This series can be applied on top of v5.11-rc6 . This can be tested with CONFIG_SECURITY_LANDLOCK, CONFIG_SAMPLE_LANDLOCK and by prepending "landlock," to CONFIG_LSM. This patch series can be found in a Git repository here: https://github.com/landlock-lsm/linux/commits/landlock-v28 This patch series seems ready for upstream and I would really appreciate final reviews. # Landlock LSM The goal of Landlock is to enable to restrict ambient rights (e.g. global filesystem access) for a set of processes. Because Landlock is a stackable LSM [1], it makes possible to create safe security sandboxes as new security layers in addition to the existing system-wide access-controls. This kind of sandbox is expected to help mitigate the security impact of bugs or unexpected/malicious behaviors in user-space applications. Landlock empowers any process, including unprivileged ones, to securely restrict themselves. Landlock is inspired by seccomp-bpf but instead of filtering syscalls and their raw arguments, a Landlock rule can restrict the use of kernel objects like file hierarchies, according to the kernel semantic. Landlock also takes inspiration from other OS sandbox mechanisms: XNU Sandbox, FreeBSD Capsicum or OpenBSD Pledge/Unveil. In this current form, Landlock misses some access-control features. This enables to minimize this patch series and ease review. This series still addresses multiple use cases, especially with the combined use of seccomp-bpf: applications with built-in sandboxing, init systems, security sandbox tools and security-oriented APIs [2]. Previous version: https://lore.kernel.org/lkml/20210121205119.793296-1-mic@digikod.net/ [1] https://lore.kernel.org/lkml/50db058a-7dde-441b-a7f9-f6837fe8b69f@schaufler… [2] https://lore.kernel.org/lkml/f646e1c7-33cf-333f-070c-0a40ad0468cd@digikod.n… Casey Schaufler (1): LSM: Infrastructure management of the superblock Mickaël Salaün (11): landlock: Add object management landlock: Add ruleset and domain management landlock: Set up the security framework and manage credentials landlock: Add ptrace restrictions fs,security: Add sb_delete hook landlock: Support filesystem access-control landlock: Add syscall implementations arch: Wire up Landlock syscalls selftests/landlock: Add user space tests samples/landlock: Add a sandbox manager example landlock: Add user and kernel documentation Documentation/security/index.rst | 1 + Documentation/security/landlock.rst | 79 + Documentation/userspace-api/index.rst | 1 + Documentation/userspace-api/landlock.rst | 307 ++ MAINTAINERS | 15 + arch/Kconfig | 7 + arch/alpha/kernel/syscalls/syscall.tbl | 3 + arch/arm/tools/syscall.tbl | 3 + arch/arm64/include/asm/unistd.h | 2 +- arch/arm64/include/asm/unistd32.h | 6 + arch/ia64/kernel/syscalls/syscall.tbl | 3 + arch/m68k/kernel/syscalls/syscall.tbl | 3 + arch/microblaze/kernel/syscalls/syscall.tbl | 3 + arch/mips/kernel/syscalls/syscall_n32.tbl | 3 + arch/mips/kernel/syscalls/syscall_n64.tbl | 3 + arch/mips/kernel/syscalls/syscall_o32.tbl | 3 + arch/parisc/kernel/syscalls/syscall.tbl | 3 + arch/powerpc/kernel/syscalls/syscall.tbl | 3 + arch/s390/kernel/syscalls/syscall.tbl | 3 + arch/sh/kernel/syscalls/syscall.tbl | 3 + arch/sparc/kernel/syscalls/syscall.tbl | 3 + arch/um/Kconfig | 1 + arch/x86/entry/syscalls/syscall_32.tbl | 3 + arch/x86/entry/syscalls/syscall_64.tbl | 3 + arch/xtensa/kernel/syscalls/syscall.tbl | 3 + fs/super.c | 1 + include/linux/lsm_hook_defs.h | 1 + include/linux/lsm_hooks.h | 3 + include/linux/security.h | 4 + include/linux/syscalls.h | 7 + include/uapi/asm-generic/unistd.h | 8 +- include/uapi/linux/landlock.h | 128 + kernel/sys_ni.c | 5 + samples/Kconfig | 7 + samples/Makefile | 1 + samples/landlock/.gitignore | 1 + samples/landlock/Makefile | 13 + samples/landlock/sandboxer.c | 238 ++ security/Kconfig | 11 +- security/Makefile | 2 + security/landlock/Kconfig | 21 + security/landlock/Makefile | 4 + security/landlock/common.h | 20 + security/landlock/cred.c | 46 + security/landlock/cred.h | 58 + security/landlock/fs.c | 627 ++++ security/landlock/fs.h | 56 + security/landlock/limits.h | 21 + security/landlock/object.c | 67 + security/landlock/object.h | 91 + security/landlock/ptrace.c | 120 + security/landlock/ptrace.h | 14 + security/landlock/ruleset.c | 473 +++ security/landlock/ruleset.h | 165 + security/landlock/setup.c | 40 + security/landlock/setup.h | 18 + security/landlock/syscalls.c | 444 +++ security/security.c | 51 +- security/selinux/hooks.c | 58 +- security/selinux/include/objsec.h | 6 + security/selinux/ss/services.c | 3 +- security/smack/smack.h | 6 + security/smack/smack_lsm.c | 35 +- tools/testing/selftests/Makefile | 1 + tools/testing/selftests/landlock/.gitignore | 2 + tools/testing/selftests/landlock/Makefile | 24 + tools/testing/selftests/landlock/base_test.c | 219 ++ tools/testing/selftests/landlock/common.h | 169 ++ tools/testing/selftests/landlock/config | 6 + tools/testing/selftests/landlock/fs_test.c | 2664 +++++++++++++++++ .../testing/selftests/landlock/ptrace_test.c | 314 ++ tools/testing/selftests/landlock/true.c | 5 + 72 files changed, 6668 insertions(+), 77 deletions(-) create mode 100644 Documentation/security/landlock.rst create mode 100644 Documentation/userspace-api/landlock.rst create mode 100644 include/uapi/linux/landlock.h create mode 100644 samples/landlock/.gitignore create mode 100644 samples/landlock/Makefile create mode 100644 samples/landlock/sandboxer.c create mode 100644 security/landlock/Kconfig create mode 100644 security/landlock/Makefile create mode 100644 security/landlock/common.h create mode 100644 security/landlock/cred.c create mode 100644 security/landlock/cred.h create mode 100644 security/landlock/fs.c create mode 100644 security/landlock/fs.h create mode 100644 security/landlock/limits.h create mode 100644 security/landlock/object.c create mode 100644 security/landlock/object.h create mode 100644 security/landlock/ptrace.c create mode 100644 security/landlock/ptrace.h create mode 100644 security/landlock/ruleset.c create mode 100644 security/landlock/ruleset.h create mode 100644 security/landlock/setup.c create mode 100644 security/landlock/setup.h create mode 100644 security/landlock/syscalls.c create mode 100644 tools/testing/selftests/landlock/.gitignore create mode 100644 tools/testing/selftests/landlock/Makefile create mode 100644 tools/testing/selftests/landlock/base_test.c create mode 100644 tools/testing/selftests/landlock/common.h create mode 100644 tools/testing/selftests/landlock/config create mode 100644 tools/testing/selftests/landlock/fs_test.c create mode 100644 tools/testing/selftests/landlock/ptrace_test.c create mode 100644 tools/testing/selftests/landlock/true.c base-commit: 1048ba83fb1c00cd24172e23e8263972f6b5d9ac -- 2.30.0

4 years, 10 months

[RFC PATCH 00/13] Add futex2 syscalls

by André Almeida

Hi, This patch series introduces the futex2 syscalls. * What happened to the current futex()? For some years now, developers have been trying to add new features to futex, but maintainers have been reluctant to accept then, given the multiplexed interface full of legacy features and tricky to do big changes. Some problems that people tried to address with patchsets are: NUMA-awareness[0], smaller sized futexes[1], wait on multiple futexes[2]. NUMA, for instance, just doesn't fit the current API in a reasonable way. Considering that, it's not possible to merge new features into the current futex. ** The NUMA problem At the current implementation, all futex kernel side infrastructure is stored on a single node. Given that, all futex() calls issued by processors that aren't located on that node will have a memory access penalty when doing it. ** The 32bit sized futex problem Embedded systems or anything with memory constrains would benefit of using smaller sizes for the futex userspace integer. Also, a mutex implementation can be done using just three values, so 8 bits is enough for various scenarios. ** The wait on multiple problem The use case lies in the Wine implementation of the Windows NT interface WaitMultipleObjects. This Windows API function allows a thread to sleep waiting on the first of a set of event sources (mutexes, timers, signal, console input, etc) to signal. Considering this is a primitive synchronization operation for Windows applications, being able to quickly signal events on the producer side, and quickly go to sleep on the consumer side is essential for good performance of those running over Wine. [0] https://lore.kernel.org/lkml/20160505204230.932454245@linutronix.de/ [1] https://lore.kernel.org/lkml/20191221155659.3159-2-malteskarupke@web.de/ [2] https://lore.kernel.org/lkml/20200213214525.183689-1-andrealmeid@collabora.… * The solution As proposed by Peter Zijlstra and Florian Weimer[3], a new interface is required to solve this, which must be designed with those features in mind. futex2() is that interface. As opposed to the current multiplexed interface, the new one should have one syscall per operation. This will allow the maintainability of the API if it gets extended, and will help users with type checking of arguments. In particular, the new interface is extended to support the ability to wait on any of a list of futexes at a time, which could be seen as a vectored extension of the FUTEX_WAIT semantics. [3] https://lore.kernel.org/lkml/20200303120050.GC2596@hirez.programming.kicks-… * The interface The new interface can be seen in details in the following patches, but this is a high level summary of what the interface can do: - Supports wake/wait semantics, as in futex() - Supports requeue operations, similarly as FUTEX_CMP_REQUEUE, but with individual flags for each address - Supports waiting for a vector of futexes, using a new syscall named futex_waitv() - Supports variable sized futexes (8bits, 16bits and 32bits) - Supports NUMA-awareness operations, where the user can specify on which memory node would like to operate * Implementation The internal implementation follows a similar design to the original futex. Given that we want to replicate the same external behavior of current futex, this should be somewhat expected. For some functions, like the init and the code to get a shared key, I literally copied code and comments from kernel/futex.c. I decided to do so instead of exposing the original function as a public function since in that way we can freely modify our implementation if required, without any impact on old futex. Also, the comments precisely describes the details and corner cases of the implementation. Each patch contains a brief description of implementation, but patch 6 "docs: locking: futex2: Add documentation" adds a more complete document about it. * The patchset This patchset can be also found at my git tree: https://gitlab.collabora.com/tonyk/linux/-/tree/futex2 - Patch 1: Implements wait/wake, and the basics foundations of futex2 - Patches 2-4: Implement the remaining features (shared, waitv, requeue). - Patch 5: Adds the x86_x32 ABI handling. I kept it in a separated patch since I'm not sure if x86_x32 is still a thing, or if it should return -ENOSYS. - Patch 6: Add a documentation file which details the interface and the internal implementation. - Patches 7-13: Selftests for all operations along with perf support for futex2. - Patch 14: While working on porting glibc for futex2, I found out that there's a futex_wake() call at the user thread exit path, if that thread was created with clone(..., CLONE_CHILD_SETTID, ...). In order to make pthreads work with futex2, it was required to add this patch. Note that this is more a proof-of-concept of what we will need to do in future, rather than part of the interface and shouldn't be merged as it is. * Testing: This patchset provides selftests for each operation and their flags. Along with that, the following work was done: ** Stability To stress the interface in "real world scenarios": - glibc[4]: nptl's low level locking was modified to use futex2 API (except for robust and PI things). All relevant nptl/ tests passed. - Wine[5]: Proton/Wine was modified in order to use futex2() for the emulation of Windows NT sync mechanisms based on futex, called "fsync". Triple-A games with huge CPU's loads and tons of parallel jobs worked as expected when compared with the previous FUTEX_WAIT_MULTIPLE implementation at futex(). Some games issue 42k futex2() calls per second. - Full GNU/Linux distro: I installed the modified glibc in my host machine, so all pthread's programs would use futex2(). After tweaking systemd[6] to allow futex2() calls at seccomp, everything worked as expected (web browsers do some syscall sandboxing and need some configuration as well). - perf: The perf benchmarks tests can also be used to stress the interface, and they can be found in this patchset. ** Performance - For comparing futex() and futex2() performance, I used the artificial benchmarks implemented at perf (wake, wake-parallel, hash and requeue). The setup was 200 runs for each test and using 8, 80, 800, 8000 for the number of threads, Note that for this test, I'm not using patch 14 ("kernel: Enable waitpid() for futex2") , for reasons explained at "The patchset" section. - For the first three ones, I measured an average of 4% gain in performance. This is not a big step, but it shows that the new interface is at least comparable in performance with the current one. - For requeue, I measured an average of 21% decrease in performance compared to the original futex implementation. This is expected given the new design with individual flags. The performance trade-offs are explained at patch 4 ("futex2: Implement requeue operation"). [4] https://gitlab.collabora.com/tonyk/glibc/-/tree/futex2 [5] https://gitlab.collabora.com/tonyk/wine/-/tree/proton_5.13 [6] https://gitlab.collabora.com/tonyk/systemd * FAQ ** "Where's the code for NUMA and FUTEX_8/16?" The current code is already complex enough to take some time for review, so I believe it's better to split that work out to a future iteration of this patchset. Besides that, this RFC is the core part of the infrastructure, and the following features will not pose big design changes to it, the work will be more about wiring up the flags and modifying some functions. ** "And what's about FUTEX_64?" By supporting 64 bit futexes, the kernel structure for futex would need to have a 64 bit field for the value, and that could defeat one of the purposes of having different sized futexes in the first place: supporting smaller ones to decrease memory usage. This might be something that could be disabled for 32bit archs (and even for CONFIG_BASE_SMALL). Which use case would benefit for FUTEX_64? Does it worth the trade-offs? ** "Where's the PI/robust stuff?" As said by Peter Zijlstra at [3], all those new features are related to the "simple" futex interface, that doesn't use PI or robust. Do we want to have this complexity at futex2() and if so, should it be part of this patchset or can it be future work? Thanks, André André Almeida (13): futex2: Implement wait and wake functions futex2: Add support for shared futexes futex2: Implement vectorized wait futex2: Implement requeue operation futex2: Add compatibility entry point for x86_x32 ABI docs: locking: futex2: Add documentation selftests: futex2: Add wake/wait test selftests: futex2: Add timeout test selftests: futex2: Add wouldblock test selftests: futex2: Add waitv test selftests: futex2: Add requeue test perf bench: Add futex2 benchmark tests kernel: Enable waitpid() for futex2 Documentation/locking/futex2.rst | 198 +++ Documentation/locking/index.rst | 1 + MAINTAINERS | 2 +- arch/arm/tools/syscall.tbl | 4 + arch/arm64/include/asm/unistd.h | 2 +- arch/arm64/include/asm/unistd32.h | 4 + arch/x86/entry/syscalls/syscall_32.tbl | 4 + arch/x86/entry/syscalls/syscall_64.tbl | 4 + fs/inode.c | 1 + include/linux/compat.h | 23 + include/linux/fs.h | 1 + include/linux/syscalls.h | 18 + include/uapi/asm-generic/unistd.h | 14 +- include/uapi/linux/futex.h | 56 + init/Kconfig | 7 + kernel/Makefile | 1 + kernel/fork.c | 2 + kernel/futex2.c | 1255 +++++++++++++++++ kernel/sys_ni.c | 6 + tools/arch/x86/include/asm/unistd_64.h | 12 + tools/include/uapi/asm-generic/unistd.h | 11 +- .../arch/x86/entry/syscalls/syscall_64.tbl | 3 + tools/perf/bench/bench.h | 4 + tools/perf/bench/futex-hash.c | 24 +- tools/perf/bench/futex-requeue.c | 57 +- tools/perf/bench/futex-wake-parallel.c | 41 +- tools/perf/bench/futex-wake.c | 37 +- tools/perf/bench/futex.h | 47 + tools/perf/builtin-bench.c | 18 +- .../selftests/futex/functional/.gitignore | 3 + .../selftests/futex/functional/Makefile | 8 +- .../futex/functional/futex2_requeue.c | 164 +++ .../selftests/futex/functional/futex2_wait.c | 209 +++ .../selftests/futex/functional/futex2_waitv.c | 157 +++ .../futex/functional/futex_wait_timeout.c | 58 +- .../futex/functional/futex_wait_wouldblock.c | 33 +- .../testing/selftests/futex/functional/run.sh | 6 + .../selftests/futex/include/futex2test.h | 121 ++ 38 files changed, 2563 insertions(+), 53 deletions(-) create mode 100644 Documentation/locking/futex2.rst create mode 100644 kernel/futex2.c create mode 100644 tools/testing/selftests/futex/functional/futex2_requeue.c create mode 100644 tools/testing/selftests/futex/functional/futex2_wait.c create mode 100644 tools/testing/selftests/futex/functional/futex2_waitv.c create mode 100644 tools/testing/selftests/futex/include/futex2test.h -- 2.30.1

4 years, 10 months

[PATCH] selftests: timers: set-timer-lat: remove unneeded semicolon

by Yang Li

Eliminate the following coccicheck warning: ./tools/testing/selftests/timers/set-timer-lat.c:83:2-3: Unneeded semicolon ./tools/testing/selftests/timers/nsleep-lat.c:75:2-3: Unneeded semicolon ./tools/testing/selftests/timers/nanosleep.c:75:2-3: Unneeded semicolon ./tools/testing/selftests/timers/inconsistency-check.c:75:2-3: Unneeded semicolon ./tools/testing/selftests/timers/alarmtimer-suspend.c:82:2-3: Unneeded semicolon Reported-by: Abaci Robot <abaci(a)linux.alibaba.com> Signed-off-by: Yang Li <yang.lee(a)linux.alibaba.com> --- tools/testing/selftests/timers/alarmtimer-suspend.c | 2 +- tools/testing/selftests/timers/inconsistency-check.c | 2 +- tools/testing/selftests/timers/nanosleep.c | 2 +- tools/testing/selftests/timers/nsleep-lat.c | 2 +- tools/testing/selftests/timers/set-timer-lat.c | 2 +- 5 files changed, 5 insertions(+), 5 deletions(-) diff --git a/tools/testing/selftests/timers/alarmtimer-suspend.c b/tools/testing/selftests/timers/alarmtimer-suspend.c index 4da09db..54da4b08 100644 --- a/tools/testing/selftests/timers/alarmtimer-suspend.c +++ b/tools/testing/selftests/timers/alarmtimer-suspend.c @@ -79,7 +79,7 @@ char *clockstring(int clockid) return "CLOCK_BOOTTIME_ALARM"; case CLOCK_TAI: return "CLOCK_TAI"; - }; + } return "UNKNOWN_CLOCKID"; } diff --git a/tools/testing/selftests/timers/inconsistency-check.c b/tools/testing/selftests/timers/inconsistency-check.c index 022d3ff..e6756d9 100644 --- a/tools/testing/selftests/timers/inconsistency-check.c +++ b/tools/testing/selftests/timers/inconsistency-check.c @@ -72,7 +72,7 @@ char *clockstring(int clockid) return "CLOCK_BOOTTIME_ALARM"; case CLOCK_TAI: return "CLOCK_TAI"; - }; + } return "UNKNOWN_CLOCKID"; } diff --git a/tools/testing/selftests/timers/nanosleep.c b/tools/testing/selftests/timers/nanosleep.c index 71b5441..433a096 100644 --- a/tools/testing/selftests/timers/nanosleep.c +++ b/tools/testing/selftests/timers/nanosleep.c @@ -72,7 +72,7 @@ char *clockstring(int clockid) return "CLOCK_BOOTTIME_ALARM"; case CLOCK_TAI: return "CLOCK_TAI"; - }; + } return "UNKNOWN_CLOCKID"; } diff --git a/tools/testing/selftests/timers/nsleep-lat.c b/tools/testing/selftests/timers/nsleep-lat.c index eb3e79e..a7ca982 100644 --- a/tools/testing/selftests/timers/nsleep-lat.c +++ b/tools/testing/selftests/timers/nsleep-lat.c @@ -72,7 +72,7 @@ char *clockstring(int clockid) return "CLOCK_BOOTTIME_ALARM"; case CLOCK_TAI: return "CLOCK_TAI"; - }; + } return "UNKNOWN_CLOCKID"; } diff --git a/tools/testing/selftests/timers/set-timer-lat.c b/tools/testing/selftests/timers/set-timer-lat.c index 50da454..d60bbca 100644 --- a/tools/testing/selftests/timers/set-timer-lat.c +++ b/tools/testing/selftests/timers/set-timer-lat.c @@ -80,7 +80,7 @@ char *clockstring(int clockid) return "CLOCK_BOOTTIME_ALARM"; case CLOCK_TAI: return "CLOCK_TAI"; - }; + } return "UNKNOWN_CLOCKID"; } -- 1.8.3.1

4 years, 10 months

[PATCH v3 0/2] kunit: fail tests on UBSAN errors

by Daniel Latypov

v1 by Uriel is here: [1]. Since it's been a while, I've dropped the Reviewed-By's. It depended on commit 83c4e7a0363b ("KUnit: KASAN Integration") which hadn't been merged yet, so that caused some kerfuffle with applying them previously and the series was reverted. This revives the series but makes the kunit_fail_current_test() function take a format string and logs the file and line number of the failing code, addressing Alan Maguire's comments on the previous version. As a result, the patch that makes UBSAN errors was tweaked slightly to include an error message. v2 -> v3: Fix kunit_fail_current_test() so it works w/ CONFIG_KUNIT=m s/_/__ on the helper func to match others in test.c [1] https://lore.kernel.org/linux-kselftest/20200806174326.3577537-1-urielguaja… Uriel Guajardo (2): kunit: support failure from dynamic analysis tools kunit: ubsan integration include/kunit/test-bug.h | 30 ++++++++++++++++++++++++++++++ lib/kunit/test.c | 37 +++++++++++++++++++++++++++++++++---- lib/ubsan.c | 3 +++ 3 files changed, 66 insertions(+), 4 deletions(-) create mode 100644 include/kunit/test-bug.h base-commit: 1e0d27fce010b0a4a9e595506b6ede75934c31be -- 2.30.0.478.g8a0d178c01-goog

4 years, 10 months

epoll: different edge-triggered behavior bewteen pipe and socketpair

by fruggeri＠arista.com

pipe() and socketpair() have different behavior wrt edge-triggered read epoll, in that no event is generated when data is written into a non-empty pipe, but an event is generated if socketpair() is used instead. This simple modification of the epoll2 testlet from tools/testing/selftests/filesystems/epoll/epoll_wakeup_test.c (it just adds a second write) shows the different behavior. The testlet passes with pipe() but fails with socketpair() with 5.10. They both fail with 4.19. Is it fair to assume that 5.10 pipe's behavior is the correct one? Thanks, Francesco Ruggeri /* * t0 * | (ew) * e0 * | (et) * s0 */ TEST(epoll2) { int efd; int sfd[2]; struct epoll_event e; ASSERT_EQ(socketpair(AF_UNIX, SOCK_STREAM, 0, sfd), 0); //ASSERT_EQ(pipe(sfd), 0); efd = epoll_create(1); ASSERT_GE(efd, 0); e.events = EPOLLIN | EPOLLET; ASSERT_EQ(epoll_ctl(efd, EPOLL_CTL_ADD, sfd[0], &e), 0); ASSERT_EQ(write(sfd[1], "w", 1), 1); EXPECT_EQ(epoll_wait(efd, &e, 1, 0), 1); ASSERT_EQ(write(sfd[1], "w", 1), 1); EXPECT_EQ(epoll_wait(efd, &e, 1, 0), 0); close(efd); close(sfd[0]); close(sfd[1]); }

4 years, 10 months

[PATCH v3 0/5] Some optimizations related to sgx

by Tianjia Zhang

This is an optimization of a set of sgx-related codes, each of which is independent of the patch. Because the second and third patches have conflicting dependencies, these patches are put together. --- v3 changes: * split free_cnt count and spin lock optimization into two patches v2 changes: * review suggested changes Tianjia Zhang (5): selftests/x86: Simplify the code to get vdso base address in sgx x86/sgx: Optimize the locking range in sgx_sanitize_section() x86/sgx: Optimize the free_cnt count in sgx_epc_section x86/sgx: Allows ioctl PROVISION to execute before CREATE x86/sgx: Remove redundant if conditions in sgx_encl_create arch/x86/kernel/cpu/sgx/driver.c | 1 + arch/x86/kernel/cpu/sgx/ioctl.c | 9 +++++---- arch/x86/kernel/cpu/sgx/main.c | 13 +++++-------- tools/testing/selftests/sgx/main.c | 24 ++++-------------------- 4 files changed, 15 insertions(+), 32 deletions(-) -- 2.19.1.3.ge56e4f7

4 years, 10 months

[PATCH] selftests: kvm: add hardware_disable test

by Marc Orr

From: Ignacio Alvarado <ikalvarado(a)google.com> This test launches 512 VMs in serial and kills them after a random amount of time. The test was original written to exercise KVM user notifiers in the context of1650b4ebc99d: - KVM: Disable irq while unregistering user notifier - https://lore.kernel.org/kvm/CACXrx53vkO=HKfwWwk+fVpvxcNjPrYmtDZ10qWxFvVX_PT… Recently, this test piqued my interest because it proved useful to for AMD SNP in exercising the "in-use" pages, described in APM section 15.36.12, "Running SNP-Active Virtual Machines". To run the test, first compile: $ make "CPPFLAGS=-static -Wl,--whole-archive -lpthread -Wl,--no-whole-archive" \ -C tools/testing/selftests/kvm/ Then, copy the test over to a machine with the kernel and run: $ ./hardware_disable_test Signed-off-by: Ignacio Alvarado <ikalvarado(a)google.com> Signed-off-by: Marc Orr <marcorr(a)google.com> --- tools/testing/selftests/kvm/.gitignore | 1 + tools/testing/selftests/kvm/Makefile | 1 + .../selftests/kvm/hardware_disable_test.c | 165 ++++++++++++++++++ 3 files changed, 167 insertions(+) create mode 100644 tools/testing/selftests/kvm/hardware_disable_test.c diff --git a/tools/testing/selftests/kvm/.gitignore b/tools/testing/selftests/kvm/.gitignore index ce8f4ad39684..d631e111441a 100644 --- a/tools/testing/selftests/kvm/.gitignore +++ b/tools/testing/selftests/kvm/.gitignore @@ -28,6 +28,7 @@ /demand_paging_test /dirty_log_test /dirty_log_perf_test +/hardware_disable_test /kvm_create_max_vcpus /set_memory_region_test /steal_time diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile index fe41c6a0fa67..c1c403d878f6 100644 --- a/tools/testing/selftests/kvm/Makefile +++ b/tools/testing/selftests/kvm/Makefile @@ -62,6 +62,7 @@ TEST_GEN_PROGS_x86_64 += x86_64/tsc_msrs_test TEST_GEN_PROGS_x86_64 += demand_paging_test TEST_GEN_PROGS_x86_64 += dirty_log_test TEST_GEN_PROGS_x86_64 += dirty_log_perf_test +TEST_GEN_PROGS_x86_64 += hardware_disable_test TEST_GEN_PROGS_x86_64 += kvm_create_max_vcpus TEST_GEN_PROGS_x86_64 += set_memory_region_test TEST_GEN_PROGS_x86_64 += steal_time diff --git a/tools/testing/selftests/kvm/hardware_disable_test.c b/tools/testing/selftests/kvm/hardware_disable_test.c new file mode 100644 index 000000000000..2f2eeb8a1d86 --- /dev/null +++ b/tools/testing/selftests/kvm/hardware_disable_test.c @@ -0,0 +1,165 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * This test is intended to reproduce a crash that happens when + * kvm_arch_hardware_disable is called and it attempts to unregister the user + * return notifiers. + */ + +#define _GNU_SOURCE + +#include <fcntl.h> +#include <pthread.h> +#include <semaphore.h> +#include <stdint.h> +#include <stdlib.h> +#include <unistd.h> +#include <sys/wait.h> + +#include <test_util.h> + +#include "kvm_util.h" + +#define VCPU_NUM 4 +#define SLEEPING_THREAD_NUM (1 << 4) +#define FORK_NUM (1ULL << 9) +#define DELAY_US_MAX 2000 +#define GUEST_CODE_PIO_PORT 4 + +sem_t *sem; + +/* Arguments for the pthreads */ +struct payload { + struct kvm_vm *vm; + uint32_t index; +}; + +static void guest_code(void) +{ + for (;;) + ; /* Some busy work */ + printf("Should not be reached.\n"); +} + +static void *run_vcpu(void *arg) +{ + struct payload *payload = (struct payload *)arg; + struct kvm_run *state = vcpu_state(payload->vm, payload->index); + + vcpu_run(payload->vm, payload->index); + + TEST_ASSERT(false, "%s: exited with reason %d: %s\n", + __func__, state->exit_reason, + exit_reason_str(state->exit_reason)); + pthread_exit(NULL); +} + +static void *sleeping_thread(void *arg) +{ + int fd; + + while (true) { + fd = open("/dev/null", O_RDWR); + close(fd); + } + TEST_ASSERT(false, "%s: exited\n", __func__); + pthread_exit(NULL); +} + +static inline void check_create_thread(pthread_t *thread, pthread_attr_t *attr, + void *(*f)(void *), void *arg) +{ + int r; + + r = pthread_create(thread, attr, f, arg); + TEST_ASSERT(r == 0, "%s: failed to create thread", __func__); +} + +static inline void check_set_affinity(pthread_t thread, cpu_set_t *cpu_set) +{ + int r; + + r = pthread_setaffinity_np(thread, sizeof(cpu_set_t), cpu_set); + TEST_ASSERT(r == 0, "%s: failed set affinity", __func__); +} + +static inline void check_join(pthread_t thread, void **retval) +{ + int r; + + r = pthread_join(thread, retval); + TEST_ASSERT(r == 0, "%s: failed to join thread", __func__); +} + +static void run_test(uint32_t run) +{ + struct kvm_vm *vm; + cpu_set_t cpu_set; + pthread_t threads[VCPU_NUM]; + pthread_t throw_away; + struct payload payloads[VCPU_NUM]; + void *b; + uint32_t i, j; + + CPU_ZERO(&cpu_set); + for (i = 0; i < VCPU_NUM; i++) + CPU_SET(i, &cpu_set); + + vm = vm_create(VM_MODE_DEFAULT, DEFAULT_GUEST_PHY_PAGES, O_RDWR); + kvm_vm_elf_load(vm, program_invocation_name, 0, 0); + vm_create_irqchip(vm); + + fprintf(stderr, "%s: [%d] start vcpus\n", __func__, run); + for (i = 0; i < VCPU_NUM; ++i) { + vm_vcpu_add_default(vm, i, guest_code); + payloads[i].vm = vm; + payloads[i].index = i; + + check_create_thread(&threads[i], NULL, run_vcpu, + (void *)&payloads[i]); + check_set_affinity(threads[i], &cpu_set); + + for (j = 0; j < SLEEPING_THREAD_NUM; ++j) { + check_create_thread(&throw_away, NULL, sleeping_thread, + (void *)NULL); + check_set_affinity(throw_away, &cpu_set); + } + } + fprintf(stderr, "%s: [%d] all threads launched\n", __func__, run); + sem_post(sem); + for (i = 0; i < VCPU_NUM; ++i) + check_join(threads[i], &b); + /* Should not be reached */ + TEST_ASSERT(false, "%s: [%d] child escaped the ninja\n", __func__, run); +} + +int main(int argc, char **argv) +{ + uint32_t i; + int s, r; + pid_t pid; + + sem = sem_open("vm_sem", O_CREAT | O_EXCL, 0644, 0); + sem_unlink("vm_sem"); + + for (i = 0; i < FORK_NUM; ++i) { + pid = fork(); + TEST_ASSERT(pid >= 0, "%s: unable to fork", __func__); + if (pid == 0) + run_test(i); /* This function always exits */ + + fprintf(stderr, "%s: [%d] waiting semaphore\n", __func__, i); + sem_wait(sem); + r = (rand() % DELAY_US_MAX) + 1; + fprintf(stderr, "%s: [%d] waiting %dus\n", __func__, i, r); + usleep(r); + r = waitpid(pid, &s, WNOHANG); + TEST_ASSERT(r != pid, + "%s: [%d] child exited unexpectedly status: [%d]", + __func__, i, s); + fprintf(stderr, "%s: [%d] killing child\n", __func__, i); + kill(pid, SIGKILL); + } + + sem_destroy(sem); + exit(0); +} -- 2.30.0.478.g8a0d178c01-goog

4 years, 11 months

[PATCH v11 00/14] prohibit pinning pages in ZONE_MOVABLE

by Pavel Tatashin

Changelog --------- v11 - Another build fix reported by robot on i386: moved is_pinnable_page() below set_page_section() in linux/mm.h v10 - Fixed !CONFIG_MMU compiler issues by adding is_zero_pfn() stub. v9 - Renamed gpf_to_alloc_flags() to gfp_to_alloc_flags_cma(); thanks Lecopzer Chen for noticing. - Fixed warning reported scripts/checkpatch.pl: "Logical continuations should be on the previous line" v8 - Added reviewed by's from John Hubbard - Fixed subjects for selftests patches - Moved zero page check inside is_pinnable_page() as requested by Jason Gunthorpe. v7 - Added reviewed-by's - Fixed a compile bug on non-mmu builds reported by robot v6 Small update, but I wanted to send it out quicker, as it removes a controversial patch and replaces it with something sane. - Removed forcing FOLL_WRITE for longterm gup, instead added a patch to skip zero pages during migration. - Added reviewed-by's and minor log changes. v5 - Added the following patches to the beginning of series, which are fixes to the other existing problems with CMA migration code: mm/gup: check every subpage of a compound page during isolation mm/gup: return an error on migration failure mm/gup: check for isolation errors also at the beginning of series mm/gup: do not allow zero page for pinned pages - remove .gfp_mask/.reclaim_idx changes from mm/vmscan.c - update movable zone header comment in patch 8 instead of patch 3, fix the comment - Added acked, sign-offs - Updated commit logs based on feedback - Addressed issues reported by Michal and Jason. - Remove: #define PINNABLE_MIGRATE_MAX 10 #define PINNABLE_ISOLATE_MAX 100 Instead: fail on the first migration failure, and retry isolation forever as their failures are transient. - In self-set addressed some of the comments from John Hubbard, updated commit logs, and added comments. Renamed gup->flags with gup->test_flags. v4 - Address page migration comments. New patch: mm/gup: limit number of gup migration failures, honor failures Implements the limiting number of retries for migration failures, and also check for isolation failures. Added a test case into gup_test to verify that pages never long-term pinned in a movable zone, and also added tests to fault both in kernel and in userland. v3 - Merged with linux-next, which contains clean-up patch from Jason, therefore this series is reduced by two patches which did the same thing. v2 - Addressed all review comments - Added Reviewed-by's. - Renamed PF_MEMALLOC_NOMOVABLE to PF_MEMALLOC_PIN - Added is_pinnable_page() to check if page can be longterm pinned - Fixed gup fast path by checking is_in_pinnable_zone() - rename cma_page_list to movable_page_list - add a admin-guide note about handling pinned pages in ZONE_MOVABLE, updated caveat about pinned pages from linux/mmzone.h - Move current_gfp_context() to fast-path --------- When page is pinned it cannot be moved and its physical address stays the same until pages is unpinned. This is useful functionality to allows userland to implementation DMA access. For example, it is used by vfio in vfio_pin_pages(). However, this functionality breaks memory hotplug/hotremove assumptions that pages in ZONE_MOVABLE can always be migrated. This patch series fixes this issue by forcing new allocations during page pinning to omit ZONE_MOVABLE, and also to migrate any existing pages from ZONE_MOVABLE during pinning. It uses the same scheme logic that is currently used by CMA, and extends the functionality for all allocations. For more information read the discussion [1] about this problem. [1] https://lore.kernel.org/lkml/CA+CK2bBffHBxjmb9jmSKacm0fJMinyt3Nhk8Nx6iudcQS… Previous versions: v1 https://lore.kernel.org/lkml/20201202052330.474592-1-pasha.tatashin@soleen.… v2 https://lore.kernel.org/lkml/20201210004335.64634-1-pasha.tatashin@soleen.c… v3 https://lore.kernel.org/lkml/20201211202140.396852-1-pasha.tatashin@soleen.… v4 https://lore.kernel.org/lkml/20201217185243.3288048-1-pasha.tatashin@soleen… v5 https://lore.kernel.org/lkml/20210119043920.155044-1-pasha.tatashin@soleen.… v6 https://lore.kernel.org/lkml/20210120014333.222547-1-pasha.tatashin@soleen.… v7 https://lore.kernel.org/lkml/20210122033748.924330-1-pasha.tatashin@soleen.… v8 https://lore.kernel.org/lkml/20210125194751.1275316-1-pasha.tatashin@soleen… v9 https://lore.kernel.org/lkml/20210201153827.444374-1-pasha.tatashin@soleen.… v10 https://lore.kernel.org/lkml/20210211162427.618913-1-pasha.tatashin@soleen.… Pavel Tatashin (14): mm/gup: don't pin migrated cma pages in movable zone mm/gup: check every subpage of a compound page during isolation mm/gup: return an error on migration failure mm/gup: check for isolation errors mm cma: rename PF_MEMALLOC_NOCMA to PF_MEMALLOC_PIN mm: apply per-task gfp constraints in fast path mm: honor PF_MEMALLOC_PIN for all movable pages mm/gup: do not migrate zero page mm/gup: migrate pinned pages out of movable zone memory-hotplug.rst: add a note about ZONE_MOVABLE and page pinning mm/gup: change index type to long as it counts pages mm/gup: longterm pin migration cleanup selftests/vm: gup_test: fix test flag selftests/vm: gup_test: test faulting in kernel, and verify pinnable pages .../admin-guide/mm/memory-hotplug.rst | 9 + include/linux/migrate.h | 1 + include/linux/mm.h | 19 ++ include/linux/mmzone.h | 13 +- include/linux/pgtable.h | 12 ++ include/linux/sched.h | 2 +- include/linux/sched/mm.h | 27 +-- include/trace/events/migrate.h | 3 +- mm/gup.c | 174 ++++++++---------- mm/gup_test.c | 29 +-- mm/gup_test.h | 3 +- mm/hugetlb.c | 4 +- mm/page_alloc.c | 33 ++-- tools/testing/selftests/vm/gup_test.c | 36 +++- 14 files changed, 208 insertions(+), 157 deletions(-) -- 2.25.1

4 years, 11 months

Jump to page:

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

Linux-kselftest-mirror