April 2019 - Linux-kselftest-mirror

[PATCH v14 00/17] arm64: untag user pointers passed to the kernel

by Andrey Konovalov

=== Overview arm64 has a feature called Top Byte Ignore, which allows to embed pointer tags into the top byte of each pointer. Userspace programs (such as HWASan, a memory debugging tool [1]) might use this feature and pass tagged user pointers to the kernel through syscalls or other interfaces. Right now the kernel is already able to handle user faults with tagged pointers, due to these patches: 1. 81cddd65 ("arm64: traps: fix userspace cache maintenance emulation on a tagged pointer") 2. 7dcd9dd8 ("arm64: hw_breakpoint: fix watchpoint matching for tagged pointers") 3. 276e9327 ("arm64: entry: improve data abort handling of tagged pointers") This patchset extends tagged pointer support to syscall arguments. As per the proposed ABI change [3], tagged pointers are only allowed to be passed to syscalls when they point to memory ranges obtained by anonymous mmap() or sbrk() (see the patchset [3] for more details). For non-memory syscalls this is done by untaging user pointers when the kernel performs pointer checking to find out whether the pointer comes from userspace (most notably in access_ok). The untagging is done only when the pointer is being checked, the tag is preserved as the pointer makes its way through the kernel and stays tagged when the kernel dereferences the pointer when perfoming user memory accesses. Memory syscalls (mmap, mprotect, etc.) don't do user memory accesses but rather deal with memory ranges, and untagged pointers are better suited to describe memory ranges internally. Thus for memory syscalls we untag pointers completely when they enter the kernel. === Other approaches One of the alternative approaches to untagging that was considered is to completely strip the pointer tag as the pointer enters the kernel with some kind of a syscall wrapper, but that won't work with the countless number of different ioctl calls. With this approach we would need a custom wrapper for each ioctl variation, which doesn't seem practical. An alternative approach to untagging pointers in memory syscalls prologues is to inspead allow tagged pointers to be passed to find_vma() (and other vma related functions) and untag them there. Unfortunately, a lot of find_vma() callers then compare or subtract the returned vma start and end fields against the pointer that was being searched. Thus this approach would still require changing all find_vma() callers. === Testing The following testing approaches has been taken to find potential issues with user pointer untagging: 1. Static testing (with sparse [2] and separately with a custom static analyzer based on Clang) to track casts of __user pointers to integer types to find places where untagging needs to be done. 2. Static testing with grep to find parts of the kernel that call find_vma() (and other similar functions) or directly compare against vm_start/vm_end fields of vma. 3. Static testing with grep to find parts of the kernel that compare user pointers with TASK_SIZE or other similar consts and macros. 4. Dynamic testing: adding BUG_ON(has_tag(addr)) to find_vma() and running a modified syzkaller version that passes tagged pointers to the kernel. Based on the results of the testing the requried patches have been added to the patchset. === Notes This patchset is meant to be merged together with "arm64 relaxed ABI" [3]. This patchset is a prerequisite for ARM's memory tagging hardware feature support [4]. This patchset has been merged into the Pixel 2 & 3 kernel trees and is now being used to enable testing of Pixel phones with HWASan. Thanks! [1] http://clang.llvm.org/docs/HardwareAssistedAddressSanitizerDesign.html [2] https://github.com/lucvoo/sparse-dev/commit/5f960cb10f56ec2017c128ef9d16060… [3] https://lkml.org/lkml/2019/3/18/819 [4] https://community.arm.com/processors/b/blog/posts/arm-a-profile-architectur… Changes in v14: - Moved untagging for most memory syscalls to an arm64 specific implementation, instead of doing that in the common code. - Dropped "net, arm64: untag user pointers in tcp_zerocopy_receive", since the provided user pointers don't come from an anonymous map and thus are not covered by this ABI relaxation. - Dropped "kernel, arm64: untag user pointers in prctl_set_mm*". - Moved untagging from __check_mem_type() to tee_shm_register(). - Updated untagging for the amdgpu and radeon drivers to cover the MMU notifier, as suggested by Felix. - Since this ABI relaxation doesn't actually allow tagged instruction pointers, dropped the following patches: - Dropped "tracing, arm64: untag user pointers in seq_print_user_ip". - Dropped "uprobes, arm64: untag user pointers in find_active_uprobe". - Dropped "bpf, arm64: untag user pointers in stack_map_get_build_id_offset". - Rebased onto 5.1-rc7 (37624b58). Changes in v13: - Simplified untagging in tcp_zerocopy_receive(). - Looked at find_vma() callers in drivers/, which allowed to identify a few other places where untagging is needed. - Added patch "mm, arm64: untag user pointers in get_vaddr_frames". - Added patch "drm/amdgpu, arm64: untag user pointers in amdgpu_ttm_tt_get_user_pages". - Added patch "drm/radeon, arm64: untag user pointers in radeon_ttm_tt_pin_userptr". - Added patch "IB/mlx4, arm64: untag user pointers in mlx4_get_umem_mr". - Added patch "media/v4l2-core, arm64: untag user pointers in videobuf_dma_contig_user_get". - Added patch "tee/optee, arm64: untag user pointers in check_mem_type". - Added patch "vfio/type1, arm64: untag user pointers". Changes in v12: - Changed untagging in tcp_zerocopy_receive() to also untag zc->address. - Fixed untagging in prctl_set_mm* to only untag pointers for vma lookups and validity checks, but leave them as is for actual user space accesses. - Updated the link to the v2 of the "arm64 relaxed ABI" patchset [3]. - Dropped the documentation patch, as the "arm64 relaxed ABI" patchset [3] handles that. Changes in v11: - Added "uprobes, arm64: untag user pointers in find_active_uprobe" patch. - Added "bpf, arm64: untag user pointers in stack_map_get_build_id_offset" patch. - Fixed "tracing, arm64: untag user pointers in seq_print_user_ip" to correctly perform subtration with a tagged addr. - Moved untagged_addr() from SYSCALL_DEFINE3(mprotect) and SYSCALL_DEFINE4(pkey_mprotect) to do_mprotect_pkey(). - Moved untagged_addr() definition for other arches from include/linux/memory.h to include/linux/mm.h. - Changed untagging in strn*_user() to perform userspace accesses through tagged pointers. - Updated the documentation to mention that passing tagged pointers to memory syscalls is allowed. - Updated the test to use malloc'ed memory instead of stack memory. Changes in v10: - Added "mm, arm64: untag user pointers passed to memory syscalls" back. - New patch "fs, arm64: untag user pointers in fs/userfaultfd.c". - New patch "net, arm64: untag user pointers in tcp_zerocopy_receive". - New patch "kernel, arm64: untag user pointers in prctl_set_mm*". - New patch "tracing, arm64: untag user pointers in seq_print_user_ip". Changes in v9: - Rebased onto 4.20-rc6. - Used u64 instead of __u64 in type casts in the untagged_addr macro for arm64. - Added braces around (addr) in the untagged_addr macro for other arches. Changes in v8: - Rebased onto 65102238 (4.20-rc1). - Added a note to the cover letter on why syscall wrappers/shims that untag user pointers won't work. - Added a note to the cover letter that this patchset has been merged into the Pixel 2 kernel tree. - Documentation fixes, in particular added a list of syscalls that don't support tagged user pointers. Changes in v7: - Rebased onto 17b57b18 (4.19-rc6). - Dropped the "arm64: untag user address in __do_user_fault" patch, since the existing patches already handle user faults properly. - Dropped the "usb, arm64: untag user addresses in devio" patch, since the passed pointer must come from a vma and therefore be untagged. - Dropped the "arm64: annotate user pointers casts detected by sparse" patch (see the discussion to the replies of the v6 of this patchset). - Added more context to the cover letter. - Updated Documentation/arm64/tagged-pointers.txt. Changes in v6: - Added annotations for user pointer casts found by sparse. - Rebased onto 050cdc6c (4.19-rc1+). Changes in v5: - Added 3 new patches that add untagging to places found with static analysis. - Rebased onto 44c929e1 (4.18-rc8). Changes in v4: - Added a selftest for checking that passing tagged pointers to the kernel succeeds. - Rebased onto 81e97f013 (4.18-rc1+). Changes in v3: - Rebased onto e5c51f30 (4.17-rc6+). - Added linux-arch@ to the list of recipients. Changes in v2: - Rebased onto 2d618bdf (4.17-rc3+). - Removed excessive untagging in gup.c. - Removed untagging pointers returned from __uaccess_mask_ptr. Changes in v1: - Rebased onto 4.17-rc1. Changes in RFC v2: - Added "#ifndef untagged_addr..." fallback in linux/uaccess.h instead of defining it for each arch individually. - Updated Documentation/arm64/tagged-pointers.txt. - Dropped "mm, arm64: untag user addresses in memory syscalls". - Rebased onto 3eb2ce82 (4.16-rc7). Signed-off-by: Andrey Konovalov <andreyknvl(a)google.com> Andrey Konovalov (17): uaccess: add untagged_addr definition for other arches arm64: untag user pointers in access_ok and __uaccess_mask_ptr lib, arm64: untag user pointers in strn*_user mm: add ksys_ wrappers to memory syscalls arms64: untag user pointers passed to memory syscalls mm: untag user pointers in do_pages_move mm, arm64: untag user pointers in mm/gup.c mm, arm64: untag user pointers in get_vaddr_frames fs, arm64: untag user pointers in copy_mount_options fs, arm64: untag user pointers in fs/userfaultfd.c drm/amdgpu, arm64: untag user pointers drm/radeon, arm64: untag user pointers IB/mlx4, arm64: untag user pointers in mlx4_get_umem_mr media/v4l2-core, arm64: untag user pointers in videobuf_dma_contig_user_get tee, arm64: untag user pointers in tee_shm_register vfio/type1, arm64: untag user pointers in vaddr_get_pfn selftests, arm64: add a selftest for passing tagged pointers to kernel arch/arm64/include/asm/uaccess.h | 10 +- arch/arm64/kernel/sys.c | 128 ++++++++++++++++- .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 2 + drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 2 +- drivers/gpu/drm/radeon/radeon_gem.c | 2 + drivers/gpu/drm/radeon/radeon_ttm.c | 2 +- drivers/infiniband/hw/mlx4/mr.c | 7 +- drivers/media/v4l2-core/videobuf-dma-contig.c | 9 +- drivers/tee/tee_shm.c | 1 + drivers/vfio/vfio_iommu_type1.c | 2 + fs/namespace.c | 2 +- fs/userfaultfd.c | 5 + include/linux/mm.h | 4 + include/linux/syscalls.h | 22 +++ ipc/shm.c | 7 +- lib/strncpy_from_user.c | 3 +- lib/strnlen_user.c | 3 +- mm/frame_vector.c | 2 + mm/gup.c | 4 + mm/madvise.c | 129 +++++++++--------- mm/mempolicy.c | 21 ++- mm/migrate.c | 1 + mm/mincore.c | 57 ++++---- mm/mlock.c | 20 ++- mm/mmap.c | 30 +++- mm/mprotect.c | 6 +- mm/mremap.c | 27 ++-- mm/msync.c | 35 +++-- tools/testing/selftests/arm64/.gitignore | 1 + tools/testing/selftests/arm64/Makefile | 11 ++ .../testing/selftests/arm64/run_tags_test.sh | 12 ++ tools/testing/selftests/arm64/tags_test.c | 21 +++ 33 files changed, 431 insertions(+), 159 deletions(-) create mode 100644 tools/testing/selftests/arm64/.gitignore create mode 100644 tools/testing/selftests/arm64/Makefile create mode 100755 tools/testing/selftests/arm64/run_tags_test.sh create mode 100644 tools/testing/selftests/arm64/tags_test.c -- 2.21.0.593.g511ec345e18-goog

6 years, 7 months

3
26
0 0

[PATCH v7 resend 1/2] Provide in-kernel headers to make extending kernel easier

by Joel Fernandes (Google)

Introduce in-kernel headers which are made available as an archive through proc (/proc/kheaders.tar.xz file). This archive makes it possible to run eBPF and other tracing programs that need to extend the kernel for tracing purposes without any dependency on the file system having headers. A github PR is sent for the corresponding BCC patch at: https://github.com/iovisor/bcc/pull/2312 On Android and embedded systems, it is common to switch kernels but not have kernel headers available on the file system. Further once a different kernel is booted, any headers stored on the file system will no longer be useful. This is an issue even well known to distros. By storing the headers as a compressed archive within the kernel, we can avoid these issues that have been a hindrance for a long time. The best way to use this feature is by building it in. Several users have a need for this, when they switch debug kernels, they do not want to update the filesystem or worry about it where to store the headers on it. However, the feature is also buildable as a module in case the user desires it not being part of the kernel image. This makes it possible to load and unload the headers from memory on demand. A tracing program can load the module, do its operations, and then unload the module to save kernel memory. The total memory needed is 3.3MB. By having the archive available at a fixed location independent of filesystem dependencies and conventions, all debugging tools can directly refer to the fixed location for the archive, without concerning with where the headers on a typical filesystem which significantly simplifies tooling that needs kernel headers. The code to read the headers is based on /proc/config.gz code and uses the same technique to embed the headers. Other approaches were discussed such as having an in-memory mountable filesystem, but that has drawbacks such as requiring an in-kernel xz decompressor which we don't have today, and requiring usage of 42 MB of kernel memory to host the decompressed headers at anytime. Also this approach is simpler than such approaches. Reviewed-by: Masahiro Yamada <yamada.masahiro(a)socionext.com> Signed-off-by: Joel Fernandes (Google) <joel(a)joelfernandes.org> --- (Just a resend with Masahiro's Reviewed-by tag added) v6 -> v7: - Minor nits from Masahiro Yamada are addressed. v5 -> v6: (Masahiro Yamada suggestions mostly) - Dropped support for module building. - Rebuild archive if script changes. - Move archive file list to script. - Move build script to kernel directory. v4 -> v5: (v4 was Tested-by the following folks) Tested-by: qais.yousef(a)arm.com Tested-by: dietmar.eggemann(a)arm.com Tested-by: linux(a)manojrajarao.com (Thanks to Masahiro Yamada for several excellent suggestions) - used incbin instead of bin2c (Masahiro did similar idea) - added module.lds if ia64 otherwise ia64 may fail to build. - added clean-files rule to Makefile - removed strip-comments script and doing it inline - added set -e to header generated to die on errorsr - fixed a minor issue where find command was noisy. - removed unneeded tar.xz rule from kernel/.gitignore - added Tested-by tags from ARM folks. Changes since v3: - Blank tar was being generated because of a one line I forgot to push. It is updated now. - Added module.lds since arm64 needs it to build modules. Changes since v2: (Thanks to Masahiro Yamada for several excellent suggestions) - Added support for out of tree builds. - Added incremental build support bringing down build time of incremental builds from 50 seconds to 5 seconds. - Fixed various small nits / cleanups. - clean ups to kheaders.c pointed by Alexey Dobriyan. - Fixed MODULE_LICENSE in test module and kheaders.c - Dropped Module.symvers from archive due to circular dependency. Changes since v1: - removed IKH_EXTRA variable, not needed (Masahiro Yamada) - small fix ups to selftest - added target to main Makefile etc - added MODULE_LICENSE to test module - made selftest more quiet Changes since RFC: Both changes bring size down to 3.8MB: - use xz for compression - strip comments except SPDX lines - Call out the module name in Kconfig - Also added selftests in second patch to ensure headers are always working. Signed-off-by: Joel Fernandes (Google) <joel(a)joelfernandes.org> init/Kconfig | 10 +++++ kernel/.gitignore | 1 + kernel/Makefile | 10 +++++ kernel/gen_ikh_data.sh | 89 ++++++++++++++++++++++++++++++++++++++++++ kernel/kheaders.c | 74 +++++++++++++++++++++++++++++++++++ 5 files changed, 184 insertions(+) create mode 100755 kernel/gen_ikh_data.sh create mode 100644 kernel/kheaders.c diff --git a/init/Kconfig b/init/Kconfig index 4592bf7997c0..47c0db6e63a5 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -580,6 +580,16 @@ config IKCONFIG_PROC This option enables access to the kernel configuration file through /proc/config.gz. +config IKHEADERS_PROC + tristate "Enable kernel header artifacts through /proc/kheaders.tar.xz" + depends on PROC_FS + help + This option enables access to the kernel header and other artifacts that + are generated during the build process. These can be used to build eBPF + tracing programs, or similar programs. If you build the headers as a + module, a module called kheaders.ko is built which can be loaded on-demand + to get access to the headers. + config LOG_BUF_SHIFT int "Kernel log buffer size (16 => 64KB, 17 => 128KB)" range 12 25 diff --git a/kernel/.gitignore b/kernel/.gitignore index 6e699100872f..34d1e77ee9df 100644 --- a/kernel/.gitignore +++ b/kernel/.gitignore @@ -1,5 +1,6 @@ # # Generated files # +kheaders.md5 timeconst.h hz.bc diff --git a/kernel/Makefile b/kernel/Makefile index 6c57e78817da..12399614c350 100644 --- a/kernel/Makefile +++ b/kernel/Makefile @@ -70,6 +70,7 @@ obj-$(CONFIG_UTS_NS) += utsname.o obj-$(CONFIG_USER_NS) += user_namespace.o obj-$(CONFIG_PID_NS) += pid_namespace.o obj-$(CONFIG_IKCONFIG) += configs.o +obj-$(CONFIG_IKHEADERS_PROC) += kheaders.o obj-$(CONFIG_SMP) += stop_machine.o obj-$(CONFIG_KPROBES_SANITY_TEST) += test_kprobes.o obj-$(CONFIG_AUDIT) += audit.o auditfilter.o @@ -121,3 +122,12 @@ $(obj)/configs.o: $(obj)/config_data.gz targets += config_data.gz $(obj)/config_data.gz: $(KCONFIG_CONFIG) FORCE $(call if_changed,gzip) + +$(obj)/kheaders.o: $(obj)/kheaders_data.tar.xz + +quiet_cmd_genikh = CHK $(obj)/kheaders_data.tar.xz +cmd_genikh = $(srctree)/kernel/gen_ikh_data.sh $@ +$(obj)/kheaders_data.tar.xz: FORCE + $(call cmd,genikh) + +clean-files := kheaders_data.tar.xz kheaders.md5 diff --git a/kernel/gen_ikh_data.sh b/kernel/gen_ikh_data.sh new file mode 100755 index 000000000000..591a94f7b387 --- /dev/null +++ b/kernel/gen_ikh_data.sh @@ -0,0 +1,89 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-2.0 + +# This script generates an archive consisting of kernel headers +# for CONFIG_IKHEADERS_PROC. +set -e +spath="$(dirname "$(readlink -f "$0")")" +kroot="$spath/.." +outdir="$(pwd)" +tarfile=$1 +cpio_dir=$outdir/$tarfile.tmp + +# Script filename relative to the kernel source root +# We add it to the archive because it is small and any changes +# to this script will also cause a rebuild of the archive. +sfile="$(realpath --relative-to $kroot "$(readlink -f "$0")")" + +src_file_list=" +include/ +arch/$SRCARCH/include/ +$sfile +" + +obj_file_list=" +include/ +arch/$SRCARCH/include/ +" + +# Support incremental builds by skipping archive generation +# if timestamps of files being archived are not changed. + +# This block is useful for debugging the incremental builds. +# Uncomment it for debugging. +# iter=1 +# if [ ! -f /tmp/iter ]; then echo 1 > /tmp/iter; +# else; iter=$(($(cat /tmp/iter) + 1)); fi +# find $src_file_list -type f | xargs ls -lR > /tmp/src-ls-$iter +# find $obj_file_list -type f | xargs ls -lR > /tmp/obj-ls-$iter + +# include/generated/compile.h is ignored because it is touched even when none +# of the source files changed. This causes pointless regeneration, so let us +# ignore them for md5 calculation. +pushd $kroot > /dev/null +src_files_md5="$(find $src_file_list -type f | + grep -v "include/generated/compile.h" | + xargs ls -lR | md5sum | cut -d ' ' -f1)" +popd > /dev/null +obj_files_md5="$(find $obj_file_list -type f | + grep -v "include/generated/compile.h" | + xargs ls -lR | md5sum | cut -d ' ' -f1)" + +if [ -f $tarfile ]; then tarfile_md5="$(md5sum $tarfile | cut -d ' ' -f1)"; fi +if [ -f kernel/kheaders.md5 ] && + [ "$(cat kernel/kheaders.md5|head -1)" == "$src_files_md5" ] && + [ "$(cat kernel/kheaders.md5|head -2|tail -1)" == "$obj_files_md5" ] && + [ "$(cat kernel/kheaders.md5|tail -1)" == "$tarfile_md5" ]; then + exit +fi + +if [ "${quiet}" != "silent_" ]; then + echo " GEN $tarfile" +fi + +rm -rf $cpio_dir +mkdir $cpio_dir + +pushd $kroot > /dev/null +for f in $src_file_list; + do find "$f" ! -name "*.cmd" ! -name ".*"; +done | cpio --quiet -pd $cpio_dir +popd > /dev/null + +# The second CPIO can complain if files already exist which can +# happen with out of tree builds. Just silence CPIO for now. +for f in $obj_file_list; + do find "$f" ! -name "*.cmd" ! -name ".*"; +done | cpio --quiet -pd $cpio_dir >/dev/null 2>&1 + +# Remove comments except SDPX lines +find $cpio_dir -type f -print0 | + xargs -0 -P8 -n1 perl -pi -e 'BEGIN {undef $/;}; s/\/\*((?!SPDX).)*?\*\///smg;' + +tar -Jcf $tarfile -C $cpio_dir/ . > /dev/null + +echo "$src_files_md5" > kernel/kheaders.md5 +echo "$obj_files_md5" >> kernel/kheaders.md5 +echo "$(md5sum $tarfile | cut -d ' ' -f1)" >> kernel/kheaders.md5 + +rm -rf $cpio_dir diff --git a/kernel/kheaders.c b/kernel/kheaders.c new file mode 100644 index 000000000000..70ae6052920d --- /dev/null +++ b/kernel/kheaders.c @@ -0,0 +1,74 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Provide kernel headers useful to build tracing programs + * such as for running eBPF tracing tools. + * + * (Borrowed code from kernel/configs.c) + */ + +#include <linux/kernel.h> +#include <linux/module.h> +#include <linux/proc_fs.h> +#include <linux/init.h> +#include <linux/uaccess.h> + +/* + * Define kernel_headers_data and kernel_headers_data_end, within which the + * compressed kernel headers are stored. The file is first compressed with xz. + */ + +asm ( +" .pushsection .rodata, \"a\" \n" +" .global kernel_headers_data \n" +"kernel_headers_data: \n" +" .incbin \"kernel/kheaders_data.tar.xz\" \n" +" .global kernel_headers_data_end \n" +"kernel_headers_data_end: \n" +" .popsection \n" +); + +extern char kernel_headers_data; +extern char kernel_headers_data_end; + +static ssize_t +ikheaders_read_current(struct file *file, char __user *buf, + size_t len, loff_t *offset) +{ + return simple_read_from_buffer(buf, len, offset, + &kernel_headers_data, + &kernel_headers_data_end - + &kernel_headers_data); +} + +static const struct file_operations ikheaders_file_ops = { + .read = ikheaders_read_current, + .llseek = default_llseek, +}; + +static int __init ikheaders_init(void) +{ + struct proc_dir_entry *entry; + + /* create the current headers file */ + entry = proc_create("kheaders.tar.xz", S_IRUGO, NULL, + &ikheaders_file_ops); + if (!entry) + return -ENOMEM; + + proc_set_size(entry, + &kernel_headers_data_end - + &kernel_headers_data); + return 0; +} + +static void __exit ikheaders_cleanup(void) +{ + remove_proc_entry("kheaders.tar.xz", NULL); +} + +module_init(ikheaders_init); +module_exit(ikheaders_cleanup); + +MODULE_LICENSE("GPL v2"); +MODULE_AUTHOR("Joel Fernandes"); +MODULE_DESCRIPTION("Echo the kernel header artifacts used to build the kernel"); -- 2.21.0.593.g511ec345e18-goog

6 years, 7 months

6
11
0 0

[PATCH] rcutorture: Tweak kvm options

by Sebastian Andrzej Siewior

In one of my rcutorture tests the TSC clocksource got marked unstable due to a large difference in the TSC value. I'm not sure if the guest run for a long time with disabled interrupts or if the host was very busy and didn't schedule the guest for some time. I took a look on the qemu/KVM options and decided to update the options: - Use kvm{32|64} as CPU. We could probably use `host' (like ARM does) for maximum available features but since we don't run any userland I'm not sure if it makes any difference. - Drop the "noapic" option, enable TSC deadline timer. There is no history why the APIC was disabled, I see no reason for it. The deadline timer is probably "nicer". - Additional config options. It ensures that the kernel knowns that it runs as a kvm guest and can use virt devices like the kvm-clock as clocksource. The kvm-clock was the main motivation here. - I didn't add a random HW device. It would make the random device ready earlier (not it doesn't complete the initialisation at all) but I doubt that there is any need for this. Signed-off-by: Sebastian Andrzej Siewior <bigeasy(a)linutronix.de> --- tools/testing/selftests/rcutorture/bin/functions.sh | 13 ++++++++++++- .../selftests/rcutorture/configs/rcu/CFcommon | 4 ++++ 2 files changed, 16 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/rcutorture/bin/functions.sh b/tools/testing/selftests/rcutorture/bin/functions.sh index 6bcb8b5b2ff22..be3c5c73d7e79 100644 --- a/tools/testing/selftests/rcutorture/bin/functions.sh +++ b/tools/testing/selftests/rcutorture/bin/functions.sh @@ -172,7 +172,7 @@ identify_qemu_append () { local console=ttyS0 case "$1" in qemu-system-x86_64|qemu-system-i386) - echo noapic selinux=0 initcall_debug debug + echo selinux=0 initcall_debug debug ;; qemu-system-aarch64) console=ttyAMA0 @@ -191,8 +191,19 @@ identify_qemu_append () { # Output arguments for qemu arguments based on the TORTURE_QEMU_MAC # and TORTURE_QEMU_INTERACTIVE environment variables. identify_qemu_args () { + local KVM_CPU="" + case "$1" in + qemu-system-x86_64) + KVM_CPU=kvm64 + ;; + qemu-system-i386) + KVM_CPU=kvm32 + ;; + esac case "$1" in qemu-system-x86_64|qemu-system-i386) + echo -machine q35,accel=kvm + echo -cpu ${KVM_CPU},x2apic=on,tsc-deadline=on,hypervisor=on,tsc_adjust=on ;; qemu-system-aarch64) echo -machine virt,gic-version=host -cpu host diff --git a/tools/testing/selftests/rcutorture/configs/rcu/CFcommon b/tools/testing/selftests/rcutorture/configs/rcu/CFcommon index d2d2a86139db1..322d5d40443cd 100644 --- a/tools/testing/selftests/rcutorture/configs/rcu/CFcommon +++ b/tools/testing/selftests/rcutorture/configs/rcu/CFcommon @@ -1,2 +1,6 @@ CONFIG_RCU_TORTURE_TEST=y CONFIG_PRINTK_TIME=y +CONFIG_HYPERVISOR_GUEST=y +CONFIG_PARAVIRT=y +CONFIG_PARAVIRT_SPINLOCKS=y +CONFIG_KVM_GUEST=y -- 2.20.1

6 years, 7 months

3
12
0 0

Re: [PATCH 3/4] x86/ftrace: make ftrace_int3_handler() not to skip fops invocation

by Andy Lutomirski

On Mon, Apr 29, 2019 at 12:13 PM Linus Torvalds <torvalds(a)linux-foundation.org> wrote: > > > > On Mon, Apr 29, 2019, 12:02 Linus Torvalds <torvalds(a)linux-foundation.org> wrote: >> >> >> >> If nmi were to break it, it would be a cpu bug. > > > Side note: we *already* depend on sti shadow working in other parts of the kernel, namely sti->iret. > Where? STI; IRET would be nuts. Before: commit 4214a16b02971c60960afd675d03544e109e0d75 Author: Andy Lutomirski <luto(a)kernel.org> Date: Thu Apr 2 17:12:12 2015 -0700 x86/asm/entry/64/compat: Use SYSRETL to return from compat mode SYSENTER we did sti; sysxit, but, when we discussed this, I don't recall anyone speaking up in favor of the safely of the old code. Not to mention that the crash we'll get if we get an NMI and a rescheduling interrupt in this path will be very, very hard to debug.

6 years, 7 months

6
23
0 0

[PATCH v1 1/2] Add polling support to pidfd

by Joel Fernandes (Google)

pidfd are file descriptors referring to a process created with the CLONE_PIDFD clone(2) flag. Android low memory killer (LMK) needs pidfd polling support to replace code that currently checks for existence of /proc/pid for knowing that a process that is signalled to be killed has died, which is both racy and slow. The pidfd poll approach is race-free, and also allows the LMK to do other things (such as by polling on other fds) while awaiting the process being killed to die. It prevents a situation where a PID is reused between when LMK sends a kill signal and checks for existence of the PID, since the wrong PID is now possibly checked for existence. In this patch, we follow the same existing mechanism in the kernel used when the parent of the task group is to be notified (do_notify_parent). This is when the tasks waiting on a poll of pidfd are also awakened. We have decided to include the waitqueue in struct pid for the following reasons: 1. The wait queue has to survive for the lifetime of the poll. Including it in task_struct would not be option in this case because the task can be reaped and destroyed before the poll returns. 2. By including the struct pid for the waitqueue means that during de_thread(), the new thread group leader automatically gets the new waitqueue/pid even though its task_struct is different. Appropriate test cases are added in the second patch to provide coverage of all the cases the patch is handling. Andy had a similar patch [1] in the past which was a good reference however this patch tries to handle different situations properly related to thread group existence, and how/where it notifies. And also solves other bugs (waitqueue lifetime). Daniel had a similar patch [2] recently which this patch supercedes. [1] https://lore.kernel.org/patchwork/patch/345098/ [2] https://lore.kernel.org/lkml/20181029175322.189042-1-dancol@google.com/ Cc: luto(a)amacapital.net Cc: rostedt(a)goodmis.org Cc: dancol(a)google.com Cc: sspatil(a)google.com Cc: christian(a)brauner.io Cc: jannh(a)google.com Cc: surenb(a)google.com Cc: timmurray(a)google.com Cc: Jonathan Kowalski <bl0pbl33p(a)gmail.com> Cc: torvalds(a)linux-foundation.org Cc: kernel-team(a)android.com Co-developed-by: Daniel Colascione <dancol(a)google.com> Signed-off-by: Joel Fernandes (Google) <joel(a)joelfernandes.org> --- RFC -> v1: * Based on CLONE_PIDFD patches: https://lwn.net/Articles/786244/ * Updated selftests. * Renamed poll wake function to do_notify_pidfd. * Removed depending on EXIT flags * Removed POLLERR flag since semantics are controversial and we don't have usecases for it right now (later we can add if there's a need for it). include/linux/pid.h | 3 +++ kernel/fork.c | 33 +++++++++++++++++++++++++++++++++ kernel/pid.c | 2 ++ kernel/signal.c | 14 ++++++++++++++ 4 files changed, 52 insertions(+) diff --git a/include/linux/pid.h b/include/linux/pid.h index 3c8ef5a199ca..1484db6ca8d1 100644 --- a/include/linux/pid.h +++ b/include/linux/pid.h @@ -3,6 +3,7 @@ #define _LINUX_PID_H #include <linux/rculist.h> +#include <linux/wait.h> enum pid_type { @@ -60,6 +61,8 @@ struct pid unsigned int level; /* lists of tasks that use this pid */ struct hlist_head tasks[PIDTYPE_MAX]; + /* wait queue for pidfd notifications */ + wait_queue_head_t wait_pidfd; struct rcu_head rcu; struct upid numbers[1]; }; diff --git a/kernel/fork.c b/kernel/fork.c index 5525837ed80e..fb3b614f6456 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1685,8 +1685,41 @@ static void pidfd_show_fdinfo(struct seq_file *m, struct file *f) } #endif +static unsigned int pidfd_poll(struct file *file, struct poll_table_struct *pts) +{ + struct task_struct *task; + struct pid *pid; + int poll_flags = 0; + + /* + * tasklist_lock must be held because to avoid racing with + * changes in exit_state and wake up. Basically to avoid: + * + * P0: read exit_state = 0 + * P1: write exit_state = EXIT_DEAD + * P1: Do a wake up - wq is empty, so do nothing + * P0: Queue for polling - wait forever. + */ + read_lock(&tasklist_lock); + pid = file->private_data; + task = pid_task(pid, PIDTYPE_PID); + WARN_ON_ONCE(task && !thread_group_leader(task)); + + if (!task || (task->exit_state && thread_group_empty(task))) + poll_flags = POLLIN | POLLRDNORM; + + if (!poll_flags) + poll_wait(file, &pid->wait_pidfd, pts); + + read_unlock(&tasklist_lock); + + return poll_flags; +} + + const struct file_operations pidfd_fops = { .release = pidfd_release, + .poll = pidfd_poll, #ifdef CONFIG_PROC_FS .show_fdinfo = pidfd_show_fdinfo, #endif diff --git a/kernel/pid.c b/kernel/pid.c index 20881598bdfa..5c90c239242f 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -214,6 +214,8 @@ struct pid *alloc_pid(struct pid_namespace *ns) for (type = 0; type < PIDTYPE_MAX; ++type) INIT_HLIST_HEAD(&pid->tasks[type]); + init_waitqueue_head(&pid->wait_pidfd); + upid = pid->numbers + ns->level; spin_lock_irq(&pidmap_lock); if (!(ns->pid_allocated & PIDNS_ADDING)) diff --git a/kernel/signal.c b/kernel/signal.c index 1581140f2d99..16e7718316e5 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -1800,6 +1800,17 @@ int send_sigqueue(struct sigqueue *q, struct pid *pid, enum pid_type type) return ret; } +static void do_notify_pidfd(struct task_struct *task) +{ + struct pid *pid; + + lockdep_assert_held(&tasklist_lock); + + pid = get_task_pid(task, PIDTYPE_PID); + wake_up_all(&pid->wait_pidfd); + put_pid(pid); +} + /* * Let a parent know about the death of a child. * For a stopped/continued status change, use do_notify_parent_cldstop instead. @@ -1823,6 +1834,9 @@ bool do_notify_parent(struct task_struct *tsk, int sig) BUG_ON(!tsk->ptrace && (tsk->group_leader != tsk || !thread_group_empty(tsk))); + /* Wake up all pidfd waiters */ + do_notify_pidfd(tsk); + if (sig != SIGCHLD) { /* * This is only possible if parent == real_parent. -- 2.21.0.593.g511ec345e18-goog

6 years, 7 months

6
23
0 0

[PATCH 0/4] x86/ftrace: make ftrace_int3_handler() not to skip fops invocation

by Nicolai Stange

Hi, this series is the result of the discussion to the RFC patch found at [1]. The goal is to make x86' ftrace_int3_handler() not to simply skip over the trapping instruction as this is problematic in the context of the live patching consistency model. For details, c.f. the commit message of [3/4] ("x86/ftrace: make ftrace_int3_handler() not to skip fops invocation"). Everything is based on v5.1-rc6, please let me know in case you want me to rebase on somehing else. For x86_64, the live patching selftest added in [4/4] succeeds with this series applied and fails without it. On 32 bits I only compile-tested. checkpatch reports warnings about - an overlong line in assembly -- I chose to ignore that - MAINTAINERS perhaps needing updates due to the new files arch/x86/kernel/ftrace_int3_stubs.S and tools/testing/selftests/livepatch/test-livepatch-vs-ftrace.sh. As the existing arch/x86/kernel/ftrace_{32,64}.S haven't got an explicit entry either, this one is probably Ok? The selftest definitely is. Changes to the RFC patch: - s/trampoline/stub/ to avoid confusion with the ftrace_ops' trampolines, - use a fixed size stack kept in struct thread_info for passing the (adjusted) ->ip values from ftrace_int3_handler() to the stubs, - provide one stub for each of the two possible jump targets and hardcode those, - add the live patching selftest. Thanks, Nicolai Nicolai Stange (4): x86/thread_info: introduce ->ftrace_int3_stack member ftrace: drop 'static' qualifier from ftrace_ops_list_func() x86/ftrace: make ftrace_int3_handler() not to skip fops invocation selftests/livepatch: add "ftrace a live patched function" test arch/x86/include/asm/thread_info.h | 11 +++ arch/x86/kernel/Makefile | 1 + arch/x86/kernel/asm-offsets.c | 8 +++ arch/x86/kernel/ftrace.c | 79 +++++++++++++++++++--- arch/x86/kernel/ftrace_int3_stubs.S | 61 +++++++++++++++++ kernel/trace/ftrace.c | 8 +-- tools/testing/selftests/livepatch/Makefile | 3 +- .../livepatch/test-livepatch-vs-ftrace.sh | 44 ++++++++++++ 8 files changed, 199 insertions(+), 16 deletions(-) create mode 100644 arch/x86/kernel/ftrace_int3_stubs.S create mode 100755 tools/testing/selftests/livepatch/test-livepatch-vs-ftrace.sh -- 2.13.7

6 years, 7 months

6
21
0 0

Re: [PATCH 3/4] x86/ftrace: make ftrace_int3_handler() not to skip fops invocation

by Andy Lutomirski

On Mon, Apr 29, 2019 at 11:53 AM Linus Torvalds <torvalds(a)linux-foundation.org> wrote: > > > > On Mon, Apr 29, 2019, 11:42 Andy Lutomirski <luto(a)kernel.org> wrote: >> >> >> I'm less than 100% convinced about this argument. Sure, an NMI right >> there won't cause a problem. But an NMI followed by an interrupt will >> kill us if preemption is on. I can think of three solutions: > > > No, because either the sti shadow disables nmi too (that's the case on some CPUs at least) or the iret from nmi does. > > Otherwise you could never trust the whole sti shadow thing - and it very much is part of the architecture. > Is this documented somewhere? And do you actually believe that this is true under KVM, Hyper-V, etc? As I recall, Andrew Cooper dug in to the way that VMX dealt with this stuff and concluded that the SDM was blatantly wrong in many cases, which leads me to believe that Xen HVM/PVH is the *only* hypervisor that gets it right. Steven's point about batched updates is quite valid, though. My personal favorite solution to this whole mess is to rework the whole thing so that the int3 handler simply returns and retries and to replace the sync_core() broadcast with an SMI broadcast. I don't know whether this will actually work on real CPUs and on VMs and whether it's going to crash various BIOSes out there.

6 years, 7 months

3
2
0 0

Re: [PATCH 3/4] x86/ftrace: make ftrace_int3_handler() not to skip fops invocation

by Linus Torvalds

On Mon, Apr 29, 2019 at 12:02 PM Linus Torvalds <torvalds(a)linux-foundation.org> wrote: > > If nmi were to break it, it would be a cpu bug. I'm pretty sure I've > seen the "shadow stops even nmi" documented for some uarch, but as > mentioned it's not necessarily the only way to guarantee the shadow. In fact, the documentation is simply the official Intel instruction docs for "STI": The IF flag and the STI and CLI instructions do not prohibit the generation of exceptions and NMI interrupts. NMI interrupts (and SMIs) may be blocked for one macroinstruction following an STI. note the "may be blocked". As mentioned, that's just one option for not having NMI break the STI shadow guarantee, but it's clearly one that Intel has done at times, and clearly even documents as having done so. There is absolutely no question that the sti shadow is real, and that people have depended on it for _decades_. It would be a horrible errata if the shadow can just be made to go away by randomly getting an NMI or SMI. Linus

6 years, 7 months

4
7
0 0

Re: [PATCH 3/4] x86/ftrace: make ftrace_int3_handler() not to skip fops invocation

by Steven Rostedt

On Mon, 29 Apr 2019 11:59:04 -0700 Linus Torvalds <torvalds(a)linux-foundation.org> wrote: > I really don't care. Just do what I suggested, and if you have numbers to > show problems, then maybe I'll care. > Are you suggesting that I rewrite the code to do it one function at a time? This has always been batch mode. This is not something new. The function tracer has been around longer than the text poke code. > Right now you're just making excuses for this. I described the solution > months ago, now I've written a patch, if that's not good enough then we can > just skip this all entirely. > > Honestly, if you need to rewrite tens of thousands of calls, maybe you're > doing something wrong? > # cd /sys/kernel/debug/tracing # cat available_filter_functions | wc -l 45856 # cat enabled_functions | wc -l 0 # echo function > current_tracer # cat enabled_functions | wc -l 45856 There, I just enabled 45,856 function call sites in one shot! How else do you want to update them? Every function in the kernel has a nop, that turns into a call to the ftrace_handler, if I add another user of that code, it will change each one as well. -- Steve

6 years, 7 months

3
6
0 0

[PATCH for 5.2 12/12] rseq/selftests: add -no-integrated-as for clang

by Mathieu Desnoyers

Ongoing work for asm goto support from clang requires the -no-integrated-as compiler flag. This compiler flag is present in the toplevel kernel Makefile, but is not replicated for selftests. Add it specifically for the rseq selftest which requires asm goto. Link: https://reviews.llvm.org/D56571 Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers(a)efficios.com> CC: Nick Desaulniers <ndesaulniers(a)google.com> CC: Thomas Gleixner <tglx(a)linutronix.de> CC: Joel Fernandes <joelaf(a)google.com> CC: Peter Zijlstra <peterz(a)infradead.org> CC: Catalin Marinas <catalin.marinas(a)arm.com> CC: Dave Watson <davejwatson(a)fb.com> CC: Will Deacon <will.deacon(a)arm.com> CC: Shuah Khan <shuah(a)kernel.org> CC: Andi Kleen <andi(a)firstfloor.org> CC: linux-kselftest(a)vger.kernel.org CC: "H . Peter Anvin" <hpa(a)zytor.com> CC: Chris Lameter <cl(a)linux.com> CC: Russell King <linux(a)arm.linux.org.uk> CC: Michael Kerrisk <mtk.manpages(a)gmail.com> CC: "Paul E . McKenney" <paulmck(a)linux.vnet.ibm.com> CC: Paul Turner <pjt(a)google.com> CC: Boqun Feng <boqun.feng(a)gmail.com> CC: Josh Triplett <josh(a)joshtriplett.org> CC: Steven Rostedt <rostedt(a)goodmis.org> CC: Ben Maurer <bmaurer(a)fb.com> CC: linux-api(a)vger.kernel.org CC: Andy Lutomirski <luto(a)amacapital.net> CC: Andrew Morton <akpm(a)linux-foundation.org> CC: Linus Torvalds <torvalds(a)linux-foundation.org> --- tools/testing/selftests/rseq/Makefile | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/rseq/Makefile b/tools/testing/selftests/rseq/Makefile index c30c52e1d0d2..d6469535630a 100644 --- a/tools/testing/selftests/rseq/Makefile +++ b/tools/testing/selftests/rseq/Makefile @@ -1,5 +1,11 @@ # SPDX-License-Identifier: GPL-2.0+ OR MIT -CFLAGS += -O2 -Wall -g -I./ -I../../../../usr/include/ -L./ -Wl,-rpath=./ + +ifneq ($(shell $(CC) --version 2>&1 | head -n 1 | grep clang),) +CLANG_FLAGS += -no-integrated-as +endif + +CFLAGS += -O2 -Wall -g -I./ -I../../../../usr/include/ -L./ -Wl,-rpath=./ \ + $(CLANG_FLAGS) LDLIBS += -lpthread # Own dependencies because we only want to build against 1st prerequisite, but -- 2.11.0

6 years, 7 months

2
3
0 0

2025

2024

2023

2022

2021

2020

2019

2018

2017

Linux-kselftest-mirror April 2019