September 2023 - Linux-kselftest-mirror

[PATCH v5 00/11] Add Intel VT-d nested translation

by Yi Liu

This is to add Intel VT-d nested translation based on IOMMUFD nesting infrastructure. As the iommufd nesting infrastructure series[1], iommu core supports new ops to report iommu hardware information, allocate domains with user data and invalidate stage-1 IOTLB when there is mapping changed in stage-1 page table. The data required in the three paths are vendor-specific, so 1) IOMMU_HWPT_TYPE_VTD_S1 is defined for the Intel VT-d stage-1 page table, it will be used in the stage-1 domain allocation and IOTLB syncing path. struct iommu_hwpt_vtd_s1 is defined to pass user_data for the Intel VT-d stage-1 domain allocation. struct iommu_hwpt_vtd_s1_invalidate is defined to pass the data for the Intel VT-d stage-1 IOTLB invalidation. 2) IOMMU_HW_INFO_TYPE_INTEL_VTD and struct iommu_hw_info_vtd are defined to report iommu hardware information for Intel VT-d. With above IOMMUFD extensions, the intel iommu driver implements the three paths to support nested translation. The first Intel platform supporting nested translation is Sapphire Rapids which, unfortunately, has a hardware errata [2] requiring special treatment. This errata happens when a stage-1 page table page (either level) is located in a stage-2 read-only region. In that case the IOMMU hardware may ignore the stage-2 RO permission and still set the A/D bit in stage-1 page table entries during page table walking. A flag IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17 is introduced to report this errata to userspace. With that restriction the user should either disable nested translation to favor RO stage-2 mappings or ensure no RO stage-2 mapping to enable nested translation. Intel-iommu driver is armed with necessary checks to prevent such mix in patch12 of this series. Qemu currently does add RO mappings though. The vfio agent in Qemu simply maps all valid regions in the GPA address space which certainly includes RO regions e.g. vbios. In reality we don't know a usage relying on DMA reads from the BIOS region. Hence finding a way to skip RO regions (e.g. via a discard manager) in Qemu might be an acceptable tradeoff. The actual change needs more discussion in Qemu community. For now we just hacked Qemu to test. Complete code can be found in [3], corresponding QEMU could can be found in [4]. [1] https://lore.kernel.org/linux-iommu/20230921075138.124099-1-yi.l.liu@intel.… [2] https://www.intel.com/content/www/us/en/content-details/772415/content-deta… [3] https://github.com/yiliu1765/iommufd/tree/iommufd_nesting [4] https://github.com/yiliu1765/qemu/tree/zhenzhong/wip/iommufd_nesting_rfcv1 Change log: v5: - Add Kevin's r-b for patch 2, 3 ,5 8, 10 - Drop enforce_cache_coherency callback from the nested type domain ops (Kevin) - Remove duplicate agaw check in patch 04 (Kevin) - Remove duplicate domain_update_iommu_cap() in patch 06 (Kevin) - Check parent's force_snooping to set pgsnp in the pasid entry (Kevin) - uapi data structure check (Kevin) - Simplify the errata handling as user can allocate nested parent domain v4: https://lore.kernel.org/linux-iommu/20230724111335.107427-1-yi.l.liu@intel.… - Remove ascii art tables (Jason) - Drop EMT (Tina, Jason) - Drop MTS and related definitions (Kevin) - Rename macro IOMMU_VTD_PGTBL_ to IOMMU_VTD_S1_ (Kevin) - Rename struct iommu_hwpt_intel_vtd_ to iommu_hwpt_vtd_ (Kevin) - Rename struct iommu_hwpt_intel_vtd to iommu_hwpt_vtd_s1 (Kevin) - Put the vendor specific hwpt alloc data structure before enuma iommu_hwpt_type (Kevin) - Do not trim the higher page levels of S2 domain in nested domain attachment as the S2 domain may have been used independently. (Kevin) - Remove the first-stage pgd check against the maximum address of s2_domain as hw can check it anyhow. It makes sense to check every pfns used in the stage-1 page table. But it cannot make it. So just leave it to hw. (Kevin) - Split the iotlb flush part into an order of uapi, helper and callback implementation (Kevin) - Change the policy of VT-d nesting errata, disallow RO mapping once a domain is used as parent domain of a nested domain. This removes the nested_users counting. (Kevin) - Minor fix for "make htmldocs" v3: https://lore.kernel.org/linux-iommu/20230511145110.27707-1-yi.l.liu@intel.c… - Further split the patches into an order of adding helpers for nested domain, iotlb flush, nested domain attachment and nested domain allocation callback, then report the hw_info to userspace. - Add batch support in cache invalidation from userspace - Disallow nested translation usage if RO mappings exists in stage-2 domain due to errata on readonly mappings on Sapphire Rapids platform. v2: https://lore.kernel.org/linux-iommu/20230309082207.612346-1-yi.l.liu@intel.… - The iommufd infrastructure is split to be separate series. v1: https://lore.kernel.org/linux-iommu/20230209043153.14964-1-yi.l.liu@intel.c… Regards, Yi Liu Lu Baolu (5): iommu/vt-d: Extend dmar_domain to support nested domain iommu/vt-d: Add helper for nested domain allocation iommu/vt-d: Add helper to setup pasid nested translation iommu/vt-d: Add nested domain allocation iommu/vt-d: Disallow read-only mappings to nest parent domain Yi Liu (6): iommufd: Add data structure for Intel VT-d stage-1 domain allocation iommu/vt-d: Make domain attach helpers to be extern iommu/vt-d: Set the nested domain to a device iommufd: Add data structure for Intel VT-d stage-1 cache invalidation iommu/vt-d: Make iotlb flush helpers to be extern iommu/vt-d: Add iotlb flush for nested domain drivers/iommu/intel/Makefile | 2 +- drivers/iommu/intel/iommu.c | 60 +++++++++---- drivers/iommu/intel/iommu.h | 51 +++++++++-- drivers/iommu/intel/nested.c | 162 +++++++++++++++++++++++++++++++++++ drivers/iommu/intel/pasid.c | 125 +++++++++++++++++++++++++++ drivers/iommu/intel/pasid.h | 2 + include/uapi/linux/iommufd.h | 76 +++++++++++++++- 7 files changed, 452 insertions(+), 26 deletions(-) create mode 100644 drivers/iommu/intel/nested.c -- 2.34.1

1 year, 9 months

3
25
0 0

[PATCH] selftests/powerpc: Fix emit_tests to work with run_kselftest.sh

by Michael Ellerman

In order to use run_kselftest.sh the list of tests must be emitted to populate kselftest-list.txt. The powerpc Makefile is written to use EMIT_TESTS. But support for EMIT_TESTS was dropped in commit d4e59a536f50 ("selftests: Use runner.sh for emit targets"). Although prior to that commit a548de0fe8e1 ("selftests: lib.mk: add test execute bit check to EMIT_TESTS") had already broken run_kselftest.sh for powerpc due to the executable check using the wrong path. It can be fixed by replacing the EMIT_TESTS definitions with actual emit_tests rules in the powerpc Makefiles. This makes run_kselftest.sh able to run powerpc tests: $ cd linux $ export ARCH=powerpc $ export CROSS_COMPILE=powerpc64le-linux-gnu- $ make headers $ make -j -C tools/testing/selftests install $ grep -c "^powerpc" tools/testing/selftests/kselftest_install/kselftest-list.txt 182 Fixes: d4e59a536f50 ("selftests: Use runner.sh for emit targets") Signed-off-by: Michael Ellerman <mpe(a)ellerman.id.au> --- tools/testing/selftests/powerpc/Makefile | 7 +++---- tools/testing/selftests/powerpc/pmu/Makefile | 11 ++++++----- 2 files changed, 9 insertions(+), 9 deletions(-) I'll plan to merge this via the powerpc tree. cheers diff --git a/tools/testing/selftests/powerpc/Makefile b/tools/testing/selftests/powerpc/Makefile index 49f2ad1793fd..7ea42fa02eab 100644 --- a/tools/testing/selftests/powerpc/Makefile +++ b/tools/testing/selftests/powerpc/Makefile @@ -59,12 +59,11 @@ override define INSTALL_RULE done; endef -override define EMIT_TESTS +emit_tests: +@for TARGET in $(SUB_DIRS); do \ BUILD_TARGET=$(OUTPUT)/$$TARGET; \ - $(MAKE) OUTPUT=$$BUILD_TARGET -s -C $$TARGET emit_tests;\ + $(MAKE) OUTPUT=$$BUILD_TARGET -s -C $$TARGET $@;\ done; -endef override define CLEAN +@for TARGET in $(SUB_DIRS); do \ @@ -77,4 +76,4 @@ endef tags: find . -name '*.c' -o -name '*.h' | xargs ctags -.PHONY: tags $(SUB_DIRS) +.PHONY: tags $(SUB_DIRS) emit_tests diff --git a/tools/testing/selftests/powerpc/pmu/Makefile b/tools/testing/selftests/powerpc/pmu/Makefile index 2b95e44d20ff..a284fa874a9f 100644 --- a/tools/testing/selftests/powerpc/pmu/Makefile +++ b/tools/testing/selftests/powerpc/pmu/Makefile @@ -30,13 +30,14 @@ override define RUN_TESTS +TARGET=event_code_tests; BUILD_TARGET=$$OUTPUT/$$TARGET; $(MAKE) OUTPUT=$$BUILD_TARGET -C $$TARGET run_tests endef -DEFAULT_EMIT_TESTS := $(EMIT_TESTS) -override define EMIT_TESTS - $(DEFAULT_EMIT_TESTS) +emit_tests: + for TEST in $(TEST_GEN_PROGS); do \ + BASENAME_TEST=`basename $$TEST`; \ + echo "$(COLLECTION):$$BASENAME_TEST"; \ + done +TARGET=ebb; BUILD_TARGET=$$OUTPUT/$$TARGET; $(MAKE) OUTPUT=$$BUILD_TARGET -s -C $$TARGET emit_tests +TARGET=sampling_tests; BUILD_TARGET=$$OUTPUT/$$TARGET; $(MAKE) OUTPUT=$$BUILD_TARGET -s -C $$TARGET emit_tests +TARGET=event_code_tests; BUILD_TARGET=$$OUTPUT/$$TARGET; $(MAKE) OUTPUT=$$BUILD_TARGET -s -C $$TARGET emit_tests -endef DEFAULT_INSTALL_RULE := $(INSTALL_RULE) override define INSTALL_RULE @@ -64,4 +65,4 @@ sampling_tests: event_code_tests: TARGET=$@; BUILD_TARGET=$$OUTPUT/$$TARGET; mkdir -p $$BUILD_TARGET; $(MAKE) OUTPUT=$$BUILD_TARGET -k -C $$TARGET all -.PHONY: all run_tests ebb sampling_tests event_code_tests +.PHONY: all run_tests ebb sampling_tests event_code_tests emit_tests -- 2.41.0

1 year, 9 months

2
2
0 0

[PATCH 0/5] KVM: x86: Fix breakage in KVM_SET_XSAVE's ABI

by Sean Christopherson

Rework how KVM limits guest-unsupported xfeatures to effectively hide only when saving state for userspace (KVM_GET_XSAVE), i.e. to let userspace load all host-supported xfeatures (via KVM_SET_XSAVE) irrespective of what features have been exposed to the guest. The effect on KVM_SET_XSAVE was knowingly done by commit ad856280ddea ("x86/kvm/fpu: Limit guest user_xfeatures to supported bits of XCR0"): As a bonus, it will also fail if userspace tries to set fpu features (with the KVM_SET_XSAVE ioctl) that are not compatible to the guest configuration. Such features will never be returned by KVM_GET_XSAVE or KVM_GET_XSAVE2. Peventing userspace from doing stupid things is usually a good idea, but in this case restricting KVM_SET_XSAVE actually exacerbated the problem that commit ad856280ddea was fixing. As reported by Tyler, rejecting KVM_SET_XSAVE for guest-unsupported xfeatures breaks live migration from a kernel without commit ad856280ddea, to a kernel with ad856280ddea. I.e. from a kernel that saves guest-unsupported xfeatures to a kernel that doesn't allow loading guest-unuspported xfeatures. To make matters even worse, QEMU doesn't terminate if KVM_SET_XSAVE fails, and so the end result is that the live migration results (possibly silent) guest data corruption instead of a failed migration. Patch 1 refactors the FPU code to let KVM pass in a mask of which xfeatures to save, patch 2 fixes KVM by passing in guest_supported_xcr0 instead of modifying user_xfeatures directly. Patches 3-5 are regression tests. I have no objection if anyone wants patches 1 and 2 squashed together, I split them purely to make review easier. Note, this doesn't fix the scenario where a guest is migrated from a "bad" to a "good" kernel and the target host doesn't support the over-saved set of xfeatures. I don't see a way to safely handle that in the kernel without an opt-in, which more or less defeats the purpose of handling it in KVM. Sean Christopherson (5): x86/fpu: Allow caller to constrain xfeatures when copying to uabi buffer KVM: x86: Constrain guest-supported xfeatures only at KVM_GET_XSAVE{2} KVM: selftests: Touch relevant XSAVE state in guest for state test KVM: selftests: Load XSAVE state into untouched vCPU during state test KVM: selftests: Force load all supported XSAVE state in state test arch/x86/include/asm/fpu/api.h | 3 +- arch/x86/kernel/fpu/core.c | 5 +- arch/x86/kernel/fpu/xstate.c | 12 +- arch/x86/kernel/fpu/xstate.h | 3 +- arch/x86/kvm/cpuid.c | 8 -- arch/x86/kvm/x86.c | 37 +++--- .../selftests/kvm/include/x86_64/processor.h | 23 ++++ .../testing/selftests/kvm/x86_64/state_test.c | 110 +++++++++++++++++- 8 files changed, 168 insertions(+), 33 deletions(-) base-commit: 5804c19b80bf625c6a9925317f845e497434d6d3 -- 2.42.0.582.g8ccd20d70d-goog

1 year, 9 months

5
13
0 0

[RFC v3 0/2] CPU-Idle latency selftest framework

by Aboorva Devarajan

Changelog: v2 -> v3 * Minimal code refactoring * Rebased on v6.6-rc1 RFC v1: https://lore.kernel.org/all/20210611124154.56427-1-psampat@linux.ibm.com/ RFC v2: https://lore.kernel.org/all/20230828061530.126588-2-aboorvad@linux.vnet.ibm… Other related RFC: https://lore.kernel.org/all/20210430082804.38018-1-psampat@linux.ibm.com/ Userspace selftest: https://lkml.org/lkml/2020/9/2/356 ---- A kernel module + userspace driver to estimate the wakeup latency caused by going into stop states. The motivation behind this program is to find significant deviations behind advertised latency and residency values. The patchset measures latencies for two kinds of events. IPIs and Timers As this is a software-only mechanism, there will be additional latencies of the kernel-firmware-hardware interactions. To account for that, the program also measures a baseline latency on a 100 percent loaded CPU and the latencies achieved must be in view relative to that. To achieve this, we introduce a kernel module and expose its control knobs through the debugfs interface that the selftests can engage with. The kernel module provides the following interfaces within /sys/kernel/debug/powerpc/latency_test/ for, IPI test: ipi_cpu_dest = Destination CPU for the IPI ipi_cpu_src = Origin of the IPI ipi_latency_ns = Measured latency time in ns Timeout test: timeout_cpu_src = CPU on which the timer to be queued timeout_expected_ns = Timer duration timeout_diff_ns = Difference of actual duration vs expected timer Sample output is as follows: # --IPI Latency Test--- # Baseline Avg IPI latency(ns): 2720 # Observed Avg IPI latency(ns) - State snooze: 2565 # Observed Avg IPI latency(ns) - State stop0_lite: 3856 # Observed Avg IPI latency(ns) - State stop0: 3670 # Observed Avg IPI latency(ns) - State stop1: 3872 # Observed Avg IPI latency(ns) - State stop2: 17421 # Observed Avg IPI latency(ns) - State stop4: 1003922 # Observed Avg IPI latency(ns) - State stop5: 1058870 # # --Timeout Latency Test-- # Baseline Avg timeout diff(ns): 1435 # Observed Avg timeout diff(ns) - State snooze: 1709 # Observed Avg timeout diff(ns) - State stop0_lite: 2028 # Observed Avg timeout diff(ns) - State stop0: 1954 # Observed Avg timeout diff(ns) - State stop1: 1895 # Observed Avg timeout diff(ns) - State stop2: 14556 # Observed Avg timeout diff(ns) - State stop4: 873988 # Observed Avg timeout diff(ns) - State stop5: 959137 Aboorva Devarajan (2): powerpc/cpuidle: cpuidle wakeup latency based on IPI and timer events powerpc/selftest: Add support for cpuidle latency measurement arch/powerpc/Kconfig.debug | 10 + arch/powerpc/kernel/Makefile | 1 + arch/powerpc/kernel/test_cpuidle_latency.c | 154 ++++++ tools/testing/selftests/powerpc/Makefile | 1 + .../powerpc/cpuidle_latency/.gitignore | 2 + .../powerpc/cpuidle_latency/Makefile | 6 + .../cpuidle_latency/cpuidle_latency.sh | 443 ++++++++++++++++++ .../powerpc/cpuidle_latency/settings | 1 + 8 files changed, 618 insertions(+) create mode 100644 arch/powerpc/kernel/test_cpuidle_latency.c create mode 100644 tools/testing/selftests/powerpc/cpuidle_latency/.gitignore create mode 100644 tools/testing/selftests/powerpc/cpuidle_latency/Makefile create mode 100755 tools/testing/selftests/powerpc/cpuidle_latency/cpuidle_latency.sh create mode 100644 tools/testing/selftests/powerpc/cpuidle_latency/settings -- 2.25.1

1 year, 9 months

2
7
0 0

[PATCH v6] selftests/clone3: Fix broken test under !CONFIG_TIME_NS

by Tiezhu Yang

When execute the following command to test clone3 under !CONFIG_TIME_NS: # make headers && cd tools/testing/selftests/clone3 && make && ./clone3 we can see the following error info: # [7538] Trying clone3() with flags 0x80 (size 0) # Invalid argument - Failed to create new process # [7538] clone3() with flags says: -22 expected 0 not ok 18 [7538] Result (-22) is different than expected (0) ... # Totals: pass:18 fail:1 xfail:0 xpass:0 skip:0 error:0 This is because if CONFIG_TIME_NS is not set, but the flag CLONE_NEWTIME (0x80) is used to clone a time namespace, it will return -EINVAL in copy_time_ns(). If kernel does not support CONFIG_TIME_NS, /proc/self/ns/time will be not exist, and then we should skip clone3() test with CLONE_NEWTIME. With this patch under !CONFIG_TIME_NS: # make headers && cd tools/testing/selftests/clone3 && make && ./clone3 ... # Time namespaces are not supported ok 18 # SKIP Skipping clone3() with CLONE_NEWTIME ... # Totals: pass:18 fail:0 xfail:0 xpass:0 skip:1 error:0 Fixes: 515bddf0ec41 ("selftests/clone3: test clone3 with CLONE_NEWTIME") Suggested-by: Thomas Gleixner <tglx(a)linutronix.de> Signed-off-by: Tiezhu Yang <yangtiezhu(a)loongson.cn> --- v6: Rebase on 6.5-rc1 and update the commit message tools/testing/selftests/clone3/clone3.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/clone3/clone3.c b/tools/testing/selftests/clone3/clone3.c index e60cf4d..1c61e3c 100644 --- a/tools/testing/selftests/clone3/clone3.c +++ b/tools/testing/selftests/clone3/clone3.c @@ -196,7 +196,12 @@ int main(int argc, char *argv[]) CLONE3_ARGS_NO_TEST); /* Do a clone3() in a new time namespace */ - test_clone3(CLONE_NEWTIME, 0, 0, CLONE3_ARGS_NO_TEST); + if (access("/proc/self/ns/time", F_OK) == 0) { + test_clone3(CLONE_NEWTIME, 0, 0, CLONE3_ARGS_NO_TEST); + } else { + ksft_print_msg("Time namespaces are not supported\n"); + ksft_test_result_skip("Skipping clone3() with CLONE_NEWTIME\n"); + } /* Do a clone3() with exit signal (SIGCHLD) in flags */ test_clone3(SIGCHLD, 0, -EINVAL, CLONE3_ARGS_NO_TEST); -- 2.1.0

1 year, 9 months

1
2
0 0

[PATCH v2 0/2] Modify vDSO selftests

by Tiezhu Yang

v2: Rebase on 6.5-rc1 and update the commit message Tiezhu Yang (2): selftests/vDSO: Add support for LoongArch selftests/vDSO: Get version and name for all archs tools/testing/selftests/vDSO/vdso_config.h | 6 ++++- tools/testing/selftests/vDSO/vdso_test_getcpu.c | 16 +++++-------- .../selftests/vDSO/vdso_test_gettimeofday.c | 26 ++++++---------------- 3 files changed, 18 insertions(+), 30 deletions(-) -- 2.1.0

1 year, 9 months

1
4
0 0

[PATCH v1 00/20] Permission Overlay Extension

by Joey Gouly

Hi all, This series implements the Permission Overlay Extension introduced in 2022 VMSA enhancements [1]. It is based on v6.6-rc3. The Permission Overlay Extension allows to constrain permissions on memory regions. This can be used from userspace (EL0) without a system call or TLB invalidation. POE is used to implement the Memory Protection Keys [2] Linux syscall. The first few patches add the basic framework, then the PKEYS interface is implemented, and then the selftests are made to work on arm64. There was discussion about what the 'default' protection key value should be, I used disallow-all (apart from pkey 0), which matches what x86 does. Patch 15 contains a call to cpus_have_const_cap(), which I couldn't avoid until Mark's patch to re-order when the alternatives were applied [3] is committed. The KVM part isn't tested yet. I have tested the modified protection_keys test on x86_64 [4], but not PPC. Hopefully I have CC'd everyone correctly. Thanks, Joey Joey Gouly (20): arm64/sysreg: add system register POR_EL{0,1} arm64/sysreg: update CPACR_EL1 register arm64: cpufeature: add Permission Overlay Extension cpucap arm64: disable trapping of POR_EL0 to EL2 arm64: context switch POR_EL0 register KVM: arm64: Save/restore POE registers arm64: enable the Permission Overlay Extension for EL0 arm64: add POIndex defines arm64: define VM_PKEY_BIT* for arm64 arm64: mask out POIndex when modifying a PTE arm64: enable ARCH_HAS_PKEYS on arm64 arm64: handle PKEY/POE faults arm64: stop using generic mm_hooks.h arm64: implement PKEYS support arm64: add POE signal support arm64: enable PKEY support for CPUs with S1POE arm64: enable POE and PIE to coexist kselftest/arm64: move get_header() selftests: mm: move fpregs printing selftests: mm: make protection_keys test work on arm64 Documentation/arch/arm64/elf_hwcaps.rst | 3 + arch/arm64/Kconfig | 1 + arch/arm64/include/asm/el2_setup.h | 10 +- arch/arm64/include/asm/hwcap.h | 1 + arch/arm64/include/asm/kvm_host.h | 4 + arch/arm64/include/asm/mman.h | 6 +- arch/arm64/include/asm/mmu.h | 2 + arch/arm64/include/asm/mmu_context.h | 51 ++++++- arch/arm64/include/asm/pgtable-hwdef.h | 10 ++ arch/arm64/include/asm/pgtable-prot.h | 8 +- arch/arm64/include/asm/pgtable.h | 28 +++- arch/arm64/include/asm/pkeys.h | 110 ++++++++++++++ arch/arm64/include/asm/por.h | 33 +++++ arch/arm64/include/asm/processor.h | 1 + arch/arm64/include/asm/sysreg.h | 16 ++ arch/arm64/include/asm/traps.h | 1 + arch/arm64/include/uapi/asm/hwcap.h | 1 + arch/arm64/include/uapi/asm/sigcontext.h | 7 + arch/arm64/kernel/cpufeature.c | 15 ++ arch/arm64/kernel/cpuinfo.c | 1 + arch/arm64/kernel/process.c | 16 ++ arch/arm64/kernel/signal.c | 51 +++++++ arch/arm64/kernel/traps.c | 12 +- arch/arm64/kvm/sys_regs.c | 2 + arch/arm64/mm/fault.c | 44 +++++- arch/arm64/mm/mmap.c | 7 + arch/arm64/mm/mmu.c | 38 +++++ arch/arm64/tools/cpucaps | 1 + arch/arm64/tools/sysreg | 11 +- fs/proc/task_mmu.c | 2 + include/linux/mm.h | 11 +- .../arm64/signal/testcases/testcases.c | 23 --- .../arm64/signal/testcases/testcases.h | 26 +++- tools/testing/selftests/mm/Makefile | 2 +- tools/testing/selftests/mm/pkey-arm64.h | 138 ++++++++++++++++++ tools/testing/selftests/mm/pkey-helpers.h | 8 + tools/testing/selftests/mm/pkey-powerpc.h | 3 + tools/testing/selftests/mm/pkey-x86.h | 4 + tools/testing/selftests/mm/protection_keys.c | 29 ++-- 39 files changed, 685 insertions(+), 52 deletions(-) create mode 100644 arch/arm64/include/asm/pkeys.h create mode 100644 arch/arm64/include/asm/por.h create mode 100644 tools/testing/selftests/mm/pkey-arm64.h -- 2.25.1

1 year, 9 months

6
40
0 0

[PATCH] selftests/x86/lam: Zero out buffer for readlink()

by Binbin Wu

Zero out the buffer for readlink() since readlink() does not append a terminating null byte to the buffer. Fixes: 833c12ce0f430 ("selftests/x86/lam: Add inherit test cases for linear-address masking") Signed-off-by: Binbin Wu <binbin.wu(a)linux.intel.com> --- tools/testing/selftests/x86/lam.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/testing/selftests/x86/lam.c b/tools/testing/selftests/x86/lam.c index eb0e46905bf9..9f06942a8e25 100644 --- a/tools/testing/selftests/x86/lam.c +++ b/tools/testing/selftests/x86/lam.c @@ -680,7 +680,7 @@ static int handle_execve(struct testcases *test) perror("Fork failed."); ret = 1; } else if (pid == 0) { - char path[PATH_MAX]; + char path[PATH_MAX] = {0}; /* Set LAM mode in parent process */ if (set_lam(lam) != 0) base-commit: ce9ecca0238b140b88f43859b211c9fdfd8e5b70 -- 2.25.1

1 year, 9 months

2
4
0 0

[RFC 0/8] iommufd support pasid attach/replace

by Yi Liu

PASID (Process Address Space ID) is a PCIe extension to tag the DMA transactions out of a physical device, and most modern IOMMU hardware have supported PASID granular address translation. So a PASID-capable devices can be attached to multiple hwpts (a.k.a. domains), each attachment is tagged with a PASID. This series first adds a missing iommu API to replace domain for a pasid, then adds iommufd APIs for device drivers to attach/replace/detach pasid to/from hwpt per userspace's request, and adds selftest to validate the iommufd APIs. pasid attach/replace is mandatory on Intel VT-d given the PASID table locates in the physical address space hence must be managed by the kernel, both for supporting vSVA and coming SIOV. But it's optional on ARM/AMD which allow configuring the PASID/CD table either in host physical address space or nested on top of an GPA address space. This series only add VT-d support as the minimal requirement. Complete code can be found in below link: https://github.com/yiliu1765/iommufd/tree/iommufd_pasid Regards, Yi Liu Kevin Tian (1): iommufd: Support attach/replace hwpt per pasid Lu Baolu (2): iommu: Introduce a replace API for device pasid iommu/vt-d: Add set_dev_pasid callback for nested domain Yi Liu (5): iommufd: replace attach_fn with a structure iommufd/selftest: Add set_dev_pasid and remove_dev_pasid in mock iommu iommufd/selftest: Add a helper to get test device iommufd/selftest: Add test ops to test pasid attach/detach iommufd/selftest: Add coverage for iommufd pasid attach/detach drivers/iommu/intel/nested.c | 47 +++++ drivers/iommu/iommu-priv.h | 2 + drivers/iommu/iommu.c | 73 ++++++-- drivers/iommu/iommufd/Makefile | 1 + drivers/iommu/iommufd/device.c | 42 +++-- drivers/iommu/iommufd/iommufd_private.h | 16 ++ drivers/iommu/iommufd/iommufd_test.h | 24 +++ drivers/iommu/iommufd/pasid.c | 152 ++++++++++++++++ drivers/iommu/iommufd/selftest.c | 158 ++++++++++++++-- include/linux/iommufd.h | 6 + tools/testing/selftests/iommu/iommufd.c | 172 ++++++++++++++++++ .../selftests/iommu/iommufd_fail_nth.c | 28 ++- tools/testing/selftests/iommu/iommufd_utils.h | 78 ++++++++ 13 files changed, 756 insertions(+), 43 deletions(-) create mode 100644 drivers/iommu/iommufd/pasid.c -- 2.34.1

1 year, 9 months

4
20
0 0

[RFC PATCH v2 00/20] context_tracking,x86: Defer some IPIs until a user->kernel transition

by Valentin Schneider

Context ======= We've observed within Red Hat that isolated, NOHZ_FULL CPUs running a pure-userspace application get regularly interrupted by IPIs sent from housekeeping CPUs. Those IPIs are caused by activity on the housekeeping CPUs leading to various on_each_cpu() calls, e.g.: 64359.052209596 NetworkManager 0 1405 smp_call_function_many_cond (cpu=0, func=do_kernel_range_flush) smp_call_function_many_cond+0x1 smp_call_function+0x39 on_each_cpu+0x2a flush_tlb_kernel_range+0x7b __purge_vmap_area_lazy+0x70 _vm_unmap_aliases.part.42+0xdf change_page_attr_set_clr+0x16a set_memory_ro+0x26 bpf_int_jit_compile+0x2f9 bpf_prog_select_runtime+0xc6 bpf_prepare_filter+0x523 sk_attach_filter+0x13 sock_setsockopt+0x92c __sys_setsockopt+0x16a __x64_sys_setsockopt+0x20 do_syscall_64+0x87 entry_SYSCALL_64_after_hwframe+0x65 The heart of this series is the thought that while we cannot remove NOHZ_FULL CPUs from the list of CPUs targeted by these IPIs, they may not have to execute the callbacks immediately. Anything that only affects kernelspace can wait until the next user->kernel transition, providing it can be executed "early enough" in the entry code. The original implementation is from Peter [1]. Nicolas then added kernel TLB invalidation deferral to that [2], and I picked it up from there. Deferral approach ================= Storing each and every callback, like a secondary call_single_queue turned out to be a no-go: the whole point of deferral is to keep NOHZ_FULL CPUs in userspace for as long as possible - no signal of any form would be sent when deferring an IPI. This means that any form of queuing for deferred callbacks would end up as a convoluted memory leak. Deferred IPIs must thus be coalesced, which this series achieves by assigning IPIs a "type" and having a mapping of IPI type to callback, leveraged upon kernel entry. What about IPIs whose callback take a parameter, you may ask? Peter suggested during OSPM23 [3] that since on_each_cpu() targets housekeeping CPUs *and* isolated CPUs, isolated CPUs can access either global or housekeeping-CPU-local state to "reconstruct" the data that would have been sent via the IPI. This series does not affect any IPI callback that requires an argument, but the approach would remain the same (one coalescable callback executed on kernel entry). Kernel entry vs execution of the deferred operation =================================================== There is a non-zero length of code that is executed upon kernel entry before the deferred operation can be itself executed (i.e. before we start getting into context_tracking.c proper). This means one must take extra care to what can happen in the early entry code, and that <bad things> cannot happen. For instance, we really don't want to hit instructions that have been modified by a remote text_poke() while we're on our way to execute a deferred sync_core(). Patches ======= o Patches 1-9 have been submitted separately and are included for the sake of testing o Patches 10-14 focus on having objtool detect problematic static key usage in early entry o Patch 15 adds the infrastructure for IPI deferral. o Patches 16-17 add some RCU testing infrastructure o Patch 18 adds text_poke() IPI deferral. o Patches 19-20 add vunmap() flush_tlb_kernel_range() IPI deferral These ones I'm a lot less confident about, mostly due to lacking instrumentation/verification. The actual deferred callback is also incomplete as it's not properly noinstr: vmlinux.o: warning: objtool: __flush_tlb_all_noinstr+0x19: call to native_write_cr4() leaves .noinstr.text section and it doesn't support PARAVIRT - it's going to need a pv_ops.mmu entry, but I have *no idea* what a sane implementation would be for Xen so I haven't touched that yet. Patches are also available at: https://gitlab.com/vschneid/linux.git -b redhat/isolirq/defer/v2 Testing ======= Note: this is a different machine than used for v1, because that machine decided to act difficult. Xeon E5-2699 system with SMToff, NOHZ_FULL, isolated CPUs. RHEL9 userspace. Workload is using rteval (kernel compilation + hackbench) on housekeeping CPUs and a dummy stay-in-userspace loop on the isolated CPUs. The main invocation is: $ trace-cmd record -e "csd_queue_cpu" -f "cpu & CPUS{$ISOL_CPUS}" \ -e "ipi_send_cpumask" -f "cpumask & CPUS{$ISOL_CPUS}" \ -e "ipi_send_cpu" -f "cpu & CPUS{$ISOL_CPUS}" \ rteval --onlyload --loads-cpulist=$HK_CPUS \ --hackbench-runlowmem=True --duration=$DURATION This only records IPIs sent to isolated CPUs, so any event there is interference (with a bit of fuzz at the start/end of the workload when spawning the processes). All tests were done with a duration of 30 minutes. v6.5-rc1 (+ cpumask filtering patches): # This is the actual IPI count $ trace-cmd report | grep callback | awk '{ print $(NF) }' | sort | uniq -c | sort -nr 338 callback=generic_smp_call_function_single_interrupt+0x0 # These are the different CSD's that caused IPIs $ trace-cmd report | grep csd_queue | awk '{ print $(NF-1) }' | sort | uniq -c | sort -nr 9207 func=do_flush_tlb_all 1116 func=do_sync_core 62 func=do_kernel_range_flush 3 func=nohz_full_kick_func v6.5-rc1 + patches: # This is the actual IPI count $ trace-cmd report | grep callback | awk '{ print $(NF) }' | sort | uniq -c | sort -nr 2 callback=generic_smp_call_function_single_interrupt+0x0 # These are the different CSD's that caused IPIs $ trace-cmd report | grep csd_queue | awk '{ print $(NF-1) }' | sort | uniq -c | sort -nr 2 func=nohz_full_kick_func The incriminating IPIs are all gone, but note that on the machine I used to test v1 there were still some do_flush_tlb_all() IPIs caused by pcpu_balance_workfn(), since only vmalloc is affected by the deferral mechanism. Acknowledgements ================ Special thanks to: o Clark Williams for listening to my ramblings about this and throwing ideas my way o Josh Poimboeuf for his guidance regarding objtool and hinting at the .data..ro_after_init section. Links ===== [1]: https://lore.kernel.org/all/20210929151723.162004989@infradead.org/ [2]: https://github.com/vianpl/linux.git -b ct-work-defer-wip [3]: https://youtu.be/0vjE6fjoVVE Revisions ========= RFCv1 -> RFCv2 ++++++++++++++ o Rebased onto v6.5-rc1 o Updated the trace filter patches (Steven) o Fixed __ro_after_init keys used in modules (Peter) o Dropped the extra context_tracking atomic, squashed the new bits in the existing .state field (Peter, Frederic) o Added an RCU_EXPERT config for the RCU dynticks counter size, and added an rcutorture case for a low-size counter (Paul) The new TREE11 case with a 2-bit dynticks counter seems to pass when ran against this series. o Fixed flush_tlb_kernel_range_deferrable() definition Peter Zijlstra (1): jump_label,module: Don't alloc static_key_mod for __ro_after_init keys Valentin Schneider (19): tracing/filters: Dynamically allocate filter_pred.regex tracing/filters: Enable filtering a cpumask field by another cpumask tracing/filters: Enable filtering a scalar field by a cpumask tracing/filters: Enable filtering the CPU common field by a cpumask tracing/filters: Optimise cpumask vs cpumask filtering when user mask is a single CPU tracing/filters: Optimise scalar vs cpumask filtering when the user mask is a single CPU tracing/filters: Optimise CPU vs cpumask filtering when the user mask is a single CPU tracing/filters: Further optimise scalar vs cpumask comparison tracing/filters: Document cpumask filtering objtool: Flesh out warning related to pv_ops[] calls objtool: Warn about non __ro_after_init static key usage in .noinstr context_tracking: Make context_tracking_key __ro_after_init x86/kvm: Make kvm_async_pf_enabled __ro_after_init context-tracking: Introduce work deferral infrastructure rcu: Make RCU dynticks counter size configurable rcutorture: Add a test config to torture test low RCU_DYNTICKS width context_tracking,x86: Defer kernel text patching IPIs context_tracking,x86: Add infrastructure to defer kernel TLBI x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs Documentation/trace/events.rst | 14 + arch/Kconfig | 9 + arch/x86/Kconfig | 1 + arch/x86/include/asm/context_tracking_work.h | 20 ++ arch/x86/include/asm/text-patching.h | 1 + arch/x86/include/asm/tlbflush.h | 2 + arch/x86/kernel/alternative.c | 24 +- arch/x86/kernel/kprobes/core.c | 4 +- arch/x86/kernel/kprobes/opt.c | 4 +- arch/x86/kernel/kvm.c | 2 +- arch/x86/kernel/module.c | 2 +- arch/x86/mm/tlb.c | 40 ++- include/asm-generic/sections.h | 5 + include/linux/context_tracking.h | 26 ++ include/linux/context_tracking_state.h | 65 +++- include/linux/context_tracking_work.h | 28 ++ include/linux/jump_label.h | 1 + include/linux/trace_events.h | 1 + init/main.c | 1 + kernel/context_tracking.c | 53 ++- kernel/jump_label.c | 49 +++ kernel/rcu/Kconfig | 33 ++ kernel/time/Kconfig | 5 + kernel/trace/trace_events_filter.c | 302 ++++++++++++++++-- mm/vmalloc.c | 19 +- tools/objtool/check.c | 22 +- tools/objtool/include/objtool/check.h | 1 + tools/objtool/include/objtool/special.h | 2 + tools/objtool/special.c | 3 + .../selftests/rcutorture/configs/rcu/TREE11 | 19 ++ .../rcutorture/configs/rcu/TREE11.boot | 1 + 31 files changed, 695 insertions(+), 64 deletions(-) create mode 100644 arch/x86/include/asm/context_tracking_work.h create mode 100644 include/linux/context_tracking_work.h create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/TREE11 create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/TREE11.boot -- 2.31.1

1 year, 9 months

12
75
0 0

2025

2024

2023

2022

2021

2020

2019

2018

2017

Linux-kselftest-mirror September 2023