There's been a bunch of off-list discussions about this, including at
Plumbers. The original plan was to do something involving providing an
ISA string to userspace, but ISA strings just aren't sufficient for a
stable ABI any more: in order to parse an ISA string users need the
version of the specifications that the string is written to, the version
of each extension (sometimes at a finer granularity than the RISC-V
releases/versions encode), and the expected use case for the ISA string
(ie, is it a U-mode or M-mode string). That's a lot of complexity to
try and keep ABI compatible and it's probably going to continue to grow,
as even if there's no more complexity in the specifications we'll have
to deal with the various ISA string parsing oddities that end up all
over userspace.
Instead this patch set takes a very different approach and provides a set
of key/value pairs that encode various bits about the system. The big
advantage here is that we can clearly define what these mean so we can
ensure ABI stability, but it also allows us to encode information that's
unlikely to ever appear in an ISA string (see the misaligned access
performance, for example). The resulting interface looks a lot like
what arm64 and x86 do, and will hopefully fit well into something like
ACPI in the future.
The actual user interface is a syscall, with a vDSO function in front of
it. The vDSO function can answer some queries without a syscall at all,
and falls back to the syscall for cases it doesn't have answers to.
Currently we prepopulate it with an array of answers for all keys and
a CPU set of "all CPUs". This can be adjusted as necessary to provide
fast answers to the most common queries.
An example series in glibc exposing this syscall and using it in an
ifunc selector for memcpy can be found at [1]. I'm about to send a v2
of that series out that incorporates the vDSO function.
I was asked about the performance delta between this and something like
sysfs. I created a small test program [2] and ran it on a riscv64 qemu
instance. Doing each operation 100000 times and dividing, these
operations take the following amount of time:
- open()+read()+close() of /sys/kernel/cpu_byteorder: 114us
- access("/sys/kernel/cpu_byteorder", R_OK): 69us
- riscv_hwprobe() vDSO and syscall: 13us
- riscv_hwprobe() vDSO with no syscall: 0.07us
These numbers get farther apart if we query multiple keys, as sysfs will
scale linearly with the number of keys, where the dedicated syscall
stays the same. To frame these numbers, I also did a tight
fork/exec/wait loop, which I measured as 23ms. So doing 4
open/read/close operations is a delta of about 2%, versus a single vDSO
call is a delta of 0.0003%.
This being qemu rather than real hardware, the numbers
themselves are somewhat inaccurate, though the relative orders of
magnitude are probably good enough.
[1] https://public-inbox.org/libc-alpha/20230206194819.1679472-1-evan@rivosinc.…
[2] https://pastebin.com/x84NEKaS
Changes in v3:
- Updated copyright date in cpufeature.h
- Fixed typo in cpufeature.h comment (Conor)
- Refactored functions so that kernel mode can query too, in
preparation for the vDSO data population.
- Changed the vendor/arch/imp IDs to return a value of -1 on mismatch
rather than failing the whole call.
- Const cpumask pointer in hwprobe_mid()
- Embellished documentation WRT cpu_set and the returned values.
- Renamed hwprobe_mid() to hwprobe_arch_id() (Conor)
- Fixed machine ID doc warnings, changed elements to c:macro:.
- Completed dangling unistd.h comment (Conor)
- Fixed line breaks and minor logic optimization (Conor).
- Use riscv_cached_mxxxid() (Conor)
- Refactored base ISA behavior probe to allow kernel probing as well,
in prep for vDSO data initialization.
- Fixed doc warnings in IMA text list, use :c:macro:.
- Added | to description: to make dt-checker happy.
- Have hwprobe_misaligned return int instead of long.
- Constify cpumask pointer in hwprobe_misaligned()
- Fix warnings in _PERF_O list documentation, use :c:macro:.
- Move include cpufeature.h to misaligned patch.
- Fix documentation mismatch for RISCV_HWPROBE_KEY_CPUPERF_0 (Conor)
- Use for_each_possible_cpu() instead of NR_CPUS (Conor)
- Break early in misaligned access iteration (Conor)
- Increase MISALIGNED_MASK from 2 bits to 3 for possible UNSUPPORTED future
value (Conor)
- Introduced vDSO function
Changes in v2:
- Factored the move of struct riscv_cpuinfo to its own header
- Changed the interface to look more like poll(). Rather than supplying
key_offset and getting back an array of values with numerically
contiguous keys, have the user pre-fill the key members of the array,
and the kernel will fill in the corresponding values. For any key it
doesn't recognize, it will set the key of that element to -1. This
allows usermode to quickly ask for exactly the elements it cares
about, and not get bogged down in a back and forth about newer keys
that older kernels might not recognize. In other words, the kernel
can communicate that it doesn't recognize some of the keys while
still providing the data for the keys it does know.
- Added a shortcut to the cpuset parameters that if a size of 0 and
NULL is provided for the CPU set, the kernel will use a cpu mask of
all online CPUs. This is convenient because I suspect most callers
will only want to act on a feature if it's supported on all CPUs, and
it's a headache to dynamically allocate an array of all 1s, not to
mention a waste to have the kernel loop over all of the offline bits.
- Fixed logic error in if(of_property_read_string...) that caused crash
- Include cpufeature.h in cpufeature.h to avoid undeclared variable
warning.
- Added a _MASK define
- Fix random checkpatch complaints
- Updated the selftests to the new API and added some more.
- Fixed indentation, comments in .S, and general checkpatch complaints.
Evan Green (6):
RISC-V: Move struct riscv_cpuinfo to new header
RISC-V: Add a syscall for HW probing
RISC-V: hwprobe: Add support for RISCV_HWPROBE_BASE_BEHAVIOR_IMA
RISC-V: hwprobe: Support probing of misaligned access performance
selftests: Test the new RISC-V hwprobe interface
RISC-V: Add hwprobe vDSO function and data
Palmer Dabbelt (1):
dt-bindings: Add RISC-V misaligned access performance
.../devicetree/bindings/riscv/cpus.yaml | 15 ++
Documentation/riscv/hwprobe.rst | 74 ++++++
Documentation/riscv/index.rst | 1 +
arch/riscv/Kconfig | 1 +
arch/riscv/include/asm/cpufeature.h | 23 ++
arch/riscv/include/asm/hwprobe.h | 13 +
arch/riscv/include/asm/smp.h | 11 +
arch/riscv/include/asm/syscall.h | 3 +
arch/riscv/include/asm/vdso/data.h | 17 ++
arch/riscv/include/uapi/asm/hwprobe.h | 36 +++
arch/riscv/include/uapi/asm/unistd.h | 9 +
arch/riscv/kernel/cpu.c | 11 +-
arch/riscv/kernel/cpufeature.c | 31 ++-
arch/riscv/kernel/sys_riscv.c | 222 +++++++++++++++++-
arch/riscv/kernel/vdso/Makefile | 2 +
arch/riscv/kernel/vdso/hwprobe.c | 47 ++++
arch/riscv/kernel/vdso/sys_hwprobe.S | 15 ++
arch/riscv/kernel/vdso/vdso.lds.S | 1 +
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/riscv/Makefile | 58 +++++
.../testing/selftests/riscv/hwprobe/Makefile | 10 +
.../testing/selftests/riscv/hwprobe/hwprobe.c | 89 +++++++
.../selftests/riscv/hwprobe/sys_hwprobe.S | 12 +
23 files changed, 692 insertions(+), 10 deletions(-)
create mode 100644 Documentation/riscv/hwprobe.rst
create mode 100644 arch/riscv/include/asm/cpufeature.h
create mode 100644 arch/riscv/include/asm/hwprobe.h
create mode 100644 arch/riscv/include/asm/vdso/data.h
create mode 100644 arch/riscv/include/uapi/asm/hwprobe.h
create mode 100644 arch/riscv/kernel/vdso/hwprobe.c
create mode 100644 arch/riscv/kernel/vdso/sys_hwprobe.S
create mode 100644 tools/testing/selftests/riscv/Makefile
create mode 100644 tools/testing/selftests/riscv/hwprobe/Makefile
create mode 100644 tools/testing/selftests/riscv/hwprobe/hwprobe.c
create mode 100644 tools/testing/selftests/riscv/hwprobe/sys_hwprobe.S
--
2.25.1
So far KSM can only be enabled by calling madvise for memory regions. What is
required to enable KSM for more workloads is to enable / disable it at the
process / cgroup level.
Use case:
The madvise call is not available in the programming language. An example for
this are programs with forked workloads using a garbage collected language without
pointers. In such a language madvise cannot be made available.
In addition the addresses of objects get moved around as they are garbage
collected. KSM sharing needs to be enabled "from the outside" for these type of
workloads.
Experiments with using UKSM have shown a capacity increase of around 20%.
1. New options for prctl system command
This patch series adds two new options to the prctl system call. The first
one allows to enable KSM at the process level and the second one to query the
setting.
The setting will be inherited by child processes.
With the above setting, KSM can be enabled for the seed process of a cgroup
and all processes in the cgroup will inherit the setting.
2. Changes to KSM processing
When KSM is enabled at the process level, the KSM code will iterate over all
the VMA's and enable KSM for the eligible VMA's.
When forking a process that has KSM enabled, the setting will be inherited by
the new child process.
In addition when KSM is disabled for a process, KSM will be disabled for the
VMA's where KSM has been enabled.
3. Add general_profit metric
The general_profit metric of KSM is specified in the documentation, but not
calculated. This adds the general profit metric to /sys/kernel/debug/mm/ksm.
4. Add more metrics to ksm_stat
This adds the process profit and ksm type metric to /proc/<pid>/ksm_stat.
5. Add more tests to ksm_tests
This adds an option to specify the merge type to the ksm_tests. This allows to
test madvise and prctl KSM. It also adds a new option to query if prctl KSM has
been enabled. It adds a fork test to verify that the KSM process setting is
inherited by client processes.
Changes:
- V2:
- Added use cases to the cover letter
- Removed the tracing patch from the patch series and posted it as an
individual patch
- Refreshed repo
Stefan Roesch (19):
mm: add new flag to enable ksm per process
mm: add flag to __ksm_enter
mm: add flag to __ksm_exit call
mm: invoke madvise for all vmas in scan_get_next_rmap_item
mm: support disabling of ksm for a process
mm: add new prctl option to get and set ksm for a process
mm: split off pages_volatile function
mm: expose general_profit metric
docs: document general_profit sysfs knob
mm: calculate ksm process profit metric
mm: add ksm_merge_type() function
mm: expose ksm process profit metric in ksm_stat
mm: expose ksm merge type in ksm_stat
docs: document new procfs ksm knobs
tools: add new prctl flags to prctl in tools dir
selftests/vm: add KSM prctl merge test
selftests/vm: add KSM get merge type test
selftests/vm: add KSM fork test
selftests/vm: add two functions for debugging merge outcome
Documentation/ABI/testing/sysfs-kernel-mm-ksm | 8 +
Documentation/admin-guide/mm/ksm.rst | 8 +-
fs/proc/base.c | 5 +
include/linux/ksm.h | 19 +-
include/linux/sched/coredump.h | 1 +
include/uapi/linux/prctl.h | 2 +
kernel/sys.c | 29 ++
mm/ksm.c | 114 +++++++-
tools/include/uapi/linux/prctl.h | 2 +
tools/testing/selftests/mm/Makefile | 3 +-
tools/testing/selftests/mm/ksm_tests.c | 254 +++++++++++++++---
11 files changed, 389 insertions(+), 56 deletions(-)
base-commit: 234a68e24b120b98875a8b6e17a9dead277be16a
--
2.30.2
Stack protectors need support from libc.
This support is not provided by nolibc which leads to compiler errors
when stack protectors are enabled by default in a compiler:
CC nolibc-test
/usr/bin/ld: /tmp/ccqbHEPk.o: in function `stat':
nolibc-test.c:(.text+0x1d1): undefined reference to `__stack_chk_fail'
/usr/bin/ld: /tmp/ccqbHEPk.o: in function `poll.constprop.0':
nolibc-test.c:(.text+0x37b): undefined reference to `__stack_chk_fail'
/usr/bin/ld: /tmp/ccqbHEPk.o: in function `vfprintf.constprop.0':
nolibc-test.c:(.text+0x712): undefined reference to `__stack_chk_fail'
/usr/bin/ld: /tmp/ccqbHEPk.o: in function `pad_spc.constprop.0':
nolibc-test.c:(.text+0x80d): undefined reference to `__stack_chk_fail'
/usr/bin/ld: /tmp/ccqbHEPk.o: in function `printf':
nolibc-test.c:(.text+0x8c4): undefined reference to `__stack_chk_fail'
/usr/bin/ld: /tmp/ccqbHEPk.o:nolibc-test.c:(.text+0x12d4): more undefined references to `__stack_chk_fail' follow
collect2: error: ld returned 1 exit status
Signed-off-by: Thomas Weißschuh <linux(a)weissschuh.net>
---
tools/testing/selftests/nolibc/Makefile | 3 +++
1 file changed, 3 insertions(+)
diff --git a/tools/testing/selftests/nolibc/Makefile b/tools/testing/selftests/nolibc/Makefile
index 22f1e1d73fa8..ec724e445b5a 100644
--- a/tools/testing/selftests/nolibc/Makefile
+++ b/tools/testing/selftests/nolibc/Makefile
@@ -1,6 +1,8 @@
# SPDX-License-Identifier: GPL-2.0
# Makefile for nolibc tests
include ../../../scripts/Makefile.include
+# We need this for the "cc-option" macro.
+include ../../../build/Build.include
# we're in ".../tools/testing/selftests/nolibc"
ifeq ($(srctree),)
@@ -63,6 +65,7 @@ Q=@
endif
CFLAGS ?= -Os -fno-ident -fno-asynchronous-unwind-tables
+CFLAGS += $(call cc-option,-fno-stack-protector)
LDFLAGS := -s
help:
---
base-commit: 3f0b0903fde584a7398f82fc00bf4f8138610b87
change-id: 20230221-nolibc-no-stack-protector-5b54d2be4c61
Best regards,
--
Thomas Weißschuh <linux(a)weissschuh.net>
From: Jiri Pirko <jiri(a)nvidia.com>
When devlink instance is put into network namespace and that network
namespace gets deleted, devlink instance is moved back into init_ns.
This is done as a part of cleanup_net() routine. Since cleanup_net()
is called asynchronously from workqueue, there is no guarantee that
the devlink instance move is done after "ip netns del" returns.
So fix this race by making sure that the devlink instance is present
before any other operation.
Reported-by: Amir Tzin <amirtz(a)nvidia.com>
Fixes: b74c37fd35a2 ("selftests: netdevsim: add tests for devlink reload with resources")
Signed-off-by: Jiri Pirko <jiri(a)nvidia.com>
---
.../selftests/drivers/net/netdevsim/devlink.sh | 18 ++++++++++++++++++
1 file changed, 18 insertions(+)
diff --git a/tools/testing/selftests/drivers/net/netdevsim/devlink.sh b/tools/testing/selftests/drivers/net/netdevsim/devlink.sh
index a08c02abde12..7f7d20f22207 100755
--- a/tools/testing/selftests/drivers/net/netdevsim/devlink.sh
+++ b/tools/testing/selftests/drivers/net/netdevsim/devlink.sh
@@ -17,6 +17,18 @@ SYSFS_NET_DIR=/sys/bus/netdevsim/devices/$DEV_NAME/net/
DEBUGFS_DIR=/sys/kernel/debug/netdevsim/$DEV_NAME/
DL_HANDLE=netdevsim/$DEV_NAME
+wait_for_devlink()
+{
+ "$@" | grep -q $DL_HANDLE
+}
+
+devlink_wait()
+{
+ local timeout=$1
+
+ busywait "$timeout" wait_for_devlink devlink dev
+}
+
fw_flash_test()
{
RET=0
@@ -256,6 +268,9 @@ netns_reload_test()
ip netns del testns2
ip netns del testns1
+ # Wait until netns async cleanup is done.
+ devlink_wait 2000
+
log_test "netns reload test"
}
@@ -348,6 +363,9 @@ resource_test()
ip netns del testns2
ip netns del testns1
+ # Wait until netns async cleanup is done.
+ devlink_wait 2000
+
log_test "resource test"
}
--
2.39.0
Usually when a subtest is executed, setup and cleanup functions
are linearly called at the beginning and end of it.
In some of them, `set -e` is used before executing commands.
If one of the commands returns a non zero code, the whole script exists
without cleaning up the resources allocated at setup.
This can affect the next tests that use the same resources,
leading to a chain of failures.
To be consistent with other tests, calling cleanup function when the
script exists fixes the issue.
Steps to reproduce it:
1. Build with CONFIG_IP_ROUTE_MULTIPATH disabled.
2. Run net kselftest suite
3. fib_tests:fib_unreg_multipath_test fails when executing
`ip -netns ns1 route add 203.0.113.0/24 nexthop via 198.51.100.2 dev
dummy0 nexthop via 192.0.2.2 dev dummy1` because
CONFIG_IP_ROUTE_MULTIPATH is disabled.
This results in resources allocated during setup (e.g namespace ns1)
not being cleaned up.
4. When icmp.sh tries to create namespace ns1 during its setup, it fails
with the following error:
Cannot create namespace file "/run/netns/ns1": File exists
Roxana Nicolescu (1):
selftest: fib_tests: Always cleanup before exit
tools/testing/selftests/net/fib_tests.sh | 2 ++
1 file changed, 2 insertions(+)
--
2.34.1
[
RESEND, get rid of: Linux x86-64 Mailing List <linux-x86_64(a)vger.kernel.org>
from the CC list. It bounces and makes noise. Sorry for the wrong address.
]
Hi,
This is an RFC v8. Based on the x86/cpu branch in the tip tree.
The 'syscall' instruction on the Intel FRED architecture does not
clobber %rcx and %r11. This behavior leads to an assertion failure in
the sysret_rip selftest because it asserts %r11 = %rflags.
In the previous discussion, we agreed that there are two cases for
'syscall':
A) 'syscall' in a FRED system preserves %rcx and %r11.
B) 'syscall' in a non-FRED system sets %rcx=%rip and %r11=%rflags.
This series fixes the selftest. Make it work on the Intel FRED
architecture. Also, add more tests to ensure the syscall behavior is
consistent. It must always be (A) or always be (B). Not a mix of them.
See the previous discussion here:
https://lore.kernel.org/lkml/5d4ad3e3-034f-c7da-d141-9c001c2343af@intel.com
## Changelog revision
v8:
- Stop using "+r"(rsp) to avoid the red zone problem because it
generates the wrong Assembly code (Ammar).
See: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108799
- Update commit message (Ammar).
v7:
- Fix comment, REGS_ERROR no longer exists in the enum (Ammar).
- Update commit message (Ammar).
v6:
- Move the check-regs assertion in sigusr1() to check_regs_result() (HPA).
- Add a new test just like sigusr1(), but don't modify REG_RCX and
REG_R11. This is used to test SYSRET behavior consistency (HPA).
v5:
- Fix do_syscall() return value (Ammar).
v4:
- Fix the assertion condition inside the SIGUSR1 handler (Xin Li).
- Explain the purpose of patch #2 in the commit message (HPA).
- Update commit message (Ammar).
- Repeat test_syscall_rcx_r11_consistent() 32 times to be more sure
that the result is really consistent (Ammar).
v3:
- Test that we don't get a mix of REGS_SAVED and REGS_SYSRET, which
is a major part of the point (HPA).
v2:
- Use "+r"(rsp) as the right way to avoid redzone problems per
Andrew's comment (HPA).
Co-developed-by: H. Peter Anvin (Intel) <hpa(a)zytor.com>
Signed-off-by: H. Peter Anvin (Intel) <hpa(a)zytor.com>
Signed-off-by: Ammar Faizi <ammarfaizi2(a)gnuweeb.org>
---
Ammar Faizi (3):
selftests/x86: sysret_rip: Handle syscall on the Intel FRED architecture
selftests/x86: sysret_rip: Add more tests to verify the 'syscall' behavior
selftests/x86: sysret_rip: Test SYSRET with a signal handler
tools/testing/selftests/x86/sysret_rip.c | 169 +++++++++++++++++++++--
1 file changed, 160 insertions(+), 9 deletions(-)
base-commit: e067248949e3de7fbeae812b0ccbbee7a401e7aa
--
Ammar Faizi
Hi,
This is an RFC v8. Based on the x86/cpu branch in the tip tree.
The 'syscall' instruction on the Intel FRED architecture does not
clobber %rcx and %r11. This behavior leads to an assertion failure in
the sysret_rip selftest because it asserts %r11 = %rflags.
In the previous discussion, we agreed that there are two cases for
'syscall':
A) 'syscall' in a FRED system preserves %rcx and %r11.
B) 'syscall' in a non-FRED system sets %rcx=%rip and %r11=%rflags.
This series fixes the selftest. Make it work on the Intel FRED
architecture. Also, add more tests to ensure the syscall behavior is
consistent. It must always be (A) or always be (B). Not a mix of them.
See the previous discussion here:
https://lore.kernel.org/lkml/5d4ad3e3-034f-c7da-d141-9c001c2343af@intel.com
## Changelog revision
v8:
- Stop using "+r"(rsp) to avoid the red zone problem because it
generates the wrong Assembly code (Ammar).
See: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108799
- Update commit message (Ammar).
v7:
- Fix comment, REGS_ERROR no longer exists in the enum (Ammar).
- Update commit message (Ammar).
v6:
- Move the check-regs assertion in sigusr1() to check_regs_result() (HPA).
- Add a new test just like sigusr1(), but don't modify REG_RCX and
REG_R11. This is used to test SYSRET behavior consistency (HPA).
v5:
- Fix do_syscall() return value (Ammar).
v4:
- Fix the assertion condition inside the SIGUSR1 handler (Xin Li).
- Explain the purpose of patch #2 in the commit message (HPA).
- Update commit message (Ammar).
- Repeat test_syscall_rcx_r11_consistent() 32 times to be more sure
that the result is really consistent (Ammar).
v3:
- Test that we don't get a mix of REGS_SAVED and REGS_SYSRET, which
is a major part of the point (HPA).
v2:
- Use "+r"(rsp) as the right way to avoid redzone problems per
Andrew's comment (HPA).
Co-developed-by: H. Peter Anvin (Intel) <hpa(a)zytor.com>
Signed-off-by: H. Peter Anvin (Intel) <hpa(a)zytor.com>
Signed-off-by: Ammar Faizi <ammarfaizi2(a)gnuweeb.org>
---
Ammar Faizi (3):
selftests/x86: sysret_rip: Handle syscall on the Intel FRED architecture
selftests/x86: sysret_rip: Add more tests to verify the 'syscall' behavior
selftests/x86: sysret_rip: Test SYSRET with a signal handler
tools/testing/selftests/x86/sysret_rip.c | 169 +++++++++++++++++++++--
1 file changed, 160 insertions(+), 9 deletions(-)
base-commit: e067248949e3de7fbeae812b0ccbbee7a401e7aa
--
Ammar Faizi
AMX architecture involves several entities such as xstate, XCR0,
IA32_XFD. This series add several missing checks on top of the existing
amx_test.
v1 -> v2:
- Add a working xstate data structure suggested by seanjc.
- Split the checking of CR0.TS from the checking of XFD.
- Fix all the issues pointed by in review.
v1:
https://lore.kernel.org/all/20230110185823.1856951-1-mizhang@google.com/
Mingwei Zhang (7):
KVM: selftests: x86: Fix an error in comment of amx_test
KVM: selftests: x86: Add a working xstate data structure
KVM: selftests: x86: Add check of CR0.TS in the #NM handler in
amx_test
KVM: selftests: Add the XFD check to IA32_XFD in #NM handler
KVM: selftests: Fix the checks to XFD_ERR using and operation
KVM: selftests: x86: Enable checking on xcomp_bv in amx_test
KVM: selftests: x86: Repeat the checking of xheader when
IA32_XFD[XTILEDATA] is set in amx_test
.../selftests/kvm/include/x86_64/processor.h | 12 ++++
tools/testing/selftests/kvm/x86_64/amx_test.c | 59 ++++++++++---------
2 files changed, 43 insertions(+), 28 deletions(-)
--
2.39.1.581.gbfd45094c4-goog