This series attempts to provide a simple way for BPF programs (and in
future other consumers) to utilize BPF Type Format (BTF) information
to display kernel data structures in-kernel. The use case this
functionality is applied to here is to support a snprintf()-like
helper to copy a BTF representation of kernel data to a string,
and a BPF seq file helper to display BTF data for an iterator.
There is already support in kernel/bpf/btf.c for "show" functionality;
the changes here generalize that support from seq-file specific
verifier display to the more generic case and add another specific
use case; rather than seq_printf()ing the show data, it is copied
to a supplied string using a snprintf()-like function. Other future
consumers of the show functionality could include a bpf_printk_btf()
function which printk()ed the data instead. Oops messaging in
particular would be an interesting application for such functionality.
The above potential use case hints at a potential reply to
a reasonable objection that such typed display should be
solved by tracing programs, where the in-kernel tracing records
data and the userspace program prints it out. While this
is certainly the recommended approach for most cases, I
believe having an in-kernel mechanism would be valuable
also. Critically in BPF programs it greatly simplifies
debugging and tracing of such data to invoking a simple
helper.
One challenge raised in an earlier iteration of this work -
where the BTF printing was implemented as a printk() format
specifier - was that the amount of data printed per
printk() was large, and other format specifiers were far
simpler. Here we sidestep that concern by printing
components of the BTF representation as we go for the
seq file case, and in the string case the snprintf()-like
operation is intended to be a basis for perf event or
ringbuf output. The reasons for avoiding bpf_trace_printk
are that
1. bpf_trace_printk() strings are restricted in size and
cannot display anything beyond trivial data structures; and
2. bpf_trace_printk() is for debugging purposes only.
As Alexei suggested, a bpf_trace_puts() helper could solve
this in the future but it still would be limited by the
1000 byte limit for traced strings.
Default output for an sk_buff looks like this (zeroed fields
are omitted):
(struct sk_buff){
.transport_header = (__u16)65535,
.mac_header = (__u16)65535,
.end = (sk_buff_data_t)192,
.head = (unsigned char *)0x000000007524fd8b,
.data = (unsigned char *)0x000000007524fd8b,
.truesize = (unsigned int)768,
.users = (refcount_t){
.refs = (atomic_t){
.counter = (int)1,
},
},
}
Flags can modify aspects of output format; see patch 3
for more details.
Changes since v5:
- Moved btf print prepare into patch 3, type show seq
with flags into patch 2 (Alexei, patches 2,3)
- Fixed build bot warnings around static declarations
and printf attributes
- Renamed functions to snprintf_btf/seq_printf_btf
(Alexei, patches 3-6)
Changes since v4:
- Changed approach from a BPF trace event-centric design to one
utilizing a snprintf()-like helper and an iter helper (Alexei,
patches 3,5)
- Added tests to verify BTF output (patch 4)
- Added support to tests for verifying BTF type_id-based display
as well as type name via __builtin_btf_type_id (Andrii, patch 4).
- Augmented task iter tests to cover the BTF-based seq helper.
Because a task_struct's BTF-based representation would overflow
the PAGE_SIZE limit on iterator data, the "struct fs_struct"
(task->fs) is displayed for each task instead (Alexei, patch 6).
Changes since v3:
- Moved to RFC since the approach is different (and bpf-next is
closed)
- Rather than using a printk() format specifier as the means
of invoking BTF-enabled display, a dedicated BPF helper is
used. This solves the issue of printk() having to output
large amounts of data using a complex mechanism such as
BTF traversal, but still provides a way for the display of
such data to be achieved via BPF programs. Future work could
include a bpf_printk_btf() function to invoke display via
printk() where the elements of a data structure are printk()ed
one at a time. Thanks to Petr Mladek, Andy Shevchenko and
Rasmus Villemoes who took time to look at the earlier printk()
format-specifier-focused version of this and provided feedback
clarifying the problems with that approach.
- Added trace id to the bpf_trace_printk events as a means of
separating output from standard bpf_trace_printk() events,
ensuring it can be easily parsed by the reader.
- Added bpf_trace_btf() helper tests which do simple verification
of the various display options.
Changes since v2:
- Alexei and Yonghong suggested it would be good to use
probe_kernel_read() on to-be-shown data to ensure safety
during operation. Safe copy via probe_kernel_read() to a
buffer object in "struct btf_show" is used to support
this. A few different approaches were explored
including dynamic allocation and per-cpu buffers. The
downside of dynamic allocation is that it would be done
during BPF program execution for bpf_trace_printk()s using
%pT format specifiers. The problem with per-cpu buffers
is we'd have to manage preemption and since the display
of an object occurs over an extended period and in printk
context where we'd rather not change preemption status,
it seemed tricky to manage buffer safety while considering
preemption. The approach of utilizing stack buffer space
via the "struct btf_show" seemed like the simplest approach.
The stack size of the associated functions which have a
"struct btf_show" on their stack to support show operation
(btf_type_snprintf_show() and btf_type_seq_show()) stays
under 500 bytes. The compromise here is the safe buffer we
use is small - 256 bytes - and as a result multiple
probe_kernel_read()s are needed for larger objects. Most
objects of interest are smaller than this (e.g.
"struct sk_buff" is 224 bytes), and while task_struct is a
notable exception at ~8K, performance is not the priority for
BTF-based display. (Alexei and Yonghong, patch 2).
- safe buffer use is the default behaviour (and is mandatory
for BPF) but unsafe display - meaning no safe copy is done
and we operate on the object itself - is supported via a
'u' option.
- pointers are prefixed with 0x for clarity (Alexei, patch 2)
- added additional comments and explanations around BTF show
code, especially around determining whether objects such
zeroed. Also tried to comment safe object scheme used. (Yonghong,
patch 2)
- added late_initcall() to initialize vmlinux BTF so that it would
not have to be initialized during printk operation (Alexei,
patch 5)
- removed CONFIG_BTF_PRINTF config option as it is not needed;
CONFIG_DEBUG_INFO_BTF can be used to gate test behaviour and
determining behaviour of type-based printk can be done via
retrieval of BTF data; if it's not there BTF was unavailable
or broken (Alexei, patches 4,6)
- fix bpf_trace_printk test to use vmlinux.h and globals via
skeleton infrastructure, removing need for perf events
(Andrii, patch 8)
Changes since v1:
- changed format to be more drgn-like, rendering indented type info
along with type names by default (Alexei)
- zeroed values are omitted (Arnaldo) by default unless the '0'
modifier is specified (Alexei)
- added an option to print pointer values without obfuscation.
The reason to do this is the sysctls controlling pointer display
are likely to be irrelevant in many if not most tracing contexts.
Some questions on this in the outstanding questions section below...
- reworked printk format specifer so that we no longer rely on format
%pT<type> but instead use a struct * which contains type information
(Rasmus). This simplifies the printk parsing, makes use more dynamic
and also allows specification by BTF id as well as name.
- removed incorrect patch which tried to fix dereferencing of resolved
BTF info for vmlinux; instead we skip modifiers for the relevant
case (array element type determination) (Alexei).
- fixed issues with negative snprintf format length (Rasmus)
- added test cases for various data structure formats; base types,
typedefs, structs, etc.
- tests now iterate through all typedef, enum, struct and unions
defined for vmlinux BTF and render a version of the target dummy
value which is either all zeros or all 0xff values; the idea is this
exercises the "skip if zero" and "print everything" cases.
- added support in BPF for using the %pT format specifier in
bpf_trace_printk()
- added BPF tests which ensure %pT format specifier use works (Alexei).
Alan Maguire (6):
bpf: provide function to get vmlinux BTF information
bpf: move to generic BTF show support, apply it to seq files/strings
bpf: add bpf_snprintf_btf helper
selftests/bpf: add bpf_snprintf_btf helper tests
bpf: add bpf_seq_printf_btf helper
selftests/bpf: add test for bpf_seq_printf_btf helper
include/linux/bpf.h | 3 +
include/linux/btf.h | 39 +
include/uapi/linux/bpf.h | 78 ++
kernel/bpf/btf.c | 980 ++++++++++++++++++---
kernel/bpf/core.c | 2 +
kernel/bpf/helpers.c | 4 +
kernel/bpf/verifier.c | 18 +-
kernel/trace/bpf_trace.c | 134 +++
scripts/bpf_helpers_doc.py | 2 +
tools/include/uapi/linux/bpf.h | 78 ++
tools/testing/selftests/bpf/prog_tests/bpf_iter.c | 66 ++
.../selftests/bpf/prog_tests/snprintf_btf.c | 54 ++
.../selftests/bpf/progs/bpf_iter_task_btf.c | 49 ++
.../selftests/bpf/progs/netif_receive_skb.c | 260 ++++++
14 files changed, 1659 insertions(+), 108 deletions(-)
create mode 100644 tools/testing/selftests/bpf/prog_tests/snprintf_btf.c
create mode 100644 tools/testing/selftests/bpf/progs/bpf_iter_task_btf.c
create mode 100644 tools/testing/selftests/bpf/progs/netif_receive_skb.c
--
1.8.3.1
From: Thomas Gleixner <tglx(a)linutronix.de>
CONFIG_PREEMPT_COUNT is now unconditionally enabled and will be
removed. Cleanup the leftovers before doing so.
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Cc: "Paul E. McKenney" <paulmck(a)kernel.org>
Cc: Josh Triplett <josh(a)joshtriplett.org>
Cc: Steven Rostedt <rostedt(a)goodmis.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers(a)efficios.com>
Cc: Lai Jiangshan <jiangshanlai(a)gmail.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: rcu(a)vger.kernel.org
Cc: linux-kselftest(a)vger.kernel.org
Signed-off-by: Paul E. McKenney <paulmck(a)kernel.org>
---
tools/testing/selftests/rcutorture/configs/rcu/SRCU-t | 1 -
tools/testing/selftests/rcutorture/configs/rcu/SRCU-u | 1 -
tools/testing/selftests/rcutorture/configs/rcu/TINY01 | 1 -
tools/testing/selftests/rcutorture/doc/TINY_RCU.txt | 5 ++---
tools/testing/selftests/rcutorture/doc/TREE_RCU-kconfig.txt | 1 -
tools/testing/selftests/rcutorture/formal/srcu-cbmc/src/config.h | 1 -
6 files changed, 2 insertions(+), 8 deletions(-)
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/SRCU-t b/tools/testing/selftests/rcutorture/configs/rcu/SRCU-t
index 6c78022..553cf65 100644
--- a/tools/testing/selftests/rcutorture/configs/rcu/SRCU-t
+++ b/tools/testing/selftests/rcutorture/configs/rcu/SRCU-t
@@ -7,4 +7,3 @@ CONFIG_RCU_TRACE=n
CONFIG_DEBUG_LOCK_ALLOC=n
CONFIG_DEBUG_OBJECTS_RCU_HEAD=n
CONFIG_DEBUG_ATOMIC_SLEEP=y
-#CHECK#CONFIG_PREEMPT_COUNT=y
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/SRCU-u b/tools/testing/selftests/rcutorture/configs/rcu/SRCU-u
index c15ada8..99563da 100644
--- a/tools/testing/selftests/rcutorture/configs/rcu/SRCU-u
+++ b/tools/testing/selftests/rcutorture/configs/rcu/SRCU-u
@@ -7,4 +7,3 @@ CONFIG_RCU_TRACE=n
CONFIG_DEBUG_LOCK_ALLOC=y
CONFIG_PROVE_LOCKING=y
CONFIG_DEBUG_OBJECTS_RCU_HEAD=n
-CONFIG_PREEMPT_COUNT=n
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TINY01 b/tools/testing/selftests/rcutorture/configs/rcu/TINY01
index 6db705e..9b22b8e 100644
--- a/tools/testing/selftests/rcutorture/configs/rcu/TINY01
+++ b/tools/testing/selftests/rcutorture/configs/rcu/TINY01
@@ -10,4 +10,3 @@ CONFIG_RCU_TRACE=n
#CHECK#CONFIG_RCU_STALL_COMMON=n
CONFIG_DEBUG_LOCK_ALLOC=n
CONFIG_DEBUG_OBJECTS_RCU_HEAD=n
-CONFIG_PREEMPT_COUNT=n
diff --git a/tools/testing/selftests/rcutorture/doc/TINY_RCU.txt b/tools/testing/selftests/rcutorture/doc/TINY_RCU.txt
index a75b169..d30cedf 100644
--- a/tools/testing/selftests/rcutorture/doc/TINY_RCU.txt
+++ b/tools/testing/selftests/rcutorture/doc/TINY_RCU.txt
@@ -3,11 +3,10 @@ This document gives a brief rationale for the TINY_RCU test cases.
Kconfig Parameters:
-CONFIG_DEBUG_LOCK_ALLOC -- Do all three and none of the three.
-CONFIG_PREEMPT_COUNT
+CONFIG_DEBUG_LOCK_ALLOC -- Do both and none of the two.
CONFIG_RCU_TRACE
-The theory here is that randconfig testing will hit the other six possible
+The theory here is that randconfig testing will hit the other two possible
combinations of these parameters.
diff --git a/tools/testing/selftests/rcutorture/doc/TREE_RCU-kconfig.txt b/tools/testing/selftests/rcutorture/doc/TREE_RCU-kconfig.txt
index 1b96d68..cfdd48f 100644
--- a/tools/testing/selftests/rcutorture/doc/TREE_RCU-kconfig.txt
+++ b/tools/testing/selftests/rcutorture/doc/TREE_RCU-kconfig.txt
@@ -43,7 +43,6 @@ CONFIG_64BIT
Used only to check CONFIG_RCU_FANOUT value, inspection suffices.
-CONFIG_PREEMPT_COUNT
CONFIG_PREEMPT_RCU
Redundant with CONFIG_PREEMPT, ignore.
diff --git a/tools/testing/selftests/rcutorture/formal/srcu-cbmc/src/config.h b/tools/testing/selftests/rcutorture/formal/srcu-cbmc/src/config.h
index 283d710..d0d485d 100644
--- a/tools/testing/selftests/rcutorture/formal/srcu-cbmc/src/config.h
+++ b/tools/testing/selftests/rcutorture/formal/srcu-cbmc/src/config.h
@@ -8,7 +8,6 @@
#undef CONFIG_HOTPLUG_CPU
#undef CONFIG_MODULES
#undef CONFIG_NO_HZ_FULL_SYSIDLE
-#undef CONFIG_PREEMPT_COUNT
#undef CONFIG_PREEMPT_RCU
#undef CONFIG_PROVE_RCU
#undef CONFIG_RCU_NOCB_CPU
--
2.9.5
Hi!
I really like Hangbin Liu's intent[1] but I think we need to be a little
more clean about the implementation. This extracts run_kselftest.sh from
the Makefile so it can actually be changed without embeds, etc. Instead,
generate the test list into a text file. Everything gets much simpler.
:)
And in patch 2, I add back Hangin Liu's new options (with some extra
added) with knowledge of "collections" (i.e. Makefile TARGETS) and
subtests. This should work really well with LAVA too, which needs to
manipulate the lists of tests being run.
Thoughts?
-Kees
[1] https://lore.kernel.org/lkml/20200914022227.437143-1-liuhangbin@gmail.com/
Kees Cook (2):
selftests: Extract run_kselftest.sh and generate stand-alone test list
selftests/run_kselftest.sh: Make each test individually selectable
tools/testing/selftests/Makefile | 26 ++-----
tools/testing/selftests/lib.mk | 5 +-
tools/testing/selftests/run_kselftest.sh | 89 ++++++++++++++++++++++++
3 files changed, 98 insertions(+), 22 deletions(-)
create mode 100755 tools/testing/selftests/run_kselftest.sh
--
2.25.1
Right now .kunitconfig and the build dir are automatically created if
the build dir does not exists; however, if the build dir is present and
.kunitconfig is not, kunit_tool will crash.
Fix this by checking for both the build dir as well as the .kunitconfig.
NOTE: This depends on commit 5578d008d9e0 ("kunit: tool: fix running
kunit_tool from outside kernel tree")
Link: https://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest.git/c…
Signed-off-by: Brendan Higgins <brendanhiggins(a)google.com>
---
tools/testing/kunit/kunit.py | 12 ++++++++----
1 file changed, 8 insertions(+), 4 deletions(-)
diff --git a/tools/testing/kunit/kunit.py b/tools/testing/kunit/kunit.py
index e2caf4e24ecb2..8ab17e21a3578 100755
--- a/tools/testing/kunit/kunit.py
+++ b/tools/testing/kunit/kunit.py
@@ -243,6 +243,8 @@ def main(argv, linux=None):
if cli_args.subcommand == 'run':
if not os.path.exists(cli_args.build_dir):
os.mkdir(cli_args.build_dir)
+
+ if not os.path.exists(kunit_kernel.kunitconfig_path):
create_default_kunitconfig()
if not linux:
@@ -258,10 +260,12 @@ def main(argv, linux=None):
if result.status != KunitStatus.SUCCESS:
sys.exit(1)
elif cli_args.subcommand == 'config':
- if cli_args.build_dir:
- if not os.path.exists(cli_args.build_dir):
- os.mkdir(cli_args.build_dir)
- create_default_kunitconfig()
+ if cli_args.build_dir and (
+ not os.path.exists(cli_args.build_dir)):
+ os.mkdir(cli_args.build_dir)
+
+ if not os.path.exists(kunit_kernel.kunitconfig_path):
+ create_default_kunitconfig()
if not linux:
linux = kunit_kernel.LinuxSourceTree()
base-commit: d96fe1a5485fa978a6e3690adc4dbe4d20b5baa4
--
2.28.0.681.g6f77f65b4e-goog
Changes since v2:
- added struct xfrm_translator as API to register xfrm_compat.ko with
xfrm_state.ko. This allows compilation of translator as a loadable
module
- fixed indention and collected reviewed-by (Johannes Berg)
- moved boilerplate from commit messages into cover-letter (Steffen
Klassert)
- found on KASAN build and fixed non-initialised stack variable usage
in the translator
The resulting v2/v3 diff can be found here:
https://gist.github.com/0x7f454c46/8f68311dfa1f240959fdbe7c77ed2259
Patches as a .git branch:
https://github.com/0x7f454c46/linux/tree/xfrm-compat-v3
Changes since v1:
- reworked patches set to use translator
- separated the compat layer into xfrm_compat.c,
compiled under XFRM_USER_COMPAT config
- 32-bit messages now being sent in frag_list (like wext-core does)
- instead of __packed add compat_u64 members in compat structures
- selftest reworked to kselftest lib API
- added netlink dump testing to the selftest
XFRM is disabled for compatible users because of the UABI difference.
The difference is in structures paddings and in the result the size
of netlink messages differ.
Possibility for compatible application to manage xfrm tunnels was
disabled by: the commmit 19d7df69fdb2 ("xfrm: Refuse to insert 32 bit
userspace socket policies on 64 bit systems") and the commit 74005991b78a
("xfrm: Do not parse 32bits compiled xfrm netlink msg on 64bits host").
This is my second attempt to resolve the xfrm/compat problem by adding
the 64=>32 and 32=>64 bit translators those non-visibly to a user
provide translation between compatible user and kernel.
Previous attempt was to interrupt the message ABI according to a syscall
by xfrm_user, which resulted in over-complicated code [1].
Florian Westphal provided the idea of translator and some draft patches
in the discussion. In these patches, his idea is reused and some of his
initial code is also present.
There were a couple of attempts to solve xfrm compat problem:
https://lkml.org/lkml/2017/1/20/733https://patchwork.ozlabs.org/patch/44600/http://netdev.vger.kernel.narkive.com/2Gesykj6/patch-net-next-xfrm-correctl…
All the discussions end in the conclusion that xfrm should have a full
compatible layer to correctly work with 32-bit applications on 64-bit
kernels:
https://lkml.org/lkml/2017/1/23/413https://patchwork.ozlabs.org/patch/433279/
In some recent lkml discussion, Linus said that it's worth to fix this
problem and not giving people an excuse to stay on 32-bit kernel:
https://lkml.org/lkml/2018/2/13/752
There is also an selftest for ipsec tunnels.
It doesn't depend on any library and compat version can be easy
build with: make CFLAGS=-m32 net/ipsec
[1]: https://lkml.kernel.org/r/20180726023144.31066-1-dima@arista.com
Cc: "David S. Miller" <davem(a)davemloft.net>
Cc: Florian Westphal <fw(a)strlen.de>
Cc: Herbert Xu <herbert(a)gondor.apana.org.au>
Cc: Jakub Kicinski <kuba(a)kernel.org>
Cc: Johannes Berg <johannes(a)sipsolutions.net>
Cc: Steffen Klassert <steffen.klassert(a)secunet.com>
Cc: Stephen Suryaputra <ssuryaextr(a)gmail.com>
Cc: Dmitry Safonov <0x7f454c46(a)gmail.com>
Cc: netdev(a)vger.kernel.org
Dmitry Safonov (7):
xfrm: Provide API to register translator module
xfrm/compat: Add 64=>32-bit messages translator
xfrm/compat: Attach xfrm dumps to 64=>32 bit translator
netlink/compat: Append NLMSG_DONE/extack to frag_list
xfrm/compat: Add 32=>64-bit messages translator
xfrm/compat: Translate 32-bit user_policy from sockptr
selftest/net/xfrm: Add test for ipsec tunnel
MAINTAINERS | 1 +
include/net/xfrm.h | 33 +
net/netlink/af_netlink.c | 47 +-
net/xfrm/Kconfig | 11 +
net/xfrm/Makefile | 1 +
net/xfrm/xfrm_compat.c | 625 +++++++
net/xfrm/xfrm_state.c | 77 +-
net/xfrm/xfrm_user.c | 110 +-
tools/testing/selftests/net/.gitignore | 1 +
tools/testing/selftests/net/Makefile | 1 +
tools/testing/selftests/net/ipsec.c | 2195 ++++++++++++++++++++++++
11 files changed, 3066 insertions(+), 36 deletions(-)
create mode 100644 net/xfrm/xfrm_compat.c
create mode 100644 tools/testing/selftests/net/ipsec.c
base-commit: ba4f184e126b751d1bffad5897f263108befc780
--
2.28.0
Hi,
The v6 of this patch series include only the type change requested by
Andy on the vdso patch, but since v5 included some bigger changes, I'm
documenting them in this cover letter as well.
Please note this applies on top of Linus tree, and it succeeds seccomp
and syscall user dispatch selftests.
v5 cover letter
--------------
This is v5 of Syscall User Dispatch. It has some big changes in
comparison to v4.
First of all, it allows the vdso trampoline code for architectures that
support it. This is exposed through an arch hook. It also addresses
the concern about what happens when a bad selector is provided, instead
of SIGSEGV, we fail with SIGSYS, which is more debug-able.
Another major change is that it is now based on top of Gleixner's common
syscall entry work, and is supposed to only be used by that code.
Therefore, the entry symbol is not exported outside of kernel/entry/ code.
The biggest change in this version is the attempt to avoid using one of
the final TIF flags on x86 32 bit, without increasing the size of that
variable to 64 bit. My expectation is that, with this work, plus the
removal of TIF_IA32, TIF_X32 and TIF_FORCE_TF, we might be able to avoid
changing this field to 64 bits at all. Instead, this follows the
suggestion by Andy to have a generic TIF flag for SECCOMP and this
mechanism, and use another field to decide which one is enabled. The
code for this is not complex, so it seems like a viable approach.
Finally, this version adds some documentation to the feature.
Kees, I dropped your reviewed-by on patch 5, given the amount of
changes.
Thanks,
Previous submissions are archived at:
RFC/v1: https://lkml.org/lkml/2020/7/8/96
v2: https://lkml.org/lkml/2020/7/9/17
v3: https://lkml.org/lkml/2020/7/12/4
v4: https://www.spinics.net/lists/linux-kselftest/msg16377.html
v5: https://lkml.org/lkml/2020/8/10/1320
Gabriel Krisman Bertazi (9):
kernel: Support TIF_SYSCALL_INTERCEPT flag
kernel: entry: Support TIF_SYSCAL_INTERCEPT on common entry code
x86: vdso: Expose sigreturn address on vdso to the kernel
signal: Expose SYS_USER_DISPATCH si_code type
kernel: Implement selective syscall userspace redirection
kernel: entry: Support Syscall User Dispatch for common syscall entry
x86: Enable Syscall User Dispatch
selftests: Add kselftest for syscall user dispatch
doc: Document Syscall User Dispatch
.../admin-guide/syscall-user-dispatch.rst | 87 ++++++
arch/Kconfig | 21 ++
arch/x86/Kconfig | 1 +
arch/x86/entry/vdso/vdso2c.c | 2 +
arch/x86/entry/vdso/vdso32/sigreturn.S | 2 +
arch/x86/entry/vdso/vma.c | 15 +
arch/x86/include/asm/elf.h | 1 +
arch/x86/include/asm/thread_info.h | 4 +-
arch/x86/include/asm/vdso.h | 2 +
arch/x86/kernel/signal_compat.c | 2 +-
fs/exec.c | 8 +
include/linux/entry-common.h | 6 +-
include/linux/sched.h | 8 +-
include/linux/seccomp.h | 20 +-
include/linux/syscall_intercept.h | 71 +++++
include/linux/syscall_user_dispatch.h | 29 ++
include/uapi/asm-generic/siginfo.h | 3 +-
include/uapi/linux/prctl.h | 5 +
kernel/entry/Makefile | 1 +
kernel/entry/common.c | 32 +-
kernel/entry/common.h | 15 +
kernel/entry/syscall_user_dispatch.c | 101 ++++++
kernel/fork.c | 10 +-
kernel/seccomp.c | 7 +-
kernel/sys.c | 5 +
tools/testing/selftests/Makefile | 1 +
.../syscall_user_dispatch/.gitignore | 2 +
.../selftests/syscall_user_dispatch/Makefile | 9 +
.../selftests/syscall_user_dispatch/config | 1 +
.../syscall_user_dispatch.c | 292 ++++++++++++++++++
30 files changed, 744 insertions(+), 19 deletions(-)
create mode 100644 Documentation/admin-guide/syscall-user-dispatch.rst
create mode 100644 include/linux/syscall_intercept.h
create mode 100644 include/linux/syscall_user_dispatch.h
create mode 100644 kernel/entry/common.h
create mode 100644 kernel/entry/syscall_user_dispatch.c
create mode 100644 tools/testing/selftests/syscall_user_dispatch/.gitignore
create mode 100644 tools/testing/selftests/syscall_user_dispatch/Makefile
create mode 100644 tools/testing/selftests/syscall_user_dispatch/config
create mode 100644 tools/testing/selftests/syscall_user_dispatch/syscall_user_dispatch.c
--
2.28.0
This patchset is based on Google-internal RSEQ
work done by Paul Turner and Andrew Hunter.
When working with per-CPU RSEQ-based memory allocations,
it is sometimes important to make sure that a global
memory location is no longer accessed from RSEQ critical
sections. For example, there can be two per-CPU lists,
one is "active" and accessed per-CPU, while another one
is inactive and worked on asynchronously "off CPU" (e.g.
garbage collection is performed). Then at some point
the two lists are swapped, and a fast RCU-like mechanism
is required to make sure that the previously active
list is no longer accessed.
This patch introduces such a mechanism: in short,
membarrier() syscall issues an IPI to a CPU, restarting
a potentially active RSEQ critical section on the CPU.
v1->v2:
- removed the ability to IPI all CPUs in a single sycall;
- use task->mm rather than task->group_leader to identify
tasks belonging to the same process.
v2->v3:
- re-added the ability to IPI all CPUs in a single syscall;
- integrated with membarrier_private_expedited() to
make sure only CPUs running tasks with the same mm as
the current task are interrupted;
- also added MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_RSEQ;
- flags in membarrier_private_expedited are never actually
bit flags but always distinct values (i.e. never two flags
are combined), so I modified bit testing to full equation
comparison for simplicity (otherwise the code needs to
work when several bits are set, for example).
v3->v4:
- added the third parameter to membarrier syscall: @cpu_id:
if @flags == MEMBARRIER_CMD_FLAG_CPU, then @cpu_id indicates
the cpu on which RSEQ CS should be restarted.
v4->v5:
- added @cpu_id parameter to sys_membarrier in syscalls.h.
v5->v6:
- made membarrier_private_expedited more efficient in a
single-cpu case;
- a couple of minor refactorings.
v6->v7:
- made @flags an unsigned int in sys_membarrier;
- a couple of minor refactorings.
v7->v8:
- replaced BUG_ON with WARN_ON_ONCE in membarrier.c.
The second patch in the patchset adds a selftest
of this feature.
Signed-off-by: Peter Oskolkov <posk(a)google.com>
---
include/linux/sched/mm.h | 3 +
include/linux/syscalls.h | 2 +-
include/uapi/linux/membarrier.h | 26 ++++++
kernel/sched/membarrier.c | 136 +++++++++++++++++++++++++-------
4 files changed, 136 insertions(+), 31 deletions(-)
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index f889e332912f..15bfb06f2884 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -348,10 +348,13 @@ enum {
MEMBARRIER_STATE_GLOBAL_EXPEDITED = (1U << 3),
MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE_READY = (1U << 4),
MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE = (1U << 5),
+ MEMBARRIER_STATE_PRIVATE_EXPEDITED_RSEQ_READY = (1U << 6),
+ MEMBARRIER_STATE_PRIVATE_EXPEDITED_RSEQ = (1U << 7),
};
enum {
MEMBARRIER_FLAG_SYNC_CORE = (1U << 0),
+ MEMBARRIER_FLAG_RSEQ = (1U << 1),
};
#ifdef CONFIG_ARCH_HAS_MEMBARRIER_CALLBACKS
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 75ac7f8ae93c..466c993e52bf 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -974,7 +974,7 @@ asmlinkage long sys_execveat(int dfd, const char __user *filename,
const char __user *const __user *argv,
const char __user *const __user *envp, int flags);
asmlinkage long sys_userfaultfd(int flags);
-asmlinkage long sys_membarrier(int cmd, int flags);
+asmlinkage long sys_membarrier(int cmd, int flags, int cpu_id);
asmlinkage long sys_mlock2(unsigned long start, size_t len, int flags);
asmlinkage long sys_copy_file_range(int fd_in, loff_t __user *off_in,
int fd_out, loff_t __user *off_out,
diff --git a/include/uapi/linux/membarrier.h b/include/uapi/linux/membarrier.h
index 5891d7614c8c..737605897f36 100644
--- a/include/uapi/linux/membarrier.h
+++ b/include/uapi/linux/membarrier.h
@@ -114,6 +114,26 @@
* If this command is not implemented by an
* architecture, -EINVAL is returned.
* Returns 0 on success.
+ * @MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ:
+ * Ensure the caller thread, upon return from
+ * system call, that all its running thread
+ * siblings have any currently running rseq
+ * critical sections restarted if @flags
+ * parameter is 0; if @flags parameter is
+ * MEMBARRIER_CMD_FLAG_CPU,
+ * then this operation is performed only
+ * on CPU indicated by @cpu_id. If this command is
+ * not implemented by an architecture, -EINVAL
+ * is returned. A process needs to register its
+ * intent to use the private expedited rseq
+ * command prior to using it, otherwise
+ * this command returns -EPERM.
+ * @MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_RSEQ:
+ * Register the process intent to use
+ * MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ.
+ * If this command is not implemented by an
+ * architecture, -EINVAL is returned.
+ * Returns 0 on success.
* @MEMBARRIER_CMD_SHARED:
* Alias to MEMBARRIER_CMD_GLOBAL. Provided for
* header backward compatibility.
@@ -131,9 +151,15 @@ enum membarrier_cmd {
MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED = (1 << 4),
MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE = (1 << 5),
MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_SYNC_CORE = (1 << 6),
+ MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ = (1 << 7),
+ MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_RSEQ = (1 << 8),
/* Alias for header backward compatibility. */
MEMBARRIER_CMD_SHARED = MEMBARRIER_CMD_GLOBAL,
};
+enum membarrier_cmd_flag {
+ MEMBARRIER_CMD_FLAG_CPU = (1 << 0),
+};
+
#endif /* _UAPI_LINUX_MEMBARRIER_H */
diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
index 168479a7d61b..e23e74d52db5 100644
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -18,6 +18,14 @@
#define MEMBARRIER_PRIVATE_EXPEDITED_SYNC_CORE_BITMASK 0
#endif
+#ifdef CONFIG_RSEQ
+#define MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ_BITMASK \
+ (MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ \
+ | MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_RSEQ_BITMASK)
+#else
+#define MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ_BITMASK 0
+#endif
+
#define MEMBARRIER_CMD_BITMASK \
(MEMBARRIER_CMD_GLOBAL | MEMBARRIER_CMD_GLOBAL_EXPEDITED \
| MEMBARRIER_CMD_REGISTER_GLOBAL_EXPEDITED \
@@ -30,6 +38,11 @@ static void ipi_mb(void *info)
smp_mb(); /* IPIs should be serializing but paranoid. */
}
+static void ipi_rseq(void *info)
+{
+ rseq_preempt(current);
+}
+
static void ipi_sync_rq_state(void *info)
{
struct mm_struct *mm = (struct mm_struct *) info;
@@ -129,19 +142,27 @@ static int membarrier_global_expedited(void)
return 0;
}
-static int membarrier_private_expedited(int flags)
+static int membarrier_private_expedited(int flags, int cpu_id)
{
- int cpu;
cpumask_var_t tmpmask;
struct mm_struct *mm = current->mm;
+ smp_call_func_t ipi_func = ipi_mb;
- if (flags & MEMBARRIER_FLAG_SYNC_CORE) {
+ if (flags == MEMBARRIER_FLAG_SYNC_CORE) {
if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE))
return -EINVAL;
if (!(atomic_read(&mm->membarrier_state) &
MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE_READY))
return -EPERM;
+ } else if (flags == MEMBARRIER_FLAG_RSEQ) {
+ if (!IS_ENABLED(CONFIG_RSEQ))
+ return -EINVAL;
+ if (!(atomic_read(&mm->membarrier_state) &
+ MEMBARRIER_STATE_PRIVATE_EXPEDITED_RSEQ_READY))
+ return -EPERM;
+ ipi_func = ipi_rseq;
} else {
+ WARN_ON_ONCE(flags);
if (!(atomic_read(&mm->membarrier_state) &
MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY))
return -EPERM;
@@ -156,35 +177,59 @@ static int membarrier_private_expedited(int flags)
*/
smp_mb(); /* system call entry is not a mb. */
- if (!zalloc_cpumask_var(&tmpmask, GFP_KERNEL))
+ if (cpu_id < 0 && !zalloc_cpumask_var(&tmpmask, GFP_KERNEL))
return -ENOMEM;
cpus_read_lock();
- rcu_read_lock();
- for_each_online_cpu(cpu) {
+
+ if (cpu_id >= 0) {
struct task_struct *p;
- /*
- * Skipping the current CPU is OK even through we can be
- * migrated at any point. The current CPU, at the point
- * where we read raw_smp_processor_id(), is ensured to
- * be in program order with respect to the caller
- * thread. Therefore, we can skip this CPU from the
- * iteration.
- */
- if (cpu == raw_smp_processor_id())
- continue;
- p = rcu_dereference(cpu_rq(cpu)->curr);
- if (p && p->mm == mm)
- __cpumask_set_cpu(cpu, tmpmask);
+ if (cpu_id >= nr_cpu_ids || !cpu_online(cpu_id))
+ goto out;
+ if (cpu_id == raw_smp_processor_id())
+ goto out;
+ rcu_read_lock();
+ p = rcu_dereference(cpu_rq(cpu_id)->curr);
+ if (!p || p->mm != mm) {
+ rcu_read_unlock();
+ goto out;
+ }
+ rcu_read_unlock();
+ } else {
+ int cpu;
+
+ rcu_read_lock();
+ for_each_online_cpu(cpu) {
+ struct task_struct *p;
+
+ /*
+ * Skipping the current CPU is OK even through we can be
+ * migrated at any point. The current CPU, at the point
+ * where we read raw_smp_processor_id(), is ensured to
+ * be in program order with respect to the caller
+ * thread. Therefore, we can skip this CPU from the
+ * iteration.
+ */
+ if (cpu == raw_smp_processor_id())
+ continue;
+ p = rcu_dereference(cpu_rq(cpu)->curr);
+ if (p && p->mm == mm)
+ __cpumask_set_cpu(cpu, tmpmask);
+ }
+ rcu_read_unlock();
}
- rcu_read_unlock();
preempt_disable();
- smp_call_function_many(tmpmask, ipi_mb, NULL, 1);
+ if (cpu_id >= 0)
+ smp_call_function_single(cpu_id, ipi_func, NULL, 1);
+ else
+ smp_call_function_many(tmpmask, ipi_func, NULL, 1);
preempt_enable();
- free_cpumask_var(tmpmask);
+out:
+ if (cpu_id < 0)
+ free_cpumask_var(tmpmask);
cpus_read_unlock();
/*
@@ -283,11 +328,18 @@ static int membarrier_register_private_expedited(int flags)
set_state = MEMBARRIER_STATE_PRIVATE_EXPEDITED,
ret;
- if (flags & MEMBARRIER_FLAG_SYNC_CORE) {
+ if (flags == MEMBARRIER_FLAG_SYNC_CORE) {
if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE))
return -EINVAL;
ready_state =
MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE_READY;
+ } else if (flags == MEMBARRIER_FLAG_RSEQ) {
+ if (!IS_ENABLED(CONFIG_RSEQ))
+ return -EINVAL;
+ ready_state =
+ MEMBARRIER_STATE_PRIVATE_EXPEDITED_RSEQ_READY;
+ } else {
+ WARN_ON_ONCE(flags);
}
/*
@@ -299,6 +351,8 @@ static int membarrier_register_private_expedited(int flags)
return 0;
if (flags & MEMBARRIER_FLAG_SYNC_CORE)
set_state |= MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE;
+ if (flags & MEMBARRIER_FLAG_RSEQ)
+ set_state |= MEMBARRIER_STATE_PRIVATE_EXPEDITED_RSEQ;
atomic_or(set_state, &mm->membarrier_state);
ret = sync_runqueues_membarrier_state(mm);
if (ret)
@@ -310,8 +364,15 @@ static int membarrier_register_private_expedited(int flags)
/**
* sys_membarrier - issue memory barriers on a set of threads
- * @cmd: Takes command values defined in enum membarrier_cmd.
- * @flags: Currently needs to be 0. For future extensions.
+ * @cmd: Takes command values defined in enum membarrier_cmd.
+ * @flags: Currently needs to be 0 for all commands other than
+ * MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ: in the latter
+ * case it can be MEMBARRIER_CMD_FLAG_CPU, indicating that @cpu_id
+ * contains the CPU on which to interrupt (= restart)
+ * the RSEQ critical section.
+ * @cpu_id: if @flags == MEMBARRIER_CMD_FLAG_CPU, indicates the cpu on which
+ * RSEQ CS should be interrupted (@cmd must be
+ * MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ).
*
* If this system call is not implemented, -ENOSYS is returned. If the
* command specified does not exist, not available on the running
@@ -337,10 +398,21 @@ static int membarrier_register_private_expedited(int flags)
* smp_mb() X O O
* sys_membarrier() O O O
*/
-SYSCALL_DEFINE2(membarrier, int, cmd, int, flags)
+SYSCALL_DEFINE3(membarrier, int, cmd, unsigned int, flags, int, cpu_id)
{
- if (unlikely(flags))
- return -EINVAL;
+ switch (cmd) {
+ case MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ:
+ if (unlikely(flags && flags != MEMBARRIER_CMD_FLAG_CPU))
+ return -EINVAL;
+ break;
+ default:
+ if (unlikely(flags))
+ return -EINVAL;
+ }
+
+ if (!(flags & MEMBARRIER_CMD_FLAG_CPU))
+ cpu_id = -1;
+
switch (cmd) {
case MEMBARRIER_CMD_QUERY:
{
@@ -362,13 +434,17 @@ SYSCALL_DEFINE2(membarrier, int, cmd, int, flags)
case MEMBARRIER_CMD_REGISTER_GLOBAL_EXPEDITED:
return membarrier_register_global_expedited();
case MEMBARRIER_CMD_PRIVATE_EXPEDITED:
- return membarrier_private_expedited(0);
+ return membarrier_private_expedited(0, cpu_id);
case MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED:
return membarrier_register_private_expedited(0);
case MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE:
- return membarrier_private_expedited(MEMBARRIER_FLAG_SYNC_CORE);
+ return membarrier_private_expedited(MEMBARRIER_FLAG_SYNC_CORE, cpu_id);
case MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_SYNC_CORE:
return membarrier_register_private_expedited(MEMBARRIER_FLAG_SYNC_CORE);
+ case MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ:
+ return membarrier_private_expedited(MEMBARRIER_FLAG_RSEQ, cpu_id);
+ case MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_RSEQ:
+ return membarrier_register_private_expedited(MEMBARRIER_FLAG_RSEQ);
default:
return -EINVAL;
}
--
2.28.0.709.gb0816b6eb0-goog