Hi, Willy
This patchset mainly allows speed up the nolibc test with a minimal
kernel config.
As the nolibc supported architectures become more and more, the 'run'
test with DEFCONFIG may cost several hours, which is not friendly to
develop testing and even for release testing, so, smaller kernel configs
may be required, and firstly, we should let nolibc-test work with less
kernel config options, this patchset aims to this goal.
This patchset mainly remove the dependency from procfs, tmpfs, net and
memfd_create, many failures have been fixed up.
When CONFIG_TMPFS and CONFIG_SHMEM are disabled, kernel will provide a
ramfs based tmpfs (mm/shmem.c), it will be used as a choice to fix up
some failures and also allow skip less tests.
Besides, it also adds musl support, improves glibc support and fixes up
a kernel cmdline passing use case.
This is based on the dev.2023.06.14a branch of linux-rcu [1], all of the
supported architectures are tested (with local minimal configs, [5]
pasted the one for i386) without failures:
arch/board | result
------------|------------
arm/vexpress-a9 | 138 test(s) passed, 1 skipped, 0 failed. See all results in /labs/linux-lab/logging/nolibc/arm-vexpress-a9-nolibc-test.log
aarch64/virt | 138 test(s) passed, 1 skipped, 0 failed. See all results in /labs/linux-lab/logging/nolibc/aarch64-virt-nolibc-test.log
ppc/g3beige | not supported
i386/pc | 136 test(s) passed, 3 skipped, 0 failed. See all results in /labs/linux-lab/logging/nolibc/i386-pc-nolibc-test.log
x86_64/pc | 138 test(s) passed, 1 skipped, 0 failed. See all results in /labs/linux-lab/logging/nolibc/x86_64-pc-nolibc-test.log
mipsel/malta | 138 test(s) passed, 1 skipped, 0 failed. See all results in /labs/linux-lab/logging/nolibc/mipsel-malta-nolibc-test.log
loongarch64/virt | 138 test(s) passed, 1 skipped, 0 failed. See all results in /labs/linux-lab/logging/nolibc/loongarch64-virt-nolibc-test.log
riscv64/virt | 136 test(s) passed, 3 skipped, 0 failed. See all results in /labs/linux-lab/logging/nolibc/riscv64-virt-nolibc-test.log
riscv32/virt | no test log found
s390x/s390-ccw-virtio | 138 test(s) passed, 1 skipped, 0 failed. See all results in /labs/linux-lab/logging/nolibc/s390x-s390-ccw-virtio-nolibc-test.log
Notes:
* The skipped ones are -fstackprotector, chmod_self and chown_self
The -fstackprotector skip is due to gcc version.
chmod_self and chmod_self skips are due to procfs not enabled
* ppc/g3beige support is added locally, but not added in this patchset
will send ppc support as a new patchset, it depends on v2 test
report patchset [3] and the v5 rv32 support, require changes on
Makefile
* riscv32/virt support is still in review, see v5 rv32 support [4]
This patchset doesn't depends on any of my other nolibc patch series,
but the new rmdir() routine added in this patchset may be requird to
apply the __sysret() from our v4 syscall helper series [2] after that
series being merged, currently, we use the old method to let it compile
without any dependency.
Here explains all of the patches:
* selftests/nolibc: stat_fault: silence NULL argument warning with glibc
selftests/nolibc: gettid: restore for glibc and musl
selftests/nolibc: add _LARGEFILE64_SOURCE for musl
The above 3 patches adds musl compile support and improve glibc support.
It is able to build and run nolibc-test with musl libc now, but there
are some failures/skips due to the musl its own issues/requirements:
$ sudo ./nolibc-test | grep -E 'FAIL|SKIP'
8 sbrk = 1 ENOMEM [FAIL]
9 brk = -1 ENOMEM [FAIL]
46 limit_int_fast16_min = -2147483648 [FAIL]
47 limit_int_fast16_max = 2147483647 [FAIL]
49 limit_int_fast32_min = -2147483648 [FAIL]
50 limit_int_fast32_max = 2147483647 [FAIL]
0 -fstackprotector not supported [SKIPPED]
musl disabled sbrk and brk for some conflicts with its malloc and the
fast version of int types are defined in 32bit, which differs from nolibc
and glibc. musl reserved the sbrk(0) to allow get current brk, we
added a test for this in the v4 __sysret() helper series [2].
* selftests/nolibc: fix up kernel parameters support
kernel cmdline allows pass two types of parameters, one is without
'=', another is with '=', the first one is passed as init arguments,
the sencond one is passed as init environment variables.
Our nolibc-test prefer arguments to environment variables, this not
work when users add such parameters in the kernel cmdline:
noapic NOLIBC_TEST=syscall
So, this patch will verify the setting from arguments at first, if it
is no valid, will try the environment variables instead.
* selftests/nolibc: stat_timestamps: remove procfs dependency
Use '/' instead of /proc/self, or we can add a 'has_proc' condition
for this test case, but it is not that necessary to skip the whole
stat_timestamps tests for such a subtest binding to /proc/self.
Welcome suggestion from Thomas.
* tools/nolibc: add rmdir() support
selftests/nolibc: add a new rmdir() test case
rmdir() routine and test case are added for the coming requirement.
Note, if the __sysret() patchset [2] is applied before us, this patch
should be rebased on it and apply the __sysret() helper.
* selftests/nolibc: fix up failures when there is no procfs
call rmdir() to remove /proc completely to rework the checking of
/proc, before, the existing of /proc not means the procfs is really
mounted.
* selftests/nolibc: rename proc variable to has_proc
selftests/nolibc: rename euid0 variable to is_root
align with the has_gettid, has_xxx variables.
* selftests/nolibc: prepare tmpfs and hugetlbfs
selftests/nolibc: rename chmod_net to chmod_good
selftests/nolibc: link_cross: support tmpfs
selftests/nolibc: rename chroot_exe to chroot_file
use file from /tmp instead of file from /proc when there is no procfs
this avoid skipping the chmod_net, link_cross, chroot_exe tests
* selftests/nolibc: vfprintf: silence memfd_create() warning
selftests/nolibc: vfprintf: skip if neither tmpfs nor hugetlbfs
selftests/nolibc: vfprintf: support tmpfs and hugetlbfs
memfd_create from kernel >= v6.2 forcely warn on missing
MFD_NOEXEC_SEAL flag, the first one silence it with such flag, for
older kernels, use 0 flag as before.
since memfd_create() depends on TMPFS or HUGETLBFS, the second one
skip the whole vfprintf instead of simply fail if memfd_create() not
work.
the 3rd one futher try the ramfs based tmpfs even when memfd_create()
not work.
At last, let's simply discuss about the configs, I have prepared minimal
configs for all of the nolibc supported architectures but not sure where
should we put them, what about tools/testing/selftests/nolibc/configs ?
Thanks!
Best regards,
Zhangjin
---
[1]: https://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git/
[2]: https://lore.kernel.org/linux-riscv/cover.1687187451.git.falcon@tinylab.org/
[3]: https://lore.kernel.org/lkml/cover.1687156559.git.falcon@tinylab.org/
[4]: https://lore.kernel.org/linux-riscv/cover.1687176996.git.falcon@tinylab.org/
[5]: https://pastebin.com/5jq0Vxbz
Zhangjin Wu (17):
selftests/nolibc: stat_fault: silence NULL argument warning with glibc
selftests/nolibc: gettid: restore for glibc and musl
selftests/nolibc: add _LARGEFILE64_SOURCE for musl
selftests/nolibc: fix up kernel parameters support
selftests/nolibc: stat_timestamps: remove procfs dependency
tools/nolibc: add rmdir() support
selftests/nolibc: add a new rmdir() test case
selftests/nolibc: fix up failures when there is no procfs
selftests/nolibc: rename proc variable to has_proc
selftests/nolibc: rename euid0 variable to is_root
selftests/nolibc: prepare tmpfs and hugetlbfs
selftests/nolibc: rename chmod_net to chmod_good
selftests/nolibc: link_cross: support tmpfs
selftests/nolibc: rename chroot_exe to chroot_file
selftests/nolibc: vfprintf: silence memfd_create() warning
selftests/nolibc: vfprintf: skip if neither tmpfs nor hugetlbfs
selftests/nolibc: vfprintf: support tmpfs and hugetlbfs
tools/include/nolibc/sys.h | 28 ++++
tools/testing/selftests/nolibc/nolibc-test.c | 132 +++++++++++++++----
2 files changed, 138 insertions(+), 22 deletions(-)
--
2.25.1
From: Jeff Xu <jeffxu(a)google.com>
Since Linux introduced the memfd feature, memfd have always had their
execute bit set, and the memfd_create() syscall doesn't allow setting
it differently.
However, in a secure by default system, such as ChromeOS, (where all
executables should come from the rootfs, which is protected by Verified
boot), this executable nature of memfd opens a door for NoExec bypass
and enables “confused deputy attack”. E.g, in VRP bug [1]: cros_vm
process created a memfd to share the content with an external process,
however the memfd is overwritten and used for executing arbitrary code
and root escalation. [2] lists more VRP in this kind.
On the other hand, executable memfd has its legit use, runc uses memfd’s
seal and executable feature to copy the contents of the binary then
execute them, for such system, we need a solution to differentiate runc's
use of executable memfds and an attacker's [3].
To address those above, this set of patches add following:
1> Let memfd_create() set X bit at creation time.
2> Let memfd to be sealed for modifying X bit.
3> A new pid namespace sysctl: vm.memfd_noexec to control the behavior of
X bit.For example, if a container has vm.memfd_noexec=2, then
memfd_create() without MFD_NOEXEC_SEAL will be rejected.
4> A new security hook in memfd_create(). This make it possible to a new
LSM, which rejects or allows executable memfd based on its security policy.
Change history:
v8:
- Update ref bug in cover letter.
- Add Reviewed-by field.
- Remove security hook (security_memfd_create) patch, which will have
its own patch set in future.
v7:
- patch 2/6: remove #ifdef and MAX_PATH (memfd_test.c).
- patch 3/6: check capability (CAP_SYS_ADMIN) from userns instead of
global ns (pid_sysctl.h). Add a tab (pid_namespace.h).
- patch 5/6: remove #ifdef (memfd_test.c)
- patch 6/6: remove unneeded security_move_mount(security.c).
v6:https://lore.kernel.org/lkml/20221206150233.1963717-1-jeffxu@google.com/
- Address comment and move "#ifdef CONFIG_" from .c file to pid_sysctl.h
v5:https://lore.kernel.org/lkml/20221206152358.1966099-1-jeffxu@google.com/
- Pass vm.memfd_noexec from current ns to child ns.
- Fix build issue detected by kernel test robot.
- Add missing security.c
v3:https://lore.kernel.org/lkml/20221202013404.163143-1-jeffxu@google.com/
- Address API design comments in v2.
- Let memfd_create() to set X bit at creation time.
- A new pid namespace sysctl: vm.memfd_noexec to control behavior of X bit.
- A new security hook in memfd_create().
v2:https://lore.kernel.org/lkml/20220805222126.142525-1-jeffxu@google.com/
- address comments in V1.
- add sysctl (vm.mfd_noexec) to set the default file permissions of
memfd_create to be non-executable.
v1:https://lwn.net/Articles/890096/
[1] https://crbug.com/1305267
[2] https://bugs.chromium.org/p/chromium/issues/list?q=type%3Dbug-security%20me…
[3] https://lwn.net/Articles/781013/
Daniel Verkamp (2):
mm/memfd: add F_SEAL_EXEC
selftests/memfd: add tests for F_SEAL_EXEC
Jeff Xu (3):
mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC
mm/memfd: Add write seals when apply SEAL_EXEC to executable memfd
selftests/memfd: add tests for MFD_NOEXEC_SEAL MFD_EXEC
include/linux/pid_namespace.h | 19 ++
include/uapi/linux/fcntl.h | 1 +
include/uapi/linux/memfd.h | 4 +
kernel/pid_namespace.c | 5 +
kernel/pid_sysctl.h | 59 ++++
mm/memfd.c | 56 +++-
mm/shmem.c | 6 +
tools/testing/selftests/memfd/fuse_test.c | 1 +
tools/testing/selftests/memfd/memfd_test.c | 341 ++++++++++++++++++++-
9 files changed, 489 insertions(+), 3 deletions(-)
create mode 100644 kernel/pid_sysctl.h
base-commit: eb7081409f94a9a8608593d0fb63a1aa3d6f95d8
--
2.39.0.rc1.256.g54fd8350bd-goog
From: sunliming <sunliming(a)kylinos.cn>
[ Upstream commit ba470eebc2f6c2f704872955a715b9555328e7d0 ]
User processes register name_args for events. If the same name but different
args event are registered. The trace outputs of second event are printed
as the first event. This is incorrect.
Return EADDRINUSE back to the user process if the same name but different args
event has being registered.
Link: https://lore.kernel.org/linux-trace-kernel/20230529032100.286534-1-sunlimin…
Signed-off-by: sunliming <sunliming(a)kylinos.cn>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat(a)kernel.org>
Acked-by: Beau Belgrave <beaub(a)linux.microsoft.com>
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
kernel/trace/trace_events_user.c | 36 +++++++++++++++----
.../selftests/user_events/ftrace_test.c | 6 ++++
2 files changed, 36 insertions(+), 6 deletions(-)
diff --git a/kernel/trace/trace_events_user.c b/kernel/trace/trace_events_user.c
index 625cab4b9d945..774d146c2c2ca 100644
--- a/kernel/trace/trace_events_user.c
+++ b/kernel/trace/trace_events_user.c
@@ -1274,6 +1274,8 @@ static int user_event_parse(struct user_event_group *group, char *name,
int index;
u32 key;
struct user_event *user;
+ int argc = 0;
+ char **argv;
/* Prevent dyn_event from racing */
mutex_lock(&event_mutex);
@@ -1281,13 +1283,35 @@ static int user_event_parse(struct user_event_group *group, char *name,
mutex_unlock(&event_mutex);
if (user) {
- *newuser = user;
- /*
- * Name is allocated by caller, free it since it already exists.
- * Caller only worries about failure cases for freeing.
- */
- kfree(name);
+ if (args) {
+ argv = argv_split(GFP_KERNEL, args, &argc);
+ if (!argv) {
+ ret = -ENOMEM;
+ goto error;
+ }
+
+ ret = user_fields_match(user, argc, (const char **)argv);
+ argv_free(argv);
+
+ } else
+ ret = list_empty(&user->fields);
+
+ if (ret) {
+ *newuser = user;
+ /*
+ * Name is allocated by caller, free it since it already exists.
+ * Caller only worries about failure cases for freeing.
+ */
+ kfree(name);
+ } else {
+ ret = -EADDRINUSE;
+ goto error;
+ }
+
return 0;
+error:
+ refcount_dec(&user->refcnt);
+ return ret;
}
index = find_first_zero_bit(group->page_bitmap, MAX_EVENTS);
diff --git a/tools/testing/selftests/user_events/ftrace_test.c b/tools/testing/selftests/user_events/ftrace_test.c
index 1bc26e6476fc3..df0e776c2cc1b 100644
--- a/tools/testing/selftests/user_events/ftrace_test.c
+++ b/tools/testing/selftests/user_events/ftrace_test.c
@@ -209,6 +209,12 @@ TEST_F(user, register_events) {
ASSERT_EQ(0, reg.write_index);
ASSERT_NE(0, reg.status_bit);
+ /* Multiple registers to same name but different args should fail */
+ reg.enable_bit = 29;
+ reg.name_args = (__u64)"__test_event u32 field1;";
+ ASSERT_EQ(-1, ioctl(self->data_fd, DIAG_IOCSREG, ®));
+ ASSERT_EQ(EADDRINUSE, errno);
+
/* Ensure disabled */
self->enable_fd = open(enable_file, O_RDWR);
ASSERT_NE(-1, self->enable_fd);
--
2.39.2
From: sunliming <sunliming(a)kylinos.cn>
[ Upstream commit ba470eebc2f6c2f704872955a715b9555328e7d0 ]
User processes register name_args for events. If the same name but different
args event are registered. The trace outputs of second event are printed
as the first event. This is incorrect.
Return EADDRINUSE back to the user process if the same name but different args
event has being registered.
Link: https://lore.kernel.org/linux-trace-kernel/20230529032100.286534-1-sunlimin…
Signed-off-by: sunliming <sunliming(a)kylinos.cn>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat(a)kernel.org>
Acked-by: Beau Belgrave <beaub(a)linux.microsoft.com>
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
kernel/trace/trace_events_user.c | 36 +++++++++++++++----
.../selftests/user_events/ftrace_test.c | 6 ++++
2 files changed, 36 insertions(+), 6 deletions(-)
diff --git a/kernel/trace/trace_events_user.c b/kernel/trace/trace_events_user.c
index 625cab4b9d945..774d146c2c2ca 100644
--- a/kernel/trace/trace_events_user.c
+++ b/kernel/trace/trace_events_user.c
@@ -1274,6 +1274,8 @@ static int user_event_parse(struct user_event_group *group, char *name,
int index;
u32 key;
struct user_event *user;
+ int argc = 0;
+ char **argv;
/* Prevent dyn_event from racing */
mutex_lock(&event_mutex);
@@ -1281,13 +1283,35 @@ static int user_event_parse(struct user_event_group *group, char *name,
mutex_unlock(&event_mutex);
if (user) {
- *newuser = user;
- /*
- * Name is allocated by caller, free it since it already exists.
- * Caller only worries about failure cases for freeing.
- */
- kfree(name);
+ if (args) {
+ argv = argv_split(GFP_KERNEL, args, &argc);
+ if (!argv) {
+ ret = -ENOMEM;
+ goto error;
+ }
+
+ ret = user_fields_match(user, argc, (const char **)argv);
+ argv_free(argv);
+
+ } else
+ ret = list_empty(&user->fields);
+
+ if (ret) {
+ *newuser = user;
+ /*
+ * Name is allocated by caller, free it since it already exists.
+ * Caller only worries about failure cases for freeing.
+ */
+ kfree(name);
+ } else {
+ ret = -EADDRINUSE;
+ goto error;
+ }
+
return 0;
+error:
+ refcount_dec(&user->refcnt);
+ return ret;
}
index = find_first_zero_bit(group->page_bitmap, MAX_EVENTS);
diff --git a/tools/testing/selftests/user_events/ftrace_test.c b/tools/testing/selftests/user_events/ftrace_test.c
index 1bc26e6476fc3..df0e776c2cc1b 100644
--- a/tools/testing/selftests/user_events/ftrace_test.c
+++ b/tools/testing/selftests/user_events/ftrace_test.c
@@ -209,6 +209,12 @@ TEST_F(user, register_events) {
ASSERT_EQ(0, reg.write_index);
ASSERT_NE(0, reg.status_bit);
+ /* Multiple registers to same name but different args should fail */
+ reg.enable_bit = 29;
+ reg.name_args = (__u64)"__test_event u32 field1;";
+ ASSERT_EQ(-1, ioctl(self->data_fd, DIAG_IOCSREG, ®));
+ ASSERT_EQ(EADDRINUSE, errno);
+
/* Ensure disabled */
self->enable_fd = open(enable_file, O_RDWR);
ASSERT_NE(-1, self->enable_fd);
--
2.39.2
=== Context ===
In the context of a middlebox, fragmented packets are tricky to handle.
The full 5-tuple of a packet is often only available in the first
fragment which makes enforcing consistent policy difficult. There are
really only two stateless options, neither of which are very nice:
1. Enforce policy on first fragment and accept all subsequent fragments.
This works but may let in certain attacks or allow data exfiltration.
2. Enforce policy on first fragment and drop all subsequent fragments.
This does not really work b/c some protocols may rely on
fragmentation. For example, DNS may rely on oversized UDP packets for
large responses.
So stateful tracking is the only sane option. RFC 8900 [0] calls this
out as well in section 6.3:
Middleboxes [...] should process IP fragments in a manner that is
consistent with [RFC0791] and [RFC8200]. In many cases, middleboxes
must maintain state in order to achieve this goal.
=== BPF related bits ===
Policy has traditionally been enforced from XDP/TC hooks. Both hooks
run before kernel reassembly facilities. However, with the new
BPF_PROG_TYPE_NETFILTER, we can rather easily hook into existing
netfilter reassembly infra.
The basic idea is we bump a refcnt on the netfilter defrag module and
then run the bpf prog after the defrag module runs. This allows bpf
progs to transparently see full, reassembled packets. The nice thing
about this is that progs don't have to carry around logic to detect
fragments.
=== Patchset details ===
There was an earlier attempt at providing defrag via kfuncs [1]. The
feedback was that we could end up doing too much stuff in prog execution
context (like sending ICMP error replies). However, I think there are
still some outstanding discussion w.r.t. performance when it comes to
netfilter vs the previous approach. I'll schedule some time during
office hours for this.
Patches 1 & 2 are stolenfrom Florian. Hopefully he doesn't mind. There
were some outstanding comments on the v2 [2] but it doesn't look like a
v3 was ever submitted. I've addressed the comments and put them in this
patchset cuz I needed them.
Finally, the new selftest seems to be a little flaky. I'm not quite
sure why the server will fail to `recvfrom()` occassionaly. I'm fairly
sure it's a timing related issue with creating veths. I'll keep
debugging but I didn't want that to hold up discussion on this patchset.
[0]: https://datatracker.ietf.org/doc/html/rfc8900
[1]: https://lore.kernel.org/bpf/cover.1677526810.git.dxu@dxuuu.xyz/
[2]: https://lore.kernel.org/bpf/20230525110100.8212-1-fw@strlen.de/
Daniel Xu (7):
tools: libbpf: add netfilter link attach helper
selftests/bpf: Add bpf_program__attach_netfilter helper test
netfilter: defrag: Add glue hooks for enabling/disabling defrag
netfilter: bpf: Support BPF_F_NETFILTER_IP_DEFRAG in netfilter link
bpf: selftests: Support not connecting client socket
bpf: selftests: Support custom type and proto for client sockets
bpf: selftests: Add defrag selftests
include/linux/netfilter.h | 12 +
include/uapi/linux/bpf.h | 5 +
net/ipv4/netfilter/nf_defrag_ipv4.c | 8 +
net/ipv6/netfilter/nf_defrag_ipv6_hooks.c | 10 +
net/netfilter/core.c | 6 +
net/netfilter/nf_bpf_link.c | 108 ++++++-
tools/include/uapi/linux/bpf.h | 5 +
tools/lib/bpf/bpf.c | 8 +
tools/lib/bpf/bpf.h | 6 +
tools/lib/bpf/libbpf.c | 47 +++
tools/lib/bpf/libbpf.h | 15 +
tools/lib/bpf/libbpf.map | 1 +
tools/testing/selftests/bpf/Makefile | 4 +-
.../selftests/bpf/generate_udp_fragments.py | 90 ++++++
.../selftests/bpf/ip_check_defrag_frags.h | 57 ++++
tools/testing/selftests/bpf/network_helpers.c | 26 +-
tools/testing/selftests/bpf/network_helpers.h | 3 +
.../bpf/prog_tests/ip_check_defrag.c | 282 ++++++++++++++++++
.../bpf/prog_tests/netfilter_basic.c | 78 +++++
.../selftests/bpf/progs/ip_check_defrag.c | 104 +++++++
.../bpf/progs/test_netfilter_link_attach.c | 14 +
21 files changed, 868 insertions(+), 21 deletions(-)
create mode 100755 tools/testing/selftests/bpf/generate_udp_fragments.py
create mode 100644 tools/testing/selftests/bpf/ip_check_defrag_frags.h
create mode 100644 tools/testing/selftests/bpf/prog_tests/ip_check_defrag.c
create mode 100644 tools/testing/selftests/bpf/prog_tests/netfilter_basic.c
create mode 100644 tools/testing/selftests/bpf/progs/ip_check_defrag.c
create mode 100644 tools/testing/selftests/bpf/progs/test_netfilter_link_attach.c
--
2.40.1
Dzień dobry,
zapoznałem się z Państwa ofertą i z przyjemnością przyznaję, że przyciąga uwagę i zachęca do dalszych rozmów.
Pomyślałem, że może mógłbym mieć swój wkład w Państwa rozwój i pomóc dotrzeć z tą ofertą do większego grona odbiorców. Pozycjonuję strony www, dzięki czemu generują świetny ruch w sieci.
Możemy porozmawiać w najbliższym czasie?
Pozdrawiam
Adam Charachuta
Nested translation is a hardware feature that is supported by many modern
IOMMU hardwares. It has two stages (stage-1, stage-2) address translation
to get access to the physical address. stage-1 translation table is owned
by userspace (e.g. by a guest OS), while stage-2 is owned by kernel. Changes
to stage-1 translation table should be followed by an IOTLB invalidation.
Take Intel VT-d as an example, the stage-1 translation table is I/O page
table. As the below diagram shows, guest I/O page table pointer in GPA
(guest physical address) is passed to host and be used to perform the stage-1
address translation. Along with it, modifications to present mappings in the
guest I/O page table should be followed with an IOTLB invalidation.
.-------------. .---------------------------.
| vIOMMU | | Guest I/O page table |
| | '---------------------------'
.----------------/
| PASID Entry |--- PASID cache flush --+
'-------------' |
| | V
| | I/O page table pointer in GPA
'-------------'
Guest
------| Shadow |--------------------------|--------
v v v
Host
.-------------. .------------------------.
| pIOMMU | | FS for GIOVA->GPA |
| | '------------------------'
.----------------/ |
| PASID Entry | V (Nested xlate)
'----------------\.----------------------------------.
| | | SS for GPA->HPA, unmanaged domain|
| | '----------------------------------'
'-------------'
Where:
- FS = First stage page tables
- SS = Second stage page tables
<Intel VT-d Nested translation>
In IOMMUFD, all the translation tables are tracked by hw_pagetable (hwpt)
and each has an iommu_domain allocated from iommu driver. So in this series
hw_pagetable and iommu_domain means the same thing if no special note.
IOMMUFD has already supported allocating hw_pagetable that is linked with
an IOAS. However, nesting requires IOMMUFD to allow allocating hw_pagetable
with driver specific parameters and interface to sync stage-1 IOTLB as user
owns the stage-1 translation table.
This series is based on the iommu hw info reporting series [1]. It first
introduces new iommu op for allocating domains with user data and the op
for syncing stage-1 IOTLB, and then extend the IOMMUFD internal infrastructure
to accept user_data and parent hwpt, then relay the data to iommu core to
allocate iommu_domain. After it, extend the ioctl IOMMU_HWPT_ALLOC to accept
user data and stage-2 hwpt ID to allocate hwpt. Along with it, ioctl
IOMMU_HWPT_INVALIDATE is added to invalidate stage-1 IOTLB. This is needed
for user-managed hwpts. Selftest is added as well to cover the new ioctls.
Complete code can be found in [2], QEMU could can be found in [3].
At last, this is a team work together with Nicolin Chen, Lu Baolu. Thanks
them for the help. ^_^. Look forward to your feedbacks.
base-commit: cf905391237ded2331388e75adb5afbabeddc852
[1] https://lore.kernel.org/linux-iommu/20230511143024.19542-1-yi.l.liu@intel.c…
[2] https://github.com/yiliu1765/iommufd/tree/iommufd_nesting
[3] https://github.com/yiliu1765/qemu/tree/wip/iommufd_rfcv4.mig.reset.v4_var3%…
Change log:
v2:
- Add union iommu_domain_user_data to include all user data structures to avoid
passing void * in kernel APIs.
- Add iommu op to return user data length for user domain allocation
- Rename struct iommu_hwpt_alloc::data_type to be hwpt_type
- Store the invalidation data length in iommu_domain_ops::cache_invalidate_user_data_len
- Convert cache_invalidate_user op to be int instead of void
- Remove @data_type in struct iommu_hwpt_invalidate
- Remove out_hwpt_type_bitmap in struct iommu_hw_info hence drop patch 08 of v1
v1: https://lore.kernel.org/linux-iommu/20230309080910.607396-1-yi.l.liu@intel.…
Thanks,
Yi Liu
Lu Baolu (2):
iommu: Add new iommu op to create domains owned by userspace
iommu: Add nested domain support
Nicolin Chen (5):
iommufd/hw_pagetable: Do not populate user-managed hw_pagetables
iommufd/selftest: Add domain_alloc_user() support in iommu mock
iommufd/selftest: Add coverage for IOMMU_HWPT_ALLOC with user data
iommufd/selftest: Add IOMMU_TEST_OP_MD_CHECK_IOTLB test op
iommufd/selftest: Add coverage for IOMMU_HWPT_INVALIDATE ioctl
Yi Liu (4):
iommufd/hw_pagetable: Use domain_alloc_user op for domain allocation
iommufd: Pass parent hwpt and user_data to
iommufd_hw_pagetable_alloc()
iommufd: IOMMU_HWPT_ALLOC allocation with user data
iommufd: Add IOMMU_HWPT_INVALIDATE
drivers/iommu/iommufd/device.c | 2 +-
drivers/iommu/iommufd/hw_pagetable.c | 191 +++++++++++++++++-
drivers/iommu/iommufd/iommufd_private.h | 16 +-
drivers/iommu/iommufd/iommufd_test.h | 30 +++
drivers/iommu/iommufd/main.c | 5 +-
drivers/iommu/iommufd/selftest.c | 119 ++++++++++-
include/linux/iommu.h | 36 ++++
include/uapi/linux/iommufd.h | 58 +++++-
tools/testing/selftests/iommu/iommufd.c | 126 +++++++++++-
tools/testing/selftests/iommu/iommufd_utils.h | 70 +++++++
10 files changed, 629 insertions(+), 24 deletions(-)
--
2.34.1
Make sv39 the default address space for mmap as some applications
currently depend on this assumption. The RISC-V specification enforces
that bits outside of the virtual address range are not used, so
restricting the size of the default address space as such should be
temporary. A hint address passed to mmap will cause the largest address
space that fits entirely into the hint to be used. If the hint is less
than or equal to 1<<38, a 39-bit address will be used. After an address
space is completely full, the next smallest address space will be used.
Documentation is also added to the RISC-V virtual memory section to explain
these changes.
Charlie Jenkins (2):
RISC-V: mm: Restrict address space for sv39,sv48,sv57
RISC-V: mm: Update documentation and include test
Documentation/riscv/vm-layout.rst | 20 ++++++++
arch/riscv/include/asm/elf.h | 2 +-
arch/riscv/include/asm/pgtable.h | 21 ++++++--
arch/riscv/include/asm/processor.h | 41 +++++++++++++---
tools/testing/selftests/riscv/Makefile | 2 +-
tools/testing/selftests/riscv/mm/Makefile | 22 +++++++++
.../selftests/riscv/mm/testcases/mmap.c | 49 +++++++++++++++++++
7 files changed, 144 insertions(+), 13 deletions(-)
create mode 100644 tools/testing/selftests/riscv/mm/Makefile
create mode 100644 tools/testing/selftests/riscv/mm/testcases/mmap.c
base-commit: eef509789cecdce895020682192d32e8bac790e8
--
2.34.1
Hi folks,
This series implements the functionality of delivering IO page faults to
user space through the IOMMUFD framework. The use case is nested
translation, where modern IOMMU hardware supports two-stage translation
tables. The second-stage translation table is managed by the host VMM
while the first-stage translation table is owned by the user space.
Hence, any IO page fault that occurs on the first-stage page table
should be delivered to the user space and handled there. The user space
should respond the page fault handling result to the device top-down
through the IOMMUFD response uAPI.
User space indicates its capablity of handling IO page faults by setting
a user HWPT allocation flag IOMMU_HWPT_ALLOC_FLAGS_IOPF_CAPABLE. IOMMUFD
will then setup its infrastructure for page fault delivery. Together
with the iopf-capable flag, user space should also provide an eventfd
where it will listen on any down-top page fault messages.
On a successful return of the allocation of iopf-capable HWPT, a fault
fd will be returned. User space can open and read fault messages from it
once the eventfd is signaled.
Besides the overall design, I'd like to hear comments about below
designs:
- The IOMMUFD fault message format. It is very similar to that in
uapi/linux/iommu which has been discussed before and partially used by
the IOMMU SVA implementation. I'd like to get more comments on the
format when it comes to IOMMUFD.
- The timeout value for the pending page fault messages. Ideally we
should determine the timeout value from the device configuration, but
I failed to find any statement in the PCI specification (version 6.x).
A default 100 milliseconds is selected in the implementation, but it
leave the room for grow the code for per-device setting.
This series is only for review comment purpose. I used IOMMUFD selftest
to verify the hwpt allocation, attach/detach and replace. But I didn't
get a chance to run it with real hardware yet. I will do more test in
the subsequent versions when I am confident that I am heading on the
right way.
This series is based on the latest implementation of the nested
translation under discussion. The whole series and related patches are
available on gitbub:
https://github.com/LuBaolu/intel-iommu/commits/iommufd-io-pgfault-delivery-…
Best regards,
baolu
Lu Baolu (17):
iommu: Move iommu fault data to linux/iommu.h
iommu: Support asynchronous I/O page fault response
iommu: Add helper to set iopf handler for domain
iommu: Pass device parameter to iopf handler
iommu: Split IO page fault handling from SVA
iommu: Add iommu page fault cookie helpers
iommufd: Add iommu page fault data
iommufd: IO page fault delivery initialization and release
iommufd: Add iommufd hwpt iopf handler
iommufd: Add IOMMU_HWPT_ALLOC_FLAGS_USER_PASID_TABLE for hwpt_alloc
iommufd: Deliver fault messages to user space
iommufd: Add io page fault response support
iommufd: Add a timer for each iommufd fault data
iommufd: Drain all pending faults when destroying hwpt
iommufd: Allow new hwpt_alloc flags
iommufd/selftest: Add IOPF feature for mock devices
iommufd/selftest: Cover iopf-capable nested hwpt
include/linux/iommu.h | 175 +++++++++-
drivers/iommu/{iommu-sva.h => io-pgfault.h} | 25 +-
drivers/iommu/iommu-priv.h | 3 +
drivers/iommu/iommufd/iommufd_private.h | 32 ++
include/uapi/linux/iommu.h | 161 ---------
include/uapi/linux/iommufd.h | 73 +++-
tools/testing/selftests/iommu/iommufd_utils.h | 20 +-
.../iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c | 2 +-
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 2 +-
drivers/iommu/intel/iommu.c | 2 +-
drivers/iommu/intel/svm.c | 2 +-
drivers/iommu/io-pgfault.c | 7 +-
drivers/iommu/iommu-sva.c | 4 +-
drivers/iommu/iommu.c | 50 ++-
drivers/iommu/iommufd/device.c | 64 +++-
drivers/iommu/iommufd/hw_pagetable.c | 318 +++++++++++++++++-
drivers/iommu/iommufd/main.c | 3 +
drivers/iommu/iommufd/selftest.c | 71 ++++
tools/testing/selftests/iommu/iommufd.c | 17 +-
MAINTAINERS | 1 -
drivers/iommu/Kconfig | 4 +
drivers/iommu/Makefile | 3 +-
drivers/iommu/intel/Kconfig | 1 +
23 files changed, 837 insertions(+), 203 deletions(-)
rename drivers/iommu/{iommu-sva.h => io-pgfault.h} (71%)
delete mode 100644 include/uapi/linux/iommu.h
--
2.34.1
When we collect a signal context with one of the SME modes enabled we will
have enabled that mode behind the compiler and libc's back so they may
issue some instructions not valid in streaming mode, causing spurious
failures.
For the code prior to issuing the BRK to trigger signal handling we need to
stay in streaming mode if we were already there since that's a part of the
signal context the caller is trying to collect. Unfortunately this code
includes a memset() which is likely to be heavily optimised and is likely
to use FP instructions incompatible with streaming mode. We can avoid this
happening by open coding the memset(), inserting a volatile assembly
statement to avoid the compiler recognising what's being done and doing
something in optimisation. This code is not performance critical so the
inefficiency should not be an issue.
After collecting the context we can simply exit streaming mode, avoiding
these issues. Use a full SMSTOP for safety to prevent any issues appearing
with ZA.
Reported-by: Will Deacon <will(a)kernel.org>
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
.../selftests/arm64/signal/test_signals_utils.h | 28 +++++++++++++++++++++-
1 file changed, 27 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/arm64/signal/test_signals_utils.h b/tools/testing/selftests/arm64/signal/test_signals_utils.h
index 222093f51b67..db28409fd44b 100644
--- a/tools/testing/selftests/arm64/signal/test_signals_utils.h
+++ b/tools/testing/selftests/arm64/signal/test_signals_utils.h
@@ -60,13 +60,28 @@ static __always_inline bool get_current_context(struct tdescr *td,
size_t dest_sz)
{
static volatile bool seen_already;
+ int i;
+ char *uc = (char *)dest_uc;
assert(td && dest_uc);
/* it's a genuine invocation..reinit */
seen_already = 0;
td->live_uc_valid = 0;
td->live_sz = dest_sz;
- memset(dest_uc, 0x00, td->live_sz);
+
+ /*
+ * This is a memset() but we don't want the compiler to
+ * optimise it into either instructions or a library call
+ * which might be incompatible with streaming mode.
+ */
+ for (i = 0; i < td->live_sz; i++) {
+ asm volatile("nop"
+ : "+m" (*dest_uc)
+ :
+ : "memory");
+ uc[i] = 0;
+ }
+
td->live_uc = dest_uc;
/*
* Grab ucontext_t triggering a SIGTRAP.
@@ -103,6 +118,17 @@ static __always_inline bool get_current_context(struct tdescr *td,
:
: "memory");
+ /*
+ * If we were grabbing a streaming mode context then we may
+ * have entered streaming mode behind the system's back and
+ * libc or compiler generated code might decide to do
+ * something invalid in streaming mode, or potentially even
+ * the state of ZA. Issue a SMSTOP to exit both now we have
+ * grabbed the state.
+ */
+ if (td->feats_supported & FEAT_SME)
+ asm volatile("msr S0_3_C4_C6_3, xzr");
+
/*
* If we get here with seen_already==1 it implies the td->live_uc
* context has been used to get back here....this probably means
---
base-commit: 6995e2de6891c724bfeb2db33d7b87775f913ad1
change-id: 20230628-arm64-signal-memcpy-fix-7de3b3c8fa10
Best regards,
--
Mark Brown <broonie(a)kernel.org>