From: Xiaoyan Li <lixiaoyan(a)google.com>
When compound pages are enabled, although the mm layer still
returns an array of page pointers, a subset (or all) of them
may have the same page head since a max 180kb skb can span 2
hugepages if it is on the boundary, be a mix of pages and 1 hugepage,
or fit completely in a hugepage. Instead of referencing page head
on all page pointers, use page length arithmetic to only call page
head when referencing a known different page head to avoid touching
a cold cacheline.
Tested:
See next patch with changes to tcp_mmap
Correntess:
On a pair of separate hosts as send with MSG_ZEROCOPY will
force a copy on tx if using loopback alone, check that the SHA
on the message sent is equivalent to checksum on the message received,
since the current program already checks for the length.
echo 1024 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
./tcp_mmap -s -z
./tcp_mmap -H $DADDR -z
SHA256 is correct
received 2 MB (100 % mmap'ed) in 0.005914 s, 2.83686 Gbit
cpu usage user:0.001984 sys:0.000963, 1473.5 usec per MB, 10 c-switches
Performance:
Run neper between adjacent hosts with the same config
tcp_stream -Z --skip-rx-copy -6 -T 20 -F 1000 --stime-use-proc --test-length=30
Before patch: stime_end=37.670000
After patch: stime_end=30.310000
Signed-off-by: Coco Li <lixiaoyan(a)google.com>
---
net/core/datagram.c | 14 +++++++++++---
1 file changed, 11 insertions(+), 3 deletions(-)
diff --git a/net/core/datagram.c b/net/core/datagram.c
index e4ff2db40c98..5662dff3d381 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -622,12 +622,12 @@ int __zerocopy_sg_from_iter(struct msghdr *msg, struct sock *sk,
frag = skb_shinfo(skb)->nr_frags;
while (length && iov_iter_count(from)) {
+ struct page *head, *last_head = NULL;
struct page *pages[MAX_SKB_FRAGS];
- struct page *last_head = NULL;
+ int refs, order, n = 0;
size_t start;
ssize_t copied;
unsigned long truesize;
- int refs, n = 0;
if (frag == MAX_SKB_FRAGS)
return -EMSGSIZE;
@@ -650,9 +650,17 @@ int __zerocopy_sg_from_iter(struct msghdr *msg, struct sock *sk,
} else {
refcount_add(truesize, &skb->sk->sk_wmem_alloc);
}
+
+ head = compound_head(pages[n]);
+ order = compound_order(head);
+
for (refs = 0; copied != 0; start = 0) {
int size = min_t(int, copied, PAGE_SIZE - start);
- struct page *head = compound_head(pages[n]);
+
+ if (pages[n] - head > (1UL << order) - 1) {
+ head = compound_head(pages[n]);
+ order = compound_order(head);
+ }
start += (pages[n] - head) << PAGE_SHIFT;
copied -= size;
--
2.40.0.rc1.284.g88254d51c5-goog
Currently, anonymous PTE-mapped THPs cannot be collapsed in-place:
collapsing (e.g., via MADV_COLLAPSE) implies allocating a fresh THP and
mapping that new THP via a PMD: as it's a fresh anon THP, it will get the
exclusive flag set on the head page and everybody is happy.
However, if the kernel would ever support in-place collapse of anonymous
THPs (replacing a page table mapping each sub-page of a THP via PTEs with a
single PMD mapping the complete THP), exclusivity information stored for
each sub-page would have to be collapsed accordingly:
(1) All PTEs map !exclusive anon sub-pages: the in-place collapsed THP
must not not have the exclusive flag set on the head page mapped by
the PMD. This is the easiest case to handle ("simply don't set any
exclusive flags").
(2) All PTEs map exclusive anon sub-pages: when collapsing, we have to
clear the exclusive flag from all tail pages and only leave the
exclusive flag set for the head page. Otherwise, fork() after
collapse would not clear the exclusive flags from the tail pages
and we'd be in trouble once PTE-mapping the shared THP when writing
to shared tail pages that still have the exclusive flag set. This
would effectively revert what the PTE-mapping code does when
propagating the exclusive flag to all sub-pages.
(3) PTEs map a mixture of exclusive and !exclusive anon sub-pages (can
happen e.g., due to MADV_DONTFORK before fork()). We must not
collapse the THP in-place, otherwise bad things may happen:
the exclusive flags of sub-pages would get ignored and the
exclusive flag of the head page would get used instead.
Now that we have MADV_COLLAPSE in place to trigger collapsing a THP,
let's add some test cases that would bail out early, if we'd
voluntarily/accidantially unlock in-place collapse for anon THPs and
forget about taking proper care of exclusive flags.
Running the test on a kernel with MADV_COLLAPSE support:
# [INFO] Anonymous THP tests
# [RUN] Basic COW after fork() when collapsing before fork()
ok 169 No leak from parent into child
# [RUN] Basic COW after fork() when collapsing after fork() (fully shared)
ok 170 # SKIP MADV_COLLAPSE failed: Invalid argument
# [RUN] Basic COW after fork() when collapsing after fork() (lower shared)
ok 171 No leak from parent into child
# [RUN] Basic COW after fork() when collapsing after fork() (upper shared)
ok 172 No leak from parent into child
For now, MADV_COLLAPSE always seems to fail if all PTEs map shared
sub-pages.
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: Hugh Dickins <hughd(a)google.com>
Cc: Peter Xu <peterx(a)redhat.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: Nadav Amit <nadav.amit(a)gmail.com>
Cc: Zach O'Keefe <zokeefe(a)google.com>
Cc: Andrea Arcangeli <aarcange(a)redhat.com>
Signed-off-by: David Hildenbrand <david(a)redhat.com>
---
A patch from Hugh made me explore the wonderful world of in-place collapse
of THP, and I was briefly concerned that it would apply to anon THP as
well. After thinking about it a bit, I decided to add test cases, to better
be safe than sorry in any case, and to document how PG_anon_exclusive is to
be handled in that case.
---
tools/testing/selftests/vm/cow.c | 228 +++++++++++++++++++++++++++++++
1 file changed, 228 insertions(+)
diff --git a/tools/testing/selftests/vm/cow.c b/tools/testing/selftests/vm/cow.c
index 26f6ea3079e2..16216d893d96 100644
--- a/tools/testing/selftests/vm/cow.c
+++ b/tools/testing/selftests/vm/cow.c
@@ -30,6 +30,10 @@
#include "../kselftest.h"
#include "vm_util.h"
+#ifndef MADV_COLLAPSE
+#define MADV_COLLAPSE 25
+#endif
+
static size_t pagesize;
static int pagemap_fd;
static size_t thpsize;
@@ -1178,6 +1182,228 @@ static int tests_per_anon_test_case(void)
return tests;
}
+enum anon_thp_collapse_test {
+ ANON_THP_COLLAPSE_UNSHARED,
+ ANON_THP_COLLAPSE_FULLY_SHARED,
+ ANON_THP_COLLAPSE_LOWER_SHARED,
+ ANON_THP_COLLAPSE_UPPER_SHARED,
+};
+
+static void do_test_anon_thp_collapse(char *mem, size_t size,
+ enum anon_thp_collapse_test test)
+{
+ struct comm_pipes comm_pipes;
+ char buf;
+ int ret;
+
+ ret = setup_comm_pipes(&comm_pipes);
+ if (ret) {
+ ksft_test_result_fail("pipe() failed\n");
+ return;
+ }
+
+ /*
+ * Trigger PTE-mapping the THP by temporarily mapping a single subpage
+ * R/O, such that we can try collapsing it later.
+ */
+ ret = mprotect(mem + pagesize, pagesize, PROT_READ);
+ if (ret) {
+ ksft_test_result_fail("mprotect() failed\n");
+ goto close_comm_pipes;
+ }
+ ret = mprotect(mem + pagesize, pagesize, PROT_READ | PROT_WRITE);
+ if (ret) {
+ ksft_test_result_fail("mprotect() failed\n");
+ goto close_comm_pipes;
+ }
+
+ switch (test) {
+ case ANON_THP_COLLAPSE_UNSHARED:
+ /* Collapse before actually COW-sharing the page. */
+ ret = madvise(mem, size, MADV_COLLAPSE);
+ if (ret) {
+ ksft_test_result_skip("MADV_COLLAPSE failed: %s\n",
+ strerror(errno));
+ goto close_comm_pipes;
+ }
+ break;
+ case ANON_THP_COLLAPSE_FULLY_SHARED:
+ /* COW-share the full PTE-mapped THP. */
+ break;
+ case ANON_THP_COLLAPSE_LOWER_SHARED:
+ /* Don't COW-share the upper part of the THP. */
+ ret = madvise(mem + size / 2, size / 2, MADV_DONTFORK);
+ if (ret) {
+ ksft_test_result_fail("MADV_DONTFORK failed\n");
+ goto close_comm_pipes;
+ }
+ break;
+ case ANON_THP_COLLAPSE_UPPER_SHARED:
+ /* Don't COW-share the lower part of the THP. */
+ ret = madvise(mem, size / 2, MADV_DONTFORK);
+ if (ret) {
+ ksft_test_result_fail("MADV_DONTFORK failed\n");
+ goto close_comm_pipes;
+ }
+ break;
+ default:
+ assert(false);
+ }
+
+ ret = fork();
+ if (ret < 0) {
+ ksft_test_result_fail("fork() failed\n");
+ goto close_comm_pipes;
+ } else if (!ret) {
+ switch (test) {
+ case ANON_THP_COLLAPSE_UNSHARED:
+ case ANON_THP_COLLAPSE_FULLY_SHARED:
+ exit(child_memcmp_fn(mem, size, &comm_pipes));
+ break;
+ case ANON_THP_COLLAPSE_LOWER_SHARED:
+ exit(child_memcmp_fn(mem, size / 2, &comm_pipes));
+ break;
+ case ANON_THP_COLLAPSE_UPPER_SHARED:
+ exit(child_memcmp_fn(mem + size / 2, size / 2,
+ &comm_pipes));
+ break;
+ default:
+ assert(false);
+ }
+ }
+
+ while (read(comm_pipes.child_ready[0], &buf, 1) != 1)
+ ;
+
+ switch (test) {
+ case ANON_THP_COLLAPSE_UNSHARED:
+ break;
+ case ANON_THP_COLLAPSE_UPPER_SHARED:
+ case ANON_THP_COLLAPSE_LOWER_SHARED:
+ /*
+ * Revert MADV_DONTFORK such that we merge the VMAs and are
+ * able to actually collapse.
+ */
+ ret = madvise(mem, size, MADV_DOFORK);
+ if (ret) {
+ ksft_test_result_fail("MADV_DOFORK failed\n");
+ write(comm_pipes.parent_ready[1], "0", 1);
+ wait(&ret);
+ goto close_comm_pipes;
+ }
+ /* FALLTHROUGH */
+ case ANON_THP_COLLAPSE_FULLY_SHARED:
+ /* Collapse before anyone modified the COW-shared page. */
+ ret = madvise(mem, size, MADV_COLLAPSE);
+ if (ret) {
+ ksft_test_result_skip("MADV_COLLAPSE failed: %s\n",
+ strerror(errno));
+ write(comm_pipes.parent_ready[1], "0", 1);
+ wait(&ret);
+ goto close_comm_pipes;
+ }
+ break;
+ default:
+ assert(false);
+ }
+
+ /* Modify the page. */
+ memset(mem, 0xff, size);
+ write(comm_pipes.parent_ready[1], "0", 1);
+
+ wait(&ret);
+ if (WIFEXITED(ret))
+ ret = WEXITSTATUS(ret);
+ else
+ ret = -EINVAL;
+
+ ksft_test_result(!ret, "No leak from parent into child\n");
+close_comm_pipes:
+ close_comm_pipes(&comm_pipes);
+}
+
+static void test_anon_thp_collapse_unshared(char *mem, size_t size)
+{
+ do_test_anon_thp_collapse(mem, size, ANON_THP_COLLAPSE_UNSHARED);
+}
+
+static void test_anon_thp_collapse_fully_shared(char *mem, size_t size)
+{
+ do_test_anon_thp_collapse(mem, size, ANON_THP_COLLAPSE_FULLY_SHARED);
+}
+
+static void test_anon_thp_collapse_lower_shared(char *mem, size_t size)
+{
+ do_test_anon_thp_collapse(mem, size, ANON_THP_COLLAPSE_LOWER_SHARED);
+}
+
+static void test_anon_thp_collapse_upper_shared(char *mem, size_t size)
+{
+ do_test_anon_thp_collapse(mem, size, ANON_THP_COLLAPSE_UPPER_SHARED);
+}
+
+/*
+ * Test cases that are specific to anonymous THP: pages in private mappings
+ * that may get shared via COW during fork().
+ */
+static const struct test_case anon_thp_test_cases[] = {
+ /*
+ * Basic COW test for fork() without any GUP when collapsing a THP
+ * before fork().
+ *
+ * Re-mapping a PTE-mapped anon THP using a single PMD ("in-place
+ * collapse") might easily get COW handling wrong when not collapsing
+ * exclusivity information properly.
+ */
+ {
+ "Basic COW after fork() when collapsing before fork()",
+ test_anon_thp_collapse_unshared,
+ },
+ /* Basic COW test, but collapse after COW-sharing a full THP. */
+ {
+ "Basic COW after fork() when collapsing after fork() (fully shared)",
+ test_anon_thp_collapse_fully_shared,
+ },
+ /*
+ * Basic COW test, but collapse after COW-sharing the lower half of a
+ * THP.
+ */
+ {
+ "Basic COW after fork() when collapsing after fork() (lower shared)",
+ test_anon_thp_collapse_lower_shared,
+ },
+ /*
+ * Basic COW test, but collapse after COW-sharing the upper half of a
+ * THP.
+ */
+ {
+ "Basic COW after fork() when collapsing after fork() (upper shared)",
+ test_anon_thp_collapse_upper_shared,
+ },
+};
+
+static void run_anon_thp_test_cases(void)
+{
+ int i;
+
+ if (!thpsize)
+ return;
+
+ ksft_print_msg("[INFO] Anonymous THP tests\n");
+
+ for (i = 0; i < ARRAY_SIZE(anon_thp_test_cases); i++) {
+ struct test_case const *test_case = &anon_thp_test_cases[i];
+
+ ksft_print_msg("[RUN] %s\n", test_case->desc);
+ do_run_with_thp(test_case->fn, THP_RUN_PMD);
+ }
+}
+
+static int tests_per_anon_thp_test_case(void)
+{
+ return thpsize ? 1 : 0;
+}
+
typedef void (*non_anon_test_fn)(char *mem, const char *smem, size_t size);
static void test_cow(char *mem, const char *smem, size_t size)
@@ -1518,6 +1744,7 @@ int main(int argc, char **argv)
ksft_print_header();
ksft_set_plan(ARRAY_SIZE(anon_test_cases) * tests_per_anon_test_case() +
+ ARRAY_SIZE(anon_thp_test_cases) * tests_per_anon_thp_test_case() +
ARRAY_SIZE(non_anon_test_cases) * tests_per_non_anon_test_case());
gup_fd = open("/sys/kernel/debug/gup_test", O_RDWR);
@@ -1526,6 +1753,7 @@ int main(int argc, char **argv)
ksft_exit_fail_msg("opening pagemap failed\n");
run_anon_test_cases();
+ run_anon_thp_test_cases();
run_non_anon_test_cases();
err = ksft_get_fail_cnt();
--
2.39.0
There's been a bunch of off-list discussions about this, including at
Plumbers. The original plan was to do something involving providing an
ISA string to userspace, but ISA strings just aren't sufficient for a
stable ABI any more: in order to parse an ISA string users need the
version of the specifications that the string is written to, the version
of each extension (sometimes at a finer granularity than the RISC-V
releases/versions encode), and the expected use case for the ISA string
(ie, is it a U-mode or M-mode string). That's a lot of complexity to
try and keep ABI compatible and it's probably going to continue to grow,
as even if there's no more complexity in the specifications we'll have
to deal with the various ISA string parsing oddities that end up all
over userspace.
Instead this patch set takes a very different approach and provides a set
of key/value pairs that encode various bits about the system. The big
advantage here is that we can clearly define what these mean so we can
ensure ABI stability, but it also allows us to encode information that's
unlikely to ever appear in an ISA string (see the misaligned access
performance, for example). The resulting interface looks a lot like
what arm64 and x86 do, and will hopefully fit well into something like
ACPI in the future.
The actual user interface is a syscall, with a vDSO function in front of
it. The vDSO function can answer some queries without a syscall at all,
and falls back to the syscall for cases it doesn't have answers to.
Currently we prepopulate it with an array of answers for all keys and
a CPU set of "all CPUs". This can be adjusted as necessary to provide
fast answers to the most common queries.
An example series in glibc exposing this syscall and using it in an
ifunc selector for memcpy can be found at [1]. I'm about to send a v2
of that series out that incorporates the vDSO function.
I was asked about the performance delta between this and something like
sysfs. I created a small test program [2] and ran it on a Nezha D1
Allwinner board. Doing each operation 100000 times and dividing, these
operations take the following amount of time:
- open()+read()+close() of /sys/kernel/cpu_byteorder: 3.8us
- access("/sys/kernel/cpu_byteorder", R_OK): 1.3us
- riscv_hwprobe() vDSO and syscall: .0094us
- riscv_hwprobe() vDSO with no syscall: 0.0091us
These numbers get farther apart if we query multiple keys, as sysfs will
scale linearly with the number of keys, where the dedicated syscall
stays the same. To frame these numbers, I also did a tight
fork/exec/wait loop, which I measured as 4.8ms. So doing 4
open/read/close operations is a delta of about 0.3%, versus a single vDSO
call is a delta of essentially zero.
[1] https://public-inbox.org/libc-alpha/20230206194819.1679472-1-evan@rivosinc.…
[2] https://pastebin.com/x84NEKaS
Changes in v4:
- Used real types in syscall prototypes (Arnd)
- Fixed static line break in do_riscv_hwprobe() (Conor)
- Added newlines between documentation lists (Conor)
- Crispen up size types to size_t, and cpu indices to int (Joe)
- Fix copy_from_user() return logic bug (found via kselftests!)
- Add __user to SYSCALL_DEFINE() to fix warning
- More newlines in BASE_BEHAVIOR_IMA documentation (Conor)
- Add newlines to CPUPERF_0 documentation (Conor)
- Add UNSUPPORTED value (Conor)
- Switched from DT to alternatives-based probing (Rob)
- Crispen up cpu index type to always be int (Conor)
- Fixed selftests commit description, no more tiny libc (Mark Brown)
- Fixed selftest syscall prototype types to match v4.
- Added a prototype to fix -Wmissing-prototype warning (lkp(a)intel.com)
- Fixed rv32 build failure (lkp(a)intel.com)
- Make vdso prototype match syscall types update
Changes in v3:
- Updated copyright date in cpufeature.h
- Fixed typo in cpufeature.h comment (Conor)
- Refactored functions so that kernel mode can query too, in
preparation for the vDSO data population.
- Changed the vendor/arch/imp IDs to return a value of -1 on mismatch
rather than failing the whole call.
- Const cpumask pointer in hwprobe_mid()
- Embellished documentation WRT cpu_set and the returned values.
- Renamed hwprobe_mid() to hwprobe_arch_id() (Conor)
- Fixed machine ID doc warnings, changed elements to c:macro:.
- Completed dangling unistd.h comment (Conor)
- Fixed line breaks and minor logic optimization (Conor).
- Use riscv_cached_mxxxid() (Conor)
- Refactored base ISA behavior probe to allow kernel probing as well,
in prep for vDSO data initialization.
- Fixed doc warnings in IMA text list, use :c:macro:.
- Have hwprobe_misaligned return int instead of long.
- Constify cpumask pointer in hwprobe_misaligned()
- Fix warnings in _PERF_O list documentation, use :c:macro:.
- Move include cpufeature.h to misaligned patch.
- Fix documentation mismatch for RISCV_HWPROBE_KEY_CPUPERF_0 (Conor)
- Use for_each_possible_cpu() instead of NR_CPUS (Conor)
- Break early in misaligned access iteration (Conor)
- Increase MISALIGNED_MASK from 2 bits to 3 for possible UNSUPPORTED future
value (Conor)
- Introduced vDSO function
Changes in v2:
- Factored the move of struct riscv_cpuinfo to its own header
- Changed the interface to look more like poll(). Rather than supplying
key_offset and getting back an array of values with numerically
contiguous keys, have the user pre-fill the key members of the array,
and the kernel will fill in the corresponding values. For any key it
doesn't recognize, it will set the key of that element to -1. This
allows usermode to quickly ask for exactly the elements it cares
about, and not get bogged down in a back and forth about newer keys
that older kernels might not recognize. In other words, the kernel
can communicate that it doesn't recognize some of the keys while
still providing the data for the keys it does know.
- Added a shortcut to the cpuset parameters that if a size of 0 and
NULL is provided for the CPU set, the kernel will use a cpu mask of
all online CPUs. This is convenient because I suspect most callers
will only want to act on a feature if it's supported on all CPUs, and
it's a headache to dynamically allocate an array of all 1s, not to
mention a waste to have the kernel loop over all of the offline bits.
- Fixed logic error in if(of_property_read_string...) that caused crash
- Include cpufeature.h in cpufeature.h to avoid undeclared variable
warning.
- Added a _MASK define
- Fix random checkpatch complaints
- Updated the selftests to the new API and added some more.
- Fixed indentation, comments in .S, and general checkpatch complaints.
Evan Green (6):
RISC-V: Move struct riscv_cpuinfo to new header
RISC-V: Add a syscall for HW probing
RISC-V: hwprobe: Add support for RISCV_HWPROBE_BASE_BEHAVIOR_IMA
RISC-V: hwprobe: Support probing of misaligned access performance
selftests: Test the new RISC-V hwprobe interface
RISC-V: Add hwprobe vDSO function and data
Documentation/riscv/hwprobe.rst | 86 +++++++
Documentation/riscv/index.rst | 1 +
arch/riscv/Kconfig | 1 +
arch/riscv/errata/thead/errata.c | 9 +
arch/riscv/include/asm/alternative.h | 5 +
arch/riscv/include/asm/cpufeature.h | 23 ++
arch/riscv/include/asm/hwprobe.h | 13 +
arch/riscv/include/asm/syscall.h | 4 +
arch/riscv/include/asm/vdso/data.h | 17 ++
arch/riscv/include/asm/vdso/gettimeofday.h | 8 +
arch/riscv/include/uapi/asm/hwprobe.h | 37 +++
arch/riscv/include/uapi/asm/unistd.h | 9 +
arch/riscv/kernel/alternative.c | 19 ++
arch/riscv/kernel/cpu.c | 8 +-
arch/riscv/kernel/cpufeature.c | 3 +
arch/riscv/kernel/smpboot.c | 1 +
arch/riscv/kernel/sys_riscv.c | 225 +++++++++++++++++-
arch/riscv/kernel/vdso.c | 6 -
arch/riscv/kernel/vdso/Makefile | 2 +
arch/riscv/kernel/vdso/hwprobe.c | 52 ++++
arch/riscv/kernel/vdso/sys_hwprobe.S | 15 ++
arch/riscv/kernel/vdso/vdso.lds.S | 1 +
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/riscv/Makefile | 58 +++++
.../testing/selftests/riscv/hwprobe/Makefile | 10 +
.../testing/selftests/riscv/hwprobe/hwprobe.c | 90 +++++++
.../selftests/riscv/hwprobe/sys_hwprobe.S | 12 +
27 files changed, 703 insertions(+), 13 deletions(-)
create mode 100644 Documentation/riscv/hwprobe.rst
create mode 100644 arch/riscv/include/asm/cpufeature.h
create mode 100644 arch/riscv/include/asm/hwprobe.h
create mode 100644 arch/riscv/include/asm/vdso/data.h
create mode 100644 arch/riscv/include/uapi/asm/hwprobe.h
create mode 100644 arch/riscv/kernel/vdso/hwprobe.c
create mode 100644 arch/riscv/kernel/vdso/sys_hwprobe.S
create mode 100644 tools/testing/selftests/riscv/Makefile
create mode 100644 tools/testing/selftests/riscv/hwprobe/Makefile
create mode 100644 tools/testing/selftests/riscv/hwprobe/hwprobe.c
create mode 100644 tools/testing/selftests/riscv/hwprobe/sys_hwprobe.S
--
2.25.1
This patch series adds unit tests for the clk fixed rate basic type and
the clk registration functions that use struct clk_parent_data. To get
there, we add support for loading device tree overlays onto the live DTB
along with probing platform drivers to bind to device nodes in the
overlays. With this series, we're able to exercise some of the code in
the common clk framework that uses devicetree lookups to find parents
and the fixed rate clk code that scans device tree directly and creates
clks. Please review.
I Cced everyone to all the patches so they get the full context. I'm
hoping I can take the whole pile through the clk tree as they almost all
depend on each other.
Changes from v1 (https://lore.kernel.org/r/20230302013822.1808711-1-sboyd@kernel.org):
* Don't depend on UML, use unittest data approach to attach nodes
* Introduce overlay loading API for KUnit
* Move platform_device KUnit code to drivers/base/test
* Use #define macros for constants shared between unit tests and
overlays
* Settle on "test" as a vendor prefix
* Make KUnit wrappers have "_kunit" postfix
Stephen Boyd (11):
of: Load KUnit DTB from of_core_init()
of: Add test managed wrappers for of_overlay_apply()/of_node_put()
dt-bindings: vendor-prefixes: Add "test" vendor for KUnit and friends
dt-bindings: test: Add KUnit empty node binding
of: Add a KUnit test for overlays and test managed APIs
platform: Add test managed platform_device/driver APIs
dt-bindings: kunit: Add fixed rate clk consumer test
clk: Add test managed clk provider/consumer APIs
clk: Add KUnit tests for clk fixed rate basic type
dt-bindings: clk: Add KUnit clk_parent_data test
clk: Add KUnit tests for clks registered with struct clk_parent_data
.../clock/test,clk-kunit-parent-data.yaml | 47 ++
.../kunit/test,clk-kunit-fixed-rate.yaml | 35 ++
.../bindings/test/test,kunit-empty.yaml | 30 ++
.../devicetree/bindings/vendor-prefixes.yaml | 2 +
drivers/base/test/Makefile | 3 +
drivers/base/test/platform_kunit-test.c | 108 +++++
drivers/base/test/platform_kunit.c | 186 +++++++
drivers/clk/.kunitconfig | 3 +
drivers/clk/Kconfig | 7 +
drivers/clk/Makefile | 9 +-
drivers/clk/clk-fixed-rate_test.c | 299 ++++++++++++
drivers/clk/clk-fixed-rate_test.h | 8 +
drivers/clk/clk_kunit.c | 219 +++++++++
drivers/clk/clk_parent_data_test.h | 10 +
drivers/clk/clk_test.c | 459 +++++++++++++++++-
drivers/clk/kunit_clk_fixed_rate_test.dtso | 19 +
drivers/clk/kunit_clk_parent_data_test.dtso | 28 ++
drivers/of/.kunitconfig | 5 +
drivers/of/Kconfig | 23 +
drivers/of/Makefile | 7 +
drivers/of/base.c | 182 +++++++
drivers/of/kunit.dtso | 10 +
drivers/of/kunit_overlay_test.dtso | 9 +
drivers/of/of_kunit.c | 123 +++++
drivers/of/of_private.h | 6 +
drivers/of/of_test.c | 43 ++
drivers/of/overlay_test.c | 107 ++++
drivers/of/unittest.c | 101 +---
include/kunit/clk.h | 28 ++
include/kunit/of.h | 90 ++++
include/kunit/platform_device.h | 15 +
31 files changed, 2119 insertions(+), 102 deletions(-)
create mode 100644 Documentation/devicetree/bindings/clock/test,clk-kunit-parent-data.yaml
create mode 100644 Documentation/devicetree/bindings/kunit/test,clk-kunit-fixed-rate.yaml
create mode 100644 Documentation/devicetree/bindings/test/test,kunit-empty.yaml
create mode 100644 drivers/base/test/platform_kunit-test.c
create mode 100644 drivers/base/test/platform_kunit.c
create mode 100644 drivers/clk/clk-fixed-rate_test.c
create mode 100644 drivers/clk/clk-fixed-rate_test.h
create mode 100644 drivers/clk/clk_kunit.c
create mode 100644 drivers/clk/clk_parent_data_test.h
create mode 100644 drivers/clk/kunit_clk_fixed_rate_test.dtso
create mode 100644 drivers/clk/kunit_clk_parent_data_test.dtso
create mode 100644 drivers/of/.kunitconfig
create mode 100644 drivers/of/kunit.dtso
create mode 100644 drivers/of/kunit_overlay_test.dtso
create mode 100644 drivers/of/of_kunit.c
create mode 100644 drivers/of/of_test.c
create mode 100644 drivers/of/overlay_test.c
create mode 100644 include/kunit/clk.h
create mode 100644 include/kunit/of.h
create mode 100644 include/kunit/platform_device.h
base-commit: fe15c26ee26efa11741a7b632e9f23b01aca4cc6
--
https://git.kernel.org/pub/scm/linux/kernel/git/clk/linux.git/https://git.kernel.org/pub/scm/linux/kernel/git/sboyd/spmi.git
Hi all,
I don't know if this is expected result, so I am filing the bug report.
Reports like this from tools/testing/selftests/net/tls:
# RUN tls.13_sm4_ccm.sendfile ...
# tls.c:323:sendfile:Expected ret (-1) == 0 (0)
# sendfile: Test terminated by assertion
# FAIL tls.13_sm4_ccm.sendfile
not ok 251 tls.13_sm4_ccm.sendfile
# RUN tls.13_sm4_ccm.send_then_sendfile ...
# tls.c:323:send_then_sendfile:Expected ret (-1) == 0 (0)
# send_then_sendfile: Test terminated by assertion
# FAIL tls.13_sm4_ccm.send_then_sendfile
not ok 252 tls.13_sm4_ccm.send_then_sendfile
# RUN tls.13_sm4_ccm.multi_chunk_sendfile ...
# tls.c:323:multi_chunk_sendfile:Expected ret (-1) == 0 (0)
# multi_chunk_sendfile: Test terminated by assertion
# FAIL tls.13_sm4_ccm.multi_chunk_sendfile
not ok 253 tls.13_sm4_ccm.multi_chunk_sendfile
Apparently, all are connected with sm4 hash ccm.
(Please find the complete report attached in tls-6.3-rc3-1.log)
The rest of the failed tests is as follows from this command:
[marvin@pc-mtodorov linux_torvalds]$ grep -v '^#' ../kselftest-6.3-rc3-1.log | grep "not ok"
not ok 2 selftests: alsa: pcm-test # TIMEOUT 45 seconds
not ok 1 selftests: drivers/net/bonding: bond-arp-interval-causes-panic.sh # exit=1
not ok 2 selftests: drivers/net/bonding: bond-break-lacpdu-tx.sh # exit=1
not ok 1 selftests: filesystems/binderfs: binderfs_test # exit=1
not ok 1 selftests: ftrace: ftracetest # exit=1
not ok 1 selftests: gpio: gpio-mockup.sh # exit=1
not ok 1 selftests: intel_pstate: run.sh # TIMEOUT 45 seconds
not ok 1 selftests: iommu: iommufd # exit=1
not ok 26 selftests: kvm: vmx_preemption_timer_test # exit=254
not ok 1 selftests: landlock: fs_test # exit=1
not ok 1 selftests: mincore: mincore_selftest # exit=1
not ok 2 selftests: mqueue: mq_perf_tests # TIMEOUT 45 seconds
not ok 1 selftests: nci: nci_dev # exit=1
not ok 6 selftests: net: tls # exit=1
not ok 12 selftests: net: run_netsocktests # exit=1
not ok 28 selftests: net: udpgro_bench.sh # exit=255
not ok 29 selftests: net: udpgro.sh # exit=255
not ok 36 selftests: net: fcnal-test.sh # TIMEOUT 1500 seconds
not ok 37 selftests: net: l2tp.sh # exit=2
not ok 45 selftests: net: icmp_redirect.sh # exit=1
not ok 49 selftests: net: txtimestamp.sh # exit=1
not ok 54 selftests: net: vrf_route_leaking.sh # exit=1
not ok 58 selftests: net: udpgro_fwd.sh # exit=1
not ok 59 selftests: net: udpgro_frglist.sh # exit=255
not ok 60 selftests: net: veth.sh # exit=1
not ok 67 selftests: net: srv6_end_dt46_l3vpn_test.sh # exit=1
not ok 68 selftests: net: srv6_end_dt4_l3vpn_test.sh # exit=1
not ok 82 selftests: net: rps_default_mask.sh # exit=1
not ok 85 selftests: net: test_ingress_egress_chaining.sh # exit=1
not ok 1 selftests: net/hsr: hsr_ping.sh # TIMEOUT 45 seconds
not ok 3 selftests: net/mptcp: mptcp_join.sh # exit=1
not ok 3 selftests: netfilter: nft_nat.sh # exit=1
not ok 5 selftests: netfilter: conntrack_icmp_related.sh # exit=1
not ok 8 selftests: netfilter: nft_concat_range.sh # exit=1
not ok 14 selftests: netfilter: conntrack_tcp_unreplied.sh # exit=1
not ok 15 selftests: netfilter: conntrack_vrf.sh # exit=1
not ok 15 selftests: proc: read # exit=134
not ok 1 selftests: pstore: pstore_tests # exit=1
not ok 3 selftests: ptrace: vmaccess # exit=1
not ok 1 selftests: rlimits: rlimits-per-userns # exit=1
not ok 1 selftests: sgx: test_sgx # exit=1
not ok 2 selftests: splice: short_splice_read.sh # exit=3
not ok 1 selftests: tdx: tdx_guest_test # exit=1
not ok 3 selftests: mm: split_huge_page_test # exit=1
not ok 5 selftests: mm: mdwe_test # exit=1
[marvin@pc-mtodorov linux_torvalds]$
The environment is AlmaLinux 8.7 running 6.3-rc3 vanilla kernel with
MGLRU, KMEMLEAK and CONFIG_DEBUG_KOBJECT=y enabled.
Hw := LENOVO_MT_10TX_BU_Lenovo_FM_V530S-07ICB
In case you are interested to debug this, I am available for additional
diagnostics.
As 45 bug reports might overwhelm me due to the overhead of bug submission,
I will probably submit a bug or two at a time.
Best regards,
Mirsad
--
Mirsad Goran Todorovac
Sistem inženjer
Grafički fakultet | Akademija likovnih umjetnosti
Sveučilište u Zagrebu
System engineer
Faculty of Graphic Arts | Academy of Fine Arts
University of Zagreb, Republic of Croatia
This is the basic functionality for iommufd to support
iommufd_device_replace() and IOMMU_HWPT_ALLOC for physical devices.
iommufd_device_replace() allows changing the HWPT associated with the
device to a new IOAS or HWPT. Replace does this in way that failure leaves
things unchanged, and utilizes the iommu iommu_group_replace_domain() API
to allow the iommu driver to perform an optional non-disruptive change.
IOMMU_HWPT_ALLOC allows HWPTs to be explicitly allocated by the user and
used by attach or replace. At this point it isn't very useful since the
HWPT is the same as the automatically managed HWPT from the IOAS. However
a following series will allow userspace to customize the created HWPT.
The implementation is complicated because we have to introduce some
per-iommu_group memory in iommufd and redo how we think about multi-device
groups to be more explicit. This solves all the locking problems in the
prior attempts.
This series is infrastructure work for the following series which:
- Add replace for attach
- Expose replace through VFIO APIs
- Implement driver parameters for HWPT creation (nesting)
Once review of this is complete I will keep it on a side branch and
accumulate the following series when they are ready so we can have a
stable base and make more incremental progress. When we have all the parts
together to get a full implementation it can go to Linus.
I have this on github:
https://github.com/jgunthorpe/linux/commits/iommufd_hwpt
Jason Gunthorpe (12):
iommufd: Move isolated msi enforcement to iommufd_device_bind()
iommufd: Add iommufd_group
iommufd: Replace the hwpt->devices list with iommufd_group
iommufd: Use the iommufd_group to avoid duplicate reserved groups and
msi setup
iommufd: Make sw_msi_start a group global
iommufd: Move putting a hwpt to a helper function
iommufd: Add enforced_cache_coherency to iommufd_hw_pagetable_alloc()
iommufd: Add iommufd_device_replace()
iommufd: Make destroy_rwsem use a lock class per object type
iommufd: Add IOMMU_HWPT_ALLOC
iommufd/selftest: Return the real idev id from selftest mock_domain
iommufd/selftest: Add a selftest for IOMMU_HWPT_ALLOC
Nicolin Chen (2):
iommu: Introduce a new iommu_group_replace_domain() API
iommufd/selftest: Test iommufd_device_replace()
drivers/iommu/iommu-priv.h | 10 +
drivers/iommu/iommu.c | 30 ++
drivers/iommu/iommufd/device.c | 482 +++++++++++++-----
drivers/iommu/iommufd/hw_pagetable.c | 96 +++-
drivers/iommu/iommufd/io_pagetable.c | 5 +-
drivers/iommu/iommufd/iommufd_private.h | 44 +-
drivers/iommu/iommufd/iommufd_test.h | 7 +
drivers/iommu/iommufd/main.c | 17 +-
drivers/iommu/iommufd/selftest.c | 40 ++
include/linux/iommufd.h | 1 +
include/uapi/linux/iommufd.h | 26 +
tools/testing/selftests/iommu/iommufd.c | 64 ++-
.../selftests/iommu/iommufd_fail_nth.c | 57 ++-
tools/testing/selftests/iommu/iommufd_utils.h | 59 ++-
14 files changed, 758 insertions(+), 180 deletions(-)
create mode 100644 drivers/iommu/iommu-priv.h
base-commit: ac395958f9156733246b5bc3a481c6d38c321a7a
--
2.39.1
On Sat, 11 Mar 2023 at 07:21, Stephen Boyd <sboyd(a)kernel.org> wrote:
>
> Quoting David Gow (2023-03-02 23:15:35)
> > On Thu, 2 Mar 2023 at 09:38, Stephen Boyd <sboyd(a)kernel.org> wrote:
> > >
> > > Unit tests are more ergonomic and simpler to understand if they don't
> > > have to hoist a bunch of code into the test harness init and exit
> > > functions. Add some test managed wrappers for the clk APIs so that clk
> > > unit tests can write more code in the actual test and less code in the
> > > harness.
> > >
> > > Only add APIs that are used for now. More wrappers can be added in the
> > > future as necessary.
> > >
> > > Cc: Brendan Higgins <brendan.higgins(a)linux.dev>
> > > Cc: David Gow <davidgow(a)google.com>
> > > Signed-off-by: Stephen Boyd <sboyd(a)kernel.org>
> > > ---
> >
> > Looks good, modulo bikeshedding below.
>
> Cool!
>
> > >
> > > diff --git a/drivers/clk/Makefile b/drivers/clk/Makefile
> > > index e3ca0d058a25..7efce649b0d3 100644
> > > --- a/drivers/clk/Makefile
> > > +++ b/drivers/clk/Makefile
> > > @@ -17,6 +17,11 @@ ifeq ($(CONFIG_OF), y)
> > > obj-$(CONFIG_COMMON_CLK) += clk-conf.o
> > > endif
> > >
> > > +# KUnit specific helpers
> > > +ifeq ($(CONFIG_COMMON_CLK), y)
> > > +obj-$(CONFIG_KUNIT) += clk-kunit.o
> >
> > Do we want to compile these in whenever KUnit is enabled, or only when
> > we're building clk tests specifically? I suspect this would be served
> > better by being under a CLK_KUNIT config option, which all of the
> > tests then depend on. (Whether that's the existing
> > CONFIG_CLK_KUNIT_TEST, and all of the clk tests live under the same
> > config option, or a separate parent option would be up to you).
>
> I was thinking of building it in with whatever mode CONFIG_KUNIT is
> built as. If this is a module because CONFIG_KUNIT=m, then unit tests
> would depend on that, and this would be a module as well. modprobe would
> know that some unit test module depends on symbols provided by
> clk-kunit.ko and thus load clk-kunit.ko first.
>
Personally, I'd rather have this behind CONFIG_CLK_KUNIT_TEST if
possible, if only to avoid needlessly building these if someone just
wants to test some other subsystem (but needs CONFIG_COMMON_CLK
enabled anyway). I doubt it'd be a problem in practice in this case,
but we definitely want to keep build (and hence iteration) times down
as much as possible, so it's probably good practice to keep all tests
behind at least some sort of "test this subsystem" option.
> >
> > Equally, this could be a bit interesting if CONFIG_KUNIT=m. Given
> > CONFIG_COMMON_CLK=y, this would end up as a clk-kunit module, no?
>
> Yes, that is the intent.
>
> >
> > > +endif
> > > +
> > > # hardware specific clock types
> > > # please keep this section sorted lexicographically by file path name
> > > obj-$(CONFIG_COMMON_CLK_APPLE_NCO) += clk-apple-nco.o
> > > diff --git a/drivers/clk/clk-kunit.c b/drivers/clk/clk-kunit.c
> > > new file mode 100644
> > > index 000000000000..78d85b3a7a4a
> > > --- /dev/null
> > > +++ b/drivers/clk/clk-kunit.c
> > > @@ -0,0 +1,204 @@
> > > +// SPDX-License-Identifier: GPL-2.0
> > > +/*
> > > + * KUnit helpers for clk tests
> > > + */
> > > +#include <linux/clk.h>
> > > +#include <linux/clk-provider.h>
> > > +#include <linux/err.h>
> > > +#include <linux/kernel.h>
> > > +#include <linux/slab.h>
> > > +
> > > +#include <kunit/resource.h>
> > > +
> > > +#include "clk-kunit.h"
> > > +
> > > +static void kunit_clk_disable_unprepare(struct kunit_resource *res)
> >
> > We need to decide on the naming scheme of these, and in particular if
> > they should be kunit_clk or clk_kunit (or something else).
> >
> > I'd lean to clk_kunit, if only to match DRM's KUnit helpers being
> > drm_kunit_helper better, and so that these are more tightly bound to
> > the subsystem being tested.
> > (i.e., so I don't have to scroll through every subsystem's helpers
> > when autocompleting kunit_).
>
> Ok, got it. I was trying to match kunit_kzalloc() style. It makes it
> easy to slap the 'kunit_' prefix on existing auto-completed function
> names like kzalloc() or clk_prepare_enable().
Yeah: my rule of thumb at the moment is to keep the kunit_ prefix for
things which are generic across the whole kernel (and tend to be
implemented in lib/kunit), and to use suffixes or infixes (whichever
works best) for things which are subsystem-specific.
> I wasn't aware of drm_kunit_helper. That's a mouthful! We don't call it
> slab_kunit_helper_kzalloc(). Maybe to satisfy all conditions it should
> be:
>
> clk_prepare_enable_kunit()
>
> so that kunit_ autocomplete doesn't have a big scroll list, and clk
> subsystem autocompletes, and we know it is kunit specific.
Sounds good to me.
Cheers,
-- David