This patch series introduces UFFDIO_MOVE feature to userfaultfd, which
has long been implemented and maintained by Andrea in his local tree [1],
but was not upstreamed due to lack of use cases where this approach would
be better than allocating a new page and copying the contents. Previous
upstraming attempts could be found at [6] and [7].
UFFDIO_COPY performs ~20% better than UFFDIO_MOVE when the application
needs pages to be allocated [2]. However, with UFFDIO_MOVE, if pages are
available (in userspace) for recycling, as is usually the case in heap
compaction algorithms, then we can avoid the page allocation and memcpy
(done by UFFDIO_COPY). Also, since the pages are recycled in the
userspace, we avoid the need to release (via madvise) the pages back to
the kernel [3].
We see over 40% reduction (on a Google pixel 6 device) in the compacting
thread’s completion time by using UFFDIO_MOVE vs. UFFDIO_COPY. This was
measured using a benchmark that emulates a heap compaction implementation
using userfaultfd (to allow concurrent accesses by application threads).
More details of the usecase are explained in [3].
Furthermore, UFFDIO_MOVE enables moving swapped-out pages without
touching them within the same vma. Today, it can only be done by mremap,
however it forces splitting the vma.
TODOs for follow-up improvements:
- cross-mm support. Known differences from single-mm and missing pieces:
- memcg recharging (might need to isolate pages in the process)
- mm counters
- cross-mm deposit table moves
- cross-mm test
- document the address space where src and dest reside in struct
uffdio_move
- TLB flush batching. Will require extensive changes to PTL locking in
move_pages_pte(). OTOH that might let us reuse parts of mremap code.
Changes since v5 [10]:
- added logic to split large folios in move_pages_pte(),
per David Hildenbrand
- added check for PAE before split_huge_pmd() to avoid the split if the
move operation can't be done
- replaced calls to default_huge_page_size() with read_pmd_pagesize() in
uffd_move_pmd test, per David Hildenbrand
- fixed the condition in uffd_move_test_common() checking if area
alignment is needed
Changes since v4 [9]:
- added Acked-by in patch 1, per Peter Xu
- added description for ctx, mm and mode parameters of move_pages(),
per kernel test robot
- added Reviewed-by's, per Peter Xu and Axel Rasmussen
- removed unused operations in uffd_test_case_ops
- refactored uffd-unit-test changes to avoid using global variables and
handle pmd moves without page size overrides, per Peter Xu
Changes since v3 [8]:
- changed retry path in folio_lock_anon_vma_read() to unlock and then
relock RCU, per Peter Xu
- removed cross-mm support from initial patchset, per David Hildenbrand
- replaced BUG_ONs with VM_WARN_ON or WARN_ON_ONCE, per David Hildenbrand
- added missing cache flushing, per Lokesh Gidra and Peter Xu
- updated manpage text in the patch description, per Peter Xu
- renamed internal functions from "remap" to "move", per Peter Xu
- added mmap_changing check after taking mmap_lock, per Peter Xu
- changed uffd context check to ensure dst_mm is registered onto uffd we
are operating on, Peter Xu and David Hildenbrand
- changed to non-maybe variants of maybe*_mkwrite(), per David Hildenbrand
- fixed warning for CONFIG_TRANSPARENT_HUGEPAGE=n, per kernel test robot
- comments cleanup, per David Hildenbrand and Peter Xu
- checks for VM_IO,VM_PFNMAP,VM_HUGETLB,..., per David Hildenbrand
- prevent moving pinned pages, per Peter Xu
- changed uffd tests to call move uffd_test_ctx_clear() at the end of the
test run instead of in the beginning of the next run
- added support for testcase-specific ops
- added test for moving PMD-aligned blocks
Changes since v2 [5]:
- renamed UFFDIO_REMAP to UFFDIO_MOVE, per David Hildenbrand
- rebase over mm-unstable to use folio_move_anon_rmap(),
per David Hildenbrand
- added text for manpage explaining DONTFORK and KSM requirements for this
feature, per David Hildenbrand
- check for anon_vma changes in the fast path of folio_lock_anon_vma_read,
per Peter Xu
- updated the title and description of the first patch,
per David Hildenbrand
- updating comments in folio_lock_anon_vma_read() explaining the need for
anon_vma checks, per David Hildenbrand
- changed all mapcount checks to PageAnonExclusive, per Jann Horn and
David Hildenbrand
- changed counters in remap_swap_pte() from MM_ANONPAGES to MM_SWAPENTS,
per Jann Horn
- added a check for PTE change after folio is locked in remap_pages_pte(),
per Jann Horn
- added handling of PMD migration entries and bailout when pmd_devmap(),
per Jann Horn
- added checks to ensure both src and dst VMAs are writable, per Peter Xu
- added UFFD_FEATURE_MOVE, per Peter Xu
- removed obsolete comments, per Peter Xu
- renamed remap_anon_pte to remap_present_pte, per Peter Xu
- added a comment for folio_get_anon_vma() explaining the need for
anon_vma checks, per Peter Xu
- changed error handling in remap_pages() to make it more clear,
per Peter Xu
- changed EFAULT to EAGAIN to retry when a hugepage appears or disappears
from under us, per Peter Xu
- added links to previous upstreaming attempts, per David Hildenbrand
Changes since v1 [4]:
- add mmget_not_zero in userfaultfd_remap, per Jann Horn
- removed extern from function definitions, per Matthew Wilcox
- converted to folios in remap_pages_huge_pmd, per Matthew Wilcox
- use PageAnonExclusive in remap_pages_huge_pmd, per David Hildenbrand
- handle pgtable transfers between MMs, per Jann Horn
- ignore concurrent A/D pte bit changes, per Jann Horn
- split functions into smaller units, per David Hildenbrand
- test for folio_test_large in remap_anon_pte, per Matthew Wilcox
- use pte_swp_exclusive for swapcount check, per David Hildenbrand
- eliminated use of mmu_notifier_invalidate_range_start_nonblock,
per Jann Horn
- simplified THP alignment checks, per Jann Horn
- refactored the loop inside remap_pages, per Jann Horn
- additional clarifying comments, per Jann Horn
Main changes since Andrea's last version [1]:
- Trivial translations from page to folio, mmap_sem to mmap_lock
- Replace pmd_trans_unstable() with pte_offset_map_nolock() and handle its
possible failure
- Move pte mapping into remap_pages_pte to allow for retries when source
page or anon_vma is contended. Since pte_offset_map_nolock() start RCU
read section, we can't block anymore after mapping a pte, so have to unmap
the ptesm do the locking and retry.
- Add and use anon_vma_trylock_write() to avoid blocking while in RCU
read section.
- Accommodate changes in mmu_notifier_range_init() API, switch to
mmu_notifier_invalidate_range_start_nonblock() to avoid blocking while in
RCU read section.
- Open-code now removed __swp_swapcount()
- Replace pmd_read_atomic() with pmdp_get_lockless()
- Add new selftest for UFFDIO_MOVE
[1] https://gitlab.com/aarcange/aa/-/commit/2aec7aea56b10438a3881a20a411aa4b1fc…
[2] https://lore.kernel.org/all/1425575884-2574-1-git-send-email-aarcange@redha…
[3] https://lore.kernel.org/linux-mm/CA+EESO4uO84SSnBhArH4HvLNhaUQ5nZKNKXqxRCyj…
[4] https://lore.kernel.org/all/20230914152620.2743033-1-surenb@google.com/
[5] https://lore.kernel.org/all/20230923013148.1390521-1-surenb@google.com/
[6] https://lore.kernel.org/all/1425575884-2574-21-git-send-email-aarcange@redh…
[7] https://lore.kernel.org/all/cover.1547251023.git.blake.caldwell@colorado.ed…
[8] https://lore.kernel.org/all/20231009064230.2952396-1-surenb@google.com/
[9] https://lore.kernel.org/all/20231028003819.652322-1-surenb@google.com/
[10] https://lore.kernel.org/all/20231121171643.3719880-1-surenb@google.com/
Andrea Arcangeli (2):
mm/rmap: support move to different root anon_vma in
folio_move_anon_rmap()
userfaultfd: UFFDIO_MOVE uABI
Suren Baghdasaryan (3):
selftests/mm: call uffd_test_ctx_clear at the end of the test
selftests/mm: add uffd_test_case_ops to allow test case-specific
operations
selftests/mm: add UFFDIO_MOVE ioctl test
Documentation/admin-guide/mm/userfaultfd.rst | 3 +
fs/userfaultfd.c | 72 +++
include/linux/rmap.h | 5 +
include/linux/userfaultfd_k.h | 11 +
include/uapi/linux/userfaultfd.h | 29 +-
mm/huge_memory.c | 122 ++++
mm/khugepaged.c | 3 +
mm/rmap.c | 30 +
mm/userfaultfd.c | 614 +++++++++++++++++++
tools/testing/selftests/mm/uffd-common.c | 39 +-
tools/testing/selftests/mm/uffd-common.h | 9 +
tools/testing/selftests/mm/uffd-stress.c | 5 +-
tools/testing/selftests/mm/uffd-unit-tests.c | 192 ++++++
13 files changed, 1130 insertions(+), 4 deletions(-)
--
2.43.0.rc2.451.g8631bc7472-goog
The guard regions feature was initially implemented to support anonymous
mappings only, excluding shmem.
This was done such as to introduce the feature carefully and incrementally
and to be conservative when considering the various caveats and corner
cases that are applicable to file-backed mappings but not to anonymous
ones.
Now this feature has landed in 6.13, it is time to revisit this and to
extend this functionality to file-backed and shmem mappings.
In order to make this maximally useful, and since one may map file-backed
mappings read-only (for instance ELF images), we also remove the
restriction on read-only mappings and permit the establishment of guard
regions in any non-hugetlb, non-mlock()'d mapping.
It is permissible to permit the establishment of guard regions in read-only
mappings because the guard regions only reduce access to the mapping, and
when removed simply reinstate the existing attributes of the underlying
VMA, meaning no access violations can occur.
While the change in kernel code introduced in this series is small, the
majority of the effort here is spent in extending the testing to assert
that the feature works correctly across numerous file-backed mapping
scenarios.
Every single guard region self-test performed against anonymous memory
(which is relevant and not anon-only) has now been updated to also be
performed against shmem and a mapping of a file in the working directory.
This confirms that all cases also function correctly for file-backed guard
regions.
In addition a number of other tests are added for specific file-backed
mapping scenarios.
There are a number of other concerns that one might have with regard to
guard regions, addressed below:
Readahead
~~~~~~~~~
Readahead is a process through which the page cache is populated on the
assumption that sequential reads will occur, thus amortising I/O and,
through a clever use of the PG_readahead folio flag establishing during
major fault and checked upon minor fault, provides for asynchronous I/O to
occur as dat is processed, reducing I/O stalls as data is faulted in.
Guard regions do not alter this mechanism which operations at the folio and
fault level, but do of course prevent the faulting of folios that would
otherwise be mapped.
In the instance of a major fault prior to a guard region, synchronous
readahead will occur including populating folios in the page cache which
the guard regions will, in the case of the mapping in question, prevent
access to.
In addition, if PG_readahead is placed in a folio that is now inaccessible,
this will prevent asynchronous readahead from occurring as it would
otherwise do.
However, there are mechanisms for heuristically resetting this within
readahead regardless, which will 'recover' correct readahead behaviour.
Readahead presumes sequential data access, the presence of a guard region
clearly indicates that, at least in the guard region, no such sequential
access will occur, as it cannot occur there.
So this should have very little impact on any real workload. The far more
important point is as to whether readahead causes incorrect or
inappropriate mapping of ranges disallowed by the presence of guard
regions - this is not the case, as readahead does not 'pre-fault' memory in
this fashion.
At any rate, any mechanism which would attempt to do so would hit the usual
page fault paths, which correctly handle PTE markers as with anonymous
mappings.
Fault-Around
~~~~~~~~~~~~
The fault-around logic, in a similar vein to readahead, attempts to improve
efficiency with regard to file-backed memory mappings, however it differs
in that it does not try to fetch folios into the page cache that are about
to be accessed, but rather pre-maps a range of folios around the faulting
address.
Guard regions making use of PTE markers makes this relatively trivial, as
this case is already handled - see filemap_map_folio_range() and
filemap_map_order0_folio() - in both instances, the solution is to simply
keep the established page table mappings and let the fault handler take
care of PTE markers, as per the comment:
/*
* NOTE: If there're PTE markers, we'll leave them to be
* handled in the specific fault path, and it'll prohibit
* the fault-around logic.
*/
This works, as establishing guard regions results in page table mappings
with PTE markers, and clearing them removes them.
Truncation
~~~~~~~~~~
File truncation will not eliminate existing guard regions, as the
truncation operation will ultimately zap the range via
unmap_mapping_range(), which specifically excludes PTE markers.
Zapping
~~~~~~~
Zapping is, as with anonymous mappings, handled by zap_nonpresent_ptes(),
which specifically deals with guard entries, leaving them intact except in
instances such as process teardown or munmap() where they need to be
removed.
Reclaim
~~~~~~~
When reclaim is performed on file-backed folios, it ultimately invokes
try_to_unmap_one() via the rmap. If the folio is non-large, then map_pte()
will ultimately abort the operation for the guard region mapping. If large,
then check_pte() will determine that this is a non-device private
entry/device-exclusive entry 'swap' PTE and thus abort the operation in
that instance.
Therefore, no odd things happen in the instance of reclaim being attempted
upon a file-backed guard region.
Hole Punching
~~~~~~~~~~~~~
This updates the page cache and ultimately invokes unmap_mapping_range(),
which explicitly leaves PTE markers in place.
Because the establishment of guard regions zapped any existing mappings to
file-backed folios, once the guard regions are removed then the
hole-punched region will be faulted in as usual and everything will behave
as expected.
Lorenzo Stoakes (4):
mm: allow guard regions in file-backed and read-only mappings
selftests/mm: rename guard-pages to guard-regions
tools/selftests: expand all guard region tests to file-backed
tools/selftests: add file/shmem-backed mapping guard region tests
mm/madvise.c | 8 +-
tools/testing/selftests/mm/.gitignore | 2 +-
tools/testing/selftests/mm/Makefile | 2 +-
.../mm/{guard-pages.c => guard-regions.c} | 921 ++++++++++++++++--
4 files changed, 821 insertions(+), 112 deletions(-)
rename tools/testing/selftests/mm/{guard-pages.c => guard-regions.c} (58%)
--
2.48.1
Lei Chen raised an issue with CLOCK_MONOTONIC_COARSE seeing
time inconsistencies.
Lei tracked down that this was being caused by the adjustment
tk->tkr_mono.xtime_nsec -= offset;
which is made to compensate for the unaccumulated cycles in
offset when the mult value is adjusted forward, so that
the non-_COARSE clockids don't see inconsistencies.
However, the _COARSE clockids don't use the mult*offset value
in their calculations, so this subtraction can cause the
_COARSE clock ids to jump back a bit.
Now, by design, this negative adjustment should be fine, because
the logic run from timekeeping_adjust() is done after we
accumulate approx mult*interval_cycles into xtime_nsec.
The accumulated (mult*interval_cycles) will be larger then the
(mult_adj*offset) value subtracted from xtime_nsec, and both
operations are done together under the tk_core.lock, so the net
change to xtime_nsec should always be positive.
However, do_adjtimex() calls into timekeeping_advance() as well,
since we want to apply the ntp freq adjustment immediately.
In this case, we don't return early when the offset is smaller
then interval_cycles, so we don't end up accumulating any time
into xtime_nsec. But we do go on to call timekeeping_adjust(),
which modifies the mult value, and subtracts from xtime_nsec
to correct for the new mult value.
Here because we did not accumulate anything, we have a window
where the _COARSE clockids that don't utilize the mult*offset
value, can see an inconsistency.
So to fix this, rework the timekeeping_advance() logic a bit
so that when we are called from do_adjtimex(), we call
timekeeping_forward(), to first accumulate the sub-interval
time into xtime_nsec. Then with no unaccumulated cycles in
offset, we can do the mult adjustment without worry of the
subtraction having an impact.
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Stephen Boyd <sboyd(a)kernel.org>
Cc: Anna-Maria Behnsen <anna-maria(a)linutronix.de>
Cc: Frederic Weisbecker <frederic(a)kernel.org>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: Miroslav Lichvar <mlichvar(a)redhat.com>
Cc: linux-kselftest(a)vger.kernel.org
Cc: kernel-team(a)android.com
Cc: Lei Chen <lei.chen(a)smartx.com>
Fixes: da15cfdae033 ("time: Introduce CLOCK_REALTIME_COARSE")
Reported-by: Lei Chen <lei.chen(a)smartx.com>
Closes: https://lore.kernel.org/lkml/20250310030004.3705801-1-lei.chen@smartx.com/
Diagnosed-by: Thomas Gleixner <tglx(a)linutronix.de>
Additional-fixes-by: Thomas Gleixner <tglx(a)linutronix.de>
Signed-off-by: John Stultz <jstultz(a)google.com>
---
v2: Include fixes from Thomas, dropping the unnecessary clock_set
setting, and instead clearing ntp_error, along with some other
minor tweaks.
---
kernel/time/timekeeping.c | 94 ++++++++++++++++++++++++++++-----------
1 file changed, 69 insertions(+), 25 deletions(-)
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 1e67d076f1955..929846b8b45ab 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -682,20 +682,19 @@ static void timekeeping_update_from_shadow(struct tk_data *tkd, unsigned int act
}
/**
- * timekeeping_forward_now - update clock to the current time
+ * timekeeping_forward - update clock to given cycle now value
* @tk: Pointer to the timekeeper to update
+ * @cycle_now: Current clocksource read value
*
* Forward the current clock to update its state since the last call to
* update_wall_time(). This is useful before significant clock changes,
* as it avoids having to deal with this time offset explicitly.
*/
-static void timekeeping_forward_now(struct timekeeper *tk)
+static void timekeeping_forward(struct timekeeper *tk, u64 cycle_now)
{
- u64 cycle_now, delta;
+ u64 delta = clocksource_delta(cycle_now, tk->tkr_mono.cycle_last, tk->tkr_mono.mask,
+ tk->tkr_mono.clock->max_raw_delta);
- cycle_now = tk_clock_read(&tk->tkr_mono);
- delta = clocksource_delta(cycle_now, tk->tkr_mono.cycle_last, tk->tkr_mono.mask,
- tk->tkr_mono.clock->max_raw_delta);
tk->tkr_mono.cycle_last = cycle_now;
tk->tkr_raw.cycle_last = cycle_now;
@@ -710,6 +709,21 @@ static void timekeeping_forward_now(struct timekeeper *tk)
}
}
+/**
+ * timekeeping_forward_now - update clock to the current time
+ * @tk: Pointer to the timekeeper to update
+ *
+ * Forward the current clock to update its state since the last call to
+ * update_wall_time(). This is useful before significant clock changes,
+ * as it avoids having to deal with this time offset explicitly.
+ */
+static void timekeeping_forward_now(struct timekeeper *tk)
+{
+ u64 cycle_now = tk_clock_read(&tk->tkr_mono);
+
+ timekeeping_forward(tk, cycle_now);
+}
+
/**
* ktime_get_real_ts64 - Returns the time of day in a timespec64.
* @ts: pointer to the timespec to be set
@@ -2151,6 +2165,54 @@ static u64 logarithmic_accumulation(struct timekeeper *tk, u64 offset,
return offset;
}
+static u64 timekeeping_accumulate(struct timekeeper *tk, u64 offset,
+ enum timekeeping_adv_mode mode,
+ unsigned int *clock_set)
+{
+ int shift = 0, maxshift;
+
+ /*
+ * TK_ADV_FREQ indicates that adjtimex(2) directly set the
+ * frequency or the tick length.
+ *
+ * Accumulate the offset, so that the new multiplier starts from
+ * now. This is required as otherwise for offsets, which are
+ * smaller than tk::cycle_interval, timekeeping_adjust() could set
+ * xtime_nsec backwards, which subsequently causes time going
+ * backwards in the coarse time getters. But even for the case
+ * where offset is greater than tk::cycle_interval the periodic
+ * accumulation does not have much value.
+ *
+ * Also reset tk::ntp_error as it does not make sense to keep the
+ * old accumulated error around in this case.
+ */
+ if (mode == TK_ADV_FREQ) {
+ timekeeping_forward(tk, tk->tkr_mono.cycle_last + offset);
+ tk->ntp_error = 0;
+ return 0;
+ }
+
+ /*
+ * With NO_HZ we may have to accumulate many cycle_intervals
+ * (think "ticks") worth of time at once. To do this efficiently,
+ * we calculate the largest doubling multiple of cycle_intervals
+ * that is smaller than the offset. We then accumulate that
+ * chunk in one go, and then try to consume the next smaller
+ * doubled multiple.
+ */
+ shift = ilog2(offset) - ilog2(tk->cycle_interval);
+ shift = max(0, shift);
+ /* Bound shift to one less than what overflows tick_length */
+ maxshift = (64 - (ilog2(ntp_tick_length()) + 1)) - 1;
+ shift = min(shift, maxshift);
+ while (offset >= tk->cycle_interval) {
+ offset = logarithmic_accumulation(tk, offset, shift, clock_set);
+ if (offset < tk->cycle_interval << shift)
+ shift--;
+ }
+ return offset;
+}
+
/*
* timekeeping_advance - Updates the timekeeper to the current time and
* current NTP tick length
@@ -2160,7 +2222,6 @@ static bool timekeeping_advance(enum timekeeping_adv_mode mode)
struct timekeeper *tk = &tk_core.shadow_timekeeper;
struct timekeeper *real_tk = &tk_core.timekeeper;
unsigned int clock_set = 0;
- int shift = 0, maxshift;
u64 offset;
guard(raw_spinlock_irqsave)(&tk_core.lock);
@@ -2177,24 +2238,7 @@ static bool timekeeping_advance(enum timekeeping_adv_mode mode)
if (offset < real_tk->cycle_interval && mode == TK_ADV_TICK)
return false;
- /*
- * With NO_HZ we may have to accumulate many cycle_intervals
- * (think "ticks") worth of time at once. To do this efficiently,
- * we calculate the largest doubling multiple of cycle_intervals
- * that is smaller than the offset. We then accumulate that
- * chunk in one go, and then try to consume the next smaller
- * doubled multiple.
- */
- shift = ilog2(offset) - ilog2(tk->cycle_interval);
- shift = max(0, shift);
- /* Bound shift to one less than what overflows tick_length */
- maxshift = (64 - (ilog2(ntp_tick_length())+1)) - 1;
- shift = min(shift, maxshift);
- while (offset >= tk->cycle_interval) {
- offset = logarithmic_accumulation(tk, offset, shift, &clock_set);
- if (offset < tk->cycle_interval<<shift)
- shift--;
- }
+ offset = timekeeping_accumulate(tk, offset, mode, &clock_set);
/* Adjust the multiplier to correct NTP error */
timekeeping_adjust(tk, offset);
--
2.49.0.395.g12beb8f557-goog
As part of LKFT’s re-validation of known issues, we have observed that
the selftests: cgroup suite is consistently failing across almost all
LKFT-supported devices due to:
- Test timeouts (45 seconds limit reached)
- OOM-killer invocation
## Key Questions for Discussion:
- Would it be beneficial to increase the test timeout to ~180 seconds
to allow sufficient execution time?
- Should we enhance logging to explicitly print failure reasons when a
test fails?
- Are there any missing dependencies that could be causing these failures?
Note: The required selftests/cgroup/config options were included in
LKFT's build and test plans.
## Devices Affected:
The following DUTs consistently experience these failures:
- dragonboard-410c (arm64)
- dragonboard-845c (arm64)
- e850-96 (arm64)
- juno-r2 (arm64)
- qemu-arm64 (arm64)
- qemu-armv7
- qemu-x86_64
- rk3399-rock-pi-4b (arm64)
- x15 (arm)
- x86_64
Regression Analysis:
- New regression? No (these failures have been observed for months/years).
- Reproducibility? Yes, the failures occur consistently.
- Test suite affected? selftests: cgroup (timeouts and OOM-related failures).
Test regression: selftests cgroup fails timeout and oom-killer
Reported-by: Linux Kernel Functional Testing <lkft(a)linaro.org>
## Test log:
# selftests: cgroup: test_cpu
# ok 1 test_cpucg_subtree_control
# ok 2 test_cpucg_stats
# ok 3 test_cpucg_nice
# not ok 4 test_cpucg_weight_overprovisioned
# ok 5 test_cpucg_weight_underprovisioned
# ok 6 test_cpucg_nested_weight_overprovisioned
# ok 7 test_cpucg_nested_weight_underprovisioned
#
not ok 2 selftests: cgroup: test_cpu # TIMEOUT 45 seconds
<trim>
# selftests: cgroup: test_freezer
# ok 1 test_cgfreezer_simple
# ok 2 test_cgfreezer_tree
# ok 3 test_cgfreezer_forkbomb
# ok 4 test_cgfreezer_mkdir
# ok 5 test_cgfreezer_rmdir
# ok 6 test_cgfreezer_migrate
# Cgroup /sys/fs/cgroup/cg_test_ptrace isn't frozen
# not ok 7 test_cgfreezer_ptrace
# ok 8 test_cgfreezer_stopped
# ok 9 test_cgfreezer_ptraced
# ok 10 test_cgfreezer_vfork
not ok 4 selftests: cgroup: test_freezer # exit=1
<trim>
selftests: cgroup: test_kmem
#
not ok 7 selftests: cgroup: test_kmem # TIMEOUT 45 seconds
<trim>
# selftests: cgroup: test_memcontrol
# ok 1 test_memcg_subtree_control
# not ok 2 test_memcg_current_peak
# not ok 3 test_memcg_min
# not ok 4 test_memcg_low
# not ok 5 test_memcg_high
# ok 6 test_memcg_high_sync
[ 270.699078] test_memcontrol invoked oom-killer:
gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
[ 270.699921] CPU: 1 UID: 0 PID: 946 Comm: test_memcontrol Not
tainted 6.14.0-rc5-next-20250303 #1
[ 270.699930] Hardware name: Radxa ROCK Pi 4B (DT)
<trim>
[ 270.729527] Memory cgroup out of memory: Killed process 946
(test_memcontrol) total-vm:104840kB, anon-rss:30596kB,
file-rss:1056kB, shmem-rss:0kB, UID:0 pgtables:104kB oom_score_adj:0
# not ok 7 test_memcg_max
# not ok 8 test_memcg_reclaim
<trim>
not ok 8 selftests: cgroup: test_memcontrol # exit=1
## Source
* Kernel version: 6.14.0-rc5-next-20250303
* Git tree: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
* Git sha: cd3215bbcb9d4321def93fea6cfad4d5b42b9d1d
* Git describe: 6.14.0-rc5-next-20250303
* Project details:
https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20250303/
## Test data
* Test log: https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20250303/te…
* Test history:
https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20250303/te…
* Test details:
https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20250303/te…
* Test logs rock pi:
https://lkft.validation.linaro.org/scheduler/job/8148789#L1774
* Test logs x86: https://lkft.validation.linaro.org/scheduler/job/8148731#L1948
--
Linaro LKFT
https://lkft.linaro.org
FW_CS_DSP gets enabled if KUNIT is enabled. The test should rather
depend on if the feature is enabled. Fix this by moving FW_CS_DSP to the
depends on clause, and set CONFIG_FW_CS_DSP=y in the kunit tooling.
Fixes: dd0b6b1f29b9 ("firmware: cs_dsp: Add KUnit testing of bin file download")
Signed-off-by: Nico Pache <npache(a)redhat.com>
---
drivers/firmware/cirrus/Kconfig | 3 +--
tools/testing/kunit/configs/all_tests.config | 2 ++
2 files changed, 3 insertions(+), 2 deletions(-)
diff --git a/drivers/firmware/cirrus/Kconfig b/drivers/firmware/cirrus/Kconfig
index 0a883091259a..989568ab5712 100644
--- a/drivers/firmware/cirrus/Kconfig
+++ b/drivers/firmware/cirrus/Kconfig
@@ -11,9 +11,8 @@ config FW_CS_DSP_KUNIT_TEST_UTILS
config FW_CS_DSP_KUNIT_TEST
tristate "KUnit tests for Cirrus Logic cs_dsp" if !KUNIT_ALL_TESTS
- depends on KUNIT && REGMAP
+ depends on KUNIT && REGMAP && FW_CS_DSP
default KUNIT_ALL_TESTS
- select FW_CS_DSP
select FW_CS_DSP_KUNIT_TEST_UTILS
help
This builds KUnit tests for cs_dsp.
diff --git a/tools/testing/kunit/configs/all_tests.config b/tools/testing/kunit/configs/all_tests.config
index b0049be00c70..96c6b4aca87d 100644
--- a/tools/testing/kunit/configs/all_tests.config
+++ b/tools/testing/kunit/configs/all_tests.config
@@ -49,3 +49,5 @@ CONFIG_SOUND=y
CONFIG_SND=y
CONFIG_SND_SOC=y
CONFIG_SND_SOC_TOPOLOGY_BUILD=y
+
+CONFIG_FW_CS_DSP=y
\ No newline at end of file
--
2.48.1
Originally introduced to sysctl-test.c by commit b5ffbd139688 ("sysctl:
move the extra1/2 boundary check of u8 to sysctl_check_table_array"), it
has been shown to lead to a panic under certain conditions related to a
dangling registration.
This series moves the u8 test to lib/test_sysctl.c where the
registration calls are kept and correctly removed on module exit. An
additional 0012 test is added to selftests/sysctl/sysctl.sh in order to
visualize the registration calls done in test_sysctl.c.
Very much related to adding tests to sysctl, the last two patches of
this series reduce the places that need to be changed when tests are
added by managing the initialization and closing of sysctl tables with a
for loop.
Comments are greatly appreciated
Signed-off-by: Joel Granados <joel.granados(a)kernel.org>
---
Joel Granados (4):
sysctl: move u8 register test to lib/test_sysctl.c
sysctl: Add 0012 to test the u8 range check
sysctl: call sysctl tests with a for loop
sysctl: Close test ctl_headers with a for loop
kernel/sysctl-test.c | 49 ------------
lib/test_sysctl.c | 133 +++++++++++++++++++++----------
tools/testing/selftests/sysctl/sysctl.sh | 30 +++++++
3 files changed, 122 insertions(+), 90 deletions(-)
---
base-commit: 7eb172143d5508b4da468ed59ee857c6e5e01da6
change-id: 20250321-jag-test_extra_val-40954050a1f6
Best regards,
--
Joel Granados <joel.granados(a)kernel.org>
Series takes care of few bugs and missing features with the aim to improve
the test coverage of sockmap/sockhash.
Last patch is a create_pair() rewrite making use of
__attribute__((cleanup)) to handle socket fd lifetime.
Signed-off-by: Michal Luczaj <mhal(a)rbox.co>
---
Changes in v2:
- Rebase on bpf-next (Jakub)
- Use cleanup helpers from kernel's cleanup.h (Jakub)
- Fix subject of patch 3, rephrase patch 4, use correct prefix
- Link to v1: https://lore.kernel.org/r/20240724-sockmap-selftest-fixes-v1-0-46165d224712…
Changes in v1:
- No declarations in function body (Jakub)
- Don't touch output arguments until function succeeds (Jakub)
- Link to v0: https://lore.kernel.org/netdev/027fdb41-ee11-4be0-a493-22f28a1abd7c@rbox.co/
---
Michal Luczaj (6):
selftests/bpf: Support more socket types in create_pair()
selftests/bpf: Socket pair creation, cleanups
selftests/bpf: Simplify inet_socketpair() and vsock_socketpair_connectible()
selftests/bpf: Honour the sotype of af_unix redir tests
selftests/bpf: Exercise SOCK_STREAM unix_inet_redir_to_connected()
selftests/bpf: Introduce __attribute__((cleanup)) in create_pair()
.../selftests/bpf/prog_tests/sockmap_basic.c | 28 ++--
.../selftests/bpf/prog_tests/sockmap_helpers.h | 149 ++++++++++++++-------
.../selftests/bpf/prog_tests/sockmap_listen.c | 117 ++--------------
3 files changed, 124 insertions(+), 170 deletions(-)
---
base-commit: 92cc2456e9775dc4333fb4aa430763ae4ac2f2d9
change-id: 20240729-selftest-sockmap-fixes-bcca996e143b
Best regards,
--
Michal Luczaj <mhal(a)rbox.co>
We can reproduce the issue using the existing test program:
'./test_sockmap --ktls'
Or use the selftest I provided, which will cause a panic:
------------[ cut here ]------------
kernel BUG at lib/iov_iter.c:629!
PKRU: 55555554
Call Trace:
<TASK>
? die+0x36/0x90
? do_trap+0xdd/0x100
? iov_iter_revert+0x178/0x180
? iov_iter_revert+0x178/0x180
? do_error_trap+0x7d/0x110
? iov_iter_revert+0x178/0x180
? exc_invalid_op+0x50/0x70
? iov_iter_revert+0x178/0x180
? asm_exc_invalid_op+0x1a/0x20
? iov_iter_revert+0x178/0x180
? iov_iter_revert+0x5c/0x180
tls_sw_sendmsg_locked.isra.0+0x794/0x840
tls_sw_sendmsg+0x52/0x80
? inet_sendmsg+0x1f/0x70
__sys_sendto+0x1cd/0x200
? find_held_lock+0x2b/0x80
? syscall_trace_enter+0x140/0x270
? __lock_release.isra.0+0x5e/0x170
? find_held_lock+0x2b/0x80
? syscall_trace_enter+0x140/0x270
? lockdep_hardirqs_on_prepare+0xda/0x190
? ktime_get_coarse_real_ts64+0xc2/0xd0
__x64_sys_sendto+0x24/0x30
do_syscall_64+0x90/0x170
1. It looks like the issue started occurring after bpf being introduced to
ktls and later the addition of assertions to iov_iter has caused a panic.
If my fix tag is incorrect, please assist me in correcting the fix tag.
2. I make minimal changes for now, it's enough to make ktls work
correctly.
---
v1->v2: Added more content to the commit message
https://lore.kernel.org/all/20250123171552.57345-1-mrpre@163.com/#r
---
Jiayuan Chen (2):
bpf: fix ktls panic with sockmap
selftests/bpf: add ktls selftest
net/tls/tls_sw.c | 8 +-
.../selftests/bpf/prog_tests/sockmap_ktls.c | 174 +++++++++++++++++-
.../selftests/bpf/progs/test_sockmap_ktls.c | 26 +++
3 files changed, 205 insertions(+), 3 deletions(-)
create mode 100644 tools/testing/selftests/bpf/progs/test_sockmap_ktls.c
--
2.47.1