The vIOMMU object is designed to represent a slice of an IOMMU HW for its
virtualization features shared with or passed to user space (a VM mostly)
in a way of HW acceleration. This extended the HWPT-based design for more
advanced virtualization feature.
HW QUEUE introduced by this series as a part of the vIOMMU infrastructure
represents a HW accelerated queue/buffer for VM to use exclusively, e.g.
- NVIDIA's Virtual Command Queue
- AMD vIOMMU's Command Buffer, Event Log Buffer, and PPR Log Buffer
each of which allows its IOMMU HW to directly access a queue memory owned
by a guest VM and allows a guest OS to control the HW queue direclty, to
avoid VM Exit overheads to improve the performance.
Introduce IOMMUFD_OBJ_HW_QUEUE and its pairing IOMMUFD_CMD_HW_QUEUE_ALLOC
allowing VMM to forward the IOMMU-specific queue info, such as queue base
address, size, and etc.
Meanwhile, a guest-owned queue needs the guest kernel to control the queue
by reading/writing its consumer and producer indexes, via MMIO acceses to
the hardware MMIO registers. Introduce an mmap infrastructure for iommufd
to support passing through a piece of MMIO region from the host physical
address space to the guest physical address space. The mmap info (offset/
length) used by an mmap syscall must be pre-allocated and returned to the
user space via an output driver-data during an IOMMUFD_CMD_HW_QUEUE_ALLOC
call. Thus, it requires a driver-specific user data support in the vIOMMU
allocation flow.
As a real-world use case, this series implements a HW QUEUE support in the
tegra241-cmdqv driver for VCMDQs on NVIDIA Grace CPU. In another word, it
is also the Tegra CMDQV series Part-2 (user-space support), reworked from
Previous RFCv1:
https://lore.kernel.org/all/cover.1712978212.git.nicolinc@nvidia.com/
This enables the HW accelerated feature for NVIDIA Grace CPU. Compared to
the standard SMMUv3 operating in the nested translation mode trapping CMDQ
for TLBI and ATC_INV commands, this gives a huge performance improvement:
70% to 90% reductions of invalidation time were measured by various DMA
unmap tests running in a guest OS.
// Unmap latencies from "dma_map_benchmark -g @granule -t @threads",
// by toggling "/sys/kernel/debug/iommu/tegra241_cmdqv/bypass_vcmdq"
@granule | @threads | bypass_vcmdq=1 | bypass_vcmdq=0
4KB 1 35.7 us 5.3 us
16KB 1 41.8 us 6.8 us
64KB 1 68.9 us 9.9 us
128KB 1 109.0 us 12.6 us
256KB 1 187.1 us 18.0 us
4KB 2 96.9 us 6.8 us
16KB 2 97.8 us 7.5 us
64KB 2 151.5 us 10.7 us
128KB 2 257.8 us 12.7 us
256KB 2 443.0 us 17.9 us
This is on Github:
https://github.com/nicolinc/iommufd/commits/iommufd_hw_queue-v6
Paring QEMU branch for testing:
https://github.com/nicolinc/qemu/commits/wip/for_iommufd_hw_queue-v6
Changelog
v6
* Rebase on iommufd_hw_queue-prep-v2
* Add Reviewed-by from Kevin and Jason
* [iommufd] Update kdocs and notes
* [iommufd] Drop redundant pages[i] check
* [iommufd] Allow nesting_parent_iova to be 0
* [iommufd] Add iommufd_hw_queue_alloc_phys()
* [iommufd] Revise iommufd_viommu_alloc/destroy_mmap APIs
* [iommufd] Move destroy ops to vdevice/hw_queue structures
* [iommufd] Add union in hw_info struct to share out_data_type field
* [iommufd] Replace iopt_pin/unpin_pages() with internal access APIs
* [iommufd] Replace vdevice_alloc with vdevice_size and vdevice_init
* [iommufd] Replace hw_queue_alloc with get_hw_queue_size/hw_queue_init
* [iommufd] Replace IOMMUFD_VIOMMU_FLAG_HW_QUEUE_READS_PA with init_phys
* [smmu] Drop arm_smmu_domain_ipa_to_pa
* [smmu] Update arm_smmu_impl_ops changes for vsmmu_init
* [tegra] Add a vdev_to_vsid macro
* [tegra] Add lvcmdq_mutex to protect multi queues
* [tegra] Drop duplicated kcalloc for vintf->lvcmdqs (memory leak)
v5
https://lore.kernel.org/all/cover.1747537752.git.nicolinc@nvidia.com/
* Rebase on v6.15-rc6
* Add Reviewed-by from Jason and Kevin
* Correct typos in kdoc and update commit logs
* [iommufd] Add a cosmetic fix
* [iommufd] Drop unused num_pfns
* [iommufd] Drop unnecessary check
* [iommufd] Reorder patch sequence
* [iommufd] Use io_remap_pfn_range()
* [iommufd] Use success oriented flow
* [iommufd] Fix max_npages calculation
* [iommufd] Add more selftest coverage
* [iommufd] Drop redundant static_assert
* [iommufd] Fix mmap pfn range validation
* [iommufd] Reject unmap on pinned iovas
* [iommufd] Drop redundant vm_flags_set()
* [iommufd] Drop iommufd_struct_destroy()
* [iommufd] Drop redundant queue iova test
* [iommufd] Use "mmio_addr" and "mmio_pfn"
* [iommufd] Rename to "nesting_parent_iova"
* [iommufd] Make iopt_pin_pages call option
* [iommufd] Add ictx comparison in depend()
* [iommufd] Add iommufd_object_alloc_ucmd()
* [iommufd] Move kcalloc() after validations
* [iommufd] Replace ictx setting with WARN_ON
* [iommufd] Make hw_info's type bidirectional
* [smmu] Add supported_vsmmu_type in impl_ops
* [smmu] Drop impl report in smmu vendor struct
* [tegra] Add IOMMU_HW_INFO_TYPE_TEGRA241_CMDQV
* [tegra] Replace "number of VINTFs" with a note
* [tegra] Drop the redundant lvcmdq pointer setting
* [tegra] Flag IOMMUFD_VIOMMU_FLAG_HW_QUEUE_READS_PA
* [tegra] Use "vintf_alloc_vsid" for vdevice_alloc op
v4
https://lore.kernel.org/all/cover.1746757630.git.nicolinc@nvidia.com/
* Rebase on v6.15-rc5
* Add Reviewed-by from Vasant
* Rename "vQUEUE" to "HW QUEUE"
* Use "offset" and "length" for all mmap-related variables
* [iommufd] Use u64 for guest PA
* [iommufd] Fix typo in uAPI doc
* [iommufd] Rename immap_id to offset
* [iommufd] Drop the partial-size mmap support
* [iommufd] Do not replace WARN_ON with WARN_ON_ONCE
* [iommufd] Use "u64 base_addr" for queue base address
* [iommufd] Use u64 base_pfn/num_pfns for immap structure
* [iommufd] Correct the size passed in to mtree_alloc_range()
* [iommufd] Add IOMMUFD_VIOMMU_FLAG_HW_QUEUE_READS_PA to viommu_ops
v3
https://lore.kernel.org/all/cover.1746139811.git.nicolinc@nvidia.com/
* Add Reviewed-by from Baolu, Pranjal, and Alok
* Revise kdocs, uAPI docs, and commit logs
* Rename "vCMDQ" back to "vQUEUE" for AMD cases
* [tegra] Add tegra241_vcmdq_hw_flush_timeout()
* [tegra] Rename vsmmu_alloc to alloc_vintf_user
* [tegra] Use writel for SID replacement registers
* [tegra] Move mmap removal call to vsmmu_destroy op
* [tegra] Fix revert in tegra241_vintf_alloc_lvcmdq_user()
* [iommufd] Replace "& ~PAGE_MASK" with PAGE_ALIGNED()
* [iommufd] Add an object-type "owner" to immap structure
* [iommufd] Drop the ictx input in the new for-driver APIs
* [iommufd] Add iommufd_vma_ops to keep track of mmap lifecycle
* [iommufd] Add viommu-based iommufd_viommu_alloc/destroy_mmap helpers
* [iommufd] Rename iommufd_ctx_alloc/free_mmap to
_iommufd_alloc/destroy_mmap
v2
https://lore.kernel.org/all/cover.1745646960.git.nicolinc@nvidia.com/
* Add Reviewed-by from Jason
* [smmu] Fix vsmmu initial value
* [smmu] Support impl for hw_info
* [tegra] Rename "slot" to "vsid"
* [tegra] Update kdocs and commit logs
* [tegra] Map/unmap LVCMDQ dynamically
* [tegra] Refcount the previous LVCMDQ
* [tegra] Return -EEXIST if LVCMDQ exists
* [tegra] Simplify VINTF cleanup routine
* [tegra] Use vmid and s2_domain in vsmmu
* [tegra] Rename "mmap_pgoff" to "immap_id"
* [tegra] Add more addr and length validation
* [iommufd] Add more narrative to mmap's kdoc
* [iommufd] Add iommufd_struct_depend/undepend()
* [iommufd] Rename vcmdq_free op to vcmdq_destroy
* [iommufd] Fix bug in iommu_copy_struct_to_user()
* [iommufd] Drop is_io from iommufd_ctx_alloc_mmap()
* [iommufd] Test the queue memory for its contiguity
* [iommufd] Return -ENXIO if address or length fails
* [iommufd] Do not change @min_last in mock_viommu_alloc()
* [iommufd] Generalize TEGRA241_VCMDQ data in core structure
* [iommufd] Add selftest coverage for IOMMUFD_CMD_VCMDQ_ALLOC
* [iommufd] Add iopt_pin_pages() to prevent queue memory from unmapping
v1
https://lore.kernel.org/all/cover.1744353300.git.nicolinc@nvidia.com/
Thanks
Nicolin
Nicolin Chen (25):
iommu: Add iommu_copy_struct_to_user helper
iommu: Pass in a driver-level user data structure to viommu_init op
iommufd/viommu: Allow driver-specific user data for a vIOMMU object
iommufd/selftest: Support user_data in mock_viommu_alloc
iommufd/selftest: Add coverage for viommu data
iommufd/access: Allow access->ops to be NULL for internal use
iommufd/access: Add internal APIs for HW queue to use
iommufd/viommu: Add driver-defined vDEVICE support
iommufd/viommu: Introduce IOMMUFD_OBJ_HW_QUEUE and its related struct
iommufd/viommu: Add IOMMUFD_CMD_HW_QUEUE_ALLOC ioctl
iommufd/driver: Add iommufd_hw_queue_depend/undepend() helpers
iommufd/selftest: Add coverage for IOMMUFD_CMD_HW_QUEUE_ALLOC
iommufd: Add mmap interface
iommufd/selftest: Add coverage for the new mmap interface
Documentation: userspace-api: iommufd: Update HW QUEUE
iommu: Allow an input type in hw_info op
iommufd: Allow an input data_type via iommu_hw_info
iommufd/selftest: Update hw_info coverage for an input data_type
iommu/arm-smmu-v3-iommufd: Add vsmmu_size/type and vsmmu_init impl ops
iommu/arm-smmu-v3-iommufd: Add hw_info to impl_ops
iommu/tegra241-cmdqv: Use request_threaded_irq
iommu/tegra241-cmdqv: Simplify deinit flow in
tegra241_cmdqv_remove_vintf()
iommu/tegra241-cmdqv: Do not statically map LVCMDQs
iommu/tegra241-cmdqv: Add user-space use support
iommu/tegra241-cmdqv: Add IOMMU_VEVENTQ_TYPE_TEGRA241_CMDQV support
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 16 +-
drivers/iommu/iommufd/iommufd_private.h | 35 +-
drivers/iommu/iommufd/iommufd_test.h | 20 +
include/linux/iommu.h | 49 +-
include/linux/iommufd.h | 156 ++++++
include/uapi/linux/iommufd.h | 145 +++++-
tools/testing/selftests/iommu/iommufd_utils.h | 89 +++-
.../arm/arm-smmu-v3/arm-smmu-v3-iommufd.c | 25 +-
.../iommu/arm/arm-smmu-v3/tegra241-cmdqv.c | 481 +++++++++++++++++-
drivers/iommu/intel/iommu.c | 4 +
drivers/iommu/iommufd/device.c | 84 ++-
drivers/iommu/iommufd/driver.c | 79 +++
drivers/iommu/iommufd/io_pagetable.c | 17 +-
drivers/iommu/iommufd/main.c | 70 +++
drivers/iommu/iommufd/selftest.c | 150 +++++-
drivers/iommu/iommufd/viommu.c | 218 +++++++-
tools/testing/selftests/iommu/iommufd.c | 143 +++++-
.../selftests/iommu/iommufd_fail_nth.c | 15 +-
Documentation/userspace-api/iommufd.rst | 12 +
19 files changed, 1708 insertions(+), 100 deletions(-)
--
2.43.0
Hi,
I noticed that only the RSEQ membarrier command allows specifying a
specific cpu. I have a (extremely toy) lib I was playing around with [1] and
noticed this and that being able to specify the cpu would be useful to
me.
I'm by no means an expert in this code though - and so could be
totally missing something.
Additionally this seems really difficult to actually test. I added a
self test that just proces "some" interrupts are being sent, but
nothing more than that. I don't know what else can be done there.
Patch 1 is the main change
Patch 2 is the self-test. Which maybe you don't want at all.
[1]: https://github.com/DylanZA/rseqlock/commit/be7bc7214fd5aacec47e26126118f8bb…
Thanks,
Dylan
Dylan Yudaken (2):
membarrier: allow cpu_id to be set on more commands
membarrier: self test for cpu specific calls
kernel/sched/membarrier.c | 6 +-
tools/testing/selftests/membarrier/.gitignore | 1 +
tools/testing/selftests/membarrier/Makefile | 3 +-
.../membarrier/membarrier_test_expedited.c | 135 ++++++++++++++++++
.../membarrier/membarrier_test_impl.h | 5 +
5 files changed, 148 insertions(+), 2 deletions(-)
create mode 100644 tools/testing/selftests/membarrier/membarrier_test_expedited.c
base-commit: ee88bddf7f2f5d1f1da87dd7bedc734048b70e88
--
2.49.0
To accommodate varying hardware performance and use cases,
the default kunit test case timeout (currently 300 seconds)
is now configurable. Users can adjust the timeout by
either setting the 'timeout' module parameter or the
KUNIT_DEFAULT_TIMEOUT Kconfig option to their desired
timeout in seconds.
Signed-off-by: Marie Zhussupova <marievic(a)google.com>
---
lib/kunit/Kconfig | 13 +++++++++++++
lib/kunit/test.c | 15 ++++++++-------
2 files changed, 21 insertions(+), 7 deletions(-)
diff --git a/lib/kunit/Kconfig b/lib/kunit/Kconfig
index a97897edd964..c10ede4b1d22 100644
--- a/lib/kunit/Kconfig
+++ b/lib/kunit/Kconfig
@@ -93,4 +93,17 @@ config KUNIT_AUTORUN_ENABLED
In most cases this should be left as Y. Only if additional opt-in
behavior is needed should this be set to N.
+config KUNIT_DEFAULT_TIMEOUT
+ int "Default value of the timeout module parameter"
+ default 300
+ help
+ Sets the default timeout, in seconds, for Kunit test cases. This value
+ is further multiplied by a factor determined by the assigned speed
+ setting: 1x for `DEFAULT`, 3x for `KUNIT_SPEED_SLOW`, and 12x for
+ `KUNIT_SPEED_VERY_SLOW`. This allows slower tests on slower machines
+ sufficient time to complete.
+
+ If unsure, the default timeout of 300 seconds is suitable for most
+ cases.
+
endif # KUNIT
diff --git a/lib/kunit/test.c b/lib/kunit/test.c
index 002121675605..f3c6b11f12b8 100644
--- a/lib/kunit/test.c
+++ b/lib/kunit/test.c
@@ -69,6 +69,13 @@ static bool enable_param;
module_param_named(enable, enable_param, bool, 0);
MODULE_PARM_DESC(enable, "Enable KUnit tests");
+/*
+ * Configure the base timeout.
+ */
+static unsigned long kunit_base_timeout = CONFIG_KUNIT_DEFAULT_TIMEOUT;
+module_param_named(timeout, kunit_base_timeout, ulong, 0644);
+MODULE_PARM_DESC(timeout, "Set the base timeout for Kunit test cases");
+
/*
* KUnit statistic mode:
* 0 - disabled
@@ -393,12 +400,6 @@ static int kunit_timeout_mult(enum kunit_speed speed)
static unsigned long kunit_test_timeout(struct kunit_suite *suite, struct kunit_case *test_case)
{
int mult = 1;
- /*
- * TODO: Make the default (base) timeout configurable, so that users with
- * particularly slow or fast machines can successfully run tests, while
- * still taking advantage of the relative speed.
- */
- unsigned long default_timeout = 300;
/*
* The default test timeout is 300 seconds and will be adjusted by mult
@@ -409,7 +410,7 @@ static unsigned long kunit_test_timeout(struct kunit_suite *suite, struct kunit_
mult = kunit_timeout_mult(suite->attr.speed);
if (test_case->attr.speed != KUNIT_SPEED_UNSET)
mult = kunit_timeout_mult(test_case->attr.speed);
- return mult * default_timeout * msecs_to_jiffies(MSEC_PER_SEC);
+ return mult * kunit_base_timeout * msecs_to_jiffies(MSEC_PER_SEC);
}
--
2.50.0.rc2.761.g2dc52ea45b-goog
A few selftest harness changes being merged to v6.16, which exposed some
bugs and vulnerabilities in the iommufd selftest code. Fix them properly.
Note that the patch fixing the build warnings at mfd is not ideal, as it
has possibly hit some corner case in the gcc:
https://lore.kernel.org/all/aEi8DV+ReF3v3Rlf@nvidia.com/
This is on github:
https://github.com/nicolinc/iommufd/commits/iommufd_selftest_fixes-v6.16
Changelog:
v2
* Add "Reviewed-by" from Jason
* Only use kfree() in the teardown()
* Add an mmap_buffer_size for readability
v1
https://lore.kernel.org/all/cover.1750049883.git.nicolinc@nvidia.com/
Thanks
Nicolin
Nicolin Chen (4):
iommufd/selftest: Fix iommufd_dirty_tracking with large hugepage sizes
iommufd/selftest: Add missing close(mfd) in memfd_mmap()
iommufd/selftest: Add asserts testing global mfd
iommufd/selftest: Fix build warnings due to uninitialized mfd
tools/testing/selftests/iommu/iommufd_utils.h | 9 ++++-
tools/testing/selftests/iommu/iommufd.c | 40 ++++++++++++++-----
2 files changed, 36 insertions(+), 13 deletions(-)
--
2.43.0
This patch series fixes some of the false positives in generic
mm selftests and skips tests that cannot run correctly due to
missing features or system limitations.
Please let us know if you have any feedback.
Thanks,
Aboorva
Aboorva Devarajan (2):
selftests/mm: Fix child process exit codes in KSM tests
selftests/mm: Mark thuge-gen as skipped if shmmax is too small or no
1G pages
Donet Tom (4):
mm/selftests: Fix virtual_address_range test issues.
selftest/mm: Fix ksm_funtional_test failures
selftests/mm : fix test_prctl_fork_exec failure
mm/selftests: Fix split_huge_page_test failure on systems with 64KB
page size
.../selftests/mm/ksm_functional_tests.c | 24 +++++++++++++------
.../selftests/mm/split_huge_page_test.c | 23 ++++++++++++++----
tools/testing/selftests/mm/thuge-gen.c | 11 +++++----
.../selftests/mm/virtual_address_range.c | 14 +++--------
4 files changed, 45 insertions(+), 27 deletions(-)
--
2.43.5
Add support for SuperH/"sh" to nolibc.
Only sh4 is tested for now.
This is only tested on QEMU so far.
Additional testing would be very welcome.
Signed-off-by: Thomas Weißschuh <linux(a)weissschuh.net>
---
Thomas Weißschuh (3):
selftests/nolibc: fix EXTRACONFIG variables ordering
selftests/nolibc: use file driver for QEMU serial
tools/nolibc: add support for SuperH
tools/include/nolibc/arch-sh.h | 162 ++++++++++++++++++++++++++++
tools/include/nolibc/arch.h | 2 +
tools/testing/selftests/nolibc/Makefile | 15 ++-
tools/testing/selftests/nolibc/run-tests.sh | 3 +-
4 files changed, 177 insertions(+), 5 deletions(-)
---
base-commit: 6275a61db2f0586b8a5d651dfc7b4aacf9d0b2d6
change-id: 20250528-nolibc-sh-8b4e3bb8efcb
Best regards,
--
Thomas Weißschuh <linux(a)weissschuh.net>
Reading /proc/pid/maps requires read-locking mmap_lock which prevents any
other task from concurrently modifying the address space. This guarantees
coherent reporting of virtual address ranges, however it can block
important updates from happening. Oftentimes /proc/pid/maps readers are
low priority monitoring tasks and them blocking high priority tasks
results in priority inversion.
Locking the entire address space is required to present fully coherent
picture of the address space, however even current implementation does not
strictly guarantee that by outputting vmas in page-size chunks and
dropping mmap_lock in between each chunk. Address space modifications are
possible while mmap_lock is dropped and userspace reading the content is
expected to deal with possible concurrent address space modifications.
Considering these relaxed rules, holding mmap_lock is not strictly needed
as long as we can guarantee that a concurrently modified vma is reported
either in its original form or after it was modified.
This patchset switches from holding mmap_lock while reading /proc/pid/maps
to taking per-vma locks as we walk the vma tree. This reduces the
contention with tasks modifying the address space because they would have
to contend for the same vma as opposed to the entire address space. Same
is done for PROCMAP_QUERY ioctl which locks only the vma that fell into
the requested range instead of the entire address space. Previous version
of this patchset [1] tried to perform /proc/pid/maps reading under RCU,
however its implementation is quite complex and the results are worse than
the new version because it still relied on mmap_lock speculation which
retries if any part of the address space gets modified. New implementaion
is both simpler and results in less contention. Note that similar approach
would not work for /proc/pid/smaps reading as it also walks the page table
and that's not RCU-safe.
Paul McKenney's designed a test [2] to measure mmap/munmap latencies while
concurrently reading /proc/pid/maps. The test has a pair of processes
scanning /proc/PID/maps, and another process unmapping and remapping 4K
pages from a 128MB range of anonymous memory. At the end of each 10
second run, the latency of each mmap() or munmap() operation is measured,
and for each run the maximum and mean latency is printed. The map/unmap
process is started first, its PID is passed to the scanners, and then the
map/unmap process waits until both scanners are running before starting
its timed test. The scanners keep scanning until the specified
/proc/PID/maps file disappears. This test registered close to 10x
improvement in update latencies:
Before the change:
./run-proc-vs-map.sh --nsamples 100 --rawdata -- --busyduration 2
0.011 0.008 0.455
0.011 0.008 0.472
0.011 0.008 0.535
0.011 0.009 0.545
...
0.011 0.014 2.875
0.011 0.014 2.913
0.011 0.014 3.007
0.011 0.015 3.018
After the change:
./run-proc-vs-map.sh --nsamples 100 --rawdata -- --busyduration 2
0.006 0.005 0.036
0.006 0.005 0.039
0.006 0.005 0.039
0.006 0.005 0.039
...
0.006 0.006 0.403
0.006 0.006 0.474
0.006 0.006 0.479
0.006 0.006 0.498
The patchset also adds a number of tests to check for /proc/pid/maps data
coherency. They are designed to detect any unexpected data tearing while
performing some common address space modifications (vma split, resize and
remap). Even before these changes, reading /proc/pid/maps might have
inconsistent data because the file is read page-by-page with mmap_lock
being dropped between the pages. An example of user-visible inconsistency
can be that the same vma is printed twice: once before it was modified and
then after the modifications. For example if vma was extended, it might be
found and reported twice. What is not expected is to see a gap where there
should have been a vma both before and after modification. This patchset
increases the chances of such tearing, therefore it's even more important
now to test for unexpected inconsistencies.
In [3] Lorenzo identified the following possible vma merging/splitting
scenarios:
Merges with changes to existing vmas:
1 Merge both - mapping a vma over another one and between two vmas which
can be merged after this replacement;
2. Merge left full - mapping a vma at the end of an existing one and
completely over its right neighbor;
3. Merge left partial - mapping a vma at the end of an existing one and
partially over its right neighbor;
4. Merge right full - mapping a vma before the start of an existing one
and completely over its left neighbor;
5. Merge right partial - mapping a vma before the start of an existing one
and partially over its left neighbor;
Merges without changes to existing vmas:
6. Merge both - mapping a vma into a gap between two vmas which can be
merged after the insertion;
7. Merge left - mapping a vma at the end of an existing one;
8. Merge right - mapping a vma before the start end of an existing one;
Splits
9. Split with new vma at the lower address;
10. Split with new vma at the higher address;
If such merges or splits happen concurrently with the /proc/maps reading
we might report a vma twice, once before the modification and once after
it is modified:
Case 1 might report overwritten and previous vma along with the final
merged vma;
Case 2 might report previous and the final merged vma;
Case 3 might cause us to retry once we detect the temporary gap caused by
shrinking of the right neighbor;
Case 4 might report overritten and the final merged vma;
Case 5 might cause us to retry once we detect the temporary gap caused by
shrinking of the left neighbor;
Case 6 might report previous vma and the gap along with the final marged
vma;
Case 7 might report previous and the final merged vma;
Case 8 might report the original gap and the final merged vma covering the
gap;
Case 9 might cause us to retry once we detect the temporary gap caused by
shrinking of the original vma at the vma start;
Case 10 might cause us to retry once we detect the temporary gap caused by
shrinking of the original vma at the vma end;
In all these cases the retry mechanism prevents us from reporting possible
temporary gaps.
Changes from v4 [4]:
- refactored trylock_vma() and other locking parts into mmap_lock.c, per
Lorenzo
- renamed {lock|unlock}_content() into {lock|unlock}_vma_range(), per
Lorenzo
- added clarifying comments for sentinels, per Lorenzo
- introduced is_sentinel_pos() helper function
- fixed position reset logic when last_addr is a sentinel, per Lorenzo
- added Acked-by to the last patch, per Andrii Nakryiko
[1] https://lore.kernel.org/all/20250418174959.1431962-1-surenb@google.com/
[2] https://github.com/paulmckrcu/proc-mmap_sem-test
[3] https://lore.kernel.org/all/e1863f40-39ab-4e5b-984a-c48765ffde1c@lucifer.lo…
[4] https://lore.kernel.org/all/20250604231151.799834-1-surenb@google.com/
Suren Baghdasaryan (7):
selftests/proc: add /proc/pid/maps tearing from vma split test
selftests/proc: extend /proc/pid/maps tearing test to include vma
resizing
selftests/proc: extend /proc/pid/maps tearing test to include vma
remapping
selftests/proc: test PROCMAP_QUERY ioctl while vma is concurrently
modified
selftests/proc: add verbose more for tests to facilitate debugging
mm/maps: read proc/pid/maps under per-vma lock
mm/maps: execute PROCMAP_QUERY ioctl under per-vma locks
fs/proc/internal.h | 5 +
fs/proc/task_mmu.c | 179 ++++-
include/linux/mmap_lock.h | 11 +
mm/mmap_lock.c | 88 +++
tools/testing/selftests/proc/proc-pid-vm.c | 793 ++++++++++++++++++++-
5 files changed, 1053 insertions(+), 23 deletions(-)
base-commit: 0b2a863368fb0cf674b40925c55dc8898c5a33af
--
2.50.0.714.g196bf9f422-goog