FEAT_LSFE is optional from v9.5, it adds new instructions for atomic
memory operations with floating point values. We have no immediate use
for it in kernel, provide a hwcap so userspace can discover it and allow
the ID register field to be exposed to KVM guests.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Changes in v4:
- Rebase onto arm64/for-next/cpufeature, note that both patches have
build dependencies on this.
- Drop unneeded cc clobber in hwcap.
- Use STRFADD as the instruction probed in hwcap.
- Link to v3: https://lore.kernel.org/r/20250818-arm64-lsfe-v3-0-af6f4d66eb39@kernel.org
Changes in v3:
- Rebase onto v6.17-rc1.
- Link to v2: https://lore.kernel.org/r/20250703-arm64-lsfe-v2-0-eced80999cb4@kernel.org
Changes in v2:
- Fix result of vi dropping in hwcap test.
- Link to v1: https://lore.kernel.org/r/20250627-arm64-lsfe-v1-0-68351c4bf741@kernel.org
---
Mark Brown (2):
KVM: arm64: Expose FEAT_LSFE to guests
kselftest/arm64: Add lsfe to the hwcaps test
arch/arm64/kvm/sys_regs.c | 4 +++-
tools/testing/selftests/arm64/abi/hwcap.c | 21 +++++++++++++++++++++
2 files changed, 24 insertions(+), 1 deletion(-)
---
base-commit: 220928e52cb03d223b3acad3888baf0687486d21
change-id: 20250625-arm64-lsfe-0810cf98adc2
Best regards,
--
Mark Brown <broonie(a)kernel.org>
[Lots of changes in comments thanks to Randy]
Currently each of the iommu page table formats duplicates all of the logic
to maintain the page table and perform map/unmap/etc operations. There are
several different versions of the algorithms between all the different
formats. The io-pgtable system provides an interface to help isolate the
page table code from the iommu driver, but doesn't provide tools to
implement the common algorithms.
This makes it very hard to improve the state of the pagetable code under
the iommu domains as any proposed improvement needs to alter a large
number of different driver code paths. Combined with a lack of software
based testing this makes improvement in this area very hard.
iommufd wants several new page table operations:
- More efficient map/unmap operations, using iommufd's batching logic
- unmap that returns the physical addresses into a batch as it progresses
- cut that allows splitting areas so large pages can have holes
poked in them dynamically (ie guestmemfd hitless shared/private
transitions)
- More agressive freeing of table memory to avoid waste
- Fragmenting large pages so that dirty tracking can be more granular
- Reassembling large pages so that VMs can run at full IO performance
in migration/dirty tracking error flows
- KHO integration for kernel live upgrade
Together these are algorithmically complex enough to be a very significant
task to go and implement in all the page table formats we support. Just
the "server" focused drivers use almost all the formats (ARMv8 S1&S2 / x86
PAE / AMDv1 / VT-D SS / RISCV)
Instead of doing the duplicated work, this series takes the first step to
consolidate the algorithms into one places. In spirit it is similar to the
work Christoph did a few years back to pull the redundant get_user_pages()
implementations out of the arch code into core MM. This unlocked a great
deal of improvement in that space in the following years. I would like to
see the same benefit in iommu as well.
My first RFC showed a bigger picture with all most all formats and more
algorithms. This series reorganizes that to be narrowly focused on just
enough to convert the AMD driver to use the new mechanism.
kunit tests are provided that allow good testing of the algorithms and all
formats on x86, nothing is arch specific.
AMD is one of the simpler options as the HW is quite uniform with few
different options/bugs while still requiring the complicated contiguous
pages support. The HW also has a very simple range based invalidation
approach that is easy to implement.
The AMD v1 and AMD v2 page table formats are implemented bit for bit
identical to the current code, tested using a compare kunit test that
checks against the io-pgtable version (on github, see below).
Updating the AMD driver to replace the io-pgtable layer with the new stuff
is fairly straightforward now. The layering is fixed up in the new version
so that all the invalidation goes through function pointers.
Several small fixing patches have come out of this as I've been fixing the
problems that the test suite uncovers in the current code, and
implementing the fixed version in iommupt.
On performance, there is a quite wide variety of implementation designs
across all the drivers. Looking at some key performance across
the main formats:
iommu_map():
pgsz ,avg new,old ns, min new,old ns , min % (+ve is better)
2^12, 53,66 , 51,63 , 19.19 (AMDV1)
256*2^12, 386,1909 , 367,1795 , 79.79
256*2^21, 362,1633 , 355,1556 , 77.77
2^12, 56,62 , 52,59 , 11.11 (AMDv2)
256*2^12, 405,1355 , 357,1292 , 72.72
256*2^21, 393,1160 , 358,1114 , 67.67
2^12, 55,65 , 53,62 , 14.14 (VTD second stage)
256*2^12, 391,518 , 332,512 , 35.35
256*2^21, 383,635 , 336,624 , 46.46
2^12, 57,65 , 55,63 , 12.12 (ARM 64 bit)
256*2^12, 380,389 , 361,369 , 2.02
256*2^21, 358,419 , 345,400 , 13.13
iommu_unmap():
pgsz ,avg new,old ns, min new,old ns , min % (+ve is better)
2^12, 69,88 , 65,85 , 23.23 (AMDv1)
256*2^12, 353,6498 , 331,6029 , 94.94
256*2^21, 373,6014 , 360,5706 , 93.93
2^12, 71,72 , 66,69 , 4.04 (AMDv2)
256*2^12, 228,891 , 206,871 , 76.76
256*2^21, 254,721 , 245,711 , 65.65
2^12, 69,87 , 65,82 , 20.20 (VTD second stage)
256*2^12, 210,321 , 200,315 , 36.36
256*2^21, 255,349 , 238,342 , 30.30
2^12, 72,77 , 68,74 , 8.08 (ARM 64 bit)
256*2^12, 521,357 , 447,346 , -29.29
256*2^21, 489,358 , 433,345 , -25.25
* Above numbers include additional patches to remove the iommu_pgsize()
overheads. gcc 13.3.0, i7-12700
This version provides fairly consistent performance across formats. ARM
unmap performance is quite different because this version supports
contiguous pages and uses a very different algorithm for unmapping. Though
why it is so worse compared to AMDv1 I haven't figured out yet.
The per-format commits include a more detailed chart.
There is a second branch:
https://github.com/jgunthorpe/linux/commits/iommu_pt_all
Containing supporting work and future steps:
- ARM short descriptor (32 bit), ARM long descriptor (64 bit) formats
- RISCV format and RISCV conversion
https://github.com/jgunthorpe/linux/commits/iommu_pt_riscv
- Support for a DMA incoherent HW page table walker
- VT-D second stage format and VT-D conversion
https://github.com/jgunthorpe/linux/commits/iommu_pt_vtd
- DART v1 & v2 format
- Draft of a iommufd 'cut' operation to break down huge pages
- A compare test that checks the iommupt formats against the iopgtable
interface, including updating AMD to have a working iopgtable and patches
to make VT-D have an iopgtable for testing.
- A performance test to micro-benchmark map and unmap against iogptable
My strategy is to go one by one for the drivers:
- AMD driver conversion
- RISCV page table and driver
- Intel VT-D driver and VTDSS page table
- Flushing improvements for RISCV
- ARM SMMUv3
And concurrently work on the algorithm side:
- debugfs content dump, like VT-D has
- Cut support
- Increase/Decrease page size support
- map/unmap batching
- KHO
As we make more algorithm improvements the value to convert the drivers
increases.
This is on github: https://github.com/jgunthorpe/linux/commits/iommu_pt
v4:
- Text grammar updates and kdoc fixes
v3: https://patch.msgid.link/r/0-v4-0d6a6726a372+18959-iommu_pt_jgg@nvidia.com
- Rebase on v6.16-rc3
- Integrate the HATS/HATDis changes
- Remove 'default n' from kconfig
- Remove unused 'PT_FIXED_TOP_LEVEL'
- Improve comments and coumentation
- Fix some compile warnings from kbuild robots
v2: https://patch.msgid.link/r/0-v3-a93aab628dbc+521-iommu_pt_jgg@nvidia.com
- Rebase on v6.16-rc2
- s/PT_ENTRY_WORD_SIZE/PT_ITEM_WORD_SIZE/s to follow the language better
- Comment and documentation updates
- Add PT_TOP_PHYS_MASK to help manage alignment restrictions on the top
pointer
- Add missed force_aperture = true
- Make pt_iommu_deinit() take care of the not-yet-inited error case
internally as AMD/RISCV/VTD all shared this logic
- Change gather_range() into gather_range_pages() so it also deals with
the page list. This makes the following cache flushing series simpler
- Fix missed update of unmap->unmapped in some error cases
- Change clear_contig() to order the gather more logically
- Remove goto from the error handling in __map_range_leaf()
- s/log2_/oalog2_/ in places where the argument is an oaddr_t
- Pass the pts to pt_table_install64/32()
- Do not use SIGN_EXTEND for the AMDv2 page table because of Vasant's
information on how PASID 0 works.
v1: https://patch.msgid.link/r/0-v2-5c26bde5c22d+58b-iommu_pt_jgg@nvidia.com
- AMD driver only, many code changes
RFC: https://lore.kernel.org/all/0-v1-01fa10580981+1d-iommu_pt_jgg@nvidia.com/
Alejandro Jimenez (1):
iommu/amd: Use the generic iommu page table
Jason Gunthorpe (14):
genpt: Generic Page Table base API
genpt: Add Documentation/ files
iommupt: Add the basic structure of the iommu implementation
iommupt: Add the AMD IOMMU v1 page table format
iommupt: Add iova_to_phys op
iommupt: Add unmap_pages op
iommupt: Add map_pages op
iommupt: Add read_and_clear_dirty op
iommupt: Add a kunit test for Generic Page Table
iommupt: Add a mock pagetable format for iommufd selftest to use
iommufd: Change the selftest to use iommupt instead of xarray
iommupt: Add the x86 64 bit page table format
iommu/amd: Remove AMD io_pgtable support
iommupt: Add a kunit test for the IOMMU implementation
.clang-format | 1 +
Documentation/driver-api/generic_pt.rst | 140 ++
Documentation/driver-api/index.rst | 1 +
drivers/iommu/Kconfig | 2 +
drivers/iommu/Makefile | 1 +
drivers/iommu/amd/Kconfig | 5 +-
drivers/iommu/amd/Makefile | 2 +-
drivers/iommu/amd/amd_iommu.h | 1 -
drivers/iommu/amd/amd_iommu_types.h | 109 +-
drivers/iommu/amd/io_pgtable.c | 560 --------
drivers/iommu/amd/io_pgtable_v2.c | 370 ------
drivers/iommu/amd/iommu.c | 538 ++++----
drivers/iommu/generic_pt/.kunitconfig | 13 +
drivers/iommu/generic_pt/Kconfig | 67 +
drivers/iommu/generic_pt/fmt/Makefile | 26 +
drivers/iommu/generic_pt/fmt/amdv1.h | 409 ++++++
drivers/iommu/generic_pt/fmt/defs_amdv1.h | 21 +
drivers/iommu/generic_pt/fmt/defs_x86_64.h | 21 +
drivers/iommu/generic_pt/fmt/iommu_amdv1.c | 15 +
drivers/iommu/generic_pt/fmt/iommu_mock.c | 10 +
drivers/iommu/generic_pt/fmt/iommu_template.h | 48 +
drivers/iommu/generic_pt/fmt/iommu_x86_64.c | 11 +
drivers/iommu/generic_pt/fmt/x86_64.h | 248 ++++
drivers/iommu/generic_pt/iommu_pt.h | 1149 +++++++++++++++++
drivers/iommu/generic_pt/kunit_generic_pt.h | 717 ++++++++++
drivers/iommu/generic_pt/kunit_iommu.h | 183 +++
drivers/iommu/generic_pt/kunit_iommu_pt.h | 451 +++++++
drivers/iommu/generic_pt/pt_common.h | 355 +++++
drivers/iommu/generic_pt/pt_defs.h | 323 +++++
drivers/iommu/generic_pt/pt_fmt_defaults.h | 193 +++
drivers/iommu/generic_pt/pt_iter.h | 636 +++++++++
drivers/iommu/generic_pt/pt_log2.h | 130 ++
drivers/iommu/io-pgtable.c | 4 -
drivers/iommu/iommufd/Kconfig | 1 +
drivers/iommu/iommufd/iommufd_test.h | 11 +-
drivers/iommu/iommufd/selftest.c | 438 +++----
include/linux/generic_pt/common.h | 166 +++
include/linux/generic_pt/iommu.h | 270 ++++
include/linux/io-pgtable.h | 2 -
tools/testing/selftests/iommu/iommufd.c | 60 +-
tools/testing/selftests/iommu/iommufd_utils.h | 12 +
41 files changed, 6128 insertions(+), 1592 deletions(-)
create mode 100644 Documentation/driver-api/generic_pt.rst
delete mode 100644 drivers/iommu/amd/io_pgtable.c
delete mode 100644 drivers/iommu/amd/io_pgtable_v2.c
create mode 100644 drivers/iommu/generic_pt/.kunitconfig
create mode 100644 drivers/iommu/generic_pt/Kconfig
create mode 100644 drivers/iommu/generic_pt/fmt/Makefile
create mode 100644 drivers/iommu/generic_pt/fmt/amdv1.h
create mode 100644 drivers/iommu/generic_pt/fmt/defs_amdv1.h
create mode 100644 drivers/iommu/generic_pt/fmt/defs_x86_64.h
create mode 100644 drivers/iommu/generic_pt/fmt/iommu_amdv1.c
create mode 100644 drivers/iommu/generic_pt/fmt/iommu_mock.c
create mode 100644 drivers/iommu/generic_pt/fmt/iommu_template.h
create mode 100644 drivers/iommu/generic_pt/fmt/iommu_x86_64.c
create mode 100644 drivers/iommu/generic_pt/fmt/x86_64.h
create mode 100644 drivers/iommu/generic_pt/iommu_pt.h
create mode 100644 drivers/iommu/generic_pt/kunit_generic_pt.h
create mode 100644 drivers/iommu/generic_pt/kunit_iommu.h
create mode 100644 drivers/iommu/generic_pt/kunit_iommu_pt.h
create mode 100644 drivers/iommu/generic_pt/pt_common.h
create mode 100644 drivers/iommu/generic_pt/pt_defs.h
create mode 100644 drivers/iommu/generic_pt/pt_fmt_defaults.h
create mode 100644 drivers/iommu/generic_pt/pt_iter.h
create mode 100644 drivers/iommu/generic_pt/pt_log2.h
create mode 100644 include/linux/generic_pt/common.h
create mode 100644 include/linux/generic_pt/iommu.h
base-commit: 8da0d63bd5726ff656bfa1eacb45d6f5cce65616
--
2.43.0
This patch simplifies kublk's implementation of the feature list
command, fixes a bug where a feature was missing, and adds a test to
ensure that similar bugs do not happen in the future.
Signed-off-by: Uday Shankar <ushankar(a)purestorage.com>
---
Changes in v2:
- Add log lines to new test in failure case, to tell the user how to fix
the test, and to indicate that the failure is expected when running
an old test suite against a new kernel (Ming Lei)
- Link to v1: https://lore.kernel.org/r/20250916-ublk_features-v1-0-52014be9cde5@purestor…
---
Uday Shankar (3):
selftests: ublk: kublk: simplify feat_map definition
selftests: ublk: kublk: add UBLK_F_BUF_REG_OFF_DAEMON to feat_map
selftests: ublk: add test to verify that feat_map is complete
tools/testing/selftests/ublk/Makefile | 1 +
tools/testing/selftests/ublk/kublk.c | 32 +++++++++++++------------
tools/testing/selftests/ublk/test_generic_13.sh | 20 ++++++++++++++++
3 files changed, 38 insertions(+), 15 deletions(-)
---
base-commit: da7b97ba0d219a14a83e9cc93f98b53939f12944
change-id: 20250916-ublk_features-07af4e321e5a
Best regards,
--
Uday Shankar <ushankar(a)purestorage.com>
From: Dylan Yudaken <dyudaken(a)gmail.com>
Add a .gitignore for the test case build object.
Signed-off-by: Dylan Yudaken <dyudaken(a)gmail.com>
Signed-off-by: Sohil Mehta <sohil.mehta(a)intel.com>
Reviewed-by: Simon Horman <horms(a)kernel.org>
---
The binary creates some noise. The patch to fix that seems to have
fallen through the cracks. Sending another revision with an expanded Cc
list.
v2:
- Pick up the review tag
v1: https://lore.kernel.org/all/20250623232549.3263273-1-dyudaken@gmail.com/
---
tools/testing/selftests/kexec/.gitignore | 2 ++
1 file changed, 2 insertions(+)
create mode 100644 tools/testing/selftests/kexec/.gitignore
diff --git a/tools/testing/selftests/kexec/.gitignore b/tools/testing/selftests/kexec/.gitignore
new file mode 100644
index 000000000000..5f3d9e089ae8
--- /dev/null
+++ b/tools/testing/selftests/kexec/.gitignore
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0-only
+test_kexec_jump
--
2.43.0
There is a spelling mistake in a test message. Fix it.
Signed-off-by: Colin Ian King <colin.i.king(a)gmail.com>
---
tools/testing/selftests/futex/functional/futex_numa_mpol.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/futex/functional/futex_numa_mpol.c b/tools/testing/selftests/futex/functional/futex_numa_mpol.c
index 722427fe90bf..3a71ab93db72 100644
--- a/tools/testing/selftests/futex/functional/futex_numa_mpol.c
+++ b/tools/testing/selftests/futex/functional/futex_numa_mpol.c
@@ -206,7 +206,7 @@ int main(int argc, char *argv[])
ksft_print_msg("Memory back to RW\n");
test_futex(futex_ptr, 0);
- ksft_test_result_pass("futex2 memory boundarie tests passed\n");
+ ksft_test_result_pass("futex2 memory boundary tests passed\n");
/* MPOL test. Does not work as expected */
#ifdef LIBNUMA_VER_SUFFICIENT
--
2.51.0
I've removed the RFC tag from this version of the series, but the items
that I'm looking for feedback on remains the same:
- The userspace ABI, in particular:
- The vector length used for the SVE registers, access to the SVE
registers and access to ZA and (if available) ZT0 depending on
the current state of PSTATE.{SM,ZA}.
- The use of a single finalisation for both SVE and SME.
- The addition of control for enabling fine grained traps in a similar
manner to FGU but without the UNDEF, I'm not clear if this is desired
at all and at present this requires symmetric read and write traps like
FGU. That seemed like it might be desired from an implementation
point of view but we already have one case where we enable an
asymmetric trap (for ARM64_WORKAROUND_AMPERE_AC03_CPU_38) and it
seems generally useful to enable asymmetrically.
This series implements support for SME use in non-protected KVM guests.
Much of this is very similar to SVE, the main additional challenge that
SME presents is that it introduces a new vector length similar to the
SVE vector length and two new controls which change the registers seen
by guests:
- PSTATE.ZA enables the ZA matrix register and, if SME2 is supported,
the ZT0 LUT register.
- PSTATE.SM enables streaming mode, a new floating point mode which
uses the SVE register set with the separately configured SME vector
length. In streaming mode implementation of the FFR register is
optional.
It is also permitted to build systems which support SME without SVE, in
this case when not in streaming mode no SVE registers or instructions
are available. Further, there is no requirement that there be any
overlap in the set of vector lengths supported by SVE and SME in a
system, this is expected to be a common situation in practical systems.
Since there is a new vector length to configure we introduce a new
feature parallel to the existing SVE one with a new pseudo register for
the streaming mode vector length. Due to the overlap with SVE caused by
streaming mode rather than finalising SME as a separate feature we use
the existing SVE finalisation to also finalise SME, a new define
KVM_ARM_VCPU_VEC is provided to help make user code clearer. Finalising
SVE and SME separately would introduce complication with register access
since finalising SVE makes the SVE registers writeable by userspace and
doing multiple finalisations results in an error being reported.
Dealing with a state where the SVE registers are writeable due to one of
SVE or SME being finalised but may have their VL changed by the other
being finalised seems like needless complexity with minimal practical
utility, it seems clearer to just express directly that only one
finalisation can be done in the ABI.
Access to the floating point registers follows the architecture:
- When both SVE and SME are present:
- If PSTATE.SM == 0 the vector length used for the Z and P registers
is the SVE vector length.
- If PSTATE.SM == 1 the vector length used for the Z and P registers
is the SME vector length.
- If only SME is present:
- If PSTATE.SM == 0 the Z and P registers are inaccessible and the
floating point state accessed via the encodings for the V registers.
- If PSTATE.SM == 1 the vector length used for the Z and P registers
- The SME specific ZA and ZT0 registers are only accessible if SVCR.ZA is 1.
The VMM must understand this, in particular when loading state SVCR
should be configured before other state. It should be noted that while
the architecture refers to PSTATE.SM and PSTATE.ZA these PSTATE bits are
not preserved in SPSR_ELx, they are only accessible via SVCR.
There are a large number of subfeatures for SME, most of which only
offer additional instructions but some of which (SME2 and FA64) add
architectural state. These are configured via the ID registers as per
usual.
Protected KVM supported, with the implementation maintaining the
existing restriction that the hypervisor will refuse to run if streaming
mode or ZA is enabled. This both simplfies the code and avoids the need
to allocate storage for host ZA and ZT0 state, there seems to be little
practical use case for supporting this and the memory usage would be
non-trivial.
The new KVM_ARM_VCPU_VEC feature and ZA and ZT0 registers have not been
added to the get-reg-list selftest, the idea of supporting additional
features there without restructuring the program to generate all
possible feature combinations has been rejected. I will post a separate
series which does that restructuring.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Changes in v8:
- Small fixes in ABI documentation.
- Link to v7: https://lore.kernel.org/r/20250822-kvm-arm64-sme-v7-0-7a65d82b8b10@kernel.o…
Changes in v7:
- Rebase onto v6.17-rc1.
- Handle SMIDR_EL1 as a VM wide ID register and use this in feat_sme_smps().
- Expose affinity fields in SMIDR_EL1.
- Remove SMPRI_EL1 from vcpu_sysreg, the value is always 0 currently.
- Prevent userspace writes to SMPRIMAP_EL2.
- Link to v6: https://lore.kernel.org/r/20250625-kvm-arm64-sme-v6-0-114cff4ffe04@kernel.o…
Changes in v6:
- Rebase onto v6.16-rc3.
- Link to v5: https://lore.kernel.org/r/20250417-kvm-arm64-sme-v5-0-f469a2d5f574@kernel.o…
Changes in v5:
- Rebase onto v6.15-rc2.
- Add pKVM guest support.
- Always restore SVCR.
- Link to v4: https://lore.kernel.org/r/20250214-kvm-arm64-sme-v4-0-d64a681adcc2@kernel.o…
Changes in v4:
- Rebase onto v6.14-rc2 and Mark Rutland's fixes.
- Expose SME to nested guests.
- Additional cleanups and test fixes following on from the rebase.
- Flush register state on VMM PSTATE.{SM,ZA}.
- Link to v3: https://lore.kernel.org/r/20241220-kvm-arm64-sme-v3-0-05b018c1ffeb@kernel.o…
Changes in v3:
- Rebase onto v6.12-rc2.
- Link to v2: https://lore.kernel.org/r/20231222-kvm-arm64-sme-v2-0-da226cb180bb@kernel.o…
Changes in v2:
- Rebase onto v6.7-rc3.
- Configure subfeatures based on host system only.
- Complete nVHE support.
- There was some snafu with sending v1 out, it didn't make it to the
lists but in case it hit people's inboxes I'm sending as v2.
---
Mark Brown (29):
arm64/sysreg: Update SMIDR_EL1 to DDI0601 2025-06
arm64/fpsimd: Update FA64 and ZT0 enables when loading SME state
arm64/fpsimd: Decide to save ZT0 and streaming mode FFR at bind time
arm64/fpsimd: Check enable bit for FA64 when saving EFI state
arm64/fpsimd: Determine maximum virtualisable SME vector length
KVM: arm64: Introduce non-UNDEF FGT control
KVM: arm64: Pay attention to FFR parameter in SVE save and load
KVM: arm64: Pull ctxt_has_ helpers to start of sysreg-sr.h
KVM: arm64: Move SVE state access macros after feature test macros
KVM: arm64: Rename SVE finalization constants to be more general
KVM: arm64: Document the KVM ABI for SME
KVM: arm64: Define internal features for SME
KVM: arm64: Rename sve_state_reg_region
KVM: arm64: Store vector lengths in an array
KVM: arm64: Implement SME vector length configuration
KVM: arm64: Support SME control registers
KVM: arm64: Support TPIDR2_EL0
KVM: arm64: Support SME identification registers for guests
KVM: arm64: Support SME priority registers
KVM: arm64: Provide assembly for SME register access
KVM: arm64: Support userspace access to streaming mode Z and P registers
KVM: arm64: Flush register state on writes to SVCR.SM and SVCR.ZA
KVM: arm64: Expose SME specific state to userspace
KVM: arm64: Context switch SME state for guests
KVM: arm64: Handle SME exceptions
KVM: arm64: Expose SME to nested guests
KVM: arm64: Provide interface for configuring and enabling SME for guests
KVM: arm64: selftests: Add SME system registers to get-reg-list
KVM: arm64: selftests: Add SME to set_id_regs test
Documentation/virt/kvm/api.rst | 115 ++++++++---
arch/arm64/include/asm/fpsimd.h | 26 +++
arch/arm64/include/asm/kvm_emulate.h | 6 +
arch/arm64/include/asm/kvm_host.h | 169 ++++++++++++---
arch/arm64/include/asm/kvm_hyp.h | 5 +-
arch/arm64/include/asm/kvm_pkvm.h | 2 +-
arch/arm64/include/asm/vncr_mapping.h | 2 +
arch/arm64/include/uapi/asm/kvm.h | 33 +++
arch/arm64/kernel/cpufeature.c | 2 -
arch/arm64/kernel/fpsimd.c | 89 ++++----
arch/arm64/kvm/arm.c | 10 +
arch/arm64/kvm/config.c | 8 +-
arch/arm64/kvm/fpsimd.c | 28 ++-
arch/arm64/kvm/guest.c | 252 ++++++++++++++++++++---
arch/arm64/kvm/handle_exit.c | 14 ++
arch/arm64/kvm/hyp/fpsimd.S | 28 ++-
arch/arm64/kvm/hyp/include/hyp/switch.h | 175 ++++++++++++++--
arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h | 110 ++++++----
arch/arm64/kvm/hyp/nvhe/hyp-main.c | 86 ++++++--
arch/arm64/kvm/hyp/nvhe/pkvm.c | 85 ++++++--
arch/arm64/kvm/hyp/nvhe/switch.c | 4 +-
arch/arm64/kvm/hyp/nvhe/sys_regs.c | 6 +
arch/arm64/kvm/hyp/vhe/switch.c | 17 +-
arch/arm64/kvm/hyp/vhe/sysreg-sr.c | 7 +
arch/arm64/kvm/nested.c | 3 +-
arch/arm64/kvm/reset.c | 156 ++++++++++----
arch/arm64/kvm/sys_regs.c | 141 ++++++++++++-
arch/arm64/tools/sysreg | 8 +-
include/uapi/linux/kvm.h | 1 +
tools/testing/selftests/kvm/arm64/get-reg-list.c | 15 +-
tools/testing/selftests/kvm/arm64/set_id_regs.c | 27 ++-
31 files changed, 1327 insertions(+), 303 deletions(-)
---
base-commit: 062b3e4a1f880f104a8d4b90b767788786aa7b78
change-id: 20230301-kvm-arm64-sme-06a1246d3636
Best regards,
--
Mark Brown <broonie(a)kernel.org>
Syzkaller found this, fput runs the release from a work queue so the
refcount remains elevated during abort. This is tricky so move more
handling of files into the core code.
Add a WARN_ON to catch things like this more reliably without relying on
kasn.
Update the fail_nth test to succeed on 6.17 kernels.
Jason Gunthorpe (3):
iommufd: Fix race during abort for file descriptors
iommufd: WARN if an object is aborted with an elevated refcount
iommufd/selftest: Update the fail_nth limit
drivers/iommu/iommufd/device.c | 3 +-
drivers/iommu/iommufd/eventq.c | 9 +----
drivers/iommu/iommufd/iommufd_private.h | 3 +-
drivers/iommu/iommufd/main.c | 39 +++++++++++++++++--
.../selftests/iommu/iommufd_fail_nth.c | 2 +-
5 files changed, 42 insertions(+), 14 deletions(-)
base-commit: 1046d40b0e78d2cd63f6183629699b629b21f877
--
2.43.0
Mshare is a developing feature proposed by Anthony Yznaga and Khalid Aziz
that enables sharing of PTEs across processes. The V3 patch set has been
posted for review:
https://lore.kernel.org/linux-mm/20250820010415.699353-1-anthony.yznaga@ora…
This patch set adds selftests to exercise and demonstrate basic
functionality of mshare.
The initial tests use open, ioctl, and mmap syscalls to establish a shared
memory mapping between two processes and verify the expected behavior.
Additional tests are included to check interoperability with swap and
Transparent Huge Pages.
Future work will extend coverage to other use cases such as integration
with KVM and more advanced scenarios.
This series is intended to be applied on top of mshare V3, which is
based on mm-new (2025-08-15).
-----------------
V1->V2:
- Based on mshare V3, which based on mm-new as of 2025-08-15
- (Fix) For test cases in basic.c, Change to use a small chunk of
memory(4k/8K for normal pages, 2M/4M for hugetlb pages), as to
ensure these tests can run on any server or device.
- (Fix) For test cases of hugetlb, swap and THP, add a tips to
configure corresponding settings.
- (Fix) Add memory to .gitignore file once it exists
- (fix) Correct the Changelog of THP test case that mshare support
THP only when user configure shmem_enabled as always
V1:
https://lore.kernel.org/all/20250825145719.29455-1-linyongting@bytedance.co…
Yongting Lin (8):
mshare: Add selftests
mshare: selftests: Adding config fragments
mshare: selftests: Add some helper functions for mshare filesystem
mshare: selftests: Add test case shared memory
mshare: selftests: Add test case ioctl unmap
mshare: selftests: Add some helper functions for configuring and
retrieving cgroup
mshare: selftests: Add test case to demostrate the swapping of mshare
memory
mshare: selftests: Add test case to demostrate that mshare partly
supports THP
tools/testing/selftests/mshare/.gitignore | 4 +
tools/testing/selftests/mshare/Makefile | 7 +
tools/testing/selftests/mshare/basic.c | 109 ++++++++++
tools/testing/selftests/mshare/config | 1 +
tools/testing/selftests/mshare/memory.c | 89 ++++++++
tools/testing/selftests/mshare/util.c | 254 ++++++++++++++++++++++
6 files changed, 464 insertions(+)
create mode 100644 tools/testing/selftests/mshare/.gitignore
create mode 100644 tools/testing/selftests/mshare/Makefile
create mode 100644 tools/testing/selftests/mshare/basic.c
create mode 100644 tools/testing/selftests/mshare/config
create mode 100644 tools/testing/selftests/mshare/memory.c
create mode 100644 tools/testing/selftests/mshare/util.c
--
2.20.1
Hi everyone,
This patchset introduces a new BPF program type that allows overriding
a tracepoint probe function registered via register_trace_*.
Motivation
----------
Tracepoint probe functions registered via register_trace_* in the kernel
cannot be dynamically modified, changing a probe function requires recompiling
the kernel and rebooting. Nor can BPF programs change an existing
probe function.
Overiding tracepoint supports a way to apply patches into kernel quickly
(such as applying security ones), through predefined static tracepoints,
without waiting for upstream integration.
This patchset demonstrates the way to override probe functions by BPF program.
Overview
--------
This patchset adds BPF_PROG_TYPE_RAW_TRACEPOINT_OVERRIDE program type.
When this type of BPF program attaches, it overrides the target tracepoint
probe function.
And it also extends a new struct type "tracepoint_func_snapshot", which extends
the tracepoint structure. It is used to record the original probe function
registered by kernel after BPF program being attached and restore from it
after detachment.
Critical steps
--------------
1. Attach: Attach programs via the raw_tracepoint_open syscall.
2. Override:
(a) Locate the target probe by `probe_name`.
(b) Override target probe with the BPF program.
(c) Save the BPF program and target probe function into "tracepoint_func_snapshot".
3. Restore: When the BPF program is detached, automatically restore
the original probe function from earlier saved snapshot.
Future work
-----------
This patchset is intended as a first step toward supporting BPF programs
that can override tracepoint probes. The current implementation may not yet
cover all use cases or handle every corner case.
I welcome feedback and suggestions from the community, and will continue to
refine and improve the design based on comments and real-world requirements.
Thanks!
Fuyu
Fuyu Zhao (3):
bpf: Introduce BPF_PROG_TYPE_RAW_TRACEPOINT_OVERRIDE
libbpf: Add support for BPF_PROG_TYPE_RAW_TRACEPOINT_OVERRIDE
selftests/bpf: Add selftest for "raw_tp.o"
include/linux/bpf_types.h | 2 +
include/linux/trace_events.h | 9 +
include/linux/tracepoint-defs.h | 6 +
include/linux/tracepoint.h | 3 +
include/uapi/linux/bpf.h | 2 +
kernel/bpf/syscall.c | 35 +++-
kernel/trace/bpf_trace.c | 31 +++
kernel/tracepoint.c | 190 +++++++++++++++++-
tools/include/uapi/linux/bpf.h | 2 +
tools/lib/bpf/bpf.c | 1 +
tools/lib/bpf/bpf.h | 3 +-
tools/lib/bpf/libbpf.c | 27 ++-
tools/lib/bpf/libbpf.h | 3 +-
.../bpf/prog_tests/raw_tp_override_test_run.c | 23 +++
.../bpf/progs/test_raw_tp_override_test_run.c | 20 ++
.../selftests/bpf/test_kmods/bpf_testmod.c | 7 +
16 files changed, 352 insertions(+), 12 deletions(-)
create mode 100644 tools/testing/selftests/bpf/prog_tests/raw_tp_override_test_run.c
create mode 100644 tools/testing/selftests/bpf/progs/test_raw_tp_override_test_run.c
--
2.43.0
For a while now we have supported file handles for pidfds. This has
proven to be very useful.
Extend the concept to cover namespaces as well. After this patchset it
is possible to encode and decode namespace file handles using the
commong name_to_handle_at() and open_by_handle_at() apis.
Namespaces file descriptors can already be derived from pidfds which
means they aren't subject to overmount protection bugs. IOW, it's
irrelevant if the caller would not have access to an appropriate
/proc/<pid>/ns/ directory as they could always just derive the namespace
based on a pidfd already.
It has the same advantage as pidfds. It's possible to reliably and for
the lifetime of the system refer to a namespace without pinning any
resources and to compare them.
Permission checking is kept simple. If the caller is located in the
namespace the file handle refers to they are able to open it otherwise
they must hold privilege over the owning namespace of the relevant
namespace.
Both the network namespace and the mount namespace already have an
associated cookie that isn't recycled and is fully exposed to userspace.
Move this into ns_common and use the same id space for all namespaces so
they can trivially and reliably be compared.
There's more coming based on the iterator infrastructure but the series
is large enough and focuses on file handles.
Extensive selftests included.
Signed-off-by: Christian Brauner <brauner(a)kernel.org>
---
Changes in v2:
- Address various review comments.
- Use a common NS_GET_ID ioctl() instead of individual ioctls.
- Link to v1: https://lore.kernel.org/20250910-work-namespace-v1-0-4dd56e7359d8@kernel.org
---
Christian Brauner (33):
pidfs: validate extensible ioctls
nsfs: drop tautological ioctl() check
nsfs: validate extensible ioctls
block: use extensible_ioctl_valid()
ns: move to_ns_common() to ns_common.h
nsfs: add nsfs.h header
ns: uniformly initialize ns_common
cgroup: use ns_common_init()
ipc: use ns_common_init()
mnt: use ns_common_init()
net: use ns_common_init()
pid: use ns_common_init()
time: use ns_common_init()
user: use ns_common_init()
uts: use ns_common_init()
ns: remove ns_alloc_inum()
nstree: make iterator generic
mnt: support ns lookup
cgroup: support ns lookup
ipc: support ns lookup
net: support ns lookup
pid: support ns lookup
time: support ns lookup
user: support ns lookup
uts: support ns lookup
ns: add to_<type>_ns() to respective headers
nsfs: add current_in_namespace()
nsfs: support file handles
nsfs: support exhaustive file handles
nsfs: add missing id retrieval support
tools: update nsfs.h uapi header
selftests/namespaces: add identifier selftests
selftests/namespaces: add file handle selftests
block/blk-integrity.c | 8 +-
fs/fhandle.c | 6 +
fs/internal.h | 1 +
fs/mount.h | 10 +-
fs/namespace.c | 156 +--
fs/nsfs.c | 201 ++-
fs/pidfs.c | 2 +-
include/linux/cgroup.h | 5 +
include/linux/exportfs.h | 6 +
include/linux/fs.h | 14 +
include/linux/ipc_namespace.h | 5 +
include/linux/ns_common.h | 29 +
include/linux/nsfs.h | 40 +
include/linux/nsproxy.h | 11 -
include/linux/nstree.h | 89 ++
include/linux/pid_namespace.h | 5 +
include/linux/proc_ns.h | 32 +-
include/linux/time_namespace.h | 9 +
include/linux/user_namespace.h | 5 +
include/linux/utsname.h | 5 +
include/net/net_namespace.h | 6 +
include/uapi/linux/fcntl.h | 1 +
include/uapi/linux/nsfs.h | 15 +-
init/main.c | 2 +
ipc/msgutil.c | 1 +
ipc/namespace.c | 12 +-
ipc/shm.c | 2 +
kernel/Makefile | 2 +-
kernel/cgroup/cgroup.c | 2 +
kernel/cgroup/namespace.c | 24 +-
kernel/nstree.c | 233 ++++
kernel/pid_namespace.c | 13 +-
kernel/time/namespace.c | 23 +-
kernel/user_namespace.c | 17 +-
kernel/utsname.c | 28 +-
net/core/net_namespace.c | 59 +-
tools/include/uapi/linux/nsfs.h | 17 +-
tools/testing/selftests/namespaces/.gitignore | 2 +
tools/testing/selftests/namespaces/Makefile | 7 +
tools/testing/selftests/namespaces/config | 7 +
.../selftests/namespaces/file_handle_test.c | 1429 ++++++++++++++++++++
tools/testing/selftests/namespaces/nsid_test.c | 986 ++++++++++++++
42 files changed, 3257 insertions(+), 270 deletions(-)
---
base-commit: 8f5ae30d69d7543eee0d70083daf4de8fe15d585
change-id: 20250905-work-namespace-c68826dda0d4