Recent change in how get_user() handles pointers [1] has a specific case
for LAM. It assigns a different bitmask that's later used to check
whether a pointer comes from userland in get_user().
While currently commented out (until LASS [2] is merged into the kernel)
it's worth making changes to the LAM selftest ahead of time.
Modify cpu_has_la57() so it provides current paging level information
instead of the cpuid one.
Add test case to LAM that utilizes a ioctl (FIOASYNC) syscall which uses
get_user() in its implementation. Execute the syscall with differently
tagged pointers to verify that valid user pointers are passing through
and invalid kernel/non-canonical pointers are not.
Also to avoid unhelpful test failures add a check in main() to skip
running tests if LAM was not compiled into the kernel.
Code was tested on a Sierra Forest Xeon machine that's LAM capable. The
test was ran without issues with both the LAM lines from [1] untouched
and commented out. The test was also ran without issues with LAM_SUP
both enabled and disabled.
4/5 level pagetables code paths were also successfully tested in Simics
on a 5-level capable machine.
[1] https://lore.kernel.org/all/20241024013214.129639-1-torvalds@linux-foundati…
[2] https://lore.kernel.org/all/20241028160917.1380714-1-alexander.shishkin@lin…
Maciej Wieczor-Retman (3):
selftests/lam: Move cpu_has_la57() to use cpuinfo flag
selftests/lam: Skip test if LAM is disabled
selftests/lam: Test get_user() LAM pointer handling
tools/testing/selftests/x86/lam.c | 120 ++++++++++++++++++++++++++++--
1 file changed, 115 insertions(+), 5 deletions(-)
--
2.47.1
Here is a series from Geliang, adding mptcp_subflow bpf_iter support.
We are working on extending MPTCP with BPF, e.g. to control the path
manager -- in charge of the creation, deletion, and announcements of
subflows (paths) -- and the packet scheduler -- in charge of selecting
which available path the next data will be sent to. These extensions
need to iterate over the list of subflows attached to an MPTCP
connection, and do some specific actions via some new kfunc that will be
added later on.
This preparation work is split in different patches:
- Patch 1: extend bpf_skc_to_mptcp_sock() to be called with msk.
- Patch 2: allow using skc_to_mptcp_sock() in CGroup sockopt hooks.
- Patch 3: register some "basic" MPTCP kfunc.
- Patch 4: add mptcp_subflow bpf_iter support. Note that previous
versions of this single patch have already been shared to the
BPF mailing list. The changelog has been kept with a comment,
but the version number has been reset to avoid confusions.
- Patch 5: add kfunc to make sure the msk is valid
- Patch 6: add more MPTCP endpoints in the selftests, in order to create
more than 2 subflows.
- Patch 7: add a very simple test validating mptcp_subflow bpf_iter
support. This test could be written without the new bpf_iter,
but it is there only to make sure this specific feature works
as expected.
Signed-off-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
---
Changes in v2:
- Patches 1-2: new ones.
- Patch 3: remove two kfunc, more restrictions. (Martin)
- Patch 4: add BUILD_BUG_ON(), more restrictions. (Martin)
- Patch 7: adaptations due to modifications in patches 1-4.
- Link to v1: https://lore.kernel.org/r/20241108-bpf-next-net-mptcp-bpf_iter-subflows-v1-…
---
Geliang Tang (7):
bpf: Extend bpf_skc_to_mptcp_sock to MPTCP sock
bpf: Allow use of skc_to_mptcp_sock in cg_sockopt
bpf: Register mptcp common kfunc set
bpf: Add mptcp_subflow bpf_iter
bpf: Acquire and release mptcp socket
selftests/bpf: More endpoints for endpoint_init
selftests/bpf: Add mptcp_subflow bpf_iter subtest
include/net/mptcp.h | 4 +-
kernel/bpf/cgroup.c | 2 +
net/core/filter.c | 2 +-
net/mptcp/bpf.c | 113 +++++++++++++++++-
tools/testing/selftests/bpf/bpf_experimental.h | 8 ++
tools/testing/selftests/bpf/prog_tests/mptcp.c | 129 ++++++++++++++++++++-
tools/testing/selftests/bpf/progs/mptcp_bpf.h | 9 ++
.../testing/selftests/bpf/progs/mptcp_bpf_iters.c | 63 ++++++++++
8 files changed, 318 insertions(+), 12 deletions(-)
---
base-commit: dad704ebe38642cd405e15b9c51263356391355c
change-id: 20241108-bpf-next-net-mptcp-bpf_iter-subflows-027f6d87770e
Best regards,
--
Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
Hi all,
Does anyone know what the 'stty sane' invocation in kunit.py is about?
The other day I ran into an issue when running it via watchexec[1]. At
the time I believed that it was there to clean up after the firmware
that QEMU runs potentially messed up the terminal.
However, I just realised I'm not sure if that makes sense - stty is
about setting terminal settings via ioctl. I don't think QEMU or its
guests are messing up the terminal with ioctls, they're just writing
funny control characters.
What's going on here? I guess one of:
1. Terminal is messed up with ctrl chars but ioctls are the
easiest/only way to reliably clean it up.
2. Nobody thought about this unimportant detail so hard before and
there's no particular rationale in place here.
3. I made bad assumptions about why the `stty sane` is there.
If it's 1 or 2 I wonder if there's an alternative way to clean up
without getting the SIGTTOU issue.
Or, maybe it doesn't matter and the fact that this was ever a problem
is just a bug in watchexec (maybe you can tell I haven't actually
taken the time to research the SIGTTOU thing properly). But thought
I'd raise it in case this points to issues people might have using
kunit.py in CI.
[1] https://github.com/watchexec/watchexec/issues/874
[2] https://gist.github.com/bjackman/27fd9980d87c5556c20e67a6ed891500
Given the time of year and point in the release cycle this is an RFC
series, there's a few areas where I'm particularly expecting that people
might have feedback:
- The userspace ABI, in particular:
- The vector length used for the SVE registers, access to the SVE
registers and access to ZA and (if available) ZT0 depending on
the current state of PSTATE.{SM,ZA}.
- The use of a single finalisation for both SVE and SME.
- The addition of control for enabling fine grained traps in a similar
manner to FGU but without the UNDEF, I'm not clear if this is desired
at all and at present this requires symmetric read and write traps like
FGU. That seemed like it might be desired from an implementation
point of view but we already have one case where we enable an
asymmetric trap (for ARM64_WORKAROUND_AMPERE_AC03_CPU_38) and it
seems generally useful to enable asymmetrically.
There is some nested virtualisation support in the code but it is not
enabled or complete, this will be completed before the RFC tag is
removed. I am anticipating having a vastly better test environment soon
which will make this much easier to complete and there is no SME
specific ABI for nested virtualisation.
This series implements support for SME use in non-protected KVM guests.
Much of this is very similar to SVE, the main additional challenge that
SME presents is that it introduces a new vector length similar to the
SVE vector length and two new controls which change the registers seen
by guests:
- PSTATE.ZA enables the ZA matrix register and, if SME2 is supported,
the ZT0 LUT register.
- PSTATE.SM enables streaming mode, a new floating point mode which
uses the SVE register set with the separately configured SME vector
length. In streaming mode implementation of the FFR register is
optional.
It is also permitted to build systems which support SME without SVE, in
this case when not in streaming mode no SVE registers or instructions
are available. Further, there is no requirement that there be any
overlap in the set of vector lengths supported by SVE and SME in a
system, this is expected to be a common situation in practical systems.
Since there is a new vector length to configure we introduce a new
feature parallel to the existing SVE one with a new pseudo register for
the streaming mode vector length. Due to the overlap with SVE caused by
streaming mode rather than finalising SME as a separate feature we use
the existing SVE finalisation to also finalise SME, a new define
KVM_ARM_VCPU_VEC is provided to help make user code clearer. Finalising
SVE and SME separately would introduce complication with register access
since finalising SVE makes the SVE regsiters writeable by userspace and
doing multiple finalisations results in an error being reported.
Dealing with a state where the SVE registers are writeable due to one of
SVE or SME being finalised but may have their VL changed by the other
being finalised seems like needless complexity with minimal practical
utility, it seems clearer to just express directly that only one
finalisation can be done in the ABI.
Access to the floating point registers follows the architecture:
- When both SVE and SME are present:
- If PSTATE.SM == 0 the vector length used for the Z and P registers
is the SVE vector length.
- If PSTATE.SM == 1 the vector length used for the Z and P registers
is the SME vector length.
- If only SME is present:
- If PSTATE.SM == 0 the Z and P registers are inaccessible and the
floating point state accessed via the encodings for the V registers.
- If PSTATE.SM == 1 the vector length used for the Z and P registers
- The SME specific ZA and ZT0 registers are only accessible if SVCR.ZA is 1.
The VMM must understand this, in particular when loading state SVCR
should be configured before other state.
There are a large number of subfeatures for SME, most of which only
offer additional instructions but some of which (SME2 and FA64) add
architectural state. These are configured via the ID registers as per
usual.
The new KVM_ARM_VCPU_VEC feature and ZA and ZT0 registers have not been
added to the get-reg-list selftest, the idea of supporting additional
features there without restructuring the program to generate all
possible feature combinations has been rejected. I will post a separate
series which does that restructuring.
No support is present for protected guests, this is expected to be added
separately.
The series is based on Fuad's series:
https://lore.kernel.org/r/20241216105057.579031-1-tabba@google.com/
It will need a rebase on:
https://lore.kernel.org/r/20241219173351.1123087-1-maz@kernel.org
(as will Fuad's.)
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Changes in v3:
- Rebase onto v6.12-rc2.
- Link to v2: https://lore.kernel.org/r/20231222-kvm-arm64-sme-v2-0-da226cb180bb@kernel.o…
Changes in v2:
- Rebase onto v6.7-rc3.
- Configure subfeatures based on host system only.
- Complete nVHE support.
- There was some snafu with sending v1 out, it didn't make it to the
lists but in case it hit people's inboxes I'm sending as v2.
---
Mark Brown (27):
arm64/fpsimd: Update FA64 and ZT0 enables when loading SME state
arm64/fpsimd: Decide to save ZT0 and streaming mode FFR at bind time
arm64/fpsimd: Check enable bit for FA64 when saving EFI state
arm64/fpsimd: Determine maximum virtualisable SME vector length
KVM: arm64: Introduce non-UNDEF FGT control
KVM: arm64: Pull ctxt_has_ helpers to start of sysreg-sr.h
KVM: arm64: Convert cpacr_clear_set() to a static inline
KVM: arm64: Move SVE state access macros after feature test macros
KVM: arm64: Factor SVE guest exit handling out into a function
KVM: arm64: Rename SVE finalization constants to be more general
KVM: arm64: Document the KVM ABI for SME
KVM: arm64: Define internal features for SME
KVM: arm64: Rename sve_state_reg_region
KVM: arm64: Store vector lengths in an array
KVM: arm64: Implement SME vector length configuration
KVM: arm64: Add definitions for SME control register
KVM: arm64: Support TPIDR2_EL0
KVM: arm64: Support SMIDR_EL1 for guests
KVM: arm64: Support SME priority registers
KVM: arm64: Provide assembly for SME state restore
KVM: arm64: Support Z and P registers in streaming mode
KVM: arm64: Expose SME specific state to userspace
KVM: arm64: Context switch SME state for normal guests
KVM: arm64: Handle SME exceptions
KVM: arm64: Provide interface for configuring and enabling SME for guests
KVM: arm64: selftests: Add SME system registers to get-reg-list
KVM: arm64: selftests: Add SME to set_id_regs test
Documentation/virt/kvm/api.rst | 117 ++++++---
arch/arm64/include/asm/fpsimd.h | 22 ++
arch/arm64/include/asm/kvm_emulate.h | 37 ++-
arch/arm64/include/asm/kvm_host.h | 135 ++++++++---
arch/arm64/include/asm/kvm_hyp.h | 4 +-
arch/arm64/include/asm/kvm_pkvm.h | 2 +-
arch/arm64/include/asm/vncr_mapping.h | 2 +
arch/arm64/include/uapi/asm/kvm.h | 33 +++
arch/arm64/kernel/cpufeature.c | 2 -
arch/arm64/kernel/fpsimd.c | 86 ++++---
arch/arm64/kvm/arm.c | 10 +
arch/arm64/kvm/fpsimd.c | 156 +++++++-----
arch/arm64/kvm/guest.c | 262 ++++++++++++++++++---
arch/arm64/kvm/handle_exit.c | 14 ++
arch/arm64/kvm/hyp/fpsimd.S | 16 ++
arch/arm64/kvm/hyp/include/hyp/switch.h | 104 ++++++--
arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h | 47 ++--
arch/arm64/kvm/hyp/nvhe/hyp-main.c | 17 +-
arch/arm64/kvm/hyp/nvhe/pkvm.c | 4 +-
arch/arm64/kvm/hyp/nvhe/switch.c | 11 +-
arch/arm64/kvm/hyp/vhe/switch.c | 21 +-
arch/arm64/kvm/reset.c | 154 +++++++++---
arch/arm64/kvm/sys_regs.c | 118 +++++++++-
include/uapi/linux/kvm.h | 1 +
tools/testing/selftests/kvm/aarch64/get-reg-list.c | 32 ++-
tools/testing/selftests/kvm/aarch64/set_id_regs.c | 29 ++-
26 files changed, 1117 insertions(+), 319 deletions(-)
---
base-commit: e32a80927434907f973f38a88cd19d7e51991d24
change-id: 20230301-kvm-arm64-sme-06a1246d3636
prerequisite-message-id: 20241216105057.579031-1-tabba(a)google.com
prerequisite-patch-id: 10a23279fc1aa942c363d66df0e95414342b614b
prerequisite-patch-id: 670db72b1987d2591e23db072fd27db7f65ffb0f
prerequisite-patch-id: c6bc6f799cebe5010bf3d734eb06e39d5dfab0d6
prerequisite-patch-id: 5555cde0b025483c2318d006a0324fd95bd06268
prerequisite-patch-id: a73738d5bbc5e694c92b7a5654f78eb79ed23c09
prerequisite-patch-id: 6194857db22ccaefe13e88b3155b6e761c9b7692
prerequisite-patch-id: 5dca3992c2ffa5bf2edb45f68be45edfae9b41b3
prerequisite-patch-id: b048e799d816c9c6750ed4f264fd38cb6e31f968
prerequisite-patch-id: 07fea6c2207f8cd2d35d4c171a97d28397db9a79
prerequisite-patch-id: f330e82665af9f223e838511bd4a95faad56e3ac
prerequisite-patch-id: 060a6061eaedb7fd02c18e898bfd9652c991b9af
prerequisite-patch-id: fc31d9f0e7812a8f962876fdb311414122895389
prerequisite-patch-id: ae675f63215a211c42a497789ee5e092fd461279
prerequisite-patch-id: ff3c533043a1fa3a13827ea5c70459b228aa95ee
prerequisite-patch-id: de489d2d73f49d74b75c628828a6b56dbac751e2
prerequisite-patch-id: 92f4a1249e3a1ff32eb16c25af56930762c5697d
prerequisite-patch-id: ac1248b4e10dce15672e02b366a359d634297877
Best regards,
--
Mark Brown <broonie(a)kernel.org>
From: Roberto Sassu <roberto.sassu(a)huawei.com>
Integrity detection and protection has long been a desirable feature, to
reach a large user base and mitigate the risk of flaws in the software
and attacks.
However, while solutions exist, they struggle to reach a large user base,
due to requiring higher than desired constraints on performance,
flexibility and configurability, that only security conscious people are
willing to accept.
For example, IMA measurement requires the target platform to collect
integrity measurements, and to protect them with the TPM, which introduces
a noticeable overhead (up to 10x slower in a microbenchmark) on frequently
used system calls, like the open().
IMA Appraisal currently requires individual files to be signed and
verified, and Linux distributions to rebuild all packages to include file
signatures (this approach has been adopted from Fedora 39+). Like a TPM,
also signature verification introduces a significant overhead, especially
if it is used to check the integrity of many files.
This is where the new Integrity Digest Cache comes into play, it offers
additional support for new and existing integrity solutions, to make
them faster and easier to deploy.
The Integrity Digest Cache can help IMA to reduce the number of TPM
operations and to make them happen in a deterministic way. If IMA knows
that a file comes from a Linux distribution, it can measure files in a
different way: measure the list of digests coming from the distribution
(e.g. RPM package headers), and subsequently measure a file if it is not
found in that list.
The performance improvement comes at the cost of IMA not reporting which
files from installed packages were accessed, and in which temporal
sequence. This approach might not be suitable for all use cases.
The Integrity Digest Cache can also help IMA for appraisal. IMA can simply
lookup the calculated digest of an accessed file in the list of digests
extracted from package headers, after verifying the header signature. It is
sufficient to verify only one signature for all files in the package, as
opposed to verifying a signature for each file.
The same approach can be followed by other LSMs, such as Integrity Policy
Enforcement (IPE), and BPF LSM.
The Integrity Digest Cache is not tied to a specific package format. The
kernel supports a TLV-based digest list format. More can be added through
third-party kernel modules. The TLV parser has been verified for memory
safety with the Frama-C static analyzer. The version with the Frama-C
assertions is available here:
https://github.com/robertosassu/rpm-formal/blob/main/validate_tlv.c
Integrating the Integrity Digest Cache in IMA brings significant
performance improvements: up to 67% and 79% for measurement respectively in
sequential and parallel file reads; up to 65% and 43% for appraisal
respectively in sequential and parallel file reads.
The performance can be further enhanced by using fsverity digests instead
of conventional file digests, which would make IMA verify only the portion
of the file to be read. However, at the moment, fsverity digests are not
included in RPM packages. In this case, once rpm is extended to include
them, Linux distributions still have to rebuild their packages.
The Integrity Digest Cache can support both digest types, so that the
functionality is immediately available without waiting for Linux
distributions to do the transition.
This patch set only includes the patches necessary to extract digests from
a TLV-based data format, and exposes an API for LSMs to query them. A
separate patch set will be provided to integrate it in IMA.
This patch set and the follow-up IMA integration can be tested by following
the instructions at:
https://github.com/linux-integrity/digest-cache-tools
This patch set applies on top of:
https://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity.git/l…
with commit 08ae3e5f5fc8 ("integrity: Use static_assert() to check struct
sizes").
Changelog
v5:
- Remove the RPM parser and selftests (suggested by Linus)
- Return digest cache pointer from digest_cache_lookup()
- Export new Parser API, and allow registration of third-party digest list
parsers (suggested by Mimi)
- Reduce sizes in TLV format and remove TLV header (suggested by Jani
Nikula)
- Introduce new DIGEST_LIST_NUM_ENTRIES TLV field
- Pass file descriptor instead of dentry in digest_cache_get() to properly
detect potential deadlocks
- Introduce digest_cache_opened_fd() to tell lockdep when it is safe to
nest a mutex if digest_cache_get() is called with that mutex held
- Add new patch to introduce ksys_finit_module()
- Make the TLV parser as configurable (Y/N/m) with Kconfig (suggested by
Mimi)
- Don't store the path structure in the digest cache and pass it between
creation and initialization of the digest cache
- Remove digest_cache_dir_update_dig_user() and keep the digest cache
retrieved during digest_cache_get()
- Fail with an error pointer in digest_cache_dir_lookup_digest() if the
current and passed directory digest cache don't match, or the digest
cache was reset
- Handle num_digest = 0 in digest_cache_htable_init()
- Accept -EOPNOTSUPP error in digest_cache_new()
- Implement inode_free_security_rcu LSM hook instead of inode_free_security
- Move reservation of file descriptor security blob inside the #ifdef in
init_ima_lsm()
- Add test file_reset_again to check the error pointer returned by
digest_cache_lookup()
- Remove TLV_FAILURE_HDR_LEN TLV error test
- Add missing MODULE_DESCRIPTION in kselftest kernel module (suggested by
Jeff Johnson)
- Replace dentry_open() with kernel_file_open() in populate.c and dir.c
- Skip affected tests when CONFIG_DYNAMIC_FTRACE_WITH_ARGS=n
v4:
- Rename digest_cache LSM to Integrity Digest Cache (suggested by Paul
Moore)
- Update documentation
- Remove forward declaration of struct digest_cache in
include/linux/digest_cache.h (suggested by Jarkko)
- Add DIGEST_CACHE_FREE digest cache event for notification
- Remove digest_cache_found_t typedef and use uintptr_t instead
- Add header callback in TLV parser and unexport tlv_parse_hdr() and
tlv_parse_data()
- Plug the Integrity Digest Cache into the 'ima' LSM
- Switch from constructor to zeroing the object cache
- Remove notifier and detect digest cache changes by comparing pointers
- Rename digest_cache_dir_create() to digest_cache_dir_add_entries()
- Introduce digest_cache_dir_create() to create and initialize a directory
digest cache
- Introduce digest_cache_dir_update_dig_user() to update dig_user with a
file digest cache on positive digest lookup
- Use up to date directory digest cache, to take into account possible
inode eviction for the old ones
- Introduce digest_cache_dir_prefetch() to prefetch digest lists
- Adjust component name in debug messages (suggested by Jarkko)
- Add FILE_PREFETCH and FILE_READ digest cache flags, remove RESET_USER
- Reintroduce spin lock for digest cache verification data (needed for the
selftests)
- Get inode and file descriptor security blob offsets from outside (IMA)
- Avoid user-after-free in digest_cache_unref() by decrementing the ref.
count after printing the debug message
- Check for digest list lookup loops also for the parent directory
- Put and clear dig_owner directly in digest_cache_reset_clear_owner()
- Move digest cache initialization code from digest_cache_create() to
digest_cache_init()
- Hold the digest list path until the digest cache is initialized (to avoid
premature inode eviction)
- Avoid race condition on setting DIR_PREFETCH in the directory digest
cache
- Introduce digest_cache_dir_prefetch() and do it between digest cache
creation and initialization (to avoid lock inversion)
- Avoid unnecessary length check in digest_list_parse_rpm()
- Declare arrays of strings in tlv parser as static
- Emit reset for parent directory on directory entry modification
- Rename digest_cache_reset_owner() to digest_cache_reset_clear_owner()
and digest_cache_reset_user() to digest_cache_clear_user()
- Execute digest_cache_file_release() either if FMODE_WRITE or
FMODE_CREATED are set in the file descriptor f_mode
- Determine in digest_cache_verif_set() which gfp flag to use depending on
verifier ID
- Update selftests
v3:
- Rewrite documentation, and remove the installation instructions since
they are now included in the README of digest-cache-tools
- Add digest cache event notifier
- Drop digest_cache_was_reset(), and send instead to asynchronous
notifications
- Fix digest_cache LSM Kconfig style issues (suggested by Randy Dunlap)
- Propagate digest cache reset to directory entries
- Destroy per directory entry mutex
- Introduce RESET_USER bit, to clear the dig_user pointer on
set/removexattr
- Replace 'file content' with 'file data' (suggested by Mimi)
- Introduce per digest cache mutex and replace verif_data_lock spinlock
- Track changes of security.digest_list xattr
- Stop tracking file_open and use file_release instead also for file writes
- Add error messages in digest_cache_create()
- Load/unload testing kernel module automatically during execution of test
- Add tests for digest cache event notifier
- Add test for ftruncate()
- Remove DIGEST_CACHE_RESET_PREFETCH_BUF command in test and clear the
buffer on read instead
v2:
- Include the TLV parser in this patch set (from user asymmetric keys and
signatures)
- Move from IMA and make an independent LSM
- Remove IMA-specific stuff from this patch set
- Add per algorithm hash table
- Expect all digest lists to be in the same directory and allow changing
the default directory
- Support digest lookup on directories, when there is no
security.digest_list xattr
- Add seq num to digest list file name, to impose ordering on directory
iteration
- Add a new data type DIGEST_LIST_ENTRY_DATA for the nested data in the
tlv digest list format
- Add the concept of verification data attached to digest caches
- Add the reset mechanism to track changes on digest lists and directory
containing the digest lists
- Add kernel selftests
v1:
- Add documentation in Documentation/security/integrity-digest-cache.rst
- Pass the mask of IMA actions to digest_cache_alloc()
- Add a reference count to the digest cache
- Remove the path parameter from digest_cache_get(), and rely on the
reference count to avoid the digest cache disappearing while being used
- Rename the dentry_to_check parameter of digest_cache_get() to dentry
- Rename digest_cache_get() to digest_cache_new() and add
digest_cache_get() to set the digest cache in the iint of the inode for
which the digest cache was requested
- Add dig_owner and dig_user to the iint, to distinguish from which inode
the digest cache was created from, and which is using it; consequently it
makes the digest cache usable to measure/appraise other digest caches
(support not yet enabled)
- Add dig_owner_mutex and dig_user_mutex to serialize accesses to dig_owner
and dig_user until they are initialized
- Enforce strong synchronization and make the contenders wait until
dig_owner and dig_user are assigned to the iint the first time
- Move checking IMA actions on the digest list earlier, and fail if no
action were performed (digest cache not usable)
- Remove digest_cache_put(), not needed anymore with the introduction of
the reference count
- Fail immediately in digest_cache_lookup() if the digest algorithm is
not set in the digest cache
- Use 64 bit mask for IMA actions on the digest list instead of 8 bit
- Return NULL in the inline version of digest_cache_get()
- Use list_add_tail() instead of list_add() in the iterator
- Copy the digest list path to a separate buffer in digest_cache_iter_dir()
- Use digest list parsers verified with Frama-C
- Explicitly disable (for now) the possibility in the IMA policy to use the
digest cache to measure/appraise other digest lists
- Replace exit(<value>) with return <value> in manage_digest_lists.c
Roberto Sassu (15):
lib: Add TLV parser
module: Introduce ksys_finit_module()
integrity: Introduce the Integrity Digest Cache
digest_cache: Initialize digest caches
digest_cache: Add securityfs interface
digest_cache: Add hash tables and operations
digest_cache: Allow registration of digest list parsers
digest_cache: Parse tlv digest lists
digest_cache: Populate the digest cache from a digest list
digest_cache: Add management of verification data
digest_cache: Add support for directories
digest cache: Prefetch digest lists if requested
digest_cache: Reset digest cache on file/directory change
selftests/digest_cache: Add selftests for the Integrity Digest Cache
docs: Add documentation of the Integrity Digest Cache
Documentation/security/digest_cache.rst | 850 ++++++++++++++++++
Documentation/security/index.rst | 1 +
MAINTAINERS | 10 +
include/linux/digest_cache.h | 131 +++
include/linux/kernel_read_file.h | 1 +
include/linux/syscalls.h | 10 +
include/linux/tlv_parser.h | 32 +
include/uapi/linux/tlv_digest_list.h | 47 +
include/uapi/linux/tlv_parser.h | 41 +
include/uapi/linux/xattr.h | 6 +
kernel/module/main.c | 43 +-
lib/Kconfig | 3 +
lib/Makefile | 2 +
lib/tlv_parser.c | 87 ++
lib/tlv_parser.h | 18 +
security/integrity/Kconfig | 1 +
security/integrity/Makefile | 1 +
security/integrity/digest_cache/Kconfig | 43 +
security/integrity/digest_cache/Makefile | 11 +
security/integrity/digest_cache/dir.c | 400 +++++++++
security/integrity/digest_cache/htable.c | 260 ++++++
security/integrity/digest_cache/internal.h | 283 ++++++
security/integrity/digest_cache/main.c | 597 ++++++++++++
security/integrity/digest_cache/modsig.c | 66 ++
security/integrity/digest_cache/parsers.c | 257 ++++++
security/integrity/digest_cache/parsers/tlv.c | 341 +++++++
security/integrity/digest_cache/populate.c | 104 +++
security/integrity/digest_cache/reset.c | 227 +++++
security/integrity/digest_cache/secfs.c | 104 +++
security/integrity/digest_cache/verif.c | 135 +++
security/integrity/ima/ima.h | 1 +
security/integrity/ima/ima_fs.c | 6 +
security/integrity/ima/ima_main.c | 10 +-
tools/testing/selftests/Makefile | 1 +
.../testing/selftests/digest_cache/.gitignore | 3 +
tools/testing/selftests/digest_cache/Makefile | 24 +
.../testing/selftests/digest_cache/all_test.c | 769 ++++++++++++++++
tools/testing/selftests/digest_cache/common.c | 78 ++
tools/testing/selftests/digest_cache/common.h | 93 ++
.../selftests/digest_cache/common_user.c | 33 +
.../selftests/digest_cache/common_user.h | 15 +
tools/testing/selftests/digest_cache/config | 2 +
.../selftests/digest_cache/generators.c | 130 +++
.../selftests/digest_cache/generators.h | 16 +
.../selftests/digest_cache/testmod/Makefile | 16 +
.../selftests/digest_cache/testmod/kern.c | 551 ++++++++++++
46 files changed, 5849 insertions(+), 11 deletions(-)
create mode 100644 Documentation/security/digest_cache.rst
create mode 100644 include/linux/digest_cache.h
create mode 100644 include/linux/tlv_parser.h
create mode 100644 include/uapi/linux/tlv_digest_list.h
create mode 100644 include/uapi/linux/tlv_parser.h
create mode 100644 lib/tlv_parser.c
create mode 100644 lib/tlv_parser.h
create mode 100644 security/integrity/digest_cache/Kconfig
create mode 100644 security/integrity/digest_cache/Makefile
create mode 100644 security/integrity/digest_cache/dir.c
create mode 100644 security/integrity/digest_cache/htable.c
create mode 100644 security/integrity/digest_cache/internal.h
create mode 100644 security/integrity/digest_cache/main.c
create mode 100644 security/integrity/digest_cache/modsig.c
create mode 100644 security/integrity/digest_cache/parsers.c
create mode 100644 security/integrity/digest_cache/parsers/tlv.c
create mode 100644 security/integrity/digest_cache/populate.c
create mode 100644 security/integrity/digest_cache/reset.c
create mode 100644 security/integrity/digest_cache/secfs.c
create mode 100644 security/integrity/digest_cache/verif.c
create mode 100644 tools/testing/selftests/digest_cache/.gitignore
create mode 100644 tools/testing/selftests/digest_cache/Makefile
create mode 100644 tools/testing/selftests/digest_cache/all_test.c
create mode 100644 tools/testing/selftests/digest_cache/common.c
create mode 100644 tools/testing/selftests/digest_cache/common.h
create mode 100644 tools/testing/selftests/digest_cache/common_user.c
create mode 100644 tools/testing/selftests/digest_cache/common_user.h
create mode 100644 tools/testing/selftests/digest_cache/config
create mode 100644 tools/testing/selftests/digest_cache/generators.c
create mode 100644 tools/testing/selftests/digest_cache/generators.h
create mode 100644 tools/testing/selftests/digest_cache/testmod/Makefile
create mode 100644 tools/testing/selftests/digest_cache/testmod/kern.c
--
2.47.0.118.gfd3785337b
Extend pmu_counters_test to AMD CPUs.
As the AMD PMU is quite different from Intel with different events and
feature sets, this series introduces a new code path to test it,
specifically focusing on the core counters including the
PerfCtrExtCore and PerfMonV2 features. Northbridge counters and cache
counters exist, but are not as important and can be deferred to a
later series.
The first patch is a bug fix that could be submitted separately.
The series has been tested on both Intel and AMD machines, but I have
not found an AMD machine old enough to lack PerfCtrExtCore. I have
made efforts that no part of the code has any dependency on its
presence.
I am aware of similar work in this direction done by Jinrong Liang
[1]. He told me he is not working on it currently and I am not
intruding by making my own submission.
[1] https://lore.kernel.org/kvm/20231121115457.76269-1-cloudliang@tencent.com/
v2:
* Test all combinations of VM setup rather than only the maximum
allowed by hardware
* Add fixes tag to bug fix in patch 1
* Refine some names
v1:
https://lore.kernel.org/kvm/20240813164244.751597-1-coltonlewis@google.com/
Colton Lewis (6):
KVM: x86: selftests: Fix typos in macro variable use
KVM: x86: selftests: Define AMD PMU CPUID leaves
KVM: x86: selftests: Set up AMD VM in pmu_counters_test
KVM: x86: selftests: Test read/write core counters
KVM: x86: selftests: Test core events
KVM: x86: selftests: Test PerfMonV2
.../selftests/kvm/include/x86_64/processor.h | 7 +
.../selftests/kvm/x86_64/pmu_counters_test.c | 304 ++++++++++++++++--
2 files changed, 277 insertions(+), 34 deletions(-)
base-commit: da3ea35007d0af457a0afc87e84fddaebc4e0b63
--
2.46.0.662.g92d0881bb0-goog
xtheadvector is a custom extension that is based upon riscv vector
version 0.7.1 [1]. All of the vector routines have been modified to
support this alternative vector version based upon whether xtheadvector
was determined to be supported at boot.
vlenb is not supported on the existing xtheadvector hardware, so a
devicetree property thead,vlenb is added to provide the vlenb to Linux.
There is a new hwprobe key RISCV_HWPROBE_KEY_VENDOR_EXT_THEAD_0 that is
used to request which thead vendor extensions are supported on the
current platform. This allows future vendors to allocate hwprobe keys
for their vendor.
Support for xtheadvector is also added to the vector kselftests.
Signed-off-by: Charlie Jenkins <charlie(a)rivosinc.com>
[1] https://github.com/T-head-Semi/thead-extension-spec/blob/95358cb2cca9489361…
---
This series is a continuation of a different series that was fragmented
into two other series in an attempt to get part of it merged in the 6.10
merge window. The split-off series did not get merged due to a NAK on
the series that added the generic riscv,vlenb devicetree entry. This
series has converted riscv,vlenb to thead,vlenb to remedy this issue.
The original series is titled "riscv: Support vendor extensions and
xtheadvector" [3].
The series titled "riscv: Extend cpufeature.c to detect vendor
extensions" is still under development and this series is based on that
series! [4]
I have tested this with an Allwinner Nezha board. I used SkiffOS [1] to
manage building the image, but upgraded the U-Boot version to Samuel
Holland's more up-to-date version [2] and changed out the device tree
used by U-Boot with the device trees that are present in upstream linux
and this series. Thank you Samuel for all of the work you did to make
this task possible.
[1] https://github.com/skiffos/SkiffOS/tree/master/configs/allwinner/nezha
[2] https://github.com/smaeul/u-boot/commit/2e89b706f5c956a70c989cd31665f1429e9…
[3] https://lore.kernel.org/all/20240503-dev-charlie-support_thead_vector_6_9-v…
[4] https://lore.kernel.org/lkml/20240719-support_vendor_extensions-v3-4-0af758…
---
Changes in v10:
- In DT probing disable vector with new function to clear vendor
extension bits for xtheadvector
- Add ghostwrite mitigations for c9xx CPUs. This disables xtheadvector
unless mitigations=off is set as a kernel boot arg
- Link to v9: https://lore.kernel.org/r/20240806-xtheadvector-v9-0-62a56d2da5d0@rivosinc.…
Changes in v9:
- Rebase onto palmer's for-next
- Fix sparse error in arch/riscv/kernel/vendor_extensions/thead.c
- Fix maybe-uninitialized warning in arch/riscv/include/asm/vendor_extensions/vendor_hwprobe.h
- Wrap some long lines
- Link to v8: https://lore.kernel.org/r/20240724-xtheadvector-v8-0-cf043168e137@rivosinc.…
Changes in v8:
- Rebase onto palmer's for-next
- Link to v7: https://lore.kernel.org/r/20240724-xtheadvector-v7-0-b741910ada3e@rivosinc.…
Changes in v7:
- Add defs for has_xtheadvector_no_alternatives() and has_xtheadvector()
when vector disabled. (Palmer)
- Link to v6: https://lore.kernel.org/r/20240722-xtheadvector-v6-0-c9af0130fa00@rivosinc.…
Changes in v6:
- Fix return type of is_vector_supported()/is_xthead_supported() to be bool
- Link to v5: https://lore.kernel.org/r/20240719-xtheadvector-v5-0-4b485fc7d55f@rivosinc.…
Changes in v5:
- Rebase on for-next
- Link to v4: https://lore.kernel.org/r/20240702-xtheadvector-v4-0-2bad6820db11@rivosinc.…
Changes in v4:
- Replace inline asm with C (Samuel)
- Rename VCSRs to CSRs (Samuel)
- Replace .insn directives with .4byte directives
- Link to v3: https://lore.kernel.org/r/20240619-xtheadvector-v3-0-bff39eb9668e@rivosinc.…
Changes in v3:
- Add back Heiko's signed-off-by (Conor)
- Mark RISCV_HWPROBE_KEY_VENDOR_EXT_THEAD_0 as a bitmask
- Link to v2: https://lore.kernel.org/r/20240610-xtheadvector-v2-0-97a48613ad64@rivosinc.…
Changes in v2:
- Removed extraneous references to "riscv,vlenb" (Jess)
- Moved declaration of "thead,vlenb" into cpus.yaml and added
restriction that it's only applicable to thead cores (Conor)
- Check CONFIG_RISCV_ISA_XTHEADVECTOR instead of CONFIG_RISCV_ISA_V for
thead,vlenb (Jess)
- Fix naming of hwprobe variables (Evan)
- Link to v1: https://lore.kernel.org/r/20240609-xtheadvector-v1-0-3fe591d7f109@rivosinc.…
---
Charlie Jenkins (13):
dt-bindings: riscv: Add xtheadvector ISA extension description
dt-bindings: cpus: add a thead vlen register length property
riscv: dts: allwinner: Add xtheadvector to the D1/D1s devicetree
riscv: Add thead and xtheadvector as a vendor extension
riscv: vector: Use vlenb from DT for thead
riscv: csr: Add CSR encodings for CSR_VXRM/CSR_VXSAT
riscv: Add xtheadvector instruction definitions
riscv: vector: Support xtheadvector save/restore
riscv: hwprobe: Add thead vendor extension probing
riscv: hwprobe: Document thead vendor extensions and xtheadvector extension
selftests: riscv: Fix vector tests
selftests: riscv: Support xtheadvector in vector tests
riscv: Add ghostwrite vulnerability
Heiko Stuebner (1):
RISC-V: define the elements of the VCSR vector CSR
Documentation/arch/riscv/hwprobe.rst | 10 +
Documentation/devicetree/bindings/riscv/cpus.yaml | 19 ++
.../devicetree/bindings/riscv/extensions.yaml | 10 +
arch/riscv/Kconfig.errata | 11 +
arch/riscv/Kconfig.vendor | 26 ++
arch/riscv/boot/dts/allwinner/sun20i-d1s.dtsi | 3 +-
arch/riscv/errata/thead/errata.c | 28 ++
arch/riscv/include/asm/bugs.h | 22 ++
arch/riscv/include/asm/cpufeature.h | 2 +
arch/riscv/include/asm/csr.h | 15 +
arch/riscv/include/asm/errata_list.h | 3 +-
arch/riscv/include/asm/hwprobe.h | 3 +-
arch/riscv/include/asm/switch_to.h | 2 +-
arch/riscv/include/asm/vector.h | 225 +++++++++++----
arch/riscv/include/asm/vendor_extensions/thead.h | 48 ++++
.../include/asm/vendor_extensions/thead_hwprobe.h | 19 ++
.../include/asm/vendor_extensions/vendor_hwprobe.h | 37 +++
arch/riscv/include/uapi/asm/hwprobe.h | 3 +-
arch/riscv/include/uapi/asm/vendor/thead.h | 3 +
arch/riscv/kernel/Makefile | 2 +
arch/riscv/kernel/bugs.c | 55 ++++
arch/riscv/kernel/cpufeature.c | 58 +++-
arch/riscv/kernel/kernel_mode_vector.c | 8 +-
arch/riscv/kernel/process.c | 4 +-
arch/riscv/kernel/signal.c | 6 +-
arch/riscv/kernel/sys_hwprobe.c | 5 +
arch/riscv/kernel/vector.c | 24 +-
arch/riscv/kernel/vendor_extensions.c | 10 +
arch/riscv/kernel/vendor_extensions/Makefile | 2 +
arch/riscv/kernel/vendor_extensions/thead.c | 29 ++
.../riscv/kernel/vendor_extensions/thead_hwprobe.c | 19 ++
drivers/base/cpu.c | 3 +
include/linux/cpu.h | 1 +
tools/testing/selftests/riscv/vector/.gitignore | 3 +-
tools/testing/selftests/riscv/vector/Makefile | 17 +-
.../selftests/riscv/vector/v_exec_initval_nolibc.c | 94 +++++++
tools/testing/selftests/riscv/vector/v_helpers.c | 68 +++++
tools/testing/selftests/riscv/vector/v_helpers.h | 8 +
tools/testing/selftests/riscv/vector/v_initval.c | 22 ++
.../selftests/riscv/vector/v_initval_nolibc.c | 68 -----
.../selftests/riscv/vector/vstate_exec_nolibc.c | 20 +-
.../testing/selftests/riscv/vector/vstate_prctl.c | 305 +++++++++++++--------
42 files changed, 1048 insertions(+), 272 deletions(-)
---
base-commit: 0e3f3649d44bf1b388a7613ade14c29cbdedf075
change-id: 20240530-xtheadvector-833d3d17b423
--
- Charlie
Hello,
This patchset is our exploration of how to support 1G pages in guest_memfd, and
how the pages will be used in Confidential VMs.
The patchset covers:
+ How to get 1G pages
+ Allowing mmap() of guest_memfd to userspace so that both private and shared
memory can use the same physical pages
+ Splitting and reconstructing pages to support conversions and mmap()
+ How the VM, userspace and guest_memfd interact to support conversions
+ Selftests to test all the above
+ Selftests also demonstrate the conversion flow between VM, userspace and
guest_memfd.
Why 1G pages in guest memfd?
Bring guest_memfd to performance and memory savings parity with VMs that are
backed by HugeTLBfs.
+ Performance is improved with 1G pages by more TLB hits and faster page walks
on TLB misses.
+ Memory savings from 1G pages comes from HugeTLB Vmemmap Optimization (HVO).
Options for 1G page support:
1. HugeTLB
2. Contiguous Memory Allocator (CMA)
3. Other suggestions are welcome!
Comparison between options:
1. HugeTLB
+ Refactor HugeTLB to separate allocator from the rest of HugeTLB
+ Pro: Graceful transition for VMs backed with HugeTLB to guest_memfd
+ Near term: Allows co-tenancy of HugeTLB and guest_memfd backed VMs
+ Pro: Can provide iterative steps toward new future allocator
+ Unexplored: Managing userspace-visible changes
+ e.g. HugeTLB's free_hugepages will decrease if HugeTLB is used,
but not when future allocator is used
2. CMA
+ Port some HugeTLB features to be applied on CMA
+ Pro: Clean slate
What would refactoring HugeTLB involve?
(Some refactoring was done in this RFC, more can be done.)
1. Broadly involves separating the HugeTLB allocator from the rest of HugeTLB
+ Brings more modularity to HugeTLB
+ No functionality change intended
+ Likely step towards HugeTLB's integration into core-mm
2. guest_memfd will use just the allocator component of HugeTLB, not including
the complex parts of HugeTLB like
+ Userspace reservations (resv_map)
+ Shared PMD mappings
+ Special page walkers
What features would need to be ported to CMA?
+ Improved allocation guarantees
+ Per NUMA node pool of huge pages
+ Subpools per guest_memfd
+ Memory savings
+ Something like HugeTLB Vmemmap Optimization
+ Configuration/reporting features
+ Configuration of number of pages available (and per NUMA node) at and
after host boot
+ Reporting of memory usage/availability statistics at runtime
HugeTLB was picked as the source of 1G pages for this RFC because it allows a
graceful transition, and retains memory savings from HVO.
To illustrate this, if a host machine uses HugeTLBfs to back VMs, and a
confidential VM were to be scheduled on that host, some HugeTLBfs pages would
have to be given up and returned to CMA for guest_memfd pages to be rebuilt from
that memory. This requires memory to be reserved for HVO to be removed and
reapplied on the new guest_memfd memory. This not only slows down memory
allocation but also trims the benefits of HVO. Memory would have to be reserved
on the host to facilitate these transitions.
Improving how guest_memfd uses the allocator in a future revision of this RFC:
To provide an easier transition away from HugeTLB, guest_memfd's use of HugeTLB
should be limited to these allocator functions:
+ reserve(node, page_size, num_pages) => opaque handle
+ Used when a guest_memfd inode is created to reserve memory from backend
allocator
+ allocate(handle, mempolicy, page_size) => folio
+ To allocate a folio from guest_memfd's reservation
+ split(handle, folio, target_page_size) => void
+ To take a huge folio, and split it to smaller folios, restore to filemap
+ reconstruct(handle, first_folio, nr_pages) => void
+ To take a folio, and reconstruct a huge folio out of nr_pages from the
first_folio
+ free(handle, folio) => void
+ To return folio to guest_memfd's reservation
+ error(handle, folio) => void
+ To handle memory errors
+ unreserve(handle) => void
+ To return guest_memfd's reservation to allocator backend
Userspace should only provide a page size when creating a guest_memfd and should
not have to specify HugeTLB.
Overview of patches:
+ Patches 01-12
+ Many small changes to HugeTLB, mostly to separate HugeTLBfs concepts from
HugeTLB, and to expose HugeTLB functions.
+ Patches 13-16
+ Letting guest_memfd use HugeTLB
+ Creation of each guest_memfd reserves pages from HugeTLB's global hstate
and puts it into the guest_memfd inode's subpool
+ Each folio allocation takes a page from the guest_memfd inode's subpool
+ Patches 17-21
+ Selftests for new HugeTLB features in guest_memfd
+ Patches 22-24
+ More small changes on the HugeTLB side to expose functions needed by
guest_memfd
+ Patch 25:
+ Uses the newly available functions from patches 22-24 to split HugeTLB
pages. In this patch, HugeTLB folios are always split to 4K before any
usage, private or shared.
+ Patches 26-28
+ Allow mmap() in guest_memfd and faulting in shared pages
+ Patch 29
+ Enables conversion between private/shared pages
+ Patch 30
+ Required to zero folios after conversions to avoid leaking initialized
kernel memory
+ Patch 31-38
+ Add selftests to test mapping pages to userspace, guest/host memory
sharing and update conversions tests
+ Patch 33 illustrates the conversion flow between VM/userspace/guest_memfd
+ Patch 39
+ Dynamically split and reconstruct HugeTLB pages instead of always
splitting before use. All earlier selftests are expected to still pass.
TODOs:
+ Add logic to wait for safe_refcount [1]
+ Look into lazy splitting/reconstruction of pages
+ Currently, when the KVM_SET_MEMORY_ATTRIBUTES is invoked, not only is the
mem_attr_array and faultability updated, the pages in the requested range
are also split/reconstructed as necessary. We want to look into delaying
splitting/reconstruction to fault time.
+ Solve race between folios being faulted in and being truncated
+ When running private_mem_conversions_test with more than 1 vCPU, a folio
getting truncated may get faulted in by another process, causing elevated
mapcounts when the folio is freed (VM_BUG_ON_FOLIO).
+ Add intermediate splits (1G should first split to 2M and not split directly to
4K)
+ Use guest's lock instead of hugetlb_lock
+ Use multi-index xarray/replace xarray with some other data struct for
faultability flag
+ Refactor HugeTLB better, present generic allocator interface
Please let us know your thoughts on:
+ HugeTLB as the choice of transitional allocator backend
+ Refactoring HugeTLB to provide generic allocator interface
+ Shared/private conversion flow
+ Requiring user to request kernel to unmap pages from userspace using
madvise(MADV_DONTNEED)
+ Failing conversion on elevated mapcounts/pincounts/refcounts
+ Process of splitting/reconstructing page
+ Anything else!
[1] https://lore.kernel.org/all/20240829-guest-memfd-lib-v2-0-b9afc1ff3656@quic…
Ackerley Tng (37):
mm: hugetlb: Simplify logic in dequeue_hugetlb_folio_vma()
mm: hugetlb: Refactor vma_has_reserves() to should_use_hstate_resv()
mm: hugetlb: Remove unnecessary check for avoid_reserve
mm: mempolicy: Refactor out policy_node_nodemask()
mm: hugetlb: Refactor alloc_buddy_hugetlb_folio_with_mpol() to
interpret mempolicy instead of vma
mm: hugetlb: Refactor dequeue_hugetlb_folio_vma() to use mpol
mm: hugetlb: Refactor out hugetlb_alloc_folio
mm: truncate: Expose preparation steps for truncate_inode_pages_final
mm: hugetlb: Expose hugetlb_subpool_{get,put}_pages()
mm: hugetlb: Add option to create new subpool without using surplus
mm: hugetlb: Expose hugetlb_acct_memory()
mm: hugetlb: Move and expose hugetlb_zero_partial_page()
KVM: guest_memfd: Make guest mem use guest mem inodes instead of
anonymous inodes
KVM: guest_memfd: hugetlb: initialization and cleanup
KVM: guest_memfd: hugetlb: allocate and truncate from hugetlb
KVM: guest_memfd: Add page alignment check for hugetlb guest_memfd
KVM: selftests: Add basic selftests for hugetlb-backed guest_memfd
KVM: selftests: Support various types of backing sources for private
memory
KVM: selftests: Update test for various private memory backing source
types
KVM: selftests: Add private_mem_conversions_test.sh
KVM: selftests: Test that guest_memfd usage is reported via hugetlb
mm: hugetlb: Expose vmemmap optimization functions
mm: hugetlb: Expose HugeTLB functions for promoting/demoting pages
mm: hugetlb: Add functions to add/move/remove from hugetlb lists
KVM: guest_memfd: Track faultability within a struct kvm_gmem_private
KVM: guest_memfd: Allow mmapping guest_memfd files
KVM: guest_memfd: Use vm_type to determine default faultability
KVM: Handle conversions in the SET_MEMORY_ATTRIBUTES ioctl
KVM: guest_memfd: Handle folio preparation for guest_memfd mmap
KVM: selftests: Allow vm_set_memory_attributes to be used without
asserting return value of 0
KVM: selftests: Test using guest_memfd memory from userspace
KVM: selftests: Test guest_memfd memory sharing between guest and host
KVM: selftests: Add notes in private_mem_kvm_exits_test for mmap-able
guest_memfd
KVM: selftests: Test that pinned pages block KVM from setting memory
attributes to PRIVATE
KVM: selftests: Refactor vm_mem_add to be more flexible
KVM: selftests: Add helper to perform madvise by memslots
KVM: selftests: Update private_mem_conversions_test for mmap()able
guest_memfd
Vishal Annapurve (2):
KVM: guest_memfd: Split HugeTLB pages for guest_memfd use
KVM: guest_memfd: Dynamically split/reconstruct HugeTLB page
fs/hugetlbfs/inode.c | 35 +-
include/linux/hugetlb.h | 54 +-
include/linux/kvm_host.h | 1 +
include/linux/mempolicy.h | 2 +
include/linux/mm.h | 1 +
include/uapi/linux/kvm.h | 26 +
include/uapi/linux/magic.h | 1 +
mm/hugetlb.c | 346 ++--
mm/hugetlb_vmemmap.h | 11 -
mm/mempolicy.c | 36 +-
mm/truncate.c | 26 +-
tools/include/linux/kernel.h | 4 +-
tools/testing/selftests/kvm/Makefile | 3 +
.../kvm/guest_memfd_hugetlb_reporting_test.c | 222 +++
.../selftests/kvm/guest_memfd_pin_test.c | 104 ++
.../selftests/kvm/guest_memfd_sharing_test.c | 160 ++
.../testing/selftests/kvm/guest_memfd_test.c | 238 ++-
.../testing/selftests/kvm/include/kvm_util.h | 45 +-
.../testing/selftests/kvm/include/test_util.h | 18 +
tools/testing/selftests/kvm/lib/kvm_util.c | 443 +++--
tools/testing/selftests/kvm/lib/test_util.c | 99 ++
.../kvm/x86_64/private_mem_conversions_test.c | 158 +-
.../x86_64/private_mem_conversions_test.sh | 91 +
.../kvm/x86_64/private_mem_kvm_exits_test.c | 11 +-
virt/kvm/guest_memfd.c | 1563 ++++++++++++++++-
virt/kvm/kvm_main.c | 17 +
virt/kvm/kvm_mm.h | 16 +
27 files changed, 3288 insertions(+), 443 deletions(-)
create mode 100644 tools/testing/selftests/kvm/guest_memfd_hugetlb_reporting_test.c
create mode 100644 tools/testing/selftests/kvm/guest_memfd_pin_test.c
create mode 100644 tools/testing/selftests/kvm/guest_memfd_sharing_test.c
create mode 100755 tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.sh
--
2.46.0.598.g6f2099f65c-goog
This is something that I've been thinking about for a while. We had a
discussion at LPC 2020 about this[1] but the proposals suggested there
never materialised.
In short, it is quite difficult for userspace to detect the feature
capability of syscalls at runtime. This is something a lot of programs
want to do, but they are forced to create elaborate scenarios to try to
figure out if a feature is supported without causing damage to the
system. For the vast majority of cases, each individual feature also
needs to be tested individually (because syscall results are
all-or-nothing), so testing even a single syscall's feature set can
easily inflate the startup time of programs.
This patchset implements the fairly minimal design I proposed in this
talk[2] and in some old LKML threads (though I can't find the exact
references ATM). The general flow looks like:
1. Userspace will indicate to the kernel that a syscall should a be
no-op by setting the top bit of the extensible struct size argument.
We will almost certainly never support exabyte sized structs, so the
top bits are free for us to use as makeshift flag bits. This is
preferable to using the per-syscall flag field inside the structure
because seccomp can easily detect the bit in the flag and allow the
probe or forcefully return -EEXTSYS_NOOP.
2. The kernel will then fill the provided structure with every valid
bit pattern that the current kernel understands.
For flags or other bitflag-like fields, this is the set of valid
flags or bits. For pointer fields or fields that take an arbitrary
value, the field has every bit set (0xFF... to fill the field) to
indicate that any value is valid in the field.
3. The syscall then returns -EEXTSYS_NOOP which is an errno that will
only ever be used for this purpose (so userspace can be sure that
the request succeeded).
On older kernels, the syscall will return a different error (usually
-E2BIG or -EFAULT) and userspace can do their old-fashioned checks.
4. Userspace can then check which flags and fields are supported by
looking at the fields in the returned structure. Flags are checked
by doing an AND with the flags field, and field support can checked
by comparing to 0. In principle you could just AND the entire
structure if you wanted to do this check generically without caring
about the structure contents (this is what libraries might consider
doing).
Userspace can even find out the internal kernel structure size by
passing a PAGE_SIZE buffer and seeing how many bytes are non-zero.
As with copy_struct_from_user(), this is designed to be forward- and
backwards- compatible.
This allows programas to get a one-shot understanding of what features a
syscall supports without having to do any elaborate setups or tricks to
detect support for destructive features. Flags can simply be ANDed to
check if they are in the supported set, and fields can just be checked
to see if they are non-zero.
This patchset is IMHO the simplest way we can add the ability to
introspect the feature set of extensible struct (copy_struct_from_user)
syscalls. It doesn't preclude the chance of a more generic mechanism
being added later.
The intended way of using this interface to get feature information
looks something like the following (imagine that openat2 has gained a
new field and a new flag in the future):
static bool openat2_no_automount_supported;
static bool openat2_cwd_fd_supported;
int check_openat2_support(void)
{
int err;
struct open_how how = {};
err = openat2(AT_FDCWD, ".", &how, CHECK_FIELDS | sizeof(how));
assert(err < 0);
switch (errno) {
case EFAULT: case E2BIG:
/* Old kernel... */
check_support_the_old_way();
break;
case EEXTSYS_NOOP:
openat2_no_automount_supported = (how.flags & RESOLVE_NO_AUTOMOUNT);
openat2_cwd_fd_supported = (how.cwd_fd != 0);
break;
}
}
This series adds CHECK_FIELDS support for the following extensible
struct syscalls, as they are quite likely to grow flags in the near
future:
* openat2
* clone3
* mount_setattr
[1]: https://lwn.net/Articles/830666/
[2]: https://youtu.be/ggD-eb3yPVs
Signed-off-by: Aleksa Sarai <cyphar(a)cyphar.com>
---
Changes in v3:
- Fix copy_struct_to_user() return values in case of clear_user() failure.
- v2: <https://lore.kernel.org/r/20240906-extensible-structs-check_fields-v2-0-0f4…>
Changes in v2:
- Add CHECK_FIELDS support to mount_setattr(2).
- Fix build failure on architectures with custom errno values.
- Rework selftests to use the tools/ uAPI headers rather than custom
defining EEXTSYS_NOOP.
- Make sure we return -EINVAL and -E2BIG for invalid sizes even if
CHECK_FIELDS is set, and add some tests for that.
- v1: <https://lore.kernel.org/r/20240902-extensible-structs-check_fields-v1-0-545…>
---
Aleksa Sarai (10):
uaccess: add copy_struct_to_user helper
sched_getattr: port to copy_struct_to_user
openat2: explicitly return -E2BIG for (usize > PAGE_SIZE)
openat2: add CHECK_FIELDS flag to usize argument
selftests: openat2: add 0xFF poisoned data after misaligned struct
selftests: openat2: add CHECK_FIELDS selftests
clone3: add CHECK_FIELDS flag to usize argument
selftests: clone3: add CHECK_FIELDS selftests
mount_setattr: add CHECK_FIELDS flag to usize argument
selftests: mount_setattr: add CHECK_FIELDS selftest
arch/alpha/include/uapi/asm/errno.h | 3 +
arch/mips/include/uapi/asm/errno.h | 3 +
arch/parisc/include/uapi/asm/errno.h | 3 +
arch/sparc/include/uapi/asm/errno.h | 3 +
fs/namespace.c | 17 ++
fs/open.c | 18 ++
include/linux/uaccess.h | 97 ++++++++
include/uapi/asm-generic/errno.h | 3 +
include/uapi/linux/openat2.h | 2 +
kernel/fork.c | 30 ++-
kernel/sched/syscalls.c | 42 +---
tools/arch/alpha/include/uapi/asm/errno.h | 3 +
tools/arch/mips/include/uapi/asm/errno.h | 3 +
tools/arch/parisc/include/uapi/asm/errno.h | 3 +
tools/arch/sparc/include/uapi/asm/errno.h | 3 +
tools/include/uapi/asm-generic/errno.h | 3 +
tools/include/uapi/asm-generic/posix_types.h | 101 ++++++++
tools/testing/selftests/clone3/.gitignore | 1 +
tools/testing/selftests/clone3/Makefile | 4 +-
.../testing/selftests/clone3/clone3_check_fields.c | 264 +++++++++++++++++++++
tools/testing/selftests/mount_setattr/Makefile | 2 +-
.../selftests/mount_setattr/mount_setattr_test.c | 53 ++++-
tools/testing/selftests/openat2/Makefile | 2 +
tools/testing/selftests/openat2/openat2_test.c | 165 ++++++++++++-
24 files changed, 777 insertions(+), 51 deletions(-)
---
base-commit: 98f7e32f20d28ec452afb208f9cffc08448a2652
change-id: 20240803-extensible-structs-check_fields-a47e94cef691
Best regards,
--
Aleksa Sarai <cyphar(a)cyphar.com>