The kernel has recently added support for shadow stacks, currently
x86 only using their CET feature but both arm64 and RISC-V have
equivalent features (GCS and Zicfiss respectively), I am actively
working on GCS[1]. With shadow stacks the hardware maintains an
additional stack containing only the return addresses for branch
instructions which is not generally writeable by userspace and ensures
that any returns are to the recorded addresses. This provides some
protection against ROP attacks …
[View More]and making it easier to collect call
stacks. These shadow stacks are allocated in the address space of the
userspace process.
Our API for shadow stacks does not currently offer userspace any
flexiblity for managing the allocation of shadow stacks for newly
created threads, instead the kernel allocates a new shadow stack with
the same size as the normal stack whenever a thread is created with the
feature enabled. The stacks allocated in this way are freed by the
kernel when the thread exits or shadow stacks are disabled for the
thread. This lack of flexibility and control isn't ideal, in the vast
majority of cases the shadow stack will be over allocated and the
implicit allocation and deallocation is not consistent with other
interfaces. As far as I can tell the interface is done in this manner
mainly because the shadow stack patches were in development since before
clone3() was implemented.
Since clone3() is readily extensible let's add support for specifying a
shadow stack when creating a new thread or process, keeping the current
implicit allocation behaviour if one is not specified either with
clone3() or through the use of clone(). The user must provide a shadow
stack pointer, this must point to memory mapped for use as a shadow
stackby map_shadow_stack() with an architecture specified shadow stack
token at the top of the stack.
Please note that the x86 portions of this code are build tested only, I
don't appear to have a system that can run CET available to me.
[1] https://lore.kernel.org/linux-arm-kernel/20241001-arm64-gcs-v13-0-222b78d87…
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Changes in v12:
- Add the regular prctl() to the userspace API document since arm64
support is queued in -next.
- Link to v11: https://lore.kernel.org/r/20241005-clone3-shadow-stack-v11-0-2a6a2bd6d651@k…
Changes in v11:
- Rebase onto arm64 for-next/gcs, which is based on v6.12-rc1, and
integrate arm64 support.
- Rework the interface to specify a shadow stack pointer rather than a
base and size like we do for the regular stack.
- Link to v10: https://lore.kernel.org/r/20240821-clone3-shadow-stack-v10-0-06e8797b9445@k…
Changes in v10:
- Integrate fixes & improvements for the x86 implementation from Rick
Edgecombe.
- Require that the shadow stack be VM_WRITE.
- Require that the shadow stack base and size be sizeof(void *) aligned.
- Clean up trailing newline.
- Link to v9: https://lore.kernel.org/r/20240819-clone3-shadow-stack-v9-0-962d74f99464@ke…
Changes in v9:
- Pull token validation earlier and report problems with an error return
to parent rather than signal delivery to the child.
- Verify that the top of the supplied shadow stack is VM_SHADOW_STACK.
- Rework token validation to only do the page mapping once.
- Drop no longer needed support for testing for signals in selftest.
- Fix typo in comments.
- Link to v8: https://lore.kernel.org/r/20240808-clone3-shadow-stack-v8-0-0acf37caf14c@ke…
Changes in v8:
- Fix token verification with user specified shadow stack.
- Don't track user managed shadow stacks for child processes.
- Link to v7: https://lore.kernel.org/r/20240731-clone3-shadow-stack-v7-0-a9532eebfb1d@ke…
Changes in v7:
- Rebase onto v6.11-rc1.
- Typo fixes.
- Link to v6: https://lore.kernel.org/r/20240623-clone3-shadow-stack-v6-0-9ee7783b1fb9@ke…
Changes in v6:
- Rebase onto v6.10-rc3.
- Ensure we don't try to free the parent shadow stack in error paths of
x86 arch code.
- Spelling fixes in userspace API document.
- Additional cleanups and improvements to the clone3() tests to support
the shadow stack tests.
- Link to v5: https://lore.kernel.org/r/20240203-clone3-shadow-stack-v5-0-322c69598e4b@ke…
Changes in v5:
- Rebase onto v6.8-rc2.
- Rework ABI to have the user allocate the shadow stack memory with
map_shadow_stack() and a token.
- Force inlining of the x86 shadow stack enablement.
- Move shadow stack enablement out into a shared header for reuse by
other tests.
- Link to v4: https://lore.kernel.org/r/20231128-clone3-shadow-stack-v4-0-8b28ffe4f676@ke…
Changes in v4:
- Formatting changes.
- Use a define for minimum shadow stack size and move some basic
validation to fork.c.
- Link to v3: https://lore.kernel.org/r/20231120-clone3-shadow-stack-v3-0-a7b8ed3e2acc@ke…
Changes in v3:
- Rebase onto v6.7-rc2.
- Remove stale shadow_stack in internal kargs.
- If a shadow stack is specified unconditionally use it regardless of
CLONE_ parameters.
- Force enable shadow stacks in the selftest.
- Update changelogs for RISC-V feature rename.
- Link to v2: https://lore.kernel.org/r/20231114-clone3-shadow-stack-v2-0-b613f8681155@ke…
Changes in v2:
- Rebase onto v6.7-rc1.
- Remove ability to provide preallocated shadow stack, just specify the
desired size.
- Link to v1: https://lore.kernel.org/r/20231023-clone3-shadow-stack-v1-0-d867d0b5d4d0@ke…
---
Mark Brown (8):
arm64/gcs: Return a success value from gcs_alloc_thread_stack()
Documentation: userspace-api: Add shadow stack API documentation
selftests: Provide helper header for shadow stack testing
fork: Add shadow stack support to clone3()
selftests/clone3: Remove redundant flushes of output streams
selftests/clone3: Factor more of main loop into test_clone3()
selftests/clone3: Allow tests to flag if -E2BIG is a valid error code
selftests/clone3: Test shadow stack support
Documentation/userspace-api/index.rst | 1 +
Documentation/userspace-api/shadow_stack.rst | 42 ++++
arch/arm64/include/asm/gcs.h | 8 +-
arch/arm64/kernel/process.c | 8 +-
arch/arm64/mm/gcs.c | 62 +++++-
arch/x86/include/asm/shstk.h | 11 +-
arch/x86/kernel/process.c | 2 +-
arch/x86/kernel/shstk.c | 57 +++++-
include/asm-generic/cacheflush.h | 11 ++
include/linux/sched/task.h | 17 ++
include/uapi/linux/sched.h | 10 +-
kernel/fork.c | 96 +++++++--
tools/testing/selftests/clone3/clone3.c | 226 ++++++++++++++++++----
tools/testing/selftests/clone3/clone3_selftests.h | 65 ++++++-
tools/testing/selftests/ksft_shstk.h | 98 ++++++++++
15 files changed, 633 insertions(+), 81 deletions(-)
---
base-commit: d17cd7b7cc92d37ee8b2df8f975fc859a261f4dc
change-id: 20231019-clone3-shadow-stack-15d40d2bf536
Best regards,
--
Mark Brown <broonie(a)kernel.org>
[View Less]
From: Tycho Andersen <tandersen(a)netflix.com>
Zbigniew mentioned at Linux Plumber's that systemd is interested in
switching to execveat() for service execution, but can't, because the
contents of /proc/pid/comm are the file descriptor which was used,
instead of the path to the binary. This makes the output of tools like
top and ps useless, especially in a world where most fds are opened
CLOEXEC so the number is truly meaningless.
Change exec path to fix up /proc/pid/comm in the case …
[View More]where we have
allocated one of these synthetic paths in bprm_init(). This way the actual
exec machinery is unchanged, but cosmetically the comm looks reasonable to
admins investigating things.
Signed-off-by: Tycho Andersen <tandersen(a)netflix.com>
Suggested-by: Zbigniew Jędrzejewski-Szmek <zbyszek(a)in.waw.pl>
CC: Aleksa Sarai <cyphar(a)cyphar.com>
Link: https://github.com/uapi-group/kernel-features#set-comm-field-before-exec
---
v2: * drop the flag, everyone :)
* change the rendered value to f_path.dentry->d_name.name instead of
argv[0], Eric
v3: * fix up subject line, Eric
v4: * switch to no flag, always rewrite approach, with some cleanup
suggested by Kees
---
fs/exec.c | 36 +++++++++++++++++++++++++++++++++++-
include/linux/binfmts.h | 1 +
2 files changed, 36 insertions(+), 1 deletion(-)
diff --git a/fs/exec.c b/fs/exec.c
index 6c53920795c2..3b559f598c74 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1347,7 +1347,16 @@ int begin_new_exec(struct linux_binprm * bprm)
set_dumpable(current->mm, SUID_DUMP_USER);
perf_event_exec();
- __set_task_comm(me, kbasename(bprm->filename), true);
+
+ /*
+ * If argv0 was set, alloc_bprm() made up a path that will
+ * probably not be useful to admins running ps or similar.
+ * Let's fix it up to be something reasonable.
+ */
+ if (bprm->argv0)
+ __set_task_comm(me, kbasename(bprm->argv0), true);
+ else
+ __set_task_comm(me, kbasename(bprm->filename), true);
/* An exec changes our domain. We are no longer part of the thread
group */
@@ -1497,9 +1506,28 @@ static void free_bprm(struct linux_binprm *bprm)
if (bprm->interp != bprm->filename)
kfree(bprm->interp);
kfree(bprm->fdpath);
+ kfree(bprm->argv0);
kfree(bprm);
}
+static int bprm_add_fixup_comm(struct linux_binprm *bprm,
+ struct user_arg_ptr argv)
+{
+ const char __user *p = get_user_arg_ptr(argv, 0);
+
+ /*
+ * If p == NULL, let's just fall back to fdpath.
+ */
+ if (!p)
+ return 0;
+
+ bprm->argv0 = strndup_user(p, MAX_ARG_STRLEN);
+ if (bprm->argv0)
+ return 0;
+
+ return -EFAULT;
+}
+
static struct linux_binprm *alloc_bprm(int fd, struct filename *filename, int flags)
{
struct linux_binprm *bprm;
@@ -1906,6 +1934,12 @@ static int do_execveat_common(int fd, struct filename *filename,
goto out_ret;
}
+ if (unlikely(bprm->fdpath)) {
+ retval = bprm_add_fixup_comm(bprm, argv);
+ if (retval != 0)
+ goto out_free;
+ }
+
retval = count(argv, MAX_ARG_STRINGS);
if (retval == 0)
pr_warn_once("process '%s' launched '%s' with NULL argv: empty string added\n",
diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
index e6c00e860951..bab5121a746b 100644
--- a/include/linux/binfmts.h
+++ b/include/linux/binfmts.h
@@ -55,6 +55,7 @@ struct linux_binprm {
of the time same as filename, but could be
different for binfmt_{misc,script} */
const char *fdpath; /* generated filename for execveat */
+ const char *argv0; /* argv0 from execveat */
unsigned interp_flags;
int execfd; /* File descriptor of the executable */
unsigned long loader, exec;
base-commit: c1e939a21eb111a6d6067b38e8e04b8809b64c4e
--
2.34.1
[View Less]
Hi,
At Linux Plumbers, a few dozen of us gathered together to discuss how
to expose what tests subsystem maintainers would like to run for every
patch submitted or when CI runs tests. We agreed on a mock up of a
yaml template to start gathering info. The yaml file could be
temporarily stored on kernelci.org until a more permanent home could
be found. Attached is a template to start the conversation.
Longer story.
The current problem is CI systems are not unanimous about what tests
they …
[View More]run on submitted patches or git branches. This makes it
difficult to figure out why a test failed or how to reproduce.
Further, it isn't always clear what tests a normal contributor should
run before posting patches.
It has been long communicated that the tests LTP, xfstest and/or
kselftests should be the tests to run. However, not all maintainers
use those tests for their subsystems. I am hoping to either capture
those tests or find ways to convince them to add their tests to the
preferred locations.
The goal is for a given subsystem (defined in MAINTAINERS), define a
set of tests that should be run for any contributions to that
subsystem. The hope is the collective CI results can be triaged
collectively (because they are related) and even have the numerous
flakes waived collectively (same reason) improving the ability to
find and debug new test failures. Because the tests and process are
known, having a human help debug any failures becomes easier.
The plan is to put together a minimal yaml template that gets us going
(even if it is not optimized yet) and aim for about a dozen or so
subsystems. At that point we should have enough feedback to promote
this more seriously and talk optimizations.
Feedback encouraged.
Cheers,
Don
---
# List of tests by subsystem
#
# Tests should adhere to KTAP definitions for results
#
# Description of section entries
#
# maintainer: test maintainer - name <email>
# list: mailing list for discussion
# version: stable version of the test
# dependency: necessary distro package for testing
# test:
# path: internal git path or url to fetch from
# cmd: command to run; ability to run locally
# param: additional param necessary to run test
# hardware: hardware necessary for validation
#
# Subsystems (alphabetical)
KUNIT TEST:
maintainer:
- name: name1
email: email1
- name: name2
email: email2
list:
version:
dependency:
- dep1
- dep2
test:
- path: tools/testing/kunit
cmd:
param:
- path:
cmd:
param:
hardware: none
[View Less]
This series is a follow-up to Joey's Permission Overlay Extension (POE)
series [1] that recently landed on mainline. The goal is to improve the
way we handle the register that governs which pkeys/POIndex are
accessible (POR_EL0) during signal delivery. As things stand, we may
unexpectedly fail to write the signal frame on the stack because POR_EL0
is not reset before the uaccess operations. See patch 3 for more details
and the main changes this series brings.
A similar series landed recently …
[View More]for x86/MPK [2]; the present series
aims at aligning arm64 with x86. Worth noting: once the signal frame is
written, POR_EL0 is still set to POR_EL0_INIT, granting access to pkey 0
only. This means that a program that sets up an alternate signal stack
with a non-zero pkey will need some assembly trampoline to set POR_EL0
before invoking the real signal handler, as discussed here [3]. This is
not ideal, but it makes experimentation with pkeys in signal handlers
possible while waiting for a potential interface to control the pkey
state when delivering a signal. See Pierre's reply [4] for more
information about use-cases and a potential interface.
The x86 series also added kselftests to ensure that no spurious SIGSEGV
occurs during signal delivery regardless of which pkey is accessible at
the point where the signal is delivered. This series adapts those
kselftests to allow running them on arm64 (patch 4-5).
Finally patch 2 is a clean-up following feedback on Joey's series [5].
I have tested this series on arm64 and x86_64 (booting and running the
protection_keys and pkey_sighandler_tests mm kselftests).
v1..v2:
* In setup_rt_frame(), ensured that POR_EL0 is reset to its original
value if we fail to deliver the signal (addresses Catalin's concern [6]).
* Renamed *unpriv_access* to *user_access* in patch 3 (suggestion from
Dave).
* Made what patch 1-2 do explicit in the commit message body (suggestion
from Dave).
- Kevin
[1] https://lore.kernel.org/linux-arm-kernel/20240822151113.1479789-1-joey.goul…
[2] https://lore.kernel.org/lkml/20240802061318.2140081-1-aruna.ramakrishna@ora…
[3] https://lore.kernel.org/lkml/CABi2SkWxNkP2O7ipkP67WKz0-LV33e5brReevTTtba6oK…
[4] https://lore.kernel.org/linux-arm-kernel/87plns8owh.fsf@arm.com/
[5] https://lore.kernel.org/linux-arm-kernel/20241015114116.GA19334@willie-the-…
[6] https://lore.kernel.org/linux-arm-kernel/Zw6D2waVyIwYE7wd@arm.com/
Cc: akpm(a)linux-foundation.org
Cc: anshuman.khandual(a)arm.com
Cc: aruna.ramakrishna(a)oracle.com
Cc: broonie(a)kernel.org
Cc: catalin.marinas(a)arm.com
Cc: dave.hansen(a)linux.intel.com
Cc: dave.martin(a)arm.com
Cc: jeffxu(a)chromium.org
Cc: joey.gouly(a)arm.com
Cc: pierre.langlois(a)arm.com
Cc: shuah(a)kernel.org
Cc: sroettger(a)google.com
Cc: will(a)kernel.org
Cc: linux-kselftest(a)vger.kernel.org
Cc: x86(a)kernel.org
Kevin Brodsky (5):
arm64: signal: Remove unused macro
arm64: signal: Remove unnecessary check when saving POE state
arm64: signal: Improve POR_EL0 handling to avoid uaccess failures
selftests/mm: Use generic pkey register manipulation
selftests/mm: Enable pkey_sighandler_tests on arm64
arch/arm64/kernel/signal.c | 95 +++++++++++++---
tools/testing/selftests/mm/Makefile | 8 +-
tools/testing/selftests/mm/pkey-arm64.h | 1 +
tools/testing/selftests/mm/pkey-x86.h | 2 +
.../selftests/mm/pkey_sighandler_tests.c | 101 +++++++++++++-----
5 files changed, 162 insertions(+), 45 deletions(-)
--
2.43.0
[View Less]
Unmapping virtual machine guest memory from the host kernel's direct map
is a successful mitigation against Spectre-style transient execution
issues: If the kernel page tables do not contain entries pointing to
guest memory, then any attempted speculative read through the direct map
will necessarily be blocked by the MMU before any observable
microarchitectural side-effects happen. This means that Spectre-gadgets
and similar cannot be used to target virtual machine memory. Roughly 60%
of …
[View More]speculative execution issues fall into this category [1, Table 1].
This patch series extends guest_memfd with the ability to remove its
memory from the host kernel's direct map, to be able to attain the above
protection for KVM guests running inside guest_memfd.
=== Changes to v2 ===
- Handle direct map removal for physically contiguous pages in arch code
(Mike R.)
- Track the direct map state in guest_memfd itself instead of at the
folio level, to prepare for huge pages support (Sean C.)
- Allow configuring direct map state of not-yet faulted in memory
(Vishal A.)
- Pay attention to alignment in ftrace structs (Steven R.)
Most significantly, I've reduced the patch series to focus only on
direct map removal for guest_memfd for now, leaving the whole "how to do
non-CoCo VMs in guest_memfd" for later. If this separation is
acceptable, then I think I can drop the RFC tag in the next revision
(I've mainly kept it here because I'm not entirely sure what to do with
patches 3 and 4).
=== Implementation ===
This patch series introduces a new flag to the KVM_CREATE_GUEST_MEMFD
that causes guest_memfd to remove its pages from the host kernel's
direct map immediately after population/preparation. It also adds
infrastructure for tracking the direct map state of all gmem folios
inside the guest_memfd inode. Storing this information in the inode has
the advantage that the code is ready for future hugepages extensions,
where only removing/reinserting direct map entries for sub-ranges of a
huge folio is a valid usecase, and it allows pre-configuring the direct
map state of not-yet faulted in parts of memory (for example, when the
VMM is receiving a RX virtio buffer from the guest).
=== Summary ===
Patch 1 (from Mike Rapoport) adds arch APIs for manipulating the direct
map for ranges of physically contiguous pages, which are used by
guest_memfd in follow up patches. Patch 2 adds the
KVM_GMEM_NO_DIRECT_MAP flag and the logic for configuring direct map
state of freshly prepared folios. Patches 3 and 4 mainly serve an
illustrative purpose, to show how the framework from patch 2 can be
extended with routines for runtime direct map manipulation. Patches 5
and 6 deal with documentation and self-tests respectively.
[1]: https://download.vusec.net/papers/quarantine_raid23.pdf
[RFC v1]: https://lore.kernel.org/kvm/20240709132041.3625501-1-roypat@amazon.co.uk/
[RFC v2]: https://lore.kernel.org/kvm/20240910163038.1298452-1-roypat@amazon.co.uk/
Mike Rapoport (Microsoft) (1):
arch: introduce set_direct_map_valid_noflush()
Patrick Roy (5):
kvm: gmem: add flag to remove memory from kernel direct map
kvm: gmem: implement direct map manipulation routines
kvm: gmem: add trace point for direct map state changes
kvm: document KVM_GMEM_NO_DIRECT_MAP flag
kvm: selftests: run gmem tests with KVM_GMEM_NO_DIRECT_MAP set
Documentation/virt/kvm/api.rst | 14 +
arch/arm64/include/asm/set_memory.h | 1 +
arch/arm64/mm/pageattr.c | 10 +
arch/loongarch/include/asm/set_memory.h | 1 +
arch/loongarch/mm/pageattr.c | 21 ++
arch/riscv/include/asm/set_memory.h | 1 +
arch/riscv/mm/pageattr.c | 15 +
arch/s390/include/asm/set_memory.h | 1 +
arch/s390/mm/pageattr.c | 11 +
arch/x86/include/asm/set_memory.h | 1 +
arch/x86/mm/pat/set_memory.c | 8 +
include/linux/set_memory.h | 6 +
include/trace/events/kvm.h | 22 ++
include/uapi/linux/kvm.h | 2 +
.../testing/selftests/kvm/guest_memfd_test.c | 2 +-
.../kvm/x86_64/private_mem_conversions_test.c | 7 +-
virt/kvm/guest_memfd.c | 280 +++++++++++++++++-
17 files changed, 384 insertions(+), 19 deletions(-)
base-commit: 5cb1659f412041e4780f2e8ee49b2e03728a2ba6
--
2.47.0
[View Less]
To be able to switch VMware products running on Linux to KVM some minor
changes are required to let KVM run/resume unmodified VMware guests.
First allow enabling of the VMware backdoor via an api. Currently the
setting of the VMware backdoor is limited to kernel boot parameters,
which forces all VM's running on a host to either run with or without
the VMware backdoor. Add a simple cap to allow enabling of the VMware
backdoor on a per VM basis. The default for that setting remains the
kvm.…
[View More]enable_vmware_backdoor boot parameter (which is false by default)
and can be changed on a per-vm basis via the KVM_CAP_X86_VMWARE_BACKDOOR
cap.
Second add a cap to forward hypercalls to userspace. I know that in
general that's frowned upon but VMwre guests send quite a few hypercalls
from userspace and it would be both impractical and largelly impossible
to handle all in the kernel. The change is trivial and I'd be maintaining
this code so I hope it's not a big deal.
The third commit just adds a self-test for the "forward VMware hypercalls
to userspace" functionality.
Cc: Doug Covelli <doug.covelli(a)broadcom.com>
Cc: Paolo Bonzini <pbonzini(a)redhat.com>
Cc: Jonathan Corbet <corbet(a)lwn.net>
Cc: Sean Christopherson <seanjc(a)google.com>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Ingo Molnar <mingo(a)redhat.com>
Cc: Borislav Petkov <bp(a)alien8.de>
Cc: Dave Hansen <dave.hansen(a)linux.intel.com>
Cc: x86(a)kernel.org
Cc: "H. Peter Anvin" <hpa(a)zytor.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: Namhyung Kim <namhyung(a)kernel.org>
Cc: Arnaldo Carvalho de Melo <acme(a)redhat.com>
Cc: Isaku Yamahata <isaku.yamahata(a)intel.com>
Cc: Joel Stanley <joel(a)jms.id.au>
Cc: Zack Rusin <zack.rusin(a)broadcom.com>
Cc: kvm(a)vger.kernel.org
Cc: linux-doc(a)vger.kernel.org
Cc: linux-kernel(a)vger.kernel.org
Cc: linux-kselftest(a)vger.kernel.org
Zack Rusin (3):
KVM: x86: Allow enabling of the vmware backdoor via a cap
KVM: x86: Add support for VMware guest specific hypercalls
KVM: selftests: x86: Add a test for KVM_CAP_X86_VMWARE_HYPERCALL
Documentation/virt/kvm/api.rst | 56 ++++++++-
arch/x86/include/asm/kvm_host.h | 2 +
arch/x86/kvm/emulate.c | 5 +-
arch/x86/kvm/svm/svm.c | 6 +-
arch/x86/kvm/vmx/vmx.c | 4 +-
arch/x86/kvm/x86.c | 47 ++++++++
arch/x86/kvm/x86.h | 7 +-
include/uapi/linux/kvm.h | 2 +
tools/include/uapi/linux/kvm.h | 2 +
tools/testing/selftests/kvm/Makefile | 1 +
.../kvm/x86_64/vmware_hypercall_test.c | 108 ++++++++++++++++++
11 files changed, 227 insertions(+), 13 deletions(-)
create mode 100644 tools/testing/selftests/kvm/x86_64/vmware_hypercall_test.c
--
2.43.0
[View Less]