Attacks against vulnerable userspace applications with the purpose to break
ASLR or bypass canaries traditionally use some level of brute force with
the help of the fork system call. This is possible since when creating a
new process using fork its memory contents are the same as those of the
parent process (the process that called the fork system call). So, the
attacker can test the memory infinite times to find the correct memory
values or the correct memory addresses without worrying about crashing the
application.
Based on the above scenario it would be nice to have this detected and
mitigated, and this is the goal of this patch serie. Specifically the
following attacks are expected to be detected:
1.- Launching (fork()/exec()) a setuid/setgid process repeatedly until a
desirable memory layout is got (e.g. Stack Clash).
2.- Connecting to an exec()ing network daemon (e.g. xinetd) repeatedly
until a desirable memory layout is got (e.g. what CTFs do for simple
network service).
3.- Launching processes without exec() (e.g. Android Zygote) and exposing
state to attack a sibling.
4.- Connecting to a fork()ing network daemon (e.g. apache) repeatedly until
the previously shared memory layout of all the other children is
exposed (e.g. kind of related to HeartBleed).
In each case, a privilege boundary has been crossed:
Case 1: setuid/setgid process
Case 2: network to local
Case 3: privilege changes
Case 4: network to local
So, what will really be detected are fork/exec brute force attacks that
cross any of the commented bounds.
The implementation details and comparison against other existing
implementations can be found in the "Documentation" patch.
It is important to mention that this version has changed the method used to
track the information related to the application crashes. Prior this
version, a pointer per process (in the task_struct structure) held a
reference to the shared statistical data. Or in other words, these stats
were shared by all the fork hierarchy processes. But this has an important
drawback: a brute force attack that happens through the execve system call
losts the faults info since these statistics are freed when the fork
hierarchy disappears. So, the solution adopted in the v6 version was to use
an upper fork hierarchy to track the info for this attack type. But, as
Valdis Kletnieks pointed out during this discussion [1], this method can
be easily bypassed using a double exec (well, this was the method used in
the kselftest to avoid the detection ;) ). So, in this version, to track
all the statistical data (info related with application crashes), the
extended attributes feature for the executable files are used. The xattr is
also used to mark the executables as "not allowed" when an attack is
detected. Then, the execve system call rely on this flag to avoid following
executions of this file.
[1] https://lore.kernel.org/kernelnewbies/20210330173459.GA3163@ubuntu/
Moreover, I think this solves another problem pointed out by Andi Kleen
during the v5 review [2] related to the possibility that a supervisor
respawns processes killed by the Brute LSM. He suggested adding some way so
a supervisor can know that a process has been killed by Brute and then
decide to respawn or not. So, now, the supervisor can read the brute xattr
of one executable and know if it is blocked by Brute and why (using the
statistical data).
[2] https://lore.kernel.org/kernel-hardening/878s78dnrm.fsf@linux.intel.com/
Knowing all this information I will explain now the different patches:
The 1/7 patch defines a new LSM hook to get the fatal signal of a task.
This will be useful during the attack detection phase.
The 2/7 patch defines a new LSM and the necessary sysctl attributes to fine
tuning the attack detection.
The 3/7 patch detects a fork/exec brute force attack and narrows the
possible cases taken into account the privilege boundary crossing.
The 4/7 patch mitigates a brute force attack.
The 5/7 patch adds self-tests to validate the Brute LSM expectations.
The 6/7 patch adds the documentation to explain this implementation.
The 7/7 patch updates the maintainers file.
This patch serie is a task of the KSPP [3] and can also be accessed from my
github tree [4] in the "brute_v7" branch.
[3] https://github.com/KSPP/linux/issues/39
[4] https://github.com/johwood/linux/
When I ran the "checkpatch" script I got the following errors, but I think
they are false positives as I follow the same coding style for the others
extended attributes suffixes.
----------------------------------------------------------------------------
../patches/brute_v7/v7-0003-security-brute-Detect-a-brute-force-attack.patch
----------------------------------------------------------------------------
ERROR: Macros with complex values should be enclosed in parentheses
89: FILE: include/uapi/linux/xattr.h:80:
+#define XATTR_NAME_BRUTE XATTR_SECURITY_PREFIX XATTR_BRUTE_SUFFIX
-----------------------------------------------------------------------------
../patches/brute_v7/v7-0005-selftests-brute-Add-tests-for-the-Brute-LSM.patch
-----------------------------------------------------------------------------
ERROR: Macros with complex values should be enclosed in parentheses
100: FILE: tools/testing/selftests/brute/rmxattr.c:18:
+#define XATTR_NAME_BRUTE XATTR_SECURITY_PREFIX XATTR_BRUTE_SUFFIX
When I ran the "kernel-doc" script with the following parameters:
./scripts/kernel-doc --none -v security/brute/brute.c
I got the following warning:
security/brute/brute.c:65: warning: contents before sections
But I don't understand why it is complaining. Could it be a false positive?
The previous versions can be found in:
RFC
https://lore.kernel.org/kernel-hardening/20200910202107.3799376-1-keescook@…
Version 2
https://lore.kernel.org/kernel-hardening/20201025134540.3770-1-john.wood@gm…
Version 3
https://lore.kernel.org/lkml/20210221154919.68050-1-john.wood@gmx.com/
Version 4
https://lore.kernel.org/lkml/20210227150956.6022-1-john.wood@gmx.com/
Version 5
https://lore.kernel.org/kernel-hardening/20210227153013.6747-1-john.wood@gm…
Version 6
https://lore.kernel.org/kernel-hardening/20210307113031.11671-1-john.wood@g…
Changelog RFC -> v2
-------------------
- Rename this feature with a more suitable name (Jann Horn, Kees Cook).
- Convert the code to an LSM (Kees Cook).
- Add locking to avoid data races (Jann Horn).
- Add a new LSM hook to get the fatal signal of a task (Jann Horn, Kees
Cook).
- Add the last crashes timestamps list to avoid false positives in the
attack detection (Jann Horn).
- Use "period" instead of "rate" (Jann Horn).
- Other minor changes suggested (Jann Horn, Kees Cook).
Changelog v2 -> v3
------------------
- Compute the application crash period on an on-going basis (Kees Cook).
- Detect a brute force attack through the execve system call (Kees Cook).
- Detect an slow brute force attack (Randy Dunlap).
- Fine tuning the detection taken into account privilege boundary crossing
(Kees Cook).
- Taken into account only fatal signals delivered by the kernel (Kees
Cook).
- Remove the sysctl attributes to fine tuning the detection (Kees Cook).
- Remove the prctls to allow per process enabling/disabling (Kees Cook).
- Improve the documentation (Kees Cook).
- Fix some typos in the documentation (Randy Dunlap).
- Add self-test to validate the expectations (Kees Cook).
Changelog v3 -> v4
------------------
- Fix all the warnings shown by the tool "scripts/kernel-doc" (Randy
Dunlap).
Changelog v4 -> v5
------------------
- Fix some typos (Randy Dunlap).
Changelog v5 -> v6
------------------
- Fix a reported deadlock (kernel test robot).
- Add high level details to the documentation (Andi Kleen).
Changelog v6 -> v7
------------------
- Add the "Reviewed-by:" tag to the first patch.
- Rearrange the brute LSM between lockdown and yama (Kees Cook).
- Split subdir and obj in security/Makefile (Kees Cook).
- Reduce the number of header files included (Kees Cook).
- Print the pid when an attack is detected (Kees Cook).
- Use the socket_accept LSM hook instead of socket_sock_rcv_skb hook to
avoid running a hook on every incoming network packet (Kees Cook).
- Update the documentation and fix it to render it properly (Jonathan
Corbet).
- Manage correctly an exec brute force attack avoiding the bypass (Valdis
Kletnieks).
- Other minor changes and cleanups.
Any constructive comments are welcome.
Thanks in advance.
John Wood (7):
security: Add LSM hook at the point where a task gets a fatal signal
security/brute: Define a LSM and add sysctl attributes
security/brute: Detect a brute force attack
security/brute: Mitigate a brute force attack
selftests/brute: Add tests for the Brute LSM
Documentation: Add documentation for the Brute LSM
MAINTAINERS: Add a new entry for the Brute LSM
Documentation/admin-guide/LSM/Brute.rst | 334 +++++++++++
Documentation/admin-guide/LSM/index.rst | 1 +
MAINTAINERS | 7 +
include/linux/lsm_hook_defs.h | 1 +
include/linux/lsm_hooks.h | 4 +
include/linux/security.h | 4 +
include/uapi/linux/xattr.h | 3 +
kernel/signal.c | 1 +
security/Kconfig | 11 +-
security/Makefile | 2 +
security/brute/Kconfig | 15 +
security/brute/Makefile | 2 +
security/brute/brute.c | 716 +++++++++++++++++++++++
security/security.c | 5 +
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/brute/.gitignore | 2 +
tools/testing/selftests/brute/Makefile | 5 +
tools/testing/selftests/brute/config | 1 +
tools/testing/selftests/brute/rmxattr.c | 34 ++
tools/testing/selftests/brute/test.c | 507 ++++++++++++++++
tools/testing/selftests/brute/test.sh | 256 ++++++++
21 files changed, 1907 insertions(+), 5 deletions(-)
create mode 100644 Documentation/admin-guide/LSM/Brute.rst
create mode 100644 security/brute/Kconfig
create mode 100644 security/brute/Makefile
create mode 100644 security/brute/brute.c
create mode 100644 tools/testing/selftests/brute/.gitignore
create mode 100644 tools/testing/selftests/brute/Makefile
create mode 100644 tools/testing/selftests/brute/config
create mode 100644 tools/testing/selftests/brute/rmxattr.c
create mode 100644 tools/testing/selftests/brute/test.c
create mode 100755 tools/testing/selftests/brute/test.sh
--
2.25.1
TL;DR: Add support to kunit_tool to dispatch tests via QEMU. Also add
support to immediately shutdown a kernel after running KUnit tests.
Background
----------
KUnit has supported running on all architectures for quite some time;
however, kunit_tool - the script commonly used to invoke KUnit tests -
has only fully supported KUnit run on UML. Its functionality has been
broken up for some time to separate the configure, build, run, and parse
phases making it possible to be used in part on other architectures to a
small extent. Nevertheless, kunit_tool has not supported running tests
on other architectures.
What this patchset does
-----------------------
This patchset introduces first class support to kunit_tool for KUnit to
be run on many popular architectures via QEMU. It does this by adding
two new flags: `--arch` and `--cross_compile`.
`--arch` allows an architecture to be specified by the name the
architecture is given in `arch/`. It uses the specified architecture to
select a minimal amount of Kconfigs and QEMU configs needed for the
architecture to run in QEMU and provide a console from which KTAP
results can be scraped.
`--cross_compile` allows a toolchain prefix to be specified to make
similar to how `CROSS_COMPILE` is used.
Additionally, this patchset revives the previously considered "kunit:
tool: add support for QEMU"[1] patchs. The motivation for this new
kernel command line flags, `kunit_shutdown`, is to better support
running KUnit tests inside of QEMU. For most popular architectures, QEMU
can be made to terminate when the Linux kernel that is being run is
reboted, halted, or powered off. As Kees pointed out in a previous
discussion[2], it is possible to make a kernel initrd that can reboot
the kernel immediately, doing this for every architecture would likely
be infeasible. Instead, just having an option for the kernel to shutdown
when it is done with testing seems a lot simpler, especially since it is
an option which would only available in testing configurations of the
kernel anyway.
Changes since last revision
---------------------------
Mostly fixed lots of minor issues suggested/poited out by David and
Daniel. Also reworked how qemu_configs are loaded: Now each config is in
its own Python file and is loaded dynamically. Given the number of
improvements that address the biggest concerns I had in the last RFC, I
decided to promote this to a normal patch set.
What discussion remains for this patchset?
------------------------------------------
I am still hoping to see some discussion regarding the kunit_shutdown
patch: I want to make sure with the added context of QEMU running under
kunit_tool that this is now a reasonable approach. Nevertheless, I am
pretty happy with this patchset as is, and I did not get any negative
feedback on the previous revision, so I think we can probably just move
forward as is.
Brendan Higgins (3):
Documentation: Add kunit_shutdown to kernel-parameters.txt
kunit: tool: add support for QEMU
Documentation: kunit: document support for QEMU in kunit_tool
David Gow (1):
kunit: Add 'kunit_shutdown' option
.../admin-guide/kernel-parameters.txt | 8 +
Documentation/dev-tools/kunit/usage.rst | 37 +++-
lib/kunit/executor.c | 20 ++
tools/testing/kunit/kunit.py | 57 +++++-
tools/testing/kunit/kunit_config.py | 7 +-
tools/testing/kunit/kunit_kernel.py | 172 +++++++++++++++---
tools/testing/kunit/kunit_parser.py | 2 +-
tools/testing/kunit/kunit_tool_test.py | 18 +-
tools/testing/kunit/qemu_config.py | 17 ++
tools/testing/kunit/qemu_configs/alpha.py | 10 +
tools/testing/kunit/qemu_configs/arm.py | 13 ++
tools/testing/kunit/qemu_configs/arm64.py | 12 ++
tools/testing/kunit/qemu_configs/i386.py | 10 +
tools/testing/kunit/qemu_configs/powerpc.py | 12 ++
tools/testing/kunit/qemu_configs/riscv.py | 31 ++++
tools/testing/kunit/qemu_configs/s390.py | 14 ++
tools/testing/kunit/qemu_configs/sparc.py | 10 +
tools/testing/kunit/qemu_configs/x86_64.py | 10 +
18 files changed, 411 insertions(+), 49 deletions(-)
create mode 100644 tools/testing/kunit/qemu_config.py
create mode 100644 tools/testing/kunit/qemu_configs/alpha.py
create mode 100644 tools/testing/kunit/qemu_configs/arm.py
create mode 100644 tools/testing/kunit/qemu_configs/arm64.py
create mode 100644 tools/testing/kunit/qemu_configs/i386.py
create mode 100644 tools/testing/kunit/qemu_configs/powerpc.py
create mode 100644 tools/testing/kunit/qemu_configs/riscv.py
create mode 100644 tools/testing/kunit/qemu_configs/s390.py
create mode 100644 tools/testing/kunit/qemu_configs/sparc.py
create mode 100644 tools/testing/kunit/qemu_configs/x86_64.py
base-commit: 38182162b50aa4e970e5997df0a0c4288147a153
--
2.31.1.607.g51e8a6a459-goog
This patchset provides a file descriptor for every VM and VCPU to read
KVM statistics data in binary format.
It is meant to provide a lightweight, flexible, scalable and efficient
lock-free solution for user space telemetry applications to pull the
statistics data periodically for large scale systems. The pulling
frequency could be as high as a few times per second.
In this patchset, every statistics data are treated to have some
attributes as below:
* architecture dependent or common
* VM statistics data or VCPU statistics data
* type: cumulative, instantaneous,
* unit: none for simple counter, nanosecond, microsecond,
millisecond, second, Byte, KiByte, MiByte, GiByte. Clock Cycles
Since no lock/synchronization is used, the consistency between all
the statistics data is not guaranteed. That means not all statistics
data are read out at the exact same time, since the statistics date
are still being updated by KVM subsystems while they are read out.
---
* v4 -> v5
- Rebase to kvm/queue, commit a4345a7cecfb ("Merge tag
'kvmarm-fixes-5.13-1'")
- Change maximum stats name length to 48
- Replace VM_STATS_COMMON/VCPU_STATS_COMMON macros with stats
descriptor definition macros.
- Fixed some errors/warnings reported by checkpatch.pl
* v3 -> v4
- Rebase to kvm/queue, commit 9f242010c3b4 ("KVM: avoid "deadlock"
between install_new_memslots and MMU notifier")
- Use C-stype comments in the whole patch
- Fix wrong count for x86 VCPU stats descriptors
- Fix KVM stats data size counting and validity check in selftest
* v2 -> v3
- Rebase to kvm/queue, commit edf408f5257b ("KVM: avoid "deadlock"
between install_new_memslots and MMU notifier")
- Resolve some nitpicks about format
* v1 -> v2
- Use ARRAY_SIZE to count the number of stats descriptors
- Fix missing `size` field initialization in macro STATS_DESC
[1] https://lore.kernel.org/kvm/20210402224359.2297157-1-jingzhangos@google.com
[2] https://lore.kernel.org/kvm/20210415151741.1607806-1-jingzhangos@google.com
[3] https://lore.kernel.org/kvm/20210423181727.596466-1-jingzhangos@google.com
[4] https://lore.kernel.org/kvm/20210429203740.1935629-1-jingzhangos@google.com
---
Jing Zhang (4):
KVM: stats: Separate common stats from architecture specific ones
KVM: stats: Add fd-based API to read binary stats data
KVM: stats: Add documentation for statistics data binary interface
KVM: selftests: Add selftest for KVM statistics data binary interface
Documentation/virt/kvm/api.rst | 171 ++++++++
arch/arm64/include/asm/kvm_host.h | 9 +-
arch/arm64/kvm/guest.c | 38 +-
arch/mips/include/asm/kvm_host.h | 9 +-
arch/mips/kvm/mips.c | 64 ++-
arch/powerpc/include/asm/kvm_host.h | 9 +-
arch/powerpc/kvm/book3s.c | 64 ++-
arch/powerpc/kvm/book3s_hv.c | 12 +-
arch/powerpc/kvm/book3s_pr.c | 2 +-
arch/powerpc/kvm/book3s_pr_papr.c | 2 +-
arch/powerpc/kvm/booke.c | 59 ++-
arch/s390/include/asm/kvm_host.h | 9 +-
arch/s390/kvm/kvm-s390.c | 129 +++++-
arch/x86/include/asm/kvm_host.h | 9 +-
arch/x86/kvm/x86.c | 67 +++-
include/linux/kvm_host.h | 136 ++++++-
include/linux/kvm_types.h | 12 +
include/uapi/linux/kvm.h | 50 +++
tools/testing/selftests/kvm/.gitignore | 1 +
tools/testing/selftests/kvm/Makefile | 3 +
.../testing/selftests/kvm/include/kvm_util.h | 3 +
.../selftests/kvm/kvm_bin_form_stats.c | 379 ++++++++++++++++++
tools/testing/selftests/kvm/lib/kvm_util.c | 12 +
virt/kvm/kvm_main.c | 237 ++++++++++-
24 files changed, 1396 insertions(+), 90 deletions(-)
create mode 100644 tools/testing/selftests/kvm/kvm_bin_form_stats.c
base-commit: a4345a7cecfb91ae78cd43d26b0c6a956420761a
--
2.31.1.751.gd2f1c929bd-goog
When there is no devlink device, the following command will return:
$ devlink -j dev show
{dev:{}}
This will cause IndexError when trying to access the first element
in dev of this json dataset. Use the kselftest framework skip code
to skip this test in this case.
Example output with this change:
# selftests: net: devlink_port_split.py
# no devlink device was found, test skipped
ok 7 selftests: net: devlink_port_split.py # SKIP
Link: https://bugs.launchpad.net/bugs/1928889
Signed-off-by: Po-Hsu Lin <po-hsu.lin(a)canonical.com>
---
tools/testing/selftests/net/devlink_port_split.py | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/net/devlink_port_split.py b/tools/testing/selftests/net/devlink_port_split.py
index 834066d..2b5d6ff 100755
--- a/tools/testing/selftests/net/devlink_port_split.py
+++ b/tools/testing/selftests/net/devlink_port_split.py
@@ -18,6 +18,8 @@ import sys
#
+# Kselftest framework requirement - SKIP code is 4
+KSFT_SKIP=4
Port = collections.namedtuple('Port', 'bus_info name')
@@ -239,7 +241,11 @@ def main(cmdline=None):
assert stderr == ""
devs = json.loads(stdout)['dev']
- dev = list(devs.keys())[0]
+ if devs:
+ dev = list(devs.keys())[0]
+ else:
+ print("no devlink device was found, test skipped")
+ sys.exit(KSFT_SKIP)
cmd = "devlink dev show %s" % dev
stdout, stderr = run_command(cmd)
--
2.7.4
Building the nci test suite produces a binary, nci_dev, that git then
tries to track. Add a .gitignore file to tell git to ignore this binary.
Signed-off-by: David Matlack <dmatlack(a)google.com>
---
tools/testing/selftests/nci/.gitignore | 1 +
1 file changed, 1 insertion(+)
create mode 100644 tools/testing/selftests/nci/.gitignore
diff --git a/tools/testing/selftests/nci/.gitignore b/tools/testing/selftests/nci/.gitignore
new file mode 100644
index 000000000000..448eeb4590fc
--- /dev/null
+++ b/tools/testing/selftests/nci/.gitignore
@@ -0,0 +1 @@
+/nci_dev
--
2.31.1.751.gd2f1c929bd-goog
> -----Original Message-----
> From: Po-Hsu Lin <po-hsu.lin(a)canonical.com>
> Sent: Thursday, May 20, 2021 1:50 PM
> To: linux-kernel(a)vger.kernel.org; linux-kselftest(a)vger.kernel.org; netdev(a)vger.kernel.org
> Cc: po-hsu.lin(a)canonical.com; shuah(a)kernel.org; kuba(a)kernel.org; davem(a)davemloft.net; skhan(a)linuxfoundation.org
> Subject: [PATCH] selftests: net: devlink_port_split.py: skip the test if no devlink device
>
> When there is no devlink device, the following command will return:
> $ devlink -j dev show
> {dev:{}}
>
> This will cause IndexError when trying to access the first element in dev of this json dataset. Use the kselftest framework skip code to
> skip this test in this case.
>
> Example output with this change:
> # selftests: net: devlink_port_split.py
> # no devlink device was found, test skipped
> ok 7 selftests: net: devlink_port_split.py # SKIP
>
> Link: https://bugs.launchpad.net/bugs/1928889
> Signed-off-by: Po-Hsu Lin <po-hsu.lin(a)canonical.com>
Reviewed-by: Danielle Ratson <danieller(a)nvidia.com>
If a signed number field starts with a '-' the field width must be > 1,
or unlimited, to allow at least one digit after the '-'.
This patch adds a check for this. If a signed field starts with '-'
and field_width == 1 the scanf will quit.
It is ok for a signed number field to have a field width of 1 if it
starts with a digit. In that case the single digit can be converted.
Signed-off-by: Richard Fitzgerald <rf(a)opensource.cirrus.com>
Reviewed-by: Petr Mladek <pmladek(a)suse.com>
Acked-by: Andy Shevchenko <andriy.shevchenko(a)linux.intel.com>
---
lib/vsprintf.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/lib/vsprintf.c b/lib/vsprintf.c
index 41ddc353ebb8..f78651e9b030 100644
--- a/lib/vsprintf.c
+++ b/lib/vsprintf.c
@@ -3466,8 +3466,12 @@ int vsscanf(const char *buf, const char *fmt, va_list args)
str = skip_spaces(str);
digit = *str;
- if (is_sign && digit == '-')
+ if (is_sign && digit == '-') {
+ if (field_width == 1)
+ break;
+
digit = *(str + 1);
+ }
if (!digit
|| (base == 16 && !isxdigit(digit))
--
2.20.1
From: Mike Rapoport <rppt(a)linux.ibm.com>
Hi,
@Andrew, this is based on v5.13-rc1, I can rebase whatever way you prefer.
This is an implementation of "secret" mappings backed by a file descriptor.
The file descriptor backing secret memory mappings is created using a
dedicated memfd_secret system call The desired protection mode for the
memory is configured using flags parameter of the system call. The mmap()
of the file descriptor created with memfd_secret() will create a "secret"
memory mapping. The pages in that mapping will be marked as not present in
the direct map and will be present only in the page table of the owning mm.
Although normally Linux userspace mappings are protected from other users,
such secret mappings are useful for environments where a hostile tenant is
trying to trick the kernel into giving them access to other tenants
mappings.
It's designed to provide the following protections:
* Enhanced protection (in conjunction with all the other in-kernel
attack prevention systems) against ROP attacks. Seceretmem makes "simple"
ROP insufficient to perform exfiltration, which increases the required
complexity of the attack. Along with other protections like the kernel
stack size limit and address space layout randomization which make finding
gadgets is really hard, absence of any in-kernel primitive for accessing
secret memory means the one gadget ROP attack can't work. Since the only
way to access secret memory is to reconstruct the missing mapping entry,
the attacker has to recover the physical page and insert a PTE pointing to
it in the kernel and then retrieve the contents. That takes at least three
gadgets which is a level of difficulty beyond most standard attacks.
* Prevent cross-process secret userspace memory exposures. Once the secret
memory is allocated, the user can't accidentally pass it into the kernel to
be transmitted somewhere. The secreremem pages cannot be accessed via the
direct map and they are disallowed in GUP.
* Harden against exploited kernel flaws. In order to access secretmem, a
kernel-side attack would need to either walk the page tables and create new
ones, or spawn a new privileged uiserspace process to perform secrets
exfiltration using ptrace.
In the future the secret mappings may be used as a mean to protect guest memory
in a virtual machine host.
For demonstration of secret memory usage we've created a userspace library
https://git.kernel.org/pub/scm/linux/kernel/git/jejb/secret-memory-preloade…
that does two things: the first is act as a preloader for openssl to
redirect all the OPENSSL_malloc calls to secret memory meaning any secret
keys get automatically protected this way and the other thing it does is
expose the API to the user who needs it. We anticipate that a lot of the
use cases would be like the openssl one: many toolkits that deal with
secret keys already have special handling for the memory to try to give
them greater protection, so this would simply be pluggable into the
toolkits without any need for user application modification.
Hiding secret memory mappings behind an anonymous file allows usage of
the page cache for tracking pages allocated for the "secret" mappings as
well as using address_space_operations for e.g. page migration callbacks.
The anonymous file may be also used implicitly, like hugetlb files, to
implement mmap(MAP_SECRET) and use the secret memory areas with "native" mm
ABIs in the future.
Removing of the pages from the direct map may cause its fragmentation on
architectures that use large pages to map the physical memory which affects
the system performance. However, the original Kconfig text for
CONFIG_DIRECT_GBPAGES said that gigabyte pages in the direct map "... can
improve the kernel's performance a tiny bit ..." (commit 00d1c5e05736
("x86: add gbpages switches")) and the recent report [1] showed that "...
although 1G mappings are a good default choice, there is no compelling
evidence that it must be the only choice". Hence, it is sufficient to have
secretmem disabled by default with the ability of a system administrator to
enable it at boot time.
In addition, there is also a long term goal to improve management of the
direct map.
[1] https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@linux…
v19:
* block /dev/mem mmap access, per David
* disallow mmap/mprotect with PROT_EXEC, per Kees
* simplify return in page_is_secretmem(), per Matthew
* use unsigned int for syscall falgs, per Yury
v18: https://lore.kernel.org/lkml/20210303162209.8609-1-rppt@kernel.org
* rebase on v5.12-rc1
* merge kfence fix into the original patch
* massage commit message of the patch introducing the memfd_secret syscall
v17: https://lore.kernel.org/lkml/20210208084920.2884-1-rppt@kernel.org
* Remove pool of large pages backing secretmem allocations, per Michal Hocko
* Add secretmem pages to unevictable LRU, per Michal Hocko
* Use GFP_HIGHUSER as secretmem mapping mask, per Michal Hocko
* Make secretmem an opt-in feature that is disabled by default
v16: https://lore.kernel.org/lkml/20210121122723.3446-1-rppt@kernel.org
* Fix memory leak intorduced in v15
* Clean the data left from previous page user before handing the page to
the userspace
v15: https://lore.kernel.org/lkml/20210120180612.1058-1-rppt@kernel.org
* Add riscv/Kconfig update to disable set_memory operations for nommu
builds (patch 3)
* Update the code around add_to_page_cache() per Matthew's comments
(patches 6,7)
* Add fixups for build/checkpatch errors discovered by CI systems
Older history:
v14: https://lore.kernel.org/lkml/20201203062949.5484-1-rppt@kernel.org
v13: https://lore.kernel.org/lkml/20201201074559.27742-1-rppt@kernel.org
v12: https://lore.kernel.org/lkml/20201125092208.12544-1-rppt@kernel.org
v11: https://lore.kernel.org/lkml/20201124092556.12009-1-rppt@kernel.org
v10: https://lore.kernel.org/lkml/20201123095432.5860-1-rppt@kernel.org
v9: https://lore.kernel.org/lkml/20201117162932.13649-1-rppt@kernel.org
v8: https://lore.kernel.org/lkml/20201110151444.20662-1-rppt@kernel.org
v7: https://lore.kernel.org/lkml/20201026083752.13267-1-rppt@kernel.org
v6: https://lore.kernel.org/lkml/20200924132904.1391-1-rppt@kernel.org
v5: https://lore.kernel.org/lkml/20200916073539.3552-1-rppt@kernel.org
v4: https://lore.kernel.org/lkml/20200818141554.13945-1-rppt@kernel.org
v3: https://lore.kernel.org/lkml/20200804095035.18778-1-rppt@kernel.org
v2: https://lore.kernel.org/lkml/20200727162935.31714-1-rppt@kernel.org
v1: https://lore.kernel.org/lkml/20200720092435.17469-1-rppt@kernel.org
rfc-v2: https://lore.kernel.org/lkml/20200706172051.19465-1-rppt@kernel.org/
rfc-v1: https://lore.kernel.org/lkml/20200130162340.GA14232@rapoport-lnx/
rfc-v0: https://lore.kernel.org/lkml/1572171452-7958-1-git-send-email-rppt@kernel.o…
Mike Rapoport (8):
mmap: make mlock_future_check() global
riscv/Kconfig: make direct map manipulation options depend on MMU
set_memory: allow set_direct_map_*_noflush() for multiple pages
set_memory: allow querying whether set_direct_map_*() is actually enabled
mm: introduce memfd_secret system call to create "secret" memory areas
PM: hibernate: disable when there are active secretmem users
arch, mm: wire up memfd_secret system call where relevant
secretmem: test: add basic selftest for memfd_secret(2)
arch/arm64/include/asm/Kbuild | 1 -
arch/arm64/include/asm/cacheflush.h | 6 -
arch/arm64/include/asm/kfence.h | 2 +-
arch/arm64/include/asm/set_memory.h | 17 ++
arch/arm64/include/uapi/asm/unistd.h | 1 +
arch/arm64/kernel/machine_kexec.c | 1 +
arch/arm64/mm/mmu.c | 6 +-
arch/arm64/mm/pageattr.c | 23 +-
arch/riscv/Kconfig | 4 +-
arch/riscv/include/asm/set_memory.h | 4 +-
arch/riscv/include/asm/unistd.h | 1 +
arch/riscv/mm/pageattr.c | 8 +-
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/x86/include/asm/set_memory.h | 4 +-
arch/x86/mm/pat/set_memory.c | 8 +-
drivers/char/mem.c | 4 +
include/linux/secretmem.h | 54 ++++
include/linux/set_memory.h | 16 +-
include/linux/syscalls.h | 1 +
include/uapi/asm-generic/unistd.h | 7 +-
include/uapi/linux/magic.h | 1 +
kernel/power/hibernate.c | 5 +-
kernel/power/snapshot.c | 4 +-
kernel/sys_ni.c | 2 +
mm/Kconfig | 4 +
mm/Makefile | 1 +
mm/gup.c | 12 +
mm/internal.h | 3 +
mm/mlock.c | 3 +-
mm/mmap.c | 5 +-
mm/secretmem.c | 254 +++++++++++++++++++
mm/vmalloc.c | 5 +-
scripts/checksyscalls.sh | 4 +
tools/testing/selftests/vm/.gitignore | 1 +
tools/testing/selftests/vm/Makefile | 3 +-
tools/testing/selftests/vm/memfd_secret.c | 296 ++++++++++++++++++++++
tools/testing/selftests/vm/run_vmtests.sh | 17 ++
38 files changed, 744 insertions(+), 46 deletions(-)
create mode 100644 arch/arm64/include/asm/set_memory.h
create mode 100644 include/linux/secretmem.h
create mode 100644 mm/secretmem.c
create mode 100644 tools/testing/selftests/vm/memfd_secret.c
base-commit: 6efb943b8616ec53a5e444193dccf1af9ad627b5
--
2.28.0