This patch series introduces LANDLOCK_SCOPE_MEMFD_EXEC, a new Landlock
scoping mechanism that restricts execution of anonymous memory file
descriptors (memfd) created via memfd_create(2). This addresses security
gaps where processes can bypass W^X policies and execute arbitrary code
through anonymous memory objects.
Fixes: https://github.com/landlock-lsm/linux/issues/37
SECURITY PROBLEM
================
Current Landlock filesystem restrictions do not cover memfd objects,
allowing processes to:
1. Read-to-execute bypass: Create writable memfd, inject code,
then execute via mmap(PROT_EXEC) or direct execve()
2. Anonymous execution: Execute code without touching the filesystem via
execve("/proc/self/fd/N") where N is a memfd descriptor
3. Cross-domain access violations: Pass memfd between processes to
bypass domain restrictions
These scenarios can occur in sandboxed environments where filesystem
access is restricted but memfd creation remains possible.
IMPLEMENTATION
==============
The implementation adds hierarchical execution control through domain
scoping:
Core Components:
- is_memfd_file(): Reliable memfd detection via "memfd:" dentry prefix
- domain_is_scoped(): Cross-domain hierarchy checking (moved to domain.c)
- LSM hooks: mmap_file, file_mprotect, bprm_creds_for_exec
- Creation-time restrictions: hook_file_alloc_security
Security Matrix:
Execution decisions follow domain hierarchy rules preventing both
same-domain bypass attempts and cross-domain access violations while
preserving legitimate hierarchical access patterns.
Domain Hierarchy with LANDLOCK_SCOPE_MEMFD_EXEC:
===============================================
Root (no domain) - No restrictions
|
+-- Domain A [SCOPE_MEMFD_EXEC] Layer 1
| +-- memfd_A (tagged with Domain A as creator)
| |
| +-- Domain A1 (child) [NO SCOPE] Layer 2
| | +-- Inherits Layer 1 restrictions from parent
| | +-- memfd_A1 (can create, inherits restrictions)
| | +-- Domain A1a [SCOPE_MEMFD_EXEC] Layer 3
| | +-- memfd_A1a (tagged with Domain A1a)
| |
| +-- Domain A2 (child) [SCOPE_MEMFD_EXEC] Layer 2
| +-- memfd_A2 (tagged with Domain A2 as creator)
| +-- CANNOT access memfd_A1 (different subtree)
|
+-- Domain B [SCOPE_MEMFD_EXEC] Layer 1
+-- memfd_B (tagged with Domain B as creator)
+-- CANNOT access ANY memfd from Domain A subtree
Execution Decision Matrix:
========================
Executor-> | A | A1 | A1a | A2 | B | Root
Creator | | | | | |
------------|-----|----|-----|----|----|-----
Domain A | X | X | X | X | X | Y
Domain A1 | Y | X | X | X | X | Y
Domain A1a | Y | Y | X | X | X | Y
Domain A2 | Y | X | X | X | X | Y
Domain B | X | X | X | X | X | Y
Root | Y | Y | Y | Y | Y | Y
Legend: Y = Execution allowed, X = Execution denied
Scenarios Covered:
- Direct mmap(PROT_EXEC) on memfd files
- Two-stage mmap(PROT_READ) + mprotect(PROT_EXEC) bypass attempts
- execve("/proc/self/fd/N") anonymous execution
- execveat() and fexecve() file descriptor execution
- Cross-process memfd inheritance and IPC passing
TESTING
=======
All patches have been validated with:
- scripts/checkpatch.pl --strict (clean)
- Selftests covering same-domain restrictions, cross-domain
hierarchy enforcement, and regular file isolation
- KUnit tests for memfd detection edge cases
DISCLAIMER
==========
My understanding of Landlock scoping semantics may be limited, but this
implementation reflects my current understanding based on available
documentation and code analysis. I welcome feedback and corrections
regarding the scoping logic and domain hierarchy enforcement.
Signed-off-by: Abhinav Saxena <xandfury(a)gmail.com>
---
Abhinav Saxena (4):
landlock: add LANDLOCK_SCOPE_MEMFD_EXEC scope
landlock: implement memfd detection
landlock: add memfd exec LSM hooks and scoping
selftests/landlock: add memfd execution tests
include/uapi/linux/landlock.h | 5 +
security/landlock/.kunitconfig | 1 +
security/landlock/audit.c | 4 +
security/landlock/audit.h | 1 +
security/landlock/cred.c | 14 -
security/landlock/domain.c | 67 ++++
security/landlock/domain.h | 4 +
security/landlock/fs.c | 405 ++++++++++++++++++++-
security/landlock/limits.h | 2 +-
security/landlock/task.c | 67 ----
.../selftests/landlock/scoped_memfd_exec_test.c | 325 +++++++++++++++++
11 files changed, 812 insertions(+), 83 deletions(-)
---
base-commit: 5b74b2eff1eeefe43584e5b7b348c8cd3b723d38
change-id: 20250716-memfd-exec-ac0d582018c3
Best regards,
--
Abhinav Saxena <xandfury(a)gmail.com>
Currently the vDSO selftests use the time-related types from libc.
This works on glibc by chance today but will break with other libc
implementations or on distributions which switch to 64-bit times
everywhere.
The kernel's UAPI headers provide the proper types to use with the vDSO
(and raw syscalls) but are not necessarily compatible with libc types.
Introduce a new header which makes the UAPI headers compatible with the
libc.
Also contains some related cleanups.
Signed-off-by: Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>
---
Changes in v2:
- Use __kernel_old_time_t in vdso_time_t.
- Add vdso_syscalls.h.
- Add a test for the time() function.
- Validate return value of syscall(clock_getres) in vdso_test_abi
- Link to v1: https://lore.kernel.org/r/20251111-vdso-test-types-v1-0-03b31f88c659@linutr…
---
Thomas Weißschuh (14):
Revert "selftests: vDSO: parse_vdso: Use UAPI headers instead of libc headers"
selftests: vDSO: Introduce vdso_types.h
selftests: vDSO: Introduce vdso_syscalls.h
selftests: vDSO: vdso_test_gettimeofday: Remove nolibc checks
selftests: vDSO: vdso_test_gettimeofday: Use types from vdso_types.h
selftests: vDSO: vdso_test_abi: Use types from vdso_types.h
selftests: vDSO: vdso_test_abi: Validate return value of syscall(clock_getres)
selftests: vDSO: vdso_test_abi: Use system call wrappers from vdso_syscalls.h
selftests: vDSO: vdso_test_correctness: Drop SYS_getcpu fallbacks
selftests: vDSO: vdso_test_correctness: Make ts_leq() and tv_leq() more generic
selftests: vDSO: vdso_test_correctness: Use types from vdso_types.h
selftests: vDSO: vdso_test_correctness: Use system call wrappers from vdso_syscalls.h
selftests: vDSO: vdso_test_correctness: Use facilities from parse_vdso.c
selftests: vDSO: vdso_test_correctness: Add a test for time()
tools/testing/selftests/vDSO/Makefile | 6 +-
tools/testing/selftests/vDSO/parse_vdso.c | 3 +-
tools/testing/selftests/vDSO/vdso_syscalls.h | 93 ++++++++++
tools/testing/selftests/vDSO/vdso_test_abi.c | 46 +++--
.../testing/selftests/vDSO/vdso_test_correctness.c | 190 +++++++++++----------
.../selftests/vDSO/vdso_test_gettimeofday.c | 9 +-
tools/testing/selftests/vDSO/vdso_types.h | 70 ++++++++
7 files changed, 285 insertions(+), 132 deletions(-)
---
base-commit: 1b2eb8c1324859864f4aa79dc3cfbb2f7ef5c524
change-id: 20251110-vdso-test-types-68ce0c712b79
Best regards,
--
Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>
test_memcg_sock() currently requires that memory.stat's "sock " counter
is exactly zero immediately after the TCP server exits. On a busy system
this assumption is too strict:
- Socket memory may be freed with a small delay (e.g. RCU callbacks).
- memcg statistics are updated asynchronously via the rstat flushing
worker, so the "sock " value in memory.stat can stay non-zero for a
short period of time even after all socket memory has been uncharged.
As a result, test_memcg_sock() can intermittently fail even though socket
memory accounting is working correctly.
Make the test more robust by polling memory.stat for the "sock "
counter and allowing it some time to drop to zero instead of checking
it only once. The timeout is set to 3 seconds to cover the periodic
rstat flush interval (FLUSH_TIME = 2*HZ by default) plus some
scheduling slack. If the counter does not become zero within the
timeout, the test still fails as before.
On my test system, running test_memcontrol 50 times produced:
- Before this patch: 6/50 runs passed.
- After this patch: 50/50 runs passed.
Suggested-by: Lance Yang <lance.yang(a)linux.dev>
Signed-off-by: Guopeng Zhang <zhangguopeng(a)kylinos.cn>
---
v2:
- Mention the periodic rstat flush interval (FLUSH_TIME = 2*HZ) in
the comment and clarify the rationale for the 3s timeout.
- Replace the hard-coded retry count and wait interval with macros
to avoid magic numbers and make the 3s timeout calculation explicit.
---
.../selftests/cgroup/test_memcontrol.c | 30 ++++++++++++++++++-
1 file changed, 29 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/cgroup/test_memcontrol.c b/tools/testing/selftests/cgroup/test_memcontrol.c
index 4e1647568c5b..7bea656658a2 100644
--- a/tools/testing/selftests/cgroup/test_memcontrol.c
+++ b/tools/testing/selftests/cgroup/test_memcontrol.c
@@ -24,6 +24,9 @@
static bool has_localevents;
static bool has_recursiveprot;
+#define MEMCG_SOCKSTAT_WAIT_RETRIES 30 /* 3s total */
+#define MEMCG_SOCKSTAT_WAIT_INTERVAL_US (100 * 1000) /* 100 ms */
+
int get_temp_fd(void)
{
return open(".", O_TMPFILE | O_RDWR | O_EXCL);
@@ -1384,6 +1387,8 @@ static int test_memcg_sock(const char *root)
int bind_retries = 5, ret = KSFT_FAIL, pid, err;
unsigned short port;
char *memcg;
+ long sock_post = -1;
+ int i;
memcg = cg_name(root, "memcg_test");
if (!memcg)
@@ -1432,7 +1437,30 @@ static int test_memcg_sock(const char *root)
if (cg_read_long(memcg, "memory.current") < 0)
goto cleanup;
- if (cg_read_key_long(memcg, "memory.stat", "sock "))
+ /*
+ * memory.stat is updated asynchronously via the memcg rstat
+ * flushing worker, which runs periodically (every 2 seconds,
+ * see FLUSH_TIME). On a busy system, the "sock " counter may
+ * stay non-zero for a short period of time after the TCP
+ * connection is closed and all socket memory has been
+ * uncharged.
+ *
+ * Poll memory.stat for up to 3 seconds (~FLUSH_TIME plus some
+ * scheduling slack) and require that the "sock " counter
+ * eventually drops to zero.
+ */
+ for (i = 0; i < MEMCG_SOCKSTAT_WAIT_RETRIES; i++) {
+ sock_post = cg_read_key_long(memcg, "memory.stat", "sock ");
+ if (sock_post < 0)
+ goto cleanup;
+
+ if (!sock_post)
+ break;
+
+ usleep(MEMCG_SOCKSTAT_WAIT_INTERVAL_US);
+ }
+
+ if (sock_post)
goto cleanup;
ret = KSFT_PASS;
--
2.25.1