From: Jeff Xu jeffxu@chromium.org
This patchset proposes a new mseal() syscall for the Linux kernel.
In a nutshell, mseal() protects the VMAs of a given virtual memory range against modifications, such as changes to their permission bits.
Modern CPUs support memory permissions, such as the read/write (RW) and no-execute (NX) bits. Linux has supported NX since the release of kernel version 2.6.8 in August 2004 [1]. The memory permission feature improves the security stance on memory corruption bugs, as an attacker cannot simply write to arbitrary memory and point the code to it. The memory must be marked with the X bit, or else an exception will occur. Internally, the kernel maintains the memory permissions in a data structure called VMA (vm_area_struct). mseal() additionally protects the VMA itself against modifications of the selected seal type.
Memory sealing is useful to mitigate memory corruption issues where a corrupted pointer is passed to a memory management system. For example, such an attacker primitive can break control-flow integrity guarantees since read-only memory that is supposed to be trusted can become writable or .text pages can get remapped. Memory sealing can automatically be applied by the runtime loader to seal .text and .rodata pages and applications can additionally seal security critical data at runtime. A similar feature already exists in the XNU kernel with the VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the mimmutable syscall [4]. Also, Chrome wants to adopt this feature for their CFI work [2] and this patchset has been designed to be compatible with the Chrome use case.
The new mseal() is an architecture independent syscall, and with following signature:
mseal(void addr, size_t len, unsigned long types, unsigned long flags)
addr/len: memory range. Must be continuous/allocated memory, or else mseal() will fail and no VMA is updated. For details on acceptable arguments, please refer to documentation patch (mseal.rst) of this patch set. Those are also fully covered by the selftest.
types: bit mask to specify the sealing types. MM_SEAL_BASE MM_SEAL_PROT_PKEY MM_SEAL_DISCARD_RO_ANON MM_SEAL_SEAL
The MM_SEAL_BASE: The base package includes the features common to all VMA sealing types. It prevents sealed VMAs from: 1> Unmapping, moving to another location, and shrinking the size, via munmap() and mremap(), can leave an empty space, therefore can be replaced with a VMA with a new set of attributes. 2> Move or expand a different vma into the current location, via mremap(). 3> Modifying sealed VMA via mmap(MAP_FIXED). 4> Size expansion, via mremap(), does not appear to pose any specific risks to sealed VMAs. It is included anyway because the use case is unclear. In any case, users can rely on merging to expand a sealed VMA.
We consider the MM_SEAL_BASE feature, on which other sealing features will depend. For instance, it probably does not make sense to seal PROT_PKEY without sealing the BASE, and the kernel will implicitly add SEAL_BASE for SEAL_PROT_PKEY.
The MM_SEAL_PROT_PKEY: Seal PROT and PKEY of the address range, i.e. mprotect() and pkey_mprotect() will be denied if the memory is sealed with MM_SEAL_PROT_PKEY.
The MM_SEAL_DISCARD_RO_ANON Certain types of madvise() operations are destructive [6], such as MADV_DONTNEED, which can effectively alter region contents by discarding pages, especially when memory is anonymous. This blocks such operations for anonymous memory which is not writable to the user.
The MM_SEAL_SEAL MM_SEAL_SEAL denies adding a new seal for an VMA. This is similar to F_SEAL_SEAL in fcntl.
The idea that inspired this patch comes from Stephen Röttger’s work in V8 CFI [5]. Chrome browser in ChromeOS will be the first user of this API.
Indeed, the Chrome browser has very specific requirements for sealing, which are distinct from those of most applications. For example, in the case of libc, sealing is only applied to read-only (RO) or read-execute (RX) memory segments (such as .text and .RELRO) to prevent them from becoming writable, the lifetime of those mappings are tied to the lifetime of the process.
Chrome wants to seal two large address space reservations that are managed by different allocators. The memory is mapped RW- and RWX respectively but write access to it is restricted using pkeys (or in the future ARM permission overlay extensions). The lifetime of those mappings are not tied to the lifetime of the process, therefore, while the memory is sealed, the allocators still need to free or discard the unused memory. For example, with madvise(DONTNEED).
However, always allowing madvise(DONTNEED) on this range poses a security risk. For example if a jump instruction crosses a page boundary and the second page gets discarded, it will overwrite the target bytes with zeros and change the control flow. Checking write-permission before the discard operation allows us to control when the operation is valid. In this case, the madvise will only succeed if the executing thread has PKEY write permissions and PKRU changes are protected in software by control-flow integrity.
Although the initial version of this patch series is targeting the Chrome browser as its first user, it became evident during upstream discussions that we would also want to ensure that the patch set eventually is a complete solution for memory sealing and compatible with other use cases. The specific scenario currently in mind is glibc's use case of loading and sealing ELF executables. To this end, Stephen is working on a change to glibc to add sealing support to the dynamic linker, which will seal all non-writable segments at startup. Once this work is completed, all applications will be able to automatically benefit from these new protections.
-------------------------------------------------------------------- Change history: =============== V3: - Abandon per-syscall approach, (Suggested by Linus Torvalds). - Organize sealing types around their functionality, such as MM_SEAL_BASE, MM_SEAL_PROT_PKEY. - Extend the scope of sealing from calls originated in userspace to both kernel and userspace. (Suggested by Linus Torvalds) - Add seal type support in mmap(). (Suggested by Pedro Falcato) - Add a new sealing type: MM_SEAL_DISCARD_RO_ANON to prevent destructive operations of madvise. (Suggested by Jann Horn and Stephen Röttger) - Make sealed VMAs mergeable. (Suggested by Jann Horn) - Add MAP_SEALABLE to mmap() (Detail see new discussions) - Add documentation - mseal.rst
Work in progress: ================= - update man page for mseal() and mmap()
Open discussions: ================= Several open discussions from V1/V2 were not incorporated into V3. I believe these are worth more discussion for future versions of this patch series.
1> mseal() vs mimmutable() mseal(bitmasks for multiple seal types) BASE + PROT_PKEY+ MM_SEAL_DISCARD_RO_ANON Apply PROT_PKEY implies BASE, same for DISCARD_RO_ANON
mimmutable() (openBSD) This is equal to SEAL_BASE + SEAL_PROT_PKEY in mseal() Plus allowing downgrade from W=>NW (OpenBSD) Doesn’t have MM_SEAL_DISCARD_RO_ANON
mimmutable() is designed for memory sealing in libc, and mseal() is designed for both Chrome browser and libc.
For the two memory ranges that Chrome browser wants to seal, as discussed previously, the allocator still needs to free or discard memory for the sealed memory. For performance reasons, we have explored two solutions in the past: first, using PKEY-based munmap() [7], and second, separating SEAL_MPROTECT (v1 of this patch set). Recently, we have experimented with an alternative approach that allows us to remove the separation of SEAL_MPROTECT. For those two memory ranges, Chrome browser will use BASE + PROT_PKEY + DISCARD_RO_ANON for all its sealing needs.
In the case of libc, the .text segment can be sealed with the BASE and PROT_PKEY, and the RO data segments can be sealed with the BASE + PROT_PKEY + DISCARD_RO_ANON.
From a flexibility standpoint, separating BASE out could be beneficial for future extensions of sealing features. For instance, applications might desire downgradable "prot" permissions (X=>NX, W=>NW, R=>NR), which would conflict with SEAL_PROT_PKEY.
The more sealing features integrated into a single sealing type, the fewer applications can utilize these features. For example, some applications might programmatically require DISCARD_RO_ANON memory, which would render such VMA unsuitable for sealing.
I'd like to get the community's input on this. From Chrome's perspective, the separation isn't as important anymore, at least in the short term. However, I prefer the multiple bits approach because it's more extensible in the long term.
2> mseal() vs mprotect() vs madvise() for setting the seal.
mprotect(): Using prot field, but prot supports unset. It's workable, i.e. let applications carry the sealing type and set in all subsequent calls to mprotect(), but it feels like this is an extra thing to care about.
madvise(): uses enum, multiple sealing types might require multiple roundtrips.
IMO: sealing is a major departure from other memory syscalls because it takes away capabilities. The other memory APIs add behaviors or change attributes, but sealing does the opposite: it reduces capabilities. The name of the syscall, mseal(), can help emphasize the "taking away" part.
My second choice would be madvise().
3> Other:
There is also a topic about ptrace/, /proc/self/mem, Userfaultfd, which I think can be followed up using v1 thread, where it has the most context.
New discussions topics: ======================= During the development of V3, I had new questions and thoughts and wished to discuss.
1> shm/aio From reading the code, it seems to me that aio/shm can mmap/munmap maps on behalf of userspace, e.g. ksys_shmdt() in shm.c. The lifetime of those mapping are not tied to the lifetime of the process. If those memories are sealed from userspace, then unmap will fail. This isn’t a huge problem, since the memory will eventually be freed at exit or exec. However, it feels like the solution is not complete, because of the leaks in VMA address space during the lifetime of the process. There could be two solutions to address this, which I will discuss later.
2> Brk (heap/stack) Currently, userspace applications can seal parts of the heap by calling malloc() and mseal(). This raises the question of what the expected behavior is when sealing the heap is attempted.
let's assume following calls from user space:
ptr = malloc(size); mprotect(ptr, size, RO); mseal(ptr, size, SEAL_PROT_PKEY); free(ptr);
Technically, before mseal() is added, the user can change the protection of the heap by calling mprotect(RO). As long as the user changes the protection back to RW before free(), the memory can be reused.
Adding mseal() into picture, however, the heap is then sealed partially, user can still free it, but the memory remains to be RO, and the result of brk-shrink is nondeterministic, depending on if munmap() will try to free the sealed memory.(brk uses munmap to shrink the heap).
3> Above two cases led to the third topic: There are two options to address the problem mentioned above. Option 1: A “MAP_SEALABLE” flag in mmap(). If a map is created without this flag, the mseal() operation will fail. Applications that are not concerned with sealing will expect their behavior to be unchanged. For those that are concerned, adding a flag at mmap time to opt in is not difficult. For the short term, this solves problems 1 and 2 above. The memory in shm/aio/brk will not have the MAP_SEALABLE flag at mmap(), and the same is true for the heap.
Option 2: Add MM_SEAL_SEAL during mmap() It is possible to defensively set MM_SEAL_SEAL for the selected mappings at creation time. Specifically, we can find all the mmaps that we do not want to seal, and add the MM_SEAL_SEAL flag in the mmap() call. The difference between MAP_SEALABLE and MM_SEAL_SEAL is that the first option starts from a small size and incrementally increases, while the second option is the opposite.
In my opinion, MAP_SEALABLE is the preferred option. Only a limited set of mappings need to be sealed, and these are typically created by the runtime. For the few dedicated applications that manage their own mappings, such as Chrome, adding an extra flag at mmap() is not a difficult task. It is also a safer option in terms of regression risk. This is the option included in this version.
4> I think it might be possible to seal the stack or other special mappings created at runtime (vdso, vsyscall, vvar). This means we can enforce and seal W^X for certain types of application. For instance, the stack is typically used in read-write mode, but in some cases, it can become executable. To defend against unintented addition of executable bit to stack, we could let the application to seal it.
Sealing the heap (for adding X) requires special handling, since the heap can shrink, and shrink is implemented through munmap().
Indeed, it might be possible that all virtual memory accessible to user space, regardless of its usage pattern, could be sealed. However, this would require additional research and development work.
------------------------------------------------------------------------ v2: Use _BITUL to define MM_SEAL_XX type. Use unsigned long for seal type in sys_mseal() and other functions. Remove internal VM_SEAL_XX type and convert_user_seal_type(). Remove MM_ACTION_XX type. Remove caller_origin(ON_BEHALF_OF_XX) and replace with sealing bitmask. Add more comments in code. Add a detailed commit message. https://lore.kernel.org/lkml/20231017090815.1067790-1-jeffxu@chromium.org/
v1: https://lore.kernel.org/lkml/20231016143828.647848-1-jeffxu@chromium.org/
---------------------------------------------------------------- [1] https://kernelnewbies.org/Linux_2_6_8 [2] https://v8.dev/blog/control-flow-integrity [3] https://github.com/apple-oss-distributions/xnu/blob/1031c584a5e37aff177559b9... [4] https://man.openbsd.org/mimmutable.2 [5] https://docs.google.com/document/d/1O2jwK4dxI3nRcOJuPYkonhTkNQfbmwdvxQMyXgea... [6] https://lore.kernel.org/lkml/CAG48ez3ShUYey+ZAFsU2i1RpQn0a5eOs2hzQ426FkcgnfU... [7] https://lore.kernel.org/lkml/20230515130553.2311248-1-jeffxu@chromium.org/
Jeff Xu (11): mseal: Add mseal syscall. mseal: Wire up mseal syscall mseal: add can_modify_mm and can_modify_vma mseal: add MM_SEAL_BASE mseal: add MM_SEAL_PROT_PKEY mseal: add sealing support for mmap mseal: make sealed VMA mergeable. mseal: add MM_SEAL_DISCARD_RO_ANON mseal: add MAP_SEALABLE to mmap() selftest mm/mseal memory sealing mseal:add documentation
Documentation/userspace-api/mseal.rst | 189 ++ arch/alpha/kernel/syscalls/syscall.tbl | 1 + arch/arm/tools/syscall.tbl | 1 + arch/arm64/include/asm/unistd.h | 2 +- arch/arm64/include/asm/unistd32.h | 2 + arch/ia64/kernel/syscalls/syscall.tbl | 1 + arch/m68k/kernel/syscalls/syscall.tbl | 1 + arch/microblaze/kernel/syscalls/syscall.tbl | 1 + arch/mips/kernel/syscalls/syscall_n32.tbl | 1 + arch/mips/kernel/syscalls/syscall_n64.tbl | 1 + arch/mips/kernel/syscalls/syscall_o32.tbl | 1 + arch/mips/kernel/vdso.c | 10 +- arch/parisc/kernel/syscalls/syscall.tbl | 1 + arch/powerpc/kernel/syscalls/syscall.tbl | 1 + arch/s390/kernel/syscalls/syscall.tbl | 1 + arch/sh/kernel/syscalls/syscall.tbl | 1 + arch/sparc/kernel/syscalls/syscall.tbl | 1 + arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + arch/xtensa/kernel/syscalls/syscall.tbl | 1 + fs/userfaultfd.c | 8 +- include/linux/mm.h | 178 +- include/linux/mm_types.h | 8 + include/linux/syscalls.h | 2 + include/uapi/asm-generic/mman-common.h | 16 + include/uapi/asm-generic/unistd.h | 5 +- include/uapi/linux/mman.h | 5 + kernel/sys_ni.c | 1 + mm/Kconfig | 9 + mm/Makefile | 1 + mm/madvise.c | 14 +- mm/mempolicy.c | 2 +- mm/mlock.c | 2 +- mm/mmap.c | 77 +- mm/mprotect.c | 12 +- mm/mremap.c | 44 +- mm/mseal.c | 376 ++++ tools/testing/selftests/mm/.gitignore | 1 + tools/testing/selftests/mm/Makefile | 1 + tools/testing/selftests/mm/config | 1 + tools/testing/selftests/mm/mseal_test.c | 2141 +++++++++++++++++++ 41 files changed, 3091 insertions(+), 32 deletions(-) create mode 100644 Documentation/userspace-api/mseal.rst create mode 100644 mm/mseal.c create mode 100644 tools/testing/selftests/mm/mseal_test.c
From: Jeff Xu jeffxu@chromium.org
The new mseal() is an architecture independent syscall, and with following signature:
mseal(void addr, size_t len, unsigned long types, unsigned long flags)
addr/len: memory range. Must be continuous/allocated memory, or else mseal() will fail and no VMA is updated. For details on acceptable arguments, please refer to comments in mseal.c. Those are also covered by the selftest.
This CL adds three sealing types. MM_SEAL_BASE MM_SEAL_PROT_PKEY MM_SEAL_SEAL
The MM_SEAL_BASE: The base package includes the features common to all VMA sealing types. It prevents sealed VMAs from: 1> Unmapping, moving to another location, and shrinking the size, via munmap() and mremap(), can leave an empty space, therefore can be replaced with a VMA with a new set of attributes. 2> Move or expand a different vma into the current location, via mremap(). 3> Modifying sealed VMA via mmap(MAP_FIXED). 4> Size expansion, via mremap(), does not appear to pose any specific risks to sealed VMAs. It is included anyway because the use case is unclear. In any case, users can rely on merging to expand a sealed VMA.
We consider the MM_SEAL_BASE feature, on which other sealing features will depend. For instance, it probably does not make sense to seal PROT_PKEY without sealing the BASE, and the kernel will implicitly add SEAL_BASE for SEAL_PROT_PKEY. (If the application wants to relax this in future, we could use the “flags” field in mseal() to overwrite this the behavior of implicitly adding SEAL_BASE.)
The MM_SEAL_PROT_PKEY: Seal PROT and PKEY of the address range, in other words, mprotect() and pkey_mprotect() will be denied if the memory is sealed with MM_SEAL_PROT_PKEY.
The MM_SEAL_SEAL MM_SEAL_SEAL denies adding a new seal for an VMA. The kernel will remember which seal types are applied, and the application doesn’t need to repeat all existing seal types in the next mseal(). Once a seal type is applied, it can’t be unsealed. Call mseal() on an existing seal type is a no-action, not a failure.
Data structure: Internally, the vm_area_struct adds a new field, vm_seals, to store the bit masks. The vm_seals field is added because the existing vm_flags field is full in 32-bit CPUs. The vm_seals field can be merged into vm_flags in the future if the size of vm_flags is ever expanded.
TODO: Sealed VMA won't merge with other VMA in this patch, merging support will be added in later patch.
Signed-off-by: Jeff Xu jeffxu@chromium.org --- include/linux/mm.h | 45 ++++++- include/linux/mm_types.h | 7 ++ include/linux/syscalls.h | 2 + include/uapi/linux/mman.h | 4 + kernel/sys_ni.c | 1 + mm/Kconfig | 9 ++ mm/Makefile | 1 + mm/mmap.c | 3 + mm/mseal.c | 257 ++++++++++++++++++++++++++++++++++++++ 9 files changed, 328 insertions(+), 1 deletion(-) create mode 100644 mm/mseal.c
diff --git a/include/linux/mm.h b/include/linux/mm.h index 19fc73b02c9f..3d1120570de5 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -30,6 +30,7 @@ #include <linux/kasan.h> #include <linux/memremap.h> #include <linux/slab.h> +#include <uapi/linux/mman.h>
struct mempolicy; struct anon_vma; @@ -257,9 +258,17 @@ extern struct rw_semaphore nommu_region_sem; extern unsigned int kobjsize(const void *objp); #endif
+/* + * MM_SEAL_ALL is all supported flags in mseal(). + */ +#define MM_SEAL_ALL ( \ + MM_SEAL_SEAL | \ + MM_SEAL_BASE | \ + MM_SEAL_PROT_PKEY) + /* * vm_flags in vm_area_struct, see mm_types.h. - * When changing, update also include/trace/events/mmflags.h + * When changing, update also include/trace/events/mmflags.h. */ #define VM_NONE 0x00000000
@@ -3308,6 +3317,40 @@ static inline void mm_populate(unsigned long addr, unsigned long len) static inline void mm_populate(unsigned long addr, unsigned long len) {} #endif
+#ifdef CONFIG_MSEAL +static inline bool check_vma_seals_mergeable(unsigned long vm_seals) +{ + /* + * Set sealed VMA not mergeable with another VMA for now. + * This will be changed in later commit to make sealed + * VMA also mergeable. + */ + if (vm_seals & MM_SEAL_ALL) + return false; + + return true; +} + +/* + * return the valid sealing (after mask). + */ +static inline unsigned long vma_seals(struct vm_area_struct *vma) +{ + return (vma->vm_seals & MM_SEAL_ALL); +} + +#else +static inline bool check_vma_seals_mergeable(unsigned long vm_seals1) +{ + return true; +} + +static inline unsigned long vma_seals(struct vm_area_struct *vma) +{ + return 0; +} +#endif + /* These take the mm semaphore themselves */ extern int __must_check vm_brk(unsigned long, unsigned long); extern int __must_check vm_brk_flags(unsigned long, unsigned long, unsigned long); diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 589f31ef2e84..052799173c86 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -687,6 +687,13 @@ struct vm_area_struct { struct vma_numab_state *numab_state; /* NUMA Balancing state */ #endif struct vm_userfaultfd_ctx vm_userfaultfd_ctx; +#ifdef CONFIG_MSEAL + /* + * bit masks for seal. + * need this since vm_flags is full. + */ + unsigned long vm_seals; /* seal flags, see mm.h. */ +#endif } __randomize_layout;
#ifdef CONFIG_SCHED_MM_CID diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 0901af60d971..b1c766b74765 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -812,6 +812,8 @@ asmlinkage long sys_process_mrelease(int pidfd, unsigned int flags); asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size, unsigned long prot, unsigned long pgoff, unsigned long flags); +asmlinkage long sys_mseal(unsigned long start, size_t len, unsigned long types, + unsigned long flags); asmlinkage long sys_mbind(unsigned long start, unsigned long len, unsigned long mode, const unsigned long __user *nmask, diff --git a/include/uapi/linux/mman.h b/include/uapi/linux/mman.h index a246e11988d5..f561652886c4 100644 --- a/include/uapi/linux/mman.h +++ b/include/uapi/linux/mman.h @@ -55,4 +55,8 @@ struct cachestat { __u64 nr_recently_evicted; };
+#define MM_SEAL_SEAL _BITUL(0) +#define MM_SEAL_BASE _BITUL(1) +#define MM_SEAL_PROT_PKEY _BITUL(2) + #endif /* _UAPI_LINUX_MMAN_H */ diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index 9db51ea373b0..716d64df522d 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -195,6 +195,7 @@ COND_SYSCALL(migrate_pages); COND_SYSCALL(move_pages); COND_SYSCALL(set_mempolicy_home_node); COND_SYSCALL(cachestat); +COND_SYSCALL(mseal);
COND_SYSCALL(perf_event_open); COND_SYSCALL(accept4); diff --git a/mm/Kconfig b/mm/Kconfig index 264a2df5ecf5..63972d476d19 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -1258,6 +1258,15 @@ config LOCK_MM_AND_FIND_VMA bool depends on !STACK_GROWSUP
+config MSEAL + default n + bool "Enable mseal() system call" + depends on MMU + help + Enable the virtual memory sealing. + This feature allows sealing each virtual memory area separately with + multiple sealing types. + source "mm/damon/Kconfig"
endmenu diff --git a/mm/Makefile b/mm/Makefile index ec65984e2ade..643d8518dac0 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -120,6 +120,7 @@ obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o obj-$(CONFIG_PAGE_TABLE_CHECK) += page_table_check.o obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o obj-$(CONFIG_SECRETMEM) += secretmem.o +obj-$(CONFIG_MSEAL) += mseal.o obj-$(CONFIG_CMA_SYSFS) += cma_sysfs.o obj-$(CONFIG_USERFAULTFD) += userfaultfd.o obj-$(CONFIG_IDLE_PAGE_TRACKING) += page_idle.o diff --git a/mm/mmap.c b/mm/mmap.c index 9e018d8dd7d6..42462c2a0c35 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -740,6 +740,9 @@ static inline bool is_mergeable_vma(struct vm_area_struct *vma, return false; if (!anon_vma_name_eq(anon_vma_name(vma), anon_name)) return false; + if (!check_vma_seals_mergeable(vma_seals(vma))) + return false; + return true; }
diff --git a/mm/mseal.c b/mm/mseal.c new file mode 100644 index 000000000000..13bbe9ef5883 --- /dev/null +++ b/mm/mseal.c @@ -0,0 +1,257 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Implement mseal() syscall. + * + * Copyright (c) 2023 Google, Inc. + * + * Author: Jeff Xu jeffxu@chromium.org + */ + +#include <linux/mman.h> +#include <linux/mm.h> +#include <linux/syscalls.h> +#include <linux/sched.h> +#include "internal.h" + +static bool can_do_mseal(unsigned long types, unsigned long flags) +{ + /* check types is a valid bitmap. */ + if (types & ~MM_SEAL_ALL) + return false; + + /* flags isn't used for now. */ + if (flags) + return false; + + return true; +} + +/* + * Check if a seal type can be added to VMA. + */ +static bool can_add_vma_seals(struct vm_area_struct *vma, unsigned long newSeals) +{ + /* When SEAL_MSEAL is set, reject if a new type of seal is added. */ + if ((vma->vm_seals & MM_SEAL_SEAL) && + (newSeals & ~(vma_seals(vma)))) + return false; + + return true; +} + +static int mseal_fixup(struct vma_iterator *vmi, struct vm_area_struct *vma, + struct vm_area_struct **prev, unsigned long start, + unsigned long end, unsigned long addtypes) +{ + int ret = 0; + + if (addtypes & ~(vma_seals(vma))) { + /* + * Handle split at start and end. + * For now sealed VMA doesn't merge with other VMAs. + * This will be updated in later commit to make + * sealed VMA also mergeable. + */ + if (start != vma->vm_start) { + ret = split_vma(vmi, vma, start, 1); + if (ret) + goto out; + } + + if (end != vma->vm_end) { + ret = split_vma(vmi, vma, end, 0); + if (ret) + goto out; + } + + vma->vm_seals |= addtypes; + } + +out: + *prev = vma; + return ret; +} + +/* + * Check for do_mseal: + * 1> start is part of a valid vma. + * 2> end is part of a valid vma. + * 3> No gap (unallocated address) between start and end. + * 4> requested seal type can be added in given address range. + */ +static int check_mm_seal(unsigned long start, unsigned long end, + unsigned long newtypes) +{ + struct vm_area_struct *vma; + unsigned long nstart = start; + + VMA_ITERATOR(vmi, current->mm, start); + + /* going through each vma to check. */ + for_each_vma_range(vmi, vma, end) { + if (vma->vm_start > nstart) + /* unallocated memory found. */ + return -ENOMEM; + + if (!can_add_vma_seals(vma, newtypes)) + return -EACCES; + + if (vma->vm_end >= end) + return 0; + + nstart = vma->vm_end; + } + + return -ENOMEM; +} + +/* + * Apply sealing. + */ +static int apply_mm_seal(unsigned long start, unsigned long end, + unsigned long newtypes) +{ + unsigned long nstart, nend; + struct vm_area_struct *vma, *prev = NULL; + struct vma_iterator vmi; + int error = 0; + + vma_iter_init(&vmi, current->mm, start); + vma = vma_find(&vmi, end); + + prev = vma_prev(&vmi); + if (start > vma->vm_start) + prev = vma; + + nstart = start; + + /* going through each vma to update. */ + for_each_vma_range(vmi, vma, end) { + nend = vma->vm_end; + if (nend > end) + nend = end; + + error = mseal_fixup(&vmi, vma, &prev, nstart, nend, newtypes); + if (error) + break; + + nstart = vma->vm_end; + } + + return error; +} + +/* + * mseal(2) seals the VM's meta data from + * selected syscalls. + * + * addr/len: VM address range. + * + * The address range by addr/len must meet: + * start (addr) must be in a valid VMA. + * end (addr + len) must be in a valid VMA. + * no gap (unallocated memory) between start and end. + * start (addr) must be page aligned. + * + * len: len will be page aligned implicitly. + * + * types: bit mask for sealed syscalls. + * MM_SEAL_BASE: prevent VMA from: + * 1> Unmapping, moving to another location, and shrinking + * the size, via munmap() and mremap(), can leave an empty + * space, therefore can be replaced with a VMA with a new + * set of attributes. + * 2> Move or expand a different vma into the current location, + * via mremap(). + * 3> Modifying sealed VMA via mmap(MAP_FIXED). + * 4> Size expansion, via mremap(), does not appear to pose any + * specific risks to sealed VMAs. It is included anyway because + * the use case is unclear. In any case, users can rely on + * merging to expand a sealed VMA. + * + * The MM_SEAL_PROT_PKEY: + * Seal PROT and PKEY of the address range, in other words, + * mprotect() and pkey_mprotect() will be denied if the memory is + * sealed with MM_SEAL_PROT_PKEY. + * + * The MM_SEAL_SEAL + * MM_SEAL_SEAL denies adding a new seal for an VMA. + * + * The kernel will remember which seal types are applied, and the + * application doesn’t need to repeat all existing seal types in + * the next mseal(). Once a seal type is applied, it can’t be + * unsealed. Call mseal() on an existing seal type is a no-action, + * not a failure. + * + * flags: reserved. + * + * return values: + * zero: success. + * -EINVAL: + * invalid seal type. + * invalid input flags. + * addr is not page aligned. + * addr + len overflow. + * -ENOMEM: + * addr is not a valid address (not allocated). + * end (addr + len) is not a valid address. + * a gap (unallocated memory) between start and end. + * -EACCES: + * MM_SEAL_SEAL is set, adding a new seal is rejected. + * + * Note: + * user can call mseal(2) multiple times to add new seal types. + * adding an already added seal type is a no-action (no error). + * adding a new seal type after MM_SEAL_SEAL will be rejected. + * unseal() or removing a seal type is not supported. + */ +static int do_mseal(unsigned long start, size_t len_in, unsigned long types, + unsigned long flags) +{ + int ret = 0; + unsigned long end; + struct mm_struct *mm = current->mm; + size_t len; + + /* MM_SEAL_BASE is set when other seal types are set. */ + if (types & MM_SEAL_PROT_PKEY) + types |= MM_SEAL_BASE; + + if (!can_do_mseal(types, flags)) + return -EINVAL; + + start = untagged_addr(start); + if (!PAGE_ALIGNED(start)) + return -EINVAL; + + len = PAGE_ALIGN(len_in); + /* Check to see whether len was rounded up from small -ve to zero. */ + if (len_in && !len) + return -EINVAL; + + end = start + len; + if (end < start) + return -EINVAL; + + if (end == start) + return 0; + + if (mmap_write_lock_killable(mm)) + return -EINTR; + + ret = check_mm_seal(start, end, types); + if (ret) + goto out; + + ret = apply_mm_seal(start, end, types); + +out: + mmap_write_unlock(current->mm); + return ret; +} + +SYSCALL_DEFINE4(mseal, unsigned long, start, size_t, len, unsigned long, types, unsigned long, + flags) +{ + return do_mseal(start, len, types, flags); +}
On Tue, Dec 12, 2023 at 11:16:55PM +0000, jeffxu@chromium.org wrote:
+config MSEAL
- default n
Minor nit, "n" is always the default, no need to call it out here.
- bool "Enable mseal() system call"
- depends on MMU
- help
Enable the virtual memory sealing.
This feature allows sealing each virtual memory area separately with
multiple sealing types.
You might want to include more documentation as to what this is for, otherwise distros / users will not know if they need to enable this or not.
thanks,
greg k-h
From: Jeff Xu jeffxu@chromium.org
Wire up mseal syscall for all architectures.
Signed-off-by: Jeff Xu jeffxu@chromium.org --- arch/alpha/kernel/syscalls/syscall.tbl | 1 + arch/arm/tools/syscall.tbl | 1 + arch/arm64/include/asm/unistd.h | 2 +- arch/arm64/include/asm/unistd32.h | 2 ++ arch/ia64/kernel/syscalls/syscall.tbl | 1 + arch/m68k/kernel/syscalls/syscall.tbl | 1 + arch/microblaze/kernel/syscalls/syscall.tbl | 1 + arch/mips/kernel/syscalls/syscall_n32.tbl | 1 + arch/mips/kernel/syscalls/syscall_n64.tbl | 1 + arch/mips/kernel/syscalls/syscall_o32.tbl | 1 + arch/parisc/kernel/syscalls/syscall.tbl | 1 + arch/powerpc/kernel/syscalls/syscall.tbl | 1 + arch/s390/kernel/syscalls/syscall.tbl | 1 + arch/sh/kernel/syscalls/syscall.tbl | 1 + arch/sparc/kernel/syscalls/syscall.tbl | 1 + arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + arch/xtensa/kernel/syscalls/syscall.tbl | 1 + include/uapi/asm-generic/unistd.h | 5 ++++- 19 files changed, 23 insertions(+), 2 deletions(-)
diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl index b68f1f56b836..4de33b969009 100644 --- a/arch/alpha/kernel/syscalls/syscall.tbl +++ b/arch/alpha/kernel/syscalls/syscall.tbl @@ -496,3 +496,4 @@ 564 common futex_wake sys_futex_wake 565 common futex_wait sys_futex_wait 566 common futex_requeue sys_futex_requeue +567 common mseal sys_mseal diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl index 93d0d46cbb15..dacea023bb88 100644 --- a/arch/arm/tools/syscall.tbl +++ b/arch/arm/tools/syscall.tbl @@ -469,3 +469,4 @@ 454 common futex_wake sys_futex_wake 455 common futex_wait sys_futex_wait 456 common futex_requeue sys_futex_requeue +457 common mseal sys_mseal diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h index 531effca5f1f..298313d2e0af 100644 --- a/arch/arm64/include/asm/unistd.h +++ b/arch/arm64/include/asm/unistd.h @@ -39,7 +39,7 @@ #define __ARM_NR_compat_set_tls (__ARM_NR_COMPAT_BASE + 5) #define __ARM_NR_COMPAT_END (__ARM_NR_COMPAT_BASE + 0x800)
-#define __NR_compat_syscalls 457 +#define __NR_compat_syscalls 458 #endif
#define __ARCH_WANT_SYS_CLONE diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h index c453291154fd..015c80b14206 100644 --- a/arch/arm64/include/asm/unistd32.h +++ b/arch/arm64/include/asm/unistd32.h @@ -917,6 +917,8 @@ __SYSCALL(__NR_futex_wake, sys_futex_wake) __SYSCALL(__NR_futex_wait, sys_futex_wait) #define __NR_futex_requeue 456 __SYSCALL(__NR_futex_requeue, sys_futex_requeue) +#define __NR_mseal 457 +__SYSCALL(__NR_mseal, sys_mseal)
/* * Please add new compat syscalls above this comment and update diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl index 81375ea78288..e8b40451693d 100644 --- a/arch/ia64/kernel/syscalls/syscall.tbl +++ b/arch/ia64/kernel/syscalls/syscall.tbl @@ -376,3 +376,4 @@ 454 common futex_wake sys_futex_wake 455 common futex_wait sys_futex_wait 456 common futex_requeue sys_futex_requeue +457 common mseal sys_mseal diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl index f7f997a88bab..0da4a4dc1737 100644 --- a/arch/m68k/kernel/syscalls/syscall.tbl +++ b/arch/m68k/kernel/syscalls/syscall.tbl @@ -455,3 +455,4 @@ 454 common futex_wake sys_futex_wake 455 common futex_wait sys_futex_wait 456 common futex_requeue sys_futex_requeue +457 common mseal sys_mseal diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl index 2967ec26b978..ca8572222783 100644 --- a/arch/microblaze/kernel/syscalls/syscall.tbl +++ b/arch/microblaze/kernel/syscalls/syscall.tbl @@ -461,3 +461,4 @@ 454 common futex_wake sys_futex_wake 455 common futex_wait sys_futex_wait 456 common futex_requeue sys_futex_requeue +457 common mseal sys_mseal diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl index 383abb1713f4..4fd33623b7e8 100644 --- a/arch/mips/kernel/syscalls/syscall_n32.tbl +++ b/arch/mips/kernel/syscalls/syscall_n32.tbl @@ -394,3 +394,4 @@ 454 n32 futex_wake sys_futex_wake 455 n32 futex_wait sys_futex_wait 456 n32 futex_requeue sys_futex_requeue +457 n32 mseal sys_mseal diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl index c9bd09ba905f..aaa6382781e0 100644 --- a/arch/mips/kernel/syscalls/syscall_n64.tbl +++ b/arch/mips/kernel/syscalls/syscall_n64.tbl @@ -370,3 +370,4 @@ 454 n64 futex_wake sys_futex_wake 455 n64 futex_wait sys_futex_wait 456 n64 futex_requeue sys_futex_requeue +457 n64 mseal sys_mseal diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl index ba5ef6cea97a..bbdd6f151224 100644 --- a/arch/mips/kernel/syscalls/syscall_o32.tbl +++ b/arch/mips/kernel/syscalls/syscall_o32.tbl @@ -443,3 +443,4 @@ 454 o32 futex_wake sys_futex_wake 455 o32 futex_wait sys_futex_wait 456 o32 futex_requeue sys_futex_requeue +457 o32 mseal sys_mseal diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl index 9f0f6df55361..8dda80555c7c 100644 --- a/arch/parisc/kernel/syscalls/syscall.tbl +++ b/arch/parisc/kernel/syscalls/syscall.tbl @@ -454,3 +454,4 @@ 454 common futex_wake sys_futex_wake 455 common futex_wait sys_futex_wait 456 common futex_requeue sys_futex_requeue +457 common mseal sys_mseal diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl index 26fc41904266..d0aa97a669bc 100644 --- a/arch/powerpc/kernel/syscalls/syscall.tbl +++ b/arch/powerpc/kernel/syscalls/syscall.tbl @@ -542,3 +542,4 @@ 454 common futex_wake sys_futex_wake 455 common futex_wait sys_futex_wait 456 common futex_requeue sys_futex_requeue +457 common mseal sys_mseal diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl index 31be90b241f7..228f100f8565 100644 --- a/arch/s390/kernel/syscalls/syscall.tbl +++ b/arch/s390/kernel/syscalls/syscall.tbl @@ -458,3 +458,4 @@ 454 common futex_wake sys_futex_wake sys_futex_wake 455 common futex_wait sys_futex_wait sys_futex_wait 456 common futex_requeue sys_futex_requeue sys_futex_requeue +457 common mseal sys_mseal sys_mseal diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl index 4bc5d488ab17..cf08ea4a7539 100644 --- a/arch/sh/kernel/syscalls/syscall.tbl +++ b/arch/sh/kernel/syscalls/syscall.tbl @@ -458,3 +458,4 @@ 454 common futex_wake sys_futex_wake 455 common futex_wait sys_futex_wait 456 common futex_requeue sys_futex_requeue +457 common mseal sys_mseal diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl index 8404c8e50394..30796f78bdc2 100644 --- a/arch/sparc/kernel/syscalls/syscall.tbl +++ b/arch/sparc/kernel/syscalls/syscall.tbl @@ -501,3 +501,4 @@ 454 common futex_wake sys_futex_wake 455 common futex_wait sys_futex_wait 456 common futex_requeue sys_futex_requeue +457 common mseal sys_mseal diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl index 31c48bc2c3d8..c4163b904714 100644 --- a/arch/x86/entry/syscalls/syscall_32.tbl +++ b/arch/x86/entry/syscalls/syscall_32.tbl @@ -460,3 +460,4 @@ 454 i386 futex_wake sys_futex_wake 455 i386 futex_wait sys_futex_wait 456 i386 futex_requeue sys_futex_requeue +457 i386 mseal sys_mseal diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index a577bb27c16d..47fbc6ac0267 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -378,6 +378,7 @@ 454 common futex_wake sys_futex_wake 455 common futex_wait sys_futex_wait 456 common futex_requeue sys_futex_requeue +457 common mseal sys_mseal
# # Due to a historical design error, certain syscalls are numbered differently diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl index dd71ecce8b86..fe5f562f6493 100644 --- a/arch/xtensa/kernel/syscalls/syscall.tbl +++ b/arch/xtensa/kernel/syscalls/syscall.tbl @@ -426,3 +426,4 @@ 454 common futex_wake sys_futex_wake 455 common futex_wait sys_futex_wait 456 common futex_requeue sys_futex_requeue +457 common mseal sys_mseal diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h index d9e9cd13e577..1678245d8a2b 100644 --- a/include/uapi/asm-generic/unistd.h +++ b/include/uapi/asm-generic/unistd.h @@ -829,8 +829,11 @@ __SYSCALL(__NR_futex_wait, sys_futex_wait) #define __NR_futex_requeue 456 __SYSCALL(__NR_futex_requeue, sys_futex_requeue)
+#define __NR_mseal 457 +__SYSCALL(__NR_mseal, sys_mseal) + #undef __NR_syscalls -#define __NR_syscalls 457 +#define __NR_syscalls 458
/* * 32 bit systems traditionally used different
From: Jeff Xu jeffxu@chromium.org
Two utilities to be used later.
can_modify_mm: checks sealing flags for given memory range.
can_modify_vma: checks sealing flags for given vma.
Signed-off-by: Jeff Xu jeffxu@chromium.org --- include/linux/mm.h | 18 ++++++++++++++++++ mm/mseal.c | 38 ++++++++++++++++++++++++++++++++++++++ 2 files changed, 56 insertions(+)
diff --git a/include/linux/mm.h b/include/linux/mm.h index 3d1120570de5..2435acc1f44f 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3339,6 +3339,12 @@ static inline unsigned long vma_seals(struct vm_area_struct *vma) return (vma->vm_seals & MM_SEAL_ALL); }
+extern bool can_modify_mm(struct mm_struct *mm, unsigned long start, + unsigned long end, unsigned long checkSeals); + +extern bool can_modify_vma(struct vm_area_struct *vma, + unsigned long checkSeals); + #else static inline bool check_vma_seals_mergeable(unsigned long vm_seals1) { @@ -3349,6 +3355,18 @@ static inline unsigned long vma_seals(struct vm_area_struct *vma) { return 0; } + +static inline bool can_modify_mm(struct mm_struct *mm, unsigned long start, + unsigned long end, unsigned long checkSeals) +{ + return true; +} + +static inline bool can_modify_vma(struct vm_area_struct *vma, + unsigned long checkSeals) +{ + return true; +} #endif
/* These take the mm semaphore themselves */ diff --git a/mm/mseal.c b/mm/mseal.c index 13bbe9ef5883..d12aa628ebdc 100644 --- a/mm/mseal.c +++ b/mm/mseal.c @@ -26,6 +26,44 @@ static bool can_do_mseal(unsigned long types, unsigned long flags) return true; }
+/* + * check if a vma is sealed for modification. + * return true, if modification is allowed. + */ +bool can_modify_vma(struct vm_area_struct *vma, + unsigned long checkSeals) +{ + if (checkSeals & vma_seals(vma)) + return false; + + return true; +} + +/* + * Check if the vmas of a memory range are allowed to be modified. + * the memory ranger can have a gap (unallocated memory). + * return true, if it is allowed. + */ +bool can_modify_mm(struct mm_struct *mm, unsigned long start, unsigned long end, + unsigned long checkSeals) +{ + struct vm_area_struct *vma; + + VMA_ITERATOR(vmi, mm, start); + + if (!checkSeals) + return true; + + /* going through each vma to check. */ + for_each_vma_range(vmi, vma, end) { + if (!can_modify_vma(vma, checkSeals)) + return false; + } + + /* Allow by default. */ + return true; +} + /* * Check if a seal type can be added to VMA. */
From: Jeff Xu jeffxu@chromium.org
The base package includes the features common to all VMA sealing types. It prevents sealed VMAs from: 1> Unmapping, moving to another location, and shrinking the size, via munmap() and mremap(), can leave an empty space, therefore can be replaced with a VMA with a new set of attributes. 2> Move or expand a different vma into the current location, via mremap(). 3> Modifying sealed VMA via mmap(MAP_FIXED). 4> Size expansion, via mremap(), does not appear to pose any specific risks to sealed VMAs. It is included anyway because the use case is unclear. In any case, users can rely on merging to expand a sealed VMA.
We consider the MM_SEAL_BASE feature, on which other sealing features will depend. For instance, it probably does not make sense to seal PROT_PKEY without sealing the BASE, and the kernel will implicitly add SEAL_BASE for SEAL_PROT_PKEY. (If the application wants to relax this in future, we could use the flags field in mseal() to overwrite this the behavior of implicitly adding SEAL_BASE.)
Signed-off-by: Jeff Xu jeffxu@chromium.org --- mm/mmap.c | 23 +++++++++++++++++++++++ mm/mremap.c | 42 ++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 65 insertions(+)
diff --git a/mm/mmap.c b/mm/mmap.c index 42462c2a0c35..dbc557bd460c 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1259,6 +1259,13 @@ unsigned long do_mmap(struct file *file, unsigned long addr, return -EEXIST; }
+ /* + * Check if the address range is sealed for do_mmap(). + * can_modify_mm assumes we have acquired the lock on MM. + */ + if (!can_modify_mm(mm, addr, addr + len, MM_SEAL_BASE)) + return -EACCES; + if (prot == PROT_EXEC) { pkey = execute_only_pkey(mm); if (pkey < 0) @@ -2632,6 +2639,14 @@ int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm, if (end == start) return -EINVAL;
+ /* + * Check if memory is sealed before arch_unmap. + * Prevent unmapping a sealed VMA. + * can_modify_mm assumes we have acquired the lock on MM. + */ + if (!can_modify_mm(mm, start, end, MM_SEAL_BASE)) + return -EACCES; + /* arch_unmap() might do unmaps itself. */ arch_unmap(mm, start, end);
@@ -3053,6 +3068,14 @@ int do_vma_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma, { struct mm_struct *mm = vma->vm_mm;
+ /* + * Check if memory is sealed before arch_unmap. + * Prevent unmapping a sealed VMA. + * can_modify_mm assumes we have acquired the lock on MM. + */ + if (!can_modify_mm(mm, start, end, MM_SEAL_BASE)) + return -EACCES; + arch_unmap(mm, start, end); return do_vmi_align_munmap(vmi, vma, mm, start, end, uf, unlock); } diff --git a/mm/mremap.c b/mm/mremap.c index 382e81c33fc4..ff7429bfbbe1 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -835,7 +835,35 @@ static unsigned long mremap_to(unsigned long addr, unsigned long old_len, if ((mm->map_count + 2) >= sysctl_max_map_count - 3) return -ENOMEM;
+ /* + * In mremap_to() which moves a VMA to another address. + * Check if src address is sealed, if so, reject. + * In other words, prevent a sealed VMA being moved to + * another address. + * + * Place can_modify_mm here because mremap_to() + * does its own checking for address range, and we only + * check the sealing after passing those checks. + * can_modify_mm assumes we have acquired the lock on MM. + */ + if (!can_modify_mm(mm, addr, addr + old_len, MM_SEAL_BASE)) + return -EACCES; + if (flags & MREMAP_FIXED) { + /* + * In mremap_to() which moves a VMA to another address. + * Check if dst address is sealed, if so, reject. + * In other words, prevent moving a vma to a sealed VMA. + * + * Place can_modify_mm here because mremap_to() does its + * own checking for address, and we only check the sealing + * after passing those checks. + * can_modify_mm assumes we have acquired the lock on MM. + */ + if (!can_modify_mm(mm, new_addr, new_addr + new_len, + MM_SEAL_BASE)) + return -EACCES; + ret = do_munmap(mm, new_addr, new_len, uf_unmap_early); if (ret) goto out; @@ -994,6 +1022,20 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len, goto out; }
+ /* + * This is shrink/expand case (not mremap_to()) + * Check if src address is sealed, if so, reject. + * In other words, prevent shrinking or expanding a sealed VMA. + * + * Place can_modify_mm here so we can keep the logic related to + * shrink/expand together. Perhaps we can extract below to be its + * own function in future. + */ + if (!can_modify_mm(mm, addr, addr + old_len, MM_SEAL_BASE)) { + ret = -EACCES; + goto out; + } + /* * Always allow a shrinking remap: that just unmaps * the unnecessary pages..
From: Jeff Xu jeffxu@chromium.org
Seal PROT and PKEY of the address range, in other words, mprotect() and pkey_mprotect() will be denied if the memory is sealed with MM_SEAL_PROT_PKEY.
Signed-off-by: Jeff Xu jeffxu@chromium.org --- mm/mprotect.c | 10 ++++++++++ 1 file changed, 10 insertions(+)
diff --git a/mm/mprotect.c b/mm/mprotect.c index b94fbb45d5c7..1527188b1e92 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -32,6 +32,7 @@ #include <linux/sched/sysctl.h> #include <linux/userfaultfd_k.h> #include <linux/memory-tiers.h> +#include <uapi/linux/mman.h> #include <asm/cacheflush.h> #include <asm/mmu_context.h> #include <asm/tlbflush.h> @@ -753,6 +754,15 @@ static int do_mprotect_pkey(unsigned long start, size_t len, } }
+ /* + * checking if PROT and PKEY is sealed. + * can_modify_mm assumes we have acquired the lock on MM. + */ + if (!can_modify_mm(current->mm, start, end, MM_SEAL_PROT_PKEY)) { + error = -EACCES; + goto out; + } + prev = vma_prev(&vmi); if (start > vma->vm_start) prev = vma;
From: Jeff Xu jeffxu@chromium.org
Allow mmap() to set the sealing type when creating a mapping. This is useful for optimization because it avoids having to make two system calls: one for mmap() and one for mseal(). With this change, mmap() can take an input that specifies the sealing type, so only one system call is needed.
This patch uses the "prot" field of mmap() to set the sealing. Three sealing types are added to match with MM_SEAL_xyz in mseal(). PROT_SEAL_SEAL PROT_SEAL_BASE PROT_SEAL_PROT_PKEY
We also thought about using MAP_SEAL_xyz, which is a field in the mmap() function called "flags". However, this field is more about the type of mapping, such as MAP_FIXED_NOREPLACE. The "prot" field seems more appropriate for our case.
It's worth noting that even though the sealing type is set via the "prot" field in mmap(), we don't require it to be set in the "prot" field in later mprotect() call. This is unlike the PROT_READ, PROT_WRITE, PROT_EXEC bits, e.g. if PROT_WRITE is not set in mprotect(), it means that the region is not writable. In other words, if you set PROT_SEAL_PROT_PKEY in mmap(), you don't need to set it in mprotect(). In fact, with the current approach, mseal() is used to set sealing on existing VMA.
Signed-off-by: Jeff Xu jeffxu@chromium.org Suggested-by: Pedro Falcato pedro.falcato@gmail.com --- arch/mips/kernel/vdso.c | 10 +++- include/linux/mm.h | 63 +++++++++++++++++++++++++- include/uapi/asm-generic/mman-common.h | 13 ++++++ mm/mmap.c | 25 ++++++++-- 4 files changed, 105 insertions(+), 6 deletions(-)
diff --git a/arch/mips/kernel/vdso.c b/arch/mips/kernel/vdso.c index f6d40e43f108..6d1103d36af1 100644 --- a/arch/mips/kernel/vdso.c +++ b/arch/mips/kernel/vdso.c @@ -98,11 +98,17 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp) return -EINTR;
if (IS_ENABLED(CONFIG_MIPS_FP_SUPPORT)) { - /* Map delay slot emulation page */ + /* + * Map delay slot emulation page. + * + * Note: passing vm_seals = 0 + * Don't support sealing for vdso for now. + * This might change when we add sealing support for vdso. + */ base = mmap_region(NULL, STACK_TOP, PAGE_SIZE, VM_READ | VM_EXEC | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC, - 0, NULL); + 0, NULL, 0); if (IS_ERR_VALUE(base)) { ret = base; goto out; diff --git a/include/linux/mm.h b/include/linux/mm.h index 2435acc1f44f..5d3ee79f1438 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -266,6 +266,15 @@ extern unsigned int kobjsize(const void *objp); MM_SEAL_BASE | \ MM_SEAL_PROT_PKEY)
+/* + * PROT_SEAL_ALL is all supported flags in mmap(). + * See include/uapi/asm-generic/mman-common.h. + */ +#define PROT_SEAL_ALL ( \ + PROT_SEAL_SEAL | \ + PROT_SEAL_BASE | \ + PROT_SEAL_PROT_PKEY) + /* * vm_flags in vm_area_struct, see mm_types.h. * When changing, update also include/trace/events/mmflags.h. @@ -3290,7 +3299,7 @@ extern unsigned long get_unmapped_area(struct file *, unsigned long, unsigned lo
extern unsigned long mmap_region(struct file *file, unsigned long addr, unsigned long len, vm_flags_t vm_flags, unsigned long pgoff, - struct list_head *uf); + struct list_head *uf, unsigned long vm_seals); extern unsigned long do_mmap(struct file *file, unsigned long addr, unsigned long len, unsigned long prot, unsigned long flags, vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate, @@ -3339,12 +3348,47 @@ static inline unsigned long vma_seals(struct vm_area_struct *vma) return (vma->vm_seals & MM_SEAL_ALL); }
+static inline void update_vma_seals(struct vm_area_struct *vma, unsigned long vm_seals) +{ + vma->vm_seals |= vm_seals; +} + extern bool can_modify_mm(struct mm_struct *mm, unsigned long start, unsigned long end, unsigned long checkSeals);
extern bool can_modify_vma(struct vm_area_struct *vma, unsigned long checkSeals);
+/* + * Convert prot field of mmap to vm_seals type. + */ +static inline unsigned long convert_mmap_seals(unsigned long prot) +{ + unsigned long seals = 0; + + /* + * set SEAL_PROT_PKEY implies SEAL_BASE. + */ + if (prot & PROT_SEAL_PROT_PKEY) + prot |= PROT_SEAL_BASE; + + /* + * The seal bits start from bit 26 of the "prot" field of mmap. + * see comments in include/uapi/asm-generic/mman-common.h. + */ + seals = (prot & PROT_SEAL_ALL) >> PROT_SEAL_BIT_BEGIN; + return seals; +} + +/* + * check input sealing type from the "prot" field of mmap(). + * for CONFIG_MSEAL case, this always return 0 (successful). + */ +static inline int check_mmap_seals(unsigned long prot, unsigned long *vm_seals) +{ + *vm_seals = convert_mmap_seals(prot); + return 0; +} #else static inline bool check_vma_seals_mergeable(unsigned long vm_seals1) { @@ -3367,6 +3411,23 @@ static inline bool can_modify_vma(struct vm_area_struct *vma, { return true; } + +static inline void update_vma_seals(struct vm_area_struct *vma, unsigned long vm_seals) +{ +} + +/* + * check input sealing type from the "prot" field of mmap(). + * For not CONFIG_MSEAL, if SEAL flag is set, it will return failure. + */ +static inline int check_mmap_seals(unsigned long prot, unsigned long *vm_seals) +{ + if (prot & PROT_SEAL_ALL) + return -EINVAL; + + return 0; +} + #endif
/* These take the mm semaphore themselves */ diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h index 6ce1f1ceb432..f07ad9e70b3a 100644 --- a/include/uapi/asm-generic/mman-common.h +++ b/include/uapi/asm-generic/mman-common.h @@ -17,6 +17,19 @@ #define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */ #define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */
+/* + * The PROT_SEAL_XX defines memory sealings flags in the prot argument + * of mmap(). The bits currently take consecutive bits and match + * the same sequence as MM_SEAL_XX type, this allows convert_mmap_seals() + * to convert prot to MM_SEAL_XX type using bit operations. + * The include/uapi/linux/mman.h header file defines the MM_SEAL_XX type, + * which is used by the mseal() system call. + */ +#define PROT_SEAL_BIT_BEGIN 26 +#define PROT_SEAL_SEAL _BITUL(PROT_SEAL_BIT_BEGIN) /* 0x04000000 seal seal */ +#define PROT_SEAL_BASE _BITUL(PROT_SEAL_BIT_BEGIN + 1) /* 0x08000000 base for all sealing types */ +#define PROT_SEAL_PROT_PKEY _BITUL(PROT_SEAL_BIT_BEGIN + 2) /* 0x10000000 seal prot and pkey */ + /* 0x01 - 0x03 are defined in linux/mman.h */ #define MAP_TYPE 0x0f /* Mask for type of mapping */ #define MAP_FIXED 0x10 /* Interpret addr exactly */ diff --git a/mm/mmap.c b/mm/mmap.c index dbc557bd460c..3e1bf5a131b0 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1211,6 +1211,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr, { struct mm_struct *mm = current->mm; int pkey = 0; + unsigned long vm_seals = 0;
*populate = 0;
@@ -1231,6 +1232,9 @@ unsigned long do_mmap(struct file *file, unsigned long addr, if (flags & MAP_FIXED_NOREPLACE) flags |= MAP_FIXED;
+ if (check_mmap_seals(prot, &vm_seals) < 0) + return -EINVAL; + if (!(flags & MAP_FIXED)) addr = round_hint_to_min(addr);
@@ -1381,7 +1385,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr, vm_flags |= VM_NORESERVE; }
- addr = mmap_region(file, addr, len, vm_flags, pgoff, uf); + addr = mmap_region(file, addr, len, vm_flags, pgoff, uf, vm_seals); if (!IS_ERR_VALUE(addr) && ((vm_flags & VM_LOCKED) || (flags & (MAP_POPULATE | MAP_NONBLOCK)) == MAP_POPULATE)) @@ -2679,7 +2683,7 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
unsigned long mmap_region(struct file *file, unsigned long addr, unsigned long len, vm_flags_t vm_flags, unsigned long pgoff, - struct list_head *uf) + struct list_head *uf, unsigned long vm_seals) { struct mm_struct *mm = current->mm; struct vm_area_struct *vma = NULL; @@ -2723,7 +2727,13 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
next = vma_next(&vmi); prev = vma_prev(&vmi); - if (vm_flags & VM_SPECIAL) { + /* + * For now, sealed VMA doesn't merge with other VMA, + * Will change this in later commit when we make sealed VMA + * also mergeable. + */ + if ((vm_flags & VM_SPECIAL) || + (vm_seals & MM_SEAL_ALL)) { if (prev) vma_iter_next_range(&vmi); goto cannot_expand; @@ -2781,6 +2791,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr, vma->vm_page_prot = vm_get_page_prot(vm_flags); vma->vm_pgoff = pgoff;
+ update_vma_seals(vma, vm_seals); + if (file) { if (vm_flags & VM_SHARED) { error = mapping_map_writable(file->f_mapping); @@ -2992,6 +3004,13 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size, if (pgoff + (size >> PAGE_SHIFT) < pgoff) return ret;
+ /* + * Do not support sealing in remap_file_page. + * sealing is set via mmap() and mseal(). + */ + if (prot & PROT_SEAL_ALL) + return ret; + if (mmap_write_lock_killable(mm)) return -EINTR;
From: Jeff Xu jeffxu@chromium.org
Add merge/split handling for mlock/madvice/mprotect/mmap case. Make sealed VMA mergeable with adjacent VMAs.
This is so that we don't run out of VMAs, i.e. there is a max number of VMA per process.
Signed-off-by: Jeff Xu jeffxu@chromium.org Suggested-by: Jann Horn jannh@google.com --- fs/userfaultfd.c | 8 +++++--- include/linux/mm.h | 31 +++++++++++++------------------ mm/madvise.c | 2 +- mm/mempolicy.c | 2 +- mm/mlock.c | 2 +- mm/mmap.c | 44 +++++++++++++++++++++----------------------- mm/mprotect.c | 2 +- mm/mremap.c | 2 +- mm/mseal.c | 23 ++++++++++++++++++----- 9 files changed, 62 insertions(+), 54 deletions(-)
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 56eaae9dac1a..8ebee7c1c6cf 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -926,7 +926,8 @@ static int userfaultfd_release(struct inode *inode, struct file *file) new_flags, vma->anon_vma, vma->vm_file, vma->vm_pgoff, vma_policy(vma), - NULL_VM_UFFD_CTX, anon_vma_name(vma)); + NULL_VM_UFFD_CTX, anon_vma_name(vma), + vma_seals(vma)); if (prev) { vma = prev; } else { @@ -1483,7 +1484,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx, vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma), ((struct vm_userfaultfd_ctx){ ctx }), - anon_vma_name(vma)); + anon_vma_name(vma), vma_seals(vma)); if (prev) { /* vma_merge() invalidated the mas */ vma = prev; @@ -1668,7 +1669,8 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx, prev = vma_merge(&vmi, mm, prev, start, vma_end, new_flags, vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma), - NULL_VM_UFFD_CTX, anon_vma_name(vma)); + NULL_VM_UFFD_CTX, anon_vma_name(vma), + vma_seals(vma)); if (prev) { vma = prev; goto next; diff --git a/include/linux/mm.h b/include/linux/mm.h index 5d3ee79f1438..1f162bb5b38d 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3243,7 +3243,7 @@ extern struct vm_area_struct *vma_merge(struct vma_iterator *vmi, struct mm_struct *, struct vm_area_struct *prev, unsigned long addr, unsigned long end, unsigned long vm_flags, struct anon_vma *, struct file *, pgoff_t, struct mempolicy *, struct vm_userfaultfd_ctx, - struct anon_vma_name *); + struct anon_vma_name *, unsigned long vm_seals); extern struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *); extern int __split_vma(struct vma_iterator *vmi, struct vm_area_struct *, unsigned long addr, int new_below); @@ -3327,19 +3327,6 @@ static inline void mm_populate(unsigned long addr, unsigned long len) {} #endif
#ifdef CONFIG_MSEAL -static inline bool check_vma_seals_mergeable(unsigned long vm_seals) -{ - /* - * Set sealed VMA not mergeable with another VMA for now. - * This will be changed in later commit to make sealed - * VMA also mergeable. - */ - if (vm_seals & MM_SEAL_ALL) - return false; - - return true; -} - /* * return the valid sealing (after mask). */ @@ -3353,6 +3340,14 @@ static inline void update_vma_seals(struct vm_area_struct *vma, unsigned long vm vma->vm_seals |= vm_seals; }
+static inline bool check_vma_seals_mergeable(unsigned long vm_seals1, unsigned long vm_seals2) +{ + if ((vm_seals1 & MM_SEAL_ALL) != (vm_seals2 & MM_SEAL_ALL)) + return false; + + return true; +} + extern bool can_modify_mm(struct mm_struct *mm, unsigned long start, unsigned long end, unsigned long checkSeals);
@@ -3390,14 +3385,14 @@ static inline int check_mmap_seals(unsigned long prot, unsigned long *vm_seals) return 0; } #else -static inline bool check_vma_seals_mergeable(unsigned long vm_seals1) +static inline unsigned long vma_seals(struct vm_area_struct *vma) { - return true; + return 0; }
-static inline unsigned long vma_seals(struct vm_area_struct *vma) +static inline bool check_vma_seals_mergeable(unsigned long vm_seals1, unsigned long vm_seals2) { - return 0; + return true; }
static inline bool can_modify_mm(struct mm_struct *mm, unsigned long start, diff --git a/mm/madvise.c b/mm/madvise.c index 4dded5d27e7e..e2d219a4b6ef 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -152,7 +152,7 @@ static int madvise_update_vma(struct vm_area_struct *vma, pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT); *prev = vma_merge(&vmi, mm, *prev, start, end, new_flags, vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma), - vma->vm_userfaultfd_ctx, anon_name); + vma->vm_userfaultfd_ctx, anon_name, vma_seals(vma)); if (*prev) { vma = *prev; goto success; diff --git a/mm/mempolicy.c b/mm/mempolicy.c index e52e3a0b8f2e..e70b69c64564 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -836,7 +836,7 @@ static int mbind_range(struct vma_iterator *vmi, struct vm_area_struct *vma, pgoff = vma->vm_pgoff + ((vmstart - vma->vm_start) >> PAGE_SHIFT); merged = vma_merge(vmi, vma->vm_mm, *prev, vmstart, vmend, vma->vm_flags, vma->anon_vma, vma->vm_file, pgoff, new_pol, - vma->vm_userfaultfd_ctx, anon_vma_name(vma)); + vma->vm_userfaultfd_ctx, anon_vma_name(vma), vma_seals(vma)); if (merged) { *prev = merged; return vma_replace_policy(merged, new_pol); diff --git a/mm/mlock.c b/mm/mlock.c index 06bdfab83b58..b537a2cbd337 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -428,7 +428,7 @@ static int mlock_fixup(struct vma_iterator *vmi, struct vm_area_struct *vma, pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT); *prev = vma_merge(vmi, mm, *prev, start, end, newflags, vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma), - vma->vm_userfaultfd_ctx, anon_vma_name(vma)); + vma->vm_userfaultfd_ctx, anon_vma_name(vma), vma_seals(vma)); if (*prev) { vma = *prev; goto success; diff --git a/mm/mmap.c b/mm/mmap.c index 3e1bf5a131b0..6da8d83f2e66 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -720,7 +720,8 @@ int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma, static inline bool is_mergeable_vma(struct vm_area_struct *vma, struct file *file, unsigned long vm_flags, struct vm_userfaultfd_ctx vm_userfaultfd_ctx, - struct anon_vma_name *anon_name, bool may_remove_vma) + struct anon_vma_name *anon_name, bool may_remove_vma, + unsigned long vm_seals) { /* * VM_SOFTDIRTY should not prevent from VMA merging, if we @@ -740,7 +741,7 @@ static inline bool is_mergeable_vma(struct vm_area_struct *vma, return false; if (!anon_vma_name_eq(anon_vma_name(vma), anon_name)) return false; - if (!check_vma_seals_mergeable(vma_seals(vma))) + if (!check_vma_seals_mergeable(vma_seals(vma), vm_seals)) return false;
return true; @@ -776,9 +777,10 @@ static bool can_vma_merge_before(struct vm_area_struct *vma, unsigned long vm_flags, struct anon_vma *anon_vma, struct file *file, pgoff_t vm_pgoff, struct vm_userfaultfd_ctx vm_userfaultfd_ctx, - struct anon_vma_name *anon_name) + struct anon_vma_name *anon_name, unsigned long vm_seals) { - if (is_mergeable_vma(vma, file, vm_flags, vm_userfaultfd_ctx, anon_name, true) && + if (is_mergeable_vma(vma, file, vm_flags, vm_userfaultfd_ctx, + anon_name, true, vm_seals) && is_mergeable_anon_vma(anon_vma, vma->anon_vma, vma)) { if (vma->vm_pgoff == vm_pgoff) return true; @@ -799,9 +801,10 @@ static bool can_vma_merge_after(struct vm_area_struct *vma, unsigned long vm_flags, struct anon_vma *anon_vma, struct file *file, pgoff_t vm_pgoff, struct vm_userfaultfd_ctx vm_userfaultfd_ctx, - struct anon_vma_name *anon_name) + struct anon_vma_name *anon_name, unsigned long vm_seals) { - if (is_mergeable_vma(vma, file, vm_flags, vm_userfaultfd_ctx, anon_name, false) && + if (is_mergeable_vma(vma, file, vm_flags, vm_userfaultfd_ctx, + anon_name, false, vm_seals) && is_mergeable_anon_vma(anon_vma, vma->anon_vma, vma)) { pgoff_t vm_pglen; vm_pglen = vma_pages(vma); @@ -869,7 +872,7 @@ struct vm_area_struct *vma_merge(struct vma_iterator *vmi, struct mm_struct *mm, struct anon_vma *anon_vma, struct file *file, pgoff_t pgoff, struct mempolicy *policy, struct vm_userfaultfd_ctx vm_userfaultfd_ctx, - struct anon_vma_name *anon_name) + struct anon_vma_name *anon_name, unsigned long vm_seals) { struct vm_area_struct *curr, *next, *res; struct vm_area_struct *vma, *adjust, *remove, *remove2; @@ -908,7 +911,7 @@ struct vm_area_struct *vma_merge(struct vma_iterator *vmi, struct mm_struct *mm, /* Can we merge the predecessor? */ if (addr == prev->vm_end && mpol_equal(vma_policy(prev), policy) && can_vma_merge_after(prev, vm_flags, anon_vma, file, - pgoff, vm_userfaultfd_ctx, anon_name)) { + pgoff, vm_userfaultfd_ctx, anon_name, vm_seals)) { merge_prev = true; vma_prev(vmi); } @@ -917,7 +920,7 @@ struct vm_area_struct *vma_merge(struct vma_iterator *vmi, struct mm_struct *mm, /* Can we merge the successor? */ if (next && mpol_equal(policy, vma_policy(next)) && can_vma_merge_before(next, vm_flags, anon_vma, file, pgoff+pglen, - vm_userfaultfd_ctx, anon_name)) { + vm_userfaultfd_ctx, anon_name, vm_seals)) { merge_next = true; }
@@ -2727,13 +2730,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
next = vma_next(&vmi); prev = vma_prev(&vmi); - /* - * For now, sealed VMA doesn't merge with other VMA, - * Will change this in later commit when we make sealed VMA - * also mergeable. - */ - if ((vm_flags & VM_SPECIAL) || - (vm_seals & MM_SEAL_ALL)) { + + if (vm_flags & VM_SPECIAL) { if (prev) vma_iter_next_range(&vmi); goto cannot_expand; @@ -2743,7 +2741,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr, /* Check next */ if (next && next->vm_start == end && !vma_policy(next) && can_vma_merge_before(next, vm_flags, NULL, file, pgoff+pglen, - NULL_VM_UFFD_CTX, NULL)) { + NULL_VM_UFFD_CTX, NULL, vm_seals)) { merge_end = next->vm_end; vma = next; vm_pgoff = next->vm_pgoff - pglen; @@ -2752,9 +2750,9 @@ unsigned long mmap_region(struct file *file, unsigned long addr, /* Check prev */ if (prev && prev->vm_end == addr && !vma_policy(prev) && (vma ? can_vma_merge_after(prev, vm_flags, vma->anon_vma, file, - pgoff, vma->vm_userfaultfd_ctx, NULL) : + pgoff, vma->vm_userfaultfd_ctx, NULL, vm_seals) : can_vma_merge_after(prev, vm_flags, NULL, file, pgoff, - NULL_VM_UFFD_CTX, NULL))) { + NULL_VM_UFFD_CTX, NULL, vm_seals))) { merge_start = prev->vm_start; vma = prev; vm_pgoff = prev->vm_pgoff; @@ -2822,7 +2820,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr, merge = vma_merge(&vmi, mm, prev, vma->vm_start, vma->vm_end, vma->vm_flags, NULL, vma->vm_file, vma->vm_pgoff, NULL, - NULL_VM_UFFD_CTX, NULL); + NULL_VM_UFFD_CTX, NULL, vma_seals(vma)); if (merge) { /* * ->mmap() can change vma->vm_file and fput @@ -3130,14 +3128,14 @@ static int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
if (security_vm_enough_memory_mm(mm, len >> PAGE_SHIFT)) return -ENOMEM; - /* * Expand the existing vma if possible; Note that singular lists do not * occur after forking, so the expand will only happen on new VMAs. */ if (vma && vma->vm_end == addr && !vma_policy(vma) && can_vma_merge_after(vma, flags, NULL, NULL, - addr >> PAGE_SHIFT, NULL_VM_UFFD_CTX, NULL)) { + addr >> PAGE_SHIFT, NULL_VM_UFFD_CTX, NULL, + vma_seals(vma))) { vma_iter_config(vmi, vma->vm_start, addr + len); if (vma_iter_prealloc(vmi, vma)) goto unacct_fail; @@ -3380,7 +3378,7 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
new_vma = vma_merge(&vmi, mm, prev, addr, addr + len, vma->vm_flags, vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma), - vma->vm_userfaultfd_ctx, anon_vma_name(vma)); + vma->vm_userfaultfd_ctx, anon_vma_name(vma), vma_seals(vma)); if (new_vma) { /* * Source vma may have been merged into new_vma diff --git a/mm/mprotect.c b/mm/mprotect.c index 1527188b1e92..a4c90e71607b 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -632,7 +632,7 @@ mprotect_fixup(struct vma_iterator *vmi, struct mmu_gather *tlb, pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT); *pprev = vma_merge(vmi, mm, *pprev, start, end, newflags, vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma), - vma->vm_userfaultfd_ctx, anon_vma_name(vma)); + vma->vm_userfaultfd_ctx, anon_vma_name(vma), vma_seals(vma)); if (*pprev) { vma = *pprev; VM_WARN_ON((vma->vm_flags ^ newflags) & ~VM_SOFTDIRTY); diff --git a/mm/mremap.c b/mm/mremap.c index ff7429bfbbe1..357efd6b48b9 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -1098,7 +1098,7 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len, vma = vma_merge(&vmi, mm, vma, extension_start, extension_end, vma->vm_flags, vma->anon_vma, vma->vm_file, extension_pgoff, vma_policy(vma), - vma->vm_userfaultfd_ctx, anon_vma_name(vma)); + vma->vm_userfaultfd_ctx, anon_vma_name(vma), vma_seals(vma)); if (!vma) { vm_unacct_memory(pages); ret = -ENOMEM; diff --git a/mm/mseal.c b/mm/mseal.c index d12aa628ebdc..3b90dce7d20e 100644 --- a/mm/mseal.c +++ b/mm/mseal.c @@ -7,8 +7,10 @@ * Author: Jeff Xu jeffxu@chromium.org */
+#include <linux/mempolicy.h> #include <linux/mman.h> #include <linux/mm.h> +#include <linux/mm_inline.h> #include <linux/syscalls.h> #include <linux/sched.h> #include "internal.h" @@ -81,14 +83,25 @@ static int mseal_fixup(struct vma_iterator *vmi, struct vm_area_struct *vma, struct vm_area_struct **prev, unsigned long start, unsigned long end, unsigned long addtypes) { + pgoff_t pgoff; int ret = 0; + unsigned long newtypes = vma_seals(vma) | addtypes; + + if (newtypes != vma_seals(vma)) { + /* + * Attempt to merge with prev and next vma. + */ + pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT); + *prev = vma_merge(vmi, vma->vm_mm, *prev, start, end, vma->vm_flags, + vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma), + vma->vm_userfaultfd_ctx, anon_vma_name(vma), newtypes); + if (*prev) { + vma = *prev; + goto out; + }
- if (addtypes & ~(vma_seals(vma))) { /* * Handle split at start and end. - * For now sealed VMA doesn't merge with other VMAs. - * This will be updated in later commit to make - * sealed VMA also mergeable. */ if (start != vma->vm_start) { ret = split_vma(vmi, vma, start, 1); @@ -102,7 +115,7 @@ static int mseal_fixup(struct vma_iterator *vmi, struct vm_area_struct *vma, goto out; }
- vma->vm_seals |= addtypes; + vma->vm_seals = newtypes; }
out:
From: Jeff Xu jeffxu@chromium.org
Certain types of madvise() operations are destructive, such as MADV_DONTNEED, which can effectively alter region contents by discarding pages, especially when memory is anonymous. This blocks such operations for anonymous memory which is not writable to the user.
The MM_SEAL_DISCARD_RO_ANON blocks such operations if users don't have access to the memory, and the memory is anonymous memory.
We do not think such sealing is useful for file-backed mapping because it should repopulate the memory contents from the underlying mapped file. We also do not think it is useful if the user can write to the memory because then the attacker can also write.
Signed-off-by: Jeff Xu jeffxu@chromium.org Suggested-by: Jann Horn jannh@google.com Suggested-by: Stephen Röttger sroettger@google.com --- include/linux/mm.h | 19 +++++-- include/uapi/asm-generic/mman-common.h | 2 + include/uapi/linux/mman.h | 1 + mm/madvise.c | 12 +++++ mm/mseal.c | 73 ++++++++++++++++++++++++-- 5 files changed, 98 insertions(+), 9 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h index 1f162bb5b38d..50dda474acc2 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -264,7 +264,8 @@ extern unsigned int kobjsize(const void *objp); #define MM_SEAL_ALL ( \ MM_SEAL_SEAL | \ MM_SEAL_BASE | \ - MM_SEAL_PROT_PKEY) + MM_SEAL_PROT_PKEY | \ + MM_SEAL_DISCARD_RO_ANON)
/* * PROT_SEAL_ALL is all supported flags in mmap(). @@ -273,7 +274,8 @@ extern unsigned int kobjsize(const void *objp); #define PROT_SEAL_ALL ( \ PROT_SEAL_SEAL | \ PROT_SEAL_BASE | \ - PROT_SEAL_PROT_PKEY) + PROT_SEAL_PROT_PKEY | \ + PROT_SEAL_DISCARD_RO_ANON)
/* * vm_flags in vm_area_struct, see mm_types.h. @@ -3354,6 +3356,9 @@ extern bool can_modify_mm(struct mm_struct *mm, unsigned long start, extern bool can_modify_vma(struct vm_area_struct *vma, unsigned long checkSeals);
+extern bool can_modify_mm_madv(struct mm_struct *mm, unsigned long start, + unsigned long end, int behavior); + /* * Convert prot field of mmap to vm_seals type. */ @@ -3362,9 +3367,9 @@ static inline unsigned long convert_mmap_seals(unsigned long prot) unsigned long seals = 0;
/* - * set SEAL_PROT_PKEY implies SEAL_BASE. + * set SEAL_PROT_PKEY or SEAL_DISCARD_RO_ANON implies SEAL_BASE. */ - if (prot & PROT_SEAL_PROT_PKEY) + if (prot & (PROT_SEAL_PROT_PKEY | PROT_SEAL_DISCARD_RO_ANON)) prot |= PROT_SEAL_BASE;
/* @@ -3407,6 +3412,12 @@ static inline bool can_modify_vma(struct vm_area_struct *vma, return true; }
+static inline bool can_modify_mm_madv(struct mm_struct *mm, unsigned long start, + unsigned long end, int behavior) +{ + return true; +} + static inline void update_vma_seals(struct vm_area_struct *vma, unsigned long vm_seals) { } diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h index f07ad9e70b3a..bf503962409a 100644 --- a/include/uapi/asm-generic/mman-common.h +++ b/include/uapi/asm-generic/mman-common.h @@ -29,6 +29,8 @@ #define PROT_SEAL_SEAL _BITUL(PROT_SEAL_BIT_BEGIN) /* 0x04000000 seal seal */ #define PROT_SEAL_BASE _BITUL(PROT_SEAL_BIT_BEGIN + 1) /* 0x08000000 base for all sealing types */ #define PROT_SEAL_PROT_PKEY _BITUL(PROT_SEAL_BIT_BEGIN + 2) /* 0x10000000 seal prot and pkey */ +/* seal destructive madvise for non-writeable anonymous memory. */ +#define PROT_SEAL_DISCARD_RO_ANON _BITUL(PROT_SEAL_BIT_BEGIN + 3) /* 0x20000000 */
/* 0x01 - 0x03 are defined in linux/mman.h */ #define MAP_TYPE 0x0f /* Mask for type of mapping */ diff --git a/include/uapi/linux/mman.h b/include/uapi/linux/mman.h index f561652886c4..3872cc118c8a 100644 --- a/include/uapi/linux/mman.h +++ b/include/uapi/linux/mman.h @@ -58,5 +58,6 @@ struct cachestat { #define MM_SEAL_SEAL _BITUL(0) #define MM_SEAL_BASE _BITUL(1) #define MM_SEAL_PROT_PKEY _BITUL(2) +#define MM_SEAL_DISCARD_RO_ANON _BITUL(3)
#endif /* _UAPI_LINUX_MMAN_H */ diff --git a/mm/madvise.c b/mm/madvise.c index e2d219a4b6ef..ff038e323779 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -1403,6 +1403,7 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start, * -EIO - an I/O error occurred while paging in data. * -EBADF - map exists, but area maps something that isn't a file. * -EAGAIN - a kernel resource was temporarily unavailable. + * -EACCES - memory is sealed. */ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior) { @@ -1446,10 +1447,21 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh start = untagged_addr_remote(mm, start); end = start + len;
+ /* + * Check if the address range is sealed for do_madvise(). + * can_modify_mm_madv assumes we have acquired the lock on MM. + */ + if (!can_modify_mm_madv(mm, start, end, behavior)) { + error = -EACCES; + goto out; + } + blk_start_plug(&plug); error = madvise_walk_vmas(mm, start, end, behavior, madvise_vma_behavior); blk_finish_plug(&plug); + +out: if (write) mmap_write_unlock(mm); else diff --git a/mm/mseal.c b/mm/mseal.c index 3b90dce7d20e..294f48d33db6 100644 --- a/mm/mseal.c +++ b/mm/mseal.c @@ -11,6 +11,7 @@ #include <linux/mman.h> #include <linux/mm.h> #include <linux/mm_inline.h> +#include <linux/mmu_context.h> #include <linux/syscalls.h> #include <linux/sched.h> #include "internal.h" @@ -66,6 +67,55 @@ bool can_modify_mm(struct mm_struct *mm, unsigned long start, unsigned long end, return true; }
+static bool is_madv_discard(int behavior) +{ + return behavior & + (MADV_FREE | MADV_DONTNEED | MADV_DONTNEED_LOCKED | + MADV_REMOVE | MADV_DONTFORK | MADV_WIPEONFORK); +} + +static bool is_ro_anon(struct vm_area_struct *vma) +{ + /* check anonymous mapping. */ + if (vma->vm_file || vma->vm_flags & VM_SHARED) + return false; + + /* + * check for non-writable: + * PROT=RO or PKRU is not writeable. + */ + if (!(vma->vm_flags & VM_WRITE) || + !arch_vma_access_permitted(vma, true, false, false)) + return true; + + return false; +} + +/* + * Check if the vmas of a memory range are allowed to be modified by madvise. + * the memory ranger can have a gap (unallocated memory). + * return true, if it is allowed. + */ +bool can_modify_mm_madv(struct mm_struct *mm, unsigned long start, unsigned long end, + int behavior) +{ + struct vm_area_struct *vma; + + VMA_ITERATOR(vmi, mm, start); + + if (!is_madv_discard(behavior)) + return true; + + /* going through each vma to check. */ + for_each_vma_range(vmi, vma, end) + if (is_ro_anon(vma) && !can_modify_vma( + vma, MM_SEAL_DISCARD_RO_ANON)) + return false; + + /* Allow by default. */ + return true; +} + /* * Check if a seal type can be added to VMA. */ @@ -76,6 +126,12 @@ static bool can_add_vma_seals(struct vm_area_struct *vma, unsigned long newSeals (newSeals & ~(vma_seals(vma)))) return false;
+ /* + * For simplicity, we allow adding all sealing types during mmap or mseal. + * The actual sealing check will happen later during particular action. + * E.g. For MM_SEAL_DISCARD_RO_ANON, we always allow adding it, at the + * time madvice() call, we will check if the sealing condition isn't met. + */ return true; }
@@ -225,15 +281,22 @@ static int apply_mm_seal(unsigned long start, unsigned long end, * mprotect() and pkey_mprotect() will be denied if the memory is * sealed with MM_SEAL_PROT_PKEY. * - * The MM_SEAL_SEAL - * MM_SEAL_SEAL denies adding a new seal for an VMA. - * * The kernel will remember which seal types are applied, and the * application doesn’t need to repeat all existing seal types in * the next mseal(). Once a seal type is applied, it can’t be * unsealed. Call mseal() on an existing seal type is a no-action, * not a failure. * + * MM_SEAL_DISCARD_RO_ANON: block some destructive madvice() + * behavior, such as MADV_DONTNEED, which can effectively + * alter gegion contents by discarding pages, block such + * operation if users don't have write access to the memory, and + * the memory is anonymous memory. + * Setting this implies MM_SEAL_BASE is also set. + * + * The MM_SEAL_SEAL + * MM_SEAL_SEAL denies adding a new seal for an VMA. + * * flags: reserved. * * return values: @@ -264,8 +327,8 @@ static int do_mseal(unsigned long start, size_t len_in, unsigned long types, struct mm_struct *mm = current->mm; size_t len;
- /* MM_SEAL_BASE is set when other seal types are set. */ - if (types & MM_SEAL_PROT_PKEY) + /* MM_SEAL_BASE is set when other seal types are set */ + if (types & (MM_SEAL_PROT_PKEY | MM_SEAL_DISCARD_RO_ANON)) types |= MM_SEAL_BASE;
if (!can_do_mseal(types, flags))
From: Jeff Xu jeffxu@chromium.org
The MAP_SEALABLE flag is added to the flags field of mmap(). When present, it marks the map as sealable. A map created without MAP_SEALABLE will not support sealing; In other words, mseal() will fail for such a map.
Applications that don't care about sealing will expect their behavior unchanged. For those that need sealing support, opt-in by adding MAP_SEALABLE when creating the map.
Signed-off-by: Jeff Xu jeffxu@chromium.org --- include/linux/mm.h | 52 ++++++++++++++++++++++++-- include/linux/mm_types.h | 1 + include/uapi/asm-generic/mman-common.h | 1 + mm/mmap.c | 2 +- mm/mseal.c | 7 +++- 5 files changed, 57 insertions(+), 6 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h index 50dda474acc2..6f5dba9fbe21 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -267,6 +267,17 @@ extern unsigned int kobjsize(const void *objp); MM_SEAL_PROT_PKEY | \ MM_SEAL_DISCARD_RO_ANON)
+/* define VM_SEALABLE in vm_seals of vm_area_struct. */ +#define VM_SEALABLE _BITUL(31) + +/* + * VM_SEALS_BITS_ALL marks the bits used for + * sealing in vm_seals of vm_area_structure. + */ +#define VM_SEALS_BITS_ALL ( \ + MM_SEAL_ALL | \ + VM_SEALABLE) + /* * PROT_SEAL_ALL is all supported flags in mmap(). * See include/uapi/asm-generic/mman-common.h. @@ -3330,9 +3341,17 @@ static inline void mm_populate(unsigned long addr, unsigned long len) {}
#ifdef CONFIG_MSEAL /* - * return the valid sealing (after mask). + * return the valid sealing (after mask), this includes sealable bit. */ static inline unsigned long vma_seals(struct vm_area_struct *vma) +{ + return (vma->vm_seals & VM_SEALS_BITS_ALL); +} + +/* + * return the enabled sealing type (after mask), without sealable bit. + */ +static inline unsigned long vma_enabled_seals(struct vm_area_struct *vma) { return (vma->vm_seals & MM_SEAL_ALL); } @@ -3342,9 +3361,14 @@ static inline void update_vma_seals(struct vm_area_struct *vma, unsigned long vm vma->vm_seals |= vm_seals; }
+static inline bool is_vma_sealable(struct vm_area_struct *vma) +{ + return vma->vm_seals & VM_SEALABLE; +} + static inline bool check_vma_seals_mergeable(unsigned long vm_seals1, unsigned long vm_seals2) { - if ((vm_seals1 & MM_SEAL_ALL) != (vm_seals2 & MM_SEAL_ALL)) + if ((vm_seals1 & VM_SEALS_BITS_ALL) != (vm_seals2 & VM_SEALS_BITS_ALL)) return false;
return true; @@ -3384,9 +3408,15 @@ static inline unsigned long convert_mmap_seals(unsigned long prot) * check input sealing type from the "prot" field of mmap(). * for CONFIG_MSEAL case, this always return 0 (successful). */ -static inline int check_mmap_seals(unsigned long prot, unsigned long *vm_seals) +static inline int check_mmap_seals(unsigned long prot, unsigned long *vm_seals, + unsigned long flags) { *vm_seals = convert_mmap_seals(prot); + if (*vm_seals) + /* setting one of MM_SEAL_XX means the map is sealable. */ + *vm_seals |= VM_SEALABLE; + else + *vm_seals |= (flags & MAP_SEALABLE) ? VM_SEALABLE:0; return 0; } #else @@ -3395,6 +3425,16 @@ static inline unsigned long vma_seals(struct vm_area_struct *vma) return 0; }
+static inline unsigned long vma_enabled_seals(struct vm_area_struct *vma) +{ + return 0; +} + +static inline bool is_vma_sealable(struct vm_area_struct *vma) +{ + return false; +} + static inline bool check_vma_seals_mergeable(unsigned long vm_seals1, unsigned long vm_seals2) { return true; @@ -3426,11 +3466,15 @@ static inline void update_vma_seals(struct vm_area_struct *vma, unsigned long vm * check input sealing type from the "prot" field of mmap(). * For not CONFIG_MSEAL, if SEAL flag is set, it will return failure. */ -static inline int check_mmap_seals(unsigned long prot, unsigned long *vm_seals) +static inline int check_mmap_seals(unsigned long prot, unsigned long *vm_seals, + unsigned long flags) { if (prot & PROT_SEAL_ALL) return -EINVAL;
+ if (flags & MAP_SEALABLE) + return -EINVAL; + return 0; }
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 052799173c86..c9b04c545f39 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -691,6 +691,7 @@ struct vm_area_struct { /* * bit masks for seal. * need this since vm_flags is full. + * We could merge this into vm_flags if vm_flags ever get expanded. */ unsigned long vm_seals; /* seal flags, see mm.h. */ #endif diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h index bf503962409a..57ef4507c00b 100644 --- a/include/uapi/asm-generic/mman-common.h +++ b/include/uapi/asm-generic/mman-common.h @@ -47,6 +47,7 @@
#define MAP_UNINITIALIZED 0x4000000 /* For anonymous mmap, memory could be * uninitialized */ +#define MAP_SEALABLE 0x8000000 /* map is sealable. */
/* * Flags for mlock diff --git a/mm/mmap.c b/mm/mmap.c index 6da8d83f2e66..6e35e2070060 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1235,7 +1235,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr, if (flags & MAP_FIXED_NOREPLACE) flags |= MAP_FIXED;
- if (check_mmap_seals(prot, &vm_seals) < 0) + if (check_mmap_seals(prot, &vm_seals, flags) < 0) return -EINVAL;
if (!(flags & MAP_FIXED)) diff --git a/mm/mseal.c b/mm/mseal.c index 294f48d33db6..5d4cf71b497e 100644 --- a/mm/mseal.c +++ b/mm/mseal.c @@ -121,9 +121,13 @@ bool can_modify_mm_madv(struct mm_struct *mm, unsigned long start, unsigned long */ static bool can_add_vma_seals(struct vm_area_struct *vma, unsigned long newSeals) { + /* if map is not sealable, reject. */ + if (!is_vma_sealable(vma)) + return false; + /* When SEAL_MSEAL is set, reject if a new type of seal is added. */ if ((vma->vm_seals & MM_SEAL_SEAL) && - (newSeals & ~(vma_seals(vma)))) + (newSeals & ~(vma_enabled_seals(vma)))) return false;
/* @@ -185,6 +189,7 @@ static int mseal_fixup(struct vma_iterator *vmi, struct vm_area_struct *vma, * 2> end is part of a valid vma. * 3> No gap (unallocated address) between start and end. * 4> requested seal type can be added in given address range. + * 5> map is sealable. */ static int check_mm_seal(unsigned long start, unsigned long end, unsigned long newtypes)
From: Jeff Xu jeffxu@chromium.org
selftest for memory sealing change in mmap() and mseal().
Signed-off-by: Jeff Xu jeffxu@chromium.org --- tools/testing/selftests/mm/.gitignore | 1 + tools/testing/selftests/mm/Makefile | 1 + tools/testing/selftests/mm/config | 1 + tools/testing/selftests/mm/mseal_test.c | 2141 +++++++++++++++++++++++ 4 files changed, 2144 insertions(+) create mode 100644 tools/testing/selftests/mm/mseal_test.c
diff --git a/tools/testing/selftests/mm/.gitignore b/tools/testing/selftests/mm/.gitignore index cdc9ce4426b9..f0f22a649985 100644 --- a/tools/testing/selftests/mm/.gitignore +++ b/tools/testing/selftests/mm/.gitignore @@ -43,3 +43,4 @@ mdwe_test gup_longterm mkdirty va_high_addr_switch +mseal_test diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile index 6a9fc5693145..0c086cecc093 100644 --- a/tools/testing/selftests/mm/Makefile +++ b/tools/testing/selftests/mm/Makefile @@ -59,6 +59,7 @@ TEST_GEN_FILES += mlock2-tests TEST_GEN_FILES += mrelease_test TEST_GEN_FILES += mremap_dontunmap TEST_GEN_FILES += mremap_test +TEST_GEN_FILES += mseal_test TEST_GEN_FILES += on-fault-limit TEST_GEN_FILES += thuge-gen TEST_GEN_FILES += transhuge-stress diff --git a/tools/testing/selftests/mm/config b/tools/testing/selftests/mm/config index be087c4bc396..cf2b8780e9b1 100644 --- a/tools/testing/selftests/mm/config +++ b/tools/testing/selftests/mm/config @@ -6,3 +6,4 @@ CONFIG_TEST_HMM=m CONFIG_GUP_TEST=y CONFIG_TRANSPARENT_HUGEPAGE=y CONFIG_MEM_SOFT_DIRTY=y +CONFIG_MSEAL=y diff --git a/tools/testing/selftests/mm/mseal_test.c b/tools/testing/selftests/mm/mseal_test.c new file mode 100644 index 000000000000..0692485d8b3c --- /dev/null +++ b/tools/testing/selftests/mm/mseal_test.c @@ -0,0 +1,2141 @@ +// SPDX-License-Identifier: GPL-2.0 +#define _GNU_SOURCE +#include <sys/mman.h> +#include <stdint.h> +#include <unistd.h> +#include <string.h> +#include <sys/time.h> +#include <sys/resource.h> +#include <stdbool.h> +#include "../kselftest.h" +#include <syscall.h> +#include <errno.h> +#include <stdio.h> +#include <stdlib.h> +#include <assert.h> +#include <fcntl.h> +#include <assert.h> +#include <sys/ioctl.h> +#include <sys/vfs.h> +#include <sys/stat.h> + +/* + * need those definition for manually build using gcc. + * gcc -I ../../../../usr/include -DDEBUG -O3 -DDEBUG -O3 mseal_test.c -o mseal_test + */ +#ifndef MM_SEAL_SEAL +#define MM_SEAL_SEAL 0x1 +#endif + +#ifndef MM_SEAL_BASE +#define MM_SEAL_BASE 0x2 +#endif + +#ifndef MM_SEAL_PROT_PKEY +#define MM_SEAL_PROT_PKEY 0x4 +#endif + +#ifndef MM_SEAL_DISCARD_RO_ANON +#define MM_SEAL_DISCARD_RO_ANON 0x8 +#endif + +#ifndef MAP_SEALABLE +#define MAP_SEALABLE 0x8000000 +#endif + +#ifndef PROT_SEAL_SEAL +#define PROT_SEAL_SEAL 0x04000000 +#endif + +#ifndef PROT_SEAL_BASE +#define PROT_SEAL_BASE 0x08000000 +#endif + +#ifndef PROT_SEAL_PROT_PKEY +#define PROT_SEAL_PROT_PKEY 0x10000000 +#endif + +#ifndef PROT_SEAL_DISCARD_RO_ANON +#define PROT_SEAL_DISCARD_RO_ANON 0x20000000 +#endif + +#ifndef PKEY_DISABLE_ACCESS +# define PKEY_DISABLE_ACCESS 0x1 +#endif + +#ifndef PKEY_DISABLE_WRITE +# define PKEY_DISABLE_WRITE 0x2 +#endif + +#ifndef PKEY_BITS_PER_KEY +#define PKEY_BITS_PER_PKEY 2 +#endif + +#ifndef PKEY_MASK +#define PKEY_MASK (PKEY_DISABLE_ACCESS | PKEY_DISABLE_WRITE) +#endif + +#ifndef DEBUG +#define LOG_TEST_ENTER() {} +#else +#define LOG_TEST_ENTER() {ksft_print_msg("%s\n", __func__); } +#endif + +#ifndef u64 +#define u64 unsigned long long +#endif + +/* + * define sys_xyx to call syscall directly. + */ +static int sys_mseal(void *start, size_t len, int types) +{ + int sret; + + errno = 0; + sret = syscall(__NR_mseal, start, len, types, 0); + return sret; +} + +int sys_mprotect(void *ptr, size_t size, unsigned long prot) +{ + int sret; + + errno = 0; + sret = syscall(SYS_mprotect, ptr, size, prot); + return sret; +} + +int sys_mprotect_pkey(void *ptr, size_t size, unsigned long orig_prot, + unsigned long pkey) +{ + int sret; + + errno = 0; + sret = syscall(__NR_pkey_mprotect, ptr, size, orig_prot, pkey); + return sret; +} + +int sys_munmap(void *ptr, size_t size) +{ + int sret; + + errno = 0; + sret = syscall(SYS_munmap, ptr, size); + return sret; +} + +static int sys_madvise(void *start, size_t len, int types) +{ + int sret; + + errno = 0; + sret = syscall(__NR_madvise, start, len, types); + return sret; +} + +int sys_pkey_alloc(unsigned long flags, unsigned long init_val) +{ + int ret = syscall(SYS_pkey_alloc, flags, init_val); + return ret; +} + +static inline unsigned int __read_pkey_reg(void) +{ + unsigned int eax, edx; + unsigned int ecx = 0; + unsigned int pkey_reg; + + asm volatile(".byte 0x0f,0x01,0xee\n\t" + : "=a" (eax), "=d" (edx) + : "c" (ecx)); + pkey_reg = eax; + return pkey_reg; +} + +static inline void __write_pkey_reg(u64 pkey_reg) +{ + unsigned int eax = pkey_reg; + unsigned int ecx = 0; + unsigned int edx = 0; + + asm volatile(".byte 0x0f,0x01,0xef\n\t" + : : "a" (eax), "c" (ecx), "d" (edx)); + assert(pkey_reg == __read_pkey_reg()); +} + +static inline unsigned long pkey_bit_position(int pkey) +{ + return pkey * PKEY_BITS_PER_PKEY; +} + +static inline u64 set_pkey_bits(u64 reg, int pkey, u64 flags) +{ + unsigned long shift = pkey_bit_position(pkey); + /* mask out bits from pkey in old value */ + reg &= ~((u64)PKEY_MASK << shift); + /* OR in new bits for pkey */ + reg |= (flags & PKEY_MASK) << shift; + return reg; +} + +static inline void set_pkey(int pkey, unsigned long pkey_value) +{ + unsigned long mask = (PKEY_DISABLE_ACCESS | PKEY_DISABLE_WRITE); + u64 new_pkey_reg; + + assert(!(pkey_value & ~mask)); + new_pkey_reg = set_pkey_bits(__read_pkey_reg(), pkey, pkey_value); + __write_pkey_reg(new_pkey_reg); +} + +void setup_single_address(int size, void **ptrOut) +{ + void *ptr; + + ptr = mmap(NULL, size, PROT_READ, MAP_ANONYMOUS | MAP_PRIVATE | MAP_SEALABLE, -1, 0); + assert(ptr != (void *)-1); + *ptrOut = ptr; +} + +void setup_single_address_sealable(int size, void **ptrOut, bool sealable) +{ + void *ptr; + unsigned long mapflags = MAP_ANONYMOUS | MAP_PRIVATE; + + if (sealable) + mapflags |= MAP_SEALABLE; + + ptr = mmap(NULL, size, PROT_READ, mapflags, -1, 0); + assert(ptr != (void *)-1); + *ptrOut = ptr; +} + +void setup_single_address_rw_sealable(int size, void **ptrOut, bool sealable) +{ + void *ptr; + unsigned long mapflags = MAP_ANONYMOUS | MAP_PRIVATE; + + if (sealable) + mapflags |= MAP_SEALABLE; + + ptr = mmap(NULL, size, PROT_READ | PROT_WRITE, mapflags, -1, 0); + assert(ptr != (void *)-1); + *ptrOut = ptr; +} + +void clean_single_address(void *ptr, int size) +{ + int ret; + + ret = munmap(ptr, size); + assert(!ret); +} + +void seal_mprotect_single_address(void *ptr, int size) +{ + int ret; + + ret = sys_mseal(ptr, size, MM_SEAL_PROT_PKEY); + assert(!ret); +} + +void seal_discard_ro_anon_single_address(void *ptr, int size) +{ + int ret; + + ret = sys_mseal(ptr, size, MM_SEAL_DISCARD_RO_ANON); + assert(!ret); +} + +static void test_seal_addseals(void) +{ + LOG_TEST_ENTER(); + int ret; + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + + setup_single_address(size, &ptr); + + /* adding seal one by one */ + + ret = sys_mseal(ptr, size, MM_SEAL_BASE); + assert(!ret); + ret = sys_mseal(ptr, size, MM_SEAL_PROT_PKEY); + assert(!ret); + ret = sys_mseal(ptr, size, MM_SEAL_SEAL); + assert(!ret); +} + +static void test_seal_addseals_combined(void) +{ + LOG_TEST_ENTER(); + int ret; + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + + setup_single_address(size, &ptr); + + ret = sys_mseal(ptr, size, MM_SEAL_PROT_PKEY); + assert(!ret); + + /* adding multiple seals */ + ret = sys_mseal(ptr, size, + MM_SEAL_PROT_PKEY | MM_SEAL_BASE| + MM_SEAL_SEAL); + assert(!ret); + + /* not adding more seal type, so ok. */ + ret = sys_mseal(ptr, size, MM_SEAL_BASE); + assert(!ret); + + /* not adding more seal type, so ok. */ + ret = sys_mseal(ptr, size, MM_SEAL_SEAL); + assert(!ret); +} + +static void test_seal_addseals_reject(void) +{ + LOG_TEST_ENTER(); + int ret; + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + + setup_single_address(size, &ptr); + + ret = sys_mseal(ptr, size, MM_SEAL_BASE | MM_SEAL_SEAL); + assert(!ret); + + /* MM_SEAL_SEAL is set, so not allow new seal type . */ + ret = sys_mseal(ptr, size, + MM_SEAL_PROT_PKEY | MM_SEAL_BASE | MM_SEAL_SEAL); + assert(ret < 0); +} + +static void test_seal_unmapped_start(void) +{ + LOG_TEST_ENTER(); + int ret; + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + + setup_single_address(size, &ptr); + + // munmap 2 pages from ptr. + ret = sys_munmap(ptr, 2 * page_size); + assert(!ret); + + // mprotect will fail because 2 pages from ptr are unmapped. + ret = sys_mprotect(ptr, size, PROT_READ | PROT_WRITE); + assert(ret < 0); + + // mseal will fail because 2 pages from ptr are unmapped. + ret = sys_mseal(ptr, size, MM_SEAL_SEAL); + assert(ret < 0); + + ret = sys_mseal(ptr + 2 * page_size, 2 * page_size, MM_SEAL_SEAL); + assert(!ret); +} + +static void test_seal_unmapped_middle(void) +{ + LOG_TEST_ENTER(); + int ret; + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + + setup_single_address(size, &ptr); + + // munmap 2 pages from ptr + page. + ret = sys_munmap(ptr + page_size, 2 * page_size); + assert(!ret); + + // mprotect will fail, since size is 4 pages. + ret = sys_mprotect(ptr, size, PROT_READ | PROT_WRITE); + assert(ret < 0); + + // mseal will fail as well. + ret = sys_mseal(ptr, size, MM_SEAL_SEAL); + assert(ret < 0); + + /* we still can add seal to the first page and last page*/ + ret = sys_mseal(ptr, page_size, MM_SEAL_SEAL | MM_SEAL_PROT_PKEY); + assert(!ret); + + ret = sys_mseal(ptr + 3 * page_size, page_size, + MM_SEAL_SEAL | MM_SEAL_PROT_PKEY); + assert(!ret); +} + +static void test_seal_unmapped_end(void) +{ + LOG_TEST_ENTER(); + int ret; + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + + setup_single_address(size, &ptr); + + // unmap last 2 pages. + ret = sys_munmap(ptr + 2 * page_size, 2 * page_size); + assert(!ret); + + //mprotect will fail since last 2 pages are unmapped. + ret = sys_mprotect(ptr, size, PROT_READ | PROT_WRITE); + assert(ret < 0); + + //mseal will fail as well. + ret = sys_mseal(ptr, size, MM_SEAL_SEAL); + assert(ret < 0); + + /* The first 2 pages is not sealed, and can add seals */ + ret = sys_mseal(ptr, 2 * page_size, MM_SEAL_SEAL | MM_SEAL_PROT_PKEY); + assert(!ret); +} + +static void test_seal_multiple_vmas(void) +{ + LOG_TEST_ENTER(); + int ret; + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + + setup_single_address(size, &ptr); + + // use mprotect to split the vma into 3. + ret = sys_mprotect(ptr + page_size, 2 * page_size, + PROT_READ | PROT_WRITE); + assert(!ret); + + // mprotect will get applied to all 4 pages - 3 VMAs. + ret = sys_mprotect(ptr, size, PROT_READ); + assert(!ret); + + // use mprotect to split the vma into 3. + ret = sys_mprotect(ptr + page_size, 2 * page_size, + PROT_READ | PROT_WRITE); + assert(!ret); + + // mseal get applied to all 4 pages - 3 VMAs. + ret = sys_mseal(ptr, size, MM_SEAL_SEAL); + assert(!ret); + + // verify additional seal type will fail after MM_SEAL_SEAL set. + ret = sys_mseal(ptr, page_size, MM_SEAL_SEAL | MM_SEAL_PROT_PKEY); + assert(ret < 0); + + ret = sys_mseal(ptr + page_size, 2 * page_size, + MM_SEAL_SEAL | MM_SEAL_PROT_PKEY); + assert(ret < 0); + + ret = sys_mseal(ptr + 3 * page_size, page_size, + MM_SEAL_SEAL | MM_SEAL_PROT_PKEY); + assert(ret < 0); +} + +static void test_seal_split_start(void) +{ + LOG_TEST_ENTER(); + int ret; + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + + setup_single_address(size, &ptr); + + /* use mprotect to split at middle */ + ret = sys_mprotect(ptr, 2 * page_size, PROT_READ | PROT_WRITE); + assert(!ret); + + /* seal the first page, this will split the VMA */ + ret = sys_mseal(ptr, page_size, MM_SEAL_SEAL); + assert(!ret); + + /* can't add seal to the first page */ + ret = sys_mseal(ptr, page_size, MM_SEAL_SEAL | MM_SEAL_PROT_PKEY); + assert(ret < 0); + + /* add seal to the remain 3 pages */ + ret = sys_mseal(ptr + page_size, 3 * page_size, + MM_SEAL_SEAL | MM_SEAL_PROT_PKEY); + assert(!ret); +} + +static void test_seal_split_end(void) +{ + LOG_TEST_ENTER(); + int ret; + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + + setup_single_address(size, &ptr); + + /* use mprotect to split at middle */ + ret = sys_mprotect(ptr, 2 * page_size, PROT_READ | PROT_WRITE); + assert(!ret); + + /* seal the last page */ + ret = sys_mseal(ptr + 3 * page_size, page_size, MM_SEAL_SEAL); + assert(!ret); + + /* adding seal to the last page is rejected. */ + ret = sys_mseal(ptr + 3 * page_size, page_size, + MM_SEAL_SEAL | MM_SEAL_PROT_PKEY); + assert(ret < 0); + + /* Adding seals to the first 3 pages */ + ret = sys_mseal(ptr, 3 * page_size, MM_SEAL_SEAL | MM_SEAL_PROT_PKEY); + assert(!ret); +} + +static void test_seal_invalid_input(void) +{ + LOG_TEST_ENTER(); + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + + setup_single_address(8 * page_size, &ptr); + clean_single_address(ptr + 4 * page_size, 4 * page_size); + + /* invalid flag */ + ret = sys_mseal(ptr, size, 0x20); + assert(ret < 0); + + ret = sys_mseal(ptr, size, 0x31); + assert(ret < 0); + + ret = sys_mseal(ptr, size, 0x3F); + assert(ret < 0); + + /* unaligned address */ + ret = sys_mseal(ptr + 1, 2 * page_size, MM_SEAL_SEAL); + assert(ret < 0); + + /* length too big */ + ret = sys_mseal(ptr, 5 * page_size, MM_SEAL_SEAL); + assert(ret < 0); + + /* start is not in a valid VMA */ + ret = sys_mseal(ptr - page_size, 5 * page_size, MM_SEAL_SEAL); + assert(ret < 0); +} + +static void test_seal_zero_length(void) +{ + LOG_TEST_ENTER(); + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + + setup_single_address(size, &ptr); + + ret = sys_mprotect(ptr, 0, PROT_READ | PROT_WRITE); + assert(!ret); + + /* seal 0 length will be OK, same as mprotect */ + ret = sys_mseal(ptr, 0, MM_SEAL_PROT_PKEY); + assert(!ret); + + // verify the 4 pages are not sealed by previous call. + ret = sys_mprotect(ptr, size, PROT_READ | PROT_WRITE); + assert(!ret); +} + +static void test_seal_twice(void) +{ + LOG_TEST_ENTER(); + int ret; + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + + setup_single_address(size, &ptr); + + ret = sys_mseal(ptr, size, MM_SEAL_PROT_PKEY); + assert(!ret); + + // apply the same seal will be OK. idempotent. + ret = sys_mseal(ptr, size, MM_SEAL_PROT_PKEY); + assert(!ret); + + ret = sys_mseal(ptr, size, + MM_SEAL_PROT_PKEY | MM_SEAL_BASE | + MM_SEAL_SEAL); + assert(!ret); + + ret = sys_mseal(ptr, size, + MM_SEAL_PROT_PKEY | MM_SEAL_BASE | + MM_SEAL_SEAL); + assert(!ret); + + ret = sys_mseal(ptr, size, MM_SEAL_SEAL); + assert(!ret); +} + +static void test_seal_mprotect(bool seal) +{ + LOG_TEST_ENTER(); + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + + setup_single_address(size, &ptr); + + if (seal) + seal_mprotect_single_address(ptr, size); + + ret = sys_mprotect(ptr, size, PROT_READ | PROT_WRITE); + if (seal) + assert(ret < 0); + else + assert(!ret); +} + +static void test_seal_start_mprotect(bool seal) +{ + LOG_TEST_ENTER(); + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + + setup_single_address(size, &ptr); + + if (seal) + seal_mprotect_single_address(ptr, page_size); + + // the first page is sealed. + ret = sys_mprotect(ptr, page_size, PROT_READ | PROT_WRITE); + if (seal) + assert(ret < 0); + else + assert(!ret); + + // pages after the first page is not sealed. + ret = sys_mprotect(ptr + page_size, page_size * 3, + PROT_READ | PROT_WRITE); + assert(!ret); +} + +static void test_seal_end_mprotect(bool seal) +{ + LOG_TEST_ENTER(); + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + + setup_single_address(size, &ptr); + + if (seal) + seal_mprotect_single_address(ptr + page_size, 3 * page_size); + + /* first page is not sealed */ + ret = sys_mprotect(ptr, page_size, PROT_READ | PROT_WRITE); + assert(!ret); + + /* last 3 page are sealed */ + ret = sys_mprotect(ptr + page_size, page_size * 3, + PROT_READ | PROT_WRITE); + if (seal) + assert(ret < 0); + else + assert(!ret); +} + +static void test_seal_mprotect_unalign_len(bool seal) +{ + LOG_TEST_ENTER(); + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + + setup_single_address(size, &ptr); + + if (seal) + seal_mprotect_single_address(ptr, page_size * 2 - 1); + + // 2 pages are sealed. + ret = sys_mprotect(ptr, page_size * 2, PROT_READ | PROT_WRITE); + if (seal) + assert(ret < 0); + else + assert(!ret); + + ret = sys_mprotect(ptr + page_size * 2, page_size, + PROT_READ | PROT_WRITE); + assert(!ret); +} + +static void test_seal_mprotect_unalign_len_variant_2(bool seal) +{ + LOG_TEST_ENTER(); + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + + setup_single_address(size, &ptr); + if (seal) + seal_mprotect_single_address(ptr, page_size * 2 + 1); + + // 3 pages are sealed. + ret = sys_mprotect(ptr, page_size * 3, PROT_READ | PROT_WRITE); + if (seal) + assert(ret < 0); + else + assert(!ret); + + ret = sys_mprotect(ptr + page_size * 3, page_size, + PROT_READ | PROT_WRITE); + assert(!ret); +} + +static void test_seal_mprotect_two_vma(bool seal) +{ + LOG_TEST_ENTER(); + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + + setup_single_address(size, &ptr); + + /* use mprotect to split */ + ret = sys_mprotect(ptr, page_size * 2, PROT_READ | PROT_WRITE); + assert(!ret); + + if (seal) + seal_mprotect_single_address(ptr, page_size * 4); + + ret = sys_mprotect(ptr, page_size * 2, PROT_READ | PROT_WRITE); + if (seal) + assert(ret < 0); + else + assert(!ret); + + ret = sys_mprotect(ptr + page_size * 2, page_size * 2, + PROT_READ | PROT_WRITE); + if (seal) + assert(ret < 0); + else + assert(!ret); +} + +static void test_seal_mprotect_two_vma_with_split(bool seal) +{ + LOG_TEST_ENTER(); + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + + setup_single_address(size, &ptr); + + // use mprotect to split as two vma. + ret = sys_mprotect(ptr, page_size * 2, PROT_READ | PROT_WRITE); + assert(!ret); + + // mseal can apply across 2 vma, also split them. + if (seal) + seal_mprotect_single_address(ptr + page_size, page_size * 2); + + // the first page is not sealed. + ret = sys_mprotect(ptr, page_size, PROT_READ | PROT_WRITE); + assert(!ret); + + // the second page is sealed. + ret = sys_mprotect(ptr + page_size, page_size, PROT_READ | PROT_WRITE); + if (seal) + assert(ret < 0); + else + assert(!ret); + + // the third page is sealed. + ret = sys_mprotect(ptr + 2 * page_size, page_size, + PROT_READ | PROT_WRITE); + if (seal) + assert(ret < 0); + else + assert(!ret); + + // the fouth page is not sealed. + ret = sys_mprotect(ptr + 3 * page_size, page_size, + PROT_READ | PROT_WRITE); + assert(!ret); +} + +static void test_seal_mprotect_partial_mprotect(bool seal) +{ + LOG_TEST_ENTER(); + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + + setup_single_address(size, &ptr); + + // seal one page. + if (seal) + seal_mprotect_single_address(ptr, page_size); + + // mprotect first 2 page will fail, since the first page are sealed. + ret = sys_mprotect(ptr, 2 * page_size, PROT_READ | PROT_WRITE); + if (seal) + assert(ret < 0); + else + assert(!ret); +} + +static void test_seal_mprotect_two_vma_with_gap(bool seal) +{ + LOG_TEST_ENTER(); + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + + setup_single_address(size, &ptr); + + // use mprotect to split. + ret = sys_mprotect(ptr, page_size, PROT_READ | PROT_WRITE); + assert(!ret); + + // use mprotect to split. + ret = sys_mprotect(ptr + 3 * page_size, page_size, + PROT_READ | PROT_WRITE); + assert(!ret); + + // use munmap to free two pages in the middle + ret = sys_munmap(ptr + page_size, 2 * page_size); + assert(!ret); + + // mprotect will fail, because there is a gap in the address. + // notes, internally mprotect still updated the first page. + ret = sys_mprotect(ptr, 4 * page_size, PROT_READ); + assert(ret < 0); + + // mseal will fail as well. + ret = sys_mseal(ptr, 4 * page_size, MM_SEAL_PROT_PKEY); + assert(ret < 0); + + // unlike mprotect, the first page is not sealed. + ret = sys_mprotect(ptr, page_size, PROT_READ); + assert(ret == 0); + + // the last page is not sealed. + ret = sys_mprotect(ptr + 3 * page_size, page_size, PROT_READ); + assert(ret == 0); +} + +static void test_seal_mprotect_split(bool seal) +{ + LOG_TEST_ENTER(); + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + + setup_single_address(size, &ptr); + + //use mprotect to split. + ret = sys_mprotect(ptr, page_size, PROT_READ | PROT_WRITE); + assert(!ret); + + //seal all 4 pages. + if (seal) { + ret = sys_mseal(ptr, 4 * page_size, MM_SEAL_PROT_PKEY); + assert(!ret); + } + + //madvice is OK. + ret = sys_madvise(ptr, page_size * 2, MADV_WILLNEED); + assert(!ret); + + //mprotect is sealed. + ret = sys_mprotect(ptr, 2 * page_size, PROT_READ); + if (seal) + assert(ret < 0); + else + assert(!ret); + + + ret = sys_mprotect(ptr + 2 * page_size, 2 * page_size, PROT_READ); + if (seal) + assert(ret < 0); + else + assert(!ret); +} + +static void test_seal_mprotect_merge(bool seal) +{ + LOG_TEST_ENTER(); + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + + setup_single_address(size, &ptr); + + // use mprotect to split one page. + ret = sys_mprotect(ptr, page_size, PROT_READ | PROT_WRITE); + assert(!ret); + + // seal first two pages. + if (seal) { + ret = sys_mseal(ptr, 2 * page_size, MM_SEAL_PROT_PKEY); + assert(!ret); + } + + ret = sys_madvise(ptr, page_size, MADV_WILLNEED); + assert(!ret); + + // 2 pages are sealed. + ret = sys_mprotect(ptr, 2 * page_size, PROT_READ); + if (seal) + assert(ret < 0); + else + assert(!ret); + + // last 2 pages are not sealed. + ret = sys_mprotect(ptr + 2 * page_size, 2 * page_size, PROT_READ); + assert(ret == 0); +} + +static void test_seal_munmap(bool seal) +{ + LOG_TEST_ENTER(); + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + + setup_single_address(size, &ptr); + + if (seal) { + ret = sys_mseal(ptr, size, MM_SEAL_BASE); + assert(!ret); + } + + // 4 pages are sealed. + ret = sys_munmap(ptr, size); + if (seal) + assert(ret < 0); + else + assert(!ret); +} + +/* + * allocate 4 pages, + * use mprotect to split it as two VMAs + * seal the whole range + * munmap will fail on both + */ +static void test_seal_munmap_two_vma(bool seal) +{ + LOG_TEST_ENTER(); + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + + setup_single_address(size, &ptr); + + /* use mprotect to split */ + ret = sys_mprotect(ptr, page_size * 2, PROT_READ | PROT_WRITE); + assert(!ret); + + if (seal) { + ret = sys_mseal(ptr, size, MM_SEAL_BASE); + assert(!ret); + } + + ret = sys_munmap(ptr, page_size * 2); + if (seal) + assert(ret < 0); + else + assert(!ret); + + ret = sys_munmap(ptr + page_size, page_size * 2); + if (seal) + assert(ret < 0); + else + assert(!ret); +} + +/* + * allocate a VMA with 4 pages. + * munmap the middle 2 pages. + * seal the whole 4 pages, will fail. + * note: one of the pages are sealed + * munmap the first page will be OK. + * munmap the last page will be OK. + */ +static void test_seal_munmap_vma_with_gap(bool seal) +{ + LOG_TEST_ENTER(); + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + + setup_single_address(size, &ptr); + + ret = sys_munmap(ptr + page_size, page_size * 2); + assert(!ret); + + if (seal) { + // can't have gap in the middle. + ret = sys_mseal(ptr, size, MM_SEAL_BASE); + assert(ret < 0); + } + + ret = sys_munmap(ptr, page_size); + assert(!ret); + + ret = sys_munmap(ptr + page_size * 2, page_size); + assert(!ret); + + ret = sys_munmap(ptr, size); + assert(!ret); +} + +static void test_munmap_start_freed(bool seal) +{ + LOG_TEST_ENTER(); + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + + setup_single_address(size, &ptr); + + // unmap the first page. + ret = sys_munmap(ptr, page_size); + assert(!ret); + + // seal the last 3 pages. + if (seal) { + ret = sys_mseal(ptr + page_size, 3 * page_size, MM_SEAL_BASE); + assert(!ret); + } + + // unmap from the first page. + ret = sys_munmap(ptr, size); + if (seal) { + assert(ret < 0); + + // use mprotect to verify page is not unmapped. + ret = sys_mprotect(ptr + page_size, 3 * page_size, PROT_READ); + assert(!ret); + } else + // note: this will be OK, even the first page is + // already unmapped. + assert(!ret); +} + +static void test_munmap_end_freed(bool seal) +{ + LOG_TEST_ENTER(); + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + + setup_single_address(size, &ptr); + // unmap last page. + ret = sys_munmap(ptr + page_size * 3, page_size); + assert(!ret); + + // seal the first 3 pages. + if (seal) { + ret = sys_mseal(ptr, 3 * page_size, MM_SEAL_BASE); + assert(!ret); + } + + // unmap all pages. + ret = sys_munmap(ptr, size); + if (seal) { + assert(ret < 0); + + // use mprotect to verify page is not unmapped. + ret = sys_mprotect(ptr, 3 * page_size, PROT_READ); + assert(!ret); + } else + assert(!ret); +} + +static void test_munmap_middle_freed(bool seal) +{ + LOG_TEST_ENTER(); + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + + setup_single_address(size, &ptr); + // unmap 2 pages in the middle. + ret = sys_munmap(ptr + page_size, page_size * 2); + assert(!ret); + + // seal the first page. + if (seal) { + ret = sys_mseal(ptr, page_size, MM_SEAL_BASE); + assert(!ret); + } + + // munmap all 4 pages. + ret = sys_munmap(ptr, size); + if (seal) { + assert(ret < 0); + + // use mprotect to verify page is not unmapped. + ret = sys_mprotect(ptr, page_size, PROT_READ); + assert(!ret); + } else + assert(!ret); +} + +void test_seal_mremap_shrink(bool seal) +{ + LOG_TEST_ENTER(); + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + void *ret2; + + setup_single_address(size, &ptr); + + if (seal) { + ret = sys_mseal(ptr, size, MM_SEAL_BASE); + assert(!ret); + } + + // shrink from 4 pages to 2 pages. + ret2 = mremap(ptr, size, 2 * page_size, 0, 0); + if (seal) { + assert(ret2 == MAP_FAILED); + assert(errno == EACCES); + } else { + assert(ret2 != MAP_FAILED); + + } +} + +void test_seal_mremap_expand(bool seal) +{ + LOG_TEST_ENTER(); + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + void *ret2; + + setup_single_address(size, &ptr); + // ummap last 2 pages. + ret = sys_munmap(ptr + 2 * page_size, 2 * page_size); + assert(!ret); + + if (seal) { + ret = sys_mseal(ptr, 2 * page_size, MM_SEAL_BASE); + assert(!ret); + } + + // expand from 2 page to 4 pages. + ret2 = mremap(ptr, 2 * page_size, 4 * page_size, 0, 0); + if (seal) { + assert(ret2 == MAP_FAILED); + assert(errno == EACCES); + } else { + assert(ret2 == ptr); + + } +} + +void test_seal_mremap_move(bool seal) +{ + LOG_TEST_ENTER(); + void *ptr, *newPtr; + unsigned long page_size = getpagesize(); + unsigned long size = page_size; + int ret; + void *ret2; + + setup_single_address(size, &ptr); + setup_single_address(size, &newPtr); + clean_single_address(newPtr, size); + + if (seal) { + ret = sys_mseal(ptr, size, MM_SEAL_BASE); + assert(!ret); + } + + // move from ptr to fixed address. + ret2 = mremap(ptr, size, size, MREMAP_MAYMOVE | MREMAP_FIXED, newPtr); + if (seal) { + assert(ret2 == MAP_FAILED); + assert(errno == EACCES); + } else { + assert(ret2 != MAP_FAILED); + + } +} + +void test_seal_mmap_overwrite_prot(bool seal) +{ + LOG_TEST_ENTER(); + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = page_size; + int ret; + void *ret2; + + setup_single_address(size, &ptr); + + if (seal) { + ret = sys_mseal(ptr, size, MM_SEAL_BASE); + assert(!ret); + } + + // use mmap to change protection. + ret2 = mmap(ptr, size, PROT_NONE, + MAP_ANONYMOUS | MAP_PRIVATE | MAP_FIXED, -1, 0); + if (seal) { + assert(ret2 == MAP_FAILED); + assert(errno == EACCES); + } else + assert(ret2 == ptr); +} + +void test_seal_mmap_expand(bool seal) +{ + LOG_TEST_ENTER(); + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 12 * page_size; + int ret; + void *ret2; + + setup_single_address(size, &ptr); + // ummap last 4 pages. + ret = sys_munmap(ptr + 8 * page_size, 4 * page_size); + assert(!ret); + + if (seal) { + ret = sys_mseal(ptr, 8 * page_size, MM_SEAL_BASE); + assert(!ret); + } + + // use mmap to expand. + ret2 = mmap(ptr, size, PROT_READ, + MAP_ANONYMOUS | MAP_PRIVATE | MAP_FIXED, -1, 0); + if (seal) { + assert(ret2 == MAP_FAILED); + assert(errno == EACCES); + } else + assert(ret2 == ptr); +} + +void test_seal_mmap_shrink(bool seal) +{ + LOG_TEST_ENTER(); + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 12 * page_size; + int ret; + void *ret2; + + setup_single_address(size, &ptr); + + if (seal) { + ret = sys_mseal(ptr, size, MM_SEAL_BASE); + assert(!ret); + } + + // use mmap to shrink. + ret2 = mmap(ptr, 8 * page_size, PROT_READ, + MAP_ANONYMOUS | MAP_PRIVATE | MAP_FIXED, -1, 0); + if (seal) { + assert(ret2 == MAP_FAILED); + assert(errno == EACCES); + } else + assert(ret2 == ptr); +} + +void test_seal_mremap_shrink_fixed(bool seal) +{ + LOG_TEST_ENTER(); + void *ptr; + void *newAddr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + void *ret2; + + setup_single_address(size, &ptr); + setup_single_address(size, &newAddr); + + if (seal) { + ret = sys_mseal(ptr, size, MM_SEAL_BASE); + assert(!ret); + } + + // mremap to move and shrink to fixed address + ret2 = mremap(ptr, size, 2 * page_size, MREMAP_MAYMOVE | MREMAP_FIXED, + newAddr); + if (seal) { + assert(ret2 == MAP_FAILED); + assert(errno == EACCES); + } else + assert(ret2 == newAddr); +} + +void test_seal_mremap_expand_fixed(bool seal) +{ + LOG_TEST_ENTER(); + void *ptr; + void *newAddr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + void *ret2; + + setup_single_address(page_size, &ptr); + setup_single_address(size, &newAddr); + + if (seal) { + ret = sys_mseal(newAddr, size, MM_SEAL_BASE); + assert(!ret); + } + + // mremap to move and expand to fixed address + ret2 = mremap(ptr, page_size, size, MREMAP_MAYMOVE | MREMAP_FIXED, + newAddr); + if (seal) { + assert(ret2 == MAP_FAILED); + assert(errno == EACCES); + } else + assert(ret2 == newAddr); +} + +void test_seal_mremap_move_fixed(bool seal) +{ + LOG_TEST_ENTER(); + void *ptr; + void *newAddr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + void *ret2; + + setup_single_address(size, &ptr); + setup_single_address(size, &newAddr); + + if (seal) { + ret = sys_mseal(newAddr, size, MM_SEAL_BASE); + assert(!ret); + } + + // mremap to move to fixed address + ret2 = mremap(ptr, size, size, MREMAP_MAYMOVE | MREMAP_FIXED, newAddr); + if (seal) { + assert(ret2 == MAP_FAILED); + assert(errno == EACCES); + } else + assert(ret2 == newAddr); +} + +void test_seal_mremap_move_fixed_zero(bool seal) +{ + LOG_TEST_ENTER(); + void *ptr; + void *newAddr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + void *ret2; + + setup_single_address(size, &ptr); + + if (seal) { + ret = sys_mseal(ptr, size, MM_SEAL_BASE); + assert(!ret); + } + + /* + * MREMAP_FIXED can move the mapping to zero address + */ + ret2 = mremap(ptr, size, 2 * page_size, MREMAP_MAYMOVE | MREMAP_FIXED, + 0); + if (seal) { + assert(ret2 == MAP_FAILED); + assert(errno == EACCES); + } else { + assert(ret2 == 0); + + } +} + +void test_seal_mremap_move_dontunmap(bool seal) +{ + LOG_TEST_ENTER(); + void *ptr; + void *newAddr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + void *ret2; + + setup_single_address(size, &ptr); + + if (seal) { + ret = sys_mseal(ptr, size, MM_SEAL_BASE); + assert(!ret); + } + + // mremap to move, and don't unmap src addr. + ret2 = mremap(ptr, size, size, MREMAP_MAYMOVE | MREMAP_DONTUNMAP, 0); + if (seal) { + assert(ret2 == MAP_FAILED); + assert(errno == EACCES); + } else { + assert(ret2 != MAP_FAILED); + + } +} + +void test_seal_mremap_move_dontunmap_anyaddr(bool seal) +{ + LOG_TEST_ENTER(); + void *ptr; + void *newAddr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + void *ret2; + + setup_single_address(size, &ptr); + + if (seal) { + ret = sys_mseal(ptr, size, MM_SEAL_BASE); + assert(!ret); + } + + /* + * The 0xdeaddead should not have effect on dest addr + * when MREMAP_DONTUNMAP is set. + */ + ret2 = mremap(ptr, size, size, MREMAP_MAYMOVE | MREMAP_DONTUNMAP, + 0xdeaddead); + if (seal) { + assert(ret2 == MAP_FAILED); + assert(errno == EACCES); + } else { + assert(ret2 != MAP_FAILED); + assert((long)ret2 != 0xdeaddead); + + } +} + +unsigned long get_vma_size(void *addr) +{ + FILE *maps; + char line[256]; + int size = 0; + uintptr_t addr_start, addr_end; + + maps = fopen("/proc/self/maps", "r"); + if (!maps) + return 0; + + while (fgets(line, sizeof(line), maps)) { + if (sscanf(line, "%lx-%lx", &addr_start, &addr_end) == 2) { + if (addr_start == (uintptr_t) addr) { + size = addr_end - addr_start; + break; + } + } + } + fclose(maps); + return size; +} + +void test_seal_mmap_seal_base(void) +{ + LOG_TEST_ENTER(); + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + void *ret2; + + ptr = mmap(NULL, size, PROT_READ | PROT_SEAL_BASE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); + assert(ptr != (void *)-1); + + ret = sys_munmap(ptr, size); + assert(ret < 0); + + ret = sys_mprotect(ptr, size, PROT_READ | PROT_WRITE); + assert(!ret); + + ret = sys_mseal(ptr, size, MM_SEAL_PROT_PKEY); + assert(!ret); + + ret = sys_mprotect(ptr, size, PROT_READ); + assert(ret < 0); +} + +void test_seal_mmap_seal_mprotect(void) +{ + LOG_TEST_ENTER(); + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + void *ret2; + + ptr = mmap(NULL, size, PROT_READ | PROT_SEAL_PROT_PKEY, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); + assert(ptr != (void *)-1); + + ret = sys_munmap(ptr, size); + assert(ret < 0); + + ret = sys_mprotect(ptr, size, PROT_READ | PROT_WRITE); + assert(ret < 0); +} + +void test_seal_mmap_seal_mseal(void) +{ + LOG_TEST_ENTER(); + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + void *ret2; + + ptr = mmap(NULL, size, PROT_READ | PROT_SEAL_SEAL, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); + assert(ptr != (void *)-1); + + ret = sys_mseal(ptr, size, MM_SEAL_BASE); + assert(ret < 0); + + ret = sys_mprotect(ptr, size, PROT_READ | PROT_WRITE); + assert(!ret); + + ret = sys_munmap(ptr, size); + assert(!ret); +} + +void test_seal_merge_and_split(void) +{ + LOG_TEST_ENTER(); + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size; + int ret; + void *ret2; + + // (24 RO) + setup_single_address(24 * page_size, &ptr); + + // use mprotect(NONE) to set out boundary + // (1 NONE) (22 RO) (1 NONE) + ret = sys_mprotect(ptr, page_size, PROT_NONE); + assert(!ret); + ret = sys_mprotect(ptr + 23 * page_size, page_size, PROT_NONE); + assert(!ret); + size = get_vma_size(ptr + page_size); + assert(size == 22 * page_size); + + // use mseal to split from beginning + // (1 NONE) (1 RO_SBASE) (21 RO) (1 NONE) + ret = sys_mseal(ptr + page_size, page_size, MM_SEAL_BASE); + assert(!ret); + size = get_vma_size(ptr + page_size); + assert(size == page_size); + size = get_vma_size(ptr + 2 * page_size); + assert(size == 21 * page_size); + + // use mseal to split from the end. + // (1 NONE) (1 RO_SBASE) (20 RO) (1 RO_SBASE) (1 NONE) + ret = sys_mseal(ptr + 22 * page_size, page_size, MM_SEAL_BASE); + assert(!ret); + size = get_vma_size(ptr + 22 * page_size); + assert(size == page_size); + size = get_vma_size(ptr + 2 * page_size); + assert(size == 20 * page_size); + + // merge with prev. + // (1 NONE) (2 RO_SBASE) (19 RO) (1 RO_SBASE) (1 NONE) + ret = sys_mseal(ptr + 2 * page_size, page_size, MM_SEAL_BASE); + assert(!ret); + size = get_vma_size(ptr + page_size); + assert(size == 2 * page_size); + + // merge with after. + // (1 NONE) (2 RO_SBASE) (18 RO) (2 RO_SBASES) (1 NONE) + ret = sys_mseal(ptr + 21 * page_size, page_size, MM_SEAL_BASE); + assert(!ret); + size = get_vma_size(ptr + 21 * page_size); + assert(size == 2 * page_size); + + // split from prev + // (1 NONE) (1 RO_SBASE) (2RO_SPROT) (17 RO) (2 RO_SBASES) (1 NONE) + ret = sys_mseal(ptr + 2 * page_size, 2 * page_size, MM_SEAL_PROT_PKEY); + assert(!ret); + size = get_vma_size(ptr + 2 * page_size); + assert(size == 2 * page_size); + ret = sys_munmap(ptr + page_size, page_size); + assert(ret < 0); + ret = sys_mprotect(ptr + 2 * page_size, page_size, PROT_NONE); + assert(ret < 0); + + // split from next + // (1 NONE) (1 RO_SBASE) (2 RO_SPROT) (16 RO) (2 RO_SPROT) (1 RO_SBASES) (1 NONE) + ret = sys_mseal(ptr + 20 * page_size, 2 * page_size, MM_SEAL_PROT_PKEY); + assert(!ret); + size = get_vma_size(ptr + 20 * page_size); + assert(size == 2 * page_size); + + // merge from middle of prev and middle of next. + // (1 NONE) (1 RW_SBASE) (20 RO_SPROT) (1 RW_SBASES) (1 NONE) + ret = sys_mseal(ptr + 3 * page_size, 18 * page_size, MM_SEAL_PROT_PKEY); + assert(!ret); + size = get_vma_size(ptr + 2 * page_size); + assert(size == 20 * page_size); + + size = get_vma_size(ptr + 22 * page_size); + assert(size == page_size); + + size = get_vma_size(ptr + 23 * page_size); + assert(size == page_size); + + // Add split using SEAL_ALL + // (1 NONE) (1 RW_SBASE) (1 RO_SALL) (18 RO_SPROT) (1 RO_SALL) (1 RW_SBASES) (1 NONE) + ret = sys_mseal(ptr + 2 * page_size, page_size, + MM_SEAL_PROT_PKEY | MM_SEAL_DISCARD_RO_ANON); + assert(!ret); + size = get_vma_size(ptr + 2 * page_size); + assert(size == 1 * page_size); + + ret = sys_mseal(ptr + 21 * page_size, page_size, + MM_SEAL_PROT_PKEY | MM_SEAL_DISCARD_RO_ANON); + assert(!ret); + size = get_vma_size(ptr + 21 * page_size); + assert(size == 1 * page_size); + + // add a new seal type, and merge with next + // (1 NONE) (2 RO_SALL) (18 RO_SPROT) (2 RO_SALL) (1 NONE) + ret = sys_mprotect(ptr + page_size, page_size, PROT_READ); + assert(!ret); + ret = sys_mseal(ptr + page_size, page_size, MM_SEAL_PROT_PKEY); + assert(!ret); + ret = sys_mseal(ptr + page_size, page_size, MM_SEAL_DISCARD_RO_ANON); + assert(!ret); + size = get_vma_size(ptr + page_size); + assert(size == 2 * page_size); + + ret = sys_mprotect(ptr + 22 * page_size, page_size, PROT_READ); + assert(!ret); + ret = sys_mseal(ptr + 22 * page_size, page_size, MM_SEAL_PROT_PKEY); + assert(!ret); + ret = sys_mseal(ptr + 22 * page_size, page_size, MM_SEAL_DISCARD_RO_ANON); + assert(!ret); + size = get_vma_size(ptr + page_size); + assert(size == 2 * page_size); +} + +void test_seal_mmap_merge(void) +{ + LOG_TEST_ENTER(); + + void *ptr, *ptr2; + unsigned long page_size = getpagesize(); + unsigned long size; + int ret; + void *ret2; + + // (24 RO) + setup_single_address(24 * page_size, &ptr); + + // use mprotect(NONE) to set out boundary + // (1 NONE) (22 RO) (1 NONE) + ret = sys_mprotect(ptr, page_size, PROT_NONE); + assert(!ret); + ret = sys_mprotect(ptr + 23 * page_size, page_size, PROT_NONE); + assert(!ret); + size = get_vma_size(ptr + page_size); + assert(size == 22 * page_size); + + // use munmap to free 2 segment of memory. + // (1 NONE) (1 free) (20 RO) (1 free) (1 NONE) + ret = sys_munmap(ptr + page_size, page_size); + assert(!ret); + + ret = sys_munmap(ptr + 22 * page_size, page_size); + assert(!ret); + + // apply seal to the middle + // (1 NONE) (1 free) (20 RO_SBASE) (1 free) (1 NONE) + ret = sys_mseal(ptr + 2 * page_size, 20 * page_size, MM_SEAL_BASE); + assert(!ret); + size = get_vma_size(ptr + 2 * page_size); + assert(size == 20 * page_size); + + // allocate a mapping at beginning, and make sure it merges. + // (1 NONE) (21 RO_SBASE) (1 free) (1 NONE) + ptr2 = mmap(ptr + page_size, page_size, PROT_READ | PROT_SEAL_BASE, + MAP_FIXED | MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); + assert(ptr != (void *)-1); + size = get_vma_size(ptr + page_size); + assert(size == 21 * page_size); + + // allocate a mapping at end, and make sure it merges. + // (1 NONE) (22 RO_SBASE) (1 NONE) + ptr2 = mmap(ptr + 22 * page_size, page_size, PROT_READ | PROT_SEAL_BASE, + MAP_FIXED | MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); + assert(ptr != (void *)-1); + size = get_vma_size(ptr + page_size); + assert(size == 22 * page_size); +} + +static void test_not_sealable(void) +{ + int ret; + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + + ptr = mmap(NULL, size, PROT_READ, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); + assert(ptr != (void *)-1); + + ret = sys_mseal(ptr, size, MM_SEAL_SEAL); + assert(ret < 0); +} + +static void test_merge_sealable(void) +{ + int ret; + void *ptr, *ptr2; + unsigned long page_size = getpagesize(); + unsigned long size; + + // (24 RO) + setup_single_address(24 * page_size, &ptr); + + // use mprotect(NONE) to set out boundary + // (1 NONE) (22 RO) (1 NONE) + ret = sys_mprotect(ptr, page_size, PROT_NONE); + assert(!ret); + ret = sys_mprotect(ptr + 23 * page_size, page_size, PROT_NONE); + assert(!ret); + size = get_vma_size(ptr + page_size); + assert(size == 22 * page_size); + + // (1 NONE) (RO) (4 free) (17 RO) (1 NONE) + ret = sys_munmap(ptr + 2 * page_size, 4 * page_size); + assert(!ret); + size = get_vma_size(ptr + page_size); + assert(size == 1 * page_size); + size = get_vma_size(ptr + 6 * page_size); + assert(size == 17 * page_size); + + // (1 NONE) (RO) (1 free) (2 RO) (1 free) (17 RO) (1 NONE) + ptr2 = mmap(ptr + 3 * page_size, 2 * page_size, PROT_READ, + MAP_FIXED | MAP_ANONYMOUS | MAP_PRIVATE | MAP_SEALABLE, -1, 0); + size = get_vma_size(ptr + 3 * page_size); + assert(size == 2 * page_size); + + // (1 NONE) (RO) (1 free) (20 RO) (1 NONE) + ptr2 = mmap(ptr + 5 * page_size, 1 * page_size, PROT_READ, + MAP_FIXED | MAP_ANONYMOUS | MAP_PRIVATE | MAP_SEALABLE, -1, 0); + assert(ptr2 != (void *)-1); + size = get_vma_size(ptr + 3 * page_size); + assert(size == 20 * page_size); + + // (1 NONE) (RO) (1 free) (19 RO) (1 RO_SB) (1 NONE) + ret = sys_mseal(ptr + 22 * page_size, page_size, MM_SEAL_BASE); + assert(!ret); + + // (1 NONE) (RO) (not sealable) (19 RO) (1 RO_SB) (1 NONE) + ptr2 = mmap(ptr + 2 * page_size, page_size, PROT_READ, + MAP_FIXED | MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); + assert(ptr2 != (void *)-1); + size = get_vma_size(ptr + page_size); + assert(size == page_size); + size = get_vma_size(ptr + 2 * page_size); + assert(size == page_size); +} + +static void test_seal_discard_ro_anon_on_rw(bool seal) +{ + LOG_TEST_ENTER(); + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + + setup_single_address_rw_sealable(size, &ptr, seal); + assert(ptr != (void *)-1); + + if (seal) { + ret = sys_mseal(ptr, size, MM_SEAL_DISCARD_RO_ANON); + assert(!ret); + } + + // sealing doesn't take effect on RW memory. + ret = sys_madvise(ptr, size, MADV_DONTNEED); + assert(!ret); + + // base seal still apply. + ret = sys_munmap(ptr, size); + if (seal) + assert(ret < 0); + else + assert(!ret); +} + +static void test_seal_discard_ro_anon_on_pkey(bool seal) +{ + LOG_TEST_ENTER(); + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + int pkey; + + setup_single_address_rw_sealable(size, &ptr, seal); + assert(ptr != (void *)-1); + + pkey = sys_pkey_alloc(0, 0); + assert(pkey > 0); + + ret = sys_mprotect_pkey((void *)ptr, size, PROT_READ | PROT_WRITE, pkey); + assert(!ret); + + if (seal) { + ret = sys_mseal(ptr, size, MM_SEAL_DISCARD_RO_ANON); + assert(!ret); + } + + // sealing doesn't take effect if PKRU allow write. + set_pkey(pkey, 0); + ret = sys_madvise(ptr, size, MADV_DONTNEED); + assert(!ret); + + // sealing will take effect if PKRU deny write. + set_pkey(pkey, PKEY_DISABLE_WRITE); + ret = sys_madvise(ptr, size, MADV_DONTNEED); + if (seal) + assert(ret < 0); + else + assert(!ret); + + // base seal still apply. + ret = sys_munmap(ptr, size); + if (seal) + assert(ret < 0); + else + assert(!ret); +} + +static void test_seal_discard_ro_anon_on_filebacked(bool seal) +{ + LOG_TEST_ENTER(); + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + int fd; + unsigned long mapflags = MAP_PRIVATE; + + if (seal) + mapflags |= MAP_SEALABLE; + + fd = memfd_create("test", 0); + assert(fd > 0); + + ret = fallocate(fd, 0, 0, size); + assert(!ret); + + ptr = mmap(NULL, size, PROT_READ, mapflags, fd, 0); + assert(ptr != MAP_FAILED); + + if (seal) { + ret = sys_mseal(ptr, size, MM_SEAL_DISCARD_RO_ANON); + assert(!ret); + } + + // sealing doesn't apply for file backed mapping. + ret = sys_madvise(ptr, size, MADV_DONTNEED); + assert(!ret); + + ret = sys_munmap(ptr, size); + if (seal) + assert(ret < 0); + else + assert(!ret); + close(fd); +} + +static void test_seal_discard_ro_anon_on_shared(bool seal) +{ + LOG_TEST_ENTER(); + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + unsigned long mapflags = MAP_ANONYMOUS | MAP_SHARED; + + if (seal) + mapflags |= MAP_SEALABLE; + + ptr = mmap(NULL, size, PROT_READ, mapflags, -1, 0); + assert(ptr != (void *)-1); + + if (seal) { + ret = sys_mseal(ptr, size, MM_SEAL_DISCARD_RO_ANON); + assert(!ret); + } + + // sealing doesn't apply for shared mapping. + ret = sys_madvise(ptr, size, MADV_DONTNEED); + assert(!ret); + + ret = sys_munmap(ptr, size); + if (seal) + assert(ret < 0); + else + assert(!ret); +} + +static void test_seal_discard_ro_anon_invalid_shared(bool seal) +{ + LOG_TEST_ENTER(); + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + int fd; + + fd = open("/proc/self/maps", O_RDONLY); + ptr = mmap(NULL, size, PROT_READ, MAP_ANONYMOUS | MAP_PRIVATE, fd, 0); + assert(ptr != (void *)-1); + + if (seal) { + ret = sys_mseal(ptr, size, MM_SEAL_DISCARD_RO_ANON); + assert(!ret); + } + + ret = sys_madvise(ptr, size, MADV_DONTNEED); + assert(!ret); + + ret = sys_munmap(ptr, size); + assert(ret < 0); + close(fd); +} + +static void test_seal_discard_ro_anon(bool seal) +{ + LOG_TEST_ENTER(); + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + + setup_single_address(size, &ptr); + + if (seal) + seal_discard_ro_anon_single_address(ptr, size); + + ret = sys_madvise(ptr, size, MADV_DONTNEED); + if (seal) + assert(ret < 0); + else + assert(!ret); + + ret = sys_munmap(ptr, size); + if (seal) + assert(ret < 0); + else + assert(!ret); +} + +static void test_mmap_seal_discard_ro_anon(void) +{ + LOG_TEST_ENTER(); + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + + ptr = mmap(NULL, size, PROT_READ | PROT_WRITE | PROT_SEAL_DISCARD_RO_ANON, + MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); + assert(ptr != (void *)-1); + + ret = sys_mprotect(ptr, size, PROT_READ); + assert(!ret); + + ret = sys_madvise(ptr, size, MADV_DONTNEED); + assert(ret < 0); + + ret = sys_munmap(ptr, size); + assert(ret < 0); +} + +bool seal_support(void) +{ + void *ptr; + unsigned long page_size = getpagesize(); + + ptr = mmap(NULL, page_size, PROT_READ | PROT_SEAL_BASE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); + if (ptr == (void *) -1) + return false; + return true; +} + +bool pkey_supported(void) +{ + int pkey = sys_pkey_alloc(0, 0); + + if (pkey > 0) + return true; + return false; +} + +int main(int argc, char **argv) +{ + bool test_seal = seal_support(); + + if (!test_seal) { + ksft_print_msg("%s CONFIG_MSEAL might be disabled, skip test\n", __func__); + return 0; + } + + test_seal_invalid_input(); + test_seal_addseals(); + test_seal_addseals_combined(); + test_seal_addseals_reject(); + test_seal_unmapped_start(); + test_seal_unmapped_middle(); + test_seal_unmapped_end(); + test_seal_multiple_vmas(); + test_seal_split_start(); + test_seal_split_end(); + + test_seal_zero_length(); + test_seal_twice(); + + test_seal_mprotect(false); + test_seal_mprotect(true); + + test_seal_start_mprotect(false); + test_seal_start_mprotect(true); + + test_seal_end_mprotect(false); + test_seal_end_mprotect(true); + + test_seal_mprotect_unalign_len(false); + test_seal_mprotect_unalign_len(true); + + test_seal_mprotect_unalign_len_variant_2(false); + test_seal_mprotect_unalign_len_variant_2(true); + + test_seal_mprotect_two_vma(false); + test_seal_mprotect_two_vma(true); + + test_seal_mprotect_two_vma_with_split(false); + test_seal_mprotect_two_vma_with_split(true); + + test_seal_mprotect_partial_mprotect(false); + test_seal_mprotect_partial_mprotect(true); + + test_seal_mprotect_two_vma_with_gap(false); + test_seal_mprotect_two_vma_with_gap(true); + + test_seal_mprotect_merge(false); + test_seal_mprotect_merge(true); + + test_seal_mprotect_split(false); + test_seal_mprotect_split(true); + + test_seal_munmap(false); + test_seal_munmap(true); + test_seal_munmap_two_vma(false); + test_seal_munmap_two_vma(true); + test_seal_munmap_vma_with_gap(false); + test_seal_munmap_vma_with_gap(true); + + test_munmap_start_freed(false); + test_munmap_start_freed(true); + test_munmap_middle_freed(false); + test_munmap_middle_freed(true); + test_munmap_end_freed(false); + test_munmap_end_freed(true); + + test_seal_mremap_shrink(false); + test_seal_mremap_shrink(true); + test_seal_mremap_expand(false); + test_seal_mremap_expand(true); + test_seal_mremap_move(false); + test_seal_mremap_move(true); + + test_seal_mremap_shrink_fixed(false); + test_seal_mremap_shrink_fixed(true); + test_seal_mremap_expand_fixed(false); + test_seal_mremap_expand_fixed(true); + test_seal_mremap_move_fixed(false); + test_seal_mremap_move_fixed(true); + test_seal_mremap_move_dontunmap(false); + test_seal_mremap_move_dontunmap(true); + test_seal_mremap_move_fixed_zero(false); + test_seal_mremap_move_fixed_zero(true); + test_seal_mremap_move_dontunmap_anyaddr(false); + test_seal_mremap_move_dontunmap_anyaddr(true); + test_seal_discard_ro_anon(false); + test_seal_discard_ro_anon(true); + test_seal_discard_ro_anon_on_rw(false); + test_seal_discard_ro_anon_on_rw(true); + test_seal_discard_ro_anon_on_shared(false); + test_seal_discard_ro_anon_on_shared(true); + test_seal_discard_ro_anon_on_filebacked(false); + test_seal_discard_ro_anon_on_filebacked(true); + test_seal_mmap_overwrite_prot(false); + test_seal_mmap_overwrite_prot(true); + test_seal_mmap_expand(false); + test_seal_mmap_expand(true); + test_seal_mmap_shrink(false); + test_seal_mmap_shrink(true); + + test_seal_mmap_seal_base(); + test_seal_mmap_seal_mprotect(); + test_seal_mmap_seal_mseal(); + test_mmap_seal_discard_ro_anon(); + test_seal_merge_and_split(); + test_seal_mmap_merge(); + + test_not_sealable(); + test_merge_sealable(); + + if (pkey_supported()) { + test_seal_discard_ro_anon_on_pkey(false); + test_seal_discard_ro_anon_on_pkey(true); + } + + ksft_print_msg("Done\n"); + return 0; +}
I wasn't CC-ed on the patch even though I'd reviewed the earlier revision.
On 12/13/23 4:17 AM, jeffxu@chromium.org wrote:
From: Jeff Xu jeffxu@chromium.org
selftest for memory sealing change in mmap() and mseal().
Signed-off-by: Jeff Xu jeffxu@chromium.org
tools/testing/selftests/mm/.gitignore | 1 + tools/testing/selftests/mm/Makefile | 1 + tools/testing/selftests/mm/config | 1 + tools/testing/selftests/mm/mseal_test.c | 2141 +++++++++++++++++++++++ 4 files changed, 2144 insertions(+) create mode 100644 tools/testing/selftests/mm/mseal_test.c
diff --git a/tools/testing/selftests/mm/.gitignore b/tools/testing/selftests/mm/.gitignore index cdc9ce4426b9..f0f22a649985 100644 --- a/tools/testing/selftests/mm/.gitignore +++ b/tools/testing/selftests/mm/.gitignore @@ -43,3 +43,4 @@ mdwe_test gup_longterm mkdirty va_high_addr_switch +mseal_test diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile index 6a9fc5693145..0c086cecc093 100644 --- a/tools/testing/selftests/mm/Makefile +++ b/tools/testing/selftests/mm/Makefile @@ -59,6 +59,7 @@ TEST_GEN_FILES += mlock2-tests TEST_GEN_FILES += mrelease_test TEST_GEN_FILES += mremap_dontunmap TEST_GEN_FILES += mremap_test +TEST_GEN_FILES += mseal_test TEST_GEN_FILES += on-fault-limit TEST_GEN_FILES += thuge-gen TEST_GEN_FILES += transhuge-stress diff --git a/tools/testing/selftests/mm/config b/tools/testing/selftests/mm/config index be087c4bc396..cf2b8780e9b1 100644 --- a/tools/testing/selftests/mm/config +++ b/tools/testing/selftests/mm/config @@ -6,3 +6,4 @@ CONFIG_TEST_HMM=m CONFIG_GUP_TEST=y CONFIG_TRANSPARENT_HUGEPAGE=y CONFIG_MEM_SOFT_DIRTY=y +CONFIG_MSEAL=y diff --git a/tools/testing/selftests/mm/mseal_test.c b/tools/testing/selftests/mm/mseal_test.c new file mode 100644 index 000000000000..0692485d8b3c --- /dev/null +++ b/tools/testing/selftests/mm/mseal_test.c @@ -0,0 +1,2141 @@ +// SPDX-License-Identifier: GPL-2.0 +#define _GNU_SOURCE +#include <sys/mman.h> +#include <stdint.h> +#include <unistd.h> +#include <string.h> +#include <sys/time.h> +#include <sys/resource.h> +#include <stdbool.h> +#include "../kselftest.h" +#include <syscall.h> +#include <errno.h> +#include <stdio.h> +#include <stdlib.h> +#include <assert.h> +#include <fcntl.h> +#include <assert.h> +#include <sys/ioctl.h> +#include <sys/vfs.h> +#include <sys/stat.h>
+/*
- need those definition for manually build using gcc.
- gcc -I ../../../../usr/include -DDEBUG -O3 -DDEBUG -O3 mseal_test.c -o mseal_test
- */
+#ifndef MM_SEAL_SEAL +#define MM_SEAL_SEAL 0x1 +#endif
+#ifndef MM_SEAL_BASE +#define MM_SEAL_BASE 0x2 +#endif
+#ifndef MM_SEAL_PROT_PKEY +#define MM_SEAL_PROT_PKEY 0x4 +#endif
+#ifndef MM_SEAL_DISCARD_RO_ANON +#define MM_SEAL_DISCARD_RO_ANON 0x8 +#endif
+#ifndef MAP_SEALABLE +#define MAP_SEALABLE 0x8000000 +#endif
+#ifndef PROT_SEAL_SEAL +#define PROT_SEAL_SEAL 0x04000000 +#endif
+#ifndef PROT_SEAL_BASE +#define PROT_SEAL_BASE 0x08000000 +#endif
+#ifndef PROT_SEAL_PROT_PKEY +#define PROT_SEAL_PROT_PKEY 0x10000000 +#endif
+#ifndef PROT_SEAL_DISCARD_RO_ANON +#define PROT_SEAL_DISCARD_RO_ANON 0x20000000 +#endif
+#ifndef PKEY_DISABLE_ACCESS +# define PKEY_DISABLE_ACCESS 0x1 +#endif
+#ifndef PKEY_DISABLE_WRITE +# define PKEY_DISABLE_WRITE 0x2 +#endif
+#ifndef PKEY_BITS_PER_KEY +#define PKEY_BITS_PER_PKEY 2 +#endif
+#ifndef PKEY_MASK +#define PKEY_MASK (PKEY_DISABLE_ACCESS | PKEY_DISABLE_WRITE) +#endif
+#ifndef DEBUG +#define LOG_TEST_ENTER() {} +#else +#define LOG_TEST_ENTER() {ksft_print_msg("%s\n", __func__); } +#endif
+#ifndef u64 +#define u64 unsigned long long +#endif
+/*
- define sys_xyx to call syscall directly.
- */
+static int sys_mseal(void *start, size_t len, int types) +{
- int sret;
- errno = 0;
- sret = syscall(__NR_mseal, start, len, types, 0);
- return sret;
+}
+int sys_mprotect(void *ptr, size_t size, unsigned long prot) +{
- int sret;
- errno = 0;
- sret = syscall(SYS_mprotect, ptr, size, prot);
- return sret;
+}
+int sys_mprotect_pkey(void *ptr, size_t size, unsigned long orig_prot,
unsigned long pkey)
+{
- int sret;
- errno = 0;
- sret = syscall(__NR_pkey_mprotect, ptr, size, orig_prot, pkey);
- return sret;
+}
+int sys_munmap(void *ptr, size_t size) +{
- int sret;
- errno = 0;
- sret = syscall(SYS_munmap, ptr, size);
- return sret;
+}
+static int sys_madvise(void *start, size_t len, int types) +{
- int sret;
- errno = 0;
- sret = syscall(__NR_madvise, start, len, types);
- return sret;
+}
+int sys_pkey_alloc(unsigned long flags, unsigned long init_val) +{
- int ret = syscall(SYS_pkey_alloc, flags, init_val);
Add empty line here.
- return ret;
+}
+static inline unsigned int __read_pkey_reg(void) +{
- unsigned int eax, edx;
- unsigned int ecx = 0;
- unsigned int pkey_reg;
- asm volatile(".byte 0x0f,0x01,0xee\n\t"
: "=a" (eax), "=d" (edx)
: "c" (ecx));
- pkey_reg = eax;
- return pkey_reg;
+}
+static inline void __write_pkey_reg(u64 pkey_reg) +{
- unsigned int eax = pkey_reg;
- unsigned int ecx = 0;
- unsigned int edx = 0;
- asm volatile(".byte 0x0f,0x01,0xef\n\t"
: : "a" (eax), "c" (ecx), "d" (edx));
- assert(pkey_reg == __read_pkey_reg());
+}
+static inline unsigned long pkey_bit_position(int pkey) +{
- return pkey * PKEY_BITS_PER_PKEY;
+}
+static inline u64 set_pkey_bits(u64 reg, int pkey, u64 flags) +{
- unsigned long shift = pkey_bit_position(pkey);
- /* mask out bits from pkey in old value */
- reg &= ~((u64)PKEY_MASK << shift);
- /* OR in new bits for pkey */
- reg |= (flags & PKEY_MASK) << shift;
- return reg;
+}
+static inline void set_pkey(int pkey, unsigned long pkey_value) +{
- unsigned long mask = (PKEY_DISABLE_ACCESS | PKEY_DISABLE_WRITE);
- u64 new_pkey_reg;
- assert(!(pkey_value & ~mask));
- new_pkey_reg = set_pkey_bits(__read_pkey_reg(), pkey, pkey_value);
- __write_pkey_reg(new_pkey_reg);
+}
+void setup_single_address(int size, void **ptrOut) +{
- void *ptr;
- ptr = mmap(NULL, size, PROT_READ, MAP_ANONYMOUS | MAP_PRIVATE | MAP_SEALABLE, -1, 0);
- assert(ptr != (void *)-1);
- *ptrOut = ptr;
+}
+void setup_single_address_sealable(int size, void **ptrOut, bool sealable) +{
- void *ptr;
- unsigned long mapflags = MAP_ANONYMOUS | MAP_PRIVATE;
- if (sealable)
mapflags |= MAP_SEALABLE;
- ptr = mmap(NULL, size, PROT_READ, mapflags, -1, 0);
- assert(ptr != (void *)-1);
- *ptrOut = ptr;
+}
+void setup_single_address_rw_sealable(int size, void **ptrOut, bool sealable) +{
- void *ptr;
- unsigned long mapflags = MAP_ANONYMOUS | MAP_PRIVATE;
- if (sealable)
mapflags |= MAP_SEALABLE;
- ptr = mmap(NULL, size, PROT_READ | PROT_WRITE, mapflags, -1, 0);
- assert(ptr != (void *)-1);
- *ptrOut = ptr;
+}
+void clean_single_address(void *ptr, int size) +{
- int ret;
- ret = munmap(ptr, size);
- assert(!ret);
+}
+void seal_mprotect_single_address(void *ptr, int size) +{
- int ret;
- ret = sys_mseal(ptr, size, MM_SEAL_PROT_PKEY);
- assert(!ret);
+}
+void seal_discard_ro_anon_single_address(void *ptr, int size) +{
- int ret;
- ret = sys_mseal(ptr, size, MM_SEAL_DISCARD_RO_ANON);
- assert(!ret);
+}
+static void test_seal_addseals(void) +{
- LOG_TEST_ENTER();
- int ret;
- void *ptr;
- unsigned long page_size = getpagesize();
- unsigned long size = 4 * page_size;
- setup_single_address(size, &ptr);
- /* adding seal one by one */
- ret = sys_mseal(ptr, size, MM_SEAL_BASE);
- assert(!ret);
- ret = sys_mseal(ptr, size, MM_SEAL_PROT_PKEY);
- assert(!ret);
- ret = sys_mseal(ptr, size, MM_SEAL_SEAL);
- assert(!ret);
+}
+static void test_seal_addseals_combined(void) +{
- LOG_TEST_ENTER();
- int ret;
- void *ptr;
- unsigned long page_size = getpagesize();
- unsigned long size = 4 * page_size;
- setup_single_address(size, &ptr);
- ret = sys_mseal(ptr, size, MM_SEAL_PROT_PKEY);
- assert(!ret);
- /* adding multiple seals */
- ret = sys_mseal(ptr, size,
MM_SEAL_PROT_PKEY | MM_SEAL_BASE|
MM_SEAL_SEAL);
- assert(!ret);
- /* not adding more seal type, so ok. */
- ret = sys_mseal(ptr, size, MM_SEAL_BASE);
- assert(!ret);
- /* not adding more seal type, so ok. */
- ret = sys_mseal(ptr, size, MM_SEAL_SEAL);
- assert(!ret);
+}
+static void test_seal_addseals_reject(void) +{
- LOG_TEST_ENTER();
- int ret;
- void *ptr;
- unsigned long page_size = getpagesize();
- unsigned long size = 4 * page_size;
- setup_single_address(size, &ptr);
- ret = sys_mseal(ptr, size, MM_SEAL_BASE | MM_SEAL_SEAL);
- assert(!ret);
- /* MM_SEAL_SEAL is set, so not allow new seal type . */
- ret = sys_mseal(ptr, size,
MM_SEAL_PROT_PKEY | MM_SEAL_BASE | MM_SEAL_SEAL);
- assert(ret < 0);
+}
+static void test_seal_unmapped_start(void) +{
- LOG_TEST_ENTER();
- int ret;
- void *ptr;
- unsigned long page_size = getpagesize();
- unsigned long size = 4 * page_size;
- setup_single_address(size, &ptr);
- // munmap 2 pages from ptr.
Don't use different commenting styles in one file. Use /* */ for comments.
- ret = sys_munmap(ptr, 2 * page_size);
- assert(!ret);
- // mprotect will fail because 2 pages from ptr are unmapped.
- ret = sys_mprotect(ptr, size, PROT_READ | PROT_WRITE);
- assert(ret < 0);
- // mseal will fail because 2 pages from ptr are unmapped.
- ret = sys_mseal(ptr, size, MM_SEAL_SEAL);
- assert(ret < 0);
- ret = sys_mseal(ptr + 2 * page_size, 2 * page_size, MM_SEAL_SEAL);
- assert(!ret);
+}
From: Jeff Xu jeffxu@chromium.org
Add documentation for mseal().
Signed-off-by: Jeff Xu jeffxu@chromium.org --- Documentation/userspace-api/mseal.rst | 189 ++++++++++++++++++++++++++ 1 file changed, 189 insertions(+) create mode 100644 Documentation/userspace-api/mseal.rst
diff --git a/Documentation/userspace-api/mseal.rst b/Documentation/userspace-api/mseal.rst new file mode 100644 index 000000000000..651c618d0664 --- /dev/null +++ b/Documentation/userspace-api/mseal.rst @@ -0,0 +1,189 @@ +.. SPDX-License-Identifier: GPL-2.0 + +===================== +Introduction of mseal +===================== + +:Author: Jeff Xu jeffxu@chromium.org + +Modern CPUs support memory permissions such as RW and NX bits. The memory +permission feature improves security stance on memory corruption bugs, i.e. +the attacker can’t just write to arbitrary memory and point the code to it, +the memory has to be marked with X bit, or else an exception will happen. + +Memory sealing additionally protects the mapping itself against +modifications. This is useful to mitigate memory corruption issues where a +corrupted pointer is passed to a memory management system. For example, +such an attacker primitive can break control-flow integrity guarantees +since read-only memory that is supposed to be trusted can become writable +or .text pages can get remapped. Memory sealing can automatically be +applied by the runtime loader to seal .text and .rodata pages and +applications can additionally seal security critical data at runtime. + +A similar feature already exists in the XNU kernel with the +VM_FLAGS_PERMANENT flag [1] and on OpenBSD with the mimmutable syscall [2]. + +User API +======== +Two system calls are involved in virtual memory sealing, ``mseal()`` and ``mmap()``. + +``mseal()`` +----------- + +The ``mseal()`` is an architecture independent syscall, and with following +signature: + +``int mseal(void addr, size_t len, unsigned long types, unsigned long flags)`` + +**addr/len**: virtual memory address range. + +The address range set by ``addr``/``len`` must meet: + - start (addr) must be in a valid VMA. + - end (addr + len) must be in a valid VMA. + - no gap (unallocated memory) between start and end. + - start (addr) must be page aligned. + +The ``len`` will be paged aligned implicitly by kernel. + +**types**: bit mask to specify the sealing types, they are: + +- The ``MM_SEAL_BASE``: Prevent VMA from: + + Unmapping, moving to another location, and shrinking the size, + via munmap() and mremap(), can leave an empty space, therefore + can be replaced with a VMA with a new set of attributes. + + Move or expand a different vma into the current location, + via mremap(). + + Modifying sealed VMA via mmap(MAP_FIXED). + + Size expansion, via mremap(), does not appear to pose any + specific risks to sealed VMAs. It is included anyway because + the use case is unclear. In any case, users can rely on + merging to expand a sealed VMA. + + We consider the MM_SEAL_BASE feature, on which other sealing + features will depend. For instance, it probably does not make sense + to seal PROT_PKEY without sealing the BASE, and the kernel will + implicitly add SEAL_BASE for SEAL_PROT_PKEY. (If the application + wants to relax this in future, we could use the “flags” field in + mseal() to overwrite this the behavior.) + +- The ``MM_SEAL_PROT_PKEY``: + + Seal PROT and PKEY of the address range, in other words, + mprotect() and pkey_mprotect() will be denied if the memory is + sealed with MM_SEAL_PROT_PKEY. + +- The ``MM_SEAL_DISCARD_RO_ANON``: + + Certain types of madvise() operations are destructive [3], such + as MADV_DONTNEED, which can effectively alter region contents by + discarding pages, especially when memory is anonymous. This blocks + such operations for anonymous memory which is not writable to the + user. + +- The ``MM_SEAL_SEAL`` + Denies adding a new seal. + +**flags**: reserved for future use. + +**return values**: + +- ``0``: + - Success. + +- ``-EINVAL``: + - Invalid seal type. + - Invalid input flags. + - Start address is not page aligned. + - Address range (``addr`` + ``len``) overflow. + +- ``-ENOMEM``: + - ``addr`` is not a valid address (not allocated). + - End address (``addr`` + ``len``) is not a valid address. + - A gap (unallocated memory) between start and end. + +- ``-EACCES``: + - ``MM_SEAL_SEAL`` is set, adding a new seal is not allowed. + - Address range is not sealable, e.g. ``MAP_SEALABLE`` is not + set during ``mmap()``. + +**Note**: + +- User can call mseal(2) multiple times to add new seal types. +- Adding an already added seal type is a no-action (no error). +- unseal() or removing a seal type is not supported. +- In case of error return, one can expect the memory range is unchanged. + +``mmap()`` +---------- +``void *mmap(void* addr, size_t length, int prot, int flags, int fd, +off_t offset);`` + +We made two changes (``prot`` and ``flags``) to ``mmap()`` related to +memory sealing. + +**prot**: + +- ``PROT_SEAL_SEAL`` +- ``PROT_SEAL_BASE`` +- ``PROT_SEAL_PROT_PKEY`` +- ``PROT_SEAL_DISCARD_RO_ANON`` + +Allow ``mmap()`` to set the sealing type when creating a mapping. This is +useful for optimization because it avoids having to make two system +calls: one for ``mmap()`` and one for ``mseal()``. + +It's worth noting that even though the sealing type is set via the +``prot`` field in ``mmap()``, we don't require it to be set in the ``prot`` +field in later ``mprotect()`` call. This is unlike the ``PROT_READ``, +``PROT_WRITE``, ``PROT_EXEC`` bits, e.g. if ``PROT_WRITE`` is not set in +``mprotect()``, it means that the region is not writable. + +**flags** +The ``MAP_SEALABLE`` flag is added to the ``flags`` field of ``mmap()``. +When present, it marks the map as sealable. A map created +without ``MAP_SEALABLE`` will not support sealing; In other words, +``mseal()`` will fail for such a map. + +Applications that don't care about sealing will expect their +behavior unchanged. For those that need sealing support, opt-in +by adding ``MAP_SEALABLE`` when creating the map. + +Use Case: +========= +- glibc: + The dymamic linker, during loading ELF executables, can apply sealing to + to non-writeable memory segments. + +- Chrome browser: protect some security sensitive data-structures. + +Additional notes: +================= +As Jann Horn pointed out in [3], there are still a few ways to write +to RO memory, which is, in a way, by design. Those are not covered by +``mseal()``. If applications want to block such cases, sandboxer +(such as seccomp, LSM, etc) might be considered. + +Those cases are: + +- Write to read-only memory through ``/proc/self/mem`` interface. + +- Write to read-only memory through ``ptrace`` (such as ``PTRACE_POKETEXT``). + +- ``userfaultfd()``. + +The idea that inspired this patch comes from Stephen Röttger’s work in V8 +CFI [4].Chrome browser in ChromeOS will be the first user of this API. + +Reference: +========== +[1] https://github.com/apple-oss-distributions/xnu/blob/1031c584a5e37aff177559b9... + +[2] https://man.openbsd.org/mimmutable.2 + +[3] https://lore.kernel.org/lkml/CAG48ez3ShUYey+ZAFsU2i1RpQn0a5eOs2hzQ426FkcgnfU... + +[4] https://docs.google.com/document/d/1O2jwK4dxI3nRcOJuPYkonhTkNQfbmwdvxQMyXgea...
On Tue, 12 Dec 2023 at 15:17, jeffxu@chromium.org wrote:
+**types**: bit mask to specify the sealing types, they are:
I really want a real-life use-case for more than one bit of "don't modify".
IOW, when would you *ever* say "seal this area, but MADV_DONTNEED is ok"?
Or when would you *ever* say "seal this area, but mprotect()" is ok.
IOW, I want to know why we don't just do the BSD immutable thing, and why we need this multi-level sealing thing.
Linus
On Tue, Dec 12, 2023 at 4:39 PM Linus Torvalds torvalds@linux-foundation.org wrote:
On Tue, 12 Dec 2023 at 15:17, jeffxu@chromium.org wrote:
+**types**: bit mask to specify the sealing types, they are:
I really want a real-life use-case for more than one bit of "don't modify".
For the real-life use case question, Stephen Röttger and I put description in the cover letter as well as the open discussion section (mseal() vs immutable()) of patch 0/11. Perhaps you are looking for more details in chrome usage of the API, e.g. code-wise ?
IOW, when would you *ever* say "seal this area, but MADV_DONTNEED is ok"?
The MADV_DONTNEED is OK for file-backed mapping. As state in man page of madvise: [1]
"subsequent accesses of pages in the range will succeed, but will result in either repopulating the memory contents from the up-to-date contents of the underlying mapped file"
Or when would you *ever* say "seal this area, but mprotect()" is ok.
The fact that openBSD allows RW=>RO transaction, as in its man page [2]
" At present, mprotect(2) may reduce permissions on immutable pages marked PROT_READ | PROT_WRITE to the less permissive PROT_READ."
suggests application might desire multiple ways to seal the "PROT" bits.
E.g. Applications that wants a full lockdown of PROT and PKEY might use SEAL_PROT_PKEY (Chrome case and implemented in this patch)
Application that desires RW=>RO transaction, might implement SEAL_PROT_DOWNGRADEABLE, or specifically allow RW=>RO. (not implemented but can be added in future as extension if needed.)
IOW, I want to know why we don't just do the BSD immutable thing, and why we need this multi-level sealing thing.
The details are discussed in mseal() vs immutable()) of the cover letter (patch 0/11)
In short, BSD's immutable is designed specific for libc case, and Chrome case is just different (e.g. the lifetime of those mappings and requirement of free/discard unused memory).
Single bit vs multi-bits are still up for discussion. If there are strong opinions on the multiple-bits approach, (and no objection on applying MM_SEAL_DISCARD_RO_ANON to the .text part during libc dynamic loading, which has no effect anyway because it is file backed.), we could combine all three bits into one. A side note is that we could not add something such as SEAL_PROT_DOWNGRADEABLE later, since pkey_mprotect is sealed.
I'm open to one bit approach. If we took that approach, We might consider the following:
mseal() or mseal(flags), flags are reserved for future use.
I appreciate a direction on this.
[1] https://man7.org/linux/man-pages/man2/madvise.2.html [2] https://man.openbsd.org/mimmutable.2
-Jeff
Linus
Jeff Xu jeffxu@google.com wrote:
Or when would you *ever* say "seal this area, but mprotect()" is ok.
The fact that openBSD allows RW=>RO transaction, as in its man page [2]
" At present, mprotect(2) may reduce permissions on immutable pages marked PROT_READ | PROT_WRITE to the less permissive PROT_READ."
Let me explain this.
We encountered two places that needed this less-permission-transition.
Both of these problems were found in either .data or bss, which the kernel makes immutable by default. The OpenBSD kernel makes those regions immutable BY DEFAULT, and there is no way to turn that off.
One was in our libc malloc, which after initialization, wants to protect a control data structure from being written in the future.
The other was in chrome v8, for the v8_flags variable, this is similarily mprotected to less permission after initialization to avoid tampering (because it's an amazing relative-address located control gadget).
We introduced a different mechanism to solve these problem.
So we added a new ELF section which annotates objects you need to be MUTABLE. If these are .data or .bss, they are placed in the MUTABLE region annotated with the following Program Header:
Program Headers: Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align OPENBSD_MUTABLE 0x0e9000 0x00000000000ec000 0x00000000000ec000 0x001000 0x001000 RW 0x1000
associated with this Section Header
[20] .openbsd.mutable PROGBITS 00000000000ec000 0e9000 001000 00 WA 0 0 4096
(It is vaguely similar to RELRO).
You can place objects there using the a compiler __attribute__((section declaration, like this example from our libc/malloc.c code
static union { struct malloc_readonly mopts; u_char _pad[MALLOC_PAGESIZE]; } malloc_readonly __attribute__((aligned(MALLOC_PAGESIZE))) __attribute__((section(".openbsd.mutable")));
During startup the code can set the protection and then the immutability of the object correctly.
Since we have no purpose left for this permission reduction semantic upon immutable mappings, we may be deleting that behaviour in the future. I wrote that code, because I needed it to make progress with some difficult pieces of code. But we found a better way.
On Wed, 13 Dec 2023 at 16:36, Jeff Xu jeffxu@google.com wrote:
IOW, when would you *ever* say "seal this area, but MADV_DONTNEED is ok"?
The MADV_DONTNEED is OK for file-backed mapping.
Right. It makes no semantic difference. So there's no point to it.
My point was that you added this magic flag for "not ok for RO anon mapping".
It's such a *completely* random flag, that I go "that's just crazy random - make sealing _always_ disallow that case".
So what I object to in this series is basically random small details that should just eb part of the basic act of sealing.
I think sealing should just mean "you can't do any operations that have semantic meaning for the mapping, because it is SEALED".
So I think sealing should automatically mean "can't do MADV_DONTNEED on anon memory", because that's basically equivalent to a munmap/remap operation.
I also think that sealing should just automatically mean "can't do mprotect any more".
And yes, the OpenBSD semantics of "immutable" apparently allowed reducing permissions, but even the openbsd man-page seems to think that was a bug, so we should just not allow it. And the openbsd case seems to be because of how they made certain things immutable by default, which is different from what this mseal() thing is.
End result: I'd really like to make the thing conceptually simpler, rather than add all those random (*very* random in case of MADV_DONTNEED) special cases.
Is there any actual practical example of why you'd want a half-sealed thing?
And no, I didn't read the pdf that was attached. If it can't just be explained in plain language, it's not an explanation.
I'd love for "sealed" to be just a single bit in the vm_flags things that we already have. Not a config option. Not some complicated thing that is hard to explain. A simple "I have set up this mapping, you can't change it any more".
And if it cannot be that kind of thing, I want to have clear and obvious examples of why it can't be that simple thing.
Not a pdf file that describes some google-chrome design. Something down-to-earth and practical (and not a "we might want this in the future" thing either).
IOW, what is wrong with "THIS VMA SETUP CANNOT BE CHANGED ANY MORE"?
Nothing less, but also nothing more. No random odd bits that need explaining.
Linus
On Thu, Dec 14, 2023 at 2:31 AM Linus Torvalds torvalds@linux-foundation.org wrote:
On Wed, 13 Dec 2023 at 16:36, Jeff Xu jeffxu@google.com wrote:
IOW, when would you *ever* say "seal this area, but MADV_DONTNEED is ok"?
The MADV_DONTNEED is OK for file-backed mapping.
Right. It makes no semantic difference. So there's no point to it.
My point was that you added this magic flag for "not ok for RO anon mapping".
It's such a *completely* random flag, that I go "that's just crazy random - make sealing _always_ disallow that case".
So what I object to in this series is basically random small details that should just eb part of the basic act of sealing.
I think sealing should just mean "you can't do any operations that have semantic meaning for the mapping, because it is SEALED".
So I think sealing should automatically mean "can't do MADV_DONTNEED on anon memory", because that's basically equivalent to a munmap/remap operation.
In Chrome, we have a use case to allow MADV_DONTNEED on sealed memory. We have a pkey-tagged heap and code region for JIT code. The regions are writable by page permissions, but we use the pkey to control write access. These regions are mmapped at process startup and we want to seal them to ensure that the pkey and page permissions can't change. Since these regions are used for dynamic allocations, we still need a way to release unneeded resources, i.e. madvise(DONTNEED) unused pages on free().
AIUI, the madvise(DONTNEED) should effectively only change the content of anonymous pages, i.e. it's similar to a memset(0) in that case. That's why we added this special case: if you want to madvise(DONTNEED) an anonymous page, you should have write permissions to the page.
In our allocator, on free we can then release resources via: * allow pkey writes * madvise(DONTNEED) * disallow pkey writes
On Thu, Dec 14, 2023 at 6:07 PM Stephen Röttger sroettger@google.com wrote:
On Thu, Dec 14, 2023 at 2:31 AM Linus Torvalds torvalds@linux-foundation.org wrote:
On Wed, 13 Dec 2023 at 16:36, Jeff Xu jeffxu@google.com wrote:
IOW, when would you *ever* say "seal this area, but MADV_DONTNEED is ok"?
The MADV_DONTNEED is OK for file-backed mapping.
Right. It makes no semantic difference. So there's no point to it.
My point was that you added this magic flag for "not ok for RO anon mapping".
It's such a *completely* random flag, that I go "that's just crazy random - make sealing _always_ disallow that case".
So what I object to in this series is basically random small details that should just eb part of the basic act of sealing.
I think sealing should just mean "you can't do any operations that have semantic meaning for the mapping, because it is SEALED".
So I think sealing should automatically mean "can't do MADV_DONTNEED on anon memory", because that's basically equivalent to a munmap/remap operation.
In Chrome, we have a use case to allow MADV_DONTNEED on sealed memory.
I don't want to be that guy (*believe me*), but what if there was a way to attach BPF programs to mm's? Such that you could handle 'seal failures' in BPF, and thus allow for this sort of weird semantics? e.g: madvise(MADV_DONTNEED) on a sealed region fails, kernel invokes the BPF program (that chrome loaded), BPF program sees it was a MADV_DONTNEED and allows it to proceed.
It requires BPF but sounds like a good compromise in order to not get an ugly API?
On Thu, 14 Dec 2023 at 10:07, Stephen Röttger sroettger@google.com wrote:
AIUI, the madvise(DONTNEED) should effectively only change the content of anonymous pages, i.e. it's similar to a memset(0) in that case. That's why we added this special case: if you want to madvise(DONTNEED) an anonymous page, you should have write permissions to the page.
Hmm. I actually would be happier if we just made that change in general. Maybe even without sealing, but I agree that it *definitely* makes sense in general as a sealing thing.
IOW, just saying
"madvise(DONTNEED) needs write permissions to an anonymous mapping when sealed"
makes 100% sense to me. Having a separate _flag_ to give sensible semantics is just odd.
IOW, what I really want is exactly that "sensible semantics, not random flags".
Particularly for new system calls with fairly specialized use, I think it's very important that the semantics are sensible on a conceptual level, and that we do not add system calls that are based on "random implementation issue of the day".
Yes, yes, then as we have to maintain things long-term, and we hit some compatibility issue, at *THAT* point we'll end up facing nasty "we had an implementation that had these semantics in practice, so now we're stuck with it", but when introducing a new system call, let's try really hard to start off from those kinds of random things.
Wouldn't it be lovely if we can just come up with a sane set of "this is what it means to seal a vma", and enumerate those, and make those sane conceptual rules be the initial definition. By all means have a "flags" argument for future cases when we figure out there was something wrong or the notion needed to be extended, but if we already *start* with random extensions, I feel there's something wrong with the whole concept.
So I would really wish for the first version of
mseal(start, len, flags);
to have "flags=0" be the one and only case we actually handle initially, and only add a single PROT_SEAL flag to mmap() that says "create this mapping already pre-sealed".
Strive very hard to make sealing be a single VM_SEALED flag in the vma->vm_flags that we already have, just admit that none of this matters on 32-bit architectures, so that VM_SEALED can just use one of the high flags that we have several free of (and that pkeys already depends on), and make this a standard feature with no #ifdef's.
Can chrome live with that? And what would the required semantics be? I'll start the list:
- you can't unmap or remap in any way (including over-mapping)
- you can't change protections (but with architecture support like pkey, you can obviously change the protections indirectly with PKRU etc)
- you can't do VM operations that change data without the area being writable (so the DONTNEED case - maybe there are others)
- anything else?
Wouldn't it be lovely to have just a single notion of sealing that is well-documented and makes sense, and doesn't require people to worry about odd special cases?
And yes, we'd have the 'flags' argument for future special cases, and hope really hard that it's never needed.
Linus
On Thu, Dec 14, 2023 at 12:14 PM Linus Torvalds torvalds@linux-foundation.org wrote:
On Thu, 14 Dec 2023 at 10:07, Stephen Röttger sroettger@google.com wrote:
AIUI, the madvise(DONTNEED) should effectively only change the content of anonymous pages, i.e. it's similar to a memset(0) in that case. That's why we added this special case: if you want to madvise(DONTNEED) an anonymous page, you should have write permissions to the page.
Hmm. I actually would be happier if we just made that change in general. Maybe even without sealing, but I agree that it *definitely* makes sense in general as a sealing thing.
IOW, just saying
"madvise(DONTNEED) needs write permissions to an anonymous mapping when sealed"
makes 100% sense to me. Having a separate _flag_ to give sensible semantics is just odd.
IOW, what I really want is exactly that "sensible semantics, not random flags".
Particularly for new system calls with fairly specialized use, I think it's very important that the semantics are sensible on a conceptual level, and that we do not add system calls that are based on "random implementation issue of the day".
Yes, yes, then as we have to maintain things long-term, and we hit some compatibility issue, at *THAT* point we'll end up facing nasty "we had an implementation that had these semantics in practice, so now we're stuck with it", but when introducing a new system call, let's try really hard to start off from those kinds of random things.
Wouldn't it be lovely if we can just come up with a sane set of "this is what it means to seal a vma", and enumerate those, and make those sane conceptual rules be the initial definition. By all means have a "flags" argument for future cases when we figure out there was something wrong or the notion needed to be extended, but if we already *start* with random extensions, I feel there's something wrong with the whole concept.
So I would really wish for the first version of
mseal(start, len, flags);
to have "flags=0" be the one and only case we actually handle initially, and only add a single PROT_SEAL flag to mmap() that says "create this mapping already pre-sealed".
Strive very hard to make sealing be a single VM_SEALED flag in the vma->vm_flags that we already have, just admit that none of this matters on 32-bit architectures, so that VM_SEALED can just use one of the high flags that we have several free of (and that pkeys already depends on), and make this a standard feature with no #ifdef's.
Can chrome live with that? And what would the required semantics be? I'll start the list:
you can't unmap or remap in any way (including over-mapping)
you can't change protections (but with architecture support like
pkey, you can obviously change the protections indirectly with PKRU etc)
- you can't do VM operations that change data without the area being
writable (so the DONTNEED case - maybe there are others)
- anything else?
Wouldn't it be lovely to have just a single notion of sealing that is well-documented and makes sense, and doesn't require people to worry about odd special cases?
And yes, we'd have the 'flags' argument for future special cases, and hope really hard that it's never needed.
Yes, those inputs make a lot of sense ! I will start the next version. In the meantime, if there are more comments, please continue to post, I appreciate those to make the API better and simple to use.
-Jeff
Linus
Some notes about compatibility between mimmutable(2) and mseal().
This morning, the "RW -> R demotion" code in mimmutable(2) was removed. As described previously, that was a development crutch to solved a problem but we found a better way with a new ELF section which is available at compile time with __attribute__((section(".openbsd.mutable"))). Which works great.
I am syncronizing the madvise / msync behaviour further, we will be compatible. I have worried about madvise / msync for a long time, and audited vast amounts of the software ecosystem to come to a conclusion we can be more strict, but I never acted upon it.
BTW, on OpenBSD and probably other related BSD operating systems, MADV_DONTNEED is non-destructive. However we have a destructive operation called MADV_FREE. msync() MS_INVALIDATE is also destructive. But all of these operations will now be prohibited, to syncronize the error return value situation.
There is an one large difference remainig between mimmutable() and mseal(), which is how other system calls behave.
We return EPERM for failures in all the system calls that fail upon immutable memory (since Oct 2022).
You are returning EACESS.
Before it is too late, do you want to reconsider that return value, or do you have a justification for the choice?
I think this remains the blocker which would prevent software from doing
#define mimmutable(addr, len) mseal(addr, len, 0)
On Sat, 20 Jan 2024 at 07:23, Theo de Raadt deraadt@openbsd.org wrote:
There is an one large difference remainig between mimmutable() and mseal(), which is how other system calls behave.
We return EPERM for failures in all the system calls that fail upon immutable memory (since Oct 2022).
You are returning EACESS.
Before it is too late, do you want to reconsider that return value, or do you have a justification for the choice?
I don't think there's any real reason for the difference.
Jeff - mind changing the EACESS to EPERM, and we'll have something that is more-or-less compatible between Linux and OpenBSD?
Linus
Linus Torvalds torvalds@linux-foundation.org wrote:
On Sat, 20 Jan 2024 at 07:23, Theo de Raadt deraadt@openbsd.org wrote:
There is an one large difference remainig between mimmutable() and mseal(), which is how other system calls behave.
We return EPERM for failures in all the system calls that fail upon immutable memory (since Oct 2022).
You are returning EACESS.
Before it is too late, do you want to reconsider that return value, or do you have a justification for the choice?
I don't think there's any real reason for the difference.
Jeff - mind changing the EACESS to EPERM, and we'll have something that is more-or-less compatible between Linux and OpenBSD?
(I tried to remember why I chose EPERM, replaying the view from the German castle during kernel compiles...)
In mmap, EACCESS already means something.
[EACCES] The flag PROT_READ was specified as part of the prot parameter and fd was not open for reading. The flags MAP_SHARED and PROT_WRITE were specified as part of the flags and prot parameters and fd was not open for writing.
In mprotect, the situation is similar
[EACCES] The process does not have sufficient access to the underlying memory object to provide the requested protection.
immutable isn't an aspect of the underlying object, but an aspect of the mapping.
Anyways, it is common for one errno value to have multiple causes.
But this error-aliasing can make it harder to figure things out when studying a "system call trace" a program, and I strongly believe in keeping systems are simple as possible.
For all the memory mapping control operations, EPERM was available and unambiguous.
On Sat, Jan 20, 2024 at 8:40 AM Linus Torvalds torvalds@linux-foundation.org wrote:
On Sat, 20 Jan 2024 at 07:23, Theo de Raadt deraadt@openbsd.org wrote:
There is an one large difference remainig between mimmutable() and mseal(), which is how other system calls behave.
We return EPERM for failures in all the system calls that fail upon immutable memory (since Oct 2022).
You are returning EACESS.
Before it is too late, do you want to reconsider that return value, or do you have a justification for the choice?
I don't think there's any real reason for the difference.
Jeff - mind changing the EACESS to EPERM, and we'll have something that is more-or-less compatible between Linux and OpenBSD?
Sounds Good. I will make the necessary changes in the next version.
-Jeff
Linus
Jeff Xu jeffxu@chromium.org wrote:
Jeff - mind changing the EACESS to EPERM, and we'll have something that is more-or-less compatible between Linux and OpenBSD?
Sounds Good. I will make the necessary changes in the next version.
Thanks! That is so awesome!
On the OpenBSD side, I am close to landing our madvise / msync changes.
Then we are mostly in sync.
It was on my radar for a year, but delayed because I was ponderingn blocking the destructive madvise / msync ops on regular non-writeable pages. These ops remain a page-zero gadget against regular (mutable) readonly pages, and it bothers me. I've heard rumour this has been used in a nasty way, and I think the sloppily defined semantics could use a strict modernization.
Jeff Xu jeffxu@google.com wrote:
In short, BSD's immutable is designed specific for libc case, and Chrome case is just different (e.g. the lifetime of those mappings and requirement of free/discard unused memory).
That is not true. During the mimmutable design I took the entire software ecosystem into consideration. Not just libc. That is either uncharitable or uninformed.
In OpenBSD, pretty much the only thing which calls mimmutable() is the shared library linker, which does so on all possible regions of all DSO objects, not just libc.
For example, chrome loads 96 libraries, and all their text/data/bss/etc are immutable. All the static address space is immutable. It's the same for all other programs running in OpenBSD -- only transient heap and mmap spaces remain permission mutable.
It is not just libc.
What you are trying to do here with chrome is bring some sort of soft-immutable management to regions of memory, so that trusted parts of chrome can still change the permissions, but untrusted / gadgetry parts of chrome cannot change the permissions. That's a very different thing than what I set out to do with mimmutable(). I'm not aware of any other piece of software that needs this. I still can't wrap my head around the assurance model of the design.
Maybe it is time to stop comparing mseal() to mimmutable().
Also, maybe this proposal should be using the name chromesyscall() instead -- then it could be extended indefinitely in the future...
linux-kselftest-mirror@lists.linaro.org