On Mon, Feb 08, 2021 at 11:49:22AM +0100, Michal Hocko wrote:
On Mon 08-02-21 10:49:17, Mike Rapoport wrote:
From: Mike Rapoport rppt@linux.ibm.com
Introduce "memfd_secret" system call with the ability to create memory areas visible only in the context of the owning process and not mapped not only to other processes but in the kernel page tables as well.
The secretmem feature is off by default and the user must explicitly enable it at the boot time.
Once secretmem is enabled, the user will be able to create a file descriptor using the memfd_secret() system call. The memory areas created by mmap() calls from this file descriptor will be unmapped from the kernel direct map and they will be only mapped in the page table of the owning mm.
Is this really true? I guess you meant to say that the memory will visible only via page tables to anybody who can mmap the respective file descriptor. There is nothing like an owning mm as the fd is inherently a shareable resource and the ownership becomes a very vague and hard to define term.
Hmm, it seems I've been dragging this paragraph from the very first mmap(MAP_EXCLUSIVE) rfc and nobody (including myself) noticed the inconsistency.
The file descriptor based memory has several advantages over the "traditional" mm interfaces, such as mlock(), mprotect(), madvise(). It paves the way for VMMs to remove the secret memory range from the process;
I do not understand how it helps to remove the memory from the process as the interface explicitly allows to add a memory that is removed from all other processes via direct map.
The current implementation does not help to remove the memory from the process, but using fd-backed memory seems a better interface to remove guest memory from host mappings than mmap. As Andy nicely put it:
"Getting fd-backed memory into a guest will take some possibly major work in the kernel, but getting vma-backed memory into a guest without mapping it in the host user address space seems much, much worse."
As secret memory implementation is not an extension of tmpfs or hugetlbfs, usage of a dedicated system call rather than hooking new functionality into memfd_create(2) emphasises that memfd_secret(2) has different semantics and allows better upwards compatibility.
What is this supposed to mean? What are differences?
Well, the phrasing could be better indeed. That supposed to mean that they differ in the semantics behind the file descriptor: memfd_create implements sealing for shmem and hugetlbfs while memfd_secret implements memory hidden from the kernel.
The secretmem mappings are locked in memory so they cannot exceed RLIMIT_MEMLOCK. Since these mappings are already locked an attempt to mlock() secretmem range would fail and mlockall() will ignore secretmem mappings.
What about munlock?
Isn't this implied? ;-) I'll add a sentence about it.