This change introduces a way to check if an fd points to a memfd's original open fd (the one created by memfd_create).
We encountered an issue with migrating memfds in CRIU (checkpoint restore in userspace - it migrates running processes between machines). Imagine a scenario: 1. Create a memfd. By default it's open with O_RDWR and yet one can exec() to it (unlike with regular files, where one would get ETXTBSY). 2. Reopen that memfd with O_RDWR via /proc/self/fd/<fd>.
Now those 2 fds are indistinguishable from userspace. You can't exec() to either of them (since the reopen incremented inode->i_writecount) and their /proc/self/fdinfo/ are exactly the same. Unfortunately they are not the same. If you close the second one, the first one becomes exec()able again. If you close the first one, the other doesn't become exec()able. Therefore during migration it does matter which is recreated first and which is reopened but there is no way for CRIU to tell which was first.
--- Changes since v1 at [1]: - Rewrote it from fcntl to ioctl. This was requested by filesystems maintainer.
Links: [1] https://lore.kernel.org/all/20230831203647.558079-1-mclapinski@google.com/
Michal Clapinski (2): mm/memfd: add ioctl(MEMFD_CHECK_IF_ORIGINAL) selftests: test ioctl(MEMFD_CHECK_IF_ORIGINAL)
.../userspace-api/ioctl/ioctl-number.rst | 1 + fs/hugetlbfs/inode.c | 9 ++++++ include/linux/memfd.h | 12 +++++++ mm/memfd.c | 9 ++++++ mm/shmem.c | 9 ++++++ tools/testing/selftests/memfd/memfd_test.c | 32 +++++++++++++++++++ 6 files changed, 72 insertions(+)
Add a way to check if an fd points to the memfd's original open fd (the one created by memfd_create). Useful because only the original open fd can be both writable and executable.
Signed-off-by: Michal Clapinski mclapinski@google.com --- Documentation/userspace-api/ioctl/ioctl-number.rst | 1 + fs/hugetlbfs/inode.c | 9 +++++++++ include/linux/memfd.h | 12 ++++++++++++ mm/memfd.c | 9 +++++++++ mm/shmem.c | 9 +++++++++ 5 files changed, 40 insertions(+)
diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst index 4ea5b837399a..9a0782116ac2 100644 --- a/Documentation/userspace-api/ioctl/ioctl-number.rst +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst @@ -355,6 +355,7 @@ Code Seq# Include File Comments 0xB6 all linux/fpga-dfl.h 0xB7 all uapi/linux/remoteproc_cdev.h mailto:linux-remoteproc@vger.kernel.org 0xB7 all uapi/linux/nsfs.h <mailto:Andrei Vagin avagin@openvz.org> +0xB8 00 linux/memfd.h 0xC0 00-0F linux/usb/iowarrior.h 0xCA 00-0F uapi/misc/cxl.h 0xCA 10-2F uapi/misc/ocxl.h diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index 316c4cebd3f3..89ff46f7ac54 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -35,6 +35,7 @@ #include <linux/magic.h> #include <linux/migrate.h> #include <linux/uio.h> +#include <linux/memfd.h>
#include <linux/uaccess.h> #include <linux/sched/mm.h> @@ -1324,6 +1325,12 @@ static void init_once(void *foo) inode_init_once(&ei->vfs_inode); }
+static long hugetlbfs_file_ioctl(struct file *file, unsigned int cmd, + unsigned long arg) +{ + return memfd_ioctl(file, cmd, arg); +} + const struct file_operations hugetlbfs_file_operations = { .read_iter = hugetlbfs_read_iter, .mmap = hugetlbfs_file_mmap, @@ -1331,6 +1338,8 @@ const struct file_operations hugetlbfs_file_operations = { .get_unmapped_area = hugetlb_get_unmapped_area, .llseek = default_llseek, .fallocate = hugetlbfs_fallocate, + .unlocked_ioctl = hugetlbfs_file_ioctl, + .compat_ioctl = hugetlbfs_file_ioctl, };
static const struct inode_operations hugetlbfs_dir_inode_operations = { diff --git a/include/linux/memfd.h b/include/linux/memfd.h index e7abf6fa4c52..50f512624c92 100644 --- a/include/linux/memfd.h +++ b/include/linux/memfd.h @@ -3,14 +3,26 @@ #define __LINUX_MEMFD_H
#include <linux/file.h> +#include <linux/ioctl.h>
#ifdef CONFIG_MEMFD_CREATE extern long memfd_fcntl(struct file *file, unsigned int cmd, unsigned int arg); +extern long memfd_ioctl(struct file *file, unsigned int cmd, unsigned int arg); #else static inline long memfd_fcntl(struct file *f, unsigned int c, unsigned int a) { return -EINVAL; } +static inline long memfd_ioctl(struct file *f, unsigned int c, unsigned int a) +{ + return -EINVAL; +} #endif
+/* + * Return 1 if the memfd is original (i.e. was created by memfd_create, + * not reopened), 0 otherwise. + */ +#define MEMFD_CHECK_IF_ORIGINAL _IOR(0xB8, 0, int) + #endif /* __LINUX_MEMFD_H */ diff --git a/mm/memfd.c b/mm/memfd.c index 1cad1904fc26..06bcb970c387 100644 --- a/mm/memfd.c +++ b/mm/memfd.c @@ -262,6 +262,15 @@ long memfd_fcntl(struct file *file, unsigned int cmd, unsigned int arg) return error; }
+long memfd_ioctl(struct file *file, unsigned int cmd, unsigned int arg) +{ + if (cmd == MEMFD_CHECK_IF_ORIGINAL) + return (file->f_mode & FMODE_WRITE) && + !(file->f_mode & FMODE_WRITER); + + return -EINVAL; +} + #define MFD_NAME_PREFIX "memfd:" #define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1) #define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN) diff --git a/mm/shmem.c b/mm/shmem.c index 02e62fccc80d..347fcba15fb7 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -79,6 +79,7 @@ static struct vfsmount *shm_mnt; #include <linux/rmap.h> #include <linux/uuid.h> #include <linux/quotaops.h> +#include <linux/memfd.h>
#include <linux/uaccess.h>
@@ -4459,6 +4460,12 @@ const struct address_space_operations shmem_aops = { }; EXPORT_SYMBOL(shmem_aops);
+static long shmem_file_ioctl(struct file *file, unsigned int cmd, + unsigned long arg) +{ + return memfd_ioctl(file, cmd, arg); +} + static const struct file_operations shmem_file_operations = { .mmap = shmem_mmap, .open = shmem_file_open, @@ -4471,6 +4478,8 @@ static const struct file_operations shmem_file_operations = { .splice_read = shmem_file_splice_read, .splice_write = iter_file_splice_write, .fallocate = shmem_fallocate, + .unlocked_ioctl = shmem_file_ioctl, + .compat_ioctl = shmem_file_ioctl, #endif };
Hi Michal,
kernel test robot noticed the following build warnings:
[auto build test WARNING on akpm-mm/mm-everything]
url: https://github.com/intel-lab-lkp/linux/commits/Michal-Clapinski/mm-memfd-add... base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything patch link: https://lore.kernel.org/r/20230908175738.41895-2-mclapinski%40google.com patch subject: [PATCH v2 1/2] mm/memfd: add ioctl(MEMFD_CHECK_IF_ORIGINAL) config: riscv-allnoconfig (https://download.01.org/0day-ci/archive/20230909/202309090301.rMwXPz1I-lkp@i...) compiler: riscv64-linux-gcc (GCC) 13.2.0 reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20230909/202309090301.rMwXPz1I-lkp@i...)
If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot lkp@intel.com | Closes: https://lore.kernel.org/oe-kbuild-all/202309090301.rMwXPz1I-lkp@intel.com/
All warnings (new ones prefixed by >>):
mm/shmem.c:4480:13: warning: 'shmem_file_ioctl' defined but not used [-Wunused-function]
4480 | static long shmem_file_ioctl(struct file *file, unsigned int cmd, | ^~~~~~~~~~~~~~~~
vim +/shmem_file_ioctl +4480 mm/shmem.c
4479
4480 static long shmem_file_ioctl(struct file *file, unsigned int cmd,
4481 unsigned long arg) 4482 { 4483 return memfd_ioctl(file, cmd, arg); 4484 } 4485
Signed-off-by: Michal Clapinski mclapinski@google.com --- tools/testing/selftests/memfd/memfd_test.c | 32 ++++++++++++++++++++++ 1 file changed, 32 insertions(+)
diff --git a/tools/testing/selftests/memfd/memfd_test.c b/tools/testing/selftests/memfd/memfd_test.c index 3df008677239..1a702af6e01a 100644 --- a/tools/testing/selftests/memfd/memfd_test.c +++ b/tools/testing/selftests/memfd/memfd_test.c @@ -13,6 +13,7 @@ #include <stdlib.h> #include <signal.h> #include <string.h> +#include <sys/ioctl.h> #include <sys/mman.h> #include <sys/stat.h> #include <sys/syscall.h> @@ -39,6 +40,10 @@
#define MFD_NOEXEC_SEAL 0x0008U
+#ifndef MEMFD_CHECK_IF_ORIGINAL +#define MEMFD_CHECK_IF_ORIGINAL _IOR(0xB8, 0, int) +#endif + /* * Default is not to test hugetlbfs */ @@ -1567,6 +1572,31 @@ static void test_share_fork(char *banner, char *b_suffix) close(fd); }
+static void test_ioctl_check_original(void) +{ + int fd, fd2; + + printf("%s IOCTL-CHECK-ORIGINAL\n", memfd_str); + fd = sys_memfd_create("kern_memfd_check_original", 0); + if (fd < 0) { + printf("memfd_create failed: %m\n"); + abort(); + } + if (ioctl(fd, MEMFD_CHECK_IF_ORIGINAL) != 1) { + printf("ioctl(MEMFD_CHECK_IF_ORIGINAL) failed\n"); + abort(); + } + + fd2 = mfd_assert_reopen_fd(fd); + if (ioctl(fd2, MEMFD_CHECK_IF_ORIGINAL) != 0) { + printf("ioctl(MEMFD_CHECK_IF_ORIGINAL) failed\n"); + abort(); + } + + close(fd); + close(fd2); +} + int main(int argc, char **argv) { pid_t pid; @@ -1609,6 +1639,8 @@ int main(int argc, char **argv) test_share_open("SHARE-OPEN", ""); test_share_fork("SHARE-FORK", "");
+ test_ioctl_check_original(); + /* Run test-suite in a multi-threaded environment with a shared * file-table. */ pid = spawn_idle_thread(CLONE_FILES | CLONE_FS | CLONE_VM);
Michal Clapinski mclapinski@google.com writes:
This change introduces a way to check if an fd points to a memfd's original open fd (the one created by memfd_create).
We encountered an issue with migrating memfds in CRIU (checkpoint restore in userspace - it migrates running processes between machines). Imagine a scenario:
- Create a memfd. By default it's open with O_RDWR and yet one can
exec() to it (unlike with regular files, where one would get ETXTBSY). 2. Reopen that memfd with O_RDWR via /proc/self/fd/<fd>.
Now those 2 fds are indistinguishable from userspace. You can't exec() to either of them (since the reopen incremented inode->i_writecount) and their /proc/self/fdinfo/ are exactly the same. Unfortunately they are not the same. If you close the second one, the first one becomes exec()able again. If you close the first one, the other doesn't become exec()able. Therefore during migration it does matter which is recreated first and which is reopened but there is no way for CRIU to tell which was first.
So please bear with me...I'll confess that I don't fully understand the situation here, so this is probably a dumb question.
It seems like you are adding this "original open" test as a way of working around a quirk with the behavior of subsequent opens. I don't *think* that this is part of the intended, documented behavior of memfds, it's just something that happens. You're exposing an artifact of the current implementation.
Given that the two file descriptors are otherwise indistinguishable, might a better fix be to make them indistinguishable in this regard as well? Is there a good reason why the second fd doesn't become exec()able in this scenario and, if not, perhaps that behavior could be changed instead?
Thanks,
jon
On Fri, Sep 8, 2023 at 10:34 PM Jonathan Corbet corbet@lwn.net wrote:
Michal Clapinski mclapinski@google.com writes:
This change introduces a way to check if an fd points to a memfd's original open fd (the one created by memfd_create).
We encountered an issue with migrating memfds in CRIU (checkpoint restore in userspace - it migrates running processes between machines). Imagine a scenario:
- Create a memfd. By default it's open with O_RDWR and yet one can
exec() to it (unlike with regular files, where one would get ETXTBSY). 2. Reopen that memfd with O_RDWR via /proc/self/fd/<fd>.
Now those 2 fds are indistinguishable from userspace. You can't exec() to either of them (since the reopen incremented inode->i_writecount) and their /proc/self/fdinfo/ are exactly the same. Unfortunately they are not the same. If you close the second one, the first one becomes exec()able again. If you close the first one, the other doesn't become exec()able. Therefore during migration it does matter which is recreated first and which is reopened but there is no way for CRIU to tell which was first.
So please bear with me...I'll confess that I don't fully understand the situation here, so this is probably a dumb question.
It seems like you are adding this "original open" test as a way of working around a quirk with the behavior of subsequent opens. I don't *think* that this is part of the intended, documented behavior of memfds, it's just something that happens. You're exposing an artifact of the current implementation.
I don't know if the exec()ability of the original memfd was intended, let alone the non-exec()ability of subsequent opens. But otherwise yes.
Given that the two file descriptors are otherwise indistinguishable, might a better fix be to make them indistinguishable in this regard as well? Is there a good reason why the second fd doesn't become exec()able in this scenario and, if not, perhaps that behavior could be changed instead?
It probably could be changed, yes. But I'm worried that would be broadening the bug that is the exec()ability of memfds. AFAIK no other fd that is opened as writable can be exec()ed. If maintainers would prefer this, I could do this.
Michał Cłapiński mclapinski@google.com writes:
On Fri, Sep 8, 2023 at 10:34 PM Jonathan Corbet corbet@lwn.net wrote:
Given that the two file descriptors are otherwise indistinguishable, might a better fix be to make them indistinguishable in this regard as well? Is there a good reason why the second fd doesn't become exec()able in this scenario and, if not, perhaps that behavior could be changed instead?
It probably could be changed, yes. But I'm worried that would be broadening the bug that is the exec()ability of memfds. AFAIK no other fd that is opened as writable can be exec()ed. If maintainers would prefer this, I could do this.
I'm not convinced that perpetuating the behavior and adding an ioctl() workaround would be better than that; it seems to me that consistency would be better. But I don't have any real say in that matter, of course; I'm curious what others think.
Thanks,
jon
linux-kselftest-mirror@lists.linaro.org