Currently pivot_root() doesnt't work on the real rootfs because it cannot be unmounted. Userspace has to do a recursive removal of the initramfs contents manually before continuing the boot.
Really all we want from the real rootfs is to serve as the parent mount for anything that is actually useful such as the tmpfs or ramfs for initramfs unpacking or the rootfs itself. There's no need for the real rootfs to actually be anything meaningful or useful. Add a immutable rootfs that can be selected via the "immutable_rootfs" kernel command line option.
The kernel will mount a tmpfs/ramfs on top of it, unpack the initramfs and fire up userspace which mounts the rootfs and can then just do:
chdir(rootfs); pivot_root(".", "."); umount2(".", MNT_DETACH);
and be done with it. (Ofc, userspace can also choose to retain the initramfs contents by using something like pivot_root(".", "/initramfs") without unmounting it.)
Technically this also means that the rootfs mount in unprivileged namespaces doesn't need to become MNT_LOCKED anymore as it's guaranteed that the immutable rootfs remains permanently empty so there cannot be anything revealed by unmounting the covering mount.
In the future this will also allow us to create completely empty mount namespaces without risking to leak anything.
systemd already handles this all correctly as it tries to pivot_root() first and falls back to MS_MOVE only when that fails.
This goes back to various discussion in previous years and a LPC 2024 presentation about this very topic.
Signed-off-by: Christian Brauner brauner@kernel.org --- Christian Brauner (3): fs: ensure that internal tmpfs mount gets mount id zero fs: add init_pivot_root() fs: add immutable rootfs
fs/Makefile | 2 +- fs/init.c | 17 ++++ fs/internal.h | 1 + fs/mount.h | 1 + fs/namespace.c | 181 +++++++++++++++++++++++++++++------------- fs/rootfs.c | 65 +++++++++++++++ include/linux/init_syscalls.h | 1 + include/uapi/linux/magic.h | 1 + init/do_mounts.c | 13 ++- init/do_mounts.h | 1 + 10 files changed, 223 insertions(+), 60 deletions(-) --- base-commit: 8f0b4cce4481fb22653697cced8d0d04027cb1e8 change-id: 20260102-work-immutable-rootfs-b5f23e0f5a27
and the rootfs get mount id one as it always has. Before we actually mount the rootfs we create an internal tmpfs mount which has mount id zero but is never exposed anywhere. Continue that "tradition".
Fixes: 7f9bfafc5f49 ("fs: use xarray for old mount id") Cc: stable@vger.kernel.org Signed-off-by: Christian Brauner brauner@kernel.org --- fs/namespace.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/namespace.c b/fs/namespace.c index c58674a20cad..8b082b1de7f3 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -221,7 +221,7 @@ static int mnt_alloc_id(struct mount *mnt) int res;
xa_lock(&mnt_id_xa); - res = __xa_alloc(&mnt_id_xa, &mnt->mnt_id, mnt, XA_LIMIT(1, INT_MAX), GFP_KERNEL); + res = __xa_alloc(&mnt_id_xa, &mnt->mnt_id, mnt, xa_limit_31b, GFP_KERNEL); if (!res) mnt->mnt_id_unique = ++mnt_id_ctr; xa_unlock(&mnt_id_xa);
linux-stable-mirror@lists.linaro.org