On 2025-07-24, Christian Brauner brauner@kernel.org wrote:
On Wed, Jul 23, 2025 at 09:18:53AM +1000, Aleksa Sarai wrote:
/proc has historically had very opaque semantics about PID namespaces, which is a little unfortunate for container runtimes and other programs that deal with switching namespaces very often. One common issue is that of converting between PIDs in the process's namespace and PIDs in the namespace of /proc.
In principle, it is possible to do this today by opening a pidfd with pidfd_open(2) and then looking at /proc/self/fdinfo/$n (which will contain a PID value translated to the pid namespace associated with that procfs superblock). However, allocating a new file for each PID to be converted is less than ideal for programs that may need to scan procfs, and it is generally useful for userspace to be able to finally get this information from procfs.
So, add a new API for this in the form of an ioctl(2) you can call on the root directory of procfs. The returned file descriptor will have O_CLOEXEC set. This acts as a sister feature to the new "pidns" mount option, finally allowing userspace full control of the pid namespaces associated with procfs instances.
The permission model for this is a bit looser than that of the "pidns" mount option, but this is mainly because /proc/1/ns/pid provides the same information, so as long as you have access to that magic-link (or something equivalently reasonable such as privileges with CAP_SYS_ADMIN or being in an ancestor pid namespace) it makes sense to allow userspace to grab a handle. setns(2) will still have their own permission checks, so being able to open a pidns handle doesn't really provide too many other capabilities.
Signed-off-by: Aleksa Sarai cyphar@cyphar.com
Documentation/filesystems/proc.rst | 4 +++ fs/proc/root.c | 54 ++++++++++++++++++++++++++++++++++++-- include/uapi/linux/fs.h | 3 +++ 3 files changed, 59 insertions(+), 2 deletions(-)
diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst index c520b9f8a3fd..506383273c9d 100644 --- a/Documentation/filesystems/proc.rst +++ b/Documentation/filesystems/proc.rst @@ -2398,6 +2398,10 @@ pidns= specifies a pid namespace (either as a string path to something like will be used by the procfs instance when translating pids. By default, procfs will use the calling process's active pid namespace. +Processes can check which pid namespace is used by a procfs instance by using +the `PROCFS_GET_PID_NAMESPACE` ioctl() on the root directory of the procfs +instance.
Chapter 5: Filesystem behavior
diff --git a/fs/proc/root.c b/fs/proc/root.c index 057c8a125c6e..548a57ec2152 100644 --- a/fs/proc/root.c +++ b/fs/proc/root.c @@ -23,8 +23,10 @@ #include <linux/cred.h> #include <linux/magic.h> #include <linux/slab.h> +#include <linux/ptrace.h> #include "internal.h" +#include "../internal.h" struct proc_fs_context { struct pid_namespace *pid_ns; @@ -418,15 +420,63 @@ static int proc_root_readdir(struct file *file, struct dir_context *ctx) return proc_pid_readdir(file, ctx); } +static long int proc_root_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) +{
- switch (cmd) {
+#ifdef CONFIG_PID_NS
- case PROCFS_GET_PID_NAMESPACE: {
struct pid_namespace *active = task_active_pid_ns(current);
struct pid_namespace *ns = proc_pid_ns(file_inode(filp)->i_sb);
bool can_access_pidns = false;
/*
* If we are in an ancestors of the pidns, or have join
* privileges (CAP_SYS_ADMIN), then it makes sense that we
* would be able to grab a handle to the pidns.
*
* Otherwise, if there is a root process, then being able to
* access /proc/$pid/ns/pid is equivalent to this ioctl and so
* we should probably match the permission model. For empty
* namespaces it seems unlikely for there to be a downside to
* allowing unprivileged users to open a handle to it (setns
* will fail for unprivileged users anyway).
*/
can_access_pidns = pidns_is_ancestor(ns, active) ||
ns_capable(ns->user_ns, CAP_SYS_ADMIN);
This seems to imply that if @ns is a descendant of @active that the caller holds privileges over it. Is that actually always true?
IOW, why is the check different from the previous pidns= mount option check. I would've expected:
ns_capable(_no_audit)(ns->user_ns) && pidns_is_ancestor(ns, active)
and then the ptrace check as a fallback.
That would mirror pidns_install(), and I did think about it. The primary (mostly handwave-y) reasoning I had for making it less strict was that:
* If you are in an ancestor pidns, then you can already see those processes in your own /proc. In theory that means that you will be able to access /proc/$pid/ns/pid for at least some subprocess there (even if some subprocesses have SUID_DUMP_DISABLE, that flag is cleared on ).
Though hypothetically if they are all running as a different user, this does not apply (and you could create scenarios where a child pidns is owned by a userns that you do not have privileges over -- if you deal with setuid binaries). Maybe that risk means we should just combine them, I'm not sure.
* If you have CAP_SYS_ADMIN permissions over the pidns, it seems strange to disallow access even if it is not in an ancestor namespace. This is distinct to pidns_install(), where you want to ensure you cannot escape to a parent pid namespace, this is about getting a handle to do other operations (i.e. NS_GET_{P,TG}ID_*_PIDNS).
Maybe they should be combined to match pidns_install(), but then I would expect the ptrace_may_access() check to apply to all processes in the pidns to make it less restrictive, which is not something you can practically do (and there is a higher chance that pid1 will have SUID_DUMP_DISABLE than some random subprocess, which almost certainly will not be SUID_DUMP_DISABLE).
Fundamentally, I guess I'm still trying to see what the risk is of allowing a process to get a handle to a pidns that they have some kind of privilege over (whether it's CAP_SYS_ADMIN, or by the virtue of being able to see and address all processes in the namespace, or by being able to open /proc/$pidns_pid1/ns/pid anyway) but cannot join.
Then again, maybe the fact that it is kind of strange to explain is enough of a reason to just make it simpler...
if (!can_access_pidns) {
bool cannot_ptrace_pid1 = false;
read_lock(&tasklist_lock);
if (ns->child_reaper)
cannot_ptrace_pid1 = ptrace_may_access(ns->child_reaper,
PTRACE_MODE_READ_FSCREDS);
read_unlock(&tasklist_lock);
can_access_pidns = !cannot_ptrace_pid1;
}
if (!can_access_pidns)
return -EPERM;
/* open_namespace() unconditionally consumes the reference. */
get_pid_ns(ns);
return open_namespace(to_ns_common(ns));
- }
+#endif /* CONFIG_PID_NS */
- default:
return -ENOIOCTLCMD;
- }
+}
/*
- The root /proc directory is special, as it has the
- <pid> directories. Thus we don't use the generic
- directory handling functions for that..
*/ static const struct file_operations proc_root_operations = {
- .read = generic_read_dir,
- .iterate_shared = proc_root_readdir,
- .read = generic_read_dir,
- .iterate_shared = proc_root_readdir, .llseek = generic_file_llseek,
- .unlocked_ioctl = proc_root_ioctl,
- .compat_ioctl = compat_ptr_ioctl,
}; /* diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h index 0bd678a4a10e..aa642cb48feb 100644 --- a/include/uapi/linux/fs.h +++ b/include/uapi/linux/fs.h @@ -437,6 +437,9 @@ typedef int __bitwise __kernel_rwf_t; #define PROCFS_IOCTL_MAGIC 'f' +/* procfs root ioctls */ +#define PROCFS_GET_PID_NAMESPACE _IO(PROCFS_IOCTL_MAGIC, 1)
/* Pagemap ioctl */ #define PAGEMAP_SCAN _IOWR(PROCFS_IOCTL_MAGIC, 16, struct pm_scan_arg)
-- 2.50.0