On Mon, Sep 15, 2025 at 03:25:20PM +0200, Jan Kara wrote:
On Fri 12-09-25 13:52:51, Christian Brauner wrote:
A while ago we added support for file handles to pidfs so pidfds can be encoded and decoded as file handles. Userspace has adopted this quickly and it's proven very useful. Implement file handles for namespaces as well.
A process is not always able to open /proc/self/ns/. That requires procfs to be mounted and for /proc/self/ or /proc/self/ns/ to not be overmounted. However, userspace can always derive a namespace fd from a pidfd. And that always works for a task's own namespace.
There's no need to introduce unnecessary behavioral differences between /proc/self/ns/ fds, pidfd-derived namespace fds, and file-handle-derived namespace fds. So namespace file handles are always decodable if the caller is located in the namespace the file handle refers to.
This also allows a task to e.g., store a set of file handles to its namespaces in a file on-disk so it can verify when it gets rexeced that they're still valid and so on. This is akin to the pidfd use-case.
Or just plainly for namespace comparison reasons where a file handle to the task's own namespace can be easily compared against others.
Reviewed-by: Amir Goldstein amir73il@gmail.com Signed-off-by: Christian Brauner brauner@kernel.org
...
- switch (ns->ops->type) {
+#ifdef CONFIG_CGROUPS
- case CLONE_NEWCGROUP:
if (!current_in_namespace(to_cg_ns(ns)))
owning_ns = to_cg_ns(ns)->user_ns;
break;
+#endif +#ifdef CONFIG_IPC_NS
- case CLONE_NEWIPC:
if (!current_in_namespace(to_ipc_ns(ns)))
owning_ns = to_ipc_ns(ns)->user_ns;
break;
+#endif
- case CLONE_NEWNS:
if (!current_in_namespace(to_mnt_ns(ns)))
owning_ns = to_mnt_ns(ns)->user_ns;
break;
+#ifdef CONFIG_NET_NS
- case CLONE_NEWNET:
if (!current_in_namespace(to_net_ns(ns)))
owning_ns = to_net_ns(ns)->user_ns;
break;
+#endif +#ifdef CONFIG_PID_NS
- case CLONE_NEWPID:
if (!current_in_namespace(to_pid_ns(ns))) {
owning_ns = to_pid_ns(ns)->user_ns;
} else if (!READ_ONCE(to_pid_ns(ns)->child_reaper)) {
ns->ops->put(ns);
return ERR_PTR(-EPERM);
}
break;
+#endif +#ifdef CONFIG_TIME_NS
- case CLONE_NEWTIME:
if (!current_in_namespace(to_time_ns(ns)))
owning_ns = to_time_ns(ns)->user_ns;
break;
+#endif +#ifdef CONFIG_USER_NS
- case CLONE_NEWUSER:
if (!current_in_namespace(to_user_ns(ns)))
owning_ns = to_user_ns(ns);
break;
+#endif +#ifdef CONFIG_UTS_NS
- case CLONE_NEWUTS:
if (!current_in_namespace(to_uts_ns(ns)))
owning_ns = to_uts_ns(ns)->user_ns;
break;
+#endif
Frankly, switches like these are asking for more Generic usage ;) But ok for now.
- default:
return ERR_PTR(-EOPNOTSUPP);
- }
- if (owning_ns && !ns_capable(owning_ns, CAP_SYS_ADMIN)) {
ns->ops->put(ns);
return ERR_PTR(-EPERM);
- }
- /* path_from_stashed() unconditionally consumes the reference. */
- ret = path_from_stashed(&ns->stashed, nsfs_mnt, ns, &path);
- if (ret)
return ERR_PTR(ret);
- return no_free_ptr(path.dentry);
Ugh, so IMO this is very subtle because we declare
struct path path __free(path_put)
but then do no_free_ptr(path.dentry). I really had to lookup implementation of no_free_ptr() to check whether we are leaking mnt reference here or not (we are not). But that seems as an implementation detail we shouldn't better rely on? Wouldn't be:
return dget(path.dentry);
much clearer (and sligthly less efficient, I know, but who cares)?
Fine by me as well!