On Fri, Oct 25, 2024 at 01:50:12PM +0100, Pedro Falcato wrote:
On Fri, Oct 25, 2024 at 10:41 AM Lorenzo Stoakes lorenzo.stoakes@oracle.com wrote:
It is useful to be able to utilise the pidfd mechanism to reference the current thread or process (from a userland point of view - thread group leader from the kernel's point of view).
Therefore introduce PIDFD_SELF_THREAD to refer to the current thread, and PIDFD_SELF_THREAD_GROUP to refer to the current thread group leader.
For convenience and to avoid confusion from userland's perspective we alias these:
PIDFD_SELF is an alias for PIDFD_SELF_THREAD - This is nearly always what the user will want to use, as they would find it surprising if for instance fd's were unshared()'d and they wanted to invoke pidfd_getfd() and that failed.
PIDFD_SELF_PROCESS is an alias for PIDFD_SELF_THREAD_GROUP - Most users have no concept of thread groups or what a thread group leader is, and from userland's perspective and nomenclature this is what userland considers to be a process.
Due to the refactoring of the central __pidfd_get_pid() function we can implement this functionality centrally, providing the use of this sentinel in most functionality which utilises pidfd's.
We need to explicitly adjust kernel_waitid_prepare() to permit this (though it wouldn't really make sense to use this there, we provide the ability for consistency).
We explicitly disallow use of this in setns(), which would otherwise have required explicit custom handling, as it doesn't make sense to set the current calling thread to join the namespace of itself.
As the callers of pidfd_get_pid() expect an increased reference count on the pid we do so in the self case, reducing churn and avoiding any breakage from existing logic which decrements this reference count.
This change implicitly provides PIDFD_SELF_* support in the waitid(P_PIDFS, ...), process_madvise(), process_mrelease(), pidfd_send_signal(), and pidfd_getfd() system calls.
Things such as polling a pidfs and general fd operations are not supported, this strictly provides the sentinel for APIs which explicitly accept a pidfd.
Reviewed-by: Shakeel Butt shakeel.butt@linux.dev Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com
include/linux/pid.h | 8 ++++-- include/uapi/linux/pidfd.h | 15 +++++++++++ kernel/exit.c | 3 ++- kernel/nsproxy.c | 1 + kernel/pid.c | 51 ++++++++++++++++++++++++-------------- 5 files changed, 57 insertions(+), 21 deletions(-)
diff --git a/include/linux/pid.h b/include/linux/pid.h index d466890e1b35..3b2ac7567a88 100644 --- a/include/linux/pid.h +++ b/include/linux/pid.h @@ -78,11 +78,15 @@ struct file;
- __pidfd_get_pid() - Retrieve a pid associated with the specified pidfd.
- @pidfd: The pidfd whose pid we want, or the fd of a /proc/<pid> file if
@alloc_proc is also set.
@alloc_proc is also set, or PIDFD_SELF_* to refer to the current
thread or thread group leader.
- @allow_proc: If set, then an fd of a /proc/<pid> file can be passed instead
of a pidfd, and this will be used to determine the pid.
- @flags: Output variable, if non-NULL, then the file->f_flags of the
pidfd will be set here.
pidfd will be set here or If PIDFD_SELF_THREAD is set, this is
set to PIDFD_THREAD, otherwise if PIDFD_SELF_THREAD_GROUP then
this is set to zero.
- Returns: If successful, the pid associated with the pidfd, otherwise an
error.
diff --git a/include/uapi/linux/pidfd.h b/include/uapi/linux/pidfd.h index 565fc0629fff..0ca2ebf906fd 100644 --- a/include/uapi/linux/pidfd.h +++ b/include/uapi/linux/pidfd.h @@ -29,4 +29,19 @@ #define PIDFD_GET_USER_NAMESPACE _IO(PIDFS_IOCTL_MAGIC, 9) #define PIDFD_GET_UTS_NAMESPACE _IO(PIDFS_IOCTL_MAGIC, 10)
+/*
- Special sentinel values which can be used to refer to the current thread or
- thread group leader (which from a userland perspective is the process).
- */
+#define PIDFD_SELF PIDFD_SELF_THREAD +#define PIDFD_SELF_PROCESS PIDFD_SELF_THREAD_GROUP
+#define PIDFD_SELF_THREAD -100 /* Current thread. */
This conflicts with AT_FDCWD, might be worth changing?
+#define PIDFD_SELF_THREAD_GROUP -200 /* Current thread group leader. */
We might want to pick some range outside of the negative errno space (-4096 IIRC), since we have plenty of values to pick from (2^31 at least).
This is entirely up to Christian, I used the values he suggested in review. But I agree we should probably find one that doesn't conflict and is outside that range.
+static inline int pidfd_is_self_sentinel(pid_t pid) +{
return pid == PIDFD_SELF_THREAD || pid == PIDFD_SELF_THREAD_GROUP;
+}
Do we want this in the uapi header? Even if this is useful, it might come with several drawbacks such as breaking scripts that parse kernel headers (and a quick git grep suggests we do have static inlines in headers, but in rather obscure ones) and breaking C89:
<source>:8:8: error: unknown type name 'inline' 8 | static inline int pidfd_is_self_sentinel(pid_t pid)
:)
It doesn't really make sense to put it anywhere else I don't think.
I'm not sure 'support compilers that don't know what inline is' is a requirement for UAPI. Nor do I suspect people using such strict ansi-c89 compilers will be importing linux/pidfd.h... :)
Also:
[~/kerndev/kernels/mm/include/uapi/linux]$ ag inline | wc -l 382
I mean yeah 'obscure' or not it seems this is an acceptable thing to do :)
#endif /* _UAPI_LINUX_PIDFD_H */ diff --git a/kernel/exit.c b/kernel/exit.c index 619f0014c33b..3eb20f8252ee 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -71,6 +71,7 @@ #include <linux/user_events.h> #include <linux/uaccess.h>
+#include <uapi/linux/pidfd.h> #include <uapi/linux/wait.h>
#include <asm/unistd.h> @@ -1739,7 +1740,7 @@ int kernel_waitid_prepare(struct wait_opts *wo, int which, pid_t upid, break; case P_PIDFD: type = PIDTYPE_PID;
if (upid < 0)
if (upid < 0 && !pidfd_is_self_sentinel(upid)) return -EINVAL; pid = pidfd_get_pid(upid, &f_flags);
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c index dc952c3b05af..d239f7eeaa1f 100644 --- a/kernel/nsproxy.c +++ b/kernel/nsproxy.c @@ -550,6 +550,7 @@ SYSCALL_DEFINE2(setns, int, fd, int, flags) struct nsset nsset = {}; int err = 0;
/* If fd is PIDFD_SELF_*, implicitly fail here, as invalid. */ if (!fd_file(f)) return -EBADF;
diff --git a/kernel/pid.c b/kernel/pid.c index 94c97559e5c5..8742157b36f8 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -535,33 +535,48 @@ struct pid *find_ge_pid(int nr, struct pid_namespace *ns) } EXPORT_SYMBOL_GPL(find_ge_pid);
+static struct pid *pidfd_get_pid_self(unsigned int pidfd, unsigned int *flags) +{
bool is_thread = pidfd == PIDFD_SELF_THREAD;
enum pid_type type = is_thread ? PIDTYPE_PID : PIDTYPE_TGID;
struct pid *pid = *task_pid_ptr(current, type);
/* The caller expects an elevated reference count. */
get_pid(pid);
It would be really really nice to avoid the get here, but I imagine it'll take some refactoring around put_pid's?
I cover this in the commit message and have addressed it on review already, but to risk repeating myself :)
Yes it'd be nice, but then you would have to make sure you _always_ unpinned correctly _everywhere_ from here on in, and it makes the behaviour different for these self modes.
You'd need to change how everyone everywhere puts and... yeah. It's not a big deal to do a useless ref inc here I don't think, eliminates a class of bug, and importantly it keeps behaviour identical to if you do a self-pidfd in the 'manual' way.
I equally dislike this aspect, but doing it this way also enables us to implement this in this one place and get self pidfd support 'for free' everywhere.
So I think RoI-wise this is a better proposition than the alternative.
return pid;
+}
struct pid *__pidfd_get_pid(unsigned int pidfd, bool allow_proc, unsigned int *flags) {
struct pid *pid;
struct fd f = fdget(pidfd);
struct file *file = fd_file(f);
if (pidfd_is_self_sentinel(pidfd)) {
return pidfd_get_pid_self(pidfd, flags);
} else {
Skipping the else here might make the rest of the code more legible (since the sentinel branch returns anyway...).
This is so we can declare types for the other branch without having to figure out how to assign the struct fd sensibly.
Normally I'm a big fan of the if (!...) { return ... } guard pattern, but it's because of the 'types first' requirement of kernel code that I do this here.
-- Pedro