During set*id() which cred->ucounts to charge the the current process
to is not known until after set_cred_ucounts. So move the
RLIMIT_NPROC checking into a new helper flag_nproc_exceeded and call
flag_nproc_exceeded after set_cred_ucounts.
This is very much an arbitrary subset of the places where we currently
change the RLIMIT_NPROC accounting, designed to preserve the existing
logic.
Fixing the existing logic will be the subject of another series of
changes.
Cc: stable(a)vger.kernel.org
Fixes: 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts")
Signed-off-by: "Eric W. Biederman" <ebiederm(a)xmission.com>
---
kernel/sys.c | 19 ++++++++++++++-----
1 file changed, 14 insertions(+), 5 deletions(-)
diff --git a/kernel/sys.c b/kernel/sys.c
index 8dd938a3d2bf..97dc9e5d6bf9 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -472,6 +472,16 @@ static int set_user(struct cred *new)
if (!new_user)
return -EAGAIN;
+ free_uid(new->user);
+ new->user = new_user;
+ return 0;
+}
+
+static void flag_nproc_exceeded(struct cred *new)
+{
+ if (new->ucounts == current_ucounts())
+ return;
+
/*
* We don't fail in case of NPROC limit excess here because too many
* poorly written programs don't check set*uid() return code, assuming
@@ -480,14 +490,10 @@ static int set_user(struct cred *new)
* failure to the execve() stage.
*/
if (is_ucounts_overlimit(new->ucounts, UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC)) &&
- new_user != INIT_USER)
+ new->user != INIT_USER)
current->flags |= PF_NPROC_EXCEEDED;
else
current->flags &= ~PF_NPROC_EXCEEDED;
-
- free_uid(new->user);
- new->user = new_user;
- return 0;
}
/*
@@ -562,6 +568,7 @@ long __sys_setreuid(uid_t ruid, uid_t euid)
if (retval < 0)
goto error;
+ flag_nproc_exceeded(new);
return commit_creds(new);
error:
@@ -624,6 +631,7 @@ long __sys_setuid(uid_t uid)
if (retval < 0)
goto error;
+ flag_nproc_exceeded(new);
return commit_creds(new);
error:
@@ -703,6 +711,7 @@ long __sys_setresuid(uid_t ruid, uid_t euid, uid_t suid)
if (retval < 0)
goto error;
+ flag_nproc_exceeded(new);
return commit_creds(new);
error:
--
2.29.2
Michal Koutný <mkoutny(a)suse.com> wrote:
> Tasks are associated to multiple users at once. Historically and as per
> setrlimit(2) RLIMIT_NPROC is enforce based on real user ID.
>
> The commit 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts")
> made the accounting structure "indexed" by euid and hence potentially
> account tasks differently.
>
> The effective user ID may be different e.g. for setuid programs but
> those are exec'd into already existing task (i.e. below limit), so
> different accounting is moot.
>
> Some special setresuid(2) users may notice the difference, justifying
> this fix.
I looked at cred->ucount and it is only used for rlimit operations
that were previously stored in cred->user. Making the fact
cred->ucount can refer to a different user from cred->user a bug,
affecting all uses of cred->ulimit not just RLIMIT_NPROC.
Fix set_cred_ucounts to always use the real uid not the effective uid.
Further simplify set_cred_ucounts by noticing that set_cred_ucounts
somehow retained a draft version of the check to see if alloc_ucounts
was needed that checks the new->user and new->user_ns against the
current_real_cred(). Remove that draft version of the check.
All that matters for setting the cred->ucounts are the user_ns and uid
fields in the cred.
Cc: stable(a)vger.kernel.org
Link: https://lkml.kernel.org/r/20220207121800.5079-4-mkoutny@suse.com
Reported-by: Michal Koutný <mkoutny(a)suse.com>
Reviewed-by: Michal Koutný <mkoutny(a)suse.com>
Fixes: 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts")
Signed-off-by: "Eric W. Biederman" <ebiederm(a)xmission.com>
---
kernel/cred.c | 9 ++-------
1 file changed, 2 insertions(+), 7 deletions(-)
diff --git a/kernel/cred.c b/kernel/cred.c
index 473d17c431f3..933155c96922 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -665,21 +665,16 @@ EXPORT_SYMBOL(cred_fscmp);
int set_cred_ucounts(struct cred *new)
{
- struct task_struct *task = current;
- const struct cred *old = task->real_cred;
struct ucounts *new_ucounts, *old_ucounts = new->ucounts;
- if (new->user == old->user && new->user_ns == old->user_ns)
- return 0;
-
/*
* This optimization is needed because alloc_ucounts() uses locks
* for table lookups.
*/
- if (old_ucounts->ns == new->user_ns && uid_eq(old_ucounts->uid, new->euid))
+ if (old_ucounts->ns == new->user_ns && uid_eq(old_ucounts->uid, new->uid))
return 0;
- if (!(new_ucounts = alloc_ucounts(new->user_ns, new->euid)))
+ if (!(new_ucounts = alloc_ucounts(new->user_ns, new->uid)))
return -EAGAIN;
new->ucounts = new_ucounts;
--
2.29.2
Michal Koutný <mkoutny(a)suse.com> wrote:
> It was reported that v5.14 behaves differently when enforcing
> RLIMIT_NPROC limit, namely, it allows one more task than previously.
> This is consequence of the commit 21d1c5e386bc ("Reimplement
> RLIMIT_NPROC on top of ucounts") that missed the sharpness of
> equality in the forking path.
This can be fixed either by fixing the test or by moving the increment
to be before the test. Fix it my moving copy_creds which contains
the increment before is_ucounts_overlimit.
In the case of CLONE_NEWUSER the ucounts in the task_cred changes.
The function is_ucounts_overlimit needs to use the final version of
the ucounts for the new process. Which means moving the
is_ucounts_overlimit test after copy_creds is necessary.
Both the test in fork and the test in set_user were semantically
changed when the code moved to ucounts. The change of the test in
fork was bad because it was before the increment. The test in
set_user was wrong and the change to ucounts fixed it. So this
fix only restores the old behavior in one lcation not two.
Link: https://lkml.kernel.org/r/20220204181144.24462-1-mkoutny@suse.com
Cc: stable(a)vger.kernel.org
Reported-by: Michal Koutný <mkoutny(a)suse.com>
Reviewed-by: Michal Koutný <mkoutny(a)suse.com>
Fixes: 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts")
Signed-off-by: "Eric W. Biederman" <ebiederm(a)xmission.com>
---
kernel/fork.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/kernel/fork.c b/kernel/fork.c
index d75a528f7b21..17d8a8c85e3b 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2021,18 +2021,18 @@ static __latent_entropy struct task_struct *copy_process(
#ifdef CONFIG_PROVE_LOCKING
DEBUG_LOCKS_WARN_ON(!p->softirqs_enabled);
#endif
+ retval = copy_creds(p, clone_flags);
+ if (retval < 0)
+ goto bad_fork_free;
+
retval = -EAGAIN;
if (is_ucounts_overlimit(task_ucounts(p), UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC))) {
if (p->real_cred->user != INIT_USER &&
!capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN))
- goto bad_fork_free;
+ goto bad_fork_cleanup_count;
}
current->flags &= ~PF_NPROC_EXCEEDED;
- retval = copy_creds(p, clone_flags);
- if (retval < 0)
- goto bad_fork_free;
-
/*
* If multiple threads are within copy_process(), then this check
* triggers too late. This doesn't hurt, the check is only there
--
2.29.2
In [1] Michal Koutný <mkoutny(a)suse.com> reported that it does not make
sense to unconditionally exempt the INIT_USER during fork and exec
from RLIMIT_NPROC and then to impose a limit on that same user with
is_ucounts_overlimit. So I looked into why the exeception was added.
commit 909cc4ae86f3 ("[PATCH] Fix two bugs with process limits (RLIMIT_NPROC)") says:
> If a setuid process swaps it's real and effective uids and then
> forks, the fork fails if the new realuid has more processes than
> the original process was limited to. This is particularly a
> problem if a user with a process limit (e.g. 256) runs a
> setuid-root program which does setuid() + fork() (e.g. lprng) while
> root already has more than 256 process (which is quite possible).
>
> The root problem here is that a limit which should be a per-user
> limit is being implemented as a per-process limit with per-process
> (e.g. CAP_SYS_RESOURCE) controls. Being a per-user limit, it
> should be that the root-user can over-ride it, not just some
> process with CAP_SYS_RESOURCE.
>
> This patch adds a test to ignore process limits if the real user is root.
With the moving of the limits from per-user to per-user-per-user_ns it
is clear that testing a user_struct is no longer the proper test and
the test should be a test against the ucounts struct to match the
rest of the logic change that was made.
With RLIMIT_NPROC not being enforced for the global root user anywhere
else should it be enforced in is_ucounts_overlimit for a user
namespace created by the global root user?
Since this is limited to just the global root user, and RLIMIT_NPROC
is already ignored for that user I am going to vote no.
This change does imply that it becomes possible to limit all users in
a user namespace but to not enforce the rlimits on the root user or
anyone with CAP_SYS_ADMIN and CAP_SYS_RESOURCE in the user namespace.
It is not clear to me why any of those exceptions exist so I figure
we should until this is actually a problem for someone before
we relax the permission checks here.
Cc: stable(a)vger.kernel.org
Reported-by: Michal Koutný <mkoutny(a)suse.com>
[1] https://lkml.kernel.org/r/20220207121800.5079-5-mkoutny@suse.com
History-Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Fixes: 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts")
Signed-off-by: "Eric W. Biederman" <ebiederm(a)xmission.com>
---
fs/exec.c | 2 +-
kernel/fork.c | 2 +-
kernel/user_namespace.c | 2 ++
3 files changed, 4 insertions(+), 2 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c
index 1e7f757cbc2c..01c8c7bae9b4 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1881,7 +1881,7 @@ static int do_execveat_common(int fd, struct filename *filename,
*/
if ((current->flags & PF_NPROC_CHECK) &&
is_ucounts_overlimit(current_ucounts(), UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC)) &&
- (current_user() != INIT_USER) &&
+ (current_ucounts() != &init_ucounts) &&
!capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN)) {
retval = -EAGAIN;
goto out_ret;
diff --git a/kernel/fork.c b/kernel/fork.c
index 2b6a28a86325..6f62d37f3650 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2027,7 +2027,7 @@ static __latent_entropy struct task_struct *copy_process(
retval = -EAGAIN;
if (is_ucounts_overlimit(task_ucounts(p), UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC))) {
- if (p->real_cred->user != INIT_USER &&
+ if ((task_ucounts(p) != &init_ucounts) &&
!capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN))
goto bad_fork_cleanup_count;
}
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 6b2e3ca7ee99..f0c04073403d 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -123,6 +123,8 @@ int create_user_ns(struct cred *new)
ns->ucount_max[i] = INT_MAX;
}
set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC));
+ if (new->ucounts == &init_ucounts)
+ set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_NPROC, RLIMIT_INFINITY);
set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_MSGQUEUE, rlimit(RLIMIT_MSGQUEUE));
set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_SIGPENDING, rlimit(RLIMIT_SIGPENDING));
set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_MEMLOCK, rlimit(RLIMIT_MEMLOCK));
--
2.29.2
Michal Koutný <mkoutny(a)suse.com> wrote:
> The check is currently against the current->cred but since those are
> going to change and we want to check RLIMIT_NPROC condition after the
> switch, supply the capability check with the new cred.
> But since we're checking new_user being INIT_USER any new cred's
> capability-based allowance may be redundant when the check fails and the
> alternative solution would be revert of the commit 2863643fb8b9
> ("set_user: add capability check when rlimit(RLIMIT_NPROC) exceeds")
As of commit 2863643fb8b9 ("set_user: add capability check when
rlimit(RLIMIT_NPROC) exceeds") setting the flag to see if execve
should check RLIMIT_NPROC is buggy, as it tests the capabilites from
before the credential change and not aftwards.
As of commit 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of
ucounts") examining the rlimit is buggy as cred->ucounts has not yet
been properly set in the new credential.
Make the code correct and more robust moving the test to see if
execve() needs to test RLIMIT_NPROC into commit_creds, and defer all
of the rest of the logic into execve() itself.
As the flag only indicateds that RLIMIT_NPROC should be checked
in execve rename it from PF_NPROC_EXCEEDED to PF_NPROC_CHECK.
Cc: stable(a)vger.kernel.org
Link: https://lkml.kernel.org/r/20220207121800.5079-2-mkoutny@suse.com
Link: https://lkml.kernel.org/r/20220207121800.5079-3-mkoutny@suse.com
Reported-by: Michal Koutný <mkoutny(a)suse.com>
Fixes: 2863643fb8b9 ("set_user: add capability check when rlimit(RLIMIT_NPROC) exceeds")
Fixes: 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts")
Signed-off-by: "Eric W. Biederman" <ebiederm(a)xmission.com>
---
fs/exec.c | 15 ++++++++-------
include/linux/sched.h | 2 +-
kernel/cred.c | 13 +++++++++----
kernel/fork.c | 2 +-
kernel/sys.c | 14 --------------
5 files changed, 19 insertions(+), 27 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c
index 79f2c9483302..1e7f757cbc2c 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1875,20 +1875,21 @@ static int do_execveat_common(int fd, struct filename *filename,
return PTR_ERR(filename);
/*
- * We move the actual failure in case of RLIMIT_NPROC excess from
- * set*uid() to execve() because too many poorly written programs
- * don't check setuid() return code. Here we additionally recheck
- * whether NPROC limit is still exceeded.
+ * After calling set*uid() is RLIMT_NPROC exceeded?
+ * This can not be checked in set*uid() because too many programs don't
+ * check the setuid() return code.
*/
- if ((current->flags & PF_NPROC_EXCEEDED) &&
- is_ucounts_overlimit(current_ucounts(), UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC))) {
+ if ((current->flags & PF_NPROC_CHECK) &&
+ is_ucounts_overlimit(current_ucounts(), UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC)) &&
+ (current_user() != INIT_USER) &&
+ !capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN)) {
retval = -EAGAIN;
goto out_ret;
}
/* We're below the limit (still or again), so we don't want to make
* further execve() calls fail. */
- current->flags &= ~PF_NPROC_EXCEEDED;
+ current->flags &= ~PF_NPROC_CHECK;
bprm = alloc_bprm(fd, filename);
if (IS_ERR(bprm)) {
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 75ba8aa60248..6605a262a6be 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1678,7 +1678,7 @@ extern struct pid *cad_pid;
#define PF_DUMPCORE 0x00000200 /* Dumped core */
#define PF_SIGNALED 0x00000400 /* Killed by a signal */
#define PF_MEMALLOC 0x00000800 /* Allocating memory */
-#define PF_NPROC_EXCEEDED 0x00001000 /* set_user() noticed that RLIMIT_NPROC was exceeded */
+#define PF_NPROC_CHECK 0x00001000 /* Check in execve if RLIMIT_NPROC was exceeded */
#define PF_USED_MATH 0x00002000 /* If unset the fpu must be initialized before use */
#define PF_NOFREEZE 0x00008000 /* This thread should not be frozen */
#define PF_FROZEN 0x00010000 /* Frozen for system suspend */
diff --git a/kernel/cred.c b/kernel/cred.c
index 933155c96922..229cff081167 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -490,13 +490,18 @@ int commit_creds(struct cred *new)
if (!gid_eq(new->fsgid, old->fsgid))
key_fsgid_changed(new);
- /* do it
- * RLIMIT_NPROC limits on user->processes have already been checked
- * in set_user().
+ /*
+ * Remember if the NPROC limit may be exceeded. The set*uid() functions
+ * can not fail if the NPROC limit is exceeded as too many programs
+ * don't check the return code. Instead enforce the NPROC limit for
+ * programs doing set*uid()+execve by harmlessly defering the failure
+ * to the execve() stage.
*/
alter_cred_subscribers(new, 2);
- if (new->user != old->user || new->user_ns != old->user_ns)
+ if (new->user != old->user || new->user_ns != old->user_ns) {
inc_rlimit_ucounts(new->ucounts, UCOUNT_RLIMIT_NPROC, 1);
+ task->flags |= PF_NPROC_CHECK;
+ }
rcu_assign_pointer(task->real_cred, new);
rcu_assign_pointer(task->cred, new);
if (new->user != old->user || new->user_ns != old->user_ns)
diff --git a/kernel/fork.c b/kernel/fork.c
index 17d8a8c85e3b..2b6a28a86325 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2031,7 +2031,7 @@ static __latent_entropy struct task_struct *copy_process(
!capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN))
goto bad_fork_cleanup_count;
}
- current->flags &= ~PF_NPROC_EXCEEDED;
+ current->flags &= ~PF_NPROC_CHECK;
/*
* If multiple threads are within copy_process(), then this check
diff --git a/kernel/sys.c b/kernel/sys.c
index ecc4cf019242..b1ed21d79f3b 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -472,20 +472,6 @@ static int set_user(struct cred *new)
if (!new_user)
return -EAGAIN;
- /*
- * We don't fail in case of NPROC limit excess here because too many
- * poorly written programs don't check set*uid() return code, assuming
- * it never fails if called by root. We may still enforce NPROC limit
- * for programs doing set*uid()+execve() by harmlessly deferring the
- * failure to the execve() stage.
- */
- if (is_ucounts_overlimit(new->ucounts, UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC)) &&
- new_user != INIT_USER &&
- !capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN))
- current->flags |= PF_NPROC_EXCEEDED;
- else
- current->flags &= ~PF_NPROC_EXCEEDED;
-
free_uid(new->user);
new->user = new_user;
return 0;
--
2.29.2
Michal Koutný <mkoutny(a)suse.com> wrote:
> It was reported that v5.14 behaves differently when enforcing
> RLIMIT_NPROC limit, namely, it allows one more task than previously.
> This is consequence of the commit 21d1c5e386bc ("Reimplement
> RLIMIT_NPROC on top of ucounts") that missed the sharpness of
> equality in the forking path.
This can be fixed either by fixing the test or by moving the increment
to be before the test. Fix it my moving copy_creds which contains
the increment before is_ucounts_overlimit.
This is necessary so that in the case of CLONE_NEWUSER the new credential
with the new ucounts is available of is_ucounts_overlimit to test.
Both the test in fork and the test in set_user were semantically
changed when the code moved to ucounts. The change of the test in
fork was bad because it was before the increment. The test in
set_user was wrong and the change to ucounts fixed it. So this
fix is only restore the old behavior in one lcatio not two.
Link: https://lkml.kernel.org/r/20220204181144.24462-1-mkoutny@suse.com
Cc: stable(a)vger.kernel.org
Reported-by: Michal Koutný <mkoutny(a)suse.com>
Fixes: 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts")
Signed-off-by: "Eric W. Biederman" <ebiederm(a)xmission.com>
---
kernel/fork.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/kernel/fork.c b/kernel/fork.c
index d75a528f7b21..17d8a8c85e3b 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2021,18 +2021,18 @@ static __latent_entropy struct task_struct *copy_process(
#ifdef CONFIG_PROVE_LOCKING
DEBUG_LOCKS_WARN_ON(!p->softirqs_enabled);
#endif
+ retval = copy_creds(p, clone_flags);
+ if (retval < 0)
+ goto bad_fork_free;
+
retval = -EAGAIN;
if (is_ucounts_overlimit(task_ucounts(p), UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC))) {
if (p->real_cred->user != INIT_USER &&
!capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN))
- goto bad_fork_free;
+ goto bad_fork_cleanup_count;
}
current->flags &= ~PF_NPROC_EXCEEDED;
- retval = copy_creds(p, clone_flags);
- if (retval < 0)
- goto bad_fork_free;
-
/*
* If multiple threads are within copy_process(), then this check
* triggers too late. This doesn't hurt, the check is only there
--
2.29.2