Re: [PATCH v2 1/4] capabilities: Add user namespace capabilities

10 Jun 2024

      On Sun, Jun 09, 2024 at 03:43:34AM -0700, Jonathan Calmels wrote:
...
Attackers often rely on user namespaces to get elevated (yet confined)
privileges in order to target specific subsystems (e.g. [1]). Distributions
have been pretty adamant that they need a way to configure these, most of
them carry out-of-tree patches to do so, or plainly refuse to enable them.
As a result, there have been multiple efforts over the years to introduce
various knobs to control and/or disable user namespaces (e.g. [2][3][4]).
While we acknowledge that there are already ways to control the creation of
such namespaces (the most recent being a LSM hook), there are inherent
issues with these approaches. Preventing the user namespace creation is not
fine-grained enough, and in some cases, incompatible with various userspace
expectations (e.g. container runtimes, browser sandboxing, service
isolation)
This patch addresses these limitations by introducing an additional
capability set used to restrict the permissions granted when creating user
namespaces. This way, processes can apply the principle of least privilege
by configuring only the capabilities they need for their namespaces.
For compatibility reasons, processes always start with a full userns
capability set.
On namespace creation, the userns capability set (pU) is assigned to the
new effective (pE), permitted (pP) and bounding set (X) of the task:
pU = pE = pP = X

The userns capability set obeys the invariant that no bit can ever be set
if it is not already part of the task’s bounding set. This ensures that
no namespace can ever gain more privileges than its predecessors.
Additionally, if a task is not privileged over CAP_SETPCAP, setting any bit
in the userns set requires its corresponding bit to be set in the permitted
set. This effectively mimics the inheritable set rules and means that, by
default, only root in the user namespace can regain userns capabilities
previously dropped:
p’U = (pE & CAP_SETPCAP) ? X : (X & pP)

Note that since userns capabilities are strictly hierarchical, policies can
be enforced at various levels (e.g. init, pam_cap) and inherited by every
child namespace.
Here is a sample program that can be used to verify the functionality:
/*

Test program that drops CAP_SYS_RAWIO from subsequent user namespaces.

./cap_userns_test unshare -r grep Cap /proc/self/status
CapInh: 0000000000000000
CapPrm: 000001fffffdffff
CapEff: 000001fffffdffff
CapBnd: 000001fffffdffff
CapAmb: 0000000000000000
CapUNs: 000001fffffdffff

*/
...
...
+#ifdef CONFIG_USER_NS

case PR_CAP_USERNS:
if (arg2 == PR_CAP_USERNS_CLEAR_ALL) {

	if (arg3 | arg4 | arg5)

		return -EINVAL;

	new = prepare_creds();

	if (!new)

		return -ENOMEM;

	cap_clear(new->cap_userns);

	return commit_creds(new);

}

if (((!cap_valid(arg3)) | arg4 | arg5))

	return -EINVAL;

if (arg2 == PR_CAP_USERNS_IS_SET)

	return !!cap_raised(current_cred()->cap_userns, arg3);

if (arg2 != PR_CAP_USERNS_RAISE && arg2 != PR_CAP_USERNS_LOWER)

	return -EINVAL;

if (arg2 == PR_CAP_USERNS_RAISE && !cap_uns_is_raiseable(arg3))

	return -EPERM;

new = prepare_creds();

if (!new)

	return -ENOMEM;

if (arg2 == PR_CAP_USERNS_RAISE)

	cap_raise(new->cap_userns, arg3);

else

	cap_lower(new->cap_userns, arg3);

Now, one thing that does occur to me here is that there is a
very mild form of sendmail-capabilities vulnerability that
could happen here.  Unpriv user joe can drop CAP_SYS_ADMIN
from cap_userns, then run a setuid-root program which starts
a container which expects CAP_SYS_ADMIN.  This could be a
shared container, and so joe could be breaking expected
behavior there.
I *think* we want to say we don't care about this case, but
if we did, I suppose we could say that the normal cap raise
rules on setuid should apply to cap_userns?

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH v2 1/4] capabilities: Add user namespace capabilities