This patch series introduces a new user namespace capability set, as well as some plumbing around it (i.e. sysctl, secbit, lsm support).
First patch goes over the motivations for this as well as prior art.
In summary, while user namespaces are a great success today in that they avoid running a lot of code as root, they also expand the attack surface of the kernel substantially which is often abused by attackers. Methods exist to limit the creation of such namespaces [1], however, application developers often need to assume that user namespaces are available for various tasks such as sandboxing. Thus, instead of restricting the creation of user namespaces, we offer ways for userspace to limit the capabilities granted to them.
Why a new capability set and not something specific to the userns (e.g. ioctl_ns)?
1. We can't really expect userspace to patch every single callsite and opt-in this new security mechanism.
2. We don't necessarily want policies enforced at said callsites. For example a service like systemd-machined or a PAM session need to be able to place restrictions on any namespace spawned under it.
3. We would need to come up with inheritance rules, querying capabilities, etc. At this point we're just reinventing capability sets.
4. We can easily define interactions between capability sets, thus helping with adoption (patch 2 is an example of this)
Some examples of how this could be leveraged in userspace:
- Prevent user from getting CAP_NET_ADMIN in user namespaces under SSH: echo "auth optional pam_cap.so" >> /etc/pam.d/sshd echo "!cap_net_admin $USER" >> /etc/security/capability.conf capsh --secbits=$((1 << 8)) -- -c /usr/sbin/sshd
- Prevent containers from ever getting CAP_DAC_OVERRIDE: systemd-run -p CapabilityBoundingSet=~CAP_DAC_OVERRIDE \ -p SecureBits=userns-strict-caps \ /usr/bin/dockerd systemd-run -p UserNSCapabilities=~CAP_DAC_OVERRIDE \ /usr/bin/incusd
- Kernel could be vulnerable to CAP_SYS_RAWIO exploits, prevent it: sysctl -w cap_bound_userns_mask=0x1fffffdffff
- Drop CAP_SYS_ADMIN for this shell and all the user namespaces below it: bwrap --unshare-user --cap-drop CAP_SYS_ADMIN /bin/sh
[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--- Changes since v1: - Add documentation - Change commit wording - Cleanup various aspects of the code based on feedback - Add new CAP_SYS_CONTROL capability for sysctl check - Add BPF-LSM support for modifying userns capabilities --- Jonathan Calmels (4): capabilities: Add user namespace capabilities capabilities: Add securebit to restrict userns caps capabilities: Add sysctl to mask off userns caps bpf,lsm: Allow editing capabilities in BPF-LSM hooks
Documentation/filesystems/proc.rst | 1 + Documentation/security/credentials.rst | 6 ++ fs/proc/array.c | 9 +++ include/linux/cred.h | 3 + include/linux/lsm_hook_defs.h | 2 +- include/linux/securebits.h | 1 + include/linux/security.h | 4 +- include/linux/user_namespace.h | 7 ++ include/uapi/linux/capability.h | 6 +- include/uapi/linux/prctl.h | 7 ++ include/uapi/linux/securebits.h | 11 ++- kernel/bpf/bpf_lsm.c | 55 +++++++++++++ kernel/cred.c | 3 + kernel/sysctl.c | 10 +++ kernel/umh.c | 15 ++++ kernel/user_namespace.c | 80 +++++++++++++++++-- security/apparmor/lsm.c | 2 +- security/commoncap.c | 62 +++++++++++++- security/keys/process_keys.c | 3 + security/security.c | 6 +- security/selinux/hooks.c | 2 +- security/selinux/include/classmap.h | 5 +- .../selftests/bpf/prog_tests/deny_namespace.c | 12 ++- .../selftests/bpf/progs/test_deny_namespace.c | 7 +- 24 files changed, 291 insertions(+), 28 deletions(-)
Attackers often rely on user namespaces to get elevated (yet confined) privileges in order to target specific subsystems (e.g. [1]). Distributions have been pretty adamant that they need a way to configure these, most of them carry out-of-tree patches to do so, or plainly refuse to enable them. As a result, there have been multiple efforts over the years to introduce various knobs to control and/or disable user namespaces (e.g. [2][3][4]).
While we acknowledge that there are already ways to control the creation of such namespaces (the most recent being a LSM hook), there are inherent issues with these approaches. Preventing the user namespace creation is not fine-grained enough, and in some cases, incompatible with various userspace expectations (e.g. container runtimes, browser sandboxing, service isolation)
This patch addresses these limitations by introducing an additional capability set used to restrict the permissions granted when creating user namespaces. This way, processes can apply the principle of least privilege by configuring only the capabilities they need for their namespaces.
For compatibility reasons, processes always start with a full userns capability set.
On namespace creation, the userns capability set (pU) is assigned to the new effective (pE), permitted (pP) and bounding set (X) of the task:
pU = pE = pP = X
The userns capability set obeys the invariant that no bit can ever be set if it is not already part of the task’s bounding set. This ensures that no namespace can ever gain more privileges than its predecessors. Additionally, if a task is not privileged over CAP_SETPCAP, setting any bit in the userns set requires its corresponding bit to be set in the permitted set. This effectively mimics the inheritable set rules and means that, by default, only root in the user namespace can regain userns capabilities previously dropped:
p’U = (pE & CAP_SETPCAP) ? X : (X & pP)
Note that since userns capabilities are strictly hierarchical, policies can be enforced at various levels (e.g. init, pam_cap) and inherited by every child namespace.
Here is a sample program that can be used to verify the functionality:
/* * Test program that drops CAP_SYS_RAWIO from subsequent user namespaces. * * ./cap_userns_test unshare -r grep Cap /proc/self/status * CapInh: 0000000000000000 * CapPrm: 000001fffffdffff * CapEff: 000001fffffdffff * CapBnd: 000001fffffdffff * CapAmb: 0000000000000000 * CapUNs: 000001fffffdffff */
int main(int argc, char *argv[]) { if (prctl(PR_CAP_USERNS, PR_CAP_USERNS_LOWER, CAP_SYS_RAWIO, 0, 0) < 0) err(1, "cannot drop userns cap");
execvp(argv[1], argv + 1); err(1, "cannot exec"); }
[1] https://security.googleblog.com/2023/06/learnings-from-kctf-vrps-42-linux.ht... [2] https://lore.kernel.org/lkml/1453502345-30416-1-git-send-email-keescook@chro... [3] https://lore.kernel.org/lkml/20220815162028.926858-1-fred@cloudflare.com [4] https://lore.kernel.org/containers/168547265011.24337.4306067683997517082-0@...
Signed-off-by: Jonathan Calmels jcalmels@3xx0.net --- Documentation/filesystems/proc.rst | 1 + Documentation/security/credentials.rst | 6 +++ fs/proc/array.c | 9 ++++ include/linux/cred.h | 3 ++ include/uapi/linux/prctl.h | 7 +++ kernel/cred.c | 3 ++ kernel/umh.c | 15 +++++++ kernel/user_namespace.c | 12 +++-- security/commoncap.c | 62 ++++++++++++++++++++++++-- security/keys/process_keys.c | 3 ++ 10 files changed, 111 insertions(+), 10 deletions(-)
diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst index 7c3a565ffbef..b5de4eaf1b7b 100644 --- a/Documentation/filesystems/proc.rst +++ b/Documentation/filesystems/proc.rst @@ -294,6 +294,7 @@ It's slow but very precise. CapEff bitmap of effective capabilities CapBnd bitmap of capabilities bounding set CapAmb bitmap of ambient capabilities + CapUns bitmap of user namespace capabilities NoNewPrivs no_new_privs, like prctl(PR_GET_NO_NEW_PRIV, ...) Seccomp seccomp mode, like prctl(PR_GET_SECCOMP, ...) Speculation_Store_Bypass speculative store bypass mitigation status diff --git a/Documentation/security/credentials.rst b/Documentation/security/credentials.rst index 357328d566c8..7ee904237023 100644 --- a/Documentation/security/credentials.rst +++ b/Documentation/security/credentials.rst @@ -148,6 +148,7 @@ The Linux kernel supports the following types of credentials: - Set of permitted capabilities - Set of inheritable capabilities - Set of effective capabilities + - Set of user namespace capabilities - Capability bounding set
These are only carried by tasks. They indicate superior capabilities @@ -170,6 +171,11 @@ The Linux kernel supports the following types of credentials: ``execve()``, especially when a binary is executed that will execute as UID 0.
+ The user namespace set limits the capabilities granted to user namespaces. + It defines what capabilities will be available in the other sets after + creating a new user namespace, such as when calling ``clone()`` or + ``unshare()`` with ``CLONE_NEWUSER``. + 3. Secure management flags (securebits).
These are only carried by tasks. These govern the way the above diff --git a/fs/proc/array.c b/fs/proc/array.c index 34a47fb0c57f..364e8bb19f9d 100644 --- a/fs/proc/array.c +++ b/fs/proc/array.c @@ -313,6 +313,9 @@ static inline void task_cap(struct seq_file *m, struct task_struct *p) const struct cred *cred; kernel_cap_t cap_inheritable, cap_permitted, cap_effective, cap_bset, cap_ambient; +#ifdef CONFIG_USER_NS + kernel_cap_t cap_userns; +#endif
rcu_read_lock(); cred = __task_cred(p); @@ -321,6 +324,9 @@ static inline void task_cap(struct seq_file *m, struct task_struct *p) cap_effective = cred->cap_effective; cap_bset = cred->cap_bset; cap_ambient = cred->cap_ambient; +#ifdef CONFIG_USER_NS + cap_userns = cred->cap_userns; +#endif rcu_read_unlock();
render_cap_t(m, "CapInh:\t", &cap_inheritable); @@ -328,6 +334,9 @@ static inline void task_cap(struct seq_file *m, struct task_struct *p) render_cap_t(m, "CapEff:\t", &cap_effective); render_cap_t(m, "CapBnd:\t", &cap_bset); render_cap_t(m, "CapAmb:\t", &cap_ambient); +#ifdef CONFIG_USER_NS + render_cap_t(m, "CapUNs:\t", &cap_userns); +#endif }
static inline void task_seccomp(struct seq_file *m, struct task_struct *p) diff --git a/include/linux/cred.h b/include/linux/cred.h index 2976f534a7a3..adab0031443e 100644 --- a/include/linux/cred.h +++ b/include/linux/cred.h @@ -124,6 +124,9 @@ struct cred { kernel_cap_t cap_effective; /* caps we can actually use */ kernel_cap_t cap_bset; /* capability bounding set */ kernel_cap_t cap_ambient; /* Ambient capability set */ +#ifdef CONFIG_USER_NS + kernel_cap_t cap_userns; /* User namespace capability set */ +#endif #ifdef CONFIG_KEYS unsigned char jit_keyring; /* default keyring to attach requested * keys to */ diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 35791791a879..b58325ebdc9e 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -198,6 +198,13 @@ struct prctl_mm_map { # define PR_CAP_AMBIENT_LOWER 3 # define PR_CAP_AMBIENT_CLEAR_ALL 4
+/* Control the userns capability set */ +#define PR_CAP_USERNS 48 +# define PR_CAP_USERNS_IS_SET 1 +# define PR_CAP_USERNS_RAISE 2 +# define PR_CAP_USERNS_LOWER 3 +# define PR_CAP_USERNS_CLEAR_ALL 4 + /* arm64 Scalable Vector Extension controls */ /* Flag values must be kept in sync with ptrace NT_ARM_SVE interface */ #define PR_SVE_SET_VL 50 /* set task vector length */ diff --git a/kernel/cred.c b/kernel/cred.c index 075cfa7c896f..9912c6f3bc6b 100644 --- a/kernel/cred.c +++ b/kernel/cred.c @@ -56,6 +56,9 @@ struct cred init_cred = { .cap_permitted = CAP_FULL_SET, .cap_effective = CAP_FULL_SET, .cap_bset = CAP_FULL_SET, +#ifdef CONFIG_USER_NS + .cap_userns = CAP_FULL_SET, +#endif .user = INIT_USER, .user_ns = &init_user_ns, .group_info = &init_groups, diff --git a/kernel/umh.c b/kernel/umh.c index 598b3ffe1522..0a5a9cf10d83 100644 --- a/kernel/umh.c +++ b/kernel/umh.c @@ -32,6 +32,9 @@
#include <trace/events/module.h>
+#ifdef CONFIG_USER_NS +static kernel_cap_t usermodehelper_userns = CAP_FULL_SET; +#endif static kernel_cap_t usermodehelper_bset = CAP_FULL_SET; static kernel_cap_t usermodehelper_inheritable = CAP_FULL_SET; static DEFINE_SPINLOCK(umh_sysctl_lock); @@ -94,6 +97,9 @@ static int call_usermodehelper_exec_async(void *data) new->cap_bset = cap_intersect(usermodehelper_bset, new->cap_bset); new->cap_inheritable = cap_intersect(usermodehelper_inheritable, new->cap_inheritable); +#ifdef CONFIG_USER_NS + new->cap_userns = cap_intersect(usermodehelper_userns, new->cap_userns); +#endif spin_unlock(&umh_sysctl_lock);
if (sub_info->init) { @@ -560,6 +566,15 @@ static struct ctl_table usermodehelper_table[] = { .mode = 0600, .proc_handler = proc_cap_handler, }, +#ifdef CONFIG_USER_NS + { + .procname = "userns", + .data = &usermodehelper_userns, + .maxlen = 2 * sizeof(unsigned long), + .mode = 0600, + .proc_handler = proc_cap_handler, + }, +#endif };
static int __init init_umh_sysctls(void) diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c index 0b0b95418b16..7e624607330b 100644 --- a/kernel/user_namespace.c +++ b/kernel/user_namespace.c @@ -42,15 +42,13 @@ static void dec_user_namespaces(struct ucounts *ucounts)
static void set_cred_user_ns(struct cred *cred, struct user_namespace *user_ns) { - /* Start with the same capabilities as init but useless for doing - * anything as the capabilities are bound to the new user namespace. - */ - cred->securebits = SECUREBITS_DEFAULT; + /* Start with the capabilities defined in the userns set. */ + cred->cap_bset = cred->cap_userns; + cred->cap_permitted = cred->cap_userns; + cred->cap_effective = cred->cap_userns; cred->cap_inheritable = CAP_EMPTY_SET; - cred->cap_permitted = CAP_FULL_SET; - cred->cap_effective = CAP_FULL_SET; cred->cap_ambient = CAP_EMPTY_SET; - cred->cap_bset = CAP_FULL_SET; + cred->securebits = SECUREBITS_DEFAULT; #ifdef CONFIG_KEYS key_put(cred->request_key_auth); cred->request_key_auth = NULL; diff --git a/security/commoncap.c b/security/commoncap.c index 162d96b3a676..59fafbfcfc5e 100644 --- a/security/commoncap.c +++ b/security/commoncap.c @@ -214,10 +214,10 @@ int cap_capget(const struct task_struct *target, kernel_cap_t *effective, }
/* - * Determine whether the inheritable capabilities are limited to the old + * Determine whether the capabilities are limited to the old * permitted set. Returns 1 if they are limited, 0 if they are not. */ -static inline int cap_inh_is_capped(void) +static inline int cap_is_capped(void) { /* they are so limited unless the current task has the CAP_SETPCAP * capability @@ -228,6 +228,29 @@ static inline int cap_inh_is_capped(void) return 1; }
+/* + * Determine whether a userns capability can be raised. + * Returns 1 if it can, 0 otherwise. + */ +#ifdef CONFIG_USER_NS +static inline int cap_uns_is_raiseable(unsigned long cap) +{ + if (!!cap_raised(current_cred()->cap_userns, cap)) + return 1; + + /* + * A capability cannot be raised unless the current task has it in + * its bounding set and, without CAP_SETPCAP, its permitted set. + */ + if (!cap_raised(current_cred()->cap_bset, cap)) + return 0; + if (cap_is_capped() && !cap_raised(current_cred()->cap_permitted, cap)) + return 0; + + return 1; +} +#endif + /** * cap_capset - Validate and apply proposed changes to current's capabilities * @new: The proposed new credentials; alterations should be made here @@ -246,7 +269,7 @@ int cap_capset(struct cred *new, const kernel_cap_t *inheritable, const kernel_cap_t *permitted) { - if (cap_inh_is_capped() && + if (cap_is_capped() && !cap_issubset(*inheritable, cap_combine(old->cap_inheritable, old->cap_permitted))) @@ -1382,6 +1405,39 @@ int cap_task_prctl(int option, unsigned long arg2, unsigned long arg3, return commit_creds(new); }
+#ifdef CONFIG_USER_NS + case PR_CAP_USERNS: + if (arg2 == PR_CAP_USERNS_CLEAR_ALL) { + if (arg3 | arg4 | arg5) + return -EINVAL; + + new = prepare_creds(); + if (!new) + return -ENOMEM; + cap_clear(new->cap_userns); + return commit_creds(new); + } + + if (((!cap_valid(arg3)) | arg4 | arg5)) + return -EINVAL; + + if (arg2 == PR_CAP_USERNS_IS_SET) + return !!cap_raised(current_cred()->cap_userns, arg3); + if (arg2 != PR_CAP_USERNS_RAISE && arg2 != PR_CAP_USERNS_LOWER) + return -EINVAL; + if (arg2 == PR_CAP_USERNS_RAISE && !cap_uns_is_raiseable(arg3)) + return -EPERM; + + new = prepare_creds(); + if (!new) + return -ENOMEM; + if (arg2 == PR_CAP_USERNS_RAISE) + cap_raise(new->cap_userns, arg3); + else + cap_lower(new->cap_userns, arg3); + return commit_creds(new); +#endif + default: /* No functionality available - continue with default */ return -ENOSYS; diff --git a/security/keys/process_keys.c b/security/keys/process_keys.c index b5d5333ab330..e3670d815435 100644 --- a/security/keys/process_keys.c +++ b/security/keys/process_keys.c @@ -944,6 +944,9 @@ void key_change_session_keyring(struct callback_head *twork) new->cap_effective = old->cap_effective; new->cap_ambient = old->cap_ambient; new->cap_bset = old->cap_bset; +#ifdef CONFIG_USER_NS + new->cap_userns = old->cap_userns; +#endif
new->jit_keyring = old->jit_keyring; new->thread_keyring = key_get(old->thread_keyring);
On Sun, Jun 09, 2024 at 03:43:34AM -0700, Jonathan Calmels wrote:
(Adding amorgan as he doesn't seem to be on cc list)
Attackers often rely on user namespaces to get elevated (yet confined) privileges in order to target specific subsystems (e.g. [1]). Distributions
I'd modify this to say "in order to target *bugs* in specific subsystems" :)
have been pretty adamant that they need a way to configure these, most of them carry out-of-tree patches to do so, or plainly refuse to enable them. As a result, there have been multiple efforts over the years to introduce various knobs to control and/or disable user namespaces (e.g. [2][3][4]).
While we acknowledge that there are already ways to control the creation of such namespaces (the most recent being a LSM hook), there are inherent issues with these approaches. Preventing the user namespace creation is not fine-grained enough, and in some cases, incompatible with various userspace expectations (e.g. container runtimes, browser sandboxing, service isolation)
This patch addresses these limitations by introducing an additional capability set used to restrict the permissions granted when creating user namespaces. This way, processes can apply the principle of least privilege by configuring only the capabilities they need for their namespaces.
I think this is precisely the right thing to do. In each of these cases, there is a kernel bug in code which is only reachable with some CAP_X, regardless of which user namespace the CAP_X is targeted towards. So there's really no better way to resist this (apart from not having the bugs in the first place) than to allow CAP_X to be denied in unprivileged user namespaces.
For compatibility reasons, processes always start with a full userns capability set.
On namespace creation, the userns capability set (pU) is assigned to the new effective (pE), permitted (pP) and bounding set (X) of the task:
pU = pE = pP = X
The userns capability set obeys the invariant that no bit can ever be set if it is not already part of the task’s bounding set. This ensures that no namespace can ever gain more privileges than its predecessors. Additionally, if a task is not privileged over CAP_SETPCAP, setting any bit in the userns set requires its corresponding bit to be set in the permitted set. This effectively mimics the inheritable set rules and means that, by default, only root in the user namespace can regain userns capabilities previously dropped:
Something about this last sentence feels wrong, but I'm not sure what the best alternative would be. As is, though, it makes it sound as though root in the userns can always regain previously dropped capabilities, but that's not true if dropped in ancestor ns, or if root also dropped the bits from its bounding set (right?).
p’U = (pE & CAP_SETPCAP) ? X : (X & pP)
Note that since userns capabilities are strictly hierarchical, policies can be enforced at various levels (e.g. init, pam_cap) and inherited by every child namespace.
Here is a sample program that can be used to verify the functionality:
/*
- Test program that drops CAP_SYS_RAWIO from subsequent user namespaces.
- ./cap_userns_test unshare -r grep Cap /proc/self/status
- CapInh: 0000000000000000
- CapPrm: 000001fffffdffff
- CapEff: 000001fffffdffff
- CapBnd: 000001fffffdffff
- CapAmb: 0000000000000000
- CapUNs: 000001fffffdffff
*/
int main(int argc, char *argv[]) { if (prctl(PR_CAP_USERNS, PR_CAP_USERNS_LOWER, CAP_SYS_RAWIO, 0, 0) < 0) err(1, "cannot drop userns cap");
execvp(argv[1], argv + 1); err(1, "cannot exec");
}
[1] https://security.googleblog.com/2023/06/learnings-from-kctf-vrps-42-linux.ht... [2] https://lore.kernel.org/lkml/1453502345-30416-1-git-send-email-keescook@chro... [3] https://lore.kernel.org/lkml/20220815162028.926858-1-fred@cloudflare.com [4] https://lore.kernel.org/containers/168547265011.24337.4306067683997517082-0@...
Signed-off-by: Jonathan Calmels jcalmels@3xx0.net
Thanks.
Reviewed-by: Serge Hallyn serge@hallyn.com
Documentation/filesystems/proc.rst | 1 + Documentation/security/credentials.rst | 6 +++ fs/proc/array.c | 9 ++++ include/linux/cred.h | 3 ++ include/uapi/linux/prctl.h | 7 +++ kernel/cred.c | 3 ++ kernel/umh.c | 15 +++++++ kernel/user_namespace.c | 12 +++-- security/commoncap.c | 62 ++++++++++++++++++++++++-- security/keys/process_keys.c | 3 ++ 10 files changed, 111 insertions(+), 10 deletions(-)
diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst index 7c3a565ffbef..b5de4eaf1b7b 100644 --- a/Documentation/filesystems/proc.rst +++ b/Documentation/filesystems/proc.rst @@ -294,6 +294,7 @@ It's slow but very precise. CapEff bitmap of effective capabilities CapBnd bitmap of capabilities bounding set CapAmb bitmap of ambient capabilities
- CapUns bitmap of user namespace capabilities NoNewPrivs no_new_privs, like prctl(PR_GET_NO_NEW_PRIV, ...) Seccomp seccomp mode, like prctl(PR_GET_SECCOMP, ...) Speculation_Store_Bypass speculative store bypass mitigation status
diff --git a/Documentation/security/credentials.rst b/Documentation/security/credentials.rst index 357328d566c8..7ee904237023 100644 --- a/Documentation/security/credentials.rst +++ b/Documentation/security/credentials.rst @@ -148,6 +148,7 @@ The Linux kernel supports the following types of credentials:
- Set of permitted capabilities
- Set of inheritable capabilities
- Set of effective capabilities
- Set of user namespace capabilities
- Capability bounding set
These are only carried by tasks. They indicate superior capabilities @@ -170,6 +171,11 @@ The Linux kernel supports the following types of credentials: ``execve()``, especially when a binary is executed that will execute as UID 0.
The user namespace set limits the capabilities granted to user namespaces.
It defines what capabilities will be available in the other sets after
creating a new user namespace, such as when calling ``clone()`` or
``unshare()`` with ``CLONE_NEWUSER``.
- Secure management flags (securebits).
These are only carried by tasks. These govern the way the above diff --git a/fs/proc/array.c b/fs/proc/array.c index 34a47fb0c57f..364e8bb19f9d 100644 --- a/fs/proc/array.c +++ b/fs/proc/array.c @@ -313,6 +313,9 @@ static inline void task_cap(struct seq_file *m, struct task_struct *p) const struct cred *cred; kernel_cap_t cap_inheritable, cap_permitted, cap_effective, cap_bset, cap_ambient; +#ifdef CONFIG_USER_NS
- kernel_cap_t cap_userns;
+#endif rcu_read_lock(); cred = __task_cred(p); @@ -321,6 +324,9 @@ static inline void task_cap(struct seq_file *m, struct task_struct *p) cap_effective = cred->cap_effective; cap_bset = cred->cap_bset; cap_ambient = cred->cap_ambient; +#ifdef CONFIG_USER_NS
- cap_userns = cred->cap_userns;
+#endif rcu_read_unlock(); render_cap_t(m, "CapInh:\t", &cap_inheritable); @@ -328,6 +334,9 @@ static inline void task_cap(struct seq_file *m, struct task_struct *p) render_cap_t(m, "CapEff:\t", &cap_effective); render_cap_t(m, "CapBnd:\t", &cap_bset); render_cap_t(m, "CapAmb:\t", &cap_ambient); +#ifdef CONFIG_USER_NS
- render_cap_t(m, "CapUNs:\t", &cap_userns);
+#endif } static inline void task_seccomp(struct seq_file *m, struct task_struct *p) diff --git a/include/linux/cred.h b/include/linux/cred.h index 2976f534a7a3..adab0031443e 100644 --- a/include/linux/cred.h +++ b/include/linux/cred.h @@ -124,6 +124,9 @@ struct cred { kernel_cap_t cap_effective; /* caps we can actually use */ kernel_cap_t cap_bset; /* capability bounding set */ kernel_cap_t cap_ambient; /* Ambient capability set */ +#ifdef CONFIG_USER_NS
- kernel_cap_t cap_userns; /* User namespace capability set */
+#endif #ifdef CONFIG_KEYS unsigned char jit_keyring; /* default keyring to attach requested * keys to */ diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 35791791a879..b58325ebdc9e 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -198,6 +198,13 @@ struct prctl_mm_map { # define PR_CAP_AMBIENT_LOWER 3 # define PR_CAP_AMBIENT_CLEAR_ALL 4 +/* Control the userns capability set */ +#define PR_CAP_USERNS 48 +# define PR_CAP_USERNS_IS_SET 1 +# define PR_CAP_USERNS_RAISE 2 +# define PR_CAP_USERNS_LOWER 3 +# define PR_CAP_USERNS_CLEAR_ALL 4
/* arm64 Scalable Vector Extension controls */ /* Flag values must be kept in sync with ptrace NT_ARM_SVE interface */ #define PR_SVE_SET_VL 50 /* set task vector length */ diff --git a/kernel/cred.c b/kernel/cred.c index 075cfa7c896f..9912c6f3bc6b 100644 --- a/kernel/cred.c +++ b/kernel/cred.c @@ -56,6 +56,9 @@ struct cred init_cred = { .cap_permitted = CAP_FULL_SET, .cap_effective = CAP_FULL_SET, .cap_bset = CAP_FULL_SET, +#ifdef CONFIG_USER_NS
- .cap_userns = CAP_FULL_SET,
+#endif .user = INIT_USER, .user_ns = &init_user_ns, .group_info = &init_groups, diff --git a/kernel/umh.c b/kernel/umh.c index 598b3ffe1522..0a5a9cf10d83 100644 --- a/kernel/umh.c +++ b/kernel/umh.c @@ -32,6 +32,9 @@ #include <trace/events/module.h> +#ifdef CONFIG_USER_NS +static kernel_cap_t usermodehelper_userns = CAP_FULL_SET; +#endif static kernel_cap_t usermodehelper_bset = CAP_FULL_SET; static kernel_cap_t usermodehelper_inheritable = CAP_FULL_SET; static DEFINE_SPINLOCK(umh_sysctl_lock); @@ -94,6 +97,9 @@ static int call_usermodehelper_exec_async(void *data) new->cap_bset = cap_intersect(usermodehelper_bset, new->cap_bset); new->cap_inheritable = cap_intersect(usermodehelper_inheritable, new->cap_inheritable); +#ifdef CONFIG_USER_NS
- new->cap_userns = cap_intersect(usermodehelper_userns, new->cap_userns);
+#endif spin_unlock(&umh_sysctl_lock); if (sub_info->init) { @@ -560,6 +566,15 @@ static struct ctl_table usermodehelper_table[] = { .mode = 0600, .proc_handler = proc_cap_handler, }, +#ifdef CONFIG_USER_NS
- {
.procname = "userns",
.data = &usermodehelper_userns,
.maxlen = 2 * sizeof(unsigned long),
.mode = 0600,
.proc_handler = proc_cap_handler,
- },
+#endif }; static int __init init_umh_sysctls(void) diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c index 0b0b95418b16..7e624607330b 100644 --- a/kernel/user_namespace.c +++ b/kernel/user_namespace.c @@ -42,15 +42,13 @@ static void dec_user_namespaces(struct ucounts *ucounts) static void set_cred_user_ns(struct cred *cred, struct user_namespace *user_ns) {
- /* Start with the same capabilities as init but useless for doing
* anything as the capabilities are bound to the new user namespace.
*/
- cred->securebits = SECUREBITS_DEFAULT;
- /* Start with the capabilities defined in the userns set. */
- cred->cap_bset = cred->cap_userns;
- cred->cap_permitted = cred->cap_userns;
- cred->cap_effective = cred->cap_userns; cred->cap_inheritable = CAP_EMPTY_SET;
- cred->cap_permitted = CAP_FULL_SET;
- cred->cap_effective = CAP_FULL_SET; cred->cap_ambient = CAP_EMPTY_SET;
- cred->cap_bset = CAP_FULL_SET;
- cred->securebits = SECUREBITS_DEFAULT;
#ifdef CONFIG_KEYS key_put(cred->request_key_auth); cred->request_key_auth = NULL; diff --git a/security/commoncap.c b/security/commoncap.c index 162d96b3a676..59fafbfcfc5e 100644 --- a/security/commoncap.c +++ b/security/commoncap.c @@ -214,10 +214,10 @@ int cap_capget(const struct task_struct *target, kernel_cap_t *effective, } /*
- Determine whether the inheritable capabilities are limited to the old
*/
- Determine whether the capabilities are limited to the old
- permitted set. Returns 1 if they are limited, 0 if they are not.
-static inline int cap_inh_is_capped(void) +static inline int cap_is_capped(void) { /* they are so limited unless the current task has the CAP_SETPCAP * capability @@ -228,6 +228,29 @@ static inline int cap_inh_is_capped(void) return 1; } +/*
- Determine whether a userns capability can be raised.
- Returns 1 if it can, 0 otherwise.
- */
+#ifdef CONFIG_USER_NS +static inline int cap_uns_is_raiseable(unsigned long cap) +{
- if (!!cap_raised(current_cred()->cap_userns, cap))
return 1;
- /*
* A capability cannot be raised unless the current task has it in
* its bounding set and, without CAP_SETPCAP, its permitted set.
*/
- if (!cap_raised(current_cred()->cap_bset, cap))
return 0;
- if (cap_is_capped() && !cap_raised(current_cred()->cap_permitted, cap))
return 0;
- return 1;
+} +#endif
/**
- cap_capset - Validate and apply proposed changes to current's capabilities
- @new: The proposed new credentials; alterations should be made here
@@ -246,7 +269,7 @@ int cap_capset(struct cred *new, const kernel_cap_t *inheritable, const kernel_cap_t *permitted) {
- if (cap_inh_is_capped() &&
- if (cap_is_capped() && !cap_issubset(*inheritable, cap_combine(old->cap_inheritable, old->cap_permitted)))
@@ -1382,6 +1405,39 @@ int cap_task_prctl(int option, unsigned long arg2, unsigned long arg3, return commit_creds(new); } +#ifdef CONFIG_USER_NS
- case PR_CAP_USERNS:
if (arg2 == PR_CAP_USERNS_CLEAR_ALL) {
if (arg3 | arg4 | arg5)
return -EINVAL;
new = prepare_creds();
if (!new)
return -ENOMEM;
cap_clear(new->cap_userns);
return commit_creds(new);
}
if (((!cap_valid(arg3)) | arg4 | arg5))
return -EINVAL;
if (arg2 == PR_CAP_USERNS_IS_SET)
return !!cap_raised(current_cred()->cap_userns, arg3);
if (arg2 != PR_CAP_USERNS_RAISE && arg2 != PR_CAP_USERNS_LOWER)
return -EINVAL;
if (arg2 == PR_CAP_USERNS_RAISE && !cap_uns_is_raiseable(arg3))
return -EPERM;
new = prepare_creds();
if (!new)
return -ENOMEM;
if (arg2 == PR_CAP_USERNS_RAISE)
cap_raise(new->cap_userns, arg3);
else
cap_lower(new->cap_userns, arg3);
return commit_creds(new);
+#endif
- default: /* No functionality available - continue with default */ return -ENOSYS;
diff --git a/security/keys/process_keys.c b/security/keys/process_keys.c index b5d5333ab330..e3670d815435 100644 --- a/security/keys/process_keys.c +++ b/security/keys/process_keys.c @@ -944,6 +944,9 @@ void key_change_session_keyring(struct callback_head *twork) new->cap_effective = old->cap_effective; new->cap_ambient = old->cap_ambient; new->cap_bset = old->cap_bset; +#ifdef CONFIG_USER_NS
- new->cap_userns = old->cap_userns;
+#endif new->jit_keyring = old->jit_keyring; new->thread_keyring = key_get(old->thread_keyring); -- 2.45.2
On Sun, Jun 09, 2024 at 08:50:24PM GMT, Serge E. Hallyn wrote:
On Sun, Jun 09, 2024 at 03:43:34AM -0700, Jonathan Calmels wrote:
Attackers often rely on user namespaces to get elevated (yet confined) privileges in order to target specific subsystems (e.g. [1]). Distributions
I'd modify this to say "in order to target *bugs* in specific subsystems" :)
Ack
This effectively mimics the inheritable set rules and means that, by default, only root in the user namespace can regain userns capabilities previously dropped:
Something about this last sentence feels wrong, but I'm not sure what the best alternative would be. As is, though, it makes it sound as though root in the userns can always regain previously dropped capabilities, but that's not true if dropped in ancestor ns, or if root also dropped the bits from its bounding set (right?).
Right, the wording is a little bit confusing here I admit. What I meant to say is that if a cap is dropped in a *given* namespace, then it can only be regained by root there. But yes, caps can never be regained from ancestors ns. I'll try to rephrase it.
BTW, this is rather strict, but I think that's what we want right, something simple? Alternative would be to have a new cap masked off by default, but if granted to a userns, allows you to regain ancestors caps.
On Mon, Jun 10, 2024 at 01:47:13AM -0700, Jonathan Calmels wrote:
On Sun, Jun 09, 2024 at 08:50:24PM GMT, Serge E. Hallyn wrote:
On Sun, Jun 09, 2024 at 03:43:34AM -0700, Jonathan Calmels wrote:
Attackers often rely on user namespaces to get elevated (yet confined) privileges in order to target specific subsystems (e.g. [1]). Distributions
I'd modify this to say "in order to target *bugs* in specific subsystems" :)
Ack
This effectively mimics the inheritable set rules and means that, by default, only root in the user namespace can regain userns capabilities previously dropped:
Something about this last sentence feels wrong, but I'm not sure what the best alternative would be. As is, though, it makes it sound as though root in the userns can always regain previously dropped capabilities, but that's not true if dropped in ancestor ns, or if root also dropped the bits from its bounding set (right?).
Right, the wording is a little bit confusing here I admit. What I meant to say is that if a cap is dropped in a *given* namespace, then it can only be regained by root there. But yes, caps can never be regained from ancestors ns. I'll try to rephrase it.
BTW, this is rather strict, but I think that's what we want right,
Yes,
something simple? Alternative would be to have a new cap masked off by default, but if granted to a userns, allows you to regain ancestors caps.
we absolutely do not want to allow regaining caps dropped in an ancestor namespace.
thanks, -serge
On Sun, Jun 09, 2024 at 03:43:34AM -0700, Jonathan Calmels wrote:
Attackers often rely on user namespaces to get elevated (yet confined) privileges in order to target specific subsystems (e.g. [1]). Distributions have been pretty adamant that they need a way to configure these, most of them carry out-of-tree patches to do so, or plainly refuse to enable them. As a result, there have been multiple efforts over the years to introduce various knobs to control and/or disable user namespaces (e.g. [2][3][4]).
While we acknowledge that there are already ways to control the creation of such namespaces (the most recent being a LSM hook), there are inherent issues with these approaches. Preventing the user namespace creation is not fine-grained enough, and in some cases, incompatible with various userspace expectations (e.g. container runtimes, browser sandboxing, service isolation)
This patch addresses these limitations by introducing an additional capability set used to restrict the permissions granted when creating user namespaces. This way, processes can apply the principle of least privilege by configuring only the capabilities they need for their namespaces.
For compatibility reasons, processes always start with a full userns capability set.
On namespace creation, the userns capability set (pU) is assigned to the new effective (pE), permitted (pP) and bounding set (X) of the task:
pU = pE = pP = X
The userns capability set obeys the invariant that no bit can ever be set if it is not already part of the task’s bounding set. This ensures that no namespace can ever gain more privileges than its predecessors. Additionally, if a task is not privileged over CAP_SETPCAP, setting any bit in the userns set requires its corresponding bit to be set in the permitted set. This effectively mimics the inheritable set rules and means that, by default, only root in the user namespace can regain userns capabilities previously dropped:
p’U = (pE & CAP_SETPCAP) ? X : (X & pP)
Note that since userns capabilities are strictly hierarchical, policies can be enforced at various levels (e.g. init, pam_cap) and inherited by every child namespace.
Here is a sample program that can be used to verify the functionality:
/*
- Test program that drops CAP_SYS_RAWIO from subsequent user namespaces.
- ./cap_userns_test unshare -r grep Cap /proc/self/status
- CapInh: 0000000000000000
- CapPrm: 000001fffffdffff
- CapEff: 000001fffffdffff
- CapBnd: 000001fffffdffff
- CapAmb: 0000000000000000
- CapUNs: 000001fffffdffff
*/
...
+#ifdef CONFIG_USER_NS
- case PR_CAP_USERNS:
if (arg2 == PR_CAP_USERNS_CLEAR_ALL) {
if (arg3 | arg4 | arg5)
return -EINVAL;
new = prepare_creds();
if (!new)
return -ENOMEM;
cap_clear(new->cap_userns);
return commit_creds(new);
}
if (((!cap_valid(arg3)) | arg4 | arg5))
return -EINVAL;
if (arg2 == PR_CAP_USERNS_IS_SET)
return !!cap_raised(current_cred()->cap_userns, arg3);
if (arg2 != PR_CAP_USERNS_RAISE && arg2 != PR_CAP_USERNS_LOWER)
return -EINVAL;
if (arg2 == PR_CAP_USERNS_RAISE && !cap_uns_is_raiseable(arg3))
return -EPERM;
new = prepare_creds();
if (!new)
return -ENOMEM;
if (arg2 == PR_CAP_USERNS_RAISE)
cap_raise(new->cap_userns, arg3);
else
cap_lower(new->cap_userns, arg3);
Now, one thing that does occur to me here is that there is a very mild form of sendmail-capabilities vulnerability that could happen here. Unpriv user joe can drop CAP_SYS_ADMIN from cap_userns, then run a setuid-root program which starts a container which expects CAP_SYS_ADMIN. This could be a shared container, and so joe could be breaking expected behavior there.
I *think* we want to say we don't care about this case, but if we did, I suppose we could say that the normal cap raise rules on setuid should apply to cap_userns?
On Mon, Jun 10, 2024 at 08:00:57AM GMT, Serge E. Hallyn wrote:
Now, one thing that does occur to me here is that there is a very mild form of sendmail-capabilities vulnerability that could happen here. Unpriv user joe can drop CAP_SYS_ADMIN from cap_userns, then run a setuid-root program which starts a container which expects CAP_SYS_ADMIN. This could be a shared container, and so joe could be breaking expected behavior there.
I *think* we want to say we don't care about this case, but if we did, I suppose we could say that the normal cap raise rules on setuid should apply to cap_userns?
Right, good catch. If we do want to fix it, we could just check for setuid no? Or do we want to follow the normal root inheritance rules too? Essentially something like this:
pU' = is_suid(root) ? X : pU
On Tue, Jun 11, 2024 at 01:20:40AM -0700, Jonathan Calmels wrote:
On Mon, Jun 10, 2024 at 08:00:57AM GMT, Serge E. Hallyn wrote:
Now, one thing that does occur to me here is that there is a very mild form of sendmail-capabilities vulnerability that could happen here. Unpriv user joe can drop CAP_SYS_ADMIN from cap_userns, then run a setuid-root program which starts a container which expects CAP_SYS_ADMIN. This could be a shared container, and so joe could be breaking expected behavior there.
I *think* we want to say we don't care about this case, but if we did, I suppose we could say that the normal cap raise rules on setuid should apply to cap_userns?
Right, good catch. If we do want to fix it, we could just check for setuid no? Or do we want to follow the normal root inheritance rules too? Essentially something like this:
pU' = is_suid(root) ? X : pU
Yeah, I think that makes sense. Thanks.
-serge
This patch adds a new capability security bit designed to constrain a task’s userns capability set to its bounding set. The reason for this is twofold:
- This serves as a quick and easy way to lock down a set of capabilities for a task, thus ensuring that any namespace it creates will never be more privileged than itself is. - This helps userspace transition to more secure defaults by not requiring specific logic for the userns capability set, or libcap support.
Example:
# capsh --secbits=$((1 << 8)) --drop=cap_sys_rawio -- \ -c 'unshare -r grep Cap /proc/self/status' CapInh: 0000000000000000 CapPrm: 000001fffffdffff CapEff: 000001fffffdffff CapBnd: 000001fffffdffff CapAmb: 0000000000000000 CapUNs: 000001fffffdffff
Signed-off-by: Jonathan Calmels jcalmels@3xx0.net --- include/linux/securebits.h | 1 + include/uapi/linux/securebits.h | 11 ++++++++++- kernel/user_namespace.c | 5 +++++ 3 files changed, 16 insertions(+), 1 deletion(-)
diff --git a/include/linux/securebits.h b/include/linux/securebits.h index 656528673983..5f9d85cd69c3 100644 --- a/include/linux/securebits.h +++ b/include/linux/securebits.h @@ -5,4 +5,5 @@ #include <uapi/linux/securebits.h>
#define issecure(X) (issecure_mask(X) & current_cred_xxx(securebits)) +#define iscredsecure(cred, X) (issecure_mask(X) & cred->securebits) #endif /* !_LINUX_SECUREBITS_H */ diff --git a/include/uapi/linux/securebits.h b/include/uapi/linux/securebits.h index d6d98877ff1a..2da3f4be4531 100644 --- a/include/uapi/linux/securebits.h +++ b/include/uapi/linux/securebits.h @@ -52,10 +52,19 @@ #define SECBIT_NO_CAP_AMBIENT_RAISE_LOCKED \ (issecure_mask(SECURE_NO_CAP_AMBIENT_RAISE_LOCKED))
+/* When set, user namespace capabilities are restricted to their parent's bounding set. */ +#define SECURE_USERNS_STRICT_CAPS 8 +#define SECURE_USERNS_STRICT_CAPS_LOCKED 9 /* make bit-8 immutable */ + +#define SECBIT_USERNS_STRICT_CAPS (issecure_mask(SECURE_USERNS_STRICT_CAPS)) +#define SECBIT_USERNS_STRICT_CAPS_LOCKED \ + (issecure_mask(SECURE_USERNS_STRICT_CAPS_LOCKED)) + #define SECURE_ALL_BITS (issecure_mask(SECURE_NOROOT) | \ issecure_mask(SECURE_NO_SETUID_FIXUP) | \ issecure_mask(SECURE_KEEP_CAPS) | \ - issecure_mask(SECURE_NO_CAP_AMBIENT_RAISE)) + issecure_mask(SECURE_NO_CAP_AMBIENT_RAISE) | \ + issecure_mask(SECURE_USERNS_STRICT_CAPS)) #define SECURE_ALL_LOCKS (SECURE_ALL_BITS << 1)
#endif /* _UAPI_LINUX_SECUREBITS_H */ diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c index 7e624607330b..53848e2b68cd 100644 --- a/kernel/user_namespace.c +++ b/kernel/user_namespace.c @@ -10,6 +10,7 @@ #include <linux/cred.h> #include <linux/securebits.h> #include <linux/security.h> +#include <linux/capability.h> #include <linux/keyctl.h> #include <linux/key-type.h> #include <keys/user-type.h> @@ -42,6 +43,10 @@ static void dec_user_namespaces(struct ucounts *ucounts)
static void set_cred_user_ns(struct cred *cred, struct user_namespace *user_ns) { + /* Limit userns capabilities to our parent's bounding set. */ + if (iscredsecure(cred, SECURE_USERNS_STRICT_CAPS)) + cred->cap_userns = cap_intersect(cred->cap_userns, cred->cap_bset); + /* Start with the capabilities defined in the userns set. */ cred->cap_bset = cred->cap_userns; cred->cap_permitted = cred->cap_userns;
On Sun, Jun 09, 2024 at 03:43:35AM -0700, Jonathan Calmels wrote:
This patch adds a new capability security bit designed to constrain a task’s userns capability set to its bounding set. The reason for this is twofold:
- This serves as a quick and easy way to lock down a set of capabilities for a task, thus ensuring that any namespace it creates will never be more privileged than itself is.
- This helps userspace transition to more secure defaults by not requiring specific logic for the userns capability set, or libcap support.
Example:
# capsh --secbits=$((1 << 8)) --drop=cap_sys_rawio -- \ -c 'unshare -r grep Cap /proc/self/status' CapInh: 0000000000000000 CapPrm: 000001fffffdffff CapEff: 000001fffffdffff CapBnd: 000001fffffdffff CapAmb: 0000000000000000 CapUNs: 000001fffffdffff
But you are not (that I can see, in this or the previous patch) keeping SECURE_USERNS_STRICT_CAPS in securebits on the next level unshare. Though I think it's ok, because by then both cap_userns and cap_bset are reduced and cap_userns can't be expanded. (Sorry, just thinking aloud here)
Signed-off-by: Jonathan Calmels jcalmels@3xx0.net
include/linux/securebits.h | 1 + include/uapi/linux/securebits.h | 11 ++++++++++- kernel/user_namespace.c | 5 +++++ 3 files changed, 16 insertions(+), 1 deletion(-)
diff --git a/include/linux/securebits.h b/include/linux/securebits.h index 656528673983..5f9d85cd69c3 100644 --- a/include/linux/securebits.h +++ b/include/linux/securebits.h @@ -5,4 +5,5 @@ #include <uapi/linux/securebits.h> #define issecure(X) (issecure_mask(X) & current_cred_xxx(securebits)) +#define iscredsecure(cred, X) (issecure_mask(X) & cred->securebits) #endif /* !_LINUX_SECUREBITS_H */ diff --git a/include/uapi/linux/securebits.h b/include/uapi/linux/securebits.h index d6d98877ff1a..2da3f4be4531 100644 --- a/include/uapi/linux/securebits.h +++ b/include/uapi/linux/securebits.h @@ -52,10 +52,19 @@ #define SECBIT_NO_CAP_AMBIENT_RAISE_LOCKED \ (issecure_mask(SECURE_NO_CAP_AMBIENT_RAISE_LOCKED)) +/* When set, user namespace capabilities are restricted to their parent's bounding set. */ +#define SECURE_USERNS_STRICT_CAPS 8 +#define SECURE_USERNS_STRICT_CAPS_LOCKED 9 /* make bit-8 immutable */
+#define SECBIT_USERNS_STRICT_CAPS (issecure_mask(SECURE_USERNS_STRICT_CAPS)) +#define SECBIT_USERNS_STRICT_CAPS_LOCKED \
(issecure_mask(SECURE_USERNS_STRICT_CAPS_LOCKED))
#define SECURE_ALL_BITS (issecure_mask(SECURE_NOROOT) | \ issecure_mask(SECURE_NO_SETUID_FIXUP) | \ issecure_mask(SECURE_KEEP_CAPS) | \
issecure_mask(SECURE_NO_CAP_AMBIENT_RAISE))
issecure_mask(SECURE_NO_CAP_AMBIENT_RAISE) | \
issecure_mask(SECURE_USERNS_STRICT_CAPS))
#define SECURE_ALL_LOCKS (SECURE_ALL_BITS << 1) #endif /* _UAPI_LINUX_SECUREBITS_H */ diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c index 7e624607330b..53848e2b68cd 100644 --- a/kernel/user_namespace.c +++ b/kernel/user_namespace.c @@ -10,6 +10,7 @@ #include <linux/cred.h> #include <linux/securebits.h> #include <linux/security.h> +#include <linux/capability.h> #include <linux/keyctl.h> #include <linux/key-type.h> #include <keys/user-type.h> @@ -42,6 +43,10 @@ static void dec_user_namespaces(struct ucounts *ucounts) static void set_cred_user_ns(struct cred *cred, struct user_namespace *user_ns) {
- /* Limit userns capabilities to our parent's bounding set. */
In the case of userns_install(), it will be the target user namespace creator's bounding set, right? Not "our parent's"?
- if (iscredsecure(cred, SECURE_USERNS_STRICT_CAPS))
cred->cap_userns = cap_intersect(cred->cap_userns, cred->cap_bset);
- /* Start with the capabilities defined in the userns set. */ cred->cap_bset = cred->cap_userns; cred->cap_permitted = cred->cap_userns;
-- 2.45.2
On Sun, Jun 09, 2024 at 09:33:01PM GMT, Serge E. Hallyn wrote:
On Sun, Jun 09, 2024 at 03:43:35AM -0700, Jonathan Calmels wrote:
This patch adds a new capability security bit designed to constrain a task’s userns capability set to its bounding set. The reason for this is twofold:
- This serves as a quick and easy way to lock down a set of capabilities for a task, thus ensuring that any namespace it creates will never be more privileged than itself is.
- This helps userspace transition to more secure defaults by not requiring specific logic for the userns capability set, or libcap support.
Example:
# capsh --secbits=$((1 << 8)) --drop=cap_sys_rawio -- \ -c 'unshare -r grep Cap /proc/self/status' CapInh: 0000000000000000 CapPrm: 000001fffffdffff CapEff: 000001fffffdffff CapBnd: 000001fffffdffff CapAmb: 0000000000000000 CapUNs: 000001fffffdffff
But you are not (that I can see, in this or the previous patch) keeping SECURE_USERNS_STRICT_CAPS in securebits on the next level unshare. Though I think it's ok, because by then both cap_userns and cap_bset are reduced and cap_userns can't be expanded. (Sorry, just thinking aloud here)
Right this is safe to reset, but maybe we do keep it if the secbit is locked? This is kind of a special case compared to the other bits.
- /* Limit userns capabilities to our parent's bounding set. */
In the case of userns_install(), it will be the target user namespace creator's bounding set, right? Not "our parent's"?
Good point, I should reword this comment.
On Mon, Jun 10, 2024 at 02:46:06AM -0700, Jonathan Calmels wrote:
On Sun, Jun 09, 2024 at 09:33:01PM GMT, Serge E. Hallyn wrote:
On Sun, Jun 09, 2024 at 03:43:35AM -0700, Jonathan Calmels wrote:
This patch adds a new capability security bit designed to constrain a task’s userns capability set to its bounding set. The reason for this is twofold:
- This serves as a quick and easy way to lock down a set of capabilities for a task, thus ensuring that any namespace it creates will never be more privileged than itself is.
- This helps userspace transition to more secure defaults by not requiring specific logic for the userns capability set, or libcap support.
Example:
# capsh --secbits=$((1 << 8)) --drop=cap_sys_rawio -- \ -c 'unshare -r grep Cap /proc/self/status' CapInh: 0000000000000000 CapPrm: 000001fffffdffff CapEff: 000001fffffdffff CapBnd: 000001fffffdffff CapAmb: 0000000000000000 CapUNs: 000001fffffdffff
But you are not (that I can see, in this or the previous patch) keeping SECURE_USERNS_STRICT_CAPS in securebits on the next level unshare. Though I think it's ok, because by then both cap_userns and cap_bset are reduced and cap_userns can't be expanded. (Sorry, just thinking aloud here)
Right this is safe to reset, but maybe we do keep it if the secbit is locked? This is kind of a special case compared to the other bits.
I don't think it would be worth the extra complication in the secbits code, and it's semantically very different from the cap_userns.
- /* Limit userns capabilities to our parent's bounding set. */
In the case of userns_install(), it will be the target user namespace creator's bounding set, right? Not "our parent's"?
Good point, I should reword this comment.
On Sun, 2024-06-09 at 03:43 -0700, Jonathan Calmels wrote:
This patch adds a new capability security bit designed to constrain a
nit: if you think of it "This patch adds" could be just "add", right? :-)
Also name the exact thing/symbol/whatever here. This is not a HBO series.
task’s userns capability set to its bounding set. The reason for this is twofold:
- This serves as a quick and easy way to lock down a set of
capabilities for a task, thus ensuring that any namespace it creates will never be more privileged than itself is.
- This helps userspace transition to more secure defaults by not
requiring specific logic for the userns capability set, or libcap support.
Example:
# capsh --secbits=$((1 << 8)) --drop=cap_sys_rawio -- \ -c 'unshare -r grep Cap /proc/self/status' CapInh: 0000000000000000 CapPrm: 000001fffffdffff CapEff: 000001fffffdffff CapBnd: 000001fffffdffff CapAmb: 0000000000000000 CapUNs: 000001fffffdffff
Signed-off-by: Jonathan Calmels jcalmels@3xx0.net
include/linux/securebits.h | 1 + include/uapi/linux/securebits.h | 11 ++++++++++- kernel/user_namespace.c | 5 +++++ 3 files changed, 16 insertions(+), 1 deletion(-)
diff --git a/include/linux/securebits.h b/include/linux/securebits.h index 656528673983..5f9d85cd69c3 100644 --- a/include/linux/securebits.h +++ b/include/linux/securebits.h @@ -5,4 +5,5 @@ #include <uapi/linux/securebits.h> #define issecure(X) (issecure_mask(X) & current_cred_xxx(securebits)) +#define iscredsecure(cred, X) (issecure_mask(X) & cred-
securebits)
#endif /* !_LINUX_SECUREBITS_H */ diff --git a/include/uapi/linux/securebits.h b/include/uapi/linux/securebits.h index d6d98877ff1a..2da3f4be4531 100644 --- a/include/uapi/linux/securebits.h +++ b/include/uapi/linux/securebits.h @@ -52,10 +52,19 @@ #define SECBIT_NO_CAP_AMBIENT_RAISE_LOCKED \ (issecure_mask(SECURE_NO_CAP_AMBIENT_RAISE_L OCKED)) +/* When set, user namespace capabilities are restricted to their parent's bounding set. */ +#define SECURE_USERNS_STRICT_CAPS 8 +#define SECURE_USERNS_STRICT_CAPS_LOCKED 9 /* make
bit-8 immutable */
+#define SECBIT_USERNS_STRICT_CAPS (issecure_mask(SECURE_USERNS_STRICT_CAPS)) +#define SECBIT_USERNS_STRICT_CAPS_LOCKED \
(issecure_mask(SECURE_USERNS_STRICT_CAPS_LOC
KED))
#define SECURE_ALL_BITS (issecure_mask(SECURE_NOROOT) | \ issecure_mask(SECURE_NO_SETUID_FIXUP) | \ issecure_mask(SECURE_KEEP_CAPS) | \
issecure_mask(SECURE_NO_CAP_AMBIENT_RAISE))
issecure_mask(SECURE_NO_CAP_AMBIENT_RAISE) | \
spurious new lines in the diff
please as first priority aim for absolute minimal diff or at least do grow diff proactively like this.
If we really think after that, that we need some "extras" to the patch set, then we decide that. These only take energy away from reviewers.
issecure_mask(SECURE_USERNS_STRICT_CAPS)) #define SECURE_ALL_LOCKS (SECURE_ALL_BITS << 1) #endif /* _UAPI_LINUX_SECUREBITS_H */ diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c index 7e624607330b..53848e2b68cd 100644 --- a/kernel/user_namespace.c +++ b/kernel/user_namespace.c @@ -10,6 +10,7 @@ #include <linux/cred.h> #include <linux/securebits.h> #include <linux/security.h> +#include <linux/capability.h> #include <linux/keyctl.h> #include <linux/key-type.h> #include <keys/user-type.h> @@ -42,6 +43,10 @@ static void dec_user_namespaces(struct ucounts *ucounts) static void set_cred_user_ns(struct cred *cred, struct user_namespace *user_ns) {
- /* Limit userns capabilities to our parent's bounding set.
*/
- if (iscredsecure(cred, SECURE_USERNS_STRICT_CAPS))
cred->cap_userns = cap_intersect(cred->cap_userns,
cred->cap_bset);
/* Start with the capabilities defined in the userns set. */ cred->cap_bset = cred->cap_userns; cred->cap_permitted = cred->cap_userns;
Going for 4 week holiday starting for next week so focus in on nits but since this is something to do access control:
1. Please go surgical with the diff's because this type of patches also require a surgical review. Now reviewing this like riding on a bumpy road with a car of which suspension mechanics is broken ;-)
Hope you grab my argument here. I only want to look at the problem and solution for that not random stuff..
BR, Jarkko
On Fri, 2024-06-28 at 17:43 +0300, Jarkko Sakkinen wrote:
On Sun, 2024-06-09 at 03:43 -0700, Jonathan Calmels wrote:
This patch adds a new capability security bit designed to constrain a
nit: if you think of it "This patch adds" could be just "add", right? :-)
Also name the exact thing/symbol/whatever here. This is not a HBO series.
task’s userns capability set to its bounding set. The reason for this is twofold:
- This serves as a quick and easy way to lock down a set of
capabilities for a task, thus ensuring that any namespace it creates will never be more privileged than itself is.
- This helps userspace transition to more secure defaults by not
requiring specific logic for the userns capability set, or libcap support.
Example:
# capsh --secbits=$((1 << 8)) --drop=cap_sys_rawio -- \ -c 'unshare -r grep Cap /proc/self/status' CapInh: 0000000000000000 CapPrm: 000001fffffdffff CapEff: 000001fffffdffff CapBnd: 000001fffffdffff CapAmb: 0000000000000000 CapUNs: 000001fffffdffff
Signed-off-by: Jonathan Calmels jcalmels@3xx0.net
include/linux/securebits.h | 1 + include/uapi/linux/securebits.h | 11 ++++++++++- kernel/user_namespace.c | 5 +++++ 3 files changed, 16 insertions(+), 1 deletion(-)
diff --git a/include/linux/securebits.h b/include/linux/securebits.h index 656528673983..5f9d85cd69c3 100644 --- a/include/linux/securebits.h +++ b/include/linux/securebits.h @@ -5,4 +5,5 @@ #include <uapi/linux/securebits.h> #define issecure(X) (issecure_mask(X) & current_cred_xxx(securebits)) +#define iscredsecure(cred, X) (issecure_mask(X) & cred-
securebits)
#endif /* !_LINUX_SECUREBITS_H */ diff --git a/include/uapi/linux/securebits.h b/include/uapi/linux/securebits.h index d6d98877ff1a..2da3f4be4531 100644 --- a/include/uapi/linux/securebits.h +++ b/include/uapi/linux/securebits.h @@ -52,10 +52,19 @@ #define SECBIT_NO_CAP_AMBIENT_RAISE_LOCKED \ (issecure_mask(SECURE_NO_CAP_AMBIENT_RAISE _L OCKED)) +/* When set, user namespace capabilities are restricted to their parent's bounding set. */ +#define SECURE_USERNS_STRICT_CAPS 8 +#define SECURE_USERNS_STRICT_CAPS_LOCKED 9 /* make
bit-8 immutable */
+#define SECBIT_USERNS_STRICT_CAPS (issecure_mask(SECURE_USERNS_STRICT_CAPS)) +#define SECBIT_USERNS_STRICT_CAPS_LOCKED \
(issecure_mask(SECURE_USERNS_STRICT_CAPS_L
OC KED))
#define SECURE_ALL_BITS (issecure_mask(SECURE_NOROOT) | \ issecure_mask(SECURE_NO_SETUID_FIXUP) | \ issecure_mask(SECURE_KEEP_CAPS) | \
issecure_mask(SECURE_NO_CAP_AMBIENT_RAISE))
issecure_mask(SECURE_NO_CAP_AMBIENT_RAISE) | \
spurious new lines in the diff
please as first priority aim for absolute minimal diff or at least do grow diff proactively like this.
If we really think after that, that we need some "extras" to the patch set, then we decide that. These only take energy away from reviewers.
issecure_mask(SECURE_USERNS_STRICT_CAPS)) #define SECURE_ALL_LOCKS (SECURE_ALL_BITS << 1) #endif /* _UAPI_LINUX_SECUREBITS_H */ diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c index 7e624607330b..53848e2b68cd 100644 --- a/kernel/user_namespace.c +++ b/kernel/user_namespace.c @@ -10,6 +10,7 @@ #include <linux/cred.h> #include <linux/securebits.h> #include <linux/security.h> +#include <linux/capability.h> #include <linux/keyctl.h> #include <linux/key-type.h> #include <keys/user-type.h> @@ -42,6 +43,10 @@ static void dec_user_namespaces(struct ucounts *ucounts) static void set_cred_user_ns(struct cred *cred, struct user_namespace *user_ns) {
- /* Limit userns capabilities to our parent's bounding set.
*/
- if (iscredsecure(cred, SECURE_USERNS_STRICT_CAPS))
cred->cap_userns = cap_intersect(cred->cap_userns,
cred->cap_bset);
/* Start with the capabilities defined in the userns set. */ cred->cap_bset = cred->cap_userns; cred->cap_permitted = cred->cap_userns;
Going for 4 week holiday starting for next week so focus in on nits but since this is something to do access control:
- Please go surgical with the diff's because this type of patches
also require a surgical review. Now reviewing this like riding on a bumpy road with a car of which suspension mechanics is broken ;-)
Hope you grab my argument here. I only want to look at the problem and solution for that not random stuff..
I skip the other patches because of my eager to get on holiday but my instinct tells me that at least some of this feedback applies to all of the patches.
So put your solution in sight, not clean ups.
BR, Jarkko
This patch adds a new system-wide userns capability mask designed to mask off capabilities in user namespaces.
This mask is controlled through a sysctl and can be set early in the boot process or on the kernel command line to exclude known capabilities from ever being gained in namespaces. Once set, it can be further restricted to exert dynamic policies on the system (e.g. ward off a potential exploit).
Changing this mask requires privileges in the initial user namespace over the newly introduced CAP_SYS_CONTROL.
Example:
# sysctl -qw kernel.cap_userns_mask=0x1fffffdffff && \ unshare -r grep Cap /proc/self/status CapInh: 0000000000000000 CapPrm: 000001fffffdffff CapEff: 000001fffffdffff CapBnd: 000001fffffdffff CapAmb: 0000000000000000 CapUNs: 000001fffffdffff
Signed-off-by: Jonathan Calmels jcalmels@3xx0.net --- include/linux/user_namespace.h | 7 ++++ include/uapi/linux/capability.h | 6 ++- kernel/sysctl.c | 10 +++++ kernel/user_namespace.c | 63 +++++++++++++++++++++++++++++ security/selinux/include/classmap.h | 5 ++- 5 files changed, 88 insertions(+), 3 deletions(-)
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h index 6030a8235617..d958d4819608 100644 --- a/include/linux/user_namespace.h +++ b/include/linux/user_namespace.h @@ -2,6 +2,7 @@ #ifndef _LINUX_USER_NAMESPACE_H #define _LINUX_USER_NAMESPACE_H
+#include <linux/capability.h> #include <linux/kref.h> #include <linux/nsproxy.h> #include <linux/ns_common.h> @@ -14,6 +15,12 @@ #define UID_GID_MAP_MAX_BASE_EXTENTS 5 #define UID_GID_MAP_MAX_EXTENTS 340
+#ifdef CONFIG_SYSCTL +extern kernel_cap_t cap_userns_mask; +int cap_userns_sysctl_handler(const struct ctl_table *table, int write, + void *buffer, size_t *lenp, loff_t *ppos); +#endif + struct uid_gid_extent { u32 first; u32 lower_first; diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h index 5bb906098697..e2c5e4bb2eb0 100644 --- a/include/uapi/linux/capability.h +++ b/include/uapi/linux/capability.h @@ -418,7 +418,11 @@ struct vfs_ns_cap_data {
#define CAP_CHECKPOINT_RESTORE 40
-#define CAP_LAST_CAP CAP_CHECKPOINT_RESTORE +/* Allow setting the system userns capability mask. */ + +#define CAP_SYS_CONTROL 41 + +#define CAP_LAST_CAP CAP_SYS_CONTROL
#define cap_valid(x) ((x) >= 0 && (x) <= CAP_LAST_CAP)
diff --git a/kernel/sysctl.c b/kernel/sysctl.c index e0b917328cf9..95b27a92c63c 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -62,6 +62,7 @@ #include <linux/sched/sysctl.h> #include <linux/mount.h> #include <linux/userfaultfd_k.h> +#include <linux/user_namespace.h> #include <linux/pid.h>
#include "../lib/kstrtox.h" @@ -1846,6 +1847,15 @@ static struct ctl_table kern_table[] = { .mode = 0444, .proc_handler = proc_dointvec, }, +#ifdef CONFIG_USER_NS + { + .procname = "cap_userns_mask", + .data = &cap_userns_mask, + .maxlen = sizeof(kernel_cap_t), + .mode = 0644, + .proc_handler = cap_userns_sysctl_handler, + }, +#endif #if defined(CONFIG_X86_LOCAL_APIC) && defined(CONFIG_X86) { .procname = "unknown_nmi_panic", diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c index 53848e2b68cd..e513d87ed102 100644 --- a/kernel/user_namespace.c +++ b/kernel/user_namespace.c @@ -26,6 +26,63 @@ static struct kmem_cache *user_ns_cachep __ro_after_init; static DEFINE_MUTEX(userns_state_mutex);
+#ifdef CONFIG_SYSCTL +static DEFINE_SPINLOCK(cap_userns_lock); +kernel_cap_t cap_userns_mask = CAP_FULL_SET; + +int cap_userns_sysctl_handler(const struct ctl_table *table, int write, + void *buffer, size_t *lenp, loff_t *ppos) +{ + struct ctl_table t; + unsigned long mask_array[2]; + kernel_cap_t new_mask, *mask; + int err; + + if (write && !capable(CAP_SYS_CONTROL)) + return -EPERM; + + /* + * convert from the global kernel_cap_t to the ulong array to print to + * userspace if this is a read. + * + * capabilities are exposed as one 64-bit value or two 32-bit values + * depending on the architecture + */ + mask = table->data; + spin_lock(&cap_userns_lock); + mask_array[0] = (unsigned long) mask->val; + if (BITS_PER_LONG != 64) + mask_array[1] = mask->val >> BITS_PER_LONG; + spin_unlock(&cap_userns_lock); + + t = *table; + t.data = &mask_array; + + /* + * actually read or write and array of ulongs from userspace. Remember + * these are least significant bits first + */ + err = proc_doulongvec_minmax(&t, write, buffer, lenp, ppos); + if (err < 0) + return err; + + new_mask.val = mask_array[0]; + if (BITS_PER_LONG != 64) + new_mask.val += (u64)mask_array[1] << BITS_PER_LONG; + + /* + * Drop everything not in the new_mask (but don't add things) + */ + if (write) { + spin_lock(&cap_userns_lock); + *mask = cap_intersect(*mask, new_mask); + spin_unlock(&cap_userns_lock); + } + + return 0; +} +#endif + static bool new_idmap_permitted(const struct file *file, struct user_namespace *ns, int cap_setid, struct uid_gid_map *map); @@ -46,6 +103,12 @@ static void set_cred_user_ns(struct cred *cred, struct user_namespace *user_ns) /* Limit userns capabilities to our parent's bounding set. */ if (iscredsecure(cred, SECURE_USERNS_STRICT_CAPS)) cred->cap_userns = cap_intersect(cred->cap_userns, cred->cap_bset); +#ifdef CONFIG_SYSCTL + /* Mask off userns capabilities that are not permitted by the system-wide mask. */ + spin_lock(&cap_userns_lock); + cred->cap_userns = cap_intersect(cred->cap_userns, cap_userns_mask); + spin_unlock(&cap_userns_lock); +#endif
/* Start with the capabilities defined in the userns set. */ cred->cap_bset = cred->cap_userns; diff --git a/security/selinux/include/classmap.h b/security/selinux/include/classmap.h index 7229c9bf6c27..8f3ede7aac92 100644 --- a/security/selinux/include/classmap.h +++ b/security/selinux/include/classmap.h @@ -34,9 +34,10 @@
#define COMMON_CAP2_PERMS \ "mac_override", "mac_admin", "syslog", "wake_alarm", "block_suspend", \ - "audit_read", "perfmon", "bpf", "checkpoint_restore" + "audit_read", "perfmon", "bpf", "checkpoint_restore", \ + "sys_control"
-#if CAP_LAST_CAP > CAP_CHECKPOINT_RESTORE +#if CAP_LAST_CAP > CAP_SYS_CONTROL #error New capability defined, please update COMMON_CAP2_PERMS. #endif
This patch allows modifying the various capabilities of the struct cred in BPF-LSM hooks. More specifically, the userns_create hook called prior to creating a new user namespace.
With the introduction of userns capabilities, this effectively provides a simple way for LSMs to control the capabilities granted to a user namespace and all its descendants.
Update the selftests accordingly by dropping CAP_SYS_ADMIN in namespaces and checking the resulting task's bounding set.
Signed-off-by: Jonathan Calmels jcalmels@3xx0.net --- include/linux/lsm_hook_defs.h | 2 +- include/linux/security.h | 4 +- kernel/bpf/bpf_lsm.c | 55 +++++++++++++++++++ security/apparmor/lsm.c | 2 +- security/security.c | 6 +- security/selinux/hooks.c | 2 +- .../selftests/bpf/prog_tests/deny_namespace.c | 12 ++-- .../selftests/bpf/progs/test_deny_namespace.c | 7 ++- 8 files changed, 76 insertions(+), 14 deletions(-)
diff --git a/include/linux/lsm_hook_defs.h b/include/linux/lsm_hook_defs.h index f804b76cde44..58d6d8f2511f 100644 --- a/include/linux/lsm_hook_defs.h +++ b/include/linux/lsm_hook_defs.h @@ -250,7 +250,7 @@ LSM_HOOK(int, -ENOSYS, task_prctl, int option, unsigned long arg2, unsigned long arg3, unsigned long arg4, unsigned long arg5) LSM_HOOK(void, LSM_RET_VOID, task_to_inode, struct task_struct *p, struct inode *inode) -LSM_HOOK(int, 0, userns_create, const struct cred *cred) +LSM_HOOK(int, 0, userns_create, struct cred *cred) LSM_HOOK(int, 0, ipc_permission, struct kern_ipc_perm *ipcp, short flag) LSM_HOOK(void, LSM_RET_VOID, ipc_getsecid, struct kern_ipc_perm *ipcp, u32 *secid) diff --git a/include/linux/security.h b/include/linux/security.h index 21cf70346b33..ffb1b0dd2aef 100644 --- a/include/linux/security.h +++ b/include/linux/security.h @@ -465,7 +465,7 @@ int security_task_kill(struct task_struct *p, struct kernel_siginfo *info, int security_task_prctl(int option, unsigned long arg2, unsigned long arg3, unsigned long arg4, unsigned long arg5); void security_task_to_inode(struct task_struct *p, struct inode *inode); -int security_create_user_ns(const struct cred *cred); +int security_create_user_ns(struct cred *cred); int security_ipc_permission(struct kern_ipc_perm *ipcp, short flag); void security_ipc_getsecid(struct kern_ipc_perm *ipcp, u32 *secid); int security_msg_msg_alloc(struct msg_msg *msg); @@ -1294,7 +1294,7 @@ static inline int security_task_prctl(int option, unsigned long arg2, static inline void security_task_to_inode(struct task_struct *p, struct inode *inode) { }
-static inline int security_create_user_ns(const struct cred *cred) +static inline int security_create_user_ns(struct cred *cred) { return 0; } diff --git a/kernel/bpf/bpf_lsm.c b/kernel/bpf/bpf_lsm.c index 68240c3c6e7d..6edba93ff883 100644 --- a/kernel/bpf/bpf_lsm.c +++ b/kernel/bpf/bpf_lsm.c @@ -382,10 +382,65 @@ bool bpf_lsm_is_trusted(const struct bpf_prog *prog) return !btf_id_set_contains(&untrusted_lsm_hooks, prog->aux->attach_btf_id); }
+static int bpf_lsm_btf_struct_access(struct bpf_verifier_log *log, + const struct bpf_reg_state *reg, + int off, int size) +{ + const struct btf_type *cred; + const struct btf_type *t; + s32 type_id; + size_t end; + + type_id = btf_find_by_name_kind(reg->btf, "cred", BTF_KIND_STRUCT); + if (type_id < 0) + return -EINVAL; + + t = btf_type_by_id(reg->btf, reg->btf_id); + cred = btf_type_by_id(reg->btf, type_id); + if (t != cred) { + bpf_log(log, "only read is supported\n"); + return -EACCES; + } + + switch (off) { + case offsetof(struct cred, cap_inheritable): + end = offsetofend(struct cred, cap_inheritable); + break; + case offsetof(struct cred, cap_permitted): + end = offsetofend(struct cred, cap_permitted); + break; + case offsetof(struct cred, cap_effective): + end = offsetofend(struct cred, cap_effective); + break; + case offsetof(struct cred, cap_bset): + end = offsetofend(struct cred, cap_bset); + break; + case offsetof(struct cred, cap_ambient): + end = offsetofend(struct cred, cap_ambient); + break; + case offsetof(struct cred, cap_userns): + end = offsetofend(struct cred, cap_userns); + break; + default: + bpf_log(log, "no write support to cred at off %d\n", off); + return -EACCES; + } + + if (off + size > end) { + bpf_log(log, + "write access at off %d with size %d beyond the member of cred ended at %zu\n", + off, size, end); + return -EACCES; + } + + return 0; +} + const struct bpf_prog_ops lsm_prog_ops = { };
const struct bpf_verifier_ops lsm_verifier_ops = { .get_func_proto = bpf_lsm_func_proto, .is_valid_access = btf_ctx_access, + .btf_struct_access = bpf_lsm_btf_struct_access, }; diff --git a/security/apparmor/lsm.c b/security/apparmor/lsm.c index 6239777090c4..310c9fa3d4b4 100644 --- a/security/apparmor/lsm.c +++ b/security/apparmor/lsm.c @@ -1036,7 +1036,7 @@ static int apparmor_task_kill(struct task_struct *target, struct kernel_siginfo return error; }
-static int apparmor_userns_create(const struct cred *cred) +static int apparmor_userns_create(struct cred *cred) { struct aa_label *label; struct aa_profile *profile; diff --git a/security/security.c b/security/security.c index e5da848c50b9..83cf2025c58e 100644 --- a/security/security.c +++ b/security/security.c @@ -3558,14 +3558,14 @@ void security_task_to_inode(struct task_struct *p, struct inode *inode) }
/** - * security_create_user_ns() - Check if creating a new userns is allowed + * security_create_user_ns() - Review permissions prior to userns creation * @cred: prepared creds * - * Check permission prior to creating a new user namespace. + * Check and/or modify permissions prior to creating a new user namespace. * * Return: Returns 0 if successful, otherwise < 0 error code. */ -int security_create_user_ns(const struct cred *cred) +int security_create_user_ns(struct cred *cred) { return call_int_hook(userns_create, cred); } diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c index 7eed331e90f0..28deb9510d8e 100644 --- a/security/selinux/hooks.c +++ b/security/selinux/hooks.c @@ -4263,7 +4263,7 @@ static void selinux_task_to_inode(struct task_struct *p, spin_unlock(&isec->lock); }
-static int selinux_userns_create(const struct cred *cred) +static int selinux_userns_create(struct cred *cred) { u32 sid = current_sid();
diff --git a/tools/testing/selftests/bpf/prog_tests/deny_namespace.c b/tools/testing/selftests/bpf/prog_tests/deny_namespace.c index 1bc6241b755b..1500578f9a30 100644 --- a/tools/testing/selftests/bpf/prog_tests/deny_namespace.c +++ b/tools/testing/selftests/bpf/prog_tests/deny_namespace.c @@ -5,6 +5,8 @@ #include <sched.h> #include "cap_helpers.h" #include <stdio.h> +#include <stdbool.h> +#include <sys/prctl.h>
static int wait_for_pid(pid_t pid) { @@ -29,7 +31,7 @@ static int wait_for_pid(pid_t pid) * positive return value -> userns creation failed * 0 -> userns creation succeeded */ -static int create_user_ns(void) +static int create_user_ns(bool bpf) { pid_t pid;
@@ -40,6 +42,8 @@ static int create_user_ns(void) if (pid == 0) { if (unshare(CLONE_NEWUSER)) _exit(EXIT_FAILURE); + if (bpf && prctl(PR_CAPBSET_READ, CAP_SYS_ADMIN)) + _exit(EXIT_FAILURE); _exit(EXIT_SUCCESS); }
@@ -53,11 +57,11 @@ static void test_userns_create_bpf(void)
cap_enable_effective(cap_mask, &old_caps);
- ASSERT_OK(create_user_ns(), "priv new user ns"); + ASSERT_OK(create_user_ns(true), "priv new user ns");
cap_disable_effective(cap_mask, &old_caps);
- ASSERT_EQ(create_user_ns(), EPERM, "unpriv new user ns"); + ASSERT_EQ(create_user_ns(true), EPERM, "unpriv new user ns");
if (cap_mask & old_caps) cap_enable_effective(cap_mask, NULL); @@ -70,7 +74,7 @@ static void test_unpriv_userns_create_no_bpf(void)
cap_disable_effective(cap_mask, &old_caps);
- ASSERT_OK(create_user_ns(), "no-bpf unpriv new user ns"); + ASSERT_OK(create_user_ns(false), "no-bpf unpriv new user ns");
if (cap_mask & old_caps) cap_enable_effective(cap_mask, NULL); diff --git a/tools/testing/selftests/bpf/progs/test_deny_namespace.c b/tools/testing/selftests/bpf/progs/test_deny_namespace.c index e96b901a733c..051906f80f4c 100644 --- a/tools/testing/selftests/bpf/progs/test_deny_namespace.c +++ b/tools/testing/selftests/bpf/progs/test_deny_namespace.c @@ -9,12 +9,13 @@ typedef struct { unsigned long long val; } kernel_cap_t;
struct cred { kernel_cap_t cap_effective; + kernel_cap_t cap_userns; } __attribute__((preserve_access_index));
char _license[] SEC("license") = "GPL";
SEC("lsm.s/userns_create") -int BPF_PROG(test_userns_create, const struct cred *cred, int ret) +int BPF_PROG(test_userns_create, struct cred *cred, int ret) { kernel_cap_t caps = cred->cap_effective; __u64 cap_mask = 1ULL << CAP_SYS_ADMIN; @@ -23,8 +24,10 @@ int BPF_PROG(test_userns_create, const struct cred *cred, int ret) return 0;
ret = -EPERM; - if (caps.val & cap_mask) + if (caps.val & cap_mask) { + cred->cap_userns.val &= ~cap_mask; return 0; + }
return -EPERM; }
On Sun, Jun 9, 2024 at 6:40 AM Jonathan Calmels jcalmels@3xx0.net wrote:
This patch allows modifying the various capabilities of the struct cred in BPF-LSM hooks. More specifically, the userns_create hook called prior to creating a new user namespace.
With the introduction of userns capabilities, this effectively provides a simple way for LSMs to control the capabilities granted to a user namespace and all its descendants.
Update the selftests accordingly by dropping CAP_SYS_ADMIN in namespaces and checking the resulting task's bounding set.
Signed-off-by: Jonathan Calmels jcalmels@3xx0.net
include/linux/lsm_hook_defs.h | 2 +- include/linux/security.h | 4 +- kernel/bpf/bpf_lsm.c | 55 +++++++++++++++++++ security/apparmor/lsm.c | 2 +- security/security.c | 6 +- security/selinux/hooks.c | 2 +- .../selftests/bpf/prog_tests/deny_namespace.c | 12 ++-- .../selftests/bpf/progs/test_deny_namespace.c | 7 ++- 8 files changed, 76 insertions(+), 14 deletions(-)
I'm not sure we want to go down the path of a LSM modifying the POSIX capabilities of a task, other than the capabilities/commoncap LSM. It sets a bad precedent and could further complicate issues around LSM ordering.
-- paul-moore.com
On Sun, Jun 09, 2024 at 08:18:48PM GMT, Paul Moore wrote:
On Sun, Jun 9, 2024 at 6:40 AM Jonathan Calmels jcalmels@3xx0.net wrote:
This patch allows modifying the various capabilities of the struct cred in BPF-LSM hooks. More specifically, the userns_create hook called prior to creating a new user namespace.
With the introduction of userns capabilities, this effectively provides a simple way for LSMs to control the capabilities granted to a user namespace and all its descendants.
Update the selftests accordingly by dropping CAP_SYS_ADMIN in namespaces and checking the resulting task's bounding set.
Signed-off-by: Jonathan Calmels jcalmels@3xx0.net
include/linux/lsm_hook_defs.h | 2 +- include/linux/security.h | 4 +- kernel/bpf/bpf_lsm.c | 55 +++++++++++++++++++ security/apparmor/lsm.c | 2 +- security/security.c | 6 +- security/selinux/hooks.c | 2 +- .../selftests/bpf/prog_tests/deny_namespace.c | 12 ++-- .../selftests/bpf/progs/test_deny_namespace.c | 7 ++- 8 files changed, 76 insertions(+), 14 deletions(-)
I'm not sure we want to go down the path of a LSM modifying the POSIX capabilities of a task, other than the capabilities/commoncap LSM. It sets a bad precedent and could further complicate issues around LSM ordering.
Well unless I'm misunderstanding, this does allow modifying the capabilities/commoncap LSM through BTF. The reason for allowing `userns_create` to be modified is that it is functionally very similar to `cred_prepare` in that it operates with new creds (but specific to user namespaces because of reasons detailed in [1]).
There were some concerns in previous threads that the userns caps by themselves wouldn't be granular enough, hence the LSM integration. Ubuntu for example, currently has to resort to a hardcoded profile transition to achieve this [2].
[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... [2] https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/noble/com...
On 6/11/24 01:09, Jonathan Calmels wrote:
On Sun, Jun 09, 2024 at 08:18:48PM GMT, Paul Moore wrote:
On Sun, Jun 9, 2024 at 6:40 AM Jonathan Calmels jcalmels@3xx0.net wrote:
This patch allows modifying the various capabilities of the struct cred in BPF-LSM hooks. More specifically, the userns_create hook called prior to creating a new user namespace.
With the introduction of userns capabilities, this effectively provides a simple way for LSMs to control the capabilities granted to a user namespace and all its descendants.
Update the selftests accordingly by dropping CAP_SYS_ADMIN in namespaces and checking the resulting task's bounding set.
Signed-off-by: Jonathan Calmels jcalmels@3xx0.net
include/linux/lsm_hook_defs.h | 2 +- include/linux/security.h | 4 +- kernel/bpf/bpf_lsm.c | 55 +++++++++++++++++++ security/apparmor/lsm.c | 2 +- security/security.c | 6 +- security/selinux/hooks.c | 2 +- .../selftests/bpf/prog_tests/deny_namespace.c | 12 ++-- .../selftests/bpf/progs/test_deny_namespace.c | 7 ++- 8 files changed, 76 insertions(+), 14 deletions(-)
I'm not sure we want to go down the path of a LSM modifying the POSIX capabilities of a task, other than the capabilities/commoncap LSM. It sets a bad precedent and could further complicate issues around LSM ordering.
Well unless I'm misunderstanding, this does allow modifying the capabilities/commoncap LSM through BTF. The reason for allowing `userns_create` to be modified is that it is functionally very similar to `cred_prepare` in that it operates with new creds (but specific to user namespaces because of reasons detailed in [1]).
yes
There were some concerns in previous threads that the userns caps by themselves wouldn't be granular enough, hence the LSM integration.
Ubuntu for example, currently has to resort to a hardcoded profile transition to achieve this [2].
The hard coded profile transition, is because the more generic solution as part of policy just wasn't ready. The hard coding will go away before it is upstreamed.
But yes, updating the cred really is necessary for the flexibility needed whether it is modifying the POSIX capabilities of the task or the LSM modifying its own security blob.
I do share some of Paul's concerns about the LSM modifying the POSIX capabilities of the task, but also thing the LSM here needs to be able to modify its own blob.
I have a very similar patch I was planning on posting once the work to fix the hard coded profile transition is done.
[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... [2] https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/noble/com...
On Tue, Jun 11, 2024 at 6:32 AM John Johansen john.johansen@canonical.com wrote:
On 6/11/24 01:09, Jonathan Calmels wrote:
On Sun, Jun 09, 2024 at 08:18:48PM GMT, Paul Moore wrote:
On Sun, Jun 9, 2024 at 6:40 AM Jonathan Calmels jcalmels@3xx0.net wrote:
This patch allows modifying the various capabilities of the struct cred in BPF-LSM hooks. More specifically, the userns_create hook called prior to creating a new user namespace.
With the introduction of userns capabilities, this effectively provides a simple way for LSMs to control the capabilities granted to a user namespace and all its descendants.
Update the selftests accordingly by dropping CAP_SYS_ADMIN in namespaces and checking the resulting task's bounding set.
Signed-off-by: Jonathan Calmels jcalmels@3xx0.net
include/linux/lsm_hook_defs.h | 2 +- include/linux/security.h | 4 +- kernel/bpf/bpf_lsm.c | 55 +++++++++++++++++++ security/apparmor/lsm.c | 2 +- security/security.c | 6 +- security/selinux/hooks.c | 2 +- .../selftests/bpf/prog_tests/deny_namespace.c | 12 ++-- .../selftests/bpf/progs/test_deny_namespace.c | 7 ++- 8 files changed, 76 insertions(+), 14 deletions(-)
I'm not sure we want to go down the path of a LSM modifying the POSIX capabilities of a task, other than the capabilities/commoncap LSM. It sets a bad precedent and could further complicate issues around LSM ordering.
Well unless I'm misunderstanding, this does allow modifying the capabilities/commoncap LSM through BTF. The reason for allowing `userns_create` to be modified is that it is functionally very similar to `cred_prepare` in that it operates with new creds (but specific to user namespaces because of reasons detailed in [1]).
yes
There were some concerns in previous threads that the userns caps by themselves wouldn't be granular enough, hence the LSM integration.
Ubuntu for example, currently has to resort to a hardcoded profile transition to achieve this [2].
The hard coded profile transition, is because the more generic solution as part of policy just wasn't ready. The hard coding will go away before it is upstreamed.
But yes, updating the cred really is necessary for the flexibility needed whether it is modifying the POSIX capabilities of the task or the LSM modifying its own security blob.
I do share some of Paul's concerns about the LSM modifying the POSIX capabilities of the task, but also thing the LSM here needs to be able to modify its own blob.
To be clear, this isn't about a generic LSM needing to update its own blob (LSM state), it is about the BPF LSM updating the capability sets. While we obviously must support a LSM updating its own state, I'm currently of the opinion that allowing one LSM to update the state of another LSM is only going to lead to problems. We wouldn't want to allow Smack to update AppArmor state, and from my current perspective allowing the BPF LSM to update the capability state is no different.
It's also important to keep in mind that if we allow one LSM to do something, we need to allow all LSMs to do something. If we allow multiple LSMs to manipulate the capability sets, how do we reconcile differences in the desired capability state? Does that resolution change depending on what LSMs are enabled at build time? Enabled at boot? Similarly, what about custom LSM ordering?
What about those LSMs that use a task's capabilities as an input to an access control decision? If those LSMs allow an access based on a given capability set only to have a LSM later in the ordering modify that capability set to something which would have resulted in an access denial, do we risk a security regression?
Our current approach to handling multiple LSMs is that each LSM is limited to modifying its own state, and I'm pretty confident that we stick to this model if we have any hope of preserving the sanity of the LSM layer as a whole. If you want to modify the capability set you need to do so within the confines of the capability LSM and/or modify the other related kernel subsystems (which I'm guessing will likely necessitate a change in the LSMs, but that avenue is very unclear if such an option even exists).
On Tue, Jun 11, 2024 at 03:01:01PM GMT, Paul Moore wrote:
On Tue, Jun 11, 2024 at 6:32 AM John Johansen john.johansen@canonical.com wrote:
On 6/11/24 01:09, Jonathan Calmels wrote:
On Sun, Jun 09, 2024 at 08:18:48PM GMT, Paul Moore wrote:
On Sun, Jun 9, 2024 at 6:40 AM Jonathan Calmels jcalmels@3xx0.net wrote:
This patch allows modifying the various capabilities of the struct cred in BPF-LSM hooks. More specifically, the userns_create hook called prior to creating a new user namespace.
With the introduction of userns capabilities, this effectively provides a simple way for LSMs to control the capabilities granted to a user namespace and all its descendants.
Update the selftests accordingly by dropping CAP_SYS_ADMIN in namespaces and checking the resulting task's bounding set.
Signed-off-by: Jonathan Calmels jcalmels@3xx0.net
include/linux/lsm_hook_defs.h | 2 +- include/linux/security.h | 4 +- kernel/bpf/bpf_lsm.c | 55 +++++++++++++++++++ security/apparmor/lsm.c | 2 +- security/security.c | 6 +- security/selinux/hooks.c | 2 +- .../selftests/bpf/prog_tests/deny_namespace.c | 12 ++-- .../selftests/bpf/progs/test_deny_namespace.c | 7 ++- 8 files changed, 76 insertions(+), 14 deletions(-)
I'm not sure we want to go down the path of a LSM modifying the POSIX capabilities of a task, other than the capabilities/commoncap LSM. It sets a bad precedent and could further complicate issues around LSM ordering.
Well unless I'm misunderstanding, this does allow modifying the capabilities/commoncap LSM through BTF. The reason for allowing `userns_create` to be modified is that it is functionally very similar to `cred_prepare` in that it operates with new creds (but specific to user namespaces because of reasons detailed in [1]).
yes
There were some concerns in previous threads that the userns caps by themselves wouldn't be granular enough, hence the LSM integration.
Ubuntu for example, currently has to resort to a hardcoded profile transition to achieve this [2].
The hard coded profile transition, is because the more generic solution as part of policy just wasn't ready. The hard coding will go away before it is upstreamed.
But yes, updating the cred really is necessary for the flexibility needed whether it is modifying the POSIX capabilities of the task or the LSM modifying its own security blob.
I do share some of Paul's concerns about the LSM modifying the POSIX capabilities of the task, but also thing the LSM here needs to be able to modify its own blob.
To be clear, this isn't about a generic LSM needing to update its own blob (LSM state), it is about the BPF LSM updating the capability sets. While we obviously must support a LSM updating its own state, I'm currently of the opinion that allowing one LSM to update the state of another LSM is only going to lead to problems. We wouldn't want to allow Smack to update AppArmor state, and from my current perspective allowing the BPF LSM to update the capability state is no different.
It's also important to keep in mind that if we allow one LSM to do something, we need to allow all LSMs to do something. If we allow multiple LSMs to manipulate the capability sets, how do we reconcile differences in the desired capability state? Does that resolution change depending on what LSMs are enabled at build time? Enabled at boot? Similarly, what about custom LSM ordering?
What about those LSMs that use a task's capabilities as an input to an access control decision? If those LSMs allow an access based on a given capability set only to have a LSM later in the ordering modify that capability set to something which would have resulted in an access denial, do we risk a security regression?
I understand the concerns, what I fail to understand however, is how is it any different from say the `cred_prepare` hook today?
Our current approach to handling multiple LSMs is that each LSM is limited to modifying its own state, and I'm pretty confident that we stick to this model if we have any hope of preserving the sanity of the LSM layer as a whole. If you want to modify the capability set you need to do so within the confines of the capability LSM and/or modify the other related kernel subsystems (which I'm guessing will likely necessitate a change in the LSMs, but that avenue is very unclear if such an option even exists).
What do you mean by "within the confines of the capability LSM" here?
Arguably, if we do want fine-grained userns policies, we need LSMs to influence the userns capset at some point. Regardless of how or where we do this, it will always be subject to some sort of ordering. We could come up with some rules to limit surprises (e.g. caps can only be dropped, only possible through some hooks, etc), but at the end of the day, something needs to have the final word when it comes to deciding what the creds should be.
On Tue, Jun 11, 2024 at 6:15 PM Jonathan Calmels jcalmels@3xx0.net wrote:
On Tue, Jun 11, 2024 at 03:01:01PM GMT, Paul Moore wrote:
On Tue, Jun 11, 2024 at 6:32 AM John Johansen john.johansen@canonical.com wrote:
On 6/11/24 01:09, Jonathan Calmels wrote:
On Sun, Jun 09, 2024 at 08:18:48PM GMT, Paul Moore wrote:
On Sun, Jun 9, 2024 at 6:40 AM Jonathan Calmels jcalmels@3xx0.net wrote:
This patch allows modifying the various capabilities of the struct cred in BPF-LSM hooks. More specifically, the userns_create hook called prior to creating a new user namespace.
With the introduction of userns capabilities, this effectively provides a simple way for LSMs to control the capabilities granted to a user namespace and all its descendants.
Update the selftests accordingly by dropping CAP_SYS_ADMIN in namespaces and checking the resulting task's bounding set.
Signed-off-by: Jonathan Calmels jcalmels@3xx0.net
include/linux/lsm_hook_defs.h | 2 +- include/linux/security.h | 4 +- kernel/bpf/bpf_lsm.c | 55 +++++++++++++++++++ security/apparmor/lsm.c | 2 +- security/security.c | 6 +- security/selinux/hooks.c | 2 +- .../selftests/bpf/prog_tests/deny_namespace.c | 12 ++-- .../selftests/bpf/progs/test_deny_namespace.c | 7 ++- 8 files changed, 76 insertions(+), 14 deletions(-)
I'm not sure we want to go down the path of a LSM modifying the POSIX capabilities of a task, other than the capabilities/commoncap LSM. It sets a bad precedent and could further complicate issues around LSM ordering.
Well unless I'm misunderstanding, this does allow modifying the capabilities/commoncap LSM through BTF. The reason for allowing `userns_create` to be modified is that it is functionally very similar to `cred_prepare` in that it operates with new creds (but specific to user namespaces because of reasons detailed in [1]).
yes
There were some concerns in previous threads that the userns caps by themselves wouldn't be granular enough, hence the LSM integration.
Ubuntu for example, currently has to resort to a hardcoded profile transition to achieve this [2].
The hard coded profile transition, is because the more generic solution as part of policy just wasn't ready. The hard coding will go away before it is upstreamed.
But yes, updating the cred really is necessary for the flexibility needed whether it is modifying the POSIX capabilities of the task or the LSM modifying its own security blob.
I do share some of Paul's concerns about the LSM modifying the POSIX capabilities of the task, but also thing the LSM here needs to be able to modify its own blob.
To be clear, this isn't about a generic LSM needing to update its own blob (LSM state), it is about the BPF LSM updating the capability sets. While we obviously must support a LSM updating its own state, I'm currently of the opinion that allowing one LSM to update the state of another LSM is only going to lead to problems. We wouldn't want to allow Smack to update AppArmor state, and from my current perspective allowing the BPF LSM to update the capability state is no different.
It's also important to keep in mind that if we allow one LSM to do something, we need to allow all LSMs to do something. If we allow multiple LSMs to manipulate the capability sets, how do we reconcile differences in the desired capability state? Does that resolution change depending on what LSMs are enabled at build time? Enabled at boot? Similarly, what about custom LSM ordering?
What about those LSMs that use a task's capabilities as an input to an access control decision? If those LSMs allow an access based on a given capability set only to have a LSM later in the ordering modify that capability set to something which would have resulted in an access denial, do we risk a security regression?
I understand the concerns, what I fail to understand however, is how is it any different from say the `cred_prepare` hook today?
The existing cred_prepare hooks only operate on their own small portion of the cred::security blob. What you are proposing would be the BPF LSM operating on the capability sets that it does not "own" (they belong to the capability LSM).
If you see that as a minor difference, please understand that if you skip past that you have all the issues I mentioned in my previous message to deal with.
Our current approach to handling multiple LSMs is that each LSM is limited to modifying its own state, and I'm pretty confident that we stick to this model if we have any hope of preserving the sanity of the LSM layer as a whole. If you want to modify the capability set you need to do so within the confines of the capability LSM and/or modify the other related kernel subsystems (which I'm guessing will likely necessitate a change in the LSMs, but that avenue is very unclear if such an option even exists).
What do you mean by "within the confines of the capability LSM" here?
Basically security/commoncap.c. One could make a lot of arguments about if it is, or isn't, a LSM, but commoncap.c registers LSM hooks which is pretty much the definition of a LSM from an implementation point of view.
Arguably, if we do want fine-grained userns policies, we need LSMs to influence the userns capset at some point.
One could always use, or develop, a LSM that offers additional controls around exercising capabilities. There are currently four in-tree LSMs, including the capabilities LSM, which supply a security_capable() hook that is used by the capability-based access controls in the kernel; all of these hook implementations work together within the LSM framework and provide an additional level of control/granularity beyond the existing capabilities.
On Tue, Jun 11, 2024 at 06:38:31PM GMT, Paul Moore wrote:
On Tue, Jun 11, 2024 at 6:15 PM Jonathan Calmels jcalmels@3xx0.net wrote:
On Tue, Jun 11, 2024 at 03:01:01PM GMT, Paul Moore wrote:
On Tue, Jun 11, 2024 at 6:32 AM John Johansen john.johansen@canonical.com wrote:
On 6/11/24 01:09, Jonathan Calmels wrote:
On Sun, Jun 09, 2024 at 08:18:48PM GMT, Paul Moore wrote:
On Sun, Jun 9, 2024 at 6:40 AM Jonathan Calmels jcalmels@3xx0.net wrote: > > This patch allows modifying the various capabilities of the struct cred > in BPF-LSM hooks. More specifically, the userns_create hook called > prior to creating a new user namespace. > > With the introduction of userns capabilities, this effectively provides > a simple way for LSMs to control the capabilities granted to a user > namespace and all its descendants. > > Update the selftests accordingly by dropping CAP_SYS_ADMIN in > namespaces and checking the resulting task's bounding set. > > Signed-off-by: Jonathan Calmels jcalmels@3xx0.net > --- > include/linux/lsm_hook_defs.h | 2 +- > include/linux/security.h | 4 +- > kernel/bpf/bpf_lsm.c | 55 +++++++++++++++++++ > security/apparmor/lsm.c | 2 +- > security/security.c | 6 +- > security/selinux/hooks.c | 2 +- > .../selftests/bpf/prog_tests/deny_namespace.c | 12 ++-- > .../selftests/bpf/progs/test_deny_namespace.c | 7 ++- > 8 files changed, 76 insertions(+), 14 deletions(-)
I'm not sure we want to go down the path of a LSM modifying the POSIX capabilities of a task, other than the capabilities/commoncap LSM. It sets a bad precedent and could further complicate issues around LSM ordering.
Well unless I'm misunderstanding, this does allow modifying the capabilities/commoncap LSM through BTF. The reason for allowing `userns_create` to be modified is that it is functionally very similar to `cred_prepare` in that it operates with new creds (but specific to user namespaces because of reasons detailed in [1]).
yes
There were some concerns in previous threads that the userns caps by themselves wouldn't be granular enough, hence the LSM integration.
Ubuntu for example, currently has to resort to a hardcoded profile transition to achieve this [2].
The hard coded profile transition, is because the more generic solution as part of policy just wasn't ready. The hard coding will go away before it is upstreamed.
But yes, updating the cred really is necessary for the flexibility needed whether it is modifying the POSIX capabilities of the task or the LSM modifying its own security blob.
I do share some of Paul's concerns about the LSM modifying the POSIX capabilities of the task, but also thing the LSM here needs to be able to modify its own blob.
To be clear, this isn't about a generic LSM needing to update its own blob (LSM state), it is about the BPF LSM updating the capability sets. While we obviously must support a LSM updating its own state, I'm currently of the opinion that allowing one LSM to update the state of another LSM is only going to lead to problems. We wouldn't want to allow Smack to update AppArmor state, and from my current perspective allowing the BPF LSM to update the capability state is no different.
It's also important to keep in mind that if we allow one LSM to do something, we need to allow all LSMs to do something. If we allow multiple LSMs to manipulate the capability sets, how do we reconcile differences in the desired capability state? Does that resolution change depending on what LSMs are enabled at build time? Enabled at boot? Similarly, what about custom LSM ordering?
What about those LSMs that use a task's capabilities as an input to an access control decision? If those LSMs allow an access based on a given capability set only to have a LSM later in the ordering modify that capability set to something which would have resulted in an access denial, do we risk a security regression?
I understand the concerns, what I fail to understand however, is how is it any different from say the `cred_prepare` hook today?
The existing cred_prepare hooks only operate on their own small portion of the cred::security blob. What you are proposing would be the BPF LSM operating on the capability sets that it does not "own" (they belong to the capability LSM).
If you see that as a minor difference, please understand that if you skip past that you have all the issues I mentioned in my previous message to deal with.
Our current approach to handling multiple LSMs is that each LSM is limited to modifying its own state, and I'm pretty confident that we stick to this model if we have any hope of preserving the sanity of the LSM layer as a whole. If you want to modify the capability set you need to do so within the confines of the capability LSM and/or modify the other related kernel subsystems (which I'm guessing will likely necessitate a change in the LSMs, but that avenue is very unclear if such an option even exists).
What do you mean by "within the confines of the capability LSM" here?
Basically security/commoncap.c. One could make a lot of arguments about if it is, or isn't, a LSM, but commoncap.c registers LSM hooks which is pretty much the definition of a LSM from an implementation point of view.
Yes, hence the proposal to give it more fine-grained controls than what's currently available. But to your point, unlike the others, its own state (i.e. capsets) is shared, so this gets questionable.
Arguably, if we do want fine-grained userns policies, we need LSMs to influence the userns capset at some point.
One could always use, or develop, a LSM that offers additional controls around exercising capabilities. There are currently four in-tree LSMs, including the capabilities LSM, which supply a security_capable() hook that is used by the capability-based access controls in the kernel; all of these hook implementations work together within the LSM framework and provide an additional level of control/granularity beyond the existing capabilities.
Right, but the idea was to have a simple and easy way to reuse/trigger as much of the commoncap one as possible from BPF. If we're saying we need to reimplement and/or use a whole new framework, then there is little value.
TBH, I don't feel strongly about this, which is why it is absent from v1. However, as John pointed out, we should at least be able to modify the blob if we want flexible userns caps policies down the road.
On Wed, Jun 12, 2024 at 4:15 AM Jonathan Calmels jcalmels@3xx0.net wrote:
On Tue, Jun 11, 2024 at 06:38:31PM GMT, Paul Moore wrote:
On Tue, Jun 11, 2024 at 6:15 PM Jonathan Calmels jcalmels@3xx0.net wrote:
...
Arguably, if we do want fine-grained userns policies, we need LSMs to influence the userns capset at some point.
One could always use, or develop, a LSM that offers additional controls around exercising capabilities. There are currently four in-tree LSMs, including the capabilities LSM, which supply a security_capable() hook that is used by the capability-based access controls in the kernel; all of these hook implementations work together within the LSM framework and provide an additional level of control/granularity beyond the existing capabilities.
Right, but the idea was to have a simple and easy way to reuse/trigger as much of the commoncap one as possible from BPF. If we're saying we need to reimplement and/or use a whole new framework, then there is little value.
I can appreciate how allowing direct manipulation of capability bits from a BPF LSM looks attractive, but my hope is that our discussion here revealed that as you look deeper into making it work there are a number of pitfalls which prevent this from being a safe option for generalized systems.
TBH, I don't feel strongly about this, which is why it is absent from v1. However, as John pointed out, we should at least be able to modify the blob if we want flexible userns caps policies down the road.
As discussed in this thread, there are existing ways to provide fine grained control over exercising capabilities that can be safely used within the LSM framework. I don't want to speak to what John is envisioning, but he should be aware of these mechanisms, and if I recall he did voice a level of concern about the same worries I mentioned.
I'm happy to discuss ways in which we can adjust the LSM hooks/layer to support different approaches to capability controls, but one LSM directly manipulating the state of another is going to be a no vote from me.
On 6/12/24 10:29, Paul Moore wrote:
On Wed, Jun 12, 2024 at 4:15 AM Jonathan Calmels jcalmels@3xx0.net wrote:
On Tue, Jun 11, 2024 at 06:38:31PM GMT, Paul Moore wrote:
On Tue, Jun 11, 2024 at 6:15 PM Jonathan Calmels jcalmels@3xx0.net wrote:
...
Arguably, if we do want fine-grained userns policies, we need LSMs to influence the userns capset at some point.
One could always use, or develop, a LSM that offers additional controls around exercising capabilities. There are currently four in-tree LSMs, including the capabilities LSM, which supply a security_capable() hook that is used by the capability-based access controls in the kernel; all of these hook implementations work together within the LSM framework and provide an additional level of control/granularity beyond the existing capabilities.
Right, but the idea was to have a simple and easy way to reuse/trigger as much of the commoncap one as possible from BPF. If we're saying we need to reimplement and/or use a whole new framework, then there is little value.
I can appreciate how allowing direct manipulation of capability bits from a BPF LSM looks attractive, but my hope is that our discussion here revealed that as you look deeper into making it work there are a number of pitfalls which prevent this from being a safe option for generalized systems.
TBH, I don't feel strongly about this, which is why it is absent from v1. However, as John pointed out, we should at least be able to modify the blob if we want flexible userns caps policies down the road.
As discussed in this thread, there are existing ways to provide fine grained control over exercising capabilities that can be safely used within the LSM framework. I don't want to speak to what John is envisioning, but he should be aware of these mechanisms, and if I recall he did voice a level of concern about the same worries I mentioned.
sorry, I should have been more clear. I envision LSMs being able to update their own state in the userns hook.
Basically the portion of the patch that removes const from the userns hook.
An LSM updating the capset is worrysome for all the reasons you pointed out, and I think a few more. I haven't had a chance to really look at v2 yet, so I didn't want to speak directly on the bpf part of the patch without first giving a good once over.
I'm happy to discuss ways in which we can adjust the LSM hooks/layer to support different approaches to capability controls, but one LSM directly manipulating the state of another is going to be a no vote from me.
I might not be as hard no as Paul here, I am always willing to listen to arguments, but it would have to be a really good argument to modify the capset, when there are multiple LSMs in play on a system.
On Wed, Jun 12, 2024 at 08:54:28PM GMT, John Johansen wrote:
On 6/12/24 10:29, Paul Moore wrote:
On Wed, Jun 12, 2024 at 4:15 AM Jonathan Calmels jcalmels@3xx0.net wrote:
On Tue, Jun 11, 2024 at 06:38:31PM GMT, Paul Moore wrote:
On Tue, Jun 11, 2024 at 6:15 PM Jonathan Calmels jcalmels@3xx0.net wrote:
...
Arguably, if we do want fine-grained userns policies, we need LSMs to influence the userns capset at some point.
One could always use, or develop, a LSM that offers additional controls around exercising capabilities. There are currently four in-tree LSMs, including the capabilities LSM, which supply a security_capable() hook that is used by the capability-based access controls in the kernel; all of these hook implementations work together within the LSM framework and provide an additional level of control/granularity beyond the existing capabilities.
Right, but the idea was to have a simple and easy way to reuse/trigger as much of the commoncap one as possible from BPF. If we're saying we need to reimplement and/or use a whole new framework, then there is little value.
I can appreciate how allowing direct manipulation of capability bits from a BPF LSM looks attractive, but my hope is that our discussion here revealed that as you look deeper into making it work there are a number of pitfalls which prevent this from being a safe option for generalized systems.
TBH, I don't feel strongly about this, which is why it is absent from v1. However, as John pointed out, we should at least be able to modify the blob if we want flexible userns caps policies down the road.
As discussed in this thread, there are existing ways to provide fine grained control over exercising capabilities that can be safely used within the LSM framework. I don't want to speak to what John is envisioning, but he should be aware of these mechanisms, and if I recall he did voice a level of concern about the same worries I mentioned.
sorry, I should have been more clear. I envision LSMs being able to update their own state in the userns hook.
Basically the portion of the patch that removes const from the userns hook.
Yes, pretty sure we'll need this regardless.
An LSM updating the capset is worrysome for all the reasons you pointed out, and I think a few more. I haven't had a chance to really look at v2 yet, so I didn't want to speak directly on the bpf part of the patch without first giving a good once over.
I'm happy to discuss ways in which we can adjust the LSM hooks/layer to support different approaches to capability controls, but one LSM directly manipulating the state of another is going to be a no vote from me.
I might not be as hard no as Paul here, I am always willing to listen to arguments, but it would have to be a really good argument to modify the capset, when there are multiple LSMs in play on a system.
The way I see it, it's more about enhancing the capability LSM with BPF hooks and have it modify its own state dynamically, not so much crosstalk between two distinct LSM frameworks (say one where the BPF LSM implements a lot of things like capable()).
In this context and with enough safeguards (say we only allow dropping caps) this could be a net positive. Sure, ordering could come into play in very specific scenarios, but at this point I would expect the admin/LSM author to be conscious about it.
If we think there is no way we can come up with something that's safe enough, and that the risks outweigh the benefits, fine by me, we can drop this patch from the series.
On Thu, Jun 13, 2024 at 4:45 AM Jonathan Calmels jcalmels@3xx0.net wrote:
On Wed, Jun 12, 2024 at 08:54:28PM GMT, John Johansen wrote:
On 6/12/24 10:29, Paul Moore wrote:
On Wed, Jun 12, 2024 at 4:15 AM Jonathan Calmels jcalmels@3xx0.net wrote:
On Tue, Jun 11, 2024 at 06:38:31PM GMT, Paul Moore wrote:
On Tue, Jun 11, 2024 at 6:15 PM Jonathan Calmels jcalmels@3xx0.net wrote:
...
Arguably, if we do want fine-grained userns policies, we need LSMs to influence the userns capset at some point.
One could always use, or develop, a LSM that offers additional controls around exercising capabilities. There are currently four in-tree LSMs, including the capabilities LSM, which supply a security_capable() hook that is used by the capability-based access controls in the kernel; all of these hook implementations work together within the LSM framework and provide an additional level of control/granularity beyond the existing capabilities.
Right, but the idea was to have a simple and easy way to reuse/trigger as much of the commoncap one as possible from BPF. If we're saying we need to reimplement and/or use a whole new framework, then there is little value.
I can appreciate how allowing direct manipulation of capability bits from a BPF LSM looks attractive, but my hope is that our discussion here revealed that as you look deeper into making it work there are a number of pitfalls which prevent this from being a safe option for generalized systems.
TBH, I don't feel strongly about this, which is why it is absent from v1. However, as John pointed out, we should at least be able to modify the blob if we want flexible userns caps policies down the road.
As discussed in this thread, there are existing ways to provide fine grained control over exercising capabilities that can be safely used within the LSM framework. I don't want to speak to what John is envisioning, but he should be aware of these mechanisms, and if I recall he did voice a level of concern about the same worries I mentioned.
sorry, I should have been more clear. I envision LSMs being able to update their own state in the userns hook.
Basically the portion of the patch that removes const from the userns hook.
Yes, pretty sure we'll need this regardless.
An LSM updating the capset is worrysome for all the reasons you pointed out, and I think a few more. I haven't had a chance to really look at v2 yet, so I didn't want to speak directly on the bpf part of the patch without first giving a good once over.
I'm happy to discuss ways in which we can adjust the LSM hooks/layer to support different approaches to capability controls, but one LSM directly manipulating the state of another is going to be a no vote from me.
I might not be as hard no as Paul here, I am always willing to listen to arguments, but it would have to be a really good argument to modify the capset, when there are multiple LSMs in play on a system.
The way I see it, it's more about enhancing the capability LSM with BPF hooks and have it modify its own state dynamically, not so much crosstalk between two distinct LSM frameworks (say one where the BPF LSM implements a lot of things like capable()).
As I mentioned previously, if you want to do something with the capability sets you simply need to do it within the confines of security/commoncap.c. If you're really set on the "MUST BE BPF!" way of life, and you can convince Serge (capabilities maintainer) that it would be a good idea, you could propose a dedicated BPF hook within the capabilities LSM. I'm not sure how wise that would be, but it would resolve a lot of the LSM ordering/stacking issues that we've discussed.
If we think there is no way we can come up with something that's safe enough, and that the risks outweigh the benefits, fine by me, we can drop this patch from the series.
To be clear, this patch is not acceptable at this point in time. With the understanding that I haven't looked that closely at the rest of the patchset, it looks fairly well contained to the capabilities code which means it is largely up to Serge, not me.
I will mention that you should update the audit code to recognize the new capability set, look at kernel/auditsc.c for more information.
On Thu, Jun 13, 2024 at 01:50:29AM -0700, Jonathan Calmels wrote:
On Wed, Jun 12, 2024 at 08:54:28PM GMT, John Johansen wrote:
On 6/12/24 10:29, Paul Moore wrote:
On Wed, Jun 12, 2024 at 4:15 AM Jonathan Calmels jcalmels@3xx0.net wrote:
On Tue, Jun 11, 2024 at 06:38:31PM GMT, Paul Moore wrote:
On Tue, Jun 11, 2024 at 6:15 PM Jonathan Calmels jcalmels@3xx0.net wrote:
...
Arguably, if we do want fine-grained userns policies, we need LSMs to influence the userns capset at some point.
One could always use, or develop, a LSM that offers additional controls around exercising capabilities. There are currently four in-tree LSMs, including the capabilities LSM, which supply a security_capable() hook that is used by the capability-based access controls in the kernel; all of these hook implementations work together within the LSM framework and provide an additional level of control/granularity beyond the existing capabilities.
Right, but the idea was to have a simple and easy way to reuse/trigger as much of the commoncap one as possible from BPF. If we're saying we need to reimplement and/or use a whole new framework, then there is little value.
I can appreciate how allowing direct manipulation of capability bits from a BPF LSM looks attractive, but my hope is that our discussion here revealed that as you look deeper into making it work there are a number of pitfalls which prevent this from being a safe option for generalized systems.
TBH, I don't feel strongly about this, which is why it is absent from v1. However, as John pointed out, we should at least be able to modify the blob if we want flexible userns caps policies down the road.
As discussed in this thread, there are existing ways to provide fine grained control over exercising capabilities that can be safely used within the LSM framework. I don't want to speak to what John is envisioning, but he should be aware of these mechanisms, and if I recall he did voice a level of concern about the same worries I mentioned.
sorry, I should have been more clear. I envision LSMs being able to update their own state in the userns hook.
Basically the portion of the patch that removes const from the userns hook.
Yes, pretty sure we'll need this regardless.
An LSM updating the capset is worrysome for all the reasons you pointed out, and I think a few more. I haven't had a chance to really look at v2 yet, so I didn't want to speak directly on the bpf part of the patch without first giving a good once over.
I'm happy to discuss ways in which we can adjust the LSM hooks/layer to support different approaches to capability controls, but one LSM directly manipulating the state of another is going to be a no vote from me.
I might not be as hard no as Paul here, I am always willing to listen to arguments, but it would have to be a really good argument to modify the capset, when there are multiple LSMs in play on a system.
The way I see it, it's more about enhancing the capability LSM with BPF hooks and have it modify its own state dynamically, not so much crosstalk between two distinct LSM frameworks (say one where the BPF LSM implements a lot of things like capable()).
In this context and with enough safeguards (say we only allow dropping caps) this could be a net positive. Sure, ordering could come into play in very specific scenarios, but at this point I would expect the admin/LSM author to be conscious about it.
If we think there is no way we can come up with something that's safe enough, and that the risks outweigh the benefits, fine by me, we can drop this patch from the series.
I think pursuing patches 1-3 now, and punting on 4 until later, would be great.
On Wed, Jun 12, 2024 at 08:54:28PM -0700, John Johansen wrote:
Good morning, I hope the day is going well for everyone.
On 6/12/24 10:29, Paul Moore wrote:
On Wed, Jun 12, 2024 at 4:15???AM Jonathan Calmels jcalmels@3xx0.net wrote:
On Tue, Jun 11, 2024 at 06:38:31PM GMT, Paul Moore wrote:
On Tue, Jun 11, 2024 at 6:15???PM Jonathan Calmels jcalmels@3xx0.net wrote:
...
Arguably, if we do want fine-grained userns policies, we need LSMs to influence the userns capset at some point.
One could always use, or develop, a LSM that offers additional controls around exercising capabilities. There are currently four in-tree LSMs, including the capabilities LSM, which supply a security_capable() hook that is used by the capability-based access controls in the kernel; all of these hook implementations work together within the LSM framework and provide an additional level of control/granularity beyond the existing capabilities.
Right, but the idea was to have a simple and easy way to reuse/trigger as much of the commoncap one as possible from BPF. If we're saying we need to reimplement and/or use a whole new framework, then there is little value.
I can appreciate how allowing direct manipulation of capability bits from a BPF LSM looks attractive, but my hope is that our discussion here revealed that as you look deeper into making it work there are a number of pitfalls which prevent this from being a safe option for generalized systems.
TBH, I don't feel strongly about this, which is why it is absent from v1. However, as John pointed out, we should at least be able to modify the blob if we want flexible userns caps policies down the road.
As discussed in this thread, there are existing ways to provide fine grained control over exercising capabilities that can be safely used within the LSM framework. I don't want to speak to what John is envisioning, but he should be aware of these mechanisms, and if I recall he did voice a level of concern about the same worries I mentioned.
sorry, I should have been more clear. I envision LSMs being able to update their own state in the userns hook.
Basically the portion of the patch that removes const from the userns hook.
An LSM updating the capset is worrysome for all the reasons you pointed out, and I think a few more. I haven't had a chance to really look at v2 yet, so I didn't want to speak directly on the bpf part of the patch without first giving a good once over.
I'm happy to discuss ways in which we can adjust the LSM hooks/layer to support different approaches to capability controls, but one LSM directly manipulating the state of another is going to be a no vote from me.
I might not be as hard no as Paul here, I am always willing to listen to arguments, but it would have to be a really good argument to modify the capset, when there are multiple LSMs in play on a system.
Putting my pragmatic operations hat on, it isn't just the impact on multiple LSM's.
The security vendors, CrowdStrike's Falcon comes immediately to mind, are installing BPF hooks as part of their agent systems.
Given that the issue of signing BPF programs is still an open question, allowing the ability of a BPF program to modify the security capabilities of a process opens the door to supply chain attacks that would seem unbounded in scope.
On the other side of the fence, installing a BPF program is a privileged operation. If a decision is made to allow that kind of privilege on a system the argument can be made that you get to keep both pieces.
Of course that needs to be paired against the fact that system's administrators are not given any choice as to the wisdom of that type of permission being afforded to security applications.
Best wishes for a productive remainder of the week.
As always, Dr. Greg
The Quixote Project - Flailing at the Travails of Cybersecurity https://github.com/Quixote-Project
On Wed, Jun 12, 2024 at 11:54 PM John Johansen john.johansen@canonical.com wrote:
On 6/12/24 10:29, Paul Moore wrote:
On Wed, Jun 12, 2024 at 4:15 AM Jonathan Calmels jcalmels@3xx0.net wrote:
On Tue, Jun 11, 2024 at 06:38:31PM GMT, Paul Moore wrote:
On Tue, Jun 11, 2024 at 6:15 PM Jonathan Calmels jcalmels@3xx0.net wrote:
...
Arguably, if we do want fine-grained userns policies, we need LSMs to influence the userns capset at some point.
One could always use, or develop, a LSM that offers additional controls around exercising capabilities. There are currently four in-tree LSMs, including the capabilities LSM, which supply a security_capable() hook that is used by the capability-based access controls in the kernel; all of these hook implementations work together within the LSM framework and provide an additional level of control/granularity beyond the existing capabilities.
Right, but the idea was to have a simple and easy way to reuse/trigger as much of the commoncap one as possible from BPF. If we're saying we need to reimplement and/or use a whole new framework, then there is little value.
I can appreciate how allowing direct manipulation of capability bits from a BPF LSM looks attractive, but my hope is that our discussion here revealed that as you look deeper into making it work there are a number of pitfalls which prevent this from being a safe option for generalized systems.
TBH, I don't feel strongly about this, which is why it is absent from v1. However, as John pointed out, we should at least be able to modify the blob if we want flexible userns caps policies down the road.
As discussed in this thread, there are existing ways to provide fine grained control over exercising capabilities that can be safely used within the LSM framework. I don't want to speak to what John is envisioning, but he should be aware of these mechanisms, and if I recall he did voice a level of concern about the same worries I mentioned.
sorry, I should have been more clear. I envision LSMs being able to update their own state in the userns hook.
Ah, okay, yes, that seems reasonable; although like any other change, until we have an in-tree user we should just leave it as-is.
On Sun, Jun 09, 2024 at 03:43:33AM -0700, Jonathan Calmels wrote:
This patch series introduces a new user namespace capability set, as well as some plumbing around it (i.e. sysctl, secbit, lsm support).
First patch goes over the motivations for this as well as prior art.
In summary, while user namespaces are a great success today in that they avoid running a lot of code as root, they also expand the attack surface of the kernel substantially which is often abused by attackers. Methods exist to limit the creation of such namespaces [1], however, application developers often need to assume that user namespaces are available for various tasks such as sandboxing. Thus, instead of restricting the creation of user namespaces, we offer ways for userspace to limit the capabilities granted to them.
Why a new capability set and not something specific to the userns (e.g. ioctl_ns)?
1. We can't really expect userspace to patch every single callsite and opt-in this new security mechanism. 2. We don't necessarily want policies enforced at said callsites. For example a service like systemd-machined or a PAM session need to be able to place restrictions on any namespace spawned under it. 3. We would need to come up with inheritance rules, querying capabilities, etc. At this point we're just reinventing capability sets. 4. We can easily define interactions between capability sets, thus helping with adoption (patch 2 is an example of this)
Some examples of how this could be leveraged in userspace:
- Prevent user from getting CAP_NET_ADMIN in user namespaces under SSH: echo "auth optional pam_cap.so" >> /etc/pam.d/sshd echo "!cap_net_admin $USER" >> /etc/security/capability.conf capsh --secbits=$((1 << 8)) -- -c /usr/sbin/sshd - Prevent containers from ever getting CAP_DAC_OVERRIDE: systemd-run -p CapabilityBoundingSet=~CAP_DAC_OVERRIDE \ -p SecureBits=userns-strict-caps \ /usr/bin/dockerd systemd-run -p UserNSCapabilities=~CAP_DAC_OVERRIDE \ /usr/bin/incusd - Kernel could be vulnerable to CAP_SYS_RAWIO exploits, prevent it: sysctl -w cap_bound_userns_mask=0x1fffffdffff - Drop CAP_SYS_ADMIN for this shell and all the user namespaces below it: bwrap --unshare-user --cap-drop CAP_SYS_ADMIN /bin/sh
Where are the tests for this patchset? I see you updated the bpf tests for the bpf lsm bits, but there's nothing to validate this new behavior or exercise the new ioctl you've added. Thanks,
Josef
On Mon, Jun 10, 2024 at 04:12:27PM GMT, Josef Bacik wrote:
Where are the tests for this patchset? I see you updated the bpf tests for the bpf lsm bits, but there's nothing to validate this new behavior or exercise the new ioctl you've added. Thanks,
Apologies, I haven't had much time to spend on it so I prioritized the rest. But yes, we should certainly update the capabilities selftests once we agreed on the different behaviors.
linux-kselftest-mirror@lists.linaro.org