On Tue, Aug 1, 2023 at 6:05 PM Aleksa Sarai cyphar@cyphar.com wrote:
This thread is getting longer with different topics, I will try to respond with trimmed interleaved replies [1] There are 3 topics (logging/'migration/ratcheting), this response will be regarding ratcheting.
[1] https://www.kernel.org/doc/html/latest/process/submitting-patches.html?highl...
- The ratcheting mechanism for vm.memfd_noexec doesn't make sense as a security mechanism because a CAP_SYS_ADMIN capable user can create executable binaries in a hidden tmpfs very easily, not to mention the many other things they can do.
By further limiting CAP_SYS_ADMIN, an attacker can't modify this sysctl even after compromising some system service with high privilege, YAMA has the same approach for ptrace_scope=3
Personally, I also think this behaviour from YAMA is a little goofy too, but given that it only locks the most extreme setting and there is no way to get around the most extreme setting, I guess it makes some sense (not to mention it's an LSM and so there is an argument that it should be possible to lock out privileged users from modifying it). There are many other security sysctls, and very few have this behaviour because it doesn't make much sense in most cases.
In addition, this sysctl is pid_name spaced, this means child pid_namespace will alway have the same or stricter security setting than its parent, this allows admin to maintain a tree like view. If we allow the child pid namespace to elevate its setting, then the system-wide setting is no longer meaningful.
"no longer meaningful" is too strong of a statement imho. It is still useful for constraining non-root processes and presumably ChromeOS disallows random processes to do CLONE_NEWUSER (otherwise the protection of this sysctl is pointless) so in practice for ChromeOS there is no change in the attack surface.
(FWIW, I think tying this to the user namespace would've made more sense since this is about privilege restrictions, but that ship has sailed.)
The reason that this sysctl is a PID namespace is that I hope a container and host can have different sysctl values, e.g. host will allow runc's use of X mfd, while a container doesn't want X mfd. . To clarify what you meant, do you mean this: when a container is in its own pid_namespace, and has "=2", the programs inside the container can still use CLONE_NEWUSER to break out "=2" ?
With the current implementation, this is not possible. My point was that even if it were possible to lower the sysctl, ChromeOS presumably already blocks the operations that a user would be able to use to create a memfd (an unprivileged user cannot CLONE_NEWPID to modify the sysctl without CLONE_NEWUSER, which is presumably blocked on ChromeOS due to the other security concerns).
The code sample shared in this patch set indicates that the attacker already has the ability of creating tmpfs and executing complex steps, at that point, it doesn't matter if the code execution is from memfd or not. For a safe by default system such as ChromeOS, attackers won't easily run arbitrary code, memfd is one of the open doors for that, so we are disabling executable memfd in ChromeOS. In other words: if an attacker can already execute the arbitrary code as sample given in ChromeOS, without using executable memfd, then memfd is no longer the thing we need to worry about, the arbitrary code execution is already achieved by the attacker. Even though I use ChromeOS as an example, I think the same type of threat model applies to any system that wants to disable executable memfd entirely.
I understand the threat model this sysctl is blocking, my point is that blocking CAP_SYS_ADMIN from modifying the setting doesn't make sense from that threat model. An attacker that manages to trick some process into creating a memfd with an executable payload is not going to be able to change the sysctl setting (unless there's a confused deputy with CAP_SYS_ADMIN, in which case you have much bigger issues).
It is the reverse. An attacker that manages to trick some CAP_SYSADMIN processes into changing this sysctl value (i.e. lower the setting to 0 if no ratcheting), will be able to continue to use mfd as part of the attack chain. In chromeOS, an attacker that can change sysctl might not necessarily gain full arbitrary code execution already. As I mentioned previously, the main threat model here is to prevent arbitrary code execution through mfd. If an attacker already gains arbitrary code execution, at that point, we no longer worry about mfd.
If an attacker can trick a privileged process into writing to arbitrary sysctls, the system has much bigger issues than arbitrary (presumably unprivileged) code execution. On the other hand, requiring you to reboot a server due to a misconfigured sysctl *is* broken.
Again, at the very least, not even allowing capable(CAP_SYS_ADMIN) to change the setting is actually broken.
If a CAP_SYS_ADMIN-capable user wants to change the sysctl, blocking it doesn't add any security because that process could create a memfd-like fd to execute without issues. What practical attack does this ratcheting mechanism protect against? (This is a question you can answer with the YAMA sysctl, but not this one AFAICS.)
But even if you feel that allowing this in child user namespaces is unsafe or undesirable, it's absolutely necessary that capable(CAP_SYS_ADMIN) should be able to un-brick the running system by changing the sysctl. The alternative is that you need to reboot your server in order to un-set a sysctl that broke some application you run.
Also, by the same token, this ratcheting mechanism doesn't make sense with =1 *at all* because it could break programs in a way that would require a reboot but it's not a "security setting" (and the YAMA sysctl mentioned only locks the sysctl at the highest setting).
I think a system should use "=0" when it is unsure about its program's need or not need executable memfd. Technically, it is not that this sysctl breaks the user, but the admin made the mistake to set the wrong sysctl value, and an admin should know what they are doing for a sysctl. Yes. rebooting increases the steps to undo the mistake, but that could be an incentive for the admin to fully test its programs before turning on this sysctl - and avoid unexpected runtime errors.
I don't think this stance is really acceptable -- if an admin that has privileges to load kernel modules is not able to disable a sysctl that can break working programs without rebooting there is
When this sysctl was first proposed a few years ago (when kernel folks found out that runc was using executable memfds), my understanding is that the long-term goal was to switch programs to have non-executable-memfds by default on most distributions. Making it impossible for an admin to lower the sysctl value flies in the face of this goal.
At the very least, being unable to lower the sysctl from =1 to =0 is just broken (even if you use the yama example -- yama only locks the sysctl at highest possible setting, not on lower settings). But in my view, having this sysctl ratchet at all doesn't make sense.
To reiterate/summarize the current mechanism for vm.memfd_noexec
1> It is a pid namespace sysctl, init ns and child pid ns can have different setting values. 2> child pid ns inherits parent's pid ns's sysctl at the time of fork. 3> There are 3 values for the sysctl, each higher value is more restrictive than the lower one. Once set, doesn't allow downgrading.
It can be used as following: 1> init ns: vm.memfd_noexec = 2 (at boot time) Not allow executable memfd for the entire system, including its containers.
2> init ns: vm.memfd_noexec = 0 or 1 container (child init namespace) vm.memfd_noexec = 2. The host allows runc's usage of executable memfd during container creation. Inside the container, executable memfd is not allowed.
The inherence + not allow downgrading is to reason with how vm.memfd_noexec is applied in the process tree. Without it, essentially we are losing the hierarchy view across the process tree and a process can evaluate its capability by modifying the setting. I think that is a less secure approach I would not prefer.
Thanks
-Jeff