On 2023-08-02, Jeff Xu jeffxu@chromium.org wrote:
- vm.memfd_noexec=2 shouldn't reject old-style memfd_create(2) syscalls because it will make it far to difficult to ever migrate. Instead it should imply MFD_EXEC.
Though the purpose of memfd_noexec=2 is not to help with migration - but to disable creation of executable memfd for the current system/pid namespace. During the migration, vm.memfd_noexe = 1 helps overwriting for unmigrated user code as a temporary measure.
My point is that the current behaviour for =2 means that nobody other than *maybe* ChromeOS will ever be able to use it because it requires auditing every program on the system. In fact, it's possible even ChromeOS will run into issues given that one of the arguments made for the nosymfollow mount option was that auditing all of ChromeOS to replace every open with RESOLVE_NO_SYMLINKS would be too much effort[1] (which I agreed with). Maybe this is less of an issue with memfd_create(2) (which is much newer than open(2)) but it still seems like a lot of busy work when the =1 behaviour is entirely sane even in the strict threat model that =2 is trying to protect against.
It can also be a container (that have all memfd_create migrated to new API)
If ChromeOS would struggle to rewrite all of the libraries they use, containers are in even worse shape -- most container users don't have a complete list of every package installed in a container, let alone the ability to audit whether they pass a (no-op) flag to memfd_create(2) in every codepath.
One option I considered previously was "=2" would do overwrite+block , and "=3" just block. But then I worry that applications won't have motivation to ever change their existing code, the setting will forever stay at "=2", making "=3" even more impossible to ever be used system side.
What is the downside of overwriting? Backwards-compatibility is a very important part of Linux -- being able to use old programs without having to modify them is incredibly important. Yes, this behaviour is opt-in -- but I don't see the point of making opting in more difficult than necessary. Surely overwite+block provides the security guarantee you need from the threat model -- othewise nobody will be able to use block because you never know if one library will call memfd_create() "incorrectly" without the new flags.
If you want to block syscalls that don't explicitly pass NOEXEC_SEAL, there are several tools for doing this (both seccomp and LSM hooks).
Additional functionality/features should be implemented through security hook and LSM, not sysctl, I think.
This issue with =2 cannot be fixed in an LSM. (On the other hand, you could implement either =2 behaviour with an LSM using =1, and the current strict =2 behaviour could be implemented purely with seccomp.)
By migration, I mean a system that is not fully migrated, such a system should just use "=0" or "=1". Additional features can be implemented in SELinux/Landlock/other LSM by a motivated dev. e.g. if a system wants to limit executable memfd to specific programs or fully disable it. "=2" is for a system/container that is fully migrated, in that case, SELinux/Landlock/LSM can do the same, but sysctl provides a convenient alternative. Yes, seccomp provides a similar mechanism. Indeed, combining "=1" and seccomp (block MFD_EXEC), it will overwrite + block X mfd, which is essentially what you want, iiuc.However, I do not wish to have this implemented in kernel, due to the thinking that I want kernel to get out of business of "overwriting" eventually.
See my above comments -- "overwriting" is perfectly acceptable to me. There's also no way to "get out of the business of overwriting" -- Linux has strict backwards compatibility requirements.
I agree, if we weigh on the short term goal of letting the user space applications to do minimum, then having 4 state sysctl (or 2 sysctl, one controls overwrite, one disable/enable executable memfd) will do. But with that approach, I'm afraid a version of the future (say in 20 years), most applications stays with memfd_create with the old API style, not setting the NX bit. With the current approach, it might seem to be less convenient, but I hope it offers a bit of incentive to make applications migrating their code towards the new API, explicitly setting the NX bit. I understand this hope is questionable, we might still end up the same in 20 years, but at least I tried :-). I will leave this decision to maintainers when you supply patches for that, and I wouldn't feel bad either way, there is a valid reason on both sides.
People will not switch =2 on if it has the possibility of breaking existing programs that are doing nothing wrong by not passing a noop flag.
In 20 years at best you would have =1 in widespread use because the rewriting behaviour is what users expect of kernel uAPIs. They expect old programs to work without modifying them if they aren't doing anything wrong. A uAPI knob that requires every userspace program to change before you can safely enable it (especially because it ratchets in a way that makes it dangerous to enable on production machines) will simply not be used.
If the goal is to get programs to update (which it seems it is), having a knob that nobody will turn on doesn't help. Doing proper warning logging is the way to get userspace to switch -- userspace usually notices when their programs trigger warnings in dmesg.
To supplement, there are two other ways for what you want: 1> seccomp to block MFD_EXEC, and leaving the setting to 1.
I made this point in an earlier mail.
However my point is that =2 is not an acceptable uAPI and if you want something that looks like =2 you can also implement that with seccomp too!
In fact, the key difference is that you cannot implement the rewriting easily in seccomp -- you would need to install a seccomp_notify monitor that does nothing but rewrite syscall arguments. This would be equivalent to running the entire system under GDB to work around a uAPI flaw.
2> implement the blocking using a security hook and LSM, imo, which is probably the most common way to deal with this type of request (block something).
The issue is not the blocking, it's the rewriting.