- vm.memfd_noexec=2 shouldn't reject old-style memfd_create(2) syscalls because it will make it far to difficult to ever migrate. Instead it should imply MFD_EXEC.
Though the purpose of memfd_noexec=2 is not to help with migration - but to disable creation of executable memfd for the current system/pid namespace. During the migration, vm.memfd_noexe = 1 helps overwriting for unmigrated user code as a temporary measure.
My point is that the current behaviour for =2 means that nobody other than *maybe* ChromeOS will ever be able to use it because it requires auditing every program on the system. In fact, it's possible even ChromeOS will run into issues given that one of the arguments made for the nosymfollow mount option was that auditing all of ChromeOS to replace every open with RESOLVE_NO_SYMLINKS would be too much effort[1] (which I agreed with). Maybe this is less of an issue with memfd_create(2) (which is much newer than open(2)) but it still seems like a lot of busy work when the =1 behaviour is entirely sane even in the strict threat model that =2 is trying to protect against.
It can also be a container (that have all memfd_create migrated to new API)
If ChromeOS would struggle to rewrite all of the libraries they use, containers are in even worse shape -- most container users don't have a complete list of every package installed in a container, let alone the ability to audit whether they pass a (no-op) flag to memfd_create(2) in every codepath.
One option I considered previously was "=2" would do overwrite+block , and "=3" just block. But then I worry that applications won't have motivation to ever change their existing code, the setting will forever stay at "=2", making "=3" even more impossible to ever be used system side.
What is the downside of overwriting? Backwards-compatibility is a very important part of Linux -- being able to use old programs without having to modify them is incredibly important. Yes, this behaviour is opt-in -- but I don't see the point of making opting in more difficult than necessary. Surely overwite+block provides the security guarantee you need from the threat model -- othewise nobody will be able to use block because you never know if one library will call memfd_create() "incorrectly" without the new flags.
If you want to block syscalls that don't explicitly pass NOEXEC_SEAL, there are several tools for doing this (both seccomp and LSM hooks).
Additional functionality/features should be implemented through security hook and LSM, not sysctl, I think.
This issue with =2 cannot be fixed in an LSM. (On the other hand, you could implement either =2 behaviour with an LSM using =1, and the current strict =2 behaviour could be implemented purely with seccomp.)
By migration, I mean a system that is not fully migrated, such a system should just use "=0" or "=1". Additional features can be implemented in SELinux/Landlock/other LSM by a motivated dev. e.g. if a system wants to limit executable memfd to specific programs or fully disable it. "=2" is for a system/container that is fully migrated, in that case, SELinux/Landlock/LSM can do the same, but sysctl provides a convenient alternative. Yes, seccomp provides a similar mechanism. Indeed, combining "=1" and seccomp (block MFD_EXEC), it will overwrite + block X mfd, which is essentially what you want, iiuc.However, I do not wish to have this implemented in kernel, due to the thinking that I want kernel to get out of business of "overwriting" eventually.
See my above comments -- "overwriting" is perfectly acceptable to me. There's also no way to "get out of the business of overwriting" -- Linux has strict backwards compatibility requirements.
I agree, if we weigh on the short term goal of letting the user space applications to do minimum, then having 4 state sysctl (or 2 sysctl, one controls overwrite, one disable/enable executable memfd) will do. But with that approach, I'm afraid a version of the future (say in 20 years), most applications stays with memfd_create with the old API style, not setting the NX bit. With the current approach, it might seem to be less convenient, but I hope it offers a bit of incentive to make applications migrating their code towards the new API, explicitly setting the NX bit. I understand this hope is questionable, we might still end up the same in 20 years, but at least I tried :-). I will leave this decision to maintainers when you supply patches for that, and I wouldn't feel bad either way, there is a valid reason on both sides.
To supplement, there are two other ways for what you want: 1> seccomp to block MFD_EXEC, and leaving the setting to 1. 2> implement the blocking using a security hook and LSM, imo, which is probably the most common way to deal with this type of request (block something). I admit those two ways will be less convenient than just having sysctl do all the things, from the user space's perspective.
Thanks
-Jeff