Re: [RFC PATCH 0/3] memfd: cleanups for vm.memfd_noexec

2 Aug 2023

      On 2023-08-02, Jeff Xu jeffxu@chromium.org wrote:
...
...
...
...
...
...

vm.memfd_noexec=2 shouldn't reject old-style memfd_create(2) syscalls
because it will make it far to difficult to ever migrate. Instead it
should imply MFD_EXEC.

Though the purpose of memfd_noexec=2 is not to help with migration  -
but to disable creation of executable memfd for the current system/pid
namespace.
During the migration,  vm.memfd_noexe = 1 helps overwriting for
unmigrated user code as a temporary measure.
My point is that the current behaviour for =2 means that nobody other
than *maybe* ChromeOS will ever be able to use it because it requires
auditing every program on the system. In fact, it's possible even
ChromeOS will run into issues given that one of the arguments made for
the nosymfollow mount option was that auditing all of ChromeOS to
replace every open with RESOLVE_NO_SYMLINKS would be too much effort[1]
(which I agreed with). Maybe this is less of an issue with
memfd_create(2) (which is much newer than open(2)) but it still seems
like a lot of busy work when the =1 behaviour is entirely sane even in
the strict threat model that =2 is trying to protect against.
It can also be a container (that have all memfd_create migrated to new API)
If ChromeOS would struggle to rewrite all of the libraries they use,
containers are in even worse shape -- most container users don't have a
complete list of every package installed in a container, let alone the
ability to audit whether they pass a (no-op) flag to memfd_create(2) in
every codepath.
...
One option I considered previously was "=2" would do overwrite+block ,
and "=3" just block. But then I worry that applications won't have
motivation to ever change their existing code, the setting will
forever stay at "=2", making "=3" even more impossible to ever be used
 system side.
What is the downside of overwriting? Backwards-compatibility is a very
important part of Linux -- being able to use old programs without having
to modify them is incredibly important. Yes, this behaviour is opt-in --
but I don't see the point of making opting in more difficult than
necessary. Surely overwite+block provides the security guarantee you
need from the threat model -- othewise nobody will be able to use block
because you never know if one library will call memfd_create()
"incorrectly" without the new flags.
...
...
If you want to block syscalls that don't explicitly pass NOEXEC_SEAL,
there are several tools for doing this (both seccomp and LSM hooks).
...
Additional functionality/features should be implemented through
security hook and LSM, not sysctl, I think.
This issue with =2 cannot be fixed in an LSM. (On the other hand, you
could implement either =2 behaviour with an LSM using =1, and the
current strict =2 behaviour could be implemented purely with seccomp.)
By migration, I mean  a system that is not fully migrated, such a
system should just use "=0" or "=1". Additional features can be
implemented in SELinux/Landlock/other LSM by a motivated dev.  e.g. if
a system wants to limit executable memfd to specific programs or fully
disable it.
"=2" is for a system/container that is fully migrated, in that case,
SELinux/Landlock/LSM can do the same, but sysctl provides a convenient
 alternative.
Yes, seccomp provides a similar mechanism. Indeed, combining "=1" and
seccomp (block MFD_EXEC), it will overwrite + block X mfd, which is
essentially what you want, iiuc.However, I do not wish to have this
implemented in kernel, due to the thinking that I want kernel to get
out of business of "overwriting" eventually.
See my above comments -- "overwriting" is perfectly acceptable to me.
There's also no way to "get out of the business of overwriting" -- Linux
has strict backwards compatibility requirements.
I agree, if we weigh on the short term goal of letting the user space
applications to do minimum, then having 4 state sysctl (or 2 sysctl,
one controls overwrite, one disable/enable executable memfd) will do.
But with that approach, I'm afraid a version of the future (say in 20
years), most applications stays with memfd_create with the old API
style, not setting the NX bit. With the current approach, it might seem
to be less convenient, but I hope it offers a bit of incentive to make
applications migrating their code towards the new API, explicitly
setting the NX bit.  I understand this hope is questionable, we might
still end up the same in 20 years, but at least I tried :-). I will
leave this decision to maintainers when you supply patches for that,
and I wouldn't feel bad either way, there is a valid reason on both sides.
People will not switch =2 on if it has the possibility of breaking
existing programs that are doing nothing wrong by not passing a noop
flag.
In 20 years at best you would have =1 in widespread use because the
rewriting behaviour is what users expect of kernel uAPIs. They expect
old programs to work without modifying them if they aren't doing
anything wrong. A uAPI knob that requires every userspace program to
change before you can safely enable it (especially because it ratchets
in a way that makes it dangerous to enable on production machines) will
simply not be used.
If the goal is to get programs to update (which it seems it is), having
a knob that nobody will turn on doesn't help. Doing proper warning
logging is the way to get userspace to switch -- userspace usually
notices when their programs trigger warnings in dmesg.
...
To supplement, there are  two other ways for what you want:
1> seccomp to block MFD_EXEC, and leaving the setting to 1.
I made this point in an earlier mail.
However my point is that =2 is not an acceptable uAPI and if you want
something that looks like =2 you can also implement that with seccomp
too!
In fact, the key difference is that you cannot implement the
rewriting easily in seccomp -- you would need to install a
seccomp_notify monitor that does nothing but rewrite syscall arguments.
This would be equivalent to running the entire system under GDB to work
around a uAPI flaw.
...
2> implement the blocking using a security hook and LSM, imo, which is
probably the most common way to deal with this type of request (block
something).
The issue is not the blocking, it's the rewriting.
-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [RFC PATCH 0/3] memfd: cleanups for vm.memfd_noexec