Re: [PATCH v8 0/4] Introduce mseal

1 Feb 2024

      * Jeff Xu jeffxu@chromium.org [240131 20:27]:
...
On Wed, Jan 31, 2024 at 11:34 AM Liam R. Howlett
Liam.Howlett@oracle.com wrote:
...
Please add me to the Cc list of these patches.
Ok.
...

jeffxu@chromium.org jeffxu@chromium.org [240131 12:50]:

...
From: Jeff Xu jeffxu@chromium.org
This patchset proposes a new mseal() syscall for the Linux kernel.
In a nutshell, mseal() protects the VMAs of a given virtual memory
range against modifications, such as changes to their permission bits.
Modern CPUs support memory permissions, such as the read/write (RW)
and no-execute (NX) bits. Linux has supported NX since the release of
kernel version 2.6.8 in August 2004 [1]. The memory permission feature
improves the security stance on memory corruption bugs, as an attacker
cannot simply write to arbitrary memory and point the code to it. The
memory must be marked with the X bit, or else an exception will occur.
Internally, the kernel maintains the memory permissions in a data
structure called VMA (vm_area_struct). mseal() additionally protects
the VMA itself is against modifications of the selected seal type.
... The v8 cut Jonathan's email discussion [1] off and
instead of
replying there, I'm going to add my question here.
The best plan to ensure it is a general safety measure for all of linux
is to work with the community before it lands upstream.  It's much
harder to change functionality provided to users after it is upstream.
I'm happy to hear google is super excited about sharing this, but so
far, the community isn't as excited.
It seems Theo has a lot of experience trying to add a feature very close
to what you are doing and has real data on how this went [2].  Can we
see if there is a solution that is, at least, different enough from what
he tried to do for a shot of success?  Do we have anyone in the
toolchain groups that sees this working well?  If this means Stephen
needs to do something, can we get that to happen please?
For Theo's input from OpenBSD's perspective;
IIUC: as today, the mseal-Linux and mimmutable-OpenBSD has the same
scope on what operations to seal, e.g. considering the progress made
on both sides since the beginning of the RFC:

mseal(Linux): dropped "multiple-bit" approach.
mimmutable(OpenBSD): Dropped "downgradable"; Added madvise(DONOTNEED).

The difference is in mmap(), i.e.

mseal(Linux): support of PROT_SEAL in mmap().
mseal(Linux): use of MAP_SEALABLE in mmap().

I considered Theo's inputs from OpenBSD's perspective regarding the
difference, and I wasn't convinced that Linux should remove these. In
my view, those are two different kernels code, and the difference in
Linux is not added without reasons (for MAP_SEALABLE, there is a note
in the documentation section with details).
I would love to hear more from Linux developers on this.
Linus said it was really important to get the semantics correct, but you
took his (unfinished) list and kept going.  I think there are some
unanswered questions and that's frustrating some people as you may not
be valuing the experience they have in this area.
You dropped the RFC from the topic and incremented the version numbering
on the patch set. I thought it was customary to restart counting after
the RFC was complete?  Maybe I'm wrong, but it seemed a bit odd to see
that happen.  The documentation also implies there are still questions
to be answered, so it seems this is still an RFC in some ways?
I'd like to talk about the design some more.
Having to opt-in to allowing mseal will probably not work well.
Initial library mappings happen in one huge chunk then it's cut up into
smaller VMAs, at least that's what I see with my maple tree tracing.  If
you opt-in, then the entire library will have to opt-in and so the
'discourage inadvertent sealing' argument is not very strong.
It also makes a somewhat messy tracking of inheritance of the attribute
across splitting, MAP_FIXED replacement, vma_move, vma_copy.  I think
most of this is forced on the user?
It makes your call less flexible, it means you have to hope that the VMA
origin was blessed before you decide you want to mseal it.
What if you want to ensure the library mapped by a parent or on launch
is mseal'ed?
What about the initial relocated VMA (expand/shrink of VMA)?
Creating something as "non-sealable" is pointless.  If you don't want it
sealed, then don't mseal() that region.
If your use case doesn't need it, then can we please drop the opt-in
behaviour and just have all VMAs treated the same?
If it does need it, can you explain why?
The glibc relocation/fixup will then work.  glibc could mseal once it is
complete - or an application could bypass glibc support and use the
feature itself.
If we proceed to remove the MAP_SEALABLE flag to mmap, then we have the
heap/stack concerns.  We can either let people shoot their own feet off
or try to protect them.
Right now, you seem to be trying to protect them.  Keeping with that, I
guess we could either get the kernel to mark those VMAs or tell some
other way?  I'd suggest a range, but people do very strange things with
these special VMAs [1].  I don't think you can predict enough crazy
actions to make a difference in trying to protect people.
There are far fewer VMAs that should not be allowed to be mseal'ed than
should be, and the kernel creates those so it seems logical to only let
the kernel opt-out on those ones.
I'd rather just let people shoot themselves and return an error.
I also hope it reduces the complexity of this code while increasing the
flexibility of the feature.  As stated before, we remove the dependency
of needing support from the initial loader.
Merging VMAs
I can see this going Very Bad with brk + mseal.  But, again, if someone
decides to mseal these VMAs then they should expect Bad Things to
happen (or maybe they know what they are doing even in some complex
situation?)
vma_merge() can also expand a VMA.  I think this is okay as it checks
for the same flags, so you will allow VMA expansion of two (or three)
vma areas to become one.  Is this okay in your model?
...
...
I mean, you specifically state that this is a 'very specific
requirement' in your cover letter.  Does this mean even other browsers
have no use for it?
No, I don’t mean “other browsers have no use for it”.
About specific requirements from Chrome, that refers to "The lifetime
of those mappings are not tied to the lifetime of the process, which
is not the case of libc" as in the cover letter. This addition to the
cover letter was made in V3, thus, it might be beneficial to provide
additional context to help answer the question.
This patch series begins with multiple-bit approaches (v1,v2,v3), the
rationale for this is that I am uncertain if Chrome's specific needs
are common enough for other use cases.  Consequently, I am unable to
make this decision myself without input from the community. To
accommodate this, multiple bits are selected initially due to their
adaptability.
Since V1, after hearing from the community, Chrome has changed its
design (no longer relying on separating out mprotect), and Linus
acknowledged the defect of madvise(DONOTNEED) [1]. With those inputs,
today mseal() has a simple design that:

meet Chrome's specific needs.

How many VMAs will chrome have that are mseal'ed?  Is this a common
operation?
PROT_SEAL seems like an extra flag we could drop.  I don't expect we'll
be sealing enough VMAs that a hand full of extra syscalls would make a
difference?
...

meet Libc's needs.

What needs of libc are you referring to?  I'm looking through the
version changelog and I guess you mean return EPERM?
...

Chrome's specific need doesn't interfere with Libc's.

[1] https://lore.kernel.org/all/CAHk-=wiVhHmnXviy1xqStLRozC4ziSugTk=1JOc8ORWd2_0...
Linus said he'd be happier if we made the change in general.
...
...
I am very concerned this feature will land and have to be maintained by
the core mm people for the one user it was specifically targeting.
See above. This feature is not specifically targeting Chrome.
...
Can we also get some benchmarking on the impact of this feature?  I
believe my answer in v7 removed the worst offender, but since there is
no benchmarking we really are guessing (educated or not, hard data would
help).  We still have an extra loop in madvise, mprotect_pkey, mremap_to
(and mreamp syscall?).
Yes. There is an extra loop in mmap(FIXED), munmap(),
madvise(DONOTNEED), mremap(), to emulate the VMAs for the given
address range. I suspect the impact would be low, but having some hard
data would be good. I will see what I can find to assist the perf
testing. If you have a specific test suite in mind, I can also try it.
You should look at mmtests [2]. But since you are adding loops across
VMA ranges, you need to test loops across several ranges of VMAs.  That
is, it would be good to see what happens on 1, 3, 6, 12, 24 VMAs, or
some subset of small and large numbers to get an idea of complexity we
are adding.  My hope is that the looping will be cache-hot in the maple
tree and have minimum effect.
In my personal testing, I've seen munmap often do a single VMA, or 3, or
more rare 7 on x86_64.  There should be some good starting points in
mmtests for the common operations.
[1] https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/m...
[2] https://github.com/gormanm/mmtests
Thanks,
Liam

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH v8 0/4] Introduce mseal