On Thu, Feb 1, 2024 at 12:45 PM Liam R. Howlett Liam.Howlett@oracle.com wrote:
- Jeff Xu jeffxu@chromium.org [240131 20:27]:
On Wed, Jan 31, 2024 at 11:34 AM Liam R. Howlett Liam.Howlett@oracle.com wrote:
Having to opt-in to allowing mseal will probably not work well.
I'm leaving the opt-in discussion in Linus's thread.
Initial library mappings happen in one huge chunk then it's cut up into smaller VMAs, at least that's what I see with my maple tree tracing. If you opt-in, then the entire library will have to opt-in and so the 'discourage inadvertent sealing' argument is not very strong.
Regarding "The initial library mappings happen in one huge chunk then it is cut up into smaller VMAS", this is not a problem.
As example of elf loading (fs/binfmt_elf.c), there is just a few places to pass in what type of memory to be allocated, e.g. MAP_PRIVATE, MAP_FIXED_NOREPLACE, we can add MAP_SEALABLE at those places. If glic does additional splitting on the memory range, by using mprotect(), then the MAP_SEALABLE is automatically applied after splitting. If glic uses mmap(MAP_FIXED), then it should use mmap(MAP_FIXED|MAP_SEALABLE).
It also makes a somewhat messy tracking of inheritance of the attribute across splitting, MAP_FIXED replacement, vma_move, vma_copy. I think most of this is forced on the user?
The inheritance is the same as other VMA flags.
It makes your call less flexible, it means you have to hope that the VMA origin was blessed before you decide you want to mseal it.
What if you want to ensure the library mapped by a parent or on launch is mseal'ed?
What about the initial relocated VMA (expand/shrink of VMA)?
Creating something as "non-sealable" is pointless. If you don't want it sealed, then don't mseal() that region.
If your use case doesn't need it, then can we please drop the opt-in behaviour and just have all VMAs treated the same?
If it does need it, can you explain why?
The glibc relocation/fixup will then work. glibc could mseal once it is complete - or an application could bypass glibc support and use the feature itself.
Yes. That is the idea.
If we proceed to remove the MAP_SEALABLE flag to mmap, then we have the heap/stack concerns. We can either let people shoot their own feet off or try to protect them.
Right now, you seem to be trying to protect them. Keeping with that, I guess we could either get the kernel to mark those VMAs or tell some other way? I'd suggest a range, but people do very strange things with these special VMAs [1]. I don't think you can predict enough crazy actions to make a difference in trying to protect people.
There are far fewer VMAs that should not be allowed to be mseal'ed than should be, and the kernel creates those so it seems logical to only let the kernel opt-out on those ones.
I'd rather just let people shoot themselves and return an error.
I also hope it reduces the complexity of this code while increasing the flexibility of the feature. As stated before, we remove the dependency of needing support from the initial loader.
Merging VMAs I can see this going Very Bad with brk + mseal. But, again, if someone decides to mseal these VMAs then they should expect Bad Things to happen (or maybe they know what they are doing even in some complex situation?)
vma_merge() can also expand a VMA. I think this is okay as it checks for the same flags, so you will allow VMA expansion of two (or three) vma areas to become one. Is this okay in your model?
I mean, you specifically state that this is a 'very specific requirement' in your cover letter. Does this mean even other browsers have no use for it?
No, I don’t mean “other browsers have no use for it”.
About specific requirements from Chrome, that refers to "The lifetime of those mappings are not tied to the lifetime of the process, which is not the case of libc" as in the cover letter. This addition to the cover letter was made in V3, thus, it might be beneficial to provide additional context to help answer the question.
This patch series begins with multiple-bit approaches (v1,v2,v3), the rationale for this is that I am uncertain if Chrome's specific needs are common enough for other use cases. Consequently, I am unable to make this decision myself without input from the community. To accommodate this, multiple bits are selected initially due to their adaptability.
Since V1, after hearing from the community, Chrome has changed its design (no longer relying on separating out mprotect), and Linus acknowledged the defect of madvise(DONOTNEED) [1]. With those inputs, today mseal() has a simple design that:
- meet Chrome's specific needs.
How many VMAs will chrome have that are mseal'ed? Is this a common operation?
PROT_SEAL seems like an extra flag we could drop. I don't expect we'll be sealing enough VMAs that a hand full of extra syscalls would make a difference?
- meet Libc's needs.
What needs of libc are you referring to? I'm looking through the version changelog and I guess you mean return EPERM?
I meant libc's sealing RO part of the elf binary, those memory's lifetime are associated with the lifetime of the process.
- Chrome's specific need doesn't interfere with Libc's.
[1] https://lore.kernel.org/all/CAHk-=wiVhHmnXviy1xqStLRozC4ziSugTk=1JOc8ORWd2_0...
Linus said he'd be happier if we made the change in general.
I am very concerned this feature will land and have to be maintained by the core mm people for the one user it was specifically targeting.
See above. This feature is not specifically targeting Chrome.
Can we also get some benchmarking on the impact of this feature? I believe my answer in v7 removed the worst offender, but since there is no benchmarking we really are guessing (educated or not, hard data would help). We still have an extra loop in madvise, mprotect_pkey, mremap_to (and mreamp syscall?).
Yes. There is an extra loop in mmap(FIXED), munmap(), madvise(DONOTNEED), mremap(), to emulate the VMAs for the given address range. I suspect the impact would be low, but having some hard data would be good. I will see what I can find to assist the perf testing. If you have a specific test suite in mind, I can also try it.
You should look at mmtests [2]. But since you are adding loops across VMA ranges, you need to test loops across several ranges of VMAs. That is, it would be good to see what happens on 1, 3, 6, 12, 24 VMAs, or some subset of small and large numbers to get an idea of complexity we are adding. My hope is that the looping will be cache-hot in the maple tree and have minimum effect.
In my personal testing, I've seen munmap often do a single VMA, or 3, or more rare 7 on x86_64. There should be some good starting points in mmtests for the common operations.
Thanks. Will do.
[1] https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/m... [2] https://github.com/gormanm/mmtests
Thanks, Liam