* Andrii Nakryiko andrii.nakryiko@gmail.com [250423 18:06]:
On Wed, Apr 23, 2025 at 2:49 PM Suren Baghdasaryan surenb@google.com wrote:
On Tue, Apr 22, 2025 at 3:49 PM Andrii Nakryiko andrii.nakryiko@gmail.com wrote:
On Fri, Apr 18, 2025 at 10:50 AM Suren Baghdasaryan surenb@google.com wrote:
With maple_tree supporting vma tree traversal under RCU and vma and its important members being RCU-safe, /proc/pid/maps can be read under RCU and without the need to read-lock mmap_lock. However vma content can change from under us, therefore we make a copy of the vma and we pin pointer fields used when generating the output (currently only vm_file and anon_name). Afterwards we check for concurrent address space modifications, wait for them to end and retry. While we take the mmap_lock for reading during such contention, we do that momentarily only to record new mm_wr_seq counter. This change is designed to reduce
This is probably a stupid question, but why do we need to take a lock just to record this counter? uprobes get away without taking mmap_lock even for reads, and still record this seq counter. And then detect whether there were any modifications in between. Why does this change need more heavy-weight mmap_read_lock to do speculative reads?
Not a stupid question. mmap_read_lock() is used to wait for the writer to finish what it's doing and then we continue by recording a new sequence counter value and call mmap_read_unlock. This is what get_vma_snapshot() does. But your question made me realize that we can optimize m_start() further by not taking mmap_read_lock at all. Instead of taking mmap_read_lock then doing drop_mmap_lock() we can try mmap_lock_speculate_try_begin() and only if it fails do the same dance we do in the get_vma_snapshot(). I think that should work.
Ok, yeah, it would be great to avoid taking a lock in a common case!
We can check this counter once per 4k block and maintain the same 'tearing' that exists today instead of per-vma. Not that anyone said they had an issue with changing it, but since we're on this road anyways I'd thought I'd point out where we could end up.
I am concerned about live locking in either scenario, but I haven't looked too deep into this pattern.
I also don't love (as usual) the lack of ensured forward progress.
It seems like we have four cases for the vm area state now: 1. we want to read a stable vma or set of vmas (per-vma locking) 2. we want to read a stable mm state for reading (the very short named mmap_lock_speculate_try_begin) 3. we ensure a stable vma/mm state for reading (mmap read lock) 4. we are writing - get out of my way (mmap write lock).
Cheers, Liam