On Tue, Apr 29, 2025 at 08:54:58PM +0200, Jann Horn wrote:
On Tue, Apr 29, 2025 at 8:04 PM Suren Baghdasaryan surenb@google.com wrote:
On Tue, Apr 29, 2025 at 10:21 AM Jann Horn jannh@google.com wrote:
Hi!
(I just noticed that I incorrectly assumed that VMAs use kfree_rcu (not SLAB_TYPESAFE_BY_RCU) when I wrote my review of this, somehow I forgot all about that...)
Does this fact affect your previous comments? Just want to make sure I'm not missing something...
When I suggested using "WRITE_ONCE(vma->vm_file, NULL)" when tearing down a VMA, and using get_file_rcu() for the lockless lookup, I did not realize that you could actually also race with all the other places that set ->vm_file, like __mmap_new_file_vma() and so on; and I did not think about whether any of those code paths might leave a VMA with a dangling ->vm_file pointer.
I guess maybe that means you really do need to do the lookup from the copied data, as you did in your patch; and that might require calling get_file_active() on the copied ->vm_file pointer (instead of get_file_rcu()), even though I think that is not really how get_file_active() is supposed to be used (it's supposed to be used when you know the original file hasn't been freed yet). Really what
I think it's fine for get_file_active() to be used in this way. That ->vm_file pointer usage should get a fat comment above it explaining how what you're doing is safe.
you'd want for that is basically a raw __get_file_rcu(), but that is static and I think Christian wouldn't want to expose more of these internals outside VFS...
Yeah, no. I don't want that to be usable outside of that file.
(In that case, all the stuff below about get_file_rcu() would be moot.)
Or you could pepper WRITE_ONCE() over all the places that write ->vm_file, and ensure that ->vm_file is always NULLed before its reference is dropped... but that seems a bit more ugly to me.
On Tue, Apr 29, 2025 at 7:09 PM Suren Baghdasaryan surenb@google.com wrote:
On Tue, Apr 29, 2025 at 8:40 AM Jann Horn jannh@google.com wrote:
On Fri, Apr 18, 2025 at 7:50 PM Suren Baghdasaryan surenb@google.com wrote:
With maple_tree supporting vma tree traversal under RCU and vma and its important members being RCU-safe, /proc/pid/maps can be read under RCU and without the need to read-lock mmap_lock. However vma content can change from under us, therefore we make a copy of the vma and we pin pointer fields used when generating the output (currently only vm_file and anon_name). Afterwards we check for concurrent address space modifications, wait for them to end and retry. While we take the mmap_lock for reading during such contention, we do that momentarily only to record new mm_wr_seq counter. This change is designed to reduce mmap_lock contention and prevent a process reading /proc/pid/maps files (often a low priority task, such as monitoring/data collection services) from blocking address space updates.
[...]
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index b9e4fbbdf6e6..f9d50a61167c 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c
[...]
+/*
- Take VMA snapshot and pin vm_file and anon_name as they are used by
- show_map_vma.
- */
+static int get_vma_snapshot(struct proc_maps_private *priv, struct vm_area_struct *vma) +{
struct vm_area_struct *copy = &priv->vma_copy;
int ret = -EAGAIN;
memcpy(copy, vma, sizeof(*vma));
if (copy->vm_file && !get_file_rcu(©->vm_file))
goto out;
I think this uses get_file_rcu() in a different way than intended.
As I understand it, get_file_rcu() is supposed to be called on a pointer which always points to a file with a non-zero refcount (except when it is NULL). That's why it takes a file** instead of a file* - if it observes a zero refcount, it assumes that the pointer must have been updated in the meantime, and retries. Calling get_file_rcu() on a pointer that points to a file with zero refcount, which I think can happen with this patch, will cause an endless loop. (Just as background: For other usecases, get_file_rcu() is supposed to still behave nicely and not spuriously return NULL when the file* is concurrently updated to point to another file*; that's what that loop is for.)
Ah, I see. I wasn't aware of this subtlety. I think this is fixable by checking the return value of get_file_rcu() and retrying speculation if it changed.
I think you could probably still end up looping endlessly in get_file_rcu().
(Just to be clear: What I meant here is that get_file_rcu() loops *internally*; get_file_rcu() is not guaranteed to ever return if the pointed-to file has a zero refcount.)
By "retrying speculation" I meant it in the sense of get_vma_snapshot() retry when it takes the mmap_read_lock and then does mmap_lock_speculate_try_begin to restart speculation. I'm also thinking about Liam's concern of guaranteeing forward progress for the reader. Thinking maybe I should not drop mmap_read_lock immediately on contention but generate some output (one vma or one page worth of vmas) before dropping mmap_read_lock and proceeding with speculation.
Hm, yeah, I guess you need that for forward progress...
(If my understanding is correct, maybe we should document that more explicitly...)
Good point. I'll add comments for get_file_rcu() as a separate patch.
Also, I think you are introducing an implicit assumption that remove_vma() does not NULL out the ->vm_file pointer (because that could cause tearing and could theoretically lead to a torn pointer being accessed here).
One alternative might be to change the paths that drop references to vma->vm_file (search for vma_close to find them) such that they first NULL out ->vm_file with a WRITE_ONCE() and do the fput() after that, maybe with a new helper like this:
static void vma_fput(struct vm_area_struct *vma) { struct file *file = vma->vm_file;
if (file) { WRITE_ONCE(vma->vm_file, NULL); fput(file); } }
Then on the lockless lookup path you could use get_file_rcu() on the ->vm_file pointer _of the original VMA_, and store the returned file* into copy->vm_file.
Ack. Except for storing the return value of get_file_rcu(). I think once we detect that get_file_rcu() returns a different file we should bail out and retry. The change in file is an indication that the vma got changed from under us, so whatever we have is stale.
What does "different file" mean here - what file* would you compare the returned one against?
Inside get_vma_snapshot() I would pass the original vma->vm_file to get_file_rcu() and check if it returns the same value. If the value got changed we jump to /* Address space got modified, vma might be stale. Re-lock and retry. */ section. That should work, right?
Where do you get an "original vma->vm_file" from?
To be clear, get_file_rcu(p) returns one of the values that *p had while get_file_rcu(p) is running.