Re: [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU

23 Apr 2025

On Wed, Apr 23, 2025 at 2:49 PM Suren Baghdasaryan surenb@google.com wrote:
...
On Tue, Apr 22, 2025 at 3:49 PM Andrii Nakryiko
andrii.nakryiko@gmail.com wrote:
...
On Fri, Apr 18, 2025 at 10:50 AM Suren Baghdasaryan surenb@google.com wrote:
...
With maple_tree supporting vma tree traversal under RCU and vma and
its important members being RCU-safe, /proc/pid/maps can be read under
RCU and without the need to read-lock mmap_lock. However vma content
can change from under us, therefore we make a copy of the vma and we
pin pointer fields used when generating the output (currently only
vm_file and anon_name). Afterwards we check for concurrent address
space modifications, wait for them to end and retry. While we take
the mmap_lock for reading during such contention, we do that momentarily
only to record new mm_wr_seq counter. This change is designed to reduce
This is probably a stupid question, but why do we need to take a lock
just to record this counter? uprobes get away without taking mmap_lock
even for reads, and still record this seq counter. And then detect
whether there were any modifications in between. Why does this change
need more heavy-weight mmap_read_lock to do speculative reads?
Not a stupid question. mmap_read_lock() is used to wait for the writer
to finish what it's doing and then we continue by recording a new
sequence counter value and call mmap_read_unlock. This is what
get_vma_snapshot() does. But your question made me realize that we can
optimize m_start() further by not taking mmap_read_lock at all.
Instead of taking mmap_read_lock then doing drop_mmap_lock() we can
try mmap_lock_speculate_try_begin() and only if it fails do the same
dance we do in the get_vma_snapshot(). I think that should work.
Ok, yeah, it would be great to avoid taking a lock in a common case!
...
...
...
mmap_lock contention and prevent a process reading /proc/pid/maps files
(often a low priority task, such as monitoring/data collection services)
from blocking address space updates.
Note that this change has a userspace visible disadvantage: it allows
for sub-page data tearing as opposed to the previous mechanism where
data tearing could happen only between pages of generated output data.
Since current userspace considers data tearing between pages to be
acceptable, we assume is will be able to handle sub-page data tearing
as well.
Signed-off-by: Suren Baghdasaryan surenb@google.com
fs/proc/internal.h        |   6 ++
 fs/proc/task_mmu.c        | 170 ++++++++++++++++++++++++++++++++++----
 include/linux/mm_inline.h |  18 ++++
 3 files changed, 177 insertions(+), 17 deletions(-)

diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 96122e91c645..6e1169c1f4df 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -379,6 +379,12 @@ struct proc_maps_private {
        struct task_struct *task;
        struct mm_struct *mm;
        struct vma_iterator iter;

  bool mmap_locked;


  loff_t last_pos;



+#ifdef CONFIG_PER_VMA_LOCK

  unsigned int mm_wr_seq;


  struct vm_area_struct vma_copy;



+#endif
 #ifdef CONFIG_NUMA
        struct mempolicy *task_mempolicy;
 #endif
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index b9e4fbbdf6e6..f9d50a61167c 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -127,13 +127,130 @@ static void release_task_mempolicy(struct proc_maps_private *priv)
 }
 #endif
-static struct vm_area_struct *proc_get_vma(struct proc_maps_private *priv,

                                          loff_t *ppos)



+#ifdef CONFIG_PER_VMA_LOCK



+static const struct seq_operations proc_pid_maps_op;



+/*


Take VMA snapshot and pin vm_file and anon_name as they are used by



show_map_vma.


*/

+static int get_vma_snapshot(struct proc_maps_private *priv, struct vm_area_struct *vma)
+{

  struct vm_area_struct *copy = &priv->vma_copy;


  int ret = -EAGAIN;



  memcpy(copy, vma, sizeof(*vma));


  if (copy->vm_file && !get_file_rcu(&copy->vm_file))


          goto out;



  if (!anon_vma_name_get_if_valid(copy))


          goto put_file;



Given vm_file and anon_vma_name are both RCU-protected, if we take
rcu_read_lock() at m_start/m_next before fetching VMA, we shouldn't
even need getting/putting them, no?
Yeah, anon_vma_name indeed looks safe without pinning but vm_file is
using SLAB_TYPESAFE_BY_RCU cache, so it might still be a valid pointer
but pointing to a wrong object even if the rcu grace period did not
pass. With my assumption that seq_file can't rollback once show_map()
is done, I would need a completely stable vma at the time show_map()
is executed so that it does not change from under us while we are
generating the output.
OTOH, if we indeed can rollback while generating seq_file output then
show_map() could output potentially invalid vma, then check for vma
changes and when detected, rollback seq_file and retry again.
...
I feel like I'm missing some important limitations, but I don't think
they are spelled out explicitly...
...


  if (!mmap_lock_speculate_retry(priv->mm, priv->mm_wr_seq))


          return 0;



  /* Address space got modified, vma might be stale. Re-lock and retry. */


  rcu_read_unlock();



Another question I have is whether we really need to copy vma into
priv->vma_copy to have a stable snapshot? Can't we just speculate like
with uprobes under assumption that data doesn't change. And once we
are done producing text output, confirm that speculation was
successful, and if not - retry?
We'd need an API for seq_file to rollback to previous good known
location for that, but that seems straightforward enough to do, no?
Just remember buffer position before speculation, write data, check
for no mm modifications, and if something changed, rollback seq file
to last position.
From looking at seq_read_iter() I think for a rollback we would have
to reset seq_file.count and seq_file.index to their previous states.
At this location:
https://elixir.bootlin.com/linux/v6.14.3/source/fs/seq_file.c#L272 if
show_map() returns negative value m->count will indeed be rolled back
but not seq_file.index. Also returning negative value at
https://elixir.bootlin.com/linux/v6.14.3/source/fs/seq_file.c#L230
would be interpreted as a hard error... So, I'll need to spend some
time in this code to get the answer about rollback.
Thanks for the review!
Yeah, seq_file is a glorified wrapper around a memory buffer,
essentially. And at the lowest level, this transaction-like API would
basically just return seq->count before we start writing anything, and
rollback will just accept a new count to set to seq->count, if we need
to rollback.
Logistically this all needs to be factored into the whole seq_file
callbacks thing, of course, especially if "transaction" will be
started in m_start/m_next, while it can be "aborted" in m_show... So
that's what would need careful consideration.
But you can end up with faster and cleaner implementation, as we
discussed above, so worth giving it a try, IMO.
...
...
...

  ret = mmap_read_lock_killable(priv->mm);


  if (!ret) {


          /* mmap_lock_speculate_try_begin() succeeds when holding mmap_read_lock */


          mmap_lock_speculate_try_begin(priv->mm, &priv->mm_wr_seq);


          mmap_read_unlock(priv->mm);


          ret = -EAGAIN;


  }



  rcu_read_lock();



  anon_vma_name_put_if_valid(copy);



+put_file:

  if (copy->vm_file)


          fput(copy->vm_file);



+out:

  return ret;



+}



+static void put_vma_snapshot(struct proc_maps_private *priv)
+{

  struct vm_area_struct *vma = &priv->vma_copy;



  anon_vma_name_put_if_valid(vma);


  if (vma->vm_file)


          fput(vma->vm_file);



+}



[...]

    

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU

Signed-off-by: Suren Baghdasaryan surenb@google.com