On Wed, Apr 23, 2025 at 2:49 PM Suren Baghdasaryan surenb@google.com wrote:
On Tue, Apr 22, 2025 at 3:49 PM Andrii Nakryiko andrii.nakryiko@gmail.com wrote:
On Fri, Apr 18, 2025 at 10:50 AM Suren Baghdasaryan surenb@google.com wrote:
With maple_tree supporting vma tree traversal under RCU and vma and its important members being RCU-safe, /proc/pid/maps can be read under RCU and without the need to read-lock mmap_lock. However vma content can change from under us, therefore we make a copy of the vma and we pin pointer fields used when generating the output (currently only vm_file and anon_name). Afterwards we check for concurrent address space modifications, wait for them to end and retry. While we take the mmap_lock for reading during such contention, we do that momentarily only to record new mm_wr_seq counter. This change is designed to reduce
This is probably a stupid question, but why do we need to take a lock just to record this counter? uprobes get away without taking mmap_lock even for reads, and still record this seq counter. And then detect whether there were any modifications in between. Why does this change need more heavy-weight mmap_read_lock to do speculative reads?
Not a stupid question. mmap_read_lock() is used to wait for the writer to finish what it's doing and then we continue by recording a new sequence counter value and call mmap_read_unlock. This is what get_vma_snapshot() does. But your question made me realize that we can optimize m_start() further by not taking mmap_read_lock at all. Instead of taking mmap_read_lock then doing drop_mmap_lock() we can try mmap_lock_speculate_try_begin() and only if it fails do the same dance we do in the get_vma_snapshot(). I think that should work.
Ok, yeah, it would be great to avoid taking a lock in a common case!
mmap_lock contention and prevent a process reading /proc/pid/maps files (often a low priority task, such as monitoring/data collection services) from blocking address space updates. Note that this change has a userspace visible disadvantage: it allows for sub-page data tearing as opposed to the previous mechanism where data tearing could happen only between pages of generated output data. Since current userspace considers data tearing between pages to be acceptable, we assume is will be able to handle sub-page data tearing as well.
Signed-off-by: Suren Baghdasaryan surenb@google.com
fs/proc/internal.h | 6 ++ fs/proc/task_mmu.c | 170 ++++++++++++++++++++++++++++++++++---- include/linux/mm_inline.h | 18 ++++ 3 files changed, 177 insertions(+), 17 deletions(-)
diff --git a/fs/proc/internal.h b/fs/proc/internal.h index 96122e91c645..6e1169c1f4df 100644 --- a/fs/proc/internal.h +++ b/fs/proc/internal.h @@ -379,6 +379,12 @@ struct proc_maps_private { struct task_struct *task; struct mm_struct *mm; struct vma_iterator iter;
bool mmap_locked;
loff_t last_pos;
+#ifdef CONFIG_PER_VMA_LOCK
unsigned int mm_wr_seq;
struct vm_area_struct vma_copy;
+#endif #ifdef CONFIG_NUMA struct mempolicy *task_mempolicy; #endif diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index b9e4fbbdf6e6..f9d50a61167c 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -127,13 +127,130 @@ static void release_task_mempolicy(struct proc_maps_private *priv) } #endif
-static struct vm_area_struct *proc_get_vma(struct proc_maps_private *priv,
loff_t *ppos)
+#ifdef CONFIG_PER_VMA_LOCK
+static const struct seq_operations proc_pid_maps_op;
+/*
- Take VMA snapshot and pin vm_file and anon_name as they are used by
- show_map_vma.
- */
+static int get_vma_snapshot(struct proc_maps_private *priv, struct vm_area_struct *vma) +{
struct vm_area_struct *copy = &priv->vma_copy;
int ret = -EAGAIN;
memcpy(copy, vma, sizeof(*vma));
if (copy->vm_file && !get_file_rcu(©->vm_file))
goto out;
if (!anon_vma_name_get_if_valid(copy))
goto put_file;
Given vm_file and anon_vma_name are both RCU-protected, if we take rcu_read_lock() at m_start/m_next before fetching VMA, we shouldn't even need getting/putting them, no?
Yeah, anon_vma_name indeed looks safe without pinning but vm_file is using SLAB_TYPESAFE_BY_RCU cache, so it might still be a valid pointer but pointing to a wrong object even if the rcu grace period did not pass. With my assumption that seq_file can't rollback once show_map() is done, I would need a completely stable vma at the time show_map() is executed so that it does not change from under us while we are generating the output. OTOH, if we indeed can rollback while generating seq_file output then show_map() could output potentially invalid vma, then check for vma changes and when detected, rollback seq_file and retry again.
I feel like I'm missing some important limitations, but I don't think they are spelled out explicitly...
if (!mmap_lock_speculate_retry(priv->mm, priv->mm_wr_seq))
return 0;
/* Address space got modified, vma might be stale. Re-lock and retry. */
rcu_read_unlock();
Another question I have is whether we really need to copy vma into priv->vma_copy to have a stable snapshot? Can't we just speculate like with uprobes under assumption that data doesn't change. And once we are done producing text output, confirm that speculation was successful, and if not - retry?
We'd need an API for seq_file to rollback to previous good known location for that, but that seems straightforward enough to do, no? Just remember buffer position before speculation, write data, check for no mm modifications, and if something changed, rollback seq file to last position.
From looking at seq_read_iter() I think for a rollback we would have to reset seq_file.count and seq_file.index to their previous states. At this location: https://elixir.bootlin.com/linux/v6.14.3/source/fs/seq_file.c#L272 if show_map() returns negative value m->count will indeed be rolled back but not seq_file.index. Also returning negative value at https://elixir.bootlin.com/linux/v6.14.3/source/fs/seq_file.c#L230 would be interpreted as a hard error... So, I'll need to spend some time in this code to get the answer about rollback. Thanks for the review!
Yeah, seq_file is a glorified wrapper around a memory buffer, essentially. And at the lowest level, this transaction-like API would basically just return seq->count before we start writing anything, and rollback will just accept a new count to set to seq->count, if we need to rollback.
Logistically this all needs to be factored into the whole seq_file callbacks thing, of course, especially if "transaction" will be started in m_start/m_next, while it can be "aborted" in m_show... So that's what would need careful consideration.
But you can end up with faster and cleaner implementation, as we discussed above, so worth giving it a try, IMO.
ret = mmap_read_lock_killable(priv->mm);
if (!ret) {
/* mmap_lock_speculate_try_begin() succeeds when holding mmap_read_lock */
mmap_lock_speculate_try_begin(priv->mm, &priv->mm_wr_seq);
mmap_read_unlock(priv->mm);
ret = -EAGAIN;
}
rcu_read_lock();
anon_vma_name_put_if_valid(copy);
+put_file:
if (copy->vm_file)
fput(copy->vm_file);
+out:
return ret;
+}
+static void put_vma_snapshot(struct proc_maps_private *priv) +{
struct vm_area_struct *vma = &priv->vma_copy;
anon_vma_name_put_if_valid(vma);
if (vma->vm_file)
fput(vma->vm_file);
+}
[...]