On Mon, 31 Jul 2023 at 10:12, Suren Baghdasaryan surenb@google.com wrote:
walk_page_vma(vma, &subpage_walk_ops, NULL);
walk_page_vma(vma, &subpage_walk_ops, true, NULL);
Rather than add a new argument to the walk_page_*() functions, I really think you should just add the locking rule to the 'const struct mm_walk_ops' structure.
The locking rule goes along with the rules for what you are actually doing, after all. Plus it would actually make it all much more legible when it's not just some random "true/false" argument, but a an actual
.write_lock = 1
in the ops definition.
Yes, yes, that might mean that some ops might need duplicating in case you really have a walk that sometimes takes the lock, and sometimes doesn't, but that is odd to begin with.
The only such case I found from a quick look was the very strange queue_pages_range() case. Is it really true that do_mbind() needs the write-lock, but do_migrate_pages() does not?
And if they really are that different maybe they should have different walk_ops?
Maybe there were other cases that I didn't notice.
error = walk_page_range(current->mm, start, end,
&prot_none_walk_ops, &new_pgprot);
&prot_none_walk_ops, true, &new_pgprot);
This looks odd. You're adding vma locking to a place that didn't do it before.
Yes, the mmap semaphore is held for writing, but this particular walk doesn't need it as far as I can tell.
In fact, this feels like that walker should maybe *verify* that it's held for writing, but not try to write it again?
Maybe the "lock_vma" flag should be a tri-state:
- lock for reading (no-op per vma), verify that the mmap sem is held for reading
- lock for reading (no-op per vma), but with mmap sem held for writing (this kind of "check before doing changes" walker)
- lock for writing (with mmap sem obviously needs to be held for writing)
mmap_assert_locked(walk.mm);
if (lock_vma)
vma_start_write(vma);
So I think this should also be tightened up, and something like
switch (ops->locking) { case WRLOCK: vma_start_write(vma); fallthrough; case WRLOCK_VERIFY: mmap_assert_write_locked(mm); break; case RDLOCK: mmap_assert_locked(walk.mm); }
because we shouldn't have a 'vma_start_write()' without holding the mmap sem for *writing*, and the above would also allow that mprotect_fixup() "walk to see if we can merge, verify that it was already locked" thing.
Hmm?
NOTE! The above names are just completely made up. I dcon't think it should actually be some "WRLOCK" enum. There are probably much better names. Take the above as a "maybe something kind of in this direction" rather than "do it exactly like this".
Linus