On 18.12.21 00:20, Linus Torvalds wrote:
On Fri, Dec 17, 2021 at 2:43 PM David Hildenbrand david@redhat.com wrote:
The pages stay PageAnon(). swap-backed pages simply set a bit IIRC. mapcount still applies.
Our code-base is too large for me to remember all the details, but if we still end up having PageAnon for swapbacked pages, then mapcount can increase from another process faulting in an pte with that swap entry.
"Our code-base is too large for me to remember all the details". I second that.
You might a valid point with the mapcount regarding concurrent swapin in the current code, I'll have to think further about that if it could be a problem and if it cannot be handled without heavy synchronization (I think the concern is that gup unsharing could miss doing an unshare because it doesn't detect that there are other page sharers not expressed in the mapcount code but via the swap code when seeing mapcount == 1).
Do you have any other concerns regarding the semantics/stability regarding the following points (as discussed, fork() is not the issue because it can be handled via write_protect_seq or something comparable. handling per-process thingies is not the problem):
a) Using PageAnon(): It cannot possibly change in the pagefault path or in the gup-fast-only path (otherwise there would be use-after-free already). b) Using PageKsm(): It cannot possibly change in the pagefault path or in the gup-fast path (otherwise there would be use-after-free already). c) Using mapcount: It cannot possibly change in the way we care about or cannot detect (mapcount going from == 1 to > 1 concurrently) in the pagefault path or in the gup-fast path due to fork().
You're point for c) is that we might currently not handle swap correctly. Any other concerns, especially regarding the mapcount or is that it?
IIUC, any GUP approach to detect necessary unsharing would at least require a check for a) and b). What we're arguing about is c).
And mmap_sem doesn't protect against that. Again, page_lock() does.
And taking the page lock was a big performance issue.
One of the reasons that new COW handling is so nice is that you can do things like
if (!trylock_page(page)) goto copy;
exactly because in the a/b world order, the copy case is always safe.
In your model, as far as I can tell, you leave the page read-only and a subsequent COW fault _can_ happen, which means that now the subsequent COW needs to b every very careful, because if it ever copies a page that was GUP'ed, you just broke the rules.
So COWing too much is a bug (because it breaks the page from the GUP), but COWing too little is an even worse problem (because it measn that now the GUP user can see data it shouldn't have seen).
Good summary, I'll extend below.
Our old code literally COWed too little. It's why all those changes happened in the first place.
Let's see if we can agree on some things to get a common understanding.
What can happen with COW is:
1) Missed COW
We miss a COW, therefore someone has access to a wrong page.
This is the security issue as in patch #11. The security issue documented in [1].
2) Unnecessary COW
We do a COW, but there are no other valid users, so it's just overhead + noise.
The performance issue documented in section 5 in [1].
3) Wrong COW
We do a COW but there are other valid users (-> GUP).
The memory corruption issue documented in section 2 and 3 in [1].
Most notably, the io_uring reproducer which races with the page_maybe_dma_pinned() check in current code can trigger this easily, and exactly this issues is what gives me nightmares. [2]
Does that make sense? If we agree on the above, then here is how the currently discussed approaches differ:
page_count != 1: * 1) cannot happen * 2) can happen easily (speculative references due to pagecache, migration, daemon, pagevec, ...) * 3) can happen in the current code
mapcount > 1: * 1) your concern is that this can happen due to concurrent swapin * 2) cannot happen. * 3) your concern is that this can happen due to concurrent swapin
If we can agree on that, I can see why you dislike mapcount, can you see why I dislike page_count?
Ideally we'd really have a fast and reliable check for "is this page shared and could get used by multiple processes -- either multiple processes are already mapping it R/O or could map it via the swap R/O later".
This is why I'm pushing that whole story line of
(1) COW is based purely on refcounting, because that's the only thing that obviously can never COW too little.
I am completely missing how 2) or 3) could *ever* be handled properly for page_count != 1. 3) is obviously more important and gives me nightmares.
And that's what I'm trying to communicate the whole time: page_count is absolutely fragile, because anything that results in a page getting mapped R/O into a page table can trigger 3). And as [2] proves that can even happen with *swap*.
(see how we're running into the same swap issues with both approaches? Stupid swap :) )
(2) GUP pre-COWs (the thing I called the "(a)" rule earlier) and then makes sure to not mark pinned pages COW again (that "(b)" rule).
and here "don't use page_mapcount()" really is about that (1).
You do seem to have kept (1) in that your COW rules don't seem to change (but maybe I missed it), but because your GUP-vs-COW semantics are very different indeed, I'm not at all convinced about (2).
Oh yes, sorry, not in the context of this series. The point is that the current page_count != 1 covers mapcount > 1, so we can adjust that separately later.
You mentioned "design", so let's assume we have a nice function:
/* * Check if an anon page is shared or exclusively used by a single * process: if shared, the page is shared by multiple processes either * mapping the page R/O ("active sharing") or having swap entries that * could result in the page getting mapped R/O ("inactive sharing"). * * This function is safe to be called under mmap_lock in read/write mode * because it prevents concurrent fork() sharing the page. * This function is safe to be called from gup-fast-only in IRQ context, * as it detects concurrent fork() sharing the page */ bool page_anon_shared();
Can we agree that that would that be a suitable function for (1) and (2) instead of using either the page_count or the mapcount directly? (yes, how to actually make it reliable due to swapin is to be discussed, but it might be a problem worth solving if that's the way to go)
For hugetlb, this would really have to use the mapcount as explained (after all, fortunately there is no swap ...).
[1] https://lore.kernel.org/all/3ae33b08-d9ef-f846-56fb-645e3b9b4c66@redhat.com/
[2] https://gitlab.com/aarcange/kernel-testcases-for-v5.11/-/blob/main/io_uring_...