On 12/18/21 22:02, Nadav Amit wrote:
On Dec 18, 2021, at 4:35 PM, Linus Torvalds torvalds@linux-foundation.org wrote:
(I have only ever seen the kernel side of uffd, not the actual user side, so I'm not sure about the use patterns).
I use it in a very fine granularity, and I suspect QEMU and CRIU do so too.
That said, your suggestion of a shadow sw page table bit thing would also work. And it would solve some problems we have in core areas (notably "page_special()" which right now has that ARCH_HAS_PTE_SPECIAL thing).
It would make it really easy to have that "this page table entry is pinned" flag too.
I found my old messy code for the software-PTE thing.
I see that eventually I decided to hold a pointer to the “extra PTEs” of each page in the PMD-page-struct. [ I also implemented the 2-adjacent pages approach but this code is long gone. ]
My rationale was that:
- It does not bound you to have the same size for PTE and “extra-PTE”
- The PMD-page struct is anyhow hot (since you acquired the PTL)
- Allocating “extra-PTE” dynamically does not require to rewire the page-tables, which requires a TLB flush.
I think there is a place to hold a pointer in the PMD-page-struct (_pt_pad_1, we just need to keep the lowest bit clear so the kernel won’t mistaken it to be a compound page).
I still don’t know what exactly you have in mind for making use out of it for the COW issue. Keeping a pin-count (which requires internal API changes for unpin_user_page() and friends?) or having “was ever pinned” sticky bit? And then changing page_needs_cow_for_dma() to look at the PTE so copy_present_pte() would break the COW eagerly?
Anyhow, I can clean it up and send (although it is rather simple and I ignored many thing, such as THP, remap, etc), but I am not sure I have the time now to fully address the COW problem. I will wait for Monday for David’s response.
Hi Nadav,
A couple of thoughts about this part of the design:
a) The PMD-page-struct approach won't help as much, because (assuming that we're using it in an attempt to get a true, perfect pin count), you are combining the pin counts of a PMD's worth of pages. OTOH...maybe that actually *is* OK, assuming you don't overflow--except that you can only answer the "is it dma-pinned?" question at a PMD level. That's a contradiction of your stated desire above to have very granular control.
Also, because of not having bit 0 available in page._pt_pad_1, I think the count would have to be implemented as adding and subtracting 2, instead of 1 (in order to keep the value even), further reducing the counter range.
b) If, instead, you used your older 2-adjacent pages approach, then Linus' comment makes more sense here: we could use the additional struct page to hold an exact pin count, per page. That way, you can actually build a wrapper function such as:
page_really_is_dma_pinned()
...and/or simply get a stronger "maybe" for page_maybe_dma_pinned().
Furthermore, this direction is extensible and supports solving other "I am out of space in struct page" problems, at the cost of more pages, of course.
As an aside, I find it instructive that we're talking about this approach, instead of extending struct page. The lesson I'm taking away is: allocating more space for some cases (2x pages) is better than having *all* struct pages be larger than they are now.
Anyway, the pin count implementation would look somewhat like the existing hpage_pincount, which similarly has ample space for a separate, exact pin count. In other words, this sort of thing (mostly-pseudo code):
diff --git a/include/linux/mm.h b/include/linux/mm.h index a7e4a9e7d807..646761388025 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -938,6 +938,16 @@ static inline bool hpage_pincount_available(struct page *page) return PageCompound(page) && compound_order(page) > 1; }
+static inline bool shadow_page_pincount_available(struct page *page) +{ + /* + * TODO: Nadav: connect this up with the shadow page table + * implementation, and return an appropriate answer. + */ + + return false; // hardcoded for now, for compile testing +} + static inline int head_compound_pincount(struct page *head) { return atomic_read(compound_pincount_ptr(head)); @@ -950,6 +960,13 @@ static inline int compound_pincount(struct page *page) return head_compound_pincount(page); }
+static inline int shadow_page_pincount(struct page *page) +{ + VM_BUG_ON_PAGE(!shadow_page_pincount_available(page), page); + + return atomic_read(shadow_page_pincount_ptr(page)); +} + static inline void set_compound_order(struct page *page, unsigned int order) { page[1].compound_order = order; @@ -1326,6 +1343,9 @@ static inline bool page_maybe_dma_pinned(struct page *page) if (hpage_pincount_available(page)) return compound_pincount(page) > 0;
+ if (shadow_page_pincount_available(page)) + return shadow_page_pincount(page) > 0; + /* * page_ref_count() is signed. If that refcount overflows, then * page_ref_count() returns a negative value, and callers will avoid
c) The "was it ever pinned" sticky bit is not a workable concept, at the struct page level. A counter is required, in order to allow pages to live out their normal lives to their fullest potential. The only time we even temporarily got away with this kind of stickiness was at a higher level, and only per-process, not per-physical-page. Processes come and go, but the struct pages are more or less forever, so once you mark one sticky like this, it's out of play.
thanks,