On Sat, Dec 18, 2021 at 4:19 PM Nadav Amit namit@vmware.com wrote:
I have always felt that the PTE software-bits limit is very artificial. We can just allocate two adjacent pages when needed, one for PTEs and one for extra software bits. A software bit in the PTE can indicate “extra software bits” are relevant (to save cache-misses), and a bit in the PTEs' page-struct indicate whether there is adjacent “extra software bits” page.
Hmm. That doesn't sound very bad, no. And it would be nice to have more software bits (and have them portably).
I don’t think that I am following. The write-protection of UFFD means that the userspace wants to intervene before anything else (including COW).
The point I was making (badly) is that UFFD_WP is only needed to for the case where the pte isn't already non-writable for other reasons.
UFFD_WP indications are recorded per PTE (i.e., not VMA).
The changing of those bits are basically a bastardized 'mprotect()', and does already require the vma to be marked VM_UFFD_WP.
And the way you set (or clear) the bits is with a range operation. It really could have been done with mprotect(), and with actual explicit vma bits.
The fact that it now uses the page table bit is rather random. I think it would actually be cleaner to make that userfaultfd_writeprotect truly *be* a vma range.
Right now it's kind of "half this, half that".
Of course, it's possible that because of this situation, some users do a lot of fine-grained VM_UFFD_WP setting, and they kind of expect to not have issues with lots of vma fragments. So practical concerns may have made the implementation set in stone.
(I have only ever seen the kernel side of uffd, not the actual user side, so I'm not sure about the use patterns).
That said, your suggestion of a shadow sw page table bit thing would also work. And it would solve some problems we have in core areas (notably "page_special()" which right now has that ARCH_HAS_PTE_SPECIAL thing).
It would make it really easy to have that "this page table entry is pinned" flag too.
Linus