On 01/12/2025 18:35, Peter Xu wrote:
On Mon, Dec 01, 2025 at 04:48:22PM +0000, Nikita Kalyazin wrote:
I believe I found the precise point where we convinced ourselves that minor support was sufficient: [1]. If at this moment we don't find that reasoning valid anymore, then indeed implementing missing is the only option.
Now after I re-read the discussion, I may have made a wrong statement there, sorry. I could have got slightly confused on when the write() syscall can be involved.
I agree if you want to get an event when cache missed with the current uffd definitions and when pre-population is forbidden, then MISSING trap is required. That is, with/without the need of UFFDIO_COPY being available.
Do I understand it right that UFFDIO_COPY is not allowed in your case, but only write()?
No, UFFDIO_COPY would work perfectly fine. We will still use write() whenever we resolve stage-2 faults as they aren't visible to UFFD. When a userfault occurs at an offset that already has a page in the cache, we will have to keep using UFFDIO_CONTINUE so it looks like both will be required:
- user mapping major fault -> UFFDIO_COPY (fills the cache and sets up userspace PT) - user mapping minor fault -> UFFDIO_CONTINUE (only sets up userspace PT) - stage-2 fault -> write() (only fills the cache)
One way that might work this around, is introducing a new UFFD_FEATURE bit allowing the MINOR registration to trap all pgtable faults, which will change the MINOR fault semantics.
This would equally work for us. I suppose this MINOR+MAJOR semantics would be more intrusive from the API point of view though.
That'll need some further thoughts, meanwhile we may also want to make sure the old shmem/hugetlbfs semantics are kept (e.g. they should fail MINOR registers if the new feature bit is enabled in the ctx somehow; or support them properly in the codebase).
Thanks,
-- Peter Xu