On Fri, May 19, 2023 at 9:20 AM Peter Xu peterx@redhat.com wrote:
Hi, Jiaqi,
On Fri, May 19, 2023 at 08:04:09AM -0700, Jiaqi Yan wrote:
I don't think CAP_ADMIN is something we can work around: a VMM must be a good citizen to avoid introducing any vulnerability to the host or guest.
On the other hand, "Userfaults allow the implementation of on-demand paging from userland and more generally they allow userland to take control of various memory page faults, something otherwise only the kernel code could do." [3]. I am not familiar with the UFFD internals, but our use case seems to match what UFFD wants to provide: without affecting the whole world, give a specific userspace (without CAP_ADMIN) the ability to handle page faults (indirectly emulate a HWPOISON page (in my mind I treat it as SetHWPOISON(page) + TestHWPOISON(page) operation in kernel's PF code)). So is it fair to say what Axel provided here is "provide !ADMIN somehow"?
Userfault keywords on "user", IMHO. We don't strictly need userfault to resolve anything regarding CAP_ADMIN problems. MADV_DONTNEED also dosn't need CAP_ADMIN, same to any new madvise() if we want to make it useful for injecting poisoned ptes with !ADMIN and limit it within current->mm.
But I think you're right that userfaultfd always tried to avoid having ADMIN and keep everything within its own scope of permissions.
So again, totally no objection on make it uffd specific for now if you guys are all happy with it, but just to be clear that it's (to me) mostly for avoiding another WAKE, and afaics that's not really for solving the ADMIN issue here.
How about this plan:
Since the concrete use case we have (postcopy live migration) is UFFD-specific, let's leave it as a UFFDIO_* operation for now.
If in the future we come up with a non-UFFD use case, we can add a new MADV_* which does this operation at that point. From my perspective they could even share most of the same implementation code.
I don't think it's a big problem keeping the UFFDIO_* version too at that point, because it still provides some (perhaps small) value:
- Combines the operation + waking into one syscall - It allows us to support additional UFFD flags which modify / extend the operation in UFFD-specific ways, if we want to add those in the future
Seem reasonable?
If so, I'll send a v2 with documentation updates.
Thanks,
-- Peter Xu