On Dec 18, 2021, at 10:42 AM, Jason Gunthorpe jgg@nvidia.com wrote:
On Fri, Dec 17, 2021 at 07:38:39PM -0800, Linus Torvalds wrote:
On Fri, Dec 17, 2021 at 7:30 PM Nadav Amit namit@vmware.com wrote:
In such a case, I do think it makes sense to fail uffd-wp (when page_count() > 1), and in a prototype I am working on I do something like that.
Ack. If uddf-wp finds a page that is pinned, just skip it as not write-protectable.
Because some of the pinners might be writing to it, of course - just not through the page tables.
That doesn't address the qemu use case though. The RDMA pin is the 'coherent r/o pin' we discussed before, which requires that the pages remain un-write-protected and the HW DMA is read only.
The VFIO pin will enable dirty page tracking in the system IOMMU so it gets the same effect from qemu's perspective as the CPU WP is doing.
In these operations every single page of the guest will be pinned, so skip it just means userfault fd wp doesn't work at all.
Qemu needs some solution to be able to dirty track the CPU memory for migration..
My bad. I misunderstood the scenario.
Yes, I guess that you pin the pages early for RDMA registration, which is also something you may do for IO-uring buffers. This would render userfaultfd unusable.
I do not see how it can be solved without custom, potentially complicated logic, which the page_count() approach wants to avoid.
The only thing I can think of is requiring the pinned regions to be first madvise’d with MADV_DONTFORK and not COW’ing in such case. But this would break existing code though.