On Tue, Oct 22, 2024 at 08:39:34AM -0700, Sean Christopherson wrote:
On Tue, Oct 22, 2024, Yosry Ahmed wrote:
Even if we don't want mlock() to err in this case, shouldn't we just do nothing?
Ideally, yes.
Agreed. There's no sense in having this count against the NR_MLOCK stats, for example.
I see a lot of checks at the beginning of mlock_fixup() to check whether we should operate on the vma, perhaps we should also check for these KVM vmas?
Definitely not. KVM may be doing something unexpected, but the VMA certainly isn't unique enough to warrant mm/ needing dedicated handling.
Focusing on KVM is likely a waste of time. There are probably other subsystems and/or drivers that .mmap() kernel allocated memory in the same way. Odds are good KVM is just the messenger, because syzkaller knows how to beat on KVM. And even if there aren't any other existing cases, nothing would prevent them from coming along in the future.
They all need to be fixed. How to do that is not an answer I have at this point. Ideally we can fix them without changing them all immediately (but they will all need to be fixed eventually because pages will no longer have a refcount and so get_page() will need to go away ...)
Trying to or maybe set VM_SPECIAL in kvm_vcpu_mmap()? I am not sure tbh, but this doesn't seem right.
Agreed. VM_DONTEXPAND is the only VM_SPECIAL flag that is remotely appropriate, but setting VM_DONTEXPAND could theoretically break userspace, and other than preventing mlock(), there is no reason why the VMA can't be expanded. I doubt any userspace VMM is actually remapping and expanding a vCPU mapping, but trying to fudge around this outside of core mm/ feels kludgy and has the potential to turn into a game of whack-a-mole.
Actually, VM_PFNMAP is probably ideal. We're not really mapping pages here (I mean, they are pages, but they're not filesystem pages or anonymous pages ... there's no rmap to them). We're mapping blobs of memory whose refcount is controlled by the vma that maps them. We don't particularly want to be able to splice() this memory, or do RDMA to it. We probably do want gdb to be able to read it (... yes?) which might be a complication with a PFNMAP VMA.
We've given a lot of flexibility to device drivers about how they implement mmap() and I think that's now getting in the way of some important improvements. I want to see a simpler way of providing the same functionality, and I'm not quite there yet.