 
            On Wed, Oct 15, 2025, Gregory Price wrote:
On Fri, Sep 26, 2025 at 12:36:27PM -0700, Sean Christopherson via Linux-f2fs-devel wrote:
static struct mempolicy *kvm_gmem_get_policy(struct vm_area_struct *vma, unsigned long addr, pgoff_t *pgoff) { *pgoff = vma->vm_pgoff + ((addr - vma->vm_start) >> PAGE_SHIFT);
return __kvm_gmem_get_policy(GMEM_I(file_inode(vma->vm_file)), *pgoff);
Argh!!!!! This breaks the selftest because do_get_mempolicy() very specifically falls back to the default_policy, NOT to the current task's policy. That is *exactly* the type of subtle detail that needs to be commented, because there's no way some random KVM developer is going to know that returning NULL here is important with respect to get_mempolicy() ABI.
Do_get_mempolicy was designed to be accessed by the syscall, not as an in-kernel ABI.
Ya, by "get_mempolicy() ABI" I meant the uABI for the get_mempolicy syscall.
get_task_policy also returns the default policy if there's nothing there, because that's what applies.
I have dangerous questions:
Not dangerous at all, I find them very helpful!
why is __kvm_gmem_get_policy using mpol_shared_policy_lookup() instead of get_vma_policy()
With the disclaimer that I haven't followed the gory details of this series super closely, my understanding is...
Because the VMA is a means to an end, and we want the policy to persist even if the VMA goes away.
With guest_memfd, KVM effectively inverts the standard MMU model. Instead of mm/ being the primary MMU and KVM being a secondary MMU, guest_memfd is the primary MMU and any VMAs are secondary (mostly; it's probably more like 1a and 1b). This allows KVM to map guest_memfd memory into a guest without a VMA, or with more permissions than are granted to host userspace, e.g. guest_memfd memory could be writable by the guest, but read-only for userspace.
But we still want to support things like mbind() so that userspace can ensure guest_memfd allocations align with the vNUMA topology presented to the guest, or are bound to the NUMA node where the VM will run. We considered adding equivalent file-based syscalls, e.g. fbind(), but IIRC the consensus was that doing so was unnecessary (and potentially messy?) since we were planning on eventually adding mmap() support to guest_memfd anyways.
get_vma_policy does this all for you
I assume that doesn't work if the intent is for new VMAs to pick up the existing policy from guest_memfd? And more importantly, guest_memfd needs to hook ->set_policy so that changes through e.g. mbind() persist beyond the lifetime of the VMA.
struct mempolicy *get_vma_policy(struct vm_area_struct *vma, unsigned long addr, int order, pgoff_t *ilx) { struct mempolicy *pol;
pol = __get_vma_policy(vma, addr, ilx); if (!pol) pol = get_task_policy(current); if (pol->mode == MPOL_INTERLEAVE || pol->mode == MPOL_WEIGHTED_INTERLEAVE) { *ilx += vma->vm_pgoff >> order; *ilx += (addr - vma->vm_start) >> (PAGE_SHIFT + order); } return pol;}
Of course you still have the same issue: get_task_policy will return the default, because that's what applies.
do_get_mempolicy just seems like the completely incorrect interface to be using here.