Alex, here is a change to vaddr_get_pfn() that we discussed in this thread: https://lists.nongnu.org/archive/html/qemu-devel/2018-01/msg07117.html
Namely, drop support for passing Filesystem-DAX mappings through to guests. Perhaps in the future we can create some para-virtualized passthrough interface to coordinate guest-DMA vs host-filesystem operations. For now, this needs to be disabled for data-integrity and guaranteeing forward progress of filesystem operations.
If you want to take this through your tree please grab the other dax fixups as well. Otherwise, let me know and I'll take the lot through the nvdimm tree.
---
Dan Williams (3): dax: fix S_DAX definition dax: short circuit vma_is_fsdax() in the CONFIG_FS_DAX=n case vfio: disable filesystem-dax page pinning
drivers/vfio/vfio_iommu_type1.c | 18 +++++++++++++++--- include/linux/fs.h | 4 +++- 2 files changed, 18 insertions(+), 4 deletions(-)
Make sure S_DAX is defined in the CONFIG_FS_DAX=n + CONFIG_DEV_DAX=y case. Otherwise vma_is_dax() may incorrectly return false in the Device-DAX case.
Cc: Alexander Viro viro@zeniv.linux.org.uk Cc: linux-fsdevel@vger.kernel.org Cc: Christoph Hellwig hch@lst.de Cc: Jan Kara jack@suse.cz Cc: stable@vger.kernel.org Fixes: dee410792419 ("/dev/dax, core: file operations and dax-mmap") Signed-off-by: Dan Williams dan.j.williams@intel.com --- include/linux/fs.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/include/linux/fs.h b/include/linux/fs.h index 511fbaabf624..a3329258ff5c 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1850,7 +1850,7 @@ struct super_operations { #define S_IMA 1024 /* Inode has an associated IMA struct */ #define S_AUTOMOUNT 2048 /* Automount/referral quasi-directory */ #define S_NOSEC 4096 /* no suid or xattr security attributes */ -#ifdef CONFIG_FS_DAX +#if IS_ENABLED(CONFIG_FS_DAX) || IS_ENABLED(CONFIG_DEV_DAX) #define S_DAX 8192 /* Direct Access, avoiding the page cache */ #else #define S_DAX 0 /* Make all the DAX code disappear */
Filesystem-DAX is incompatible with 'longterm' page pinning. Without page cache indirection a DAX mapping maps filesystem blocks directly. This means that the filesystem must not modify a file's block map while any page in a mapping is pinned. In order to prevent the situation of userspace holding of filesystem operations indefinitely, disallow 'longterm' Filesystem-DAX mappings.
RDMA has the same conflict and the plan there is to add a 'with lease' mechanism to allow the kernel to notify userspace that the mapping is being torn down for block-map maintenance. Perhaps something similar can be put in place for vfio.
Note that xfs and ext4 still report:
"DAX enabled. Warning: EXPERIMENTAL, use at your own risk"
...at mount time, and resolving the dax-dma-vs-truncate problem is one of the last hurdles to remove that designation.
Cc: Alex Williamson alex.williamson@redhat.com Cc: Michal Hocko mhocko@suse.com Cc: Christoph Hellwig hch@lst.de Cc: kvm@vger.kernel.org Cc: stable@vger.kernel.org Reported-by: Haozhong Zhang haozhong.zhang@intel.com Fixes: d475c6346a38 ("dax,ext2: replace XIP read and write with DAX I/O") Signed-off-by: Dan Williams dan.j.williams@intel.com --- drivers/vfio/vfio_iommu_type1.c | 18 +++++++++++++++--- 1 file changed, 15 insertions(+), 3 deletions(-)
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index e30e29ae4819..45657e2b1ff7 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -338,11 +338,12 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr, { struct page *page[1]; struct vm_area_struct *vma; + struct vm_area_struct *vmas[1]; int ret;
if (mm == current->mm) { - ret = get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE), - page); + ret = get_user_pages_longterm(vaddr, 1, !!(prot & IOMMU_WRITE), + page, vmas); } else { unsigned int flags = 0;
@@ -351,7 +352,18 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
down_read(&mm->mmap_sem); ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags, page, - NULL, NULL); + vmas, NULL); + /* + * The lifetime of a vaddr_get_pfn() page pin is + * userspace-controlled. In the fs-dax case this could + * lead to indefinite stalls in filesystem operations. + * Disallow attempts to pin fs-dax pages via this + * interface. + */ + if (ret > 0 && vma_is_fsdax(vmas[0])) { + ret = -EOPNOTSUPP; + put_page(page[0]); + } up_read(&mm->mmap_sem); }
On 02/04/18 15:05 -0800, Dan Williams wrote:
Filesystem-DAX is incompatible with 'longterm' page pinning. Without page cache indirection a DAX mapping maps filesystem blocks directly. This means that the filesystem must not modify a file's block map while any page in a mapping is pinned. In order to prevent the situation of userspace holding of filesystem operations indefinitely, disallow 'longterm' Filesystem-DAX mappings.
RDMA has the same conflict and the plan there is to add a 'with lease' mechanism to allow the kernel to notify userspace that the mapping is being torn down for block-map maintenance. Perhaps something similar can be put in place for vfio.
Note that xfs and ext4 still report:
"DAX enabled. Warning: EXPERIMENTAL, use at your own risk"
...at mount time, and resolving the dax-dma-vs-truncate problem is one of the last hurdles to remove that designation.
Cc: Alex Williamson alex.williamson@redhat.com Cc: Michal Hocko mhocko@suse.com Cc: Christoph Hellwig hch@lst.de Cc: kvm@vger.kernel.org Cc: stable@vger.kernel.org Reported-by: Haozhong Zhang haozhong.zhang@intel.com Fixes: d475c6346a38 ("dax,ext2: replace XIP read and write with DAX I/O") Signed-off-by: Dan Williams dan.j.williams@intel.com
drivers/vfio/vfio_iommu_type1.c | 18 +++++++++++++++--- 1 file changed, 15 insertions(+), 3 deletions(-)
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index e30e29ae4819..45657e2b1ff7 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -338,11 +338,12 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr, { struct page *page[1]; struct vm_area_struct *vma;
- struct vm_area_struct *vmas[1]; int ret;
if (mm == current->mm) {
ret = get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE),
page);
ret = get_user_pages_longterm(vaddr, 1, !!(prot & IOMMU_WRITE),
page, vmas);
vmas is not used subsequently if this branch is taken, so can we use NULL here?
Thanks, Haozhong
} else { unsigned int flags = 0; @@ -351,7 +352,18 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr, down_read(&mm->mmap_sem); ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags, page,
NULL, NULL);
vmas, NULL);
/*
* The lifetime of a vaddr_get_pfn() page pin is
* userspace-controlled. In the fs-dax case this could
* lead to indefinite stalls in filesystem operations.
* Disallow attempts to pin fs-dax pages via this
* interface.
*/
if (ret > 0 && vma_is_fsdax(vmas[0])) {
ret = -EOPNOTSUPP;
put_page(page[0]);
up_read(&mm->mmap_sem); }}
On Sun, Feb 4, 2018 at 7:46 PM, Haozhong Zhang haozhong.zhang@intel.com wrote:
On 02/04/18 15:05 -0800, Dan Williams wrote:
Filesystem-DAX is incompatible with 'longterm' page pinning. Without page cache indirection a DAX mapping maps filesystem blocks directly. This means that the filesystem must not modify a file's block map while any page in a mapping is pinned. In order to prevent the situation of userspace holding of filesystem operations indefinitely, disallow 'longterm' Filesystem-DAX mappings.
RDMA has the same conflict and the plan there is to add a 'with lease' mechanism to allow the kernel to notify userspace that the mapping is being torn down for block-map maintenance. Perhaps something similar can be put in place for vfio.
Note that xfs and ext4 still report:
"DAX enabled. Warning: EXPERIMENTAL, use at your own risk"
...at mount time, and resolving the dax-dma-vs-truncate problem is one of the last hurdles to remove that designation.
Cc: Alex Williamson alex.williamson@redhat.com Cc: Michal Hocko mhocko@suse.com Cc: Christoph Hellwig hch@lst.de Cc: kvm@vger.kernel.org Cc: stable@vger.kernel.org Reported-by: Haozhong Zhang haozhong.zhang@intel.com Fixes: d475c6346a38 ("dax,ext2: replace XIP read and write with DAX I/O") Signed-off-by: Dan Williams dan.j.williams@intel.com
drivers/vfio/vfio_iommu_type1.c | 18 +++++++++++++++--- 1 file changed, 15 insertions(+), 3 deletions(-)
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index e30e29ae4819..45657e2b1ff7 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -338,11 +338,12 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr, { struct page *page[1]; struct vm_area_struct *vma;
struct vm_area_struct *vmas[1]; int ret; if (mm == current->mm) {
ret = get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE),
page);
ret = get_user_pages_longterm(vaddr, 1, !!(prot & IOMMU_WRITE),
page, vmas);
vmas is not used subsequently if this branch is taken, so can we use NULL here?
I'd rather go the other way and refactor this a bit further to skip the find_vma_intersection() below since get_user_pages() already does that work.
On Sun, 04 Feb 2018 15:05:30 -0800 Dan Williams dan.j.williams@intel.com wrote:
Filesystem-DAX is incompatible with 'longterm' page pinning. Without page cache indirection a DAX mapping maps filesystem blocks directly. This means that the filesystem must not modify a file's block map while any page in a mapping is pinned. In order to prevent the situation of userspace holding of filesystem operations indefinitely, disallow 'longterm' Filesystem-DAX mappings.
RDMA has the same conflict and the plan there is to add a 'with lease' mechanism to allow the kernel to notify userspace that the mapping is being torn down for block-map maintenance. Perhaps something similar can be put in place for vfio.
Note that xfs and ext4 still report:
"DAX enabled. Warning: EXPERIMENTAL, use at your own risk"
...at mount time, and resolving the dax-dma-vs-truncate problem is one of the last hurdles to remove that designation.
Cc: Alex Williamson alex.williamson@redhat.com Cc: Michal Hocko mhocko@suse.com Cc: Christoph Hellwig hch@lst.de Cc: kvm@vger.kernel.org Cc: stable@vger.kernel.org Reported-by: Haozhong Zhang haozhong.zhang@intel.com Fixes: d475c6346a38 ("dax,ext2: replace XIP read and write with DAX I/O") Signed-off-by: Dan Williams dan.j.williams@intel.com
drivers/vfio/vfio_iommu_type1.c | 18 +++++++++++++++--- 1 file changed, 15 insertions(+), 3 deletions(-)
This isn't without some expense, a vfio mapping and un-mapping unit test incurs ~1.5% increase in system time losing access to gup_fast(). Also, I think tce_iommu_use_page() is going to have the same problem, it provides the same sort of functionality for a different vfio IOMMU backend. Please take this through your tree and I'll add a todo list item to see how we might improve this.
Acked-by: Alex Williamson alex.williamson@redhat.com
Thanks, Alex
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index e30e29ae4819..45657e2b1ff7 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -338,11 +338,12 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr, { struct page *page[1]; struct vm_area_struct *vma;
- struct vm_area_struct *vmas[1]; int ret;
if (mm == current->mm) {
ret = get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE),
page);
ret = get_user_pages_longterm(vaddr, 1, !!(prot & IOMMU_WRITE),
} else { unsigned int flags = 0;page, vmas);
@@ -351,7 +352,18 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr, down_read(&mm->mmap_sem); ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags, page,
NULL, NULL);
vmas, NULL);
/*
* The lifetime of a vaddr_get_pfn() page pin is
* userspace-controlled. In the fs-dax case this could
* lead to indefinite stalls in filesystem operations.
* Disallow attempts to pin fs-dax pages via this
* interface.
*/
if (ret > 0 && vma_is_fsdax(vmas[0])) {
ret = -EOPNOTSUPP;
put_page(page[0]);
up_read(&mm->mmap_sem); }}
On Mon, Feb 5, 2018 at 1:44 PM, Alex Williamson alex.williamson@redhat.com wrote:
On Sun, 04 Feb 2018 15:05:30 -0800 Dan Williams dan.j.williams@intel.com wrote:
Filesystem-DAX is incompatible with 'longterm' page pinning. Without page cache indirection a DAX mapping maps filesystem blocks directly. This means that the filesystem must not modify a file's block map while any page in a mapping is pinned. In order to prevent the situation of userspace holding of filesystem operations indefinitely, disallow 'longterm' Filesystem-DAX mappings.
RDMA has the same conflict and the plan there is to add a 'with lease' mechanism to allow the kernel to notify userspace that the mapping is being torn down for block-map maintenance. Perhaps something similar can be put in place for vfio.
Note that xfs and ext4 still report:
"DAX enabled. Warning: EXPERIMENTAL, use at your own risk"
...at mount time, and resolving the dax-dma-vs-truncate problem is one of the last hurdles to remove that designation.
Cc: Alex Williamson alex.williamson@redhat.com Cc: Michal Hocko mhocko@suse.com Cc: Christoph Hellwig hch@lst.de Cc: kvm@vger.kernel.org Cc: stable@vger.kernel.org Reported-by: Haozhong Zhang haozhong.zhang@intel.com Fixes: d475c6346a38 ("dax,ext2: replace XIP read and write with DAX I/O") Signed-off-by: Dan Williams dan.j.williams@intel.com
drivers/vfio/vfio_iommu_type1.c | 18 +++++++++++++++--- 1 file changed, 15 insertions(+), 3 deletions(-)
This isn't without some expense, a vfio mapping and un-mapping unit test incurs ~1.5% increase in system time losing access to gup_fast(). Also, I think tce_iommu_use_page() is going to have the same problem, it provides the same sort of functionality for a different vfio IOMMU backend. Please take this through your tree and I'll add a todo list item to see how we might improve this.
Acked-by: Alex Williamson alex.williamson@redhat.com
Thanks Alex.
Hi Dan,
On 02/04/18 15:05 -0800, Dan Williams wrote:
Filesystem-DAX is incompatible with 'longterm' page pinning. Without page cache indirection a DAX mapping maps filesystem blocks directly. This means that the filesystem must not modify a file's block map while any page in a mapping is pinned. In order to prevent the situation of userspace holding of filesystem operations indefinitely, disallow 'longterm' Filesystem-DAX mappings.
RDMA has the same conflict and the plan there is to add a 'with lease' mechanism to allow the kernel to notify userspace that the mapping is being torn down for block-map maintenance. Perhaps something similar can be put in place for vfio.
Note that xfs and ext4 still report:
"DAX enabled. Warning: EXPERIMENTAL, use at your own risk"
...at mount time, and resolving the dax-dma-vs-truncate problem is one of the last hurdles to remove that designation.
Cc: Alex Williamson alex.williamson@redhat.com Cc: Michal Hocko mhocko@suse.com Cc: Christoph Hellwig hch@lst.de Cc: kvm@vger.kernel.org Cc: stable@vger.kernel.org Reported-by: Haozhong Zhang haozhong.zhang@intel.com Fixes: d475c6346a38 ("dax,ext2: replace XIP read and write with DAX I/O") Signed-off-by: Dan Williams dan.j.williams@intel.com
drivers/vfio/vfio_iommu_type1.c | 18 +++++++++++++++--- 1 file changed, 15 insertions(+), 3 deletions(-)
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index e30e29ae4819..45657e2b1ff7 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -338,11 +338,12 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr, { struct page *page[1]; struct vm_area_struct *vma;
- struct vm_area_struct *vmas[1]; int ret;
if (mm == current->mm) {
ret = get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE),
page);
ret = get_user_pages_longterm(vaddr, 1, !!(prot & IOMMU_WRITE),
} else { unsigned int flags = 0;page, vmas);
@@ -351,7 +352,18 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr, down_read(&mm->mmap_sem); ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags, page,
NULL, NULL);
vmas, NULL);
/*
* The lifetime of a vaddr_get_pfn() page pin is
* userspace-controlled. In the fs-dax case this could
* lead to indefinite stalls in filesystem operations.
* Disallow attempts to pin fs-dax pages via this
* interface.
*/
if (ret > 0 && vma_is_fsdax(vmas[0])) {
ret = -EOPNOTSUPP;
put_page(page[0]);
up_read(&mm->mmap_sem); }}
Besides this patch series, are there other patches needed to make vma_is_fsdax() to work with device-dax?
I applied this patch series on the libvdimm-for-next branch of nvdimm tree (ee95f4059a83), and found this patch series also failed device-dax mapping with vfio. It can be reproduced by following steps:
1. Attach PCI device at BDF 0000:03:10.2 to vfio-pci. # modprobe vfio-pci # lspci -n -s 0000:03:10.2 03:10.2 0200: 8086:1515 (rev 01) # echo 0000:03:10.2 > /sys/bus/pci/devices/0000:06:0d.0/driver/unbind # echo 8086:1515 > /sys/bus/pci/drivers/vfio-pci/new_id
2. Use RAM to emulate NVDIMM and create a device-dax device /dev/dax0.0 # cat /proc/iomem ... 100000000-2ffffffff : Persistent Memory (legacy) 100000000-2ffffffff : namespace0.0 ...
# ndctl create-namespace -f -e namespace0.0 -m dax { "dev":"namespace0.0", "mode":"dax", "size":8453619712, "uuid":"e1db00bc-f830-4f1b-ac18-091ae7df4f93", "daxdevs":[ { "chardev":"dax0.0", "size":8453619712 } ] }
3. Create a VM with assigned PCI device in step 1 and the device-dax device in step 2. # qemu-system-x86_64 -machine pc,accel=kvm,nvdimm=on -smp host \ -m 4G,slots=32,maxmem=128G \ -drive file=VM_DISK_IMG.img,format=raw,if=virtio \ -object memory-backend-file,id=nv_be1,share=on,mem-path=/dev/dax0.0,size=4G,align=2M \ -device nvdimm,id=nv1,memdev=nv_be1 \ -device ioh3420,id=root.0,slot=4 \ -device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:03:10.2,id=nic1,bus=pci.0,addr=0x6
It then fails with the following QEMU error messages: qemu-system-x86_64: -device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:03:10.2,id=nic1,bus=pci.0,addr=0x6: VFIO_MAP_DMA: -95 qemu-system-x86_64: -device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:03:10.2,id=nic1,bus=pci.0,addr=0x6: vfio_dma_map(0x5643804a92c0, 0x140000000, 0xffe00000, 0x7f2ed5200000) = -95 (Operation not supported) qemu-system-x86_64: -device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:03:10.2,id=nic1,bus=pci.0,addr=0x6: vfio error: 0000:03:10.2: failed to setup container for group 52: memory listener initialization failed for container: Operation not supported
I added the following debug messages after the get_user_pages_longterm() call in this patch, if (vmas[0] && vma_is_dax(vmas[0])) printk(KERN_DEBUG "%s: longterm failed for pfn 0x%lx, ret %d\n", __func__, page_to_pfn(page[0]), ret); and shows get_user_pages_longterm() returns -EOPNOTSUPP on the first device-dax page mapping.
Haozhong
On Mon, Feb 5, 2018 at 11:53 PM, Haozhong Zhang haozhong.zhang@intel.com wrote:
Hi Dan,
On 02/04/18 15:05 -0800, Dan Williams wrote:
Filesystem-DAX is incompatible with 'longterm' page pinning. Without page cache indirection a DAX mapping maps filesystem blocks directly. This means that the filesystem must not modify a file's block map while any page in a mapping is pinned. In order to prevent the situation of userspace holding of filesystem operations indefinitely, disallow 'longterm' Filesystem-DAX mappings.
RDMA has the same conflict and the plan there is to add a 'with lease' mechanism to allow the kernel to notify userspace that the mapping is being torn down for block-map maintenance. Perhaps something similar can be put in place for vfio.
Note that xfs and ext4 still report:
"DAX enabled. Warning: EXPERIMENTAL, use at your own risk"
...at mount time, and resolving the dax-dma-vs-truncate problem is one of the last hurdles to remove that designation.
Cc: Alex Williamson alex.williamson@redhat.com Cc: Michal Hocko mhocko@suse.com Cc: Christoph Hellwig hch@lst.de Cc: kvm@vger.kernel.org Cc: stable@vger.kernel.org Reported-by: Haozhong Zhang haozhong.zhang@intel.com Fixes: d475c6346a38 ("dax,ext2: replace XIP read and write with DAX I/O") Signed-off-by: Dan Williams dan.j.williams@intel.com
drivers/vfio/vfio_iommu_type1.c | 18 +++++++++++++++--- 1 file changed, 15 insertions(+), 3 deletions(-)
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index e30e29ae4819..45657e2b1ff7 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -338,11 +338,12 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr, { struct page *page[1]; struct vm_area_struct *vma;
struct vm_area_struct *vmas[1]; int ret; if (mm == current->mm) {
ret = get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE),
page);
ret = get_user_pages_longterm(vaddr, 1, !!(prot & IOMMU_WRITE),
page, vmas); } else { unsigned int flags = 0;
@@ -351,7 +352,18 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
down_read(&mm->mmap_sem); ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags, page,
NULL, NULL);
vmas, NULL);
/*
* The lifetime of a vaddr_get_pfn() page pin is
* userspace-controlled. In the fs-dax case this could
* lead to indefinite stalls in filesystem operations.
* Disallow attempts to pin fs-dax pages via this
* interface.
*/
if (ret > 0 && vma_is_fsdax(vmas[0])) {
ret = -EOPNOTSUPP;
put_page(page[0]);
} up_read(&mm->mmap_sem); }
Besides this patch series, are there other patches needed to make vma_is_fsdax() to work with device-dax?
I applied this patch series on the libvdimm-for-next branch of nvdimm tree (ee95f4059a83), and found this patch series also failed device-dax mapping with vfio. It can be reproduced by following steps:
Attach PCI device at BDF 0000:03:10.2 to vfio-pci. # modprobe vfio-pci # lspci -n -s 0000:03:10.2 03:10.2 0200: 8086:1515 (rev 01) # echo 0000:03:10.2 > /sys/bus/pci/devices/0000:06:0d.0/driver/unbind # echo 8086:1515 > /sys/bus/pci/drivers/vfio-pci/new_id
Use RAM to emulate NVDIMM and create a device-dax device /dev/dax0.0 # cat /proc/iomem ... 100000000-2ffffffff : Persistent Memory (legacy) 100000000-2ffffffff : namespace0.0 ...
# ndctl create-namespace -f -e namespace0.0 -m dax { "dev":"namespace0.0", "mode":"dax", "size":8453619712, "uuid":"e1db00bc-f830-4f1b-ac18-091ae7df4f93", "daxdevs":[ { "chardev":"dax0.0", "size":8453619712 } ] }
Create a VM with assigned PCI device in step 1 and the device-dax device in step 2. # qemu-system-x86_64 -machine pc,accel=kvm,nvdimm=on -smp host \ -m 4G,slots=32,maxmem=128G \ -drive file=VM_DISK_IMG.img,format=raw,if=virtio \ -object memory-backend-file,id=nv_be1,share=on,mem-path=/dev/dax0.0,size=4G,align=2M \ -device nvdimm,id=nv1,memdev=nv_be1 \ -device ioh3420,id=root.0,slot=4 \ -device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:03:10.2,id=nic1,bus=pci.0,addr=0x6
It then fails with the following QEMU error messages: qemu-system-x86_64: -device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:03:10.2,id=nic1,bus=pci.0,addr=0x6: VFIO_MAP_DMA: -95 qemu-system-x86_64: -device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:03:10.2,id=nic1,bus=pci.0,addr=0x6: vfio_dma_map(0x5643804a92c0, 0x140000000, 0xffe00000, 0x7f2ed5200000) = -95 (Operation not supported) qemu-system-x86_64: -device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:03:10.2,id=nic1,bus=pci.0,addr=0x6: vfio error: 0000:03:10.2: failed to setup container for group 52: memory listener initialization failed for container: Operation not supported
I added the following debug messages after the get_user_pages_longterm() call in this patch, if (vmas[0] && vma_is_dax(vmas[0])) printk(KERN_DEBUG "%s: longterm failed for pfn 0x%lx, ret %d\n", __func__, page_to_pfn(page[0]), ret); and shows get_user_pages_longterm() returns -EOPNOTSUPP on the first device-dax page mapping.
Thanks for that thorough debug, I'll take a look today.
linux-stable-mirror@lists.linaro.org