Changes since v1 [1]:
* Fix the detection of device-dax file instances in vma_is_fsdax(). (Haozhong, Gerd)
* Fix compile breakage in the FS_DAX=n and DEV_DAX=y case. (0day robot)
[1]: https://lists.01.org/pipermail/linux-nvdimm/2018-February/014046.html
---
The vfio interface, like RDMA, wants to setup long term (indefinite) pins of the pages backing an address range so that a guest or userspace driver can perform DMA to the with physical address. Given that this pinning may lead to filesystem operations deadlocking in the filesystem-dax case, the pinning request needs to be rejected.
The longer term fix for vfio, RDMA, and any other long term pin user, is to provide a 'pin with lease' mechanism. Similar to the leases that are hold for pNFS RDMA layouts, this userspace lease gives the kernel a way to notify userspace that the block layout of the file is changing and the kernel is revoking access to pinned pages.
---
Dan Williams (5): dax: fix vma_is_fsdax() helper dax: fix dax_mapping() definition in the FS_DAX=n + DEV_DAX=y case dax: fix S_DAX definition dax: short circuit vma_is_fsdax() in the CONFIG_FS_DAX=n case vfio: disable filesystem-dax page pinning
drivers/vfio/vfio_iommu_type1.c | 18 +++++++++++++++--- include/linux/dax.h | 9 ++++++--- include/linux/fs.h | 6 ++++-- 3 files changed, 25 insertions(+), 8 deletions(-)
Gerd reports that ->i_mode may contain other bits besides S_IFCHR. Use S_ISCHR() instead. Otherwise, get_user_pages_longterm() may fail on device-dax instances when those are meant to be explicitly allowed.
Fixes: 2bb6d2837083 ("mm: introduce get_user_pages_longterm") Cc: stable@vger.kernel.org Reported-by: Gerd Rausch gerd.rausch@oracle.com Reported-by: Haozhong Zhang haozhong.zhang@intel.com Signed-off-by: Dan Williams dan.j.williams@intel.com --- include/linux/fs.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/include/linux/fs.h b/include/linux/fs.h index 2a815560fda0..79c413985305 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -3198,7 +3198,7 @@ static inline bool vma_is_fsdax(struct vm_area_struct *vma) if (!vma_is_dax(vma)) return false; inode = file_inode(vma->vm_file); - if (inode->i_mode == S_IFCHR) + if (S_ISCHR(inode->i_mode)) return false; /* device-dax */ return true; }
An address_space will only have dax exceptional entries when FS_DAX is enabled. The current reliance on S_DAX causes compile failures when S_DAX is defined for DEV_DAX, but FS_DAX is disabled. Make dax_mapping() always return false so that mm/truncate.c drops its link time dependencies on fs/dax.c.
Cc: Alexander Viro viro@zeniv.linux.org.uk Cc: linux-fsdevel@vger.kernel.org Cc: Christoph Hellwig hch@lst.de Cc: Jan Kara jack@suse.cz Cc: stable@vger.kernel.org Reported-by: kbuild test robot fengguang.wu@intel.com Fixes: dee410792419 ("/dev/dax, core: file operations and dax-mmap") Signed-off-by: Dan Williams dan.j.williams@intel.com --- include/linux/dax.h | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-)
diff --git a/include/linux/dax.h b/include/linux/dax.h index 0185ecdae135..62e8cf7eb566 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -107,6 +107,10 @@ int dax_invalidate_mapping_entry_sync(struct address_space *mapping, int __dax_zero_page_range(struct block_device *bdev, struct dax_device *dax_dev, sector_t sector, unsigned int offset, unsigned int length); +static inline bool dax_mapping(struct address_space *mapping) +{ + return mapping->host && IS_DAX(mapping->host); +} #else static inline int __dax_zero_page_range(struct block_device *bdev, struct dax_device *dax_dev, sector_t sector, @@ -114,12 +118,11 @@ static inline int __dax_zero_page_range(struct block_device *bdev, { return -ENXIO; } -#endif - static inline bool dax_mapping(struct address_space *mapping) { - return mapping->host && IS_DAX(mapping->host); + return false; } +#endif
struct writeback_control; int dax_writeback_mapping_range(struct address_space *mapping,
On Thu 22-02-18 23:17:51, Dan Williams wrote:
An address_space will only have dax exceptional entries when FS_DAX is enabled. The current reliance on S_DAX causes compile failures when S_DAX is defined for DEV_DAX, but FS_DAX is disabled. Make dax_mapping() always return false so that mm/truncate.c drops its link time dependencies on fs/dax.c.
Cc: Alexander Viro viro@zeniv.linux.org.uk Cc: linux-fsdevel@vger.kernel.org Cc: Christoph Hellwig hch@lst.de Cc: Jan Kara jack@suse.cz Cc: stable@vger.kernel.org Reported-by: kbuild test robot fengguang.wu@intel.com Fixes: dee410792419 ("/dev/dax, core: file operations and dax-mmap") Signed-off-by: Dan Williams dan.j.williams@intel.com
Looks good. You can add:
Reviewed-by: Jan Kara jack@suse.cz
Honza
include/linux/dax.h | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-)
diff --git a/include/linux/dax.h b/include/linux/dax.h index 0185ecdae135..62e8cf7eb566 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -107,6 +107,10 @@ int dax_invalidate_mapping_entry_sync(struct address_space *mapping, int __dax_zero_page_range(struct block_device *bdev, struct dax_device *dax_dev, sector_t sector, unsigned int offset, unsigned int length); +static inline bool dax_mapping(struct address_space *mapping) +{
- return mapping->host && IS_DAX(mapping->host);
+} #else static inline int __dax_zero_page_range(struct block_device *bdev, struct dax_device *dax_dev, sector_t sector, @@ -114,12 +118,11 @@ static inline int __dax_zero_page_range(struct block_device *bdev, { return -ENXIO; } -#endif
static inline bool dax_mapping(struct address_space *mapping) {
- return mapping->host && IS_DAX(mapping->host);
- return false;
} +#endif struct writeback_control; int dax_writeback_mapping_range(struct address_space *mapping,
Make sure S_DAX is defined in the CONFIG_FS_DAX=n + CONFIG_DEV_DAX=y case. Otherwise vma_is_dax() may incorrectly return false in the Device-DAX case.
Cc: Alexander Viro viro@zeniv.linux.org.uk Cc: linux-fsdevel@vger.kernel.org Cc: Christoph Hellwig hch@lst.de Cc: Jan Kara jack@suse.cz Cc: stable@vger.kernel.org Fixes: dee410792419 ("/dev/dax, core: file operations and dax-mmap") Signed-off-by: Dan Williams dan.j.williams@intel.com --- include/linux/fs.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/include/linux/fs.h b/include/linux/fs.h index 79c413985305..b2fa9b4c1e51 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1859,7 +1859,7 @@ struct super_operations { #define S_IMA 1024 /* Inode has an associated IMA struct */ #define S_AUTOMOUNT 2048 /* Automount/referral quasi-directory */ #define S_NOSEC 4096 /* no suid or xattr security attributes */ -#ifdef CONFIG_FS_DAX +#if IS_ENABLED(CONFIG_FS_DAX) || IS_ENABLED(CONFIG_DEV_DAX) #define S_DAX 8192 /* Direct Access, avoiding the page cache */ #else #define S_DAX 0 /* Make all the DAX code disappear */
On Thu 22-02-18 23:17:56, Dan Williams wrote:
Make sure S_DAX is defined in the CONFIG_FS_DAX=n + CONFIG_DEV_DAX=y case. Otherwise vma_is_dax() may incorrectly return false in the Device-DAX case.
Cc: Alexander Viro viro@zeniv.linux.org.uk Cc: linux-fsdevel@vger.kernel.org Cc: Christoph Hellwig hch@lst.de Cc: Jan Kara jack@suse.cz Cc: stable@vger.kernel.org Fixes: dee410792419 ("/dev/dax, core: file operations and dax-mmap") Signed-off-by: Dan Williams dan.j.williams@intel.com
Looks good. You can add:
Reviewed-by: Jan Kara jack@suse.cz
Honza
include/linux/fs.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/include/linux/fs.h b/include/linux/fs.h index 79c413985305..b2fa9b4c1e51 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1859,7 +1859,7 @@ struct super_operations { #define S_IMA 1024 /* Inode has an associated IMA struct */ #define S_AUTOMOUNT 2048 /* Automount/referral quasi-directory */ #define S_NOSEC 4096 /* no suid or xattr security attributes */ -#ifdef CONFIG_FS_DAX +#if IS_ENABLED(CONFIG_FS_DAX) || IS_ENABLED(CONFIG_DEV_DAX) #define S_DAX 8192 /* Direct Access, avoiding the page cache */ #else #define S_DAX 0 /* Make all the DAX code disappear */
Filesystem-DAX is incompatible with 'longterm' page pinning. Without page cache indirection a DAX mapping maps filesystem blocks directly. This means that the filesystem must not modify a file's block map while any page in a mapping is pinned. In order to prevent the situation of userspace holding of filesystem operations indefinitely, disallow 'longterm' Filesystem-DAX mappings.
RDMA has the same conflict and the plan there is to add a 'with lease' mechanism to allow the kernel to notify userspace that the mapping is being torn down for block-map maintenance. Perhaps something similar can be put in place for vfio.
Note that xfs and ext4 still report:
"DAX enabled. Warning: EXPERIMENTAL, use at your own risk"
...at mount time, and resolving the dax-dma-vs-truncate problem is one of the last hurdles to remove that designation.
Acked-by: Alex Williamson alex.williamson@redhat.com Cc: Michal Hocko mhocko@suse.com Cc: Christoph Hellwig hch@lst.de Cc: kvm@vger.kernel.org Cc: stable@vger.kernel.org Reported-by: Haozhong Zhang haozhong.zhang@intel.com Fixes: d475c6346a38 ("dax,ext2: replace XIP read and write with DAX I/O") Signed-off-by: Dan Williams dan.j.williams@intel.com --- drivers/vfio/vfio_iommu_type1.c | 18 +++++++++++++++--- 1 file changed, 15 insertions(+), 3 deletions(-)
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index e30e29ae4819..45657e2b1ff7 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -338,11 +338,12 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr, { struct page *page[1]; struct vm_area_struct *vma; + struct vm_area_struct *vmas[1]; int ret;
if (mm == current->mm) { - ret = get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE), - page); + ret = get_user_pages_longterm(vaddr, 1, !!(prot & IOMMU_WRITE), + page, vmas); } else { unsigned int flags = 0;
@@ -351,7 +352,18 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
down_read(&mm->mmap_sem); ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags, page, - NULL, NULL); + vmas, NULL); + /* + * The lifetime of a vaddr_get_pfn() page pin is + * userspace-controlled. In the fs-dax case this could + * lead to indefinite stalls in filesystem operations. + * Disallow attempts to pin fs-dax pages via this + * interface. + */ + if (ret > 0 && vma_is_fsdax(vmas[0])) { + ret = -EOPNOTSUPP; + put_page(page[0]); + } up_read(&mm->mmap_sem); }
On 02/22/18 23:17 -0800, Dan Williams wrote:
Changes since v1 [1]:
Fix the detection of device-dax file instances in vma_is_fsdax(). (Haozhong, Gerd)
Fix compile breakage in the FS_DAX=n and DEV_DAX=y case. (0day robot)
The vfio interface, like RDMA, wants to setup long term (indefinite) pins of the pages backing an address range so that a guest or userspace driver can perform DMA to the with physical address. Given that this pinning may lead to filesystem operations deadlocking in the filesystem-dax case, the pinning request needs to be rejected.
The longer term fix for vfio, RDMA, and any other long term pin user, is to provide a 'pin with lease' mechanism. Similar to the leases that are hold for pNFS RDMA layouts, this userspace lease gives the kernel a way to notify userspace that the block layout of the file is changing and the kernel is revoking access to pinned pages.
Dan Williams (5): dax: fix vma_is_fsdax() helper dax: fix dax_mapping() definition in the FS_DAX=n + DEV_DAX=y case dax: fix S_DAX definition dax: short circuit vma_is_fsdax() in the CONFIG_FS_DAX=n case vfio: disable filesystem-dax page pinning
drivers/vfio/vfio_iommu_type1.c | 18 +++++++++++++++--- include/linux/dax.h | 9 ++++++--- include/linux/fs.h | 6 ++++-- 3 files changed, 25 insertions(+), 8 deletions(-)
Tested on QEMU with fs-dax and device-dax as vNVDIMM backends respectively with vfio passthrough. The fs-dax case fails QEMU as expected, and the device-dax case works normally now.
Tested-by: Haozhong Zhang haozhong.zhang@intel.com
On Fri, Feb 23, 2018 at 12:55 AM, Haozhong Zhang haozhong.zhang@intel.com wrote:
On 02/22/18 23:17 -0800, Dan Williams wrote:
Changes since v1 [1]:
Fix the detection of device-dax file instances in vma_is_fsdax(). (Haozhong, Gerd)
Fix compile breakage in the FS_DAX=n and DEV_DAX=y case. (0day robot)
The vfio interface, like RDMA, wants to setup long term (indefinite) pins of the pages backing an address range so that a guest or userspace driver can perform DMA to the with physical address. Given that this pinning may lead to filesystem operations deadlocking in the filesystem-dax case, the pinning request needs to be rejected.
The longer term fix for vfio, RDMA, and any other long term pin user, is to provide a 'pin with lease' mechanism. Similar to the leases that are hold for pNFS RDMA layouts, this userspace lease gives the kernel a way to notify userspace that the block layout of the file is changing and the kernel is revoking access to pinned pages.
Dan Williams (5): dax: fix vma_is_fsdax() helper dax: fix dax_mapping() definition in the FS_DAX=n + DEV_DAX=y case dax: fix S_DAX definition dax: short circuit vma_is_fsdax() in the CONFIG_FS_DAX=n case vfio: disable filesystem-dax page pinning
drivers/vfio/vfio_iommu_type1.c | 18 +++++++++++++++--- include/linux/dax.h | 9 ++++++--- include/linux/fs.h | 6 ++++-- 3 files changed, 25 insertions(+), 8 deletions(-)
Tested on QEMU with fs-dax and device-dax as vNVDIMM backends respectively with vfio passthrough. The fs-dax case fails QEMU as expected, and the device-dax case works normally now.
Tested-by: Haozhong Zhang haozhong.zhang@intel.com
Thank you!
linux-stable-mirror@lists.linaro.org