Changes since v2 [1]:
* Fix yet more compile breakage in the FS_DAX=n and DEV_DAX=y case. (0day robot)
[1]: https://lists.01.org/pipermail/linux-nvdimm/2018-February/014046.html
---
The vfio interface, like RDMA, wants to setup long term (indefinite) pins of the pages backing an address range so that a guest or userspace driver can perform DMA to the with physical address. Given that this pinning may lead to filesystem operations deadlocking in the filesystem-dax case, the pinning request needs to be rejected.
The longer term fix for vfio, RDMA, and any other long term pin user, is to provide a 'pin with lease' mechanism. Similar to the leases that are hold for pNFS RDMA layouts, this userspace lease gives the kernel a way to notify userspace that the block layout of the file is changing and the kernel is revoking access to pinned pages.
---
Dan Williams (6): dax: fix vma_is_fsdax() helper dax: fix dax_mapping() definition in the FS_DAX=n + DEV_DAX=y case xfs, dax: introduce IS_FSDAX() dax: fix S_DAX definition dax: short circuit vma_is_fsdax() in the CONFIG_FS_DAX=n case vfio: disable filesystem-dax page pinning
drivers/vfio/vfio_iommu_type1.c | 18 +++++++++++++++--- fs/xfs/xfs_file.c | 14 +++++++------- fs/xfs/xfs_ioctl.c | 4 ++-- fs/xfs/xfs_iomap.c | 6 +++--- fs/xfs/xfs_reflink.c | 2 +- include/linux/dax.h | 9 ++++++--- include/linux/fs.h | 8 ++++++-- 7 files changed, 40 insertions(+), 21 deletions(-)
Gerd reports that ->i_mode may contain other bits besides S_IFCHR. Use S_ISCHR() instead. Otherwise, get_user_pages_longterm() may fail on device-dax instances when those are meant to be explicitly allowed.
Fixes: 2bb6d2837083 ("mm: introduce get_user_pages_longterm") Cc: stable@vger.kernel.org Reported-by: Gerd Rausch gerd.rausch@oracle.com Acked-by: Jane Chu jane.chu@oracle.com Reported-by: Haozhong Zhang haozhong.zhang@intel.com Signed-off-by: Dan Williams dan.j.williams@intel.com --- include/linux/fs.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/include/linux/fs.h b/include/linux/fs.h index 2a815560fda0..79c413985305 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -3198,7 +3198,7 @@ static inline bool vma_is_fsdax(struct vm_area_struct *vma) if (!vma_is_dax(vma)) return false; inode = file_inode(vma->vm_file); - if (inode->i_mode == S_IFCHR) + if (S_ISCHR(inode->i_mode)) return false; /* device-dax */ return true; }
On Fri 23-02-18 16:43:11, Dan Williams wrote:
Gerd reports that ->i_mode may contain other bits besides S_IFCHR. Use S_ISCHR() instead. Otherwise, get_user_pages_longterm() may fail on device-dax instances when those are meant to be explicitly allowed.
Fixes: 2bb6d2837083 ("mm: introduce get_user_pages_longterm") Cc: stable@vger.kernel.org Reported-by: Gerd Rausch gerd.rausch@oracle.com Acked-by: Jane Chu jane.chu@oracle.com Reported-by: Haozhong Zhang haozhong.zhang@intel.com Signed-off-by: Dan Williams dan.j.williams@intel.com
I wonder how I didn't notice this when reading the original patch. Anyway the fix looks good. You can add:
Reviewed-by: Jan Kara jack@suse.cz
Honza
include/linux/fs.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/include/linux/fs.h b/include/linux/fs.h index 2a815560fda0..79c413985305 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -3198,7 +3198,7 @@ static inline bool vma_is_fsdax(struct vm_area_struct *vma) if (!vma_is_dax(vma)) return false; inode = file_inode(vma->vm_file);
- if (inode->i_mode == S_IFCHR)
- if (S_ISCHR(inode->i_mode)) return false; /* device-dax */ return true;
}
An address_space will only have dax exceptional entries when FS_DAX is enabled. The current reliance on S_DAX causes compile failures when S_DAX is defined for DEV_DAX, but FS_DAX is disabled. Make dax_mapping() always return false so that mm/truncate.c drops its link time dependencies on fs/dax.c.
Cc: Alexander Viro viro@zeniv.linux.org.uk Cc: linux-fsdevel@vger.kernel.org Cc: Christoph Hellwig hch@lst.de Cc: Jan Kara jack@suse.cz Cc: stable@vger.kernel.org Reported-by: kbuild test robot fengguang.wu@intel.com Fixes: dee410792419 ("/dev/dax, core: file operations and dax-mmap") Signed-off-by: Dan Williams dan.j.williams@intel.com --- include/linux/dax.h | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-)
diff --git a/include/linux/dax.h b/include/linux/dax.h index 0185ecdae135..62e8cf7eb566 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -107,6 +107,10 @@ int dax_invalidate_mapping_entry_sync(struct address_space *mapping, int __dax_zero_page_range(struct block_device *bdev, struct dax_device *dax_dev, sector_t sector, unsigned int offset, unsigned int length); +static inline bool dax_mapping(struct address_space *mapping) +{ + return mapping->host && IS_DAX(mapping->host); +} #else static inline int __dax_zero_page_range(struct block_device *bdev, struct dax_device *dax_dev, sector_t sector, @@ -114,12 +118,11 @@ static inline int __dax_zero_page_range(struct block_device *bdev, { return -ENXIO; } -#endif - static inline bool dax_mapping(struct address_space *mapping) { - return mapping->host && IS_DAX(mapping->host); + return false; } +#endif
struct writeback_control; int dax_writeback_mapping_range(struct address_space *mapping,
Given that S_DAX is non-zero in the FS_DAX=n + DEV_DAX=y case, another mechanism besides the plain IS_DAX() check to compile out dead filesystem-dax code paths. Without IS_FSDAX() xfs will fail at link time with:
ERROR: "dax_finish_sync_fault" [fs/xfs/xfs.ko] undefined! ERROR: "dax_iomap_fault" [fs/xfs/xfs.ko] undefined! ERROR: "dax_iomap_rw" [fs/xfs/xfs.ko] undefined!
This compile failure was previously hidden by the fact that S_DAX was erroneously defined to '0' in the FS_DAX=n + DEV_DAX=y case.
Cc: "Darrick J. Wong" darrick.wong@oracle.com Cc: linux-xfs@vger.kernel.org Cc: stable@vger.kernel.org Reported-by: kbuild test robot fengguang.wu@intel.com Signed-off-by: Dan Williams dan.j.williams@intel.com --- fs/xfs/xfs_file.c | 14 +++++++------- fs/xfs/xfs_ioctl.c | 4 ++-- fs/xfs/xfs_iomap.c | 6 +++--- fs/xfs/xfs_reflink.c | 2 +- include/linux/fs.h | 2 ++ 5 files changed, 15 insertions(+), 13 deletions(-)
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index 9ea08326f876..46a098b90fd0 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -288,7 +288,7 @@ xfs_file_read_iter( if (XFS_FORCED_SHUTDOWN(mp)) return -EIO;
- if (IS_DAX(inode)) + if (IS_FSDAX(inode)) ret = xfs_file_dax_read(iocb, to); else if (iocb->ki_flags & IOCB_DIRECT) ret = xfs_file_dio_aio_read(iocb, to); @@ -726,7 +726,7 @@ xfs_file_write_iter( if (XFS_FORCED_SHUTDOWN(ip->i_mount)) return -EIO;
- if (IS_DAX(inode)) + if (IS_FSDAX(inode)) ret = xfs_file_dax_write(iocb, from); else if (iocb->ki_flags & IOCB_DIRECT) { /* @@ -1045,7 +1045,7 @@ __xfs_filemap_fault( }
xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED); - if (IS_DAX(inode)) { + if (IS_FSDAX(inode)) { pfn_t pfn;
ret = dax_iomap_fault(vmf, pe_size, &pfn, NULL, &xfs_iomap_ops); @@ -1070,7 +1070,7 @@ xfs_filemap_fault( { /* DAX can shortcut the normal fault path on write faults! */ return __xfs_filemap_fault(vmf, PE_SIZE_PTE, - IS_DAX(file_inode(vmf->vma->vm_file)) && + IS_FSDAX(file_inode(vmf->vma->vm_file)) && (vmf->flags & FAULT_FLAG_WRITE)); }
@@ -1079,7 +1079,7 @@ xfs_filemap_huge_fault( struct vm_fault *vmf, enum page_entry_size pe_size) { - if (!IS_DAX(file_inode(vmf->vma->vm_file))) + if (!IS_FSDAX(file_inode(vmf->vma->vm_file))) return VM_FAULT_FALLBACK;
/* DAX can shortcut the normal fault path on write faults! */ @@ -1124,12 +1124,12 @@ xfs_file_mmap( * We don't support synchronous mappings for non-DAX files. At least * until someone comes with a sensible use case. */ - if (!IS_DAX(file_inode(filp)) && (vma->vm_flags & VM_SYNC)) + if (!IS_FSDAX(file_inode(filp)) && (vma->vm_flags & VM_SYNC)) return -EOPNOTSUPP;
file_accessed(filp); vma->vm_ops = &xfs_file_vm_ops; - if (IS_DAX(file_inode(filp))) + if (IS_FSDAX(file_inode(filp))) vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE; return 0; } diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c index 89fb1eb80aae..234279ff66ce 100644 --- a/fs/xfs/xfs_ioctl.c +++ b/fs/xfs/xfs_ioctl.c @@ -1108,9 +1108,9 @@ xfs_ioctl_setattr_dax_invalidate( }
/* If the DAX state is not changing, we have nothing to do here. */ - if ((fa->fsx_xflags & FS_XFLAG_DAX) && IS_DAX(inode)) + if ((fa->fsx_xflags & FS_XFLAG_DAX) && IS_FSDAX(inode)) return 0; - if (!(fa->fsx_xflags & FS_XFLAG_DAX) && !IS_DAX(inode)) + if (!(fa->fsx_xflags & FS_XFLAG_DAX) && !IS_FSDAX(inode)) return 0;
/* lock, flush and invalidate mapping in preparation for flag change */ diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c index 66e1edbfb2b2..cf794d429aec 100644 --- a/fs/xfs/xfs_iomap.c +++ b/fs/xfs/xfs_iomap.c @@ -241,7 +241,7 @@ xfs_iomap_write_direct( * the reserve block pool for bmbt block allocation if there is no space * left but we need to do unwritten extent conversion. */ - if (IS_DAX(VFS_I(ip))) { + if (IS_FSDAX(VFS_I(ip))) { bmapi_flags = XFS_BMAPI_CONVERT | XFS_BMAPI_ZERO; if (imap->br_state == XFS_EXT_UNWRITTEN) { tflags |= XFS_TRANS_RESERVE; @@ -952,7 +952,7 @@ static inline bool imap_needs_alloc(struct inode *inode, return !nimaps || imap->br_startblock == HOLESTARTBLOCK || imap->br_startblock == DELAYSTARTBLOCK || - (IS_DAX(inode) && imap->br_state == XFS_EXT_UNWRITTEN); + (IS_FSDAX(inode) && imap->br_state == XFS_EXT_UNWRITTEN); }
static inline bool need_excl_ilock(struct xfs_inode *ip, unsigned flags) @@ -988,7 +988,7 @@ xfs_file_iomap_begin( return -EIO;
if (((flags & (IOMAP_WRITE | IOMAP_DIRECT)) == IOMAP_WRITE) && - !IS_DAX(inode) && !xfs_get_extsz_hint(ip)) { + !IS_FSDAX(inode) && !xfs_get_extsz_hint(ip)) { /* Reserve delalloc blocks for regular writeback. */ return xfs_file_iomap_begin_delay(inode, offset, length, iomap); } diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c index 270246943a06..a126e00e05e3 100644 --- a/fs/xfs/xfs_reflink.c +++ b/fs/xfs/xfs_reflink.c @@ -1351,7 +1351,7 @@ xfs_reflink_remap_range( goto out_unlock;
/* Don't share DAX file data for now. */ - if (IS_DAX(inode_in) || IS_DAX(inode_out)) + if (IS_FSDAX(inode_in) || IS_FSDAX(inode_out)) goto out_unlock;
ret = vfs_clone_file_prep_inodes(inode_in, pos_in, inode_out, pos_out, diff --git a/include/linux/fs.h b/include/linux/fs.h index 79c413985305..a4310a95011b 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1909,6 +1909,8 @@ static inline bool sb_rdonly(const struct super_block *sb) { return sb->s_flags #define IS_WHITEOUT(inode) (S_ISCHR(inode->i_mode) && \ (inode)->i_rdev == WHITEOUT_DEV)
+#define IS_FSDAX(inode) (IS_ENABLED(CONFIG_FS_DAX) && IS_DAX(inode)) + static inline bool HAS_UNMAPPED_ID(struct inode *inode) { return !uid_valid(inode->i_uid) || !gid_valid(inode->i_gid);
On Fri 23-02-18 16:43:27, Dan Williams wrote:
Given that S_DAX is non-zero in the FS_DAX=n + DEV_DAX=y case, another mechanism besides the plain IS_DAX() check to compile out dead filesystem-dax code paths. Without IS_FSDAX() xfs will fail at link time with:
ERROR: "dax_finish_sync_fault" [fs/xfs/xfs.ko] undefined! ERROR: "dax_iomap_fault" [fs/xfs/xfs.ko] undefined! ERROR: "dax_iomap_rw" [fs/xfs/xfs.ko] undefined!
This compile failure was previously hidden by the fact that S_DAX was erroneously defined to '0' in the FS_DAX=n + DEV_DAX=y case.
Cc: "Darrick J. Wong" darrick.wong@oracle.com Cc: linux-xfs@vger.kernel.org Cc: stable@vger.kernel.org Reported-by: kbuild test robot fengguang.wu@intel.com Signed-off-by: Dan Williams dan.j.williams@intel.com
As much as I appreciate that relying on compiler to optimize out dead branches results in nicer looking code this is an example where it backfires. Also having IS_DAX() and IS_FSDAX() doing almost the same, just not exactly the same, is IMHO a recipe for confusion (e.g. a casual reader could think why does ext4 get away with using IS_DAX while XFS has to use IS_FSDAX?). So I'd just prefer to handle this as is usual in other kernel areas - define empty stubs for all exported functions when CONFIG_FS_DAX is not enabled. That way code can stay without ugly ifdefs and we don't have to bother with IS_FSDAX vs IS_DAX distinction in filesystem code. Thoughts?
Honza
fs/xfs/xfs_file.c | 14 +++++++------- fs/xfs/xfs_ioctl.c | 4 ++-- fs/xfs/xfs_iomap.c | 6 +++--- fs/xfs/xfs_reflink.c | 2 +- include/linux/fs.h | 2 ++ 5 files changed, 15 insertions(+), 13 deletions(-)
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index 9ea08326f876..46a098b90fd0 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -288,7 +288,7 @@ xfs_file_read_iter( if (XFS_FORCED_SHUTDOWN(mp)) return -EIO;
- if (IS_DAX(inode))
- if (IS_FSDAX(inode)) ret = xfs_file_dax_read(iocb, to); else if (iocb->ki_flags & IOCB_DIRECT) ret = xfs_file_dio_aio_read(iocb, to);
@@ -726,7 +726,7 @@ xfs_file_write_iter( if (XFS_FORCED_SHUTDOWN(ip->i_mount)) return -EIO;
- if (IS_DAX(inode))
- if (IS_FSDAX(inode)) ret = xfs_file_dax_write(iocb, from); else if (iocb->ki_flags & IOCB_DIRECT) { /*
@@ -1045,7 +1045,7 @@ __xfs_filemap_fault( } xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
- if (IS_DAX(inode)) {
- if (IS_FSDAX(inode)) { pfn_t pfn;
ret = dax_iomap_fault(vmf, pe_size, &pfn, NULL, &xfs_iomap_ops); @@ -1070,7 +1070,7 @@ xfs_filemap_fault( { /* DAX can shortcut the normal fault path on write faults! */ return __xfs_filemap_fault(vmf, PE_SIZE_PTE,
IS_DAX(file_inode(vmf->vma->vm_file)) &&
IS_FSDAX(file_inode(vmf->vma->vm_file)) && (vmf->flags & FAULT_FLAG_WRITE));
} @@ -1079,7 +1079,7 @@ xfs_filemap_huge_fault( struct vm_fault *vmf, enum page_entry_size pe_size) {
- if (!IS_DAX(file_inode(vmf->vma->vm_file)))
- if (!IS_FSDAX(file_inode(vmf->vma->vm_file))) return VM_FAULT_FALLBACK;
/* DAX can shortcut the normal fault path on write faults! */ @@ -1124,12 +1124,12 @@ xfs_file_mmap( * We don't support synchronous mappings for non-DAX files. At least * until someone comes with a sensible use case. */
- if (!IS_DAX(file_inode(filp)) && (vma->vm_flags & VM_SYNC))
- if (!IS_FSDAX(file_inode(filp)) && (vma->vm_flags & VM_SYNC)) return -EOPNOTSUPP;
file_accessed(filp); vma->vm_ops = &xfs_file_vm_ops;
- if (IS_DAX(file_inode(filp)))
- if (IS_FSDAX(file_inode(filp))) vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE; return 0;
} diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c index 89fb1eb80aae..234279ff66ce 100644 --- a/fs/xfs/xfs_ioctl.c +++ b/fs/xfs/xfs_ioctl.c @@ -1108,9 +1108,9 @@ xfs_ioctl_setattr_dax_invalidate( } /* If the DAX state is not changing, we have nothing to do here. */
- if ((fa->fsx_xflags & FS_XFLAG_DAX) && IS_DAX(inode))
- if ((fa->fsx_xflags & FS_XFLAG_DAX) && IS_FSDAX(inode)) return 0;
- if (!(fa->fsx_xflags & FS_XFLAG_DAX) && !IS_DAX(inode))
- if (!(fa->fsx_xflags & FS_XFLAG_DAX) && !IS_FSDAX(inode)) return 0;
/* lock, flush and invalidate mapping in preparation for flag change */ diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c index 66e1edbfb2b2..cf794d429aec 100644 --- a/fs/xfs/xfs_iomap.c +++ b/fs/xfs/xfs_iomap.c @@ -241,7 +241,7 @@ xfs_iomap_write_direct( * the reserve block pool for bmbt block allocation if there is no space * left but we need to do unwritten extent conversion. */
- if (IS_DAX(VFS_I(ip))) {
- if (IS_FSDAX(VFS_I(ip))) { bmapi_flags = XFS_BMAPI_CONVERT | XFS_BMAPI_ZERO; if (imap->br_state == XFS_EXT_UNWRITTEN) { tflags |= XFS_TRANS_RESERVE;
@@ -952,7 +952,7 @@ static inline bool imap_needs_alloc(struct inode *inode, return !nimaps || imap->br_startblock == HOLESTARTBLOCK || imap->br_startblock == DELAYSTARTBLOCK ||
(IS_DAX(inode) && imap->br_state == XFS_EXT_UNWRITTEN);
(IS_FSDAX(inode) && imap->br_state == XFS_EXT_UNWRITTEN);
} static inline bool need_excl_ilock(struct xfs_inode *ip, unsigned flags) @@ -988,7 +988,7 @@ xfs_file_iomap_begin( return -EIO; if (((flags & (IOMAP_WRITE | IOMAP_DIRECT)) == IOMAP_WRITE) &&
!IS_DAX(inode) && !xfs_get_extsz_hint(ip)) {
/* Reserve delalloc blocks for regular writeback. */ return xfs_file_iomap_begin_delay(inode, offset, length, iomap); }!IS_FSDAX(inode) && !xfs_get_extsz_hint(ip)) {
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c index 270246943a06..a126e00e05e3 100644 --- a/fs/xfs/xfs_reflink.c +++ b/fs/xfs/xfs_reflink.c @@ -1351,7 +1351,7 @@ xfs_reflink_remap_range( goto out_unlock; /* Don't share DAX file data for now. */
- if (IS_DAX(inode_in) || IS_DAX(inode_out))
- if (IS_FSDAX(inode_in) || IS_FSDAX(inode_out)) goto out_unlock;
ret = vfs_clone_file_prep_inodes(inode_in, pos_in, inode_out, pos_out, diff --git a/include/linux/fs.h b/include/linux/fs.h index 79c413985305..a4310a95011b 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1909,6 +1909,8 @@ static inline bool sb_rdonly(const struct super_block *sb) { return sb->s_flags #define IS_WHITEOUT(inode) (S_ISCHR(inode->i_mode) && \ (inode)->i_rdev == WHITEOUT_DEV) +#define IS_FSDAX(inode) (IS_ENABLED(CONFIG_FS_DAX) && IS_DAX(inode))
static inline bool HAS_UNMAPPED_ID(struct inode *inode) { return !uid_valid(inode->i_uid) || !gid_valid(inode->i_gid);
On Mon, Feb 26, 2018 at 2:06 AM, Jan Kara jack@suse.cz wrote:
On Fri 23-02-18 16:43:27, Dan Williams wrote:
Given that S_DAX is non-zero in the FS_DAX=n + DEV_DAX=y case, another mechanism besides the plain IS_DAX() check to compile out dead filesystem-dax code paths. Without IS_FSDAX() xfs will fail at link time with:
ERROR: "dax_finish_sync_fault" [fs/xfs/xfs.ko] undefined! ERROR: "dax_iomap_fault" [fs/xfs/xfs.ko] undefined! ERROR: "dax_iomap_rw" [fs/xfs/xfs.ko] undefined!
This compile failure was previously hidden by the fact that S_DAX was erroneously defined to '0' in the FS_DAX=n + DEV_DAX=y case.
Cc: "Darrick J. Wong" darrick.wong@oracle.com Cc: linux-xfs@vger.kernel.org Cc: stable@vger.kernel.org Reported-by: kbuild test robot fengguang.wu@intel.com Signed-off-by: Dan Williams dan.j.williams@intel.com
As much as I appreciate that relying on compiler to optimize out dead branches results in nicer looking code this is an example where it backfires. Also having IS_DAX() and IS_FSDAX() doing almost the same, just not exactly the same, is IMHO a recipe for confusion (e.g. a casual reader could think why does ext4 get away with using IS_DAX while XFS has to use IS_FSDAX?). So I'd just prefer to handle this as is usual in other kernel areas - define empty stubs for all exported functions when CONFIG_FS_DAX is not enabled. That way code can stay without ugly ifdefs and we don't have to bother with IS_FSDAX vs IS_DAX distinction in filesystem code. Thoughts?
I think my patch is incomplete either way, because the current IS_DAX() usages handle more than just compiling out calls to fs/dax.c symbols. I.e. even if there were stubs for all fs/dax.c call outs call there are still local usages of the helper. Lets kill IS_DAX() and only have IS_FSDAX() and IS_DEVDAX() with the S_ISCHR() check. Any issues with that?
Make sure S_DAX is defined in the CONFIG_FS_DAX=n + CONFIG_DEV_DAX=y case. Otherwise vma_is_dax() may incorrectly return false in the Device-DAX case.
Cc: Alexander Viro viro@zeniv.linux.org.uk Cc: linux-fsdevel@vger.kernel.org Cc: Christoph Hellwig hch@lst.de Cc: Jan Kara jack@suse.cz Cc: stable@vger.kernel.org Fixes: dee410792419 ("/dev/dax, core: file operations and dax-mmap") Signed-off-by: Dan Williams dan.j.williams@intel.com --- include/linux/fs.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/include/linux/fs.h b/include/linux/fs.h index a4310a95011b..7418341578a3 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1859,7 +1859,7 @@ struct super_operations { #define S_IMA 1024 /* Inode has an associated IMA struct */ #define S_AUTOMOUNT 2048 /* Automount/referral quasi-directory */ #define S_NOSEC 4096 /* no suid or xattr security attributes */ -#ifdef CONFIG_FS_DAX +#if IS_ENABLED(CONFIG_FS_DAX) || IS_ENABLED(CONFIG_DEV_DAX) #define S_DAX 8192 /* Direct Access, avoiding the page cache */ #else #define S_DAX 0 /* Make all the DAX code disappear */
Filesystem-DAX is incompatible with 'longterm' page pinning. Without page cache indirection a DAX mapping maps filesystem blocks directly. This means that the filesystem must not modify a file's block map while any page in a mapping is pinned. In order to prevent the situation of userspace holding of filesystem operations indefinitely, disallow 'longterm' Filesystem-DAX mappings.
RDMA has the same conflict and the plan there is to add a 'with lease' mechanism to allow the kernel to notify userspace that the mapping is being torn down for block-map maintenance. Perhaps something similar can be put in place for vfio.
Note that xfs and ext4 still report:
"DAX enabled. Warning: EXPERIMENTAL, use at your own risk"
...at mount time, and resolving the dax-dma-vs-truncate problem is one of the last hurdles to remove that designation.
Acked-by: Alex Williamson alex.williamson@redhat.com Cc: Michal Hocko mhocko@suse.com Cc: Christoph Hellwig hch@lst.de Cc: kvm@vger.kernel.org Cc: stable@vger.kernel.org Reported-by: Haozhong Zhang haozhong.zhang@intel.com Fixes: d475c6346a38 ("dax,ext2: replace XIP read and write with DAX I/O") Signed-off-by: Dan Williams dan.j.williams@intel.com --- drivers/vfio/vfio_iommu_type1.c | 18 +++++++++++++++--- 1 file changed, 15 insertions(+), 3 deletions(-)
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index e30e29ae4819..45657e2b1ff7 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -338,11 +338,12 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr, { struct page *page[1]; struct vm_area_struct *vma; + struct vm_area_struct *vmas[1]; int ret;
if (mm == current->mm) { - ret = get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE), - page); + ret = get_user_pages_longterm(vaddr, 1, !!(prot & IOMMU_WRITE), + page, vmas); } else { unsigned int flags = 0;
@@ -351,7 +352,18 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
down_read(&mm->mmap_sem); ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags, page, - NULL, NULL); + vmas, NULL); + /* + * The lifetime of a vaddr_get_pfn() page pin is + * userspace-controlled. In the fs-dax case this could + * lead to indefinite stalls in filesystem operations. + * Disallow attempts to pin fs-dax pages via this + * interface. + */ + if (ret > 0 && vma_is_fsdax(vmas[0])) { + ret = -EOPNOTSUPP; + put_page(page[0]); + } up_read(&mm->mmap_sem); }
linux-stable-mirror@lists.linaro.org