Greg,
The recent history of copy_file_range() API is somewhat convoluted. The API changes are documented in copy_file_range(2) man page. I've just posted a man page update patch [1] to fix some wrong kernel version references in the man page.
The problem is that it took many kernel releases to get reports on the regression from v5.3 and yet more releases (v5.12..v5.19) to work on the solution and get it merged.
This situation leads to confusion among users as can be seen in [2]. You've already picked the patch [1/2] to 5.15.y and I sent you a request to pick patch [2/2] (from v6.1) as well.
Following are backports of the two patches to 5.10.y, which I verified with the relevant test in fstests.
Backport to 5.4.y is more challenging because Darrick did a lot of cleanup in that area between v5.4..v5.10, so I did not do it.
Thanks, Amir.
[1] https://lore.kernel.org/linux-fsdevel/20221213120834.948163-1-amir73il@gmail... [2] https://bugzilla.kernel.org/show_bug.cgi?id=216800
Amir Goldstein (2): vfs: fix copy_file_range() regression in cross-fs copies vfs: fix copy_file_range() averts filesystem freeze protection
fs/nfsd/vfs.c | 8 ++++- fs/read_write.c | 90 +++++++++++++++++++++++++++++++++--------------------- include/linux/fs.h | 8 +++++ 3 files changed, 70 insertions(+), 36 deletions(-)
commit 868f9f2f8e004bfe0d3935b1976f625b2924893b upstream.
[backport comments for pre v5.15: - This commit has a bug fixed by commit 10bc8e4af659 ("vfs: fix copy_file_range() averts filesystem freeze protection") - ksmbd mentions are irrelevant - ksmbd hunks were dropped ]
A regression has been reported by Nicolas Boichat, found while using the copy_file_range syscall to copy a tracefs file.
Before commit 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices") the kernel would return -EXDEV to userspace when trying to copy a file across different filesystems. After this commit, the syscall doesn't fail anymore and instead returns zero (zero bytes copied), as this file's content is generated on-the-fly and thus reports a size of zero.
Another regression has been reported by He Zhe - the assertion of WARN_ON_ONCE(ret == -EOPNOTSUPP) can be triggered from userspace when copying from a sysfs file whose read operation may return -EOPNOTSUPP.
Since we do not have test coverage for copy_file_range() between any two types of filesystems, the best way to avoid these sort of issues in the future is for the kernel to be more picky about filesystems that are allowed to do copy_file_range().
This patch restores some cross-filesystem copy restrictions that existed prior to commit 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices"), namely, cross-sb copy is not allowed for filesystems that do not implement ->copy_file_range().
Filesystems that do implement ->copy_file_range() have full control of the result - if this method returns an error, the error is returned to the user. Before this change this was only true for fs that did not implement the ->remap_file_range() operation (i.e. nfsv3).
Filesystems that do not implement ->copy_file_range() still fall-back to the generic_copy_file_range() implementation when the copy is within the same sb. This helps the kernel can maintain a more consistent story about which filesystems support copy_file_range().
nfsd and ksmbd servers are modified to fall-back to the generic_copy_file_range() implementation in case vfs_copy_file_range() fails with -EOPNOTSUPP or -EXDEV, which preserves behavior of server-side-copy.
fall-back to generic_copy_file_range() is not implemented for the smb operation FSCTL_DUPLICATE_EXTENTS_TO_FILE, which is arguably a correct change of behavior.
Fixes: 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices") Link: https://lore.kernel.org/linux-fsdevel/20210212044405.4120619-1-drinkcat@chro... Link: https://lore.kernel.org/linux-fsdevel/CANMq1KDZuxir2LM5jOTm0xx+BnvW=ZmpsG47C... Link: https://lore.kernel.org/linux-fsdevel/20210126135012.1.If45b7cdc3ff707bc1efa... Link: https://lore.kernel.org/linux-fsdevel/20210630161320.29006-1-lhenriques@suse... Reported-by: Nicolas Boichat drinkcat@chromium.org Reported-by: kernel test robot oliver.sang@intel.com Signed-off-by: Luis Henriques lhenriques@suse.de Fixes: 64bf5ff58dff ("vfs: no fallback for ->copy_file_range") Link: https://lore.kernel.org/linux-fsdevel/20f17f64-88cb-4e80-07c1-85cb96c83619@w... Reported-by: He Zhe zhe.he@windriver.com Tested-by: Namjae Jeon linkinjeon@kernel.org Tested-by: Luis Henriques lhenriques@suse.de Signed-off-by: Amir Goldstein amir73il@gmail.com Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Link: https://bugzilla.kernel.org/show_bug.cgi?id=216800 Signed-off-by: Amir Goldstein amir73il@gmail.com --- fs/nfsd/vfs.c | 8 +++++- fs/read_write.c | 77 ++++++++++++++++++++++++++++++++------------------------- 2 files changed, 51 insertions(+), 34 deletions(-)
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c index d8c5b67f06f4..9ecaf9481a37 100644 --- a/fs/nfsd/vfs.c +++ b/fs/nfsd/vfs.c @@ -568,6 +568,7 @@ __be32 nfsd4_clone_file_range(struct nfsd_file *nf_src, u64 src_pos, ssize_t nfsd_copy_file_range(struct file *src, u64 src_pos, struct file *dst, u64 dst_pos, u64 count) { + ssize_t ret;
/* * Limit copy to 4MB to prevent indefinitely blocking an nfsd @@ -578,7 +579,12 @@ ssize_t nfsd_copy_file_range(struct file *src, u64 src_pos, struct file *dst, * limit like this and pipeline multiple COPY requests. */ count = min_t(u64, count, 1 << 22); - return vfs_copy_file_range(src, src_pos, dst, dst_pos, count, 0); + ret = vfs_copy_file_range(src, src_pos, dst, dst_pos, count, 0); + + if (ret == -EOPNOTSUPP || ret == -EXDEV) + ret = generic_copy_file_range(src, src_pos, dst, dst_pos, + count, 0); + return ret; }
__be32 nfsd4_vfs_fallocate(struct svc_rqst *rqstp, struct svc_fh *fhp, diff --git a/fs/read_write.c b/fs/read_write.c index b5fb534cebe6..e1e83d6c21e6 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -1413,28 +1413,6 @@ ssize_t generic_copy_file_range(struct file *file_in, loff_t pos_in, } EXPORT_SYMBOL(generic_copy_file_range);
-static ssize_t do_copy_file_range(struct file *file_in, loff_t pos_in, - struct file *file_out, loff_t pos_out, - size_t len, unsigned int flags) -{ - /* - * Although we now allow filesystems to handle cross sb copy, passing - * a file of the wrong filesystem type to filesystem driver can result - * in an attempt to dereference the wrong type of ->private_data, so - * avoid doing that until we really have a good reason. NFS defines - * several different file_system_type structures, but they all end up - * using the same ->copy_file_range() function pointer. - */ - if (file_out->f_op->copy_file_range && - file_out->f_op->copy_file_range == file_in->f_op->copy_file_range) - return file_out->f_op->copy_file_range(file_in, pos_in, - file_out, pos_out, - len, flags); - - return generic_copy_file_range(file_in, pos_in, file_out, pos_out, len, - flags); -} - /* * Performs necessary checks before doing a file copy * @@ -1456,6 +1434,24 @@ static int generic_copy_file_checks(struct file *file_in, loff_t pos_in, if (ret) return ret;
+ /* + * We allow some filesystems to handle cross sb copy, but passing + * a file of the wrong filesystem type to filesystem driver can result + * in an attempt to dereference the wrong type of ->private_data, so + * avoid doing that until we really have a good reason. + * + * nfs and cifs define several different file_system_type structures + * and several different sets of file_operations, but they all end up + * using the same ->copy_file_range() function pointer. + */ + if (file_out->f_op->copy_file_range) { + if (file_in->f_op->copy_file_range != + file_out->f_op->copy_file_range) + return -EXDEV; + } else if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb) { + return -EXDEV; + } + /* Don't touch certain kinds of inodes */ if (IS_IMMUTABLE(inode_out)) return -EPERM; @@ -1521,26 +1517,41 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in, file_start_write(file_out);
/* - * Try cloning first, this is supported by more file systems, and - * more efficient if both clone and copy are supported (e.g. NFS). + * Cloning is supported by more file systems, so we implement copy on + * same sb using clone, but for filesystems where both clone and copy + * are supported (e.g. nfs,cifs), we only call the copy method. */ + if (file_out->f_op->copy_file_range) { + ret = file_out->f_op->copy_file_range(file_in, pos_in, + file_out, pos_out, + len, flags); + goto done; + } + if (file_in->f_op->remap_file_range && file_inode(file_in)->i_sb == file_inode(file_out)->i_sb) { - loff_t cloned; - - cloned = file_in->f_op->remap_file_range(file_in, pos_in, + ret = file_in->f_op->remap_file_range(file_in, pos_in, file_out, pos_out, min_t(loff_t, MAX_RW_COUNT, len), REMAP_FILE_CAN_SHORTEN); - if (cloned > 0) { - ret = cloned; + if (ret > 0) goto done; - } }
- ret = do_copy_file_range(file_in, pos_in, file_out, pos_out, len, - flags); - WARN_ON_ONCE(ret == -EOPNOTSUPP); + /* + * We can get here for same sb copy of filesystems that do not implement + * ->copy_file_range() in case filesystem does not support clone or in + * case filesystem supports clone but rejected the clone request (e.g. + * because it was not block aligned). + * + * In both cases, fall back to kernel copy so we are able to maintain a + * consistent story about which filesystems support copy_file_range() + * and which filesystems do not, that will allow userspace tools to + * make consistent desicions w.r.t using copy_file_range(). + */ + ret = generic_copy_file_range(file_in, pos_in, file_out, pos_out, len, + flags); + done: if (ret > 0) { fsnotify_access(file_in);
commit 10bc8e4af65946b727728d7479c028742321b60a upstream.
[backport comments for pre v5.15: - ksmbd mentions are irrelevant - ksmbd hunks were dropped - sb_write_started() is missing - assert was dropped ]
Commit 868f9f2f8e00 ("vfs: fix copy_file_range() regression in cross-fs copies") removed fallback to generic_copy_file_range() for cross-fs cases inside vfs_copy_file_range().
To preserve behavior of nfsd and ksmbd server-side-copy, the fallback to generic_copy_file_range() was added in nfsd and ksmbd code, but that call is missing sb_start_write(), fsnotify hooks and more.
Ideally, nfsd and ksmbd would pass a flag to vfs_copy_file_range() that will take care of the fallback, but that code would be subtle and we got vfs_copy_file_range() logic wrong too many times already.
Instead, add a flag to explicitly request vfs_copy_file_range() to perform only generic_copy_file_range() and let nfsd and ksmbd use this flag only in the fallback path.
This choise keeps the logic changes to minimum in the non-nfsd/ksmbd code paths to reduce the risk of further regressions.
Fixes: 868f9f2f8e00 ("vfs: fix copy_file_range() regression in cross-fs copies") Tested-by: Namjae Jeon linkinjeon@kernel.org Tested-by: Luis Henriques lhenriques@suse.de Signed-off-by: Amir Goldstein amir73il@gmail.com Signed-off-by: Al Viro viro@zeniv.linux.org.uk Signed-off-by: Amir Goldstein amir73il@gmail.com --- fs/nfsd/vfs.c | 4 ++-- fs/read_write.c | 17 +++++++++++++---- include/linux/fs.h | 8 ++++++++ 3 files changed, 23 insertions(+), 6 deletions(-)
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c index 9ecaf9481a37..e6bdc16aaf05 100644 --- a/fs/nfsd/vfs.c +++ b/fs/nfsd/vfs.c @@ -582,8 +582,8 @@ ssize_t nfsd_copy_file_range(struct file *src, u64 src_pos, struct file *dst, ret = vfs_copy_file_range(src, src_pos, dst, dst_pos, count, 0);
if (ret == -EOPNOTSUPP || ret == -EXDEV) - ret = generic_copy_file_range(src, src_pos, dst, dst_pos, - count, 0); + ret = vfs_copy_file_range(src, src_pos, dst, dst_pos, count, + COPY_FILE_SPLICE); return ret; }
diff --git a/fs/read_write.c b/fs/read_write.c index e1e83d6c21e6..ce74f3e0a346 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -1444,7 +1444,9 @@ static int generic_copy_file_checks(struct file *file_in, loff_t pos_in, * and several different sets of file_operations, but they all end up * using the same ->copy_file_range() function pointer. */ - if (file_out->f_op->copy_file_range) { + if (flags & COPY_FILE_SPLICE) { + /* cross sb splice is allowed */ + } else if (file_out->f_op->copy_file_range) { if (file_in->f_op->copy_file_range != file_out->f_op->copy_file_range) return -EXDEV; @@ -1494,8 +1496,9 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in, size_t len, unsigned int flags) { ssize_t ret; + bool splice = flags & COPY_FILE_SPLICE;
- if (flags != 0) + if (flags & ~COPY_FILE_SPLICE) return -EINVAL;
ret = generic_copy_file_checks(file_in, pos_in, file_out, pos_out, &len, @@ -1521,14 +1524,14 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in, * same sb using clone, but for filesystems where both clone and copy * are supported (e.g. nfs,cifs), we only call the copy method. */ - if (file_out->f_op->copy_file_range) { + if (!splice && file_out->f_op->copy_file_range) { ret = file_out->f_op->copy_file_range(file_in, pos_in, file_out, pos_out, len, flags); goto done; }
- if (file_in->f_op->remap_file_range && + if (!splice && file_in->f_op->remap_file_range && file_inode(file_in)->i_sb == file_inode(file_out)->i_sb) { ret = file_in->f_op->remap_file_range(file_in, pos_in, file_out, pos_out, @@ -1548,6 +1551,8 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in, * consistent story about which filesystems support copy_file_range() * and which filesystems do not, that will allow userspace tools to * make consistent desicions w.r.t using copy_file_range(). + * + * We also get here if caller (e.g. nfsd) requested COPY_FILE_SPLICE. */ ret = generic_copy_file_range(file_in, pos_in, file_out, pos_out, len, flags); @@ -1602,6 +1607,10 @@ SYSCALL_DEFINE6(copy_file_range, int, fd_in, loff_t __user *, off_in, pos_out = f_out.file->f_pos; }
+ ret = -EINVAL; + if (flags != 0) + goto out; + ret = vfs_copy_file_range(f_in.file, pos_in, f_out.file, pos_out, len, flags); if (ret > 0) { diff --git a/include/linux/fs.h b/include/linux/fs.h index 8bde32cf9711..acf8dd6288f7 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1817,6 +1817,14 @@ struct dir_context { */ #define REMAP_FILE_ADVISORY (REMAP_FILE_CAN_SHORTEN)
+/* + * These flags control the behavior of vfs_copy_file_range(). + * They are not available to the user via syscall. + * + * COPY_FILE_SPLICE: call splice direct instead of fs clone/copy ops + */ +#define COPY_FILE_SPLICE (1 << 0) + struct iov_iter;
struct file_operations {
On Tue, Dec 13, 2022 at 03:13:39PM +0200, Amir Goldstein wrote:
Greg,
The recent history of copy_file_range() API is somewhat convoluted. The API changes are documented in copy_file_range(2) man page. I've just posted a man page update patch [1] to fix some wrong kernel version references in the man page.
The problem is that it took many kernel releases to get reports on the regression from v5.3 and yet more releases (v5.12..v5.19) to work on the solution and get it merged.
This situation leads to confusion among users as can be seen in [2]. You've already picked the patch [1/2] to 5.15.y and I sent you a request to pick patch [2/2] (from v6.1) as well.
Following are backports of the two patches to 5.10.y, which I verified with the relevant test in fstests.
All now queued up, thanks.
greg k-h
linux-stable-mirror@lists.linaro.org