From: Dmitry Mastykin mastichi@gmail.com
[ Upstream commit aad11473f8f4be3df86461081ce35ec5b145ba68 ]
We leak nfs_fattr and nfs4_label every time we set a security xattr.
Signed-off-by: Dmitry Mastykin mastichi@gmail.com Signed-off-by: Trond Myklebust trond.myklebust@hammerspace.com Signed-off-by: Sasha Levin sashal@kernel.org --- fs/nfs/nfs4proc.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c index 3a816c4a6d5e2..a691fa10b3e95 100644 --- a/fs/nfs/nfs4proc.c +++ b/fs/nfs/nfs4proc.c @@ -6289,6 +6289,7 @@ nfs4_set_security_label(struct inode *inode, const void *buf, size_t buflen) if (status == 0) nfs_setsecurity(inode, fattr);
+ nfs_free_fattr(fattr); return status; } #endif /* CONFIG_NFS_V4_SECURITY_LABEL */
From: Sagi Grimberg sagi@grimberg.me
[ Upstream commit 134d0b3f2440cdddd12fc3444c9c0f62331ce6fc ]
There is an inherent race where a symlink file may have been overriden (by a different client) between lookup and readlink, resulting in a spurious EIO error returned to userspace. Fix this by propagating back ESTALE errors such that the vfs will retry the lookup/get_link (similar to nfs4_file_open) at least once.
Cc: Dan Aloni dan.aloni@vastdata.com Signed-off-by: Sagi Grimberg sagi@grimberg.me Reviewed-by: Jeff Layton jlayton@kernel.org Signed-off-by: Trond Myklebust trond.myklebust@hammerspace.com Signed-off-by: Sasha Levin sashal@kernel.org --- fs/nfs/symlink.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/nfs/symlink.c b/fs/nfs/symlink.c index 0e27a2e4e68b8..13818129d268f 100644 --- a/fs/nfs/symlink.c +++ b/fs/nfs/symlink.c @@ -41,7 +41,7 @@ static int nfs_symlink_filler(struct file *file, struct folio *folio) error: folio_set_error(folio); folio_unlock(folio); - return -EIO; + return error; }
static const char *nfs_get_link(struct dentry *dentry,
From: Jan Kara jack@suse.cz
[ Upstream commit a527c3ba41c4c61e2069bfce4091e5515f06a8dd ]
When we are doing WB_SYNC_ALL writeback, nfs submits write requests with NFS_FILE_SYNC flag to the server (which then generally treats it as an O_SYNC write). This helps to reduce latency for single requests but when submitting more requests, additional fsyncs on the server side hurt latency. NFS generally avoids this additional overhead by not setting NFS_FILE_SYNC if desc->pg_moreio is set.
However this logic doesn't always work. When we do random 4k writes to a huge file and then call fsync(2), each page writeback is going to be sent with NFS_FILE_SYNC because after preparing one page for writeback, we start writing back next, nfs_do_writepage() will call nfs_pageio_cond_complete() which finds the page is not contiguous with previously prepared IO and submits is *without* setting desc->pg_moreio. Hence NFS_FILE_SYNC is used resulting in poor performance.
Fix the problem by setting desc->pg_moreio in nfs_pageio_cond_complete() before submitting outstanding IO. This improves throughput of fsync-after-random-writes on my test SSD from ~70MB/s to ~250MB/s.
Signed-off-by: Jan Kara jack@suse.cz Signed-off-by: Trond Myklebust trond.myklebust@hammerspace.com Signed-off-by: Sasha Levin sashal@kernel.org --- fs/nfs/pagelist.c | 5 +++++ 1 file changed, 5 insertions(+)
diff --git a/fs/nfs/pagelist.c b/fs/nfs/pagelist.c index 6efb5068c116e..040b6b79c75e5 100644 --- a/fs/nfs/pagelist.c +++ b/fs/nfs/pagelist.c @@ -1545,6 +1545,11 @@ void nfs_pageio_cond_complete(struct nfs_pageio_descriptor *desc, pgoff_t index) continue; } else if (index == prev->wb_index + 1) continue; + /* + * We will submit more requests after these. Indicate + * this to the underlying layers. + */ + desc->pg_moreio = 1; nfs_pageio_complete(desc); break; }
From: Scott Mayhew smayhew@redhat.com
[ Upstream commit 0c8c7c559740d2d8b66048162af6c4dba8f0c88c ]
This is a slight variation on a patch previously proposed by Neil Brown that never got merged.
Prior to commit 5ceb9d7fdaaf ("NFS: Refactor nfs_lookup_revalidate()"), any error from nfs_lookup_verify_inode() other than -ESTALE would result in nfs_lookup_revalidate() returning that error (-ESTALE is mapped to zero).
Since that commit, all errors result in nfs_lookup_revalidate() returning zero, resulting in dentries being invalidated where they previously were not (particularly in the case of -ERESTARTSYS).
Fix it by passing the actual error code to nfs_lookup_revalidate_done(), and leaving the decision on whether to map the error code to zero or one to nfs_lookup_revalidate_done().
A simple reproducer is to run the following python code in a subdirectory of an NFS mount (not in the root of the NFS mount):
---8<--- import os import multiprocessing import time
if __name__=="__main__": multiprocessing.set_start_method("spawn")
count = 0 while True: try: os.getcwd() pool = multiprocessing.Pool(10) pool.close() pool.terminate() count += 1 except Exception as e: print(f"Failed after {count} iterations") print(e) break ---8<---
Prior to commit 5ceb9d7fdaaf, the above code would run indefinitely. After commit 5ceb9d7fdaaf, it fails almost immediately with -ENOENT.
Signed-off-by: Scott Mayhew smayhew@redhat.com Signed-off-by: Trond Myklebust trond.myklebust@hammerspace.com Signed-off-by: Sasha Levin sashal@kernel.org --- fs/nfs/dir.c | 27 +++++++++++++-------------- 1 file changed, 13 insertions(+), 14 deletions(-)
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c index bdd6cb33a3708..375c08fdcf2f3 100644 --- a/fs/nfs/dir.c +++ b/fs/nfs/dir.c @@ -1625,7 +1625,16 @@ nfs_lookup_revalidate_done(struct inode *dir, struct dentry *dentry, switch (error) { case 1: break; - case 0: + case -ETIMEDOUT: + if (inode && (IS_ROOT(dentry) || + NFS_SERVER(inode)->flags & NFS_MOUNT_SOFTREVAL)) + error = 1; + break; + case -ESTALE: + case -ENOENT: + error = 0; + fallthrough; + default: /* * We can't d_drop the root of a disconnected tree: * its d_hash is on the s_anon list and d_drop() would hide @@ -1680,18 +1689,8 @@ static int nfs_lookup_revalidate_dentry(struct inode *dir,
dir_verifier = nfs_save_change_attribute(dir); ret = NFS_PROTO(dir)->lookup(dir, dentry, fhandle, fattr); - if (ret < 0) { - switch (ret) { - case -ESTALE: - case -ENOENT: - ret = 0; - break; - case -ETIMEDOUT: - if (NFS_SERVER(inode)->flags & NFS_MOUNT_SOFTREVAL) - ret = 1; - } + if (ret < 0) goto out; - }
/* Request help from readdirplus */ nfs_lookup_advise_force_readdirplus(dir, flags); @@ -1735,7 +1734,7 @@ nfs_do_lookup_revalidate(struct inode *dir, struct dentry *dentry, unsigned int flags) { struct inode *inode; - int error; + int error = 0;
nfs_inc_stats(dir, NFSIOS_DENTRYREVALIDATE); inode = d_inode(dentry); @@ -1780,7 +1779,7 @@ nfs_do_lookup_revalidate(struct inode *dir, struct dentry *dentry, out_bad: if (flags & LOOKUP_RCU) return -ECHILD; - return nfs_lookup_revalidate_done(dir, dentry, inode, 0); + return nfs_lookup_revalidate_done(dir, dentry, inode, error); }
static int
From: Baokun Li libaokun1@huawei.com
[ Upstream commit a26dc49df37e996876f50a0210039b2d211fdd6f ]
This prevents malicious processes from completing random copen/cread requests and crashing the system. Added checks are listed below:
* Generic, copen can only complete open requests, and cread can only complete read requests. * For copen, ondemand_id must not be 0, because this indicates that the request has not been read by the daemon. * For cread, the object corresponding to fd and req should be the same.
Signed-off-by: Baokun Li libaokun1@huawei.com Link: https://lore.kernel.org/r/20240522114308.2402121-7-libaokun@huaweicloud.com Acked-by: Jeff Layton jlayton@kernel.org Reviewed-by: Jingbo Xu jefflexu@linux.alibaba.com Signed-off-by: Christian Brauner brauner@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- fs/cachefiles/ondemand.c | 27 ++++++++++++++++++++------- 1 file changed, 20 insertions(+), 7 deletions(-)
diff --git a/fs/cachefiles/ondemand.c b/fs/cachefiles/ondemand.c index 89f118d68d125..6f815e7c50867 100644 --- a/fs/cachefiles/ondemand.c +++ b/fs/cachefiles/ondemand.c @@ -97,12 +97,12 @@ static loff_t cachefiles_ondemand_fd_llseek(struct file *filp, loff_t pos, }
static long cachefiles_ondemand_fd_ioctl(struct file *filp, unsigned int ioctl, - unsigned long arg) + unsigned long id) { struct cachefiles_object *object = filp->private_data; struct cachefiles_cache *cache = object->volume->cache; struct cachefiles_req *req; - unsigned long id; + XA_STATE(xas, &cache->reqs, id);
if (ioctl != CACHEFILES_IOC_READ_COMPLETE) return -EINVAL; @@ -110,10 +110,15 @@ static long cachefiles_ondemand_fd_ioctl(struct file *filp, unsigned int ioctl, if (!test_bit(CACHEFILES_ONDEMAND_MODE, &cache->flags)) return -EOPNOTSUPP;
- id = arg; - req = xa_erase(&cache->reqs, id); - if (!req) + xa_lock(&cache->reqs); + req = xas_load(&xas); + if (!req || req->msg.opcode != CACHEFILES_OP_READ || + req->object != object) { + xa_unlock(&cache->reqs); return -EINVAL; + } + xas_store(&xas, NULL); + xa_unlock(&cache->reqs);
trace_cachefiles_ondemand_cread(object, id); complete(&req->done); @@ -142,6 +147,7 @@ int cachefiles_ondemand_copen(struct cachefiles_cache *cache, char *args) unsigned long id; long size; int ret; + XA_STATE(xas, &cache->reqs, 0);
if (!test_bit(CACHEFILES_ONDEMAND_MODE, &cache->flags)) return -EOPNOTSUPP; @@ -165,9 +171,16 @@ int cachefiles_ondemand_copen(struct cachefiles_cache *cache, char *args) if (ret) return ret;
- req = xa_erase(&cache->reqs, id); - if (!req) + xa_lock(&cache->reqs); + xas.xa_index = id; + req = xas_load(&xas); + if (!req || req->msg.opcode != CACHEFILES_OP_OPEN || + !req->object->ondemand->ondemand_id) { + xa_unlock(&cache->reqs); return -EINVAL; + } + xas_store(&xas, NULL); + xa_unlock(&cache->reqs);
/* fail OPEN request if copen format is invalid */ ret = kstrtol(psize, 0, &size);
From: Zizhi Wo wozizhi@huawei.com
[ Upstream commit 4f8703fb3482f92edcfd31661857b16fec89c2c0 ]
If copen is maliciously called in the user mode, it may delete the request corresponding to the random id. And the request may have not been read yet.
Note that when the object is set to reopen, the open request will be done with the still reopen state in above case. As a result, the request corresponding to this object is always skipped in select_req function, so the read request is never completed and blocks other process.
Fix this issue by simply set object to close if its id < 0 in copen.
Signed-off-by: Zizhi Wo wozizhi@huawei.com Signed-off-by: Baokun Li libaokun1@huawei.com Link: https://lore.kernel.org/r/20240522114308.2402121-11-libaokun@huaweicloud.com Acked-by: Jeff Layton jlayton@kernel.org Reviewed-by: Jia Zhu zhujia.zj@bytedance.com Signed-off-by: Christian Brauner brauner@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- fs/cachefiles/ondemand.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/fs/cachefiles/ondemand.c b/fs/cachefiles/ondemand.c index 6f815e7c50867..922cab1a314b2 100644 --- a/fs/cachefiles/ondemand.c +++ b/fs/cachefiles/ondemand.c @@ -182,6 +182,7 @@ int cachefiles_ondemand_copen(struct cachefiles_cache *cache, char *args) xas_store(&xas, NULL); xa_unlock(&cache->reqs);
+ info = req->object->ondemand; /* fail OPEN request if copen format is invalid */ ret = kstrtol(psize, 0, &size); if (ret) { @@ -201,7 +202,6 @@ int cachefiles_ondemand_copen(struct cachefiles_cache *cache, char *args) goto out; }
- info = req->object->ondemand; spin_lock(&info->lock); /* * The anonymous fd was closed before copen ? Fail the request. @@ -241,6 +241,11 @@ int cachefiles_ondemand_copen(struct cachefiles_cache *cache, char *args) wake_up_all(&cache->daemon_pollwq);
out: + spin_lock(&info->lock); + /* Need to set object close to avoid reopen status continuing */ + if (info->ondemand_id == CACHEFILES_ONDEMAND_ID_CLOSED) + cachefiles_ondemand_set_object_close(req->object); + spin_unlock(&info->lock); complete(&req->done); return ret; }
From: Baokun Li libaokun1@huawei.com
[ Upstream commit bc9dde6155464e906e630a0a5c17a4cab241ffbb ]
Replacing wait_for_completion() with wait_for_completion_killable() in cachefiles_ondemand_send_req() allows us to kill processes that might trigger a hunk_task if the daemon is abnormal.
But now only CACHEFILES_OP_READ is killable, because OP_CLOSE and OP_OPEN is initiated from kworker context and the signal is prohibited in these kworker.
Note that when the req in xas changes, i.e. xas_load(&xas) != req, it means that a process will complete the current request soon, so wait again for the request to be completed.
In addition, add the cachefiles_ondemand_finish_req() helper function to simplify the code.
Suggested-by: Hou Tao houtao1@huawei.com Signed-off-by: Baokun Li libaokun1@huawei.com Link: https://lore.kernel.org/r/20240522114308.2402121-13-libaokun@huaweicloud.com Acked-by: Jeff Layton jlayton@kernel.org Reviewed-by: Jia Zhu zhujia.zj@bytedance.com Signed-off-by: Christian Brauner brauner@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- fs/cachefiles/ondemand.c | 40 ++++++++++++++++++++++++++++------------ 1 file changed, 28 insertions(+), 12 deletions(-)
diff --git a/fs/cachefiles/ondemand.c b/fs/cachefiles/ondemand.c index 922cab1a314b2..58bd80956c5a6 100644 --- a/fs/cachefiles/ondemand.c +++ b/fs/cachefiles/ondemand.c @@ -380,6 +380,20 @@ static struct cachefiles_req *cachefiles_ondemand_select_req(struct xa_state *xa return NULL; }
+static inline bool cachefiles_ondemand_finish_req(struct cachefiles_req *req, + struct xa_state *xas, int err) +{ + if (unlikely(!xas || !req)) + return false; + + if (xa_cmpxchg(xas->xa, xas->xa_index, req, NULL, 0) != req) + return false; + + req->error = err; + complete(&req->done); + return true; +} + ssize_t cachefiles_ondemand_daemon_read(struct cachefiles_cache *cache, char __user *_buffer, size_t buflen) { @@ -443,16 +457,8 @@ ssize_t cachefiles_ondemand_daemon_read(struct cachefiles_cache *cache, out: cachefiles_put_object(req->object, cachefiles_obj_put_read_req); /* Remove error request and CLOSE request has no reply */ - if (ret || msg->opcode == CACHEFILES_OP_CLOSE) { - xas_reset(&xas); - xas_lock(&xas); - if (xas_load(&xas) == req) { - req->error = ret; - complete(&req->done); - xas_store(&xas, NULL); - } - xas_unlock(&xas); - } + if (ret || msg->opcode == CACHEFILES_OP_CLOSE) + cachefiles_ondemand_finish_req(req, &xas, ret); cachefiles_req_put(req); return ret ? ret : n; } @@ -544,8 +550,18 @@ static int cachefiles_ondemand_send_req(struct cachefiles_object *object, goto out;
wake_up_all(&cache->daemon_pollwq); - wait_for_completion(&req->done); - ret = req->error; +wait: + ret = wait_for_completion_killable(&req->done); + if (!ret) { + ret = req->error; + } else { + ret = -EINTR; + if (!cachefiles_ondemand_finish_req(req, &xas, ret)) { + /* Someone will complete it soon. */ + cpu_relax(); + goto wait; + } + } cachefiles_req_put(req); return ret; out:
From: Yuntao Wang yuntao.wang@linux.dev
[ Upstream commit ed8c7fbdfe117abbef81f65428ba263118ef298a ]
The maximum possible return value of find_next_zero_bit(fdt->full_fds_bits, maxbit, bitbit) is maxbit. This return value, multiplied by BITS_PER_LONG, gives the value of bitbit, which can never be greater than maxfd, it can only be equal to maxfd at most, so the following check 'if (bitbit > maxfd)' will never be true.
Moreover, when bitbit equals maxfd, it indicates that there are no unused fds, and the function can directly return.
Fix this check.
Signed-off-by: Yuntao Wang yuntao.wang@linux.dev Link: https://lore.kernel.org/r/20240529160656.209352-1-yuntao.wang@linux.dev Reviewed-by: Jan Kara jack@suse.cz Signed-off-by: Christian Brauner brauner@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- fs/file.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/fs/file.c b/fs/file.c index 3b683b9101d84..005841dd35977 100644 --- a/fs/file.c +++ b/fs/file.c @@ -481,12 +481,12 @@ struct files_struct init_files = {
static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start) { - unsigned int maxfd = fdt->max_fds; + unsigned int maxfd = fdt->max_fds; /* always multiple of BITS_PER_LONG */ unsigned int maxbit = maxfd / BITS_PER_LONG; unsigned int bitbit = start / BITS_PER_LONG;
bitbit = find_next_zero_bit(fdt->full_fds_bits, maxbit, bitbit) * BITS_PER_LONG; - if (bitbit > maxfd) + if (bitbit >= maxfd) return maxfd; if (bitbit > start) start = bitbit;
From: Alex Williamson alex.williamson@redhat.com
[ Upstream commit b7c5e64fecfa88764791679cca4786ac65de739e ]
By linking all the device fds we provide to userspace to an address space through a new pseudo fs, we can use tools like unmap_mapping_range() to zap all vmas associated with a device.
Suggested-by: Jason Gunthorpe jgg@nvidia.com Reviewed-by: Jason Gunthorpe jgg@nvidia.com Reviewed-by: Kevin Tian kevin.tian@intel.com Link: https://lore.kernel.org/r/20240530045236.1005864-2-alex.williamson@redhat.co... Signed-off-by: Alex Williamson alex.williamson@redhat.com Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/vfio/device_cdev.c | 7 ++++++ drivers/vfio/group.c | 7 ++++++ drivers/vfio/vfio_main.c | 44 ++++++++++++++++++++++++++++++++++++++ include/linux/vfio.h | 1 + 4 files changed, 59 insertions(+)
diff --git a/drivers/vfio/device_cdev.c b/drivers/vfio/device_cdev.c index e75da0a70d1f8..bb1817bd4ff31 100644 --- a/drivers/vfio/device_cdev.c +++ b/drivers/vfio/device_cdev.c @@ -39,6 +39,13 @@ int vfio_device_fops_cdev_open(struct inode *inode, struct file *filep)
filep->private_data = df;
+ /* + * Use the pseudo fs inode on the device to link all mmaps + * to the same address space, allowing us to unmap all vmas + * associated to this device using unmap_mapping_range(). + */ + filep->f_mapping = device->inode->i_mapping; + return 0;
err_put_registration: diff --git a/drivers/vfio/group.c b/drivers/vfio/group.c index 610a429c61912..ded364588d297 100644 --- a/drivers/vfio/group.c +++ b/drivers/vfio/group.c @@ -286,6 +286,13 @@ static struct file *vfio_device_open_file(struct vfio_device *device) */ filep->f_mode |= (FMODE_PREAD | FMODE_PWRITE);
+ /* + * Use the pseudo fs inode on the device to link all mmaps + * to the same address space, allowing us to unmap all vmas + * associated to this device using unmap_mapping_range(). + */ + filep->f_mapping = device->inode->i_mapping; + if (device->group->type == VFIO_NO_IOMMU) dev_warn(device->dev, "vfio-noiommu device opened by user " "(%s:%d)\n", current->comm, task_pid_nr(current)); diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c index e97d796a54fba..a5a62d9d963f7 100644 --- a/drivers/vfio/vfio_main.c +++ b/drivers/vfio/vfio_main.c @@ -22,8 +22,10 @@ #include <linux/list.h> #include <linux/miscdevice.h> #include <linux/module.h> +#include <linux/mount.h> #include <linux/mutex.h> #include <linux/pci.h> +#include <linux/pseudo_fs.h> #include <linux/rwsem.h> #include <linux/sched.h> #include <linux/slab.h> @@ -43,9 +45,13 @@ #define DRIVER_AUTHOR "Alex Williamson alex.williamson@redhat.com" #define DRIVER_DESC "VFIO - User Level meta-driver"
+#define VFIO_MAGIC 0x5646494f /* "VFIO" */ + static struct vfio { struct class *device_class; struct ida device_ida; + struct vfsmount *vfs_mount; + int fs_count; } vfio;
#ifdef CONFIG_VFIO_NOIOMMU @@ -186,6 +192,8 @@ static void vfio_device_release(struct device *dev) if (device->ops->release) device->ops->release(device);
+ iput(device->inode); + simple_release_fs(&vfio.vfs_mount, &vfio.fs_count); kvfree(device); }
@@ -228,6 +236,34 @@ struct vfio_device *_vfio_alloc_device(size_t size, struct device *dev, } EXPORT_SYMBOL_GPL(_vfio_alloc_device);
+static int vfio_fs_init_fs_context(struct fs_context *fc) +{ + return init_pseudo(fc, VFIO_MAGIC) ? 0 : -ENOMEM; +} + +static struct file_system_type vfio_fs_type = { + .name = "vfio", + .owner = THIS_MODULE, + .init_fs_context = vfio_fs_init_fs_context, + .kill_sb = kill_anon_super, +}; + +static struct inode *vfio_fs_inode_new(void) +{ + struct inode *inode; + int ret; + + ret = simple_pin_fs(&vfio_fs_type, &vfio.vfs_mount, &vfio.fs_count); + if (ret) + return ERR_PTR(ret); + + inode = alloc_anon_inode(vfio.vfs_mount->mnt_sb); + if (IS_ERR(inode)) + simple_release_fs(&vfio.vfs_mount, &vfio.fs_count); + + return inode; +} + /* * Initialize a vfio_device so it can be registered to vfio core. */ @@ -246,6 +282,11 @@ static int vfio_init_device(struct vfio_device *device, struct device *dev, init_completion(&device->comp); device->dev = dev; device->ops = ops; + device->inode = vfio_fs_inode_new(); + if (IS_ERR(device->inode)) { + ret = PTR_ERR(device->inode); + goto out_inode; + }
if (ops->init) { ret = ops->init(device); @@ -260,6 +301,9 @@ static int vfio_init_device(struct vfio_device *device, struct device *dev, return 0;
out_uninit: + iput(device->inode); + simple_release_fs(&vfio.vfs_mount, &vfio.fs_count); +out_inode: vfio_release_device_set(device); ida_free(&vfio.device_ida, device->index); return ret; diff --git a/include/linux/vfio.h b/include/linux/vfio.h index 8b1a298204091..000a6cab2d318 100644 --- a/include/linux/vfio.h +++ b/include/linux/vfio.h @@ -64,6 +64,7 @@ struct vfio_device { struct completion comp; struct iommufd_access *iommufd_access; void (*put_kvm)(struct kvm *kvm); + struct inode *inode; #if IS_ENABLED(CONFIG_IOMMUFD) struct iommufd_device *iommufd_device; u8 iommufd_attached:1;
From: Alex Williamson alex.williamson@redhat.com
[ Upstream commit aac6db75a9fc2c7a6f73e152df8f15101dda38e6 ]
With the vfio device fd tied to the address space of the pseudo fs inode, we can use the mm to track all vmas that might be mmap'ing device BARs, which removes our vma_list and all the complicated lock ordering necessary to manually zap each related vma.
Note that we can no longer store the pfn in vm_pgoff if we want to use unmap_mapping_range() to zap a selective portion of the device fd corresponding to BAR mappings.
This also converts our mmap fault handler to use vmf_insert_pfn() because we no longer have a vma_list to avoid the concurrency problem with io_remap_pfn_range(). The goal is to eventually use the vm_ops huge_fault handler to avoid the additional faulting overhead, but vmf_insert_pfn_{pmd,pud}() need to learn about pfnmaps first.
Also, Jason notes that a race exists between unmap_mapping_range() and the fops mmap callback if we were to call io_remap_pfn_range() to populate the vma on mmap. Specifically, mmap_region() does call_mmap() before it does vma_link_file() which gives a window where the vma is populated but invisible to unmap_mapping_range().
Suggested-by: Jason Gunthorpe jgg@nvidia.com Reviewed-by: Jason Gunthorpe jgg@nvidia.com Reviewed-by: Kevin Tian kevin.tian@intel.com Link: https://lore.kernel.org/r/20240530045236.1005864-3-alex.williamson@redhat.co... Signed-off-by: Alex Williamson alex.williamson@redhat.com Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/vfio/pci/vfio_pci_core.c | 264 +++++++------------------------ include/linux/vfio_pci_core.h | 2 - 2 files changed, 55 insertions(+), 211 deletions(-)
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c index d94d61b92c1ac..2baf4dfac3f43 100644 --- a/drivers/vfio/pci/vfio_pci_core.c +++ b/drivers/vfio/pci/vfio_pci_core.c @@ -1587,100 +1587,20 @@ ssize_t vfio_pci_core_write(struct vfio_device *core_vdev, const char __user *bu } EXPORT_SYMBOL_GPL(vfio_pci_core_write);
-/* Return 1 on zap and vma_lock acquired, 0 on contention (only with @try) */ -static int vfio_pci_zap_and_vma_lock(struct vfio_pci_core_device *vdev, bool try) +static void vfio_pci_zap_bars(struct vfio_pci_core_device *vdev) { - struct vfio_pci_mmap_vma *mmap_vma, *tmp; + struct vfio_device *core_vdev = &vdev->vdev; + loff_t start = VFIO_PCI_INDEX_TO_OFFSET(VFIO_PCI_BAR0_REGION_INDEX); + loff_t end = VFIO_PCI_INDEX_TO_OFFSET(VFIO_PCI_ROM_REGION_INDEX); + loff_t len = end - start;
- /* - * Lock ordering: - * vma_lock is nested under mmap_lock for vm_ops callback paths. - * The memory_lock semaphore is used by both code paths calling - * into this function to zap vmas and the vm_ops.fault callback - * to protect the memory enable state of the device. - * - * When zapping vmas we need to maintain the mmap_lock => vma_lock - * ordering, which requires using vma_lock to walk vma_list to - * acquire an mm, then dropping vma_lock to get the mmap_lock and - * reacquiring vma_lock. This logic is derived from similar - * requirements in uverbs_user_mmap_disassociate(). - * - * mmap_lock must always be the top-level lock when it is taken. - * Therefore we can only hold the memory_lock write lock when - * vma_list is empty, as we'd need to take mmap_lock to clear - * entries. vma_list can only be guaranteed empty when holding - * vma_lock, thus memory_lock is nested under vma_lock. - * - * This enables the vm_ops.fault callback to acquire vma_lock, - * followed by memory_lock read lock, while already holding - * mmap_lock without risk of deadlock. - */ - while (1) { - struct mm_struct *mm = NULL; - - if (try) { - if (!mutex_trylock(&vdev->vma_lock)) - return 0; - } else { - mutex_lock(&vdev->vma_lock); - } - while (!list_empty(&vdev->vma_list)) { - mmap_vma = list_first_entry(&vdev->vma_list, - struct vfio_pci_mmap_vma, - vma_next); - mm = mmap_vma->vma->vm_mm; - if (mmget_not_zero(mm)) - break; - - list_del(&mmap_vma->vma_next); - kfree(mmap_vma); - mm = NULL; - } - if (!mm) - return 1; - mutex_unlock(&vdev->vma_lock); - - if (try) { - if (!mmap_read_trylock(mm)) { - mmput(mm); - return 0; - } - } else { - mmap_read_lock(mm); - } - if (try) { - if (!mutex_trylock(&vdev->vma_lock)) { - mmap_read_unlock(mm); - mmput(mm); - return 0; - } - } else { - mutex_lock(&vdev->vma_lock); - } - list_for_each_entry_safe(mmap_vma, tmp, - &vdev->vma_list, vma_next) { - struct vm_area_struct *vma = mmap_vma->vma; - - if (vma->vm_mm != mm) - continue; - - list_del(&mmap_vma->vma_next); - kfree(mmap_vma); - - zap_vma_ptes(vma, vma->vm_start, - vma->vm_end - vma->vm_start); - } - mutex_unlock(&vdev->vma_lock); - mmap_read_unlock(mm); - mmput(mm); - } + unmap_mapping_range(core_vdev->inode->i_mapping, start, len, true); }
void vfio_pci_zap_and_down_write_memory_lock(struct vfio_pci_core_device *vdev) { - vfio_pci_zap_and_vma_lock(vdev, false); down_write(&vdev->memory_lock); - mutex_unlock(&vdev->vma_lock); + vfio_pci_zap_bars(vdev); }
u16 vfio_pci_memory_lock_and_enable(struct vfio_pci_core_device *vdev) @@ -1702,99 +1622,41 @@ void vfio_pci_memory_unlock_and_restore(struct vfio_pci_core_device *vdev, u16 c up_write(&vdev->memory_lock); }
-/* Caller holds vma_lock */ -static int __vfio_pci_add_vma(struct vfio_pci_core_device *vdev, - struct vm_area_struct *vma) -{ - struct vfio_pci_mmap_vma *mmap_vma; - - mmap_vma = kmalloc(sizeof(*mmap_vma), GFP_KERNEL_ACCOUNT); - if (!mmap_vma) - return -ENOMEM; - - mmap_vma->vma = vma; - list_add(&mmap_vma->vma_next, &vdev->vma_list); - - return 0; -} - -/* - * Zap mmaps on open so that we can fault them in on access and therefore - * our vma_list only tracks mappings accessed since last zap. - */ -static void vfio_pci_mmap_open(struct vm_area_struct *vma) -{ - zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start); -} - -static void vfio_pci_mmap_close(struct vm_area_struct *vma) +static unsigned long vma_to_pfn(struct vm_area_struct *vma) { struct vfio_pci_core_device *vdev = vma->vm_private_data; - struct vfio_pci_mmap_vma *mmap_vma; + int index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT); + u64 pgoff;
- mutex_lock(&vdev->vma_lock); - list_for_each_entry(mmap_vma, &vdev->vma_list, vma_next) { - if (mmap_vma->vma == vma) { - list_del(&mmap_vma->vma_next); - kfree(mmap_vma); - break; - } - } - mutex_unlock(&vdev->vma_lock); + pgoff = vma->vm_pgoff & + ((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1); + + return (pci_resource_start(vdev->pdev, index) >> PAGE_SHIFT) + pgoff; }
static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf) { struct vm_area_struct *vma = vmf->vma; struct vfio_pci_core_device *vdev = vma->vm_private_data; - struct vfio_pci_mmap_vma *mmap_vma; - vm_fault_t ret = VM_FAULT_NOPAGE; + unsigned long pfn, pgoff = vmf->pgoff - vma->vm_pgoff; + vm_fault_t ret = VM_FAULT_SIGBUS;
- mutex_lock(&vdev->vma_lock); - down_read(&vdev->memory_lock); + pfn = vma_to_pfn(vma);
- /* - * Memory region cannot be accessed if the low power feature is engaged - * or memory access is disabled. - */ - if (vdev->pm_runtime_engaged || !__vfio_pci_memory_enabled(vdev)) { - ret = VM_FAULT_SIGBUS; - goto up_out; - } + down_read(&vdev->memory_lock);
- /* - * We populate the whole vma on fault, so we need to test whether - * the vma has already been mapped, such as for concurrent faults - * to the same vma. io_remap_pfn_range() will trigger a BUG_ON if - * we ask it to fill the same range again. - */ - list_for_each_entry(mmap_vma, &vdev->vma_list, vma_next) { - if (mmap_vma->vma == vma) - goto up_out; - } + if (vdev->pm_runtime_engaged || !__vfio_pci_memory_enabled(vdev)) + goto out_disabled;
- if (io_remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff, - vma->vm_end - vma->vm_start, - vma->vm_page_prot)) { - ret = VM_FAULT_SIGBUS; - zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start); - goto up_out; - } + ret = vmf_insert_pfn(vma, vmf->address, pfn + pgoff);
- if (__vfio_pci_add_vma(vdev, vma)) { - ret = VM_FAULT_OOM; - zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start); - } - -up_out: +out_disabled: up_read(&vdev->memory_lock); - mutex_unlock(&vdev->vma_lock); + return ret; }
static const struct vm_operations_struct vfio_pci_mmap_ops = { - .open = vfio_pci_mmap_open, - .close = vfio_pci_mmap_close, .fault = vfio_pci_mmap_fault, };
@@ -1857,11 +1719,12 @@ int vfio_pci_core_mmap(struct vfio_device *core_vdev, struct vm_area_struct *vma
vma->vm_private_data = vdev; vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); - vma->vm_pgoff = (pci_resource_start(pdev, index) >> PAGE_SHIFT) + pgoff; + vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot);
/* - * See remap_pfn_range(), called from vfio_pci_fault() but we can't - * change vm_flags within the fault handler. Set them now. + * Set vm_flags now, they should not be changed in the fault handler. + * We want the same flags and page protection (decrypted above) as + * io_remap_pfn_range() would set. * * VM_ALLOW_ANY_UNCACHED: The VMA flag is implemented for ARM64, * allowing KVM stage 2 device mapping attributes to use Normal-NC @@ -2179,8 +2042,6 @@ int vfio_pci_core_init_dev(struct vfio_device *core_vdev) mutex_init(&vdev->ioeventfds_lock); INIT_LIST_HEAD(&vdev->dummy_resources_list); INIT_LIST_HEAD(&vdev->ioeventfds_list); - mutex_init(&vdev->vma_lock); - INIT_LIST_HEAD(&vdev->vma_list); INIT_LIST_HEAD(&vdev->sriov_pfs_item); init_rwsem(&vdev->memory_lock); xa_init(&vdev->ctx); @@ -2196,7 +2057,6 @@ void vfio_pci_core_release_dev(struct vfio_device *core_vdev)
mutex_destroy(&vdev->igate); mutex_destroy(&vdev->ioeventfds_lock); - mutex_destroy(&vdev->vma_lock); kfree(vdev->region); kfree(vdev->pm_save); } @@ -2474,26 +2334,15 @@ static int vfio_pci_dev_set_pm_runtime_get(struct vfio_device_set *dev_set) return ret; }
-/* - * We need to get memory_lock for each device, but devices can share mmap_lock, - * therefore we need to zap and hold the vma_lock for each device, and only then - * get each memory_lock. - */ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set, struct vfio_pci_group_info *groups, struct iommufd_ctx *iommufd_ctx) { - struct vfio_pci_core_device *cur_mem; - struct vfio_pci_core_device *cur_vma; - struct vfio_pci_core_device *cur; + struct vfio_pci_core_device *vdev; struct pci_dev *pdev; - bool is_mem = true; int ret;
mutex_lock(&dev_set->lock); - cur_mem = list_first_entry(&dev_set->device_list, - struct vfio_pci_core_device, - vdev.dev_set_list);
pdev = vfio_pci_dev_set_resettable(dev_set); if (!pdev) { @@ -2510,7 +2359,7 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set, if (ret) goto err_unlock;
- list_for_each_entry(cur_vma, &dev_set->device_list, vdev.dev_set_list) { + list_for_each_entry(vdev, &dev_set->device_list, vdev.dev_set_list) { bool owned;
/* @@ -2534,38 +2383,38 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set, * Otherwise, reset is not allowed. */ if (iommufd_ctx) { - int devid = vfio_iommufd_get_dev_id(&cur_vma->vdev, + int devid = vfio_iommufd_get_dev_id(&vdev->vdev, iommufd_ctx);
owned = (devid > 0 || devid == -ENOENT); } else { - owned = vfio_dev_in_groups(&cur_vma->vdev, groups); + owned = vfio_dev_in_groups(&vdev->vdev, groups); }
if (!owned) { ret = -EINVAL; - goto err_undo; + break; }
/* - * Locking multiple devices is prone to deadlock, runaway and - * unwind if we hit contention. + * Take the memory write lock for each device and zap BAR + * mappings to prevent the user accessing the device while in + * reset. Locking multiple devices is prone to deadlock, + * runaway and unwind if we hit contention. */ - if (!vfio_pci_zap_and_vma_lock(cur_vma, true)) { + if (!down_write_trylock(&vdev->memory_lock)) { ret = -EBUSY; - goto err_undo; + break; } + + vfio_pci_zap_bars(vdev); } - cur_vma = NULL;
- list_for_each_entry(cur_mem, &dev_set->device_list, vdev.dev_set_list) { - if (!down_write_trylock(&cur_mem->memory_lock)) { - ret = -EBUSY; - goto err_undo; - } - mutex_unlock(&cur_mem->vma_lock); + if (!list_entry_is_head(vdev, + &dev_set->device_list, vdev.dev_set_list)) { + vdev = list_prev_entry(vdev, vdev.dev_set_list); + goto err_undo; } - cur_mem = NULL;
/* * The pci_reset_bus() will reset all the devices in the bus. @@ -2576,25 +2425,22 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set, * cause the PCI config space reset without restoring the original * state (saved locally in 'vdev->pm_save'). */ - list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) - vfio_pci_set_power_state(cur, PCI_D0); + list_for_each_entry(vdev, &dev_set->device_list, vdev.dev_set_list) + vfio_pci_set_power_state(vdev, PCI_D0);
ret = pci_reset_bus(pdev);
+ vdev = list_last_entry(&dev_set->device_list, + struct vfio_pci_core_device, vdev.dev_set_list); + err_undo: - list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) { - if (cur == cur_mem) - is_mem = false; - if (cur == cur_vma) - break; - if (is_mem) - up_write(&cur->memory_lock); - else - mutex_unlock(&cur->vma_lock); - } + list_for_each_entry_from_reverse(vdev, &dev_set->device_list, + vdev.dev_set_list) + up_write(&vdev->memory_lock); + + list_for_each_entry(vdev, &dev_set->device_list, vdev.dev_set_list) + pm_runtime_put(&vdev->pdev->dev);
- list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) - pm_runtime_put(&cur->pdev->dev); err_unlock: mutex_unlock(&dev_set->lock); return ret; diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h index a2c8b8bba7119..f87067438ed48 100644 --- a/include/linux/vfio_pci_core.h +++ b/include/linux/vfio_pci_core.h @@ -93,8 +93,6 @@ struct vfio_pci_core_device { struct list_head sriov_pfs_item; struct vfio_pci_core_device *sriov_pf_core_dev; struct notifier_block nb; - struct mutex vma_lock; - struct list_head vma_list; struct rw_semaphore memory_lock; };
From: Alexander Usyskin alexander.usyskin@intel.com
[ Upstream commit 1db5322b7e6b58e1b304ce69a50e9dca798ca95b ]
Change level for the "not connected" client message in the write callback from error to debug.
The MEI driver currently disconnects all clients upon system suspend. This behavior is by design and user-space applications with open connections before the suspend are expected to handle errors upon resume, by reopening their handles, reconnecting, and retrying their operations.
However, the current driver implementation logs an error message every time a write operation is attempted on a disconnected client. Since this is a normal and expected flow after system resume logging this as an error can be misleading.
Signed-off-by: Alexander Usyskin alexander.usyskin@intel.com Signed-off-by: Tomas Winkler tomas.winkler@intel.com Link: https://lore.kernel.org/r/20240530091415.725247-1-tomas.winkler@intel.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/misc/mei/main.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/misc/mei/main.c b/drivers/misc/mei/main.c index 79e6f3c1341fe..40c3fe26f76df 100644 --- a/drivers/misc/mei/main.c +++ b/drivers/misc/mei/main.c @@ -329,7 +329,7 @@ static ssize_t mei_write(struct file *file, const char __user *ubuf, }
if (!mei_cl_is_connected(cl)) { - cl_err(dev, cl, "is not connected"); + cl_dbg(dev, cl, "is not connected"); rets = -ENODEV; goto out; }
From: Uwe Kleine-König u.kleine-koenig@pengutronix.de
[ Upstream commit 73fedc31fed38cb6039fd8a7efea1774143b68b0 ]
As described in the added code comment, a reference to .exit.text is ok for drivers registered via module_platform_driver_probe(). Make this explicit to prevent the following section mismatch warning
WARNING: modpost: drivers/parport/parport_amiga: section mismatch in reference: amiga_parallel_driver+0x8 (section: .data) -> amiga_parallel_remove (section: .exit.text)
that triggers on an allmodconfig W=1 build.
Signed-off-by: Uwe Kleine-König u.kleine-koenig@pengutronix.de Link: https://lore.kernel.org/r/20240513075206.2337310-2-u.kleine-koenig@pengutron... Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/parport/parport_amiga.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/drivers/parport/parport_amiga.c b/drivers/parport/parport_amiga.c index e6dc857aac3fe..e06c7b2aac5c4 100644 --- a/drivers/parport/parport_amiga.c +++ b/drivers/parport/parport_amiga.c @@ -229,7 +229,13 @@ static void __exit amiga_parallel_remove(struct platform_device *pdev) parport_put_port(port); }
-static struct platform_driver amiga_parallel_driver = { +/* + * amiga_parallel_remove() lives in .exit.text. For drivers registered via + * module_platform_driver_probe() this is ok because they cannot get unbound at + * runtime. So mark the driver struct with __refdata to prevent modpost + * triggering a section mismatch warning. + */ +static struct platform_driver amiga_parallel_driver __refdata = { .remove_new = __exit_p(amiga_parallel_remove), .driver = { .name = "amiga-parallel",
From: "Ritesh Harjani (IBM)" ritesh.list@gmail.com
[ Upstream commit f5ceb1bbc98c69536d4673a97315e8427e67de1b ]
If the extent spans the block that contains i_size, we need to handle both halves separately so that we properly zero data in the page cache for blocks that are entirely outside of i_size. But this is needed only when i_size is within the current folio under processing. "orig_pos + length > isize" can be true for all folios if the mapped extent length is greater than the folio size. That is making plen to break for every folio instead of only the last folio.
So use orig_plen for checking if "orig_pos + orig_plen > isize".
Signed-off-by: Ritesh Harjani (IBM) ritesh.list@gmail.com Link: https://lore.kernel.org/r/a32e5f9a4fcfdb99077300c4020ed7ae61d6e0f9.171506705... Reviewed-by: Christoph Hellwig hch@lst.de Reviewed-by: Darrick J. Wong djwong@kernel.org Reviewed-by: Jan Kara jack@suse.cz cc: Ojaswin Mujoo ojaswin@linux.ibm.com Signed-off-by: Christian Brauner brauner@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- fs/iomap/buffered-io.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index 4ac6c8c403c26..248e615270ff7 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -241,6 +241,7 @@ static void iomap_adjust_read_range(struct inode *inode, struct folio *folio, unsigned block_size = (1 << block_bits); size_t poff = offset_in_folio(folio, *pos); size_t plen = min_t(loff_t, folio_size(folio) - poff, length); + size_t orig_plen = plen; unsigned first = poff >> block_bits; unsigned last = (poff + plen - 1) >> block_bits;
@@ -277,7 +278,7 @@ static void iomap_adjust_read_range(struct inode *inode, struct folio *folio, * handle both halves separately so that we properly zero data in the * page cache for blocks that are entirely outside of i_size. */ - if (orig_pos <= isize && orig_pos + length > isize) { + if (orig_pos <= isize && orig_pos + orig_plen > isize) { unsigned end = offset_in_folio(folio, isize - 1) >> block_bits;
if (first <= end && last > end)
From: Krzysztof Kozlowski krzysztof.kozlowski@linaro.org
[ Upstream commit 1f3512cdf8299f9edaea9046d53ea324a7730bab ]
Core in platform_driver_register() already sets the .owner, so driver does not need to. Whatever is set here will be anyway overwritten by main driver calling platform_driver_register().
Signed-off-by: Krzysztof Kozlowski krzysztof.kozlowski@linaro.org Signed-off-by: Inki Dae inki.dae@samsung.com Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/gpu/drm/exynos/exynos_dp.c | 1 - 1 file changed, 1 deletion(-)
diff --git a/drivers/gpu/drm/exynos/exynos_dp.c b/drivers/gpu/drm/exynos/exynos_dp.c index f48c4343f4690..3e6d4c6aa877e 100644 --- a/drivers/gpu/drm/exynos/exynos_dp.c +++ b/drivers/gpu/drm/exynos/exynos_dp.c @@ -285,7 +285,6 @@ struct platform_driver dp_driver = { .remove_new = exynos_dp_remove, .driver = { .name = "exynos-dp", - .owner = THIS_MODULE, .pm = pm_ptr(&exynos_dp_pm_ops), .of_match_table = exynos_dp_match, },
From: Tobias Jakobi tjakobi@math.uni-bielefeld.de
[ Upstream commit f74fb5df429ebc6a614dc5aa9e44d7194d402e5a ]
Similar to the other Aya Neo devices this one features again a portrait screen, here with a native resolution of 1600x2560.
Signed-off-by: Tobias Jakobi tjakobi@math.uni-bielefeld.de Reviewed-by: Hans de Goede hdegoede@redhat.com Signed-off-by: Hans de Goede hdegoede@redhat.com Link: https://patchwork.freedesktop.org/patch/msgid/20240310220401.895591-1-tjakob... Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/gpu/drm/drm_panel_orientation_quirks.c | 6 ++++++ 1 file changed, 6 insertions(+)
diff --git a/drivers/gpu/drm/drm_panel_orientation_quirks.c b/drivers/gpu/drm/drm_panel_orientation_quirks.c index aa93129c3397e..2166208a961d6 100644 --- a/drivers/gpu/drm/drm_panel_orientation_quirks.c +++ b/drivers/gpu/drm/drm_panel_orientation_quirks.c @@ -202,6 +202,12 @@ static const struct dmi_system_id orientation_data[] = { DMI_MATCH(DMI_BOARD_NAME, "NEXT"), }, .driver_data = (void *)&lcd800x1280_rightside_up, + }, { /* AYA NEO KUN */ + .matches = { + DMI_EXACT_MATCH(DMI_BOARD_VENDOR, "AYANEO"), + DMI_MATCH(DMI_BOARD_NAME, "KUN"), + }, + .driver_data = (void *)&lcd1600x2560_rightside_up, }, { /* Chuwi HiBook (CWI514) */ .matches = { DMI_MATCH(DMI_BOARD_VENDOR, "Hampoo"),
From: Douglas Anderson dianders@chromium.org
[ Upstream commit 0320ca14c6fb68ad19aa72e55a1a21c061b2946b ]
Based on grepping through the source code, this driver appears to be missing a call to drm_atomic_helper_shutdown() at system shutdown time. This is important because drm_atomic_helper_shutdown() will cause panels to get disabled cleanly which may be important for their power sequencing. Future changes will remove any custom powering off in individual panel drivers so the DRM drivers need to start getting this right.
The fact that we should call drm_atomic_helper_shutdown() in the case of OS shutdown comes straight out of the kernel doc "driver instance overview" in drm_drv.c.
[geert: shmob_drm_remove() already calls drm_atomic_helper_shutdown]
Suggested-by: Maxime Ripard mripard@kernel.org Signed-off-by: Douglas Anderson dianders@chromium.org Link: https://lore.kernel.org/r/20230901164111.RFT.15.Iaf638a1d4c8b3c307a6192efabb... [geert: s/drm_helper_force_disable_all/drm_atomic_helper_shutdown/] Signed-off-by: Geert Uytterhoeven geert+renesas@glider.be Reviewed-by: Laurent Pinchart laurent.pinchart@ideasonboard.com Reviewed-by: Sui Jingfeng sui.jingfeng@linux.dev Signed-off-by: Maxime Ripard mripard@kernel.org Link: https://patchwork.freedesktop.org/patch/msgid/17c6a5a668e5975f871b77fb1fca67... Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/gpu/drm/renesas/shmobile/shmob_drm_drv.c | 8 ++++++++ 1 file changed, 8 insertions(+)
diff --git a/drivers/gpu/drm/renesas/shmobile/shmob_drm_drv.c b/drivers/gpu/drm/renesas/shmobile/shmob_drm_drv.c index e83c3e52251de..0250d5f00bf10 100644 --- a/drivers/gpu/drm/renesas/shmobile/shmob_drm_drv.c +++ b/drivers/gpu/drm/renesas/shmobile/shmob_drm_drv.c @@ -171,6 +171,13 @@ static void shmob_drm_remove(struct platform_device *pdev) drm_kms_helper_poll_fini(ddev); }
+static void shmob_drm_shutdown(struct platform_device *pdev) +{ + struct shmob_drm_device *sdev = platform_get_drvdata(pdev); + + drm_atomic_helper_shutdown(&sdev->ddev); +} + static int shmob_drm_probe(struct platform_device *pdev) { struct shmob_drm_platform_data *pdata = pdev->dev.platform_data; @@ -273,6 +280,7 @@ static const struct of_device_id shmob_drm_of_table[] __maybe_unused = { static struct platform_driver shmob_drm_platform_driver = { .probe = shmob_drm_probe, .remove_new = shmob_drm_remove, + .shutdown = shmob_drm_shutdown, .driver = { .name = "shmob-drm", .of_match_table = of_match_ptr(shmob_drm_of_table),
From: Douglas Anderson dianders@chromium.org
[ Upstream commit c38896ca6318c2df20bbe6c8e3f633e071fda910 ]
Based on grepping through the source code this driver appears to be missing a call to drm_atomic_helper_shutdown() at system shutdown time. Among other things, this means that if a panel is in use that it won't be cleanly powered off at system shutdown time.
The fact that we should call drm_atomic_helper_shutdown() in the case of OS shutdown/restart comes straight out of the kernel doc "driver instance overview" in drm_drv.c.
This driver users the component model and shutdown happens in the base driver. The "drvdata" for this driver will always be valid if shutdown() is called and as of commit 2a073968289d ("drm/atomic-helper: drm_atomic_helper_shutdown(NULL) should be a noop") we don't need to confirm that "drm" is non-NULL.
Suggested-by: Maxime Ripard mripard@kernel.org Reviewed-by: Maxime Ripard mripard@kernel.org Reviewed-by: Fei Shao fshao@chromium.org Tested-by: Fei Shao fshao@chromium.org Signed-off-by: Douglas Anderson dianders@chromium.org Signed-off-by: Maxime Ripard mripard@kernel.org Link: https://patchwork.freedesktop.org/patch/msgid/20240611102744.v2.1.I2b014f90a... Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/gpu/drm/mediatek/mtk_drm_drv.c | 8 ++++++++ 1 file changed, 8 insertions(+)
diff --git a/drivers/gpu/drm/mediatek/mtk_drm_drv.c b/drivers/gpu/drm/mediatek/mtk_drm_drv.c index 74832c2130921..0b570e194079a 100644 --- a/drivers/gpu/drm/mediatek/mtk_drm_drv.c +++ b/drivers/gpu/drm/mediatek/mtk_drm_drv.c @@ -950,6 +950,13 @@ static void mtk_drm_remove(struct platform_device *pdev) of_node_put(private->comp_node[i]); }
+static void mtk_drm_shutdown(struct platform_device *pdev) +{ + struct mtk_drm_private *private = platform_get_drvdata(pdev); + + drm_atomic_helper_shutdown(private->drm); +} + static int mtk_drm_sys_prepare(struct device *dev) { struct mtk_drm_private *private = dev_get_drvdata(dev); @@ -981,6 +988,7 @@ static const struct dev_pm_ops mtk_drm_pm_ops = { static struct platform_driver mtk_drm_platform_driver = { .probe = mtk_drm_probe, .remove_new = mtk_drm_remove, + .shutdown = mtk_drm_shutdown, .driver = { .name = "mediatek-drm", .pm = &mtk_drm_pm_ops,
From: Chunguang Xu chunguang.xu@shopee.com
[ Upstream commit e5d574ab37f5f2e7937405613d9b1a724811e5ad ]
If a discard request needs to be retried, and that retry may fail before a new special payload is added, a double free will result. Clear the RQF_SPECIAL_LOAD when the request is cleaned.
Signed-off-by: Chunguang Xu chunguang.xu@shopee.com Reviewed-by: Sagi Grimberg sagi@grimberg.me Reviewed-by: Max Gurtovoy mgurtovoy@nvidia.com Signed-off-by: Keith Busch kbusch@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/nvme/host/core.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index d513fd27589df..36f30594b671f 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -998,6 +998,7 @@ void nvme_cleanup_cmd(struct request *req) clear_bit_unlock(0, &ctrl->discard_page_busy); else kfree(bvec_virt(&req->special_vec)); + req->rq_flags &= ~RQF_SPECIAL_PAYLOAD; } } EXPORT_SYMBOL_GPL(nvme_cleanup_cmd);
From: Daniel Wagner dwagner@suse.de
[ Upstream commit cd0c1b8e045a8d2785342b385cb2684d9b48e426 ]
The spec doesn't mandate that the first two double words (aka results) for the command queue entry need to be set to 0 when they are not used (not specified). Though, the target implemention returns 0 for TCP and FC but not for RDMA.
Let's make RDMA behave the same and thus explicitly initializing the result field. This prevents leaking any data from the stack.
Signed-off-by: Daniel Wagner dwagner@suse.de Reviewed-by: Christoph Hellwig hch@lst.de Signed-off-by: Keith Busch kbusch@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/nvme/target/core.c | 1 + drivers/nvme/target/fabrics-cmd-auth.c | 3 --- drivers/nvme/target/fabrics-cmd.c | 6 ------ 3 files changed, 1 insertion(+), 9 deletions(-)
diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c index 2fde22323622e..29d324b97f8c3 100644 --- a/drivers/nvme/target/core.c +++ b/drivers/nvme/target/core.c @@ -948,6 +948,7 @@ bool nvmet_req_init(struct nvmet_req *req, struct nvmet_cq *cq, req->metadata_sg_cnt = 0; req->transfer_len = 0; req->metadata_len = 0; + req->cqe->result.u64 = 0; req->cqe->status = 0; req->cqe->sq_head = 0; req->ns = NULL; diff --git a/drivers/nvme/target/fabrics-cmd-auth.c b/drivers/nvme/target/fabrics-cmd-auth.c index eb7785be0ca77..ee76491e8b12c 100644 --- a/drivers/nvme/target/fabrics-cmd-auth.c +++ b/drivers/nvme/target/fabrics-cmd-auth.c @@ -332,7 +332,6 @@ void nvmet_execute_auth_send(struct nvmet_req *req) pr_debug("%s: ctrl %d qid %d nvme status %x error loc %d\n", __func__, ctrl->cntlid, req->sq->qid, status, req->error_loc); - req->cqe->result.u64 = 0; if (req->sq->dhchap_step != NVME_AUTH_DHCHAP_MESSAGE_SUCCESS2 && req->sq->dhchap_step != NVME_AUTH_DHCHAP_MESSAGE_FAILURE2) { unsigned long auth_expire_secs = ctrl->kato ? ctrl->kato : 120; @@ -515,8 +514,6 @@ void nvmet_execute_auth_receive(struct nvmet_req *req) status = nvmet_copy_to_sgl(req, 0, d, al); kfree(d); done: - req->cqe->result.u64 = 0; - if (req->sq->dhchap_step == NVME_AUTH_DHCHAP_MESSAGE_SUCCESS2) nvmet_auth_sq_free(req->sq); else if (req->sq->dhchap_step == NVME_AUTH_DHCHAP_MESSAGE_FAILURE1) { diff --git a/drivers/nvme/target/fabrics-cmd.c b/drivers/nvme/target/fabrics-cmd.c index b23f4cf840bd5..f6714453b8bb3 100644 --- a/drivers/nvme/target/fabrics-cmd.c +++ b/drivers/nvme/target/fabrics-cmd.c @@ -226,9 +226,6 @@ static void nvmet_execute_admin_connect(struct nvmet_req *req) if (status) goto out;
- /* zero out initial completion result, assign values as needed */ - req->cqe->result.u32 = 0; - if (c->recfmt != 0) { pr_warn("invalid connect version (%d).\n", le16_to_cpu(c->recfmt)); @@ -304,9 +301,6 @@ static void nvmet_execute_io_connect(struct nvmet_req *req) if (status) goto out;
- /* zero out initial completion result, assign values as needed */ - req->cqe->result.u32 = 0; - if (c->recfmt != 0) { pr_warn("invalid connect version (%d).\n", le16_to_cpu(c->recfmt));
From: Alex Williamson alex.williamson@redhat.com
[ Upstream commit d71a989cf5d961989c273093cdff2550acdde314 ]
In order to improve performance of typical scenarios we can try to insert the entire vma on fault. This accelerates typical cases, such as when the MMIO region is DMA mapped by QEMU. The vfio_iommu_type1 driver will fault in the entire DMA mapped range through fixup_user_fault().
In synthetic testing, this improves the time required to walk a PCI BAR mapping from userspace by roughly 1/3rd.
This is likely an interim solution until vmf_insert_pfn_{pmd,pud}() gain support for pfnmaps.
Suggested-by: Yan Zhao yan.y.zhao@intel.com Link: https://lore.kernel.org/all/Zl6XdUkt%2FzMMGOLF@yzhao56-desk.sh.intel.com/ Reviewed-by: Yan Zhao yan.y.zhao@intel.com Link: https://lore.kernel.org/r/20240607035213.2054226-1-alex.williamson@redhat.co... Signed-off-by: Alex Williamson alex.williamson@redhat.com Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/vfio/pci/vfio_pci_core.c | 19 +++++++++++++++++-- 1 file changed, 17 insertions(+), 2 deletions(-)
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c index 2baf4dfac3f43..9eaf10a8f134b 100644 --- a/drivers/vfio/pci/vfio_pci_core.c +++ b/drivers/vfio/pci/vfio_pci_core.c @@ -1639,6 +1639,7 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf) struct vm_area_struct *vma = vmf->vma; struct vfio_pci_core_device *vdev = vma->vm_private_data; unsigned long pfn, pgoff = vmf->pgoff - vma->vm_pgoff; + unsigned long addr = vma->vm_start; vm_fault_t ret = VM_FAULT_SIGBUS;
pfn = vma_to_pfn(vma); @@ -1646,11 +1647,25 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf) down_read(&vdev->memory_lock);
if (vdev->pm_runtime_engaged || !__vfio_pci_memory_enabled(vdev)) - goto out_disabled; + goto out_unlock;
ret = vmf_insert_pfn(vma, vmf->address, pfn + pgoff); + if (ret & VM_FAULT_ERROR) + goto out_unlock;
-out_disabled: + /* + * Pre-fault the remainder of the vma, abort further insertions and + * supress error if fault is encountered during pre-fault. + */ + for (; addr < vma->vm_end; addr += PAGE_SIZE, pfn++) { + if (addr == vmf->address) + continue; + + if (vmf_insert_pfn(vma, addr, pfn) & VM_FAULT_ERROR) + break; + } + +out_unlock: up_read(&vdev->memory_lock);
return ret;
From: Cyril Hrubis chrubis@suse.cz
[ Upstream commit 5f75e081ab5cbfbe7aca2112a802e69576ee9778 ]
If fallcate is implemented but zero and discard operations are not supported by the filesystem the backing file is on we continue to fill dmesg with errors from the blk_mq_end_request() since each time we call fallocate() on the loop device the EOPNOTSUPP error from lo_fallocate() ends up propagated into the block layer. In the end syscall succeeds since the blkdev_issue_zeroout() falls back to writing zeroes which makes the errors even more misleading and confusing.
How to reproduce:
1. make sure /tmp is mounted as tmpfs 2. dd if=/dev/zero of=/tmp/disk.img bs=1M count=100 3. losetup /dev/loop0 /tmp/disk.img 4. mkfs.ext2 /dev/loop0 5. dmesg |tail
[710690.898214] operation not supported error, dev loop0, sector 204672 op 0x9:(WRITE_ZEROES) flags 0x8000800 phys_seg 0 prio class 0 [710690.898279] operation not supported error, dev loop0, sector 522 op 0x9:(WRITE_ZEROES) flags 0x8000800 phys_seg 0 prio class 0 [710690.898603] operation not supported error, dev loop0, sector 16906 op 0x9:(WRITE_ZEROES) flags 0x8000800 phys_seg 0 prio class 0 [710690.898917] operation not supported error, dev loop0, sector 32774 op 0x9:(WRITE_ZEROES) flags 0x8000800 phys_seg 0 prio class 0 [710690.899218] operation not supported error, dev loop0, sector 49674 op 0x9:(WRITE_ZEROES) flags 0x8000800 phys_seg 0 prio class 0 [710690.899484] operation not supported error, dev loop0, sector 65542 op 0x9:(WRITE_ZEROES) flags 0x8000800 phys_seg 0 prio class 0 [710690.899743] operation not supported error, dev loop0, sector 82442 op 0x9:(WRITE_ZEROES) flags 0x8000800 phys_seg 0 prio class 0 [710690.900015] operation not supported error, dev loop0, sector 98310 op 0x9:(WRITE_ZEROES) flags 0x8000800 phys_seg 0 prio class 0 [710690.900276] operation not supported error, dev loop0, sector 115210 op 0x9:(WRITE_ZEROES) flags 0x8000800 phys_seg 0 prio class 0 [710690.900546] operation not supported error, dev loop0, sector 131078 op 0x9:(WRITE_ZEROES) flags 0x8000800 phys_seg 0 prio class 0
This patch changes the lo_fallocate() to clear the flags for zero and discard operations if we get EOPNOTSUPP from the backing file fallocate callback, that way we at least stop spewing errors after the first unsuccessful try.
CC: Jan Kara jack@suse.cz Signed-off-by: Cyril Hrubis chrubis@suse.cz Reviewed-by: Christoph Hellwig hch@lst.de Reviewed-by: Jan Kara jack@suse.cz Link: https://lore.kernel.org/r/20240613163817.22640-1-chrubis@suse.cz Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/block/loop.c | 23 +++++++++++++++++++++++ 1 file changed, 23 insertions(+)
diff --git a/drivers/block/loop.c b/drivers/block/loop.c index 28a95fd366fea..95a468eaa7013 100644 --- a/drivers/block/loop.c +++ b/drivers/block/loop.c @@ -302,6 +302,21 @@ static int lo_read_simple(struct loop_device *lo, struct request *rq, return 0; }
+static void loop_clear_limits(struct loop_device *lo, int mode) +{ + struct queue_limits lim = queue_limits_start_update(lo->lo_queue); + + if (mode & FALLOC_FL_ZERO_RANGE) + lim.max_write_zeroes_sectors = 0; + + if (mode & FALLOC_FL_PUNCH_HOLE) { + lim.max_hw_discard_sectors = 0; + lim.discard_granularity = 0; + } + + queue_limits_commit_update(lo->lo_queue, &lim); +} + static int lo_fallocate(struct loop_device *lo, struct request *rq, loff_t pos, int mode) { @@ -320,6 +335,14 @@ static int lo_fallocate(struct loop_device *lo, struct request *rq, loff_t pos, ret = file->f_op->fallocate(file, mode, pos, blk_rq_bytes(rq)); if (unlikely(ret && ret != -EINVAL && ret != -EOPNOTSUPP)) return -EIO; + + /* + * We initially configure the limits in a hope that fallocate is + * supported and clear them here if that turns out not to be true. + */ + if (unlikely(ret == -EOPNOTSUPP)) + loop_clear_limits(lo, mode); + return ret; }
linux-stable-mirror@lists.linaro.org