September 2024 - Linux-stable-mirror

Re: Patch "userfaultfd: fix checks for huge PMDs" has been added to the 6.1-stable tree

by Jann Horn

On Mon, Sep 9, 2024 at 2:48 PM Sasha Levin <sashal(a)kernel.org> wrote: > This is a note to let you know that I've just added the patch titled > > userfaultfd: fix checks for huge PMDs > > to the 6.1-stable tree which can be found at: > http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum… Thanks for the backport! Are you also doing the backport for older trees, or should someone else take care of that?

1 year, 3 months

2
1
0 0

[tip: core/debugobjects] debugobjects: Fix conditions in fill_pool()

by tip-bot2 for Zhen Lei

The following commit has been merged into the core/debugobjects branch of tip: Commit-ID: 684d28feb8546d1e9597aa363c3bfcf52fe250b7 Gitweb: https://git.kernel.org/tip/684d28feb8546d1e9597aa363c3bfcf52fe250b7 Author: Zhen Lei <thunder.leizhen(a)huawei.com> AuthorDate: Wed, 04 Sep 2024 21:39:40 +08:00 Committer: Thomas Gleixner <tglx(a)linutronix.de> CommitterDate: Mon, 09 Sep 2024 16:40:25 +02:00 debugobjects: Fix conditions in fill_pool() fill_pool() uses 'obj_pool_min_free' to decide whether objects should be handed back to the kmem cache. But 'obj_pool_min_free' records the lowest historical value of the number of objects in the object pool and not the minimum number of objects which should be kept in the pool. Use 'debug_objects_pool_min_level' instead, which holds the minimum number which was scaled to the number of CPUs at boot time. [ tglx: Massage change log ] Fixes: d26bf5056fc0 ("debugobjects: Reduce number of pool_lock acquisitions in fill_pool()") Fixes: 36c4ead6f6df ("debugobjects: Add global free list and the counter") Signed-off-by: Zhen Lei <thunder.leizhen(a)huawei.com> Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de> Cc: stable(a)vger.kernel.org Link: https://lore.kernel.org/all/20240904133944.2124-3-thunder.leizhen@huawei.com --- lib/debugobjects.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/lib/debugobjects.c b/lib/debugobjects.c index 7226fdb..6329a86 100644 --- a/lib/debugobjects.c +++ b/lib/debugobjects.c @@ -142,13 +142,14 @@ static void fill_pool(void) * READ_ONCE()s pair with the WRITE_ONCE()s in pool_lock critical * sections. */ - while (READ_ONCE(obj_nr_tofree) && (READ_ONCE(obj_pool_free) < obj_pool_min_free)) { + while (READ_ONCE(obj_nr_tofree) && + READ_ONCE(obj_pool_free) < debug_objects_pool_min_level) { raw_spin_lock_irqsave(&pool_lock, flags); /* * Recheck with the lock held as the worker thread might have * won the race and freed the global free list already. */ - while (obj_nr_tofree && (obj_pool_free < obj_pool_min_free)) { + while (obj_nr_tofree && (obj_pool_free < debug_objects_pool_min_level)) { obj = hlist_entry(obj_to_free.first, typeof(*obj), node); hlist_del(&obj->node); WRITE_ONCE(obj_nr_tofree, obj_nr_tofree - 1);

1 year, 3 months

1
0
0 0

[PATCH] soundwire: stream: fix programming slave ports for non-continous port maps

by Krzysztof Kozlowski

Two bitmasks in 'struct sdw_slave_prop' - 'source_ports' and 'sink_ports' - define which ports to program in sdw_program_slave_port_params(). The masks are used to get the appropriate data port properties ('struct sdw_get_slave_dpn_prop') from an array. Bitmasks can be non-continuous or can start from index different than 0, thus when looking for matching port property for given port, we must iterate over mask bits, not from 0 up to number of ports. This fixes allocation and programming slave ports, when a source or sink masks start from further index. Fixes: f8101c74aa54 ("soundwire: Add Master and Slave port programming") Cc: <stable(a)vger.kernel.org> Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski(a)linaro.org> --- drivers/soundwire/stream.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/soundwire/stream.c b/drivers/soundwire/stream.c index 7aa4900dcf31..f275143d7b18 100644 --- a/drivers/soundwire/stream.c +++ b/drivers/soundwire/stream.c @@ -1291,18 +1291,18 @@ struct sdw_dpn_prop *sdw_get_slave_dpn_prop(struct sdw_slave *slave, unsigned int port_num) { struct sdw_dpn_prop *dpn_prop; - u8 num_ports; + unsigned long mask; int i; if (direction == SDW_DATA_DIR_TX) { - num_ports = hweight32(slave->prop.source_ports); + mask = slave->prop.source_ports; dpn_prop = slave->prop.src_dpn_prop; } else { - num_ports = hweight32(slave->prop.sink_ports); + mask = slave->prop.sink_ports; dpn_prop = slave->prop.sink_dpn_prop; } - for (i = 0; i < num_ports; i++) { + for_each_set_bit(i, &mask, 32) { if (dpn_prop[i].num == port_num) return &dpn_prop[i]; } -- 2.43.0

1 year, 3 months

6
15
0 0

[PATCH v2] ice: Fix improper handling of refcount in ice_sriov_set_msix_vec_count()

by Gui-Dong Han

This patch addresses an issue with improper reference count handling in the ice_sriov_set_msix_vec_count() function. First, the function calls ice_get_vf_by_id(), which increments the reference count of the vf pointer. If the subsequent call to ice_get_vf_vsi() fails, the function currently returns an error without decrementing the reference count of the vf pointer, leading to a reference count leak. The correct behavior, as implemented in this patch, is to decrement the reference count using ice_put_vf(vf) before returning an error when vsi is NULL. Second, the function calls ice_sriov_get_irqs(), which sets vf->first_vector_idx. If this call returns a negative value, indicating an error, the function returns an error without decrementing the reference count of the vf pointer, resulting in another reference count leak. The patch addresses this by adding a call to ice_put_vf(vf) before returning an error when vf->first_vector_idx < 0. This bug was identified by an experimental static analysis tool developed by our team. The tool specializes in analyzing reference count operations and identifying potential mismanagement of reference counts. In this case, the tool flagged the missing decrement operation as a potential issue, leading to this patch. Fixes: 4035c72dc1ba ("ice: reconfig host after changing MSI-X on VF") Fixes: 4d38cb44bd32 ("ice: manage VFs MSI-X using resource tracking") Cc: stable(a)vger.kernel.org Signed-off-by: Gui-Dong Han <hanguidong02(a)outlook.com> --- v2: * In this patch v2, an additional resource leak was addressed when vf->first_vector_idx < 0. The issue is now fixed by adding ice_put_vf(vf) before returning an error. Thanks to Simon Horman for identifying this additional leak scenario. --- drivers/net/ethernet/intel/ice/ice_sriov.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/intel/ice/ice_sriov.c b/drivers/net/ethernet/intel/ice/ice_sriov.c index 55ef33208456..fbf18ac97875 100644 --- a/drivers/net/ethernet/intel/ice/ice_sriov.c +++ b/drivers/net/ethernet/intel/ice/ice_sriov.c @@ -1096,8 +1096,10 @@ int ice_sriov_set_msix_vec_count(struct pci_dev *vf_dev, int msix_vec_count) return -ENOENT; vsi = ice_get_vf_vsi(vf); - if (!vsi) + if (!vsi) { + ice_put_vf(vf); return -ENOENT; + } prev_msix = vf->num_msix; prev_queues = vf->num_vf_qs; @@ -1142,8 +1144,10 @@ int ice_sriov_set_msix_vec_count(struct pci_dev *vf_dev, int msix_vec_count) vf->num_msix = prev_msix; vf->num_vf_qs = prev_queues; vf->first_vector_idx = ice_sriov_get_irqs(pf, vf->num_msix); - if (vf->first_vector_idx < 0) + if (vf->first_vector_idx < 0) { + ice_put_vf(vf); return -EINVAL; + } if (needs_rebuild) { ice_vf_reconfig_vsi(vf); -- 2.25.1

1 year, 3 months

5
5
0 0

[RFC 1/4] drm/sched: Add locking to drm_sched_entity_modify_sched

by Tvrtko Ursulin

From: Tvrtko Ursulin <tvrtko.ursulin(a)igalia.com> Without the locking amdgpu currently can race amdgpu_ctx_set_entity_priority() and drm_sched_job_arm(), leading to the latter accesing potentially inconsitent entity->sched_list and entity->num_sched_list pair. The comment on drm_sched_entity_modify_sched() however says: """ * Note that this must be called under the same common lock for @entity as * drm_sched_job_arm() and drm_sched_entity_push_job(), or the driver needs to * guarantee through some other means that this is never called while new jobs * can be pushed to @entity. """ It is unclear if that is referring to this race or something else. Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin(a)igalia.com> Fixes: b37aced31eb0 ("drm/scheduler: implement a function to modify sched list") Cc: Christian König <christian.koenig(a)amd.com> Cc: Alex Deucher <alexander.deucher(a)amd.com> Cc: Luben Tuikov <ltuikov89(a)gmail.com> Cc: Matthew Brost <matthew.brost(a)intel.com> Cc: David Airlie <airlied(a)gmail.com> Cc: Daniel Vetter <daniel(a)ffwll.ch> Cc: dri-devel(a)lists.freedesktop.org Cc: <stable(a)vger.kernel.org> # v5.7+ --- drivers/gpu/drm/scheduler/sched_entity.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c index 58c8161289fe..ae8be30472cd 100644 --- a/drivers/gpu/drm/scheduler/sched_entity.c +++ b/drivers/gpu/drm/scheduler/sched_entity.c @@ -133,8 +133,10 @@ void drm_sched_entity_modify_sched(struct drm_sched_entity *entity, { WARN_ON(!num_sched_list || !sched_list); + spin_lock(&entity->rq_lock); entity->sched_list = sched_list; entity->num_sched_list = num_sched_list; + spin_unlock(&entity->rq_lock); } EXPORT_SYMBOL(drm_sched_entity_modify_sched); -- 2.46.0

1 year, 3 months

4
9
0 0

FAILED: patch "[PATCH] memcg: protect concurrent access to mem_cgroup_idr" failed to apply to 5.15-stable tree

by gregkh＠linuxfoundation.org

The patch below does not apply to the 5.15-stable tree. If someone wants it applied there, or to any other stable or longterm tree, then please email the backport, including the original git commit id to <stable(a)vger.kernel.org>. To reproduce the conflict and resubmit, you may use the following commands: git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y git checkout FETCH_HEAD git cherry-pick -x 9972605a238339b85bd16b084eed5f18414d22db # <resolve conflicts, build, test, etc.> git commit -s git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024081211-owl-snowdrop-d2aa@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^.. Possible dependencies: 9972605a2383 ("memcg: protect concurrent access to mem_cgroup_idr") 6f0df8e16eb5 ("memcontrol: ensure memcg acquired by id is properly set up") e4dde56cd208 ("mm: multi-gen LRU: per-node lru_gen_folio lists") 7348cc91821b ("mm: multi-gen LRU: remove aging fairness safeguard") a579086c99ed ("mm: multi-gen LRU: remove eviction fairness safeguard") adb8213014b2 ("mm: memcg: fix stale protection of reclaim target memcg") 57e9cc50f4dd ("mm: vmscan: split khugepaged stats from direct reclaim stats") e4fea72b1438 ("mglru: mm/vmscan.c: fix imprecise comments") d396def5d86d ("memcg: rearrange code") 410f8e82689e ("memcg: extract memcg_vmstats from struct mem_cgroup") d6c3af7d8a2b ("mm: multi-gen LRU: debugfs interface") 1332a809d95a ("mm: multi-gen LRU: thrashing prevention") 354ed5974429 ("mm: multi-gen LRU: kill switch") f76c83378851 ("mm: multi-gen LRU: optimize multiple memcgs") bd74fdaea146 ("mm: multi-gen LRU: support page table walks") 018ee47f1489 ("mm: multi-gen LRU: exploit locality in rmap") ac35a4902374 ("mm: multi-gen LRU: minimal implementation") ec1c86b25f4b ("mm: multi-gen LRU: groundwork") f1e1a7be4718 ("mm/vmscan.c: refactor shrink_node()") d3629af59f41 ("mm/vmscan: make the annotations of refaults code at the right place") thanks, greg k-h ------------------ original commit in Linus's tree ------------------ From 9972605a238339b85bd16b084eed5f18414d22db Mon Sep 17 00:00:00 2001 From: Shakeel Butt <shakeel.butt(a)linux.dev> Date: Fri, 2 Aug 2024 16:58:22 -0700 Subject: [PATCH] memcg: protect concurrent access to mem_cgroup_idr Commit 73f576c04b94 ("mm: memcontrol: fix cgroup creation failure after many small jobs") decoupled the memcg IDs from the CSS ID space to fix the cgroup creation failures. It introduced IDR to maintain the memcg ID space. The IDR depends on external synchronization mechanisms for modifications. For the mem_cgroup_idr, the idr_alloc() and idr_replace() happen within css callback and thus are protected through cgroup_mutex from concurrent modifications. However idr_remove() for mem_cgroup_idr was not protected against concurrency and can be run concurrently for different memcgs when they hit their refcnt to zero. Fix that. We have been seeing list_lru based kernel crashes at a low frequency in our fleet for a long time. These crashes were in different part of list_lru code including list_lru_add(), list_lru_del() and reparenting code. Upon further inspection, it looked like for a given object (dentry and inode), the super_block's list_lru didn't have list_lru_one for the memcg of that object. The initial suspicions were either the object is not allocated through kmem_cache_alloc_lru() or somehow memcg_list_lru_alloc() failed to allocate list_lru_one() for a memcg but returned success. No evidence were found for these cases. Looking more deeply, we started seeing situations where valid memcg's id is not present in mem_cgroup_idr and in some cases multiple valid memcgs have same id and mem_cgroup_idr is pointing to one of them. So, the most reasonable explanation is that these situations can happen due to race between multiple idr_remove() calls or race between idr_alloc()/idr_replace() and idr_remove(). These races are causing multiple memcgs to acquire the same ID and then offlining of one of them would cleanup list_lrus on the system for all of them. Later access from other memcgs to the list_lru cause crashes due to missing list_lru_one. Link: https://lkml.kernel.org/r/20240802235822.1830976-1-shakeel.butt@linux.dev Fixes: 73f576c04b94 ("mm: memcontrol: fix cgroup creation failure after many small jobs") Signed-off-by: Shakeel Butt <shakeel.butt(a)linux.dev> Acked-by: Muchun Song <muchun.song(a)linux.dev> Reviewed-by: Roman Gushchin <roman.gushchin(a)linux.dev> Acked-by: Johannes Weiner <hannes(a)cmpxchg.org> Cc: Michal Hocko <mhocko(a)suse.com> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 960371788687..f29157288b7d 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3386,11 +3386,28 @@ static void memcg_wb_domain_size_changed(struct mem_cgroup *memcg) #define MEM_CGROUP_ID_MAX ((1UL << MEM_CGROUP_ID_SHIFT) - 1) static DEFINE_IDR(mem_cgroup_idr); +static DEFINE_SPINLOCK(memcg_idr_lock); + +static int mem_cgroup_alloc_id(void) +{ + int ret; + + idr_preload(GFP_KERNEL); + spin_lock(&memcg_idr_lock); + ret = idr_alloc(&mem_cgroup_idr, NULL, 1, MEM_CGROUP_ID_MAX + 1, + GFP_NOWAIT); + spin_unlock(&memcg_idr_lock); + idr_preload_end(); + return ret; +} static void mem_cgroup_id_remove(struct mem_cgroup *memcg) { if (memcg->id.id > 0) { + spin_lock(&memcg_idr_lock); idr_remove(&mem_cgroup_idr, memcg->id.id); + spin_unlock(&memcg_idr_lock); + memcg->id.id = 0; } } @@ -3524,8 +3541,7 @@ static struct mem_cgroup *mem_cgroup_alloc(struct mem_cgroup *parent) if (!memcg) return ERR_PTR(error); - memcg->id.id = idr_alloc(&mem_cgroup_idr, NULL, - 1, MEM_CGROUP_ID_MAX + 1, GFP_KERNEL); + memcg->id.id = mem_cgroup_alloc_id(); if (memcg->id.id < 0) { error = memcg->id.id; goto fail; @@ -3667,7 +3683,9 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css) * publish it here at the end of onlining. This matches the * regular ID destruction during offlining. */ + spin_lock(&memcg_idr_lock); idr_replace(&mem_cgroup_idr, memcg, memcg->id.id); + spin_unlock(&memcg_idr_lock); return 0; offline_kmem:

1 year, 3 months

2
1
0 0

FAILED: patch "[PATCH] memcg: protect concurrent access to mem_cgroup_idr" failed to apply to 6.1-stable tree

by gregkh＠linuxfoundation.org

The patch below does not apply to the 6.1-stable tree. If someone wants it applied there, or to any other stable or longterm tree, then please email the backport, including the original git commit id to <stable(a)vger.kernel.org>. To reproduce the conflict and resubmit, you may use the following commands: git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.1.y git checkout FETCH_HEAD git cherry-pick -x 9972605a238339b85bd16b084eed5f18414d22db # <resolve conflicts, build, test, etc.> git commit -s git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024081259-plow-freezing-a93e@gregkh' --subject-prefix 'PATCH 6.1.y' HEAD^.. Possible dependencies: 9972605a2383 ("memcg: protect concurrent access to mem_cgroup_idr") 6f0df8e16eb5 ("memcontrol: ensure memcg acquired by id is properly set up") e4dde56cd208 ("mm: multi-gen LRU: per-node lru_gen_folio lists") 7348cc91821b ("mm: multi-gen LRU: remove aging fairness safeguard") a579086c99ed ("mm: multi-gen LRU: remove eviction fairness safeguard") adb8213014b2 ("mm: memcg: fix stale protection of reclaim target memcg") 57e9cc50f4dd ("mm: vmscan: split khugepaged stats from direct reclaim stats") thanks, greg k-h ------------------ original commit in Linus's tree ------------------ From 9972605a238339b85bd16b084eed5f18414d22db Mon Sep 17 00:00:00 2001 From: Shakeel Butt <shakeel.butt(a)linux.dev> Date: Fri, 2 Aug 2024 16:58:22 -0700 Subject: [PATCH] memcg: protect concurrent access to mem_cgroup_idr Commit 73f576c04b94 ("mm: memcontrol: fix cgroup creation failure after many small jobs") decoupled the memcg IDs from the CSS ID space to fix the cgroup creation failures. It introduced IDR to maintain the memcg ID space. The IDR depends on external synchronization mechanisms for modifications. For the mem_cgroup_idr, the idr_alloc() and idr_replace() happen within css callback and thus are protected through cgroup_mutex from concurrent modifications. However idr_remove() for mem_cgroup_idr was not protected against concurrency and can be run concurrently for different memcgs when they hit their refcnt to zero. Fix that. We have been seeing list_lru based kernel crashes at a low frequency in our fleet for a long time. These crashes were in different part of list_lru code including list_lru_add(), list_lru_del() and reparenting code. Upon further inspection, it looked like for a given object (dentry and inode), the super_block's list_lru didn't have list_lru_one for the memcg of that object. The initial suspicions were either the object is not allocated through kmem_cache_alloc_lru() or somehow memcg_list_lru_alloc() failed to allocate list_lru_one() for a memcg but returned success. No evidence were found for these cases. Looking more deeply, we started seeing situations where valid memcg's id is not present in mem_cgroup_idr and in some cases multiple valid memcgs have same id and mem_cgroup_idr is pointing to one of them. So, the most reasonable explanation is that these situations can happen due to race between multiple idr_remove() calls or race between idr_alloc()/idr_replace() and idr_remove(). These races are causing multiple memcgs to acquire the same ID and then offlining of one of them would cleanup list_lrus on the system for all of them. Later access from other memcgs to the list_lru cause crashes due to missing list_lru_one. Link: https://lkml.kernel.org/r/20240802235822.1830976-1-shakeel.butt@linux.dev Fixes: 73f576c04b94 ("mm: memcontrol: fix cgroup creation failure after many small jobs") Signed-off-by: Shakeel Butt <shakeel.butt(a)linux.dev> Acked-by: Muchun Song <muchun.song(a)linux.dev> Reviewed-by: Roman Gushchin <roman.gushchin(a)linux.dev> Acked-by: Johannes Weiner <hannes(a)cmpxchg.org> Cc: Michal Hocko <mhocko(a)suse.com> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 960371788687..f29157288b7d 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3386,11 +3386,28 @@ static void memcg_wb_domain_size_changed(struct mem_cgroup *memcg) #define MEM_CGROUP_ID_MAX ((1UL << MEM_CGROUP_ID_SHIFT) - 1) static DEFINE_IDR(mem_cgroup_idr); +static DEFINE_SPINLOCK(memcg_idr_lock); + +static int mem_cgroup_alloc_id(void) +{ + int ret; + + idr_preload(GFP_KERNEL); + spin_lock(&memcg_idr_lock); + ret = idr_alloc(&mem_cgroup_idr, NULL, 1, MEM_CGROUP_ID_MAX + 1, + GFP_NOWAIT); + spin_unlock(&memcg_idr_lock); + idr_preload_end(); + return ret; +} static void mem_cgroup_id_remove(struct mem_cgroup *memcg) { if (memcg->id.id > 0) { + spin_lock(&memcg_idr_lock); idr_remove(&mem_cgroup_idr, memcg->id.id); + spin_unlock(&memcg_idr_lock); + memcg->id.id = 0; } } @@ -3524,8 +3541,7 @@ static struct mem_cgroup *mem_cgroup_alloc(struct mem_cgroup *parent) if (!memcg) return ERR_PTR(error); - memcg->id.id = idr_alloc(&mem_cgroup_idr, NULL, - 1, MEM_CGROUP_ID_MAX + 1, GFP_KERNEL); + memcg->id.id = mem_cgroup_alloc_id(); if (memcg->id.id < 0) { error = memcg->id.id; goto fail; @@ -3667,7 +3683,9 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css) * publish it here at the end of onlining. This matches the * regular ID destruction during offlining. */ + spin_lock(&memcg_idr_lock); idr_replace(&mem_cgroup_idr, memcg, memcg->id.id); + spin_unlock(&memcg_idr_lock); return 0; offline_kmem:

1 year, 3 months

2
1
0 0

[PATCH for 6.1 stable] btrfs: fix race between direct IO write and fsync when using same fd

by fdmanana＠kernel.org

From: Filipe Manana <fdmanana(a)suse.com> commit cd9253c23aedd61eb5ff11f37a36247cd46faf86 upstream. If we have 2 threads that are using the same file descriptor and one of them is doing direct IO writes while the other is doing fsync, we have a race where we can end up either: 1) Attempt a fsync without holding the inode's lock, triggering an assertion failures when assertions are enabled; 2) Do an invalid memory access from the fsync task because the file private points to memory allocated on stack by the direct IO task and it may be used by the fsync task after the stack was destroyed. The race happens like this: 1) A user space program opens a file descriptor with O_DIRECT; 2) The program spawns 2 threads using libpthread for example; 3) One of the threads uses the file descriptor to do direct IO writes, while the other calls fsync using the same file descriptor. 4) Call task A the thread doing direct IO writes and task B the thread doing fsyncs; 5) Task A does a direct IO write, and at btrfs_direct_write() sets the file's private to an on stack allocated private with the member 'fsync_skip_inode_lock' set to true; 6) Task B enters btrfs_sync_file() and sees that there's a private structure associated to the file which has 'fsync_skip_inode_lock' set to true, so it skips locking the inode's vfs lock; 7) Task A completes the direct IO write, and resets the file's private to NULL since it had no prior private and our private was stack allocated. Then it unlocks the inode's vfs lock; 8) Task B enters btrfs_get_ordered_extents_for_logging(), then the assertion that checks the inode's vfs lock is held fails, since task B never locked it and task A has already unlocked it. The stack trace produced is the following: Aug 21 11:46:43 kerberos kernel: assertion failed: inode_is_locked(&inode->vfs_inode), in fs/btrfs/ordered-data.c:983 Aug 21 11:46:43 kerberos kernel: ------------[ cut here ]------------ Aug 21 11:46:43 kerberos kernel: kernel BUG at fs/btrfs/ordered-data.c:983! Aug 21 11:46:43 kerberos kernel: Oops: invalid opcode: 0000 [#1] PREEMPT SMP PTI Aug 21 11:46:43 kerberos kernel: CPU: 9 PID: 5072 Comm: worker Tainted: G U OE 6.10.5-1-default #1 openSUSE Tumbleweed 69f48d427608e1c09e60ea24c6c55e2ca1b049e8 Aug 21 11:46:43 kerberos kernel: Hardware name: Acer Predator PH315-52/Covini_CFS, BIOS V1.12 07/28/2020 Aug 21 11:46:43 kerberos kernel: RIP: 0010:btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs] Aug 21 11:46:43 kerberos kernel: Code: 50 d6 86 c0 e8 (...) Aug 21 11:46:43 kerberos kernel: RSP: 0018:ffff9e4a03dcfc78 EFLAGS: 00010246 Aug 21 11:46:43 kerberos kernel: RAX: 0000000000000054 RBX: ffff9078a9868e98 RCX: 0000000000000000 Aug 21 11:46:43 kerberos kernel: RDX: 0000000000000000 RSI: ffff907dce4a7800 RDI: ffff907dce4a7800 Aug 21 11:46:43 kerberos kernel: RBP: ffff907805518800 R08: 0000000000000000 R09: ffff9e4a03dcfb38 Aug 21 11:46:43 kerberos kernel: R10: ffff9e4a03dcfb30 R11: 0000000000000003 R12: ffff907684ae7800 Aug 21 11:46:43 kerberos kernel: R13: 0000000000000001 R14: ffff90774646b600 R15: 0000000000000000 Aug 21 11:46:43 kerberos kernel: FS: 00007f04b96006c0(0000) GS:ffff907dce480000(0000) knlGS:0000000000000000 Aug 21 11:46:43 kerberos kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Aug 21 11:46:43 kerberos kernel: CR2: 00007f32acbfc000 CR3: 00000001fd4fa005 CR4: 00000000003726f0 Aug 21 11:46:43 kerberos kernel: Call Trace: Aug 21 11:46:43 kerberos kernel: <TASK> Aug 21 11:46:43 kerberos kernel: ? __die_body.cold+0x14/0x24 Aug 21 11:46:43 kerberos kernel: ? die+0x2e/0x50 Aug 21 11:46:43 kerberos kernel: ? do_trap+0xca/0x110 Aug 21 11:46:43 kerberos kernel: ? do_error_trap+0x6a/0x90 Aug 21 11:46:43 kerberos kernel: ? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a] Aug 21 11:46:43 kerberos kernel: ? exc_invalid_op+0x50/0x70 Aug 21 11:46:43 kerberos kernel: ? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a] Aug 21 11:46:43 kerberos kernel: ? asm_exc_invalid_op+0x1a/0x20 Aug 21 11:46:43 kerberos kernel: ? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a] Aug 21 11:46:43 kerberos kernel: ? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a] Aug 21 11:46:43 kerberos kernel: btrfs_sync_file+0x21a/0x4d0 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a] Aug 21 11:46:43 kerberos kernel: ? __seccomp_filter+0x31d/0x4f0 Aug 21 11:46:43 kerberos kernel: __x64_sys_fdatasync+0x4f/0x90 Aug 21 11:46:43 kerberos kernel: do_syscall_64+0x82/0x160 Aug 21 11:46:43 kerberos kernel: ? do_futex+0xcb/0x190 Aug 21 11:46:43 kerberos kernel: ? __x64_sys_futex+0x10e/0x1d0 Aug 21 11:46:43 kerberos kernel: ? switch_fpu_return+0x4f/0xd0 Aug 21 11:46:43 kerberos kernel: ? syscall_exit_to_user_mode+0x72/0x220 Aug 21 11:46:43 kerberos kernel: ? do_syscall_64+0x8e/0x160 Aug 21 11:46:43 kerberos kernel: ? syscall_exit_to_user_mode+0x72/0x220 Aug 21 11:46:43 kerberos kernel: ? do_syscall_64+0x8e/0x160 Aug 21 11:46:43 kerberos kernel: ? syscall_exit_to_user_mode+0x72/0x220 Aug 21 11:46:43 kerberos kernel: ? do_syscall_64+0x8e/0x160 Aug 21 11:46:43 kerberos kernel: ? syscall_exit_to_user_mode+0x72/0x220 Aug 21 11:46:43 kerberos kernel: ? do_syscall_64+0x8e/0x160 Aug 21 11:46:43 kerberos kernel: entry_SYSCALL_64_after_hwframe+0x76/0x7e Another problem here is if task B grabs the private pointer and then uses it after task A has finished, since the private was allocated in the stack of trask A, it results in some invalid memory access with a hard to predict result. This issue, triggering the assertion, was observed with QEMU workloads by two users in the Link tags below. Fix this by not relying on a file's private to pass information to fsync that it should skip locking the inode and instead pass this information through a special value stored in current->journal_info. This is safe because in the relevant section of the direct IO write path we are not holding a transaction handle, so current->journal_info is NULL. The following C program triggers the issue: $ cat repro.c /* Get the O_DIRECT definition. */ #ifndef _GNU_SOURCE #define _GNU_SOURCE #endif #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <stdint.h> #include <fcntl.h> #include <errno.h> #include <string.h> #include <pthread.h> static int fd; static ssize_t do_write(int fd, const void *buf, size_t count, off_t offset) { while (count > 0) { ssize_t ret; ret = pwrite(fd, buf, count, offset); if (ret < 0) { if (errno == EINTR) continue; return ret; } count -= ret; buf += ret; } return 0; } static void *fsync_loop(void *arg) { while (1) { int ret; ret = fsync(fd); if (ret != 0) { perror("Fsync failed"); exit(6); } } } int main(int argc, char *argv[]) { long pagesize; void *write_buf; pthread_t fsyncer; int ret; if (argc != 2) { fprintf(stderr, "Use: %s <file path>\n", argv[0]); return 1; } fd = open(argv[1], O_WRONLY | O_CREAT | O_TRUNC | O_DIRECT, 0666); if (fd == -1) { perror("Failed to open/create file"); return 1; } pagesize = sysconf(_SC_PAGE_SIZE); if (pagesize == -1) { perror("Failed to get page size"); return 2; } ret = posix_memalign(&write_buf, pagesize, pagesize); if (ret) { perror("Failed to allocate buffer"); return 3; } ret = pthread_create(&fsyncer, NULL, fsync_loop, NULL); if (ret != 0) { fprintf(stderr, "Failed to create writer thread: %d\n", ret); return 4; } while (1) { ret = do_write(fd, write_buf, pagesize, 0); if (ret != 0) { perror("Write failed"); exit(5); } } return 0; } $ mkfs.btrfs -f /dev/sdi $ mount /dev/sdi /mnt/sdi $ timeout 10 ./repro /mnt/sdi/foo Usually the race is triggered within less than 1 second. A test case for fstests will follow soon. Reported-by: Paulo Dias <paulo.miguel.dias(a)gmail.com> Link: https://bugzilla.kernel.org/show_bug.cgi?id=219187 Reported-by: Andreas Jahn <jahn-andi(a)web.de> Link: https://bugzilla.kernel.org/show_bug.cgi?id=219199 Reported-by: syzbot+4704b3cc972bd76024f1(a)syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/00000000000044ff540620d7dee2@google.com/ Fixes: 939b656bc8ab ("btrfs: fix corruption after buffer fault in during direct IO append write") CC: stable(a)vger.kernel.org # 5.15+ Reviewed-by: Josef Bacik <josef(a)toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana(a)suse.com> Reviewed-by: David Sterba <dsterba(a)suse.com> Signed-off-by: David Sterba <dsterba(a)suse.com> --- fs/btrfs/ctree.h | 1 - fs/btrfs/file.c | 25 ++++++++++--------------- fs/btrfs/transaction.h | 6 ++++++ 3 files changed, 16 insertions(+), 16 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 853b1f96b1fd..cca1acf2e037 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1553,7 +1553,6 @@ struct btrfs_drop_extents_args { struct btrfs_file_private { void *filldir_buf; u64 last_index; - bool fsync_skip_inode_lock; }; diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index e23d178f9778..c8231677c79e 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -1534,13 +1534,6 @@ static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from) if (IS_ERR_OR_NULL(dio)) { err = PTR_ERR_OR_ZERO(dio); } else { - struct btrfs_file_private stack_private = { 0 }; - struct btrfs_file_private *private; - const bool have_private = (file->private_data != NULL); - - if (!have_private) - file->private_data = &stack_private; - /* * If we have a synchoronous write, we must make sure the fsync * triggered by the iomap_dio_complete() call below doesn't @@ -1549,13 +1542,10 @@ static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from) * partial writes due to the input buffer (or parts of it) not * being already faulted in. */ - private = file->private_data; - private->fsync_skip_inode_lock = true; + ASSERT(current->journal_info == NULL); + current->journal_info = BTRFS_TRANS_DIO_WRITE_STUB; err = iomap_dio_complete(dio); - private->fsync_skip_inode_lock = false; - - if (!have_private) - file->private_data = NULL; + current->journal_info = NULL; } /* No increment (+=) because iomap returns a cumulative value. */ @@ -1795,7 +1785,6 @@ static inline bool skip_inode_logging(const struct btrfs_log_ctx *ctx) */ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync) { - struct btrfs_file_private *private = file->private_data; struct dentry *dentry = file_dentry(file); struct inode *inode = d_inode(dentry); struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); @@ -1805,7 +1794,13 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync) int ret = 0, err; u64 len; bool full_sync; - const bool skip_ilock = (private ? private->fsync_skip_inode_lock : false); + bool skip_ilock = false; + + if (current->journal_info == BTRFS_TRANS_DIO_WRITE_STUB) { + skip_ilock = true; + current->journal_info = NULL; + lockdep_assert_held(&inode->i_rwsem); + } trace_btrfs_sync_file(file, datasync); diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h index 970ff316069d..8b88446df36d 100644 --- a/fs/btrfs/transaction.h +++ b/fs/btrfs/transaction.h @@ -11,6 +11,12 @@ #include "delayed-ref.h" #include "ctree.h" +/* + * Signal that a direct IO write is in progress, to avoid deadlock for sync + * direct IO writes when fsync is called during the direct IO write path. + */ +#define BTRFS_TRANS_DIO_WRITE_STUB ((void *) 1) + enum btrfs_trans_state { TRANS_STATE_RUNNING, TRANS_STATE_COMMIT_START, -- 2.43.0

1 year, 3 months

1
0
0 0

[PATCH for 6.6 stable] btrfs: fix race between direct IO write and fsync when using same fd

by fdmanana＠kernel.org

From: Filipe Manana <fdmanana(a)suse.com> commit cd9253c23aedd61eb5ff11f37a36247cd46faf86 upstream. If we have 2 threads that are using the same file descriptor and one of them is doing direct IO writes while the other is doing fsync, we have a race where we can end up either: 1) Attempt a fsync without holding the inode's lock, triggering an assertion failures when assertions are enabled; 2) Do an invalid memory access from the fsync task because the file private points to memory allocated on stack by the direct IO task and it may be used by the fsync task after the stack was destroyed. The race happens like this: 1) A user space program opens a file descriptor with O_DIRECT; 2) The program spawns 2 threads using libpthread for example; 3) One of the threads uses the file descriptor to do direct IO writes, while the other calls fsync using the same file descriptor. 4) Call task A the thread doing direct IO writes and task B the thread doing fsyncs; 5) Task A does a direct IO write, and at btrfs_direct_write() sets the file's private to an on stack allocated private with the member 'fsync_skip_inode_lock' set to true; 6) Task B enters btrfs_sync_file() and sees that there's a private structure associated to the file which has 'fsync_skip_inode_lock' set to true, so it skips locking the inode's vfs lock; 7) Task A completes the direct IO write, and resets the file's private to NULL since it had no prior private and our private was stack allocated. Then it unlocks the inode's vfs lock; 8) Task B enters btrfs_get_ordered_extents_for_logging(), then the assertion that checks the inode's vfs lock is held fails, since task B never locked it and task A has already unlocked it. The stack trace produced is the following: Aug 21 11:46:43 kerberos kernel: assertion failed: inode_is_locked(&inode->vfs_inode), in fs/btrfs/ordered-data.c:983 Aug 21 11:46:43 kerberos kernel: ------------[ cut here ]------------ Aug 21 11:46:43 kerberos kernel: kernel BUG at fs/btrfs/ordered-data.c:983! Aug 21 11:46:43 kerberos kernel: Oops: invalid opcode: 0000 [#1] PREEMPT SMP PTI Aug 21 11:46:43 kerberos kernel: CPU: 9 PID: 5072 Comm: worker Tainted: G U OE 6.10.5-1-default #1 openSUSE Tumbleweed 69f48d427608e1c09e60ea24c6c55e2ca1b049e8 Aug 21 11:46:43 kerberos kernel: Hardware name: Acer Predator PH315-52/Covini_CFS, BIOS V1.12 07/28/2020 Aug 21 11:46:43 kerberos kernel: RIP: 0010:btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs] Aug 21 11:46:43 kerberos kernel: Code: 50 d6 86 c0 e8 (...) Aug 21 11:46:43 kerberos kernel: RSP: 0018:ffff9e4a03dcfc78 EFLAGS: 00010246 Aug 21 11:46:43 kerberos kernel: RAX: 0000000000000054 RBX: ffff9078a9868e98 RCX: 0000000000000000 Aug 21 11:46:43 kerberos kernel: RDX: 0000000000000000 RSI: ffff907dce4a7800 RDI: ffff907dce4a7800 Aug 21 11:46:43 kerberos kernel: RBP: ffff907805518800 R08: 0000000000000000 R09: ffff9e4a03dcfb38 Aug 21 11:46:43 kerberos kernel: R10: ffff9e4a03dcfb30 R11: 0000000000000003 R12: ffff907684ae7800 Aug 21 11:46:43 kerberos kernel: R13: 0000000000000001 R14: ffff90774646b600 R15: 0000000000000000 Aug 21 11:46:43 kerberos kernel: FS: 00007f04b96006c0(0000) GS:ffff907dce480000(0000) knlGS:0000000000000000 Aug 21 11:46:43 kerberos kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Aug 21 11:46:43 kerberos kernel: CR2: 00007f32acbfc000 CR3: 00000001fd4fa005 CR4: 00000000003726f0 Aug 21 11:46:43 kerberos kernel: Call Trace: Aug 21 11:46:43 kerberos kernel: <TASK> Aug 21 11:46:43 kerberos kernel: ? __die_body.cold+0x14/0x24 Aug 21 11:46:43 kerberos kernel: ? die+0x2e/0x50 Aug 21 11:46:43 kerberos kernel: ? do_trap+0xca/0x110 Aug 21 11:46:43 kerberos kernel: ? do_error_trap+0x6a/0x90 Aug 21 11:46:43 kerberos kernel: ? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a] Aug 21 11:46:43 kerberos kernel: ? exc_invalid_op+0x50/0x70 Aug 21 11:46:43 kerberos kernel: ? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a] Aug 21 11:46:43 kerberos kernel: ? asm_exc_invalid_op+0x1a/0x20 Aug 21 11:46:43 kerberos kernel: ? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a] Aug 21 11:46:43 kerberos kernel: ? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a] Aug 21 11:46:43 kerberos kernel: btrfs_sync_file+0x21a/0x4d0 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a] Aug 21 11:46:43 kerberos kernel: ? __seccomp_filter+0x31d/0x4f0 Aug 21 11:46:43 kerberos kernel: __x64_sys_fdatasync+0x4f/0x90 Aug 21 11:46:43 kerberos kernel: do_syscall_64+0x82/0x160 Aug 21 11:46:43 kerberos kernel: ? do_futex+0xcb/0x190 Aug 21 11:46:43 kerberos kernel: ? __x64_sys_futex+0x10e/0x1d0 Aug 21 11:46:43 kerberos kernel: ? switch_fpu_return+0x4f/0xd0 Aug 21 11:46:43 kerberos kernel: ? syscall_exit_to_user_mode+0x72/0x220 Aug 21 11:46:43 kerberos kernel: ? do_syscall_64+0x8e/0x160 Aug 21 11:46:43 kerberos kernel: ? syscall_exit_to_user_mode+0x72/0x220 Aug 21 11:46:43 kerberos kernel: ? do_syscall_64+0x8e/0x160 Aug 21 11:46:43 kerberos kernel: ? syscall_exit_to_user_mode+0x72/0x220 Aug 21 11:46:43 kerberos kernel: ? do_syscall_64+0x8e/0x160 Aug 21 11:46:43 kerberos kernel: ? syscall_exit_to_user_mode+0x72/0x220 Aug 21 11:46:43 kerberos kernel: ? do_syscall_64+0x8e/0x160 Aug 21 11:46:43 kerberos kernel: entry_SYSCALL_64_after_hwframe+0x76/0x7e Another problem here is if task B grabs the private pointer and then uses it after task A has finished, since the private was allocated in the stack of trask A, it results in some invalid memory access with a hard to predict result. This issue, triggering the assertion, was observed with QEMU workloads by two users in the Link tags below. Fix this by not relying on a file's private to pass information to fsync that it should skip locking the inode and instead pass this information through a special value stored in current->journal_info. This is safe because in the relevant section of the direct IO write path we are not holding a transaction handle, so current->journal_info is NULL. The following C program triggers the issue: $ cat repro.c /* Get the O_DIRECT definition. */ #ifndef _GNU_SOURCE #define _GNU_SOURCE #endif #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <stdint.h> #include <fcntl.h> #include <errno.h> #include <string.h> #include <pthread.h> static int fd; static ssize_t do_write(int fd, const void *buf, size_t count, off_t offset) { while (count > 0) { ssize_t ret; ret = pwrite(fd, buf, count, offset); if (ret < 0) { if (errno == EINTR) continue; return ret; } count -= ret; buf += ret; } return 0; } static void *fsync_loop(void *arg) { while (1) { int ret; ret = fsync(fd); if (ret != 0) { perror("Fsync failed"); exit(6); } } } int main(int argc, char *argv[]) { long pagesize; void *write_buf; pthread_t fsyncer; int ret; if (argc != 2) { fprintf(stderr, "Use: %s <file path>\n", argv[0]); return 1; } fd = open(argv[1], O_WRONLY | O_CREAT | O_TRUNC | O_DIRECT, 0666); if (fd == -1) { perror("Failed to open/create file"); return 1; } pagesize = sysconf(_SC_PAGE_SIZE); if (pagesize == -1) { perror("Failed to get page size"); return 2; } ret = posix_memalign(&write_buf, pagesize, pagesize); if (ret) { perror("Failed to allocate buffer"); return 3; } ret = pthread_create(&fsyncer, NULL, fsync_loop, NULL); if (ret != 0) { fprintf(stderr, "Failed to create writer thread: %d\n", ret); return 4; } while (1) { ret = do_write(fd, write_buf, pagesize, 0); if (ret != 0) { perror("Write failed"); exit(5); } } return 0; } $ mkfs.btrfs -f /dev/sdi $ mount /dev/sdi /mnt/sdi $ timeout 10 ./repro /mnt/sdi/foo Usually the race is triggered within less than 1 second. A test case for fstests will follow soon. Reported-by: Paulo Dias <paulo.miguel.dias(a)gmail.com> Link: https://bugzilla.kernel.org/show_bug.cgi?id=219187 Reported-by: Andreas Jahn <jahn-andi(a)web.de> Link: https://bugzilla.kernel.org/show_bug.cgi?id=219199 Reported-by: syzbot+4704b3cc972bd76024f1(a)syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/00000000000044ff540620d7dee2@google.com/ Fixes: 939b656bc8ab ("btrfs: fix corruption after buffer fault in during direct IO append write") CC: stable(a)vger.kernel.org # 5.15+ Reviewed-by: Josef Bacik <josef(a)toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana(a)suse.com> Reviewed-by: David Sterba <dsterba(a)suse.com> Signed-off-by: David Sterba <dsterba(a)suse.com> --- fs/btrfs/ctree.h | 1 - fs/btrfs/file.c | 25 ++++++++++--------------- fs/btrfs/transaction.h | 6 ++++++ 3 files changed, 16 insertions(+), 16 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 86c7f8ce1715..06333a74d6c4 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -445,7 +445,6 @@ struct btrfs_file_private { void *filldir_buf; u64 last_index; struct extent_state *llseek_cached_state; - bool fsync_skip_inode_lock; }; static inline u32 BTRFS_LEAF_DATA_SIZE(const struct btrfs_fs_info *info) diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 952cf145c629..15fd8c00f4c0 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -1543,13 +1543,6 @@ static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from) if (IS_ERR_OR_NULL(dio)) { err = PTR_ERR_OR_ZERO(dio); } else { - struct btrfs_file_private stack_private = { 0 }; - struct btrfs_file_private *private; - const bool have_private = (file->private_data != NULL); - - if (!have_private) - file->private_data = &stack_private; - /* * If we have a synchoronous write, we must make sure the fsync * triggered by the iomap_dio_complete() call below doesn't @@ -1558,13 +1551,10 @@ static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from) * partial writes due to the input buffer (or parts of it) not * being already faulted in. */ - private = file->private_data; - private->fsync_skip_inode_lock = true; + ASSERT(current->journal_info == NULL); + current->journal_info = BTRFS_TRANS_DIO_WRITE_STUB; err = iomap_dio_complete(dio); - private->fsync_skip_inode_lock = false; - - if (!have_private) - file->private_data = NULL; + current->journal_info = NULL; } /* No increment (+=) because iomap returns a cumulative value. */ @@ -1796,7 +1786,6 @@ static inline bool skip_inode_logging(const struct btrfs_log_ctx *ctx) */ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync) { - struct btrfs_file_private *private = file->private_data; struct dentry *dentry = file_dentry(file); struct inode *inode = d_inode(dentry); struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); @@ -1806,7 +1795,13 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync) int ret = 0, err; u64 len; bool full_sync; - const bool skip_ilock = (private ? private->fsync_skip_inode_lock : false); + bool skip_ilock = false; + + if (current->journal_info == BTRFS_TRANS_DIO_WRITE_STUB) { + skip_ilock = true; + current->journal_info = NULL; + lockdep_assert_held(&inode->i_rwsem); + } trace_btrfs_sync_file(file, datasync); diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h index 238a0ab85df9..7623db359881 100644 --- a/fs/btrfs/transaction.h +++ b/fs/btrfs/transaction.h @@ -12,6 +12,12 @@ #include "ctree.h" #include "misc.h" +/* + * Signal that a direct IO write is in progress, to avoid deadlock for sync + * direct IO writes when fsync is called during the direct IO write path. + */ +#define BTRFS_TRANS_DIO_WRITE_STUB ((void *) 1) + /* Radix-tree tag for roots that are part of the trasaction. */ #define BTRFS_ROOT_TRANS_TAG 0 -- 2.43.0

1 year, 3 months

1
0
0 0

[PATCH for 6.10 stable] btrfs: fix race between direct IO write and fsync when using same fd

by fdmanana＠kernel.org

From: Filipe Manana <fdmanana(a)suse.com> commit cd9253c23aedd61eb5ff11f37a36247cd46faf86 upstream. If we have 2 threads that are using the same file descriptor and one of them is doing direct IO writes while the other is doing fsync, we have a race where we can end up either: 1) Attempt a fsync without holding the inode's lock, triggering an assertion failures when assertions are enabled; 2) Do an invalid memory access from the fsync task because the file private points to memory allocated on stack by the direct IO task and it may be used by the fsync task after the stack was destroyed. The race happens like this: 1) A user space program opens a file descriptor with O_DIRECT; 2) The program spawns 2 threads using libpthread for example; 3) One of the threads uses the file descriptor to do direct IO writes, while the other calls fsync using the same file descriptor. 4) Call task A the thread doing direct IO writes and task B the thread doing fsyncs; 5) Task A does a direct IO write, and at btrfs_direct_write() sets the file's private to an on stack allocated private with the member 'fsync_skip_inode_lock' set to true; 6) Task B enters btrfs_sync_file() and sees that there's a private structure associated to the file which has 'fsync_skip_inode_lock' set to true, so it skips locking the inode's vfs lock; 7) Task A completes the direct IO write, and resets the file's private to NULL since it had no prior private and our private was stack allocated. Then it unlocks the inode's vfs lock; 8) Task B enters btrfs_get_ordered_extents_for_logging(), then the assertion that checks the inode's vfs lock is held fails, since task B never locked it and task A has already unlocked it. The stack trace produced is the following: Aug 21 11:46:43 kerberos kernel: assertion failed: inode_is_locked(&inode->vfs_inode), in fs/btrfs/ordered-data.c:983 Aug 21 11:46:43 kerberos kernel: ------------[ cut here ]------------ Aug 21 11:46:43 kerberos kernel: kernel BUG at fs/btrfs/ordered-data.c:983! Aug 21 11:46:43 kerberos kernel: Oops: invalid opcode: 0000 [#1] PREEMPT SMP PTI Aug 21 11:46:43 kerberos kernel: CPU: 9 PID: 5072 Comm: worker Tainted: G U OE 6.10.5-1-default #1 openSUSE Tumbleweed 69f48d427608e1c09e60ea24c6c55e2ca1b049e8 Aug 21 11:46:43 kerberos kernel: Hardware name: Acer Predator PH315-52/Covini_CFS, BIOS V1.12 07/28/2020 Aug 21 11:46:43 kerberos kernel: RIP: 0010:btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs] Aug 21 11:46:43 kerberos kernel: Code: 50 d6 86 c0 e8 (...) Aug 21 11:46:43 kerberos kernel: RSP: 0018:ffff9e4a03dcfc78 EFLAGS: 00010246 Aug 21 11:46:43 kerberos kernel: RAX: 0000000000000054 RBX: ffff9078a9868e98 RCX: 0000000000000000 Aug 21 11:46:43 kerberos kernel: RDX: 0000000000000000 RSI: ffff907dce4a7800 RDI: ffff907dce4a7800 Aug 21 11:46:43 kerberos kernel: RBP: ffff907805518800 R08: 0000000000000000 R09: ffff9e4a03dcfb38 Aug 21 11:46:43 kerberos kernel: R10: ffff9e4a03dcfb30 R11: 0000000000000003 R12: ffff907684ae7800 Aug 21 11:46:43 kerberos kernel: R13: 0000000000000001 R14: ffff90774646b600 R15: 0000000000000000 Aug 21 11:46:43 kerberos kernel: FS: 00007f04b96006c0(0000) GS:ffff907dce480000(0000) knlGS:0000000000000000 Aug 21 11:46:43 kerberos kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Aug 21 11:46:43 kerberos kernel: CR2: 00007f32acbfc000 CR3: 00000001fd4fa005 CR4: 00000000003726f0 Aug 21 11:46:43 kerberos kernel: Call Trace: Aug 21 11:46:43 kerberos kernel: <TASK> Aug 21 11:46:43 kerberos kernel: ? __die_body.cold+0x14/0x24 Aug 21 11:46:43 kerberos kernel: ? die+0x2e/0x50 Aug 21 11:46:43 kerberos kernel: ? do_trap+0xca/0x110 Aug 21 11:46:43 kerberos kernel: ? do_error_trap+0x6a/0x90 Aug 21 11:46:43 kerberos kernel: ? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a] Aug 21 11:46:43 kerberos kernel: ? exc_invalid_op+0x50/0x70 Aug 21 11:46:43 kerberos kernel: ? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a] Aug 21 11:46:43 kerberos kernel: ? asm_exc_invalid_op+0x1a/0x20 Aug 21 11:46:43 kerberos kernel: ? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a] Aug 21 11:46:43 kerberos kernel: ? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a] Aug 21 11:46:43 kerberos kernel: btrfs_sync_file+0x21a/0x4d0 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a] Aug 21 11:46:43 kerberos kernel: ? __seccomp_filter+0x31d/0x4f0 Aug 21 11:46:43 kerberos kernel: __x64_sys_fdatasync+0x4f/0x90 Aug 21 11:46:43 kerberos kernel: do_syscall_64+0x82/0x160 Aug 21 11:46:43 kerberos kernel: ? do_futex+0xcb/0x190 Aug 21 11:46:43 kerberos kernel: ? __x64_sys_futex+0x10e/0x1d0 Aug 21 11:46:43 kerberos kernel: ? switch_fpu_return+0x4f/0xd0 Aug 21 11:46:43 kerberos kernel: ? syscall_exit_to_user_mode+0x72/0x220 Aug 21 11:46:43 kerberos kernel: ? do_syscall_64+0x8e/0x160 Aug 21 11:46:43 kerberos kernel: ? syscall_exit_to_user_mode+0x72/0x220 Aug 21 11:46:43 kerberos kernel: ? do_syscall_64+0x8e/0x160 Aug 21 11:46:43 kerberos kernel: ? syscall_exit_to_user_mode+0x72/0x220 Aug 21 11:46:43 kerberos kernel: ? do_syscall_64+0x8e/0x160 Aug 21 11:46:43 kerberos kernel: ? syscall_exit_to_user_mode+0x72/0x220 Aug 21 11:46:43 kerberos kernel: ? do_syscall_64+0x8e/0x160 Aug 21 11:46:43 kerberos kernel: entry_SYSCALL_64_after_hwframe+0x76/0x7e Another problem here is if task B grabs the private pointer and then uses it after task A has finished, since the private was allocated in the stack of trask A, it results in some invalid memory access with a hard to predict result. This issue, triggering the assertion, was observed with QEMU workloads by two users in the Link tags below. Fix this by not relying on a file's private to pass information to fsync that it should skip locking the inode and instead pass this information through a special value stored in current->journal_info. This is safe because in the relevant section of the direct IO write path we are not holding a transaction handle, so current->journal_info is NULL. The following C program triggers the issue: $ cat repro.c /* Get the O_DIRECT definition. */ #ifndef _GNU_SOURCE #define _GNU_SOURCE #endif #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <stdint.h> #include <fcntl.h> #include <errno.h> #include <string.h> #include <pthread.h> static int fd; static ssize_t do_write(int fd, const void *buf, size_t count, off_t offset) { while (count > 0) { ssize_t ret; ret = pwrite(fd, buf, count, offset); if (ret < 0) { if (errno == EINTR) continue; return ret; } count -= ret; buf += ret; } return 0; } static void *fsync_loop(void *arg) { while (1) { int ret; ret = fsync(fd); if (ret != 0) { perror("Fsync failed"); exit(6); } } } int main(int argc, char *argv[]) { long pagesize; void *write_buf; pthread_t fsyncer; int ret; if (argc != 2) { fprintf(stderr, "Use: %s <file path>\n", argv[0]); return 1; } fd = open(argv[1], O_WRONLY | O_CREAT | O_TRUNC | O_DIRECT, 0666); if (fd == -1) { perror("Failed to open/create file"); return 1; } pagesize = sysconf(_SC_PAGE_SIZE); if (pagesize == -1) { perror("Failed to get page size"); return 2; } ret = posix_memalign(&write_buf, pagesize, pagesize); if (ret) { perror("Failed to allocate buffer"); return 3; } ret = pthread_create(&fsyncer, NULL, fsync_loop, NULL); if (ret != 0) { fprintf(stderr, "Failed to create writer thread: %d\n", ret); return 4; } while (1) { ret = do_write(fd, write_buf, pagesize, 0); if (ret != 0) { perror("Write failed"); exit(5); } } return 0; } $ mkfs.btrfs -f /dev/sdi $ mount /dev/sdi /mnt/sdi $ timeout 10 ./repro /mnt/sdi/foo Usually the race is triggered within less than 1 second. A test case for fstests will follow soon. Reported-by: Paulo Dias <paulo.miguel.dias(a)gmail.com> Link: https://bugzilla.kernel.org/show_bug.cgi?id=219187 Reported-by: Andreas Jahn <jahn-andi(a)web.de> Link: https://bugzilla.kernel.org/show_bug.cgi?id=219199 Reported-by: syzbot+4704b3cc972bd76024f1(a)syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/00000000000044ff540620d7dee2@google.com/ Fixes: 939b656bc8ab ("btrfs: fix corruption after buffer fault in during direct IO append write") CC: stable(a)vger.kernel.org # 5.15+ Reviewed-by: Josef Bacik <josef(a)toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana(a)suse.com> Reviewed-by: David Sterba <dsterba(a)suse.com> Signed-off-by: David Sterba <dsterba(a)suse.com> --- fs/btrfs/ctree.h | 1 - fs/btrfs/file.c | 25 ++++++++++--------------- fs/btrfs/transaction.h | 6 ++++++ 3 files changed, 16 insertions(+), 16 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index a56209d275c1..b2e4b30b8fae 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -457,7 +457,6 @@ struct btrfs_file_private { void *filldir_buf; u64 last_index; struct extent_state *llseek_cached_state; - bool fsync_skip_inode_lock; }; static inline u32 BTRFS_LEAF_DATA_SIZE(const struct btrfs_fs_info *info) diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index ca434f0cd27f..66dfee873906 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -1558,13 +1558,6 @@ static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from) if (IS_ERR_OR_NULL(dio)) { ret = PTR_ERR_OR_ZERO(dio); } else { - struct btrfs_file_private stack_private = { 0 }; - struct btrfs_file_private *private; - const bool have_private = (file->private_data != NULL); - - if (!have_private) - file->private_data = &stack_private; - /* * If we have a synchoronous write, we must make sure the fsync * triggered by the iomap_dio_complete() call below doesn't @@ -1573,13 +1566,10 @@ static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from) * partial writes due to the input buffer (or parts of it) not * being already faulted in. */ - private = file->private_data; - private->fsync_skip_inode_lock = true; + ASSERT(current->journal_info == NULL); + current->journal_info = BTRFS_TRANS_DIO_WRITE_STUB; ret = iomap_dio_complete(dio); - private->fsync_skip_inode_lock = false; - - if (!have_private) - file->private_data = NULL; + current->journal_info = NULL; } /* No increment (+=) because iomap returns a cumulative value. */ @@ -1811,7 +1801,6 @@ static inline bool skip_inode_logging(const struct btrfs_log_ctx *ctx) */ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync) { - struct btrfs_file_private *private = file->private_data; struct dentry *dentry = file_dentry(file); struct inode *inode = d_inode(dentry); struct btrfs_fs_info *fs_info = inode_to_fs_info(inode); @@ -1821,7 +1810,13 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync) int ret = 0, err; u64 len; bool full_sync; - const bool skip_ilock = (private ? private->fsync_skip_inode_lock : false); + bool skip_ilock = false; + + if (current->journal_info == BTRFS_TRANS_DIO_WRITE_STUB) { + skip_ilock = true; + current->journal_info = NULL; + lockdep_assert_held(&inode->i_rwsem); + } trace_btrfs_sync_file(file, datasync); diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h index 4e451ab173b1..62ec85f4b777 100644 --- a/fs/btrfs/transaction.h +++ b/fs/btrfs/transaction.h @@ -27,6 +27,12 @@ struct btrfs_root_item; struct btrfs_root; struct btrfs_path; +/* + * Signal that a direct IO write is in progress, to avoid deadlock for sync + * direct IO writes when fsync is called during the direct IO write path. + */ +#define BTRFS_TRANS_DIO_WRITE_STUB ((void *) 1) + /* Radix-tree tag for roots that are part of the trasaction. */ #define BTRFS_ROOT_TRANS_TAG 0 -- 2.43.0

1 year, 3 months

1
0
0 0

2025

2024

2023

2022

2021

2020

2019

2018

2017

Linux-stable-mirror September 2024