BugLink: https://bugs.launchpad.net/bugs/1775165
[Note to upstream] I understand that this patch is a little long for -stable, but this patch series fixes a real issue, seen by real users, is testable, and is made up from upstream commits. Please consider it.
[Impact]
When userspace tasks which are processing fanotify permission events act incorrectly, the fsnotify_mark_srcu SRCU is held indefinitely which causes the whole notification subsystem to hang.
This has been seen in production, and it can also be seen when running the Linux Test Project testsuite, specifically fanotify07.
[Fix]
Instead of holding the SRCU lock while waiting for userspace to respond, which may never happen, or not in the order we are expecting, we drop the fsnotify_mark_srcu SRCU lock before waiting for userspace response, and then reacquire the lock again when userspace responds.
The fixes are from a series of upstream commits:
05f0e38724e8449184acd8fbf0473ee5a07adc6c (cherry-pick) 9385a84d7e1f658bb2d96ab798393e4b16268aaa (backport) abc77577a669f424c5d0c185b9994f2621c52aa4 (backport)
The following are upstream commits necessary for the fixes to function:
35e481761cdc688dbee0ef552a13f49af8eba6cc (backport) 0918f1c309b86301605650c836ddd2021d311ae2 (cherry-pick)
[Testcase]
You can reproduce the problem pretty quickly with the Linux Test Project:
Steps (with root): 1. sudo apt-get install git xfsprogs -y 2. git clone --depth=1 https://github.com/linux-test-project/ltp.git 3. cd ltp 4. make autotools 5. ./configure 6. make; make install 7. cd /opt/ltp 8. echo -e "fanotify07 fanotify07 \nfanotify08 fanotify08" > /tmp/jobs 9. ./runltp -f /tmp/jobs
On a stock Xenial kernel, the system will hang, and the testcase will look like:
<<<test_start>>> tag=fanotify07 stime=1554326200 cmdline="fanotify07 " contacts="" analysis=exit <<<test_output>>> tst_test.c:1096: INFO: Timeout per run is 0h 05m 00s Test timeouted, sending SIGKILL! Test timeouted, sending SIGKILL! Test timeouted, sending SIGKILL! Test timeouted, sending SIGKILL! Test timeouted, sending SIGKILL! Test timeouted, sending SIGKILL! Test timeouted, sending SIGKILL! Test timeouted, sending SIGKILL! Test timeouted, sending SIGKILL! Test timeouted, sending SIGKILL! Test timeouted, sending SIGKILL! Cannot kill test processes! Congratulation, likely test hit a kernel bug. Exitting uncleanly... <<<execution_status>>> initiation_status="ok" duration=350 termination_type=exited termination_id=1 corefile=no cutime=0 cstime=0 <<<test_end>>>
Looking at dmesg, we see the following call stack
[ 790.772792] LTP: starting fanotify07 (fanotify07 ) [ 960.140455] INFO: task fsnotify_mark:36 blocked for more than 120 seconds. [ 960.140867] Not tainted 4.4.0-142-generic #168-Ubuntu [ 960.141185] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 960.141498] fsnotify_mark D ffff8800b6703c98 0 36 2 0x00000000 [ 960.141516] ffff8800b6703c98 ffff88013a558a00 ffff8800b7797000 ffff8800b66f8000 [ 960.141524] ffff8800b6704000 7fffffffffffffff ffff8800b6703de0 ffff8800b66f8000 [ 960.141528] 0000000000000000 ffff8800b6703cb0 ffffffff8185cb45 ffff8800b6703de8 [ 960.141532] Call Trace: [ 960.141580] [<ffffffff8185cb45>] schedule+0x35/0x80 [ 960.141588] [<ffffffff818600f4>] schedule_timeout+0x1b4/0x270 [ 960.141617] [<ffffffff810f57ac>] ? mod_timer+0x10c/0x240 [ 960.141621] [<ffffffff8185c60d>] ? __schedule+0x30d/0x810 [ 960.141625] [<ffffffff8185d652>] wait_for_completion+0xb2/0x190 [ 960.141636] [<ffffffff810b1f10>] ? wake_up_q+0x70/0x70 [ 960.141641] [<ffffffff810eb140>] __synchronize_srcu+0x100/0x1a0 [ 960.141645] [<ffffffff810ea400>] ? trace_raw_output_rcu_utilization+0x60/0x60 [ 960.141664] [<ffffffff81260870>] ? fsnotify_put_mark+0x40/0x40 [ 960.141669] [<ffffffff810eb204>] synchronize_srcu+0x24/0x30 [ 960.141672] [<ffffffff812608f4>] fsnotify_mark_destroy+0x84/0x130 [ 960.141680] [<ffffffff810ca000>] ? wake_atomic_t_function+0x60/0x60 [ 960.141691] [<ffffffff810a6227>] kthread+0xe7/0x100 [ 960.141694] [<ffffffff8185c601>] ? __schedule+0x301/0x810 [ 960.141699] [<ffffffff810a6140>] ? kthread_create_on_node+0x1e0/0x1e0 [ 960.141703] [<ffffffff818618e5>] ret_from_fork+0x55/0x80 [ 960.141706] [<ffffffff810a6140>] ? kthread_create_on_node+0x1e0/0x1e0
The vanilla 4.4 kernel also shows the same call stack.
On a patched kernel, the test will pass successfully, and there will be no messages in dmesg.
[Regression Potential]
This makes modifications to how locking is performed in fsnotify / fanotify and there may be some cause for regression. Running all fanotify Linux Test Project tests shows that there are no extra failures caused by the patches, and instead fewer failures are seen due to the bugfix.
Running the entire Linux Test Project testsuite actually works and runs to completion, something which doesn't happen in a unpatched kernel since it will hang on the fanotify07 test.
The patches are taken from upstream, and all necessary commits have been taken into account, so I am happy with the potential risks and that testing has been completed.
Jan Kara (4): fsnotify: avoid spurious EMFILE errors from inotify_init() fsnotify: Provide framework for dropping SRCU lock in ->handle_event fsnotify: Pass fsnotify_iter_info into handle_event handler fanotify: Release SRCU lock when waiting for userspace response
Jeff Layton (1): fsnotify: turn fsnotify reaper thread into a workqueue job
fs/notify/dnotify/dnotify.c | 3 +- fs/notify/fanotify/fanotify.c | 20 ++- fs/notify/fsnotify.c | 19 ++- fs/notify/fsnotify.h | 13 ++ fs/notify/group.c | 18 ++- fs/notify/inotify/inotify.h | 3 +- fs/notify/inotify/inotify_fsnotify.c | 3 +- fs/notify/inotify/inotify_user.c | 2 +- fs/notify/mark.c | 194 +++++++++++++++++++++------ include/linux/fsnotify_backend.h | 10 +- kernel/audit_fsnotify.c | 3 +- kernel/audit_tree.c | 3 +- kernel/audit_watch.c | 3 +- 13 files changed, 230 insertions(+), 64 deletions(-)
From: Jeff Layton jlayton@poochiereds.net
commit 0918f1c309b86301605650c836ddd2021d311ae2 upstream.
We don't require a dedicated thread for fsnotify cleanup. Switch it over to a workqueue job instead that runs on the system_unbound_wq.
In the interest of not thrashing the queued job too often when there are a lot of marks being removed, we delay the reaper job slightly when queueing it, to allow several to gather on the list.
Signed-off-by: Jeff Layton jeff.layton@primarydata.com Tested-by: Eryu Guan guaneryu@gmail.com Reviewed-by: Jan Kara jack@suse.cz Cc: Eric Paris eparis@parisplace.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org [mruffell: cherry picked] Signed-off-by: Matthew Ruffell matthew.ruffell@canonical.com --- fs/notify/mark.c | 49 ++++++++++++++++++------------------------------ 1 file changed, 18 insertions(+), 31 deletions(-)
diff --git a/fs/notify/mark.c b/fs/notify/mark.c index fc0df4442f7b..7115c5d7d373 100644 --- a/fs/notify/mark.c +++ b/fs/notify/mark.c @@ -91,10 +91,14 @@ #include <linux/fsnotify_backend.h> #include "fsnotify.h"
+#define FSNOTIFY_REAPER_DELAY (1) /* 1 jiffy */ + struct srcu_struct fsnotify_mark_srcu; static DEFINE_SPINLOCK(destroy_lock); static LIST_HEAD(destroy_list); -static DECLARE_WAIT_QUEUE_HEAD(destroy_waitq); + +static void fsnotify_mark_destroy(struct work_struct *work); +static DECLARE_DELAYED_WORK(reaper_work, fsnotify_mark_destroy);
void fsnotify_get_mark(struct fsnotify_mark *mark) { @@ -189,7 +193,8 @@ void fsnotify_free_mark(struct fsnotify_mark *mark) spin_lock(&destroy_lock); list_add(&mark->g_list, &destroy_list); spin_unlock(&destroy_lock); - wake_up(&destroy_waitq); + queue_delayed_work(system_unbound_wq, &reaper_work, + FSNOTIFY_REAPER_DELAY);
/* * Some groups like to know that marks are being freed. This is a @@ -388,7 +393,8 @@ err: spin_lock(&destroy_lock); list_add(&mark->g_list, &destroy_list); spin_unlock(&destroy_lock); - wake_up(&destroy_waitq); + queue_delayed_work(system_unbound_wq, &reaper_work, + FSNOTIFY_REAPER_DELAY);
return ret; } @@ -493,39 +499,20 @@ void fsnotify_init_mark(struct fsnotify_mark *mark, mark->free_mark = free_mark; }
-static int fsnotify_mark_destroy(void *ignored) +static void fsnotify_mark_destroy(struct work_struct *work) { struct fsnotify_mark *mark, *next; struct list_head private_destroy_list;
- for (;;) { - spin_lock(&destroy_lock); - /* exchange the list head */ - list_replace_init(&destroy_list, &private_destroy_list); - spin_unlock(&destroy_lock); - - synchronize_srcu(&fsnotify_mark_srcu); + spin_lock(&destroy_lock); + /* exchange the list head */ + list_replace_init(&destroy_list, &private_destroy_list); + spin_unlock(&destroy_lock);
- list_for_each_entry_safe(mark, next, &private_destroy_list, g_list) { - list_del_init(&mark->g_list); - fsnotify_put_mark(mark); - } + synchronize_srcu(&fsnotify_mark_srcu);
- wait_event_interruptible(destroy_waitq, !list_empty(&destroy_list)); + list_for_each_entry_safe(mark, next, &private_destroy_list, g_list) { + list_del_init(&mark->g_list); + fsnotify_put_mark(mark); } - - return 0; -} - -static int __init fsnotify_mark_init(void) -{ - struct task_struct *thread; - - thread = kthread_run(fsnotify_mark_destroy, NULL, - "fsnotify_mark"); - if (IS_ERR(thread)) - panic("unable to start fsnotify mark destruction thread."); - - return 0; } -device_initcall(fsnotify_mark_init);
From: Jan Kara jack@suse.cz
commit 35e481761cdc688dbee0ef552a13f49af8eba6cc upstream.
Inotify instance is destroyed when all references to it are dropped. That not only means that the corresponding file descriptor needs to be closed but also that all corresponding instance marks are freed (as each mark holds a reference to the inotify instance). However marks are freed only after SRCU period ends which can take some time and thus if user rapidly creates and frees inotify instances, number of existing inotify instances can exceed max_user_instances limit although from user point of view there is always at most one existing instance. Thus inotify_init() returns EMFILE error which is hard to justify from user point of view. This problem is exposed by LTP inotify06 testcase on some machines.
We fix the problem by making sure all group marks are properly freed while destroying inotify instance. We wait for SRCU period to end in that path anyway since we have to make sure there is no event being added to the instance while we are tearing down the instance. So it takes only some plumbing to allow for marks to be destroyed in that path as well and not from a dedicated work item.
[akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Jan Kara jack@suse.cz Reported-by: Xiaoguang Wang wangxg.fnst@cn.fujitsu.com Tested-by: Xiaoguang Wang wangxg.fnst@cn.fujitsu.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org [mruffell: backport: adjust layout of fsnotify_destroy_group()] Signed-off-by: Matthew Ruffell matthew.ruffell@canonical.com --- fs/notify/fsnotify.h | 7 +++ fs/notify/group.c | 17 +++++-- fs/notify/mark.c | 78 +++++++++++++++++++++++++------- include/linux/fsnotify_backend.h | 2 - 4 files changed, 81 insertions(+), 23 deletions(-)
diff --git a/fs/notify/fsnotify.h b/fs/notify/fsnotify.h index b44c68a857e7..0a3bc2cf192c 100644 --- a/fs/notify/fsnotify.h +++ b/fs/notify/fsnotify.h @@ -56,6 +56,13 @@ static inline void fsnotify_clear_marks_by_mount(struct vfsmount *mnt) fsnotify_destroy_marks(&real_mount(mnt)->mnt_fsnotify_marks, &mnt->mnt_root->d_lock); } +/* prepare for freeing all marks associated with given group */ +extern void fsnotify_detach_group_marks(struct fsnotify_group *group); +/* + * wait for fsnotify_mark_srcu period to end and free all marks in destroy_list + */ +extern void fsnotify_mark_destroy_list(void); + /* * update the dentry->d_flags of all of inode's children to indicate if inode cares * about events that happen to its children. diff --git a/fs/notify/group.c b/fs/notify/group.c index 18eb30c6bd8f..b47f7cfdcaa4 100644 --- a/fs/notify/group.c +++ b/fs/notify/group.c @@ -66,12 +66,21 @@ void fsnotify_destroy_group(struct fsnotify_group *group) */ fsnotify_group_stop_queueing(group);
- /* clear all inode marks for this group */ - fsnotify_clear_marks_by_group(group); + /* clear all inode marks for this group, attach them to destroy_list */ + fsnotify_detach_group_marks(group);
- synchronize_srcu(&fsnotify_mark_srcu); + /* + * Wait for fsnotify_mark_srcu period to end and free all marks in + * destroy_list + */ + fsnotify_mark_destroy_list();
- /* clear the notification queue of all events */ + /* + * Since we have waited for fsnotify_mark_srcu in + * fsnotify_mark_destroy_list() there can be no outstanding event + * notification against this group. So clearing the notification queue + * of all events is reliable now. + */ fsnotify_flush_notify(group);
/* diff --git a/fs/notify/mark.c b/fs/notify/mark.c index 7115c5d7d373..d3fea0bd89e2 100644 --- a/fs/notify/mark.c +++ b/fs/notify/mark.c @@ -97,8 +97,8 @@ struct srcu_struct fsnotify_mark_srcu; static DEFINE_SPINLOCK(destroy_lock); static LIST_HEAD(destroy_list);
-static void fsnotify_mark_destroy(struct work_struct *work); -static DECLARE_DELAYED_WORK(reaper_work, fsnotify_mark_destroy); +static void fsnotify_mark_destroy_workfn(struct work_struct *work); +static DECLARE_DELAYED_WORK(reaper_work, fsnotify_mark_destroy_workfn);
void fsnotify_get_mark(struct fsnotify_mark *mark) { @@ -173,11 +173,15 @@ void fsnotify_detach_mark(struct fsnotify_mark *mark) }
/* - * Free fsnotify mark. The freeing is actually happening from a kthread which - * first waits for srcu period end. Caller must have a reference to the mark - * or be protected by fsnotify_mark_srcu. + * Prepare mark for freeing and add it to the list of marks prepared for + * freeing. The actual freeing must happen after SRCU period ends and the + * caller is responsible for this. + * + * The function returns true if the mark was added to the list of marks for + * freeing. The function returns false if someone else has already called + * __fsnotify_free_mark() for the mark. */ -void fsnotify_free_mark(struct fsnotify_mark *mark) +static bool __fsnotify_free_mark(struct fsnotify_mark *mark) { struct fsnotify_group *group = mark->group;
@@ -185,17 +189,11 @@ void fsnotify_free_mark(struct fsnotify_mark *mark) /* something else already called this function on this mark */ if (!(mark->flags & FSNOTIFY_MARK_FLAG_ALIVE)) { spin_unlock(&mark->lock); - return; + return false; } mark->flags &= ~FSNOTIFY_MARK_FLAG_ALIVE; spin_unlock(&mark->lock);
- spin_lock(&destroy_lock); - list_add(&mark->g_list, &destroy_list); - spin_unlock(&destroy_lock); - queue_delayed_work(system_unbound_wq, &reaper_work, - FSNOTIFY_REAPER_DELAY); - /* * Some groups like to know that marks are being freed. This is a * callback to the group function to let it know that this mark @@ -203,6 +201,25 @@ void fsnotify_free_mark(struct fsnotify_mark *mark) */ if (group->ops->freeing_mark) group->ops->freeing_mark(mark, group); + + spin_lock(&destroy_lock); + list_add(&mark->g_list, &destroy_list); + spin_unlock(&destroy_lock); + + return true; +} + +/* + * Free fsnotify mark. The freeing is actually happening from a workqueue which + * first waits for srcu period end. Caller must have a reference to the mark + * or be protected by fsnotify_mark_srcu. + */ +void fsnotify_free_mark(struct fsnotify_mark *mark) +{ + if (__fsnotify_free_mark(mark)) { + queue_delayed_work(system_unbound_wq, &reaper_work, + FSNOTIFY_REAPER_DELAY); + } }
void fsnotify_destroy_mark(struct fsnotify_mark *mark, @@ -468,11 +485,29 @@ void fsnotify_clear_marks_by_group_flags(struct fsnotify_group *group, }
/* - * Given a group, destroy all of the marks associated with that group. + * Given a group, prepare for freeing all the marks associated with that group. + * The marks are attached to the list of marks prepared for destruction, the + * caller is responsible for freeing marks in that list after SRCU period has + * ended. */ -void fsnotify_clear_marks_by_group(struct fsnotify_group *group) +void fsnotify_detach_group_marks(struct fsnotify_group *group) { - fsnotify_clear_marks_by_group_flags(group, (unsigned int)-1); + struct fsnotify_mark *mark; + + while (1) { + mutex_lock_nested(&group->mark_mutex, SINGLE_DEPTH_NESTING); + if (list_empty(&group->marks_list)) { + mutex_unlock(&group->mark_mutex); + break; + } + mark = list_first_entry(&group->marks_list, + struct fsnotify_mark, g_list); + fsnotify_get_mark(mark); + fsnotify_detach_mark(mark); + mutex_unlock(&group->mark_mutex); + __fsnotify_free_mark(mark); + fsnotify_put_mark(mark); + } }
void fsnotify_duplicate_mark(struct fsnotify_mark *new, struct fsnotify_mark *old) @@ -499,7 +534,11 @@ void fsnotify_init_mark(struct fsnotify_mark *mark, mark->free_mark = free_mark; }
-static void fsnotify_mark_destroy(struct work_struct *work) +/* + * Destroy all marks in destroy_list, waits for SRCU period to finish before + * actually freeing marks. + */ +void fsnotify_mark_destroy_list(void) { struct fsnotify_mark *mark, *next; struct list_head private_destroy_list; @@ -516,3 +555,8 @@ static void fsnotify_mark_destroy(struct work_struct *work) fsnotify_put_mark(mark); } } + +static void fsnotify_mark_destroy_workfn(struct work_struct *work) +{ + fsnotify_mark_destroy_list(); +} diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h index 850d8822e8ff..c611724ff16b 100644 --- a/include/linux/fsnotify_backend.h +++ b/include/linux/fsnotify_backend.h @@ -364,8 +364,6 @@ extern void fsnotify_clear_vfsmount_marks_by_group(struct fsnotify_group *group) extern void fsnotify_clear_inode_marks_by_group(struct fsnotify_group *group); /* run all the marks in a group, and clear all of the marks where mark->flags & flags is true*/ extern void fsnotify_clear_marks_by_group_flags(struct fsnotify_group *group, unsigned int flags); -/* run all the marks in a group, and flag them to be freed */ -extern void fsnotify_clear_marks_by_group(struct fsnotify_group *group); extern void fsnotify_get_mark(struct fsnotify_mark *mark); extern void fsnotify_put_mark(struct fsnotify_mark *mark); extern void fsnotify_unmount_inodes(struct super_block *sb);
From: Jan Kara jack@suse.cz
commit abc77577a669f424c5d0c185b9994f2621c52aa4 upstream.
fanotify wants to drop fsnotify_mark_srcu lock when waiting for response from userspace so that the whole notification subsystem is not blocked during that time. This patch provides a framework for safely getting mark reference for a mark found in the object list which pins the mark in that list. We can then drop fsnotify_mark_srcu, wait for userspace response and then safely continue iteration of the object list once we reaquire fsnotify_mark_srcu.
Reviewed-by: Miklos Szeredi mszeredi@redhat.com Reviewed-by: Amir Goldstein amir73il@gmail.com Signed-off-by: Jan Kara jack@suse.cz [mruffell: backport: realign file fs/notify/mark.c] Signed-off-by: Matthew Ruffell matthew.ruffell@canonical.com --- fs/notify/fsnotify.h | 6 +++ fs/notify/group.c | 1 + fs/notify/mark.c | 83 +++++++++++++++++++++++++++++++- include/linux/fsnotify_backend.h | 5 ++ 4 files changed, 94 insertions(+), 1 deletion(-)
diff --git a/fs/notify/fsnotify.h b/fs/notify/fsnotify.h index 0a3bc2cf192c..0ad0eb9f2e14 100644 --- a/fs/notify/fsnotify.h +++ b/fs/notify/fsnotify.h @@ -8,6 +8,12 @@
#include "../mount.h"
+struct fsnotify_iter_info { + struct fsnotify_mark *inode_mark; + struct fsnotify_mark *vfsmount_mark; + int srcu_idx; +}; + /* destroy all events sitting in this groups notification queue */ extern void fsnotify_flush_notify(struct fsnotify_group *group);
diff --git a/fs/notify/group.c b/fs/notify/group.c index b47f7cfdcaa4..4c63b148835f 100644 --- a/fs/notify/group.c +++ b/fs/notify/group.c @@ -124,6 +124,7 @@ struct fsnotify_group *fsnotify_alloc_group(const struct fsnotify_ops *ops) /* set to 0 when there a no external references to this group */ atomic_set(&group->refcnt, 1); atomic_set(&group->num_marks, 0); + atomic_set(&group->user_waits, 0);
mutex_init(&group->notification_mutex); INIT_LIST_HEAD(&group->notification_list); diff --git a/fs/notify/mark.c b/fs/notify/mark.c index d3fea0bd89e2..d3005d95d530 100644 --- a/fs/notify/mark.c +++ b/fs/notify/mark.c @@ -105,6 +105,16 @@ void fsnotify_get_mark(struct fsnotify_mark *mark) atomic_inc(&mark->refcnt); }
+/* + * Get mark reference when we found the mark via lockless traversal of object + * list. Mark can be already removed from the list by now and on its way to be + * destroyed once SRCU period ends. + */ +static bool fsnotify_get_mark_safe(struct fsnotify_mark *mark) +{ + return atomic_inc_not_zero(&mark->refcnt); +} + void fsnotify_put_mark(struct fsnotify_mark *mark) { if (atomic_dec_and_test(&mark->refcnt)) { @@ -125,6 +135,72 @@ u32 fsnotify_recalc_mask(struct hlist_head *head) return new_mask; }
+bool fsnotify_prepare_user_wait(struct fsnotify_iter_info *iter_info) +{ + struct fsnotify_group *group; + + if (WARN_ON_ONCE(!iter_info->inode_mark && !iter_info->vfsmount_mark)) + return false; + + if (iter_info->inode_mark) + group = iter_info->inode_mark->group; + else + group = iter_info->vfsmount_mark->group; + + /* + * Since acquisition of mark reference is an atomic op as well, we can + * be sure this inc is seen before any effect of refcount increment. + */ + atomic_inc(&group->user_waits); + + if (iter_info->inode_mark) { + /* This can fail if mark is being removed */ + if (!fsnotify_get_mark_safe(iter_info->inode_mark)) + goto out_wait; + } + if (iter_info->vfsmount_mark) { + if (!fsnotify_get_mark_safe(iter_info->vfsmount_mark)) + goto out_inode; + } + + /* + * Now that both marks are pinned by refcount in the inode / vfsmount + * lists, we can drop SRCU lock, and safely resume the list iteration + * once userspace returns. + */ + srcu_read_unlock(&fsnotify_mark_srcu, iter_info->srcu_idx); + + return true; +out_inode: + if (iter_info->inode_mark) + fsnotify_put_mark(iter_info->inode_mark); +out_wait: + if (atomic_dec_and_test(&group->user_waits) && group->shutdown) + wake_up(&group->notification_waitq); + return false; +} + +void fsnotify_finish_user_wait(struct fsnotify_iter_info *iter_info) +{ + struct fsnotify_group *group = NULL; + + iter_info->srcu_idx = srcu_read_lock(&fsnotify_mark_srcu); + if (iter_info->inode_mark) { + group = iter_info->inode_mark->group; + fsnotify_put_mark(iter_info->inode_mark); + } + if (iter_info->vfsmount_mark) { + group = iter_info->vfsmount_mark->group; + fsnotify_put_mark(iter_info->vfsmount_mark); + } + /* + * We abuse notification_waitq on group shutdown for waiting for all + * marks pinned when waiting for userspace. + */ + if (atomic_dec_and_test(&group->user_waits) && group->shutdown) + wake_up(&group->notification_waitq); +} + /* * Remove mark from inode / vfsmount list, group list, drop inode reference * if we got one. @@ -161,7 +237,6 @@ void fsnotify_detach_mark(struct fsnotify_mark *mark) * __fsnotify_parent() lazily when next event happens on one of our * children. */ - list_del_init(&mark->g_list);
spin_unlock(&mark->lock); @@ -508,6 +583,12 @@ void fsnotify_detach_group_marks(struct fsnotify_group *group) __fsnotify_free_mark(mark); fsnotify_put_mark(mark); } + /* + * Some marks can still be pinned when waiting for response from + * userspace. Wait for those now. fsnotify_prepare_user_wait() will + * not succeed now so this wait is race-free. + */ + wait_event(group->notification_waitq, !atomic_read(&group->user_waits)); }
void fsnotify_duplicate_mark(struct fsnotify_mark *new, struct fsnotify_mark *old) diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h index c611724ff16b..c7c5ea590d54 100644 --- a/include/linux/fsnotify_backend.h +++ b/include/linux/fsnotify_backend.h @@ -79,6 +79,7 @@ struct fsnotify_event; struct fsnotify_mark; struct fsnotify_event_private_data; struct fsnotify_fname; +struct fsnotify_iter_info;
/* * Each group much define these ops. The fsnotify infrastructure will call @@ -162,6 +163,8 @@ struct fsnotify_group { struct fsnotify_event *overflow_event; /* Event we queue when the * notification list is too * full */ + atomic_t user_waits; /* Number of tasks waiting for user + * response */
/* groups can define private fields here or use the void *private */ union { @@ -367,6 +370,8 @@ extern void fsnotify_clear_marks_by_group_flags(struct fsnotify_group *group, un extern void fsnotify_get_mark(struct fsnotify_mark *mark); extern void fsnotify_put_mark(struct fsnotify_mark *mark); extern void fsnotify_unmount_inodes(struct super_block *sb); +extern void fsnotify_finish_user_wait(struct fsnotify_iter_info *iter_info); +extern bool fsnotify_prepare_user_wait(struct fsnotify_iter_info *iter_info);
/* put here because inotify does some weird stuff when destroying watches */ extern void fsnotify_init_event(struct fsnotify_event *event,
From: Jan Kara jack@suse.cz
commit 9385a84d7e1f658bb2d96ab798393e4b16268aaa upstream.
Pass fsnotify_iter_info into ->handle_event() handler so that it can release and reacquire SRCU lock via fsnotify_prepare_user_wait() and fsnotify_finish_user_wait() functions. These functions also make sure current marks are appropriately pinned so that iteration protected by srcu in fsnotify() stays safe.
Reviewed-by: Miklos Szeredi mszeredi@redhat.com Reviewed-by: Amir Goldstein amir73il@gmail.com Signed-off-by: Jan Kara jack@suse.cz [mruffell: backport: removing const keyword and minor realignment] Signed-off-by: Matthew Ruffell matthew.ruffell@canonical.com --- fs/notify/dnotify/dnotify.c | 3 ++- fs/notify/fanotify/fanotify.c | 3 ++- fs/notify/fsnotify.c | 19 +++++++++++++------ fs/notify/inotify/inotify.h | 3 ++- fs/notify/inotify/inotify_fsnotify.c | 3 ++- fs/notify/inotify/inotify_user.c | 2 +- include/linux/fsnotify_backend.h | 3 ++- kernel/audit_fsnotify.c | 3 ++- kernel/audit_tree.c | 3 ++- kernel/audit_watch.c | 3 ++- 10 files changed, 30 insertions(+), 15 deletions(-)
diff --git a/fs/notify/dnotify/dnotify.c b/fs/notify/dnotify/dnotify.c index 6faaf710e563..264bfd99a694 100644 --- a/fs/notify/dnotify/dnotify.c +++ b/fs/notify/dnotify/dnotify.c @@ -86,7 +86,8 @@ static int dnotify_handle_event(struct fsnotify_group *group, struct fsnotify_mark *inode_mark, struct fsnotify_mark *vfsmount_mark, u32 mask, void *data, int data_type, - const unsigned char *file_name, u32 cookie) + const unsigned char *file_name, u32 cookie, + struct fsnotify_iter_info *iter_info) { struct dnotify_mark *dn_mark; struct dnotify_struct *dn; diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c index 8a459b179183..4944956cdbd9 100644 --- a/fs/notify/fanotify/fanotify.c +++ b/fs/notify/fanotify/fanotify.c @@ -174,7 +174,8 @@ static int fanotify_handle_event(struct fsnotify_group *group, struct fsnotify_mark *inode_mark, struct fsnotify_mark *fanotify_mark, u32 mask, void *data, int data_type, - const unsigned char *file_name, u32 cookie) + const unsigned char *file_name, u32 cookie, + struct fsnotify_iter_info *iter_info) { int ret = 0; struct fanotify_event_info *event; diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c index a64adc2fced9..19c75b446314 100644 --- a/fs/notify/fsnotify.c +++ b/fs/notify/fsnotify.c @@ -131,7 +131,8 @@ static int send_to_group(struct inode *to_tell, struct fsnotify_mark *vfsmount_mark, __u32 mask, void *data, int data_is, u32 cookie, - const unsigned char *file_name) + const unsigned char *file_name, + struct fsnotify_iter_info *iter_info) { struct fsnotify_group *group = NULL; __u32 inode_test_mask = 0; @@ -182,7 +183,7 @@ static int send_to_group(struct inode *to_tell,
return group->ops->handle_event(group, to_tell, inode_mark, vfsmount_mark, mask, data, data_is, - file_name, cookie); + file_name, cookie, iter_info); }
/* @@ -197,8 +198,9 @@ int fsnotify(struct inode *to_tell, __u32 mask, void *data, int data_is, struct hlist_node *inode_node = NULL, *vfsmount_node = NULL; struct fsnotify_mark *inode_mark = NULL, *vfsmount_mark = NULL; struct fsnotify_group *inode_group, *vfsmount_group; + struct fsnotify_iter_info iter_info; struct mount *mnt; - int idx, ret = 0; + int ret = 0; /* global tests shouldn't care about events on child only the specific event */ __u32 test_mask = (mask & ~FS_EVENT_ON_CHILD);
@@ -227,7 +229,7 @@ int fsnotify(struct inode *to_tell, __u32 mask, void *data, int data_is, !(mnt && test_mask & mnt->mnt_fsnotify_mask)) return 0;
- idx = srcu_read_lock(&fsnotify_mark_srcu); + iter_info.srcu_idx = srcu_read_lock(&fsnotify_mark_srcu);
if ((mask & FS_MODIFY) || (test_mask & to_tell->i_fsnotify_mask)) @@ -276,8 +278,13 @@ int fsnotify(struct inode *to_tell, __u32 mask, void *data, int data_is, vfsmount_mark = NULL; } } + + iter_info.inode_mark = inode_mark; + iter_info.vfsmount_mark = vfsmount_mark; + ret = send_to_group(to_tell, inode_mark, vfsmount_mark, mask, - data, data_is, cookie, file_name); + data, data_is, cookie, file_name, + &iter_info);
if (ret && (mask & ALL_FSNOTIFY_PERM_EVENTS)) goto out; @@ -291,7 +298,7 @@ int fsnotify(struct inode *to_tell, __u32 mask, void *data, int data_is, } ret = 0; out: - srcu_read_unlock(&fsnotify_mark_srcu, idx); + srcu_read_unlock(&fsnotify_mark_srcu, iter_info.srcu_idx);
return ret; } diff --git a/fs/notify/inotify/inotify.h b/fs/notify/inotify/inotify.h index ed855ef6f077..726b06b303b8 100644 --- a/fs/notify/inotify/inotify.h +++ b/fs/notify/inotify/inotify.h @@ -27,6 +27,7 @@ extern int inotify_handle_event(struct fsnotify_group *group, struct fsnotify_mark *inode_mark, struct fsnotify_mark *vfsmount_mark, u32 mask, void *data, int data_type, - const unsigned char *file_name, u32 cookie); + const unsigned char *file_name, u32 cookie, + struct fsnotify_iter_info *iter_info);
extern const struct fsnotify_ops inotify_fsnotify_ops; diff --git a/fs/notify/inotify/inotify_fsnotify.c b/fs/notify/inotify/inotify_fsnotify.c index 2cd900c2c737..79a5f06b9100 100644 --- a/fs/notify/inotify/inotify_fsnotify.c +++ b/fs/notify/inotify/inotify_fsnotify.c @@ -67,7 +67,8 @@ int inotify_handle_event(struct fsnotify_group *group, struct fsnotify_mark *inode_mark, struct fsnotify_mark *vfsmount_mark, u32 mask, void *data, int data_type, - const unsigned char *file_name, u32 cookie) + const unsigned char *file_name, u32 cookie, + struct fsnotify_iter_info *iter_info) { struct inotify_inode_mark *i_mark; struct inotify_event_info *event; diff --git a/fs/notify/inotify/inotify_user.c b/fs/notify/inotify/inotify_user.c index b8d08d0d0a4d..6cea8b2131a3 100644 --- a/fs/notify/inotify/inotify_user.c +++ b/fs/notify/inotify/inotify_user.c @@ -494,7 +494,7 @@ void inotify_ignored_and_remove_idr(struct fsnotify_mark *fsn_mark,
/* Queue ignore event for the watch */ inotify_handle_event(group, NULL, fsn_mark, NULL, FS_IN_IGNORED, - NULL, FSNOTIFY_EVENT_NONE, NULL, 0); + NULL, FSNOTIFY_EVENT_NONE, NULL, 0, NULL);
i_mark = container_of(fsn_mark, struct inotify_inode_mark, fsn_mark); /* remove this mark from the idr */ diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h index c7c5ea590d54..ddc13584cbe2 100644 --- a/include/linux/fsnotify_backend.h +++ b/include/linux/fsnotify_backend.h @@ -98,7 +98,8 @@ struct fsnotify_ops { struct fsnotify_mark *inode_mark, struct fsnotify_mark *vfsmount_mark, u32 mask, void *data, int data_type, - const unsigned char *file_name, u32 cookie); + const unsigned char *file_name, u32 cookie, + struct fsnotify_iter_info *iter_info); void (*free_group_priv)(struct fsnotify_group *group); void (*freeing_mark)(struct fsnotify_mark *mark, struct fsnotify_group *group); void (*free_event)(struct fsnotify_event *event); diff --git a/kernel/audit_fsnotify.c b/kernel/audit_fsnotify.c index 27c6046c2c3d..94aa9995f41a 100644 --- a/kernel/audit_fsnotify.c +++ b/kernel/audit_fsnotify.c @@ -169,7 +169,8 @@ static int audit_mark_handle_event(struct fsnotify_group *group, struct fsnotify_mark *inode_mark, struct fsnotify_mark *vfsmount_mark, u32 mask, void *data, int data_type, - const unsigned char *dname, u32 cookie) + const unsigned char *dname, u32 cookie, + struct fsnotify_iter_info *iter_info) { struct audit_fsnotify_mark *audit_mark; struct inode *inode = NULL; diff --git a/kernel/audit_tree.c b/kernel/audit_tree.c index 5efe9b299a12..9443b7fd6d90 100644 --- a/kernel/audit_tree.c +++ b/kernel/audit_tree.c @@ -951,7 +951,8 @@ static int audit_tree_handle_event(struct fsnotify_group *group, struct fsnotify_mark *inode_mark, struct fsnotify_mark *vfsmount_mark, u32 mask, void *data, int data_type, - const unsigned char *file_name, u32 cookie) + const unsigned char *file_name, u32 cookie, + struct fsnotify_iter_info *iter_info) { return 0; } diff --git a/kernel/audit_watch.c b/kernel/audit_watch.c index f45a9a5d3e47..40fb562ca404 100644 --- a/kernel/audit_watch.c +++ b/kernel/audit_watch.c @@ -485,7 +485,8 @@ static int audit_watch_handle_event(struct fsnotify_group *group, struct fsnotify_mark *inode_mark, struct fsnotify_mark *vfsmount_mark, u32 mask, void *data, int data_type, - const unsigned char *dname, u32 cookie) + const unsigned char *dname, u32 cookie, + struct fsnotify_iter_info *iter_info) { struct inode *inode; struct audit_parent *parent;
From: Jan Kara jack@suse.cz
commit 05f0e38724e8449184acd8fbf0473ee5a07adc6c upstream.
When userspace task processing fanotify permission events screws up and does not respond, fsnotify_mark_srcu SRCU is held indefinitely which causes further hangs in the whole notification subsystem. Although we cannot easily solve the problem of operations blocked waiting for response from userspace, we can at least somewhat localize the damage by dropping SRCU lock before waiting for userspace response and reacquiring it when userspace responds.
Reviewed-by: Miklos Szeredi mszeredi@redhat.com Reviewed-by: Amir Goldstein amir73il@gmail.com Signed-off-by: Jan Kara jack@suse.cz [mruffell: cherry picked] Signed-off-by: Matthew Ruffell matthew.ruffell@canonical.com --- fs/notify/fanotify/fanotify.c | 17 +++++++++++++++-- 1 file changed, 15 insertions(+), 2 deletions(-)
diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c index 4944956cdbd9..eeb5cc1f6978 100644 --- a/fs/notify/fanotify/fanotify.c +++ b/fs/notify/fanotify/fanotify.c @@ -61,14 +61,26 @@ static int fanotify_merge(struct list_head *list, struct fsnotify_event *event)
#ifdef CONFIG_FANOTIFY_ACCESS_PERMISSIONS static int fanotify_get_response(struct fsnotify_group *group, - struct fanotify_perm_event_info *event) + struct fanotify_perm_event_info *event, + struct fsnotify_iter_info *iter_info) { int ret;
pr_debug("%s: group=%p event=%p\n", __func__, group, event);
+ /* + * fsnotify_prepare_user_wait() fails if we race with mark deletion. + * Just let the operation pass in that case. + */ + if (!fsnotify_prepare_user_wait(iter_info)) { + event->response = FAN_ALLOW; + goto out; + } + wait_event(group->fanotify_data.access_waitq, event->response);
+ fsnotify_finish_user_wait(iter_info); +out: /* userspace responded, convert to something usable */ switch (event->response) { case FAN_ALLOW: @@ -216,7 +228,8 @@ static int fanotify_handle_event(struct fsnotify_group *group,
#ifdef CONFIG_FANOTIFY_ACCESS_PERMISSIONS if (mask & FAN_ALL_PERM_EVENTS) { - ret = fanotify_get_response(group, FANOTIFY_PE(fsn_event)); + ret = fanotify_get_response(group, FANOTIFY_PE(fsn_event), + iter_info); fsnotify_destroy_event(group, fsn_event); } #endif
On Wed, Apr 10, 2019 at 04:54:51PM +1200, Matthew Ruffell wrote:
BugLink: https://bugs.launchpad.net/bugs/1775165
[Note to upstream] I understand that this patch is a little long for -stable, but this patch series fixes a real issue, seen by real users, is testable, and is made up from upstream commits. Please consider it.
[Impact]
When userspace tasks which are processing fanotify permission events act incorrectly, the fsnotify_mark_srcu SRCU is held indefinitely which causes the whole notification subsystem to hang.
This has been seen in production, and it can also be seen when running the Linux Test Project testsuite, specifically fanotify07.
[Fix]
Instead of holding the SRCU lock while waiting for userspace to respond, which may never happen, or not in the order we are expecting, we drop the fsnotify_mark_srcu SRCU lock before waiting for userspace response, and then reacquire the lock again when userspace responds.
The fixes are from a series of upstream commits:
05f0e38724e8449184acd8fbf0473ee5a07adc6c (cherry-pick) 9385a84d7e1f658bb2d96ab798393e4b16268aaa (backport) abc77577a669f424c5d0c185b9994f2621c52aa4 (backport)
The following are upstream commits necessary for the fixes to function:
35e481761cdc688dbee0ef552a13f49af8eba6cc (backport) 0918f1c309b86301605650c836ddd2021d311ae2 (cherry-pick)
This would also make sense for 4.9, right? I don't want to fix 4.4 without fixing 4.9 as well.
-- Thanks, Sasha
Apologies for the previous HTML email, my bad.
Yes, it does make sense for inclusion in 4.9.
I went and tested 4.9, saw the same hang, and backported the relevant patches. 4.9 requires only 3 of the 5 needed for 4.4.
I sent the patchset to the list, you should see them soon.
Let me know if there's anything else you need.
Thanks, Matthew
On 11/04/19 4:44 AM, Sasha Levin wrote:
On Wed, Apr 10, 2019 at 04:54:51PM +1200, Matthew Ruffell wrote:
BugLink: https://bugs.launchpad.net/bugs/1775165
[Note to upstream] I understand that this patch is a little long for -stable, but this patch series fixes a real issue, seen by real users, is testable, and is made up from upstream commits. Please consider it.
[Impact]
When userspace tasks which are processing fanotify permission events act incorrectly, the fsnotify_mark_srcu SRCU is held indefinitely which causes the whole notification subsystem to hang.
This has been seen in production, and it can also be seen when running the Linux Test Project testsuite, specifically fanotify07.
[Fix]
Instead of holding the SRCU lock while waiting for userspace to respond, which may never happen, or not in the order we are expecting, we drop the fsnotify_mark_srcu SRCU lock before waiting for userspace response, and then reacquire the lock again when userspace responds.
The fixes are from a series of upstream commits:
05f0e38724e8449184acd8fbf0473ee5a07adc6c (cherry-pick) 9385a84d7e1f658bb2d96ab798393e4b16268aaa (backport) abc77577a669f424c5d0c185b9994f2621c52aa4 (backport)
The following are upstream commits necessary for the fixes to function:
35e481761cdc688dbee0ef552a13f49af8eba6cc (backport) 0918f1c309b86301605650c836ddd2021d311ae2 (cherry-pick)
This would also make sense for 4.9, right? I don't want to fix 4.4 without fixing 4.9 as well.
-- Thanks, Sasha
linux-stable-mirror@lists.linaro.org