Syzbot found a corrupted list bug scenario that can be triggered from cgroup_subtree_control_write(cgrp). The reproduces writes to cgroup.subtree_control file, which invokes: cgroup_apply_control_enable()->css_create()->css_populate_dir(), which then fails with a fault injected -ENOMEM. In such scenario the css_killed_work_fn will be en-queued via cgroup_apply_control_disable(cgrp)->kill_css(css), and bail out to cgroup_kn_unlock(). Then cgroup_kn_unlock() will call: cgroup_put(cgrp)->css_put(&cgrp->self), which will try to enqueue css_release_work_fn for the same css instance, causing a list_add corruption bug, as can be seen in the syzkaller report [1].
Fix this by synchronizing the css ref_kill and css_release jobs. css_release() function will check if the css_killed_work_fn() has been scheduled for the css and only en-queue the css_release_work_fn() if css_killed_work_fn wasn't already en-queued. Otherwise css_release() will set the CSS_REL_LATER flag for that css. This will cause the css_release_work_fn() work to be executed after css_killed_work_fn() is finished.
Two scc flags have been introduced to implement this serialization mechanizm:
* CSS_KILL_ENQED, which will be set when css_killed_work_fn() is en-queued, and * CSS_REL_LATER, which, if set, will cause the css_release_work_fn() to be scheduled after the css_killed_work_fn is finished.
There is also a new lock, which will protect the integrity of the css flags.
[1] https://syzkaller.appspot.com/bug?id=e26e54d6eac9d9fb50b221ec3e4627b327465db...
Cc: Tejun Heo tj@kernel.org Cc: Michal Koutny mkoutny@suse.com Cc: Zefan Li lizefan.x@bytedance.com Cc: Johannes Weiner hannes@cmpxchg.org Cc: Christian Brauner brauner@kernel.org Cc: Alexei Starovoitov ast@kernel.org Cc: Daniel Borkmann daniel@iogearbox.net Cc: Andrii Nakryiko andrii@kernel.org Cc: Martin KaFai Lau kafai@fb.com Cc: Song Liu songliubraving@fb.com Cc: Yonghong Song yhs@fb.com Cc: John Fastabend john.fastabend@gmail.com Cc: KP Singh kpsingh@kernel.org Cc: cgroups@vger.kernel.org Cc: netdev@vger.kernel.org Cc: bpf@vger.kernel.org Cc: stable@vger.kernel.org Cc: linux-kernel@vger.kernel.org
Reported-and-tested-by: syzbot+e42ae441c3b10acf9e9d@syzkaller.appspotmail.com Fixes: 8f36aaec9c92 ("cgroup: Use rcu_work instead of explicit rcu and work item") Signed-off-by: Tadeusz Struk tadeusz.struk@linaro.org --- include/linux/cgroup-defs.h | 4 ++++ kernel/cgroup/cgroup.c | 35 ++++++++++++++++++++++++++++++++--- 2 files changed, 36 insertions(+), 3 deletions(-)
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index 1bfcfb1af352..8dc8b4edb242 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -53,6 +53,8 @@ enum { CSS_RELEASED = (1 << 2), /* refcnt reached zero, released */ CSS_VISIBLE = (1 << 3), /* css is visible to userland */ CSS_DYING = (1 << 4), /* css is dying */ + CSS_KILL_ENQED = (1 << 5), /* kill work enqueued for the css */ + CSS_REL_LATER = (1 << 6), /* release needs to be done after kill */ };
/* bits in struct cgroup flags field */ @@ -162,6 +164,8 @@ struct cgroup_subsys_state { */ int id;
+ /* lock to protect flags */ + spinlock_t lock; unsigned int flags;
/* diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 1779ccddb734..a0ceead4b390 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -5210,8 +5210,23 @@ static void css_release(struct percpu_ref *ref) struct cgroup_subsys_state *css = container_of(ref, struct cgroup_subsys_state, refcnt);
- INIT_WORK(&css->destroy_work, css_release_work_fn); - queue_work(cgroup_destroy_wq, &css->destroy_work); + spin_lock_bh(&css->lock); + + /* + * Check if the css_killed_work_fn work has been scheduled for this + * css and enqueue css_release_work_fn only if it wasn't. + * Otherwise set the CSS_REL_LATER flag, which will cause + * release to be enqueued after css_killed_work_fn is finished. + * This is to prevent list corruption by en-queuing two instance + * of the same work struct on the same WQ, namely cgroup_destroy_wq. + */ + if (!(css->flags & CSS_KILL_ENQED)) { + INIT_WORK(&css->destroy_work, css_release_work_fn); + queue_work(cgroup_destroy_wq, &css->destroy_work); + } else { + css->flags |= CSS_REL_LATER; + } + spin_unlock_bh(&css->lock); }
static void init_and_link_css(struct cgroup_subsys_state *css, @@ -5230,6 +5245,7 @@ static void init_and_link_css(struct cgroup_subsys_state *css, INIT_LIST_HEAD(&css->rstat_css_node); css->serial_nr = css_serial_nr_next++; atomic_set(&css->online_cnt, 0); + spin_lock_init(&css->lock);
if (cgroup_parent(cgrp)) { css->parent = cgroup_css(cgroup_parent(cgrp), ss); @@ -5545,10 +5561,12 @@ int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name, umode_t mode) */ static void css_killed_work_fn(struct work_struct *work) { - struct cgroup_subsys_state *css = + struct cgroup_subsys_state *css_killed, *css = container_of(work, struct cgroup_subsys_state, destroy_work);
mutex_lock(&cgroup_mutex); + css_killed = css; + css_killed->flags &= ~CSS_KILL_ENQED;
do { offline_css(css); @@ -5557,6 +5575,14 @@ static void css_killed_work_fn(struct work_struct *work) css = css->parent; } while (css && atomic_dec_and_test(&css->online_cnt));
+ spin_lock_bh(&css->lock); + if (css_killed->flags & CSS_REL_LATER) { + /* If css_release work was delayed for the css enqueue it now. */ + INIT_WORK(&css_killed->destroy_work, css_release_work_fn); + queue_work(cgroup_destroy_wq, &css_killed->destroy_work); + css_killed->flags &= ~CSS_REL_LATER; + } + spin_unlock_bh(&css->lock); mutex_unlock(&cgroup_mutex); }
@@ -5566,10 +5592,13 @@ static void css_killed_ref_fn(struct percpu_ref *ref) struct cgroup_subsys_state *css = container_of(ref, struct cgroup_subsys_state, refcnt);
+ spin_lock_bh(&css->lock); if (atomic_dec_and_test(&css->online_cnt)) { + css->flags |= CSS_KILL_ENQED; INIT_WORK(&css->destroy_work, css_killed_work_fn); queue_work(cgroup_destroy_wq, &css->destroy_work); } + spin_unlock_bh(&css->lock); }
/**
On 6/3/22 10:34, Tadeusz Struk wrote:
Syzbot found a corrupted list bug scenario that can be triggered from cgroup_subtree_control_write(cgrp). The reproduces writes to cgroup.subtree_control file, which invokes: cgroup_apply_control_enable()->css_create()->css_populate_dir(), which then fails with a fault injected -ENOMEM. In such scenario the css_killed_work_fn will be en-queued via cgroup_apply_control_disable(cgrp)->kill_css(css), and bail out to cgroup_kn_unlock(). Then cgroup_kn_unlock() will call: cgroup_put(cgrp)->css_put(&cgrp->self), which will try to enqueue css_release_work_fn for the same css instance, causing a list_add corruption bug, as can be seen in the syzkaller report [1].
Fix this by synchronizing the css ref_kill and css_release jobs. css_release() function will check if the css_killed_work_fn() has been scheduled for the css and only en-queue the css_release_work_fn() if css_killed_work_fn wasn't already en-queued. Otherwise css_release() will set the CSS_REL_LATER flag for that css. This will cause the css_release_work_fn() work to be executed after css_killed_work_fn() is finished.
Two scc flags have been introduced to implement this serialization mechanizm:
- CSS_KILL_ENQED, which will be set when css_killed_work_fn() is en-queued, and
- CSS_REL_LATER, which, if set, will cause the css_release_work_fn() to be scheduled after the css_killed_work_fn is finished.
There is also a new lock, which will protect the integrity of the css flags.
[1]https://syzkaller.appspot.com/bug?id=e26e54d6eac9d9fb50b221ec3e4627b327465db...
Cc: Tejun Heotj@kernel.org Cc: Michal Koutnymkoutny@suse.com Cc: Zefan Lilizefan.x@bytedance.com Cc: Johannes Weinerhannes@cmpxchg.org Cc: Christian Braunerbrauner@kernel.org Cc: Alexei Starovoitovast@kernel.org Cc: Daniel Borkmanndaniel@iogearbox.net Cc: Andrii Nakryikoandrii@kernel.org Cc: Martin KaFai Laukafai@fb.com Cc: Song Liusongliubraving@fb.com Cc: Yonghong Songyhs@fb.com Cc: John Fastabendjohn.fastabend@gmail.com Cc: KP Singhkpsingh@kernel.org Cc:cgroups@vger.kernel.org Cc:netdev@vger.kernel.org Cc:bpf@vger.kernel.org Cc:stable@vger.kernel.org Cc:linux-kernel@vger.kernel.org
Reported-and-tested-by:syzbot+e42ae441c3b10acf9e9d@syzkaller.appspotmail.com Fixes: 8f36aaec9c92 ("cgroup: Use rcu_work instead of explicit rcu and work item") Signed-off-by: Tadeusz Struktadeusz.struk@linaro.org
I just spotted an issue with this. I'm holding invalid lock in css_killed_work_fn(). I will follow up with a v2 of the patch soon.
Syzbot found a corrupted list bug scenario that can be triggered from cgroup_subtree_control_write(cgrp). The reproduces writes to cgroup.subtree_control file, which invokes: cgroup_apply_control_enable()->css_create()->css_populate_dir(), which then fails with a fault injected -ENOMEM. In such scenario the css_killed_work_fn will be en-queued via cgroup_apply_control_disable(cgrp)->kill_css(css), and bail out to cgroup_kn_unlock(). Then cgroup_kn_unlock() will call: cgroup_put(cgrp)->css_put(&cgrp->self), which will try to enqueue css_release_work_fn for the same css instance, causing a list_add corruption bug, as can be seen in the syzkaller report [1].
Fix this by synchronizing the css ref_kill and css_release jobs. css_release() function will check if the css_killed_work_fn() has been scheduled for the css and only en-queue the css_release_work_fn() if css_killed_work_fn wasn't already en-queued. Otherwise css_release() will set the CSS_REL_LATER flag for that css. This will cause the css_release_work_fn() work to be executed after css_killed_work_fn() is finished.
Two scc flags have been introduced to implement this serialization mechanizm:
* CSS_KILL_ENQED, which will be set when css_killed_work_fn() is en-queued, and * CSS_REL_LATER, which, if set, will cause the css_release_work_fn() to be scheduled after the css_killed_work_fn is finished.
There is also a new lock, which will protect the integrity of the css flags.
[1] https://syzkaller.appspot.com/bug?id=e26e54d6eac9d9fb50b221ec3e4627b327465db...
Cc: Tejun Heo tj@kernel.org Cc: Michal Koutnymkoutny@suse.com Cc: Zefan Li lizefan.x@bytedance.com Cc: Johannes Weiner hannes@cmpxchg.org Cc: Christian Brauner brauner@kernel.org Cc: Alexei Starovoitov ast@kernel.org Cc: Daniel Borkmann daniel@iogearbox.net Cc: Andrii Nakryiko andrii@kernel.org Cc: Martin KaFai Lau kafai@fb.com Cc: Song Liu songliubraving@fb.com Cc: Yonghong Song yhs@fb.com Cc: John Fastabend john.fastabend@gmail.com Cc: KP Singh kpsingh@kernel.org Cc: cgroups@vger.kernel.org Cc: netdev@vger.kernel.org Cc: bpf@vger.kernel.org Cc: stable@vger.kernel.org Cc: linux-kernel@vger.kernel.org
Reported-and-tested-by: syzbot+e42ae441c3b10acf9e9d@syzkaller.appspotmail.com Fixes: 8f36aaec9c92 ("cgroup: Use rcu_work instead of explicit rcu and work item") Signed-off-by: Tadeusz Struk tadeusz.struk@linaro.org --- v2: Use correct lock in css_killed_work_fn() --- include/linux/cgroup-defs.h | 4 ++++ kernel/cgroup/cgroup.c | 35 ++++++++++++++++++++++++++++++++--- 2 files changed, 36 insertions(+), 3 deletions(-)
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index 1bfcfb1af352..8dc8b4edb242 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -53,6 +53,8 @@ enum { CSS_RELEASED = (1 << 2), /* refcnt reached zero, released */ CSS_VISIBLE = (1 << 3), /* css is visible to userland */ CSS_DYING = (1 << 4), /* css is dying */ + CSS_KILL_ENQED = (1 << 5), /* kill work enqueued for the css */ + CSS_REL_LATER = (1 << 6), /* release needs to be done after kill */ };
/* bits in struct cgroup flags field */ @@ -162,6 +164,8 @@ struct cgroup_subsys_state { */ int id;
+ /* lock to protect flags */ + spinlock_t lock; unsigned int flags;
/* diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 1779ccddb734..b1bbd438d426 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -5210,8 +5210,23 @@ static void css_release(struct percpu_ref *ref) struct cgroup_subsys_state *css = container_of(ref, struct cgroup_subsys_state, refcnt);
- INIT_WORK(&css->destroy_work, css_release_work_fn); - queue_work(cgroup_destroy_wq, &css->destroy_work); + spin_lock_bh(&css->lock); + + /* + * Check if the css_killed_work_fn work has been scheduled for this + * css and enqueue css_release_work_fn only if it wasn't. + * Otherwise set the CSS_REL_LATER flag, which will cause + * release to be enqueued after css_killed_work_fn is finished. + * This is to prevent list corruption by en-queuing two instance + * of the same work struct on the same WQ, namely cgroup_destroy_wq. + */ + if (!(css->flags & CSS_KILL_ENQED)) { + INIT_WORK(&css->destroy_work, css_release_work_fn); + queue_work(cgroup_destroy_wq, &css->destroy_work); + } else { + css->flags |= CSS_REL_LATER; + } + spin_unlock_bh(&css->lock); }
static void init_and_link_css(struct cgroup_subsys_state *css, @@ -5230,6 +5245,7 @@ static void init_and_link_css(struct cgroup_subsys_state *css, INIT_LIST_HEAD(&css->rstat_css_node); css->serial_nr = css_serial_nr_next++; atomic_set(&css->online_cnt, 0); + spin_lock_init(&css->lock);
if (cgroup_parent(cgrp)) { css->parent = cgroup_css(cgroup_parent(cgrp), ss); @@ -5545,10 +5561,12 @@ int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name, umode_t mode) */ static void css_killed_work_fn(struct work_struct *work) { - struct cgroup_subsys_state *css = + struct cgroup_subsys_state *css_killed, *css = container_of(work, struct cgroup_subsys_state, destroy_work);
mutex_lock(&cgroup_mutex); + css_killed = css; + css_killed->flags &= ~CSS_KILL_ENQED;
do { offline_css(css); @@ -5557,6 +5575,14 @@ static void css_killed_work_fn(struct work_struct *work) css = css->parent; } while (css && atomic_dec_and_test(&css->online_cnt));
+ spin_lock_bh(&css_killed->lock); + if (css_killed->flags & CSS_REL_LATER) { + /* If css_release work was delayed for the css enqueue it now. */ + INIT_WORK(&css_killed->destroy_work, css_release_work_fn); + queue_work(cgroup_destroy_wq, &css_killed->destroy_work); + css_killed->flags &= ~CSS_REL_LATER; + } + spin_unlock_bh(&css_killed->lock); mutex_unlock(&cgroup_mutex); }
@@ -5566,10 +5592,13 @@ static void css_killed_ref_fn(struct percpu_ref *ref) struct cgroup_subsys_state *css = container_of(ref, struct cgroup_subsys_state, refcnt);
+ spin_lock_bh(&css->lock); if (atomic_dec_and_test(&css->online_cnt)) { + css->flags |= CSS_KILL_ENQED; INIT_WORK(&css->destroy_work, css_killed_work_fn); queue_work(cgroup_destroy_wq, &css->destroy_work); } + spin_unlock_bh(&css->lock); }
/**
On 6/3/22 11:13, Tadeusz Struk wrote:
Syzbot found a corrupted list bug scenario that can be triggered from cgroup_subtree_control_write(cgrp). The reproduces writes to cgroup.subtree_control file, which invokes: cgroup_apply_control_enable()->css_create()->css_populate_dir(), which then fails with a fault injected -ENOMEM. In such scenario the css_killed_work_fn will be en-queued via cgroup_apply_control_disable(cgrp)->kill_css(css), and bail out to cgroup_kn_unlock(). Then cgroup_kn_unlock() will call: cgroup_put(cgrp)->css_put(&cgrp->self), which will try to enqueue css_release_work_fn for the same css instance, causing a list_add corruption bug, as can be seen in the syzkaller report [1].
Fix this by synchronizing the css ref_kill and css_release jobs. css_release() function will check if the css_killed_work_fn() has been scheduled for the css and only en-queue the css_release_work_fn() if css_killed_work_fn wasn't already en-queued. Otherwise css_release() will set the CSS_REL_LATER flag for that css. This will cause the css_release_work_fn() work to be executed after css_killed_work_fn() is finished.
Two scc flags have been introduced to implement this serialization mechanizm:
- CSS_KILL_ENQED, which will be set when css_killed_work_fn() is en-queued, and
- CSS_REL_LATER, which, if set, will cause the css_release_work_fn() to be scheduled after the css_killed_work_fn is finished.
There is also a new lock, which will protect the integrity of the css flags.
[1] https://syzkaller.appspot.com/bug?id=e26e54d6eac9d9fb50b221ec3e4627b327465db...
This also fixes a similar, cgroup related list corrupt issue: https://syzkaller.appspot.com/bug?id=3c7ff113ccb695e839b859da3fc481c36eb1cfd...
Hello.
On Fri, Jun 03, 2022 at 11:13:21AM -0700, Tadeusz Struk tadeusz.struk@linaro.org wrote:
In such scenario the css_killed_work_fn will be en-queued via cgroup_apply_control_disable(cgrp)->kill_css(css), and bail out to cgroup_kn_unlock(). Then cgroup_kn_unlock() will call: cgroup_put(cgrp)->css_put(&cgrp->self), which will try to enqueue css_release_work_fn for the same css instance, causing a list_add corruption bug, as can be seen in the syzkaller report [1].
This hypothesis doesn't add up to me (I am sorry).
The kill_css(css) would be a css associated with a subsys (css.ss != NULL) whereas css_put(&cgrp->self) is a different css just for the cgroup (css.ss == NULL).
Michal
On 6/6/22 05:39, Michal Koutný wrote:
On Fri, Jun 03, 2022 at 11:13:21AM -0700, Tadeusz Struktadeusz.struk@linaro.org wrote:
In such scenario the css_killed_work_fn will be en-queued via cgroup_apply_control_disable(cgrp)->kill_css(css), and bail out to cgroup_kn_unlock(). Then cgroup_kn_unlock() will call: cgroup_put(cgrp)->css_put(&cgrp->self), which will try to enqueue css_release_work_fn for the same css instance, causing a list_add corruption bug, as can be seen in the syzkaller report [1].
This hypothesis doesn't add up to me (I am sorry).
The kill_css(css) would be a css associated with a subsys (css.ss != NULL) whereas css_put(&cgrp->self) is a different css just for the cgroup (css.ss == NULL).
Yes, you are right. I couldn't figure it out where the extra css_put() is called from, and the only place that fitted into my theory was from the cgroup_kn_unlock() in cgroup_apply_control_disable(). After some more debugging I can see that, as you said, the cgrp->self is a different css. The offending _put() is actually called by the percpu_ref_kill_and_confirm(), as it not only calls the passed confirm_kill percpu_ref_func_t, but also it puts the refcnt iself. Because the cgroup_apply_control_disable() will loop for_each_live_descendant, and call css_kill() on all css'es, and css_killed_work_fn() will also loop and call css_put() on all parents, the css_release() will be called on the first parent prematurely, causing the BUG(). What I think should be done to balance put/get is to call css_get() for all the parents in kill_css():
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index c1e1a5c34e77..3ca61325bc4e 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -5527,6 +5527,8 @@ static void css_killed_ref_fn(struct percpu_ref *ref) */ static void kill_css(struct cgroup_subsys_state *css) { + struct cgroup_subsys_state *_css = css; + lockdep_assert_held(&cgroup_mutex);
if (css->flags & CSS_DYING) @@ -5541,10 +5543,13 @@ static void kill_css(struct cgroup_subsys_state *css) css_clear_dir(css);
/* - * Killing would put the base ref, but we need to keep it alive - * until after ->css_offline(). + * Killing would put the base ref, but we need to keep it alive, + * and all its parents, until after ->css_offline(). */ - css_get(css); + do { + css_get(_css); + _css = _css->parent; + } while (_css && atomic_read(&_css->online_cnt));
/* * cgroup core guarantees that, by the time ->css_offline() is
This will be then "reverted" in css_killed_work_fn() Please let me know if it makes sense to you. I'm still testing it, but syzbot is very slow today.
Greeting,
FYI, we noticed the following commit (built with gcc-11):
commit: 3c87862ca13147416d900cf82ca56bb2f23910bf ("[PATCH v2] cgroup: serialize css kill and release paths") url: https://github.com/intel-lab-lkp/linux/commits/Tadeusz-Struk/cgroup-serializ... base: https://git.kernel.org/cgit/linux/kernel/git/tj/cgroup.git for-next patch link: https://lore.kernel.org/netdev/20220603181321.443716-1-tadeusz.struk@linaro....
in testcase: boot
on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G
caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):
If you fix the issue, kindly add following tag Reported-by: kernel test robot oliver.sang@intel.com
[ 55.821003][ C1] WARNING: CPU: 1 PID: 1 at kernel/softirq.c:363 __local_bh_enable_ip (kernel/softirq.c:363) [ 55.822745][ C1] Modules linked in: fuse ip_tables [ 55.823837][ C1] CPU: 1 PID: 1 Comm: systemd Not tainted 5.18.0-rc5-00048-g3c87862ca131 #1 [ 55.825505][ C1] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-4 04/01/2014 [ 55.827516][ C1] RIP: 0010:__local_bh_enable_ip (kernel/softirq.c:363) [ 55.828671][ C1] Code: 47 65 8b 05 b1 d9 a8 47 a9 00 ff ff 00 74 30 65 ff 0d a3 d9 a8 47 e8 9e c6 39 00 fb 5b 5d c3 65 8b 05 8f e2 a8 47 85 c0 75 b0 <0f> 0b eb ac e8 c6 c7 39 00 eb ad 48 89 ef e8 3c 89 16 00 eb b6 65 All code ======== 0: 47 rex.RXB 1: 65 8b 05 b1 d9 a8 47 mov %gs:0x47a8d9b1(%rip),%eax # 0x47a8d9b9 8: a9 00 ff ff 00 test $0xffff00,%eax d: 74 30 je 0x3f f: 65 ff 0d a3 d9 a8 47 decl %gs:0x47a8d9a3(%rip) # 0x47a8d9b9 16: e8 9e c6 39 00 callq 0x39c6b9 1b: fb sti 1c: 5b pop %rbx 1d: 5d pop %rbp 1e: c3 retq 1f: 65 8b 05 8f e2 a8 47 mov %gs:0x47a8e28f(%rip),%eax # 0x47a8e2b5 26: 85 c0 test %eax,%eax 28: 75 b0 jne 0xffffffffffffffda 2a:* 0f 0b ud2 <-- trapping instruction 2c: eb ac jmp 0xffffffffffffffda 2e: e8 c6 c7 39 00 callq 0x39c7f9 33: eb ad jmp 0xffffffffffffffe2 35: 48 89 ef mov %rbp,%rdi 38: e8 3c 89 16 00 callq 0x168979 3d: eb b6 jmp 0xfffffffffffffff5 3f: 65 gs
Code starting with the faulting instruction =========================================== 0: 0f 0b ud2 2: eb ac jmp 0xffffffffffffffb0 4: e8 c6 c7 39 00 callq 0x39c7cf 9: eb ad jmp 0xffffffffffffffb8 b: 48 89 ef mov %rbp,%rdi e: e8 3c 89 16 00 callq 0x16894f 13: eb b6 jmp 0xffffffffffffffcb 15: 65 gs [ 55.832215][ C1] RSP: 0018:ffffc90000178d60 EFLAGS: 00010046 [ 55.833390][ C1] RAX: 0000000000000000 RBX: 0000000000000201 RCX: 1ffffffff7a3ca41 [ 55.834990][ C1] RDX: 0000000000000000 RSI: 0000000000000201 RDI: ffffffffb88332da [ 55.836576][ C1] RBP: ffffffffb88332da R08: 0000000000000000 R09: ffff8881c0af245b [ 55.838123][ C1] R10: ffffed103815e48b R11: 0000000000000001 R12: 0000000000000008 [ 55.839742][ C1] R13: ffff8881a593a000 R14: dffffc0000000000 R15: ffff8881c0af2418 [ 55.841328][ C1] FS: 00007fd6571cc900(0000) GS:ffff88839d500000(0000) knlGS:0000000000000000 [ 55.843084][ C1] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 55.844302][ C1] CR2: 00007fd51c623028 CR3: 00000001bfc72000 CR4: 00000000000406e0 [ 55.845905][ C1] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 55.847510][ C1] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 55.849058][ C1] Call Trace: [ 55.849769][ C1] <IRQ> [ 55.850460][ C1] put_css_set_locked (include/linux/percpu-refcount.h:335 include/linux/percpu-refcount.h:351 include/linux/cgroup.h:404 kernel/cgroup/cgroup.c:971 kernel/cgroup/cgroup.c:955) [ 55.851549][ C1] cgroup_free (include/linux/spinlock.h:404 kernel/cgroup/cgroup-internal.h:212 kernel/cgroup/cgroup-internal.h:198 kernel/cgroup/cgroup.c:6480) [ 55.852470][ C1] __put_task_struct (kernel/fork.c:842) [ 55.853436][ C1] rcu_do_batch (include/linux/rcupdate.h:273 kernel/rcu/tree.c:2537) [ 55.854361][ C1] ? lock_is_held_type (kernel/locking/lockdep.c:5382 kernel/locking/lockdep.c:5684) [ 55.855400][ C1] ? rcu_implicit_dynticks_qs (kernel/rcu/tree.c:2474) [ 55.856581][ C1] ? lockdep_hardirqs_on_prepare (kernel/locking/lockdep.c:4493) [ 55.857836][ C1] ? note_gp_changes (arch/x86/include/asm/irqflags.h:45 (discriminator 1) arch/x86/include/asm/irqflags.h:80 (discriminator 1) arch/x86/include/asm/irqflags.h:138 (discriminator 1) kernel/rcu/tree.c:1698 (discriminator 1)) [ 55.858828][ C1] ? trace_hardirqs_on (kernel/trace/trace_preemptirq.c:50 (discriminator 22)) [ 55.859889][ C1] rcu_core (kernel/rcu/tree.c:2788) [ 55.860786][ C1] __do_softirq (arch/x86/include/asm/jump_label.h:27 include/linux/jump_label.h:207 include/trace/events/irq.h:142 kernel/softirq.c:559) [ 55.861747][ C1] __irq_exit_rcu (kernel/softirq.c:432 kernel/softirq.c:637) [ 55.862673][ C1] irq_exit_rcu (kernel/softirq.c:651) [ 55.863601][ C1] sysvec_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:1097 (discriminator 14)) [ 55.864709][ C1] </IRQ> [ 55.865399][ C1] <TASK> [ 55.866087][ C1] asm_sysvec_apic_timer_interrupt (arch/x86/include/asm/idtentry.h:645) [ 55.867251][ C1] RIP: 0010:walk_component (fs/namei.c:2027) [ 55.868365][ C1] Code: c7 43 08 00 00 00 00 48 8b 84 24 80 00 00 00 65 48 2b 04 25 28 00 00 00 0f 85 99 03 00 00 48 81 c4 88 00 00 00 4c 89 e0 5b 5d <41> 5c 41 5d 41 5e 41 5f c3 48 b8 00 00 00 00 00 fc ff df 48 8d 7d All code ======== 0: c7 43 08 00 00 00 00 movl $0x0,0x8(%rbx) 7: 48 8b 84 24 80 00 00 mov 0x80(%rsp),%rax e: 00 f: 65 48 2b 04 25 28 00 sub %gs:0x28,%rax 16: 00 00 18: 0f 85 99 03 00 00 jne 0x3b7 1e: 48 81 c4 88 00 00 00 add $0x88,%rsp 25: 4c 89 e0 mov %r12,%rax 28: 5b pop %rbx 29: 5d pop %rbp 2a:* 41 5c pop %r12 <-- trapping instruction 2c: 41 5d pop %r13 2e: 41 5e pop %r14 30: 41 5f pop %r15 32: c3 retq 33: 48 b8 00 00 00 00 00 movabs $0xdffffc0000000000,%rax 3a: fc ff df 3d: 48 rex.W 3e: 8d .byte 0x8d 3f: 7d .byte 0x7d
Code starting with the faulting instruction =========================================== 0: 41 5c pop %r12 2: 41 5d pop %r13 4: 41 5e pop %r14 6: 41 5f pop %r15 8: c3 retq 9: 48 b8 00 00 00 00 00 movabs $0xdffffc0000000000,%rax 10: fc ff df 13: 48 rex.W 14: 8d .byte 0x8d 15: 7d .byte 0x7d [ 55.871899][ C1] RSP: 0018:ffffc9000001f988 EFLAGS: 00000286 [ 55.873081][ C1] RAX: 0000000000000000 RBX: 000000063dbe831e RCX: ffffc9000001f690 [ 55.874697][ C1] RDX: 0000000000000000 RSI: ffffffffbc798f20 RDI: ffffc9000001fbe0 [ 55.876281][ C1] RBP: 0000000000000006 R08: ffff888100310dd8 R09: ffffffffbd1de4e7 [ 55.877803][ C1] R10: ffffed102003531b R11: 0000000000000001 R12: 0000000000000000 [ 55.879386][ C1] R13: 0000000000000002 R14: 0000000000000000 R15: 0000000000000000 [ 55.881044][ C1] link_path_walk+0x563/0xc80 [ 55.882258][ C1] ? path_init (fs/namei.c:2411) [ 55.883192][ C1] ? __raw_spin_lock_init (kernel/locking/spinlock_debug.c:26) [ 55.884260][ C1] ? open_last_lookups (fs/namei.c:2265) [ 55.885290][ C1] ? debug_mutex_init (kernel/locking/mutex-debug.c:89) [ 55.886305][ C1] ? __alloc_file (fs/file_table.c:153) [ 55.887298][ C1] path_openat (fs/namei.c:3605) [ 55.888247][ C1] ? __lock_acquire (kernel/locking/lockdep.c:5029) [ 55.889229][ C1] ? path_lookupat (fs/namei.c:3591) [ 55.890215][ C1] do_filp_open (fs/namei.c:3636) [ 55.893219][ C1] ? alloc_fd (fs/file.c:555 (discriminator 10)) [ 55.894154][ C1] ? may_open_dev (fs/namei.c:3630) [ 55.895168][ C1] ? __lock_release (kernel/locking/lockdep.c:5317) [ 55.896219][ C1] ? lock_is_held_type (kernel/locking/lockdep.c:5382 kernel/locking/lockdep.c:5684) [ 55.897250][ C1] ? alloc_fd (fs/file.c:555 (discriminator 10)) [ 55.898122][ C1] ? _raw_spin_unlock (arch/x86/include/asm/preempt.h:85 include/linux/spinlock_api_smp.h:143 kernel/locking/spinlock.c:186) [ 55.899114][ C1] ? alloc_fd (fs/file.c:555 (discriminator 10)) [ 55.900031][ C1] ? getname_flags (fs/namei.c:204) [ 55.901118][ C1] do_sys_openat2 (fs/open.c:1213) [ 55.902081][ C1] ? rcu_nocb_cb_kthread (kernel/rcu/tree_nocb.h:382) [ 55.903138][ C1] ? mntput_no_expire (fs/namespace.c:1208) [ 55.904198][ C1] ? lock_is_held_type (kernel/locking/lockdep.c:5382 kernel/locking/lockdep.c:5684) [ 55.905231][ C1] ? build_open_flags (fs/open.c:1199) [ 55.906242][ C1] ? call_rcu (arch/x86/include/asm/irqflags.h:45 arch/x86/include/asm/irqflags.h:80 arch/x86/include/asm/irqflags.h:138 kernel/rcu/tree.c:3109) [ 55.907155][ C1] ? trace_hardirqs_on (kernel/trace/trace_preemptirq.c:50 (discriminator 22)) [ 55.908186][ C1] ? call_rcu (arch/x86/include/asm/irqflags.h:45 arch/x86/include/asm/irqflags.h:80 arch/x86/include/asm/irqflags.h:138 kernel/rcu/tree.c:3109) [ 55.909104][ C1] __x64_sys_openat (fs/open.c:1240) [ 55.910087][ C1] ? __ia32_compat_sys_open (fs/open.c:1240) [ 55.911242][ C1] ? lockdep_hardirqs_on_prepare (kernel/locking/lockdep.c:4501) [ 55.912481][ C1] ? syscall_enter_from_user_mode (arch/x86/include/asm/irqflags.h:45 arch/x86/include/asm/irqflags.h:80 kernel/entry/common.c:109) [ 55.913602][ C1] ? trace_hardirqs_on (kernel/trace/trace_preemptirq.c:50 (discriminator 22)) [ 55.914638][ C1] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) [ 55.915572][ C1] ? lockdep_hardirqs_on_prepare (kernel/locking/lockdep.c:4501) [ 55.916880][ C1] ? do_syscall_64 (arch/x86/entry/common.c:87) [ 55.917809][ C1] ? do_syscall_64 (arch/x86/entry/common.c:87) [ 55.918770][ C1] ? do_syscall_64 (arch/x86/entry/common.c:87) [ 55.919737][ C1] ? do_syscall_64 (arch/x86/entry/common.c:87) [ 55.920694][ C1] ? do_syscall_64 (arch/x86/entry/common.c:87)
To reproduce:
# build kernel cd linux cp config-5.18.0-rc5-00048-g3c87862ca131 .config make HOSTCC=gcc-11 CC=gcc-11 ARCH=x86_64 olddefconfig prepare modules_prepare bzImage modules make HOSTCC=gcc-11 CC=gcc-11 ARCH=x86_64 INSTALL_MOD_PATH=<mod-install-dir> modules_install cd <mod-install-dir> find lib/ | cpio -o -H newc --quiet | gzip > modules.cgz
git clone https://github.com/intel/lkp-tests.git cd lkp-tests bin/lkp qemu -k <bzImage> -m modules.cgz job-script # job-script is attached in this email
# if come across any failure that blocks the test, # please remove ~/.lkp and /lkp dir to run from a clean state.
On 6/9/22 01:56, kernel test robot wrote:
Greeting,
FYI, we noticed the following commit (built with gcc-11):
commit: 3c87862ca13147416d900cf82ca56bb2f23910bf ("[PATCH v2] cgroup: serialize css kill and release paths") url:https://github.com/intel-lab-lkp/linux/commits/Tadeusz-Struk/cgroup-serializ... base:https://git.kernel.org/cgit/linux/kernel/git/tj/cgroup.git for-next patch link:https://lore.kernel.org/netdev/20220603181321.443716-1-tadeusz.struk@linaro....
in testcase: boot
on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G
caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):
If you fix the issue, kindly add following tag Reported-by: kernel test robotoliver.sang@intel.com
[ 55.821003][ C1] WARNING: CPU: 1 PID: 1 at kernel/softirq.c:363 __local_bh_enable_ip (kernel/softirq.c:363)
Looks like that will need to be spin_lock_irq(&css->lock) instead of spin_lock_bh(&css->lock) I can respin the patch, but I would like to request some feedback on it first.
Tejun, Michal Are you interested in fixing this at syzbot issue all? Do you have any more feedback on this?
Hello Tadeusz.
On Thu, Jun 09, 2022 at 07:30:41AM -0700, Tadeusz Struk tadeusz.struk@linaro.org wrote:
Are you interested in fixing this at syzbot issue all?
The (original) syzbot report is conditioned by allocation failure that's unlikely under normal conditions (AFAIU). Hence I don't treat it extra high urgent. OTOH, it's interesting and it points to some disparity worth fixing -- so I try helping (as time permits, so far I can only run the reproducers via the syzbot).
Do you have any more feedback on this?
Ad the patch v2 with spinlock per css -- that looks like an overkill to me, I didn't look deeper into it.
Ad the in-thread patch with ancestry css_get(), the ->parent ref: - is inc'd in init_and_link_css(), - is dec'd in css_free_rwork_fn() and thanks to ->online_cnt offlining is ordered (child before parent).
Where does your patch dec these ancestry references? (Or why would they be missing in the first place?)
Thanks for digging into this! Michal
linux-stable-mirror@lists.linaro.org