In the cache missing code path of cached device, if a proper location
from the internal B+ tree is matched for a cache miss range, function
cached_dev_cache_miss() will be called in cache_lookup_fn() in the
following code block,
[code block 1]
526 unsigned int sectors = KEY_INODE(k) == s->iop.inode
527 ? min_t(uint64_t, INT_MAX,
528 KEY_START(k) - bio->bi_iter.bi_sector)
529 : INT_MAX;
530 int ret = s->d->cache_miss(b, s, bio, sectors);
Here s->d->cache_miss() is the call backfunction pointer initialized as
cached_dev_cache_miss(), the last parameter 'sectors' is an important
hint to calculate the size of read request to backing device of the
missing cache data.
Current calculation in above code block may generate oversized value of
'sectors', which consequently may trigger 2 different potential kernel
panics by BUG() or BUG_ON() as listed below,
1) BUG_ON() inside bch_btree_insert_key(),
[code block 2]
886 BUG_ON(b->ops->is_extents && !KEY_SIZE(k));
2) BUG() inside biovec_slab(),
[code block 3]
51 default:
52 BUG();
53 return NULL;
All the above panics are original from cached_dev_cache_miss() by the
oversized parameter 'sectors'.
Inside cached_dev_cache_miss(), parameter 'sectors' is used to calculate
the size of data read from backing device for the cache missing. This
size is stored in s->insert_bio_sectors by the following lines of code,
[code block 4]
909 s->insert_bio_sectors = min(sectors, bio_sectors(bio) + reada);
Then the actual key inserting to the internal B+ tree is generated and
stored in s->iop.replace_key by the following lines of code,
[code block 5]
911 s->iop.replace_key = KEY(s->iop.inode,
912 bio->bi_iter.bi_sector + s->insert_bio_sectors,
913 s->insert_bio_sectors);
The oversized parameter 'sectors' may trigger panic 1) by BUG_ON() from
the above code block.
And the bio sending to backing device for the missing data is allocated
with hint from s->insert_bio_sectors by the following lines of code,
[code block 6]
926 cache_bio = bio_alloc_bioset(GFP_NOWAIT,
927 DIV_ROUND_UP(s->insert_bio_sectors, PAGE_SECTORS),
928 &dc->disk.bio_split);
The oversized parameter 'sectors' may trigger panic 2) by BUG() from the
agove code block.
Now let me explain how the panics happen with the oversized 'sectors'.
In code block 5, replace_key is generated by macro KEY(). From the
definition of macro KEY(),
[code block 7]
71 #define KEY(inode, offset, size) \
72 ((struct bkey) { \
73 .high = (1ULL << 63) | ((__u64) (size) << 20) | (inode), \
74 .low = (offset) \
75 })
Here 'size' is 16bits width embedded in 64bits member 'high' of struct
bkey. But in code block 1, if "KEY_START(k) - bio->bi_iter.bi_sector" is
very probably to be larger than (1<<16) - 1, which makes the bkey size
calculation in code block 5 is overflowed. In one bug report the value
of parameter 'sectors' is 131072 (= 1 << 17), the overflowed 'sectors'
results the overflowed s->insert_bio_sectors in code block 4, then makes
size field of s->iop.replace_key to be 0 in code block 5. Then the 0-
sized s->iop.replace_key is inserted into the internal B+ tree as cache
missing check key (a special key to detect and avoid a racing between
normal write request and cache missing read request) as,
[code block 8]
915 ret = bch_btree_insert_check_key(b, &s->op, &s->iop.replace_key);
Then the 0-sized s->iop.replace_key as 3rd parameter triggers the bkey
size check BUG_ON() in code block 2, and causes the kernel panic 1).
Another kernel panic is from code block 6, is by the bvecs number
oversized value s->insert_bio_sectors from code block 4,
min(sectors, bio_sectors(bio) + reada)
There are two possibility for oversized reresult,
- bio_sectors(bio) is valid, but bio_sectors(bio) + reada is oversized.
- sectors < bio_sectors(bio) + reada, but sectors is oversized.
>From a bug report the result of "DIV_ROUND_UP(s->insert_bio_sectors,
PAGE_SECTORS)" from code block 6 can be 344, 282, 946, 342 and many
other values which larther than BIO_MAX_VECS (a.k.a 256). When calling
bio_alloc_bioset() with such larger-than-256 value as the 2nd parameter,
this value will eventually be sent to biovec_slab() as parameter
'nr_vecs' in following code path,
bio_alloc_bioset() ==> bvec_alloc() ==> biovec_slab()
Because parameter 'nr_vecs' is larger-than-256 value, the panic by BUG()
in code block 3 is triggered inside biovec_slab().
>From the above analysis, we know that the 4th parameter 'sector' sent
into cached_dev_cache_miss() may cause overflow in code block 5 and 6,
and finally cause kernel panic in code block 2 and 3. And if result of
bio_sectors(bio) + reada exceeds valid bvecs number, it may also trigger
kernel panic in code block 3 from code block 6.
In this patch, the above two panics are avoided by the following
changes,
- If DIV_ROUND_UP(bio_sectors(bio) + reada, PAGE_SECTORS) exceeds the
maximum bvecs counter, reduce reada to make sure the DIV_ROUND_UP()
result won't generate a oversized s->insert_bio_sectors to cause
invalid bvecs number to cache_bio.
- If sectors exceeds the maximum bkey size, then set the maximum valid
bkey size to sectors.
By the above changes, in code block 5 the size value in KEY() macro will
always be in valid range. As well in code block 6, the nr_iovecs
parameter of bio_alloc_bioset() calculated by
DIV_ROUND_UP(s->insert_bio_sectors, PAGE_SECTORS) will always be a valid
bvecs number. Now both panics won't happen anymore.
Current problmatic code can be partially found since Linux v5.13-rc1,
therefore all maintained stable kernels should try to apply this fix.
Reported-by: Diego Ercolani <diego.ercolani(a)gmail.com>
Reported-by: Jan Szubiak <jan.szubiak(a)linuxpolska.pl>
Reported-by: Marco Rebhan <me(a)dblsaiko.net>
Reported-by: Matthias Ferdinand <bcache(a)mfedv.net>
Reported-by: Thorsten Knabe <linux(a)thorsten-knabe.de>
Reported-by: Victor Westerhuis <victor(a)westerhu.is>
Reported-by: Vojtech Pavlik <vojtech(a)suse.cz>
Signed-off-by: Coly Li <colyli(a)suse.de>
Cc: stable(a)vger.kernel.org
Cc: Christoph Hellwig <hch(a)lst.de>
Cc: Kent Overstreet <kent.overstreet(a)gmail.com>
Cc: Takashi Iwai <tiwai(a)suse.com>
---
Changelog:
v4, not directly access BIO_MAX_VECS and reduce reada value to avoid
oversized bvecs number, by hint from Christoph Hellwig.
v3, fix typo in v2.
v2, fix the bypass bio size calculation in v1.
v1, the initial version
drivers/md/bcache/request.c | 19 +++++++++++++++++++
1 file changed, 19 insertions(+)
diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index 29c231758293..054948f037ed 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -883,6 +883,7 @@ static int cached_dev_cache_miss(struct btree *b, struct search *s,
unsigned int reada = 0;
struct cached_dev *dc = container_of(s->d, struct cached_dev, disk);
struct bio *miss, *cache_bio;
+ unsigned int nr_bvecs, max_segs;
s->cache_missed = 1;
@@ -899,6 +900,24 @@ static int cached_dev_cache_miss(struct btree *b, struct search *s,
get_capacity(bio->bi_bdev->bd_disk) -
bio_end_sector(bio));
+ /*
+ * If "bio_sectors(bio) + reada" may causes an oversized bio bvecs
+ * number, reada size must be deducted to make sure the following
+ * calculated s->insert_bio_sectors won't cause oversized bvecs number
+ * to cache_bio.
+ */
+ nr_bvecs = DIV_ROUND_UP(bio_sectors(bio) + reada, PAGE_SECTORS);
+ max_segs = bio_max_segs(nr_bvecs);
+ if (nr_bvecs > max_segs)
+ reada = max_segs * PAGE_SECTORS - bio_sectors(bio);
+
+ /*
+ * Make sure sectors won't exceed (1 << KEY_SIZE_BITS) - 1, which is
+ * the maximum bkey size in unit of sector. Then s->insert_bio_sectors
+ * will always be a valid bio in valid bkey size range.
+ */
+ if (sectors > ((1 << KEY_SIZE_BITS) - 1))
+ sectors = (1 << KEY_SIZE_BITS) - 1;
s->insert_bio_sectors = min(sectors, bio_sectors(bio) + reada);
s->iop.replace_key = KEY(s->iop.inode,
--
2.26.2
Hi Greg,
Can you please cherry-pick this to LTS:
commit bf53fc6b5f415cddc7118091cb8fd6a211b2320d
Author: Jan Kratochvil <jan.kratochvil(a)redhat.com>
Date: Fri Dec 4 09:17:02 2020 -0300
perf unwind: Fix separate debug info files when using elfutils' libdw's unwinder
-Tommi
The patch titled
Subject: kthread: fix kthread_mod_delayed_work vs kthread_cancel_delayed_work_sync race
has been added to the -mm tree. Its filename is
kthread-fix-kthread_mod_delayed_work-vs-kthread_cancel_delayed_work_sync-race.patch
This patch should soon appear at
https://ozlabs.org/~akpm/mmots/broken-out/kthread-fix-kthread_mod_delayed_w…
and later at
https://ozlabs.org/~akpm/mmotm/broken-out/kthread-fix-kthread_mod_delayed_w…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Martin Liu <liumartin(a)google.com>
Subject: kthread: fix kthread_mod_delayed_work vs kthread_cancel_delayed_work_sync race
We encountered a system hang issue while doing the tests. The callstack
is as following
schedule+0x80/0x100
schedule_timeout+0x48/0x138
wait_for_common+0xa4/0x134
wait_for_completion+0x1c/0x2c
kthread_flush_work+0x114/0x1cc
kthread_cancel_work_sync.llvm.16514401384283632983+0xe8/0x144
kthread_cancel_delayed_work_sync+0x18/0x2c
xxxx_pm_notify+0xb0/0xd8
blocking_notifier_call_chain_robust+0x80/0x194
pm_notifier_call_chain_robust+0x28/0x4c
suspend_prepare+0x40/0x260
enter_state+0x80/0x3f4
pm_suspend+0x60/0xdc
state_store+0x108/0x144
kobj_attr_store+0x38/0x88
sysfs_kf_write+0x64/0xc0
kernfs_fop_write_iter+0x108/0x1d0
vfs_write+0x2f4/0x368
ksys_write+0x7c/0xec
When we started investigating, we found race between
kthread_mod_delayed_work vs kthread_cancel_delayed_work_sync. The race's
result could be simply reproduced as a kthread_mod_delayed_work with a
following kthread_flush_work call.
Thing is we release kthread_mod_delayed_work kspin_lock in
__kthread_cancel_work so it opens a race window for
kthread_cancel_delayed_work_sync to change the canceling count used to
prevent dwork from being requeued before calling kthread_flush_work.
However, we don't check the canceling count after returning from
__kthread_cancel_work and then insert the dwork to the worker. It results
the following kthread_flush_work inserts flush work to dwork's tail which
is at worker's dealyed_work_list. Therefore, flush work will never get
moved to the worker's work_list to be executed. Finally,
kthread_cancel_delayed_work_sync will NOT be able to get completed and
wait forever. The code sequence diagram is as following
Thread A Thread B
kthread_mod_delayed_work
spin_lock
__kthread_cancel_work
canceling = 1
spin_unlock
kthread_cancel_delayed_work_sync
spin_lock
kthread_cancel_work
canceling = 2
spin_unlock
del_timer_sync
spin_lock
canceling = 1 // canceling count gets update in ThreadB before
queue_delayed_work // dwork is put into the woker's dealyed_work_list
without checking the canceling count
spin_unlock
kthread_flush_work
spin_lock
Insert flush work // at the tail of the
dwork which is at
the worker's
dealyed_work_list
spin_unlock
wait_for_completion // Thread B stuck here as
flush work will never
get executed
The canceling count could change in __kthread_cancel_work as the spinlock
get released and regained in between, let's check the count again before
we queue the delayed work to avoid the race.
Link: https://lkml.kernel.org/r/20210513065458.941403-1-liumartin@google.com
Fixes: 37be45d49dec2 ("kthread: allow to cancel kthread work")
Signed-off-by: Martin Liu <liumartin(a)google.com>
Tested-by: David Chao <davidchao(a)google.com>
Reviewed-by: Petr Mladek <pmladek(a)suse.com>
Cc: Tejun Heo <tj(a)kernel.org>
Cc: Oleg Nesterov <oleg(a)redhat.com>
Cc: Ingo Molnar <mingo(a)redhat.com>
Cc: Peter Zijlstra <peterz(a)infradead.org>
Cc: Steven Rostedt <rostedt(a)goodmis.org>
Cc: "Paul E. McKenney" <paulmck(a)linux.vnet.ibm.com>
Cc: Josh Triplett <josh(a)joshtriplett.org>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Jiri Kosina <jkosina(a)suse.cz>
Cc: Borislav Petkov <bp(a)suse.de>
Cc: Michal Hocko <mhocko(a)suse.cz>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: Nathan Chancellor <nathan(a)kernel.org>
Cc: Nick Desaulniers <ndesaulniers(a)google.com>
Cc: <jenhaochen(a)google.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
kernel/kthread.c | 13 +++++++++++++
1 file changed, 13 insertions(+)
--- a/kernel/kthread.c~kthread-fix-kthread_mod_delayed_work-vs-kthread_cancel_delayed_work_sync-race
+++ a/kernel/kthread.c
@@ -1181,6 +1181,19 @@ bool kthread_mod_delayed_work(struct kth
goto out;
ret = __kthread_cancel_work(work, true, &flags);
+
+ /*
+ * Canceling could run in parallel from kthread_cancel_delayed_work_sync
+ * and change work's canceling count as the spinlock is released and regain
+ * in __kthread_cancel_work so we need to check the count again. Otherwise,
+ * we might incorrectly queue the dwork and further cause
+ * cancel_delayed_work_sync thread waiting for flush dwork endlessly.
+ */
+ if (work->canceling) {
+ ret = false;
+ goto out;
+ }
+
fast_queue:
__kthread_queue_delayed_work(worker, dwork, delay);
out:
_
Patches currently in -mm which might be from liumartin(a)google.com are
kthread-fix-kthread_mod_delayed_work-vs-kthread_cancel_delayed_work_sync-race.patch
If fuse_fill_super_submount() returns an error, the error path
triggers a crash:
[ 26.206673] BUG: kernel NULL pointer dereference, address: 0000000000000000
[...]
[ 26.226362] RIP: 0010:__list_del_entry_valid+0x25/0x90
[...]
[ 26.247938] Call Trace:
[ 26.248300] fuse_mount_remove+0x2c/0x70 [fuse]
[ 26.248892] virtio_kill_sb+0x22/0x160 [virtiofs]
[ 26.249487] deactivate_locked_super+0x36/0xa0
[ 26.250077] fuse_dentry_automount+0x178/0x1a0 [fuse]
The crash happens because fuse_mount_remove() assumes that the FUSE
mount was already added to list under the FUSE connection, but this
only done after fuse_fill_super_submount() has returned success.
This means that until fuse_fill_super_submount() has returned success,
the FUSE mount isn't actually owned by the superblock. We should thus
reclaim ownership by clearing sb->s_fs_info, which will skip the call
to fuse_mount_remove(), and perform rollback, like virtio_fs_get_tree()
already does for the root sb.
Fixes: bf109c64040f ("fuse: implement crossmounts")
Cc: mreitz(a)redhat.com
Cc: stable(a)vger.kernel.org # v5.10+
Signed-off-by: Greg Kurz <groug(a)kaod.org>
---
fs/fuse/dir.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 1b6c001a7dd1..01559061cbfb 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -339,8 +339,12 @@ static struct vfsmount *fuse_dentry_automount(struct path *path)
/* Initialize superblock, making @mp_fi its root */
err = fuse_fill_super_submount(sb, mp_fi);
- if (err)
+ if (err) {
+ fuse_conn_put(fc);
+ kfree(fm);
+ sb->s_fs_info = NULL;
goto out_put_sb;
+ }
sb->s_flags |= SB_ACTIVE;
fsc->root = dget(sb->s_root);
--
2.31.1
On Wed, 26 May 2021 22:18:31 +0800, Zenghui Yu wrote:
> Commit 26778aaa134a ("KVM: arm64: Commit pending PC adjustemnts before
> returning to userspace") fixed the PC updating issue by forcing an explicit
> synchronisation of the exception state on vcpu exit to userspace.
>
> However, we forgot to take into account the case where immediate_exit is
> set by userspace and KVM_RUN will exit immediately. Fix it by resolving all
> pending PC updates before returning to userspace.
>
> [...]
Applied to fixes, thanks!
[1/1] KVM: arm64: Resolve all pending PC updates before immediate exit
commit: e3e880bb1518eb10a4b4bb4344ed614d6856f190
with:
Fixes: 26778aaa134a ("KVM: arm64: Commit pending PC adjustemnts before returning to userspace")
Cc: stable(a)vger.kernel.org # 5.11
Cheers,
M.
--
Without deviation from the norm, progress is not possible.
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From d1d90dd27254c44d087ad3f8b5b3e4fff0571f45 Mon Sep 17 00:00:00 2001
From: Jack Pham <jackp(a)codeaurora.org>
Date: Wed, 28 Apr 2021 02:01:10 -0700
Subject: [PATCH] usb: dwc3: gadget: Enable suspend events
commit 72704f876f50 ("dwc3: gadget: Implement the suspend entry event
handler") introduced (nearly 5 years ago!) an interrupt handler for
U3/L1-L2 suspend events. The problem is that these events aren't
currently enabled in the DEVTEN register so the handler is never
even invoked. Fix this simply by enabling the corresponding bit
in dwc3_gadget_enable_irq() using the same revision check as found
in the handler.
Fixes: 72704f876f50 ("dwc3: gadget: Implement the suspend entry event handler")
Acked-by: Felipe Balbi <balbi(a)kernel.org>
Signed-off-by: Jack Pham <jackp(a)codeaurora.org>
Cc: stable <stable(a)vger.kernel.org>
Link: https://lore.kernel.org/r/20210428090111.3370-1-jackp@codeaurora.org
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c
index dd80e5ca8c78..cab3a9184068 100644
--- a/drivers/usb/dwc3/gadget.c
+++ b/drivers/usb/dwc3/gadget.c
@@ -2323,6 +2323,10 @@ static void dwc3_gadget_enable_irq(struct dwc3 *dwc)
if (DWC3_VER_IS_PRIOR(DWC3, 250A))
reg |= DWC3_DEVTEN_ULSTCNGEN;
+ /* On 2.30a and above this bit enables U3/L2-L1 Suspend Events */
+ if (!DWC3_VER_IS_PRIOR(DWC3, 230A))
+ reg |= DWC3_DEVTEN_EOPFEN;
+
dwc3_writel(dwc->regs, DWC3_DEVTEN, reg);
}