The patch titled
Subject: squashfs: fix divide error in calculate_skip()
has been removed from the -mm tree. Its filename was
squashfs-fix-divide-error-in-calculate_skip.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Phillip Lougher <phillip(a)squashfs.org.uk>
Subject: squashfs: fix divide error in calculate_skip()
Sysbot has reported a "divide error" which has been identified as being
caused by a corrupted file_size value within the file inode. This value
has been corrupted to a much larger value than expected.
Calculate_skip() is passed i_size_read(inode) >> msblk->block_log. Due to
the file_size value corruption this overflows the int argument/variable in
that function, leading to the divide error.
This patch changes the function to use u64. This will accommodate any
unexpectedly large values due to corruption.
The value returned from calculate_skip() is clamped to be never more than
SQUASHFS_CACHED_BLKS - 1, or 7. So file_size corruption does not lead to
an unexpectedly large return result here.
Link: https://lkml.kernel.org/r/20210507152618.9447-1-phillip@squashfs.org.uk
Signed-off-by: Phillip Lougher <phillip(a)squashfs.org.uk>
Reported-by: <syzbot+e8f781243ce16ac2f962(a)syzkaller.appspotmail.com>
Reported-by: <syzbot+7b98870d4fec9447b951(a)syzkaller.appspotmail.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
fs/squashfs/file.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
--- a/fs/squashfs/file.c~squashfs-fix-divide-error-in-calculate_skip
+++ a/fs/squashfs/file.c
@@ -211,11 +211,11 @@ failure:
* If the skip factor is limited in this way then the file will use multiple
* slots.
*/
-static inline int calculate_skip(int blocks)
+static inline int calculate_skip(u64 blocks)
{
- int skip = blocks / ((SQUASHFS_META_ENTRIES + 1)
+ u64 skip = blocks / ((SQUASHFS_META_ENTRIES + 1)
* SQUASHFS_META_INDEXES);
- return min(SQUASHFS_CACHED_BLKS - 1, skip + 1);
+ return min((u64) SQUASHFS_CACHED_BLKS - 1, skip + 1);
}
_
Patches currently in -mm which might be from phillip(a)squashfs.org.uk are
The patch titled
Subject: mm/hugetlb: fix cow where page writable in child
has been removed from the -mm tree. Its filename was
mm-hugetlb-fix-cow-where-page-writtable-in-child.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Peter Xu <peterx(a)redhat.com>
Subject: mm/hugetlb: fix cow where page writtable in child
When rework early cow of pinned hugetlb pages, we moved huge_ptep_get()
upper but overlooked a side effect that the huge_ptep_get() will fetch the
pte after wr-protection. After moving it upwards, we need explicit
wr-protect of child pte or we will keep the write bit set in the child
process, which could cause data corrution where the child can write to the
original page directly.
This issue can also be exposed by "memfd_test hugetlbfs" kselftest.
Link: https://lkml.kernel.org/r/20210503234356.9097-3-peterx@redhat.com
Fixes: 4eae4efa2c299 ("hugetlb: do early cow when page pinned on src mm")
Signed-off-by: Peter Xu <peterx(a)redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz(a)oracle.com>
Cc: Hugh Dickins <hughd(a)google.com>
Cc: Joel Fernandes (Google) <joel(a)joelfernandes.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/hugetlb.c | 1 +
1 file changed, 1 insertion(+)
--- a/mm/hugetlb.c~mm-hugetlb-fix-cow-where-page-writtable-in-child
+++ a/mm/hugetlb.c
@@ -4056,6 +4056,7 @@ again:
* See Documentation/vm/mmu_notifier.rst
*/
huge_ptep_set_wrprotect(src, addr, src_pte);
+ entry = huge_pte_wrprotect(entry);
}
page_dup_rmap(ptepage, true);
_
Patches currently in -mm which might be from peterx(a)redhat.com are
mm-gup_benchmark-support-threading.patch
mm-gup-pack-has_pinned-in-mmf_has_pinned-fix.patch
userfaultfd-selftests-use-user-mode-only.patch
userfaultfd-selftests-remove-the-time-check-on-delayed-uffd.patch
userfaultfd-selftests-dropping-verify-check-in-locking_thread.patch
userfaultfd-selftests-only-dump-counts-if-mode-enabled.patch
userfaultfd-selftests-unify-error-handling.patch
mm-thp-simplify-copying-of-huge-zero-page-pmd-when-fork.patch
mm-userfaultfd-fix-uffd-wp-special-cases-for-fork.patch
mm-userfaultfd-fix-a-few-thp-pmd-missing-uffd-wp-bit.patch
mm-userfaultfd-fail-uffd-wp-registeration-if-not-supported.patch
mm-pagemap-export-uffd-wp-protection-information.patch
userfaultfd-selftests-add-pagemap-uffd-wp-test.patch
The patch titled
Subject: mm/hugetlb: fix F_SEAL_FUTURE_WRITE
has been removed from the -mm tree. Its filename was
mm-hugetlb-fix-f_seal_future_write.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Peter Xu <peterx(a)redhat.com>
Subject: mm/hugetlb: fix F_SEAL_FUTURE_WRITE
Patch series "mm/hugetlb: Fix issues on file sealing and fork", v2.
Hugh reported issue with F_SEAL_FUTURE_WRITE not applied correctly to
hugetlbfs, which I can easily verify using the memfd_test program, which
seems that the program is hardly run with hugetlbfs pages (as by default
shmem).
Meanwhile I found another probably even more severe issue on that hugetlb
fork won't wr-protect child cow pages, so child can potentially write to
parent private pages. Patch 2 addresses that.
After this series applied, "memfd_test hugetlbfs" should start to pass.
This patch (of 2):
F_SEAL_FUTURE_WRITE is missing for hugetlb starting from the first day.
There is a test program for that and it fails constantly.
$ ./memfd_test hugetlbfs
memfd-hugetlb: CREATE
memfd-hugetlb: BASIC
memfd-hugetlb: SEAL-WRITE
memfd-hugetlb: SEAL-FUTURE-WRITE
mmap() didn't fail as expected
Aborted (core dumped)
I think it's probably because no one is really running the hugetlbfs test.
Fix it by checking FUTURE_WRITE also in hugetlbfs_file_mmap() as what we
do in shmem_mmap(). Generalize a helper for that.
Link: https://lkml.kernel.org/r/20210503234356.9097-1-peterx@redhat.com
Link: https://lkml.kernel.org/r/20210503234356.9097-2-peterx@redhat.com
Fixes: ab3948f58ff84 ("mm/memfd: add an F_SEAL_FUTURE_WRITE seal to memfd")
Signed-off-by: Peter Xu <peterx(a)redhat.com>
Reported-by: Hugh Dickins <hughd(a)google.com>
Reviewed-by: Mike Kravetz <mike.kravetz(a)oracle.com>
Cc: Joel Fernandes (Google) <joel(a)joelfernandes.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
fs/hugetlbfs/inode.c | 5 +++++
include/linux/mm.h | 32 ++++++++++++++++++++++++++++++++
mm/shmem.c | 22 ++++------------------
3 files changed, 41 insertions(+), 18 deletions(-)
--- a/fs/hugetlbfs/inode.c~mm-hugetlb-fix-f_seal_future_write
+++ a/fs/hugetlbfs/inode.c
@@ -131,6 +131,7 @@ static void huge_pagevec_release(struct
static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
{
struct inode *inode = file_inode(file);
+ struct hugetlbfs_inode_info *info = HUGETLBFS_I(inode);
loff_t len, vma_len;
int ret;
struct hstate *h = hstate_file(file);
@@ -146,6 +147,10 @@ static int hugetlbfs_file_mmap(struct fi
vma->vm_flags |= VM_HUGETLB | VM_DONTEXPAND;
vma->vm_ops = &hugetlb_vm_ops;
+ ret = seal_check_future_write(info->seals, vma);
+ if (ret)
+ return ret;
+
/*
* page based offset in vm_pgoff could be sufficiently large to
* overflow a loff_t when converted to byte offset. This can
--- a/include/linux/mm.h~mm-hugetlb-fix-f_seal_future_write
+++ a/include/linux/mm.h
@@ -3216,5 +3216,37 @@ void mem_dump_obj(void *object);
static inline void mem_dump_obj(void *object) {}
#endif
+/**
+ * seal_check_future_write - Check for F_SEAL_FUTURE_WRITE flag and handle it
+ * @seals: the seals to check
+ * @vma: the vma to operate on
+ *
+ * Check whether F_SEAL_FUTURE_WRITE is set; if so, do proper check/handling on
+ * the vma flags. Return 0 if check pass, or <0 for errors.
+ */
+static inline int seal_check_future_write(int seals, struct vm_area_struct *vma)
+{
+ if (seals & F_SEAL_FUTURE_WRITE) {
+ /*
+ * New PROT_WRITE and MAP_SHARED mmaps are not allowed when
+ * "future write" seal active.
+ */
+ if ((vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_WRITE))
+ return -EPERM;
+
+ /*
+ * Since an F_SEAL_FUTURE_WRITE sealed memfd can be mapped as
+ * MAP_SHARED and read-only, take care to not allow mprotect to
+ * revert protections on such mappings. Do this only for shared
+ * mappings. For private mappings, don't need to mask
+ * VM_MAYWRITE as we still want them to be COW-writable.
+ */
+ if (vma->vm_flags & VM_SHARED)
+ vma->vm_flags &= ~(VM_MAYWRITE);
+ }
+
+ return 0;
+}
+
#endif /* __KERNEL__ */
#endif /* _LINUX_MM_H */
--- a/mm/shmem.c~mm-hugetlb-fix-f_seal_future_write
+++ a/mm/shmem.c
@@ -2258,25 +2258,11 @@ out_nomem:
static int shmem_mmap(struct file *file, struct vm_area_struct *vma)
{
struct shmem_inode_info *info = SHMEM_I(file_inode(file));
+ int ret;
- if (info->seals & F_SEAL_FUTURE_WRITE) {
- /*
- * New PROT_WRITE and MAP_SHARED mmaps are not allowed when
- * "future write" seal active.
- */
- if ((vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_WRITE))
- return -EPERM;
-
- /*
- * Since an F_SEAL_FUTURE_WRITE sealed memfd can be mapped as
- * MAP_SHARED and read-only, take care to not allow mprotect to
- * revert protections on such mappings. Do this only for shared
- * mappings. For private mappings, don't need to mask
- * VM_MAYWRITE as we still want them to be COW-writable.
- */
- if (vma->vm_flags & VM_SHARED)
- vma->vm_flags &= ~(VM_MAYWRITE);
- }
+ ret = seal_check_future_write(info->seals, vma);
+ if (ret)
+ return ret;
/* arm64 - allow memory tagging on RAM-based files */
vma->vm_flags |= VM_MTE_ALLOWED;
_
Patches currently in -mm which might be from peterx(a)redhat.com are
mm-gup_benchmark-support-threading.patch
mm-gup-pack-has_pinned-in-mmf_has_pinned-fix.patch
userfaultfd-selftests-use-user-mode-only.patch
userfaultfd-selftests-remove-the-time-check-on-delayed-uffd.patch
userfaultfd-selftests-dropping-verify-check-in-locking_thread.patch
userfaultfd-selftests-only-dump-counts-if-mode-enabled.patch
userfaultfd-selftests-unify-error-handling.patch
mm-thp-simplify-copying-of-huge-zero-page-pmd-when-fork.patch
mm-userfaultfd-fix-uffd-wp-special-cases-for-fork.patch
mm-userfaultfd-fix-a-few-thp-pmd-missing-uffd-wp-bit.patch
mm-userfaultfd-fail-uffd-wp-registeration-if-not-supported.patch
mm-pagemap-export-uffd-wp-protection-information.patch
userfaultfd-selftests-add-pagemap-uffd-wp-test.patch
The legacy PIC on the AMD variant of the Microsoft Surface Laptop 4 has
some problems on boot. For some reason it consistently does not respond
on the first try, requiring a couple more tries before it finally
responds.
This currently leads to the PIC not being properly recognized, which
prevents interrupt handling down the line. Ultimately, this also leads
to the pinctrl-amd driver failing to probe due to platform_get_irq()
returning -EINVAL for its base IRQ. That, in turn, means that several
interrupts are not available and device drivers relying on those will
defer probing indefinitely, as querying those interrupts returns
-EPROBE_DEFER.
Add a quirk table and a retry-loop to work around that.
Also switch to pr_info() due to complaints by checkpatch and add a
pr_fmt() definition for completeness.
Cc: <stable(a)vger.kernel.org> # 5.10+
Co-developed-by: Sachi King <nakato(a)nakato.io>
Signed-off-by: Sachi King <nakato(a)nakato.io>
Signed-off-by: Maximilian Luz <luzmaximilian(a)gmail.com>
---
arch/x86/kernel/i8259.c | 51 +++++++++++++++++++++++++++++++++++++----
1 file changed, 46 insertions(+), 5 deletions(-)
diff --git a/arch/x86/kernel/i8259.c b/arch/x86/kernel/i8259.c
index 282b4ee1339f..0da757c6b292 100644
--- a/arch/x86/kernel/i8259.c
+++ b/arch/x86/kernel/i8259.c
@@ -1,4 +1,7 @@
// SPDX-License-Identifier: GPL-2.0
+
+#define pr_fmt(fmt) "i8259: " fmt
+
#include <linux/linkage.h>
#include <linux/errno.h>
#include <linux/signal.h>
@@ -16,6 +19,7 @@
#include <linux/io.h>
#include <linux/delay.h>
#include <linux/pgtable.h>
+#include <linux/dmi.h>
#include <linux/atomic.h>
#include <asm/timer.h>
@@ -298,11 +302,39 @@ static void unmask_8259A(void)
raw_spin_unlock_irqrestore(&i8259A_lock, flags);
}
+/*
+ * DMI table to identify devices with quirky probe behavior. See comment in
+ * probe_8259A() for more details.
+ */
+static const struct dmi_system_id retry_probe_quirk_table[] = {
+ {
+ .ident = "Microsoft Surface Laptop 4 (AMD)",
+ .matches = {
+ DMI_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"),
+ DMI_MATCH(DMI_PRODUCT_SKU, "Surface_Laptop_4_1952:1953")
+ },
+ },
+ {}
+};
+
static int probe_8259A(void)
{
unsigned long flags;
unsigned char probe_val = ~(1 << PIC_CASCADE_IR);
unsigned char new_val;
+ unsigned int i, imax = 1;
+
+ /*
+ * Some systems have a legacy PIC that doesn't immediately respond
+ * after boot. We know it's there, we know it should respond and is
+ * required for proper interrupt handling later on, so let's try a
+ * couple of times.
+ */
+ if (dmi_check_system(retry_probe_quirk_table)) {
+ pr_warn("system with broken legacy PIC detected, re-trying multiple times if necessary\n");
+ imax = 10;
+ }
+
/*
* Check to see if we have a PIC.
* Mask all except the cascade and read
@@ -312,15 +344,24 @@ static int probe_8259A(void)
*/
raw_spin_lock_irqsave(&i8259A_lock, flags);
- outb(0xff, PIC_SLAVE_IMR); /* mask all of 8259A-2 */
- outb(probe_val, PIC_MASTER_IMR);
- new_val = inb(PIC_MASTER_IMR);
- if (new_val != probe_val) {
- printk(KERN_INFO "Using NULL legacy PIC\n");
+ for (i = 0; i < imax; i++) {
+ outb(0xff, PIC_SLAVE_IMR); /* mask all of 8259A-2 */
+ outb(probe_val, PIC_MASTER_IMR);
+ new_val = inb(PIC_MASTER_IMR);
+ if (new_val == probe_val)
+ break;
+ }
+
+ if (i == imax) {
+ pr_info("using NULL legacy PIC\n");
legacy_pic = &null_legacy_pic;
}
raw_spin_unlock_irqrestore(&i8259A_lock, flags);
+
+ if (imax > 1 && i < imax)
+ pr_info("got legacy PIC after %d tries\n", i + 1);
+
return nr_legacy_irqs();
}
--
2.31.1
From: Wanpeng Li <wanpengli(a)tencent.com>
Commit 66570e966dd9 (kvm: x86: only provide PV features if enabled in guest's
CPUID) avoids to access pv tlb shootdown host side logic when this pv feature
is not exposed to guest, however, kvm_steal_time.preempted not only leveraged
by pv tlb shootdown logic but also mitigate the lock holder preemption issue.
>From guest's point of view, vCPU is always preempted since we lose the reset
of kvm_steal_time.preempted before vmentry if pv tlb shootdown feature is not
exposed. This patch fixes it by clearing kvm_steal_time.preempted before
vmentry.
Fixes: 66570e966dd9 (kvm: x86: only provide PV features if enabled in guest's CPUID)
Reviewed-by: Sean Christopherson <seanjc(a)google.com>
Cc: stable(a)vger.kernel.org
Signed-off-by: Wanpeng Li <wanpengli(a)tencent.com>
---
v1 -> v2:
* add curly braces
arch/x86/kvm/x86.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index dfb7c320581f..bed7b5348c0e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3105,6 +3105,8 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
st->preempted & KVM_VCPU_FLUSH_TLB);
if (xchg(&st->preempted, 0) & KVM_VCPU_FLUSH_TLB)
kvm_vcpu_flush_tlb_guest(vcpu);
+ } else {
+ st->preempted = 0;
}
vcpu->arch.st.preempted = 0;
--
2.25.1
Sometimes kernel is trying to probe Fingerprint MCU (FPMCU) when it
hasn't initialized SPI yet. This can happen because FPMCU is restarted
during system boot and kernel can send message in short window
eg. between sysjump to RW and SPI initialization.
Cc: <stable(a)vger.kernel.org> # 4.4+
Signed-off-by: Patryk Duda <pdk(a)semihalf.com>
---
Fingerprint MCU is rebooted during system startup by AP firmware (coreboot).
During cold boot kernel can query FPMCU in a window just after jump to RW
section of firmware but before SPI is initialized. The window was
shortened to <1ms, but it can't be eliminated completly.
Communication with FPMCU (and all devices based on EC) is bi-directional.
When kernel sends message, EC will send EC_SPI* status codes. When EC is
not able to process command one of bytes will be eg. EC_SPI_NOT_READY.
This mechanism won't work when SPI is not initailized on EC side. In fact,
buffer is filled with 0xFF bytes, so from kernel perspective device is not
responding. To avoid this problem, we can query device once again. We are
already waiting EC_MSG_DEADLINE_MS for response, so we can send command
immediately.
Best regards,
Patryk
drivers/platform/chrome/cros_ec_proto.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/drivers/platform/chrome/cros_ec_proto.c b/drivers/platform/chrome/cros_ec_proto.c
index aa7f7aa77297..3384631d21e2 100644
--- a/drivers/platform/chrome/cros_ec_proto.c
+++ b/drivers/platform/chrome/cros_ec_proto.c
@@ -279,6 +279,18 @@ static int cros_ec_host_command_proto_query(struct cros_ec_device *ec_dev,
msg->insize = sizeof(struct ec_response_get_protocol_info);
ret = send_command(ec_dev, msg);
+ /*
+ * Send command once again when timeout occurred.
+ * Fingerprint MCU (FPMCU) is restarted during system boot which
+ * introduces small window in which FPMCU won't respond for any
+ * messages sent by kernel. There is no need to wait before next
+ * attempt because we waited at least EC_MSG_DEADLINE_MS.
+ */
+ if (ret == -ETIMEDOUT) {
+ dev_warn(ec_dev->dev,
+ "Timeout to get response from EC. Retrying.\n");
+ ret = send_command(ec_dev, msg);
+ }
if (ret < 0) {
dev_dbg(ec_dev->dev,
--
2.31.1.751.gd2f1c929bd-goog
From: Coly Li <colyli(a)suse.de>
In the cache missing code path of cached device, if a proper location
from the internal B+ tree is matched for a cache miss range, function
cached_dev_cache_miss() will be called in cache_lookup_fn() in the
following code block,
[code block 1]
526 unsigned int sectors = KEY_INODE(k) == s->iop.inode
527 ? min_t(uint64_t, INT_MAX,
528 KEY_START(k) - bio->bi_iter.bi_sector)
529 : INT_MAX;
530 int ret = s->d->cache_miss(b, s, bio, sectors);
Here s->d->cache_miss() is the call backfunction pointer initialized as
cached_dev_cache_miss(), the last parameter 'sectors' is an important
hint to calculate the size of read request to backing device of the
missing cache data.
Current calculation in above code block may generate oversized value of
'sectors', which consequently may trigger 2 different potential kernel
panics by BUG() or BUG_ON() as listed below,
1) BUG_ON() inside bch_btree_insert_key(),
[code block 2]
886 BUG_ON(b->ops->is_extents && !KEY_SIZE(k));
2) BUG() inside biovec_slab(),
[code block 3]
51 default:
52 BUG();
53 return NULL;
All the above panics are original from cached_dev_cache_miss() by the
oversized parameter 'sectors'.
Inside cached_dev_cache_miss(), parameter 'sectors' is used to calculate
the size of data read from backing device for the cache missing. This
size is stored in s->insert_bio_sectors by the following lines of code,
[code block 4]
909 s->insert_bio_sectors = min(sectors, bio_sectors(bio) + reada);
Then the actual key inserting to the internal B+ tree is generated and
stored in s->iop.replace_key by the following lines of code,
[code block 5]
911 s->iop.replace_key = KEY(s->iop.inode,
912 bio->bi_iter.bi_sector + s->insert_bio_sectors,
913 s->insert_bio_sectors);
The oversized parameter 'sectors' may trigger panic 1) by BUG_ON() from
the above code block.
And the bio sending to backing device for the missing data is allocated
with hint from s->insert_bio_sectors by the following lines of code,
[code block 6]
926 cache_bio = bio_alloc_bioset(GFP_NOWAIT,
927 DIV_ROUND_UP(s->insert_bio_sectors, PAGE_SECTORS),
928 &dc->disk.bio_split);
The oversized parameter 'sectors' may trigger panic 2) by BUG() from the
agove code block.
Now let me explain how the panics happen with the oversized 'sectors'.
In code block 5, replace_key is generated by macro KEY(). From the
definition of macro KEY(),
[code block 7]
71 #define KEY(inode, offset, size) \
72 ((struct bkey) { \
73 .high = (1ULL << 63) | ((__u64) (size) << 20) | (inode), \
74 .low = (offset) \
75 })
Here 'size' is 16bits width embedded in 64bits member 'high' of struct
bkey. But in code block 1, if "KEY_START(k) - bio->bi_iter.bi_sector" is
very probably to be larger than (1<<16) - 1, which makes the bkey size
calculation in code block 5 is overflowed. In one bug report the value
of parameter 'sectors' is 131072 (= 1 << 17), the overflowed 'sectors'
results the overflowed s->insert_bio_sectors in code block 4, then makes
size field of s->iop.replace_key to be 0 in code block 5. Then the 0-
sized s->iop.replace_key is inserted into the internal B+ tree as cache
missing check key (a special key to detect and avoid a racing between
normal write request and cache missing read request) as,
[code block 8]
915 ret = bch_btree_insert_check_key(b, &s->op, &s->iop.replace_key);
Then the 0-sized s->iop.replace_key as 3rd parameter triggers the bkey
size check BUG_ON() in code block 2, and causes the kernel panic 1).
Another kernel panic is from code block 6, is from the oversized value
s->insert_bio_sectors resulted by the oversized 'sectors'. From a bug
report the result of "DIV_ROUND_UP(s->insert_bio_sectors, PAGE_SECTORS)"
from code block 6 can be 344, 282, 946, 342 and many other values which
larther than BIO_MAX_VECS (a.k.a 256). When calling bio_alloc_bioset()
with such larger-than-256 value as the 2nd parameter, this value will
eventually be sent to biovec_slab() as parameter 'nr_vecs' in following
code path,
bio_alloc_bioset() ==> bvec_alloc() ==> biovec_slab()
Because parameter 'nr_vecs' is larger-than-256 value, the panic by BUG()
in code block 3 is triggered inside biovec_slab().
>From the above analysis, we know that the 4th parameter 'sector' sent
into cached_dev_cache_miss() may cause overflow in code block 5 and 6,
and finally cause kernel panic in code block 2 and 3.
Therefore inside cached_dev_cache_miss() before parameter 'sector' is
used to calculate s->insert_bio_sectors in code block4, there should be
an value overflow check on 'sector' and fix its value when necessary.
- To avoid overflow in code block 5, the maximum 'sectors' value should
be equal or less than (1 << KEY_SIZE_BITS) - 1.
- To avoid overflow in code block 6, the maximum 'sectors' value should
be euqal or less than BIO_MAX_VECS * PAGE_SECTORS.
Considering the kernel page size can be variable, a reasonable maximum
limitation of 'sectors' should be the smaller one of the values
"(1 << KEY_SIZE_BITS) - 1" and "BIO_MAX_VECS * PAGE_SECTORS".
In this patch, a local variable inside cached_dev_cache_miss() is added
as,
max_cache_miss_size = min_t(uint32_t,
(1 << KEY_SIZE_BITS) - 1, BIO_MAX_VECS * PAGE_SECTORS);
Then the following code check and fix parameter 'sectors' as,
if (sectors > max_cache_miss_size)
sectors = max_cache_miss_size;
Now inside cached_dev_cache_miss(), the calculated 'sectors' value sent
into code block 5 and 6 will not trigger neither of the above kernel
panics anymore.
Current problmatic code can be partially found since Linux v5.13-rc1,
therefore all maintained stable kernels should try to apply this fix.
Reported-by: Diego Ercolani <diego.ercolani(a)gmail.com>
Reported-by: Jan Szubiak <jan.szubiak(a)linuxpolska.pl>
Reported-by: Marco Rebhan <me(a)dblsaiko.net>
Reported-by: Matthias Ferdinand <bcache(a)mfedv.net>
Reported-by: Thorsten Knabe <linux(a)thorsten-knabe.de>
Reported-by: Victor Westerhuis <victor(a)westerhu.is>
Reported-by: Vojtech Pavlik <vojtech(a)suse.cz>
Signed-off-by: Coly Li <colyli(a)suse.de>
Cc: stable(a)vger.kernel.org
Cc: Takashi Iwai <tiwai(a)suse.com>
Cc: Kent Overstreet <kent.overstreet(a)gmail.com>
---
Changlog
v2, fix the bypass bio size calculation in v1.
v1, the initial version
drivers/md/bcache/request.c | 20 ++++++++++++++++++++
1 file changed, 20 insertions(+)
diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index 29c231758293..ba1612b00b9f 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -883,6 +883,7 @@ static int cached_dev_cache_miss(struct btree *b, struct search *s,
unsigned int reada = 0;
struct cached_dev *dc = container_of(s->d, struct cached_dev, disk);
struct bio *miss, *cache_bio;
+ unsigned int max_miss_size;
s->cache_missed = 1;
@@ -899,6 +900,25 @@ static int cached_dev_cache_miss(struct btree *b, struct search *s,
get_capacity(bio->bi_bdev->bd_disk) -
bio_end_sector(bio));
+ /*
+ * Make sure sectors won't exceed two size limitations,
+ * - The bkey maximum size
+ * Size field in the bkey is 16 bits, the maximum permitted
+ * value is (1 << KEY_SIZE_BITS) - 1, in unit of sector.
+ * - The bio io vecs maximum number
+ * BIO_MAX_VECS is the maximum permitted io vecs number of a
+ * bio, any larger value will result a BUG() complain in bio
+ * layer code. When maximum size of each io vector is a page,
+ * BIO_MAX_VECS * PAGE_SECTORS is the maximum permitted value
+ * in unit of sectors.
+ * Then we are sure there is no overflow for key size of
+ * s->iop.replace_key and bio io vecs number of cache_bio.
+ */
+ max_cache_miss_size = min_t(uint32_t,
+ (1 << KEY_SIZE_BITS) - 1, BIO_MAX_VECS * PAGE_SECTORS);
+ if (sectors > max_cache_miss_size)
+ sectors = max_cache_miss_size;
+
s->insert_bio_sectors = min(sectors, bio_sectors(bio) + reada);
s->iop.replace_key = KEY(s->iop.inode,
--
2.26.2