The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 353050be4c19e102178ccc05988101887c25ae53 Mon Sep 17 00:00:00 2001
From: Daniel Borkmann <daniel(a)iogearbox.net>
Date: Tue, 9 Nov 2021 18:48:08 +0000
Subject: [PATCH] bpf: Fix toctou on read-only map's constant scalar tracking
Commit a23740ec43ba ("bpf: Track contents of read-only maps as scalars") is
checking whether maps are read-only both from BPF program side and user space
side, and then, given their content is constant, reading out their data via
map->ops->map_direct_value_addr() which is then subsequently used as known
scalar value for the register, that is, it is marked as __mark_reg_known()
with the read value at verification time. Before a23740ec43ba, the register
content was marked as an unknown scalar so the verifier could not make any
assumptions about the map content.
The current implementation however is prone to a TOCTOU race, meaning, the
value read as known scalar for the register is not guaranteed to be exactly
the same at a later point when the program is executed, and as such, the
prior made assumptions of the verifier with regards to the program will be
invalid which can cause issues such as OOB access, etc.
While the BPF_F_RDONLY_PROG map flag is always fixed and required to be
specified at map creation time, the map->frozen property is initially set to
false for the map given the map value needs to be populated, e.g. for global
data sections. Once complete, the loader "freezes" the map from user space
such that no subsequent updates/deletes are possible anymore. For the rest
of the lifetime of the map, this freeze one-time trigger cannot be undone
anymore after a successful BPF_MAP_FREEZE cmd return. Meaning, any new BPF_*
cmd calls which would update/delete map entries will be rejected with -EPERM
since map_get_sys_perms() removes the FMODE_CAN_WRITE permission. This also
means that pending update/delete map entries must still complete before this
guarantee is given. This corner case is not an issue for loaders since they
create and prepare such program private map in successive steps.
However, a malicious user is able to trigger this TOCTOU race in two different
ways: i) via userfaultfd, and ii) via batched updates. For i) userfaultfd is
used to expand the competition interval, so that map_update_elem() can modify
the contents of the map after map_freeze() and bpf_prog_load() were executed.
This works, because userfaultfd halts the parallel thread which triggered a
map_update_elem() at the time where we copy key/value from the user buffer and
this already passed the FMODE_CAN_WRITE capability test given at that time the
map was not "frozen". Then, the main thread performs the map_freeze() and
bpf_prog_load(), and once that had completed successfully, the other thread
is woken up to complete the pending map_update_elem() which then changes the
map content. For ii) the idea of the batched update is similar, meaning, when
there are a large number of updates to be processed, it can increase the
competition interval between the two. It is therefore possible in practice to
modify the contents of the map after executing map_freeze() and bpf_prog_load().
One way to fix both i) and ii) at the same time is to expand the use of the
map's map->writecnt. The latter was introduced in fc9702273e2e ("bpf: Add mmap()
support for BPF_MAP_TYPE_ARRAY") and further refined in 1f6cb19be2e2 ("bpf:
Prevent re-mmap()'ing BPF map as writable for initially r/o mapping") with
the rationale to make a writable mmap()'ing of a map mutually exclusive with
read-only freezing. The counter indicates writable mmap() mappings and then
prevents/fails the freeze operation. Its semantics can be expanded beyond
just mmap() by generally indicating ongoing write phases. This would essentially
span any parallel regular and batched flavor of update/delete operation and
then also have map_freeze() fail with -EBUSY. For the check_mem_access() in
the verifier we expand upon the bpf_map_is_rdonly() check ensuring that all
last pending writes have completed via bpf_map_write_active() test. Once the
map->frozen is set and bpf_map_write_active() indicates a map->writecnt of 0
only then we are really guaranteed to use the map's data as known constants.
For map->frozen being set and pending writes in process of still being completed
we fall back to marking that register as unknown scalar so we don't end up
making assumptions about it. With this, both TOCTOU reproducers from i) and
ii) are fixed.
Note that the map->writecnt has been converted into a atomic64 in the fix in
order to avoid a double freeze_mutex mutex_{un,}lock() pair when updating
map->writecnt in the various map update/delete BPF_* cmd flavors. Spanning
the freeze_mutex over entire map update/delete operations in syscall side
would not be possible due to then causing everything to be serialized.
Similarly, something like synchronize_rcu() after setting map->frozen to wait
for update/deletes to complete is not possible either since it would also
have to span the user copy which can sleep. On the libbpf side, this won't
break d66562fba1ce ("libbpf: Add BPF object skeleton support") as the
anonymous mmap()-ed "map initialization image" is remapped as a BPF map-backed
mmap()-ed memory where for .rodata it's non-writable.
Fixes: a23740ec43ba ("bpf: Track contents of read-only maps as scalars")
Reported-by: w1tcher.bupt(a)gmail.com
Signed-off-by: Daniel Borkmann <daniel(a)iogearbox.net>
Acked-by: Andrii Nakryiko <andrii(a)kernel.org>
Signed-off-by: Alexei Starovoitov <ast(a)kernel.org>
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index f715e8863f4d..e7a163a3146b 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -193,7 +193,7 @@ struct bpf_map {
atomic64_t usercnt;
struct work_struct work;
struct mutex freeze_mutex;
- u64 writecnt; /* writable mmap cnt; protected by freeze_mutex */
+ atomic64_t writecnt;
};
static inline bool map_value_has_spin_lock(const struct bpf_map *map)
@@ -1419,6 +1419,7 @@ void bpf_map_put(struct bpf_map *map);
void *bpf_map_area_alloc(u64 size, int numa_node);
void *bpf_map_area_mmapable_alloc(u64 size, int numa_node);
void bpf_map_area_free(void *base);
+bool bpf_map_write_active(const struct bpf_map *map);
void bpf_map_init_from_attr(struct bpf_map *map, union bpf_attr *attr);
int generic_map_lookup_batch(struct bpf_map *map,
const union bpf_attr *attr,
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 50f96ea4452a..1033ee8c0caf 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -132,6 +132,21 @@ static struct bpf_map *find_and_alloc_map(union bpf_attr *attr)
return map;
}
+static void bpf_map_write_active_inc(struct bpf_map *map)
+{
+ atomic64_inc(&map->writecnt);
+}
+
+static void bpf_map_write_active_dec(struct bpf_map *map)
+{
+ atomic64_dec(&map->writecnt);
+}
+
+bool bpf_map_write_active(const struct bpf_map *map)
+{
+ return atomic64_read(&map->writecnt) != 0;
+}
+
static u32 bpf_map_value_size(const struct bpf_map *map)
{
if (map->map_type == BPF_MAP_TYPE_PERCPU_HASH ||
@@ -601,11 +616,8 @@ static void bpf_map_mmap_open(struct vm_area_struct *vma)
{
struct bpf_map *map = vma->vm_file->private_data;
- if (vma->vm_flags & VM_MAYWRITE) {
- mutex_lock(&map->freeze_mutex);
- map->writecnt++;
- mutex_unlock(&map->freeze_mutex);
- }
+ if (vma->vm_flags & VM_MAYWRITE)
+ bpf_map_write_active_inc(map);
}
/* called for all unmapped memory region (including initial) */
@@ -613,11 +625,8 @@ static void bpf_map_mmap_close(struct vm_area_struct *vma)
{
struct bpf_map *map = vma->vm_file->private_data;
- if (vma->vm_flags & VM_MAYWRITE) {
- mutex_lock(&map->freeze_mutex);
- map->writecnt--;
- mutex_unlock(&map->freeze_mutex);
- }
+ if (vma->vm_flags & VM_MAYWRITE)
+ bpf_map_write_active_dec(map);
}
static const struct vm_operations_struct bpf_map_default_vmops = {
@@ -668,7 +677,7 @@ static int bpf_map_mmap(struct file *filp, struct vm_area_struct *vma)
goto out;
if (vma->vm_flags & VM_MAYWRITE)
- map->writecnt++;
+ bpf_map_write_active_inc(map);
out:
mutex_unlock(&map->freeze_mutex);
return err;
@@ -1139,6 +1148,7 @@ static int map_update_elem(union bpf_attr *attr, bpfptr_t uattr)
map = __bpf_map_get(f);
if (IS_ERR(map))
return PTR_ERR(map);
+ bpf_map_write_active_inc(map);
if (!(map_get_sys_perms(map, f) & FMODE_CAN_WRITE)) {
err = -EPERM;
goto err_put;
@@ -1174,6 +1184,7 @@ static int map_update_elem(union bpf_attr *attr, bpfptr_t uattr)
free_key:
kvfree(key);
err_put:
+ bpf_map_write_active_dec(map);
fdput(f);
return err;
}
@@ -1196,6 +1207,7 @@ static int map_delete_elem(union bpf_attr *attr)
map = __bpf_map_get(f);
if (IS_ERR(map))
return PTR_ERR(map);
+ bpf_map_write_active_inc(map);
if (!(map_get_sys_perms(map, f) & FMODE_CAN_WRITE)) {
err = -EPERM;
goto err_put;
@@ -1226,6 +1238,7 @@ static int map_delete_elem(union bpf_attr *attr)
out:
kvfree(key);
err_put:
+ bpf_map_write_active_dec(map);
fdput(f);
return err;
}
@@ -1533,6 +1546,7 @@ static int map_lookup_and_delete_elem(union bpf_attr *attr)
map = __bpf_map_get(f);
if (IS_ERR(map))
return PTR_ERR(map);
+ bpf_map_write_active_inc(map);
if (!(map_get_sys_perms(map, f) & FMODE_CAN_READ) ||
!(map_get_sys_perms(map, f) & FMODE_CAN_WRITE)) {
err = -EPERM;
@@ -1597,6 +1611,7 @@ static int map_lookup_and_delete_elem(union bpf_attr *attr)
free_key:
kvfree(key);
err_put:
+ bpf_map_write_active_dec(map);
fdput(f);
return err;
}
@@ -1624,8 +1639,7 @@ static int map_freeze(const union bpf_attr *attr)
}
mutex_lock(&map->freeze_mutex);
-
- if (map->writecnt) {
+ if (bpf_map_write_active(map)) {
err = -EBUSY;
goto err_put;
}
@@ -4171,6 +4185,9 @@ static int bpf_map_do_batch(const union bpf_attr *attr,
union bpf_attr __user *uattr,
int cmd)
{
+ bool has_read = cmd == BPF_MAP_LOOKUP_BATCH ||
+ cmd == BPF_MAP_LOOKUP_AND_DELETE_BATCH;
+ bool has_write = cmd != BPF_MAP_LOOKUP_BATCH;
struct bpf_map *map;
int err, ufd;
struct fd f;
@@ -4183,16 +4200,13 @@ static int bpf_map_do_batch(const union bpf_attr *attr,
map = __bpf_map_get(f);
if (IS_ERR(map))
return PTR_ERR(map);
-
- if ((cmd == BPF_MAP_LOOKUP_BATCH ||
- cmd == BPF_MAP_LOOKUP_AND_DELETE_BATCH) &&
- !(map_get_sys_perms(map, f) & FMODE_CAN_READ)) {
+ if (has_write)
+ bpf_map_write_active_inc(map);
+ if (has_read && !(map_get_sys_perms(map, f) & FMODE_CAN_READ)) {
err = -EPERM;
goto err_put;
}
-
- if (cmd != BPF_MAP_LOOKUP_BATCH &&
- !(map_get_sys_perms(map, f) & FMODE_CAN_WRITE)) {
+ if (has_write && !(map_get_sys_perms(map, f) & FMODE_CAN_WRITE)) {
err = -EPERM;
goto err_put;
}
@@ -4205,8 +4219,9 @@ static int bpf_map_do_batch(const union bpf_attr *attr,
BPF_DO_BATCH(map->ops->map_update_batch);
else
BPF_DO_BATCH(map->ops->map_delete_batch);
-
err_put:
+ if (has_write)
+ bpf_map_write_active_dec(map);
fdput(f);
return err;
}
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 65d2f93b7030..50efda51515b 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -4056,7 +4056,22 @@ static void coerce_reg_to_size(struct bpf_reg_state *reg, int size)
static bool bpf_map_is_rdonly(const struct bpf_map *map)
{
- return (map->map_flags & BPF_F_RDONLY_PROG) && map->frozen;
+ /* A map is considered read-only if the following condition are true:
+ *
+ * 1) BPF program side cannot change any of the map content. The
+ * BPF_F_RDONLY_PROG flag is throughout the lifetime of a map
+ * and was set at map creation time.
+ * 2) The map value(s) have been initialized from user space by a
+ * loader and then "frozen", such that no new map update/delete
+ * operations from syscall side are possible for the rest of
+ * the map's lifetime from that point onwards.
+ * 3) Any parallel/pending map update/delete operations from syscall
+ * side have been completed. Only after that point, it's safe to
+ * assume that map value(s) are immutable.
+ */
+ return (map->map_flags & BPF_F_RDONLY_PROG) &&
+ READ_ONCE(map->frozen) &&
+ !bpf_map_write_active(map);
}
static int bpf_map_direct_read(struct bpf_map *map, int off, int size, u64 *val)
Commit 6dce5aa59e0b ("PCI: xgene: Use inbound resources for setup")
broke PCI support on XGene. The cause is the IB resources are now sorted
in address order instead of being in DT dma-ranges order. The result is
which inbound registers are used for each region are swapped. I don't
know the details about this h/w, but it appears that IB region 0
registers can't handle a size greater than 4GB. In any case, limiting
the size for region 0 is enough to get back to the original assignment
of dma-ranges to regions.
Reported-by: Stéphane Graber <stgraber(a)ubuntu.com>
Fixes: 6dce5aa59e0b ("PCI: xgene: Use inbound resources for setup")
Link: https://lore.kernel.org/all/CA+enf=v9rY_xnZML01oEgKLmvY1NGBUUhnSJaETmXtDtXf…
Cc: stable(a)vger.kernel.org # v5.5+
Signed-off-by: Rob Herring <robh(a)kernel.org>
---
drivers/pci/controller/pci-xgene.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/pci/controller/pci-xgene.c b/drivers/pci/controller/pci-xgene.c
index 56d0d50338c8..d83dbd977418 100644
--- a/drivers/pci/controller/pci-xgene.c
+++ b/drivers/pci/controller/pci-xgene.c
@@ -465,7 +465,7 @@ static int xgene_pcie_select_ib_reg(u8 *ib_reg_mask, u64 size)
return 1;
}
- if ((size > SZ_1K) && (size < SZ_1T) && !(*ib_reg_mask & (1 << 0))) {
+ if ((size > SZ_1K) && (size < SZ_4G) && !(*ib_reg_mask & (1 << 0))) {
*ib_reg_mask |= (1 << 0);
return 0;
}
--
2.32.0
On 2021-07-30 12:35, Anders Roxell wrote:
> From: Robin Murphy <robin.murphy(a)arm.com>
>
>> Now that PCI inbound window restrictions are handled generically between
>> the of_pci resource parsing and the IOMMU layer, and described in the
>> Juno DT, we can finally enable the PCIe SMMU without the risk of DMA
>> mappings inadvertently allocating unusable addresses.
>>
>> Similarly, the relevant support for IOMMU mappings for peripheral
>> transfers has been hooked up in the pl330 driver for ages, so we can
>> happily enable the DMA SMMU without that breaking anything either.
>>
>> Signed-off-by: Robin Murphy <robin.murphy(a)arm.com>
>
> When we build a kernel with 64k page size and run the ltp syscalls we
> sporadically see a kernel crash while doing a mkfs on a connected SATA
> drive. This is happening every third test run on any juno-r2 device in
> the lab with the same kernel image (stable-rc 5.13.y, mainline and next)
> with gcc-11.
Hmm, I guess 64K pages might make a difference in that we'll chew
through IOVA space a lot faster with small mappings...
I'll have to try to reproduce this locally, since the interesting thing
would be knowing what DMA address it was trying to use that went wrong,
but IOMMU tracepoints and/or dma-debug are going to generate an crazy
amount of data to sift through and try to correlate - having done it
before it's not something I'd readily ask someone else to do for me :)
On a hunch, though, does it make any difference if you remove the first
entry from the PCIe "dma-ranges" (the 0x2c1c0000 one)?
Robin.
> Here is a snippet of the boot log [1]:
>
> + mkfs -t ext4 /dev/disk/by-id/ata-SanDisk_SDSSDA120G_165192443611
> mke2fs 1.43.8 (1-Jan-2018)
> Discarding device blocks: 4096/29305200
> [ 55.344291] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
> frozen
> [ 55.351423] ata1.00: irq_stat 0x00020002, failed to transmit command
> FIS
> [ 55.358205] ata1.00: failed command: DATA SET MANAGEMENT
> [ 55.363561] ata1.00: cmd 06/01:01:00:00:00/00:00:00:00:00/a0 tag 12
> dma 512 out
> [ 55.363561] res ec/ff:00:00:00:00/00:00:00:00:ec/00 Emask
> 0x12 (ATA bus error)
> [ 55.378955] ata1.00: status: { Busy }
> [ 55.382658] ata1.00: error: { ICRC UNC AMNF IDNF ABRT }
> [ 55.387947] ata1: hard resetting link
> [ 55.391653] ata1: controller in dubious state, performing PORT_RST
> [ 57.588447] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 0)
> [ 57.613471] ata1.00: configured for UDMA/100
> [ 57.617866] ata1.00: device reported invalid CHS sector 0
> [ 57.623397] ata1: EH complete
>
>
> When we revert this patch we don't see any issue.
>
> Reported-by: Linux Kernel Functional Testing <lkft(a)linaro.org>
>
> Cheers,
> Anders
> [1]
> https://qa-reports.linaro.org/lkft/linux-stable-rc-linux-5.13.y/build/v5.13…
>
Some boot partitions on the Samsung 4GB KLM4G1YE4C "4YMD1R" and "M4G1YC"
cards appear broken when accessed randomly. CMD6 to switch back to the main
partition randomly stalls after CMD18 access to the boot partition 1, and
the card never comes back online. The accesses to the boot partitions work
several times before this happens, but eventually the card access hangs
while initializing the card.
Some problematic eMMC cards are found in the Samsung GT-S7710 (Skomer)
and SGH-I407 (Kyle) mobile phones.
I tried using only single blocks with CMD17 on the boot partitions with the
result that it crashed even faster.
After a bit of root cause analysis it turns out that these old eMMC cards
probably cannot do hardware busy detection (monitoring DAT0) properly.
The card survives on older kernels, but this is because recent kernels have
added busy detection handling for the SoC used in these phones, exposing
the issue.
Construct a quirk that makes the MMC cord avoid using the ->card_busy()
callback if the card is listed with MMC_QUIRK_BROKEN_HW_BUSY_DETECT and
register the known problematic cards. The core changes are pretty
straight-forward with a helper inline to check of we can use hardware
busy detection.
On the MMCI host we have to counter the fact that if the host was able to
use busy detect, it would be used unsolicited in the command IRQ callback.
Rewrite this so that MMCI will not attempt to use hardware busy detection
in the command IRQ until:
- A card is attached to the host and
- We know that the card can handle this busy detection
I have glanced over the ->card_busy() callbacks on some other hosts and
they seem to mostly read a register reflecting the value of DAT0 for this
which works fine with the quirk in this patch. However if the error appear
on other hosts they might need additional fixes.
After applying this patch, the main partition can be accessed and mounted
without problems on Samsung GT-S7710 and SGH-I407.
Fixes: cb0335b778c7 ("mmc: mmci: add busy_complete callback")
Cc: stable(a)vger.kernel.org
Cc: phone-devel(a)vger.kernel.org
Cc: Ludovic Barre <ludovic.barre(a)st.com>
Cc: Stephan Gerhold <stephan(a)gerhold.net>
Reported-by: newbyte(a)disroot.org
Signed-off-by: Linus Walleij <linus.walleij(a)linaro.org>
---
ChangeLog v2->v3:
- Rebase on v5.14-rc1
- Reword the commit message slightly.
ChangeLog v1->v2:
- Rewrite to reflect the actual problem of broken busy detection.
---
drivers/mmc/core/core.c | 8 ++++----
drivers/mmc/core/core.h | 17 +++++++++++++++++
drivers/mmc/core/mmc_ops.c | 4 ++--
drivers/mmc/core/quirks.h | 21 +++++++++++++++++++++
drivers/mmc/host/mmci.c | 22 ++++++++++++++++++++--
include/linux/mmc/card.h | 1 +
6 files changed, 65 insertions(+), 8 deletions(-)
diff --git a/drivers/mmc/core/core.c b/drivers/mmc/core/core.c
index 95fedcf56e4a..e08dd9ea3d46 100644
--- a/drivers/mmc/core/core.c
+++ b/drivers/mmc/core/core.c
@@ -232,7 +232,7 @@ static void __mmc_start_request(struct mmc_host *host, struct mmc_request *mrq)
* And bypass I/O abort, reset and bus suspend operations.
*/
if (sdio_is_io_busy(mrq->cmd->opcode, mrq->cmd->arg) &&
- host->ops->card_busy) {
+ mmc_hw_busy_detect(host)) {
int tries = 500; /* Wait aprox 500ms at maximum */
while (host->ops->card_busy(host) && --tries)
@@ -1200,7 +1200,7 @@ int mmc_set_uhs_voltage(struct mmc_host *host, u32 ocr)
*/
if (!host->ops->start_signal_voltage_switch)
return -EPERM;
- if (!host->ops->card_busy)
+ if (!mmc_hw_busy_detect(host))
pr_warn("%s: cannot verify signal voltage switch\n",
mmc_hostname(host));
@@ -1220,7 +1220,7 @@ int mmc_set_uhs_voltage(struct mmc_host *host, u32 ocr)
* after the response of cmd11, but wait 1 ms to be sure
*/
mmc_delay(1);
- if (host->ops->card_busy && !host->ops->card_busy(host)) {
+ if (mmc_hw_busy_detect(host) && !host->ops->card_busy(host)) {
err = -EAGAIN;
goto power_cycle;
}
@@ -1241,7 +1241,7 @@ int mmc_set_uhs_voltage(struct mmc_host *host, u32 ocr)
* Failure to switch is indicated by the card holding
* dat[0:3] low
*/
- if (host->ops->card_busy && host->ops->card_busy(host))
+ if (mmc_hw_busy_detect(host) && host->ops->card_busy(host))
err = -EAGAIN;
power_cycle:
diff --git a/drivers/mmc/core/core.h b/drivers/mmc/core/core.h
index 0c4de2030b3f..6a5619eed4a6 100644
--- a/drivers/mmc/core/core.h
+++ b/drivers/mmc/core/core.h
@@ -181,4 +181,21 @@ static inline int mmc_flush_cache(struct mmc_host *host)
return 0;
}
+/**
+ * mmc_hw_busy_detect() - Can we use hw busy detection?
+ * @host: the host in question
+ */
+static inline bool mmc_hw_busy_detect(struct mmc_host *host)
+{
+ struct mmc_card *card = host->card;
+ bool has_ops;
+ bool able = true;
+
+ has_ops = (host->ops->card_busy != NULL);
+ if (card)
+ able = !(card->quirks & MMC_QUIRK_BROKEN_HW_BUSY_DETECT);
+
+ return (has_ops && able);
+}
+
#endif
diff --git a/drivers/mmc/core/mmc_ops.c b/drivers/mmc/core/mmc_ops.c
index 973756ed4016..546fc799a8e5 100644
--- a/drivers/mmc/core/mmc_ops.c
+++ b/drivers/mmc/core/mmc_ops.c
@@ -435,7 +435,7 @@ static int mmc_busy_cb(void *cb_data, bool *busy)
u32 status = 0;
int err;
- if (host->ops->card_busy) {
+ if (mmc_hw_busy_detect(host)) {
*busy = host->ops->card_busy(host);
return 0;
}
@@ -597,7 +597,7 @@ int __mmc_switch(struct mmc_card *card, u8 set, u8 index, u8 value,
* when it's not allowed to poll by using CMD13, then we need to rely on
* waiting the stated timeout to be sufficient.
*/
- if (!send_status && !host->ops->card_busy) {
+ if (!send_status && !mmc_hw_busy_detect(host)) {
mmc_delay(timeout_ms);
goto out_tim;
}
diff --git a/drivers/mmc/core/quirks.h b/drivers/mmc/core/quirks.h
index d68e6e513a4f..8da6526f0eb0 100644
--- a/drivers/mmc/core/quirks.h
+++ b/drivers/mmc/core/quirks.h
@@ -99,6 +99,27 @@ static const struct mmc_fixup __maybe_unused mmc_blk_fixups[] = {
MMC_FIXUP("V10016", CID_MANFID_KINGSTON, CID_OEMID_ANY, add_quirk_mmc,
MMC_QUIRK_TRIM_BROKEN),
+ /*
+ * Some older Samsung eMMCs have broken hardware busy detection.
+ * Enabling this feature in the host controller can make the card
+ * accesses lock up completely.
+ */
+ MMC_FIXUP("4YMD1R", CID_MANFID_SAMSUNG, CID_OEMID_ANY, add_quirk_mmc,
+ MMC_QUIRK_BROKEN_HW_BUSY_DETECT),
+ /* Samsung KLMxGxxE4x eMMCs from 2012: 4, 8, 16, 32 and 64 GB */
+ MMC_FIXUP("M4G1YC", CID_MANFID_SAMSUNG, CID_OEMID_ANY, add_quirk_mmc,
+ MMC_QUIRK_BROKEN_HW_BUSY_DETECT),
+ MMC_FIXUP("M8G1WA", CID_MANFID_SAMSUNG, CID_OEMID_ANY, add_quirk_mmc,
+ MMC_QUIRK_BROKEN_HW_BUSY_DETECT),
+ MMC_FIXUP("MAG2WA", CID_MANFID_SAMSUNG, CID_OEMID_ANY, add_quirk_mmc,
+ MMC_QUIRK_BROKEN_HW_BUSY_DETECT),
+ MMC_FIXUP("MBG4WA", CID_MANFID_SAMSUNG, CID_OEMID_ANY, add_quirk_mmc,
+ MMC_QUIRK_BROKEN_HW_BUSY_DETECT),
+ MMC_FIXUP("MAG2WA", CID_MANFID_SAMSUNG, CID_OEMID_ANY, add_quirk_mmc,
+ MMC_QUIRK_BROKEN_HW_BUSY_DETECT),
+ MMC_FIXUP("MCG8WA", CID_MANFID_SAMSUNG, CID_OEMID_ANY, add_quirk_mmc,
+ MMC_QUIRK_BROKEN_HW_BUSY_DETECT),
+
END_FIXUP
};
diff --git a/drivers/mmc/host/mmci.c b/drivers/mmc/host/mmci.c
index 984d35055156..3046917b2b67 100644
--- a/drivers/mmc/host/mmci.c
+++ b/drivers/mmc/host/mmci.c
@@ -347,6 +347,24 @@ static int mmci_card_busy(struct mmc_host *mmc)
return busy;
}
+/* Use this if the MMCI variant AND the card supports it */
+static bool mmci_use_busy_detect(struct mmci_host *host)
+{
+ struct mmc_card *card = host->mmc->card;
+
+ if (!host->variant->busy_detect)
+ return false;
+
+ /* We don't allow this until we know that the card can handle it */
+ if (!card)
+ return false;
+
+ if (card->quirks & MMC_QUIRK_BROKEN_HW_BUSY_DETECT)
+ return false;
+
+ return true;
+}
+
static void mmci_reg_delay(struct mmci_host *host)
{
/*
@@ -1381,7 +1399,7 @@ mmci_cmd_irq(struct mmci_host *host, struct mmc_command *cmd,
return;
/* Handle busy detection on DAT0 if the variant supports it. */
- if (busy_resp && host->variant->busy_detect)
+ if (busy_resp && mmci_use_busy_detect(host))
if (!host->ops->busy_complete(host, status, err_msk))
return;
@@ -1725,7 +1743,7 @@ static void mmci_set_max_busy_timeout(struct mmc_host *mmc)
struct mmci_host *host = mmc_priv(mmc);
u32 max_busy_timeout = 0;
- if (!host->variant->busy_detect)
+ if (!mmci_use_busy_detect(host))
return;
if (host->variant->busy_timeout && mmc->actual_clock)
diff --git a/include/linux/mmc/card.h b/include/linux/mmc/card.h
index 74e6c0624d27..525a39951c6d 100644
--- a/include/linux/mmc/card.h
+++ b/include/linux/mmc/card.h
@@ -280,6 +280,7 @@ struct mmc_card {
/* for byte mode */
#define MMC_QUIRK_NONSTD_SDIO (1<<2) /* non-standard SDIO card attached */
/* (missing CIA registers) */
+#define MMC_QUIRK_BROKEN_HW_BUSY_DETECT (1<<3) /* Disable hardware busy detection on DAT0 */
#define MMC_QUIRK_NONSTD_FUNC_IF (1<<4) /* SDIO card has nonstd function interfaces */
#define MMC_QUIRK_DISABLE_CD (1<<5) /* disconnect CD/DAT[3] resistor */
#define MMC_QUIRK_INAND_CMD38 (1<<6) /* iNAND devices have broken CMD38 */
--
2.31.1
On 11/24/21 8:28 AM, Jens Axboe wrote:
> On 11/23/21 8:27 PM, Daniel Black wrote:
>> On Mon, Nov 15, 2021 at 7:55 AM Jens Axboe <axboe(a)kernel.dk> wrote:
>>>
>>> On 11/14/21 1:33 PM, Daniel Black wrote:
>>>> On Fri, Nov 12, 2021 at 10:44 AM Jens Axboe <axboe(a)kernel.dk> wrote:
>>>>>
>>>>> Alright, give this one a go if you can. Against -git, but will apply to
>>>>> 5.15 as well.
>>>>
>>>>
>>>> Works. Thank you very much.
>>>>
>>>> https://jira.mariadb.org/browse/MDEV-26674?page=com.atlassian.jira.plugin.s…
>>>>
>>>> Tested-by: Marko Mäkelä <marko.makela(a)mariadb.com>
>>>
>>> The patch is already upstream (and in the 5.15 stable queue), and I
>>> provided 5.14 patches too.
>>
>> Jens,
>>
>> I'm getting the same reproducer on 5.14.20
>> (https://bugzilla.redhat.com/show_bug.cgi?id=2018882#c3) though the
>> backport change logs indicate 5.14.19 has the patch.
>>
>> Anything missing?
>
> We might also need another patch that isn't in stable, I'm attaching
> it here. Any chance you can run 5.14.20/21 with this applied? If not,
> I'll do some sanity checking here and push it to -stable.
Looks good to me - Greg, would you mind queueing this up for
5.14-stable?
--
Jens Axboe
Note: this applies to 5.10 stable only. It doesn't trigger on anything
above 5.10 as the code there has been substantially reworked. This also
doesn't apply to any stable kernel below 5.10 afaict.
Syzbot found a bug: KASAN: invalid-free in io_dismantle_req
https://syzkaller.appspot.com/bug?id=123d9a852fc88ba573ffcb2dbcf4f9576c3b05…
The test submits bunch of io_uring writes and exits, which then triggers
uring_task_cancel() and io_put_identity(), which in some corner cases,
tries to free a static identity. This causes a panic as shown in the
trace below:
BUG: KASAN: double-free or invalid-free in kfree+0xd5/0x310
CPU: 0 PID: 4618 Comm: repro Not tainted 5.10.76-05281-g4944ec82ebb9-dirty #17
Call Trace:
dump_stack_lvl+0x1b2/0x21b
print_address_description+0x8d/0x3b0
kasan_report_invalid_free+0x58/0x130
____kasan_slab_free+0x14b/0x170
__kasan_slab_free+0x11/0x20
slab_free_freelist_hook+0xcc/0x1a0
kfree+0xd5/0x310
io_dismantle_req+0x9b0/0xd90
io_do_iopoll+0x13a4/0x23e0
io_iopoll_try_reap_events+0x116/0x290
io_uring_cancel_task_requests+0x197d/0x1ee0
io_uring_flush+0x170/0x6d0
filp_close+0xb0/0x150
put_files_struct+0x1d4/0x350
exit_files+0x80/0xa0
do_exit+0x6d9/0x2390
do_group_exit+0x16a/0x2d0
get_signal+0x133e/0x1f80
arch_do_signal+0x7b/0x610
exit_to_user_mode_prepare+0xaa/0xe0
syscall_exit_to_user_mode+0x24/0x40
do_syscall_64+0x3d/0x70
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Allocated by task 4611:
____kasan_kmalloc+0xcd/0x100
__kasan_kmalloc+0x9/0x10
kmem_cache_alloc_trace+0x208/0x390
io_uring_alloc_task_context+0x57/0x550
io_uring_add_task_file+0x1f7/0x290
io_uring_create+0x2195/0x3490
__x64_sys_io_uring_setup+0x1bf/0x280
do_syscall_64+0x31/0x70
entry_SYSCALL_64_after_hwframe+0x44/0xa9
The buggy address belongs to the object at ffff88810732b500
which belongs to the cache kmalloc-192 of size 192
The buggy address is located 88 bytes inside of
192-byte region [ffff88810732b500, ffff88810732b5c0)
Kernel panic - not syncing: panic_on_warn set ...
This issue bisected to this commit:
commit 186725a80c4e ("io_uring: fix skipping disabling sqo on exec")
Simple reverting the offending commit doesn't work as it hits some
other, related issues like:
/* sqo_dead check is for when this happens after cancellation */
WARN_ON_ONCE(ctx->sqo_task == current && !ctx->sqo_dead &&
!xa_load(&tctx->xa, (unsigned long)file));
------------[ cut here ]------------
WARNING: CPU: 1 PID: 5622 at fs/io_uring.c:8960 io_uring_flush+0x5bc/0x6d0
Modules linked in:
CPU: 1 PID: 5622 Comm: repro Not tainted 5.10.76-05281-g4944ec82ebb9-dirty #16
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-6.fc35 04/01/2014
RIP: 0010:io_uring_flush+0x5bc/0x6d0
Call Trace:
filp_close+0xb0/0x150
put_files_struct+0x1d4/0x350
reset_files_struct+0x88/0xa0
bprm_execve+0x7f2/0x9f0
do_execveat_common+0x46f/0x5d0
__x64_sys_execve+0x92/0xb0
do_syscall_64+0x31/0x70
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Changing __io_uring_task_cancel() to call io_disable_sqo_submit() directly,
as the comment suggests, only if __io_uring_files_cancel() is not executed
seems to fix the issue.
Cc: Jens Axboe <axboe(a)kernel.dk>
Cc: Alexander Viro <viro(a)zeniv.linux.org.uk>
Cc: <io-uring(a)vger.kernel.org>
Cc: <linux-fsdevel(a)vger.kernel.org>
Cc: <linux-kernel(a)vger.kernel.org>
Cc: <stable(a)vger.kernel.org>
Reported-by: syzbot+6055980d041c8ac23307(a)syzkaller.appspotmail.com
Signed-off-by: Tadeusz Struk <tadeusz.struk(a)linaro.org>
---
fs/io_uring.c | 21 +++++++++++++++++----
1 file changed, 17 insertions(+), 4 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 0736487165da..fcf9ffe9b209 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -8882,20 +8882,18 @@ void __io_uring_task_cancel(void)
struct io_uring_task *tctx = current->io_uring;
DEFINE_WAIT(wait);
s64 inflight;
+ int canceled = 0;
/* make sure overflow events are dropped */
atomic_inc(&tctx->in_idle);
- /* trigger io_disable_sqo_submit() */
- if (tctx->sqpoll)
- __io_uring_files_cancel(NULL);
-
do {
/* read completions before cancelations */
inflight = tctx_inflight(tctx);
if (!inflight)
break;
__io_uring_files_cancel(NULL);
+ canceled = 1;
prepare_to_wait(&tctx->wait, &wait, TASK_UNINTERRUPTIBLE);
@@ -8909,6 +8907,21 @@ void __io_uring_task_cancel(void)
finish_wait(&tctx->wait, &wait);
} while (1);
+ /*
+ * trigger io_disable_sqo_submit()
+ * if not already done by __io_uring_files_cancel()
+ */
+ if (tctx->sqpoll && !canceled) {
+ struct file *file;
+ unsigned long index;
+
+ xa_for_each(&tctx->xa, index, file) {
+ struct io_ring_ctx *ctx = file->private_data;
+
+ io_disable_sqo_submit(ctx);
+ }
+ }
+
atomic_dec(&tctx->in_idle);
io_uring_remove_task_files(tctx);
--
2.33.1