This is a note to let you know that I've just added the patch titled
serial: pl011: Fix lockdep splat when handling magic-sysrq interrupt
to my tty git tree which can be found at
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty.git
in the tty-next branch.
The patch will show up in the next release of the linux-next tree
(usually sometime within the next 24 hours during the week.)
The patch will also be merged in the next major kernel release
during the merge window.
If you have any questions about this process, please let me know.
>From 534cf755d9df99e214ddbe26b91cd4d81d2603e2 Mon Sep 17 00:00:00 2001
From: Peter Zijlstra <peterz(a)infradead.org>
Date: Wed, 30 Sep 2020 13:04:32 +0100
Subject: serial: pl011: Fix lockdep splat when handling magic-sysrq interrupt
Issuing a magic-sysrq via the PL011 causes the following lockdep splat,
which is easily reproducible under QEMU:
| sysrq: Changing Loglevel
| sysrq: Loglevel set to 9
|
| ======================================================
| WARNING: possible circular locking dependency detected
| 5.9.0-rc7 #1 Not tainted
| ------------------------------------------------------
| systemd-journal/138 is trying to acquire lock:
| ffffab133ad950c0 (console_owner){-.-.}-{0:0}, at: console_lock_spinning_enable+0x34/0x70
|
| but task is already holding lock:
| ffff0001fd47b098 (&port_lock_key){-.-.}-{2:2}, at: pl011_int+0x40/0x488
|
| which lock already depends on the new lock.
[...]
| Possible unsafe locking scenario:
|
| CPU0 CPU1
| ---- ----
| lock(&port_lock_key);
| lock(console_owner);
| lock(&port_lock_key);
| lock(console_owner);
|
| *** DEADLOCK ***
The issue being that CPU0 takes 'port_lock' on the irq path in pl011_int()
before taking 'console_owner' on the printk() path, whereas CPU1 takes
the two locks in the opposite order on the printk() path due to setting
the "console_owner" prior to calling into into the actual console driver.
Fix this in the same way as the msm-serial driver by dropping 'port_lock'
before handling the sysrq.
Cc: <stable(a)vger.kernel.org> # 4.19+
Cc: Russell King <linux(a)armlinux.org.uk>
Cc: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Cc: Jiri Slaby <jirislaby(a)kernel.org>
Link: https://lore.kernel.org/r/20200811101313.GA6970@willie-the-truck
Signed-off-by: Peter Zijlstra <peterz(a)infradead.org>
Tested-by: Will Deacon <will(a)kernel.org>
Signed-off-by: Will Deacon <will(a)kernel.org>
Link: https://lore.kernel.org/r/20200930120432.16551-1-will@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/tty/serial/amba-pl011.c | 11 +++++++----
1 file changed, 7 insertions(+), 4 deletions(-)
diff --git a/drivers/tty/serial/amba-pl011.c b/drivers/tty/serial/amba-pl011.c
index 67498594d7d7..87dc3fc15694 100644
--- a/drivers/tty/serial/amba-pl011.c
+++ b/drivers/tty/serial/amba-pl011.c
@@ -308,8 +308,9 @@ static void pl011_write(unsigned int val, const struct uart_amba_port *uap,
*/
static int pl011_fifo_to_tty(struct uart_amba_port *uap)
{
- u16 status;
unsigned int ch, flag, fifotaken;
+ int sysrq;
+ u16 status;
for (fifotaken = 0; fifotaken != 256; fifotaken++) {
status = pl011_read(uap, REG_FR);
@@ -344,10 +345,12 @@ static int pl011_fifo_to_tty(struct uart_amba_port *uap)
flag = TTY_FRAME;
}
- if (uart_handle_sysrq_char(&uap->port, ch & 255))
- continue;
+ spin_unlock(&uap->port.lock);
+ sysrq = uart_handle_sysrq_char(&uap->port, ch & 255);
+ spin_lock(&uap->port.lock);
- uart_insert_char(&uap->port, ch, UART011_DR_OE, ch, flag);
+ if (!sysrq)
+ uart_insert_char(&uap->port, ch, UART011_DR_OE, ch, flag);
}
return fifotaken;
--
2.28.0
This is a note to let you know that I've just added the patch titled
tty: serial: fsl_lpuart: fix lpuart32_poll_get_char
to my tty git tree which can be found at
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty.git
in the tty-next branch.
The patch will show up in the next release of the linux-next tree
(usually sometime within the next 24 hours during the week.)
The patch will also be merged in the next major kernel release
during the merge window.
If you have any questions about this process, please let me know.
>From 29788ab1d2bf26c130de8f44f9553ee78a27e8d5 Mon Sep 17 00:00:00 2001
From: Peng Fan <peng.fan(a)nxp.com>
Date: Tue, 29 Sep 2020 17:55:09 +0800
Subject: tty: serial: fsl_lpuart: fix lpuart32_poll_get_char
The watermark is set to 1, so we need to input two chars to trigger RDRF
using the original logic. With the new logic, we could always get the
char when there is data in FIFO.
Suggested-by: Fugang Duan <fugang.duan(a)nxp.com>
Signed-off-by: Peng Fan <peng.fan(a)nxp.com>
Link: https://lore.kernel.org/r/20200929095509.21680-1-peng.fan@nxp.com
Cc: stable <stable(a)vger.kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/tty/serial/fsl_lpuart.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/tty/serial/fsl_lpuart.c b/drivers/tty/serial/fsl_lpuart.c
index 645bbb24b433..590c901a1fc5 100644
--- a/drivers/tty/serial/fsl_lpuart.c
+++ b/drivers/tty/serial/fsl_lpuart.c
@@ -680,7 +680,7 @@ static void lpuart32_poll_put_char(struct uart_port *port, unsigned char c)
static int lpuart32_poll_get_char(struct uart_port *port)
{
- if (!(lpuart32_read(port, UARTSTAT) & UARTSTAT_RDRF))
+ if (!(lpuart32_read(port, UARTWATER) >> UARTWATER_RXCNT_OFF))
return NO_POLL_CHAR;
return lpuart32_read(port, UARTDATA);
--
2.28.0
This is a note to let you know that I've just added the patch titled
tty: serial: lpuart: fix lpuart32_write usage
to my tty git tree which can be found at
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty.git
in the tty-next branch.
The patch will show up in the next release of the linux-next tree
(usually sometime within the next 24 hours during the week.)
The patch will also be merged in the next major kernel release
during the merge window.
If you have any questions about this process, please let me know.
>From 9ea40db477c024dcb87321b7f32bd6ea12130ac2 Mon Sep 17 00:00:00 2001
From: Peng Fan <peng.fan(a)nxp.com>
Date: Tue, 29 Sep 2020 17:19:20 +0800
Subject: tty: serial: lpuart: fix lpuart32_write usage
The 2nd and 3rd parameter were wrongly used, and cause kernel abort when
doing kgdb debug.
Fixes: 1da17d7cf8e2 ("tty: serial: fsl_lpuart: Use appropriate lpuart32_* I/O funcs")
Cc: stable <stable(a)vger.kernel.org>
Signed-off-by: Peng Fan <peng.fan(a)nxp.com>
Link: https://lore.kernel.org/r/20200929091920.22612-1-peng.fan@nxp.com
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/tty/serial/fsl_lpuart.c | 14 ++++++--------
1 file changed, 6 insertions(+), 8 deletions(-)
diff --git a/drivers/tty/serial/fsl_lpuart.c b/drivers/tty/serial/fsl_lpuart.c
index 5a5a22d77841..645bbb24b433 100644
--- a/drivers/tty/serial/fsl_lpuart.c
+++ b/drivers/tty/serial/fsl_lpuart.c
@@ -649,26 +649,24 @@ static int lpuart32_poll_init(struct uart_port *port)
spin_lock_irqsave(&sport->port.lock, flags);
/* Disable Rx & Tx */
- lpuart32_write(&sport->port, UARTCTRL, 0);
+ lpuart32_write(&sport->port, 0, UARTCTRL);
temp = lpuart32_read(&sport->port, UARTFIFO);
/* Enable Rx and Tx FIFO */
- lpuart32_write(&sport->port, UARTFIFO,
- temp | UARTFIFO_RXFE | UARTFIFO_TXFE);
+ lpuart32_write(&sport->port, temp | UARTFIFO_RXFE | UARTFIFO_TXFE, UARTFIFO);
/* flush Tx and Rx FIFO */
- lpuart32_write(&sport->port, UARTFIFO,
- UARTFIFO_TXFLUSH | UARTFIFO_RXFLUSH);
+ lpuart32_write(&sport->port, UARTFIFO_TXFLUSH | UARTFIFO_RXFLUSH, UARTFIFO);
/* explicitly clear RDRF */
if (lpuart32_read(&sport->port, UARTSTAT) & UARTSTAT_RDRF) {
lpuart32_read(&sport->port, UARTDATA);
- lpuart32_write(&sport->port, UARTFIFO, UARTFIFO_RXUF);
+ lpuart32_write(&sport->port, UARTFIFO_RXUF, UARTFIFO);
}
/* Enable Rx and Tx */
- lpuart32_write(&sport->port, UARTCTRL, UARTCTRL_RE | UARTCTRL_TE);
+ lpuart32_write(&sport->port, UARTCTRL_RE | UARTCTRL_TE, UARTCTRL);
spin_unlock_irqrestore(&sport->port.lock, flags);
return 0;
@@ -677,7 +675,7 @@ static int lpuart32_poll_init(struct uart_port *port)
static void lpuart32_poll_put_char(struct uart_port *port, unsigned char c)
{
lpuart32_wait_bit_set(port, UARTSTAT, UARTSTAT_TDRE);
- lpuart32_write(port, UARTDATA, c);
+ lpuart32_write(port, c, UARTDATA);
}
static int lpuart32_poll_get_char(struct uart_port *port)
--
2.28.0
This is a note to let you know that I've just added the patch titled
serial: qcom_geni_serial: To correct QUP Version detection logic
to my tty git tree which can be found at
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty.git
in the tty-next branch.
The patch will show up in the next release of the linux-next tree
(usually sometime within the next 24 hours during the week.)
The patch will also be merged in the next major kernel release
during the merge window.
If you have any questions about this process, please let me know.
>From c9ca43d42ed8d5fd635d327a664ed1d8579eb2af Mon Sep 17 00:00:00 2001
From: Paras Sharma <parashar(a)codeaurora.org>
Date: Wed, 30 Sep 2020 11:35:26 +0530
Subject: serial: qcom_geni_serial: To correct QUP Version detection logic
For QUP IP versions 2.5 and above the oversampling rate is
halved from 32 to 16.
Commit ce734600545f ("tty: serial: qcom_geni_serial: Update
the oversampling rate") is pushed to handle this scenario.
But the existing logic is failing to classify QUP Version 3.0
into the correct group ( 2.5 and above).
As result Serial Engine clocks are not configured properly for
baud rate and garbage data is sampled to FIFOs from the line.
So, fix the logic to detect QUP with versions 2.5 and above.
Fixes: ce734600545f ("tty: serial: qcom_geni_serial: Update the oversampling rate")
Cc: stable <stable(a)vger.kernel.org>
Signed-off-by: Paras Sharma <parashar(a)codeaurora.org>
Reviewed-by: Akash Asthana <akashast(a)codeaurora.org>
Link: https://lore.kernel.org/r/1601445926-23673-1-git-send-email-parashar@codeau…
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/tty/serial/qcom_geni_serial.c | 2 +-
include/linux/qcom-geni-se.h | 3 +++
2 files changed, 4 insertions(+), 1 deletion(-)
diff --git a/drivers/tty/serial/qcom_geni_serial.c b/drivers/tty/serial/qcom_geni_serial.c
index 8da2eb2d1090..291649f02821 100644
--- a/drivers/tty/serial/qcom_geni_serial.c
+++ b/drivers/tty/serial/qcom_geni_serial.c
@@ -1000,7 +1000,7 @@ static void qcom_geni_serial_set_termios(struct uart_port *uport,
sampling_rate = UART_OVERSAMPLING;
/* Sampling rate is halved for IP versions >= 2.5 */
ver = geni_se_get_qup_hw_version(&port->se);
- if (GENI_SE_VERSION_MAJOR(ver) >= 2 && GENI_SE_VERSION_MINOR(ver) >= 5)
+ if (ver >= QUP_SE_VERSION_2_5)
sampling_rate /= 2;
clk_rate = get_clk_div_rate(baud, sampling_rate, &clk_div);
diff --git a/include/linux/qcom-geni-se.h b/include/linux/qcom-geni-se.h
index 8f385fbe5a0e..1c31f26ccc7a 100644
--- a/include/linux/qcom-geni-se.h
+++ b/include/linux/qcom-geni-se.h
@@ -248,6 +248,9 @@ struct geni_se {
#define GENI_SE_VERSION_MINOR(ver) ((ver & HW_VER_MINOR_MASK) >> HW_VER_MINOR_SHFT)
#define GENI_SE_VERSION_STEP(ver) (ver & HW_VER_STEP_MASK)
+/* QUP SE VERSION value for major number 2 and minor number 5 */
+#define QUP_SE_VERSION_2_5 0x20050000
+
/*
* Define bandwidth thresholds that cause the underlying Core 2X interconnect
* clock to run at the named frequency. These baseline values are recommended
--
2.28.0
commit a10674bf2406 ("tcp: detecting the misuse of .sendpage for Slab
objects") adds the checks for Slab pages, but the pages don't have
page_count are still missing from the check.
Network layer's sendpage method is not designed to send page_count 0
pages neither, therefore both PageSlab() and page_count() should be
both checked for the sending page. This is exactly what sendpage_ok()
does.
This patch uses sendpage_ok() in do_tcp_sendpages() to detect misused
.sendpage, to make the code more robust.
Fixes: a10674bf2406 ("tcp: detecting the misuse of .sendpage for Slab objects")
Suggested-by: Eric Dumazet <eric.dumazet(a)gmail.com>
Signed-off-by: Coly Li <colyli(a)suse.de>
Cc: Vasily Averin <vvs(a)virtuozzo.com>
Cc: David S. Miller <davem(a)davemloft.net>
Cc: stable(a)vger.kernel.org
---
net/ipv4/tcp.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 31f3b858db81..2135ee7c806d 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -970,7 +970,8 @@ ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
long timeo = sock_sndtimeo(sk, flags & MSG_DONTWAIT);
if (IS_ENABLED(CONFIG_DEBUG_VM) &&
- WARN_ONCE(PageSlab(page), "page must not be a Slab one"))
+ WARN_ONCE(!sendpage_ok(page),
+ "page must not be a Slab one and have page_count > 0"))
return -EINVAL;
/* Wait for a connection to finish. One exception is TCP Fast Open
--
2.26.2
Currently nvme_tcp_try_send_data() doesn't use kernel_sendpage() to
send slab pages. But for pages allocated by __get_free_pages() without
__GFP_COMP, which also have refcount as 0, they are still sent by
kernel_sendpage() to remote end, this is problematic.
The new introduced helper sendpage_ok() checks both PageSlab tag and
page_count counter, and returns true if the checking page is OK to be
sent by kernel_sendpage().
This patch fixes the page checking issue of nvme_tcp_try_send_data()
with sendpage_ok(). If sendpage_ok() returns true, send this page by
kernel_sendpage(), otherwise use sock_no_sendpage to handle this page.
Signed-off-by: Coly Li <colyli(a)suse.de>
Cc: Chaitanya Kulkarni <chaitanya.kulkarni(a)wdc.com>
Cc: Christoph Hellwig <hch(a)lst.de>
Cc: Hannes Reinecke <hare(a)suse.de>
Cc: Jan Kara <jack(a)suse.com>
Cc: Jens Axboe <axboe(a)kernel.dk>
Cc: Mikhail Skorzhinskii <mskorzhinskiy(a)solarflare.com>
Cc: Philipp Reisner <philipp.reisner(a)linbit.com>
Cc: Sagi Grimberg <sagi(a)grimberg.me>
Cc: Vlastimil Babka <vbabka(a)suse.com>
Cc: stable(a)vger.kernel.org
---
drivers/nvme/host/tcp.c | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 8f4f29f18b8c..d6a3e1487354 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -913,12 +913,11 @@ static int nvme_tcp_try_send_data(struct nvme_tcp_request *req)
else
flags |= MSG_MORE | MSG_SENDPAGE_NOTLAST;
- /* can't zcopy slab pages */
- if (unlikely(PageSlab(page))) {
- ret = sock_no_sendpage(queue->sock, page, offset, len,
+ if (sendpage_ok(page)) {
+ ret = kernel_sendpage(queue->sock, page, offset, len,
flags);
} else {
- ret = kernel_sendpage(queue->sock, page, offset, len,
+ ret = sock_no_sendpage(queue->sock, page, offset, len,
flags);
}
if (ret <= 0)
--
2.26.2
The original problem was from nvme-over-tcp code, who mistakenly uses
kernel_sendpage() to send pages allocated by __get_free_pages() without
__GFP_COMP flag. Such pages don't have refcount (page_count is 0) on
tail pages, sending them by kernel_sendpage() may trigger a kernel panic
from a corrupted kernel heap, because these pages are incorrectly freed
in network stack as page_count 0 pages.
This patch introduces a helper sendpage_ok(), it returns true if the
checking page,
- is not slab page: PageSlab(page) is false.
- has page refcount: page_count(page) is not zero
All drivers who want to send page to remote end by kernel_sendpage()
may use this helper to check whether the page is OK. If the helper does
not return true, the driver should try other non sendpage method (e.g.
sock_no_sendpage()) to handle the page.
Signed-off-by: Coly Li <colyli(a)suse.de>
Cc: Chaitanya Kulkarni <chaitanya.kulkarni(a)wdc.com>
Cc: Christoph Hellwig <hch(a)lst.de>
Cc: Hannes Reinecke <hare(a)suse.de>
Cc: Jan Kara <jack(a)suse.com>
Cc: Jens Axboe <axboe(a)kernel.dk>
Cc: Mikhail Skorzhinskii <mskorzhinskiy(a)solarflare.com>
Cc: Philipp Reisner <philipp.reisner(a)linbit.com>
Cc: Sagi Grimberg <sagi(a)grimberg.me>
Cc: Vlastimil Babka <vbabka(a)suse.com>
Cc: stable(a)vger.kernel.org
---
include/linux/net.h | 16 ++++++++++++++++
1 file changed, 16 insertions(+)
diff --git a/include/linux/net.h b/include/linux/net.h
index d48ff1180879..ae713c851342 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -21,6 +21,7 @@
#include <linux/rcupdate.h>
#include <linux/once.h>
#include <linux/fs.h>
+#include <linux/mm.h>
#include <linux/sockptr.h>
#include <uapi/linux/net.h>
@@ -286,6 +287,21 @@ do { \
#define net_get_random_once_wait(buf, nbytes) \
get_random_once_wait((buf), (nbytes))
+/*
+ * E.g. XFS meta- & log-data is in slab pages, or bcache meta
+ * data pages, or other high order pages allocated by
+ * __get_free_pages() without __GFP_COMP, which have a page_count
+ * of 0 and/or have PageSlab() set. We cannot use send_page for
+ * those, as that does get_page(); put_page(); and would cause
+ * either a VM_BUG directly, or __page_cache_release a page that
+ * would actually still be referenced by someone, leading to some
+ * obscure delayed Oops somewhere else.
+ */
+static inline bool sendpage_ok(struct page *page)
+{
+ return !PageSlab(page) && page_count(page) >= 1;
+}
+
int kernel_sendmsg(struct socket *sock, struct msghdr *msg, struct kvec *vec,
size_t num, size_t len);
int kernel_sendmsg_locked(struct sock *sk, struct msghdr *msg,
--
2.26.2
Although btrfs_calc_avail_data_space() is trying to do an estimation
on how many data chunks it can allocate, the estimation is far from
perfect:
- Metadata over-commit is not considered at all
- Chunk allocation doesn't take RAID5/6 into consideration
This patch will change btrfs_calc_avail_data_space() to use
pre-calculated per-profile available space.
This provides the following benefits:
- Accurate unallocated data space estimation
It's as accurate as chunk allocator, and can handle RAID5/6 and newly
introduced RAID1C3/C4.
For the metadata over-commit part, we don't take that into consideration
yet. As metadata over-commit only happens when we have enough
unallocated space, and under most case we won't use that much metadata
space at all.
And we still have the existing 0-available space check, to prevent us
from reporting too optimistic f_bavail result.
Since we're keeping the old lock-free design, statfs should not experience
any extra delay.
CC: stable(a)vger.kernel.org # 5.4+
Signed-off-by: Qu Wenruo <wqu(a)suse.com>
---
fs/btrfs/super.c | 131 +++--------------------------------------------
1 file changed, 7 insertions(+), 124 deletions(-)
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 25967ecaaf0a..355e4f6a2fd4 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -2016,124 +2016,6 @@ static inline void btrfs_descending_sort_devices(
btrfs_cmp_device_free_bytes, NULL);
}
-/*
- * The helper to calc the free space on the devices that can be used to store
- * file data.
- */
-static inline int btrfs_calc_avail_data_space(struct btrfs_fs_info *fs_info,
- u64 *free_bytes)
-{
- struct btrfs_device_info *devices_info;
- struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
- struct btrfs_device *device;
- u64 type;
- u64 avail_space;
- u64 min_stripe_size;
- int num_stripes = 1;
- int i = 0, nr_devices;
- const struct btrfs_raid_attr *rattr;
-
- /*
- * We aren't under the device list lock, so this is racy-ish, but good
- * enough for our purposes.
- */
- nr_devices = fs_info->fs_devices->open_devices;
- if (!nr_devices) {
- smp_mb();
- nr_devices = fs_info->fs_devices->open_devices;
- ASSERT(nr_devices);
- if (!nr_devices) {
- *free_bytes = 0;
- return 0;
- }
- }
-
- devices_info = kmalloc_array(nr_devices, sizeof(*devices_info),
- GFP_KERNEL);
- if (!devices_info)
- return -ENOMEM;
-
- /* calc min stripe number for data space allocation */
- type = btrfs_data_alloc_profile(fs_info);
- rattr = &btrfs_raid_array[btrfs_bg_flags_to_raid_index(type)];
-
- if (type & BTRFS_BLOCK_GROUP_RAID0)
- num_stripes = nr_devices;
- else if (type & BTRFS_BLOCK_GROUP_RAID1)
- num_stripes = 2;
- else if (type & BTRFS_BLOCK_GROUP_RAID1C3)
- num_stripes = 3;
- else if (type & BTRFS_BLOCK_GROUP_RAID1C4)
- num_stripes = 4;
- else if (type & BTRFS_BLOCK_GROUP_RAID10)
- num_stripes = 4;
-
- /* Adjust for more than 1 stripe per device */
- min_stripe_size = rattr->dev_stripes * BTRFS_STRIPE_LEN;
-
- rcu_read_lock();
- list_for_each_entry_rcu(device, &fs_devices->devices, dev_list) {
- if (!test_bit(BTRFS_DEV_STATE_IN_FS_METADATA,
- &device->dev_state) ||
- !device->bdev ||
- test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state))
- continue;
-
- if (i >= nr_devices)
- break;
-
- avail_space = device->total_bytes - device->bytes_used;
-
- /* align with stripe_len */
- avail_space = rounddown(avail_space, BTRFS_STRIPE_LEN);
-
- /*
- * In order to avoid overwriting the superblock on the drive,
- * btrfs starts at an offset of at least 1MB when doing chunk
- * allocation.
- *
- * This ensures we have at least min_stripe_size free space
- * after excluding 1MB.
- */
- if (avail_space <= SZ_1M + min_stripe_size)
- continue;
-
- avail_space -= SZ_1M;
-
- devices_info[i].dev = device;
- devices_info[i].max_avail = avail_space;
-
- i++;
- }
- rcu_read_unlock();
-
- nr_devices = i;
-
- btrfs_descending_sort_devices(devices_info, nr_devices);
-
- i = nr_devices - 1;
- avail_space = 0;
- while (nr_devices >= rattr->devs_min) {
- num_stripes = min(num_stripes, nr_devices);
-
- if (devices_info[i].max_avail >= min_stripe_size) {
- int j;
- u64 alloc_size;
-
- avail_space += devices_info[i].max_avail * num_stripes;
- alloc_size = devices_info[i].max_avail;
- for (j = i + 1 - num_stripes; j <= i; j++)
- devices_info[j].max_avail -= alloc_size;
- }
- i--;
- nr_devices--;
- }
-
- kfree(devices_info);
- *free_bytes = avail_space;
- return 0;
-}
-
/*
* Calculate numbers for 'df', pessimistic in case of mixed raid profiles.
*
@@ -2150,6 +2032,7 @@ static inline int btrfs_calc_avail_data_space(struct btrfs_fs_info *fs_info,
static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf)
{
struct btrfs_fs_info *fs_info = btrfs_sb(dentry->d_sb);
+ struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
struct btrfs_super_block *disk_super = fs_info->super_copy;
struct btrfs_space_info *found;
u64 total_used = 0;
@@ -2159,7 +2042,7 @@ static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf)
__be32 *fsid = (__be32 *)fs_info->fs_devices->fsid;
unsigned factor = 1;
struct btrfs_block_rsv *block_rsv = &fs_info->global_block_rsv;
- int ret;
+ enum btrfs_raid_types data_type;
u64 thresh = 0;
int mixed = 0;
@@ -2208,11 +2091,11 @@ static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf)
buf->f_bfree = 0;
spin_unlock(&block_rsv->lock);
- buf->f_bavail = div_u64(total_free_data, factor);
- ret = btrfs_calc_avail_data_space(fs_info, &total_free_data);
- if (ret)
- return ret;
- buf->f_bavail += div_u64(total_free_data, factor);
+ data_type = btrfs_bg_flags_to_raid_index(
+ btrfs_data_alloc_profile(fs_info));
+
+ buf->f_bavail = total_free_data +
+ atomic64_read(&fs_devices->per_profile_avail[data_type]);
buf->f_bavail = buf->f_bavail >> bits;
/*
--
2.28.0
For the following disk layout, can_overcommit() can cause false
confidence in available space:
devid 1 unallocated: 1T
devid 2 unallocated: 10T
metadata type: RAID1
As can_overcommit() simply uses unallocated space with factor to
calculate the allocatable metadata chunk size.
can_overcommit() believes we still have 5.5T for metadata chunks, while
the truth is, we only have 1T available for metadata chunks.
This can lead to ENOSPC at run_delalloc_range() and cause transaction
abort.
Since factor based calculation can't distinguish RAID1/RAID10 and DUP at
all, we need proper chunk-allocator level awareness to do such estimation.
Thankfully, we have per-profile available space already calculated, just
use that facility to avoid such false confidence.
CC: stable(a)vger.kernel.org # 5.4+
Reported-by: Marc Lehmann <schmorp(a)schmorp.de>
Signed-off-by: Qu Wenruo <wqu(a)suse.com>
Reviewed-by: Josef Bacik <josef(a)toxicpanda.com>
---
fs/btrfs/space-info.c | 14 +++++---------
1 file changed, 5 insertions(+), 9 deletions(-)
diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index 64b6e1d44f47..4bb4e3c3531f 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -336,25 +336,21 @@ static u64 calc_available_free_space(struct btrfs_fs_info *fs_info,
struct btrfs_space_info *space_info,
enum btrfs_reserve_flush_enum flush)
{
+ enum btrfs_raid_types index;
u64 profile;
u64 avail;
- int factor;
if (space_info->flags & BTRFS_BLOCK_GROUP_SYSTEM)
profile = btrfs_system_alloc_profile(fs_info);
else
profile = btrfs_metadata_alloc_profile(fs_info);
- avail = atomic64_read(&fs_info->free_chunk_space);
-
/*
- * If we have dup, raid1 or raid10 then only half of the free
- * space is actually usable. For raid56, the space info used
- * doesn't include the parity drive, so we don't have to
- * change the math
+ * Grab avail space from per-profile array which should be as accurate
+ * as chunk allocator.
*/
- factor = btrfs_bg_type_to_factor(profile);
- avail = div_u64(avail, factor);
+ index = btrfs_bg_flags_to_raid_index(profile);
+ avail = atomic64_read(&fs_info->fs_devices->per_profile_avail[index]);
/*
* If we aren't flushing all things, let us overcommit up to
--
2.28.0
Dear stable
Please i was referred to your company for this services and urgently.
Please see attached our company quotation request.
Kindly quote your best prices for this items including shipping cost,
FOB and payment terms.
Expecting your response asap.
Regards,
Shui biun