commit 5f58d783fd7823b2c2d5954d1126e702f94bfc4c upstream
We have this check to make sure we don't accidentally add older devices
that may have disappeared and re-appeared with an older generation from
being added to an fs_devices (such as a replace source device). This
makes sense, we don't want stale disks in our file system. However for
single disks this doesn't really make sense.
I've seen this in testing, but I was provided a reproducer from a
project that builds btrfs images on loopback devices. The loopback
device gets cached with the new generation, and then if it is re-used to
generate a new file system we'll fail to mount it because the new fs is
"older" than what we have in cache.
Fix this by freeing the cache when closing the device for a single device
filesystem. This will ensure that the mount command passed device path is
scanned successfully during the next mount.
CC: stable(a)vger.kernel.org # 5.10+
Reported-by: Daan De Meyer <daandemeyer(a)fb.com>
Signed-off-by: Josef Bacik <josef(a)toxicpanda.com>
Signed-off-by: Anand Jain <anand.jain(a)oracle.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
Signed-off-by: Anand Jain <anand.jain(a)oracle.com>
---
This patch has already been submitted for the LTS stable 5.10 and above.
fs/btrfs/volumes.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 548de841cee5..dacaea61c2f7 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -354,6 +354,7 @@ void btrfs_free_device(struct btrfs_device *device)
static void free_fs_devices(struct btrfs_fs_devices *fs_devices)
{
struct btrfs_device *device;
+
WARN_ON(fs_devices->opened);
while (!list_empty(&fs_devices->devices)) {
device = list_entry(fs_devices->devices.next,
@@ -1401,6 +1402,17 @@ int btrfs_close_devices(struct btrfs_fs_devices *fs_devices)
if (!fs_devices->opened) {
seed_devices = fs_devices->seed;
fs_devices->seed = NULL;
+
+ /*
+ * If the struct btrfs_fs_devices is not assembled with any
+ * other device, it can be re-initialized during the next mount
+ * without the needing device-scan step. Therefore, it can be
+ * fully freed.
+ */
+ if (fs_devices->num_devices == 1) {
+ list_del(&fs_devices->fs_list);
+ free_fs_devices(fs_devices);
+ }
}
mutex_unlock(&uuid_mutex);
--
2.31.1
This patch series is to remove reader optimistic spinning in
kernel 5.10 to improve the MongoDB performance. Performance measurements
(10 times running average of overall throughput ops/sec) are using
MongoDB 5.0.11 and YCSB [1] microbenchmark with workloadA [2] on AWS EC2
m5.4xlarge/m6g.4xlarge (16-vCPU 64GiB-memory) instances with a 512GB EBS
IO1 drive disk with 5000 IOPS and separating MongoDB and YCSB load generator
on 2 instances and setting recordcount=25000000 and operationcount=10000000
to see the impacts of these changes:
Before - v5.10.165 kernel in OS Amazon Linux 2
After - v5.10.165 kernel with reader spinning disabled in OS Amazon Linux 2
| Arch | Instance Type | Before | After |
|---------+---------------+---------+---------|
| x86_64 | m5.4xlarge | 37365.4 | 42373.9 |
|---------+---------------+---------+---------|
| aarch64 | m6g.4xlarge | 33823.1 | 43113.7 |
|---------+---------------+---------+---------|
It can be seen that the MongoDB throughput can be improved around 13% in x86_64
and 27% in aarch64 after disabling reader optimistic spinning and these patches
can be applied to 5.10 with no conflict so we wonder if it's possible to backport
them to stable 5.10?
[1] https://github.com/brianfrankcooper/YCSB/releases/download/0.17.0/ycsb-0.17…
[2] https://github.com/brianfrankcooper/YCSB/blob/master/workloads/workloada
Thanks,
Shaoying
Peter Zijlstra (3):
locking/rwsem: Better collate rwsem_read_trylock()
locking/rwsem: Introduce rwsem_write_trylock()
locking/rwsem: Fold __down_{read,write}*()
Waiman Long (4):
locking/rwsem: Pass the current atomic count to
rwsem_down_read_slowpath()
locking/rwsem: Prevent potential lock starvation
locking/rwsem: Enable reader optimistic lock stealing
locking/rwsem: Remove reader optimistic spinning
kernel/locking/lock_events_list.h | 6 +-
kernel/locking/rwsem.c | 359 +++++++++---------------------
2 files changed, 106 insertions(+), 259 deletions(-)
--
2.38.1
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
ec76d0c2da5c ("vmxnet3: move rss code block under eop descriptor")
bdeed8b0958c ("vmxnet3: Record queue number to incoming packets")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From ec76d0c2da5c6dfb6a33f1545cc15997013923da Mon Sep 17 00:00:00 2001
From: Ronak Doshi <doshir(a)vmware.com>
Date: Wed, 8 Feb 2023 14:38:59 -0800
Subject: [PATCH] vmxnet3: move rss code block under eop descriptor
Commit b3973bb40041 ("vmxnet3: set correct hash type based on
rss information") added hashType information into skb. However,
rssType field is populated for eop descriptor. This can lead
to incorrectly reporting of hashType for packets which use
multiple rx descriptors. Multiple rx descriptors are used
for Jumbo frame or LRO packets, which can hit this issue.
This patch moves the RSS codeblock under eop descritor.
Cc: stable(a)vger.kernel.org
Fixes: b3973bb40041 ("vmxnet3: set correct hash type based on rss information")
Signed-off-by: Ronak Doshi <doshir(a)vmware.com>
Acked-by: Peng Li <lpeng(a)vmware.com>
Acked-by: Guolin Yang <gyang(a)vmware.com>
Link: https://lore.kernel.org/r/20230208223900.5794-1-doshir@vmware.com
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
diff --git a/drivers/net/vmxnet3/vmxnet3_drv.c b/drivers/net/vmxnet3/vmxnet3_drv.c
index 56267c327f0b..682987040ea8 100644
--- a/drivers/net/vmxnet3/vmxnet3_drv.c
+++ b/drivers/net/vmxnet3/vmxnet3_drv.c
@@ -1546,31 +1546,6 @@ vmxnet3_rq_rx_complete(struct vmxnet3_rx_queue *rq,
rxd->len = rbi->len;
}
-#ifdef VMXNET3_RSS
- if (rcd->rssType != VMXNET3_RCD_RSS_TYPE_NONE &&
- (adapter->netdev->features & NETIF_F_RXHASH)) {
- enum pkt_hash_types hash_type;
-
- switch (rcd->rssType) {
- case VMXNET3_RCD_RSS_TYPE_IPV4:
- case VMXNET3_RCD_RSS_TYPE_IPV6:
- hash_type = PKT_HASH_TYPE_L3;
- break;
- case VMXNET3_RCD_RSS_TYPE_TCPIPV4:
- case VMXNET3_RCD_RSS_TYPE_TCPIPV6:
- case VMXNET3_RCD_RSS_TYPE_UDPIPV4:
- case VMXNET3_RCD_RSS_TYPE_UDPIPV6:
- hash_type = PKT_HASH_TYPE_L4;
- break;
- default:
- hash_type = PKT_HASH_TYPE_L3;
- break;
- }
- skb_set_hash(ctx->skb,
- le32_to_cpu(rcd->rssHash),
- hash_type);
- }
-#endif
skb_record_rx_queue(ctx->skb, rq->qid);
skb_put(ctx->skb, rcd->len);
@@ -1653,6 +1628,31 @@ vmxnet3_rq_rx_complete(struct vmxnet3_rx_queue *rq,
u32 mtu = adapter->netdev->mtu;
skb->len += skb->data_len;
+#ifdef VMXNET3_RSS
+ if (rcd->rssType != VMXNET3_RCD_RSS_TYPE_NONE &&
+ (adapter->netdev->features & NETIF_F_RXHASH)) {
+ enum pkt_hash_types hash_type;
+
+ switch (rcd->rssType) {
+ case VMXNET3_RCD_RSS_TYPE_IPV4:
+ case VMXNET3_RCD_RSS_TYPE_IPV6:
+ hash_type = PKT_HASH_TYPE_L3;
+ break;
+ case VMXNET3_RCD_RSS_TYPE_TCPIPV4:
+ case VMXNET3_RCD_RSS_TYPE_TCPIPV6:
+ case VMXNET3_RCD_RSS_TYPE_UDPIPV4:
+ case VMXNET3_RCD_RSS_TYPE_UDPIPV6:
+ hash_type = PKT_HASH_TYPE_L4;
+ break;
+ default:
+ hash_type = PKT_HASH_TYPE_L3;
+ break;
+ }
+ skb_set_hash(skb,
+ le32_to_cpu(rcd->rssHash),
+ hash_type);
+ }
+#endif
vmxnet3_rx_csum(adapter, skb,
(union Vmxnet3_GenericDesc *)rcd);
skb->protocol = eth_type_trans(skb, adapter->netdev);
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
ce4d9a1ea35a ("of: reserved_mem: Have kmemleak ignore dynamically allocated reserved mem")
3ecc68349bba ("memblock: rename memblock_free to memblock_phys_free")
fa27717110ae ("memblock: drop memblock_free_early_nid() and memblock_free_early()")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From ce4d9a1ea35ac5429e822c4106cb2859d5c71f3e Mon Sep 17 00:00:00 2001
From: "Isaac J. Manjarres" <isaacmanjarres(a)google.com>
Date: Wed, 8 Feb 2023 15:20:00 -0800
Subject: [PATCH] of: reserved_mem: Have kmemleak ignore dynamically allocated
reserved mem
Patch series "Fix kmemleak crashes when scanning CMA regions", v2.
When trying to boot a device with an ARM64 kernel with the following
config options enabled:
CONFIG_DEBUG_PAGEALLOC=y
CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT=y
CONFIG_DEBUG_KMEMLEAK=y
a crash is encountered when kmemleak starts to scan the list of gray
or allocated objects that it maintains. Upon closer inspection, it was
observed that these page-faults always occurred when kmemleak attempted
to scan a CMA region.
At the moment, kmemleak is made aware of CMA regions that are specified
through the devicetree to be dynamically allocated within a range of
addresses. However, kmemleak should not need to scan CMA regions or any
reserved memory region, as those regions can be used for DMA transfers
between drivers and peripherals, and thus wouldn't contain anything
useful for kmemleak.
Additionally, since CMA regions are unmapped from the kernel's address
space when they are freed to the buddy allocator at boot when
CONFIG_DEBUG_PAGEALLOC is enabled, kmemleak shouldn't attempt to access
those memory regions, as that will trigger a crash. Thus, kmemleak
should ignore all dynamically allocated reserved memory regions.
This patch (of 1):
Currently, kmemleak ignores dynamically allocated reserved memory regions
that don't have a kernel mapping. However, regions that do retain a
kernel mapping (e.g. CMA regions) do get scanned by kmemleak.
This is not ideal for two reasons:
1 kmemleak works by scanning memory regions for pointers to allocated
objects to determine if those objects have been leaked or not.
However, reserved memory regions can be used between drivers and
peripherals for DMA transfers, and thus, would not contain pointers to
allocated objects, making it unnecessary for kmemleak to scan these
reserved memory regions.
2 When CONFIG_DEBUG_PAGEALLOC is enabled, along with kmemleak, the
CMA reserved memory regions are unmapped from the kernel's address
space when they are freed to buddy at boot. These CMA reserved regions
are still tracked by kmemleak, however, and when kmemleak attempts to
scan them, a crash will happen, as accessing the CMA region will result
in a page-fault, since the regions are unmapped.
Thus, use kmemleak_ignore_phys() for all dynamically allocated reserved
memory regions, instead of those that do not have a kernel mapping
associated with them.
Link: https://lkml.kernel.org/r/20230208232001.2052777-1-isaacmanjarres@google.com
Link: https://lkml.kernel.org/r/20230208232001.2052777-2-isaacmanjarres@google.com
Fixes: a7259df76702 ("memblock: make memblock_find_in_range method private")
Signed-off-by: Isaac J. Manjarres <isaacmanjarres(a)google.com>
Acked-by: Mike Rapoport (IBM) <rppt(a)kernel.org>
Acked-by: Catalin Marinas <catalin.marinas(a)arm.com>
Cc: Frank Rowand <frowand.list(a)gmail.com>
Cc: Kirill A. Shutemov <kirill.shtuemov(a)linux.intel.com>
Cc: Nick Kossifidis <mick(a)ics.forth.gr>
Cc: Rafael J. Wysocki <rafael.j.wysocki(a)intel.com>
Cc: Rob Herring <robh(a)kernel.org>
Cc: Russell King (Oracle) <rmk+kernel(a)armlinux.org.uk>
Cc: Saravana Kannan <saravanak(a)google.com>
Cc: <stable(a)vger.kernel.org> [5.15+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/drivers/of/of_reserved_mem.c b/drivers/of/of_reserved_mem.c
index 65f3b02a0e4e..f90975e00446 100644
--- a/drivers/of/of_reserved_mem.c
+++ b/drivers/of/of_reserved_mem.c
@@ -48,9 +48,10 @@ static int __init early_init_dt_alloc_reserved_memory_arch(phys_addr_t size,
err = memblock_mark_nomap(base, size);
if (err)
memblock_phys_free(base, size);
- kmemleak_ignore_phys(base);
}
+ kmemleak_ignore_phys(base);
+
return err;
}
From: Paolo Abeni <pabeni(a)redhat.com>
[ Upstream commit d4e85922e3e7ef2071f91f65e61629b60f3a9cf4 ]
If the peer closes all the existing subflows for a given
mptcp socket and later the application closes it, the current
implementation let it survive until the timewait timeout expires.
While the above is allowed by the protocol specification it
consumes resources for almost no reason and additionally
causes sporadic self-tests failures.
Let's move the mptcp socket to the TCP_CLOSE state when there are
no alive subflows at close time, so that the allocated resources
will be freed immediately.
Fixes: e16163b6e2b7 ("mptcp: refactor shutdown and close")
Cc: stable(a)vger.kernel.org
Signed-off-by: Paolo Abeni <pabeni(a)redhat.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts(a)tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts(a)tessares.net>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts(a)tessares.net>
---
Hi Greg, Sasha,
Here is one MPTCP patch backport that recently failed to apply to the
5.15 stable tree: it clears resources earlier if there is no more
reasons to keep MPTCP sockets alive.
I had a simple conflict because in v5.15, the context is a bit different
when iterating over the different subflows in __mptcp_close() but the
idea is still the same: in this loop, a counter needs to be incremented.
---
net/mptcp/protocol.c | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 47f359dac247..5d05d85242bc 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -2726,6 +2726,7 @@ static void mptcp_close(struct sock *sk, long timeout)
{
struct mptcp_subflow_context *subflow;
bool do_cancel_work = false;
+ int subflows_alive = 0;
lock_sock(sk);
sk->sk_shutdown = SHUTDOWN_MASK;
@@ -2747,11 +2748,19 @@ static void mptcp_close(struct sock *sk, long timeout)
struct sock *ssk = mptcp_subflow_tcp_sock(subflow);
bool slow = lock_sock_fast_nested(ssk);
+ subflows_alive += ssk->sk_state != TCP_CLOSE;
+
sock_orphan(ssk);
unlock_sock_fast(ssk, slow);
}
sock_orphan(sk);
+ /* all the subflows are closed, only timeout can change the msk
+ * state, let's not keep resources busy for no reasons
+ */
+ if (subflows_alive == 0)
+ inet_sk_state_store(sk, TCP_CLOSE);
+
sock_hold(sk);
pr_debug("msk=%p state=%d", sk, sk->sk_state);
if (sk->sk_state == TCP_CLOSE) {
---
base-commit: e2c1a934fd8e4288e7a32f4088ceaccf469eb74c
change-id: 20230214-upstream-stable-20230214-linux-5-15-94-rc1-mptcp-fixes-517feb25bd47
Best regards,
--
Matthieu Baerts <matthieu.baerts(a)tessares.net>