Hi,
While running the udp-flood test from stress-ng on Ampere Altra (Mt. Jade platform) I encountered a kernel panic caused by NULL pointer dereference within nf_conntrack.
The issue is present in the latest mainline (5.19-rc4), latest stable (5.18.8), as well as multiple older stable versions. The last working stable version I found was 5.15.40.
Through bisecting I've traced the issue back to mainline commit 719774377622bc4025d2a74f551b5dc2158c6c30 (netfilter: conntrack: convert to refcount_t api), on kernels from before this commit the test runs fine. As far as I can tell, this commit was included in stable with version 5.15.41, thus causing the regression compared to 5.15.40. It was included in the mainline with version 5.16.
The issue is very consistently reproducible as well, running this command resulted in the same kernel panic every time I tried it on different kernels from after the change in question was merged.
stress-ng --udp-flood 0 -t 1m --metrics-brief --perf
The commit was not easily revertible so I can't say whether reverting it on the latest mainline would fix the problem or not.
Here's the decoded kernel panic in question for reference:
[ 869.372868] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000 [ 869.381708] Mem abort info: [ 869.384490] ESR = 0x0000000096000044 [ 869.388245] EC = 0x25: DABT (current EL), IL = 32 bits [ 869.393681] SET = 0, FnV = 0 [ 869.396723] EA = 0, S1PTW = 0 [ 869.399859] FSC = 0x04: level 0 translation fault [ 869.404731] Data abort info: [ 869.407606] ISV = 0, ISS = 0x00000044 [ 869.411437] CM = 0, WnR = 1 [ 869.414398] user pgtable: 4k pages, 48-bit VAs, pgdp=000008002f6b9000 [ 869.420834] [0000000000000000] pgd=0000000000000000, p4d=0000000000000000 [ 869.427621] Internal error: Oops: 96000044 [#1] SMP [ 869.432488] Modules linked in: sctp ip6_udp_tunnel udp_tunnel dccp_ipv4 dccp xt_conntrack xt_MASQUERADE xt_addrtype iptable_filter iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 bridge stp llc ipmi_devintf ipmi_msghandler overlay cppc_cpufreq drm ip_tables x_tables btrfs blake2b_generic libcrc32c xor xor_neon raid6_pq zstd_compress nvme mlx5_core crct10dif_ce nvme_core mlxfw [ 869.466986] CPU: 13 PID: 100864 Comm: stress-ng-udp-f Not tainted 5.19.0-rc4-custom-kajpuc01-teo #15 [ 869.476107] Hardware name: WIWYNN Mt.Jade Server System B81.03001.0005/Mt.Jade Motherboard, BIOS 1.08.20220218 (SCP: 1.08.20220218) 2022/02/18 [ 869.488872] pstate: 40400009 (nZcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 869.495821] pc : __nf_ct_delete_from_lists (/home/kajpuc01/linux/./include/linux/list_nulls.h:108) nf_conntrack [ 869.502174] lr : __nf_ct_delete_from_lists (/home/kajpuc01/linux/net/netfilter/nf_conntrack_core.c:628) nf_conntrack [ 869.508523] sp : ffff800055acb6d0 [ 869.511825] x29: ffff800055acb6d0 x28: ffffa1a42f9c8ac0 x27: 0000000000000000 [ 869.518949] x26: ffffa1a3f531c780 x25: 000000000000ef6d x24: ffff4005eddfc500 [ 869.526072] x23: ffff4005eddfc520 x22: ffff4005eddfc558 x21: ffffa1a42f9c8ac0 [ 869.533195] x20: 0000000000000000 x19: 0000000000023923 x18: 0000000000000000 [ 869.540319] x17: 0000000000000000 x16: ffffa1a42d8c3bb0 x15: 0000ffffffec5ff0 [ 869.547442] x14: 4e4e4e4e4e4e4e4e x13: 4e4e4e4e4e4e4e4e x12: 4e4e4e4e4e4e4e4e [ 869.554565] x11: 0000000000000000 x10: 000000000100007f x9 : ffffa1a3f5304074 [ 869.561688] x8 : 0000000000000000 x7 : 00113ec6b282834e x6 : ffff800055acb6c8 [ 869.568811] x5 : dd7051c45787c585 x4 : abfc59e3b0a2b492 x3 : 0000000000000000 [ 869.575934] x2 : 0000000000000001 x1 : 0000000000000000 x0 : 000000000001dedb [ 869.583058] Call trace: [ 869.585492] __nf_ct_delete_from_lists (/home/kajpuc01/linux/./include/linux/list_nulls.h:108) nf_conntrack [ 869.591494] nf_ct_delete (/home/kajpuc01/linux/./include/linux/bottom_half.h:33) nf_conntrack [ 869.596368] early_drop (/home/kajpuc01/linux/net/netfilter/nf_conntrack_core.c:1400) nf_conntrack [ 869.601154] __nf_conntrack_alloc (/home/kajpuc01/linux/net/netfilter/nf_conntrack_core.c:1610) nf_conntrack [ 869.606808] init_conntrack.isra.0 (/home/kajpuc01/linux/net/netfilter/nf_conntrack_core.c:1716) nf_conntrack [ 869.612549] nf_conntrack_in (/home/kajpuc01/linux/net/netfilter/nf_conntrack_core.c:1832) nf_conntrack [ 869.617769] ipv4_conntrack_local (/home/kajpuc01/linux/net/netfilter/nf_conntrack_proto.c:214) nf_conntrack [ 869.623249] nf_hook_slow (/home/kajpuc01/linux/net/netfilter/core.c:621) [ 869.626814] __ip_local_out (/home/kajpuc01/linux/net/ipv4/ip_output.c:118) [ 869.630640] ip_local_out (/home/kajpuc01/linux/net/ipv4/ip_output.c:125) [ 869.634117] ip_send_skb (/home/kajpuc01/linux/net/ipv4/ip_output.c:1572) [ 869.637508] udp_send_skb.isra.0 (/home/kajpuc01/linux/net/ipv4/udp.c:968) [ 869.641767] udp_sendmsg (/home/kajpuc01/linux/net/ipv4/udp.c:1254) [ 869.645329] inet_sendmsg (/home/kajpuc01/linux/net/ipv4/af_inet.c:821) [ 869.648807] sock_sendmsg (/home/kajpuc01/linux/net/socket.c:717) [ 869.652284] __sys_sendto (/home/kajpuc01/linux/net/socket.c:2119) [ 869.655847] __arm64_sys_sendto (/home/kajpuc01/linux/net/socket.c:2127) [ 869.659845] invoke_syscall (/home/kajpuc01/linux/./arch/arm64/include/asm/current.h:19) [ 869.663584] el0_svc_common.constprop.0 (/home/kajpuc01/linux/arch/arm64/kernel/syscall.c:75) [ 869.668449] do_el0_svc (/home/kajpuc01/linux/arch/arm64/kernel/syscall.c:207) [ 869.671753] el0_svc (/home/kajpuc01/linux/./arch/arm64/include/asm/daifflags.h:28) [ 869.674796] el0t_64_sync_handler (/home/kajpuc01/linux/arch/arm64/kernel/entry-common.c:643) [ 869.678966] el0t_64_sync (/home/kajpuc01/linux/arch/arm64/kernel/entry.S:581) [ 869.682618] Code: 72001c1f 54fffc01 d503201f a9410700 (f9000020) All code ======== 0: 72001c1f tst w0, #0xff 4: 54fffc01 b.ne 0xffffffffffffff84 // b.any 8: d503201f nop c: a9410700 ldp x0, x1, [x24, #16] 10:* f9000020 str x0, [x1] <-- trapping instruction
Code starting with the faulting instruction =========================================== 0: f9000020 str x0, [x1] [ 869.688700] ---[ end trace 0000000000000000 ]--- [ 869.693306] Kernel panic - not syncing: Oops: Fatal exception in interrupt [ 869.700168] SMP: stopping secondary CPUs [ 869.704219] Kernel Offset: 0x21a4251b0000 from 0xffff800008000000 [ 869.710300] PHYS_OFFSET: 0x80000000 [ 869.713775] CPU features: 0x000,0042e015,19805c82 [ 869.718467] Memory Limit: none [ 869.721509] ---[ end Kernel panic - not syncing: Oops: Fatal exception in interrupt ]---
The contents of /proc/cpuinfo:
processor : 0 BogoMIPS : 50.00 Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs CPU implementer : 0x41 CPU architecture: 8 CPU variant : 0x3 CPU part : 0xd0c CPU revision : 1
(The same type of CPU is repeated 160 times, only including one for brevity)
/proc/version: Linux version 5.19.0-rc4-custom-kajpuc01-teo gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0, GNU ld (GNU Binutils for Ubuntu) 2.34) #37 SMP Thu Jun 30 14:53:25 UTC 2022
The distirbution is Ubuntu 20.04.3 LTS, the architecture is aarch64.
Please let me know if I can provide any more details or try any more tests.
Regards, Kajetan
[TLDR: I'm adding this regression report to the list of tracked regressions; all text from me you find below is based on a few templates paragraphs you might have encountered already already in similar form.]
TWIMC: this mail is primarily send for documentation purposes and for regzbot, my Linux kernel regression tracking bot. These mails usually contain '#forregzbot' in the subject, to make them easy to spot and filter.
Hi, this is your Linux kernel regression tracker.
On 01.07.22 13:11, Kajetan Puchalski wrote:
Hi,
While running the udp-flood test from stress-ng on Ampere Altra (Mt. Jade platform) I encountered a kernel panic caused by NULL pointer dereference within nf_conntrack.
The issue is present in the latest mainline (5.19-rc4), latest stable (5.18.8), as well as multiple older stable versions. The last working stable version I found was 5.15.40.
Through bisecting I've traced the issue back to mainline commit 719774377622bc4025d2a74f551b5dc2158c6c30 (netfilter: conntrack: convert to refcount_t api), on kernels from before this commit the test runs fine. As far as I can tell, this commit was included in stable with version 5.15.41, thus causing the regression compared to 5.15.40. It was included in the mainline with version 5.16.
FWIW, looks like it was merged for v5.17-rc1 $ git describe --contains --tags 719774377622bc4025
v5.17-rc1~170^2~24^2~19
The issue is very consistently reproducible as well, running this command resulted in the same kernel panic every time I tried it on different kernels from after the change in question was merged.
stress-ng --udp-flood 0 -t 1m --metrics-brief --perf
The commit was not easily revertible so I can't say whether reverting it on the latest mainline would fix the problem or not.
[...]
The distirbution is Ubuntu 20.04.3 LTS, the architecture is aarch64.
Please let me know if I can provide any more details or try any more tests.
Thanks for the report. To be sure below issue doesn't fall through the cracks unnoticed, I'm adding it to regzbot, my Linux kernel regression tracking bot:
#regzbot ^introduced 719774377622bc402 #regzbot title net: netfilter: stress-ng udp-flood causes kernel panic on Ampere Altra
This isn't a regression? This issue or a fix for it are already discussed somewhere else? It was fixed already? You want to clarify when the regression started to happen? Or point out I got the title or something else totally wrong? Then just reply -- ideally with also telling regzbot about it, as explained here: https://linux-regtracking.leemhuis.info/tracked-regression/
Reminder for developers: When fixing the issue, add 'Link:' tags pointing to the report (the mail this one replies to), as explained for in the Linux kernel's documentation; above webpage explains why this is important for tracked regressions.
Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
P.S.: As the Linux kernel's regression tracker I deal with a lot of reports and sometimes miss something important when writing mails like this. If that's the case here, don't hesitate to tell me in a public reply, it's in everyone's interest to set the public record straight.
Kajetan Puchalski kajetan.puchalski@arm.com wrote:
While running the udp-flood test from stress-ng on Ampere Altra (Mt. Jade platform) I encountered a kernel panic caused by NULL pointer dereference within nf_conntrack.
The issue is present in the latest mainline (5.19-rc4), latest stable (5.18.8), as well as multiple older stable versions. The last working stable version I found was 5.15.40.
Do I need a special setup for conntrack?
No crashes after more than one hour of stress-ng on 1. 4 core amd64 Fedora 5.17 kernel 2. 16 core amd64, linux stable 5.17.15 3. 12 core intel, Fedora 5.18 kernel 4. 3 core aarch64 vm, 5.18.7-200.fc36.aarch64
I used standard firewalld ruleset for all of these and manually tuned conntrack settings to make sure the early evict path (as per backtrace) gets exercised.
On Fri, Jul 01, 2022 at 10:01:10PM +0200, Florian Westphal wrote:
Kajetan Puchalski kajetan.puchalski@arm.com wrote:
While running the udp-flood test from stress-ng on Ampere Altra (Mt. Jade platform) I encountered a kernel panic caused by NULL pointer dereference within nf_conntrack.
The issue is present in the latest mainline (5.19-rc4), latest stable (5.18.8), as well as multiple older stable versions. The last working stable version I found was 5.15.40.
Do I need a special setup for conntrack?
I don't think there was any special setup involved, the config I started from was a generic distribution config and I didn't change any networking-specific options. In case that's helpful here's the .config I used.
No crashes after more than one hour of stress-ng on
- 4 core amd64 Fedora 5.17 kernel
- 16 core amd64, linux stable 5.17.15
- 12 core intel, Fedora 5.18 kernel
- 3 core aarch64 vm, 5.18.7-200.fc36.aarch64
That would make sense, from further experiments I ran it somehow seems to be related to the number of workers being spawned by stress-ng along with the CPUs/cores involved.
For instance, running the test with <=25 workers (--udp-flood 25 etc.) results in the test running fine for at least 15 minutes. Running the test with 30 workers results in a panic sometime before it hits the 15 minute mark. Based on observations there seems to be a corellation between the number of workers and how quickly the panic occurs, ie with 30 it takes a few minutes, with 160 it consistently happens almost immediately. That also holds for various numbers of workers in between.
On the CPU/core side of things, the machine in question has two CPU sockets with 80 identical cores each. All the panics I've encountered happened when stress-ng was ran directly and unbound. When I tried using hwloc-bind to bind the process to one of the CPU sockets, the test ran for 15 mins with 80 and 160 workers with no issues, no matter which CPU it was bound to.
Ie the specific circumstances under which it seems to occur are when the test is able to run across multiple CPU sockets with a large number of workers being spawned.
I used standard firewalld ruleset for all of these and manually tuned conntrack settings to make sure the early evict path (as per backtrace) gets exercised.
On Sat, Jul 02, 2022 at 12:08:46PM +0100, Kajetan Puchalski wrote:
On Fri, Jul 01, 2022 at 10:01:10PM +0200, Florian Westphal wrote:
Kajetan Puchalski kajetan.puchalski@arm.com wrote:
While running the udp-flood test from stress-ng on Ampere Altra (Mt. Jade platform) I encountered a kernel panic caused by NULL pointer dereference within nf_conntrack.
The issue is present in the latest mainline (5.19-rc4), latest stable (5.18.8), as well as multiple older stable versions. The last working stable version I found was 5.15.40.
Do I need a special setup for conntrack?
I don't think there was any special setup involved, the config I started from was a generic distribution config and I didn't change any networking-specific options. In case that's helpful here's the .config I used.
No crashes after more than one hour of stress-ng on
- 4 core amd64 Fedora 5.17 kernel
- 16 core amd64, linux stable 5.17.15
- 12 core intel, Fedora 5.18 kernel
- 3 core aarch64 vm, 5.18.7-200.fc36.aarch64
That would make sense, from further experiments I ran it somehow seems to be related to the number of workers being spawned by stress-ng along with the CPUs/cores involved.
For instance, running the test with <=25 workers (--udp-flood 25 etc.) results in the test running fine for at least 15 minutes.
Another point to keep in mind is that modern ARM processors (ARMv8.1 and above) have a more relaxed memory model than older ones (and x86), that can easily exhibit a missing barrier somewhere. I faced this situation already in the past the first time I ran my code on Graviton2, which caused crashes that would never happen on A53/A72/A73 cores nor x86.
ARMv8.1 SoCs are not yet widely available for end users like us. A76 is only coming, and A55 has now been available for a bit more than a year. So testing on regular ARM devices like RPi etc may not exhibit such differences.
Running the test with 30 workers results in a panic sometime before it hits the 15 minute mark. Based on observations there seems to be a corellation between the number of workers and how quickly the panic occurs, ie with 30 it takes a few minutes, with 160 it consistently happens almost immediately. That also holds for various numbers of workers in between.
On the CPU/core side of things, the machine in question has two CPU sockets with 80 identical cores each. All the panics I've encountered happened when stress-ng was ran directly and unbound. When I tried using hwloc-bind to bind the process to one of the CPU sockets, the test ran for 15 mins with 80 and 160 workers with no issues, no matter which CPU it was bound to.
Ie the specific circumstances under which it seems to occur are when the test is able to run across multiple CPU sockets with a large number of workers being spawned.
This could further fuel the possibliity explained above.
Regards, Willy
Kajetan Puchalski kajetan.puchalski@arm.com wrote:
On Fri, Jul 01, 2022 at 10:01:10PM +0200, Florian Westphal wrote:
Kajetan Puchalski kajetan.puchalski@arm.com wrote:
While running the udp-flood test from stress-ng on Ampere Altra (Mt. Jade platform) I encountered a kernel panic caused by NULL pointer dereference within nf_conntrack.
The issue is present in the latest mainline (5.19-rc4), latest stable (5.18.8), as well as multiple older stable versions. The last working stable version I found was 5.15.40.
Do I need a special setup for conntrack?
I don't think there was any special setup involved, the config I started from was a generic distribution config and I didn't change any networking-specific options. In case that's helpful here's the .config I used.
No crashes after more than one hour of stress-ng on
- 4 core amd64 Fedora 5.17 kernel
- 16 core amd64, linux stable 5.17.15
- 12 core intel, Fedora 5.18 kernel
- 3 core aarch64 vm, 5.18.7-200.fc36.aarch64
That would make sense, from further experiments I ran it somehow seems to be related to the number of workers being spawned by stress-ng along with the CPUs/cores involved.
For instance, running the test with <=25 workers (--udp-flood 25 etc.) results in the test running fine for at least 15 minutes.
Ok. I will let it run for longer on the machines I have access to.
In mean time, you could test attached patch, its simple s/refcount_/atomic_/ in nf_conntrack.
If mainline (patch vs. HEAD 69cb6c6556ad89620547318439) crashes for you but works with attached patch someone who understands aarch64 memory ordering would have to look more closely at refcount_XXX functions to see where they might differ from atomic_ ones.
If it still crashes, please try below hunk in addition, although I don't see how it would make a difference.
This is the one spot where the original conversion replaced atomic_inc() with refcount_set(), this is on allocation, refcount is expected to be 0 so refcount_inc() triggers a warning hinting at a use-after free.
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c --- a/net/netfilter/nf_conntrack_core.c +++ b/net/netfilter/nf_conntrack_core.c @@ -1776,7 +1776,7 @@ init_conntrack(struct net *net, struct nf_conn *tmpl, __nf_ct_try_assign_helper(ct, tmpl, GFP_ATOMIC);
/* Now it is going to be associated with an sk_buff, set refcount to 1. */ - atomic_set(&ct->ct_general.use, 1); + atomic_inc(&ct->ct_general.use);
if (exp) { if (exp->expectfn)
On Sat, Jul 02, 2022 at 10:56:51PM +0200, Florian Westphal wrote:
That would make sense, from further experiments I ran it somehow seems to be related to the number of workers being spawned by stress-ng along with the CPUs/cores involved.
For instance, running the test with <=25 workers (--udp-flood 25 etc.) results in the test running fine for at least 15 minutes.
Ok. I will let it run for longer on the machines I have access to.
In mean time, you could test attached patch, its simple s/refcount_/atomic_/ in nf_conntrack.
If mainline (patch vs. HEAD 69cb6c6556ad89620547318439) crashes for you but works with attached patch someone who understands aarch64 memory ordering would have to look more closely at refcount_XXX functions to see where they might differ from atomic_ ones.
I can confirm that the patch seems to solve the issue. With it applied on top of the 5.19-rc5 tag the test runs fine for at least 15 minutes which was not the case before so it looks like it is that aarch64 memory ordering problem.
If it still crashes, please try below hunk in addition, although I don't see how it would make a difference.
This is the one spot where the original conversion replaced atomic_inc() with refcount_set(), this is on allocation, refcount is expected to be 0 so refcount_inc() triggers a warning hinting at a use-after free.
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c --- a/net/netfilter/nf_conntrack_core.c +++ b/net/netfilter/nf_conntrack_core.c @@ -1776,7 +1776,7 @@ init_conntrack(struct net *net, struct nf_conn *tmpl, __nf_ct_try_assign_helper(ct, tmpl, GFP_ATOMIC); /* Now it is going to be associated with an sk_buff, set refcount to 1. */
atomic_set(&ct->ct_general.use, 1);
atomic_inc(&ct->ct_general.use);
if (exp) { if (exp->expectfn)
From 4234018dff486bdc30f4fe4625c8da1a8e30c2f6 Mon Sep 17 00:00:00 2001 From: Florian Westphal fw@strlen.de Date: Sat, 2 Jul 2022 22:42:57 +0200 Subject: [PATCH 1/1] netfilter: conntrack: revert to atomic_t api
Just for testing.
include/linux/netfilter/nf_conntrack_common.h | 6 ++--- include/net/netfilter/nf_conntrack.h | 2 +- net/netfilter/nf_conntrack_core.c | 24 +++++++++---------- net/netfilter/nf_conntrack_expect.c | 2 +- net/netfilter/nf_conntrack_netlink.c | 6 ++--- net/netfilter/nf_conntrack_standalone.c | 4 ++-- net/netfilter/nf_flow_table_core.c | 2 +- net/netfilter/nft_ct.c | 4 ++-- net/netfilter/xt_CT.c | 2 +- 9 files changed, 26 insertions(+), 26 deletions(-)
diff --git a/include/linux/netfilter/nf_conntrack_common.h b/include/linux/netfilter/nf_conntrack_common.h index 2770db2fa080..48a78944182d 100644 --- a/include/linux/netfilter/nf_conntrack_common.h +++ b/include/linux/netfilter/nf_conntrack_common.h @@ -25,7 +25,7 @@ struct ip_conntrack_stat { #define NFCT_PTRMASK ~(NFCT_INFOMASK) struct nf_conntrack {
- refcount_t use;
- atomic_t use;
}; void nf_conntrack_destroy(struct nf_conntrack *nfct); @@ -33,13 +33,13 @@ void nf_conntrack_destroy(struct nf_conntrack *nfct); /* like nf_ct_put, but without module dependency on nf_conntrack */ static inline void nf_conntrack_put(struct nf_conntrack *nfct) {
- if (nfct && refcount_dec_and_test(&nfct->use))
- if (nfct && atomic_dec_and_test(&nfct->use)) nf_conntrack_destroy(nfct);
} static inline void nf_conntrack_get(struct nf_conntrack *nfct) { if (nfct)
refcount_inc(&nfct->use);
atomic_inc(&nfct->use);
} #endif /* _NF_CONNTRACK_COMMON_H */ diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h index a32be8aa7ed2..9fab0c8835bb 100644 --- a/include/net/netfilter/nf_conntrack.h +++ b/include/net/netfilter/nf_conntrack.h @@ -180,7 +180,7 @@ void nf_ct_destroy(struct nf_conntrack *nfct); /* decrement reference count on a conntrack */ static inline void nf_ct_put(struct nf_conn *ct) {
- if (ct && refcount_dec_and_test(&ct->ct_general.use))
- if (ct && atomic_dec_and_test(&ct->ct_general.use)) nf_ct_destroy(&ct->ct_general);
} diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c index 082a2fd8d85b..4469e49d78a7 100644 --- a/net/netfilter/nf_conntrack_core.c +++ b/net/netfilter/nf_conntrack_core.c @@ -554,7 +554,7 @@ struct nf_conn *nf_ct_tmpl_alloc(struct net *net, tmpl->status = IPS_TEMPLATE; write_pnet(&tmpl->ct_net, net); nf_ct_zone_add(tmpl, zone);
- refcount_set(&tmpl->ct_general.use, 1);
- atomic_set(&tmpl->ct_general.use, 1);
return tmpl; } @@ -586,7 +586,7 @@ void nf_ct_destroy(struct nf_conntrack *nfct) struct nf_conn *ct = (struct nf_conn *)nfct; pr_debug("%s(%p)\n", __func__, ct);
- WARN_ON(refcount_read(&nfct->use) != 0);
- WARN_ON(atomic_read(&nfct->use) != 0);
if (unlikely(nf_ct_is_template(ct))) { nf_ct_tmpl_free(ct); @@ -726,7 +726,7 @@ nf_ct_match(const struct nf_conn *ct1, const struct nf_conn *ct2) /* caller must hold rcu readlock and none of the nf_conntrack_locks */ static void nf_ct_gc_expired(struct nf_conn *ct) {
- if (!refcount_inc_not_zero(&ct->ct_general.use))
- if (!atomic_inc_not_zero(&ct->ct_general.use)) return;
if (nf_ct_should_gc(ct)) @@ -794,7 +794,7 @@ __nf_conntrack_find_get(struct net *net, const struct nf_conntrack_zone *zone, * in, try to obtain a reference and re-check tuple */ ct = nf_ct_tuplehash_to_ctrack(h);
if (likely(refcount_inc_not_zero(&ct->ct_general.use))) {
if (likely(atomic_inc_not_zero(&ct->ct_general.use))) { if (likely(nf_ct_key_equal(h, tuple, zone, net))) goto found;
@@ -923,7 +923,7 @@ nf_conntrack_hash_check_insert(struct nf_conn *ct) smp_wmb(); /* The caller holds a reference to this object */
- refcount_set(&ct->ct_general.use, 2);
- atomic_set(&ct->ct_general.use, 2); __nf_conntrack_hash_insert(ct, hash, reply_hash); nf_conntrack_double_unlock(hash, reply_hash); NF_CT_STAT_INC(net, insert);
@@ -981,7 +981,7 @@ static void __nf_conntrack_insert_prepare(struct nf_conn *ct) { struct nf_conn_tstamp *tstamp;
- refcount_inc(&ct->ct_general.use);
- atomic_inc(&ct->ct_general.use);
/* set conntrack timestamp, if enabled. */ tstamp = nf_conn_tstamp_find(ct); @@ -1384,7 +1384,7 @@ static unsigned int early_drop_list(struct net *net, nf_ct_is_dying(tmp)) continue;
if (!refcount_inc_not_zero(&tmp->ct_general.use))
if (!atomic_inc_not_zero(&tmp->ct_general.use)) continue;
/* kill only if still in same netns -- might have moved due to @@ -1533,7 +1533,7 @@ static void gc_worker(struct work_struct *work) continue; /* need to take reference to avoid possible races */
if (!refcount_inc_not_zero(&tmp->ct_general.use))
if (!atomic_inc_not_zero(&tmp->ct_general.use)) continue;
if (gc_worker_skip_ct(tmp)) { @@ -1640,7 +1640,7 @@ __nf_conntrack_alloc(struct net *net, /* Because we use RCU lookups, we set ct_general.use to zero before * this is inserted in any list. */
- refcount_set(&ct->ct_general.use, 0);
- atomic_set(&ct->ct_general.use, 0); return ct;
out: atomic_dec(&cnet->count); @@ -1665,7 +1665,7 @@ void nf_conntrack_free(struct nf_conn *ct) /* A freed object has refcnt == 0, that's * the golden rule for SLAB_TYPESAFE_BY_RCU */
- WARN_ON(refcount_read(&ct->ct_general.use) != 0);
- WARN_ON(atomic_read(&ct->ct_general.use) != 0);
if (ct->status & IPS_SRC_NAT_DONE) { const struct nf_nat_hook *nat_hook; @@ -1776,7 +1776,7 @@ init_conntrack(struct net *net, struct nf_conn *tmpl, __nf_ct_try_assign_helper(ct, tmpl, GFP_ATOMIC); /* Now it is going to be associated with an sk_buff, set refcount to 1. */
- refcount_set(&ct->ct_general.use, 1);
- atomic_set(&ct->ct_general.use, 1);
if (exp) { if (exp->expectfn) @@ -2390,7 +2390,7 @@ get_next_corpse(int (*iter)(struct nf_conn *i, void *data), return NULL; found:
- refcount_inc(&ct->ct_general.use);
- atomic_inc(&ct->ct_general.use); spin_unlock(lockp); local_bh_enable(); return ct;
diff --git a/net/netfilter/nf_conntrack_expect.c b/net/netfilter/nf_conntrack_expect.c index 96948e98ec53..84cb05eae410 100644 --- a/net/netfilter/nf_conntrack_expect.c +++ b/net/netfilter/nf_conntrack_expect.c @@ -208,7 +208,7 @@ nf_ct_find_expectation(struct net *net, * can be sure the ct cannot disappear underneath. */ if (unlikely(nf_ct_is_dying(exp->master) ||
!refcount_inc_not_zero(&exp->master->ct_general.use)))
return NULL;!atomic_inc_not_zero(&exp->master->ct_general.use)))
if (exp->flags & NF_CT_EXPECT_PERMANENT) { diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c index 722af5e309ba..d5de0e580e6c 100644 --- a/net/netfilter/nf_conntrack_netlink.c +++ b/net/netfilter/nf_conntrack_netlink.c @@ -514,7 +514,7 @@ static int ctnetlink_dump_id(struct sk_buff *skb, const struct nf_conn *ct) static int ctnetlink_dump_use(struct sk_buff *skb, const struct nf_conn *ct) {
- if (nla_put_be32(skb, CTA_USE, htonl(refcount_read(&ct->ct_general.use))))
- if (nla_put_be32(skb, CTA_USE, htonl(atomic_read(&ct->ct_general.use)))) goto nla_put_failure; return 0;
@@ -1204,7 +1204,7 @@ ctnetlink_dump_table(struct sk_buff *skb, struct netlink_callback *cb) ct = nf_ct_tuplehash_to_ctrack(h); if (nf_ct_is_expired(ct)) { if (i < ARRAY_SIZE(nf_ct_evict) &&
refcount_inc_not_zero(&ct->ct_general.use))
atomic_inc_not_zero(&ct->ct_general.use)) nf_ct_evict[i++] = ct; continue; }
@@ -1747,7 +1747,7 @@ static int ctnetlink_dump_one_entry(struct sk_buff *skb, NFNL_MSG_TYPE(cb->nlh->nlmsg_type), ct, dying, 0); if (res < 0) {
if (!refcount_inc_not_zero(&ct->ct_general.use))
if (!atomic_inc_not_zero(&ct->ct_general.use)) return 0;
ctx->last = ct; diff --git a/net/netfilter/nf_conntrack_standalone.c b/net/netfilter/nf_conntrack_standalone.c index 6ad7bbc90d38..badd3f219533 100644 --- a/net/netfilter/nf_conntrack_standalone.c +++ b/net/netfilter/nf_conntrack_standalone.c @@ -303,7 +303,7 @@ static int ct_seq_show(struct seq_file *s, void *v) int ret = 0; WARN_ON(!ct);
- if (unlikely(!refcount_inc_not_zero(&ct->ct_general.use)))
- if (unlikely(!atomic_inc_not_zero(&ct->ct_general.use))) return 0;
if (nf_ct_should_gc(ct)) { @@ -370,7 +370,7 @@ static int ct_seq_show(struct seq_file *s, void *v) ct_show_zone(s, ct, NF_CT_DEFAULT_ZONE_DIR); ct_show_delta_time(s, ct);
- seq_printf(s, "use=%u\n", refcount_read(&ct->ct_general.use));
- seq_printf(s, "use=%u\n", atomic_read(&ct->ct_general.use));
if (seq_has_overflowed(s)) goto release; diff --git a/net/netfilter/nf_flow_table_core.c b/net/netfilter/nf_flow_table_core.c index f2def06d1070..8b3f91a60ba2 100644 --- a/net/netfilter/nf_flow_table_core.c +++ b/net/netfilter/nf_flow_table_core.c @@ -54,7 +54,7 @@ struct flow_offload *flow_offload_alloc(struct nf_conn *ct) struct flow_offload *flow; if (unlikely(nf_ct_is_dying(ct) ||
!refcount_inc_not_zero(&ct->ct_general.use)))
return NULL;!atomic_inc_not_zero(&ct->ct_general.use)))
flow = kzalloc(sizeof(*flow), GFP_ATOMIC); diff --git a/net/netfilter/nft_ct.c b/net/netfilter/nft_ct.c index d8e1614918a1..1b6ead61a8f1 100644 --- a/net/netfilter/nft_ct.c +++ b/net/netfilter/nft_ct.c @@ -260,8 +260,8 @@ static void nft_ct_set_zone_eval(const struct nft_expr *expr, ct = this_cpu_read(nft_ct_pcpu_template);
- if (likely(refcount_read(&ct->ct_general.use) == 1)) {
refcount_inc(&ct->ct_general.use);
- if (likely(atomic_read(&ct->ct_general.use) == 1)) {
nf_ct_zone_add(ct, &zone); } else { /* previous skb got queued to userspace, allocate temporaryatomic_inc(&ct->ct_general.use);
diff --git a/net/netfilter/xt_CT.c b/net/netfilter/xt_CT.c index 267757b0392a..cf2f8c1d4fb5 100644 --- a/net/netfilter/xt_CT.c +++ b/net/netfilter/xt_CT.c @@ -24,7 +24,7 @@ static inline int xt_ct_target(struct sk_buff *skb, struct nf_conn *ct) return XT_CONTINUE; if (ct) {
refcount_inc(&ct->ct_general.use);
nf_ct_set(skb, ct, IP_CT_NEW); } else { nf_ct_set(skb, ct, IP_CT_UNTRACKED);atomic_inc(&ct->ct_general.use);
-- 2.35.1
On Mon, Jul 04, 2022 at 10:22:24AM +0100, Kajetan Puchalski wrote:
On Sat, Jul 02, 2022 at 10:56:51PM +0200, Florian Westphal wrote:
That would make sense, from further experiments I ran it somehow seems to be related to the number of workers being spawned by stress-ng along with the CPUs/cores involved.
For instance, running the test with <=25 workers (--udp-flood 25 etc.) results in the test running fine for at least 15 minutes.
Ok. I will let it run for longer on the machines I have access to.
In mean time, you could test attached patch, its simple s/refcount_/atomic_/ in nf_conntrack.
If mainline (patch vs. HEAD 69cb6c6556ad89620547318439) crashes for you but works with attached patch someone who understands aarch64 memory ordering would have to look more closely at refcount_XXX functions to see where they might differ from atomic_ ones.
I can confirm that the patch seems to solve the issue. With it applied on top of the 5.19-rc5 tag the test runs fine for at least 15 minutes which was not the case before so it looks like it is that aarch64 memory ordering problem.
I'm CCing some people who should be able to help with aarch64 memory ordering, maybe they could take a look.
If it still crashes, please try below hunk in addition, although I don't see how it would make a difference.
This is the one spot where the original conversion replaced atomic_inc() with refcount_set(), this is on allocation, refcount is expected to be 0 so refcount_inc() triggers a warning hinting at a use-after free.
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c --- a/net/netfilter/nf_conntrack_core.c +++ b/net/netfilter/nf_conntrack_core.c @@ -1776,7 +1776,7 @@ init_conntrack(struct net *net, struct nf_conn *tmpl, __nf_ct_try_assign_helper(ct, tmpl, GFP_ATOMIC); /* Now it is going to be associated with an sk_buff, set refcount to 1. */
atomic_set(&ct->ct_general.use, 1);
atomic_inc(&ct->ct_general.use);
if (exp) { if (exp->expectfn)
From 4234018dff486bdc30f4fe4625c8da1a8e30c2f6 Mon Sep 17 00:00:00 2001 From: Florian Westphal fw@strlen.de Date: Sat, 2 Jul 2022 22:42:57 +0200 Subject: [PATCH 1/1] netfilter: conntrack: revert to atomic_t api
Just for testing.
include/linux/netfilter/nf_conntrack_common.h | 6 ++--- include/net/netfilter/nf_conntrack.h | 2 +- net/netfilter/nf_conntrack_core.c | 24 +++++++++---------- net/netfilter/nf_conntrack_expect.c | 2 +- net/netfilter/nf_conntrack_netlink.c | 6 ++--- net/netfilter/nf_conntrack_standalone.c | 4 ++-- net/netfilter/nf_flow_table_core.c | 2 +- net/netfilter/nft_ct.c | 4 ++-- net/netfilter/xt_CT.c | 2 +- 9 files changed, 26 insertions(+), 26 deletions(-)
diff --git a/include/linux/netfilter/nf_conntrack_common.h b/include/linux/netfilter/nf_conntrack_common.h index 2770db2fa080..48a78944182d 100644 --- a/include/linux/netfilter/nf_conntrack_common.h +++ b/include/linux/netfilter/nf_conntrack_common.h @@ -25,7 +25,7 @@ struct ip_conntrack_stat { #define NFCT_PTRMASK ~(NFCT_INFOMASK) struct nf_conntrack {
- refcount_t use;
- atomic_t use;
}; void nf_conntrack_destroy(struct nf_conntrack *nfct); @@ -33,13 +33,13 @@ void nf_conntrack_destroy(struct nf_conntrack *nfct); /* like nf_ct_put, but without module dependency on nf_conntrack */ static inline void nf_conntrack_put(struct nf_conntrack *nfct) {
- if (nfct && refcount_dec_and_test(&nfct->use))
- if (nfct && atomic_dec_and_test(&nfct->use)) nf_conntrack_destroy(nfct);
} static inline void nf_conntrack_get(struct nf_conntrack *nfct) { if (nfct)
refcount_inc(&nfct->use);
atomic_inc(&nfct->use);
} #endif /* _NF_CONNTRACK_COMMON_H */ diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h index a32be8aa7ed2..9fab0c8835bb 100644 --- a/include/net/netfilter/nf_conntrack.h +++ b/include/net/netfilter/nf_conntrack.h @@ -180,7 +180,7 @@ void nf_ct_destroy(struct nf_conntrack *nfct); /* decrement reference count on a conntrack */ static inline void nf_ct_put(struct nf_conn *ct) {
- if (ct && refcount_dec_and_test(&ct->ct_general.use))
- if (ct && atomic_dec_and_test(&ct->ct_general.use)) nf_ct_destroy(&ct->ct_general);
} diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c index 082a2fd8d85b..4469e49d78a7 100644 --- a/net/netfilter/nf_conntrack_core.c +++ b/net/netfilter/nf_conntrack_core.c @@ -554,7 +554,7 @@ struct nf_conn *nf_ct_tmpl_alloc(struct net *net, tmpl->status = IPS_TEMPLATE; write_pnet(&tmpl->ct_net, net); nf_ct_zone_add(tmpl, zone);
- refcount_set(&tmpl->ct_general.use, 1);
- atomic_set(&tmpl->ct_general.use, 1);
return tmpl; } @@ -586,7 +586,7 @@ void nf_ct_destroy(struct nf_conntrack *nfct) struct nf_conn *ct = (struct nf_conn *)nfct; pr_debug("%s(%p)\n", __func__, ct);
- WARN_ON(refcount_read(&nfct->use) != 0);
- WARN_ON(atomic_read(&nfct->use) != 0);
if (unlikely(nf_ct_is_template(ct))) { nf_ct_tmpl_free(ct); @@ -726,7 +726,7 @@ nf_ct_match(const struct nf_conn *ct1, const struct nf_conn *ct2) /* caller must hold rcu readlock and none of the nf_conntrack_locks */ static void nf_ct_gc_expired(struct nf_conn *ct) {
- if (!refcount_inc_not_zero(&ct->ct_general.use))
- if (!atomic_inc_not_zero(&ct->ct_general.use)) return;
if (nf_ct_should_gc(ct)) @@ -794,7 +794,7 @@ __nf_conntrack_find_get(struct net *net, const struct nf_conntrack_zone *zone, * in, try to obtain a reference and re-check tuple */ ct = nf_ct_tuplehash_to_ctrack(h);
if (likely(refcount_inc_not_zero(&ct->ct_general.use))) {
if (likely(atomic_inc_not_zero(&ct->ct_general.use))) { if (likely(nf_ct_key_equal(h, tuple, zone, net))) goto found;
@@ -923,7 +923,7 @@ nf_conntrack_hash_check_insert(struct nf_conn *ct) smp_wmb(); /* The caller holds a reference to this object */
- refcount_set(&ct->ct_general.use, 2);
- atomic_set(&ct->ct_general.use, 2); __nf_conntrack_hash_insert(ct, hash, reply_hash); nf_conntrack_double_unlock(hash, reply_hash); NF_CT_STAT_INC(net, insert);
@@ -981,7 +981,7 @@ static void __nf_conntrack_insert_prepare(struct nf_conn *ct) { struct nf_conn_tstamp *tstamp;
- refcount_inc(&ct->ct_general.use);
- atomic_inc(&ct->ct_general.use);
/* set conntrack timestamp, if enabled. */ tstamp = nf_conn_tstamp_find(ct); @@ -1384,7 +1384,7 @@ static unsigned int early_drop_list(struct net *net, nf_ct_is_dying(tmp)) continue;
if (!refcount_inc_not_zero(&tmp->ct_general.use))
if (!atomic_inc_not_zero(&tmp->ct_general.use)) continue;
/* kill only if still in same netns -- might have moved due to @@ -1533,7 +1533,7 @@ static void gc_worker(struct work_struct *work) continue; /* need to take reference to avoid possible races */
if (!refcount_inc_not_zero(&tmp->ct_general.use))
if (!atomic_inc_not_zero(&tmp->ct_general.use)) continue;
if (gc_worker_skip_ct(tmp)) { @@ -1640,7 +1640,7 @@ __nf_conntrack_alloc(struct net *net, /* Because we use RCU lookups, we set ct_general.use to zero before * this is inserted in any list. */
- refcount_set(&ct->ct_general.use, 0);
- atomic_set(&ct->ct_general.use, 0); return ct;
out: atomic_dec(&cnet->count); @@ -1665,7 +1665,7 @@ void nf_conntrack_free(struct nf_conn *ct) /* A freed object has refcnt == 0, that's * the golden rule for SLAB_TYPESAFE_BY_RCU */
- WARN_ON(refcount_read(&ct->ct_general.use) != 0);
- WARN_ON(atomic_read(&ct->ct_general.use) != 0);
if (ct->status & IPS_SRC_NAT_DONE) { const struct nf_nat_hook *nat_hook; @@ -1776,7 +1776,7 @@ init_conntrack(struct net *net, struct nf_conn *tmpl, __nf_ct_try_assign_helper(ct, tmpl, GFP_ATOMIC); /* Now it is going to be associated with an sk_buff, set refcount to 1. */
- refcount_set(&ct->ct_general.use, 1);
- atomic_set(&ct->ct_general.use, 1);
if (exp) { if (exp->expectfn) @@ -2390,7 +2390,7 @@ get_next_corpse(int (*iter)(struct nf_conn *i, void *data), return NULL; found:
- refcount_inc(&ct->ct_general.use);
- atomic_inc(&ct->ct_general.use); spin_unlock(lockp); local_bh_enable(); return ct;
diff --git a/net/netfilter/nf_conntrack_expect.c b/net/netfilter/nf_conntrack_expect.c index 96948e98ec53..84cb05eae410 100644 --- a/net/netfilter/nf_conntrack_expect.c +++ b/net/netfilter/nf_conntrack_expect.c @@ -208,7 +208,7 @@ nf_ct_find_expectation(struct net *net, * can be sure the ct cannot disappear underneath. */ if (unlikely(nf_ct_is_dying(exp->master) ||
!refcount_inc_not_zero(&exp->master->ct_general.use)))
return NULL;!atomic_inc_not_zero(&exp->master->ct_general.use)))
if (exp->flags & NF_CT_EXPECT_PERMANENT) { diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c index 722af5e309ba..d5de0e580e6c 100644 --- a/net/netfilter/nf_conntrack_netlink.c +++ b/net/netfilter/nf_conntrack_netlink.c @@ -514,7 +514,7 @@ static int ctnetlink_dump_id(struct sk_buff *skb, const struct nf_conn *ct) static int ctnetlink_dump_use(struct sk_buff *skb, const struct nf_conn *ct) {
- if (nla_put_be32(skb, CTA_USE, htonl(refcount_read(&ct->ct_general.use))))
- if (nla_put_be32(skb, CTA_USE, htonl(atomic_read(&ct->ct_general.use)))) goto nla_put_failure; return 0;
@@ -1204,7 +1204,7 @@ ctnetlink_dump_table(struct sk_buff *skb, struct netlink_callback *cb) ct = nf_ct_tuplehash_to_ctrack(h); if (nf_ct_is_expired(ct)) { if (i < ARRAY_SIZE(nf_ct_evict) &&
refcount_inc_not_zero(&ct->ct_general.use))
atomic_inc_not_zero(&ct->ct_general.use)) nf_ct_evict[i++] = ct; continue; }
@@ -1747,7 +1747,7 @@ static int ctnetlink_dump_one_entry(struct sk_buff *skb, NFNL_MSG_TYPE(cb->nlh->nlmsg_type), ct, dying, 0); if (res < 0) {
if (!refcount_inc_not_zero(&ct->ct_general.use))
if (!atomic_inc_not_zero(&ct->ct_general.use)) return 0;
ctx->last = ct; diff --git a/net/netfilter/nf_conntrack_standalone.c b/net/netfilter/nf_conntrack_standalone.c index 6ad7bbc90d38..badd3f219533 100644 --- a/net/netfilter/nf_conntrack_standalone.c +++ b/net/netfilter/nf_conntrack_standalone.c @@ -303,7 +303,7 @@ static int ct_seq_show(struct seq_file *s, void *v) int ret = 0; WARN_ON(!ct);
- if (unlikely(!refcount_inc_not_zero(&ct->ct_general.use)))
- if (unlikely(!atomic_inc_not_zero(&ct->ct_general.use))) return 0;
if (nf_ct_should_gc(ct)) { @@ -370,7 +370,7 @@ static int ct_seq_show(struct seq_file *s, void *v) ct_show_zone(s, ct, NF_CT_DEFAULT_ZONE_DIR); ct_show_delta_time(s, ct);
- seq_printf(s, "use=%u\n", refcount_read(&ct->ct_general.use));
- seq_printf(s, "use=%u\n", atomic_read(&ct->ct_general.use));
if (seq_has_overflowed(s)) goto release; diff --git a/net/netfilter/nf_flow_table_core.c b/net/netfilter/nf_flow_table_core.c index f2def06d1070..8b3f91a60ba2 100644 --- a/net/netfilter/nf_flow_table_core.c +++ b/net/netfilter/nf_flow_table_core.c @@ -54,7 +54,7 @@ struct flow_offload *flow_offload_alloc(struct nf_conn *ct) struct flow_offload *flow; if (unlikely(nf_ct_is_dying(ct) ||
!refcount_inc_not_zero(&ct->ct_general.use)))
return NULL;!atomic_inc_not_zero(&ct->ct_general.use)))
flow = kzalloc(sizeof(*flow), GFP_ATOMIC); diff --git a/net/netfilter/nft_ct.c b/net/netfilter/nft_ct.c index d8e1614918a1..1b6ead61a8f1 100644 --- a/net/netfilter/nft_ct.c +++ b/net/netfilter/nft_ct.c @@ -260,8 +260,8 @@ static void nft_ct_set_zone_eval(const struct nft_expr *expr, ct = this_cpu_read(nft_ct_pcpu_template);
- if (likely(refcount_read(&ct->ct_general.use) == 1)) {
refcount_inc(&ct->ct_general.use);
- if (likely(atomic_read(&ct->ct_general.use) == 1)) {
nf_ct_zone_add(ct, &zone); } else { /* previous skb got queued to userspace, allocate temporaryatomic_inc(&ct->ct_general.use);
diff --git a/net/netfilter/xt_CT.c b/net/netfilter/xt_CT.c index 267757b0392a..cf2f8c1d4fb5 100644 --- a/net/netfilter/xt_CT.c +++ b/net/netfilter/xt_CT.c @@ -24,7 +24,7 @@ static inline int xt_ct_target(struct sk_buff *skb, struct nf_conn *ct) return XT_CONTINUE; if (ct) {
refcount_inc(&ct->ct_general.use);
nf_ct_set(skb, ct, IP_CT_NEW); } else { nf_ct_set(skb, ct, IP_CT_UNTRACKED);atomic_inc(&ct->ct_general.use);
-- 2.35.1
On Mon, Jul 04, 2022 at 10:22:24AM +0100, Kajetan Puchalski wrote:
On Sat, Jul 02, 2022 at 10:56:51PM +0200, Florian Westphal wrote:
That would make sense, from further experiments I ran it somehow seems to be related to the number of workers being spawned by stress-ng along with the CPUs/cores involved.
For instance, running the test with <=25 workers (--udp-flood 25 etc.) results in the test running fine for at least 15 minutes.
Ok. I will let it run for longer on the machines I have access to.
In mean time, you could test attached patch, its simple s/refcount_/atomic_/ in nf_conntrack.
If mainline (patch vs. HEAD 69cb6c6556ad89620547318439) crashes for you but works with attached patch someone who understands aarch64 memory ordering would have to look more closely at refcount_XXX functions to see where they might differ from atomic_ ones.
I can confirm that the patch seems to solve the issue. With it applied on top of the 5.19-rc5 tag the test runs fine for at least 15 minutes which was not the case before so it looks like it is that aarch64 memory ordering problem.
I'm CCing some people who should be able to help with aarch64 memory ordering, maybe they could take a look.
(re-sending due to a typo in CC, sorry for duplicate emails!)
If it still crashes, please try below hunk in addition, although I don't see how it would make a difference.
This is the one spot where the original conversion replaced atomic_inc() with refcount_set(), this is on allocation, refcount is expected to be 0 so refcount_inc() triggers a warning hinting at a use-after free.
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c --- a/net/netfilter/nf_conntrack_core.c +++ b/net/netfilter/nf_conntrack_core.c @@ -1776,7 +1776,7 @@ init_conntrack(struct net *net, struct nf_conn *tmpl, __nf_ct_try_assign_helper(ct, tmpl, GFP_ATOMIC); /* Now it is going to be associated with an sk_buff, set refcount to 1. */
atomic_set(&ct->ct_general.use, 1);
atomic_inc(&ct->ct_general.use);
if (exp) { if (exp->expectfn)
From 4234018dff486bdc30f4fe4625c8da1a8e30c2f6 Mon Sep 17 00:00:00 2001 From: Florian Westphal fw@strlen.de Date: Sat, 2 Jul 2022 22:42:57 +0200 Subject: [PATCH 1/1] netfilter: conntrack: revert to atomic_t api
Just for testing.
include/linux/netfilter/nf_conntrack_common.h | 6 ++--- include/net/netfilter/nf_conntrack.h | 2 +- net/netfilter/nf_conntrack_core.c | 24 +++++++++---------- net/netfilter/nf_conntrack_expect.c | 2 +- net/netfilter/nf_conntrack_netlink.c | 6 ++--- net/netfilter/nf_conntrack_standalone.c | 4 ++-- net/netfilter/nf_flow_table_core.c | 2 +- net/netfilter/nft_ct.c | 4 ++-- net/netfilter/xt_CT.c | 2 +- 9 files changed, 26 insertions(+), 26 deletions(-)
diff --git a/include/linux/netfilter/nf_conntrack_common.h b/include/linux/netfilter/nf_conntrack_common.h index 2770db2fa080..48a78944182d 100644 --- a/include/linux/netfilter/nf_conntrack_common.h +++ b/include/linux/netfilter/nf_conntrack_common.h @@ -25,7 +25,7 @@ struct ip_conntrack_stat { #define NFCT_PTRMASK ~(NFCT_INFOMASK) struct nf_conntrack {
- refcount_t use;
- atomic_t use;
}; void nf_conntrack_destroy(struct nf_conntrack *nfct); @@ -33,13 +33,13 @@ void nf_conntrack_destroy(struct nf_conntrack *nfct); /* like nf_ct_put, but without module dependency on nf_conntrack */ static inline void nf_conntrack_put(struct nf_conntrack *nfct) {
- if (nfct && refcount_dec_and_test(&nfct->use))
- if (nfct && atomic_dec_and_test(&nfct->use)) nf_conntrack_destroy(nfct);
} static inline void nf_conntrack_get(struct nf_conntrack *nfct) { if (nfct)
refcount_inc(&nfct->use);
atomic_inc(&nfct->use);
} #endif /* _NF_CONNTRACK_COMMON_H */ diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h index a32be8aa7ed2..9fab0c8835bb 100644 --- a/include/net/netfilter/nf_conntrack.h +++ b/include/net/netfilter/nf_conntrack.h @@ -180,7 +180,7 @@ void nf_ct_destroy(struct nf_conntrack *nfct); /* decrement reference count on a conntrack */ static inline void nf_ct_put(struct nf_conn *ct) {
- if (ct && refcount_dec_and_test(&ct->ct_general.use))
- if (ct && atomic_dec_and_test(&ct->ct_general.use)) nf_ct_destroy(&ct->ct_general);
} diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c index 082a2fd8d85b..4469e49d78a7 100644 --- a/net/netfilter/nf_conntrack_core.c +++ b/net/netfilter/nf_conntrack_core.c @@ -554,7 +554,7 @@ struct nf_conn *nf_ct_tmpl_alloc(struct net *net, tmpl->status = IPS_TEMPLATE; write_pnet(&tmpl->ct_net, net); nf_ct_zone_add(tmpl, zone);
- refcount_set(&tmpl->ct_general.use, 1);
- atomic_set(&tmpl->ct_general.use, 1);
return tmpl; } @@ -586,7 +586,7 @@ void nf_ct_destroy(struct nf_conntrack *nfct) struct nf_conn *ct = (struct nf_conn *)nfct; pr_debug("%s(%p)\n", __func__, ct);
- WARN_ON(refcount_read(&nfct->use) != 0);
- WARN_ON(atomic_read(&nfct->use) != 0);
if (unlikely(nf_ct_is_template(ct))) { nf_ct_tmpl_free(ct); @@ -726,7 +726,7 @@ nf_ct_match(const struct nf_conn *ct1, const struct nf_conn *ct2) /* caller must hold rcu readlock and none of the nf_conntrack_locks */ static void nf_ct_gc_expired(struct nf_conn *ct) {
- if (!refcount_inc_not_zero(&ct->ct_general.use))
- if (!atomic_inc_not_zero(&ct->ct_general.use)) return;
if (nf_ct_should_gc(ct)) @@ -794,7 +794,7 @@ __nf_conntrack_find_get(struct net *net, const struct nf_conntrack_zone *zone, * in, try to obtain a reference and re-check tuple */ ct = nf_ct_tuplehash_to_ctrack(h);
if (likely(refcount_inc_not_zero(&ct->ct_general.use))) {
if (likely(atomic_inc_not_zero(&ct->ct_general.use))) { if (likely(nf_ct_key_equal(h, tuple, zone, net))) goto found;
@@ -923,7 +923,7 @@ nf_conntrack_hash_check_insert(struct nf_conn *ct) smp_wmb(); /* The caller holds a reference to this object */
- refcount_set(&ct->ct_general.use, 2);
- atomic_set(&ct->ct_general.use, 2); __nf_conntrack_hash_insert(ct, hash, reply_hash); nf_conntrack_double_unlock(hash, reply_hash); NF_CT_STAT_INC(net, insert);
@@ -981,7 +981,7 @@ static void __nf_conntrack_insert_prepare(struct nf_conn *ct) { struct nf_conn_tstamp *tstamp;
- refcount_inc(&ct->ct_general.use);
- atomic_inc(&ct->ct_general.use);
/* set conntrack timestamp, if enabled. */ tstamp = nf_conn_tstamp_find(ct); @@ -1384,7 +1384,7 @@ static unsigned int early_drop_list(struct net *net, nf_ct_is_dying(tmp)) continue;
if (!refcount_inc_not_zero(&tmp->ct_general.use))
if (!atomic_inc_not_zero(&tmp->ct_general.use)) continue;
/* kill only if still in same netns -- might have moved due to @@ -1533,7 +1533,7 @@ static void gc_worker(struct work_struct *work) continue; /* need to take reference to avoid possible races */
if (!refcount_inc_not_zero(&tmp->ct_general.use))
if (!atomic_inc_not_zero(&tmp->ct_general.use)) continue;
if (gc_worker_skip_ct(tmp)) { @@ -1640,7 +1640,7 @@ __nf_conntrack_alloc(struct net *net, /* Because we use RCU lookups, we set ct_general.use to zero before * this is inserted in any list. */
- refcount_set(&ct->ct_general.use, 0);
- atomic_set(&ct->ct_general.use, 0); return ct;
out: atomic_dec(&cnet->count); @@ -1665,7 +1665,7 @@ void nf_conntrack_free(struct nf_conn *ct) /* A freed object has refcnt == 0, that's * the golden rule for SLAB_TYPESAFE_BY_RCU */
- WARN_ON(refcount_read(&ct->ct_general.use) != 0);
- WARN_ON(atomic_read(&ct->ct_general.use) != 0);
if (ct->status & IPS_SRC_NAT_DONE) { const struct nf_nat_hook *nat_hook; @@ -1776,7 +1776,7 @@ init_conntrack(struct net *net, struct nf_conn *tmpl, __nf_ct_try_assign_helper(ct, tmpl, GFP_ATOMIC); /* Now it is going to be associated with an sk_buff, set refcount to 1. */
- refcount_set(&ct->ct_general.use, 1);
- atomic_set(&ct->ct_general.use, 1);
if (exp) { if (exp->expectfn) @@ -2390,7 +2390,7 @@ get_next_corpse(int (*iter)(struct nf_conn *i, void *data), return NULL; found:
- refcount_inc(&ct->ct_general.use);
- atomic_inc(&ct->ct_general.use); spin_unlock(lockp); local_bh_enable(); return ct;
diff --git a/net/netfilter/nf_conntrack_expect.c b/net/netfilter/nf_conntrack_expect.c index 96948e98ec53..84cb05eae410 100644 --- a/net/netfilter/nf_conntrack_expect.c +++ b/net/netfilter/nf_conntrack_expect.c @@ -208,7 +208,7 @@ nf_ct_find_expectation(struct net *net, * can be sure the ct cannot disappear underneath. */ if (unlikely(nf_ct_is_dying(exp->master) ||
!refcount_inc_not_zero(&exp->master->ct_general.use)))
return NULL;!atomic_inc_not_zero(&exp->master->ct_general.use)))
if (exp->flags & NF_CT_EXPECT_PERMANENT) { diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c index 722af5e309ba..d5de0e580e6c 100644 --- a/net/netfilter/nf_conntrack_netlink.c +++ b/net/netfilter/nf_conntrack_netlink.c @@ -514,7 +514,7 @@ static int ctnetlink_dump_id(struct sk_buff *skb, const struct nf_conn *ct) static int ctnetlink_dump_use(struct sk_buff *skb, const struct nf_conn *ct) {
- if (nla_put_be32(skb, CTA_USE, htonl(refcount_read(&ct->ct_general.use))))
- if (nla_put_be32(skb, CTA_USE, htonl(atomic_read(&ct->ct_general.use)))) goto nla_put_failure; return 0;
@@ -1204,7 +1204,7 @@ ctnetlink_dump_table(struct sk_buff *skb, struct netlink_callback *cb) ct = nf_ct_tuplehash_to_ctrack(h); if (nf_ct_is_expired(ct)) { if (i < ARRAY_SIZE(nf_ct_evict) &&
refcount_inc_not_zero(&ct->ct_general.use))
atomic_inc_not_zero(&ct->ct_general.use)) nf_ct_evict[i++] = ct; continue; }
@@ -1747,7 +1747,7 @@ static int ctnetlink_dump_one_entry(struct sk_buff *skb, NFNL_MSG_TYPE(cb->nlh->nlmsg_type), ct, dying, 0); if (res < 0) {
if (!refcount_inc_not_zero(&ct->ct_general.use))
if (!atomic_inc_not_zero(&ct->ct_general.use)) return 0;
ctx->last = ct; diff --git a/net/netfilter/nf_conntrack_standalone.c b/net/netfilter/nf_conntrack_standalone.c index 6ad7bbc90d38..badd3f219533 100644 --- a/net/netfilter/nf_conntrack_standalone.c +++ b/net/netfilter/nf_conntrack_standalone.c @@ -303,7 +303,7 @@ static int ct_seq_show(struct seq_file *s, void *v) int ret = 0; WARN_ON(!ct);
- if (unlikely(!refcount_inc_not_zero(&ct->ct_general.use)))
- if (unlikely(!atomic_inc_not_zero(&ct->ct_general.use))) return 0;
if (nf_ct_should_gc(ct)) { @@ -370,7 +370,7 @@ static int ct_seq_show(struct seq_file *s, void *v) ct_show_zone(s, ct, NF_CT_DEFAULT_ZONE_DIR); ct_show_delta_time(s, ct);
- seq_printf(s, "use=%u\n", refcount_read(&ct->ct_general.use));
- seq_printf(s, "use=%u\n", atomic_read(&ct->ct_general.use));
if (seq_has_overflowed(s)) goto release; diff --git a/net/netfilter/nf_flow_table_core.c b/net/netfilter/nf_flow_table_core.c index f2def06d1070..8b3f91a60ba2 100644 --- a/net/netfilter/nf_flow_table_core.c +++ b/net/netfilter/nf_flow_table_core.c @@ -54,7 +54,7 @@ struct flow_offload *flow_offload_alloc(struct nf_conn *ct) struct flow_offload *flow; if (unlikely(nf_ct_is_dying(ct) ||
!refcount_inc_not_zero(&ct->ct_general.use)))
return NULL;!atomic_inc_not_zero(&ct->ct_general.use)))
flow = kzalloc(sizeof(*flow), GFP_ATOMIC); diff --git a/net/netfilter/nft_ct.c b/net/netfilter/nft_ct.c index d8e1614918a1..1b6ead61a8f1 100644 --- a/net/netfilter/nft_ct.c +++ b/net/netfilter/nft_ct.c @@ -260,8 +260,8 @@ static void nft_ct_set_zone_eval(const struct nft_expr *expr, ct = this_cpu_read(nft_ct_pcpu_template);
- if (likely(refcount_read(&ct->ct_general.use) == 1)) {
refcount_inc(&ct->ct_general.use);
- if (likely(atomic_read(&ct->ct_general.use) == 1)) {
nf_ct_zone_add(ct, &zone); } else { /* previous skb got queued to userspace, allocate temporaryatomic_inc(&ct->ct_general.use);
diff --git a/net/netfilter/xt_CT.c b/net/netfilter/xt_CT.c index 267757b0392a..cf2f8c1d4fb5 100644 --- a/net/netfilter/xt_CT.c +++ b/net/netfilter/xt_CT.c @@ -24,7 +24,7 @@ static inline int xt_ct_target(struct sk_buff *skb, struct nf_conn *ct) return XT_CONTINUE; if (ct) {
refcount_inc(&ct->ct_general.use);
nf_ct_set(skb, ct, IP_CT_NEW); } else { nf_ct_set(skb, ct, IP_CT_UNTRACKED);atomic_inc(&ct->ct_general.use);
-- 2.35.1
On Tue, Jul 05, 2022 at 11:53:22AM +0100, Kajetan Puchalski wrote:
On Mon, Jul 04, 2022 at 10:22:24AM +0100, Kajetan Puchalski wrote:
On Sat, Jul 02, 2022 at 10:56:51PM +0200, Florian Westphal wrote:
That would make sense, from further experiments I ran it somehow seems to be related to the number of workers being spawned by stress-ng along with the CPUs/cores involved.
For instance, running the test with <=25 workers (--udp-flood 25 etc.) results in the test running fine for at least 15 minutes.
Ok. I will let it run for longer on the machines I have access to.
In mean time, you could test attached patch, its simple s/refcount_/atomic_/ in nf_conntrack.
If mainline (patch vs. HEAD 69cb6c6556ad89620547318439) crashes for you but works with attached patch someone who understands aarch64 memory ordering would have to look more closely at refcount_XXX functions to see where they might differ from atomic_ ones.
I can confirm that the patch seems to solve the issue. With it applied on top of the 5.19-rc5 tag the test runs fine for at least 15 minutes which was not the case before so it looks like it is that aarch64 memory ordering problem.
I'm CCing some people who should be able to help with aarch64 memory ordering, maybe they could take a look.
(re-sending due to a typo in CC, sorry for duplicate emails!)
Sorry, but I have absolutely no context here. We have a handy document describing the differences between atomic_t and refcount_t:
Documentation/core-api/refcount-vs-atomic.rst
What else do you need to know?
Will
On Tue, Jul 05, 2022 at 11:57:49AM +0100, Will Deacon wrote:
On Tue, Jul 05, 2022 at 11:53:22AM +0100, Kajetan Puchalski wrote:
On Mon, Jul 04, 2022 at 10:22:24AM +0100, Kajetan Puchalski wrote:
On Sat, Jul 02, 2022 at 10:56:51PM +0200, Florian Westphal wrote:
That would make sense, from further experiments I ran it somehow seems to be related to the number of workers being spawned by stress-ng along with the CPUs/cores involved.
For instance, running the test with <=25 workers (--udp-flood 25 etc.) results in the test running fine for at least 15 minutes.
Ok. I will let it run for longer on the machines I have access to.
In mean time, you could test attached patch, its simple s/refcount_/atomic_/ in nf_conntrack.
If mainline (patch vs. HEAD 69cb6c6556ad89620547318439) crashes for you but works with attached patch someone who understands aarch64 memory ordering would have to look more closely at refcount_XXX functions to see where they might differ from atomic_ ones.
I can confirm that the patch seems to solve the issue. With it applied on top of the 5.19-rc5 tag the test runs fine for at least 15 minutes which was not the case before so it looks like it is that aarch64 memory ordering problem.
I'm CCing some people who should be able to help with aarch64 memory ordering, maybe they could take a look.
(re-sending due to a typo in CC, sorry for duplicate emails!)
Sorry, but I have absolutely no context here. We have a handy document describing the differences between atomic_t and refcount_t:
Documentation/core-api/refcount-vs-atomic.rst
What else do you need to know?
Hmm, and I see a tonne of *_inc_not_zero() conversions in 719774377622 ("netfilter: conntrack: convert to refcount_t api") which mean that you no longer have ordering to subsequent reads in the absence of an address dependency.
Will
On Tue, Jul 05, 2022 at 12:07:25PM +0100, Will Deacon wrote:
On Tue, Jul 05, 2022 at 11:57:49AM +0100, Will Deacon wrote:
On Tue, Jul 05, 2022 at 11:53:22AM +0100, Kajetan Puchalski wrote:
On Mon, Jul 04, 2022 at 10:22:24AM +0100, Kajetan Puchalski wrote:
On Sat, Jul 02, 2022 at 10:56:51PM +0200, Florian Westphal wrote:
That would make sense, from further experiments I ran it somehow seems to be related to the number of workers being spawned by stress-ng along with the CPUs/cores involved.
For instance, running the test with <=25 workers (--udp-flood 25 etc.) results in the test running fine for at least 15 minutes.
Ok. I will let it run for longer on the machines I have access to.
In mean time, you could test attached patch, its simple s/refcount_/atomic_/ in nf_conntrack.
If mainline (patch vs. HEAD 69cb6c6556ad89620547318439) crashes for you but works with attached patch someone who understands aarch64 memory ordering would have to look more closely at refcount_XXX functions to see where they might differ from atomic_ ones.
I can confirm that the patch seems to solve the issue. With it applied on top of the 5.19-rc5 tag the test runs fine for at least 15 minutes which was not the case before so it looks like it is that aarch64 memory ordering problem.
I'm CCing some people who should be able to help with aarch64 memory ordering, maybe they could take a look.
(re-sending due to a typo in CC, sorry for duplicate emails!)
Sorry, but I have absolutely no context here. We have a handy document describing the differences between atomic_t and refcount_t:
Documentation/core-api/refcount-vs-atomic.rst
What else do you need to know?
Hmm, and I see a tonne of *_inc_not_zero() conversions in 719774377622 ("netfilter: conntrack: convert to refcount_t api") which mean that you no longer have ordering to subsequent reads in the absence of an address dependency.
I think the patch above needs auditing with the relaxed behaviour in mind, but for the specific crash reported here possibly something like the diff below?
Will
--->8
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c index 082a2fd8d85b..5ad9fcc84269 100644 --- a/net/netfilter/nf_conntrack_core.c +++ b/net/netfilter/nf_conntrack_core.c @@ -1394,6 +1394,7 @@ static unsigned int early_drop_list(struct net *net, * already fired or someone else deleted it. Just drop ref * and move to next entry. */ + smp_rmb(); /* XXX: Why? */ if (net_eq(nf_ct_net(tmp), net) && nf_ct_is_confirmed(tmp) && nf_ct_delete(tmp, 0, 0))
Sorry, but I have absolutely no context here. We have a handy document describing the differences between atomic_t and refcount_t:
Documentation/core-api/refcount-vs-atomic.rst
What else do you need to know?
Hmm, and I see a tonne of *_inc_not_zero() conversions in 719774377622 ("netfilter: conntrack: convert to refcount_t api") which mean that you no longer have ordering to subsequent reads in the absence of an address dependency.
I think the patch above needs auditing with the relaxed behaviour in mind, but for the specific crash reported here possibly something like the diff below?
Will
--->8
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c index 082a2fd8d85b..5ad9fcc84269 100644 --- a/net/netfilter/nf_conntrack_core.c +++ b/net/netfilter/nf_conntrack_core.c @@ -1394,6 +1394,7 @@ static unsigned int early_drop_list(struct net *net, * already fired or someone else deleted it. Just drop ref * and move to next entry. */
smp_rmb(); /* XXX: Why? */ if (net_eq(nf_ct_net(tmp), net) && nf_ct_is_confirmed(tmp) && nf_ct_delete(tmp, 0, 0))
With this patch applied the issue goes away as well. The test runs fine well beyond where it would crash previously so looks good, thanks!
On Tue, Jul 05, 2022 at 12:24:49PM +0100, Will Deacon wrote:
Sorry, but I have absolutely no context here. We have a handy document describing the differences between atomic_t and refcount_t:
Documentation/core-api/refcount-vs-atomic.rst
What else do you need to know?
Hmm, and I see a tonne of *_inc_not_zero() conversions in 719774377622 ("netfilter: conntrack: convert to refcount_t api") which mean that you no longer have ordering to subsequent reads in the absence of an address dependency.
I think the patch above needs auditing with the relaxed behaviour in mind, but for the specific crash reported here possibly something like the diff below?
Will
--->8
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c index 082a2fd8d85b..5ad9fcc84269 100644 --- a/net/netfilter/nf_conntrack_core.c +++ b/net/netfilter/nf_conntrack_core.c @@ -1394,6 +1394,7 @@ static unsigned int early_drop_list(struct net *net, * already fired or someone else deleted it. Just drop ref * and move to next entry. */
smp_rmb(); /* XXX: Why? */ if (net_eq(nf_ct_net(tmp), net) && nf_ct_is_confirmed(tmp) && nf_ct_delete(tmp, 0, 0))
Just to follow up, I think you're right, the patch in question should be audited further for other missing memory barrier issues. While this one smp_rmb() helps a lot, ie lets the test run for at least an hour or two, an overnight 6 hour test still resulted in the same crash somewhere along the way so it looks like it's not the only one that's needed.
Kajetan Puchalski kajetan.puchalski@arm.com wrote:
On Tue, Jul 05, 2022 at 12:24:49PM +0100, Will Deacon wrote:
Sorry, but I have absolutely no context here. We have a handy document describing the differences between atomic_t and refcount_t:
Documentation/core-api/refcount-vs-atomic.rst
What else do you need to know?
Hmm, and I see a tonne of *_inc_not_zero() conversions in 719774377622 ("netfilter: conntrack: convert to refcount_t api") which mean that you no longer have ordering to subsequent reads in the absence of an address dependency.
I think the patch above needs auditing with the relaxed behaviour in mind, but for the specific crash reported here possibly something like the diff below?
Will
--->8
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c index 082a2fd8d85b..5ad9fcc84269 100644 --- a/net/netfilter/nf_conntrack_core.c +++ b/net/netfilter/nf_conntrack_core.c @@ -1394,6 +1394,7 @@ static unsigned int early_drop_list(struct net *net, * already fired or someone else deleted it. Just drop ref * and move to next entry. */
smp_rmb(); /* XXX: Why? */ if (net_eq(nf_ct_net(tmp), net) && nf_ct_is_confirmed(tmp) && nf_ct_delete(tmp, 0, 0))
Just to follow up, I think you're right, the patch in question should be audited further for other missing memory barrier issues. While this one smp_rmb() helps a lot, ie lets the test run for at least an hour or two, an overnight 6 hour test still resulted in the same crash somewhere along the way so it looks like it's not the only one that's needed.
Yes, I don't think that refcount_inc_not_zero is useable as-is for conntrack. Here is a patch, I hope this will get things back to a working order without a revert to atomic_t api.
Subject: [nf] netfilter: conntrack: fix crash due to confirmed bit load reordering
Kajetan Puchalski reports crash on ARM, with backtrace of:
__nf_ct_delete_from_lists nf_ct_delete early_drop __nf_conntrack_alloc
Unlike atomic_inc_not_zero, refcount_inc_not_zero is not a full barrier. conntrack uses SLAB_TYPESAFE_BY_RCU, i.e. it is possible that a 'newly' allocated object is still in use on another CPU:
CPU1 CPU2 enounters 'ct' during hlist walk delete_from_lists refcount drops to 0 kmem_cache_free(ct); __nf_conntrack_alloc() // returns same object refcount_inc_not_zero(ct); /* might fail */
/* If set, ct is public/in the hash table */ test_bit(IPS_CONFIRMED_BIT, &ct->status);
In case CPU1 already set refcount back to 1, refcount_inc_not_zero() will succeed.
The expected possibilities for a CPU that obtained the object 'ct' (but no reference so far) are:
1. refcount_inc_not_zero() fails. CPU2 ignores the object and moves to the next entry in the list. This happens for objects that are about to be free'd, that have been free'd, or that have been reallocated by __nf_conntrack_alloc(), but where the refcount has not been increased back to 1 yet.
2. refcount_inc_not_zero() succeeds. CPU2 checks the CONFIRMED bit in ct->status. If set, the object is public/in the table.
If not, the object must be skipped; CPU2 calls nf_ct_put() to un-do the refcount increment and moves to the next object.
Parallel deletion from the hlists is prevented by a 'test_and_set_bit(IPS_DYING_BIT, &ct->status);' check, i.e. only one cpu will do the unlink, the other one will only drop its reference count.
Because refcount_inc_not_zero is not a full barrier, CPU2 may try to delete an object that is not on any list:
1. refcount_inc_not_zero() successful (refcount inited to 1 on other CPU) 2. CONFIRMED test also successful (load was reordered or zeroing of ct->status not yet visible) 3. delete_from_lists unlinks entry not on the hlist, because IPS_DYING_BIT is 0 (already cleared).
2) is already wrong: CPU2 will handle a partially initited object that is supposed to be private to CPU1.
This change adds smp_rmb() whenever refcount_inc_not_zero() was successful.
It also inserts a smp_wmb() before the refcount is set to 1 during allocation.
Because other CPU might still 'see' the object, refcount_set(1) "resurrects" the object, so we need to make sure that other CPUs will also observe the right contents. In particular, the CONFIRMED bit test must only pass once the object is fully initialised and either in the hash or about to be inserted (with locks held to delay possible unlink from early_drop or gc worker).
I did not change flow_offload_alloc(), as far as I can see it should call refcount_inc(), not refcount_inc_not_zero(): the ct object is attached to the skb so its refcount should be >= 1 in all cases.
Reported-by: Kajetan Puchalski kajetan.puchalski@arm.com Diagnosed-by: Will Deacon will@kernel.org Fixes: 719774377622 ("netfilter: conntrack: convert to refcount_t api") Signed-off-by: Florian Westphal fw@strlen.de --- include/net/netfilter/nf_conntrack.h | 3 +++ net/netfilter/nf_conntrack_core.c | 19 +++++++++++++++++++ 2 files changed, 22 insertions(+)
diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h index a32be8aa7ed2..3dc3646ffba2 100644 --- a/include/net/netfilter/nf_conntrack.h +++ b/include/net/netfilter/nf_conntrack.h @@ -300,6 +300,9 @@ static inline bool nf_ct_is_expired(const struct nf_conn *ct) /* use after obtaining a reference count */ static inline bool nf_ct_should_gc(const struct nf_conn *ct) { + /* ->status and ->timeout loads must happen after refcount increase */ + smp_rmb(); + return nf_ct_is_expired(ct) && nf_ct_is_confirmed(ct) && !nf_ct_is_dying(ct); } diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c index 082a2fd8d85b..072cabf1b296 100644 --- a/net/netfilter/nf_conntrack_core.c +++ b/net/netfilter/nf_conntrack_core.c @@ -795,6 +795,9 @@ __nf_conntrack_find_get(struct net *net, const struct nf_conntrack_zone *zone, */ ct = nf_ct_tuplehash_to_ctrack(h); if (likely(refcount_inc_not_zero(&ct->ct_general.use))) { + /* re-check key after refcount */ + smp_rmb(); + if (likely(nf_ct_key_equal(h, tuple, zone, net))) goto found;
@@ -1393,7 +1396,11 @@ static unsigned int early_drop_list(struct net *net, * We steal the timer reference. If that fails timer has * already fired or someone else deleted it. Just drop ref * and move to next entry. + * + * smp_rmb to ensure ->ct_net and ->status are loaded after + * refcount increase. */ + smp_rmb(); if (net_eq(nf_ct_net(tmp), net) && nf_ct_is_confirmed(tmp) && nf_ct_delete(tmp, 0, 0)) @@ -1536,6 +1543,8 @@ static void gc_worker(struct work_struct *work) if (!refcount_inc_not_zero(&tmp->ct_general.use)) continue;
+ /* load ct->status after refcount */ + smp_rmb(); if (gc_worker_skip_ct(tmp)) { nf_ct_put(tmp); continue; @@ -1775,6 +1784,16 @@ init_conntrack(struct net *net, struct nf_conn *tmpl, if (!exp) __nf_ct_try_assign_helper(ct, tmpl, GFP_ATOMIC);
+ /* Other CPU might have obtained a pointer to this object before it was + * released. Because refcount is 0, refcount_inc_not_zero() will fail. + * + * After refcount_set(1) it will succeed; ensure that zeroing of + * ct->status and the correct ct->net pointer are visible; else other + * core might observe CONFIRMED bit which means the entry is valid and + * in the hash table, but its not (anymore). + */ + smp_wmb(); + /* Now it is going to be associated with an sk_buff, set refcount to 1. */ refcount_set(&ct->ct_general.use, 1);
On Wed, Jul 06, 2022 at 02:02:01PM +0200, Florian Westphal wrote:
Subject: [nf] netfilter: conntrack: fix crash due to confirmed bit load reordering
Kajetan Puchalski reports crash on ARM, with backtrace of:
__nf_ct_delete_from_lists nf_ct_delete early_drop __nf_conntrack_alloc
Unlike atomic_inc_not_zero, refcount_inc_not_zero is not a full barrier. conntrack uses SLAB_TYPESAFE_BY_RCU, i.e. it is possible that a 'newly' allocated object is still in use on another CPU:
CPU1 CPU2 enounters 'ct' during hlist walk delete_from_lists refcount drops to 0 kmem_cache_free(ct); __nf_conntrack_alloc() // returns same object refcount_inc_not_zero(ct); /* might fail */
/* If set, ct is public/in the hash table */ test_bit(IPS_CONFIRMED_BIT, &ct->status);
In case CPU1 already set refcount back to 1, refcount_inc_not_zero() will succeed.
The expected possibilities for a CPU that obtained the object 'ct' (but no reference so far) are:
refcount_inc_not_zero() fails. CPU2 ignores the object and moves to the next entry in the list. This happens for objects that are about to be free'd, that have been free'd, or that have been reallocated by __nf_conntrack_alloc(), but where the refcount has not been increased back to 1 yet.
refcount_inc_not_zero() succeeds. CPU2 checks the CONFIRMED bit in ct->status. If set, the object is public/in the table.
If not, the object must be skipped; CPU2 calls nf_ct_put() to un-do the refcount increment and moves to the next object.
Parallel deletion from the hlists is prevented by a 'test_and_set_bit(IPS_DYING_BIT, &ct->status);' check, i.e. only one cpu will do the unlink, the other one will only drop its reference count.
Because refcount_inc_not_zero is not a full barrier, CPU2 may try to delete an object that is not on any list:
- refcount_inc_not_zero() successful (refcount inited to 1 on other CPU)
- CONFIRMED test also successful (load was reordered or zeroing of ct->status not yet visible)
- delete_from_lists unlinks entry not on the hlist, because IPS_DYING_BIT is 0 (already cleared).
- is already wrong: CPU2 will handle a partially initited object
that is supposed to be private to CPU1.
This change adds smp_rmb() whenever refcount_inc_not_zero() was successful.
It also inserts a smp_wmb() before the refcount is set to 1 during allocation.
Because other CPU might still 'see' the object, refcount_set(1) "resurrects" the object, so we need to make sure that other CPUs will also observe the right contents. In particular, the CONFIRMED bit test must only pass once the object is fully initialised and either in the hash or about to be inserted (with locks held to delay possible unlink from early_drop or gc worker).
I did not change flow_offload_alloc(), as far as I can see it should call refcount_inc(), not refcount_inc_not_zero(): the ct object is attached to the skb so its refcount should be >= 1 in all cases.
Reported-by: Kajetan Puchalski kajetan.puchalski@arm.com Diagnosed-by: Will Deacon will@kernel.org Fixes: 719774377622 ("netfilter: conntrack: convert to refcount_t api") Signed-off-by: Florian Westphal fw@strlen.de
include/net/netfilter/nf_conntrack.h | 3 +++ net/netfilter/nf_conntrack_core.c | 19 +++++++++++++++++++ 2 files changed, 22 insertions(+)
diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h index a32be8aa7ed2..3dc3646ffba2 100644 --- a/include/net/netfilter/nf_conntrack.h +++ b/include/net/netfilter/nf_conntrack.h @@ -300,6 +300,9 @@ static inline bool nf_ct_is_expired(const struct nf_conn *ct) /* use after obtaining a reference count */ static inline bool nf_ct_should_gc(const struct nf_conn *ct) {
- /* ->status and ->timeout loads must happen after refcount increase */
- smp_rmb();
- return nf_ct_is_expired(ct) && nf_ct_is_confirmed(ct) && !nf_ct_is_dying(ct);
} diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c index 082a2fd8d85b..072cabf1b296 100644 --- a/net/netfilter/nf_conntrack_core.c +++ b/net/netfilter/nf_conntrack_core.c @@ -795,6 +795,9 @@ __nf_conntrack_find_get(struct net *net, const struct nf_conntrack_zone *zone, */ ct = nf_ct_tuplehash_to_ctrack(h); if (likely(refcount_inc_not_zero(&ct->ct_general.use))) {
/* re-check key after refcount */
smp_rmb();
if (likely(nf_ct_key_equal(h, tuple, zone, net))) goto found;
@@ -1393,7 +1396,11 @@ static unsigned int early_drop_list(struct net *net, * We steal the timer reference. If that fails timer has * already fired or someone else deleted it. Just drop ref * and move to next entry.
*
* smp_rmb to ensure ->ct_net and ->status are loaded after
*/* refcount increase.
if (net_eq(nf_ct_net(tmp), net) && nf_ct_is_confirmed(tmp) && nf_ct_delete(tmp, 0, 0))smp_rmb();
@@ -1536,6 +1543,8 @@ static void gc_worker(struct work_struct *work) if (!refcount_inc_not_zero(&tmp->ct_general.use)) continue;
/* load ct->status after refcount */
smp_rmb(); if (gc_worker_skip_ct(tmp)) { nf_ct_put(tmp); continue;
@@ -1775,6 +1784,16 @@ init_conntrack(struct net *net, struct nf_conn *tmpl, if (!exp) __nf_ct_try_assign_helper(ct, tmpl, GFP_ATOMIC);
- /* Other CPU might have obtained a pointer to this object before it was
* released. Because refcount is 0, refcount_inc_not_zero() will fail.
*
* After refcount_set(1) it will succeed; ensure that zeroing of
* ct->status and the correct ct->net pointer are visible; else other
* core might observe CONFIRMED bit which means the entry is valid and
* in the hash table, but its not (anymore).
*/
- smp_wmb();
- /* Now it is going to be associated with an sk_buff, set refcount to 1. */ refcount_set(&ct->ct_general.use, 1);
FWIW, the old code, that used atomic_inc() instead of refcount_set() would have had the exact sample problem. There was no implied order vs the earlier stores.
On Wed, Jul 06, 2022 at 02:02:01PM +0200, Florian Westphal wrote:
Kajetan Puchalski kajetan.puchalski@arm.com wrote:
On Tue, Jul 05, 2022 at 12:24:49PM +0100, Will Deacon wrote:
Sorry, but I have absolutely no context here. We have a handy document describing the differences between atomic_t and refcount_t:
Documentation/core-api/refcount-vs-atomic.rst
What else do you need to know?
Hmm, and I see a tonne of *_inc_not_zero() conversions in 719774377622 ("netfilter: conntrack: convert to refcount_t api") which mean that you no longer have ordering to subsequent reads in the absence of an address dependency.
I think the patch above needs auditing with the relaxed behaviour in mind, but for the specific crash reported here possibly something like the diff below?
Will
--->8
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c index 082a2fd8d85b..5ad9fcc84269 100644 --- a/net/netfilter/nf_conntrack_core.c +++ b/net/netfilter/nf_conntrack_core.c @@ -1394,6 +1394,7 @@ static unsigned int early_drop_list(struct net *net, * already fired or someone else deleted it. Just drop ref * and move to next entry. */
smp_rmb(); /* XXX: Why? */ if (net_eq(nf_ct_net(tmp), net) && nf_ct_is_confirmed(tmp) && nf_ct_delete(tmp, 0, 0))
Just to follow up, I think you're right, the patch in question should be audited further for other missing memory barrier issues. While this one smp_rmb() helps a lot, ie lets the test run for at least an hour or two, an overnight 6 hour test still resulted in the same crash somewhere along the way so it looks like it's not the only one that's needed.
Yes, I don't think that refcount_inc_not_zero is useable as-is for conntrack. Here is a patch, I hope this will get things back to a working order without a revert to atomic_t api.
Subject: [nf] netfilter: conntrack: fix crash due to confirmed bit load reordering
Kajetan Puchalski reports crash on ARM, with backtrace of:
__nf_ct_delete_from_lists nf_ct_delete early_drop __nf_conntrack_alloc
Unlike atomic_inc_not_zero, refcount_inc_not_zero is not a full barrier. conntrack uses SLAB_TYPESAFE_BY_RCU, i.e. it is possible that a 'newly' allocated object is still in use on another CPU:
CPU1 CPU2 enounters 'ct' during hlist walk delete_from_lists refcount drops to 0 kmem_cache_free(ct); __nf_conntrack_alloc() // returns same object refcount_inc_not_zero(ct); /* might fail */
/* If set, ct is public/in the hash table */ test_bit(IPS_CONFIRMED_BIT, &ct->status);
In case CPU1 already set refcount back to 1, refcount_inc_not_zero() will succeed.
The expected possibilities for a CPU that obtained the object 'ct' (but no reference so far) are:
refcount_inc_not_zero() fails. CPU2 ignores the object and moves to the next entry in the list. This happens for objects that are about to be free'd, that have been free'd, or that have been reallocated by __nf_conntrack_alloc(), but where the refcount has not been increased back to 1 yet.
refcount_inc_not_zero() succeeds. CPU2 checks the CONFIRMED bit in ct->status. If set, the object is public/in the table.
If not, the object must be skipped; CPU2 calls nf_ct_put() to un-do the refcount increment and moves to the next object.
Parallel deletion from the hlists is prevented by a 'test_and_set_bit(IPS_DYING_BIT, &ct->status);' check, i.e. only one cpu will do the unlink, the other one will only drop its reference count.
Because refcount_inc_not_zero is not a full barrier, CPU2 may try to delete an object that is not on any list:
- refcount_inc_not_zero() successful (refcount inited to 1 on other CPU)
- CONFIRMED test also successful (load was reordered or zeroing of ct->status not yet visible)
- delete_from_lists unlinks entry not on the hlist, because IPS_DYING_BIT is 0 (already cleared).
- is already wrong: CPU2 will handle a partially initited object
that is supposed to be private to CPU1.
This change adds smp_rmb() whenever refcount_inc_not_zero() was successful.
It also inserts a smp_wmb() before the refcount is set to 1 during allocation.
Because other CPU might still 'see' the object, refcount_set(1) "resurrects" the object, so we need to make sure that other CPUs will also observe the right contents. In particular, the CONFIRMED bit test must only pass once the object is fully initialised and either in the hash or about to be inserted (with locks held to delay possible unlink from early_drop or gc worker).
I did not change flow_offload_alloc(), as far as I can see it should call refcount_inc(), not refcount_inc_not_zero(): the ct object is attached to the skb so its refcount should be >= 1 in all cases.
Reported-by: Kajetan Puchalski kajetan.puchalski@arm.com Diagnosed-by: Will Deacon will@kernel.org Fixes: 719774377622 ("netfilter: conntrack: convert to refcount_t api") Signed-off-by: Florian Westphal fw@strlen.de
include/net/netfilter/nf_conntrack.h | 3 +++ net/netfilter/nf_conntrack_core.c | 19 +++++++++++++++++++ 2 files changed, 22 insertions(+)
diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h index a32be8aa7ed2..3dc3646ffba2 100644 --- a/include/net/netfilter/nf_conntrack.h +++ b/include/net/netfilter/nf_conntrack.h @@ -300,6 +300,9 @@ static inline bool nf_ct_is_expired(const struct nf_conn *ct) /* use after obtaining a reference count */ static inline bool nf_ct_should_gc(const struct nf_conn *ct) {
- /* ->status and ->timeout loads must happen after refcount increase */
- smp_rmb();
Sorry I didn't suggest this earlier, but if all of these smp_rmb()s are for upgrading the ordering from refcount_inc_not_zero() then you should use smp_acquire__after_ctrl_dep() instead. It's the same under the hood, but it illustrates what's going on a bit better.
@@ -1775,6 +1784,16 @@ init_conntrack(struct net *net, struct nf_conn *tmpl, if (!exp) __nf_ct_try_assign_helper(ct, tmpl, GFP_ATOMIC);
- /* Other CPU might have obtained a pointer to this object before it was
* released. Because refcount is 0, refcount_inc_not_zero() will fail.
*
* After refcount_set(1) it will succeed; ensure that zeroing of
* ct->status and the correct ct->net pointer are visible; else other
* core might observe CONFIRMED bit which means the entry is valid and
* in the hash table, but its not (anymore).
*/
- smp_wmb();
- /* Now it is going to be associated with an sk_buff, set refcount to 1. */ refcount_set(&ct->ct_general.use, 1);
Ideally that refcount_set() would be a release, but this is definitely (ab)using refcount_t in way that isn't anticipated by the API! It looks like a similar pattern exists in net/core/sock.c as well, so I wonder if it's worth extending the API.
Peter, what do you think?
Will
Will Deacon will@kernel.org wrote:
On Wed, Jul 06, 2022 at 02:02:01PM +0200, Florian Westphal wrote:
- /* ->status and ->timeout loads must happen after refcount increase */
- smp_rmb();
Sorry I didn't suggest this earlier, but if all of these smp_rmb()s are for upgrading the ordering from refcount_inc_not_zero() then you should use smp_acquire__after_ctrl_dep() instead. It's the same under the hood, but it illustrates what's going on a bit better.
Ok, I can replace it and send a v2, no problem.
On Wed, Jul 06, 2022 at 01:22:50PM +0100, Will Deacon wrote:
@@ -300,6 +300,9 @@ static inline bool nf_ct_is_expired(const struct nf_conn *ct) /* use after obtaining a reference count */ static inline bool nf_ct_should_gc(const struct nf_conn *ct) {
- /* ->status and ->timeout loads must happen after refcount increase */
- smp_rmb();
Sorry I didn't suggest this earlier, but if all of these smp_rmb()s are for upgrading the ordering from refcount_inc_not_zero() then you should use smp_acquire__after_ctrl_dep() instead. It's the same under the hood, but it illustrates what's going on a bit better.
But in that case if had better also be near an actual condition, otherwise things become too murky for words :/
That is, why is this sprinkled all over instead of right after an successfull refcount_inc_not_zero() ?
Code like:
if (!refcount_inc_not_zero()) return;
smp_acquire__after_ctrl_dep();
is fairly self-evident, whereas encountering an smp_acquire__after_ctrl_dep() in a different function, completely unrelated to any condition is quite crazy.
@@ -1775,6 +1784,16 @@ init_conntrack(struct net *net, struct nf_conn *tmpl, if (!exp) __nf_ct_try_assign_helper(ct, tmpl, GFP_ATOMIC);
- /* Other CPU might have obtained a pointer to this object before it was
* released. Because refcount is 0, refcount_inc_not_zero() will fail.
*
* After refcount_set(1) it will succeed; ensure that zeroing of
* ct->status and the correct ct->net pointer are visible; else other
* core might observe CONFIRMED bit which means the entry is valid and
* in the hash table, but its not (anymore).
*/
- smp_wmb();
- /* Now it is going to be associated with an sk_buff, set refcount to 1. */ refcount_set(&ct->ct_general.use, 1);
Ideally that refcount_set() would be a release, but this is definitely (ab)using refcount_t in way that isn't anticipated by the API! It looks like a similar pattern exists in net/core/sock.c as well, so I wonder if it's worth extending the API.
Peter, what do you think?
Bah; you have reminded me that I have a fairly sizable amount of refcount patches from when Linus complained about it last that don't seem to have gone anywhere :/
Anyway, I suppose we could do a refcount_set_release(), but it had better get a fairly big comment on how you're on your own if you use it.
linux-stable-mirror@lists.linaro.org