On Tue, Jul 05, 2022 at 11:57:49AM +0100, Will Deacon wrote:
On Tue, Jul 05, 2022 at 11:53:22AM +0100, Kajetan Puchalski wrote:
On Mon, Jul 04, 2022 at 10:22:24AM +0100, Kajetan Puchalski wrote:
On Sat, Jul 02, 2022 at 10:56:51PM +0200, Florian Westphal wrote:
That would make sense, from further experiments I ran it somehow seems to be related to the number of workers being spawned by stress-ng along with the CPUs/cores involved.
For instance, running the test with <=25 workers (--udp-flood 25 etc.) results in the test running fine for at least 15 minutes.
Ok. I will let it run for longer on the machines I have access to.
In mean time, you could test attached patch, its simple s/refcount_/atomic_/ in nf_conntrack.
If mainline (patch vs. HEAD 69cb6c6556ad89620547318439) crashes for you but works with attached patch someone who understands aarch64 memory ordering would have to look more closely at refcount_XXX functions to see where they might differ from atomic_ ones.
I can confirm that the patch seems to solve the issue. With it applied on top of the 5.19-rc5 tag the test runs fine for at least 15 minutes which was not the case before so it looks like it is that aarch64 memory ordering problem.
I'm CCing some people who should be able to help with aarch64 memory ordering, maybe they could take a look.
(re-sending due to a typo in CC, sorry for duplicate emails!)
Sorry, but I have absolutely no context here. We have a handy document describing the differences between atomic_t and refcount_t:
Documentation/core-api/refcount-vs-atomic.rst
What else do you need to know?
Hmm, and I see a tonne of *_inc_not_zero() conversions in 719774377622 ("netfilter: conntrack: convert to refcount_t api") which mean that you no longer have ordering to subsequent reads in the absence of an address dependency.
Will