Re: temporary hung tasks on XFS since updating to 6.6.92

19 Jun 2025


      Ouch, thanks for that explanation. I now get why and how this really is either
a log-related or even general dirty/writeback flushing issue. We definitely
have some quite sub-optimal settings here - so I’m revisiting our dirty/writeback 
settings and XFS journal sizes. My understanding is that (in our
situation) those should be sized based on the backing storage performance -
so typically we’ll end up with fixed sizes independent of the available
RAM or volume size that will guarantee an upper bound for worst case
of flushing/processing.
However, I’m still a bit puzzled why this has escalated to user-visible
incidents in the last weeks - we’ve been running with those sub-optimal
settings for a very long time and haven’t had these issues before.
I took some time to educate myself further around the writeback behaviour
and noticed the special relationship with that keeps getting mentioned
between writeback and cgroups and those are in play with our setup, so
maybe it’s worth another look.
Whenever this issue hits us, the system is running one of two system
management tasks where one performs garbage collection on the nix store
(which typically contains multiple generations of all packages on the
system) and the other builds out a new system config. Those jobs are run in
separate cgroups and throttled at 31 RW IOPS (12.5% of 250) and 188 RW IOPS
(75% of 250). Those jobs aren’t time critical at all, so we want to leave
sufficient IOPS to not starve the actual workload.
Those jobs run relatively often without causing issues, but then they’re
also lazy. Still, even when they do work they do not always cause those
issues. If they do, then i can see a sudden spike in our telemetry for
dirty pages (around 300-400 MiB typically) followed by a jump in writeback
pages (which is typically a bit higher: around 200 MiB more than dirty pages).
So far so good - that’s expected behaviour: one of the jobs might be deleting
a lot of files - so that’s lots of metadata operations in the journal,
and the other typically extracts archives, so many new files with different sizes.
Here’s why I think there might be a bug lurking (or I’m still missing something):
When the journal in a volume is relatively small (10 MiB) but the amount of
data in writeback is around 450 MiB this indicates regular file content
being flushed. I only see 3 mib/s at around 100 iops (indicating average 32k writes).
The 100 IOPS is a weird number here. Especially as at this time customer traffic
is impacted and apparently completely stalled. My impression from reading about
the writeback mechanisms and your description of journal flushes was that 
none of the flushing operations should cause a full stall until the caches or
journal is empty, but allow traffic to trickle, roughly at the backing storage
speed. Of course, if the system adds more traffic then is being written out
then it would still starve.
(I’m guessing that the journal flush doesn’t go through the writeback cache.
So maybe the journal isn’t really involved here at all.)
So, could it be that there is some (faulty) relationship with the cgroup
throttling interacting with the writeback and/or journal flush that causes 
a) the available IOPS from the backing storage not being properly used and
b) other traffic being stalled that shouldn’t be?
How would I go about proving that?
I tried again to replicate the worst case traffic patterns and ran
fio --rw=randwrite --name=synctest --bs=512 --direct=1 --numjobs=1 --ioengine=libaio --iodepth=8 --runtime=600 --fdatasync=8 --size=10m
which gave me consistent ~250 IOPS and finished in around 80 seconds.
I’m going to try to find a machine that exhibits the issues most often
and will a) apply lower bounds for the writeback cache, b) remove the
cgroup limits and c) both.
Thanks again,
Christian
...
On 18. Jun 2025, at 23:54, Dave Chinner david@fromorbit.com wrote:
On Wed, Jun 18, 2025 at 03:01:26PM +0200, Christian Theune wrote:
...
[…]
Yeah, I’m very wary of reporting these tracebacks as potential
bugs because of them easily being just a hint on slow storage. My
problem here is that I can’t point to anything that says the
storage would have been slow.
I’ve gone through all metrics and logs on the KVM servers as well
as the Ceph servers and they’ve been performing completely at
baseline level regarding errors, queues, iops, latency.
I’ve done a measurement to try to emulate those accesses by
running
$ fio --rw=randrw --name=synctest --bs=4k --direct=1 --numjobs=1
--ioengine=libaio --iodepth=1 --runtime=600 --write_barrier=1
--size=60m
I hope this is sufficiently comparable behaviour (maybe with a
different read/write ratio instead of 0.5?) to what XFS log
flushing does. This resulted in [1].
Not really comparable. "write-barrier" is not the option for data
integrity - fdatasync=8 would be closer, but even that doesn't
replicate the cache flush/fua write pattern of journal IOs.
Also, journals are write only and sequential, not random. The
iodepth is typically 8 when fully busy, and the IO size ranges from
single sector to logbsize (from /proc/mounts) depending on the
workload.
Metadata writeback to clear the journal, OTOH, is larger random
4kB writes with no data integrity implied (the journal cache flush
mechanism I mention above handles that).
...
My interpretation of this measurement (and the VM showed no illnes
while this was running over 10 minutes): the VM is throttled at
250 IOPs and is reporting back after 10 minutes of 4k random
writes with average IOPS of exactly 250. The latencies are a bit
varied, this could be due to Qemu throttling. The max latency was
133ms, the average 2ms. This is on a 10g storage network with Ceph
that requires another network roundtrip for replication before
ACKing a write.
Oh. Ceph.
Slow random write performance on ceph rbd devices seem to be a
recurring problem - the filesystem is way faster than the storage,
and by the time the filesystem throttles on journal space (even with
small logs), the backlog of IO cannot be drained very quickly and so
everything stalls.
IOWs, a hard cap of 250 random 4kB write IOPS is extremely slow and
a likely cause of your random stall problems. When you have lots of
dirty metadata in memory, then it can take minutes of IO for
writeback to free journal space to allow progress to be made again.
For example, modify an inode (e.g. chmod), and that only takes ~250
bytes of journal space to record. So a 10MB journal can hold ~40,000
dirty inodes. If we assume that there maximum locality in these
inodes (i.e. best case writeback IO patterns) we have ~1300 x 16kB
writes to perform to empty the journal.
Hence there's up to 10-15s of writeback IO to "unstall" a small
journal. Worst case, it's one 16kB write per inode, and that means
potential for multiple minutes of writeback to unstall a 10MB
journal. Now consider a 64MB journal, and the potential stalls
waiting on metadata writeback are much, much worse...
These stalls will come and go with workload peaks. The filesystem
can soak up some of the peak into memory and the journal, then it
throttles to storage speed and that's where the hung task warnings
fire. Once the storage catches up, everything goes back to "normal"
latencies.
...
I’m out of ideas for now, I’ll keep thinking about this. If anyone
has any pointer for further tests in any direction, I’m open to
anything. ;)
Don't use hard throttling on IOPS rates. Measure IO rates over a
longer period of time (e.g. rolling average over a minute) so that
the storage can rapidly sink the short term IO peaks that occur when
the workload runs the journal out of space and the fs throttles it
whilst metadata writeback makes progress.
-Dave.
Dave Chinner
david@fromorbit.com
-- 
Christian Theune · ct@flyingcircus.io · +49 345 219401 0
Flying Circus Internet Operations GmbH · https://flyingcircus.io
Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: temporary hung tasks on XFS since updating to 6.6.92

-Dave.