Re: temporary hung tasks on XFS since updating to 6.6.92

18 Jun 2025


      On Wed, Jun 18, 2025 at 03:01:26PM +0200, Christian Theune wrote:
...
...
On 18. Jun 2025, at 00:28, Dave Chinner david@fromorbit.com
wrote:
On Mon, Jun 16, 2025 at 12:09:21PM +0200, Christian Theune
wrote:
...
...
Can you share the xfs_info of one of these filesystems? I'm
curious about the FS geometry.
Sure:
# xfs_info / meta-data=/dev/disk/by-label/root isize=512
agcount=21, agsize=655040 blks =
sectsz=512   attr=2, projid32bit=1 =
crc=1        finobt=1, sparse=1, rmapbt=0 =
reflink=1    bigtime=1 inobtcount=1 nrext64=0 =
exchange=0 data     =                       bsize=4096
blocks=13106171, imaxpct=25 =                       sunit=0
swidth=0 blks naming   =version 2              bsize=4096
ascii-ci=0, ftype=1, parent=0 log      =internal log
bsize=4096   blocks=16384, version=2 =
sectsz=512   sunit=0 blks, lazy-count=1 realtime =none
extsz=4096   blocks=0, rtextents=0
From the logs, it was /dev/vda1 that was getting hung up, so I'm
going to assume the workload is hitting the root partition,
not:
...
# xfs_info /tmp/ meta-data=/dev/vdb1              isize=512
agcount=8, agsize=229376 blks
... this one that has a small log.
IOWs, I don't think the log size is a contributing factor
here.
The indication from the logs is that the system is hung up
waiting on slow journal writes. e.g. there are processes hung
waiting for transaction reservations (i.e. no journal space
available). Journal space is backed up on metadata writeback
trying to force the journal to stable storage (which is blocked
waiting for journal IO completion so it can issue more journal
IO) and getting blocked so it can't make progress, either.
I think part of the issue is that journal writes issue device
cache flushes and FUA writes, both of which require written data
to be on stable storage before returning.
All this points to whatever storage is backing these VMs is
extremely slow at guaranteeing persistence of data and
eventually it can't keep up with the application making changes
to the filesystem.  When the journal IO latency gets high enough
you start to see things backing up and stall warnings
appearing.
IOWs, this does not look like a filesystem issue from the
information presented, just storage that can't keep up with the
rate at which the filesystem can make modifications in memory.
When the fs finally starts to throttle on the slow storage,
that's when you notice just how slow the storage actually
is...
[ Historical note: this is exactly the sort of thing we have
seen for years with hardware RAID5/6 adapters with large amounts
of NVRAM and random write workloads. They run as fast as NVRAM
can sink the 4kB random writes, then when the NVRAM fills, they
have to wait for hundreds of MB of cached 4kB random writes to
be written to the RAID5/6 luns at 50-100 IOPS. This causes the
exact same "filesystem is hung" symptoms as you are describing
in this thread. ]
Yeah, I’m very wary of reporting these tracebacks as potential
bugs because of them easily being just a hint on slow storage. My
problem here is that I can’t point to anything that says the
storage would have been slow.
I’ve gone through all metrics and logs on the KVM servers as well
as the Ceph servers and they’ve been performing completely at
baseline level regarding errors, queues, iops, latency.
I’ve done a measurement to try to emulate those accesses by
running
$ fio --rw=randrw --name=synctest --bs=4k --direct=1 --numjobs=1
--ioengine=libaio --iodepth=1 --runtime=600 --write_barrier=1
--size=60m
I hope this is sufficiently comparable behaviour (maybe with a
different read/write ratio instead of 0.5?) to what XFS log
flushing does. This resulted in [1].
Not really comparable. "write-barrier" is not the option for data
integrity - fdatasync=8 would be closer, but even that doesn't
replicate the cache flush/fua write pattern of journal IOs.
Also, journals are write only and sequential, not random. The
iodepth is typically 8 when fully busy, and the IO size ranges from
single sector to logbsize (from /proc/mounts) depending on the
workload.
Metadata writeback to clear the journal, OTOH, is larger random
4kB writes with no data integrity implied (the journal cache flush
mechanism I mention above handles that).
...
My interpretation of this measurement (and the VM showed no illnes
while this was running over 10 minutes): the VM is throttled at
250 IOPs and is reporting back after 10 minutes of 4k random
writes with average IOPS of exactly 250. The latencies are a bit
varied, this could be due to Qemu throttling. The max latency was
133ms, the average 2ms. This is on a 10g storage network with Ceph
that requires another network roundtrip for replication before
ACKing a write.
Oh. Ceph.
Slow random write performance on ceph rbd devices seem to be a
recurring problem - the filesystem is way faster than the storage,
and by the time the filesystem throttles on journal space (even with
small logs), the backlog of IO cannot be drained very quickly and so
everything stalls.
IOWs, a hard cap of 250 random 4kB write IOPS is extremely slow and
a likely cause of your random stall problems. When you have lots of
dirty metadata in memory, then it can take minutes of IO for
writeback to free journal space to allow progress to be made again.
For example, modify an inode (e.g. chmod), and that only takes ~250
bytes of journal space to record. So a 10MB journal can hold ~40,000
dirty inodes. If we assume that there maximum locality in these
inodes (i.e. best case writeback IO patterns) we have ~1300 x 16kB
writes to perform to empty the journal.
Hence there's up to 10-15s of writeback IO to "unstall" a small
journal. Worst case, it's one 16kB write per inode, and that means
potential for multiple minutes of writeback to unstall a 10MB
journal. Now consider a 64MB journal, and the potential stalls
waiting on metadata writeback are much, much worse...
These stalls will come and go with workload peaks. The filesystem
can soak up some of the peak into memory and the journal, then it
throttles to storage speed and that's where the hung task warnings
fire. Once the storage catches up, everything goes back to "normal"
latencies.
...
I’m out of ideas for now, I’ll keep thinking about this. If anyone
has any pointer for further tests in any direction, I’m open to
anything. ;)
Don't use hard throttling on IOPS rates. Measure IO rates over a
longer period of time (e.g. rolling average over a minute) so that
the storage can rapidly sink the short term IO peaks that occur when
the workload runs the journal out of space and the fs throttles it
whilst metadata writeback makes progress.
-Dave.
-- 
Dave Chinner
david@fromorbit.com

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: temporary hung tasks on XFS since updating to 6.6.92