On Mon, Jun 16, 2025 at 12:09:21PM +0200, Christian Theune wrote:
Can you share the xfs_info of one of these filesystems? I'm curious about the FS geometry.
Sure:
# xfs_info / meta-data=/dev/disk/by-label/root isize=512 agcount=21, agsize=655040 blks = sectsz=512 attr=2, projid32bit=1 = crc=1 finobt=1, sparse=1, rmapbt=0 = reflink=1 bigtime=1 inobtcount=1 nrext64=0 = exchange=0 data = bsize=4096 blocks=13106171, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0, ftype=1, parent=0 log =internal log bsize=4096 blocks=16384, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0
From the logs, it was /dev/vda1 that was getting hung up, so I'm going to assume the workload is hitting the root partition, not:
# xfs_info /tmp/ meta-data=/dev/vdb1 isize=512 agcount=8, agsize=229376 blks
... this one that has a small log.
IOWs, I don't think the log size is a contributing factor here.
The indication from the logs is that the system is hung up waiting on slow journal writes. e.g. there are processes hung waiting for transaction reservations (i.e. no journal space available). Journal space is backed up on metadata writeback trying to force the journal to stable storage (which is blocked waiting for journal IO completion so it can issue more journal IO) and getting blocked so it can't make progress, either.
I think part of the issue is that journal writes issue device cache flushes and FUA writes, both of which require written data to be on stable storage before returning.
All this points to whatever storage is backing these VMs is extremely slow at guaranteeing persistence of data and eventually it can't keep up with the application making changes to the filesystem. When the journal IO latency gets high enough you start to see things backing up and stall warnings appearing.
IOWs, this does not look like a filesystem issue from the information presented, just storage that can't keep up with the rate at which the filesystem can make modifications in memory. When the fs finally starts to throttle on the slow storage, that's when you notice just how slow the storage actually is...
[ Historical note: this is exactly the sort of thing we have seen for years with hardware RAID5/6 adapters with large amounts of NVRAM and random write workloads. They run as fast as NVRAM can sink the 4kB random writes, then when the NVRAM fills, they have to wait for hundreds of MB of cached 4kB random writes to be written to the RAID5/6 luns at 50-100 IOPS. This causes the exact same "filesystem is hung" symptoms as you are describing in this thread. ]
There has been a few improvements though during Linux 6.9 on the log performance, but I can't tell if you have any of those improvements around. I'd suggest you trying to run a newer upstream kernel, otherwise you'll get very limited support from the upstream community. If you can't, I'd suggest you reporting this issue to your vendor, so they can track what you are/are not using in your current kernel.
Yeah, we’ve started upgrading selected/affected projects to 6.12, to see whether this improves things.
Keep in mind that if the problem is persistent write performance of the storage, upgrading the kernel will not make it go away. It may make it worse, because other optimisations we've made in the mean time could mean the journal fills faster and pushes into the persistent IO backlog latency issue sooner and more frequently....
-Dave.