On Wed, Jun 18, 2025 at 03:01:26PM +0200, Christian Theune wrote:
On 18. Jun 2025, at 00:28, Dave Chinner david@fromorbit.com wrote:
On Mon, Jun 16, 2025 at 12:09:21PM +0200, Christian Theune wrote:
Can you share the xfs_info of one of these filesystems? I'm curious about the FS geometry.
Sure:
# xfs_info / meta-data=/dev/disk/by-label/root isize=512 agcount=21, agsize=655040 blks = sectsz=512 attr=2, projid32bit=1 = crc=1 finobt=1, sparse=1, rmapbt=0 = reflink=1 bigtime=1 inobtcount=1 nrext64=0 = exchange=0 data = bsize=4096 blocks=13106171, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0, ftype=1, parent=0 log =internal log bsize=4096 blocks=16384, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0
From the logs, it was /dev/vda1 that was getting hung up, so I'm going to assume the workload is hitting the root partition, not:
# xfs_info /tmp/ meta-data=/dev/vdb1 isize=512 agcount=8, agsize=229376 blks
... this one that has a small log.
IOWs, I don't think the log size is a contributing factor here.
The indication from the logs is that the system is hung up waiting on slow journal writes. e.g. there are processes hung waiting for transaction reservations (i.e. no journal space available). Journal space is backed up on metadata writeback trying to force the journal to stable storage (which is blocked waiting for journal IO completion so it can issue more journal IO) and getting blocked so it can't make progress, either.
I think part of the issue is that journal writes issue device cache flushes and FUA writes, both of which require written data to be on stable storage before returning.
All this points to whatever storage is backing these VMs is extremely slow at guaranteeing persistence of data and eventually it can't keep up with the application making changes to the filesystem. When the journal IO latency gets high enough you start to see things backing up and stall warnings appearing.
IOWs, this does not look like a filesystem issue from the information presented, just storage that can't keep up with the rate at which the filesystem can make modifications in memory. When the fs finally starts to throttle on the slow storage, that's when you notice just how slow the storage actually is...
[ Historical note: this is exactly the sort of thing we have seen for years with hardware RAID5/6 adapters with large amounts of NVRAM and random write workloads. They run as fast as NVRAM can sink the 4kB random writes, then when the NVRAM fills, they have to wait for hundreds of MB of cached 4kB random writes to be written to the RAID5/6 luns at 50-100 IOPS. This causes the exact same "filesystem is hung" symptoms as you are describing in this thread. ]
Yeah, I’m very wary of reporting these tracebacks as potential bugs because of them easily being just a hint on slow storage. My problem here is that I can’t point to anything that says the storage would have been slow.
I’ve gone through all metrics and logs on the KVM servers as well as the Ceph servers and they’ve been performing completely at baseline level regarding errors, queues, iops, latency.
I’ve done a measurement to try to emulate those accesses by running
$ fio --rw=randrw --name=synctest --bs=4k --direct=1 --numjobs=1 --ioengine=libaio --iodepth=1 --runtime=600 --write_barrier=1 --size=60m
I hope this is sufficiently comparable behaviour (maybe with a different read/write ratio instead of 0.5?) to what XFS log flushing does. This resulted in [1].
Not really comparable. "write-barrier" is not the option for data integrity - fdatasync=8 would be closer, but even that doesn't replicate the cache flush/fua write pattern of journal IOs.
Also, journals are write only and sequential, not random. The iodepth is typically 8 when fully busy, and the IO size ranges from single sector to logbsize (from /proc/mounts) depending on the workload.
Metadata writeback to clear the journal, OTOH, is larger random 4kB writes with no data integrity implied (the journal cache flush mechanism I mention above handles that).
My interpretation of this measurement (and the VM showed no illnes while this was running over 10 minutes): the VM is throttled at 250 IOPs and is reporting back after 10 minutes of 4k random writes with average IOPS of exactly 250. The latencies are a bit varied, this could be due to Qemu throttling. The max latency was 133ms, the average 2ms. This is on a 10g storage network with Ceph that requires another network roundtrip for replication before ACKing a write.
Oh. Ceph.
Slow random write performance on ceph rbd devices seem to be a recurring problem - the filesystem is way faster than the storage, and by the time the filesystem throttles on journal space (even with small logs), the backlog of IO cannot be drained very quickly and so everything stalls.
IOWs, a hard cap of 250 random 4kB write IOPS is extremely slow and a likely cause of your random stall problems. When you have lots of dirty metadata in memory, then it can take minutes of IO for writeback to free journal space to allow progress to be made again.
For example, modify an inode (e.g. chmod), and that only takes ~250 bytes of journal space to record. So a 10MB journal can hold ~40,000 dirty inodes. If we assume that there maximum locality in these inodes (i.e. best case writeback IO patterns) we have ~1300 x 16kB writes to perform to empty the journal.
Hence there's up to 10-15s of writeback IO to "unstall" a small journal. Worst case, it's one 16kB write per inode, and that means potential for multiple minutes of writeback to unstall a 10MB journal. Now consider a 64MB journal, and the potential stalls waiting on metadata writeback are much, much worse...
These stalls will come and go with workload peaks. The filesystem can soak up some of the peak into memory and the journal, then it throttles to storage speed and that's where the hung task warnings fire. Once the storage catches up, everything goes back to "normal" latencies.
I’m out of ideas for now, I’ll keep thinking about this. If anyone has any pointer for further tests in any direction, I’m open to anything. ;)
Don't use hard throttling on IOPS rates. Measure IO rates over a longer period of time (e.g. rolling average over a minute) so that the storage can rapidly sink the short term IO peaks that occur when the workload runs the journal out of space and the fs throttles it whilst metadata writeback makes progress.
-Dave.