Ouch, thanks for that explanation. I now get why and how this really is either a log-related or even general dirty/writeback flushing issue. We definitely have some quite sub-optimal settings here - so I’m revisiting our dirty/writeback settings and XFS journal sizes. My understanding is that (in our situation) those should be sized based on the backing storage performance - so typically we’ll end up with fixed sizes independent of the available RAM or volume size that will guarantee an upper bound for worst case of flushing/processing.
However, I’m still a bit puzzled why this has escalated to user-visible incidents in the last weeks - we’ve been running with those sub-optimal settings for a very long time and haven’t had these issues before.
I took some time to educate myself further around the writeback behaviour and noticed the special relationship with that keeps getting mentioned between writeback and cgroups and those are in play with our setup, so maybe it’s worth another look.
Whenever this issue hits us, the system is running one of two system management tasks where one performs garbage collection on the nix store (which typically contains multiple generations of all packages on the system) and the other builds out a new system config. Those jobs are run in separate cgroups and throttled at 31 RW IOPS (12.5% of 250) and 188 RW IOPS (75% of 250). Those jobs aren’t time critical at all, so we want to leave sufficient IOPS to not starve the actual workload.
Those jobs run relatively often without causing issues, but then they’re also lazy. Still, even when they do work they do not always cause those issues. If they do, then i can see a sudden spike in our telemetry for dirty pages (around 300-400 MiB typically) followed by a jump in writeback pages (which is typically a bit higher: around 200 MiB more than dirty pages).
So far so good - that’s expected behaviour: one of the jobs might be deleting a lot of files - so that’s lots of metadata operations in the journal, and the other typically extracts archives, so many new files with different sizes.
Here’s why I think there might be a bug lurking (or I’m still missing something):
When the journal in a volume is relatively small (10 MiB) but the amount of data in writeback is around 450 MiB this indicates regular file content being flushed. I only see 3 mib/s at around 100 iops (indicating average 32k writes).
The 100 IOPS is a weird number here. Especially as at this time customer traffic is impacted and apparently completely stalled. My impression from reading about the writeback mechanisms and your description of journal flushes was that none of the flushing operations should cause a full stall until the caches or journal is empty, but allow traffic to trickle, roughly at the backing storage speed. Of course, if the system adds more traffic then is being written out then it would still starve.
(I’m guessing that the journal flush doesn’t go through the writeback cache. So maybe the journal isn’t really involved here at all.)
So, could it be that there is some (faulty) relationship with the cgroup throttling interacting with the writeback and/or journal flush that causes a) the available IOPS from the backing storage not being properly used and b) other traffic being stalled that shouldn’t be?
How would I go about proving that?
I tried again to replicate the worst case traffic patterns and ran
fio --rw=randwrite --name=synctest --bs=512 --direct=1 --numjobs=1 --ioengine=libaio --iodepth=8 --runtime=600 --fdatasync=8 --size=10m
which gave me consistent ~250 IOPS and finished in around 80 seconds.
I’m going to try to find a machine that exhibits the issues most often and will a) apply lower bounds for the writeback cache, b) remove the cgroup limits and c) both.
Thanks again, Christian
On 18. Jun 2025, at 23:54, Dave Chinner david@fromorbit.com wrote:
On Wed, Jun 18, 2025 at 03:01:26PM +0200, Christian Theune wrote:
[…] Yeah, I’m very wary of reporting these tracebacks as potential bugs because of them easily being just a hint on slow storage. My problem here is that I can’t point to anything that says the storage would have been slow.
I’ve gone through all metrics and logs on the KVM servers as well as the Ceph servers and they’ve been performing completely at baseline level regarding errors, queues, iops, latency.
I’ve done a measurement to try to emulate those accesses by running
$ fio --rw=randrw --name=synctest --bs=4k --direct=1 --numjobs=1 --ioengine=libaio --iodepth=1 --runtime=600 --write_barrier=1 --size=60m
I hope this is sufficiently comparable behaviour (maybe with a different read/write ratio instead of 0.5?) to what XFS log flushing does. This resulted in [1].
Not really comparable. "write-barrier" is not the option for data integrity - fdatasync=8 would be closer, but even that doesn't replicate the cache flush/fua write pattern of journal IOs.
Also, journals are write only and sequential, not random. The iodepth is typically 8 when fully busy, and the IO size ranges from single sector to logbsize (from /proc/mounts) depending on the workload.
Metadata writeback to clear the journal, OTOH, is larger random 4kB writes with no data integrity implied (the journal cache flush mechanism I mention above handles that).
My interpretation of this measurement (and the VM showed no illnes while this was running over 10 minutes): the VM is throttled at 250 IOPs and is reporting back after 10 minutes of 4k random writes with average IOPS of exactly 250. The latencies are a bit varied, this could be due to Qemu throttling. The max latency was 133ms, the average 2ms. This is on a 10g storage network with Ceph that requires another network roundtrip for replication before ACKing a write.
Oh. Ceph.
Slow random write performance on ceph rbd devices seem to be a recurring problem - the filesystem is way faster than the storage, and by the time the filesystem throttles on journal space (even with small logs), the backlog of IO cannot be drained very quickly and so everything stalls.
IOWs, a hard cap of 250 random 4kB write IOPS is extremely slow and a likely cause of your random stall problems. When you have lots of dirty metadata in memory, then it can take minutes of IO for writeback to free journal space to allow progress to be made again.
For example, modify an inode (e.g. chmod), and that only takes ~250 bytes of journal space to record. So a 10MB journal can hold ~40,000 dirty inodes. If we assume that there maximum locality in these inodes (i.e. best case writeback IO patterns) we have ~1300 x 16kB writes to perform to empty the journal.
Hence there's up to 10-15s of writeback IO to "unstall" a small journal. Worst case, it's one 16kB write per inode, and that means potential for multiple minutes of writeback to unstall a 10MB journal. Now consider a 64MB journal, and the potential stalls waiting on metadata writeback are much, much worse...
These stalls will come and go with workload peaks. The filesystem can soak up some of the peak into memory and the journal, then it throttles to storage speed and that's where the hung task warnings fire. Once the storage catches up, everything goes back to "normal" latencies.
I’m out of ideas for now, I’ll keep thinking about this. If anyone has any pointer for further tests in any direction, I’m open to anything. ;)
Don't use hard throttling on IOPS rates. Measure IO rates over a longer period of time (e.g. rolling average over a minute) so that the storage can rapidly sink the short term IO peaks that occur when the workload runs the journal out of space and the fs throttles it whilst metadata writeback makes progress.
-Dave.
Dave Chinner david@fromorbit.com