Hi Jakub, Willem,
On 01/08/2025 20:16, Jakub Kicinski wrote:
We keep seeing flakes on packetdrill on debug kernels, while non-debug kernels are stable, not a single flake in 200 runs. Time to give up, debug kernels appear to suffer from 10msec latency spikes and any timing-sensitive test is bound to flake.
Thank you for the patch!
Another solution might be to increase the tolerance, but I don't think it will fix all issues. I quickly looked at the last 100 runs, and I think most failures might be fixed by a higher tolerance, e.g.
# tcp_ooo-before-and-after-accept.pkt:19: timing error: expected inbound packet at 0.101619 sec but happened at 0.115894 sec; tolerance 0.014000 sec
(0.275ms above the limit!)
On MPTCP, we used to have a very high tolerance with debug kernels (>0.5s) when public CIs were very limited in terms of CPU resources. I guess having a tolerance of 0.1s would be enough, but for these MPTCP packetdrill tests, I put 0.2s for the tolerance with a debug kernel, just to be on the safe side.
Still, I think increasing the tolerance would not fix all issues. On MPTCP side, the latency introduced by debug kernel caused unexpected retransmissions due to too low RTO. I took the time to make sure injected packets were always done with enough delay, but with the TCP packetdrill tests here, it is possibly not enough to do that when I look at some recent errors, e.g.
tcp_zerocopy_batch.pkt:26: error handling packet: live packet payload: expected 4000 bytes vs actual 5000 bytes
At the end, and as previously mentioned, these adaptations for debug kernel are perhaps not worth it: in this environment, it is probably enough to ignore packetdrill results and focus on kernel warnings.
Acked-by: Matthieu Baerts (NGI0) matttbe@kernel.org
Cheers, Matt