On Sun, 23 Feb 2025 07:19:49 -0800 Kevin Krakauer wrote:
Thanks for the review! I'll split this up. Do you think it's better as two patchsets -- one for stability/deflaking, one for return value and output cleanup -- or as a single patchset with several commits?
Should be fine either way, they will both end up in net-next. One patchset may be easier to merge, as we can't CI-test two conflicting series on the list.
To be clear - are you running this over veth or a real device?
Over a veth.
Set the device's napi_defer_hard_irqs to 50 so that GRO is less likely to immediately flush. This already happened in setup_loopback.sh, but wasn't added to setup_veth.sh. This accounts for most of the reduction in flakiness.
That doesn't make intuitive sense to me. If we already defer flushes why do we need to also defer IRQs?
Yep, the behavior here is weird. I ran `gro.sh -t large` 1000 times with each of the following setups (all inside strace to increase flakiness):
- gro_flush_timeout=1ms, napi_defer_hard_irqs=0 --> failed to GRO 29 times
- gro_flush_timeout=5ms, napi_defer_hard_irqs=0 --> failed to GRO 45 times
- gro_flush_timeout=50ms, napi_defer_hard_irqs=0 --> failed to GRO 35 times
- gro_flush_timeout=1ms, napi_defer_hard_irqs=1 --> failed to GRO 0 times
- gro_flush_timeout=1ms, napi_defer_hard_irqs=50 --> failed to GRO 0 times
napi_defer_hard_irqs is clearly having an effect. And deferring once is enough. I believe that deferring IRQs prevents anything else from causing a GRO flush before gro_flush_timeout expires. While waiting for the timeout to expire, an incoming packet can cause napi_complete_done and thus napi_gro_flush to run. Outgoing packets from the veth can also cause this: veth_xmit calls __veth_xdp_flush, which only actually does anything when IRQs are enabled.
So napi_defer_hard_irqs=1 seems sufficient to allow the full gro_flush_timeout to expire before flushing GRO.
With msec-long deferrals we'll flush due to jiffies change. At least that explains a bit. Could you maybe try lower timeouts than 1msec? Previously we'd just keep partially-completed packets in GRO for up to 1msec, now we'll delay all packet processing for 1msec, that's a lot.