Quoting Willem de Bruijn willemdebruijn.kernel@gmail.com:
That library does not enable UDP_GRO. You do not have any UDP based tunnel devices (besides vxlan) configured, either, right?
The configuration is really minimal by now, I also took the bonding out of the equation. We have systemd configure "en*" with mDNS and DHCP enabled and that's it. The problem remains.
I also found new hardware on my desk today (some Intel SoC), showing exactly the same symptoms. So it's really nothing to do with the hardware.
It is also unlikely that the device has either of NETIF_F_GRO_FRAGLIST or NETIF_F_GRO_UDP_FWD configured. This can be checked with `ethtool -K $DEV`, shown as "rx-gro-list" and "rx-udp-gro-forwarding", respectively.
The full output of "ethtool -k enp5s0" from that SoC:
Features for enp5s0: rx-checksumming: on tx-checksumming: on tx-checksum-ipv4: off [fixed] tx-checksum-ip-generic: on tx-checksum-ipv6: off [fixed] tx-checksum-fcoe-crc: off [fixed] tx-checksum-sctp: on scatter-gather: on tx-scatter-gather: on tx-scatter-gather-fraglist: off [fixed] tcp-segmentation-offload: on tx-tcp-segmentation: on tx-tcp-ecn-segmentation: off [fixed] tx-tcp-mangleid-segmentation: off tx-tcp6-segmentation: on generic-segmentation-offload: on generic-receive-offload: on large-receive-offload: off [fixed] rx-vlan-offload: on tx-vlan-offload: on ntuple-filters: off receive-hashing: on highdma: on [fixed] rx-vlan-filter: on [fixed] vlan-challenged: off [fixed] tx-lockless: off [fixed] netns-local: off [fixed] tx-gso-robust: off [fixed] tx-fcoe-segmentation: off [fixed] tx-gre-segmentation: on tx-gre-csum-segmentation: on tx-ipxip4-segmentation: on tx-ipxip6-segmentation: on tx-udp_tnl-segmentation: on tx-udp_tnl-csum-segmentation: on tx-gso-partial: on tx-tunnel-remcsum-segmentation: off [fixed] tx-sctp-segmentation: off [fixed] tx-esp-segmentation: off [fixed] tx-udp-segmentation: on tx-gso-list: off [fixed] fcoe-mtu: off [fixed] tx-nocache-copy: off loopback: off [fixed] rx-fcs: off [fixed] rx-all: off tx-vlan-stag-hw-insert: off [fixed] rx-vlan-stag-hw-parse: off [fixed] rx-vlan-stag-filter: off [fixed] l2-fwd-offload: off [fixed] hw-tc-offload: on esp-hw-offload: off [fixed] esp-tx-csum-hw-offload: off [fixed] rx-udp_tunnel-port-offload: off [fixed] tls-hw-tx-offload: off [fixed] tls-hw-rx-offload: off [fixed] rx-gro-hw: off [fixed] tls-hw-record: off [fixed] rx-gro-list: off macsec-hw-offload: off [fixed] rx-udp-gro-forwarding: off hsr-tag-ins-offload: off [fixed] hsr-tag-rm-offload: off [fixed] hsr-fwd-offload: off [fixed] hsr-dup-offload: off [fixed]
That's the only NIC on this board:
# ip l 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 2: enp5s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000 link/ether 00:30:d6:24:99:67 brd ff:ff:ff:ff:ff:ff
One possible short-term workaround is to disable GRO.
Indeed, "ethtool -K enp5s0 gro off" fixes the problem, and calling it with "gro on" brings it back.
And to answer Paolo's questions from his mail to the list (@Paolo: I'm not subscribed, please also send to me directly so I don't miss your mail)
Could you please:
- tell how frequent is the pkt corruption, even a rough estimate of the
frequency.
# journalctl --since "5min ago" | grep "Packet corrupt" | wc -l 167
So there are 167 detected failures in 5 minutes, while the system is receiving at a moderate rate of about 900 pkts/s (according to Prometheus' node exporter at least, but seems about right)
Next I'll try to capture some broken packets and reply in a separate mail, I'll have to figure out a good way to do this first.
Thanks for your help, -Matthias