Paolo Abeni wrote:
On Tue, 2024-04-16 at 11:21 +0200, Paolo Abeni wrote:
On Fri, 2024-04-12 at 17:55 +0200, Richard Gobert wrote:
{inet,ipv6}_gro_receive functions perform flush checks (ttl, flags, iph->id, ...) against all packets in a loop. These flush checks are used currently in all tcp flows and in some UDP flows in GRO.
These checks need to be done only once and only against the found p skb, since they only affect flush and not same_flow.
Leveraging the previous commit in the series, in which correct network header offsets are saved for both outer and inner network headers - allowing these checks to be done only once, in tcp_gro_receive and udp_gro_receive_segment. As a result, NAPI_GRO_CB(p)->flush is not used at all. In addition, flush_id checks are more declarative and contained in inet_gro_flush, thus removing the need for flush_id in napi_gro_cb.
This results in less parsing code for UDP flows and non-loop flush tests for TCP flows.
To make sure results are not within noise range - I've made netfilter drop all TCP packets, and measured CPU performance in GRO (in this case GRO is responsible for about 50% of the CPU utilization).
L3 flush/flush_id checks are not relevant to UDP connections where skb_gro_receive_list is called. The only code change relevant to this flow is inet_gro_receive. The rest of the code parsing this flow stays the same.
All concurrent connections tested are with the same ip srcaddr and dstaddr.
perf top while replaying 64 concurrent IP/UDP connections (UDP fwd flow): net-next: 3.03% [kernel] [k] inet_gro_receive
patch applied: 2.78% [kernel] [k] inet_gro_receive
Why there are no figures for udp_gro_receive_segment()/gro_network_flush() here?
Also you should be able to observer a very high amount of CPU usage by GRO even with TCP with very high speed links, keeping the BH/GRO on a CPU and the user-space/data copy on a different one (or using rx zero copy).
To be more explicit: I think at least the above figures are required, and I still fear the real gain in that case would range from zero to negative.
I wrote about it in the commit message in short, sorry if I wasn't clear enough.
gro_network_flush is compiled in-line to both udp_gro_receive_segment and tcp_gro_receive. udp_gro_receive_segment is compiled in-line to udp_gro_receive.
The UDP numbers I posted are not relevant anymore after Willem and Alexander's thread, after which we understood flush and flush_id should be calculated for all UDP flows.
I can post new numbers for the UDP fwd path after implementing the correct change. As for TCP - the numbers I posted stay the same.
You should note there is an increase in CPU utilization in tcp_gro_receive because of the inline compilation of gro_network_flush. The numbers make sense and show performance enhancement in the case I showed when both inet_gro_receive and tcp_gro_receive are accounted for.
If you can't do the TCP part of the testing, please provide at least the figures for a single UDP flow, that should give more indication WRT the result we can expect with TCP.
Note that GRO is used mainly by TCP and TCP packets with different src/dst port will land into different GRO hash buckets, having different RX hash.
That will happen even for UDP, at least for some (most?) nics include the UDP ports in the RX hash.
Thanks,
Paolo