The cb fields network_offset and inner_network_offset are used instead of skb->network_header throughout GRO.
These fields are then leveraged in the next commit to remove flush_id state from napi_gro_cb, and stateful code in {ipv6,inet}_gro_receive which may be unnecessarily complicated due to encapsulation support in GRO. These fields are checked in L4 instead.
3rd patch adds tests for different flush_id flows in GRO.
v9 -> v10: - rebased on latest net-next - improved code readability in inet_gro_flush - v9: https://lore.kernel.org/all/20240507162349.130277-1-richardbgobert@gmail.com...
v8 -> v9: - rename skb_gro_network_offset to skb_gro_receive_network_offset for clarification - improved code readability - v8: https://lore.kernel.org/all/20240412155533.115507-1-richardbgobert@gmail.com...
v7 -> v8: - Remove network_header use in gro - Re-send commits after the dependent patch to net was applied - v7: https://lore.kernel.org/all/20240412155533.115507-1-richardbgobert@gmail.com...
v6 -> v7: - Moved bug fixes to a separate submission in net - Added UDP fwd benchmark - v6: https://lore.kernel.org/all/20240410153423.107381-1-richardbgobert@gmail.com...
v5 -> v6: - Write inner_network_offset in vxlan and geneve - Ignore is_atomic when DF=0 - v5: https://lore.kernel.org/all/20240408141720.98832-1-richardbgobert@gmail.com/
v4 -> v5: - Add 1st commit - flush id checks in udp_gro_receive segment which can be backported by itself - Add TCP measurements for the 5th commit - Add flush id tests to ensure flush id logic is preserved in GRO - Simplify gro_inet_flush by removing a branch - v4: https://lore.kernel.org/all/202420325182543.87683-1-richardbgobert@gmail.com...
v3 -> v4: - Fix code comment and commit message typos - v3: https://lore.kernel.org/all/f939c84a-2322-4393-a5b0-9b1e0be8ed8e@gmail.com/
v2 -> v3: - Use napi_gro_cb instead of skb->{offset} - v2: https://lore.kernel.org/all/2ce1600b-e733-448b-91ac-9d0ae2b866a4@gmail.com/
v1 -> v2: - Pass p_off in *_gro_complete to fix UDP bug - Remove more conditionals and memory fetches from inet_gro_flush - v1: https://lore.kernel.org/netdev/e1d22505-c5f8-4c02-a997-64248480338b@gmail.co...
Richard Gobert (3): net: gro: use cb instead of skb->network_header net: gro: move L3 flush checks to tcp_gro_receive and udp_gro_receive_segment selftests/net: add flush id selftests
include/net/gro.h | 85 +++++++++++++++--- net/core/gro.c | 3 - net/ipv4/af_inet.c | 45 +--------- net/ipv4/tcp_offload.c | 20 ++--- net/ipv4/udp_offload.c | 9 +- net/ipv6/ip6_offload.c | 16 +--- net/ipv6/tcpv6_offload.c | 3 +- tools/testing/selftests/net/gro.c | 138 ++++++++++++++++++++++++++++++ 8 files changed, 224 insertions(+), 95 deletions(-)
This patch converts references of skb->network_header to napi_gro_cb's network_offset and inner_network_offset.
Signed-off-by: Richard Gobert richardbgobert@gmail.com --- include/net/gro.h | 9 +++++++-- net/ipv4/af_inet.c | 4 ---- net/ipv4/tcp_offload.c | 3 ++- net/ipv6/ip6_offload.c | 5 ++--- net/ipv6/tcpv6_offload.c | 3 ++- 5 files changed, 13 insertions(+), 11 deletions(-)
diff --git a/include/net/gro.h b/include/net/gro.h index 5df8bf318197..cbc1b0aaf295 100644 --- a/include/net/gro.h +++ b/include/net/gro.h @@ -181,12 +181,17 @@ static inline void *skb_gro_header(struct sk_buff *skb, unsigned int hlen, return ptr; }
+static inline int skb_gro_receive_network_offset(const struct sk_buff *skb) +{ + return NAPI_GRO_CB(skb)->network_offsets[NAPI_GRO_CB(skb)->encap_mark]; +} + static inline void *skb_gro_network_header(const struct sk_buff *skb) { if (skb_gro_may_pull(skb, skb_gro_offset(skb))) - return skb_gro_header_fast(skb, skb_network_offset(skb)); + return skb_gro_header_fast(skb, skb_gro_receive_network_offset(skb));
- return skb_network_header(skb); + return skb->data + skb_gro_receive_network_offset(skb); }
static inline __wsum inet_gro_compute_pseudo(const struct sk_buff *skb, diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index a7bad18bc8b5..428196e1541f 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -1569,10 +1569,6 @@ struct sk_buff *inet_gro_receive(struct list_head *head, struct sk_buff *skb)
NAPI_GRO_CB(skb)->is_atomic = !!(iph->frag_off & htons(IP_DF)); NAPI_GRO_CB(skb)->flush |= flush; - skb_set_network_header(skb, off); - /* The above will be needed by the transport layer if there is one - * immediately following this IP hdr. - */ NAPI_GRO_CB(skb)->inner_network_offset = off;
/* Note : No need to call skb_gro_postpull_rcsum() here, diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c index c90704befd7b..2809667ac924 100644 --- a/net/ipv4/tcp_offload.c +++ b/net/ipv4/tcp_offload.c @@ -463,7 +463,8 @@ struct sk_buff *tcp4_gro_receive(struct list_head *head, struct sk_buff *skb)
INDIRECT_CALLABLE_SCOPE int tcp4_gro_complete(struct sk_buff *skb, int thoff) { - const struct iphdr *iph = ip_hdr(skb); + const u16 offset = NAPI_GRO_CB(skb)->network_offsets[skb->encapsulation]; + const struct iphdr *iph = (struct iphdr *)(skb->data + offset); struct tcphdr *th = tcp_hdr(skb);
if (unlikely(NAPI_GRO_CB(skb)->is_flist)) { diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c index c8b909a9904f..288c7c6ea50f 100644 --- a/net/ipv6/ip6_offload.c +++ b/net/ipv6/ip6_offload.c @@ -67,7 +67,7 @@ static int ipv6_gro_pull_exthdrs(struct sk_buff *skb, int off, int proto) off += len; }
- skb_gro_pull(skb, off - skb_network_offset(skb)); + skb_gro_pull(skb, off - skb_gro_receive_network_offset(skb)); return proto; }
@@ -236,7 +236,6 @@ INDIRECT_CALLABLE_SCOPE struct sk_buff *ipv6_gro_receive(struct list_head *head, if (unlikely(!iph)) goto out;
- skb_set_network_header(skb, off); NAPI_GRO_CB(skb)->inner_network_offset = off;
flush += ntohs(iph->payload_len) != skb->len - hlen; @@ -260,7 +259,7 @@ INDIRECT_CALLABLE_SCOPE struct sk_buff *ipv6_gro_receive(struct list_head *head, NAPI_GRO_CB(skb)->proto = proto;
flush--; - nlen = skb_network_header_len(skb); + nlen = skb_gro_offset(skb) - off;
list_for_each_entry(p, head, list) { const struct ipv6hdr *iph2; diff --git a/net/ipv6/tcpv6_offload.c b/net/ipv6/tcpv6_offload.c index d59f58cbd306..23971903e66d 100644 --- a/net/ipv6/tcpv6_offload.c +++ b/net/ipv6/tcpv6_offload.c @@ -72,7 +72,8 @@ struct sk_buff *tcp6_gro_receive(struct list_head *head, struct sk_buff *skb)
INDIRECT_CALLABLE_SCOPE int tcp6_gro_complete(struct sk_buff *skb, int thoff) { - const struct ipv6hdr *iph = ipv6_hdr(skb); + const u16 offset = NAPI_GRO_CB(skb)->network_offsets[skb->encapsulation]; + const struct ipv6hdr *iph = (struct ipv6hdr *)(skb->data + offset); struct tcphdr *th = tcp_hdr(skb);
if (unlikely(NAPI_GRO_CB(skb)->is_flist)) {
Richard Gobert wrote:
This patch converts references of skb->network_header to napi_gro_cb's network_offset and inner_network_offset.
Signed-off-by: Richard Gobert richardbgobert@gmail.com
Reviewed-by: Willem de Bruijn willemb@google.com
{inet,ipv6}_gro_receive functions perform flush checks (ttl, flags, iph->id, ...) against all packets in a loop. These flush checks are used in all merging UDP and TCP flows.
These checks need to be done only once and only against the found p skb, since they only affect flush and not same_flow.
This patch leverages correct network header offsets from the cb for both outer and inner network headers - allowing these checks to be done only once, in tcp_gro_receive and udp_gro_receive_segment. As a result, NAPI_GRO_CB(p)->flush is not used at all. In addition, flush_id checks are more declarative and contained in inet_gro_flush, thus removing the need for flush_id in napi_gro_cb.
This results in less parsing code for non-loop flush tests for TCP and UDP flows.
To make sure results are not within noise range - I've made netfilter drop all TCP packets, and measured CPU performance in GRO (in this case GRO is responsible for about 50% of the CPU utilization).
perf top while replaying 64 parallel IP/TCP streams merging in GRO: (gro_receive_network_flush is compiled inline to tcp_gro_receive) net-next: 6.94% [kernel] [k] inet_gro_receive 3.02% [kernel] [k] tcp_gro_receive
patch applied: 4.27% [kernel] [k] tcp_gro_receive 4.22% [kernel] [k] inet_gro_receive
perf top while replaying 64 parallel IP/IP/TCP streams merging in GRO (same results for any encapsulation, in this case inet_gro_receive is top offender in net-next) net-next: 10.09% [kernel] [k] inet_gro_receive 2.08% [kernel] [k] tcp_gro_receive
patch applied: 6.97% [kernel] [k] inet_gro_receive 3.68% [kernel] [k] tcp_gro_receive
Signed-off-by: Richard Gobert richardbgobert@gmail.com --- include/net/gro.h | 76 +++++++++++++++++++++++++++++++++++++----- net/core/gro.c | 3 -- net/ipv4/af_inet.c | 41 +---------------------- net/ipv4/tcp_offload.c | 17 ++-------- net/ipv4/udp_offload.c | 9 +---- net/ipv6/ip6_offload.c | 11 ------ 6 files changed, 73 insertions(+), 84 deletions(-)
diff --git a/include/net/gro.h b/include/net/gro.h index cbc1b0aaf295..f13634b1f4c1 100644 --- a/include/net/gro.h +++ b/include/net/gro.h @@ -36,15 +36,15 @@ struct napi_gro_cb { /* This is non-zero if the packet cannot be merged with the new skb. */ u16 flush;
- /* Save the IP ID here and check when we get to the transport layer */ - u16 flush_id; - /* Number of segments aggregated. */ u16 count;
/* Used in ipv6_gro_receive() and foo-over-udp and esp-in-udp */ u16 proto;
+ /* used to support CHECKSUM_COMPLETE for tunneling protocols */ + __wsum csum; + /* Used in napi_gro_cb::free */ #define NAPI_GRO_FREE 1 #define NAPI_GRO_FREE_STOLEN_HEAD 2 @@ -75,8 +75,8 @@ struct napi_gro_cb { /* Used in GRE, set in fou/gue_gro_receive */ u8 is_fou:1;
- /* Used to determine if flush_id can be ignored */ - u8 is_atomic:1; + /* Used to determine if ipid_offset can be ignored */ + u8 ip_fixedid:1;
/* Number of gro_receive callbacks this packet already went through */ u8 recursion_counter:4; @@ -85,9 +85,6 @@ struct napi_gro_cb { u8 is_flist:1; );
- /* used to support CHECKSUM_COMPLETE for tunneling protocols */ - __wsum csum; - /* L3 offsets */ union { struct { @@ -442,6 +439,69 @@ static inline __wsum ip6_gro_compute_pseudo(const struct sk_buff *skb, skb_gro_len(skb), proto, 0)); }
+static inline int inet_gro_flush(const struct iphdr *iph, const struct iphdr *iph2, + struct sk_buff *p, bool outer) +{ + const u32 id = ntohl(*(__be32 *)&iph->id); + const u32 id2 = ntohl(*(__be32 *)&iph2->id); + const u16 ipid_offset = (id >> 16) - (id2 >> 16); + const u16 count = NAPI_GRO_CB(p)->count; + const u32 df = id & IP_DF; + int flush; + + /* All fields must match except length and checksum. */ + flush = (iph->ttl ^ iph2->ttl) | (iph->tos ^ iph2->tos) | (df ^ (id2 & IP_DF)); + + if (flush | (outer && df)) + return flush; + + /* When we receive our second frame we can make a decision on if we + * continue this flow as an atomic flow with a fixed ID or if we use + * an incrementing ID. + */ + if (count == 1 && df && !ipid_offset) + NAPI_GRO_CB(p)->ip_fixedid = true; + + return ipid_offset ^ (count * !NAPI_GRO_CB(p)->ip_fixedid); +} + +static inline int ipv6_gro_flush(const struct ipv6hdr *iph, const struct ipv6hdr *iph2) +{ + /* Version:4<Traffic_Class:8><Flow_Label:20> */ + __be32 first_word = *(__be32 *)iph ^ *(__be32 *)iph2; + + /* Flush if Traffic Class fields are different. */ + return !!((first_word & htonl(0x0FF00000)) | + (__force __be32)(iph->hop_limit ^ iph2->hop_limit)); +} + +static inline int __gro_receive_network_flush(const void *th, const void *th2, + struct sk_buff *p, const u16 diff, + bool outer) +{ + const void *nh = th - diff; + const void *nh2 = th2 - diff; + + if (((struct iphdr *)nh)->version == 6) + return ipv6_gro_flush(nh, nh2); + else + return inet_gro_flush(nh, nh2, p, outer); +} + +static inline int gro_receive_network_flush(const void *th, const void *th2, + struct sk_buff *p) +{ + const bool encap_mark = NAPI_GRO_CB(p)->encap_mark; + int off = skb_transport_offset(p); + int flush; + + flush = __gro_receive_network_flush(th, th2, p, off - NAPI_GRO_CB(p)->network_offset, encap_mark); + if (encap_mark) + flush |= __gro_receive_network_flush(th, th2, p, off - NAPI_GRO_CB(p)->inner_network_offset, false); + + return flush; +} + int skb_gro_receive(struct sk_buff *p, struct sk_buff *skb); int skb_gro_receive_list(struct sk_buff *p, struct sk_buff *skb);
diff --git a/net/core/gro.c b/net/core/gro.c index e2f84947fb74..b3b43de1a650 100644 --- a/net/core/gro.c +++ b/net/core/gro.c @@ -358,8 +358,6 @@ static void gro_list_prepare(const struct list_head *head, list_for_each_entry(p, head, list) { unsigned long diffs;
- NAPI_GRO_CB(p)->flush = 0; - if (hash != skb_get_hash_raw(p)) { NAPI_GRO_CB(p)->same_flow = 0; continue; @@ -499,7 +497,6 @@ static enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff sizeof(u32))); /* Avoid slow unaligned acc */ *(u32 *)&NAPI_GRO_CB(skb)->zeroed = 0; NAPI_GRO_CB(skb)->flush = skb_has_frag_list(skb); - NAPI_GRO_CB(skb)->is_atomic = 1; NAPI_GRO_CB(skb)->count = 1; if (unlikely(skb_is_gso(skb))) { NAPI_GRO_CB(skb)->count = skb_shinfo(skb)->gso_segs; diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index 428196e1541f..44564d009e95 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -1482,7 +1482,6 @@ struct sk_buff *inet_gro_receive(struct list_head *head, struct sk_buff *skb) struct sk_buff *p; unsigned int hlen; unsigned int off; - unsigned int id; int flush = 1; int proto;
@@ -1508,13 +1507,10 @@ struct sk_buff *inet_gro_receive(struct list_head *head, struct sk_buff *skb) goto out;
NAPI_GRO_CB(skb)->proto = proto; - id = ntohl(*(__be32 *)&iph->id); - flush = (u16)((ntohl(*(__be32 *)iph) ^ skb_gro_len(skb)) | (id & ~IP_DF)); - id >>= 16; + flush = (u16)((ntohl(*(__be32 *)iph) ^ skb_gro_len(skb)) | (ntohl(*(__be32 *)&iph->id) & ~IP_DF));
list_for_each_entry(p, head, list) { struct iphdr *iph2; - u16 flush_id;
if (!NAPI_GRO_CB(p)->same_flow) continue; @@ -1531,43 +1527,8 @@ struct sk_buff *inet_gro_receive(struct list_head *head, struct sk_buff *skb) NAPI_GRO_CB(p)->same_flow = 0; continue; } - - /* All fields must match except length and checksum. */ - NAPI_GRO_CB(p)->flush |= - (iph->ttl ^ iph2->ttl) | - (iph->tos ^ iph2->tos) | - ((iph->frag_off ^ iph2->frag_off) & htons(IP_DF)); - - NAPI_GRO_CB(p)->flush |= flush; - - /* We need to store of the IP ID check to be included later - * when we can verify that this packet does in fact belong - * to a given flow. - */ - flush_id = (u16)(id - ntohs(iph2->id)); - - /* This bit of code makes it much easier for us to identify - * the cases where we are doing atomic vs non-atomic IP ID - * checks. Specifically an atomic check can return IP ID - * values 0 - 0xFFFF, while a non-atomic check can only - * return 0 or 0xFFFF. - */ - if (!NAPI_GRO_CB(p)->is_atomic || - !(iph->frag_off & htons(IP_DF))) { - flush_id ^= NAPI_GRO_CB(p)->count; - flush_id = flush_id ? 0xFFFF : 0; - } - - /* If the previous IP ID value was based on an atomic - * datagram we can overwrite the value and ignore it. - */ - if (NAPI_GRO_CB(skb)->is_atomic) - NAPI_GRO_CB(p)->flush_id = flush_id; - else - NAPI_GRO_CB(p)->flush_id |= flush_id; }
- NAPI_GRO_CB(skb)->is_atomic = !!(iph->frag_off & htons(IP_DF)); NAPI_GRO_CB(skb)->flush |= flush; NAPI_GRO_CB(skb)->inner_network_offset = off;
diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c index 2809667ac924..4b791e74529e 100644 --- a/net/ipv4/tcp_offload.c +++ b/net/ipv4/tcp_offload.c @@ -313,10 +313,8 @@ struct sk_buff *tcp_gro_receive(struct list_head *head, struct sk_buff *skb, if (!p) goto out_check_final;
- /* Include the IP ID check below from the inner most IP hdr */ th2 = tcp_hdr(p); - flush = NAPI_GRO_CB(p)->flush; - flush |= (__force int)(flags & TCP_FLAG_CWR); + flush = (__force int)(flags & TCP_FLAG_CWR); flush |= (__force int)((flags ^ tcp_flag_word(th2)) & ~(TCP_FLAG_CWR | TCP_FLAG_FIN | TCP_FLAG_PSH)); flush |= (__force int)(th->ack_seq ^ th2->ack_seq); @@ -324,16 +322,7 @@ struct sk_buff *tcp_gro_receive(struct list_head *head, struct sk_buff *skb, flush |= *(u32 *)((u8 *)th + i) ^ *(u32 *)((u8 *)th2 + i);
- /* When we receive our second frame we can made a decision on if we - * continue this flow as an atomic flow with a fixed ID or if we use - * an incrementing ID. - */ - if (NAPI_GRO_CB(p)->flush_id != 1 || - NAPI_GRO_CB(p)->count != 1 || - !NAPI_GRO_CB(p)->is_atomic) - flush |= NAPI_GRO_CB(p)->flush_id; - else - NAPI_GRO_CB(p)->is_atomic = false; + flush |= gro_receive_network_flush(th, th2, p);
mss = skb_shinfo(p)->gso_size;
@@ -480,7 +469,7 @@ INDIRECT_CALLABLE_SCOPE int tcp4_gro_complete(struct sk_buff *skb, int thoff) iph->daddr, 0);
skb_shinfo(skb)->gso_type |= SKB_GSO_TCPV4 | - (NAPI_GRO_CB(skb)->is_atomic * SKB_GSO_TCP_FIXEDID); + (NAPI_GRO_CB(skb)->ip_fixedid * SKB_GSO_TCP_FIXEDID);
tcp_gro_complete(skb); return 0; diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c index f6d3c442e260..59448a2dbf2c 100644 --- a/net/ipv4/udp_offload.c +++ b/net/ipv4/udp_offload.c @@ -478,14 +478,7 @@ static struct sk_buff *udp_gro_receive_segment(struct list_head *head, return p; }
- flush = NAPI_GRO_CB(p)->flush; - - if (NAPI_GRO_CB(p)->flush_id != 1 || - NAPI_GRO_CB(p)->count != 1 || - !NAPI_GRO_CB(p)->is_atomic) - flush |= NAPI_GRO_CB(p)->flush_id; - else - NAPI_GRO_CB(p)->is_atomic = false; + flush = gro_receive_network_flush(uh, uh2, p);
/* Terminate the flow on len mismatch or if it grow "too much". * Under small packet flood GRO count could elsewhere grow a lot diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c index 288c7c6ea50f..bd5aff97d8b1 100644 --- a/net/ipv6/ip6_offload.c +++ b/net/ipv6/ip6_offload.c @@ -290,19 +290,8 @@ INDIRECT_CALLABLE_SCOPE struct sk_buff *ipv6_gro_receive(struct list_head *head, nlen - sizeof(struct ipv6hdr))) goto not_same_flow; } - /* flush if Traffic Class fields are different */ - NAPI_GRO_CB(p)->flush |= !!((first_word & htonl(0x0FF00000)) | - (__force __be32)(iph->hop_limit ^ iph2->hop_limit)); - NAPI_GRO_CB(p)->flush |= flush; - - /* If the previous IP ID value was based on an atomic - * datagram we can overwrite the value and ignore it. - */ - if (NAPI_GRO_CB(skb)->is_atomic) - NAPI_GRO_CB(p)->flush_id = 0; }
- NAPI_GRO_CB(skb)->is_atomic = true; NAPI_GRO_CB(skb)->flush |= flush;
skb_gro_postpull_rcsum(skb, iph, nlen);
Richard Gobert wrote:
{inet,ipv6}_gro_receive functions perform flush checks (ttl, flags, iph->id, ...) against all packets in a loop. These flush checks are used in all merging UDP and TCP flows.
These checks need to be done only once and only against the found p skb, since they only affect flush and not same_flow.
This patch leverages correct network header offsets from the cb for both outer and inner network headers - allowing these checks to be done only once, in tcp_gro_receive and udp_gro_receive_segment. As a result, NAPI_GRO_CB(p)->flush is not used at all. In addition, flush_id checks are more declarative and contained in inet_gro_flush, thus removing the need for flush_id in napi_gro_cb.
This results in less parsing code for non-loop flush tests for TCP and UDP flows.
To make sure results are not within noise range - I've made netfilter drop all TCP packets, and measured CPU performance in GRO (in this case GRO is responsible for about 50% of the CPU utilization).
perf top while replaying 64 parallel IP/TCP streams merging in GRO: (gro_receive_network_flush is compiled inline to tcp_gro_receive) net-next: 6.94% [kernel] [k] inet_gro_receive 3.02% [kernel] [k] tcp_gro_receive
patch applied: 4.27% [kernel] [k] tcp_gro_receive 4.22% [kernel] [k] inet_gro_receive
perf top while replaying 64 parallel IP/IP/TCP streams merging in GRO (same results for any encapsulation, in this case inet_gro_receive is top offender in net-next) net-next: 10.09% [kernel] [k] inet_gro_receive 2.08% [kernel] [k] tcp_gro_receive
patch applied: 6.97% [kernel] [k] inet_gro_receive 3.68% [kernel] [k] tcp_gro_receive
Signed-off-by: Richard Gobert richardbgobert@gmail.com
Reviewed-by: Willem de Bruijn willemb@google.com
Hi Richard,
On Thu, May 9, 2024 at 9:09 PM Richard Gobert richardbgobert@gmail.com wrote:
{inet,ipv6}_gro_receive functions perform flush checks (ttl, flags, iph->id, ...) against all packets in a loop. These flush checks are used in all merging UDP and TCP flows.
These checks need to be done only once and only against the found p skb, since they only affect flush and not same_flow.
This patch leverages correct network header offsets from the cb for both outer and inner network headers - allowing these checks to be done only once, in tcp_gro_receive and udp_gro_receive_segment. As a result, NAPI_GRO_CB(p)->flush is not used at all. In addition, flush_id checks are more declarative and contained in inet_gro_flush, thus removing the need for flush_id in napi_gro_cb.
This results in less parsing code for non-loop flush tests for TCP and UDP flows.
To make sure results are not within noise range - I've made netfilter drop all TCP packets, and measured CPU performance in GRO (in this case GRO is responsible for about 50% of the CPU utilization).
perf top while replaying 64 parallel IP/TCP streams merging in GRO: (gro_receive_network_flush is compiled inline to tcp_gro_receive) net-next: 6.94% [kernel] [k] inet_gro_receive 3.02% [kernel] [k] tcp_gro_receive
patch applied: 4.27% [kernel] [k] tcp_gro_receive 4.22% [kernel] [k] inet_gro_receive
perf top while replaying 64 parallel IP/IP/TCP streams merging in GRO (same results for any encapsulation, in this case inet_gro_receive is top offender in net-next) net-next: 10.09% [kernel] [k] inet_gro_receive 2.08% [kernel] [k] tcp_gro_receive
patch applied: 6.97% [kernel] [k] inet_gro_receive 3.68% [kernel] [k] tcp_gro_receive
Signed-off-by: Richard Gobert richardbgobert@gmail.com
Thanks for your patch, which is now commit 4b0ebbca3e167976 ("net: gro: move L3 flush checks to tcp_gro_receive and udp_gro_receive_segment") in net-next/main (next-20240514).
noreply@ellerman.id.au reports build failures on m68k, e.g. http://kisskb.ellerman.id.au/kisskb/buildresult/15168903/
net/core/gro.c: In function ‘dev_gro_receive’: ././include/linux/compiler_types.h:460:38: error: call to ‘__compiletime_assert_654’ declared with attribute error: BUILD_BUG_ON failed: !IS_ALIGNED(offsetof(struct napi_gro_cb, zeroed), sizeof(u32))
--- a/include/net/gro.h +++ b/include/net/gro.h @@ -36,15 +36,15 @@ struct napi_gro_cb { /* This is non-zero if the packet cannot be merged with the new skb. */ u16 flush;
/* Save the IP ID here and check when we get to the transport layer */
u16 flush_id;
/* Number of segments aggregated. */ u16 count; /* Used in ipv6_gro_receive() and foo-over-udp and esp-in-udp */ u16 proto;
On most architectures, there is now a hole of 2 bytes here. However, on m68k the minimum alignment of __wsum (__u32) below is 2, hence there is no hole, breaking the assertion.
Probably you just want to make this explicit, by adding
u16 pad;
here.
/* used to support CHECKSUM_COMPLETE for tunneling protocols */
__wsum csum;
Gr{oetje,eeting}s,
Geert
-- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org
In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds
On Tue, 14 May 2024 14:13:21 +0200 Geert Uytterhoeven wrote:
On Thu, May 9, 2024 at 9:09 PM Richard Gobert richardbgobert@gmail.com wrote:
{inet,ipv6}_gro_receive functions perform flush checks (ttl, flags, iph->id, ...) against all packets in a loop. These flush checks are used in all merging UDP and TCP flows.
These checks need to be done only once and only against the found p skb, since they only affect flush and not same_flow.
This patch leverages correct network header offsets from the cb for both outer and inner network headers - allowing these checks to be done only once, in tcp_gro_receive and udp_gro_receive_segment. As a result, NAPI_GRO_CB(p)->flush is not used at all. In addition, flush_id checks are more declarative and contained in inet_gro_flush, thus removing the need for flush_id in napi_gro_cb.
This results in less parsing code for non-loop flush tests for TCP and UDP flows.
To make sure results are not within noise range - I've made netfilter drop all TCP packets, and measured CPU performance in GRO (in this case GRO is responsible for about 50% of the CPU utilization).
perf top while replaying 64 parallel IP/TCP streams merging in GRO: (gro_receive_network_flush is compiled inline to tcp_gro_receive) net-next: 6.94% [kernel] [k] inet_gro_receive 3.02% [kernel] [k] tcp_gro_receive
patch applied: 4.27% [kernel] [k] tcp_gro_receive 4.22% [kernel] [k] inet_gro_receive
perf top while replaying 64 parallel IP/IP/TCP streams merging in GRO (same results for any encapsulation, in this case inet_gro_receive is top offender in net-next) net-next: 10.09% [kernel] [k] inet_gro_receive 2.08% [kernel] [k] tcp_gro_receive
patch applied: 6.97% [kernel] [k] inet_gro_receive 3.68% [kernel] [k] tcp_gro_receive
Signed-off-by: Richard Gobert richardbgobert@gmail.com
Thanks for your patch, which is now commit 4b0ebbca3e167976 ("net: gro: move L3 flush checks to tcp_gro_receive and udp_gro_receive_segment") in net-next/main (next-20240514).
noreply@ellerman.id.au reports build failures on m68k, e.g. http://kisskb.ellerman.id.au/kisskb/buildresult/15168903/
net/core/gro.c: In function ‘dev_gro_receive’: ././include/linux/compiler_types.h:460:38: error: call to
‘__compiletime_assert_654’ declared with attribute error: BUILD_BUG_ON failed: !IS_ALIGNED(offsetof(struct napi_gro_cb, zeroed), sizeof(u32))
Hi Richard, any chance of getting this fixed within the next 2 hours? I can't send the net-next PR if it doesn't build on one of the arches..
Jakub Kicinski wrote:
On Tue, 14 May 2024 14:13:21 +0200 Geert Uytterhoeven wrote:
On Thu, May 9, 2024 at 9:09 PM Richard Gobert richardbgobert@gmail.com wrote:
{inet,ipv6}_gro_receive functions perform flush checks (ttl, flags, iph->id, ...) against all packets in a loop. These flush checks are used in all merging UDP and TCP flows.
These checks need to be done only once and only against the found p skb, since they only affect flush and not same_flow.
This patch leverages correct network header offsets from the cb for both outer and inner network headers - allowing these checks to be done only once, in tcp_gro_receive and udp_gro_receive_segment. As a result, NAPI_GRO_CB(p)->flush is not used at all. In addition, flush_id checks are more declarative and contained in inet_gro_flush, thus removing the need for flush_id in napi_gro_cb.
This results in less parsing code for non-loop flush tests for TCP and UDP flows.
To make sure results are not within noise range - I've made netfilter drop all TCP packets, and measured CPU performance in GRO (in this case GRO is responsible for about 50% of the CPU utilization).
perf top while replaying 64 parallel IP/TCP streams merging in GRO: (gro_receive_network_flush is compiled inline to tcp_gro_receive) net-next: 6.94% [kernel] [k] inet_gro_receive 3.02% [kernel] [k] tcp_gro_receive
patch applied: 4.27% [kernel] [k] tcp_gro_receive 4.22% [kernel] [k] inet_gro_receive
perf top while replaying 64 parallel IP/IP/TCP streams merging in GRO (same results for any encapsulation, in this case inet_gro_receive is top offender in net-next) net-next: 10.09% [kernel] [k] inet_gro_receive 2.08% [kernel] [k] tcp_gro_receive
patch applied: 6.97% [kernel] [k] inet_gro_receive 3.68% [kernel] [k] tcp_gro_receive
Signed-off-by: Richard Gobert richardbgobert@gmail.com
Thanks for your patch, which is now commit 4b0ebbca3e167976 ("net: gro: move L3 flush checks to tcp_gro_receive and udp_gro_receive_segment") in net-next/main (next-20240514).
noreply@ellerman.id.au reports build failures on m68k, e.g. http://kisskb.ellerman.id.au/kisskb/buildresult/15168903/
net/core/gro.c: In function ‘dev_gro_receive’: ././include/linux/compiler_types.h:460:38: error: call to
‘__compiletime_assert_654’ declared with attribute error: BUILD_BUG_ON failed: !IS_ALIGNED(offsetof(struct napi_gro_cb, zeroed), sizeof(u32))
Hi Richard, any chance of getting this fixed within the next 2 hours? I can't send the net-next PR if it doesn't build on one of the arches..
Hi Jakub and Geert, I'm only seeing this mail now, sorry for the late response. I can fix this within the next two hours, would you prefer a standalone patch or should I add it to this patch series?
On Tue, 14 May 2024 17:56:44 +0200 Richard Gobert wrote:
Hi Richard, any chance of getting this fixed within the next 2 hours? I can't send the net-next PR if it doesn't build on one of the arches..
Hi Jakub and Geert, I'm only seeing this mail now, sorry for the late response. I can fix this within the next two hours, would you prefer a standalone patch or should I add it to this patch series?
Standalone patch, please!
Add 2 byte padding to napi_gro_cb struct to ensure zeroed member is aligned after flush_id member was removed in the original commit.
Fixes: 4b0ebbca3e16 ("net: gro: move L3 flush checks to tcp_gro_receive and udp_gro_receive_segment") Suggested-by: Geert Uytterhoeven geert@linux-m68k.org Signed-off-by: Richard Gobert richardbgobert@gmail.com --- include/net/gro.h | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/include/net/gro.h b/include/net/gro.h index f13634b1f4c1..b9b58c1f8d19 100644 --- a/include/net/gro.h +++ b/include/net/gro.h @@ -42,8 +42,7 @@ struct napi_gro_cb { /* Used in ipv6_gro_receive() and foo-over-udp and esp-in-udp */ u16 proto;
- /* used to support CHECKSUM_COMPLETE for tunneling protocols */ - __wsum csum; + u16 pad;
/* Used in napi_gro_cb::free */ #define NAPI_GRO_FREE 1 @@ -85,6 +84,9 @@ struct napi_gro_cb { u8 is_flist:1; );
+ /* used to support CHECKSUM_COMPLETE for tunneling protocols */ + __wsum csum; + /* L3 offsets */ union { struct {
Hello:
This patch was applied to netdev/net-next.git (main) by Jakub Kicinski kuba@kernel.org:
On Tue, 14 May 2024 19:06:15 +0200 you wrote:
Add 2 byte padding to napi_gro_cb struct to ensure zeroed member is aligned after flush_id member was removed in the original commit.
Fixes: 4b0ebbca3e16 ("net: gro: move L3 flush checks to tcp_gro_receive and udp_gro_receive_segment") Suggested-by: Geert Uytterhoeven geert@linux-m68k.org Signed-off-by: Richard Gobert richardbgobert@gmail.com
[...]
Here is the summary with links: - net: gro: fix napi_gro_cb zeroed alignment https://git.kernel.org/netdev/net-next/c/386f0cffae46
You are awesome, thank you!
Added flush id selftests to test different cases where DF flag is set or unset and id value changes in the following packets. All cases where the packets should coalesce or should not coalesce are tested.
Signed-off-by: Richard Gobert richardbgobert@gmail.com --- tools/testing/selftests/net/gro.c | 138 ++++++++++++++++++++++++++++++ 1 file changed, 138 insertions(+)
diff --git a/tools/testing/selftests/net/gro.c b/tools/testing/selftests/net/gro.c index 6038b96ecee8..b2184847e388 100644 --- a/tools/testing/selftests/net/gro.c +++ b/tools/testing/selftests/net/gro.c @@ -93,6 +93,7 @@ static bool tx_socket = true; static int tcp_offset = -1; static int total_hdr_len = -1; static int ethhdr_proto = -1; +static const int num_flush_id_cases = 6;
static void vlog(const char *fmt, ...) { @@ -620,6 +621,113 @@ static void add_ipv6_exthdr(void *buf, void *optpkt, __u8 exthdr_type, char *ext iph->payload_len = htons(ntohs(iph->payload_len) + MIN_EXTHDR_SIZE); }
+static void fix_ip4_checksum(struct iphdr *iph) +{ + iph->check = 0; + iph->check = checksum_fold(iph, sizeof(struct iphdr), 0); +} + +static void send_flush_id_case(int fd, struct sockaddr_ll *daddr, int tcase) +{ + static char buf1[MAX_HDR_LEN + PAYLOAD_LEN]; + static char buf2[MAX_HDR_LEN + PAYLOAD_LEN]; + static char buf3[MAX_HDR_LEN + PAYLOAD_LEN]; + bool send_three = false; + struct iphdr *iph1; + struct iphdr *iph2; + struct iphdr *iph3; + + iph1 = (struct iphdr *)(buf1 + ETH_HLEN); + iph2 = (struct iphdr *)(buf2 + ETH_HLEN); + iph3 = (struct iphdr *)(buf3 + ETH_HLEN); + + create_packet(buf1, 0, 0, PAYLOAD_LEN, 0); + create_packet(buf2, PAYLOAD_LEN, 0, PAYLOAD_LEN, 0); + create_packet(buf3, PAYLOAD_LEN * 2, 0, PAYLOAD_LEN, 0); + + switch (tcase) { + case 0: /* DF=1, Incrementing - should coalesce */ + iph1->frag_off |= htons(IP_DF); + iph1->id = htons(8); + + iph2->frag_off |= htons(IP_DF); + iph2->id = htons(9); + break; + + case 1: /* DF=1, Fixed - should coalesce */ + iph1->frag_off |= htons(IP_DF); + iph1->id = htons(8); + + iph2->frag_off |= htons(IP_DF); + iph2->id = htons(8); + break; + + case 2: /* DF=0, Incrementing - should coalesce */ + iph1->frag_off &= ~htons(IP_DF); + iph1->id = htons(8); + + iph2->frag_off &= ~htons(IP_DF); + iph2->id = htons(9); + break; + + case 3: /* DF=0, Fixed - should not coalesce */ + iph1->frag_off &= ~htons(IP_DF); + iph1->id = htons(8); + + iph2->frag_off &= ~htons(IP_DF); + iph2->id = htons(8); + break; + + case 4: /* DF=1, two packets incrementing, and one fixed - should + * coalesce only the first two packets + */ + iph1->frag_off |= htons(IP_DF); + iph1->id = htons(8); + + iph2->frag_off |= htons(IP_DF); + iph2->id = htons(9); + + iph3->frag_off |= htons(IP_DF); + iph3->id = htons(9); + send_three = true; + break; + + case 5: /* DF=1, two packets fixed, and one incrementing - should + * coalesce only the first two packets + */ + iph1->frag_off |= htons(IP_DF); + iph1->id = htons(8); + + iph2->frag_off |= htons(IP_DF); + iph2->id = htons(8); + + iph3->frag_off |= htons(IP_DF); + iph3->id = htons(9); + send_three = true; + break; + } + + fix_ip4_checksum(iph1); + fix_ip4_checksum(iph2); + write_packet(fd, buf1, total_hdr_len + PAYLOAD_LEN, daddr); + write_packet(fd, buf2, total_hdr_len + PAYLOAD_LEN, daddr); + + if (send_three) { + fix_ip4_checksum(iph3); + write_packet(fd, buf3, total_hdr_len + PAYLOAD_LEN, daddr); + } +} + +static void test_flush_id(int fd, struct sockaddr_ll *daddr, char *fin_pkt) +{ + for (int i = 0; i < num_flush_id_cases; i++) { + sleep(1); + send_flush_id_case(fd, daddr, i); + sleep(1); + write_packet(fd, fin_pkt, total_hdr_len, daddr); + } +} + static void send_ipv6_exthdr(int fd, struct sockaddr_ll *daddr, char *ext_data1, char *ext_data2) { static char buf[MAX_HDR_LEN + PAYLOAD_LEN]; @@ -938,6 +1046,8 @@ static void gro_sender(void) send_fragment4(txfd, &daddr); sleep(1); write_packet(txfd, fin_pkt, total_hdr_len, &daddr); + + test_flush_id(txfd, &daddr, fin_pkt); } else if (proto == PF_INET6) { sleep(1); send_fragment6(txfd, &daddr); @@ -1064,6 +1174,34 @@ static void gro_receiver(void)
printf("fragmented ip4 doesn't coalesce: "); check_recv_pkts(rxfd, correct_payload, 2); + + /* is_atomic checks */ + printf("DF=1, Incrementing - should coalesce: "); + correct_payload[0] = PAYLOAD_LEN * 2; + check_recv_pkts(rxfd, correct_payload, 1); + + printf("DF=1, Fixed - should coalesce: "); + correct_payload[0] = PAYLOAD_LEN * 2; + check_recv_pkts(rxfd, correct_payload, 1); + + printf("DF=0, Incrementing - should coalesce: "); + correct_payload[0] = PAYLOAD_LEN * 2; + check_recv_pkts(rxfd, correct_payload, 1); + + printf("DF=0, Fixed - should not coalesce: "); + correct_payload[0] = PAYLOAD_LEN; + correct_payload[1] = PAYLOAD_LEN; + check_recv_pkts(rxfd, correct_payload, 2); + + printf("DF=1, 2 Incrementing and one fixed - should coalesce only first 2 packets: "); + correct_payload[0] = PAYLOAD_LEN * 2; + correct_payload[1] = PAYLOAD_LEN; + check_recv_pkts(rxfd, correct_payload, 2); + + printf("DF=1, 2 Fixed and one incrementing - should coalesce only first 2 packets: "); + correct_payload[0] = PAYLOAD_LEN * 2; + correct_payload[1] = PAYLOAD_LEN; + check_recv_pkts(rxfd, correct_payload, 2); } else if (proto == PF_INET6) { /* GRO doesn't check for ipv6 hop limit when flushing. * Hence no corresponding test to the ipv4 case.
Richard Gobert wrote:
Added flush id selftests to test different cases where DF flag is set or unset and id value changes in the following packets. All cases where the packets should coalesce or should not coalesce are tested.
Signed-off-by: Richard Gobert richardbgobert@gmail.com
Reviewed-by: Willem de Bruijn willemb@google.com
Hello:
This series was applied to netdev/net-next.git (main) by Jakub Kicinski kuba@kernel.org:
On Thu, 9 May 2024 21:08:16 +0200 you wrote:
The cb fields network_offset and inner_network_offset are used instead of skb->network_header throughout GRO.
These fields are then leveraged in the next commit to remove flush_id state from napi_gro_cb, and stateful code in {ipv6,inet}_gro_receive which may be unnecessarily complicated due to encapsulation support in GRO. These fields are checked in L4 instead.
[...]
Here is the summary with links: - [net-next,v10,1/3] net: gro: use cb instead of skb->network_header https://git.kernel.org/netdev/net-next/c/186b1ea73ad8 - [net-next,v10,2/3] net: gro: move L3 flush checks to tcp_gro_receive and udp_gro_receive_segment https://git.kernel.org/netdev/net-next/c/4b0ebbca3e16 - [net-next,v10,3/3] selftests/net: add flush id selftests https://git.kernel.org/netdev/net-next/c/bc21faefbe58
You are awesome, thank you!
linux-kselftest-mirror@lists.linaro.org