-----Original Message----- From: Paolo Abeni pabeni@redhat.com Sent: Tuesday, June 17, 2025 10:03 AM To: Chia-Yu Chang (Nokia) chia-yu.chang@nokia-bell-labs.com; edumazet@google.com; linux-doc@vger.kernel.org; corbet@lwn.net; horms@kernel.org; dsahern@kernel.org; kuniyu@amazon.com; bpf@vger.kernel.org; netdev@vger.kernel.org; dave.taht@gmail.com; jhs@mojatatu.com; kuba@kernel.org; stephen@networkplumber.org; xiyou.wangcong@gmail.com; jiri@resnulli.us; davem@davemloft.net; andrew+netdev@lunn.ch; donald.hunter@gmail.com; ast@fiberby.net; liuhangbin@gmail.com; shuah@kernel.org; linux-kselftest@vger.kernel.org; ij@kernel.org; ncardwell@google.com; Koen De Schepper (Nokia) koen.de_schepper@nokia-bell-labs.com; g.white@cablelabs.com; ingemar.s.johansson@ericsson.com; mirja.kuehlewind@ericsson.com; cheshire@apple.com; rs.ietf@gmx.at; Jason_Livingood@comcast.com; vidhi_goel@apple.com Cc: Olivier Tilmans (Nokia) olivier.tilmans@nokia.com Subject: Re: [PATCH v8 net-next 04/15] tcp: AccECN core
CAUTION: This is an external email. Please be very careful when clicking links or opening attachments. See the URL nok.it/ext for additional information.
On 6/10/25 2:53 PM, chia-yu.chang@nokia-bell-labs.com wrote:
From: Ilpo Järvinen ij@kernel.org
This change implements Accurate ECN without negotiation and AccECN Option (that will be added by later changes). Based on AccECN specifications: https://tools.ietf.org/id/draft-ietf-tcpm-accurate-ecn-28.txt
Accurate ECN allows feeding back the number of CE (congestion experienced) marks accurately to the sender in contrast to RFC3168 ECN that can only signal one marks-seen-yes/no per RTT. Congestion control algorithms can take advantage of the accurate ECN information to fine-tune their congestion response to avoid drastic rate reduction when only mild congestion is encountered.
With Accurate ECN, tp->received_ce (r.cep in AccECN spec) keeps track of how many segments have arrived with a CE mark. Accurate ECN uses ACE field (ECE, CWR, AE) to communicate the value back to the sender which updates tp->delivered_ce (s.cep) based on the feedback. This signalling channel is lossy when ACE field overflow occurs.
Conservative strategy is selected here to deal with the ACE overflow, however, some strategies using the AccECN option later in the overall patchset mitigate against false overflows detected.
The ACE field values on the wire are offset by TCP_ACCECN_CEP_INIT_OFFSET. Delivered_ce/received_ce count the real CE marks rather than forcing all downstream users to adapt to the wire offset.
This patch uses the first 1-byte hole and the last 4-byte hole of the tcp_sock_write_txrx for 'received_ce_pending' and 'received_ce'. Also, the group size of tcp_sock_write_txrx is increased from 91 + 4 to 95 + 4 due to the new u32 received_ce member. Below are the trimmed pahole outcomes before and after this patch.
AFAICS 'received_ce' fills the existing 4 bytes hole, so tcp_sock_write_txrx size should be now (95 + 0), am I missreading something?
Hi Paolo,
First, thanks for the feedback. Regarding such "+ 4" is due to the byte alignemnt needed for ARM arch. You can see below tha pahole outputs BEFORE and AFTER this patch, in which a 4-byte hole is added before "tcp_clock_cache" to make it aligns from the multiple of 8. And the comment in net/ipv4/tcp.c also explain it: "32bit arches with 8byte alignment on u64 fields might need padding before tcp_clock_cache." Does it make sense to you?
[BEFORE PATCH] __u8 __cacheline_group_begin__tcp_sock_write_txrx[0]; /* 1869 0 */ u8 nonagle:4; /* 1869: 0 1 */ u8 rate_app_limited:1; /* 1869: 4 1 */
/* XXX 3 bits hole, try to pack */ /* XXX 2 bytes hole, try to pack */
__be32 pred_flags; /* 1872 4 */
/* XXX 4 bytes hole, try to pack */
u64 tcp_clock_cache; /* 1880 8 */ u64 tcp_mstamp; /* 1888 8 */ u32 rcv_nxt; /* 1896 4 */ u32 snd_nxt; /* 1900 4 */ u32 snd_una; /* 1904 4 */ u32 window_clamp; /* 1908 4 */ u32 srtt_us; /* 1912 4 */ u32 packets_out; /* 1916 4 */ /* --- cacheline 30 boundary (1920 bytes) --- */ u32 snd_up; /* 1920 4 */ u32 delivered; /* 1924 4 */ u32 delivered_ce; /* 1928 4 */ u32 app_limited; /* 1932 4 */ u32 rcv_wnd; /* 1936 4 */ struct tcp_options_received rx_opt; /* 1940 24 */ __u8 __cacheline_group_end__tcp_sock_write_txrx[0]; /* 1964 0 */
[AFTER PATCH] __u8 __cacheline_group_begin__tcp_sock_write_txrx[0]; /* 1869 0 */ u8 nonagle:4; /* 1869: 0 1 */ u8 rate_app_limited:1; /* 1869: 4 1 */
/* XXX 3 bits hole, try to pack */
/* Force alignment to the next boundary: */ u8 :0;
u8 received_ce_pending:4; /* 1870: 0 1 */ u8 unused2:4; /* 1870: 4 1 */
/* XXX 1 byte hole, try to pack */
__be32 pred_flags; /* 1872 4 */
/* XXX 4 bytes hole, try to pack */
u64 tcp_clock_cache; /* 1880 8 */ u64 tcp_mstamp; /* 1888 8 */ u32 rcv_nxt; /* 1896 4 */ u32 snd_nxt; /* 1900 4 */ u32 snd_una; /* 1904 4 */ u32 window_clamp; /* 1908 4 */ u32 srtt_us; /* 1912 4 */ u32 packets_out; /* 1916 4 */ /* --- cacheline 30 boundary (1920 bytes) --- */ u32 snd_up; /* 1920 4 */ u32 delivered; /* 1924 4 */ u32 delivered_ce; /* 1928 4 */ u32 received_ce; /* 1932 4 */ u32 app_limited; /* 1936 4 */ u32 rcv_wnd; /* 1940 4 */ struct tcp_options_received rx_opt; /* 1944 24 */ __u8 __cacheline_group_end__tcp_sock_write_txrx[0]; /* 1968 0 */
@@ -384,17 +387,16 @@ static void tcp_data_ecn_check(struct sock *sk, const struct sk_buff *skb) if (tcp_ca_needs_ecn(sk)) tcp_ca_event(sk, CA_EVENT_ECN_IS_CE);
if (!(tp->ecn_flags & TCP_ECN_DEMAND_CWR)) {
if (!(tp->ecn_flags & TCP_ECN_DEMAND_CWR) &&
tcp_ecn_mode_rfc3168(tp)) { /* Better not delay acks, sender can have a very low cwnd */ tcp_enter_quickack_mode(sk, 2); tp->ecn_flags |= TCP_ECN_DEMAND_CWR; }
tp->ecn_flags |= TCP_ECN_SEEN;
It's not clear why you need to move this statement earlier in the code path even for ecn_mode_rfc3168(). Either a comment or
if (!tcp_ecn_mode_rfc3168(tp)) break;
a few lines aboved could help.
break; default: if (tcp_ca_needs_ecn(sk)) tcp_ca_event(sk, CA_EVENT_ECN_NO_CE);
tp->ecn_flags |= TCP_ECN_SEEN;
Same here.
Thanks,
Paolo
OK, will apply the above feedback in the next version and thanks.
BRs, Chia-Yu