Changes in v2: - Fixup for uninitilized md5sig_count stack variable (Oops! Kudos to kernel test robot lkp@intel.com) - Correct space damage, add a missing Fixes tag & reformat tcp_ulp_ops_size() (Kuniyuki Iwashima) - Take out patch for maximum attribute length, see (4) below. Going to send it later with the next TCP-AO-diag part (Kuniyuki Iwashima) - Link to v1: https://lore.kernel.org/r/20241106-tcp-md5-diag-prep-v1-0-d62debf3dded@gmail...
My original intent was to replace the last non-upstream Arista's TCP-AO piece. That is per-netns procfs seqfile which lists AO keys. In my view an acceptable upstream alternative would be TCP-AO-diag uAPI.
So, I started by looking and reviewing TCP-MD5-diag code. And straight away I saw a bunch of issues:
1. Similarly to TCP_MD5SIG_EXT, which doesn't check tcpm_flags for unknown flags and so being non-extendable setsockopt(), the same way tcp_diag_put_md5sig() dumps md5 keys in an array of tcp_diag_md5sig, which makes it ABI non-extendable structure as userspace can't tolerate any new members in it.
2. Inet-diag allocates netlink message for sockets in inet_diag_dump_one_icsk(), which uses a TCP-diag callback .idiag_get_aux_size(), that pre-calculates the needed space for TCP-diag related information. But as neither socket lock nor rcu_readlock() are held between allocation and the actual TCP info filling, the TCP-related space requirement may change before reaching tcp_diag_put_md5sig(). I.e., the number of TCP-MD5 keys on a socket. Thankfully, TCP-MD5-diag won't overwrite the skb, but will return EMSGSIZE, triggering WARN_ON() in inet_diag_dump_one_icsk().
3. Inet-diag "do" request* can create skb of any message required size. But "dump" request* the skb size, since d35c99ff77ec ("netlink: do not enter direct reclaim from netlink_dump()") is limited by 32 KB. Having in mind that sizeof(struct tcp_diag_md5sig) = 100 bytes, dumps for sockets that have more than 327 keys are going to fail (not counting other diag infos, which lower this limit futher). That is much lower than the number of TCP-MD5 keys that can be allocated on a socket with the current default optmem_max limit (128Kb).
So, then I went and written selftests for TCP-MD5-diag and besides confirming that (2) and (3) are not theoretical issues, I also discovered another issues, that I didn't notice on code inspection:
4. nlattr::nla_len is __u16, which limits the largest netlink attibute by 64Kb or by 655 tcp_diag_md5sig keys in the diag array. What happens de-facto is that the netlink attribute gets u16 overflow, breaking the userspace parsing - RTA_NEXT(), that should point to the next attribute, points into the middle of md5 keys array.
In this patch set issues (2) and (4) are addressed. (2) by not returning EMSGSIZE when the dump raced with modifying TCP-MD5 keys on a socket, but mark the dump inconsistent by setting NLM_F_DUMP_INTR nlmsg flag. Which changes uAPI in situations where previously kernel did WARN() and errored the dump. (4) by artificially limiting the maximum attribute size by U16_MAX - 1.
In order to remove the new limit from (4) solution, my plan is to convert the dump of TCP-MD5 keys from an array to NL_ATTR_TYPE_NESTED_ARRAY (or alike), which should also address (1). And for (3), it's needed to teach tcp-diag how-to remember not only socket on which previous recvmsg() stopped, but potentially TCP-MD5 key as well.
I plan in the next part of patch set address (3), (1) and the new limit for (4), together with adding new TCP-AO-diag.
* Terminology from Documentation/userspace-api/netlink/intro.rst
Signed-off-by: Dmitry Safonov 0x7f454c46@gmail.com --- Dmitry Safonov (5): net/diag: Do not race on dumping MD5 keys with adding new MD5 keys net/diag: Warn only once on EMSGSIZE net/diag: Pre-allocate optional info only if requested net/diag: Always pre-allocate tcp_ulp info net/netlink: Correct the comment on netlink message max cap
include/linux/inet_diag.h | 3 +- include/net/tcp.h | 1 - net/ipv4/inet_diag.c | 89 ++++++++++++++++++++++++++++++++++++++--------- net/ipv4/tcp_diag.c | 68 ++++++++++++++++++------------------ net/mptcp/diag.c | 20 ----------- net/netlink/af_netlink.c | 4 +-- net/tls/tls_main.c | 17 --------- 7 files changed, 110 insertions(+), 92 deletions(-) --- base-commit: f1b785f4c7870c42330b35522c2514e39a1e28e7 change-id: 20241106-tcp-md5-diag-prep-2f0dcf371d90
Best regards,
From: Dmitry Safonov 0x7f454c46@gmail.com
Inet-diag has two modes: (1) dumping information for a specific socket, for which kernel creates one netlink message with the information and (2) dumping information for multiple sockets (possibly with a filter), where for the reply kernel sends many messages, one for each matched socket.
Currently those two modes work differently as the information about a specific socket is never split between multiple messages. For (2), multi-socket dump for the reply kernel allocates up to 32Kb skb and fills that with as many socket dumps as possible. For (1), one-socket dump kernel pre-calculates the required space for the reply, allocates a new skb and nlmsg and only then starts filling the socket's details.
Preallocating the needed size quite makes sense as most of the details are fix-sized and provided for each socket, see inet_sk_attr_size(). But there's an exception: .idiag_get_aux_size() which is optional for a socket. This is provided only for TCP sockets by tcp_diag.
For TCP-MD5 it calculates the memory needed to fill an array of (struct tcp_diag_md5sig). The issue here is that the amount of keys may change in inet_diag_dump_one_icsk() between inet_sk_attr_size() and sk_diag_fill() calls. As the code expects fix-sized information on any socket, it considers sk_diag_fill() failures by -EMSGSIZE reason as a bug, resulting in such WARN_ON():
[] ------------[ cut here ]------------ [] WARNING: CPU: 7 PID: 17420 at net/ipv4/inet_diag.c:586 inet_diag_dump_one_icsk+0x3c8/0x420 [] CPU: 7 UID: 0 PID: 17420 Comm: diag_ipv4 Tainted: G W 6.11.0-rc6-00022-gc9fd7a9f9aca-dirty #2 [] pc : inet_diag_dump_one_icsk+0x3c8/0x420 [] lr : inet_diag_dump_one_icsk+0x1d4/0x420 [] sp : ffff8000aef87460 ... [] Call trace: [] inet_diag_dump_one_icsk+0x3c8/0x420 [] tcp_diag_dump_one+0xa0/0xf0 [] inet_diag_cmd_exact+0x234/0x278 [] inet_diag_handler_cmd+0x16c/0x288 [] sock_diag_rcv_msg+0x1a8/0x550 [] netlink_rcv_skb+0x198/0x378 [] sock_diag_rcv+0x20/0x48 [] netlink_unicast+0x400/0x6a8 [] netlink_sendmsg+0x654/0xa58 [] __sys_sendto+0x1ec/0x330 [] __arm64_sys_sendto+0xc8/0x168 ... [] ---[ end trace 0000000000000000 ]---
One way to solve it would be to grab lock_sock() in inet_diag_dump_one_icsk(), but that may be costly and bring new lock dependencies. The alternative is to call tcp_diag_put_md5sig() as the last attribute of the netlink message and calculate how much space left after all previous attributes filled and translate it into (struct tcp_diag_md5sig)-sized units. If it turns out that there's not enough space for all TCP-MD5 keys, mark the dump as inconsistent by setting NLM_F_DUMP_INTR flag. Userspace may figure out that dumping raced with the socket properties change and retry again.
Currently it may be unexpected by userspace that netlink message for one socket may be inconsistent, but I believe we're on a safe side from breaking userspace as previously dump would fail and an ugly WARN was produced in dmesg. IOW, it is a clear improvement.
This is not a theoretical issue: I've written a test and that reproduces the issue I suspected (the backtrace above).
Fixes: c03fa9bcacd9 ("tcp_diag: report TCP MD5 signing keys and addresses") Cc: stable@vger.kernel.org Signed-off-by: Dmitry Safonov 0x7f454c46@gmail.com --- include/linux/inet_diag.h | 3 ++- net/ipv4/inet_diag.c | 8 +++---- net/ipv4/tcp_diag.c | 55 ++++++++++++++++++++++++++++------------------- 3 files changed, 39 insertions(+), 27 deletions(-)
diff --git a/include/linux/inet_diag.h b/include/linux/inet_diag.h index a9033696b0aad36ab9abd47e4b68e272053019d7..cb2ba672eba131986d0432dd628fc42bbf800886 100644 --- a/include/linux/inet_diag.h +++ b/include/linux/inet_diag.h @@ -22,7 +22,8 @@ struct inet_diag_handler {
int (*idiag_get_aux)(struct sock *sk, bool net_admin, - struct sk_buff *skb); + struct sk_buff *skb, + struct nlmsghdr *nlh);
size_t (*idiag_get_aux_size)(struct sock *sk, bool net_admin); diff --git a/net/ipv4/inet_diag.c b/net/ipv4/inet_diag.c index 67639309163d05c034fad80fc9a6096c3b79d42f..67b9cc4c0e47a596a4d588e793b7f13ee040a1e3 100644 --- a/net/ipv4/inet_diag.c +++ b/net/ipv4/inet_diag.c @@ -350,10 +350,6 @@ int inet_sk_diag_fill(struct sock *sk, struct inet_connection_sock *icsk,
handler->idiag_get_info(sk, r, info);
- if (ext & (1 << (INET_DIAG_INFO - 1)) && handler->idiag_get_aux) - if (handler->idiag_get_aux(sk, net_admin, skb) < 0) - goto errout; - if (sk->sk_state < TCP_TIME_WAIT) { union tcp_cc_info info; size_t sz = 0; @@ -368,6 +364,10 @@ int inet_sk_diag_fill(struct sock *sk, struct inet_connection_sock *icsk, goto errout; }
+ if (ext & (1 << (INET_DIAG_INFO - 1)) && handler->idiag_get_aux) + if (handler->idiag_get_aux(sk, net_admin, skb, nlh) < 0) + goto errout; + /* Keep it at the end for potential retry with a larger skb, * or else do best-effort fitting, which is only done for the * first_nlmsg. diff --git a/net/ipv4/tcp_diag.c b/net/ipv4/tcp_diag.c index f428ecf9120f2f596e1d67db2b2a0d0d0e211905..d752dc5de3536303aeb075c10fbdc2c9fc417cd5 100644 --- a/net/ipv4/tcp_diag.c +++ b/net/ipv4/tcp_diag.c @@ -53,29 +53,39 @@ static void tcp_diag_md5sig_fill(struct tcp_diag_md5sig *info, }
static int tcp_diag_put_md5sig(struct sk_buff *skb, - const struct tcp_md5sig_info *md5sig) + const struct tcp_md5sig_info *md5sig, + struct nlmsghdr *nlh) { + size_t key_size = sizeof(struct tcp_diag_md5sig); + unsigned int attrlen, md5sig_count = 0; const struct tcp_md5sig_key *key; struct tcp_diag_md5sig *info; struct nlattr *attr; - int md5sig_count = 0;
+ /* + * Userspace doesn't like to see zero-filled key-values, so + * allocating too large attribute is bad. + */ hlist_for_each_entry_rcu(key, &md5sig->head, node) md5sig_count++; if (md5sig_count == 0) return 0;
- attr = nla_reserve(skb, INET_DIAG_MD5SIG, - md5sig_count * sizeof(struct tcp_diag_md5sig)); + attrlen = skb_availroom(skb) - NLA_HDRLEN; + md5sig_count = min(md5sig_count, attrlen / key_size); + attr = nla_reserve(skb, INET_DIAG_MD5SIG, md5sig_count * key_size); if (!attr) return -EMSGSIZE;
info = nla_data(attr); - memset(info, 0, md5sig_count * sizeof(struct tcp_diag_md5sig)); + memset(info, 0, md5sig_count * key_size); hlist_for_each_entry_rcu(key, &md5sig->head, node) { - tcp_diag_md5sig_fill(info++, key); - if (--md5sig_count == 0) + /* More keys on a socket than pre-allocated space available */ + if (md5sig_count-- == 0) { + nlh->nlmsg_flags |= NLM_F_DUMP_INTR; break; + } + tcp_diag_md5sig_fill(info++, key); }
return 0; @@ -110,25 +120,11 @@ static int tcp_diag_put_ulp(struct sk_buff *skb, struct sock *sk, }
static int tcp_diag_get_aux(struct sock *sk, bool net_admin, - struct sk_buff *skb) + struct sk_buff *skb, struct nlmsghdr *nlh) { struct inet_connection_sock *icsk = inet_csk(sk); int err = 0;
-#ifdef CONFIG_TCP_MD5SIG - if (net_admin) { - struct tcp_md5sig_info *md5sig; - - rcu_read_lock(); - md5sig = rcu_dereference(tcp_sk(sk)->md5sig_info); - if (md5sig) - err = tcp_diag_put_md5sig(skb, md5sig); - rcu_read_unlock(); - if (err < 0) - return err; - } -#endif - if (net_admin) { const struct tcp_ulp_ops *ulp_ops;
@@ -138,6 +134,21 @@ static int tcp_diag_get_aux(struct sock *sk, bool net_admin, if (err) return err; } + +#ifdef CONFIG_TCP_MD5SIG + if (net_admin) { + struct tcp_md5sig_info *md5sig; + + rcu_read_lock(); + md5sig = rcu_dereference(tcp_sk(sk)->md5sig_info); + if (md5sig) + err = tcp_diag_put_md5sig(skb, md5sig, nlh); + rcu_read_unlock(); + if (err < 0) + return err; + } +#endif + return 0; }
Hello:
This series was applied to netdev/net-next.git (main) by Jakub Kicinski kuba@kernel.org:
On Wed, 13 Nov 2024 18:46:39 +0000 you wrote:
Changes in v2:
- Fixup for uninitilized md5sig_count stack variable (Oops! Kudos to kernel test robot lkp@intel.com)
- Correct space damage, add a missing Fixes tag & reformat tcp_ulp_ops_size() (Kuniyuki Iwashima)
- Take out patch for maximum attribute length, see (4) below. Going to send it later with the next TCP-AO-diag part (Kuniyuki Iwashima)
- Link to v1: https://lore.kernel.org/r/20241106-tcp-md5-diag-prep-v1-0-d62debf3dded@gmail...
[...]
Here is the summary with links: - [net,v2,1/5] net/diag: Do not race on dumping MD5 keys with adding new MD5 keys (no matching commit) - [net,v2,2/5] net/diag: Warn only once on EMSGSIZE (no matching commit) - [net,v2,3/5] net/diag: Pre-allocate optional info only if requested (no matching commit) - [net,v2,4/5] net/diag: Always pre-allocate tcp_ulp info (no matching commit) - [net,v2,5/5] net/netlink: Correct the comment on netlink message max cap https://git.kernel.org/netdev/net-next/c/e51edeaf3506
You are awesome, thank you!
On Wed, 13 Nov 2024 18:46:39 +0000 Dmitry Safonov via B4 Relay wrote:
- Inet-diag allocates netlink message for sockets in inet_diag_dump_one_icsk(), which uses a TCP-diag callback .idiag_get_aux_size(), that pre-calculates the needed space for TCP-diag related information. But as neither socket lock nor rcu_readlock() are held between allocation and the actual TCP info filling, the TCP-related space requirement may change before reaching tcp_diag_put_md5sig(). I.e., the number of TCP-MD5 keys on a socket. Thankfully, TCP-MD5-diag won't overwrite the skb, but will return EMSGSIZE, triggering WARN_ON() in inet_diag_dump_one_icsk().
Hi Eric!
This was posted while you were away -- any thoughts or recommendation on how to address the required nl message size changing? Or other problems pointed out by Dmitry? My suggestion in the subthread is to re-dump with a fixed, large buffer on EMSGSIZE, but that's not super clean..
On Thu, Dec 5, 2024 at 2:13 AM Jakub Kicinski kuba@kernel.org wrote:
On Wed, 13 Nov 2024 18:46:39 +0000 Dmitry Safonov via B4 Relay wrote:
- Inet-diag allocates netlink message for sockets in inet_diag_dump_one_icsk(), which uses a TCP-diag callback .idiag_get_aux_size(), that pre-calculates the needed space for TCP-diag related information. But as neither socket lock nor rcu_readlock() are held between allocation and the actual TCP info filling, the TCP-related space requirement may change before reaching tcp_diag_put_md5sig(). I.e., the number of TCP-MD5 keys on a socket. Thankfully, TCP-MD5-diag won't overwrite the skb, but will return EMSGSIZE, triggering WARN_ON() in inet_diag_dump_one_icsk().
Hi Eric!
This was posted while you were away -- any thoughts or recommendation on how to address the required nl message size changing? Or other problems pointed out by Dmitry? My suggestion in the subthread is to re-dump with a fixed, large buffer on EMSGSIZE, but that's not super clean..
Hi Jakub
inet_diag_dump_one_icsk() could retry, doubling the size until the ~32768 byte limit is reached ?
Also, we could make sure inet_sk_attr_size() returns at least NLMSG_DEFAULT_SIZE, there is no point trying to save memory for a single skb in inet_diag_dump_one_icsk().
diff --git a/net/ipv4/inet_diag.c b/net/ipv4/inet_diag.c index 321acc8abf17e8c7d6a4e3326615123fff19deab..cd2e7fe9b090ea9127aebbba0faf2ef12c0f86a4 100644 --- a/net/ipv4/inet_diag.c +++ b/net/ipv4/inet_diag.c @@ -102,7 +102,7 @@ static size_t inet_sk_attr_size(struct sock *sk, bool net_admin) { const struct inet_diag_handler *handler; - size_t aux = 0; + size_t aux = 0, res;
rcu_read_lock(); handler = rcu_dereference(inet_diag_table[req->sdiag_protocol]); @@ -111,7 +111,7 @@ static size_t inet_sk_attr_size(struct sock *sk, aux = handler->idiag_get_aux_size(sk, net_admin); rcu_read_unlock();
- return nla_total_size(sizeof(struct tcp_info)) + res = nla_total_size(sizeof(struct tcp_info)) + nla_total_size(sizeof(struct inet_diag_msg)) + inet_diag_msg_attrs_size() + nla_total_size(sizeof(struct inet_diag_meminfo)) @@ -120,6 +120,7 @@ static size_t inet_sk_attr_size(struct sock *sk, + nla_total_size(sizeof(struct tcpvegas_info)) + aux + 64; + return max(res, NLMSG_DEFAULT_SIZE); }
int inet_diag_msg_attrs_fill(struct sock *sk, struct sk_buff *skb, @@ -570,6 +571,7 @@ int inet_diag_dump_one_icsk(struct inet_hashinfo *hashinfo, bool net_admin = netlink_net_capable(in_skb, CAP_NET_ADMIN); struct net *net = sock_net(in_skb->sk); struct sk_buff *rep; + size_t attr_size; struct sock *sk; int err;
@@ -577,7 +579,9 @@ int inet_diag_dump_one_icsk(struct inet_hashinfo *hashinfo, if (IS_ERR(sk)) return PTR_ERR(sk);
- rep = nlmsg_new(inet_sk_attr_size(sk, req, net_admin), GFP_KERNEL); + attr_size = inet_sk_attr_size(sk, req, net_admin); +retry: + rep = nlmsg_new(attr_size, GFP_KERNEL); if (!rep) { err = -ENOMEM; goto out; @@ -585,8 +589,14 @@ int inet_diag_dump_one_icsk(struct inet_hashinfo *hashinfo,
err = sk_diag_fill(sk, rep, cb, req, 0, net_admin); if (err < 0) { - WARN_ON(err == -EMSGSIZE); nlmsg_free(rep); + if (err == -EMSGSIZE) { + attr_size <<= 1; + if (attr_size + NLMSG_HDRLEN <= SKB_WITH_OVERHEAD(32768)) { + cond_resched(); + goto retry; + } + } goto out; } err = nlmsg_unicast(net->diag_nlsk, rep, NETLINK_CB(in_skb).portid);
On Thu, 5 Dec 2024 10:09:02 +0100 Eric Dumazet wrote:
inet_diag_dump_one_icsk() could retry, doubling the size until the ~32768 byte limit is reached ?
Also, we could make sure inet_sk_attr_size() returns at least NLMSG_DEFAULT_SIZE, there is no point trying to save memory for a single skb in inet_diag_dump_one_icsk().
SGTM :)
Hi Jakub, Eric,
On Thu, 5 Dec 2024 at 09:09, Eric Dumazet edumazet@google.com wrote:
On Thu, Dec 5, 2024 at 2:13 AM Jakub Kicinski kuba@kernel.org wrote:
Hi Eric!
This was posted while you were away -- any thoughts or recommendation on how to address the required nl message size changing? Or other problems pointed out by Dmitry? My suggestion in the subthread is to re-dump with a fixed, large buffer on EMSGSIZE, but that's not super clean..
Hi Jakub
inet_diag_dump_one_icsk() could retry, doubling the size until the ~32768 byte limit is reached ?
Also, we could make sure inet_sk_attr_size() returns at least NLMSG_DEFAULT_SIZE, there is no point trying to save memory for a single skb in inet_diag_dump_one_icsk().
Starting from NLMSG_DEFAULT_SIZE sounds like a really sane idea! :-)
[..]
@@ -585,8 +589,14 @@ int inet_diag_dump_one_icsk(struct inet_hashinfo *hashinfo,
err = sk_diag_fill(sk, rep, cb, req, 0, net_admin); if (err < 0) {
WARN_ON(err == -EMSGSIZE); nlmsg_free(rep);
if (err == -EMSGSIZE) {
attr_size <<= 1;
if (attr_size + NLMSG_HDRLEN <=
SKB_WITH_OVERHEAD(32768)) {
cond_resched();
goto retry;
}
} goto out; } err = nlmsg_unicast(net->diag_nlsk, rep, NETLINK_CB(in_skb).portid);
To my personal taste on larger than 327 md5 keys scale, I'd prefer to see "dump may be inconsistent, retry if you need consistency" than -EMSGSIZE fail, yet userspace potentially may use the errno as a "retry" signal.
Do you plan to re-send it as a proper patch? Or I can send it with my next patches for TCP-MD5-diag issues (1), (3), (4) and TCP-AO-diag.
Thanks, Dmitry
On Fri, Dec 6, 2024 at 3:49 AM Dmitry Safonov 0x7f454c46@gmail.com wrote:
Hi Jakub, Eric,
On Thu, 5 Dec 2024 at 09:09, Eric Dumazet edumazet@google.com wrote:
On Thu, Dec 5, 2024 at 2:13 AM Jakub Kicinski kuba@kernel.org wrote:
Hi Eric!
This was posted while you were away -- any thoughts or recommendation on how to address the required nl message size changing? Or other problems pointed out by Dmitry? My suggestion in the subthread is to re-dump with a fixed, large buffer on EMSGSIZE, but that's not super clean..
Hi Jakub
inet_diag_dump_one_icsk() could retry, doubling the size until the ~32768 byte limit is reached ?
Also, we could make sure inet_sk_attr_size() returns at least NLMSG_DEFAULT_SIZE, there is no point trying to save memory for a single skb in inet_diag_dump_one_icsk().
Starting from NLMSG_DEFAULT_SIZE sounds like a really sane idea! :-)
There is a consensus for this one, I will cook a patch with this part only.
[..]
@@ -585,8 +589,14 @@ int inet_diag_dump_one_icsk(struct inet_hashinfo *hashinfo,
err = sk_diag_fill(sk, rep, cb, req, 0, net_admin); if (err < 0) {
WARN_ON(err == -EMSGSIZE); nlmsg_free(rep);
if (err == -EMSGSIZE) {
attr_size <<= 1;
if (attr_size + NLMSG_HDRLEN <=
SKB_WITH_OVERHEAD(32768)) {
cond_resched();
goto retry;
}
} goto out; } err = nlmsg_unicast(net->diag_nlsk, rep, NETLINK_CB(in_skb).portid);
To my personal taste on larger than 327 md5 keys scale, I'd prefer to see "dump may be inconsistent, retry if you need consistency" than -EMSGSIZE fail, yet userspace potentially may use the errno as a "retry" signal.
I do not yet understand this point. I will let you send a patch for further discussion.
Thanks.
On Fri, 6 Dec 2024 at 15:15, Eric Dumazet edumazet@google.com wrote:
On Fri, Dec 6, 2024 at 3:49 AM Dmitry Safonov 0x7f454c46@gmail.com wrote:
[..]
@@ -585,8 +589,14 @@ int inet_diag_dump_one_icsk(struct inet_hashinfo *hashinfo,
err = sk_diag_fill(sk, rep, cb, req, 0, net_admin); if (err < 0) {
WARN_ON(err == -EMSGSIZE); nlmsg_free(rep);
if (err == -EMSGSIZE) {
attr_size <<= 1;
if (attr_size + NLMSG_HDRLEN <=
SKB_WITH_OVERHEAD(32768)) {
cond_resched();
goto retry;
}
} goto out; } err = nlmsg_unicast(net->diag_nlsk, rep, NETLINK_CB(in_skb).portid);
To my personal taste on larger than 327 md5 keys scale, I'd prefer to see "dump may be inconsistent, retry if you need consistency" than -EMSGSIZE fail, yet userspace potentially may use the errno as a "retry" signal.
I do not yet understand this point. I will let you send a patch for further discussion.
Let me explain my view. It's based on two points: (a) TCP-MD5/AO-diag interfaces are mostly used for debugging/investigating/monitoring by tools alike ss. Without a side-synchronisation, they can't be used by BGP or other tools/tests to make decisions as the socket is controlled by another process and the resulting dump may be incomplete, inconsistent or outdated. (b) The current default of optmem_max limit (128Kb) allows to allocate on a socket 655 TCP-AO keys and even more TCP-MD5 keys. Some of Arista's customers (I'd guess the same for other BGP users) have 1000 peers (for MD5 it's one key per peer on a listen socket, for AO might be higher).
I think the situation that's being addressed here is a race and potentially it's rare to hit (I have to run a reproducer in a loop to hit it). That's why in my view a re-try jump is too big of a hammer. And failing with -EMSGSIZE on 327+ keys scale sounds slighly worse than just marking the resulting dump as inconsistent and letting the user decide if he wants to re-run the dump or if the dump is "good enough" to get a sense of the situation. I would even say "the dump is inconsistent" has its value as a signal that the keys on a socket change right now, which may be useful.
Regarding the patch, my attempt was in this thread: https://lore.kernel.org/all/20241113-tcp-md5-diag-prep-v2-1-00a2a7feb1fa@gma...
However, I should note that I'm fine with either of the approaches (userspace to retry on EMSGSIZE or to get an inconsistent dump and decide what to do with that). I'm somewhat looking forward to switching to problems (1)/(3)/(4) from the cover-letter and adding TCP-AO-diag, rather than being stuck arguing about what's the best solution for quite a rare race :-)
Thanks, Dmitry
linux-stable-mirror@lists.linaro.org