On Fri, 6 Dec 2024 at 15:15, Eric Dumazet edumazet@google.com wrote:
On Fri, Dec 6, 2024 at 3:49 AM Dmitry Safonov 0x7f454c46@gmail.com wrote:
[..]
@@ -585,8 +589,14 @@ int inet_diag_dump_one_icsk(struct inet_hashinfo *hashinfo,
err = sk_diag_fill(sk, rep, cb, req, 0, net_admin); if (err < 0) {
WARN_ON(err == -EMSGSIZE); nlmsg_free(rep);
if (err == -EMSGSIZE) {
attr_size <<= 1;
if (attr_size + NLMSG_HDRLEN <=
SKB_WITH_OVERHEAD(32768)) {
cond_resched();
goto retry;
}
} goto out; } err = nlmsg_unicast(net->diag_nlsk, rep, NETLINK_CB(in_skb).portid);
To my personal taste on larger than 327 md5 keys scale, I'd prefer to see "dump may be inconsistent, retry if you need consistency" than -EMSGSIZE fail, yet userspace potentially may use the errno as a "retry" signal.
I do not yet understand this point. I will let you send a patch for further discussion.
Let me explain my view. It's based on two points: (a) TCP-MD5/AO-diag interfaces are mostly used for debugging/investigating/monitoring by tools alike ss. Without a side-synchronisation, they can't be used by BGP or other tools/tests to make decisions as the socket is controlled by another process and the resulting dump may be incomplete, inconsistent or outdated. (b) The current default of optmem_max limit (128Kb) allows to allocate on a socket 655 TCP-AO keys and even more TCP-MD5 keys. Some of Arista's customers (I'd guess the same for other BGP users) have 1000 peers (for MD5 it's one key per peer on a listen socket, for AO might be higher).
I think the situation that's being addressed here is a race and potentially it's rare to hit (I have to run a reproducer in a loop to hit it). That's why in my view a re-try jump is too big of a hammer. And failing with -EMSGSIZE on 327+ keys scale sounds slighly worse than just marking the resulting dump as inconsistent and letting the user decide if he wants to re-run the dump or if the dump is "good enough" to get a sense of the situation. I would even say "the dump is inconsistent" has its value as a signal that the keys on a socket change right now, which may be useful.
Regarding the patch, my attempt was in this thread: https://lore.kernel.org/all/20241113-tcp-md5-diag-prep-v2-1-00a2a7feb1fa@gma...
However, I should note that I'm fine with either of the approaches (userspace to retry on EMSGSIZE or to get an inconsistent dump and decide what to do with that). I'm somewhat looking forward to switching to problems (1)/(3)/(4) from the cover-letter and adding TCP-AO-diag, rather than being stuck arguing about what's the best solution for quite a rare race :-)
Thanks, Dmitry