From: Christoph Paasch cpaasch@openai.com
[ Upstream commit b0ac6d3b56a2384db151696cfda2836a8a961b6d ]
When removing a nexthop, commit 90f33bffa382 ("nexthops: don't modify published nexthop groups") added a call to synchronize_rcu() (later changed to _net()) to make sure everyone sees the new nexthop-group before the rtnl-lock is released.
When one wants to delete a large number of groups and nexthops, it is fastest to first flush the groups (ip nexthop flush groups) and then flush the nexthops themselves (ip -6 nexthop flush). As that way the groups don't need to be rebalanced.
However, `ip -6 nexthop flush` will still take a long time if there is a very large number of nexthops because of the call to synchronize_net(). Now, if there are no more groups, there is no point in calling synchronize_net(). So, let's skip that entirely by checking if nh->grp_list is empty.
This gives us a nice speedup:
BEFORE: =======
$ time sudo ip -6 nexthop flush Dump was interrupted and may be inconsistent. Flushed 2097152 nexthops
real 1m45.345s user 0m0.001s sys 0m0.005s
$ time sudo ip -6 nexthop flush Dump was interrupted and may be inconsistent. Flushed 4194304 nexthops
real 3m10.430s user 0m0.002s sys 0m0.004s
AFTER: ======
$ time sudo ip -6 nexthop flush Dump was interrupted and may be inconsistent. Flushed 2097152 nexthops
real 0m17.545s user 0m0.003s sys 0m0.003s
$ time sudo ip -6 nexthop flush Dump was interrupted and may be inconsistent. Flushed 4194304 nexthops
real 0m35.823s user 0m0.002s sys 0m0.004s
Signed-off-by: Christoph Paasch cpaasch@openai.com Reviewed-by: Ido Schimmel idosch@nvidia.com Reviewed-by: Nikolay Aleksandrov razor@blackwall.org Reviewed-by: Eric Dumazet edumazet@google.com Reviewed-by: David Ahern dsahern@kernel.org Link: https://patch.msgid.link/20250816-nexthop_dump-v2-2-491da3462118@openai.com Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org ---
LLM Generated explanations, may be completely bogus:
YES
Why this is a good stable backport - Fixes a real, user-facing regression in large-scale environments: deleting many nexthops pays an O(N) cost from an unnecessary RCU grace period per nexthop. The commit message shows dramatic runtime reductions (minutes → seconds) for “ip -6 nexthop flush” on millions of nexthops. This is an operational pain point, not a micro- optimization. - Minimal, localized change with no functional/architectural impact: it only short-circuits a barrier when there is provably nothing to synchronize. No API changes, no behavior changes when cleanup is actually needed. - Preserves correctness: the synchronize call was introduced to serialize readers after updating a published group array (commit 90f33bffa382). Skipping it is safe when there were no group updates.
Specific code and history analysis - Barrier origin and purpose: - 90f33bffa382 added a post-update grace period to “make sure all see the newly published array before releasing RTNL” by calling `synchronize_rcu()` (later became `synchronize_net()`). - See 90f33bffa382: net/ipv4/nexthop.c: the barrier was added after removing a nexthop from groups. - Current code path (pre-patch): - `remove_nexthop_from_groups()` iterates `nh->grp_list`, potentially updating group arrays via `remove_nh_grp_entry()`, then unconditionally calls `synchronize_net()`; net/ipv4/nexthop.c:2085 and net/ipv4/nexthop.c:2094. - This function runs for non-group nexthops during deletion; see call site in `__remove_nexthop()`: net/ipv4/nexthop.c:2166. The RTNL lock is held across deletion (rtnl lock in `rtm_del_nexthop()`); net/ipv4/nexthop.c:3310. - The patch’s exact change: - Adds an early return when there is nothing to remove: - New check: `if (list_empty(&nh->grp_list)) return;` - This prevents the unconditional `synchronize_net()` when `nh` belongs to no groups. - The loop and the barrier still run when there are entries to remove, preserving the original safety guarantee. - Why the early return is safe: - If `&nh->grp_list` is empty, no group arrays are modified; there is nothing to “publish” and thus no readers to wait out. The barrier is purely to serialize readers after `rcu_assign_pointer()` of a new group array (e.g., in `remove_nh_grp_entry()` which calls `rcu_assign_pointer(nhp->nh_grp, newg)`; net/ipv4/nexthop.c:around 2020). With no modifications, the barrier is a no-op, only adding latency. - Concurrency context is correct: group membership modifications happen under RTNL, and `remove_nexthop_from_groups()` is called under RTNL; `list_empty()` on `nh->grp_list` is consistent. The list head is always initialized (`INIT_LIST_HEAD(&nh->grp_list)`; net/ipv4/nexthop.c:542). - Other RCU barriers in the file that protect real publications remain intact (e.g., in group replacement, `synchronize_net()` remains; net/ipv4/nexthop.c:2291).
Stable policy considerations - Scope is tiny and self-contained (one function, one early return); no cross-subsystem impact. - Not a feature; it is a performance fix for a behavior introduced by an earlier change (90f33bffa382) that added unconditional grace periods even when nothing changed. - Risk of regression is very low: previously, the barrier was sometimes unnecessary. Now it remains when necessary and is skipped when provably unneeded. No change to notifier behavior or group update logic.
Practical backport notes - Older stable trees may have `synchronize_rcu()` instead of `synchronize_net()` at the end of `remove_nexthop_from_groups()`. The early return remains valid and safe regardless; adapt the barrier name to the tree’s version if needed. - The infrastructure used by the check (`nh->grp_list`) and usage context (RTNL held) are long-standing and present in stable kernels that have nexthop groups.
Conclusion - This change is a classic stable backport candidate: important user- visible improvement, minimal risk, no semantics change, and tightly scoped to the nexthop cleanup path.
net/ipv4/nexthop.c | 6 ++++++ 1 file changed, 6 insertions(+)
diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c index 34137768e7f9a..15acfb74fd238 100644 --- a/net/ipv4/nexthop.c +++ b/net/ipv4/nexthop.c @@ -2087,6 +2087,12 @@ static void remove_nexthop_from_groups(struct net *net, struct nexthop *nh, { struct nh_grp_entry *nhge, *tmp;
+ /* If there is nothing to do, let's avoid the costly call to + * synchronize_net() + */ + if (list_empty(&nh->grp_list)) + return; + list_for_each_entry_safe(nhge, tmp, &nh->grp_list, nh_list) remove_nh_grp_entry(net, nhge, nlinfo);