[PATCH AUTOSEL 6.17-5.4] net: When removing nexthops, don't call synchronize_net if it is not necessary

25 Oct 2025

From: Christoph Paasch cpaasch@openai.com
[ Upstream commit b0ac6d3b56a2384db151696cfda2836a8a961b6d ]
When removing a nexthop, commit
90f33bffa382 ("nexthops: don't modify published nexthop groups") added a
call to synchronize_rcu() (later changed to _net()) to make sure
everyone sees the new nexthop-group before the rtnl-lock is released.
When one wants to delete a large number of groups and nexthops, it is
fastest to first flush the groups (ip nexthop flush groups) and then
flush the nexthops themselves (ip -6 nexthop flush). As that way the
groups don't need to be rebalanced.
However, `ip -6 nexthop flush` will still take a long time if there is
a very large number of nexthops because of the call to
synchronize_net(). Now, if there are no more groups, there is no point
in calling synchronize_net(). So, let's skip that entirely by checking
if nh->grp_list is empty.
This gives us a nice speedup:
BEFORE:
=======
$ time sudo ip -6 nexthop flush
Dump was interrupted and may be inconsistent.
Flushed 2097152 nexthops
real	1m45.345s
user	0m0.001s
sys	0m0.005s
$ time sudo ip -6 nexthop flush
Dump was interrupted and may be inconsistent.
Flushed 4194304 nexthops
real	3m10.430s
user	0m0.002s
sys	0m0.004s
AFTER:
======
$ time sudo ip -6 nexthop flush
Dump was interrupted and may be inconsistent.
Flushed 2097152 nexthops
real	0m17.545s
user	0m0.003s
sys	0m0.003s
$ time sudo ip -6 nexthop flush
Dump was interrupted and may be inconsistent.
Flushed 4194304 nexthops
real	0m35.823s
user	0m0.002s
sys	0m0.004s
Signed-off-by: Christoph Paasch cpaasch@openai.com
Reviewed-by: Ido Schimmel idosch@nvidia.com
Reviewed-by: Nikolay Aleksandrov razor@blackwall.org
Reviewed-by: Eric Dumazet edumazet@google.com
Reviewed-by: David Ahern dsahern@kernel.org
Link: https://patch.msgid.link/20250816-nexthop_dump-v2-2-491da3462118@openai.com
Signed-off-by: Jakub Kicinski kuba@kernel.org
Signed-off-by: Sasha Levin sashal@kernel.org
---
LLM Generated explanations, may be completely bogus:
YES
Why this is a good stable backport
- Fixes a real, user-facing regression in large-scale environments:
  deleting many nexthops pays an O(N) cost from an unnecessary RCU grace
  period per nexthop. The commit message shows dramatic runtime
  reductions (minutes → seconds) for “ip -6 nexthop flush” on millions
  of nexthops. This is an operational pain point, not a micro-
  optimization.
- Minimal, localized change with no functional/architectural impact: it
  only short-circuits a barrier when there is provably nothing to
  synchronize. No API changes, no behavior changes when cleanup is
  actually needed.
- Preserves correctness: the synchronize call was introduced to
  serialize readers after updating a published group array (commit
  90f33bffa382). Skipping it is safe when there were no group updates.
Specific code and history analysis
- Barrier origin and purpose:
  - 90f33bffa382 added a post-update grace period to “make sure all see
    the newly published array before releasing RTNL” by calling
    `synchronize_rcu()` (later became `synchronize_net()`).
  - See 90f33bffa382: net/ipv4/nexthop.c: the barrier was added after
    removing a nexthop from groups.
- Current code path (pre-patch):
  - `remove_nexthop_from_groups()` iterates `nh->grp_list`, potentially
    updating group arrays via `remove_nh_grp_entry()`, then
    unconditionally calls `synchronize_net()`; net/ipv4/nexthop.c:2085
    and net/ipv4/nexthop.c:2094.
  - This function runs for non-group nexthops during deletion; see call
    site in `__remove_nexthop()`: net/ipv4/nexthop.c:2166. The RTNL lock
    is held across deletion (rtnl lock in `rtm_del_nexthop()`);
    net/ipv4/nexthop.c:3310.
- The patch’s exact change:
  - Adds an early return when there is nothing to remove:
    - New check: `if (list_empty(&nh->grp_list)) return;`
    - This prevents the unconditional `synchronize_net()` when `nh`
      belongs to no groups.
  - The loop and the barrier still run when there are entries to remove,
    preserving the original safety guarantee.
- Why the early return is safe:
  - If `&nh->grp_list` is empty, no group arrays are modified; there is
    nothing to “publish” and thus no readers to wait out. The barrier is
    purely to serialize readers after `rcu_assign_pointer()` of a new
    group array (e.g., in `remove_nh_grp_entry()` which calls
    `rcu_assign_pointer(nhp->nh_grp, newg)`; net/ipv4/nexthop.c:around
    2020). With no modifications, the barrier is a no-op, only adding
    latency.
  - Concurrency context is correct: group membership modifications
    happen under RTNL, and `remove_nexthop_from_groups()` is called
    under RTNL; `list_empty()` on `nh->grp_list` is consistent. The list
    head is always initialized (`INIT_LIST_HEAD(&nh->grp_list)`;
    net/ipv4/nexthop.c:542).
  - Other RCU barriers in the file that protect real publications remain
    intact (e.g., in group replacement, `synchronize_net()` remains;
    net/ipv4/nexthop.c:2291).
Stable policy considerations
- Scope is tiny and self-contained (one function, one early return); no
  cross-subsystem impact.
- Not a feature; it is a performance fix for a behavior introduced by an
  earlier change (90f33bffa382) that added unconditional grace periods
  even when nothing changed.
- Risk of regression is very low: previously, the barrier was sometimes
  unnecessary. Now it remains when necessary and is skipped when
  provably unneeded. No change to notifier behavior or group update
  logic.
Practical backport notes
- Older stable trees may have `synchronize_rcu()` instead of
  `synchronize_net()` at the end of `remove_nexthop_from_groups()`. The
  early return remains valid and safe regardless; adapt the barrier name
  to the tree’s version if needed.
- The infrastructure used by the check (`nh->grp_list`) and usage
  context (RTNL held) are long-standing and present in stable kernels
  that have nexthop groups.
Conclusion
- This change is a classic stable backport candidate: important user-
  visible improvement, minimal risk, no semantics change, and tightly
  scoped to the nexthop cleanup path.
net/ipv4/nexthop.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c
index 34137768e7f9a..15acfb74fd238 100644
--- a/net/ipv4/nexthop.c
+++ b/net/ipv4/nexthop.c
@@ -2087,6 +2087,12 @@ static void remove_nexthop_from_groups(struct net *net, struct nexthop *nh,
 {
    struct nh_grp_entry *nhge, *tmp;
+	/* If there is nothing to do, let's avoid the costly call to
+	 * synchronize_net()
+	 */
+	if (list_empty(&nh->grp_list))
+		return;
+
    list_for_each_entry_safe(nhge, tmp, &nh->grp_list, nh_list)
    	remove_nh_grp_entry(net, nhge, nlinfo);
-- 
2.51.0



    

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

[PATCH AUTOSEL 6.17-5.4] net: When removing nexthops, don't call synchronize_net if it is not necessary