In order to select a nexthop for multipath routes, fib_select_multipath() is used with legacy nexthops and nexthop_select_path_hthr() is used with nexthop objects. Those two functions perform a validity test on the neighbor related to each nexthop but their logic is structured differently. This causes a divergence in behavior and nexthop_select_path_hthr() may return a nexthop that failed the neighbor validity test even if there was one that passed.
Refactor nexthop_select_path_hthr() to make it more similar to fib_select_multipath() and fix the problem mentioned above.
Benjamin Poirier (4): nexthop: Factor out hash threshold fdb nexthop selection nexthop: Factor out neighbor validity check nexthop: Do not return invalid nexthop object during multipath selection selftests: net: Add test cases for nexthop groups with invalid neighbors
net/ipv4/nexthop.c | 64 +++++++--- tools/testing/selftests/net/fib_nexthops.sh | 129 ++++++++++++++++++++ 2 files changed, 174 insertions(+), 19 deletions(-)
The loop in nexthop_select_path_hthr() includes code to check for neighbor validity. Since this does not apply to fdb nexthops, simplify the loop by moving the fdb nexthop selection to its own function.
Signed-off-by: Ido Schimmel idosch@nvidia.com Signed-off-by: Benjamin Poirier bpoirier@nvidia.com --- net/ipv4/nexthop.c | 22 ++++++++++++++++++++-- 1 file changed, 20 insertions(+), 2 deletions(-)
diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c index f95142e56da0..27089dea0ed0 100644 --- a/net/ipv4/nexthop.c +++ b/net/ipv4/nexthop.c @@ -1152,11 +1152,31 @@ static bool ipv4_good_nh(const struct fib_nh *nh) return !!(state & NUD_VALID); }
+static struct nexthop *nexthop_select_path_fdb(struct nh_group *nhg, int hash) +{ + int i; + + for (i = 0; i < nhg->num_nh; i++) { + struct nh_grp_entry *nhge = &nhg->nh_entries[i]; + + if (hash > atomic_read(&nhge->hthr.upper_bound)) + continue; + + return nhge->nh; + } + + WARN_ON_ONCE(1); + return NULL; +} + static struct nexthop *nexthop_select_path_hthr(struct nh_group *nhg, int hash) { struct nexthop *rc = NULL; int i;
+ if (nhg->fdb_nh) + return nexthop_select_path_fdb(nhg, hash); + for (i = 0; i < nhg->num_nh; ++i) { struct nh_grp_entry *nhge = &nhg->nh_entries[i]; struct nh_info *nhi; @@ -1165,8 +1185,6 @@ static struct nexthop *nexthop_select_path_hthr(struct nh_group *nhg, int hash) continue;
nhi = rcu_dereference(nhge->nh->nh_info); - if (nhi->fdb_nh) - return nhge->nh;
/* nexthops always check if it is good and does * not rely on a sysctl for this behavior
On 5/29/23 2:19 PM, Benjamin Poirier wrote:
diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c index f95142e56da0..27089dea0ed0 100644 --- a/net/ipv4/nexthop.c +++ b/net/ipv4/nexthop.c @@ -1152,11 +1152,31 @@ static bool ipv4_good_nh(const struct fib_nh *nh) return !!(state & NUD_VALID); } +static struct nexthop *nexthop_select_path_fdb(struct nh_group *nhg, int hash) +{
- int i;
- for (i = 0; i < nhg->num_nh; i++) {
struct nh_grp_entry *nhge = &nhg->nh_entries[i];
if (hash > atomic_read(&nhge->hthr.upper_bound))
continue;
return nhge->nh;
- }
- WARN_ON_ONCE(1);
I do not see how the stack is going to provide useful information; it should always be vxlan_xmit ... nexthop_select_path_fdb, right?
besides that: Reviewed-by: David Ahern dsahern@kernel.org
On 2023-05-30 08:57 -0600, David Ahern wrote:
On 5/29/23 2:19 PM, Benjamin Poirier wrote:
diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c index f95142e56da0..27089dea0ed0 100644 --- a/net/ipv4/nexthop.c +++ b/net/ipv4/nexthop.c @@ -1152,11 +1152,31 @@ static bool ipv4_good_nh(const struct fib_nh *nh) return !!(state & NUD_VALID); } +static struct nexthop *nexthop_select_path_fdb(struct nh_group *nhg, int hash) +{
- int i;
- for (i = 0; i < nhg->num_nh; i++) {
struct nh_grp_entry *nhge = &nhg->nh_entries[i];
if (hash > atomic_read(&nhge->hthr.upper_bound))
continue;
return nhge->nh;
- }
- WARN_ON_ONCE(1);
I do not see how the stack is going to provide useful information; it should always be vxlan_xmit ... nexthop_select_path_fdb, right?
Not always, it is also possible to have a resilient nhg with fdb nexthops. In that case, nexthop_select_path_fdb() is not called. In practice, I tried such a configuration and it does not work well. I have prepared a fix that I'll send after the current series has been dealt with.
Sorry for the long delay before my reply.
For legacy nexthops, there is fib_good_nh() to check the neighbor validity. In order to make the nexthop object code more similar to the legacy nexthop code, factor out the nexthop object neighbor validity check into its own function.
Signed-off-by: Ido Schimmel idosch@nvidia.com Signed-off-by: Benjamin Poirier bpoirier@nvidia.com --- net/ipv4/nexthop.c | 29 ++++++++++++++++------------- 1 file changed, 16 insertions(+), 13 deletions(-)
diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c index 27089dea0ed0..c12acbf39659 100644 --- a/net/ipv4/nexthop.c +++ b/net/ipv4/nexthop.c @@ -1152,6 +1152,20 @@ static bool ipv4_good_nh(const struct fib_nh *nh) return !!(state & NUD_VALID); }
+static bool nexthop_is_good_nh(const struct nexthop *nh) +{ + struct nh_info *nhi = rcu_dereference(nh->nh_info); + + switch (nhi->family) { + case AF_INET: + return ipv4_good_nh(&nhi->fib_nh); + case AF_INET6: + return ipv6_good_nh(&nhi->fib6_nh); + } + + return false; +} + static struct nexthop *nexthop_select_path_fdb(struct nh_group *nhg, int hash) { int i; @@ -1179,26 +1193,15 @@ static struct nexthop *nexthop_select_path_hthr(struct nh_group *nhg, int hash)
for (i = 0; i < nhg->num_nh; ++i) { struct nh_grp_entry *nhge = &nhg->nh_entries[i]; - struct nh_info *nhi;
if (hash > atomic_read(&nhge->hthr.upper_bound)) continue;
- nhi = rcu_dereference(nhge->nh->nh_info); - /* nexthops always check if it is good and does * not rely on a sysctl for this behavior */ - switch (nhi->family) { - case AF_INET: - if (ipv4_good_nh(&nhi->fib_nh)) - return nhge->nh; - break; - case AF_INET6: - if (ipv6_good_nh(&nhi->fib6_nh)) - return nhge->nh; - break; - } + if (nexthop_is_good_nh(nhge->nh)) + return nhge->nh;
if (!rc) rc = nhge->nh;
On 5/29/23 2:19 PM, Benjamin Poirier wrote:
For legacy nexthops, there is fib_good_nh() to check the neighbor validity. In order to make the nexthop object code more similar to the legacy nexthop code, factor out the nexthop object neighbor validity check into its own function.
Signed-off-by: Ido Schimmel idosch@nvidia.com Signed-off-by: Benjamin Poirier bpoirier@nvidia.com
net/ipv4/nexthop.c | 29 ++++++++++++++++------------- 1 file changed, 16 insertions(+), 13 deletions(-)
Reviewed-by: David Ahern dsahern@kernel.org
With legacy nexthops, when net.ipv4.fib_multipath_use_neigh is set, fib_select_multipath() will never set res->nhc to a nexthop that is not good (as per fib_good_nh()). OTOH, with nexthop objects, nexthop_select_path_hthr() may return a nexthop that failed the nexthop_is_good_nh() test even if there was one that passed. Refactor nexthop_select_path_hthr() to follow a selection logic more similar to fib_select_multipath().
The issue can be demonstrated with the following sequence of commands. The first block shows that things work as expected with legacy nexthops. The last sequence of `ip rou get` in the second block shows the problem case - some routes still use the .2 nexthop.
sysctl net.ipv4.fib_multipath_use_neigh=1 ip link add dummy1 up type dummy ip rou add 198.51.100.0/24 nexthop via 192.0.2.1 dev dummy1 onlink nexthop via 192.0.2.2 dev dummy1 onlink for i in {10..19}; do ip -o rou get 198.51.100.$i; done ip neigh add 192.0.2.1 dev dummy1 nud failed echo ".1 failed:" # results should not use .1 for i in {10..19}; do ip -o rou get 198.51.100.$i; done ip neigh del 192.0.2.1 dev dummy1 ip neigh add 192.0.2.2 dev dummy1 nud failed echo ".2 failed:" # results should not use .2 for i in {10..19}; do ip -o rou get 198.51.100.$i; done ip link del dummy1
ip link add dummy1 up type dummy ip nexthop add id 1 via 192.0.2.1 dev dummy1 onlink ip nexthop add id 2 via 192.0.2.2 dev dummy1 onlink ip nexthop add id 1001 group 1/2 ip rou add 198.51.100.0/24 nhid 1001 for i in {10..19}; do ip -o rou get 198.51.100.$i; done ip neigh add 192.0.2.1 dev dummy1 nud failed echo ".1 failed:" # results should not use .1 for i in {10..19}; do ip -o rou get 198.51.100.$i; done ip neigh del 192.0.2.1 dev dummy1 ip neigh add 192.0.2.2 dev dummy1 nud failed echo ".2 failed:" # results should not use .2 for i in {10..19}; do ip -o rou get 198.51.100.$i; done ip link del dummy1
Signed-off-by: Ido Schimmel idosch@nvidia.com Signed-off-by: Benjamin Poirier bpoirier@nvidia.com --- net/ipv4/nexthop.c | 19 ++++++++++++------- 1 file changed, 12 insertions(+), 7 deletions(-)
diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c index c12acbf39659..ca501ced04fb 100644 --- a/net/ipv4/nexthop.c +++ b/net/ipv4/nexthop.c @@ -1186,6 +1186,7 @@ static struct nexthop *nexthop_select_path_fdb(struct nh_group *nhg, int hash) static struct nexthop *nexthop_select_path_hthr(struct nh_group *nhg, int hash) { struct nexthop *rc = NULL; + bool first = false; int i;
if (nhg->fdb_nh) @@ -1194,20 +1195,24 @@ static struct nexthop *nexthop_select_path_hthr(struct nh_group *nhg, int hash) for (i = 0; i < nhg->num_nh; ++i) { struct nh_grp_entry *nhge = &nhg->nh_entries[i];
- if (hash > atomic_read(&nhge->hthr.upper_bound)) - continue; - /* nexthops always check if it is good and does * not rely on a sysctl for this behavior */ - if (nexthop_is_good_nh(nhge->nh)) - return nhge->nh; + if (!nexthop_is_good_nh(nhge->nh)) + continue;
- if (!rc) + if (!first) { rc = nhge->nh; + first = true; + } + + if (hash > atomic_read(&nhge->hthr.upper_bound)) + continue; + + return nhge->nh; }
- return rc; + return rc ? : nhg->nh_entries[0].nh; }
static struct nexthop *nexthop_select_path_res(struct nh_group *nhg, int hash)
On 5/29/23 2:19 PM, Benjamin Poirier wrote:
diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c index c12acbf39659..ca501ced04fb 100644 --- a/net/ipv4/nexthop.c +++ b/net/ipv4/nexthop.c @@ -1186,6 +1186,7 @@ static struct nexthop *nexthop_select_path_fdb(struct nh_group *nhg, int hash) static struct nexthop *nexthop_select_path_hthr(struct nh_group *nhg, int hash) { struct nexthop *rc = NULL;
- bool first = false; int i;
if (nhg->fdb_nh) @@ -1194,20 +1195,24 @@ static struct nexthop *nexthop_select_path_hthr(struct nh_group *nhg, int hash) for (i = 0; i < nhg->num_nh; ++i) { struct nh_grp_entry *nhge = &nhg->nh_entries[i];
if (hash > atomic_read(&nhge->hthr.upper_bound))
continue;
- /* nexthops always check if it is good and does
*/
- not rely on a sysctl for this behavior
if (nexthop_is_good_nh(nhge->nh))
return nhge->nh;
if (!nexthop_is_good_nh(nhge->nh))
continue;
if (!rc)
if (!first) {
Setting 'first' and 'rc' are equivalent, so 'first' is not needed. As I recall it was used in fib_select_multipath before the nexthop refactoring (eba618abacade) because nhsel == 0 is valid, so the loop could not rely on it.
rc = nhge->nh;
first = true;
}
if (hash > atomic_read(&nhge->hthr.upper_bound))
continue;
}return nhge->nh;
- return rc;
- return rc ? : nhg->nh_entries[0].nh;
} static struct nexthop *nexthop_select_path_res(struct nh_group *nhg, int hash)
On Tue, May 30, 2023 at 09:08:01AM -0600, David Ahern wrote:
On 5/29/23 2:19 PM, Benjamin Poirier wrote:
diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c index c12acbf39659..ca501ced04fb 100644 --- a/net/ipv4/nexthop.c +++ b/net/ipv4/nexthop.c @@ -1186,6 +1186,7 @@ static struct nexthop *nexthop_select_path_fdb(struct nh_group *nhg, int hash) static struct nexthop *nexthop_select_path_hthr(struct nh_group *nhg, int hash) { struct nexthop *rc = NULL;
- bool first = false; int i;
if (nhg->fdb_nh) @@ -1194,20 +1195,24 @@ static struct nexthop *nexthop_select_path_hthr(struct nh_group *nhg, int hash) for (i = 0; i < nhg->num_nh; ++i) { struct nh_grp_entry *nhge = &nhg->nh_entries[i];
if (hash > atomic_read(&nhge->hthr.upper_bound))
continue;
- /* nexthops always check if it is good and does
*/
- not rely on a sysctl for this behavior
if (nexthop_is_good_nh(nhge->nh))
return nhge->nh;
if (!nexthop_is_good_nh(nhge->nh))
continue;
if (!rc)
if (!first) {
Setting 'first' and 'rc' are equivalent, so 'first' is not needed.
Yea, looking at it again not sure what I was thinking...
Thanks for the review!
As I recall it was used in fib_select_multipath before the nexthop refactoring (eba618abacade) because nhsel == 0 is valid, so the loop could not rely on it.
rc = nhge->nh;
first = true;
}
if (hash > atomic_read(&nhge->hthr.upper_bound))
continue;
}return nhge->nh;
- return rc;
- return rc ? : nhg->nh_entries[0].nh;
} static struct nexthop *nexthop_select_path_res(struct nh_group *nhg, int hash)
Add test cases for hash threshold (multipath) nexthop groups with invalid neighbors. Check that a nexthop with invalid neighbor is not selected when there is another nexthop with a valid neighbor. Check that there is no crash when there is no nexthop with a valid neighbor.
The first test fails before the previous commit in this series.
Signed-off-by: Benjamin Poirier bpoirier@nvidia.com --- tools/testing/selftests/net/fib_nexthops.sh | 129 ++++++++++++++++++++ 1 file changed, 129 insertions(+)
diff --git a/tools/testing/selftests/net/fib_nexthops.sh b/tools/testing/selftests/net/fib_nexthops.sh index 0f5e88c8f4ff..54ec2b7b7b8c 100755 --- a/tools/testing/selftests/net/fib_nexthops.sh +++ b/tools/testing/selftests/net/fib_nexthops.sh @@ -29,6 +29,7 @@ IPV4_TESTS=" ipv4_large_res_grp ipv4_compat_mode ipv4_fdb_grp_fcnal + ipv4_mpath_select ipv4_torture ipv4_res_torture " @@ -42,6 +43,7 @@ IPV6_TESTS=" ipv6_large_res_grp ipv6_compat_mode ipv6_fdb_grp_fcnal + ipv6_mpath_select ipv6_torture ipv6_res_torture " @@ -370,6 +372,27 @@ check_large_res_grp() log_test $? 0 "Dump large (x$buckets) nexthop buckets" }
+get_route_dev() +{ + local pfx="$1" + local out + + if out=$($IP -j route get "$pfx" | jq -re ".[0].dev"); then + echo "$out" + fi +} + +check_route_dev() +{ + local pfx="$1" + local expected="$2" + local out + + out=$(get_route_dev "$pfx") + + check_output "$out" "$expected" +} + start_ip_monitor() { local mtype=$1 @@ -575,6 +598,112 @@ ipv4_fdb_grp_fcnal() $IP link del dev vx10 }
+ipv4_mpath_select() +{ + local rc dev match h addr + + echo + echo "IPv4 multipath selection" + echo "------------------------" + if [ ! -x "$(command -v jq)" ]; then + echo "SKIP: Could not run test; need jq tool" + return $ksft_skip + fi + + # Use status of existing neighbor entry when determining nexthop for + # multipath routes. + local -A gws + gws=([veth1]=172.16.1.2 [veth3]=172.16.2.2) + local -A other_dev + other_dev=([veth1]=veth3 [veth3]=veth1) + + run_cmd "$IP nexthop add id 1 via ${gws["veth1"]} dev veth1" + run_cmd "$IP nexthop add id 2 via ${gws["veth3"]} dev veth3" + run_cmd "$IP nexthop add id 1001 group 1/2" + run_cmd "$IP ro add 172.16.101.0/24 nhid 1001" + rc=0 + for dev in veth1 veth3; do + match=0 + for h in {1..254}; do + addr="172.16.101.$h" + if [ "$(get_route_dev "$addr")" = "$dev" ]; then + match=1 + break + fi + done + if (( match == 0 )); then + echo "SKIP: Did not find a route using device $dev" + return $ksft_skip + fi + run_cmd "$IP neigh add ${gws[$dev]} dev $dev nud failed" + if ! check_route_dev "$addr" "${other_dev[$dev]}"; then + rc=1 + break + fi + run_cmd "$IP neigh del ${gws[$dev]} dev $dev" + done + log_test $rc 0 "Use valid neighbor during multipath selection" + + run_cmd "$IP neigh add 172.16.1.2 dev veth1 nud incomplete" + run_cmd "$IP neigh add 172.16.2.2 dev veth3 nud incomplete" + run_cmd "$IP route get 172.16.101.1" + # if we did not crash, success + log_test $rc 0 "Multipath selection with no valid neighbor" +} + +ipv6_mpath_select() +{ + local rc dev match h addr + + echo + echo "IPv6 multipath selection" + echo "------------------------" + if [ ! -x "$(command -v jq)" ]; then + echo "SKIP: Could not run test; need jq tool" + return $ksft_skip + fi + + # Use status of existing neighbor entry when determining nexthop for + # multipath routes. + local -A gws + gws=([veth1]=2001:db8:91::2 [veth3]=2001:db8:92::2) + local -A other_dev + other_dev=([veth1]=veth3 [veth3]=veth1) + + run_cmd "$IP nexthop add id 1 via ${gws["veth1"]} dev veth1" + run_cmd "$IP nexthop add id 2 via ${gws["veth3"]} dev veth3" + run_cmd "$IP nexthop add id 1001 group 1/2" + run_cmd "$IP ro add 2001:db8:101::/64 nhid 1001" + rc=0 + for dev in veth1 veth3; do + match=0 + for h in {1..65535}; do + addr=$(printf "2001:db8:101::%x" $h) + if [ "$(get_route_dev "$addr")" = "$dev" ]; then + match=1 + break + fi + done + if (( match == 0 )); then + echo "SKIP: Did not find a route using device $dev" + return $ksft_skip + fi + run_cmd "$IP neigh add ${gws[$dev]} dev $dev nud failed" + if ! check_route_dev "$addr" "${other_dev[$dev]}"; then + rc=1 + break + fi + run_cmd "$IP neigh del ${gws[$dev]} dev $dev" + done + log_test $rc 0 "Use valid neighbor during multipath selection" + + run_cmd "$IP neigh add 2001:db8:91::2 dev veth1 nud incomplete" + run_cmd "$IP neigh add 2001:db8:92::2 dev veth3 nud incomplete" + run_cmd "$IP route get 2001:db8:101::1" + # if we did not crash, success + log_test $rc 0 "Multipath selection with no valid neighbor" +} + ################################################################################ # basic operations (add, delete, replace) on nexthops and nexthop groups #
On 5/29/23 2:19 PM, Benjamin Poirier wrote:
Add test cases for hash threshold (multipath) nexthop groups with invalid neighbors. Check that a nexthop with invalid neighbor is not selected when there is another nexthop with a valid neighbor. Check that there is no crash when there is no nexthop with a valid neighbor.
The first test fails before the previous commit in this series.
Signed-off-by: Benjamin Poirier bpoirier@nvidia.com
tools/testing/selftests/net/fib_nexthops.sh | 129 ++++++++++++++++++++ 1 file changed, 129 insertions(+)
Reviewed-by: David Ahern dsahern@kernel.org
linux-kselftest-mirror@lists.linaro.org