NOTE: I'm leaving Daynix Computing Ltd., for which I worked on this patch series, by the end of this month.
While net-next is closed, this is the last chance for me to send another version so let me send the local changes now.
Please contact Yuri Benditovich, who is CCed on this email, for anything about this series.
virtio-net have two usage of hashes: one is RSS and another is hash reporting. Conventionally the hash calculation was done by the VMM. However, computing the hash after the queue was chosen defeats the purpose of RSS.
Another approach is to use eBPF steering program. This approach has another downside: it cannot report the calculated hash due to the restrictive nature of eBPF.
Introduce the code to compute hashes to the kernel in order to overcome thse challenges.
An alternative solution is to extend the eBPF steering program so that it will be able to report to the userspace, but it is based on context rewrites, which is in feature freeze. We can adopt kfuncs, but they will not be UAPIs. We opt to ioctl to align with other relevant UAPIs (KVM and vhost_net).
The patches for QEMU to use this new feature was submitted as RFC and is available at: https://patchew.org/QEMU/20250530-hash-v5-0-343d7d7a8200@daynix.com/
This work was presented at LPC 2024: https://lpc.events/event/18/contributions/1963/
V1 -> V2: Changed to introduce a new BPF program type.
Signed-off-by: Akihiko Odaki akihiko.odaki@daynix.com --- Changes in v12: - Updated tools/testing/selftests/net/config. - Split TUNSETVNETHASH. - Link to v11: https://lore.kernel.org/r/20250317-rss-v11-0-4cacca92f31f@daynix.com
Changes in v11: - Added the missing code to free vnet_hash in patch "tap: Introduce virtio-net hash feature". - Link to v10: https://lore.kernel.org/r/20250313-rss-v10-0-3185d73a9af0@daynix.com
Changes in v10: - Split common code and TUN/TAP-specific code into separate patches. - Reverted a spurious style change in patch "tun: Introduce virtio-net hash feature". - Added a comment explaining disable_ipv6 in tests. - Used AF_PACKET for patch "selftest: tun: Add tests for virtio-net hashing". I also added the usage of FIXTURE_VARIANT() as the testing function now needs access to more variant-specific variables. - Corrected the message of patch "selftest: tun: Add tests for virtio-net hashing"; it mentioned validation of configuration but it is not scope of this patch. - Expanded the description of patch "selftest: tun: Add tests for virtio-net hashing". - Added patch "tun: Allow steering eBPF program to fall back". - Changed to handle TUNGETVNETHASHCAP before taking the rtnl lock. - Removed redundant tests for tun_vnet_ioctl(). - Added patch "selftest: tap: Add tests for virtio-net ioctls". - Added a design explanation of ioctls for extensibility and migration. - Removed a few branches in patch "vhost/net: Support VIRTIO_NET_F_HASH_REPORT". - Link to v9: https://lore.kernel.org/r/20250307-rss-v9-0-df76624025eb@daynix.com
Changes in v9: - Added a missing return statement in patch "tun: Introduce virtio-net hash feature". - Link to v8: https://lore.kernel.org/r/20250306-rss-v8-0-7ab4f56ff423@daynix.com
Changes in v8: - Disabled IPv6 to eliminate noises in tests. - Added a branch in tap to avoid unnecessary dissection when hash reporting is disabled. - Removed unnecessary rtnl_lock(). - Extracted code to handle new ioctls into separate functions to avoid adding extra NULL checks to the code handling other ioctls. - Introduced variable named "fd" to __tun_chr_ioctl(). - s/-/=/g in a patch message to avoid confusing Git. - Link to v7: https://lore.kernel.org/r/20250228-rss-v7-0-844205cbbdd6@daynix.com
Changes in v7: - Ensured to set hash_report to VIRTIO_NET_HASH_REPORT_NONE for VHOST_NET_F_VIRTIO_NET_HDR. - s/4/sizeof(u32)/ in patch "virtio_net: Add functions for hashing". - Added tap_skb_cb type. - Rebased. - Link to v6: https://lore.kernel.org/r/20250109-rss-v6-0-b1c90ad708f6@daynix.com
Changes in v6: - Extracted changes to fill vnet header holes into another series. - Squashed patches "skbuff: Introduce SKB_EXT_TUN_VNET_HASH", "tun: Introduce virtio-net hash reporting feature", and "tun: Introduce virtio-net RSS" into patch "tun: Introduce virtio-net hash feature". - Dropped the RFC tag. - Link to v5: https://lore.kernel.org/r/20241008-rss-v5-0-f3cf68df005d@daynix.com
Changes in v5: - Fixed a compilation error with CONFIG_TUN_VNET_CROSS_LE. - Optimized the calculation of the hash value according to: https://git.dpdk.org/dpdk/commit/?id=3fb1ea032bd6ff8317af5dac9af901f1f324cab... - Added patch "tun: Unify vnet implementation". - Dropped patch "tap: Pad virtio header with zero". - Added patch "selftest: tun: Test vnet ioctls without device". - Reworked selftests to skip for older kernels. - Documented the case when the underlying device is deleted and packets have queue_mapping set by TC. - Reordered test harness arguments. - Added code to handle fragmented packets. - Link to v4: https://lore.kernel.org/r/20240924-rss-v4-0-84e932ec0e6c@daynix.com
Changes in v4: - Moved tun_vnet_hash_ext to if_tun.h. - Renamed virtio_net_toeplitz() to virtio_net_toeplitz_calc(). - Replaced htons() with cpu_to_be16(). - Changed virtio_net_hash_rss() to return void. - Reordered variable declarations in virtio_net_hash_rss(). - Removed virtio_net_hdr_v1_hash_from_skb(). - Updated messages of "tap: Pad virtio header with zero" and "tun: Pad virtio header with zero". - Fixed vnet_hash allocation size. - Ensured to free vnet_hash when destructing tun_struct. - Link to v3: https://lore.kernel.org/r/20240915-rss-v3-0-c630015db082@daynix.com
Changes in v3: - Reverted back to add ioctl. - Split patch "tun: Introduce virtio-net hashing feature" into "tun: Introduce virtio-net hash reporting feature" and "tun: Introduce virtio-net RSS". - Changed to reuse hash values computed for automq instead of performing RSS hashing when hash reporting is requested but RSS is not. - Extracted relevant data from struct tun_struct to keep it minimal. - Added kernel-doc. - Changed to allow calling TUNGETVNETHASHCAP before TUNSETIFF. - Initialized num_buffers with 1. - Added a test case for unclassified packets. - Fixed error handling in tests. - Changed tests to verify that the queue index will not overflow. - Rebased. - Link to v2: https://lore.kernel.org/r/20231015141644.260646-1-akihiko.odaki@daynix.com
--- Akihiko Odaki (10): virtio_net: Add functions for hashing net: flow_dissector: Export flow_keys_dissector_symmetric tun: Allow steering eBPF program to fall back tun: Add common virtio-net hash feature code tun: Introduce virtio-net hash feature tap: Introduce virtio-net hash feature selftest: tun: Test vnet ioctls without device selftest: tun: Add tests for virtio-net hashing selftest: tap: Add tests for virtio-net ioctls vhost/net: Support VIRTIO_NET_F_HASH_REPORT
Documentation/networking/tuntap.rst | 7 + drivers/net/Kconfig | 1 + drivers/net/ipvlan/ipvtap.c | 2 +- drivers/net/macvtap.c | 2 +- drivers/net/tap.c | 80 +++++- drivers/net/tun.c | 92 +++++-- drivers/net/tun_vnet.h | 165 +++++++++++- drivers/vhost/net.c | 68 ++--- include/linux/if_tap.h | 4 +- include/linux/skbuff.h | 3 + include/linux/virtio_net.h | 188 ++++++++++++++ include/net/flow_dissector.h | 1 + include/uapi/linux/if_tun.h | 80 ++++++ net/core/flow_dissector.c | 3 +- net/core/skbuff.c | 4 + tools/testing/selftests/net/Makefile | 2 +- tools/testing/selftests/net/config | 1 + tools/testing/selftests/net/tap.c | 131 +++++++++- tools/testing/selftests/net/tun.c | 485 ++++++++++++++++++++++++++++++++++- 19 files changed, 1234 insertions(+), 85 deletions(-) --- base-commit: 5cb8274d66c611b7889565c418a8158517810f9b change-id: 20240403-rss-e737d89efa77
Best regards,
They are useful to implement VIRTIO_NET_F_RSS and VIRTIO_NET_F_HASH_REPORT.
Signed-off-by: Akihiko Odaki akihiko.odaki@daynix.com Tested-by: Lei Yang leiyang@redhat.com --- include/linux/virtio_net.h | 188 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 188 insertions(+)
diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h index 02a9f4dc594d..426f33b4b824 100644 --- a/include/linux/virtio_net.h +++ b/include/linux/virtio_net.h @@ -9,6 +9,194 @@ #include <uapi/linux/tcp.h> #include <uapi/linux/virtio_net.h>
+struct virtio_net_hash { + u32 value; + u16 report; +}; + +struct virtio_net_toeplitz_state { + u32 hash; + const u32 *key; +}; + +#define VIRTIO_NET_SUPPORTED_HASH_TYPES (VIRTIO_NET_RSS_HASH_TYPE_IPv4 | \ + VIRTIO_NET_RSS_HASH_TYPE_TCPv4 | \ + VIRTIO_NET_RSS_HASH_TYPE_UDPv4 | \ + VIRTIO_NET_RSS_HASH_TYPE_IPv6 | \ + VIRTIO_NET_RSS_HASH_TYPE_TCPv6 | \ + VIRTIO_NET_RSS_HASH_TYPE_UDPv6) + +#define VIRTIO_NET_RSS_MAX_KEY_SIZE 40 + +static inline void virtio_net_toeplitz_convert_key(u32 *input, size_t len) +{ + while (len >= sizeof(*input)) { + *input = be32_to_cpu((__force __be32)*input); + input++; + len -= sizeof(*input); + } +} + +static inline void virtio_net_toeplitz_calc(struct virtio_net_toeplitz_state *state, + const __be32 *input, size_t len) +{ + while (len >= sizeof(*input)) { + for (u32 map = be32_to_cpu(*input); map; map &= (map - 1)) { + u32 i = ffs(map); + + state->hash ^= state->key[0] << (32 - i) | + (u32)((u64)state->key[1] >> i); + } + + state->key++; + input++; + len -= sizeof(*input); + } +} + +static inline u8 virtio_net_hash_key_length(u32 types) +{ + size_t len = 0; + + if (types & VIRTIO_NET_HASH_REPORT_IPv4) + len = max(len, + sizeof(struct flow_dissector_key_ipv4_addrs)); + + if (types & + (VIRTIO_NET_HASH_REPORT_TCPv4 | VIRTIO_NET_HASH_REPORT_UDPv4)) + len = max(len, + sizeof(struct flow_dissector_key_ipv4_addrs) + + sizeof(struct flow_dissector_key_ports)); + + if (types & VIRTIO_NET_HASH_REPORT_IPv6) + len = max(len, + sizeof(struct flow_dissector_key_ipv6_addrs)); + + if (types & + (VIRTIO_NET_HASH_REPORT_TCPv6 | VIRTIO_NET_HASH_REPORT_UDPv6)) + len = max(len, + sizeof(struct flow_dissector_key_ipv6_addrs) + + sizeof(struct flow_dissector_key_ports)); + + return len + sizeof(u32); +} + +static inline u32 virtio_net_hash_report(u32 types, + const struct flow_keys_basic *keys) +{ + switch (keys->basic.n_proto) { + case cpu_to_be16(ETH_P_IP): + if (!(keys->control.flags & FLOW_DIS_IS_FRAGMENT)) { + if (keys->basic.ip_proto == IPPROTO_TCP && + (types & VIRTIO_NET_RSS_HASH_TYPE_TCPv4)) + return VIRTIO_NET_HASH_REPORT_TCPv4; + + if (keys->basic.ip_proto == IPPROTO_UDP && + (types & VIRTIO_NET_RSS_HASH_TYPE_UDPv4)) + return VIRTIO_NET_HASH_REPORT_UDPv4; + } + + if (types & VIRTIO_NET_RSS_HASH_TYPE_IPv4) + return VIRTIO_NET_HASH_REPORT_IPv4; + + return VIRTIO_NET_HASH_REPORT_NONE; + + case cpu_to_be16(ETH_P_IPV6): + if (!(keys->control.flags & FLOW_DIS_IS_FRAGMENT)) { + if (keys->basic.ip_proto == IPPROTO_TCP && + (types & VIRTIO_NET_RSS_HASH_TYPE_TCPv6)) + return VIRTIO_NET_HASH_REPORT_TCPv6; + + if (keys->basic.ip_proto == IPPROTO_UDP && + (types & VIRTIO_NET_RSS_HASH_TYPE_UDPv6)) + return VIRTIO_NET_HASH_REPORT_UDPv6; + } + + if (types & VIRTIO_NET_RSS_HASH_TYPE_IPv6) + return VIRTIO_NET_HASH_REPORT_IPv6; + + return VIRTIO_NET_HASH_REPORT_NONE; + + default: + return VIRTIO_NET_HASH_REPORT_NONE; + } +} + +static inline void virtio_net_hash_rss(const struct sk_buff *skb, + u32 types, const u32 *key, + struct virtio_net_hash *hash) +{ + struct virtio_net_toeplitz_state toeplitz_state = { .key = key }; + struct flow_keys flow; + struct flow_keys_basic flow_basic; + u16 report; + + if (!skb_flow_dissect_flow_keys(skb, &flow, 0)) { + hash->report = VIRTIO_NET_HASH_REPORT_NONE; + return; + } + + flow_basic = (struct flow_keys_basic) { + .control = flow.control, + .basic = flow.basic + }; + + report = virtio_net_hash_report(types, &flow_basic); + + switch (report) { + case VIRTIO_NET_HASH_REPORT_IPv4: + virtio_net_toeplitz_calc(&toeplitz_state, + (__be32 *)&flow.addrs.v4addrs, + sizeof(flow.addrs.v4addrs)); + break; + + case VIRTIO_NET_HASH_REPORT_TCPv4: + virtio_net_toeplitz_calc(&toeplitz_state, + (__be32 *)&flow.addrs.v4addrs, + sizeof(flow.addrs.v4addrs)); + virtio_net_toeplitz_calc(&toeplitz_state, &flow.ports.ports, + sizeof(flow.ports.ports)); + break; + + case VIRTIO_NET_HASH_REPORT_UDPv4: + virtio_net_toeplitz_calc(&toeplitz_state, + (__be32 *)&flow.addrs.v4addrs, + sizeof(flow.addrs.v4addrs)); + virtio_net_toeplitz_calc(&toeplitz_state, &flow.ports.ports, + sizeof(flow.ports.ports)); + break; + + case VIRTIO_NET_HASH_REPORT_IPv6: + virtio_net_toeplitz_calc(&toeplitz_state, + (__be32 *)&flow.addrs.v6addrs, + sizeof(flow.addrs.v6addrs)); + break; + + case VIRTIO_NET_HASH_REPORT_TCPv6: + virtio_net_toeplitz_calc(&toeplitz_state, + (__be32 *)&flow.addrs.v6addrs, + sizeof(flow.addrs.v6addrs)); + virtio_net_toeplitz_calc(&toeplitz_state, &flow.ports.ports, + sizeof(flow.ports.ports)); + break; + + case VIRTIO_NET_HASH_REPORT_UDPv6: + virtio_net_toeplitz_calc(&toeplitz_state, + (__be32 *)&flow.addrs.v6addrs, + sizeof(flow.addrs.v6addrs)); + virtio_net_toeplitz_calc(&toeplitz_state, &flow.ports.ports, + sizeof(flow.ports.ports)); + break; + + default: + hash->report = VIRTIO_NET_HASH_REPORT_NONE; + return; + } + + hash->value = toeplitz_state.hash; + hash->report = report; +} + static inline bool virtio_net_hdr_match_proto(__be16 protocol, __u8 gso_type) { switch (gso_type & ~VIRTIO_NET_HDR_GSO_ECN) {
On Fri, May 30, 2025 at 12:50 PM Akihiko Odaki akihiko.odaki@daynix.com wrote:
They are useful to implement VIRTIO_NET_F_RSS and VIRTIO_NET_F_HASH_REPORT.
Signed-off-by: Akihiko Odaki akihiko.odaki@daynix.com Tested-by: Lei Yang leiyang@redhat.com
include/linux/virtio_net.h | 188 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 188 insertions(+)
diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h index 02a9f4dc594d..426f33b4b824 100644 --- a/include/linux/virtio_net.h +++ b/include/linux/virtio_net.h @@ -9,6 +9,194 @@ #include <uapi/linux/tcp.h> #include <uapi/linux/virtio_net.h>
+struct virtio_net_hash {
u32 value;
u16 report;
+};
+struct virtio_net_toeplitz_state {
u32 hash;
const u32 *key;
+};
+#define VIRTIO_NET_SUPPORTED_HASH_TYPES (VIRTIO_NET_RSS_HASH_TYPE_IPv4 | \
VIRTIO_NET_RSS_HASH_TYPE_TCPv4 | \
VIRTIO_NET_RSS_HASH_TYPE_UDPv4 | \
VIRTIO_NET_RSS_HASH_TYPE_IPv6 | \
VIRTIO_NET_RSS_HASH_TYPE_TCPv6 | \
VIRTIO_NET_RSS_HASH_TYPE_UDPv6)
+#define VIRTIO_NET_RSS_MAX_KEY_SIZE 40
+static inline void virtio_net_toeplitz_convert_key(u32 *input, size_t len) +{
while (len >= sizeof(*input)) {
*input = be32_to_cpu((__force __be32)*input);
input++;
len -= sizeof(*input);
}
+}
+static inline void virtio_net_toeplitz_calc(struct virtio_net_toeplitz_state *state,
const __be32 *input, size_t len)
+{
while (len >= sizeof(*input)) {
for (u32 map = be32_to_cpu(*input); map; map &= (map - 1)) {
u32 i = ffs(map);
state->hash ^= state->key[0] << (32 - i) |
(u32)((u64)state->key[1] >> i);
}
state->key++;
input++;
len -= sizeof(*input);
}
+}
+static inline u8 virtio_net_hash_key_length(u32 types) +{
size_t len = 0;
if (types & VIRTIO_NET_HASH_REPORT_IPv4)
len = max(len,
sizeof(struct flow_dissector_key_ipv4_addrs));
if (types &
(VIRTIO_NET_HASH_REPORT_TCPv4 | VIRTIO_NET_HASH_REPORT_UDPv4))
len = max(len,
sizeof(struct flow_dissector_key_ipv4_addrs) +
sizeof(struct flow_dissector_key_ports));
if (types & VIRTIO_NET_HASH_REPORT_IPv6)
len = max(len,
sizeof(struct flow_dissector_key_ipv6_addrs));
if (types &
(VIRTIO_NET_HASH_REPORT_TCPv6 | VIRTIO_NET_HASH_REPORT_UDPv6))
len = max(len,
sizeof(struct flow_dissector_key_ipv6_addrs) +
sizeof(struct flow_dissector_key_ports));
return len + sizeof(u32);
+}
+static inline u32 virtio_net_hash_report(u32 types,
const struct flow_keys_basic *keys)
+{
switch (keys->basic.n_proto) {
case cpu_to_be16(ETH_P_IP):
if (!(keys->control.flags & FLOW_DIS_IS_FRAGMENT)) {
if (keys->basic.ip_proto == IPPROTO_TCP &&
(types & VIRTIO_NET_RSS_HASH_TYPE_TCPv4))
return VIRTIO_NET_HASH_REPORT_TCPv4;
if (keys->basic.ip_proto == IPPROTO_UDP &&
(types & VIRTIO_NET_RSS_HASH_TYPE_UDPv4))
return VIRTIO_NET_HASH_REPORT_UDPv4;
}
if (types & VIRTIO_NET_RSS_HASH_TYPE_IPv4)
return VIRTIO_NET_HASH_REPORT_IPv4;
return VIRTIO_NET_HASH_REPORT_NONE;
case cpu_to_be16(ETH_P_IPV6):
if (!(keys->control.flags & FLOW_DIS_IS_FRAGMENT)) {
if (keys->basic.ip_proto == IPPROTO_TCP &&
(types & VIRTIO_NET_RSS_HASH_TYPE_TCPv6))
return VIRTIO_NET_HASH_REPORT_TCPv6;
if (keys->basic.ip_proto == IPPROTO_UDP &&
(types & VIRTIO_NET_RSS_HASH_TYPE_UDPv6))
return VIRTIO_NET_HASH_REPORT_UDPv6;
}
if (types & VIRTIO_NET_RSS_HASH_TYPE_IPv6)
return VIRTIO_NET_HASH_REPORT_IPv6;
return VIRTIO_NET_HASH_REPORT_NONE;
default:
return VIRTIO_NET_HASH_REPORT_NONE;
}
+}
+static inline void virtio_net_hash_rss(const struct sk_buff *skb,
u32 types, const u32 *key,
struct virtio_net_hash *hash)
+{
struct virtio_net_toeplitz_state toeplitz_state = { .key = key };
struct flow_keys flow;
struct flow_keys_basic flow_basic;
u16 report;
if (!skb_flow_dissect_flow_keys(skb, &flow, 0)) {
hash->report = VIRTIO_NET_HASH_REPORT_NONE;
return;
}
flow_basic = (struct flow_keys_basic) {
.control = flow.control,
.basic = flow.basic
};
report = virtio_net_hash_report(types, &flow_basic);
switch (report) {
case VIRTIO_NET_HASH_REPORT_IPv4:
virtio_net_toeplitz_calc(&toeplitz_state,
(__be32 *)&flow.addrs.v4addrs,
sizeof(flow.addrs.v4addrs));
break;
case VIRTIO_NET_HASH_REPORT_TCPv4:
virtio_net_toeplitz_calc(&toeplitz_state,
(__be32 *)&flow.addrs.v4addrs,
sizeof(flow.addrs.v4addrs));
virtio_net_toeplitz_calc(&toeplitz_state, &flow.ports.ports,
sizeof(flow.ports.ports));
break;
case VIRTIO_NET_HASH_REPORT_UDPv4:
virtio_net_toeplitz_calc(&toeplitz_state,
(__be32 *)&flow.addrs.v4addrs,
sizeof(flow.addrs.v4addrs));
virtio_net_toeplitz_calc(&toeplitz_state, &flow.ports.ports,
sizeof(flow.ports.ports));
break;
case VIRTIO_NET_HASH_REPORT_IPv6:
virtio_net_toeplitz_calc(&toeplitz_state,
(__be32 *)&flow.addrs.v6addrs,
sizeof(flow.addrs.v6addrs));
break;
case VIRTIO_NET_HASH_REPORT_TCPv6:
virtio_net_toeplitz_calc(&toeplitz_state,
(__be32 *)&flow.addrs.v6addrs,
sizeof(flow.addrs.v6addrs));
virtio_net_toeplitz_calc(&toeplitz_state, &flow.ports.ports,
sizeof(flow.ports.ports));
break;
case VIRTIO_NET_HASH_REPORT_UDPv6:
virtio_net_toeplitz_calc(&toeplitz_state,
(__be32 *)&flow.addrs.v6addrs,
sizeof(flow.addrs.v6addrs));
virtio_net_toeplitz_calc(&toeplitz_state, &flow.ports.ports,
sizeof(flow.ports.ports));
break;
default:
hash->report = VIRTIO_NET_HASH_REPORT_NONE;
return;
So I still think we need a comment here to explain why this is not an issue if the device can report HASH_XXX_EX. Or we need to add the support, since this is the code from the driver side, I don't think we need to worry about the device implementation issues.
For the issue of the number of options, does the spec forbid fallback to VIRTIO_NET_HASH_REPORT_NONE? If not, we can do that.
Thanks
}
hash->value = toeplitz_state.hash;
hash->report = report;
+}
static inline bool virtio_net_hdr_match_proto(__be16 protocol, __u8 gso_type) { switch (gso_type & ~VIRTIO_NET_HDR_GSO_ECN) {
-- 2.49.0
On 2025/06/03 12:19, Jason Wang wrote:
On Fri, May 30, 2025 at 12:50 PM Akihiko Odaki akihiko.odaki@daynix.com wrote:
They are useful to implement VIRTIO_NET_F_RSS and VIRTIO_NET_F_HASH_REPORT.
Signed-off-by: Akihiko Odaki akihiko.odaki@daynix.com Tested-by: Lei Yang leiyang@redhat.com
include/linux/virtio_net.h | 188 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 188 insertions(+)
diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h index 02a9f4dc594d..426f33b4b824 100644 --- a/include/linux/virtio_net.h +++ b/include/linux/virtio_net.h @@ -9,6 +9,194 @@ #include <uapi/linux/tcp.h> #include <uapi/linux/virtio_net.h>
+struct virtio_net_hash {
u32 value;
u16 report;
+};
+struct virtio_net_toeplitz_state {
u32 hash;
const u32 *key;
+};
+#define VIRTIO_NET_SUPPORTED_HASH_TYPES (VIRTIO_NET_RSS_HASH_TYPE_IPv4 | \
VIRTIO_NET_RSS_HASH_TYPE_TCPv4 | \
VIRTIO_NET_RSS_HASH_TYPE_UDPv4 | \
VIRTIO_NET_RSS_HASH_TYPE_IPv6 | \
VIRTIO_NET_RSS_HASH_TYPE_TCPv6 | \
VIRTIO_NET_RSS_HASH_TYPE_UDPv6)
+#define VIRTIO_NET_RSS_MAX_KEY_SIZE 40
+static inline void virtio_net_toeplitz_convert_key(u32 *input, size_t len) +{
while (len >= sizeof(*input)) {
*input = be32_to_cpu((__force __be32)*input);
input++;
len -= sizeof(*input);
}
+}
+static inline void virtio_net_toeplitz_calc(struct virtio_net_toeplitz_state *state,
const __be32 *input, size_t len)
+{
while (len >= sizeof(*input)) {
for (u32 map = be32_to_cpu(*input); map; map &= (map - 1)) {
u32 i = ffs(map);
state->hash ^= state->key[0] << (32 - i) |
(u32)((u64)state->key[1] >> i);
}
state->key++;
input++;
len -= sizeof(*input);
}
+}
+static inline u8 virtio_net_hash_key_length(u32 types) +{
size_t len = 0;
if (types & VIRTIO_NET_HASH_REPORT_IPv4)
len = max(len,
sizeof(struct flow_dissector_key_ipv4_addrs));
if (types &
(VIRTIO_NET_HASH_REPORT_TCPv4 | VIRTIO_NET_HASH_REPORT_UDPv4))
len = max(len,
sizeof(struct flow_dissector_key_ipv4_addrs) +
sizeof(struct flow_dissector_key_ports));
if (types & VIRTIO_NET_HASH_REPORT_IPv6)
len = max(len,
sizeof(struct flow_dissector_key_ipv6_addrs));
if (types &
(VIRTIO_NET_HASH_REPORT_TCPv6 | VIRTIO_NET_HASH_REPORT_UDPv6))
len = max(len,
sizeof(struct flow_dissector_key_ipv6_addrs) +
sizeof(struct flow_dissector_key_ports));
return len + sizeof(u32);
+}
+static inline u32 virtio_net_hash_report(u32 types,
const struct flow_keys_basic *keys)
+{
switch (keys->basic.n_proto) {
case cpu_to_be16(ETH_P_IP):
if (!(keys->control.flags & FLOW_DIS_IS_FRAGMENT)) {
if (keys->basic.ip_proto == IPPROTO_TCP &&
(types & VIRTIO_NET_RSS_HASH_TYPE_TCPv4))
return VIRTIO_NET_HASH_REPORT_TCPv4;
if (keys->basic.ip_proto == IPPROTO_UDP &&
(types & VIRTIO_NET_RSS_HASH_TYPE_UDPv4))
return VIRTIO_NET_HASH_REPORT_UDPv4;
}
if (types & VIRTIO_NET_RSS_HASH_TYPE_IPv4)
return VIRTIO_NET_HASH_REPORT_IPv4;
return VIRTIO_NET_HASH_REPORT_NONE;
case cpu_to_be16(ETH_P_IPV6):
if (!(keys->control.flags & FLOW_DIS_IS_FRAGMENT)) {
if (keys->basic.ip_proto == IPPROTO_TCP &&
(types & VIRTIO_NET_RSS_HASH_TYPE_TCPv6))
return VIRTIO_NET_HASH_REPORT_TCPv6;
if (keys->basic.ip_proto == IPPROTO_UDP &&
(types & VIRTIO_NET_RSS_HASH_TYPE_UDPv6))
return VIRTIO_NET_HASH_REPORT_UDPv6;
}
if (types & VIRTIO_NET_RSS_HASH_TYPE_IPv6)
return VIRTIO_NET_HASH_REPORT_IPv6;
return VIRTIO_NET_HASH_REPORT_NONE;
default:
return VIRTIO_NET_HASH_REPORT_NONE;
}
+}
+static inline void virtio_net_hash_rss(const struct sk_buff *skb,
u32 types, const u32 *key,
struct virtio_net_hash *hash)
+{
struct virtio_net_toeplitz_state toeplitz_state = { .key = key };
struct flow_keys flow;
struct flow_keys_basic flow_basic;
u16 report;
if (!skb_flow_dissect_flow_keys(skb, &flow, 0)) {
hash->report = VIRTIO_NET_HASH_REPORT_NONE;
return;
}
flow_basic = (struct flow_keys_basic) {
.control = flow.control,
.basic = flow.basic
};
report = virtio_net_hash_report(types, &flow_basic);
switch (report) {
case VIRTIO_NET_HASH_REPORT_IPv4:
virtio_net_toeplitz_calc(&toeplitz_state,
(__be32 *)&flow.addrs.v4addrs,
sizeof(flow.addrs.v4addrs));
break;
case VIRTIO_NET_HASH_REPORT_TCPv4:
virtio_net_toeplitz_calc(&toeplitz_state,
(__be32 *)&flow.addrs.v4addrs,
sizeof(flow.addrs.v4addrs));
virtio_net_toeplitz_calc(&toeplitz_state, &flow.ports.ports,
sizeof(flow.ports.ports));
break;
case VIRTIO_NET_HASH_REPORT_UDPv4:
virtio_net_toeplitz_calc(&toeplitz_state,
(__be32 *)&flow.addrs.v4addrs,
sizeof(flow.addrs.v4addrs));
virtio_net_toeplitz_calc(&toeplitz_state, &flow.ports.ports,
sizeof(flow.ports.ports));
break;
case VIRTIO_NET_HASH_REPORT_IPv6:
virtio_net_toeplitz_calc(&toeplitz_state,
(__be32 *)&flow.addrs.v6addrs,
sizeof(flow.addrs.v6addrs));
break;
case VIRTIO_NET_HASH_REPORT_TCPv6:
virtio_net_toeplitz_calc(&toeplitz_state,
(__be32 *)&flow.addrs.v6addrs,
sizeof(flow.addrs.v6addrs));
virtio_net_toeplitz_calc(&toeplitz_state, &flow.ports.ports,
sizeof(flow.ports.ports));
break;
case VIRTIO_NET_HASH_REPORT_UDPv6:
virtio_net_toeplitz_calc(&toeplitz_state,
(__be32 *)&flow.addrs.v6addrs,
sizeof(flow.addrs.v6addrs));
virtio_net_toeplitz_calc(&toeplitz_state, &flow.ports.ports,
sizeof(flow.ports.ports));
break;
default:
hash->report = VIRTIO_NET_HASH_REPORT_NONE;
return;
So I still think we need a comment here to explain why this is not an issue if the device can report HASH_XXX_EX. Or we need to add the support, since this is the code from the driver side, I don't think we need to worry about the device implementation issues.
This is on the device side, and don't report HASH_TYPE_XXX_EX.
For the issue of the number of options, does the spec forbid fallback to VIRTIO_NET_HASH_REPORT_NONE? If not, we can do that.
5.1.6.4.3.4 "IPv6 packets with extension header" says:
If VIRTIO_NET_HASH_TYPE_TCP_EX is set and the packet has a TCPv6 header, the hash is calculated over the following fields:
- Home address from the home address option in the IPv6 destination options header. If the extension header is not present, use the Source IPv6 address.
- IPv6 address that is contained in the Routing-Header-Type-2 from the associated extension header. If the extension header is not present, use the Destination IPv6 address.
- Source TCP port
- Destination TCP port
Therefore, if VIRTIO_NET_HASH_TYPE_TCP_EX is set, the packet has a TCPv6 and an home address option in the IPv6 destination options header is present, the hash is calculated over the home address. If the hash is not calculated over the home address in such a case, the device is contradicting with this section and violating the spec. The same goes for the other HASH_TYPE_XXX_EX types and Routing-Header-Type-2.
Regards, Akihiko Odaki
On Tue, Jun 3, 2025 at 1:31 PM Akihiko Odaki akihiko.odaki@daynix.com wrote:
On 2025/06/03 12:19, Jason Wang wrote:
On Fri, May 30, 2025 at 12:50 PM Akihiko Odaki akihiko.odaki@daynix.com wrote:
They are useful to implement VIRTIO_NET_F_RSS and VIRTIO_NET_F_HASH_REPORT.
Signed-off-by: Akihiko Odaki akihiko.odaki@daynix.com Tested-by: Lei Yang leiyang@redhat.com
include/linux/virtio_net.h | 188 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 188 insertions(+)
diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h index 02a9f4dc594d..426f33b4b824 100644 --- a/include/linux/virtio_net.h +++ b/include/linux/virtio_net.h @@ -9,6 +9,194 @@ #include <uapi/linux/tcp.h> #include <uapi/linux/virtio_net.h>
+struct virtio_net_hash {
u32 value;
u16 report;
+};
+struct virtio_net_toeplitz_state {
u32 hash;
const u32 *key;
+};
+#define VIRTIO_NET_SUPPORTED_HASH_TYPES (VIRTIO_NET_RSS_HASH_TYPE_IPv4 | \
VIRTIO_NET_RSS_HASH_TYPE_TCPv4 | \
VIRTIO_NET_RSS_HASH_TYPE_UDPv4 | \
VIRTIO_NET_RSS_HASH_TYPE_IPv6 | \
VIRTIO_NET_RSS_HASH_TYPE_TCPv6 | \
VIRTIO_NET_RSS_HASH_TYPE_UDPv6)
+#define VIRTIO_NET_RSS_MAX_KEY_SIZE 40
+static inline void virtio_net_toeplitz_convert_key(u32 *input, size_t len) +{
while (len >= sizeof(*input)) {
*input = be32_to_cpu((__force __be32)*input);
input++;
len -= sizeof(*input);
}
+}
+static inline void virtio_net_toeplitz_calc(struct virtio_net_toeplitz_state *state,
const __be32 *input, size_t len)
+{
while (len >= sizeof(*input)) {
for (u32 map = be32_to_cpu(*input); map; map &= (map - 1)) {
u32 i = ffs(map);
state->hash ^= state->key[0] << (32 - i) |
(u32)((u64)state->key[1] >> i);
}
state->key++;
input++;
len -= sizeof(*input);
}
+}
+static inline u8 virtio_net_hash_key_length(u32 types) +{
size_t len = 0;
if (types & VIRTIO_NET_HASH_REPORT_IPv4)
len = max(len,
sizeof(struct flow_dissector_key_ipv4_addrs));
if (types &
(VIRTIO_NET_HASH_REPORT_TCPv4 | VIRTIO_NET_HASH_REPORT_UDPv4))
len = max(len,
sizeof(struct flow_dissector_key_ipv4_addrs) +
sizeof(struct flow_dissector_key_ports));
if (types & VIRTIO_NET_HASH_REPORT_IPv6)
len = max(len,
sizeof(struct flow_dissector_key_ipv6_addrs));
if (types &
(VIRTIO_NET_HASH_REPORT_TCPv6 | VIRTIO_NET_HASH_REPORT_UDPv6))
len = max(len,
sizeof(struct flow_dissector_key_ipv6_addrs) +
sizeof(struct flow_dissector_key_ports));
return len + sizeof(u32);
+}
+static inline u32 virtio_net_hash_report(u32 types,
const struct flow_keys_basic *keys)
+{
switch (keys->basic.n_proto) {
case cpu_to_be16(ETH_P_IP):
if (!(keys->control.flags & FLOW_DIS_IS_FRAGMENT)) {
if (keys->basic.ip_proto == IPPROTO_TCP &&
(types & VIRTIO_NET_RSS_HASH_TYPE_TCPv4))
return VIRTIO_NET_HASH_REPORT_TCPv4;
if (keys->basic.ip_proto == IPPROTO_UDP &&
(types & VIRTIO_NET_RSS_HASH_TYPE_UDPv4))
return VIRTIO_NET_HASH_REPORT_UDPv4;
}
if (types & VIRTIO_NET_RSS_HASH_TYPE_IPv4)
return VIRTIO_NET_HASH_REPORT_IPv4;
return VIRTIO_NET_HASH_REPORT_NONE;
case cpu_to_be16(ETH_P_IPV6):
if (!(keys->control.flags & FLOW_DIS_IS_FRAGMENT)) {
if (keys->basic.ip_proto == IPPROTO_TCP &&
(types & VIRTIO_NET_RSS_HASH_TYPE_TCPv6))
return VIRTIO_NET_HASH_REPORT_TCPv6;
if (keys->basic.ip_proto == IPPROTO_UDP &&
(types & VIRTIO_NET_RSS_HASH_TYPE_UDPv6))
return VIRTIO_NET_HASH_REPORT_UDPv6;
}
if (types & VIRTIO_NET_RSS_HASH_TYPE_IPv6)
return VIRTIO_NET_HASH_REPORT_IPv6;
return VIRTIO_NET_HASH_REPORT_NONE;
default:
return VIRTIO_NET_HASH_REPORT_NONE;
}
+}
+static inline void virtio_net_hash_rss(const struct sk_buff *skb,
u32 types, const u32 *key,
struct virtio_net_hash *hash)
+{
struct virtio_net_toeplitz_state toeplitz_state = { .key = key };
struct flow_keys flow;
struct flow_keys_basic flow_basic;
u16 report;
if (!skb_flow_dissect_flow_keys(skb, &flow, 0)) {
hash->report = VIRTIO_NET_HASH_REPORT_NONE;
return;
}
flow_basic = (struct flow_keys_basic) {
.control = flow.control,
.basic = flow.basic
};
report = virtio_net_hash_report(types, &flow_basic);
switch (report) {
case VIRTIO_NET_HASH_REPORT_IPv4:
virtio_net_toeplitz_calc(&toeplitz_state,
(__be32 *)&flow.addrs.v4addrs,
sizeof(flow.addrs.v4addrs));
break;
case VIRTIO_NET_HASH_REPORT_TCPv4:
virtio_net_toeplitz_calc(&toeplitz_state,
(__be32 *)&flow.addrs.v4addrs,
sizeof(flow.addrs.v4addrs));
virtio_net_toeplitz_calc(&toeplitz_state, &flow.ports.ports,
sizeof(flow.ports.ports));
break;
case VIRTIO_NET_HASH_REPORT_UDPv4:
virtio_net_toeplitz_calc(&toeplitz_state,
(__be32 *)&flow.addrs.v4addrs,
sizeof(flow.addrs.v4addrs));
virtio_net_toeplitz_calc(&toeplitz_state, &flow.ports.ports,
sizeof(flow.ports.ports));
break;
case VIRTIO_NET_HASH_REPORT_IPv6:
virtio_net_toeplitz_calc(&toeplitz_state,
(__be32 *)&flow.addrs.v6addrs,
sizeof(flow.addrs.v6addrs));
break;
case VIRTIO_NET_HASH_REPORT_TCPv6:
virtio_net_toeplitz_calc(&toeplitz_state,
(__be32 *)&flow.addrs.v6addrs,
sizeof(flow.addrs.v6addrs));
virtio_net_toeplitz_calc(&toeplitz_state, &flow.ports.ports,
sizeof(flow.ports.ports));
break;
case VIRTIO_NET_HASH_REPORT_UDPv6:
virtio_net_toeplitz_calc(&toeplitz_state,
(__be32 *)&flow.addrs.v6addrs,
sizeof(flow.addrs.v6addrs));
virtio_net_toeplitz_calc(&toeplitz_state, &flow.ports.ports,
sizeof(flow.ports.ports));
break;
default:
hash->report = VIRTIO_NET_HASH_REPORT_NONE;
return;
So I still think we need a comment here to explain why this is not an issue if the device can report HASH_XXX_EX. Or we need to add the support, since this is the code from the driver side, I don't think we need to worry about the device implementation issues.
This is on the device side, and don't report HASH_TYPE_XXX_EX.
For the issue of the number of options, does the spec forbid fallback to VIRTIO_NET_HASH_REPORT_NONE? If not, we can do that.
5.1.6.4.3.4 "IPv6 packets with extension header" says:
If VIRTIO_NET_HASH_TYPE_TCP_EX is set and the packet has a TCPv6 header, the hash is calculated over the following fields:
- Home address from the home address option in the IPv6 destination options header. If the extension header is not present, use the Source IPv6 address.
- IPv6 address that is contained in the Routing-Header-Type-2 from the associated extension header. If the extension header is not present, use the Destination IPv6 address.
- Source TCP port
- Destination TCP port
Therefore, if VIRTIO_NET_HASH_TYPE_TCP_EX is set, the packet has a TCPv6 and an home address option in the IPv6 destination options header is present, the hash is calculated over the home address. If the hash is not calculated over the home address in such a case, the device is contradicting with this section and violating the spec. The same goes for the other HASH_TYPE_XXX_EX types and Routing-Header-Type-2.
Just to make sure we are one the same page. I meant:
1) If the hash is not calculated over the home address (in the case of IPv6 destination destination), it can still report VIRTIO_NET_RSS_HASH_TYPE_IPv6. This is what you implemented in your series. So the device can simply fallback to e.g TCPv6 if it can't understand all or part of the IPv6 options. 2) the VIRTIO_NET_SUPPORTED_HASH_TYPES is not checked against the tun_vnet_ioctl_sethash(), so userspace may set VIRTIO_NET_HASH_TYPE_TCP_EX regardless of what has been returned by tun_vnet_ioctl_gethashtypes(). In this case they won't get VIRTIO_NET_HASH_TYPE_TCP_EX. 3) implementing part of the hash types might complicate the migration or at least we need to describe the expectations of libvirt or other management in this case. For example, do we plan to have a dedicated Qemu command line like:
-device virtio-net-pci,hash_report=on,supported_hash_types=X,Y,Z?
Thanks
Regards, Akihiko Odaki
On 2025/06/04 10:18, Jason Wang wrote:
On Tue, Jun 3, 2025 at 1:31 PM Akihiko Odaki akihiko.odaki@daynix.com wrote:
On 2025/06/03 12:19, Jason Wang wrote:
On Fri, May 30, 2025 at 12:50 PM Akihiko Odaki akihiko.odaki@daynix.com wrote:
They are useful to implement VIRTIO_NET_F_RSS and VIRTIO_NET_F_HASH_REPORT.
Signed-off-by: Akihiko Odaki akihiko.odaki@daynix.com Tested-by: Lei Yang leiyang@redhat.com
include/linux/virtio_net.h | 188 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 188 insertions(+)
diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h index 02a9f4dc594d..426f33b4b824 100644 --- a/include/linux/virtio_net.h +++ b/include/linux/virtio_net.h @@ -9,6 +9,194 @@ #include <uapi/linux/tcp.h> #include <uapi/linux/virtio_net.h>
+struct virtio_net_hash {
u32 value;
u16 report;
+};
+struct virtio_net_toeplitz_state {
u32 hash;
const u32 *key;
+};
+#define VIRTIO_NET_SUPPORTED_HASH_TYPES (VIRTIO_NET_RSS_HASH_TYPE_IPv4 | \
VIRTIO_NET_RSS_HASH_TYPE_TCPv4 | \
VIRTIO_NET_RSS_HASH_TYPE_UDPv4 | \
VIRTIO_NET_RSS_HASH_TYPE_IPv6 | \
VIRTIO_NET_RSS_HASH_TYPE_TCPv6 | \
VIRTIO_NET_RSS_HASH_TYPE_UDPv6)
+#define VIRTIO_NET_RSS_MAX_KEY_SIZE 40
+static inline void virtio_net_toeplitz_convert_key(u32 *input, size_t len) +{
while (len >= sizeof(*input)) {
*input = be32_to_cpu((__force __be32)*input);
input++;
len -= sizeof(*input);
}
+}
+static inline void virtio_net_toeplitz_calc(struct virtio_net_toeplitz_state *state,
const __be32 *input, size_t len)
+{
while (len >= sizeof(*input)) {
for (u32 map = be32_to_cpu(*input); map; map &= (map - 1)) {
u32 i = ffs(map);
state->hash ^= state->key[0] << (32 - i) |
(u32)((u64)state->key[1] >> i);
}
state->key++;
input++;
len -= sizeof(*input);
}
+}
+static inline u8 virtio_net_hash_key_length(u32 types) +{
size_t len = 0;
if (types & VIRTIO_NET_HASH_REPORT_IPv4)
len = max(len,
sizeof(struct flow_dissector_key_ipv4_addrs));
if (types &
(VIRTIO_NET_HASH_REPORT_TCPv4 | VIRTIO_NET_HASH_REPORT_UDPv4))
len = max(len,
sizeof(struct flow_dissector_key_ipv4_addrs) +
sizeof(struct flow_dissector_key_ports));
if (types & VIRTIO_NET_HASH_REPORT_IPv6)
len = max(len,
sizeof(struct flow_dissector_key_ipv6_addrs));
if (types &
(VIRTIO_NET_HASH_REPORT_TCPv6 | VIRTIO_NET_HASH_REPORT_UDPv6))
len = max(len,
sizeof(struct flow_dissector_key_ipv6_addrs) +
sizeof(struct flow_dissector_key_ports));
return len + sizeof(u32);
+}
+static inline u32 virtio_net_hash_report(u32 types,
const struct flow_keys_basic *keys)
+{
switch (keys->basic.n_proto) {
case cpu_to_be16(ETH_P_IP):
if (!(keys->control.flags & FLOW_DIS_IS_FRAGMENT)) {
if (keys->basic.ip_proto == IPPROTO_TCP &&
(types & VIRTIO_NET_RSS_HASH_TYPE_TCPv4))
return VIRTIO_NET_HASH_REPORT_TCPv4;
if (keys->basic.ip_proto == IPPROTO_UDP &&
(types & VIRTIO_NET_RSS_HASH_TYPE_UDPv4))
return VIRTIO_NET_HASH_REPORT_UDPv4;
}
if (types & VIRTIO_NET_RSS_HASH_TYPE_IPv4)
return VIRTIO_NET_HASH_REPORT_IPv4;
return VIRTIO_NET_HASH_REPORT_NONE;
case cpu_to_be16(ETH_P_IPV6):
if (!(keys->control.flags & FLOW_DIS_IS_FRAGMENT)) {
if (keys->basic.ip_proto == IPPROTO_TCP &&
(types & VIRTIO_NET_RSS_HASH_TYPE_TCPv6))
return VIRTIO_NET_HASH_REPORT_TCPv6;
if (keys->basic.ip_proto == IPPROTO_UDP &&
(types & VIRTIO_NET_RSS_HASH_TYPE_UDPv6))
return VIRTIO_NET_HASH_REPORT_UDPv6;
}
if (types & VIRTIO_NET_RSS_HASH_TYPE_IPv6)
return VIRTIO_NET_HASH_REPORT_IPv6;
return VIRTIO_NET_HASH_REPORT_NONE;
default:
return VIRTIO_NET_HASH_REPORT_NONE;
}
+}
+static inline void virtio_net_hash_rss(const struct sk_buff *skb,
u32 types, const u32 *key,
struct virtio_net_hash *hash)
+{
struct virtio_net_toeplitz_state toeplitz_state = { .key = key };
struct flow_keys flow;
struct flow_keys_basic flow_basic;
u16 report;
if (!skb_flow_dissect_flow_keys(skb, &flow, 0)) {
hash->report = VIRTIO_NET_HASH_REPORT_NONE;
return;
}
flow_basic = (struct flow_keys_basic) {
.control = flow.control,
.basic = flow.basic
};
report = virtio_net_hash_report(types, &flow_basic);
switch (report) {
case VIRTIO_NET_HASH_REPORT_IPv4:
virtio_net_toeplitz_calc(&toeplitz_state,
(__be32 *)&flow.addrs.v4addrs,
sizeof(flow.addrs.v4addrs));
break;
case VIRTIO_NET_HASH_REPORT_TCPv4:
virtio_net_toeplitz_calc(&toeplitz_state,
(__be32 *)&flow.addrs.v4addrs,
sizeof(flow.addrs.v4addrs));
virtio_net_toeplitz_calc(&toeplitz_state, &flow.ports.ports,
sizeof(flow.ports.ports));
break;
case VIRTIO_NET_HASH_REPORT_UDPv4:
virtio_net_toeplitz_calc(&toeplitz_state,
(__be32 *)&flow.addrs.v4addrs,
sizeof(flow.addrs.v4addrs));
virtio_net_toeplitz_calc(&toeplitz_state, &flow.ports.ports,
sizeof(flow.ports.ports));
break;
case VIRTIO_NET_HASH_REPORT_IPv6:
virtio_net_toeplitz_calc(&toeplitz_state,
(__be32 *)&flow.addrs.v6addrs,
sizeof(flow.addrs.v6addrs));
break;
case VIRTIO_NET_HASH_REPORT_TCPv6:
virtio_net_toeplitz_calc(&toeplitz_state,
(__be32 *)&flow.addrs.v6addrs,
sizeof(flow.addrs.v6addrs));
virtio_net_toeplitz_calc(&toeplitz_state, &flow.ports.ports,
sizeof(flow.ports.ports));
break;
case VIRTIO_NET_HASH_REPORT_UDPv6:
virtio_net_toeplitz_calc(&toeplitz_state,
(__be32 *)&flow.addrs.v6addrs,
sizeof(flow.addrs.v6addrs));
virtio_net_toeplitz_calc(&toeplitz_state, &flow.ports.ports,
sizeof(flow.ports.ports));
break;
default:
hash->report = VIRTIO_NET_HASH_REPORT_NONE;
return;
So I still think we need a comment here to explain why this is not an issue if the device can report HASH_XXX_EX. Or we need to add the support, since this is the code from the driver side, I don't think we need to worry about the device implementation issues.
This is on the device side, and don't report HASH_TYPE_XXX_EX.
For the issue of the number of options, does the spec forbid fallback to VIRTIO_NET_HASH_REPORT_NONE? If not, we can do that.
5.1.6.4.3.4 "IPv6 packets with extension header" says:
If VIRTIO_NET_HASH_TYPE_TCP_EX is set and the packet has a TCPv6 header, the hash is calculated over the following fields:
- Home address from the home address option in the IPv6 destination options header. If the extension header is not present, use the Source IPv6 address.
- IPv6 address that is contained in the Routing-Header-Type-2 from the associated extension header. If the extension header is not present, use the Destination IPv6 address.
- Source TCP port
- Destination TCP port
Therefore, if VIRTIO_NET_HASH_TYPE_TCP_EX is set, the packet has a TCPv6 and an home address option in the IPv6 destination options header is present, the hash is calculated over the home address. If the hash is not calculated over the home address in such a case, the device is contradicting with this section and violating the spec. The same goes for the other HASH_TYPE_XXX_EX types and Routing-Header-Type-2.
Just to make sure we are one the same page. I meant:
- If the hash is not calculated over the home address (in the case of
IPv6 destination destination), it can still report VIRTIO_NET_RSS_HASH_TYPE_IPv6. This is what you implemented in your series. So the device can simply fallback to e.g TCPv6 if it can't understand all or part of the IPv6 options.
The spec says it can fallback if "the extension header is not present", not if the device can't understand the extension header.
- the VIRTIO_NET_SUPPORTED_HASH_TYPES is not checked against the
tun_vnet_ioctl_sethash(), so userspace may set VIRTIO_NET_HASH_TYPE_TCP_EX regardless of what has been returned by tun_vnet_ioctl_gethashtypes(). In this case they won't get VIRTIO_NET_HASH_TYPE_TCP_EX.
That's right. It's the responsibility of the userspace to set only the supported hash types.
- implementing part of the hash types might complicate the migration
or at least we need to describe the expectations of libvirt or other management in this case. For example, do we plan to have a dedicated Qemu command line like:
-device virtio-net-pci,hash_report=on,supported_hash_types=X,Y,Z?
I posted a patch series to implement such a command line for vDPA[1]. The patch series that wires this tuntap feature up[2] reuses the infrastructure so it doesn't bring additional complexity.
[1] https://lore.kernel.org/qemu-devel/20250530-vdpa-v1-0-5af4109b1c19@daynix.co... [2] https://lore.kernel.org/qemu-devel/20250530-hash-v5-0-343d7d7a8200@daynix.co...
Regards, Akihiko Odaki
On Wed, Jun 4, 2025 at 3:20 PM Akihiko Odaki akihiko.odaki@daynix.com wrote:
On 2025/06/04 10:18, Jason Wang wrote:
On Tue, Jun 3, 2025 at 1:31 PM Akihiko Odaki akihiko.odaki@daynix.com wrote:
On 2025/06/03 12:19, Jason Wang wrote:
On Fri, May 30, 2025 at 12:50 PM Akihiko Odaki akihiko.odaki@daynix.com wrote:
They are useful to implement VIRTIO_NET_F_RSS and VIRTIO_NET_F_HASH_REPORT.
Signed-off-by: Akihiko Odaki akihiko.odaki@daynix.com Tested-by: Lei Yang leiyang@redhat.com
include/linux/virtio_net.h | 188 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 188 insertions(+)
diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h index 02a9f4dc594d..426f33b4b824 100644 --- a/include/linux/virtio_net.h +++ b/include/linux/virtio_net.h @@ -9,6 +9,194 @@ #include <uapi/linux/tcp.h> #include <uapi/linux/virtio_net.h>
+struct virtio_net_hash {
u32 value;
u16 report;
+};
+struct virtio_net_toeplitz_state {
u32 hash;
const u32 *key;
+};
+#define VIRTIO_NET_SUPPORTED_HASH_TYPES (VIRTIO_NET_RSS_HASH_TYPE_IPv4 | \
VIRTIO_NET_RSS_HASH_TYPE_TCPv4 | \
VIRTIO_NET_RSS_HASH_TYPE_UDPv4 | \
VIRTIO_NET_RSS_HASH_TYPE_IPv6 | \
VIRTIO_NET_RSS_HASH_TYPE_TCPv6 | \
VIRTIO_NET_RSS_HASH_TYPE_UDPv6)
+#define VIRTIO_NET_RSS_MAX_KEY_SIZE 40
+static inline void virtio_net_toeplitz_convert_key(u32 *input, size_t len) +{
while (len >= sizeof(*input)) {
*input = be32_to_cpu((__force __be32)*input);
input++;
len -= sizeof(*input);
}
+}
+static inline void virtio_net_toeplitz_calc(struct virtio_net_toeplitz_state *state,
const __be32 *input, size_t len)
+{
while (len >= sizeof(*input)) {
for (u32 map = be32_to_cpu(*input); map; map &= (map - 1)) {
u32 i = ffs(map);
state->hash ^= state->key[0] << (32 - i) |
(u32)((u64)state->key[1] >> i);
}
state->key++;
input++;
len -= sizeof(*input);
}
+}
+static inline u8 virtio_net_hash_key_length(u32 types) +{
size_t len = 0;
if (types & VIRTIO_NET_HASH_REPORT_IPv4)
len = max(len,
sizeof(struct flow_dissector_key_ipv4_addrs));
if (types &
(VIRTIO_NET_HASH_REPORT_TCPv4 | VIRTIO_NET_HASH_REPORT_UDPv4))
len = max(len,
sizeof(struct flow_dissector_key_ipv4_addrs) +
sizeof(struct flow_dissector_key_ports));
if (types & VIRTIO_NET_HASH_REPORT_IPv6)
len = max(len,
sizeof(struct flow_dissector_key_ipv6_addrs));
if (types &
(VIRTIO_NET_HASH_REPORT_TCPv6 | VIRTIO_NET_HASH_REPORT_UDPv6))
len = max(len,
sizeof(struct flow_dissector_key_ipv6_addrs) +
sizeof(struct flow_dissector_key_ports));
return len + sizeof(u32);
+}
+static inline u32 virtio_net_hash_report(u32 types,
const struct flow_keys_basic *keys)
+{
switch (keys->basic.n_proto) {
case cpu_to_be16(ETH_P_IP):
if (!(keys->control.flags & FLOW_DIS_IS_FRAGMENT)) {
if (keys->basic.ip_proto == IPPROTO_TCP &&
(types & VIRTIO_NET_RSS_HASH_TYPE_TCPv4))
return VIRTIO_NET_HASH_REPORT_TCPv4;
if (keys->basic.ip_proto == IPPROTO_UDP &&
(types & VIRTIO_NET_RSS_HASH_TYPE_UDPv4))
return VIRTIO_NET_HASH_REPORT_UDPv4;
}
if (types & VIRTIO_NET_RSS_HASH_TYPE_IPv4)
return VIRTIO_NET_HASH_REPORT_IPv4;
return VIRTIO_NET_HASH_REPORT_NONE;
case cpu_to_be16(ETH_P_IPV6):
if (!(keys->control.flags & FLOW_DIS_IS_FRAGMENT)) {
if (keys->basic.ip_proto == IPPROTO_TCP &&
(types & VIRTIO_NET_RSS_HASH_TYPE_TCPv6))
return VIRTIO_NET_HASH_REPORT_TCPv6;
if (keys->basic.ip_proto == IPPROTO_UDP &&
(types & VIRTIO_NET_RSS_HASH_TYPE_UDPv6))
return VIRTIO_NET_HASH_REPORT_UDPv6;
}
if (types & VIRTIO_NET_RSS_HASH_TYPE_IPv6)
return VIRTIO_NET_HASH_REPORT_IPv6;
return VIRTIO_NET_HASH_REPORT_NONE;
default:
return VIRTIO_NET_HASH_REPORT_NONE;
}
+}
+static inline void virtio_net_hash_rss(const struct sk_buff *skb,
u32 types, const u32 *key,
struct virtio_net_hash *hash)
+{
struct virtio_net_toeplitz_state toeplitz_state = { .key = key };
struct flow_keys flow;
struct flow_keys_basic flow_basic;
u16 report;
if (!skb_flow_dissect_flow_keys(skb, &flow, 0)) {
hash->report = VIRTIO_NET_HASH_REPORT_NONE;
return;
}
flow_basic = (struct flow_keys_basic) {
.control = flow.control,
.basic = flow.basic
};
report = virtio_net_hash_report(types, &flow_basic);
switch (report) {
case VIRTIO_NET_HASH_REPORT_IPv4:
virtio_net_toeplitz_calc(&toeplitz_state,
(__be32 *)&flow.addrs.v4addrs,
sizeof(flow.addrs.v4addrs));
break;
case VIRTIO_NET_HASH_REPORT_TCPv4:
virtio_net_toeplitz_calc(&toeplitz_state,
(__be32 *)&flow.addrs.v4addrs,
sizeof(flow.addrs.v4addrs));
virtio_net_toeplitz_calc(&toeplitz_state, &flow.ports.ports,
sizeof(flow.ports.ports));
break;
case VIRTIO_NET_HASH_REPORT_UDPv4:
virtio_net_toeplitz_calc(&toeplitz_state,
(__be32 *)&flow.addrs.v4addrs,
sizeof(flow.addrs.v4addrs));
virtio_net_toeplitz_calc(&toeplitz_state, &flow.ports.ports,
sizeof(flow.ports.ports));
break;
case VIRTIO_NET_HASH_REPORT_IPv6:
virtio_net_toeplitz_calc(&toeplitz_state,
(__be32 *)&flow.addrs.v6addrs,
sizeof(flow.addrs.v6addrs));
break;
case VIRTIO_NET_HASH_REPORT_TCPv6:
virtio_net_toeplitz_calc(&toeplitz_state,
(__be32 *)&flow.addrs.v6addrs,
sizeof(flow.addrs.v6addrs));
virtio_net_toeplitz_calc(&toeplitz_state, &flow.ports.ports,
sizeof(flow.ports.ports));
break;
case VIRTIO_NET_HASH_REPORT_UDPv6:
virtio_net_toeplitz_calc(&toeplitz_state,
(__be32 *)&flow.addrs.v6addrs,
sizeof(flow.addrs.v6addrs));
virtio_net_toeplitz_calc(&toeplitz_state, &flow.ports.ports,
sizeof(flow.ports.ports));
break;
default:
hash->report = VIRTIO_NET_HASH_REPORT_NONE;
return;
So I still think we need a comment here to explain why this is not an issue if the device can report HASH_XXX_EX. Or we need to add the support, since this is the code from the driver side, I don't think we need to worry about the device implementation issues.
This is on the device side, and don't report HASH_TYPE_XXX_EX.
For the issue of the number of options, does the spec forbid fallback to VIRTIO_NET_HASH_REPORT_NONE? If not, we can do that.
5.1.6.4.3.4 "IPv6 packets with extension header" says:
If VIRTIO_NET_HASH_TYPE_TCP_EX is set and the packet has a TCPv6 header, the hash is calculated over the following fields:
- Home address from the home address option in the IPv6 destination options header. If the extension header is not present, use the Source IPv6 address.
- IPv6 address that is contained in the Routing-Header-Type-2 from the associated extension header. If the extension header is not present, use the Destination IPv6 address.
- Source TCP port
- Destination TCP port
Therefore, if VIRTIO_NET_HASH_TYPE_TCP_EX is set, the packet has a TCPv6 and an home address option in the IPv6 destination options header is present, the hash is calculated over the home address. If the hash is not calculated over the home address in such a case, the device is contradicting with this section and violating the spec. The same goes for the other HASH_TYPE_XXX_EX types and Routing-Header-Type-2.
Just to make sure we are one the same page. I meant:
- If the hash is not calculated over the home address (in the case of
IPv6 destination destination), it can still report VIRTIO_NET_RSS_HASH_TYPE_IPv6. This is what you implemented in your series. So the device can simply fallback to e.g TCPv6 if it can't understand all or part of the IPv6 options.
The spec says it can fallback if "the extension header is not present", not if the device can't understand the extension header.
I don't think so,
1) spec had a condition beforehand:
""" If VIRTIO_NET_HASH_TYPE_TCP_EX is set and the packet has a TCPv6 header, the hash is calculated over the following fields: ... If the extension header is not present ... """
So the device can choose not to set VIRTIO_NET_HASH_TYPE_TCP_EX as spec doesn't say device MUST set VIRTIO_NET_HASH_TYPE_TCP_EX if ...
2) implementation wise, since device has limited resources, we can't expect the device can parse arbitrary number of ipv6 options
3) if 1) and 2) not the case, we need fix the spec otherwise implement a spec compliant device is impractical
- the VIRTIO_NET_SUPPORTED_HASH_TYPES is not checked against the
tun_vnet_ioctl_sethash(), so userspace may set VIRTIO_NET_HASH_TYPE_TCP_EX regardless of what has been returned by tun_vnet_ioctl_gethashtypes(). In this case they won't get VIRTIO_NET_HASH_TYPE_TCP_EX.
That's right. It's the responsibility of the userspace to set only the supported hash types.
Well, the kernel should filter out the unsupported one to have a robust uAPI. Otherwise, we give green light to the buggy userspace which will have unexpected results.
- implementing part of the hash types might complicate the migration
or at least we need to describe the expectations of libvirt or other management in this case. For example, do we plan to have a dedicated Qemu command line like:
-device virtio-net-pci,hash_report=on,supported_hash_types=X,Y,Z?
I posted a patch series to implement such a command line for vDPA[1]. The patch series that wires this tuntap feature up[2] reuses the infrastructure so it doesn't bring additional complexity.
[1] https://lore.kernel.org/qemu-devel/20250530-vdpa-v1-0-5af4109b1c19@daynix.co... [2] https://lore.kernel.org/qemu-devel/20250530-hash-v5-0-343d7d7a8200@daynix.co...
I meant, if we implement a full hash report feature, it means a single hash cmdline option is more than sufficient and so compatibility code can just turn it off when dealing with machine types. This is much more simpler than
1) having both hash as well as supported_hash_features 2) dealing both hash as well as supported_hash_features in compatibility codes 3) libvirt will be happy
For [1], it seems it introduces a per has type option, this seems to be a burden to the management layer as it need to learn new option everytime a new hash type is supported
Thanks
Regards, Akihiko Odaki
On 2025/06/05 10:53, Jason Wang wrote:
On Wed, Jun 4, 2025 at 3:20 PM Akihiko Odaki akihiko.odaki@daynix.com wrote:
On 2025/06/04 10:18, Jason Wang wrote:
On Tue, Jun 3, 2025 at 1:31 PM Akihiko Odaki akihiko.odaki@daynix.com wrote:
On 2025/06/03 12:19, Jason Wang wrote:
On Fri, May 30, 2025 at 12:50 PM Akihiko Odaki akihiko.odaki@daynix.com wrote:
They are useful to implement VIRTIO_NET_F_RSS and VIRTIO_NET_F_HASH_REPORT.
Signed-off-by: Akihiko Odaki akihiko.odaki@daynix.com Tested-by: Lei Yang leiyang@redhat.com
include/linux/virtio_net.h | 188 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 188 insertions(+)
diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h index 02a9f4dc594d..426f33b4b824 100644 --- a/include/linux/virtio_net.h +++ b/include/linux/virtio_net.h @@ -9,6 +9,194 @@ #include <uapi/linux/tcp.h> #include <uapi/linux/virtio_net.h>
+struct virtio_net_hash {
u32 value;
u16 report;
+};
+struct virtio_net_toeplitz_state {
u32 hash;
const u32 *key;
+};
+#define VIRTIO_NET_SUPPORTED_HASH_TYPES (VIRTIO_NET_RSS_HASH_TYPE_IPv4 | \
VIRTIO_NET_RSS_HASH_TYPE_TCPv4 | \
VIRTIO_NET_RSS_HASH_TYPE_UDPv4 | \
VIRTIO_NET_RSS_HASH_TYPE_IPv6 | \
VIRTIO_NET_RSS_HASH_TYPE_TCPv6 | \
VIRTIO_NET_RSS_HASH_TYPE_UDPv6)
+#define VIRTIO_NET_RSS_MAX_KEY_SIZE 40
+static inline void virtio_net_toeplitz_convert_key(u32 *input, size_t len) +{
while (len >= sizeof(*input)) {
*input = be32_to_cpu((__force __be32)*input);
input++;
len -= sizeof(*input);
}
+}
+static inline void virtio_net_toeplitz_calc(struct virtio_net_toeplitz_state *state,
const __be32 *input, size_t len)
+{
while (len >= sizeof(*input)) {
for (u32 map = be32_to_cpu(*input); map; map &= (map - 1)) {
u32 i = ffs(map);
state->hash ^= state->key[0] << (32 - i) |
(u32)((u64)state->key[1] >> i);
}
state->key++;
input++;
len -= sizeof(*input);
}
+}
+static inline u8 virtio_net_hash_key_length(u32 types) +{
size_t len = 0;
if (types & VIRTIO_NET_HASH_REPORT_IPv4)
len = max(len,
sizeof(struct flow_dissector_key_ipv4_addrs));
if (types &
(VIRTIO_NET_HASH_REPORT_TCPv4 | VIRTIO_NET_HASH_REPORT_UDPv4))
len = max(len,
sizeof(struct flow_dissector_key_ipv4_addrs) +
sizeof(struct flow_dissector_key_ports));
if (types & VIRTIO_NET_HASH_REPORT_IPv6)
len = max(len,
sizeof(struct flow_dissector_key_ipv6_addrs));
if (types &
(VIRTIO_NET_HASH_REPORT_TCPv6 | VIRTIO_NET_HASH_REPORT_UDPv6))
len = max(len,
sizeof(struct flow_dissector_key_ipv6_addrs) +
sizeof(struct flow_dissector_key_ports));
return len + sizeof(u32);
+}
+static inline u32 virtio_net_hash_report(u32 types,
const struct flow_keys_basic *keys)
+{
switch (keys->basic.n_proto) {
case cpu_to_be16(ETH_P_IP):
if (!(keys->control.flags & FLOW_DIS_IS_FRAGMENT)) {
if (keys->basic.ip_proto == IPPROTO_TCP &&
(types & VIRTIO_NET_RSS_HASH_TYPE_TCPv4))
return VIRTIO_NET_HASH_REPORT_TCPv4;
if (keys->basic.ip_proto == IPPROTO_UDP &&
(types & VIRTIO_NET_RSS_HASH_TYPE_UDPv4))
return VIRTIO_NET_HASH_REPORT_UDPv4;
}
if (types & VIRTIO_NET_RSS_HASH_TYPE_IPv4)
return VIRTIO_NET_HASH_REPORT_IPv4;
return VIRTIO_NET_HASH_REPORT_NONE;
case cpu_to_be16(ETH_P_IPV6):
if (!(keys->control.flags & FLOW_DIS_IS_FRAGMENT)) {
if (keys->basic.ip_proto == IPPROTO_TCP &&
(types & VIRTIO_NET_RSS_HASH_TYPE_TCPv6))
return VIRTIO_NET_HASH_REPORT_TCPv6;
if (keys->basic.ip_proto == IPPROTO_UDP &&
(types & VIRTIO_NET_RSS_HASH_TYPE_UDPv6))
return VIRTIO_NET_HASH_REPORT_UDPv6;
}
if (types & VIRTIO_NET_RSS_HASH_TYPE_IPv6)
return VIRTIO_NET_HASH_REPORT_IPv6;
return VIRTIO_NET_HASH_REPORT_NONE;
default:
return VIRTIO_NET_HASH_REPORT_NONE;
}
+}
+static inline void virtio_net_hash_rss(const struct sk_buff *skb,
u32 types, const u32 *key,
struct virtio_net_hash *hash)
+{
struct virtio_net_toeplitz_state toeplitz_state = { .key = key };
struct flow_keys flow;
struct flow_keys_basic flow_basic;
u16 report;
if (!skb_flow_dissect_flow_keys(skb, &flow, 0)) {
hash->report = VIRTIO_NET_HASH_REPORT_NONE;
return;
}
flow_basic = (struct flow_keys_basic) {
.control = flow.control,
.basic = flow.basic
};
report = virtio_net_hash_report(types, &flow_basic);
switch (report) {
case VIRTIO_NET_HASH_REPORT_IPv4:
virtio_net_toeplitz_calc(&toeplitz_state,
(__be32 *)&flow.addrs.v4addrs,
sizeof(flow.addrs.v4addrs));
break;
case VIRTIO_NET_HASH_REPORT_TCPv4:
virtio_net_toeplitz_calc(&toeplitz_state,
(__be32 *)&flow.addrs.v4addrs,
sizeof(flow.addrs.v4addrs));
virtio_net_toeplitz_calc(&toeplitz_state, &flow.ports.ports,
sizeof(flow.ports.ports));
break;
case VIRTIO_NET_HASH_REPORT_UDPv4:
virtio_net_toeplitz_calc(&toeplitz_state,
(__be32 *)&flow.addrs.v4addrs,
sizeof(flow.addrs.v4addrs));
virtio_net_toeplitz_calc(&toeplitz_state, &flow.ports.ports,
sizeof(flow.ports.ports));
break;
case VIRTIO_NET_HASH_REPORT_IPv6:
virtio_net_toeplitz_calc(&toeplitz_state,
(__be32 *)&flow.addrs.v6addrs,
sizeof(flow.addrs.v6addrs));
break;
case VIRTIO_NET_HASH_REPORT_TCPv6:
virtio_net_toeplitz_calc(&toeplitz_state,
(__be32 *)&flow.addrs.v6addrs,
sizeof(flow.addrs.v6addrs));
virtio_net_toeplitz_calc(&toeplitz_state, &flow.ports.ports,
sizeof(flow.ports.ports));
break;
case VIRTIO_NET_HASH_REPORT_UDPv6:
virtio_net_toeplitz_calc(&toeplitz_state,
(__be32 *)&flow.addrs.v6addrs,
sizeof(flow.addrs.v6addrs));
virtio_net_toeplitz_calc(&toeplitz_state, &flow.ports.ports,
sizeof(flow.ports.ports));
break;
default:
hash->report = VIRTIO_NET_HASH_REPORT_NONE;
return;
So I still think we need a comment here to explain why this is not an issue if the device can report HASH_XXX_EX. Or we need to add the support, since this is the code from the driver side, I don't think we need to worry about the device implementation issues.
This is on the device side, and don't report HASH_TYPE_XXX_EX.
For the issue of the number of options, does the spec forbid fallback to VIRTIO_NET_HASH_REPORT_NONE? If not, we can do that.
5.1.6.4.3.4 "IPv6 packets with extension header" says:
If VIRTIO_NET_HASH_TYPE_TCP_EX is set and the packet has a TCPv6 header, the hash is calculated over the following fields:
- Home address from the home address option in the IPv6 destination options header. If the extension header is not present, use the Source IPv6 address.
- IPv6 address that is contained in the Routing-Header-Type-2 from the associated extension header. If the extension header is not present, use the Destination IPv6 address.
- Source TCP port
- Destination TCP port
Therefore, if VIRTIO_NET_HASH_TYPE_TCP_EX is set, the packet has a TCPv6 and an home address option in the IPv6 destination options header is present, the hash is calculated over the home address. If the hash is not calculated over the home address in such a case, the device is contradicting with this section and violating the spec. The same goes for the other HASH_TYPE_XXX_EX types and Routing-Header-Type-2.
Just to make sure we are one the same page. I meant:
- If the hash is not calculated over the home address (in the case of
IPv6 destination destination), it can still report VIRTIO_NET_RSS_HASH_TYPE_IPv6. This is what you implemented in your series. So the device can simply fallback to e.g TCPv6 if it can't understand all or part of the IPv6 options.
The spec says it can fallback if "the extension header is not present", not if the device can't understand the extension header.
I don't think so,
- spec had a condition beforehand:
""" If VIRTIO_NET_HASH_TYPE_TCP_EX is set and the packet has a TCPv6 header, the hash is calculated over the following fields: ... If the extension header is not present ... """
So the device can choose not to set VIRTIO_NET_HASH_TYPE_TCP_EX as spec doesn't say device MUST set VIRTIO_NET_HASH_TYPE_TCP_EX if ...
- implementation wise, since device has limited resources, we can't
expect the device can parse arbitrary number of ipv6 options
- if 1) and 2) not the case, we need fix the spec otherwise implement
a spec compliant device is impractical
The statement is preceded by the following:
The device calculates the hash on IPv4 packets according to ’Enabled hash types’ bitmask as follows:
The 'Enabled hash types' bitmask is specified by the device.
I think the spec needs amendment.
I wonder if there are any people interested in the feature though. Looking at virtnet_set_hashflow() in drivers/net/virtio_net.c, the driver of Linux does not let users configure HASH_TYPE_XXX_EX. I suppose Windows supports HASH_TYPE_XXX_EX, but those who care network performance most would use Linux so HASH_TYPE_XXX_EX support without Linux driver's support may not be useful.
- the VIRTIO_NET_SUPPORTED_HASH_TYPES is not checked against the
tun_vnet_ioctl_sethash(), so userspace may set VIRTIO_NET_HASH_TYPE_TCP_EX regardless of what has been returned by tun_vnet_ioctl_gethashtypes(). In this case they won't get VIRTIO_NET_HASH_TYPE_TCP_EX.
That's right. It's the responsibility of the userspace to set only the supported hash types.
Well, the kernel should filter out the unsupported one to have a robust uAPI. Otherwise, we give green light to the buggy userspace which will have unexpected results.
My reasoning was that it may be fine for some use cases other than VM (e.g., DPDK); in such a use case, it is fine as long as the UAPI works in the best-effort basis.
For example, suppose a userspace program that processes TCP packets; the program can enable: HASH_TYPE_IPv4, HASH_TYPE_TCPv4, HASH_TYPE_IPv6, and HASH_TYPE_TCPv6. Ideally, the kernel should support all the hash types, but, even if e.g., HASH_TYPE_TCPv6 is not available, it will fall back to HASH_TYPE_IPv6, which still does something good and may be acceptable.
That said, for a use case that involves VM and implements virtio-net (e.g., QEMU), setting an unsupported hash type here is definitely a bug. Catching the bug may outweigh the extra trouble for other use cases.
- implementing part of the hash types might complicate the migration
or at least we need to describe the expectations of libvirt or other management in this case. For example, do we plan to have a dedicated Qemu command line like:
-device virtio-net-pci,hash_report=on,supported_hash_types=X,Y,Z?
I posted a patch series to implement such a command line for vDPA[1]. The patch series that wires this tuntap feature up[2] reuses the infrastructure so it doesn't bring additional complexity.
[1] https://lore.kernel.org/qemu-devel/20250530-vdpa-v1-0-5af4109b1c19@daynix.co... [2] https://lore.kernel.org/qemu-devel/20250530-hash-v5-0-343d7d7a8200@daynix.co...
I meant, if we implement a full hash report feature, it means a single hash cmdline option is more than sufficient and so compatibility code can just turn it off when dealing with machine types. This is much more simpler than
- having both hash as well as supported_hash_features
- dealing both hash as well as supported_hash_features in compatibility codes
- libvirt will be happy
For [1], it seems it introduces a per has type option, this seems to be a burden to the management layer as it need to learn new option everytime a new hash type is supported
Even with the command line you proposed (supported_hash_types=X,Y,Z), it is still necessary to know the values the supported_hash_types property accepts (X.Y,Z), so I don't think it makes difference.
The burden to the management layer is already present for features, so it is an existing problem (or its mere extension).
This problem was discussed in the following thread in the past, but no solution is implemented yet, and probably solving it will be difficult. https://lore.kernel.org/qemu-devel/20230731223148.1002258-5-yuri.benditovich...
Regards, Akihiko Odaki
On Thu, Jun 5, 2025 at 3:58 PM Akihiko Odaki akihiko.odaki@daynix.com wrote:
On 2025/06/05 10:53, Jason Wang wrote:
On Wed, Jun 4, 2025 at 3:20 PM Akihiko Odaki akihiko.odaki@daynix.com wrote:
On 2025/06/04 10:18, Jason Wang wrote:
On Tue, Jun 3, 2025 at 1:31 PM Akihiko Odaki akihiko.odaki@daynix.com wrote:
On 2025/06/03 12:19, Jason Wang wrote:
On Fri, May 30, 2025 at 12:50 PM Akihiko Odaki akihiko.odaki@daynix.com wrote: > > They are useful to implement VIRTIO_NET_F_RSS and > VIRTIO_NET_F_HASH_REPORT. > > Signed-off-by: Akihiko Odaki akihiko.odaki@daynix.com > Tested-by: Lei Yang leiyang@redhat.com > --- > include/linux/virtio_net.h | 188 +++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 188 insertions(+) > > diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h > index 02a9f4dc594d..426f33b4b824 100644 > --- a/include/linux/virtio_net.h > +++ b/include/linux/virtio_net.h > @@ -9,6 +9,194 @@ > #include <uapi/linux/tcp.h> > #include <uapi/linux/virtio_net.h> > > +struct virtio_net_hash { > + u32 value; > + u16 report; > +}; > + > +struct virtio_net_toeplitz_state { > + u32 hash; > + const u32 *key; > +}; > + > +#define VIRTIO_NET_SUPPORTED_HASH_TYPES (VIRTIO_NET_RSS_HASH_TYPE_IPv4 | \ > + VIRTIO_NET_RSS_HASH_TYPE_TCPv4 | \ > + VIRTIO_NET_RSS_HASH_TYPE_UDPv4 | \ > + VIRTIO_NET_RSS_HASH_TYPE_IPv6 | \ > + VIRTIO_NET_RSS_HASH_TYPE_TCPv6 | \ > + VIRTIO_NET_RSS_HASH_TYPE_UDPv6) > + > +#define VIRTIO_NET_RSS_MAX_KEY_SIZE 40 > + > +static inline void virtio_net_toeplitz_convert_key(u32 *input, size_t len) > +{ > + while (len >= sizeof(*input)) { > + *input = be32_to_cpu((__force __be32)*input); > + input++; > + len -= sizeof(*input); > + } > +} > + > +static inline void virtio_net_toeplitz_calc(struct virtio_net_toeplitz_state *state, > + const __be32 *input, size_t len) > +{ > + while (len >= sizeof(*input)) { > + for (u32 map = be32_to_cpu(*input); map; map &= (map - 1)) { > + u32 i = ffs(map); > + > + state->hash ^= state->key[0] << (32 - i) | > + (u32)((u64)state->key[1] >> i); > + } > + > + state->key++; > + input++; > + len -= sizeof(*input); > + } > +} > + > +static inline u8 virtio_net_hash_key_length(u32 types) > +{ > + size_t len = 0; > + > + if (types & VIRTIO_NET_HASH_REPORT_IPv4) > + len = max(len, > + sizeof(struct flow_dissector_key_ipv4_addrs)); > + > + if (types & > + (VIRTIO_NET_HASH_REPORT_TCPv4 | VIRTIO_NET_HASH_REPORT_UDPv4)) > + len = max(len, > + sizeof(struct flow_dissector_key_ipv4_addrs) + > + sizeof(struct flow_dissector_key_ports)); > + > + if (types & VIRTIO_NET_HASH_REPORT_IPv6) > + len = max(len, > + sizeof(struct flow_dissector_key_ipv6_addrs)); > + > + if (types & > + (VIRTIO_NET_HASH_REPORT_TCPv6 | VIRTIO_NET_HASH_REPORT_UDPv6)) > + len = max(len, > + sizeof(struct flow_dissector_key_ipv6_addrs) + > + sizeof(struct flow_dissector_key_ports)); > + > + return len + sizeof(u32); > +} > + > +static inline u32 virtio_net_hash_report(u32 types, > + const struct flow_keys_basic *keys) > +{ > + switch (keys->basic.n_proto) { > + case cpu_to_be16(ETH_P_IP): > + if (!(keys->control.flags & FLOW_DIS_IS_FRAGMENT)) { > + if (keys->basic.ip_proto == IPPROTO_TCP && > + (types & VIRTIO_NET_RSS_HASH_TYPE_TCPv4)) > + return VIRTIO_NET_HASH_REPORT_TCPv4; > + > + if (keys->basic.ip_proto == IPPROTO_UDP && > + (types & VIRTIO_NET_RSS_HASH_TYPE_UDPv4)) > + return VIRTIO_NET_HASH_REPORT_UDPv4; > + } > + > + if (types & VIRTIO_NET_RSS_HASH_TYPE_IPv4) > + return VIRTIO_NET_HASH_REPORT_IPv4; > + > + return VIRTIO_NET_HASH_REPORT_NONE; > + > + case cpu_to_be16(ETH_P_IPV6): > + if (!(keys->control.flags & FLOW_DIS_IS_FRAGMENT)) { > + if (keys->basic.ip_proto == IPPROTO_TCP && > + (types & VIRTIO_NET_RSS_HASH_TYPE_TCPv6)) > + return VIRTIO_NET_HASH_REPORT_TCPv6; > + > + if (keys->basic.ip_proto == IPPROTO_UDP && > + (types & VIRTIO_NET_RSS_HASH_TYPE_UDPv6)) > + return VIRTIO_NET_HASH_REPORT_UDPv6; > + } > + > + if (types & VIRTIO_NET_RSS_HASH_TYPE_IPv6) > + return VIRTIO_NET_HASH_REPORT_IPv6; > + > + return VIRTIO_NET_HASH_REPORT_NONE; > + > + default: > + return VIRTIO_NET_HASH_REPORT_NONE; > + } > +} > + > +static inline void virtio_net_hash_rss(const struct sk_buff *skb, > + u32 types, const u32 *key, > + struct virtio_net_hash *hash) > +{ > + struct virtio_net_toeplitz_state toeplitz_state = { .key = key }; > + struct flow_keys flow; > + struct flow_keys_basic flow_basic; > + u16 report; > + > + if (!skb_flow_dissect_flow_keys(skb, &flow, 0)) { > + hash->report = VIRTIO_NET_HASH_REPORT_NONE; > + return; > + } > + > + flow_basic = (struct flow_keys_basic) { > + .control = flow.control, > + .basic = flow.basic > + }; > + > + report = virtio_net_hash_report(types, &flow_basic); > + > + switch (report) { > + case VIRTIO_NET_HASH_REPORT_IPv4: > + virtio_net_toeplitz_calc(&toeplitz_state, > + (__be32 *)&flow.addrs.v4addrs, > + sizeof(flow.addrs.v4addrs)); > + break; > + > + case VIRTIO_NET_HASH_REPORT_TCPv4: > + virtio_net_toeplitz_calc(&toeplitz_state, > + (__be32 *)&flow.addrs.v4addrs, > + sizeof(flow.addrs.v4addrs)); > + virtio_net_toeplitz_calc(&toeplitz_state, &flow.ports.ports, > + sizeof(flow.ports.ports)); > + break; > + > + case VIRTIO_NET_HASH_REPORT_UDPv4: > + virtio_net_toeplitz_calc(&toeplitz_state, > + (__be32 *)&flow.addrs.v4addrs, > + sizeof(flow.addrs.v4addrs)); > + virtio_net_toeplitz_calc(&toeplitz_state, &flow.ports.ports, > + sizeof(flow.ports.ports)); > + break; > + > + case VIRTIO_NET_HASH_REPORT_IPv6: > + virtio_net_toeplitz_calc(&toeplitz_state, > + (__be32 *)&flow.addrs.v6addrs, > + sizeof(flow.addrs.v6addrs)); > + break; > + > + case VIRTIO_NET_HASH_REPORT_TCPv6: > + virtio_net_toeplitz_calc(&toeplitz_state, > + (__be32 *)&flow.addrs.v6addrs, > + sizeof(flow.addrs.v6addrs)); > + virtio_net_toeplitz_calc(&toeplitz_state, &flow.ports.ports, > + sizeof(flow.ports.ports)); > + break; > + > + case VIRTIO_NET_HASH_REPORT_UDPv6: > + virtio_net_toeplitz_calc(&toeplitz_state, > + (__be32 *)&flow.addrs.v6addrs, > + sizeof(flow.addrs.v6addrs)); > + virtio_net_toeplitz_calc(&toeplitz_state, &flow.ports.ports, > + sizeof(flow.ports.ports)); > + break; > + > + default: > + hash->report = VIRTIO_NET_HASH_REPORT_NONE; > + return;
So I still think we need a comment here to explain why this is not an issue if the device can report HASH_XXX_EX. Or we need to add the support, since this is the code from the driver side, I don't think we need to worry about the device implementation issues.
This is on the device side, and don't report HASH_TYPE_XXX_EX.
For the issue of the number of options, does the spec forbid fallback to VIRTIO_NET_HASH_REPORT_NONE? If not, we can do that.
5.1.6.4.3.4 "IPv6 packets with extension header" says:
If VIRTIO_NET_HASH_TYPE_TCP_EX is set and the packet has a TCPv6 header, the hash is calculated over the following fields:
- Home address from the home address option in the IPv6 destination options header. If the extension header is not present, use the Source IPv6 address.
- IPv6 address that is contained in the Routing-Header-Type-2 from the associated extension header. If the extension header is not present, use the Destination IPv6 address.
- Source TCP port
- Destination TCP port
Therefore, if VIRTIO_NET_HASH_TYPE_TCP_EX is set, the packet has a TCPv6 and an home address option in the IPv6 destination options header is present, the hash is calculated over the home address. If the hash is not calculated over the home address in such a case, the device is contradicting with this section and violating the spec. The same goes for the other HASH_TYPE_XXX_EX types and Routing-Header-Type-2.
Just to make sure we are one the same page. I meant:
- If the hash is not calculated over the home address (in the case of
IPv6 destination destination), it can still report VIRTIO_NET_RSS_HASH_TYPE_IPv6. This is what you implemented in your series. So the device can simply fallback to e.g TCPv6 if it can't understand all or part of the IPv6 options.
The spec says it can fallback if "the extension header is not present", not if the device can't understand the extension header.
I don't think so,
- spec had a condition beforehand:
""" If VIRTIO_NET_HASH_TYPE_TCP_EX is set and the packet has a TCPv6 header, the hash is calculated over the following fields: ... If the extension header is not present ... """
So the device can choose not to set VIRTIO_NET_HASH_TYPE_TCP_EX as spec doesn't say device MUST set VIRTIO_NET_HASH_TYPE_TCP_EX if ...
- implementation wise, since device has limited resources, we can't
expect the device can parse arbitrary number of ipv6 options
- if 1) and 2) not the case, we need fix the spec otherwise implement
a spec compliant device is impractical
The statement is preceded by the following:
The device calculates the hash on IPv4 packets according to ’Enabled hash types’ bitmask as follows:
The 'Enabled hash types' bitmask is specified by the device.
I think the spec needs amendment.
Michael, can you help to clarify here?
I wonder if there are any people interested in the feature though. Looking at virtnet_set_hashflow() in drivers/net/virtio_net.c, the driver of Linux does not let users configure HASH_TYPE_XXX_EX. I suppose Windows supports HASH_TYPE_XXX_EX, but those who care network performance most would use Linux so HASH_TYPE_XXX_EX support without Linux driver's support may not be useful.
It might be still interesting for example for the hardware virtio vendors to support windows etc.
- the VIRTIO_NET_SUPPORTED_HASH_TYPES is not checked against the
tun_vnet_ioctl_sethash(), so userspace may set VIRTIO_NET_HASH_TYPE_TCP_EX regardless of what has been returned by tun_vnet_ioctl_gethashtypes(). In this case they won't get VIRTIO_NET_HASH_TYPE_TCP_EX.
That's right. It's the responsibility of the userspace to set only the supported hash types.
Well, the kernel should filter out the unsupported one to have a robust uAPI. Otherwise, we give green light to the buggy userspace which will have unexpected results.
My reasoning was that it may be fine for some use cases other than VM (e.g., DPDK); in such a use case, it is fine as long as the UAPI works in the best-effort basis.
Best-effort might increase the chance for user visisable changes after migration.
For example, suppose a userspace program that processes TCP packets; the program can enable: HASH_TYPE_IPv4, HASH_TYPE_TCPv4, HASH_TYPE_IPv6, and HASH_TYPE_TCPv6. Ideally, the kernel should support all the hash types, but, even if e.g., HASH_TYPE_TCPv6 is not available,
For "available" did you mean it is not supported by the device?
it will fall back to HASH_TYPE_IPv6, which still does something good and may be acceptable.
This fallback is exactly the same as I said above, let VIRTIO_NET_HASH_TYPE_TCP_EX to fallback.
My point is that, the implementation should either:
1) allow fallback so it can claim to support all hash types
or
2) don't allow fallback so it can only support a part of the hash types
If we're doing something in the middle, for example, allow part of the type to fallback.
That said, for a use case that involves VM and implements virtio-net (e.g., QEMU), setting an unsupported hash type here is definitely a bug. Catching the bug may outweigh the extra trouble for other use cases.
- implementing part of the hash types might complicate the migration
or at least we need to describe the expectations of libvirt or other management in this case. For example, do we plan to have a dedicated Qemu command line like:
-device virtio-net-pci,hash_report=on,supported_hash_types=X,Y,Z?
I posted a patch series to implement such a command line for vDPA[1]. The patch series that wires this tuntap feature up[2] reuses the infrastructure so it doesn't bring additional complexity.
[1] https://lore.kernel.org/qemu-devel/20250530-vdpa-v1-0-5af4109b1c19@daynix.co... [2] https://lore.kernel.org/qemu-devel/20250530-hash-v5-0-343d7d7a8200@daynix.co...
I meant, if we implement a full hash report feature, it means a single hash cmdline option is more than sufficient and so compatibility code can just turn it off when dealing with machine types. This is much more simpler than
- having both hash as well as supported_hash_features
- dealing both hash as well as supported_hash_features in compatibility codes
- libvirt will be happy
For [1], it seems it introduces a per has type option, this seems to be a burden to the management layer as it need to learn new option everytime a new hash type is supported
Even with the command line you proposed (supported_hash_types=X,Y,Z), it is still necessary to know the values the supported_hash_types property accepts (X.Y,Z), so I don't think it makes difference.
It could be a uint32_t.
The burden to the management layer is already present for features, so it is an existing problem (or its mere extension).
Yes, but since this feature is new it's better to try our best to avoid that.
This problem was discussed in the following thread in the past, but no solution is implemented yet, and probably solving it will be difficult. https://lore.kernel.org/qemu-devel/20230731223148.1002258-5-yuri.benditovich...
It's a similar issue but not the same, it looks more like a discussion on whether the fallback from vhost-net to qemu works for missing features etc.
Thanks
Regards, Akihiko Odaki
On 2025/06/06 9:48, Jason Wang wrote:
On Thu, Jun 5, 2025 at 3:58 PM Akihiko Odaki akihiko.odaki@daynix.com wrote:
On 2025/06/05 10:53, Jason Wang wrote:
On Wed, Jun 4, 2025 at 3:20 PM Akihiko Odaki akihiko.odaki@daynix.com wrote:
On 2025/06/04 10:18, Jason Wang wrote:
On Tue, Jun 3, 2025 at 1:31 PM Akihiko Odaki akihiko.odaki@daynix.com wrote:
On 2025/06/03 12:19, Jason Wang wrote: > On Fri, May 30, 2025 at 12:50 PM Akihiko Odaki akihiko.odaki@daynix.com wrote: >> >> They are useful to implement VIRTIO_NET_F_RSS and >> VIRTIO_NET_F_HASH_REPORT. >> >> Signed-off-by: Akihiko Odaki akihiko.odaki@daynix.com >> Tested-by: Lei Yang leiyang@redhat.com >> --- >> include/linux/virtio_net.h | 188 +++++++++++++++++++++++++++++++++++++++++++++ >> 1 file changed, 188 insertions(+) >> >> diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h >> index 02a9f4dc594d..426f33b4b824 100644 >> --- a/include/linux/virtio_net.h >> +++ b/include/linux/virtio_net.h >> @@ -9,6 +9,194 @@ >> #include <uapi/linux/tcp.h> >> #include <uapi/linux/virtio_net.h> >> >> +struct virtio_net_hash { >> + u32 value; >> + u16 report; >> +}; >> + >> +struct virtio_net_toeplitz_state { >> + u32 hash; >> + const u32 *key; >> +}; >> + >> +#define VIRTIO_NET_SUPPORTED_HASH_TYPES (VIRTIO_NET_RSS_HASH_TYPE_IPv4 | \ >> + VIRTIO_NET_RSS_HASH_TYPE_TCPv4 | \ >> + VIRTIO_NET_RSS_HASH_TYPE_UDPv4 | \ >> + VIRTIO_NET_RSS_HASH_TYPE_IPv6 | \ >> + VIRTIO_NET_RSS_HASH_TYPE_TCPv6 | \ >> + VIRTIO_NET_RSS_HASH_TYPE_UDPv6) >> + >> +#define VIRTIO_NET_RSS_MAX_KEY_SIZE 40 >> + >> +static inline void virtio_net_toeplitz_convert_key(u32 *input, size_t len) >> +{ >> + while (len >= sizeof(*input)) { >> + *input = be32_to_cpu((__force __be32)*input); >> + input++; >> + len -= sizeof(*input); >> + } >> +} >> + >> +static inline void virtio_net_toeplitz_calc(struct virtio_net_toeplitz_state *state, >> + const __be32 *input, size_t len) >> +{ >> + while (len >= sizeof(*input)) { >> + for (u32 map = be32_to_cpu(*input); map; map &= (map - 1)) { >> + u32 i = ffs(map); >> + >> + state->hash ^= state->key[0] << (32 - i) | >> + (u32)((u64)state->key[1] >> i); >> + } >> + >> + state->key++; >> + input++; >> + len -= sizeof(*input); >> + } >> +} >> + >> +static inline u8 virtio_net_hash_key_length(u32 types) >> +{ >> + size_t len = 0; >> + >> + if (types & VIRTIO_NET_HASH_REPORT_IPv4) >> + len = max(len, >> + sizeof(struct flow_dissector_key_ipv4_addrs)); >> + >> + if (types & >> + (VIRTIO_NET_HASH_REPORT_TCPv4 | VIRTIO_NET_HASH_REPORT_UDPv4)) >> + len = max(len, >> + sizeof(struct flow_dissector_key_ipv4_addrs) + >> + sizeof(struct flow_dissector_key_ports)); >> + >> + if (types & VIRTIO_NET_HASH_REPORT_IPv6) >> + len = max(len, >> + sizeof(struct flow_dissector_key_ipv6_addrs)); >> + >> + if (types & >> + (VIRTIO_NET_HASH_REPORT_TCPv6 | VIRTIO_NET_HASH_REPORT_UDPv6)) >> + len = max(len, >> + sizeof(struct flow_dissector_key_ipv6_addrs) + >> + sizeof(struct flow_dissector_key_ports)); >> + >> + return len + sizeof(u32); >> +} >> + >> +static inline u32 virtio_net_hash_report(u32 types, >> + const struct flow_keys_basic *keys) >> +{ >> + switch (keys->basic.n_proto) { >> + case cpu_to_be16(ETH_P_IP): >> + if (!(keys->control.flags & FLOW_DIS_IS_FRAGMENT)) { >> + if (keys->basic.ip_proto == IPPROTO_TCP && >> + (types & VIRTIO_NET_RSS_HASH_TYPE_TCPv4)) >> + return VIRTIO_NET_HASH_REPORT_TCPv4; >> + >> + if (keys->basic.ip_proto == IPPROTO_UDP && >> + (types & VIRTIO_NET_RSS_HASH_TYPE_UDPv4)) >> + return VIRTIO_NET_HASH_REPORT_UDPv4; >> + } >> + >> + if (types & VIRTIO_NET_RSS_HASH_TYPE_IPv4) >> + return VIRTIO_NET_HASH_REPORT_IPv4; >> + >> + return VIRTIO_NET_HASH_REPORT_NONE; >> + >> + case cpu_to_be16(ETH_P_IPV6): >> + if (!(keys->control.flags & FLOW_DIS_IS_FRAGMENT)) { >> + if (keys->basic.ip_proto == IPPROTO_TCP && >> + (types & VIRTIO_NET_RSS_HASH_TYPE_TCPv6)) >> + return VIRTIO_NET_HASH_REPORT_TCPv6; >> + >> + if (keys->basic.ip_proto == IPPROTO_UDP && >> + (types & VIRTIO_NET_RSS_HASH_TYPE_UDPv6)) >> + return VIRTIO_NET_HASH_REPORT_UDPv6; >> + } >> + >> + if (types & VIRTIO_NET_RSS_HASH_TYPE_IPv6) >> + return VIRTIO_NET_HASH_REPORT_IPv6; >> + >> + return VIRTIO_NET_HASH_REPORT_NONE; >> + >> + default: >> + return VIRTIO_NET_HASH_REPORT_NONE; >> + } >> +} >> + >> +static inline void virtio_net_hash_rss(const struct sk_buff *skb, >> + u32 types, const u32 *key, >> + struct virtio_net_hash *hash) >> +{ >> + struct virtio_net_toeplitz_state toeplitz_state = { .key = key }; >> + struct flow_keys flow; >> + struct flow_keys_basic flow_basic; >> + u16 report; >> + >> + if (!skb_flow_dissect_flow_keys(skb, &flow, 0)) { >> + hash->report = VIRTIO_NET_HASH_REPORT_NONE; >> + return; >> + } >> + >> + flow_basic = (struct flow_keys_basic) { >> + .control = flow.control, >> + .basic = flow.basic >> + }; >> + >> + report = virtio_net_hash_report(types, &flow_basic); >> + >> + switch (report) { >> + case VIRTIO_NET_HASH_REPORT_IPv4: >> + virtio_net_toeplitz_calc(&toeplitz_state, >> + (__be32 *)&flow.addrs.v4addrs, >> + sizeof(flow.addrs.v4addrs)); >> + break; >> + >> + case VIRTIO_NET_HASH_REPORT_TCPv4: >> + virtio_net_toeplitz_calc(&toeplitz_state, >> + (__be32 *)&flow.addrs.v4addrs, >> + sizeof(flow.addrs.v4addrs)); >> + virtio_net_toeplitz_calc(&toeplitz_state, &flow.ports.ports, >> + sizeof(flow.ports.ports)); >> + break; >> + >> + case VIRTIO_NET_HASH_REPORT_UDPv4: >> + virtio_net_toeplitz_calc(&toeplitz_state, >> + (__be32 *)&flow.addrs.v4addrs, >> + sizeof(flow.addrs.v4addrs)); >> + virtio_net_toeplitz_calc(&toeplitz_state, &flow.ports.ports, >> + sizeof(flow.ports.ports)); >> + break; >> + >> + case VIRTIO_NET_HASH_REPORT_IPv6: >> + virtio_net_toeplitz_calc(&toeplitz_state, >> + (__be32 *)&flow.addrs.v6addrs, >> + sizeof(flow.addrs.v6addrs)); >> + break; >> + >> + case VIRTIO_NET_HASH_REPORT_TCPv6: >> + virtio_net_toeplitz_calc(&toeplitz_state, >> + (__be32 *)&flow.addrs.v6addrs, >> + sizeof(flow.addrs.v6addrs)); >> + virtio_net_toeplitz_calc(&toeplitz_state, &flow.ports.ports, >> + sizeof(flow.ports.ports)); >> + break; >> + >> + case VIRTIO_NET_HASH_REPORT_UDPv6: >> + virtio_net_toeplitz_calc(&toeplitz_state, >> + (__be32 *)&flow.addrs.v6addrs, >> + sizeof(flow.addrs.v6addrs)); >> + virtio_net_toeplitz_calc(&toeplitz_state, &flow.ports.ports, >> + sizeof(flow.ports.ports)); >> + break; >> + >> + default: >> + hash->report = VIRTIO_NET_HASH_REPORT_NONE; >> + return; > > So I still think we need a comment here to explain why this is not an > issue if the device can report HASH_XXX_EX. Or we need to add the > support, since this is the code from the driver side, I don't think we > need to worry about the device implementation issues.
This is on the device side, and don't report HASH_TYPE_XXX_EX.
> > For the issue of the number of options, does the spec forbid fallback > to VIRTIO_NET_HASH_REPORT_NONE? If not, we can do that.
5.1.6.4.3.4 "IPv6 packets with extension header" says: > If VIRTIO_NET_HASH_TYPE_TCP_EX is set and the packet has a TCPv6 > header, the hash is calculated over the following fields: > - Home address from the home address option in the IPv6 destination > options header. If the extension header is not present, use the > Source IPv6 address. > - IPv6 address that is contained in the Routing-Header-Type-2 from the > associated extension header. If the extension header is not present, > use the Destination IPv6 address. > - Source TCP port > - Destination TCP port
Therefore, if VIRTIO_NET_HASH_TYPE_TCP_EX is set, the packet has a TCPv6 and an home address option in the IPv6 destination options header is present, the hash is calculated over the home address. If the hash is not calculated over the home address in such a case, the device is contradicting with this section and violating the spec. The same goes for the other HASH_TYPE_XXX_EX types and Routing-Header-Type-2.
Just to make sure we are one the same page. I meant:
- If the hash is not calculated over the home address (in the case of
IPv6 destination destination), it can still report VIRTIO_NET_RSS_HASH_TYPE_IPv6. This is what you implemented in your series. So the device can simply fallback to e.g TCPv6 if it can't understand all or part of the IPv6 options.
The spec says it can fallback if "the extension header is not present", not if the device can't understand the extension header.
I don't think so,
- spec had a condition beforehand:
""" If VIRTIO_NET_HASH_TYPE_TCP_EX is set and the packet has a TCPv6 header, the hash is calculated over the following fields: ... If the extension header is not present ... """
So the device can choose not to set VIRTIO_NET_HASH_TYPE_TCP_EX as spec doesn't say device MUST set VIRTIO_NET_HASH_TYPE_TCP_EX if ...
- implementation wise, since device has limited resources, we can't
expect the device can parse arbitrary number of ipv6 options
- if 1) and 2) not the case, we need fix the spec otherwise implement
a spec compliant device is impractical
The statement is preceded by the following:
The device calculates the hash on IPv4 packets according to ’Enabled hash types’ bitmask as follows:
The 'Enabled hash types' bitmask is specified by the device.
I think the spec needs amendment.
Michael, can you help to clarify here?
I wonder if there are any people interested in the feature though. Looking at virtnet_set_hashflow() in drivers/net/virtio_net.c, the driver of Linux does not let users configure HASH_TYPE_XXX_EX. I suppose Windows supports HASH_TYPE_XXX_EX, but those who care network performance most would use Linux so HASH_TYPE_XXX_EX support without Linux driver's support may not be useful.
It might be still interesting for example for the hardware virtio vendors to support windows etc.
I don't know if Windows needs them for e.g., device/driver certification so surveying Windows makes sense.
- the VIRTIO_NET_SUPPORTED_HASH_TYPES is not checked against the
tun_vnet_ioctl_sethash(), so userspace may set VIRTIO_NET_HASH_TYPE_TCP_EX regardless of what has been returned by tun_vnet_ioctl_gethashtypes(). In this case they won't get VIRTIO_NET_HASH_TYPE_TCP_EX.
That's right. It's the responsibility of the userspace to set only the supported hash types.
Well, the kernel should filter out the unsupported one to have a robust uAPI. Otherwise, we give green light to the buggy userspace which will have unexpected results.
My reasoning was that it may be fine for some use cases other than VM (e.g., DPDK); in such a use case, it is fine as long as the UAPI works in the best-effort basis.
Best-effort might increase the chance for user visisable changes after migration.
It is a trade-off between catching a migration bug for VMM and making life a bit easier for userspace programs other than VMM.
For example, suppose a userspace program that processes TCP packets; the program can enable: HASH_TYPE_IPv4, HASH_TYPE_TCPv4, HASH_TYPE_IPv6, and HASH_TYPE_TCPv6. Ideally, the kernel should support all the hash types, but, even if e.g., HASH_TYPE_TCPv6 is not available,
For "available" did you mean it is not supported by the device?
it will fall back to HASH_TYPE_IPv6, which still does something good and may be acceptable.
This fallback is exactly the same as I said above, let VIRTIO_NET_HASH_TYPE_TCP_EX to fallback.
My point is that, the implementation should either:
- allow fallback so it can claim to support all hash types
or
- don't allow fallback so it can only support a part of the hash types
If we're doing something in the middle, for example, allow part of the type to fallback.
1) or the middle will make it unsuitable for VM because it violates the virtio spec. 2) makes sense though the trade-off I mentioned should be taken into consideration.
That said, for a use case that involves VM and implements virtio-net (e.g., QEMU), setting an unsupported hash type here is definitely a bug. Catching the bug may outweigh the extra trouble for other use cases.
- implementing part of the hash types might complicate the migration
or at least we need to describe the expectations of libvirt or other management in this case. For example, do we plan to have a dedicated Qemu command line like:
-device virtio-net-pci,hash_report=on,supported_hash_types=X,Y,Z?
I posted a patch series to implement such a command line for vDPA[1]. The patch series that wires this tuntap feature up[2] reuses the infrastructure so it doesn't bring additional complexity.
[1] https://lore.kernel.org/qemu-devel/20250530-vdpa-v1-0-5af4109b1c19@daynix.co... [2] https://lore.kernel.org/qemu-devel/20250530-hash-v5-0-343d7d7a8200@daynix.co...
I meant, if we implement a full hash report feature, it means a single hash cmdline option is more than sufficient and so compatibility code can just turn it off when dealing with machine types. This is much more simpler than
- having both hash as well as supported_hash_features
- dealing both hash as well as supported_hash_features in compatibility codes
- libvirt will be happy
For [1], it seems it introduces a per has type option, this seems to be a burden to the management layer as it need to learn new option everytime a new hash type is supported
Even with the command line you proposed (supported_hash_types=X,Y,Z), it is still necessary to know the values the supported_hash_types property accepts (X.Y,Z), so I don't think it makes difference.
It could be a uint32_t.
The management layer will need to know what bits are accepted even with uint32_t.
The burden to the management layer is already present for features, so it is an existing problem (or its mere extension).
Yes, but since this feature is new it's better to try our best to avoid that.
This problem was discussed in the following thread in the past, but no solution is implemented yet, and probably solving it will be difficult. https://lore.kernel.org/qemu-devel/20230731223148.1002258-5-yuri.benditovich...
It's a similar issue but not the same, it looks more like a discussion on whether the fallback from vhost-net to qemu works for missing features etc.
Perhaps we may be able to do better since this feature is new as you say and we don't have to worry much about breaking change. I don't have an idea for that yet.
Regards, Akihiko Odaki
flow_keys_dissector_symmetric is useful to derive a symmetric hash and to know its source such as IPv4, IPv6, TCP, and UDP.
Signed-off-by: Akihiko Odaki akihiko.odaki@daynix.com Tested-by: Lei Yang leiyang@redhat.com --- include/net/flow_dissector.h | 1 + net/core/flow_dissector.c | 3 ++- 2 files changed, 3 insertions(+), 1 deletion(-)
diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h index ced79dc8e856..d01c1ec77b7d 100644 --- a/include/net/flow_dissector.h +++ b/include/net/flow_dissector.h @@ -423,6 +423,7 @@ __be32 flow_get_u32_src(const struct flow_keys *flow); __be32 flow_get_u32_dst(const struct flow_keys *flow);
extern struct flow_dissector flow_keys_dissector; +extern struct flow_dissector flow_keys_dissector_symmetric; extern struct flow_dissector flow_keys_basic_dissector;
/* struct flow_keys_digest: diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c index 9cd8de6bebb5..32c7ee31330c 100644 --- a/net/core/flow_dissector.c +++ b/net/core/flow_dissector.c @@ -1862,7 +1862,8 @@ void make_flow_keys_digest(struct flow_keys_digest *digest, } EXPORT_SYMBOL(make_flow_keys_digest);
-static struct flow_dissector flow_keys_dissector_symmetric __read_mostly; +struct flow_dissector flow_keys_dissector_symmetric __read_mostly; +EXPORT_SYMBOL(flow_keys_dissector_symmetric);
u32 __skb_get_hash_symmetric_net(const struct net *net, const struct sk_buff *skb) {
On Fri, May 30, 2025 at 12:50 PM Akihiko Odaki akihiko.odaki@daynix.com wrote:
flow_keys_dissector_symmetric is useful to derive a symmetric hash and to know its source such as IPv4, IPv6, TCP, and UDP.
Signed-off-by: Akihiko Odaki akihiko.odaki@daynix.com Tested-by: Lei Yang leiyang@redhat.com
Acked-by: Jason Wang jasowang@redhat.com
Thanks
This clarifies a steering eBPF program takes precedence over the other steering algorithms.
Signed-off-by: Akihiko Odaki akihiko.odaki@daynix.com --- Documentation/networking/tuntap.rst | 7 +++++++ drivers/net/tun.c | 28 +++++++++++++++++----------- include/uapi/linux/if_tun.h | 9 +++++++++ 3 files changed, 33 insertions(+), 11 deletions(-)
diff --git a/Documentation/networking/tuntap.rst b/Documentation/networking/tuntap.rst index 4d7087f727be..86b4ae8caa8a 100644 --- a/Documentation/networking/tuntap.rst +++ b/Documentation/networking/tuntap.rst @@ -206,6 +206,13 @@ enable is true we enable it, otherwise we disable it:: return ioctl(fd, TUNSETQUEUE, (void *)&ifr); }
+3.4 Reference +------------- + +``linux/if_tun.h`` defines the interface described below: + +.. kernel-doc:: include/uapi/linux/if_tun.h + Universal TUN/TAP device driver Frequently Asked Question =========================================================
diff --git a/drivers/net/tun.c b/drivers/net/tun.c index d8f4d3e996a7..9133ab9ed3f5 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -476,21 +476,29 @@ static u16 tun_automq_select_queue(struct tun_struct *tun, struct sk_buff *skb) return txq; }
-static u16 tun_ebpf_select_queue(struct tun_struct *tun, struct sk_buff *skb) +static bool tun_ebpf_select_queue(struct tun_struct *tun, struct sk_buff *skb, + u16 *ret) { struct tun_prog *prog; u32 numqueues; - u16 ret = 0; + u32 prog_ret; + + prog = rcu_dereference(tun->steering_prog); + if (!prog) + return false;
numqueues = READ_ONCE(tun->numqueues); - if (!numqueues) - return 0; + if (!numqueues) { + *ret = 0; + return true; + }
- prog = rcu_dereference(tun->steering_prog); - if (prog) - ret = bpf_prog_run_clear_cb(prog->prog, skb); + prog_ret = bpf_prog_run_clear_cb(prog->prog, skb); + if (prog_ret == TUN_STEERINGEBPF_FALLBACK) + return false;
- return ret % numqueues; + *ret = (u16)prog_ret % numqueues; + return true; }
static u16 tun_select_queue(struct net_device *dev, struct sk_buff *skb, @@ -500,9 +508,7 @@ static u16 tun_select_queue(struct net_device *dev, struct sk_buff *skb, u16 ret;
rcu_read_lock(); - if (rcu_dereference(tun->steering_prog)) - ret = tun_ebpf_select_queue(tun, skb); - else + if (!tun_ebpf_select_queue(tun, skb, &ret)) ret = tun_automq_select_queue(tun, skb); rcu_read_unlock();
diff --git a/include/uapi/linux/if_tun.h b/include/uapi/linux/if_tun.h index 287cdc81c939..980de74724fc 100644 --- a/include/uapi/linux/if_tun.h +++ b/include/uapi/linux/if_tun.h @@ -115,4 +115,13 @@ struct tun_filter { __u8 addr[][ETH_ALEN]; };
+/** + * define TUN_STEERINGEBPF_FALLBACK - A steering eBPF return value to fall back + * + * A steering eBPF program may return this value to fall back to the steering + * algorithm that should have been used if the program was not set. This allows + * selectively overriding the steering decision. + */ +#define TUN_STEERINGEBPF_FALLBACK -1 + #endif /* _UAPI__IF_TUN_H */
On Fri, May 30, 2025 at 12:50 PM Akihiko Odaki akihiko.odaki@daynix.com wrote:
This clarifies a steering eBPF program takes precedence over the other steering algorithms.
Let's give an example on the use case for this.
Signed-off-by: Akihiko Odaki akihiko.odaki@daynix.com
Documentation/networking/tuntap.rst | 7 +++++++ drivers/net/tun.c | 28 +++++++++++++++++----------- include/uapi/linux/if_tun.h | 9 +++++++++ 3 files changed, 33 insertions(+), 11 deletions(-)
diff --git a/Documentation/networking/tuntap.rst b/Documentation/networking/tuntap.rst index 4d7087f727be..86b4ae8caa8a 100644 --- a/Documentation/networking/tuntap.rst +++ b/Documentation/networking/tuntap.rst @@ -206,6 +206,13 @@ enable is true we enable it, otherwise we disable it:: return ioctl(fd, TUNSETQUEUE, (void *)&ifr); }
+3.4 Reference +-------------
+``linux/if_tun.h`` defines the interface described below:
+.. kernel-doc:: include/uapi/linux/if_tun.h
Universal TUN/TAP device driver Frequently Asked Question
diff --git a/drivers/net/tun.c b/drivers/net/tun.c index d8f4d3e996a7..9133ab9ed3f5 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -476,21 +476,29 @@ static u16 tun_automq_select_queue(struct tun_struct *tun, struct sk_buff *skb) return txq; }
-static u16 tun_ebpf_select_queue(struct tun_struct *tun, struct sk_buff *skb) +static bool tun_ebpf_select_queue(struct tun_struct *tun, struct sk_buff *skb,
u16 *ret)
{ struct tun_prog *prog; u32 numqueues;
u16 ret = 0;
u32 prog_ret;
prog = rcu_dereference(tun->steering_prog);
if (!prog)
return false; numqueues = READ_ONCE(tun->numqueues);
if (!numqueues)
return 0;
if (!numqueues) {
*ret = 0;
return true;
}
prog = rcu_dereference(tun->steering_prog);
if (prog)
ret = bpf_prog_run_clear_cb(prog->prog, skb);
prog_ret = bpf_prog_run_clear_cb(prog->prog, skb);
if (prog_ret == TUN_STEERINGEBPF_FALLBACK)
return false;
This seems to break the uAPI. So I think we need a new ioctl to enable the behaviour
return ret % numqueues;
*ret = (u16)prog_ret % numqueues;
return true;
}
static u16 tun_select_queue(struct net_device *dev, struct sk_buff *skb, @@ -500,9 +508,7 @@ static u16 tun_select_queue(struct net_device *dev, struct sk_buff *skb, u16 ret;
rcu_read_lock();
if (rcu_dereference(tun->steering_prog))
ret = tun_ebpf_select_queue(tun, skb);
else
if (!tun_ebpf_select_queue(tun, skb, &ret)) ret = tun_automq_select_queue(tun, skb); rcu_read_unlock();
diff --git a/include/uapi/linux/if_tun.h b/include/uapi/linux/if_tun.h index 287cdc81c939..980de74724fc 100644 --- a/include/uapi/linux/if_tun.h +++ b/include/uapi/linux/if_tun.h @@ -115,4 +115,13 @@ struct tun_filter { __u8 addr[][ETH_ALEN]; };
+/**
- define TUN_STEERINGEBPF_FALLBACK - A steering eBPF return value to fall back
- A steering eBPF program may return this value to fall back to the steering
- algorithm that should have been used if the program was not set. This allows
- selectively overriding the steering decision.
- */
+#define TUN_STEERINGEBPF_FALLBACK -1
Not a native speaker, consider it works more like XDP_PASS, would it be better to use "TUN_STERRING_PASS"?
Thanks
#endif /* _UAPI__IF_TUN_H */
-- 2.49.0
On 2025/06/04 10:27, Jason Wang wrote:
On Fri, May 30, 2025 at 12:50 PM Akihiko Odaki akihiko.odaki@daynix.com wrote:
This clarifies a steering eBPF program takes precedence over the other steering algorithms.
Let's give an example on the use case for this.
Signed-off-by: Akihiko Odaki akihiko.odaki@daynix.com
Documentation/networking/tuntap.rst | 7 +++++++ drivers/net/tun.c | 28 +++++++++++++++++----------- include/uapi/linux/if_tun.h | 9 +++++++++ 3 files changed, 33 insertions(+), 11 deletions(-)
diff --git a/Documentation/networking/tuntap.rst b/Documentation/networking/tuntap.rst index 4d7087f727be..86b4ae8caa8a 100644 --- a/Documentation/networking/tuntap.rst +++ b/Documentation/networking/tuntap.rst @@ -206,6 +206,13 @@ enable is true we enable it, otherwise we disable it:: return ioctl(fd, TUNSETQUEUE, (void *)&ifr); }
+3.4 Reference +-------------
+``linux/if_tun.h`` defines the interface described below:
+.. kernel-doc:: include/uapi/linux/if_tun.h
Universal TUN/TAP device driver Frequently Asked Question
diff --git a/drivers/net/tun.c b/drivers/net/tun.c index d8f4d3e996a7..9133ab9ed3f5 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -476,21 +476,29 @@ static u16 tun_automq_select_queue(struct tun_struct *tun, struct sk_buff *skb) return txq; }
-static u16 tun_ebpf_select_queue(struct tun_struct *tun, struct sk_buff *skb) +static bool tun_ebpf_select_queue(struct tun_struct *tun, struct sk_buff *skb,
{ struct tun_prog *prog; u32 numqueues;u16 *ret)
u16 ret = 0;
u32 prog_ret;
prog = rcu_dereference(tun->steering_prog);
if (!prog)
return false; numqueues = READ_ONCE(tun->numqueues);
if (!numqueues)
return 0;
if (!numqueues) {
*ret = 0;
return true;
}
prog = rcu_dereference(tun->steering_prog);
if (prog)
ret = bpf_prog_run_clear_cb(prog->prog, skb);
prog_ret = bpf_prog_run_clear_cb(prog->prog, skb);
if (prog_ret == TUN_STEERINGEBPF_FALLBACK)
return false;
This seems to break the uAPI. So I think we need a new ioctl to enable the behaviour
I assumed it is fine to repurpose one of the 32-bit integer values since 32-bit integer is too big to specify the queue number, but it may not be fine. I don't have a concrete use case either.
Perhaps it is safer to note that TUNSETSTEERINGEBPF takes precedence over TUNSETVNETRSS to allow such an extension in the future (but without implementing one now).
return ret % numqueues;
*ret = (u16)prog_ret % numqueues;
return true;
}
static u16 tun_select_queue(struct net_device *dev, struct sk_buff *skb,
@@ -500,9 +508,7 @@ static u16 tun_select_queue(struct net_device *dev, struct sk_buff *skb, u16 ret;
rcu_read_lock();
if (rcu_dereference(tun->steering_prog))
ret = tun_ebpf_select_queue(tun, skb);
else
if (!tun_ebpf_select_queue(tun, skb, &ret)) ret = tun_automq_select_queue(tun, skb); rcu_read_unlock();
diff --git a/include/uapi/linux/if_tun.h b/include/uapi/linux/if_tun.h index 287cdc81c939..980de74724fc 100644 --- a/include/uapi/linux/if_tun.h +++ b/include/uapi/linux/if_tun.h @@ -115,4 +115,13 @@ struct tun_filter { __u8 addr[][ETH_ALEN]; };
+/**
- define TUN_STEERINGEBPF_FALLBACK - A steering eBPF return value to fall back
- A steering eBPF program may return this value to fall back to the steering
- algorithm that should have been used if the program was not set. This allows
- selectively overriding the steering decision.
- */
+#define TUN_STEERINGEBPF_FALLBACK -1
Not a native speaker, consider it works more like XDP_PASS, would it be better to use "TUN_STERRING_PASS"?
That sounds indeed better to me.
Regards, Akihiko Odaki
On Wed, Jun 4, 2025 at 3:24 PM Akihiko Odaki akihiko.odaki@daynix.com wrote:
On 2025/06/04 10:27, Jason Wang wrote:
On Fri, May 30, 2025 at 12:50 PM Akihiko Odaki akihiko.odaki@daynix.com wrote:
This clarifies a steering eBPF program takes precedence over the other steering algorithms.
Let's give an example on the use case for this.
Signed-off-by: Akihiko Odaki akihiko.odaki@daynix.com
Documentation/networking/tuntap.rst | 7 +++++++ drivers/net/tun.c | 28 +++++++++++++++++----------- include/uapi/linux/if_tun.h | 9 +++++++++ 3 files changed, 33 insertions(+), 11 deletions(-)
diff --git a/Documentation/networking/tuntap.rst b/Documentation/networking/tuntap.rst index 4d7087f727be..86b4ae8caa8a 100644 --- a/Documentation/networking/tuntap.rst +++ b/Documentation/networking/tuntap.rst @@ -206,6 +206,13 @@ enable is true we enable it, otherwise we disable it:: return ioctl(fd, TUNSETQUEUE, (void *)&ifr); }
+3.4 Reference +-------------
+``linux/if_tun.h`` defines the interface described below:
+.. kernel-doc:: include/uapi/linux/if_tun.h
Universal TUN/TAP device driver Frequently Asked Question
diff --git a/drivers/net/tun.c b/drivers/net/tun.c index d8f4d3e996a7..9133ab9ed3f5 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -476,21 +476,29 @@ static u16 tun_automq_select_queue(struct tun_struct *tun, struct sk_buff *skb) return txq; }
-static u16 tun_ebpf_select_queue(struct tun_struct *tun, struct sk_buff *skb) +static bool tun_ebpf_select_queue(struct tun_struct *tun, struct sk_buff *skb,
{ struct tun_prog *prog; u32 numqueues;u16 *ret)
u16 ret = 0;
u32 prog_ret;
prog = rcu_dereference(tun->steering_prog);
if (!prog)
return false; numqueues = READ_ONCE(tun->numqueues);
if (!numqueues)
return 0;
if (!numqueues) {
*ret = 0;
return true;
}
prog = rcu_dereference(tun->steering_prog);
if (prog)
ret = bpf_prog_run_clear_cb(prog->prog, skb);
prog_ret = bpf_prog_run_clear_cb(prog->prog, skb);
if (prog_ret == TUN_STEERINGEBPF_FALLBACK)
return false;
This seems to break the uAPI. So I think we need a new ioctl to enable the behaviour
I assumed it is fine to repurpose one of the 32-bit integer values since 32-bit integer is too big to specify the queue number, but it may not be fine. I don't have a concrete use case either.
Perhaps it is safer to note that TUNSETSTEERINGEBPF takes precedence over TUNSETVNETRSS to allow such an extension in the future (but without implementing one now).
Yes, that's my original point. Let's start from something that is simpler and safer.
Thanks
Add common code required for the features being added to TUN and TAP. They will be enabled for each of them in following patches.
Added Features ==============
Hash reporting --------------
Allow the guest to reuse the hash value to make receive steering consistent between the host and guest, and to save hash computation.
Receive Side Scaling (RSS) --------------------------
RSS is a receive steering algorithm that can be negotiated to use with virtio_net. Conventionally the hash calculation was done by the VMM. However, computing the hash after the queue was chosen defeats the purpose of RSS.
Another approach is to use eBPF steering program. This approach has another downside: it cannot report the calculated hash due to the restrictive nature of eBPF steering program.
Introduce the code to perform RSS to the kernel in order to overcome thse challenges. An alternative solution is to extend the eBPF steering program so that it will be able to report to the userspace, but I didn't opt for it because extending the current mechanism of eBPF steering program as is because it relies on legacy context rewriting, and introducing kfunc-based eBPF will result in non-UAPI dependency while the other relevant virtualization APIs such as KVM and vhost_net are UAPIs.
Added ioctls ============
They are designed to make extensibility and VM migration compatible. This change only adds the implementation and does not expose them to the userspace.
TUNGETVNETHASHTYPES -------------------
This ioctl tells supported hash types. It is useful to check if a VM can be migrated to the current host.
TUNSETVNETREPORTINGAUTOMQ, TUNSETVNETREPORTINGRSS, and TUNSETVNETRSS --------------------------------------------------------------------
These ioctls configures a steering algorithm and, if needed, hash reporting.
Signed-off-by: Akihiko Odaki akihiko.odaki@daynix.com Tested-by: Lei Yang leiyang@redhat.com --- drivers/net/tap.c | 10 ++- drivers/net/tun.c | 12 +++- drivers/net/tun_vnet.h | 165 +++++++++++++++++++++++++++++++++++++++++--- include/uapi/linux/if_tun.h | 71 +++++++++++++++++++ 4 files changed, 244 insertions(+), 14 deletions(-)
diff --git a/drivers/net/tap.c b/drivers/net/tap.c index d4ece538f1b2..25c60ff2d3f2 100644 --- a/drivers/net/tap.c +++ b/drivers/net/tap.c @@ -179,6 +179,11 @@ static void tap_put_queue(struct tap_queue *q) sock_put(&q->sk); }
+static const struct virtio_net_hash *tap_find_hash(const struct sk_buff *skb) +{ + return NULL; +} + /* * Select a queue based on the rxq of the device on which this packet * arrived. If the incoming device is not mq, calculate a flow hash @@ -711,11 +716,12 @@ static ssize_t tap_put_user(struct tap_queue *q, int total;
if (q->flags & IFF_VNET_HDR) { - struct virtio_net_hdr vnet_hdr; + struct virtio_net_hdr_v1_hash vnet_hdr;
vnet_hdr_len = READ_ONCE(q->vnet_hdr_sz);
- ret = tun_vnet_hdr_from_skb(q->flags, NULL, skb, &vnet_hdr); + ret = tun_vnet_hdr_from_skb(vnet_hdr_len, q->flags, NULL, skb, + tap_find_hash, &vnet_hdr); if (ret) return ret;
diff --git a/drivers/net/tun.c b/drivers/net/tun.c index 9133ab9ed3f5..03d47799e9bd 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -451,6 +451,11 @@ static inline void tun_flow_save_rps_rxhash(struct tun_flow_entry *e, u32 hash) e->rps_rxhash = hash; }
+static const struct virtio_net_hash *tun_find_hash(const struct sk_buff *skb) +{ + return NULL; +} + /* We try to identify a flow through its rxhash. The reason that * we do not check rxq no. is because some cards(e.g 82599), chooses * the rxq based on the txq where the last packet of the flow comes. As @@ -1993,7 +1998,7 @@ static ssize_t tun_put_user_xdp(struct tun_struct *tun, ssize_t ret;
if (tun->flags & IFF_VNET_HDR) { - struct virtio_net_hdr gso = { 0 }; + struct virtio_net_hdr_v1_hash gso = { 0 };
vnet_hdr_sz = READ_ONCE(tun->vnet_hdr_sz); ret = tun_vnet_hdr_put(vnet_hdr_sz, iter, &gso); @@ -2046,9 +2051,10 @@ static ssize_t tun_put_user(struct tun_struct *tun, }
if (vnet_hdr_sz) { - struct virtio_net_hdr gso; + struct virtio_net_hdr_v1_hash gso;
- ret = tun_vnet_hdr_from_skb(tun->flags, tun->dev, skb, &gso); + ret = tun_vnet_hdr_from_skb(vnet_hdr_sz, tun->flags, tun->dev, + skb, tun_find_hash, &gso); if (ret) return ret;
diff --git a/drivers/net/tun_vnet.h b/drivers/net/tun_vnet.h index 58b9ac7a5fc4..45d0533efc8d 100644 --- a/drivers/net/tun_vnet.h +++ b/drivers/net/tun_vnet.h @@ -6,6 +6,17 @@ #define TUN_VNET_LE 0x80000000 #define TUN_VNET_BE 0x40000000
+typedef struct virtio_net_hash *(*tun_vnet_hash_add)(struct sk_buff *); +typedef const struct virtio_net_hash *(*tun_vnet_hash_find)(const struct sk_buff *); + +struct tun_vnet_hash { + bool report; + bool rss; + struct tun_vnet_rss common; + u32 rss_key[VIRTIO_NET_RSS_MAX_KEY_SIZE]; + u16 rss_indirection_table[]; +}; + static inline bool tun_vnet_legacy_is_little_endian(unsigned int flags) { bool be = IS_ENABLED(CONFIG_TUN_VNET_CROSS_LE) && @@ -107,6 +118,128 @@ static inline long tun_vnet_ioctl(int *vnet_hdr_sz, unsigned int *flags, } }
+static inline long tun_vnet_ioctl_gethashtypes(u32 __user *argp) +{ + return put_user(VIRTIO_NET_SUPPORTED_HASH_TYPES, argp) ? -EFAULT : 0; +} + +static inline long tun_vnet_ioctl_sethash(struct tun_vnet_hash __rcu **hashp, + unsigned int cmd, + void __user *argp) +{ + struct tun_vnet_rss common; + struct tun_vnet_hash *hash; + size_t indirection_table_size; + size_t key_size; + size_t size; + + switch (cmd) { + case TUNSETVNETREPORTINGAUTOMQ: + if (get_user(common.hash_types, (u32 __user *)argp)) + return -EFAULT; + + if (common.hash_types) { + hash = kzalloc(sizeof(*hash), GFP_KERNEL); + if (!hash) + return -ENOMEM; + + hash->report = true; + hash->common.hash_types = common.hash_types; + } else { + hash = NULL; + } + break; + + case TUNSETVNETREPORTINGRSS: + case TUNSETVNETRSS: + if (copy_from_user(&common, argp, sizeof(common))) + return -EFAULT; + argp = (struct tun_vnet_rss __user *)argp + 1; + + indirection_table_size = ((size_t)common.indirection_table_mask + 1) * 2; + key_size = virtio_net_hash_key_length(common.hash_types); + size = struct_size(hash, rss_indirection_table, + (size_t)common.indirection_table_mask + 1); + + hash = kmalloc(size, GFP_KERNEL); + if (!hash) + return -ENOMEM; + + if (copy_from_user(hash->rss_indirection_table, + argp, indirection_table_size)) { + kfree(hash); + return -EFAULT; + } + argp = (u16 __user *)argp + common.indirection_table_mask + 1; + + if (copy_from_user(hash->rss_key, argp, key_size)) { + kfree(hash); + return -EFAULT; + } + + virtio_net_toeplitz_convert_key(hash->rss_key, key_size); + hash->report = cmd == TUNSETVNETREPORTINGRSS; + hash->rss = true; + hash->common = common; + break; + + default: + return -EINVAL; + } + + kfree_rcu_mightsleep(rcu_replace_pointer_rtnl(*hashp, hash)); + return 0; +} + +static inline void tun_vnet_hash_report(const struct tun_vnet_hash *hash, + struct sk_buff *skb, + const struct flow_keys_basic *keys, + u32 value, + tun_vnet_hash_add vnet_hash_add) +{ + struct virtio_net_hash *report; + + if (!hash || !hash->report) + return; + + report = vnet_hash_add(skb); + if (!report) + return; + + *report = (struct virtio_net_hash) { + .report = virtio_net_hash_report(hash->common.hash_types, keys), + .value = value + }; +} + +static inline u16 tun_vnet_rss_select_queue(u32 numqueues, + const struct tun_vnet_hash *hash, + struct sk_buff *skb, + tun_vnet_hash_add vnet_hash_add) +{ + struct virtio_net_hash *report; + struct virtio_net_hash ret; + u16 index; + + if (!numqueues) + return 0; + + virtio_net_hash_rss(skb, hash->common.hash_types, hash->rss_key, &ret); + + if (!ret.report) + return hash->common.unclassified_queue % numqueues; + + if (hash->report) { + report = vnet_hash_add(skb); + if (report) + *report = ret; + } + + index = ret.value & hash->common.indirection_table_mask; + + return hash->rss_indirection_table[index] % numqueues; +} + static inline int tun_vnet_hdr_get(int sz, unsigned int flags, struct iov_iter *from, struct virtio_net_hdr *hdr) @@ -135,15 +268,17 @@ static inline int tun_vnet_hdr_get(int sz, unsigned int flags, }
static inline int tun_vnet_hdr_put(int sz, struct iov_iter *iter, - const struct virtio_net_hdr *hdr) + const struct virtio_net_hdr_v1_hash *hdr) { + int content_sz = MIN(sizeof(*hdr), sz); + if (unlikely(iov_iter_count(iter) < sz)) return -EINVAL;
- if (unlikely(copy_to_iter(hdr, sizeof(*hdr), iter) != sizeof(*hdr))) + if (unlikely(copy_to_iter(hdr, content_sz, iter) != content_sz)) return -EFAULT;
- if (iov_iter_zero(sz - sizeof(*hdr), iter) != sz - sizeof(*hdr)) + if (iov_iter_zero(sz - content_sz, iter) != sz - content_sz) return -EFAULT;
return 0; @@ -155,26 +290,38 @@ static inline int tun_vnet_hdr_to_skb(unsigned int flags, struct sk_buff *skb, return virtio_net_hdr_to_skb(skb, hdr, tun_vnet_is_little_endian(flags)); }
-static inline int tun_vnet_hdr_from_skb(unsigned int flags, +static inline int tun_vnet_hdr_from_skb(int sz, unsigned int flags, const struct net_device *dev, const struct sk_buff *skb, - struct virtio_net_hdr *hdr) + tun_vnet_hash_find vnet_hash_find, + struct virtio_net_hdr_v1_hash *hdr) { int vlan_hlen = skb_vlan_tag_present(skb) ? VLAN_HLEN : 0; + const struct virtio_net_hash *report = sz < sizeof(struct virtio_net_hdr_v1_hash) ? + NULL : vnet_hash_find(skb); + + *hdr = (struct virtio_net_hdr_v1_hash) { + .hash_report = VIRTIO_NET_HASH_REPORT_NONE + }; + + if (report) { + hdr->hash_value = cpu_to_le32(report->value); + hdr->hash_report = cpu_to_le16(report->report); + }
- if (virtio_net_hdr_from_skb(skb, hdr, + if (virtio_net_hdr_from_skb(skb, (struct virtio_net_hdr *)hdr, tun_vnet_is_little_endian(flags), true, vlan_hlen)) { struct skb_shared_info *sinfo = skb_shinfo(skb);
if (net_ratelimit()) { netdev_err(dev, "unexpected GSO type: 0x%x, gso_size %d, hdr_len %d\n", - sinfo->gso_type, tun_vnet16_to_cpu(flags, hdr->gso_size), - tun_vnet16_to_cpu(flags, hdr->hdr_len)); + sinfo->gso_type, tun_vnet16_to_cpu(flags, hdr->hdr.gso_size), + tun_vnet16_to_cpu(flags, hdr->hdr.hdr_len)); print_hex_dump(KERN_ERR, "tun: ", DUMP_PREFIX_NONE, 16, 1, skb->head, - min(tun_vnet16_to_cpu(flags, hdr->hdr_len), 64), true); + min(tun_vnet16_to_cpu(flags, hdr->hdr.hdr_len), 64), true); } WARN_ON_ONCE(1); return -EINVAL; diff --git a/include/uapi/linux/if_tun.h b/include/uapi/linux/if_tun.h index 980de74724fc..fe4b984d3bbb 100644 --- a/include/uapi/linux/if_tun.h +++ b/include/uapi/linux/if_tun.h @@ -62,6 +62,62 @@ #define TUNSETCARRIER _IOW('T', 226, int) #define TUNGETDEVNETNS _IO('T', 227)
+/** + * define TUNGETVNETHASHTYPES - ioctl to get supported virtio_net hashing types + * + * The argument is a pointer to __u32 which will store the supported virtio_net + * hashing types. + */ +#define TUNGETVNETHASHTYPES _IOR('T', 228, __u32) + +/** + * define TUNSETVNETREPORTINGAUTOMQ - ioctl to enable automq with hash reporting + * + * Disable RSS and enable automatic receive steering with hash reporting. + * + * The argument is a pointer to __u32 that contains a bitmask of hash types + * allowed to be reported. + * + * This ioctl results in %EBADFD if the underlying device is deleted. It affects + * all queues attached to the same device. + * + * This ioctl currently has no effect on XDP packets and packets with + * queue_mapping set by TC. + */ +#define TUNSETVNETREPORTINGAUTOMQ _IOR('T', 229, __u32) + +/** + * define TUNSETVNETREPORTINGRSS - ioctl to enable RSS with hash reporting + * + * Disable automatic receive steering and enable RSS with hash reporting. + * + * This ioctl results in %EBADFD if the underlying device is deleted. It affects + * all queues attached to the same device. + * + * This ioctl currently has no effect on XDP packets and packets with + * queue_mapping set by TC. + */ +#define TUNSETVNETREPORTINGRSS _IOR('T', 230, struct tun_vnet_rss) + +/** + * define TUNSETVNETRSS - ioctl to enable RSS without hash reporting + * + * Disable automatic receive steering and enable RSS without hash reporting. + * + * The argument is a pointer to the compound of the following in order: + * + * 1. &struct tun_vnet_rss + * 3. Indirection table + * 4. Key + * + * This ioctl results in %EBADFD if the underlying device is deleted. It affects + * all queues attached to the same device. + * + * This ioctl currently has no effect on XDP packets and packets with + * queue_mapping set by TC. + */ +#define TUNSETVNETRSS _IOR('T', 231, struct tun_vnet_rss) + /* TUNSETIFF ifr flags */ #define IFF_TUN 0x0001 #define IFF_TAP 0x0002 @@ -124,4 +180,19 @@ struct tun_filter { */ #define TUN_STEERINGEBPF_FALLBACK -1
+/** + * struct tun_vnet_rss - virtio_net RSS configuration + * @hash_types: + * Bitmask of allowed hash types + * @indirection_table_mask: + * Bitmask to be applied to the indirection table index + * @unclassified_queue: + * The index of the queue to place unclassified packets in + */ +struct tun_vnet_rss { + __u32 hash_types; + __u16 indirection_table_mask; + __u16 unclassified_queue; +}; + #endif /* _UAPI__IF_TUN_H */
On Fri, May 30, 2025 at 12:50 PM Akihiko Odaki akihiko.odaki@daynix.com wrote:
Add common code required for the features being added to TUN and TAP. They will be enabled for each of them in following patches.
Added Features
Hash reporting
Allow the guest to reuse the hash value to make receive steering consistent between the host and guest, and to save hash computation.
Receive Side Scaling (RSS)
RSS is a receive steering algorithm that can be negotiated to use with virtio_net. Conventionally the hash calculation was done by the VMM. However, computing the hash after the queue was chosen defeats the purpose of RSS.
Another approach is to use eBPF steering program. This approach has another downside: it cannot report the calculated hash due to the restrictive nature of eBPF steering program.
Introduce the code to perform RSS to the kernel in order to overcome thse challenges. An alternative solution is to extend the eBPF steering program so that it will be able to report to the userspace, but I didn't opt for it because extending the current mechanism of eBPF steering program as is because it relies on legacy context rewriting, and introducing kfunc-based eBPF will result in non-UAPI dependency while the other relevant virtualization APIs such as KVM and vhost_net are UAPIs.
Added ioctls
They are designed to make extensibility and VM migration compatible. This change only adds the implementation and does not expose them to the userspace.
TUNGETVNETHASHTYPES
This ioctl tells supported hash types. It is useful to check if a VM can be migrated to the current host.
TUNSETVNETREPORTINGAUTOMQ, TUNSETVNETREPORTINGRSS, and TUNSETVNETRSS
These ioctls configures a steering algorithm and, if needed, hash reporting.
Signed-off-by: Akihiko Odaki akihiko.odaki@daynix.com Tested-by: Lei Yang leiyang@redhat.com
drivers/net/tap.c | 10 ++- drivers/net/tun.c | 12 +++- drivers/net/tun_vnet.h | 165 +++++++++++++++++++++++++++++++++++++++++--- include/uapi/linux/if_tun.h | 71 +++++++++++++++++++ 4 files changed, 244 insertions(+), 14 deletions(-)
diff --git a/drivers/net/tap.c b/drivers/net/tap.c index d4ece538f1b2..25c60ff2d3f2 100644 --- a/drivers/net/tap.c +++ b/drivers/net/tap.c @@ -179,6 +179,11 @@ static void tap_put_queue(struct tap_queue *q) sock_put(&q->sk); }
+static const struct virtio_net_hash *tap_find_hash(const struct sk_buff *skb) +{
return NULL;
+}
/*
- Select a queue based on the rxq of the device on which this packet
- arrived. If the incoming device is not mq, calculate a flow hash
@@ -711,11 +716,12 @@ static ssize_t tap_put_user(struct tap_queue *q, int total;
if (q->flags & IFF_VNET_HDR) {
struct virtio_net_hdr vnet_hdr;
struct virtio_net_hdr_v1_hash vnet_hdr; vnet_hdr_len = READ_ONCE(q->vnet_hdr_sz);
ret = tun_vnet_hdr_from_skb(q->flags, NULL, skb, &vnet_hdr);
ret = tun_vnet_hdr_from_skb(vnet_hdr_len, q->flags, NULL, skb,
tap_find_hash, &vnet_hdr); if (ret) return ret;
diff --git a/drivers/net/tun.c b/drivers/net/tun.c index 9133ab9ed3f5..03d47799e9bd 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -451,6 +451,11 @@ static inline void tun_flow_save_rps_rxhash(struct tun_flow_entry *e, u32 hash) e->rps_rxhash = hash; }
+static const struct virtio_net_hash *tun_find_hash(const struct sk_buff *skb) +{
return NULL;
+}
/* We try to identify a flow through its rxhash. The reason that
- we do not check rxq no. is because some cards(e.g 82599), chooses
- the rxq based on the txq where the last packet of the flow comes. As
@@ -1993,7 +1998,7 @@ static ssize_t tun_put_user_xdp(struct tun_struct *tun, ssize_t ret;
if (tun->flags & IFF_VNET_HDR) {
struct virtio_net_hdr gso = { 0 };
struct virtio_net_hdr_v1_hash gso = { 0 }; vnet_hdr_sz = READ_ONCE(tun->vnet_hdr_sz); ret = tun_vnet_hdr_put(vnet_hdr_sz, iter, &gso);
@@ -2046,9 +2051,10 @@ static ssize_t tun_put_user(struct tun_struct *tun, }
if (vnet_hdr_sz) {
struct virtio_net_hdr gso;
struct virtio_net_hdr_v1_hash gso;
ret = tun_vnet_hdr_from_skb(tun->flags, tun->dev, skb, &gso);
ret = tun_vnet_hdr_from_skb(vnet_hdr_sz, tun->flags, tun->dev,
skb, tun_find_hash, &gso); if (ret) return ret;
diff --git a/drivers/net/tun_vnet.h b/drivers/net/tun_vnet.h index 58b9ac7a5fc4..45d0533efc8d 100644 --- a/drivers/net/tun_vnet.h +++ b/drivers/net/tun_vnet.h @@ -6,6 +6,17 @@ #define TUN_VNET_LE 0x80000000 #define TUN_VNET_BE 0x40000000
+typedef struct virtio_net_hash *(*tun_vnet_hash_add)(struct sk_buff *); +typedef const struct virtio_net_hash *(*tun_vnet_hash_find)(const struct sk_buff *);
+struct tun_vnet_hash {
bool report;
bool rss;
struct tun_vnet_rss common;
u32 rss_key[VIRTIO_NET_RSS_MAX_KEY_SIZE];
u16 rss_indirection_table[];
+};
static inline bool tun_vnet_legacy_is_little_endian(unsigned int flags) { bool be = IS_ENABLED(CONFIG_TUN_VNET_CROSS_LE) && @@ -107,6 +118,128 @@ static inline long tun_vnet_ioctl(int *vnet_hdr_sz, unsigned int *flags, } }
+static inline long tun_vnet_ioctl_gethashtypes(u32 __user *argp) +{
return put_user(VIRTIO_NET_SUPPORTED_HASH_TYPES, argp) ? -EFAULT : 0;
+}
+static inline long tun_vnet_ioctl_sethash(struct tun_vnet_hash __rcu **hashp,
unsigned int cmd,
void __user *argp)
+{
struct tun_vnet_rss common;
struct tun_vnet_hash *hash;
size_t indirection_table_size;
size_t key_size;
size_t size;
switch (cmd) {
case TUNSETVNETREPORTINGAUTOMQ:
if (get_user(common.hash_types, (u32 __user *)argp))
return -EFAULT;
if (common.hash_types) {
hash = kzalloc(sizeof(*hash), GFP_KERNEL);
if (!hash)
return -ENOMEM;
hash->report = true;
hash->common.hash_types = common.hash_types;
} else {
hash = NULL;
}
break;
case TUNSETVNETREPORTINGRSS:
case TUNSETVNETRSS:
So the above three shows unnecessary design redundancy as well as a burden for the future extension. Why not simply have
1) TUNSETVNET_RSS 2) TUNSETVNET_HASH_REPORT ?
Which maps to
#define VIRTIO_NET_CTRL_MQ_RSS_CONFIG 1 (for configurable receive steering) #define VIRTIO_NET_CTRL_MQ_HASH_CONFIG 2 (for configurable hash calculation)
It would be always easier to start with what spec had or at least we need to explain why we choose a different design here or in the changelog to ease our life.
One day we would have tun_select_queue_algorithm_x() we don't have to duplicate the ioctls once again here like TUNSETVNETREPORTINGXYZ
if (copy_from_user(&common, argp, sizeof(common)))
return -EFAULT;
argp = (struct tun_vnet_rss __user *)argp + 1;
indirection_table_size = ((size_t)common.indirection_table_mask + 1) * 2;
key_size = virtio_net_hash_key_length(common.hash_types);
size = struct_size(hash, rss_indirection_table,
(size_t)common.indirection_table_mask + 1);
hash = kmalloc(size, GFP_KERNEL);
if (!hash)
return -ENOMEM;
if (copy_from_user(hash->rss_indirection_table,
argp, indirection_table_size)) {
kfree(hash);
return -EFAULT;
}
argp = (u16 __user *)argp + common.indirection_table_mask + 1;
if (copy_from_user(hash->rss_key, argp, key_size)) {
kfree(hash);
return -EFAULT;
}
virtio_net_toeplitz_convert_key(hash->rss_key, key_size);
hash->report = cmd == TUNSETVNETREPORTINGRSS;
At least, if this is the only difference why not simply code this into the ioctl itself other than having a very similar command?
hash->rss = true;
hash->common = common;
break;
default:
return -EINVAL;
}
kfree_rcu_mightsleep(rcu_replace_pointer_rtnl(*hashp, hash));
return 0;
+}
+static inline void tun_vnet_hash_report(const struct tun_vnet_hash *hash,
struct sk_buff *skb,
const struct flow_keys_basic *keys,
u32 value,
tun_vnet_hash_add vnet_hash_add)
+{
struct virtio_net_hash *report;
if (!hash || !hash->report)
return;
report = vnet_hash_add(skb);
if (!report)
return;
*report = (struct virtio_net_hash) {
.report = virtio_net_hash_report(hash->common.hash_types, keys),
.value = value
};
+}
+static inline u16 tun_vnet_rss_select_queue(u32 numqueues,
const struct tun_vnet_hash *hash,
struct sk_buff *skb,
tun_vnet_hash_add vnet_hash_add)
+{
struct virtio_net_hash *report;
struct virtio_net_hash ret;
u16 index;
if (!numqueues)
return 0;
virtio_net_hash_rss(skb, hash->common.hash_types, hash->rss_key, &ret);
if (!ret.report)
return hash->common.unclassified_queue % numqueues;
if (hash->report) {
report = vnet_hash_add(skb);
if (report)
*report = ret;
}
index = ret.value & hash->common.indirection_table_mask;
return hash->rss_indirection_table[index] % numqueues;
+}
static inline int tun_vnet_hdr_get(int sz, unsigned int flags, struct iov_iter *from, struct virtio_net_hdr *hdr) @@ -135,15 +268,17 @@ static inline int tun_vnet_hdr_get(int sz, unsigned int flags, }
static inline int tun_vnet_hdr_put(int sz, struct iov_iter *iter,
const struct virtio_net_hdr *hdr)
const struct virtio_net_hdr_v1_hash *hdr)
{
int content_sz = MIN(sizeof(*hdr), sz);
if (unlikely(iov_iter_count(iter) < sz)) return -EINVAL;
if (unlikely(copy_to_iter(hdr, sizeof(*hdr), iter) != sizeof(*hdr)))
if (unlikely(copy_to_iter(hdr, content_sz, iter) != content_sz)) return -EFAULT;
if (iov_iter_zero(sz - sizeof(*hdr), iter) != sz - sizeof(*hdr))
if (iov_iter_zero(sz - content_sz, iter) != sz - content_sz) return -EFAULT; return 0;
@@ -155,26 +290,38 @@ static inline int tun_vnet_hdr_to_skb(unsigned int flags, struct sk_buff *skb, return virtio_net_hdr_to_skb(skb, hdr, tun_vnet_is_little_endian(flags)); }
-static inline int tun_vnet_hdr_from_skb(unsigned int flags, +static inline int tun_vnet_hdr_from_skb(int sz, unsigned int flags, const struct net_device *dev, const struct sk_buff *skb,
struct virtio_net_hdr *hdr)
tun_vnet_hash_find vnet_hash_find,
struct virtio_net_hdr_v1_hash *hdr)
{ int vlan_hlen = skb_vlan_tag_present(skb) ? VLAN_HLEN : 0;
const struct virtio_net_hash *report = sz < sizeof(struct virtio_net_hdr_v1_hash) ?
NULL : vnet_hash_find(skb);
*hdr = (struct virtio_net_hdr_v1_hash) {
.hash_report = VIRTIO_NET_HASH_REPORT_NONE
};
if (report) {
hdr->hash_value = cpu_to_le32(report->value);
hdr->hash_report = cpu_to_le16(report->report);
}
if (virtio_net_hdr_from_skb(skb, hdr,
if (virtio_net_hdr_from_skb(skb, (struct virtio_net_hdr *)hdr, tun_vnet_is_little_endian(flags), true, vlan_hlen)) { struct skb_shared_info *sinfo = skb_shinfo(skb); if (net_ratelimit()) { netdev_err(dev, "unexpected GSO type: 0x%x, gso_size %d, hdr_len %d\n",
sinfo->gso_type, tun_vnet16_to_cpu(flags, hdr->gso_size),
tun_vnet16_to_cpu(flags, hdr->hdr_len));
sinfo->gso_type, tun_vnet16_to_cpu(flags, hdr->hdr.gso_size),
tun_vnet16_to_cpu(flags, hdr->hdr.hdr_len)); print_hex_dump(KERN_ERR, "tun: ", DUMP_PREFIX_NONE, 16, 1, skb->head,
min(tun_vnet16_to_cpu(flags, hdr->hdr_len), 64), true);
min(tun_vnet16_to_cpu(flags, hdr->hdr.hdr_len), 64), true); } WARN_ON_ONCE(1); return -EINVAL;
diff --git a/include/uapi/linux/if_tun.h b/include/uapi/linux/if_tun.h index 980de74724fc..fe4b984d3bbb 100644 --- a/include/uapi/linux/if_tun.h +++ b/include/uapi/linux/if_tun.h @@ -62,6 +62,62 @@ #define TUNSETCARRIER _IOW('T', 226, int) #define TUNGETDEVNETNS _IO('T', 227)
+/**
- define TUNGETVNETHASHTYPES - ioctl to get supported virtio_net hashing types
- The argument is a pointer to __u32 which will store the supported virtio_net
- hashing types.
- */
+#define TUNGETVNETHASHTYPES _IOR('T', 228, __u32)
+/**
- define TUNSETVNETREPORTINGAUTOMQ - ioctl to enable automq with hash reporting
- Disable RSS and enable automatic receive steering with hash reporting.
- The argument is a pointer to __u32 that contains a bitmask of hash types
- allowed to be reported.
- This ioctl results in %EBADFD if the underlying device is deleted. It affects
- all queues attached to the same device.
- This ioctl currently has no effect on XDP packets and packets with
- queue_mapping set by TC.
- */
+#define TUNSETVNETREPORTINGAUTOMQ _IOR('T', 229, __u32)
+/**
- define TUNSETVNETREPORTINGRSS - ioctl to enable RSS with hash reporting
- Disable automatic receive steering and enable RSS with hash reporting.
This is unnecessary, e.g one day will have select_queue_xyz(), we don't want to say "Disable automatic receive steering and xyz ..."
- This ioctl results in %EBADFD if the underlying device is deleted. It affects
- all queues attached to the same device.
- This ioctl currently has no effect on XDP packets and packets with
- queue_mapping set by TC.
- */
+#define TUNSETVNETREPORTINGRSS _IOR('T', 230, struct tun_vnet_rss)
+/**
- define TUNSETVNETRSS - ioctl to enable RSS without hash reporting
- Disable automatic receive steering and enable RSS without hash reporting.
Same here.
- The argument is a pointer to the compound of the following in order:
- &struct tun_vnet_rss
- Indirection table
- Key
- This ioctl results in %EBADFD if the underlying device is deleted. It affects
- all queues attached to the same device.
- This ioctl currently has no effect on XDP packets and packets with
- queue_mapping set by TC.
- */
+#define TUNSETVNETRSS _IOR('T', 231, struct tun_vnet_rss)
/* TUNSETIFF ifr flags */ #define IFF_TUN 0x0001 #define IFF_TAP 0x0002 @@ -124,4 +180,19 @@ struct tun_filter { */ #define TUN_STEERINGEBPF_FALLBACK -1
+/**
- struct tun_vnet_rss - virtio_net RSS configuration
- @hash_types:
Bitmask of allowed hash types
- @indirection_table_mask:
Bitmask to be applied to the indirection table index
- @unclassified_queue:
The index of the queue to place unclassified packets in
- */
+struct tun_vnet_rss {
__u32 hash_types;
__u16 indirection_table_mask;
__u16 unclassified_queue;
+};
#endif /* _UAPI__IF_TUN_H */
-- 2.49.0
Thanks
On 2025/06/04 10:53, Jason Wang wrote:
On Fri, May 30, 2025 at 12:50 PM Akihiko Odaki akihiko.odaki@daynix.com wrote:
Add common code required for the features being added to TUN and TAP. They will be enabled for each of them in following patches.
Added Features
Hash reporting
Allow the guest to reuse the hash value to make receive steering consistent between the host and guest, and to save hash computation.
Receive Side Scaling (RSS)
RSS is a receive steering algorithm that can be negotiated to use with virtio_net. Conventionally the hash calculation was done by the VMM. However, computing the hash after the queue was chosen defeats the purpose of RSS.
Another approach is to use eBPF steering program. This approach has another downside: it cannot report the calculated hash due to the restrictive nature of eBPF steering program.
Introduce the code to perform RSS to the kernel in order to overcome thse challenges. An alternative solution is to extend the eBPF steering program so that it will be able to report to the userspace, but I didn't opt for it because extending the current mechanism of eBPF steering program as is because it relies on legacy context rewriting, and introducing kfunc-based eBPF will result in non-UAPI dependency while the other relevant virtualization APIs such as KVM and vhost_net are UAPIs.
Added ioctls
They are designed to make extensibility and VM migration compatible. This change only adds the implementation and does not expose them to the userspace.
TUNGETVNETHASHTYPES
This ioctl tells supported hash types. It is useful to check if a VM can be migrated to the current host.
TUNSETVNETREPORTINGAUTOMQ, TUNSETVNETREPORTINGRSS, and TUNSETVNETRSS
These ioctls configures a steering algorithm and, if needed, hash reporting.
Signed-off-by: Akihiko Odaki akihiko.odaki@daynix.com Tested-by: Lei Yang leiyang@redhat.com
drivers/net/tap.c | 10 ++- drivers/net/tun.c | 12 +++- drivers/net/tun_vnet.h | 165 +++++++++++++++++++++++++++++++++++++++++--- include/uapi/linux/if_tun.h | 71 +++++++++++++++++++ 4 files changed, 244 insertions(+), 14 deletions(-)
diff --git a/drivers/net/tap.c b/drivers/net/tap.c index d4ece538f1b2..25c60ff2d3f2 100644 --- a/drivers/net/tap.c +++ b/drivers/net/tap.c @@ -179,6 +179,11 @@ static void tap_put_queue(struct tap_queue *q) sock_put(&q->sk); }
+static const struct virtio_net_hash *tap_find_hash(const struct sk_buff *skb) +{
return NULL;
+}
- /*
- Select a queue based on the rxq of the device on which this packet
- arrived. If the incoming device is not mq, calculate a flow hash
@@ -711,11 +716,12 @@ static ssize_t tap_put_user(struct tap_queue *q, int total;
if (q->flags & IFF_VNET_HDR) {
struct virtio_net_hdr vnet_hdr;
struct virtio_net_hdr_v1_hash vnet_hdr; vnet_hdr_len = READ_ONCE(q->vnet_hdr_sz);
ret = tun_vnet_hdr_from_skb(q->flags, NULL, skb, &vnet_hdr);
ret = tun_vnet_hdr_from_skb(vnet_hdr_len, q->flags, NULL, skb,
tap_find_hash, &vnet_hdr); if (ret) return ret;
diff --git a/drivers/net/tun.c b/drivers/net/tun.c index 9133ab9ed3f5..03d47799e9bd 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -451,6 +451,11 @@ static inline void tun_flow_save_rps_rxhash(struct tun_flow_entry *e, u32 hash) e->rps_rxhash = hash; }
+static const struct virtio_net_hash *tun_find_hash(const struct sk_buff *skb) +{
return NULL;
+}
- /* We try to identify a flow through its rxhash. The reason that
- we do not check rxq no. is because some cards(e.g 82599), chooses
- the rxq based on the txq where the last packet of the flow comes. As
@@ -1993,7 +1998,7 @@ static ssize_t tun_put_user_xdp(struct tun_struct *tun, ssize_t ret;
if (tun->flags & IFF_VNET_HDR) {
struct virtio_net_hdr gso = { 0 };
struct virtio_net_hdr_v1_hash gso = { 0 }; vnet_hdr_sz = READ_ONCE(tun->vnet_hdr_sz); ret = tun_vnet_hdr_put(vnet_hdr_sz, iter, &gso);
@@ -2046,9 +2051,10 @@ static ssize_t tun_put_user(struct tun_struct *tun, }
if (vnet_hdr_sz) {
struct virtio_net_hdr gso;
struct virtio_net_hdr_v1_hash gso;
ret = tun_vnet_hdr_from_skb(tun->flags, tun->dev, skb, &gso);
ret = tun_vnet_hdr_from_skb(vnet_hdr_sz, tun->flags, tun->dev,
skb, tun_find_hash, &gso); if (ret) return ret;
diff --git a/drivers/net/tun_vnet.h b/drivers/net/tun_vnet.h index 58b9ac7a5fc4..45d0533efc8d 100644 --- a/drivers/net/tun_vnet.h +++ b/drivers/net/tun_vnet.h @@ -6,6 +6,17 @@ #define TUN_VNET_LE 0x80000000 #define TUN_VNET_BE 0x40000000
+typedef struct virtio_net_hash *(*tun_vnet_hash_add)(struct sk_buff *); +typedef const struct virtio_net_hash *(*tun_vnet_hash_find)(const struct sk_buff *);
+struct tun_vnet_hash {
bool report;
bool rss;
struct tun_vnet_rss common;
u32 rss_key[VIRTIO_NET_RSS_MAX_KEY_SIZE];
u16 rss_indirection_table[];
+};
- static inline bool tun_vnet_legacy_is_little_endian(unsigned int flags) { bool be = IS_ENABLED(CONFIG_TUN_VNET_CROSS_LE) &&
@@ -107,6 +118,128 @@ static inline long tun_vnet_ioctl(int *vnet_hdr_sz, unsigned int *flags, } }
+static inline long tun_vnet_ioctl_gethashtypes(u32 __user *argp) +{
return put_user(VIRTIO_NET_SUPPORTED_HASH_TYPES, argp) ? -EFAULT : 0;
+}
+static inline long tun_vnet_ioctl_sethash(struct tun_vnet_hash __rcu **hashp,
unsigned int cmd,
void __user *argp)
+{
struct tun_vnet_rss common;
struct tun_vnet_hash *hash;
size_t indirection_table_size;
size_t key_size;
size_t size;
switch (cmd) {
case TUNSETVNETREPORTINGAUTOMQ:
if (get_user(common.hash_types, (u32 __user *)argp))
return -EFAULT;
if (common.hash_types) {
hash = kzalloc(sizeof(*hash), GFP_KERNEL);
if (!hash)
return -ENOMEM;
hash->report = true;
hash->common.hash_types = common.hash_types;
} else {
hash = NULL;
}
break;
case TUNSETVNETREPORTINGRSS:
case TUNSETVNETRSS:
So the above three shows unnecessary design redundancy as well as a burden for the future extension. Why not simply have
- TUNSETVNET_RSS
- TUNSETVNET_HASH_REPORT
?
Which maps to
#define VIRTIO_NET_CTRL_MQ_RSS_CONFIG 1 (for configurable receive steering) #define VIRTIO_NET_CTRL_MQ_HASH_CONFIG 2 (for configurable hash calculation)
It would be always easier to start with what spec had or at least we need to explain why we choose a different design here or in the changelog to ease our life.
TUNSETVNETREPORTINGAUTOMQ maps to VIRTIO_NET_CTRL_MQ_HASH_CONFIG.
TUNSETVNETREPORTINGRSS and TUNSETVNETRSS map to VIRTIO_NET_CTRL_MQ_RSS_CONFIG. We have two ioctls here because VIRTIO_NET_CTRL_MQ_RSS_CONFIG behaves differently depending on whether VIRTIO_NET_F_HASH_REPORT is negotiated or not; it also enables hash reporting if the feature is negotiated.
One day we would have tun_select_queue_algorithm_x() we don't have to duplicate the ioctls once again here like TUNSETVNETREPORTINGXYZ
5.1.6.5.6.4 Hash calculation says:
If VIRTIO_NET_F_HASH_REPORT was negotiated and the device uses automatic receive steering, the device MUST support a command to configure hash calculation parameters.
The driver provides parameters for hash calculation as follows:
class VIRTIO_NET_CTRL_MQ, command VIRTIO_NET_CTRL_MQ_HASH_CONFIG.
VIRTIO_NET_CTRL_MQ_HASH_CONFIG is for automatic receive steering and not for RSS (or other steering algorithms if the spec gets any in the future).
if (copy_from_user(&common, argp, sizeof(common)))
return -EFAULT;
argp = (struct tun_vnet_rss __user *)argp + 1;
indirection_table_size = ((size_t)common.indirection_table_mask + 1) * 2;
key_size = virtio_net_hash_key_length(common.hash_types);
size = struct_size(hash, rss_indirection_table,
(size_t)common.indirection_table_mask + 1);
hash = kmalloc(size, GFP_KERNEL);
if (!hash)
return -ENOMEM;
if (copy_from_user(hash->rss_indirection_table,
argp, indirection_table_size)) {
kfree(hash);
return -EFAULT;
}
argp = (u16 __user *)argp + common.indirection_table_mask + 1;
if (copy_from_user(hash->rss_key, argp, key_size)) {
kfree(hash);
return -EFAULT;
}
virtio_net_toeplitz_convert_key(hash->rss_key, key_size);
hash->report = cmd == TUNSETVNETREPORTINGRSS;
At least, if this is the only difference why not simply code this into the ioctl itself other than having a very similar command?
It is what the previous version did. Either is fine I guess; the only practical difference would be the size of the configuration struct is smaller with this approach.
hash->rss = true;
hash->common = common;
break;
default:
return -EINVAL;
}
kfree_rcu_mightsleep(rcu_replace_pointer_rtnl(*hashp, hash));
return 0;
+}
+static inline void tun_vnet_hash_report(const struct tun_vnet_hash *hash,
struct sk_buff *skb,
const struct flow_keys_basic *keys,
u32 value,
tun_vnet_hash_add vnet_hash_add)
+{
struct virtio_net_hash *report;
if (!hash || !hash->report)
return;
report = vnet_hash_add(skb);
if (!report)
return;
*report = (struct virtio_net_hash) {
.report = virtio_net_hash_report(hash->common.hash_types, keys),
.value = value
};
+}
+static inline u16 tun_vnet_rss_select_queue(u32 numqueues,
const struct tun_vnet_hash *hash,
struct sk_buff *skb,
tun_vnet_hash_add vnet_hash_add)
+{
struct virtio_net_hash *report;
struct virtio_net_hash ret;
u16 index;
if (!numqueues)
return 0;
virtio_net_hash_rss(skb, hash->common.hash_types, hash->rss_key, &ret);
if (!ret.report)
return hash->common.unclassified_queue % numqueues;
if (hash->report) {
report = vnet_hash_add(skb);
if (report)
*report = ret;
}
index = ret.value & hash->common.indirection_table_mask;
return hash->rss_indirection_table[index] % numqueues;
+}
- static inline int tun_vnet_hdr_get(int sz, unsigned int flags, struct iov_iter *from, struct virtio_net_hdr *hdr)
@@ -135,15 +268,17 @@ static inline int tun_vnet_hdr_get(int sz, unsigned int flags, }
static inline int tun_vnet_hdr_put(int sz, struct iov_iter *iter,
const struct virtio_net_hdr *hdr)
{const struct virtio_net_hdr_v1_hash *hdr)
int content_sz = MIN(sizeof(*hdr), sz);
if (unlikely(iov_iter_count(iter) < sz)) return -EINVAL;
if (unlikely(copy_to_iter(hdr, sizeof(*hdr), iter) != sizeof(*hdr)))
if (unlikely(copy_to_iter(hdr, content_sz, iter) != content_sz)) return -EFAULT;
if (iov_iter_zero(sz - sizeof(*hdr), iter) != sz - sizeof(*hdr))
if (iov_iter_zero(sz - content_sz, iter) != sz - content_sz) return -EFAULT; return 0;
@@ -155,26 +290,38 @@ static inline int tun_vnet_hdr_to_skb(unsigned int flags, struct sk_buff *skb, return virtio_net_hdr_to_skb(skb, hdr, tun_vnet_is_little_endian(flags)); }
-static inline int tun_vnet_hdr_from_skb(unsigned int flags, +static inline int tun_vnet_hdr_from_skb(int sz, unsigned int flags, const struct net_device *dev, const struct sk_buff *skb,
struct virtio_net_hdr *hdr)
tun_vnet_hash_find vnet_hash_find,
{ int vlan_hlen = skb_vlan_tag_present(skb) ? VLAN_HLEN : 0;struct virtio_net_hdr_v1_hash *hdr)
const struct virtio_net_hash *report = sz < sizeof(struct virtio_net_hdr_v1_hash) ?
NULL : vnet_hash_find(skb);
*hdr = (struct virtio_net_hdr_v1_hash) {
.hash_report = VIRTIO_NET_HASH_REPORT_NONE
};
if (report) {
hdr->hash_value = cpu_to_le32(report->value);
hdr->hash_report = cpu_to_le16(report->report);
}
if (virtio_net_hdr_from_skb(skb, hdr,
if (virtio_net_hdr_from_skb(skb, (struct virtio_net_hdr *)hdr, tun_vnet_is_little_endian(flags), true, vlan_hlen)) { struct skb_shared_info *sinfo = skb_shinfo(skb); if (net_ratelimit()) { netdev_err(dev, "unexpected GSO type: 0x%x, gso_size %d, hdr_len %d\n",
sinfo->gso_type, tun_vnet16_to_cpu(flags, hdr->gso_size),
tun_vnet16_to_cpu(flags, hdr->hdr_len));
sinfo->gso_type, tun_vnet16_to_cpu(flags, hdr->hdr.gso_size),
tun_vnet16_to_cpu(flags, hdr->hdr.hdr_len)); print_hex_dump(KERN_ERR, "tun: ", DUMP_PREFIX_NONE, 16, 1, skb->head,
min(tun_vnet16_to_cpu(flags, hdr->hdr_len), 64), true);
min(tun_vnet16_to_cpu(flags, hdr->hdr.hdr_len), 64), true); } WARN_ON_ONCE(1); return -EINVAL;
diff --git a/include/uapi/linux/if_tun.h b/include/uapi/linux/if_tun.h index 980de74724fc..fe4b984d3bbb 100644 --- a/include/uapi/linux/if_tun.h +++ b/include/uapi/linux/if_tun.h @@ -62,6 +62,62 @@ #define TUNSETCARRIER _IOW('T', 226, int) #define TUNGETDEVNETNS _IO('T', 227)
+/**
- define TUNGETVNETHASHTYPES - ioctl to get supported virtio_net hashing types
- The argument is a pointer to __u32 which will store the supported virtio_net
- hashing types.
- */
+#define TUNGETVNETHASHTYPES _IOR('T', 228, __u32)
+/**
- define TUNSETVNETREPORTINGAUTOMQ - ioctl to enable automq with hash reporting
- Disable RSS and enable automatic receive steering with hash reporting.
- The argument is a pointer to __u32 that contains a bitmask of hash types
- allowed to be reported.
- This ioctl results in %EBADFD if the underlying device is deleted. It affects
- all queues attached to the same device.
- This ioctl currently has no effect on XDP packets and packets with
- queue_mapping set by TC.
- */
+#define TUNSETVNETREPORTINGAUTOMQ _IOR('T', 229, __u32)
+/**
- define TUNSETVNETREPORTINGRSS - ioctl to enable RSS with hash reporting
- Disable automatic receive steering and enable RSS with hash reporting.
This is unnecessary, e.g one day will have select_queue_xyz(), we don't want to say "Disable automatic receive steering and xyz ..."
It is still something better to be documented as its behavior is somewhat complicated.
Concretely, this ioctl disables automatic receive steering but doesn't disable steering by eBPF, which is implied by TUN_STEERINGEBPF_FALLBACK.
Perhaps it may be rephrased to tell it disables "the other receive steering mechanism except eBPF".
Regards, Akihiko Odaki
- This ioctl results in %EBADFD if the underlying device is deleted. It affects
- all queues attached to the same device.
- This ioctl currently has no effect on XDP packets and packets with
- queue_mapping set by TC.
- */
+#define TUNSETVNETREPORTINGRSS _IOR('T', 230, struct tun_vnet_rss)
+/**
- define TUNSETVNETRSS - ioctl to enable RSS without hash reporting
- Disable automatic receive steering and enable RSS without hash reporting.
Same here.
- The argument is a pointer to the compound of the following in order:
- &struct tun_vnet_rss
- Indirection table
- Key
- This ioctl results in %EBADFD if the underlying device is deleted. It affects
- all queues attached to the same device.
- This ioctl currently has no effect on XDP packets and packets with
- queue_mapping set by TC.
- */
+#define TUNSETVNETRSS _IOR('T', 231, struct tun_vnet_rss)
- /* TUNSETIFF ifr flags */ #define IFF_TUN 0x0001 #define IFF_TAP 0x0002
@@ -124,4 +180,19 @@ struct tun_filter { */ #define TUN_STEERINGEBPF_FALLBACK -1
+/**
- struct tun_vnet_rss - virtio_net RSS configuration
- @hash_types:
Bitmask of allowed hash types
- @indirection_table_mask:
Bitmask to be applied to the indirection table index
- @unclassified_queue:
The index of the queue to place unclassified packets in
- */
+struct tun_vnet_rss {
__u32 hash_types;
__u16 indirection_table_mask;
__u16 unclassified_queue;
+};
- #endif /* _UAPI__IF_TUN_H */
-- 2.49.0
Thanks
On Wed, Jun 4, 2025 at 4:42 PM Akihiko Odaki akihiko.odaki@daynix.com wrote:
On 2025/06/04 10:53, Jason Wang wrote:
On Fri, May 30, 2025 at 12:50 PM Akihiko Odaki akihiko.odaki@daynix.com wrote:
Add common code required for the features being added to TUN and TAP. They will be enabled for each of them in following patches.
Added Features
Hash reporting
Allow the guest to reuse the hash value to make receive steering consistent between the host and guest, and to save hash computation.
Receive Side Scaling (RSS)
RSS is a receive steering algorithm that can be negotiated to use with virtio_net. Conventionally the hash calculation was done by the VMM. However, computing the hash after the queue was chosen defeats the purpose of RSS.
Another approach is to use eBPF steering program. This approach has another downside: it cannot report the calculated hash due to the restrictive nature of eBPF steering program.
Introduce the code to perform RSS to the kernel in order to overcome thse challenges. An alternative solution is to extend the eBPF steering program so that it will be able to report to the userspace, but I didn't opt for it because extending the current mechanism of eBPF steering program as is because it relies on legacy context rewriting, and introducing kfunc-based eBPF will result in non-UAPI dependency while the other relevant virtualization APIs such as KVM and vhost_net are UAPIs.
Added ioctls
They are designed to make extensibility and VM migration compatible. This change only adds the implementation and does not expose them to the userspace.
TUNGETVNETHASHTYPES
This ioctl tells supported hash types. It is useful to check if a VM can be migrated to the current host.
TUNSETVNETREPORTINGAUTOMQ, TUNSETVNETREPORTINGRSS, and TUNSETVNETRSS
These ioctls configures a steering algorithm and, if needed, hash reporting.
Signed-off-by: Akihiko Odaki akihiko.odaki@daynix.com Tested-by: Lei Yang leiyang@redhat.com
drivers/net/tap.c | 10 ++- drivers/net/tun.c | 12 +++- drivers/net/tun_vnet.h | 165 +++++++++++++++++++++++++++++++++++++++++--- include/uapi/linux/if_tun.h | 71 +++++++++++++++++++ 4 files changed, 244 insertions(+), 14 deletions(-)
diff --git a/drivers/net/tap.c b/drivers/net/tap.c index d4ece538f1b2..25c60ff2d3f2 100644 --- a/drivers/net/tap.c +++ b/drivers/net/tap.c @@ -179,6 +179,11 @@ static void tap_put_queue(struct tap_queue *q) sock_put(&q->sk); }
+static const struct virtio_net_hash *tap_find_hash(const struct sk_buff *skb) +{
return NULL;
+}
- /*
- Select a queue based on the rxq of the device on which this packet
- arrived. If the incoming device is not mq, calculate a flow hash
@@ -711,11 +716,12 @@ static ssize_t tap_put_user(struct tap_queue *q, int total;
if (q->flags & IFF_VNET_HDR) {
struct virtio_net_hdr vnet_hdr;
struct virtio_net_hdr_v1_hash vnet_hdr; vnet_hdr_len = READ_ONCE(q->vnet_hdr_sz);
ret = tun_vnet_hdr_from_skb(q->flags, NULL, skb, &vnet_hdr);
ret = tun_vnet_hdr_from_skb(vnet_hdr_len, q->flags, NULL, skb,
tap_find_hash, &vnet_hdr); if (ret) return ret;
diff --git a/drivers/net/tun.c b/drivers/net/tun.c index 9133ab9ed3f5..03d47799e9bd 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -451,6 +451,11 @@ static inline void tun_flow_save_rps_rxhash(struct tun_flow_entry *e, u32 hash) e->rps_rxhash = hash; }
+static const struct virtio_net_hash *tun_find_hash(const struct sk_buff *skb) +{
return NULL;
+}
- /* We try to identify a flow through its rxhash. The reason that
- we do not check rxq no. is because some cards(e.g 82599), chooses
- the rxq based on the txq where the last packet of the flow comes. As
@@ -1993,7 +1998,7 @@ static ssize_t tun_put_user_xdp(struct tun_struct *tun, ssize_t ret;
if (tun->flags & IFF_VNET_HDR) {
struct virtio_net_hdr gso = { 0 };
struct virtio_net_hdr_v1_hash gso = { 0 }; vnet_hdr_sz = READ_ONCE(tun->vnet_hdr_sz); ret = tun_vnet_hdr_put(vnet_hdr_sz, iter, &gso);
@@ -2046,9 +2051,10 @@ static ssize_t tun_put_user(struct tun_struct *tun, }
if (vnet_hdr_sz) {
struct virtio_net_hdr gso;
struct virtio_net_hdr_v1_hash gso;
ret = tun_vnet_hdr_from_skb(tun->flags, tun->dev, skb, &gso);
ret = tun_vnet_hdr_from_skb(vnet_hdr_sz, tun->flags, tun->dev,
skb, tun_find_hash, &gso); if (ret) return ret;
diff --git a/drivers/net/tun_vnet.h b/drivers/net/tun_vnet.h index 58b9ac7a5fc4..45d0533efc8d 100644 --- a/drivers/net/tun_vnet.h +++ b/drivers/net/tun_vnet.h @@ -6,6 +6,17 @@ #define TUN_VNET_LE 0x80000000 #define TUN_VNET_BE 0x40000000
+typedef struct virtio_net_hash *(*tun_vnet_hash_add)(struct sk_buff *); +typedef const struct virtio_net_hash *(*tun_vnet_hash_find)(const struct sk_buff *);
+struct tun_vnet_hash {
bool report;
bool rss;
struct tun_vnet_rss common;
u32 rss_key[VIRTIO_NET_RSS_MAX_KEY_SIZE];
u16 rss_indirection_table[];
+};
- static inline bool tun_vnet_legacy_is_little_endian(unsigned int flags) { bool be = IS_ENABLED(CONFIG_TUN_VNET_CROSS_LE) &&
@@ -107,6 +118,128 @@ static inline long tun_vnet_ioctl(int *vnet_hdr_sz, unsigned int *flags, } }
+static inline long tun_vnet_ioctl_gethashtypes(u32 __user *argp) +{
return put_user(VIRTIO_NET_SUPPORTED_HASH_TYPES, argp) ? -EFAULT : 0;
+}
+static inline long tun_vnet_ioctl_sethash(struct tun_vnet_hash __rcu **hashp,
unsigned int cmd,
void __user *argp)
+{
struct tun_vnet_rss common;
struct tun_vnet_hash *hash;
size_t indirection_table_size;
size_t key_size;
size_t size;
switch (cmd) {
case TUNSETVNETREPORTINGAUTOMQ:
if (get_user(common.hash_types, (u32 __user *)argp))
return -EFAULT;
if (common.hash_types) {
hash = kzalloc(sizeof(*hash), GFP_KERNEL);
if (!hash)
return -ENOMEM;
hash->report = true;
hash->common.hash_types = common.hash_types;
} else {
hash = NULL;
}
break;
case TUNSETVNETREPORTINGRSS:
case TUNSETVNETRSS:
So the above three shows unnecessary design redundancy as well as a burden for the future extension. Why not simply have
- TUNSETVNET_RSS
- TUNSETVNET_HASH_REPORT
?
Which maps to
#define VIRTIO_NET_CTRL_MQ_RSS_CONFIG 1 (for configurable receive steering) #define VIRTIO_NET_CTRL_MQ_HASH_CONFIG 2 (for configurable hash calculation)
It would be always easier to start with what spec had or at least we need to explain why we choose a different design here or in the changelog to ease our life.
TUNSETVNETREPORTINGAUTOMQ maps to VIRTIO_NET_CTRL_MQ_HASH_CONFIG.
It's not:
VIRTIO_NET_CTRL_MQ_HASH_CONFIG uses:
struct virtio_net_hash_config { le32 hash_types; le16 reserved[4]; u8 hash_key_length; u8 hash_key_data[hash_key_length]; };
but TUNSETVNETREPORTINGAUTOMQ only accepts hash_types without others:
TUNSETVNETREPORTINGRSS and TUNSETVNETRSS map to VIRTIO_NET_CTRL_MQ_RSS_CONFIG.
I think we've already had a discussion about this.
Reusing virtio-net uAPI is much better instead of having a tun specific one considering tun may need to support more virtio commands in the future. Or maybe it's the time to introduce a transport for the virtio control virtqueue uAPI in tuntap to avoid inventing new uAPI endlessly.
What's more I see:
struct tun_vnet_rss { __u32 hash_types; __u16 indirection_table_mask; __u16 unclassified_queue; };
struct tun_vnet_hash { bool report; bool rss; struct tun_vnet_rss common; u32 rss_key[VIRTIO_NET_RSS_MAX_KEY_SIZE]; u16 rss_indirection_table[]; };
As I pointed out in the past, let's just decouple the rss from hash, everything would be much simpler, or you need to explain why you couple this somewhere.
For example:
1) why is the tun_vnet_hash not part of the uAPI but tun_vnet_rss, or how could userspace know what kind of format it would use for TUNSETVNETREPORTINGRSS? 2) what's the advantages of embedding rss specific stuff into hash report structure 3) what's the advantages of not using virtio-net uAPI
More issues:
1) max_tx_vq is ignored, so do you expect the userspace to intercept this and switch to using TUNSETQUEUE? This seems like a burden as TUN can simply accept it and do the attaching/detaching by itself 2) the rx depends on the indirection table, so userspace need to intercept the indirection and change the rx queue numbers accordingly 3) RSS allows a async TX/RX queue, this is not supported by TUN now, no matter if we decide to let userspace to intercept max_tx_vq or not, we need to implement it first
Things would be much more simpler, if kernel can do 1) and 2).
We have two ioctls here because VIRTIO_NET_CTRL_MQ_RSS_CONFIG behaves differently depending on whether VIRTIO_NET_F_HASH_REPORT is negotiated or not;
It wouldn't be a problem if you do 1:1 mapping between virtio commands and TUN uAPI, otherwise it should have a bug somewhere.
it also enables hash reporting if the feature is negotiated.
Again, starting from virtio-net command is easier, a strong justification is needed to explain why we choose another for tun/tap.
One day we would have tun_select_queue_algorithm_x() we don't have to duplicate the ioctls once again here like TUNSETVNETREPORTINGXYZ
5.1.6.5.6.4 Hash calculation says:
If VIRTIO_NET_F_HASH_REPORT was negotiated and the device uses automatic receive steering, the device MUST support a command to configure hash calculation parameters.
The driver provides parameters for hash calculation as follows:
class VIRTIO_NET_CTRL_MQ, command VIRTIO_NET_CTRL_MQ_HASH_CONFIG.
VIRTIO_NET_CTRL_MQ_HASH_CONFIG is for automatic receive steering and not for RSS (or other steering algorithms if the spec gets any in the future).
I'm not sure but the spec needs some tweaking. For example, I don't expect there would be a dedicated hash config command for flow filters in the future. I think the reason why spec says like this is that virtio-net only supports automatic receive steering.
Note that macvtap doesn't implement automatic receive steering.
if (copy_from_user(&common, argp, sizeof(common)))
return -EFAULT;
argp = (struct tun_vnet_rss __user *)argp + 1;
indirection_table_size = ((size_t)common.indirection_table_mask + 1) * 2;
key_size = virtio_net_hash_key_length(common.hash_types);
size = struct_size(hash, rss_indirection_table,
(size_t)common.indirection_table_mask + 1);
hash = kmalloc(size, GFP_KERNEL);
if (!hash)
return -ENOMEM;
if (copy_from_user(hash->rss_indirection_table,
argp, indirection_table_size)) {
kfree(hash);
return -EFAULT;
}
argp = (u16 __user *)argp + common.indirection_table_mask + 1;
if (copy_from_user(hash->rss_key, argp, key_size)) {
kfree(hash);
return -EFAULT;
}
virtio_net_toeplitz_convert_key(hash->rss_key, key_size);
hash->report = cmd == TUNSETVNETREPORTINGRSS;
At least, if this is the only difference why not simply code this into the ioctl itself other than having a very similar command?
It is what the previous version did. Either is fine I guess; the only practical difference would be the size of the configuration struct is smaller with this approach.
hash->rss = true;
hash->common = common;
break;
default:
return -EINVAL;
}
kfree_rcu_mightsleep(rcu_replace_pointer_rtnl(*hashp, hash));
return 0;
+}
+static inline void tun_vnet_hash_report(const struct tun_vnet_hash *hash,
struct sk_buff *skb,
const struct flow_keys_basic *keys,
u32 value,
tun_vnet_hash_add vnet_hash_add)
+{
struct virtio_net_hash *report;
if (!hash || !hash->report)
return;
report = vnet_hash_add(skb);
if (!report)
return;
*report = (struct virtio_net_hash) {
.report = virtio_net_hash_report(hash->common.hash_types, keys),
.value = value
};
+}
+static inline u16 tun_vnet_rss_select_queue(u32 numqueues,
const struct tun_vnet_hash *hash,
struct sk_buff *skb,
tun_vnet_hash_add vnet_hash_add)
+{
struct virtio_net_hash *report;
struct virtio_net_hash ret;
u16 index;
if (!numqueues)
return 0;
virtio_net_hash_rss(skb, hash->common.hash_types, hash->rss_key, &ret);
if (!ret.report)
return hash->common.unclassified_queue % numqueues;
if (hash->report) {
report = vnet_hash_add(skb);
if (report)
*report = ret;
}
index = ret.value & hash->common.indirection_table_mask;
return hash->rss_indirection_table[index] % numqueues;
+}
- static inline int tun_vnet_hdr_get(int sz, unsigned int flags, struct iov_iter *from, struct virtio_net_hdr *hdr)
@@ -135,15 +268,17 @@ static inline int tun_vnet_hdr_get(int sz, unsigned int flags, }
static inline int tun_vnet_hdr_put(int sz, struct iov_iter *iter,
const struct virtio_net_hdr *hdr)
{const struct virtio_net_hdr_v1_hash *hdr)
int content_sz = MIN(sizeof(*hdr), sz);
if (unlikely(iov_iter_count(iter) < sz)) return -EINVAL;
if (unlikely(copy_to_iter(hdr, sizeof(*hdr), iter) != sizeof(*hdr)))
if (unlikely(copy_to_iter(hdr, content_sz, iter) != content_sz)) return -EFAULT;
if (iov_iter_zero(sz - sizeof(*hdr), iter) != sz - sizeof(*hdr))
if (iov_iter_zero(sz - content_sz, iter) != sz - content_sz) return -EFAULT; return 0;
@@ -155,26 +290,38 @@ static inline int tun_vnet_hdr_to_skb(unsigned int flags, struct sk_buff *skb, return virtio_net_hdr_to_skb(skb, hdr, tun_vnet_is_little_endian(flags)); }
-static inline int tun_vnet_hdr_from_skb(unsigned int flags, +static inline int tun_vnet_hdr_from_skb(int sz, unsigned int flags, const struct net_device *dev, const struct sk_buff *skb,
struct virtio_net_hdr *hdr)
tun_vnet_hash_find vnet_hash_find,
{ int vlan_hlen = skb_vlan_tag_present(skb) ? VLAN_HLEN : 0;struct virtio_net_hdr_v1_hash *hdr)
const struct virtio_net_hash *report = sz < sizeof(struct virtio_net_hdr_v1_hash) ?
NULL : vnet_hash_find(skb);
*hdr = (struct virtio_net_hdr_v1_hash) {
.hash_report = VIRTIO_NET_HASH_REPORT_NONE
};
if (report) {
hdr->hash_value = cpu_to_le32(report->value);
hdr->hash_report = cpu_to_le16(report->report);
}
if (virtio_net_hdr_from_skb(skb, hdr,
if (virtio_net_hdr_from_skb(skb, (struct virtio_net_hdr *)hdr, tun_vnet_is_little_endian(flags), true, vlan_hlen)) { struct skb_shared_info *sinfo = skb_shinfo(skb); if (net_ratelimit()) { netdev_err(dev, "unexpected GSO type: 0x%x, gso_size %d, hdr_len %d\n",
sinfo->gso_type, tun_vnet16_to_cpu(flags, hdr->gso_size),
tun_vnet16_to_cpu(flags, hdr->hdr_len));
sinfo->gso_type, tun_vnet16_to_cpu(flags, hdr->hdr.gso_size),
tun_vnet16_to_cpu(flags, hdr->hdr.hdr_len)); print_hex_dump(KERN_ERR, "tun: ", DUMP_PREFIX_NONE, 16, 1, skb->head,
min(tun_vnet16_to_cpu(flags, hdr->hdr_len), 64), true);
min(tun_vnet16_to_cpu(flags, hdr->hdr.hdr_len), 64), true); } WARN_ON_ONCE(1); return -EINVAL;
diff --git a/include/uapi/linux/if_tun.h b/include/uapi/linux/if_tun.h index 980de74724fc..fe4b984d3bbb 100644 --- a/include/uapi/linux/if_tun.h +++ b/include/uapi/linux/if_tun.h @@ -62,6 +62,62 @@ #define TUNSETCARRIER _IOW('T', 226, int) #define TUNGETDEVNETNS _IO('T', 227)
+/**
- define TUNGETVNETHASHTYPES - ioctl to get supported virtio_net hashing types
- The argument is a pointer to __u32 which will store the supported virtio_net
- hashing types.
- */
+#define TUNGETVNETHASHTYPES _IOR('T', 228, __u32)
+/**
- define TUNSETVNETREPORTINGAUTOMQ - ioctl to enable automq with hash reporting
- Disable RSS and enable automatic receive steering with hash reporting.
- The argument is a pointer to __u32 that contains a bitmask of hash types
- allowed to be reported.
- This ioctl results in %EBADFD if the underlying device is deleted. It affects
- all queues attached to the same device.
- This ioctl currently has no effect on XDP packets and packets with
- queue_mapping set by TC.
- */
+#define TUNSETVNETREPORTINGAUTOMQ _IOR('T', 229, __u32)
+/**
- define TUNSETVNETREPORTINGRSS - ioctl to enable RSS with hash reporting
- Disable automatic receive steering and enable RSS with hash reporting.
This is unnecessary, e.g one day will have select_queue_xyz(), we don't want to say "Disable automatic receive steering and xyz ..."
It is still something better to be documented as its behavior is somewhat complicated.
This is a hint of uAPI design issue.
Concretely, this ioctl disables automatic receive steering but doesn't disable steering by eBPF, which is implied by TUN_STEERINGEBPF_FALLBACK.
It would be simpler:
1) not having TUN_STEERINGEBPF_FALLBACK 2) the steering algorithm depends on the last uAPI call
Perhaps it may be rephrased to tell it disables "the other receive steering mechanism except eBPF".
Regards, Akihiko Odaki
Thanks
On 2025/06/05 11:46, Jason Wang wrote:
On Wed, Jun 4, 2025 at 4:42 PM Akihiko Odaki akihiko.odaki@daynix.com wrote:
On 2025/06/04 10:53, Jason Wang wrote:
On Fri, May 30, 2025 at 12:50 PM Akihiko Odaki akihiko.odaki@daynix.com wrote:
Add common code required for the features being added to TUN and TAP. They will be enabled for each of them in following patches.
Added Features
Hash reporting
Allow the guest to reuse the hash value to make receive steering consistent between the host and guest, and to save hash computation.
Receive Side Scaling (RSS)
RSS is a receive steering algorithm that can be negotiated to use with virtio_net. Conventionally the hash calculation was done by the VMM. However, computing the hash after the queue was chosen defeats the purpose of RSS.
Another approach is to use eBPF steering program. This approach has another downside: it cannot report the calculated hash due to the restrictive nature of eBPF steering program.
Introduce the code to perform RSS to the kernel in order to overcome thse challenges. An alternative solution is to extend the eBPF steering program so that it will be able to report to the userspace, but I didn't opt for it because extending the current mechanism of eBPF steering program as is because it relies on legacy context rewriting, and introducing kfunc-based eBPF will result in non-UAPI dependency while the other relevant virtualization APIs such as KVM and vhost_net are UAPIs.
Added ioctls
They are designed to make extensibility and VM migration compatible. This change only adds the implementation and does not expose them to the userspace.
TUNGETVNETHASHTYPES
This ioctl tells supported hash types. It is useful to check if a VM can be migrated to the current host.
TUNSETVNETREPORTINGAUTOMQ, TUNSETVNETREPORTINGRSS, and TUNSETVNETRSS
These ioctls configures a steering algorithm and, if needed, hash reporting.
Signed-off-by: Akihiko Odaki akihiko.odaki@daynix.com Tested-by: Lei Yang leiyang@redhat.com
drivers/net/tap.c | 10 ++- drivers/net/tun.c | 12 +++- drivers/net/tun_vnet.h | 165 +++++++++++++++++++++++++++++++++++++++++--- include/uapi/linux/if_tun.h | 71 +++++++++++++++++++ 4 files changed, 244 insertions(+), 14 deletions(-)
diff --git a/drivers/net/tap.c b/drivers/net/tap.c index d4ece538f1b2..25c60ff2d3f2 100644 --- a/drivers/net/tap.c +++ b/drivers/net/tap.c @@ -179,6 +179,11 @@ static void tap_put_queue(struct tap_queue *q) sock_put(&q->sk); }
+static const struct virtio_net_hash *tap_find_hash(const struct sk_buff *skb) +{
return NULL;
+}
- /*
- Select a queue based on the rxq of the device on which this packet
- arrived. If the incoming device is not mq, calculate a flow hash
@@ -711,11 +716,12 @@ static ssize_t tap_put_user(struct tap_queue *q, int total;
if (q->flags & IFF_VNET_HDR) {
struct virtio_net_hdr vnet_hdr;
struct virtio_net_hdr_v1_hash vnet_hdr; vnet_hdr_len = READ_ONCE(q->vnet_hdr_sz);
ret = tun_vnet_hdr_from_skb(q->flags, NULL, skb, &vnet_hdr);
ret = tun_vnet_hdr_from_skb(vnet_hdr_len, q->flags, NULL, skb,
tap_find_hash, &vnet_hdr); if (ret) return ret;
diff --git a/drivers/net/tun.c b/drivers/net/tun.c index 9133ab9ed3f5..03d47799e9bd 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -451,6 +451,11 @@ static inline void tun_flow_save_rps_rxhash(struct tun_flow_entry *e, u32 hash) e->rps_rxhash = hash; }
+static const struct virtio_net_hash *tun_find_hash(const struct sk_buff *skb) +{
return NULL;
+}
- /* We try to identify a flow through its rxhash. The reason that
- we do not check rxq no. is because some cards(e.g 82599), chooses
- the rxq based on the txq where the last packet of the flow comes. As
@@ -1993,7 +1998,7 @@ static ssize_t tun_put_user_xdp(struct tun_struct *tun, ssize_t ret;
if (tun->flags & IFF_VNET_HDR) {
struct virtio_net_hdr gso = { 0 };
struct virtio_net_hdr_v1_hash gso = { 0 }; vnet_hdr_sz = READ_ONCE(tun->vnet_hdr_sz); ret = tun_vnet_hdr_put(vnet_hdr_sz, iter, &gso);
@@ -2046,9 +2051,10 @@ static ssize_t tun_put_user(struct tun_struct *tun, }
if (vnet_hdr_sz) {
struct virtio_net_hdr gso;
struct virtio_net_hdr_v1_hash gso;
ret = tun_vnet_hdr_from_skb(tun->flags, tun->dev, skb, &gso);
ret = tun_vnet_hdr_from_skb(vnet_hdr_sz, tun->flags, tun->dev,
skb, tun_find_hash, &gso); if (ret) return ret;
diff --git a/drivers/net/tun_vnet.h b/drivers/net/tun_vnet.h index 58b9ac7a5fc4..45d0533efc8d 100644 --- a/drivers/net/tun_vnet.h +++ b/drivers/net/tun_vnet.h @@ -6,6 +6,17 @@ #define TUN_VNET_LE 0x80000000 #define TUN_VNET_BE 0x40000000
+typedef struct virtio_net_hash *(*tun_vnet_hash_add)(struct sk_buff *); +typedef const struct virtio_net_hash *(*tun_vnet_hash_find)(const struct sk_buff *);
+struct tun_vnet_hash {
bool report;
bool rss;
struct tun_vnet_rss common;
u32 rss_key[VIRTIO_NET_RSS_MAX_KEY_SIZE];
u16 rss_indirection_table[];
+};
- static inline bool tun_vnet_legacy_is_little_endian(unsigned int flags) { bool be = IS_ENABLED(CONFIG_TUN_VNET_CROSS_LE) &&
@@ -107,6 +118,128 @@ static inline long tun_vnet_ioctl(int *vnet_hdr_sz, unsigned int *flags, } }
+static inline long tun_vnet_ioctl_gethashtypes(u32 __user *argp) +{
return put_user(VIRTIO_NET_SUPPORTED_HASH_TYPES, argp) ? -EFAULT : 0;
+}
+static inline long tun_vnet_ioctl_sethash(struct tun_vnet_hash __rcu **hashp,
unsigned int cmd,
void __user *argp)
+{
struct tun_vnet_rss common;
struct tun_vnet_hash *hash;
size_t indirection_table_size;
size_t key_size;
size_t size;
switch (cmd) {
case TUNSETVNETREPORTINGAUTOMQ:
if (get_user(common.hash_types, (u32 __user *)argp))
return -EFAULT;
if (common.hash_types) {
hash = kzalloc(sizeof(*hash), GFP_KERNEL);
if (!hash)
return -ENOMEM;
hash->report = true;
hash->common.hash_types = common.hash_types;
} else {
hash = NULL;
}
break;
case TUNSETVNETREPORTINGRSS:
case TUNSETVNETRSS:
So the above three shows unnecessary design redundancy as well as a burden for the future extension. Why not simply have
- TUNSETVNET_RSS
- TUNSETVNET_HASH_REPORT
?
Which maps to
#define VIRTIO_NET_CTRL_MQ_RSS_CONFIG 1 (for configurable receive steering) #define VIRTIO_NET_CTRL_MQ_HASH_CONFIG 2 (for configurable hash calculation)
It would be always easier to start with what spec had or at least we need to explain why we choose a different design here or in the changelog to ease our life.
TUNSETVNETREPORTINGAUTOMQ maps to VIRTIO_NET_CTRL_MQ_HASH_CONFIG.
It's not:
VIRTIO_NET_CTRL_MQ_HASH_CONFIG uses:
struct virtio_net_hash_config { le32 hash_types; le16 reserved[4]; u8 hash_key_length; u8 hash_key_data[hash_key_length]; };
but TUNSETVNETREPORTINGAUTOMQ only accepts hash_types without others:
The others are not present because the spec doesn't specify what to do with them and the kernel doesn't use them either.
TUNSETVNETREPORTINGRSS and TUNSETVNETRSS map to VIRTIO_NET_CTRL_MQ_RSS_CONFIG.
I think we've already had a discussion about this.
Reusing virtio-net uAPI is much better instead of having a tun specific one considering tun may need to support more virtio commands in the future. Or maybe it's the time to introduce a transport for the virtio control virtqueue uAPI in tuntap to avoid inventing new uAPI endlessly.
What's more I see:
struct tun_vnet_rss { __u32 hash_types; __u16 indirection_table_mask; __u16 unclassified_queue; };
struct tun_vnet_hash { bool report; bool rss; struct tun_vnet_rss common; u32 rss_key[VIRTIO_NET_RSS_MAX_KEY_SIZE]; u16 rss_indirection_table[]; };
As I pointed out in the past, let's just decouple the rss from hash, everything would be much simpler, or you need to explain why you couple this somewhere.
For example:
- why is the tun_vnet_hash not part of the uAPI but tun_vnet_rss, or
how could userspace know what kind of format it would use for TUNSETVNETREPORTINGRSS?
That was the previous version.
- what's the advantages of embedding rss specific stuff into hash
report structure
Because the hash types field in struct tun_vnet_rss is used by hash reporting too.
- what's the advantages of not using virtio-net uAPI
1. The use cases that don't involve VM will be simplified; programs for such a use case will not need to convert endian or to fill fileds the kernel doesn't care. 2. It aligns with existing UAPIs that operate in native endian and don't use virtio-net data structures like TUNSETOFFLOAD and TUNSETVNETHDRSZ.
More issues:
- max_tx_vq is ignored, so do you expect the userspace to intercept
this and switch to using TUNSETQUEUE? This seems like a burden as TUN can simply accept it and do the attaching/detaching by itself 2) the rx depends on the indirection table, so userspace need to intercept the indirection and change the rx queue numbers accordingly 3) RSS allows a async TX/RX queue, this is not supported by TUN now, no matter if we decide to let userspace to intercept max_tx_vq or not, we need to implement it first
Things would be much more simpler, if kernel can do 1) and 2).
Attaching and detaching queues is not possible for the kernel because it doesn't know what file descriptors that map to queues will be used by the userspace.
The following patch does 2) for QEMU: https://lore.kernel.org/qemu-devel/20250322-vq-v2-1-cee0aafe6404@daynix.com/
For 3), the patch for QEMU takes the maximum of TX and RX queue numbers to derive the number of queue pairs.
We have two ioctls here because VIRTIO_NET_CTRL_MQ_RSS_CONFIG behaves differently depending on whether VIRTIO_NET_F_HASH_REPORT is negotiated or not;
It wouldn't be a problem if you do 1:1 mapping between virtio commands and TUN uAPI, otherwise it should have a bug somewhere.
Speaking of 1:1 mapping, it is possible to map VIRTIO_NET_F_HASH_REPORT into another ioctl. It may help add another receive steering algorithm in the future by not requiring two ioctls (TUNSETVNETREPORTING_X and TUNSETVNET_X).
it also enables hash reporting if the feature is negotiated.
Again, starting from virtio-net command is easier, a strong justification is needed to explain why we choose another for tun/tap.
One day we would have tun_select_queue_algorithm_x() we don't have to duplicate the ioctls once again here like TUNSETVNETREPORTINGXYZ
5.1.6.5.6.4 Hash calculation says:
If VIRTIO_NET_F_HASH_REPORT was negotiated and the device uses automatic receive steering, the device MUST support a command to configure hash calculation parameters.
The driver provides parameters for hash calculation as follows:
class VIRTIO_NET_CTRL_MQ, command VIRTIO_NET_CTRL_MQ_HASH_CONFIG.
VIRTIO_NET_CTRL_MQ_HASH_CONFIG is for automatic receive steering and not for RSS (or other steering algorithms if the spec gets any in the future).
I'm not sure but the spec needs some tweaking. For example, I don't expect there would be a dedicated hash config command for flow filters in the future. I think the reason why spec says like this is that virtio-net only supports automatic receive steering.
Note that macvtap doesn't implement automatic receive steering.
QEMU advertises VIRTIO_NET_F_CTRL_VQ for macvtap too, so it should have implemented it. I think QEMU with macvtap still compliant to the spec.
"5.1.6.5.6 Automatic receive steering in multiqueue mode" says:
After the driver transmitted a packet of a flow on transmitqX, the device SHOULD cause incoming packets for that flow to be steered to receiveqX.
It is "SHOULD", so it is still compliant if the device doesn't properly respect the flow.
if (copy_from_user(&common, argp, sizeof(common)))
return -EFAULT;
argp = (struct tun_vnet_rss __user *)argp + 1;
indirection_table_size = ((size_t)common.indirection_table_mask + 1) * 2;
key_size = virtio_net_hash_key_length(common.hash_types);
size = struct_size(hash, rss_indirection_table,
(size_t)common.indirection_table_mask + 1);
hash = kmalloc(size, GFP_KERNEL);
if (!hash)
return -ENOMEM;
if (copy_from_user(hash->rss_indirection_table,
argp, indirection_table_size)) {
kfree(hash);
return -EFAULT;
}
argp = (u16 __user *)argp + common.indirection_table_mask + 1;
if (copy_from_user(hash->rss_key, argp, key_size)) {
kfree(hash);
return -EFAULT;
}
virtio_net_toeplitz_convert_key(hash->rss_key, key_size);
hash->report = cmd == TUNSETVNETREPORTINGRSS;
At least, if this is the only difference why not simply code this into the ioctl itself other than having a very similar command?
It is what the previous version did. Either is fine I guess; the only practical difference would be the size of the configuration struct is smaller with this approach.
hash->rss = true;
hash->common = common;
break;
default:
return -EINVAL;
}
kfree_rcu_mightsleep(rcu_replace_pointer_rtnl(*hashp, hash));
return 0;
+}
+static inline void tun_vnet_hash_report(const struct tun_vnet_hash *hash,
struct sk_buff *skb,
const struct flow_keys_basic *keys,
u32 value,
tun_vnet_hash_add vnet_hash_add)
+{
struct virtio_net_hash *report;
if (!hash || !hash->report)
return;
report = vnet_hash_add(skb);
if (!report)
return;
*report = (struct virtio_net_hash) {
.report = virtio_net_hash_report(hash->common.hash_types, keys),
.value = value
};
+}
+static inline u16 tun_vnet_rss_select_queue(u32 numqueues,
const struct tun_vnet_hash *hash,
struct sk_buff *skb,
tun_vnet_hash_add vnet_hash_add)
+{
struct virtio_net_hash *report;
struct virtio_net_hash ret;
u16 index;
if (!numqueues)
return 0;
virtio_net_hash_rss(skb, hash->common.hash_types, hash->rss_key, &ret);
if (!ret.report)
return hash->common.unclassified_queue % numqueues;
if (hash->report) {
report = vnet_hash_add(skb);
if (report)
*report = ret;
}
index = ret.value & hash->common.indirection_table_mask;
return hash->rss_indirection_table[index] % numqueues;
+}
- static inline int tun_vnet_hdr_get(int sz, unsigned int flags, struct iov_iter *from, struct virtio_net_hdr *hdr)
@@ -135,15 +268,17 @@ static inline int tun_vnet_hdr_get(int sz, unsigned int flags, }
static inline int tun_vnet_hdr_put(int sz, struct iov_iter *iter,
const struct virtio_net_hdr *hdr)
{const struct virtio_net_hdr_v1_hash *hdr)
int content_sz = MIN(sizeof(*hdr), sz);
if (unlikely(iov_iter_count(iter) < sz)) return -EINVAL;
if (unlikely(copy_to_iter(hdr, sizeof(*hdr), iter) != sizeof(*hdr)))
if (unlikely(copy_to_iter(hdr, content_sz, iter) != content_sz)) return -EFAULT;
if (iov_iter_zero(sz - sizeof(*hdr), iter) != sz - sizeof(*hdr))
if (iov_iter_zero(sz - content_sz, iter) != sz - content_sz) return -EFAULT; return 0;
@@ -155,26 +290,38 @@ static inline int tun_vnet_hdr_to_skb(unsigned int flags, struct sk_buff *skb, return virtio_net_hdr_to_skb(skb, hdr, tun_vnet_is_little_endian(flags)); }
-static inline int tun_vnet_hdr_from_skb(unsigned int flags, +static inline int tun_vnet_hdr_from_skb(int sz, unsigned int flags, const struct net_device *dev, const struct sk_buff *skb,
struct virtio_net_hdr *hdr)
tun_vnet_hash_find vnet_hash_find,
{ int vlan_hlen = skb_vlan_tag_present(skb) ? VLAN_HLEN : 0;struct virtio_net_hdr_v1_hash *hdr)
const struct virtio_net_hash *report = sz < sizeof(struct virtio_net_hdr_v1_hash) ?
NULL : vnet_hash_find(skb);
*hdr = (struct virtio_net_hdr_v1_hash) {
.hash_report = VIRTIO_NET_HASH_REPORT_NONE
};
if (report) {
hdr->hash_value = cpu_to_le32(report->value);
hdr->hash_report = cpu_to_le16(report->report);
}
if (virtio_net_hdr_from_skb(skb, hdr,
if (virtio_net_hdr_from_skb(skb, (struct virtio_net_hdr *)hdr, tun_vnet_is_little_endian(flags), true, vlan_hlen)) { struct skb_shared_info *sinfo = skb_shinfo(skb); if (net_ratelimit()) { netdev_err(dev, "unexpected GSO type: 0x%x, gso_size %d, hdr_len %d\n",
sinfo->gso_type, tun_vnet16_to_cpu(flags, hdr->gso_size),
tun_vnet16_to_cpu(flags, hdr->hdr_len));
sinfo->gso_type, tun_vnet16_to_cpu(flags, hdr->hdr.gso_size),
tun_vnet16_to_cpu(flags, hdr->hdr.hdr_len)); print_hex_dump(KERN_ERR, "tun: ", DUMP_PREFIX_NONE, 16, 1, skb->head,
min(tun_vnet16_to_cpu(flags, hdr->hdr_len), 64), true);
min(tun_vnet16_to_cpu(flags, hdr->hdr.hdr_len), 64), true); } WARN_ON_ONCE(1); return -EINVAL;
diff --git a/include/uapi/linux/if_tun.h b/include/uapi/linux/if_tun.h index 980de74724fc..fe4b984d3bbb 100644 --- a/include/uapi/linux/if_tun.h +++ b/include/uapi/linux/if_tun.h @@ -62,6 +62,62 @@ #define TUNSETCARRIER _IOW('T', 226, int) #define TUNGETDEVNETNS _IO('T', 227)
+/**
- define TUNGETVNETHASHTYPES - ioctl to get supported virtio_net hashing types
- The argument is a pointer to __u32 which will store the supported virtio_net
- hashing types.
- */
+#define TUNGETVNETHASHTYPES _IOR('T', 228, __u32)
+/**
- define TUNSETVNETREPORTINGAUTOMQ - ioctl to enable automq with hash reporting
- Disable RSS and enable automatic receive steering with hash reporting.
- The argument is a pointer to __u32 that contains a bitmask of hash types
- allowed to be reported.
- This ioctl results in %EBADFD if the underlying device is deleted. It affects
- all queues attached to the same device.
- This ioctl currently has no effect on XDP packets and packets with
- queue_mapping set by TC.
- */
+#define TUNSETVNETREPORTINGAUTOMQ _IOR('T', 229, __u32)
+/**
- define TUNSETVNETREPORTINGRSS - ioctl to enable RSS with hash reporting
- Disable automatic receive steering and enable RSS with hash reporting.
This is unnecessary, e.g one day will have select_queue_xyz(), we don't want to say "Disable automatic receive steering and xyz ..."
It is still something better to be documented as its behavior is somewhat complicated.
This is a hint of uAPI design issue.
Concretely, this ioctl disables automatic receive steering but doesn't disable steering by eBPF, which is implied by TUN_STEERINGEBPF_FALLBACK.
It would be simpler:
- not having TUN_STEERINGEBPF_FALLBACK
- the steering algorithm depends on the last uAPI call
What will TUNSETSTEERINGEBPF with NULL mean in that case? It currently switches the steering algorithm to automq.
Regards, Akihiko Odaki
On Thu, Jun 5, 2025 at 4:18 PM Akihiko Odaki akihiko.odaki@daynix.com wrote:
On 2025/06/05 11:46, Jason Wang wrote:
On Wed, Jun 4, 2025 at 4:42 PM Akihiko Odaki akihiko.odaki@daynix.com wrote:
On 2025/06/04 10:53, Jason Wang wrote:
On Fri, May 30, 2025 at 12:50 PM Akihiko Odaki akihiko.odaki@daynix.com wrote:
Add common code required for the features being added to TUN and TAP. They will be enabled for each of them in following patches.
Added Features
Hash reporting
Allow the guest to reuse the hash value to make receive steering consistent between the host and guest, and to save hash computation.
Receive Side Scaling (RSS)
RSS is a receive steering algorithm that can be negotiated to use with virtio_net. Conventionally the hash calculation was done by the VMM. However, computing the hash after the queue was chosen defeats the purpose of RSS.
Another approach is to use eBPF steering program. This approach has another downside: it cannot report the calculated hash due to the restrictive nature of eBPF steering program.
Introduce the code to perform RSS to the kernel in order to overcome thse challenges. An alternative solution is to extend the eBPF steering program so that it will be able to report to the userspace, but I didn't opt for it because extending the current mechanism of eBPF steering program as is because it relies on legacy context rewriting, and introducing kfunc-based eBPF will result in non-UAPI dependency while the other relevant virtualization APIs such as KVM and vhost_net are UAPIs.
Added ioctls
They are designed to make extensibility and VM migration compatible. This change only adds the implementation and does not expose them to the userspace.
TUNGETVNETHASHTYPES
This ioctl tells supported hash types. It is useful to check if a VM can be migrated to the current host.
TUNSETVNETREPORTINGAUTOMQ, TUNSETVNETREPORTINGRSS, and TUNSETVNETRSS
These ioctls configures a steering algorithm and, if needed, hash reporting.
Signed-off-by: Akihiko Odaki akihiko.odaki@daynix.com Tested-by: Lei Yang leiyang@redhat.com
drivers/net/tap.c | 10 ++- drivers/net/tun.c | 12 +++- drivers/net/tun_vnet.h | 165 +++++++++++++++++++++++++++++++++++++++++--- include/uapi/linux/if_tun.h | 71 +++++++++++++++++++ 4 files changed, 244 insertions(+), 14 deletions(-)
diff --git a/drivers/net/tap.c b/drivers/net/tap.c index d4ece538f1b2..25c60ff2d3f2 100644 --- a/drivers/net/tap.c +++ b/drivers/net/tap.c @@ -179,6 +179,11 @@ static void tap_put_queue(struct tap_queue *q) sock_put(&q->sk); }
+static const struct virtio_net_hash *tap_find_hash(const struct sk_buff *skb) +{
return NULL;
+}
- /*
- Select a queue based on the rxq of the device on which this packet
- arrived. If the incoming device is not mq, calculate a flow hash
@@ -711,11 +716,12 @@ static ssize_t tap_put_user(struct tap_queue *q, int total;
if (q->flags & IFF_VNET_HDR) {
struct virtio_net_hdr vnet_hdr;
struct virtio_net_hdr_v1_hash vnet_hdr; vnet_hdr_len = READ_ONCE(q->vnet_hdr_sz);
ret = tun_vnet_hdr_from_skb(q->flags, NULL, skb, &vnet_hdr);
ret = tun_vnet_hdr_from_skb(vnet_hdr_len, q->flags, NULL, skb,
tap_find_hash, &vnet_hdr); if (ret) return ret;
diff --git a/drivers/net/tun.c b/drivers/net/tun.c index 9133ab9ed3f5..03d47799e9bd 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -451,6 +451,11 @@ static inline void tun_flow_save_rps_rxhash(struct tun_flow_entry *e, u32 hash) e->rps_rxhash = hash; }
+static const struct virtio_net_hash *tun_find_hash(const struct sk_buff *skb) +{
return NULL;
+}
- /* We try to identify a flow through its rxhash. The reason that
- we do not check rxq no. is because some cards(e.g 82599), chooses
- the rxq based on the txq where the last packet of the flow comes. As
@@ -1993,7 +1998,7 @@ static ssize_t tun_put_user_xdp(struct tun_struct *tun, ssize_t ret;
if (tun->flags & IFF_VNET_HDR) {
struct virtio_net_hdr gso = { 0 };
struct virtio_net_hdr_v1_hash gso = { 0 }; vnet_hdr_sz = READ_ONCE(tun->vnet_hdr_sz); ret = tun_vnet_hdr_put(vnet_hdr_sz, iter, &gso);
@@ -2046,9 +2051,10 @@ static ssize_t tun_put_user(struct tun_struct *tun, }
if (vnet_hdr_sz) {
struct virtio_net_hdr gso;
struct virtio_net_hdr_v1_hash gso;
ret = tun_vnet_hdr_from_skb(tun->flags, tun->dev, skb, &gso);
ret = tun_vnet_hdr_from_skb(vnet_hdr_sz, tun->flags, tun->dev,
skb, tun_find_hash, &gso); if (ret) return ret;
diff --git a/drivers/net/tun_vnet.h b/drivers/net/tun_vnet.h index 58b9ac7a5fc4..45d0533efc8d 100644 --- a/drivers/net/tun_vnet.h +++ b/drivers/net/tun_vnet.h @@ -6,6 +6,17 @@ #define TUN_VNET_LE 0x80000000 #define TUN_VNET_BE 0x40000000
+typedef struct virtio_net_hash *(*tun_vnet_hash_add)(struct sk_buff *); +typedef const struct virtio_net_hash *(*tun_vnet_hash_find)(const struct sk_buff *);
+struct tun_vnet_hash {
bool report;
bool rss;
struct tun_vnet_rss common;
u32 rss_key[VIRTIO_NET_RSS_MAX_KEY_SIZE];
u16 rss_indirection_table[];
+};
- static inline bool tun_vnet_legacy_is_little_endian(unsigned int flags) { bool be = IS_ENABLED(CONFIG_TUN_VNET_CROSS_LE) &&
@@ -107,6 +118,128 @@ static inline long tun_vnet_ioctl(int *vnet_hdr_sz, unsigned int *flags, } }
+static inline long tun_vnet_ioctl_gethashtypes(u32 __user *argp) +{
return put_user(VIRTIO_NET_SUPPORTED_HASH_TYPES, argp) ? -EFAULT : 0;
+}
+static inline long tun_vnet_ioctl_sethash(struct tun_vnet_hash __rcu **hashp,
unsigned int cmd,
void __user *argp)
+{
struct tun_vnet_rss common;
struct tun_vnet_hash *hash;
size_t indirection_table_size;
size_t key_size;
size_t size;
switch (cmd) {
case TUNSETVNETREPORTINGAUTOMQ:
if (get_user(common.hash_types, (u32 __user *)argp))
return -EFAULT;
if (common.hash_types) {
hash = kzalloc(sizeof(*hash), GFP_KERNEL);
if (!hash)
return -ENOMEM;
hash->report = true;
hash->common.hash_types = common.hash_types;
} else {
hash = NULL;
}
break;
case TUNSETVNETREPORTINGRSS:
case TUNSETVNETRSS:
So the above three shows unnecessary design redundancy as well as a burden for the future extension. Why not simply have
- TUNSETVNET_RSS
- TUNSETVNET_HASH_REPORT
?
Which maps to
#define VIRTIO_NET_CTRL_MQ_RSS_CONFIG 1 (for configurable receive steering) #define VIRTIO_NET_CTRL_MQ_HASH_CONFIG 2 (for configurable hash calculation)
It would be always easier to start with what spec had or at least we need to explain why we choose a different design here or in the changelog to ease our life.
TUNSETVNETREPORTINGAUTOMQ maps to VIRTIO_NET_CTRL_MQ_HASH_CONFIG.
It's not:
VIRTIO_NET_CTRL_MQ_HASH_CONFIG uses:
struct virtio_net_hash_config { le32 hash_types; le16 reserved[4]; u8 hash_key_length; u8 hash_key_data[hash_key_length]; };
but TUNSETVNETREPORTINGAUTOMQ only accepts hash_types without others:
The others are not present because the spec doesn't specify what to do with them and the kernel doesn't use them either.
Did you mean the hash_key_length and hash_key_data? Note that we have drivers other than the Linux ones as well.
TUNSETVNETREPORTINGRSS and TUNSETVNETRSS map to VIRTIO_NET_CTRL_MQ_RSS_CONFIG.
I think we've already had a discussion about this.
Reusing virtio-net uAPI is much better instead of having a tun specific one considering tun may need to support more virtio commands in the future. Or maybe it's the time to introduce a transport for the virtio control virtqueue uAPI in tuntap to avoid inventing new uAPI endlessly.
What's more I see:
struct tun_vnet_rss { __u32 hash_types; __u16 indirection_table_mask; __u16 unclassified_queue; };
struct tun_vnet_hash { bool report; bool rss; struct tun_vnet_rss common; u32 rss_key[VIRTIO_NET_RSS_MAX_KEY_SIZE]; u16 rss_indirection_table[]; };
As I pointed out in the past, let's just decouple the rss from hash, everything would be much simpler, or you need to explain why you couple this somewhere.
For example:
- why is the tun_vnet_hash not part of the uAPI but tun_vnet_rss, or
how could userspace know what kind of format it would use for TUNSETVNETREPORTINGRSS?
That was the previous version.
- what's the advantages of embedding rss specific stuff into hash
report structure
Because the hash types field in struct tun_vnet_rss is used by hash reporting too.
- what's the advantages of not using virtio-net uAPI
- The use cases that don't involve VM will be simplified; programs for
such a use case will not need to convert endian or to fill fileds the kernel doesn't care.
Well, virtio_net_hdr is used by packet socket as well. Considering the complexity of designing a new uAPI, it's still better.
Or maybe you can clarify which field that kernel doesn't care about? In this case TUN/TAP is basically the device datapath, if some of the fields don't make sense, it's a bug of the spec.
- It aligns with existing UAPIs that operate in native endian and don't
use virtio-net data structures like TUNSETOFFLOAD and TUNSETVNETHDRSZ.
For those two examples, it would be used by guests directly. This is different from RSS stuff.
With native endian, you need an endian conversation that converts le to native.
More issues:
- max_tx_vq is ignored, so do you expect the userspace to intercept
this and switch to using TUNSETQUEUE? This seems like a burden as TUN can simply accept it and do the attaching/detaching by itself 2) the rx depends on the indirection table, so userspace need to intercept the indirection and change the rx queue numbers accordingly 3) RSS allows a async TX/RX queue, this is not supported by TUN now, no matter if we decide to let userspace to intercept max_tx_vq or not, we need to implement it first
Things would be much more simpler, if kernel can do 1) and 2).
Attaching and detaching queues is not possible for the kernel because it doesn't know what file descriptors that map to queues will be used by the userspace.
The kernel knows, tfile has a queue_index part.
The following patch does 2) for QEMU: https://lore.kernel.org/qemu-devel/20250322-vq-v2-1-cee0aafe6404@daynix.com/
See below point, form the view of the kernel, it's still a queue pair not async TX/RX queue.
For 3), the patch for QEMU takes the maximum of TX and RX queue numbers to derive the number of queue pairs.
We have two ioctls here because VIRTIO_NET_CTRL_MQ_RSS_CONFIG behaves differently depending on whether VIRTIO_NET_F_HASH_REPORT is negotiated or not;
It wouldn't be a problem if you do 1:1 mapping between virtio commands and TUN uAPI, otherwise it should have a bug somewhere.
Speaking of 1:1 mapping, it is possible to map VIRTIO_NET_F_HASH_REPORT into another ioctl. It may help add another receive steering algorithm in the future by not requiring two ioctls (TUNSETVNETREPORTING_X and TUNSETVNET_X).
Yes and as I pointed out, virtio_net_hash_config should not be specific to automq, it can work for other steering algorithm as well.
it also enables hash reporting if the feature is negotiated.
Again, starting from virtio-net command is easier, a strong justification is needed to explain why we choose another for tun/tap.
One day we would have tun_select_queue_algorithm_x() we don't have to duplicate the ioctls once again here like TUNSETVNETREPORTINGXYZ
5.1.6.5.6.4 Hash calculation says:
If VIRTIO_NET_F_HASH_REPORT was negotiated and the device uses automatic receive steering, the device MUST support a command to configure hash calculation parameters.
The driver provides parameters for hash calculation as follows:
class VIRTIO_NET_CTRL_MQ, command VIRTIO_NET_CTRL_MQ_HASH_CONFIG.
VIRTIO_NET_CTRL_MQ_HASH_CONFIG is for automatic receive steering and not for RSS (or other steering algorithms if the spec gets any in the future).
I'm not sure but the spec needs some tweaking. For example, I don't expect there would be a dedicated hash config command for flow filters in the future. I think the reason why spec says like this is that virtio-net only supports automatic receive steering.
Note that macvtap doesn't implement automatic receive steering.
QEMU advertises VIRTIO_NET_F_CTRL_VQ for macvtap too, so it should have implemented it. I think QEMU with macvtap still compliant to the spec.
Compliant, but automatic traffic steering is the best effort as well.
Nope, TUN/TAP implements a flow cache that can steer tx based on rx. Macvtap simply uses hash here.
"5.1.6.5.6 Automatic receive steering in multiqueue mode" says:
After the driver transmitted a packet of a flow on transmitqX, the device SHOULD cause incoming packets for that flow to be steered to receiveqX.
It is "SHOULD", so it is still compliant if the device doesn't properly respect the flow.
Yes, a quality of implementation, or it's impractical to support a correct steering for this device as limited resources and mailious users can do syn flood etc.
if (copy_from_user(&common, argp, sizeof(common)))
return -EFAULT;
argp = (struct tun_vnet_rss __user *)argp + 1;
indirection_table_size = ((size_t)common.indirection_table_mask + 1) * 2;
key_size = virtio_net_hash_key_length(common.hash_types);
size = struct_size(hash, rss_indirection_table,
(size_t)common.indirection_table_mask + 1);
hash = kmalloc(size, GFP_KERNEL);
if (!hash)
return -ENOMEM;
if (copy_from_user(hash->rss_indirection_table,
argp, indirection_table_size)) {
kfree(hash);
return -EFAULT;
}
argp = (u16 __user *)argp + common.indirection_table_mask + 1;
if (copy_from_user(hash->rss_key, argp, key_size)) {
kfree(hash);
return -EFAULT;
}
virtio_net_toeplitz_convert_key(hash->rss_key, key_size);
hash->report = cmd == TUNSETVNETREPORTINGRSS;
At least, if this is the only difference why not simply code this into the ioctl itself other than having a very similar command?
It is what the previous version did. Either is fine I guess; the only practical difference would be the size of the configuration struct is smaller with this approach.
hash->rss = true;
hash->common = common;
break;
default:
return -EINVAL;
}
kfree_rcu_mightsleep(rcu_replace_pointer_rtnl(*hashp, hash));
return 0;
+}
+static inline void tun_vnet_hash_report(const struct tun_vnet_hash *hash,
struct sk_buff *skb,
const struct flow_keys_basic *keys,
u32 value,
tun_vnet_hash_add vnet_hash_add)
+{
struct virtio_net_hash *report;
if (!hash || !hash->report)
return;
report = vnet_hash_add(skb);
if (!report)
return;
*report = (struct virtio_net_hash) {
.report = virtio_net_hash_report(hash->common.hash_types, keys),
.value = value
};
+}
+static inline u16 tun_vnet_rss_select_queue(u32 numqueues,
const struct tun_vnet_hash *hash,
struct sk_buff *skb,
tun_vnet_hash_add vnet_hash_add)
+{
struct virtio_net_hash *report;
struct virtio_net_hash ret;
u16 index;
if (!numqueues)
return 0;
virtio_net_hash_rss(skb, hash->common.hash_types, hash->rss_key, &ret);
if (!ret.report)
return hash->common.unclassified_queue % numqueues;
if (hash->report) {
report = vnet_hash_add(skb);
if (report)
*report = ret;
}
index = ret.value & hash->common.indirection_table_mask;
return hash->rss_indirection_table[index] % numqueues;
+}
- static inline int tun_vnet_hdr_get(int sz, unsigned int flags, struct iov_iter *from, struct virtio_net_hdr *hdr)
@@ -135,15 +268,17 @@ static inline int tun_vnet_hdr_get(int sz, unsigned int flags, }
static inline int tun_vnet_hdr_put(int sz, struct iov_iter *iter,
const struct virtio_net_hdr *hdr)
{const struct virtio_net_hdr_v1_hash *hdr)
int content_sz = MIN(sizeof(*hdr), sz);
if (unlikely(iov_iter_count(iter) < sz)) return -EINVAL;
if (unlikely(copy_to_iter(hdr, sizeof(*hdr), iter) != sizeof(*hdr)))
if (unlikely(copy_to_iter(hdr, content_sz, iter) != content_sz)) return -EFAULT;
if (iov_iter_zero(sz - sizeof(*hdr), iter) != sz - sizeof(*hdr))
if (iov_iter_zero(sz - content_sz, iter) != sz - content_sz) return -EFAULT; return 0;
@@ -155,26 +290,38 @@ static inline int tun_vnet_hdr_to_skb(unsigned int flags, struct sk_buff *skb, return virtio_net_hdr_to_skb(skb, hdr, tun_vnet_is_little_endian(flags)); }
-static inline int tun_vnet_hdr_from_skb(unsigned int flags, +static inline int tun_vnet_hdr_from_skb(int sz, unsigned int flags, const struct net_device *dev, const struct sk_buff *skb,
struct virtio_net_hdr *hdr)
tun_vnet_hash_find vnet_hash_find,
{ int vlan_hlen = skb_vlan_tag_present(skb) ? VLAN_HLEN : 0;struct virtio_net_hdr_v1_hash *hdr)
const struct virtio_net_hash *report = sz < sizeof(struct virtio_net_hdr_v1_hash) ?
NULL : vnet_hash_find(skb);
*hdr = (struct virtio_net_hdr_v1_hash) {
.hash_report = VIRTIO_NET_HASH_REPORT_NONE
};
if (report) {
hdr->hash_value = cpu_to_le32(report->value);
hdr->hash_report = cpu_to_le16(report->report);
}
if (virtio_net_hdr_from_skb(skb, hdr,
if (virtio_net_hdr_from_skb(skb, (struct virtio_net_hdr *)hdr, tun_vnet_is_little_endian(flags), true, vlan_hlen)) { struct skb_shared_info *sinfo = skb_shinfo(skb); if (net_ratelimit()) { netdev_err(dev, "unexpected GSO type: 0x%x, gso_size %d, hdr_len %d\n",
sinfo->gso_type, tun_vnet16_to_cpu(flags, hdr->gso_size),
tun_vnet16_to_cpu(flags, hdr->hdr_len));
sinfo->gso_type, tun_vnet16_to_cpu(flags, hdr->hdr.gso_size),
tun_vnet16_to_cpu(flags, hdr->hdr.hdr_len)); print_hex_dump(KERN_ERR, "tun: ", DUMP_PREFIX_NONE, 16, 1, skb->head,
min(tun_vnet16_to_cpu(flags, hdr->hdr_len), 64), true);
min(tun_vnet16_to_cpu(flags, hdr->hdr.hdr_len), 64), true); } WARN_ON_ONCE(1); return -EINVAL;
diff --git a/include/uapi/linux/if_tun.h b/include/uapi/linux/if_tun.h index 980de74724fc..fe4b984d3bbb 100644 --- a/include/uapi/linux/if_tun.h +++ b/include/uapi/linux/if_tun.h @@ -62,6 +62,62 @@ #define TUNSETCARRIER _IOW('T', 226, int) #define TUNGETDEVNETNS _IO('T', 227)
+/**
- define TUNGETVNETHASHTYPES - ioctl to get supported virtio_net hashing types
- The argument is a pointer to __u32 which will store the supported virtio_net
- hashing types.
- */
+#define TUNGETVNETHASHTYPES _IOR('T', 228, __u32)
+/**
- define TUNSETVNETREPORTINGAUTOMQ - ioctl to enable automq with hash reporting
- Disable RSS and enable automatic receive steering with hash reporting.
- The argument is a pointer to __u32 that contains a bitmask of hash types
- allowed to be reported.
- This ioctl results in %EBADFD if the underlying device is deleted. It affects
- all queues attached to the same device.
- This ioctl currently has no effect on XDP packets and packets with
- queue_mapping set by TC.
- */
+#define TUNSETVNETREPORTINGAUTOMQ _IOR('T', 229, __u32)
+/**
- define TUNSETVNETREPORTINGRSS - ioctl to enable RSS with hash reporting
- Disable automatic receive steering and enable RSS with hash reporting.
This is unnecessary, e.g one day will have select_queue_xyz(), we don't want to say "Disable automatic receive steering and xyz ..."
It is still something better to be documented as its behavior is somewhat complicated.
This is a hint of uAPI design issue.
Concretely, this ioctl disables automatic receive steering but doesn't disable steering by eBPF, which is implied by TUN_STEERINGEBPF_FALLBACK.
It would be simpler:
- not having TUN_STEERINGEBPF_FALLBACK
- the steering algorithm depends on the last uAPI call
What will TUNSETSTEERINGEBPF with NULL mean in that case? It currently switches the steering algorithm to automq.
A stackwise semantic then?
Thanks
Regards, Akihiko Odaki
On 2025/06/06 10:01, Jason Wang wrote:
On Thu, Jun 5, 2025 at 4:18 PM Akihiko Odaki akihiko.odaki@daynix.com wrote:
On 2025/06/05 11:46, Jason Wang wrote:
On Wed, Jun 4, 2025 at 4:42 PM Akihiko Odaki akihiko.odaki@daynix.com wrote:
On 2025/06/04 10:53, Jason Wang wrote:
On Fri, May 30, 2025 at 12:50 PM Akihiko Odaki akihiko.odaki@daynix.com wrote:
Add common code required for the features being added to TUN and TAP. They will be enabled for each of them in following patches.
Added Features
Hash reporting
Allow the guest to reuse the hash value to make receive steering consistent between the host and guest, and to save hash computation.
Receive Side Scaling (RSS)
RSS is a receive steering algorithm that can be negotiated to use with virtio_net. Conventionally the hash calculation was done by the VMM. However, computing the hash after the queue was chosen defeats the purpose of RSS.
Another approach is to use eBPF steering program. This approach has another downside: it cannot report the calculated hash due to the restrictive nature of eBPF steering program.
Introduce the code to perform RSS to the kernel in order to overcome thse challenges. An alternative solution is to extend the eBPF steering program so that it will be able to report to the userspace, but I didn't opt for it because extending the current mechanism of eBPF steering program as is because it relies on legacy context rewriting, and introducing kfunc-based eBPF will result in non-UAPI dependency while the other relevant virtualization APIs such as KVM and vhost_net are UAPIs.
Added ioctls
They are designed to make extensibility and VM migration compatible. This change only adds the implementation and does not expose them to the userspace.
TUNGETVNETHASHTYPES
This ioctl tells supported hash types. It is useful to check if a VM can be migrated to the current host.
TUNSETVNETREPORTINGAUTOMQ, TUNSETVNETREPORTINGRSS, and TUNSETVNETRSS
These ioctls configures a steering algorithm and, if needed, hash reporting.
Signed-off-by: Akihiko Odaki akihiko.odaki@daynix.com Tested-by: Lei Yang leiyang@redhat.com
drivers/net/tap.c | 10 ++- drivers/net/tun.c | 12 +++- drivers/net/tun_vnet.h | 165 +++++++++++++++++++++++++++++++++++++++++--- include/uapi/linux/if_tun.h | 71 +++++++++++++++++++ 4 files changed, 244 insertions(+), 14 deletions(-)
diff --git a/drivers/net/tap.c b/drivers/net/tap.c index d4ece538f1b2..25c60ff2d3f2 100644 --- a/drivers/net/tap.c +++ b/drivers/net/tap.c @@ -179,6 +179,11 @@ static void tap_put_queue(struct tap_queue *q) sock_put(&q->sk); }
+static const struct virtio_net_hash *tap_find_hash(const struct sk_buff *skb) +{
return NULL;
+}
- /*
- Select a queue based on the rxq of the device on which this packet
- arrived. If the incoming device is not mq, calculate a flow hash
@@ -711,11 +716,12 @@ static ssize_t tap_put_user(struct tap_queue *q, int total;
if (q->flags & IFF_VNET_HDR) {
struct virtio_net_hdr vnet_hdr;
struct virtio_net_hdr_v1_hash vnet_hdr; vnet_hdr_len = READ_ONCE(q->vnet_hdr_sz);
ret = tun_vnet_hdr_from_skb(q->flags, NULL, skb, &vnet_hdr);
ret = tun_vnet_hdr_from_skb(vnet_hdr_len, q->flags, NULL, skb,
tap_find_hash, &vnet_hdr); if (ret) return ret;
diff --git a/drivers/net/tun.c b/drivers/net/tun.c index 9133ab9ed3f5..03d47799e9bd 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -451,6 +451,11 @@ static inline void tun_flow_save_rps_rxhash(struct tun_flow_entry *e, u32 hash) e->rps_rxhash = hash; }
+static const struct virtio_net_hash *tun_find_hash(const struct sk_buff *skb) +{
return NULL;
+}
- /* We try to identify a flow through its rxhash. The reason that
- we do not check rxq no. is because some cards(e.g 82599), chooses
- the rxq based on the txq where the last packet of the flow comes. As
@@ -1993,7 +1998,7 @@ static ssize_t tun_put_user_xdp(struct tun_struct *tun, ssize_t ret;
if (tun->flags & IFF_VNET_HDR) {
struct virtio_net_hdr gso = { 0 };
struct virtio_net_hdr_v1_hash gso = { 0 }; vnet_hdr_sz = READ_ONCE(tun->vnet_hdr_sz); ret = tun_vnet_hdr_put(vnet_hdr_sz, iter, &gso);
@@ -2046,9 +2051,10 @@ static ssize_t tun_put_user(struct tun_struct *tun, }
if (vnet_hdr_sz) {
struct virtio_net_hdr gso;
struct virtio_net_hdr_v1_hash gso;
ret = tun_vnet_hdr_from_skb(tun->flags, tun->dev, skb, &gso);
ret = tun_vnet_hdr_from_skb(vnet_hdr_sz, tun->flags, tun->dev,
skb, tun_find_hash, &gso); if (ret) return ret;
diff --git a/drivers/net/tun_vnet.h b/drivers/net/tun_vnet.h index 58b9ac7a5fc4..45d0533efc8d 100644 --- a/drivers/net/tun_vnet.h +++ b/drivers/net/tun_vnet.h @@ -6,6 +6,17 @@ #define TUN_VNET_LE 0x80000000 #define TUN_VNET_BE 0x40000000
+typedef struct virtio_net_hash *(*tun_vnet_hash_add)(struct sk_buff *); +typedef const struct virtio_net_hash *(*tun_vnet_hash_find)(const struct sk_buff *);
+struct tun_vnet_hash {
bool report;
bool rss;
struct tun_vnet_rss common;
u32 rss_key[VIRTIO_NET_RSS_MAX_KEY_SIZE];
u16 rss_indirection_table[];
+};
- static inline bool tun_vnet_legacy_is_little_endian(unsigned int flags) { bool be = IS_ENABLED(CONFIG_TUN_VNET_CROSS_LE) &&
@@ -107,6 +118,128 @@ static inline long tun_vnet_ioctl(int *vnet_hdr_sz, unsigned int *flags, } }
+static inline long tun_vnet_ioctl_gethashtypes(u32 __user *argp) +{
return put_user(VIRTIO_NET_SUPPORTED_HASH_TYPES, argp) ? -EFAULT : 0;
+}
+static inline long tun_vnet_ioctl_sethash(struct tun_vnet_hash __rcu **hashp,
unsigned int cmd,
void __user *argp)
+{
struct tun_vnet_rss common;
struct tun_vnet_hash *hash;
size_t indirection_table_size;
size_t key_size;
size_t size;
switch (cmd) {
case TUNSETVNETREPORTINGAUTOMQ:
if (get_user(common.hash_types, (u32 __user *)argp))
return -EFAULT;
if (common.hash_types) {
hash = kzalloc(sizeof(*hash), GFP_KERNEL);
if (!hash)
return -ENOMEM;
hash->report = true;
hash->common.hash_types = common.hash_types;
} else {
hash = NULL;
}
break;
case TUNSETVNETREPORTINGRSS:
case TUNSETVNETRSS:
So the above three shows unnecessary design redundancy as well as a burden for the future extension. Why not simply have
- TUNSETVNET_RSS
- TUNSETVNET_HASH_REPORT
?
Which maps to
#define VIRTIO_NET_CTRL_MQ_RSS_CONFIG 1 (for configurable
receive steering) #define VIRTIO_NET_CTRL_MQ_HASH_CONFIG 2 (for configurable hash calculation)
It would be always easier to start with what spec had or at least we need to explain why we choose a different design here or in the changelog to ease our life.
TUNSETVNETREPORTINGAUTOMQ maps to VIRTIO_NET_CTRL_MQ_HASH_CONFIG.
It's not:
VIRTIO_NET_CTRL_MQ_HASH_CONFIG uses:
struct virtio_net_hash_config { le32 hash_types; le16 reserved[4]; u8 hash_key_length; u8 hash_key_data[hash_key_length]; };
but TUNSETVNETREPORTINGAUTOMQ only accepts hash_types without others:
The others are not present because the spec doesn't specify what to do with them and the kernel doesn't use them either.
Did you mean the hash_key_length and hash_key_data? Note that we have drivers other than the Linux ones as well.
And reserved. Drivers can set whatever to these fields. It is not specified how these fields should be used.
TUNSETVNETREPORTINGRSS and TUNSETVNETRSS map to VIRTIO_NET_CTRL_MQ_RSS_CONFIG.
I think we've already had a discussion about this.
Reusing virtio-net uAPI is much better instead of having a tun specific one considering tun may need to support more virtio commands in the future. Or maybe it's the time to introduce a transport for the virtio control virtqueue uAPI in tuntap to avoid inventing new uAPI endlessly.
What's more I see:
struct tun_vnet_rss { __u32 hash_types; __u16 indirection_table_mask; __u16 unclassified_queue; };
struct tun_vnet_hash { bool report; bool rss; struct tun_vnet_rss common; u32 rss_key[VIRTIO_NET_RSS_MAX_KEY_SIZE]; u16 rss_indirection_table[]; };
As I pointed out in the past, let's just decouple the rss from hash, everything would be much simpler, or you need to explain why you couple this somewhere.
For example:
- why is the tun_vnet_hash not part of the uAPI but tun_vnet_rss, or
how could userspace know what kind of format it would use for TUNSETVNETREPORTINGRSS?
That was the previous version.
- what's the advantages of embedding rss specific stuff into hash
report structure
Because the hash types field in struct tun_vnet_rss is used by hash reporting too.
- what's the advantages of not using virtio-net uAPI
- The use cases that don't involve VM will be simplified; programs for
such a use case will not need to convert endian or to fill fileds the kernel doesn't care.
Well, virtio_net_hdr is used by packet socket as well. Considering the complexity of designing a new uAPI, it's still better.
This patch series also reuses the datapath, following the prior examples.
Or maybe you can clarify which field that kernel doesn't care about? In this case TUN/TAP is basically the device datapath, if some of the fields don't make sense, it's a bug of the spec.
reserved, hash_key_length, and hash_key_data.
- It aligns with existing UAPIs that operate in native endian and don't
use virtio-net data structures like TUNSETOFFLOAD and TUNSETVNETHDRSZ.
For those two examples, it would be used by guests directly. This is different from RSS stuff.
They are mediated by the VMM, which is no different from RSS.
With native endian, you need an endian conversation that converts le to native.
That's true, but QEMU does so anyway to validate the configuration, to attach/detach queues, and to share the data structures with userspace RSS implementations. I expect other VMMs will do so too.
More issues:
- max_tx_vq is ignored, so do you expect the userspace to intercept
this and switch to using TUNSETQUEUE? This seems like a burden as TUN can simply accept it and do the attaching/detaching by itself 2) the rx depends on the indirection table, so userspace need to intercept the indirection and change the rx queue numbers accordingly 3) RSS allows a async TX/RX queue, this is not supported by TUN now, no matter if we decide to let userspace to intercept max_tx_vq or not, we need to implement it first
Things would be much more simpler, if kernel can do 1) and 2).
Attaching and detaching queues is not possible for the kernel because it doesn't know what file descriptors that map to queues will be used by the userspace.
The kernel knows, tfile has a queue_index part.
queue_index is set with TUNSETQUEUE so we need the ioctl.
The following patch does 2) for QEMU: https://lore.kernel.org/qemu-devel/20250322-vq-v2-1-cee0aafe6404@daynix.com/
See below point, form the view of the kernel, it's still a queue pair not async TX/RX queue.
For 3), the patch for QEMU takes the maximum of TX and RX queue numbers to derive the number of queue pairs.
We have two ioctls here because VIRTIO_NET_CTRL_MQ_RSS_CONFIG behaves differently depending on whether VIRTIO_NET_F_HASH_REPORT is negotiated or not;
It wouldn't be a problem if you do 1:1 mapping between virtio commands and TUN uAPI, otherwise it should have a bug somewhere.
Speaking of 1:1 mapping, it is possible to map VIRTIO_NET_F_HASH_REPORT into another ioctl. It may help add another receive steering algorithm in the future by not requiring two ioctls (TUNSETVNETREPORTING_X and TUNSETVNET_X).
Yes and as I pointed out, virtio_net_hash_config should not be specific to automq, it can work for other steering algorithm as well.
That's not what the virtio spec says, so it will not be 1:1 mapping between virtio commands and TUN uAPI.
it also enables hash reporting if the feature is negotiated.
Again, starting from virtio-net command is easier, a strong justification is needed to explain why we choose another for tun/tap.
One day we would have tun_select_queue_algorithm_x() we don't have to duplicate the ioctls once again here like TUNSETVNETREPORTINGXYZ
5.1.6.5.6.4 Hash calculation says:
If VIRTIO_NET_F_HASH_REPORT was negotiated and the device uses automatic receive steering, the device MUST support a command to configure hash calculation parameters.
The driver provides parameters for hash calculation as follows:
class VIRTIO_NET_CTRL_MQ, command VIRTIO_NET_CTRL_MQ_HASH_CONFIG.
VIRTIO_NET_CTRL_MQ_HASH_CONFIG is for automatic receive steering and not for RSS (or other steering algorithms if the spec gets any in the future).
I'm not sure but the spec needs some tweaking. For example, I don't expect there would be a dedicated hash config command for flow filters in the future. I think the reason why spec says like this is that virtio-net only supports automatic receive steering.
Note that macvtap doesn't implement automatic receive steering.
QEMU advertises VIRTIO_NET_F_CTRL_VQ for macvtap too, so it should have implemented it. I think QEMU with macvtap still compliant to the spec.
Compliant, but automatic traffic steering is the best effort as well.
Nope, TUN/TAP implements a flow cache that can steer tx based on rx. Macvtap simply uses hash here.
"5.1.6.5.6 Automatic receive steering in multiqueue mode" says:
After the driver transmitted a packet of a flow on transmitqX, the device SHOULD cause incoming packets for that flow to be steered to receiveqX.
It is "SHOULD", so it is still compliant if the device doesn't properly respect the flow.
Yes, a quality of implementation, or it's impractical to support a correct steering for this device as limited resources and mailious users can do syn flood etc.
if (copy_from_user(&common, argp, sizeof(common)))
return -EFAULT;
argp = (struct tun_vnet_rss __user *)argp + 1;
indirection_table_size = ((size_t)common.indirection_table_mask + 1) * 2;
key_size = virtio_net_hash_key_length(common.hash_types);
size = struct_size(hash, rss_indirection_table,
(size_t)common.indirection_table_mask + 1);
hash = kmalloc(size, GFP_KERNEL);
if (!hash)
return -ENOMEM;
if (copy_from_user(hash->rss_indirection_table,
argp, indirection_table_size)) {
kfree(hash);
return -EFAULT;
}
argp = (u16 __user *)argp + common.indirection_table_mask + 1;
if (copy_from_user(hash->rss_key, argp, key_size)) {
kfree(hash);
return -EFAULT;
}
virtio_net_toeplitz_convert_key(hash->rss_key, key_size);
hash->report = cmd == TUNSETVNETREPORTINGRSS;
At least, if this is the only difference why not simply code this into the ioctl itself other than having a very similar command?
It is what the previous version did. Either is fine I guess; the only practical difference would be the size of the configuration struct is smaller with this approach.
hash->rss = true;
hash->common = common;
break;
default:
return -EINVAL;
}
kfree_rcu_mightsleep(rcu_replace_pointer_rtnl(*hashp, hash));
return 0;
+}
+static inline void tun_vnet_hash_report(const struct tun_vnet_hash *hash,
struct sk_buff *skb,
const struct flow_keys_basic *keys,
u32 value,
tun_vnet_hash_add vnet_hash_add)
+{
struct virtio_net_hash *report;
if (!hash || !hash->report)
return;
report = vnet_hash_add(skb);
if (!report)
return;
*report = (struct virtio_net_hash) {
.report = virtio_net_hash_report(hash->common.hash_types, keys),
.value = value
};
+}
+static inline u16 tun_vnet_rss_select_queue(u32 numqueues,
const struct tun_vnet_hash *hash,
struct sk_buff *skb,
tun_vnet_hash_add vnet_hash_add)
+{
struct virtio_net_hash *report;
struct virtio_net_hash ret;
u16 index;
if (!numqueues)
return 0;
virtio_net_hash_rss(skb, hash->common.hash_types, hash->rss_key, &ret);
if (!ret.report)
return hash->common.unclassified_queue % numqueues;
if (hash->report) {
report = vnet_hash_add(skb);
if (report)
*report = ret;
}
index = ret.value & hash->common.indirection_table_mask;
return hash->rss_indirection_table[index] % numqueues;
+}
- static inline int tun_vnet_hdr_get(int sz, unsigned int flags, struct iov_iter *from, struct virtio_net_hdr *hdr)
@@ -135,15 +268,17 @@ static inline int tun_vnet_hdr_get(int sz, unsigned int flags, }
static inline int tun_vnet_hdr_put(int sz, struct iov_iter *iter,
const struct virtio_net_hdr *hdr)
{const struct virtio_net_hdr_v1_hash *hdr)
int content_sz = MIN(sizeof(*hdr), sz);
if (unlikely(iov_iter_count(iter) < sz)) return -EINVAL;
if (unlikely(copy_to_iter(hdr, sizeof(*hdr), iter) != sizeof(*hdr)))
if (unlikely(copy_to_iter(hdr, content_sz, iter) != content_sz)) return -EFAULT;
if (iov_iter_zero(sz - sizeof(*hdr), iter) != sz - sizeof(*hdr))
if (iov_iter_zero(sz - content_sz, iter) != sz - content_sz) return -EFAULT; return 0;
@@ -155,26 +290,38 @@ static inline int tun_vnet_hdr_to_skb(unsigned int flags, struct sk_buff *skb, return virtio_net_hdr_to_skb(skb, hdr, tun_vnet_is_little_endian(flags)); }
-static inline int tun_vnet_hdr_from_skb(unsigned int flags, +static inline int tun_vnet_hdr_from_skb(int sz, unsigned int flags, const struct net_device *dev, const struct sk_buff *skb,
struct virtio_net_hdr *hdr)
tun_vnet_hash_find vnet_hash_find,
{ int vlan_hlen = skb_vlan_tag_present(skb) ? VLAN_HLEN : 0;struct virtio_net_hdr_v1_hash *hdr)
const struct virtio_net_hash *report = sz < sizeof(struct virtio_net_hdr_v1_hash) ?
NULL : vnet_hash_find(skb);
*hdr = (struct virtio_net_hdr_v1_hash) {
.hash_report = VIRTIO_NET_HASH_REPORT_NONE
};
if (report) {
hdr->hash_value = cpu_to_le32(report->value);
hdr->hash_report = cpu_to_le16(report->report);
}
if (virtio_net_hdr_from_skb(skb, hdr,
if (virtio_net_hdr_from_skb(skb, (struct virtio_net_hdr *)hdr, tun_vnet_is_little_endian(flags), true, vlan_hlen)) { struct skb_shared_info *sinfo = skb_shinfo(skb); if (net_ratelimit()) { netdev_err(dev, "unexpected GSO type: 0x%x, gso_size %d, hdr_len %d\n",
sinfo->gso_type, tun_vnet16_to_cpu(flags, hdr->gso_size),
tun_vnet16_to_cpu(flags, hdr->hdr_len));
sinfo->gso_type, tun_vnet16_to_cpu(flags, hdr->hdr.gso_size),
tun_vnet16_to_cpu(flags, hdr->hdr.hdr_len)); print_hex_dump(KERN_ERR, "tun: ", DUMP_PREFIX_NONE, 16, 1, skb->head,
min(tun_vnet16_to_cpu(flags, hdr->hdr_len), 64), true);
min(tun_vnet16_to_cpu(flags, hdr->hdr.hdr_len), 64), true); } WARN_ON_ONCE(1); return -EINVAL;
diff --git a/include/uapi/linux/if_tun.h b/include/uapi/linux/if_tun.h index 980de74724fc..fe4b984d3bbb 100644 --- a/include/uapi/linux/if_tun.h +++ b/include/uapi/linux/if_tun.h @@ -62,6 +62,62 @@ #define TUNSETCARRIER _IOW('T', 226, int) #define TUNGETDEVNETNS _IO('T', 227)
+/**
- define TUNGETVNETHASHTYPES - ioctl to get supported virtio_net hashing types
- The argument is a pointer to __u32 which will store the supported virtio_net
- hashing types.
- */
+#define TUNGETVNETHASHTYPES _IOR('T', 228, __u32)
+/**
- define TUNSETVNETREPORTINGAUTOMQ - ioctl to enable automq with hash reporting
- Disable RSS and enable automatic receive steering with hash reporting.
- The argument is a pointer to __u32 that contains a bitmask of hash types
- allowed to be reported.
- This ioctl results in %EBADFD if the underlying device is deleted. It affects
- all queues attached to the same device.
- This ioctl currently has no effect on XDP packets and packets with
- queue_mapping set by TC.
- */
+#define TUNSETVNETREPORTINGAUTOMQ _IOR('T', 229, __u32)
+/**
- define TUNSETVNETREPORTINGRSS - ioctl to enable RSS with hash reporting
- Disable automatic receive steering and enable RSS with hash reporting.
This is unnecessary, e.g one day will have select_queue_xyz(), we don't want to say "Disable automatic receive steering and xyz ..."
It is still something better to be documented as its behavior is somewhat complicated.
This is a hint of uAPI design issue.
Concretely, this ioctl disables automatic receive steering but doesn't disable steering by eBPF, which is implied by TUN_STEERINGEBPF_FALLBACK.
It would be simpler:
- not having TUN_STEERINGEBPF_FALLBACK
- the steering algorithm depends on the last uAPI call
What will TUNSETSTEERINGEBPF with NULL mean in that case? It currently switches the steering algorithm to automq.
A stackwise semantic then?
Can you clarify the semantics with an example of a set of ioctls? Perhaps it is an easy way to demonstrate an alternative design idea.
Regards, Akihiko Odaki
Add ioctls and storage required for the virtio-net hash feature to TUN.
Signed-off-by: Akihiko Odaki akihiko.odaki@daynix.com --- drivers/net/Kconfig | 1 + drivers/net/tun.c | 56 ++++++++++++++++++++++++++++++++++++++++++-------- include/linux/skbuff.h | 3 +++ net/core/skbuff.c | 4 ++++ 4 files changed, 56 insertions(+), 8 deletions(-)
diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig index 1fd5acdc73c6..aecfd244dd83 100644 --- a/drivers/net/Kconfig +++ b/drivers/net/Kconfig @@ -395,6 +395,7 @@ config TUN tristate "Universal TUN/TAP device driver support" depends on INET select CRC32 + select SKB_EXTENSIONS help TUN/TAP provides packet reception and transmission for user space programs. It can be viewed as a simple Point-to-Point or Ethernet diff --git a/drivers/net/tun.c b/drivers/net/tun.c index 03d47799e9bd..0a34db248e03 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -209,6 +209,7 @@ struct tun_struct { struct bpf_prog __rcu *xdp_prog; struct tun_prog __rcu *steering_prog; struct tun_prog __rcu *filter_prog; + struct tun_vnet_hash __rcu *vnet_hash; struct ethtool_link_ksettings link_ksettings; /* init args */ struct file *file; @@ -451,9 +452,14 @@ static inline void tun_flow_save_rps_rxhash(struct tun_flow_entry *e, u32 hash) e->rps_rxhash = hash; }
+static struct virtio_net_hash *tun_add_hash(struct sk_buff *skb) +{ + return skb_ext_add(skb, SKB_EXT_TUN_VNET_HASH); +} + static const struct virtio_net_hash *tun_find_hash(const struct sk_buff *skb) { - return NULL; + return skb_ext_find(skb, SKB_EXT_TUN_VNET_HASH); }
/* We try to identify a flow through its rxhash. The reason that @@ -462,14 +468,21 @@ static const struct virtio_net_hash *tun_find_hash(const struct sk_buff *skb) * the userspace application move between processors, we may get a * different rxq no. here. */ -static u16 tun_automq_select_queue(struct tun_struct *tun, struct sk_buff *skb) +static u16 tun_automq_select_queue(struct tun_struct *tun, + const struct tun_vnet_hash *vnet_hash, + struct sk_buff *skb) { + struct flow_keys keys; + struct flow_keys_basic keys_basic; struct tun_flow_entry *e; u32 txq, numqueues;
numqueues = READ_ONCE(tun->numqueues);
- txq = __skb_get_hash_symmetric(skb); + memset(&keys, 0, sizeof(keys)); + skb_flow_dissect(skb, &flow_keys_dissector_symmetric, &keys, 0); + + txq = flow_hash_from_keys(&keys); e = tun_flow_find(&tun->flows[tun_hashfn(txq)], txq); if (e) { tun_flow_save_rps_rxhash(e, txq); @@ -478,6 +491,13 @@ static u16 tun_automq_select_queue(struct tun_struct *tun, struct sk_buff *skb) txq = reciprocal_scale(txq, numqueues); }
+ keys_basic = (struct flow_keys_basic) { + .control = keys.control, + .basic = keys.basic + }; + tun_vnet_hash_report(vnet_hash, skb, &keys_basic, skb->l4_hash ? skb->hash : txq, + tun_add_hash); + return txq; }
@@ -513,8 +533,15 @@ static u16 tun_select_queue(struct net_device *dev, struct sk_buff *skb, u16 ret;
rcu_read_lock(); - if (!tun_ebpf_select_queue(tun, skb, &ret)) - ret = tun_automq_select_queue(tun, skb); + if (!tun_ebpf_select_queue(tun, skb, &ret)) { + struct tun_vnet_hash *vnet_hash = rcu_dereference(tun->vnet_hash); + + if (vnet_hash && vnet_hash->rss) + ret = tun_vnet_rss_select_queue(READ_ONCE(tun->numqueues), vnet_hash, + skb, tun_add_hash); + else + ret = tun_automq_select_queue(tun, vnet_hash, skb); + } rcu_read_unlock();
return ret; @@ -2235,6 +2262,7 @@ static void tun_free_netdev(struct net_device *dev) security_tun_dev_free_security(tun->security); __tun_set_ebpf(tun, &tun->steering_prog, NULL); __tun_set_ebpf(tun, &tun->filter_prog, NULL); + kfree_rcu_mightsleep(rcu_access_pointer(tun->vnet_hash)); }
static void tun_setup(struct net_device *dev) @@ -3014,16 +3042,22 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd, } else { memset(&ifr, 0, sizeof(ifr)); } - if (cmd == TUNGETFEATURES) { + switch (cmd) { + case TUNGETFEATURES: /* Currently this just means: "what IFF flags are valid?". * This is needed because we never checked for invalid flags on * TUNSETIFF. */ return put_user(IFF_TUN | IFF_TAP | IFF_NO_CARRIER | TUN_FEATURES, (unsigned int __user*)argp); - } else if (cmd == TUNSETQUEUE) { + + case TUNSETQUEUE: return tun_set_queue(file, &ifr); - } else if (cmd == SIOCGSKNS) { + + case TUNGETVNETHASHTYPES: + return tun_vnet_ioctl_gethashtypes(argp); + + case SIOCGSKNS: if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) return -EPERM; return open_related_ns(&net->ns, get_net_ns); @@ -3264,6 +3298,12 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd, ret = open_related_ns(&net->ns, get_net_ns); break;
+ case TUNSETVNETREPORTINGAUTOMQ: + case TUNSETVNETREPORTINGRSS: + case TUNSETVNETRSS: + ret = tun_vnet_ioctl_sethash(&tun->vnet_hash, cmd, argp); + break; + default: ret = tun_vnet_ioctl(&tun->vnet_hdr_sz, &tun->flags, cmd, argp); break; diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index bb2b751d274a..cdd793f1c360 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -4842,6 +4842,9 @@ enum skb_ext_id { #endif #if IS_ENABLED(CONFIG_MCTP_FLOWS) SKB_EXT_MCTP, +#endif +#if IS_ENABLED(CONFIG_TUN) + SKB_EXT_TUN_VNET_HASH, #endif SKB_EXT_NUM, /* must be last */ }; diff --git a/net/core/skbuff.c b/net/core/skbuff.c index b1c81687e9d8..75d48217a20f 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -64,6 +64,7 @@ #include <linux/mpls.h> #include <linux/kcov.h> #include <linux/iov_iter.h> +#include <linux/virtio_net.h>
#include <net/protocol.h> #include <net/dst.h> @@ -4969,6 +4970,9 @@ static const u8 skb_ext_type_len[] = { #if IS_ENABLED(CONFIG_MCTP_FLOWS) [SKB_EXT_MCTP] = SKB_EXT_CHUNKSIZEOF(struct mctp_flow), #endif +#if IS_ENABLED(CONFIG_TUN) + [SKB_EXT_TUN_VNET_HASH] = SKB_EXT_CHUNKSIZEOF(struct virtio_net_hash), +#endif };
static __always_inline unsigned int skb_ext_total_length(void)
Add ioctls and storage required for the virtio-net hash feature to TAP.
Signed-off-by: Akihiko Odaki akihiko.odaki@daynix.com --- drivers/net/ipvlan/ipvtap.c | 2 +- drivers/net/macvtap.c | 2 +- drivers/net/tap.c | 72 +++++++++++++++++++++++++++++++++++++++++---- include/linux/if_tap.h | 4 ++- 4 files changed, 71 insertions(+), 9 deletions(-)
diff --git a/drivers/net/ipvlan/ipvtap.c b/drivers/net/ipvlan/ipvtap.c index 1afc4c47be73..305438abf7ae 100644 --- a/drivers/net/ipvlan/ipvtap.c +++ b/drivers/net/ipvlan/ipvtap.c @@ -114,7 +114,7 @@ static void ipvtap_dellink(struct net_device *dev, struct ipvtap_dev *vlan = netdev_priv(dev);
netdev_rx_handler_unregister(dev); - tap_del_queues(&vlan->tap); + tap_del(&vlan->tap); ipvlan_link_delete(dev, head); }
diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c index 29a5929d48e5..e72144d05ef4 100644 --- a/drivers/net/macvtap.c +++ b/drivers/net/macvtap.c @@ -122,7 +122,7 @@ static void macvtap_dellink(struct net_device *dev, struct macvtap_dev *vlantap = netdev_priv(dev);
netdev_rx_handler_unregister(dev); - tap_del_queues(&vlantap->tap); + tap_del(&vlantap->tap); macvlan_dellink(dev, head); }
diff --git a/drivers/net/tap.c b/drivers/net/tap.c index 25c60ff2d3f2..15f056d7f632 100644 --- a/drivers/net/tap.c +++ b/drivers/net/tap.c @@ -49,6 +49,10 @@ struct major_info { struct list_head next; };
+struct tap_skb_cb { + struct virtio_net_hash hash; +}; + #define GOODCOPY_LEN 128
static const struct proto_ops tap_socket_ops; @@ -179,9 +183,20 @@ static void tap_put_queue(struct tap_queue *q) sock_put(&q->sk); }
+static struct tap_skb_cb *tap_skb_cb(const struct sk_buff *skb) +{ + BUILD_BUG_ON(sizeof(skb->cb) < sizeof(struct tap_skb_cb)); + return (struct tap_skb_cb *)skb->cb; +} + +static struct virtio_net_hash *tap_add_hash(struct sk_buff *skb) +{ + return &tap_skb_cb(skb)->hash; +} + static const struct virtio_net_hash *tap_find_hash(const struct sk_buff *skb) { - return NULL; + return &tap_skb_cb(skb)->hash; }
/* @@ -194,6 +209,7 @@ static const struct virtio_net_hash *tap_find_hash(const struct sk_buff *skb) static struct tap_queue *tap_get_queue(struct tap_dev *tap, struct sk_buff *skb) { + struct flow_keys_basic keys_basic; struct tap_queue *queue = NULL; /* Access to taps array is protected by rcu, but access to numvtaps * isn't. Below we use it to lookup a queue, but treat it as a hint @@ -201,17 +217,47 @@ static struct tap_queue *tap_get_queue(struct tap_dev *tap, * racing against queue removal. */ int numvtaps = READ_ONCE(tap->numvtaps); + struct tun_vnet_hash *vnet_hash = rcu_dereference(tap->vnet_hash); __u32 rxq;
+ *tap_skb_cb(skb) = (struct tap_skb_cb) { + .hash = { .report = VIRTIO_NET_HASH_REPORT_NONE } + }; + if (!numvtaps) goto out;
if (numvtaps == 1) goto single;
+ if (vnet_hash) { + if (vnet_hash->rss) { + rxq = tun_vnet_rss_select_queue(numvtaps, vnet_hash, skb, tap_add_hash); + queue = rcu_dereference(tap->taps[rxq]); + goto out; + } + + if (!skb->l4_hash && !skb->sw_hash) { + struct flow_keys keys; + + skb_flow_dissect_flow_keys(skb, &keys, FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL); + rxq = flow_hash_from_keys(&keys); + keys_basic = (struct flow_keys_basic) { + .control = keys.control, + .basic = keys.basic + }; + } else { + skb_flow_dissect_flow_keys_basic(NULL, skb, &keys_basic, NULL, 0, 0, 0, + FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL); + rxq = skb->hash; + } + } else { + rxq = skb_get_hash(skb); + } + /* Check if we can use flow to select a queue */ - rxq = skb_get_hash(skb); if (rxq) { + tun_vnet_hash_report(vnet_hash, skb, &keys_basic, rxq, tap_add_hash); queue = rcu_dereference(tap->taps[rxq % numvtaps]); goto out; } @@ -234,10 +280,10 @@ static struct tap_queue *tap_get_queue(struct tap_dev *tap,
/* * The net_device is going away, give up the reference - * that it holds on all queues and safely set the pointer - * from the queues to NULL. + * that it holds on all queues, safely set the pointer + * from the queues to NULL, and free vnet_hash. */ -void tap_del_queues(struct tap_dev *tap) +void tap_del(struct tap_dev *tap) { struct tap_queue *q, *tmp;
@@ -254,8 +300,10 @@ void tap_del_queues(struct tap_dev *tap) BUG_ON(tap->numqueues); /* guarantee that any future tap_set_queue will fail */ tap->numvtaps = MAX_TAP_QUEUES; + + kfree_rcu_mightsleep(rtnl_dereference(tap->vnet_hash)); } -EXPORT_SYMBOL_GPL(tap_del_queues); +EXPORT_SYMBOL_GPL(tap_del);
rx_handler_result_t tap_handle_frame(struct sk_buff **pskb) { @@ -998,6 +1046,18 @@ static long tap_ioctl(struct file *file, unsigned int cmd, rtnl_unlock(); return ret;
+ case TUNGETVNETHASHTYPES: + return tun_vnet_ioctl_gethashtypes(argp); + + case TUNSETVNETREPORTINGAUTOMQ: + case TUNSETVNETREPORTINGRSS: + case TUNSETVNETRSS: + rtnl_lock(); + tap = rtnl_dereference(q->tap); + ret = tap ? tun_vnet_ioctl_sethash(&tap->vnet_hash, cmd, argp) : -EBADFD; + rtnl_unlock(); + return ret; + case SIOCGIFHWADDR: rtnl_lock(); tap = tap_get_tap_dev(q); diff --git a/include/linux/if_tap.h b/include/linux/if_tap.h index 553552fa635c..6647a7a9e956 100644 --- a/include/linux/if_tap.h +++ b/include/linux/if_tap.h @@ -31,6 +31,7 @@ static inline struct ptr_ring *tap_get_ptr_ring(struct file *f) #define MAX_TAP_QUEUES 256
struct tap_queue; +struct tun_vnet_hash;
struct tap_dev { struct net_device *dev; @@ -43,6 +44,7 @@ struct tap_dev { int numqueues; netdev_features_t tap_features; int minor; + struct tun_vnet_hash __rcu *vnet_hash;
void (*update_features)(struct tap_dev *tap, netdev_features_t features); void (*count_tx_dropped)(struct tap_dev *tap); @@ -74,7 +76,7 @@ struct tap_queue { };
rx_handler_result_t tap_handle_frame(struct sk_buff **pskb); -void tap_del_queues(struct tap_dev *tap); +void tap_del(struct tap_dev *tap); int tap_get_minor(dev_t major, struct tap_dev *tap); void tap_free_minor(dev_t major, struct tap_dev *tap); int tap_queue_resize(struct tap_dev *tap);
Ensure that vnet ioctls result in EBADFD when the underlying device is deleted.
Signed-off-by: Akihiko Odaki akihiko.odaki@daynix.com Tested-by: Lei Yang leiyang@redhat.com --- tools/testing/selftests/net/tun.c | 39 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 39 insertions(+)
diff --git a/tools/testing/selftests/net/tun.c b/tools/testing/selftests/net/tun.c index fa83918b62d1..41747e1728a6 100644 --- a/tools/testing/selftests/net/tun.c +++ b/tools/testing/selftests/net/tun.c @@ -12,6 +12,7 @@ #include <linux/if_tun.h> #include <linux/netlink.h> #include <linux/rtnetlink.h> +#include <linux/virtio_net.h> #include <sys/ioctl.h> #include <sys/socket.h>
@@ -159,4 +160,42 @@ TEST_F(tun, reattach_close_delete) { EXPECT_EQ(tun_delete(self->ifname), 0); }
+FIXTURE(tun_deleted) +{ + char ifname[IFNAMSIZ]; + int fd; +}; + +FIXTURE_SETUP(tun_deleted) +{ + self->ifname[0] = 0; + self->fd = tun_alloc(self->ifname); + ASSERT_LE(0, self->fd); + + ASSERT_EQ(0, tun_delete(self->ifname)) + EXPECT_EQ(0, close(self->fd)); +} + +FIXTURE_TEARDOWN(tun_deleted) +{ + EXPECT_EQ(0, close(self->fd)); +} + +TEST_F(tun_deleted, getvnethdrsz) +{ + ASSERT_EQ(-1, ioctl(self->fd, TUNGETVNETHDRSZ)); + EXPECT_EQ(EBADFD, errno); +} + +TEST_F(tun_deleted, getvnethashtypes) +{ + uint32_t hash_types; + int ret = ioctl(self->fd, TUNGETVNETHASHTYPES, &hash_types); + + if (ret == -1 && errno == EBADFD) + SKIP(return, "TUNGETVNETHASHTYPES not supported"); + + EXPECT_FALSE(ret); +} + TEST_HARNESS_MAIN
The added tests confirm tun can perform RSS for all supported hash types to select the receive queue and report hash values.
Signed-off-by: Akihiko Odaki akihiko.odaki@daynix.com Tested-by: Lei Yang leiyang@redhat.com --- tools/testing/selftests/net/Makefile | 2 +- tools/testing/selftests/net/tun.c | 448 ++++++++++++++++++++++++++++++++++- 2 files changed, 440 insertions(+), 10 deletions(-)
diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile index 8f32b4f01aee..0e0c751a4691 100644 --- a/tools/testing/selftests/net/Makefile +++ b/tools/testing/selftests/net/Makefile @@ -124,6 +124,6 @@ $(OUTPUT)/reuseport_bpf_numa: LDLIBS += -lnuma $(OUTPUT)/tcp_mmap: LDLIBS += -lpthread -lcrypto $(OUTPUT)/tcp_inq: LDLIBS += -lpthread $(OUTPUT)/bind_bhash: LDLIBS += -lpthread -$(OUTPUT)/io_uring_zerocopy_tx: CFLAGS += -I../../../include/ +$(OUTPUT)/io_uring_zerocopy_tx $(OUTPUT)/tun: CFLAGS += -I../../../include/
include bpf.mk diff --git a/tools/testing/selftests/net/tun.c b/tools/testing/selftests/net/tun.c index 41747e1728a6..79ae65ae934a 100644 --- a/tools/testing/selftests/net/tun.c +++ b/tools/testing/selftests/net/tun.c @@ -2,22 +2,38 @@
#define _GNU_SOURCE
+#include <endian.h> #include <errno.h> #include <fcntl.h> +#include <sched.h> +#include <stddef.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <unistd.h> -#include <linux/if.h> +#include <net/if.h> +#include <netinet/ip.h> +#include <sys/ioctl.h> +#include <sys/socket.h> +#include <linux/compiler.h> +#include <linux/icmp.h> +#include <linux/if_arp.h> #include <linux/if_tun.h> +#include <linux/ipv6.h> #include <linux/netlink.h> #include <linux/rtnetlink.h> +#include <linux/sockios.h> +#include <linux/tcp.h> +#include <linux/udp.h> #include <linux/virtio_net.h> -#include <sys/ioctl.h> -#include <sys/socket.h>
#include "../kselftest_harness.h"
+#define TUN_HWADDR_SOURCE { 0x02, 0x00, 0x00, 0x00, 0x00, 0x00 } +#define TUN_HWADDR_DEST { 0x02, 0x00, 0x00, 0x00, 0x00, 0x01 } +#define TUN_IPADDR_SOURCE htonl((172 << 24) | (17 << 16) | 0) +#define TUN_IPADDR_DEST htonl((172 << 24) | (17 << 16) | 1) + static int tun_attach(int fd, char *dev) { struct ifreq ifr; @@ -40,7 +56,7 @@ static int tun_detach(int fd, char *dev) return ioctl(fd, TUNSETQUEUE, (void *) &ifr); }
-static int tun_alloc(char *dev) +static int tun_alloc(char *dev, short flags) { struct ifreq ifr; int fd, err; @@ -53,7 +69,8 @@ static int tun_alloc(char *dev)
memset(&ifr, 0, sizeof(ifr)); strcpy(ifr.ifr_name, dev); - ifr.ifr_flags = IFF_TAP | IFF_NAPI | IFF_MULTI_QUEUE; + ifr.ifr_flags = flags | IFF_TAP | IFF_NAPI | IFF_NO_PI | + IFF_MULTI_QUEUE;
err = ioctl(fd, TUNSETIFF, (void *) &ifr); if (err < 0) { @@ -65,6 +82,20 @@ static int tun_alloc(char *dev) return fd; }
+static bool tun_set_flags(int local_fd, const char *name, short flags) +{ + struct ifreq ifreq = { .ifr_flags = flags }; + + strcpy(ifreq.ifr_name, name); + + if (ioctl(local_fd, SIOCSIFFLAGS, &ifreq)) { + perror("SIOCSIFFLAGS"); + return false; + } + + return true; +} + static int tun_delete(char *dev) { struct { @@ -103,6 +134,107 @@ static int tun_delete(char *dev) return ret; }
+static uint32_t tun_sum(const void *buf, size_t len) +{ + const uint16_t *sbuf = buf; + uint32_t sum = 0; + + while (len > 1) { + sum += *sbuf++; + len -= 2; + } + + if (len) + sum += *(uint8_t *)sbuf; + + return sum; +} + +static uint16_t tun_build_ip_check(uint32_t sum) +{ + return ~((sum & 0xffff) + (sum >> 16)); +} + +static uint32_t tun_build_ip_pseudo_sum(const void *iphdr) +{ + uint16_t tot_len = ntohs(((struct iphdr *)iphdr)->tot_len); + + return tun_sum((char *)iphdr + offsetof(struct iphdr, saddr), 8) + + htons(((struct iphdr *)iphdr)->protocol) + + htons(tot_len - sizeof(struct iphdr)); +} + +static uint32_t tun_build_ipv6_pseudo_sum(const void *ipv6hdr) +{ + return tun_sum((char *)ipv6hdr + offsetof(struct ipv6hdr, saddr), 32) + + ((struct ipv6hdr *)ipv6hdr)->payload_len + + htons(((struct ipv6hdr *)ipv6hdr)->nexthdr); +} + +static void tun_build_iphdr(void *dest, uint16_t len, uint8_t protocol) +{ + struct iphdr iphdr = { + .ihl = sizeof(iphdr) / 4, + .version = 4, + .tot_len = htons(sizeof(iphdr) + len), + .ttl = 255, + .protocol = protocol, + .saddr = TUN_IPADDR_SOURCE, + .daddr = TUN_IPADDR_DEST + }; + + iphdr.check = tun_build_ip_check(tun_sum(&iphdr, sizeof(iphdr))); + memcpy(dest, &iphdr, sizeof(iphdr)); +} + +static void tun_build_ipv6hdr(void *dest, uint16_t len, uint8_t protocol) +{ + struct ipv6hdr ipv6hdr = { + .version = 6, + .payload_len = htons(len), + .nexthdr = protocol, + .saddr = { + .s6_addr32 = { + htonl(0xffff0000), 0, 0, TUN_IPADDR_SOURCE + } + }, + .daddr = { + .s6_addr32 = { + htonl(0xffff0000), 0, 0, TUN_IPADDR_DEST + } + }, + }; + + memcpy(dest, &ipv6hdr, sizeof(ipv6hdr)); +} + +static void tun_build_tcphdr(void *dest, uint32_t sum) +{ + struct tcphdr tcphdr = { + .source = htons(9), + .dest = htons(9), + .fin = 1, + .doff = sizeof(tcphdr) / 4, + }; + uint32_t tcp_sum = tun_sum(&tcphdr, sizeof(tcphdr)); + + tcphdr.check = tun_build_ip_check(sum + tcp_sum); + memcpy(dest, &tcphdr, sizeof(tcphdr)); +} + +static void tun_build_udphdr(void *dest, uint32_t sum) +{ + struct udphdr udphdr = { + .source = htons(9), + .dest = htons(9), + .len = htons(sizeof(udphdr)), + }; + uint32_t udp_sum = tun_sum(&udphdr, sizeof(udphdr)); + + udphdr.check = tun_build_ip_check(sum + udp_sum); + memcpy(dest, &udphdr, sizeof(udphdr)); +} + FIXTURE(tun) { char ifname[IFNAMSIZ]; @@ -113,10 +245,10 @@ FIXTURE_SETUP(tun) { memset(self->ifname, 0, sizeof(self->ifname));
- self->fd = tun_alloc(self->ifname); + self->fd = tun_alloc(self->ifname, 0); ASSERT_GE(self->fd, 0);
- self->fd2 = tun_alloc(self->ifname); + self->fd2 = tun_alloc(self->ifname, 0); ASSERT_GE(self->fd2, 0); }
@@ -169,7 +301,7 @@ FIXTURE(tun_deleted) FIXTURE_SETUP(tun_deleted) { self->ifname[0] = 0; - self->fd = tun_alloc(self->ifname); + self->fd = tun_alloc(self->ifname, 0); ASSERT_LE(0, self->fd);
ASSERT_EQ(0, tun_delete(self->ifname)) @@ -198,4 +330,302 @@ TEST_F(tun_deleted, getvnethashtypes) EXPECT_FALSE(ret); }
-TEST_HARNESS_MAIN +FIXTURE(tun_vnet_rss) +{ + int dest_fds[3]; + unsigned int dest_ifindex; + int source_fd; + char buffer[sizeof(struct ipv6hdr) + sizeof(struct tcphdr)]; + uint16_t len; +}; + +FIXTURE_VARIANT(tun_vnet_rss) +{ + uint16_t eth_p; + uint8_t ipproto; + uint8_t flags; + uint16_t hash_report; + uint32_t hash_value; +}; + +FIXTURE_VARIANT_ADD(tun_vnet_rss, unclassified) +{ + .eth_p = ETH_P_LOOPBACK +}; + +FIXTURE_VARIANT_ADD(tun_vnet_rss, ipv4) +{ + .eth_p = ETH_P_IP, + .ipproto = 253, + .hash_report = VIRTIO_NET_HASH_REPORT_IPv4, + .hash_value = 0x6e45d952 +}; + +FIXTURE_VARIANT_ADD(tun_vnet_rss, tcpv4) +{ + .eth_p = ETH_P_IP, + .ipproto = IPPROTO_TCP, + .hash_report = VIRTIO_NET_HASH_REPORT_TCPv4, + .hash_value = 0xfb63539a +}; + +FIXTURE_VARIANT_ADD(tun_vnet_rss, udpv4) +{ + .eth_p = ETH_P_IP, + .ipproto = IPPROTO_UDP, + .hash_report = VIRTIO_NET_HASH_REPORT_UDPv4, + .hash_value = 0xfb63539a +}; + +FIXTURE_VARIANT_ADD(tun_vnet_rss, ipv6) +{ + .eth_p = ETH_P_IPV6, + .ipproto = 253, + .hash_report = VIRTIO_NET_HASH_REPORT_IPv6, + .hash_value = 0xd6eb560f +}; + +FIXTURE_VARIANT_ADD(tun_vnet_rss, tcpv6) +{ + .eth_p = ETH_P_IPV6, + .ipproto = IPPROTO_TCP, + .hash_report = VIRTIO_NET_HASH_REPORT_TCPv6, + .hash_value = 0xc2b9f251 +}; + +FIXTURE_VARIANT_ADD(tun_vnet_rss, udpv6) +{ + .eth_p = ETH_P_IPV6, + .ipproto = IPPROTO_UDP, + .hash_report = VIRTIO_NET_HASH_REPORT_UDPv6, + .hash_value = 0xc2b9f251 +}; + +FIXTURE_SETUP(tun_vnet_rss) +{ + static const struct { + struct tun_vnet_rss hdr; + uint16_t indirection_table[2]; + uint8_t key[40]; + } vnet_rss = { + .hdr = { + .hash_types = VIRTIO_NET_RSS_HASH_TYPE_IPv4 | + VIRTIO_NET_RSS_HASH_TYPE_TCPv4 | + VIRTIO_NET_RSS_HASH_TYPE_UDPv4 | + VIRTIO_NET_RSS_HASH_TYPE_IPv6 | + VIRTIO_NET_RSS_HASH_TYPE_TCPv6 | + VIRTIO_NET_RSS_HASH_TYPE_UDPv6, + .indirection_table_mask = 1, + .unclassified_queue = 5 + }, + .indirection_table = { 3, 4 }, + .key = { + 0x6d, 0x5a, 0x56, 0xda, 0x25, 0x5b, 0x0e, 0xc2, + 0x41, 0x67, 0x25, 0x3d, 0x43, 0xa3, 0x8f, 0xb0, + 0xd0, 0xca, 0x2b, 0xcb, 0xae, 0x7b, 0x30, 0xb4, + 0x77, 0xcb, 0x2d, 0xa3, 0x80, 0x30, 0xf2, 0x0c, + 0x6a, 0x42, 0xb7, 0x3b, 0xbe, 0xac, 0x01, 0xfa + } + }; + + struct { + struct virtio_net_hdr_v1_hash vnet_hdr; + struct ethhdr ethhdr; + struct arphdr arphdr; + unsigned char sender_hwaddr[6]; + uint32_t sender_ipaddr; + unsigned char target_hwaddr[6]; + uint32_t target_ipaddr; + } __packed packet = { + .ethhdr = { + .h_source = TUN_HWADDR_SOURCE, + .h_dest = { 0xff, 0xff, 0xff, 0xff, 0xff, 0xff }, + .h_proto = htons(ETH_P_ARP) + }, + .arphdr = { + .ar_hrd = htons(ARPHRD_ETHER), + .ar_pro = htons(ETH_P_IP), + .ar_hln = ETH_ALEN, + .ar_pln = 4, + .ar_op = htons(ARPOP_REQUEST) + }, + .sender_hwaddr = TUN_HWADDR_DEST, + .sender_ipaddr = TUN_IPADDR_DEST, + .target_ipaddr = TUN_IPADDR_DEST + }; + + char dest_ifname[IFNAMSIZ] = ""; + int i, local_fd; + uint16_t ip_payload_len; + uint32_t hash_types; + uint32_t ip_sum; + + self->dest_fds[0] = tun_alloc(dest_ifname, IFF_VNET_HDR); + ASSERT_LE(0, self->dest_fds[0]) { + EXPECT_EQ(0, close(self->dest_fds[0])); + } + + i = ioctl(self->dest_fds[0], TUNGETVNETHASHTYPES, &hash_types); + if (i == -1 && errno == EINVAL) { + EXPECT_EQ(0, close(self->dest_fds[0])); + SKIP(return, "TUNGETVNETHASHTYPES not supported"); + } + + ASSERT_EQ(0, i) + EXPECT_EQ(0, close(self->dest_fds[0])); + + if ((hash_types & vnet_rss.hdr.hash_types) != vnet_rss.hdr.hash_types) { + EXPECT_EQ(0, close(self->dest_fds[0])); + SKIP(return, "Lacks some hash type support"); + } + + self->dest_ifindex = if_nametoindex(dest_ifname); + ASSERT_TRUE(self->dest_ifindex) + EXPECT_EQ(0, close(self->dest_fds[0])); + + i = sizeof(struct virtio_net_hdr_v1_hash); + ASSERT_EQ(0, ioctl(self->dest_fds[0], TUNSETVNETHDRSZ, &i)) + EXPECT_EQ(0, close(self->dest_fds[0])); + + i = 1; + ASSERT_EQ(0, ioctl(self->dest_fds[0], TUNSETVNETLE, &i)) + EXPECT_EQ(0, close(self->dest_fds[0])); + + local_fd = socket(AF_LOCAL, SOCK_STREAM, 0); + ASSERT_LE(0, local_fd) + EXPECT_EQ(0, close(self->dest_fds[0])); + + i = tun_set_flags(local_fd, dest_ifname, IFF_UP); + EXPECT_EQ(0, close(local_fd)); + ASSERT_TRUE(i) + EXPECT_EQ(0, close(self->dest_fds[0])); + + ASSERT_EQ(sizeof(packet), + write(self->dest_fds[0], &packet, sizeof(packet))) + EXPECT_EQ(0, close(self->dest_fds[0])); + + ASSERT_EQ(0, ioctl(self->dest_fds[0], TUNSETVNETREPORTINGRSS, &vnet_rss)) + EXPECT_EQ(0, close(self->dest_fds[0])); + + self->source_fd = socket(AF_PACKET, SOCK_DGRAM, 0); + ASSERT_LE(0, self->source_fd) + EXPECT_EQ(0, close(self->dest_fds[0])); + + for (i = 1; i < ARRAY_SIZE(self->dest_fds); i++) { + self->dest_fds[i] = tun_alloc(dest_ifname, IFF_VNET_HDR); + ASSERT_LE(0, self->dest_fds[i]) { + while (i) { + i--; + EXPECT_EQ(0, close(self->dest_fds[i])); + } + } + } + + if (variant->eth_p == ETH_P_IP || variant->eth_p == ETH_P_IPV6) { + switch (variant->ipproto) { + case IPPROTO_TCP: + ip_payload_len = sizeof(struct tcphdr); + break; + + case IPPROTO_UDP: + ip_payload_len = sizeof(struct udphdr); + break; + + default: + ip_payload_len = 0; + } + + switch (variant->eth_p) { + case ETH_P_IP: + tun_build_iphdr(self->buffer, ip_payload_len, variant->ipproto); + self->len = sizeof(struct iphdr); + ip_sum = tun_build_ip_pseudo_sum(self->buffer); + break; + + case ETH_P_IPV6: + tun_build_ipv6hdr(self->buffer, ip_payload_len, variant->ipproto); + self->len = sizeof(struct ipv6hdr); + ip_sum = tun_build_ipv6_pseudo_sum(self->buffer); + break; + } + + switch (variant->ipproto) { + case IPPROTO_TCP: + tun_build_tcphdr(self->buffer + self->len, ip_sum); + break; + + case IPPROTO_UDP: + tun_build_udphdr(self->buffer + self->len, ip_sum); + break; + } + + self->len += ip_payload_len; + } +} + +FIXTURE_TEARDOWN(tun_vnet_rss) +{ + EXPECT_EQ(0, close(self->source_fd)); + + for (size_t i = 0; i < ARRAY_SIZE(self->dest_fds); i++) + EXPECT_EQ(0, close(self->dest_fds[i])); +} + +TEST_F(tun_vnet_rss, rx) +{ + size_t len = sizeof(struct virtio_net_hdr_v1_hash) + ETH_HLEN + self->len; + struct { + struct virtio_net_hdr_v1_hash vnet_hdr; + struct ethhdr ethhdr; + char payload[sizeof(self->buffer)]; + } buffer; + struct virtio_net_hdr_v1_hash hdr = { + .hdr = { .flags = variant->flags }, + .hash_value = htole32(variant->hash_value), + .hash_report = htole16(variant->hash_report) + }; + int txq = variant->hash_report ? variant->hash_value & 1 : 2; + struct sockaddr_ll ll = { + .sll_family = AF_PACKET, + .sll_addr = TUN_HWADDR_DEST, + .sll_halen = ETH_ALEN, + .sll_ifindex = self->dest_ifindex, + .sll_protocol = htons(variant->eth_p), + }; + + EXPECT_EQ(self->len, + sendto(self->source_fd, self->buffer, self->len, 0, + (struct sockaddr *)&ll, sizeof(ll))); + EXPECT_EQ(len, read(self->dest_fds[txq], &buffer, len)); + ASSERT_FALSE(memcmp(&buffer, &hdr, sizeof(hdr))); + ASSERT_FALSE(memcmp(buffer.payload, self->buffer, self->len)); +} + +int main(int argc, char **argv) +{ + FILE *file; + + if (unshare(CLONE_NEWNET)) { + perror("unshare"); + return KSFT_FAIL; + } + + /* Disable IPv6 to eliminate IPv6 Neighbor Discovery messages. */ + file = fopen("/proc/sys/net/ipv6/conf/default/disable_ipv6", "w"); + if (file) { + if (fputc('1', file) != '1') { + perror("fputc"); + return KSFT_FAIL; + } + + if (fclose(file)) { + perror("fclose"); + return KSFT_FAIL; + } + } else if (errno != ENOENT) { + perror("fopen"); + return KSFT_FAIL; + } + + return test_harness_run(argc, argv); +}
They only test the ioctls are wired up to the implementation common with tun as it is already tested for tun.
Signed-off-by: Akihiko Odaki akihiko.odaki@daynix.com --- tools/testing/selftests/net/config | 1 + tools/testing/selftests/net/tap.c | 131 ++++++++++++++++++++++++++++++++++++- 2 files changed, 129 insertions(+), 3 deletions(-)
diff --git a/tools/testing/selftests/net/config b/tools/testing/selftests/net/config index 61e5116987f3..00cb1e65b392 100644 --- a/tools/testing/selftests/net/config +++ b/tools/testing/selftests/net/config @@ -99,6 +99,7 @@ CONFIG_IPV6_IOAM6_LWTUNNEL=y CONFIG_CRYPTO_SM4_GENERIC=y CONFIG_AMT=m CONFIG_TUN=y +CONFIG_TUN_VNET_CROSS_LE=y CONFIG_VXLAN=m CONFIG_IP_SCTP=m CONFIG_NETFILTER_XT_MATCH_POLICY=m diff --git a/tools/testing/selftests/net/tap.c b/tools/testing/selftests/net/tap.c index 247c3b3ac1c9..0decbc338965 100644 --- a/tools/testing/selftests/net/tap.c +++ b/tools/testing/selftests/net/tap.c @@ -387,9 +387,6 @@ FIXTURE_TEARDOWN(tap) if (self->fd != -1) close(self->fd);
- ret = dev_delete(param_dev_tap_name); - EXPECT_EQ(ret, 0); - ret = dev_delete(param_dev_dummy_name); EXPECT_EQ(ret, 0); } @@ -431,4 +428,132 @@ TEST_F(tap, test_packet_crash_tap_invalid_eth_proto) ASSERT_EQ(errno, EINVAL); }
+TEST_F(tap, test_vnethdrsz) +{ + int sz = sizeof(struct virtio_net_hdr_v1_hash); + + ASSERT_FALSE(dev_delete(param_dev_tap_name)); + + ASSERT_FALSE(ioctl(self->fd, TUNSETVNETHDRSZ, &sz)); + sz = 0; + ASSERT_FALSE(ioctl(self->fd, TUNGETVNETHDRSZ, &sz)); + EXPECT_EQ(sizeof(struct virtio_net_hdr_v1_hash), sz); +} + +TEST_F(tap, test_vnetle) +{ + int le = 1; + + ASSERT_FALSE(dev_delete(param_dev_tap_name)); + + ASSERT_FALSE(ioctl(self->fd, TUNSETVNETLE, &le)); + le = 0; + ASSERT_FALSE(ioctl(self->fd, TUNGETVNETLE, &le)); + EXPECT_EQ(1, le); +} + +TEST_F(tap, test_vnetbe) +{ + int be = 1; + int ret; + + ASSERT_FALSE(dev_delete(param_dev_tap_name)); + + ret = ioctl(self->fd, TUNSETVNETBE, &be); + if (ret == -1 && errno == EINVAL) + SKIP(return, "TUNSETVNETBE not supported"); + + ASSERT_FALSE(ret); + be = 0; + ASSERT_FALSE(ioctl(self->fd, TUNGETVNETBE, &be)); + EXPECT_EQ(1, be); +} + +TEST_F(tap, test_getvnethashtypes) +{ + uint32_t hash_types; + int ret; + + ASSERT_FALSE(dev_delete(param_dev_tap_name)); + + ret = ioctl(self->fd, TUNGETVNETHASHTYPES, &hash_types); + if (ret == -1 && errno == EINVAL) + SKIP(return, "TUNGETVNETHASHTYPES not supported"); + + EXPECT_FALSE(ret); +} + +FIXTURE(tap_setvnethash) +{ + int fd; +}; + +FIXTURE_VARIANT(tap_setvnethash) +{ + unsigned int cmd; +}; + +FIXTURE_VARIANT_ADD(tap_setvnethash, reportingautomq) +{ + .cmd = TUNSETVNETREPORTINGAUTOMQ +}; + +FIXTURE_VARIANT_ADD(tap_setvnethash, reportingrss) +{ + .cmd = TUNSETVNETREPORTINGRSS +}; + +FIXTURE_VARIANT_ADD(tap_setvnethash, rss) +{ + .cmd = TUNSETVNETRSS +}; + +FIXTURE_SETUP(tap_setvnethash) +{ + int ret; + + ret = dev_create(param_dev_dummy_name, "dummy", NULL, NULL); + ASSERT_FALSE(ret); + + ret = dev_create(param_dev_tap_name, "macvtap", macvtap_fill_rtattr, + NULL); + ASSERT_FALSE(ret) + EXPECT_FALSE(dev_delete(param_dev_dummy_name)); + + self->fd = opentap(param_dev_tap_name); + ASSERT_LT(0, self->fd) + EXPECT_FALSE(dev_delete(param_dev_dummy_name)); +} + +FIXTURE_TEARDOWN(tap_setvnethash) +{ + EXPECT_FALSE(close(self->fd)); + EXPECT_FALSE(dev_delete(param_dev_dummy_name)); +} + +TEST_F(tap_setvnethash, test_alive) +{ + struct tun_vnet_rss rss = { .hash_types = 0 }; + int ret; + + ret = ioctl(self->fd, variant->cmd, &rss); + + if (ret == -1 && errno == EINVAL) + SKIP(return, "not supported"); + + EXPECT_FALSE(ret); +} + +TEST_F(tap_setvnethash, test_deleted) +{ + ASSERT_FALSE(dev_delete(param_dev_tap_name)); + + ASSERT_EQ(-1, ioctl(self->fd, variant->cmd)); + + if (errno == EINVAL) + SKIP(return, "not supported"); + + EXPECT_EQ(EBADFD, errno); +} + TEST_HARNESS_MAIN
VIRTIO_NET_F_HASH_REPORT allows to report hash values calculated on the host. When VHOST_NET_F_VIRTIO_NET_HDR is employed, it will report no hash values (i.e., the hash_report member is always set to VIRTIO_NET_HASH_REPORT_NONE). Otherwise, the values reported by the underlying socket will be reported.
VIRTIO_NET_F_HASH_REPORT requires VIRTIO_F_VERSION_1.
Signed-off-by: Akihiko Odaki akihiko.odaki@daynix.com Tested-by: Lei Yang leiyang@redhat.com --- drivers/vhost/net.c | 68 +++++++++++++++++++++++++++-------------------------- 1 file changed, 35 insertions(+), 33 deletions(-)
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index b9b9e9d40951..fc5b43e43a06 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -73,6 +73,7 @@ enum { VHOST_NET_FEATURES = VHOST_FEATURES | (1ULL << VHOST_NET_F_VIRTIO_NET_HDR) | (1ULL << VIRTIO_NET_F_MRG_RXBUF) | + (1ULL << VIRTIO_NET_F_HASH_REPORT) | (1ULL << VIRTIO_F_ACCESS_PLATFORM) | (1ULL << VIRTIO_F_RING_RESET) }; @@ -1097,10 +1098,6 @@ static void handle_rx(struct vhost_net *net) .msg_controllen = 0, .msg_flags = MSG_DONTWAIT, }; - struct virtio_net_hdr hdr = { - .flags = 0, - .gso_type = VIRTIO_NET_HDR_GSO_NONE - }; size_t total_len = 0; int err, mergeable; s16 headcount; @@ -1174,11 +1171,15 @@ static void handle_rx(struct vhost_net *net) /* We don't need to be notified again. */ iov_iter_init(&msg.msg_iter, ITER_DEST, vq->iov, in, vhost_len); fixup = msg.msg_iter; - if (unlikely((vhost_hlen))) { - /* We will supply the header ourselves - * TODO: support TSO. - */ - iov_iter_advance(&msg.msg_iter, vhost_hlen); + /* + * Supply virtio_net_hdr if VHOST_NET_F_VIRTIO_NET_HDR + * TODO: support TSO. + */ + if (unlikely(vhost_hlen) && + iov_iter_zero(vhost_hlen, &msg.msg_iter) != vhost_hlen) { + vq_err(vq, "Unable to write vnet_hdr at addr %p\n", + vq->iov->iov_base); + goto out; } err = sock->ops->recvmsg(sock, &msg, sock_len, MSG_DONTWAIT | MSG_TRUNC); @@ -1191,30 +1192,24 @@ static void handle_rx(struct vhost_net *net) vhost_discard_vq_desc(vq, headcount); continue; } - /* Supply virtio_net_hdr if VHOST_NET_F_VIRTIO_NET_HDR */ - if (unlikely(vhost_hlen)) { - if (copy_to_iter(&hdr, sizeof(hdr), - &fixup) != sizeof(hdr)) { - vq_err(vq, "Unable to write vnet_hdr " - "at addr %p\n", vq->iov->iov_base); - goto out; - } - } else { - /* Header came from socket; we'll need to patch - * ->num_buffers over if VIRTIO_NET_F_MRG_RXBUF - */ - iov_iter_advance(&fixup, sizeof(hdr)); - } /* TODO: Should check and handle checksum. */
+ /* + * We'll need to patch ->num_buffers over if + * VIRTIO_NET_F_MRG_RXBUF or VIRTIO_F_VERSION_1 + */ num_buffers = cpu_to_vhost16(vq, headcount); - if (likely(set_num_buffers) && - copy_to_iter(&num_buffers, sizeof num_buffers, - &fixup) != sizeof num_buffers) { - vq_err(vq, "Failed num_buffers write"); - vhost_discard_vq_desc(vq, headcount); - goto out; + if (likely(set_num_buffers)) { + iov_iter_advance(&fixup, offsetof(struct virtio_net_hdr_v1, num_buffers)); + + if (copy_to_iter(&num_buffers, sizeof(num_buffers), + &fixup) != sizeof(num_buffers)) { + vq_err(vq, "Failed num_buffers write"); + vhost_discard_vq_desc(vq, headcount); + goto out; + } } + nvq->done_idx += headcount; if (nvq->done_idx > VHOST_NET_BATCH) vhost_net_signal_used(nvq); @@ -1607,10 +1602,13 @@ static int vhost_net_set_features(struct vhost_net *n, u64 features) size_t vhost_hlen, sock_hlen, hdr_len; int i;
- hdr_len = (features & ((1ULL << VIRTIO_NET_F_MRG_RXBUF) | - (1ULL << VIRTIO_F_VERSION_1))) ? - sizeof(struct virtio_net_hdr_mrg_rxbuf) : - sizeof(struct virtio_net_hdr); + if (features & (1ULL << VIRTIO_NET_F_HASH_REPORT)) + hdr_len = sizeof(struct virtio_net_hdr_v1_hash); + else if (features & ((1ULL << VIRTIO_NET_F_MRG_RXBUF) | + (1ULL << VIRTIO_F_VERSION_1))) + hdr_len = sizeof(struct virtio_net_hdr_mrg_rxbuf); + else + hdr_len = sizeof(struct virtio_net_hdr); if (features & (1 << VHOST_NET_F_VIRTIO_NET_HDR)) { /* vhost provides vnet_hdr */ vhost_hlen = hdr_len; @@ -1691,6 +1689,10 @@ static long vhost_net_ioctl(struct file *f, unsigned int ioctl, return -EFAULT; if (features & ~VHOST_NET_FEATURES) return -EOPNOTSUPP; + if ((features & ((1ULL << VIRTIO_F_VERSION_1) | + (1ULL << VIRTIO_NET_F_HASH_REPORT))) == + (1ULL << VIRTIO_NET_F_HASH_REPORT)) + return -EINVAL; return vhost_net_set_features(n, features); case VHOST_GET_BACKEND_FEATURES: features = VHOST_NET_BACKEND_FEATURES;
linux-kselftest-mirror@lists.linaro.org