January 2025 - Linux-kselftest-mirror

[PATCH RFC net-next v2 0/6] Device memory TCP TX

by Mina Almasry

RFC v2: https://patchwork.kernel.org/project/netdevbpf/list/?series=920056&state=* ======= RFC v2 addresses much of the feedback from RFC v1. I plan on sending something close to this as net-next reopens, sending it slightly early to get feedback if any. Major changes: -------------- - much improved UAPI as suggested by Stan. We now interpret the iov_base of the passed in iov from userspace as the offset into the dmabuf to send from. This removes the need to set iov.iov_base = NULL which may be confusing to users, and enables us to send multiple iovs in the same sendmsg() call. ncdevmem and the docs show a sample use of that. - Removed the duplicate dmabuf iov_iter in binding->iov_iter. I think this is good improvment as it was confusing to keep track of 2 iterators for the same sendmsg, and mistracking both iterators caused a couple of bugs reported in the last iteration that are now resolved with this streamlining. - Improved test coverage in ncdevmem. Now muliple sendmsg() are tested, and sending multiple iovs in the same sendmsg() is tested. - Fixed issue where dmabuf unmapping was happening in invalid context (Stan). ==================================================================== The TX path had been dropped from the Device Memory TCP patch series post RFCv1 [1], to make that series slightly easier to review. This series rebases the implementation of the TX path on top of the net_iov/netmem framework agreed upon and merged. The motivation for the feature is thoroughly described in the docs & cover letter of the original proposal, so I don't repeat the lengthy descriptions here, but they are available in [1]. Sending this series as RFC as the winder closure is immenient. I plan on reposting as non-RFC once the tree re-opens, addressing any feedback I receive in the meantime. Full outline on usage of the TX path is detailed in the documentation added in the first patch. Test example is available via the kselftest included in the series as well. The series is relatively small, as the TX path for this feature largely piggybacks on the existing MSG_ZEROCOPY implementation. Patch Overview: --------------- 1. Documentation & tests to give high level overview of the feature being added. 2. Add netmem refcounting needed for the TX path. 3. Devmem TX netlink API. 4. Devmem TX net stack implementation. Testing: -------- Testing is very similar to devmem TCP RX path. The ncdevmem test used for the RX path is now augemented with client functionality to test TX path. * Test Setup: Kernel: net-next with this RFC and memory provider API cherry-picked locally. Hardware: Google Cloud A3 VMs. NIC: GVE with header split & RSS & flow steering support. Performance results are not included with this version, unfortunately. I'm having issues running the dma-buf exporter driver against the upstream kernel on my test setup. The issues are specific to that dma-buf exporter and do not affect this patch series. I plan to follow up this series with perf fixes if the tests point to issues once they're up and running. Special thanks to Stan who took a stab at rebasing the TX implementation on top of the netmem/net_iov framework merged. Parts of his proposal [2] that are reused as-is are forked off into their own patches to give full credit. [1] https://lore.kernel.org/netdev/20240909054318.1809580-1-almasrymina@google.… [2] https://lore.kernel.org/netdev/20240913150913.1280238-2-sdf@fomichev.me/T/#… Cc: sdf(a)fomichev.me Cc: asml.silence(a)gmail.com Cc: dw(a)davidwei.uk Cc: Jamal Hadi Salim <jhs(a)mojatatu.com> Cc: Victor Nogueira <victor(a)mojatatu.com> Cc: Pedro Tammela <pctammela(a)mojatatu.com> Mina Almasry (5): net: add devmem TCP TX documentation selftests: ncdevmem: Implement devmem TCP TX net: add get_netmem/put_netmem support net: devmem: Implement TX path net: devmem: make dmabuf unbinding scheduled work Stanislav Fomichev (1): net: devmem: TCP tx netlink api Documentation/netlink/specs/netdev.yaml | 12 + Documentation/networking/devmem.rst | 144 ++++++++- include/linux/skbuff.h | 15 +- include/linux/skbuff_ref.h | 4 +- include/net/netmem.h | 3 + include/net/sock.h | 1 + include/uapi/linux/netdev.h | 1 + include/uapi/linux/uio.h | 6 +- net/core/datagram.c | 41 ++- net/core/devmem.c | 110 ++++++- net/core/devmem.h | 70 ++++- net/core/netdev-genl-gen.c | 13 + net/core/netdev-genl-gen.h | 1 + net/core/netdev-genl.c | 67 ++++- net/core/skbuff.c | 36 ++- net/core/sock.c | 8 + net/ipv4/tcp.c | 36 ++- net/vmw_vsock/virtio_transport_common.c | 3 +- tools/include/uapi/linux/netdev.h | 1 + .../selftests/drivers/net/hw/ncdevmem.c | 276 +++++++++++++++++- 20 files changed, 802 insertions(+), 46 deletions(-) -- 2.48.1.362.g079036d154-goog

9 months, 2 weeks

3
16
0 0

[PATCH RFC net-next v3 0/8] netconsole: Add support for CPU population

by Breno Leitao

The current implementation of netconsole sends all log messages in parallel, which can lead to an intermixed and interleaved output on the receiving side. This makes it challenging to demultiplex the messages and attribute them to their originating CPUs. As a result, users and developers often struggle to effectively analyze and debug the parallel log output received through netconsole. Example of a message got from produciton hosts: ------------[ cut here ]------------ ------------[ cut here ]------------ refcount_t: saturated; leaking memory. WARNING: CPU: 2 PID: 1613668 at lib/refcount.c:22 refcount_warn_saturate+0x5e/0xe0 refcount_t: addition on 0; use-after-free. WARNING: CPU: 26 PID: 4139916 at lib/refcount.c:25 refcount_warn_saturate+0x7d/0xe0 Modules linked in: bpf_preload(E) vhost_net(E) tun(E) vhost(E) This series of patches introduces a new feature to the netconsole subsystem that allows the automatic population of the CPU number in the userdata field for each log message. This enhancement provides several benefits: * Improved demultiplexing of parallel log output: When multiple CPUs are sending messages concurrently, the added CPU number in the userdata makes it easier to differentiate and attribute the messages to their originating CPUs. * Better visibility into message sources: The CPU number information gives users and developers more insight into which specific CPU a particular log message came from, which can be valuable for debugging and analysis. The changes in this series are as follows Patches:: Patch "consolidate send buffers into netconsole_target struct" ================================================= Move the static buffers to netconsole target, from static declaration in send_msg_no_fragmentation() and send_msg_fragmented(). Patch "netconsole: Rename userdata to extradata" ================================================= Create the a concept of extradata, which encompasses the concept of userdata and the upcoming sysdatao Sysdata is a new concept being added, which is basically fields that are populated by the kernel. At this time only the CPU#, but, there is a desire to add current task name, kernel release version, etc. Patch "netconsole: Helper to count number of used entries" =========================================================== Create a simple helper to count number of entries in extradata. I am separating this in a function since it will need to count userdata and sysdata. For instance, when the user adds an extra userdata, we need to check if there is space, counting the previous data entries (from userdata and cpu data) Patch "Introduce configfs helpers for sysdata features" ====================================================== Create the concept of sysdata feature in the netconsole target, and create the configfs helpers to enable the bit in nt->sysdata Patch "Include sysdata in extradata entry count" ================================================ Add the concept of sysdata when counting for available space in the buffer. This will protect users from creating new userdata/sysdata if there is no more space Patch "netconsole: add support for sysdata and CPU population" =============================================================== This is the core patch. Basically add a new option to enable automatic CPU number population in the netconsole userdata Provides a new "cpu_nr" sysfs attribute to control this feature Patch "netconsole: selftest: test CPU number auto-population" ============================================================= Expands the existing netconsole selftest to verify the CPU number auto-population functionality Ensures the received netconsole messages contain the expected "cpu=<CPU>" entry in the message. Test different permutation with userdata Patch "netconsole: docs: Add documentation for CPU number auto-population" ============================================================================= Updates the netconsole documentation to explain the new CPU number auto-population feature Provides instructions on how to enable and use the feature I believe these changes will be a valuable addition to the netconsole subsystem, enhancing its usefulness for kernel developers and users. Signed-off-by: Breno Leitao <leitao(a)debian.org> --- Changes in v3: - Moved the buffer into netconsole_target, avoiding static functions in the send path (Jakub). - Fix a documentation error (Randy Dunlap) - Created a function that handle all the extradata, consolidating it in a single place (Jakub) - Split the patch even more, trying to simplify the review. - Link to v2: https://lore.kernel.org/r/20250115-netcon_cpu-v2-0-95971b44dc56@debian.org Changes in v2: - Create the concept of extradata and sysdata. This will make the design easier to understand, and the code easier to read. * Basically extradata encompasses userdata and the new sysdata. Userdata originates from user, and sysdata originates in kernel. - Improved the test to send from a very specific CPU, which can be checked to be correct on the other side, as suggested by Jakub. - Fixed a bug where CPU # was populated at the wrong place - Link to v1: https://lore.kernel.org/r/20241113-netcon_cpu-v1-0-d187bf7c0321@debian.org --- Breno Leitao (8): netconsole: consolidate send buffers into netconsole_target struct netconsole: Rename userdata to extradata netconsole: Helper to count number of used entries netconsole: Introduce configfs helpers for sysdata features netconsole: Include sysdata in extradata entry count netconsole: add support for sysdata and CPU population netconsole: selftest: test for sysdata CPU netconsole: docs: Add documentation for CPU number auto-population Documentation/networking/netconsole.rst | 45 ++++ drivers/net/netconsole.c | 260 ++++++++++++++++----- tools/testing/selftests/drivers/net/Makefile | 1 + .../selftests/drivers/net/lib/sh/lib_netcons.sh | 17 ++ .../selftests/drivers/net/netcons_sysdata.sh | 167 +++++++++++++ 5 files changed, 426 insertions(+), 64 deletions(-) --- base-commit: fa10c9f5aa705deb43fd65623508074bee942764 change-id: 20241108-netcon_cpu-ce3917e88f4b Best regards, -- Breno Leitao <leitao(a)debian.org>

9 months, 3 weeks

3
22
0 0

[PATCH] kunit: tool: Use qboot on QEMU x86_64

by Brendan Jackman

As noted in [0], SeaBIOS (QEMU default) makes a mess of the terminal, qboot does not. It turns out this is actually useful with kunit.py, since the user is exposed to this issue if they set --raw_output=all. qboot is also faster than SeaBIOS, but it's is marginal for this usecase. [0] https://lore.kernel.org/all/CA+i-1C0wYb-gZ8Mwh3WSVpbk-LF-Uo+njVbASJPe1WXDUR… Both SeaBIOS and qboot are x86-specific. Signed-off-by: Brendan Jackman <jackmanb(a)google.com> --- tools/testing/kunit/qemu_configs/x86_64.py | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/tools/testing/kunit/qemu_configs/x86_64.py b/tools/testing/kunit/qemu_configs/x86_64.py index dc794907686304b325dbe180149169dd79bcd44f..4a6bf4e048f5b05c889e3b9b03046f14cc9b0bcc 100644 --- a/tools/testing/kunit/qemu_configs/x86_64.py +++ b/tools/testing/kunit/qemu_configs/x86_64.py @@ -7,4 +7,6 @@ CONFIG_SERIAL_8250_CONSOLE=y''', qemu_arch='x86_64', kernel_path='arch/x86/boot/bzImage', kernel_command_line='console=ttyS0', - extra_qemu_params=[]) + # qboot is faster than SeaBIOS and doesn't mess up + # the terminal. + extra_qemu_params=['-bios', 'qboot.rom']) --- base-commit: 8ea24baaaa869adeb39c6b9ce7542657a7251b56 change-id: 20250124-kunit-qboot-5201945f7e86 Best regards, -- Brendan Jackman <jackmanb(a)google.com>

9 months, 3 weeks

2
2
0 0

[PATCH bpf-next v2 0/7] bpf: Add probe_read_{kernel,user}_dynptr and copy_from_user_dynptr

by Levi Zim via B4 Relay

This series introduce the dynptr counterpart of the bpf_probe_read_{kernel,user} helpers and bpf_copy_from_user helper. These helpers are helpful for reading variable-length data from kernel memory into dynptr without going through an intermediate buffer. Link: https://lore.kernel.org/bpf/MEYP282MB2312CFCE5F7712FDE313215AC64D2@MEYP282M… Suggested-by: Andrii Nakryiko <andrii.nakryiko(a)gmail.com> Signed-off-by: Levi Zim <rsworktech(a)outlook.com> --- Changes in v2: - Add missing bpf-next prefix. I forgot it in the initial series. Sorry about that. - Link to v1: https://lore.kernel.org/r/20250125-bpf_dynptr_probe-v1-0-c3cb121f6951@outlo… --- Levi Zim (7): bpf: Implement bpf_probe_read_kernel_dynptr helper bpf: Implement bpf_probe_read_user_dynptr helper bpf: Implement bpf_copy_from_user_dynptr helper tools headers UAPI: Update tools's copy of bpf.h header selftests/bpf: probe_read_kernel_dynptr test selftests/bpf: probe_read_user_dynptr test selftests/bpf: copy_from_user_dynptr test include/linux/bpf.h | 3 + include/uapi/linux/bpf.h | 49 ++++++++++ kernel/bpf/helpers.c | 53 ++++++++++- kernel/trace/bpf_trace.c | 72 ++++++++++++++ tools/include/uapi/linux/bpf.h | 49 ++++++++++ tools/testing/selftests/bpf/prog_tests/dynptr.c | 45 ++++++++- tools/testing/selftests/bpf/progs/dynptr_success.c | 106 +++++++++++++++++++++ 7 files changed, 374 insertions(+), 3 deletions(-) --- base-commit: d0d106a2bd21499901299160744e5fe9f4c83ddb change-id: 20250124-bpf_dynptr_probe-ab483c554f1a Best regards, -- Levi Zim <rsworktech(a)outlook.com>

9 months, 3 weeks

6
20
0 0

[PATCH v2 net] udp: gso: do not drop small packets when PMTU reduces

by Yan Zhai

Commit 4094871db1d6 ("udp: only do GSO if # of segs > 1") avoided GSO for small packets. But the kernel currently dismisses GSO requests only after checking MTU/PMTU on gso_size. This means any packets, regardless of their payload sizes, could be dropped when PMTU becomes smaller than requested gso_size. We encountered this issue in production and it caused a reliability problem that new QUIC connection cannot be established before PMTU cache expired, while non GSO sockets still worked fine at the same time. Ideally, do not check any GSO related constraints when payload size is smaller than requested gso_size, and return EMSGSIZE instead of EINVAL on MTU/PMTU check failure to be more specific on the error cause. Fixes: 4094871db1d6 ("udp: only do GSO if # of segs > 1") Signed-off-by: Yan Zhai <yan(a)cloudflare.com> -- v1->v2: add a missing MTU check when fall back to no GSO mode suggested by Willem de Bruijn <willemdebruijn.kernel(a)gmail.com>; Fixed up commit message to be more precise. v1: https://lore.kernel.org/all/Z5cgWh%2F6bRQm9vVU@debian.debian/ --- net/ipv4/udp.c | 28 +++++++++++++++++++--------- net/ipv6/udp.c | 28 +++++++++++++++++++--------- tools/testing/selftests/net/udpgso.c | 14 ++++++++++++++ 3 files changed, 52 insertions(+), 18 deletions(-) diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index c472c9a57cf6..0b5010238d05 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -1141,9 +1141,20 @@ static int udp_send_skb(struct sk_buff *skb, struct flowi4 *fl4, const int hlen = skb_network_header_len(skb) + sizeof(struct udphdr); + if (datalen <= cork->gso_size) { + /* + * check MTU again: it's skipped previously when + * gso_size != 0 + */ + if (hlen + datalen > cork->fragsize) { + kfree_skb(skb); + return -EMSGSIZE; + } + goto no_gso; + } if (hlen + cork->gso_size > cork->fragsize) { kfree_skb(skb); - return -EINVAL; + return -EMSGSIZE; } if (datalen > cork->gso_size * UDP_MAX_SEGMENTS) { kfree_skb(skb); @@ -1158,17 +1169,16 @@ static int udp_send_skb(struct sk_buff *skb, struct flowi4 *fl4, return -EIO; } - if (datalen > cork->gso_size) { - skb_shinfo(skb)->gso_size = cork->gso_size; - skb_shinfo(skb)->gso_type = SKB_GSO_UDP_L4; - skb_shinfo(skb)->gso_segs = DIV_ROUND_UP(datalen, - cork->gso_size); + skb_shinfo(skb)->gso_size = cork->gso_size; + skb_shinfo(skb)->gso_type = SKB_GSO_UDP_L4; + skb_shinfo(skb)->gso_segs = DIV_ROUND_UP(datalen, + cork->gso_size); - /* Don't checksum the payload, skb will get segmented */ - goto csum_partial; - } + /* Don't checksum the payload, skb will get segmented */ + goto csum_partial; } +no_gso: if (is_udplite) /* UDP-Lite */ csum = udplite_csum(skb); diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c index 6671daa67f4f..d97befa7f80d 100644 --- a/net/ipv6/udp.c +++ b/net/ipv6/udp.c @@ -1389,9 +1389,20 @@ static int udp_v6_send_skb(struct sk_buff *skb, struct flowi6 *fl6, const int hlen = skb_network_header_len(skb) + sizeof(struct udphdr); + if (datalen <= cork->gso_size) { + /* + * check MTU again: it's skipped previously when + * gso_size != 0 + */ + if (hlen + datalen > cork->fragsize) { + kfree_skb(skb); + return -EMSGSIZE; + } + goto no_gso; + } if (hlen + cork->gso_size > cork->fragsize) { kfree_skb(skb); - return -EINVAL; + return -EMSGSIZE; } if (datalen > cork->gso_size * UDP_MAX_SEGMENTS) { kfree_skb(skb); @@ -1406,17 +1417,16 @@ static int udp_v6_send_skb(struct sk_buff *skb, struct flowi6 *fl6, return -EIO; } - if (datalen > cork->gso_size) { - skb_shinfo(skb)->gso_size = cork->gso_size; - skb_shinfo(skb)->gso_type = SKB_GSO_UDP_L4; - skb_shinfo(skb)->gso_segs = DIV_ROUND_UP(datalen, - cork->gso_size); + skb_shinfo(skb)->gso_size = cork->gso_size; + skb_shinfo(skb)->gso_type = SKB_GSO_UDP_L4; + skb_shinfo(skb)->gso_segs = DIV_ROUND_UP(datalen, + cork->gso_size); - /* Don't checksum the payload, skb will get segmented */ - goto csum_partial; - } + /* Don't checksum the payload, skb will get segmented */ + goto csum_partial; } +no_gso: if (is_udplite) csum = udplite_csum(skb); else if (udp_get_no_check6_tx(sk)) { /* UDP csum disabled */ diff --git a/tools/testing/selftests/net/udpgso.c b/tools/testing/selftests/net/udpgso.c index 3f2fca02fec5..fb73f1c331fb 100644 --- a/tools/testing/selftests/net/udpgso.c +++ b/tools/testing/selftests/net/udpgso.c @@ -102,6 +102,13 @@ struct testcase testcases_v4[] = { .gso_len = CONST_MSS_V4, .r_num_mss = 1, }, + { + /* datalen <= MSS < gso_len: will fall back to no GSO */ + .tlen = CONST_MSS_V4, + .gso_len = CONST_MSS_V4 + 1, + .r_num_mss = 0, + .r_len_last = CONST_MSS_V4, + }, { /* send a single MSS + 1B */ .tlen = CONST_MSS_V4 + 1, @@ -205,6 +212,13 @@ struct testcase testcases_v6[] = { .gso_len = CONST_MSS_V6, .r_num_mss = 1, }, + { + /* datalen <= MSS < gso_len: will fall back to no GSO */ + .tlen = CONST_MSS_V6, + .gso_len = CONST_MSS_V6 + 1, + .r_num_mss = 0, + .r_len_last = CONST_MSS_V6, + }, { /* send a single MSS + 1B */ .tlen = CONST_MSS_V6 + 1, -- 2.30.2

9 months, 3 weeks

2
2
0 0

[RFC net-next 0/2] netdevgenl: Add an xsk attribute to queues

by Joe Damato

Greetings: This is an attempt to followup on something Jakub asked me about [1], adding an xsk attribute to queues and more clearly documenting which queues are linked to NAPIs... But: 1. I couldn't pick a good "thing" to expose as "xsk", so I chose 0 or 1. Happy to take suggestions on what might be better to expose for the xsk queue attribute. 2. I create a silly C helper program to create an XDP socket in order to add a new test to queues.py. I'm not particularly good at python programming, so there's probably a better way to do this. Notably, python does not seem to have a socket.AF_XDP, so I needed the C helper to make a socket and bind it to a queue to perform the test. Tested this on my mlx5 machine and the test seems to pass. Happy to take any suggestions / feedback on this one; sorry in advance if I missed many obvious better ways to do things. Thanks, Joe [1]: https://lore.kernel.org/netdev/20250113143109.60afa59a@kernel.org/ Joe Damato (2): netdev-genl: Add an XSK attribute to queues selftests: drv-net: Test queue xsk attribute Documentation/netlink/specs/netdev.yaml | 10 ++- include/uapi/linux/netdev.h | 1 + net/core/netdev-genl.c | 6 ++ tools/include/uapi/linux/netdev.h | 1 + tools/testing/selftests/drivers/.gitignore | 1 + tools/testing/selftests/drivers/net/Makefile | 3 + tools/testing/selftests/drivers/net/queues.py | 32 ++++++- .../selftests/drivers/net/xdp_helper.c | 90 +++++++++++++++++++ 8 files changed, 141 insertions(+), 3 deletions(-) create mode 100644 tools/testing/selftests/drivers/net/xdp_helper.c base-commit: 0ad9617c78acbc71373fb341a6f75d4012b01d69 -- 2.25.1

9 months, 3 weeks

2
4
0 0

[PATCH v11 00/14] riscv: Add support for xtheadvector

by Charlie Jenkins

xtheadvector is a custom extension that is based upon riscv vector version 0.7.1 [1]. All of the vector routines have been modified to support this alternative vector version based upon whether xtheadvector was determined to be supported at boot. vlenb is not supported on the existing xtheadvector hardware, so a devicetree property thead,vlenb is added to provide the vlenb to Linux. There is a new hwprobe key RISCV_HWPROBE_KEY_VENDOR_EXT_THEAD_0 that is used to request which thead vendor extensions are supported on the current platform. This allows future vendors to allocate hwprobe keys for their vendor. Support for xtheadvector is also added to the vector kselftests. Signed-off-by: Charlie Jenkins <charlie(a)rivosinc.com> [1] https://github.com/T-head-Semi/thead-extension-spec/blob/95358cb2cca9489361… --- This series is a continuation of a different series that was fragmented into two other series in an attempt to get part of it merged in the 6.10 merge window. The split-off series did not get merged due to a NAK on the series that added the generic riscv,vlenb devicetree entry. This series has converted riscv,vlenb to thead,vlenb to remedy this issue. The original series is titled "riscv: Support vendor extensions and xtheadvector" [3]. I have tested this with an Allwinner Nezha board. I used SkiffOS [1] to manage building the image, but upgraded the U-Boot version to Samuel Holland's more up-to-date version [2] and changed out the device tree used by U-Boot with the device trees that are present in upstream linux and this series. Thank you Samuel for all of the work you did to make this task possible. [1] https://github.com/skiffos/SkiffOS/tree/master/configs/allwinner/nezha [2] https://github.com/smaeul/u-boot/commit/2e89b706f5c956a70c989cd31665f1429e9… [3] https://lore.kernel.org/all/20240503-dev-charlie-support_thead_vector_6_9-v… [4] https://lore.kernel.org/lkml/20240719-support_vendor_extensions-v3-4-0af758… --- Changes in v11: - Fix an issue where the mitigation was not being properly skipped when requested - Fix vstate_discard issue - Fix issue when -1 was passed into __riscv_isa_vendor_extension_available() - Remove some artifacts from being placed in the test directory - Link to v10: https://lore.kernel.org/r/20240911-xtheadvector-v10-0-8d3930091246@rivosinc… Changes in v10: - In DT probing disable vector with new function to clear vendor extension bits for xtheadvector - Add ghostwrite mitigations for c9xx CPUs. This disables xtheadvector unless mitigations=off is set as a kernel boot arg - Link to v9: https://lore.kernel.org/r/20240806-xtheadvector-v9-0-62a56d2da5d0@rivosinc.… Changes in v9: - Rebase onto palmer's for-next - Fix sparse error in arch/riscv/kernel/vendor_extensions/thead.c - Fix maybe-uninitialized warning in arch/riscv/include/asm/vendor_extensions/vendor_hwprobe.h - Wrap some long lines - Link to v8: https://lore.kernel.org/r/20240724-xtheadvector-v8-0-cf043168e137@rivosinc.… Changes in v8: - Rebase onto palmer's for-next - Link to v7: https://lore.kernel.org/r/20240724-xtheadvector-v7-0-b741910ada3e@rivosinc.… Changes in v7: - Add defs for has_xtheadvector_no_alternatives() and has_xtheadvector() when vector disabled. (Palmer) - Link to v6: https://lore.kernel.org/r/20240722-xtheadvector-v6-0-c9af0130fa00@rivosinc.… Changes in v6: - Fix return type of is_vector_supported()/is_xthead_supported() to be bool - Link to v5: https://lore.kernel.org/r/20240719-xtheadvector-v5-0-4b485fc7d55f@rivosinc.… Changes in v5: - Rebase on for-next - Link to v4: https://lore.kernel.org/r/20240702-xtheadvector-v4-0-2bad6820db11@rivosinc.… Changes in v4: - Replace inline asm with C (Samuel) - Rename VCSRs to CSRs (Samuel) - Replace .insn directives with .4byte directives - Link to v3: https://lore.kernel.org/r/20240619-xtheadvector-v3-0-bff39eb9668e@rivosinc.… Changes in v3: - Add back Heiko's signed-off-by (Conor) - Mark RISCV_HWPROBE_KEY_VENDOR_EXT_THEAD_0 as a bitmask - Link to v2: https://lore.kernel.org/r/20240610-xtheadvector-v2-0-97a48613ad64@rivosinc.… Changes in v2: - Removed extraneous references to "riscv,vlenb" (Jess) - Moved declaration of "thead,vlenb" into cpus.yaml and added restriction that it's only applicable to thead cores (Conor) - Check CONFIG_RISCV_ISA_XTHEADVECTOR instead of CONFIG_RISCV_ISA_V for thead,vlenb (Jess) - Fix naming of hwprobe variables (Evan) - Link to v1: https://lore.kernel.org/r/20240609-xtheadvector-v1-0-3fe591d7f109@rivosinc.… --- Charlie Jenkins (13): dt-bindings: riscv: Add xtheadvector ISA extension description dt-bindings: cpus: add a thead vlen register length property riscv: dts: allwinner: Add xtheadvector to the D1/D1s devicetree riscv: Add thead and xtheadvector as a vendor extension riscv: vector: Use vlenb from DT for thead riscv: csr: Add CSR encodings for CSR_VXRM/CSR_VXSAT riscv: Add xtheadvector instruction definitions riscv: vector: Support xtheadvector save/restore riscv: hwprobe: Add thead vendor extension probing riscv: hwprobe: Document thead vendor extensions and xtheadvector extension selftests: riscv: Fix vector tests selftests: riscv: Support xtheadvector in vector tests riscv: Add ghostwrite vulnerability Heiko Stuebner (1): RISC-V: define the elements of the VCSR vector CSR Documentation/arch/riscv/hwprobe.rst | 10 + Documentation/devicetree/bindings/riscv/cpus.yaml | 19 ++ .../devicetree/bindings/riscv/extensions.yaml | 10 + arch/riscv/Kconfig.errata | 11 + arch/riscv/Kconfig.vendor | 26 ++ arch/riscv/boot/dts/allwinner/sun20i-d1s.dtsi | 3 +- arch/riscv/errata/thead/errata.c | 28 ++ arch/riscv/include/asm/bugs.h | 22 ++ arch/riscv/include/asm/cpufeature.h | 2 + arch/riscv/include/asm/csr.h | 15 + arch/riscv/include/asm/errata_list.h | 3 +- arch/riscv/include/asm/hwprobe.h | 3 +- arch/riscv/include/asm/switch_to.h | 2 +- arch/riscv/include/asm/vector.h | 222 +++++++++++---- arch/riscv/include/asm/vendor_extensions/thead.h | 47 ++++ .../include/asm/vendor_extensions/thead_hwprobe.h | 19 ++ .../include/asm/vendor_extensions/vendor_hwprobe.h | 37 +++ arch/riscv/include/uapi/asm/hwprobe.h | 3 +- arch/riscv/include/uapi/asm/vendor/thead.h | 3 + arch/riscv/kernel/Makefile | 2 + arch/riscv/kernel/bugs.c | 60 ++++ arch/riscv/kernel/cpufeature.c | 59 +++- arch/riscv/kernel/kernel_mode_vector.c | 8 +- arch/riscv/kernel/process.c | 4 +- arch/riscv/kernel/signal.c | 6 +- arch/riscv/kernel/sys_hwprobe.c | 5 + arch/riscv/kernel/vector.c | 24 +- arch/riscv/kernel/vendor_extensions.c | 10 + arch/riscv/kernel/vendor_extensions/Makefile | 2 + arch/riscv/kernel/vendor_extensions/thead.c | 29 ++ .../riscv/kernel/vendor_extensions/thead_hwprobe.c | 19 ++ drivers/base/cpu.c | 3 + include/linux/cpu.h | 1 + tools/testing/selftests/riscv/vector/.gitignore | 3 +- tools/testing/selftests/riscv/vector/Makefile | 17 +- .../selftests/riscv/vector/v_exec_initval_nolibc.c | 94 +++++++ tools/testing/selftests/riscv/vector/v_helpers.c | 68 +++++ tools/testing/selftests/riscv/vector/v_helpers.h | 8 + tools/testing/selftests/riscv/vector/v_initval.c | 22 ++ .../selftests/riscv/vector/v_initval_nolibc.c | 68 ----- .../selftests/riscv/vector/vstate_exec_nolibc.c | 20 +- .../testing/selftests/riscv/vector/vstate_prctl.c | 305 +++++++++++++-------- 42 files changed, 1051 insertions(+), 271 deletions(-) --- base-commit: 0eb512779d642b21ced83778287a0f7a3ca8f2a1 change-id: 20240530-xtheadvector-833d3d17b423 -- - Charlie

9 months, 3 weeks

3
22
0 0

[PATCH bpf-next/net v2 0/7] bpf: Add mptcp_subflow bpf_iter support

by Matthieu Baerts (NGI0)

Here is a series from Geliang, adding mptcp_subflow bpf_iter support. We are working on extending MPTCP with BPF, e.g. to control the path manager -- in charge of the creation, deletion, and announcements of subflows (paths) -- and the packet scheduler -- in charge of selecting which available path the next data will be sent to. These extensions need to iterate over the list of subflows attached to an MPTCP connection, and do some specific actions via some new kfunc that will be added later on. This preparation work is split in different patches: - Patch 1: extend bpf_skc_to_mptcp_sock() to be called with msk. - Patch 2: allow using skc_to_mptcp_sock() in CGroup sockopt hooks. - Patch 3: register some "basic" MPTCP kfunc. - Patch 4: add mptcp_subflow bpf_iter support. Note that previous versions of this single patch have already been shared to the BPF mailing list. The changelog has been kept with a comment, but the version number has been reset to avoid confusions. - Patch 5: add kfunc to make sure the msk is valid - Patch 6: add more MPTCP endpoints in the selftests, in order to create more than 2 subflows. - Patch 7: add a very simple test validating mptcp_subflow bpf_iter support. This test could be written without the new bpf_iter, but it is there only to make sure this specific feature works as expected. Signed-off-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org> --- Changes in v2: - Patches 1-2: new ones. - Patch 3: remove two kfunc, more restrictions. (Martin) - Patch 4: add BUILD_BUG_ON(), more restrictions. (Martin) - Patch 7: adaptations due to modifications in patches 1-4. - Link to v1: https://lore.kernel.org/r/20241108-bpf-next-net-mptcp-bpf_iter-subflows-v1-… --- Geliang Tang (7): bpf: Extend bpf_skc_to_mptcp_sock to MPTCP sock bpf: Allow use of skc_to_mptcp_sock in cg_sockopt bpf: Register mptcp common kfunc set bpf: Add mptcp_subflow bpf_iter bpf: Acquire and release mptcp socket selftests/bpf: More endpoints for endpoint_init selftests/bpf: Add mptcp_subflow bpf_iter subtest include/net/mptcp.h | 4 +- kernel/bpf/cgroup.c | 2 + net/core/filter.c | 2 +- net/mptcp/bpf.c | 113 +++++++++++++++++- tools/testing/selftests/bpf/bpf_experimental.h | 8 ++ tools/testing/selftests/bpf/prog_tests/mptcp.c | 129 ++++++++++++++++++++- tools/testing/selftests/bpf/progs/mptcp_bpf.h | 9 ++ .../testing/selftests/bpf/progs/mptcp_bpf_iters.c | 63 ++++++++++ 8 files changed, 318 insertions(+), 12 deletions(-) --- base-commit: dad704ebe38642cd405e15b9c51263356391355c change-id: 20241108-bpf-next-net-mptcp-bpf_iter-subflows-027f6d87770e Best regards, -- Matthieu Baerts (NGI0) <matttbe(a)kernel.org>

9 months, 3 weeks

3
13
0 0

[PATCH] udp: gso: fix MTU check for small packets

by Yan Zhai

Commit 4094871db1d6 ("udp: only do GSO if # of segs > 1") avoided GSO for small packets. But the kernel currently dismisses GSO requests only after checking MTU on gso_size. This means any packets, regardless of their payload sizes, would be dropped when MTU is smaller than requested gso_size. Meanwhile, EINVAL would be returned in this case, making it very misleading to debug. Ideally, do not check any GSO related constraints when payload size is smaller than requested gso_size, and return EMSGSIZE on MTU check failure consistently for all packets to ease debugging. Fixes: 4094871db1d6 ("udp: only do GSO if # of segs > 1") Signed-off-by: Yan Zhai <yan(a)cloudflare.com> --- net/ipv4/udp.c | 18 ++++++++---------- net/ipv6/udp.c | 18 ++++++++---------- tools/testing/selftests/net/udpgso.c | 14 ++++++++++++++ 3 files changed, 30 insertions(+), 20 deletions(-) diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index c472c9a57cf6..9aed1b4a871f 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -1137,13 +1137,13 @@ static int udp_send_skb(struct sk_buff *skb, struct flowi4 *fl4, uh->len = htons(len); uh->check = 0; - if (cork->gso_size) { + if (cork->gso_size && datalen > cork->gso_size) { const int hlen = skb_network_header_len(skb) + sizeof(struct udphdr); if (hlen + cork->gso_size > cork->fragsize) { kfree_skb(skb); - return -EINVAL; + return -EMSGSIZE; } if (datalen > cork->gso_size * UDP_MAX_SEGMENTS) { kfree_skb(skb); @@ -1158,15 +1158,13 @@ static int udp_send_skb(struct sk_buff *skb, struct flowi4 *fl4, return -EIO; } - if (datalen > cork->gso_size) { - skb_shinfo(skb)->gso_size = cork->gso_size; - skb_shinfo(skb)->gso_type = SKB_GSO_UDP_L4; - skb_shinfo(skb)->gso_segs = DIV_ROUND_UP(datalen, - cork->gso_size); + skb_shinfo(skb)->gso_size = cork->gso_size; + skb_shinfo(skb)->gso_type = SKB_GSO_UDP_L4; + skb_shinfo(skb)->gso_segs = DIV_ROUND_UP(datalen, + cork->gso_size); - /* Don't checksum the payload, skb will get segmented */ - goto csum_partial; - } + /* Don't checksum the payload, skb will get segmented */ + goto csum_partial; } if (is_udplite) /* UDP-Lite */ diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c index 6671daa67f4f..6cdc8ce4c6f9 100644 --- a/net/ipv6/udp.c +++ b/net/ipv6/udp.c @@ -1385,13 +1385,13 @@ static int udp_v6_send_skb(struct sk_buff *skb, struct flowi6 *fl6, uh->len = htons(len); uh->check = 0; - if (cork->gso_size) { + if (cork->gso_size && datalen > cork->gso_size) { const int hlen = skb_network_header_len(skb) + sizeof(struct udphdr); if (hlen + cork->gso_size > cork->fragsize) { kfree_skb(skb); - return -EINVAL; + return -EMSGSIZE; } if (datalen > cork->gso_size * UDP_MAX_SEGMENTS) { kfree_skb(skb); @@ -1406,15 +1406,13 @@ static int udp_v6_send_skb(struct sk_buff *skb, struct flowi6 *fl6, return -EIO; } - if (datalen > cork->gso_size) { - skb_shinfo(skb)->gso_size = cork->gso_size; - skb_shinfo(skb)->gso_type = SKB_GSO_UDP_L4; - skb_shinfo(skb)->gso_segs = DIV_ROUND_UP(datalen, - cork->gso_size); + skb_shinfo(skb)->gso_size = cork->gso_size; + skb_shinfo(skb)->gso_type = SKB_GSO_UDP_L4; + skb_shinfo(skb)->gso_segs = DIV_ROUND_UP(datalen, + cork->gso_size); - /* Don't checksum the payload, skb will get segmented */ - goto csum_partial; - } + /* Don't checksum the payload, skb will get segmented */ + goto csum_partial; } if (is_udplite) diff --git a/tools/testing/selftests/net/udpgso.c b/tools/testing/selftests/net/udpgso.c index 3f2fca02fec5..fb73f1c331fb 100644 --- a/tools/testing/selftests/net/udpgso.c +++ b/tools/testing/selftests/net/udpgso.c @@ -102,6 +102,13 @@ struct testcase testcases_v4[] = { .gso_len = CONST_MSS_V4, .r_num_mss = 1, }, + { + /* datalen <= MSS < gso_len: will fall back to no GSO */ + .tlen = CONST_MSS_V4, + .gso_len = CONST_MSS_V4 + 1, + .r_num_mss = 0, + .r_len_last = CONST_MSS_V4, + }, { /* send a single MSS + 1B */ .tlen = CONST_MSS_V4 + 1, @@ -205,6 +212,13 @@ struct testcase testcases_v6[] = { .gso_len = CONST_MSS_V6, .r_num_mss = 1, }, + { + /* datalen <= MSS < gso_len: will fall back to no GSO */ + .tlen = CONST_MSS_V6, + .gso_len = CONST_MSS_V6 + 1, + .r_num_mss = 0, + .r_len_last = CONST_MSS_V6, + }, { /* send a single MSS + 1B */ .tlen = CONST_MSS_V6 + 1, -- 2.30.2

9 months, 3 weeks

2
7
0 0

[PATCH v4 0/9] mm: workingset reporting

by Yuanchu Xie

This patch series provides workingset reporting of user pages in lruvecs, of which coldness can be tracked by accessed bits and fd references. However, the concept of workingset applies generically to all types of memory, which could be kernel slab caches, discardable userspace caches (databases), or CXL.mem. Therefore, data sources might come from slab shrinkers, device drivers, or the userspace. Another interesting idea might be hugepage workingset, so that we can measure the proportion of hugepages backing cold memory. However, with architectures like arm, there may be too many hugepage sizes leading to a combinatorial explosion when exporting stats to the userspace. Nonetheless, the kernel should provide a set of workingset interfaces that is generic enough to accommodate the various use cases, and extensible to potential future use cases. Use cases ========== Job scheduling On overcommitted hosts, workingset information improves efficiency and reliability by allowing the job scheduler to have better stats on the exact memory requirements of each job. This can manifest in efficiency by landing more jobs on the same host or NUMA node. On the other hand, the job scheduler can also ensure each node has a sufficient amount of memory and does not enter direct reclaim or the kernel OOM path. With workingset information and job priority, the userspace OOM killing or proactive reclaim policy can kick in before the system is under memory pressure. If the job shape is very different from the machine shape, knowing the workingset per-node can also help inform page allocation policies. Proactive reclaim Workingset information allows the a container manager to proactively reclaim memory while not impacting a job's performance. While PSI may provide a reactive measure of when a proactive reclaim has reclaimed too much, workingset reporting allows the policy to be more accurate and flexible. Ballooning (similar to proactive reclaim) The last patch of the series extends the virtio-balloon device to report the guest workingset. Balloon policies benefit from workingset to more precisely determine the size of the memory balloon. On end-user devices where memory is scarce and overcommitted, the balloon sizing in multiple VMs running on the same device can be orchestrated with workingset reports from each one. On the server side, workingset reporting allows the balloon controller to inflate the balloon without causing too much file cache to be reclaimed in the guest. Promotion/Demotion If different mechanisms are used for promition and demotion, workingset information can help connect the two and avoid pages being migrated back and forth. For example, given a promotion hot page threshold defined in reaccess distance of N seconds (promote pages accessed more often than every N seconds). The threshold N should be set so that ~80% (e.g.) of pages on the fast memory node passes the threshold. This calculation can be done with workingset reports. To be directly useful for promotion policies, the workingset report interfaces need to be extended to report hotness and gather hotness information from the devices[1]. [1] https://www.opencompute.org/documents/ocp-cms-hotness-tracking-requirements… Sysfs and Cgroup Interfaces ========== The interfaces are detailed in the patches that introduce them. The main idea here is we break down the workingset per-node per-memcg into time intervals (ms), e.g. 1000 anon=137368 file=24530 20000 anon=34342 file=0 30000 anon=353232 file=333608 40000 anon=407198 file=206052 9223372036854775807 anon=4925624 file=892892 Implementation ========== The reporting of user pages is based off of MGLRU, and therefore requires CONFIG_LRU_GEN=y. We would benefit from more MGLRU generations for a more fine-grained workingset report, but we can already gather a lot of data with just four generations. The workingset reporting mechanism is gated behind CONFIG_WORKINGSET_REPORT, and the aging thread is behind CONFIG_WORKINGSET_REPORT_AGING. Benchmarks ========== Ghait Ouled Amar Ben Cheikh has implemented a simple policy and ran Linux compile and redis benchmarks from openbenchmarking.org. The policy and runner is referred to as WMO (Workload Memory Optimization). The results were based on v3 of the series, but v4 doesn't change the core of the working set reporting and just adds the ballooning counterpart. The timed Linux kernel compilation benchmark shows improvements in peak memory usage with a policy of "swap out all bytes colder than 10 seconds every 40 seconds". A swapfile is configured on SSD. -------------------------------------------- peak memory usage (with WMO): 4982.61328 MiB peak memory usage (control): 9569.1367 MiB peak memory reduction: 47.9% -------------------------------------------- Benchmark | Experimental |Control | Experimental_Std_Dev | Control_Std_Dev Timed Linux Kernel Compilation - allmodconfig (sec) | 708.486 (95.91%) | 679.499 (100%) | 0.6% | 0.1% -------------------------------------------- Seconds, fewer is better The redis benchmark shows employs the same policy: -------------------------------------------- peak memory usage (with WMO): 375.9023 MiB peak memory usage (control): 509.765 MiB peak memory reduction: 26% -------------------------------------------- Benchmark | Experimental | Control | Experimental_Std_Dev | Control_Std_Dev Redis - LPOP (Reqs/sec) | 2023130 (98.22%) | 2059849 (100%) | 1.2% | 2% Redis - SADD (Reqs/sec) | 2539662 (98.63%) | 2574811 (100%) | 2.3% | 1.4% Redis - LPUSH (Reqs/sec)| 2024880 (100%) | 2000884 (98.81%) | 1.1% | 0.8% Redis - GET (Reqs/sec) | 2835764 (100%) | 2763722 (97.46%) | 2.7% | 1.6% Redis - SET (Reqs/sec) | 2340723 (100%) | 2327372 (99.43%) | 2.4% | 1.8% -------------------------------------------- Reqs/sec, more is better The detailed report and benchmarking results are in Ghait's repo: https://github.com/miloudi98/WMO Changelog ========== Changes from PATCH v3 -> v4: - Added documentation for cgroup-v2 (Waiman Long) - Fixed types in documentation (Randy Dunlap) - Added implementation for the ballooning use case - Added detailed description of benchmark results (Andrew Morton) Changes from PATCH v2 -> v3: - Fixed typos in commit messages and documentation (Lance Yang, Randy Dunlap) - Split out the force_scan patch to be reviewed separately - Added benchmarks from Ghait Ouled Amar Ben Cheikh - Fixed reported compile error without CONFIG_MEMCG Changes from PATCH v1 -> v2: - Updated selftest to use ksft_test_result_code instead of switch-case (Muhammad Usama Anjum) - Included more use cases in the cover letter (Huang, Ying) - Added documentation for sysfs and memcg interfaces - Added an aging-specific struct lru_gen_mm_walk in struct pglist_data to avoid allocating for each lruvec. [v1] https://lore.kernel.org/linux-mm/20240504073011.4000534-1-yuanchu@google.co… [v2] https://lore.kernel.org/linux-mm/20240604020549.1017540-1-yuanchu@google.co… [v3] https://lore.kernel.org/linux-mm/20240813165619.748102-1-yuanchu@google.com/ Yuanchu Xie (9): mm: aggregate workingset information into histograms mm: use refresh interval to rate-limit workingset report aggregation mm: report workingset during memory pressure driven scanning mm: extend workingset reporting to memcgs mm: add kernel aging thread for workingset reporting selftest: test system-wide workingset reporting Docs/admin-guide/mm/workingset_report: document sysfs and memcg interfaces Docs/admin-guide/cgroup-v2: document workingset reporting virtio-balloon: add workingset reporting Documentation/admin-guide/cgroup-v2.rst | 35 + Documentation/admin-guide/mm/index.rst | 1 + .../admin-guide/mm/workingset_report.rst | 105 +++ drivers/base/node.c | 6 + drivers/virtio/virtio_balloon.c | 390 ++++++++++- include/linux/balloon_compaction.h | 1 + include/linux/memcontrol.h | 21 + include/linux/mmzone.h | 13 + include/linux/workingset_report.h | 167 +++++ include/uapi/linux/virtio_balloon.h | 30 + mm/Kconfig | 15 + mm/Makefile | 2 + mm/internal.h | 19 + mm/memcontrol.c | 162 ++++- mm/mm_init.c | 2 + mm/mmzone.c | 2 + mm/vmscan.c | 56 +- mm/workingset_report.c | 653 ++++++++++++++++++ mm/workingset_report_aging.c | 127 ++++ tools/testing/selftests/mm/.gitignore | 1 + tools/testing/selftests/mm/Makefile | 3 + tools/testing/selftests/mm/run_vmtests.sh | 5 + .../testing/selftests/mm/workingset_report.c | 306 ++++++++ .../testing/selftests/mm/workingset_report.h | 39 ++ .../selftests/mm/workingset_report_test.c | 330 +++++++++ 25 files changed, 2482 insertions(+), 9 deletions(-) create mode 100644 Documentation/admin-guide/mm/workingset_report.rst create mode 100644 include/linux/workingset_report.h create mode 100644 mm/workingset_report.c create mode 100644 mm/workingset_report_aging.c create mode 100644 tools/testing/selftests/mm/workingset_report.c create mode 100644 tools/testing/selftests/mm/workingset_report.h create mode 100644 tools/testing/selftests/mm/workingset_report_test.c -- 2.47.0.338.g60cca15819-goog

9 months, 3 weeks

6
20
0 0

2025

2024

2023

2022

2021

2020

2019

2018

2017

Linux-kselftest-mirror January 2025