From: Bobby Eshleman <bobbyeshleman(a)meta.com>
Update devmem.rst documentation to describe the autorelease netlink
attribute used during RX dmabuf binding.
The autorelease attribute is specified at bind-time via the netlink API
(NETDEV_CMD_BIND_RX) and controls what happens to outstanding tokens
when the socket closes.
Document the two token release modes (automatic vs manual), how to
configure the binding for autorelease, the perf benefits, new caveats
and restrictions, and the way the mode is enforced system-wide.
Signed-off-by: Bobby Eshleman <bobbyeshleman(a)meta.com>
---
Changes in v7:
- Document netlink instead of sockopt
- Mention system-wide locked to one mode
---
Documentation/networking/devmem.rst | 70 +++++++++++++++++++++++++++++++++++++
1 file changed, 70 insertions(+)
diff --git a/Documentation/networking/devmem.rst b/Documentation/networking/devmem.rst
index a6cd7236bfbd..67c63bc5a7ae 100644
--- a/Documentation/networking/devmem.rst
+++ b/Documentation/networking/devmem.rst
@@ -235,6 +235,76 @@ can be less than the tokens provided by the user in case of:
(a) an internal kernel leak bug.
(b) the user passed more than 1024 frags.
+
+Autorelease Control
+~~~~~~~~~~~~~~~~~~~
+
+The autorelease mode controls what happens to outstanding tokens (tokens not
+released via SO_DEVMEM_DONTNEED) when the socket closes. Autorelease is
+configured per-binding at binding creation time via the netlink API::
+
+ struct netdev_bind_rx_req *req;
+ struct netdev_bind_rx_rsp *rsp;
+ struct ynl_sock *ys;
+ struct ynl_error yerr;
+
+ ys = ynl_sock_create(&ynl_netdev_family, &yerr);
+
+ req = netdev_bind_rx_req_alloc();
+ netdev_bind_rx_req_set_ifindex(req, ifindex);
+ netdev_bind_rx_req_set_fd(req, dmabuf_fd);
+ netdev_bind_rx_req_set_autorelease(req, 0); /* 0 = manual, 1 = auto */
+ __netdev_bind_rx_req_set_queues(req, queues, n_queues);
+
+ rsp = netdev_bind_rx(ys, req);
+
+ dmabuf_id = rsp->id;
+
+When autorelease is disabled (0):
+
+- Outstanding tokens are NOT released when the socket closes
+- Outstanding tokens are only released when the dmabuf is unbound
+- Provides better performance by eliminating xarray overhead (~13% CPU reduction)
+- Kernel tracks tokens via atomic reference counters in net_iov structures
+
+When autorelease is enabled (1):
+
+- Outstanding tokens are automatically released when the socket closes
+- Backwards compatible behavior
+- Kernel tracks tokens in an xarray per socket
+
+The default is autorelease disabled.
+
+Important: In both modes, applications should call SO_DEVMEM_DONTNEED to
+return tokens as soon as they are done processing. The autorelease setting only
+affects what happens to tokens that are still outstanding when close() is called.
+
+The mode is enforced system-wide. Once a binding is created with a specific
+autorelease mode, all subsequent bindings system-wide must use the same mode.
+
+
+Performance Considerations
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Disabling autorelease provides approximately ~13% CPU utilization improvement
+in RX workloads. That said, applications must ensure all tokens are released
+via SO_DEVMEM_DONTNEED before closing the socket, otherwise the backing pages
+will remain pinned until the dmabuf is unbound.
+
+
+Caveats
+~~~~~~~
+
+- Once a system-wide autorelease mode is selected (via the first binding),
+ all subsequent bindings must use the same mode. Attempts to create bindings
+ with a different mode will be rejected with -EINVAL.
+
+- Applications using manual release mode (autorelease=0) must ensure all tokens
+ are returned via SO_DEVMEM_DONTNEED before socket close to avoid resource
+ leaks during the lifetime of the dmabuf binding. Tokens not released before
+ close() will only be freed when the dmabuf is unbound.
+
+
TX Interface
============
--
2.47.3
We see the following failure a few times a week:
# RUN global.data_steal ...
# tls.c:3280:data_steal:Expected recv(cfd, buf2, sizeof(buf2), MSG_DONTWAIT) (10000) == -1 (-1)
# data_steal: Test failed
# FAIL global.data_steal
not ok 8 global.data_steal
The 10000 bytes read suggests that the child process did a recv()
of half of the data using the TLS ULP and we're now getting the
remaining half. The intent of the test is to get the child to
enter _TCP_ recvmsg handler, so it needs to enter the syscall before
parent installed the TLS recvmsg with setsockopt(SOL_TLS).
Instead of the 10msec sleep send 1 byte of data and wait for the
child to consume it.
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
---
CC: sd(a)queasysnail.net
CC: shuah(a)kernel.org
CC: linux-kselftest(a)vger.kernel.org
---
tools/testing/selftests/net/tls.c | 16 ++++++++++++----
1 file changed, 12 insertions(+), 4 deletions(-)
diff --git a/tools/testing/selftests/net/tls.c b/tools/testing/selftests/net/tls.c
index a4d16a460fbe..9e2ccea13d70 100644
--- a/tools/testing/selftests/net/tls.c
+++ b/tools/testing/selftests/net/tls.c
@@ -3260,17 +3260,25 @@ TEST(data_steal) {
ASSERT_EQ(setsockopt(cfd, IPPROTO_TCP, TCP_ULP, "tls", sizeof("tls")), 0);
/* Spawn a child and get it into the read wait path of the underlying
- * TCP socket.
+ * TCP socket (before kernel .recvmsg is replaced with the TLS one).
*/
pid = fork();
ASSERT_GE(pid, 0);
if (!pid) {
- EXPECT_EQ(recv(cfd, buf, sizeof(buf) / 2, MSG_WAITALL),
- sizeof(buf) / 2);
+ EXPECT_EQ(recv(cfd, buf, sizeof(buf) / 2 + 1, MSG_WAITALL),
+ sizeof(buf) / 2 + 1);
exit(!__test_passed(_metadata));
}
- usleep(10000);
+ /* Send a sync byte and poll until it's consumed to ensure
+ * the child is in recv() before we proceed to install TLS.
+ */
+ ASSERT_EQ(send(fd, buf, 1, 0), 1);
+ do {
+ usleep(500);
+ } while (recv(cfd, buf, 1, MSG_PEEK | MSG_DONTWAIT) == 1);
+ EXPECT_EQ(errno, EAGAIN);
+
ASSERT_EQ(setsockopt(fd, SOL_TLS, TLS_TX, &tls, tls.len), 0);
ASSERT_EQ(setsockopt(cfd, SOL_TLS, TLS_RX, &tls, tls.len), 0);
--
2.52.0
v3:
- Patch 2: Change the condition for calling reset_partition_data() to
(new_prs <= 0).
- Patch 4: Update commit log and code comment to clarify the change.
- Add a new patch 5 to move the empty cpus/mems check to
cpuset1_validate_change().
v2:
- Patch 1: additional comment
- Patch 2: simplify the conditions for triggering call to
compute_excpus().
- Patch 3: update description of cpuset.cpus.exclusive in cgroup-v2.rst
to reflect the new behavior and change the name of the new
cpus_excl_conflict() parameter to xcpus_changed.
- Patch 4: update description of cpuset.cpus.partition in cgroup-v2.rst
to clarify what exclusive CPUs will be used when a partition is
created.
This patch series is inspired by the cpuset patch sent by Sun Shaojie [1].
The idea is to avoid invalidating sibling partitions when there is a
cpuset.cpus conflict. However this patch series does it in a slightly
different way to make its behavior more consistent with other cpuset
properties.
The first 3 patches are just some cleanup and minor bug fixes on
issues found during the investigation process. The last one is
the major patch that changes the way cpuset.cpus is being handled
during the partition creation process. Instead of invalidating sibling
partitions when there is a conflict, it will strip out the conflicting
exclusive CPUs and assign the remaining non-conflicting exclusive
CPUs to the new partition unless there is no more CPU left which will
fail the partition creation process. It is similar to the idea that
cpuset.cpus.effective may only contain a subset of CPUs specified in
cpuset.cpus. So cpuset.cpus.exclusive.effective may contain only a
subset of cpuset.cpus when a partition is created without setting
cpuset.cpus.exclusive.
Even setting cpuset.cpus.exclusive instead of cpuset.cpus may not
guarantee all the requested CPUs can be granted if parent doesn't have
access to some of those exclusive CPUs. The difference is that conflicts
from siblings is not possible with cpuset.cpus.exclusive as long as it
can be set successfully without failure.
[1] https://lore.kernel.org/lkml/20251117015708.977585-1-sunshaojie@kylinos.cn/
Waiman Long (5):
cgroup/cpuset: Streamline rm_siblings_excl_cpus()
cgroup/cpuset: Consistently compute effective_xcpus in
update_cpumasks_hier()
cgroup/cpuset: Don't fail cpuset.cpus change in v2
cgroup/cpuset: Don't invalidate sibling partitions on cpuset.cpus
conflict
cgroup/cpuset: Move the v1 empty cpus/mems check to
cpuset1_validate_change()
Documentation/admin-guide/cgroup-v2.rst | 40 +++--
kernel/cgroup/cpuset-internal.h | 12 ++
kernel/cgroup/cpuset-v1.c | 33 ++++
kernel/cgroup/cpuset.c | 163 ++++++------------
.../selftests/cgroup/test_cpuset_prs.sh | 29 +++-
5 files changed, 150 insertions(+), 127 deletions(-)
--
2.52.0
ksm_tests writes KSM sysfs knobs under /sys/kernel/mm/ksm, which requires
root privileges. When run unprivileged, it fails with permission errors
and reports FAIL, which is misleading.
Skip the test early when not run as root to avoid false failures.
Signed-off-by: Sun Jian <sun.jian.kdev(a)gmail.com>
---
tools/testing/selftests/mm/ksm_tests.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/tools/testing/selftests/mm/ksm_tests.c b/tools/testing/selftests/mm/ksm_tests.c
index a0b48b839d54..c22cd9c61711 100644
--- a/tools/testing/selftests/mm/ksm_tests.c
+++ b/tools/testing/selftests/mm/ksm_tests.c
@@ -766,6 +766,11 @@ int main(int argc, char *argv[])
bool merge_across_nodes = KSM_MERGE_ACROSS_NODES_DEFAULT;
long size_MB = 0;
+ if (geteuid() != 0) {
+ printf("# SKIP ksm_tests requires root privileges\n");
+ return KSFT_SKIP;
+ }
+
while ((opt = getopt(argc, argv, "dha:p:l:z:m:s:t:MUZNPCHD")) != -1) {
switch (opt) {
case 'a':
--
2.43.0