November 2025 - Linux-kselftest-mirror

[PATCH v3 0/1] cpuset: relax the overlap check for cgroup-v2

by Sun Shaojie

In cgroup v2, a mutual overlap check is required when at least one of two cpusets is exclusive. However, this check should be relaxed and limited to cases where both cpusets are exclusive. This patch ensures that for sibling cpusets A1 (exclusive) and B1 (non-exclusive), change B1 cannot affect A1's exclusivity. for example. Assume a machine has 4 CPUs (0-3). root cgroup / \ A1 B1 Case 1: Table 1.1: Before applying the patch Step | A1's prstate | B1'sprstate | #1> echo "0-1" > A1/cpuset.cpus | member | member | #2> echo "root" > A1/cpuset.cpus.partition | root | member | #3> echo "0" > B1/cpuset.cpus | root invalid | member | After step #3, A1 changes from "root" to "root invalid" because its CPUs (0-1) overlap with those requested by B1 (0-3). However, B1 can actually use CPUs 2-3(from B1's parent), so it would be more reasonable for A1 to remain as "root." Table 1.2: After applying the patch Step | A1's prstate | B1'sprstate | #1> echo "0-1" > A1/cpuset.cpus | member | member | #2> echo "root" > A1/cpuset.cpus.partition | root | member | #3> echo "0" > B1/cpuset.cpus | root | member | Case 2: (This situation remains unchanged from before) Table 2.1: Before applying the patch Step | A1's prstate | B1'sprstate | #1> echo "0-1" > A1/cpuset.cpus | member | member | #3> echo "1-2" > B1/cpuset.cpus | member | member | #2> echo "root" > A1/cpuset.cpus.partition | root invalid | member | Table 2.2: After applying the patch Step | A1's prstate | B1'sprstate | #1> echo "0-1" > A1/cpuset.cpus | member | member | #3> echo "1-2" > B1/cpuset.cpus | member | member | #2> echo "root" > A1/cpuset.cpus.partition | root invalid | member | All other cases remain unaffected. For example, cgroup-v1, both A1 and B1 are exclusive or non-exlusive. --- v3 -> v4: - Adjust the test_cpuset_prt.sh test file to align with the current behavior. v2 -> v3: - Ensure compliance with constraints such as cpuset.cpus.exclusive. - Link: https://lore.kernel.org/cgroups/20251113131434.606961-1-sunshaojie@kylinos.… v1 -> v2: - Keeps the current cgroup v1 behavior unchanged - Link: https://lore.kernel.org/cgroups/c8e234f4-2c27-4753-8f39-8ae83197efd3@redhat… --- kernel/cgroup/cpuset-internal.h | 3 ++ kernel/cgroup/cpuset-v1.c | 20 +++++++++ kernel/cgroup/cpuset.c | 43 ++++++++++++++----- .../selftests/cgroup/test_cpuset_prs.sh | 5 ++- 4 files changed, 58 insertions(+), 13 deletions(-) -- 2.25.1

32 minutes

4
43
0 0

[PATCH v4 0/3] VMM can handle guest SEA via KVM_EXIT_ARM_SEA

by Jiaqi Yan

Problem ======= When host APEI is unable to claim a synchronous external abort (SEA) during guest abort, today KVM directly injects an asynchronous SError into the VCPU then resumes it. The injected SError usually results in unpleasant guest kernel panic. One of the major situation of guest SEA is when VCPU consumes recoverable uncorrected memory error (UER), which is not uncommon at all in modern datacenter servers with large amounts of physical memory. Although SError and guest panic is sufficient to stop the propagation of corrupted memory, there is room to recover from an UER in a more graceful manner. Proposed Solution ================= The idea is, we can replay the SEA to the faulting VCPU. If the memory error consumption or the fault that cause SEA is not from guest kernel, the blast radius can be limited to the poison-consuming guest process, while the VM can keep running. In addition, instead of doing under the hood without involving userspace, there are benefits to redirect the SEA to VMM: - VM customers care about the disruptions caused by memory errors, and VMM usually has the responsibility to start the process of notifying the customers of memory error events in their VMs. For example some cloud provider emits a critical log in their observability UI [1], and provides a playbook for customers on how to mitigate disruptions to their workloads. - VMM can protect future memory error consumption by unmapping the poisoned pages from stage-2 page table with KVM userfault [2], or by splitting the memslot that contains the poisoned pages. - VMM can keep track of SEA events in the VM. When VMM thinks the status on the host or the VM is bad enough, e.g. number of distinct SEAs exceeds a threshold, it can restart the VM on another healthy host. - Behavior parity with x86 architecture. When machine check exception (MCE) is caused by VCPU, kernel or KVM signals userspace SIGBUS to let VMM either recover from the MCE, or terminate itself with VM. The prior RFC proposes to implement SIGBUS on arm64 as well, but Marc preferred KVM exit over signal [3]. However, implementation aside, returning SEA to VMM is on par with returning MCE to VMM. Once SEA is redirected to VMM, among other actions, VMM is encouraged to inject external aborts into the faulting VCPU. New UAPIs ========= This patchset introduces following userspace-visible changes to empower VMM to control what happens for SEA on guest memory: - KVM_CAP_ARM_SEA_TO_USER. While taking SEA, if userspace has enabled this new capability at VM creation, and the SEA is not owned by kernel allocated memory, instead of injecting SError, return KVM_EXIT_ARM_SEA to userspace. - KVM_EXIT_ARM_SEA. This is the VM exit reason VMM gets. The details about the SEA is provided in arm_sea as much as possible, including sanitized ESR value at EL2, faulting guest virtual and physical addresses if available. * From v3 [4] - Rebased on commit 3a8660878839 ("Linux 6.18-rc1"). - In selftest, print a message if GVA or GPA expects to be valid. * From v2 [5]: - Rebased on "[PATCH] KVM: arm64: nv: Handle SEAs due to VNCR redirection" [6] and kvmarm/next commit 7b8346bd9fce6 ("KVM: arm64: Don't attempt vLPI mappings when vPE allocation is disabled") - Took the host_owns_sea implementation from Oliver [7, 8]. - Excluded the guest SEA injection patches. - Updated selftest. * From v1 [9]: - Rebased on commit 4d62121ce9b5 ("KVM: arm64: vgic-debug: Avoid dereferencing NULL ITE pointer"). - Sanitize ESR_EL2 before reporting it to userspace. - Do not do KVM_EXIT_ARM_SEA when SEA is caused by memory allocated to stage-2 translation table. [1] https://cloud.google.com/solutions/sap/docs/manage-host-errors [2] https://lore.kernel.org/kvm/20250109204929.1106563-1-jthoughton@google.com [3] https://lore.kernel.org/kvm/86pljbqqh0.wl-maz@kernel.org [4] https://lore.kernel.org/kvmarm/20250731205844.1346839-1-jiaqiyan@google.com [5] https://lore.kernel.org/kvm/20250604050902.3944054-1-jiaqiyan@google.com [6] https://lore.kernel.org/kvmarm/20250729182342.3281742-1-oliver.upton@linux.… [7] https://lore.kernel.org/kvm/aHFohmTb9qR_JG1E@linux.dev [8] https://lore.kernel.org/kvm/aHK-DPufhLy5Dtuk@linux.dev [9] https://lore.kernel.org/kvm/20250505161412.1926643-1-jiaqiyan@google.com Jiaqi Yan (3): KVM: arm64: VM exit to userspace to handle SEA KVM: selftests: Test for KVM_EXIT_ARM_SEA Documentation: kvm: new UAPI for handling SEA Documentation/virt/kvm/api.rst | 61 ++++ arch/arm64/include/asm/kvm_host.h | 2 + arch/arm64/kvm/arm.c | 5 + arch/arm64/kvm/mmu.c | 68 +++- include/uapi/linux/kvm.h | 10 + tools/arch/arm64/include/asm/esr.h | 2 + tools/testing/selftests/kvm/Makefile.kvm | 1 + .../testing/selftests/kvm/arm64/sea_to_user.c | 331 ++++++++++++++++++ tools/testing/selftests/kvm/lib/kvm_util.c | 1 + 9 files changed, 480 insertions(+), 1 deletion(-) create mode 100644 tools/testing/selftests/kvm/arm64/sea_to_user.c -- 2.51.0.760.g7b8bcc2412-goog

1 hour, 21 minutes

9
21
0 0

[PATCH v5 0/7] platform/chrome: Fix a possible UAF via revocable

by Tzung-Bi Shih

This is a follow-up series of [1]. It tries to fix a possible UAF in the fops of cros_ec_chardev after the underlying protocol device has gone by using revocable. The 1st patch introduces the revocable which is an implementation of ideas from the talk [2]. The 2nd and 3rd patches add test cases for revocable in Kunit and selftest. The 4th patch converts existing protocol devices to resource providers of cros_ec_device. The 5th - 7th are PoC patches for showing the use case of "Replace file operations" below. --- I came out with 2 possible usages of revocable. 1. Use primitive APIs Use the primitive APIs of revocable directly. The file operations make sure the resources are available when using them. This is what the series original proposed[3][4]. Even though it has the finest grain for accessing the resources, it makes the user code verbose. Per feedback from the community, I'm looking for some subsystem level helpers so that user code can be simlper. 2. Replace file operations Replace filp->f_op to revocable-aware warppers. The warppers make sure the resources are available in the file operations. The user code needs to provide a callback .try_access() to tell the wrappers where/how to *save* the pointers of resources. Known drawback: - The warppers reserve the resources for all file operations even if they might be unused. - The user code still needs to be revocable-aware. - The whole file operation becomes a SRCU read-side critical section. Are there any functions can't be called in the critical section? If there is, the file operations may not be awared of that. See 5th - 7th patches for an example usage. [1] https://lore.kernel.org/chrome-platform/20250721044456.2736300-6-tzungbi@ke… [2] https://lpc.events/event/17/contributions/1627/ [3] https://lore.kernel.org/chrome-platform/20250912081718.3827390-5-tzungbi@ke… [4] https://lore.kernel.org/chrome-platform/20250912081718.3827390-6-tzungbi@ke… v5: - Rebase onto next-20251015. - Add more context about the PoC. - Support multiple revocable providers in the PoC. v4: https://lore.kernel.org/chrome-platform/20250923075302.591026-1-tzungbi@ker… - Rebase onto next-20250922. - Remove the 5th patch from v3. - Add fops replacement PoC in 5th - 7th patches. v3: https://lore.kernel.org/chrome-platform/20250912081718.3827390-1-tzungbi@ke… - Rebase onto https://lore.kernel.org/chrome-platform/20250828083601.856083-1-tzungbi@ker… and next-20250912. - The 4th patch changed accordingly. v2: https://lore.kernel.org/chrome-platform/20250820081645.847919-1-tzungbi@ker… - Rename "ref_proxy" -> "revocable". - Add test cases in Kunit and selftest. v1: https://lore.kernel.org/chrome-platform/20250814091020.1302888-1-tzungbi@ke… Tzung-Bi Shih (7): revocable: Revocable resource management revocable: Add Kunit test cases selftests: revocable: Add kselftest cases platform/chrome: Protect cros_ec_device lifecycle with revocable revocable: Add fops replacement char: misc: Leverage revocable fops replacement platform/chrome: cros_ec_chardev: Secure cros_ec_device via revocable .../driver-api/driver-model/index.rst | 1 + .../driver-api/driver-model/revocable.rst | 87 +++++++ MAINTAINERS | 9 + drivers/base/Kconfig | 8 + drivers/base/Makefile | 5 +- drivers/base/revocable.c | 233 ++++++++++++++++++ drivers/base/revocable_test.c | 110 +++++++++ drivers/char/misc.c | 8 + drivers/platform/chrome/cros_ec.c | 5 + drivers/platform/chrome/cros_ec_chardev.c | 22 +- fs/Makefile | 2 +- fs/fs_revocable.c | 154 ++++++++++++ include/linux/fs.h | 2 + include/linux/fs_revocable.h | 21 ++ include/linux/miscdevice.h | 4 + include/linux/platform_data/cros_ec_proto.h | 4 + include/linux/revocable.h | 53 ++++ tools/testing/selftests/Makefile | 1 + .../selftests/drivers/base/revocable/Makefile | 7 + .../drivers/base/revocable/revocable_test.c | 116 +++++++++ .../drivers/base/revocable/test-revocable.sh | 39 +++ .../base/revocable/test_modules/Makefile | 10 + .../revocable/test_modules/revocable_test.c | 188 ++++++++++++++ 23 files changed, 1086 insertions(+), 3 deletions(-) create mode 100644 Documentation/driver-api/driver-model/revocable.rst create mode 100644 drivers/base/revocable.c create mode 100644 drivers/base/revocable_test.c create mode 100644 fs/fs_revocable.c create mode 100644 include/linux/fs_revocable.h create mode 100644 include/linux/revocable.h create mode 100644 tools/testing/selftests/drivers/base/revocable/Makefile create mode 100644 tools/testing/selftests/drivers/base/revocable/revocable_test.c create mode 100755 tools/testing/selftests/drivers/base/revocable/test-revocable.sh create mode 100644 tools/testing/selftests/drivers/base/revocable/test_modules/Makefile create mode 100644 tools/testing/selftests/drivers/base/revocable/test_modules/revocable_test.c -- 2.51.0.788.g6d19910ace-goog

2 hours, 10 minutes

7
41
0 0

[PATCH 0/7] KVM: x86: Improve the handling of debug exceptions during instruction emulation

by Hou Wenlong

During my testing, I found that guest debugging with 'DR6.BD' does not work in instruction emulation, as the current code only considers the guest's DR7. Upon reviewing the code, I also observed that the checks for the userspace guest debugging feature and the guest's own debugging feature are repeated in different places during instruction emulation, but the overall logic is the same. If guest debugging is enabled, it needs to exit to userspace; otherwise, a #DB exception needs to be injected into the guest. Therefore, as suggested by Jiangshan Lai, some cleanup has been done for #DB handling in instruction emulation in this patchset. A new function named 'kvm_inject_emulated_db()' is introduced to consolidate all the checking logic. Moreover, I hope we can make the #DB interception path use the same function as well. Additionally, when I looked into the single-step #DB handling in instruction emulation, I noticed that the interrupt shadow is toggled, but it is not considered in the single-step #DB injection. This oversight causes VM entry to fail on VMX (due to pending debug exceptions checking) or breaks the 'MOV SS' suppressed #DB. For the latter, I have kept the behavior for now in my patchset, as I need some suggestions. Hou Wenlong (7): KVM: x86: Set guest DR6 by kvm_queue_exception_p() in instruction emulation KVM: x86: Check guest debug in DR access instruction emulation KVM: x86: Only check effective code breakpoint in emulation KVM: x86: Consolidate KVM_GUESTDBG_SINGLESTEP check into the kvm_inject_emulated_db() KVM: VMX: Set 'BS' bit in pending debug exceptions during instruction emulation KVM: selftests: Verify guest debug DR7.GD checking during instruction emulation KVM: selftests: Verify 'BS' bit checking in pending debug exception during VM entry arch/x86/include/asm/kvm-x86-ops.h | 1 + arch/x86/include/asm/kvm_host.h | 1 + arch/x86/kvm/emulate.c | 14 +-- arch/x86/kvm/kvm_emulate.h | 7 +- arch/x86/kvm/vmx/main.c | 9 ++ arch/x86/kvm/vmx/vmx.c | 14 ++- arch/x86/kvm/vmx/x86_ops.h | 1 + arch/x86/kvm/x86.c | 109 +++++++++++------- arch/x86/kvm/x86.h | 7 ++ .../selftests/kvm/include/x86/processor.h | 3 +- tools/testing/selftests/kvm/x86/debug_regs.c | 64 +++++++++- 11 files changed, 167 insertions(+), 63 deletions(-) base-commit: ecbcc2461839e848970468b44db32282e5059925 -- 2.31.1

21 hours, 22 minutes

2
5
0 0

[PATCH net-next v7 0/9] Add support for providers with large rx buffer

by Pavel Begunkov

Note: it's net/ only bits and doesn't include changes, which shoulf be merged separately and are posted separately. The full branch for convenience is at [1], and the patch is here: https://lore.kernel.org/io-uring/7486ab32e99be1f614b3ef8d0e9bc77015b173f7.1… Many modern NICs support configurable receive buffer lengths, and zcrx and memory providers can use buffers larger than 4K/PAGE_SIZE on x86 to improve performance. When paired with hw-gro larger rx buffer sizes can drastically reduce the number of buffers traversing the stack and save a lot of processing time. It also allows to give to users larger contiguous chunks of data. The idea was first floated around by Saeed during netdev conf 2024 and was asked about by a few folks. Single stream benchmarks showed up to ~30% CPU util improvement. E.g. comparison for 4K vs 32K buffers using a 200Gbit NIC: packets=23987040 (MB=2745098), rps=199559 (MB/s=22837) CPU %usr %nice %sys %iowait %irq %soft %idle 0 1.53 0.00 27.78 2.72 1.31 66.45 0.22 packets=24078368 (MB=2755550), rps=200319 (MB/s=22924) CPU %usr %nice %sys %iowait %irq %soft %idle 0 0.69 0.00 8.26 31.65 1.83 57.00 0.57 This series adds net infrastructure for memory providers configuring the size and implements it for bnxt. It's an opt-in feature for drivers, they should advertise support for the parameter in the qops and must check if the hardware supports the given size. It's limited to memory providers as it drastically simplifies implementation. It doesn't affect the fast path zcrx uAPI, and the sizes is defined in zcrx terms, which allows it to be flexible and adjusted in the future, see Patch 8 for details. A liburing example can be found at [2] full branch: [1] https://github.com/isilence/linux.git zcrx/large-buffers-v7 Liburing example: [2] https://github.com/isilence/liburing.git zcrx/rx-buf-len v7: - Add xa_destroy - Rebase v6: - Update docs and add a selftest v5: https://lore.kernel.org/netdev/cover.1760440268.git.asml.silence@gmail.com/ - Remove all unnecessary bits like configuration via netlink, and multi-stage queue configuration. v4: https://lore.kernel.org/all/cover.1760364551.git.asml.silence@gmail.com/ - Update fbnic qops - Propagate max buf len for hns3 - Use configured buf size in __bnxt_alloc_rx_netmem - Minor stylistic changes v3: https://lore.kernel.org/all/cover.1755499375.git.asml.silence@gmail.com/ - Rebased, excluded zcrx specific patches - Set agg_size_fac to 1 on warning v2: https://lore.kernel.org/all/cover.1754657711.git.asml.silence@gmail.com/ - Add MAX_PAGE_ORDER check on pp init - Applied comments rewording - Adjust pp.max_len based on order - Patch up mlx5 queue callbacks after rebase - Minor ->queue_mgmt_ops refactoring - Rebased to account for both fill level and agg_size_fac - Pass providers buf length in struct pp_memory_provider_params and apply it in __netdev_queue_confi(). - Use ->supported_ring_params to validate drivers support of set qcfg parameters. Jakub Kicinski (1): eth: bnxt: adjust the fill level of agg queues with larger buffers Pavel Begunkov (8): net: page pool: xa init with destroy on pp init net: page_pool: sanitise allocation order net: memzero mp params when closing a queue net: let pp memory provider to specify rx buf len eth: bnxt: store rx buffer size per queue eth: bnxt: allow providers to set rx buf size io_uring/zcrx: document area chunking parameter selftests: iou-zcrx: test large chunk sizes Documentation/networking/iou-zcrx.rst | 20 +++ drivers/net/ethernet/broadcom/bnxt/bnxt.c | 118 ++++++++++++++---- drivers/net/ethernet/broadcom/bnxt/bnxt.h | 2 + drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c | 6 +- drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.h | 2 +- include/net/netdev_queues.h | 9 ++ include/net/page_pool/types.h | 1 + net/core/netdev_rx_queue.c | 14 ++- net/core/page_pool.c | 4 + .../selftests/drivers/net/hw/iou-zcrx.c | 72 +++++++++-- .../selftests/drivers/net/hw/iou-zcrx.py | 37 ++++++ 11 files changed, 236 insertions(+), 49 deletions(-) -- 2.52.0

1 day, 9 hours

4
17
0 0

[PATCH net-next 0/4] (no cover subject)

by Breno Leitao

This patch series introduces a new configfs attribute that enables sending messages directly through netconsole without going through the kernel's logging infrastructure. This feature allows users to send custom messages, alerts, or status updates directly to netconsole receivers by writing to /sys/kernel/config/netconsole/<target>/send_msg, without poluting kernel buffers, and sending msgs to the serial, which could be slow. At Meta this is currently used in two cases right now (through printk by now): a) When a new workload enters or leave the machine. b) From time to time, as a "ping" to make sure the netconsole/machine is alive. The implementation reuses the existing message transmission functions (send_msg_udp() and send_ext_msg_udp()) to handle both basic and extended message formats. Regarding code organization, this version uses forward declarations for send_msg_udp() and send_ext_msg_udp() functions rather than relocating them within the file. While forward declarations do add a small amount of redundancy, they avoid the larger churn that would result from moving entire function definitions. --- Breno Leitao (4): netconsole: extract message fragmentation into send_msg_udp() netconsole: Add configfs attribute for direct message sending selftests/netconsole: Switch to configfs send_msg interface Documentation: netconsole: Document send_msg configfs attribute Documentation/networking/netconsole.rst | 40 +++++++++++++++ drivers/net/netconsole.c | 59 ++++++++++++++++++---- .../selftests/drivers/net/netcons_sysdata.sh | 2 +- 3 files changed, 91 insertions(+), 10 deletions(-) --- base-commit: ab084f0b8d6d2ee4b1c6a28f39a2a7430bdfa7f0 change-id: 20251127-netconsole_send_msg-89813956dc23 Best regards, -- Breno Leitao <leitao(a)debian.org>

2 days, 6 hours

4
20
0 0

[PATCH net-next v7 0/5] net: devmem: improve cpu cost of RX token management

by Bobby Eshleman

This series improves the CPU cost of RX token management by adding an attribute to NETDEV_CMD_BIND_RX that configures sockets using the binding to avoid the xarray allocator and instead use a per-binding niov array and a uref field in niov. Improvement is ~13% cpu util per RX user thread. Using kperf, the following results were observed: Before: Average RX worker idle %: 13.13, flows 4, test runs 11 After: Average RX worker idle %: 26.32, flows 4, test runs 11 Two other approaches were tested, but with no improvement. Namely, 1) using a hashmap for tokens and 2) keeping an xarray of atomic counters but using RCU so that the hotpath could be mostly lockless. Neither of these approaches proved better than the simple array in terms of CPU. The attribute NETDEV_A_DMABUF_AUTORELEASE is added to toggle the optimization. It is an optional attribute and defaults to 0 (i.e., optimization on). To: David S. Miller <davem(a)davemloft.net> To: Eric Dumazet <edumazet(a)google.com> To: Jakub Kicinski <kuba(a)kernel.org> To: Paolo Abeni <pabeni(a)redhat.com> To: Simon Horman <horms(a)kernel.org> To: Kuniyuki Iwashima <kuniyu(a)google.com> To: Willem de Bruijn <willemb(a)google.com> To: Neal Cardwell <ncardwell(a)google.com> To: David Ahern <dsahern(a)kernel.org> To: Mina Almasry <almasrymina(a)google.com> To: Arnd Bergmann <arnd(a)arndb.de> To: Jonathan Corbet <corbet(a)lwn.net> To: Andrew Lunn <andrew+netdev(a)lunn.ch> To: Shuah Khan <shuah(a)kernel.org> Cc: Stanislav Fomichev <sdf(a)fomichev.me> Cc: netdev(a)vger.kernel.org Cc: linux-kernel(a)vger.kernel.org Cc: linux-arch(a)vger.kernel.org Cc: linux-doc(a)vger.kernel.org Cc: linux-kselftest(a)vger.kernel.org Signed-off-by: Bobby Eshleman <bobbyeshleman(a)meta.com> Changes in v7: - use netlink instead of sockopt (Stan) - restrict system to only one mode, dmabuf bindings can not co-exist with different modes (Stan) - use static branching to enforce single system-wide mode (Stan) - Link to v6: https://lore.kernel.org/r/20251104-scratch-bobbyeshleman-devmem-tcp-token-u… Changes in v6: - renamed 'net: devmem: use niov array for token management' to refer to optionality of new config - added documentation and tests - make autorelease flag per-socket sockopt instead of binding field / sysctl - many per-patch changes (see Changes sections per-patch) - Link to v5: https://lore.kernel.org/r/20251023-scratch-bobbyeshleman-devmem-tcp-token-u… Changes in v5: - add sysctl to opt-out of performance benefit, back to old token release - Link to v4: https://lore.kernel.org/all/20250926-scratch-bobbyeshleman-devmem-tcp-token… Changes in v4: - rebase to net-next - Link to v3: https://lore.kernel.org/r/20250926-scratch-bobbyeshleman-devmem-tcp-token-u… Changes in v3: - make urefs per-binding instead of per-socket, reducing memory footprint - fallback to cleaning up references in dmabuf unbind if socket leaked tokens - drop ethtool patch - Link to v2: https://lore.kernel.org/r/20250911-scratch-bobbyeshleman-devmem-tcp-token-u… Changes in v2: - net: ethtool: prevent user from breaking devmem single-binding rule (Mina) - pre-assign niovs in binding->vec for RX case (Mina) - remove WARNs on invalid user input (Mina) - remove extraneous binding ref get (Mina) - remove WARN for changed binding (Mina) - always use GFP_ZERO for binding->vec (Mina) - fix length of alloc for urefs - use atomic_set(, 0) to initialize sk_user_frags.urefs - Link to v1: https://lore.kernel.org/r/20250902-scratch-bobbyeshleman-devmem-tcp-token-u… --- Bobby Eshleman (5): net: devmem: rename tx_vec to vec in dmabuf binding net: devmem: refactor sock_devmem_dontneed for autorelease split net: devmem: implement autorelease token management net: devmem: document NETDEV_A_DMABUF_AUTORELEASE netlink attribute selftests: drv-net: devmem: add autorelease tests Documentation/netlink/specs/netdev.yaml | 12 +++ Documentation/networking/devmem.rst | 70 +++++++++++++ include/net/netmem.h | 1 + include/net/sock.h | 7 +- include/uapi/linux/netdev.h | 1 + net/core/devmem.c | 121 ++++++++++++++++++---- net/core/devmem.h | 13 ++- net/core/netdev-genl-gen.c | 5 +- net/core/netdev-genl.c | 13 ++- net/core/sock.c | 103 ++++++++++++++---- net/ipv4/tcp.c | 78 +++++++++++--- net/ipv4/tcp_ipv4.c | 13 ++- net/ipv4/tcp_minisocks.c | 3 +- tools/include/uapi/linux/netdev.h | 1 + tools/testing/selftests/drivers/net/hw/devmem.py | 22 +++- tools/testing/selftests/drivers/net/hw/ncdevmem.c | 19 ++-- 16 files changed, 401 insertions(+), 81 deletions(-) --- base-commit: 4c52142904b33b41c3ff7ee58670b4e3b3bf1120 change-id: 20250829-scratch-bobbyeshleman-devmem-tcp-token-upstream-292be174d503 Best regards, -- Bobby Eshleman <bobbyeshleman(a)meta.com>

2 days, 14 hours

5
13
0 0

[PATCH kvm-next V11 0/7] Add NUMA mempolicy support for KVM guest-memfd

by Shivank Garg

This series introduces NUMA-aware memory placement support for KVM guests with guest_memfd memory backends. It builds upon Fuad Tabba's work (V17) that enabled host-mapping for guest_memfd memory [1] and can be applied directly applied on KVM tree [2] (branch kvm-next, base commit: a6ad5413, Merge branch 'guest-memfd-mmap' into HEAD) == Background == KVM's guest-memfd memory backend currently lacks support for NUMA policy enforcement, causing guest memory allocations to be distributed across host nodes according to kernel's default behavior, irrespective of any policy specified by the VMM. This limitation arises because conventional userspace NUMA control mechanisms like mbind(2) don't work since the memory isn't directly mapped to userspace when allocations occur. Fuad's work [1] provides the necessary mmap capability, and this series leverages it to enable mbind(2). == Implementation == This series implements proper NUMA policy support for guest-memfd by: 1. Adding mempolicy-aware allocation APIs to the filemap layer. 2. Introducing custom inodes (via a dedicated slab-allocated inode cache, kvm_gmem_inode_info) to store NUMA policy and metadata for guest memory. 3. Implementing get/set_policy vm_ops in guest_memfd to support NUMA policy. With these changes, VMMs can now control guest memory placement by mapping guest_memfd file descriptor and using mbind(2) to specify: - Policy modes: default, bind, interleave, or preferred - Host NUMA nodes: List of target nodes for memory allocation These Policies affect only future allocations and do not migrate existing memory. This matches mbind(2)'s default behavior which affects only new allocations unless overridden with MPOL_MF_MOVE/MPOL_MF_MOVE_ALL flags (Not supported for guest_memfd as it is unmovable by design). == Upstream Plan == Phased approach as per David's guest_memfd extension overview [3] and community calls [4]: Phase 1 (this series): 1. Focuses on shared guest_memfd support (non-CoCo VMs). 2. Builds on Fuad's host-mapping work [1]. Phase2 (future work): 1. NUMA support for private guest_memfd (CoCo VMs). 2. Depends on SNP in-place conversion support [5]. This series provides a clean integration path for NUMA-aware memory management for guest_memfd and lays the groundwork for future confidential computing NUMA capabilities. Thanks, Shivank == Changelog == - v1,v2: Extended the KVM_CREATE_GUEST_MEMFD IOCTL to pass mempolicy. - v3: Introduced fbind() syscall for VMM memory-placement configuration. - v4-v6: Current approach using shared_policy support and vm_ops (based on suggestions from David [6] and guest_memfd bi-weekly upstream call discussion [7]). - v7: Use inodes to store NUMA policy instead of file [8]. - v8: Rebase on top of Fuad's V12: Host mmaping for guest_memfd memory. - v9: Rebase on top of Fuad's V13 and incorporate review comments - V10: Rebase on top of Fuad's V17. Use latest guest_memfd inode patch from Ackerley (with David's review comments). Use newer kmem_cache_create() API variant with arg parameter (Vlastimil) - V11: Rebase on kvm-next, remove RFC tag, use Ackerley's latest patch and fix a rcu race bug during kvm module unload. [1] https://lore.kernel.org/all/20250729225455.670324-1-seanjc@google.com [2] https://git.kernel.org/pub/scm/virt/kvm/kvm.git/log/?h=next [3] https://lore.kernel.org/all/c1c9591d-218a-495c-957b-ba356c8f8e09@redhat.com [4] https://docs.google.com/document/d/1M6766BzdY1Lhk7LiR5IqVR8B8mG3cr-cxTxOrAo… [5] https://lore.kernel.org/all/20250613005400.3694904-1-michael.roth@amd.com [6] https://lore.kernel.org/all/6fbef654-36e2-4be5-906e-2a648a845278@redhat.com [7] https://lore.kernel.org/all/2b77e055-98ac-43a1-a7ad-9f9065d7f38f@amd.com [8] https://lore.kernel.org/all/diqzbjumm167.fsf@ackerleytng-ctop.c.googlers.com Ackerley Tng (1): KVM: guest_memfd: Use guest mem inodes instead of anonymous inodes Matthew Wilcox (Oracle) (2): mm/filemap: Add NUMA mempolicy support to filemap_alloc_folio() mm/filemap: Extend __filemap_get_folio() to support NUMA memory policies Shivank Garg (4): mm/mempolicy: Export memory policy symbols KVM: guest_memfd: Add slab-allocated inode cache KVM: guest_memfd: Enforce NUMA mempolicy using shared policy KVM: guest_memfd: selftests: Add tests for mmap and NUMA policy support fs/bcachefs/fs-io-buffered.c | 2 +- fs/btrfs/compression.c | 4 +- fs/btrfs/verity.c | 2 +- fs/erofs/zdata.c | 2 +- fs/f2fs/compress.c | 2 +- include/linux/pagemap.h | 18 +- include/uapi/linux/magic.h | 1 + mm/filemap.c | 23 +- mm/mempolicy.c | 6 + mm/readahead.c | 2 +- tools/testing/selftests/kvm/Makefile.kvm | 1 + .../testing/selftests/kvm/guest_memfd_test.c | 121 ++++++++ virt/kvm/guest_memfd.c | 262 ++++++++++++++++-- virt/kvm/kvm_main.c | 7 +- virt/kvm/kvm_mm.h | 9 +- 15 files changed, 412 insertions(+), 50 deletions(-) -- 2.43.0 --- == Earlier Postings == v10: https://lore.kernel.org/all/20250811090605.16057-2-shivankg@amd.com v9: https://lore.kernel.org/all/20250713174339.13981-2-shivankg@amd.com v8: https://lore.kernel.org/all/20250618112935.7629-1-shivankg@amd.com v7: https://lore.kernel.org/all/20250408112402.181574-1-shivankg@amd.com v6: https://lore.kernel.org/all/20250226082549.6034-1-shivankg@amd.com v5: https://lore.kernel.org/all/20250219101559.414878-1-shivankg@amd.com v4: https://lore.kernel.org/all/20250210063227.41125-1-shivankg@amd.com v3: https://lore.kernel.org/all/20241105164549.154700-1-shivankg@amd.com v2: https://lore.kernel.org/all/20240919094438.10987-1-shivankg@amd.com v1: https://lore.kernel.org/all/20240916165743.201087-1-shivankg@amd.com

2 days, 17 hours

10
38
0 0

[PATCH v12] exec: Fix dead-lock in de_thread with ptrace_attach

by Bernd Edlinger

This introduces signal->exec_bprm, which is used to fix the case when at least one of the sibling threads is traced, and therefore the trace process may dead-lock in ptrace_attach, but de_thread will need to wait for the tracer to continue execution. The solution is to detect this situation and allow ptrace_attach to continue by temporarily releasing the cred_guard_mutex, while de_thread() is still waiting for traced zombies to be eventually released by the tracer. In the case of the thread group leader we only have to wait for the thread to become a zombie, which may also need co-operation from the tracer due to PTRACE_O_TRACEEXIT. When a tracer wants to ptrace_attach a task that already is in execve, we simply retry the ptrace_may_access check while temporarily installing the new credentials and dumpability which are about to be used after execve completes. If the ptrace_attach happens on a thread that is a sibling-thread of the thread doing execve, it is sufficient to check against the old credentials, as this thread will be waited for, before the new credentials are installed. Other threads die quickly since the cred_guard_mutex is released, but a deadly signal is already pending. In case the mutex_lock_killable misses the signal, the non-zero current->signal->exec_bprm makes sure they release the mutex immediately and return with -ERESTARTNOINTR. This means there is no API change, unlike the previous version of this patch which was discussed here: https://lore.kernel.org/lkml/b6537ae6-31b1-5c50-f32b-8b8332ace882@hotmail.d… See tools/testing/selftests/ptrace/vmaccess.c for a test case that gets fixed by this change. Note that since the test case was originally designed to test the ptrace_attach returning an error in this situation, the test expectation needed to be adjusted, to allow the API to succeed at the first attempt. Signed-off-by: Bernd Edlinger <bernd.edlinger(a)hotmail.de> --- fs/exec.c | 69 ++++++++++++++++------- fs/proc/base.c | 6 ++ include/linux/cred.h | 1 + include/linux/sched/signal.h | 18 ++++++ kernel/cred.c | 28 +++++++-- kernel/ptrace.c | 32 +++++++++++ kernel/seccomp.c | 12 +++- tools/testing/selftests/ptrace/vmaccess.c | 23 +++++--- 8 files changed, 155 insertions(+), 34 deletions(-) v10: Changes to previous version, make the PTRACE_ATTACH retun -EAGAIN, instead of execve return -ERESTARTSYS. Added some lessions learned to the description. v11: Check old and new credentials in PTRACE_ATTACH again without changing the API. Note: I got actually one response from an automatic checker to the v11 patch, https://lore.kernel.org/lkml/202107121344.wu68hEPF-lkp@intel.com/ which is complaining about: >> kernel/ptrace.c:425:26: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct cred const *old_cred @@ got struct cred const [noderef] __rcu *real_cred @@ 417 struct linux_binprm *bprm = task->signal->exec_bprm; 418 const struct cred *old_cred; 419 struct mm_struct *old_mm; 420 421 retval = down_write_killable(&task->signal->exec_update_lock); 422 if (retval) 423 goto unlock_creds; 424 task_lock(task); > 425 old_cred = task->real_cred; v12: Essentially identical to v11. - Fixed a minor merge conflict in linux v5.17, and fixed the above mentioned nit by adding __rcu to the declaration. - re-tested the patch with all linux versions from v5.11 to v6.6 v10 was an alternative approach which did imply an API change. But I would prefer to avoid such an API change. The difficult part is getting the right dumpability flags assigned before de_thread starts, hope you like this version. If not, the v10 is of course also acceptable. Thanks Bernd. diff --git a/fs/exec.c b/fs/exec.c index 2f2b0acec4f0..902d3b230485 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1041,11 +1041,13 @@ static int exec_mmap(struct mm_struct *mm) return 0; } -static int de_thread(struct task_struct *tsk) +static int de_thread(struct task_struct *tsk, struct linux_binprm *bprm) { struct signal_struct *sig = tsk->signal; struct sighand_struct *oldsighand = tsk->sighand; spinlock_t *lock = &oldsighand->siglock; + struct task_struct *t = tsk; + bool unsafe_execve_in_progress = false; if (thread_group_empty(tsk)) goto no_thread_group; @@ -1068,6 +1070,19 @@ static int de_thread(struct task_struct *tsk) if (!thread_group_leader(tsk)) sig->notify_count--; + while_each_thread(tsk, t) { + if (unlikely(t->ptrace) + && (t != tsk->group_leader || !t->exit_state)) + unsafe_execve_in_progress = true; + } + + if (unlikely(unsafe_execve_in_progress)) { + spin_unlock_irq(lock); + sig->exec_bprm = bprm; + mutex_unlock(&sig->cred_guard_mutex); + spin_lock_irq(lock); + } + while (sig->notify_count) { __set_current_state(TASK_KILLABLE); spin_unlock_irq(lock); @@ -1158,6 +1173,11 @@ static int de_thread(struct task_struct *tsk) release_task(leader); } + if (unlikely(unsafe_execve_in_progress)) { + mutex_lock(&sig->cred_guard_mutex); + sig->exec_bprm = NULL; + } + sig->group_exec_task = NULL; sig->notify_count = 0; @@ -1169,6 +1189,11 @@ static int de_thread(struct task_struct *tsk) return 0; killed: + if (unlikely(unsafe_execve_in_progress)) { + mutex_lock(&sig->cred_guard_mutex); + sig->exec_bprm = NULL; + } + /* protects against exit_notify() and __exit_signal() */ read_lock(&tasklist_lock); sig->group_exec_task = NULL; @@ -1253,6 +1278,24 @@ int begin_new_exec(struct linux_binprm * bprm) if (retval) return retval; + /* If the binary is not readable then enforce mm->dumpable=0 */ + would_dump(bprm, bprm->file); + if (bprm->have_execfd) + would_dump(bprm, bprm->executable); + + /* + * Figure out dumpability. Note that this checking only of current + * is wrong, but userspace depends on it. This should be testing + * bprm->secureexec instead. + */ + if (bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP || + is_dumpability_changed(current_cred(), bprm->cred) || + !(uid_eq(current_euid(), current_uid()) && + gid_eq(current_egid(), current_gid()))) + set_dumpable(bprm->mm, suid_dumpable); + else + set_dumpable(bprm->mm, SUID_DUMP_USER); + /* * Ensure all future errors are fatal. */ @@ -1261,7 +1304,7 @@ int begin_new_exec(struct linux_binprm * bprm) /* * Make this the only thread in the thread group. */ - retval = de_thread(me); + retval = de_thread(me, bprm); if (retval) goto out; @@ -1284,11 +1327,6 @@ int begin_new_exec(struct linux_binprm * bprm) if (retval) goto out; - /* If the binary is not readable then enforce mm->dumpable=0 */ - would_dump(bprm, bprm->file); - if (bprm->have_execfd) - would_dump(bprm, bprm->executable); - /* * Release all of the old mmap stuff */ @@ -1350,18 +1388,6 @@ int begin_new_exec(struct linux_binprm * bprm) me->sas_ss_sp = me->sas_ss_size = 0; - /* - * Figure out dumpability. Note that this checking only of current - * is wrong, but userspace depends on it. This should be testing - * bprm->secureexec instead. - */ - if (bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP || - !(uid_eq(current_euid(), current_uid()) && - gid_eq(current_egid(), current_gid()))) - set_dumpable(current->mm, suid_dumpable); - else - set_dumpable(current->mm, SUID_DUMP_USER); - perf_event_exec(); __set_task_comm(me, kbasename(bprm->filename), true); @@ -1480,6 +1506,11 @@ static int prepare_bprm_creds(struct linux_binprm *bprm) if (mutex_lock_interruptible(&current->signal->cred_guard_mutex)) return -ERESTARTNOINTR; + if (unlikely(current->signal->exec_bprm)) { + mutex_unlock(&current->signal->cred_guard_mutex); + return -ERESTARTNOINTR; + } + bprm->cred = prepare_exec_creds(); if (likely(bprm->cred)) return 0; diff --git a/fs/proc/base.c b/fs/proc/base.c index ffd54617c354..0da9adfadb48 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -2788,6 +2788,12 @@ static ssize_t proc_pid_attr_write(struct file * file, const char __user * buf, if (rv < 0) goto out_free; + if (unlikely(current->signal->exec_bprm)) { + mutex_unlock(&current->signal->cred_guard_mutex); + rv = -ERESTARTNOINTR; + goto out_free; + } + rv = security_setprocattr(PROC_I(inode)->op.lsm, file->f_path.dentry->d_name.name, page, count); diff --git a/include/linux/cred.h b/include/linux/cred.h index f923528d5cc4..b01e309f5686 100644 --- a/include/linux/cred.h +++ b/include/linux/cred.h @@ -159,6 +159,7 @@ extern const struct cred *get_task_cred(struct task_struct *); extern struct cred *cred_alloc_blank(void); extern struct cred *prepare_creds(void); extern struct cred *prepare_exec_creds(void); +extern bool is_dumpability_changed(const struct cred *, const struct cred *); extern int commit_creds(struct cred *); extern void abort_creds(struct cred *); extern const struct cred *override_creds(const struct cred *); diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h index 0014d3adaf84..14df7073a0a8 100644 --- a/include/linux/sched/signal.h +++ b/include/linux/sched/signal.h @@ -234,9 +234,27 @@ struct signal_struct { struct mm_struct *oom_mm; /* recorded mm when the thread group got * killed by the oom killer */ + struct linux_binprm *exec_bprm; /* Used to check ptrace_may_access + * against new credentials while + * de_thread is waiting for other + * traced threads to terminate. + * Set while de_thread is executing. + * The cred_guard_mutex is released + * after de_thread() has called + * zap_other_threads(), therefore + * a fatal signal is guaranteed to be + * already pending in the unlikely + * event, that + * current->signal->exec_bprm happens + * to be non-zero after the + * cred_guard_mutex was acquired. + */ + struct mutex cred_guard_mutex; /* guard against foreign influences on * credential calculations * (notably. ptrace) + * Held while execve runs, except when + * a sibling thread is being traced. * Deprecated do not use in new code. * Use exec_update_lock instead. */ diff --git a/kernel/cred.c b/kernel/cred.c index 98cb4eca23fb..586cb6c7cf6b 100644 --- a/kernel/cred.c +++ b/kernel/cred.c @@ -433,6 +433,28 @@ static bool cred_cap_issubset(const struct cred *set, const struct cred *subset) return false; } +/** + * is_dumpability_changed - Will changing creds from old to new + * affect the dumpability in commit_creds? + * + * Return: false - dumpability will not be changed in commit_creds. + * Return: true - dumpability will be changed to non-dumpable. + * + * @old: The old credentials + * @new: The new credentials + */ +bool is_dumpability_changed(const struct cred *old, const struct cred *new) +{ + if (!uid_eq(old->euid, new->euid) || + !gid_eq(old->egid, new->egid) || + !uid_eq(old->fsuid, new->fsuid) || + !gid_eq(old->fsgid, new->fsgid) || + !cred_cap_issubset(old, new)) + return true; + + return false; +} + /** * commit_creds - Install new credentials upon the current task * @new: The credentials to be assigned @@ -467,11 +489,7 @@ int commit_creds(struct cred *new) get_cred(new); /* we will require a ref for the subj creds too */ /* dumpability changes */ - if (!uid_eq(old->euid, new->euid) || - !gid_eq(old->egid, new->egid) || - !uid_eq(old->fsuid, new->fsuid) || - !gid_eq(old->fsgid, new->fsgid) || - !cred_cap_issubset(old, new)) { + if (is_dumpability_changed(old, new)) { if (task->mm) set_dumpable(task->mm, suid_dumpable); task->pdeath_signal = 0; diff --git a/kernel/ptrace.c b/kernel/ptrace.c index 443057bee87c..eb1c450bb7d7 100644 --- a/kernel/ptrace.c +++ b/kernel/ptrace.c @@ -20,6 +20,7 @@ #include <linux/pagemap.h> #include <linux/ptrace.h> #include <linux/security.h> +#include <linux/binfmts.h> #include <linux/signal.h> #include <linux/uio.h> #include <linux/audit.h> @@ -435,6 +436,28 @@ static int ptrace_attach(struct task_struct *task, long request, if (retval) goto unlock_creds; + if (unlikely(task->in_execve)) { + struct linux_binprm *bprm = task->signal->exec_bprm; + const struct cred __rcu *old_cred; + struct mm_struct *old_mm; + + retval = down_write_killable(&task->signal->exec_update_lock); + if (retval) + goto unlock_creds; + task_lock(task); + old_cred = task->real_cred; + old_mm = task->mm; + rcu_assign_pointer(task->real_cred, bprm->cred); + task->mm = bprm->mm; + retval = __ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS); + rcu_assign_pointer(task->real_cred, old_cred); + task->mm = old_mm; + task_unlock(task); + up_write(&task->signal->exec_update_lock); + if (retval) + goto unlock_creds; + } + write_lock_irq(&tasklist_lock); retval = -EPERM; if (unlikely(task->exit_state)) @@ -508,6 +531,14 @@ static int ptrace_traceme(void) { int ret = -EPERM; + if (mutex_lock_interruptible(&current->signal->cred_guard_mutex)) + return -ERESTARTNOINTR; + + if (unlikely(current->signal->exec_bprm)) { + mutex_unlock(&current->signal->cred_guard_mutex); + return -ERESTARTNOINTR; + } + write_lock_irq(&tasklist_lock); /* Are we already being traced? */ if (!current->ptrace) { @@ -523,6 +554,7 @@ static int ptrace_traceme(void) } } write_unlock_irq(&tasklist_lock); + mutex_unlock(&current->signal->cred_guard_mutex); return ret; } diff --git a/kernel/seccomp.c b/kernel/seccomp.c index 255999ba9190..b29bbfa0b044 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -1955,9 +1955,15 @@ static long seccomp_set_mode_filter(unsigned int flags, * Make sure we cannot change seccomp or nnp state via TSYNC * while another thread is in the middle of calling exec. */ - if (flags & SECCOMP_FILTER_FLAG_TSYNC && - mutex_lock_killable(&current->signal->cred_guard_mutex)) - goto out_put_fd; + if (flags & SECCOMP_FILTER_FLAG_TSYNC) { + if (mutex_lock_killable(&current->signal->cred_guard_mutex)) + goto out_put_fd; + + if (unlikely(current->signal->exec_bprm)) { + mutex_unlock(&current->signal->cred_guard_mutex); + goto out_put_fd; + } + } spin_lock_irq(&current->sighand->siglock); diff --git a/tools/testing/selftests/ptrace/vmaccess.c b/tools/testing/selftests/ptrace/vmaccess.c index 4db327b44586..3b7d81fb99bb 100644 --- a/tools/testing/selftests/ptrace/vmaccess.c +++ b/tools/testing/selftests/ptrace/vmaccess.c @@ -39,8 +39,15 @@ TEST(vmaccess) f = open(mm, O_RDONLY); ASSERT_GE(f, 0); close(f); - f = kill(pid, SIGCONT); - ASSERT_EQ(f, 0); + f = waitpid(-1, NULL, 0); + ASSERT_NE(f, -1); + ASSERT_NE(f, 0); + ASSERT_NE(f, pid); + f = waitpid(-1, NULL, 0); + ASSERT_EQ(f, pid); + f = waitpid(-1, NULL, 0); + ASSERT_EQ(f, -1); + ASSERT_EQ(errno, ECHILD); } TEST(attach) @@ -57,22 +64,24 @@ TEST(attach) sleep(1); k = ptrace(PTRACE_ATTACH, pid, 0L, 0L); - ASSERT_EQ(errno, EAGAIN); - ASSERT_EQ(k, -1); + ASSERT_EQ(k, 0); k = waitpid(-1, &s, WNOHANG); ASSERT_NE(k, -1); ASSERT_NE(k, 0); ASSERT_NE(k, pid); ASSERT_EQ(WIFEXITED(s), 1); ASSERT_EQ(WEXITSTATUS(s), 0); - sleep(1); - k = ptrace(PTRACE_ATTACH, pid, 0L, 0L); + k = waitpid(-1, &s, 0); + ASSERT_EQ(k, pid); + ASSERT_EQ(WIFSTOPPED(s), 1); + ASSERT_EQ(WSTOPSIG(s), SIGTRAP); + k = ptrace(PTRACE_CONT, pid, 0L, 0L); ASSERT_EQ(k, 0); k = waitpid(-1, &s, 0); ASSERT_EQ(k, pid); ASSERT_EQ(WIFSTOPPED(s), 1); ASSERT_EQ(WSTOPSIG(s), SIGSTOP); - k = ptrace(PTRACE_DETACH, pid, 0L, 0L); + k = ptrace(PTRACE_CONT, pid, 0L, 0L); ASSERT_EQ(k, 0); k = waitpid(-1, &s, 0); ASSERT_EQ(k, pid); -- 2.39.2

2 days, 22 hours

17
70
0 0

[PATCH v8 00/15] Consolidate iommu page table implementations (AMD)

by Jason Gunthorpe

[Joerg, can you put this and vtd in linux-next please. The vtd series is still good at v3 thanks] Currently each of the iommu page table formats duplicates all of the logic to maintain the page table and perform map/unmap/etc operations. There are several different versions of the algorithms between all the different formats. The io-pgtable system provides an interface to help isolate the page table code from the iommu driver, but doesn't provide tools to implement the common algorithms. This makes it very hard to improve the state of the pagetable code under the iommu domains as any proposed improvement needs to alter a large number of different driver code paths. Combined with a lack of software based testing this makes improvement in this area very hard. iommufd wants several new page table operations: - More efficient map/unmap operations, using iommufd's batching logic - unmap that returns the physical addresses into a batch as it progresses - cut that allows splitting areas so large pages can have holes poked in them dynamically (ie guestmemfd hitless shared/private transitions) - More agressive freeing of table memory to avoid waste - Fragmenting large pages so that dirty tracking can be more granular - Reassembling large pages so that VMs can run at full IO performance in migration/dirty tracking error flows - KHO integration for kernel live upgrade Together these are algorithmically complex enough to be a very significant task to go and implement in all the page table formats we support. Just the "server" focused drivers use almost all the formats (ARMv8 S1&S2 / x86 PAE / AMDv1 / VT-d SS / RISCV) Instead of doing the duplicated work, this series takes the first step to consolidate the algorithms into one places. In spirit it is similar to the work Christoph did a few years back to pull the redundant get_user_pages() implementations out of the arch code into core MM. This unlocked a great deal of improvement in that space in the following years. I would like to see the same benefit in iommu as well. My first RFC showed a bigger picture with all most all formats and more algorithms. This series reorganizes that to be narrowly focused on just enough to convert the AMD driver to use the new mechanism. kunit tests are provided that allow good testing of the algorithms and all formats on x86, nothing is arch specific. AMD is one of the simpler options as the HW is quite uniform with few different options/bugs while still requiring the complicated contiguous pages support. The HW also has a very simple range based invalidation approach that is easy to implement. The AMD v1 and AMD v2 page table formats are implemented bit for bit identical to the current code, tested using a compare kunit test that checks against the io-pgtable version (on github, see below). Updating the AMD driver to replace the io-pgtable layer with the new stuff is fairly straightforward now. The layering is fixed up in the new version so that all the invalidation goes through function pointers. Several small fixing patches have come out of this as I've been fixing the problems that the test suite uncovers in the current code, and implementing the fixed version in iommupt. On performance, there is a quite wide variety of implementation designs across all the drivers. Looking at some key performance across the main formats: iommu_map(): pgsz ,avg new,old ns, min new,old ns , min % (+ve is better) 2^12, 53,66 , 51,63 , 19.19 (AMDV1) 256*2^12, 386,1909 , 367,1795 , 79.79 256*2^21, 362,1633 , 355,1556 , 77.77 2^12, 56,62 , 52,59 , 11.11 (AMDv2) 256*2^12, 405,1355 , 357,1292 , 72.72 256*2^21, 393,1160 , 358,1114 , 67.67 2^12, 55,65 , 53,62 , 14.14 (VT-d second stage) 256*2^12, 391,518 , 332,512 , 35.35 256*2^21, 383,635 , 336,624 , 46.46 2^12, 57,65 , 55,63 , 12.12 (ARM 64 bit) 256*2^12, 380,389 , 361,369 , 2.02 256*2^21, 358,419 , 345,400 , 13.13 iommu_unmap(): pgsz ,avg new,old ns, min new,old ns , min % (+ve is better) 2^12, 69,88 , 65,85 , 23.23 (AMDv1) 256*2^12, 353,6498 , 331,6029 , 94.94 256*2^21, 373,6014 , 360,5706 , 93.93 2^12, 71,72 , 66,69 , 4.04 (AMDv2) 256*2^12, 228,891 , 206,871 , 76.76 256*2^21, 254,721 , 245,711 , 65.65 2^12, 69,87 , 65,82 , 20.20 (VT-d second stage) 256*2^12, 210,321 , 200,315 , 36.36 256*2^21, 255,349 , 238,342 , 30.30 2^12, 72,77 , 68,74 , 8.08 (ARM 64 bit) 256*2^12, 521,357 , 447,346 , -29.29 256*2^21, 489,358 , 433,345 , -25.25 * Above numbers include additional patches to remove the iommu_pgsize() overheads. gcc 13.3.0, i7-12700 This version provides fairly consistent performance across formats. ARM unmap performance is quite different because this version supports contiguous pages and uses a very different algorithm for unmapping. Though why it is so worse compared to AMDv1 I haven't figured out yet. The per-format commits include a more detailed chart. There is a second branch: https://github.com/jgunthorpe/linux/commits/iommu_pt_all Containing supporting work and future steps: - ARM short descriptor (32 bit), ARM long descriptor (64 bit) formats - RISCV format and RISCV conversion https://github.com/jgunthorpe/linux/commits/iommu_pt_riscv - Support for a DMA incoherent HW page table walker - VT-d second stage format and VT-d conversion https://github.com/jgunthorpe/linux/commits/iommu_pt_vtd - DART v1 & v2 format - Draft of a iommufd 'cut' operation to break down huge pages - A compare test that checks the iommupt formats against the iopgtable interface, including updating AMD to have a working iopgtable and patches to make VT-d have an iopgtable for testing. - A performance test to micro-benchmark map and unmap against iogptable My strategy is to go one by one for the drivers: - AMD driver conversion - RISCV page table and driver - Intel VT-d driver and VTDSS page table - Flushing improvements for RISCV - ARM SMMUv3 And concurrently work on the algorithm side: - debugfs content dump, like VT-d has - Cut support - Increase/Decrease page size support - map/unmap batching - KHO As we make more algorithm improvements the value to convert the drivers increases. This is on github: https://github.com/jgunthorpe/linux/commits/iommu_pt v8: - Remove unused to_amdv1pt/common_to_amdv1pt/to_x86_64_pt/common_to_x86_64_pt - Fix 32 bit udiv compile failure in the kunit v7: https://patch.msgid.link/r/0-v7-ab019a8791e2+175b8-iommu_pt_jgg@nvidia.com - Rebase to v6.18-rc2 - Improve comments and documentation - Add a few missed __sme_sets() for AMD CC - Rename pt_iommu_flush_ops -> pt_iommu_driver_ops VT-D -> VT-d pt_clear_entry -> pt_clear_entries pt_entry_write_is_dirty -> pt_entry_is_write_dirty pt_entry_set_write_clean -> pt_entry_make_write_clean - Tidy some of the map flow into a new function do_map() - Fix ffz64() v6: https://patch.msgid.link/r/0-v6-0fb54a1d9850+36b-iommu_pt_jgg@nvidia.com - Improve comments and documentation - Rename pt_entry_oa_full -> pt_entry_oa_exact pt_has_system_page -> pt_has_system_page_size pt_max_output_address_lg2 -> pt_max_oa_lg2 log2_f*() -> vaf* / oaf* / f*_t pt_item_fully_covered -> pt_entry_fully_covered - Fix missed constant propogation causing division - Consolidate debugging checks to pt_check_install_leaf_args() - Change collect->ignore_mapped to check_mapped - Shuffle some hunks around to more appropriate patches - Two new mini kunit tests v5: https://patch.msgid.link/r/0-v5-116c4948af3d+68091-iommu_pt_jgg@nvidia.com - Text grammar updates and kdoc fixes v4: https://patch.msgid.link/r/0-v4-0d6a6726a372+18959-iommu_pt_jgg@nvidia.com - Rebase on v6.16-rc3 - Integrate the HATS/HATDis changes - Remove 'default n' from kconfig - Remove unused 'PT_FIXED_TOP_LEVEL' - Improve comments and documentation - Fix some compile warnings from kbuild robots v3: https://patch.msgid.link/r/0-v3-a93aab628dbc+521-iommu_pt_jgg@nvidia.com - Rebase on v6.16-rc2 - s/PT_ENTRY_WORD_SIZE/PT_ITEM_WORD_SIZE/s to follow the language better - Comment and documentation updates - Add PT_TOP_PHYS_MASK to help manage alignment restrictions on the top pointer - Add missed force_aperture = true - Make pt_iommu_deinit() take care of the not-yet-inited error case internally as AMD/RISCV/VTD all shared this logic - Change gather_range() into gather_range_pages() so it also deals with the page list. This makes the following cache flushing series simpler - Fix missed update of unmap->unmapped in some error cases - Change clear_contig() to order the gather more logically - Remove goto from the error handling in __map_range_leaf() - s/log2_/oalog2_/ in places where the argument is an oaddr_t - Pass the pts to pt_table_install64/32() - Do not use SIGN_EXTEND for the AMDv2 page table because of Vasant's information on how PASID 0 works. v2: https://patch.msgid.link/r/0-v2-5c26bde5c22d+58b-iommu_pt_jgg@nvidia.com - AMD driver only, many code changes RFC: https://lore.kernel.org/all/0-v1-01fa10580981+1d-iommu_pt_jgg@nvidia.com/ Cc: Michael Roth <michael.roth(a)amd.com> Cc: Alexey Kardashevskiy <aik(a)amd.com> Cc: Pasha Tatashin <pasha.tatashin(a)soleen.com> Cc: James Gowans <jgowans(a)amazon.com> Signed-off-by: Jason Gunthorpe <jgg(a)nvidia.com> Alejandro Jimenez (1): iommu/amd: Use the generic iommu page table Jason Gunthorpe (14): genpt: Generic Page Table base API genpt: Add Documentation/ files iommupt: Add the basic structure of the iommu implementation iommupt: Add the AMD IOMMU v1 page table format iommupt: Add iova_to_phys op iommupt: Add unmap_pages op iommupt: Add map_pages op iommupt: Add read_and_clear_dirty op iommupt: Add a kunit test for Generic Page Table iommupt: Add a mock pagetable format for iommufd selftest to use iommufd: Change the selftest to use iommupt instead of xarray iommupt: Add the x86 64 bit page table format iommu/amd: Remove AMD io_pgtable support iommupt: Add a kunit test for the IOMMU implementation .clang-format | 1 + Documentation/driver-api/generic_pt.rst | 142 ++ Documentation/driver-api/index.rst | 1 + drivers/iommu/Kconfig | 2 + drivers/iommu/Makefile | 1 + drivers/iommu/amd/Kconfig | 5 +- drivers/iommu/amd/Makefile | 2 +- drivers/iommu/amd/amd_iommu.h | 1 - drivers/iommu/amd/amd_iommu_types.h | 110 +- drivers/iommu/amd/io_pgtable.c | 577 -------- drivers/iommu/amd/io_pgtable_v2.c | 370 ------ drivers/iommu/amd/iommu.c | 538 ++++---- drivers/iommu/generic_pt/.kunitconfig | 13 + drivers/iommu/generic_pt/Kconfig | 68 + drivers/iommu/generic_pt/fmt/Makefile | 26 + drivers/iommu/generic_pt/fmt/amdv1.h | 411 ++++++ drivers/iommu/generic_pt/fmt/defs_amdv1.h | 21 + drivers/iommu/generic_pt/fmt/defs_x86_64.h | 21 + drivers/iommu/generic_pt/fmt/iommu_amdv1.c | 15 + drivers/iommu/generic_pt/fmt/iommu_mock.c | 10 + drivers/iommu/generic_pt/fmt/iommu_template.h | 48 + drivers/iommu/generic_pt/fmt/iommu_x86_64.c | 11 + drivers/iommu/generic_pt/fmt/x86_64.h | 255 ++++ drivers/iommu/generic_pt/iommu_pt.h | 1162 +++++++++++++++++ drivers/iommu/generic_pt/kunit_generic_pt.h | 713 ++++++++++ drivers/iommu/generic_pt/kunit_iommu.h | 183 +++ drivers/iommu/generic_pt/kunit_iommu_pt.h | 487 +++++++ drivers/iommu/generic_pt/pt_common.h | 358 +++++ drivers/iommu/generic_pt/pt_defs.h | 329 +++++ drivers/iommu/generic_pt/pt_fmt_defaults.h | 233 ++++ drivers/iommu/generic_pt/pt_iter.h | 636 +++++++++ drivers/iommu/generic_pt/pt_log2.h | 122 ++ drivers/iommu/io-pgtable.c | 4 - drivers/iommu/iommufd/Kconfig | 1 + drivers/iommu/iommufd/iommufd_test.h | 11 +- drivers/iommu/iommufd/selftest.c | 438 +++---- include/linux/generic_pt/common.h | 167 +++ include/linux/generic_pt/iommu.h | 271 ++++ include/linux/io-pgtable.h | 2 - include/linux/irqchip/riscv-imsic.h | 3 +- tools/testing/selftests/iommu/iommufd.c | 60 +- tools/testing/selftests/iommu/iommufd_utils.h | 12 + 42 files changed, 6229 insertions(+), 1612 deletions(-) create mode 100644 Documentation/driver-api/generic_pt.rst delete mode 100644 drivers/iommu/amd/io_pgtable.c delete mode 100644 drivers/iommu/amd/io_pgtable_v2.c create mode 100644 drivers/iommu/generic_pt/.kunitconfig create mode 100644 drivers/iommu/generic_pt/Kconfig create mode 100644 drivers/iommu/generic_pt/fmt/Makefile create mode 100644 drivers/iommu/generic_pt/fmt/amdv1.h create mode 100644 drivers/iommu/generic_pt/fmt/defs_amdv1.h create mode 100644 drivers/iommu/generic_pt/fmt/defs_x86_64.h create mode 100644 drivers/iommu/generic_pt/fmt/iommu_amdv1.c create mode 100644 drivers/iommu/generic_pt/fmt/iommu_mock.c create mode 100644 drivers/iommu/generic_pt/fmt/iommu_template.h create mode 100644 drivers/iommu/generic_pt/fmt/iommu_x86_64.c create mode 100644 drivers/iommu/generic_pt/fmt/x86_64.h create mode 100644 drivers/iommu/generic_pt/iommu_pt.h create mode 100644 drivers/iommu/generic_pt/kunit_generic_pt.h create mode 100644 drivers/iommu/generic_pt/kunit_iommu.h create mode 100644 drivers/iommu/generic_pt/kunit_iommu_pt.h create mode 100644 drivers/iommu/generic_pt/pt_common.h create mode 100644 drivers/iommu/generic_pt/pt_defs.h create mode 100644 drivers/iommu/generic_pt/pt_fmt_defaults.h create mode 100644 drivers/iommu/generic_pt/pt_iter.h create mode 100644 drivers/iommu/generic_pt/pt_log2.h create mode 100644 include/linux/generic_pt/common.h create mode 100644 include/linux/generic_pt/iommu.h base-commit: 8440410283bb5533b676574211f31f030a18011b -- 2.43.0

6 days, 14 hours

7
33
0 0

2025

2024

2023

2022

2021

2020

2019

2018

2017

Linux-kselftest-mirror November 2025