October 2024 - Linux-kselftest-mirror

[PATCH v3 0/4] Add support for the Bus Lock Threshold

by Manali Shukla

Misbehaving guests can cause bus locks to degrade the performance of a system. Non-WB (write-back) and misaligned locked RMW (read-modify-write) instructions are referred to as "bus locks" and require system wide synchronization among all processors to guarantee the atomicity. The bus locks can impose notable performance penalties for all processors within the system. Support for the Bus Lock Threshold is indicated by CPUID Fn8000_000A_EDX[29] BusLockThreshold=1, the VMCB provides a Bus Lock Threshold enable bit and an unsigned 16-bit Bus Lock Threshold count. VMCB intercept bit VMCB Offset Bits Function 14h 5 Intercept bus lock operations Bus lock threshold count VMCB Offset Bits Function 120h 15:0 Bus lock counter During VMRUN, the bus lock threshold count is fetched and stored in an internal count register. Prior to executing a bus lock within the guest, the processor verifies the count in the bus lock register. If the count is greater than zero, the processor executes the bus lock, reducing the count. However, if the count is zero, the bus lock operation is not performed, and instead, a Bus Lock Threshold #VMEXIT is triggered to transfer control to the Virtual Machine Monitor (VMM). A Bus Lock Threshold #VMEXIT is reported to the VMM with VMEXIT code 0xA5h, VMEXIT_BUSLOCK. EXITINFO1 and EXITINFO2 are set to 0 on a VMEXIT_BUSLOCK. On a #VMEXIT, the processor writes the current value of the Bus Lock Threshold Counter to the VMCB. More details about the Bus Lock Threshold feature can be found in AMD APM [1]. v2 -> v3 - Drop parch to add virt tag in /proc/cpuinfo. - Incorporated Tom's review comments. v1 -> v2 - Incorporated misc review comments from Sean. - Removed bus_lock_counter module parameter. - Set the value of bus_lock_counter to zero by default and reload the value by 1 in bus lock exit handler. - Add documentation for the behavioral difference for KVM_EXIT_BUS_LOCK. - Improved selftest for buslock to work on SVM and VMX. - Rewrite the commit messages. Patches are prepared on kvm-next/next (efbc6bd090f4). Testing done: - Added a selftest for the Bus Lock Threshold functionality. - The bus lock threshold selftest has been tested on both Intel and AMD platforms. - Tested the Bus Lock Threshold functionality on SEV, SEV-ES, SEV-SNP guests. - Tested the Bus Lock Threshold functionality on nested guests. v1: https://lore.kernel.org/kvm/20240709175145.9986-4-manali.shukla@amd.com/T/ v2: https://lore.kernel.org/kvm/20241001063413.687787-4-manali.shukla@amd.com/T/ [1]: AMD64 Architecture Programmer's Manual Pub. 24593, April 2024, Vol 2, 15.14.5 Bus Lock Threshold. https://bugzilla.kernel.org/attachment.cgi?id=306250 Manali Shukla (2): x86/cpufeatures: Add CPUID feature bit for the Bus Lock Threshold KVM: X86: Add documentation about behavioral difference for KVM_EXIT_BUS_LOCK Nikunj A Dadhania (2): KVM: SVM: Enable Bus lock threshold exit KVM: selftests: Add bus lock exit test Documentation/virt/kvm/api.rst | 4 + arch/x86/include/asm/cpufeatures.h | 1 + arch/x86/include/asm/svm.h | 5 +- arch/x86/include/uapi/asm/svm.h | 2 + arch/x86/kvm/svm/nested.c | 10 ++ arch/x86/kvm/svm/svm.c | 27 ++++ tools/testing/selftests/kvm/Makefile | 1 + .../selftests/kvm/x86_64/kvm_buslock_test.c | 130 ++++++++++++++++++ 8 files changed, 179 insertions(+), 1 deletion(-) create mode 100644 tools/testing/selftests/kvm/x86_64/kvm_buslock_test.c base-commit: efbc6bd090f48ccf64f7a8dd5daea775821d57ec -- 2.34.1

1 year, 1 month

4
13
0 0

[PATCH net-next v2 0/4] Introduce VLAN support in HSR

by MD Danish Anwar

This series adds VLAN support to HSR framework. This series also adds VLAN support to HSR mode of ICSSG Ethernet driver. Changes from v1 to v2: *) Added patch 4/4 to add test script related to VLAN in HSR as asked by Lukasz Majewski <lukma(a)denx.de> v1 https://lore.kernel.org/all/20241004074715.791191-1-danishanwar@ti.com/ MD Danish Anwar (1): selftests: hsr: Add test for VLAN Murali Karicheri (1): net: hsr: Add VLAN CTAG filter support Ravi Gunasekaran (1): net: ti: icssg-prueth: Add VLAN support for HSR mode WingMan Kwok (1): net: hsr: Add VLAN support drivers/net/ethernet/ti/icssg/icssg_prueth.c | 45 +++++++++++- net/hsr/hsr_device.c | 76 ++++++++++++++++++-- net/hsr/hsr_forward.c | 19 +++-- tools/testing/selftests/net/hsr/config | 1 + tools/testing/selftests/net/hsr/hsr_ping.sh | 63 +++++++++++++++- 5 files changed, 191 insertions(+), 13 deletions(-) base-commit: 1bf70e6c3a5346966c25e0a1ff492945b25d3f80 -- 2.34.1

1 year, 1 month

4
10
0 0

[PATCH net v6] ipv6: Fix soft lockups in fib6_select_path under high next hop churn

by Omid Ehtemam-Haghighi

Soft lockups have been observed on a cluster of Linux-based edge routers located in a highly dynamic environment. Using the `bird` service, these routers continuously update BGP-advertised routes due to frequently changing nexthop destinations, while also managing significant IPv6 traffic. The lockups occur during the traversal of the multipath circular linked-list in the `fib6_select_path` function, particularly while iterating through the siblings in the list. The issue typically arises when the nodes of the linked list are unexpectedly deleted concurrently on a different core—indicated by their 'next' and 'previous' elements pointing back to the node itself and their reference count dropping to zero. This results in an infinite loop, leading to a soft lockup that triggers a system panic via the watchdog timer. Apply RCU primitives in the problematic code sections to resolve the issue. Where necessary, update the references to fib6_siblings to annotate or use the RCU APIs. Include a test script that reproduces the issue. The script periodically updates the routing table while generating a heavy load of outgoing IPv6 traffic through multiple iperf3 clients. It consistently induces infinite soft lockups within a couple of minutes. Kernel log: 0 [ffffbd13003e8d30] machine_kexec at ffffffff8ceaf3eb 1 [ffffbd13003e8d90] __crash_kexec at ffffffff8d0120e3 2 [ffffbd13003e8e58] panic at ffffffff8cef65d4 3 [ffffbd13003e8ed8] watchdog_timer_fn at ffffffff8d05cb03 4 [ffffbd13003e8f08] __hrtimer_run_queues at ffffffff8cfec62f 5 [ffffbd13003e8f70] hrtimer_interrupt at ffffffff8cfed756 6 [ffffbd13003e8fd0] __sysvec_apic_timer_interrupt at ffffffff8cea01af 7 [ffffbd13003e8ff0] sysvec_apic_timer_interrupt at ffffffff8df1b83d -- <IRQ stack> -- 8 [ffffbd13003d3708] asm_sysvec_apic_timer_interrupt at ffffffff8e000ecb [exception RIP: fib6_select_path+299] RIP: ffffffff8ddafe7b RSP: ffffbd13003d37b8 RFLAGS: 00000287 RAX: ffff975850b43600 RBX: ffff975850b40200 RCX: 0000000000000000 RDX: 000000003fffffff RSI: 0000000051d383e4 RDI: ffff975850b43618 RBP: ffffbd13003d3800 R8: 0000000000000000 R9: ffff975850b40200 R10: 0000000000000000 R11: 0000000000000000 R12: ffffbd13003d3830 R13: ffff975850b436a8 R14: ffff975850b43600 R15: 0000000000000007 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 9 [ffffbd13003d3808] ip6_pol_route at ffffffff8ddb030c 10 [ffffbd13003d3888] ip6_pol_route_input at ffffffff8ddb068c 11 [ffffbd13003d3898] fib6_rule_lookup at ffffffff8ddf02b5 12 [ffffbd13003d3928] ip6_route_input at ffffffff8ddb0f47 13 [ffffbd13003d3a18] ip6_rcv_finish_core.constprop.0 at ffffffff8dd950d0 14 [ffffbd13003d3a30] ip6_list_rcv_finish.constprop.0 at ffffffff8dd96274 15 [ffffbd13003d3a98] ip6_sublist_rcv at ffffffff8dd96474 16 [ffffbd13003d3af8] ipv6_list_rcv at ffffffff8dd96615 17 [ffffbd13003d3b60] __netif_receive_skb_list_core at ffffffff8dc16fec 18 [ffffbd13003d3be0] netif_receive_skb_list_internal at ffffffff8dc176b3 19 [ffffbd13003d3c50] napi_gro_receive at ffffffff8dc565b9 20 [ffffbd13003d3c80] ice_receive_skb at ffffffffc087e4f5 [ice] 21 [ffffbd13003d3c90] ice_clean_rx_irq at ffffffffc0881b80 [ice] 22 [ffffbd13003d3d20] ice_napi_poll at ffffffffc088232f [ice] 23 [ffffbd13003d3d80] __napi_poll at ffffffff8dc18000 24 [ffffbd13003d3db8] net_rx_action at ffffffff8dc18581 25 [ffffbd13003d3e40] __do_softirq at ffffffff8df352e9 26 [ffffbd13003d3eb0] run_ksoftirqd at ffffffff8ceffe47 27 [ffffbd13003d3ec0] smpboot_thread_fn at ffffffff8cf36a30 28 [ffffbd13003d3ee8] kthread at ffffffff8cf2b39f 29 [ffffbd13003d3f28] ret_from_fork at ffffffff8ce5fa64 30 [ffffbd13003d3f50] ret_from_fork_asm at ffffffff8ce03cbb Fixes: 66f5d6ce53e6 ("ipv6: replace rwlock with rcu and spinlock in fib6_table") Reported-by: Adrian Oliver <kernel(a)aoliver.ca> Signed-off-by: Omid Ehtemam-Haghighi <omid.ehtemamhaghighi(a)menlosecurity.com> Cc: David S. Miller <davem(a)davemloft.net> Cc: David Ahern <dsahern(a)gmail.com> Cc: Eric Dumazet <edumazet(a)google.com> Cc: Jakub Kicinski <kuba(a)kernel.org> Cc: Paolo Abeni <pabeni(a)redhat.com> Cc: Shuah Khan <shuah(a)kernel.org> Cc: Ido Schimmel <idosch(a)idosch.org> Cc: Kuniyuki Iwashima <kuniyu(a)amazon.com> Cc: Simon Horman <horms(a)kernel.org> Cc: netdev(a)vger.kernel.org Cc: linux-kselftest(a)vger.kernel.org Cc: linux-kernel(a)vger.kernel.org --- v5 -> v6: * Adjust the comment line lengths in the test script to a maximum of 80 characters * Change memory allocation in inet6_rt_notify from gfp_any() to GFP_ATOMIC for atomic allocation in non-blocking contexts, as suggested by Ido Schimmel * NOTE: I have executed the test script on both bare-metal servers and virtualized environments such as QEMU and vng. In the case of bare-metal, it consistently triggers a soft lockup in under a minute on unpatched kernels. For the virtualized environments, an unpatched kernel compiled with the Ubuntu 24.04 configuration also triggers a soft lockup, though it takes longer; however, it did not trigger a soft lockup on kernels compiled with configurations provided in: https://github.com/linux-netdev/nipa/wiki/How-to-run-netdev-selftests-CI-st… leading to potential false negatives in the test results. I am curious if this test can be executed on a bare-metal machine within a CI system, if such a setup exists, rather than in a virtualized environment. If that’s not possible, how can I apply a different kernel configuration, such as the one used in Ubuntu 24.04, for this test? Please advise. v4 -> v5: * Addressed review comments from Paolo Abeni. * Added additional clarifying comments in the test script. * Minor cleanup performed in the test script. v3 -> v4: * Added RCU primitives to rt6_fill_node(). I found that this function is typically called either with a table lock held or within rcu_read_lock/rcu_read_unlock pairs, except in the following call chain, where the protection is unclear: rt_fill_node() fib6_info_hw_flags_set() mlxsw_sp_fib6_offload_failed_flag_set() mlxsw_sp_router_fib6_event_work() The last function is initialized as a work item in mlxsw_sp_router_fib_event() and scheduled for deferred execution. I am unsure if the execution context of this work item is protected by any table lock or rcu_read_lock/rcu_read_unlock pair, so I have added the protection. Please let me know if this is redundant. * Other review comments addressed v2 -> v3: * Removed redundant rcu_read_lock()/rcu_read_unlock() pairs * Revised the test script based on Ido Schimmel's feedback * Updated the test script to ensure compatibility with the latest iperf3 version * Fixed new warnings generated with 'C=2' in the previous version * Other review comments addressed v1 -> v2: * list_del_rcu() is applied exclusively to legacy multipath code * All occurrences of fib6_siblings have been modified to utilize RCU APIs for annotation and usage. * Additionally, a test script for reproducing the reported issue is included --- net/ipv6/ip6_fib.c | 8 +- net/ipv6/route.c | 45 ++- tools/testing/selftests/net/Makefile | 1 + .../net/ipv6_route_update_soft_lockup.sh | 262 ++++++++++++++++++ 4 files changed, 297 insertions(+), 19 deletions(-) create mode 100755 tools/testing/selftests/net/ipv6_route_update_soft_lockup.sh diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c index eb111d20615c..9a1c59275a10 100644 --- a/net/ipv6/ip6_fib.c +++ b/net/ipv6/ip6_fib.c @@ -1190,8 +1190,8 @@ static int fib6_add_rt2node(struct fib6_node *fn, struct fib6_info *rt, while (sibling) { if (sibling->fib6_metric == rt->fib6_metric && rt6_qualify_for_ecmp(sibling)) { - list_add_tail(&rt->fib6_siblings, - &sibling->fib6_siblings); + list_add_tail_rcu(&rt->fib6_siblings, + &sibling->fib6_siblings); break; } sibling = rcu_dereference_protected(sibling->fib6_next, @@ -1252,7 +1252,7 @@ static int fib6_add_rt2node(struct fib6_node *fn, struct fib6_info *rt, fib6_siblings) sibling->fib6_nsiblings--; rt->fib6_nsiblings = 0; - list_del_init(&rt->fib6_siblings); + list_del_rcu(&rt->fib6_siblings); rt6_multipath_rebalance(next_sibling); return err; } @@ -1970,7 +1970,7 @@ static void fib6_del_route(struct fib6_table *table, struct fib6_node *fn, &rt->fib6_siblings, fib6_siblings) sibling->fib6_nsiblings--; rt->fib6_nsiblings = 0; - list_del_init(&rt->fib6_siblings); + list_del_rcu(&rt->fib6_siblings); rt6_multipath_rebalance(next_sibling); } diff --git a/net/ipv6/route.c b/net/ipv6/route.c index b4251915585f..b9b986bda943 100644 --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -413,8 +413,8 @@ void fib6_select_path(const struct net *net, struct fib6_result *res, struct flowi6 *fl6, int oif, bool have_oif_match, const struct sk_buff *skb, int strict) { - struct fib6_info *sibling, *next_sibling; struct fib6_info *match = res->f6i; + struct fib6_info *sibling; if (!match->nh && (!match->fib6_nsiblings || have_oif_match)) goto out; @@ -440,8 +440,8 @@ void fib6_select_path(const struct net *net, struct fib6_result *res, if (fl6->mp_hash <= atomic_read(&match->fib6_nh->fib_nh_upper_bound)) goto out; - list_for_each_entry_safe(sibling, next_sibling, &match->fib6_siblings, - fib6_siblings) { + list_for_each_entry_rcu(sibling, &match->fib6_siblings, + fib6_siblings) { const struct fib6_nh *nh = sibling->fib6_nh; int nh_upper_bound; @@ -5195,14 +5195,18 @@ static void ip6_route_mpath_notify(struct fib6_info *rt, * nexthop. Since sibling routes are always added at the end of * the list, find the first sibling of the last route appended */ + rcu_read_lock(); + if ((nlflags & NLM_F_APPEND) && rt_last && rt_last->fib6_nsiblings) { - rt = list_first_entry(&rt_last->fib6_siblings, - struct fib6_info, - fib6_siblings); + rt = list_first_or_null_rcu(&rt_last->fib6_siblings, + struct fib6_info, + fib6_siblings); } if (rt) inet6_rt_notify(RTM_NEWROUTE, rt, info, nlflags); + + rcu_read_unlock(); } static bool ip6_route_mpath_should_notify(const struct fib6_info *rt) @@ -5547,17 +5551,21 @@ static size_t rt6_nlmsg_size(struct fib6_info *f6i) nexthop_for_each_fib6_nh(f6i->nh, rt6_nh_nlmsg_size, &nexthop_len); } else { - struct fib6_info *sibling, *next_sibling; struct fib6_nh *nh = f6i->fib6_nh; + struct fib6_info *sibling; nexthop_len = 0; if (f6i->fib6_nsiblings) { rt6_nh_nlmsg_size(nh, &nexthop_len); - list_for_each_entry_safe(sibling, next_sibling, - &f6i->fib6_siblings, fib6_siblings) { + rcu_read_lock(); + + list_for_each_entry_rcu(sibling, &f6i->fib6_siblings, + fib6_siblings) { rt6_nh_nlmsg_size(sibling->fib6_nh, &nexthop_len); } + + rcu_read_unlock(); } nexthop_len += lwtunnel_get_encap_size(nh->fib_nh_lws); } @@ -5721,7 +5729,7 @@ static int rt6_fill_node(struct net *net, struct sk_buff *skb, lwtunnel_fill_encap(skb, dst->lwtstate, RTA_ENCAP, RTA_ENCAP_TYPE) < 0) goto nla_put_failure; } else if (rt->fib6_nsiblings) { - struct fib6_info *sibling, *next_sibling; + struct fib6_info *sibling; struct nlattr *mp; mp = nla_nest_start_noflag(skb, RTA_MULTIPATH); @@ -5733,14 +5741,21 @@ static int rt6_fill_node(struct net *net, struct sk_buff *skb, 0) < 0) goto nla_put_failure; - list_for_each_entry_safe(sibling, next_sibling, - &rt->fib6_siblings, fib6_siblings) { + rcu_read_lock(); + + list_for_each_entry_rcu(sibling, &rt->fib6_siblings, + fib6_siblings) { if (fib_add_nexthop(skb, &sibling->fib6_nh->nh_common, sibling->fib6_nh->fib_nh_weight, - AF_INET6, 0) < 0) + AF_INET6, 0) < 0) { + rcu_read_unlock(); + goto nla_put_failure; + } } + rcu_read_unlock(); + nla_nest_end(skb, mp); } else if (rt->nh) { if (nla_put_u32(skb, RTA_NH_ID, rt->nh->id)) @@ -6177,7 +6192,7 @@ void inet6_rt_notify(int event, struct fib6_info *rt, struct nl_info *info, err = -ENOBUFS; seq = info->nlh ? info->nlh->nlmsg_seq : 0; - skb = nlmsg_new(rt6_nlmsg_size(rt), gfp_any()); + skb = nlmsg_new(rt6_nlmsg_size(rt), GFP_ATOMIC); if (!skb) goto errout; @@ -6190,7 +6205,7 @@ void inet6_rt_notify(int event, struct fib6_info *rt, struct nl_info *info, goto errout; } rtnl_notify(skb, net, info->portid, RTNLGRP_IPV6_ROUTE, - info->nlh, gfp_any()); + info->nlh, GFP_ATOMIC); return; errout: rtnl_set_sk_err(net, RTNLGRP_IPV6_ROUTE, err); diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile index 649f1fe0dc46..88cbac27aa10 100644 --- a/tools/testing/selftests/net/Makefile +++ b/tools/testing/selftests/net/Makefile @@ -96,6 +96,7 @@ TEST_PROGS += fdb_flush.sh TEST_PROGS += fq_band_pktlimit.sh TEST_PROGS += vlan_hw_filter.sh TEST_PROGS += bpf_offload.py +TEST_PROGS += ipv6_route_update_soft_lockup.sh # YNL files, must be before "include ..lib.mk" EXTRA_CLEAN += $(OUTPUT)/libynl.a diff --git a/tools/testing/selftests/net/ipv6_route_update_soft_lockup.sh b/tools/testing/selftests/net/ipv6_route_update_soft_lockup.sh new file mode 100755 index 000000000000..c61a7a8352cf --- /dev/null +++ b/tools/testing/selftests/net/ipv6_route_update_soft_lockup.sh @@ -0,0 +1,262 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-2.0 +# +# Testing for potential kernel soft lockup during IPv6 routing table +# refresh under heavy outgoing IPv6 traffic. If a kernel soft lockup +# occurs, a kernel panic will be triggered to prevent associated issues. +# +# +# Test Environment Layout +# +# ┌----------------┐ ┌----------------┐ +# | SOURCE_NS | | SINK_NS | +# | NAMESPACE | | NAMESPACE | +# |(iperf3 clients)| |(iperf3 servers)| +# | | | | +# | | | | +# | ┌-----------| nexthops |---------┐ | +# | |veth_source|<--------------------------------------->|veth_sink|<┐ | +# | └-----------|2001:0DB8:1::0:1/96 2001:0DB8:1::1:1/96 |---------┘ | | +# | | ^ 2001:0DB8:1::1:2/96 | | | +# | | . . | fwd | | +# | ┌---------┐ | . . | | | +# | | IPv6 | | . . | V | +# | | routing | | . 2001:0DB8:1::1:80/96| ┌-----┐ | +# | | table | | . | | lo | | +# | | nexthop | | . └--------┴-----┴-┘ +# | | update | | ............................> 2001:0DB8:2::1:1/128 +# | └-------- ┘ | +# └----------------┘ +# +# The test script sets up two network namespaces, source_ns and sink_ns, +# connected via a veth link. Within source_ns, it continuously updates the +# IPv6 routing table by flushing and inserting IPV6_NEXTHOP_ADDR_COUNT nexthop +# IPs destined for SINK_LOOPBACK_IP_ADDR in sink_ns. This refresh occurs at a +# rate of 1/ROUTING_TABLE_REFRESH_PERIOD per second for TEST_DURATION seconds. +# +# Simultaneously, multiple iperf3 clients within source_ns generate heavy +# outgoing IPv6 traffic. Each client is assigned a unique port number starting +# at 5000 and incrementing sequentially. Each client targets a unique iperf3 +# server running in sink_ns, connected to the SINK_LOOPBACK_IFACE interface +# using the same port number. +# +# The number of iperf3 servers and clients is set to half of the total +# available cores on each machine. +# +# NOTE: We have tested this script on machines with various CPU specifications, +# ranging from lower to higher performance as listed below. The test script +# effectively triggered a kernel soft lockup on machines running an unpatched +# kernel in under a minute: +# +# - 1x Intel Xeon E-2278G 8-Core Processor @ 3.40GHz +# - 1x Intel Xeon E-2378G Processor 8-Core @ 2.80GHz +# - 1x AMD EPYC 7401P 24-Core Processor @ 2.00GHz +# - 1x AMD EPYC 7402P 24-Core Processor @ 2.80GHz +# - 2x Intel Xeon Gold 5120 14-Core Processor @ 2.20GHz +# - 1x Ampere Altra Q80-30 80-Core Processor @ 3.00GHz +# - 2x Intel Xeon Gold 5120 14-Core Processor @ 2.20GHz +# - 2x Intel Xeon Silver 4214 24-Core Processor @ 2.20GHz +# - 1x AMD EPYC 7502P 32-Core @ 2.50GHz +# - 1x Intel Xeon Gold 6314U 32-Core Processor @ 2.30GHz +# - 2x Intel Xeon Gold 6338 32-Core Processor @ 2.00GHz +# +# On less performant machines, you may need to increase the TEST_DURATION +# parameter to enhance the likelihood of encountering a race condition leading +# to a kernel soft lockup and avoid a false negative result. +# +# NOTE: The test may not produce the expected result in virtualized +# environments (e.g., qemu) due to differences in timing and CPU handling, +# which can affect the conditions needed to trigger a soft lockup. + +source lib.sh +source net_helper.sh + +TEST_DURATION=300 +ROUTING_TABLE_REFRESH_PERIOD=0.01 + +IPERF3_BITRATE="300m" + + +IPV6_NEXTHOP_ADDR_COUNT="128" +IPV6_NEXTHOP_ADDR_MASK="96" +IPV6_NEXTHOP_PREFIX="2001:0DB8:1" + + +SOURCE_TEST_IFACE="veth_source" +SOURCE_TEST_IP_ADDR="2001:0DB8:1::0:1/96" + +SINK_TEST_IFACE="veth_sink" +# ${SINK_TEST_IFACE} is populated with the following range of IPv6 addresses: +# 2001:0DB8:1::1:1 to 2001:0DB8:1::1:${IPV6_NEXTHOP_ADDR_COUNT} +SINK_LOOPBACK_IFACE="lo" +SINK_LOOPBACK_IP_MASK="128" +SINK_LOOPBACK_IP_ADDR="2001:0DB8:2::1:1" + +nexthop_ip_list="" +termination_signal="" +kernel_softlokup_panic_prev_val="" + +terminate_ns_processes_by_pattern() { + local ns=$1 + local pattern=$2 + + for pid in $(ip netns pids ${ns}); do + [ -e /proc/$pid/cmdline ] && grep -qe "${pattern}" /proc/$pid/cmdline && kill -9 $pid + done +} + +cleanup() { + echo "info: cleaning up namespaces and terminating all processes within them..." + + + # Terminate iperf3 instances running in the source_ns. To avoid race + # conditions, first iterate over the PIDs and terminate those + # associated with the bash shells running the + # `while true; do iperf3 -c ...; done` loops. In a second iteration, + # terminate the individual `iperf3 -c ...` instances. + terminate_ns_processes_by_pattern ${source_ns} while + terminate_ns_processes_by_pattern ${source_ns} iperf3 + + # Repeat the same process for sink_ns + terminate_ns_processes_by_pattern ${sink_ns} while + terminate_ns_processes_by_pattern ${sink_ns} iperf3 + + # Check if any iperf3 instances are still running. This could happen + # if a core has entered an infinite loop and the timeout for detecting + # the soft lockup has not expired, but either the test interval has + # already elapsed or the test was terminated manually (e.g., with ^C) + for pid in $(ip netns pids ${source_ns}); do + if [ -e /proc/$pid/cmdline ] && grep -qe 'iperf3' /proc/$pid/cmdline; then + echo "FAIL: unable to terminate some iperf3 instances. Soft lockup is underway. A kernel panic is on the way!" + exit ${ksft_fail} + fi + done + + if [ "$termination_signal" == "SIGINT" ]; then + echo "SKIP: Termination due to ^C (SIGINT)" + elif [ "$termination_signal" == "SIGALRM" ]; then + echo "PASS: No kernel soft lockup occurred during this ${TEST_DURATION} second test" + fi + + cleanup_ns ${source_ns} ${sink_ns} + + sysctl -qw kernel.softlockup_panic=${kernel_softlokup_panic_prev_val} +} + +setup_prepare() { + setup_ns source_ns sink_ns + + ip -n ${source_ns} link add name ${SOURCE_TEST_IFACE} type veth peer name ${SINK_TEST_IFACE} netns ${sink_ns} + + # Setting up the Source namespace + ip -n ${source_ns} addr add ${SOURCE_TEST_IP_ADDR} dev ${SOURCE_TEST_IFACE} + ip -n ${source_ns} link set dev ${SOURCE_TEST_IFACE} qlen 10000 + ip -n ${source_ns} link set dev ${SOURCE_TEST_IFACE} up + ip netns exec ${source_ns} sysctl -qw net.ipv6.fib_multipath_hash_policy=1 + + # Setting up the Sink namespace + ip -n ${sink_ns} addr add ${SINK_LOOPBACK_IP_ADDR}/${SINK_LOOPBACK_IP_MASK} dev ${SINK_LOOPBACK_IFACE} + ip -n ${sink_ns} link set dev ${SINK_LOOPBACK_IFACE} up + ip netns exec ${sink_ns} sysctl -qw net.ipv6.conf.${SINK_LOOPBACK_IFACE}.forwarding=1 + + ip -n ${sink_ns} link set ${SINK_TEST_IFACE} up + ip netns exec ${sink_ns} sysctl -qw net.ipv6.conf.${SINK_TEST_IFACE}.forwarding=1 + + + # Populate nexthop IPv6 addresses on the test interface in the sink_ns + echo "info: populating ${IPV6_NEXTHOP_ADDR_COUNT} IPv6 addresses on the ${SINK_TEST_IFACE} interface ..." + for IP in $(seq 1 ${IPV6_NEXTHOP_ADDR_COUNT}); do + ip -n ${sink_ns} addr add ${IPV6_NEXTHOP_PREFIX}::$(printf "1:%x" "${IP}")/${IPV6_NEXTHOP_ADDR_MASK} dev ${SINK_TEST_IFACE}; + done + + # Preparing list of nexthops + for IP in $(seq 1 ${IPV6_NEXTHOP_ADDR_COUNT}); do + nexthop_ip_list=$nexthop_ip_list" nexthop via ${IPV6_NEXTHOP_PREFIX}::$(printf "1:%x" $IP) dev ${SOURCE_TEST_IFACE} weight 1" + done +} + + +test_soft_lockup_during_routing_table_refresh() { + # Start num_of_iperf_servers iperf3 servers in the sink_ns namespace, + # each listening on ports starting at 5001 and incrementing + # sequentially. Since iperf3 instances may terminate unexpectedly, a + # while loop is used to automatically restart them in such cases. + echo "info: starting ${num_of_iperf_servers} iperf3 servers in the sink_ns namespace ..." + for i in $(seq 1 ${num_of_iperf_servers}); do + cmd="iperf3 --bind ${SINK_LOOPBACK_IP_ADDR} -s -p $(printf '5%03d' ${i}) --rcv-timeout 200 &>/dev/null" + ip netns exec ${sink_ns} bash -c "while true; do ${cmd}; done &" &>/dev/null + done + + # Wait for the iperf3 servers to be ready + for i in $(seq ${num_of_iperf_servers}); do + port=$(printf '5%03d' ${i}); + wait_local_port_listen ${sink_ns} ${port} tcp + done + + # Continuously refresh the routing table in the background within + # the source_ns namespace + ip netns exec ${source_ns} bash -c " + while \$(ip netns list | grep -q ${source_ns}); do + ip -6 route add ${SINK_LOOPBACK_IP_ADDR}/${SINK_LOOPBACK_IP_MASK} ${nexthop_ip_list}; + sleep ${ROUTING_TABLE_REFRESH_PERIOD}; + ip -6 route delete ${SINK_LOOPBACK_IP_ADDR}/${SINK_LOOPBACK_IP_MASK}; + done &" + + # Start num_of_iperf_servers iperf3 clients in the source_ns namespace, + # each sending TCP traffic on sequential ports starting at 5001. + # Since iperf3 instances may terminate unexpectedly (e.g., if the route + # to the server is deleted in the background during a route refresh), a + # while loop is used to automatically restart them in such cases. + echo "info: starting ${num_of_iperf_servers} iperf3 clients in the source_ns namespace ..." + for i in $(seq 1 ${num_of_iperf_servers}); do + cmd="iperf3 -c ${SINK_LOOPBACK_IP_ADDR} -p $(printf '5%03d' ${i}) --length 64 --bitrate ${IPERF3_BITRATE} -t 0 --connect-timeout 150 &>/dev/null" + ip netns exec ${source_ns} bash -c "while true; do ${cmd}; done &" &>/dev/null + done + + echo "info: IPv6 routing table is being updated at the rate of $(echo "1/${ROUTING_TABLE_REFRESH_PERIOD}" | bc)/s for ${TEST_DURATION} seconds ..." + echo "info: A kernel soft lockup, if detected, results in a kernel panic!" + + wait +} + +# Make sure 'iperf3' is installed, skip the test otherwise +if [ ! -x "$(command -v "iperf3")" ]; then + echo "SKIP: 'iperf3' is not installed. Skipping the test." + exit ${ksft_skip} +fi + +# Determine the number of cores on the machine +num_of_iperf_servers=$(( $(nproc)/2 )) + +# Check if we are running on a multi-core machine, skip the test otherwise +if [ "${num_of_iperf_servers}" -eq 0 ]; then + echo "SKIP: This test is not valid on a single core machine!" + exit ${ksft_skip} +fi + +# Since the kernel soft lockup we're testing causes at least one core to enter +# an infinite loop, destabilizing the host and likely affecting subsequent +# tests, we trigger a kernel panic instead of reporting a failure and +# continuing +kernel_softlokup_panic_prev_val=$(sysctl -n kernel.softlockup_panic) +sysctl -qw kernel.softlockup_panic=1 + +handle_sigint() { + termination_signal="SIGINT" + cleanup + exit ${ksft_skip} +} + +handle_sigalrm() { + termination_signal="SIGALRM" + cleanup + exit ${ksft_pass} +} + +trap handle_sigint SIGINT +trap handle_sigalrm SIGALRM + +(sleep ${TEST_DURATION} && kill -s SIGALRM $$)& + +setup_prepare +test_soft_lockup_during_routing_table_refresh -- 2.43.0

1 year, 1 month

3
3
0 0

[PATCH] KVM: selftests: memslot_perf_test: increase guest sync timeout

by Maxim Levitsky

When memslot_perf_test is run nested, first iteration of test_memslot_rw_loop testcase, sometimes takes more than 2 seconds due to build of shadow page tables. Following iterations are fast. To be on the safe side, bump the timeout to 10 seconds. Signed-off-by: Maxim Levitsky <mlevitsk(a)redhat.com> --- tools/testing/selftests/kvm/memslot_perf_test.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/testing/selftests/kvm/memslot_perf_test.c b/tools/testing/selftests/kvm/memslot_perf_test.c index 893366982f77b..e9189cbed4bb6 100644 --- a/tools/testing/selftests/kvm/memslot_perf_test.c +++ b/tools/testing/selftests/kvm/memslot_perf_test.c @@ -415,7 +415,7 @@ static bool _guest_should_exit(void) */ static noinline void host_perform_sync(struct sync_area *sync) { - alarm(2); + alarm(10); atomic_store_explicit(&sync->sync_flag, true, memory_order_release); while (atomic_load_explicit(&sync->sync_flag, memory_order_acquire)) -- 2.26.3

1 year, 1 month

3
3
0 0

[PATCH] kvm: selftest: fix noop test in guest_memfd_test.c

by Patrick Roy

The loop in test_create_guest_memfd_invalid that is supposed to test that nothing is accepted as a valid flag to KVM_CREATE_GUEST_MEMFD was initializing `flag` as 0 instead of BIT(0). This caused the loop to immediately exit instead of iterating over BIT(0), BIT(1), ... . Fixes: 8a89efd43423 ("KVM: selftests: Add basic selftest for guest_memfd()") Signed-off-by: Patrick Roy <roypat(a)amazon.co.uk> --- tools/testing/selftests/kvm/guest_memfd_test.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c b/tools/testing/selftests/kvm/guest_memfd_test.c index ba0c8e9960358..ce687f8d248fc 100644 --- a/tools/testing/selftests/kvm/guest_memfd_test.c +++ b/tools/testing/selftests/kvm/guest_memfd_test.c @@ -134,7 +134,7 @@ static void test_create_guest_memfd_invalid(struct kvm_vm *vm) size); } - for (flag = 0; flag; flag <<= 1) { + for (flag = BIT(0); flag; flag <<= 1) { fd = __vm_create_guest_memfd(vm, page_size, flag); TEST_ASSERT(fd == -1 && errno == EINVAL, "guest_memfd() with flag '0x%lx' should fail with EINVAL", -- 2.47.0

1 year, 1 month

4
4
0 0

[PATCH v3 0/9] SEV Kernel Selftests

by Pratik R. Sampat

This series primarily introduces SEV-SNP test for the kernel selftest framework. It tests boot, ioctl, pre fault, and fallocate in various combinations to exercise both positive and negative launch flow paths. Patch 1 - Adds a wrapper for the ioctl calls that decouple ioctl and asserts, which enables the use of negative test cases. No functional change intended. Patch 2 - Extend the sev smoke tests to use the SNP specific ioctl calls and sets up memory to boot a SNP guest VM Patch 3 - Adds SNP to shutdown testing Patch 4, 5 - Tests the ioctl path for SEV, SEV-ES and SNP Patch 6 - Adds support for SNP in KVM_SEV_INIT2 tests Patch 7,8,9 - Enable Prefault tests for SEV, SEV-ES and SNP The patchset is rebased on top of kvm-x86/next branch. v3: 1. Remove the assignments for the prefault and fallocate test type enums. 2. Fix error message for sev launch measure and finish. 3. Collect tested-by tags [Peter, Srikanth] v2: https://lore.kernel.org/kvm/20240816192310.117456-1-pratikrajesh.sampat@amd… 1. Add SMT parsing check to populate SNP policy flags 2. Extend Peter Gonda's shutdown test to include SNP 3. Introduce new tests for prefault which include exercising prefault, fallocate, hole-punch in various combinations. 4. Decouple ioctl patch reworked to introduce private variants of the the functions that call into the ioctl. Also reordered the patch for it to arrive first so that new APIs are not written right after their introduction. 5. General cleanups - adding comments, avoiding local booleans, better error message. Suggestions incorporated from Peter, Tom, and Sean. RFC: https://lore.kernel.org/kvm/20240710220540.188239-1-pratikrajesh.sampat@amd… Any feedback/review is highly appreciated! Michael Roth (2): KVM: selftests: Add interface to manually flag protected/encrypted ranges KVM: selftests: Add a CoCo-specific test for KVM_PRE_FAULT_MEMORY Pratik R. Sampat (7): KVM: selftests: Decouple SEV ioctls from asserts KVM: selftests: Add a basic SNP smoke test KVM: selftests: Add SNP to shutdown testing KVM: selftests: SEV IOCTL test KVM: selftests: SNP IOCTL test KVM: selftests: SEV-SNP test for KVM_SEV_INIT2 KVM: selftests: Interleave fallocate for KVM_PRE_FAULT_MEMORY tools/testing/selftests/kvm/Makefile | 1 + .../testing/selftests/kvm/include/kvm_util.h | 13 + .../selftests/kvm/include/x86_64/processor.h | 1 + .../selftests/kvm/include/x86_64/sev.h | 76 +++- tools/testing/selftests/kvm/lib/kvm_util.c | 53 ++- .../selftests/kvm/lib/x86_64/processor.c | 6 +- tools/testing/selftests/kvm/lib/x86_64/sev.c | 190 +++++++- .../kvm/x86_64/coco_pre_fault_memory_test.c | 421 ++++++++++++++++++ .../selftests/kvm/x86_64/sev_init2_tests.c | 13 + .../selftests/kvm/x86_64/sev_smoke_test.c | 297 +++++++++++- 10 files changed, 1023 insertions(+), 48 deletions(-) create mode 100644 tools/testing/selftests/kvm/x86_64/coco_pre_fault_memory_test.c -- 2.34.1

1 year, 1 month

2
27
0 0

[PATCH V4 00/15] selftests/resctrl: Support diverse platforms with MBM and MBA tests

by Reinette Chatre

Changes since V3: - V3: https://lore.kernel.org/all/cover.1729218182.git.reinette.chatre@intel.com/ - Rebased on HEAD 2a027d6bb660 of kselftest/next. - Fix empty string parsing issues pointed out by Ilpo. - Add Reviewed-by tags. - Please see individual patches for detailed changes. Changes since V2: - V2: https://lore.kernel.org/all/cover.1726164080.git.reinette.chatre@intel.com/ - Add fix to protect against buffer overflow when parsing text from sysfs files. - Add cleanup patch to address use of magic constants as pointed out by Ilpo. - Add Reviewed-by tags where received, except for "selftests/resctrl: Use cache size to determine "fill_buf" buffer size" that changed too much since receiving the Reviewed-by tag. - Please see individual patches for detailed changes. Changes since V1: - V1: https://lore.kernel.org/cover.1724970211.git.reinette.chatre@intel.com/ - V2 contains the same general solutions to stated problem as V1 but these are now preceded by more fixes (patches 1 to 5) and improved robustness (patches 6 to 9) to existing tests before the series gets back to solving the original problem with more confidence in patches 10 to 13. - The posibility of making "memflush = false" for CMT test was discussed during V1. Modifying this setting does not have a significant impact on the observed results that are already well within acceptable range and this version thus keeps original default. If performance was a goal it may be possible to do further experimentation where "memflush = false" could eliminate the need for the sleep(1) within the test wrapper, but improving the performance is not a goal of this work. - (New) Support what seems to be unintended ability for user space to provide parameters to "fill_buf" by making the parsing robust and only support changing parameters that are supported to be changed. Drop support for "write" operation since it has never been measured. - (New) Improve wraparound handling. (Ilpo) - (New) A couple of new fixes addressing issues discovered during development. - (Change from V1) To support fill_buf parameters provided by user space as well as test specific fill_buf parameters struct fill_buf_param is no longer just a member of struct resctrl_val_param, instead there could be at most two instances of struct fill_buf_param, the immutable parameters provided by user space and the parameters used by individual tests. (Ilpo) - Please see individual patches for detailed changes. V1 cover: The resctrl selftests for Memory Bandwidth Allocation (MBA) and Memory Bandwidth Monitoring (MBM) are failing on some (for example [1]) Emerald Rapids systems. The test failures result from the following two properties of these systems: 1) Emerald Rapids systems can have up to 320MB L3 cache. The resctrl MBA and MBM selftests measure memory traffic for which a hardcoded 250MB buffer has been sufficient so far. On platforms with L3 cache larger than the buffer, the buffer fits in the L3 cache and thus no/very little memory traffic is generated during the "memory bandwidth" tests. 2) Some platform features, for example RAS features or memory performance features that generate memory traffic may drive accesses that are counted differently by performance counters and MBM respectively, for instance generating "overhead" traffic which is not counted against any specific RMID. Until now these counting differences have always been "in the noise". On Emerald Rapids systems the maximum MBA throttling (10% memory bandwidth) throttles memory bandwidth to where memory accesses by these other platform features push the memory bandwidth difference between memory controller performance counters and resctrl (MBM) beyond the tests' hardcoded tolerance. Make the tests more robust against platform variations: 1) Let the buffer used by memory bandwidth tests be guided by the size of the L3 cache. 2) Larger buffers require longer initialization time before the buffer can be used to measurement. Rework the tests to ensure that buffer initialization is complete before measurements start. 3) Do not compare performance counters and MBM measurements at low bandwidth. The value of "low" is hardcoded to 750MiB based on measurements on Emerald Rapids, Sapphire Rapids, and Ice Lake systems. This limit is not applicable to AMD systems since it only applies to the MBA and MBM tests that are isolated to Intel. [1] https://ark.intel.com/content/www/us/en/ark/products/237261/intel-xeon-plat… Reinette Chatre (15): selftests/resctrl: Make functions only used in same file static selftests/resctrl: Print accurate buffer size as part of MBM results selftests/resctrl: Fix memory overflow due to unhandled wraparound selftests/resctrl: Protect against array overrun during iMC config parsing selftests/resctrl: Protect against array overflow when reading strings selftests/resctrl: Make wraparound handling obvious selftests/resctrl: Remove "once" parameter required to be false selftests/resctrl: Only support measured read operation selftests/resctrl: Remove unused measurement code selftests/resctrl: Make benchmark parameter passing robust selftests/resctrl: Ensure measurements skip initialization of default benchmark selftests/resctrl: Use cache size to determine "fill_buf" buffer size selftests/resctrl: Do not compare performance counters and resctrl at low bandwidth selftests/resctrl: Keep results from first test run selftests/resctrl: Replace magic constants used as array size tools/testing/selftests/resctrl/cmt_test.c | 37 +- tools/testing/selftests/resctrl/fill_buf.c | 45 +- tools/testing/selftests/resctrl/mba_test.c | 54 ++- tools/testing/selftests/resctrl/mbm_test.c | 37 +- tools/testing/selftests/resctrl/resctrl.h | 79 +++- .../testing/selftests/resctrl/resctrl_tests.c | 95 +++- tools/testing/selftests/resctrl/resctrl_val.c | 447 +++++------------- tools/testing/selftests/resctrl/resctrlfs.c | 19 +- 8 files changed, 354 insertions(+), 459 deletions(-) base-commit: 2a027d6bb66002c8e50e974676f932b33c5fce10 -- 2.47.0

1 year, 1 month

3
26
0 0

[PATCH v3 0/5] Improve arm64 pkeys handling in signal delivery

by Kevin Brodsky

This series is a follow-up to Joey's Permission Overlay Extension (POE) series [1] that recently landed on mainline. The goal is to improve the way we handle the register that governs which pkeys/POIndex are accessible (POR_EL0) during signal delivery. As things stand, we may unexpectedly fail to write the signal frame on the stack because POR_EL0 is not reset before the uaccess operations. See patch 1 for more details and the main changes this series brings. A similar series landed recently for x86/MPK [2]; the present series aims at aligning arm64 with x86. Worth noting: once the signal frame is written, POR_EL0 is still set to POR_EL0_INIT, granting access to pkey 0 only. This means that a program that sets up an alternate signal stack with a non-zero pkey will need some assembly trampoline to set POR_EL0 before invoking the real signal handler, as discussed here [3]. This is not ideal, but it makes experimentation with pkeys in signal handlers possible while waiting for a potential interface to control the pkey state when delivering a signal. See Pierre's reply [4] for more information about use-cases and a potential interface. The x86 series also added kselftests to ensure that no spurious SIGSEGV occurs during signal delivery regardless of which pkey is accessible at the point where the signal is delivered. This series adapts those kselftests to allow running them on arm64 (patch 4-5). There is a dependency on Yury's PKEY_UNRESTRICTED patch [7] for patch 4 specifically. Finally patch 2 is a clean-up following feedback on Joey's series [5]. I have tested this series on arm64 and x86_64 (booting and running the protection_keys and pkey_sighandler_tests mm kselftests). - Kevin --- v2..v3: * Reordered patches (patch 1 is now the main patch). * Patch 1: compute por_enable_all with an explicit loop, based on arch_max_pkey() (suggestion from Dave M). * Patch 4: improved naming, replaced global pkey reg value with inline helper, made use of Yury's PKEY_UNRESTRICTED macro [7] (suggestions from Dave H). v2: https://lore.kernel.org/linux-arm-kernel/20241023150511.3923558-1-kevin.bro… v1..v2: * In setup_rt_frame(), ensured that POR_EL0 is reset to its original value if we fail to deliver the signal (addresses Catalin's concern [6]). * Renamed *unpriv_access* to *user_access* in patch 3 (suggestion from Dave). * Made what patch 1-2 do explicit in the commit message body (suggestion from Dave). v1: https://lore.kernel.org/linux-arm-kernel/20241017133909.3837547-1-kevin.bro… [1] https://lore.kernel.org/linux-arm-kernel/20240822151113.1479789-1-joey.goul… [2] https://lore.kernel.org/lkml/20240802061318.2140081-1-aruna.ramakrishna@ora… [3] https://lore.kernel.org/lkml/CABi2SkWxNkP2O7ipkP67WKz0-LV33e5brReevTTtba6oK… [4] https://lore.kernel.org/linux-arm-kernel/87plns8owh.fsf@arm.com/ [5] https://lore.kernel.org/linux-arm-kernel/20241015114116.GA19334@willie-the-… [6] https://lore.kernel.org/linux-arm-kernel/Zw6D2waVyIwYE7wd@arm.com/ [7] https://lore.kernel.org/all/20241028090715.509527-2-yury.khrustalev@arm.com/ Cc: akpm(a)linux-foundation.org Cc: anshuman.khandual(a)arm.com Cc: aruna.ramakrishna(a)oracle.com Cc: broonie(a)kernel.org Cc: catalin.marinas(a)arm.com Cc: dave.hansen(a)linux.intel.com Cc: Dave.Martin(a)arm.com Cc: jeffxu(a)chromium.org Cc: joey.gouly(a)arm.com Cc: keith.lucas(a)oracle.com Cc: pierre.langlois(a)arm.com Cc: shuah(a)kernel.org Cc: sroettger(a)google.com Cc: tglx(a)linutronix.de Cc: will(a)kernel.org Cc: yury.khrustalev(a)arm.com Cc: linux-kselftest(a)vger.kernel.org Cc: x86(a)kernel.org Kevin Brodsky (5): arm64: signal: Improve POR_EL0 handling to avoid uaccess failures arm64: signal: Remove unnecessary check when saving POE state arm64: signal: Remove unused macro selftests/mm: Use generic pkey register manipulation selftests/mm: Enable pkey_sighandler_tests on arm64 arch/arm64/kernel/signal.c | 95 ++++++++++++--- tools/testing/selftests/mm/Makefile | 8 +- tools/testing/selftests/mm/pkey-arm64.h | 1 + tools/testing/selftests/mm/pkey-x86.h | 2 + .../selftests/mm/pkey_sighandler_tests.c | 115 ++++++++++++++---- 5 files changed, 176 insertions(+), 45 deletions(-) -- 2.43.0

1 year, 1 month

5
16
0 0

[PATCH v6 0/6] Add PSCI v1.3 SYSTEM_OFF2 support for hibernation

by David Woodhouse

The PSCI v1.3 spec (https://developer.arm.com/documentation/den0022) adds support for a SYSTEM_OFF2 function enabling a HIBERNATE_OFF state which is analogous to ACPI S4. This will allow hosting environments to determine that a guest is hibernated rather than just powered off, and ensure that they preserve the virtual environment appropriately to allow the guest to resume safely (or bump the hardware_signature in the FACS to trigger a clean reboot instead). This updates KVM to support advertising PSCI v1.3, and unconditionally enables the SYSTEM_OFF2 support when PSCI v1.3 is enabled. For the guest side, add a new SYS_OFF_MODE_POWER_OFF handler with higher priority than the EFI one, but which *only* triggers when there's a hibernation in progress. There are other ways to do this (see the commit message for more details) but this seemed like the simplest. Version 2 of the patch series splits out the psci.h definitions into a separate commit (a dependency for both the guest and KVM side), and adds definitions for the other new functions added in v1.3. It also moves the pKVM psci-relay support to a separate commit; although in arch/arm64/kvm that's actually about the *guest* side of SYSTEM_OFF2 (i.e. using it from the host kernel, relayed through nVHE). Version 3 dropped the KVM_CAP which allowed userspace to explicitly opt in to the new feature like with SYSTEM_SUSPEND, and makes it depend only on PSCI v1.3 being exposed to the guest. Version 4 is no longer RFC, as the PSCI v1.3 spec is finally published. Minor fixes from the last round of review, and an added KVM self test. Version 5 drops some of the changes which didn't make it to the final v1.3 spec, and cleans up a couple of places which still referred to it as 'alpha' or 'beta'. It also temporarily drops the guest-side patch to invoke SYSTEM_OFF2 for hibernation, pending confirmation that the final PSCI v1.3 spec just has a typo where it changed to saying that 0x1 should be passed to mean HIBERNATE_OFF, even though it's advertised as bit 0. That can be sent under separate cover, and perhaps should have been anyway. The change in question doesn't matter for any of the KVM patches, because we just treat SYSTEM_OFF2 like the existing SYSTEM_RESET2, setting a flag to indicate that it was a SYSTEM_OFF2 call, but not actually caring about the argument; that's for userspace to worry about. Version 6 is updated to revision F.b of the specification, allowing both 0x0 and 0x1 to indicate HIBERNATE_OFF in the SYSTEM_OFF2 call, as well as disallowing anything but zero in the second argument. The test is expanded to cover those variants, and to require PSCI v1.3 instead of skipping on older kernels. The final guest-side patch in the series is reinstated, using zero as the argument for compatibility with existing hypervisors in the field as permitted by revision F.b. allows for zero as well as 0x1 for HIBERNATE_OFF to accommodate the change in the final version of the spec, and expands the test to cover both forms as well as checking that a non-zero second argument is correctly rejected. David Woodhouse (6): firmware/psci: Add definitions for PSCI v1.3 specification KVM: arm64: Add PSCI v1.3 SYSTEM_OFF2 function for hibernation KVM: arm64: Add support for PSCI v1.2 and v1.3 KVM: selftests: Add test for PSCI SYSTEM_OFF2 KVM: arm64: nvhe: Pass through PSCI v1.3 SYSTEM_OFF2 call arm64: Use SYSTEM_OFF2 PSCI call to power off for hibernate Documentation/virt/kvm/api.rst | 11 +++ arch/arm64/include/uapi/asm/kvm.h | 6 ++ arch/arm64/kvm/hyp/nvhe/psci-relay.c | 2 + arch/arm64/kvm/hypercalls.c | 2 + arch/arm64/kvm/psci.c | 50 ++++++++++++- drivers/firmware/psci/psci.c | 42 +++++++++++ include/kvm/arm_psci.h | 4 +- include/uapi/linux/psci.h | 5 ++ kernel/power/hibernate.c | 5 +- tools/testing/selftests/kvm/aarch64/psci_test.c | 93 +++++++++++++++++++++++++ 10 files changed, 217 insertions(+), 3 deletions(-)

1 year, 1 month

6
28
0 0

[PATCH v3] selftests: clone3: Use the capget and capset syscall directly

by zhouyuhang

From: zhouyuhang <zhouyuhang(a)kylinos.cn> The libcap commit aca076443591 ("Make cap_t operations thread safe.") added a __u8 mutex at the beginning of the struct _cap_struct, it changes the offset of the members in the structure that breaks the assumption made in the "struct libcap" definition in clone3_cap_checkpoint_restore.c. This will cause the test case to fail with the following output: # RUN global.clone3_cap_checkpoint_restore ... # clone3() syscall supported # clone3_cap_checkpoint_restore.c:151:clone3_cap_checkpoint_restore:Child has PID 130508 cap_set_proc: Operation not permitted # clone3_cap_checkpoint_restore.c:160:clone3_cap_checkpoint_restore:Expected set_capability() (-1) == 0 (0) # clone3_cap_checkpoint_restore.c:161:clone3_cap_checkpoint_restore:Could not set CAP_CHECKPOINT_RESTORE # clone3_cap_checkpoint_restore: Test terminated by assertion # FAIL global.clone3_cap_checkpoint_restore Changing to using capget and capset syscall directly here can fix this error, just like what the commit 663af70aabb7 ("bpf: selftests: Add helpers to directly use the capget and capset syscall") does. Signed-off-by: zhouyuhang <zhouyuhang(a)kylinos.cn> --- .../clone3/clone3_cap_checkpoint_restore.c | 58 +++++++++---------- 1 file changed, 27 insertions(+), 31 deletions(-) diff --git a/tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c b/tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c index 3c196fa86c99..8b61702bf721 100644 --- a/tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c +++ b/tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c @@ -27,6 +27,13 @@ #include "../kselftest_harness.h" #include "clone3_selftests.h" +/* + * Prevent not being defined in the header file + */ +#ifndef CAP_CHECKPOINT_RESTORE +#define CAP_CHECKPOINT_RESTORE 40 +#endif + static void child_exit(int ret) { fflush(stdout); @@ -87,47 +94,36 @@ static int test_clone3_set_tid(struct __test_metadata *_metadata, return ret; } -struct libcap { - struct __user_cap_header_struct hdr; - struct __user_cap_data_struct data[2]; -}; - static int set_capability(void) { - cap_value_t cap_values[] = { CAP_SETUID, CAP_SETGID }; - struct libcap *cap; - int ret = -1; - cap_t caps; - - caps = cap_get_proc(); - if (!caps) { - perror("cap_get_proc"); + struct __user_cap_data_struct data[2]; + struct __user_cap_header_struct hdr = { + .version = _LINUX_CAPABILITY_VERSION_3, + }; + __u32 cap0 = 1 << CAP_SETUID | 1 << CAP_SETGID; + __u32 cap1 = 1 << (CAP_CHECKPOINT_RESTORE - 32); + int ret; + + ret = capget(&hdr, data); + if (ret) { + perror("capget"); return -1; } /* Drop all capabilities */ - if (cap_clear(caps)) { - perror("cap_clear"); - goto out; - } - - cap_set_flag(caps, CAP_EFFECTIVE, 2, cap_values, CAP_SET); - cap_set_flag(caps, CAP_PERMITTED, 2, cap_values, CAP_SET); + memset(&data, 0, sizeof(data)); - cap = (struct libcap *) caps; + data[0].effective |= cap0; + data[0].permitted |= cap0; - /* 40 -> CAP_CHECKPOINT_RESTORE */ - cap->data[1].effective |= 1 << (40 - 32); - cap->data[1].permitted |= 1 << (40 - 32); + data[1].effective |= cap1; + data[1].permitted |= cap1; - if (cap_set_proc(caps)) { - perror("cap_set_proc"); - goto out; + ret = capset(&hdr, data); + if (ret) { + perror("capset"); + return -1; } - ret = 0; -out: - if (cap_free(caps)) - perror("cap_free"); return ret; } -- 2.27.0

1 year, 1 month

2
2
0 0

2025

2024

2023

2022

2021

2020

2019

2018

2017

Linux-kselftest-mirror October 2024