As the vIOMMU infrastructure series part-3, this introduces a new vEVENTQ
object. The existing FAULT object provides a nice notification pathway to
the user space with a queue already, so let vEVENTQ reuse that.
Mimicing the HWPT structure, add a common EVENTQ structure to support its
derivatives: IOMMUFD_OBJ_FAULT (existing) and IOMMUFD_OBJ_VEVENTQ (new).
An IOMMUFD_CMD_VEVENTQ_ALLOC is introduced to allocate vEVENTQ object for
vIOMMUs. One vIOMMU can have multiple vEVENTQs in different types but can
not support multiple vEVENTQs in the same type.
The forwarding part is fairly simple but might need to replace a physical
device ID with a virtual device ID in a driver-level event data structure.
So, this also adds some helpers for drivers to use.
As usual, this series comes with the selftest coverage for this new ioctl
and with a real world use case in the ARM SMMUv3 driver.
This is on Github:
https://github.com/nicolinc/iommufd/commits/iommufd_veventq-v7
Paring QEMU branch for testing:
https://github.com/nicolinc/qemu/commits/wip/for_iommufd_veventq-v7
Changelog
v7
* Rebase on Jason's for-next tree for latest fault.c
* Add Reviewed-by
* Update commit logs
* Add __reserved field sanity
* Skip kfree() on the static header
* Replace "bool on_list" with list_is_last()
* Use u32 for flags in iommufd_vevent_header
* Drop casting in iommufd_viommu_get_vdev_id()
* Update the bounding logic to veventq->sequence
* Add missing cpu_to_le64() around STRTAB_STE_1_MEV
* Reuse veventq->common.lock to fence sequence and num_events
* Rename overflow to lost_events and log it in upon kmalloc failure
* Correct the error handling part in iommufd_veventq_deliver_fetch()
* Add an arm_smmu_clear_vmaster() to simplify identity/blocked domain
attach ops
* Add additional four event records to forward to user space VM, and
update the uAPI doc
* Reuse the existing smmu->streams_mutex lock to fence master->vmaster
pointer, instead of adding a new rwsem
v6
https://lore.kernel.org/all/cover.1737754129.git.nicolinc@nvidia.com/
* Drop supports_veventq viommu op
* Split bug/cosmetics fixes out of the series
* Drop the blocking mutex around copy_to_user()
* Add veventq_depth in uAPI to limit vEVENTQ size
* Revise the documentation for a clear description
* Fix sparse warnings in arm_vmaster_report_event()
* Rework iommufd_viommu_get_vdev_id() to return -ENOENT v.s. 0
* Allow Abort/Bypass STEs to allocate vEVENTQ and set STE.MEV for DoS
mitigations
v5
https://lore.kernel.org/all/cover.1736237481.git.nicolinc@nvidia.com/
* Add Reviewed-by from Baolu
* Reorder the OBJ list as well
* Fix alphabetical order after renaming in v4
* Add supports_veventq viommu op for vEVENTQ type validation
v4
https://lore.kernel.org/all/cover.1735933254.git.nicolinc@nvidia.com/
* Rename "vIRQ" to "vEVENTQ"
* Use flexible array in struct iommufd_vevent
* Add the new ioctl command to union ucmd_buffer
* Fix the alphabetical order in union ucmd_buffer too
* Rename _TYPE_NONE to _TYPE_DEFAULT aligning with vIOMMU naming
v3
https://lore.kernel.org/all/cover.1734477608.git.nicolinc@nvidia.com/
* Rebase on Will's for-joerg/arm-smmu/updates for arm_smmu_event series
* Add "Reviewed-by" lines from Kevin
* Fix typos in comments, kdocs, and jump tags
* Add a patch to sort struct iommufd_ioctl_op
* Update iommufd's userpsace-api documentation
* Update uAPI kdoc to quote SMMUv3 offical spec
* Drop the unused workqueue in struct iommufd_virq
* Drop might_sleep() in iommufd_viommu_report_irq() helper
* Add missing "break" in iommufd_viommu_get_vdev_id() helper
* Shrink the scope of the vmaster's read lock in SMMUv3 driver
* Pass in two arguments to iommufd_eventq_virq_handler() helper
* Move "!ops || !ops->read" validation into iommufd_eventq_init()
* Move "fault->ictx = ictx" closer to iommufd_ctx_get(fault->ictx)
* Update commit message for arm_smmu_attach_prepare/commit_vmaster()
* Keep "iommufd_fault" as-is and rename "iommufd_eventq_virq" to just
"iommufd_virq"
v2
https://lore.kernel.org/all/cover.1733263737.git.nicolinc@nvidia.com/
* Rebase on v6.13-rc1
* Add IOPF and vIRQ in iommufd.rst (userspace-api)
* Add a proper locking in iommufd_event_virq_destroy
* Add iommufd_event_virq_abort with a lockdep_assert_held
* Rename "EVENT_*" to "EVENTQ_*" to describe the objects better
* Reorganize flows in iommufd_eventq_virq_alloc for abort() to work
* Adde struct arm_smmu_vmaster to store vSID upon attaching to a nested
domain, calling a newly added iommufd_viommu_get_vdev_id helper
* Adde an arm_vmaster_report_event helper in arm-smmu-v3-iommufd file
to simplify the routine in arm_smmu_handle_evt() of the main driver
v1
https://lore.kernel.org/all/cover.1724777091.git.nicolinc@nvidia.com/
Thanks!
Nicolin
Nicolin Chen (14):
iommufd/fault: Move two fault functions out of the header
iommufd/fault: Add an iommufd_fault_init() helper
iommufd: Abstract an iommufd_eventq from iommufd_fault
iommufd: Rename fault.c to eventq.c
iommufd: Add IOMMUFD_OBJ_VEVENTQ and IOMMUFD_CMD_VEVENTQ_ALLOC
iommufd/viommu: Add iommufd_viommu_get_vdev_id helper
iommufd/viommu: Add iommufd_viommu_report_event helper
iommufd/selftest: Require vdev_id when attaching to a nested domain
iommufd/selftest: Add IOMMU_TEST_OP_TRIGGER_VEVENT for vEVENTQ
coverage
iommufd/selftest: Add IOMMU_VEVENTQ_ALLOC test coverage
Documentation: userspace-api: iommufd: Update FAULT and VEVENTQ
iommu/arm-smmu-v3: Introduce struct arm_smmu_vmaster
iommu/arm-smmu-v3: Report events that belong to devices attached to
vIOMMU
iommu/arm-smmu-v3: Set MEV bit in nested STE for DoS mitigations
drivers/iommu/iommufd/Makefile | 2 +-
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 35 +
drivers/iommu/iommufd/iommufd_private.h | 135 +++-
drivers/iommu/iommufd/iommufd_test.h | 10 +
include/linux/iommufd.h | 23 +
include/uapi/linux/iommufd.h | 105 +++
tools/testing/selftests/iommu/iommufd_utils.h | 115 ++++
.../arm/arm-smmu-v3/arm-smmu-v3-iommufd.c | 72 +++
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 79 ++-
drivers/iommu/iommufd/driver.c | 72 +++
drivers/iommu/iommufd/eventq.c | 597 ++++++++++++++++++
drivers/iommu/iommufd/fault.c | 342 ----------
drivers/iommu/iommufd/hw_pagetable.c | 6 +-
drivers/iommu/iommufd/main.c | 7 +
drivers/iommu/iommufd/selftest.c | 54 ++
drivers/iommu/iommufd/viommu.c | 2 +
tools/testing/selftests/iommu/iommufd.c | 36 ++
.../selftests/iommu/iommufd_fail_nth.c | 7 +
Documentation/userspace-api/iommufd.rst | 17 +
19 files changed, 1308 insertions(+), 408 deletions(-)
create mode 100644 drivers/iommu/iommufd/eventq.c
delete mode 100644 drivers/iommu/iommufd/fault.c
base-commit: 598749522d4254afb33b8a6c1bea614a95896868
--
2.43.0
While taking a look at '[PATCH net] pktgen: Avoid out-of-range in
get_imix_entries' ([1]) and '[PATCH net v2] pktgen: Avoid out-of-bounds
access in get_imix_entries' ([2], [3]) and doing some tests and code review
I detected that the /proc/net/pktgen/... parsing logic does not honour the
user given buffer bounds (resulting in out-of-bounds access).
This can be observed e.g. by the following simple test (sometimes the
old/'longer' previous value is re-read from the buffer):
$ echo add_device lo@0 > /proc/net/pktgen/kpktgend_0
$ echo "min_pkt_size 12345" > /proc/net/pktgen/lo\@0 && grep min_pkt_size /proc/net/pktgen/lo\@0
Params: count 1000 min_pkt_size: 12345 max_pkt_size: 0
Result: OK: min_pkt_size=12345
$ echo -n "min_pkt_size 123" > /proc/net/pktgen/lo\@0 && grep min_pkt_size /proc/net/pktgen/lo\@0
Params: count 1000 min_pkt_size: 12345 max_pkt_size: 0
Result: OK: min_pkt_size=12345
$ echo "min_pkt_size 123" > /proc/net/pktgen/lo\@0 && grep min_pkt_size /proc/net/pktgen/lo\@0
Params: count 1000 min_pkt_size: 123 max_pkt_size: 0
Result: OK: min_pkt_size=123
So fix the out-of-bounds access (and some minor findings) and add a simple
proc_net_pktgen selftest...
Patch set splited into part I (now already applied to net-next)
- net: pktgen: replace ENOTSUPP with EOPNOTSUPP
- net: pktgen: enable 'param=value' parsing
- net: pktgen: fix hex32_arg parsing for short reads
- net: pktgen: fix 'rate 0' error handling (return -EINVAL)
- net: pktgen: fix 'ratep 0' error handling (return -EINVAL)
- net: pktgen: fix ctrl interface command parsing
- net: pktgen: fix access outside of user given buffer in pktgen_thread_write()
And part II (this one):
- net: pktgen: use defines for the various dec/hex number parsing digits lengths
- net: pktgen: fix mix of int/long
- net: pktgen: remove extra tmp variable (re-use len instead)
- net: pktgen: remove some superfluous variable initializing
- net: pktgen: fix mpls maximum labels list parsing
- net: pktgen: fix access outside of user given buffer in pktgen_if_write()
- net: pktgen: fix mpls reset parsing
- net: pktgen: remove all superfluous index assignements
- selftest: net: add proc_net_pktgen
Regards,
Peter
Changes v5 -> v6:
- add rev-by Simon Horman
- drop patch 'net: pktgen: use defines for the various dec/hex number
parsing digits lengths'
- adjust to dropped patch ''net: pktgen: use defines for the various
dec/hex number parsing digits lengths'
- net: pktgen: fix mix of int/long
- fix line break (suggested by Simon Horman)
Changes v4 -> v5:
- split up patchset into part i/ii (suggested by Simon Horman)
- add rev-by Simon Horman
- net: pktgen: align some variable declarations to the most common pattern
-> net: pktgen: fix mix of int/long
- instead of align to most common pattern (int) adjust all usages to
size_t for i and max and ssize_t for len and adjust function signatures
of hex32_arg(), count_trail_chars(), num_arg() and strn_len() accordingly
- respect reverse xmas tree order for local variable declarations (where
possible without too much code churn)
- update subject line and patch description
- dropped net: pktgen: hex32_arg/num_arg error out in case no characters are
available
- keep empty hex/num arg is implicit assumed as zero value
- dropped net: pktgen: num_arg error out in case no valid character is parsed
- keep empty hex/num arg is implicit assumed as zero value
- Change patch description ('Fixes:' -> 'Addresses the following:',
suggested by Simon Horman)
- net: pktgen: remove all superfluous index assignements
- new patch (suggested by Simon Horman)
- selftest: net: add proc_net_pktgen
- addapt to dropped patch 'net: pktgen: hex32_arg/num_arg error out in case
no characters are available', empty hex/num arg is now implicit assumed as
zero value (instead of failure)
Changes v3 -> v4:
- add rev-by Simon Horman
- new patch 'net: pktgen: use defines for the various dec/hex number parsing
digits lengths' (suggested by Simon Horman)
- replace C99 comment (suggested by Paolo Abeni)
- drop available characters check in strn_len() (suggested by Paolo Abeni)
- factored out patch 'net: pktgen: align some variable declarations to the
most common pattern' (suggested by Paolo Abeni)
- factored out patch 'net: pktgen: remove extra tmp variable (re-use len
instead)' (suggested by Paolo Abeni)
- factored out patch 'net: pktgen: remove some superfluous variable
initializing' (suggested by Paolo Abeni)
- factored out patch 'net: pktgen: fix mpls maximum labels list parsing'
(suggested by Paolo Abeni)
- factored out 'net: pktgen: hex32_arg/num_arg error out in case no
characters are available' (suggested by Paolo Abeni)
- factored out 'net: pktgen: num_arg error out in case no valid character
is parsed' (suggested by Paolo Abeni)
Changes v2 -> v3:
- new patch: 'net: pktgen: fix ctrl interface command parsing'
- new patch: 'net: pktgen: fix mpls reset parsing'
- tools/testing/selftests/net/proc_net_pktgen.c:
- fix typo in change description ('v1 -> v1' and tyop)
- rename some vars to better match usage
add_loopback_0 -> thr_cmd_add_loopback_0
rm_loopback_0 -> thr_cmd_rm_loopback_0
wrong_ctrl_cmd -> wrong_thr_cmd
legacy_ctrl_cmd -> legacy_thr_cmd
ctrl_fd -> thr_fd
- add ctrl interface tests
Changes v1 -> v2:
- new patch: 'net: pktgen: fix hex32_arg parsing for short reads'
- new patch: 'net: pktgen: fix 'rate 0' error handling (return -EINVAL)'
- new patch: 'net: pktgen: fix 'ratep 0' error handling (return -EINVAL)'
- net/core/pktgen.c: additional fix get_imix_entries() and get_labels()
- tools/testing/selftests/net/proc_net_pktgen.c:
- fix tyop not vs. nod (suggested by Jakub Kicinski)
- fix misaligned line (suggested by Jakub Kicinski)
- enable fomerly commented out CONFIG_XFRM dependent test (command spi),
as CONFIG_XFRM is enabled via tools/testing/selftests/net/config
CONFIG_XFRM_INTERFACE/CONFIG_XFRM_USER (suggestex by Jakub Kicinski)
- add CONFIG_NET_PKTGEN=m to tools/testing/selftests/net/config
(suggested by Jakub Kicinski)
- add modprobe pktgen to FIXTURE_SETUP() (suggested by Jakub Kicinski)
- fix some checkpatch warnings (Missing a blank line after declarations)
- shrink line length by re-naming some variables (command -> cmd,
device -> dev)
- add 'rate 0' testcase
- add 'ratep 0' testcase
[1] https://lore.kernel.org/netdev/20241006221221.3744995-1-artem.chernyshev@re…
[2] https://lore.kernel.org/netdev/20250109083039.14004-1-pchelkin@ispras.ru/
[3] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?…
Peter Seiderer (8):
net: pktgen: fix mix of int/long
net: pktgen: remove extra tmp variable (re-use len instead)
net: pktgen: remove some superfluous variable initializing
net: pktgen: fix mpls maximum labels list parsing
net: pktgen: fix access outside of user given buffer in
pktgen_if_write()
net: pktgen: fix mpls reset parsing
net: pktgen: remove all superfluous index assignements
selftest: net: add proc_net_pktgen
net/core/pktgen.c | 288 ++++----
tools/testing/selftests/net/Makefile | 1 +
tools/testing/selftests/net/config | 1 +
tools/testing/selftests/net/proc_net_pktgen.c | 646 ++++++++++++++++++
4 files changed, 805 insertions(+), 131 deletions(-)
create mode 100644 tools/testing/selftests/net/proc_net_pktgen.c
--
2.48.1
The cited commit fixed a software GSO bug with VXLAN + IPSec in tunnel
mode. Unfortunately, it is slightly broader than necessary, as it also
severely affects performance for Geneve + IPSec transport mode over a
device capable of both HW GSO and IPSec crypto offload. In this case,
xfrm_output unnecessarily triggers software GSO instead of letting the
HW do it. In simple iperf3 tests over Geneve + IPSec transport mode over
a back-2-back pair of NICs with MTU 1500, the performance was observed
to be up to 6x worse when doing software GSO compared to leaving it to
the hardware.
This commit makes xfrm_output only trigger software GSO in crypto
offload cases for already encapsulated packets in tunnel mode, as not
doing so would then cause the inner tunnel skb->inner_networking_header
to be overwritten and break software GSO for that packet later if the
device turns out to not be capable of HW GSO.
Taking a closer look at the conditions for the original bug, to better
understand the reasons for this change:
- vxlan_build_skb -> iptunnel_handle_offloads sets inner_protocol and
inner network header.
- then, udp_tunnel_xmit_skb -> ip_tunnel_xmit adds outer transport and
network headers.
- later in the xmit path, xfrm_output -> xfrm_outer_mode_output ->
xfrm4_prepare_output -> xfrm4_tunnel_encap_add overwrites the inner
network header with the one set in ip_tunnel_xmit before adding the
second outer header.
- __dev_queue_xmit -> validate_xmit_skb checks whether GSO segmentation
needs to happen based on dev features. In the original bug, the hw
couldn't segment the packets, so skb_gso_segment was invoked.
- deep in the .gso_segment callback machinery, __skb_udp_tunnel_segment
tries to use the wrong inner network header, expecting the one set in
iptunnel_handle_offloads but getting the one set by xfrm instead.
- a bit later, ipv6_gso_segment accesses the wrong memory based on that
wrong inner network header.
With the new change, the original bug (or similar ones) cannot happen
again, as xfrm will now trigger software GSO before applying a tunnel.
This concern doesn't exist in packet offload mode, when the HW adds
encapsulation headers. For the non-offloaded packets (crypto in SW),
software GSO is still done unconditionally in the else branch.
Reviewed-by: Dragos Tatulea <dtatulea(a)nvidia.com>
Reviewed-by: Yael Chemla <ychemla(a)nvidia.com>
Fixes: a204aef9fd77 ("xfrm: call xfrm_output_gso when inner_protocol is set in xfrm_output")
Signed-off-by: Cosmin Ratiu <cratiu(a)nvidia.com>
---
net/xfrm/xfrm_output.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/xfrm/xfrm_output.c b/net/xfrm/xfrm_output.c
index f7abd42c077d..42f1ca513879 100644
--- a/net/xfrm/xfrm_output.c
+++ b/net/xfrm/xfrm_output.c
@@ -758,7 +758,7 @@ int xfrm_output(struct sock *sk, struct sk_buff *skb)
skb->encapsulation = 1;
if (skb_is_gso(skb)) {
- if (skb->inner_protocol)
+ if (skb->inner_protocol && x->props.mode == XFRM_MODE_TUNNEL)
return xfrm_output_gso(net, sk, skb);
skb_shinfo(skb)->gso_type |= SKB_GSO_ESP;
--
2.45.0
This patch series introduces changes to add default build support for
the sched tests in selftests.
The only test under sched is cs_prctl_test which validates cookies when
core scheduling is in effect. This test fails on systems where core
scheduling is disabled. The patch series also modifies this behaviour to
gracefully skip the test on such systems.
For example, such a test skip would look like:
TAP version 13
1..1
timeout set to 45
selftests: sched: cs_prctl_test
Checking for CONFIG_SCHED_CORE support
Core scheduling not enabled in kernel, hence skipping tests
ok 1 selftests: sched: cs_prctl_test # SKIP
and a successful run:
TAP version 13
1..1
timeout set to 45
selftests: sched: cs_prctl_test
Checking for CONFIG_SCHED_CORE support
CONFIG_SCHED_CORE=y
.
.
.
SUCCESS !!!
ok 1 selftests: sched: cs_prctl_test
Signed-off-by: Sinadin Shan <sinadin.shan(a)oracle.com>
---
v2:
* Add patch to skip cs_prctl_test on core scheduling disabled systems
* v1 link: https://lore.kernel.org/all/20250219064658.449069-1-sinadin.shan@oracle.com
---
Sinadin Shan (2):
selftests: sched: add sched as a default selftest target
selftests: sched: skip cs_prctl_test for systems with core scheduling
disabled
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/sched/cs_prctl_test.c | 29 ++++++++++++++++++-
2 files changed, 29 insertions(+), 1 deletion(-)
--
2.43.5
> > > On Mon, 2025-02-10 at 16:06 -0800, Andrii Nakryiko wrote:
> > > > Tracking associated maps for a program is not necessary. As long as
> > > > the last BPF program using the BPF map is unloaded, the kernel will
> > > > automatically free not-anymore-referenced BPF map. Note that
> > > > bpf_object itself will keep FDs for BPF maps, so you'd need to make
> > > > sure to do bpf_object__close() to release those references.
> > > >
> > > > But if you are going to ask to re-create BPF maps next time BPF
> > > > program is loaded... Well, I'll say you are asking for a bit too >
> > > > much,
> > > > tbh. If you want to be *that* sophisticated, it shouldn't be too
> > > > hard
> > > > for you to get all this information from BPF program's
> > > > instructions.
> > > >
> >
> > We really are that sophisticated (see below for more details). We could
> > scan program instructions, but we'd then tie our logic to BPF
> > implementation details and duplicate logic already present in libbpf
> > (https://urldefense.com/v3/__https://elixir.bootlin.com/linux/v6.13.2/source… <https://urldefense.com/v3/__https://elixir.bootlin.com/linux/v6.13.2/source…>*L6087__;Iw!!BmdzS3_lV9HdKG8!384hiRcpmYOjMU3kT-1m4GxWp5XxwHj9ICrDtIUKCmCbxLVAxemc6UN64cE3Mktu_z4f2JheoKm6RBYQdc-MiPbyexk6-Rn48A$
> > ). Obviously this *can* be done but it's not at all ideal from an
> > application perspective.
> >
>
>
> I agree it's not ideal, but it's also not some complicated and
> bound-to-be-changed logic. What you point out in libbpf source code is
> a bit different thing, reality is much simpler. Only so-called ldimm64
> instruction (BPF_LD | BPF_IMM | BPF_DW opcode) can be referencing map
> FD, so analysing this is borderline trivial. And this is part of BPF
> ISA, so not going to change.
Our approach is to associate an array of maps as a property with each
BPF program, this property is initialised at the relocation stage.
So, we do not need to parse BPF program instructions. Instead, we rely on
recorded relocations. I think this is a more robust and clean solution with
advantage of all code in the same place and being at the higher level of
abstraction with a relocation table.
The mainline libbpf keeps array of maps for a bpf_object, we extended
this by adding an array of maps associated with each bpf_program.
For example, a code excerpt, from our development branch, which associates
a map with bpf_program at relocation phase:
insn[0].src_reg = BPF_PSEUDO_MAP_FD;
insn[0].imm = map->fd;
err = bpf_program__add_map(prog, map);
> > > > > >
> > > > bpf_object is the unit of coherence in libbpf, so I don't see us
> > > > refcounting maps between bpf_objects. Kernel is doing refcounting
> > > > based on FDs, so see if you can use that.
> > > >
> >
> > I can understand that. That said, I think if there's no logic across
> > objects, and bpf_object access is not thread-safe, it puts us into a
> > tough situation:
> > - Complex refcounting, code scanning, etc to keep consistency when
> > manipulating maps used by multiple programs.
> > - Parallel loading not being well-balanced, if we split programs across
> > objects.
> >
> > We could alternatively write our own custom loader, but then we’d have
> > to duplicate much of the useful logic that libbpf already implements:
> > skeleton generation, map/program association, embedding programs into
> > ELFs, loading logic and kernel probing, etc. We’d like some way to
> > handle dynamic/parallel loading without having to replicate all the
> > advantages libbpf grants us.
> >
>
>
> Yeah, I can understand that as well, but bpf_object's single-threaded
> design and the fact that bpf_object__load is kind of the final step
> where programs are loaded (or not) is pretty backed in. I don't see
> bpf_object becoming multi-threaded.
We understood this, but the current bpf_object design allowed us to use
it in a multithreaded environment with minor modification for bpf_program
load.
We understand that the design choice of libbpf being single threaded
is unlikely to be reconsidered.
> > > > > >
> > > > bpf_object is the unit of coherence in libbpf, so I don't see us
> > > > refcounting maps between bpf_objects. Kernel is doing refcounting
> > > > based on FDs, so see if you can use that.
> > > >
> >
> > I can understand that. That said, I think if there's no logic across
> > objects, and bpf_object access is not thread-safe, it puts us into a
> > tough situation:
> > - Complex refcounting, code scanning, etc to keep consistency when
> > manipulating maps used by multiple programs.
> > - Parallel loading not being well-balanced, if we split programs across
> > objects.
> >
> > We could alternatively write our own custom loader, but then we’d have
> > to duplicate much of the useful logic that libbpf already implements:
> > skeleton generation, map/program association, embedding programs into
> > ELFs, loading logic and kernel probing, etc. We’d like some way to
> > handle dynamic/parallel loading without having to replicate all the
> > advantages libbpf grants us.
> >
>
>
> Yeah, I can understand that as well, but bpf_object's single-threaded
> design and the fact that bpf_object__load is kind of the final step
> where programs are loaded (or not) is pretty backed in. I don't see
> bpf_object becoming multi-threaded. The dynamic program
> loading/unloading/loading again is something that I can't yet justify,
> tbh.
>
>
> So the best I can propose you is to use libbpf's skeleton and
> bpf_object concept for, effectively, ELF handling, relocations, all
> the preparations up to loading BPF programs. And after that you can
> take over loading and handling program lifetime outside of bpf_object.
>
>
> Dynamic map creation after bpf_object__load() I think is completely
> outside of the scope and you'll have to solve this problem for
> yourself. I would point out, though, that internally libbpf already
> switched to sort-of pre-creating stable FDs for maps before they are
> actually created in the kernel. So it's conceivable that we can have
> more granularity in bpf_object preparation. I.e., first step would be
> to parse ELF and handle relocations, prepare everything. After that we
> can have a step to create maps, and then another one to create
> programs. Usually people would do all that, but you can stop right
> before maps creation or before program creation, whatever fits your
> use case better.
>
>
> The key is that program instructions will be final and won't need
> adjustments regardless of maps actually being created or not. FDs, as
> I mentioned, are stable regardless.
We used this in our design, so we did not need to scan BPF program
instructions to fix map's fds referenced by instructions from a dynamically
loaded bpf_program with dynamically created maps.
> >
> > The use case here is that our security monitoring agent leverages eBPF
> > as its foundational technology to gather telemetry from the kernel. As
> > part of that, we hook many different kernel subsystems (process,
> > memory, filesystem, network, etc), tying them together and tracking
> > with maps. So we legitimately have a very large number of programs all
> > doing different work. For products of this scale, it increases security
> > and performance to load this set of programs and their maps in an
> > optimized, parallel fashion and subsequently change the loaded set of
> > programs and maps dynamically without disturbing the rest of the
> > application.
>
>
> Yes, makes sense. You'll need to decide for yourself if it's actually
> more meaningful to split those 200 programs into independent
> bpf_objects by features, and be rigorous about sharing state (maps)
> through bpf_map__reuse_fd(), which would allow to parallelize loading
> within confines of existing libbpf APIs. Or you can be a bit more
> low-level with program loading outside of bpf_object API, as I
> described above.
Yes, this can be one of the ways to share bpf maps across multiple bpf_objects
and use existing libbpf for parallel bps programs loading, if we want to keep
a full libbpf compatibility, but at a cost of complicating design, as we need to
convert a single bpf_object model to multiple bpf_objects with a new layer
that manages these bpf_objects.
In our case, as a bpf_program can map to multiple features, which can be
modified independently, and to achieve an even load balancing across multiple
threads, it would be probably one bpf_program for a bpf_object.
For testing the functionality of the vDSO, it is necessary to build
userspace programs for multiple different architectures.
It is additional work to acquire matching userspace cross-compilers with
full C libraries and then building root images out of those.
The kernel tree already contains nolibc, a small, header-only C library.
By using it, it is possible to build userspace programs without any
additional dependencies.
For example the kernel.org crosstools or multi-target clang can be used
to build test programs for a multitude of architectures.
While nolibc is very limited, it is enough for many selftests.
With some minor adjustments it is possible to make parse_vdso.c
compatible with nolibc.
As an example, vdso_standalone_test_x86 is now built from the same C
code as the regular vdso_test_gettimeofday, while still being completely
standalone.
This should probably go through the kselftest tree.
Signed-off-by: Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>
---
Thomas Weißschuh (16):
MAINTAINERS: Add vDSO selftests
elf, uapi: Add definition for STN_UNDEF
elf, uapi: Add definition for DT_GNU_HASH
elf, uapi: Add definitions for VER_FLG_BASE and VER_FLG_WEAK
elf, uapi: Add type ElfXX_Versym
elf, uapi: Add types ElfXX_Verdef and ElfXX_Veraux
tools/include: Add uapi/linux/elf.h
selftests: Add headers target
selftests: vDSO: vdso_standalone_test_x86: Use vdso_init_form_sysinfo_ehdr
selftests: vDSO: parse_vdso: Drop vdso_init_from_auxv()
selftests: vDSO: parse_vdso: Use UAPI headers instead of libc headers
selftests: vDSO: parse_vdso: Test __SIZEOF_LONG__ instead of ULONG_MAX
selftests: vDSO: parse_vdso: Make compatible with nolibc
selftests: vDSO: vdso_test_gettimeofday: Clean up includes
selftests: vDSO: vdso_test_gettimeofday: Make compatible with nolibc
selftests: vDSO: vdso_standalone_test_x86: Switch to nolibc
MAINTAINERS | 1 +
include/uapi/linux/elf.h | 38 ++
tools/include/uapi/linux/elf.h | 524 +++++++++++++++++++++
tools/testing/selftests/lib.mk | 5 +-
tools/testing/selftests/vDSO/Makefile | 11 +-
tools/testing/selftests/vDSO/parse_vdso.c | 21 +-
tools/testing/selftests/vDSO/parse_vdso.h | 1 -
.../selftests/vDSO/vdso_standalone_test_x86.c | 143 +-----
.../selftests/vDSO/vdso_test_gettimeofday.c | 4 +-
9 files changed, 584 insertions(+), 164 deletions(-)
---
base-commit: 2014c95afecee3e76ca4a56956a936e23283f05b
change-id: 20241017-parse_vdso-nolibc-e069baa7ff48
Best regards,
--
Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>
A task in the kernel (task_mm_cid_work) runs somewhat periodically to
compact the mm_cid for each process. Add a test to validate that it runs
correctly and timely.
The test spawns 1 thread pinned to each CPU, then each thread, including
the main one, runs in short bursts for some time. During this period, the
mm_cids should be spanning all numbers between 0 and nproc.
At the end of this phase, a thread with high enough mm_cid (>= nproc/2)
is selected to be the new leader, all other threads terminate.
After some time, the only running thread should see 0 as mm_cid, if that
doesn't happen, the compaction mechanism didn't work and the test fails.
The test never fails if only 1 core is available, in which case, we
cannot test anything as the only available mm_cid is 0.
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers(a)efficios.com>
Signed-off-by: Gabriele Monaco <gmonaco(a)redhat.com>
---
tools/testing/selftests/rseq/.gitignore | 1 +
tools/testing/selftests/rseq/Makefile | 2 +-
.../selftests/rseq/mm_cid_compaction_test.c | 200 ++++++++++++++++++
3 files changed, 202 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/rseq/mm_cid_compaction_test.c
diff --git a/tools/testing/selftests/rseq/.gitignore b/tools/testing/selftests/rseq/.gitignore
index 16496de5f6ce4..2c89f97e4f737 100644
--- a/tools/testing/selftests/rseq/.gitignore
+++ b/tools/testing/selftests/rseq/.gitignore
@@ -3,6 +3,7 @@ basic_percpu_ops_test
basic_percpu_ops_mm_cid_test
basic_test
basic_rseq_op_test
+mm_cid_compaction_test
param_test
param_test_benchmark
param_test_compare_twice
diff --git a/tools/testing/selftests/rseq/Makefile b/tools/testing/selftests/rseq/Makefile
index 5a3432fceb586..ce1b38f46a355 100644
--- a/tools/testing/selftests/rseq/Makefile
+++ b/tools/testing/selftests/rseq/Makefile
@@ -16,7 +16,7 @@ OVERRIDE_TARGETS = 1
TEST_GEN_PROGS = basic_test basic_percpu_ops_test basic_percpu_ops_mm_cid_test param_test \
param_test_benchmark param_test_compare_twice param_test_mm_cid \
- param_test_mm_cid_benchmark param_test_mm_cid_compare_twice
+ param_test_mm_cid_benchmark param_test_mm_cid_compare_twice mm_cid_compaction_test
TEST_GEN_PROGS_EXTENDED = librseq.so
diff --git a/tools/testing/selftests/rseq/mm_cid_compaction_test.c b/tools/testing/selftests/rseq/mm_cid_compaction_test.c
new file mode 100644
index 0000000000000..7ddde3b657dd6
--- /dev/null
+++ b/tools/testing/selftests/rseq/mm_cid_compaction_test.c
@@ -0,0 +1,200 @@
+// SPDX-License-Identifier: LGPL-2.1
+#define _GNU_SOURCE
+#include <assert.h>
+#include <pthread.h>
+#include <sched.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <stddef.h>
+
+#include "../kselftest.h"
+#include "rseq.h"
+
+#define VERBOSE 0
+#define printf_verbose(fmt, ...) \
+ do { \
+ if (VERBOSE) \
+ printf(fmt, ##__VA_ARGS__); \
+ } while (0)
+
+/* 0.5 s */
+#define RUNNER_PERIOD 500000
+/* Number of runs before we terminate or get the token */
+#define THREAD_RUNS 5
+
+/*
+ * Number of times we check that the mm_cid were compacted.
+ * Checks are repeated every RUNNER_PERIOD.
+ */
+#define MM_CID_COMPACT_TIMEOUT 10
+
+struct thread_args {
+ int cpu;
+ int num_cpus;
+ pthread_mutex_t *token;
+ pthread_barrier_t *barrier;
+ pthread_t *tinfo;
+ struct thread_args *args_head;
+};
+
+static void __noreturn *thread_runner(void *arg)
+{
+ struct thread_args *args = arg;
+ int i, ret, curr_mm_cid;
+ cpu_set_t cpumask;
+
+ CPU_ZERO(&cpumask);
+ CPU_SET(args->cpu, &cpumask);
+ ret = pthread_setaffinity_np(pthread_self(), sizeof(cpumask), &cpumask);
+ if (ret) {
+ errno = ret;
+ perror("Error: failed to set affinity");
+ abort();
+ }
+ pthread_barrier_wait(args->barrier);
+
+ for (i = 0; i < THREAD_RUNS; i++)
+ usleep(RUNNER_PERIOD);
+ curr_mm_cid = rseq_current_mm_cid();
+ /*
+ * We select one thread with high enough mm_cid to be the new leader.
+ * All other threads (including the main thread) will terminate.
+ * After some time, the mm_cid of the only remaining thread should
+ * converge to 0, if not, the test fails.
+ */
+ if (curr_mm_cid >= args->num_cpus / 2 &&
+ !pthread_mutex_trylock(args->token)) {
+ printf_verbose(
+ "cpu%d has mm_cid=%d and will be the new leader.\n",
+ sched_getcpu(), curr_mm_cid);
+ for (i = 0; i < args->num_cpus; i++) {
+ if (args->tinfo[i] == pthread_self())
+ continue;
+ ret = pthread_join(args->tinfo[i], NULL);
+ if (ret) {
+ errno = ret;
+ perror("Error: failed to join thread");
+ abort();
+ }
+ }
+ pthread_barrier_destroy(args->barrier);
+ free(args->tinfo);
+ free(args->token);
+ free(args->barrier);
+ free(args->args_head);
+
+ for (i = 0; i < MM_CID_COMPACT_TIMEOUT; i++) {
+ curr_mm_cid = rseq_current_mm_cid();
+ printf_verbose("run %d: mm_cid=%d on cpu%d.\n", i,
+ curr_mm_cid, sched_getcpu());
+ if (curr_mm_cid == 0)
+ exit(EXIT_SUCCESS);
+ usleep(RUNNER_PERIOD);
+ }
+ exit(EXIT_FAILURE);
+ }
+ printf_verbose("cpu%d has mm_cid=%d and is going to terminate.\n",
+ sched_getcpu(), curr_mm_cid);
+ pthread_exit(NULL);
+}
+
+int test_mm_cid_compaction(void)
+{
+ cpu_set_t affinity;
+ int i, j, ret = 0, num_threads;
+ pthread_t *tinfo;
+ pthread_mutex_t *token;
+ pthread_barrier_t *barrier;
+ struct thread_args *args;
+
+ sched_getaffinity(0, sizeof(affinity), &affinity);
+ num_threads = CPU_COUNT(&affinity);
+ tinfo = calloc(num_threads, sizeof(*tinfo));
+ if (!tinfo) {
+ perror("Error: failed to allocate tinfo");
+ return -1;
+ }
+ args = calloc(num_threads, sizeof(*args));
+ if (!args) {
+ perror("Error: failed to allocate args");
+ ret = -1;
+ goto out_free_tinfo;
+ }
+ token = malloc(sizeof(*token));
+ if (!token) {
+ perror("Error: failed to allocate token");
+ ret = -1;
+ goto out_free_args;
+ }
+ barrier = malloc(sizeof(*barrier));
+ if (!barrier) {
+ perror("Error: failed to allocate barrier");
+ ret = -1;
+ goto out_free_token;
+ }
+ if (num_threads == 1) {
+ fprintf(stderr, "Cannot test on a single cpu. "
+ "Skipping mm_cid_compaction test.\n");
+ /* only skipping the test, this is not a failure */
+ goto out_free_barrier;
+ }
+ pthread_mutex_init(token, NULL);
+ ret = pthread_barrier_init(barrier, NULL, num_threads);
+ if (ret) {
+ errno = ret;
+ perror("Error: failed to initialise barrier");
+ goto out_free_barrier;
+ }
+ for (i = 0, j = 0; i < CPU_SETSIZE && j < num_threads; i++) {
+ if (!CPU_ISSET(i, &affinity))
+ continue;
+ args[j].num_cpus = num_threads;
+ args[j].tinfo = tinfo;
+ args[j].token = token;
+ args[j].barrier = barrier;
+ args[j].cpu = i;
+ args[j].args_head = args;
+ if (!j) {
+ /* The first thread is the main one */
+ tinfo[0] = pthread_self();
+ ++j;
+ continue;
+ }
+ ret = pthread_create(&tinfo[j], NULL, thread_runner, &args[j]);
+ if (ret) {
+ errno = ret;
+ perror("Error: failed to create thread");
+ abort();
+ }
+ ++j;
+ }
+ printf_verbose("Started %d threads.\n", num_threads);
+
+ /* Also main thread will terminate if it is not selected as leader */
+ thread_runner(&args[0]);
+
+ /* only reached in case of errors */
+out_free_barrier:
+ free(barrier);
+out_free_token:
+ free(token);
+out_free_args:
+ free(args);
+out_free_tinfo:
+ free(tinfo);
+
+ return ret;
+}
+
+int main(int argc, char **argv)
+{
+ if (!rseq_mm_cid_available()) {
+ fprintf(stderr, "Error: rseq_mm_cid unavailable\n");
+ return -1;
+ }
+ if (test_mm_cid_compaction())
+ return -1;
+ return 0;
+}
--
2.48.1