Hi.
First, I hope you are fine and the same for your relatives.
Normally, when BPF ring buffer are full, producers cannot write anymore and
need to wait for consumer to get some data.
As a consequence, calling bpf_ringbuf_reserve() from eBPF code returns NULL.
This contribution adds a new flag to make BPF ring buffer overwritable.
Perf ring buffers already implement an option to be overwritable. In order to
avoid data corruption, the data is written backward, see
commit 9ecda41acb97 ("perf/core: Add ::write_backward attribute to perf event").
This patch series re-uses the same idea from perf ring buffers but in BPF ring
buffers.
So, calling bpf_ringbuf_reserve() on an overwritable BPF ring buffer never
returns NULL.
As a consequence, oldest data will be overwritten by the newest so consumer will
loose data.
Overwritable ring buffers are useful in BPF programs that are permanently
enabled but rarely read, only on-demand, for example in case of a user request
to investigate problems. We would like to use this in the Traceloop project [1].
The self test added in this series was tested and validated in a VM:
you@vm# ./share/linux/tools/testing/selftests/bpf/test_progs -t ringbuf_over
Can't find bpf_testmod.ko kernel module: -2
WARNING! Selftests relying on bpf_testmod.ko will be skipped.
#135 ringbuf_over_writable:OK
Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED
You can also test the libbpf implementation by using the last patch of this
series which should be applied to iovisor/bcc:
you@home$ cd /path/to/iovisor/bcc
you@home$ git am -3 v2-0005-for-test-purpose-only-Add-toy-to-play-with-BPF-ri.patch
you@home$ cd /path/to/linux/tools/lib/bpf
you@home$ make -j$(nproc)
you@home$ cp libbpf.a /path/to/iovisor/bcc/libbpf-tools/.output
you@home$ cd /path/to/iovisor/bcc/libbpf-tools/
you@home$ make -j toy
# Start your VM and copy toy executable inside it.
root@vm-amd64:~# ./share/toy &
[1] 287
root@vm-amd64:~# for i in {1..16}; do ls > /dev/null; done
16
15
14
13
12
11
10
9
root@vm-amd64:~# ls > /dev/null && ls > /dev/null
18
17
As you can see, the first eight events are overwritten.
If you see any way to improve this contribution, feel free to share.
Changes since:
v1:
* Made producers write backward like perf ring buffer, so it permits avoiding
memory corruption.
* Added libbpf implementation to consume all events available.
* Added selftest.
* Added documentation.
Francis Laniel (5):
bpf: Make ring buffer overwritable.
selftests: Add BPF overwritable ring buffer self tests.
docs/bpf: Add documentation for overwritable ring buffer.
libbpf: Add implementation to consume overwritable BPF ring buffer.
for test purpose only: Add toy to play with BPF ring.
...-only-Add-toy-to-play-with-BPF-ring-.patch | 147 ++++++++++++++++
Documentation/bpf/ringbuf.rst | 18 +-
include/uapi/linux/bpf.h | 3 +
kernel/bpf/ringbuf.c | 43 +++--
tools/include/uapi/linux/bpf.h | 3 +
tools/lib/bpf/ringbuf.c | 106 ++++++++++++
tools/testing/selftests/bpf/Makefile | 5 +-
.../bpf/prog_tests/ringbuf_overwritable.c | 158 ++++++++++++++++++
.../bpf/progs/test_ringbuf_overwritable.c | 61 +++++++
9 files changed, 531 insertions(+), 13 deletions(-)
create mode 100644 0001-for-test-purpose-only-Add-toy-to-play-with-BPF-ring-.patch
create mode 100644 tools/testing/selftests/bpf/prog_tests/ringbuf_overwritable.c
create mode 100644 tools/testing/selftests/bpf/progs/test_ringbuf_overwritable.c
Best regards and thank you in advance.
---
[1] https://github.com/kinvolk/traceloop
Traceloop was presented at LPC 2020 (https://lpc.events/event/7/contributions/667/)
--
2.25.1
QUIC requires end to end encryption of the data. The application usually
prepares the data in clear text, encrypts and calls send() which implies
multiple copies of the data before the packets hit the networking stack.
Similar to kTLS, QUIC kernel offload of cryptography reduces the memory
pressure by reducing the number of copies.
The scope of kernel support is limited to the symmetric cryptography,
leaving the handshake to the user space library. For QUIC in particular,
the application packets that require symmetric cryptography are the 1RTT
packets with short headers. Kernel will encrypt the application packets
on transmission and decrypt on receive. This series implements Tx only,
because in QUIC server applications Tx outweighs Rx by orders of
magnitude.
Supporting the combination of QUIC and GSO requires the application to
correctly place the data and the kernel to correctly slice it. The
encryption process appends an arbitrary number of bytes (tag) to the end
of the message to authenticate it. The GSO value should include this
overhead, the offload would then subtract the tag size to parse the
input on Tx before chunking and encrypting it.
With the kernel cryptography, the buffer copy operation is conjoined
with the encryption operation. The memory bandwidth is reduced by 5-8%.
When devices supporting QUIC encryption in hardware come to the market,
we will be able to free further 7% of CPU utilization which is used
today for crypto operations.
Adel Abouchaev (6):
Documentation on QUIC kernel Tx crypto.
Define QUIC specific constants, control and data plane structures
Add UDP ULP operations, initialization and handling prototype
functions.
Implement QUIC offload functions
Add flow counters and Tx processing error counter
Add self tests for ULP operations, flow setup and crypto tests
Documentation/networking/index.rst | 1 +
Documentation/networking/quic.rst | 185 ++++
include/net/inet_sock.h | 2 +
include/net/netns/mib.h | 3 +
include/net/quic.h | 63 ++
include/net/snmp.h | 6 +
include/net/udp.h | 33 +
include/uapi/linux/quic.h | 60 +
include/uapi/linux/snmp.h | 9 +
include/uapi/linux/udp.h | 4 +
net/Kconfig | 1 +
net/Makefile | 1 +
net/ipv4/Makefile | 3 +-
net/ipv4/udp.c | 15 +
net/ipv4/udp_ulp.c | 192 ++++
net/quic/Kconfig | 16 +
net/quic/Makefile | 8 +
net/quic/quic_main.c | 1417 ++++++++++++++++++++++++
net/quic/quic_proc.c | 45 +
tools/testing/selftests/net/.gitignore | 4 +-
tools/testing/selftests/net/Makefile | 3 +-
tools/testing/selftests/net/quic.c | 1153 +++++++++++++++++++
tools/testing/selftests/net/quic.sh | 46 +
23 files changed, 3267 insertions(+), 3 deletions(-)
create mode 100644 Documentation/networking/quic.rst
create mode 100644 include/net/quic.h
create mode 100644 include/uapi/linux/quic.h
create mode 100644 net/ipv4/udp_ulp.c
create mode 100644 net/quic/Kconfig
create mode 100644 net/quic/Makefile
create mode 100644 net/quic/quic_main.c
create mode 100644 net/quic/quic_proc.c
create mode 100644 tools/testing/selftests/net/quic.c
create mode 100755 tools/testing/selftests/net/quic.sh
base-commit: fd78d07c7c35de260eb89f1be4a1e7487b8092ad
--
2.30.2
Page_idle uses {ptep/pmdp}_clear_young_notify which in turn calls
the mmu notifier callback ->clear_young(), which purposefully
does not flush the TLB.
When running the test in a nested guest, point 1. of the test
doc header is violated, because KVM TLB is unbounded by size
and since no flush is forced, KVM does not update the sptes
accessed/idle bits resulting in guest assertion failure.
More precisely, only the first ACCESS_WRITE in run_test() actually
makes visible changes, because sptes are created and the accessed
bit is set to 1 (or idle bit is 0). Then the first mark_memory_idle()
passes since access bit is still one, and sets all pages as idle
(or not accessed). When the next write is performed, the update
is not flushed therefore idle is still 1 and next mark_memory_idle()
fails.
Signed-off-by: Emanuele Giuseppe Esposito <eesposit(a)redhat.com>
---
.../selftests/kvm/access_tracking_perf_test.c | 25 ++++++++++++-------
1 file changed, 16 insertions(+), 9 deletions(-)
diff --git a/tools/testing/selftests/kvm/access_tracking_perf_test.c b/tools/testing/selftests/kvm/access_tracking_perf_test.c
index 1c2749b1481a..87b0bd5ebc65 100644
--- a/tools/testing/selftests/kvm/access_tracking_perf_test.c
+++ b/tools/testing/selftests/kvm/access_tracking_perf_test.c
@@ -31,8 +31,9 @@
* These limitations are worked around in this test by using a large enough
* region of memory for each vCPU such that the number of translations cached in
* the TLB and the number of pages held in pagevecs are a small fraction of the
- * overall workload. And if either of those conditions are not true this test
- * will fail rather than silently passing.
+ * overall workload. And if either of those conditions are not true (for example
+ * in nesting, where TLB size is unlimited) this test will print a warning
+ * rather than silently passing.
*/
#include <inttypes.h>
#include <limits.h>
@@ -172,17 +173,23 @@ static void mark_vcpu_memory_idle(struct kvm_vm *vm,
vcpu_idx, no_pfn, pages);
/*
- * Test that at least 90% of memory has been marked idle (the rest might
- * not be marked idle because the pages have not yet made it to an LRU
- * list or the translations are still cached in the TLB). 90% is
+ * Check that at least 90% of memory has been marked idle (the rest
+ * might not be marked idle because the pages have not yet made it to an
+ * LRU list or the translations are still cached in the TLB). 90% is
* arbitrary; high enough that we ensure most memory access went through
* access tracking but low enough as to not make the test too brittle
* over time and across architectures.
+ *
+ * Note that when run in nested virtualization, this check will trigger
+ * much more frequently because TLB size is unlimited and since no flush
+ * happens, much more pages are cached there and guest won't see the
+ * "idle" bit cleared.
*/
- TEST_ASSERT(still_idle < pages / 10,
- "vCPU%d: Too many pages still idle (%"PRIu64 " out of %"
- PRIu64 ").\n",
- vcpu_idx, still_idle, pages);
+ if (still_idle < pages / 10)
+ printf("WARNING: vCPU%d: Too many pages still idle (%"PRIu64 "
+ out of %" PRIu64 "), this will affect performance results
+ .\n",
+ vcpu_idx, still_idle, pages);
close(page_idle_fd);
close(pagemap_fd);
--
2.31.1
>
> On 26.09.22 15:03, Zhao Gongyi wrote:
> > Add checking for online_memory_expect_success()/
> > offline_memory_expect_success()/offline_memory_expect_fail(), or
> the
> > test would exit 0 although the functions return 1.
> >
> > Signed-off-by: Zhao Gongyi <zhaogongyi(a)huawei.com>
> > ---
> > .../selftests/memory-hotplug/mem-on-off-test.sh | 15
> ++++++++++++---
> > 1 file changed, 12 insertions(+), 3 deletions(-)
> >
> > diff --git a/tools/testing/selftests/memory-hotplug/mem-on-off-test.sh
> > b/tools/testing/selftests/memory-hotplug/mem-on-off-test.sh
> > index 46a97f318f58..3edda1f13f7b 100755
> > --- a/tools/testing/selftests/memory-hotplug/mem-on-off-test.sh
> > +++ b/tools/testing/selftests/memory-hotplug/mem-on-off-test.sh
> > @@ -266,7 +266,10 @@ done
> > #
> > echo $error >
> $NOTIFIER_ERR_INJECT_DIR/actions/MEM_GOING_ONLINE/error
> > for memory in `hotpluggable_offline_memory`; do
> > - online_memory_expect_fail $memory
> > + online_memory_expect_fail $memory || {
> > + echo "online memory $memory: unexpected success"
>
> The functions themself already print an error, isn't it sufficient to set
> retval=1?
Indeed, this function is only called once. It is no need to add the log info.
>
> > + retval=1
> > + }
> > done
> >
> > #
> > @@ -274,7 +277,10 @@ done
> > #
> > echo 0 >
> $NOTIFIER_ERR_INJECT_DIR/actions/MEM_GOING_ONLINE/error
> > for memory in `hotpluggable_offline_memory`; do
> > - online_memory_expect_success $memory
> > + online_memory_expect_success $memory || {
> > + echo "online memory $memory: unexpected fail"
> > + retval=1
> > + }
> > done
> >
> > #
> > @@ -283,7 +289,10 @@ done
> > echo $error >
> $NOTIFIER_ERR_INJECT_DIR/actions/MEM_GOING_OFFLINE/error
> > for memory in `hotpluggable_online_memory`; do
> > if [ $((RANDOM % 100)) -lt $ratio ]; then
> > - offline_memory_expect_fail $memory
> > + offline_memory_expect_fail $memory || {
> > + echo "offline memory $memory: unexpected success"
> > + retval=1
> > + }
>
> These functions return 0 if the result is as expected and 1 if the result is
> unexpected.
>
> ... but wouldn't we evaluate the right hand side only if the result is "0" --
> expected? I might be wrong.
>
Yes, if offline_memory_expect_fail's return value is not zero, then it will set 'retval=1'
>
> Wouldn't it be simpler do it as in "Online all hot-pluggable memory again"
>
> if ! online_memory_expect_success $memory; then
> retval=1
> fi
>
> (similarly adjusting the function name)
I will send a new version of the patch to fix it.
Kind regards,
Gongyi
In test_sockmap.c, the testcase sets socket nonblock first, and then
calls select() and recvmsg() to receive data.
If some error occur, nonblock setting will make recvmsg() return
immediately, rather than blocking forever.
However, the way to call fcntl() to set nonblock is wrong.
To set socket noblock, we need to use
> fcntl(fd, F_SETFL, O_NONBLOCK);
rather than:
> fcntl(fd, O_NONBLOCK);
Signed-off-by: Qiao Ma <mqaio(a)linux.alibaba.com>
---
tools/testing/selftests/bpf/test_sockmap.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/bpf/test_sockmap.c b/tools/testing/selftests/bpf/test_sockmap.c
index 0fbaccdc8861..abb4102f33b0 100644
--- a/tools/testing/selftests/bpf/test_sockmap.c
+++ b/tools/testing/selftests/bpf/test_sockmap.c
@@ -598,7 +598,12 @@ static int msg_loop(int fd, int iov_count, int iov_length, int cnt,
struct timeval timeout;
fd_set w;
- fcntl(fd, fd_flags);
+ err = fcntl(fd, F_SETFL, fd_flags);
+ if (err < 0) {
+ perror("fcntl failed");
+ goto out_errno;
+ }
+
/* Account for pop bytes noting each iteration of apply will
* call msg_pop_data helper so we need to account for this
* by calculating the number of apply iterations. Note user
--
1.8.3.1