Currently, sk_lookup allows an ebpf program to run on the ingress socket
lookup path, and accept traffic not only on a range of addresses, but
also on a range of ports. At Cloudflare we use sk_lookup for two main
cases:
1. Sharing a single port between multiple services - i.e. two services
(or more) use disjoint IP ranges but share the same port;
2. Receiving traffic on all ports - i.e. a service which accepts traffic
on specific IP ranges but any port [1].
However, one main challenge we face while using sk_lookup for these use
cases is how to source return UDP traffic:
- On point 1. above, sometimes this range of addresses are not local
(i.e. there's no local routes for these in the server), which means we
need IP_TRANSPARENT set to be able to egress traffic from addresses
we've received traffic on (or simply IP_FREEBIND in the case of IPv6);
- And on point 2. above, allowing traffic to a range of ports means a
service could get traffic on multiple ports, but currently there's no
way to set the source UDP port egress traffic should be sourced from -
it's possible to receive the original destination port using the
IP_ORIGDSTADDR ancilliary message in recvmsg, but not set it in
sendmsg.
Both of these limitations can be worked around, but in a sub-optimal
way. Using IP_TRANSPARENT, for instance, requires special privileges.
And while one could use UDP connected sockets to send return traffic,
creating a connected socket for each different address a UDP traffic is
received on does have performance implications.
Given sk_lookup allows services to accept traffic on a range of
addresses or ports, it seems sensible to also allow return traffic to
proceed through as well, without needing extra configurations / set ups.
This patch sets out to fix both of this issues by:
1. Allowing users to set the src address/port egress traffic should be
sent from, when calling sendmsg();
2. Validating that this egress traffic comes from a socket that matches
an ingress socket in sk_lookup.
- If it does, traffic is allowed to proceed;
- Otherwise it falls back to the regular egress path.
The downsides to this is that this runs on the egress hot path, although
this work tries to minimise its impact by only performing the reverse
socket lookup when necessary (i.e. only when the src address/port are
modified). Further performance measurements are to be taken, but we're
reaching out early for feedback to see what the technical concerns are
and if we can address them.
[1] https://blog.cloudflare.com/how-we-built-spectrum/
Suggested-by: Jakub Sitnicki <jakub(a)cloudflare.com>
Signed-off-by: Tiago Lam <tiagolam(a)cloudflare.com>
---
Changes in v2:
- Amended commit messages and cover letter to make the intent and
implementation clearer (Willem de Bruijn);
- Fixed socket comparison by not using socket cookies and comparing them
directly (Eric Dumazet);
- Fixed misspellings and checkpatch.pl warnings on line lengths (Simon
Horman);
- Fixed usage of start_server_addr() and gcc compilation (Philo Lu);
- Link to v1: https://lore.kernel.org/r/20240913-reverse-sk-lookup-v1-0-e721ea003d4c@clou…
---
Tiago Lam (3):
ipv4: Support setting src port in sendmsg().
ipv6: Support setting src port in sendmsg().
bpf: Add sk_lookup test to use ORIGDSTADDR cmsg.
include/net/ip.h | 1 +
net/ipv4/ip_sockglue.c | 11 +++
net/ipv4/udp.c | 35 +++++++++-
net/ipv6/datagram.c | 79 ++++++++++++++++++++++
net/ipv6/udp.c | 8 ++-
tools/testing/selftests/bpf/prog_tests/sk_lookup.c | 67 ++++++++++++------
6 files changed, 176 insertions(+), 25 deletions(-)
---
base-commit: 6562a89739bbefddb5495c09aaab67c1c3756f36
change-id: 20240909-reverse-sk-lookup-f7bf36292bc4
Best regards,
--
Tiago Lam <tiagolam(a)cloudflare.com>
+CC linux-kselftest
-------
On 22/09/2024 17:16, Gabriela Bittencourt wrote:
> Hey all,
>
> We are making these changes as part of a KUnit Hackathon at LKCamp [1].
>
> This patch sets out to refactor fs/unicode/utf8-selftest.c to KUnit tests.
>
> The first commit is the refactoring itself from self test into KUnit, while
> the second one applies the naming style conventions.
>
> We appreciate any feedback and suggestions. :)
>
> [1] https://lkcamp.dev/about/
>
> Co-developed-by: Pedro Orlando <porlando(a)lkcamp.dev>
> Co-developed-by: Danilo Pereira <dpereira(a)lkcamp.dev>
> Signed-off-by: Pedro Orlando <porlando(a)lkcamp.dev>
> Signed-off-by: Danilo Pereira <dpereira(a)lkcamp.dev>
> Signed-off-by: Gabriela Bittencourt <gbittencourt(a)lkcamp.dev>
>
> Gabriela Bittencourt (2):
> unicode: kunit: refactor selftest to kunit tests
> unicode: kunit: change tests filename and path
>
> fs/unicode/Kconfig | 5 +-
> fs/unicode/Makefile | 2 +-
> fs/unicode/tests/.kunitconfig | 3 +
> .../{utf8-selftest.c => tests/utf8_kunit.c} | 152 ++++++++----------
> 4 files changed, 76 insertions(+), 86 deletions(-)
> create mode 100644 fs/unicode/tests/.kunitconfig
> rename fs/unicode/{utf8-selftest.c => tests/utf8_kunit.c} (63%)
>
We recently notice that the step_after_suspend_test would
fail on our plenty devices. The test believesit failed to
enter suspend state with
$ sudo ./step_after_suspend_test
TAP version 13
Bail out! Failed to enter Suspend state
However, in the kernel message, I indeed see the system get
suspended and then wake up later.
[611172.033108] PM: suspend entry (s2idle)
[611172.044940] Filesystems sync: 0.006 seconds
[611172.052254] Freezing user space processes
[611172.059319] Freezing user space processes completed (elapsed 0.001 seconds)
[611172.067920] OOM killer disabled.
[611172.072465] Freezing remaining freezable tasks
[611172.080332] Freezing remaining freezable tasks completed (elapsed 0.001 seconds)
[611172.089724] printk: Suspending console(s) (use no_console_suspend to debug)
[611172.117126] serial 00:03: disabled
--- some other hardware get reconnected ---
[611203.136277] OOM killer enabled.
[611203.140637] Restarting tasks ...
[611203.141135] usb 1-8.1: USB disconnect, device number 7
[611203.141755] done.
[611203.155268] random: crng reseeded on system resumption
[611203.162059] PM: suspend exit
After investigation, I notice that for the code block
if (write(power_state_fd, "mem", strlen("mem")) != strlen("mem"))
ksft_exit_fail_msg("Failed to enter Suspend state\n");
The write will return -1 and errno is set to 16 (device busy).
It should be caused by the write function is not successfully returned
before the system suspend and the return value get messed when waking up.
As a result, It may be better to check the time passed of those few instructions
to determine whether the suspend is executed correctly for it is pretty hard to
execute those few lines for 4 seconds, or even more if it is not long enough.
Fixes: bfd092b8c2728 ("selftests: breakpoint: add step_after_suspend_test")
Reported-by: Sinadin Shan <sinadin.shan(a)oracle.com>
Signed-off-by: Yifei Liu <yifei.l.liu(a)oracle.com>
---
.../selftests/breakpoints/step_after_suspend_test.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/breakpoints/step_after_suspend_test.c b/tools/testing/selftests/breakpoints/step_after_suspend_test.c
index dfec31fb9b30d..d615f091e5bae 100644
--- a/tools/testing/selftests/breakpoints/step_after_suspend_test.c
+++ b/tools/testing/selftests/breakpoints/step_after_suspend_test.c
@@ -18,6 +18,7 @@
#include <sys/timerfd.h>
#include <sys/types.h>
#include <sys/wait.h>
+#include <time.h>
#include "../kselftest.h"
@@ -133,6 +134,7 @@ void suspend(void)
int timerfd;
int err;
struct itimerspec spec = {};
+ clock_t t;
if (getuid() != 0)
ksft_exit_skip("Please run the test as root - Exiting.\n");
@@ -152,8 +154,11 @@ void suspend(void)
if (err < 0)
ksft_exit_fail_msg("timerfd_settime() failed\n");
- if (write(power_state_fd, "mem", strlen("mem")) != strlen("mem"))
- ksft_exit_fail_msg("Failed to enter Suspend state\n");
+ t = clock();
+ write(power_state_fd, "mem", strlen("mem"));
+ t = clock()-t;
+ if ((int)(t) < 4)
+ ksft_exit_fail_msg("Failed to enter Suspend state %d\n",errno);
close(timerfd);
close(power_state_fd);
--
2.45.2
grep -rnIF "#define __NR_userfaultfd"
tools/include/uapi/asm-generic/unistd.h:681:#define __NR_userfaultfd 282
arch/x86/include/generated/uapi/asm/unistd_32.h:374:#define
__NR_userfaultfd 374
arch/x86/include/generated/uapi/asm/unistd_64.h:327:#define
__NR_userfaultfd 323
arch/x86/include/generated/uapi/asm/unistd_x32.h:282:#define
__NR_userfaultfd (__X32_SYSCALL_BIT + 323)
arch/arm/include/generated/uapi/asm/unistd-eabi.h:347:#define
__NR_userfaultfd (__NR_SYSCALL_BASE + 388)
arch/arm/include/generated/uapi/asm/unistd-oabi.h:359:#define
__NR_userfaultfd (__NR_SYSCALL_BASE + 388)
include/uapi/asm-generic/unistd.h:681:#define __NR_userfaultfd 282
The number is dependent on the architecture. The above data shows that:
x86 374
x86_64 323
The value of __NR_userfaultfd was changed to 282 when
asm-generic/unistd.h was included. It makes the test to fail every time
as the correct number of this syscall on x86_64 is 323. Fix the header
to asm/unistd.h.
Fixes: a5c6bc590094 ("selftests/mm: remove local __NR_* definitions")
Signed-off-by: Muhammad Usama Anjum <usama.anjum(a)collabora.com>
---
tools/testing/selftests/mm/pagemap_ioctl.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/mm/pagemap_ioctl.c b/tools/testing/selftests/mm/pagemap_ioctl.c
index fc90af2a97b80..bcc73b4e805c6 100644
--- a/tools/testing/selftests/mm/pagemap_ioctl.c
+++ b/tools/testing/selftests/mm/pagemap_ioctl.c
@@ -15,7 +15,7 @@
#include <sys/ioctl.h>
#include <sys/stat.h>
#include <math.h>
-#include <asm-generic/unistd.h>
+#include <asm/unistd.h>
#include <pthread.h>
#include <sys/resource.h>
#include <assert.h>
--
2.39.2
With `long` mapped to `isize`, `size_t`/`__kernel_size_t` mapped to
usize and `char` mapped to `u8`, many of the existing casts are no
longer necessary.
Signed-off-by: Gary Guo <gary(a)garyguo.net>
---
rust/kernel/kunit.rs | 10 ++--------
rust/kernel/print.rs | 4 ++--
rust/kernel/str.rs | 6 +++---
rust/kernel/uaccess.rs | 27 +++++++--------------------
4 files changed, 14 insertions(+), 33 deletions(-)
diff --git a/rust/kernel/kunit.rs b/rust/kernel/kunit.rs
index 0ba77276ae7ef..766aeb1c6aea8 100644
--- a/rust/kernel/kunit.rs
+++ b/rust/kernel/kunit.rs
@@ -17,10 +17,7 @@ pub fn err(args: fmt::Arguments<'_>) {
// are passing.
#[cfg(CONFIG_PRINTK)]
unsafe {
- bindings::_printk(
- b"\x013%pA\0".as_ptr() as _,
- &args as *const _ as *const c_void,
- );
+ bindings::_printk(b"\x013%pA\0".as_ptr(), &args as *const _ as *const c_void);
}
}
@@ -33,10 +30,7 @@ pub fn info(args: fmt::Arguments<'_>) {
// are passing.
#[cfg(CONFIG_PRINTK)]
unsafe {
- bindings::_printk(
- b"\x016%pA\0".as_ptr() as _,
- &args as *const _ as *const c_void,
- );
+ bindings::_printk(b"\x016%pA\0".as_ptr(), &args as *const _ as *const c_void);
}
}
diff --git a/rust/kernel/print.rs b/rust/kernel/print.rs
index 508b0221256c9..90ae4f2568910 100644
--- a/rust/kernel/print.rs
+++ b/rust/kernel/print.rs
@@ -104,7 +104,7 @@ pub unsafe fn call_printk(
#[cfg(CONFIG_PRINTK)]
unsafe {
bindings::_printk(
- format_string.as_ptr() as _,
+ format_string.as_ptr(),
module_name.as_ptr(),
&args as *const _ as *const c_void,
);
@@ -125,7 +125,7 @@ pub fn call_printk_cont(args: fmt::Arguments<'_>) {
#[cfg(CONFIG_PRINTK)]
unsafe {
bindings::_printk(
- format_strings::CONT.as_ptr() as _,
+ format_strings::CONT.as_ptr(),
&args as *const _ as *const c_void,
);
}
diff --git a/rust/kernel/str.rs b/rust/kernel/str.rs
index 3980d37bd4b29..2d30bca079e37 100644
--- a/rust/kernel/str.rs
+++ b/rust/kernel/str.rs
@@ -190,7 +190,7 @@ pub unsafe fn from_char_ptr<'a>(ptr: *const crate::ffi::c_char) -> &'a Self {
// to a `NUL`-terminated C string.
let len = unsafe { bindings::strlen(ptr) } + 1;
// SAFETY: Lifetime guaranteed by the safety precondition.
- let bytes = unsafe { core::slice::from_raw_parts(ptr as _, len as _) };
+ let bytes = unsafe { core::slice::from_raw_parts(ptr as _, len) };
// SAFETY: As `len` is returned by `strlen`, `bytes` does not contain interior `NUL`.
// As we have added 1 to `len`, the last byte is known to be `NUL`.
unsafe { Self::from_bytes_with_nul_unchecked(bytes) }
@@ -249,7 +249,7 @@ pub unsafe fn from_bytes_with_nul_unchecked_mut(bytes: &mut [u8]) -> &mut CStr {
/// Returns a C pointer to the string.
#[inline]
pub const fn as_char_ptr(&self) -> *const crate::ffi::c_char {
- self.0.as_ptr() as _
+ self.0.as_ptr()
}
/// Convert the string to a byte slice without the trailing `NUL` byte.
@@ -817,7 +817,7 @@ pub fn try_from_fmt(args: fmt::Arguments<'_>) -> Result<Self, Error> {
// SAFETY: The buffer is valid for read because `f.bytes_written()` is bounded by `size`
// (which the minimum buffer size) and is non-zero (we wrote at least the `NUL` terminator)
// so `f.bytes_written() - 1` doesn't underflow.
- let ptr = unsafe { bindings::memchr(buf.as_ptr().cast(), 0, (f.bytes_written() - 1) as _) };
+ let ptr = unsafe { bindings::memchr(buf.as_ptr().cast(), 0, f.bytes_written() - 1) };
if !ptr.is_null() {
return Err(EINVAL);
}
diff --git a/rust/kernel/uaccess.rs b/rust/kernel/uaccess.rs
index c746a1f1bb5ad..eb72fbcf152a1 100644
--- a/rust/kernel/uaccess.rs
+++ b/rust/kernel/uaccess.rs
@@ -8,7 +8,7 @@
alloc::Flags,
bindings,
error::Result,
- ffi::{c_ulong, c_void},
+ ffi::c_void,
prelude::*,
types::{AsBytes, FromBytes},
};
@@ -227,13 +227,9 @@ pub fn read_raw(&mut self, out: &mut [MaybeUninit<u8>]) -> Result {
if len > self.length {
return Err(EFAULT);
}
- let Ok(len_ulong) = c_ulong::try_from(len) else {
- return Err(EFAULT);
- };
- // SAFETY: `out_ptr` points into a mutable slice of length `len_ulong`, so we may write
+ // SAFETY: `out_ptr` points into a mutable slice of length `len`, so we may write
// that many bytes to it.
- let res =
- unsafe { bindings::copy_from_user(out_ptr, self.ptr as *const c_void, len_ulong) };
+ let res = unsafe { bindings::copy_from_user(out_ptr, self.ptr as *const c_void, len) };
if res != 0 {
return Err(EFAULT);
}
@@ -262,9 +258,6 @@ pub fn read<T: FromBytes>(&mut self) -> Result<T> {
if len > self.length {
return Err(EFAULT);
}
- let Ok(len_ulong) = c_ulong::try_from(len) else {
- return Err(EFAULT);
- };
let mut out: MaybeUninit<T> = MaybeUninit::uninit();
// SAFETY: The local variable `out` is valid for writing `size_of::<T>()` bytes.
//
@@ -275,7 +268,7 @@ pub fn read<T: FromBytes>(&mut self) -> Result<T> {
bindings::_copy_from_user(
out.as_mut_ptr().cast::<c_void>(),
self.ptr as *const c_void,
- len_ulong,
+ len,
)
};
if res != 0 {
@@ -338,12 +331,9 @@ pub fn write_slice(&mut self, data: &[u8]) -> Result {
if len > self.length {
return Err(EFAULT);
}
- let Ok(len_ulong) = c_ulong::try_from(len) else {
- return Err(EFAULT);
- };
- // SAFETY: `data_ptr` points into an immutable slice of length `len_ulong`, so we may read
+ // SAFETY: `data_ptr` points into an immutable slice of length `len`, so we may read
// that many bytes from it.
- let res = unsafe { bindings::copy_to_user(self.ptr as *mut c_void, data_ptr, len_ulong) };
+ let res = unsafe { bindings::copy_to_user(self.ptr as *mut c_void, data_ptr, len) };
if res != 0 {
return Err(EFAULT);
}
@@ -362,9 +352,6 @@ pub fn write<T: AsBytes>(&mut self, value: &T) -> Result {
if len > self.length {
return Err(EFAULT);
}
- let Ok(len_ulong) = c_ulong::try_from(len) else {
- return Err(EFAULT);
- };
// SAFETY: The reference points to a value of type `T`, so it is valid for reading
// `size_of::<T>()` bytes.
//
@@ -375,7 +362,7 @@ pub fn write<T: AsBytes>(&mut self, value: &T) -> Result {
bindings::_copy_to_user(
self.ptr as *mut c_void,
(value as *const T).cast::<c_void>(),
- len_ulong,
+ len,
)
};
if res != 0 {
--
2.44.1
This patch allows progs to elide a null check on statically known map
lookup keys. In other words, if the verifier can statically prove that
the lookup will be in-bounds, allow the prog to drop the null check.
This is useful for two reasons:
1. Large numbers of nullness checks (especially when they cannot fail)
unnecessarily pushes prog towards BPF_COMPLEXITY_LIMIT_JMP_SEQ.
2. It forms a tighter contract between programmer and verifier.
For (1), bpftrace is starting to make heavier use of percpu scratch
maps. As a result, for user scripts with large number of unrolled loops,
we are starting to hit jump complexity verification errors. These
percpu lookups cannot fail anyways, as we only use static key values.
Eliding nullness probably results in less work for verifier as well.
For (2), percpu scratch maps are often used as a larger stack, as the
currrent stack is limited to 512 bytes. In these situations, it is
desirable for the programmer to express: "this lookup should never fail,
and if it does, it means I messed up the code". By omitting the null
check, the programmer can "ask" the verifier to double check the logic.
Changes from v1:
* Added a check for when R2 is not a ptr to stack
* Added a check for when stack is uninitialized (no stack slot yet)
* Fix spinlock reg id bumping
* Updated existing tests to account for null elision
* Added test case for when R2 can be both const and non-const
Daniel Xu (2):
bpf: verifier: Support eliding map lookup nullness
bpf: selftests: verifier: Add nullness elision tests
kernel/bpf/verifier.c | 64 ++++++-
tools/testing/selftests/bpf/progs/iters.c | 14 +-
.../selftests/bpf/progs/map_kptr_fail.c | 2 +-
.../bpf/progs/verifier_array_access.c | 166 ++++++++++++++++++
.../selftests/bpf/progs/verifier_map_in_map.c | 2 +-
.../testing/selftests/bpf/verifier/map_kptr.c | 2 +-
6 files changed, 239 insertions(+), 11 deletions(-)
--
2.46.0
Recently we committed a fix to allow processes to receive notifications for
non-zero exits via the process connector module. Commit is a4c9a56e6a2c.
However, for threads, when it does a pthread_exit(&exit_status) call, the
kernel is not aware of the exit status with which pthread_exit is called.
It is sent by child thread to the parent process, if it is waiting in
pthread_join(). Hence, for a thread exiting abnormally, kernel cannot
send notifications to any listening processes.
The exception to this is if the thread is sent a signal which it has not
handled, and dies along with it's process as a result; for eg. SIGSEGV or
SIGKILL. In this case, kernel is aware of the non-zero exit and sends a
notification for it.
For our use case, we cannot have parent wait in pthread_join, one of the
main reasons for this being that we do not want to track normal
pthread_exit(), which could be a very large number. We only want to be
notified of any abnormal exits. Hence, threads are created with
pthread_attr_t set to PTHREAD_CREATE_DETACHED.
To fix this problem, we add a new type PROC_CN_MCAST_NOTIFY to proc connector
API, which allows a thread to send it's exit status to kernel either when
it needs to call pthread_exit() with non-zero value to indicate some
error or from signal handler before pthread_exit().
Anjali Kulkarni (2):
connector/cn_proc: Handle threads for proc connector
connector/cn_proc: Selftest for threads case
drivers/connector/cn_proc.c | 11 ++-
include/linux/cn_proc.h | 5 +-
include/uapi/linux/cn_proc.h | 4 +-
kernel/exit.c | 5 +-
tools/testing/selftests/connector/Makefile | 23 ++++-
.../testing/selftests/connector/proc_filter.c | 5 +
tools/testing/selftests/connector/thread.c | 87 +++++++++++++++++
.../selftests/connector/thread_filter.c | 93 +++++++++++++++++++
8 files changed, 226 insertions(+), 7 deletions(-)
create mode 100644 tools/testing/selftests/connector/thread.c
create mode 100644 tools/testing/selftests/connector/thread_filter.c
--
2.45.2