virtio-net have two usage of hashes: one is RSS and another is hash
reporting. Conventionally the hash calculation was done by the VMM.
However, computing the hash after the queue was chosen defeats the
purpose of RSS.
Another approach is to use eBPF steering program. This approach has
another downside: it cannot report the calculated hash due to the
restrictive nature of eBPF.
Introduce the code to compute hashes to the kernel in order to overcome
thse challenges.
An alternative solution is to extend the eBPF steering program so that it
will be able to report to the userspace, but it is based on context
rewrites, which is in feature freeze. We can adopt kfuncs, but they will
not be UAPIs. We opt to ioctl to align with other relevant UAPIs (KVM
and vhost_net).
The patches for QEMU to use this new feature was submitted as RFC and
is available at:
https://patchew.org/QEMU/20250313-hash-v4-0-c75c494b495e@daynix.com/
This work was presented at LPC 2024:
https://lpc.events/event/18/contributions/1963/
V1 -> V2:
Changed to introduce a new BPF program type.
Signed-off-by: Akihiko Odaki <akihiko.odaki(a)daynix.com>
---
Changes in v10:
- Split common code and TUN/TAP-specific code into separate patches.
- Reverted a spurious style change in patch "tun: Introduce virtio-net
hash feature".
- Added a comment explaining disable_ipv6 in tests.
- Used AF_PACKET for patch "selftest: tun: Add tests for
virtio-net hashing". I also added the usage of FIXTURE_VARIANT() as
the testing function now needs access to more variant-specific
variables.
- Corrected the message of patch "selftest: tun: Add tests for
virtio-net hashing"; it mentioned validation of configuration but
it is not scope of this patch.
- Expanded the description of patch "selftest: tun: Add tests for
virtio-net hashing".
- Added patch "tun: Allow steering eBPF program to fall back".
- Changed to handle TUNGETVNETHASHCAP before taking the rtnl lock.
- Removed redundant tests for tun_vnet_ioctl().
- Added patch "selftest: tap: Add tests for virtio-net ioctls".
- Added a design explanation of ioctls for extensibility and migration.
- Removed a few branches in patch
"vhost/net: Support VIRTIO_NET_F_HASH_REPORT".
- Link to v9: https://lore.kernel.org/r/20250307-rss-v9-0-df76624025eb@daynix.com
Changes in v9:
- Added a missing return statement in patch
"tun: Introduce virtio-net hash feature".
- Link to v8: https://lore.kernel.org/r/20250306-rss-v8-0-7ab4f56ff423@daynix.com
Changes in v8:
- Disabled IPv6 to eliminate noises in tests.
- Added a branch in tap to avoid unnecessary dissection when hash
reporting is disabled.
- Removed unnecessary rtnl_lock().
- Extracted code to handle new ioctls into separate functions to avoid
adding extra NULL checks to the code handling other ioctls.
- Introduced variable named "fd" to __tun_chr_ioctl().
- s/-/=/g in a patch message to avoid confusing Git.
- Link to v7: https://lore.kernel.org/r/20250228-rss-v7-0-844205cbbdd6@daynix.com
Changes in v7:
- Ensured to set hash_report to VIRTIO_NET_HASH_REPORT_NONE for
VHOST_NET_F_VIRTIO_NET_HDR.
- s/4/sizeof(u32)/ in patch "virtio_net: Add functions for hashing".
- Added tap_skb_cb type.
- Rebased.
- Link to v6: https://lore.kernel.org/r/20250109-rss-v6-0-b1c90ad708f6@daynix.com
Changes in v6:
- Extracted changes to fill vnet header holes into another series.
- Squashed patches "skbuff: Introduce SKB_EXT_TUN_VNET_HASH", "tun:
Introduce virtio-net hash reporting feature", and "tun: Introduce
virtio-net RSS" into patch "tun: Introduce virtio-net hash feature".
- Dropped the RFC tag.
- Link to v5: https://lore.kernel.org/r/20241008-rss-v5-0-f3cf68df005d@daynix.com
Changes in v5:
- Fixed a compilation error with CONFIG_TUN_VNET_CROSS_LE.
- Optimized the calculation of the hash value according to:
https://git.dpdk.org/dpdk/commit/?id=3fb1ea032bd6ff8317af5dac9af901f1f324ca…
- Added patch "tun: Unify vnet implementation".
- Dropped patch "tap: Pad virtio header with zero".
- Added patch "selftest: tun: Test vnet ioctls without device".
- Reworked selftests to skip for older kernels.
- Documented the case when the underlying device is deleted and packets
have queue_mapping set by TC.
- Reordered test harness arguments.
- Added code to handle fragmented packets.
- Link to v4: https://lore.kernel.org/r/20240924-rss-v4-0-84e932ec0e6c@daynix.com
Changes in v4:
- Moved tun_vnet_hash_ext to if_tun.h.
- Renamed virtio_net_toeplitz() to virtio_net_toeplitz_calc().
- Replaced htons() with cpu_to_be16().
- Changed virtio_net_hash_rss() to return void.
- Reordered variable declarations in virtio_net_hash_rss().
- Removed virtio_net_hdr_v1_hash_from_skb().
- Updated messages of "tap: Pad virtio header with zero" and
"tun: Pad virtio header with zero".
- Fixed vnet_hash allocation size.
- Ensured to free vnet_hash when destructing tun_struct.
- Link to v3: https://lore.kernel.org/r/20240915-rss-v3-0-c630015db082@daynix.com
Changes in v3:
- Reverted back to add ioctl.
- Split patch "tun: Introduce virtio-net hashing feature" into
"tun: Introduce virtio-net hash reporting feature" and
"tun: Introduce virtio-net RSS".
- Changed to reuse hash values computed for automq instead of performing
RSS hashing when hash reporting is requested but RSS is not.
- Extracted relevant data from struct tun_struct to keep it minimal.
- Added kernel-doc.
- Changed to allow calling TUNGETVNETHASHCAP before TUNSETIFF.
- Initialized num_buffers with 1.
- Added a test case for unclassified packets.
- Fixed error handling in tests.
- Changed tests to verify that the queue index will not overflow.
- Rebased.
- Link to v2: https://lore.kernel.org/r/20231015141644.260646-1-akihiko.odaki@daynix.com
---
Akihiko Odaki (10):
virtio_net: Add functions for hashing
net: flow_dissector: Export flow_keys_dissector_symmetric
tun: Allow steering eBPF program to fall back
tun: Add common virtio-net hash feature code
tun: Introduce virtio-net hash feature
tap: Introduce virtio-net hash feature
selftest: tun: Test vnet ioctls without device
selftest: tun: Add tests for virtio-net hashing
selftest: tap: Add tests for virtio-net ioctls
vhost/net: Support VIRTIO_NET_F_HASH_REPORT
Documentation/networking/tuntap.rst | 7 +
drivers/net/Kconfig | 1 +
drivers/net/tap.c | 68 ++++-
drivers/net/tun.c | 90 +++++--
drivers/net/tun_vnet.h | 155 ++++++++++-
drivers/vhost/net.c | 68 ++---
include/linux/if_tap.h | 2 +
include/linux/skbuff.h | 3 +
include/linux/virtio_net.h | 188 ++++++++++++++
include/net/flow_dissector.h | 1 +
include/uapi/linux/if_tun.h | 82 ++++++
net/core/flow_dissector.c | 3 +-
net/core/skbuff.c | 4 +
tools/testing/selftests/net/Makefile | 2 +-
tools/testing/selftests/net/tap.c | 97 ++++++-
tools/testing/selftests/net/tun.c | 491 ++++++++++++++++++++++++++++++++++-
16 files changed, 1185 insertions(+), 77 deletions(-)
---
base-commit: dd83757f6e686a2188997cb58b5975f744bb7786
change-id: 20240403-rss-e737d89efa77
prerequisite-change-id: 20241230-tun-66e10a49b0c7:v6
prerequisite-patch-id: 871dc5f146fb6b0e3ec8612971a8e8190472c0fb
prerequisite-patch-id: 2797ed249d32590321f088373d4055ff3f430a0e
prerequisite-patch-id: ea3370c72d4904e2f0536ec76ba5d26784c0cede
prerequisite-patch-id: 837e4cf5d6b451424f9b1639455e83a260c4440d
prerequisite-patch-id: ea701076f57819e844f5a35efe5cbc5712d3080d
prerequisite-patch-id: 701646fb43ad04cc64dd2bf13c150ccbe6f828ce
prerequisite-patch-id: 53176dae0c003f5b6c114d43f936cf7140d31bb5
prerequisite-change-id: 20250116-buffers-96e14bf023fc:v2
prerequisite-patch-id: 25fd4f99d4236a05a5ef16ab79f3e85ee57e21cc
Best regards,
--
Akihiko Odaki <akihiko.odaki(a)daynix.com>
On Friday, 14 March 2025 05:14:30 CDT Su Hui wrote:
> On 2025/3/14 17:21, Dan Carpenter wrote:
> > On Fri, Mar 14, 2025 at 03:14:51PM +0800, Su Hui wrote:
> >> When 'manual=false' and 'signaled=true', then expected value when using
> >> NTSYNC_IOC_CREATE_EVENT should be greater than zero. Fix this typo error.
> >>
> >> Signed-off-by: Su Hui<suhui(a)nfschina.com>
> >> ---
> >> tools/testing/selftests/drivers/ntsync/ntsync.c | 2 +-
> >> 1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/tools/testing/selftests/drivers/ntsync/ntsync.c b/tools/testing/selftests/drivers/ntsync/ntsync.c
> >> index 3aad311574c4..bfb6fad653d0 100644
> >> --- a/tools/testing/selftests/drivers/ntsync/ntsync.c
> >> +++ b/tools/testing/selftests/drivers/ntsync/ntsync.c
> >> @@ -968,7 +968,7 @@ TEST(wake_all)
> >> auto_event_args.manual = false;
> >> auto_event_args.signaled = true;
> >> objs[3] = ioctl(fd, NTSYNC_IOC_CREATE_EVENT, &auto_event_args);
> >> - EXPECT_EQ(0, objs[3]);
> >> + EXPECT_LE(0, objs[3]);
> > It's kind of weird how these macros put the constant on the left.
> > It returns an "fd" on success. So this look reasonable. It probably
> > won't return the zero fd so we could probably check EXPECT_LT()?
> Agreed, there are about 29 items that can be changed to EXPECT_LT().
> I can send a v2 patchset with this change if there is no more other
> suggestions.
I personally think it looks wrong to use EXPECT_LT(), but I'll certainly defer to a higher maintainer on this point.
Replacing all occurrences of `addr_of!(place)` with `&raw const place`, and
all occurrences of `addr_of_mut!(place)` with `&raw mut place`.
Utilizing the new feature will allow us to reduce macro complexity, and
improve consistency with existing reference syntax as `&raw const`, `&raw mut`
is very similar to `&`, `&mut` making it fit more naturally with other
existing code.
Suggested-by: Benno Lossin <benno.lossin(a)proton.me>
Link: https://github.com/Rust-for-Linux/linux/issues/1148
Signed-off-by: Antonio Hickey <contact(a)antoniohickey.com>
---
rust/kernel/block/mq/request.rs | 4 ++--
rust/kernel/faux.rs | 4 ++--
rust/kernel/fs/file.rs | 2 +-
rust/kernel/init.rs | 8 ++++----
rust/kernel/init/macros.rs | 28 +++++++++++++-------------
rust/kernel/jump_label.rs | 4 ++--
rust/kernel/kunit.rs | 4 ++--
rust/kernel/list.rs | 2 +-
rust/kernel/list/impl_list_item_mod.rs | 6 +++---
rust/kernel/net/phy.rs | 4 ++--
rust/kernel/pci.rs | 4 ++--
rust/kernel/platform.rs | 4 +---
rust/kernel/rbtree.rs | 22 ++++++++++----------
rust/kernel/sync/arc.rs | 2 +-
rust/kernel/task.rs | 4 ++--
rust/kernel/workqueue.rs | 8 ++++----
16 files changed, 54 insertions(+), 56 deletions(-)
diff --git a/rust/kernel/block/mq/request.rs b/rust/kernel/block/mq/request.rs
index 7943f43b9575..4a5b7ec914ef 100644
--- a/rust/kernel/block/mq/request.rs
+++ b/rust/kernel/block/mq/request.rs
@@ -12,7 +12,7 @@
};
use core::{
marker::PhantomData,
- ptr::{addr_of_mut, NonNull},
+ ptr::NonNull,
sync::atomic::{AtomicU64, Ordering},
};
@@ -187,7 +187,7 @@ pub(crate) fn refcount(&self) -> &AtomicU64 {
pub(crate) unsafe fn refcount_ptr(this: *mut Self) -> *mut AtomicU64 {
// SAFETY: Because of the safety requirements of this function, the
// field projection is safe.
- unsafe { addr_of_mut!((*this).refcount) }
+ unsafe { &raw mut (*this).refcount }
}
}
diff --git a/rust/kernel/faux.rs b/rust/kernel/faux.rs
index 5acc0c02d451..52ac554c1119 100644
--- a/rust/kernel/faux.rs
+++ b/rust/kernel/faux.rs
@@ -7,7 +7,7 @@
//! C header: [`include/linux/device/faux.h`]
use crate::{bindings, device, error::code::*, prelude::*};
-use core::ptr::{addr_of_mut, null, null_mut, NonNull};
+use core::ptr::{null, null_mut, NonNull};
/// The registration of a faux device.
///
@@ -45,7 +45,7 @@ impl AsRef<device::Device> for Registration {
fn as_ref(&self) -> &device::Device {
// SAFETY: The underlying `device` in `faux_device` is guaranteed by the C API to be
// a valid initialized `device`.
- unsafe { device::Device::as_ref(addr_of_mut!((*self.as_raw()).dev)) }
+ unsafe { device::Device::as_ref((&raw mut (*self.as_raw()).dev)) }
}
}
diff --git a/rust/kernel/fs/file.rs b/rust/kernel/fs/file.rs
index ed57e0137cdb..7ee4830b67f3 100644
--- a/rust/kernel/fs/file.rs
+++ b/rust/kernel/fs/file.rs
@@ -331,7 +331,7 @@ pub fn flags(&self) -> u32 {
// SAFETY: The file is valid because the shared reference guarantees a nonzero refcount.
//
// FIXME(read_once): Replace with `read_once` when available on the Rust side.
- unsafe { core::ptr::addr_of!((*self.as_ptr()).f_flags).read_volatile() }
+ unsafe { (&raw const (*self.as_ptr()).f_flags).read_volatile() }
}
}
diff --git a/rust/kernel/init.rs b/rust/kernel/init.rs
index 7fd1ea8265a5..a8fac6558671 100644
--- a/rust/kernel/init.rs
+++ b/rust/kernel/init.rs
@@ -122,7 +122,7 @@
//! ```rust
//! # #![expect(unreachable_pub, clippy::disallowed_names)]
//! use kernel::{init, types::Opaque};
-//! use core::{ptr::addr_of_mut, marker::PhantomPinned, pin::Pin};
+//! use core::{marker::PhantomPinned, pin::Pin};
//! # mod bindings {
//! # #![expect(non_camel_case_types)]
//! # #![expect(clippy::missing_safety_doc)]
@@ -159,7 +159,7 @@
//! unsafe {
//! init::pin_init_from_closure(move |slot: *mut Self| {
//! // `slot` contains uninit memory, avoid creating a reference.
-//! let foo = addr_of_mut!((*slot).foo);
+//! let foo = &raw mut (*slot).foo;
//!
//! // Initialize the `foo`
//! bindings::init_foo(Opaque::raw_get(foo));
@@ -541,7 +541,7 @@ macro_rules! stack_try_pin_init {
///
/// ```rust
/// # use kernel::{macros::{Zeroable, pin_data}, pin_init};
-/// # use core::{ptr::addr_of_mut, marker::PhantomPinned};
+/// # use core::marker::PhantomPinned;
/// #[pin_data]
/// #[derive(Zeroable)]
/// struct Buf {
@@ -554,7 +554,7 @@ macro_rules! stack_try_pin_init {
/// pin_init!(&this in Buf {
/// buf: [0; 64],
/// // SAFETY: TODO.
-/// ptr: unsafe { addr_of_mut!((*this.as_ptr()).buf).cast() },
+/// ptr: unsafe { &raw mut (*this.as_ptr()).buf.cast() },
/// pin: PhantomPinned,
/// });
/// pin_init!(Buf {
diff --git a/rust/kernel/init/macros.rs b/rust/kernel/init/macros.rs
index 1fd146a83241..af525fbb2f01 100644
--- a/rust/kernel/init/macros.rs
+++ b/rust/kernel/init/macros.rs
@@ -244,25 +244,25 @@
//! struct __InitOk;
//! // This is the expansion of `t,`, which is syntactic sugar for `t: t,`.
//! {
-//! unsafe { ::core::ptr::write(::core::addr_of_mut!((*slot).t), t) };
+//! unsafe { ::core::ptr::write(&raw mut (*slot).t, t) };
//! }
//! // Since initialization could fail later (not in this case, since the
//! // error type is `Infallible`) we will need to drop this field if there
//! // is an error later. This `DropGuard` will drop the field when it gets
//! // dropped and has not yet been forgotten.
//! let __t_guard = unsafe {
-//! ::pinned_init::__internal::DropGuard::new(::core::addr_of_mut!((*slot).t))
+//! ::pinned_init::__internal::DropGuard::new(&raw mut (*slot).t)
//! };
//! // Expansion of `x: 0,`:
//! // Since this can be an arbitrary expression we cannot place it inside
//! // of the `unsafe` block, so we bind it here.
//! {
//! let x = 0;
-//! unsafe { ::core::ptr::write(::core::addr_of_mut!((*slot).x), x) };
+//! unsafe { ::core::ptr::write(&raw mut (*slot).x, x) };
//! }
//! // We again create a `DropGuard`.
//! let __x_guard = unsafe {
-//! ::kernel::init::__internal::DropGuard::new(::core::addr_of_mut!((*slot).x))
+//! ::kernel::init::__internal::DropGuard::new(&raw mut (*slot).x)
//! };
//! // Since initialization has successfully completed, we can now forget
//! // the guards. This is not `mem::forget`, since we only have
@@ -459,15 +459,15 @@
//! {
//! struct __InitOk;
//! {
-//! unsafe { ::core::ptr::write(::core::addr_of_mut!((*slot).a), a) };
+//! unsafe { ::core::ptr::write(&raw mut (*slot).a, a) };
//! }
//! let __a_guard = unsafe {
-//! ::kernel::init::__internal::DropGuard::new(::core::addr_of_mut!((*slot).a))
+//! ::kernel::init::__internal::DropGuard::new(&raw mut (*slot).a)
//! };
//! let init = Bar::new(36);
-//! unsafe { data.b(::core::addr_of_mut!((*slot).b), b)? };
+//! unsafe { data.b(&raw mut (*slot).b, b)? };
//! let __b_guard = unsafe {
-//! ::kernel::init::__internal::DropGuard::new(::core::addr_of_mut!((*slot).b))
+//! ::kernel::init::__internal::DropGuard::new(&raw mut (*slot).b)
//! };
//! ::core::mem::forget(__b_guard);
//! ::core::mem::forget(__a_guard);
@@ -1210,7 +1210,7 @@ fn assert_zeroable<T: $crate::init::Zeroable>(_: *mut T) {}
// SAFETY: `slot` is valid, because we are inside of an initializer closure, we
// return when an error/panic occurs.
// We also use the `data` to require the correct trait (`Init` or `PinInit`) for `$field`.
- unsafe { $data.$field(::core::ptr::addr_of_mut!((*$slot).$field), init)? };
+ unsafe { $data.$field(&raw mut (*$slot).$field, init)? };
// Create the drop guard:
//
// We rely on macro hygiene to make it impossible for users to access this local variable.
@@ -1218,7 +1218,7 @@ fn assert_zeroable<T: $crate::init::Zeroable>(_: *mut T) {}
::kernel::macros::paste! {
// SAFETY: We forget the guard later when initialization has succeeded.
let [< __ $field _guard >] = unsafe {
- $crate::init::__internal::DropGuard::new(::core::ptr::addr_of_mut!((*$slot).$field))
+ $crate::init::__internal::DropGuard::new(&raw mut (*$slot).$field)
};
$crate::__init_internal!(init_slot($use_data):
@@ -1241,7 +1241,7 @@ fn assert_zeroable<T: $crate::init::Zeroable>(_: *mut T) {}
//
// SAFETY: `slot` is valid, because we are inside of an initializer closure, we
// return when an error/panic occurs.
- unsafe { $crate::init::Init::__init(init, ::core::ptr::addr_of_mut!((*$slot).$field))? };
+ unsafe { $crate::init::Init::__init(init, &raw mut (*$slot).$field)? };
// Create the drop guard:
//
// We rely on macro hygiene to make it impossible for users to access this local variable.
@@ -1249,7 +1249,7 @@ fn assert_zeroable<T: $crate::init::Zeroable>(_: *mut T) {}
::kernel::macros::paste! {
// SAFETY: We forget the guard later when initialization has succeeded.
let [< __ $field _guard >] = unsafe {
- $crate::init::__internal::DropGuard::new(::core::ptr::addr_of_mut!((*$slot).$field))
+ $crate::init::__internal::DropGuard::new(&raw mut (*$slot).$field)
};
$crate::__init_internal!(init_slot():
@@ -1272,7 +1272,7 @@ fn assert_zeroable<T: $crate::init::Zeroable>(_: *mut T) {}
// Initialize the field.
//
// SAFETY: The memory at `slot` is uninitialized.
- unsafe { ::core::ptr::write(::core::ptr::addr_of_mut!((*$slot).$field), $field) };
+ unsafe { ::core::ptr::write(&raw mut (*$slot).$field, $field) };
}
// Create the drop guard:
//
@@ -1281,7 +1281,7 @@ fn assert_zeroable<T: $crate::init::Zeroable>(_: *mut T) {}
::kernel::macros::paste! {
// SAFETY: We forget the guard later when initialization has succeeded.
let [< __ $field _guard >] = unsafe {
- $crate::init::__internal::DropGuard::new(::core::ptr::addr_of_mut!((*$slot).$field))
+ $crate::init::__internal::DropGuard::new(&raw mut (*$slot).$field)
};
$crate::__init_internal!(init_slot($($use_data)?):
diff --git a/rust/kernel/jump_label.rs b/rust/kernel/jump_label.rs
index 4e974c768dbd..ca10abae0eee 100644
--- a/rust/kernel/jump_label.rs
+++ b/rust/kernel/jump_label.rs
@@ -20,8 +20,8 @@
#[macro_export]
macro_rules! static_branch_unlikely {
($key:path, $keytyp:ty, $field:ident) => {{
- let _key: *const $keytyp = ::core::ptr::addr_of!($key);
- let _key: *const $crate::bindings::static_key_false = ::core::ptr::addr_of!((*_key).$field);
+ let _key: *const $keytyp = &raw const $key;
+ let _key: *const $crate::bindings::static_key_false = &raw const (*_key).$field;
let _key: *const $crate::bindings::static_key = _key.cast();
#[cfg(not(CONFIG_JUMP_LABEL))]
diff --git a/rust/kernel/kunit.rs b/rust/kernel/kunit.rs
index 824da0e9738a..a17ef3b2e860 100644
--- a/rust/kernel/kunit.rs
+++ b/rust/kernel/kunit.rs
@@ -128,9 +128,9 @@ unsafe impl Sync for UnaryAssert {}
unsafe {
$crate::bindings::__kunit_do_failed_assertion(
kunit_test,
- core::ptr::addr_of!(LOCATION.0),
+ &raw const LOCATION.0,
$crate::bindings::kunit_assert_type_KUNIT_ASSERTION,
- core::ptr::addr_of!(ASSERTION.0.assert),
+ &raw const ASSERTION.0.assert,
Some($crate::bindings::kunit_unary_assert_format),
core::ptr::null(),
);
diff --git a/rust/kernel/list.rs b/rust/kernel/list.rs
index c0ed227b8a4f..e98f0820f002 100644
--- a/rust/kernel/list.rs
+++ b/rust/kernel/list.rs
@@ -176,7 +176,7 @@ pub fn new() -> impl PinInit<Self> {
#[inline]
unsafe fn fields(me: *mut Self) -> *mut ListLinksFields {
// SAFETY: The caller promises that the pointer is valid.
- unsafe { Opaque::raw_get(ptr::addr_of!((*me).inner)) }
+ unsafe { Opaque::raw_get(&raw const (*me).inner) }
}
/// # Safety
diff --git a/rust/kernel/list/impl_list_item_mod.rs b/rust/kernel/list/impl_list_item_mod.rs
index a0438537cee1..014b6713d59d 100644
--- a/rust/kernel/list/impl_list_item_mod.rs
+++ b/rust/kernel/list/impl_list_item_mod.rs
@@ -49,7 +49,7 @@ macro_rules! impl_has_list_links {
// SAFETY: The implementation of `raw_get_list_links` only compiles if the field has the
// right type.
//
- // The behavior of `raw_get_list_links` is not changed since the `addr_of_mut!` macro is
+ // The behavior of `raw_get_list_links` is not changed since the `&raw mut` op is
// equivalent to the pointer offset operation in the trait definition.
unsafe impl$(<$($implarg),*>)? $crate::list::HasListLinks$(<$id>)? for
$self $(<$($selfarg),*>)?
@@ -61,7 +61,7 @@ unsafe fn raw_get_list_links(ptr: *mut Self) -> *mut $crate::list::ListLinks$(<$
// SAFETY: The caller promises that the pointer is not dangling. We know that this
// expression doesn't follow any pointers, as the `offset_of!` invocation above
// would otherwise not compile.
- unsafe { ::core::ptr::addr_of_mut!((*ptr)$(.$field)*) }
+ unsafe { &raw mut (*ptr)$(.$field)* }
}
}
)*};
@@ -103,7 +103,7 @@ macro_rules! impl_has_list_links_self_ptr {
unsafe fn raw_get_list_links(ptr: *mut Self) -> *mut $crate::list::ListLinks$(<$id>)? {
// SAFETY: The caller promises that the pointer is not dangling.
let ptr: *mut $crate::list::ListLinksSelfPtr<$item_type $(, $id)?> =
- unsafe { ::core::ptr::addr_of_mut!((*ptr).$field) };
+ unsafe { &raw mut (*ptr).$field };
ptr.cast()
}
}
diff --git a/rust/kernel/net/phy.rs b/rust/kernel/net/phy.rs
index a59469c785e3..757db052cc09 100644
--- a/rust/kernel/net/phy.rs
+++ b/rust/kernel/net/phy.rs
@@ -7,7 +7,7 @@
//! C headers: [`include/linux/phy.h`](srctree/include/linux/phy.h).
use crate::{error::*, prelude::*, types::Opaque};
-use core::{marker::PhantomData, ptr::addr_of_mut};
+use core::marker::PhantomData;
pub mod reg;
@@ -285,7 +285,7 @@ impl AsRef<kernel::device::Device> for Device {
fn as_ref(&self) -> &kernel::device::Device {
let phydev = self.0.get();
// SAFETY: The struct invariant ensures that `mdio.dev` is valid.
- unsafe { kernel::device::Device::as_ref(addr_of_mut!((*phydev).mdio.dev)) }
+ unsafe { kernel::device::Device::as_ref(&raw mut (*phydev).mdio.dev) }
}
}
diff --git a/rust/kernel/pci.rs b/rust/kernel/pci.rs
index f7b2743828ae..6cb9ed1e7cbf 100644
--- a/rust/kernel/pci.rs
+++ b/rust/kernel/pci.rs
@@ -17,7 +17,7 @@
types::{ARef, ForeignOwnable, Opaque},
ThisModule,
};
-use core::{ops::Deref, ptr::addr_of_mut};
+use core::ops::Deref;
use kernel::prelude::*;
/// An adapter for the registration of PCI drivers.
@@ -60,7 +60,7 @@ extern "C" fn probe_callback(
) -> kernel::ffi::c_int {
// SAFETY: The PCI bus only ever calls the probe callback with a valid pointer to a
// `struct pci_dev`.
- let dev = unsafe { device::Device::get_device(addr_of_mut!((*pdev).dev)) };
+ let dev = unsafe { device::Device::get_device(&raw mut (*pdev).dev) };
// SAFETY: `dev` is guaranteed to be embedded in a valid `struct pci_dev` by the call
// above.
let mut pdev = unsafe { Device::from_dev(dev) };
diff --git a/rust/kernel/platform.rs b/rust/kernel/platform.rs
index 1297f5292ba9..344875ad7b82 100644
--- a/rust/kernel/platform.rs
+++ b/rust/kernel/platform.rs
@@ -14,8 +14,6 @@
ThisModule,
};
-use core::ptr::addr_of_mut;
-
/// An adapter for the registration of platform drivers.
pub struct Adapter<T: Driver>(T);
@@ -55,7 +53,7 @@ unsafe fn unregister(pdrv: &Opaque<Self::RegType>) {
impl<T: Driver + 'static> Adapter<T> {
extern "C" fn probe_callback(pdev: *mut bindings::platform_device) -> kernel::ffi::c_int {
// SAFETY: The platform bus only ever calls the probe callback with a valid `pdev`.
- let dev = unsafe { device::Device::get_device(addr_of_mut!((*pdev).dev)) };
+ let dev = unsafe { device::Device::get_device(&raw mut (*pdev).dev) };
// SAFETY: `dev` is guaranteed to be embedded in a valid `struct platform_device` by the
// call above.
let mut pdev = unsafe { Device::from_dev(dev) };
diff --git a/rust/kernel/rbtree.rs b/rust/kernel/rbtree.rs
index 1ea25c7092fb..b0ad35663cb0 100644
--- a/rust/kernel/rbtree.rs
+++ b/rust/kernel/rbtree.rs
@@ -11,7 +11,7 @@
cmp::{Ord, Ordering},
marker::PhantomData,
mem::MaybeUninit,
- ptr::{addr_of_mut, from_mut, NonNull},
+ ptr::{from_mut, NonNull},
};
/// A red-black tree with owned nodes.
@@ -238,7 +238,7 @@ pub fn values_mut(&mut self) -> impl Iterator<Item = &'_ mut V> {
/// Returns a cursor over the tree nodes, starting with the smallest key.
pub fn cursor_front(&mut self) -> Option<Cursor<'_, K, V>> {
- let root = addr_of_mut!(self.root);
+ let root = &raw mut self.root;
// SAFETY: `self.root` is always a valid root node
let current = unsafe { bindings::rb_first(root) };
NonNull::new(current).map(|current| {
@@ -253,7 +253,7 @@ pub fn cursor_front(&mut self) -> Option<Cursor<'_, K, V>> {
/// Returns a cursor over the tree nodes, starting with the largest key.
pub fn cursor_back(&mut self) -> Option<Cursor<'_, K, V>> {
- let root = addr_of_mut!(self.root);
+ let root = &raw mut self.root;
// SAFETY: `self.root` is always a valid root node
let current = unsafe { bindings::rb_last(root) };
NonNull::new(current).map(|current| {
@@ -459,7 +459,7 @@ pub fn cursor_lower_bound(&mut self, key: &K) -> Option<Cursor<'_, K, V>>
let best = best_match?;
// SAFETY: `best` is a non-null node so it is valid by the type invariants.
- let links = unsafe { addr_of_mut!((*best.as_ptr()).links) };
+ let links = unsafe { &raw mut (*best.as_ptr()).links };
NonNull::new(links).map(|current| {
// INVARIANT:
@@ -767,7 +767,7 @@ pub fn remove_current(self) -> (Option<Self>, RBTreeNode<K, V>) {
let node = RBTreeNode { node };
// SAFETY: The reference to the tree used to create the cursor outlives the cursor, so
// the tree cannot change. By the tree invariant, all nodes are valid.
- unsafe { bindings::rb_erase(&mut (*this).links, addr_of_mut!(self.tree.root)) };
+ unsafe { bindings::rb_erase(&mut (*this).links, &raw mut self.tree.root) };
let current = match (prev, next) {
(_, Some(next)) => next,
@@ -803,7 +803,7 @@ fn remove_neighbor(&mut self, direction: Direction) -> Option<RBTreeNode<K, V>>
let neighbor = neighbor.as_ptr();
// SAFETY: The reference to the tree used to create the cursor outlives the cursor, so
// the tree cannot change. By the tree invariant, all nodes are valid.
- unsafe { bindings::rb_erase(neighbor, addr_of_mut!(self.tree.root)) };
+ unsafe { bindings::rb_erase(neighbor, &raw mut self.tree.root) };
// SAFETY: By the type invariant of `Self`, all non-null `rb_node` pointers stored in `self`
// point to the links field of `Node<K, V>` objects.
let this = unsafe { container_of!(neighbor, Node<K, V>, links) }.cast_mut();
@@ -918,7 +918,7 @@ unsafe fn to_key_value_raw<'b>(node: NonNull<bindings::rb_node>) -> (&'b K, *mut
let k = unsafe { &(*this).key };
// SAFETY: The passed `node` is the current node or a non-null neighbor,
// thus `this` is valid by the type invariants.
- let v = unsafe { addr_of_mut!((*this).value) };
+ let v = unsafe { &raw mut (*this).value };
(k, v)
}
}
@@ -1027,7 +1027,7 @@ fn next(&mut self) -> Option<Self::Item> {
self.next = unsafe { bindings::rb_next(self.next) };
// SAFETY: By the same reasoning above, it is safe to dereference the node.
- Some(unsafe { (addr_of_mut!((*cur).key), addr_of_mut!((*cur).value)) })
+ Some(unsafe { (&raw mut (*cur).key, &raw mut (*cur).value) })
}
}
@@ -1170,7 +1170,7 @@ fn insert(self, node: RBTreeNode<K, V>) -> &'a mut V {
// SAFETY: `node` is valid at least until we call `Box::from_raw`, which only happens when
// the node is removed or replaced.
- let node_links = unsafe { addr_of_mut!((*node).links) };
+ let node_links = unsafe { &raw mut (*node).links };
// INVARIANT: We are linking in a new node, which is valid. It remains valid because we
// "forgot" it with `Box::into_raw`.
@@ -1178,7 +1178,7 @@ fn insert(self, node: RBTreeNode<K, V>) -> &'a mut V {
unsafe { bindings::rb_link_node(node_links, self.parent, self.child_field_of_parent) };
// SAFETY: All pointers are valid. `node` has just been inserted into the tree.
- unsafe { bindings::rb_insert_color(node_links, addr_of_mut!((*self.rbtree).root)) };
+ unsafe { bindings::rb_insert_color(node_links, &raw mut (*self.rbtree).root) };
// SAFETY: The node is valid until we remove it from the tree.
unsafe { &mut (*node).value }
@@ -1261,7 +1261,7 @@ fn replace(self, node: RBTreeNode<K, V>) -> RBTreeNode<K, V> {
// SAFETY: `node` is valid at least until we call `Box::from_raw`, which only happens when
// the node is removed or replaced.
- let new_node_links = unsafe { addr_of_mut!((*node).links) };
+ let new_node_links = unsafe { &raw mut (*node).links };
// SAFETY: This updates the pointers so that `new_node_links` is in the tree where
// `self.node_links` used to be.
diff --git a/rust/kernel/sync/arc.rs b/rust/kernel/sync/arc.rs
index 3cefda7a4372..81d8b0f84957 100644
--- a/rust/kernel/sync/arc.rs
+++ b/rust/kernel/sync/arc.rs
@@ -243,7 +243,7 @@ pub fn into_raw(self) -> *const T {
let ptr = self.ptr.as_ptr();
core::mem::forget(self);
// SAFETY: The pointer is valid.
- unsafe { core::ptr::addr_of!((*ptr).data) }
+ unsafe { &raw const (*ptr).data }
}
/// Recreates an [`Arc`] instance previously deconstructed via [`Arc::into_raw`].
diff --git a/rust/kernel/task.rs b/rust/kernel/task.rs
index 49012e711942..b2ac768eed23 100644
--- a/rust/kernel/task.rs
+++ b/rust/kernel/task.rs
@@ -257,7 +257,7 @@ pub fn as_ptr(&self) -> *mut bindings::task_struct {
pub fn group_leader(&self) -> &Task {
// SAFETY: The group leader of a task never changes after initialization, so reading this
// field is not a data race.
- let ptr = unsafe { *ptr::addr_of!((*self.as_ptr()).group_leader) };
+ let ptr = unsafe { *(&raw const (*self.as_ptr()).group_leader) };
// SAFETY: The lifetime of the returned task reference is tied to the lifetime of `self`,
// and given that a task has a reference to its group leader, we know it must be valid for
@@ -269,7 +269,7 @@ pub fn group_leader(&self) -> &Task {
pub fn pid(&self) -> Pid {
// SAFETY: The pid of a task never changes after initialization, so reading this field is
// not a data race.
- unsafe { *ptr::addr_of!((*self.as_ptr()).pid) }
+ unsafe { *(&raw const (*self.as_ptr()).pid) }
}
/// Returns the UID of the given task.
diff --git a/rust/kernel/workqueue.rs b/rust/kernel/workqueue.rs
index 0cd100d2aefb..34e8abb38974 100644
--- a/rust/kernel/workqueue.rs
+++ b/rust/kernel/workqueue.rs
@@ -401,9 +401,9 @@ pub fn new(name: &'static CStr, key: &'static LockClassKey) -> impl PinInit<Self
pub unsafe fn raw_get(ptr: *const Self) -> *mut bindings::work_struct {
// SAFETY: The caller promises that the pointer is aligned and not dangling.
//
- // A pointer cast would also be ok due to `#[repr(transparent)]`. We use `addr_of!` so that
- // the compiler does not complain that the `work` field is unused.
- unsafe { Opaque::raw_get(core::ptr::addr_of!((*ptr).work)) }
+ // A pointer cast would also be ok due to `#[repr(transparent)]`. We use `&raw const (*ptr).work`
+ // so that the compiler does not complain that the `work` field is unused.
+ unsafe { Opaque::raw_get(&raw const (*ptr).work) }
}
}
@@ -510,7 +510,7 @@ macro_rules! impl_has_work {
unsafe fn raw_get_work(ptr: *mut Self) -> *mut $crate::workqueue::Work<$work_type $(, $id)?> {
// SAFETY: The caller promises that the pointer is not dangling.
unsafe {
- ::core::ptr::addr_of_mut!((*ptr).$field)
+ &raw mut (*ptr).$field
}
}
}
--
2.48.1
There are four small fixes for ntsync test and doc. I divided these into
four different patches due to different types of errors. If one patch is
better, I can do it too.
Su Hui (4):
selftests: ntsync: fix the wrong condition in wake_all
selftests: ntsync: avoid possible overflow in 32-bit machine
selftests: ntsync: update config
docs: ntsync: update NTSYNC_IOC_*
Documentation/userspace-api/ntsync.rst | 18 +++++++++---------
tools/testing/selftests/drivers/ntsync/config | 2 +-
.../testing/selftests/drivers/ntsync/ntsync.c | 6 +++---
3 files changed, 13 insertions(+), 13 deletions(-)
--
2.30.2
I never had much luck running mm selftests so I spent a few hours
digging into why.
Looks like most of the reason is missing SKIP checks, so this series is
just adding a bunch of those that I found. I did not do anything like
all of them, just the ones I spotted in gup_longterm, gup_test, mmap,
userfaultfd and memfd_secret.
It's a bit unfortunate to have to skip those tests when ftruncate()
fails, but I don't have time to dig deep enough into it to actually make
them pass. I have observed the issue on 9pfs and heard rumours that NFS
has a similar problem.
I'm now able to run these test groups successfully:
- mmap
- gup_test
- compaction
- migration
- page_frag
- userfaultfd
Signed-off-by: Brendan Jackman <jackmanb(a)google.com>
---
Changes in v3:
- Added fix for userfaultfd tests.
- Dropped attempts to use sudo.
- Fixed garbage printf in uffd-stress.
(Added EXTRA_CFLAGS=-Werror FORCE_TARGETS=1 to my scripts to prevent
such errors happening again).
- Fixed missing newlines in ksft_test_result_skip() calls.
- Link to v2: https://lore.kernel.org/r/20250221-mm-selftests-v2-0-28c4d66383c5@google.com
Changes in v2 (Thanks to Dev for the reviews):
- Improve and cleanup some error messages
- Add some extra SKIPs
- Fix misnaming of nr_cpus variable in uffd tests
- Link to v1: https://lore.kernel.org/r/20250220-mm-selftests-v1-0-9bbf57d64463@google.com
---
Brendan Jackman (10):
selftests/mm: Report errno when things fail in gup_longterm
selftests/mm: Skip uffd-stress if userfaultfd not available
selftests/mm: Skip uffd-wp-mremap if userfaultfd not available
selftests/mm/uffd: Rename nr_cpus -> nr_threads
selftests/mm: Print some details when uffd-stress gets bad params
selftests/mm: Don't fail uffd-stress if too many CPUs
selftests/mm: Skip map_populate on weird filesystems
selftests/mm: Skip gup_longerm tests on weird filesystems
selftests/mm: Drop unnecessary sudo usage
selftests/mm: Ensure uffd-wp-mremap gets pages of each size
tools/testing/selftests/mm/gup_longterm.c | 45 ++++++++++++++++++----------
tools/testing/selftests/mm/map_populate.c | 7 +++++
tools/testing/selftests/mm/run_vmtests.sh | 25 ++++++++++++++--
tools/testing/selftests/mm/uffd-common.c | 8 ++---
tools/testing/selftests/mm/uffd-common.h | 2 +-
tools/testing/selftests/mm/uffd-stress.c | 42 ++++++++++++++++----------
tools/testing/selftests/mm/uffd-unit-tests.c | 2 +-
tools/testing/selftests/mm/uffd-wp-mremap.c | 5 +++-
8 files changed, 95 insertions(+), 41 deletions(-)
---
base-commit: 76544811c850a1f4c055aa182b513b7a843868ea
change-id: 20250220-mm-selftests-2d7d0542face
Best regards,
--
Brendan Jackman <jackmanb(a)google.com>
This series is built on top of the v3 write syscall support [1].
With James's KVM userfault [2], it is possible to handle stage-2 faults
in guest_memfd in userspace. However, KVM itself also triggers faults
in guest_memfd in some cases, for example: PV interfaces like kvmclock,
PV EOI and page table walking code when fetching the MMIO instruction on
x86. It was agreed in the guest_memfd upstream call on 23 Jan 2025 [3]
that KVM would be accessing those pages via userspace page tables. In
order for such faults to be handled in userspace, guest_memfd needs to
support userfaultfd.
This series proposes a limited support for userfaultfd in guest_memfd:
- userfaultfd support is conditional to `CONFIG_KVM_GMEM_SHARED_MEM`
(as is fault support in general)
- Only `page missing` event is currently supported
- Userspace is supposed to respond to the event with the `write`
syscall followed by `UFFDIO_CONTINUE` ioctl to unblock the faulting
process. Note that we can't use `UFFDIO_COPY` here because
userfaulfd code does not know how to prepare guest_memfd pages, eg
remove them from direct map [4].
Not included in this series:
- Proper interface for userfaultfd to recognise guest_memfd mappings
- Proper handling of truncation cases after locking the page
Request for comments:
- Is it a sensible workflow for guest_memfd to resolve a userfault
`page missing` event with `write` syscall + `UFFDIO_CONTINUE`? One
of the alternatives is teaching `UFFDIO_COPY` how to deal with
guest_memfd pages.
- What is a way forward to make userfaultfd code aware of guest_memfd?
I saw that Patrick hit a somewhat similar problem in [5] when trying
to use direct map manipulation functions in KVM and was pointed by
David at Elliot's guestmem library [6] that might include a shim for that.
Would the library be the right place to expose required interfaces like
`vma_is_gmem`?
Nikita
[1] https://lore.kernel.org/kvm/20250303130838.28812-1-kalyazin@amazon.com/T/
[2] https://lore.kernel.org/kvm/20250109204929.1106563-1-jthoughton@google.com/…
[3] https://docs.google.com/document/d/1M6766BzdY1Lhk7LiR5IqVR8B8mG3cr-cxTxOrAo…
[4] https://lore.kernel.org/kvm/20250221160728.1584559-1-roypat@amazon.co.uk/T/
[4] https://lore.kernel.org/kvm/20250221160728.1584559-1-roypat@amazon.co.uk/T/…
[5] https://lore.kernel.org/kvm/20241122-guestmem-library-v5-2-450e92951a15@qui…
Nikita Kalyazin (5):
KVM: guest_memfd: add kvm_gmem_vma_is_gmem
KVM: guest_memfd: add support for uffd missing
mm: userfaultfd: allow to register userfaultfd for guest_memfd
mm: userfaultfd: support continue for guest_memfd
KVM: selftests: add uffd missing test for guest_memfd
include/linux/userfaultfd_k.h | 9 ++
mm/userfaultfd.c | 23 ++++-
.../testing/selftests/kvm/guest_memfd_test.c | 88 +++++++++++++++++++
virt/kvm/guest_memfd.c | 17 +++-
virt/kvm/kvm_mm.h | 1 +
5 files changed, 136 insertions(+), 2 deletions(-)
base-commit: 592e7531753dc4b711f96cd1daf808fd493d3223
--
2.47.1
Basics and overview
===================
Software with larger attack surfaces (e.g. network facing apps like databases,
browsers or apps relying on browser runtimes) suffer from memory corruption
issues which can be utilized by attackers to bend control flow of the program
to eventually gain control (by making their payload executable). Attackers are
able to perform such attacks by leveraging call-sites which rely on indirect
calls or return sites which rely on obtaining return address from stack memory.
To mitigate such attacks, risc-v extension zicfilp enforces that all indirect
calls must land on a landing pad instruction `lpad` else cpu will raise software
check exception (a new cpu exception cause code on riscv).
Similarly for return flow, risc-v extension zicfiss extends architecture with
- `sspush` instruction to push return address on a shadow stack
- `sspopchk` instruction to pop return address from shadow stack
and compare with input operand (i.e. return address on stack)
- `sspopchk` to raise software check exception if comparision above
was a mismatch
- Protection mechanism using which shadow stack is not writeable via
regular store instructions
More information an details can be found at extensions github repo [1].
Equivalent to landing pad (zicfilp) on x86 is `ENDBRANCH` instruction in Intel
CET [3] and branch target identification (BTI) [4] on arm.
Similarly x86's Intel CET has shadow stack [5] and arm64 has guarded control
stack (GCS) [6] which are very similar to risc-v's zicfiss shadow stack.
x86 and arm64 support for user mode shadow stack is already in mainline.
Kernel awareness for user control flow integrity
================================================
This series picks up Samuel Holland's envcfg changes [2] as well. So if those are
being applied independently, they should be removed from this series.
Enabling:
In order to maintain compatibility and not break anything in user mode, kernel
doesn't enable control flow integrity cpu extensions on binary by default.
Instead exposes a prctl interface to enable, disable and lock the shadow stack
or landing pad feature for a task. This allows userspace (loader) to enumerate
if all objects in its address space are compiled with shadow stack and landing
pad support and accordingly enable the feature. Additionally if a subsequent
`dlopen` happens on a library, user mode can take a decision again to disable
the feature (if incoming library is not compiled with support) OR terminate the
task (if user mode policy is strict to have all objects in address space to be
compiled with control flow integirty cpu feature). prctl to enable shadow stack
results in allocating shadow stack from virtual memory and activating for user
address space. x86 and arm64 are also following same direction due to similar
reason(s).
clone/fork:
On clone and fork, cfi state for task is inherited by child. Shadow stack is
part of virtual memory and is a writeable memory from kernel perspective
(writeable via a restricted set of instructions aka shadow stack instructions)
Thus kernel changes ensure that this memory is converted into read-only when
fork/clone happens and COWed when fault is taken due to sspush, sspopchk or
ssamoswap. In case `CLONE_VM` is specified and shadow stack is to be enabled,
kernel will automatically allocate a shadow stack for that clone call.
map_shadow_stack:
x86 introduced `map_shadow_stack` system call to allow user space to explicitly
map shadow stack memory in its address space. It is useful to allocate shadow
for different contexts managed by a single thread (green threads or contexts)
risc-v implements this system call as well.
signal management:
If shadow stack is enabled for a task, kernel performs an asynchronous control
flow diversion to deliver the signal and eventually expects userspace to issue
sigreturn so that original execution can be resumed. Even though resume context
is prepared by kernel, it is in user space memory and is subject to memory
corruption and corruption bugs can be utilized by attacker in this race window
to perform arbitrary sigreturn and eventually bypass cfi mechanism.
Another issue is how to ensure that cfi related state on sigcontext area is not
trampled by legacy apps or apps compiled with old kernel headers.
In order to mitigate control-flow hijacting, kernel prepares a token and place
it on shadow stack before signal delivery and places address of token in
sigcontext structure. During sigreturn, kernel obtains address of token from
sigcontext struture, reads token from shadow stack and validates it and only
then allow sigreturn to succeed. Compatiblity issue is solved by adopting
dynamic sigcontext management introduced for vector extension. This series
re-factor the code little bit to allow future sigcontext management easy (as
proposed by Andy Chiu from SiFive)
config and compilation:
Introduce a new risc-v config option `CONFIG_RISCV_USER_CFI`. Selecting this
config option picks the kernel support for user control flow integrity. This
optin is presented only if toolchain has shadow stack and landing pad support.
And is on purpose guarded by toolchain support. Reason being that eventually
vDSO also needs to be compiled in with shadow stack and landing pad support.
vDSO compile patches are not included as of now because landing pad labeling
scheme is yet to settle for usermode runtime.
To get more information on kernel interactions with respect to
zicfilp and zicfiss, patch series adds documentation for
`zicfilp` and `zicfiss` in following:
Documentation/arch/riscv/zicfiss.rst
Documentation/arch/riscv/zicfilp.rst
How to test this series
=======================
Toolchain
---------
$ git clone git@github.com:sifive/riscv-gnu-toolchain.git -b cfi-dev
$ riscv-gnu-toolchain/configure --prefix=<path-to-where-to-build> --with-arch=rv64gc_zicfilp_zicfiss --enable-linux --disable-gdb --with-extra-multilib-test="rv64gc_zicfilp_zicfiss-lp64d:-static"
$ make -j$(nproc)
Qemu
----
Get the lastest qemu
$ cd qemu
$ mkdir build
$ cd build
$ ../configure --target-list=riscv64-softmmu
$ make -j$(nproc)
Opensbi
-------
$ git clone git@github.com:deepak0414/opensbi.git -b v6_cfi_spec_split_opensbi
$ make CROSS_COMPILE=<your riscv toolchain> -j$(nproc) PLATFORM=generic
Linux
-----
Running defconfig is fine. CFI is enabled by default if the toolchain
supports it.
$ make ARCH=riscv CROSS_COMPILE=<path-to-cfi-riscv-gnu-toolchain>/build/bin/riscv64-unknown-linux-gnu- -j$(nproc) defconfig
$ make ARCH=riscv CROSS_COMPILE=<path-to-cfi-riscv-gnu-toolchain>/build/bin/riscv64-unknown-linux-gnu- -j$(nproc)
In case you're building your own rootfs using toolchain, please make sure you
pick following patch to ensure that vDSO compiled with lpad and shadow stack.
"arch/riscv: compile vdso with landing pad"
Branch where above patch can be picked
https://github.com/deepak0414/linux-riscv-cfi/tree/vdso_user_cfi_v6.12-rc1
Running
-------
Modify your qemu command to have:
-bios <path-to-cfi-opensbi>/build/platform/generic/firmware/fw_dynamic.bin
-cpu rv64,zicfilp=true,zicfiss=true,zimop=true,zcmop=true
vDSO related Opens (in the flux)
=================================
I am listing these opens for laying out plan and what to expect in future
patch sets. And of course for the sake of discussion.
Shadow stack and landing pad enabling in vDSO
----------------------------------------------
vDSO must have shadow stack and landing pad support compiled in for task
to have shadow stack and landing pad support. This patch series doesn't
enable that (yet). Enabling shadow stack support in vDSO should be
straight forward (intend to do that in next versions of patch set). Enabling
landing pad support in vDSO requires some collaboration with toolchain folks
to follow a single label scheme for all object binaries. This is necessary to
ensure that all indirect call-sites are setting correct label and target landing
pads are decorated with same label scheme.
How many vDSOs
---------------
Shadow stack instructions are carved out of zimop (may be operations) and if CPU
doesn't implement zimop, they're illegal instructions. Kernel could be running on
a CPU which may or may not implement zimop. And thus kernel will have to carry 2
different vDSOs and expose the appropriate one depending on whether CPU implements
zimop or not.
References
==========
[1] - https://github.com/riscv/riscv-cfi
[2] - https://lore.kernel.org/all/20240814081126.956287-1-samuel.holland@sifive.c…
[3] - https://lwn.net/Articles/889475/
[4] - https://developer.arm.com/documentation/109576/0100/Branch-Target-Identific…
[5] - https://www.intel.com/content/dam/develop/external/us/en/documents/catc17-i…
[6] - https://lwn.net/Articles/940403/
---
changelog
---------
v11:
- patch "arch/riscv: compile vdso with landing pad" was unconditionally
selecting `_zicfilp` for vDSO compile. fixed that. Changed `lpad 1` to
to `lpad 0`.
v10:
- dropped "mm: helper `is_shadow_stack_vma` to check shadow stack vma". This patch
is not that interesting to this patch series for risc-v. There are instances in
arch directories where VM_SHADOW_STACK flag is anyways used. Dropping this patch
to expedite merging in riscv tree.
- Took suggestions from `Clement` on "riscv: zicfiss / zicfilp enumeration" to
validate presence of cfi based on config.
- Added a patch for vDSO to have `lpad 0`. I had omitted this earlier to make sure
we add single vdso object with cfi enabled. But a vdso object with scheme of
zero labeled landing pad is least common denominator and should work with all
objects of zero labeled as well as function-signature labeled objects.
v9:
- rebased on master (39a803b754d5 fix braino in "9p: fix ->rename_sem exclusion")
- dropped "mm: Introduce ARCH_HAS_USER_SHADOW_STACK" (master has it from arm64/gcs)
- dropped "prctl: arch-agnostic prctl for shadow stack" (master has it from arm64/gcs)
v8:
- rebased on palmer/for-next
- dropped samuel holland's `envcfg` context switch patches.
they are in parlmer/for-next
v7:
- Removed "riscv/Kconfig: enable HAVE_EXIT_THREAD for riscv"
Instead using `deactivate_mm` flow to clean up.
see here for more context
https://lore.kernel.org/all/20230908203655.543765-1-rick.p.edgecombe@intel.…
- Changed the header include in `kselftest`. Hopefully this fixes compile
issue faced by Zong Li at SiFive.
- Cleaned up an orphaned change to `mm/mmap.c` in below patch
"riscv/mm : ensure PROT_WRITE leads to VM_READ | VM_WRITE"
- Lock interfaces for shadow stack and indirect branch tracking expect arg == 0
Any future evolution of this interface should accordingly define how arg should
be setup.
- `mm/map.c` has an instance of using `VM_SHADOW_STACK`. Fixed it to use helper
`is_shadow_stack_vma`.
- Link to v6: https://lore.kernel.org/r/20241008-v5_user_cfi_series-v6-0-60d9fe073f37@riv…
v6:
- Picked up Samuel Holland's changes as is with `envcfg` placed in
`thread` instead of `thread_info`
- fixed unaligned newline escapes in kselftest
- cleaned up messages in kselftest and included test output in commit message
- fixed a bug in clone path reported by Zong Li
- fixed a build issue if CONFIG_RISCV_ISA_V is not selected
(this was introduced due to re-factoring signal context
management code)
v5:
- rebased on v6.12-rc1
- Fixed schema related issues in device tree file
- Fixed some of the documentation related issues in zicfilp/ss.rst
(style issues and added index)
- added `SHADOW_STACK_SET_MARKER` so that implementation can define base
of shadow stack.
- Fixed warnings on definitions added in usercfi.h when
CONFIG_RISCV_USER_CFI is not selected.
- Adopted context header based signal handling as proposed by Andy Chiu
- Added support for enabling kernel mode access to shadow stack using
FWFT
(https://github.com/riscv-non-isa/riscv-sbi-doc/blob/master/src/ext-firmware…)
- Link to v5: https://lore.kernel.org/r/20241001-v5_user_cfi_series-v1-0-3ba65b6e550f@riv…
(Note: I had an issue in my workflow due to which version number wasn't
picked up correctly while sending out patches)
v4:
- rebased on 6.11-rc6
- envcfg: Converged with Samuel Holland's patches for envcfg management on per-
thread basis.
- vma_is_shadow_stack is renamed to is_vma_shadow_stack
- picked up Mark Brown's `ARCH_HAS_USER_SHADOW_STACK` patch
- signal context: using extended context management to maintain compatibility.
- fixed `-Wmissing-prototypes` compiler warnings for prctl functions
- Documentation fixes and amending typos.
- Link to v4: https://lore.kernel.org/all/20240912231650.3740732-1-debug@rivosinc.com/
v3:
- envcfg
logic to pick up base envcfg had a bug where `ENVCFG_CBZE` could have been
picked on per task basis, even though CPU didn't implement it. Fixed in
this series.
- dt-bindings
As suggested, split into separate commit. fixed the messaging that spec is
in public review
- arch_is_shadow_stack change
arch_is_shadow_stack changed to vma_is_shadow_stack
- hwprobe
zicfiss / zicfilp if present will get enumerated in hwprobe
- selftests
As suggested, added object and binary filenames to .gitignore
Selftest binary anyways need to be compiled with cfi enabled compiler which
will make sure that landing pad and shadow stack are enabled. Thus removed
separate enable/disable tests. Cleaned up tests a bit.
- Link to v3: https://lore.kernel.org/lkml/20240403234054.2020347-1-debug@rivosinc.com/
v2:
- Using config `CONFIG_RISCV_USER_CFI`, kernel support for riscv control flow
integrity for user mode programs can be compiled in the kernel.
- Enabling of control flow integrity for user programs is left to user runtime
- This patch series introduces arch agnostic `prctls` to enable shadow stack
and indirect branch tracking. And implements them on riscv.
---
---
Changes in v11:
- EDITME: describe what is new in this series revision.
- EDITME: use bulletpoints and terse descriptions.
- Link to v10: https://lore.kernel.org/r/20250210-v5_user_cfi_series-v10-0-163dcfa31c60@ri…
---
Andy Chiu (1):
riscv: signal: abstract header saving for setup_sigcontext
Clément Léger (1):
riscv: Add Firmware Feature SBI extensions definitions
Deepak Gupta (24):
mm: VM_SHADOW_STACK definition for riscv
dt-bindings: riscv: zicfilp and zicfiss in dt-bindings (extensions.yaml)
riscv: zicfiss / zicfilp enumeration
riscv: zicfiss / zicfilp extension csr and bit definitions
riscv: usercfi state for task and save/restore of CSR_SSP on trap entry/exit
riscv/mm : ensure PROT_WRITE leads to VM_READ | VM_WRITE
riscv mm: manufacture shadow stack pte
riscv mmu: teach pte_mkwrite to manufacture shadow stack PTEs
riscv mmu: write protect and shadow stack
riscv/mm: Implement map_shadow_stack() syscall
riscv/shstk: If needed allocate a new shadow stack on clone
riscv: Implements arch agnostic shadow stack prctls
prctl: arch-agnostic prctl for indirect branch tracking
riscv/traps: Introduce software check exception
riscv/signal: save and restore of shadow stack for signal
riscv/kernel: update __show_regs to print shadow stack register
riscv/ptrace: riscv cfi status and state via ptrace and in core files
riscv/hwprobe: zicfilp / zicfiss enumeration in hwprobe
riscv: enable kernel access to shadow stack memory via FWFT sbi call
riscv: kernel command line option to opt out of user cfi
riscv: create a config for shadow stack and landing pad instr support
riscv: Documentation for landing pad / indirect branch tracking
riscv: Documentation for shadow stack on riscv
kselftest/riscv: kselftest for user mode cfi
Jim Shu (1):
arch/riscv: compile vdso with landing pad
Documentation/arch/riscv/index.rst | 2 +
Documentation/arch/riscv/zicfilp.rst | 115 +++++
Documentation/arch/riscv/zicfiss.rst | 176 +++++++
.../devicetree/bindings/riscv/extensions.yaml | 14 +
arch/riscv/Kconfig | 20 +
arch/riscv/Makefile | 7 +-
arch/riscv/include/asm/asm-prototypes.h | 1 +
arch/riscv/include/asm/assembler.h | 44 ++
arch/riscv/include/asm/cpufeature.h | 13 +
arch/riscv/include/asm/csr.h | 16 +
arch/riscv/include/asm/entry-common.h | 2 +
arch/riscv/include/asm/hwcap.h | 2 +
arch/riscv/include/asm/mman.h | 25 +
arch/riscv/include/asm/mmu_context.h | 7 +
arch/riscv/include/asm/pgtable.h | 30 +-
arch/riscv/include/asm/processor.h | 2 +
arch/riscv/include/asm/sbi.h | 26 +
arch/riscv/include/asm/thread_info.h | 3 +
arch/riscv/include/asm/usercfi.h | 89 ++++
arch/riscv/include/asm/vector.h | 3 +
arch/riscv/include/uapi/asm/hwprobe.h | 2 +
arch/riscv/include/uapi/asm/ptrace.h | 22 +
arch/riscv/include/uapi/asm/sigcontext.h | 1 +
arch/riscv/kernel/Makefile | 1 +
arch/riscv/kernel/asm-offsets.c | 8 +
arch/riscv/kernel/cpufeature.c | 13 +
arch/riscv/kernel/entry.S | 31 +-
arch/riscv/kernel/head.S | 12 +
arch/riscv/kernel/process.c | 26 +-
arch/riscv/kernel/ptrace.c | 83 ++++
arch/riscv/kernel/signal.c | 142 +++++-
arch/riscv/kernel/sys_hwprobe.c | 2 +
arch/riscv/kernel/sys_riscv.c | 10 +
arch/riscv/kernel/traps.c | 43 ++
arch/riscv/kernel/usercfi.c | 524 +++++++++++++++++++++
arch/riscv/kernel/vdso/Makefile | 12 +
arch/riscv/kernel/vdso/flush_icache.S | 4 +
arch/riscv/kernel/vdso/getcpu.S | 4 +
arch/riscv/kernel/vdso/rt_sigreturn.S | 4 +
arch/riscv/kernel/vdso/sys_hwprobe.S | 4 +
arch/riscv/mm/init.c | 2 +-
arch/riscv/mm/pgtable.c | 17 +
include/linux/cpu.h | 4 +
include/linux/mm.h | 7 +
include/uapi/linux/elf.h | 1 +
include/uapi/linux/prctl.h | 27 ++
kernel/sys.c | 30 ++
tools/testing/selftests/riscv/Makefile | 2 +-
tools/testing/selftests/riscv/cfi/.gitignore | 3 +
tools/testing/selftests/riscv/cfi/Makefile | 10 +
tools/testing/selftests/riscv/cfi/cfi_rv_test.h | 84 ++++
tools/testing/selftests/riscv/cfi/riscv_cfi_test.c | 78 +++
tools/testing/selftests/riscv/cfi/shadowstack.c | 375 +++++++++++++++
tools/testing/selftests/riscv/cfi/shadowstack.h | 37 ++
54 files changed, 2193 insertions(+), 29 deletions(-)
---
base-commit: 39a803b754d5224a3522016b564113ee1e4091b2
change-id: 20240930-v5_user_cfi_series-3dc332f8f5b2
--
- debug