Extend the XDP Tx metadata framework so that user can requests launch time
hardware offload, where the Ethernet device will schedule the packet for
transmission at a pre-determined time called launch time. The value of
launch time is communicated from user space to Ethernet driver via
launch_time field of struct xsk_tx_metadata.
Suggested-by: Stanislav Fomichev <sdf(a)google.com>
Signed-off-by: Song Yoong Siang <yoong.siang.song(a)intel.com>
---
Documentation/netlink/specs/netdev.yaml | 4 ++
Documentation/networking/xsk-tx-metadata.rst | 64 ++++++++++++++++++++
include/net/xdp_sock.h | 10 +++
include/net/xdp_sock_drv.h | 1 +
include/uapi/linux/if_xdp.h | 10 +++
include/uapi/linux/netdev.h | 3 +
net/core/netdev-genl.c | 2 +
net/xdp/xsk.c | 3 +
tools/include/uapi/linux/if_xdp.h | 10 +++
tools/include/uapi/linux/netdev.h | 3 +
10 files changed, 110 insertions(+)
diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
index cbb544bd6c84..e59c8a14f7d1 100644
--- a/Documentation/netlink/specs/netdev.yaml
+++ b/Documentation/netlink/specs/netdev.yaml
@@ -70,6 +70,10 @@ definitions:
name: tx-checksum
doc:
L3 checksum HW offload is supported by the driver.
+ -
+ name: tx-launch-time
+ doc:
+ Launch time HW offload is supported by the driver.
-
name: queue-type
type: enum
diff --git a/Documentation/networking/xsk-tx-metadata.rst b/Documentation/networking/xsk-tx-metadata.rst
index e76b0cfc32f7..3cec089747ce 100644
--- a/Documentation/networking/xsk-tx-metadata.rst
+++ b/Documentation/networking/xsk-tx-metadata.rst
@@ -50,6 +50,10 @@ The flags field enables the particular offload:
checksum. ``csum_start`` specifies byte offset of where the checksumming
should start and ``csum_offset`` specifies byte offset where the
device should store the computed checksum.
+- ``XDP_TXMD_FLAGS_LAUNCH_TIME``: requests the device to schedule the
+ packet for transmission at a pre-determined time called launch time. The
+ value of launch time is indicated by ``launch_time`` field of
+ ``union xsk_tx_metadata``.
Besides the flags above, in order to trigger the offloads, the first
packet's ``struct xdp_desc`` descriptor should set ``XDP_TX_METADATA``
@@ -65,6 +69,65 @@ In this case, when running in ``XDK_COPY`` mode, the TX checksum
is calculated on the CPU. Do not enable this option in production because
it will negatively affect performance.
+Launch Time
+===========
+
+The value of the requested launch time should be based on the device's PTP
+Hardware Clock (PHC) to ensure accuracy. AF_XDP takes a different data path
+compared to the ETF queuing discipline, which organizes packets and delays
+their transmission. Instead, AF_XDP immediately hands off the packets to
+the device driver without rearranging their order or holding them prior to
+transmission. In scenarios where the launch time offload feature is
+disabled, the device driver is expected to disregard the launch time
+request. For correct interpretation and meaningful operation, the launch
+time should never be set to a value larger than the farthest programmable
+time in the future (the horizon). Different devices have different hardware
+limitations on the launch time offload feature.
+
+stmmac driver
+-------------
+
+For stmmac, TSO and launch time (TBS) features are mutually exclusive for
+each individual Tx Queue. By default, the driver configures Tx Queue 0 to
+support TSO and the rest of the Tx Queues to support TBS. The launch time
+hardware offload feature can be enabled or disabled by using the tc-etf
+command to call the driver's ndo_setup_tc() callback.
+
+The value of the launch time that is programmed in the Enhanced Normal
+Transmit Descriptors is a 32-bit value, where the most significant 8 bits
+represent the time in seconds and the remaining 24 bits represent the time
+in 256 ns increments. The programmed launch time is compared against the
+PTP time (bits[39:8]) and rolls over after 256 seconds. Therefore, the
+horizon of the launch time for dwmac4 and dwxlgmac2 is 128 seconds in the
+future.
+
+The stmmac driver maintains FIFO behavior and does not perform packet
+reordering. This means that a packet with a launch time request will block
+other packets in the same Tx Queue until it is transmitted.
+
+igc driver
+----------
+
+For igc, all four Tx Queues support the launch time feature. The launch
+time hardware offload feature can be enabled or disabled by using the
+tc-etf command to call the driver's ndo_setup_tc() callback. When entering
+TSN mode, the igc driver will reset the device and create a default Qbv
+schedule with a 1-second cycle time, with all Tx Queues open at all times.
+
+The value of the launch time that is programmed in the Advanced Transmit
+Context Descriptor is a relative offset to the starting time of the Qbv
+transmission window of the queue. The Frst flag of the descriptor can be
+set to schedule the packet for the next Qbv cycle. Therefore, the horizon
+of the launch time for i225 and i226 is the ending time of the next cycle
+of the Qbv transmission window of the queue. For example, when the Qbv
+cycle time is set to 1 second, the horizon of the launch time ranges
+from 1 second to 2 seconds, depending on where the Qbv cycle is currently
+running.
+
+The igc driver maintains FIFO behavior and does not perform packet
+reordering. This means that a packet with a launch time request will block
+other packets in the same Tx Queue until it is transmitted.
+
Querying Device Capabilities
============================
@@ -74,6 +137,7 @@ Refer to ``xsk-flags`` features bitmask in
- ``tx-timestamp``: device supports ``XDP_TXMD_FLAGS_TIMESTAMP``
- ``tx-checksum``: device supports ``XDP_TXMD_FLAGS_CHECKSUM``
+- ``tx-launch-time``: device supports ``XDP_TXMD_FLAGS_LAUNCH_TIME``
See ``tools/net/ynl/samples/netdev.c`` on how to query this information.
diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index bfe625b55d55..a58ae7589d12 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -110,11 +110,16 @@ struct xdp_sock {
* indicates position where checksumming should start.
* csum_offset indicates position where checksum should be stored.
*
+ * void (*tmo_request_launch_time)(u64 launch_time, void *priv)
+ * Called when AF_XDP frame requested launch time HW offload support.
+ * launch_time indicates the PTP time at which the device can schedule the
+ * packet for transmission.
*/
struct xsk_tx_metadata_ops {
void (*tmo_request_timestamp)(void *priv);
u64 (*tmo_fill_timestamp)(void *priv);
void (*tmo_request_checksum)(u16 csum_start, u16 csum_offset, void *priv);
+ void (*tmo_request_launch_time)(u64 launch_time, void *priv);
};
#ifdef CONFIG_XDP_SOCKETS
@@ -162,6 +167,11 @@ static inline void xsk_tx_metadata_request(const struct xsk_tx_metadata *meta,
if (!meta)
return;
+ if (ops->tmo_request_launch_time)
+ if (meta->flags & XDP_TXMD_FLAGS_LAUNCH_TIME)
+ ops->tmo_request_launch_time(meta->request.launch_time,
+ priv);
+
if (ops->tmo_request_timestamp)
if (meta->flags & XDP_TXMD_FLAGS_TIMESTAMP)
ops->tmo_request_timestamp(priv);
diff --git a/include/net/xdp_sock_drv.h b/include/net/xdp_sock_drv.h
index 40085afd9160..78af371bc002 100644
--- a/include/net/xdp_sock_drv.h
+++ b/include/net/xdp_sock_drv.h
@@ -198,6 +198,7 @@ static inline void *xsk_buff_raw_get_data(struct xsk_buff_pool *pool, u64 addr)
#define XDP_TXMD_FLAGS_VALID ( \
XDP_TXMD_FLAGS_TIMESTAMP | \
XDP_TXMD_FLAGS_CHECKSUM | \
+ XDP_TXMD_FLAGS_LAUNCH_TIME | \
0)
static inline bool xsk_buff_valid_tx_metadata(struct xsk_tx_metadata *meta)
diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
index 42ec5ddaab8d..42869770776e 100644
--- a/include/uapi/linux/if_xdp.h
+++ b/include/uapi/linux/if_xdp.h
@@ -127,6 +127,12 @@ struct xdp_options {
*/
#define XDP_TXMD_FLAGS_CHECKSUM (1 << 1)
+/* Request launch time hardware offload. The device will schedule the packet for
+ * transmission at a pre-determined time called launch time. The value of
+ * launch time is communicated via launch_time field of struct xsk_tx_metadata.
+ */
+#define XDP_TXMD_FLAGS_LAUNCH_TIME (1 << 2)
+
/* AF_XDP offloads request. 'request' union member is consumed by the driver
* when the packet is being transmitted. 'completion' union member is
* filled by the driver when the transmit completion arrives.
@@ -142,6 +148,10 @@ struct xsk_tx_metadata {
__u16 csum_start;
/* Offset from csum_start where checksum should be stored. */
__u16 csum_offset;
+
+ /* XDP_TXMD_FLAGS_LAUNCH_TIME */
+ /* Launch time in nanosecond against the PTP HW Clock */
+ __u64 launch_time;
} request;
struct {
diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
index e4be227d3ad6..5ab85f4af009 100644
--- a/include/uapi/linux/netdev.h
+++ b/include/uapi/linux/netdev.h
@@ -59,10 +59,13 @@ enum netdev_xdp_rx_metadata {
* by the driver.
* @NETDEV_XSK_FLAGS_TX_CHECKSUM: L3 checksum HW offload is supported by the
* driver.
+ * @NETDEV_XSK_FLAGS_LAUNCH_TIME: Launch Time HW offload is supported by the
+ * driver.
*/
enum netdev_xsk_flags {
NETDEV_XSK_FLAGS_TX_TIMESTAMP = 1,
NETDEV_XSK_FLAGS_TX_CHECKSUM = 2,
+ NETDEV_XSK_FLAGS_LAUNCH_TIME = 4,
};
enum netdev_queue_type {
diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
index 9527dd46e4dc..e2515cf9190f 100644
--- a/net/core/netdev-genl.c
+++ b/net/core/netdev-genl.c
@@ -52,6 +52,8 @@ XDP_METADATA_KFUNC_xxx
xsk_features |= NETDEV_XSK_FLAGS_TX_TIMESTAMP;
if (netdev->xsk_tx_metadata_ops->tmo_request_checksum)
xsk_features |= NETDEV_XSK_FLAGS_TX_CHECKSUM;
+ if (netdev->xsk_tx_metadata_ops->tmo_request_launch_time)
+ xsk_features |= NETDEV_XSK_FLAGS_LAUNCH_TIME;
}
if (nla_put_u32(rsp, NETDEV_A_DEV_IFINDEX, netdev->ifindex) ||
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 3fa70286c846..8feaa0e86f07 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -743,6 +743,9 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
goto free_err;
}
}
+
+ if (meta->flags & XDP_TXMD_FLAGS_LAUNCH_TIME)
+ skb->skb_mstamp_ns = meta->request.launch_time;
}
}
diff --git a/tools/include/uapi/linux/if_xdp.h b/tools/include/uapi/linux/if_xdp.h
index 2f082b01ff22..67719f8966c2 100644
--- a/tools/include/uapi/linux/if_xdp.h
+++ b/tools/include/uapi/linux/if_xdp.h
@@ -127,6 +127,12 @@ struct xdp_options {
*/
#define XDP_TXMD_FLAGS_CHECKSUM (1 << 1)
+/* Request launch time hardware offload. The device will schedule the packet for
+ * transmission at a pre-determined time called launch time. The value of
+ * launch time is communicated via launch_time field of struct xsk_tx_metadata.
+ */
+#define XDP_TXMD_FLAGS_LAUNCH_TIME (1 << 2)
+
/* AF_XDP offloads request. 'request' union member is consumed by the driver
* when the packet is being transmitted. 'completion' union member is
* filled by the driver when the transmit completion arrives.
@@ -142,6 +148,10 @@ struct xsk_tx_metadata {
__u16 csum_start;
/* Offset from csum_start where checksum should be stored. */
__u16 csum_offset;
+
+ /* XDP_TXMD_FLAGS_LAUNCH_TIME */
+ /* Launch time in nanosecond against the PTP HW Clock */
+ __u64 launch_time;
} request;
struct {
diff --git a/tools/include/uapi/linux/netdev.h b/tools/include/uapi/linux/netdev.h
index e4be227d3ad6..5ab85f4af009 100644
--- a/tools/include/uapi/linux/netdev.h
+++ b/tools/include/uapi/linux/netdev.h
@@ -59,10 +59,13 @@ enum netdev_xdp_rx_metadata {
* by the driver.
* @NETDEV_XSK_FLAGS_TX_CHECKSUM: L3 checksum HW offload is supported by the
* driver.
+ * @NETDEV_XSK_FLAGS_LAUNCH_TIME: Launch Time HW offload is supported by the
+ * driver.
*/
enum netdev_xsk_flags {
NETDEV_XSK_FLAGS_TX_TIMESTAMP = 1,
NETDEV_XSK_FLAGS_TX_CHECKSUM = 2,
+ NETDEV_XSK_FLAGS_LAUNCH_TIME = 4,
};
enum netdev_queue_type {
--
2.34.1
The selftest started failing since commit e93d2521b27f
("x86/vdso: Split virtual clock pages into dedicated mapping")
was merged. While debugging I stumbled upon another bug and potential
cleanup.
Signed-off-by: Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>
---
Thomas Weißschuh (3):
selftests/mm: virtual_address_range: Fix error when CommitLimit < 1GiB
selftests/mm: virtual_address_range: Avoid reading VVAR mappings
selftests/mm: virtual_address_range: Dump to /dev/null
tools/testing/selftests/mm/virtual_address_range.c | 21 +++++++++++++++------
1 file changed, 15 insertions(+), 6 deletions(-)
---
base-commit: fbfd64d25c7af3b8695201ebc85efe90be28c5a3
change-id: 20250107-virtual_address_range-tests-95843766fa97
Best regards,
--
Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>
Notable changes since v15:
* added IPV6 hack in Kconfig
* switched doc '|' operator to '>-' in yaml netlink spec
* added ovpn-mode doc to rt_link.yaml
* implemented rtnl_link_ops.fill_info
* removed ovpn_socket_detach() function because UDP and TCP detachment
is now happening in different moments
* reworked ovpn_socket lifetime:
** introduced ovpn_socket_release() that depending on transport proto
will take the right step towards releasing the socket (check large
comment on top of function for greater details)
** extended comments on various ovpn_socket* functions to ensure socket
lifecycle is clear
** implemented kref_put_lock() to allow UDP sockets to be detached while
holding socket lock
** acquired socket lock in ovpn_socket_new() to avoid race with detach
(point above)
** socket is now released upon peer removal (not upon peer free!)
* added convenient define OVPN_AAD_SIZE
* renamed AUTH_TAG_SIZE to OVPN_AUTH_TAG_SIZE
* s/dev_core_stats_rx_dropped_inc/dev_core_stats_tx_dropped_inc where
needed
* fixed some typos
* moved tcp_close() call outside of rcu_read_lock area
* moved ovpn_socket creation from ovpn_nl_peer_modify() to
ovpn_nl_peer_new_doit() to make smatch happy (ovpn_socket_new() may
have been called under spinlock, but it may sleep)
* added support for MSG_NOSIGNAL flag in TCP calls (required extending
the skb API)
* improved TCP proto/ops customization (required exporting
inet6_stream_ops)
* changed kselftest tool (ovpn-cli.c) to pass MSG_NOSIGNAL to TCP
send/recv calls.
The ovpn_socket lifecycle changes above address the race conditions
previously reported by Sabrina.
Hopefully all though nuts have been cracked at this point.
Please note that some patches were already reviewed by Andre Lunn,
Donald Hunter and Shuah Khan. They have retained the Reviewed-by tag
since no major code modification has happened since the review.
The latest code can also be found at:
https://github.com/OpenVPN/linux-kernel-ovpn
Thanks a lot!
Best Regards,
Antonio Quartulli
OpenVPN Inc.
---
Antonio Quartulli (26):
net: introduce OpenVPN Data Channel Offload (ovpn)
ovpn: add basic netlink support
ovpn: add basic interface creation/destruction/management routines
ovpn: keep carrier always on for MP interfaces
ovpn: introduce the ovpn_peer object
kref/refcount: implement kref_put_sock()
ovpn: introduce the ovpn_socket object
ovpn: implement basic TX path (UDP)
ovpn: implement basic RX path (UDP)
ovpn: implement packet processing
ovpn: store tunnel and transport statistics
ipv6: export inet6_stream_ops via EXPORT_SYMBOL_GPL
ovpn: implement TCP transport
skb: implement skb_send_sock_locked_with_flags()
ovpn: add support for MSG_NOSIGNAL in tcp_sendmsg
ovpn: implement multi-peer support
ovpn: implement peer lookup logic
ovpn: implement keepalive mechanism
ovpn: add support for updating local UDP endpoint
ovpn: add support for peer floating
ovpn: implement peer add/get/dump/delete via netlink
ovpn: implement key add/get/del/swap via netlink
ovpn: kill key and notify userspace in case of IV exhaustion
ovpn: notify userspace when a peer is deleted
ovpn: add basic ethtool support
testing/selftests: add test tool and scripts for ovpn module
Documentation/netlink/specs/ovpn.yaml | 372 +++
Documentation/netlink/specs/rt_link.yaml | 16 +
MAINTAINERS | 11 +
drivers/net/Kconfig | 15 +
drivers/net/Makefile | 1 +
drivers/net/ovpn/Makefile | 22 +
drivers/net/ovpn/bind.c | 55 +
drivers/net/ovpn/bind.h | 101 +
drivers/net/ovpn/crypto.c | 211 ++
drivers/net/ovpn/crypto.h | 145 ++
drivers/net/ovpn/crypto_aead.c | 382 ++++
drivers/net/ovpn/crypto_aead.h | 33 +
drivers/net/ovpn/io.c | 446 ++++
drivers/net/ovpn/io.h | 34 +
drivers/net/ovpn/main.c | 350 +++
drivers/net/ovpn/main.h | 14 +
drivers/net/ovpn/netlink-gen.c | 213 ++
drivers/net/ovpn/netlink-gen.h | 41 +
drivers/net/ovpn/netlink.c | 1178 ++++++++++
drivers/net/ovpn/netlink.h | 18 +
drivers/net/ovpn/ovpnstruct.h | 57 +
drivers/net/ovpn/peer.c | 1256 +++++++++++
drivers/net/ovpn/peer.h | 159 ++
drivers/net/ovpn/pktid.c | 129 ++
drivers/net/ovpn/pktid.h | 87 +
drivers/net/ovpn/proto.h | 118 +
drivers/net/ovpn/skb.h | 60 +
drivers/net/ovpn/socket.c | 237 ++
drivers/net/ovpn/socket.h | 45 +
drivers/net/ovpn/stats.c | 21 +
drivers/net/ovpn/stats.h | 47 +
drivers/net/ovpn/tcp.c | 567 +++++
drivers/net/ovpn/tcp.h | 33 +
drivers/net/ovpn/udp.c | 392 ++++
drivers/net/ovpn/udp.h | 23 +
include/linux/kref.h | 11 +
include/linux/refcount.h | 3 +
include/linux/skbuff.h | 2 +
include/uapi/linux/if_link.h | 15 +
include/uapi/linux/ovpn.h | 111 +
include/uapi/linux/udp.h | 1 +
lib/refcount.c | 32 +
net/core/skbuff.c | 18 +-
net/ipv6/af_inet6.c | 1 +
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/net/ovpn/.gitignore | 2 +
tools/testing/selftests/net/ovpn/Makefile | 17 +
tools/testing/selftests/net/ovpn/config | 10 +
tools/testing/selftests/net/ovpn/data64.key | 5 +
tools/testing/selftests/net/ovpn/ovpn-cli.c | 2366 ++++++++++++++++++++
tools/testing/selftests/net/ovpn/tcp_peers.txt | 5 +
.../testing/selftests/net/ovpn/test-chachapoly.sh | 9 +
tools/testing/selftests/net/ovpn/test-float.sh | 9 +
tools/testing/selftests/net/ovpn/test-tcp.sh | 9 +
tools/testing/selftests/net/ovpn/test.sh | 182 ++
tools/testing/selftests/net/ovpn/udp_peers.txt | 5 +
56 files changed, 9698 insertions(+), 5 deletions(-)
---
base-commit: 4b252f2dab2ebb654eebbb2aee980ab8373b2295
change-id: 20241002-b4-ovpn-eeee35c694a2
Best regards,
--
Antonio Quartulli <antonio(a)openvpn.net>
On 08.01.25 07:09, Dev Jain wrote:
>
> On 07/01/25 8:44 pm, Thomas Weißschuh wrote:
>> During the execution of validate_complete_va_space() a lot of memory is
>> on the VM subsystem. When running on a low memory subsystem an OOM may
>> be triggered, when writing to the dump file as the filesystem may also
>> require memory.
>>
>> On my test system with 1100MiB physical memory:
>>
>> Tasks state (memory values in pages):
>> [ pid ] uid tgid total_vm rss rss_anon rss_file rss_shmem pgtables_bytes swapents oom_score_adj name
>> [ 57] 0 57 34359215953 695 256 0 439 1064390656 0 0 virtual_address
>>
>> Out of memory: Killed process 57 (virtual_address) total-vm:137436863812kB, anon-rss:1024kB, file-rss:0kB, shmem-rss:1756kB, UID:0 pgtables:1039444kB oom_score_adj:0
>> <snip>
>> fault_in_iov_iter_readable+0x4a/0xd0
>> generic_perform_write+0x9c/0x280
>> shmem_file_write_iter+0x86/0x90
>> vfs_write+0x29c/0x480
>> ksys_write+0x6c/0xe0
>> do_syscall_64+0x9e/0x1a0
>> entry_SYSCALL_64_after_hwframe+0x77/0x7f
>>
>> Write the dumped data into /dev/null instead which does not require
>> additional memory during write(), making the code simpler as a
>> side-effect.
>>
>> Signed-off-by: Thomas Weißschuh<thomas.weissschuh(a)linutronix.de>
>> ---
>> tools/testing/selftests/mm/virtual_address_range.c | 6 ++----
>> 1 file changed, 2 insertions(+), 4 deletions(-)
>>
>> diff --git a/tools/testing/selftests/mm/virtual_address_range.c b/tools/testing/selftests/mm/virtual_address_range.c
>> index 484f82c7b7c871f82a7d9ec6d6c649f2ab1eb0cd..4042fd878acd702d23da2c3293292de33bd48143 100644
>> --- a/tools/testing/selftests/mm/virtual_address_range.c
>> +++ b/tools/testing/selftests/mm/virtual_address_range.c
>> @@ -103,10 +103,9 @@ static int validate_complete_va_space(void)
>> FILE *file;
>> int fd;
>>
>> - fd = open("va_dump", O_CREAT | O_WRONLY, 0600);
>> - unlink("va_dump");
>> + fd = open("/dev/null", O_WRONLY);
>> if (fd < 0) {
>> - ksft_test_result_skip("cannot create or open dump file\n");
>> + ksft_test_result_skip("cannot create or open /dev/null\n");
>> ksft_finished();
>> }
>> >> @@ -152,7 +151,6 @@ static int validate_complete_va_space(void)
>> while (start_addr + hop < end_addr) {
>> if (write(fd, (void *)(start_addr + hop), 1) != 1)
>> return 1;
>> - lseek(fd, 0, SEEK_SET);
>>
>> hop += MAP_CHUNK_SIZE;
>> }
>>
>
> The reason I had not used /dev/null was that write() was succeeding to /dev/null
> even from an address not in my VA space. I was puzzled about this behaviour of
> /dev/null and I chose to ignore it and just use a real file.
>
> To test this behaviour, run the following program:
>
> #include <stdio.h>
> #include <stdlib.h>
> #include <unistd.h>
> #include <fcntl.h>
> #include <sys/mman.h>
> intmain()
> {
> intfd;
> fd = open("va_dump", O_CREAT| O_WRONLY, 0600);
> unlink("va_dump");
> // fd = open("/dev/null", O_WRONLY);
> intret = munmap((void*)(1UL<< 30), 100);
> if(!ret)
> printf("munmap succeeded\n");
> intres = write(fd, (void*)(1UL<< 30), 1);
> if(res == 1)
> printf("write succeeded\n");
> return0;
> }
> The write will fail as expected, but if you comment out the va_dump
> lines and use /dev/null, the write will succeed.
What exactly do we want to achieve with the write? Verify that the
output of /proc/self/map is reasonable and we can actually resolve a
fault / map a page?
Why not access the memory directly+signal handler or using
/proc/self/mem, so you can avoid the temp file completely?
--
Cheers,
David / dhildenb
* Resending because I accidentally forgot to include Lorenzo in the
"to" list.
Android uses the ashmem driver [1] for creating shared memory regions
between processes. The ashmem driver exposes an ioctl command for
processes to restrict the permissions an ashmem buffer can be mapped
with.
Buffers are created with the ability to be mapped as readable, writable,
and executable. Processes remove the ability to map some ashmem buffers
as executable to ensure that those buffers cannot be used to inject
malicious code for another process to run. Other buffers retain their
ability to be mapped as executable, as these buffers can be used for
just-in-time (JIT) compilation. So there is a need to be able to remove
the ability to map a buffer as executable on a per-buffer basis.
Android is currently trying to migrate towards replacing its ashmem
driver usage with memfd. Part of the transition involved introducing a
library that serves to abstract away how shared memory regions are
allocated (i.e. ashmem vs memfd). This allows clients to use a single
interface for restricting how a buffer can be mapped without having to
worry about how it is handled for ashmem (through the ioctl
command mentioned earlier) or memfd (through file seals).
While memfd has support for preventing buffers from being mapped as
writable beyond a certain point in time (thanks to
F_SEAL_FUTURE_WRITE), it does not have a similar interface to prevent
buffers from being mapped as executable beyond a certain point.
However, that could be implemented as a file seal (F_SEAL_FUTURE_EXEC)
which works similarly to F_SEAL_FUTURE_WRITE.
F_SEAL_FUTURE_WRITE was chosen as a template for how this new seal
should behave, instead of F_SEAL_WRITE, for the following reasons:
1. Having the new seal behave like F_SEAL_FUTURE_WRITE matches the
behavior that was present with ashmem. This aids in seamlessly
transitioning clients away from ashmem to memfd.
2. Making the new seal behave like F_SEAL_WRITE would mean that no
mappings that could become executable in the future (i.e. via
mprotect()) can exist when the seal is applied. However, there are
known cases (e.g. CursorWindow [2]) where restrictions are applied
on how a buffer can be mapped after a mapping has already been made.
That mapping may have VM_MAYEXEC set, which would not allow the seal
to be applied successfully.
Therefore, the F_SEAL_FUTURE_EXEC seal was designed to have the same
semantics as F_SEAL_FUTURE_WRITE.
Note: this series depends on Lorenzo's work [3], [4], [5] from Andrew
Morton's mm-unstable branch [6], which reworks memfd's file seal checks,
allowing for newer file seals to be implemented in a cleaner fashion.
Changes from v1 ==> v2:
- Changed the return code to be -EPERM instead of -EACCES when
attempting to map an exec sealed file with PROT_EXEC to align
to mmap()'s man page. Thank you Kalesh Singh for spotting this!
- Rebased on top of Lorenzo's work to cleanup memfd file seal checks in
mmap() ([3], [4], and [5]). Thank you for this Lorenzo!
- Changed to deny PROT_EXEC mappings only if the mapping is shared,
instead of for both shared and private mappings, after discussing
this with Lorenzo.
Opens:
- Lorenzo brought up that this patch may negatively impact the usage of
MFD_NOEXEC_SCOPE_NOEXEC_ENFORCED [7]. However, it is not clear to me
why that is the case. At the moment, my intent is for the executable
permissions of the file to be disjoint from the ability to create
executable mappings.
Links:
[1] https://cs.android.com/android/kernel/superproject/+/common-android-mainlin…
[2] https://developer.android.com/reference/android/database/CursorWindow
[3] https://lore.kernel.org/all/cover.1732804776.git.lorenzo.stoakes@oracle.com/
[4] https://lkml.kernel.org/r/20241206212846.210835-1-lorenzo.stoakes@oracle.com
[5] https://lkml.kernel.org/r/7dee6c5d-480b-4c24-b98e-6fa47dbd8a23@lucifer.local
[6] https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/tree/?h=mm-unst…
[7] https://lore.kernel.org/all/3a53b154-1e46-45fb-a559-65afa7a8a788@lucifer.lo…
Links to previous versions:
v1: https://lore.kernel.org/all/20241206010930.3871336-1-isaacmanjarres@google.…
Isaac J. Manjarres (2):
mm/memfd: Add support for F_SEAL_FUTURE_EXEC to memfd
selftests/memfd: Add tests for F_SEAL_FUTURE_EXEC
include/uapi/linux/fcntl.h | 1 +
mm/memfd.c | 39 ++++++++++-
tools/testing/selftests/memfd/memfd_test.c | 79 ++++++++++++++++++++++
3 files changed, 118 insertions(+), 1 deletion(-)
--
2.47.1.613.gc27f4b7a9f-goog