The kselftests may be built in a couple different ways:
make LLVM=1
make CC=clang
In order to handle both cases, set LLVM=1 if CC=clang. That way,the rest
of lib.mk, and any Makefiles that include lib.mk, can base decisions
solely on whether or not LLVM is set.
Then, build upon that to disable a pair of clang warnings that are
already silenced on gcc.
Doing it this way is much better than the piecemeal approach that I
started with in [1] and [2]. Thanks to Nathan Chancellor for the patch
reviews that led to this approach.
Changes since the first version:
1) Wrote a detailed explanation for suppressing two clang warnings, in
both a lib.mk comment, and the commit description.
2) Added a Reviewed-by tag to the first patch.
[1] https://lore.kernel.org/20240527214704.300444-1-jhubbard@nvidia.com
[2] https://lore.kernel.org/20240527213641.299458-1-jhubbard@nvidia.com
John Hubbard (2):
selftests/lib.mk: handle both LLVM=1 and CC=clang builds
selftests/lib.mk: silence some clang warnings that gcc already ignores
tools/testing/selftests/lib.mk | 20 ++++++++++++++++++++
1 file changed, 20 insertions(+)
base-commit: e0cce98fe279b64f4a7d81b7f5c3a23d80b92fbc
--
2.45.1
Commit 1b151e2435fc ("block: Remove special-casing of compound
pages") caused a change in behaviour when releasing the pages
if the buffer does not start at the beginning of the page. This
was because the calculation of the number of pages to release
was incorrect.
This was fixed by commit 38b43539d64b ("block: Fix page refcounts
for unaligned buffers in __bio_release_pages()").
We pin the user buffer during direct I/O writes. If this buffer is a
hugepage, bio_release_page() will unpin it and decrement all references
and pin counts at ->bi_end_io. However, if any references to the hugepage
remain post-I/O, the hugepage will not be freed upon unmap, leading
to a memory leak.
This patch verifies that a hugepage, used as a user buffer for DIO
operations, is correctly freed upon unmapping, regardless of whether
the offsets are aligned or unaligned w.r.t page boundary.
Test Result Fail Scenario (Without the fix)
--------------------------------------------------------
[]# ./hugetlb_dio
TAP version 13
1..4
No. Free pages before allocation : 7
No. Free pages after munmap : 7
ok 1 : Huge pages freed successfully !
No. Free pages before allocation : 7
No. Free pages after munmap : 7
ok 2 : Huge pages freed successfully !
No. Free pages before allocation : 7
No. Free pages after munmap : 7
ok 3 : Huge pages freed successfully !
No. Free pages before allocation : 7
No. Free pages after munmap : 6
not ok 4 : Huge pages not freed!
Totals: pass:3 fail:1 xfail:0 xpass:0 skip:0 error:0
Test Result PASS Scenario (With the fix)
---------------------------------------------------------
[]#./hugetlb_dio
TAP version 13
1..4
No. Free pages before allocation : 7
No. Free pages after munmap : 7
ok 1 : Huge pages freed successfully !
No. Free pages before allocation : 7
No. Free pages after munmap : 7
ok 2 : Huge pages freed successfully !
No. Free pages before allocation : 7
No. Free pages after munmap : 7
ok 3 : Huge pages freed successfully !
No. Free pages before allocation : 7
No. Free pages after munmap : 7
ok 4 : Huge pages freed successfully !
Totals: pass:4 fail:0 xfail:0 xpass:0 skip:0 error:0
V3:
- Fixed the build error when it is compiled with _FORTIFY_SOURCE.
V2:
- Addressed all review commets from Muhammad Usama Anjum
https://lore.kernel.org/all/20240604132801.23377-1-donettom@linux.ibm.com/
V1:
https://lore.kernel.org/all/20240523063905.3173-1-donettom@linux.ibm.com/#t
Signed-off-by: Donet Tom <donettom(a)linux.ibm.com>
Co-developed-by: Ritesh Harjani (IBM) <ritesh.list(a)gmail.com>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list(a)gmail.com>
---
tools/testing/selftests/mm/Makefile | 1 +
tools/testing/selftests/mm/hugetlb_dio.c | 118 +++++++++++++++++++++++
2 files changed, 119 insertions(+)
create mode 100644 tools/testing/selftests/mm/hugetlb_dio.c
diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile
index 3b49bc3d0a3b..a1748a4c7df1 100644
--- a/tools/testing/selftests/mm/Makefile
+++ b/tools/testing/selftests/mm/Makefile
@@ -73,6 +73,7 @@ TEST_GEN_FILES += ksm_functional_tests
TEST_GEN_FILES += mdwe_test
TEST_GEN_FILES += hugetlb_fault_after_madv
TEST_GEN_FILES += hugetlb_madv_vs_map
+TEST_GEN_FILES += hugetlb_dio
ifneq ($(ARCH),arm64)
TEST_GEN_FILES += soft-dirty
diff --git a/tools/testing/selftests/mm/hugetlb_dio.c b/tools/testing/selftests/mm/hugetlb_dio.c
new file mode 100644
index 000000000000..986f3b6c7f7b
--- /dev/null
+++ b/tools/testing/selftests/mm/hugetlb_dio.c
@@ -0,0 +1,118 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * This program tests for hugepage leaks after DIO writes to a file using a
+ * hugepage as the user buffer. During DIO, the user buffer is pinned and
+ * should be properly unpinned upon completion. This patch verifies that the
+ * kernel correctly unpins the buffer at DIO completion for both aligned and
+ * unaligned user buffer offsets (w.r.t page boundary), ensuring the hugepage
+ * is freed upon unmapping.
+ */
+
+#define _GNU_SOURCE
+#include <stdio.h>
+#include <sys/stat.h>
+#include <stdlib.h>
+#include <fcntl.h>
+#include <stdint.h>
+#include <unistd.h>
+#include <string.h>
+#include <sys/mman.h>
+#include "vm_util.h"
+#include "../kselftest.h"
+
+void run_dio_using_hugetlb(unsigned int start_off, unsigned int end_off)
+{
+ int fd;
+ char *buffer = NULL;
+ char *orig_buffer = NULL;
+ size_t h_pagesize = 0;
+ size_t writesize;
+ int free_hpage_b = 0;
+ int free_hpage_a = 0;
+ const int mmap_flags = MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB;
+ const int mmap_prot = PROT_READ | PROT_WRITE;
+
+ writesize = end_off - start_off;
+
+ /* Get the default huge page size */
+ h_pagesize = default_huge_page_size();
+ if (!h_pagesize)
+ ksft_exit_fail_msg("Unable to determine huge page size\n");
+
+ /* Open the file to DIO */
+ fd = open("/tmp", O_TMPFILE | O_RDWR | O_DIRECT, 0664);
+ if (fd < 0)
+ ksft_exit_fail_perror("Error opening file\n");
+
+ /* Get the free huge pages before allocation */
+ free_hpage_b = get_free_hugepages();
+ if (free_hpage_b == 0) {
+ close(fd);
+ ksft_exit_skip("No free hugepage, exiting!\n");
+ }
+
+ /* Allocate a hugetlb page */
+ orig_buffer = mmap(NULL, h_pagesize, mmap_prot, mmap_flags, -1, 0);
+ if (orig_buffer == MAP_FAILED) {
+ close(fd);
+ ksft_exit_fail_perror("Error mapping memory\n");
+ }
+ buffer = orig_buffer;
+ buffer += start_off;
+
+ memset(buffer, 'A', writesize);
+
+ /* Write the buffer to the file */
+ if (write(fd, buffer, writesize) != (writesize)) {
+ munmap(orig_buffer, h_pagesize);
+ close(fd);
+ ksft_exit_fail_perror("Error writing to file\n");
+ }
+
+ /* unmap the huge page */
+ munmap(orig_buffer, h_pagesize);
+ close(fd);
+
+ /* Get the free huge pages after unmap*/
+ free_hpage_a = get_free_hugepages();
+
+ /*
+ * If the no. of free hugepages before allocation and after unmap does
+ * not match - that means there could still be a page which is pinned.
+ */
+ if (free_hpage_a != free_hpage_b) {
+ ksft_print_msg("No. Free pages before allocation : %d\n", free_hpage_b);
+ ksft_print_msg("No. Free pages after munmap : %d\n", free_hpage_a);
+ ksft_test_result_fail(": Huge pages not freed!\n");
+ } else {
+ ksft_print_msg("No. Free pages before allocation : %d\n", free_hpage_b);
+ ksft_print_msg("No. Free pages after munmap : %d\n", free_hpage_a);
+ ksft_test_result_pass(": Huge pages freed successfully !\n");
+ }
+}
+
+int main(void)
+{
+ size_t pagesize = 0;
+
+ ksft_print_header();
+ ksft_set_plan(4);
+
+ /* Get base page size */
+ pagesize = psize();
+
+ /* start and end is aligned to pagesize */
+ run_dio_using_hugetlb(0, (pagesize * 3));
+
+ /* start is aligned but end is not aligned */
+ run_dio_using_hugetlb(0, (pagesize * 3) - (pagesize / 2));
+
+ /* start is unaligned and end is aligned */
+ run_dio_using_hugetlb(pagesize / 2, (pagesize * 3));
+
+ /* both start and end are unaligned */
+ run_dio_using_hugetlb(pagesize / 2, (pagesize * 3) + (pagesize / 2));
+
+ ksft_finished();
+}
+
--
2.43.0
From: Jeff Xu <jeffxu(a)google.com>
By default, memfd_create() creates a non-sealable MFD, unless the
MFD_ALLOW_SEALING flag is set.
When the MFD_NOEXEC_SEAL flag is initially introduced, the MFD created
with that flag is sealable, even though MFD_ALLOW_SEALING is not set.
This patch changes MFD_NOEXEC_SEAL to be non-sealable by default,
unless MFD_ALLOW_SEALING is explicitly set.
This is a non-backward compatible change. However, as MFD_NOEXEC_SEAL
is new, we expect not many applications will rely on the nature of
MFD_NOEXEC_SEAL being sealable. In most cases, the application already
sets MFD_ALLOW_SEALING if they need a sealable MFD.
Additionally, this enhances the useability of pid namespace sysctl
vm.memfd_noexec. When vm.memfd_noexec equals 1 or 2, the kernel will
add MFD_NOEXEC_SEAL if mfd_create does not specify MFD_EXEC or
MFD_NOEXEC_SEAL, and the addition of MFD_NOEXEC_SEAL enables the MFD
to be sealable. This means, any application that does not desire this
behavior will be unable to utilize vm.memfd_noexec = 1 or 2 to
migrate/enforce non-executable MFD. This adjustment ensures that
applications can anticipate that the sealable characteristic will
remain unmodified by vm.memfd_noexec.
This patch was initially developed by Barnabás Pőcze, and Barnabás
used Debian Code Search and GitHub to try to find potential breakages
and could only find a single one. Dbus-broker's memfd_create() wrapper
is aware of this implicit `MFD_ALLOW_SEALING` behavior, and tries to
work around it [1]. This workaround will break. Luckily, this only
affects the test suite, it does not affect
the normal operations of dbus-broker. There is a PR with a fix[2]. In
addition, David Rheinsberg also raised similar fix in [3]
[1]: https://github.com/bus1/dbus-broker/blob/9eb0b7e5826fc76cad7b025bc46f267d4a…
[2]: https://github.com/bus1/dbus-broker/pull/366
[3]: https://lore.kernel.org/lkml/20230714114753.170814-1-david@readahead.eu/
History
======
V2:
update commit message.
add testcase for vm.memfd_noexec
add documentation.
V1:
https://lore.kernel.org/lkml/20240513191544.94754-1-pobrn@protonmail.com/
Jeff Xu (2):
memfd: fix MFD_NOEXEC_SEAL to be non-sealable by default
memfd:add MEMFD_NOEXEC_SEAL documentation
Documentation/userspace-api/index.rst | 1 +
Documentation/userspace-api/mfd_noexec.rst | 90 ++++++++++++++++++++++
mm/memfd.c | 9 +--
tools/testing/selftests/memfd/memfd_test.c | 26 ++++++-
4 files changed, 120 insertions(+), 6 deletions(-)
create mode 100644 Documentation/userspace-api/mfd_noexec.rst
--
2.45.1.288.g0e0cd299f1-goog
`MFD_NOEXEC_SEAL` should remove the executable bits and set
`F_SEAL_EXEC` to prevent further modifications to the executable
bits as per the comment in the uapi header file:
not executable and sealed to prevent changing to executable
However, currently, it also unsets `F_SEAL_SEAL`, essentially
acting as a superset of `MFD_ALLOW_SEALING`. Nothing implies
that it should be so, and indeed up until the second version
of the of the patchset[0] that introduced `MFD_EXEC` and
`MFD_NOEXEC_SEAL`, `F_SEAL_SEAL` was not removed, however it
was changed in the third revision of the patchset[1] without
a clear explanation.
This behaviour is suprising for application developers,
there is no documentation that would reveal that `MFD_NOEXEC_SEAL`
has the additional effect of `MFD_ALLOW_SEALING`.
So do not remove `F_SEAL_SEAL` when `MFD_NOEXEC_SEAL` is requested.
This is technically an ABI break, but it seems very unlikely that an
application would depend on this behaviour (unless by accident).
[0]: https://lore.kernel.org/lkml/20220805222126.142525-3-jeffxu@google.com/
[1]: https://lore.kernel.org/lkml/20221202013404.163143-3-jeffxu@google.com/
Fixes: 105ff5339f498a ("mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC")
Signed-off-by: Barnabás Pőcze <pobrn(a)protonmail.com>
---
Or did I miss the explanation as to why MFD_NOEXEC_SEAL should
imply MFD_ALLOW_SEALING? If so, please direct me to it and
sorry for the noise.
---
mm/memfd.c | 9 ++++-----
tools/testing/selftests/memfd/memfd_test.c | 2 +-
2 files changed, 5 insertions(+), 6 deletions(-)
diff --git a/mm/memfd.c b/mm/memfd.c
index 7d8d3ab3fa37..8b7f6afee21d 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -356,12 +356,11 @@ SYSCALL_DEFINE2(memfd_create,
inode->i_mode &= ~0111;
file_seals = memfd_file_seals_ptr(file);
- if (file_seals) {
- *file_seals &= ~F_SEAL_SEAL;
+ if (file_seals)
*file_seals |= F_SEAL_EXEC;
- }
- } else if (flags & MFD_ALLOW_SEALING) {
- /* MFD_EXEC and MFD_ALLOW_SEALING are set */
+ }
+
+ if (flags & MFD_ALLOW_SEALING) {
file_seals = memfd_file_seals_ptr(file);
if (file_seals)
*file_seals &= ~F_SEAL_SEAL;
diff --git a/tools/testing/selftests/memfd/memfd_test.c b/tools/testing/selftests/memfd/memfd_test.c
index 18f585684e20..b6a7ad68c3c1 100644
--- a/tools/testing/selftests/memfd/memfd_test.c
+++ b/tools/testing/selftests/memfd/memfd_test.c
@@ -1151,7 +1151,7 @@ static void test_noexec_seal(void)
mfd_def_size,
MFD_CLOEXEC | MFD_NOEXEC_SEAL);
mfd_assert_mode(fd, 0666);
- mfd_assert_has_seals(fd, F_SEAL_EXEC);
+ mfd_assert_has_seals(fd, F_SEAL_SEAL | F_SEAL_EXEC);
mfd_fail_chmod(fd, 0777);
close(fd);
}
--
2.45.0
In order to be able to save the current value of a sysctl without changing
it, split the relevant bit out of sysctl_set() into a new helper.
Signed-off-by: Petr Machata <petrm(a)nvidia.com>
Reviewed-by: Ido Schimmel <idosch(a)nvidia.com>
---
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: linux-kselftest(a)vger.kernel.org
Notes:
v2:
- New patch.
tools/testing/selftests/net/forwarding/lib.sh | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/net/forwarding/lib.sh b/tools/testing/selftests/net/forwarding/lib.sh
index eabbdf00d8ca..9086d2015296 100644
--- a/tools/testing/selftests/net/forwarding/lib.sh
+++ b/tools/testing/selftests/net/forwarding/lib.sh
@@ -1134,12 +1134,19 @@ bridge_ageing_time_get()
}
declare -A SYSCTL_ORIG
+sysctl_save()
+{
+ local key=$1; shift
+
+ SYSCTL_ORIG[$key]=$(sysctl -n $key)
+}
+
sysctl_set()
{
local key=$1; shift
local value=$1; shift
- SYSCTL_ORIG[$key]=$(sysctl -n $key)
+ sysctl_save "$key"
sysctl -qw $key="$value"
}
--
2.45.0
This patch series is motivated by the following observation:
Raise a signal, jump to signal handler. The ucontext_t structure dumped
by kernel to userspace has a uc_sigmask field having the mask of blocked
signals. If you run a fresh minimalistic program doing this, this field
is empty, even if you block some signals while registering the handler
with sigaction().
Here is what the man-pages have to say:
sigaction(2): "sa_mask specifies a mask of signals which should be blocked
(i.e., added to the signal mask of the thread in which the signal handler
is invoked) during execution of the signal handler. In addition, the
signal which triggered the handler will be blocked, unless the SA_NODEFER
flag is used."
signal(7): Under "Execution of signal handlers", (1.3) implies:
"The thread's current signal mask is accessible via the ucontext_t
object that is pointed to by the third argument of the signal handler."
But, (1.4) states:
"Any signals specified in act->sa_mask when registering the handler with
sigprocmask(2) are added to the thread's signal mask. The signal being
delivered is also added to the signal mask, unless SA_NODEFER was
specified when registering the handler. These signals are thus blocked
while the handler executes."
There clearly is no distinction being made in the man pages between
"Thread's signal mask" and ucontext_t; this logically should imply
that a signal blocked by populating struct sigaction should be visible
in ucontext_t.
Here is what the kernel code does (for Aarch64):
do_signal() -> handle_signal() -> sigmask_to_save(), which returns
¤t->blocked, is passed to setup_rt_frame() -> setup_sigframe() ->
__copy_to_user(). Hence, ¤t->blocked is copied to ucontext_t
exposed to userspace. Returning back to handle_signal(),
signal_setup_done() -> signal_delivered() -> sigorsets() and
set_current_blocked() are responsible for using information from
struct ksignal ksig, which was populated through the sigaction()
system call in kernel/signal.c:
copy_from_user(&new_sa.sa, act, sizeof(new_sa.sa)),
to update ¤t->blocked; hence, the set of blocked signals for the
current thread is updated AFTER the kernel dumps ucontext_t to
userspace.
Assuming that the above is indeed the intended behaviour, because it
semantically makes sense, since the signals blocked using sigaction()
remain blocked only till the execution of the handler, and not in the
context present before jumping to the handler (but nothing can be
confirmed from the man-pages), the series introduces a test for
mangling with uc_sigmask. I will send a separate series to fix the
man-pages.
The proposed selftest has been tested out on Aarch32, Aarch64 and x86_64.
Dev Jain (2):
selftests: Rename sigaltstack to generic signal
selftests: Add a test mangling with uc_sigmask
tools/testing/selftests/Makefile | 2 +-
.../{sigaltstack => signal}/.gitignore | 3 +-
.../{sigaltstack => signal}/Makefile | 3 +-
.../current_stack_pointer.h | 0
.../selftests/signal/mangle_uc_sigmask.c | 141 ++++++++++++++++++
.../sas.c => signal/sigaltstack.c} | 0
6 files changed, 146 insertions(+), 3 deletions(-)
rename tools/testing/selftests/{sigaltstack => signal}/.gitignore (57%)
rename tools/testing/selftests/{sigaltstack => signal}/Makefile (53%)
rename tools/testing/selftests/{sigaltstack => signal}/current_stack_pointer.h (100%)
create mode 100644 tools/testing/selftests/signal/mangle_uc_sigmask.c
rename tools/testing/selftests/{sigaltstack/sas.c => signal/sigaltstack.c} (100%)
--
2.34.1
Hello,
We're pleased to announce the return of the Kernel Testing &
Dependability Micro-Conference at Linux Plumbers 2024:
https://lpc.events/event/18/contributions/1665/
You can already submit proposals by selecting the micro-conf in
the Track drop-down list:
https://lpc.events/login/?next=/event/18/abstracts/%23submit-abstract
Please note that the deadline for submissions is *Sunday 16th June*
The event description contains a list of suggested topics
inherited from past editions. Is there anything in particular
you would like to see discussed this year?
Knowing people's interests helps with triaging proposals and
making the micro-conf as relevant as possible. See you there!
Thanks,
Guillaume & Shuah & Sasha