From: Geliang Tang <tanggeliang(a)kylinos.cn>
v6:
- add a fix for tls_sw_recvmsg().
v5:
- add a new patch "Check recv lengths in test_sockmap" instead of using
"continue" in msg_loop.
v4:
- address Martin's comments for v3. (thanks.)
- add Yonghong's "Acked-by" tags. (thanks.)
- update subject-prefix from "bpf-next" to "bpf".
Patch 1, v3 of "selftests/bpf: Add F_SETFL for fcntl":
- detect nonblock flag automatically, then test_sockmap can run in both
block and nonblock modes.
- use continue instead of again in v2.
Patch 2, fix for umount cgroup2 error.
Geliang Tang (2):
tls: wait for receiving next skb for sk_redirect
selftests/bpf: Add F_SETFL for fcntl in test_sockmap
net/tls/tls_sw.c | 2 ++
tools/testing/selftests/bpf/test_sockmap.c | 4 +++-
2 files changed, 5 insertions(+), 1 deletion(-)
--
2.43.0
PMU event filter test fails on zen4 architecture because of the
unavailability of family and model check for zen4 in use_amd_pmu().
use_amd_pmu() is added to detect architectures that supports event
select 0xc2 umask 0 as "retired branch instructions".
Model ranges in is_zen1(), is_zen2() and is_zen3() are used only for
sever SOCs, so they might not cover all the model ranges which supports
retired branch instructions.
X86_FEATURE_ZEN is a synthetic feature flag specifically added to
recognize all Zen generations by commit 232afb557835d ("x86/CPU/AMD: Add
X86_FEATURE_ZEN1"). init_amd_zen_common() uses family >= 0x17 check to
enable X86_FEATURE_ZEN.
Family 17h+ is where Zen and its successors start and that event 0xc2,0
is supported on all currently released F17h+ processors as branch
instruction retired and it is true going forward to maintain the
backward compatibility for the branch instruction retired.
Since X86_FEATURE_ZEN is not recognized in selftest framework, instead
of checking family and model value for all zen architecture, "family >=
0x17" check is added in use_amd_pmu().
Fixes: bef9a701f3eb ("selftests: kvm/x86: Add test for KVM_SET_PMU_EVENT_FILTER")
Suggested-by: Sandipan Das <sandipan.das(a)amd.com>
Signed-off-by: Manali Shukla <manali.shukla(a)amd.com>
---
.../kvm/x86_64/pmu_event_filter_test.c | 32 +++----------------
1 file changed, 5 insertions(+), 27 deletions(-)
diff --git a/tools/testing/selftests/kvm/x86_64/pmu_event_filter_test.c b/tools/testing/selftests/kvm/x86_64/pmu_event_filter_test.c
index 26b3e7efe5dd..f65033fab0c0 100644
--- a/tools/testing/selftests/kvm/x86_64/pmu_event_filter_test.c
+++ b/tools/testing/selftests/kvm/x86_64/pmu_event_filter_test.c
@@ -353,38 +353,16 @@ static bool use_intel_pmu(void)
kvm_pmu_has(X86_PMU_FEATURE_BRANCH_INSNS_RETIRED);
}
-static bool is_zen1(uint32_t family, uint32_t model)
-{
- return family == 0x17 && model <= 0x0f;
-}
-
-static bool is_zen2(uint32_t family, uint32_t model)
-{
- return family == 0x17 && model >= 0x30 && model <= 0x3f;
-}
-
-static bool is_zen3(uint32_t family, uint32_t model)
-{
- return family == 0x19 && model <= 0x0f;
-}
-
/*
- * Determining AMD support for a PMU event requires consulting the AMD
- * PPR for the CPU or reference material derived therefrom. The AMD
- * test code herein has been verified to work on Zen1, Zen2, and Zen3.
- *
- * Feel free to add more AMD CPUs that are documented to support event
- * select 0xc2 umask 0 as "retired branch instructions."
+ * Family 17h+ is where Zen and its successors start and that event
+ * 0xc2,0 is supported on all currently released F17h+ processors as
+ * branch instruction retired and it is true going forward to maintain
+ * the backward compatibility for the branch instruction retired.
*/
static bool use_amd_pmu(void)
{
uint32_t family = kvm_cpu_family();
- uint32_t model = kvm_cpu_model();
-
- return host_cpu_is_amd &&
- (is_zen1(family, model) ||
- is_zen2(family, model) ||
- is_zen3(family, model));
+ return family >= 0x17;
}
/*
--
2.34.1
From: Jeff Xu <jeffxu(a)chromium.org>
When MFD_NOEXEC_SEAL was introduced, there was one big mistake: it
didn't have proper documentation. This led to a lot of confusion,
especially about whether or not memfd created with the MFD_NOEXEC_SEAL
flag is sealable. Before MFD_NOEXEC_SEAL, memfd had to explicitly set
MFD_ALLOW_SEALING to be sealable, so it's a fair question.
As one might have noticed, unlike other flags in memfd_create,
MFD_NOEXEC_SEAL is actually a combination of multiple flags. The idea
is to make it easier to use memfd in the most common way, which is
NOEXEC + F_SEAL_EXEC + MFD_ALLOW_SEALING. This works with sysctl
vm.noexec to help existing applications move to a more secure way of
using memfd.
Proposals have been made to put MFD_NOEXEC_SEAL non-sealable, unless
MFD_ALLOW_SEALING is set, to be consistent with other flags [1] [2],
Those are based on the viewpoint that each flag is an atomic unit,
which is a reasonable assumption. However, MFD_NOEXEC_SEAL was
designed with the intent of promoting the most secure method of using
memfd, therefore a combination of multiple functionalities into one
bit.
Furthermore, the MFD_NOEXEC_SEAL has been added for more than one
year, and multiple applications and distributions have backported and
utilized it. Altering ABI now presents a degree of risk and may lead
to disruption.
MFD_NOEXEC_SEAL is a new flag, and applications must change their code
to use it. There is no backward compatibility problem.
When sysctl vm.noexec == 1 or 2, applications that don't set
MFD_NOEXEC_SEAL or MFD_EXEC will get MFD_NOEXEC_SEAL memfd. And
old-application might break, that is by-design, in such a system
vm.noexec = 0 shall be used. Also no backward compatibility problem.
I propose to include this documentation patch to assist in clarifying
the semantics of MFD_NOEXEC_SEAL, thereby preventing any potential
future confusion.
This patch supersede previous patch which is trying different
direction [3], and please remove [2] from mm-unstable branch when
applying this patch.
Finally, I would like to express my gratitude to David Rheinsberg and
Barnabás Pőcze for initiating the discussion on the topic of sealability.
[1]
https://lore.kernel.org/lkml/20230714114753.170814-1-david@readahead.eu/
[2]
https://lore.kernel.org/lkml/20240513191544.94754-1-pobrn@protonmail.com/
[3]
https://lore.kernel.org/lkml/20240524033933.135049-1-jeffxu@google.com/
v3:
Additional Randy Dunlap' comments.
v2:
Update according to Randy Dunlap' comments.
https://lore.kernel.org/linux-mm/20240611034903.3456796-1-jeffxu@chromium.o…
v1:
https://lore.kernel.org/linux-mm/20240607203543.2151433-1-jeffxu@google.com/
Jeff Xu (1):
mm/memfd: add documentation for MFD_NOEXEC_SEAL MFD_EXEC
Documentation/userspace-api/index.rst | 1 +
Documentation/userspace-api/mfd_noexec.rst | 86 ++++++++++++++++++++++
2 files changed, 87 insertions(+)
create mode 100644 Documentation/userspace-api/mfd_noexec.rst
--
2.45.2.505.gda0bf45e8d-goog
From: Jeff Xu <jeffxu(a)chromium.org>
When MFD_NOEXEC_SEAL was introduced, there was one big mistake: it
didn't have proper documentation. This led to a lot of confusion,
especially about whether or not memfd created with the MFD_NOEXEC_SEAL
flag is sealable. Before MFD_NOEXEC_SEAL, memfd had to explicitly set
MFD_ALLOW_SEALING to be sealable, so it's a fair question.
As one might have noticed, unlike other flags in memfd_create,
MFD_NOEXEC_SEAL is actually a combination of multiple flags. The idea
is to make it easier to use memfd in the most common way, which is
NOEXEC + F_SEAL_EXEC + MFD_ALLOW_SEALING. This works with sysctl
vm.noexec to help existing applications move to a more secure way of
using memfd.
Proposals have been made to put MFD_NOEXEC_SEAL non-sealable, unless
MFD_ALLOW_SEALING is set, to be consistent with other flags [1] [2],
Those are based on the viewpoint that each flag is an atomic unit,
which is a reasonable assumption. However, MFD_NOEXEC_SEAL was
designed with the intent of promoting the most secure method of using
memfd, therefore a combination of multiple functionalities into one
bit.
Furthermore, the MFD_NOEXEC_SEAL has been added for more than one
year, and multiple applications and distributions have backported and
utilized it. Altering ABI now presents a degree of risk and may lead
to disruption.
MFD_NOEXEC_SEAL is a new flag, and applications must change their code
to use it. There is no backward compatibility problem.
When sysctl vm.noexec == 1 or 2, applications that don't set
MFD_NOEXEC_SEAL or MFD_EXEC will get MFD_NOEXEC_SEAL memfd. And
old-application might break, that is by-design, in such a system
vm.noexec = 0 shall be used. Also no backward compatibility problem.
I propose to include this documentation patch to assist in clarifying
the semantics of MFD_NOEXEC_SEAL, thereby preventing any potential
future confusion.
This patch supersede previous patch which is trying different
direction [3], and please remove [2] from mm-unstable branch when
applying this patch.
Finally, I would like to express my gratitude to David Rheinsberg and
Barnabás Pőcze for initiating the discussion on the topic of sealability.
[1]
https://lore.kernel.org/lkml/20230714114753.170814-1-david@readahead.eu/
[2]
https://lore.kernel.org/lkml/20240513191544.94754-1-pobrn@protonmail.com/
[3]
https://lore.kernel.org/lkml/20240524033933.135049-1-jeffxu@google.com/
v2:
Update according to Randy Dunlap' comments.
v1:
https://lore.kernel.org/linux-mm/20240607203543.2151433-1-jeffxu@google.com/
Jeff Xu (1):
mm/memfd: add documentation for MFD_NOEXEC_SEAL MFD_EXEC
Documentation/userspace-api/index.rst | 1 +
Documentation/userspace-api/mfd_noexec.rst | 86 ++++++++++++++++++++++
2 files changed, 87 insertions(+)
create mode 100644 Documentation/userspace-api/mfd_noexec.rst
--
2.45.2.505.gda0bf45e8d-goog
These two subsystems require very similar fixes, so I'm sending them
out together.
Changes since the first version:
1) Rebased onto Linux 6.10-rc1.
2) Added a Reviewed-by tag from Ryan Roberts. See [1] for that.
Related work: I've sent a separate fix that allows "make CC=clang" to
work in addition to "make LLVM=1" [2].
[1] https://lore.kernel.org/518dd1e3-e31a-41c3-b488-9b75a64b6c8a@arm.com
[2] https://lore.kernel.org/20240531183751.100541-2-jhubbard@nvidia.com
John Hubbard (2):
selftests/openat2: fix clang build failures: -static-libasan,
LOCAL_HDRS
selftests/fchmodat2: fix clang build failure due to -static-libasan
tools/testing/selftests/fchmodat2/Makefile | 11 ++++++++++-
tools/testing/selftests/openat2/Makefile | 14 ++++++++++++--
2 files changed, 22 insertions(+), 3 deletions(-)
base-commit: cc8ed4d0a8486c7472cd72ec3c19957e509dc68c
--
2.45.1
Correctable memory errors are very common on servers with large
amount of memory, and are corrected by ECC, but with two
pain points to users:
1. Correction usually happens on the fly and adds latency overhead
2. Not-fully-proved theory states excessive correctable memory
errors can develop into uncorrectable memory error.
Soft offline is kernel's additional solution for memory pages
having (excessive) corrected memory errors. Impacted page is migrated
to healthy page if it is in use, then the original page is discarded
for any future use.
The actual policy on whether (and when) to soft offline should be
maintained by userspace, especially in case of HugeTLB hugepages.
Soft-offline dissolves a hugepage, either in-use or free, into
chunks of 4K pages, reducing HugeTLB pool capacity by 1 hugepage.
If userspace has not acknowledged such behavior, it may be surprised
when later mmap hugepages MAP_FAILED due to lack of hugepages.
In addition, discarding the entire 1G memory page only because of
corrected memory errors sounds very costly and kernel better not
doing under the hood. But today there are at least 2 such cases:
1. GHES driver sees both GHES_SEV_CORRECTED and
CPER_SEC_ERROR_THRESHOLD_EXCEEDED after parsing CPER.
2. RAS Correctable Errors Collector counts correctable errors per
PFN and when the counter for a PFN reaches threshold
In both cases, userspace has no control of the soft offline performed
by kernel's memory failure recovery.
This patch series give userspace the control of soft-offlining
HugeTLB pages: kernel only soft offlines hugepage if userspace has
opt-ed in for that specific hugepage size, and exposed to userspace
by a new sysfs entry called softoffline_corrected_errors under
/sys/kernel/mm/hugepages/hugepages-${size}kB directory:
* When softoffline_corrected_errors=0, skip soft offlining for all
hugepages of size ${size}kB.
* When softoffline_corrected_errors=1, soft offline as before this
patch series.
By default softoffline_corrected_errors is 1.
This patch set is based at
commit a52b4f11a2e1 ("selftest mm/mseal read-only elf memory segment").
Jiaqi Yan (3):
mm/memory-failure: userspace controls soft-offlining hugetlb pages
selftest/mm: test softoffline_corrected_errors behaviors
docs: hugetlbpage.rst: add softoffline_corrected_errors
Documentation/admin-guide/mm/hugetlbpage.rst | 15 +-
include/linux/hugetlb.h | 17 ++
mm/hugetlb.c | 34 +++
mm/memory-failure.c | 7 +
tools/testing/selftests/mm/.gitignore | 1 +
tools/testing/selftests/mm/Makefile | 1 +
.../selftests/mm/hugetlb-soft-offline.c | 262 ++++++++++++++++++
tools/testing/selftests/mm/run_vmtests.sh | 4 +
8 files changed, 340 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/mm/hugetlb-soft-offline.c
--
2.45.1.288.g0e0cd299f1-goog