This improves the expressiveness of unprivileged BPF by inserting
speculation barriers instead of rejecting the programs.
The approach was previously presented at LPC'24 [1] and RAID'24 [2].
To mitigate the Spectre v1 (PHT) vulnerability, the kernel rejects
potentially-dangerous unprivileged BPF programs as of
commit 9183671af6db ("bpf: Fix leakage under speculation on mispredicted
branches"). In [2], we have analyzed 364 object files from open source
projects (Linux Samples and Selftests, BCC, Loxilb, Cilium, libbpf
Examples, Parca, and Prevail) and found that this affects 31% to 54% of
programs.
To resolve this in the majority of cases this patchset adds a fall-back
for mitigating Spectre v1 using speculation barriers. The kernel still
optimistically attempts to verify all speculative paths but uses
speculation barriers against v1 when unsafe behavior is detected. This
allows for more programs to be accepted without disabling the BPF
Spectre mitigations (e.g., by setting cpu_mitigations_off()).
In [1] we have measured the overhead of this approach relative to having
mitigations off and including the upstream Spectre v4 mitigations. For
event tracing and stack-sampling profilers, we found that mitigations
increase BPF program execution time by 0% to 62%. For the Loxilb network
load balancer, we have measured a 14% slowdown in SCTP performance but
no significant slowdown for TCP. This overhead only applies to programs
that were previously rejected.
I reran the expressiveness-evaluation with v6.14 and made sure the main
results still match those from [1] and [2] (which used v6.5).
Main design decisions are:
* Do not use separate bytecode insns for v1 and v4 barriers. This
simplifies the verifier significantly and has the only downside that
performance on PowerPC is not as high as it could be.
* Allow archs to still disable v1/v4 mitigations separately by setting
bpf_jit_bypass_spec_v1/v4(). This has the benefit that archs can
benefit from improved BPF expressiveness / performance if they are not
vulnerable (e.g., ARM64 for v4 in the kernel).
* Do not remove the empty BPF_NOSPEC implementation for backends for
which it is unknown whether they are vulnerable to Spectre v1.
[1] https://lpc.events/event/18/contributions/1954/ ("Mitigating
Spectre-PHT using Speculation Barriers in Linux eBPF")
[2] https://arxiv.org/pdf/2405.00078 ("VeriFence: Lightweight and
Precise Spectre Defenses for Untrusted Linux Kernel Extensions")
Changes:
* RFC -> v1:
- rebase to bpf-next-250313
- tests: mark expected successes/new errors
- add bpt_jit_bypass_spec_v1/v4() to avoid #ifdef in
bpf_bypass_spec_v1/v4()
- ensure that nospec with v1-support is implemented for archs for
which GCC supports speculation barriers, except for MIPS
- arm64: emit speculation barrier
- powerpc: change nospec to include v1 barrier
- discuss potential security (archs that do not impl. BPF nospec) and
performance (only PowerPC) regressions
RFC: https://lore.kernel.org/bpf/20250224203619.594724-1-luis.gerhorst@fau.de/
Luis Gerhorst (11):
bpf: Move insn if/else into do_check_insn()
bpf: Return -EFAULT on misconfigurations
bpf: Return -EFAULT on internal errors
bpf, arm64, powerpc: Add bpf_jit_bypass_spec_v1/v4()
bpf, arm64, powerpc: Change nospec to include v1 barrier
bpf: Rename sanitize_stack_spill to nospec_result
bpf: Fall back to nospec for Spectre v1
bpf: Allow nospec-protected var-offset stack access
bpf: Return PTR_ERR from push_stack()
bpf: Fall back to nospec for sanitization-failures
bpf: Fall back to nospec for spec path verification
arch/arm64/net/bpf_jit.h | 5 +
arch/arm64/net/bpf_jit_comp.c | 28 +-
arch/powerpc/net/bpf_jit_comp64.c | 79 +-
include/linux/bpf.h | 11 +-
include/linux/bpf_verifier.h | 3 +-
include/linux/filter.h | 2 +-
kernel/bpf/core.c | 32 +-
kernel/bpf/verifier.c | 723 ++++++++++--------
.../selftests/bpf/progs/verifier_and.c | 3 +-
.../selftests/bpf/progs/verifier_bounds.c | 35 +-
.../bpf/progs/verifier_bounds_deduction.c | 43 +-
.../selftests/bpf/progs/verifier_map_ptr.c | 12 +-
.../selftests/bpf/progs/verifier_movsx.c | 6 +-
.../selftests/bpf/progs/verifier_unpriv.c | 3 +-
.../bpf/progs/verifier_value_ptr_arith.c | 50 +-
.../selftests/bpf/verifier/dead_code.c | 3 +-
tools/testing/selftests/bpf/verifier/jmp32.c | 33 +-
tools/testing/selftests/bpf/verifier/jset.c | 10 +-
18 files changed, 630 insertions(+), 451 deletions(-)
base-commit: 46d38f489ef02175dcff1e03a849c226eb0729a6
--
2.48.1
This patch series extends the sev_init2 and the sev_smoke test to
exercise the SEV-SNP VM launch workflow.
Primarily, it introduces the architectural defines, its support in the
SEV library and extends the tests to interact with the SEV-SNP ioctl()
wrappers.
Patch 1 - Do not advertise SNP on initialization failure
Patch 2 - SNP test for KVM_SEV_INIT2
Patch 3 - Add vmgexit helper
Patch 4 - Add SMT control interface helper
Patch 5 - Replace assert() with TEST_ASSERT_EQ()
Patch 6 - Introduce SEV+ VM type check
Patch 7 - SNP iotcl() plumbing for the SEV library
Patch 8 - Force set GUEST_MEMFD for SNP
Patch 9 - Cleanups of smoke test - Decouple policy from type
Patch 10 - SNP smoke test
The series is based on
git.kernel.org/pub/scm/virt/kvm/kvm.git next
v7..v8:
* Dropped exporting the SNP initialized API from ccp to KVM. Instead
call SNP_PLATFORM_STATUS within KVM to query the initialization. (Tom)
While it may be cheaper to query sev->snp_initialized from ccp, making
the SNP platform call within KVM does away with any dependencies.
v6..v7:
https://lore.kernel.org/kvm/20250221210200.244405-7-prsampat@amd.com/
Based on comments from Sean -
* Replaced FW check with sev->snp_initialized
* Dropped the patch which removes SEV+ KVM advertisement if INIT fails.
This should be now be resolved by the combination of the patches [1,2]
from Ashish.
* Change vmgexit to an inline function
* Export SMT control parsing interface to kvm_util
Note: hyperv_cpuid KST only compile tested
* Replace assert() with TEST_ASSERT_EQ() within SEV library
* Define KVM_SEV_PAGE_TYPE_INVALID for SEV call of encrypt_region()
* Parameterize encrypt_region() to include privatize_region()
* Deduplication of sev test calls between SEV,SEV-ES and SNP
* Removed FW version tests for SNP
* Included testing of SNP_POLICY_DBG
* Dropped most tags from patches that have been changed or indirectly
affected
[1] https://lore.kernel.org/all/d6d08c6b-9602-4f3d-92c2-8db6d50a1b92@amd.com
[2] https://lore.kernel.org/all/f78ddb64087df27e7bcb1ae0ab53f55aa0804fab.173922…
v5..v6:
https://lore.kernel.org/kvm/ab433246-e97c-495b-ab67-b0cb1721fb99@amd.com/
* Rename is_sev_platform_init to sev_fw_initialized (Nikunj)
* Rename KVM CPU feature X86_FEATURE_SNP to X86_FEATURE_SEV_SNP (Nikunj)
* Collected Tags from Nikunj, Pankaj, Srikanth.
v4..v5:
https://lore.kernel.org/kvm/8e7d8172-879e-4a28-8438-343b1c386ec9@amd.com/
* Introduced a check to disable advertising support for SEV, SEV-ES
and SNP when platform initialization fails (Nikunj)
* Remove the redundant SNP check within is_sev_vm() (Nikunj)
* Cleanup of the encrypt_region flow for better readability (Nikunj)
* Refactor paths to use the canonical $(ARCH) to rebase for kvm/next
v3..v4:
https://lore.kernel.org/kvm/20241114234104.128532-1-pratikrajesh.sampat@amd…
* Remove SNP FW API version check in the test and ensure the KVM
capability advertises the presence of the feature. Retain the minimum
version definitions to exercise these API versions in the smoke test
* Retained only the SNP smoke test and SNP_INIT2 test
* The SNP architectural defined merged with SNP_INIT2 test patch
* SNP shutdown merged with SNP smoke test patch
* Add SEV VM type check to abstract comparisons and reduce clutter
* Define a SNP default policy which sets bits based on the presence of
SMT
* Decouple privatization and encryption for it to be SNP agnostic
* Assert for only positive tests using vm_ioctl()
* Dropped tested-by tags
In summary - based on comments from Sean, I have primarily reduced the
scope of this patch series to focus on breaking down the SNP smoke test
patch (v3 - patch2) to first introduce SEV-SNP support and use this
interface to extend the sev_init2 and the sev_smoke test.
The rest of the v3 patchset that introduces ioctl, pre fault, fallocate
and negative tests, will be re-worked and re-introduced subsequently in
future patch series post addressing the issues discussed.
v2..v3:
https://lore.kernel.org/kvm/20240905124107.6954-1-pratikrajesh.sampat@amd.c…
* Remove the assignments for the prefault and fallocate test type
enums.
* Fix error message for sev launch measure and finish.
* Collect tested-by tags [Peter, Srikanth]
Pratik R. Sampat (10):
KVM: SEV: Disable SEV-SNP support on initialization failure
KVM: selftests: SEV-SNP test for KVM_SEV_INIT2
KVM: selftests: Add vmgexit helper
KVM: selftests: Add SMT control state helper
KVM: selftests: Replace assert() with TEST_ASSERT_EQ()
KVM: selftests: Introduce SEV VM type check
KVM: selftests: Add library support for interacting with SNP
KVM: selftests: Force GUEST_MEMFD flag for SNP VM type
KVM: selftests: Abstractions for SEV to decouple policy from type
KVM: selftests: Add a basic SEV-SNP smoke test
arch/x86/include/uapi/asm/kvm.h | 1 +
arch/x86/kvm/svm/sev.c | 30 +++++-
tools/arch/x86/include/uapi/asm/kvm.h | 1 +
.../testing/selftests/kvm/include/kvm_util.h | 35 +++++++
.../selftests/kvm/include/x86/processor.h | 1 +
tools/testing/selftests/kvm/include/x86/sev.h | 42 ++++++++-
tools/testing/selftests/kvm/lib/kvm_util.c | 7 +-
.../testing/selftests/kvm/lib/x86/processor.c | 4 +-
tools/testing/selftests/kvm/lib/x86/sev.c | 93 +++++++++++++++++--
.../testing/selftests/kvm/x86/hyperv_cpuid.c | 19 ----
.../selftests/kvm/x86/sev_init2_tests.c | 13 +++
.../selftests/kvm/x86/sev_smoke_test.c | 75 +++++++++------
12 files changed, 261 insertions(+), 60 deletions(-)
--
2.43.0
This series is built on top of Fuad's v7 "mapping guest_memfd backed
memory at the host" [1].
With James's KVM userfault [2], it is possible to handle stage-2 faults
in guest_memfd in userspace. However, KVM itself also triggers faults
in guest_memfd in some cases, for example: PV interfaces like kvmclock,
PV EOI and page table walking code when fetching the MMIO instruction on
x86. It was agreed in the guest_memfd upstream call on 23 Jan 2025 [3]
that KVM would be accessing those pages via userspace page tables. In
order for such faults to be handled in userspace, guest_memfd needs to
support userfaultfd.
Changes since v1 [4]:
- James, Peter: implement a full minor trap instead of a hybrid
missing/minor trap
- James, Peter: to avoid shmem- and guest_memfd-specific code in the
UFFDIO_CONTINUE implementation make it generic by calling
vm_ops->fault()
While generalising UFFDIO_CONTINUE implementation helped avoid
guest_memfd-specific code in mm/userfaulfd, userfaultfd still needs
access to KVM code to be able to verify the VMA type when handling
UFFDIO_REGISTER_MODE_MINOR, so I used a similar approach to what Fuad
did for now [5].
In v1, Peter was mentioning a potential for eliminating taking a folio
lock [6]. I did not implement that, but according to my testing, the
performance of shmem minor fault handling stayed the same after the
migration to calling vm_ops->fault() (tested on an x86).
Before:
./demand_paging_test -u MINOR -s shmem
Random seed: 0x6b8b4567
Testing guest mode: PA-bits:ANY, VA-bits:48, 4K pages
guest physical test memory: [0x3fffbffff000, 0x3ffffffff000)
Finished creating vCPUs and starting uffd threads
Started all vCPUs
All vCPU threads joined
Total guest execution time: 10.979277020s
Per-vcpu demand paging rate: 23876.253375 pgs/sec/vcpu
Overall demand paging rate: 23876.253375 pgs/sec
After:
./demand_paging_test -u MINOR -s shmem
Random seed: 0x6b8b4567
Testing guest mode: PA-bits:ANY, VA-bits:48, 4K pages
guest physical test memory: [0x3fffbffff000, 0x3ffffffff000)
Finished creating vCPUs and starting uffd threads
Started all vCPUs
All vCPU threads joined
Total guest execution time: 10.978893504s
Per-vcpu demand paging rate: 23877.087423 pgs/sec/vcpu
Overall demand paging rate: 23877.087423 pgs/sec
Nikita
[1] https://lore.kernel.org/kvm/20250318161823.4005529-1-tabba@google.com/T/
[2] https://lore.kernel.org/kvm/20250109204929.1106563-1-jthoughton@google.com/…
[3] https://docs.google.com/document/d/1M6766BzdY1Lhk7LiR5IqVR8B8mG3cr-cxTxOrAo…
[4] https://lore.kernel.org/kvm/20250303133011.44095-1-kalyazin@amazon.com/T/
[5] https://lore.kernel.org/kvm/20250318161823.4005529-1-tabba@google.com/T/#Z2…
[6] https://lore.kernel.org/kvm/20250303133011.44095-1-kalyazin@amazon.com/T/#m…
Nikita Kalyazin (5):
mm: userfaultfd: generic continue for non hugetlbfs
KVM: guest_memfd: add kvm_gmem_vma_is_gmem
mm: userfaultfd: allow to register continue for guest_memfd
KVM: guest_memfd: add support for userfaultfd minor
KVM: selftests: test userfaultfd minor for guest_memfd
include/linux/mm_types.h | 3 +
include/linux/userfaultfd_k.h | 13 ++-
mm/hugetlb.c | 2 +-
mm/shmem.c | 3 +-
mm/userfaultfd.c | 25 +++--
.../testing/selftests/kvm/guest_memfd_test.c | 94 +++++++++++++++++++
virt/kvm/guest_memfd.c | 15 +++
virt/kvm/kvm_mm.h | 1 +
8 files changed, 146 insertions(+), 10 deletions(-)
base-commit: 3cc51efc17a2c41a480eed36b31c1773936717e0
--
2.47.1
Vector registers are zero initialized by the kernel. Stop accepting
"all ones" as a clean value.
Note that this was not working as expected given that
value == 0xff
can be assumed to be always false by the compiler as value's range is
[-128, 127]. Both GCC (-Wtype-limits) and clang
(-Wtautological-constant-out-of-range-compare) warn about this.
Reviewed-by: Charlie Jenkins <charlie(a)rivosinc.com>
Tested-by: Charlie Jenkins <charlie(a)rivosinc.com>
Signed-off-by: Ignacio Encinas <ignacio(a)iencinas.com>
---
Changes in v2:
Remove code that becomes useless now that the only "clean" value for
vector registers is 0.
- Link to v1: https://lore.kernel.org/r/20250305-fix-v_exec_initval_nolibc-v1-1-b87b60e43…
---
tools/testing/selftests/riscv/vector/v_exec_initval_nolibc.c | 10 +++-------
1 file changed, 3 insertions(+), 7 deletions(-)
diff --git a/tools/testing/selftests/riscv/vector/v_exec_initval_nolibc.c b/tools/testing/selftests/riscv/vector/v_exec_initval_nolibc.c
index 35c0812e32de0c82a54f84bd52c4272507121e35..4dde05e45a04122b566cedc36d20b072413b00e2 100644
--- a/tools/testing/selftests/riscv/vector/v_exec_initval_nolibc.c
+++ b/tools/testing/selftests/riscv/vector/v_exec_initval_nolibc.c
@@ -6,7 +6,7 @@
* the values. To further ensure consistency, this file is compiled without
* libc and without auto-vectorization.
*
- * To be "clean" all values must be either all ones or all zeroes.
+ * To be "clean" all values must be all zeroes.
*/
#define __stringify_1(x...) #x
@@ -14,9 +14,8 @@
int main(int argc, char **argv)
{
- char prev_value = 0, value;
+ char value = 0;
unsigned long vl;
- int first = 1;
if (argc > 2 && strcmp(argv[2], "x"))
asm volatile (
@@ -44,14 +43,11 @@ int main(int argc, char **argv)
"vsrl.vi " __stringify(register) ", " __stringify(register) ", 8\n\t" \
".option pop\n\t" \
: "=r" (value)); \
- if (first) { \
- first = 0; \
- } else if (value != prev_value || !(value == 0x00 || value == 0xff)) { \
+ if (value != 0x00) { \
printf("Register " __stringify(register) \
" values not clean! value: %u\n", value); \
exit(-1); \
} \
- prev_value = value; \
} \
})
---
base-commit: 03d38806a902b36bf364cae8de6f1183c0a35a67
change-id: 20250301-fix-v_exec_initval_nolibc-498d976c372d
Best regards,
--
Ignacio Encinas <ignacio(a)iencinas.com>
This patch series fixes a number of bugs in the cpuset partition code as
well as improvement in remote partition handling. The test_cpuset_prs.sh
is also enhanced to allow more vigorous remote partition testing.
Waiman Long (10):
cgroup/cpuset: Fix race between newly created partition and dying one
cgroup/cpuset: Fix incorrect isolated_cpus update in
update_parent_effective_cpumask()
cgroup/cpuset: Fix error handling in remote_partition_disable()
cgroup/cpuset: Remove remote_partition_check() & make
update_cpumasks_hier() handle remote partition
cgroup/cpuset: Don't allow creation of local partition over a remote
one
cgroup/cpuset: Code cleanup and comment update
cgroup/cpuset: Remove unneeded goto in sched_partition_write() and
rename it
selftest/cgroup: Update test_cpuset_prs.sh to use | as effective CPUs
and state separator
selftest/cgroup: Clean up and restructure test_cpuset_prs.sh
selftest/cgroup: Add a remote partition transition test to
test_cpuset_prs.sh
include/linux/cgroup-defs.h | 1 +
include/linux/cgroup.h | 2 +-
kernel/cgroup/cgroup.c | 6 +
kernel/cgroup/cpuset-internal.h | 1 +
kernel/cgroup/cpuset.c | 401 +++++++-----
.../selftests/cgroup/test_cpuset_prs.sh | 617 ++++++++++++------
6 files changed, 649 insertions(+), 379 deletions(-)
--
2.48.1
Cppcheck warning:
int result is assigned to long long variable. If the variable is long long
to avoid loss of information, then you have loss of information.
This patch changes the type of page_size from 'unsigned int' to
'unsigned long' instead of using ULL suffixes. Changing hpage_size to
'unsigned long' was considered, but since gethugepage() expects an int,
this change was avoided.
Reported-by: David Binderman <dcb314(a)hotmail.com>
Closes: https://lore.kernel.org/all/AS8PR02MB10217315060BBFDB21F19643E9CA62@AS8PR02…
Signed-off-by: Siddarth G <siddarthsgml(a)gmail.com>
---
Changes since v2:
- v2 had conflict with current mainline, so this is a fresh patch
tools/testing/selftests/mm/pagemap_ioctl.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/tools/testing/selftests/mm/pagemap_ioctl.c b/tools/testing/selftests/mm/pagemap_ioctl.c
index 57b4bba2b45f..fe5ae8b25ff6 100644
--- a/tools/testing/selftests/mm/pagemap_ioctl.c
+++ b/tools/testing/selftests/mm/pagemap_ioctl.c
@@ -34,7 +34,7 @@
#define PAGEMAP "/proc/self/pagemap"
int pagemap_fd;
int uffd;
-unsigned int page_size;
+unsigned long page_size;
unsigned int hpage_size;
const char *progname;
@@ -184,7 +184,7 @@ void *gethugetlb_mem(int size, int *shmid)
int userfaultfd_tests(void)
{
- int mem_size, vec_size, written, num_pages = 16;
+ long mem_size, vec_size, written, num_pages = 16;
char *mem, *vec;
mem_size = num_pages * page_size;
@@ -213,7 +213,7 @@ int userfaultfd_tests(void)
written = pagemap_ioctl(mem, mem_size, vec, 1, PM_SCAN_WP_MATCHING | PM_SCAN_CHECK_WPASYNC,
vec_size - 2, PAGE_IS_WRITTEN, 0, 0, PAGE_IS_WRITTEN);
if (written < 0)
- ksft_exit_fail_msg("error %d %d %s\n", written, errno, strerror(errno));
+ ksft_exit_fail_msg("error %ld %d %s\n", written, errno, strerror(errno));
ksft_test_result(written == 0, "%s all new pages must not be written (dirty)\n", __func__);
@@ -995,7 +995,7 @@ int unmapped_region_tests(void)
{
void *start = (void *)0x10000000;
int written, len = 0x00040000;
- int vec_size = len / page_size;
+ long vec_size = len / page_size;
struct page_region *vec = malloc(sizeof(struct page_region) * vec_size);
/* 1. Get written pages */
@@ -1051,7 +1051,7 @@ static void test_simple(void)
int sanity_tests(void)
{
unsigned long long mem_size, vec_size;
- int ret, fd, i, buf_size;
+ long ret, fd, i, buf_size;
struct page_region *vec;
char *mem, *fmem;
struct stat sbuf;
@@ -1160,7 +1160,7 @@ int sanity_tests(void)
ret = stat(progname, &sbuf);
if (ret < 0)
- ksft_exit_fail_msg("error %d %d %s\n", ret, errno, strerror(errno));
+ ksft_exit_fail_msg("error %ld %d %s\n", ret, errno, strerror(errno));
fmem = mmap(NULL, sbuf.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
if (fmem == MAP_FAILED)
--
2.43.0
There are currently two ways in which ublk server exit is detected by
ublk_drv:
1. uring_cmd cancellation. If there are any outstanding uring_cmds which
have not been completed to the ublk server when it exits, io_uring
calls the uring_cmd callback with a special cancellation flag as the
issuing task is exiting.
2. I/O timeout. This is needed in addition to the above to handle the
"saturated queue" case, when all I/Os for a given queue are in the
ublk server, and therefore there are no outstanding uring_cmds to
cancel when the ublk server exits.
There are a couple of issues with this approach:
- It is complex and inelegant to have two methods to detect the same
condition
- The second method detects ublk server exit only after a long delay
(~30s, the default timeout assigned by the block layer). This delays
the nosrv behavior from kicking in and potential subsequent recovery
of the device.
The second issue is brought to light with the new test_generic_04. It
fails before this fix:
selftests: ublk: test_generic_04.sh
dev id is 0
dd: error writing '/dev/ublkb0': Input/output error
1+0 records in
0+0 records out
0 bytes copied, 30.0611 s, 0.0 kB/s
DEAD
dd took 31 seconds to exit (>= 5s tolerance)!
generic_04 : [FAIL]
Fix this by instead detecting and handling ublk server exit in the
character file release callback. This has several advantages:
- This one place can handle both saturated and unsaturated queues. Thus,
it replaces both preexisting methods of detecting ublk server exit.
- It runs quickly on ublk server exit - there is no 30s delay.
- It starts the process of removing task references in ublk_drv. This is
needed if we want to relax restrictions in the driver like letting
only one thread serve each queue
There is also the disadvantage that the character file release callback
can also be triggered by intentional close of the file, which is a
significant behavior change. Preexisting ublk servers (libublksrv) are
dependent on the ability to open/close the file multiple times. To
address this, only transition to a nosrv state if the file is released
while the ublk device is live. This allows for programs to open/close
the file multiple times during setup. It is still a behavior change if a
ublk server decides to close/reopen the file while the device is LIVE
(i.e. while it is responsible for serving I/O), but that would be highly
unusual. This behavior is in line with what is done by FUSE, which is
very similar to ublk in that a userspace daemon is providing services
traditionally provided by the kernel.
With this change in, the new test (and all other selftests, and all
ublksrv tests) pass:
selftests: ublk: test_generic_04.sh
dev id is 0
dd: error writing '/dev/ublkb0': Input/output error
1+0 records in
0+0 records out
0 bytes copied, 0.0376731 s, 0.0 kB/s
DEAD
generic_04 : [PASS]
Signed-off-by: Uday Shankar <ushankar(a)purestorage.com>
---
Changes in v2:
- Leave null ublk selftests target untouched, instead create new
fault_inject target for injecting per-I/O delay (Ming Lei)
- Allow multiple open/close of ublk character device with some
restrictions
- Drop patches which made it in separately at https://lore.kernel.org/r/20250401-ublk_selftests-v1-1-98129c9bc8bb@puresto…
- Consolidate more nosrv logic in ublk character device release, and
associated code cleanup
- Link to v1: https://lore.kernel.org/r/20250325-ublk_timeout-v1-0-262f0121a7bd@purestora…
---
drivers/block/ublk_drv.c | 187 +++++++-----------------
tools/testing/selftests/ublk/Makefile | 4 +-
tools/testing/selftests/ublk/fault_inject.c | 58 ++++++++
tools/testing/selftests/ublk/kublk.c | 6 +-
tools/testing/selftests/ublk/kublk.h | 4 +
tools/testing/selftests/ublk/test_generic_04.sh | 43 ++++++
6 files changed, 165 insertions(+), 137 deletions(-)
diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index 2fd05c1bd30b03343cb6f357f8c08dd92ff47af9..d06f8a9aa23f8b846928247fc9e29002c10a49e3 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -162,7 +162,6 @@ struct ublk_queue {
bool force_abort;
bool timeout;
- bool canceling;
bool fail_io; /* copy of dev->state == UBLK_S_DEV_FAIL_IO */
unsigned short nr_io_ready; /* how many ios setup */
spinlock_t cancel_lock;
@@ -199,8 +198,6 @@ struct ublk_device {
struct completion completion;
unsigned int nr_queues_ready;
unsigned int nr_privileged_daemon;
-
- struct work_struct nosrv_work;
};
/* header of ublk_params */
@@ -209,8 +206,9 @@ struct ublk_params_header {
__u32 types;
};
-static bool ublk_abort_requests(struct ublk_device *ub, struct ublk_queue *ubq);
-
+static void ublk_stop_dev_unlocked(struct ublk_device *ub);
+static void ublk_abort_queue(struct ublk_device *ub, struct ublk_queue *ubq);
+static void __ublk_quiesce_dev(struct ublk_device *ub);
static inline struct request *__ublk_check_and_get_req(struct ublk_device *ub,
struct ublk_queue *ubq, int tag, size_t offset);
static inline unsigned int ublk_req_build_flags(struct request *req);
@@ -1314,8 +1312,6 @@ static void ublk_queue_cmd_list(struct ublk_queue *ubq, struct rq_list *l)
static enum blk_eh_timer_return ublk_timeout(struct request *rq)
{
struct ublk_queue *ubq = rq->mq_hctx->driver_data;
- unsigned int nr_inflight = 0;
- int i;
if (ubq->flags & UBLK_F_UNPRIVILEGED_DEV) {
if (!ubq->timeout) {
@@ -1326,26 +1322,6 @@ static enum blk_eh_timer_return ublk_timeout(struct request *rq)
return BLK_EH_DONE;
}
- if (!ubq_daemon_is_dying(ubq))
- return BLK_EH_RESET_TIMER;
-
- for (i = 0; i < ubq->q_depth; i++) {
- struct ublk_io *io = &ubq->ios[i];
-
- if (!(io->flags & UBLK_IO_FLAG_ACTIVE))
- nr_inflight++;
- }
-
- /* cancelable uring_cmd can't help us if all commands are in-flight */
- if (nr_inflight == ubq->q_depth) {
- struct ublk_device *ub = ubq->dev;
-
- if (ublk_abort_requests(ub, ubq)) {
- schedule_work(&ub->nosrv_work);
- }
- return BLK_EH_DONE;
- }
-
return BLK_EH_RESET_TIMER;
}
@@ -1368,9 +1344,6 @@ static blk_status_t ublk_prep_req(struct ublk_queue *ubq, struct request *rq)
if (ublk_nosrv_should_queue_io(ubq) && unlikely(ubq->force_abort))
return BLK_STS_IOERR;
- if (unlikely(ubq->canceling))
- return BLK_STS_IOERR;
-
/* fill iod to slot in io cmd buffer */
res = ublk_setup_iod(ubq, rq);
if (unlikely(res != BLK_STS_OK))
@@ -1391,16 +1364,6 @@ static blk_status_t ublk_queue_rq(struct blk_mq_hw_ctx *hctx,
if (res != BLK_STS_OK)
return res;
- /*
- * ->canceling has to be handled after ->force_abort and ->fail_io
- * is dealt with, otherwise this request may not be failed in case
- * of recovery, and cause hang when deleting disk
- */
- if (unlikely(ubq->canceling)) {
- __ublk_abort_rq(ubq, rq);
- return BLK_STS_OK;
- }
-
ublk_queue_cmd(ubq, rq);
return BLK_STS_OK;
}
@@ -1461,8 +1424,52 @@ static int ublk_ch_open(struct inode *inode, struct file *filp)
static int ublk_ch_release(struct inode *inode, struct file *filp)
{
struct ublk_device *ub = filp->private_data;
+ int i;
+ mutex_lock(&ub->mutex);
+ /*
+ * If the device is not live, we will not transition to a nosrv
+ * state. This protects against:
+ * - accidental poking of the ublk character device
+ * - some ublk servers which may open/close the ublk character
+ * device during startup
+ */
+ if (ub->dev_info.state != UBLK_S_DEV_LIVE)
+ goto out;
+
+ /*
+ * Since we are releasing the ublk character file descriptor, we
+ * know that there cannot be any concurrent file-related
+ * activity (e.g. uring_cmds or reads/writes). However, I/O
+ * might still be getting dispatched. Quiesce that too so that
+ * we don't need to worry about anything concurrent
+ */
+ blk_mq_quiesce_queue(ub->ub_disk->queue);
+
+ /*
+ * Handle any requests outstanding to the ublk server
+ */
+ for (i = 0; i < ub->dev_info.nr_hw_queues; i++)
+ ublk_abort_queue(ub, ublk_get_queue(ub, i));
+
+ /*
+ * Transition the device to the nosrv state. What exactly this
+ * means depends on the recovery flags
+ */
+ if (ublk_nosrv_should_stop_dev(ub)) {
+ ublk_stop_dev_unlocked(ub);
+ } else if (ublk_nosrv_dev_should_queue_io(ub)) {
+ __ublk_quiesce_dev(ub);
+ } else {
+ ub->dev_info.state = UBLK_S_DEV_FAIL_IO;
+ for (i = 0; i < ub->dev_info.nr_hw_queues; i++)
+ ublk_get_queue(ub, i)->fail_io = true;
+ }
+
+ blk_mq_unquiesce_queue(ub->ub_disk->queue);
+out:
clear_bit(UB_STATE_OPEN, &ub->state);
+ mutex_unlock(&ub->mutex);
return 0;
}
@@ -1556,57 +1563,6 @@ static void ublk_abort_queue(struct ublk_device *ub, struct ublk_queue *ubq)
}
}
-/* Must be called when queue is frozen */
-static bool ublk_mark_queue_canceling(struct ublk_queue *ubq)
-{
- bool canceled;
-
- spin_lock(&ubq->cancel_lock);
- canceled = ubq->canceling;
- if (!canceled)
- ubq->canceling = true;
- spin_unlock(&ubq->cancel_lock);
-
- return canceled;
-}
-
-static bool ublk_abort_requests(struct ublk_device *ub, struct ublk_queue *ubq)
-{
- bool was_canceled = ubq->canceling;
- struct gendisk *disk;
-
- if (was_canceled)
- return false;
-
- spin_lock(&ub->lock);
- disk = ub->ub_disk;
- if (disk)
- get_device(disk_to_dev(disk));
- spin_unlock(&ub->lock);
-
- /* Our disk has been dead */
- if (!disk)
- return false;
-
- /*
- * Now we are serialized with ublk_queue_rq()
- *
- * Make sure that ubq->canceling is set when queue is frozen,
- * because ublk_queue_rq() has to rely on this flag for avoiding to
- * touch completed uring_cmd
- */
- blk_mq_quiesce_queue(disk->queue);
- was_canceled = ublk_mark_queue_canceling(ubq);
- if (!was_canceled) {
- /* abort queue is for making forward progress */
- ublk_abort_queue(ub, ubq);
- }
- blk_mq_unquiesce_queue(disk->queue);
- put_device(disk_to_dev(disk));
-
- return !was_canceled;
-}
-
static void ublk_cancel_cmd(struct ublk_queue *ubq, struct ublk_io *io,
unsigned int issue_flags)
{
@@ -1635,8 +1591,6 @@ static void ublk_uring_cmd_cancel_fn(struct io_uring_cmd *cmd,
struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd);
struct ublk_queue *ubq = pdu->ubq;
struct task_struct *task;
- struct ublk_device *ub;
- bool need_schedule;
struct ublk_io *io;
if (WARN_ON_ONCE(!ubq))
@@ -1649,16 +1603,9 @@ static void ublk_uring_cmd_cancel_fn(struct io_uring_cmd *cmd,
if (WARN_ON_ONCE(task && task != ubq->ubq_daemon))
return;
- ub = ubq->dev;
- need_schedule = ublk_abort_requests(ub, ubq);
-
io = &ubq->ios[pdu->tag];
WARN_ON_ONCE(io->cmd != cmd);
ublk_cancel_cmd(ubq, io, issue_flags);
-
- if (need_schedule) {
- schedule_work(&ub->nosrv_work);
- }
}
static inline bool ublk_queue_ready(struct ublk_queue *ubq)
@@ -1756,13 +1703,13 @@ static struct gendisk *ublk_detach_disk(struct ublk_device *ub)
return disk;
}
-static void ublk_stop_dev(struct ublk_device *ub)
+static void ublk_stop_dev_unlocked(struct ublk_device *ub)
+ __must_hold(&ub->mutex)
{
struct gendisk *disk;
- mutex_lock(&ub->mutex);
if (ub->dev_info.state == UBLK_S_DEV_DEAD)
- goto unlock;
+ return;
if (ublk_nosrv_dev_should_queue_io(ub)) {
if (ub->dev_info.state == UBLK_S_DEV_LIVE)
__ublk_quiesce_dev(ub);
@@ -1771,38 +1718,12 @@ static void ublk_stop_dev(struct ublk_device *ub)
del_gendisk(ub->ub_disk);
disk = ublk_detach_disk(ub);
put_disk(disk);
- unlock:
- mutex_unlock(&ub->mutex);
- ublk_cancel_dev(ub);
}
-static void ublk_nosrv_work(struct work_struct *work)
+static void ublk_stop_dev(struct ublk_device *ub)
{
- struct ublk_device *ub =
- container_of(work, struct ublk_device, nosrv_work);
- int i;
-
- if (ublk_nosrv_should_stop_dev(ub)) {
- ublk_stop_dev(ub);
- return;
- }
-
mutex_lock(&ub->mutex);
- if (ub->dev_info.state != UBLK_S_DEV_LIVE)
- goto unlock;
-
- if (ublk_nosrv_dev_should_queue_io(ub)) {
- __ublk_quiesce_dev(ub);
- } else {
- blk_mq_quiesce_queue(ub->ub_disk->queue);
- ub->dev_info.state = UBLK_S_DEV_FAIL_IO;
- for (i = 0; i < ub->dev_info.nr_hw_queues; i++) {
- ublk_get_queue(ub, i)->fail_io = true;
- }
- blk_mq_unquiesce_queue(ub->ub_disk->queue);
- }
-
- unlock:
+ ublk_stop_dev_unlocked(ub);
mutex_unlock(&ub->mutex);
ublk_cancel_dev(ub);
}
@@ -2388,7 +2309,6 @@ static void ublk_remove(struct ublk_device *ub)
bool unprivileged;
ublk_stop_dev(ub);
- cancel_work_sync(&ub->nosrv_work);
cdev_device_del(&ub->cdev, &ub->cdev_dev);
unprivileged = ub->dev_info.flags & UBLK_F_UNPRIVILEGED_DEV;
ublk_put_device(ub);
@@ -2675,7 +2595,6 @@ static int ublk_ctrl_add_dev(struct io_uring_cmd *cmd)
goto out_unlock;
mutex_init(&ub->mutex);
spin_lock_init(&ub->lock);
- INIT_WORK(&ub->nosrv_work, ublk_nosrv_work);
ret = ublk_alloc_dev_number(ub, header->dev_id);
if (ret < 0)
@@ -2807,7 +2726,6 @@ static inline void ublk_ctrl_cmd_dump(struct io_uring_cmd *cmd)
static int ublk_ctrl_stop_dev(struct ublk_device *ub)
{
ublk_stop_dev(ub);
- cancel_work_sync(&ub->nosrv_work);
return 0;
}
@@ -2927,7 +2845,6 @@ static void ublk_queue_reinit(struct ublk_device *ub, struct ublk_queue *ubq)
/* We have to reset it to NULL, otherwise ub won't accept new FETCH_REQ */
ubq->ubq_daemon = NULL;
ubq->timeout = false;
- ubq->canceling = false;
for (i = 0; i < ubq->q_depth; i++) {
struct ublk_io *io = &ubq->ios[i];
diff --git a/tools/testing/selftests/ublk/Makefile b/tools/testing/selftests/ublk/Makefile
index c7781efea0f33c02f340f90f547d3a37c1d1b8a0..afee027cccdd1b8f13f1cb9a90a3348cd54b18bc 100644
--- a/tools/testing/selftests/ublk/Makefile
+++ b/tools/testing/selftests/ublk/Makefile
@@ -6,6 +6,7 @@ LDLIBS += -lpthread -lm -luring
TEST_PROGS := test_generic_01.sh
TEST_PROGS += test_generic_02.sh
TEST_PROGS += test_generic_03.sh
+TEST_PROGS += test_generic_04.sh
TEST_PROGS += test_null_01.sh
TEST_PROGS += test_null_02.sh
@@ -26,7 +27,8 @@ TEST_GEN_PROGS_EXTENDED = kublk
include ../lib.mk
-$(TEST_GEN_PROGS_EXTENDED): kublk.c null.c file_backed.c common.c stripe.c
+$(TEST_GEN_PROGS_EXTENDED): kublk.c null.c file_backed.c common.c stripe.c \
+ fault_inject.c
check:
shellcheck -x -f gcc *.sh
diff --git a/tools/testing/selftests/ublk/fault_inject.c b/tools/testing/selftests/ublk/fault_inject.c
new file mode 100644
index 0000000000000000000000000000000000000000..e92d01e88e478a23df987ebff2a997212b831d31
--- /dev/null
+++ b/tools/testing/selftests/ublk/fault_inject.c
@@ -0,0 +1,58 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Fault injection ublk target. Hack this up however you like for
+ * testing specific behaviors of ublk_drv. Currently is a null target
+ * with a configurable delay before completing each I/O. This delay can
+ * be used to test ublk_drv's handling of I/O outstanding to the ublk
+ * server when it dies.
+ */
+
+#include "kublk.h"
+
+static int ublk_fault_inject_tgt_init(const struct dev_ctx *ctx, struct ublk_dev *dev)
+{
+ const struct ublksrv_ctrl_dev_info *info = &dev->dev_info;
+ unsigned long dev_size = 250UL << 30;
+
+ dev->tgt.dev_size = dev_size;
+ dev->tgt.params = (struct ublk_params) {
+ .types = UBLK_PARAM_TYPE_BASIC | UBLK_PARAM_TYPE_DMA_ALIGN |
+ UBLK_PARAM_TYPE_SEGMENT,
+ .basic = {
+ .logical_bs_shift = 9,
+ .physical_bs_shift = 12,
+ .io_opt_shift = 12,
+ .io_min_shift = 9,
+ .max_sectors = info->max_io_buf_bytes >> 9,
+ .dev_sectors = dev_size >> 9,
+ },
+ .dma = {
+ .alignment = 4095,
+ },
+ .seg = {
+ .seg_boundary_mask = 4095,
+ .max_segment_size = 32 << 10,
+ .max_segments = 32,
+ },
+ };
+
+ dev->private_data = (void *)ctx->delay_us;
+ return 0;
+}
+
+static int ublk_fault_inject_queue_io(struct ublk_queue *q, int tag)
+{
+ const struct ublksrv_io_desc *iod = ublk_get_iod(q, tag);
+
+ usleep((unsigned long)q->dev->private_data);
+
+ ublk_complete_io(q, tag, iod->nr_sectors << 9);
+ return 0;
+}
+
+const struct ublk_tgt_ops fault_inject_tgt_ops = {
+ .name = "fault_inject",
+ .init_tgt = ublk_fault_inject_tgt_init,
+ .queue_io = ublk_fault_inject_queue_io,
+};
diff --git a/tools/testing/selftests/ublk/kublk.c b/tools/testing/selftests/ublk/kublk.c
index 91c282bc767449a418cce7fc816dc8e9fc732d6a..0fbfa43864453471219703451271540d5dfef593 100644
--- a/tools/testing/selftests/ublk/kublk.c
+++ b/tools/testing/selftests/ublk/kublk.c
@@ -10,6 +10,7 @@ static const struct ublk_tgt_ops *tgt_ops_list[] = {
&null_tgt_ops,
&loop_tgt_ops,
&stripe_tgt_ops,
+ &fault_inject_tgt_ops,
};
static const struct ublk_tgt_ops *ublk_find_tgt(const char *name)
@@ -1041,7 +1042,7 @@ static int cmd_dev_get_features(void)
static int cmd_dev_help(char *exe)
{
- printf("%s add -t [null|loop] [-q nr_queues] [-d depth] [-n dev_id] [backfile1] [backfile2] ...\n", exe);
+ printf("%s add -t [null|loop|stripe|fault_inject] [-q nr_queues] [-d depth] [-n dev_id] [backfile1] [backfile2] ...\n", exe);
printf("\t default: nr_queues=2(max 4), depth=128(max 128), dev_id=-1(auto allocation)\n");
printf("%s del [-n dev_id] -a \n", exe);
printf("\t -a delete all devices -n delete specified device\n");
@@ -1064,6 +1065,7 @@ int main(int argc, char *argv[])
{ "zero_copy", 0, NULL, 'z' },
{ "foreground", 0, NULL, 0 },
{ "chunk_size", 1, NULL, 0 },
+ { "delay_us", 1, NULL, 0 },
{ 0, 0, 0, 0 }
};
int option_idx, opt;
@@ -1112,6 +1114,8 @@ int main(int argc, char *argv[])
ctx.fg = 1;
if (!strcmp(longopts[option_idx].name, "chunk_size"))
ctx.chunk_size = strtol(optarg, NULL, 10);
+ if (!strcmp(longopts[option_idx].name, "delay_us"))
+ ctx.delay_us = strtoul(optarg, NULL, 10);
}
}
diff --git a/tools/testing/selftests/ublk/kublk.h b/tools/testing/selftests/ublk/kublk.h
index 760ff8ffb8107037a19a8fb7ab408818845e010d..3750e67727eed89991158add49d30615ea012dae 100644
--- a/tools/testing/selftests/ublk/kublk.h
+++ b/tools/testing/selftests/ublk/kublk.h
@@ -70,6 +70,9 @@ struct dev_ctx {
/* stripe */
unsigned int chunk_size;
+ /* fault_inject */
+ unsigned long delay_us;
+
int _evtfd;
};
@@ -357,6 +360,7 @@ static inline int ublk_queue_use_zc(const struct ublk_queue *q)
extern const struct ublk_tgt_ops null_tgt_ops;
extern const struct ublk_tgt_ops loop_tgt_ops;
extern const struct ublk_tgt_ops stripe_tgt_ops;
+extern const struct ublk_tgt_ops fault_inject_tgt_ops;
void backing_file_tgt_deinit(struct ublk_dev *dev);
int backing_file_tgt_init(struct ublk_dev *dev);
diff --git a/tools/testing/selftests/ublk/test_generic_04.sh b/tools/testing/selftests/ublk/test_generic_04.sh
new file mode 100755
index 0000000000000000000000000000000000000000..48af48164aa444d8ac6a58fef1743d2a16a56a14
--- /dev/null
+++ b/tools/testing/selftests/ublk/test_generic_04.sh
@@ -0,0 +1,43 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+. "$(cd "$(dirname "$0")" && pwd)"/test_common.sh
+
+TID="generic_04"
+ERR_CODE=0
+
+_prep_test "fault_inject" "fast cleanup when all I/Os of one hctx are in server"
+
+# configure ublk server to sleep 2s before completing each I/O
+dev_id=$(_add_ublk_dev -t fault_inject -q 2 -d 1 --delay_us 2000000)
+_check_add_dev $TID $?
+
+echo "dev id is ${dev_id}"
+
+STARTTIME=${SECONDS}
+
+dd if=/dev/urandom of=/dev/ublkb${dev_id} oflag=direct bs=4k count=1 &
+dd_pid=$!
+
+__ublk_kill_daemon ${dev_id} "DEAD"
+
+wait $dd_pid
+dd_exitcode=$?
+
+ENDTIME=${SECONDS}
+ELAPSED=$(($ENDTIME - $STARTTIME))
+
+# assert that dd sees an error and exits quickly after ublk server is
+# killed. previously this relied on seeing an I/O timeout and so would
+# take ~30s
+if [ $dd_exitcode -eq 0 ]; then
+ echo "dd unexpectedly exited successfully!"
+ ERR_CODE=255
+fi
+if [ $ELAPSED -ge 5 ]; then
+ echo "dd took $ELAPSED seconds to exit (>= 5s tolerance)!"
+ ERR_CODE=255
+fi
+
+_cleanup_test "fault_inject"
+_show_result $TID $ERR_CODE
---
base-commit: 710e2c687a16b28a873a282517a85faf02a8b7cc
change-id: 20250325-ublk_timeout-b06b9b51c591
Best regards,
--
Uday Shankar <ushankar(a)purestorage.com>