On Wed, Jun 11, 2025 at 10:04:35AM +0200, Thomas Weißschuh wrote:
On Wed, Jun 11, 2025 at 12:05:25AM -0700, Nicolin Chen wrote:
On Tue, Jun 10, 2025 at 08:46:57PM -0300, Jason Gunthorpe wrote:
On Tue, Jun 10, 2025 at 11:48:44AM -0700, Nicolin Chen wrote:
On Tue, Jun 10, 2025 at 09:09:02AM -0300, Jason Gunthorpe wrote:
On Tue, Jun 10, 2025 at 01:38:22PM +0200, Thomas Weißschuh wrote:
> ------------------------------------------------------------------ > # RUN iommufd_dirty_tracking.domain_dirty128M_huge.enforce_dirty ... > # enforce_dirty: Test terminated unexpectedly by signal 11
Sig 11 is weird..
On another note, the selftest should use the kselftest_harness' ASSERT_*() macros instead of plain assert().
IIRC the kselftest stuff explodes if you try to use it's assert functions within a fixture setup/teardown context.
I also wasn't able to reproduce this (x86 ubuntu 24 LTS OS) Maybe it is ARM specific, I think Nicolin is running on ARM..
Yes. And I was running with 64KB page size. I just quickly retried with 4KB page size (matching x86), and all failed tests pass now.
That's a weird thing to be sensitive too. Can you get a backtrace from the crash, what function/line is crashing?
I think I am getting what's going on. Here the harness code has a parent process and a child process:
428- /* _metadata and potentially self are shared with all forks. */ \ 429: child = fork(); \ 430: if (child == 0) { \ 431- fixture_name##_setup(_metadata, self, variant->data); \ 432- /* Let setup failure terminate early. */ \ 433- if (_metadata->exit_code) \ 434- _exit(0); \ 435- *_metadata->no_teardown = false; \ 436- fixture_name##_##test_name(_metadata, self, variant->data); \ 437- _metadata->teardown_fn(false, _metadata, self, variant->data); \ 438- _exit(0); \ 439: } else if (child < 0 || child != waitpid(child, &status, 0)) { \ 440- ksft_print_msg("ERROR SPAWNING TEST GRANDCHILD\n"); \ 441- _metadata->exit_code = KSFT_FAIL; \ 442- } \ 443- _metadata->teardown_fn(true, _metadata, self, variant->data); \ 444- munmap(_metadata->no_teardown, sizeof(*_metadata->no_teardown)); \ 445- _metadata->no_teardown = NULL; \ 446- if (self && fixture_name##_teardown_parent) \ 447- munmap(self, sizeof(*self)); \ 448- if (WIFEXITED(status)) { \ 449- if (WEXITSTATUS(status)) \ 450- _metadata->exit_code = WEXITSTATUS(status); \ 451- } else if (WIFSIGNALED(status)) { \ 452- /* Forward signal to __wait_for_test(). */ \ 453- kill(getpid(), WTERMSIG(status)); \ 454- } \ .... 456- static void wrapper_##fixture_name##_##test_name##_teardown( \ 457- bool in_parent, struct __test_metadata *_metadata, \ 458- void *self, const void *variant) \ 459- { \ 460- if (fixture_name##_teardown_parent == in_parent && \ 461- !__atomic_test_and_set(_metadata->no_teardown, __ATOMIC_RELAXED)) \ 462- fixture_name##_teardown(_metadata, self, variant); \ 463- } \
I found there is a race between those two processes, resulting in the teardown() not getting invoked: I added some ksft_print_msg() calls in-between the lines to debug, those tests can pass mostly, as teardown() got invoked.
I think the reason why those huge page cases fail is just because the huge version of setup() takes longer time..
Can you try to recreate this issue with changes to tools/testing/selftests/kselftest_harness/harness-selftest.c ?
Hmm, I assume all 9 cases should pass? Mine only passes 4 on rc1, without any change (perhaps we should start from here?):
TAP version 13 1..9 # Starting 9 tests from 4 test cases. # RUN global.standalone_pass ... # harness-selftest.c:19:standalone_pass:before # harness-selftest.c:23:standalone_pass:after # OK global.standalone_pass ok 1 global.standalone_pass # RUN global.standalone_fail ... # harness-selftest.c:27:standalone_fail:before # harness-selftest.c:29:standalone_fail:Expected 0 (0) == 1 (1) # harness-selftest.c:30:standalone_fail:Expected 0 (0) == 1 (1) # standalone_fail: Test terminated by assertion # FAIL global.standalone_fail not ok 2 global.standalone_fail # RUN global.signal_pass ... # harness-selftest.c:35:signal_pass:before # harness-selftest.c:37:signal_pass:after # OK global.signal_pass ok 3 global.signal_pass # RUN global.signal_fail ... # harness-selftest.c:42:signal_fail:before # harness-selftest.c:43:signal_fail:Expected 0 (0) == 1 (1) # signal_fail: Test terminated by assertion # FAIL global.signal_fail not ok 4 global.signal_fail # RUN fixture.pass ... # harness-selftest.c:53:pass:setup # harness-selftest.c:62:pass:before # harness-selftest.c:19:pass:before # harness-selftest.c:23:pass:after # harness-selftest.c:66:pass:after # harness-selftest.c:58:pass:teardown same-process=1 # OK fixture.pass ok 5 fixture.pass # RUN fixture.fail ... # harness-selftest.c:53:fail:setup # harness-selftest.c:70:fail:before # harness-selftest.c:71:fail:Expected 0 (0) == 1 (1) # harness-selftest.c:58:fail:teardown same-process=1 # fail: Test terminated by assertion # FAIL fixture.fail not ok 6 fixture.fail # RUN fixture.timeout ... # harness-selftest.c:53:timeout:setup # harness-selftest.c:77:timeout:before # timeout: Test terminated by timeout # FAIL fixture.timeout not ok 7 fixture.timeout # RUN fixture_parent.pass ... # harness-selftest.c:87:pass:setup # harness-selftest.c:96:pass:before # harness-selftest.c:98:pass:after # harness-selftest.c:92:pass:teardown same-process=0 # OK fixture_parent.pass ok 8 fixture_parent.pass # RUN fixture_setup_failure.pass ... # harness-selftest.c:106:pass:setup # harness-selftest.c:108:pass:Expected 0 (0) == 1 (1) # pass: Test terminated by assertion # FAIL fixture_setup_failure.pass not ok 9 fixture_setup_failure.pass # FAILED: 4 / 9 tests passed. # Totals: pass:4 fail:5 xfail:0 xpass:0 skip:0 error:0
I haven't figured out a proper fix yet, but something smells bad:
- *no_teardown is set non-atomically, while both processes calls __atomic_test_and_set()
Does this make a difference?
diff --git a/tools/testing/selftests/kselftest_harness.h b/tools/testing/selftests/kselftest_harness.h index 2925e47db995..89fb37a21d9d 100644 --- a/tools/testing/selftests/kselftest_harness.h +++ b/tools/testing/selftests/kselftest_harness.h @@ -410,7 +410,7 @@ /* Makes sure there is only one teardown, even when child forks again. */ \ _metadata->no_teardown = mmap(NULL, sizeof(*_metadata->no_teardown), \ PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0); \
*_metadata->no_teardown = true; \
__atomic_store_n(_metadata->no_teardown, true, __ATOMIC_SEQ_CST); \ if (sizeof(*self) > 0) { \ if (fixture_name##_teardown_parent) { \ self = mmap(NULL, sizeof(*self), PROT_READ | PROT_WRITE, \
@@ -429,7 +429,7 @@ /* Let setup failure terminate early. */ \ if (_metadata->exit_code) \ _exit(0); \
*_metadata->no_teardown = false; \
__atomic_store_n(_metadata->no_teardown, false, __ATOMIC_SEQ_CST); \ fixture_name##_##test_name(_metadata, self, variant->data); \ _metadata->teardown_fn(false, _metadata, self, variant->data); \ _exit(0); \
@@ -455,7 +455,7 @@ void *self, const void *variant) \ { \ if (fixture_name##_teardown_parent == in_parent && \
!__atomic_test_and_set(_metadata->no_teardown, __ATOMIC_RELAXED)) \
!__atomic_test_and_set(_metadata->no_teardown, __ATOMIC_SEQ_CST)) \ fixture_name##_teardown(_metadata, self, variant); \ } \ static struct __test_metadata *_##fixture_name##_##test_name##_object; \
Unfortunately, no :(
- parent doesn't seem to wait for the setup() to complete..
setup() is called in the child (L431) right before the testcase itself is called (L436). The parent waits for the child to exit (L439) before unmapping.
- when parent runs faster than the child that is still running setup(), the parent unmaps the no_teardown and set it to NULL, then UAF in the child, i.e. signal 11?
That should never happen as the waitpid() will block until the child running setup() and the testcase itself have exited.
Ah, maybe I was wrong about these narratives. But the results show that iommufd_dirty_tracking_teardown() was not called in the failed cases:
// I added a huge.c file to run only 4 cases on one variant // And I added two fprintf to its FIXTURE_SETUP/TEARDOWN().
TAP version 13 1..4 # Starting 4 tests from 1 test cases. # RUN iommufd_dirty_tracking.domain_dirty64M_huge.set_dirty_tracking ... ---------iommufd_dirty_tracking_setup ---------iommufd_dirty_tracking_teardown # OK iommufd_dirty_tracking.domain_dirty64M_huge.set_dirty_tracking ok 1 iommufd_dirty_tracking.domain_dirty64M_huge.set_dirty_tracking # RUN iommufd_dirty_tracking.domain_dirty64M_huge.device_dirty_capability ... ---------iommufd_dirty_tracking_setup # device_dirty_capability: Test terminated unexpectedly by signal 11 # FAIL iommufd_dirty_tracking.domain_dirty64M_huge.device_dirty_capability not ok 2 iommufd_dirty_tracking.domain_dirty64M_huge.device_dirty_capability # RUN iommufd_dirty_tracking.domain_dirty64M_huge.get_dirty_bitmap ... ---------iommufd_dirty_tracking_setup # get_dirty_bitmap: Test terminated unexpectedly by signal 11 # FAIL iommufd_dirty_tracking.domain_dirty64M_huge.get_dirty_bitmap not ok 3 iommufd_dirty_tracking.domain_dirty64M_huge.get_dirty_bitmap # RUN iommufd_dirty_tracking.domain_dirty64M_huge.get_dirty_bitmap_no_clear ... ---------iommufd_dirty_tracking_setup # get_dirty_bitmap_no_clear: Test terminated unexpectedly by signal 11 # FAIL iommufd_dirty_tracking.domain_dirty64M_huge.get_dirty_bitmap_no_clear not ok 4 iommufd_dirty_tracking.domain_dirty64M_huge.get_dirty_bitmap_no_clear # FAILED: 1 / 4 tests passed.
Does the issue also happen when you only execute a single testcase? $ ./iommufd -r iommufd_dirty_tracking.domain_dirty128M_huge.enforce_dirty
$ sudo tools/testing/selftests/iommu/iommufd -r iommufd_dirty_tracking.domain_dirty128M.enforce_dirty TAP version 13 1..1 # Starting 1 tests from 1 test cases. # RUN iommufd_dirty_tracking.domain_dirty128M.enforce_dirty ... # OK iommufd_dirty_tracking.domain_dirty128M.enforce_dirty ok 1 iommufd_dirty_tracking.domain_dirty128M.enforce_dirty # PASSED: 1 / 1 tests passed. # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0
This one passes. Looks like the first hugepage case would pass but the following ones would fail if running them sequentially..
Thanks Nicolin