On Tue, Jun 10, 2025 at 08:46:57PM -0300, Jason Gunthorpe wrote:
On Tue, Jun 10, 2025 at 11:48:44AM -0700, Nicolin Chen wrote:
On Tue, Jun 10, 2025 at 09:09:02AM -0300, Jason Gunthorpe wrote:
On Tue, Jun 10, 2025 at 01:38:22PM +0200, Thomas Weißschuh wrote:
# RUN iommufd_dirty_tracking.domain_dirty128M_huge.enforce_dirty ... # enforce_dirty: Test terminated unexpectedly by signal 11
Sig 11 is weird..
On another note, the selftest should use the kselftest_harness' ASSERT_*() macros instead of plain assert().
IIRC the kselftest stuff explodes if you try to use it's assert functions within a fixture setup/teardown context.
I also wasn't able to reproduce this (x86 ubuntu 24 LTS OS) Maybe it is ARM specific, I think Nicolin is running on ARM..
Yes. And I was running with 64KB page size. I just quickly retried with 4KB page size (matching x86), and all failed tests pass now.
That's a weird thing to be sensitive too. Can you get a backtrace from the crash, what function/line is crashing?
I think I am getting what's going on. Here the harness code has a parent process and a child process:
-------------------------------------------------------------- 428- /* _metadata and potentially self are shared with all forks. */ \ 429: child = fork(); \ 430: if (child == 0) { \ 431- fixture_name##_setup(_metadata, self, variant->data); \ 432- /* Let setup failure terminate early. */ \ 433- if (_metadata->exit_code) \ 434- _exit(0); \ 435- *_metadata->no_teardown = false; \ 436- fixture_name##_##test_name(_metadata, self, variant->data); \ 437- _metadata->teardown_fn(false, _metadata, self, variant->data); \ 438- _exit(0); \ 439: } else if (child < 0 || child != waitpid(child, &status, 0)) { \ 440- ksft_print_msg("ERROR SPAWNING TEST GRANDCHILD\n"); \ 441- _metadata->exit_code = KSFT_FAIL; \ 442- } \ 443- _metadata->teardown_fn(true, _metadata, self, variant->data); \ 444- munmap(_metadata->no_teardown, sizeof(*_metadata->no_teardown)); \ 445- _metadata->no_teardown = NULL; \ 446- if (self && fixture_name##_teardown_parent) \ 447- munmap(self, sizeof(*self)); \ 448- if (WIFEXITED(status)) { \ 449- if (WEXITSTATUS(status)) \ 450- _metadata->exit_code = WEXITSTATUS(status); \ 451- } else if (WIFSIGNALED(status)) { \ 452- /* Forward signal to __wait_for_test(). */ \ 453- kill(getpid(), WTERMSIG(status)); \ 454- } \ .... 456- static void wrapper_##fixture_name##_##test_name##_teardown( \ 457- bool in_parent, struct __test_metadata *_metadata, \ 458- void *self, const void *variant) \ 459- { \ 460- if (fixture_name##_teardown_parent == in_parent && \ 461- !__atomic_test_and_set(_metadata->no_teardown, __ATOMIC_RELAXED)) \ 462- fixture_name##_teardown(_metadata, self, variant); \ 463- } \ --------------------------------------------------------------
I found there is a race between those two processes, resulting in the teardown() not getting invoked: I added some ksft_print_msg() calls in-between the lines to debug, those tests can pass mostly, as teardown() got invoked.
I think the reason why those huge page cases fail is just because the huge version of setup() takes longer time..
I haven't figured out a proper fix yet, but something smells bad: 1) *no_teardown is set non-atomically, while both processes calls __atomic_test_and_set() 2) parent doesn't seem to wait for the setup() to complete.. 3) when parent runs faster than the child that is still running setup(), the parent unmaps the no_teardown and set it to NULL, then UAF in the child, i.e. signal 11?
I think Thomas should have an idea with these info :)
Thanks Nicolin