On 09.03.24 20:12, Mirsad Todorovac wrote:
Hi,
Routine run of the test in net-next gave also this mm unit error.
root@defiant:tools/testing/selftests/mm# ./uffd-unit-tests Testing UFFDIO_API (with syscall)... done Testing UFFDIO_API (with /dev/userfaultfd)... done Testing register-ioctls on anon... done Testing register-ioctls on shmem... done Testing register-ioctls on shmem-private... done Testing register-ioctls on hugetlb... skipped [reason: memory allocation failed] Testing register-ioctls on hugetlb-private... skipped [reason: memory allocation failed] Testing zeropage on anon... done Testing zeropage on shmem... done Testing zeropage on shmem-private... done Testing zeropage on hugetlb... skipped [reason: memory allocation failed] Testing zeropage on hugetlb-private... skipped [reason: memory allocation failed] Testing move on anon... done Testing move-pmd on anon... done Testing move-pmd-split on anon... done Testing wp-fork on anon... done Testing wp-fork on shmem... done Testing wp-fork on shmem-private... done Testing wp-fork on hugetlb... skipped [reason: memory allocation failed] Testing wp-fork on hugetlb-private... skipped [reason: memory allocation failed] Testing wp-fork-with-event on anon... done Testing wp-fork-with-event on shmem... done Testing wp-fork-with-event on shmem-private... done Testing wp-fork-with-event on hugetlb... skipped [reason: memory allocation failed] Testing wp-fork-with-event on hugetlb-private... skipped [reason: memory allocation failed] Testing wp-fork-pin on anon... done Testing wp-fork-pin on shmem... done Testing wp-fork-pin on shmem-private... done Testing wp-fork-pin on hugetlb... skipped [reason: memory allocation failed] Testing wp-fork-pin on hugetlb-private... skipped [reason: memory allocation failed] Testing wp-fork-pin-with-event on anon... done Testing wp-fork-pin-with-event on shmem... done Testing wp-fork-pin-with-event on shmem-private... done Testing wp-fork-pin-with-event on hugetlb... skipped [reason: memory allocation failed] Testing wp-fork-pin-with-event on hugetlb-private... skipped [reason: memory allocation failed] Testing wp-unpopulated on anon... done Testing minor on shmem... done Testing minor on hugetlb... skipped [reason: memory allocation failed] Testing minor-wp on shmem... done Testing minor-wp on hugetlb... skipped [reason: memory allocation failed] Testing minor-collapse on shmem... done Testing sigbus on anon... done Testing sigbus on shmem... done Testing sigbus on shmem-private... done Testing sigbus on hugetlb... skipped [reason: memory allocation failed] Testing sigbus on hugetlb-private... skipped [reason: memory allocation failed] Testing sigbus-wp on anon... done Testing sigbus-wp on shmem... done Testing sigbus-wp on shmem-private... done Testing sigbus-wp on hugetlb... skipped [reason: memory allocation failed] Testing sigbus-wp on hugetlb-private... skipped [reason: memory allocation failed] Testing events on anon... done Testing events on shmem... done Testing events on shmem-private... done Testing events on hugetlb... skipped [reason: memory allocation failed] Testing events on hugetlb-private... skipped [reason: memory allocation failed] Testing events-wp on anon... done Testing events-wp on shmem... done Testing events-wp on shmem-private... done Testing events-wp on hugetlb... skipped [reason: memory allocation failed] Testing events-wp on hugetlb-private... skipped [reason: memory allocation failed] Testing poison on anon... done Testing poison on shmem... done Testing poison on shmem-private... done Testing poison on hugetlb... skipped [reason: memory allocation failed] Testing poison on hugetlb-private... skipped [reason: memory allocation failed] Userfaults unit tests: pass=42, skip=24, fail=0 (total=66) root@defiant:tools/testing/selftests/mm# grep -i huge /proc/meminfo
It resulted in alarming errors in the syslog:
Mar 9 19:48:24 defiant kernel: [77187.055103] MCE: Killing uffd-unit-tests:1321817 due to hardware memory corruption fault at 4631e000 Mar 9 19:48:24 defiant kernel: [77187.055132] MCE: Killing uffd-unit-tests:1321817 due to hardware memory corruption fault at 46320000 Mar 9 19:48:24 defiant kernel: [77187.055160] MCE: Killing uffd-unit-tests:1321817 due to hardware memory corruption fault at 46322000 Mar 9 19:48:24 defiant kernel: [77187.055189] MCE: Killing uffd-unit-tests:1321817 due to hardware memory corruption fault at 46324000 Mar 9 19:48:24 defiant kernel: [77187.055218] MCE: Killing uffd-unit-tests:1321817 due to hardware memory corruption fault at 46326000 Mar 9 19:48:24 defiant kernel: [77187.055250] MCE: Killing uffd-unit-tests:1321817 due to hardware memory corruption fault at 46328000 Mar 9 19:48:24 defiant kernel: [77187.055278] MCE: Killing uffd-unit-tests:1321817 due to hardware memory corruption fault at 4632a000 Mar 9 19:48:24 defiant kernel: [77187.055307] MCE: Killing uffd-unit-tests:1321817 due to hardware memory corruption fault at 4632c000 Mar 9 19:48:24 defiant kernel: [77187.055336] MCE: Killing uffd-unit-tests:1321817 due to hardware memory corruption fault at 4632e000 Mar 9 19:48:24 defiant kernel: [77187.055366] MCE: Killing uffd-unit-tests:1321817 due to hardware memory corruption fault at 46330000 Mar 9 19:48:24 defiant kernel: [77187.055395] MCE: Killing uffd-unit-tests:1321817 due to hardware memory corruption fault at 46332000 Mar 9 19:48:24 defiant kernel: [77187.055423] MCE: Killing uffd-unit-tests:1321817 due to hardware memory corruption fault at 46334000 Mar 9 19:48:24 defiant kernel: [77187.055452] MCE: Killing uffd-unit-tests:1321817 due to hardware memory corruption fault at 46336000 Mar 9 19:48:24 defiant kernel: [77187.055480] MCE: Killing uffd-unit-tests:1321817 due to hardware memory corruption fault at 46338000 Mar 9 19:48:24 defiant kernel: [77187.055509] MCE: Killing uffd-unit-tests:1321817 due to hardware memory corruption fault at 4633a000 Mar 9 19:48:24 defiant kernel: [77187.055538] MCE: Killing uffd-unit-tests:1321817 due to hardware memory corruption fault at 4633c000 Mar 9 19:48:24 defiant kernel: [77187.055567] MCE: Killing uffd-unit-tests:1321817 due to hardware memory corruption fault at 4633e000 Mar 9 19:48:24 defiant kernel: [77187.055597] MCE: Killing uffd-unit-tests:1321817 due to hardware memory corruption fault at 46340000
At this point, it can be problem with my box's memory chips, or something with HUGETLB.
However, since the "classic" allocations were successful, the problem might be in huge pages, or if I understood well, in deliberate poisoning of pages?
Isn't that just the (expected) side effect of UFFDIO_POISON tests?
IOW, there is no problem here. We are poisoning virtual memory locations (not actual memory) and expect a SIGBUS on next access. While testing that, we receive these messages.
The "ugly" thing here seems to be that we can trigger repeated pr_err() from user space. There is no rate-limiting in place. Maybe UFFDIO_POISON requires root permissions so this cannot be exploited by unprivileged user space to flood the system log?
CCing Axel