On 23/01/2024 07:51, Muhammad Usama Anjum wrote:
On 1/22/24 2:59 PM, Ryan Roberts wrote:
+CATEGORY="hugetlb" run_test ./hugetlb-read-hwpoison
The addition of this test causes 2 later tests to fail with ENOMEM. I suspect its a side-effect of marking the hugetlbs as hwpoisoned? (just a guess based on the test name!). Once a page is marked poisoned, is there a way to un-poison it? If not, I suspect that's why it wasn't part of the standard test script in the first place.
hugetlb-read-hwpoison failed as probably the fix in the kernel for the test hasn't been merged in the kernel. The other tests (uffd-stress) aren't failing on my end and on CI [1][2]
To be clear, hugetlb-read-hwpoison isn't failing for me, its just causing the subsequent tests uffd-stress tests to fail. Both of those subsequent tests are allocating hugetlbs so my guess is that since this test is marking some hugetlbs as poisoned, there are no longer enough for the subsequent tests.
[1] https://lava.collabora.dev/scheduler/job/12577207#L3677 [2] https://lava.collabora.dev/scheduler/job/12577229#L4027
Maybe its configurations issue which is exposed now. Not sure. Maybe hugetlb-read-hwpoison is changing some configuration and not restoring it.
Well yes - its marking some hugetlb pages as HWPOISONED.
Maybe your system has less number of hugetlb pages.
YEs probably; What is hugetlb-read-hwpoison's requirement for size and number of hugetlb pages? the run_vmtests.sh script allocates the required number of default-sized hugetlb pages before running any tests (I guess this value should be increased for hugetlb-read-hwpoison's requirements?).
Additionally, our CI preallocates non-default sizes from the kernel command line at boot. Happy to increase these if you can tell me what the new requirement is:
I'm not sure about the exact requirement of the number of hugetlb for these tests. But I specify hugepages=1000 and tests work for me.
1000 hugepages @2M is ~2G, which is quite a big ask for small arm systems. And for big arm systems that use 64K base pages, the default hugepage size is 512M, so 1000 of those is 512G which is also quite a big ask. So I'd prefer not to make 1000 hugepages the requirement.
Looking at the test, I think its using 8 default sized hugepages; But supporting it properly is still complex as the HWPOISON operation is destructive. I'll reply with more detail against the v2 patch.
I've sent v2 [1]. Would it be possible to run your CI on that and share results before we merge that one?
[1] https://lore.kernel.org/all/20240123073615.920324-1-usama.anjum@collabora.co...
hugepagesz=1G hugepages=0:2,1:2 hugepagesz=32M hugepages=0:2,1:2 default_hugepagesz=2M hugepages=0:64,1:64 hugepagesz=64K hugepages=0:2,1:2
Thanks, Ryan