On Wed, Jun 11, 2025 at 08:51:17PM -0300, Jason Gunthorpe wrote:
On Wed, Jun 11, 2025 at 04:43:00PM -0700, Nicolin Chen wrote:
So, the test case sets an alignment with HUGEPAGE_SIZE=512MB while allocating buffer_size=64MB: rc = posix_memalign(&self->buffer, HUGEPAGE_SIZE, variant->buffer_size); vrc = mmap(self->buffer, variant->buffer_size, PROT_READ | PROT_WRITE, this gives the self->buffer a location that is 512MB aligned, but only mmap part of one 512MB huge page.
On the other hand, _metadata->no_teardown was mmap() outside the range of the [self->buffer, self->buffer + 64MB), but within the range of [self->buffer, self->buffer + 512MB).
E.g. _metadata->no_teardown = 0xfffbfc610000 // inside range2 below buffer=[0xfffbe0000000, fffbe4000000) // range1 buffer=[0xfffbe0000000, fffc00000000) // range2
Then ,the "vrc = mmap(..." overwrites the _metadata->no_teardown location to NULL..
The following change can fix, though it feels odd that the buffer has to be preserved with the entire huge page:
@@ -2024,3 +2027,4 @@ FIXTURE_SETUP(iommufd_dirty_tracking)
rc = posix_memalign(&self->buffer, HUGEPAGE_SIZE, variant->buffer_size);
rc = posix_memalign(&self->buffer, HUGEPAGE_SIZE,
__ALIGN_KERNEL(variant->buffer_size, HUGEPAGE_SIZE)); if (rc || !self->buffer) {
Any thought?
This seems like something, variant->buffer_size should not be less than HUGEPAGE_SIZE I guess that is possible on 64K ARM64
But I still don't quite get it..
rc = posix_memalign(&self->buffer, HUGEPAGE_SIZE, variant->buffer_size);
Should allocate buffer_size
mmap_flags = MAP_SHARED | MAP_ANONYMOUS | MAP_FIXED; mmap_flags |= MAP_HUGETLB | MAP_POPULATE; vrc = mmap(self->buffer, variant->buffer_size, PROT_READ | PROT_WRITE, mmap_flags, -1, 0);
Should fail if buffer_size is not a multiple of HUGEPAGE_SIZE?
Yea, I think you are right. But..
It certainly shouldn't mmap past the provided buffer_size!!!
Are you seeing the above mmap succeed and also map beyond buffer -> buffer + buffer_size?
I think that would be a kernel bug in MAP_HUGETLB!
..I did some bpftrace:
ksys_mmap_pgoff() addr=ffff80000000, len=4000000 hugetlb_file_setup(): size=0x20000000 hugetlb_reserve_pages() from=0, to=1 hugetlb_reserve_pages() returned: ret=1 hugetlb_file_setup() returned: size=0x20000000 ret=-281471746619776 vm_mmap_pgoff() addr=ffff80000000, len=20000000 do_mmap() addr=ffff80000000, len=20000000 hugetlb_reserve_pages() from=0, to=1 hugetlb_reserve_pages() returned: ret=1 do_mmap() returned: addr=0xffff80000000 ret=ffff80000000, pop=20000000 vm_mmap_pgoff() returned: addr=0xffff80000000 ret=ffff80000000 ksys_mmap_pgoff() returned: addr=0xffff80000000 ret=ffff80000000
We can see the 64MB was rounded up to 512MB by ksys_mmap_pgoff() when being passed in to hugetlb_file_setup() at: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/m... " len = ALIGN(len, huge_page_size(hs)); "
By looking at the comments here..: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/h... " /* * Note that size should be aligned to proper hugepage size in caller side, * otherwise hugetlb_reserve_pages reserves one less hugepages than intended. */ struct file *hugetlb_file_setup(const char *name, size_t size, "
..I guess this function was supposed to fail the not-a-multiple case as you remarked? But it certainly can't do that, when that size passed in is already hugepage-aligned..
It feels like a kernel bug as you suspect :-/
And I just found one more weird thing...
In iommufd.c selftest code, we have: "static __attribute__((constructor)) void setup_sizes(void)" where it does another pair of posix_memalign/mmap, although this one doesn't flag MAP_HUGETLB and shouldn't impact what is coming to the next...
If I keep this code, the first hugepage test case can pass (64MB buffer_size; 512MB THP), but all the following cases will fail, as I reported here: https://lore.kernel.org/all/aEm6tuzy7WK12sMh@nvidia.com/
If I remove this code, the hugepage test case will fail from the first case with signal 11. But this time, it is not because the mmap() overwrites the _metadata->no_teardown, it's because mmap() call itself crashed...
And, in either a failed case (crashed) or a passed case, the top kernel function ksys_mmap_pgoff() returned successfully, which means it seemingly crashed inside the libc?
Thanks Nicolin