On Thu, Jan 08, 2026 at 10:38:04AM -0400, Jason Gunthorpe wrote:
On Wed, Jan 07, 2026 at 07:36:44PM -0800, Alex Mastro wrote:
This was inspired by QEMU's hw/vfio/region.c which also does this rounding up of size to the next power of two [1].
I'm now realizing that's only necessary for regions with VFIO_REGION_INFO_CAP_SPARSE_MMAP where there are multiple mmaps per region, and each mmap's size is less than the size of the BAR. Here, since we're mapping the entire BAR which must be pow2, it shouldn't be necessary.
You only need to do this dance if you care about having large PTEs under the VMAs, which is probably something worth testing both scenarios.
Yep, makes sense. The test takes a long time to run without this due potentially faulting a 128G BAR region 4K at a time during VFIO_IOMMU_MAP_DMA.
The intent of QEMU's mmap alignment code is imperfect in the SPARE_MMAP case? After a hole, the next mmap'able range could be some arbitrary page-aligned offset into the region. It's not helpful mmap some region offset which is maximally 4K-aligned at a 1G-aligned vaddr.
I think to be optimal, QEMU should be attempting to align the vaddr for bar mmaps such that
vaddr % {2M,1G} == region_offset % {2M,1G}
Would love someone to sanity check me on this. Kind of a diversion.
What you write is correct. Ankit recently discovered this bug in qemu. It happens not just with SPARSE_MMAP but also when mmmaping around the MSI-X hole..
Is my mental model broken? I thought MSI-X holes in a VFIO-exposed BAR region implied SPARSE_MMAP? I didn't think there was another way for the uapi to express hole-yness.
I also advocated for what you write here that qemu should ensure:
vaddr % region_size == region_offset % region_size
Why region_size out of curiosity? Assuming perfect knowledge of kernel internals I would have expected something like this:
diff --git a/hw/vfio/region.c b/hw/vfio/region.c index ca75ab1be4..1d8595e808 100644 --- a/hw/vfio/region.c +++ b/hw/vfio/region.c @@ -238,6 +238,18 @@ static void vfio_subregion_unmap(VFIORegion *region, int index) region->mmaps[index].mmap = NULL; }
+/* + * Return the next value greater than or equal to `input` such that + * (value % align) == offset. + */ +static size_t align_offset(size_t input, size_t offset, size_t align) +{ + size_t remainder = input % align; + size_t delta = (align + offset - remainder) % align; + + return input + delta; +} + int vfio_region_mmap(VFIORegion *region) { int i, ret, prot = 0; @@ -252,7 +264,11 @@ int vfio_region_mmap(VFIORegion *region) prot |= region->flags & VFIO_REGION_INFO_FLAG_WRITE ? PROT_WRITE : 0;
for (i = 0; i < region->nr_mmaps; i++) { - size_t align = MIN(1ULL << ctz64(region->mmaps[i].size), 1 * GiB); + size_t size = region->mmaps[i].size; + size_t offs = region->mmaps[i].offset; + size_t align = size >= GiB ? GiB : + size >= 2 * MiB ? 2 * MiB : + getpagesize(); void *map_base, *map_align;
/* @@ -275,7 +291,7 @@ int vfio_region_mmap(VFIORegion *region)
fd = vfio_device_get_region_fd(region->vbasedev, region->nr);
- map_align = (void *)ROUND_UP((uintptr_t)map_base, (uintptr_t)align); + map_align = (void *)align_offset((size_t)map_base, offs % align, align); munmap(map_base, map_align - map_base); munmap(map_align + region->mmaps[i].size, align - (map_align - map_base));