On Tue, Jan 07, 2025 at 03:39:56PM -0500, Peter Xu wrote:
Since commit 04f2cbe35699 ("hugetlb: guarantee that COW faults for a process that called mmap(MAP_PRIVATE) on hugetlbfs will succeed"), avoid_reserve was introduced for a special case of CoW on hugetlb private mappings, and only if the owner VMA is trying to allocate yet another hugetlb folio that is not reserved within the private vma reserved map.
Later on, in commit d85f69b0b533 ("mm/hugetlb: alloc_huge_page handle areas hole punched by fallocate"), alloc_huge_page() enforced to not consume any global reservation as long as avoid_reserve=true. This operation doesn't look correct, because even if it will enforce the allocation to not use global reservation at all, it will still try to take one reservation from the spool (if the subpool existed). Then since the spool reserved pages take from global reservation, it'll also take one reservation globally.
Logically it can cause global reservation to go wrong.
I wrote a reproducer below, trigger this special path, and every run of such program will cause global reservation count to increment by one, until it hits the number of free pages:
#define _GNU_SOURCE /* See feature_test_macros(7) */ #include <stdio.h> #include <fcntl.h> #include <errno.h> #include <unistd.h> #include <stdlib.h> #include <sys/mman.h>
#define MSIZE (2UL << 20)
int main(int argc, char *argv[]) { const char *path; int *buf; int fd, ret; pid_t child;
if (argc < 2) { printf("usage: %s <hugetlb_file>\n", argv[0]); return -1; } path = argv[1]; fd = open(path, O_RDWR | O_CREAT, 0666); if (fd < 0) { perror("open failed"); return -1; } ret = fallocate(fd, 0, 0, MSIZE); if (ret != 0) { perror("fallocate"); return -1; } buf = mmap(NULL, MSIZE, PROT_READ|PROT_WRITE, MAP_PRIVATE, fd, 0); if (buf == MAP_FAILED) { perror("mmap() failed"); return -1; } /* Allocate a page */ *buf = 1; child = fork(); if (child == 0) { /* child doesn't need to do anything */ exit(0); } /* Trigger CoW from owner */ *buf = 2; munmap(buf, MSIZE); close(fd); unlink(path); return 0;
}
It can only reproduce with a sub-mount when there're reserved pages on the spool, like:
# sysctl vm.nr_hugepages=128 # mkdir ./hugetlb-pool # mount -t hugetlbfs -o min_size=8M,pagesize=2M none ./hugetlb-pool
Then run the reproducer on the mountpoint:
# ./reproducer ./hugetlb-pool/test
Fix it by taking the reservation from spool if available. In general, avoid_reserve is IMHO more about "avoid vma resv map", not spool's.
I copied stable, however I have no intention for backporting if it's not a clean cherry-pick, because private hugetlb mapping, and then fork() on top is too rare to hit.
Cc: linux-stable stable@vger.kernel.org Fixes: d85f69b0b533 ("mm/hugetlb: alloc_huge_page handle areas hole punched by fallocate") Reviewed-by: Ackerley Tng ackerleytng@google.com Tested-by: Ackerley Tng ackerleytng@google.com Signed-off-by: Peter Xu peterx@redhat.com
Reviewed-by: Oscar Salvador osalvador@suse.de