On Tue, Oct 22, 2019 at 11:41:57AM -0700, John Hubbard wrote:
On 10/22/19 10:14 AM, Jerome Glisse wrote:
On Mon, Oct 21, 2019 at 02:24:35PM -0700, John Hubbard wrote:
The MAP_HUGETLB ("-H" option) of gup_benchmark fails:
$ sudo ./gup_benchmark -H mmap: Invalid argument
This is because gup_benchmark.c is passing in a file descriptor to mmap(), but the fd came from opening up the /dev/zero file. This confuses the mmap syscall implementation, which thinks that, if the caller did not specify MAP_ANONYMOUS, then the file must be a huge page file. So it attempts to verify that the file really is a huge page file, as you can see here:
ksys_mmap_pgoff() { if (!(flags & MAP_ANONYMOUS)) { retval = -EINVAL; if (unlikely(flags & MAP_HUGETLB && !is_file_hugepages(file))) goto out_fput; /* THIS IS WHERE WE END UP */
else if (flags & MAP_HUGETLB) { ...proceed normally, /dev/zero is ok here...
...and of course is_file_hugepages() returns "false" for the /dev/zero file.
The problem is that the user space program, gup_benchmark.c, really just wants anonymous memory here. The simplest way to get that is to pass MAP_ANONYMOUS whenever MAP_HUGETLB is specified, so that's what this patch does.
This looks wrong, MAP_HUGETLB should only be use to create vma for hugetlbfs. If you want anonymous private vma do not set the MAP_HUGETLB. If you want huge page inside your anonymous vma there is nothing to do at the mmap time, this is the job of the transparent huge page code (THP).
Not the point. Please look more closely at ksys_mmap_pgoff(). You'll see that, since 2009 (and probably earlier; 2009 is just when Hugh Dickens moved it over from util.c), this routine has had full support for using hugetlbfs automatically, via mmap.
It does that via hugetlb_file_setup():
unsigned long ksys_mmap_pgoff(unsigned long addr, unsigned long len, unsigned long prot, unsigned long flags, unsigned long fd, unsigned long pgoff) { ... if (!(flags & MAP_ANONYMOUS)) { ... } else if (flags & MAP_HUGETLB) { struct user_struct *user = NULL; struct hstate *hs;
hs = hstate_sizelog((flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK); if (!hs) return -EINVAL; len = ALIGN(len, huge_page_size(hs)); /* * VM_NORESERVE is used because the reservations will be * taken when vm_ops->mmap() is called * A dummy user value is used because we are not locking * memory so no accounting is necessary */ file = hugetlb_file_setup(HUGETLB_ANON_FILE, len, VM_NORESERVE, &user, HUGETLB_ANONHUGE_INODE, (flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK); if (IS_ERR(file)) return PTR_ERR(file);
} ...
Also, there are 14 (!) other pre-existing examples of passing MAP_HUGETLB | MAP_ANONYMOUS to mmap, so I'm not exactly the first one to reach this understanding.
NAK as misleading
Ouch. But I think I'm actually leading correctly, rather than misleading. Can you prove me wrong? :)
So i was misslead by the file descriptor, passing a file descriptor and asking for anonymous always bugs me. But yeah the _linux_ kernel is happy to ignore the file argument if you set the anonymous flag. I guess the rules of passing -1 for fd when anonymous is just engrave in my brain.
Also i thought that the file was an argument of the test and thus that for huge you needed to pass a hugetlbfs' file.
Anyway my mistake, you are right, you can pass a file and ask for anonymous and hugetlb at the same time.
Reviewed-by: Jérôme Glisse jglisse@redhat.com