A vma with vm_pgoff large enough to overflow a loff_t type when converted to a byte offset can be passed via the remap_file_pages system call. The hugetlbfs mmap routine uses the byte offset to calculate reservations and file size.
A sequence such as: mmap(0x20a00000, 0x600000, 0, 0x66033, -1, 0); remap_file_pages(0x20a00000, 0x600000, 0, 0x20000000000000, 0); will result in the following when task exits/file closed, kernel BUG at mm/hugetlb.c:749! Call Trace: hugetlbfs_evict_inode+0x2f/0x40 evict+0xcb/0x190 __dentry_kill+0xcb/0x150 __fput+0x164/0x1e0 task_work_run+0x84/0xa0 exit_to_usermode_loop+0x7d/0x80 do_syscall_64+0x18b/0x190 entry_SYSCALL_64_after_hwframe+0x3d/0xa2
The overflowed pgoff value causes hugetlbfs to try to set up a mapping with a negative range (end < start) that leaves invalid state which causes the BUG.
The previous overflow fix to this code was incomplete and did not take the remap_file_pages system call into account.
Fixes: 045c7a3f53d9 ("hugetlbfs: fix offset overflow in hugetlbfs mmap") Cc: stable@vger.kernel.org Reported-by: Nic Losby blurbdust@gmail.com Signed-off-by: Mike Kravetz mike.kravetz@oracle.com --- Changes in v2 * Use bitmask for overflow check as suggested by Yisheng Xie * Add explicit (from > to) check when setting up reservations * Cc stable
fs/hugetlbfs/inode.c | 11 ++++++++--- mm/hugetlb.c | 6 ++++++ 2 files changed, 14 insertions(+), 3 deletions(-)
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index 8fe1b0aa2896..dafffa6affae 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -111,6 +111,7 @@ static void huge_pagevec_release(struct pagevec *pvec) static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma) { struct inode *inode = file_inode(file); + unsigned long ovfl_mask; loff_t len, vma_len; int ret; struct hstate *h = hstate_file(file); @@ -127,12 +128,16 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma) vma->vm_ops = &hugetlb_vm_ops;
/* - * Offset passed to mmap (before page shift) could have been - * negative when represented as a (l)off_t. + * page based offset in vm_pgoff could be sufficiently large to + * overflow a (l)off_t when converted to byte offset. */ - if (((loff_t)vma->vm_pgoff << PAGE_SHIFT) < 0) + ovfl_mask = (1UL << (PAGE_SHIFT + 1)) - 1; + ovfl_mask <<= ((sizeof(unsigned long) * BITS_PER_BYTE) - + (PAGE_SHIFT + 1)); + if (vma->vm_pgoff & ovfl_mask) return -EINVAL;
+ /* must be huge page aligned */ if (vma->vm_pgoff & (~huge_page_mask(h) >> PAGE_SHIFT)) return -EINVAL;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 7c204e3d132b..8eeade0a0b7a 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -4374,6 +4374,12 @@ int hugetlb_reserve_pages(struct inode *inode, struct resv_map *resv_map; long gbl_reserve;
+ /* This should never happen */ + if (from > to) { + VM_WARN(1, "%s called with a negative range\n", __func__); + return -EINVAL; + } + /* * Only apply hugepage reservation if asked. At fault time, an * attempt will be made for VM_NORESERVE to allocate a page
On Thu, 8 Mar 2018 13:05:02 -0800 Mike Kravetz mike.kravetz@oracle.com wrote:
A vma with vm_pgoff large enough to overflow a loff_t type when converted to a byte offset can be passed via the remap_file_pages system call. The hugetlbfs mmap routine uses the byte offset to calculate reservations and file size.
A sequence such as: mmap(0x20a00000, 0x600000, 0, 0x66033, -1, 0); remap_file_pages(0x20a00000, 0x600000, 0, 0x20000000000000, 0); will result in the following when task exits/file closed, kernel BUG at mm/hugetlb.c:749! Call Trace: hugetlbfs_evict_inode+0x2f/0x40 evict+0xcb/0x190 __dentry_kill+0xcb/0x150 __fput+0x164/0x1e0 task_work_run+0x84/0xa0 exit_to_usermode_loop+0x7d/0x80 do_syscall_64+0x18b/0x190 entry_SYSCALL_64_after_hwframe+0x3d/0xa2
The overflowed pgoff value causes hugetlbfs to try to set up a mapping with a negative range (end < start) that leaves invalid state which causes the BUG.
The previous overflow fix to this code was incomplete and did not take the remap_file_pages system call into account.
--- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -111,6 +111,7 @@ static void huge_pagevec_release(struct pagevec *pvec) static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma) { struct inode *inode = file_inode(file);
- unsigned long ovfl_mask; loff_t len, vma_len; int ret; struct hstate *h = hstate_file(file);
@@ -127,12 +128,16 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma) vma->vm_ops = &hugetlb_vm_ops; /*
* Offset passed to mmap (before page shift) could have been
* negative when represented as a (l)off_t.
* page based offset in vm_pgoff could be sufficiently large to
*/* overflow a (l)off_t when converted to byte offset.
- if (((loff_t)vma->vm_pgoff << PAGE_SHIFT) < 0)
- ovfl_mask = (1UL << (PAGE_SHIFT + 1)) - 1;
- ovfl_mask <<= ((sizeof(unsigned long) * BITS_PER_BYTE) -
(PAGE_SHIFT + 1));
That's a compile-time constant. The compiler will indeed generate an immediate load, but I think it would be better to make the code look more like we know that it's a constant, if you get what I mean. Something like
/* * If a pgoff_t is to be converted to a byte index, this is the max value it * can have to avoid overflow in that conversion. */ #define PGOFF_T_MAX <long string of crap>
And I bet that this constant could be used elsewhere - surely it's a very common thing to be checking for.
Also, the expression seems rather complicated. Why are we adding 1 to PAGE_SHIFT? Isn't there a logical way of using PAGE_MASK?
The resulting constant is 0xfff8000000000000 on 64-bit. We could just use along the lines of
1UL << (BITS_PER_LONG - PAGE_SHIFT - 1)
But why the -1? We should be able to handle a pgoff_t of 0xfff0000000000000 in this code?
Also, we later to
len = vma_len + ((loff_t)vma->vm_pgoff << PAGE_SHIFT); /* check for overflow */ if (len < vma_len) return -EINVAL;
which is ungainly: even if we passed the PGOFF_T_MAX test, there can still be an overflow which we still must check for. Is that avoidable? Probably not...
On 03/08/2018 02:15 PM, Andrew Morton wrote:
On Thu, 8 Mar 2018 13:05:02 -0800 Mike Kravetz mike.kravetz@oracle.com wrote:
A vma with vm_pgoff large enough to overflow a loff_t type when converted to a byte offset can be passed via the remap_file_pages system call. The hugetlbfs mmap routine uses the byte offset to calculate reservations and file size.
A sequence such as: mmap(0x20a00000, 0x600000, 0, 0x66033, -1, 0); remap_file_pages(0x20a00000, 0x600000, 0, 0x20000000000000, 0); will result in the following when task exits/file closed, kernel BUG at mm/hugetlb.c:749! Call Trace: hugetlbfs_evict_inode+0x2f/0x40 evict+0xcb/0x190 __dentry_kill+0xcb/0x150 __fput+0x164/0x1e0 task_work_run+0x84/0xa0 exit_to_usermode_loop+0x7d/0x80 do_syscall_64+0x18b/0x190 entry_SYSCALL_64_after_hwframe+0x3d/0xa2
The overflowed pgoff value causes hugetlbfs to try to set up a mapping with a negative range (end < start) that leaves invalid state which causes the BUG.
The previous overflow fix to this code was incomplete and did not take the remap_file_pages system call into account.
--- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -111,6 +111,7 @@ static void huge_pagevec_release(struct pagevec *pvec) static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma) { struct inode *inode = file_inode(file);
- unsigned long ovfl_mask; loff_t len, vma_len; int ret; struct hstate *h = hstate_file(file);
@@ -127,12 +128,16 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma) vma->vm_ops = &hugetlb_vm_ops; /*
* Offset passed to mmap (before page shift) could have been
* negative when represented as a (l)off_t.
* page based offset in vm_pgoff could be sufficiently large to
*/* overflow a (l)off_t when converted to byte offset.
- if (((loff_t)vma->vm_pgoff << PAGE_SHIFT) < 0)
- ovfl_mask = (1UL << (PAGE_SHIFT + 1)) - 1;
- ovfl_mask <<= ((sizeof(unsigned long) * BITS_PER_BYTE) -
(PAGE_SHIFT + 1));
That's a compile-time constant. The compiler will indeed generate an immediate load, but I think it would be better to make the code look more like we know that it's a constant, if you get what I mean. Something like
/*
- If a pgoff_t is to be converted to a byte index, this is the max value it
- can have to avoid overflow in that conversion.
*/ #define PGOFF_T_MAX <long string of crap>
Ok
And I bet that this constant could be used elsewhere - surely it's a very common thing to be checking for.
Also, the expression seems rather complicated. Why are we adding 1 to PAGE_SHIFT? Isn't there a logical way of using PAGE_MASK?
The + 1 is there because this value will eventually be converted to a loff_t which is signed. So, we need to take that sign bit into account or we could end up with a negative value.
For PAGE_SHIFT == 12, PAGE_MASK is 0xfffffffffffff000. Our target mask is 0xfff8000000000000 (for the sign bit). So, we could do PAGE_MASK << (BITS_PER_LONG - (2 * PAGE_SHIFT) - 1)
This legacy hugetlbfs code may be a little different than other areas in the use of loff_t. When doing some previous work in this area, I did not find enough common used to make this more general purpose. See, https://lkml.org/lkml/2017/4/12/793
The resulting constant is 0xfff8000000000000 on 64-bit. We could just use along the lines of
1UL << (BITS_PER_LONG - PAGE_SHIFT - 1)
Ah yes, BITS_PER_LONG is better than (sizeof(unsigned long) * BITS_PER_BYTE
But why the -1? We should be able to handle a pgoff_t of 0xfff0000000000000 in this code?
I'm not exactly sure what you are asking/suggesting here and in the line above. It is because of the conversion to a signed value that we have to go with 0xfff8000000000000 instead of 0xfff0000000000000.
Here are a couple options for computing the mask. I changed the name you suggested to make it more obvious that the mask is being used to check for loff_t overflow.
If we want to explicitly comptue the mask as in code above. #define PGOFF_LOFFT_MAX \ (((1UL << (PAGE_SHIFT + 1)) - 1) << (BITS_PER_LONG - (PAGE_SHIFT + 1)))
Or, we use PAGE_MASK #define PGOFF_LOFFT_MAX (PAGE_MASK << (BITS_PER_LONG - (2 * PAGE_SHIFT) - 1))
In either case, we need a big comment explaining the mask and how we have that extra bit +/- 1 because the offset will be converted to a signed value.
Also, we later to
len = vma_len + ((loff_t)vma->vm_pgoff << PAGE_SHIFT); /* check for overflow */ if (len < vma_len) return -EINVAL;
which is ungainly: even if we passed the PGOFF_T_MAX test, there can still be an overflow which we still must check for. Is that avoidable? Probably not...
Yes, it is required. That check takes into account the length argument which is added to page offset. So, yes you can pass the first check and fail this one.
On Thu, 8 Mar 2018 15:37:57 -0800 Mike Kravetz mike.kravetz@oracle.com wrote:
Here are a couple options for computing the mask. I changed the name you suggested to make it more obvious that the mask is being used to check for loff_t overflow.
If we want to explicitly comptue the mask as in code above. #define PGOFF_LOFFT_MAX \ (((1UL << (PAGE_SHIFT + 1)) - 1) << (BITS_PER_LONG - (PAGE_SHIFT + 1)))
Or, we use PAGE_MASK #define PGOFF_LOFFT_MAX (PAGE_MASK << (BITS_PER_LONG - (2 * PAGE_SHIFT) - 1))
Sounds good.
In either case, we need a big comment explaining the mask and how we have that extra bit +/- 1 because the offset will be converted to a signed value.
Yup.
Also, we later to
len = vma_len + ((loff_t)vma->vm_pgoff << PAGE_SHIFT); /* check for overflow */ if (len < vma_len) return -EINVAL;
which is ungainly: even if we passed the PGOFF_T_MAX test, there can still be an overflow which we still must check for. Is that avoidable? Probably not...
Yes, it is required. That check takes into account the length argument which is added to page offset. So, yes you can pass the first check and fail this one.
Well I was sort of wondering if both checks could be done in a single operation, but I guess not.
linux-stable-mirror@lists.linaro.org