+cc Oscar - please loop him in on this as he's looking at refactoring and improving hugetlb as a whole! Thanks :)
Ricardo - thanks very much for this, TL;DR is - this is great work and you found something painful :)
To be clear and to quell too much concern, this case is one that won't happen realistically in reality - I'm see (afaict) your syzbot did fault injection to get this. This is because the only way this could happen is for vma_iter_prealloc() to fail, and it's really subject to being a 'too small to fail' allocation, in other words direct reclaim will simply keep going until the allocation succeeds in this case even under extreme memory pressure.
So this patch is perhaps less urgent than it might seem (though we should address this, of course).
On Fri, May 23, 2025 at 09:56:18AM +0200, Ricardo Cañuelo Navarro wrote:
If, during a mremap() operation for a hugetlb-backed memory mapping, copy_vma() fails after the source vma has been duplicated and opened (ie. vma_link() fails), the error is handled by closing the new
OK so really it is _only_ when vma_link() fails?
This error paths really are a problem. As is vma_close(). I'm pretty sure we've seen similar issues actually with a miscounted reservation count caused by vm_ops->close(), hugetlb seems really sensitive to this.
vma. This updates the hugetlbfs reservation counter of the reservation map which at this point is referenced by both the source vma and the new copy. As a result, once the new vma has been freed and copy_vma() returns, the reservation counter for the source vma will be incorrect.
OK so looking into the code I think I got it:
copy_vma_and_data() already does this:
if (is_vm_hugetlb_page(vma)) clear_vma_resv_huge_pages(vma);
But only if copy_vma() succeeds.
Even if the page table move fails, we reverse the operation and _still_ do this.
We do this _before_ unmapping the VMA later in the operation.
But now vma_close() has screwed things up in this one singular case. We need to fix this _prior to the vma_close()_ so it doesn't screw up this 'two VMAs referencing the same reservation' special case.
OK so your patch makes sense.
This patch addresses this corner case by clearing the hugetlb private page reservation reference for the new vma and decrementing the reference before closing the vma, so that vma_close() won't update the reservation counter.
The issue was reported by a private syzbot instance, see the error report log [1] and reproducer [2]. Possible duplicate of public syzbot report [3].
Ordinarily 'private syzbot instance' makes me nervous, but you've made your case here logically.
Signed-off-by: Ricardo Cañuelo Navarro rcn@igalia.com Cc: stable@vger.kernel.org # 6.12+
Hm, do we have a Fixes?
This might be a massive pain to backport though due to a certain somebody initials 'LS' who did a big refactoring... :)
Why 6.12+? It seems this bug has been around for... a while.
THEN AGAIN and importantly - I'm fine with this, as this is a bug that I really don't believe can be hit in reality.
It is however vital we fix it so the error paths work correctly, especially if at a later date we change how allocation works and fail more etc.
It matters more for the future than the past, basically.
Link: https://people.igalia.com/rcn/kernel_logs/20250422__WARNING_in_page_counter_... [1] Link: https://people.igalia.com/rcn/kernel_logs/20250422__WARNING_in_page_counter_... [2] Link: https://lore.kernel.org/all/67000a50.050a0220.49194.048d.GAE@google.com/ [3]
Thanks for links, though it's better to please provide this information here even if in succinct form. This is because commit messages are a permanent record, and these links (other than lore) are ephemeral.
Here it's a bit of a pain, however, as presumably the repro is fairly big.
So, can we please copy/paste the splat from [1] and drop this link, maybe just keep link [2] as it's not so important (I'm guessing this takes a while to repro so the failure injection hits the right point?) and of course keep [3].
mm/vma.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/mm/vma.c b/mm/vma.c index 839d12f02c885d3338d8d233583eb302d82bb80b..9d9f699ace977c9c869e5da5f88f12be183adcfb 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -1834,6 +1834,8 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap, return new_vma;
out_vma_link:
- if (is_vm_hugetlb_page(new_vma))
clear_vma_resv_huge_pages(new_vma);
So,
Could you implement this slightly differently please? We're duplicating this code now, so I think this should be in its own function with a copious comment.
Something like:
static void fixup_hugetlb_reservations(struct vm_area_struct *vma) { if (is_vm_hugetlb_page(new_vma)) clear_vma_resv_huge_pages(new_vma); }
And call this from here and also in copy_vma_and_data().
Could you also please update the comment in clear_vma_resv_huge_pages():
/* * Reset and decrement one ref on hugepage private reservation. * Called with mm->mmap_lock writer semaphore held. * This function should be only used by move_vma() and operate on * same sized vma. It should never come here with last ref on the * reservation. */
Drop the mention of the specific function (which is now wrong, but mentioning _any_ function is asking for bit rot anyway) and replace with something like 'This function should only be used by mremap and...'
vma_close(new_vma);
if (new_vma->vm_file)
base-commit: 94305e83eccb3120c921cd3a015cd74731140bac change-id: 20250523-warning_in_page_counter_cancel-e8c71a6b4c88
Thanks!