If, during a mremap() operation for a hugetlb-backed memory mapping, copy_vma() fails after the source vma has been duplicated and opened (ie. vma_link() fails), the error is handled by closing the new vma. This updates the hugetlbfs reservation counter of the reservation map which at this point is referenced by both the source vma and the new copy. As a result, once the new vma has been freed and copy_vma() returns, the reservation counter for the source vma will be incorrect.
This patch addresses this corner case by clearing the hugetlb private page reservation reference for the new vma and decrementing the reference before closing the vma, so that vma_close() won't update the reservation counter.
The issue was reported by a private syzbot instance, see the error report log [1] and reproducer [2]. Possible duplicate of public syzbot report [3].
Signed-off-by: Ricardo Cañuelo Navarro rcn@igalia.com Cc: stable@vger.kernel.org # 6.12+ Link: https://people.igalia.com/rcn/kernel_logs/20250422__WARNING_in_page_counter_... [1] Link: https://people.igalia.com/rcn/kernel_logs/20250422__WARNING_in_page_counter_... [2] Link: https://lore.kernel.org/all/67000a50.050a0220.49194.048d.GAE@google.com/ [3] --- mm/vma.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/mm/vma.c b/mm/vma.c index 839d12f02c885d3338d8d233583eb302d82bb80b..9d9f699ace977c9c869e5da5f88f12be183adcfb 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -1834,6 +1834,8 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap, return new_vma;
out_vma_link: + if (is_vm_hugetlb_page(new_vma)) + clear_vma_resv_huge_pages(new_vma); vma_close(new_vma);
if (new_vma->vm_file)
--- base-commit: 94305e83eccb3120c921cd3a015cd74731140bac change-id: 20250523-warning_in_page_counter_cancel-e8c71a6b4c88
On Fri, May 23, 2025 at 09:56:18AM +0200, Ricardo Cañuelo Navarro wrote:
If, during a mremap() operation for a hugetlb-backed memory mapping, copy_vma() fails after the source vma has been duplicated and opened (ie. vma_link() fails), the error is handled by closing the new vma. This updates the hugetlbfs reservation counter of the reservation map which at this point is referenced by both the source vma and the new copy. As a result, once the new vma has been freed and copy_vma() returns, the reservation counter for the source vma will be incorrect.
This patch addresses this corner case by clearing the hugetlb private page reservation reference for the new vma and decrementing the reference before closing the vma, so that vma_close() won't update the reservation counter.
The issue was reported by a private syzbot instance, see the error report log [1] and reproducer [2]. Possible duplicate of public syzbot report [3].
Signed-off-by: Ricardo Cañuelo Navarro rcn@igalia.com Cc: stable@vger.kernel.org # 6.12+ Link: https://people.igalia.com/rcn/kernel_logs/20250422__WARNING_in_page_counter_... [1] Link: https://people.igalia.com/rcn/kernel_logs/20250422__WARNING_in_page_counter_... [2] Link: https://lore.kernel.org/all/67000a50.050a0220.49194.048d.GAE@google.com/ [3]
mm/vma.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/mm/vma.c b/mm/vma.c index 839d12f02c885d3338d8d233583eb302d82bb80b..9d9f699ace977c9c869e5da5f88f12be183adcfb 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -1834,6 +1834,8 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap, return new_vma; out_vma_link:
- if (is_vm_hugetlb_page(new_vma))
vma_close(new_vma);clear_vma_resv_huge_pages(new_vma);
if (new_vma->vm_file)
Sigh, I do not think Lorenzo will be happy about having yet another hugetlb check around vma code :-). Maybe this is good as is as a quick fix, but we really need to re-assest this situation . I will have a look once I managed to finish a couple of other things.
On Fri, May 23, 2025 at 10:54:24AM +0200, Oscar Salvador wrote:
On Fri, May 23, 2025 at 09:56:18AM +0200, Ricardo Cañuelo Navarro wrote:
If, during a mremap() operation for a hugetlb-backed memory mapping, copy_vma() fails after the source vma has been duplicated and opened (ie. vma_link() fails), the error is handled by closing the new vma. This updates the hugetlbfs reservation counter of the reservation map which at this point is referenced by both the source vma and the new copy. As a result, once the new vma has been freed and copy_vma() returns, the reservation counter for the source vma will be incorrect.
This patch addresses this corner case by clearing the hugetlb private page reservation reference for the new vma and decrementing the reference before closing the vma, so that vma_close() won't update the reservation counter.
The issue was reported by a private syzbot instance, see the error report log [1] and reproducer [2]. Possible duplicate of public syzbot report [3].
Signed-off-by: Ricardo Cañuelo Navarro rcn@igalia.com Cc: stable@vger.kernel.org # 6.12+ Link: https://people.igalia.com/rcn/kernel_logs/20250422__WARNING_in_page_counter_... [1] Link: https://people.igalia.com/rcn/kernel_logs/20250422__WARNING_in_page_counter_... [2] Link: https://lore.kernel.org/all/67000a50.050a0220.49194.048d.GAE@google.com/ [3]
mm/vma.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/mm/vma.c b/mm/vma.c index 839d12f02c885d3338d8d233583eb302d82bb80b..9d9f699ace977c9c869e5da5f88f12be183adcfb 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -1834,6 +1834,8 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap, return new_vma;
out_vma_link:
if (is_vm_hugetlb_page(new_vma))
clear_vma_resv_huge_pages(new_vma);
vma_close(new_vma);
if (new_vma->vm_file)
Sigh, I do not think Lorenzo will be happy about having yet another hugetlb check around vma code :-).
NAAAA... nah only joking ;)
But you do know me well sir, and indeed I hate doing this here.
Maybe this is good as is as a quick fix, but we really need to re-assest this situation
Yes. We really need to find a way to avoid arbitrarily putting in branches like:
if ([hugetlb/uffd/dax/blah blah]) some_seemingly_random_thing_that_is_an_implementation_detail();
. I will have a look once I managed to finish a couple of other things.
Thanks.
-- Oscar Salvador SUSE Labs
Let me have a look at this fix-wise as obviously that's a priority here and we might need to live with some ugliness for that reason.
But I want to make sure this is correct (I mean it kind of rings true).
Can you confirm at least on the theoretical side this makes sense? I was actually going to cc you if you weren't already (hadn't checked yet :)
Cheers, Lorenzo
+cc Oscar - please loop him in on this as he's looking at refactoring and improving hugetlb as a whole! Thanks :)
Ricardo - thanks very much for this, TL;DR is - this is great work and you found something painful :)
To be clear and to quell too much concern, this case is one that won't happen realistically in reality - I'm see (afaict) your syzbot did fault injection to get this. This is because the only way this could happen is for vma_iter_prealloc() to fail, and it's really subject to being a 'too small to fail' allocation, in other words direct reclaim will simply keep going until the allocation succeeds in this case even under extreme memory pressure.
So this patch is perhaps less urgent than it might seem (though we should address this, of course).
On Fri, May 23, 2025 at 09:56:18AM +0200, Ricardo Cañuelo Navarro wrote:
If, during a mremap() operation for a hugetlb-backed memory mapping, copy_vma() fails after the source vma has been duplicated and opened (ie. vma_link() fails), the error is handled by closing the new
OK so really it is _only_ when vma_link() fails?
This error paths really are a problem. As is vma_close(). I'm pretty sure we've seen similar issues actually with a miscounted reservation count caused by vm_ops->close(), hugetlb seems really sensitive to this.
vma. This updates the hugetlbfs reservation counter of the reservation map which at this point is referenced by both the source vma and the new copy. As a result, once the new vma has been freed and copy_vma() returns, the reservation counter for the source vma will be incorrect.
OK so looking into the code I think I got it:
copy_vma_and_data() already does this:
if (is_vm_hugetlb_page(vma)) clear_vma_resv_huge_pages(vma);
But only if copy_vma() succeeds.
Even if the page table move fails, we reverse the operation and _still_ do this.
We do this _before_ unmapping the VMA later in the operation.
But now vma_close() has screwed things up in this one singular case. We need to fix this _prior to the vma_close()_ so it doesn't screw up this 'two VMAs referencing the same reservation' special case.
OK so your patch makes sense.
This patch addresses this corner case by clearing the hugetlb private page reservation reference for the new vma and decrementing the reference before closing the vma, so that vma_close() won't update the reservation counter.
The issue was reported by a private syzbot instance, see the error report log [1] and reproducer [2]. Possible duplicate of public syzbot report [3].
Ordinarily 'private syzbot instance' makes me nervous, but you've made your case here logically.
Signed-off-by: Ricardo Cañuelo Navarro rcn@igalia.com Cc: stable@vger.kernel.org # 6.12+
Hm, do we have a Fixes?
This might be a massive pain to backport though due to a certain somebody initials 'LS' who did a big refactoring... :)
Why 6.12+? It seems this bug has been around for... a while.
THEN AGAIN and importantly - I'm fine with this, as this is a bug that I really don't believe can be hit in reality.
It is however vital we fix it so the error paths work correctly, especially if at a later date we change how allocation works and fail more etc.
It matters more for the future than the past, basically.
Link: https://people.igalia.com/rcn/kernel_logs/20250422__WARNING_in_page_counter_... [1] Link: https://people.igalia.com/rcn/kernel_logs/20250422__WARNING_in_page_counter_... [2] Link: https://lore.kernel.org/all/67000a50.050a0220.49194.048d.GAE@google.com/ [3]
Thanks for links, though it's better to please provide this information here even if in succinct form. This is because commit messages are a permanent record, and these links (other than lore) are ephemeral.
Here it's a bit of a pain, however, as presumably the repro is fairly big.
So, can we please copy/paste the splat from [1] and drop this link, maybe just keep link [2] as it's not so important (I'm guessing this takes a while to repro so the failure injection hits the right point?) and of course keep [3].
mm/vma.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/mm/vma.c b/mm/vma.c index 839d12f02c885d3338d8d233583eb302d82bb80b..9d9f699ace977c9c869e5da5f88f12be183adcfb 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -1834,6 +1834,8 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap, return new_vma;
out_vma_link:
- if (is_vm_hugetlb_page(new_vma))
clear_vma_resv_huge_pages(new_vma);
So,
Could you implement this slightly differently please? We're duplicating this code now, so I think this should be in its own function with a copious comment.
Something like:
static void fixup_hugetlb_reservations(struct vm_area_struct *vma) { if (is_vm_hugetlb_page(new_vma)) clear_vma_resv_huge_pages(new_vma); }
And call this from here and also in copy_vma_and_data().
Could you also please update the comment in clear_vma_resv_huge_pages():
/* * Reset and decrement one ref on hugepage private reservation. * Called with mm->mmap_lock writer semaphore held. * This function should be only used by move_vma() and operate on * same sized vma. It should never come here with last ref on the * reservation. */
Drop the mention of the specific function (which is now wrong, but mentioning _any_ function is asking for bit rot anyway) and replace with something like 'This function should only be used by mremap and...'
vma_close(new_vma);
if (new_vma->vm_file)
base-commit: 94305e83eccb3120c921cd3a015cd74731140bac change-id: 20250523-warning_in_page_counter_cancel-e8c71a6b4c88
Thanks!
On Fri, May 23, 2025 at 11:00:40AM +0100, Lorenzo Stoakes wrote: [snip]
mm/vma.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/mm/vma.c b/mm/vma.c index 839d12f02c885d3338d8d233583eb302d82bb80b..9d9f699ace977c9c869e5da5f88f12be183adcfb 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -1834,6 +1834,8 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap, return new_vma;
out_vma_link:
- if (is_vm_hugetlb_page(new_vma))
clear_vma_resv_huge_pages(new_vma);
So,
Could you implement this slightly differently please? We're duplicating this code now, so I think this should be in its own function with a copious comment.
Something like:
static void fixup_hugetlb_reservations(struct vm_area_struct *vma) { if (is_vm_hugetlb_page(new_vma)) clear_vma_resv_huge_pages(new_vma); }
And call this from here and also in copy_vma_and_data().
Sorry, I wasn't clear, Can you please ensure that you put a chonking big comment on this function too please? Something that explains the situation, e.g.:
/* * For hugetlb, mremap() is an odd edge case - while the VMA copying is * performed, we permit both the old and new VMAs to reference the same * reservation. * * We fix this up after the operation succeeds, or if a newly allocated VMA * is closed as a result of a failure to allocate memory. */
[snip]
Thanks!
Hi Lorenzo,
Thanks for the in-depth review! answers below:
On Fri, May 23 2025 at 11:00:40, Lorenzo Stoakes lorenzo.stoakes@oracle.com wrote:
OK so really it is _only_ when vma_link() fails?
AFAICT yes, since copy_vma() only calls vma_close() if vma_link() fails. A failure in any of the other helpers in copy_vma() before it is handled by simply freeing the allocated resources.
Ordinarily 'private syzbot instance' makes me nervous, but you've made your case here logically.
I understand your qualms with that but, although that instance is mostly concerned with downstream code, in this case there's nothing unusual, as it was able to find the issue in mainline with a common reproducer. The closest public report I found was the one I linked in [3], although I couldn't reproduce the issue with the repro provided there.
Hm, do we have a Fixes?
I couldn't find a single commit to point as a "Fixes". The actual commit that introduces that close_vma() call there is 4080ef1579b2 ("mm: unconditionally close VMAs on error") although I wouldn't say that's the culprit. As you said, the problem with vma_close() seems to be more involved. If you want me to add that one in the "Fixes" tag so we can keep track of the context, let me know, that's fine by me.
Why 6.12+? It seems this bug has been around for... a while.
Because in stable versions lower than that (6.6) the code to patch is in mm/mmap.c instead, so I'd rather have this one merged first and then submit the appropriate backport for 6.6.
Thanks for links, though it's better to please provide this information here even if in succinct form. This is because commit messages are a permanent record, and these links (other than lore) are ephemeral.
True but, as you said, it's a bit of a pain to try to fit all the info in the commit message, and the repro will still be living somewhere else.
So, can we please copy/paste the splat from [1] and drop this link, maybe just keep link [2] as it's not so important (I'm guessing this takes a while to repro so the failure injection hits the right point?) and of course keep [3].
Sure, I'll make the changes for v2. FWIW, in my tests the repro could trigger this in a matter of seconds.
So,
Could you implement this slightly differently please? We're duplicating this code now, so I think this should be in its own function with a copious comment.
Something like:
static void fixup_hugetlb_reservations(struct vm_area_struct *vma) { if (is_vm_hugetlb_page(new_vma)) clear_vma_resv_huge_pages(new_vma); }
And call this from here and also in copy_vma_and_data().
Could you also please update the comment in clear_vma_resv_huge_pages():
/*
- Reset and decrement one ref on hugepage private reservation.
- Called with mm->mmap_lock writer semaphore held.
- This function should be only used by move_vma() and operate on
- same sized vma. It should never come here with last ref on the
- reservation.
*/
Drop the mention of the specific function (which is now wrong, but mentioning _any_ function is asking for bit rot anyway) and replace with something like 'This function should only be used by mremap and...'
Ack, thanks for the suggestions!
Cheers, Ricardo
On Fri, May 23, 2025 at 12:44:32PM +0200, Ricardo Cañuelo Navarro wrote:
Hi Lorenzo,
Thanks for the in-depth review! answers below:
On Fri, May 23 2025 at 11:00:40, Lorenzo Stoakes lorenzo.stoakes@oracle.com wrote:
OK so really it is _only_ when vma_link() fails?
AFAICT yes, since copy_vma() only calls vma_close() if vma_link() fails. A failure in any of the other helpers in copy_vma() before it is handled by simply freeing the allocated resources.
Ordinarily 'private syzbot instance' makes me nervous, but you've made your case here logically.
I understand your qualms with that but, although that instance is mostly concerned with downstream code, in this case there's nothing unusual, as it was able to find the issue in mainline with a common reproducer. The closest public report I found was the one I linked in [3], although I couldn't reproduce the issue with the repro provided there.
Yeah as I said in this case we're good :)
The issue has really been instances where people are running modified copies that are giving what appear to be spurious reports.
This is not the case here!
Hm, do we have a Fixes?
I couldn't find a single commit to point as a "Fixes". The actual commit that introduces that close_vma() call there is 4080ef1579b2 ("mm: unconditionally close VMAs on error") although I wouldn't say that's the culprit. As you said, the problem with vma_close() seems to be more involved. If you want me to add that one in the "Fixes" tag so we can keep track of the context, let me know, that's fine by me.
Yeah, fair enough.
I don't think that commit is the culprit, as it still does essentially the same logic, it just also updates vma->vm_ops to prevent any risk of double-closing.
I suspect this has been a 'long term' bug, but one again that really is unlikely to be triggered in reality.
So probably no Fixes really makes sense here.
Why 6.12+? It seems this bug has been around for... a while.
Because in stable versions lower than that (6.6) the code to patch is in mm/mmap.c instead, so I'd rather have this one merged first and then submit the appropriate backport for 6.6.
You can backport everything manually. Stable side of things won't affect this being merged upstream.
Also you're going to (probably) hit merge conflicts anyway pre my refactoring of mremap.c.
So if you feel this should get fixed everywhere, then you can always do # >= 5.4.293 or something and fix things up as you go with manual backports.
But again, I'm not sure this is really worth backporting _at all_ or at least _that far back_ given how it is more or less impossible to hit in reality.
I think under the kind of memory pressure that would result in this bug (which I'm not sure can even actually happen, unless a fatal signal arose at the same time, perhaps), hugetlb reservation miscount would be the absolute least of your concern.
So my recommendation would really to avoid a backport here,
Thanks for links, though it's better to please provide this information here even if in succinct form. This is because commit messages are a permanent record, and these links (other than lore) are ephemeral.
True but, as you said, it's a bit of a pain to try to fit all the info in the commit message, and the repro will still be living somewhere else.
Right, but then we have more information. It's a trade-off obviously.
So, can we please copy/paste the splat from [1] and drop this link, maybe just keep link [2] as it's not so important (I'm guessing this takes a while to repro so the failure injection hits the right point?) and of course keep [3].
Sure, I'll make the changes for v2. FWIW, in my tests the repro could trigger this in a matter of seconds.
Interesting :) I can't repro in qemu at least. I may be misconfiguring this somehow, however.
So,
Could you implement this slightly differently please? We're duplicating this code now, so I think this should be in its own function with a copious comment.
Something like:
static void fixup_hugetlb_reservations(struct vm_area_struct *vma) { if (is_vm_hugetlb_page(new_vma)) clear_vma_resv_huge_pages(new_vma); }
And call this from here and also in copy_vma_and_data().
Could you also please update the comment in clear_vma_resv_huge_pages():
/*
- Reset and decrement one ref on hugepage private reservation.
- Called with mm->mmap_lock writer semaphore held.
- This function should be only used by move_vma() and operate on
- same sized vma. It should never come here with last ref on the
- reservation.
*/
Drop the mention of the specific function (which is now wrong, but mentioning _any_ function is asking for bit rot anyway) and replace with something like 'This function should only be used by mremap and...'
Ack, thanks for the suggestions!
Thanks again for reporting this :)
Cheers, Ricardo
linux-stable-mirror@lists.linaro.org