arm64 supports multiple huge_pte sizes. Some of the sizes are covered by a single pte entry at a particular level (PMD_SIZE, PUD_SIZE), and some are covered by multiple ptes at a particular level (CONT_PTE_SIZE, CONT_PMD_SIZE). So the function has to figure out the size from the huge_pte pointer. This was previously done by walking the pgtable to determine the level and by using the PTE_CONT bit to determine the number of ptes at the level.
But the PTE_CONT bit is only valid when the pte is present. For non-present pte values (e.g. markers, migration entries), the previous implementation was therefore erroniously determining the size. There is at least one known caller in core-mm, move_huge_pte(), which may call huge_ptep_get_and_clear() for a non-present pte. So we must be robust to this case. Additionally the "regular" ptep_get_and_clear() is robust to being called for non-present ptes so it makes sense to follow the behaviour.
Fix this by using the new sz parameter which is now provided to the function. Additionally when clearing each pte in a contig range, don't gather the access and dirty bits if the pte is not present.
An alternative approach that would not require API changes would be to store the PTE_CONT bit in a spare bit in the swap entry pte for the non-present case. But it felt cleaner to follow other APIs' lead and just pass in the size.
As an aside, PTE_CONT is bit 52, which corresponds to bit 40 in the swap entry offset field (layout of non-present pte). Since hugetlb is never swapped to disk, this field will only be populated for markers, which always set this bit to 0 and hwpoison swap entries, which set the offset field to a PFN; So it would only ever be 1 for a 52-bit PVA system where memory in that high half was poisoned (I think!). So in practice, this bit would almost always be zero for non-present ptes and we would only clear the first entry if it was actually a contiguous block. That's probably a less severe symptom than if it was always interpretted as 1 and cleared out potentially-present neighboring PTEs.
Cc: stable@vger.kernel.org Fixes: 66b3923a1a0f ("arm64: hugetlb: add support for PTE contiguous bit") Signed-off-by: Ryan Roberts ryan.roberts@arm.com --- arch/arm64/mm/hugetlbpage.c | 40 ++++++++++++++++--------------------- 1 file changed, 17 insertions(+), 23 deletions(-)
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c index 06db4649af91..614b2feddba2 100644 --- a/arch/arm64/mm/hugetlbpage.c +++ b/arch/arm64/mm/hugetlbpage.c @@ -163,24 +163,23 @@ static pte_t get_clear_contig(struct mm_struct *mm, unsigned long pgsize, unsigned long ncontig) { - pte_t orig_pte = __ptep_get(ptep); - unsigned long i; - - for (i = 0; i < ncontig; i++, addr += pgsize, ptep++) { - pte_t pte = __ptep_get_and_clear(mm, addr, ptep); - - /* - * If HW_AFDBM is enabled, then the HW could turn on - * the dirty or accessed bit for any page in the set, - * so check them all. - */ - if (pte_dirty(pte)) - orig_pte = pte_mkdirty(orig_pte); - - if (pte_young(pte)) - orig_pte = pte_mkyoung(orig_pte); + pte_t pte, tmp_pte; + bool present; + + pte = __ptep_get_and_clear(mm, addr, ptep); + present = pte_present(pte); + while (--ncontig) { + ptep++; + addr += pgsize; + tmp_pte = __ptep_get_and_clear(mm, addr, ptep); + if (present) { + if (pte_dirty(tmp_pte)) + pte = pte_mkdirty(pte); + if (pte_young(tmp_pte)) + pte = pte_mkyoung(pte); + } } - return orig_pte; + return pte; }
static pte_t get_clear_contig_flush(struct mm_struct *mm, @@ -401,13 +400,8 @@ pte_t huge_ptep_get_and_clear(struct mm_struct *mm, unsigned long addr, { int ncontig; size_t pgsize; - pte_t orig_pte = __ptep_get(ptep); - - if (!pte_cont(orig_pte)) - return __ptep_get_and_clear(mm, addr, ptep); - - ncontig = find_num_contig(mm, addr, ptep, &pgsize);
+ ncontig = num_contig_ptes(sz, &pgsize); return get_clear_contig(mm, addr, ptep, pgsize, ncontig); }
On 2/17/25 19:34, Ryan Roberts wrote:
arm64 supports multiple huge_pte sizes. Some of the sizes are covered by a single pte entry at a particular level (PMD_SIZE, PUD_SIZE), and some are covered by multiple ptes at a particular level (CONT_PTE_SIZE, CONT_PMD_SIZE). So the function has to figure out the size from the huge_pte pointer. This was previously done by walking the pgtable to determine the level and by using the PTE_CONT bit to determine the number of ptes at the level.
But the PTE_CONT bit is only valid when the pte is present. For non-present pte values (e.g. markers, migration entries), the previous implementation was therefore erroniously determining the size. There is at least one known caller in core-mm, move_huge_pte(), which may call huge_ptep_get_and_clear() for a non-present pte. So we must be robust to this case. Additionally the "regular" ptep_get_and_clear() is robust to being called for non-present ptes so it makes sense to follow the behaviour.
Fix this by using the new sz parameter which is now provided to the function. Additionally when clearing each pte in a contig range, don't gather the access and dirty bits if the pte is not present.
An alternative approach that would not require API changes would be to store the PTE_CONT bit in a spare bit in the swap entry pte for the non-present case. But it felt cleaner to follow other APIs' lead and just pass in the size.
As an aside, PTE_CONT is bit 52, which corresponds to bit 40 in the swap entry offset field (layout of non-present pte). Since hugetlb is never swapped to disk, this field will only be populated for markers, which always set this bit to 0 and hwpoison swap entries, which set the offset field to a PFN; So it would only ever be 1 for a 52-bit PVA system where memory in that high half was poisoned (I think!). So in practice, this bit would almost always be zero for non-present ptes and we would only clear the first entry if it was actually a contiguous block. That's probably a less severe symptom than if it was always interpretted as 1 and cleared out potentially-present neighboring PTEs.
Cc: stable@vger.kernel.org Fixes: 66b3923a1a0f ("arm64: hugetlb: add support for PTE contiguous bit") Signed-off-by: Ryan Roberts ryan.roberts@arm.com
arch/arm64/mm/hugetlbpage.c | 40 ++++++++++++++++--------------------- 1 file changed, 17 insertions(+), 23 deletions(-)
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c index 06db4649af91..614b2feddba2 100644 --- a/arch/arm64/mm/hugetlbpage.c +++ b/arch/arm64/mm/hugetlbpage.c @@ -163,24 +163,23 @@ static pte_t get_clear_contig(struct mm_struct *mm, unsigned long pgsize, unsigned long ncontig) {
- pte_t orig_pte = __ptep_get(ptep);
- unsigned long i;
- for (i = 0; i < ncontig; i++, addr += pgsize, ptep++) {
pte_t pte = __ptep_get_and_clear(mm, addr, ptep);
/*
* If HW_AFDBM is enabled, then the HW could turn on
* the dirty or accessed bit for any page in the set,
* so check them all.
*/
if (pte_dirty(pte))
orig_pte = pte_mkdirty(orig_pte);
if (pte_young(pte))
orig_pte = pte_mkyoung(orig_pte);
- pte_t pte, tmp_pte;
- bool present;
- pte = __ptep_get_and_clear(mm, addr, ptep);
- present = pte_present(pte);
pte_present() may not be evaluated for standard huge pages at [PMD|PUD]_SIZE e.g when ncontig = 1 in the argument.
- while (--ncontig) {
Should this be converted into a for loop instead just to be in sync with other similar iterators in this file.
for (i = 1; i < ncontig; i++, addr += pgsize, ptep++) { tmp_pte = __ptep_get_and_clear(mm, addr, ptep); if (present) { if (pte_dirty(tmp_pte)) pte = pte_mkdirty(pte); if (pte_young(tmp_pte)) pte = pte_mkyoung(pte); } }
ptep++;
addr += pgsize;
tmp_pte = __ptep_get_and_clear(mm, addr, ptep);
if (present) {
if (pte_dirty(tmp_pte))
pte = pte_mkdirty(pte);
if (pte_young(tmp_pte))
pte = pte_mkyoung(pte);
}}
- return orig_pte;
- return pte;
} static pte_t get_clear_contig_flush(struct mm_struct *mm, @@ -401,13 +400,8 @@ pte_t huge_ptep_get_and_clear(struct mm_struct *mm, unsigned long addr, { int ncontig; size_t pgsize;
- pte_t orig_pte = __ptep_get(ptep);
- if (!pte_cont(orig_pte))
return __ptep_get_and_clear(mm, addr, ptep);
- ncontig = find_num_contig(mm, addr, ptep, &pgsize);
- ncontig = num_contig_ptes(sz, &pgsize); return get_clear_contig(mm, addr, ptep, pgsize, ncontig);
}
On 19/02/2025 08:45, Anshuman Khandual wrote:
On 2/17/25 19:34, Ryan Roberts wrote:
arm64 supports multiple huge_pte sizes. Some of the sizes are covered by a single pte entry at a particular level (PMD_SIZE, PUD_SIZE), and some are covered by multiple ptes at a particular level (CONT_PTE_SIZE, CONT_PMD_SIZE). So the function has to figure out the size from the huge_pte pointer. This was previously done by walking the pgtable to determine the level and by using the PTE_CONT bit to determine the number of ptes at the level.
But the PTE_CONT bit is only valid when the pte is present. For non-present pte values (e.g. markers, migration entries), the previous implementation was therefore erroniously determining the size. There is at least one known caller in core-mm, move_huge_pte(), which may call huge_ptep_get_and_clear() for a non-present pte. So we must be robust to this case. Additionally the "regular" ptep_get_and_clear() is robust to being called for non-present ptes so it makes sense to follow the behaviour.
Fix this by using the new sz parameter which is now provided to the function. Additionally when clearing each pte in a contig range, don't gather the access and dirty bits if the pte is not present.
An alternative approach that would not require API changes would be to store the PTE_CONT bit in a spare bit in the swap entry pte for the non-present case. But it felt cleaner to follow other APIs' lead and just pass in the size.
As an aside, PTE_CONT is bit 52, which corresponds to bit 40 in the swap entry offset field (layout of non-present pte). Since hugetlb is never swapped to disk, this field will only be populated for markers, which always set this bit to 0 and hwpoison swap entries, which set the offset field to a PFN; So it would only ever be 1 for a 52-bit PVA system where memory in that high half was poisoned (I think!). So in practice, this bit would almost always be zero for non-present ptes and we would only clear the first entry if it was actually a contiguous block. That's probably a less severe symptom than if it was always interpretted as 1 and cleared out potentially-present neighboring PTEs.
Cc: stable@vger.kernel.org Fixes: 66b3923a1a0f ("arm64: hugetlb: add support for PTE contiguous bit") Signed-off-by: Ryan Roberts ryan.roberts@arm.com
arch/arm64/mm/hugetlbpage.c | 40 ++++++++++++++++--------------------- 1 file changed, 17 insertions(+), 23 deletions(-)
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c index 06db4649af91..614b2feddba2 100644 --- a/arch/arm64/mm/hugetlbpage.c +++ b/arch/arm64/mm/hugetlbpage.c @@ -163,24 +163,23 @@ static pte_t get_clear_contig(struct mm_struct *mm, unsigned long pgsize, unsigned long ncontig) {
- pte_t orig_pte = __ptep_get(ptep);
- unsigned long i;
- for (i = 0; i < ncontig; i++, addr += pgsize, ptep++) {
pte_t pte = __ptep_get_and_clear(mm, addr, ptep);
/*
* If HW_AFDBM is enabled, then the HW could turn on
* the dirty or accessed bit for any page in the set,
* so check them all.
*/
if (pte_dirty(pte))
orig_pte = pte_mkdirty(orig_pte);
if (pte_young(pte))
orig_pte = pte_mkyoung(orig_pte);
- pte_t pte, tmp_pte;
- bool present;
- pte = __ptep_get_and_clear(mm, addr, ptep);
- present = pte_present(pte);
pte_present() may not be evaluated for standard huge pages at [PMD|PUD]_SIZE e.g when ncontig = 1 in the argument.
Sorry I'm not quite sure what you're suggesting here? Are you proposing that pte_present() should be moved into the loop so that we only actually call it when we are going to consume it? I'm happy to do that if it's the preference, but I thought it was neater to hoist it out of the loop.
- while (--ncontig) {
Should this be converted into a for loop instead just to be in sync with other similar iterators in this file.
for (i = 1; i < ncontig; i++, addr += pgsize, ptep++) { tmp_pte = __ptep_get_and_clear(mm, addr, ptep); if (present) { if (pte_dirty(tmp_pte)) pte = pte_mkdirty(pte); if (pte_young(tmp_pte)) pte = pte_mkyoung(pte); } }
I think the way you have written this it's incorrect. Let's say we have 16 ptes in the block. We want to iterate over the last 15 of them (we have already read pte 0). But you're iterating over the first 15 because you don't increment addr and ptep until after you've been around the loop the first time. So we would need to explicitly increment those 2 before entering the loop. But that is only neccessary if ncontig > 1. Personally I think my approach is neater...
ptep++;
addr += pgsize;
tmp_pte = __ptep_get_and_clear(mm, addr, ptep);
if (present) {
if (pte_dirty(tmp_pte))
pte = pte_mkdirty(pte);
if (pte_young(tmp_pte))
pte = pte_mkyoung(pte);
}}
- return orig_pte;
- return pte;
} static pte_t get_clear_contig_flush(struct mm_struct *mm, @@ -401,13 +400,8 @@ pte_t huge_ptep_get_and_clear(struct mm_struct *mm, unsigned long addr, { int ncontig; size_t pgsize;
- pte_t orig_pte = __ptep_get(ptep);
- if (!pte_cont(orig_pte))
return __ptep_get_and_clear(mm, addr, ptep);
- ncontig = find_num_contig(mm, addr, ptep, &pgsize);
- ncontig = num_contig_ptes(sz, &pgsize); return get_clear_contig(mm, addr, ptep, pgsize, ncontig);
}
On 2/19/25 14:28, Ryan Roberts wrote:
On 19/02/2025 08:45, Anshuman Khandual wrote:
On 2/17/25 19:34, Ryan Roberts wrote:
arm64 supports multiple huge_pte sizes. Some of the sizes are covered by a single pte entry at a particular level (PMD_SIZE, PUD_SIZE), and some are covered by multiple ptes at a particular level (CONT_PTE_SIZE, CONT_PMD_SIZE). So the function has to figure out the size from the huge_pte pointer. This was previously done by walking the pgtable to determine the level and by using the PTE_CONT bit to determine the number of ptes at the level.
But the PTE_CONT bit is only valid when the pte is present. For non-present pte values (e.g. markers, migration entries), the previous implementation was therefore erroniously determining the size. There is
typo - s/erroniously/erroneously ^^^^^^
at least one known caller in core-mm, move_huge_pte(), which may call huge_ptep_get_and_clear() for a non-present pte. So we must be robust to this case. Additionally the "regular" ptep_get_and_clear() is robust to being called for non-present ptes so it makes sense to follow the behaviour.
Fix this by using the new sz parameter which is now provided to the function. Additionally when clearing each pte in a contig range, don't gather the access and dirty bits if the pte is not present.
An alternative approach that would not require API changes would be to store the PTE_CONT bit in a spare bit in the swap entry pte for the non-present case. But it felt cleaner to follow other APIs' lead and just pass in the size.
As an aside, PTE_CONT is bit 52, which corresponds to bit 40 in the swap entry offset field (layout of non-present pte). Since hugetlb is never swapped to disk, this field will only be populated for markers, which always set this bit to 0 and hwpoison swap entries, which set the offset field to a PFN; So it would only ever be 1 for a 52-bit PVA system where memory in that high half was poisoned (I think!). So in practice, this bit would almost always be zero for non-present ptes and we would only clear the first entry if it was actually a contiguous block. That's probably a less severe symptom than if it was always interpretted as 1
typo - s/interpretted/interpreted ^^^^^^
and cleared out potentially-present neighboring PTEs.
Cc: stable@vger.kernel.org Fixes: 66b3923a1a0f ("arm64: hugetlb: add support for PTE contiguous bit") Signed-off-by: Ryan Roberts ryan.roberts@arm.com
arch/arm64/mm/hugetlbpage.c | 40 ++++++++++++++++--------------------- 1 file changed, 17 insertions(+), 23 deletions(-)
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c index 06db4649af91..614b2feddba2 100644 --- a/arch/arm64/mm/hugetlbpage.c +++ b/arch/arm64/mm/hugetlbpage.c @@ -163,24 +163,23 @@ static pte_t get_clear_contig(struct mm_struct *mm, unsigned long pgsize, unsigned long ncontig) {
- pte_t orig_pte = __ptep_get(ptep);
- unsigned long i;
- for (i = 0; i < ncontig; i++, addr += pgsize, ptep++) {
pte_t pte = __ptep_get_and_clear(mm, addr, ptep);
/*
* If HW_AFDBM is enabled, then the HW could turn on
* the dirty or accessed bit for any page in the set,
* so check them all.
*/
if (pte_dirty(pte))
orig_pte = pte_mkdirty(orig_pte);
if (pte_young(pte))
orig_pte = pte_mkyoung(orig_pte);
- pte_t pte, tmp_pte;
- bool present;
- pte = __ptep_get_and_clear(mm, addr, ptep);
- present = pte_present(pte);
pte_present() may not be evaluated for standard huge pages at [PMD|PUD]_SIZE e.g when ncontig = 1 in the argument.
Sorry I'm not quite sure what you're suggesting here? Are you proposing that pte_present() should be moved into the loop so that we only actually call it when we are going to consume it? I'm happy to do that if it's the preference,
Right, pte_present() is only required for the cont huge pages but not for the normal huge pages.
but I thought it was neater to hoist it out of the loop.
Agreed, but when possible pte_present() cost should be avoided for the normal huge pages where it is not required.
- while (--ncontig) {
Should this be converted into a for loop instead just to be in sync with other similar iterators in this file.
for (i = 1; i < ncontig; i++, addr += pgsize, ptep++) { tmp_pte = __ptep_get_and_clear(mm, addr, ptep); if (present) { if (pte_dirty(tmp_pte)) pte = pte_mkdirty(pte); if (pte_young(tmp_pte)) pte = pte_mkyoung(pte); } }
I think the way you have written this it's incorrect. Let's say we have 16 ptes in the block. We want to iterate over the last 15 of them (we have already read pte 0). But you're iterating over the first 15 because you don't increment addr and ptep until after you've been around the loop the first time. So we would need to explicitly increment those 2 before entering the loop. But that is only neccessary if ncontig > 1. Personally I think my approach is neater...
Thinking about this again. Just wondering should not a pte_present() check on each entries being cleared along with (ncontig > 1) in this existing loop before transferring over the dirty and accessed bits - also work as intended with less code churn ?
static pte_t get_clear_contig(struct mm_struct *mm, unsigned long addr, pte_t *ptep, unsigned long pgsize, unsigned long ncontig) { pte_t orig_pte = __ptep_get(ptep); unsigned long i;
for (i = 0; i < ncontig; i++, addr += pgsize, ptep++) { pte_t pte = __ptep_get_and_clear(mm, addr, ptep); if (ncontig > 1 && !pte_present(pte)) continue;
/* * If HW_AFDBM is enabled, then the HW could turn on * the dirty or accessed bit for any page in the set, * so check them all. */ if (pte_dirty(pte)) orig_pte = pte_mkdirty(orig_pte);
if (pte_young(pte)) orig_pte = pte_mkyoung(orig_pte); } return orig_pte; }
* Normal huge pages
- enters the for loop just once - clears the single entry - always transfers dirty and access bits - pte_present() does not matter as ncontig = 1
* Contig huge pages
- enters the for loop ncontig times - for each sub page - clears all sub page entries - transfers dirty and access bits only when pte_present() - pte_present() is relevant as ncontig > 1
ptep++;
addr += pgsize;
tmp_pte = __ptep_get_and_clear(mm, addr, ptep);
if (present) {
if (pte_dirty(tmp_pte))
pte = pte_mkdirty(pte);
if (pte_young(tmp_pte))
pte = pte_mkyoung(pte);
}}
- return orig_pte;
- return pte;
} static pte_t get_clear_contig_flush(struct mm_struct *mm, @@ -401,13 +400,8 @@ pte_t huge_ptep_get_and_clear(struct mm_struct *mm, unsigned long addr, { int ncontig; size_t pgsize;
- pte_t orig_pte = __ptep_get(ptep);
- if (!pte_cont(orig_pte))
return __ptep_get_and_clear(mm, addr, ptep);
- ncontig = find_num_contig(mm, addr, ptep, &pgsize);
- ncontig = num_contig_ptes(sz, &pgsize); return get_clear_contig(mm, addr, ptep, pgsize, ncontig);
}
On Thu, Feb 20, 2025 at 12:07:35PM +0530, Anshuman Khandual wrote:
On 2/19/25 14:28, Ryan Roberts wrote:
On 19/02/2025 08:45, Anshuman Khandual wrote:
On 2/17/25 19:34, Ryan Roberts wrote:
- while (--ncontig) {
Should this be converted into a for loop instead just to be in sync with other similar iterators in this file.
for (i = 1; i < ncontig; i++, addr += pgsize, ptep++) { tmp_pte = __ptep_get_and_clear(mm, addr, ptep); if (present) { if (pte_dirty(tmp_pte)) pte = pte_mkdirty(pte); if (pte_young(tmp_pte)) pte = pte_mkyoung(pte); } }
I think the way you have written this it's incorrect. Let's say we have 16 ptes in the block. We want to iterate over the last 15 of them (we have already read pte 0). But you're iterating over the first 15 because you don't increment addr and ptep until after you've been around the loop the first time. So we would need to explicitly increment those 2 before entering the loop. But that is only neccessary if ncontig > 1. Personally I think my approach is neater...
Thinking about this again. Just wondering should not a pte_present() check on each entries being cleared along with (ncontig > 1) in this existing loop before transferring over the dirty and accessed bits - also work as intended with less code churn ?
Shouldn't all the ptes in a contig block be either all present or all not-present? Is there any point in checking each individually?
On 21/02/2025 09:55, Catalin Marinas wrote:
On Thu, Feb 20, 2025 at 12:07:35PM +0530, Anshuman Khandual wrote:
On 2/19/25 14:28, Ryan Roberts wrote:
On 19/02/2025 08:45, Anshuman Khandual wrote:
On 2/17/25 19:34, Ryan Roberts wrote:
- while (--ncontig) {
Should this be converted into a for loop instead just to be in sync with other similar iterators in this file.
for (i = 1; i < ncontig; i++, addr += pgsize, ptep++) { tmp_pte = __ptep_get_and_clear(mm, addr, ptep); if (present) { if (pte_dirty(tmp_pte)) pte = pte_mkdirty(pte); if (pte_young(tmp_pte)) pte = pte_mkyoung(pte); } }
I think the way you have written this it's incorrect. Let's say we have 16 ptes in the block. We want to iterate over the last 15 of them (we have already read pte 0). But you're iterating over the first 15 because you don't increment addr and ptep until after you've been around the loop the first time. So we would need to explicitly increment those 2 before entering the loop. But that is only neccessary if ncontig > 1. Personally I think my approach is neater...
Thinking about this again. Just wondering should not a pte_present() check on each entries being cleared along with (ncontig > 1) in this existing loop before transferring over the dirty and accessed bits - also work as intended with less code churn ?
Shouldn't all the ptes in a contig block be either all present or all not-present? Is there any point in checking each individually?
I agree, that's why I was just testing it once outside the loop. I'm confident my code and Anshuman's alternative are both correct. So it's just a stylistic choice really. I'd prefer to leave it as is, but will change if there is consensus. I'm not sure I'm hearing that though.
Catalin, do you want me to respin the series to fix up the typos that Anshuman highlighted, or will you handle that?
Thanks, Ryan
On Mon, Feb 17, 2025 at 02:04:15PM +0000, Ryan Roberts wrote:
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c index 06db4649af91..614b2feddba2 100644 --- a/arch/arm64/mm/hugetlbpage.c +++ b/arch/arm64/mm/hugetlbpage.c @@ -163,24 +163,23 @@ static pte_t get_clear_contig(struct mm_struct *mm, unsigned long pgsize, unsigned long ncontig) {
- pte_t orig_pte = __ptep_get(ptep);
- unsigned long i;
- for (i = 0; i < ncontig; i++, addr += pgsize, ptep++) {
pte_t pte = __ptep_get_and_clear(mm, addr, ptep);
/*
* If HW_AFDBM is enabled, then the HW could turn on
* the dirty or accessed bit for any page in the set,
* so check them all.
*/
if (pte_dirty(pte))
orig_pte = pte_mkdirty(orig_pte);
if (pte_young(pte))
orig_pte = pte_mkyoung(orig_pte);
- pte_t pte, tmp_pte;
- bool present;
- pte = __ptep_get_and_clear(mm, addr, ptep);
- present = pte_present(pte);
- while (--ncontig) {
A 'for' loop may be more consistent with the rest of the file but I really don't mind the 'while' loop.
Reviewed-by: Catalin Marinas catalin.marinas@arm.com
On Mon, Feb 17, 2025 at 02:04:15PM +0000, Ryan Roberts wrote:
arm64 supports multiple huge_pte sizes. Some of the sizes are covered by a single pte entry at a particular level (PMD_SIZE, PUD_SIZE), and some are covered by multiple ptes at a particular level (CONT_PTE_SIZE, CONT_PMD_SIZE). So the function has to figure out the size from the huge_pte pointer. This was previously done by walking the pgtable to determine the level and by using the PTE_CONT bit to determine the number of ptes at the level.
But the PTE_CONT bit is only valid when the pte is present. For non-present pte values (e.g. markers, migration entries), the previous implementation was therefore erroniously determining the size. There is at least one known caller in core-mm, move_huge_pte(), which may call huge_ptep_get_and_clear() for a non-present pte. So we must be robust to this case. Additionally the "regular" ptep_get_and_clear() is robust to being called for non-present ptes so it makes sense to follow the behaviour.
Fix this by using the new sz parameter which is now provided to the function. Additionally when clearing each pte in a contig range, don't gather the access and dirty bits if the pte is not present.
An alternative approach that would not require API changes would be to store the PTE_CONT bit in a spare bit in the swap entry pte for the non-present case. But it felt cleaner to follow other APIs' lead and just pass in the size.
As an aside, PTE_CONT is bit 52, which corresponds to bit 40 in the swap entry offset field (layout of non-present pte). Since hugetlb is never swapped to disk, this field will only be populated for markers, which always set this bit to 0 and hwpoison swap entries, which set the offset field to a PFN; So it would only ever be 1 for a 52-bit PVA system where memory in that high half was poisoned (I think!). So in practice, this bit would almost always be zero for non-present ptes and we would only clear the first entry if it was actually a contiguous block. That's probably a less severe symptom than if it was always interpretted as 1 and cleared out potentially-present neighboring PTEs.
Cc: stable@vger.kernel.org Fixes: 66b3923a1a0f ("arm64: hugetlb: add support for PTE contiguous bit") Signed-off-by: Ryan Roberts ryan.roberts@arm.com
arch/arm64/mm/hugetlbpage.c | 40 ++++++++++++++++--------------------- 1 file changed, 17 insertions(+), 23 deletions(-)
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c index 06db4649af91..614b2feddba2 100644 --- a/arch/arm64/mm/hugetlbpage.c +++ b/arch/arm64/mm/hugetlbpage.c @@ -163,24 +163,23 @@ static pte_t get_clear_contig(struct mm_struct *mm, unsigned long pgsize, unsigned long ncontig) {
- pte_t orig_pte = __ptep_get(ptep);
- unsigned long i;
- for (i = 0; i < ncontig; i++, addr += pgsize, ptep++) {
pte_t pte = __ptep_get_and_clear(mm, addr, ptep);
/*
* If HW_AFDBM is enabled, then the HW could turn on
* the dirty or accessed bit for any page in the set,
* so check them all.
*/
if (pte_dirty(pte))
orig_pte = pte_mkdirty(orig_pte);
if (pte_young(pte))
orig_pte = pte_mkyoung(orig_pte);
- pte_t pte, tmp_pte;
- bool present;
- pte = __ptep_get_and_clear(mm, addr, ptep);
- present = pte_present(pte);
- while (--ncontig) {
ptep++;
addr += pgsize;
tmp_pte = __ptep_get_and_clear(mm, addr, ptep);
if (present) {
if (pte_dirty(tmp_pte))
pte = pte_mkdirty(pte);
if (pte_young(tmp_pte))
pte = pte_mkyoung(pte);
}}
nit: With the loop now structured like this, we really can't handle num_contig_ptes() returning 0 if it gets an unknown size. Granted, that really shouldn't happen, but perhaps it would be better to add a 'default' case with a WARN() to num_contig_ptes() and then add an early return here?
Will
On 21/02/2025 15:31, Will Deacon wrote:
On Mon, Feb 17, 2025 at 02:04:15PM +0000, Ryan Roberts wrote:
arm64 supports multiple huge_pte sizes. Some of the sizes are covered by a single pte entry at a particular level (PMD_SIZE, PUD_SIZE), and some are covered by multiple ptes at a particular level (CONT_PTE_SIZE, CONT_PMD_SIZE). So the function has to figure out the size from the huge_pte pointer. This was previously done by walking the pgtable to determine the level and by using the PTE_CONT bit to determine the number of ptes at the level.
But the PTE_CONT bit is only valid when the pte is present. For non-present pte values (e.g. markers, migration entries), the previous implementation was therefore erroniously determining the size. There is at least one known caller in core-mm, move_huge_pte(), which may call huge_ptep_get_and_clear() for a non-present pte. So we must be robust to this case. Additionally the "regular" ptep_get_and_clear() is robust to being called for non-present ptes so it makes sense to follow the behaviour.
Fix this by using the new sz parameter which is now provided to the function. Additionally when clearing each pte in a contig range, don't gather the access and dirty bits if the pte is not present.
An alternative approach that would not require API changes would be to store the PTE_CONT bit in a spare bit in the swap entry pte for the non-present case. But it felt cleaner to follow other APIs' lead and just pass in the size.
As an aside, PTE_CONT is bit 52, which corresponds to bit 40 in the swap entry offset field (layout of non-present pte). Since hugetlb is never swapped to disk, this field will only be populated for markers, which always set this bit to 0 and hwpoison swap entries, which set the offset field to a PFN; So it would only ever be 1 for a 52-bit PVA system where memory in that high half was poisoned (I think!). So in practice, this bit would almost always be zero for non-present ptes and we would only clear the first entry if it was actually a contiguous block. That's probably a less severe symptom than if it was always interpretted as 1 and cleared out potentially-present neighboring PTEs.
Cc: stable@vger.kernel.org Fixes: 66b3923a1a0f ("arm64: hugetlb: add support for PTE contiguous bit") Signed-off-by: Ryan Roberts ryan.roberts@arm.com
arch/arm64/mm/hugetlbpage.c | 40 ++++++++++++++++--------------------- 1 file changed, 17 insertions(+), 23 deletions(-)
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c index 06db4649af91..614b2feddba2 100644 --- a/arch/arm64/mm/hugetlbpage.c +++ b/arch/arm64/mm/hugetlbpage.c @@ -163,24 +163,23 @@ static pte_t get_clear_contig(struct mm_struct *mm, unsigned long pgsize, unsigned long ncontig) {
- pte_t orig_pte = __ptep_get(ptep);
- unsigned long i;
- for (i = 0; i < ncontig; i++, addr += pgsize, ptep++) {
pte_t pte = __ptep_get_and_clear(mm, addr, ptep);
/*
* If HW_AFDBM is enabled, then the HW could turn on
* the dirty or accessed bit for any page in the set,
* so check them all.
*/
if (pte_dirty(pte))
orig_pte = pte_mkdirty(orig_pte);
if (pte_young(pte))
orig_pte = pte_mkyoung(orig_pte);
- pte_t pte, tmp_pte;
- bool present;
- pte = __ptep_get_and_clear(mm, addr, ptep);
- present = pte_present(pte);
- while (--ncontig) {
ptep++;
addr += pgsize;
tmp_pte = __ptep_get_and_clear(mm, addr, ptep);
if (present) {
if (pte_dirty(tmp_pte))
pte = pte_mkdirty(pte);
if (pte_young(tmp_pte))
pte = pte_mkyoung(pte);
}}
nit: With the loop now structured like this, we really can't handle num_contig_ptes() returning 0 if it gets an unknown size. Granted, that really shouldn't happen, but perhaps it would be better to add a 'default' case with a WARN() to num_contig_ptes() and then add an early return here?
Looking at other users of num_contig_ptes() it looks like huge_ptep_get() already assumes at least 1 pte (it calls __ptep_get() before calling num_contig_ptes()) and set_huge_pte_at() assumes 1 pte for the "present and non-contig" case. So num_contig_ptes() returning 0 is already not really consumed consistently.
How about we change the default num_contig_ptes() return value to 1 and add a warning if size is invalid:
---8<--- diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c index 614b2feddba2..b3a7fafe8892 100644 --- a/arch/arm64/mm/hugetlbpage.c +++ b/arch/arm64/mm/hugetlbpage.c @@ -100,20 +100,11 @@ static int find_num_contig(struct mm_struct *mm, unsigned long addr,
static inline int num_contig_ptes(unsigned long size, size_t *pgsize) { - int contig_ptes = 0; + int contig_ptes = 1;
*pgsize = size;
switch (size) { -#ifndef __PAGETABLE_PMD_FOLDED - case PUD_SIZE: - if (pud_sect_supported()) - contig_ptes = 1; - break; -#endif - case PMD_SIZE: - contig_ptes = 1; - break; case CONT_PMD_SIZE: *pgsize = PMD_SIZE; contig_ptes = CONT_PMDS; @@ -122,6 +113,8 @@ static inline int num_contig_ptes(unsigned long size, size_t *pgsize) *pgsize = PAGE_SIZE; contig_ptes = CONT_PTES; break; + default: + WARN_ON(!__hugetlb_valid_size(size)); }
return contig_ptes; ---8<---
Will
On Mon, Feb 24, 2025 at 12:11:19PM +0000, Ryan Roberts wrote:
On 21/02/2025 15:31, Will Deacon wrote:
On Mon, Feb 17, 2025 at 02:04:15PM +0000, Ryan Roberts wrote:
- pte = __ptep_get_and_clear(mm, addr, ptep);
- present = pte_present(pte);
- while (--ncontig) {
ptep++;
addr += pgsize;
tmp_pte = __ptep_get_and_clear(mm, addr, ptep);
if (present) {
if (pte_dirty(tmp_pte))
pte = pte_mkdirty(pte);
if (pte_young(tmp_pte))
pte = pte_mkyoung(pte);
}}
nit: With the loop now structured like this, we really can't handle num_contig_ptes() returning 0 if it gets an unknown size. Granted, that really shouldn't happen, but perhaps it would be better to add a 'default' case with a WARN() to num_contig_ptes() and then add an early return here?
Looking at other users of num_contig_ptes() it looks like huge_ptep_get() already assumes at least 1 pte (it calls __ptep_get() before calling num_contig_ptes()) and set_huge_pte_at() assumes 1 pte for the "present and non-contig" case. So num_contig_ptes() returning 0 is already not really consumed consistently.
How about we change the default num_contig_ptes() return value to 1 and add a warning if size is invalid:
Fine by me!
I assume you'll fold that in and send a new version, along with the typo fixes?
Cheers,
Will
On 25/02/2025 22:18, Will Deacon wrote:
On Mon, Feb 24, 2025 at 12:11:19PM +0000, Ryan Roberts wrote:
On 21/02/2025 15:31, Will Deacon wrote:
On Mon, Feb 17, 2025 at 02:04:15PM +0000, Ryan Roberts wrote:
- pte = __ptep_get_and_clear(mm, addr, ptep);
- present = pte_present(pte);
- while (--ncontig) {
ptep++;
addr += pgsize;
tmp_pte = __ptep_get_and_clear(mm, addr, ptep);
if (present) {
if (pte_dirty(tmp_pte))
pte = pte_mkdirty(pte);
if (pte_young(tmp_pte))
pte = pte_mkyoung(pte);
}}
nit: With the loop now structured like this, we really can't handle num_contig_ptes() returning 0 if it gets an unknown size. Granted, that really shouldn't happen, but perhaps it would be better to add a 'default' case with a WARN() to num_contig_ptes() and then add an early return here?
Looking at other users of num_contig_ptes() it looks like huge_ptep_get() already assumes at least 1 pte (it calls __ptep_get() before calling num_contig_ptes()) and set_huge_pte_at() assumes 1 pte for the "present and non-contig" case. So num_contig_ptes() returning 0 is already not really consumed consistently.
How about we change the default num_contig_ptes() return value to 1 and add a warning if size is invalid:
Fine by me!
I assume you'll fold that in and send a new version, along with the typo fixes?
Yep, I'll aim to post this today. I have a few review comments for s390 to add in too.
Cheers,
Will
Hi Ryan,
On Mon, Feb 17, 2025 at 3:04 PM Ryan Roberts ryan.roberts@arm.com wrote:
arm64 supports multiple huge_pte sizes. Some of the sizes are covered by a single pte entry at a particular level (PMD_SIZE, PUD_SIZE), and some are covered by multiple ptes at a particular level (CONT_PTE_SIZE, CONT_PMD_SIZE). So the function has to figure out the size from the huge_pte pointer. This was previously done by walking the pgtable to determine the level and by using the PTE_CONT bit to determine the number of ptes at the level.
But the PTE_CONT bit is only valid when the pte is present. For non-present pte values (e.g. markers, migration entries), the previous implementation was therefore erroniously determining the size. There is at least one known caller in core-mm, move_huge_pte(), which may call huge_ptep_get_and_clear() for a non-present pte. So we must be robust to this case. Additionally the "regular" ptep_get_and_clear() is robust to being called for non-present ptes so it makes sense to follow the behaviour.
Fix this by using the new sz parameter which is now provided to the function. Additionally when clearing each pte in a contig range, don't gather the access and dirty bits if the pte is not present.
An alternative approach that would not require API changes would be to store the PTE_CONT bit in a spare bit in the swap entry pte for the non-present case. But it felt cleaner to follow other APIs' lead and just pass in the size.
As an aside, PTE_CONT is bit 52, which corresponds to bit 40 in the swap entry offset field (layout of non-present pte). Since hugetlb is never swapped to disk, this field will only be populated for markers, which always set this bit to 0 and hwpoison swap entries, which set the offset field to a PFN; So it would only ever be 1 for a 52-bit PVA system where memory in that high half was poisoned (I think!). So in practice, this bit would almost always be zero for non-present ptes and we would only clear the first entry if it was actually a contiguous block. That's probably a less severe symptom than if it was always interpretted as 1 and cleared out potentially-present neighboring PTEs.
Cc: stable@vger.kernel.org Fixes: 66b3923a1a0f ("arm64: hugetlb: add support for PTE contiguous bit") Signed-off-by: Ryan Roberts ryan.roberts@arm.com
arch/arm64/mm/hugetlbpage.c | 40 ++++++++++++++++--------------------- 1 file changed, 17 insertions(+), 23 deletions(-)
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c index 06db4649af91..614b2feddba2 100644 --- a/arch/arm64/mm/hugetlbpage.c +++ b/arch/arm64/mm/hugetlbpage.c @@ -163,24 +163,23 @@ static pte_t get_clear_contig(struct mm_struct *mm, unsigned long pgsize, unsigned long ncontig) {
pte_t orig_pte = __ptep_get(ptep);
unsigned long i;
for (i = 0; i < ncontig; i++, addr += pgsize, ptep++) {
pte_t pte = __ptep_get_and_clear(mm, addr, ptep);
/*
* If HW_AFDBM is enabled, then the HW could turn on
* the dirty or accessed bit for any page in the set,
* so check them all.
*/
if (pte_dirty(pte))
orig_pte = pte_mkdirty(orig_pte);
if (pte_young(pte))
orig_pte = pte_mkyoung(orig_pte);
pte_t pte, tmp_pte;
bool present;
pte = __ptep_get_and_clear(mm, addr, ptep);
present = pte_present(pte);
while (--ncontig) {
ptep++;
addr += pgsize;
tmp_pte = __ptep_get_and_clear(mm, addr, ptep);
if (present) {
if (pte_dirty(tmp_pte))
pte = pte_mkdirty(pte);
if (pte_young(tmp_pte))
pte = pte_mkyoung(pte);
} }
return orig_pte;
return pte;
}
static pte_t get_clear_contig_flush(struct mm_struct *mm, @@ -401,13 +400,8 @@ pte_t huge_ptep_get_and_clear(struct mm_struct *mm, unsigned long addr, { int ncontig; size_t pgsize;
pte_t orig_pte = __ptep_get(ptep);
if (!pte_cont(orig_pte))
return __ptep_get_and_clear(mm, addr, ptep);
ncontig = find_num_contig(mm, addr, ptep, &pgsize);
ncontig = num_contig_ptes(sz, &pgsize); return get_clear_contig(mm, addr, ptep, pgsize, ncontig);
}
-- 2.43.0
Thanks for ccing me on this fix, we have the same issue (in huge_pte_clear() too) in riscv. I'll come up with a fix and wait for you guys to merge this.
That only comforts my idea to merge both riscv and arm64 implementations :)
Thanks again,
Alex
linux-stable-mirror@lists.linaro.org