Pulling in David, who may be able to advise...
On 01/05/2024 12:35, Ryan Roberts wrote:
Zi Yan, I'm hoping you might have some input on the below...
On 30/04/2024 14:31, Ryan Roberts wrote:
__split_huge_pmd_locked() can be called for a present THP, devmap or (non-present) migration entry. It calls pmdp_invalidate() unconditionally on the pmdp and only determines if it is present or not based on the returned old pmd.
But arm64's pmd_mkinvalid(), called by pmdp_invalidate(), unconditionally sets the PMD_PRESENT_INVALID flag, which causes future pmd_present() calls to return true - even for a swap pmd. Therefore any lockless pgtable walker could see the migration entry pmd in this state and start interpretting the fields (e.g. pmd_pfn()) as if it were present, leading to BadThings (TM). GUP-fast appears to be one such lockless pgtable walker.
While the obvious fix is for core-mm to avoid such calls for non-present pmds (pmdp_invalidate() will also issue TLBI which is not necessary for this case either), all other arches that implement pmd_mkinvalid() do it in such a way that it is robust to being called with a non-present pmd.
OK the plot thickens; The tests I wrote to check that pmd_mkinvalid() is safe for swap entries fails on x86_64. See below...
So it is simpler and safer to make arm64 robust too. This approach means we can even add tests to debug_vm_pgtable.c to validate the required behaviour.
This is a theoretical bug found during code review. I don't have any test case to trigger it in practice.
Cc: stable@vger.kernel.org Fixes: 53fa117bb33c ("arm64/mm: Enable THP migration") Signed-off-by: Ryan Roberts ryan.roberts@arm.com
Hi all,
v1 of this fix [1] took the approach of fixing core-mm to never call pmdp_invalidate() on a non-present pmd. But Zi Yan highlighted that only arm64 suffers this problem; all other arches are robust. So his suggestion was to instead make arm64 robust in the same way and add tests to validate it. Despite my stated reservations in the context of the v1 discussion, having thought on it for a bit, I now agree with Zi Yan. Hence this post.
Andrew has v1 in mm-unstable at the moment, so probably the best thing to do is remove it from there and have this go in through the arm64 tree? Assuming there is agreement that this approach is right one.
This applies on top of v6.9-rc5. Passes all the mm selftests on arm64.
[1] https://lore.kernel.org/linux-mm/20240425170704.3379492-1-ryan.roberts@arm.c...
Thanks, Ryan
arch/arm64/include/asm/pgtable.h | 12 +++++-- mm/debug_vm_pgtable.c | 61 ++++++++++++++++++++++++++++++++ 2 files changed, 71 insertions(+), 2 deletions(-)
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index afdd56d26ad7..7d580271a46d 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -511,8 +511,16 @@ static inline int pmd_trans_huge(pmd_t pmd)
static inline pmd_t pmd_mkinvalid(pmd_t pmd) {
- pmd = set_pmd_bit(pmd, __pgprot(PMD_PRESENT_INVALID));
- pmd = clear_pmd_bit(pmd, __pgprot(PMD_SECT_VALID));
/*
* If not valid then either we are already present-invalid or we are
* not-present (i.e. none or swap entry). We must not convert
* not-present to present-invalid. Unbelievably, the core-mm may call
* pmd_mkinvalid() for a swap entry and all other arches can handle it.
*/
if (pmd_valid(pmd)) {
pmd = set_pmd_bit(pmd, __pgprot(PMD_PRESENT_INVALID));
pmd = clear_pmd_bit(pmd, __pgprot(PMD_SECT_VALID));
}
return pmd;
} diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c index 65c19025da3d..7e9c387d06b0 100644 --- a/mm/debug_vm_pgtable.c +++ b/mm/debug_vm_pgtable.c @@ -956,6 +956,65 @@ static void __init hugetlb_basic_tests(struct pgtable_debug_args *args) { } #endif /* CONFIG_HUGETLB_PAGE */
#ifdef CONFIG_TRANSPARENT_HUGEPAGE +#if !defined(__HAVE_ARCH_PMDP_INVALIDATE) && defined(CONFIG_ARCH_ENABLE_THP_MIGRATION) +static void __init swp_pmd_mkinvalid_tests(struct pgtable_debug_args *args) +{
Printing various values at different locations in this function for debug:
- unsigned long max_swap_offset;
- swp_entry_t swp_set, swp_clear, swp_convert;
- pmd_t pmd_set, pmd_clear;
- /*
* See generic_max_swapfile_size(): probe the maximum offset, then
* create swap entry will all possible bits set and a swap entry will
* all bits clear.
*/
- max_swap_offset = swp_offset(pmd_to_swp_entry(swp_entry_to_pmd(swp_entry(0, ~0UL))));
- swp_set = swp_entry((1 << MAX_SWAPFILES_SHIFT) - 1, max_swap_offset);
- swp_clear = swp_entry(0, 0);
- /* Convert to pmd. */
- pmd_set = swp_entry_to_pmd(swp_set);
- pmd_clear = swp_entry_to_pmd(swp_clear);
[ 0.702163] debug_vm_pgtable: [swp_pmd_mkinvalid_tests ]: valid: pmd_set=f800000000000000, pmd_clear=7fffffffffffe00
- /*
* Sanity check that the pmds are not-present, not-huge and swap entry
* is recoverable without corruption.
*/
- WARN_ON(pmd_present(pmd_set));
- WARN_ON(pmd_trans_huge(pmd_set));
- swp_convert = pmd_to_swp_entry(pmd_set);
- WARN_ON(swp_type(swp_set) != swp_type(swp_convert));
- WARN_ON(swp_offset(swp_set) != swp_offset(swp_convert));
- WARN_ON(pmd_present(pmd_clear));
- WARN_ON(pmd_trans_huge(pmd_clear));
- swp_convert = pmd_to_swp_entry(pmd_clear);
- WARN_ON(swp_type(swp_clear) != swp_type(swp_convert));
- WARN_ON(swp_offset(swp_clear) != swp_offset(swp_convert));
- /* Now invalidate the pmd. */
- pmd_set = pmd_mkinvalid(pmd_set);
- pmd_clear = pmd_mkinvalid(pmd_clear);
[ 0.704452] debug_vm_pgtable: [swp_pmd_mkinvalid_tests ]: invalid: pmd_set=f800000000000000, pmd_clear=7ffffffffe00e00
- /*
* Since its a swap pmd, invalidation should effectively be a noop and
* the checks we already did should give the same answer. Check the
* invalidation didn't corrupt any fields.
*/
- WARN_ON(pmd_present(pmd_set));
- WARN_ON(pmd_trans_huge(pmd_set));
- swp_convert = pmd_to_swp_entry(pmd_set);
[ 0.706461] debug_vm_pgtable: [swp_pmd_mkinvalid_tests ]: set: swp=7c03ffffffffffff (1f, 3ffffffffffff), convert=7c03ffffffffffff (1f, 3ffffffffffff)
- WARN_ON(swp_type(swp_set) != swp_type(swp_convert));
- WARN_ON(swp_offset(swp_set) != swp_offset(swp_convert));
- WARN_ON(pmd_present(pmd_clear));
- WARN_ON(pmd_trans_huge(pmd_clear));
- swp_convert = pmd_to_swp_entry(pmd_clear);
[ 0.708841] debug_vm_pgtable: [swp_pmd_mkinvalid_tests ]: clear: swp=0 (0, 0), convert=ff8 (0, ff8)
- WARN_ON(swp_type(swp_clear) != swp_type(swp_convert));
- WARN_ON(swp_offset(swp_clear) != swp_offset(swp_convert));
This line fails on x86_64.
The logs show that the offset is indeed being corrupted by pmd_mkinvalid(); 0 -> 0xff8.
I think this is due to x86's pmd_mkinvalid() assuming the pmd is present; pmd_flags() and pmd_pfn() do all sorts of weird and wonderful things.
So does this take us full circle? Are we now back to modifying the core-mm to never call pmd_mkinvalid() on a non-present entry? If so, then I guess we should remove the arm64 fix from for-next/fixes.
+} +#else +static void __init swp_pmd_mkinvalid_tests(struct pgtable_debug_args *args) { } +#endif /* !__HAVE_ARCH_PMDP_INVALIDATE && CONFIG_ARCH_ENABLE_THP_MIGRATION */
static void __init pmd_thp_tests(struct pgtable_debug_args *args) { pmd_t pmd; @@ -982,6 +1041,8 @@ static void __init pmd_thp_tests(struct pgtable_debug_args *args) WARN_ON(!pmd_trans_huge(pmd_mkinvalid(pmd_mkhuge(pmd)))); WARN_ON(!pmd_present(pmd_mkinvalid(pmd_mkhuge(pmd)))); #endif /* __HAVE_ARCH_PMDP_INVALIDATE */
- swp_pmd_mkinvalid_tests(args);
}
#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
2.25.1