In early boot, Linux creates identity virtual->physical address mappings so that it can enable the MMU before full memory management is ready. To ensure some available physical memory to back these structures, vmlinux.lds reserves some space (and defines marker symbols) in the middle of the kernel image. However, because they are defined outside of PROGBITS sections, they aren't pre-initialized -- at least as far as ELF is concerned.
In the typical case, this isn't actually a problem: the boot image is prepared with objcopy, which zero-fills the gaps, so these structures are incidentally zero-initialized (an all-zeroes entry is considered absent, so zero-initialization is appropriate).
However, that is just a happy accident: the `vmlinux` ELF output authoritatively represents the state of memory at entry. If the ELF says a region of memory isn't initialized, we must treat it as uninitialized. Indeed, certain bootloaders (e.g. Broadcom CFE) ingest the ELF directly -- sidestepping the objcopy-produced image entirely -- and therefore do not initialize the gaps. This results in the early boot code crashing when it attempts to create identity mappings.
Therefore, add boot-time zero-initialization for the following: - __pi_init_idmap_pg_dir..__pi_init_idmap_pg_end - idmap_pg_dir - reserved_pg_dir - tramp_pg_dir # Already done, but this patch corrects the size
Note, swapper_pg_dir is already initialized (by copy from idmap_pg_dir) before use, so this patch does not need to address it.
Cc: stable@vger.kernel.org Signed-off-by: Sam Edwards CFSworks@gmail.com --- arch/arm64/kernel/head.S | 12 ++++++++++++ arch/arm64/mm/mmu.c | 3 ++- 2 files changed, 14 insertions(+), 1 deletion(-)
diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S index ca04b338cb0d..0c3be11d0006 100644 --- a/arch/arm64/kernel/head.S +++ b/arch/arm64/kernel/head.S @@ -86,6 +86,18 @@ SYM_CODE_START(primary_entry) bl record_mmu_state bl preserve_boot_args
+ adrp x0, reserved_pg_dir + add x1, x0, #PAGE_SIZE +0: str xzr, [x0], 8 + cmp x0, x1 + b.lo 0b + + adrp x0, __pi_init_idmap_pg_dir + adrp x1, __pi_init_idmap_pg_end +1: str xzr, [x0], 8 + cmp x0, x1 + b.lo 1b + adrp x1, early_init_stack mov sp, x1 mov x29, xzr diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c index 34e5d78af076..aaf823565a65 100644 --- a/arch/arm64/mm/mmu.c +++ b/arch/arm64/mm/mmu.c @@ -761,7 +761,7 @@ static int __init map_entry_trampoline(void) pgprot_val(prot) &= ~PTE_NG;
/* Map only the text into the trampoline page table */ - memset(tramp_pg_dir, 0, PGD_SIZE); + memset(tramp_pg_dir, 0, PAGE_SIZE); __create_pgd_mapping(tramp_pg_dir, pa_start, TRAMP_VALIAS, entry_tramp_text_size(), prot, pgd_pgtable_alloc_init_mm, NO_BLOCK_MAPPINGS); @@ -806,6 +806,7 @@ static void __init create_idmap(void) u64 end = __pa_symbol(__idmap_text_end); u64 ptep = __pa_symbol(idmap_ptes);
+ memset(idmap_pg_dir, 0, PAGE_SIZE); __pi_map_range(&ptep, start, end, start, PAGE_KERNEL_ROX, IDMAP_ROOT_LEVEL, (pte_t *)idmap_pg_dir, false, __phys_to_virt(ptep) - ptep);
Hi Sam,
On Fri, 22 Aug 2025 at 14:15, Sam Edwards cfsworks@gmail.com wrote:
In early boot, Linux creates identity virtual->physical address mappings so that it can enable the MMU before full memory management is ready. To ensure some available physical memory to back these structures, vmlinux.lds reserves some space (and defines marker symbols) in the middle of the kernel image. However, because they are defined outside of PROGBITS sections, they aren't pre-initialized -- at least as far as ELF is concerned.
In the typical case, this isn't actually a problem: the boot image is prepared with objcopy, which zero-fills the gaps, so these structures are incidentally zero-initialized (an all-zeroes entry is considered absent, so zero-initialization is appropriate).
However, that is just a happy accident: the `vmlinux` ELF output authoritatively represents the state of memory at entry. If the ELF says a region of memory isn't initialized, we must treat it as uninitialized. Indeed, certain bootloaders (e.g. Broadcom CFE) ingest the ELF directly -- sidestepping the objcopy-produced image entirely -- and therefore do not initialize the gaps. This results in the early boot code crashing when it attempts to create identity mappings.
Therefore, add boot-time zero-initialization for the following:
- __pi_init_idmap_pg_dir..__pi_init_idmap_pg_end
- idmap_pg_dir
- reserved_pg_dir
I don't think this is the right approach.
If the ELF representation is inaccurate, it should be fixed, and this should be achievable without impacting the binary image at all.
- tramp_pg_dir # Already done, but this patch corrects the size
What is wrong with the size?
Note, swapper_pg_dir is already initialized (by copy from idmap_pg_dir) before use, so this patch does not need to address it.
Cc: stable@vger.kernel.org Signed-off-by: Sam Edwards CFSworks@gmail.com
arch/arm64/kernel/head.S | 12 ++++++++++++ arch/arm64/mm/mmu.c | 3 ++- 2 files changed, 14 insertions(+), 1 deletion(-)
diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S index ca04b338cb0d..0c3be11d0006 100644 --- a/arch/arm64/kernel/head.S +++ b/arch/arm64/kernel/head.S @@ -86,6 +86,18 @@ SYM_CODE_START(primary_entry) bl record_mmu_state bl preserve_boot_args
adrp x0, reserved_pg_dir
add x1, x0, #PAGE_SIZE
+0: str xzr, [x0], 8
cmp x0, x1
b.lo 0b
adrp x0, __pi_init_idmap_pg_dir
adrp x1, __pi_init_idmap_pg_end
+1: str xzr, [x0], 8
cmp x0, x1
b.lo 1b
adrp x1, early_init_stack mov sp, x1 mov x29, xzr
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c index 34e5d78af076..aaf823565a65 100644 --- a/arch/arm64/mm/mmu.c +++ b/arch/arm64/mm/mmu.c @@ -761,7 +761,7 @@ static int __init map_entry_trampoline(void) pgprot_val(prot) &= ~PTE_NG;
/* Map only the text into the trampoline page table */
memset(tramp_pg_dir, 0, PGD_SIZE);
memset(tramp_pg_dir, 0, PAGE_SIZE); __create_pgd_mapping(tramp_pg_dir, pa_start, TRAMP_VALIAS, entry_tramp_text_size(), prot, pgd_pgtable_alloc_init_mm, NO_BLOCK_MAPPINGS);
@@ -806,6 +806,7 @@ static void __init create_idmap(void) u64 end = __pa_symbol(__idmap_text_end); u64 ptep = __pa_symbol(idmap_ptes);
memset(idmap_pg_dir, 0, PAGE_SIZE); __pi_map_range(&ptep, start, end, start, PAGE_KERNEL_ROX, IDMAP_ROOT_LEVEL, (pte_t *)idmap_pg_dir, false, __phys_to_virt(ptep) - ptep);
-- 2.49.1
On Sat, Aug 23, 2025 at 3:25 PM Ard Biesheuvel ardb@kernel.org wrote:
Hi Sam,
On Fri, 22 Aug 2025 at 14:15, Sam Edwards cfsworks@gmail.com wrote:
In early boot, Linux creates identity virtual->physical address mappings so that it can enable the MMU before full memory management is ready. To ensure some available physical memory to back these structures, vmlinux.lds reserves some space (and defines marker symbols) in the middle of the kernel image. However, because they are defined outside of PROGBITS sections, they aren't pre-initialized -- at least as far as ELF is concerned.
In the typical case, this isn't actually a problem: the boot image is prepared with objcopy, which zero-fills the gaps, so these structures are incidentally zero-initialized (an all-zeroes entry is considered absent, so zero-initialization is appropriate).
However, that is just a happy accident: the `vmlinux` ELF output authoritatively represents the state of memory at entry. If the ELF says a region of memory isn't initialized, we must treat it as uninitialized. Indeed, certain bootloaders (e.g. Broadcom CFE) ingest the ELF directly -- sidestepping the objcopy-produced image entirely -- and therefore do not initialize the gaps. This results in the early boot code crashing when it attempts to create identity mappings.
Therefore, add boot-time zero-initialization for the following:
- __pi_init_idmap_pg_dir..__pi_init_idmap_pg_end
- idmap_pg_dir
- reserved_pg_dir
I don't think this is the right approach.
If the ELF representation is inaccurate, it should be fixed, and this should be achievable without impacting the binary image at all.
Hi Ard,
I don't believe I can declare the ELF output "inaccurate" per se, since it's the linker's final determination about the state of memory at kernel entry -- including which regions are not the loader's responsibility to initialize (and should therefore be initialized at runtime, e.g. .bss). But, I think I understand your meaning: you would prefer consistent load-time zero-initialization over run-time. I'm open to that approach if that's the consensus here, but it will make `vmlinux` dozens of KBs larger (even though it keeps `Image` the same size).
- tramp_pg_dir # Already done, but this patch corrects the size
What is wrong with the size?
On higher-VABIT targets, that memset is overflowing by writing PGD_SIZE bytes despite tramp_pg_dir being only PAGE_SIZE bytes in size. My understanding is that only userspace (TTBR0) PGDs are PGD_SIZE and kernelspace (TTBR1) PGDs like the trampoline mapping are always PAGE_SIZE. Please correct me if I'm wrong; I might be misled by how vmlinux.lds.S is making space for those PGDs. :)
(If you'd like, I can break that one-line change out as a separate patch to apply immediately? It seems like a more critical concern than everything else here.)
Best, Sam
Note, swapper_pg_dir is already initialized (by copy from idmap_pg_dir) before use, so this patch does not need to address it.
Cc: stable@vger.kernel.org Signed-off-by: Sam Edwards CFSworks@gmail.com
arch/arm64/kernel/head.S | 12 ++++++++++++ arch/arm64/mm/mmu.c | 3 ++- 2 files changed, 14 insertions(+), 1 deletion(-)
diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S index ca04b338cb0d..0c3be11d0006 100644 --- a/arch/arm64/kernel/head.S +++ b/arch/arm64/kernel/head.S @@ -86,6 +86,18 @@ SYM_CODE_START(primary_entry) bl record_mmu_state bl preserve_boot_args
adrp x0, reserved_pg_dir
add x1, x0, #PAGE_SIZE
+0: str xzr, [x0], 8
cmp x0, x1
b.lo 0b
adrp x0, __pi_init_idmap_pg_dir
adrp x1, __pi_init_idmap_pg_end
+1: str xzr, [x0], 8
cmp x0, x1
b.lo 1b
adrp x1, early_init_stack mov sp, x1 mov x29, xzr
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c index 34e5d78af076..aaf823565a65 100644 --- a/arch/arm64/mm/mmu.c +++ b/arch/arm64/mm/mmu.c @@ -761,7 +761,7 @@ static int __init map_entry_trampoline(void) pgprot_val(prot) &= ~PTE_NG;
/* Map only the text into the trampoline page table */
memset(tramp_pg_dir, 0, PGD_SIZE);
memset(tramp_pg_dir, 0, PAGE_SIZE); __create_pgd_mapping(tramp_pg_dir, pa_start, TRAMP_VALIAS, entry_tramp_text_size(), prot, pgd_pgtable_alloc_init_mm, NO_BLOCK_MAPPINGS);
@@ -806,6 +806,7 @@ static void __init create_idmap(void) u64 end = __pa_symbol(__idmap_text_end); u64 ptep = __pa_symbol(idmap_ptes);
memset(idmap_pg_dir, 0, PAGE_SIZE); __pi_map_range(&ptep, start, end, start, PAGE_KERNEL_ROX, IDMAP_ROOT_LEVEL, (pte_t *)idmap_pg_dir, false, __phys_to_virt(ptep) - ptep);
-- 2.49.1
On Sun, 24 Aug 2025 at 09:56, Sam Edwards cfsworks@gmail.com wrote:
On Sat, Aug 23, 2025 at 3:25 PM Ard Biesheuvel ardb@kernel.org wrote:
Hi Sam,
On Fri, 22 Aug 2025 at 14:15, Sam Edwards cfsworks@gmail.com wrote:
In early boot, Linux creates identity virtual->physical address mappings so that it can enable the MMU before full memory management is ready. To ensure some available physical memory to back these structures, vmlinux.lds reserves some space (and defines marker symbols) in the middle of the kernel image. However, because they are defined outside of PROGBITS sections, they aren't pre-initialized -- at least as far as ELF is concerned.
In the typical case, this isn't actually a problem: the boot image is prepared with objcopy, which zero-fills the gaps, so these structures are incidentally zero-initialized (an all-zeroes entry is considered absent, so zero-initialization is appropriate).
However, that is just a happy accident: the `vmlinux` ELF output authoritatively represents the state of memory at entry. If the ELF says a region of memory isn't initialized, we must treat it as uninitialized. Indeed, certain bootloaders (e.g. Broadcom CFE) ingest the ELF directly -- sidestepping the objcopy-produced image entirely -- and therefore do not initialize the gaps. This results in the early boot code crashing when it attempts to create identity mappings.
Therefore, add boot-time zero-initialization for the following:
- __pi_init_idmap_pg_dir..__pi_init_idmap_pg_end
- idmap_pg_dir
- reserved_pg_dir
I don't think this is the right approach.
If the ELF representation is inaccurate, it should be fixed, and this should be achievable without impacting the binary image at all.
Hi Ard,
I don't believe I can declare the ELF output "inaccurate" per se, since it's the linker's final determination about the state of memory at kernel entry -- including which regions are not the loader's responsibility to initialize (and should therefore be initialized at runtime, e.g. .bss). But, I think I understand your meaning: you would prefer consistent load-time zero-initialization over run-time. I'm open to that approach if that's the consensus here, but it will make `vmlinux` dozens of KBs larger (even though it keeps `Image` the same size).
Indeed, I'd like the ELF representation to be such that only the tail end of the image needs explicit clearing. A bit of bloat of vmlinux is tolerable IMO.
Note that your fix is not complete: stores to memory done with the MMU and caches disabled need to be invalidated from the D-caches too, or they could carry stale clean lines. This is precisely the reason why manipulation of memory should be limited to the bare minimum until the ID map is enabled in the MMU.
- tramp_pg_dir # Already done, but this patch corrects the size
What is wrong with the size?
On higher-VABIT targets, that memset is overflowing by writing PGD_SIZE bytes despite tramp_pg_dir being only PAGE_SIZE bytes in size.
Under which conditions would PGD_SIZE assume a value greater than PAGE_SIZE?
Note that at stage 1, arm64 does not support page table concatenation, and so the root page table is never larger than a page.
On Sat, Aug 23, 2025 at 5:29 PM Ard Biesheuvel ardb@kernel.org wrote:
On Sun, 24 Aug 2025 at 09:56, Sam Edwards cfsworks@gmail.com wrote:
On Sat, Aug 23, 2025 at 3:25 PM Ard Biesheuvel ardb@kernel.org wrote:
Hi Sam,
On Fri, 22 Aug 2025 at 14:15, Sam Edwards cfsworks@gmail.com wrote:
In early boot, Linux creates identity virtual->physical address mappings so that it can enable the MMU before full memory management is ready. To ensure some available physical memory to back these structures, vmlinux.lds reserves some space (and defines marker symbols) in the middle of the kernel image. However, because they are defined outside of PROGBITS sections, they aren't pre-initialized -- at least as far as ELF is concerned.
In the typical case, this isn't actually a problem: the boot image is prepared with objcopy, which zero-fills the gaps, so these structures are incidentally zero-initialized (an all-zeroes entry is considered absent, so zero-initialization is appropriate).
However, that is just a happy accident: the `vmlinux` ELF output authoritatively represents the state of memory at entry. If the ELF says a region of memory isn't initialized, we must treat it as uninitialized. Indeed, certain bootloaders (e.g. Broadcom CFE) ingest the ELF directly -- sidestepping the objcopy-produced image entirely -- and therefore do not initialize the gaps. This results in the early boot code crashing when it attempts to create identity mappings.
Therefore, add boot-time zero-initialization for the following:
- __pi_init_idmap_pg_dir..__pi_init_idmap_pg_end
- idmap_pg_dir
- reserved_pg_dir
I don't think this is the right approach.
If the ELF representation is inaccurate, it should be fixed, and this should be achievable without impacting the binary image at all.
Hi Ard,
I don't believe I can declare the ELF output "inaccurate" per se, since it's the linker's final determination about the state of memory at kernel entry -- including which regions are not the loader's responsibility to initialize (and should therefore be initialized at runtime, e.g. .bss). But, I think I understand your meaning: you would prefer consistent load-time zero-initialization over run-time. I'm open to that approach if that's the consensus here, but it will make `vmlinux` dozens of KBs larger (even though it keeps `Image` the same size).
Indeed, I'd like the ELF representation to be such that only the tail end of the image needs explicit clearing. A bit of bloat of vmlinux is tolerable IMO.
Since the explicit clearing region already includes the entirety of __pi_init_pg_dir, would it make sense if I instead move the other pg_dir items (except __pi_init_idmap_pg_dir) inside that region too, both to keep them all grouped and to ensure that they're all cleared in the same go? I'd still need to handle __pi_init_idmap_pg_dir, and it would mean that reserved_pg_dir is first installed in TTBR1_EL1 a few cycles before being zeroed, but beyond those two drawbacks it sounds simpler to me, reduces the image size by a few pages, and meets the "only clear the tail end" goal.
Note that your fix is not complete: stores to memory done with the MMU and caches disabled need to be invalidated from the D-caches too, or they could carry stale clean lines. This is precisely the reason why manipulation of memory should be limited to the bare minimum until the ID map is enabled in the MMU.
ACK. ARM64 caches are one of those things that I understand in principle but I'm still learning all of the gotchas. I appreciate that you shared this insight despite rejecting the overall approach!
- tramp_pg_dir # Already done, but this patch corrects the size
What is wrong with the size?
On higher-VABIT targets, that memset is overflowing by writing PGD_SIZE bytes despite tramp_pg_dir being only PAGE_SIZE bytes in size.
Under which conditions would PGD_SIZE assume a value greater than PAGE_SIZE?
I might be doing my math wrong, but wouldn't 52-bit VA with 4K granules and 5 levels result in this?
Each PTE represents 4K of virtual memory, so covers VA bits [11:0] (this is level 3) Each PMD has 512 PTEs, the index of which covers VA bits [20:12] (this is level 2) Each PUD references 512 PMDs, the index covering VA [29:21] (this is level 1) Each P4D references 512 PUDs, indexed by VA [38:30] (this is level 0) The PGD, at level -1, therefore has to cover VA bits [51:39], which means it has a 13-bit index: 8192 entries of 8 bytes each would make it 16 pages in size.
Note that at stage 1, arm64 does not support page table concatenation, and so the root page table is never larger than a page.
Doesn't PGD_SIZE refer to the size used for userspace PGDs after the boot progresses beyond stage 1? (What do you mean by "never" here? "Under no circumstances is it larger than a page at stage 1"? Or "during the entire lifecycle of the system, there is no time at which it's larger than a page"?)
Thanks for your time and attention to this, Sam
linux-stable-mirror@lists.linaro.org