Some systems have ACPI tables that don't include everything that needs to be mapped for a successful kexec. These systems rely on identity maps that include the full gigabyte surrounding any smaller region requested for kexec success. Without this, they fail to kexec and end up doing a full firmware reboot.
So, reduce the use of GB pages only on systems where this is known to be necessary (specifically, UV systems).
Signed-off-by: Steve Wahl steve.wahl@hpe.com Fixes: d794734c9bbf ("x86/mm/ident_map: Use gbpages only where full GB page should be mapped.") Reported-by: Pavin Joseph me@pavinjoseph.com Closes: https://lore.kernel.org/all/3a1b9909-45ac-4f97-ad68-d16ef1ce99db@pavinjoseph... Tested-by: Pavin Joseph me@pavinjoseph.com Tested-by: Eric Hagberg ehagberg@gmail.com Tested-by: Sarah Brofeldt srhb@dbc.dk --- arch/x86/include/asm/init.h | 1 + arch/x86/kernel/machine_kexec_64.c | 3 +++ arch/x86/mm/ident_map.c | 13 +++++++------ 3 files changed, 11 insertions(+), 6 deletions(-)
diff --git a/arch/x86/include/asm/init.h b/arch/x86/include/asm/init.h index cc9ccf61b6bd..4ae843e8fefb 100644 --- a/arch/x86/include/asm/init.h +++ b/arch/x86/include/asm/init.h @@ -10,6 +10,7 @@ struct x86_mapping_info { unsigned long page_flag; /* page flag for PMD or PUD entry */ unsigned long offset; /* ident mapping offset */ bool direct_gbpages; /* PUD level 1GB page support */ + bool direct_gbpages_always; /* use 1GB pages exclusively */ unsigned long kernpg_flag; /* kernel pagetable flag override */ };
diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c index b180d8e497c3..1e1c6633bbec 100644 --- a/arch/x86/kernel/machine_kexec_64.c +++ b/arch/x86/kernel/machine_kexec_64.c @@ -28,6 +28,7 @@ #include <asm/setup.h> #include <asm/set_memory.h> #include <asm/cpu.h> +#include <asm/uv/uv.h>
#ifdef CONFIG_ACPI /* @@ -212,6 +213,8 @@ static int init_pgtable(struct kimage *image, unsigned long start_pgtable)
if (direct_gbpages) info.direct_gbpages = true; + if (!is_uv_system()) + info.direct_gbpages_always = true;
for (i = 0; i < nr_pfn_mapped; i++) { mstart = pfn_mapped[i].start << PAGE_SHIFT; diff --git a/arch/x86/mm/ident_map.c b/arch/x86/mm/ident_map.c index a204a332c71f..8039498b9713 100644 --- a/arch/x86/mm/ident_map.c +++ b/arch/x86/mm/ident_map.c @@ -39,12 +39,13 @@ static int ident_pud_init(struct x86_mapping_info *info, pud_t *pud_page, /* Is using a gbpage allowed? */ use_gbpage = info->direct_gbpages;
- /* Don't use gbpage if it maps more than the requested region. */ - /* at the begining: */ - use_gbpage &= ((addr & ~PUD_MASK) == 0); - /* ... or at the end: */ - use_gbpage &= ((next & ~PUD_MASK) == 0); - + if (!info->direct_gbpages_always) { + /* Don't use gbpage if it maps more than the requested region. */ + /* at the beginning: */ + use_gbpage &= ((addr & ~PUD_MASK) == 0); + /* ... or at the end: */ + use_gbpage &= ((next & ~PUD_MASK) == 0); + } /* Never overwrite existing mappings */ use_gbpage &= !pud_present(*pud);
Hi,
Thanks for your patch.
FYI: kernel test robot notices the stable kernel rule is not satisfied.
The check is based on https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html#opti...
Rule: add the tag "Cc: stable@vger.kernel.org" in the sign-off area to have the patch automatically included in the stable tree. Subject: [PATCH] x86/mm/ident_map: Use full gbpages in identity maps except on UV platform. Link: https://lore.kernel.org/stable/20240322162135.3984233-1-steve.wahl%40hpe.com
On 3/22/24 09:21, Steve Wahl wrote:
Some systems have ACPI tables that don't include everything that needs to be mapped for a successful kexec. These systems rely on identity maps that include the full gigabyte surrounding any smaller region requested for kexec success. Without this, they fail to kexec and end up doing a full firmware reboot.
So, reduce the use of GB pages only on systems where this is known to be necessary (specifically, UV systems).
Isn't this called "buggy firmware"?
I'd much rather add synthetic entries to the memory maps that have this information than hack around it by assuming that things are within a gigabyte.
Dave Hansen dave.hansen@intel.com writes:
On 3/22/24 09:21, Steve Wahl wrote:
Some systems have ACPI tables that don't include everything that needs to be mapped for a successful kexec. These systems rely on identity maps that include the full gigabyte surrounding any smaller region requested for kexec success. Without this, they fail to kexec and end up doing a full firmware reboot.
So, reduce the use of GB pages only on systems where this is known to be necessary (specifically, UV systems).
Isn't this called "buggy firmware"?
I'd much rather add synthetic entries to the memory maps that have this information than hack around it by assuming that things are within a gigabyte.
So this change is a partial revert of a change that broke kexec in existing configurations. To fix a regression that breaks kexec.
I don't have enough to know which systems broke. So I don't know the difficulty in fixing systems.
Dave do you know what synthetic entries need to be added to the memory maps to fix the regression? If not we should go with this and if we care enough we can add the synthetic entries later.
Eric
On 3/22/24 10:31, Eric W. Biederman wrote:
I'd much rather add synthetic entries to the memory maps that have this information than hack around it by assuming that things are within a gigabyte.
So this change is a partial revert of a change that broke kexec in existing configurations. To fix a regression that breaks kexec.
Let's back up for a second:
* Mapping extra memory on UV systems causes halts[1] * Mapping extra memory on UV systems breaks kexec (this thread)
So we're in a pickle. I understand your concern for kexec. But I'm concerned that fixing the kexec problem will re-expose us to the [1] problem.
Steve, can you explain a bit why this patch doesn't re-expose the kernel to the [1] bug?
1. https://lore.kernel.org/all/20240126164841.170866-1-steve.wahl@hpe.com/
On Fri, Mar 22, 2024 at 10:43:36AM -0700, Dave Hansen wrote:
On 3/22/24 10:40, Dave Hansen wrote:
- Mapping extra memory on UV systems causes halts[1]
- Mapping extra memory on UV systems breaks kexec (this thread)
Sorry, I said that second one backwards:
- _Not_ mapping extra memory on UV systems breaks kexec
Not quite. This is * _Not_ mapping extra memory on _non_ UV systems breaks kexec.
Thanks,
--> Steve
On Fri, Mar 22, 2024 at 10:40:37AM -0700, Dave Hansen wrote:
On 3/22/24 10:31, Eric W. Biederman wrote:
I'd much rather add synthetic entries to the memory maps that have this information than hack around it by assuming that things are within a gigabyte.
So this change is a partial revert of a change that broke kexec in existing configurations. To fix a regression that breaks kexec.
Hi, Dave!
Let's back up for a second:
- Mapping extra memory on UV systems causes halts[1]
- Mapping extra memory on UV systems breaks kexec (this thread)
These are the same. The most reliable way to create the problem[1] on UV is a kexec to a kdump kernel, because of the typical placement of the kdump kernel active region with respect to the reserved addresses that cause the halts. (The distros we typically run place the crashkernel just below the highest reserved region, where a gbpage can include both.)
What you didn't state here is the third bullet that this patch addresses.
* Neglecting to map extra memory on some (firmware buggy?) non-UV systems breaks kexec.
So we're in a pickle. I understand your concern for kexec. But I'm concerned that fixing the kexec problem will re-expose us to the [1] problem.
Steve, can you explain a bit why this patch doesn't re-expose the kernel to the [1] bug?
This patch still has UV systems avoid gbpages that go far outside actual requested regions, but allows the full gb pages on other systems. On UV systems, the new gbpage algorithm is followed. On non-UV systems, gbpages are allowed even for requests that don't cover a complete gbpage -- essentially the former algorithm but using the new code.
Hope that makes sense.
I would probably consider this buggy firmware, but got enough reports of this regression (from Pavin Joseph, Eric Hagberg, and Sara Brofeldt, all of whom tested the patch to see if it cured the regression) that it seemd everyone would want it fixed quickly and point fingers later.
In the private debugging exchanges with Pavin, I got some printks of regions that were mapped, and did one exchange with hard-coded adding regions not covered on his particular system back into the table; there were four regions left out. I added all four in one patch. I could have dived in further to diagnose which of the missing region(s) were actually necessary to get kexec to succeed, but couldn't see what I would do with that information once I had it, as I don't see a way to generalize this to other platforms exhibiting the problem.
Thanks,
--> Steve
On 3/22/24 09:21, Steve Wahl wrote:
Some systems have ACPI tables that don't include everything that needs to be mapped for a successful kexec. These systems rely on identity maps that include the full gigabyte surrounding any smaller region requested for kexec success. Without this, they fail to kexec and end up doing a full firmware reboot.
I'm still missing something here. Which ACPI tables are we talking about? What don't they map? I normally don't think of ACPI _tables_ as "mapping" things.
It seems like there's a theory that some ACPI table isn't mapped, but looking through the discussion so far I don't see a smoking gun. Let's say the kernel has a bug and the kernel was actively not mapping something that it should have mapped. The oversized 1GB mappings made the bug harder to hit. If that's the case, we'll just be adding a hack which papers over the bug instead of fixing it properly.
I'm kind of leaning to say that we should just revert d794734c9bbf and have the UV folks go back to the nogbpages until we get this properly sorted.
@@ -10,6 +10,7 @@ struct x86_mapping_info { unsigned long page_flag; /* page flag for PMD or PUD entry */ unsigned long offset; /* ident mapping offset */ bool direct_gbpages; /* PUD level 1GB page support */
- bool direct_gbpages_always; /* use 1GB pages exclusively */ unsigned long kernpg_flag; /* kernel pagetable flag override */
};
But let's at least talk about this patch in case we decide to go forward with it. We've really got two things:
1. Can the system use gbpages in the first place? 2. Do the gbpages need to be exact (UV) or sloppy (everything else)?
I wouldn't refer to this at all as "always" use gbpages. It's really a be-sloppy-and-paper-over-bugs mode. They might be kernel bugs or firmware bugs, but they're bugs _somewhere_ right?
Dave Hansen dave.hansen@intel.com writes:
On 3/22/24 09:21, Steve Wahl wrote:
Some systems have ACPI tables that don't include everything that needs to be mapped for a successful kexec. These systems rely on identity maps that include the full gigabyte surrounding any smaller region requested for kexec success. Without this, they fail to kexec and end up doing a full firmware reboot.
I'm still missing something here. Which ACPI tables are we talking about? What don't they map? I normally don't think of ACPI _tables_ as "mapping" things.
Either E820 or ACPI lists which areas of memory are present in a machine. Those tables are used to build the identity memory mappings.
Those identity mapped page tables not built with GB pages cause kexec to fail for at least 3 people. Presumably because something using those page tables accesses memory that is not mapped.
It seems like there's a theory that some ACPI table isn't mapped, but looking through the discussion so far I don't see a smoking gun. Let's say the kernel has a bug and the kernel was actively not mapping something that it should have mapped. The oversized 1GB mappings made the bug harder to hit. If that's the case, we'll just be adding a hack which papers over the bug instead of fixing it properly.
I'm kind of leaning to say that we should just revert d794734c9bbf and have the UV folks go back to the nogbpages until we get this properly sorted.
That is exactly what this patch does. It reverts the change except on UV systems.
@@ -10,6 +10,7 @@ struct x86_mapping_info { unsigned long page_flag; /* page flag for PMD or PUD entry */ unsigned long offset; /* ident mapping offset */ bool direct_gbpages; /* PUD level 1GB page support */
- bool direct_gbpages_always; /* use 1GB pages exclusively */ unsigned long kernpg_flag; /* kernel pagetable flag override */
};
But let's at least talk about this patch in case we decide to go forward with it. We've really got two things:
- Can the system use gbpages in the first place?
- Do the gbpages need to be exact (UV) or sloppy (everything else)?
I wouldn't refer to this at all as "always" use gbpages. It's really a be-sloppy-and-paper-over-bugs mode. They might be kernel bugs or firmware bugs, but they're bugs _somewhere_ right?
Is it?
As far as I can tell the UV mode is be exact and avoid cpu bugs mode.
My sense is that using GB pages for everything (when we want an identity mapping) should be much cheaper TLB wise, so we probably want to use GB pages for everything if we can.
Personally I'd rather turn of the page tables entirely for kexec but that is not an option in x86_64.
Eric
On 3/23/24 21:45, Eric W. Biederman wrote:
Dave Hansen dave.hansen@intel.com writes:
On 3/22/24 09:21, Steve Wahl wrote:
Some systems have ACPI tables that don't include everything that needs to be mapped for a successful kexec. These systems rely on identity maps that include the full gigabyte surrounding any smaller region requested for kexec success. Without this, they fail to kexec and end up doing a full firmware reboot.
I'm still missing something here. Which ACPI tables are we talking about? What don't they map? I normally don't think of ACPI _tables_ as "mapping" things.
Either E820 or ACPI lists which areas of memory are present in a machine. Those tables are used to build the identity memory mappings.
Those identity mapped page tables not built with GB pages cause kexec to fail for at least 3 people. Presumably because something using those page tables accesses memory that is not mapped.
But why is it not mapped? Are the firmware-provided memory maps inaccurate? Or did the kernel read those maps and then forget to map something.
Using GB pages could paper over either class of bug.
It seems like there's a theory that some ACPI table isn't mapped, but looking through the discussion so far I don't see a smoking gun. Let's say the kernel has a bug and the kernel was actively not mapping something that it should have mapped. The oversized 1GB mappings made the bug harder to hit. If that's the case, we'll just be adding a hack which papers over the bug instead of fixing it properly.
I'm kind of leaning to say that we should just revert d794734c9bbf and have the UV folks go back to the nogbpages until we get this properly sorted.
That is exactly what this patch does. It reverts the change except on UV systems.
Maybe it's splitting hairs, but I see a difference between reverting the _commit_ and adding new code that tries to revert the commit's behavior.
I think reverting the commit is more conservative and that's what I was referring to.
@@ -10,6 +10,7 @@ struct x86_mapping_info { unsigned long page_flag; /* page flag for PMD or PUD entry */ unsigned long offset; /* ident mapping offset */ bool direct_gbpages; /* PUD level 1GB page support */
- bool direct_gbpages_always; /* use 1GB pages exclusively */ unsigned long kernpg_flag; /* kernel pagetable flag override */
};
But let's at least talk about this patch in case we decide to go forward with it. We've really got two things:
- Can the system use gbpages in the first place?
- Do the gbpages need to be exact (UV) or sloppy (everything else)?
I wouldn't refer to this at all as "always" use gbpages. It's really a be-sloppy-and-paper-over-bugs mode. They might be kernel bugs or firmware bugs, but they're bugs _somewhere_ right?
Is it?
As far as I can tell the UV mode is be exact and avoid cpu bugs mode.
The fact is that there are parts of the physical address space that have read side effects. If you want to have them mapped, you need to use a mapping type where speculative accesses won't occur (like UC).
I don't really think these are CPU bugs. They're just a fact of life.
My sense is that using GB pages for everything (when we want an identity mapping) should be much cheaper TLB wise, so we probably want to use GB pages for everything if we can.
Sure. But the "if we can" situation is where the physical address space is uniform underneath that GB page.
It's not at all uncommon to have those goofy, undesirable read side-effects. We've had several issues around them over the years. You really can't just map random physical memory and hope for the best.
That means that you are limited to mapping memory that you *know* is uniform, like "all RAM" or "all PMEM".
I understand the original has been reverted, and that my way forward is likely to combine the original patch and the fix into a combined patch (possibly a set). I still think some of this conversation will be valuable to creation of that patch, so continuing to reply. More below.
On Fri, Mar 22, 2024 at 04:29:27PM -0700, Dave Hansen wrote:
On 3/22/24 09:21, Steve Wahl wrote:
Some systems have ACPI tables that don't include everything that needs to be mapped for a successful kexec. These systems rely on identity maps that include the full gigabyte surrounding any smaller region requested for kexec success. Without this, they fail to kexec and end up doing a full firmware reboot.
I'm still missing something here. Which ACPI tables are we talking about? What don't they map? I normally don't think of ACPI _tables_ as "mapping" things.
I'm refering to the memory areas that are mapped in machine_kexec_64.c in the function map_acpi_tables. These appear to be e820 table entries that have a type of E820_TYPE_ACPI which the kernel marks as IORES_DESC_ACPI_TABLES, or a type of E820_TYPE_NVS that the kernel marks as IORES_DESC_ACPI_NV_STORAGE.
The name of the function that maps them is why I refer to them as ACPI tables. Sorry if that is inaccurate.
It seems like there's a theory that some ACPI table isn't mapped, but looking through the discussion so far I don't see a smoking gun.
I think I'm saying more that the ACPI table doesn't list everything that needs to be mapped, not that the table itself isn't mapped. Not sure if that changes your picture or not.
My debuging exchanges with Pavin showed that the regions mapped within the map_acpi_tables function were the ones that left uncovered holes in the identity map if you don't overshoot what's requested by using full gbpages for everything.
For his system only, I manually added hardcoded regions corresponding to the holes that got left by using 2M pages instead of GB pages, and kexec succeeded.
Having the list of holes-not-covered (IIRC, four of them), I could have persued which particular holes cause kexec to fail, but I did not because I couldn't think of a way to make use of that information. Even knowing which additional addresses need coverage for this particular machine, I have no way of knowing what is in those regions, nor how to generalize to what is needed on other machines.
Let's say the kernel has a bug and the kernel was actively not mapping something that it should have mapped. The oversized 1GB mappings made the bug harder to hit. If that's the case, we'll just be adding a hack which papers over the bug instead of fixing it properly.
I hope you agree that by reverting, we have now papered over that bug, just in a different way.
If a patch that leaves this papered over except for UV systems won't be acceptable -- that's what I intend to do, just combining my two patches -- please let me know what my way forward should be.
I'm kind of leaning to say that we should just revert d794734c9bbf and have the UV folks go back to the nogbpages until we get this properly sorted.
Being larger memory systems (for example, 32 socket Sapphire Rapids systems with a full set of RAM on each socket), UV probably suffers the most from having an extra 4K per GiB to create the identity map.
@@ -10,6 +10,7 @@ struct x86_mapping_info { unsigned long page_flag; /* page flag for PMD or PUD entry */ unsigned long offset; /* ident mapping offset */ bool direct_gbpages; /* PUD level 1GB page support */
- bool direct_gbpages_always; /* use 1GB pages exclusively */ unsigned long kernpg_flag; /* kernel pagetable flag override */
};
But let's at least talk about this patch in case we decide to go forward with it. We've really got two things:
- Can the system use gbpages in the first place?
- Do the gbpages need to be exact (UV) or sloppy (everything else)?
I wouldn't refer to this at all as "always" use gbpages. It's really a be-sloppy-and-paper-over-bugs mode. They might be kernel bugs or firmware bugs, but they're bugs _somewhere_ right?
Do you have a concise suggestion of what you'd call it? I could use *_sloppy if you'd like, but I don't care much for the way that reads.
Thanks for your time,
--> Steve Wahl
* Steve Wahl steve.wahl@hpe.com wrote:
Some systems have ACPI tables that don't include everything that needs to be mapped for a successful kexec. These systems rely on identity maps that include the full gigabyte surrounding any smaller region requested for kexec success. Without this, they fail to kexec and end up doing a full firmware reboot.
So, reduce the use of GB pages only on systems where this is known to be necessary (specifically, UV systems).
Signed-off-by: Steve Wahl steve.wahl@hpe.com Fixes: d794734c9bbf ("x86/mm/ident_map: Use gbpages only where full GB page should be mapped.") Reported-by: Pavin Joseph me@pavinjoseph.com
Sigh, why was d794734c9bbf marked for a -stable backport? The commit never explains ...
If it's broken, it should be reverted - instead of trying to partially revert and then maybe break some other systems.
When there's boot breakage with new patches, we back out the bad patch and re-try in 99.9% of the cases.
Thanks,
Ingo
On Sun, Mar 24, 2024 at 11:31:39AM +0100, Ingo Molnar wrote:
- Steve Wahl steve.wahl@hpe.com wrote:
Some systems have ACPI tables that don't include everything that needs to be mapped for a successful kexec. These systems rely on identity maps that include the full gigabyte surrounding any smaller region requested for kexec success. Without this, they fail to kexec and end up doing a full firmware reboot.
So, reduce the use of GB pages only on systems where this is known to be necessary (specifically, UV systems).
Signed-off-by: Steve Wahl steve.wahl@hpe.com Fixes: d794734c9bbf ("x86/mm/ident_map: Use gbpages only where full GB page should be mapped.") Reported-by: Pavin Joseph me@pavinjoseph.com
Sigh, why was d794734c9bbf marked for a -stable backport? The commit never explains ...
I will try to explain, since Steve is offline. That commit fixes a legitimate bug where more address range is mapped (1G) than the requested address range. The fix avoids the issue of cpu speculativly loading beyond the requested range, which inludes specutalive loads from reserved memory. That is why it was marked for -stable.
If it's broken, it should be reverted - instead of trying to partially revert and then maybe break some other systems.
Three people reported that mapping only the correct address range caused problems on their platforms. https://lore.kernel.org/all/3a1b9909-45ac-4f97-ad68-d16ef1ce99db@pavinjoseph... Steve and several people helped debug the issue. The commit itself looks correct but the correct behavior causes some side effect on a few platforms. Some memory ends up not being mapped, but it is not clear if it is due to some other bug, such as bios not accurately providing the right memory map or some other kernel code path did not map what it should. The 1G mapping covers up that type issue.
Steve's second patch was to not break those platforms while leaving the fix on the platform detected the original mapping problem (UV platform).
When there's boot breakage with new patches, we back out the bad patch and re-try in 99.9% of the cases.
Steve can certainly merge his two patches and resubmit, to replace the reverted original patch. He should be on in the morning to speak for himself.
Thanks
* Russ Anderson rja@hpe.com wrote:
On Sun, Mar 24, 2024 at 11:31:39AM +0100, Ingo Molnar wrote:
- Steve Wahl steve.wahl@hpe.com wrote:
Some systems have ACPI tables that don't include everything that needs to be mapped for a successful kexec. These systems rely on identity maps that include the full gigabyte surrounding any smaller region requested for kexec success. Without this, they fail to kexec and end up doing a full firmware reboot.
So, reduce the use of GB pages only on systems where this is known to be necessary (specifically, UV systems).
Signed-off-by: Steve Wahl steve.wahl@hpe.com Fixes: d794734c9bbf ("x86/mm/ident_map: Use gbpages only where full GB page should be mapped.") Reported-by: Pavin Joseph me@pavinjoseph.com
Sigh, why was d794734c9bbf marked for a -stable backport? The commit never explains ...
I will try to explain, since Steve is offline. That commit fixes a legitimate bug where more address range is mapped (1G) than the requested address range.
If a change regresses on certain machines then it's not a bug fix anymore, it's a regression. End of story.
The fix avoids the issue of cpu speculativly loading beyond the requested range, which inludes specutalive loads from reserved memory. That is why it was marked for -stable.
And this regression is why more complicated fixes in this area should not be forwarded to -stable before it's been merged upstream and exposed a bit more. Please keep that in mind for future iterations.
If it's broken, it should be reverted - instead of trying to partially revert and then maybe break some other systems.
Three people reported that mapping only the correct address range caused problems on their platforms. https://lore.kernel.org/all/3a1b9909-45ac-4f97-ad68-d16ef1ce99db@pavinjoseph... Steve and several people helped debug the issue. The commit itself looks correct but the correct behavior causes some side effect on a few platforms.
That's all fine and the effort is much appreciated - but we should not try to whitewash a regression: if there's a couple of reports in such a short time already, then the regression is significant.
Anyway, I've reverted this in tip:x86/urgent:
c567f2948f57 Revert "x86/mm/ident_map: Use gbpages only where full GB page should be mapped."
we can iterate from there again. Please post future patches against that tree.
Note that this is just the regular development process: regressions happen, and this is how we handle them a lot of the time in this area - we back out the breakage, then try again.
Some memory ends up not being mapped, but it is not clear if it is due to some other bug, such as bios not accurately providing the right memory map or some other kernel code path did not map what it should. The 1G mapping covers up that type issue.
Steve's second patch was to not break those platforms while leaving the fix on the platform detected the original mapping problem (UV platform).
When there's boot breakage with new patches, we back out the bad patch and re-try in 99.9% of the cases.
Steve can certainly merge his two patches and resubmit, to replace the reverted original patch. He should be on in the morning to speak for himself.
Thank you!
Ingo
On Mon, Mar 25, 2024 at 6:58 AM Ingo Molnar mingo@kernel.org wrote:
Anyway, I've reverted this in tip:x86/urgent:
c567f2948f57 Revert "x86/mm/ident_map: Use gbpages only where full GB page should be mapped."
I see that this hasn't been reverted in the longterm branches it made it into already (6.1.x and 6.6.x, for example) - is it expected to be reverted there as well? I'd think it should be, until this is all sorted out.
On Fri, Apr 05, 2024 at 09:13:36AM -0400, Eric Hagberg wrote:
On Mon, Mar 25, 2024 at 6:58 AM Ingo Molnar mingo@kernel.org wrote:
Anyway, I've reverted this in tip:x86/urgent:
c567f2948f57 Revert "x86/mm/ident_map: Use gbpages only where full GB page should be mapped."
I see that this hasn't been reverted in the longterm branches it made it into already (6.1.x and 6.6.x, for example) - is it expected to be reverted there as well? I'd think it should be, until this is all sorted out.
The revert is queued up for the next round of stable updates.
thanks,
greg k-h
Russ Anderson rja@hpe.com writes:
On Sun, Mar 24, 2024 at 11:31:39AM +0100, Ingo Molnar wrote:
- Steve Wahl steve.wahl@hpe.com wrote:
Some systems have ACPI tables that don't include everything that needs to be mapped for a successful kexec. These systems rely on identity maps that include the full gigabyte surrounding any smaller region requested for kexec success. Without this, they fail to kexec and end up doing a full firmware reboot.
So, reduce the use of GB pages only on systems where this is known to be necessary (specifically, UV systems).
Signed-off-by: Steve Wahl steve.wahl@hpe.com Fixes: d794734c9bbf ("x86/mm/ident_map: Use gbpages only where full GB page should be mapped.") Reported-by: Pavin Joseph me@pavinjoseph.com
Sigh, why was d794734c9bbf marked for a -stable backport? The commit never explains ...
I will try to explain, since Steve is offline. That commit fixes a legitimate bug where more address range is mapped (1G) than the requested address range. The fix avoids the issue of cpu speculativly loading beyond the requested range, which inludes specutalive loads from reserved memory. That is why it was marked for -stable.
To call that a bug presumes that the memory type range registers were not setup properly by the boot firmware.
I think I saw something that the existence of memory type range registers is changing/has changed in recent cpus, but historically it has been the job of the memory type range registers to ensure that the attributes of specific addresses are correct.
The memory attributes should guide the speculation.
To depend upon page tables to ensure the attributes are correct would presumably require a cpu that does not have support for disabling page tables in 32bit mode and does not have 16bit mode.
On older systems (I haven't looked lately) I have seen all kinds of oddities in the descriptions of memory. Like not describing the memory at address 0 where the real mode IDT lives. So I am not at all certain any firmware information can be depended upon or reasonably expected to be complete. For a while there was no concept of firmware memory areas so on some older systems it was actually required for their to be gaps in the description of memory provided to the system, so that operating systems would not touch memory used by the firmware.
Which definitely means in the case of kexec there are legitimate reasons to access memory areas that are well known but have not always been descried by the boot firmware. So the assertion that it is necessarily a firmware bug for not describing all of memory of memory is at least historically incorrect on x86_64.
There may be different requirements for the kexec identity map and the ordinary kernel boot type memory map and as we look at solutions that can reasonably be explored
Some memory ends up not being mapped, but it is not clear if it is due to some other bug, such as bios not accurately providing the right memory map or some other kernel code path did not map what it should.
The 1G mapping covers up that type issue.
I have seen this assertion repeated several times, and at least historically on x86_64 it is most definitely false. The E820 map which was the primary information source for a long time could not describe all of memory so depending upon it to be complete is erroneous.
When there's boot breakage with new patches, we back out the bad patch and re-try in 99.9% of the cases.
Steve can certainly merge his two patches and resubmit, to replace the reverted original patch. He should be on in the morning to speak for himself.
I am going to push back and suggest that this is perhaps a bug in the HPE UV systems firmware not setting up the cpus memory type range registers correctly.
Unless those systems are using new fangled cpus that don't have 16bit and 32bit support, and don't implement memory type range registers, I don't see how something that only affects HPE UV systems could be anything except an HPE UV specific bug.
Eric
On Mon, Mar 25, 2024 at 10:04:41AM -0500, Eric W. Biederman wrote:
Russ Anderson rja@hpe.com writes:
Steve can certainly merge his two patches and resubmit, to replace the reverted original patch. He should be on in the morning to speak for himself.
I am going to push back and suggest that this is perhaps a bug in the HPE UV systems firmware not setting up the cpus memory type range registers correctly.
Unless those systems are using new fangled cpus that don't have 16bit and 32bit support, and don't implement memory type range registers, I don't see how something that only affects HPE UV systems could be anything except an HPE UV specific bug.
Eric,
I took the time to communicate with others in the company who know this stuff better than I do before replying on this.
One of the problems with using the MTRRs for this is that there are simply not enough of them. The MTRRs size/alignment requirements mean that more than one entry would be required per reserved region, and we need one reserved region per socket on systems that currently can go up to 32 sockets. (In case you would think to ask, the reserved regions also cannot be made contiguous.)
So MTRRs will not work to keep speculation out of our reserved memory regions.
Let me know if you need more information from us on this.
Thanks.
--> Steve Wahl
Steve Wahl steve.wahl@hpe.com writes:
On Mon, Mar 25, 2024 at 10:04:41AM -0500, Eric W. Biederman wrote:
Russ Anderson rja@hpe.com writes:
Steve can certainly merge his two patches and resubmit, to replace the reverted original patch. He should be on in the morning to speak for himself.
I am going to push back and suggest that this is perhaps a bug in the HPE UV systems firmware not setting up the cpus memory type range registers correctly.
Unless those systems are using new fangled cpus that don't have 16bit and 32bit support, and don't implement memory type range registers, I don't see how something that only affects HPE UV systems could be anything except an HPE UV specific bug.
Eric,
I took the time to communicate with others in the company who know this stuff better than I do before replying on this.
One of the problems with using the MTRRs for this is that there are simply not enough of them. The MTRRs size/alignment requirements mean that more than one entry would be required per reserved region, and we need one reserved region per socket on systems that currently can go up to 32 sockets. (In case you would think to ask, the reserved regions also cannot be made contiguous.)
So MTRRs will not work to keep speculation out of our reserved memory regions.
Let me know if you need more information from us on this.
Thanks for this.
Do you know if there are enough MTRRs for the first 4GB?
I am curious if kexec should even consider going into 32bit mode without page tables or even into 16bit mode on such a system. Or if such a system will always require using page tables.
If you don't have enough MTRRs on a big NUMA system I think it is perfectly understandable, to need to use the page tables.
Please include this the fact that splitting GBpages is necessary because of a lack of MTRRs in the change description.
Given that it is the lack of MTRRs on a large NUMA system that make the change necessary. The goes from a pure bug fix change to a change to accommodate systems without enough MTRRs.
That information makes it more understandable why older systems (at least in the case of kexec) might not be ok with the change. As for older systems their MTRRs are sufficient and thus they can use fewer page table entries. Allowing for use of larger TLB entries.
Eric
On Wed, Mar 27, 2024 at 07:57:52AM -0500, Eric W. Biederman wrote:
Steve Wahl steve.wahl@hpe.com writes:
On Mon, Mar 25, 2024 at 10:04:41AM -0500, Eric W. Biederman wrote:
Russ Anderson rja@hpe.com writes:
Steve can certainly merge his two patches and resubmit, to replace the reverted original patch. He should be on in the morning to speak for himself.
I am going to push back and suggest that this is perhaps a bug in the HPE UV systems firmware not setting up the cpus memory type range registers correctly.
Unless those systems are using new fangled cpus that don't have 16bit and 32bit support, and don't implement memory type range registers, I don't see how something that only affects HPE UV systems could be anything except an HPE UV specific bug.
Eric,
I took the time to communicate with others in the company who know this stuff better than I do before replying on this.
One of the problems with using the MTRRs for this is that there are simply not enough of them. The MTRRs size/alignment requirements mean that more than one entry would be required per reserved region, and we need one reserved region per socket on systems that currently can go up to 32 sockets. (In case you would think to ask, the reserved regions also cannot be made contiguous.)
So MTRRs will not work to keep speculation out of our reserved memory regions.
Let me know if you need more information from us on this.
Thanks for this.
Do you know if there are enough MTRRs for the first 4GB?
I don't personally know all the details of how BIOS chooses to place things, but I suspect that might be true. The restricted spaces usually end up at the end of the address range for a particular node, and 4GB would be in the early part of node 0. If the conversation develops further along these lines, I can find out more definitively.
I am curious if kexec should even consider going into 32bit mode without page tables or even into 16bit mode on such a system. Or if such a system will always require using page tables.
Unless I'm mistaken, wouldn't that put a pretty heavy restriction on where the kdump kernel could be located? Or the target region for KASLR?
If you don't have enough MTRRs on a big NUMA system I think it is perfectly understandable, to need to use the page tables.
Please include this the fact that splitting GBpages is necessary because of a lack of MTRRs in the change description.
OK.
Given that it is the lack of MTRRs on a large NUMA system that make the change necessary. The goes from a pure bug fix change to a change to accommodate systems without enough MTRRs.
That information makes it more understandable why older systems (at least in the case of kexec) might not be ok with the change. As for older systems their MTRRs are sufficient and thus they can use fewer page table entries. Allowing for use of larger TLB entries.
That last paragraph doesn't match what I think is happening.
At least from my point of view, that some systems aren't OK with the change has nothing to do with MTRRs or TLB page size. They simply require the extra "slop" of GB pages, implicitly adding a full GB of space around any smaller space requested by map_acpi_tables().
The systems that failed with my original change also failed on earlier kernels when nogbpages was added to the kernel command line. That creates the identity map using 2M pages for everything, with no GB page "slop". I'm pretty sure these systems will continue to fail with "nogbpages" enabled.
For one debug-kernel cycle on Pavin's system I added in hard-coded requests to explicitly add back in the areas that not being sloppy had excluded, and that brought kexec back to functioning; which further proves my point.
I wanted to be sure you understood this in case it has any effect on what you think should be done.
--> Steve
Steve Wahl steve.wahl@hpe.com writes:
On Wed, Mar 27, 2024 at 07:57:52AM -0500, Eric W. Biederman wrote:
Steve Wahl steve.wahl@hpe.com writes:
On Mon, Mar 25, 2024 at 10:04:41AM -0500, Eric W. Biederman wrote:
Russ Anderson rja@hpe.com writes:
Steve can certainly merge his two patches and resubmit, to replace the reverted original patch. He should be on in the morning to speak for himself.
I am going to push back and suggest that this is perhaps a bug in the HPE UV systems firmware not setting up the cpus memory type range registers correctly.
Unless those systems are using new fangled cpus that don't have 16bit and 32bit support, and don't implement memory type range registers, I don't see how something that only affects HPE UV systems could be anything except an HPE UV specific bug.
Eric,
I took the time to communicate with others in the company who know this stuff better than I do before replying on this.
One of the problems with using the MTRRs for this is that there are simply not enough of them. The MTRRs size/alignment requirements mean that more than one entry would be required per reserved region, and we need one reserved region per socket on systems that currently can go up to 32 sockets. (In case you would think to ask, the reserved regions also cannot be made contiguous.)
So MTRRs will not work to keep speculation out of our reserved memory regions.
Let me know if you need more information from us on this.
Thanks for this.
Do you know if there are enough MTRRs for the first 4GB?
I don't personally know all the details of how BIOS chooses to place things, but I suspect that might be true. The restricted spaces usually end up at the end of the address range for a particular node, and 4GB would be in the early part of node 0. If the conversation develops further along these lines, I can find out more definitively.
I am curious if kexec should even consider going into 32bit mode without page tables or even into 16bit mode on such a system. Or if such a system will always require using page tables.
Unless I'm mistaken, wouldn't that put a pretty heavy restriction on where the kdump kernel could be located?
If you are coming from 64bit EFI it adds restrictions.
Most of my experience involves systems using a real mode BIOS and folks thought I was strange for wanting to be able to load the kernel above 4GB.
Having that experience, I am stuck wondering how all of the weird backwards compatibility cases are going to work. Hmm.
There is one concrete case where it matters that I think still exists.
x86_64 processors startup in 16bit real mode, then have to transition through 32bit protected mode, before transitioning to 64bit protected mode. Only in 64bit protected mode are page tables enabled.
All this happens during early kernel startup when the bootstrap processor sends STARTUP IPIs to all of the secondary processors.
The startup IPI lets you pick where in the first 1MiB the secondary processors will start.
Assuming there isn't a new processor startup sequence on your cpus speculation before the processor loads it's first page table is a legitimate concern.
Or the target region for KASLR?
As I recall the kernel is limited to the last 2GB of the virtual address space, as parts of the instruction
If you don't have enough MTRRs on a big NUMA system I think it is perfectly understandable, to need to use the page tables.
Please include this the fact that splitting GBpages is necessary because of a lack of MTRRs in the change description.
OK.
Given that it is the lack of MTRRs on a large NUMA system that make the change necessary. The goes from a pure bug fix change to a change to accommodate systems without enough MTRRs.
That information makes it more understandable why older systems (at least in the case of kexec) might not be ok with the change. As for older systems their MTRRs are sufficient and thus they can use fewer page table entries. Allowing for use of larger TLB entries.
That last paragraph doesn't match what I think is happening.
At least from my point of view, that some systems aren't OK with the change has nothing to do with MTRRs or TLB page size. They simply require the extra "slop" of GB pages, implicitly adding a full GB of space around any smaller space requested by map_acpi_tables().
The systems that failed with my original change also failed on earlier kernels when nogbpages was added to the kernel command line. That creates the identity map using 2M pages for everything, with no GB page "slop". I'm pretty sure these systems will continue to fail with "nogbpages" enabled.
For one debug-kernel cycle on Pavin's system I added in hard-coded requests to explicitly add back in the areas that not being sloppy had excluded, and that brought kexec back to functioning; which further proves my point.
I wanted to be sure you understood this in case it has any effect on what you think should be done.
Sort of.
What kexec wants of an identity mapped page table really is to simulate disabling paging altogether. There isn't enough memory in most systems to identity map the entire 48bit or 52bit physical address space so some compromises have to be made. I seem to recall only mapping up to maxpfn, and using 1GB pages when I originally wrote the code. It was later refactored to share the identity map page table building code with the rest of the kernel.
When you changed the page tables not to map everything, strictly speaking you created an ABI break of the kexec ABI.
Which is a long way of saying it isn't being sloppy it is deliberate, and that the problem from my perspective is that things have become too fine grained, too optimized.
Pavin's definitely proves the issue was not mapping enough pages, it is nice that we have that confirmation.
From my perspective the entire reason for wanting to be fine grained and precise in the kernel memory map is because the UV systems don't have enough MTRRs. So you have to depend upon the cache-ability attributes for specific addresses of memory coming from the page tables instead of from the MTRRs.
If you had enough MTRRs more defining the page tables to be precisely what is necessary would be simply an exercise in reducing kernel performance, because it is more efficient in both page table size, and in TLB usage to use 1GB pages instead of whatever smaller pages you have to use for oddball regions.
For systems without enough MTRRs the small performance hit in paging performance is the necessary trade off.
At least that is my perspective. Does that make sense?
Eric
On Thu, Mar 28, 2024 at 12:05:02AM -0500, Eric W. Biederman wrote:
Steve Wahl steve.wahl@hpe.com writes:
On Wed, Mar 27, 2024 at 07:57:52AM -0500, Eric W. Biederman wrote:
Steve Wahl steve.wahl@hpe.com writes:
On Mon, Mar 25, 2024 at 10:04:41AM -0500, Eric W. Biederman wrote:
Russ Anderson rja@hpe.com writes:
Steve can certainly merge his two patches and resubmit, to replace the reverted original patch. He should be on in the morning to speak for himself.
I am going to push back and suggest that this is perhaps a bug in the HPE UV systems firmware not setting up the cpus memory type range registers correctly.
Unless those systems are using new fangled cpus that don't have 16bit and 32bit support, and don't implement memory type range registers, I don't see how something that only affects HPE UV systems could be anything except an HPE UV specific bug.
Eric,
I took the time to communicate with others in the company who know this stuff better than I do before replying on this.
One of the problems with using the MTRRs for this is that there are simply not enough of them. The MTRRs size/alignment requirements mean that more than one entry would be required per reserved region, and we need one reserved region per socket on systems that currently can go up to 32 sockets. (In case you would think to ask, the reserved regions also cannot be made contiguous.)
So MTRRs will not work to keep speculation out of our reserved memory regions.
Let me know if you need more information from us on this.
Thanks for this.
Do you know if there are enough MTRRs for the first 4GB?
I don't personally know all the details of how BIOS chooses to place things, but I suspect that might be true. The restricted spaces usually end up at the end of the address range for a particular node, and 4GB would be in the early part of node 0. If the conversation develops further along these lines, I can find out more definitively.
I am curious if kexec should even consider going into 32bit mode without page tables or even into 16bit mode on such a system. Or if such a system will always require using page tables.
Unless I'm mistaken, wouldn't that put a pretty heavy restriction on where the kdump kernel could be located?
If you are coming from 64bit EFI it adds restrictions.
We are. :-)
Most of my experience involves systems using a real mode BIOS and folks thought I was strange for wanting to be able to load the kernel above 4GB.
Having that experience, I am stuck wondering how all of the weird backwards compatibility cases are going to work. Hmm.
There is one concrete case where it matters that I think still exists.
x86_64 processors startup in 16bit real mode, then have to transition through 32bit protected mode, before transitioning to 64bit protected mode. Only in 64bit protected mode are page tables enabled.
All this happens during early kernel startup when the bootstrap processor sends STARTUP IPIs to all of the secondary processors.
The startup IPI lets you pick where in the first 1MiB the secondary processors will start.
Assuming there isn't a new processor startup sequence on your cpus speculation before the processor loads it's first page table is a legitimate concern.
I believe the reserved memory that is problematic is at the end of each socket's (NUMA node's) address space. You have to get to 64 bit execution before you can reach addresses outside of the first 4GB of space I think. External hardware uses this RAM, the processors are not to access it at all. MTRRs don't exactly have a entry type to match this, at least from the document skimming I've done. (I have a limited understanding, but I think this reserved space is used by our hardware to keep track of cache line ownership for the rest of the ram, so letting any other entity take even a read claim on these addresses is a problem, in a catch-22 or circular reference sort of way.)
Or the target region for KASLR?
As I recall the kernel is limited to the last 2GB of the virtual address space, as parts of the instruction
From what I recall, KASLR varies both the virtual an physical addresses, and it's the physical that's of concern here.
arch/x86/boot/compressed/kaslr.c: "In theory, KASLR can put the kernel anywhere in the range of [16M, MAXMEM) on 64-bit..."
I had to make a change in that area a few years ago for similar reasons:
1869dbe87cb94d x86/boot/64: Round memory hole size up to next PMD page
If you don't have enough MTRRs on a big NUMA system I think it is perfectly understandable, to need to use the page tables.
Please include this the fact that splitting GBpages is necessary because of a lack of MTRRs in the change description.
OK.
Given that it is the lack of MTRRs on a large NUMA system that make the change necessary. The goes from a pure bug fix change to a change to accommodate systems without enough MTRRs.
That information makes it more understandable why older systems (at least in the case of kexec) might not be ok with the change. As for older systems their MTRRs are sufficient and thus they can use fewer page table entries. Allowing for use of larger TLB entries.
That last paragraph doesn't match what I think is happening.
At least from my point of view, that some systems aren't OK with the change has nothing to do with MTRRs or TLB page size. They simply require the extra "slop" of GB pages, implicitly adding a full GB of space around any smaller space requested by map_acpi_tables().
The systems that failed with my original change also failed on earlier kernels when nogbpages was added to the kernel command line. That creates the identity map using 2M pages for everything, with no GB page "slop". I'm pretty sure these systems will continue to fail with "nogbpages" enabled.
For one debug-kernel cycle on Pavin's system I added in hard-coded requests to explicitly add back in the areas that not being sloppy had excluded, and that brought kexec back to functioning; which further proves my point.
I wanted to be sure you understood this in case it has any effect on what you think should be done.
Sort of.
What kexec wants of an identity mapped page table really is to simulate disabling paging altogether. There isn't enough memory in most systems to identity map the entire 48bit or 52bit physical address space so some compromises have to be made. I seem to recall only mapping up to maxpfn, and using 1GB pages when I originally wrote the code. It was later refactored to share the identity map page table building code with the rest of the kernel.
When you changed the page tables not to map everything, strictly speaking you created an ABI break of the kexec ABI.
Which is a long way of saying it isn't being sloppy it is deliberate, and that the problem from my perspective is that things have become too fine grained, too optimized.
Pavin's definitely proves the issue was not mapping enough pages, it is nice that we have that confirmation.
From my perspective the entire reason for wanting to be fine grained and precise in the kernel memory map is because the UV systems don't have enough MTRRs. So you have to depend upon the cache-ability attributes for specific addresses of memory coming from the page tables instead of from the MTRRs.
It would be more accurate to say we depend upon the addresses not being listed in the page tables at all. We'd be OK with mapped but not accessed, if it weren't for processor speculation. There's no "no access" setting within the existing MTRR definitions, though there may be a setting that would rein in processor speculation enough to make due.
If you had enough MTRRs more defining the page tables to be precisely what is necessary would be simply an exercise in reducing kernel performance, because it is more efficient in both page table size, and in TLB usage to use 1GB pages instead of whatever smaller pages you have to use for oddball regions.
For systems without enough MTRRs the small performance hit in paging performance is the necessary trade off.
At least that is my perspective. Does that make sense?
I think I'm begining to get your perspective. From your point of view, is kexec failing with "nogbpages" set a bug? My point of view is it likely is. I think your view would say it isn't?
--> Steve
Steve Wahl steve.wahl@hpe.com writes:
On Thu, Mar 28, 2024 at 12:05:02AM -0500, Eric W. Biederman wrote:
From my perspective the entire reason for wanting to be fine grained and precise in the kernel memory map is because the UV systems don't have enough MTRRs. So you have to depend upon the cache-ability attributes for specific addresses of memory coming from the page tables instead of from the MTRRs.
It would be more accurate to say we depend upon the addresses not being listed in the page tables at all. We'd be OK with mapped but not accessed, if it weren't for processor speculation. There's no "no access" setting within the existing MTRR definitions, though there may be a setting that would rein in processor speculation enough to make due.
The uncached setting and the write-combining settings that are used for I/O are required to disable speculation for any regions so marked. Any reads or writes to a memory mapped I/O region can result in hardware with processing it as a command. Which as I understand it is exactly the problem with UV systems.
Frankly not mapping an I/O region (in an identity mapped page table) instead of properly mapping it as it would need to be mapped for performing I/O seems like a bit of a bug.
If you had enough MTRRs more defining the page tables to be precisely what is necessary would be simply an exercise in reducing kernel performance, because it is more efficient in both page table size, and in TLB usage to use 1GB pages instead of whatever smaller pages you have to use for oddball regions.
For systems without enough MTRRs the small performance hit in paging performance is the necessary trade off.
At least that is my perspective. Does that make sense?
I think I'm begining to get your perspective. From your point of view, is kexec failing with "nogbpages" set a bug? My point of view is it likely is. I think your view would say it isn't?
I would say it is a bug.
Part of the bug is someone yet again taking something simple that kexec is doing and reworking it to use generic code, then changing the generic code to do something different from what kexec needs and then being surprised that kexec stops working.
The interface kexec wants to provide to whatever is being loaded is not having to think about page tables until that software is up far enough to enable their own page tables.
People being clever and enabling just enough pages in the page tables to work based upon the results of some buggy (they are always buggy some are just less so than others) boot up firmware is where I get concerned.
Said another way the point is to build an identity mapped page table. Skipping some parts of the physical<->virtual identity because we seem to think no one will use it is likely a bug.
I really don't see any point in putting holes in such a page table for any address below the highest address that is good for something. Given that on some systems the MTRRs are insufficient to do there job it definitely makes sense to not enable caching on areas that we don't think are memory.
Eric
On Sat, Mar 30, 2024 at 10:46:21PM -0500, Eric W. Biederman wrote:
Steve Wahl steve.wahl@hpe.com writes:
On Thu, Mar 28, 2024 at 12:05:02AM -0500, Eric W. Biederman wrote:
From my perspective the entire reason for wanting to be fine grained and precise in the kernel memory map is because the UV systems don't have enough MTRRs. So you have to depend upon the cache-ability attributes for specific addresses of memory coming from the page tables instead of from the MTRRs.
It would be more accurate to say we depend upon the addresses not being listed in the page tables at all. We'd be OK with mapped but not accessed, if it weren't for processor speculation. There's no "no access" setting within the existing MTRR definitions, though there may be a setting that would rein in processor speculation enough to make due.
The uncached setting and the write-combining settings that are used for I/O are required to disable speculation for any regions so marked. Any reads or writes to a memory mapped I/O region can result in hardware with processing it as a command. Which as I understand it is exactly the problem with UV systems.
Frankly not mapping an I/O region (in an identity mapped page table) instead of properly mapping it as it would need to be mapped for performing I/O seems like a bit of a bug.
If you had enough MTRRs more defining the page tables to be precisely what is necessary would be simply an exercise in reducing kernel performance, because it is more efficient in both page table size, and in TLB usage to use 1GB pages instead of whatever smaller pages you have to use for oddball regions.
For systems without enough MTRRs the small performance hit in paging performance is the necessary trade off.
At least that is my perspective. Does that make sense?
I think I'm begining to get your perspective. From your point of view, is kexec failing with "nogbpages" set a bug? My point of view is it likely is. I think your view would say it isn't?
I would say it is a bug.
Part of the bug is someone yet again taking something simple that kexec is doing and reworking it to use generic code, then changing the generic code to do something different from what kexec needs and then being surprised that kexec stops working.
The interface kexec wants to provide to whatever is being loaded is not having to think about page tables until that software is up far enough to enable their own page tables.
People being clever and enabling just enough pages in the page tables to work based upon the results of some buggy (they are always buggy some are just less so than others) boot up firmware is where I get concerned.
Said another way the point is to build an identity mapped page table. Skipping some parts of the physical<->virtual identity because we seem to think no one will use it is likely a bug.
Hmm. I would think what's needed for kexec is to create, as nearly as possible, identical conditions to what the BIOS / bootloader provides when jumping to the kernel entry point. Whatever agreements are set on entry to the kernel, kexec needs to match.
And I think you want a completely identity mapped table to match those entry point requirements, that's why on other platforms, the condition is MMU turned off.
From that point of view, it does make sense to special case UV systems for this. The restricted areas we're talking about are not in the map when the bootloader is started on the UV platform.
I really don't see any point in putting holes in such a page table for any address below the highest address that is good for something. Given that on some systems the MTRRs are insufficient to do there job it definitely makes sense to not enable caching on areas that we don't think are memory.
Well, on the UV platform, these addresses are *not* good for something, at least from any processor's point of view, nor any IO device (they are not allowed to appear in any DMA or PCI bus master transaction, either). A hardware ASIC is using this portion of local RAM to hold some tables that are too large to put directly on the ASIC. Things turn ugly if anyone else tries to access these addresses.
In another message, Pavin thanked you for you work on kexec. I'd like to express my appreciation also. In my current job, I'm mostly focused on its use for kdump kernels. I've been dealing with kernel crash dumps since running Unix on i386 machines, and always had do deal with "OK, but what if the kernel state gets corrupt enough that the disk driver won't work, or network if you're trying to do a remote dump." The use of kexec to start a fresh instance of the kernel is an excelent way to solve that problem, in my opinion. And a couple of jobs ago we were able to use it to restart a SAN switch after software upgrade, without needing to stop forwarding traffic, which wouldn't have been possible without kexec.
Thanks,
--> Steve Wahl
On 4/1/24 08:15, Steve Wahl wrote:
From that point of view, it does make sense to special case UV systems for this. The restricted areas we're talking about are not in the map when the bootloader is started on the UV platform.
Just to be clear what I'm looking for here: Special casing UV systems is theoretically OK. What I don't like is doing that in using GB pages or not.
It would be much nicer to have specific, precise information about what UV needs done. For instance, do we know where the special address range is? Is it fixed? If so, I'd much rather have code that says: "Whoa, don't map this range with *any* identity map page tables" versus something targeted specifically at gbpages.
On Mon, Apr 01, 2024 at 11:03:20AM -0700, Dave Hansen wrote:
On 4/1/24 08:15, Steve Wahl wrote:
From that point of view, it does make sense to special case UV systems for this. The restricted areas we're talking about are not in the map when the bootloader is started on the UV platform.
Just to be clear what I'm looking for here: Special casing UV systems is theoretically OK. What I don't like is doing that in using GB pages or not.
It would be much nicer to have specific, precise information about what UV needs done. For instance, do we know where the special address range is? Is it fixed? If so, I'd much rather have code that says: "Whoa, don't map this range with *any* identity map page tables" versus something targeted specifically at gbpages.
The area is not fixed. Some reserved memory in each numa node, address and amount varies depending on the amount of memory in the system.
We've kept the memory range marked as reserved in the e820 memory tables, and our experience is the kernel respects that in all other aspects. Even creation of the identity maps is on the surface respecting those areas being listed as reserved, but using GBpages "swings wide" and includes the reserved areas in the identity map.
So, I don't fully understand your hesitation in being more selective in the use of gbpages on UV, or what I might suggest in its place.
Here's partial kernel output showing the memory map, from a randomly selected system in case it helps illustrate it for you:
[Tue Mar 21 09:40:00 2023] BIOS-provided physical RAM map: [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000000000000000-0x000000000009ffff] usable [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x00000000000a0000-0x00000000000fffff] reserved [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000000000100000-0x0000000068285fff] usable [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000000068286000-0x0000000068286fff] ACPI NVS [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000000068287000-0x0000000068baefff] usable [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000000068baf000-0x0000000068bb1fff] reserved [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000000068bb2000-0x0000000068c3ffff] usable [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000000068c40000-0x000000006a33ffff] reserved [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x000000006a340000-0x000000006d5fefff] ACPI NVS [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x000000006d5ff000-0x000000006fffefff] ACPI data [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x000000006ffff000-0x000000006fffffff] usable [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000000070000000-0x000000008fffffff] reserved [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x00000000f8000000-0x00000000fbffffff] reserved [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x00000000fe010000-0x00000000fe010fff] reserved [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000000100000000-0x0000003f7fffffff] usable [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000003f80000000-0x000000407fffffff] reserved [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000004080000000-0x0000007f7fffffff] usable [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000007f80000000-0x000000807fffffff] reserved [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000008080000000-0x000000bf7fffffff] usable [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x000000bf80000000-0x000000c07fffffff] reserved [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x000000c080000000-0x000000ff7fffffff] usable [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x000000ff80000000-0x000001007fffffff] reserved [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000010080000000-0x0000013f7fffffff] usable [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000013f80000000-0x000001407fffffff] reserved [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000014080000000-0x0000017f7fffffff] usable [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000017f80000000-0x000001807fffffff] reserved [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000018080000000-0x000001bf7fffffff] usable [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x000001bf80000000-0x000001c07fffffff] reserved [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x000001c080000000-0x000001ff7fffffff] usable [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x000001ff80000000-0x000002007fffffff] reserved [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000020080000000-0x0000023f7fffffff] usable [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000023f80000000-0x000002407fffffff] reserved [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000024080000000-0x0000027f7fffffff] usable [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000027f80000000-0x000002807fffffff] reserved [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000028080000000-0x000002bf7fffffff] usable [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x000002bf80000000-0x000002c07fffffff] reserved [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x000002c080000000-0x000002ff7fffffff] usable [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x000002ff80000000-0x000003007fffffff] reserved [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000030080000000-0x0000033f7fffffff] usable [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000033f80000000-0x000003407fffffff] reserved [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000034080000000-0x0000037f7fffffff] usable [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000037f80000000-0x000003807fffffff] reserved [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000038080000000-0x000003bf7fffffff] usable [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x000003bf80000000-0x000003c07fffffff] reserved [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x000003c080000000-0x000003ff7fffffff] usable [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x000003ff80000000-0x000004007fffffff] reserved [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x00001ffe00000000-0x00001ffeffffffff] reserved [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x00001fffc0000000-0x00001fffdfffffff] reserved [Tue Mar 21 09:40:00 2023] printk: bootconsole [earlyser0] enabled [Tue Mar 21 09:40:00 2023] NX (Execute Disable) protection: active [Tue Mar 21 09:40:00 2023] e820: update [mem 0x3d279018-0x3d2b3a57] usable ==> usable [Tue Mar 21 09:40:00 2023] e820: update [mem 0x3d279018-0x3d2b3a57] usable ==> usable [Tue Mar 21 09:40:00 2023] e820: update [mem 0x3d23e018-0x3d278a57] usable ==> usable [Tue Mar 21 09:40:00 2023] e820: update [mem 0x3d23e018-0x3d278a57] usable ==> usable [Tue Mar 21 09:40:00 2023] extended physical RAM map: [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000000000000000-0x000000000009ffff] usable [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x00000000000a0000-0x00000000000fffff] reserved [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000000000100000-0x000000003d23e017] usable [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x000000003d23e018-0x000000003d278a57] usable [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x000000003d278a58-0x000000003d279017] usable [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x000000003d279018-0x000000003d2b3a57] usable [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x000000003d2b3a58-0x0000000068285fff] usable [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000000068286000-0x0000000068286fff] ACPI NVS [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000000068287000-0x0000000068baefff] usable [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000000068baf000-0x0000000068bb1fff] reserved [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000000068bb2000-0x0000000068c3ffff] usable [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000000068c40000-0x000000006a33ffff] reserved [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x000000006a340000-0x000000006d5fefff] ACPI NVS [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x000000006d5ff000-0x000000006fffefff] ACPI data [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x000000006ffff000-0x000000006fffffff] usable [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000000070000000-0x000000008fffffff] reserved [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x00000000f8000000-0x00000000fbffffff] reserved [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x00000000fe010000-0x00000000fe010fff] reserved [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000000100000000-0x0000003f7fffffff] usable [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000003f80000000-0x000000407fffffff] reserved [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000004080000000-0x0000007f7fffffff] usable [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000007f80000000-0x000000807fffffff] reserved [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000008080000000-0x000000bf7fffffff] usable [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x000000bf80000000-0x000000c07fffffff] reserved [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x000000c080000000-0x000000ff7fffffff] usable [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x000000ff80000000-0x000001007fffffff] reserved [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000010080000000-0x0000013f7fffffff] usable [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000013f80000000-0x000001407fffffff] reserved [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000014080000000-0x0000017f7fffffff] usable [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000017f80000000-0x000001807fffffff] reserved [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000018080000000-0x000001bf7fffffff] usable [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x000001bf80000000-0x000001c07fffffff] reserved [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x000001c080000000-0x000001ff7fffffff] usable [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x000001ff80000000-0x000002007fffffff] reserved [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000020080000000-0x0000023f7fffffff] usable [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000023f80000000-0x000002407fffffff] reserved [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000024080000000-0x0000027f7fffffff] usable [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000027f80000000-0x000002807fffffff] reserved [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000028080000000-0x000002bf7fffffff] usable [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x000002bf80000000-0x000002c07fffffff] reserved [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x000002c080000000-0x000002ff7fffffff] usable [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x000002ff80000000-0x000003007fffffff] reserved [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000030080000000-0x0000033f7fffffff] usable [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000033f80000000-0x000003407fffffff] reserved [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000034080000000-0x0000037f7fffffff] usable [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000037f80000000-0x000003807fffffff] reserved [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000038080000000-0x000003bf7fffffff] usable [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x000003bf80000000-0x000003c07fffffff] reserved [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x000003c080000000-0x000003ff7fffffff] usable [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x000003ff80000000-0x000004007fffffff] reserved [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x00001ffe00000000-0x00001ffeffffffff] reserved [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x00001fffc0000000-0x00001fffdfffffff] reserved [Tue Mar 21 09:40:00 2023] efi: EFI v2.7 by HPE
Thanks,
--> Steve Wahl
On Mon, Apr 01, 2024 at 01:49:13PM -0500, Steve Wahl wrote:
On Mon, Apr 01, 2024 at 11:03:20AM -0700, Dave Hansen wrote:
On 4/1/24 08:15, Steve Wahl wrote:
From that point of view, it does make sense to special case UV systems for this. The restricted areas we're talking about are not in the map when the bootloader is started on the UV platform.
Just to be clear what I'm looking for here: Special casing UV systems is theoretically OK. What I don't like is doing that in using GB pages or not.
It would be much nicer to have specific, precise information about what UV needs done. For instance, do we know where the special address range is? Is it fixed? If so, I'd much rather have code that says: "Whoa, don't map this range with *any* identity map page tables" versus something targeted specifically at gbpages.
The area is not fixed. Some reserved memory in each numa node, address and amount varies depending on the amount of memory in the system.
We've kept the memory range marked as reserved in the e820 memory tables, and our experience is the kernel respects that in all other aspects. Even creation of the identity maps is on the surface respecting those areas being listed as reserved, but using GBpages "swings wide" and includes the reserved areas in the identity map.
So, I don't fully understand your hesitation in being more selective in the use of gbpages on UV, or what I might suggest in its place.
Dave, given the above, can you explain more precisely why the current aproach in the patch is not acceptable to you, and what you'd like to see different?
Or is it maybe ok as it stands?
Thanks,
--> Steve Wahl
Here's partial kernel output showing the memory map, from a randomly selected system in case it helps illustrate it for you:
[Tue Mar 21 09:40:00 2023] BIOS-provided physical RAM map: [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000000000000000-0x000000000009ffff] usable [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x00000000000a0000-0x00000000000fffff] reserved [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000000000100000-0x0000000068285fff] usable [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000000068286000-0x0000000068286fff] ACPI NVS [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000000068287000-0x0000000068baefff] usable [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000000068baf000-0x0000000068bb1fff] reserved [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000000068bb2000-0x0000000068c3ffff] usable [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000000068c40000-0x000000006a33ffff] reserved [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x000000006a340000-0x000000006d5fefff] ACPI NVS [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x000000006d5ff000-0x000000006fffefff] ACPI data [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x000000006ffff000-0x000000006fffffff] usable [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000000070000000-0x000000008fffffff] reserved [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x00000000f8000000-0x00000000fbffffff] reserved [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x00000000fe010000-0x00000000fe010fff] reserved [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000000100000000-0x0000003f7fffffff] usable [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000003f80000000-0x000000407fffffff] reserved [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000004080000000-0x0000007f7fffffff] usable [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000007f80000000-0x000000807fffffff] reserved [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000008080000000-0x000000bf7fffffff] usable [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x000000bf80000000-0x000000c07fffffff] reserved [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x000000c080000000-0x000000ff7fffffff] usable [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x000000ff80000000-0x000001007fffffff] reserved [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000010080000000-0x0000013f7fffffff] usable [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000013f80000000-0x000001407fffffff] reserved [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000014080000000-0x0000017f7fffffff] usable [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000017f80000000-0x000001807fffffff] reserved [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000018080000000-0x000001bf7fffffff] usable [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x000001bf80000000-0x000001c07fffffff] reserved [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x000001c080000000-0x000001ff7fffffff] usable [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x000001ff80000000-0x000002007fffffff] reserved [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000020080000000-0x0000023f7fffffff] usable [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000023f80000000-0x000002407fffffff] reserved [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000024080000000-0x0000027f7fffffff] usable [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000027f80000000-0x000002807fffffff] reserved [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000028080000000-0x000002bf7fffffff] usable [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x000002bf80000000-0x000002c07fffffff] reserved [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x000002c080000000-0x000002ff7fffffff] usable [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x000002ff80000000-0x000003007fffffff] reserved [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000030080000000-0x0000033f7fffffff] usable [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000033f80000000-0x000003407fffffff] reserved [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000034080000000-0x0000037f7fffffff] usable [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000037f80000000-0x000003807fffffff] reserved [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x0000038080000000-0x000003bf7fffffff] usable [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x000003bf80000000-0x000003c07fffffff] reserved [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x000003c080000000-0x000003ff7fffffff] usable [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x000003ff80000000-0x000004007fffffff] reserved [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x00001ffe00000000-0x00001ffeffffffff] reserved [Tue Mar 21 09:40:00 2023] BIOS-e820: [mem 0x00001fffc0000000-0x00001fffdfffffff] reserved [Tue Mar 21 09:40:00 2023] printk: bootconsole [earlyser0] enabled [Tue Mar 21 09:40:00 2023] NX (Execute Disable) protection: active [Tue Mar 21 09:40:00 2023] e820: update [mem 0x3d279018-0x3d2b3a57] usable ==> usable [Tue Mar 21 09:40:00 2023] e820: update [mem 0x3d279018-0x3d2b3a57] usable ==> usable [Tue Mar 21 09:40:00 2023] e820: update [mem 0x3d23e018-0x3d278a57] usable ==> usable [Tue Mar 21 09:40:00 2023] e820: update [mem 0x3d23e018-0x3d278a57] usable ==> usable [Tue Mar 21 09:40:00 2023] extended physical RAM map: [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000000000000000-0x000000000009ffff] usable [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x00000000000a0000-0x00000000000fffff] reserved [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000000000100000-0x000000003d23e017] usable [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x000000003d23e018-0x000000003d278a57] usable [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x000000003d278a58-0x000000003d279017] usable [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x000000003d279018-0x000000003d2b3a57] usable [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x000000003d2b3a58-0x0000000068285fff] usable [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000000068286000-0x0000000068286fff] ACPI NVS [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000000068287000-0x0000000068baefff] usable [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000000068baf000-0x0000000068bb1fff] reserved [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000000068bb2000-0x0000000068c3ffff] usable [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000000068c40000-0x000000006a33ffff] reserved [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x000000006a340000-0x000000006d5fefff] ACPI NVS [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x000000006d5ff000-0x000000006fffefff] ACPI data [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x000000006ffff000-0x000000006fffffff] usable [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000000070000000-0x000000008fffffff] reserved [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x00000000f8000000-0x00000000fbffffff] reserved [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x00000000fe010000-0x00000000fe010fff] reserved [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000000100000000-0x0000003f7fffffff] usable [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000003f80000000-0x000000407fffffff] reserved [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000004080000000-0x0000007f7fffffff] usable [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000007f80000000-0x000000807fffffff] reserved [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000008080000000-0x000000bf7fffffff] usable [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x000000bf80000000-0x000000c07fffffff] reserved [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x000000c080000000-0x000000ff7fffffff] usable [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x000000ff80000000-0x000001007fffffff] reserved [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000010080000000-0x0000013f7fffffff] usable [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000013f80000000-0x000001407fffffff] reserved [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000014080000000-0x0000017f7fffffff] usable [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000017f80000000-0x000001807fffffff] reserved [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000018080000000-0x000001bf7fffffff] usable [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x000001bf80000000-0x000001c07fffffff] reserved [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x000001c080000000-0x000001ff7fffffff] usable [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x000001ff80000000-0x000002007fffffff] reserved [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000020080000000-0x0000023f7fffffff] usable [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000023f80000000-0x000002407fffffff] reserved [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000024080000000-0x0000027f7fffffff] usable [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000027f80000000-0x000002807fffffff] reserved [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000028080000000-0x000002bf7fffffff] usable [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x000002bf80000000-0x000002c07fffffff] reserved [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x000002c080000000-0x000002ff7fffffff] usable [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x000002ff80000000-0x000003007fffffff] reserved [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000030080000000-0x0000033f7fffffff] usable [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000033f80000000-0x000003407fffffff] reserved [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000034080000000-0x0000037f7fffffff] usable [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000037f80000000-0x000003807fffffff] reserved [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x0000038080000000-0x000003bf7fffffff] usable [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x000003bf80000000-0x000003c07fffffff] reserved [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x000003c080000000-0x000003ff7fffffff] usable [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x000003ff80000000-0x000004007fffffff] reserved [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x00001ffe00000000-0x00001ffeffffffff] reserved [Tue Mar 21 09:40:00 2023] reserve setup_data: [mem 0x00001fffc0000000-0x00001fffdfffffff] reserved [Tue Mar 21 09:40:00 2023] efi: EFI v2.7 by HPE
Thanks,
--> Steve Wahl
Steve Wahl, Hewlett Packard Enterprise
On Sun, Mar 24, 2024 at 11:31:39AM +0100, Ingo Molnar wrote:
- Steve Wahl steve.wahl@hpe.com wrote:
Some systems have ACPI tables that don't include everything that needs to be mapped for a successful kexec. These systems rely on identity maps that include the full gigabyte surrounding any smaller region requested for kexec success. Without this, they fail to kexec and end up doing a full firmware reboot.
So, reduce the use of GB pages only on systems where this is known to be necessary (specifically, UV systems).
Signed-off-by: Steve Wahl steve.wahl@hpe.com Fixes: d794734c9bbf ("x86/mm/ident_map: Use gbpages only where full GB page should be mapped.") Reported-by: Pavin Joseph me@pavinjoseph.com
Sigh, why was d794734c9bbf marked for a -stable backport? The commit never explains ...
If it's broken, it should be reverted - instead of trying to partially revert and then maybe break some other systems.
When there's boot breakage with new patches, we back out the bad patch and re-try in 99.9% of the cases.
Fine with me, especially as you've already done the revert. :-)
I will create a new patch that combines the two. If you have any specific actions you'd like to be sure I do for this, let me know.
Thanks,
--> Steve Wahl
linux-stable-mirror@lists.linaro.org