On Sat, Mar 30, 2024 at 10:46:21PM -0500, Eric W. Biederman wrote:
Steve Wahl steve.wahl@hpe.com writes:
On Thu, Mar 28, 2024 at 12:05:02AM -0500, Eric W. Biederman wrote:
From my perspective the entire reason for wanting to be fine grained and precise in the kernel memory map is because the UV systems don't have enough MTRRs. So you have to depend upon the cache-ability attributes for specific addresses of memory coming from the page tables instead of from the MTRRs.
It would be more accurate to say we depend upon the addresses not being listed in the page tables at all. We'd be OK with mapped but not accessed, if it weren't for processor speculation. There's no "no access" setting within the existing MTRR definitions, though there may be a setting that would rein in processor speculation enough to make due.
The uncached setting and the write-combining settings that are used for I/O are required to disable speculation for any regions so marked. Any reads or writes to a memory mapped I/O region can result in hardware with processing it as a command. Which as I understand it is exactly the problem with UV systems.
Frankly not mapping an I/O region (in an identity mapped page table) instead of properly mapping it as it would need to be mapped for performing I/O seems like a bit of a bug.
If you had enough MTRRs more defining the page tables to be precisely what is necessary would be simply an exercise in reducing kernel performance, because it is more efficient in both page table size, and in TLB usage to use 1GB pages instead of whatever smaller pages you have to use for oddball regions.
For systems without enough MTRRs the small performance hit in paging performance is the necessary trade off.
At least that is my perspective. Does that make sense?
I think I'm begining to get your perspective. From your point of view, is kexec failing with "nogbpages" set a bug? My point of view is it likely is. I think your view would say it isn't?
I would say it is a bug.
Part of the bug is someone yet again taking something simple that kexec is doing and reworking it to use generic code, then changing the generic code to do something different from what kexec needs and then being surprised that kexec stops working.
The interface kexec wants to provide to whatever is being loaded is not having to think about page tables until that software is up far enough to enable their own page tables.
People being clever and enabling just enough pages in the page tables to work based upon the results of some buggy (they are always buggy some are just less so than others) boot up firmware is where I get concerned.
Said another way the point is to build an identity mapped page table. Skipping some parts of the physical<->virtual identity because we seem to think no one will use it is likely a bug.
Hmm. I would think what's needed for kexec is to create, as nearly as possible, identical conditions to what the BIOS / bootloader provides when jumping to the kernel entry point. Whatever agreements are set on entry to the kernel, kexec needs to match.
And I think you want a completely identity mapped table to match those entry point requirements, that's why on other platforms, the condition is MMU turned off.
From that point of view, it does make sense to special case UV systems for this. The restricted areas we're talking about are not in the map when the bootloader is started on the UV platform.
I really don't see any point in putting holes in such a page table for any address below the highest address that is good for something. Given that on some systems the MTRRs are insufficient to do there job it definitely makes sense to not enable caching on areas that we don't think are memory.
Well, on the UV platform, these addresses are *not* good for something, at least from any processor's point of view, nor any IO device (they are not allowed to appear in any DMA or PCI bus master transaction, either). A hardware ASIC is using this portion of local RAM to hold some tables that are too large to put directly on the ASIC. Things turn ugly if anyone else tries to access these addresses.
In another message, Pavin thanked you for you work on kexec. I'd like to express my appreciation also. In my current job, I'm mostly focused on its use for kdump kernels. I've been dealing with kernel crash dumps since running Unix on i386 machines, and always had do deal with "OK, but what if the kernel state gets corrupt enough that the disk driver won't work, or network if you're trying to do a remote dump." The use of kexec to start a fresh instance of the kernel is an excelent way to solve that problem, in my opinion. And a couple of jobs ago we were able to use it to restart a SAN switch after software upgrade, without needing to stop forwarding traffic, which wouldn't have been possible without kexec.
Thanks,
--> Steve Wahl