On Fri, Jul 17, 2020 at 12:20:39AM -0700, ira.weiny@intel.com wrote:
From: Ira Weiny ira.weiny@intel.com
This RFC series has been reviewed by Dave Hansen.
Changes from RFC: Clean up commit messages based on Peter Zijlstra's and Dave Hansen's feedback Fix static branch anti-pattern New patch: (memremap: Convert devmap static branch to {inc,dec}) This was the code I used as a model for my static branch which I believe is wrong now. New Patch: (x86/entry: Preserve PKRS MSR through exceptions) This attempts to preserve the per-logical-processor MSR, and reference counting during exceptions. I'd really like feed back on this because I _think_ it should work but I'm afraid I'm missing something as my testing has shown a lot of spotty crashes which don't make sense to me.
This patch set introduces a new page protection mechanism for supervisor pages, Protection Key Supervisor (PKS) and an initial user of them, persistent memory, PMEM.
PKS enables protections on 'domains' of supervisor pages to limit supervisor mode access to those pages beyond the normal paging protections. They work in a similar fashion to user space pkeys. Like User page pkeys (PKU), supervisor pkeys are checked in addition to normal paging protections and Access or Writes can be disabled via a MSR update without TLB flushes when permissions change. A page mapping is assigned to a domain by setting a pkey in the page table entry.
Unlike User pkeys no new instructions are added; rather WRMSR/RDMSR are used to update the PKRS register.
XSAVE is not supported for the PKRS MSR. To reduce software complexity the implementation saves/restores the MSR across context switches but not during irqs. This is a compromise which results is a hardening of unwanted access without absolute restriction.
For consistent behavior with current paging protections, pkey 0 is reserved and configured to allow full access via the pkey mechanism, thus preserving the default paging protections on mappings with the default pkey value of 0.
Other keys, (1-15) are allocated by an allocator which prepares us for key contention from day one. Kernel users should be prepared for the allocator to fail either because of key exhaustion or due to PKS not being supported on the arch and/or CPU instance.
Protecting against stray writes is particularly important for PMEM because, unlike writes to anonymous memory, writes to PMEM persists across a reboot. Thus data corruption could result in permanent loss of data.
The following attributes of PKS makes it perfect as a mechanism to protect PMEM from stray access within the kernel:
- Fast switching of permissions
- Prevents access without page table manipulations
- Works on a per thread basis
- No TLB flushes required
Cool! This seems like it'd be very handy to make other types of kernel data "read-only at rest" (as was long ago proposed via X86_CR0_WP[1], which only provided to protection levels, not 15). For example, I think at least a few other kinds of areas stand out to me that are in need of PKS markings (i.e. only things that actually manipulate these areas should gain temporary PK access): - Page Tables themselves - Identity mapping - The "read-only at rest" stuff, though it'll need special plumbing to make it work with the slab allocator, etc (more like the later "static allocation" work[2]).
[1] https://lore.kernel.org/lkml/1490811363-93944-1-git-send-email-keescook@chro... [2] https://lore.kernel.org/lkml/cover.1550097697.git.igor.stoppa@huawei.com/