Cc: Steven Rostedt and Suresh Siddha
Hi Peter,
On Aug 23, 2019, at 2:36 AM, Peter Zijlstra peterz@infradead.org wrote:
On Thu, Aug 22, 2019 at 10:23:35PM -0700, Song Liu wrote:
As 4k pages check was removed from cpa [1], set_kernel_text_rw() leads to split_large_page() for all kernel text pages. This means a single kprobe will put all kernel text in 4k pages:
root@ ~# grep ffff81000000- /sys/kernel/debug/page_tables/kernel 0xffffffff81000000-0xffffffff82400000 20M ro PSE x pmd
root@ ~# echo ONE_KPROBE >> /sys/kernel/debug/tracing/kprobe_events root@ ~# echo 1 > /sys/kernel/debug/tracing/events/kprobes/enable
root@ ~# grep ffff81000000- /sys/kernel/debug/page_tables/kernel 0xffffffff81000000-0xffffffff82400000 20M ro x pte
To fix this issue, introduce CPA_FLIP_TEXT_RW to bypass "Text RO" check in static_protections().
Two helper functions set_text_rw() and set_text_ro() are added to flip _PAGE_RW bit for kernel text.
[1] commit 585948f4f695 ("x86/mm/cpa: Avoid the 4k pages check completely")
ARGH; so this is because ftrace flips the whole kernel range to RW and back for giggles? I'm thinking _that_ is a bug, it's a clear W^X violation.
Thanks for your comments. Yes, it is related to ftrace, as we have CONFIG_KPROBES_ON_FTRACE. However, after digging around, I am not sure what is the expected behavior.
Kernel text region has two mappings to it. For x86_64 and four-level page table, there are:
1. kernel identity mapping, from 0xffff888000100000; 2. kernel text mapping, from 0xffffffff81000000,
Per comments in arch/x86/mm/init_64.c:set_kernel_text_rw():
/* * Make the kernel identity mapping for text RW. Kernel text * mapping will always be RO. Refer to the comment in * static_protections() in pageattr.c */ set_memory_rw(start, (end - start) >> PAGE_SHIFT);
kprobe (with CONFIG_KPROBES_ON_FTRACE) should work on kernel identity mapping.
However, my experiment shows that kprobe actually operates on the kernel text mapping (0xffffffff81000000-). It is the same w/ and w/o CONFIG_KPROBES_ON_FTRACE. Therefore, I am not sure whether the comment is out-dated (10-year old), or the kprobe is doing something wrong.
More information about the issue we are looking at.
We found with 5.2 kernel (no CONFIG_PAGE_TABLE_ISOLATION, w/ CONFIG_KPROBES_ON_FTRACE), a single kprobe will split _all_ PMDs in kernel text mapping into pte-mapped pages. This increases iTLB miss rate from about 300 per million instructions to about 700 per million instructions (for the application I test with).
Per bisect, we found this behavior happens after commit 585948f4f695 ("x86/mm/cpa: Avoid the 4k pages check completely"). That's why I proposed this PATCH to fix/workaround this issue. However, per Peter's comment and my study of the code, this doesn't seem the real problem or the only here.
I also tested that the PMD split issue doesn't happen w/o CONFIG_KPROBES_ON_FTRACE.
In summary, I have the following questions:
1. Which mapping should kprobe work on? Kernel identity mapping or kernel text mapping? 2. FTRACE causes split of PMD mapped kernel text. How should we fix this?
Thanks, Song