On Aug 26, 2019, at 8:08 AM, Song Liu songliubraving@fb.com wrote:
On Aug 26, 2019, at 2:23 AM, Peter Zijlstra peterz@infradead.org wrote:
So only the high mapping is ever executable; the identity map should not be. Both should be RO.
kprobe (with CONFIG_KPROBES_ON_FTRACE) should work on kernel identity mapping.
Please provide more information; kprobes shouldn't be touching either mapping. That is, afaict kprobes uses text_poke() which uses a temporary mapping (in 'userspace' even) to alias the high text mapping.
kprobe without CONFIG_KPROBES_ON_FTRACE uses text_poke(). But kprobe with CONFIG_KPROBES_ON_FTRACE uses another path. The split happens with set_kernel_text_rw() -> ... -> __change_page_attr() -> split_large_page(). The split is introduced by commit 585948f4f695. do_split in __change_page_attr() becomes true after commit 585948f4f695. This patch tries to fix/workaround this part.
I'm also not sure how it would then result in any 4k text maps. Yes the alias is 4k, but it should not affect the actual high text map in any way.
I am confused by the alias logic. set_kernel_text_rw() makes the high map rw, and split the PMD in the high map.
kprobes also allocates executable slots, but it does that in the module range (afaict), so that, again, should not affect the high text mapping.
We found with 5.2 kernel (no CONFIG_PAGE_TABLE_ISOLATION, w/ CONFIG_KPROBES_ON_FTRACE), a single kprobe will split _all_ PMDs in kernel text mapping into pte-mapped pages. This increases iTLB miss rate from about 300 per million instructions to about 700 per million instructions (for the application I test with).
Per bisect, we found this behavior happens after commit 585948f4f695 ("x86/mm/cpa: Avoid the 4k pages check completely"). That's why I proposed this PATCH to fix/workaround this issue. However, per Peter's comment and my study of the code, this doesn't seem the real problem or the only here.
I also tested that the PMD split issue doesn't happen w/o CONFIG_KPROBES_ON_FTRACE.
Right, because then ftrace doesn't flip the whole kernel map writable; which it _really_ should stop doing anyway.
But I'm still wondering what causes that first 4k split...
Please see above.
Another data point: we can repro the issue on Linus's master with just ftrace:
# start with PMD mapped root@virt-test:~# grep ffff81000000- /sys/kernel/debug/page_tables/kernel 0xffffffff81000000-0xffffffff81c00000 12M ro PSE x pmd
# enable single ftrace root@virt-test:~# echo consume_skb > /sys/kernel/debug/tracing/set_ftrace_filter root@virt-test:~# echo function > /sys/kernel/debug/tracing/current_tracer
# now the text is PTE mapped root@virt-test:~# grep ffff81000000- /sys/kernel/debug/page_tables/kernel 0xffffffff81000000-0xffffffff81c00000 12M ro x pte
Song