Before 32-bit support, pti_clone_pmds() always adds PMD_SIZE to addr. This behavior changes after the 32-bit support: pti_clone_pgtable() increases addr by PUD_SIZE for pud_none(*pud) case, and increases addr by PMD_SIZE for pmd_none(*pmd) case. However, this is not accurate because addr may not be PUD_SIZE/PMD_SIZE aligned.
Fix this issue by properly rounding up addr to next PUD_SIZE/PMD_SIZE in these two cases.
The following explains how we debugged this issue:
We use huge page for hot text and thus reduces iTLB misses. As we benchmark 5.2 based kernel (vs. 4.16 based), we found ~2.5x more iTLB misses.
To figure out the issue, I use a debug patch that dumps page table for a pid. The following are information from the workload pid.
For the 4.16 based kernel:
host-4.16 # grep "x pmd" /sys/kernel/debug/page_tables/dump_pid 0x0000000000600000-0x0000000000e00000 8M USR ro PSE x pmd 0xffffffff81a00000-0xffffffff81c00000 2M ro PSE x pmd
For the 5.2 based kernel before this patch:
host-5.2-before # grep "x pmd" /sys/kernel/debug/page_tables/dump_pid 0x0000000000600000-0x0000000000e00000 8M USR ro PSE x pmd
The 8MB text in pmd is from user space. 4.16 kernel has 1 pmd for the irq entry table; while 4.16 kernel doesn't have it.
For the 5.2 based kernel after this patch:
host-5.2-after # grep "x pmd" /sys/kernel/debug/page_tables/dump_pid 0x0000000000600000-0x0000000000e00000 8M USR ro PSE x pmd 0xffffffff81000000-0xffffffff81e00000 14M ro PSE GLB x pmd
So after this patch, the 5.2 based kernel has 7 PMDs instead of 1 PMD in 4.16 kernel. This further reduces iTLB miss rate
Cc: stable@vger.kernel.org # v4.19+ Fixes: 16a3fe634f6a ("x86/mm/pti: Clone kernel-image on PTE level for 32 bit") Reviewed-by: Rik van Riel riel@surriel.com Signed-off-by: Song Liu songliubraving@fb.com Cc: Joerg Roedel jroedel@suse.de Cc: Thomas Gleixner tglx@linutronix.de Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Andy Lutomirski luto@kernel.org Cc: Peter Zijlstra peterz@infradead.org --- arch/x86/mm/pti.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c index b196524759ec..1337494e22ef 100644 --- a/arch/x86/mm/pti.c +++ b/arch/x86/mm/pti.c @@ -330,13 +330,13 @@ pti_clone_pgtable(unsigned long start, unsigned long end,
pud = pud_offset(p4d, addr); if (pud_none(*pud)) { - addr += PUD_SIZE; + addr = round_up(addr + 1, PUD_SIZE); continue; }
pmd = pmd_offset(pud, addr); if (pmd_none(*pmd)) { - addr += PMD_SIZE; + addr = round_up(addr + 1, PMD_SIZE); continue; }
On Aug 20, 2019, at 1:23 PM, Song Liu songliubraving@fb.com wrote:
Before 32-bit support, pti_clone_pmds() always adds PMD_SIZE to addr. This behavior changes after the 32-bit support: pti_clone_pgtable() increases addr by PUD_SIZE for pud_none(*pud) case, and increases addr by PMD_SIZE for pmd_none(*pmd) case. However, this is not accurate because addr may not be PUD_SIZE/PMD_SIZE aligned.
Fix this issue by properly rounding up addr to next PUD_SIZE/PMD_SIZE in these two cases.
After poking around more, I found the following doesn't really make sense.
Sorry for the noise. Song
<nonsense>
The following explains how we debugged this issue:
We use huge page for hot text and thus reduces iTLB misses. As we benchmark 5.2 based kernel (vs. 4.16 based), we found ~2.5x more iTLB misses.
To figure out the issue, I use a debug patch that dumps page table for a pid. The following are information from the workload pid.
For the 4.16 based kernel:
host-4.16 # grep "x pmd" /sys/kernel/debug/page_tables/dump_pid 0x0000000000600000-0x0000000000e00000 8M USR ro PSE x pmd 0xffffffff81a00000-0xffffffff81c00000 2M ro PSE x pmd
For the 5.2 based kernel before this patch:
host-5.2-before # grep "x pmd" /sys/kernel/debug/page_tables/dump_pid 0x0000000000600000-0x0000000000e00000 8M USR ro PSE x pmd
The 8MB text in pmd is from user space. 4.16 kernel has 1 pmd for the irq entry table; while 4.16 kernel doesn't have it.
For the 5.2 based kernel after this patch:
host-5.2-after # grep "x pmd" /sys/kernel/debug/page_tables/dump_pid 0x0000000000600000-0x0000000000e00000 8M USR ro PSE x pmd 0xffffffff81000000-0xffffffff81e00000 14M ro PSE GLB x pmd
So after this patch, the 5.2 based kernel has 7 PMDs instead of 1 PMD in 4.16 kernel. This further reduces iTLB miss rate
</nonsense>
Cc: stable@vger.kernel.org # v4.19+ Fixes: 16a3fe634f6a ("x86/mm/pti: Clone kernel-image on PTE level for 32 bit") Reviewed-by: Rik van Riel riel@surriel.com Signed-off-by: Song Liu songliubraving@fb.com Cc: Joerg Roedel jroedel@suse.de Cc: Thomas Gleixner tglx@linutronix.de Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Andy Lutomirski luto@kernel.org Cc: Peter Zijlstra peterz@infradead.org
arch/x86/mm/pti.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c index b196524759ec..1337494e22ef 100644 --- a/arch/x86/mm/pti.c +++ b/arch/x86/mm/pti.c @@ -330,13 +330,13 @@ pti_clone_pgtable(unsigned long start, unsigned long end,
pud = pud_offset(p4d, addr); if (pud_none(*pud)) {
addr += PUD_SIZE;
addr = round_up(addr + 1, PUD_SIZE); continue;
}
pmd = pmd_offset(pud, addr); if (pmd_none(*pmd)) {
addr += PMD_SIZE;
}addr = round_up(addr + 1, PMD_SIZE); continue;
-- 2.17.1
On Wed, 21 Aug 2019, Song Liu wrote:
On Aug 20, 2019, at 1:23 PM, Song Liu songliubraving@fb.com wrote:
Before 32-bit support, pti_clone_pmds() always adds PMD_SIZE to addr. This behavior changes after the 32-bit support: pti_clone_pgtable() increases addr by PUD_SIZE for pud_none(*pud) case, and increases addr by PMD_SIZE for pmd_none(*pmd) case. However, this is not accurate because addr may not be PUD_SIZE/PMD_SIZE aligned.
Fix this issue by properly rounding up addr to next PUD_SIZE/PMD_SIZE in these two cases.
After poking around more, I found the following doesn't really make sense.
I'm glad you figured that out yourself. Was about to write up something to that effect.
Still interesting questions remain:
1) How did you end up feeding an unaligned address into that which points to a 0 PUD?
2) Is this related to Facebook specific changes and unlikely to affect any regular kernel? I can't come up with a way to trigger that in mainline
3) As this is a user page table and the missing mapping is related to mappings required by PTI, how is the machine going in/out of user space in the first place? Or did I just trip over what you called nonsense?
Thanks,
tglx
On Wed, 21 Aug 2019, Thomas Gleixner wrote:
On Wed, 21 Aug 2019, Song Liu wrote:
On Aug 20, 2019, at 1:23 PM, Song Liu songliubraving@fb.com wrote:
Before 32-bit support, pti_clone_pmds() always adds PMD_SIZE to addr. This behavior changes after the 32-bit support: pti_clone_pgtable() increases addr by PUD_SIZE for pud_none(*pud) case, and increases addr by PMD_SIZE for pmd_none(*pmd) case. However, this is not accurate because addr may not be PUD_SIZE/PMD_SIZE aligned.
Fix this issue by properly rounding up addr to next PUD_SIZE/PMD_SIZE in these two cases.
After poking around more, I found the following doesn't really make sense.
I'm glad you figured that out yourself. Was about to write up something to that effect.
Still interesting questions remain:
How did you end up feeding an unaligned address into that which points to a 0 PUD?
Is this related to Facebook specific changes and unlikely to affect any regular kernel? I can't come up with a way to trigger that in mainline
As this is a user page table and the missing mapping is related to mappings required by PTI, how is the machine going in/out of user space in the first place? Or did I just trip over what you called nonsense?
And just because this ended in silence I looked at it myself after Peter told me that this was on a kernel with PTI disabled. Aside of that my built in distrust for debug war stories combined with fairy tale changelogs triggered my curiousity anyway.
So that cannot be a kernel with PTI disabled compile time because in that case the functions are not available, unless it's FB hackery which I do not care about.
So the only way this can be reached is when PTI is configured but disabled at boot time via pti=off or nopti.
For some silly reason and that goes back to before the 32bit support and Joern did not notice either when he introduced pti_finalize() this results in the following functions being called unconditionallY:
pti_clone_entry_text() pti_clone_kernel_text()
pti_clone_kernel_text() was called unconditionally before the 32bit support already and the only reason why it did not have any effect in that situation is that it invokes pti_kernel_image_global_ok() and that returns false when PTI is disabled on the kernel command line. Oh well. It obviously never checked whether X86_FEATURE_PTI was disabled or enabled in the first place.
Now 32bit moved that around into pti_finalize() and added the call to pti_clone_entry_text() which just runs unconditionally.
Now there is still the interesting question why this matters. The to be cloned page table entries are mapped and the start address even if unaligned never points to something unmapped. The unmapped case only covers holes and holes are obviously aligned at the upper levels even if the address of the hole is unaligned.
So colour me still confused what's wrong there but the proper fix is the trivial:
--- a/arch/x86/mm/pti.c +++ b/arch/x86/mm/pti.c @@ -666,6 +666,8 @@ void __init pti_init(void) */ void pti_finalize(void) { + if (!boot_cpu_has(X86_FEATURE_PTI)) + return; /* * We need to clone everything (again) that maps parts of the * kernel image.
Hmm?
I'm going to look whether that makes any difference in the page tables tomorrow with brain awake, but I wanted to share this before the .us vanishes into the weekend :)
Thanks,
tglx
On Aug 23, 2019, at 5:59 PM, Thomas Gleixner tglx@linutronix.de wrote:
On Wed, 21 Aug 2019, Thomas Gleixner wrote:
On Wed, 21 Aug 2019, Song Liu wrote:
On Aug 20, 2019, at 1:23 PM, Song Liu songliubraving@fb.com wrote:
Before 32-bit support, pti_clone_pmds() always adds PMD_SIZE to addr. This behavior changes after the 32-bit support: pti_clone_pgtable() increases addr by PUD_SIZE for pud_none(*pud) case, and increases addr by PMD_SIZE for pmd_none(*pmd) case. However, this is not accurate because addr may not be PUD_SIZE/PMD_SIZE aligned.
Fix this issue by properly rounding up addr to next PUD_SIZE/PMD_SIZE in these two cases.
After poking around more, I found the following doesn't really make sense.
I'm glad you figured that out yourself. Was about to write up something to that effect.
Still interesting questions remain:
How did you end up feeding an unaligned address into that which points to a 0 PUD?
Is this related to Facebook specific changes and unlikely to affect any regular kernel? I can't come up with a way to trigger that in mainline
As this is a user page table and the missing mapping is related to mappings required by PTI, how is the machine going in/out of user space in the first place? Or did I just trip over what you called nonsense?
And just because this ended in silence I looked at it myself after Peter told me that this was on a kernel with PTI disabled. Aside of that my built in distrust for debug war stories combined with fairy tale changelogs triggered my curiousity anyway.
I am really sorry that I was silent. Somehow I didn't see this in my inbox (or it didn't show up until just now?).
For this patch, I really messed up this with something else. The issue we are seeing is that kprobe on CONFIG_KPROBES_ON_FTRACE splits PMD located at 0xffffffff81a00000. I sent another patch last night, but that might not be the right fix either.
I haven't started testing our PTI enabled kernel, so I am not sure whether there is really an issue with the PTI code.
Thanks, Song
On Tue, Aug 20, 2019 at 01:23:14PM -0700, Song Liu wrote:
Before 32-bit support, pti_clone_pmds() always adds PMD_SIZE to addr. This behavior changes after the 32-bit support: pti_clone_pgtable() increases addr by PUD_SIZE for pud_none(*pud) case, and increases addr by PMD_SIZE for pmd_none(*pmd) case. However, this is not accurate because addr may not be PUD_SIZE/PMD_SIZE aligned.
Fix this issue by properly rounding up addr to next PUD_SIZE/PMD_SIZE in these two cases.
So the patch is fine, ACK on that, but that still leaves us with the puzzle of why this didn't explode mightily and the story needs a little more work.
The following explains how we debugged this issue:
We use huge page for hot text and thus reduces iTLB misses. As we benchmark 5.2 based kernel (vs. 4.16 based), we found ~2.5x more iTLB misses.
To figure out the issue, I use a debug patch that dumps page table for a pid. The following are information from the workload pid.
For the 4.16 based kernel:
host-4.16 # grep "x pmd" /sys/kernel/debug/page_tables/dump_pid 0x0000000000600000-0x0000000000e00000 8M USR ro PSE x pmd 0xffffffff81a00000-0xffffffff81c00000 2M ro PSE x pmd
For the 5.2 based kernel before this patch:
host-5.2-before # grep "x pmd" /sys/kernel/debug/page_tables/dump_pid 0x0000000000600000-0x0000000000e00000 8M USR ro PSE x pmd
The 8MB text in pmd is from user space. 4.16 kernel has 1 pmd for the irq entry table; while 4.16 kernel doesn't have it.
For the 5.2 based kernel after this patch:
host-5.2-after # grep "x pmd" /sys/kernel/debug/page_tables/dump_pid 0x0000000000600000-0x0000000000e00000 8M USR ro PSE x pmd 0xffffffff81000000-0xffffffff81e00000 14M ro PSE GLB x pmd
So after this patch, the 5.2 based kernel has 7 PMDs instead of 1 PMD in 4.16 kernel.
This basically gives rise to more questions than it provides answers. You seem to have 'forgotten' to provide the equivalent mappings on the two older kernels. The fact that they're not PMD is evident, but it would be very good to know what is mapped, and what -- if anything -- lives in the holes we've (accidentally) created.
Can you please provide more complete mappings? Basically provide the whole cpu_entry_area mapping.
This further reduces iTLB miss rate
What you're saying is that by using PMDs, we reduce 4K iTLB usage. But we increase 2M iTLB usage, but for your workload this works out favourably (a quick look at the PMU event tables for SKL didn't show me separate 4K/2M iTLB counters :/).
Cc: stable@vger.kernel.org # v4.19+ Fixes: 16a3fe634f6a ("x86/mm/pti: Clone kernel-image on PTE level for 32 bit") Reviewed-by: Rik van Riel riel@surriel.com Signed-off-by: Song Liu songliubraving@fb.com Cc: Joerg Roedel jroedel@suse.de Cc: Thomas Gleixner tglx@linutronix.de Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Andy Lutomirski luto@kernel.org Cc: Peter Zijlstra peterz@infradead.org
arch/x86/mm/pti.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c index b196524759ec..1337494e22ef 100644 --- a/arch/x86/mm/pti.c +++ b/arch/x86/mm/pti.c @@ -330,13 +330,13 @@ pti_clone_pgtable(unsigned long start, unsigned long end, pud = pud_offset(p4d, addr); if (pud_none(*pud)) {
addr += PUD_SIZE;
}addr = round_up(addr + 1, PUD_SIZE); continue;
pmd = pmd_offset(pud, addr); if (pmd_none(*pmd)) {
addr += PMD_SIZE;
}addr = round_up(addr + 1, PMD_SIZE); continue;
2.17.1
On Wed, Aug 21, 2019 at 12:10:08PM +0200, Peter Zijlstra wrote:
On Tue, Aug 20, 2019 at 01:23:14PM -0700, Song Liu wrote:
host-5.2-after # grep "x pmd" /sys/kernel/debug/page_tables/dump_pid 0x0000000000600000-0x0000000000e00000 8M USR ro PSE x pmd 0xffffffff81000000-0xffffffff81e00000 14M ro PSE GLB x pmd
So after this patch, the 5.2 based kernel has 7 PMDs instead of 1 PMD in 4.16 kernel.
This basically gives rise to more questions than it provides answers. You seem to have 'forgotten' to provide the equivalent mappings on the two older kernels. The fact that they're not PMD is evident, but it would be very good to know what is mapped, and what -- if anything -- lives in the holes we've (accidentally) created.
Can you please provide more complete mappings? Basically provide the whole cpu_entry_area mapping.
I tried on my local machine and:
cat /debug/page_tables/kernel | awk '/^---/ { p=0 } /CPU entry/ { p=1 } { if (p) print $0 }' > ~/cea-{before,after}.txt
resulted in _identical_ files ?!?!
Can you share your before and after dumps?
On Aug 21, 2019, at 3:30 AM, Peter Zijlstra peterz@infradead.org wrote:
On Wed, Aug 21, 2019 at 12:10:08PM +0200, Peter Zijlstra wrote:
On Tue, Aug 20, 2019 at 01:23:14PM -0700, Song Liu wrote:
host-5.2-after # grep "x pmd" /sys/kernel/debug/page_tables/dump_pid 0x0000000000600000-0x0000000000e00000 8M USR ro PSE x pmd 0xffffffff81000000-0xffffffff81e00000 14M ro PSE GLB x pmd
So after this patch, the 5.2 based kernel has 7 PMDs instead of 1 PMD in 4.16 kernel.
This basically gives rise to more questions than it provides answers. You seem to have 'forgotten' to provide the equivalent mappings on the two older kernels. The fact that they're not PMD is evident, but it would be very good to know what is mapped, and what -- if anything -- lives in the holes we've (accidentally) created.
Can you please provide more complete mappings? Basically provide the whole cpu_entry_area mapping.
I tried on my local machine and:
cat /debug/page_tables/kernel | awk '/^---/ { p=0 } /CPU entry/ { p=1 } { if (p) print $0 }' > ~/cea-{before,after}.txt
resulted in _identical_ files ?!?!
Can you share your before and after dumps?
I was really dumb on this. The actual issue this that kprobe on CONFIG_KPROBES_ON_FTRACE splits kernel text PMDs (0xffffffff81000000-).
I will dig more into this.
Sorry for being silent, somehow I didn't see this email until just now.
Song
linux-stable-mirror@lists.linaro.org