pti_clone_pgtable() increases addr by PUD_SIZE for pud_none(*pud) case. This is not accurate because addr may not be PUD_SIZE aligned.
In our x86_64 kernel, pti_clone_pgtable() fails to clone 7 PMDs because of this issuse, including PMD for the irq entry table. For a memcache like workload, this introduces about 4.5x more iTLB-load and about 2.5x more iTLB-load-misses on a Skylake CPU.
This patch fixes this issue by adding PMD_SIZE to addr for pud_none() case.
Cc: stable@vger.kernel.org # v4.19+ Fixes: 16a3fe634f6a ("x86/mm/pti: Clone kernel-image on PTE level for 32 bit") Signed-off-by: Song Liu songliubraving@fb.com Cc: Joerg Roedel jroedel@suse.de Cc: Thomas Gleixner tglx@linutronix.de Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Andy Lutomirski luto@kernel.org Cc: Peter Zijlstra peterz@infradead.org --- arch/x86/mm/pti.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c index b196524759ec..5a67c3015f59 100644 --- a/arch/x86/mm/pti.c +++ b/arch/x86/mm/pti.c @@ -330,7 +330,7 @@ pti_clone_pgtable(unsigned long start, unsigned long end,
pud = pud_offset(p4d, addr); if (pud_none(*pud)) { - addr += PUD_SIZE; + addr += PMD_SIZE; continue; }
On Tue, 20 Aug 2019, Song Liu wrote:
pti_clone_pgtable() increases addr by PUD_SIZE for pud_none(*pud) case. This is not accurate because addr may not be PUD_SIZE aligned.
You fail to explain how this happened. The code before the 32bit support did always increase by PMD_SIZE. The 32bit support broke that.
In our x86_64 kernel, pti_clone_pgtable() fails to clone 7 PMDs because of this issuse, including PMD for the irq entry table. For a memcache like workload, this introduces about 4.5x more iTLB-load and about 2.5x more iTLB-load-misses on a Skylake CPU.
This information is largely irrelevant. What matters is the fact that this got broken and incorrectly forwards the address by PUD_SIZE which is wrong if address is not PUD_SIZE aligned.
This patch fixes this issue by adding PMD_SIZE to addr for pud_none() case.
git grep 'This patch' Documentation/process/submitting-patches.rst
Cc: stable@vger.kernel.org # v4.19+ Fixes: 16a3fe634f6a ("x86/mm/pti: Clone kernel-image on PTE level for 32 bit") Signed-off-by: Song Liu songliubraving@fb.com Cc: Joerg Roedel jroedel@suse.de Cc: Thomas Gleixner tglx@linutronix.de Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Andy Lutomirski luto@kernel.org Cc: Peter Zijlstra peterz@infradead.org
arch/x86/mm/pti.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c index b196524759ec..5a67c3015f59 100644 --- a/arch/x86/mm/pti.c +++ b/arch/x86/mm/pti.c @@ -330,7 +330,7 @@ pti_clone_pgtable(unsigned long start, unsigned long end, pud = pud_offset(p4d, addr); if (pud_none(*pud)) {
addr += PUD_SIZE;
addr += PMD_SIZE;
The right fix is to skip forward to the next PUD boundary instead of doing this in a loop with PMD_SIZE increments.
Thanks,
tglx
On Aug 20, 2019, at 2:12 AM, Thomas Gleixner tglx@linutronix.de wrote:
On Tue, 20 Aug 2019, Song Liu wrote:
pti_clone_pgtable() increases addr by PUD_SIZE for pud_none(*pud) case. This is not accurate because addr may not be PUD_SIZE aligned.
You fail to explain how this happened. The code before the 32bit support did always increase by PMD_SIZE. The 32bit support broke that.
Will fix.
In our x86_64 kernel, pti_clone_pgtable() fails to clone 7 PMDs because of this issuse, including PMD for the irq entry table. For a memcache like workload, this introduces about 4.5x more iTLB-load and about 2.5x more iTLB-load-misses on a Skylake CPU.
This information is largely irrelevant. What matters is the fact that this got broken and incorrectly forwards the address by PUD_SIZE which is wrong if address is not PUD_SIZE aligned.
We started looking into this because we cannot explain the regression in iTLB miss rate. I guess the patch itself explains it pretty well, so the original issue doesn't matter that much?
I will remove this part.
This patch fixes this issue by adding PMD_SIZE to addr for pud_none() case.
git grep 'This patch' Documentation/process/submitting-patches.rst
Will fix.
Cc: stable@vger.kernel.org # v4.19+ Fixes: 16a3fe634f6a ("x86/mm/pti: Clone kernel-image on PTE level for 32 bit") Signed-off-by: Song Liu songliubraving@fb.com Cc: Joerg Roedel jroedel@suse.de Cc: Thomas Gleixner tglx@linutronix.de Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Andy Lutomirski luto@kernel.org Cc: Peter Zijlstra peterz@infradead.org
arch/x86/mm/pti.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c index b196524759ec..5a67c3015f59 100644 --- a/arch/x86/mm/pti.c +++ b/arch/x86/mm/pti.c @@ -330,7 +330,7 @@ pti_clone_pgtable(unsigned long start, unsigned long end,
pud = pud_offset(p4d, addr); if (pud_none(*pud)) {
addr += PUD_SIZE;
addr += PMD_SIZE;
The right fix is to skip forward to the next PUD boundary instead of doing this in a loop with PMD_SIZE increments.
Agreed.
Thanks, Song
On Tue, Aug 20, 2019 at 12:51:28AM -0700, Song Liu wrote:
pti_clone_pgtable() increases addr by PUD_SIZE for pud_none(*pud) case. This is not accurate because addr may not be PUD_SIZE aligned.
In our x86_64 kernel, pti_clone_pgtable() fails to clone 7 PMDs because of this issuse, including PMD for the irq entry table. For a memcache like workload, this introduces about 4.5x more iTLB-load and about 2.5x more iTLB-load-misses on a Skylake CPU.
This patch fixes this issue by adding PMD_SIZE to addr for pud_none() case.
diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c index b196524759ec..5a67c3015f59 100644 --- a/arch/x86/mm/pti.c +++ b/arch/x86/mm/pti.c @@ -330,7 +330,7 @@ pti_clone_pgtable(unsigned long start, unsigned long end, pud = pud_offset(p4d, addr); if (pud_none(*pud)) {
addr += PUD_SIZE;
}addr += PMD_SIZE; continue;
I'm thinking you're right in that there's a bug here, but I'm also thinking your patch is both incomplete and broken.
What that code wants to do is skip to the end of the pud, a pmd_size increase will not do that. And right below this, there's a second instance of this exact pattern.
Did I get the below right?
--- diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c index b196524759ec..32b20b3cb227 100644 --- a/arch/x86/mm/pti.c +++ b/arch/x86/mm/pti.c @@ -330,12 +330,14 @@ pti_clone_pgtable(unsigned long start, unsigned long end,
pud = pud_offset(p4d, addr); if (pud_none(*pud)) { + addr &= PUD_MASK; addr += PUD_SIZE; continue; }
pmd = pmd_offset(pud, addr); if (pmd_none(*pmd)) { + addr &= PMD_MASK; addr += PMD_SIZE; continue; }
On Tue, 20 Aug 2019, Peter Zijlstra wrote:
What that code wants to do is skip to the end of the pud, a pmd_size increase will not do that. And right below this, there's a second instance of this exact pattern.
Did I get the below right?
diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c index b196524759ec..32b20b3cb227 100644 --- a/arch/x86/mm/pti.c +++ b/arch/x86/mm/pti.c @@ -330,12 +330,14 @@ pti_clone_pgtable(unsigned long start, unsigned long end, pud = pud_offset(p4d, addr); if (pud_none(*pud)) {
addr &= PUD_MASK; addr += PUD_SIZE;
round_up(addr, PUD_SIZE);
perhaps?
continue; }
pmd = pmd_offset(pud, addr); if (pmd_none(*pmd)) {
}addr &= PMD_MASK; addr += PMD_SIZE; continue;
On Aug 20, 2019, at 4:16 AM, Thomas Gleixner tglx@linutronix.de wrote:
On Tue, 20 Aug 2019, Peter Zijlstra wrote:
What that code wants to do is skip to the end of the pud, a pmd_size increase will not do that. And right below this, there's a second instance of this exact pattern.
Did I get the below right?
diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c index b196524759ec..32b20b3cb227 100644 --- a/arch/x86/mm/pti.c +++ b/arch/x86/mm/pti.c @@ -330,12 +330,14 @@ pti_clone_pgtable(unsigned long start, unsigned long end,
pud = pud_offset(p4d, addr); if (pud_none(*pud)) {
addr &= PUD_MASK; addr += PUD_SIZE;
round_up(addr, PUD_SIZE);
I guess we need "round_up(addr + PMD_SIZE, PUD_SIZE)".
Thanks, Song
perhaps?
continue; } pmd = pmd_offset(pud, addr); if (pmd_none(*pmd)) {
}addr &= PMD_MASK; addr += PMD_SIZE; continue;
On Tue, 20 Aug 2019, Song Liu wrote:
On Aug 20, 2019, at 4:16 AM, Thomas Gleixner tglx@linutronix.de wrote:
On Tue, 20 Aug 2019, Peter Zijlstra wrote:
What that code wants to do is skip to the end of the pud, a pmd_size increase will not do that. And right below this, there's a second instance of this exact pattern.
Did I get the below right?
diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c index b196524759ec..32b20b3cb227 100644 --- a/arch/x86/mm/pti.c +++ b/arch/x86/mm/pti.c @@ -330,12 +330,14 @@ pti_clone_pgtable(unsigned long start, unsigned long end,
pud = pud_offset(p4d, addr); if (pud_none(*pud)) {
addr &= PUD_MASK; addr += PUD_SIZE;
round_up(addr, PUD_SIZE);
I guess we need "round_up(addr + PMD_SIZE, PUD_SIZE)".
Right you are.
On Tue, 2019-08-20 at 09:21 -0400, Song Liu wrote:
On Aug 20, 2019, at 4:16 AM, Thomas Gleixner tglx@linutronix.de wrote:
On Tue, 20 Aug 2019, Peter Zijlstra wrote:
What that code wants to do is skip to the end of the pud, a pmd_size increase will not do that. And right below this, there's a second instance of this exact pattern.
Did I get the below right?
diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c index b196524759ec..32b20b3cb227 100644 --- a/arch/x86/mm/pti.c +++ b/arch/x86/mm/pti.c @@ -330,12 +330,14 @@ pti_clone_pgtable(unsigned long start, unsigned long end,
pud = pud_offset(p4d, addr); if (pud_none(*pud)) {
addr &= PUD_MASK; addr += PUD_SIZE;
round_up(addr, PUD_SIZE);
I guess we need "round_up(addr + PMD_SIZE, PUD_SIZE)".
What does that do if start is less than PMD_SIZE away from the next PUD_SIZE boundary?
How about: round_up(addr + 1, PUD_SIZE) ?
On Aug 20, 2019, at 6:55 AM, Rik van Riel riel@fb.com wrote:
On Tue, 2019-08-20 at 09:21 -0400, Song Liu wrote:
On Aug 20, 2019, at 4:16 AM, Thomas Gleixner tglx@linutronix.de wrote:
On Tue, 20 Aug 2019, Peter Zijlstra wrote:
What that code wants to do is skip to the end of the pud, a pmd_size increase will not do that. And right below this, there's a second instance of this exact pattern.
Did I get the below right?
diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c index b196524759ec..32b20b3cb227 100644 --- a/arch/x86/mm/pti.c +++ b/arch/x86/mm/pti.c @@ -330,12 +330,14 @@ pti_clone_pgtable(unsigned long start, unsigned long end,
pud = pud_offset(p4d, addr); if (pud_none(*pud)) {
addr &= PUD_MASK; addr += PUD_SIZE;
round_up(addr, PUD_SIZE);
I guess we need "round_up(addr + PMD_SIZE, PUD_SIZE)".
What does that do if start is less than PMD_SIZE away from the next PUD_SIZE boundary?
Great point!
How about: round_up(addr + 1, PUD_SIZE) ?
Yes. How about this?
=========================== 8< ============================
From 9ae74cff4faf4710a11cb8da4c4a3f3404bd9fdd Mon Sep 17 00:00:00 2001
From: Song Liu songliubraving@fb.com Date: Mon, 19 Aug 2019 23:59:47 -0700 Subject: [PATCH] x86/mm/pti: in pti_clone_pgtable(), increase addr properly
Before 32-bit support, pti_clone_pmds() always adds PMD_SIZE to addr. This behavior changes after the 32-bit support: pti_clone_pgtable() increases addr by PUD_SIZE for pud_none(*pud) case, and increases addr by PMD_SIZE for pmd_none(*pmd) case. However, this is not accurate because addr may not be PUD_SIZE/PMD_SIZE aligned.
Fix this issue by properly rounding up addr to next PUD_SIZE/PMD_SIZE in these two cases.
Cc: stable@vger.kernel.org # v4.19+ Fixes: 16a3fe634f6a ("x86/mm/pti: Clone kernel-image on PTE level for 32 bit") Signed-off-by: Song Liu songliubraving@fb.com Cc: Joerg Roedel jroedel@suse.de Cc: Thomas Gleixner tglx@linutronix.de Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Andy Lutomirski luto@kernel.org Cc: Peter Zijlstra peterz@infradead.org --- arch/x86/mm/pti.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c index b196524759ec..1337494e22ef 100644 --- a/arch/x86/mm/pti.c +++ b/arch/x86/mm/pti.c @@ -330,13 +330,13 @@ pti_clone_pgtable(unsigned long start, unsigned long end,
pud = pud_offset(p4d, addr); if (pud_none(*pud)) { - addr += PUD_SIZE; + addr = round_up(addr + 1, PUD_SIZE); continue; }
pmd = pmd_offset(pud, addr); if (pmd_none(*pmd)) { - addr += PMD_SIZE; + addr = round_up(addr + 1, PMD_SIZE); continue; }
-- 2.17.1
On Tue, 2019-08-20 at 10:00 -0400, Song Liu wrote:
From 9ae74cff4faf4710a11cb8da4c4a3f3404bd9fdd Mon Sep 17 00:00:00 2001 From: Song Liu songliubraving@fb.com Date: Mon, 19 Aug 2019 23:59:47 -0700 Subject: [PATCH] x86/mm/pti: in pti_clone_pgtable(), increase addr properly
Before 32-bit support, pti_clone_pmds() always adds PMD_SIZE to addr. This behavior changes after the 32-bit support: pti_clone_pgtable() increases addr by PUD_SIZE for pud_none(*pud) case, and increases addr by PMD_SIZE for pmd_none(*pmd) case. However, this is not accurate because addr may not be PUD_SIZE/PMD_SIZE aligned.
Fix this issue by properly rounding up addr to next PUD_SIZE/PMD_SIZE in these two cases.
Cc: stable@vger.kernel.org # v4.19+ Fixes: 16a3fe634f6a ("x86/mm/pti: Clone kernel-image on PTE level for 32 bit") Signed-off-by: Song Liu songliubraving@fb.com Cc: Joerg Roedel jroedel@suse.de Cc: Thomas Gleixner tglx@linutronix.de Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Andy Lutomirski luto@kernel.org Cc: Peter Zijlstra peterz@infradead.org
This looks like it should do the trick!
Reviewed-by: Rik van Riel riel@surriel.com
On Tue, 2019-08-20 at 13:16 +0200, Thomas Gleixner wrote:
On Tue, 20 Aug 2019, Peter Zijlstra wrote:
What that code wants to do is skip to the end of the pud, a pmd_size increase will not do that. And right below this, there's a second instance of this exact pattern.
Did I get the below right?
diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c index b196524759ec..32b20b3cb227 100644 --- a/arch/x86/mm/pti.c +++ b/arch/x86/mm/pti.c @@ -330,12 +330,14 @@ pti_clone_pgtable(unsigned long start, unsigned long end, pud = pud_offset(p4d, addr); if (pud_none(*pud)) {
addr &= PUD_MASK; addr += PUD_SIZE;
round_up(addr, PUD_SIZE);
perhaps?
Won't that keep returning the same address forever if addr & PUD_MASK == 0?
On Aug 20, 2019, at 3:00 AM, Peter Zijlstra peterz@infradead.org wrote:
On Tue, Aug 20, 2019 at 12:51:28AM -0700, Song Liu wrote:
pti_clone_pgtable() increases addr by PUD_SIZE for pud_none(*pud) case. This is not accurate because addr may not be PUD_SIZE aligned.
In our x86_64 kernel, pti_clone_pgtable() fails to clone 7 PMDs because of this issuse, including PMD for the irq entry table. For a memcache like workload, this introduces about 4.5x more iTLB-load and about 2.5x more iTLB-load-misses on a Skylake CPU.
This patch fixes this issue by adding PMD_SIZE to addr for pud_none() case.
diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c index b196524759ec..5a67c3015f59 100644 --- a/arch/x86/mm/pti.c +++ b/arch/x86/mm/pti.c @@ -330,7 +330,7 @@ pti_clone_pgtable(unsigned long start, unsigned long end,
pud = pud_offset(p4d, addr); if (pud_none(*pud)) {
addr += PUD_SIZE;
}addr += PMD_SIZE; continue;
I'm thinking you're right in that there's a bug here, but I'm also thinking your patch is both incomplete and broken.
What that code wants to do is skip to the end of the pud, a pmd_size increase will not do that. And right below this, there's a second instance of this exact pattern.
Did I get the below right?
Yes, your are right.
Thanks, Song
diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c index b196524759ec..32b20b3cb227 100644 --- a/arch/x86/mm/pti.c +++ b/arch/x86/mm/pti.c @@ -330,12 +330,14 @@ pti_clone_pgtable(unsigned long start, unsigned long end,
pud = pud_offset(p4d, addr); if (pud_none(*pud)) {
addr &= PUD_MASK; addr += PUD_SIZE; continue;
}
pmd = pmd_offset(pud, addr); if (pmd_none(*pmd)) {
addr &= PMD_MASK; addr += PMD_SIZE; continue;
}
On 8/20/19 12:51 AM, Song Liu wrote:
In our x86_64 kernel, pti_clone_pgtable() fails to clone 7 PMDs because of this issuse, including PMD for the irq entry table. For a memcache like workload, this introduces about 4.5x more iTLB-load and about 2.5x more iTLB-load-misses on a Skylake CPU.
I was surprised that this manifests as a performance issue. Usually messing up PTI page table manipulation means you get to experience the jobs of debugging triple faults. But, it makes sense if its this line:
/* * Note that this will undo _some_ of the work that * pti_set_kernel_image_nonglobal() did to clear the * global bit. */ pti_clone_pgtable(start, end_clone, PTI_LEVEL_KERNEL_IMAGE);
which is restoring the Global bit.
*But*, that shouldn't get hit on a Skylake CPU since those have PCIDs and shouldn't have a global kernel image. Could you confirm whether PCIDs are supported on this CPU?
pud = pud_offset(p4d, addr); if (pud_none(*pud)) {
addr += PUD_SIZE;
}addr += PMD_SIZE; continue;
Did we also bugger up this code:
pmd = pmd_offset(pud, addr); if (pmd_none(*pmd)) { addr += PMD_SIZE; continue; }
if we're on 32-bit and this:
#define PTI_LEVEL_KERNEL_IMAGE PTI_CLONE_PTE
and we get a hole walking to a non-PMD-aligned address?
On Aug 20, 2019, at 6:57 AM, Dave Hansen dave.hansen@intel.com wrote:
On 8/20/19 12:51 AM, Song Liu wrote:
In our x86_64 kernel, pti_clone_pgtable() fails to clone 7 PMDs because of this issuse, including PMD for the irq entry table. For a memcache like workload, this introduces about 4.5x more iTLB-load and about 2.5x more iTLB-load-misses on a Skylake CPU.
I was surprised that this manifests as a performance issue. Usually messing up PTI page table manipulation means you get to experience the jobs of debugging triple faults. But, it makes sense if its this line:
/* * Note that this will undo _some_ of the work that * pti_set_kernel_image_nonglobal() did to clear the * global bit. */ pti_clone_pgtable(start, end_clone, PTI_LEVEL_KERNEL_IMAGE);
which is restoring the Global bit.
*But*, that shouldn't get hit on a Skylake CPU since those have PCIDs and shouldn't have a global kernel image. Could you confirm whether PCIDs are supported on this CPU?
Yes, pcid is listed in /proc/cpuinfo.
Thanks, Song
On 8/20/19 7:14 AM, Song Liu wrote:
*But*, that shouldn't get hit on a Skylake CPU since those have PCIDs and shouldn't have a global kernel image. Could you confirm whether PCIDs are supported on this CPU?
Yes, pcid is listed in /proc/cpuinfo.
So what's going on? Could you confirm exactly which pti_clone_pgtable() is causing you problems? Do you have a theory as to why this manifests as a performance problem rather than a functional one?
A diff of these:
/sys/kernel/debug/page_tables/current_user /sys/kernel/debug/page_tables/current_kernel
before and after your patch might be helpful.
On Aug 20, 2019, at 7:18 AM, Dave Hansen dave.hansen@intel.com wrote:
On 8/20/19 7:14 AM, Song Liu wrote:
*But*, that shouldn't get hit on a Skylake CPU since those have PCIDs and shouldn't have a global kernel image. Could you confirm whether PCIDs are supported on this CPU?
Yes, pcid is listed in /proc/cpuinfo.
So what's going on? Could you confirm exactly which pti_clone_pgtable() is causing you problems? Do you have a theory as to why this manifests as a performance problem rather than a functional one?
A diff of these:
/sys/kernel/debug/page_tables/current_user /sys/kernel/debug/page_tables/current_kernel
before and after your patch might be helpful.
I believe the difference is from the following entries (7 PMDs)
Before the patch:
current_kernel: 0xffffffff81000000-0xffffffff81e04000 14352K ro GLB x pte efi: 0xffffffff81000000-0xffffffff81e04000 14352K ro GLB x pte kernel: 0xffffffff81000000-0xffffffff81e04000 14352K ro GLB x pte
After the patch:
current_kernel: 0xffffffff81000000-0xffffffff81e00000 14M ro PSE GLB x pmd efi: 0xffffffff81000000-0xffffffff81e00000 14M ro PSE GLB x pmd kernel: 0xffffffff81000000-0xffffffff81e00000 14M ro PSE GLB x pmd
current_kernel and kernel show same data though.
Thanks, Song
On Aug 20, 2019, at 9:05 AM, Song Liu songliubraving@fb.com wrote:
On Aug 20, 2019, at 7:18 AM, Dave Hansen dave.hansen@intel.com wrote:
On 8/20/19 7:14 AM, Song Liu wrote:
*But*, that shouldn't get hit on a Skylake CPU since those have PCIDs and shouldn't have a global kernel image. Could you confirm whether PCIDs are supported on this CPU?
Yes, pcid is listed in /proc/cpuinfo.
So what's going on? Could you confirm exactly which pti_clone_pgtable() is causing you problems? Do you have a theory as to why this manifests as a performance problem rather than a functional one?
A diff of these:
/sys/kernel/debug/page_tables/current_user /sys/kernel/debug/page_tables/current_kernel
before and after your patch might be helpful.
I believe the difference is from the following entries (7 PMDs)
Before the patch:
current_kernel: 0xffffffff81000000-0xffffffff81e04000 14352K ro GLB x pte efi: 0xffffffff81000000-0xffffffff81e04000 14352K ro GLB x pte kernel: 0xffffffff81000000-0xffffffff81e04000 14352K ro GLB x pte
After the patch:
current_kernel: 0xffffffff81000000-0xffffffff81e00000 14M ro PSE GLB x pmd efi: 0xffffffff81000000-0xffffffff81e00000 14M ro PSE GLB x pmd kernel: 0xffffffff81000000-0xffffffff81e00000 14M ro PSE GLB x pmd
current_kernel and kernel show same data though.
A little more details on how I got here.
We use huge page for hot text and thus reduces iTLB misses. As we benchmark 5.2 based kernel (vs. 4.16 based), we found ~2.5x more iTLB misses.
To figure out the issue, I use a debug patch that dumps page table for a pid. The following are information from the workload pid.
For the 4.16 based kernel:
host-4.16 # grep "x pmd" /sys/kernel/debug/page_tables/dump_pid 0x0000000000600000-0x0000000000e00000 8M USR ro PSE x pmd 0xffffffff81a00000-0xffffffff81c00000 2M ro PSE x pmd
For the 5.2 based kernel before this patch:
host-5.2-before # grep "x pmd" /sys/kernel/debug/page_tables/dump_pid 0x0000000000600000-0x0000000000e00000 8M USR ro PSE x pmd
The 8MB text in pmd is from user space. 4.16 kernel has 1 pmd for the irq entry table; while 4.16 kernel doesn't have it.
For the 5.2 based kernel after this patch:
host-5.2-after # grep "x pmd" /sys/kernel/debug/page_tables/dump_pid 0x0000000000600000-0x0000000000e00000 8M USR ro PSE x pmd 0xffffffff81000000-0xffffffff81e00000 14M ro PSE GLB x pmd
So after this patch, the 5.2 based kernel has 7 PMDs instead of 1 PMD in 4.16 kernel. This further reduces iTLB miss rate
Thanks, Song
linux-stable-mirror@lists.linaro.org