The patch titled
Subject: mm: memcg/slab: fix panic in __free_slab() caused by premature memcg pointer release
has been added to the -mm tree. Its filename is
mm-memcg-slab-fix-panic-in-__free_slab-caused-by-premature-memcg-pointer-release.patch
This patch should soon appear at
http://ozlabs.org/~akpm/mmots/broken-out/mm-memcg-slab-fix-panic-in-__free_…
and later at
http://ozlabs.org/~akpm/mmotm/broken-out/mm-memcg-slab-fix-panic-in-__free_…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Roman Gushchin <guro(a)fb.com>
Subject: mm: memcg/slab: fix panic in __free_slab() caused by premature memcg pointer release
Karsten reported the following panic in __free_slab() happening on a s390x
machine:
349.361168¨ Unable to handle kernel pointer dereference in virtual kernel address space
349.361210¨ Failing address: 0000000000000000 TEID: 0000000000000483
349.361223¨ Fault in home space mode while using kernel ASCE.
349.361240¨ AS:00000000017d4007 R3:000000007fbd0007 S:000000007fbff000 P:000000000000003d
349.361340¨ Oops: 0004 ilc:3 Ý#1¨ PREEMPT SMP
349.361349¨ Modules linked in: tcp_diag inet_diag xt_tcpudp ip6t_rpfilter ip6t_REJECT \
nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ip6table_nat ip6table_mangle \
ip6table_raw ip6table_security iptable_at nf_nat
349.361436¨ CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.3.0-05872-g6133e3e4bada-dirty #14
349.361445¨ Hardware name: IBM 2964 NC9 702 (z/VM 6.4.0)
349.361450¨ Krnl PSW : 0704d00180000000 00000000003cadb6 (__free_slab+0x686/0x6b0)
349.361464¨ R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3
349.361470¨ Krnl GPRS: 00000000f3a32928 0000000000000000 000000007fbf5d00 000000000117c4b8
349.361475¨ 0000000000000000 000000009e3291c1 0000000000000000 0000000000000000
349.361481¨ 0000000000000003 0000000000000008 000000002b478b00 000003d080a97600
349.361481¨ 0000000000000003 0000000000000008 000000002b478b00 000003d080a97600
349.361486¨ 000000000117ba00 000003e000057db0 00000000003cabcc 000003e000057c78
349.361500¨ Krnl Code: 00000000003cada6: e310a1400004 lg %r1,320(%r10)
349.361500¨ 00000000003cadac: c0e50046c286 brasl %r14,ca32b8
349.361500¨ #00000000003cadb2: a7f4fe36 brc 15,3caa1e
349.361500¨ >00000000003cadb6: e32060800024 stg %r2,128(%r6)
349.361500¨ 00000000003cadbc: a7f4fd9e brc 15,3ca8f8
349.361500¨ 00000000003cadc0: c0e50046790c brasl %r14,c99fd8
349.361500¨ 00000000003cadc6: a7f4fe2c brc 15,3caa
349.361500¨ 00000000003cadc6: a7f4fe2c brc 15,3caa1e
349.361500¨ 00000000003cadca: ecb1ffff00d9 aghik %r11,%r1,-1
349.361619¨ Call Trace:
349.361627¨ (Ý<00000000003cabcc>¨ __free_slab+0x49c/0x6b0)
349.361634¨ Ý<00000000001f5886>¨ rcu_core+0x5a6/0x7e0
349.361643¨ Ý<0000000000ca2dea>¨ __do_softirq+0xf2/0x5c0
349.361652¨ Ý<0000000000152644>¨ irq_exit+0x104/0x130
349.361659¨ Ý<000000000010d222>¨ do_IRQ+0x9a/0xf0
349.361667¨ Ý<0000000000ca2344>¨ ext_int_handler+0x130/0x134
349.361674¨ Ý<0000000000103648>¨ enabled_wait+0x58/0x128
349.361681¨ (Ý<0000000000103634>¨ enabled_wait+0x44/0x128)
349.361688¨ Ý<0000000000103b00>¨ arch_cpu_idle+0x40/0x58
349.361695¨ Ý<0000000000ca0544>¨ default_idle_call+0x3c/0x68
349.361704¨ Ý<000000000018eaa4>¨ do_idle+0xec/0x1c0
349.361748¨ Ý<000000000018ee0e>¨ cpu_startup_entry+0x36/0x40
349.361756¨ Ý<000000000122df34>¨ arch_call_rest_init+0x5c/0x88
349.361761¨ Ý<0000000000000000>¨ 0x0
349.361765¨ INFO: lockdep is turned off.
349.361769¨ Last Breaking-Event-Address:
349.361774¨ Ý<00000000003ca8f4>¨ __free_slab+0x1c4/0x6b0
349.361781¨ Kernel panic - not syncing: Fatal exception in interrupt
The kernel panics on an attempt to dereference the NULL memcg pointer.
When shutdown_cache() is called from the kmem_cache_destroy() context, a
memcg kmem_cache might have empty slab pages in a partial list, which are
still charged to the memory cgroup. These pages are released by
free_partial() at the beginning of shutdown_cache(): either directly or by
scheduling a RCU-delayed work (if the kmem_cache has the
SLAB_TYPESAFE_BY_RCU flag). The latter case is when the reported panic
can happen: memcg_unlink_cache() is called immediately after shrinking
partial lists, without waiting for scheduled RCU works. It sets the
kmem_cache->memcg_params.memcg pointer to NULL, and the following attempt
to dereference it by __free_slab() from the RCU work context causes the
panic.
To fix the issue, let's postpone the release of the memcg pointer to
destroy_memcg_params(). It's called from a separate work context by
slab_caches_to_rcu_destroy_workfn(), which contains a full RCU barrier.
This guarantees that all scheduled page release RCU works will complete
before the memcg pointer will be zeroed.
Big thanks for Karsten for the perfect report containing all necessary
information, his help with the analysis of the problem and testing of the
fix.
Link: http://lkml.kernel.org/r/20191010160549.1584316-1-guro@fb.com
Fixes: fb2f2b0adb98 ("mm: memcg/slab: reparent memcg kmem_caches on cgroup removal")
Signed-off-by: Roman Gushchin <guro(a)fb.com>
Reported-by: Karsten Graul <kgraul(a)linux.ibm.com>
Tested-by: Karsten Graul <kgraul(a)linux.ibm.com>
Acked-by: Vlastimil Babka <vbabka(a)suse.cz>
Reviewed-by: Shakeel Butt <shakeelb(a)google.com>
Cc: Karsten Graul <kgraul(a)linux.ibm.com>
Cc: Vladimir Davydov <vdavydov.dev(a)gmail.com>
Cc: David Rientjes <rientjes(a)google.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/slab_common.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)
--- a/mm/slab_common.c~mm-memcg-slab-fix-panic-in-__free_slab-caused-by-premature-memcg-pointer-release
+++ a/mm/slab_common.c
@@ -178,10 +178,13 @@ static int init_memcg_params(struct kmem
static void destroy_memcg_params(struct kmem_cache *s)
{
- if (is_root_cache(s))
+ if (is_root_cache(s)) {
kvfree(rcu_access_pointer(s->memcg_params.memcg_caches));
- else
+ } else {
+ mem_cgroup_put(s->memcg_params.memcg);
+ WRITE_ONCE(s->memcg_params.memcg, NULL);
percpu_ref_exit(&s->memcg_params.refcnt);
+ }
}
static void free_memcg_params(struct rcu_head *rcu)
@@ -253,8 +256,6 @@ static void memcg_unlink_cache(struct km
} else {
list_del(&s->memcg_params.children_node);
list_del(&s->memcg_params.kmem_caches_node);
- mem_cgroup_put(s->memcg_params.memcg);
- WRITE_ONCE(s->memcg_params.memcg, NULL);
}
}
#else
_
Patches currently in -mm which might be from guro(a)fb.com are
mm-memcg-slab-fix-panic-in-__free_slab-caused-by-premature-memcg-pointer-release.patch
The following commit has been merged into the x86/urgent branch of tip:
Commit-ID: 2aa85f246c181b1fa89f27e8e20c5636426be624
Gitweb: https://git.kernel.org/tip/2aa85f246c181b1fa89f27e8e20c5636426be624
Author: Steve Wahl <steve.wahl(a)hpe.com>
AuthorDate: Tue, 24 Sep 2019 16:03:55 -05:00
Committer: Borislav Petkov <bp(a)suse.de>
CommitterDate: Fri, 11 Oct 2019 18:38:15 +02:00
x86/boot/64: Make level2_kernel_pgt pages invalid outside kernel area
Our hardware (UV aka Superdome Flex) has address ranges marked
reserved by the BIOS. Access to these ranges is caught as an error,
causing the BIOS to halt the system.
Initial page tables mapped a large range of physical addresses that
were not checked against the list of BIOS reserved addresses, and
sometimes included reserved addresses in part of the mapped range.
Including the reserved range in the map allowed processor speculative
accesses to the reserved range, triggering a BIOS halt.
Used early in booting, the page table level2_kernel_pgt addresses 1
GiB divided into 2 MiB pages, and it was set up to linearly map a full
1 GiB of physical addresses that included the physical address range
of the kernel image, as chosen by KASLR. But this also included a
large range of unused addresses on either side of the kernel image.
And unlike the kernel image's physical address range, this extra
mapped space was not checked against the BIOS tables of usable RAM
addresses. So there were times when the addresses chosen by KASLR
would result in processor accessible mappings of BIOS reserved
physical addresses.
The kernel code did not directly access any of this extra mapped
space, but having it mapped allowed the processor to issue speculative
accesses into reserved memory, causing system halts.
This was encountered somewhat rarely on a normal system boot, and much
more often when starting the crash kernel if "crashkernel=512M,high"
was specified on the command line (this heavily restricts the physical
address of the crash kernel, in our case usually within 1 GiB of
reserved space).
The solution is to invalidate the pages of this table outside the kernel
image's space before the page table is activated. It fixes this problem
on our hardware.
[ bp: Touchups. ]
Signed-off-by: Steve Wahl <steve.wahl(a)hpe.com>
Signed-off-by: Borislav Petkov <bp(a)suse.de>
Acked-by: Dave Hansen <dave.hansen(a)linux.intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Cc: Baoquan He <bhe(a)redhat.com>
Cc: Brijesh Singh <brijesh.singh(a)amd.com>
Cc: dimitri.sivanich(a)hpe.com
Cc: Feng Tang <feng.tang(a)intel.com>
Cc: "H. Peter Anvin" <hpa(a)zytor.com>
Cc: Ingo Molnar <mingo(a)redhat.com>
Cc: Jordan Borgner <mail(a)jordan-borgner.de>
Cc: Juergen Gross <jgross(a)suse.com>
Cc: mike.travis(a)hpe.com
Cc: russ.anderson(a)hpe.com
Cc: stable(a)vger.kernel.org
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: x86-ml <x86(a)kernel.org>
Cc: Zhenzhong Duan <zhenzhong.duan(a)oracle.com>
Link: https://lkml.kernel.org/r/9c011ee51b081534a7a15065b1681d200298b530.15693585…
---
arch/x86/kernel/head64.c | 22 ++++++++++++++++++++--
1 file changed, 20 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 29ffa49..206a4b6 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -222,13 +222,31 @@ unsigned long __head __startup_64(unsigned long physaddr,
* we might write invalid pmds, when the kernel is relocated
* cleanup_highmap() fixes this up along with the mappings
* beyond _end.
+ *
+ * Only the region occupied by the kernel image has so far
+ * been checked against the table of usable memory regions
+ * provided by the firmware, so invalidate pages outside that
+ * region. A page table entry that maps to a reserved area of
+ * memory would allow processor speculation into that area,
+ * and on some hardware (particularly the UV platform) even
+ * speculative access to some reserved areas is caught as an
+ * error, causing the BIOS to halt the system.
*/
pmd = fixup_pointer(level2_kernel_pgt, physaddr);
- for (i = 0; i < PTRS_PER_PMD; i++) {
+
+ /* invalidate pages before the kernel image */
+ for (i = 0; i < pmd_index((unsigned long)_text); i++)
+ pmd[i] &= ~_PAGE_PRESENT;
+
+ /* fixup pages that are part of the kernel image */
+ for (; i <= pmd_index((unsigned long)_end); i++)
if (pmd[i] & _PAGE_PRESENT)
pmd[i] += load_delta;
- }
+
+ /* invalidate pages after the kernel image */
+ for (; i < PTRS_PER_PMD; i++)
+ pmd[i] &= ~_PAGE_PRESENT;
/*
* Fixup phys_base - remove the memory encryption mask to obtain
The following commit has been merged into the x86/urgent branch of tip:
Commit-ID: 2aa85f246c181b1fa89f27e8e20c5636426be624
Gitweb: https://git.kernel.org/tip/2aa85f246c181b1fa89f27e8e20c5636426be624
Author: Steve Wahl <steve.wahl(a)hpe.com>
AuthorDate: Tue, 24 Sep 2019 16:03:55 -05:00
Committer: Borislav Petkov <bp(a)suse.de>
CommitterDate: Fri, 11 Oct 2019 18:38:15 +02:00
x86/boot/64: Make level2_kernel_pgt pages invalid outside kernel area
Our hardware (UV aka Superdome Flex) has address ranges marked
reserved by the BIOS. Access to these ranges is caught as an error,
causing the BIOS to halt the system.
Initial page tables mapped a large range of physical addresses that
were not checked against the list of BIOS reserved addresses, and
sometimes included reserved addresses in part of the mapped range.
Including the reserved range in the map allowed processor speculative
accesses to the reserved range, triggering a BIOS halt.
Used early in booting, the page table level2_kernel_pgt addresses 1
GiB divided into 2 MiB pages, and it was set up to linearly map a full
1 GiB of physical addresses that included the physical address range
of the kernel image, as chosen by KASLR. But this also included a
large range of unused addresses on either side of the kernel image.
And unlike the kernel image's physical address range, this extra
mapped space was not checked against the BIOS tables of usable RAM
addresses. So there were times when the addresses chosen by KASLR
would result in processor accessible mappings of BIOS reserved
physical addresses.
The kernel code did not directly access any of this extra mapped
space, but having it mapped allowed the processor to issue speculative
accesses into reserved memory, causing system halts.
This was encountered somewhat rarely on a normal system boot, and much
more often when starting the crash kernel if "crashkernel=512M,high"
was specified on the command line (this heavily restricts the physical
address of the crash kernel, in our case usually within 1 GiB of
reserved space).
The solution is to invalidate the pages of this table outside the kernel
image's space before the page table is activated. It fixes this problem
on our hardware.
[ bp: Touchups. ]
Signed-off-by: Steve Wahl <steve.wahl(a)hpe.com>
Signed-off-by: Borislav Petkov <bp(a)suse.de>
Acked-by: Dave Hansen <dave.hansen(a)linux.intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Cc: Baoquan He <bhe(a)redhat.com>
Cc: Brijesh Singh <brijesh.singh(a)amd.com>
Cc: dimitri.sivanich(a)hpe.com
Cc: Feng Tang <feng.tang(a)intel.com>
Cc: "H. Peter Anvin" <hpa(a)zytor.com>
Cc: Ingo Molnar <mingo(a)redhat.com>
Cc: Jordan Borgner <mail(a)jordan-borgner.de>
Cc: Juergen Gross <jgross(a)suse.com>
Cc: mike.travis(a)hpe.com
Cc: russ.anderson(a)hpe.com
Cc: stable(a)vger.kernel.org
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: x86-ml <x86(a)kernel.org>
Cc: Zhenzhong Duan <zhenzhong.duan(a)oracle.com>
Link: https://lkml.kernel.org/r/9c011ee51b081534a7a15065b1681d200298b530.15693585…
---
arch/x86/kernel/head64.c | 22 ++++++++++++++++++++--
1 file changed, 20 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 29ffa49..206a4b6 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -222,13 +222,31 @@ unsigned long __head __startup_64(unsigned long physaddr,
* we might write invalid pmds, when the kernel is relocated
* cleanup_highmap() fixes this up along with the mappings
* beyond _end.
+ *
+ * Only the region occupied by the kernel image has so far
+ * been checked against the table of usable memory regions
+ * provided by the firmware, so invalidate pages outside that
+ * region. A page table entry that maps to a reserved area of
+ * memory would allow processor speculation into that area,
+ * and on some hardware (particularly the UV platform) even
+ * speculative access to some reserved areas is caught as an
+ * error, causing the BIOS to halt the system.
*/
pmd = fixup_pointer(level2_kernel_pgt, physaddr);
- for (i = 0; i < PTRS_PER_PMD; i++) {
+
+ /* invalidate pages before the kernel image */
+ for (i = 0; i < pmd_index((unsigned long)_text); i++)
+ pmd[i] &= ~_PAGE_PRESENT;
+
+ /* fixup pages that are part of the kernel image */
+ for (; i <= pmd_index((unsigned long)_end); i++)
if (pmd[i] & _PAGE_PRESENT)
pmd[i] += load_delta;
- }
+
+ /* invalidate pages after the kernel image */
+ for (; i < PTRS_PER_PMD; i++)
+ pmd[i] &= ~_PAGE_PRESENT;
/*
* Fixup phys_base - remove the memory encryption mask to obtain
Karsten reported the following panic in __free_slab() happening on a s390x
machine:
349.361168¨ Unable to handle kernel pointer dereference in virtual kernel address space
349.361210¨ Failing address: 0000000000000000 TEID: 0000000000000483
349.361223¨ Fault in home space mode while using kernel ASCE.
349.361240¨ AS:00000000017d4007 R3:000000007fbd0007 S:000000007fbff000 P:000000000000003d
349.361340¨ Oops: 0004 ilc:3 Ý#1¨ PREEMPT SMP
349.361349¨ Modules linked in: tcp_diag inet_diag xt_tcpudp ip6t_rpfilter ip6t_REJECT \
nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ip6table_nat ip6table_mangle \
ip6table_raw ip6table_security iptable_at nf_nat
349.361436¨ CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.3.0-05872-g6133e3e4bada-dirty #14
349.361445¨ Hardware name: IBM 2964 NC9 702 (z/VM 6.4.0)
349.361450¨ Krnl PSW : 0704d00180000000 00000000003cadb6 (__free_slab+0x686/0x6b0)
349.361464¨ R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3
349.361470¨ Krnl GPRS: 00000000f3a32928 0000000000000000 000000007fbf5d00 000000000117c4b8
349.361475¨ 0000000000000000 000000009e3291c1 0000000000000000 0000000000000000
349.361481¨ 0000000000000003 0000000000000008 000000002b478b00 000003d080a97600
349.361481¨ 0000000000000003 0000000000000008 000000002b478b00 000003d080a97600
349.361486¨ 000000000117ba00 000003e000057db0 00000000003cabcc 000003e000057c78
349.361500¨ Krnl Code: 00000000003cada6: e310a1400004 lg %r1,320(%r10)
349.361500¨ 00000000003cadac: c0e50046c286 brasl %r14,ca32b8
349.361500¨ #00000000003cadb2: a7f4fe36 brc 15,3caa1e
349.361500¨ >00000000003cadb6: e32060800024 stg %r2,128(%r6)
349.361500¨ 00000000003cadbc: a7f4fd9e brc 15,3ca8f8
349.361500¨ 00000000003cadc0: c0e50046790c brasl %r14,c99fd8
349.361500¨ 00000000003cadc6: a7f4fe2c brc 15,3caa
349.361500¨ 00000000003cadc6: a7f4fe2c brc 15,3caa1e
349.361500¨ 00000000003cadca: ecb1ffff00d9 aghik %r11,%r1,-1
349.361619¨ Call Trace:
349.361627¨ (Ý<00000000003cabcc>¨ __free_slab+0x49c/0x6b0)
349.361634¨ Ý<00000000001f5886>¨ rcu_core+0x5a6/0x7e0
349.361643¨ Ý<0000000000ca2dea>¨ __do_softirq+0xf2/0x5c0
349.361652¨ Ý<0000000000152644>¨ irq_exit+0x104/0x130
349.361659¨ Ý<000000000010d222>¨ do_IRQ+0x9a/0xf0
349.361667¨ Ý<0000000000ca2344>¨ ext_int_handler+0x130/0x134
349.361674¨ Ý<0000000000103648>¨ enabled_wait+0x58/0x128
349.361681¨ (Ý<0000000000103634>¨ enabled_wait+0x44/0x128)
349.361688¨ Ý<0000000000103b00>¨ arch_cpu_idle+0x40/0x58
349.361695¨ Ý<0000000000ca0544>¨ default_idle_call+0x3c/0x68
349.361704¨ Ý<000000000018eaa4>¨ do_idle+0xec/0x1c0
349.361748¨ Ý<000000000018ee0e>¨ cpu_startup_entry+0x36/0x40
349.361756¨ Ý<000000000122df34>¨ arch_call_rest_init+0x5c/0x88
349.361761¨ Ý<0000000000000000>¨ 0x0
349.361765¨ INFO: lockdep is turned off.
349.361769¨ Last Breaking-Event-Address:
349.361774¨ Ý<00000000003ca8f4>¨ __free_slab+0x1c4/0x6b0
349.361781¨ Kernel panic - not syncing: Fatal exception in interrupt
The kernel panics on an attempt to dereference the NULL memcg pointer.
When shutdown_cache() is called from the kmem_cache_destroy() context,
a memcg kmem_cache might have empty slab pages in a partial list,
which are still charged to the memory cgroup. These pages are released
by free_partial() at the beginning of shutdown_cache(): either
directly or by scheduling a RCU-delayed work (if the kmem_cache has
the SLAB_TYPESAFE_BY_RCU flag). The latter case is when the reported
panic can happen: memcg_unlink_cache() is called immediately after
shrinking partial lists, without waiting for scheduled RCU works.
It sets the kmem_cache->memcg_params.memcg pointer to NULL,
and the following attempt to dereference it by __free_slab()
from the RCU work context causes the panic.
To fix the issue, let's postpone the release of the memcg pointer
to destroy_memcg_params(). It's called from a separate work context
by slab_caches_to_rcu_destroy_workfn(), which contains a full RCU
barrier. This guarantees that all scheduled page release RCU works
will complete before the memcg pointer will be zeroed.
Big thanks for Karsten for the perfect report containing all necessary
information, his help with the analysis of the problem and testing
of the fix.
Fixes: fb2f2b0adb98 ("mm: memcg/slab: reparent memcg kmem_caches on cgroup removal")
Reported-by: Karsten Graul <kgraul(a)linux.ibm.com>
Tested-by: Karsten Graul <kgraul(a)linux.ibm.com>
Signed-off-by: Roman Gushchin <guro(a)fb.com>
Cc: Karsten Graul <kgraul(a)linux.ibm.com>
Cc: Shakeel Butt <shakeelb(a)google.com>
Cc: Vladimir Davydov <vdavydov.dev(a)gmail.com>
Cc: David Rientjes <rientjes(a)google.com>
Cc: stable(a)vger.kernel.org
---
mm/slab_common.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 0b94a37da531..8afa188f6e20 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -178,10 +178,13 @@ static int init_memcg_params(struct kmem_cache *s,
static void destroy_memcg_params(struct kmem_cache *s)
{
- if (is_root_cache(s))
+ if (is_root_cache(s)) {
kvfree(rcu_access_pointer(s->memcg_params.memcg_caches));
- else
+ } else {
+ mem_cgroup_put(s->memcg_params.memcg);
+ WRITE_ONCE(s->memcg_params.memcg, NULL);
percpu_ref_exit(&s->memcg_params.refcnt);
+ }
}
static void free_memcg_params(struct rcu_head *rcu)
@@ -253,8 +256,6 @@ static void memcg_unlink_cache(struct kmem_cache *s)
} else {
list_del(&s->memcg_params.children_node);
list_del(&s->memcg_params.kmem_caches_node);
- mem_cgroup_put(s->memcg_params.memcg);
- WRITE_ONCE(s->memcg_params.memcg, NULL);
}
}
#else
--
2.21.0
From: Sean Christopherson <sean.j.christopherson(a)intel.com>
James Harvey reported a livelock that was introduced by commit
d012a06ab1d23 ("Revert "KVM: x86/mmu: Zap only the relevant pages when
removing a memslot"").
The livelock occurs because kvm_mmu_zap_all() as it exists today will
voluntarily reschedule and drop KVM's mmu_lock, which allows other vCPUs
to add shadow pages. With enough vCPUs, kvm_mmu_zap_all() can get stuck
in an infinite loop as it can never zap all pages before observing lock
contention or the need to reschedule. The equivalent of kvm_mmu_zap_all()
that was in use at the time of the reverted commit (4e103134b8623, "KVM:
x86/mmu: Zap only the relevant pages when removing a memslot") employed
a fast invalidate mechanism and was not susceptible to the above livelock.
There are three ways to fix the livelock:
- Reverting the revert (commit d012a06ab1d23) is not a viable option as
the revert is needed to fix a regression that occurs when the guest has
one or more assigned devices. It's unlikely we'll root cause the device
assignment regression soon enough to fix the regression timely.
- Remove the conditional reschedule from kvm_mmu_zap_all(). However, although
removing the reschedule would be a smaller code change, it's less safe
in the sense that the resulting kvm_mmu_zap_all() hasn't been used in
the wild for flushing memslots since the fast invalidate mechanism was
introduced by commit 6ca18b6950f8d ("KVM: x86: use the fast way to
invalidate all pages"), back in 2013.
- Reintroduce the fast invalidate mechanism and use it when zapping shadow
pages in response to a memslot being deleted/moved, which is what this
patch does.
For all intents and purposes, this is a revert of commit ea145aacf4ae8
("Revert "KVM: MMU: fast invalidate all pages"") and a partial revert of
commit 7390de1e99a70 ("Revert "KVM: x86: use the fast way to invalidate
all pages""), i.e. restores the behavior of commit 5304b8d37c2a5 ("KVM:
MMU: fast invalidate all pages") and commit 6ca18b6950f8d ("KVM: x86:
use the fast way to invalidate all pages") respectively.
Fixes: d012a06ab1d23 ("Revert "KVM: x86/mmu: Zap only the relevant pages when removing a memslot"")
Reported-by: James Harvey <jamespharvey20(a)gmail.com>
Cc: Alex Willamson <alex.williamson(a)redhat.com>
Cc: Paolo Bonzini <pbonzini(a)redhat.com>
Cc: stable(a)vger.kernel.org
Signed-off-by: Sean Christopherson <sean.j.christopherson(a)intel.com>
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
(cherry picked from commit 002c5f73c508f7df5681bda339831c27f3c1aef4)
---
arch/x86/include/asm/kvm_host.h | 2 +
arch/x86/kvm/mmu.c | 101 +++++++++++++++++++++++++++++++-
2 files changed, 101 insertions(+), 2 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 7038e79b9851..58c03158fe5e 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -336,6 +336,7 @@ struct kvm_mmu_page {
int root_count; /* Currently serving as active root */
unsigned int unsync_children;
struct kvm_rmap_head parent_ptes; /* rmap pointers to parent sptes */
+ unsigned long mmu_valid_gen;
DECLARE_BITMAP(unsync_child_bitmap, 512);
#ifdef CONFIG_X86_32
@@ -850,6 +851,7 @@ struct kvm_arch {
unsigned long n_requested_mmu_pages;
unsigned long n_max_mmu_pages;
unsigned int indirect_shadow_pages;
+ unsigned long mmu_valid_gen;
struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
/*
* Hash table of struct kvm_mmu_page.
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index d3ec7d3c0ed3..a8eeea5a2a18 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2060,6 +2060,12 @@ static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, int direct
if (!direct)
sp->gfns = mmu_memory_cache_alloc(&vcpu->arch.mmu_page_cache);
set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
+
+ /*
+ * active_mmu_pages must be a FIFO list, as kvm_zap_obsolete_pages()
+ * depends on valid pages being added to the head of the list. See
+ * comments in kvm_zap_obsolete_pages().
+ */
list_add(&sp->link, &vcpu->kvm->arch.active_mmu_pages);
kvm_mod_used_mmu_pages(vcpu->kvm, +1);
return sp;
@@ -2209,7 +2215,7 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
#define for_each_valid_sp(_kvm, _sp, _gfn) \
hlist_for_each_entry(_sp, \
&(_kvm)->arch.mmu_page_hash[kvm_page_table_hashfn(_gfn)], hash_link) \
- if ((_sp)->role.invalid) { \
+ if (is_obsolete_sp((_kvm), (_sp)) || (_sp)->role.invalid) { \
} else
#define for_each_gfn_indirect_valid_sp(_kvm, _sp, _gfn) \
@@ -2266,6 +2272,11 @@ static void kvm_mmu_audit(struct kvm_vcpu *vcpu, int point) { }
static void mmu_audit_disable(void) { }
#endif
+static bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
+{
+ return unlikely(sp->mmu_valid_gen != kvm->arch.mmu_valid_gen);
+}
+
static bool kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
struct list_head *invalid_list)
{
@@ -2490,6 +2501,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
if (level > PT_PAGE_TABLE_LEVEL && need_sync)
flush |= kvm_sync_pages(vcpu, gfn, &invalid_list);
}
+ sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
clear_page(sp->spt);
trace_kvm_mmu_get_page(sp, true);
@@ -4221,6 +4233,13 @@ static bool fast_cr3_switch(struct kvm_vcpu *vcpu, gpa_t new_cr3,
return false;
if (cached_root_available(vcpu, new_cr3, new_role)) {
+ /*
+ * It is possible that the cached previous root page is
+ * obsolete because of a change in the MMU generation
+ * number. However, changing the generation number is
+ * accompanied by KVM_REQ_MMU_RELOAD, which will free
+ * the root set here and allocate a new one.
+ */
kvm_make_request(KVM_REQ_LOAD_CR3, vcpu);
if (!skip_tlb_flush) {
kvm_make_request(KVM_REQ_MMU_SYNC, vcpu);
@@ -5637,11 +5656,89 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
return alloc_mmu_pages(vcpu);
}
+
+static void kvm_zap_obsolete_pages(struct kvm *kvm)
+{
+ struct kvm_mmu_page *sp, *node;
+ LIST_HEAD(invalid_list);
+ int ign;
+
+restart:
+ list_for_each_entry_safe_reverse(sp, node,
+ &kvm->arch.active_mmu_pages, link) {
+ /*
+ * No obsolete valid page exists before a newly created page
+ * since active_mmu_pages is a FIFO list.
+ */
+ if (!is_obsolete_sp(kvm, sp))
+ break;
+
+ /*
+ * Do not repeatedly zap a root page to avoid unnecessary
+ * KVM_REQ_MMU_RELOAD, otherwise we may not be able to
+ * progress:
+ * vcpu 0 vcpu 1
+ * call vcpu_enter_guest():
+ * 1): handle KVM_REQ_MMU_RELOAD
+ * and require mmu-lock to
+ * load mmu
+ * repeat:
+ * 1): zap root page and
+ * send KVM_REQ_MMU_RELOAD
+ *
+ * 2): if (cond_resched_lock(mmu-lock))
+ *
+ * 2): hold mmu-lock and load mmu
+ *
+ * 3): see KVM_REQ_MMU_RELOAD bit
+ * on vcpu->requests is set
+ * then return 1 to call
+ * vcpu_enter_guest() again.
+ * goto repeat;
+ *
+ * Since we are reversely walking the list and the invalid
+ * list will be moved to the head, skip the invalid page
+ * can help us to avoid the infinity list walking.
+ */
+ if (sp->role.invalid)
+ continue;
+
+ if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
+ kvm_mmu_commit_zap_page(kvm, &invalid_list);
+ cond_resched_lock(&kvm->mmu_lock);
+ goto restart;
+ }
+
+ if (__kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list, &ign))
+ goto restart;
+ }
+
+ kvm_mmu_commit_zap_page(kvm, &invalid_list);
+}
+
+/*
+ * Fast invalidate all shadow pages and use lock-break technique
+ * to zap obsolete pages.
+ *
+ * It's required when memslot is being deleted or VM is being
+ * destroyed, in these cases, we should ensure that KVM MMU does
+ * not use any resource of the being-deleted slot or all slots
+ * after calling the function.
+ */
+static void kvm_mmu_zap_all_fast(struct kvm *kvm)
+{
+ spin_lock(&kvm->mmu_lock);
+ kvm->arch.mmu_valid_gen++;
+
+ kvm_zap_obsolete_pages(kvm);
+ spin_unlock(&kvm->mmu_lock);
+}
+
static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm,
struct kvm_memory_slot *slot,
struct kvm_page_track_notifier_node *node)
{
- kvm_mmu_zap_all(kvm);
+ kvm_mmu_zap_all_fast(kvm);
}
void kvm_mmu_init_vm(struct kvm *kvm)
--
2.21.0