We have usecase to use tmpfs as QEMU memory backend and we would like to
take the advantage of THP as well. But, our test shows the EPT is not
PMD mapped even though the underlying THP are PMD mapped on host.
The number showed by /sys/kernel/debug/kvm/largepage is much less than
the number of PMD mapped shmem pages as the below:
7f2778200000-7f2878200000 rw-s 00000000 00:14 262232 /dev/shm/qemu_back_mem.mem.Hz2hSf (deleted)
Size: 4194304 kB
[snip]
AnonHugePages: 0 kB
ShmemPmdMapped: 579584 kB
[snip]
Locked: 0 kB
cat /sys/kernel/debug/kvm/largepages
12
And some benchmarks do worse than with anonymous THPs.
By digging into the code we figured out that commit 127393fbe597 ("mm:
thp: kvm: fix memory corruption in KVM with THP enabled") checks if
there is a single PTE mapping on the page for anonymous THP when
setting up EPT map. But, the _mapcount < 0 check doesn't fit to page
cache THP since every subpage of page cache THP would get _mapcount
inc'ed once it is PMD mapped, so PageTransCompoundMap() always returns
false for page cache THP. This would prevent KVM from setting up PMD
mapped EPT entry.
So we need handle page cache THP correctly. However, when page cache
THP's PMD gets split, kernel just remove the map instead of setting up
PTE map like what anonymous THP does. Before KVM calls get_user_pages()
the subpages may get PTE mapped even though it is still a THP since the
page cache THP may be mapped by other processes at the mean time.
Checking its _mapcount and whether the THP has PTE mapped or not.
Although this may report some false negative cases (PTE mapped by other
processes), it looks not trivial to make this accurate.
With this fix /sys/kernel/debug/kvm/largepage would show reasonable
pages are PMD mapped by EPT as the below:
7fbeaee00000-7fbfaee00000 rw-s 00000000 00:14 275464 /dev/shm/qemu_back_mem.mem.SKUvat (deleted)
Size: 4194304 kB
[snip]
AnonHugePages: 0 kB
ShmemPmdMapped: 557056 kB
[snip]
Locked: 0 kB
cat /sys/kernel/debug/kvm/largepages
271
And the benchmarks are as same as anonymous THPs.
Signed-off-by: Yang Shi <yang.shi(a)linux.alibaba.com>
Reported-by: Gang Deng <gavin.dg(a)linux.alibaba.com>
Tested-by: Gang Deng <gavin.dg(a)linux.alibaba.com>
Suggested-by: Hugh Dickins <hughd(a)google.com>
Cc: Andrea Arcangeli <aarcange(a)redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Cc: <stable(a)vger.kernel.org> 4.8+
---
v2: Adopted the suggestion from Hugh to use _mapcount and compound_mapcount.
But I just open coding compound_mapcount to avoid duplicating the
definition of compound_mapcount_ptr in two different files. Since
"compound_mapcount" looks self-explained so I'm supposed the open
coding would not affect the readability.
include/linux/page-flags.h | 22 ++++++++++++++++++++--
1 file changed, 20 insertions(+), 2 deletions(-)
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index f91cb88..954a877 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -622,12 +622,30 @@ static inline int PageTransCompound(struct page *page)
*
* Unlike PageTransCompound, this is safe to be called only while
* split_huge_pmd() cannot run from under us, like if protected by the
- * MMU notifier, otherwise it may result in page->_mapcount < 0 false
+ * MMU notifier, otherwise it may result in page->_mapcount check false
* positives.
+ *
+ * We have to treat page cache THP differently since every subpage of it
+ * would get _mapcount inc'ed once it is PMD mapped. But, it may be PTE
+ * mapped in the current process so comparing subpage's _mapcount to
+ * compound_mapcount ot filter out PTE mapped case.
*/
static inline int PageTransCompoundMap(struct page *page)
{
- return PageTransCompound(page) && atomic_read(&page->_mapcount) < 0;
+ struct page *head;
+ int map_count;
+
+ if (!PageTransCompound(page))
+ return 0;
+
+ if (PageAnon(page))
+ return atomic_read(&page->_mapcount) < 0;
+
+ head = compound_head(page);
+ map_count = atomic_read(&page->_mapcount);
+ /* File THP is PMD mapped and not double mapped */
+ return map_count >= 0 &&
+ map_count == atomic_read(&head[1].compound_mapcount);
}
/*
--
1.8.3.1
Commit d698a388146c ("of: reserved-memory: ignore disabled memory-region
nodes") added an early return in of_reserved_mem_device_init_by_idx(), but
didn't call of_node_put() on a device_node whose ref-count was incremented
in the call to of_parse_phandle() preceding the early exit.
Fixes: d698a388146c ("of: reserved-memory: ignore disabled memory-region nodes")
Signed-off-by: Chris Goldsworthy <cgoldswo(a)codeaurora.org>
To: Rob Herring <robh+dt(a)kernel.org>
Cc: devicetree(a)vger.kernel.org
Cc: stable(a)vger.kernel.org
Cc: linux-kernel(a)vger.kernel.org
Cc: linux-arm-msm(a)vger.kernel.org
Cc: linux-arm-kernel(a)lists.infradead.org
---
drivers/of/of_reserved_mem.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/of/of_reserved_mem.c b/drivers/of/of_reserved_mem.c
index 7989703..6bd610e 100644
--- a/drivers/of/of_reserved_mem.c
+++ b/drivers/of/of_reserved_mem.c
@@ -324,8 +324,10 @@ int of_reserved_mem_device_init_by_idx(struct device *dev,
if (!target)
return -ENODEV;
- if (!of_device_is_available(target))
+ if (!of_device_is_available(target)) {
+ of_node_put(target);
return 0;
+ }
rmem = __find_rmem(target);
of_node_put(target);
--
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project
We have usecase to use tmpfs as QEMU memory backend and we would like to
take the advantage of THP as well. But, our test shows the EPT is not
PMD mapped even though the underlying THP are PMD mapped on host.
The number showed by /sys/kernel/debug/kvm/largepage is much less than
the number of PMD mapped shmem pages as the below:
7f2778200000-7f2878200000 rw-s 00000000 00:14 262232 /dev/shm/qemu_back_mem.mem.Hz2hSf (deleted)
Size: 4194304 kB
[snip]
AnonHugePages: 0 kB
ShmemPmdMapped: 579584 kB
[snip]
Locked: 0 kB
cat /sys/kernel/debug/kvm/largepages
12
And some benchmarks do worse than with anonymous THPs.
By digging into the code we figured out that commit 127393fbe597 ("mm:
thp: kvm: fix memory corruption in KVM with THP enabled") checks if
there is a single PTE mapping on the page for anonymous THP when
setting up EPT map. But, the _mapcount < 0 check doesn't fit to page
cache THP since every subpage of page cache THP would get _mapcount
inc'ed once it is PMD mapped, so PageTransCompoundMap() always returns
false for page cache THP. This would prevent KVM from setting up PMD
mapped EPT entry.
So we need handle page cache THP correctly. However, when page cache
THP's PMD gets split, kernel just remove the map instead of setting up
PTE map like what anonymous THP does. Before KVM calls get_user_pages()
the subpages may get PTE mapped even though it is still a THP since the
page cache THP may be mapped by other processes at the mean time.
Checking its _mapcount and whether the THP has PTE mapped or not.
Although this may report some false negative cases (PTE mapped by other
processes), it looks not trivial to make this accurate.
With this fix /sys/kernel/debug/kvm/largepage would show reasonable
pages are PMD mapped by EPT as the below:
7fbeaee00000-7fbfaee00000 rw-s 00000000 00:14 275464 /dev/shm/qemu_back_mem.mem.SKUvat (deleted)
Size: 4194304 kB
[snip]
AnonHugePages: 0 kB
ShmemPmdMapped: 557056 kB
[snip]
Locked: 0 kB
cat /sys/kernel/debug/kvm/largepages
271
And the benchmarks are as same as anonymous THPs.
Signed-off-by: Yang Shi <yang.shi(a)linux.alibaba.com>
Reported-by: Gang Deng <gavin.dg(a)linux.alibaba.com>
Tested-by: Gang Deng <gavin.dg(a)linux.alibaba.com>
Suggested-by: Hugh Dickins <hughd(a)google.com>
Cc: Andrea Arcangeli <aarcange(a)redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: <stable(a)vger.kernel.org> 4.8+
---
v3: Took the suggestion from Willy to move the definition of
compound_mapcount_ptr to mm_types.h in order to use it instead of
open coding it.
v2: Adopted the suggestion from Hugh to use _mapcount and compound_mapcount.
But I just open coding compound_mapcount to avoid duplicating the
definition of compound_mapcount_ptr in two different files. Since
"compound_mapcount" looks self-explained so I'm supposed the open
coding would not affect the readability.
include/linux/mm.h | 5 -----
include/linux/mm_types.h | 5 +++++
include/linux/page-flags.h | 22 ++++++++++++++++++++--
3 files changed, 25 insertions(+), 7 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index cc29227..a2adf95 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -695,11 +695,6 @@ static inline void *kvcalloc(size_t n, size_t size, gfp_t flags)
extern void kvfree(const void *addr);
-static inline atomic_t *compound_mapcount_ptr(struct page *page)
-{
- return &page[1].compound_mapcount;
-}
-
static inline int compound_mapcount(struct page *page)
{
VM_BUG_ON_PAGE(!PageCompound(page), page);
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 2222fa7..270aa8f 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -221,6 +221,11 @@ struct page {
#endif
} _struct_page_alignment;
+static inline atomic_t *compound_mapcount_ptr(struct page *page)
+{
+ return &page[1].compound_mapcount;
+}
+
/*
* Used for sizing the vmemmap region on some architectures
*/
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index f91cb88..e0f2d53 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -622,12 +622,30 @@ static inline int PageTransCompound(struct page *page)
*
* Unlike PageTransCompound, this is safe to be called only while
* split_huge_pmd() cannot run from under us, like if protected by the
- * MMU notifier, otherwise it may result in page->_mapcount < 0 false
+ * MMU notifier, otherwise it may result in page->_mapcount check false
* positives.
+ *
+ * We have to treat page cache THP differently since every subpage of it
+ * would get _mapcount inc'ed once it is PMD mapped. But, it may be PTE
+ * mapped in the current process so comparing subpage's _mapcount to
+ * compound_mapcount ot filter out PTE mapped case.
*/
static inline int PageTransCompoundMap(struct page *page)
{
- return PageTransCompound(page) && atomic_read(&page->_mapcount) < 0;
+ struct page *head;
+ int map_count;
+
+ if (!PageTransCompound(page))
+ return 0;
+
+ if (PageAnon(page))
+ return atomic_read(&page->_mapcount) < 0;
+
+ head = compound_head(page);
+ map_count = atomic_read(&page->_mapcount);
+ /* File THP is PMD mapped and not double mapped */
+ return map_count >= 0 &&
+ map_count == atomic_read(compound_mapcount_ptr(head));
}
/*
--
1.8.3.1
While the static key is correctly initialized as being disabled, it will
remain forever enabled once turned on. This means that if we start with an
asymmetric system and hotplug out enough CPUs to end up with an SMP system,
the static key will remain set - which is obviously wrong. We should detect
this and turn off things like misfit migration and capacity aware wakeups.
As Quentin pointed out, having separate root domains makes this slightly
trickier. We could have exclusive cpusets that create an SMP island - IOW,
the domains within this root domain will not see any asymmetry. This means
we can't just disable the key on domain destruction, we need to count how
many asymmetric root domains we have.
Consider the following example using Juno r0 which is 2+4 big.LITTLE, where
two identical cpusets are created: they both span both big and LITTLE CPUs:
asym0 asym1
[ ][ ]
L L B L L B
cgcreate -g cpuset:asym0
cgset -r cpuset.cpus=0,1,3 asym0
cgset -r cpuset.mems=0 asym0
cgset -r cpuset.cpu_exclusive=1 asym0
cgcreate -g cpuset:asym1
cgset -r cpuset.cpus=2,4,5 asym1
cgset -r cpuset.mems=0 asym1
cgset -r cpuset.cpu_exclusive=1 asym1
cgset -r cpuset.sched_load_balance=0 .
(the CPU numbering may look odd because on the Juno LITTLEs are CPUs 0,3-5
and bigs are CPUs 1-2)
If we make one of those SMP (IOW remove asymmetry) by e.g. hotplugging its
big core, we would end up with an SMP cpuset and an asymmetric cpuset - the
static key must remain set, because we still have one asymmetric root domain.
With the above example, this could be done with:
echo 0 > /sys/devices/system/cpu/cpu2/online
Which would result in:
asym0 asym1
[ ][ ]
L L B L L
When both SMP and asymmetric cpusets are present, all CPUs will observe
sched_asym_cpucapacity being set (it is system-wide), but not all CPUs
observe asymmetry in their sched domain hierarchy:
per_cpu(sd_asym_cpucapacity, <any CPU in asym0>) == <some SD at DIE level>
per_cpu(sd_asym_cpucapacity, <any CPU in asym1>) == NULL
Change the simple key enablement to an increment, and decrement the key
counter when destroying domains that cover asymmetric CPUs.
Cc: <stable(a)vger.kernel.org>
Fixes: df054e8445a4 ("sched/topology: Add static_key for asymmetric CPU capacity optimizations")
Reviewed-by: Dietmar Eggemann <dietmar.eggemann(a)arm.com>
Signed-off-by: Valentin Schneider <valentin.schneider(a)arm.com>
---
kernel/sched/topology.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 2e7af755e17a..6ec1e595b1d4 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2026,7 +2026,7 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
rcu_read_unlock();
if (has_asym)
- static_branch_enable_cpuslocked(&sched_asym_cpucapacity);
+ static_branch_inc_cpuslocked(&sched_asym_cpucapacity);
if (rq && sched_debug_enabled) {
pr_info("root domain span: %*pbl (max cpu_capacity = %lu)\n",
@@ -2121,9 +2121,12 @@ int sched_init_domains(const struct cpumask *cpu_map)
*/
static void detach_destroy_domains(const struct cpumask *cpu_map)
{
+ unsigned int cpu = cpumask_any(cpu_map);
int i;
+ if (rcu_access_pointer(per_cpu(sd_asym_cpucapacity, cpu)))
+ static_branch_dec_cpuslocked(&sched_asym_cpucapacity);
+
rcu_read_lock();
for_each_cpu(i, cpu_map)
cpu_attach_domain(NULL, &def_root_domain, i);
--
2.22.0
The loop that reads all the IBS MSRs into *buf stopped one MSR short of
reading the IbsOpData register, which contains the RipInvalid status bit.
Fix the offset_max assignment so the MSR gets read, so the RIP invalid
evaluation is based on what the IBS h/w output, instead of what was
left in memory.
Signed-off-by: Kim Phillips <kim.phillips(a)amd.com>
Fixes: d47e8238cd76 ("perf/x86-ibs: Take instruction pointer from ibs sample")
Cc: stable(a)vger.kernel.org
Cc: Stephane Eranian <eranian(a)google.com>
Cc: Peter Zijlstra <peterz(a)infradead.org>
Cc: Ingo Molnar <mingo(a)redhat.com>
Cc: Arnaldo Carvalho de Melo <acme(a)kernel.org>
Cc: Alexander Shishkin <alexander.shishkin(a)linux.intel.com>
Cc: Jiri Olsa <jolsa(a)redhat.com>
Cc: Namhyung Kim <namhyung(a)kernel.org>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Borislav Petkov <bp(a)alien8.de>
Cc: "H. Peter Anvin" <hpa(a)zytor.com>
Cc: x86(a)kernel.org
Cc: linux-kernel(a)vger.kernel.org
---
arch/x86/events/amd/ibs.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/events/amd/ibs.c b/arch/x86/events/amd/ibs.c
index 5b35b7ea5d72..98ba21a588a1 100644
--- a/arch/x86/events/amd/ibs.c
+++ b/arch/x86/events/amd/ibs.c
@@ -614,7 +614,7 @@ static int perf_ibs_handle_irq(struct perf_ibs *perf_ibs, struct pt_regs *iregs)
if (event->attr.sample_type & PERF_SAMPLE_RAW)
offset_max = perf_ibs->offset_max;
else if (check_rip)
- offset_max = 2;
+ offset_max = 3;
else
offset_max = 1;
do {
--
2.23.0
The following commit has been merged into the timers/urgent branch of tip:
Commit-ID: 1638b8f096ca165965189b9626564c933c79fe63
Gitweb: https://git.kernel.org/tip/1638b8f096ca165965189b9626564c933c79fe63
Author: Thomas Gleixner <tglx(a)linutronix.de>
AuthorDate: Mon, 21 Oct 2019 12:07:15 +02:00
Committer: Thomas Gleixner <tglx(a)linutronix.de>
CommitterDate: Wed, 23 Oct 2019 14:48:23 +02:00
lib/vdso: Make clock_getres() POSIX compliant again
A recent commit removed the NULL pointer check from the clock_getres()
implementation causing a test case to fault.
POSIX requires an explicit NULL pointer check for clock_getres() aside of
the validity check of the clock_id argument for obscure reasons.
Add it back for both 32bit and 64bit.
Note, this is only a partial revert of the offending commit which does not
bring back the broken fallback invocation in the the 32bit compat
implementations of clock_getres() and clock_gettime().
Fixes: a9446a906f52 ("lib/vdso/32: Remove inconsistent NULL pointer checks")
Reported-by: Andreas Schwab <schwab(a)linux-m68k.org>
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Tested-by: Christophe Leroy <christophe.leroy(a)c-s.fr>
Cc: stable(a)vger.kernel.org
Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1910211202260.1904@nanos.tec.linu…
---
lib/vdso/gettimeofday.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/lib/vdso/gettimeofday.c b/lib/vdso/gettimeofday.c
index e630e7f..45f57fd 100644
--- a/lib/vdso/gettimeofday.c
+++ b/lib/vdso/gettimeofday.c
@@ -214,9 +214,10 @@ int __cvdso_clock_getres_common(clockid_t clock, struct __kernel_timespec *res)
return -1;
}
- res->tv_sec = 0;
- res->tv_nsec = ns;
-
+ if (likely(res)) {
+ res->tv_sec = 0;
+ res->tv_nsec = ns;
+ }
return 0;
}
@@ -245,7 +246,7 @@ __cvdso_clock_getres_time32(clockid_t clock, struct old_timespec32 *res)
ret = clock_getres_fallback(clock, &ts);
#endif
- if (likely(!ret)) {
+ if (likely(!ret && res)) {
res->tv_sec = ts.tv_sec;
res->tv_nsec = ts.tv_nsec;
}
In commit 8020919a9b99 ("mac80211: Properly handle SKB with radiotap
only"), buffers whose length is too short cause a WARN_ON(1) to be
executed. This change exposed a fault in rtlwifi drivers, which is fixed
by regarding packets with skb->len <= FCS_LEN as though they are in error
and dropping them. The test is now annotated as likely.
Cc: Stable <stable(a)vger.kernel.org> # v5.0+
Signed-off-by: Larry Finger <Larry.Finger(a)lwfinger.net>
---
V2 - content dropped
V3 - changed fix to drop packet rather than arbitrarily increasing the length.
---
Material for 5.4.
---
drivers/net/wireless/realtek/rtlwifi/pci.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/net/wireless/realtek/rtlwifi/pci.c b/drivers/net/wireless/realtek/rtlwifi/pci.c
index 6087ec7a90a6..f88d26535978 100644
--- a/drivers/net/wireless/realtek/rtlwifi/pci.c
+++ b/drivers/net/wireless/realtek/rtlwifi/pci.c
@@ -822,7 +822,7 @@ static void _rtl_pci_rx_interrupt(struct ieee80211_hw *hw)
hdr = rtl_get_hdr(skb);
fc = rtl_get_fc(skb);
- if (!stats.crc && !stats.hwerror) {
+ if (!stats.crc && !stats.hwerror && (skb->len > FCS_LEN)) {
memcpy(IEEE80211_SKB_RXCB(skb), &rx_status,
sizeof(rx_status));
@@ -859,6 +859,7 @@ static void _rtl_pci_rx_interrupt(struct ieee80211_hw *hw)
_rtl_pci_rx_to_mac80211(hw, skb, rx_status);
}
} else {
+ /* drop packets with errors or those too short */
dev_kfree_skb_any(skb);
}
new_trx_end:
--
2.23.0
The arguments to queue_trb are always byteswapped to LE for placement in
the ring, but this should not happen in the case of immediate data; the
bytes copied out of transfer_buffer are already in the correct order.
Add a complementary byteswap so the bytes end up in the ring correctly.
This was observed on BE ppc64 with a "Texas Instruments TUSB73x0
SuperSpeed USB 3.0 xHCI Host Controller [104c:8241]" as a ch341
usb-serial adapter ("1a86:7523 QinHeng Electronics HL-340 USB-Serial
adapter") always transmitting the same character (generally NUL) over
the serial link regardless of the key pressed.
Cc: stable(a)vger.kernel.org
Fixes: 33e39350ebd2 ("usb: xhci: add Immediate Data Transfer support")
Signed-off-by: Samuel Holland <samuel(a)sholland.org>
---
drivers/usb/host/xhci-ring.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
index 85ceb43e3405..e7aab31fd9a5 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -3330,6 +3330,7 @@ int xhci_queue_bulk_tx(struct xhci_hcd *xhci, gfp_t mem_flags,
if (xhci_urb_suitable_for_idt(urb)) {
memcpy(&send_addr, urb->transfer_buffer,
trb_buff_len);
+ le64_to_cpus(&send_addr);
field |= TRB_IDT;
}
}
@@ -3475,6 +3476,7 @@ int xhci_queue_ctrl_tx(struct xhci_hcd *xhci, gfp_t mem_flags,
if (xhci_urb_suitable_for_idt(urb)) {
memcpy(&addr, urb->transfer_buffer,
urb->transfer_buffer_length);
+ le64_to_cpus(&addr);
field |= TRB_IDT;
} else {
addr = (u64) urb->transfer_dma;
--
2.21.0