On s390 the CPU Measurement Facility for counters now supports
2 PMUs named cpum_cf (CPU Measurement Facility for counters) and
cpum_cf_diag (CPU Measurement Facility for diagnostic counters)
for one and the same CPU.
Running command
[root@s35lp76 perf]# ./perf stat -e tx_c_tend \
-- ~/mytests/cf-tx-events 1
Measuring transactions
TX_C_TABORT_NO_SPECIAL: 0 expected:0
TX_C_TABORT_SPECIAL: 0 expected:0
TX_C_TEND: 1 expected:1
TX_NC_TABORT: 11 expected:11
TX_NC_TEND: 1 expected:1
Performance counter stats for '/root/mytests/cf-tx-events 1':
2 tx_c_tend
0.002120091 seconds time elapsed
0.000121000 seconds user
0.002127000 seconds sys
[root@s35lp76 perf]#
displays output which is unexpected (and wrong):
2 tx_c_tend
The test program definitely triggers only one transaction, as shown
in line 'TX_C_TEND: 1 expected:1'.
This is caused by the following call sequence:
pmu_lookup() scans and installs a PMU.
+--> pmu_aliases() parses all aliases in directory
.../<pmu-name>/events/* which are file names.
+--> pmu_aliases_parse() Read each file in directory and create
an new alias entry. This is done with
+--> perf_pmu__new_alias() and
+--> __perf_pmu__new_alias() which also check for
identical alias names.
After pmu_aliases() returns, a complete list of event names
for this pmu has been created. Now function
pmu_add_cpu_aliases() is called to add the events listed in the json
| files to the alias list of the cpu.
+--> perf_pmu__find_map() Returns a pointer to the json events.
Now function pmu_add_cpu_aliases() scans through all events listed
in the JSON files for this CPU.
Each json event pmu name is compared with the current PMU being
built up and if they mismatch, the json event is added to the
current PMUs alias list.
To avoid duplicate entries the following comparison is done:
if (!is_arm_pmu_core(name)) {
pname = pe->pmu ? pe->pmu : "cpu";
if (strncmp(pname, name, strlen(pname)))
continue;
}
The culprit is the strncmp() function.
Using current s390 PMU naming, the first PMU is 'cpum_cf'
and a long list of events is added, among them 'tx_c_tend'
When the second PMU named 'cpum_cf_diag' is added, only one event
named 'CF_DIAG' is added by the pmu_aliases() function.
Now function pmu_add_cpu_aliases() is invoked for PMU 'cpum_cf_diag'.
Since the CPUID string is the same for both PMUs, json file events
for PMU named 'cpum_cf' are added to the PMU 'cpm_cf_diag'
This happens because the strncmp() actually compares:
strncmp("cpum_cf", "cpum_cf_diag", 6);
The first parameter is the pmu name taken from the event in
the json file. The second parameter is the pmu name of the PMU
currently being built.
They are different, but the length of the compare only tests the
common prefix and this returns 0(true) when it should return false.
Now all events for PMU cpum_cf are added to the alias list for pmu
cpum_cf_diag.
Later on in function parse_events_add_pmu() the event 'tx_c_end' is
searched in all available PMUs and found twice, adding it two
times to the evsel_list global variable which is the root
of all events. This results in a counter value of 2 instead
of 1.
Output with this patch:
[root@s35lp76 perf]# ./perf stat -e tx_c_tend \
-- ~/mytests/cf-tx-events 1
Measuring transactions
TX_C_TABORT_NO_SPECIAL: 0 expected:0
TX_C_TABORT_SPECIAL: 0 expected:0
TX_C_TEND: 1 expected:1
TX_NC_TABORT: 11 expected:11
TX_NC_TEND: 1 expected:1
Performance counter stats for '/root/mytests/cf-tx-events 1':
1 tx_c_tend
0.001815365 seconds time elapsed
0.000123000 seconds user
0.001756000 seconds sys
[root@s35lp76 perf]#
Fixes: 292c34c10249 ("perf pmu: Fix core PMU alias list for X86 platform")
Signed-off-by: Thomas Richter <tmricht(a)linux.ibm.com>
Reviewed-by: Hendrik Brueckner <brueckner(a)linux.ibm.com>
Cc: Kan Liang <kan.liang(a)linux.intel.com>
Cc: <stable(a)vger.kernel.org> # 4.18+
---
tools/perf/util/pmu.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/perf/util/pmu.c b/tools/perf/util/pmu.c
index 7799788f662f..7e49baad304d 100644
--- a/tools/perf/util/pmu.c
+++ b/tools/perf/util/pmu.c
@@ -773,7 +773,7 @@ static void pmu_add_cpu_aliases(struct list_head *head, struct perf_pmu *pmu)
if (!is_arm_pmu_core(name)) {
pname = pe->pmu ? pe->pmu : "cpu";
- if (strncmp(pname, name, strlen(pname)))
+ if (strcmp(pname, name))
continue;
}
--
2.14.3
From: Andrea Arcangeli <aarcange(a)redhat.com>
THP allocation might be really disruptive when allocated on NUMA system
with the local node full or hard to reclaim. Stefan has posted an
allocation stall report on 4.12 based SLES kernel which suggests the
same issue:
[245513.362669] kvm: page allocation stalls for 194572ms, order:9, mode:0x4740ca(__GFP_HIGHMEM|__GFP_IO|__GFP_FS|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE|__GFP_MOVABLE|__GFP_DIRECT_RECLAIM), nodemask=(null)
[245513.363983] kvm cpuset=/ mems_allowed=0-1
[245513.364604] CPU: 10 PID: 84752 Comm: kvm Tainted: G W 4.12.0+98-ph <a href="/view.php?id=1" title="[geschlossen] Integration Ramdisk" class="resolved">0000001</a> SLE15 (unreleased)
[245513.365258] Hardware name: Supermicro SYS-1029P-WTRT/X11DDW-NT, BIOS 2.0 12/05/2017
[245513.365905] Call Trace:
[245513.366535] dump_stack+0x5c/0x84
[245513.367148] warn_alloc+0xe0/0x180
[245513.367769] __alloc_pages_slowpath+0x820/0xc90
[245513.368406] ? __slab_free+0xa9/0x2f0
[245513.369048] ? __slab_free+0xa9/0x2f0
[245513.369671] __alloc_pages_nodemask+0x1cc/0x210
[245513.370300] alloc_pages_vma+0x1e5/0x280
[245513.370921] do_huge_pmd_wp_page+0x83f/0xf00
[245513.371554] ? set_huge_zero_page.isra.52.part.53+0x9b/0xb0
[245513.372184] ? do_huge_pmd_anonymous_page+0x631/0x6d0
[245513.372812] __handle_mm_fault+0x93d/0x1060
[245513.373439] handle_mm_fault+0xc6/0x1b0
[245513.374042] __do_page_fault+0x230/0x430
[245513.374679] ? get_vtime_delta+0x13/0xb0
[245513.375411] do_page_fault+0x2a/0x70
[245513.376145] ? page_fault+0x65/0x80
[245513.376882] page_fault+0x7b/0x80
[...]
[245513.382056] Mem-Info:
[245513.382634] active_anon:126315487 inactive_anon:1612476 isolated_anon:5
active_file:60183 inactive_file:245285 isolated_file:0
unevictable:15657 dirty:286 writeback:1 unstable:0
slab_reclaimable:75543 slab_unreclaimable:2509111
mapped:81814 shmem:31764 pagetables:370616 bounce:0
free:32294031 free_pcp:6233 free_cma:0
[245513.386615] Node 0 active_anon:254680388kB inactive_anon:1112760kB active_file:240648kB inactive_file:981168kB unevictable:13368kB isolated(anon):0kB isolated(file):0kB mapped:280240kB dirty:1144kB writeback:0kB shmem:95832kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 81225728kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[245513.388650] Node 1 active_anon:250583072kB inactive_anon:5337144kB active_file:84kB inactive_file:0kB unevictable:49260kB isolated(anon):20kB isolated(file):0kB mapped:47016kB dirty:0kB writeback:4kB shmem:31224kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 31897600kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
The defrag mode is "madvise" and from the above report it is clear that
the THP has been allocated for MADV_HUGEPAGA vma.
Andrea has identified that the main source of the problem is
__GFP_THISNODE usage:
: The problem is that direct compaction combined with the NUMA
: __GFP_THISNODE logic in mempolicy.c is telling reclaim to swap very
: hard the local node, instead of failing the allocation if there's no
: THP available in the local node.
:
: Such logic was ok until __GFP_THISNODE was added to the THP allocation
: path even with MPOL_DEFAULT.
:
: The idea behind the __GFP_THISNODE addition, is that it is better to
: provide local memory in PAGE_SIZE units than to use remote NUMA THP
: backed memory. That largely depends on the remote latency though, on
: threadrippers for example the overhead is relatively low in my
: experience.
:
: The combination of __GFP_THISNODE and __GFP_DIRECT_RECLAIM results in
: extremely slow qemu startup with vfio, if the VM is larger than the
: size of one host NUMA node. This is because it will try very hard to
: unsuccessfully swapout get_user_pages pinned pages as result of the
: __GFP_THISNODE being set, instead of falling back to PAGE_SIZE
: allocations and instead of trying to allocate THP on other nodes (it
: would be even worse without vfio type1 GUP pins of course, except it'd
: be swapping heavily instead).
Fix this by removing __GFP_THISNODE for THP requests which are
requesting the direct reclaim. This effectivelly reverts 5265047ac301 on
the grounds that the zone/node reclaim was known to be disruptive due
to premature reclaim when there was memory free. While it made sense at
the time for HPC workloads without NUMA awareness on rare machines, it
was ultimately harmful in the majority of cases. The existing behaviour
is similiar, if not as widespare as it applies to a corner case but
crucially, it cannot be tuned around like zone_reclaim_mode can. The
default behaviour should always be to cause the least harm for the
common case.
If there are specialised use cases out there that want zone_reclaim_mode
in specific cases, then it can be built on top. Longterm we should
consider a memory policy which allows for the node reclaim like behavior
for the specific memory ranges which would allow a
[1] http://lkml.kernel.org/r/20180820032204.9591-1-aarcange@redhat.com
[mhocko(a)suse.com: rewrote the changelog based on the one from Andrea]
Fixes: 5265047ac301 ("mm, thp: really limit transparent hugepage allocation to local node")
Cc: Zi Yan <zi.yan(a)cs.rutgers.edu>
Cc: stable # 4.1+
Reported-by: Stefan Priebe <s.priebe(a)profihost.ag>
Debugged-by: Andrea Arcangeli <aarcange(a)redhat.com>
Reported-by: Alex Williamson <alex.williamson(a)redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange(a)redhat.com>
Signed-off-by: Michal Hocko <mhocko(a)suse.com>
---
mm/mempolicy.c | 32 ++++++++++++++++++++++++++++++--
1 file changed, 30 insertions(+), 2 deletions(-)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index da858f794eb6..149b6f4cf023 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2046,8 +2046,36 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
nmask = policy_nodemask(gfp, pol);
if (!nmask || node_isset(hpage_node, *nmask)) {
mpol_cond_put(pol);
- page = __alloc_pages_node(hpage_node,
- gfp | __GFP_THISNODE, order);
+ /*
+ * We cannot invoke reclaim if __GFP_THISNODE
+ * is set. Invoking reclaim with
+ * __GFP_THISNODE set, would cause THP
+ * allocations to trigger heavy swapping
+ * despite there may be tons of free memory
+ * (including potentially plenty of THP
+ * already available in the buddy) on all the
+ * other NUMA nodes.
+ *
+ * At most we could invoke compaction when
+ * __GFP_THISNODE is set (but we would need to
+ * refrain from invoking reclaim even if
+ * compaction returned COMPACT_SKIPPED because
+ * there wasn't not enough memory to succeed
+ * compaction). For now just avoid
+ * __GFP_THISNODE instead of limiting the
+ * allocation path to a strict and single
+ * compaction invocation.
+ *
+ * Supposedly if direct reclaim was enabled by
+ * the caller, the app prefers THP regardless
+ * of the node it comes from so this would be
+ * more desiderable behavior than only
+ * providing THP originated from the local
+ * node in such case.
+ */
+ if (!(gfp & __GFP_DIRECT_RECLAIM))
+ gfp |= __GFP_THISNODE;
+ page = __alloc_pages_node(hpage_node, gfp, order);
goto out;
}
}
--
2.18.0
Hello Dear
Do you have the passion for humanitarian welfare?
Can you devote your time and be totally committed and devoted
to run multi-million pounds humanitarian charity project sponsored
totally by me; with an incentive/compensation accrual to you for
your time and effort and at no cost to you.
If interested, reply me for the full details
Thanks
Les Scadding
Hi,
The 1st & 3rd patch fixes bio size alignment issue.
The 2nd patch cleans up __blkdev_issue_discard() a bit.
Thanks,
Ming Lei (3):
block: make sure discard bio is aligned with logical block size
block: cleanup __blkdev_issue_discard()
block: make sure writesame bio is aligned with logical block size
block/blk-lib.c | 25 ++++++-------------------
1 file changed, 6 insertions(+), 19 deletions(-)
Cc: Rui Salvaterra <rsalvaterra(a)gmail.com>
Cc: stable(a)vger.kernel.org
Cc: Mike Snitzer <snitzer(a)redhat.com>
Cc: Christoph Hellwig <hch(a)lst.de>
Cc: Xiao Ni <xni(a)redhat.com>
Cc: Mariusz Dabrowski <mariusz.dabrowski(a)intel.com>
--
2.9.5
From: Mike Kravetz <mike.kravetz(a)oracle.com>
Subject: hugetlbfs: dirty pages as they are added to pagecache
Some test systems were experiencing negative huge page reserve counts and
incorrect file block counts. This was traced to /proc/sys/vm/drop_caches
removing clean pages from hugetlbfs file pagecaches. When non-hugetlbfs
explicit code removes the pages, the appropriate accounting is not
performed.
This can be recreated as follows:
fallocate -l 2M /dev/hugepages/foo
echo 1 > /proc/sys/vm/drop_caches
fallocate -l 2M /dev/hugepages/foo
grep -i huge /proc/meminfo
AnonHugePages: 0 kB
ShmemHugePages: 0 kB
HugePages_Total: 2048
HugePages_Free: 2047
HugePages_Rsvd: 18446744073709551615
HugePages_Surp: 0
Hugepagesize: 2048 kB
Hugetlb: 4194304 kB
ls -lsh /dev/hugepages/foo
4.0M -rw-r--r--. 1 root root 2.0M Oct 17 20:05 /dev/hugepages/foo
To address this issue, dirty pages as they are added to pagecache. This
can easily be reproduced with fallocate as shown above. Read faulted
pages will eventually end up being marked dirty. But there is a window
where they are clean and could be impacted by code such as drop_caches.
So, just dirty them all as they are added to the pagecache.
Link: http://lkml.kernel.org/r/b5be45b8-5afe-56cd-9482-28384699a049@oracle.com
Fixes: 6bda666a03f0 ("hugepages: fold find_or_alloc_pages into huge_no_page()")
Signed-off-by: Mike Kravetz <mike.kravetz(a)oracle.com>
Acked-by: Mihcla Hocko <mhocko(a)suse.com>
Reviewed-by: Khalid Aziz <khalid.aziz(a)oracle.com>
Cc: Hugh Dickins <hughd(a)google.com>
Cc: Naoya Horiguchi <n-horiguchi(a)ah.jp.nec.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar(a)linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange(a)redhat.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov(a)linux.intel.com>
Cc: Davidlohr Bueso <dave(a)stgolabs.net>
Cc: Alexander Viro <viro(a)zeniv.linux.org.uk>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/hugetlb.c | 6 ++++++
1 file changed, 6 insertions(+)
--- a/mm/hugetlb.c~hugetlbfs-dirty-pages-as-they-are-added-to-pagecache-v2
+++ a/mm/hugetlb.c
@@ -3690,6 +3690,12 @@ int huge_add_to_page_cache(struct page *
return err;
ClearPagePrivate(page);
+ /*
+ * set page dirty so that it will not be removed from cache/file
+ * by non-hugetlbfs specific code paths.
+ */
+ set_page_dirty(page);
+
spin_lock(&inode->i_lock);
inode->i_blocks += blocks_per_huge_page(h);
spin_unlock(&inode->i_lock);
_
From: Christoph Hellwig <hch(a)lst.de>
Subject: userfaultfd: disable irqs when taking the waitqueue lock
userfaultfd contains howe-grown locking of the waitqueue lock, and does
not disable interrupts. This relies on the fact that no one else takes it
from interrupt context and violates an invariat of the normal waitqueue
locking scheme. With aio poll it is easy to trigger other locks that
disable interrupts (or are called from interrupt context).
Link: http://lkml.kernel.org/r/20181018154101.18750-1-hch@lst.de
Signed-off-by: Christoph Hellwig <hch(a)lst.de>
Reviewed-by: Andrea Arcangeli <aarcange(a)redhat.com>
Reviewed-by: Andrew Morton <akpm(a)linux-foundation.org>
Cc: <stable(a)vger.kernel.org> [4.19.x]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
fs/userfaultfd.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
--- a/fs/userfaultfd.c~userfaultfd-disable-irqs-when-taking-the-waitqueue-lock
+++ a/fs/userfaultfd.c
@@ -1026,7 +1026,7 @@ static ssize_t userfaultfd_ctx_read(stru
struct userfaultfd_ctx *fork_nctx = NULL;
/* always take the fd_wqh lock before the fault_pending_wqh lock */
- spin_lock(&ctx->fd_wqh.lock);
+ spin_lock_irq(&ctx->fd_wqh.lock);
__add_wait_queue(&ctx->fd_wqh, &wait);
for (;;) {
set_current_state(TASK_INTERRUPTIBLE);
@@ -1112,13 +1112,13 @@ static ssize_t userfaultfd_ctx_read(stru
ret = -EAGAIN;
break;
}
- spin_unlock(&ctx->fd_wqh.lock);
+ spin_unlock_irq(&ctx->fd_wqh.lock);
schedule();
- spin_lock(&ctx->fd_wqh.lock);
+ spin_lock_irq(&ctx->fd_wqh.lock);
}
__remove_wait_queue(&ctx->fd_wqh, &wait);
__set_current_state(TASK_RUNNING);
- spin_unlock(&ctx->fd_wqh.lock);
+ spin_unlock_irq(&ctx->fd_wqh.lock);
if (!ret && msg->event == UFFD_EVENT_FORK) {
ret = resolve_userfault_fork(ctx, fork_nctx, msg);
_