The patch titled
Subject: mm, vmscan: do not special-case slab reclaim when watermarks are boosted
has been added to the -mm tree. Its filename is
mm-vmscan-do-not-special-case-slab-reclaim-when-watermarks-are-boosted.patch
This patch should soon appear at
http://ozlabs.org/~akpm/mmots/broken-out/mm-vmscan-do-not-special-case-slab…
and later at
http://ozlabs.org/~akpm/mmotm/broken-out/mm-vmscan-do-not-special-case-slab…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Mel Gorman <mgorman(a)techsingularity.net>
Subject: mm, vmscan: do not special-case slab reclaim when watermarks are boosted
Dave Chinner reported a problem pointing a finger at commit 1c30844d2dfe
("mm: reclaim small amounts of memory when an external fragmentation event
occurs"). The report is extensive (see
https://lore.kernel.org/linux-mm/20190807091858.2857-1-david@fromorbit.com/)
and it's worth recording the most relevant parts (colorful language and
typos included).
When running a simple, steady state 4kB file creation test to
simulate extracting tarballs larger than memory full of small
files into the filesystem, I noticed that once memory fills up
the cache balance goes to hell.
The workload is creating one dirty cached inode for every dirty
page, both of which should require a single IO each to clean and
reclaim, and creation of inodes is throttled by the rate at which
dirty writeback runs at (via balance dirty pages). Hence the ingest
rate of new cached inodes and page cache pages is identical and
steady. As a result, memory reclaim should quickly find a steady
balance between page cache and inode caches.
The moment memory fills, the page cache is reclaimed at a much
faster rate than the inode cache, and evidence suggests that
the inode cache shrinker is not being called when large batches
of pages are being reclaimed. In roughly the same time period
that it takes to fill memory with 50% pages and 50% slab caches,
memory reclaim reduces the page cache down to just dirty pages
and slab caches fill the entirety of memory.
The LRU is largely full of dirty pages, and we're getting spikes
of random writeback from memory reclaim so it's all going to shit.
Behaviour never recovers, the page cache remains pinned at just
dirty pages, and nothing I could tune would make any difference.
vfs_cache_pressure makes no difference - I would set it so high
it should trim the entire inode caches in a single pass, yet it
didn't do anything. It was clear from tracing and live telemetry
that the shrinkers were pretty much not running except when
there was absolutely no memory free at all, and then they did
the minimum necessary to free memory to make progress.
So I went looking at the code, trying to find places where pages
got reclaimed and the shrinkers weren't called. There's only one
- kswapd doing boosted reclaim as per commit 1c30844d2dfe ("mm:
reclaim small amounts of memory when an external fragmentation
event occurs").
The watermark boosting introduced by the commit is triggered in response
to an allocation "fragmentation event". The boosting was not intended to
target THP specifically and triggers even if THP is disabled. However,
with Dave's perfectly reasonable workload, fragmentation events can be
very common given the ratio of slab to page cache allocations so boosting
remains active for long periods of time.
As high-order allocations might use compaction and compaction cannot move
slab pages the decision was made in the commit to special-case kswapd when
watermarks are boosted -- kswapd avoids reclaiming slab as reclaiming slab
does not directly help compaction.
As Dave notes, this decision means that slab can be artificially protected
for long periods of time and messes up the balance with slab and page
caches.
Removing the special casing can still indirectly help fragmentation by
avoiding fragmentation-causing events due to slab allocation as pages from
a slab pageblock will have some slab objects freed. Furthermore, with the
special casing, reclaim behaviour is unpredictable as kswapd sometimes
examines slab and sometimes does not in a manner that is tricky to tune or
analyse.
This patch removes the special casing. The downside is that this is not a
universal performance win. Some benchmarks that depend on the residency
of data when rereading metadata may see a regression when slab reclaim is
restored to its original behaviour. Similarly, some benchmarks that only
read-once or write-once may perform better when page reclaim is too
aggressive. The primary upside is that slab shrinker is less surprising
(arguably more sane but that's a matter of opinion), behaves consistently
regardless of the fragmentation state of the system and properly obeys VM
sysctls.
A fsmark benchmark configuration was constructed similar to what Dave
reported and is codified by the mmtest configuration
config-io-fsmark-small-file-stream. It was evaluated on a 1-socket
machine to avoid dealing with NUMA-related issues and the timing of
reclaim. The storage was an SSD Samsung Evo and a fresh trimmed XFS
filesystem was used for the test data.
This is not an exact replication of Dave's setup. The configuration
scales its parameters depending on the memory size of the SUT to behave
similarly across machines. The parameters mean the first sample reported
by fs_mark is using 50% of RAM which will barely be throttled and look
like a big outlier. Dave used fake NUMA to have multiple kswapd instances
which I didn't replicate. Finally, the number of iterations differ from
Dave's test as the target disk was not large enough. While not identical,
it should be representative.
fsmark
5.3.0-rc3 5.3.0-rc3
vanilla shrinker-v1r1
Min 1-files/sec 4444.80 ( 0.00%) 4765.60 ( 7.22%)
1st-qrtle 1-files/sec 5005.10 ( 0.00%) 5091.70 ( 1.73%)
2nd-qrtle 1-files/sec 4917.80 ( 0.00%) 4855.60 ( -1.26%)
3rd-qrtle 1-files/sec 4667.40 ( 0.00%) 4831.20 ( 3.51%)
Max-1 1-files/sec 11421.50 ( 0.00%) 9999.30 ( -12.45%)
Max-5 1-files/sec 11421.50 ( 0.00%) 9999.30 ( -12.45%)
Max-10 1-files/sec 11421.50 ( 0.00%) 9999.30 ( -12.45%)
Max-90 1-files/sec 4649.60 ( 0.00%) 4780.70 ( 2.82%)
Max-95 1-files/sec 4491.00 ( 0.00%) 4768.20 ( 6.17%)
Max-99 1-files/sec 4491.00 ( 0.00%) 4768.20 ( 6.17%)
Max 1-files/sec 11421.50 ( 0.00%) 9999.30 ( -12.45%)
Hmean 1-files/sec 5004.75 ( 0.00%) 5075.96 ( 1.42%)
Stddev 1-files/sec 1778.70 ( 0.00%) 1369.66 ( 23.00%)
CoeffVar 1-files/sec 33.70 ( 0.00%) 26.05 ( 22.71%)
BHmean-99 1-files/sec 5053.72 ( 0.00%) 5101.52 ( 0.95%)
BHmean-95 1-files/sec 5053.72 ( 0.00%) 5101.52 ( 0.95%)
BHmean-90 1-files/sec 5107.05 ( 0.00%) 5131.41 ( 0.48%)
BHmean-75 1-files/sec 5208.45 ( 0.00%) 5206.68 ( -0.03%)
BHmean-50 1-files/sec 5405.53 ( 0.00%) 5381.62 ( -0.44%)
BHmean-25 1-files/sec 6179.75 ( 0.00%) 6095.14 ( -1.37%)
5.3.0-rc3 5.3.0-rc3
vanillashrinker-v1r1
Duration User 501.82 497.29
Duration System 4401.44 4424.08
Duration Elapsed 8124.76 8358.05
This is showing a slight skew for the max result representing a large
outlier for the 1st, 2nd and 3rd quartile are similar indicating that the
bulk of the results show little difference. Note that an earlier version
of the fsmark configuration showed a regression but that included more
samples taken while memory was still filling.
Note that the elapsed time is higher. Part of this is that the
configuration included time to delete all the test files when the test
completes -- the test automation handles the possibility of testing fsmark
with multiple thread counts. Without the patch, many of these objects
would be memory resident which is part of what the patch is addressing.
There are other important observations that justify the patch.
1. With the vanilla kernel, the number of dirty pages in the system
is very low for much of the test. With this patch, dirty pages
is generally kept at 10% which matches vm.dirty_background_ratio
which is normal expected historical behaviour.
2. With the vanilla kernel, the ratio of Slab/Pagecache is close to
0.95 for much of the test i.e. Slab is being left alone and dominating
memory consumption. With the patch applied, the ratio varies between
0.35 and 0.45 with the bulk of the measured ratios roughly half way
between those values. This is a different balance to what Dave reported
but it was at least consistent.
3. Slabs are scanned throughout the entire test with the patch applied.
The vanille kernel has periods with no scan activity and then relatively
massive spikes.
4. Without the patch, kswapd scan rates are very variable. With the patch,
the scan rates remain quite stead.
4. Overall vmstats are closer to normal expectations
5.3.0-rc3 5.3.0-rc3
vanilla shrinker-v1r1
Ops Direct pages scanned 99388.00 328410.00
Ops Kswapd pages scanned 45382917.00 33451026.00
Ops Kswapd pages reclaimed 30869570.00 25239655.00
Ops Direct pages reclaimed 74131.00 5830.00
Ops Kswapd efficiency % 68.02 75.45
Ops Kswapd velocity 5585.75 4002.25
Ops Page reclaim immediate 1179721.00 430927.00
Ops Slabs scanned 62367361.00 73581394.00
Ops Direct inode steals 2103.00 1002.00
Ops Kswapd inode steals 570180.00 5183206.00
o Vanilla kernel is hitting direct reclaim more frequently,
not very much in absolute terms but the fact the patch
reduces it is interesting
o "Page reclaim immediate" in the vanilla kernel indicates
dirty pages are being encountered at the tail of the LRU.
This is generally bad and means in this case that the LRU
is not long enough for dirty pages to be cleaned by the
background flush in time. This is much reduced by the
patch.
o With the patch, kswapd is reclaiming 10 times more slab
pages than with the vanilla kernel. This is indicative
of the watermark boosting over-protecting slab
A more complete set of tests were run that were part of the basis
for introducing boosting and while there are some differences, they
are well within tolerances.
Bottom line, the special casing kswapd to avoid slab behaviour is
unpredictable and can lead to abnormal results for normal workloads. This
patch restores the expected behaviour that slab and page cache is balanced
consistently for a workload with a steady allocation ratio of
slab/pagecache pages. It also means that if there are workloads that
favour the preservation of slab over pagecache that it can be tuned via
vm.vfs_cache_pressure where as the vanilla kernel effectively ignores the
parameter when boosting is active.
Link: http://lkml.kernel.org/r/20190808182946.GM2739@techsingularity.net
Fixes: 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs")
Signed-off-by: Mel Gorman <mgorman(a)techsingularity.net>
Reviewed-by: Dave Chinner <dchinner(a)redhat.com>
Cc: Michal Hocko <mhocko(a)kernel.org>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: <stable(a)vger.kernel.org> [5.0+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/vmscan.c | 13 ++-----------
1 file changed, 2 insertions(+), 11 deletions(-)
--- a/mm/vmscan.c~mm-vmscan-do-not-special-case-slab-reclaim-when-watermarks-are-boosted
+++ a/mm/vmscan.c
@@ -88,9 +88,6 @@ struct scan_control {
/* Can pages be swapped as part of reclaim? */
unsigned int may_swap:1;
- /* e.g. boosted watermark reclaim leaves slabs alone */
- unsigned int may_shrinkslab:1;
-
/*
* Cgroups are not reclaimed below their configured memory.low,
* unless we threaten to OOM. If any cgroups are skipped due to
@@ -2714,10 +2711,8 @@ static bool shrink_node(pg_data_t *pgdat
shrink_node_memcg(pgdat, memcg, sc, &lru_pages);
node_lru_pages += lru_pages;
- if (sc->may_shrinkslab) {
- shrink_slab(sc->gfp_mask, pgdat->node_id,
- memcg, sc->priority);
- }
+ shrink_slab(sc->gfp_mask, pgdat->node_id, memcg,
+ sc->priority);
/* Record the group's reclaim efficiency */
vmpressure(sc->gfp_mask, memcg, false,
@@ -3194,7 +3189,6 @@ unsigned long try_to_free_pages(struct z
.may_writepage = !laptop_mode,
.may_unmap = 1,
.may_swap = 1,
- .may_shrinkslab = 1,
};
/*
@@ -3238,7 +3232,6 @@ unsigned long mem_cgroup_shrink_node(str
.may_unmap = 1,
.reclaim_idx = MAX_NR_ZONES - 1,
.may_swap = !noswap,
- .may_shrinkslab = 1,
};
unsigned long lru_pages;
@@ -3286,7 +3279,6 @@ unsigned long try_to_free_mem_cgroup_pag
.may_writepage = !laptop_mode,
.may_unmap = 1,
.may_swap = may_swap,
- .may_shrinkslab = 1,
};
set_task_reclaim_state(current, &sc.reclaim_state);
@@ -3598,7 +3590,6 @@ restart:
*/
sc.may_writepage = !laptop_mode && !nr_boost_reclaim;
sc.may_swap = !nr_boost_reclaim;
- sc.may_shrinkslab = !nr_boost_reclaim;
/*
* Do some background aging of the anon list, to give
_
Patches currently in -mm which might be from mgorman(a)techsingularity.net are
mm-vmscan-do-not-special-case-slab-reclaim-when-watermarks-are-boosted.patch
From: Thomas Richter <tmricht(a)linux.ibm.com>
During execution of command 'perf top' the error message:
Not enough memory for annotating '__irf_end' symbol!)
is emitted from this call sequence:
__cmd_top
perf_top__mmap_read
perf_top__mmap_read_idx
perf_event__process_sample
hist_entry_iter__add
hist_iter__top_callback
perf_top__record_precise_ip
hist_entry__inc_addr_samples
symbol__inc_addr_samples
symbol__get_annotation
symbol__alloc_hist
In this function the size of symbol __irf_end is calculated. The size of
a symbol is the difference between its start and end address.
When the symbol was read the first time, its start and end was set to:
symbol__new: __irf_end 0xe954d0-0xe954d0
which is correct and maps with /proc/kallsyms:
root@s8360046:~/linux-4.15.0/tools/perf# fgrep _irf_end /proc/kallsyms
0000000000e954d0 t __irf_end
root@s8360046:~/linux-4.15.0/tools/perf#
In function symbol__alloc_hist() the end of symbol __irf_end is
symbol__alloc_hist sym:__irf_end start:0xe954d0 end:0x3ff80045a8
which is identical with the first module entry in /proc/kallsyms
This results in a symbol size of __irf_req for histogram analyses of
70334140059072 bytes and a malloc() for this requested size fails.
The root cause of this is function
__dso__load_kallsyms()
+-> symbols__fixup_end()
Function symbols__fixup_end() enlarges the last symbol in the kallsyms
map:
# fgrep __irf_end /proc/kallsyms
0000000000e954d0 t __irf_end
#
to the start address of the first module:
# cat /proc/kallsyms | sort | egrep ' [tT] '
....
0000000000e952d0 T __security_initcall_end
0000000000e954d0 T __initramfs_size
0000000000e954d0 t __irf_end
000003ff800045a8 T fc_get_event_number [scsi_transport_fc]
000003ff800045d0 t store_fc_vport_disable [scsi_transport_fc]
000003ff800046a8 T scsi_is_fc_rport [scsi_transport_fc]
000003ff800046d0 t fc_target_setup [scsi_transport_fc]
On s390 the kernel is located around memory address 0x200, 0x10000 or
0x100000, depending on linux version. Modules however start some- where
around 0x3ff xxxx xxxx.
This is different than x86 and produces a large gap for which histogram
allocation fails.
Fix this by detecting the kernel's last symbol and do no adjustment for
it. Introduce a weak function and handle s390 specifics.
Reported-by: Klaus Theurich <klaus.theurich(a)de.ibm.com>
Signed-off-by: Thomas Richter <tmricht(a)linux.ibm.com>
Acked-by: Heiko Carstens <heiko.carstens(a)de.ibm.com>
Cc: Hendrik Brueckner <brueckner(a)linux.ibm.com>
Cc: Vasily Gorbik <gor(a)linux.ibm.com>
Cc: stable(a)vger.kernel.org
Link: http://lkml.kernel.org/r/20190724122703.3996-2-tmricht@linux.ibm.com
Signed-off-by: Arnaldo Carvalho de Melo <acme(a)redhat.com>
---
tools/perf/arch/s390/util/machine.c | 17 +++++++++++++++++
tools/perf/util/symbol.c | 7 ++++++-
tools/perf/util/symbol.h | 1 +
3 files changed, 24 insertions(+), 1 deletion(-)
diff --git a/tools/perf/arch/s390/util/machine.c b/tools/perf/arch/s390/util/machine.c
index de26b1441a48..c8c86a0c9b79 100644
--- a/tools/perf/arch/s390/util/machine.c
+++ b/tools/perf/arch/s390/util/machine.c
@@ -6,6 +6,7 @@
#include "machine.h"
#include "api/fs/fs.h"
#include "debug.h"
+#include "symbol.h"
int arch__fix_module_text_start(u64 *start, u64 *size, const char *name)
{
@@ -33,3 +34,19 @@ int arch__fix_module_text_start(u64 *start, u64 *size, const char *name)
return 0;
}
+
+/* On s390 kernel text segment start is located at very low memory addresses,
+ * for example 0x10000. Modules are located at very high memory addresses,
+ * for example 0x3ff xxxx xxxx. The gap between end of kernel text segment
+ * and beginning of first module's text segment is very big.
+ * Therefore do not fill this gap and do not assign it to the kernel dso map.
+ */
+void arch__symbols__fixup_end(struct symbol *p, struct symbol *c)
+{
+ if (strchr(p->name, '[') == NULL && strchr(c->name, '['))
+ /* Last kernel symbol mapped to end of page */
+ p->end = roundup(p->end, page_size);
+ else
+ p->end = c->start;
+ pr_debug4("%s sym:%s end:%#lx\n", __func__, p->name, p->end);
+}
diff --git a/tools/perf/util/symbol.c b/tools/perf/util/symbol.c
index 173f3378aaa0..4efde7879474 100644
--- a/tools/perf/util/symbol.c
+++ b/tools/perf/util/symbol.c
@@ -92,6 +92,11 @@ static int prefix_underscores_count(const char *str)
return tail - str;
}
+void __weak arch__symbols__fixup_end(struct symbol *p, struct symbol *c)
+{
+ p->end = c->start;
+}
+
const char * __weak arch__normalize_symbol_name(const char *name)
{
return name;
@@ -218,7 +223,7 @@ void symbols__fixup_end(struct rb_root_cached *symbols)
curr = rb_entry(nd, struct symbol, rb_node);
if (prev->end == prev->start && prev->end != curr->start)
- prev->end = curr->start;
+ arch__symbols__fixup_end(prev, curr);
}
/* Last entry */
diff --git a/tools/perf/util/symbol.h b/tools/perf/util/symbol.h
index 12755b42ea93..183f630cb5f1 100644
--- a/tools/perf/util/symbol.h
+++ b/tools/perf/util/symbol.h
@@ -288,6 +288,7 @@ const char *arch__normalize_symbol_name(const char *name);
#define SYMBOL_A 0
#define SYMBOL_B 1
+void arch__symbols__fixup_end(struct symbol *p, struct symbol *c);
int arch__compare_symbol_names(const char *namea, const char *nameb);
int arch__compare_symbol_names_n(const char *namea, const char *nameb,
unsigned int n);
--
2.21.0
From: Thomas Richter <tmricht(a)linux.ibm.com>
On s390 the modules loaded in memory have the text segment located after
the GOT and Relocation table. This can be seen with this output:
[root@m35lp76 perf]# fgrep qeth /proc/modules
qeth 151552 1 qeth_l2, Live 0x000003ff800b2000
...
[root@m35lp76 perf]# cat /sys/module/qeth/sections/.text
0x000003ff800b3990
[root@m35lp76 perf]#
There is an offset of 0x1990 bytes. The size of the qeth module is
151552 bytes (0x25000 in hex).
The location of the GOT/relocation table at the beginning of a module is
unique to s390.
commit 203d8a4aa6ed ("perf s390: Fix 'start' address of module's map")
adjusts the start address of a module in the map structures, but does
not adjust the size of the modules. This leads to overlapping of module
maps as this example shows:
[root@m35lp76 perf] # ./perf report -D
0 0 0xfb0 [0xa0]: PERF_RECORD_MMAP -1/0: [0x3ff800b3990(0x25000)
@ 0]: x /lib/modules/.../qeth.ko.xz
0 0 0x1050 [0xb0]: PERF_RECORD_MMAP -1/0: [0x3ff800d85a0(0x8000)
@ 0]: x /lib/modules/.../ip6_tables.ko.xz
The module qeth.ko has an adjusted start address modified to b3990, but
its size is unchanged and the module ends at 0x3ff800d8990. This end
address overlaps with the next modules start address of 0x3ff800d85a0.
When the size of the leading GOT/Relocation table stored in the
beginning of the text segment (0x1990 bytes) is subtracted from module
qeth end address, there are no overlaps anymore:
0x3ff800d8990 - 0x1990 = 0x0x3ff800d7000
which is the same as
0x3ff800b2000 + 0x25000 = 0x0x3ff800d7000.
To fix this issue, also adjust the modules size in function
arch__fix_module_text_start(). Add another function parameter named size
and reduce the size of the module when the text segment start address is
changed.
Output after:
0 0 0xfb0 [0xa0]: PERF_RECORD_MMAP -1/0: [0x3ff800b3990(0x23670)
@ 0]: x /lib/modules/.../qeth.ko.xz
0 0 0x1050 [0xb0]: PERF_RECORD_MMAP -1/0: [0x3ff800d85a0(0x7a60)
@ 0]: x /lib/modules/.../ip6_tables.ko.xz
Reported-by: Stefan Liebler <stli(a)linux.ibm.com>
Signed-off-by: Thomas Richter <tmricht(a)linux.ibm.com>
Acked-by: Heiko Carstens <heiko.carstens(a)de.ibm.com>
Cc: Hendrik Brueckner <brueckner(a)linux.ibm.com>
Cc: Vasily Gorbik <gor(a)linux.ibm.com>
Cc: stable(a)vger.kernel.org
Fixes: 203d8a4aa6ed ("perf s390: Fix 'start' address of module's map")
Link: http://lkml.kernel.org/r/20190724122703.3996-1-tmricht@linux.ibm.com
Signed-off-by: Arnaldo Carvalho de Melo <acme(a)redhat.com>
---
tools/perf/arch/s390/util/machine.c | 14 +++++++++++++-
tools/perf/util/machine.c | 3 ++-
tools/perf/util/machine.h | 2 +-
3 files changed, 16 insertions(+), 3 deletions(-)
diff --git a/tools/perf/arch/s390/util/machine.c b/tools/perf/arch/s390/util/machine.c
index a19690a17291..de26b1441a48 100644
--- a/tools/perf/arch/s390/util/machine.c
+++ b/tools/perf/arch/s390/util/machine.c
@@ -7,7 +7,7 @@
#include "api/fs/fs.h"
#include "debug.h"
-int arch__fix_module_text_start(u64 *start, const char *name)
+int arch__fix_module_text_start(u64 *start, u64 *size, const char *name)
{
u64 m_start = *start;
char path[PATH_MAX];
@@ -17,6 +17,18 @@ int arch__fix_module_text_start(u64 *start, const char *name)
if (sysfs__read_ull(path, (unsigned long long *)start) < 0) {
pr_debug2("Using module %s start:%#lx\n", path, m_start);
*start = m_start;
+ } else {
+ /* Successful read of the modules segment text start address.
+ * Calculate difference between module start address
+ * in memory and module text segment start address.
+ * For example module load address is 0x3ff8011b000
+ * (from /proc/modules) and module text segment start
+ * address is 0x3ff8011b870 (from file above).
+ *
+ * Adjust the module size and subtract the GOT table
+ * size located at the beginning of the module.
+ */
+ *size -= (*start - m_start);
}
return 0;
diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
index cf826eca3aaf..83b2fbbeeb90 100644
--- a/tools/perf/util/machine.c
+++ b/tools/perf/util/machine.c
@@ -1378,6 +1378,7 @@ static int machine__set_modules_path(struct machine *machine)
return map_groups__set_modules_path_dir(&machine->kmaps, modules_path, 0);
}
int __weak arch__fix_module_text_start(u64 *start __maybe_unused,
+ u64 *size __maybe_unused,
const char *name __maybe_unused)
{
return 0;
@@ -1389,7 +1390,7 @@ static int machine__create_module(void *arg, const char *name, u64 start,
struct machine *machine = arg;
struct map *map;
- if (arch__fix_module_text_start(&start, name) < 0)
+ if (arch__fix_module_text_start(&start, &size, name) < 0)
return -1;
map = machine__findnew_module_map(machine, start, name);
diff --git a/tools/perf/util/machine.h b/tools/perf/util/machine.h
index f70ab98a7bde..7aa38da26427 100644
--- a/tools/perf/util/machine.h
+++ b/tools/perf/util/machine.h
@@ -222,7 +222,7 @@ struct symbol *machine__find_kernel_symbol_by_name(struct machine *machine,
struct map *machine__findnew_module_map(struct machine *machine, u64 start,
const char *filename);
-int arch__fix_module_text_start(u64 *start, const char *name);
+int arch__fix_module_text_start(u64 *start, u64 *size, const char *name);
int machine__load_kallsyms(struct machine *machine, const char *filename);
--
2.21.0
I -thought- I had fixed this entirely, but it looks like that I didn't
test this thoroughly enough as we apparently still make one big mistake
with nv50_msto_atomic_check() - we don't handle the following scenario:
* CRTC #1 has n VCPI allocated to it, is attached to connector DP-4
which is attached to encoder #1. enabled=y active=n
* CRTC #1 is changed from DP-4 to DP-5, causing:
* DP-4 crtc=#1→NULL (VCPI n→0)
* DP-5 crtc=NULL→#1
* CRTC #1 steals encoder #1 back from DP-4 and gives it to DP-5
* CRTC #1 maintains the same mode as before, just with a different
connector
* mode_changed=n connectors_changed=y
(we _SHOULD_ do VCPI 0→n here, but don't)
Once the above scenario is repeated once, we'll attempt freeing VCPI
from the connector that we didn't allocate due to the connectors
changing, but the mode staying the same. Sigh.
Since nv50_msto_atomic_check() has broken a few times now, let's rethink
things a bit to be more careful: limit both VCPI/PBN allocations to
mode_changed || connectors_changed, since neither VCPI or PBN should
ever need to change outside of routing and mode changes.
Signed-off-by: Lyude Paul <lyude(a)redhat.com>
Reported-by: Bohdan Milar <bmilar(a)redhat.com>
Tested-by: Bohdan Milar <bmilar(a)redhat.com>
Fixes: 232c9eec417a ("drm/nouveau: Use atomic VCPI helpers for MST")
References: 412e85b60531 ("drm/nouveau: Only release VCPI slots on mode changes")
Cc: Lyude Paul <lyude(a)redhat.com>
Cc: Ben Skeggs <bskeggs(a)redhat.com>
Cc: Daniel Vetter <daniel.vetter(a)ffwll.ch>
Cc: David Airlie <airlied(a)redhat.com>
Cc: Jerry Zuo <Jerry.Zuo(a)amd.com>
Cc: Harry Wentland <harry.wentland(a)amd.com>
Cc: Juston Li <juston.li(a)intel.com>
Cc: Laurent Pinchart <laurent.pinchart(a)ideasonboard.com>
Cc: Karol Herbst <karolherbst(a)gmail.com>
Cc: Ilia Mirkin <imirkin(a)alum.mit.edu>
Cc: <stable(a)vger.kernel.org> # v5.1+
---
drivers/gpu/drm/nouveau/dispnv50/disp.c | 22 +++++++++++++---------
1 file changed, 13 insertions(+), 9 deletions(-)
diff --git a/drivers/gpu/drm/nouveau/dispnv50/disp.c b/drivers/gpu/drm/nouveau/dispnv50/disp.c
index 126703816794..5d23ab8e4917 100644
--- a/drivers/gpu/drm/nouveau/dispnv50/disp.c
+++ b/drivers/gpu/drm/nouveau/dispnv50/disp.c
@@ -771,16 +771,20 @@ nv50_msto_atomic_check(struct drm_encoder *encoder,
struct nv50_head_atom *asyh = nv50_head_atom(crtc_state);
int slots;
- /* When restoring duplicated states, we need to make sure that the
- * bw remains the same and avoid recalculating it, as the connector's
- * bpc may have changed after the state was duplicated
- */
- if (!state->duplicated)
- asyh->dp.pbn =
- drm_dp_calc_pbn_mode(crtc_state->adjusted_mode.clock,
- connector->display_info.bpc * 3);
+ if (crtc_state->mode_changed || crtc_state->connectors_changed) {
+ /*
+ * When restoring duplicated states, we need to make sure that
+ * the bw remains the same and avoid recalculating it, as the
+ * connector's bpc may have changed after the state was
+ * duplicated
+ */
+ if (!state->duplicated) {
+ const int bpp = connector->display_info.bpc * 3;
+ const int clock = crtc_state->adjusted_mode.clock;
+
+ asyh->dp.pbn = drm_dp_calc_pbn_mode(bpp, clock);
+ }
- if (crtc_state->mode_changed) {
slots = drm_dp_atomic_find_vcpi_slots(state, &mstm->mgr,
mstc->port,
asyh->dp.pbn);
--
2.21.0
This bit was fliped on for "syncing dependencies between camera and
graphics". BSpec has no recollection why, and it is causing
unrecoverable GPU hangs with Vulkan compute workloads.
>From BSpec, setting bit5 to 0 enables relaxed padding requiremets for
buffers, 1D and 2D non-array, non-MSAA, non-mip-mapped linear surfaces;
and *must* be set to 0h on skl+ to ensure "Out of Bounds" case is
suppressed.
Reported-by: Jason Ekstrand <jason(a)jlekstrand.net>
Suggested-by: Jason Ekstrand <jason(a)jlekstrand.net>
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=110998
Fixes: 8424171e135c ("drm/i915/gen9: h/w w/a: syncing dependencies between camera and graphics")
Signed-off-by: Chris Wilson <chris(a)chris-wilson.co.uk>
Cc: Jason Ekstrand <jason(a)jlekstrand.net>
Cc: Mika Kuoppala <mika.kuoppala(a)linux.intel.com>
Cc: <stable(a)vger.kernel.org> # v4.1+
---
drivers/gpu/drm/i915/gt/intel_workarounds.c | 5 -----
1 file changed, 5 deletions(-)
diff --git a/drivers/gpu/drm/i915/gt/intel_workarounds.c b/drivers/gpu/drm/i915/gt/intel_workarounds.c
index 704ace01e7f5..b95c1d59a347 100644
--- a/drivers/gpu/drm/i915/gt/intel_workarounds.c
+++ b/drivers/gpu/drm/i915/gt/intel_workarounds.c
@@ -297,11 +297,6 @@ static void gen9_ctx_workarounds_init(struct intel_engine_cs *engine,
FLOW_CONTROL_ENABLE |
PARTIAL_INSTRUCTION_SHOOTDOWN_DISABLE);
- /* Syncing dependencies between camera and graphics:skl,bxt,kbl */
- if (!IS_COFFEELAKE(i915))
- WA_SET_BIT_MASKED(HALF_SLICE_CHICKEN3,
- GEN9_DISABLE_OCL_OOB_SUPPRESS_LOGIC);
-
/* WaEnableYV12BugFixInHalfSliceChicken7:skl,bxt,kbl,glk,cfl */
/* WaEnableSamplerGPGPUPreemptionSupport:skl,bxt,kbl,cfl */
WA_SET_BIT_MASKED(GEN9_HALF_SLICE_CHICKEN7,
--
2.23.0.rc1