Hi folks,
I noticed a regression introduced sometime after 4.19.4 in USB power
management. I have a 2015 MacBook Pro. When I try to do a suspend or a
suspend+hibernate, I get the following error messages trying to
suspend usb2 and the suspend fails. This works fine in 4.19.4:
Dec 22 13:50:36 eric-macbookpro kernel: Freezing remaining freezable
tasks ... (elapsed 0.001 seconds) done.
Dec 22 13:50:36 eric-macbookpro kernel: Suspending console(s) (use
no_console_suspend to debug)
Dec 22 13:50:36 eric-macbookpro kernel: dpm_run_callback():
usb_dev_freeze+0x0/0x10 returns -16
Dec 22 13:50:36 eric-macbookpro kernel: PM: Device usb2 failed to
freeze async: error -16
Dec 22 13:50:38 eric-macbookpro systemd[1]:
systemd-hybrid-sleep.service: Main process exited, code=exited,
status=1/FAILURE
Dec 22 13:50:38 eric-macbookpro systemd[1]:
systemd-hybrid-sleep.service: Failed with result 'exit-code'.
Dec 22 13:50:38 eric-macbookpro systemd[1]: Failed to start Hybrid
Suspend+Hibernate.
Dec 22 13:50:38 eric-macbookpro systemd[1]: Dependency failed for
Hybrid Suspend+Hibernate.
Dec 22 13:50:38 eric-macbookpro systemd[1]: hybrid-sleep.target: Job
hybrid-sleep.target/start failed with result 'dependency'.
Dec 22 13:50:38 eric-macbookpro systemd-logind[1573]: Operation
'sleep' finished.
Dec 22 13:50:38 eric-macbookpro systemd[1]: Stopped target Sleep.
The behavior exists in 4.19.8 and 4.19.11, the kernel versions I have
upgraded to with Arch Linux, so the regression was introduced sometime
between 4.19.4 and 4.19.8. Hibernate still works but when I resume
from hibernate, there is a ksoftirqd and kworker thread/process
together taking up 100% of one core. If I turn off auto power control
for usb1 and usb2, the threads stop spinning. i.e.,
echo 'on' > '/sys/bus/usb/devices/usb1/power/control
Any suggestions as to where this regression was introduced and what
can be done to fix it?
Thanks,
Eric
From: Stanley Chu <stanley.chu(a)mediatek.com>
The commit 356fd2663cff ("scsi: Set request queue runtime PM status
back to active on resume") fixed up the inconsistent RPM status between
request queue and device. However changing request queue RPM status
shall be done only on successful resume, otherwise status may be still
inconsistent as below,
Request queue: RPM_ACTIVE
Device: RPM_SUSPENDED
This ends up soft lockup because requests can be submitted to
underlying devices but those devices and their required resource
are not resumed.
Fixes: 356fd2663cff ("scsi: Set request queue runtime PM status
back to active on resume")
Cc: stable(a)vger.kernel.org
Signed-off-by: Stanley Chu <stanley.chu(a)mediatek.com>
---
drivers/scsi/scsi_pm.c | 26 +++++++++++++++-----------
1 file changed, 15 insertions(+), 11 deletions(-)
diff --git a/drivers/scsi/scsi_pm.c b/drivers/scsi/scsi_pm.c
index a2b4179bfdf7..7639df91b110 100644
--- a/drivers/scsi/scsi_pm.c
+++ b/drivers/scsi/scsi_pm.c
@@ -80,8 +80,22 @@ static int scsi_dev_type_resume(struct device *dev,
if (err == 0) {
pm_runtime_disable(dev);
- pm_runtime_set_active(dev);
+ err = pm_runtime_set_active(dev);
pm_runtime_enable(dev);
+
+ /*
+ * Forcibly set runtime PM status of request queue to "active"
+ * to make sure we can again get requests from the queue
+ * (see also blk_pm_peek_request()).
+ *
+ * The resume hook will correct runtime PM status of the disk.
+ */
+ if (!err && scsi_is_sdev_device(dev)) {
+ struct scsi_device *sdev = to_scsi_device(dev);
+
+ if (sdev->request_queue->dev)
+ blk_set_runtime_active(sdev->request_queue);
+ }
}
return err;
@@ -140,16 +154,6 @@ static int scsi_bus_resume_common(struct device *dev,
else
fn = NULL;
- /*
- * Forcibly set runtime PM status of request queue to "active" to
- * make sure we can again get requests from the queue (see also
- * blk_pm_peek_request()).
- *
- * The resume hook will correct runtime PM status of the disk.
- */
- if (scsi_is_sdev_device(dev) && pm_runtime_suspended(dev))
- blk_set_runtime_active(to_scsi_device(dev)->request_queue);
-
if (fn) {
async_schedule_domain(fn, dev, &scsi_sd_pm_domain);
--
2.18.0
Commit 2b2ea09e74a5 ("staging:r8188eu: Use lib80211 to decrypt WEP-frames")
causes scheduling while atomic bugs followed by a hard freeze whenever
the driver tries to connect to a WEP-encrypted network. Experimentation
showed that the freezes were eliminated when module lib80211 was
preloaded, which can be forced by calling lib80211_get_crypto_ops()
directly rather than indirectly through try_then_request_module().
With this change, no BUG messages are logged.
Fixes: 2b2ea09e74a5 ("staging:r8188eu: Use lib80211 to decrypt WEP-frames")
Cc: Stable <stable(a)vger.kernel.org> # v4.17+
Cc: Michael Straube <straube.linux(a)gmail.com>
Cc: Ivan Safonov <insafonov(a)gmail.com>
Signed-off-by: Larry Finger <Larry.Finger(a)lwfinger.net>
---
drivers/staging/rtl8188eu/core/rtw_security.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/staging/rtl8188eu/core/rtw_security.c b/drivers/staging/rtl8188eu/core/rtw_security.c
index 052656a22821..bab96c870042 100644
--- a/drivers/staging/rtl8188eu/core/rtw_security.c
+++ b/drivers/staging/rtl8188eu/core/rtw_security.c
@@ -154,7 +154,7 @@ void rtw_wep_encrypt(struct adapter *padapter, u8 *pxmitframe)
pframe = ((struct xmit_frame *)pxmitframe)->buf_addr + hw_hdr_offset;
- crypto_ops = try_then_request_module(lib80211_get_crypto_ops("WEP"), "lib80211_crypt_wep");
+ crypto_ops = lib80211_get_crypto_ops("WEP");
if (!crypto_ops)
return;
@@ -210,7 +210,7 @@ int rtw_wep_decrypt(struct adapter *padapter, u8 *precvframe)
void *crypto_private = NULL;
int status = _SUCCESS;
const int keyindex = prxattrib->key_index;
- struct lib80211_crypto_ops *crypto_ops = try_then_request_module(lib80211_get_crypto_ops("WEP"), "lib80211_crypt_wep");
+ struct lib80211_crypto_ops *crypto_ops = lib80211_get_crypto_ops("WEP");
char iv[4], icv[4];
if (!crypto_ops) {
--
2.16.4
An iptable rule like the following on a multicore systems will result in
accepting more connections than set in the rule.
iptables -A INPUT -p tcp -m tcp --syn --dport 7777 -m connlimit \
--connlimit-above 2000 --connlimit-mask 0 -j DROP
In check_hlist function, connections that are found in saved connections
but not in netfilter conntrack are deleted, assuming that those
connections do not exist anymore. But for multi core systems, there exists
a small time window, when a connection has been added to the xt_connlimit
maintained rb-tree but has not yet made to netfilter conntrack table. This
causes concurrent connections to return incorrect counts and go over limit
set in iptable rule.
The fix has been partially backported from the above mentioned upstream
commit. Introduce timestamp and the owning cpu.
Signed-off-by: Alakesh Haloi <alakeshh(a)amazon.com>
Cc: Pablo Neira Ayuso <pablo(a)netfilter.org>
Cc: Jozsef Kadlecsik <kadlec(a)blackhole.kfki.hu>
Cc: Florian Westphal <fw(a)strlen.de>
Cc: "David S. Miller" <davem(a)davemloft.net>
Cc: stable(a)vger.kernel.org # v4.15 and before
Cc: netdev(a)vger.kernel.org
Cc: Dmitry Andrianov <dmitry.andrianov(a)alertme.com>
Cc: Justin Pettit <jpettit(a)vmware.com>
Cc: Yi-Hung Wei <yihung.wei(a)gmail.com>
---
net/netfilter/xt_connlimit.c | 28 ++++++++++++++++++++++++++--
1 file changed, 26 insertions(+), 2 deletions(-)
diff --git a/net/netfilter/xt_connlimit.c b/net/netfilter/xt_connlimit.c
index ffa8eec..e7b092b 100644
--- a/net/netfilter/xt_connlimit.c
+++ b/net/netfilter/xt_connlimit.c
@@ -47,6 +47,8 @@ struct xt_connlimit_conn {
struct hlist_node node;
struct nf_conntrack_tuple tuple;
union nf_inet_addr addr;
+ int cpu;
+ u32 jiffies32;
};
struct xt_connlimit_rb {
@@ -126,6 +128,8 @@ static bool add_hlist(struct hlist_head *head,
return false;
conn->tuple = *tuple;
conn->addr = *addr;
+ conn->cpu = raw_smp_processor_id();
+ conn->jiffies32 = (u32)jiffies;
hlist_add_head(&conn->node, head);
return true;
}
@@ -148,8 +152,26 @@ static unsigned int check_hlist(struct net *net,
hlist_for_each_entry_safe(conn, n, head, node) {
found = nf_conntrack_find_get(net, zone, &conn->tuple);
if (found == NULL) {
- hlist_del(&conn->node);
- kmem_cache_free(connlimit_conn_cachep, conn);
+ /* If connection is not found, it may be because
+ * it has not made into conntrack table yet. We
+ * check if it is a recently created connection
+ * on a different core and do not delete it in that
+ * case.
+ */
+
+ unsigned long a, b;
+ int cpu = raw_smp_processor_id();
+ __u32 age;
+
+ b = conn->jiffies;
+ a = (u32)jiffies;
+ age = a - b;
+ if (conn->cpu != cpu && age <= 2) {
+ length++;
+ } else {
+ hlist_del(&conn->node);
+ kmem_cache_free(connlimit_conn_cachep, conn);
+ }
continue;
}
@@ -271,6 +293,8 @@ static void tree_nodes_free(struct rb_root *root,
conn->tuple = *tuple;
conn->addr = *addr;
+ conn->cpu = raw_smp_processor_id();
+ conn->jiffies32 = (u32)jiffies;
rbconn->addr = *addr;
INIT_HLIST_HEAD(&rbconn->hhead);
--
1.8.3.1
An iptable rule like the following on a multicore systems will result in
accepting more connections than set in the rule.
iptables -A INPUT -p tcp -m tcp --syn --dport 7777 -m connlimit \
--connlimit-above 2000 --connlimit-mask 0 -j DROP
In check_hlist function, connections that are found in saved connections
but not in netfilter conntrack are deleted, assuming that those
connections do not exist anymore. But for multi core systems, there exists
a small time window, when a connection has been added to the xt_connlimit
maintained rb-tree but has not yet made to netfilter conntrack table. This
causes concurrent connections to return incorrect counts and go over limit
set in iptable rule.
Connection 1 on Core 1 Connection 2 on Core 2
list_length = N
conntrack_table_len = N
spin_lock_bh()
In check_hlist() function
a. loop over saved connections
1. call nf_conntrack_find_get()
2. If not found in 1,
i. call hlist_del()
b. return total count to caller
c. connection gets added to list
of saved connections.
spin_unlock_bh()
list_length = N + 1
spin_lock_bh() on core 2
In check_hlist() function
a. loop over saved connection.
1. call nf_conntrack_find_get()
2. If not found in 1.
i. call hlist_del()
[Connection 1 was in the
but not in nf_conntrack yet]
ii. connection 1 gets deleted
list_length = N
conntrack_table_len = N
b. return total count to caller
c. connection 2 gets added to list
of saved connections
spin_unlock_bh()
d. connection 1 gets added to
nf_conntrack
list_length = N + 1
conntrack_table_len = N + 1
e. connection 2 gets added to
nf_conntrack
list_length = N + 1
conntrack_table_len = N + 2
So we end up with N + 1 connections in the list but N + 2 in nf_conntrack,
allowing more number of connections eventually than set in the rule.
This fix adds an additional field to track such pending connections
and prevent them from being deleted by another execution thread on
a different core and returns correct count.
Signed-off-by: Alakesh Haloi <alakeshh(a)amazon.com>
Cc: Pablo Neira Ayuso <pablo(a)netfilter.org>
Cc: Jozsef Kadlecsik <kadlec(a)blackhole.kfki.hu>
Cc: Florian Westphal <fw(a)strlen.de>
Cc: "David S. Miller" <davem(a)davemloft.net>
Cc: stable(a)vger.kernel.org # v4.15 and before
---
net/netfilter/xt_connlimit.c | 24 +++++++++++++++++++++---
1 file changed, 21 insertions(+), 3 deletions(-)
diff --git a/net/netfilter/xt_connlimit.c b/net/netfilter/xt_connlimit.c
index ffa8eec980e9..bd7563c209a4 100644
--- a/net/netfilter/xt_connlimit.c
+++ b/net/netfilter/xt_connlimit.c
@@ -47,6 +47,7 @@ struct xt_connlimit_conn {
struct hlist_node node;
struct nf_conntrack_tuple tuple;
union nf_inet_addr addr;
+ bool pending_add;
};
struct xt_connlimit_rb {
@@ -126,6 +127,7 @@ static bool add_hlist(struct hlist_head *head,
return false;
conn->tuple = *tuple;
conn->addr = *addr;
+ conn->pending_add = true;
hlist_add_head(&conn->node, head);
return true;
}
@@ -144,15 +146,31 @@ static unsigned int check_hlist(struct net *net,
*addit = true;
- /* check the saved connections */
+ /* check the saved connections
+ */
hlist_for_each_entry_safe(conn, n, head, node) {
found = nf_conntrack_find_get(net, zone, &conn->tuple);
if (found == NULL) {
- hlist_del(&conn->node);
- kmem_cache_free(connlimit_conn_cachep, conn);
+ /* It could be an already deleted connection or
+ * a new connection that is not there in conntrack
+ * yet. If former delete it from the list, else
+ * increase count and move on.
+ */
+ if (conn->pending_add) {
+ length++;
+ } else {
+ hlist_del(&conn->node);
+ kmem_cache_free(connlimit_conn_cachep, conn);
+ }
continue;
}
+ /* If it is a connection that was pending insertion to
+ * connection tracking table before, then it's time to clear
+ * the flag.
+ */
+ conn->pending_add = false;
+
found_ct = nf_ct_tuplehash_to_ctrack(found);
if (nf_ct_tuple_equal(&conn->tuple, tuple)) {
--
2.14.4
Hi x86 maintainers,
This is an important fix that I believe needs to be merged for 4.21.
Without it, applications calling fork() can potentially double-allocate
a protection key, causing lots of strange problems.
Thomas's Reviewed-by is on the the actual fix, but not the selftest.
I would also be happy to send this as a pull request if you would
prefer.
Cc: x86(a)kernel.org
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Ingo Molnar <mingo(a)redhat.com>
Cc: Borislav Petkov <bp(a)alien8.de>
Cc: "H. Peter Anvin" <hpa(a)zytor.com>
Cc: Peter Zijlstra <peterz(a)infradead.org>
Cc: Michael Ellerman <mpe(a)ellerman.id.au>
Cc: Will Deacon <will.deacon(a)arm.com>
Cc: Andy Lutomirski <luto(a)kernel.org>
Cc: Joerg Roedel <jroedel(a)suse.de>
Cc: stable(a)vger.kernel.org
The patch titled
Subject: slab: alien caches must not be initialized if the allocation of the alien cache failed
has been added to the -mm tree. Its filename is
slab-alien-caches-must-not-be-initialized-if-the-allocation-of-the-alien-cache-failed.patch
This patch should soon appear at
http://ozlabs.org/~akpm/mmots/broken-out/slab-alien-caches-must-not-be-init…
and later at
http://ozlabs.org/~akpm/mmotm/broken-out/slab-alien-caches-must-not-be-init…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Christoph Lameter <cl(a)linux.com>
Subject: slab: alien caches must not be initialized if the allocation of the alien cache failed
Callers of __alloc_alien() check for NULL. We must do the same check in
__alloc_alien_cache to avoid NULL pointer dereferences on allocation
failures.
Link: http://lkml.kernel.org/r/010001680f42f192-82b4e12e-1565-4ee0-ae1f-1e9897490…
Signed-off-by: Christoph Lameter <cl(a)linux.com>
Reported-by: syzbot+d6ed4ec679652b4fd4e4(a)syzkaller.appspotmail.com
Reviewed-by: Andrew Morton <akpm(a)linux-foundation.org>
Cc: Pekka Enberg <penberg(a)kernel.org>
Cc: David Rientjes <rientjes(a)google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim(a)lge.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/slab.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
--- a/mm/slab.c~slab-alien-caches-must-not-be-initialized-if-the-allocation-of-the-alien-cache-failed
+++ a/mm/slab.c
@@ -666,8 +666,10 @@ static struct alien_cache *__alloc_alien
struct alien_cache *alc = NULL;
alc = kmalloc_node(memsize, gfp, node);
- init_arraycache(&alc->ac, entries, batch);
- spin_lock_init(&alc->lock);
+ if (alc) {
+ init_arraycache(&alc->ac, entries, batch);
+ spin_lock_init(&alc->lock);
+ }
return alc;
}
_
Patches currently in -mm which might be from cl(a)linux.com are
slab-alien-caches-must-not-be-initialized-if-the-allocation-of-the-alien-cache-failed.patch
The patch titled
Subject: fork, memcg: fix cached_stacks case
has been added to the -mm tree. Its filename is
fork-memcg-fix-cached_stacks-case.patch
This patch should soon appear at
http://ozlabs.org/~akpm/mmots/broken-out/fork-memcg-fix-cached_stacks-case.…
and later at
http://ozlabs.org/~akpm/mmotm/broken-out/fork-memcg-fix-cached_stacks-case.…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Shakeel Butt <shakeelb(a)google.com>
Subject: fork, memcg: fix cached_stacks case
5eed6f1dff87 ("fork,memcg: fix crash in free_thread_stack on memcg charge
fail") fixes a crash caused due to failed memcg charge of the kernel
stack. However the fix misses the cached_stacks case which this patch
fixes. So, the same crash can happen if the memcg charge of a cached
stack is failed.
Link: http://lkml.kernel.org/r/20190102180145.57406-1-shakeelb@google.com
Fixes: 5eed6f1dff87 ("fork,memcg: fix crash in free_thread_stack on memcg charge fail")
Signed-off-by: Shakeel Butt <shakeelb(a)google.com>
Cc: Rik van Riel <riel(a)surriel.com>
Cc: Roman Gushchin <guro(a)fb.com>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Johannes Weiner <hannes(a)cmpxchg.org>
Cc: Tejun Heo <tj(a)kernel.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
kernel/fork.c | 1 +
1 file changed, 1 insertion(+)
--- a/kernel/fork.c~fork-memcg-fix-cached_stacks-case
+++ a/kernel/fork.c
@@ -221,6 +221,7 @@ static unsigned long *alloc_thread_stack
memset(s->addr, 0, THREAD_SIZE);
tsk->stack_vm_area = s;
+ tsk->stack = s->addr;
return s->addr;
}
_
Patches currently in -mm which might be from shakeelb(a)google.com are
fork-memcg-fix-cached_stacks-case.patch
Recently, Alakesh Haloi reported the following issue [1] with stable/4.14:
"""
An iptable rule like the following on a multicore systems will result in
accepting more connections than set in the rule.
iptables -A INPUT -p tcp -m tcp --syn --dport 7777 -m connlimit \
--connlimit-above 2000 --connlimit-mask 0 -j DROP
"""
And proposed a fix that is not in Linus's tree. The discussion went on to
confirm whether the issue was still reproducible with mainline/nf.git tip,
and to either identify the upstream fix or re-submit the non-upstream fix.
Alakesh eventually was able to test with upstream, and reported that issue
was still reproducible [2].
On that, our findinds diverge, at least in my test environment:
First, I verified that the suggested mainline fix for the issue [3] indeed
fixes it, by testing with it applied and reverted on v4.18, a clean revert.
(The issue is reproducible with the commit reverted).
Then, with a consistent reproducer, I moved to nf.git, with HEAD on commit
a007232 ("netfilter: nf_conncount: fix argument order to find_next_bit"),
and the issues was not reproducible (even with 20+ threads on client side,
the number Alakesh reported to achieve 2150+ connections [4], and I tried
spreading the network interface IRQ affinity over more and more CPUs too.)
Either way, the suggested mainline fix does actually fix the issue in 4.14
for at least one environment. So, it might well be the case that Alakesh's
test environment has differences/subtleties that leads to more connections
accepted, and more commits are needed for that particular environment type.
But for now, with one bare-metal environment (24-core server, 4-core client)
verified, I thought of submitting the patches for review/comments/testing,
then looking for additional fixes for that environment separately.
The fix is PATCH 4/4, and PATCHes 1-3/4 are helpers for a cleaner backport.
All backports are simple, and essentially consist of refresh context lines
and use older struct/file names.
Reviews from netfilter maintainers are very appreciated, as I've no previous
experience in this area, and although the backports look simple and build/run
correctly, there's usually stuff that only more experienced people may notice.
Thanks,
Mauricio
Links:
=====
[1] https://www.spinics.net/lists/stable/msg270040.html
[2] https://www.spinics.net/lists/stable/msg273669.html
[3] https://www.spinics.net/lists/stable/msg271300.html
[4] https://www.spinics.net/lists/stable/msg273669.html
Test-case:
=========
- v4.14.91 (original): client achieves 2000+ connections (6000 target)
with 3 threads.
server # iptables -F
server # iptables -A INPUT -p tcp -m tcp --syn --dport 7777 -m connlimit --connlimit-above 2000 --connlimit-mask 0 -j DROP
server # iptables -L
Chain INPUT (policy ACCEPT)
target prot opt source destination
DROP tcp -- anywhere anywhere tcp dpt:7777 flags:FIN,SYN,RST,ACK/SYN #conn src/0 > 2000
Chain FORWARD (policy ACCEPT)
target prot opt source destination
Chain OUTPUT (policy ACCEPT)
target prot opt source destination
server # ulimit -SHn 65000
server # ruby server.rb
<... listening ...>
client # ulimit -SHn 65000
client # ruby client.rb 10.230.56.100 7777 6000 3
Connecting to ["10.230.56.100"]:7777 6000 times with 3
1
2
3
<...>
2000
<...>
6000
Target reached. Thread finishing
6001
Target reached. Thread finishing
6002
Target reached. Thread finishing
Threads done. 6002 connections
press enter to exit
- v4.14.91 + patches: client only achieved 2000 connections.
server # (same procedure)
client # (same procedure)
Connecting to ["10.230.56.100"]:7777 6000 times with 3
1
2
3
<...>
2000
<... blocked for a while...>
failed to create connection: Connection timed out - connect(2) for "10.230.56.100" port 7777
failed to create connection: Connection timed out - connect(2) for "10.230.56.100" port 7777
failed to create connection: Connection timed out - connect(2) for "10.230.56.100" port 7777
Threads done. 2000 connections
press enter to exit
Florian Westphal (2):
netfilter: xt_connlimit: don't store address in the conn nodes
netfilter: nf_conncount: fix garbage collection confirm race
Pablo Neira Ayuso (1):
netfilter: nf_conncount: expose connection list interface
Yi-Hung Wei (1):
netfilter: nf_conncount: Fix garbage collection with zones
include/net/netfilter/nf_conntrack_count.h | 15 +++++
net/netfilter/xt_connlimit.c | 99 +++++++++++++++++++++++-------
2 files changed, 91 insertions(+), 23 deletions(-)
create mode 100644 include/net/netfilter/nf_conntrack_count.h
--
2.7.4
The patch titled
Subject: lib/gen_crc64table.c: don't depend on linux headers being installed
has been removed from the -mm tree. Its filename was
lib-dont-depend-on-linux-headers-being-installed.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: NeilBrown <neil(a)brown.name>
Subject: lib/gen_crc64table.c: don't depend on linux headers being installed
gen_crc64table requires linux include files to be installed in
/usr/include/linux. This is a new requrement so hosts that could
previously build the kernel, now cannot.
gen_crc64table makes this requirement by including <linux/swab.h>, but
nothing from that header is actually used.
So remove the #include, so that the linux headers no longer need to be
installed.
Link: http://lkml.kernel.org/r/87y3899gzi.fsf@notabene.neil.brown.name
Fixes: feba04fd2cf8 ("lib: add crc64 calculation routines")
Signed-off-by: NeilBrown <neil(a)brown.name>
Acked-by: Coly Li <colyli(a)suse.de>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
--- a/lib/gen_crc64table.c~lib-dont-depend-on-linux-headers-being-installed
+++ a/lib/gen_crc64table.c
@@ -16,8 +16,6 @@
#include <inttypes.h>
#include <stdio.h>
-#include <linux/swab.h>
-
#define CRC64_ECMA182_POLY 0x42F0E1EBA9EA3693ULL
static uint64_t crc64_table[256] = {0};
_
Patches currently in -mm which might be from neil(a)brown.name are
The patch titled
Subject: memcg, oom: notify on oom killer invocation from the charge path
has been removed from the -mm tree. Its filename was
memcg-oom-notify-on-oom-killer-invocation-from-the-charge-path.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Michal Hocko <mhocko(a)suse.com>
Subject: memcg, oom: notify on oom killer invocation from the charge path
Burt Holzman has noticed that memcg v1 doesn't notify about OOM events via
eventfd anymore. The reason is that 29ef680ae7c2 ("memcg, oom: move
out_of_memory back to the charge path") has moved the oom handling back to
the charge path. While doing so the notification was left behind in
mem_cgroup_oom_synchronize.
Fix the issue by replicating the oom hierarchy locking and the
notification.
Link: http://lkml.kernel.org/r/20181224091107.18354-1-mhocko@kernel.org
Fixes: 29ef680ae7c2 ("memcg, oom: move out_of_memory back to the charge path")
Signed-off-by: Michal Hocko <mhocko(a)suse.com>
Reported-by: Burt Holzman <burt(a)fnal.gov>
Acked-by: Johannes Weiner <hannes(a)cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev(a)gmail.com
Cc: <stable(a)vger.kernel.org> [4.19+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
--- a/mm/memcontrol.c~memcg-oom-notify-on-oom-killer-invocation-from-the-charge-path
+++ a/mm/memcontrol.c
@@ -1673,6 +1673,9 @@ enum oom_status {
static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
{
+ enum oom_status ret;
+ bool locked;
+
if (order > PAGE_ALLOC_COSTLY_ORDER)
return OOM_SKIPPED;
@@ -1707,10 +1710,23 @@ static enum oom_status mem_cgroup_oom(st
return OOM_ASYNC;
}
+ mem_cgroup_mark_under_oom(memcg);
+
+ locked = mem_cgroup_oom_trylock(memcg);
+
+ if (locked)
+ mem_cgroup_oom_notify(memcg);
+
+ mem_cgroup_unmark_under_oom(memcg);
if (mem_cgroup_out_of_memory(memcg, mask, order))
- return OOM_SUCCESS;
+ ret = OOM_SUCCESS;
+ else
+ ret = OOM_FAILED;
+
+ if (locked)
+ mem_cgroup_oom_unlock(memcg);
- return OOM_FAILED;
+ return ret;
}
/**
_
Patches currently in -mm which might be from mhocko(a)suse.com are
mm-memcg-fix-reclaim-deadlock-with-writeback.patch
The patch titled
Subject: mm, swap: fix swapoff with KSM pages
has been removed from the -mm tree. Its filename was
mm-swap-fix-swapoff-with-ksm-pages.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Huang Ying <ying.huang(a)intel.com>
Subject: mm, swap: fix swapoff with KSM pages
KSM pages may be mapped to the multiple VMAs that cannot be reached from
one anon_vma. So during swapin, a new copy of the page need to be
generated if a different anon_vma is needed, please refer to comments of
ksm_might_need_to_copy() for details.
During swapoff, unuse_vma() uses anon_vma (if available) to locate VMA and
virtual address mapped to the page, so not all mappings to a swapped out
KSM page could be found. So in try_to_unuse(), even if the swap count of
a swap entry isn't zero, the page needs to be deleted from swap cache, so
that, in the next round a new page could be allocated and swapin for the
other mappings of the swapped out KSM page.
But this contradicts with the THP swap support. Where the THP could be
deleted from swap cache only after the swap count of every swap entry in
the huge swap cluster backing the THP has reach 0. So try_to_unuse() is
changed in commit e07098294adf ("mm, THP, swap: support to reclaim swap
space for THP swapped out") to check that before delete a page from swap
cache, but this has broken KSM swapoff too.
Fortunately, KSM is for the normal pages only, so the original behavior
for KSM pages could be restored easily via checking PageTransCompound().
That is how this patch works.
The bug is introduced by e07098294adf ("mm, THP, swap: support to reclaim
swap space for THP swapped out"), which is merged by v4.14-rc1. So I
think we should backport the fix to from 4.14 on. But Hugh thinks it may
be rare for the KSM pages being in the swap device when swapoff, so nobody
reports the bug so far.
Link: http://lkml.kernel.org/r/20181226051522.28442-1-ying.huang@intel.com
Fixes: e07098294adf ("mm, THP, swap: support to reclaim swap space for THP swapped out")
Signed-off-by: "Huang, Ying" <ying.huang(a)intel.com>
Reported-by: Hugh Dickins <hughd(a)google.com>
Tested-by: Hugh Dickins <hughd(a)google.com>
Acked-by: Hugh Dickins <hughd(a)google.com>
Cc: Rik van Riel <riel(a)redhat.com>
Cc: Johannes Weiner <hannes(a)cmpxchg.org>
Cc: Minchan Kim <minchan(a)kernel.org>
Cc: Shaohua Li <shli(a)kernel.org>
Cc: Daniel Jordan <daniel.m.jordan(a)oracle.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
--- a/mm/swapfile.c~mm-swap-fix-swapoff-with-ksm-pages
+++ a/mm/swapfile.c
@@ -2197,7 +2197,8 @@ int try_to_unuse(unsigned int type, bool
*/
if (PageSwapCache(page) &&
likely(page_private(page) == entry.val) &&
- !page_swapped(page))
+ (!PageTransCompound(page) ||
+ !swap_page_trans_huge_swapped(si, entry)))
delete_from_swap_cache(compound_head(page));
/*
_
Patches currently in -mm which might be from ying.huang(a)intel.com are
mm-swap-fix-race-between-swapoff-and-some-swap-operations.patch
mm-swap-fix-race-between-swapoff-and-some-swap-operations-v6.patch
mm-fix-race-between-swapoff-and-mincore.patch
The patch titled
Subject: hugetlbfs: Use i_mmap_rwsem to fix page fault/truncate race
has been removed from the -mm tree. Its filename was
hugetlbfs-use-i_mmap_rwsem-to-fix-page-fault-truncate-race.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Mike Kravetz <mike.kravetz(a)oracle.com>
Subject: hugetlbfs: Use i_mmap_rwsem to fix page fault/truncate race
hugetlbfs page faults can race with truncate and hole punch operations.
Current code in the page fault path attempts to handle this by 'backing
out' operations if we encounter the race. One obvious omission in the
current code is removing a page newly added to the page cache. This is
pretty straight forward to address, but there is a more subtle and
difficult issue of backing out hugetlb reservations. To handle this
correctly, the 'reservation state' before page allocation needs to be
noted so that it can be properly backed out. There are four distinct
possibilities for reservation state: shared/reserved, shared/no-resv,
private/reserved and private/no-resv. Backing out a reservation may
require memory allocation which could fail so that needs to be taken into
account as well.
Instead of writing the required complicated code for this rare occurrence,
just eliminate the race. i_mmap_rwsem is now held in read mode for the
duration of page fault processing. Hold i_mmap_rwsem longer in truncation
and hold punch code to cover the call to remove_inode_hugepages.
With this modification, code in remove_inode_hugepages checking for races
becomes 'dead' as it can not longer happen. Remove the dead code and
expand comments to explain reasoning. Similarly, checks for races with
truncation in the page fault path can be simplified and removed.
[mike.kravetz(a)oracle.com: incorporat suggestions from Kirill]
Link: http://lkml.kernel.org/r/20181222223013.22193-3-mike.kravetz@oracle.com
Link: http://lkml.kernel.org/r/20181218223557.5202-3-mike.kravetz@oracle.com
Fixes: ebed4bfc8da8 ("hugetlb: fix absurd HugePages_Rsvd")
Signed-off-by: Mike Kravetz <mike.kravetz(a)oracle.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Cc: Michal Hocko <mhocko(a)kernel.org>
Cc: Hugh Dickins <hughd(a)google.com>
Cc: Naoya Horiguchi <n-horiguchi(a)ah.jp.nec.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar(a)linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange(a)redhat.com>
Cc: Davidlohr Bueso <dave(a)stgolabs.net>
Cc: Prakash Sangappa <prakash.sangappa(a)oracle.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
--- a/fs/hugetlbfs/inode.c~hugetlbfs-use-i_mmap_rwsem-to-fix-page-fault-truncate-race
+++ a/fs/hugetlbfs/inode.c
@@ -383,17 +383,16 @@ hugetlb_vmdelete_list(struct rb_root_cac
* truncation is indicated by end of range being LLONG_MAX
* In this case, we first scan the range and release found pages.
* After releasing pages, hugetlb_unreserve_pages cleans up region/reserv
- * maps and global counts. Page faults can not race with truncation
- * in this routine. hugetlb_no_page() prevents page faults in the
- * truncated range. It checks i_size before allocation, and again after
- * with the page table lock for the page held. The same lock must be
- * acquired to unmap a page.
+ * maps and global counts.
* hole punch is indicated if end is not LLONG_MAX
* In the hole punch case we scan the range and release found pages.
* Only when releasing a page is the associated region/reserv map
* deleted. The region/reserv map for ranges without associated
- * pages are not modified. Page faults can race with hole punch.
- * This is indicated if we find a mapped page.
+ * pages are not modified.
+ *
+ * Callers of this routine must hold the i_mmap_rwsem in write mode to prevent
+ * races with page faults.
+ *
* Note: If the passed end of range value is beyond the end of file, but
* not LLONG_MAX this routine still performs a hole punch operation.
*/
@@ -423,32 +422,14 @@ static void remove_inode_hugepages(struc
for (i = 0; i < pagevec_count(&pvec); ++i) {
struct page *page = pvec.pages[i];
- u32 hash;
index = page->index;
- hash = hugetlb_fault_mutex_hash(h, current->mm,
- &pseudo_vma,
- mapping, index, 0);
- mutex_lock(&hugetlb_fault_mutex_table[hash]);
-
/*
- * If page is mapped, it was faulted in after being
- * unmapped in caller. Unmap (again) now after taking
- * the fault mutex. The mutex will prevent faults
- * until we finish removing the page.
- *
- * This race can only happen in the hole punch case.
- * Getting here in a truncate operation is a bug.
+ * A mapped page is impossible as callers should unmap
+ * all references before calling. And, i_mmap_rwsem
+ * prevents the creation of additional mappings.
*/
- if (unlikely(page_mapped(page))) {
- BUG_ON(truncate_op);
-
- i_mmap_lock_write(mapping);
- hugetlb_vmdelete_list(&mapping->i_mmap,
- index * pages_per_huge_page(h),
- (index + 1) * pages_per_huge_page(h));
- i_mmap_unlock_write(mapping);
- }
+ VM_BUG_ON(page_mapped(page));
lock_page(page);
/*
@@ -470,7 +451,6 @@ static void remove_inode_hugepages(struc
}
unlock_page(page);
- mutex_unlock(&hugetlb_fault_mutex_table[hash]);
}
huge_pagevec_release(&pvec);
cond_resched();
@@ -482,9 +462,20 @@ static void remove_inode_hugepages(struc
static void hugetlbfs_evict_inode(struct inode *inode)
{
+ struct address_space *mapping = inode->i_mapping;
struct resv_map *resv_map;
+ /*
+ * The vfs layer guarantees that there are no other users of this
+ * inode. Therefore, it would be safe to call remove_inode_hugepages
+ * without holding i_mmap_rwsem. We acquire and hold here to be
+ * consistent with other callers. Since there will be no contention
+ * on the semaphore, overhead is negligible.
+ */
+ i_mmap_lock_write(mapping);
remove_inode_hugepages(inode, 0, LLONG_MAX);
+ i_mmap_unlock_write(mapping);
+
resv_map = (struct resv_map *)inode->i_mapping->private_data;
/* root inode doesn't have the resv_map, so we should check it */
if (resv_map)
@@ -505,8 +496,8 @@ static int hugetlb_vmtruncate(struct ino
i_mmap_lock_write(mapping);
if (!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root))
hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0);
- i_mmap_unlock_write(mapping);
remove_inode_hugepages(inode, offset, LLONG_MAX);
+ i_mmap_unlock_write(mapping);
return 0;
}
@@ -540,8 +531,8 @@ static long hugetlbfs_punch_hole(struct
hugetlb_vmdelete_list(&mapping->i_mmap,
hole_start >> PAGE_SHIFT,
hole_end >> PAGE_SHIFT);
- i_mmap_unlock_write(mapping);
remove_inode_hugepages(inode, hole_start, hole_end);
+ i_mmap_unlock_write(mapping);
inode_unlock(inode);
}
@@ -624,7 +615,11 @@ static long hugetlbfs_fallocate(struct f
/* addr is the offset within the file (zero based) */
addr = index * hpage_size;
- /* mutex taken here, fault path and hole punch */
+ /*
+ * fault mutex taken here, protects against fault path
+ * and hole punch. inode_lock previously taken protects
+ * against truncation.
+ */
hash = hugetlb_fault_mutex_hash(h, mm, &pseudo_vma, mapping,
index, addr);
mutex_lock(&hugetlb_fault_mutex_table[hash]);
--- a/mm/hugetlb.c~hugetlbfs-use-i_mmap_rwsem-to-fix-page-fault-truncate-race
+++ a/mm/hugetlb.c
@@ -3755,16 +3755,16 @@ static vm_fault_t hugetlb_no_page(struct
}
/*
- * Use page lock to guard against racing truncation
- * before we get page_table_lock.
+ * We can not race with truncation due to holding i_mmap_rwsem.
+ * Check once here for faults beyond end of file.
*/
+ size = i_size_read(mapping->host) >> huge_page_shift(h);
+ if (idx >= size)
+ goto out;
+
retry:
page = find_lock_page(mapping, idx);
if (!page) {
- size = i_size_read(mapping->host) >> huge_page_shift(h);
- if (idx >= size)
- goto out;
-
/*
* Check for page in userfault range
*/
@@ -3854,9 +3854,6 @@ retry:
}
ptl = huge_pte_lock(h, mm, ptep);
- size = i_size_read(mapping->host) >> huge_page_shift(h);
- if (idx >= size)
- goto backout;
ret = 0;
if (!huge_pte_none(huge_ptep_get(ptep)))
@@ -3959,8 +3956,10 @@ vm_fault_t hugetlb_fault(struct mm_struc
/*
* Acquire i_mmap_rwsem before calling huge_pte_alloc and hold
- * until finished with ptep. This prevents huge_pmd_unshare from
- * being called elsewhere and making the ptep no longer valid.
+ * until finished with ptep. This serves two purposes:
+ * 1) It prevents huge_pmd_unshare from being called elsewhere
+ * and making the ptep no longer valid.
+ * 2) It synchronizes us with file truncation.
*
* ptep could have already be assigned via huge_pte_offset. That
* is OK, as huge_pte_alloc will return the same value unless
_
Patches currently in -mm which might be from mike.kravetz(a)oracle.com are
The patch titled
Subject: hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization
has been removed from the -mm tree. Its filename was
hugetlbfs-use-i_mmap_rwsem-for-more-pmd-sharing-synchronization.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Mike Kravetz <mike.kravetz(a)oracle.com>
Subject: hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization
While looking at BUGs associated with invalid huge page map counts, it was
discovered and observed that a huge pte pointer could become 'invalid' and
point to another task's page table. Consider the following:
A task takes a page fault on a shared hugetlbfs file and calls
huge_pte_alloc to get a ptep. Suppose the returned ptep points to a
shared pmd.
Now, another task truncates the hugetlbfs file. As part of truncation, it
unmaps everyone who has the file mapped. If the range being truncated is
covered by a shared pmd, huge_pmd_unshare will be called. For all but the
last user of the shared pmd, huge_pmd_unshare will clear the pud pointing
to the pmd. If the task in the middle of the page fault is not the last
user, the ptep returned by huge_pte_alloc now points to another task's
page table or worse. This leads to bad things such as incorrect page
map/reference counts or invalid memory references.
To fix, expand the use of i_mmap_rwsem as follows:
- i_mmap_rwsem is held in read mode whenever huge_pmd_share is called.
huge_pmd_share is only called via huge_pte_alloc, so callers of
huge_pte_alloc take i_mmap_rwsem before calling. In addition, callers
of huge_pte_alloc continue to hold the semaphore until finished with the
ptep.
- i_mmap_rwsem is held in write mode whenever huge_pmd_unshare is
called.
[mike.kravetz(a)oracle.com: add explicit check for mapping != null]
Link: http://lkml.kernel.org/r/20181218223557.5202-2-mike.kravetz@oracle.com
Fixes: 39dde65c9940 ("shared page table for hugetlb page")
Signed-off-by: Mike Kravetz <mike.kravetz(a)oracle.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Cc: Michal Hocko <mhocko(a)kernel.org>
Cc: Hugh Dickins <hughd(a)google.com>
Cc: Naoya Horiguchi <n-horiguchi(a)ah.jp.nec.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar(a)linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange(a)redhat.com>
Cc: Davidlohr Bueso <dave(a)stgolabs.net>
Cc: Prakash Sangappa <prakash.sangappa(a)oracle.com>
Cc: Colin Ian King <colin.king(a)canonical.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
--- a/mm/hugetlb.c~hugetlbfs-use-i_mmap_rwsem-for-more-pmd-sharing-synchronization
+++ a/mm/hugetlb.c
@@ -3238,6 +3238,7 @@ int copy_hugetlb_page_range(struct mm_st
struct page *ptepage;
unsigned long addr;
int cow;
+ struct address_space *mapping = vma->vm_file->f_mapping;
struct hstate *h = hstate_vma(vma);
unsigned long sz = huge_page_size(h);
struct mmu_notifier_range range;
@@ -3249,13 +3250,23 @@ int copy_hugetlb_page_range(struct mm_st
mmu_notifier_range_init(&range, src, vma->vm_start,
vma->vm_end);
mmu_notifier_invalidate_range_start(&range);
+ } else {
+ /*
+ * For shared mappings i_mmap_rwsem must be held to call
+ * huge_pte_alloc, otherwise the returned ptep could go
+ * away if part of a shared pmd and another thread calls
+ * huge_pmd_unshare.
+ */
+ i_mmap_lock_read(mapping);
}
for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
spinlock_t *src_ptl, *dst_ptl;
+
src_pte = huge_pte_offset(src, addr, sz);
if (!src_pte)
continue;
+
dst_pte = huge_pte_alloc(dst, addr, sz);
if (!dst_pte) {
ret = -ENOMEM;
@@ -3326,6 +3337,8 @@ int copy_hugetlb_page_range(struct mm_st
if (cow)
mmu_notifier_invalidate_range_end(&range);
+ else
+ i_mmap_unlock_read(mapping);
return ret;
}
@@ -3771,14 +3784,18 @@ retry:
};
/*
- * hugetlb_fault_mutex must be dropped before
- * handling userfault. Reacquire after handling
- * fault to make calling code simpler.
+ * hugetlb_fault_mutex and i_mmap_rwsem must be
+ * dropped before handling userfault. Reacquire
+ * after handling fault to make calling code simpler.
*/
hash = hugetlb_fault_mutex_hash(h, mm, vma, mapping,
idx, haddr);
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+ i_mmap_unlock_read(mapping);
+
ret = handle_userfault(&vmf, VM_UFFD_MISSING);
+
+ i_mmap_lock_read(mapping);
mutex_lock(&hugetlb_fault_mutex_table[hash]);
goto out;
}
@@ -3926,6 +3943,11 @@ vm_fault_t hugetlb_fault(struct mm_struc
ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
if (ptep) {
+ /*
+ * Since we hold no locks, ptep could be stale. That is
+ * OK as we are only making decisions based on content and
+ * not actually modifying content here.
+ */
entry = huge_ptep_get(ptep);
if (unlikely(is_hugetlb_entry_migration(entry))) {
migration_entry_wait_huge(vma, mm, ptep);
@@ -3933,20 +3955,31 @@ vm_fault_t hugetlb_fault(struct mm_struc
} else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
return VM_FAULT_HWPOISON_LARGE |
VM_FAULT_SET_HINDEX(hstate_index(h));
- } else {
- ptep = huge_pte_alloc(mm, haddr, huge_page_size(h));
- if (!ptep)
- return VM_FAULT_OOM;
}
+ /*
+ * Acquire i_mmap_rwsem before calling huge_pte_alloc and hold
+ * until finished with ptep. This prevents huge_pmd_unshare from
+ * being called elsewhere and making the ptep no longer valid.
+ *
+ * ptep could have already be assigned via huge_pte_offset. That
+ * is OK, as huge_pte_alloc will return the same value unless
+ * something changed.
+ */
mapping = vma->vm_file->f_mapping;
- idx = vma_hugecache_offset(h, vma, haddr);
+ i_mmap_lock_read(mapping);
+ ptep = huge_pte_alloc(mm, haddr, huge_page_size(h));
+ if (!ptep) {
+ i_mmap_unlock_read(mapping);
+ return VM_FAULT_OOM;
+ }
/*
* Serialize hugepage allocation and instantiation, so that we don't
* get spurious allocation failures if two CPUs race to instantiate
* the same page in the page cache.
*/
+ idx = vma_hugecache_offset(h, vma, haddr);
hash = hugetlb_fault_mutex_hash(h, mm, vma, mapping, idx, haddr);
mutex_lock(&hugetlb_fault_mutex_table[hash]);
@@ -4034,6 +4067,7 @@ out_ptl:
}
out_mutex:
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+ i_mmap_unlock_read(mapping);
/*
* Generally it's safe to hold refcount during waiting page lock. But
* here we just wait to defer the next page fault to avoid busy loop and
@@ -4638,10 +4672,12 @@ void adjust_range_if_pmd_sharing_possibl
* Search for a shareable pmd page for hugetlb. In any case calls pmd_alloc()
* and returns the corresponding pte. While this is not necessary for the
* !shared pmd case because we can allocate the pmd later as well, it makes the
- * code much cleaner. pmd allocation is essential for the shared case because
- * pud has to be populated inside the same i_mmap_rwsem section - otherwise
- * racing tasks could either miss the sharing (see huge_pte_offset) or select a
- * bad pmd for sharing.
+ * code much cleaner.
+ *
+ * This routine must be called with i_mmap_rwsem held in at least read mode.
+ * For hugetlbfs, this prevents removal of any page table entries associated
+ * with the address space. This is important as we are setting up sharing
+ * based on existing page table entries (mappings).
*/
pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
{
@@ -4658,7 +4694,6 @@ pte_t *huge_pmd_share(struct mm_struct *
if (!vma_shareable(vma, addr))
return (pte_t *)pmd_alloc(mm, pud, addr);
- i_mmap_lock_write(mapping);
vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) {
if (svma == vma)
continue;
@@ -4688,7 +4723,6 @@ pte_t *huge_pmd_share(struct mm_struct *
spin_unlock(ptl);
out:
pte = (pte_t *)pmd_alloc(mm, pud, addr);
- i_mmap_unlock_write(mapping);
return pte;
}
@@ -4699,7 +4733,7 @@ out:
* indicated by page_count > 1, unmap is achieved by clearing pud and
* decrementing the ref count. If count == 1, the pte page is not shared.
*
- * called with page table lock held.
+ * Called with page table lock held and i_mmap_rwsem held in write mode.
*
* returns: 1 successfully unmapped a shared pte page
* 0 the underlying pte page is not shared, or it is the last user
--- a/mm/memory-failure.c~hugetlbfs-use-i_mmap_rwsem-for-more-pmd-sharing-synchronization
+++ a/mm/memory-failure.c
@@ -966,7 +966,7 @@ static bool hwpoison_user_mappings(struc
enum ttu_flags ttu = TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS;
struct address_space *mapping;
LIST_HEAD(tokill);
- bool unmap_success;
+ bool unmap_success = true;
int kill = 1, forcekill;
struct page *hpage = *hpagep;
bool mlocked = PageMlocked(hpage);
@@ -1028,7 +1028,19 @@ static bool hwpoison_user_mappings(struc
if (kill)
collect_procs(hpage, &tokill, flags & MF_ACTION_REQUIRED);
- unmap_success = try_to_unmap(hpage, ttu);
+ if (!PageHuge(hpage)) {
+ unmap_success = try_to_unmap(hpage, ttu);
+ } else if (mapping) {
+ /*
+ * For hugetlb pages, try_to_unmap could potentially call
+ * huge_pmd_unshare. Because of this, take semaphore in
+ * write mode here and set TTU_RMAP_LOCKED to indicate we
+ * have taken the lock at this higer level.
+ */
+ i_mmap_lock_write(mapping);
+ unmap_success = try_to_unmap(hpage, ttu|TTU_RMAP_LOCKED);
+ i_mmap_unlock_write(mapping);
+ }
if (!unmap_success)
pr_err("Memory failure: %#lx: failed to unmap page (mapcount=%d)\n",
pfn, page_mapcount(hpage));
--- a/mm/migrate.c~hugetlbfs-use-i_mmap_rwsem-for-more-pmd-sharing-synchronization
+++ a/mm/migrate.c
@@ -1324,8 +1324,19 @@ static int unmap_and_move_huge_page(new_
goto put_anon;
if (page_mapped(hpage)) {
+ struct address_space *mapping = page_mapping(hpage);
+
+ /*
+ * try_to_unmap could potentially call huge_pmd_unshare.
+ * Because of this, take semaphore in write mode here and
+ * set TTU_RMAP_LOCKED to let lower levels know we have
+ * taken the lock.
+ */
+ i_mmap_lock_write(mapping);
try_to_unmap(hpage,
- TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
+ TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS|
+ TTU_RMAP_LOCKED);
+ i_mmap_unlock_write(mapping);
page_was_mapped = 1;
}
--- a/mm/rmap.c~hugetlbfs-use-i_mmap_rwsem-for-more-pmd-sharing-synchronization
+++ a/mm/rmap.c
@@ -25,6 +25,7 @@
* page->flags PG_locked (lock_page)
* hugetlbfs_i_mmap_rwsem_key (in huge_pmd_share)
* mapping->i_mmap_rwsem
+ * hugetlb_fault_mutex (hugetlbfs specific page fault mutex)
* anon_vma->rwsem
* mm->page_table_lock or pte_lock
* zone_lru_lock (in mark_page_accessed, isolate_lru_page)
@@ -1378,6 +1379,9 @@ static bool try_to_unmap_one(struct page
/*
* If sharing is possible, start and end will be adjusted
* accordingly.
+ *
+ * If called for a huge page, caller must hold i_mmap_rwsem
+ * in write mode as it is possible to call huge_pmd_unshare.
*/
adjust_range_if_pmd_sharing_possible(vma, &range.start,
&range.end);
--- a/mm/userfaultfd.c~hugetlbfs-use-i_mmap_rwsem-for-more-pmd-sharing-synchronization
+++ a/mm/userfaultfd.c
@@ -267,10 +267,14 @@ retry:
VM_BUG_ON(dst_addr & ~huge_page_mask(h));
/*
- * Serialize via hugetlb_fault_mutex
+ * Serialize via i_mmap_rwsem and hugetlb_fault_mutex.
+ * i_mmap_rwsem ensures the dst_pte remains valid even
+ * in the case of shared pmds. fault mutex prevents
+ * races with other faulting threads.
*/
- idx = linear_page_index(dst_vma, dst_addr);
mapping = dst_vma->vm_file->f_mapping;
+ i_mmap_lock_read(mapping);
+ idx = linear_page_index(dst_vma, dst_addr);
hash = hugetlb_fault_mutex_hash(h, dst_mm, dst_vma, mapping,
idx, dst_addr);
mutex_lock(&hugetlb_fault_mutex_table[hash]);
@@ -279,6 +283,7 @@ retry:
dst_pte = huge_pte_alloc(dst_mm, dst_addr, huge_page_size(h));
if (!dst_pte) {
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+ i_mmap_unlock_read(mapping);
goto out_unlock;
}
@@ -286,6 +291,7 @@ retry:
dst_pteval = huge_ptep_get(dst_pte);
if (!huge_pte_none(dst_pteval)) {
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+ i_mmap_unlock_read(mapping);
goto out_unlock;
}
@@ -293,6 +299,7 @@ retry:
dst_addr, src_addr, &page);
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+ i_mmap_unlock_read(mapping);
vm_alloc_shared = vm_shared;
cond_resched();
_
Patches currently in -mm which might be from mike.kravetz(a)oracle.com are
The patch titled
Subject: hwpoison, memory_hotplug: allow hwpoisoned pages to be offlined
has been removed from the -mm tree. Its filename was
hwpoison-memory_hotplug-allow-hwpoisoned-pages-to-be-offlined.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Michal Hocko <mhocko(a)suse.com>
Subject: hwpoison, memory_hotplug: allow hwpoisoned pages to be offlined
We have received a bug report that an injected MCE about faulty memory
prevents memory offline to succeed on 4.4 base kernel. The underlying
reason was that the HWPoison page has an elevated reference count and the
migration keeps failing. There are two problems with that. First of all
it is dubious to migrate the poisoned page because we know that accessing
that memory is possible to fail. Secondly it doesn't make any sense to
migrate a potentially broken content and preserve the memory corruption
over to a new location.
Oscar has found out that 4.4 and the current upstream kernels behave
slightly differently with his simply testcase
===
int main(void)
{
int ret;
int i;
int fd;
char *array = malloc(4096);
char *array_locked = malloc(4096);
fd = open("/tmp/data", O_RDONLY);
read(fd, array, 4095);
for (i = 0; i < 4096; i++)
array_locked[i] = 'd';
ret = mlock((void *)PAGE_ALIGN((unsigned long)array_locked), sizeof(array_locked));
if (ret)
perror("mlock");
sleep (20);
ret = madvise((void *)PAGE_ALIGN((unsigned long)array_locked), 4096, MADV_HWPOISON);
if (ret)
perror("madvise");
for (i = 0; i < 4096; i++)
array_locked[i] = 'd';
return 0;
}
===
+ offline this memory.
In 4.4 kernels he saw the hwpoisoned page to be returned back to the LRU
list
kernel: [<ffffffff81019ac9>] dump_trace+0x59/0x340
kernel: [<ffffffff81019e9a>] show_stack_log_lvl+0xea/0x170
kernel: [<ffffffff8101ac71>] show_stack+0x21/0x40
kernel: [<ffffffff8132bb90>] dump_stack+0x5c/0x7c
kernel: [<ffffffff810815a1>] warn_slowpath_common+0x81/0xb0
kernel: [<ffffffff811a275c>] __pagevec_lru_add_fn+0x14c/0x160
kernel: [<ffffffff811a2eed>] pagevec_lru_move_fn+0xad/0x100
kernel: [<ffffffff811a334c>] __lru_cache_add+0x6c/0xb0
kernel: [<ffffffff81195236>] add_to_page_cache_lru+0x46/0x70
kernel: [<ffffffffa02b4373>] extent_readpages+0xc3/0x1a0 [btrfs]
kernel: [<ffffffff811a16d7>] __do_page_cache_readahead+0x177/0x200
kernel: [<ffffffff811a18c8>] ondemand_readahead+0x168/0x2a0
kernel: [<ffffffff8119673f>] generic_file_read_iter+0x41f/0x660
kernel: [<ffffffff8120e50d>] __vfs_read+0xcd/0x140
kernel: [<ffffffff8120e9ea>] vfs_read+0x7a/0x120
kernel: [<ffffffff8121404b>] kernel_read+0x3b/0x50
kernel: [<ffffffff81215c80>] do_execveat_common.isra.29+0x490/0x6f0
kernel: [<ffffffff81215f08>] do_execve+0x28/0x30
kernel: [<ffffffff81095ddb>] call_usermodehelper_exec_async+0xfb/0x130
kernel: [<ffffffff8161c045>] ret_from_fork+0x55/0x80
And that latter confuses the hotremove path because an LRU page is
attempted to be migrated and that fails due to an elevated reference
count. It is quite possible that the reuse of the HWPoisoned page is some
kind of fixed race condition but I am not really sure about that.
With the upstream kernel the failure is slightly different. The page
doesn't seem to have LRU bit set but isolate_movable_page simply fails and
do_migrate_range simply puts all the isolated pages back to LRU and
therefore no progress is made and scan_movable_pages finds same set of
pages over and over again.
Fix both cases by explicitly checking HWPoisoned pages before we even try
to get reference on the page, try to unmap it if it is still mapped. As
explained by Naoya:
: Hwpoison code never unmapped those for no big reason because
: Ksm pages never dominate memory, so we simply didn't have strong
: motivation to save the pages.
Also put WARN_ON(PageLRU) in case there is a race and we can hit LRU
HWPoison pages which shouldn't happen but I couldn't convince myself about
that. Naoya has noted the following:
: Theoretically no such gurantee, because try_to_unmap() doesn't have a
: guarantee of success and then memory_failure() returns immediately
: when hwpoison_user_mappings fails.
: Or the following code (comes after hwpoison_user_mappings block) also impli=
: es
: that the target page can still have PageLRU flag.
:
: /*
: * Torn down by someone else?
: */
: if (PageLRU(p) && !PageSwapCache(p) && p->mapping =3D=3D NULL) {
: action_result(pfn, MF_MSG_TRUNCATED_LRU, MF_IGNORED);
: res =3D -EBUSY;
: goto out;
: }
:
: So I think it's OK to keep "if (WARN_ON(PageLRU(page)))" block in
: current version of your patch.
Link: http://lkml.kernel.org/r/20181206120135.14079-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko(a)suse.com>
Reviewed-by: Oscar Salvador <osalvador(a)suse.com>
Debugged-by: Oscar Salvador <osalvador(a)suse.com>
Tested-by: Oscar Salvador <osalvador(a)suse.com>
Acked-by: David Hildenbrand <david(a)redhat.com>
Acked-by: Naoya Horiguchi <n-horiguchi(a)ah.jp.nec.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
--- a/mm/memory_hotplug.c~hwpoison-memory_hotplug-allow-hwpoisoned-pages-to-be-offlined
+++ a/mm/memory_hotplug.c
@@ -34,6 +34,7 @@
#include <linux/hugetlb.h>
#include <linux/memblock.h>
#include <linux/compaction.h>
+#include <linux/rmap.h>
#include <asm/tlbflush.h>
@@ -1368,6 +1369,21 @@ do_migrate_range(unsigned long start_pfn
pfn = page_to_pfn(compound_head(page))
+ hpage_nr_pages(page) - 1;
+ /*
+ * HWPoison pages have elevated reference counts so the migration would
+ * fail on them. It also doesn't make any sense to migrate them in the
+ * first place. Still try to unmap such a page in case it is still mapped
+ * (e.g. current hwpoison implementation doesn't unmap KSM pages but keep
+ * the unmap as the catch all safety net).
+ */
+ if (PageHWPoison(page)) {
+ if (WARN_ON(PageLRU(page)))
+ isolate_lru_page(page);
+ if (page_mapped(page))
+ try_to_unmap(page, TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS);
+ continue;
+ }
+
if (!get_page_unless_zero(page))
continue;
/*
_
Patches currently in -mm which might be from mhocko(a)suse.com are
mm-memcg-fix-reclaim-deadlock-with-writeback.patch
The patch titled
Subject: mm, hmm: mark hmm_devmem_{add, add_resource} EXPORT_SYMBOL_GPL
has been removed from the -mm tree. Its filename was
mm-hmm-mark-hmm_devmem_add-add_resource-export_symbol_gpl.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Dan Williams <dan.j.williams(a)intel.com>
Subject: mm, hmm: mark hmm_devmem_{add, add_resource} EXPORT_SYMBOL_GPL
At Maintainer Summit, Greg brought up a topic I proposed around
EXPORT_SYMBOL_GPL usage. The motivation was considerations for when
EXPORT_SYMBOL_GPL is warranted and the criteria for taking the exceptional
step of reclassifying an existing export. Specifically, I wanted to make
the case that although the line is fuzzy and hard to specify in abstract
terms, it is nonetheless clear that devm_memremap_pages() and HMM
(Heterogeneous Memory Management) have crossed it. The
devm_memremap_pages() facility should have been EXPORT_SYMBOL_GPL from the
beginning, and HMM as a derivative of that functionality should have
naturally picked up that designation as well.
Contrary to typical rules, the HMM infrastructure was merged upstream with
zero in-tree consumers. There was a promise at the time that those users
would be merged "soon", but it has been over a year with no drivers
arriving. While the Nouveau driver is about to belatedly make good on
that promise it is clear that HMM was targeted first and foremost at an
out-of-tree consumer.
HMM is derived from devm_memremap_pages(), a facility Christoph and I
spearheaded to support persistent memory. It combines a device lifetime
model with a dynamically created 'struct page' / memmap array for any
physical address range. It enables coordination and control of the many
code paths in the kernel built to interact with memory via 'struct page'
objects. With HMM the integration goes even deeper by allowing device
drivers to hook and manipulate page fault and page free events.
One interpretation of when EXPORT_SYMBOL is suitable is when it is
exporting stable and generic leaf functionality. The
devm_memremap_pages() facility continues to see expanding use cases,
peer-to-peer DMA being the most recent, with no clear end date when it
will stop attracting reworks and semantic changes. It is not suitable to
export devm_memremap_pages() as a stable 3rd party driver API due to the
fact that it is still changing and manipulates core behavior. Moreover,
it is not in the best interest of the long term development of the core
memory management subsystem to permit any external driver to effectively
define its own system-wide memory management policies with no
encouragement to engage with upstream.
I am also concerned that HMM was designed in a way to minimize further
engagement with the core-MM. That, with these hooks in place,
device-drivers are free to implement their own policies without much
consideration for whether and how the core-MM could grow to meet that
need. Going forward not only should HMM be EXPORT_SYMBOL_GPL, but the
core-MM should be allowed the opportunity and stimulus to change and
address these new use cases as first class functionality.
Original changelog:
hmm_devmem_add(), and hmm_devmem_add_resource() duplicated
devm_memremap_pages() and are now simple now wrappers around the core
facility to inject a dev_pagemap instance into the global pgmap_radix and
hook page-idle events. The devm_memremap_pages() interface is base
infrastructure for HMM. HMM has more and deeper ties into the kernel
memory management implementation than base ZONE_DEVICE which is itself a
EXPORT_SYMBOL_GPL facility.
Originally, the HMM page structure creation routines copied the
devm_memremap_pages() code and reused ZONE_DEVICE. A cleanup to unify the
implementations was discussed during the initial review:
http://lkml.iu.edu/hypermail/linux/kernel/1701.2/00812.html Recent work to
extend devm_memremap_pages() for the peer-to-peer-DMA facility enabled
this cleanup to move forward.
In addition to the integration with devm_memremap_pages() HMM depends on
other GPL-only symbols:
mmu_notifier_unregister_no_release
percpu_ref
region_intersects
__class_create
It goes further to consume / indirectly expose functionality that is not
exported to any other driver:
alloc_pages_vma
walk_page_range
HMM is derived from devm_memremap_pages(), and extends deep core-kernel
fundamentals. Similar to devm_memremap_pages(), mark its entry points
EXPORT_SYMBOL_GPL().
[logang(a)deltatee.com: PCI/P2PDMA: match interface changes to devm_memremap_pages()]
Link: http://lkml.kernel.org/r/20181130225911.2900-1-logang@deltatee.com
Link: http://lkml.kernel.org/r/154275560565.76910.15919297436557795278.stgit@dwil…
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
Signed-off-by: Logan Gunthorpe <logang(a)deltatee.com>
Reviewed-by: Christoph Hellwig <hch(a)lst.de>
Cc: Logan Gunthorpe <logang(a)deltatee.com>
Cc: "Jérôme Glisse" <jglisse(a)redhat.com>
Cc: Balbir Singh <bsingharora(a)gmail.com>,
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Benjamin Herrenschmidt <benh(a)kernel.crashing.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
--- a/drivers/pci/p2pdma.c~mm-hmm-mark-hmm_devmem_add-add_resource-export_symbol_gpl
+++ a/drivers/pci/p2pdma.c
@@ -82,10 +82,8 @@ static void pci_p2pdma_percpu_release(st
complete_all(&p2p->devmap_ref_done);
}
-static void pci_p2pdma_percpu_kill(void *data)
+static void pci_p2pdma_percpu_kill(struct percpu_ref *ref)
{
- struct percpu_ref *ref = data;
-
/*
* pci_p2pdma_add_resource() may be called multiple times
* by a driver and may register the percpu_kill devm action multiple
@@ -198,6 +196,7 @@ int pci_p2pdma_add_resource(struct pci_d
pgmap->type = MEMORY_DEVICE_PCI_P2PDMA;
pgmap->pci_p2pdma_bus_offset = pci_bus_address(pdev, bar) -
pci_resource_start(pdev, bar);
+ pgmap->kill = pci_p2pdma_percpu_kill;
addr = devm_memremap_pages(&pdev->dev, pgmap);
if (IS_ERR(addr)) {
@@ -211,11 +210,6 @@ int pci_p2pdma_add_resource(struct pci_d
if (error)
goto pgmap_free;
- error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_percpu_kill,
- &pdev->p2pdma->devmap_ref);
- if (error)
- goto pgmap_free;
-
pci_info(pdev, "added peer-to-peer DMA memory %pR\n",
&pgmap->res);
--- a/mm/hmm.c~mm-hmm-mark-hmm_devmem_add-add_resource-export_symbol_gpl
+++ a/mm/hmm.c
@@ -1110,7 +1110,7 @@ struct hmm_devmem *hmm_devmem_add(const
return result;
return devmem;
}
-EXPORT_SYMBOL(hmm_devmem_add);
+EXPORT_SYMBOL_GPL(hmm_devmem_add);
struct hmm_devmem *hmm_devmem_add_resource(const struct hmm_devmem_ops *ops,
struct device *device,
@@ -1164,7 +1164,7 @@ struct hmm_devmem *hmm_devmem_add_resour
return result;
return devmem;
}
-EXPORT_SYMBOL(hmm_devmem_add_resource);
+EXPORT_SYMBOL_GPL(hmm_devmem_add_resource);
/*
* A device driver that wants to handle multiple devices memory through a
_
Patches currently in -mm which might be from dan.j.williams(a)intel.com are
The patch titled
Subject: mm, devm_memremap_pages: add MEMORY_DEVICE_PRIVATE support
has been removed from the -mm tree. Its filename was
mm-devm_memremap_pages-add-memory_device_private-support.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Dan Williams <dan.j.williams(a)intel.com>
Subject: mm, devm_memremap_pages: add MEMORY_DEVICE_PRIVATE support
In preparation for consolidating all ZONE_DEVICE enabling via
devm_memremap_pages(), teach it how to handle the constraints of
MEMORY_DEVICE_PRIVATE ranges.
[jglisse(a)redhat.com: call move_pfn_range_to_zone for MEMORY_DEVICE_PRIVATE]
Link: http://lkml.kernel.org/r/154275559036.76910.12434636179931292607.stgit@dwil…
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
Reviewed-by: Jérôme Glisse <jglisse(a)redhat.com>
Acked-by: Christoph Hellwig <hch(a)lst.de>
Reported-by: Logan Gunthorpe <logang(a)deltatee.com>
Reviewed-by: Logan Gunthorpe <logang(a)deltatee.com>
Cc: Balbir Singh <bsingharora(a)gmail.com>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
--- a/kernel/memremap.c~mm-devm_memremap_pages-add-memory_device_private-support
+++ a/kernel/memremap.c
@@ -98,9 +98,15 @@ static void devm_memremap_pages_release(
- align_start;
mem_hotplug_begin();
- arch_remove_memory(align_start, align_size, pgmap->altmap_valid ?
- &pgmap->altmap : NULL);
- kasan_remove_zero_shadow(__va(align_start), align_size);
+ if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
+ pfn = align_start >> PAGE_SHIFT;
+ __remove_pages(page_zone(pfn_to_page(pfn)), pfn,
+ align_size >> PAGE_SHIFT, NULL);
+ } else {
+ arch_remove_memory(align_start, align_size,
+ pgmap->altmap_valid ? &pgmap->altmap : NULL);
+ kasan_remove_zero_shadow(__va(align_start), align_size);
+ }
mem_hotplug_done();
untrack_pfn(NULL, PHYS_PFN(align_start), align_size);
@@ -187,17 +193,40 @@ void *devm_memremap_pages(struct device
goto err_pfn_remap;
mem_hotplug_begin();
- error = kasan_add_zero_shadow(__va(align_start), align_size);
- if (error) {
- mem_hotplug_done();
- goto err_kasan;
+
+ /*
+ * For device private memory we call add_pages() as we only need to
+ * allocate and initialize struct page for the device memory. More-
+ * over the device memory is un-accessible thus we do not want to
+ * create a linear mapping for the memory like arch_add_memory()
+ * would do.
+ *
+ * For all other device memory types, which are accessible by
+ * the CPU, we do want the linear mapping and thus use
+ * arch_add_memory().
+ */
+ if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
+ error = add_pages(nid, align_start >> PAGE_SHIFT,
+ align_size >> PAGE_SHIFT, NULL, false);
+ } else {
+ error = kasan_add_zero_shadow(__va(align_start), align_size);
+ if (error) {
+ mem_hotplug_done();
+ goto err_kasan;
+ }
+
+ error = arch_add_memory(nid, align_start, align_size, altmap,
+ false);
+ }
+
+ if (!error) {
+ struct zone *zone;
+
+ zone = &NODE_DATA(nid)->node_zones[ZONE_DEVICE];
+ move_pfn_range_to_zone(zone, align_start >> PAGE_SHIFT,
+ align_size >> PAGE_SHIFT, altmap);
}
- error = arch_add_memory(nid, align_start, align_size, altmap, false);
- if (!error)
- move_pfn_range_to_zone(&NODE_DATA(nid)->node_zones[ZONE_DEVICE],
- align_start >> PAGE_SHIFT,
- align_size >> PAGE_SHIFT, altmap);
mem_hotplug_done();
if (error)
goto err_add_memory;
_
Patches currently in -mm which might be from dan.j.williams(a)intel.com are