Hello,
We ran automated tests on a patchset that was proposed for merging into this
kernel tree. The patches were applied to:
Kernel repo: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git
Commit: 8c3f48e8c288 Linux 4.20.1
The results of these automated tests are provided below.
Overall result: FAILED (see details below)
Patch merge: OK
Compile: FAILED
We attempted to compile the kernel for multiple architectures, but the compile
failed on one or more architectures:
s390x: FAILED (build log attached: build_s390.log.gz)
powerpc64le: FAILED (build log attached: build_powerpc.log.gz)
aarch64: FAILED (build log attached: build_arm64.log.gz)
x86_64: FAILED (build log attached: build_x86_64.log.gz)
We hope that these logs can help you find the problem quickly. For the full
detail on our testing procedures, please scroll to the bottom of this message.
Please reply to this email if you have any questions about the tests that we
ran or if you have any suggestions on how to make future tests more effective.
,-. ,-.
( C ) ( K ) Continuous
`-',-.`-' Kernel
( I ) Integration
`-'
______________________________________________________________________________
Merge testing
-------------
We cloned this repository and checked out a ref:
Repo: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git
Ref: 8c3f48e8c288 Linux 4.20.1
We then merged the following patches with `git am`:
scsi-zfcp-fix-posting-too-many-status-read-buffers-leading-to-adapter-shutdown.patch
scsi-lpfc-do-not-set-queue-page_count-to-0-if-pc_sli4_params.wqpcnt-is-invalid.patch
fork-record-start_time-late.patch
zram-fix-double-free-backing-device.patch
hwpoison-memory_hotplug-allow-hwpoisoned-pages-to-be-offlined.patch
mm-devm_memremap_pages-mark-devm_memremap_pages-export_symbol_gpl.patch
mm-devm_memremap_pages-kill-mapping-system-ram-support.patch
mm-devm_memremap_pages-fix-shutdown-handling.patch
hugetlbfs-use-i_mmap_rwsem-for-more-pmd-sharing-synchronization.patch
hugetlbfs-use-i_mmap_rwsem-to-fix-page-fault-truncate-race.patch
memcg-oom-notify-on-oom-killer-invocation-from-the-charge-path.patch
sunrpc-fix-cache_head-leak-due-to-queued-request.patch
sunrpc-use-svc_net-in-svcauth_gss_-functions.patch
sunrpc-use-after-free-in-svc_process_common.patch
mm-devm_memremap_pages-add-memory_device_private-support.patch
mm-hmm-use-devm-semantics-for-hmm_devmem_-add-remove.patch
mm-hmm-replace-hmm_devmem_pages_create-with-devm_memremap_pages.patch
mm-hmm-mark-hmm_devmem_-add-add_resource-export_symbol_gpl.patch
mm-swap-fix-swapoff-with-ksm-pages.patch
Compile testing
---------------
We compiled the kernel for 4 architectures:
s390x:
make options: make INSTALL_MOD_STRIP=1 -j64 targz-pkg -j64
configuration:
powerpc64le:
make options: make INSTALL_MOD_STRIP=1 -j64 targz-pkg -j64
configuration:
aarch64:
make options: make INSTALL_MOD_STRIP=1 -j64 targz-pkg -j64
configuration:
x86_64:
make options: make INSTALL_MOD_STRIP=1 -j64 targz-pkg -j64
configuration:
Hardware testing
----------------
We booted each kernel and ran the following tests:
s390:
powerpc:
arm64:
x86_64:
While reading through the sysvipc implementation, I noticed that the n32
semctl/shmctl/msgctl system calls behave differently based on whether
o32 support is enabled or not: Without o32, the IPC_64 flag passed by
user space is rejected but calls without that flag get IPC_64 behavior.
As far as I can tell, this was inadvertently changed by a cleanup patch
but never noticed by anyone, possibly nobody has tried using sysvipc
on n32 after linux-3.19.
Change it back to the old behavior now.
Fixes: 78aaf956ba3a ("MIPS: Compat: Fix build error if CONFIG_MIPS32_COMPAT but no compat ABI.")
Cc: stable(a)vger.kernel.org
Signed-off-by: Arnd Bergmann <arnd(a)arndb.de>
---
As stated above, this was only found by inspection, the patch is not
tested. Please review accordingly.
---
arch/mips/Kconfig | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig
index 787290781b8c..0d14f51d0002 100644
--- a/arch/mips/Kconfig
+++ b/arch/mips/Kconfig
@@ -3155,6 +3155,7 @@ config MIPS32_O32
config MIPS32_N32
bool "Kernel support for n32 binaries"
depends on 64BIT
+ select ARCH_WANT_COMPAT_IPC_PARSE_VERSION
select COMPAT
select MIPS32_COMPAT
select SYSVIPC_COMPAT if SYSVIPC
--
2.20.0
The io_pgetevents system call was added in linux-4.18 but has
no entry for alpha:
warning: #warning syscall io_pgetevents not implemented [-Wcpp]
Assign a the next system call number here.
Cc: stable(a)vger.kernel.org
Signed-off-by: Arnd Bergmann <arnd(a)arndb.de>
---
arch/alpha/kernel/syscalls/syscall.tbl | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index 7b56a53be5e3..e09558edae73 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -451,3 +451,4 @@
520 common preadv2 sys_preadv2
521 common pwritev2 sys_pwritev2
522 common statx sys_statx
+523 common io_pgetevents sys_io_pgetevents
--
2.20.0
Good day,
We have started to compile libbpf as part of our Linux compilation
build plan, however libbpf fails to cross-compile for arm64 on Linux
4.14, but succeeds on Linux 4.19 tree.
We compile libbpf with the following command:
make -C <whatever>/linux-4.14.91/tools/lib/bpf ARCH=arm64
CROSS_COMPILE=aarch64-linux-gnu- install
And get the below output on 4.14 tree:
Auto-detecting system features:
... libelf: [ on ]
... bpf: [ on ]
CC /cfsetup_build/build/arm64/libbpf/libbpf.o
CC /cfsetup_build/build/arm64/libbpf/bpf.o
LD /cfsetup_build/build/arm64/libbpf/libbpf-in.o
ld: /cfsetup_build/build/arm64/libbpf/libbpf.o: Relocations in generic
ELF (EM: 183)
ld: /cfsetup_build/build/arm64/libbpf/libbpf.o: Relocations in generic
ELF (EM: 183)
ld: /cfsetup_build/build/arm64/libbpf/libbpf.o: Relocations in generic
ELF (EM: 183)
ld: /cfsetup_build/build/arm64/libbpf/libbpf.o: Relocations in generic
ELF (EM: 183)
ld: /cfsetup_build/build/arm64/libbpf/libbpf.o: Relocations in generic
ELF (EM: 183)
ld: /cfsetup_build/build/arm64/libbpf/libbpf.o: Relocations in generic
ELF (EM: 183)
/cfsetup_build/build/arm64/libbpf/libbpf.o: error adding symbols: File
in wrong format
/cfsetup_build/build/linux-4.14.91/tools/build/Makefile.build:144:
recipe for target '/cfsetup_build/build/arm64/libbpf/libbpf-in.o'
failed
make[2]: *** [/cfsetup_build/build/arm64/libbpf/libbpf-in.o] Error 1
Makefile:158: recipe for target
'/cfsetup_build/build/arm64/libbpf/libbpf-in.o' failed
make[1]: *** [/cfsetup_build/build/arm64/libbpf/libbpf-in.o] Error 2
Backporting the following commit fixed the build:
7ed1c1901fe52e6c5828deb155920b44b0adabb1: tools: fix cross-compile var
clobbering
Can we have it officially applied to the Linux 4.14 tree, please?
Regards,
Ignat
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From d4b09acf924b84bae77cad090a9d108e70b43643 Mon Sep 17 00:00:00 2001
From: Vasily Averin <vvs(a)virtuozzo.com>
Date: Mon, 24 Dec 2018 14:44:52 +0300
Subject: [PATCH] sunrpc: use-after-free in svc_process_common()
if node have NFSv41+ mounts inside several net namespaces
it can lead to use-after-free in svc_process_common()
svc_process_common()
/* Setup reply header */
rqstp->rq_xprt->xpt_ops->xpo_prep_reply_hdr(rqstp); <<< HERE
svc_process_common() can use incorrect rqstp->rq_xprt,
its caller function bc_svc_process() takes it from serv->sv_bc_xprt.
The problem is that serv is global structure but sv_bc_xprt
is assigned per-netnamespace.
According to Trond, the whole "let's set up rqstp->rq_xprt
for the back channel" is nothing but a giant hack in order
to work around the fact that svc_process_common() uses it
to find the xpt_ops, and perform a couple of (meaningless
for the back channel) tests of xpt_flags.
All we really need in svc_process_common() is to be able to run
rqstp->rq_xprt->xpt_ops->xpo_prep_reply_hdr()
Bruce J Fields points that this xpo_prep_reply_hdr() call
is an awfully roundabout way just to do "svc_putnl(resv, 0);"
in the tcp case.
This patch does not initialiuze rqstp->rq_xprt in bc_svc_process(),
now it calls svc_process_common() with rqstp->rq_xprt = NULL.
To adjust reply header svc_process_common() just check
rqstp->rq_prot and calls svc_tcp_prep_reply_hdr() for tcp case.
To handle rqstp->rq_xprt = NULL case in functions called from
svc_process_common() patch intruduces net namespace pointer
svc_rqst->rq_bc_net and adjust SVC_NET() definition.
Some other function was also adopted to properly handle described case.
Signed-off-by: Vasily Averin <vvs(a)virtuozzo.com>
Cc: stable(a)vger.kernel.org
Fixes: 23c20ecd4475 ("NFS: callback up - users counting cleanup")
Signed-off-by: J. Bruce Fields <bfields(a)redhat.com>
diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index 73e130a840ce..fdb6b317d974 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -295,9 +295,12 @@ struct svc_rqst {
struct svc_cacherep * rq_cacherep; /* cache info */
struct task_struct *rq_task; /* service thread */
spinlock_t rq_lock; /* per-request lock */
+ struct net *rq_bc_net; /* pointer to backchannel's
+ * net namespace
+ */
};
-#define SVC_NET(svc_rqst) (svc_rqst->rq_xprt->xpt_net)
+#define SVC_NET(rqst) (rqst->rq_xprt ? rqst->rq_xprt->xpt_net : rqst->rq_bc_net)
/*
* Rigorous type checking on sockaddr type conversions
diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h
index 28e384186c35..8617f4fd6b70 100644
--- a/include/trace/events/sunrpc.h
+++ b/include/trace/events/sunrpc.h
@@ -569,7 +569,8 @@ TRACE_EVENT(svc_process,
__field(u32, vers)
__field(u32, proc)
__string(service, name)
- __string(addr, rqst->rq_xprt->xpt_remotebuf)
+ __string(addr, rqst->rq_xprt ?
+ rqst->rq_xprt->xpt_remotebuf : "(null)")
),
TP_fast_assign(
@@ -577,7 +578,8 @@ TRACE_EVENT(svc_process,
__entry->vers = rqst->rq_vers;
__entry->proc = rqst->rq_proc;
__assign_str(service, name);
- __assign_str(addr, rqst->rq_xprt->xpt_remotebuf);
+ __assign_str(addr, rqst->rq_xprt ?
+ rqst->rq_xprt->xpt_remotebuf : "(null)");
),
TP_printk("addr=%s xid=0x%08x service=%s vers=%u proc=%u",
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index d13e05f1a990..fb647bc01fc5 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -1172,7 +1172,8 @@ svc_process_common(struct svc_rqst *rqstp, struct kvec *argv, struct kvec *resv)
clear_bit(RQ_DROPME, &rqstp->rq_flags);
/* Setup reply header */
- rqstp->rq_xprt->xpt_ops->xpo_prep_reply_hdr(rqstp);
+ if (rqstp->rq_prot == IPPROTO_TCP)
+ svc_tcp_prep_reply_hdr(rqstp);
svc_putu32(resv, rqstp->rq_xid);
@@ -1244,7 +1245,7 @@ svc_process_common(struct svc_rqst *rqstp, struct kvec *argv, struct kvec *resv)
* for lower versions. RPC_PROG_MISMATCH seems to be the closest
* fit.
*/
- if (versp->vs_need_cong_ctrl &&
+ if (versp->vs_need_cong_ctrl && rqstp->rq_xprt &&
!test_bit(XPT_CONG_CTRL, &rqstp->rq_xprt->xpt_flags))
goto err_bad_vers;
@@ -1336,7 +1337,7 @@ svc_process_common(struct svc_rqst *rqstp, struct kvec *argv, struct kvec *resv)
return 0;
close:
- if (test_bit(XPT_TEMP, &rqstp->rq_xprt->xpt_flags))
+ if (rqstp->rq_xprt && test_bit(XPT_TEMP, &rqstp->rq_xprt->xpt_flags))
svc_close_xprt(rqstp->rq_xprt);
dprintk("svc: svc_process close\n");
return 0;
@@ -1459,10 +1460,10 @@ bc_svc_process(struct svc_serv *serv, struct rpc_rqst *req,
dprintk("svc: %s(%p)\n", __func__, req);
/* Build the svc_rqst used by the common processing routine */
- rqstp->rq_xprt = serv->sv_bc_xprt;
rqstp->rq_xid = req->rq_xid;
rqstp->rq_prot = req->rq_xprt->prot;
rqstp->rq_server = serv;
+ rqstp->rq_bc_net = req->rq_xprt->xprt_net;
rqstp->rq_addrlen = sizeof(req->rq_xprt->addr);
memcpy(&rqstp->rq_addr, &req->rq_xprt->addr, rqstp->rq_addrlen);
diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index 51d36230b6e3..bd42da287c26 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -468,10 +468,11 @@ static struct svc_xprt *svc_xprt_dequeue(struct svc_pool *pool)
*/
void svc_reserve(struct svc_rqst *rqstp, int space)
{
+ struct svc_xprt *xprt = rqstp->rq_xprt;
+
space += rqstp->rq_res.head[0].iov_len;
- if (space < rqstp->rq_reserved) {
- struct svc_xprt *xprt = rqstp->rq_xprt;
+ if (xprt && space < rqstp->rq_reserved) {
atomic_sub((rqstp->rq_reserved - space), &xprt->xpt_reserved);
rqstp->rq_reserved = space;
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 986f3ed7d1a2..793149ba1bda 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -1173,7 +1173,7 @@ static int svc_tcp_sendto(struct svc_rqst *rqstp)
/*
* Setup response header. TCP has a 4B record length field.
*/
-static void svc_tcp_prep_reply_hdr(struct svc_rqst *rqstp)
+void svc_tcp_prep_reply_hdr(struct svc_rqst *rqstp)
{
struct kvec *resv = &rqstp->rq_res.head[0];
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From d4b09acf924b84bae77cad090a9d108e70b43643 Mon Sep 17 00:00:00 2001
From: Vasily Averin <vvs(a)virtuozzo.com>
Date: Mon, 24 Dec 2018 14:44:52 +0300
Subject: [PATCH] sunrpc: use-after-free in svc_process_common()
if node have NFSv41+ mounts inside several net namespaces
it can lead to use-after-free in svc_process_common()
svc_process_common()
/* Setup reply header */
rqstp->rq_xprt->xpt_ops->xpo_prep_reply_hdr(rqstp); <<< HERE
svc_process_common() can use incorrect rqstp->rq_xprt,
its caller function bc_svc_process() takes it from serv->sv_bc_xprt.
The problem is that serv is global structure but sv_bc_xprt
is assigned per-netnamespace.
According to Trond, the whole "let's set up rqstp->rq_xprt
for the back channel" is nothing but a giant hack in order
to work around the fact that svc_process_common() uses it
to find the xpt_ops, and perform a couple of (meaningless
for the back channel) tests of xpt_flags.
All we really need in svc_process_common() is to be able to run
rqstp->rq_xprt->xpt_ops->xpo_prep_reply_hdr()
Bruce J Fields points that this xpo_prep_reply_hdr() call
is an awfully roundabout way just to do "svc_putnl(resv, 0);"
in the tcp case.
This patch does not initialiuze rqstp->rq_xprt in bc_svc_process(),
now it calls svc_process_common() with rqstp->rq_xprt = NULL.
To adjust reply header svc_process_common() just check
rqstp->rq_prot and calls svc_tcp_prep_reply_hdr() for tcp case.
To handle rqstp->rq_xprt = NULL case in functions called from
svc_process_common() patch intruduces net namespace pointer
svc_rqst->rq_bc_net and adjust SVC_NET() definition.
Some other function was also adopted to properly handle described case.
Signed-off-by: Vasily Averin <vvs(a)virtuozzo.com>
Cc: stable(a)vger.kernel.org
Fixes: 23c20ecd4475 ("NFS: callback up - users counting cleanup")
Signed-off-by: J. Bruce Fields <bfields(a)redhat.com>
diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index 73e130a840ce..fdb6b317d974 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -295,9 +295,12 @@ struct svc_rqst {
struct svc_cacherep * rq_cacherep; /* cache info */
struct task_struct *rq_task; /* service thread */
spinlock_t rq_lock; /* per-request lock */
+ struct net *rq_bc_net; /* pointer to backchannel's
+ * net namespace
+ */
};
-#define SVC_NET(svc_rqst) (svc_rqst->rq_xprt->xpt_net)
+#define SVC_NET(rqst) (rqst->rq_xprt ? rqst->rq_xprt->xpt_net : rqst->rq_bc_net)
/*
* Rigorous type checking on sockaddr type conversions
diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h
index 28e384186c35..8617f4fd6b70 100644
--- a/include/trace/events/sunrpc.h
+++ b/include/trace/events/sunrpc.h
@@ -569,7 +569,8 @@ TRACE_EVENT(svc_process,
__field(u32, vers)
__field(u32, proc)
__string(service, name)
- __string(addr, rqst->rq_xprt->xpt_remotebuf)
+ __string(addr, rqst->rq_xprt ?
+ rqst->rq_xprt->xpt_remotebuf : "(null)")
),
TP_fast_assign(
@@ -577,7 +578,8 @@ TRACE_EVENT(svc_process,
__entry->vers = rqst->rq_vers;
__entry->proc = rqst->rq_proc;
__assign_str(service, name);
- __assign_str(addr, rqst->rq_xprt->xpt_remotebuf);
+ __assign_str(addr, rqst->rq_xprt ?
+ rqst->rq_xprt->xpt_remotebuf : "(null)");
),
TP_printk("addr=%s xid=0x%08x service=%s vers=%u proc=%u",
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index d13e05f1a990..fb647bc01fc5 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -1172,7 +1172,8 @@ svc_process_common(struct svc_rqst *rqstp, struct kvec *argv, struct kvec *resv)
clear_bit(RQ_DROPME, &rqstp->rq_flags);
/* Setup reply header */
- rqstp->rq_xprt->xpt_ops->xpo_prep_reply_hdr(rqstp);
+ if (rqstp->rq_prot == IPPROTO_TCP)
+ svc_tcp_prep_reply_hdr(rqstp);
svc_putu32(resv, rqstp->rq_xid);
@@ -1244,7 +1245,7 @@ svc_process_common(struct svc_rqst *rqstp, struct kvec *argv, struct kvec *resv)
* for lower versions. RPC_PROG_MISMATCH seems to be the closest
* fit.
*/
- if (versp->vs_need_cong_ctrl &&
+ if (versp->vs_need_cong_ctrl && rqstp->rq_xprt &&
!test_bit(XPT_CONG_CTRL, &rqstp->rq_xprt->xpt_flags))
goto err_bad_vers;
@@ -1336,7 +1337,7 @@ svc_process_common(struct svc_rqst *rqstp, struct kvec *argv, struct kvec *resv)
return 0;
close:
- if (test_bit(XPT_TEMP, &rqstp->rq_xprt->xpt_flags))
+ if (rqstp->rq_xprt && test_bit(XPT_TEMP, &rqstp->rq_xprt->xpt_flags))
svc_close_xprt(rqstp->rq_xprt);
dprintk("svc: svc_process close\n");
return 0;
@@ -1459,10 +1460,10 @@ bc_svc_process(struct svc_serv *serv, struct rpc_rqst *req,
dprintk("svc: %s(%p)\n", __func__, req);
/* Build the svc_rqst used by the common processing routine */
- rqstp->rq_xprt = serv->sv_bc_xprt;
rqstp->rq_xid = req->rq_xid;
rqstp->rq_prot = req->rq_xprt->prot;
rqstp->rq_server = serv;
+ rqstp->rq_bc_net = req->rq_xprt->xprt_net;
rqstp->rq_addrlen = sizeof(req->rq_xprt->addr);
memcpy(&rqstp->rq_addr, &req->rq_xprt->addr, rqstp->rq_addrlen);
diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index 51d36230b6e3..bd42da287c26 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -468,10 +468,11 @@ static struct svc_xprt *svc_xprt_dequeue(struct svc_pool *pool)
*/
void svc_reserve(struct svc_rqst *rqstp, int space)
{
+ struct svc_xprt *xprt = rqstp->rq_xprt;
+
space += rqstp->rq_res.head[0].iov_len;
- if (space < rqstp->rq_reserved) {
- struct svc_xprt *xprt = rqstp->rq_xprt;
+ if (xprt && space < rqstp->rq_reserved) {
atomic_sub((rqstp->rq_reserved - space), &xprt->xpt_reserved);
rqstp->rq_reserved = space;
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 986f3ed7d1a2..793149ba1bda 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -1173,7 +1173,7 @@ static int svc_tcp_sendto(struct svc_rqst *rqstp)
/*
* Setup response header. TCP has a 4B record length field.
*/
-static void svc_tcp_prep_reply_hdr(struct svc_rqst *rqstp)
+void svc_tcp_prep_reply_hdr(struct svc_rqst *rqstp)
{
struct kvec *resv = &rqstp->rq_res.head[0];
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From c86aa7bbfd5568ba8a82d3635d8f7b8a8e06fe54 Mon Sep 17 00:00:00 2001
From: Mike Kravetz <mike.kravetz(a)oracle.com>
Date: Fri, 28 Dec 2018 00:39:42 -0800
Subject: [PATCH] hugetlbfs: Use i_mmap_rwsem to fix page fault/truncate race
hugetlbfs page faults can race with truncate and hole punch operations.
Current code in the page fault path attempts to handle this by 'backing
out' operations if we encounter the race. One obvious omission in the
current code is removing a page newly added to the page cache. This is
pretty straight forward to address, but there is a more subtle and
difficult issue of backing out hugetlb reservations. To handle this
correctly, the 'reservation state' before page allocation needs to be
noted so that it can be properly backed out. There are four distinct
possibilities for reservation state: shared/reserved, shared/no-resv,
private/reserved and private/no-resv. Backing out a reservation may
require memory allocation which could fail so that needs to be taken into
account as well.
Instead of writing the required complicated code for this rare occurrence,
just eliminate the race. i_mmap_rwsem is now held in read mode for the
duration of page fault processing. Hold i_mmap_rwsem longer in truncation
and hold punch code to cover the call to remove_inode_hugepages.
With this modification, code in remove_inode_hugepages checking for races
becomes 'dead' as it can not longer happen. Remove the dead code and
expand comments to explain reasoning. Similarly, checks for races with
truncation in the page fault path can be simplified and removed.
[mike.kravetz(a)oracle.com: incorporat suggestions from Kirill]
Link: http://lkml.kernel.org/r/20181222223013.22193-3-mike.kravetz@oracle.com
Link: http://lkml.kernel.org/r/20181218223557.5202-3-mike.kravetz@oracle.com
Fixes: ebed4bfc8da8 ("hugetlb: fix absurd HugePages_Rsvd")
Signed-off-by: Mike Kravetz <mike.kravetz(a)oracle.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Cc: Michal Hocko <mhocko(a)kernel.org>
Cc: Hugh Dickins <hughd(a)google.com>
Cc: Naoya Horiguchi <n-horiguchi(a)ah.jp.nec.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar(a)linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange(a)redhat.com>
Cc: Davidlohr Bueso <dave(a)stgolabs.net>
Cc: Prakash Sangappa <prakash.sangappa(a)oracle.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 32920a10100e..a2fcea5f8225 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -383,17 +383,16 @@ hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end)
* truncation is indicated by end of range being LLONG_MAX
* In this case, we first scan the range and release found pages.
* After releasing pages, hugetlb_unreserve_pages cleans up region/reserv
- * maps and global counts. Page faults can not race with truncation
- * in this routine. hugetlb_no_page() prevents page faults in the
- * truncated range. It checks i_size before allocation, and again after
- * with the page table lock for the page held. The same lock must be
- * acquired to unmap a page.
+ * maps and global counts.
* hole punch is indicated if end is not LLONG_MAX
* In the hole punch case we scan the range and release found pages.
* Only when releasing a page is the associated region/reserv map
* deleted. The region/reserv map for ranges without associated
- * pages are not modified. Page faults can race with hole punch.
- * This is indicated if we find a mapped page.
+ * pages are not modified.
+ *
+ * Callers of this routine must hold the i_mmap_rwsem in write mode to prevent
+ * races with page faults.
+ *
* Note: If the passed end of range value is beyond the end of file, but
* not LLONG_MAX this routine still performs a hole punch operation.
*/
@@ -423,32 +422,14 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
for (i = 0; i < pagevec_count(&pvec); ++i) {
struct page *page = pvec.pages[i];
- u32 hash;
index = page->index;
- hash = hugetlb_fault_mutex_hash(h, current->mm,
- &pseudo_vma,
- mapping, index, 0);
- mutex_lock(&hugetlb_fault_mutex_table[hash]);
-
/*
- * If page is mapped, it was faulted in after being
- * unmapped in caller. Unmap (again) now after taking
- * the fault mutex. The mutex will prevent faults
- * until we finish removing the page.
- *
- * This race can only happen in the hole punch case.
- * Getting here in a truncate operation is a bug.
+ * A mapped page is impossible as callers should unmap
+ * all references before calling. And, i_mmap_rwsem
+ * prevents the creation of additional mappings.
*/
- if (unlikely(page_mapped(page))) {
- BUG_ON(truncate_op);
-
- i_mmap_lock_write(mapping);
- hugetlb_vmdelete_list(&mapping->i_mmap,
- index * pages_per_huge_page(h),
- (index + 1) * pages_per_huge_page(h));
- i_mmap_unlock_write(mapping);
- }
+ VM_BUG_ON(page_mapped(page));
lock_page(page);
/*
@@ -470,7 +451,6 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
}
unlock_page(page);
- mutex_unlock(&hugetlb_fault_mutex_table[hash]);
}
huge_pagevec_release(&pvec);
cond_resched();
@@ -482,9 +462,20 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
static void hugetlbfs_evict_inode(struct inode *inode)
{
+ struct address_space *mapping = inode->i_mapping;
struct resv_map *resv_map;
+ /*
+ * The vfs layer guarantees that there are no other users of this
+ * inode. Therefore, it would be safe to call remove_inode_hugepages
+ * without holding i_mmap_rwsem. We acquire and hold here to be
+ * consistent with other callers. Since there will be no contention
+ * on the semaphore, overhead is negligible.
+ */
+ i_mmap_lock_write(mapping);
remove_inode_hugepages(inode, 0, LLONG_MAX);
+ i_mmap_unlock_write(mapping);
+
resv_map = (struct resv_map *)inode->i_mapping->private_data;
/* root inode doesn't have the resv_map, so we should check it */
if (resv_map)
@@ -505,8 +496,8 @@ static int hugetlb_vmtruncate(struct inode *inode, loff_t offset)
i_mmap_lock_write(mapping);
if (!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root))
hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0);
- i_mmap_unlock_write(mapping);
remove_inode_hugepages(inode, offset, LLONG_MAX);
+ i_mmap_unlock_write(mapping);
return 0;
}
@@ -540,8 +531,8 @@ static long hugetlbfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
hugetlb_vmdelete_list(&mapping->i_mmap,
hole_start >> PAGE_SHIFT,
hole_end >> PAGE_SHIFT);
- i_mmap_unlock_write(mapping);
remove_inode_hugepages(inode, hole_start, hole_end);
+ i_mmap_unlock_write(mapping);
inode_unlock(inode);
}
@@ -624,7 +615,11 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
/* addr is the offset within the file (zero based) */
addr = index * hpage_size;
- /* mutex taken here, fault path and hole punch */
+ /*
+ * fault mutex taken here, protects against fault path
+ * and hole punch. inode_lock previously taken protects
+ * against truncation.
+ */
hash = hugetlb_fault_mutex_hash(h, mm, &pseudo_vma, mapping,
index, addr);
mutex_lock(&hugetlb_fault_mutex_table[hash]);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 87fd3ab809c6..e37efd5d8318 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3755,16 +3755,16 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
}
/*
- * Use page lock to guard against racing truncation
- * before we get page_table_lock.
+ * We can not race with truncation due to holding i_mmap_rwsem.
+ * Check once here for faults beyond end of file.
*/
+ size = i_size_read(mapping->host) >> huge_page_shift(h);
+ if (idx >= size)
+ goto out;
+
retry:
page = find_lock_page(mapping, idx);
if (!page) {
- size = i_size_read(mapping->host) >> huge_page_shift(h);
- if (idx >= size)
- goto out;
-
/*
* Check for page in userfault range
*/
@@ -3854,9 +3854,6 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
}
ptl = huge_pte_lock(h, mm, ptep);
- size = i_size_read(mapping->host) >> huge_page_shift(h);
- if (idx >= size)
- goto backout;
ret = 0;
if (!huge_pte_none(huge_ptep_get(ptep)))
@@ -3959,8 +3956,10 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
/*
* Acquire i_mmap_rwsem before calling huge_pte_alloc and hold
- * until finished with ptep. This prevents huge_pmd_unshare from
- * being called elsewhere and making the ptep no longer valid.
+ * until finished with ptep. This serves two purposes:
+ * 1) It prevents huge_pmd_unshare from being called elsewhere
+ * and making the ptep no longer valid.
+ * 2) It synchronizes us with file truncation.
*
* ptep could have already be assigned via huge_pte_offset. That
* is OK, as huge_pte_alloc will return the same value unless
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From b43a9990055958e70347c56f90ea2ae32c67334c Mon Sep 17 00:00:00 2001
From: Mike Kravetz <mike.kravetz(a)oracle.com>
Date: Fri, 28 Dec 2018 00:39:38 -0800
Subject: [PATCH] hugetlbfs: use i_mmap_rwsem for more pmd sharing
synchronization
While looking at BUGs associated with invalid huge page map counts, it was
discovered and observed that a huge pte pointer could become 'invalid' and
point to another task's page table. Consider the following:
A task takes a page fault on a shared hugetlbfs file and calls
huge_pte_alloc to get a ptep. Suppose the returned ptep points to a
shared pmd.
Now, another task truncates the hugetlbfs file. As part of truncation, it
unmaps everyone who has the file mapped. If the range being truncated is
covered by a shared pmd, huge_pmd_unshare will be called. For all but the
last user of the shared pmd, huge_pmd_unshare will clear the pud pointing
to the pmd. If the task in the middle of the page fault is not the last
user, the ptep returned by huge_pte_alloc now points to another task's
page table or worse. This leads to bad things such as incorrect page
map/reference counts or invalid memory references.
To fix, expand the use of i_mmap_rwsem as follows:
- i_mmap_rwsem is held in read mode whenever huge_pmd_share is called.
huge_pmd_share is only called via huge_pte_alloc, so callers of
huge_pte_alloc take i_mmap_rwsem before calling. In addition, callers
of huge_pte_alloc continue to hold the semaphore until finished with the
ptep.
- i_mmap_rwsem is held in write mode whenever huge_pmd_unshare is
called.
[mike.kravetz(a)oracle.com: add explicit check for mapping != null]
Link: http://lkml.kernel.org/r/20181218223557.5202-2-mike.kravetz@oracle.com
Fixes: 39dde65c9940 ("shared page table for hugetlb page")
Signed-off-by: Mike Kravetz <mike.kravetz(a)oracle.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Cc: Michal Hocko <mhocko(a)kernel.org>
Cc: Hugh Dickins <hughd(a)google.com>
Cc: Naoya Horiguchi <n-horiguchi(a)ah.jp.nec.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar(a)linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange(a)redhat.com>
Cc: Davidlohr Bueso <dave(a)stgolabs.net>
Cc: Prakash Sangappa <prakash.sangappa(a)oracle.com>
Cc: Colin Ian King <colin.king(a)canonical.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 12000ba5c868..87fd3ab809c6 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3238,6 +3238,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
struct page *ptepage;
unsigned long addr;
int cow;
+ struct address_space *mapping = vma->vm_file->f_mapping;
struct hstate *h = hstate_vma(vma);
unsigned long sz = huge_page_size(h);
struct mmu_notifier_range range;
@@ -3249,13 +3250,23 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
mmu_notifier_range_init(&range, src, vma->vm_start,
vma->vm_end);
mmu_notifier_invalidate_range_start(&range);
+ } else {
+ /*
+ * For shared mappings i_mmap_rwsem must be held to call
+ * huge_pte_alloc, otherwise the returned ptep could go
+ * away if part of a shared pmd and another thread calls
+ * huge_pmd_unshare.
+ */
+ i_mmap_lock_read(mapping);
}
for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
spinlock_t *src_ptl, *dst_ptl;
+
src_pte = huge_pte_offset(src, addr, sz);
if (!src_pte)
continue;
+
dst_pte = huge_pte_alloc(dst, addr, sz);
if (!dst_pte) {
ret = -ENOMEM;
@@ -3326,6 +3337,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
if (cow)
mmu_notifier_invalidate_range_end(&range);
+ else
+ i_mmap_unlock_read(mapping);
return ret;
}
@@ -3771,14 +3784,18 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
};
/*
- * hugetlb_fault_mutex must be dropped before
- * handling userfault. Reacquire after handling
- * fault to make calling code simpler.
+ * hugetlb_fault_mutex and i_mmap_rwsem must be
+ * dropped before handling userfault. Reacquire
+ * after handling fault to make calling code simpler.
*/
hash = hugetlb_fault_mutex_hash(h, mm, vma, mapping,
idx, haddr);
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+ i_mmap_unlock_read(mapping);
+
ret = handle_userfault(&vmf, VM_UFFD_MISSING);
+
+ i_mmap_lock_read(mapping);
mutex_lock(&hugetlb_fault_mutex_table[hash]);
goto out;
}
@@ -3926,6 +3943,11 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
if (ptep) {
+ /*
+ * Since we hold no locks, ptep could be stale. That is
+ * OK as we are only making decisions based on content and
+ * not actually modifying content here.
+ */
entry = huge_ptep_get(ptep);
if (unlikely(is_hugetlb_entry_migration(entry))) {
migration_entry_wait_huge(vma, mm, ptep);
@@ -3933,20 +3955,31 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
} else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
return VM_FAULT_HWPOISON_LARGE |
VM_FAULT_SET_HINDEX(hstate_index(h));
- } else {
- ptep = huge_pte_alloc(mm, haddr, huge_page_size(h));
- if (!ptep)
- return VM_FAULT_OOM;
}
+ /*
+ * Acquire i_mmap_rwsem before calling huge_pte_alloc and hold
+ * until finished with ptep. This prevents huge_pmd_unshare from
+ * being called elsewhere and making the ptep no longer valid.
+ *
+ * ptep could have already be assigned via huge_pte_offset. That
+ * is OK, as huge_pte_alloc will return the same value unless
+ * something changed.
+ */
mapping = vma->vm_file->f_mapping;
- idx = vma_hugecache_offset(h, vma, haddr);
+ i_mmap_lock_read(mapping);
+ ptep = huge_pte_alloc(mm, haddr, huge_page_size(h));
+ if (!ptep) {
+ i_mmap_unlock_read(mapping);
+ return VM_FAULT_OOM;
+ }
/*
* Serialize hugepage allocation and instantiation, so that we don't
* get spurious allocation failures if two CPUs race to instantiate
* the same page in the page cache.
*/
+ idx = vma_hugecache_offset(h, vma, haddr);
hash = hugetlb_fault_mutex_hash(h, mm, vma, mapping, idx, haddr);
mutex_lock(&hugetlb_fault_mutex_table[hash]);
@@ -4034,6 +4067,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
}
out_mutex:
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+ i_mmap_unlock_read(mapping);
/*
* Generally it's safe to hold refcount during waiting page lock. But
* here we just wait to defer the next page fault to avoid busy loop and
@@ -4638,10 +4672,12 @@ void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
* Search for a shareable pmd page for hugetlb. In any case calls pmd_alloc()
* and returns the corresponding pte. While this is not necessary for the
* !shared pmd case because we can allocate the pmd later as well, it makes the
- * code much cleaner. pmd allocation is essential for the shared case because
- * pud has to be populated inside the same i_mmap_rwsem section - otherwise
- * racing tasks could either miss the sharing (see huge_pte_offset) or select a
- * bad pmd for sharing.
+ * code much cleaner.
+ *
+ * This routine must be called with i_mmap_rwsem held in at least read mode.
+ * For hugetlbfs, this prevents removal of any page table entries associated
+ * with the address space. This is important as we are setting up sharing
+ * based on existing page table entries (mappings).
*/
pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
{
@@ -4658,7 +4694,6 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
if (!vma_shareable(vma, addr))
return (pte_t *)pmd_alloc(mm, pud, addr);
- i_mmap_lock_write(mapping);
vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) {
if (svma == vma)
continue;
@@ -4688,7 +4723,6 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
spin_unlock(ptl);
out:
pte = (pte_t *)pmd_alloc(mm, pud, addr);
- i_mmap_unlock_write(mapping);
return pte;
}
@@ -4699,7 +4733,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
* indicated by page_count > 1, unmap is achieved by clearing pud and
* decrementing the ref count. If count == 1, the pte page is not shared.
*
- * called with page table lock held.
+ * Called with page table lock held and i_mmap_rwsem held in write mode.
*
* returns: 1 successfully unmapped a shared pte page
* 0 the underlying pte page is not shared, or it is the last user
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 7c72f2a95785..6379fff1a5ff 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -966,7 +966,7 @@ static bool hwpoison_user_mappings(struct page *p, unsigned long pfn,
enum ttu_flags ttu = TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS;
struct address_space *mapping;
LIST_HEAD(tokill);
- bool unmap_success;
+ bool unmap_success = true;
int kill = 1, forcekill;
struct page *hpage = *hpagep;
bool mlocked = PageMlocked(hpage);
@@ -1028,7 +1028,19 @@ static bool hwpoison_user_mappings(struct page *p, unsigned long pfn,
if (kill)
collect_procs(hpage, &tokill, flags & MF_ACTION_REQUIRED);
- unmap_success = try_to_unmap(hpage, ttu);
+ if (!PageHuge(hpage)) {
+ unmap_success = try_to_unmap(hpage, ttu);
+ } else if (mapping) {
+ /*
+ * For hugetlb pages, try_to_unmap could potentially call
+ * huge_pmd_unshare. Because of this, take semaphore in
+ * write mode here and set TTU_RMAP_LOCKED to indicate we
+ * have taken the lock at this higer level.
+ */
+ i_mmap_lock_write(mapping);
+ unmap_success = try_to_unmap(hpage, ttu|TTU_RMAP_LOCKED);
+ i_mmap_unlock_write(mapping);
+ }
if (!unmap_success)
pr_err("Memory failure: %#lx: failed to unmap page (mapcount=%d)\n",
pfn, page_mapcount(hpage));
diff --git a/mm/migrate.c b/mm/migrate.c
index 4389696fba0e..5d1839a9148d 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1324,8 +1324,19 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
goto put_anon;
if (page_mapped(hpage)) {
+ struct address_space *mapping = page_mapping(hpage);
+
+ /*
+ * try_to_unmap could potentially call huge_pmd_unshare.
+ * Because of this, take semaphore in write mode here and
+ * set TTU_RMAP_LOCKED to let lower levels know we have
+ * taken the lock.
+ */
+ i_mmap_lock_write(mapping);
try_to_unmap(hpage,
- TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
+ TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS|
+ TTU_RMAP_LOCKED);
+ i_mmap_unlock_write(mapping);
page_was_mapped = 1;
}
diff --git a/mm/rmap.c b/mm/rmap.c
index 68a1a5b869a5..21a26cf51114 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -25,6 +25,7 @@
* page->flags PG_locked (lock_page)
* hugetlbfs_i_mmap_rwsem_key (in huge_pmd_share)
* mapping->i_mmap_rwsem
+ * hugetlb_fault_mutex (hugetlbfs specific page fault mutex)
* anon_vma->rwsem
* mm->page_table_lock or pte_lock
* zone_lru_lock (in mark_page_accessed, isolate_lru_page)
@@ -1378,6 +1379,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
/*
* If sharing is possible, start and end will be adjusted
* accordingly.
+ *
+ * If called for a huge page, caller must hold i_mmap_rwsem
+ * in write mode as it is possible to call huge_pmd_unshare.
*/
adjust_range_if_pmd_sharing_possible(vma, &range.start,
&range.end);
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 458acda96f20..48368589f519 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -267,10 +267,14 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
VM_BUG_ON(dst_addr & ~huge_page_mask(h));
/*
- * Serialize via hugetlb_fault_mutex
+ * Serialize via i_mmap_rwsem and hugetlb_fault_mutex.
+ * i_mmap_rwsem ensures the dst_pte remains valid even
+ * in the case of shared pmds. fault mutex prevents
+ * races with other faulting threads.
*/
- idx = linear_page_index(dst_vma, dst_addr);
mapping = dst_vma->vm_file->f_mapping;
+ i_mmap_lock_read(mapping);
+ idx = linear_page_index(dst_vma, dst_addr);
hash = hugetlb_fault_mutex_hash(h, dst_mm, dst_vma, mapping,
idx, dst_addr);
mutex_lock(&hugetlb_fault_mutex_table[hash]);
@@ -279,6 +283,7 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
dst_pte = huge_pte_alloc(dst_mm, dst_addr, huge_page_size(h));
if (!dst_pte) {
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+ i_mmap_unlock_read(mapping);
goto out_unlock;
}
@@ -286,6 +291,7 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
dst_pteval = huge_ptep_get(dst_pte);
if (!huge_pte_none(dst_pteval)) {
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+ i_mmap_unlock_read(mapping);
goto out_unlock;
}
@@ -293,6 +299,7 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
dst_addr, src_addr, &page);
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+ i_mmap_unlock_read(mapping);
vm_alloc_shared = vm_shared;
cond_resched();
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From b43a9990055958e70347c56f90ea2ae32c67334c Mon Sep 17 00:00:00 2001
From: Mike Kravetz <mike.kravetz(a)oracle.com>
Date: Fri, 28 Dec 2018 00:39:38 -0800
Subject: [PATCH] hugetlbfs: use i_mmap_rwsem for more pmd sharing
synchronization
While looking at BUGs associated with invalid huge page map counts, it was
discovered and observed that a huge pte pointer could become 'invalid' and
point to another task's page table. Consider the following:
A task takes a page fault on a shared hugetlbfs file and calls
huge_pte_alloc to get a ptep. Suppose the returned ptep points to a
shared pmd.
Now, another task truncates the hugetlbfs file. As part of truncation, it
unmaps everyone who has the file mapped. If the range being truncated is
covered by a shared pmd, huge_pmd_unshare will be called. For all but the
last user of the shared pmd, huge_pmd_unshare will clear the pud pointing
to the pmd. If the task in the middle of the page fault is not the last
user, the ptep returned by huge_pte_alloc now points to another task's
page table or worse. This leads to bad things such as incorrect page
map/reference counts or invalid memory references.
To fix, expand the use of i_mmap_rwsem as follows:
- i_mmap_rwsem is held in read mode whenever huge_pmd_share is called.
huge_pmd_share is only called via huge_pte_alloc, so callers of
huge_pte_alloc take i_mmap_rwsem before calling. In addition, callers
of huge_pte_alloc continue to hold the semaphore until finished with the
ptep.
- i_mmap_rwsem is held in write mode whenever huge_pmd_unshare is
called.
[mike.kravetz(a)oracle.com: add explicit check for mapping != null]
Link: http://lkml.kernel.org/r/20181218223557.5202-2-mike.kravetz@oracle.com
Fixes: 39dde65c9940 ("shared page table for hugetlb page")
Signed-off-by: Mike Kravetz <mike.kravetz(a)oracle.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Cc: Michal Hocko <mhocko(a)kernel.org>
Cc: Hugh Dickins <hughd(a)google.com>
Cc: Naoya Horiguchi <n-horiguchi(a)ah.jp.nec.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar(a)linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange(a)redhat.com>
Cc: Davidlohr Bueso <dave(a)stgolabs.net>
Cc: Prakash Sangappa <prakash.sangappa(a)oracle.com>
Cc: Colin Ian King <colin.king(a)canonical.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 12000ba5c868..87fd3ab809c6 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3238,6 +3238,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
struct page *ptepage;
unsigned long addr;
int cow;
+ struct address_space *mapping = vma->vm_file->f_mapping;
struct hstate *h = hstate_vma(vma);
unsigned long sz = huge_page_size(h);
struct mmu_notifier_range range;
@@ -3249,13 +3250,23 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
mmu_notifier_range_init(&range, src, vma->vm_start,
vma->vm_end);
mmu_notifier_invalidate_range_start(&range);
+ } else {
+ /*
+ * For shared mappings i_mmap_rwsem must be held to call
+ * huge_pte_alloc, otherwise the returned ptep could go
+ * away if part of a shared pmd and another thread calls
+ * huge_pmd_unshare.
+ */
+ i_mmap_lock_read(mapping);
}
for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
spinlock_t *src_ptl, *dst_ptl;
+
src_pte = huge_pte_offset(src, addr, sz);
if (!src_pte)
continue;
+
dst_pte = huge_pte_alloc(dst, addr, sz);
if (!dst_pte) {
ret = -ENOMEM;
@@ -3326,6 +3337,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
if (cow)
mmu_notifier_invalidate_range_end(&range);
+ else
+ i_mmap_unlock_read(mapping);
return ret;
}
@@ -3771,14 +3784,18 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
};
/*
- * hugetlb_fault_mutex must be dropped before
- * handling userfault. Reacquire after handling
- * fault to make calling code simpler.
+ * hugetlb_fault_mutex and i_mmap_rwsem must be
+ * dropped before handling userfault. Reacquire
+ * after handling fault to make calling code simpler.
*/
hash = hugetlb_fault_mutex_hash(h, mm, vma, mapping,
idx, haddr);
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+ i_mmap_unlock_read(mapping);
+
ret = handle_userfault(&vmf, VM_UFFD_MISSING);
+
+ i_mmap_lock_read(mapping);
mutex_lock(&hugetlb_fault_mutex_table[hash]);
goto out;
}
@@ -3926,6 +3943,11 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
if (ptep) {
+ /*
+ * Since we hold no locks, ptep could be stale. That is
+ * OK as we are only making decisions based on content and
+ * not actually modifying content here.
+ */
entry = huge_ptep_get(ptep);
if (unlikely(is_hugetlb_entry_migration(entry))) {
migration_entry_wait_huge(vma, mm, ptep);
@@ -3933,20 +3955,31 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
} else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
return VM_FAULT_HWPOISON_LARGE |
VM_FAULT_SET_HINDEX(hstate_index(h));
- } else {
- ptep = huge_pte_alloc(mm, haddr, huge_page_size(h));
- if (!ptep)
- return VM_FAULT_OOM;
}
+ /*
+ * Acquire i_mmap_rwsem before calling huge_pte_alloc and hold
+ * until finished with ptep. This prevents huge_pmd_unshare from
+ * being called elsewhere and making the ptep no longer valid.
+ *
+ * ptep could have already be assigned via huge_pte_offset. That
+ * is OK, as huge_pte_alloc will return the same value unless
+ * something changed.
+ */
mapping = vma->vm_file->f_mapping;
- idx = vma_hugecache_offset(h, vma, haddr);
+ i_mmap_lock_read(mapping);
+ ptep = huge_pte_alloc(mm, haddr, huge_page_size(h));
+ if (!ptep) {
+ i_mmap_unlock_read(mapping);
+ return VM_FAULT_OOM;
+ }
/*
* Serialize hugepage allocation and instantiation, so that we don't
* get spurious allocation failures if two CPUs race to instantiate
* the same page in the page cache.
*/
+ idx = vma_hugecache_offset(h, vma, haddr);
hash = hugetlb_fault_mutex_hash(h, mm, vma, mapping, idx, haddr);
mutex_lock(&hugetlb_fault_mutex_table[hash]);
@@ -4034,6 +4067,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
}
out_mutex:
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+ i_mmap_unlock_read(mapping);
/*
* Generally it's safe to hold refcount during waiting page lock. But
* here we just wait to defer the next page fault to avoid busy loop and
@@ -4638,10 +4672,12 @@ void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
* Search for a shareable pmd page for hugetlb. In any case calls pmd_alloc()
* and returns the corresponding pte. While this is not necessary for the
* !shared pmd case because we can allocate the pmd later as well, it makes the
- * code much cleaner. pmd allocation is essential for the shared case because
- * pud has to be populated inside the same i_mmap_rwsem section - otherwise
- * racing tasks could either miss the sharing (see huge_pte_offset) or select a
- * bad pmd for sharing.
+ * code much cleaner.
+ *
+ * This routine must be called with i_mmap_rwsem held in at least read mode.
+ * For hugetlbfs, this prevents removal of any page table entries associated
+ * with the address space. This is important as we are setting up sharing
+ * based on existing page table entries (mappings).
*/
pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
{
@@ -4658,7 +4694,6 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
if (!vma_shareable(vma, addr))
return (pte_t *)pmd_alloc(mm, pud, addr);
- i_mmap_lock_write(mapping);
vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) {
if (svma == vma)
continue;
@@ -4688,7 +4723,6 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
spin_unlock(ptl);
out:
pte = (pte_t *)pmd_alloc(mm, pud, addr);
- i_mmap_unlock_write(mapping);
return pte;
}
@@ -4699,7 +4733,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
* indicated by page_count > 1, unmap is achieved by clearing pud and
* decrementing the ref count. If count == 1, the pte page is not shared.
*
- * called with page table lock held.
+ * Called with page table lock held and i_mmap_rwsem held in write mode.
*
* returns: 1 successfully unmapped a shared pte page
* 0 the underlying pte page is not shared, or it is the last user
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 7c72f2a95785..6379fff1a5ff 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -966,7 +966,7 @@ static bool hwpoison_user_mappings(struct page *p, unsigned long pfn,
enum ttu_flags ttu = TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS;
struct address_space *mapping;
LIST_HEAD(tokill);
- bool unmap_success;
+ bool unmap_success = true;
int kill = 1, forcekill;
struct page *hpage = *hpagep;
bool mlocked = PageMlocked(hpage);
@@ -1028,7 +1028,19 @@ static bool hwpoison_user_mappings(struct page *p, unsigned long pfn,
if (kill)
collect_procs(hpage, &tokill, flags & MF_ACTION_REQUIRED);
- unmap_success = try_to_unmap(hpage, ttu);
+ if (!PageHuge(hpage)) {
+ unmap_success = try_to_unmap(hpage, ttu);
+ } else if (mapping) {
+ /*
+ * For hugetlb pages, try_to_unmap could potentially call
+ * huge_pmd_unshare. Because of this, take semaphore in
+ * write mode here and set TTU_RMAP_LOCKED to indicate we
+ * have taken the lock at this higer level.
+ */
+ i_mmap_lock_write(mapping);
+ unmap_success = try_to_unmap(hpage, ttu|TTU_RMAP_LOCKED);
+ i_mmap_unlock_write(mapping);
+ }
if (!unmap_success)
pr_err("Memory failure: %#lx: failed to unmap page (mapcount=%d)\n",
pfn, page_mapcount(hpage));
diff --git a/mm/migrate.c b/mm/migrate.c
index 4389696fba0e..5d1839a9148d 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1324,8 +1324,19 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
goto put_anon;
if (page_mapped(hpage)) {
+ struct address_space *mapping = page_mapping(hpage);
+
+ /*
+ * try_to_unmap could potentially call huge_pmd_unshare.
+ * Because of this, take semaphore in write mode here and
+ * set TTU_RMAP_LOCKED to let lower levels know we have
+ * taken the lock.
+ */
+ i_mmap_lock_write(mapping);
try_to_unmap(hpage,
- TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
+ TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS|
+ TTU_RMAP_LOCKED);
+ i_mmap_unlock_write(mapping);
page_was_mapped = 1;
}
diff --git a/mm/rmap.c b/mm/rmap.c
index 68a1a5b869a5..21a26cf51114 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -25,6 +25,7 @@
* page->flags PG_locked (lock_page)
* hugetlbfs_i_mmap_rwsem_key (in huge_pmd_share)
* mapping->i_mmap_rwsem
+ * hugetlb_fault_mutex (hugetlbfs specific page fault mutex)
* anon_vma->rwsem
* mm->page_table_lock or pte_lock
* zone_lru_lock (in mark_page_accessed, isolate_lru_page)
@@ -1378,6 +1379,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
/*
* If sharing is possible, start and end will be adjusted
* accordingly.
+ *
+ * If called for a huge page, caller must hold i_mmap_rwsem
+ * in write mode as it is possible to call huge_pmd_unshare.
*/
adjust_range_if_pmd_sharing_possible(vma, &range.start,
&range.end);
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 458acda96f20..48368589f519 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -267,10 +267,14 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
VM_BUG_ON(dst_addr & ~huge_page_mask(h));
/*
- * Serialize via hugetlb_fault_mutex
+ * Serialize via i_mmap_rwsem and hugetlb_fault_mutex.
+ * i_mmap_rwsem ensures the dst_pte remains valid even
+ * in the case of shared pmds. fault mutex prevents
+ * races with other faulting threads.
*/
- idx = linear_page_index(dst_vma, dst_addr);
mapping = dst_vma->vm_file->f_mapping;
+ i_mmap_lock_read(mapping);
+ idx = linear_page_index(dst_vma, dst_addr);
hash = hugetlb_fault_mutex_hash(h, dst_mm, dst_vma, mapping,
idx, dst_addr);
mutex_lock(&hugetlb_fault_mutex_table[hash]);
@@ -279,6 +283,7 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
dst_pte = huge_pte_alloc(dst_mm, dst_addr, huge_page_size(h));
if (!dst_pte) {
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+ i_mmap_unlock_read(mapping);
goto out_unlock;
}
@@ -286,6 +291,7 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
dst_pteval = huge_ptep_get(dst_pte);
if (!huge_pte_none(dst_pteval)) {
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+ i_mmap_unlock_read(mapping);
goto out_unlock;
}
@@ -293,6 +299,7 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
dst_addr, src_addr, &page);
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+ i_mmap_unlock_read(mapping);
vm_alloc_shared = vm_shared;
cond_resched();
Hi,
After upgrading kernel from 4.14.40 to 4.14.88,I found that 'HP FlexFabric 10Gb 2-port 554FLB Adapter' device is not in use. There are erros in dmesg log.
The Server is 'HP FlexServer B390'.
Device info:
lspci -n | grep 04:00
04:00.2 0c04: 19a2:0714 (rev 01)
...
04:00.2 Fibre Channel: Emulex Corporation OneConnect 10Gb FCoE Initiator (be3) (rev 01)
Subsystem: Hewlett-Packard Company NC554FLB 10Gb 2-port FlexFabric Converged Network Adapter
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin C routed to IRQ 95
...
Kernel driver in use: lpfc
The error info:
[ 1046.980480] lpfc 0000:04:00.3: 1:1303 Link Up Event x1 received Data: x1 x0 x4 x0 x0 x0 0
[ 1046.980482] lpfc 0000:04:00.3: 1:(0):2753 PLOGI failure DID:020009 Status:x3/x103
[ 1050.435167] lpfc 0000:04:00.2: 0:(0):2753 PLOGI failure DID:010012 Status:x3/x103
[ 1065.713327] lpfc 0000:04:00.3: 1:(0):2753 PLOGI failure DID:040002 Status:x3/x103
[ 1072.331933] lpfc 0000:04:00.2: 0:(0):2753 PLOGI failure DID:030003 Status:x3/x103
[ 1137.628132] lpfc 0000:04:00.2: 0:(0):0748 abort handler timed out waiting for aborting I/O (xri:x64) to complete: ret 0x2003, ID 2, LUN 0
[ 1137.644257] lpfc 0000:04:00.2: 0:(0):0713 SCSI layer issued Device Reset (2, 0) return x2002
[ 1139.676124] lpfc 0000:04:00.3: 1:(0):0748 abort handler timed out waiting for aborting I/O (xri:x464) to complete: ret 0x2003, ID 4, LUN 0
[ 1139.692242] lpfc 0000:04:00.3: 1:(0):0713 SCSI layer issued Device Reset (4, 0) return x2002
[ 1197.664150] lpfc 0000:04:00.2: 0:(0):0724 I/O flush failure for context LUN : cnt x1
[ 1197.664344] lpfc 0000:04:00.2: 0:(0):0723 SCSI layer issued Target Reset (2, 0) return x2002
[ 1199.704116] lpfc 0000:04:00.3: 1:(0):0724 I/O flush failure for context LUN : cnt x1
[ 1199.704368] lpfc 0000:04:00.3: 1:(0):0723 SCSI layer issued Target Reset (4, 0) return x2002
At the beginning, I thought the lpfc driver itself is the cause of the error.But,the error is still seen when 'lpfc driver' updates to the latest version.
To find the root cause and fix it, we checked the kernel version from 4.14.41 to 4.14.88, built and tested the kernel for booting.
The commit that caused error after bisect is ef86f3a72adb8a7931f67335560740a7ad696d1d,when I removed the commit the issue went away.
Commit info:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=…
During the test, I also found another issue that the system of 'HP FlexServer B390' server failed to boot by "hpsa driver timeout",basically, looks like the hpsa didn't detect the hard drives and udevd is stalled.
After upgrading from v4.14.54 to 4.14.55, hp system didn't boot ,but the system is ok when using the v4.14.55 kernel that has removed the commit.
The commit of ef86f3a72adb8a7931f67335560740a7ad696d1d also affects the HP Smart Array P220i RAID device.Because the v4.14.88 kernel is ok, I think that subsequent commits may have fixed the hpsa driver issue, but lpfc driver issue is not.
If there is any more info I can provide, just ask what would be useful. Any suggestions?
Thanks
Liang
The patch below does not apply to the 4.20-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 894169db12463cea08d0e2a9e35f42b291340e5a Mon Sep 17 00:00:00 2001
From: Shivasharan S <shivasharan.srikanteshwara(a)broadcom.com>
Date: Tue, 18 Dec 2018 05:59:54 -0800
Subject: [PATCH] scsi: megaraid_sas: Use 63-bit DMA addressing
Although MegaRAID controllers support 64-bit DMA addressing, as per
hardware design, DMA address with all 64-bits set
(0xFFFFFFFF-FFFFFFFF) results in a firmware fault.
Driver will set 63-bit DMA mask to ensure the above address will not be
used.
Cc: stable(a)vger.kernel.org
Signed-off-by: Shivasharan S <shivasharan.srikanteshwara(a)broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen(a)oracle.com>
diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c b/drivers/scsi/megaraid/megaraid_sas_base.c
index d0f4075fe36e..f7bdd783360a 100644
--- a/drivers/scsi/megaraid/megaraid_sas_base.c
+++ b/drivers/scsi/megaraid/megaraid_sas_base.c
@@ -6184,13 +6184,13 @@ static int megasas_io_attach(struct megasas_instance *instance)
* @instance: Adapter soft state
* Description:
*
- * For Ventura, driver/FW will operate in 64bit DMA addresses.
+ * For Ventura, driver/FW will operate in 63bit DMA addresses.
*
* For invader-
* By default, driver/FW will operate in 32bit DMA addresses
* for consistent DMA mapping but if 32 bit consistent
- * DMA mask fails, driver will try with 64 bit consistent
- * mask provided FW is true 64bit DMA capable
+ * DMA mask fails, driver will try with 63 bit consistent
+ * mask provided FW is true 63bit DMA capable
*
* For older controllers(Thunderbolt and MFI based adapters)-
* driver/FW will operate in 32 bit consistent DMA addresses.
@@ -6204,14 +6204,14 @@ megasas_set_dma_mask(struct megasas_instance *instance)
pdev = instance->pdev;
consistent_mask = (instance->adapter_type >= VENTURA_SERIES) ?
- DMA_BIT_MASK(64) : DMA_BIT_MASK(32);
+ DMA_BIT_MASK(63) : DMA_BIT_MASK(32);
if (IS_DMA64) {
- if (dma_set_mask(&pdev->dev, DMA_BIT_MASK(64)) &&
+ if (dma_set_mask(&pdev->dev, DMA_BIT_MASK(63)) &&
dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(32)))
goto fail_set_dma_mask;
- if ((*pdev->dev.dma_mask == DMA_BIT_MASK(64)) &&
+ if ((*pdev->dev.dma_mask == DMA_BIT_MASK(63)) &&
(dma_set_coherent_mask(&pdev->dev, consistent_mask) &&
dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(32)))) {
/*
@@ -6224,7 +6224,7 @@ megasas_set_dma_mask(struct megasas_instance *instance)
if (!(scratch_pad_1 & MR_CAN_HANDLE_64_BIT_DMA_OFFSET))
goto fail_set_dma_mask;
else if (dma_set_mask_and_coherent(&pdev->dev,
- DMA_BIT_MASK(64)))
+ DMA_BIT_MASK(63)))
goto fail_set_dma_mask;
}
} else if (dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(32)))
@@ -6236,8 +6236,8 @@ megasas_set_dma_mask(struct megasas_instance *instance)
instance->consistent_mask_64bit = true;
dev_info(&pdev->dev, "%s bit DMA mask and %s bit consistent mask\n",
- ((*pdev->dev.dma_mask == DMA_BIT_MASK(64)) ? "64" : "32"),
- (instance->consistent_mask_64bit ? "64" : "32"));
+ ((*pdev->dev.dma_mask == DMA_BIT_MASK(64)) ? "63" : "32"),
+ (instance->consistent_mask_64bit ? "63" : "32"));
return 0;
The patch
regulator: max77620: Initialize values for DT properties
has been applied to the regulator tree at
https://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator.git
All being well this means that it will be integrated into the linux-next
tree (usually sometime in the next 24 hours) and sent to Linus during
the next merge window (or sooner if it is a bug fix), however if
problems are discovered then the patch may be dropped or reverted.
You may get further e-mails resulting from automated or manual testing
and review of the tree, please engage with people reporting problems and
send followup patches addressing any issues that are reported if needed.
If any updates are required or you are submitting further changes they
should be sent as incremental updates against current git, existing
patches will not be replaced.
Please add any relevant lists and maintainers to the CCs when replying
to this mail.
Thanks,
Mark
>From 0ab66b3c326ef8f77dae9f528118966365757c0c Mon Sep 17 00:00:00 2001
From: Mark Zhang <markz(a)nvidia.com>
Date: Thu, 10 Jan 2019 12:11:16 +0800
Subject: [PATCH] regulator: max77620: Initialize values for DT properties
If regulator DT node doesn't exist, its of_parse_cb callback
function isn't called. Then all values for DT properties are
filled with zero. This leads to wrong register update for
FPS and POK settings.
Signed-off-by: Jinyoung Park <jinyoungp(a)nvidia.com>
Signed-off-by: Mark Zhang <markz(a)nvidia.com>
Signed-off-by: Mark Brown <broonie(a)kernel.org>
Cc: stable(a)vger.kernel.org
---
drivers/regulator/max77620-regulator.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)
diff --git a/drivers/regulator/max77620-regulator.c b/drivers/regulator/max77620-regulator.c
index b94e3a721721..cd93cf53e23c 100644
--- a/drivers/regulator/max77620-regulator.c
+++ b/drivers/regulator/max77620-regulator.c
@@ -1,7 +1,7 @@
/*
* Maxim MAX77620 Regulator driver
*
- * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ * Copyright (c) 2016-2018, NVIDIA CORPORATION. All rights reserved.
*
* Author: Mallikarjun Kasoju <mkasoju(a)nvidia.com>
* Laxman Dewangan <ldewangan(a)nvidia.com>
@@ -803,6 +803,14 @@ static int max77620_regulator_probe(struct platform_device *pdev)
rdesc = &rinfo[id].desc;
pmic->rinfo[id] = &max77620_regs_info[id];
pmic->enable_power_mode[id] = MAX77620_POWER_MODE_NORMAL;
+ pmic->reg_pdata[id].active_fps_src = -1;
+ pmic->reg_pdata[id].active_fps_pd_slot = -1;
+ pmic->reg_pdata[id].active_fps_pu_slot = -1;
+ pmic->reg_pdata[id].suspend_fps_src = -1;
+ pmic->reg_pdata[id].suspend_fps_pd_slot = -1;
+ pmic->reg_pdata[id].suspend_fps_pu_slot = -1;
+ pmic->reg_pdata[id].power_ok = -1;
+ pmic->reg_pdata[id].ramp_rate_setting = -1;
ret = max77620_read_slew_rate(pmic, id);
if (ret < 0)
--
2.20.1
Currently, AXP803 driver assumes that reg_drivevbus is input which is
wrong. Unfortunate consequence of that is that none of the USB ports
work on the board, even USB HOST port, because USB PHY driver probing
fails due to missing regulator.
Fix that by adding "x-powers,drive-vbus-en" property to AXP803 node.
Fixes: 14ff5d8f9151 ("arm64: dts: allwinner: a64: Orange Pi Win: Enable USB OTG socket")
Cc: stable(a)vger.kernel.org
Signed-off-by: Jernej Skrabec <jernej.skrabec(a)siol.net>
---
arch/arm64/boot/dts/allwinner/sun50i-a64-orangepi-win.dts | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/arm64/boot/dts/allwinner/sun50i-a64-orangepi-win.dts b/arch/arm64/boot/dts/allwinner/sun50i-a64-orangepi-win.dts
index b0c64f75792c..8974b5a1d3b1 100644
--- a/arch/arm64/boot/dts/allwinner/sun50i-a64-orangepi-win.dts
+++ b/arch/arm64/boot/dts/allwinner/sun50i-a64-orangepi-win.dts
@@ -188,6 +188,7 @@
reg = <0x3a3>;
interrupt-parent = <&r_intc>;
interrupts = <0 IRQ_TYPE_LEVEL_LOW>;
+ x-powers,drive-vbus-en; /* set N_VBUSEN as output pin */
};
};
--
2.20.1
Hello,
This is v2 series of fixing kretprobe incorrect stacking order patches.
In this version, I fixed a lack of kprobes.h including and added new
patch for kretprobe trampoline recursion issue. (and add Cc:stable)
(1) kprobe incorrct stacking order problem
On recent talk with Andrea, I started more precise investigation on
the kernel panic with kretprobes on notrace functions, which Francis
had been reported last year ( https://lkml.org/lkml/2017/7/14/466 ).
See the investigation details in
https://lkml.kernel.org/r/154686789378.15479.2886543882215785247.stgit@devb…
When we put a kretprobe on ftrace_ops_assist_func() and put another
kretprobe on probed-function, below happens
<caller>
-><probed-function>
->fentry
->ftrace_ops_assist_func()
->int3
->kprobe_int3_handler()
...->pre_handler_kretprobe()
push the return address (*fentry*) of ftrace_ops_assist_func() to
top of the kretprobe list and replace it with kretprobe_trampoline.
<-kprobe_int3_handler()
<-(int3)
->kprobe_ftrace_handler()
...->pre_handler_kretprobe()
push the return address (caller) of probed-function to top of the
kretprobe list and replace it with kretprobe_trampoline.
<-(kprobe_ftrace_handler())
<-(ftrace_ops_assist_func())
[kretprobe_trampoline]
->tampoline_handler()
pop the return address (caller) from top of the kretprobe list
<-(trampoline_handler())
<caller>
[run caller with incorrect stack information]
<-(<caller>)
!!KERNEL PANIC!!
Therefore, this kernel panic happens only when we put 2 k*ret*probes on
ftrace_ops_assist_func() and other functions. If we put kprobes, it
doesn't cause any issue, since it doesn't change the return address.
To fix (or just avoid) this issue, we can introduce a frame pointer
verification to skip wrong order entries. And I also would like to
blacklist those functions because those are part of ftrace-based
kprobe handling routine.
(2) kretprobe trampoline recursion problem
This was found by Andrea in the previous thread
https://lkml.kernel.org/r/20190107183444.GA5966@xps-13
----
echo "r:event_1 __fdget" >> kprobe_events
echo "r:event_2 _raw_spin_lock_irqsave" >> kprobe_events
echo 1 > events/kprobes/enable
[DEADLOCK]
----
Because kretprobe trampoline_handler uses spinlock for protecting
hash table, if we probe the spinlock itself, it causes deadlock.
Thank you Andrea and Steve for discovering this root cause!!
This bug has been introduced with the asm-coded trampoline
code, since previously it used another kprobe for hooking
the function return placeholder (which only has a nop) and
trampoline handler was called from that kprobe.
To fix this bug, I introduced a dummy kprobe and set it in
current_kprobe as we did in old days.
Thank you,
---
Masami Hiramatsu (3):
x86/kprobes: Verify stack frame on kretprobe
kprobes: Mark ftrace mcount handler functions nokprobe
x86/kprobes: Fix to avoid kretprobe recursion
arch/x86/kernel/kprobes/core.c | 48 ++++++++++++++++++++++++++++++++++++++--
include/linux/kprobes.h | 1 +
kernel/trace/ftrace.c | 6 ++++-
3 files changed, 52 insertions(+), 3 deletions(-)
--
Masami Hiramatsu (Linaro) <mhiramat(a)kernel.org>