A long story behind all of that...
Some time ago I met kernel crash after CRIU restore procedure,
fortunately, it was CRIU restore, so, I had dump files and could
do restore many times and crash reproduced easily. After some
investigation I've constructed the minimal reproducer. It was
found that it's use-after-free and it happens only if
sysctl kernel.shm_rmid_forced = 1.
The key of the problem is that the exit_shm() function
not handles shp's object destroy when task->sysvshm.shm_clist
contains items from different IPC namespaces. In most cases
this list will contain only items from one IPC namespace.
Why this list may contain object from different namespaces?
Function exit_shm() designed to clean up this list always when
process leaves IPC namespace. But we made a mistake a long time ago
and not add exit_shm() call into setns() syscall procedures.
1st second idea was just to add this call to setns() syscall but
it's obviously changes semantics of setns() syscall and that's
userspace-visible change. So, I gave up this idea.
First real attempt to address the issue was just to omit forced destroy
if we meet shp object not from current task IPC namespace [1]. But
that was not the best idea because task->sysvshm.shm_clist was
protected by rwsem which belongs to current task IPC namespace.
It means that list corruption may occur.
Second approach is just extend exit_shm() to properly handle
shp's from different IPC namespaces [2]. This is really
non-trivial thing, I've put a lot of effort into that but
not believed that it's possible to make it fully safe, clean
and clear.
Thanks to the efforts of Manfred Spraul working and elegant
solution was designed. Thanks a lot, Manfred!
Eric also suggested the way to address the issue in
("[RFC][PATCH] shm: In shm_exit destroy all created and never attached segments")
Eric's idea was to maintain a list of shm_clists one per IPC namespace,
use lock-less lists. But there is some extra memory consumption-related concerns.
Alternative solution which was suggested by me was implemented in
("shm: reset shm_clist on setns but omit forced shm destroy")
Idea is pretty simple, we add exit_shm() syscall to setns() but DO NOT
destroy shm segments even if sysctl kernel.shm_rmid_forced = 1, we just
clean up the task->sysvshm.shm_clist list. This chages semantics of
setns() syscall a little bit but in comparision to "naive" solution
when we just add exit_shm() without any special exclusions this looks
like a safer option.
[1] https://lkml.org/lkml/2021/7/6/1108
[2] https://lkml.org/lkml/2021/7/14/736
Cc: "Eric W. Biederman" <ebiederm(a)xmission.com>
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Cc: Davidlohr Bueso <dave(a)stgolabs.net>
Cc: Greg KH <gregkh(a)linuxfoundation.org>
Cc: Andrei Vagin <avagin(a)gmail.com>
Cc: Pavel Tikhomirov <ptikhomirov(a)virtuozzo.com>
Cc: Vasily Averin <vvs(a)virtuozzo.com>
Cc: Manfred Spraul <manfred(a)colorfullife.com>
Cc: Alexander Mikhalitsyn <alexander(a)mihalicyn.com>
Cc: stable(a)vger.kernel.org
Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn(a)virtuozzo.com>
Alexander Mikhalitsyn (2):
ipc: WARN if trying to remove ipc object which is absent
shm: extend forced shm destroy to support objects from several IPC
nses
include/linux/ipc_namespace.h | 15 +++
include/linux/sched/task.h | 2 +-
include/linux/shm.h | 2 +-
ipc/shm.c | 170 +++++++++++++++++++++++++---------
ipc/util.c | 6 +-
5 files changed, 145 insertions(+), 50 deletions(-)
--
2.31.1
Some devices tend to trigger SYS_ERR interrupt while the host handling
SYS_ERR state of the device during power up. This creates a race
condition and causes a failure in booting up the device.
The issue is seen on the Sierra Wireless EM9191 modem during SYS_ERR
handling in mhi_async_power_up(). Once the host detects that the device
is in SYS_ERR state, it issues MHI_RESET and waits for the device to
process the reset request. During this time, the device triggers SYS_ERR
interrupt to the host and host starts handling SYS_ERR execution.
So by the time the device has completed reset, host starts SYS_ERR
handling. This causes the race condition and the modem fails to boot.
Hence, register the IRQ handler only after handling the SYS_ERR check
to avoid getting spurious IRQs from the device.
Cc: stable(a)vger.kernel.org
Fixes: e18d4e9fa79b ("bus: mhi: core: Handle syserr during power_up")
Reported-by: Aleksander Morgado <aleksander(a)aleksander.es>
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam(a)linaro.org>
---
drivers/bus/mhi/core/pm.c | 26 +++++++++++---------------
1 file changed, 11 insertions(+), 15 deletions(-)
diff --git a/drivers/bus/mhi/core/pm.c b/drivers/bus/mhi/core/pm.c
index fb99e3727155..ec5f11166820 100644
--- a/drivers/bus/mhi/core/pm.c
+++ b/drivers/bus/mhi/core/pm.c
@@ -1055,10 +1055,6 @@ int mhi_async_power_up(struct mhi_controller *mhi_cntrl)
mutex_lock(&mhi_cntrl->pm_mutex);
mhi_cntrl->pm_state = MHI_PM_DISABLE;
- ret = mhi_init_irq_setup(mhi_cntrl);
- if (ret)
- goto error_setup_irq;
-
/* Setup BHI INTVEC */
write_lock_irq(&mhi_cntrl->pm_lock);
mhi_write_reg(mhi_cntrl, mhi_cntrl->bhi, BHI_INTVEC, 0);
@@ -1072,7 +1068,7 @@ int mhi_async_power_up(struct mhi_controller *mhi_cntrl)
dev_err(dev, "%s is not a valid EE for power on\n",
TO_MHI_EXEC_STR(current_ee));
ret = -EIO;
- goto error_async_power_up;
+ goto error_setup_irq;
}
state = mhi_get_mhi_state(mhi_cntrl);
@@ -1082,19 +1078,18 @@ int mhi_async_power_up(struct mhi_controller *mhi_cntrl)
if (state == MHI_STATE_SYS_ERR) {
mhi_set_mhi_state(mhi_cntrl, MHI_STATE_RESET);
ret = wait_event_timeout(mhi_cntrl->state_event,
- MHI_PM_IN_FATAL_STATE(mhi_cntrl->pm_state) ||
- mhi_read_reg_field(mhi_cntrl,
- mhi_cntrl->regs,
- MHICTRL,
- MHICTRL_RESET_MASK,
- MHICTRL_RESET_SHIFT,
+ mhi_read_reg_field(mhi_cntrl,
+ mhi_cntrl->regs,
+ MHICTRL,
+ MHICTRL_RESET_MASK,
+ MHICTRL_RESET_SHIFT,
&val) ||
!val,
msecs_to_jiffies(mhi_cntrl->timeout_ms));
if (!ret) {
ret = -EIO;
dev_info(dev, "Failed to reset MHI due to syserr state\n");
- goto error_async_power_up;
+ goto error_setup_irq;
}
/*
@@ -1104,6 +1099,10 @@ int mhi_async_power_up(struct mhi_controller *mhi_cntrl)
mhi_write_reg(mhi_cntrl, mhi_cntrl->bhi, BHI_INTVEC, 0);
}
+ ret = mhi_init_irq_setup(mhi_cntrl);
+ if (ret)
+ goto error_setup_irq;
+
/* Transition to next state */
next_state = MHI_IN_PBL(current_ee) ?
DEV_ST_TRANSITION_PBL : DEV_ST_TRANSITION_READY;
@@ -1116,9 +1115,6 @@ int mhi_async_power_up(struct mhi_controller *mhi_cntrl)
return 0;
-error_async_power_up:
- mhi_deinit_free_irq(mhi_cntrl);
-
error_setup_irq:
mhi_cntrl->pm_state = MHI_PM_DISABLE;
mutex_unlock(&mhi_cntrl->pm_mutex);
--
2.25.1
From: Charan Teja Reddy <charante(a)codeaurora.org>
[ Upstream commit f492283b157053e9555787262f058ae33096f568 ]
It is expected from the clients to follow the below steps on an imported
dmabuf fd:
a) dmabuf = dma_buf_get(fd) // Get the dmabuf from fd
b) dma_buf_attach(dmabuf); // Clients attach to the dmabuf
o Here the kernel does some slab allocations, say for
dma_buf_attachment and may be some other slab allocation in the
dmabuf->ops->attach().
c) Client may need to do dma_buf_map_attachment().
d) Accordingly dma_buf_unmap_attachment() should be called.
e) dma_buf_detach () // Clients detach to the dmabuf.
o Here the slab allocations made in b) are freed.
f) dma_buf_put(dmabuf) // Can free the dmabuf if it is the last
reference.
Now say an erroneous client failed at step c) above thus it directly
called dma_buf_put(), step f) above. Considering that it may be the last
reference to the dmabuf, buffer will be freed with pending attachments
left to the dmabuf which can show up as the 'memory leak'. This should
at least be reported as the WARN().
Signed-off-by: Charan Teja Reddy <charante(a)codeaurora.org>
Reviewed-by: Christian König <christian.koenig(a)amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/1627043468-16381-1-git-send-e…
Signed-off-by: Christian König <christian.koenig(a)amd.com>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
drivers/dma-buf/dma-buf.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index 758de0e9b2ddc..16bbc9bc9e6d1 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -79,6 +79,7 @@ static void dma_buf_release(struct dentry *dentry)
if (dmabuf->resv == (struct dma_resv *)&dmabuf[1])
dma_resv_fini(dmabuf->resv);
+ WARN_ON(!list_empty(&dmabuf->attachments));
module_put(dmabuf->owner);
kfree(dmabuf->name);
kfree(dmabuf);
--
2.33.0
From: Charan Teja Reddy <charante(a)codeaurora.org>
[ Upstream commit f492283b157053e9555787262f058ae33096f568 ]
It is expected from the clients to follow the below steps on an imported
dmabuf fd:
a) dmabuf = dma_buf_get(fd) // Get the dmabuf from fd
b) dma_buf_attach(dmabuf); // Clients attach to the dmabuf
o Here the kernel does some slab allocations, say for
dma_buf_attachment and may be some other slab allocation in the
dmabuf->ops->attach().
c) Client may need to do dma_buf_map_attachment().
d) Accordingly dma_buf_unmap_attachment() should be called.
e) dma_buf_detach () // Clients detach to the dmabuf.
o Here the slab allocations made in b) are freed.
f) dma_buf_put(dmabuf) // Can free the dmabuf if it is the last
reference.
Now say an erroneous client failed at step c) above thus it directly
called dma_buf_put(), step f) above. Considering that it may be the last
reference to the dmabuf, buffer will be freed with pending attachments
left to the dmabuf which can show up as the 'memory leak'. This should
at least be reported as the WARN().
Signed-off-by: Charan Teja Reddy <charante(a)codeaurora.org>
Reviewed-by: Christian König <christian.koenig(a)amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/1627043468-16381-1-git-send-e…
Signed-off-by: Christian König <christian.koenig(a)amd.com>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
drivers/dma-buf/dma-buf.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index 511fe0d217a08..733c8b1c8467c 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -79,6 +79,7 @@ static void dma_buf_release(struct dentry *dentry)
if (dmabuf->resv == (struct dma_resv *)&dmabuf[1])
dma_resv_fini(dmabuf->resv);
+ WARN_ON(!list_empty(&dmabuf->attachments));
module_put(dmabuf->owner);
kfree(dmabuf->name);
kfree(dmabuf);
--
2.33.0