From: Rob Clark <robdclark(a)chromium.org>
If a signal callback releases the sw_sync fence, that will trigger a
deadlock as the timeline_fence_release recurses onto the fence->lock
(used both for signaling and the the timeline tree).
To avoid that, temporarily hold an extra reference to the signalled
fences until after we drop the lock.
(This is an alternative implementation of https://patchwork.kernel.org/patch/11664717/
which avoids some potential UAF issues with the original patch.)
Reported-by: Bas Nieuwenhuizen <bas(a)basnieuwenhuizen.nl>
Fixes: d3c6dd1fb30d ("dma-buf/sw_sync: Synchronize signal vs syncpt free")
Signed-off-by: Rob Clark <robdclark(a)chromium.org>
---
drivers/dma-buf/sw_sync.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/drivers/dma-buf/sw_sync.c b/drivers/dma-buf/sw_sync.c
index 63f0aeb66db6..ceb6a0408624 100644
--- a/drivers/dma-buf/sw_sync.c
+++ b/drivers/dma-buf/sw_sync.c
@@ -191,6 +191,7 @@ static const struct dma_fence_ops timeline_fence_ops = {
*/
static void sync_timeline_signal(struct sync_timeline *obj, unsigned int inc)
{
+ LIST_HEAD(signalled);
struct sync_pt *pt, *next;
trace_sync_timeline(obj);
@@ -203,9 +204,13 @@ static void sync_timeline_signal(struct sync_timeline *obj, unsigned int inc)
if (!timeline_fence_signaled(&pt->base))
break;
+ dma_fence_get(&pt->base);
+
list_del_init(&pt->link);
rb_erase(&pt->node, &obj->pt_tree);
+ list_add_tail(&pt->link, &signalled);
+
/*
* A signal callback may release the last reference to this
* fence, causing it to be freed. That operation has to be
@@ -218,6 +223,11 @@ static void sync_timeline_signal(struct sync_timeline *obj, unsigned int inc)
}
spin_unlock_irq(&obj->lock);
+
+ list_for_each_entry_safe(pt, next, &signalled, link) {
+ list_del(&pt->link);
+ dma_fence_put(&pt->base);
+ }
}
/**
--
2.41.0
Hi Pintu,
On Sat, Jul 29, 2023 at 08:05:15AM +0530, Pintu Kumar wrote:
> The current global cma region name defined as "reserved"
> which is misleading, creates confusion and too generic.
>
> Also, the default cma allocation happens from global cma region,
> so, if one has to figure out all allocations happening from
> global cma region, this seems easier.
>
> Thus, change the name from "reserved" to "global-cma-region".
I agree that reserved is not a very useful name. Unfortuately the
name of the region leaks to userspace through cma_heap.
So I think we need prep patches to hardcode "reserved" in
add_default_cma_heap first, and then remove the cma_get_name
first.
From: Boris Brezillon <boris.brezillon(a)collabora.com>
[ Upstream commit e30cb0599799aac099209e3b045379613c80730e ]
drm_sched_entity_kill_jobs_cb() logic is omitting the last fence popped
from the dependency array that was waited upon before
drm_sched_entity_kill() was called (drm_sched_entity::dependency field),
so we're basically waiting for all dependencies except one.
In theory, this wait shouldn't be needed because resources should have
their users registered to the dma_resv object, thus guaranteeing that
future jobs wanting to access these resources wait on all the previous
users (depending on the access type, of course). But we want to keep
these explicit waits in the kill entity path just in case.
Let's make sure we keep all dependencies in the array in
drm_sched_job_dependency(), so we can iterate over the array and wait
in drm_sched_entity_kill_jobs_cb().
We also make sure we wait on drm_sched_fence::finished if we were
originally asked to wait on drm_sched_fence::scheduled. In that case,
we assume the intent was to delegate the wait to the firmware/GPU or
rely on the pipelining done at the entity/scheduler level, but when
killing jobs, we really want to wait for completion not just scheduling.
v2:
- Don't evict deps in drm_sched_job_dependency()
v3:
- Always wait for drm_sched_fence::finished fences in
drm_sched_entity_kill_jobs_cb() when we see a sched_fence
v4:
- Fix commit message
- Fix a use-after-free bug
v5:
- Flag deps on which we should only wait for the scheduled event
at insertion time
v6:
- Back to v4 implementation
- Add Christian's R-b
Cc: Frank Binns <frank.binns(a)imgtec.com>
Cc: Sarah Walker <sarah.walker(a)imgtec.com>
Cc: Donald Robson <donald.robson(a)imgtec.com>
Cc: Luben Tuikov <luben.tuikov(a)amd.com>
Cc: David Airlie <airlied(a)gmail.com>
Cc: Daniel Vetter <daniel(a)ffwll.ch>
Cc: Sumit Semwal <sumit.semwal(a)linaro.org>
Cc: "Christian König" <christian.koenig(a)amd.com>
Signed-off-by: Boris Brezillon <boris.brezillon(a)collabora.com>
Suggested-by: "Christian König" <christian.koenig(a)amd.com>
Reviewed-by: "Christian König" <christian.koenig(a)amd.com>
Acked-by: Luben Tuikov <luben.tuikov(a)amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20230619071921.3465992-1-bori…
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
drivers/gpu/drm/scheduler/sched_entity.c | 41 +++++++++++++++++++-----
1 file changed, 33 insertions(+), 8 deletions(-)
diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
index e0a8890a62e23..42021d1f7e016 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -155,16 +155,32 @@ static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f,
{
struct drm_sched_job *job = container_of(cb, struct drm_sched_job,
finish_cb);
- int r;
+ unsigned long index;
dma_fence_put(f);
/* Wait for all dependencies to avoid data corruptions */
- while (!xa_empty(&job->dependencies)) {
- f = xa_erase(&job->dependencies, job->last_dependency++);
- r = dma_fence_add_callback(f, &job->finish_cb,
- drm_sched_entity_kill_jobs_cb);
- if (!r)
+ xa_for_each(&job->dependencies, index, f) {
+ struct drm_sched_fence *s_fence = to_drm_sched_fence(f);
+
+ if (s_fence && f == &s_fence->scheduled) {
+ /* The dependencies array had a reference on the scheduled
+ * fence, and the finished fence refcount might have
+ * dropped to zero. Use dma_fence_get_rcu() so we get
+ * a NULL fence in that case.
+ */
+ f = dma_fence_get_rcu(&s_fence->finished);
+
+ /* Now that we have a reference on the finished fence,
+ * we can release the reference the dependencies array
+ * had on the scheduled fence.
+ */
+ dma_fence_put(&s_fence->scheduled);
+ }
+
+ xa_erase(&job->dependencies, index);
+ if (f && !dma_fence_add_callback(f, &job->finish_cb,
+ drm_sched_entity_kill_jobs_cb))
return;
dma_fence_put(f);
@@ -394,8 +410,17 @@ static struct dma_fence *
drm_sched_job_dependency(struct drm_sched_job *job,
struct drm_sched_entity *entity)
{
- if (!xa_empty(&job->dependencies))
- return xa_erase(&job->dependencies, job->last_dependency++);
+ struct dma_fence *f;
+
+ /* We keep the fence around, so we can iterate over all dependencies
+ * in drm_sched_entity_kill_jobs_cb() to ensure all deps are signaled
+ * before killing the job.
+ */
+ f = xa_load(&job->dependencies, job->last_dependency);
+ if (f) {
+ job->last_dependency++;
+ return dma_fence_get(f);
+ }
if (job->sched->ops->prepare_job)
return job->sched->ops->prepare_job(job, entity);
--
2.40.1
* TL;DR:
Device memory TCP (devmem TCP) is a proposal for transferring data to and/or
from device memory efficiently, without bouncing the data to a host memory
buffer.
* Problem:
A large amount of data transfers have device memory as the source and/or
destination. Accelerators drastically increased the volume of such transfers.
Some examples include:
- ML accelerators transferring large amounts of training data from storage into
GPU/TPU memory. In some cases ML training setup time can be as long as 50% of
TPU compute time, improving data transfer throughput & efficiency can help
improving GPU/TPU utilization.
- Distributed training, where ML accelerators, such as GPUs on different hosts,
exchange data among them.
- Distributed raw block storage applications transfer large amounts of data with
remote SSDs, much of this data does not require host processing.
Today, the majority of the Device-to-Device data transfers the network are
implemented as the following low level operations: Device-to-Host copy,
Host-to-Host network transfer, and Host-to-Device copy.
The implementation is suboptimal, especially for bulk data transfers, and can
put significant strains on system resources, such as host memory bandwidth,
PCIe bandwidth, etc. One important reason behind the current state is the
kernel’s lack of semantics to express device to network transfers.
* Proposal:
In this patch series we attempt to optimize this use case by implementing
socket APIs that enable the user to:
1. send device memory across the network directly, and
2. receive incoming network packets directly into device memory.
Packet _payloads_ go directly from the NIC to device memory for receive and from
device memory to NIC for transmit.
Packet _headers_ go to/from host memory and are processed by the TCP/IP stack
normally. The NIC _must_ support header split to achieve this.
Advantages:
- Alleviate host memory bandwidth pressure, compared to existing
network-transfer + device-copy semantics.
- Alleviate PCIe BW pressure, by limiting data transfer to the lowest level
of the PCIe tree, compared to traditional path which sends data through the
root complex.
With this proposal we're able to reach ~96.6% line rate speeds with data sent
and received directly from/to device memory.
* Patch overview:
** Part 1: struct paged device memory
Currently the standard for device memory sharing is DMABUF, which doesn't
generate struct pages. On the other hand, networking stack (skbs, drivers, and
page pool) operate on pages. We have 2 options:
1. Generate struct pages for dmabuf device memory, or,
2. Modify the networking stack to understand a new memory type.
This proposal implements option #1. We implement a small framework to generate
struct pages for an sg_table returned from dma_buf_map_attachment(). The support
added here should be generic and easily extended to other use cases interested
in struct paged device memory. We use this framework to generate pages that can
be used in the networking stack.
** Part 2: recvmsg() & sendmsg() APIs
We define user APIs for the user to send and receive these dmabuf pages.
** part 3: support for unreadable skb frags
Dmabuf pages are not accessible by the host; we implement changes throughput the
networking stack to correctly handle skbs with unreadable frags.
** part 4: page pool support
We piggy back on Jakub's page pool memory providers idea:
https://github.com/kuba-moo/linux/tree/pp-providers
It allows the page pool to define a memory provider that provides the
page allocation and freeing. It helps abstract most of the device memory TCP
changes from the driver.
This is not strictly necessary, the driver can choose to allocate dmabuf pages
and use them directly without going through the page pool (if acceptable to
their maintainers).
Not included with this RFC is the GVE devmem TCP support, just to
simplify the review. Code available here if desired:
https://github.com/mina/linux/tree/tcpdevmem
This RFC is built on top of v6.4-rc7 with Jakub's pp-providers changes
cherry-picked.
* NIC dependencies:
1. (strict) Devmem TCP require the NIC to support header split, i.e. the
capability to split incoming packets into a header + payload and to put
each into a separate buffer. Devmem TCP works by using dmabuf pages
for the packet payload, and host memory for the packet headers.
2. (optional) Devmem TCP works better with flow steering support & RSS support,
i.e. the NIC's ability to steer flows into certain rx queues. This allows the
sysadmin to enable devmem TCP on a subset of the rx queues, and steer
devmem TCP traffic onto these queues and non devmem TCP elsewhere.
The NIC I have access to with these properties is the GVE with DQO support
running in Google Cloud, but any NIC that supports these features would suffice.
I may be able to help reviewers bring up devmem TCP on their NICs.
* Testing:
The series includes a udmabuf kselftest that show a simple use case of
devmem TCP and validates the entire data path end to end without
a dependency on a specific dmabuf provider.
Not included in this series is our devmem TCP benchmark, which
transfers data to/from GPU dmabufs directly.
With this implementation & benchmark we're able to reach ~96.6% line rate
speeds with 4 GPU/NIC pairs running bi-direction traffic, with all the
packet payloads going straight to the GPU memory (no host buffer bounce).
** Test Setup
Kernel: v6.4-rc7, with this RFC and Jakub's memory provider API
cherry-picked locally.
Hardware: Google Cloud A3 VMs.
NIC: GVE with header split & RSS & flow steering support.
Benchmark: custom devmem TCP benchmark not yet open sourced.
Mina Almasry (10):
dma-buf: add support for paged attachment mappings
dma-buf: add support for NET_RX pages
dma-buf: add support for NET_TX pages
net: add support for skbs with unreadable frags
tcp: implement recvmsg() RX path for devmem TCP
net: add SO_DEVMEM_DONTNEED setsockopt to release RX pages
tcp: implement sendmsg() TX path for for devmem tcp
selftests: add ncdevmem, netcat for devmem TCP
memory-provider: updates core provider API for devmem TCP
memory-provider: add dmabuf devmem provider
drivers/dma-buf/dma-buf.c | 444 ++++++++++++++++
include/linux/dma-buf.h | 142 +++++
include/linux/netdevice.h | 1 +
include/linux/skbuff.h | 34 +-
include/linux/socket.h | 1 +
include/net/page_pool.h | 21 +
include/net/sock.h | 4 +
include/net/tcp.h | 6 +-
include/uapi/asm-generic/socket.h | 6 +
include/uapi/linux/dma-buf.h | 12 +
include/uapi/linux/uio.h | 10 +
net/core/datagram.c | 3 +
net/core/page_pool.c | 111 +++-
net/core/skbuff.c | 81 ++-
net/core/sock.c | 47 ++
net/ipv4/tcp.c | 262 +++++++++-
net/ipv4/tcp_input.c | 13 +-
net/ipv4/tcp_ipv4.c | 8 +
net/ipv4/tcp_output.c | 5 +-
net/packet/af_packet.c | 4 +-
tools/testing/selftests/net/.gitignore | 1 +
tools/testing/selftests/net/Makefile | 1 +
tools/testing/selftests/net/ncdevmem.c | 693 +++++++++++++++++++++++++
23 files changed, 1868 insertions(+), 42 deletions(-)
create mode 100644 tools/testing/selftests/net/ncdevmem.c
--
2.41.0.390.g38632f3daf-goog
From: Luc Ma <luc(a)sietium.com>
The kernel-doc for DMA-BUF statistics mentions /sys/kernel/dma-buf/buffers
but the correct path is /sys/kernel/dmabuf/buffers instead.
Signed-off-by: Luc Ma <luc(a)sietium.com>
Reviewed-by: Javier Martinez Canillas <javierm(a)redhat.com>
---
drivers/dma-buf/dma-buf-sysfs-stats.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/dma-buf/dma-buf-sysfs-stats.c b/drivers/dma-buf/dma-buf-sysfs-stats.c
index 6cfbbf0720bd..b5b62e40ccc1 100644
--- a/drivers/dma-buf/dma-buf-sysfs-stats.c
+++ b/drivers/dma-buf/dma-buf-sysfs-stats.c
@@ -33,7 +33,7 @@
* into their address space. This necessitated the creation of the DMA-BUF sysfs
* statistics interface to provide per-buffer information on production systems.
*
- * The interface at ``/sys/kernel/dma-buf/buffers`` exposes information about
+ * The interface at ``/sys/kernel/dmabuf/buffers`` exposes information about
* every DMA-BUF when ``CONFIG_DMABUF_SYSFS_STATS`` is enabled.
*
* The following stats are exposed by the interface:
--
2.25.1
As &ndlp->lock is acquired by timer lpfc_els_retry_delay() under softirq
context, process context code acquiring the lock &ndlp->lock should
disable irq or bh, otherwise deadlock could happen if the timer preempt
the execution while the lock is held in process context on the same CPU.
The two lock acquisition inside lpfc_cleanup_pending_mbox() does not
disable irq or softirq.
[Deadlock Scenario]
lpfc_cmpl_els_fdisc()
-> lpfc_cleanup_pending_mbox()
-> spin_lock(&ndlp->lock);
<irq>
-> lpfc_els_retry_delay()
-> lpfc_nlp_get()
-> spin_lock_irqsave(&ndlp->lock, flags); (deadlock here)
This flaw was found by an experimental static analysis tool I am
developing for irq-related deadlock.
The patch fix the potential deadlock by spin_lock_irq() to disable
irq.
Signed-off-by: Chengfeng Ye <dg573847474(a)gmail.com>
---
drivers/scsi/lpfc/lpfc_sli.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/drivers/scsi/lpfc/lpfc_sli.c b/drivers/scsi/lpfc/lpfc_sli.c
index 58d10f8f75a7..8555f6bb9742 100644
--- a/drivers/scsi/lpfc/lpfc_sli.c
+++ b/drivers/scsi/lpfc/lpfc_sli.c
@@ -21049,9 +21049,9 @@ lpfc_cleanup_pending_mbox(struct lpfc_vport *vport)
mb->mbox_flag |= LPFC_MBX_IMED_UNREG;
restart_loop = 1;
spin_unlock_irq(&phba->hbalock);
- spin_lock(&ndlp->lock);
+ spin_lock_irq(&ndlp->lock);
ndlp->nlp_flag &= ~NLP_IGNR_REG_CMPL;
- spin_unlock(&ndlp->lock);
+ spin_unlock_irq(&ndlp->lock);
spin_lock_irq(&phba->hbalock);
break;
}
@@ -21067,9 +21067,9 @@ lpfc_cleanup_pending_mbox(struct lpfc_vport *vport)
ndlp = (struct lpfc_nodelist *)mb->ctx_ndlp;
mb->ctx_ndlp = NULL;
if (ndlp) {
- spin_lock(&ndlp->lock);
+ spin_lock_irq(&ndlp->lock);
ndlp->nlp_flag &= ~NLP_IGNR_REG_CMPL;
- spin_unlock(&ndlp->lock);
+ spin_unlock_irq(&ndlp->lock);
lpfc_nlp_put(ndlp);
}
}
--
2.17.1