Hi,
On Wed, 25 May 2022 at 15:07, Simon Ser <contact(a)emersion.fr> wrote:
> On Wednesday, May 25th, 2022 at 15:51, Daniel Vetter <daniel(a)ffwll.ch> wrote:
> > > You can add that to the list of reasons why compositors need to stop
> > > using buffers with unsignaled fences. There's plenty of other reasons
> > > there already (the big one being that otherwise slow clients can slow
> > > down the compositor, even if the compositor uses a high priority context
> > > and the HW supports preemption).
> >
> >
> > Yeah that's tbh another reason why I think we shouldn't do umf as a
> > transparent thing - compositors need to get better anyway, so we might as
> > well take this as a chance to do this right.
>
> As a compositor dev, I agree -- we should definitely be smarter about
> this. Note, it would help a lot to have a good way to integrate the
> waits into a poll(2) event loop.
The same holds for Weston. We're currently working through a bunch of
internal infrastructure to be able to handle this. Mutter (aka GNOME)
is also really well-placed to be able to do this.
Having pollable waits would be really useful, but I don't think it's
essential. In my strawman I'm just waking up at the usual
just-before-repaint point and checking; if it doesn't make it for this
frame then we'll wait for the next frame. If someone submits buffers
which take 4 repaint periods to clear then we'll have 3 'unnecessary'
wakeups, but given the GPU is already slammed then it's not an
efficiency problem I don't think. (I don't know if all the other
compositor people share this view.)
Cheers,
Daniel
On Wed, May 25, 2022 at 03:28:41PM +0200, Michel Dänzer wrote:
> On 2022-05-25 15:05, Daniel Vetter wrote:
> > On Tue, May 17, 2022 at 12:28:17PM +0200, Christian König wrote:
> >> Am 09.05.22 um 16:10 schrieb Daniel Vetter:
> >>> On Mon, May 09, 2022 at 08:56:41AM +0200, Christian König wrote:
> >>>> Am 04.05.22 um 12:08 schrieb Daniel Vetter:
> >>>>>
> >>>>> If the goal is specifically atomic kms, then there's an entire can of
> >>>>> worms there that I really don't want to think about, but it exists: We
> >>>>> have dma_fence as out-fences from atomic commit, and that's already
> >>>>> massively broken since most drivers allocate some memory or at least take
> >>>>> locks which can allocate memory in their commit path. Like i2c. Putting a
> >>>>> userspace memory fence as in-fence in there makes that problem
> >>>>> substantially worse, since at least in theory you're just not allowed to
> >>>>> might_faul in atomic_commit_tail.
> >>>> Yes, that's unfortunately one of the goals as well and yes I completely
> >>>> agree on the can of worms. But I think I've solved that.
> >>>>
> >>>> What I do in the patch set is to enforce that the out fence is an user fence
> >>>> when the driver supports user in fences as well.
> >>>>
> >>>> Since user fences doesn't have the memory management dependency drivers can
> >>>> actually allocate memory or call I2C functions which takes locks which have
> >>>> memory allocation dependencies.
> >>>>
> >>>> Or do I miss some other reason why you can't fault or allocate memory in
> >>>> atomic_commit_tail? At least lockdep seems to be happy about that now.
> >>> The problem is a bit that this breaks the uapi already. At least if the
> >>> goal is to have this all be perfectly transparent for userspace - as you
> >>> as you have multi-gpu setups going on at least.
> >>
> >> Question here is why do you think there is an UAPI break? We currently wait
> >> in a work item already, so where exactly is the problem?
> >
> > It's a bit washy, but dma_fence and hence implicit sync is supposed to
> > finish in finite time. umf just doesn't.
> >
> > Ofc in reality you can still flood your compositor and they're not very
> > robust, but with umf it's trivial to just hang your compositor forever and
> > nothing happens.
>
> You can add that to the list of reasons why compositors need to stop
> using buffers with unsignaled fences. There's plenty of other reasons
> there already (the big one being that otherwise slow clients can slow
> down the compositor, even if the compositor uses a high priority context
> and the HW supports preemption).
Yeah that's tbh another reason why I think we shouldn't do umf as a
transparent thing - compositors need to get better anyway, so we might as
well take this as a chance to do this right.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
Hello everyone,
it's a well known problem that the DMA-buf subsystem mixed synchronization and memory management requirements into the same dma_fence and dma_resv objects. Because of this dma_fence objects need to guarantee that they complete within a finite amount of time or otherwise the system can easily deadlock.
One of the few good things about this problem is that it is really good understood by now.
Daniel and others came up with some documentation: https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html?highlight=dma_…
And Jason did an excellent presentation about that problem on last years LPC: https://lpc.events/event/11/contributions/1115/
Based on that we had been able to reject new implementations of infinite/user DMA fences and mitigate the effect of the few existing ones.
The still remaining down side is that we don't have a way of using user fences as dependency in both the explicit (sync_file, drm_syncobj) as well as the implicit (dma_resv) synchronization objects, resulting in numerous problems and limitations for things like HMM, user queues etc....
This patch set here now tries to tackle this problem by untangling the synchronization from the memory management. What it does *not* try to do is to fix the existing kernel fences, because I think we now can all agree on that this isn't really possible.
To archive this goal what I do in this patch set is to add some parallel infrastructure to cleanly separate normal kernel dma_fence objects from indefinite/user fences:
1. It introduce a DMA_FENCE_FLAG_USER define (after renaming some existing driver defines). To note that a certain dma_fence is an user fence and *must* be ignore by memory management and never used as dependency for normal none user dma_fence objects.
2. The dma_fence_array and dma_fence_chain containers are modified so that they are marked as user fences whenever any of their contained fences are an user fence.
3. The dma_resv object gets a new DMA_RESV_USAGE_USER flag which must be used with indefinite/user fences and separates those into it's own synchronization domain.
4. The existing dma_buf_poll_add_cb() function is modified so that indefinite/user fences are included in the polling.
5. The sync_file synchronization object is modified so that we essentially have two fence streams instead of just one.
6. The drm_syncobj is modified in a similar way. User fences are just ignored unless the driver explicitly states support to wait for them.
7. The DRM subsystem gains a new DRIVER_USER_FENCE flag which drivers can use to indicate the need for user fences. If user fences are used the atomic mode setting starts to support user fences as IN/OUT fences.
8. Lockdep is used at various critical locations to ensure that nobody ever tries to mix user fences with non user fences.
The general approach is to just ignore user fences unless a driver stated explicitely support for them.
On top of all of this I've hacked amdgpu so that we add the resulting CS fence only as kernel dependency to the dma_resv object and an additional wrapped up with a dma_fence_array and a stub user fence.
The result is that the newly added atomic modeset functions now correctly wait for the user fence to complete before doing the flip. And dependent CS don't pipeline any more, but rather block on the CPU before submitting work.
After tons of debugging and testing everything now seems to not go up in flames immediately and even lockdep is happy with the annotations.
I'm perfectly aware that this is probably by far the most controversial patch set I've ever created and I really wish we wouldn't need it. But we certainly have the requirement for this and I don't see much other chance to get that working in an UAPI compatible way.
Thoughts/comments?
Regards,
Christian.
Processes can pin shared memory by keeping a handle to it through a
file descriptor; for instance dmabufs, memfd, and ashsmem (in Android).
In the case of a memory leak, to identify the process pinning the
memory, userspace needs to:
- Iterate the /proc/<pid>/fd/* for each process
- Do a readlink on each entry to identify the type of memory from
the file path.
- stat() each entry to get the size of the memory.
The file permissions on /proc/<pid>/fd/* only allows for the owner
or root to perform the operations above; and so is not suitable for
capturing the system-wide state in a production environment.
This issue was addressed for dmabufs by making /proc/*/fdinfo/*
accessible to a process with PTRACE_MODE_READ_FSCREDS credentials[1]
To allow the same kind of tracking for other types of shared memory,
add the following fields to /proc/<pid>/fdinfo/<fd>:
path - This allows identifying the type of memory based on common
prefixes: e.g. "/memfd...", "/dmabuf...", "/dev/ashmem..."
This was not an issued when dmabuf tracking was introduced
because the exp_name field of dmabuf fdinfo could be used
to distinguish dmabuf fds from other types.
size - To track the amount of memory that is being pinned.
dmabufs expose size as an additional field in fdinfo. Remove
this and make it a common field for all fds.
Access to /proc/<pid>/fdinfo is governed by PTRACE_MODE_READ_FSCREDS
-- the same as for /proc/<pid>/maps which also exposes the path and
size for mapped memory regions.
This allows for a system process with PTRACE_MODE_READ_FSCREDS to
account the pinned per-process memory via fdinfo.
[1] https://lore.kernel.org/lkml/20210308170651.919148-1-kaleshsingh@google.com/
Signed-off-by: Kalesh Singh <kaleshsingh(a)google.com>
---
Documentation/filesystems/proc.rst | 22 ++++++++++++++++++++--
drivers/dma-buf/dma-buf.c | 1 -
fs/proc/fd.c | 9 +++++++--
3 files changed, 27 insertions(+), 5 deletions(-)
diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 061744c436d9..ad66d78aca51 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -1922,13 +1922,16 @@ if precise results are needed.
3.8 /proc/<pid>/fdinfo/<fd> - Information about opened file
---------------------------------------------------------------
This file provides information associated with an opened file. The regular
-files have at least four fields -- 'pos', 'flags', 'mnt_id' and 'ino'.
+files have at least six fields -- 'pos', 'flags', 'mnt_id', 'ino', 'size',
+and 'path'.
+
The 'pos' represents the current offset of the opened file in decimal
form [see lseek(2) for details], 'flags' denotes the octal O_xxx mask the
file has been created with [see open(2) for details] and 'mnt_id' represents
mount ID of the file system containing the opened file [see 3.5
/proc/<pid>/mountinfo for details]. 'ino' represents the inode number of
-the file.
+the file, 'size' represents the size of the file in bytes, and 'path'
+represents the file path.
A typical output is::
@@ -1936,6 +1939,8 @@ A typical output is::
flags: 0100002
mnt_id: 19
ino: 63107
+ size: 0
+ path: /dev/null
All locks associated with a file descriptor are shown in its fdinfo too::
@@ -1953,6 +1958,8 @@ Eventfd files
flags: 04002
mnt_id: 9
ino: 63107
+ size: 0
+ path: anon_inode:[eventfd]
eventfd-count: 5a
where 'eventfd-count' is hex value of a counter.
@@ -1966,6 +1973,8 @@ Signalfd files
flags: 04002
mnt_id: 9
ino: 63107
+ size: 0
+ path: anon_inode:[signalfd]
sigmask: 0000000000000200
where 'sigmask' is hex value of the signal mask associated
@@ -1980,6 +1989,8 @@ Epoll files
flags: 02
mnt_id: 9
ino: 63107
+ size: 0
+ path: anon_inode:[eventpoll]
tfd: 5 events: 1d data: ffffffffffffffff pos:0 ino:61af sdev:7
where 'tfd' is a target file descriptor number in decimal form,
@@ -1998,6 +2009,8 @@ For inotify files the format is the following::
flags: 02000000
mnt_id: 9
ino: 63107
+ size: 0
+ path: anon_inode:inotify
inotify wd:3 ino:9e7e sdev:800013 mask:800afce ignored_mask:0 fhandle-bytes:8 fhandle-type:1 f_handle:7e9e0000640d1b6d
where 'wd' is a watch descriptor in decimal form, i.e. a target file
@@ -2021,6 +2034,8 @@ For fanotify files the format is::
flags: 02
mnt_id: 9
ino: 63107
+ size: 0
+ path: anon_inode:[fanotify]
fanotify flags:10 event-flags:0
fanotify mnt_id:12 mflags:40 mask:38 ignored_mask:40000003
fanotify ino:4f969 sdev:800013 mflags:0 mask:3b ignored_mask:40000000 fhandle-bytes:8 fhandle-type:1 f_handle:69f90400c275b5b4
@@ -2046,6 +2061,8 @@ Timerfd files
flags: 02
mnt_id: 9
ino: 63107
+ size: 0
+ path: anon_inode:[timerfd]
clockid: 0
ticks: 0
settime flags: 01
@@ -2070,6 +2087,7 @@ DMA Buffer files
mnt_id: 9
ino: 63107
size: 32768
+ path: /dmabuf:
count: 2
exp_name: system-heap
diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index b1e25ae98302..d61183ff3c30 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -377,7 +377,6 @@ static void dma_buf_show_fdinfo(struct seq_file *m, struct file *file)
{
struct dma_buf *dmabuf = file->private_data;
- seq_printf(m, "size:\t%zu\n", dmabuf->size);
/* Don't count the temporary reference taken inside procfs seq_show */
seq_printf(m, "count:\t%ld\n", file_count(dmabuf->file) - 1);
seq_printf(m, "exp_name:\t%s\n", dmabuf->exp_name);
diff --git a/fs/proc/fd.c b/fs/proc/fd.c
index 913bef0d2a36..a8a968bc58f0 100644
--- a/fs/proc/fd.c
+++ b/fs/proc/fd.c
@@ -54,10 +54,15 @@ static int seq_show(struct seq_file *m, void *v)
if (ret)
return ret;
- seq_printf(m, "pos:\t%lli\nflags:\t0%o\nmnt_id:\t%i\nino:\t%lu\n",
+ seq_printf(m, "pos:\t%lli\nflags:\t0%o\nmnt_id:\t%i\nino:\t%lu\nsize:\t%zu\n",
(long long)file->f_pos, f_flags,
real_mount(file->f_path.mnt)->mnt_id,
- file_inode(file)->i_ino);
+ file_inode(file)->i_ino,
+ file_inode(file)->i_size);
+
+ seq_puts(m, "path:\t");
+ seq_file_path(m, file, "\n");
+ seq_putc(m, '\n');
/* show_fd_locks() never deferences files so a stale value is safe */
show_fd_locks(m, file, files);
base-commit: b015dcd62b86d298829990f8261d5d154b8d7af5
--
2.36.1.124.g0e6072fb45-goog