On Mon, May 06, 2024 at 04:01:42PM +0200, Hans de Goede wrote:
> Hi Sima,
>
> On 5/6/24 3:38 PM, Daniel Vetter wrote:
> > On Mon, May 06, 2024 at 02:05:12PM +0200, Maxime Ripard wrote:
> >> Hi,
> >>
> >> On Mon, May 06, 2024 at 01:49:17PM GMT, Hans de Goede wrote:
> >>> Hi dma-buf maintainers, et.al.,
> >>>
> >>> Various people have been working on making complex/MIPI cameras work OOTB
> >>> with mainline Linux kernels and an opensource userspace stack.
> >>>
> >>> The generic solution adds a software ISP (for Debayering and 3A) to
> >>> libcamera. Libcamera's API guarantees that buffers handed to applications
> >>> using it are dma-bufs so that these can be passed to e.g. a video encoder.
> >>>
> >>> In order to meet this API guarantee the libcamera software ISP allocates
> >>> dma-bufs from userspace through one of the /dev/dma_heap/* heaps. For
> >>> the Fedora COPR repo for the PoC of this:
> >>> https://hansdegoede.dreamwidth.org/28153.html
> >>
> >> For the record, we're also considering using them for ARM KMS devices,
> >> so it would be better if the solution wasn't only considering v4l2
> >> devices.
> >>
> >>> I have added a simple udev rule to give physically present users access
> >>> to the dma_heap-s:
> >>>
> >>> KERNEL=="system", SUBSYSTEM=="dma_heap", TAG+="uaccess"
> >>>
> >>> (and on Rasperry Pi devices any users in the video group get access)
> >>>
> >>> This was just a quick fix for the PoC. Now that we are ready to move out
> >>> of the PoC phase and start actually integrating this into distributions
> >>> the question becomes if this is an acceptable solution; or if we need some
> >>> other way to deal with this ?
> >>>
> >>> Specifically the question is if this will have any negative security
> >>> implications? I can certainly see this being used to do some sort of
> >>> denial of service attack on the system (1). This is especially true for
> >>> the cma heap which generally speaking is a limited resource.
> >>
> >> There's plenty of other ways to exhaust CMA, like allocating too much
> >> KMS or v4l2 buffers. I'm not sure we should consider dma-heaps
> >> differently than those if it's part of our threat model.
> >
> > So generally for an arm soc where your display needs cma, your render node
> > doesn't. And user applications only have access to the later, while only
> > the compositor gets a kms fd through logind. At least in drm aside from
> > vc4 there's really no render driver that just gives you access to cma and
> > allows you to exhaust that, you need to be a compositor with drm master
> > access to the display.
> >
> > Which means we're mostly protected against bad applications, and that's
> > not a threat the "user physically sits in front of the machine accounts
> > for", and which giving cma access to everyone would open up. And with
> > flathub/snaps/... this is very much an issue.
>
> I agree that bad applications are an issue, but not for the flathub / snaps
> case. Flatpacks / snaps run sandboxed and don't have access to a full /dev
> so those should not be able to open /dev/dma_heap/* independent of
> the ACLs on /dev/dma_heap/*. The plan is for cameras using the
> libcamera software ISP to always be accessed through pipewire and
> the camera portal, so in this case pipewere is taking the place of
> the compositor in your kms vs render node example.
Yeah essentially if you clarify to "set the permissions such that pipewire
can do allocations", then I think that makes sense. And is at the same
level as e.g. drm kms giving compsitors (but _only_ compositors) special
access rights.
> So this reduces the problem to bad apps packaged by regular distributions
> and if any of those misbehave the distros should fix that.
>
> So I think that for the denial of service side allowing physical
> present users (but not sandboxed apps running as those users) to
> access /dev/dma_heap/* should be ok.
>
> My bigger worry is if dma_heap (u)dma-bufs can be abused in other
> ways then causing a denial of service.
>
> I guess that the answer there is that causing other security issues
> should not be possible ?
Well pinned memory exhaustion is a very useful tool to make all kinds of
other kernel issues exploitable. Like if you have that you can weaponize
all kinds of kmalloc error paths (and since it's untracked memory the oom
killer will not get you of these issuees).
I think for the pipewire based desktop it'd be best if you only allow
pipewire to get at an fd for allocating from dma-heaps, kinda like logind
furnishes the kms master fd ... Still has the issue that you can't nuke
these buffers, but that's for another day. But at least from a "limit
attack surface" design pov I think this would be better than just handing
out access to the current user outright. But that's also not the worst
option I guess, as long as snaps/flatpacks only go through the pipewire
service.
-Sima
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
Hi Nicolas,
On Wed, May 15, 2024 at 01:43:58PM -0400, nicolas.dufresne(a)collabora.corp-partner.google.com wrote:
> Le mardi 14 mai 2024 à 23:42 +0300, Laurent Pinchart a écrit :
> > > You'll hit the same limitation as we hit in GStreamer, which is that KMS driver
> > > only offer allocation for render buffers and most of them are missing allocators
> > > for YUV buffers, even though they can import in these formats. (kms allocators,
> > > except dumb, which has other issues, are format aware).
> >
> > My experience on Arm platforms is that the KMS drivers offer allocation
> > for scanout buffers, not render buffers, and mostly using the dumb
> > allocator API. If the KMS device can scan out YUV natively, YUV buffer
> > allocation should be supported. Am I missing something here ?
>
> There is two APIs, Dumb is the legacy allocation API, only used by display
Is it legacy only ? I understand the dumb buffers API to be officially
supported, to allocate scanout buffers suitable for software rendering.
> drivers indeed, and the API does not include a pixel format or a modifier. The
> allocation of YUV buffer has been made through a small hack,
>
> bpp = number of bits per component (of luma plane if multiple planes)
> width = width
> height = height * X
>
> Where X will vary, "3 / 2" is used for 420 subsampling, "2" for 422 and "3" for
> 444. It is far from idea, requires deep knowledge of each formats in the
> application
I'm not sure I see that as an issue, but our experiences and uses cases
may vary :-)
> and cannot allocate each planes seperatly.
For semi-planar or planar formats, unless I'm mistaken, you can either
allocate a single buffer and use it with appropriate offsets when
constructing your framebuffer (with DRM_IOCTL_MODE_ADDFB2), or allocate
one buffer per plane.
> The second is to use the driver specific allocation API. This is then abstracted
> by GBM. This allows allocating render buffers with notably modifiers and/or use
> cases. But no support for YUV formats or multi-planar formats.
GBM is the way to go for render buffers indeed. It has been designed
with only graphics buffer management use cases in mind, so it's
unfortunately not an option as a generic allocator, at least in its
current form.
--
Regards,
Laurent Pinchart
On Mon, May 13, 2024 at 11:10:00AM -0400, Nicolas Dufresne wrote:
> Le lundi 13 mai 2024 à 11:34 +0300, Laurent Pinchart a écrit :
> > On Mon, May 13, 2024 at 10:29:22AM +0200, Maxime Ripard wrote:
> > > On Wed, May 08, 2024 at 10:36:08AM +0200, Daniel Vetter wrote:
> > > > On Tue, May 07, 2024 at 04:07:39PM -0400, Nicolas Dufresne wrote:
> > > > > Hi,
> > > > >
> > > > > Le mardi 07 mai 2024 à 21:36 +0300, Laurent Pinchart a écrit :
> > > > > > Shorter term, we have a problem to solve, and the best option we have
> > > > > > found so far is to rely on dma-buf heaps as a backend for the frame
> > > > > > buffer allocatro helper in libcamera for the use case described above.
> > > > > > This won't work in 100% of the cases, clearly. It's a stop-gap measure
> > > > > > until we can do better.
> > > > >
> > > > > Considering the security concerned raised on this thread with dmabuf heap
> > > > > allocation not be restricted by quotas, you'd get what you want quickly with
> > > > > memfd + udmabuf instead (which is accounted already).
> > > > >
> > > > > It was raised that distro don't enable udmabuf, but as stated there by Hans, in
> > > > > any cases distro needs to take action to make the softISP works. This
> > > > > alternative is easy and does not interfere in anyway with your future plan or
> > > > > the libcamera API. You could even have both dmabuf heap (for Raspbian) and the
> > > > > safer memfd+udmabuf for the distro with security concerns.
> > > > >
> > > > > And for the long term plan, we can certainly get closer by fixing that issue
> > > > > with accounting. This issue also applied to v4l2 io-ops, so it would be nice to
> > > > > find common set of helpers to fix these exporters.
> > > >
> > > > Yeah if this is just for softisp, then memfd + udmabuf is also what I was
> > > > about to suggest. Not just as a stopgap, but as the real official thing.
> > > >
> > > > udmabuf does kinda allow you to pin memory, but we can easily fix that by
> > > > adding the right accounting and then either let mlock rlimits or cgroups
> > > > kernel memory limits enforce good behavior.
> > >
> > > I think the main drawback with memfd is that it'll be broken for devices
> > > without an IOMMU, and while you said that it's uncommon for GPUs, it's
> > > definitely not for codecs and display engines.
> >
> > If the application wants to share buffers between the camera and a
> > display engine or codec, it should arguably not use the libcamera
> > FrameBufferAllocator, but allocate the buffers from the display or the
> > encoder. memfd wouldn't be used in that case.
> >
> > We need to eat our own dogfood though. If we want to push the
> > responsibility for buffer allocation in the buffer sharing case to the
> > application, we need to modify the cam application to do so when using
> > the KMS backend.
>
> Agreed, and the new dmabuf feedback on wayland can also be used on top of this.
>
> You'll hit the same limitation as we hit in GStreamer, which is that KMS driver
> only offer allocation for render buffers and most of them are missing allocators
> for YUV buffers, even though they can import in these formats. (kms allocators,
> except dumb, which has other issues, are format aware).
My experience on Arm platforms is that the KMS drivers offer allocation
for scanout buffers, not render buffers, and mostly using the dumb
allocator API. If the KMS device can scan out YUV natively, YUV buffer
allocation should be supported. Am I missing something here ?
--
Regards,
Laurent Pinchart
The purpose of this patchset is for MediaTek secure video playback, and
also to enable other potential uses of this in the future. The 'restricted
dma-heap' will be used to allocate dma_buf objects that reference memory
in the secure world that is inaccessible/unmappable by the non-secure
(i.e. kernel/userspace) world. That memory will be used by the secure/
trusted world to store secure information (i.e. decrypted media content).
The dma_bufs allocated from the kernel will be passed to V4L2 for video
decoding (as input and output). They will also be used by the drm
system for rendering of the content.
This patchset adds two MediaTek restricted heaps and they will be used in
v4l2[1] and drm[2].
1) restricted_mtk_cm: secure chunk memory for MediaTek SVP (Secure Video
Path). The buffer is reserved for the secure world after bootup and it
is used for vcodec's ES/working buffer;
2) restricted_mtk_cma: secure CMA memory for MediaTek SVP. This buffer is
dynamically reserved for the secure world and will be got when we start
playing secure videos. Once the security video playing is complete, the
CMA will be released. This heap is used for the vcodec's frame buffer.
[1] https://lore.kernel.org/linux-mediatek/20231206081538.17056-1-yunfei.dong@m…
[2] https://lore.kernel.org/all/20231223182932.27683-1-jason-jh.lin@mediatek.co…
Change note:
v4: 1) Rename the heap name from "secure" to "restricted". suggested from
Simon/Pekka. There are still several "secure" string in MTK file
since we use ARM platform in which we call this "secure world"/
"secure command".
v3: https://lore.kernel.org/linux-mediatek/20231212024607.3681-1-yong.wu@mediat…
1) Separate the secure heap to a common file(secure_heap.c) and mtk
special file (secure_heap_mtk.c), and put all the tee related code
into our special file.
2) About dt-binding, Add "mediatek," prefix since this is Mediatek TEE
firmware definition.
3) Remove the normal CMA heap which is a draft for qcom.
Rebase on v6.7-rc1.
v2: https://lore.kernel.org/linux-mediatek/20231111111559.8218-1-yong.wu@mediat…
1) Move John's patches into the vcodec patchset since they use the new
dma heap interface directly.
https://lore.kernel.org/linux-mediatek/20231106120423.23364-1-yunfei.dong@m…
2) Reword the dt-binding description.
3) Rename the heap name from mtk_svp to secure_mtk_cm.
This means the current vcodec/DRM upstream code doesn't match this.
4) Add a normal CMA heap. currently it should be a draft version.
5) Regarding the UUID, I still use hard code, but put it in a private
data which allow the others could set their own UUID. What's more, UUID
is necessary for the session with TEE. If we don't have it, we can't
communicate with the TEE, including the get_uuid interface, which tries
to make uuid more generic, not working. If there is other way to make
UUID more general, please free to tell me.
v1: https://lore.kernel.org/linux-mediatek/20230911023038.30649-1-yong.wu@media…
Base on v6.6-rc1.
Yong Wu (7):
dt-bindings: reserved-memory: Add mediatek,dynamic-restricted-region
dma-buf: heaps: Initialize a restricted heap
dma-buf: heaps: restricted_heap: Add private heap ops
dma-buf: heaps: restricted_heap: Add dma_ops
dma-buf: heaps: restricted_heap: Add MediaTek restricted heap and
heap_init
dma-buf: heaps: restricted_heap_mtk: Add TEE memory service call
dma_buf: heaps: restricted_heap_mtk: Add a new CMA heap
.../mediatek,dynamic-restricted-region.yaml | 43 +++
drivers/dma-buf/heaps/Kconfig | 16 +
drivers/dma-buf/heaps/Makefile | 4 +-
drivers/dma-buf/heaps/restricted_heap.c | 237 +++++++++++++
drivers/dma-buf/heaps/restricted_heap.h | 43 +++
drivers/dma-buf/heaps/restricted_heap_mtk.c | 322 ++++++++++++++++++
6 files changed, 664 insertions(+), 1 deletion(-)
create mode 100644 Documentation/devicetree/bindings/reserved-memory/mediatek,dynamic-restricted-region.yaml
create mode 100644 drivers/dma-buf/heaps/restricted_heap.c
create mode 100644 drivers/dma-buf/heaps/restricted_heap.h
create mode 100644 drivers/dma-buf/heaps/restricted_heap_mtk.c
--
2.18.0
On Mon, May 13, 2024 at 11:06:24AM -0400, Nicolas Dufresne wrote:
> Le lundi 13 mai 2024 à 15:51 +0200, Maxime Ripard a écrit :
> > On Mon, May 13, 2024 at 09:42:00AM -0400, Nicolas Dufresne wrote:
> > > Le lundi 13 mai 2024 à 10:29 +0200, Maxime Ripard a écrit :
> > > > On Wed, May 08, 2024 at 10:36:08AM +0200, Daniel Vetter wrote:
> > > > > On Tue, May 07, 2024 at 04:07:39PM -0400, Nicolas Dufresne wrote:
> > > > > > Le mardi 07 mai 2024 à 21:36 +0300, Laurent Pinchart a écrit :
> > > > > > > Shorter term, we have a problem to solve, and the best option we have
> > > > > > > found so far is to rely on dma-buf heaps as a backend for the frame
> > > > > > > buffer allocatro helper in libcamera for the use case described above.
> > > > > > > This won't work in 100% of the cases, clearly. It's a stop-gap measure
> > > > > > > until we can do better.
> > > > > >
> > > > > > Considering the security concerned raised on this thread with dmabuf heap
> > > > > > allocation not be restricted by quotas, you'd get what you want quickly with
> > > > > > memfd + udmabuf instead (which is accounted already).
> > > > > >
> > > > > > It was raised that distro don't enable udmabuf, but as stated there by Hans, in
> > > > > > any cases distro needs to take action to make the softISP works. This
> > > > > > alternative is easy and does not interfere in anyway with your future plan or
> > > > > > the libcamera API. You could even have both dmabuf heap (for Raspbian) and the
> > > > > > safer memfd+udmabuf for the distro with security concerns.
> > > > > >
> > > > > > And for the long term plan, we can certainly get closer by fixing that issue
> > > > > > with accounting. This issue also applied to v4l2 io-ops, so it would be nice to
> > > > > > find common set of helpers to fix these exporters.
> > > > >
> > > > > Yeah if this is just for softisp, then memfd + udmabuf is also what I was
> > > > > about to suggest. Not just as a stopgap, but as the real official thing.
> > > > >
> > > > > udmabuf does kinda allow you to pin memory, but we can easily fix that by
> > > > > adding the right accounting and then either let mlock rlimits or cgroups
> > > > > kernel memory limits enforce good behavior.
> > > >
> > > > I think the main drawback with memfd is that it'll be broken for devices
> > > > without an IOMMU, and while you said that it's uncommon for GPUs, it's
> > > > definitely not for codecs and display engines.
> > >
> > > In the context of libcamera, the allocation and the alignment done to the video
> > > frame is done completely blindly. In that context, there is a lot more then just
> > > the allocation type that can go wrong and will lead to a memory copy. The upside
> > > of memfd, is that the read cache will help speeding up the copies if they are
> > > needed.
> >
> > dma-heaps provide cacheable buffers too...
>
> Yes, and why we have cache hints in V4L2 now. There is no clue that softISP code
> can read to make the right call. The required cache management in undefined
> until all the importer are known. I also don't think heaps currently care to
> adapt the dmabuf sync behaviour based on the different importers, or the
> addition of a new importer. On top of which, there is insufficient information
> on the device to really deduce what is needed.
>
> > > Another important point is that this is only used if the application haven't
> > > provided frames. If your embedded application is non-generic, and you have
> > > permissions to access the right heap, the application can solve your specific
> > > issue. But in the generic Linux space, Linux kernel API are just insufficient
> > > for the "just work" scenario.
> >
> > ... but they also provide semantics around the memory buffers that no
> > other allocation API do. There's at least the mediatek secure playback
> > series and another one that I've started to work on to allocate ECC
> > protected or unprotected buffers that are just the right use case for
> > the heaps, and the target frameworks aren't.
>
> Let's agree we are both off topic now. The libcamera softISP is currently purely
> software, and cannot write to any form of protected memory. As for ECC, I would
> hope this usage will be coded in the application and that this application has
> been authorized to access the appropriate heaps.
>
> And finally, none of this fixes the issue that the heap allocation are not being
> accounted properly and allow of an easy memory DoS. So uaccess should be granted
> with care, meaning that defaulting a "desktop" library to that, means it will
> most of the time not work at all.
I think that issue should be fixed, regardless of whether or not we end
up using dma heaps for libcamera. If we do use them, maybe there will be
a higher incentive for somebody involved in this conversation to tackle
that problem first :-) And maybe, as a result, the rest of the Linux
community will consider with a more open mind usage of dma heaps on
desktop systems.
--
Regards,
Laurent Pinchart
On Mon, May 13, 2024 at 10:29:22AM +0200, Maxime Ripard wrote:
> On Wed, May 08, 2024 at 10:36:08AM +0200, Daniel Vetter wrote:
> > On Tue, May 07, 2024 at 04:07:39PM -0400, Nicolas Dufresne wrote:
> > > Hi,
> > >
> > > Le mardi 07 mai 2024 à 21:36 +0300, Laurent Pinchart a écrit :
> > > > Shorter term, we have a problem to solve, and the best option we have
> > > > found so far is to rely on dma-buf heaps as a backend for the frame
> > > > buffer allocatro helper in libcamera for the use case described above.
> > > > This won't work in 100% of the cases, clearly. It's a stop-gap measure
> > > > until we can do better.
> > >
> > > Considering the security concerned raised on this thread with dmabuf heap
> > > allocation not be restricted by quotas, you'd get what you want quickly with
> > > memfd + udmabuf instead (which is accounted already).
> > >
> > > It was raised that distro don't enable udmabuf, but as stated there by Hans, in
> > > any cases distro needs to take action to make the softISP works. This
> > > alternative is easy and does not interfere in anyway with your future plan or
> > > the libcamera API. You could even have both dmabuf heap (for Raspbian) and the
> > > safer memfd+udmabuf for the distro with security concerns.
> > >
> > > And for the long term plan, we can certainly get closer by fixing that issue
> > > with accounting. This issue also applied to v4l2 io-ops, so it would be nice to
> > > find common set of helpers to fix these exporters.
> >
> > Yeah if this is just for softisp, then memfd + udmabuf is also what I was
> > about to suggest. Not just as a stopgap, but as the real official thing.
> >
> > udmabuf does kinda allow you to pin memory, but we can easily fix that by
> > adding the right accounting and then either let mlock rlimits or cgroups
> > kernel memory limits enforce good behavior.
>
> I think the main drawback with memfd is that it'll be broken for devices
> without an IOMMU, and while you said that it's uncommon for GPUs, it's
> definitely not for codecs and display engines.
If the application wants to share buffers between the camera and a
display engine or codec, it should arguably not use the libcamera
FrameBufferAllocator, but allocate the buffers from the display or the
encoder. memfd wouldn't be used in that case.
We need to eat our own dogfood though. If we want to push the
responsibility for buffer allocation in the buffer sharing case to the
application, we need to modify the cam application to do so when using
the KMS backend.
--
Regards,
Laurent Pinchart
From: Christian Brauner
> Sent: 10 May 2024 11:55
>
> > For the uapi issue you describe below my take would be that we should just
> > try, and hope that everyone's been dutifully using O_CLOEXEC. But maybe
> > I'm biased from the gpu world, where we've been hammering it in that
> > "O_CLOEXEC or bust" mantra since well over a decade. Really the only valid
>
> Oh, we're very much on the same page. All new file descriptor types that
> I've added over the years are O_CLOEXEC by default. IOW, you need to
> remove O_CLOEXEC explicitly (see pidfd as an example). And imho, any new
> fd type that's added should just be O_CLOEXEC by default.
For fd a shell redirect creates you may want so be able to say
'this fd will have O_CLOEXEC set after the next exec'.
Also (possibly) a flag that can't be cleared once set and that
gets kept by dup() etc.
But maybe that is excessive?
I've certainly used:
# ip netns exec ns command 3</sys/class/net
in order to be able to (easily) read status for interfaces in the
default namespace and a specific namespace.
The would be hard if the O_CLOEXEC flag had got set by default.
(Especially without a shell builtin to clear it.)
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
On Thu, 9 May 2024 at 04:39, Christian Brauner <brauner(a)kernel.org> wrote:
>
> Not worth it without someone explaining in detail why imho. First pass
> should be to try and replace kcmp() in scenarios where it's obviously
> not needed or overkill.
Ack.
> I've added a CLASS(fd_raw) in a preliminary patch since we'll need that
> anyway which means that your comparison patch becomes even simpler imho.
> I've also added a selftest patch:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=vfs.misc
LGTM.
Maybe worth adding an explicit test for "open same file, but two
separate opens, F_DUPFD_QUERY returns 0? Just to clarify the "it's not
testing the file on the filesystem for equality, but the file pointer
itself".
Linus
On Tue, 7 May 2024 at 18:15, Bryan O'Donoghue
<bryan.odonoghue(a)linaro.org> wrote:
>
> On 07/05/2024 16:09, Dmitry Baryshkov wrote:
> > Ah, I see. Then why do you require the DMA-ble buffer at all? If you are
> > providing data to VPU or DRM, then you should be able to get the buffer
> > from the data-consuming device.
>
> Because we don't necessarily know what the consuming device is, if any.
>
> Could be VPU, could be Zoom/Hangouts via pipewire, could for argument
> sake be GPU or DSP.
>
> Also if we introduce a dependency on another device to allocate the
> output buffers - say always taking the output buffer from the GPU, then
> we've added another dependency which is more difficult to guarantee
> across different arches.
Yes. And it should be expected. It's a consumer who knows the
restrictions on the buffer. As I wrote, Zoom/Hangouts should not
require a DMA buffer at all. Applications should be able to allocate
the buffer out of the generic memory. GPUs might also have different
requirements. Consider GPUs with VRAM. It might be beneficial to
allocate a buffer out of VRAM rather than generic DMA mem.
--
With best wishes
Dmitry
On Tue, May 07, 2024 at 04:07:39PM -0400, Nicolas Dufresne wrote:
> Hi,
>
> Le mardi 07 mai 2024 à 21:36 +0300, Laurent Pinchart a écrit :
> > Shorter term, we have a problem to solve, and the best option we have
> > found so far is to rely on dma-buf heaps as a backend for the frame
> > buffer allocatro helper in libcamera for the use case described above.
> > This won't work in 100% of the cases, clearly. It's a stop-gap measure
> > until we can do better.
>
> Considering the security concerned raised on this thread with dmabuf heap
> allocation not be restricted by quotas, you'd get what you want quickly with
> memfd + udmabuf instead (which is accounted already).
>
> It was raised that distro don't enable udmabuf, but as stated there by Hans, in
> any cases distro needs to take action to make the softISP works. This
> alternative is easy and does not interfere in anyway with your future plan or
> the libcamera API. You could even have both dmabuf heap (for Raspbian) and the
> safer memfd+udmabuf for the distro with security concerns.
>
> And for the long term plan, we can certainly get closer by fixing that issue
> with accounting. This issue also applied to v4l2 io-ops, so it would be nice to
> find common set of helpers to fix these exporters.
Yeah if this is just for softisp, then memfd + udmabuf is also what I was
about to suggest. Not just as a stopgap, but as the real official thing.
udmabuf does kinda allow you to pin memory, but we can easily fix that by
adding the right accounting and then either let mlock rlimits or cgroups
kernel memory limits enforce good behavior.
-Sima
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
On Sat, 4 May 2024 at 02:37, Christian Brauner <brauner(a)kernel.org> wrote:
>
> --- a/drivers/dma-buf/dma-buf.c
> +++ b/drivers/dma-buf/dma-buf.c
> @@ -244,13 +244,18 @@ static __poll_t dma_buf_poll(struct file *file, poll_table *poll)
> if (!dmabuf || !dmabuf->resv)
> return EPOLLERR;
>
> + if (!get_file_active(&dmabuf->file))
> + return EPOLLERR;
[...]
I *really* don't think anything that touches dma-buf.c can possibly be right.
This is not a dma-buf.c bug.
This is purely an epoll bug.
Lookie here, the fundamental issue is that epoll can call '->poll()'
on a file descriptor that is being closed concurrently.
That means that *ANY* driver that relies on *any* data structure that
is managed by the lifetime of the 'struct file' will have problems.
Look, here's sock_poll():
static __poll_t sock_poll(struct file *file, poll_table *wait)
{
struct socket *sock = file->private_data;
and that first line looks about as innocent as it possibly can, right?
Now, imagine that this is called from 'epoll' concurrently with the
file being closed for the last time (but it just hasn't _quite_
reached eventpoll_release() yet).
Now, imagine that the kernel is built with preemption, and the epoll
thread gets preempted _just_ before it loads 'file->private_data'.
Furthermore, the machine is under heavy load, and it just stays off
its CPU a long time.
Now, during this TOTALLY INNOCENT sock_poll(), in another thread, the
file closing completes, eventpoll_release() finishes, and the
preemption of the poll() thing just takes so long that you go through
an RCU period too, so that the actual file has been released too.
So now that totally innoced file->private_data load in the poll() is
probably going to get random data.
Yes, the file is allocated as SLAB_TYPESAFE_BY_RCU, so it's probably
still a file. Not guaranteed, even the slab will get fully free'd in
some situations. And yes, the above case is impossible to hit in
practice. You have to hit quite the small race window with an
operation that practically never happens in the first place.
But my point is that the fact that the problem with file->f_count
lifetimes happens for that dmabuf driver is not the fault of the
dmabuf code. Not at all.
It is *ENTIRELY* a bug in epoll, and the dmabuf code is probably just
easier to hit because it has a poll() function that does things that
have longer lifetimes than most things, and interacts more directly
with that f_count.
So I really don't understand why Al thinks this is "dmabuf does bad
things with f_count". It damn well does not. dma-buf is the GOOD GUY
here. It's doing things *PROPERLY*. It's taking refcounts like it damn
well should.
The fact that it takes ref-counts on something that the epoll code has
messed up is *NOT* its fault.
Linus
On Wed, May 08, 2024 at 12:08:57PM +0200, Christian Brauner wrote:
> On Mon, May 06, 2024 at 04:29:44PM +0200, Christian König wrote:
> > Am 04.05.24 um 20:20 schrieb Linus Torvalds:
> > > On Sat, 4 May 2024 at 08:32, Linus Torvalds
> > > <torvalds(a)linux-foundation.org> wrote:
> > > > Lookie here, the fundamental issue is that epoll can call '->poll()'
> > > > on a file descriptor that is being closed concurrently.
> > > Thinking some more about this, and replying to myself...
> > >
> > > Actually, I wonder if we could *really* fix this by simply moving the
> > > eventpoll_release() to where it really belongs.
> > >
> > > If we did it in file_close_fd_locked(), it would actually make a
> > > *lot* more sense. Particularly since eventpoll actually uses this:
> > >
> > > struct epoll_filefd {
> > > struct file *file;
> > > int fd;
> > > } __packed;
> > >
> > > ie it doesn't just use the 'struct file *', it uses the 'fd' itself
> > > (for ep_find()).
> > >
> > > (Strictly speaking, it should also have a pointer to the 'struct
> > > files_struct' to make the 'int fd' be meaningful).
> >
> > While I completely agree on this I unfortunately have to ruin the idea.
> >
> > Before we had KCMP some people relied on the strange behavior of eventpoll
> > to compare struct files when the fd is the same.
> >
> > I just recently suggested that solution to somebody at AMD as a workaround
> > when KCMP is disabled because of security hardening and I'm pretty sure I've
> > seen it somewhere else as well.
> >
> > So when we change that it would break (undocumented?) UAPI behavior.
>
> I've worked on that a bit yesterday and I learned new things about epoll
> and ran into some limitations.
>
> Like, what happens if process P1 has a file descriptor registered in an
> epoll instance and now P1 forks and creates P2. So every file that P1
> maintains gets copied into a new file descriptor table for P2. And the
> same file descriptors refer to the same files for both P1 and P2.
So this is pretty similar to any other struct file that has resources
hanging off the struct file instead of the underlying inode. Like drm
chardev files, where all the buffers, gpu contexts and everything else
hangs off the file and there's no other way to get at them (except when
exporting to some explicitly meant-for-sharing file like dma-buf).
If you fork() that it's utter hilarity, which is why absolutely everyone
should set O_CLOEXEC on these. Or EPOLL_CLOEXEC for epoll_create.
For the uapi issue you describe below my take would be that we should just
try, and hope that everyone's been dutifully using O_CLOEXEC. But maybe
I'm biased from the gpu world, where we've been hammering it in that
"O_CLOEXEC or bust" mantra since well over a decade. Really the only valid
use-case is something like systemd handing open files to a service, where
it drops priviledges even well before the exec() call. But we can't switch
around the defaults for any of these special open files with anything more
than just a current seek position as state, since that breaks uapi.
-Sima
>
> So there's two interesting cases here:
>
> (1) P2 explicitly removes the file descriptor from the epoll instance
> via epoll_ctl(EPOLL_CTL_DEL). That removal affects both P1 and P2
> since the <fd, file> pair is only registered once and it isn't
> marked whether it belongs to P1 and P2 fdtable.
>
> So effectively fork()ing with epoll creates a weird shared state
> where removal of file descriptors that were registered before the
> fork() affects both child and parent.
>
> I found that surprising even though I've worked with epoll quite
> extensively in low-level userspace.
>
> (2) P2 doesn't close it's file descriptors. It just exits. Since removal
> of the file descriptor from the epoll instance isn't done during
> close() but during last fput() P1's epoll state remains unaffected
> by P2's sloppy exit because P1 still holds references to all files
> in its fdtable.
>
> (Sidenote, if one ends up adding every more duped-fds into epoll
> instance that one doesn't explicitly close and all of them refer to
> the same file wouldn't one just be allocating new epitems that
> are kept around for a really long time?)
>
> So if the removal of the fd would now be done during close() or during
> exit_files() when we call close_files() and since there's currently no
> way of differentiating whether P1 or P2 own that fd it would mean that
> (2) collapses into (1) and we'd always alter (1)'s epoll state. That
> would be a UAPI break.
>
> So say we record the fdtable to get ownership of that file descriptor so
> P2 doesn't close anything in (2) that really belongs to P1 to fix that
> problem.
>
> But afaict, that would break another possible use-case. Namely, where P1
> creates an epoll instance and registeres fds and then fork()s to create
> P2. Now P1 can exit and P2 takes over the epoll loop of P1. This
> wouldn't work anymore because P1 would deregister all fds it owns in
> that epoll instance during exit. I didn't see an immediate nice way of
> fixing that issue.
>
> But note that taking over an epoll loop from the parent doesn't work
> reliably for some file descriptors. Consider man signalfd(2):
>
> epoll(7) semantics
> If a process adds (via epoll_ctl(2)) a signalfd file descriptor to an epoll(7) instance,
> then epoll_wait(2) returns events only for signals sent to that process. In particular,
> if the process then uses fork(2) to create a child process, then the child will be able
> to read(2) signals that are sent to it using the signalfd file descriptor, but
> epoll_wait(2) will not indicate that the signalfd file descriptor is ready. In this
> scenario, a possible workaround is that after the fork(2), the child process can close
> the signalfd file descriptor that it inherited from the parent process and then create
> another signalfd file descriptor and add it to the epoll instance. Alternatively, the
> parent and the child could delay creating their (separate) signalfd file descriptors and
> adding them to the epoll instance until after the call to fork(2).
>
> So effectively P1 opens a signalfd and registers it in an epoll
> instance. Then it fork()s and creates P2. Now both P1 and P2 call
> epoll_wait(). Since signalfds are always relative to the caller and P1
> did call signalfd_poll() to register the callback only P1 can get
> events. So P2 can't take over signalfds in that epoll loop.
>
> Honestly, the inheritance semantics of epoll across fork() seem pretty
> wonky and it would've been better if an epoll fd inherited across
> would've returned ESTALE or EINVAL or something. And if that inheritance
> of epoll instances would really be a big use-case there'd be some
> explicit way to enable this.
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
Am 18.04.24 um 03:33 schrieb zhiguojiang:
> 在 2024/4/15 19:57, Christian König 写道:
>> [Some people who received this message don't often get email from
>> christian.koenig(a)amd.com. Learn why this is important at
>> https://aka.ms/LearnAboutSenderIdentification ]
>>
>> Am 15.04.24 um 12:35 schrieb zhiguojiang:
>>> 在 2024/4/12 14:39, Christian König 写道:
>>>> [Some people who received this message don't often get email from
>>>> christian.koenig(a)amd.com. Learn why this is important at
>>>> https://aka.ms/LearnAboutSenderIdentification ]
>>>>
>>>> Am 12.04.24 um 08:19 schrieb zhiguojiang:
>>>>> [SNIP]
>>>>> -> Here task 2220 do epoll again where internally it will get/put
>>>>> then
>>>>> start to free twice and lead to final crash.
>>>>>
>>>>> Here is the basic flow:
>>>>>
>>>>> 1. Thread A install the dma_buf_fd via dma_buf_export, the fd
>>>>> refcount
>>>>> is 1
>>>>>
>>>>> 2. Thread A add the fd to epoll list via epoll_ctl(EPOLL_CTL_ADD)
>>>>>
>>>>> 3. After use the dma buf, Thread A close the fd, then here fd
>>>>> refcount
>>>>> is 0,
>>>>> and it will run __fput finally to release the file
>>>>
>>>> Stop, that isn't correct.
>>>>
>>>> The fs layer which calls dma_buf_poll() should make sure that the file
>>>> can't go away until the function returns.
>>>>
>>>> Then inside dma_buf_poll() we add another reference to the file while
>>>> installing the callback:
>>>>
>>>> /* Paired with fput in dma_buf_poll_cb */
>>>> get_file(dmabuf->file);
>>> Hi,
>>>
>>> The problem may just occurred here.
>>>
>>> Is it possible file reference count already decreased to 0 and fput
>>> already being in progressing just before calling
>>> get_file(dmabuf->file) in dma_buf_poll()?
>>
>> No, exactly that isn't possible.
>>
>> If a function gets a dma_buf pointer or even more general any reference
>> counted pointer which has already decreased to 0 then that is a major
>> bug in the caller of that function.
>>
>> BTW: It's completely illegal to work around such issues by using
>> file_count() or RCU functions. So when you suggest stuff like that it
>> will immediately face rejection.
>>
>> Regards,
>> Christian.
> Hi,
>
> Thanks for your kind comment.
>
> 'If a function gets a dma_buf pointer or even more general any reference
>
> counted pointer which has already decreased to 0 then that is a major
>
> bug in the caller of that function.'
>
> According to the issue log, we can see the dma buf file close and
> epoll operation running in parallel.
>
> Apparently at the moment caller calls epoll, although another task
> caller already called close on the same fd, but this fd was still in
> progress to close, so fd is still valid, thus no EBADF returned to
> caller.
No, exactly that can't happen either.
As far as I can see the EPOLL holds a reference to the files it
contains. So it is perfectly valid to add the file descriptor to EPOLL
and then close it.
In this case the file won't be closed, but be kept alive by it's
reference in the EPOLL file descriptor.
>
> Then the two task contexts operate on same dma buf fd(one is close,
> another is epoll) in kernel space.
>
> Please kindly help to comment whether above scenario is possible.
>
> If there is some bug in the caller logic, Can u help to point it out? :)
Well what could be is that the EPOLL implementation is broken somehow,
but I have really doubts on that.
As I said before the most likely explanation is that you have a broken
device driver which messes up the DMA-buf file reference count somehow.
Regards,
Christian.
>
> Regards,
> Zhiguo
>>
>>>
>>>>
>>>> This reference is only dropped after the callback is completed in
>>>> dma_buf_poll_cb():
>>>>
>>>> /* Paired with get_file in dma_buf_poll */
>>>> fput(dmabuf->file);
>>>>
>>>> So your explanation for the issue just seems to be incorrect.
>>>>
>>>>>
>>>>> 4. Here Thread A not do epoll_ctl(EPOLL_CTL_DEL) manunally, so it
>>>>> still resides in one epoll_list.
>>>>> Although __fput will call eventpoll_release to remove the file from
>>>>> binded epoll list,
>>>>> but it has small time window where Thread B jumps in.
>>>>
>>>> Well if that is really the case then that would then be a bug in
>>>> epoll_ctl().
>>>>
>>>>>
>>>>> 5. During the small window, Thread B do the poll action for
>>>>> dma_buf_fd, where it will fget/fput for the file,
>>>>> this means the fd refcount will be 0 -> 1 -> 0, and it will call
>>>>> __fput again.
>>>>> This will lead to __fput twice for the same file.
>>>>>
>>>>> 6. So the potenial fix is use get_file_rcu which to check if file
>>>>> refcount already zero which means under free.
>>>>> If so, we just return and no need to do the dma_buf_poll.
>>>>
>>>> Well to say it bluntly as far as I can see this suggestion is
>>>> completely
>>>> nonsense.
>>>>
>>>> When the reference to the file goes away while dma_buf_poll() is
>>>> executed then that's a massive bug in the caller of that function.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>>
>>>>> Here is the race condition:
>>>>>
>>>>> Thread A Thread B
>>>>> dma_buf_export
>>>>> fd_refcount is 1
>>>>> epoll_ctl(EPOLL_ADD)
>>>>> add dma_buf_fd to epoll list
>>>>> close(dma_buf_fd)
>>>>> fd_refcount is 0
>>>>> __fput
>>>>> dma_buf_poll
>>>>> fget
>>>>> fput
>>>>> fd_refcount is zero again
>>>>>
>>>>> Thanks
>>>>>
>>>>
>>>
>>
>
Am 08.05.24 um 10:23 schrieb Christian Brauner:
> On Tue, May 07, 2024 at 07:45:02PM +0200, Christian König wrote:
>> Am 07.05.24 um 18:46 schrieb Linus Torvalds:
>>> On Tue, 7 May 2024 at 04:03, Daniel Vetter <daniel(a)ffwll.ch> wrote:
>>>> It's really annoying that on some distros/builds we don't have that, and
>>>> for gpu driver stack reasons we _really_ need to know whether a fd is the
>>>> same as another, due to some messy uniqueness requirements on buffer
>>>> objects various drivers have.
>>> It's sad that such a simple thing would require two other horrid
>>> models (EPOLL or KCMP).
>>>
>>> There'[s a reason that KCMP is a config option - *some* of that is
>>> horrible code - but the "compare file descriptors for equality" is not
>>> that reason.
>>>
>>> Note that KCMP really is a broken mess. It's also a potential security
>>> hole, even for the simple things, because of how it ends up comparing
>>> kernel pointers (ie it doesn't just say "same file descriptor", it
>>> gives an ordering of them, so you can use KCMP to sort things in
>>> kernel space).
>>>
>>> And yes, it orders them after obfuscating the pointer, but it's still
>>> not something I would consider sane as a baseline interface. It was
>>> designed for checkpoint-restore, it's the wrong thing to use for some
>>> "are these file descriptors the same".
>>>
>>> The same argument goes for using EPOLL for that. Disgusting hack.
>>>
>>> Just what are the requirements for the GPU stack? Is one of the file
>>> descriptors "trusted", IOW, you know what kind it is?
>>>
>>> Because dammit, it's *so* easy to do. You could just add a core DRM
>>> ioctl for it. Literally just
>>>
>>> struct fd f1 = fdget(fd1);
>>> struct fd f2 = fdget(fd2);
>>> int same;
>>>
>>> same = f1.file && f1.file == f2.file;
>>> fdput(fd1);
>>> fdput(fd2);
>>> return same;
>>>
>>> where the only question is if you also woudl want to deal with O_PATH
>>> fd's, in which case the "fdget()" would be "fdget_raw()".
>>>
>>> Honestly, adding some DRM ioctl for this sounds hacky, but it sounds
>>> less hacky than relying on EPOLL or KCMP.
>>>
>>> I'd be perfectly ok with adding a generic "FISAME" VFS level ioctl
>>> too, if this is possibly a more common thing. and not just DRM wants
>>> it.
>>>
>>> Would something like that work for you?
>> Well the generic approach yes, the DRM specific one maybe. IIRC we need to
>> be able to compare both DRM as well as DMA-buf file descriptors.
>>
>> The basic problem userspace tries to solve is that drivers might get the
>> same fd through two different code paths.
>>
>> For example application using OpenGL/Vulkan for rendering and VA-API for
>> video decoding/encoding at the same time.
>>
>> Both APIs get a fd which identifies the device to use. It can be the same,
>> but it doesn't have to.
>>
>> If it's the same device driver connection (or in kernel speak underlying
>> struct file) then you can optimize away importing and exporting of buffers
>> for example.
>>
>> Additional to that it makes cgroup accounting much easier because you don't
>> count things twice because they are shared etc...
> One thing to keep in mind is that a generic VFS level comparing function
> will only catch the obvious case where you have dup() equivalency as
> outlined above by Linus. That's what most people are interested in and
> that could easily replace most kcmp() use-cases for comparing fds.
>
> But, of course there's the case where you have two file descriptors
> referring to two different files that reference the same underlying
> object (usually stashed in file->private_data).
>
> For most cases that problem can ofc be solved by comparing the
> underlying inode. But that doesn't work for drivers using the generic
> anonymous inode infrastructure because it uses the same inode for
> everything or for cases where the same underlying object can even be
> represented by different inodes.
>
> So for such cases a driver specific ioctl() to compare two fds will
> be needed in addition to the generic helper.
At least for the DRM we already have some solution for that, see
drmGetPrimaryDeviceNameFromFd() for an example.
Basically the file->private_data is still something different, but we
use this to figure out if we have two file descriptors (with individual
struct files underneath) pointing to the same hw driver.
This is important if you need to know if just importing/exporting of
DMA-buf handles between the two file descriptors is enough or if you
deal with two different hw devices and need to do stuff like format
conversion etc...
Regards,
Christian.
On Mon, 6 May 2024 at 10:46, Stefan Metzmacher <metze(a)samba.org> wrote:
>
> I think it's a very important detail that epoll does not take
> real references. Otherwise an application level 'close()' on a socket
> would not trigger a tcp disconnect, when an fd is still registered with
> epoll.
Yes, exactly.
epoll() ends up actually wanting the lifetime of the ep item be the
lifetime of the file _descriptor_, not the lifetime of the file
itself.
We approximate that - badly - with epoll not taking a reference on the
file pointer, and then at final fput() it tears things down.
But that has two real issues, and one of them is that "oh, now epoll
has file pointers that are actually dead" that caused this thread.
The other issue is that "approximates" thing, where it means that
duplicating the file pointer (dup*() and fork() end unix socket file
sending etc) will not mean that the epoll ref is also out of sync with
the lifetime of the file descriptor.
That's why I suggested that "clean up epoll references at
file_close_fd() time instead:
https://lore.kernel.org/all/CAHk-=wj6XL9MGCd_nUzRj6SaKeN0TsyTTZDFpGdW34R+zM…
because it would actually really *fix* the lifetime issue of ep items.
In the process, it would make it possible to actually take a f_count
reference at EPOLL_CTL_ADD time, since now the lifetime of the EP
wouldn't be tied to the lifetime of the 'struct file *' pointer, it
would be properly tied to the lifetime of the actual file descriptor
that you are adding.
So it would be a huge conceptual cleanup too.
(Of course - at that point EPOLL_CTL_ADD still doesn't actually _need_
a reference to the file, since the file being open in itself is
already that reference - but the point here being that there would
*be* a reference that the epoll code would effectively have, and you'd
never be in the situation we were in where there would be stale "dead"
file pointers that just haven't gone through the cleanup yet).
But I'd rather not touch the epoll code more than I have to.
Which is why I applied the minimal patch for just "refcount over
vfs_poll()", and am just mentioning my suggestion in the hope that
some eager beaver would like to see how painful it would do to make
the bigger surgery...
Linus
From: Christian Brauner
> Sent: 06 May 2024 09:45
>
> > The fact is, it's not dma-buf that is violating any rules. It's epoll.
>
> I agree that epoll() not taking a reference on the file is at least
> unexpected and contradicts the usual code patterns for the sake of
> performance and that it very likely is the case that most callers of
> f_op->poll() don't know this.
>
> Note, I cleary wrote upthread that I'm ok to do it like you suggested
> but raised two concerns a) there's currently only one instance of
> prolonged @file lifetime in f_op->poll() afaict and b) that there's
> possibly going to be some performance impact on epoll().
>
> So it's at least worth discussing what's more important because epoll()
> is very widely used and it's not that we haven't favored performance
> before.
>
> But you've already said that you aren't concerned with performance on
> epoll() upthread. So afaict then there's really not a lot more to
> discuss other than take the patch and see whether we get any complaints.
Surely there isn't a problem with epoll holding a reference to the file
structure - it isn't really any different to a dup().
'All' that needs to happen is that the 'magic' that makes epoll() remove
files on the last fput happen when the close is done.
I'm sure there are horrid locking issues it that code (separate from
it calling ->poll() after ->release()) eg if you call close() concurrently
with EPOLL_CTL_ADD.
I'm not at all sure it would have mattered if epoll kept the file open.
But it can't do that because it is documented not to.
As well as poll/select holding a reference to all their fd for the duration
of the system call, a successful mmap() holds a reference until the pages
are all unmapped - usually by process exit.
We (dayjob) have code that uses epoll() to monitor large numbers of UDP
sockets. I was doing some tests (trying to) receive RTP (audio) data
concurrently on 10000 sockets with typically one packet every 20ms.
There are 10000 associated RCTP sockets that are usually idle.
A more normal limit would be 1000 RTP sockets.
All the data needs to go into a single (multithreaded) process.
Just getting all the packets queued on the sockets was non-trivial.
epoll is about the only way to actually read the data.
(That needed multiple epoll fd so each thread could process all
the events from one epoll fd then look for another unprocessed fd.)
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
On Mon, May 06, 2024 at 02:05:12PM +0200, Maxime Ripard wrote:
> Hi,
>
> On Mon, May 06, 2024 at 01:49:17PM GMT, Hans de Goede wrote:
> > Hi dma-buf maintainers, et.al.,
> >
> > Various people have been working on making complex/MIPI cameras work OOTB
> > with mainline Linux kernels and an opensource userspace stack.
> >
> > The generic solution adds a software ISP (for Debayering and 3A) to
> > libcamera. Libcamera's API guarantees that buffers handed to applications
> > using it are dma-bufs so that these can be passed to e.g. a video encoder.
> >
> > In order to meet this API guarantee the libcamera software ISP allocates
> > dma-bufs from userspace through one of the /dev/dma_heap/* heaps. For
> > the Fedora COPR repo for the PoC of this:
> > https://hansdegoede.dreamwidth.org/28153.html
>
> For the record, we're also considering using them for ARM KMS devices,
> so it would be better if the solution wasn't only considering v4l2
> devices.
>
> > I have added a simple udev rule to give physically present users access
> > to the dma_heap-s:
> >
> > KERNEL=="system", SUBSYSTEM=="dma_heap", TAG+="uaccess"
> >
> > (and on Rasperry Pi devices any users in the video group get access)
> >
> > This was just a quick fix for the PoC. Now that we are ready to move out
> > of the PoC phase and start actually integrating this into distributions
> > the question becomes if this is an acceptable solution; or if we need some
> > other way to deal with this ?
> >
> > Specifically the question is if this will have any negative security
> > implications? I can certainly see this being used to do some sort of
> > denial of service attack on the system (1). This is especially true for
> > the cma heap which generally speaking is a limited resource.
>
> There's plenty of other ways to exhaust CMA, like allocating too much
> KMS or v4l2 buffers. I'm not sure we should consider dma-heaps
> differently than those if it's part of our threat model.
So generally for an arm soc where your display needs cma, your render node
doesn't. And user applications only have access to the later, while only
the compositor gets a kms fd through logind. At least in drm aside from
vc4 there's really no render driver that just gives you access to cma and
allows you to exhaust that, you need to be a compositor with drm master
access to the display.
Which means we're mostly protected against bad applications, and that's
not a threat the "user physically sits in front of the machine accounts
for", and which giving cma access to everyone would open up. And with
flathub/snaps/... this is very much an issue.
So you need more, either:
- cgroups limits on dma-buf and dma-buf heaps. This has been bikeshedded
for years and is just not really moving.
- An allocator service which checks whether you're allowed to allocate
these special buffers. Android does that through binder.
Probably also some way to nuke applications that refuse to release buffers
when they're no longer the right application. cgroups is a lot more
convenient for that.
-Sima
> > But devices tagged for uaccess are only opened up to users who are
> > physcially present behind the machine and those can just hit
> > the powerbutton, so I don't believe that any *on purpose* DOS is part of
> > the thread model.
>
> How would that work for headless devices?
>
> Maxime
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
On Tue, May 07, 2024 at 04:15:05PM +0100, Bryan O'Donoghue wrote:
> On 07/05/2024 16:09, Dmitry Baryshkov wrote:
> > Ah, I see. Then why do you require the DMA-ble buffer at all? If you are
> > providing data to VPU or DRM, then you should be able to get the buffer
> > from the data-consuming device.
>
> Because we don't necessarily know what the consuming device is, if any.
Well ... that's an entirely different issue. And it's unsolved.
Currently the approach is to allocate where the constraints are usually
most severe (like display, if you need that, or the camera module for
input) and then just pray the stack works out without too much copying.
All userspace (whether the generic glue or the userspace driver depends a
bit upon the exact api) does need to have a copy fallback for these
sharing cases, ideally with the copying accelerated by hw.
If you try to solve this by just preemptive allocating everything as cma
buffers, then you'll make the situation substantially worse (because now
you're wasting tons of cma memory where you might not even need it).
And without really solving the problem, since for some gpus that memory
might be unusable (because you cannot scan that out on any discrete gpu,
and sometimes not even on an integrated one).
-Sima
> Could be VPU, could be Zoom/Hangouts via pipewire, could for argument sake
> be GPU or DSP.
>
> Also if we introduce a dependency on another device to allocate the output
> buffers - say always taking the output buffer from the GPU, then we've added
> another dependency which is more difficult to guarantee across different
> arches.
>
> ---
> bod
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
On Tue, May 07, 2024 at 04:34:24PM +0200, Hans de Goede wrote:
> Hi Dmitry,
>
> On 5/7/24 3:32 PM, Dmitry Baryshkov wrote:
> > On Mon, May 06, 2024 at 01:49:17PM +0200, Hans de Goede wrote:
> >> Hi dma-buf maintainers, et.al.,
> >>
> >> Various people have been working on making complex/MIPI cameras work OOTB
> >> with mainline Linux kernels and an opensource userspace stack.
> >>
> >> The generic solution adds a software ISP (for Debayering and 3A) to
> >> libcamera. Libcamera's API guarantees that buffers handed to applications
> >> using it are dma-bufs so that these can be passed to e.g. a video encoder.
> >>
> >> In order to meet this API guarantee the libcamera software ISP allocates
> >> dma-bufs from userspace through one of the /dev/dma_heap/* heaps. For
> >> the Fedora COPR repo for the PoC of this:
> >> https://hansdegoede.dreamwidth.org/28153.html
> >
> > Is there any reason for allocating DMA buffers for libcamera through
> > /dev/dma_heap/ rather than allocating them via corresponding media
> > device and then giving them away to DRM / VPU / etc?
> >
> > At least this should solve the permission usecase: if the app can open
> > camera device, it can allocate a buffer.
>
> This is with a software ISP, the input buffers with raw bayer data
> come from a v4l2 device, but the output buffers with the processed
> data are purely userspace managed in this case.
Ah, I see. Then why do you require the DMA-ble buffer at all? If you are
providing data to VPU or DRM, then you should be able to get the buffer
from the data-consuming device. If the data is further processed by
a userspace app, then it should not require DMA memory at all.
My main concern is that dma-heaps is both too generic and too limiting
for the generic library implementation.
--
With best wishes
Dmitry
On Mon, May 06, 2024 at 04:01:42PM +0200, Hans de Goede wrote:
> Hi Sima,
>
> On 5/6/24 3:38 PM, Daniel Vetter wrote:
> > On Mon, May 06, 2024 at 02:05:12PM +0200, Maxime Ripard wrote:
> >> Hi,
> >>
> >> On Mon, May 06, 2024 at 01:49:17PM GMT, Hans de Goede wrote:
> >>> Hi dma-buf maintainers, et.al.,
> >>>
> >>> Various people have been working on making complex/MIPI cameras work OOTB
> >>> with mainline Linux kernels and an opensource userspace stack.
> >>>
> >>> The generic solution adds a software ISP (for Debayering and 3A) to
> >>> libcamera. Libcamera's API guarantees that buffers handed to applications
> >>> using it are dma-bufs so that these can be passed to e.g. a video encoder.
> >>>
> >>> In order to meet this API guarantee the libcamera software ISP allocates
> >>> dma-bufs from userspace through one of the /dev/dma_heap/* heaps. For
> >>> the Fedora COPR repo for the PoC of this:
> >>> https://hansdegoede.dreamwidth.org/28153.html
> >>
> >> For the record, we're also considering using them for ARM KMS devices,
> >> so it would be better if the solution wasn't only considering v4l2
> >> devices.
> >>
> >>> I have added a simple udev rule to give physically present users access
> >>> to the dma_heap-s:
> >>>
> >>> KERNEL=="system", SUBSYSTEM=="dma_heap", TAG+="uaccess"
> >>>
> >>> (and on Rasperry Pi devices any users in the video group get access)
> >>>
> >>> This was just a quick fix for the PoC. Now that we are ready to move out
> >>> of the PoC phase and start actually integrating this into distributions
> >>> the question becomes if this is an acceptable solution; or if we need some
> >>> other way to deal with this ?
> >>>
> >>> Specifically the question is if this will have any negative security
> >>> implications? I can certainly see this being used to do some sort of
> >>> denial of service attack on the system (1). This is especially true for
> >>> the cma heap which generally speaking is a limited resource.
> >>
> >> There's plenty of other ways to exhaust CMA, like allocating too much
> >> KMS or v4l2 buffers. I'm not sure we should consider dma-heaps
> >> differently than those if it's part of our threat model.
> >
> > So generally for an arm soc where your display needs cma, your render node
> > doesn't. And user applications only have access to the later, while only
> > the compositor gets a kms fd through logind. At least in drm aside from
> > vc4 there's really no render driver that just gives you access to cma and
> > allows you to exhaust that, you need to be a compositor with drm master
> > access to the display.
> >
> > Which means we're mostly protected against bad applications, and that's
> > not a threat the "user physically sits in front of the machine accounts
> > for", and which giving cma access to everyone would open up. And with
> > flathub/snaps/... this is very much an issue.
>
> I agree that bad applications are an issue, but not for the flathub / snaps
> case. Flatpacks / snaps run sandboxed and don't have access to a full /dev
> so those should not be able to open /dev/dma_heap/* independent of
> the ACLs on /dev/dma_heap/*. The plan is for cameras using the
> libcamera software ISP to always be accessed through pipewire and
> the camera portal, so in this case pipewere is taking the place of
> the compositor in your kms vs render node example.
>
> So this reduces the problem to bad apps packaged by regular distributions
> and if any of those misbehave the distros should fix that.
>
> So I think that for the denial of service side allowing physical
> present users (but not sandboxed apps running as those users) to
> access /dev/dma_heap/* should be ok.
What about an app built by the user? The machines can still be
multi-seat or have multiple users.
> My bigger worry is if dma_heap (u)dma-bufs can be abused in other
> ways then causing a denial of service.
>
> I guess that the answer there is that causing other security issues
> should not be possible ?
>
> Regards,
>
> Hans
>
--
With best wishes
Dmitry
On Mon, May 06, 2024 at 01:49:17PM +0200, Hans de Goede wrote:
> Hi dma-buf maintainers, et.al.,
>
> Various people have been working on making complex/MIPI cameras work OOTB
> with mainline Linux kernels and an opensource userspace stack.
>
> The generic solution adds a software ISP (for Debayering and 3A) to
> libcamera. Libcamera's API guarantees that buffers handed to applications
> using it are dma-bufs so that these can be passed to e.g. a video encoder.
>
> In order to meet this API guarantee the libcamera software ISP allocates
> dma-bufs from userspace through one of the /dev/dma_heap/* heaps. For
> the Fedora COPR repo for the PoC of this:
> https://hansdegoede.dreamwidth.org/28153.html
Is there any reason for allocating DMA buffers for libcamera through
/dev/dma_heap/ rather than allocating them via corresponding media
device and then giving them away to DRM / VPU / etc?
At least this should solve the permission usecase: if the app can open
camera device, it can allocate a buffer.
> I have added a simple udev rule to give physically present users access
> to the dma_heap-s:
>
> KERNEL=="system", SUBSYSTEM=="dma_heap", TAG+="uaccess"
>
> (and on Rasperry Pi devices any users in the video group get access)
>
> This was just a quick fix for the PoC. Now that we are ready to move out
> of the PoC phase and start actually integrating this into distributions
> the question becomes if this is an acceptable solution; or if we need some
> other way to deal with this ?
>
> Specifically the question is if this will have any negative security
> implications? I can certainly see this being used to do some sort of
> denial of service attack on the system (1). This is especially true for
> the cma heap which generally speaking is a limited resource.
>
> But devices tagged for uaccess are only opened up to users who are
> physcially present behind the machine and those can just hit
> the powerbutton, so I don't believe that any *on purpose* DOS is part of
> the thread model. Any accidental DOS would be a userspace stack bug.
>
> Do you foresee any other negative security implications from allowing
> physically present non root users to create (u)dma-bufs ?
>
> Regards,
>
> Hans
>
>
> 1) There are some limits in drivers/dma-buf/udmabuf.c and distributions
> could narrow these.
>
>
--
With best wishes
Dmitry
On Mon, May 06, 2024 at 04:46:54PM +0200, Christian Brauner wrote:
> On Mon, May 06, 2024 at 02:47:23PM +0200, Daniel Vetter wrote:
> > On Sun, May 05, 2024 at 01:53:48PM -0700, Linus Torvalds wrote:
> > > On Sun, 5 May 2024 at 13:30, Al Viro <viro(a)zeniv.linux.org.uk> wrote:
> > > >
> > > > 0. special-cased ->f_count rule for ->poll() is a wart and it's
> > > > better to get rid of it.
> > > >
> > > > 1. fs/eventpoll.c is a steaming pile of shit and I'd be glad to see
> > > > git rm taken to it. Short of that, by all means, let's grab reference
> > > > in there around the call of vfs_poll() (see (0)).
> > >
> > > Agreed on 0/1.
> > >
> > > > 2. having ->poll() instances grab extra references to file passed
> > > > to them is not something that should be encouraged; there's a plenty
> > > > of potential problems, and "caller has it pinned, so we are fine with
> > > > grabbing extra refs" is nowhere near enough to eliminate those.
> > >
> > > So it's not clear why you hate it so much, since those extra
> > > references are totally normal in all the other VFS paths.
> > >
> > > I mean, they are perhaps not the *common* case, but we have a lot of
> > > random get_file() calls sprinkled around in various places when you
> > > end up passing a file descriptor off to some asynchronous operation
> > > thing.
> > >
> > > Yeah, I think most of them tend to be special operations (eg the tty
> > > TIOCCONS ioctl to redirect the console), but it's not like vfs_ioctl()
> > > is *that* different from vfs_poll. Different operation, not somehow
> > > "one is more special than the other".
> > >
> > > cachefiles and backing-file does it for regular IO, and drop it at IO
> > > completion - not that different from what dma-buf does. It's in
> > > ->read_iter() rather than ->poll(), but again: different operations,
> > > but not "one of them is somehow fundamentally different".
> > >
> > > > 3. dma-buf uses of get_file() are probably safe (epoll shite aside),
> > > > but they do look fishy. That has nothing to do with epoll.
> > >
> > > Now, what dma-buf basically seems to do is to avoid ref-counting its
> > > own fundamental data structure, and replaces that by refcounting the
> > > 'struct file' that *points* to it instead.
> > >
> > > And it is a bit odd, but it actually makes some amount of sense,
> > > because then what it passes around is that file pointer (and it allows
> > > passing it around from user space *as* that file).
> > >
> > > And honestly, if you look at why it then needs to add its refcount to
> > > it all, it actually makes sense. dma-bufs have this notion of
> > > "fences" that are basically completion points for the asynchronous
> > > DMA. Doing a "poll()" operation will add a note to the fence to get
> > > that wakeup when it's done.
> > >
> > > And yes, logically it takes a ref to the "struct dma_buf", but because
> > > of how the lifetime of the dma_buf is associated with the lifetime of
> > > the 'struct file', that then turns into taking a ref on the file.
> > >
> > > Unusual? Yes. But not illogical. Not obviously broken. Tying the
> > > lifetime of the dma_buf to the lifetime of a file that is passed along
> > > makes _sense_ for that use.
> > >
> > > I'm sure dma-bufs could add another level of refcounting on the
> > > 'struct dma_buf' itself, and not make it be 1:1 with the file, but
> > > it's not clear to me what the advantage would really be, or why it
> > > would be wrong to re-use a refcount that is already there.
> >
> > So there is generally another refcount, because dma_buf is just the
> > cross-driver interface to some kind of real underlying buffer object from
> > the various graphics related subsystems we have.
> >
> > And since it's a pure file based api thing that ceases to serve any
> > function once the fd/file is gone we tied all the dma_buf refcounting to
> > the refcount struct file already maintains. But the underlying buffer
> > object can easily outlive the dma_buf, and over the lifetime of an
> > underlying buffer object you might actually end up creating different
> > dma_buf api wrappers for it (but at least in drm we guarantee there's at
> > most one, hence why vmwgfx does the atomic_inc_unless_zero trick, which I
> > don't particularly like and isn't really needed).
> >
> > But we could add another refcount, it just means we have 3 of those then
> > when only really 2 are needed.
>
> Fwiw, the TTM thing described upthread and in the other thread really
> tries hard to work around the dma_buf == file lifetime choice by hooking
> into the dma-buf specific release function so it can access the dmabuf
> and then the file. All that seems like a pretty error prone thing to me.
> So a separate refcount for dma_buf wouldn't be the worst as that would
> allow that TTM thing to benefit and remove that nasty hacking into your
> generic dma_buf ops. But maybe I'm the only one who sees it that way and
> I'm certainly not familiar enough with dma-buf.
So the tricky part is the uniqueness requirement drm has for buffer
objects (and hence dma_buf wrappers), which together with the refcounting
makes dma_buf quite tricky:
- dma_buf needs to hold some reference onto the underlying object, or it
wont work
- but you're not allowed to just create a new dma_buf every time someone
exports an underlying object to a dma_buf, because that would break the
uniqueness requirement. Which means the underlying object must also hold
some kind of reference to its dma_buf, if it exists. So that on buffer
export it can just increment the refcount for that and return it,
instead of creating a new one.
Which would be a reference loop that never gets freed, so you need one of
two tricks:
- Either a weak reference, i.e. just a pointer plus
atomic_inc_unless_zero trickery like ttm does. Splitting that refcount
into more refcounts doesn't fundamentally solve the problem, it just
adds even more refcounts.
- Or you do what all other drm drivers do in drm_prime.c do and careful
clean up the dma_buf re-export cache when the userspace references (but
not all kernel internal ones) disappear, to unbreak that reference loop.
This needs to be done with extreme care and took a lot of screaming to
get right, because if you have a race you might end up breaking the
uniqueness requirement and have two dma_buf floating around.
So neither of these solutions really are simple, but I agree with you that
the atomic_inc_unless_zero trickery is less simple. It's definitely not
cool that it's done by digging around in struct file internals.
-Sima
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
On Fri, May 03, 2024 at 06:54:22PM +0700, Bui Quang Minh wrote:
> [...]
> Root cause:
> AFAIK, eventpoll (epoll) does not increase the registered file's reference.
> To ensure the safety, when the registered file is deallocated in __fput,
> eventpoll_release is called to unregister the file from epoll. When calling
> poll on epoll, epoll will loop through registered files and call vfs_poll on
> these files. In the file's poll, file is guaranteed to be alive, however, as
> epoll does not increase the registered file's reference, the file may be
> dying
> and it's not safe the get the file for later use. And dma_buf_poll violates
> this. In the dma_buf_poll, it tries to get_file to use in the callback. This
> leads to a race where the dmabuf->file can be fput twice.
>
> Here is the race occurs in the above proof-of-concept
>
> close(dmabuf->file)
> __fput_sync (f_count == 1, last ref)
> f_count-- (f_count == 0 now)
> __fput
> epoll_wait
> vfs_poll(dmabuf->file)
> get_file(dmabuf->file)(f_count == 1)
> eventpoll_release
> dmabuf->file deallocation
> fput(dmabuf->file) (f_count == 1)
> f_count--
> dmabuf->file deallocation
>
> I am not familiar with the dma_buf so I don't know the proper fix for the
> issue. About the rule that don't get the file for later use in poll callback
> of
> file, I wonder if it is there when only select/poll exist or just after
> epoll
> appears.
>
> I hope the analysis helps us to fix the issue.
Thanks for doing this analysis! I suspect at least a start of a fix
would be this:
diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index 8fe5aa67b167..15e8f74ee0f2 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -267,9 +267,8 @@ static __poll_t dma_buf_poll(struct file *file, poll_table *poll)
if (events & EPOLLOUT) {
/* Paired with fput in dma_buf_poll_cb */
- get_file(dmabuf->file);
-
- if (!dma_buf_poll_add_cb(resv, true, dcb))
+ if (!atomic_long_inc_not_zero(&dmabuf->file) &&
+ !dma_buf_poll_add_cb(resv, true, dcb))
/* No callback queued, wake up any other waiters */
dma_buf_poll_cb(NULL, &dcb->cb);
else
@@ -290,9 +289,8 @@ static __poll_t dma_buf_poll(struct file *file, poll_table *poll)
if (events & EPOLLIN) {
/* Paired with fput in dma_buf_poll_cb */
- get_file(dmabuf->file);
-
- if (!dma_buf_poll_add_cb(resv, false, dcb))
+ if (!atomic_long_inc_not_zero(&dmabuf->file) &&
+ !dma_buf_poll_add_cb(resv, false, dcb))
/* No callback queued, wake up any other waiters */
dma_buf_poll_cb(NULL, &dcb->cb);
else
But this ends up leaving "active" non-zero, and at close time it runs
into:
BUG_ON(dmabuf->cb_in.active || dmabuf->cb_out.active);
But the bottom line is that get_file() is unsafe to use in some places,
one of which appears to be in the poll handler. There are maybe some
other fragile places too, like in drivers/gpu/drm/vmwgfx/ttm_object.c:
static bool __must_check get_dma_buf_unless_doomed(struct dma_buf *dmabuf)
{
return atomic_long_inc_not_zero(&dmabuf->file->f_count) != 0L;
}
Which I also note involves a dmabuf...
Due to this issue I've proposed fixing get_file() to detect pathological states:
https://lore.kernel.org/lkml/20240502222252.work.690-kees@kernel.org/
But that has run into some push-back. I'm hoping that seeing this epoll
example will help illustrate what needs fixing a little better.
I think the best current proposal is to just WARN sooner instead of a
full refcount_t implementation:
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 8dfd53b52744..e09107d0a3d6 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1040,7 +1040,8 @@ struct file_handle {
static inline struct file *get_file(struct file *f)
{
- atomic_long_inc(&f->f_count);
+ long prior = atomic_long_fetch_inc_relaxed(&f->f_count);
+ WARN_ONCE(!prior, "struct file::f_count incremented from zero; use-after-free condition present!\n");
return f;
}
What's the right way to deal with the dmabuf situation? (And I suspect
it applies to get_dma_buf_unless_doomed() as well...)
-Kees
--
Kees Cook