Linaro-mm-sig January 2025

linaro-mm-sig@lists.linaro.org

13 participants
41 discussions

Re: [RFC PATCH 08/12] vfio/pci: Create host unaccessible dma-buf for private device

by Jason Gunthorpe

On Mon, Jun 24, 2024 at 03:59:53AM +0800, Xu Yilun wrote: > > But it also seems to me that VFIO should be able to support putting > > the device into the RUN state > > Firstly I think VFIO should support putting device into *LOCKED* state. > From LOCKED to RUN, there are many evidence fetching and attestation > things that only guest cares. I don't think VFIO needs to opt-in. VFIO is not just about running VMs. If someone wants to run DPDK on VFIO they should be able to get the device into a RUN state and work with secure memory without requiring a KVM. Yes there are many steps to this, but we should imagine how it can work. > > without involving KVM or cVMs. > > It may not be feasible for all vendors. It must be. A CC guest with an in kernel driver can definately get the PCI device into RUN, so VFIO running in the guest should be able as well. > I believe AMD would have one firmware call that requires cVM handle > *AND* move device into LOCKED state. It really depends on firmware > implementation. IMHO, you would not use the secure firmware if you are not using VMs. > Yes, the secure EPT is in the secure world and managed by TDX firmware. > Now a SW Mirror Secure EPT is introduced in KVM and managed by KVM > directly, and KVM will finally use firmware calls to propagate Mirror > Secure EPT changes to secure EPT. If the secure world managed it then the secure world can have rules that work with the IOMMU as well.. Jason

1 year, 5 months

[PATCH v2] Documentation: dma-buf: heaps: Add heap name definitions

by Maxime Ripard

Following a recent discussion at last Plumbers, John Stultz, Sumit Sewal, TJ Mercier and I came to an agreement that we should document what the dma-buf heaps names are expected to be, and what the buffers attributes you'll get should be documented. Let's create that doc to make sure those attributes and names are guaranteed going forward. Signed-off-by: Maxime Ripard <mripard(a)kernel.org> --- Changes from v1: - Add the mention that the cma / reserved heap is optional. To: Jonathan Corbet <corbet(a)lwn.net> To: Sumit Semwal <sumit.semwal(a)linaro.org> Cc: Benjamin Gaignard <benjamin.gaignard(a)collabora.com> Cc: Brian Starkey <Brian.Starkey(a)arm.com> Cc: John Stultz <jstultz(a)google.com> Cc: "T.J. Mercier" <tjmercier(a)google.com> Cc: "Christian König" <christian.koenig(a)amd.com> Cc: dri-devel(a)lists.freedesktop.org Cc: linaro-mm-sig(a)lists.linaro.org Cc: linux-media(a)vger.kernel.org Cc: linux-doc(a)vger.kernel.org --- Documentation/userspace-api/dma-buf-heaps.rst | 76 +++++++++++++++++++ Documentation/userspace-api/index.rst | 1 + 2 files changed, 77 insertions(+) create mode 100644 Documentation/userspace-api/dma-buf-heaps.rst diff --git a/Documentation/userspace-api/dma-buf-heaps.rst b/Documentation/userspace-api/dma-buf-heaps.rst new file mode 100644 index 000000000000..68be7ddea150 --- /dev/null +++ b/Documentation/userspace-api/dma-buf-heaps.rst @@ -0,0 +1,76 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============================== +Allocating dma-buf using heaps +============================== + +Dma-buf Heaps are a way for userspace to allocate dma-buf objects. They are +typically used to allocate buffers from a specific allocation pool, or to share +buffers across frameworks. + +Heaps +===== + +A heap represent a specific allocator. The Linux kernel currently supports the +following heaps: + + - The ``system`` heap allocates virtually contiguous, cacheable, buffers + + - The ``reserved`` heap allocates physically contiguous, cacheable, + buffers. Only present if a CMA region is present. Such a region is + usually created either through the kernel commandline through the + `cma` parameter, a memory region Device-Tree node with the + `linux,cma-default` property set, or through the `CMA_SIZE_MBYTES` or + `CMA_SIZE_PERCENTAGE` Kconfig options. Depending on the platform, it + might be called differently: + + - Acer Iconia Tab A500: ``linux,cma`` + - Allwinner sun4i, sun5i and sun7i families: ``default-pool`` + - Amlogic A1: ``linux,cma`` + - Amlogic G12A/G12B/SM1: ``linux,cma`` + - Amlogic GXBB/GXL: ``linux,cma`` + - ASUS EeePad Transformer TF101: ``linux,cma`` + - ASUS Google Nexus 7 (Project Bach / ME370TG) E1565: ``linux,cma`` + - ASUS Google Nexus 7 (Project Nakasi / ME370T) E1565: ``linux,cma`` + - ASUS Google Nexus 7 (Project Nakasi / ME370T) PM269: ``linux,cma`` + - Asus Transformer Infinity TF700T: ``linux,cma`` + - Asus Transformer Pad 3G TF300TG: ``linux,cma`` + - Asus Transformer Pad TF300T: ``linux,cma`` + - Asus Transformer Pad TF701T: ``linux,cma`` + - Asus Transformer Prime TF201: ``linux,cma`` + - ASUS Vivobook S 15: ``linux,cma`` + - Cadence KC705: ``linux,cma`` + - Digi International ConnectCore 6UL: ``linux,cma`` + - Freescale i.MX8DXL EVK: ``linux,cma`` + - Freescale TQMa8Xx: ``linux,cma`` + - Hisilicon Hikey: ``linux,cma`` + - Lenovo ThinkPad T14s Gen 6: ``linux,cma`` + - Lenovo ThinkPad X13s: ``linux,cma`` + - Lenovo Yoga Slim 7x: ``linux,cma`` + - LG Optimus 4X HD P880: ``linux,cma`` + - LG Optimus Vu P895: ``linux,cma`` + - Loongson 2k0500, 2k1000 and 2k2000: ``linux,cma`` + - Microsoft Romulus: ``linux,cma`` + - NXP i.MX8ULP EVK: ``linux,cma`` + - NXP i.MX93 9x9 QSB: ``linux,cma`` + - NXP i.MX93 11X11 EVK: ``linux,cma`` + - NXP i.MX93 14X14 EVK: ``linux,cma`` + - NXP i.MX95 19X19 EVK: ``linux,cma`` + - Ouya Game Console: ``linux,cma`` + - Pegatron Chagall: ``linux,cma`` + - PHYTEC phyCORE-AM62A SOM: ``linux,cma`` + - PHYTEC phyCORE-i.MX93 SOM: ``linux,cma`` + - Qualcomm SC8280XP CRD: ``linux,cma`` + - Qualcomm X1E80100 CRD: ``linux,cma`` + - Qualcomm X1E80100 QCP: ``linux,cma`` + - RaspberryPi: ``linux,cma`` + - Texas Instruments AM62x SK board family: ``linux,cma`` + - Texas Instruments AM62A7 SK: ``linux,cma`` + - Toradex Apalis iMX8: ``linux,cma`` + - TQ-Systems i.MX8MM TQMa8MxML: ``linux,cma`` + - TQ-Systems i.MX8MN TQMa8MxNL: ``linux,cma`` + - TQ-Systems i.MX8MPlus TQMa8MPxL: ``linux,cma`` + - TQ-Systems i.MX8MQ TQMa8MQ: ``linux,cma`` + - TQ-Systems i.MX93 TQMa93xxLA/TQMa93xxCA SOM: ``linux,cma`` + - TQ-Systems MBA6ULx Baseboard: ``linux,cma`` + diff --git a/Documentation/userspace-api/index.rst b/Documentation/userspace-api/index.rst index 274cc7546efc..4901ce7c6cb7 100644 --- a/Documentation/userspace-api/index.rst +++ b/Documentation/userspace-api/index.rst @@ -41,10 +41,11 @@ Devices and I/O .. toctree:: :maxdepth: 1 accelerators/ocxl + dma-buf-heaps dma-buf-alloc-exchange gpio/index iommufd media/index dcdbas -- 2.47.1

1 year, 5 months

Re: [RFC PATCH 08/12] vfio/pci: Create host unaccessible dma-buf for private device

by Jason Gunthorpe

On Fri, Jan 17, 2025 at 09:57:40AM +0800, Baolu Lu wrote: > On 1/15/25 21:01, Jason Gunthorpe wrote: > > On Wed, Jan 15, 2025 at 11:57:05PM +1100, Alexey Kardashevskiy wrote: > > > On 15/1/25 00:35, Jason Gunthorpe wrote: > > > > On Tue, Jun 18, 2024 at 07:28:43AM +0800, Xu Yilun wrote: > > > > > > > > > > is needed so the secure world can prepare anything it needs prior to > > > > > > starting the VM. > > > > > OK. From Dan's patchset there are some touch point for vendor tsm > > > > > drivers to do secure world preparation. e.g. pci_tsm_ops::probe(). > > > > > > > > > > Maybe we could move to Dan's thread for discussion. > > > > > > > > > > https://lore.kernel.org/linux- > > > > > coco/173343739517.1074769.13134786548545925484.stgit@dwillia2- > > > > > xfh.jf.intel.com/ > > > > I think Dan's series is different, any uapi from that series should > > > > not be used in the VMM case. We need proper vfio APIs for the VMM to > > > > use. I would expect VFIO to be calling some of that infrastructure. > > > Something like this experiment? > > > > > > https://github.com/aik/linux/commit/ > > > ce052512fb8784e19745d4cb222e23cabc57792e > > Yeah, maybe, though I don't know which of vfio/iommufd/kvm should be > > hosting those APIs, the above does seem to be a reasonable direction. > > > > When the various fds are closed I would expect the kernel to unbind > > and restore the device back. > > I am curious about the value of tsm binding against an iomnufd_vdevice > instead of the physical iommufd_device. Interesting question > It is likely that the kvm pointer should be passed to iommufd during the > creation of a viommu object. Yes, I fully expect this > If my recollection is correct, the arm > smmu-v3 needs it to obtain the vmid to setup the userspace event queue: Right now it will use a VMID unrelated to KVM. BTM support on ARM will require syncing the VMID with KVM. AMD and Intel may require the KVM for some reason as well. For CC I'm expecting the KVM fd to be the handle for the cVM, so any RPCs that want to call into the secure world need the KVM FD to get the cVM's identifier. Ie a "bind to cVM" RPC will need the PCI information and the cVM's handle. From that perspective it does make sense that any cVM related APIs, like "bind to cVM" would be against the VDEVICE where we have a link to the VIOMMU which has the KVM. On the iommufd side the VIOMMU is part of the object hierarchy, but does not necessarily have to force a vIOMMU to appear in the cVM. But it also seems to me that VFIO should be able to support putting the device into the RUN state without involving KVM or cVMs. > Intel TDX connect implementation also needs a reference to the kvm > pointer to obtain the secure EPT information. This is crucial because > the CPU's page table must be shared with the iommu. I thought kvm folks were NAKing this sharing entirely? Or is the secure EPT in the secure world and not directly managed by Linux? AFAIK AMD is going to mirror the iommu page table like today. ARM, I suspect, will not have an "EPT" under Linux control, so whatever happens will be hidden in their secure world. Jason

1 year, 5 months

Re: [RFC PATCH 01/12] dma-buf: Introduce dma_buf_get_pfn_unlocked() kAPI

by Christian König

Am 08.01.25 um 20:22 schrieb Xu Yilun: > On Wed, Jan 08, 2025 at 07:44:54PM +0100, Simona Vetter wrote: >> On Wed, Jan 08, 2025 at 12:22:27PM -0400, Jason Gunthorpe wrote: >>> On Wed, Jan 08, 2025 at 04:25:54PM +0100, Christian König wrote: >>>> Am 08.01.25 um 15:58 schrieb Jason Gunthorpe: >>>>> I have imagined a staged approach were DMABUF gets a new API that >>>>> works with the new DMA API to do importer mapping with "P2P source >>>>> information" and a gradual conversion. >>>> To make it clear as maintainer of that subsystem I would reject such a step >>>> with all I have. >>> This is unexpected, so you want to just leave dmabuf broken? Do you >>> have any plan to fix it, to fix the misuse of the DMA API, and all >>> the problems I listed below? This is a big deal, it is causing real >>> problems today. >>> >>> If it going to be like this I think we will stop trying to use dmabuf >>> and do something simpler for vfio/kvm/iommufd :( >> As the gal who help edit the og dma-buf spec 13 years ago, I think adding >> pfn isn't a terrible idea. By design, dma-buf is the "everything is >> optional" interface. And in the beginning, even consistent locking was >> optional, but we've managed to fix that by now :-/ Well you were also the person who mangled the struct page pointers in the scatterlist because people were abusing this and getting a bloody nose :) >> Where I do agree with Christian is that stuffing pfn support into the >> dma_buf_attachment interfaces feels a bit much wrong. > So it could a dmabuf interface like mmap/vmap()? I was also wondering > about that. But finally I start to use dma_buf_attachment interface > because of leveraging existing buffer pin and move_notify. Exactly that's the point, sharing pfn doesn't work with the pin and move_notify interfaces because of the MMU notifier approach Sima mentioned. >>>> We have already gone down that road and it didn't worked at all and >>>> was a really big pain to pull people back from it. >>> Nobody has really seriously tried to improve the DMA API before, so I >>> don't think this is true at all. >> Aside, I really hope this finally happens! Sorry my fault. I was not talking about the DMA API, but rather that people tried to look behind the curtain of DMA-buf backing stores. In other words all the fun we had with scatterlists and that people try to modify the struct pages inside of them. Improving the DMA API is something I really really hope for as well. >>>>> 3) Importing devices need to know if they are working with PCI P2P >>>>> addresses during mapping because they need to do things like turn on >>>>> ATS on their DMA. As for multi-path we have the same hacks inside mlx5 >>>>> today that assume DMABUFs are always P2P because we cannot determine >>>>> if things are P2P or not after being DMA mapped. >>>> Why would you need ATS on PCI P2P and not for system memory accesses? >>> ATS has a significant performance cost. It is mandatory for PCI P2P, >>> but ideally should be avoided for CPU memory. >> Huh, I didn't know that. And yeah kinda means we've butchered the pci p2p >> stuff a bit I guess ... Hui? Why should ATS be mandatory for PCI P2P? We have tons of production systems using PCI P2P without ATS. And it's the first time I hear that. >>>>> 5) iommufd and kvm are both using CPU addresses without DMA. No >>>>> exporter mapping is possible >>>> We have customers using both KVM and XEN with DMA-buf, so I can clearly >>>> confirm that this isn't true. >>> Today they are mmaping the dma-buf into a VMA and then using KVM's >>> follow_pfn() flow to extract the CPU pfn from the PTE. Any mmapable >>> dma-buf must have a CPU PFN. >>> >>> Here Xu implements basically the same path, except without the VMA >>> indirection, and it suddenly not OK? Illogical. >> So the big difference is that for follow_pfn() you need mmu_notifier since >> the mmap might move around, whereas with pfn smashed into >> dma_buf_attachment you need dma_resv_lock rules, and the move_notify >> callback if you go dynamic. >> >> So I guess my first question is, which locking rules do you want here for >> pfn importers? > follow_pfn() is unwanted for private MMIO, so dma_resv_lock. As Sima explained you either have follow_pfn() and mmu_notifier or you have DMA addresses and dma_resv lock / dma_fence. Just giving out PFNs without some lifetime associated with them is one of the major problems we faced before and really not something you can do. >> If mmu notifiers is fine, then I think the current approach of follow_pfn >> should be ok. But if you instead dma_resv_lock rules (or the cpu mmap >> somehow is an issue itself), then I think the clean design is create a new > cpu mmap() is an issue, this series is aimed to eliminate userspace > mapping for private MMIO resources. Why? >> separate access mechanism just for that. It would be the 5th or so (kernel >> vmap, userspace mmap, dma_buf_attach and driver private stuff like >> virtio_dma_buf.c where you access your buffer with a uuid), so really not >> a big deal. > OK, will think more about that. Please note that we have follow_pfn() + mmu_notifier working for KVM/XEN with MMIO mappings and P2P. And that required exactly zero DMA-buf changes :) I don't fully understand your use case, but I think it's quite likely that we already have that working. Regards, Christian. > > Thanks, > Yilun > >> And for non-contrived exporters we might be able to implement the other >> access methods in terms of the pfn method generically, so this wouldn't >> even be a terrible maintenance burden going forward. And meanwhile all the >> contrived exporters just keep working as-is. >> >> The other part is that cpu mmap is optional, and there's plenty of strange >> exporters who don't implement. But you can dma map the attachment into >> plenty devices. This tends to mostly be a thing on SoC devices with some >> very funky memory. But I guess you don't care about these use-case, so >> should be ok. >> >> I couldn't come up with a good name for these pfn users, maybe >> dma_buf_pfn_attachment? This does _not_ have a struct device, but maybe >> some of these new p2p source specifiers (or a list of those which are >> allowed, no idea how this would need to fit into the new dma api). >> >> Cheers, Sima >> -- >> Simona Vetter >> Software Engineer, Intel Corporation >> http://blog.ffwll.ch

1 year, 5 months

Re: [RFC PATCH] driver: dma-buf: use vmf_insert_page for cma_heap_vm_fault

by Christian König

Am 16.01.25 um 02:46 schrieb Zhaoyang Huang: > On Wed, Jan 15, 2025 at 7:49 PM Christian König > <christian.koenig(a)amd.com> wrote: >> Am 15.01.25 um 07:18 schrieb zhaoyang.huang: >>> From: Zhaoyang Huang<zhaoyang.huang(a)unisoc.com> >>> >>> When using dma-buf as memory pool for VMM. The vmf_insert_pfn will >>> apply PTE_SPECIAL on pte which have vm_normal_page report bad_pte and >>> return NULL. This commit would like to suggest to replace >>> vmf_insert_pfn by vmf_insert_page. >> Setting PTE_SPECIAL is completely intentional here to prevent >> get_user_pages() from working on DMA-buf mappings. > ok. May I ask the reason? Drivers using this interface own the backing store for their specific use cases. There are a couple of things get_user_pages(), pin_user_pages(), direct I/O etc.. do which usually clash with those use cases. So that is intentionally completely disabled. We have the possibility to create a DMA-buf from memfd object and you can then do direct I/O to the memfd and still use the DMA-buf with GPUs or V4L for example. >> So absolutely clear NAK to this patch here. >> >> What exactly are you trying to do? > I would like to have pkvm have guest kernel be faulted of its second > stage page fault(ARM64's memory virtualization method) on dma-buf > which use pin_user_pages. Yeah, exactly that's one of the use case which we intentionally prevent here. The backing store drivers use don't care about the pin count of the memory and happily give it back to memory pools and/or swap it with device local memory if necessary. When this happens the ARM VM wouldn't be informed of the change and potentially accesses the wrong address. So sorry, but this approach won't work. You could try with the memfd+DMA-buf approach I mentioned earlier, but that won't give you all functionality on all DMA-buf supporting devices. For example GPUs usually can't scan out to a monitor from such buffers because of hardware limitations. Regards, Christian. >> Regards, >> Christian. >> >>> [ 103.402787] kvm [5276]: gfn(ipa)=0x80000 hva=0x7d4a400000 write_fault=0 >>> [ 103.403822] BUG: Bad page map in process crosvm_vcpu0 pte:168000140000f43 pmd:8000000c1ca0003 >>> [ 103.405144] addr:0000007d4a400000 vm_flags:040400fb anon_vma:0000000000000000 mapping:ffffff8085163df0 index:0 >>> [ 103.406536]file:dmabuf fault:cma_heap_vm_fault [cma_heap] mmap:dma_buf_mmap_internal read_folio:0x0 >>> [ 103.407877] CPU: 3 PID: 5276 Comm: crosvm_vcpu0 Tainted: G W OE 6.6.46-android15-8-g8bab72b63c20-dirty-4k #1 1e474a12dac4553a3ebba3a911f3b744176a5d2d >>> [ 103.409818] Hardware name: Unisoc UMS9632-base Board (DT) >>> [ 103.410613] Call trace: >>> [ 103.411038] dump_backtrace+0xf4/0x140 >>> [ 103.411641] show_stack+0x20/0x30 >>> [ 103.412184] dump_stack_lvl+0x60/0x84 >>> [ 103.412766] dump_stack+0x18/0x24 >>> [ 103.413304] print_bad_pte+0x1b8/0x1cc >>> [ 103.413909] vm_normal_page+0xc8/0xd0 >>> [ 103.414491] follow_page_pte+0xb0/0x304 >>> [ 103.415096] follow_page_mask+0x108/0x240 >>> [ 103.415721] __get_user_pages+0x168/0x4ac >>> [ 103.416342] __gup_longterm_locked+0x15c/0x864 >>> [ 103.417023] pin_user_pages+0x70/0xcc >>> [ 103.417609] pkvm_mem_abort+0xf8/0x5c0 >>> [ 103.418207] kvm_handle_guest_abort+0x3e0/0x3e4 >>> [ 103.418906] handle_exit+0xac/0x33c >>> [ 103.419472] kvm_arch_vcpu_ioctl_run+0x48c/0x8d8 >>> [ 103.420176] kvm_vcpu_ioctl+0x504/0x5bc >>> [ 103.420785] __arm64_sys_ioctl+0xb0/0xec >>> [ 103.421401] invoke_syscall+0x60/0x11c >>> [ 103.422000] el0_svc_common+0xb4/0xe8 >>> [ 103.422590] do_el0_svc+0x24/0x30 >>> [ 103.423131] el0_svc+0x3c/0x70 >>> [ 103.423640] el0t_64_sync_handler+0x68/0xbc >>> [ 103.424288] el0t_64_sync+0x1a8/0x1ac >>> >>> Signed-off-by: Xiwei Wang<xiwei.wang1(a)unisoc.com> >>> Signed-off-by: Aijun Sun<aijun.sun(a)unisoc.com> >>> Signed-off-by: Zhaoyang Huang<zhaoyang.huang(a)unisoc.com> >>> --- >>> drivers/dma-buf/heaps/cma_heap.c | 2 +- >>> 1 file changed, 1 insertion(+), 1 deletion(-) >>> >>> diff --git a/drivers/dma-buf/heaps/cma_heap.c b/drivers/dma-buf/heaps/cma_heap.c >>> index c384004b918e..b301fb63f16b 100644 >>> --- a/drivers/dma-buf/heaps/cma_heap.c >>> +++ b/drivers/dma-buf/heaps/cma_heap.c >>> @@ -168,7 +168,7 @@ static vm_fault_t cma_heap_vm_fault(struct vm_fault *vmf) >>> if (vmf->pgoff > buffer->pagecount) >>> return VM_FAULT_SIGBUS; >>> >>> - return vmf_insert_pfn(vma, vmf->address, page_to_pfn(buffer->pages[vmf->pgoff])); >>> + return vmf_insert_page(vma, vmf->address, buffer->pages[vmf->pgoff]); >>> } >>> >>> static const struct vm_operations_struct dma_heap_vm_ops = {

1 year, 5 months

Re: [RFC PATCH 01/12] dma-buf: Introduce dma_buf_get_pfn_unlocked() kAPI

by Christoph Hellwig

On Wed, Jan 15, 2025 at 09:55:29AM +0100, Simona Vetter wrote: > I think for 90% of exporters pfn would fit, but there's some really funny > ones where you cannot get a cpu pfn by design. So we need to keep the > pfn-less interfaces around. But ideally for the pfn-capable exporters we'd > have helpers/common code that just implements all the other interfaces. There is no way to have dma address without a PFN in Linux right now. How would you generate them? That implies you have an IOMMU that can generate IOVAs for something that doesn't have a physical address at all. Or do you mean some that don't have pages associated with them, and thus have pfn_valid fail on them? They still have a PFN, just not one that is valid to use in most of the Linux MM.

1 year, 5 months

Re: [RFC PATCH 08/12] vfio/pci: Create host unaccessible dma-buf for private device

by Jason Gunthorpe

On Wed, Jan 15, 2025 at 11:57:05PM +1100, Alexey Kardashevskiy wrote: > On 15/1/25 00:35, Jason Gunthorpe wrote: > > On Tue, Jun 18, 2024 at 07:28:43AM +0800, Xu Yilun wrote: > > > > > > is needed so the secure world can prepare anything it needs prior to > > > > starting the VM. > > > > > > OK. From Dan's patchset there are some touch point for vendor tsm > > > drivers to do secure world preparation. e.g. pci_tsm_ops::probe(). > > > > > > Maybe we could move to Dan's thread for discussion. > > > > > > https://lore.kernel.org/linux-coco/173343739517.1074769.1313478654854592548… > > > > I think Dan's series is different, any uapi from that series should > > not be used in the VMM case. We need proper vfio APIs for the VMM to > > use. I would expect VFIO to be calling some of that infrastructure. > > Something like this experiment? > > https://github.com/aik/linux/commit/ce052512fb8784e19745d4cb222e23cabc57792e Yeah, maybe, though I don't know which of vfio/iommufd/kvm should be hosting those APIs, the above does seem to be a reasonable direction. When the various fds are closed I would expect the kernel to unbind and restore the device back. Jason

1 year, 5 months

Re: [RFC PATCH] driver: dma-buf: use vmf_insert_page for cma_heap_vm_fault

by Christian König

Am 15.01.25 um 07:18 schrieb zhaoyang.huang: > From: Zhaoyang Huang <zhaoyang.huang(a)unisoc.com> > > When using dma-buf as memory pool for VMM. The vmf_insert_pfn will > apply PTE_SPECIAL on pte which have vm_normal_page report bad_pte and > return NULL. This commit would like to suggest to replace > vmf_insert_pfn by vmf_insert_page. Setting PTE_SPECIAL is completely intentional here to prevent get_user_pages() from working on DMA-buf mappings. So absolutely clear NAK to this patch here. What exactly are you trying to do? Regards, Christian. > > [ 103.402787] kvm [5276]: gfn(ipa)=0x80000 hva=0x7d4a400000 write_fault=0 > [ 103.403822] BUG: Bad page map in process crosvm_vcpu0 pte:168000140000f43 pmd:8000000c1ca0003 > [ 103.405144] addr:0000007d4a400000 vm_flags:040400fb anon_vma:0000000000000000 mapping:ffffff8085163df0 index:0 > [ 103.406536] file:dmabuf fault:cma_heap_vm_fault [cma_heap] mmap:dma_buf_mmap_internal read_folio:0x0 > [ 103.407877] CPU: 3 PID: 5276 Comm: crosvm_vcpu0 Tainted: G W OE 6.6.46-android15-8-g8bab72b63c20-dirty-4k #1 1e474a12dac4553a3ebba3a911f3b744176a5d2d > [ 103.409818] Hardware name: Unisoc UMS9632-base Board (DT) > [ 103.410613] Call trace: > [ 103.411038] dump_backtrace+0xf4/0x140 > [ 103.411641] show_stack+0x20/0x30 > [ 103.412184] dump_stack_lvl+0x60/0x84 > [ 103.412766] dump_stack+0x18/0x24 > [ 103.413304] print_bad_pte+0x1b8/0x1cc > [ 103.413909] vm_normal_page+0xc8/0xd0 > [ 103.414491] follow_page_pte+0xb0/0x304 > [ 103.415096] follow_page_mask+0x108/0x240 > [ 103.415721] __get_user_pages+0x168/0x4ac > [ 103.416342] __gup_longterm_locked+0x15c/0x864 > [ 103.417023] pin_user_pages+0x70/0xcc > [ 103.417609] pkvm_mem_abort+0xf8/0x5c0 > [ 103.418207] kvm_handle_guest_abort+0x3e0/0x3e4 > [ 103.418906] handle_exit+0xac/0x33c > [ 103.419472] kvm_arch_vcpu_ioctl_run+0x48c/0x8d8 > [ 103.420176] kvm_vcpu_ioctl+0x504/0x5bc > [ 103.420785] __arm64_sys_ioctl+0xb0/0xec > [ 103.421401] invoke_syscall+0x60/0x11c > [ 103.422000] el0_svc_common+0xb4/0xe8 > [ 103.422590] do_el0_svc+0x24/0x30 > [ 103.423131] el0_svc+0x3c/0x70 > [ 103.423640] el0t_64_sync_handler+0x68/0xbc > [ 103.424288] el0t_64_sync+0x1a8/0x1ac > > Signed-off-by: Xiwei Wang <xiwei.wang1(a)unisoc.com> > Signed-off-by: Aijun Sun <aijun.sun(a)unisoc.com> > Signed-off-by: Zhaoyang Huang <zhaoyang.huang(a)unisoc.com> > --- > drivers/dma-buf/heaps/cma_heap.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/dma-buf/heaps/cma_heap.c b/drivers/dma-buf/heaps/cma_heap.c > index c384004b918e..b301fb63f16b 100644 > --- a/drivers/dma-buf/heaps/cma_heap.c > +++ b/drivers/dma-buf/heaps/cma_heap.c > @@ -168,7 +168,7 @@ static vm_fault_t cma_heap_vm_fault(struct vm_fault *vmf) > if (vmf->pgoff > buffer->pagecount) > return VM_FAULT_SIGBUS; > > - return vmf_insert_pfn(vma, vmf->address, page_to_pfn(buffer->pages[vmf->pgoff])); > + return vmf_insert_page(vma, vmf->address, buffer->pages[vmf->pgoff]); > } > > static const struct vm_operations_struct dma_heap_vm_ops = {

1 year, 5 months

Re: [RFC PATCH 01/12] dma-buf: Introduce dma_buf_get_pfn_unlocked() kAPI

by Christian König

Am 15.01.25 um 09:55 schrieb Simona Vetter: >>> If we add something >>> new, we need clear rules and not just "here's the kvm code that uses it". >>> That's how we've done dma-buf at first, and it was a terrible mess of >>> mismatched expecations. >> Yes, that would be wrong. It should be self defined within dmabuf and >> kvm should adopt to it, move semantics and all. > Ack. > > I feel like we have a plan here. I think I have to object a bit on that. > Summary from my side: > > - Sort out pin vs revocable vs dynamic/moveable semantics, make sure > importers have no surprises. > > - Adopt whatever new dma-api datastructures pops out of the dma-api > reworks. > > - Add pfn based memory access as yet another optional access method, with > helpers so that exporters who support this get all the others for free. > > I don't see a strict ordering between these, imo should be driven by > actual users of the dma-buf api. > > Already done: > > - dmem cgroup so that we can resource control device pinnings just landed > in drm-next for next merge window. So that part is imo sorted and we can > charge ahead with pinning into device memory without all the concerns > we've had years ago when discussing that for p2p dma-buf support. > > But there might be some work so that we can p2p pin without requiring > dynamic attachments only, I haven't checked whether that needs > adjustment in dma-buf.c code or just in exporters. > > Anything missing? Well as far as I can see this use case is not a good fit for the DMA-buf interfaces in the first place. DMA-buf deals with devices and buffer exchange. What's necessary here instead is to give an importing VM full access on some memory for their specific use case. That full access includes CPU and DMA mappings, modifying caching attributes, potentially setting encryption keys for specific ranges etc.... etc... In other words we have a lot of things the importer here should be able to do which we don't want most of the DMA-buf importers to do. The semantics for things like pin vs revocable vs dynamic/moveable seems similar, but that's basically it. As far as I know the TEE subsystem also represents their allocations as file descriptors. If I'm not completely mistaken this use case most likely fit's better there. > I feel like this is small enough that m-l archives is good enough. For > some of the bigger projects we do in graphics we sometimes create entries > in our kerneldoc with wip design consensus and things like that. But > feels like overkill here. > >> My general desire is to move all of RDMA's MR process away from >> scatterlist and work using only the new DMA API. This will save *huge* >> amounts of memory in common workloads and be the basis for non-struct >> page DMA support, including P2P. > Yeah a more memory efficient structure than the scatterlist would be > really nice. That would even benefit the very special dma-buf exporters > where you cannot get a pfn and only the dma_addr_t, altough most of those > (all maybe even?) have contig buffers, so your scatterlist has only one > entry. But it would definitely be nice from a design pov. Completely agree on that part. Scatterlist have a some design flaws, especially mixing the input and out parameters of the DMA API into the same structure. Additional to that DMA addresses are basically missing which bus they belong to and details how the access should be made (e.g. snoop vs no-snoop etc...). > Aside: A way to more efficiently create compressed scatterlists would be > neat too, because a lot of drivers hand-roll that and it's a bit brittle > and kinda silly to duplicate. With compressed I mean just a single entry > for a contig range, in practice thanks to huge pages/folios and allocators > trying to hand out contig ranges if there's plenty of memory that saves a > lot of memory too. But currently it's a bit a pain to construct these > efficiently, mostly it's just a two-pass approach and then trying to free > surplus memory or krealloc to fit. Also I don't have good ideas here, but > dma-api folks might have some from looking at too many things that create > scatterlists. I mailed with Christoph about that a while back as well and we both agreed that it would probably be a good idea to start defining a data structure to better encapsulate DMA addresses. It's just that nobody had time for that yet and/or I wasn't looped in in the final discussion about it. Regards, Christian. > -Sima

1 year, 5 months

Re: [RFC PATCH 01/12] dma-buf: Introduce dma_buf_get_pfn_unlocked() kAPI

by Jason Gunthorpe

On Tue, Jan 14, 2025 at 03:44:04PM +0100, Simona Vetter wrote: > E.g. if a compositor gets a dma-buf it assumes that by just binding that > it will not risk gpu context destruction (unless you're out of memory and > everything is on fire anyway, and it's ok to die). But if a nasty client > app supplies a revocable dma-buf, then it can shot down the higher > priviledged compositor gpu workload with precision. Which is not great, so > maybe existing dynamic gpu importers should reject revocable dma-buf. > That's at least what I had in mind as a potential issue. I see, so it is not that they can't handle a non-present fault it is just that the non-present effectively turns into a crash of the context and you want to avoid the crash. It makes sense to me to negotiate this as part of the API. > > This is similar to the structure BIO has, and it composes nicely with > > a future pin_user_pages() and memfd_pin_folios(). > > Since you mention pin here, I think that's another aspect of the revocable > vs dynamic question. Dynamic buffers are expected to sometimes just move > around for no reason, and importers must be able to cope. Yes, and we have importers that can tolerate dynamic and those that can't. Though those that can't tolerate it can often implement revoke. I view your list as a cascade: 1) Fully pinned can never be changed so long as the attach is present 2) Fully pinned, but can be revoked. Revoked is a fatal condition and the importer is allowed to experience an error 3) Fully dynamic and always present. Support for move, and restartable fault, is required Today in RDMA we ask the exporter if it is 1 or 3 and allow different things. I've seen the GPU side start to offer 1 more often as it has significant performance wins. > For recovable exporters/importers I'd expect that movement is not > happening, meaning it's pinned until the single terminal revocation. And > maybe I read the kvm stuff wrong, but it reads more like the latter to me > when crawling through the pfn code. kvm should be fully faultable and it should be able handle move. It handles move today using the mmu notifiers after all. kvm would need to interact with the dmabuf reservations on its page fault path. iommufd cannot be faultable and it would only support revoke. For VFIO revoke would not be fully terminal as VFIO can unrevoke too (sigh). If we make revoke special I'd like to eventually include unrevoke for this reason. > Once we have the lifetime rules nailed then there's the other issue of how > to describe the memory, and my take for that is that once the dma-api has > a clear answer we'll just blindly adopt that one and done. This is what I hope, we are not there yet, first Leon's series needs to get merged then we can start on making the DMA API P2P safe without any struct page. From there it should be clear what direction things go in. DMABUF would return pfns annotated with whatever matches the DMA API, and the importer would be able to inspect the PFNs to learn information like their P2Pness, CPU mappability or whatever. I'm pushing for the extra struct, and Christoph has been thinking about searching a maple tree on the PFN. We need to see what works best. > And currently with either dynamic attachments and dma_addr_t or through > fishing the pfn from the cpu pagetables there's some very clearly defined > lifetime and locking rules (which kvm might get wrong, I've seen some > discussions fly by where it wasn't doing a perfect job with reflecting pte > changes, but that was about access attributes iirc). Wouldn't surprise me, mmu notifiers are very complex all around. We've had bugs already where the mm doesn't signal the notifiers at the right points. > If we add something > new, we need clear rules and not just "here's the kvm code that uses it". > That's how we've done dma-buf at first, and it was a terrible mess of > mismatched expecations. Yes, that would be wrong. It should be self defined within dmabuf and kvm should adopt to it, move semantics and all. My general desire is to move all of RDMA's MR process away from scatterlist and work using only the new DMA API. This will save *huge* amounts of memory in common workloads and be the basis for non-struct page DMA support, including P2P. Jason

1 year, 5 months

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

Linaro-mm-sig January 2025