Am 16.01.25 um 02:46 schrieb Zhaoyang Huang:
> On Wed, Jan 15, 2025 at 7:49 PM Christian König
> <christian.koenig(a)amd.com> wrote:
>> Am 15.01.25 um 07:18 schrieb zhaoyang.huang:
>>> From: Zhaoyang Huang<zhaoyang.huang(a)unisoc.com>
>>>
>>> When using dma-buf as memory pool for VMM. The vmf_insert_pfn will
>>> apply PTE_SPECIAL on pte which have vm_normal_page report bad_pte and
>>> return NULL. This commit would like to suggest to replace
>>> vmf_insert_pfn by vmf_insert_page.
>> Setting PTE_SPECIAL is completely intentional here to prevent
>> get_user_pages() from working on DMA-buf mappings.
> ok. May I ask the reason?
Drivers using this interface own the backing store for their specific
use cases. There are a couple of things get_user_pages(),
pin_user_pages(), direct I/O etc.. do which usually clash with those use
cases. So that is intentionally completely disabled.
We have the possibility to create a DMA-buf from memfd object and you
can then do direct I/O to the memfd and still use the DMA-buf with GPUs
or V4L for example.
>> So absolutely clear NAK to this patch here.
>>
>> What exactly are you trying to do?
> I would like to have pkvm have guest kernel be faulted of its second
> stage page fault(ARM64's memory virtualization method) on dma-buf
> which use pin_user_pages.
Yeah, exactly that's one of the use case which we intentionally prevent
here.
The backing store drivers use don't care about the pin count of the
memory and happily give it back to memory pools and/or swap it with
device local memory if necessary.
When this happens the ARM VM wouldn't be informed of the change and
potentially accesses the wrong address.
So sorry, but this approach won't work.
You could try with the memfd+DMA-buf approach I mentioned earlier, but
that won't give you all functionality on all DMA-buf supporting devices.
For example GPUs usually can't scan out to a monitor from such buffers
because of hardware limitations.
Regards,
Christian.
>> Regards,
>> Christian.
>>
>>> [ 103.402787] kvm [5276]: gfn(ipa)=0x80000 hva=0x7d4a400000 write_fault=0
>>> [ 103.403822] BUG: Bad page map in process crosvm_vcpu0 pte:168000140000f43 pmd:8000000c1ca0003
>>> [ 103.405144] addr:0000007d4a400000 vm_flags:040400fb anon_vma:0000000000000000 mapping:ffffff8085163df0 index:0
>>> [ 103.406536]file:dmabuf fault:cma_heap_vm_fault [cma_heap] mmap:dma_buf_mmap_internal read_folio:0x0
>>> [ 103.407877] CPU: 3 PID: 5276 Comm: crosvm_vcpu0 Tainted: G W OE 6.6.46-android15-8-g8bab72b63c20-dirty-4k #1 1e474a12dac4553a3ebba3a911f3b744176a5d2d
>>> [ 103.409818] Hardware name: Unisoc UMS9632-base Board (DT)
>>> [ 103.410613] Call trace:
>>> [ 103.411038] dump_backtrace+0xf4/0x140
>>> [ 103.411641] show_stack+0x20/0x30
>>> [ 103.412184] dump_stack_lvl+0x60/0x84
>>> [ 103.412766] dump_stack+0x18/0x24
>>> [ 103.413304] print_bad_pte+0x1b8/0x1cc
>>> [ 103.413909] vm_normal_page+0xc8/0xd0
>>> [ 103.414491] follow_page_pte+0xb0/0x304
>>> [ 103.415096] follow_page_mask+0x108/0x240
>>> [ 103.415721] __get_user_pages+0x168/0x4ac
>>> [ 103.416342] __gup_longterm_locked+0x15c/0x864
>>> [ 103.417023] pin_user_pages+0x70/0xcc
>>> [ 103.417609] pkvm_mem_abort+0xf8/0x5c0
>>> [ 103.418207] kvm_handle_guest_abort+0x3e0/0x3e4
>>> [ 103.418906] handle_exit+0xac/0x33c
>>> [ 103.419472] kvm_arch_vcpu_ioctl_run+0x48c/0x8d8
>>> [ 103.420176] kvm_vcpu_ioctl+0x504/0x5bc
>>> [ 103.420785] __arm64_sys_ioctl+0xb0/0xec
>>> [ 103.421401] invoke_syscall+0x60/0x11c
>>> [ 103.422000] el0_svc_common+0xb4/0xe8
>>> [ 103.422590] do_el0_svc+0x24/0x30
>>> [ 103.423131] el0_svc+0x3c/0x70
>>> [ 103.423640] el0t_64_sync_handler+0x68/0xbc
>>> [ 103.424288] el0t_64_sync+0x1a8/0x1ac
>>>
>>> Signed-off-by: Xiwei Wang<xiwei.wang1(a)unisoc.com>
>>> Signed-off-by: Aijun Sun<aijun.sun(a)unisoc.com>
>>> Signed-off-by: Zhaoyang Huang<zhaoyang.huang(a)unisoc.com>
>>> ---
>>> drivers/dma-buf/heaps/cma_heap.c | 2 +-
>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/dma-buf/heaps/cma_heap.c b/drivers/dma-buf/heaps/cma_heap.c
>>> index c384004b918e..b301fb63f16b 100644
>>> --- a/drivers/dma-buf/heaps/cma_heap.c
>>> +++ b/drivers/dma-buf/heaps/cma_heap.c
>>> @@ -168,7 +168,7 @@ static vm_fault_t cma_heap_vm_fault(struct vm_fault *vmf)
>>> if (vmf->pgoff > buffer->pagecount)
>>> return VM_FAULT_SIGBUS;
>>>
>>> - return vmf_insert_pfn(vma, vmf->address, page_to_pfn(buffer->pages[vmf->pgoff]));
>>> + return vmf_insert_page(vma, vmf->address, buffer->pages[vmf->pgoff]);
>>> }
>>>
>>> static const struct vm_operations_struct dma_heap_vm_ops = {
On Wed, Jan 15, 2025 at 09:55:29AM +0100, Simona Vetter wrote:
> I think for 90% of exporters pfn would fit, but there's some really funny
> ones where you cannot get a cpu pfn by design. So we need to keep the
> pfn-less interfaces around. But ideally for the pfn-capable exporters we'd
> have helpers/common code that just implements all the other interfaces.
There is no way to have dma address without a PFN in Linux right now.
How would you generate them? That implies you have an IOMMU that can
generate IOVAs for something that doesn't have a physical address at
all.
Or do you mean some that don't have pages associated with them, and
thus have pfn_valid fail on them? They still have a PFN, just not
one that is valid to use in most of the Linux MM.
On Wed, Jan 15, 2025 at 11:57:05PM +1100, Alexey Kardashevskiy wrote:
> On 15/1/25 00:35, Jason Gunthorpe wrote:
> > On Tue, Jun 18, 2024 at 07:28:43AM +0800, Xu Yilun wrote:
> >
> > > > is needed so the secure world can prepare anything it needs prior to
> > > > starting the VM.
> > >
> > > OK. From Dan's patchset there are some touch point for vendor tsm
> > > drivers to do secure world preparation. e.g. pci_tsm_ops::probe().
> > >
> > > Maybe we could move to Dan's thread for discussion.
> > >
> > > https://lore.kernel.org/linux-coco/173343739517.1074769.1313478654854592548…
> >
> > I think Dan's series is different, any uapi from that series should
> > not be used in the VMM case. We need proper vfio APIs for the VMM to
> > use. I would expect VFIO to be calling some of that infrastructure.
>
> Something like this experiment?
>
> https://github.com/aik/linux/commit/ce052512fb8784e19745d4cb222e23cabc57792e
Yeah, maybe, though I don't know which of vfio/iommufd/kvm should be
hosting those APIs, the above does seem to be a reasonable direction.
When the various fds are closed I would expect the kernel to unbind
and restore the device back.
Jason
Am 15.01.25 um 09:55 schrieb Simona Vetter:
>>> If we add something
>>> new, we need clear rules and not just "here's the kvm code that uses it".
>>> That's how we've done dma-buf at first, and it was a terrible mess of
>>> mismatched expecations.
>> Yes, that would be wrong. It should be self defined within dmabuf and
>> kvm should adopt to it, move semantics and all.
> Ack.
>
> I feel like we have a plan here.
I think I have to object a bit on that.
> Summary from my side:
>
> - Sort out pin vs revocable vs dynamic/moveable semantics, make sure
> importers have no surprises.
>
> - Adopt whatever new dma-api datastructures pops out of the dma-api
> reworks.
>
> - Add pfn based memory access as yet another optional access method, with
> helpers so that exporters who support this get all the others for free.
>
> I don't see a strict ordering between these, imo should be driven by
> actual users of the dma-buf api.
>
> Already done:
>
> - dmem cgroup so that we can resource control device pinnings just landed
> in drm-next for next merge window. So that part is imo sorted and we can
> charge ahead with pinning into device memory without all the concerns
> we've had years ago when discussing that for p2p dma-buf support.
>
> But there might be some work so that we can p2p pin without requiring
> dynamic attachments only, I haven't checked whether that needs
> adjustment in dma-buf.c code or just in exporters.
>
> Anything missing?
Well as far as I can see this use case is not a good fit for the DMA-buf
interfaces in the first place. DMA-buf deals with devices and buffer
exchange.
What's necessary here instead is to give an importing VM full access on
some memory for their specific use case.
That full access includes CPU and DMA mappings, modifying caching
attributes, potentially setting encryption keys for specific ranges
etc.... etc...
In other words we have a lot of things the importer here should be able
to do which we don't want most of the DMA-buf importers to do.
The semantics for things like pin vs revocable vs dynamic/moveable seems
similar, but that's basically it.
As far as I know the TEE subsystem also represents their allocations as
file descriptors. If I'm not completely mistaken this use case most
likely fit's better there.
> I feel like this is small enough that m-l archives is good enough. For
> some of the bigger projects we do in graphics we sometimes create entries
> in our kerneldoc with wip design consensus and things like that. But
> feels like overkill here.
>
>> My general desire is to move all of RDMA's MR process away from
>> scatterlist and work using only the new DMA API. This will save *huge*
>> amounts of memory in common workloads and be the basis for non-struct
>> page DMA support, including P2P.
> Yeah a more memory efficient structure than the scatterlist would be
> really nice. That would even benefit the very special dma-buf exporters
> where you cannot get a pfn and only the dma_addr_t, altough most of those
> (all maybe even?) have contig buffers, so your scatterlist has only one
> entry. But it would definitely be nice from a design pov.
Completely agree on that part.
Scatterlist have a some design flaws, especially mixing the input and
out parameters of the DMA API into the same structure.
Additional to that DMA addresses are basically missing which bus they
belong to and details how the access should be made (e.g. snoop vs
no-snoop etc...).
> Aside: A way to more efficiently create compressed scatterlists would be
> neat too, because a lot of drivers hand-roll that and it's a bit brittle
> and kinda silly to duplicate. With compressed I mean just a single entry
> for a contig range, in practice thanks to huge pages/folios and allocators
> trying to hand out contig ranges if there's plenty of memory that saves a
> lot of memory too. But currently it's a bit a pain to construct these
> efficiently, mostly it's just a two-pass approach and then trying to free
> surplus memory or krealloc to fit. Also I don't have good ideas here, but
> dma-api folks might have some from looking at too many things that create
> scatterlists.
I mailed with Christoph about that a while back as well and we both
agreed that it would probably be a good idea to start defining a data
structure to better encapsulate DMA addresses.
It's just that nobody had time for that yet and/or I wasn't looped in in
the final discussion about it.
Regards,
Christian.
> -Sima
On Tue, Jan 14, 2025 at 03:44:04PM +0100, Simona Vetter wrote:
> E.g. if a compositor gets a dma-buf it assumes that by just binding that
> it will not risk gpu context destruction (unless you're out of memory and
> everything is on fire anyway, and it's ok to die). But if a nasty client
> app supplies a revocable dma-buf, then it can shot down the higher
> priviledged compositor gpu workload with precision. Which is not great, so
> maybe existing dynamic gpu importers should reject revocable dma-buf.
> That's at least what I had in mind as a potential issue.
I see, so it is not that they can't handle a non-present fault it is
just that the non-present effectively turns into a crash of the
context and you want to avoid the crash. It makes sense to me to
negotiate this as part of the API.
> > This is similar to the structure BIO has, and it composes nicely with
> > a future pin_user_pages() and memfd_pin_folios().
>
> Since you mention pin here, I think that's another aspect of the revocable
> vs dynamic question. Dynamic buffers are expected to sometimes just move
> around for no reason, and importers must be able to cope.
Yes, and we have importers that can tolerate dynamic and those that
can't. Though those that can't tolerate it can often implement revoke.
I view your list as a cascade:
1) Fully pinned can never be changed so long as the attach is present
2) Fully pinned, but can be revoked. Revoked is a fatal condition and
the importer is allowed to experience an error
3) Fully dynamic and always present. Support for move, and
restartable fault, is required
Today in RDMA we ask the exporter if it is 1 or 3 and allow different
things. I've seen the GPU side start to offer 1 more often as it has
significant performance wins.
> For recovable exporters/importers I'd expect that movement is not
> happening, meaning it's pinned until the single terminal revocation. And
> maybe I read the kvm stuff wrong, but it reads more like the latter to me
> when crawling through the pfn code.
kvm should be fully faultable and it should be able handle move. It
handles move today using the mmu notifiers after all.
kvm would need to interact with the dmabuf reservations on its page
fault path.
iommufd cannot be faultable and it would only support revoke. For VFIO
revoke would not be fully terminal as VFIO can unrevoke too
(sigh). If we make revoke special I'd like to eventually include
unrevoke for this reason.
> Once we have the lifetime rules nailed then there's the other issue of how
> to describe the memory, and my take for that is that once the dma-api has
> a clear answer we'll just blindly adopt that one and done.
This is what I hope, we are not there yet, first Leon's series needs
to get merged then we can start on making the DMA API P2P safe without
any struct page. From there it should be clear what direction things
go in.
DMABUF would return pfns annotated with whatever matches the DMA API,
and the importer would be able to inspect the PFNs to learn
information like their P2Pness, CPU mappability or whatever.
I'm pushing for the extra struct, and Christoph has been thinking
about searching a maple tree on the PFN. We need to see what works best.
> And currently with either dynamic attachments and dma_addr_t or through
> fishing the pfn from the cpu pagetables there's some very clearly defined
> lifetime and locking rules (which kvm might get wrong, I've seen some
> discussions fly by where it wasn't doing a perfect job with reflecting pte
> changes, but that was about access attributes iirc).
Wouldn't surprise me, mmu notifiers are very complex all around. We've
had bugs already where the mm doesn't signal the notifiers at the
right points.
> If we add something
> new, we need clear rules and not just "here's the kvm code that uses it".
> That's how we've done dma-buf at first, and it was a terrible mess of
> mismatched expecations.
Yes, that would be wrong. It should be self defined within dmabuf and
kvm should adopt to it, move semantics and all.
My general desire is to move all of RDMA's MR process away from
scatterlist and work using only the new DMA API. This will save *huge*
amounts of memory in common workloads and be the basis for non-struct
page DMA support, including P2P.
Jason
On Tue, Jun 18, 2024 at 07:28:43AM +0800, Xu Yilun wrote:
> > is needed so the secure world can prepare anything it needs prior to
> > starting the VM.
>
> OK. From Dan's patchset there are some touch point for vendor tsm
> drivers to do secure world preparation. e.g. pci_tsm_ops::probe().
>
> Maybe we could move to Dan's thread for discussion.
>
> https://lore.kernel.org/linux-coco/173343739517.1074769.1313478654854592548…
I think Dan's series is different, any uapi from that series should
not be used in the VMM case. We need proper vfio APIs for the VMM to
use. I would expect VFIO to be calling some of that infrastructure.
Really, I don't see a clear sense of how this will look yet. AMD
provided some patches along these lines, I have not seem ARM and Intel
proposals yet, not do I sense there is alignment.
> > Setting up secure vIOMMU emulation, for instance. I
>
> I think this could be done at VM late bind time.
The vIOMMU needs to be setup before the VM boots
> > secure. This should all be pre-arranged as possible before starting
>
> But our current implementation is not to prepare as much as possible,
> but only necessary, so most of the secure work for vPCI function is done
> at late bind time.
That's fine too, but both options need to be valid.
Jason
On Sat, Jan 11, 2025 at 11:48:06AM +0800, Xu Yilun wrote:
> > > > can be sure what is the correct UAPI. In other words, make the
> > > > VFIO device into a CC device should also prevent mmaping it and so on.
> > >
> > > My idea is prevent mmaping first, then allow VFIO device into CC dev (TDI).
> >
> > I think you need to start the TDI process much earlier. Some arches
> > are going to need work to prepare the TDI before the VM is started.
>
> Could you elaborate more on that? AFAICS Intel & AMD are all good on
> "late bind", but not sure for other architectures.
I'm not sure about this, the topic has been confused a bit, and people
often seem to misunderstand what the full scenario actually is. :\
What I'm talking abou there is that you will tell the secure world to
create vPCI function that has the potential to be secure "TDI run"
down the road. The VM will decide when it reaches the run state. This
is needed so the secure world can prepare anything it needs prior to
starting the VM. Setting up secure vIOMMU emulation, for instance. I
expect ARM will need this, I'd be surprised if AMD actually doesn't in
the full scenario with secure viommu.
It should not be a surprise to the secure world after the VM has
started that suddenly it learns about a vPCI function that wants to be
secure. This should all be pre-arranged as possible before starting
the VM, even if alot of steps happen after the VM starts running (or
maybe don't happen at all).
Jason
On Fri, Jan 10, 2025 at 08:34:55PM +0100, Simona Vetter wrote:
> So if I'm getting this right, what you need from a functional pov is a
> dma_buf_tdx_mmap? Because due to tdx restrictions, the normal dma_buf_mmap
> is not going to work I guess?
Don't want something TDX specific!
There is a general desire, and CC is one, but there are other
motivations like performance, to stop using VMAs and mmaps as a way to
exchanage memory between two entities. Instead we want to use FDs.
We now have memfd and guestmemfd that are usable with
memfd_pin_folios() - this covers pinnable CPU memory.
And for a long time we had DMABUF which is for all the other wild
stuff, and it supports movable memory too.
So, the normal DMABUF semantics with reservation locking and move
notifiers seem workable to me here. They are broadly similar enough to
the mmu notifier locking that they can serve the same job of updating
page tables.
> Also another thing that's a bit tricky is that kvm kinda has a 3rd dma-buf
> memory model:
> - permanently pinned dma-buf, they never move
> - dynamic dma-buf, they move through ->move_notify and importers can remap
> - revocable dma-buf, which thus far only exist for pci mmio resources
I would like to see the importers be able to discover which one is
going to be used, because we have RDMA cases where we can support 1
and 3 but not 2.
revocable doesn't require page faulting as it is a terminal condition.
> Since we're leaning even more on that 3rd model I'm wondering whether we
> should make it something official. Because the existing dynamic importers
> do very much assume that re-acquiring the memory after move_notify will
> work. But for the revocable use-case the entire point is that it will
> never work.
> I feel like that's a concept we need to make explicit, so that dynamic
> importers can reject such memory if necessary.
It strikes me as strange that HW can do page faulting, so it can
support #2, but it can't handle a non-present fault?
> So yeah there's a bunch of tricky lifetime questions that need to be
> sorted out with proper design I think, and the current "let's just use pfn
> directly" proposal hides them all under the rug.
I don't think these two things are connected. The lifetime model that
KVM needs to work with the EPT, and that VFIO needs for it's MMIO,
definately should be reviewed and evaluated.
But it is completely orthogonal to allowing iommufd and kvm to access
the CPU PFN to use in their mapping flows, instead of the
dma_addr_t.
What I want to get to is a replacement for scatter list in DMABUF that
is an array of arrays, roughly like:
struct memory_chunks {
struct memory_p2p_provider *provider;
struct bio_vec addrs[];
};
int (*dmabuf_get_memory)(struct memory_chunks **chunks, size_t *num_chunks);
This can represent all forms of memory: P2P, private, CPU, etc and
would be efficient with the new DMA API.
This is similar to the structure BIO has, and it composes nicely with
a future pin_user_pages() and memfd_pin_folios().
Jason
On Fri, Jan 10, 2025 at 08:24:22PM +0100, Simona Vetter wrote:
> On Thu, Jan 09, 2025 at 01:56:02AM +0800, Xu Yilun wrote:
> > > > > 5) iommufd and kvm are both using CPU addresses without DMA. No
> > > > > exporter mapping is possible
> > > >
> > > > We have customers using both KVM and XEN with DMA-buf, so I can clearly
> > > > confirm that this isn't true.
> > >
> > > Today they are mmaping the dma-buf into a VMA and then using KVM's
> > > follow_pfn() flow to extract the CPU pfn from the PTE. Any mmapable
> > > dma-buf must have a CPU PFN.
> >
> > Yes, the final target for KVM is still the CPU PFN, just with the help
> > of CPU mapping table.
> >
> > I also found the xen gntdev-dmabuf is calculating pfn from mapped
> > sgt.
>
> See the comment, it's ok because it's a fake device with fake iommu and
> the xen code has special knowledge to peek behind the curtain.
/*
* Now convert sgt to array of gfns without accessing underlying pages.
* It is not allowed to access the underlying struct page of an sg table
* exported by DMA-buf, but since we deal with special Xen dma device here
* (not a normal physical one) look at the dma addresses in the sg table
* and then calculate gfns directly from them.
*/
for_each_sgtable_dma_page(sgt, &sg_iter, 0) {
dma_addr_t addr = sg_page_iter_dma_address(&sg_iter);
unsigned long pfn = bfn_to_pfn(XEN_PFN_DOWN(dma_to_phys(dev, addr)));
*barf*
Can we please all agree that is a horrible abuse of the DMA API and
lets not point it as some acceptable "solution"? KVM and iommufd do
not have fake struct devices with fake iommus.
Jason