Am 07.01.25 um 15:27 schrieb Xu Yilun:
Introduce a new API for dma-buf importer, also add a dma_buf_ops callback for dma-buf exporter. This API is for subsystem importers who map the dma-buf to some user defined address space, e.g. for IOMMUFD to map the dma-buf to userspace IOVA via IOMMU page table, or for KVM to map the dma-buf to GPA via KVM MMU (e.g. EPT).
Currently dma-buf is only used to get DMA address for device's default domain by using kernel DMA APIs. But for these new use-cases, importers only need the pfn of the dma-buf resource to build their own mapping tables.
As far as I can see I have to fundamentally reject this whole approach.
It's intentional DMA-buf design that we don't expose struct pages nor PFNs to the importer. Essentially DMA-buf only transports DMA addresses.
In other words the mapping is done by the exporter and *not* the importer.
What we certainly can do is to annotate those DMA addresses to a better specify in which domain they are applicable, e.g. if they are PCIe bus addresses or some inter device bus addresses etc...
But moving the functionality to map the pages/PFNs to DMA addresses into the importer is an absolutely clear NO-GO.
Regards, Christian.
So the map_dma_buf() callback is not mandatory for exporters anymore. Also the importers could choose not to provide struct device *dev on dma_buf_attach() if they don't call dma_buf_map_attachment().
Like dma_buf_map_attachment(), the importer should firstly call dma_buf_attach/dynamic_attach() then call dma_buf_get_pfn_unlocked(). If the importer choose to do dynamic attach, it also should handle the dma-buf move notification.
Only the unlocked version of dma_buf_get_pfn is implemented for now, just because no locked version is used for now.
Signed-off-by: Xu Yilun yilun.xu@linux.intel.com
IIUC, Only get_pfn() is needed but no put_pfn(). The whole dma-buf is de/referenced at dma-buf attach/detach time.
Specifically, for static attachment, the exporter should always make memory resource available/pinned on first dma_buf_attach(), and release/unpin memory resource on last dma_buf_detach(). For dynamic attachment, the exporter could populate & invalidate the memory resource at any time, it's OK as long as the importers follow dma-buf move notification. So no pinning is needed for get_pfn() and no put_pfn() is needed.
drivers/dma-buf/dma-buf.c | 90 +++++++++++++++++++++++++++++++-------- include/linux/dma-buf.h | 13 ++++++ 2 files changed, 86 insertions(+), 17 deletions(-)
diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index 7eeee3a38202..83d1448b6dcc 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -630,10 +630,10 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info) size_t alloc_size = sizeof(struct dma_buf); int ret;
- if (WARN_ON(!exp_info->priv || !exp_info->ops
|| !exp_info->ops->map_dma_buf
|| !exp_info->ops->unmap_dma_buf
|| !exp_info->ops->release))
- if (WARN_ON(!exp_info->priv || !exp_info->ops ||
(!!exp_info->ops->map_dma_buf != !!exp_info->ops->unmap_dma_buf) ||
(!exp_info->ops->map_dma_buf && !exp_info->ops->get_pfn) ||
return ERR_PTR(-EINVAL);!exp_info->ops->release))
if (WARN_ON(exp_info->ops->cache_sgt_mapping && @@ -909,7 +909,10 @@ dma_buf_dynamic_attach(struct dma_buf *dmabuf, struct device *dev, struct dma_buf_attachment *attach; int ret;
- if (WARN_ON(!dmabuf || !dev))
- if (WARN_ON(!dmabuf))
return ERR_PTR(-EINVAL);
- if (WARN_ON(dmabuf->ops->map_dma_buf && !dev)) return ERR_PTR(-EINVAL);
if (WARN_ON(importer_ops && !importer_ops->move_notify)) @@ -941,7 +944,7 @@ dma_buf_dynamic_attach(struct dma_buf *dmabuf, struct device *dev, */ if (dma_buf_attachment_is_dynamic(attach) != dma_buf_is_dynamic(dmabuf)) {
struct sg_table *sgt;
struct sg_table *sgt = NULL;
dma_resv_lock(attach->dmabuf->resv, NULL); if (dma_buf_is_dynamic(attach->dmabuf)) { @@ -950,13 +953,16 @@ dma_buf_dynamic_attach(struct dma_buf *dmabuf, struct device *dev, goto err_unlock; }
sgt = __map_dma_buf(attach, DMA_BIDIRECTIONAL);
if (!sgt)
sgt = ERR_PTR(-ENOMEM);
if (IS_ERR(sgt)) {
ret = PTR_ERR(sgt);
goto err_unpin;
if (dmabuf->ops->map_dma_buf) {
sgt = __map_dma_buf(attach, DMA_BIDIRECTIONAL);
if (!sgt)
sgt = ERR_PTR(-ENOMEM);
if (IS_ERR(sgt)) {
ret = PTR_ERR(sgt);
goto err_unpin;
}}
- dma_resv_unlock(attach->dmabuf->resv); attach->sgt = sgt; attach->dir = DMA_BIDIRECTIONAL;
@@ -1119,7 +1125,8 @@ struct sg_table *dma_buf_map_attachment(struct dma_buf_attachment *attach, might_sleep();
- if (WARN_ON(!attach || !attach->dmabuf))
- if (WARN_ON(!attach || !attach->dmabuf ||
return ERR_PTR(-EINVAL);!attach->dmabuf->ops->map_dma_buf))
dma_resv_assert_held(attach->dmabuf->resv); @@ -1195,7 +1202,8 @@ dma_buf_map_attachment_unlocked(struct dma_buf_attachment *attach, might_sleep();
- if (WARN_ON(!attach || !attach->dmabuf))
- if (WARN_ON(!attach || !attach->dmabuf ||
return ERR_PTR(-EINVAL);!attach->dmabuf->ops->map_dma_buf))
dma_resv_lock(attach->dmabuf->resv, NULL); @@ -1222,7 +1230,8 @@ void dma_buf_unmap_attachment(struct dma_buf_attachment *attach, { might_sleep();
- if (WARN_ON(!attach || !attach->dmabuf || !sg_table))
- if (WARN_ON(!attach || !attach->dmabuf ||
return;!attach->dmabuf->ops->unmap_dma_buf || !sg_table))
dma_resv_assert_held(attach->dmabuf->resv); @@ -1254,7 +1263,8 @@ void dma_buf_unmap_attachment_unlocked(struct dma_buf_attachment *attach, { might_sleep();
- if (WARN_ON(!attach || !attach->dmabuf || !sg_table))
- if (WARN_ON(!attach || !attach->dmabuf ||
return;!attach->dmabuf->ops->unmap_dma_buf || !sg_table))
dma_resv_lock(attach->dmabuf->resv, NULL); @@ -1263,6 +1273,52 @@ void dma_buf_unmap_attachment_unlocked(struct dma_buf_attachment *attach, } EXPORT_SYMBOL_NS_GPL(dma_buf_unmap_attachment_unlocked, "DMA_BUF"); +/**
- dma_buf_get_pfn_unlocked -
- @attach: [in] attachment to get pfn from
- @pgoff: [in] page offset of the buffer against the start of dma_buf
- @pfn: [out] returns the pfn of the buffer
- @max_order [out] returns the max mapping order of the buffer
- */
+int dma_buf_get_pfn_unlocked(struct dma_buf_attachment *attach,
pgoff_t pgoff, u64 *pfn, int *max_order)
+{
- struct dma_buf *dmabuf = attach->dmabuf;
- int ret;
- if (WARN_ON(!attach || !attach->dmabuf ||
!attach->dmabuf->ops->get_pfn))
return -EINVAL;
- /*
* Open:
*
* When dma_buf is dynamic but dma_buf move is disabled, the buffer
* should be pinned before use, See dma_buf_map_attachment() for
* reference.
*
* But for now no pin is intended inside dma_buf_get_pfn(), otherwise
* need another API to unpin the dma_buf. So just fail out this case.
*/
- if (dma_buf_is_dynamic(attach->dmabuf) &&
!IS_ENABLED(CONFIG_DMABUF_MOVE_NOTIFY))
return -ENOENT;
- dma_resv_lock(attach->dmabuf->resv, NULL);
- ret = dmabuf->ops->get_pfn(attach, pgoff, pfn, max_order);
- /*
* Open:
*
* Is dma_resv_wait_timeout() needed? I assume no. The DMA buffer
* content synchronization could be done when the buffer is to be
* mapped by importer.
*/
- dma_resv_unlock(attach->dmabuf->resv);
- return ret;
+} +EXPORT_SYMBOL_NS_GPL(dma_buf_get_pfn_unlocked, "DMA_BUF");
- /**
- dma_buf_move_notify - notify attachments that DMA-buf is moving
@@ -1662,7 +1718,7 @@ static int dma_buf_debug_show(struct seq_file *s, void *unused) attach_count = 0; list_for_each_entry(attach_obj, &buf_obj->attachments, node) {
seq_printf(s, "\t%s\n", dev_name(attach_obj->dev));
} dma_resv_unlock(buf_obj->resv);seq_printf(s, "\t%s\n", attach_obj->dev ? dev_name(attach_obj->dev) : NULL); attach_count++;
diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index 36216d28d8bd..b16183edfb3a 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -194,6 +194,17 @@ struct dma_buf_ops { * if the call would block. */
- /**
* @get_pfn:
*
* This is called by dma_buf_get_pfn(). It is used to get the pfn
* of the buffer positioned by the page offset against the start of
* the dma_buf. It can only be called if @attach has been called
* successfully.
*/
- int (*get_pfn)(struct dma_buf_attachment *attach, pgoff_t pgoff,
u64 *pfn, int *max_order);
- /**
- @release:
@@ -629,6 +640,8 @@ dma_buf_map_attachment_unlocked(struct dma_buf_attachment *attach, void dma_buf_unmap_attachment_unlocked(struct dma_buf_attachment *attach, struct sg_table *sg_table, enum dma_data_direction direction); +int dma_buf_get_pfn_unlocked(struct dma_buf_attachment *attach,
pgoff_t pgoff, u64 *pfn, int *max_order);
int dma_buf_mmap(struct dma_buf *, struct vm_area_struct *, unsigned long);
On Wed, Jan 08, 2025 at 09:01:46AM +0100, Christian König wrote:
Am 07.01.25 um 15:27 schrieb Xu Yilun:
Introduce a new API for dma-buf importer, also add a dma_buf_ops callback for dma-buf exporter. This API is for subsystem importers who map the dma-buf to some user defined address space, e.g. for IOMMUFD to map the dma-buf to userspace IOVA via IOMMU page table, or for KVM to map the dma-buf to GPA via KVM MMU (e.g. EPT).
Currently dma-buf is only used to get DMA address for device's default domain by using kernel DMA APIs. But for these new use-cases, importers only need the pfn of the dma-buf resource to build their own mapping tables.
As far as I can see I have to fundamentally reject this whole approach.
It's intentional DMA-buf design that we don't expose struct pages nor PFNs to the importer. Essentially DMA-buf only transports DMA addresses.
In other words the mapping is done by the exporter and *not* the importer.
What we certainly can do is to annotate those DMA addresses to a better specify in which domain they are applicable, e.g. if they are PCIe bus addresses or some inter device bus addresses etc...
But moving the functionality to map the pages/PFNs to DMA addresses into the importer is an absolutely clear NO-GO.
Oh?
Having the importer do the mapping is the correct way to operate the DMA API and the new API that Leon has built to fix the scatterlist abuse in dmabuf relies on importer mapping as part of it's construction.
Why on earth do you want the exporter to map? That is completely backwards and unworkable in many cases. The disfunctional P2P support in dmabuf is like that principally because of this.
That said, I don't think get_pfn() is an especially good interface, but we will need to come with something that passes the physical pfn out.
Jason
Am 08.01.25 um 14:23 schrieb Jason Gunthorpe:
On Wed, Jan 08, 2025 at 09:01:46AM +0100, Christian König wrote:
Am 07.01.25 um 15:27 schrieb Xu Yilun:
Introduce a new API for dma-buf importer, also add a dma_buf_ops callback for dma-buf exporter. This API is for subsystem importers who map the dma-buf to some user defined address space, e.g. for IOMMUFD to map the dma-buf to userspace IOVA via IOMMU page table, or for KVM to map the dma-buf to GPA via KVM MMU (e.g. EPT).
Currently dma-buf is only used to get DMA address for device's default domain by using kernel DMA APIs. But for these new use-cases, importers only need the pfn of the dma-buf resource to build their own mapping tables.
As far as I can see I have to fundamentally reject this whole approach.
It's intentional DMA-buf design that we don't expose struct pages nor PFNs to the importer. Essentially DMA-buf only transports DMA addresses.
In other words the mapping is done by the exporter and *not* the importer.
What we certainly can do is to annotate those DMA addresses to a better specify in which domain they are applicable, e.g. if they are PCIe bus addresses or some inter device bus addresses etc...
But moving the functionality to map the pages/PFNs to DMA addresses into the importer is an absolutely clear NO-GO.
Oh?
Having the importer do the mapping is the correct way to operate the DMA API and the new API that Leon has built to fix the scatterlist abuse in dmabuf relies on importer mapping as part of it's construction.
Exactly on that I strongly disagree on.
DMA-buf works by providing DMA addresses the importer can work with and *NOT* the underlying location of the buffer.
Why on earth do you want the exporter to map?
Because the exporter owns the exported buffer and only the exporter knows to how correctly access it.
That is completely backwards and unworkable in many cases. The disfunctional P2P support in dmabuf is like that principally because of this.
No, that is exactly what we need.
Using the scatterlist to transport the DMA addresses was clearly a mistake, but providing the DMA addresses by the exporter has proved many times to be the right approach.
Keep in mind that the exported buffer is not necessary memory, but can also be MMIO or stuff which is only accessible through address space windows where you can't create a PFN nor struct page for.
That said, I don't think get_pfn() is an especially good interface, but we will need to come with something that passes the physical pfn out.
No, physical pfn is absolutely not a good way of passing the location of data around because it is limited to what the CPU sees as address space.
We have use cases where DMA-buf transports the location of CPU invisible data which only the involved devices can see.
Regards, Christian.
Jason
On Wed, Jan 08, 2025 at 02:44:26PM +0100, Christian König wrote:
Having the importer do the mapping is the correct way to operate the DMA API and the new API that Leon has built to fix the scatterlist abuse in dmabuf relies on importer mapping as part of it's construction.
Exactly on that I strongly disagree on.
DMA-buf works by providing DMA addresses the importer can work with and *NOT* the underlying location of the buffer.
The expectation is that the DMA API will be used to DMA map (most) things, and the DMA API always works with a physaddr_t/pfn argument. Basically, everything that is not a private address space should be supported by improving the DMA API. We are on course for finally getting all the common cases like P2P and MMIO solved here. That alone will take care of alot.
For P2P cases we are going toward (PFN + P2P source information) as input to the DMA API. The additional "P2P source information" provides a good way for co-operating drivers to represent private address spaces as well. Both importer and exporter can have full understanding what is being mapped and do the correct things, safely.
So, no, we don't loose private address space support when moving to importer mapping, in fact it works better because the importer gets more information about what is going on.
I have imagined a staged approach were DMABUF gets a new API that works with the new DMA API to do importer mapping with "P2P source information" and a gradual conversion.
Exporter mapping falls down in too many cases already:
1) Private addresses spaces don't work fully well because many devices need some indication what address space is being used and scatter list can't really properly convey that. If the DMABUF has a mixture of CPU and private it becomes a PITA
2) Multi-path PCI can require the importer to make mapping decisions unique to the device and program device specific information for the multi-path. We are doing this in mlx5 today and have hacks because DMABUF is destroying the information the importer needs to choose the correct PCI path.
3) Importing devices need to know if they are working with PCI P2P addresses during mapping because they need to do things like turn on ATS on their DMA. As for multi-path we have the same hacks inside mlx5 today that assume DMABUFs are always P2P because we cannot determine if things are P2P or not after being DMA mapped.
4) TPH bits needs to be programmed into the importer device but are derived based on the NUMA topology of the DMA target. The importer has no idea what the DMA target actually was because the exporter mapping destroyed that information.
5) iommufd and kvm are both using CPU addresses without DMA. No exporter mapping is possible
Jason
Am 08.01.25 um 15:58 schrieb Jason Gunthorpe:
On Wed, Jan 08, 2025 at 02:44:26PM +0100, Christian König wrote:
Having the importer do the mapping is the correct way to operate the DMA API and the new API that Leon has built to fix the scatterlist abuse in dmabuf relies on importer mapping as part of it's construction.
Exactly on that I strongly disagree on.
DMA-buf works by providing DMA addresses the importer can work with and *NOT* the underlying location of the buffer.
The expectation is that the DMA API will be used to DMA map (most) things, and the DMA API always works with a physaddr_t/pfn argument. Basically, everything that is not a private address space should be supported by improving the DMA API. We are on course for finally getting all the common cases like P2P and MMIO solved here. That alone will take care of alot.
Well, from experience the DMA API has failed more often than it actually worked in the way required by drivers.
Especially that we tried to hide architectural complexity in there instead of properly expose limitations to drivers is not something I consider a good design approach.
So I see putting even more into that extremely critical.
For P2P cases we are going toward (PFN + P2P source information) as input to the DMA API. The additional "P2P source information" provides a good way for co-operating drivers to represent private address spaces as well. Both importer and exporter can have full understanding what is being mapped and do the correct things, safely.
I can say from experience that this is clearly not going to work for all use cases.
It would mean that we have to pull a massive amount of driver specific functionality into the DMA API.
Things like programming access windows for PCI BARs is completely driver specific and as far as I can see can't be part of the DMA API without things like callbacks.
With that in mind the DMA API would become a mid layer between different drivers and that is really not something you are suggesting, isn't it?
So, no, we don't loose private address space support when moving to importer mapping, in fact it works better because the importer gets more information about what is going on.
Well, sounds like I wasn't able to voice my concern. Let me try again:
We should not give importers information they don't need. Especially not information about the backing store of buffers.
So that importers get more information about what's going on is a bad thing.
I have imagined a staged approach were DMABUF gets a new API that works with the new DMA API to do importer mapping with "P2P source information" and a gradual conversion.
To make it clear as maintainer of that subsystem I would reject such a step with all I have.
We have already gone down that road and it didn't worked at all and was a really big pain to pull people back from it.
Exporter mapping falls down in too many cases already:
- Private addresses spaces don't work fully well because many devices
need some indication what address space is being used and scatter list can't really properly convey that. If the DMABUF has a mixture of CPU and private it becomes a PITA
Correct, yes. That's why I said that scatterlist was a bad choice for the interface.
But exposing the backing store to importers and then let them do whatever they want with it sounds like an even worse idea.
- Multi-path PCI can require the importer to make mapping decisions
unique to the device and program device specific information for the multi-path. We are doing this in mlx5 today and have hacks because DMABUF is destroying the information the importer needs to choose the correct PCI path.
That's why the exporter gets the struct device of the importer so that it can plan how those accesses are made. Where exactly is the problem with that?
When you have an use case which is not covered by the existing DMA-buf interfaces then please voice that to me and other maintainers instead of implementing some hack.
- Importing devices need to know if they are working with PCI P2P
addresses during mapping because they need to do things like turn on ATS on their DMA. As for multi-path we have the same hacks inside mlx5 today that assume DMABUFs are always P2P because we cannot determine if things are P2P or not after being DMA mapped.
Why would you need ATS on PCI P2P and not for system memory accesses?
- TPH bits needs to be programmed into the importer device but are
derived based on the NUMA topology of the DMA target. The importer has no idea what the DMA target actually was because the exporter mapping destroyed that information.
Yeah, but again that is completely intentional.
I assume you mean TLP processing hints when you say TPH and those should be part of the DMA addresses provided by the exporter.
That an importer tries to look behind the curtain and determines the NUMA placement and topology themselves is clearly a no-go from the design perspective.
- iommufd and kvm are both using CPU addresses without DMA. No
exporter mapping is possible
We have customers using both KVM and XEN with DMA-buf, so I can clearly confirm that this isn't true.
Regards, Christian.
Jason
On Wed, Jan 08, 2025 at 04:25:54PM +0100, Christian König wrote:
Am 08.01.25 um 15:58 schrieb Jason Gunthorpe:
On Wed, Jan 08, 2025 at 02:44:26PM +0100, Christian König wrote:
Having the importer do the mapping is the correct way to operate the DMA API and the new API that Leon has built to fix the scatterlist abuse in dmabuf relies on importer mapping as part of it's construction.
Exactly on that I strongly disagree on.
DMA-buf works by providing DMA addresses the importer can work with and *NOT* the underlying location of the buffer.
The expectation is that the DMA API will be used to DMA map (most) things, and the DMA API always works with a physaddr_t/pfn argument. Basically, everything that is not a private address space should be supported by improving the DMA API. We are on course for finally getting all the common cases like P2P and MMIO solved here. That alone will take care of alot.
Well, from experience the DMA API has failed more often than it actually worked in the way required by drivers.
The DMA API has been static and very hard to change in these ways for a long time. I think Leon's new API will break through this and we will be able finally address these issues.
For P2P cases we are going toward (PFN + P2P source information) as input to the DMA API. The additional "P2P source information" provides a good way for co-operating drivers to represent private address spaces as well. Both importer and exporter can have full understanding what is being mapped and do the correct things, safely.
I can say from experience that this is clearly not going to work for all use cases.
It would mean that we have to pull a massive amount of driver specific functionality into the DMA API.
That isn't what I mean. There are two distinct parts, the means to describe the source (PFN + P2P source information) that is compatible with the DMA API, and the DMA API itself that works with a few general P2P source information types.
Private source information would be detected by co-operating drivers and go down driver private paths. It would be rejected by other drivers. This broadly follows how the new API is working.
So here I mean you can use the same PFN + Source API between importer and exporter and the importer can simply detect the special source and do the private stuff. It is not shifting things under the DMA API, it is building along side it using compatible design approaches. You would match the source information, cast it to a driver structure, do whatever driver math is needed to compute the local DMA address and then write it to the device. Nothing is hard or "not going to work" here.
So, no, we don't loose private address space support when moving to importer mapping, in fact it works better because the importer gets more information about what is going on.
Well, sounds like I wasn't able to voice my concern. Let me try again:
We should not give importers information they don't need. Especially not information about the backing store of buffers.
So that importers get more information about what's going on is a bad thing.
I strongly disagree because we are suffering today in mlx5 because of this viewpoint. You cannot predict in advance what importers are going to need. I already listed many examples where it does not work today as is.
I have imagined a staged approach were DMABUF gets a new API that works with the new DMA API to do importer mapping with "P2P source information" and a gradual conversion.
To make it clear as maintainer of that subsystem I would reject such a step with all I have.
This is unexpected, so you want to just leave dmabuf broken? Do you have any plan to fix it, to fix the misuse of the DMA API, and all the problems I listed below? This is a big deal, it is causing real problems today.
If it going to be like this I think we will stop trying to use dmabuf and do something simpler for vfio/kvm/iommufd :(
We have already gone down that road and it didn't worked at all and was a really big pain to pull people back from it.
Nobody has really seriously tried to improve the DMA API before, so I don't think this is true at all.
Exporter mapping falls down in too many cases already:
- Private addresses spaces don't work fully well because many devices
need some indication what address space is being used and scatter list can't really properly convey that. If the DMABUF has a mixture of CPU and private it becomes a PITA
Correct, yes. That's why I said that scatterlist was a bad choice for the interface.
But exposing the backing store to importers and then let them do whatever they want with it sounds like an even worse idea.
You keep saying this without real justification. To me it is a nanny style of API design. But also I don't see how you can possibly fix the above without telling the importer alot more information.
- Multi-path PCI can require the importer to make mapping decisions
unique to the device and program device specific information for the multi-path. We are doing this in mlx5 today and have hacks because DMABUF is destroying the information the importer needs to choose the correct PCI path.
That's why the exporter gets the struct device of the importer so that it can plan how those accesses are made. Where exactly is the problem with that?
A single struct device does not convey the multipath options. We have multiple struct devices (and multiple PCI endpoints) doing DMA concurrently under one driver.
Multipath always needs additional meta information in the importer side to tell the device which path to select. A naked dma address is not sufficient.
Today we guess that DMABUF will be using P2P and hack to choose a P2P struct device to pass the exporter. We need to know what is in the dmabuf before we can choose which of the multiple struct devices the driver has to use for DMA mapping.
But even simple CPU centric cases we will eventually want to select the proper NUMA local PCI channel matching struct device for CPU only buffers.
When you have an use case which is not covered by the existing DMA-buf interfaces then please voice that to me and other maintainers instead of implementing some hack.
Do you have any suggestion for any of this then? We have a good plan to fix this stuff and more. Many experts in their fields have agreed on the different parts now. We haven't got to dmabuf because I had no idea there would be an objection like this.
- Importing devices need to know if they are working with PCI P2P
addresses during mapping because they need to do things like turn on ATS on their DMA. As for multi-path we have the same hacks inside mlx5 today that assume DMABUFs are always P2P because we cannot determine if things are P2P or not after being DMA mapped.
Why would you need ATS on PCI P2P and not for system memory accesses?
ATS has a significant performance cost. It is mandatory for PCI P2P, but ideally should be avoided for CPU memory.
- TPH bits needs to be programmed into the importer device but are
derived based on the NUMA topology of the DMA target. The importer has no idea what the DMA target actually was because the exporter mapping destroyed that information.
Yeah, but again that is completely intentional.
I assume you mean TLP processing hints when you say TPH and those should be part of the DMA addresses provided by the exporter.
Yes, but is not part of the DMA addresses.
That an importer tries to look behind the curtain and determines the NUMA placement and topology themselves is clearly a no-go from the design perspective.
I strongly disagree, this is important. Drivers need this information in a future TPH/UIO/multipath PCI world.
- iommufd and kvm are both using CPU addresses without DMA. No
exporter mapping is possible
We have customers using both KVM and XEN with DMA-buf, so I can clearly confirm that this isn't true.
Today they are mmaping the dma-buf into a VMA and then using KVM's follow_pfn() flow to extract the CPU pfn from the PTE. Any mmapable dma-buf must have a CPU PFN.
Here Xu implements basically the same path, except without the VMA indirection, and it suddenly not OK? Illogical.
Jason
Am 08.01.25 um 17:22 schrieb Jason Gunthorpe:
[SNIP]
For P2P cases we are going toward (PFN + P2P source information) as input to the DMA API. The additional "P2P source information" provides a good way for co-operating drivers to represent private address spaces as well. Both importer and exporter can have full understanding what is being mapped and do the correct things, safely.
I can say from experience that this is clearly not going to work for all use cases.
It would mean that we have to pull a massive amount of driver specific functionality into the DMA API.
That isn't what I mean. There are two distinct parts, the means to describe the source (PFN + P2P source information) that is compatible with the DMA API, and the DMA API itself that works with a few general P2P source information types.
Private source information would be detected by co-operating drivers and go down driver private paths. It would be rejected by other drivers. This broadly follows how the new API is working.
So here I mean you can use the same PFN + Source API between importer and exporter and the importer can simply detect the special source and do the private stuff. It is not shifting things under the DMA API, it is building along side it using compatible design approaches. You would match the source information, cast it to a driver structure, do whatever driver math is needed to compute the local DMA address and then write it to the device. Nothing is hard or "not going to work" here.
Well to be honest that sounds like an absolutely horrible design.
You are moving all responsibilities for inter driver handling into the drivers themselves without any supervision by the core OS.
Drivers are notoriously buggy and should absolutely not do things like that on their own.
Do you have pointers to this new API?
So, no, we don't loose private address space support when moving to importer mapping, in fact it works better because the importer gets more information about what is going on.
Well, sounds like I wasn't able to voice my concern. Let me try again:
We should not give importers information they don't need. Especially not information about the backing store of buffers.
So that importers get more information about what's going on is a bad thing.
I strongly disagree because we are suffering today in mlx5 because of this viewpoint. You cannot predict in advance what importers are going to need. I already listed many examples where it does not work today as is.
Well there is no need to predict in advance what importers are going to do. We just gather the requirements of the importers and see how we can fulfill them.
And yes, it is absolutely intentional that maintainers reject some of those requirements because the they find that the importer tries to do something totally nasty which shouldn't happen in the first place.
I have imagined a staged approach were DMABUF gets a new API that works with the new DMA API to do importer mapping with "P2P source information" and a gradual conversion.
To make it clear as maintainer of that subsystem I would reject such a step with all I have.
This is unexpected, so you want to just leave dmabuf broken? Do you have any plan to fix it, to fix the misuse of the DMA API, and all the problems I listed below? This is a big deal, it is causing real problems today.
Well as maintainer of DMA-buf I don't think it is broken. It's just missing requirements.
When you have new requirements we can certainly talk about them, but giving drivers a green card to do whatever nasty thing they want to do is not going to fly at all.
Exporter mapping falls down in too many cases already:
- Private addresses spaces don't work fully well because many devices
need some indication what address space is being used and scatter list can't really properly convey that. If the DMABUF has a mixture of CPU and private it becomes a PITA
Correct, yes. That's why I said that scatterlist was a bad choice for the interface.
But exposing the backing store to importers and then let them do whatever they want with it sounds like an even worse idea.
You keep saying this without real justification.
Well I have really strong justification for that. Please look at the code here and how it cam to be: https://elixir.bootlin.com/linux/v6.12.6/source/drivers/dma-buf/dma-buf.c#L7...
We intentionally mangle the struct page pointers in the scatterlist to prevent importers from trying to access it.
To me it is a nanny style of API design.
Yes, absolutely and that is really necessary.
Inter driver APIs are hard work because the framework not only needs to define rules but also makes sure to properly enforce them.
That's why we not only added tons of documentation how to implement things, but also annotation which gives big warnings whenever somebody tries to do something nasty.
But also I don't see how you can possibly fix the above without telling the importer alot more information.
Giving additional information to the DMA addresses is perfectly ok. It's just that those additional information should be the output of the DMA API or DMA-buf and not something drivers try to figure out on their own.
- Multi-path PCI can require the importer to make mapping decisions
unique to the device and program device specific information for the multi-path. We are doing this in mlx5 today and have hacks because DMABUF is destroying the information the importer needs to choose the correct PCI path.
That's why the exporter gets the struct device of the importer so that it can plan how those accesses are made. Where exactly is the problem with that?
A single struct device does not convey the multipath options. We have multiple struct devices (and multiple PCI endpoints) doing DMA concurrently under one driver.
Multipath always needs additional meta information in the importer side to tell the device which path to select. A naked dma address is not sufficient.
Today we guess that DMABUF will be using P2P and hack to choose a P2P struct device to pass the exporter. We need to know what is in the dmabuf before we can choose which of the multiple struct devices the driver has to use for DMA mapping.
But even simple CPU centric cases we will eventually want to select the proper NUMA local PCI channel matching struct device for CPU only buffers.
So basically you want something which tells you if device A or device B should be used to access something?
That was discussed before as well, but so far the resolution was that userspace should decide that stuff and not the kernel.
Would it be sufficient to have a query API where you can ask the DMA-buf exporter for information about the buffer? E.g. NUMA distance etc...
When you have an use case which is not covered by the existing DMA-buf interfaces then please voice that to me and other maintainers instead of implementing some hack.
Do you have any suggestion for any of this then?
Yeah, compared to other requirements that sounds trivial what you have at hand here.
- iommufd and kvm are both using CPU addresses without DMA. No
exporter mapping is possible
We have customers using both KVM and XEN with DMA-buf, so I can clearly confirm that this isn't true.
Today they are mmaping the dma-buf into a VMA and then using KVM's follow_pfn() flow to extract the CPU pfn from the PTE. Any mmapable dma-buf must have a CPU PFN.
Yeah and I'm the one who forced the KVM guys to do it like that.
Here Xu implements basically the same path, except without the VMA indirection, and it suddenly not OK? Illogical.
Well I'm the designer of this KVM solution and it's not illogical to reject that when you are missing quite a bunch of stuff.
Sima correctly pointed out that you need an mmu notifier if you want to work with PFNs, but that is just the tip of the iceberg here.
If you want to go down the CPU access path you also need to cover things like cacheing flags, lifetime, dynamic changing backing etc...
Regards, Christian.
Jason
On Thu, Jan 09, 2025 at 10:10:01AM +0100, Christian König wrote:
Am 08.01.25 um 17:22 schrieb Jason Gunthorpe:
[SNIP]
For P2P cases we are going toward (PFN + P2P source information) as input to the DMA API. The additional "P2P source information" provides a good way for co-operating drivers to represent private address spaces as well. Both importer and exporter can have full understanding what is being mapped and do the correct things, safely.
I can say from experience that this is clearly not going to work for all use cases.
It would mean that we have to pull a massive amount of driver specific functionality into the DMA API.
That isn't what I mean. There are two distinct parts, the means to describe the source (PFN + P2P source information) that is compatible with the DMA API, and the DMA API itself that works with a few general P2P source information types.
Private source information would be detected by co-operating drivers and go down driver private paths. It would be rejected by other drivers. This broadly follows how the new API is working.
So here I mean you can use the same PFN + Source API between importer and exporter and the importer can simply detect the special source and do the private stuff. It is not shifting things under the DMA API, it is building along side it using compatible design approaches. You would match the source information, cast it to a driver structure, do whatever driver math is needed to compute the local DMA address and then write it to the device. Nothing is hard or "not going to work" here.
Well to be honest that sounds like an absolutely horrible design.
You are moving all responsibilities for inter driver handling into the drivers themselves without any supervision by the core OS.
Drivers are notoriously buggy and should absolutely not do things like that on their own.
IMHO, you and Jason give different meaning to word "driver" in this discussion. It is upto to the subsystems to decide how to provide new API to the end drivers. Worth to read this LWN article first.
Dancing the DMA two-step - https://lwn.net/Articles/997563/
Do you have pointers to this new API?
Latest version is here - https://lore.kernel.org/all/cover.1734436840.git.leon@kernel.org/ Unfortunately, I forgot to copy/paste cover letter but it can be seen in previous version https://lore.kernel.org/all/cover.1733398913.git.leon@kernel.org/.
The most complex example is block layer implementation which hides DMA API from block drivers. https://lore.kernel.org/all/cover.1730037261.git.leon@kernel.org/
Thanks
linaro-mm-sig@lists.linaro.org