Hello.
The following upstream commits:
aa6f8dcbab47 swiotlb: rework "fix info leak with DMA_FROM_DEVICE" ddbd89deb7d3 swiotlb: fix info leak with DMA_FROM_DEVICE
break ath9k-based Wi-Fi access point for me. The AP emits beacons, but no client can connect to it, either from the very beginning, or shortly after start. These are the only symptoms I've noticed (i.e., no BUG/WARNING messages in `dmesg` etc).
The hardware is:
``` $ dmesg | grep -i swiotlb [ 0.426785] PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
BIOS Information Vendor: American Megatrends Inc. Version: P1.50 Release Date: 04/16/2018
Base Board Information Manufacturer: ASRock Product Name: J3710-ITX
02:00.0 Network controller: Qualcomm Atheros AR9462 Wireless Network Adapter (rev 01) Subsystem: Lite-On Communications Inc Device 6621 Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 17 Region 0: Memory at 81400000 (64-bit, non-prefetchable) [size=512K] Expansion ROM at 81480000 [disabled] [size=64K] Capabilities: [40] Power Management version 2 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA PME(D0+,D1+,D2+,D3hot+,D3cold+) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME- Capabilities: [50] MSI: Enable- Count=1/4 Maskable+ 64bit+ Address: 0000000000000000 Data: 0000 Masking: 00000000 Pending: 00000000 Capabilities: [70] Express (v2) Endpoint, MSI 00 DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 10.000W DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq- RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop- MaxPayload 128 bytes, MaxReadReq 512 bytes DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend- LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s <4us, L1 <64us ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp- LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 2.5GT/s (ok), Width x1 (ok) TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Not Supported, TimeoutDis+ NROPrPrP- LTR- 10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix- EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit- FRS- TPHComp- ExtTPHComp- AtomicOpsCap: 32bit- 64bit- 128bitCAS- DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- OBFF Disabled, AtomicOpsCtl: ReqEn- LnkCap2: Supported Link Speeds: 2.5GT/s, Crosslink- Retimer- 2Retimers- DRS- LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete- EqualizationPhase1- EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest- Retimer- 2Retimers- CrosslinkRes: unsupported Capabilities: [100 v1] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr- CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+ AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn- MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap- HeaderLog: 00000000 00000000 00000000 00000000 Capabilities: [140 v1] Virtual Channel Caps: LPEVC=0 RefClk=100ns PATEntryBits=1 Arb: Fixed- WRR32- WRR64- WRR128- Ctrl: ArbSelect=Fixed Status: InProgress- VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans- Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256- Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=ff Status: NegoPending- InProgress- Capabilities: [160 v1] Device Serial Number 00-00-00-00-00-00-00-00 Kernel driver in use: ath9k Kernel modules: ath9k ```
These commits appeared in v5.17 and v5.16.15, and both kernels are broken for me. I'm pretty confident these commits make the difference since I've built both v5.17 and v5.16.15 without them, and it fixed the issue.
The machine has also got another Wi-Fi card that acts as a 802.11ax AP, and it is not affected:
``` 01:00.0 Unclassified device [0002]: MEDIATEK Corp. MT7915E 802.11ax PCI Express Wireless Network Adapter (prog-if 80) ```
So, I do understand this might be an issue with regard to SG I/O handling in ath9k, hence relevant people in Cc.
Please suggest on how to deal with this. Both me and Olha (in Cc) will be glad to test patches if needed. In case any extra info is required, please also let me know.
Thanks.
Adding regressions list so that this can be tracked properly, including the full report below.
Oleksandr Natalenko oleksandr@natalenko.name writes:
Hello.
The following upstream commits:
aa6f8dcbab47 swiotlb: rework "fix info leak with DMA_FROM_DEVICE" ddbd89deb7d3 swiotlb: fix info leak with DMA_FROM_DEVICE
break ath9k-based Wi-Fi access point for me. The AP emits beacons, but no client can connect to it, either from the very beginning, or shortly after start. These are the only symptoms I've noticed (i.e., no BUG/WARNING messages in `dmesg` etc).
The hardware is:
$ dmesg | grep -i swiotlb [ 0.426785] PCI-DMA: Using software bounce buffering for IO (SWIOTLB) BIOS Information Vendor: American Megatrends Inc. Version: P1.50 Release Date: 04/16/2018 Base Board Information Manufacturer: ASRock Product Name: J3710-ITX 02:00.0 Network controller: Qualcomm Atheros AR9462 Wireless Network Adapter (rev 01) Subsystem: Lite-On Communications Inc Device 6621 Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 17 Region 0: Memory at 81400000 (64-bit, non-prefetchable) [size=512K] Expansion ROM at 81480000 [disabled] [size=64K] Capabilities: [40] Power Management version 2 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA PME(D0+,D1+,D2+,D3hot+,D3cold+) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME- Capabilities: [50] MSI: Enable- Count=1/4 Maskable+ 64bit+ Address: 0000000000000000 Data: 0000 Masking: 00000000 Pending: 00000000 Capabilities: [70] Express (v2) Endpoint, MSI 00 DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 10.000W DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq- RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop- MaxPayload 128 bytes, MaxReadReq 512 bytes DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend- LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s <4us, L1 <64us ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp- LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 2.5GT/s (ok), Width x1 (ok) TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Not Supported, TimeoutDis+ NROPrPrP- LTR- 10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix- EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit- FRS- TPHComp- ExtTPHComp- AtomicOpsCap: 32bit- 64bit- 128bitCAS- DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- OBFF Disabled, AtomicOpsCtl: ReqEn- LnkCap2: Supported Link Speeds: 2.5GT/s, Crosslink- Retimer- 2Retimers- DRS- LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete- EqualizationPhase1- EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest- Retimer- 2Retimers- CrosslinkRes: unsupported Capabilities: [100 v1] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr- CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+ AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn- MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap- HeaderLog: 00000000 00000000 00000000 00000000 Capabilities: [140 v1] Virtual Channel Caps: LPEVC=0 RefClk=100ns PATEntryBits=1 Arb: Fixed- WRR32- WRR64- WRR128- Ctrl: ArbSelect=Fixed Status: InProgress- VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans- Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256- Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=ff Status: NegoPending- InProgress- Capabilities: [160 v1] Device Serial Number 00-00-00-00-00-00-00-00 Kernel driver in use: ath9k Kernel modules: ath9k
These commits appeared in v5.17 and v5.16.15, and both kernels are broken for me. I'm pretty confident these commits make the difference since I've built both v5.17 and v5.16.15 without them, and it fixed the issue.
The machine has also got another Wi-Fi card that acts as a 802.11ax AP, and it is not affected:
01:00.0 Unclassified device [0002]: MEDIATEK Corp. MT7915E 802.11ax PCI Express Wireless Network Adapter (prog-if 80)
So, I do understand this might be an issue with regard to SG I/O handling in ath9k, hence relevant people in Cc.
Please suggest on how to deal with this. Both me and Olha (in Cc) will be glad to test patches if needed. In case any extra info is required, please also let me know.
Thanks.
On Wed, Mar 23, 2022 at 12:19 AM Oleksandr Natalenko oleksandr@natalenko.name wrote:
The following upstream commits:
aa6f8dcbab47 swiotlb: rework "fix info leak with DMA_FROM_DEVICE" ddbd89deb7d3 swiotlb: fix info leak with DMA_FROM_DEVICE
break ath9k-based Wi-Fi access point for me. The AP emits beacons, but no client can connect to it, either from the very beginning, or shortly after start. These are the only symptoms I've noticed (i.e., no BUG/WARNING messages in `dmesg` etc).
Funky, but clearly true:
These commits appeared in v5.17 and v5.16.15, and both kernels are broken for me. I'm pretty confident these commits make the difference since I've built both v5.17 and v5.16.15 without them, and it fixed the issue.
Can you double-check (or just explicitly confirm if you already did that test) that you need to revert *both* of those commits, and it's the later "rework" fix that triggers it?
So, I do understand this might be an issue with regard to SG I/O handling in ath9k, hence relevant people in Cc.
Yeah, almost certainly an ath9k bug, but a regression is a regression, so if people can't find the issue in ath9k, we'll have to revert those commits.
Honestly, I personally think they were a bit draconian to begin with, and didn't limit their effects sufficiently.
I'm assuming that the ath9k issue is that it gives DMA mapping a big enough area to handle any possible packet size, and just expects - quite reasonably - smaller packets to only fill the part they need.
Which that "info leak" patch obviously breaks entirely.
So my expectation is that this is something we'll just revert, but it would be really good to have the ath9k people double-check.
Linus
On 2022-03-23 17:27, Linus Torvalds wrote:
On Wed, Mar 23, 2022 at 12:19 AM Oleksandr Natalenko oleksandr@natalenko.name wrote:
The following upstream commits:
aa6f8dcbab47 swiotlb: rework "fix info leak with DMA_FROM_DEVICE" ddbd89deb7d3 swiotlb: fix info leak with DMA_FROM_DEVICE
break ath9k-based Wi-Fi access point for me. The AP emits beacons, but no client can connect to it, either from the very beginning, or shortly after start. These are the only symptoms I've noticed (i.e., no BUG/WARNING messages in `dmesg` etc).
Funky, but clearly true:
These commits appeared in v5.17 and v5.16.15, and both kernels are broken for me. I'm pretty confident these commits make the difference since I've built both v5.17 and v5.16.15 without them, and it fixed the issue.
Can you double-check (or just explicitly confirm if you already did that test) that you need to revert *both* of those commits, and it's the later "rework" fix that triggers it?
So, I do understand this might be an issue with regard to SG I/O handling in ath9k, hence relevant people in Cc.
Yeah, almost certainly an ath9k bug, but a regression is a regression, so if people can't find the issue in ath9k, we'll have to revert those commits.
Honestly, I personally think they were a bit draconian to begin with, and didn't limit their effects sufficiently.
I'm assuming that the ath9k issue is that it gives DMA mapping a big enough area to handle any possible packet size, and just expects - quite reasonably - smaller packets to only fill the part they need.
Which that "info leak" patch obviously breaks entirely.
Except that's the exact case which the new patch is addressing - by copying the whole original area into the SWIOTLB bounce buffer to begin with, if we bounce the whole lot back after the device has only updated part of it, the non-updated parts now get overwritten with the same original contents, rather than whatever random crap happened to be left in the SWIOTLB buffer by its previous user. I'm extremely puzzled how any driver could somehow be dependent on non-device-written data getting replaced with random crap, given that it wouldn't happen with a real IOMMU, or if SWIOTLB just didn't need to bounce, and the data would hardly be deterministic either.
I think I can see how aa6f8dcbab47 might increase the severity of a driver bug where it calls dma_sync_*_for_device() on part of a DMA_FROM_DEVICE mapping that the device *has* written to, without having called a corresponding dma_sync_*_for_cpu() first - previously that would have had no effect, but now SWIOTLB will effectively behave more like an eagerly-prefetching non-coherent cache and write back old data over new - but if ddbd89deb7d3 alone makes a difference then something really weird must be going on.
Has anyone run a sanity check with CONFIG_DMA_API_DEBUG enabled to see if that flags anything up?
Robin.
On Wed, Mar 23, 2022 at 12:06 PM Robin Murphy robin.murphy@arm.com wrote:
On 2022-03-23 17:27, Linus Torvalds wrote:
I'm assuming that the ath9k issue is that it gives DMA mapping a big enough area to handle any possible packet size, and just expects - quite reasonably - smaller packets to only fill the part they need.
Which that "info leak" patch obviously breaks entirely.
Except that's the exact case which the new patch is addressing
Not "addressing". Breaking.
Which is why it will almost certainly get reverted.
Not doing DMA to the whole area seems to be quite the sane thing to do for things like network packets, and overwriting the part that didn't get DMA'd with zeroes seems to be exactly the wrong thing here.
So the SG_IO - and other random untrusted block command sources - data leak will almost certainly have to be addressed differently. Possibly by simply allocating the area with GFP_ZERO to begin with.
Linus
On 2022-03-23 19:16, Linus Torvalds wrote:
On Wed, Mar 23, 2022 at 12:06 PM Robin Murphy robin.murphy@arm.com wrote:
On 2022-03-23 17:27, Linus Torvalds wrote:
I'm assuming that the ath9k issue is that it gives DMA mapping a big enough area to handle any possible packet size, and just expects - quite reasonably - smaller packets to only fill the part they need.
Which that "info leak" patch obviously breaks entirely.
Except that's the exact case which the new patch is addressing
Not "addressing". Breaking.
Which is why it will almost certainly get reverted.
Not doing DMA to the whole area seems to be quite the sane thing to do for things like network packets, and overwriting the part that didn't get DMA'd with zeroes seems to be exactly the wrong thing here.
So the SG_IO - and other random untrusted block command sources - data leak will almost certainly have to be addressed differently. Possibly by simply allocating the area with GFP_ZERO to begin with.
Er, the point of the block layer case is that whole area *is* zeroed to begin with, and a latent memory corruption problem in SWIOTLB itself replaces those zeros with random other kernel data unexpectedly. Let me try illustrating some sequences for clarity...
Expected behaviour/without SWIOTLB: Memory --------------------------------------------------- start 12345678 dma_map(DMA_FROM_DEVICE) no-op device writes partial data 12ABC678 <- ABC dma_unmap(DMA_FROM_DEVICE) 12ABC678
SWIOTLB previously: Memory Bounce buffer --------------------------------------------------- start 12345678 xxxxxxxx dma_map(DMA_FROM_DEVICE) no-op device writes partial data 12345678 xxABCxxx <- ABC dma_unmap(DMA_FROM_DEVICE) xxABCxxx <- xxABCxxx
SWIOTLB Now: Memory Bounce buffer --------------------------------------------------- start 12345678 xxxxxxxx dma_map(DMA_FROM_DEVICE) 12345678 -> 12345678 device writes partial data 12345678 12ABC678 <- ABC dma_unmap(DMA_FROM_DEVICE) 12ABC678 <- 12ABC678
Now, sure we can prevent any actual information leakage by initialising the bounce buffer slot with zeros, but then we're just corrupting the not-written-to parts of the mapping with zeros instead of anyone else's old data. That's still fundamentally not OK. The only thing SWIOTLB can do to be correct is treat DMA_FROM_DEVICE as a read-modify-write of the entire mapping, because it has no way to know how much of it is actually going to be modified.
I'll admit I still never quite grasped the reason for also adding the override to swiotlb_sync_single_for_device() in aa6f8dcbab47, but I think by that point we were increasingly tired and confused and starting to second-guess ourselves (well, I was, at least). I don't think it's wrong per se, but as I said I do think it can bite anyone who's been doing dma_sync_*() wrong but getting away with it until now. If ddbd89deb7d3 alone turns out to work OK then I'd be inclined to try a partial revert of just that one hunk.
Thanks, Robin.
On Wed, Mar 23, 2022 at 08:54:08PM +0000, Robin Murphy wrote:
I'll admit I still never quite grasped the reason for also adding the override to swiotlb_sync_single_for_device() in aa6f8dcbab47, but I think by that point we were increasingly tired and confused and starting to second-guess ourselves (well, I was, at least). I don't think it's wrong per se, but as I said I do think it can bite anyone who's been doing dma_sync_*() wrong but getting away with it until now. If ddbd89deb7d3 alone turns out to work OK then I'd be inclined to try a partial revert of just that one hunk.
Agreed. Let's try that first.
Oleksandr, can you try the patch below:
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c index 6db1c475ec827..6c350555e5a1c 100644 --- a/kernel/dma/swiotlb.c +++ b/kernel/dma/swiotlb.c @@ -701,13 +701,10 @@ void swiotlb_tbl_unmap_single(struct device *dev, phys_addr_t tlb_addr, void swiotlb_sync_single_for_device(struct device *dev, phys_addr_t tlb_addr, size_t size, enum dma_data_direction dir) { - /* - * Unconditional bounce is necessary to avoid corruption on - * sync_*_for_cpu or dma_ummap_* when the device didn't overwrite - * the whole lengt of the bounce buffer. - */ - swiotlb_bounce(dev, tlb_addr, size, DMA_TO_DEVICE); - BUG_ON(!valid_dma_direction(dir)); + if (dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL) + swiotlb_bounce(dev, tlb_addr, size, DMA_TO_DEVICE); + else + BUG_ON(dir != DMA_FROM_DEVICE); }
void swiotlb_sync_single_for_cpu(struct device *dev, phys_addr_t tlb_addr,
Hello.
On čtvrtek 24. března 2022 6:57:32 CET Christoph Hellwig wrote:
On Wed, Mar 23, 2022 at 08:54:08PM +0000, Robin Murphy wrote:
I'll admit I still never quite grasped the reason for also adding the override to swiotlb_sync_single_for_device() in aa6f8dcbab47, but I think by that point we were increasingly tired and confused and starting to second-guess ourselves (well, I was, at least). I don't think it's wrong per se, but as I said I do think it can bite anyone who's been doing dma_sync_*() wrong but getting away with it until now. If ddbd89deb7d3 alone turns out to work OK then I'd be inclined to try a partial revert of just that one hunk.
Agreed. Let's try that first.
Oleksandr, can you try the patch below:
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c index 6db1c475ec827..6c350555e5a1c 100644 --- a/kernel/dma/swiotlb.c +++ b/kernel/dma/swiotlb.c @@ -701,13 +701,10 @@ void swiotlb_tbl_unmap_single(struct device *dev, phys_addr_t tlb_addr, void swiotlb_sync_single_for_device(struct device *dev, phys_addr_t tlb_addr, size_t size, enum dma_data_direction dir) {
- /*
* Unconditional bounce is necessary to avoid corruption on
* sync_*_for_cpu or dma_ummap_* when the device didn't overwrite
* the whole lengt of the bounce buffer.
*/
- swiotlb_bounce(dev, tlb_addr, size, DMA_TO_DEVICE);
- BUG_ON(!valid_dma_direction(dir));
- if (dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL)
swiotlb_bounce(dev, tlb_addr, size, DMA_TO_DEVICE);
- else
BUG_ON(dir != DMA_FROM_DEVICE);
} void swiotlb_sync_single_for_cpu(struct device *dev, phys_addr_t tlb_addr,
With this patch the AP works for me.
Thanks.
On 2022-03-24 10:25, Oleksandr Natalenko wrote:
Hello.
On čtvrtek 24. března 2022 6:57:32 CET Christoph Hellwig wrote:
On Wed, Mar 23, 2022 at 08:54:08PM +0000, Robin Murphy wrote:
I'll admit I still never quite grasped the reason for also adding the override to swiotlb_sync_single_for_device() in aa6f8dcbab47, but I think by that point we were increasingly tired and confused and starting to second-guess ourselves (well, I was, at least). I don't think it's wrong per se, but as I said I do think it can bite anyone who's been doing dma_sync_*() wrong but getting away with it until now. If ddbd89deb7d3 alone turns out to work OK then I'd be inclined to try a partial revert of just that one hunk.
Agreed. Let's try that first.
Oleksandr, can you try the patch below:
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c index 6db1c475ec827..6c350555e5a1c 100644 --- a/kernel/dma/swiotlb.c +++ b/kernel/dma/swiotlb.c @@ -701,13 +701,10 @@ void swiotlb_tbl_unmap_single(struct device *dev, phys_addr_t tlb_addr, void swiotlb_sync_single_for_device(struct device *dev, phys_addr_t tlb_addr, size_t size, enum dma_data_direction dir) {
- /*
* Unconditional bounce is necessary to avoid corruption on
* sync_*_for_cpu or dma_ummap_* when the device didn't overwrite
* the whole lengt of the bounce buffer.
*/
- swiotlb_bounce(dev, tlb_addr, size, DMA_TO_DEVICE);
- BUG_ON(!valid_dma_direction(dir));
- if (dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL)
swiotlb_bounce(dev, tlb_addr, size, DMA_TO_DEVICE);
- else
}BUG_ON(dir != DMA_FROM_DEVICE);
void swiotlb_sync_single_for_cpu(struct device *dev, phys_addr_t tlb_addr,
With this patch the AP works for me.
Cool, thanks for confirming. So I think ath9k probably is doing something dodgy with dma_sync_*(), but if Linus prefers to make the above change rather than wait for that to get figured out, I believe that should be fine.
The crucial part of the "rework" patch is that we'll unconditionally initialise the SWIOTLB bounce slot as it's allocated in swiotlb_tbl_map_single(), regardless of DMA_ATTR_SKIP_CPU_SYNC. As long as that happens, we're safe in terms of leaking data from previous mappings, and any possibility for incorrect sync usage to lose newly-written DMA data is at least no worse than it always has been. The most confusion was around how the proposed DMA_ATTR_OVERWRITE attribute would need to interact with DMA_ATTR_SKIP_CPU_SYNC to remain safe but still have any useful advantage, so unless and until anyone wants to revisit that, this should remain comparatively simple to reason about.
Cheers, Robin.
Robin Murphy robin.murphy@arm.com writes:
On 2022-03-24 10:25, Oleksandr Natalenko wrote:
Hello.
On čtvrtek 24. března 2022 6:57:32 CET Christoph Hellwig wrote:
On Wed, Mar 23, 2022 at 08:54:08PM +0000, Robin Murphy wrote:
I'll admit I still never quite grasped the reason for also adding the override to swiotlb_sync_single_for_device() in aa6f8dcbab47, but I think by that point we were increasingly tired and confused and starting to second-guess ourselves (well, I was, at least). I don't think it's wrong per se, but as I said I do think it can bite anyone who's been doing dma_sync_*() wrong but getting away with it until now. If ddbd89deb7d3 alone turns out to work OK then I'd be inclined to try a partial revert of just that one hunk.
Agreed. Let's try that first.
Oleksandr, can you try the patch below:
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c index 6db1c475ec827..6c350555e5a1c 100644 --- a/kernel/dma/swiotlb.c +++ b/kernel/dma/swiotlb.c @@ -701,13 +701,10 @@ void swiotlb_tbl_unmap_single(struct device *dev, phys_addr_t tlb_addr, void swiotlb_sync_single_for_device(struct device *dev, phys_addr_t tlb_addr, size_t size, enum dma_data_direction dir) {
- /*
* Unconditional bounce is necessary to avoid corruption on
* sync_*_for_cpu or dma_ummap_* when the device didn't overwrite
* the whole lengt of the bounce buffer.
*/
- swiotlb_bounce(dev, tlb_addr, size, DMA_TO_DEVICE);
- BUG_ON(!valid_dma_direction(dir));
- if (dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL)
swiotlb_bounce(dev, tlb_addr, size, DMA_TO_DEVICE);
- else
}BUG_ON(dir != DMA_FROM_DEVICE);
void swiotlb_sync_single_for_cpu(struct device *dev, phys_addr_t tlb_addr,
With this patch the AP works for me.
Cool, thanks for confirming. So I think ath9k probably is doing something dodgy with dma_sync_*(), but if Linus prefers to make the above change rather than wait for that to get figured out, I believe that should be fine.
I'm looking into this; but in the interest of a speedy resolution of the regression I would be in favour of merging that partial revert and reinstating it if/when we identify (and fix) any bugs in ath9k :)
-Toke
On Thu, 2022-03-24 at 15:27 +0100, Toke Høiland-Jørgensen wrote:
I'm looking into this; but in the interest of a speedy resolution of the regression I would be in favour of merging that partial revert and reinstating it if/when we identify (and fix) any bugs in ath9k :)
This looks fishy:
ath9k/recv.c
/* We will now give hardware our shiny new allocated skb */ new_buf_addr = dma_map_single(sc->dev, requeue_skb->data, common->rx_bufsize, dma_type); if (unlikely(dma_mapping_error(sc->dev, new_buf_addr))) { dev_kfree_skb_any(requeue_skb); goto requeue_drop_frag; }
/* Unmap the frame */ dma_unmap_single(sc->dev, bf->bf_buf_addr, common->rx_bufsize, dma_type);
bf->bf_mpdu = requeue_skb; bf->bf_buf_addr = new_buf_addr;
On Thu, Mar 24, 2022 at 05:29:12PM +0100, Maxime Bizon wrote:
I'm looking into this; but in the interest of a speedy resolution of the regression I would be in favour of merging that partial revert and reinstating it if/when we identify (and fix) any bugs in ath9k :)
This looks fishy:
ath9k/recv.c
/* We will now give hardware our shiny new allocated skb */ new_buf_addr = dma_map_single(sc->dev, requeue_skb->data, common->rx_bufsize, dma_type); if (unlikely(dma_mapping_error(sc->dev, new_buf_addr))) { dev_kfree_skb_any(requeue_skb); goto requeue_drop_frag; } /* Unmap the frame */ dma_unmap_single(sc->dev, bf->bf_buf_addr, common->rx_bufsize, dma_type); bf->bf_mpdu = requeue_skb; bf->bf_buf_addr = new_buf_addr;
Creating a new mapping for the same buffer before unmapping the previous one does looks rather bogus. But it does not fit the pattern where revering the sync_single changes make the driver work again.
On 2022-03-24 16:31, Christoph Hellwig wrote:
On Thu, Mar 24, 2022 at 05:29:12PM +0100, Maxime Bizon wrote:
I'm looking into this; but in the interest of a speedy resolution of the regression I would be in favour of merging that partial revert and reinstating it if/when we identify (and fix) any bugs in ath9k :)
This looks fishy:
ath9k/recv.c
/* We will now give hardware our shiny new allocated skb */ new_buf_addr = dma_map_single(sc->dev, requeue_skb->data, common->rx_bufsize, dma_type); if (unlikely(dma_mapping_error(sc->dev, new_buf_addr))) { dev_kfree_skb_any(requeue_skb); goto requeue_drop_frag; } /* Unmap the frame */ dma_unmap_single(sc->dev, bf->bf_buf_addr, common->rx_bufsize, dma_type); bf->bf_mpdu = requeue_skb; bf->bf_buf_addr = new_buf_addr;
Creating a new mapping for the same buffer before unmapping the previous one does looks rather bogus. But it does not fit the pattern where revering the sync_single changes make the driver work again.
OK, you made me look :)
Now that it's obvious what to look for, I can only conclude that during the stanza in ath_edma_get_buffers(), the device is still writing to the buffer while ownership has been transferred to the CPU, and whatever got written while ath9k_hw_process_rxdesc_edma() was running then gets wiped out by the subsequent sync_for_device, which currently resets the SWIOTLB slot to the state that sync_for_cpu copied out. By the letter of the DMA API that's not allowed, but on the other hand I'm not sure if we even have a good idiom for "I can't tell if the device has finished with this buffer or not unless I look at it" :/
Robin.
Robin Murphy robin.murphy@arm.com writes:
On 2022-03-24 16:31, Christoph Hellwig wrote:
On Thu, Mar 24, 2022 at 05:29:12PM +0100, Maxime Bizon wrote:
I'm looking into this; but in the interest of a speedy resolution of the regression I would be in favour of merging that partial revert and reinstating it if/when we identify (and fix) any bugs in ath9k :)
This looks fishy:
ath9k/recv.c
/* We will now give hardware our shiny new allocated skb */ new_buf_addr = dma_map_single(sc->dev, requeue_skb->data, common->rx_bufsize, dma_type); if (unlikely(dma_mapping_error(sc->dev, new_buf_addr))) { dev_kfree_skb_any(requeue_skb); goto requeue_drop_frag; } /* Unmap the frame */ dma_unmap_single(sc->dev, bf->bf_buf_addr, common->rx_bufsize, dma_type); bf->bf_mpdu = requeue_skb; bf->bf_buf_addr = new_buf_addr;
Creating a new mapping for the same buffer before unmapping the previous one does looks rather bogus. But it does not fit the pattern where revering the sync_single changes make the driver work again.
OK, you made me look :)
Now that it's obvious what to look for, I can only conclude that during the stanza in ath_edma_get_buffers(), the device is still writing to the buffer while ownership has been transferred to the CPU, and whatever got written while ath9k_hw_process_rxdesc_edma() was running then gets wiped out by the subsequent sync_for_device, which currently resets the SWIOTLB slot to the state that sync_for_cpu copied out. By the letter of the DMA API that's not allowed, but on the other hand I'm not sure if we even have a good idiom for "I can't tell if the device has finished with this buffer or not unless I look at it" :/
Right, but is that sync_for_device call really needed? AFAICT, that ath9k_hw_process_rxdesc_edma() invocation doesn't actually modify any of the data when it returns EINPROGRESS, so could we just skip it? Like the patch below? Or am I misunderstanding the semantics here?
-Toke
diff --git a/drivers/net/wireless/ath/ath9k/recv.c b/drivers/net/wireless/ath/ath9k/recv.c index 0c0624a3b40d..19244d4c0ada 100644 --- a/drivers/net/wireless/ath/ath9k/recv.c +++ b/drivers/net/wireless/ath/ath9k/recv.c @@ -647,12 +647,8 @@ static bool ath_edma_get_buffers(struct ath_softc *sc, common->rx_bufsize, DMA_FROM_DEVICE);
ret = ath9k_hw_process_rxdesc_edma(ah, rs, skb->data); - if (ret == -EINPROGRESS) { - /*let device gain the buffer again*/ - dma_sync_single_for_device(sc->dev, bf->bf_buf_addr, - common->rx_bufsize, DMA_FROM_DEVICE); + if (ret == -EINPROGRESS) return false; - }
__skb_unlink(skb, &rx_edma->rx_fifo); if (ret == -EINVAL) {
On Thu, Mar 24, 2022 at 10:07 AM Toke Høiland-Jørgensen toke@toke.dk wrote:
Right, but is that sync_for_device call really needed?
Well, imagine that you have a non-cache-coherent DMA (not bounce buffers - just bad hardware)...
So the driver first does that dma_sync_single_for_cpu() for the CPU see the current state (for the non-cache-coherent case it would just invalidate caches).
The driver then examines the command buffer state, sees that it's still in progress, and does that return -EINPROGRESS.
It's actually very natural in that situation to flush the caches from the CPU side again. And so dma_sync_single_for_device() is a fairly reasonable thing to do in that situation.
But it doesn't seem *required*, no. The CPU caches only have a copy of the data in them, no writeback needed (and writeback wouldn't work since DMA from the device may be in progress).
So I don't think the dma_sync_single_for_device() is *wrong* per se, because the CPU didn't actually do any modifications.
But yes, I think it's unnecessary - because any later CPU accesses would need that dma_sync_single_for_cpu() anyway, which should invalidate any stale caches.
And it clearly doesn't work in a bounce-buffer situation, but honestly I don't think a "CPU modified buffers concurrently with DMA" can *ever* work in that situation, so I think it's wrong for a bounce buffer model to ever do anything in the dma_sync_single_for_device() situation.
Does removing that dma_sync_single_for_device() actually fix the problem for the ath driver?
There's a fair number of those dma_sync_single_for_device() things all over. Could we find mis-uses and warn about them some way? It seems to be a very natural thing to do in this context, but bounce buffering does make them very fragile.
Linus
Linus Torvalds torvalds@linux-foundation.org writes:
On Thu, Mar 24, 2022 at 10:07 AM Toke Høiland-Jørgensen toke@toke.dk wrote:
Right, but is that sync_for_device call really needed?
Well, imagine that you have a non-cache-coherent DMA (not bounce buffers - just bad hardware)...
So the driver first does that dma_sync_single_for_cpu() for the CPU see the current state (for the non-cache-coherent case it would just invalidate caches).
The driver then examines the command buffer state, sees that it's still in progress, and does that return -EINPROGRESS.
It's actually very natural in that situation to flush the caches from the CPU side again. And so dma_sync_single_for_device() is a fairly reasonable thing to do in that situation.
But it doesn't seem *required*, no. The CPU caches only have a copy of the data in them, no writeback needed (and writeback wouldn't work since DMA from the device may be in progress).
So I don't think the dma_sync_single_for_device() is *wrong* per se, because the CPU didn't actually do any modifications.
But yes, I think it's unnecessary - because any later CPU accesses would need that dma_sync_single_for_cpu() anyway, which should invalidate any stale caches.
OK, the above was basically how I understood it. Thank you for confirming!
And it clearly doesn't work in a bounce-buffer situation, but honestly I don't think a "CPU modified buffers concurrently with DMA" can *ever* work in that situation, so I think it's wrong for a bounce buffer model to ever do anything in the dma_sync_single_for_device() situation.
Right.
Does removing that dma_sync_single_for_device() actually fix the problem for the ath driver?
I am hoping Oleksandr can help answer that since my own ath9k hardware is currently on strike :(
-Toke
On Thu, 2022-03-24 at 12:26 -0700, Linus Torvalds wrote:
It's actually very natural in that situation to flush the caches from the CPU side again. And so dma_sync_single_for_device() is a fairly reasonable thing to do in that situation.
In the non-cache-coherent scenario, and assuming dma_map() did an initial cache invalidation, you can write this:
rx_buffer_complete_1(buf) { invalidate_cache(buf, size) if (!is_ready(buf)) return; <proceed with receive> }
or
rx_buffer_complete_2(buf) { if (!is_ready(buf)) { invalidate_cache(buf, size) return; } <proceed with receive> }
The latter is preferred for performance because dma_map() did the initial invalidate.
Of course you could write:
rx_buffer_complete_3(buf) { invalidate_cache(buf, size) if (!is_ready(buf)) { invalidate_cache(buf, size) return; } <proceed with receive> }
but it's a waste of CPU cycles
So I'd be very cautious assuming sync_for_cpu() and sync_for_device() are both doing invalidation in existing implementation of arch DMA ops, implementers may have taken some liberty around DMA-API to avoid unnecessary cache operation (not to blame them).
For example looking at arch/arm/mm/dma-mapping.c, for DMA_FROM_DEVICE
sync_single_for_device() => __dma_page_cpu_to_dev() => dma_cache_maint_page(op=dmac_map_area) => cpu_cache.dma_map_area()
sync_single_for_cpu() => __dma_page_dev_to_cpu() => __dma_page_cpu_to_dev(op=dmac_unmap_area) => cpu_cache.dma_unmap_area()
dma_map_area() always does cache invalidate.
But for a couple of CPU variant, dma_unmap_area() is a noop, so sync_for_cpu() does nothing.
Toke's patch will break ath9k on those platforms (mostly silent breakage, rx corruption leading to bad performance)
There's a fair number of those dma_sync_single_for_device() things all over. Could we find mis-uses and warn about them some way? It seems to be a very natural thing to do in this context, but bounce buffering does make them very fragile.
At least in network drivers, there are at least two patterns:
1) The issue at hand, hardware mixing rx_status and data inside the same area. Usually very old hardware, very quick grep in network drivers only revealed slicoss.c. Probably would have gone unnoticed if ath9k hardware wasn't so common.
2) The very common "copy break" pattern. If a received packet is smaller than a certain threshold, the driver rx path is changed to do:
sync_for_cpu() alloc_small_skb() memcpy(small_skb, rx_buffer_data) sync_for_device()
Original skb is left in the hardware, this reduces memory wasted.
This pattern is completely valid wrt DMA-API, the buffer is always either owned by CPU or device.
On 2022-03-25 10:25, Maxime Bizon wrote:
On Thu, 2022-03-24 at 12:26 -0700, Linus Torvalds wrote:
It's actually very natural in that situation to flush the caches from the CPU side again. And so dma_sync_single_for_device() is a fairly reasonable thing to do in that situation.
In the non-cache-coherent scenario, and assuming dma_map() did an initial cache invalidation, you can write this:
rx_buffer_complete_1(buf) { invalidate_cache(buf, size) if (!is_ready(buf)) return;
<proceed with receive> }
or
rx_buffer_complete_2(buf) { if (!is_ready(buf)) { invalidate_cache(buf, size) return; }
<proceed with receive> }
The latter is preferred for performance because dma_map() did the initial invalidate.
Of course you could write:
rx_buffer_complete_3(buf) { invalidate_cache(buf, size) if (!is_ready(buf)) { invalidate_cache(buf, size) return; }
<proceed with receive> }
but it's a waste of CPU cycles
So I'd be very cautious assuming sync_for_cpu() and sync_for_device() are both doing invalidation in existing implementation of arch DMA ops, implementers may have taken some liberty around DMA-API to avoid unnecessary cache operation (not to blame them).
Right, if you have speculatively-prefetching caches, you have to invalidate DMA_FROM_DEVICE in unmap/sync_for_cpu, since a cache may have pulled in a snapshot of partly-written data at any point beforehand. But if you don't, then you can simply invalidate up-front in map/sync_for_device to tie in with the other directions, and trust that it stays that way for the duration.
What muddies the waters a bit is that the opposite combination sync_for_cpu(DMA_TO_DEVICE) really *should* always be a no-op, and I for one have already made the case for eliding that in code elsewhere, but it doesn't necessarily hold for the inverse here, hence why I'm not sure there even is a robust common solution for peeking at a live DMA_FROM_DEVICE buffer.
Robin.
For example looking at arch/arm/mm/dma-mapping.c, for DMA_FROM_DEVICE
sync_single_for_device() => __dma_page_cpu_to_dev() => dma_cache_maint_page(op=dmac_map_area) => cpu_cache.dma_map_area()
sync_single_for_cpu() => __dma_page_dev_to_cpu() => __dma_page_cpu_to_dev(op=dmac_unmap_area) => cpu_cache.dma_unmap_area()
dma_map_area() always does cache invalidate.
But for a couple of CPU variant, dma_unmap_area() is a noop, so sync_for_cpu() does nothing.
Toke's patch will break ath9k on those platforms (mostly silent breakage, rx corruption leading to bad performance)
There's a fair number of those dma_sync_single_for_device() things all over. Could we find mis-uses and warn about them some way? It seems to be a very natural thing to do in this context, but bounce buffering does make them very fragile.
At least in network drivers, there are at least two patterns:
- The issue at hand, hardware mixing rx_status and data inside the
same area. Usually very old hardware, very quick grep in network drivers only revealed slicoss.c. Probably would have gone unnoticed if ath9k hardware wasn't so common.
- The very common "copy break" pattern. If a received packet is
smaller than a certain threshold, the driver rx path is changed to do:
sync_for_cpu() alloc_small_skb() memcpy(small_skb, rx_buffer_data) sync_for_device()
Original skb is left in the hardware, this reduces memory wasted.
This pattern is completely valid wrt DMA-API, the buffer is always either owned by CPU or device.
On Fri, 25 Mar 2022 11:27:41 +0000 Robin Murphy robin.murphy@arm.com wrote:
What muddies the waters a bit is that the opposite combination sync_for_cpu(DMA_TO_DEVICE) really *should* always be a no-op, and I for one have already made the case for eliding that in code elsewhere, but it doesn't necessarily hold for the inverse here, hence why I'm not sure there even is a robust common solution for peeking at a live DMA_FROM_DEVICE buffer.
In https://lkml.org/lkml/2022/3/24/739 I also argued, that a robust common solution for a peeking at a live DMA_FROM_DEVICE buffer is probably not possible, at least not with the current programming model as described by Documentation/core-api/dma-api.rst.
Namely AFAIU the programming model is based on exclusive ownership: the buffer is either owned by the device, which means CPU(s) are not allowed to *access* it, or it is owned by the CPU(s), and the device is not allowed to *access* it. Do we agree on this?
Considering what Linus said here https://lkml.org/lkml/2022/3/24/775 I understand that: if the idea that dma_sync_*_for_{cpu,device} always transfers ownership to the cpu and device respectively is abandoned, and we re-define ownership in a sense that only the owner may write, but non-owner is allowed to read, then it may be possible to make the scenario under discussion work.
The scenario in pseudo code:
/* when invoked device might be doing DMA into buf */ rx_buf_complete(buf) { prepare_peek(buf, DMA_FROM_DEVICE); if (!is_ready(buf)) { /*let device gain the buffer again*/ peek_done_not_ready(buf, DMA_FROM_DEVICE); return false; } peek_done_ready(buf, DMA_FROM_DEVICE); process_buff(buf, DMA_FROM_DEVICE); is }
IMHO it is pretty obvious, that prepare_peek() has to update the cpu copy of the data *without* transferring ownership to the CPU. Since the owner is still the device, it is legit for the device to keep modifying the buffer via DMA. In case of the swiotlb, we would copy the content of the bounce buffer to the orig buffer possibly after invalidating caches, and for non-swiotlb we would do invalidate caches. So prepare_peek() could be actually something like, dma_sync_single_for_cpu(buf, DMA_FROM_DEVICE, DMA_ATTR_NO_OWNERSHIP_TRANSFER) which would most end up being functionally the same, as without the flag, since my guess is that the ownership is only tracked in our heads.
For peek_done_not_ready() there is conceptually nothing to do, because the device retained ownership. Thus would either have to mandate peek_done_not_ready() being a nop, or non-existent, (that is what Toke's patch does in the specific case), or we would have to mandate that dma_sync_*_for_*() has no side effects under certain. The former looks simpler to me, especially with swiotlb. But we are also fine if the cache ain't dirty, because the CPU didn't write (as pointed out by Linus) and we were to detect that, and avoid flushing a clean cache, or if we were to track ownership and to avoid flushing caches because no ownership transfer. But to avoid these bad flushes, at least for swiotlb, we would either have to track cache ownership, or even worse track dirtiness (for which we would have to extend the API, and make the drivers tell us that the cache, i.e. the original buffer got dirtied).
Since the device has ownership when peek_done_not_ready() is invoked, we might need to transfer ownership to the CPU in peek_done_ready(). This could again be a dma_sync_for_cpu() with a flag, which when supplied tells the dma API that no sync (cache invalidate) is needed because the driver guarantees, that the whole mapping was sufficiently sync-ed by prepare_peek(). Please notice, that the whole scheme is based on the driver knowing that the whole DMA is done by examining the buffer, and it decides based on whatever it sees.
Some of the ongoing discussion seem so ignore this whole ownership biz. My feeling is: the notion of ownership useful. If both sides end up modifying (and eventually flushing) we are in trouble IMHO, an ownership avoids that. But if the conclusion ends up being, that ownership does not matter, then we should make sure it is purged from the documentation, because otherwise it will confuse the hell out of people who read documentations and care about programming models. People like me.
Regards, Halil
Halil Pasic pasic@linux.ibm.com writes:
On Fri, 25 Mar 2022 11:27:41 +0000 Robin Murphy robin.murphy@arm.com wrote:
What muddies the waters a bit is that the opposite combination sync_for_cpu(DMA_TO_DEVICE) really *should* always be a no-op, and I for one have already made the case for eliding that in code elsewhere, but it doesn't necessarily hold for the inverse here, hence why I'm not sure there even is a robust common solution for peeking at a live DMA_FROM_DEVICE buffer.
In https://lkml.org/lkml/2022/3/24/739 I also argued, that a robust common solution for a peeking at a live DMA_FROM_DEVICE buffer is probably not possible, at least not with the current programming model as described by Documentation/core-api/dma-api.rst.
Namely AFAIU the programming model is based on exclusive ownership: the buffer is either owned by the device, which means CPU(s) are not allowed to *access* it, or it is owned by the CPU(s), and the device is not allowed to *access* it. Do we agree on this?
Considering what Linus said here https://lkml.org/lkml/2022/3/24/775 I understand that: if the idea that dma_sync_*_for_{cpu,device} always transfers ownership to the cpu and device respectively is abandoned, and we re-define ownership in a sense that only the owner may write, but non-owner is allowed to read, then it may be possible to make the scenario under discussion work.
The scenario in pseudo code:
/* when invoked device might be doing DMA into buf */ rx_buf_complete(buf) { prepare_peek(buf, DMA_FROM_DEVICE); if (!is_ready(buf)) { /*let device gain the buffer again*/ peek_done_not_ready(buf, DMA_FROM_DEVICE); return false; } peek_done_ready(buf, DMA_FROM_DEVICE); process_buff(buf, DMA_FROM_DEVICE); is }
IMHO it is pretty obvious, that prepare_peek() has to update the cpu copy of the data *without* transferring ownership to the CPU. Since the owner is still the device, it is legit for the device to keep modifying the buffer via DMA. In case of the swiotlb, we would copy the content of the bounce buffer to the orig buffer possibly after invalidating caches, and for non-swiotlb we would do invalidate caches. So prepare_peek() could be actually something like, dma_sync_single_for_cpu(buf, DMA_FROM_DEVICE, DMA_ATTR_NO_OWNERSHIP_TRANSFER) which would most end up being functionally the same, as without the flag, since my guess is that the ownership is only tracked in our heads.
Well we also need to ensure that the CPU caches are properly invalidated either in prepare_peek() or peek_done_not_ready(), so that the data is not cached between subsequent peeks. This could translate to either turning prepare_peek() into dma_sync_single_for_cpu(buf, DMA_FROM_DEVICE, DMA_ATTR_NO_OWNERSHIP_TRANSFER_BUT_INVALIDATE_CACHES), or it could turn peek_done_not_ready() into something that just invalidates the cache.
I was also toying with the idea of having a copy-based peek helper like:
u32 data = dma_peek_word(buf, offset)
which leaves the ownership as-is, but copies out a single word from the buffer at the given offset (from the bounce buffer or real buffer as appropriate) without messing with the ownership notion. The trouble with this idea is that ath9k reads two different words that are 44 bytes from each other, so it would have to do two such calls, which would be racy :(
-Toke
On Sat, Mar 26, 2022 at 9:06 AM Toke Høiland-Jørgensen toke@toke.dk wrote:
I was also toying with the idea of having a copy-based peek helper like:
u32 data = dma_peek_word(buf, offset)
I really don't think you can or want to have a word-based one.
That said, I like the *name* of that thing.
I think a lot of confusion comes from the very subtle naming of fundamentally having a lot of odd conditions with
- two different "directions of the sync" - ie who it is that cares:
dma_sync_single_for_{cpu,device}
- three different "direction of the data" - ie who it is that writes the data:
DMA_FROM_DEVICE / DMA_TO_DEVICE / DMA_BIDIRECTIONAL
so you have six possible combinations, three of which seem insane and not useful, and of the three that are actually possible, some are very unusual (it exactly that "device is the one writing, but we want to sync the dma area for the device").
I do not think it helps that not only do we have this combinatorial naming, we also use _different_ names. We say "for device" and "for cpu", but then when we specify who does the writing, we don't say "cpu vs device", we just specify the direction instead (FROM_DEVICE means the device did the writing, TO_DEVICE means that the CPU did the writing).
Anyway, I spent a lot of time looking at this, and I am now personally convinced that commit aa6f8dcbab47 (swiotlb: rework "fix info leak with DMA_FROM_DEVICE") was just completely buggy, and was buggy exactly becasue it was fundamentally confused even about which direction the bounce was happening.
I have reverted it in my tree, and I tried to write a comprehensive summary about why it was wrong.
What I *didn't* do in that commit was to argue against the naming, and try to enumerate all the different valid cases.
Because I think naming matters, and I think the current dma_sync() interfaces are horribly confusing exactly due to those naming combinatorials.
But I think "peek" is a good name, not because I think reading one work is a valid thing (you want to often peek more than that), but because it seems much more intuitive than "dma_sync_for_cpu(DMA_FROM_DEVICE)".
Similarly, I would think that "flush" is a much better word for "dma_sync_for_device(DMA_FROM_CPU)".
I don't know what a good word for "dma_sync_for_device(DMA_FROM_DEVICE)" is, but maybe "forget" would come closest - we want the CPU to "forget" what it peeked.
Anyway, I have reverted that commit, and I think it was wrong both in spirit and in implementation, and I'll ask Greg to remove it from stable.
And I think the security argument was entirely bogus, because the whole security argument was based on an incorrect understanding of the direction of the data.
But hey, I may currently be convinced that revert is the right thing to do, BUT I've been wrong before, and I'll happily change my mind if somebody makes a really cogent argument
Linus
From: Linus Torvalds
Sent: 26 March 2022 18:39
On Sat, Mar 26, 2022 at 9:06 AM Toke Høiland-Jørgensen toke@toke.dk wrote:
I was also toying with the idea of having a copy-based peek helper like:
u32 data = dma_peek_word(buf, offset)
I really don't think you can or want to have a word-based one.
That said, I like the *name* of that thing.
I think a lot of confusion comes from the very subtle naming of fundamentally having a lot of odd conditions with
two different "directions of the sync" - ie who it is that cares:
dma_sync_single_for_{cpu,device}
three different "direction of the data" - ie who it is that writes the data:
DMA_FROM_DEVICE / DMA_TO_DEVICE / DMA_BIDIRECTIONAL
so you have six possible combinations, three of which seem insane and not useful, and of the three that are actually possible, some are very unusual (it exactly that "device is the one writing, but we want to sync the dma area for the device").
Another 2c :-)
Is the idea of 'buffer ownership' even a good one? Perhaps the whole thing would be better described in terms of what happens when bounce buffers are used. So there are notionally two buffers. One accessed by the cpu, the other by the device.
There are then just 3 things that happen: 1) Dirty data may get moved to the other buffer at any time. So the driver must not dirty buffers (cache lines) that the device might write to. 2) The cpu has to request data be copied to the device buffer after updating the cpu buffer. This makes the cpu buffer 'not dirty' so copies (1) can no longer happen. 3) The cpu has to request data be copied from the device buffer before looking at the data. All copies affect a dma-cache-line sized block of data (which might be device dependant). An optimised version of (2) that doesn't actually do the copy can be implemented for use prior to read requests.
For cache-coherent memory only (1) happens and (2) and (3) are no operations. For non-coherent memory (2) is write-back-and-invalidate and (3) might just be an invalidate. For bounce buffers all are actual copies - and additional operations might be needed for device access to the bounce buffer itself.
For security reasons bounce buffers may need initialising. But this would be done when they are allocated.
David
- Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Sat, Mar 26, 2022 at 3:38 PM David Laight David.Laight@aculab.com wrote:
Is the idea of 'buffer ownership' even a good one?
I do think it might be best to not think in those terms, but literally just in data movement terms.
Because the "buffer ownership" mental model is clearly confused, when data transfer might be ongoing, but the CPU might need to just look at "what's going on right now" without actually taking any ownership of the buffer.
Linus
Maxime Bizon mbizon@freebox.fr writes:
On Thu, 2022-03-24 at 12:26 -0700, Linus Torvalds wrote:
It's actually very natural in that situation to flush the caches from the CPU side again. And so dma_sync_single_for_device() is a fairly reasonable thing to do in that situation.
In the non-cache-coherent scenario, and assuming dma_map() did an initial cache invalidation, you can write this:
rx_buffer_complete_1(buf) { invalidate_cache(buf, size) if (!is_ready(buf)) return;
<proceed with receive> }
or
rx_buffer_complete_2(buf) { if (!is_ready(buf)) { invalidate_cache(buf, size) return; }
<proceed with receive> }
The latter is preferred for performance because dma_map() did the initial invalidate.
Of course you could write:
rx_buffer_complete_3(buf) { invalidate_cache(buf, size) if (!is_ready(buf)) { invalidate_cache(buf, size) return; }
<proceed with receive> }
but it's a waste of CPU cycles
So I'd be very cautious assuming sync_for_cpu() and sync_for_device() are both doing invalidation in existing implementation of arch DMA ops, implementers may have taken some liberty around DMA-API to avoid unnecessary cache operation (not to blame them).
I sense an implicit "and the driver can't (or shouldn't) influence this" here, right?
For example looking at arch/arm/mm/dma-mapping.c, for DMA_FROM_DEVICE
sync_single_for_device() => __dma_page_cpu_to_dev() => dma_cache_maint_page(op=dmac_map_area) => cpu_cache.dma_map_area()
sync_single_for_cpu() => __dma_page_dev_to_cpu() => __dma_page_cpu_to_dev(op=dmac_unmap_area) => cpu_cache.dma_unmap_area()
dma_map_area() always does cache invalidate.
But for a couple of CPU variant, dma_unmap_area() is a noop, so sync_for_cpu() does nothing.
Toke's patch will break ath9k on those platforms (mostly silent breakage, rx corruption leading to bad performance)
Okay, so that would be bad obviously. So if I'm reading you correctly (cf my question above), we can't fix this properly from the driver side, and we should go with the partial SWIOTLB revert instead?
-Toke
On 2022-03-25 16:25, Toke Høiland-Jørgensen wrote:
Maxime Bizon mbizon@freebox.fr writes:
On Thu, 2022-03-24 at 12:26 -0700, Linus Torvalds wrote:
It's actually very natural in that situation to flush the caches from the CPU side again. And so dma_sync_single_for_device() is a fairly reasonable thing to do in that situation.
In the non-cache-coherent scenario, and assuming dma_map() did an initial cache invalidation, you can write this:
rx_buffer_complete_1(buf) { invalidate_cache(buf, size) if (!is_ready(buf)) return;
<proceed with receive> }
or
rx_buffer_complete_2(buf) { if (!is_ready(buf)) { invalidate_cache(buf, size) return; }
<proceed with receive> }
The latter is preferred for performance because dma_map() did the initial invalidate.
Of course you could write:
rx_buffer_complete_3(buf) { invalidate_cache(buf, size) if (!is_ready(buf)) { invalidate_cache(buf, size) return; }
<proceed with receive> }
but it's a waste of CPU cycles
So I'd be very cautious assuming sync_for_cpu() and sync_for_device() are both doing invalidation in existing implementation of arch DMA ops, implementers may have taken some liberty around DMA-API to avoid unnecessary cache operation (not to blame them).
I sense an implicit "and the driver can't (or shouldn't) influence this" here, right?
Right, drivers don't get a choice of how a given DMA API implementation works.
For example looking at arch/arm/mm/dma-mapping.c, for DMA_FROM_DEVICE
sync_single_for_device() => __dma_page_cpu_to_dev() => dma_cache_maint_page(op=dmac_map_area) => cpu_cache.dma_map_area()
sync_single_for_cpu() => __dma_page_dev_to_cpu() => __dma_page_cpu_to_dev(op=dmac_unmap_area) => cpu_cache.dma_unmap_area()
dma_map_area() always does cache invalidate.
But for a couple of CPU variant, dma_unmap_area() is a noop, so sync_for_cpu() does nothing.
Toke's patch will break ath9k on those platforms (mostly silent breakage, rx corruption leading to bad performance)
Okay, so that would be bad obviously. So if I'm reading you correctly (cf my question above), we can't fix this properly from the driver side, and we should go with the partial SWIOTLB revert instead?
Do you have any other way of telling if DMA is idle, or temporarily pausing it before the sync_for_cpu, such that you could honour the notion of ownership transfer properly? As mentioned elsewhere I suspect the only "real" fix if you really do need to allow concurrent access is to use the coherent DMA API for buffers rather than streaming mappings, but that's obviously some far more significant surgery.
Robin.
Robin Murphy robin.murphy@arm.com writes:
On 2022-03-25 16:25, Toke Høiland-Jørgensen wrote:
Maxime Bizon mbizon@freebox.fr writes:
On Thu, 2022-03-24 at 12:26 -0700, Linus Torvalds wrote:
It's actually very natural in that situation to flush the caches from the CPU side again. And so dma_sync_single_for_device() is a fairly reasonable thing to do in that situation.
In the non-cache-coherent scenario, and assuming dma_map() did an initial cache invalidation, you can write this:
rx_buffer_complete_1(buf) { invalidate_cache(buf, size) if (!is_ready(buf)) return;
<proceed with receive> }
or
rx_buffer_complete_2(buf) { if (!is_ready(buf)) { invalidate_cache(buf, size) return; }
<proceed with receive> }
The latter is preferred for performance because dma_map() did the initial invalidate.
Of course you could write:
rx_buffer_complete_3(buf) { invalidate_cache(buf, size) if (!is_ready(buf)) { invalidate_cache(buf, size) return; }
<proceed with receive> }
but it's a waste of CPU cycles
So I'd be very cautious assuming sync_for_cpu() and sync_for_device() are both doing invalidation in existing implementation of arch DMA ops, implementers may have taken some liberty around DMA-API to avoid unnecessary cache operation (not to blame them).
I sense an implicit "and the driver can't (or shouldn't) influence this" here, right?
Right, drivers don't get a choice of how a given DMA API implementation works.
For example looking at arch/arm/mm/dma-mapping.c, for DMA_FROM_DEVICE
sync_single_for_device() => __dma_page_cpu_to_dev() => dma_cache_maint_page(op=dmac_map_area) => cpu_cache.dma_map_area()
sync_single_for_cpu() => __dma_page_dev_to_cpu() => __dma_page_cpu_to_dev(op=dmac_unmap_area) => cpu_cache.dma_unmap_area()
dma_map_area() always does cache invalidate.
But for a couple of CPU variant, dma_unmap_area() is a noop, so sync_for_cpu() does nothing.
Toke's patch will break ath9k on those platforms (mostly silent breakage, rx corruption leading to bad performance)
Okay, so that would be bad obviously. So if I'm reading you correctly (cf my question above), we can't fix this properly from the driver side, and we should go with the partial SWIOTLB revert instead?
Do you have any other way of telling if DMA is idle, or temporarily pausing it before the sync_for_cpu, such that you could honour the notion of ownership transfer properly?
I'll go check with someone who has a better grasp of how the hardware works, but I don't think so...
As mentioned elsewhere I suspect the only "real" fix if you really do need to allow concurrent access is to use the coherent DMA API for buffers rather than streaming mappings, but that's obviously some far more significant surgery.
That would imply copying the packet data out of that (persistent) coherent mapping each time we do a recv operation, though, right? That would be quite a performance hit...
If all we need is a way to make dma_sync_single_for_cpu() guarantee a cache invalidation, why can't we just add a separate version that does that (dma_sync_single_for_cpu_peek() or something)? Using that with the patch I posted earlier should be enough to resolve the issue, AFAICT?
-Toke
On Fri, Mar 25, 2022 at 3:25 AM Maxime Bizon mbizon@freebox.fr wrote:
In the non-cache-coherent scenario, and assuming dma_map() did an initial cache invalidation, you can write this:
.. but the problem is that the dma mapping code is supposed to just work, and the driver isn't supposed to know or care whether dma is coherent or not, or using bounce buffers or not.
And currently it doesn't work.
Because what that ath9k driver does is "natural", but it's wrong for the bounce buffer case.
And I think the problem is squarely on the dma-mapping side for two reasons:
(a) this used to work, now it doesn't, and it's unclear how many other drivers are affected
(b) the dma-mapping naming and calling conventions are horrible and actively misleading
That (a) is a big deal. The reason the ath9k issue was found quickly is very likely *NOT* because ath9k is the only thing affected. No, it's because ath9k is relatively common.
Just grep for dma_sync_single_for_device() and ask yourself: how many of those other drivers have you ever even HEARD of, much less be able to test?
And that's just one "dma_sync" function. Admittedly it's likely one of the more common ones, but still..
Now, (b) is why I think driver nufgt get this so wrong - or, in this case, possibly the dma-mapping code itself.
The naming - and even the documentation(!!!) - implies that what ath9k does IS THE RIGHT THING TO DO.
The documentation clearly states:
"Before giving the memory to the device, dma_sync_single_for_device() needs to be called, and before reading memory written by the device, dma_sync_single_for_cpu(), just like for streaming DMA mappings that are reused"
and ath9k obviously did exactly that, even with a comment to the effect.
And I think ath9k is actually right here, but the documentation is so odd and weak that it's the dma-mapping code that was buggy.
So the dma mapping layer literally broke the documented behavior, and then Christoph goes and says (in another email in this discussion):
"Unless I'm misunderstanding this thread we found the bug in ath9k and have a fix for that now?"
which I think is a gross mis-characterization of the whole issue, and ignores *BOTH* of (a) and (b).
So what's the move forward here?
I personally think we need to
- revert commit aa6f8dcbab47 for the simple reason that it is known to break one driver. But it is unknown how many other drivers are affected.
Even if you think aa6f8dcbab47 was the right thing to do (and I don't - see later), the fact is that it's new behavior that the dma bounce buffer code hasn't done in the past, and clearly confuses things.
- think very carefully about the ath9k case.
We have a patch that fixes it for the bounce buffer case, but you seem to imply that it might actually break non-coherent cases:
"So I'd be very cautious assuming sync_for_cpu() and sync_for_device() are both doing invalidation in existing implementation of arch DMA ops, implementers may have taken some liberty around DMA-API to avoid unnecessary cache operation (not to blame them)"
so who knows what other dma situations it might break?
Because if some non-coherent mapping infrastructure assumes that *only* sync_for_device() will actually flush-and-invalidate caches (because the platform thinks that once they are flushed, getting them back to the CPU doesn't need any special ops), then you're right: Toke's ath9k patch will just result in cache coherency issues on those platforms instead.
- think even *more* about what the ath9k situation means for the dma mapping naming and documentation.
I basically think the DMA syncing has at least three cases (and a fourth combination that makes no sense):
(1) The CPU has actively written to memory, and wants to give that data to the device.
This is "dma_sync_single_for_device(DMA_TO_DEVICE)".
A cache-coherent thing needs to do nothing.
A non-coherent thing needs to do a cache "writeback" (and probably will flush)
A bounce buffer implementation needs to copy *to* the bounce buffer
(2) The CPU now wants to see any state written by the device since the last sync
This is "dma_sync_single_for_cpu(DMA_FROM_DEVICE)".
A bounce-buffer implementation needs to copy *from* the bounce buffer.
A cache-coherent implementation needs to do nothing.
A non-coherent implementation maybe needs to do nothing (ie it assumes that previous ops have flushed the cache, and just accessing the data will bring the rigth thing back into it). Or it could just flush the cache.
(3) The CPU has seen the state, but wants to leave it to the device
This is "dma_sync_single_for_device(DMA_FROM_DEVICE)".
A bounce buffer implementation needs to NOT DO ANYTHING (this is the current ath9k bug - copying to the bounce buffer is wrong)
A cache coherent implementation needs to do nothing
A non-coherent implementation needs to flush the cache again, bot not necessarily do a writeback-flush if there is some cheaper form (assuming it does nothing in the "CPU now wants to see any state" case because it depends on the data not having been in the caches)
(4) There is a fourth case: dma_sync_single_for_cpu(DMA_TO_DEVICE) which maybe should generate a warning because it seems to make no sense? I can't think of a case where this would be an issue - the data is specifically for the device, but it's synced "for the CPU"?
Do people agree? Or am I missing something?
But I don't think the documentation lays out these cases, and I don't think the naming is great.
I also don't think that we can *change* the naming. That's water under the bridge. It is what it is. So I think people need to really agree on the semantics (did I get them entirely wrong above?) and try to think about ways to maybe give warnings for things that make no sense.
Based on my suggested understanding of what the DMA layer should do, the ath9k code is actually doing exactly the right thing. It is doing
dma_sync_single_for_device(DMA_FROM_DEVICE);
and based on my four cases above, the bounce buffer code must do nothing, because "for_device()" together with "FROM_DEVICE" clearly says that all the data is coming *from* the device, and copying any bounce buffers is wrong.
In other words, I think commit aa6f8dcbab47 ("swiotlb: rework 'fix info leak with DMA_FROM_DEVICE'") is fundamentally wrong. It doesn't just break ath9k, it fundamentally break that "case 3" above. It's doing a DMA_TO_DEVICE copy, even though it was a DMA_FROM_DEVICE sync.
So I really think that "revert aa6f8dcbab47" is not only inevitable because of practical worries about what it breaks, but because that commit was just entirely and utterly WRONG.
But having written this long wall of text, I'm now slightly worried that I'm just confused, and am trying to just convince myself.
So please: can people think about this a bit more, and try to shoot down the above argument and show that I'm just being silly?
And if I'm right, can we please document this and try really hard to come up with some sanity checks (perhaps based on some "dma buffer state" debug code?)
Linus
On 2022-03-25 18:30, Linus Torvalds wrote:
On Fri, Mar 25, 2022 at 3:25 AM Maxime Bizon mbizon@freebox.fr wrote:
In the non-cache-coherent scenario, and assuming dma_map() did an initial cache invalidation, you can write this:
.. but the problem is that the dma mapping code is supposed to just work, and the driver isn't supposed to know or care whether dma is coherent or not, or using bounce buffers or not.
And currently it doesn't work.
Because what that ath9k driver does is "natural", but it's wrong for the bounce buffer case.
And I think the problem is squarely on the dma-mapping side for two reasons:
(a) this used to work, now it doesn't, and it's unclear how many other drivers are affected
(b) the dma-mapping naming and calling conventions are horrible and actively misleading
That (a) is a big deal. The reason the ath9k issue was found quickly is very likely *NOT* because ath9k is the only thing affected. No, it's because ath9k is relatively common.
Just grep for dma_sync_single_for_device() and ask yourself: how many of those other drivers have you ever even HEARD of, much less be able to test?
And that's just one "dma_sync" function. Admittedly it's likely one of the more common ones, but still..
Now, (b) is why I think driver nufgt get this so wrong - or, in this case, possibly the dma-mapping code itself.
The naming - and even the documentation(!!!) - implies that what ath9k does IS THE RIGHT THING TO DO.
The documentation clearly states:
"Before giving the memory to the device, dma_sync_single_for_device() needs to be called, and before reading memory written by the device, dma_sync_single_for_cpu(), just like for streaming DMA mappings that are reused"
Except that's documentation for the non-coherent allocation API, rather than the streaming API in question here. I'll refrain from commenting on having at least 3 DMA APIs, with the same set of sync functions serving two of them, and just stand back a safe distance...
Anyway, the appropriate part of that document is probably:
"You must do this:
- Before reading values that have been written by DMA from the device (use the DMA_FROM_DEVICE direction)"
I'm not saying it constitutes *good* documentation, but I would note how it says "have been written", and not "are currently being written". Similarly from the HOWTO:
"If you need to use the same streaming DMA region multiple times and touch the data in between the DMA transfers, the buffer needs to be synced properly..."
Note "between the DMA transfers", and not "during the DMA transfers". The fundamental assumption of the streaming API is that only one thing is ever accessing the mapping at any given time, which is what the whole notion of ownership is about.
Thanks, Robin.
On Fri, Mar 25, 2022 at 12:14 PM Robin Murphy robin.murphy@arm.com wrote:
Note "between the DMA transfers", and not "during the DMA transfers". The fundamental assumption of the streaming API is that only one thing is ever accessing the mapping at any given time, which is what the whole notion of ownership is about.
Well, but that ignores reality.
Any documentation that ignores the "CPU will want to see the intermediate state" is by definition garbage, because that is clearly a simple fact.
We don't write documentation for fantasy.
Linus
On pátek 25. března 2022 19:30:21 CET Linus Torvalds wrote:
The reason the ath9k issue was found quickly is very likely *NOT* because ath9k is the only thing affected. No, it's because ath9k is relatively common.
Indeed. But having a wife who complains about non-working Wi-Fi printer definitely helps in finding the issue too.
On Fri, Mar 25, 2022 at 12:26 PM Oleksandr Natalenko oleksandr@natalenko.name wrote:
On pátek 25. března 2022 19:30:21 CET Linus Torvalds wrote:
The reason the ath9k issue was found quickly is very likely *NOT* because ath9k is the only thing affected. No, it's because ath9k is relatively common.
Indeed. But having a wife who complains about non-working Wi-Fi printer definitely helps in finding the issue too.
Well, maybe we should credit her in the eventual resolution (whatever it ends up being).
Although probably not using that exact wording.
Linus
On pátek 25. března 2022 20:27:43 CET Linus Torvalds wrote:
On Fri, Mar 25, 2022 at 12:26 PM Oleksandr Natalenko oleksandr@natalenko.name wrote:
On pátek 25. března 2022 19:30:21 CET Linus Torvalds wrote:
The reason the ath9k issue was found quickly is very likely *NOT* because ath9k is the only thing affected. No, it's because ath9k is relatively common.
Indeed. But having a wife who complains about non-working Wi-Fi printer definitely helps in finding the issue too.
Well, maybe we should credit her in the eventual resolution (whatever it ends up being).
Although probably not using that exact wording.
While Olha has already been Cc'ed here, I can definitely encourage her in person to provide Reported-by/Tested-by if needed :).
So I've been watching this from the sidelines mostly, and discussing a bit with Toke, but:
On Fri, 2022-03-25 at 11:30 -0700, Linus Torvalds wrote:
(2) The CPU now wants to see any state written by the device since the last sync
This is "dma_sync_single_for_cpu(DMA_FROM_DEVICE)". A bounce-buffer implementation needs to copy *from* the bounce buffer. A cache-coherent implementation needs to do nothing. A non-coherent implementation maybe needs to do nothing (ie it
assumes that previous ops have flushed the cache, and just accessing the data will bring the rigth thing back into it). Or it could just flush the cache.
Doesn't that just need to *invalidate* the cache, rather than *flush* it? The cache is somewhat similar to the bounce buffer, and here you're copying _from_ the bounce buffer (which is where the device is accessing), so shouldn't it be the same for the cache, i.e. you invalidate it so you read again from the real memory?
(3) The CPU has seen the state, but wants to leave it to the device
This is "dma_sync_single_for_device(DMA_FROM_DEVICE)".
A bounce buffer implementation needs to NOT DO ANYTHING (this is the current ath9k bug - copying to the bounce buffer is wrong)
A cache coherent implementation needs to do nothing
A non-coherent implementation needs to flush the cache again, bot not necessarily do a writeback-flush if there is some cheaper form (assuming it does nothing in the "CPU now wants to see any state" case because it depends on the data not having been in the caches)
And similarly here, it would seem that the implementation can't _flush_ the cache as the device might be writing concurrently (which it does in fact do in the ath9k case), but it must invalidate the cache?
I'm not sure about the (2) case, but here it seems fairly clear cut that if you have a cache, don't expect the CPU to write to the buffer (as evidenced by DMA_FROM_DEVICE), you wouldn't want to write out the cache to DRAM?
I'll also note independently that ath9k actually maps the buffers as DMA_BIDIRECTIONAL, but the flush operations happen with DMA_FROM_DEVICE, at least after the setup is done. I must admit that I was scratching my head about this, I had sort of expected one should be passing the same DMA direction to all different APIs for the same buffer, but clearly, as we can see in your list of cases here, that's _not_ true.
Then, however, we need to define what happens if you pass DMA_BIDIRECTIONAL to the sync_for_cpu() and sync_for_device() functions, which adds two more cases? Or maybe we eventually just think that's not valid at all, since you have to specify how you're (currently?) using the buffer, which can't be DMA_BIDIRECTIONAL?
(4) There is a fourth case: dma_sync_single_for_cpu(DMA_TO_DEVICE) which maybe should generate a warning because it seems to make no sense? I can't think of a case where this would be an issue - the data is specifically for the device, but it's synced "for the CPU"?
I'd tend to agree with that, that's fairly much useless, since if only the CPU wrote to it, then you wouldn't care about any caching or bounce buffers, so no need to sync back.
In other words, I think commit aa6f8dcbab47 ("swiotlb: rework 'fix info leak with DMA_FROM_DEVICE'") is fundamentally wrong. It doesn't just break ath9k, it fundamentally break that "case 3" above. It's doing a DMA_TO_DEVICE copy, even though it was a DMA_FROM_DEVICE sync.
So I really think that "revert aa6f8dcbab47" is not only inevitable because of practical worries about what it breaks, but because that commit was just entirely and utterly WRONG.
Honestly, I was scratching my head about this too - sadly it just says "what was agreed", without a pointer to how that was derived, but it seemed that the original issue was:
"we're leaking old bounce buffer data to the device"
or was it not? In which case doing any copies during map should've been sufficient, since then later no more data leaks could occur?
johannes
On Fri, Mar 25, 2022 at 1:38 PM Johannes Berg johannes@sipsolutions.net wrote:
(2) The CPU now wants to see any state written by the device since the last sync
This is "dma_sync_single_for_cpu(DMA_FROM_DEVICE)". A bounce-buffer implementation needs to copy *from* the bounce buffer. A cache-coherent implementation needs to do nothing. A non-coherent implementation maybe needs to do nothing (ie it
assumes that previous ops have flushed the cache, and just accessing the data will bring the rigth thing back into it). Or it could just flush the cache.
Doesn't that just need to *invalidate* the cache, rather than *flush* it?
Yes. I should have been more careful.
That said, I think "invalidate without writeback" is a really dangerous operation (it can generate some *really* hard to debug memory state), so on the whole I think you should always strive to just do "flush-and-invalidate".
If the core has support for "invalidate clean cache lines only", then that's possibly a good alternative.
A non-coherent implementation needs to flush the cache again, bot not necessarily do a writeback-flush if there is some cheaper form (assuming it does nothing in the "CPU now wants to see any state" case because it depends on the data not having been in the caches)
And similarly here, it would seem that the implementation can't _flush_ the cache as the device might be writing concurrently (which it does in fact do in the ath9k case), but it must invalidate the cache?
Right, again, when I said "flush" I really should have said "invalidate".
I'm not sure about the (2) case, but here it seems fairly clear cut that if you have a cache, don't expect the CPU to write to the buffer (as evidenced by DMA_FROM_DEVICE), you wouldn't want to write out the cache to DRAM?
See above: I'd *really* want to avoid a pure "invalidate cacheline" model. The amount of debug issues that can cause is not worth it.
So please flush-and-invalidate, or invalidate-non-dirty, but not just "invalidate".
Then, however, we need to define what happens if you pass DMA_BIDIRECTIONAL to the sync_for_cpu() and sync_for_device() functions, which adds two more cases? Or maybe we eventually just think that's not valid at all, since you have to specify how you're (currently?) using the buffer, which can't be DMA_BIDIRECTIONAL?
Ugh. Do we actually have cases that do it? That sounds really odd for a "sync" operation. It sounds very reasonable for _allocating_ DMA, but for syncing I'm left scratching my head what the semantics would be.
But yes, if we do and people come up with semantics for it, those semantics should be clearly documented.
And if we don't - or people can't come up with semantics for it - we should actively warn about it and not have some code that does odd things that we don't know what they mean.
But it sounds like you agree with my analysis, just not with some of my bad/incorrect word choices.
Linus
On Fri, 2022-03-25 at 13:47 -0700, Linus Torvalds wrote:
On Fri, Mar 25, 2022 at 1:38 PM Johannes Berg johannes@sipsolutions.net wrote:
(2) The CPU now wants to see any state written by the device since the last sync
This is "dma_sync_single_for_cpu(DMA_FROM_DEVICE)". A bounce-buffer implementation needs to copy *from* the bounce buffer. A cache-coherent implementation needs to do nothing. A non-coherent implementation maybe needs to do nothing (ie it
assumes that previous ops have flushed the cache, and just accessing the data will bring the rigth thing back into it). Or it could just flush the cache.
Doesn't that just need to *invalidate* the cache, rather than *flush* it?
Yes. I should have been more careful.
Well I see now that you said 'cache "writeback"' in (1), and 'flush' in (2), so perhaps you were thinking of the same, and I'm just calling it "flush" and "invalidate" respectively?
That said, I think "invalidate without writeback" is a really dangerous operation (it can generate some *really* hard to debug memory state), so on the whole I think you should always strive to just do "flush-and-invalidate".
Hmm. Yeah, can't really disagree with that.
However, this seems to be the wrong spot to flush (writeback?) the cache, as we're trying to get data from the device, not potentially overwrite the data that the device wrote because we have a dirty cacheline. Hmm. Then again, how could we possibly have a dirty cacheline?
Which starts to clarify in my mind why we have a sort of (implied) ownership model: if the CPU dirties a cacheline while the device has ownership then the cache writeback might overwrite the DMA data. So it's easier to think of it as "CPU has ownership" and "device has ownership", but evidently that simple model breaks down in real-world cases such as ath9k where the CPU wants to look, but not write, and the device continues doing DMA at the same time.
Now in that case the cache wouldn't be considered dirty either since the CPU was just reading, but who knows? Hence the suggestion of just invalidate, not flush.
If the core has support for "invalidate clean cache lines only", then that's possibly a good alternative.
Well if you actually did dirty the cacheline, then you have a bug one way or the other, and it's going to be really hard to debug - either you lose the CPU write, or you lose the device write, there's no way you're not losing one of them?
ath9k doesn't write, of course, so hopefully the core wouldn't write back what it must think of as clean cachelines, even if the device modified the memory underneath already.
So really while I agree with your semantics, I was previously privately suggesting to Toke we should probably have something like
dma_single_cpu_peek() // read buffer and check if it was done dma_single_cpu_peek_finish()
which really is equivalent to the current
dma_sync_single_for_cpu(DMA_FROM_DEVICE) // ... dma_sync_single_for_device(DMA_FROM_DEVICE)
that ath9k does, but makes it clearer that you really can't write to the buffer... but, water under the bridge, I guess.
Thinking about the ownership model again - it seems that we need to at least modify that ownership model in the sense that we have *write* ownership that we pass around, not just "ownership". But if we do that, then we need to clarify which operations pass write ownership and which don't.
So the operations (1) dma_sync_single_for_device(DMA_TO_DEVICE) (2) dma_sync_single_for_cpu(DMA_FROM_DEVICE) (3) dma_sync_single_for_device(DMA_FROM_DEVICE)
really only (1) passes write ownership to the device, but then you can't get it back?
But that cannot be true, because ath9k maps the buffer as DMA_BIDIRECTIONAL, and then eventually might want to recycle it.
Actually though, perhaps passing write ownership back to the CPU isn't an operation that the DMA API needs to worry about - if the CPU has read ownership and the driver knows separately that the device is no longer accessing it, then basically the CPU already got write ownership, and passes that back uses (1)?
Then, however, we need to define what happens if you pass DMA_BIDIRECTIONAL to the sync_for_cpu() and sync_for_device() functions, which adds two more cases? Or maybe we eventually just think that's not valid at all, since you have to specify how you're (currently?) using the buffer, which can't be DMA_BIDIRECTIONAL?
Ugh. Do we actually have cases that do it?
Yes, a few.
That sounds really odd for a "sync" operation. It sounds very reasonable for _allocating_ DMA, but for syncing I'm left scratching my head what the semantics would be.
I agree.
But yes, if we do and people come up with semantics for it, those semantics should be clearly documented.
I'm not sure? I'm wondering if this isn't just because - like me initially - people misunderstood the direction argument, or didn't understand it well enough, and then just passed the same value as for the map()/unmap()?
You have to pass the size to all of them too, after all ... but I'm speculating.
And if we don't - or people can't come up with semantics for it - we should actively warn about it and not have some code that does odd things that we don't know what they mean.
Makes sense.
But it sounds like you agree with my analysis, just not with some of my bad/incorrect word choices.
Yes.
johannes
I've been thinking of the case where a descriptor ring has to be in non-coherent memory (eg because that is all there is).
The receive ring processing isn't actually that difficult.
The driver has to fill a cache line full of new buffer descriptors in memory but without assigning the first buffer to the hardware. Then it has to do a cache line write of just that line. Then it can assign ownership of the first buffer and finally do a second cache line write. (The first explicit write can be skipped if the cache writes are known to be atomic.) It then must not dirty that cache line.
To check for new frames it must invalidate the cache line that contains the 'next to be filled' descriptor and then read that cache line. This will contain info about one or more receive frames. But the hardware is still doing updates.
But both these operations can be happening at the same time on different parts of the buffer.
So you need to know a 'cache line size' for the mapping and be able to do writebacks and invalidates for parts of the buffer, not just all of it.
The transmit side is harder. It either requires waiting for all pending transmits to finish or splitting a single transmit into enough fragments that its descriptors end on a cache line boundary. But again, and if the interface is busy, you want the cpu to be able to update one cache line of transmit descriptors while the device is writing transmit completion status to the previous cache line.
I don't think that is materially different for non-coherent memory or bounce buffers. But partial flush/invalidate is needed.
David
- Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Fri, Mar 25, 2022 at 2:13 PM Johannes Berg johannes@sipsolutions.net wrote:
Well I see now that you said 'cache "writeback"' in (1), and 'flush' in (2), so perhaps you were thinking of the same, and I'm just calling it "flush" and "invalidate" respectively?
Yeah, so I mentally tend to think of the operations as just "writeback" (which doesn't invalidate) and "flush" (which is a writeback-invalidate).
Which explains my word-usage, but isn't great, because it's not well-defined. And I'm not even consistent about it.
However, this seems to be the wrong spot to flush (writeback?) the cache, as we're trying to get data from the device, not potentially overwrite the data that the device wrote because we have a dirty cacheline. Hmm. Then again, how could we possibly have a dirty cacheline?
So in my model, (1) is the only case where there's actively dirty data that needs to be written back. That's the "CPU wrote the data to memory, and wants to transfer it to the device" case.
In (2) and (3), the only question is whether possibly clean cachelines contain - or will contain - stale data.
And then exactly when you actually invalidate is up to you.
For example, in
(2) The CPU now wants to see any state written by the device
you have multiple options. You could invalidate any stale cache lines.
Or you could say "We wrote them back and invalidated them in (1), so we don't need to invalidate them now".
And in
(3) The CPU looked at the data while it was in flight and is now done with it.
you can (again) decide to do nothing at all, BUT ONLY IF (2) chose that "invalidate" option. Because if you made your (2) depend on that "they were already invalidated", then (3) has to invalidate the CPU caches so that a subsequent (2) will work right.
So these are all somewhat interconnected.
You can do just "writeback" in (1), but then you _have_ to do "invalidate" in (2), and in that case you don't have to do anything at all in (3).
Or maybe your CPU only has that "writeback-and-invalidate" operation, so you decide that (2) should be a no-op, and (3) - which is presumably less common than (2) - also does that writeback-invalidate thing.
Or we can also say that (3) is not allowed at all - so the ath9k case is actively wrong and we should warn about that case - but that again constrains what you can do in (2) and now that previous optimization is not valid.
And it's worth noting that if your CPU may do cache fills as a result of speculative accesses (or just sufficiently out of order), then the whole argument that "I invalidated those lines earlier, so I don't need to invalidate them now" is just garbage.
Fun, isn't it?
Which starts to clarify in my mind why we have a sort of (implied) ownership model: if the CPU dirties a cacheline while the device has ownership then the cache writeback might overwrite the DMA data.
Right, I think that "if the CPU dirties the cacheline while the device has ownership, then the data is going to be undefined".
And btw, it does actually potentially happen for real - imagine a user mmap'ing the IO buffer while IO is in flight. The kernel can't control for that (well, we can make things read-only, and some uses do), but then it's often a question of "you have to dirty that area and do the IO again, because the last attempt sent out undefined data".
And note how this "undefined data" situation can happen even with coherent IO, so this part isn't even about cache coherency - it's literally about just about memory accesses being in some undefined order.
So it *can* be valid to send inconsistent data, but that should be considered the really odd case.
So it's easier to think of it as "CPU has ownership" and "device has ownership", but evidently that simple model breaks down in real-world cases such as ath9k where the CPU wants to look, but not write, and the device continues doing DMA at the same time.
Yeah, and see above about how the CPU could even write (but honestly, that isn't valid in the general case, it really requires that kind of active "we can fix it up later" thing).
Well if you actually did dirty the cacheline, then you have a bug one way or the other, and it's going to be really hard to debug - either you lose the CPU write, or you lose the device write, there's no way you're not losing one of them?
Again, see above. Losing the CPU write is really bad, because then you can't even recover by re-doing the operation.
So yes, when looking at only the "single operation" case, it looks like "lose one or the other". But in the bigger picture, one is more important than the other.
So the operations (1) dma_sync_single_for_device(DMA_TO_DEVICE) (2) dma_sync_single_for_cpu(DMA_FROM_DEVICE) (3) dma_sync_single_for_device(DMA_FROM_DEVICE)
really only (1) passes write ownership to the device, but then you can't get it back?
Well, you get it back by just checking that the IO is done. Once the IO is done, the CPU owns the area again.
And the "IO is done" is often some entirely independent status in some entirely different place.
But it *could* be something that requires a CPU read from that DMA area. But it's a CPU _read_, so you don't need write ownership for that.
That's why there is only one DMA_TO_DEVICE, and there are two DMA_FROM_DEVICE cases.
The DMA_TO_DEVICE cannot have a "let me write in the middle" situation.
But the DMA_FROM_DEVICE has that "let me read in the middle, and decide if it's done or not", so you can have a looping read, and that's where (3) comes in.
You can't have a looping write for one operation (but you can obviously have several independent write operations - that's what the whole "are you done" is all about)
But that cannot be true, because ath9k maps the buffer as DMA_BIDIRECTIONAL, and then eventually might want to recycle it.
See above. Both cases have "the device is done with this", but they are fundamentally different situations.
That sounds really odd for a "sync" operation. It sounds very reasonable for _allocating_ DMA, but for syncing I'm left scratching my head what the semantics would be.
I agree.
But yes, if we do and people come up with semantics for it, those semantics should be clearly documented.
I'm not sure? I'm wondering if this isn't just because - like me initially - people misunderstood the direction argument, or didn't understand it well enough, and then just passed the same value as for the map()/unmap()?
Yeah, the solution may well be "grep for it, and pick the right direction once the docs are clear".
Linus
From: Linus Torvalds
Sent: 25 March 2022 21:57
On Fri, Mar 25, 2022 at 2:13 PM Johannes Berg johannes@sipsolutions.net wrote:
Well I see now that you said 'cache "writeback"' in (1), and 'flush' in (2), so perhaps you were thinking of the same, and I'm just calling it "flush" and "invalidate" respectively?
Yeah, so I mentally tend to think of the operations as just "writeback" (which doesn't invalidate) and "flush" (which is a writeback-invalidate).
It almost certainly doesn't matter whether the "writeback" invalidates or not. You have to assume that all sorts of operations might cause the cpu to read in a cacheline. This includes, but is not limited to, speculative execution and cache line prefetch.
But you definitely need an "invalidate" to force a cache line be read after the hardware has accessed it. Now such lines must not be dirty; because the cpu can write back a dirty cache line at any time - which would break things. So this can also be "write back if dirty" and "invalidate".
Bounce buffers and cache probably work much the same way. But for bounce buffers I guess you want to ensure the initially allocated buffer doesn't contain old data (belonging to someone else). So you might decide to zero them on allocation or always copy from the driver buffer on the first request.
Then you get the really annoying cpu that don't have a "write back dirty line and invalidate" opcode. And the only way is to read enough other memory areas to displace all the existing cache line data. You probably might as well give up and use PIO :-)
David
- Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Fri, 25 Mar 2022 22:13:08 +0100 Johannes Berg johannes@sipsolutions.net wrote:
Then, however, we need to define what happens if you pass DMA_BIDIRECTIONAL to the sync_for_cpu() and sync_for_device() functions, which adds two more cases? Or maybe we eventually just think that's not valid at all, since you have to specify how you're (currently?) using the buffer, which can't be DMA_BIDIRECTIONAL?
Ugh. Do we actually have cases that do it?
Yes, a few.
That sounds really odd for a "sync" operation. It sounds very reasonable for _allocating_ DMA, but for syncing I'm left scratching my head what the semantics would be.
I agree.
But yes, if we do and people come up with semantics for it, those semantics should be clearly documented.
I'm not sure? I'm wondering if this isn't just because - like me initially - people misunderstood the direction argument, or didn't understand it well enough, and then just passed the same value as for the map()/unmap()?
I don't think you misunderstood the direction argument and its usage. I didn't finish thinking about the proposal of Linus, I'm pretty confident about my understanding of the current semantics of the direction. According to the documentation, you do have to pass in the very same direction, that was specified when the mapping was created. A qoute from the documentation.
""" void dma_sync_single_for_cpu(struct device *dev, dma_addr_t dma_handle, size_t size, enum dma_data_direction direction)
void dma_sync_single_for_device(struct device *dev, dma_addr_t dma_handle, size_t size, enum dma_data_direction direction)
void dma_sync_sg_for_cpu(struct device *dev, struct scatterlist *sg, int nents, enum dma_data_direction direction)
void dma_sync_sg_for_device(struct device *dev, struct scatterlist *sg, int nents, enum dma_data_direction direction)
Synchronise a single contiguous or scatter/gather mapping for the CPU and device. With the sync_sg API, all the parameters must be the same as those passed into the single mapping API. With the sync_single API, you can use dma_handle and size parameters that aren't identical to those passed into the single mapping API to do a partial sync. """ (Documentation/core-api/dma-api.rst)
The key here is "sync_sg API, all the parameters must be the same as those passed into the single mapping API", but I have to admit, I don't understand the *single* in here. The intended meaning of the last sentence is that one can do partial sync by choose dma_hande_sync, size_sync such that dma_handle_mapping <= dma_handle_sync < dma_handle_mapping + size_mapping and dma_handle_sync + size_sync <= dma_handle_mapping + size_mapping. But the direction has to remain the same.
BTW, the current documented definition of the direction is about the data transfer direction between memory and the device, and how the CPU is interacting with the memory is not in scope. A quote form the documentation.
""" ======================= ============================================= DMA_NONE no direction (used for debugging) DMA_TO_DEVICE data is going from the memory to the device DMA_FROM_DEVICE data is coming from the device to the memory DMA_BIDIRECTIONAL direction isn't known ======================= ============================================= """ (Documentation/core-api/dma-api.rst)
My feeling is, that re-defining the dma direction is not a good idea. But I don't think my opinion has much weight here.
@Christoph, Robin: What do you think?
Regards, Halil
On Sun, 2022-03-27 at 05:15 +0200, Halil Pasic wrote:
The key here is "sync_sg API, all the parameters must be the same as those passed into the single mapping API", but I have to admit, I don't understand the *single* in here.
Hah. So I wasn't imagining things after all.
However, as the rest of the thread arrives, this still means it's all broken ... :)
The intended meaning of the last sentence is that one can do partial sync by choose dma_hande_sync, size_sync such that dma_handle_mapping <= dma_handle_sync < dma_handle_mapping + size_mapping and dma_handle_sync + size_sync <= dma_handle_mapping + size_mapping. But the direction has to remain the same.
Right.
BTW, the current documented definition of the direction is about the data transfer direction between memory and the device, and how the CPU is interacting with the memory is not in scope. A quote form the documentation.
""" ======================= ============================================= DMA_NONE no direction (used for debugging) DMA_TO_DEVICE data is going from the memory to the device DMA_FROM_DEVICE data is coming from the device to the memory DMA_BIDIRECTIONAL direction isn't known ======================= ============================================= """ (Documentation/core-api/dma-api.rst)
My feeling is, that re-defining the dma direction is not a good idea. But I don't think my opinion has much weight here.
However, this basically means that the direction argument to the flush APIs are completely useless, and we do have to define something new/else...
johannes
On Mon, 2022-03-28 at 11:48 +0200, Johannes Berg wrote:
However, this basically means that the direction argument to the flush APIs are completely useless, and we do have to define something new/else...
No I worded that badly - the direction isn't useless, but thinking of it in terms of a buffer property rather than data movement is inaccurate. So then if we need something else to indicate how data was expected to be moved, the direction argument becomes useless, since it's not a buffer property but rather a temporal thing on a specific place that expected certain data movement.
johannes
On Mon, 2022-03-28 at 11:50 +0200, Johannes Berg wrote:
No I worded that badly - the direction isn't useless, but thinking of it in terms of a buffer property rather than data movement is inaccurate. So then if we need something else to indicate how data was expected to be moved, the direction argument becomes useless, since it's not a buffer property but rather a temporal thing on a specific place that expected certain data movement.
Yeah, umm. I should've read the whole thread of the weekend first, sorry for the noise.
johannes
On Thu, 24 Mar 2022 12:26:53 -0700 Linus Torvalds torvalds@linux-foundation.org wrote:
So I don't think the dma_sync_single_for_device() is *wrong* per se, because the CPU didn't actually do any modifications.
But yes, I think it's unnecessary - because any later CPU accesses would need that dma_sync_single_for_cpu() anyway, which should invalidate any stale caches.
And it clearly doesn't work in a bounce-buffer situation, but honestly I don't think a "CPU modified buffers concurrently with DMA" can *ever* work in that situation, so I think it's wrong for a bounce buffer model to ever do anything in the dma_sync_single_for_device() situation.
I agree it CPU modified buffers *concurrently* with DMA can never work, and I believe the ownership model was conceived to prevent this situation. But a CPU can modify the buffer *after* DMA has written to it, while the mapping is still alive. For example one could do one partial read from the device, *after* the DMA is done, sync_for_cpu(DMA_FROM_DEVICE), examine, then zero out the entire buffer, sync_for_device(DMA_FROM_DEVICE), make the device do partial DMA, do dma_unmap and expect no residue from the fist DMA. That is expect that the zeroing out was effective.
The point I'm trying to make is: if concurrent is even permitted (it isn't because of ownership) swiotlb woudn't know if we are dealing with the *concurrent* case, which is completely bogous, or with the *sequential* case, which at least in theory could work. And if we don't do anyting on the sync_for_device(DMA_FROM_DEVICE) we render that zeroing out the entire buffer form my example ineffective, because it would not reach the bounce buffer, and on dma_unmap we would overwrite the original buffer with the content of the bounce buffer.
Regards, Halil
On Sat, Mar 26, 2022 at 8:49 PM Halil Pasic pasic@linux.ibm.com wrote:
I agree it CPU modified buffers *concurrently* with DMA can never work, and I believe the ownership model was conceived to prevent this situation.
But that just means that the "ownership" model is garbage, and cannot handle this REAL LIFE situation.
Here's the deal: if somebody makes a model that is counter-factual, you have exactly two choices:
- fix the damn model
- live in a la-la-land that isn't reality
Which choice do you pick?
And I'll be brutally honest: if you pick the la-la-land one, I don't think we can really discuss this any more.
But a CPU can modify the buffer *after* DMA has written to it, while the mapping is still alive.
Yes.
But you are making ALL your arguments based on that "ownership" model that we now know is garbage.
If you make your arguments based on garbage, the end result _might_ work just by happenstance, but the arguments sure aren't very convincing, are they?
So let's make it really clear that the arguments must not be based on some "ownership" model that you just admitted cannot handle the very real case of real and common hardware.
Ok?
For example one could do one partial read from the device, *after* the DMA is done, sync_for_cpu(DMA_FROM_DEVICE), examine, then zero out the entire buffer, sync_for_device(DMA_FROM_DEVICE)
So the result you want to get to I can believe in, but your sequence of getting there is untenable, since it depends on breaking other cases that are more important than your utterly broken hardware that you don't even know how much data it filled.
And I fundamentally also happen to think that since the CPU just wrote that buffer, and *that* write is what you want to sync with the device, then that DMA_FROM_DEVICE was just pure fantasy to begin with.
So that code sequence you quote is wrong. You are literally trying to re-introduce the semantics that we already know broke the ath9k driver.
Instead, let me give you a couple of alternative scenarios.
Alternative 1:
- ok, so the CPU wrote zeroes to the area, and we want to tell the DMA mapping code that it needs to make that CPU write visible to the device.
- Ahh, you mean "sync_for_device(DMA_TO_DEVICE)"?
- Yes.
The "for_device()" tells us that afterwards, the device can access the memory (we synced it for the device).
And the DMA_TO_DEVICE tells us that we're transferring the zeroes we wrote on the CPU to the device bounce buffer.
Now the device got those zeroes, and it can overwrite them (partially), and everything is fine
- And then we need to use "sync_for_cpu(DMA_FROM_DEVICE)" when we want to see the result once it's all done?
- Yes.
- Splendid. Except I don't like how you mix DMA_TO_DEVICE and DMA_FROM_DEVICE and you made the dma_alloc() not use DMA_BIDIRECTIONAL
- Yeah, that's ugly, but it matches reality, and it would "just work" today.
Alternative 2:
- Ok, so the CPU doesn't even want to write to the area AT ALL, but we know we have a broken device that might not fill all of the bounce buffer, and we don't want to see old stale bounce buffer contents that could come from some other operation entirely and leak sensitive data that wasn't for us.
- Ahh, so you really want to just clear the bounce buffer before IO?
- Yes. So let's introduce a "clear_bounce_buffer()" operation before starting DMA. Let's call it "sync_for_device(DMA_NONE)" and teach the non-bounce-buffer dmas mapping entities to just ignore it, because they don't have a bounce buffer that can contain stale data.
- Sounds good.
Alternative 3:
- Ok, you have me stumped. I can't come up with something else sane.
Anyway, notice what's common about these alternatives? They are based not on some "we have a broken model", but on trying to solve the actual real-life problem case.
I'm more than happy to hear other alternatives.
But the alternative I am _not_ willing to entertain is "Yeah, we have a model of ownership, and that can't handle ath9k because that one wants to do CPU reads while DMA is possibly active, so ath9k is broken".
Can we please agree on that?
Linus
On Sat, Mar 26, 2022 at 10:06 PM Linus Torvalds torvalds@linux-foundation.org wrote:
On Sat, Mar 26, 2022 at 8:49 PM Halil Pasic pasic@linux.ibm.com wrote:
I agree it CPU modified buffers *concurrently* with DMA can never work, and I believe the ownership model was conceived to prevent this situation.
But that just means that the "ownership" model is garbage, and cannot handle this REAL LIFE situation.
Just to clarify: I obviously agree that the "both sides modify concurrently" obviously cannot work with bounce buffers.
People still do want to do that, but they'll limit themselves to actual cache-coherent DMA when they do so (or do nasty uncached accesses but at least no bounce buffering).
But the "bounce ownership back and forth" model comes up empty when the CPU wants to read while the DMA is still going on. And that not only can work, but *has* worked.
You could have a new "get me a non-ownership copy" operation of course, but that hits the problem of "which existing drivers need it?"
We have no idea, outside of ath9k.
This is why I believe we have to keep the existing semantics in a way that keep ath9k - and any number of unknown other drivers - happy.
And then for the cases where you want to introduce the zeroing because you don't know how much data the DMA returned - those are the ones you'll have to mark some way.
Linus
From: Linus Torvalds
Sent: 27 March 2022 06:21
On Sat, Mar 26, 2022 at 10:06 PM Linus Torvalds torvalds@linux-foundation.org wrote:
On Sat, Mar 26, 2022 at 8:49 PM Halil Pasic pasic@linux.ibm.com wrote:
I agree it CPU modified buffers *concurrently* with DMA can never work, and I believe the ownership model was conceived to prevent this situation.
But that just means that the "ownership" model is garbage, and cannot handle this REAL LIFE situation.
Just to clarify: I obviously agree that the "both sides modify concurrently" obviously cannot work with bounce buffers.
Aren't bounce buffers just a more extreme case on non-coherent memory accesses? They just need explicit memory copies rather than just cache writeback and invalidate operations.
So 'both sides modify concurrently' just has the same issue as it does with non-coherent memory in that the locations need to be in separate (dma) cache lines. Indeed, if the bounce buffers are actually coherent then arbitrary concurrent updates are possible.
One issue is that the driver needs to indicate which parts of any buffer are dirty. Whereas the any 'cache writeback' request will only write dirty data.
Get everything right and you can even support hardware where the 'bounce buffers' are actually on the card and the copies are MMIO (or better, especially on PCIe, synchronous host initiated dma transfers).
David
- Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Sun, Mar 27, 2022 at 8:24 AM David Laight David.Laight@aculab.com wrote:
Aren't bounce buffers just a more extreme case on non-coherent memory accesses?
No.
In fact, this whoe change came about exactly because bounce buffers are different.
The difference is that bounce buffers have that (wait for it) bounce buffer, which can have stale contents.
They just need explicit memory copies rather than just cache writeback and invalidate operations.
That's the thing - the memory copies introduce entirely new issues.
I really think that instead of making up abstract rules ("ownership" or "bounce buffers are just extreme cases of non-coherency") we should make the rules very much practical and down to earth, and write out exactly what they *do*.
The whole "sync DMA" is odd and abstract enough as a concept on its own, we shouldn't then make the rules for it odd and abstract. We should make them very very practical.
So I will propose that we really make it very much about practical concerns, and we document things as
(a) the "sync" operation has by definition a "whose _future_ access do we sync for" notion.
So "dma_sync_single_for_cpu()" says that the future accesses to this dma area is for the CPU.
Note how it does *NOT* say that the "CPU owns the are". That's bullsh*t, and we now know it's BS.
(b) but the sync operation also has a "who wrote the data we're syncing"
Note that this is *not* "who accessed or owned it last", because that's nonsensical: if we're syncing for the CPU, then the only reason to do so is because we expect that the last access was by the device, so specifying that separately would just be redundant and stupid.
But specifying who *wrote* to the area is meaningful and matters. It matters for the non-coherent cache case (because of writeback issues), but it also matters for the bounce buffer case (becasue it determines which way we should copy).
Note how this makes sense: a "sync" operation is clearly about taking some past state, and making it palatable for a future use. The past state is pretty much defined by who wrote the data, and then we can use that and the "the next thing to access it" to determine what we need to do about the sync.
It is *NOT* about "ownership".
So let's go through the cases, and I'm going to ignore the "CPU caches are coherent with device DMA" case because that's always going to be a no-op wrt data movement (but it will still generally need a memory barrier, which I will mention here and then ignore going forward).
Syncing for *CPU* accesses (ie dma_sync_single_for_cpu()) has four choices I can see:
- nobody wrote the data at all (aka DMA_NONE).
This is nonsensical and should warn. If nobody wrote to it, why would the CPU ever validly access it?
Maybe you should have written "memset(buffer, 0, size)" instead?
- the CPU wrote the data in the first place (aka DMA_TO_DEVICE)
This is a no-op (possibly a memory barrier), because even stale CPU caches are fine, and even if it was in a bounce buffer, the original CPU-side data is fine.
- the device wrote the data (aka DMA_FROM_DEVICE)
This is just the regular case of a device having written something, and the CPU wants to see it.
It obviously needs real work, but it's simple and straightforward.
For non-coherent caches, it needs a cache invalidate. For a bounce buffer, it needs a copy from the bounce buffer to the "real" buffer.
- it's not clear who write the data (aka DMA_BIDIRECTIONAL)
This is not really doable for a bounce buffer - we just don't know which buffer contents are valid.
I think it's very very questionable for non-coherent caches too, but "writeback and invalidate" probably can't hurt.
So probably warn about it, and do whatever we used to do historically.
Syncing for device accesses (ie dma_sync_single_for_device()) also has the same four choices I can see, but obviously does different things:
- nobody wrote the data at all (aka DMA_NONE)
This sounds as nonsensical as the CPU case, but maybe isn't.
We may not have any previous explicit writes, but we *do* have that "insecure and possibly stale buffer contents" bounce buffer thing on the device side.
So with a bounce buffer, it's actually somewhat sane to say "initialize the bounce buffer to a known state".
Because bounce buffers *are* special. Unlike even the "noncoherent caches" issue, they have that entirely *other* hidden state in the form of the bounce buffer itself.
Discuss.
- the CPU wrote the data in the first place (aka DMA_TO_DEVICE).
This is the regular and common case of "we have data on the CPU side that is written to the device".
Again, needs work, but is simple and straightforward.
For non-coherent caches, we need a writeback on the CPU. For a bounce buffer, we need to copy from the regular buffer to the bounce buffer.
- the device wrote the data in the first place (aka DMA_FROM_DEVICE)
This is the case that we hit on ath9k. It's *not* obvious, but when we write this out this way, I really think the semantics are pretty clear.
For non-coherent caches, we may need an "invalidate". For a bounce buffer, it's a no-op (because the bounce buffer already contains the data)
- it's not clear who write the data (aka DMA_BIDIRECTIONAL)
This is again not really doable for a bounce buffer. We don't know which buffer contains the right data, we should warn about it and do whatever we used to do historically.
Again, it's very questionable for non-coherent caches too, but "writeback and invalidate" probably at least can't hurt.
So hey, that's my thinking. The whole "ownership" model is, I think, obviously untenable.
But just going through and listing the different cases and making them explicit I think explains exactly what the different situations are, and that then makes it fairly clear what the different combinations should do.
No?
Linus
On Sun, Mar 27, 2022 at 12:23 PM Linus Torvalds torvalds@linux-foundation.org wrote:
So I will propose that we really make it very much about practical concerns, and we document things as
(a) the "sync" operation has by definition a "whose _future_ access do we sync for" notion.
So "dma_sync_single_for_cpu()" says that the future accesses to
this dma area is for the CPU.
Note how it does *NOT* say that the "CPU owns the are". That's
bullsh*t, and we now know it's BS.
(b) but the sync operation also has a "who wrote the data we're syncing"
Note that this is *not* "who accessed or owned it last", because
that's nonsensical: if we're syncing for the CPU, then the only reason to do so is because we expect that the last access was by the device, so specifying that separately would just be redundant and stupid.
But specifying who *wrote* to the area is meaningful and matters.
We could also simply require that the bounce buffer code *remember* who wrote to it last.
So when the ath9k driver does
- setup:
bf->bf_buf_addr = dma_map_single(sc->dev, skb->data, common->rx_bufsize, DMA_FROM_DEVICE);
we clear the bounce buffer and remember that the state of the bounce buffer is "device wrote to it" (because DMA_FROM_DEVICE).
Then, we have an interrupt or other event, and ath9k does
- rc event:
dma_sync_single_for_cpu(sc->dev, bf->bf_buf_addr, common->rx_bufsize, DMA_FROM_DEVICE);
ret = ath9k_hw_process_rxdesc_edma(ah, rs, skb->data); if (ret == -EINPROGRESS) { /*let device gain the buffer again*/ dma_sync_single_for_device(sc->dev, bf->bf_buf_addr, common->rx_bufsize, DMA_FROM_DEVICE); return false; }
and the first dma_sync_single_for_cpu() now sees "Ok, I want the CPU buffer, and I remember that the device wrote to it, so I will copy from the bounce buffer". It's still DMA_FROM_DEVICE, so that "the device wrote to it" doesn't change.
When the CPU then decides "ok, that wasn't it", and does that dma_sync_single_for_device(), the bounce buffer code goes "Ok, the last operation was that the device wrote to the buffer, so the bounce buffer is still valid and I should do nothing".
Voila, no ath9k breakage, and it all still makes perfect sense.
And that sounds like an even more logical model than the one where we tell the bounce buffer code what the previous operation was, but it involves basically the DMA mapping code remembering what the last direction was. That makes perfect sense to me, but it's certainly not what the DMA mapping code has traditionally done, which makes me nervous that it would just expose a _lot_ of other drivers that do odd things.
The "good news" (if there is such a thing) is that usually the direction doesn't actually change. So if you use DMA_FROM_DEVICE initially, you'll continue to use that. So there is probably basically never any difference between "what was the previous operation" and "what is the next operation".
So maybe practically speaking, we don't care.
Anyway, I do think we have choices here on how to describe things.
I do think that the "DMA code doesn't have to remember" model has the advantage that remembering is always an added complexity, and operations that behave differently depending on previous history are always a bit harder to think about because of that. Which is why I think that model I outlined in the previous email is probably the most straightforward one.
Linus
On Sat, 26 Mar 2022 22:21:03 -0700 Linus Torvalds torvalds@linux-foundation.org wrote:
On Sat, Mar 26, 2022 at 10:06 PM Linus Torvalds torvalds@linux-foundation.org wrote:
On Sat, Mar 26, 2022 at 8:49 PM Halil Pasic pasic@linux.ibm.com wrote:
I agree it CPU modified buffers *concurrently* with DMA can never work, and I believe the ownership model was conceived to prevent this situation.
But that just means that the "ownership" model is garbage, and cannot handle this REAL LIFE situation.
Just to clarify: I obviously agree that the "both sides modify concurrently" obviously cannot work with bounce buffers.
People still do want to do that, but they'll limit themselves to actual cache-coherent DMA when they do so (or do nasty uncached accesses but at least no bounce buffering).
Thanks for the explanation!
But the "bounce ownership back and forth" model comes up empty when the CPU wants to read while the DMA is still going on. And that not only can work, but *has* worked.
You could have a new "get me a non-ownership copy" operation of course,
Yes, https://www.spinics.net/lists/linux-wireless/msg222442.html was mostly about exploring that idea.
but that hits the problem of "which existing drivers need it?"
We have no idea, outside of ath9k.
This is why I believe we have to keep the existing semantics in a way that keep ath9k - and any number of unknown other drivers - happy.
I agree.
And then for the cases where you want to introduce the zeroing because you don't know how much data the DMA returned - those are the ones you'll have to mark some way.
I have no intention of pursuing this. When fixing the information leak, I happened to realize, that a somewhat similar situation can emerge when mappings are reused. It seemed like an easy fix, so I asked the swiotlb maintainers, and they agreed. It ain't my field of expertise, and the drivers I'm interested in don't need this functionality.
Regards, Halil
Linus
On Sun, Mar 27, 2022 at 4:52 PM Halil Pasic pasic@linux.ibm.com wrote:
I have no intention of pursuing this. When fixing the information leak, I happened to realize, that a somewhat similar situation can emerge when mappings are reused. It seemed like an easy fix, so I asked the swiotlb maintainers, and they agreed. It ain't my field of expertise, and the drivers I'm interested in don't need this functionality.
Ok.
That said, I think you are putting yourself down when you said in an earlier email that you aren't veryt knowledgeable in this area.
I think the fact that you *did* think of this other similar situation is actually very interesting, and it's something people probably _haven't_ been thinking about.
So I think your first commit fixes the straightforward and common case where you do that "map / partial dma / unmap" case.
And that straightforward case is probably all that the disk IO case ever really triggers, which is presumably why those "drivers I'm interested in don't need this functionality" don't need anything else?
And yes, your second commit didn't work, but hey, whatever. The whole "multiple operations on the same double buffering allocation" situation is something I don't think people have necessarily thought about enough.
And by that I don't mean you. I mean very much the whole history of our dma mapping code.
I then get opinionated and probably too forceful, but please don't take it as being about you - it's about just my frustration with that code - and if it comes off too negative then please accept my apologies.
Linus
On Sun, 27 Mar 2022 17:30:01 -0700 Linus Torvalds torvalds@linux-foundation.org wrote:
On Sun, Mar 27, 2022 at 4:52 PM Halil Pasic pasic@linux.ibm.com wrote:
I have no intention of pursuing this. When fixing the information leak, I happened to realize, that a somewhat similar situation can emerge when mappings are reused. It seemed like an easy fix, so I asked the swiotlb maintainers, and they agreed. It ain't my field of expertise, and the drivers I'm interested in don't need this functionality.
Ok.
That said, I think you are putting yourself down when you said in an earlier email that you aren't veryt knowledgeable in this area.
I think the fact that you *did* think of this other similar situation is actually very interesting, and it's something people probably _haven't_ been thinking about.
Thank you!
So I think your first commit fixes the straightforward and common case where you do that "map / partial dma / unmap" case.
And that straightforward case is probably all that the disk IO case ever really triggers, which is presumably why those "drivers I'm interested in don't need this functionality" don't need anything else?
I agree.
And yes, your second commit didn't work, but hey, whatever. The whole "multiple operations on the same double buffering allocation" situation is something I don't think people have necessarily thought about enough.
And by that I don't mean you. I mean very much the whole history of our dma mapping code.
I agree. We are in the process of catching up! :) My idea was to aid a process, as a relatively naive pair of eyes: somebody didn't read any data sheets describing non-cache-coherent DMA, and never programmed a DMA. It is a fairly common problem, that for the very knowledgeable certain things seem obvious, self-explanatory or trivial, but for the less knowledgeable the are not. And knowledge can create bias.
I then get opinionated and probably too forceful, but please don't take it as being about you - it's about just my frustration with that code - and if it comes off too negative then please accept my apologies.
I have to admit, I did feel a little uncomfortable, and I did look for an exit strategy. I do believe, that people in your position do have to occasionally get forceful, and even abrasive to maintain efficiency. I try to not ignore the social aspect of things, but I do get carried away occasionally.
Your last especially paragraph is very encouraging and welcome. Thank you!
Regards, Halil
[..]
On Sat, 26 Mar 2022 22:06:15 -0700 Linus Torvalds torvalds@linux-foundation.org wrote:
On Sat, Mar 26, 2022 at 8:49 PM Halil Pasic pasic@linux.ibm.com wrote:
I agree it CPU modified buffers *concurrently* with DMA can never work, and I believe the ownership model was conceived to prevent this situation.
But that just means that the "ownership" model is garbage, and cannot handle this REAL LIFE situation.
Here's the deal: if somebody makes a model that is counter-factual, you have exactly two choices:
fix the damn model
live in a la-la-land that isn't reality
Which choice do you pick?
I pix "fix the dam model". This email of mine was supposed to discuss how the model can be fixed: https://www.spinics.net/lists/linux-wireless/msg222442.html
And I'll be brutally honest: if you pick the la-la-land one, I don't think we can really discuss this any more.
I completely agree. Never intended to pick the la-la-land one.
But a CPU can modify the buffer *after* DMA has written to it, while the mapping is still alive.
Yes.
But you are making ALL your arguments based on that "ownership" model that we now know is garbage.
Sorry, I'm not very knowledgeable when it comes all the different hardware out there.
If you make your arguments based on garbage, the end result _might_ work just by happenstance, but the arguments sure aren't very convincing, are they?
No it is not. I have to admit, I did see some value in talking about the model that is described by the current documentation for two reasons: 1) Not everybody has great knowledge of all the hardware out there, and there might be people other than me, who based their work on that broken model. And thus wrote code that is correct within the broken model, but not correct within the fixed mode. Does not see to be the case here, but I was not able to tell. 2) To fix the model, we have to understand both reality and the model. Or we have to replace it with an entirely new one.
So let's make it really clear that the arguments must not be based on some "ownership" model that you just admitted cannot handle the very real case of real and common hardware.
Ok?
Sure. It was never my intention to base my argument on false assumptions.
For example one could do one partial read from the device, *after* the DMA is done, sync_for_cpu(DMA_FROM_DEVICE), examine, then zero out the entire buffer, sync_for_device(DMA_FROM_DEVICE)
So the result you want to get to I can believe in, but your sequence of getting there is untenable, since it depends on breaking other cases that are more important than your utterly broken hardware that you don't even know how much data it filled.
I agree, and that is is the very reason I said, I'm not against the partial revert (see https://www.spinics.net/lists/linux-wireless/msg222326.html)
Hey, I don't even know if there is a single usage of what I described. In fact I asked the community is this even something legit. What I know, is that the current (broken) model does allow the scenario.
And I fundamentally also happen to think that since the CPU just wrote that buffer, and *that* write is what you want to sync with the device, then that DMA_FROM_DEVICE was just pure fantasy to begin with.
Not sync with the device, but with the memory. And it is supposed to happen after the DMA has completed, and not while the DMA is ongoing.
But I'm clearly not knowledgeable enough to have this discussion. I'm afraid if I continue, I will just drag the community down.
So that code sequence you quote is wrong. You are literally trying to re-introduce the semantics that we already know broke the ath9k driver.
Instead, let me give you a couple of alternative scenarios.
Alternative 1:
- ok, so the CPU wrote zeroes to the area, and we want to tell the
DMA mapping code that it needs to make that CPU write visible to the device.
Not make visible to the device but make actually it RAM instead of remaining in cache. My most basic mental model is:
+----------------+ | | +---------+ +--------+ | +--------+ | | | DMA | | | CPU | CACHE | | <---> | RAM | <----> | DEVICE | | +--------+ | | | | | | | +---------+ +--------+ +----------------+
Ahh, you mean "sync_for_device(DMA_TO_DEVICE)"?
Yes.
The "for_device()" tells us that afterwards, the device can access
the memory (we synced it for the device).
And the DMA_TO_DEVICE tells us that we're transferring the zeroes we wrote on the CPU to the device bounce buffer.
Now the device got those zeroes, and it can overwrite them (partially), and everything is fine
- And then we need to use "sync_for_cpu(DMA_FROM_DEVICE)" when we
want to see the result once it's all done?
Yes.
Splendid. Except I don't like how you mix DMA_TO_DEVICE and
DMA_FROM_DEVICE and you made the dma_alloc() not use DMA_BIDIRECTIONAL
- Yeah, that's ugly, but it matches reality, and it would "just work" today.
It is certainly an option. The tricky part is that one is supposed to use DMA_TO_DEVICE even if the device does not read RAM but only writes it, and thus the direction of the data flow of the direct memory access (DMA) is from the device to the memory (RAM).
Alternative 2:
- Ok, so the CPU doesn't even want to write to the area AT ALL, but
we know we have a broken device that might not fill all of the bounce buffer, and we don't want to see old stale bounce buffer contents that could come from some other operation entirely and leak sensitive data that wasn't for us.
Ahh, so you really want to just clear the bounce buffer before IO?
Yes. So let's introduce a "clear_bounce_buffer()" operation before
starting DMA. Let's call it "sync_for_device(DMA_NONE)" and teach the non-bounce-buffer dmas mapping entities to just ignore it, because they don't have a bounce buffer that can contain stale data.
- Sounds good.
It is an option.
Alternative 3:
- Ok, you have me stumped. I can't come up with something else sane.
I have tired to in https://www.spinics.net/lists/linux-wireless/msg222442.html
Anyway, notice what's common about these alternatives? They are based not on some "we have a broken model", but on trying to solve the actual real-life problem case.
Yes.
I'm more than happy to hear other alternatives.
But the alternative I am _not_ willing to entertain is "Yeah, we have a model of ownership, and that can't handle ath9k because that one wants to do CPU reads while DMA is possibly active, so ath9k is broken".
Can we please agree on that?
Yes.
AFAIU the main reason we have should postulate the ath9k code is correct, is that it would be prohibitively expensive not to do so. Because we can comparatively easily change ath9k, but we can't be sure about other uses of sync_for_device unless we audit all of them. That does make perfect sense for me.
IMHO that also means that the whole "ownership" concept is beyond saving, and that we should rewrite the corresponding parts of the documentation. Thus my effort at saving it was misguided.
Give how this unfolded, I intend to let the more knowledgeable people hash out the new model, and avoiding dragging down the community by being too vocal about my opinion.
For the record, I believe that the partial revert proposed here https://www.spinics.net/lists/linux-wireless/msg222300.html would have been a wiser choice, than a complete revert of commit aa6f8dcbab47 ("swiotlb: rework "fix info leak with DMA_FROM_DEVICE""). The reason why is in the commit message, and was also pointed out by Robin.
Regards, Halil
On Sun, Mar 27, 2022 at 4:37 PM Halil Pasic pasic@linux.ibm.com wrote:
For the record, I believe that the partial revert proposed here https://www.spinics.net/lists/linux-wireless/msg222300.html would have been a wiser choice, than a complete revert of commit aa6f8dcbab47 ("swiotlb: rework "fix info leak with DMA_FROM_DEVICE"").
Yeah, the revert is basically my standard "this doesn't work, discussion is still ongoing" thing.
I agree that the revert then brought back that DMA_ATTR_SKIP_CPU_SYNC complexity.
So that part of commit aa6f8dcbab47 was probably all good.
I somehow missed that Oleksandr had a tested-by for that smaller change too.
Linus
Hello.
On čtvrtek 24. března 2022 18:07:29 CET Toke Høiland-Jørgensen wrote:
Right, but is that sync_for_device call really needed? AFAICT, that ath9k_hw_process_rxdesc_edma() invocation doesn't actually modify any of the data when it returns EINPROGRESS, so could we just skip it? Like the patch below? Or am I misunderstanding the semantics here?
-Toke
diff --git a/drivers/net/wireless/ath/ath9k/recv.c b/drivers/net/wireless/ath/ath9k/recv.c index 0c0624a3b40d..19244d4c0ada 100644 --- a/drivers/net/wireless/ath/ath9k/recv.c +++ b/drivers/net/wireless/ath/ath9k/recv.c @@ -647,12 +647,8 @@ static bool ath_edma_get_buffers(struct ath_softc *sc, common->rx_bufsize, DMA_FROM_DEVICE); ret = ath9k_hw_process_rxdesc_edma(ah, rs, skb->data);
if (ret == -EINPROGRESS) {
/*let device gain the buffer again*/
dma_sync_single_for_device(sc->dev, bf->bf_buf_addr,
common->rx_bufsize, DMA_FROM_DEVICE);
if (ret == -EINPROGRESS) return false;
}
__skb_unlink(skb, &rx_edma->rx_fifo); if (ret == -EINVAL) {
With this patch and both ddbd89deb7d3+aa6f8dcbab47 in place the AP works for me.
Thanks.
On 25.03.22 08:12, Oleksandr Natalenko wrote:
On čtvrtek 24. března 2022 18:07:29 CET Toke Høiland-Jørgensen wrote:
Right, but is that sync_for_device call really needed? AFAICT, that ath9k_hw_process_rxdesc_edma() invocation doesn't actually modify any of the data when it returns EINPROGRESS, so could we just skip it? Like the patch below? Or am I misunderstanding the semantics here?
diff --git a/drivers/net/wireless/ath/ath9k/recv.c b/drivers/net/wireless/ath/ath9k/recv.c index 0c0624a3b40d..19244d4c0ada 100644 --- a/drivers/net/wireless/ath/ath9k/recv.c +++ b/drivers/net/wireless/ath/ath9k/recv.c @@ -647,12 +647,8 @@ static bool ath_edma_get_buffers(struct ath_softc *sc, common->rx_bufsize, DMA_FROM_DEVICE); ret = ath9k_hw_process_rxdesc_edma(ah, rs, skb->data);
if (ret == -EINPROGRESS) {
/*let device gain the buffer again*/
dma_sync_single_for_device(sc->dev, bf->bf_buf_addr,
common->rx_bufsize, DMA_FROM_DEVICE);
if (ret == -EINPROGRESS) return false;
}
__skb_unlink(skb, &rx_edma->rx_fifo); if (ret == -EINVAL) {
With this patch and both ddbd89deb7d3+aa6f8dcbab47 in place the AP works for me.
TWIMC: If anyone needs more testers or something, I noticed two bug report in bko about this problem:
https://bugzilla.kernel.org/show_bug.cgi?id=215703 https://bugzilla.kernel.org/show_bug.cgi?id=215698
I'll point both to this discussion and the patch.
Ciao, Thorsten
On Thu, 24 Mar 2022 16:52:31 +0000 Robin Murphy robin.murphy@arm.com wrote:
Creating a new mapping for the same buffer before unmapping the previous one does looks rather bogus. But it does not fit the pattern where revering the sync_single changes make the driver work again.
OK, you made me look :)
Now that it's obvious what to look for, I can only conclude that during the stanza in ath_edma_get_buffers(), the device is still writing to the buffer while ownership has been transferred to the CPU, and whatever got written while ath9k_hw_process_rxdesc_edma() was running then gets wiped out by the subsequent sync_for_device, which currently resets the SWIOTLB slot to the state that sync_for_cpu copied out. By the letter of the DMA API that's not allowed, but on the other hand I'm not sure if we even have a good idiom for "I can't tell if the device has finished with this buffer or not unless I look at it" :/
I agree with your analysis. Especially with the latter part (were you state that we don't have a good idiom for that use case).
I believe, a stronger statement is also true: it is fundamentally impossible to accommodate use cases where the device and the cpu need concurrent access to a dma buffer, if the dma buffer isn't in dma coherent memory.
If the dma buffer is in dma coherent memory, and we don't need swiotlb, then we don't need the dma_sync functionality.
Specifically for swiotlb, if the swiotlb buffer is in dma coherent memory, the driver could peek the swiotlb buffer, but I have no idea if we can provide a remotely sane API for that. The point is the device would have peek not via a pointer to the original buffer, but get suitable pointer to the bounce buffer, which would be probably be considered valid, as long as the mapping is valid. Honestly IMHO quite ugly but I see no other way.
Regards, Halil
On Thu, Mar 24, 2022 at 07:31:58PM +0100, Halil Pasic wrote:
I agree with your analysis. Especially with the latter part (were you state that we don't have a good idiom for that use case).
I believe, a stronger statement is also true: it is fundamentally impossible to accommodate use cases where the device and the cpu need concurrent access to a dma buffer, if the dma buffer isn't in dma coherent memory.
Yes, and that is also clearly stated in the DMA API document. We only have two platforms that do not support DMA coherent memory, one are the oldest PARISC platforms, and the other is coldfire.
The first has drivers carefully written to actually support that, the second only has a single driver using DMA that does manual global cache flushes (while pretending it supports coherent memory).
If the dma buffer is in dma coherent memory, and we don't need swiotlb, then we don't need the dma_sync functionality.
Yes.
On Wed, 23 Mar 2022 20:54:08 +0000 Robin Murphy robin.murphy@arm.com wrote:
On 2022-03-23 19:16, Linus Torvalds wrote:
On Wed, Mar 23, 2022 at 12:06 PM Robin Murphy robin.murphy@arm.com wrote:
On 2022-03-23 17:27, Linus Torvalds wrote:
I'm assuming that the ath9k issue is that it gives DMA mapping a big enough area to handle any possible packet size, and just expects - quite reasonably - smaller packets to only fill the part they need.
Which that "info leak" patch obviously breaks entirely.
Except that's the exact case which the new patch is addressing
Not "addressing". Breaking.
Which is why it will almost certainly get reverted.
Not doing DMA to the whole area seems to be quite the sane thing to do for things like network packets, and overwriting the part that didn't get DMA'd with zeroes seems to be exactly the wrong thing here.
So the SG_IO - and other random untrusted block command sources - data leak will almost certainly have to be addressed differently. Possibly by simply allocating the area with GFP_ZERO to begin with.
Er, the point of the block layer case is that whole area *is* zeroed to begin with, and a latent memory corruption problem in SWIOTLB itself replaces those zeros with random other kernel data unexpectedly. Let me try illustrating some sequences for clarity...
Expected behaviour/without SWIOTLB: Memory
start 12345678 dma_map(DMA_FROM_DEVICE) no-op device writes partial data 12ABC678 <- ABC dma_unmap(DMA_FROM_DEVICE) 12ABC678
SWIOTLB previously: Memory Bounce buffer
start 12345678 xxxxxxxx dma_map(DMA_FROM_DEVICE) no-op device writes partial data 12345678 xxABCxxx <- ABC dma_unmap(DMA_FROM_DEVICE) xxABCxxx <- xxABCxxx
SWIOTLB Now: Memory Bounce buffer
start 12345678 xxxxxxxx dma_map(DMA_FROM_DEVICE) 12345678 -> 12345678 device writes partial data 12345678 12ABC678 <- ABC dma_unmap(DMA_FROM_DEVICE) 12ABC678 <- 12ABC678
Now, sure we can prevent any actual information leakage by initialising the bounce buffer slot with zeros, but then we're just corrupting the not-written-to parts of the mapping with zeros instead of anyone else's old data. That's still fundamentally not OK. The only thing SWIOTLB can do to be correct is treat DMA_FROM_DEVICE as a read-modify-write of the entire mapping, because it has no way to know how much of it is actually going to be modified.
Very nice explanation! Thanks!
I'll admit I still never quite grasped the reason for also adding the override to swiotlb_sync_single_for_device() in aa6f8dcbab47, but I think by that point we were increasingly tired and confused and starting to second-guess ourselves (well, I was, at least).
I raised the question, do we need to do the same for swiotlb_sync_single_for_device(). Did that based on my understanding of the DMA API documentation. I had the following scenario in mind
SWIOTLB without the snyc_single: Memory Bounce buffer Owner -------------------------------------------------------------------------- start 12345678 xxxxxxxx C dma_map(DMA_FROM_DEVICE) 12345678 -> 12345678 C->D device writes partial data 12345678 12ABC678 <- ABC D sync_for_cpu(DMA_FROM_DEVICE) 12ABC678 <- 12ABC678 D->C cpu modifies buffer 66666666 12ABC678 C sync_for_device(DMA_FROM_DEVICE) 66666666 12ABC678 C->D device writes partial data 66666666 1EFGC678 <-EFG D dma_unmap(DMA_FROM_DEVICE) 1EFGC678 <- 1EFGC678 D->C
Legend: in Owner column C stands for cpu and D for device.
Without swiotlb, I believe we should have arrived at 6EFG6666. To get the same result, IMHO, we need to do a sync in sync_for_device(). And aa6f8dcbab47 is an imperfect solution to that (because of size).
I don't think it's wrong per se, but as I said I do think it can bite anyone who's been doing dma_sync_*() wrong but getting away with it until now.
I fully agree.
If ddbd89deb7d3 alone turns out to work OK then I'd be inclined to try a partial revert of just that one hunk.
I'm not against being pragmatic and doing the partial revert. But as explained above, I do believe for correctness of swiotlb we ultimately do need that change. So if the revert is the short term solution, what should be our mid-term road-map?
Regards, Halil
Thanks, Robin.
On Thu, 24 Mar 2022 19:02:16 +0100 Halil Pasic pasic@linux.ibm.com wrote:
I'll admit I still never quite grasped the reason for also adding the override to swiotlb_sync_single_for_device() in aa6f8dcbab47, but I think by that point we were increasingly tired and confused and starting to second-guess ourselves (well, I was, at least).
I raised the question, do we need to do the same for swiotlb_sync_single_for_device(). Did that based on my understanding of the DMA API documentation. I had the following scenario in mind
SWIOTLB without the snyc_single: Memory Bounce buffer Owner
start 12345678 xxxxxxxx C dma_map(DMA_FROM_DEVICE) 12345678 -> 12345678 C->D device writes partial data 12345678 12ABC678 <- ABC D sync_for_cpu(DMA_FROM_DEVICE) 12ABC678 <- 12ABC678 D->C cpu modifies buffer 66666666 12ABC678 C sync_for_device(DMA_FROM_DEVICE) 66666666 12ABC678 C->D device writes partial data 66666666 1EFGC678 <-EFG D dma_unmap(DMA_FROM_DEVICE) 1EFGC678 <- 1EFGC678 D->C
Legend: in Owner column C stands for cpu and D for device.
Without swiotlb, I believe we should have arrived at 6EFG6666. To get the same result, IMHO, we need to do a sync in sync_for_device(). And aa6f8dcbab47 is an imperfect solution to that (because of size).
@Robin, Christoph: Do we consider this a valid scenario?
Regards, Halil
On 2022-03-25 15:25, Halil Pasic wrote:
On Thu, 24 Mar 2022 19:02:16 +0100 Halil Pasic pasic@linux.ibm.com wrote:
I'll admit I still never quite grasped the reason for also adding the override to swiotlb_sync_single_for_device() in aa6f8dcbab47, but I think by that point we were increasingly tired and confused and starting to second-guess ourselves (well, I was, at least).
I raised the question, do we need to do the same for swiotlb_sync_single_for_device(). Did that based on my understanding of the DMA API documentation. I had the following scenario in mind
SWIOTLB without the snyc_single: Memory Bounce buffer Owner
start 12345678 xxxxxxxx C dma_map(DMA_FROM_DEVICE) 12345678 -> 12345678 C->D device writes partial data 12345678 12ABC678 <- ABC D sync_for_cpu(DMA_FROM_DEVICE) 12ABC678 <- 12ABC678 D->C cpu modifies buffer 66666666 12ABC678 C sync_for_device(DMA_FROM_DEVICE) 66666666 12ABC678 C->D device writes partial data 66666666 1EFGC678 <-EFG D dma_unmap(DMA_FROM_DEVICE) 1EFGC678 <- 1EFGC678 D->C
Legend: in Owner column C stands for cpu and D for device.
Without swiotlb, I believe we should have arrived at 6EFG6666. To get the same result, IMHO, we need to do a sync in sync_for_device(). And aa6f8dcbab47 is an imperfect solution to that (because of size).
@Robin, Christoph: Do we consider this a valid scenario?
Aha, I see it now (turns out diagrams really do help!) - so essentially the original situation but with buffer recycling thrown into the mix as well... I think it's technically valid, but do we know if anything's actually doing that in a way which ends up affected? For sure it would be nice to know that we had all bases covered without having to audit whether we need to, but if it's fundamentally incompatible with what other code expects, that we know *is* being widely used, and however questionable it may be we don't have an easy fix for, then we're in a bit of a tough spot :(
Thanks, Robin.
On Thu, Mar 24, 2022 at 07:02:16PM +0100, Halil Pasic wrote:
If ddbd89deb7d3 alone turns out to work OK then I'd be inclined to try a partial revert of just that one hunk.
I'm not against being pragmatic and doing the partial revert. But as explained above, I do believe for correctness of swiotlb we ultimately do need that change. So if the revert is the short term solution, what should be our mid-term road-map?
Unless I'm misunderstanding this thread we found the bug in ath9k and have a fix for that now?
Christoph Hellwig hch@lst.de writes:
On Thu, Mar 24, 2022 at 07:02:16PM +0100, Halil Pasic wrote:
If ddbd89deb7d3 alone turns out to work OK then I'd be inclined to try a partial revert of just that one hunk.
I'm not against being pragmatic and doing the partial revert. But as explained above, I do believe for correctness of swiotlb we ultimately do need that change. So if the revert is the short term solution, what should be our mid-term road-map?
Unless I'm misunderstanding this thread we found the bug in ath9k and have a fix for that now?
According to Maxim's comment on the other subthread, that ath9k patch wouldn't work on all platforms (and constitutes a bit of a violation of the DMA API ownership abstraction). So not quite, I think?
-Toke
On 2022-03-25 18:15, Toke Høiland-Jørgensen wrote:
Christoph Hellwig hch@lst.de writes:
On Thu, Mar 24, 2022 at 07:02:16PM +0100, Halil Pasic wrote:
If ddbd89deb7d3 alone turns out to work OK then I'd be inclined to try a partial revert of just that one hunk.
I'm not against being pragmatic and doing the partial revert. But as explained above, I do believe for correctness of swiotlb we ultimately do need that change. So if the revert is the short term solution, what should be our mid-term road-map?
Unless I'm misunderstanding this thread we found the bug in ath9k and have a fix for that now?
According to Maxim's comment on the other subthread, that ath9k patch wouldn't work on all platforms (and constitutes a bit of a violation of the DMA API ownership abstraction). So not quite, I think?
Indeed, it would potentially stand to pose the same problem as the SWIOTLB change, but on the scale of individual cache lines touched by ath9k_hw_process_rxdesc_edma() rather than the whole buffer. However, that might represent a less severe impact on a smaller number of users (maybe the MIPS systems? I'm not sure...) so perhaps it's an acceptable tourniquet? Note that the current code is already a violation of the DMA API (because the device keeps writing even when it doesn't have ownership), so there's not a very strong argument in that regard.
Thanks, Robin.
On Fri, Mar 25, 2022 at 11:42 AM Robin Murphy robin.murphy@arm.com wrote:
Note that the current code is already a violation of the DMA API (because the device keeps writing even when it doesn't have ownership), so there's not a very strong argument in that regard.
See my other email. I actually think that the ath9k code is 100% correct, adn it's the dma-mapping code that is in violation of the rules.
And a big part of the problem - I think - is that the rules are so badly documented and not explicitly listed.
I think my list of three different sync cases (not just two! It's not just about whether to sync for the CPU or the device, it's also about what direction the data itself is taking) is correct.
But maybe I'm wrong.
I really want people to think about this, because right now my gut feel is that commit aa6f8dcbab47 was just absolutely incorrect.
Linus
On Fri, Mar 25, 2022 at 11:46:09AM -0700, Linus Torvalds wrote:
I think my list of three different sync cases (not just two! It's not just about whether to sync for the CPU or the device, it's also about what direction the data itself is taking) is correct.
But maybe I'm wrong.
At the high level you are correct. It is all about which direction the data is taking. That is the direction argument that all the map/unmap/sync call take. The sync calls then just toggle the ownership. You seem to hate that ownership concept, but I don't see how things could work without that ownership concept as we're going to be in trouble without having that. And yes, a peek operation could work in some cases, but it would have to be at the cache line granularity.
arch/arc/mm/dma.c has a really good comment how these transfers relate to actual cache operations btw>
* * | map == for_device | unmap == for_cpu * |---------------------------------------------------------------- * TO_DEV | writeback writeback | none none * FROM_DEV | invalidate invalidate | invalidate* invalidate* * BIDIR | writeback+inv writeback+inv | invalidate invalidate * * [*] needed for CPU speculative prefetches
From: Christoph Hellwig
Sent: 28 March 2022 07:37
On Fri, Mar 25, 2022 at 11:46:09AM -0700, Linus Torvalds wrote:
I think my list of three different sync cases (not just two! It's not just about whether to sync for the CPU or the device, it's also about what direction the data itself is taking) is correct.
But maybe I'm wrong.
At the high level you are correct. It is all about which direction the data is taking. That is the direction argument that all the map/unmap/sync call take. The sync calls then just toggle the ownership. You seem to hate that ownership concept, but I don't see how things could work without that ownership concept as we're going to be in trouble without having that. And yes, a peek operation could work in some cases, but it would have to be at the cache line granularity.
I don't think it is really 'ownership' but more about who has write access. Only one side can have write access (to a cache line [1]) at any one time.
Read access is different. You need a 'synchronise' action to pick up newly written data. This might be a data copy, cache flush or cache invalidate. It only need affect the area that needs to be read - not full buffer. Partial cache flush/invalidate will almost certainly speed up receipt of short network packets that are copied into a new skb - leaving the old one mapped for another receive.
[1] The cache line size might be a property of the device and dma subsystem, not just the cpu. I have used hardware when the effective size was 1kB.
David
- Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Mon, 28 Mar 2022 08:37:23 +0200 Christoph Hellwig hch@lst.de wrote:
On Fri, Mar 25, 2022 at 11:46:09AM -0700, Linus Torvalds wrote:
I think my list of three different sync cases (not just two! It's not just about whether to sync for the CPU or the device, it's also about what direction the data itself is taking) is correct.
But maybe I'm wrong.
At the high level you are correct. It is all about which direction the data is taking. That is the direction argument that all the map/unmap/sync call take. The sync calls then just toggle the ownership. You seem to hate that ownership concept, but I don't see how things could work without that ownership concept as we're going to be in trouble without having that. And yes, a peek operation could work in some cases, but it would have to be at the cache line granularity.
arch/arc/mm/dma.c has a really good comment how these transfers relate to actual cache operations btw>
| map == for_device | unmap == for_cpu
|----------------------------------------------------------------
- TO_DEV | writeback writeback | none none
- FROM_DEV | invalidate invalidate | invalidate* invalidate*
Very interesting! Does that mean that
memset(buf, 0, size); dma_map(buf, size, FROM_DEVICE) device does a partial write dma_unmap(buf, size, FROM_DEVICE)
may boil down to being the same as without the memset(), because the effect of the memset() may get thrown away by the cache invalidate?
That is in this scenario we could actually leak the original content of the buffer if we have a non-dma-coherent cache?
Regards, Halil
- BIDIR | writeback+inv writeback+inv | invalidate invalidate
[*] needed for CPU speculative prefetches
Hello.
On středa 23. března 2022 18:27:21 CET Linus Torvalds wrote:
On Wed, Mar 23, 2022 at 12:19 AM Oleksandr Natalenko oleksandr@natalenko.name wrote:
These commits appeared in v5.17 and v5.16.15, and both kernels are broken for me. I'm pretty confident these commits make the difference since I've built both v5.17 and v5.16.15 without them, and it fixed the issue.
Can you double-check (or just explicitly confirm if you already did that test) that you need to revert *both* of those commits, and it's the later "rework" fix that triggers it?
I can confirm that if I revert aa6f8dcbab47 only, but leave ddbd89deb7d3 in place, AP works. So, it seems that the latest "rework" triggers the issue for me.
Thanks.
linux-stable-mirror@lists.linaro.org