On the 88W8897 card it's very important the TX ring write pointer is updated correctly to its new value before setting the TX ready interrupt, otherwise the firmware appears to crash (probably because it's trying to DMA-read from the wrong place). The issue is present in the latest firmware version 15.68.19.p21 of the pcie+usb card.
Since PCI uses "posted writes" when writing to a register, it's not guaranteed that a write will happen immediately. That means the pointer might be outdated when setting the TX ready interrupt, leading to firmware crashes especially when ASPM L1 and L1 substates are enabled (because of the higher link latency, the write will probably take longer).
So fix those firmware crashes by always using a non-posted write for this specific register write. We do that by simply reading back the register after writing it, just as a few other PCI drivers do.
This fixes a bug where during rx/tx traffic and with ASPM L1 substates enabled (the enabled substates are platform dependent), the firmware crashes and eventually a command timeout appears in the logs.
Cc: stable@vger.kernel.org Signed-off-by: Jonas Dreßler verdre@v0yd.nl --- drivers/net/wireless/marvell/mwifiex/pcie.c | 26 ++++++++++++++++++--- 1 file changed, 23 insertions(+), 3 deletions(-)
diff --git a/drivers/net/wireless/marvell/mwifiex/pcie.c b/drivers/net/wireless/marvell/mwifiex/pcie.c index c6ccce426b49..0eff717ac5fa 100644 --- a/drivers/net/wireless/marvell/mwifiex/pcie.c +++ b/drivers/net/wireless/marvell/mwifiex/pcie.c @@ -240,6 +240,20 @@ static int mwifiex_write_reg(struct mwifiex_adapter *adapter, int reg, u32 data) return 0; }
+/* + * This function does a non-posted write into a PCIE card register, ensuring + * it's completion before returning. + */ +static int mwifiex_write_reg_np(struct mwifiex_adapter *adapter, int reg, u32 data) +{ + struct pcie_service_card *card = adapter->card; + + iowrite32(data, card->pci_mmap1 + reg); + ioread32(card->pci_mmap1 + reg); + + return 0; +} + /* This function reads data from PCIE card register. */ static int mwifiex_read_reg(struct mwifiex_adapter *adapter, int reg, u32 *data) @@ -1482,9 +1496,15 @@ mwifiex_pcie_send_data(struct mwifiex_adapter *adapter, struct sk_buff *skb, reg->tx_rollover_ind);
rx_val = card->rxbd_rdptr & reg->rx_wrap_mask; - /* Write the TX ring write pointer in to reg->tx_wrptr */ - if (mwifiex_write_reg(adapter, reg->tx_wrptr, - card->txbd_wrptr | rx_val)) { + /* Write the TX ring write pointer in to reg->tx_wrptr. + * The firmware (latest version 15.68.19.p21) of the 88W8897 + * pcie+usb card seems to crash when getting the TX ready + * interrupt but the TX ring write pointer points to an outdated + * address, so it's important we do a non-posted write here to + * force the completion of the write. + */ + if (mwifiex_write_reg_np(adapter, reg->tx_wrptr, + card->txbd_wrptr | rx_val)) { mwifiex_dbg(adapter, ERROR, "SEND DATA: failed to write reg->tx_wrptr\n"); ret = -1;
On Tue, Sep 14, 2021 at 01:48:12PM +0200, Jonas Dreßler wrote:
On the 88W8897 card it's very important the TX ring write pointer is updated correctly to its new value before setting the TX ready interrupt, otherwise the firmware appears to crash (probably because it's trying to DMA-read from the wrong place). The issue is present in the latest firmware version 15.68.19.p21 of the pcie+usb card.
Please, be consistent in the commit message(s) and the code (esp. if the term comes from a specification).
Here, PCIe (same in the code, at least that I have noticed, but should be done everywhere).
Since PCI uses "posted writes" when writing to a register, it's not guaranteed that a write will happen immediately. That means the pointer might be outdated when setting the TX ready interrupt, leading to firmware crashes especially when ASPM L1 and L1 substates are enabled (because of the higher link latency, the write will probably take longer).
So fix those firmware crashes by always using a non-posted write for this specific register write. We do that by simply reading back the register after writing it, just as a few other PCI drivers do.
This fixes a bug where during rx/tx traffic and with ASPM L1 substates
Ditto. TX/RX.
enabled (the enabled substates are platform dependent), the firmware crashes and eventually a command timeout appears in the logs.
Should it have a Fixes tag?
Cc: stable@vger.kernel.org Signed-off-by: Jonas Dreßler verdre@v0yd.nl
...
/* Write the TX ring write pointer in to reg->tx_wrptr */
if (mwifiex_write_reg(adapter, reg->tx_wrptr,
card->txbd_wrptr | rx_val)) {
/* Write the TX ring write pointer in to reg->tx_wrptr.
* The firmware (latest version 15.68.19.p21) of the 88W8897
* pcie+usb card seems to crash when getting the TX ready
* interrupt but the TX ring write pointer points to an outdated
* address, so it's important we do a non-posted write here to
* force the completion of the write.
*/
if (mwifiex_write_reg_np(adapter, reg->tx_wrptr,
card->txbd_wrptr | rx_val)) {
mwifiex_dbg(adapter, ERROR, "SEND DATA: failed to write reg->tx_wrptr\n"); ret = -1;
I'm not sure how this is not a dead code.
On top of that, I would rather to call old function and explicitly put the dummy read after it.
/* Write the TX ring write pointer in to reg->tx_wrptr */ if (mwifiex_write_reg(adapter, reg->tx_wrptr, card->txbd_wrptr | rx_val)) { ...eliminate dead code in the following patch(es)... }
+ /* The firmware (latest version 15.68.19.p21) of the 88W8897 + * pcie+usb card seems to crash when getting the TX ready + * interrupt but the TX ring write pointer points to an outdated + * address, so it's important we do a non-posted write here to + * force the completion of the write. + */ mwifiex_read_reg(...);
Now, since I found the dummy read function to be present, perhaps you need to dive more into the code and understand why it exists.
On 9/22/21 1:17 PM, Andy Shevchenko wrote:
On Tue, Sep 14, 2021 at 01:48:12PM +0200, Jonas Dreßler wrote:
On the 88W8897 card it's very important the TX ring write pointer is updated correctly to its new value before setting the TX ready interrupt, otherwise the firmware appears to crash (probably because it's trying to DMA-read from the wrong place). The issue is present in the latest firmware version 15.68.19.p21 of the pcie+usb card.
Please, be consistent in the commit message(s) and the code (esp. if the term comes from a specification).
Here, PCIe (same in the code, at least that I have noticed, but should be done everywhere).
Since PCI uses "posted writes" when writing to a register, it's not guaranteed that a write will happen immediately. That means the pointer might be outdated when setting the TX ready interrupt, leading to firmware crashes especially when ASPM L1 and L1 substates are enabled (because of the higher link latency, the write will probably take longer).
So fix those firmware crashes by always using a non-posted write for this specific register write. We do that by simply reading back the register after writing it, just as a few other PCI drivers do.
This fixes a bug where during rx/tx traffic and with ASPM L1 substates
Ditto. TX/RX.
enabled (the enabled substates are platform dependent), the firmware crashes and eventually a command timeout appears in the logs.
Should it have a Fixes tag?
Don't think so, there's the infamous (https://bugzilla.kernel.org/show_bug.cgi?id=109681) Bugzilla bug it fixes though, I'll mention that in v3.
Cc: stable@vger.kernel.org Signed-off-by: Jonas Dreßler verdre@v0yd.nl
...
/* Write the TX ring write pointer in to reg->tx_wrptr */
if (mwifiex_write_reg(adapter, reg->tx_wrptr,
card->txbd_wrptr | rx_val)) {
/* Write the TX ring write pointer in to reg->tx_wrptr.
* The firmware (latest version 15.68.19.p21) of the 88W8897
* pcie+usb card seems to crash when getting the TX ready
* interrupt but the TX ring write pointer points to an outdated
* address, so it's important we do a non-posted write here to
* force the completion of the write.
*/
if (mwifiex_write_reg_np(adapter, reg->tx_wrptr,
card->txbd_wrptr | rx_val)) {
mwifiex_dbg(adapter, ERROR, "SEND DATA: failed to write reg->tx_wrptr\n"); ret = -1;
I'm not sure how this is not a dead code.
On top of that, I would rather to call old function and explicitly put the dummy read after it
/* Write the TX ring write pointer in to reg->tx_wrptr */ if (mwifiex_write_reg(adapter, reg->tx_wrptr, card->txbd_wrptr | rx_val)) { ...eliminate dead code in the following patch(es)... }
/* The firmware (latest version 15.68.19.p21) of the 88W8897
* pcie+usb card seems to crash when getting the TX ready
* interrupt but the TX ring write pointer points to an outdated
* address, so it's important we do a non-posted write here to
* force the completion of the write.
mwifiex_read_reg(...);*/
Now, since I found the dummy read function to be present, perhaps you need to dive more into the code and understand why it exists.
Interesting, I haven't noticed that mwifiex_write_reg() always returns 0. So are you suggesting to remove that return value and get rid of all the "if (mwifiex_write_reg()) {}" checks in a separate commit?
As for why the dummy read/write functions exist, I have no idea. Looking at git history it seems they were always there (only change is that mwifiex_read_reg() started to handle read errors with commit af05148392f50490c662dccee6c502d9fcba33e2). My bet would be that they were created to be consistent with sdio.c which is the oldest supported bus type in mwifiex.
On Wed, Sep 22, 2021 at 02:08:39PM +0200, Jonas Dreßler wrote:
On 9/22/21 1:17 PM, Andy Shevchenko wrote:
On Tue, Sep 14, 2021 at 01:48:12PM +0200, Jonas Dreßler wrote:
...
Should it have a Fixes tag?
Don't think so, there's the infamous (https://bugzilla.kernel.org/show_bug.cgi?id=109681) Bugzilla bug it fixes though, I'll mention that in v3.
Good idea, use BugLink tag for that!
...
Interesting, I haven't noticed that mwifiex_write_reg() always returns 0. So are you suggesting to remove that return value and get rid of all the "if (mwifiex_write_reg()) {}" checks in a separate commit?
Something like this, yes.
As for why the dummy read/write functions exist, I have no idea. Looking at git history it seems they were always there (only change is that mwifiex_read_reg() started to handle read errors with commit af05148392f50490c662dccee6c502d9fcba33e2). My bet would be that they were created to be consistent with sdio.c which is the oldest supported bus type in mwifiex.
It has a check against all ones. Also your another patch mentioned wake up. Perhaps the purpose is to wake up and return if device was/is in power off mode (D3hot).
From: Jonas Dreßler
Sent: 14 September 2021 12:48
On the 88W8897 card it's very important the TX ring write pointer is updated correctly to its new value before setting the TX ready interrupt, otherwise the firmware appears to crash (probably because it's trying to DMA-read from the wrong place). The issue is present in the latest firmware version 15.68.19.p21 of the pcie+usb card.
Since PCI uses "posted writes" when writing to a register, it's not guaranteed that a write will happen immediately. That means the pointer might be outdated when setting the TX ready interrupt, leading to firmware crashes especially when ASPM L1 and L1 substates are enabled (because of the higher link latency, the write will probably take longer).
So fix those firmware crashes by always using a non-posted write for this specific register write. We do that by simply reading back the register after writing it, just as a few other PCI drivers do.
This fixes a bug where during rx/tx traffic and with ASPM L1 substates enabled (the enabled substates are platform dependent), the firmware crashes and eventually a command timeout appears in the logs.
I think you need to change your terminology. PCIe does have some non-posted write transactions - but I can't remember when they are used.
What you need to say is that you are flushing the PCIe posted writes in order to avoid a timing 'issue' setting the TX ring write pointer.
Quite where the bug is, and why the read-back actually fixes it is another matter.
A typical ethernet transmit needs three things written in the correct order (as seen by the hardware):
1) The transmit frame data. 2) The descriptor ring entry referring to the frame. 3) The 'prod' of the MAC engine to process the frame.
You seems to also have: 2.5) Write the TX ring write pointer to the MAC engine.
The updates of (1) and (2) are normally handles by DMA coherent memory or cache flushes done by using the DMA APIs.
If the writes for (2.5) and (3) are both writing to the PCIe card (which seems likely) then the PCIe spec will guarantee that they happen in the correct order.
This means that the PCIe readback of the (2.5) write doesn't have any effect on the order of the bus cycles seen by the card. So flushing the PCIe write isn't what fixes your problem.
The readback between (2.5) and (3) does have two effects: a) it adds a short delay between the two writes. b) it (probably) forces the first write to by flushed through any posted-write buffers on the card itself.
It may well be that the card has separate posted write buffers for different parts of the hardware. In that case the write (3) might get actioned before the write (2.5). OTOH you'd expect that to only cause packet transmit to be delayed.
If the write (2.5) ends up being non-atomic (ie a 64bit write converted to multiple 8 bit writes internally) then you'll hit problems if the mac engine looks at the register while it is being changed just after transmitting the previous packet. (ie when the tx starts before write (3) because the tx logic is active.)
The other horrid possibility is that you have a truly broken PCIe slave that corrupts its posted-write buffer when a second write arrives. If that is actually true then you may need to also add locks to ensure that multiple threads cannot do writes at the same time. Or do all (and I mean all) accesses from a single thread/context.
The latter problem reminds me of a PCI card that got terribly confused if it saw a read request from a 2nd cpu while generating 'cycle rerun' responses to an earlier read request.
Most code that flushes posted writes only needs to do so for writes that drop level-sensitive interrupt requests. Failure to flush those can lead to unexpected interrupts. That problem goes back to VMEbus sunos (amongst others).
David
- Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Wednesday 22 September 2021 14:03:25 David Laight wrote:
From: Jonas Dreßler
Sent: 14 September 2021 12:48
On the 88W8897 card it's very important the TX ring write pointer is updated correctly to its new value before setting the TX ready interrupt, otherwise the firmware appears to crash (probably because it's trying to DMA-read from the wrong place). The issue is present in the latest firmware version 15.68.19.p21 of the pcie+usb card.
Since PCI uses "posted writes" when writing to a register, it's not guaranteed that a write will happen immediately. That means the pointer might be outdated when setting the TX ready interrupt, leading to firmware crashes especially when ASPM L1 and L1 substates are enabled (because of the higher link latency, the write will probably take longer).
So fix those firmware crashes by always using a non-posted write for this specific register write. We do that by simply reading back the register after writing it, just as a few other PCI drivers do.
This fixes a bug where during rx/tx traffic and with ASPM L1 substates enabled (the enabled substates are platform dependent), the firmware crashes and eventually a command timeout appears in the logs.
I think you need to change your terminology. PCIe does have some non-posted write transactions - but I can't remember when they are used.
In PCIe are all memory write requests as posted.
Non-posted writes in PCIe are used only for IO and config requests. But this is not case for proposed patch change as it access only card's memory space.
Technically this patch does not use non-posted memory write (as PCIe does not support / provide it), just adds something like a barrier and I'm not sure if it is really correct (you already wrote more details about it, so I will let it be).
I'm not sure what is the correct terminology, I do not know how this kind of write-followed-by-read "trick" is correctly called.
From: Pali Rohár
Sent: 22 September 2021 15:27
On Wednesday 22 September 2021 14:03:25 David Laight wrote:
From: Jonas Dreßler
Sent: 14 September 2021 12:48
On the 88W8897 card it's very important the TX ring write pointer is updated correctly to its new value before setting the TX ready interrupt, otherwise the firmware appears to crash (probably because it's trying to DMA-read from the wrong place). The issue is present in the latest firmware version 15.68.19.p21 of the pcie+usb card.
Since PCI uses "posted writes" when writing to a register, it's not guaranteed that a write will happen immediately. That means the pointer might be outdated when setting the TX ready interrupt, leading to firmware crashes especially when ASPM L1 and L1 substates are enabled (because of the higher link latency, the write will probably take longer).
So fix those firmware crashes by always using a non-posted write for this specific register write. We do that by simply reading back the register after writing it, just as a few other PCI drivers do.
This fixes a bug where during rx/tx traffic and with ASPM L1 substates enabled (the enabled substates are platform dependent), the firmware crashes and eventually a command timeout appears in the logs.
I think you need to change your terminology. PCIe does have some non-posted write transactions - but I can't remember when they are used.
In PCIe are all memory write requests as posted.
Non-posted writes in PCIe are used only for IO and config requests. But this is not case for proposed patch change as it access only card's memory space.
Technically this patch does not use non-posted memory write (as PCIe does not support / provide it), just adds something like a barrier and I'm not sure if it is really correct (you already wrote more details about it, so I will let it be).
I'm not sure what is the correct terminology, I do not know how this kind of write-followed-by-read "trick" is correctly called.
I think it is probably best to say: "flush the posted write when setting the TX ring write pointer".
The write can get posted in any/all of the following places: 1) The cpu store buffer. 2) The PCIe host bridge. 3) Any other PCIe bridges. 4) The PCIe slave logic in the target. There could be separate buffers for each BAR, 5) The actual target logic for that address block. The target (probably) will look a bit like an old fashioned cpu motherboard with the PCIe slave logic as the main bus master.
The readback forces all the posted write buffers be flushed.
In this case I suspect it is either flushing (5) or the extra delay of the read TLP processing that 'fixes' the problem.
Note that depending on the exact code and host cpu the second write may not need to wait for the response to the read TLP. So the write, readback, write TLP may be back to back on the actual PCIe link.
Although I don't have access to an actual PCIe monitor we do have the ability to trace 'data' TLP into fpga memory on one of our systems. This is near real-time but they are slightly munged. Watching the TLP can be illuminating!
David
- Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On 9/22/21 5:54 PM, David Laight wrote:
From: Pali Rohár
Sent: 22 September 2021 15:27
On Wednesday 22 September 2021 14:03:25 David Laight wrote:
From: Jonas Dreßler
Sent: 14 September 2021 12:48
On the 88W8897 card it's very important the TX ring write pointer is updated correctly to its new value before setting the TX ready interrupt, otherwise the firmware appears to crash (probably because it's trying to DMA-read from the wrong place). The issue is present in the latest firmware version 15.68.19.p21 of the pcie+usb card.
Since PCI uses "posted writes" when writing to a register, it's not guaranteed that a write will happen immediately. That means the pointer might be outdated when setting the TX ready interrupt, leading to firmware crashes especially when ASPM L1 and L1 substates are enabled (because of the higher link latency, the write will probably take longer).
So fix those firmware crashes by always using a non-posted write for this specific register write. We do that by simply reading back the register after writing it, just as a few other PCI drivers do.
This fixes a bug where during rx/tx traffic and with ASPM L1 substates enabled (the enabled substates are platform dependent), the firmware crashes and eventually a command timeout appears in the logs.
I think you need to change your terminology. PCIe does have some non-posted write transactions - but I can't remember when they are used.
In PCIe are all memory write requests as posted.
Non-posted writes in PCIe are used only for IO and config requests. But this is not case for proposed patch change as it access only card's memory space.
Technically this patch does not use non-posted memory write (as PCIe does not support / provide it), just adds something like a barrier and I'm not sure if it is really correct (you already wrote more details about it, so I will let it be).
I'm not sure what is the correct terminology, I do not know how this kind of write-followed-by-read "trick" is correctly called.
I think it is probably best to say: "flush the posted write when setting the TX ring write pointer".
The write can get posted in any/all of the following places:
- The cpu store buffer.
- The PCIe host bridge.
- Any other PCIe bridges.
- The PCIe slave logic in the target. There could be separate buffers for each BAR,
- The actual target logic for that address block. The target (probably) will look a bit like an old fashioned cpu motherboard with the PCIe slave logic as the main bus master.
The readback forces all the posted write buffers be flushed.
In this case I suspect it is either flushing (5) or the extra delay of the read TLP processing that 'fixes' the problem.
Note that depending on the exact code and host cpu the second write may not need to wait for the response to the read TLP. So the write, readback, write TLP may be back to back on the actual PCIe link.
Although I don't have access to an actual PCIe monitor we do have the ability to trace 'data' TLP into fpga memory on one of our systems. This is near real-time but they are slightly munged. Watching the TLP can be illuminating!
David
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
Thanks for the detailed explanations, it looks like indeed the read-back is not the real fix here, a simple udelay(50) before sending the "TX ready" interrupt also does the trick.
} else { + udelay(50); + /* Send the TX ready interrupt */ if (mwifiex_write_reg(adapter, PCIE_CPU_INT_EVENT, CPU_INTR_DNLD_RDY)) {
I've tested that for a week now and haven't seen any firmware crashes. Interestingly enough it looks like the delay can also be added after setting the "TX ready" interrupt, just not before updating the TX ring write pointer.
I have no idea if 50 usecs is a good duration to wait here, from trying different values I found that 10 to 20 usecs is not enough, but who knows, maybe that's platform dependent?
On 9/30/21 16:27, Jonas Dreßler wrote:
On 9/22/21 5:54 PM, David Laight wrote:
From: Pali Rohár
Sent: 22 September 2021 15:27
On Wednesday 22 September 2021 14:03:25 David Laight wrote:
From: Jonas Dreßler
Sent: 14 September 2021 12:48
On the 88W8897 card it's very important the TX ring write pointer is updated correctly to its new value before setting the TX ready interrupt, otherwise the firmware appears to crash (probably because it's trying to DMA-read from the wrong place). The issue is present in the latest firmware version 15.68.19.p21 of the pcie+usb card.
Since PCI uses "posted writes" when writing to a register, it's not guaranteed that a write will happen immediately. That means the pointer might be outdated when setting the TX ready interrupt, leading to firmware crashes especially when ASPM L1 and L1 substates are enabled (because of the higher link latency, the write will probably take longer).
So fix those firmware crashes by always using a non-posted write for this specific register write. We do that by simply reading back the register after writing it, just as a few other PCI drivers do.
This fixes a bug where during rx/tx traffic and with ASPM L1 substates enabled (the enabled substates are platform dependent), the firmware crashes and eventually a command timeout appears in the logs.
I think you need to change your terminology. PCIe does have some non-posted write transactions - but I can't remember when they are used.
In PCIe are all memory write requests as posted.
Non-posted writes in PCIe are used only for IO and config requests. But this is not case for proposed patch change as it access only card's memory space.
Technically this patch does not use non-posted memory write (as PCIe does not support / provide it), just adds something like a barrier and I'm not sure if it is really correct (you already wrote more details about it, so I will let it be).
I'm not sure what is the correct terminology, I do not know how this kind of write-followed-by-read "trick" is correctly called.
I think it is probably best to say: "flush the posted write when setting the TX ring write pointer".
The write can get posted in any/all of the following places:
- The cpu store buffer.
- The PCIe host bridge.
- Any other PCIe bridges.
- The PCIe slave logic in the target.
There could be separate buffers for each BAR, 5) The actual target logic for that address block. The target (probably) will look a bit like an old fashioned cpu motherboard with the PCIe slave logic as the main bus master.
The readback forces all the posted write buffers be flushed.
In this case I suspect it is either flushing (5) or the extra delay of the read TLP processing that 'fixes' the problem.
Note that depending on the exact code and host cpu the second write may not need to wait for the response to the read TLP. So the write, readback, write TLP may be back to back on the actual PCIe link.
Although I don't have access to an actual PCIe monitor we do have the ability to trace 'data' TLP into fpga memory on one of our systems. This is near real-time but they are slightly munged. Watching the TLP can be illuminating!
David
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
Thanks for the detailed explanations, it looks like indeed the read-back is not the real fix here, a simple udelay(50) before sending the "TX ready" interrupt also does the trick.
} else { + udelay(50);
/* Send the TX ready interrupt */ if (mwifiex_write_reg(adapter, PCIE_CPU_INT_EVENT, CPU_INTR_DNLD_RDY)) {
I've tested that for a week now and haven't seen any firmware crashes. Interestingly enough it looks like the delay can also be added after setting the "TX ready" interrupt, just not before updating the TX ring write pointer.
I have no idea if 50 usecs is a good duration to wait here, from trying different values I found that 10 to 20 usecs is not enough, but who knows, maybe that's platform dependent?
So I spent the last few days going slightly crazy while trying to dig deeper into this.
My theory was that the udelay() delays some subsequent register write or other communication with the card that would trigger the crash if executed too early after writing the TX ring write pointer. So I tried moving the udelay() around, carefully checking when the crash is gone and when it isn't.
In the end my theory turned out completely wrong, what I found was this: Pinning down the last place where the udelay() is effective gets us here (https://elixir.bootlin.com/linux/latest/source/drivers/net/wireless/marvell/...), right before we bail out of the main process and idle.
I tried adding the udelay() as the first thing we do on the next run of the while-loop after that break, but with that the crash came back.
So what does this mean, we fix the crash by sleeping before idling? Sounds a bit counterintuitive to me...
The only thing I can take away from this is that maybe the udelay() keeps the CPU from entering some powersaving state and with that the PCI bus from entering ASPM states (considering that the crash can also be fixed by disabling ASPM L1.2).
linux-stable-mirror@lists.linaro.org