From: Alvaro Gamez Machado alvaro.gamez@hazent.com
[ Upstream commit 939edfaa10f1d22e6af6a84bf4bd96dc49c67302 ]
SPI devices using a (relative) slow frequency need a larger time.
For instance, microblaze running at 83.25MHz and performing a 3 bytes transaction using a 10MHz/16 = 625kHz needed this stall value increased to at least 20. The SPI device is quite slow, but also is the microblaze, so set this value to 32 to give it even more margin.
Signed-off-by: Alvaro Gamez Machado alvaro.gamez@hazent.com Reviewed-by: Ricardo Ribalda ribalda@chromium.org Link: https://patch.msgid.link/20251106134545.31942-1-alvaro.gamez@hazent.com Signed-off-by: Mark Brown broonie@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org ---
LLM Generated explanations, may be completely bogus:
## COMPREHENSIVE ANALYSIS: spi: xilinx: increase number of retries before declaring stall
### 1. COMMIT MESSAGE ANALYSIS
The commit message clearly describes a real-world problem: - **Problem**: SPI devices using slow frequencies need more time before being declared stalled - **Specific case**: Microblaze CPU at 83.25MHz performing 3-byte transaction at 625kHz SPI frequency - **Evidence**: Author empirically determined the value needed to be at least 20; chose 32 for safety margin - **Tags**: Has "Reviewed-by" and "Link" tags, but **NO "Fixes:" or "Cc: stable@vger.kernel.org" tags**
### 2. DEEP CODE RESEARCH
#### A. HOW THE BUG WAS INTRODUCED:
The stall detection mechanism was originally introduced in **commit 5a1314fa697fc6** (2017-11-21) by Ricardo Ribalda. That commit added stall detection logic with `stalled = 10` as a retry counter to detect when the Xilinx SPI core gets permanently stuck due to unknown commands not in the hardware lookup table.
The original commit message explained:
"When the core is configured in C_SPI_MODE > 0, it integrates a lookup
table that automatically configures the core in dual or quad mode based on the command (first byte on the tx fifo). Unfortunately, that list mode_?_memoy_*.mif does not contain all the supported commands by the flash... this leads into a stall that can only be recovered with a soft reset."
The original commit **was tagged with "Cc: stable@vger.kernel.org"** and was successfully backported to all major stable branches (verified present in 6.6.y, 6.1.y, 5.15.y, 5.10.y, 4.19.y).
#### B. DETAILED CODE ANALYSIS:
The code is in `xilinx_spi_txrx_bufs()` function at line 303:
```c rx_words = n_words; stalled = 10; // ← Original value too low for slow configurations while (rx_words) { if (rx_words == n_words && !(stalled--) && !(sr & XSPI_SR_TX_EMPTY_MASK) && (sr & XSPI_SR_RX_EMPTY_MASK)) { dev_err(&spi->dev, "Detected stall. Check C_SPI_MODE and C_SPI_MEMORY\n"); xspi_init_hw(xspi); return -EIO; // ← Transaction fails } // ... read data from RX FIFO ... } ```
**What the buggy code does:** - Initializes `stalled` counter to 10 - In the polling loop, decrements `stalled` on each iteration when no data is available yet - If counter reaches 0 and RX FIFO is still empty (but TX FIFO not empty), declares a hardware stall - Returns -EIO, causing the SPI transaction to fail
**Why it's wrong:** - The value 10 was chosen empirically in 2017, but it doesn't account for **very slow SPI frequencies combined with slow CPUs** - With 625kHz SPI clock and 83.25MHz CPU, the legitimate data transfer time can exceed 10 polling loop iterations - This causes **false positive stall detection** - the hardware is working correctly, just slowly
**The race condition:** - CPU polling loop runs much faster than slow SPI hardware can transfer data - Counter expires before legitimate data arrives in RX FIFO - Hardware appears "stalled" when it's actually just slow
#### C. THE FIX EXPLAINED:
The fix changes line 303 from: ```c stalled = 10; // Old: Too short for slow configurations ``` to: ```c stalled = 32; // New: More margin for slow hardware ```
**Why this solves the problem:** - Gives slow SPI transactions 3.2x more time before declaring a stall - Author tested with real hardware (microblaze + 625kHz SPI) and found 20 was sufficient - Chose 32 to provide additional safety margin - Still provides stall detection for genuine hardware hangs (from the original 2017 issue)
**Subtle implications:** - Genuine hardware stalls will take slightly longer to detect (32 loop iterations vs 10) - This is acceptable because: (1) genuine stalls are rare, (2) the delay is still very short (microseconds to milliseconds), (3) avoiding false positives is more important for reliability
### 3. SECURITY ASSESSMENT
**No security implications.** This is a timing/reliability issue, not a security vulnerability. No CVE, no memory safety issues, no privilege escalation.
### 4. FEATURE VS BUG FIX CLASSIFICATION
**This is definitively a BUG FIX:** - Fixes incorrect hardcoded timeout value - The original value causes false stall detection on slow hardware - Subject uses "increase" rather than "fix", but the intent is clearly to fix a problem - **NOT a feature addition** - doesn't add new functionality, just corrects an inadequate constant
**Does NOT fall under exception categories** (device IDs, quirks, DT, build fixes), but doesn't need to - it's a straightforward bug fix.
### 5. CODE CHANGE SCOPE ASSESSMENT
- **Files changed**: 1 (drivers/spi/spi-xilinx.c) - **Lines changed**: 1 (1 insertion, 1 deletion) - **Complexity**: Trivial - changes a single integer constant - **Scope**: Extremely localized - only affects stall detection timeout
**Assessment**: Minimal scope, surgical fix. This is the ideal type of change for stable trees.
### 6. BUG TYPE AND SEVERITY
- **Type**: False positive error detection / incorrect timeout - **Manifestation**: SPI transactions fail with -EIO error on slow hardware configurations - **Severity**: **MEDIUM** - Not a crash, panic, or data corruption - Not a security issue - Causes service disruption: SPI devices fail to work properly - Affects specific configurations (slow SPI + slow CPU) - User-visible impact: Device failures, spurious error messages
### 7. USER IMPACT EVALUATION
**Who is affected:** - Users of Xilinx SPI controllers (common in AMD/Xilinx FPGA-based embedded systems) - Particularly affects: - Microblaze soft-core CPU users - Systems with slow SPI clock frequencies (< 1MHz) - Embedded systems with constrained CPU speeds - Examples: Industrial control systems, FPGA-based embedded devices, custom hardware with Xilinx IP cores
**Impact scope:** - **Moderate user base**: Xilinx SPI is used in FPGA designs but not as widespread as generic SPI controllers - **High impact for affected users**: Complete SPI failure for those with slow configurations - **Verifiable issue**: Author has real hardware that triggers this bug
**Call analysis**: The `xilinx_spi_txrx_bufs()` function is the main data transfer function called for every SPI transaction on affected hardware. It's in the critical path for all SPI operations.
### 8. REGRESSION RISK ANALYSIS
**Risk level: VERY LOW**
**Reasons:** 1. **Minimal change**: Single constant value modification 2. **Direction of change**: Increasing a timeout is inherently safer than decreasing it 3. **Tested**: Author has tested with real hardware 4. **Reviewed**: Ricardo Ribalda (the original stall detection author!) reviewed it 5. **Backward compatible**: Doesn't change behavior for properly functioning hardware 6. **Limited scope**: Only affects stall detection timing, not data path logic
**Potential risks:** - Genuine hardware stalls detected slightly later (32 iterations vs 10) - Mitigation: Still quick enough (microseconds), rare occurrence - None identified that would break working systems
### 9. MAINLINE STABILITY
- **Commit date**: November 7, 2024 (appears as 2025 but likely 2024) - **Testing evidence**: - Reviewed-by: Ricardo Ribalda (original stall detection author) - Author tested with real hardware (microblaze + slow SPI) - **Mainline presence**: In v6.18-rc6 (recent mainline development kernel) - **Time in mainline**: Very recent - approximately 2 weeks if November 2024
**Concern**: This is quite recent. Ideally would benefit from more time in mainline (several weeks to months) before backporting.
### 10. DEPENDENCY CHECK
**No dependencies identified:** - Change is self-contained (single constant) - Doesn't depend on other commits - Doesn't require API changes - The stall detection code (commit 5a1314fa697fc6 from 2017) is already present in all stable branches - Will apply cleanly to any stable branch that has the original stall detection code
**Verified**: The original stall detection code is present in: - ✅ stable/linux-6.6.y - ✅ stable/linux-6.1.y - ✅ stable/linux-5.15.y - ✅ stable/linux-5.10.y - ✅ stable/linux-4.19.y
### 11. PRACTICAL VS THEORETICAL
**Highly practical:** - Author encountered this bug with real hardware (microblaze @ 83.25MHz, SPI @ 625kHz) - Provides specific reproduction case - Not a theoretical race or corner case - Users with slow SPI configurations will hit this reliably
### BACKPORT DECISION ANALYSIS
#### Alignment with Stable Kernel Rules:
1. **Obviously correct**: ✅ **YES** - Trivial change (increase timeout constant) - Reviewed by the original stall detection author - Logic is straightforward
2. **Fixes real bug**: ✅ **YES** - False stall detection on slow hardware - Author has reproducible case - Affects real users
3. **Important issue**: ⚠️ **MODERATE** - Causes SPI transaction failures - NOT a crash, security issue, or data corruption - Affects specific configurations (slow SPI/CPU) - Severity: Service disruption for affected users
4. **Small and contained**: ✅ **YES** - Single line change - No architectural changes - Minimal scope
5. **No new features**: ✅ **YES** - Pure bug fix - No new APIs or functionality
6. **Clean application**: ✅ **YES** - Should apply cleanly to all stable branches - Original code is present in all major LTS kernels
#### Considerations:
**In favor of backporting:** - Extremely low regression risk (1 line, timeout increase) - Fixes real-world bug with reproducible case - Reviewed by original code author - Small, obviously correct change - Helps users with slow SPI configurations - Original stall detection is already in stable branches
**Against backporting or requiring caution:** - No "Cc: stable@vger.kernel.org" tag (maintainer didn't explicitly request it) - No "Fixes:" tag (though logically fixes commit 5a1314fa697fc6) - Very recent in mainline (limited testing time) - Moderate severity (not critical crash/security/corruption) - Affects niche configurations (slow SPI on Xilinx hardware)
#### Risk vs Benefit Trade-off:
**Benefit**: - Enables SPI functionality for slow Xilinx configurations - Prevents spurious transaction failures - Minimal user-visible change (just works better)
**Risk**: - Virtually none - increasing timeout is safe - Genuine stalls detected marginally slower (acceptable)
**Trade-off**: **Favorable** - High benefit for affected users, negligible risk
### FINAL ASSESSMENT
This commit represents a **textbook stable backport candidate** from a technical perspective: - Trivial, obviously correct change - Fixes a real bug affecting actual users - Zero risk of regression - Small, contained scope
However, it has **two procedural concerns**: 1. **No explicit stable tag**: The maintainer/author didn't tag it with "Cc: stable@vger.kernel.org", suggesting they may not have considered it critical enough for backporting 2. **Very recent**: Limited testing time in mainline (approximately 2 weeks)
The absence of the stable tag is notable because the **original stall detection commit (5a1314fa697fc6) DID have the stable tag**, indicating the original author (Ricardo Ribalda) understood stable backporting. Ricardo also reviewed this current patch, so his review implies approval of the technical approach.
Despite the lack of an explicit stable tag, the technical merits strongly support backporting: - It's an adjustment to an already-backported feature - The change is trivial and safe - It fixes a real problem for a subset of users - Regression risk is negligible
The commit should be backported to **all stable branches that contain the original stall detection code** (5a1314fa697fc6), which includes at minimum: 6.6.y, 6.1.y, 5.15.y, 5.10.y, and 4.19.y.
**Recommendation**: This would benefit from a few more weeks in mainline for additional testing, but can be safely backported. The lack of a stable tag is the only significant concern, but the technical merits outweigh this procedural consideration.
**YES**
drivers/spi/spi-xilinx.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/spi/spi-xilinx.c b/drivers/spi/spi-xilinx.c index d59cc8a184846..c86dc56f38b45 100644 --- a/drivers/spi/spi-xilinx.c +++ b/drivers/spi/spi-xilinx.c @@ -300,7 +300,7 @@ static int xilinx_spi_txrx_bufs(struct spi_device *spi, struct spi_transfer *t)
/* Read out all the data from the Rx FIFO */ rx_words = n_words; - stalled = 10; + stalled = 32; while (rx_words) { if (rx_words == n_words && !(stalled--) && !(sr & XSPI_SR_TX_EMPTY_MASK) &&