The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From ddbd89deb7d32b1fbb879f48d68fda1a8ac58e8e Mon Sep 17 00:00:00 2001
From: Halil Pasic <pasic(a)linux.ibm.com>
Date: Fri, 11 Feb 2022 02:12:52 +0100
Subject: [PATCH] swiotlb: fix info leak with DMA_FROM_DEVICE
The problem I'm addressing was discovered by the LTP test covering
cve-2018-1000204.
A short description of what happens follows:
1) The test case issues a command code 00 (TEST UNIT READY) via the SG_IO
interface with: dxfer_len == 524288, dxdfer_dir == SG_DXFER_FROM_DEV
and a corresponding dxferp. The peculiar thing about this is that TUR
is not reading from the device.
2) In sg_start_req() the invocation of blk_rq_map_user() effectively
bounces the user-space buffer. As if the device was to transfer into
it. Since commit a45b599ad808 ("scsi: sg: allocate with __GFP_ZERO in
sg_build_indirect()") we make sure this first bounce buffer is
allocated with GFP_ZERO.
3) For the rest of the story we keep ignoring that we have a TUR, so the
device won't touch the buffer we prepare as if the we had a
DMA_FROM_DEVICE type of situation. My setup uses a virtio-scsi device
and the buffer allocated by SG is mapped by the function
virtqueue_add_split() which uses DMA_FROM_DEVICE for the "in" sgs (here
scatter-gather and not scsi generics). This mapping involves bouncing
via the swiotlb (we need swiotlb to do virtio in protected guest like
s390 Secure Execution, or AMD SEV).
4) When the SCSI TUR is done, we first copy back the content of the second
(that is swiotlb) bounce buffer (which most likely contains some
previous IO data), to the first bounce buffer, which contains all
zeros. Then we copy back the content of the first bounce buffer to
the user-space buffer.
5) The test case detects that the buffer, which it zero-initialized,
ain't all zeros and fails.
One can argue that this is an swiotlb problem, because without swiotlb
we leak all zeros, and the swiotlb should be transparent in a sense that
it does not affect the outcome (if all other participants are well
behaved).
Copying the content of the original buffer into the swiotlb buffer is
the only way I can think of to make swiotlb transparent in such
scenarios. So let's do just that if in doubt, but allow the driver
to tell us that the whole mapped buffer is going to be overwritten,
in which case we can preserve the old behavior and avoid the performance
impact of the extra bounce.
Signed-off-by: Halil Pasic <pasic(a)linux.ibm.com>
Signed-off-by: Christoph Hellwig <hch(a)lst.de>
diff --git a/Documentation/core-api/dma-attributes.rst b/Documentation/core-api/dma-attributes.rst
index 1887d92e8e92..17706dc91ec9 100644
--- a/Documentation/core-api/dma-attributes.rst
+++ b/Documentation/core-api/dma-attributes.rst
@@ -130,3 +130,11 @@ accesses to DMA buffers in both privileged "supervisor" and unprivileged
subsystem that the buffer is fully accessible at the elevated privilege
level (and ideally inaccessible or at least read-only at the
lesser-privileged levels).
+
+DMA_ATTR_OVERWRITE
+------------------
+
+This is a hint to the DMA-mapping subsystem that the device is expected to
+overwrite the entire mapped size, thus the caller does not require any of the
+previous buffer contents to be preserved. This allows bounce-buffering
+implementations to optimise DMA_FROM_DEVICE transfers.
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index dca2b1355bb1..6150d11a607e 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -61,6 +61,14 @@
*/
#define DMA_ATTR_PRIVILEGED (1UL << 9)
+/*
+ * This is a hint to the DMA-mapping subsystem that the device is expected
+ * to overwrite the entire mapped size, thus the caller does not require any
+ * of the previous buffer contents to be preserved. This allows
+ * bounce-buffering implementations to optimise DMA_FROM_DEVICE transfers.
+ */
+#define DMA_ATTR_OVERWRITE (1UL << 10)
+
/*
* A dma_addr_t can hold any valid DMA or bus address for the platform. It can
* be given to a device to use as a DMA source or target. It is specific to a
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index f1e7ea160b43..bfc56cb21705 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -628,7 +628,8 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
mem->slots[index + i].orig_addr = slot_addr(orig_addr, i);
tlb_addr = slot_addr(mem->start, index) + offset;
if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC) &&
- (dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL))
+ (!(attrs & DMA_ATTR_OVERWRITE) || dir == DMA_TO_DEVICE ||
+ dir == DMA_BIDIRECTIONAL))
swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_TO_DEVICE);
return tlb_addr;
}
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From ddbd89deb7d32b1fbb879f48d68fda1a8ac58e8e Mon Sep 17 00:00:00 2001
From: Halil Pasic <pasic(a)linux.ibm.com>
Date: Fri, 11 Feb 2022 02:12:52 +0100
Subject: [PATCH] swiotlb: fix info leak with DMA_FROM_DEVICE
The problem I'm addressing was discovered by the LTP test covering
cve-2018-1000204.
A short description of what happens follows:
1) The test case issues a command code 00 (TEST UNIT READY) via the SG_IO
interface with: dxfer_len == 524288, dxdfer_dir == SG_DXFER_FROM_DEV
and a corresponding dxferp. The peculiar thing about this is that TUR
is not reading from the device.
2) In sg_start_req() the invocation of blk_rq_map_user() effectively
bounces the user-space buffer. As if the device was to transfer into
it. Since commit a45b599ad808 ("scsi: sg: allocate with __GFP_ZERO in
sg_build_indirect()") we make sure this first bounce buffer is
allocated with GFP_ZERO.
3) For the rest of the story we keep ignoring that we have a TUR, so the
device won't touch the buffer we prepare as if the we had a
DMA_FROM_DEVICE type of situation. My setup uses a virtio-scsi device
and the buffer allocated by SG is mapped by the function
virtqueue_add_split() which uses DMA_FROM_DEVICE for the "in" sgs (here
scatter-gather and not scsi generics). This mapping involves bouncing
via the swiotlb (we need swiotlb to do virtio in protected guest like
s390 Secure Execution, or AMD SEV).
4) When the SCSI TUR is done, we first copy back the content of the second
(that is swiotlb) bounce buffer (which most likely contains some
previous IO data), to the first bounce buffer, which contains all
zeros. Then we copy back the content of the first bounce buffer to
the user-space buffer.
5) The test case detects that the buffer, which it zero-initialized,
ain't all zeros and fails.
One can argue that this is an swiotlb problem, because without swiotlb
we leak all zeros, and the swiotlb should be transparent in a sense that
it does not affect the outcome (if all other participants are well
behaved).
Copying the content of the original buffer into the swiotlb buffer is
the only way I can think of to make swiotlb transparent in such
scenarios. So let's do just that if in doubt, but allow the driver
to tell us that the whole mapped buffer is going to be overwritten,
in which case we can preserve the old behavior and avoid the performance
impact of the extra bounce.
Signed-off-by: Halil Pasic <pasic(a)linux.ibm.com>
Signed-off-by: Christoph Hellwig <hch(a)lst.de>
diff --git a/Documentation/core-api/dma-attributes.rst b/Documentation/core-api/dma-attributes.rst
index 1887d92e8e92..17706dc91ec9 100644
--- a/Documentation/core-api/dma-attributes.rst
+++ b/Documentation/core-api/dma-attributes.rst
@@ -130,3 +130,11 @@ accesses to DMA buffers in both privileged "supervisor" and unprivileged
subsystem that the buffer is fully accessible at the elevated privilege
level (and ideally inaccessible or at least read-only at the
lesser-privileged levels).
+
+DMA_ATTR_OVERWRITE
+------------------
+
+This is a hint to the DMA-mapping subsystem that the device is expected to
+overwrite the entire mapped size, thus the caller does not require any of the
+previous buffer contents to be preserved. This allows bounce-buffering
+implementations to optimise DMA_FROM_DEVICE transfers.
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index dca2b1355bb1..6150d11a607e 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -61,6 +61,14 @@
*/
#define DMA_ATTR_PRIVILEGED (1UL << 9)
+/*
+ * This is a hint to the DMA-mapping subsystem that the device is expected
+ * to overwrite the entire mapped size, thus the caller does not require any
+ * of the previous buffer contents to be preserved. This allows
+ * bounce-buffering implementations to optimise DMA_FROM_DEVICE transfers.
+ */
+#define DMA_ATTR_OVERWRITE (1UL << 10)
+
/*
* A dma_addr_t can hold any valid DMA or bus address for the platform. It can
* be given to a device to use as a DMA source or target. It is specific to a
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index f1e7ea160b43..bfc56cb21705 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -628,7 +628,8 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
mem->slots[index + i].orig_addr = slot_addr(orig_addr, i);
tlb_addr = slot_addr(mem->start, index) + offset;
if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC) &&
- (dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL))
+ (!(attrs & DMA_ATTR_OVERWRITE) || dir == DMA_TO_DEVICE ||
+ dir == DMA_BIDIRECTIONAL))
swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_TO_DEVICE);
return tlb_addr;
}
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From ddbd89deb7d32b1fbb879f48d68fda1a8ac58e8e Mon Sep 17 00:00:00 2001
From: Halil Pasic <pasic(a)linux.ibm.com>
Date: Fri, 11 Feb 2022 02:12:52 +0100
Subject: [PATCH] swiotlb: fix info leak with DMA_FROM_DEVICE
The problem I'm addressing was discovered by the LTP test covering
cve-2018-1000204.
A short description of what happens follows:
1) The test case issues a command code 00 (TEST UNIT READY) via the SG_IO
interface with: dxfer_len == 524288, dxdfer_dir == SG_DXFER_FROM_DEV
and a corresponding dxferp. The peculiar thing about this is that TUR
is not reading from the device.
2) In sg_start_req() the invocation of blk_rq_map_user() effectively
bounces the user-space buffer. As if the device was to transfer into
it. Since commit a45b599ad808 ("scsi: sg: allocate with __GFP_ZERO in
sg_build_indirect()") we make sure this first bounce buffer is
allocated with GFP_ZERO.
3) For the rest of the story we keep ignoring that we have a TUR, so the
device won't touch the buffer we prepare as if the we had a
DMA_FROM_DEVICE type of situation. My setup uses a virtio-scsi device
and the buffer allocated by SG is mapped by the function
virtqueue_add_split() which uses DMA_FROM_DEVICE for the "in" sgs (here
scatter-gather and not scsi generics). This mapping involves bouncing
via the swiotlb (we need swiotlb to do virtio in protected guest like
s390 Secure Execution, or AMD SEV).
4) When the SCSI TUR is done, we first copy back the content of the second
(that is swiotlb) bounce buffer (which most likely contains some
previous IO data), to the first bounce buffer, which contains all
zeros. Then we copy back the content of the first bounce buffer to
the user-space buffer.
5) The test case detects that the buffer, which it zero-initialized,
ain't all zeros and fails.
One can argue that this is an swiotlb problem, because without swiotlb
we leak all zeros, and the swiotlb should be transparent in a sense that
it does not affect the outcome (if all other participants are well
behaved).
Copying the content of the original buffer into the swiotlb buffer is
the only way I can think of to make swiotlb transparent in such
scenarios. So let's do just that if in doubt, but allow the driver
to tell us that the whole mapped buffer is going to be overwritten,
in which case we can preserve the old behavior and avoid the performance
impact of the extra bounce.
Signed-off-by: Halil Pasic <pasic(a)linux.ibm.com>
Signed-off-by: Christoph Hellwig <hch(a)lst.de>
diff --git a/Documentation/core-api/dma-attributes.rst b/Documentation/core-api/dma-attributes.rst
index 1887d92e8e92..17706dc91ec9 100644
--- a/Documentation/core-api/dma-attributes.rst
+++ b/Documentation/core-api/dma-attributes.rst
@@ -130,3 +130,11 @@ accesses to DMA buffers in both privileged "supervisor" and unprivileged
subsystem that the buffer is fully accessible at the elevated privilege
level (and ideally inaccessible or at least read-only at the
lesser-privileged levels).
+
+DMA_ATTR_OVERWRITE
+------------------
+
+This is a hint to the DMA-mapping subsystem that the device is expected to
+overwrite the entire mapped size, thus the caller does not require any of the
+previous buffer contents to be preserved. This allows bounce-buffering
+implementations to optimise DMA_FROM_DEVICE transfers.
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index dca2b1355bb1..6150d11a607e 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -61,6 +61,14 @@
*/
#define DMA_ATTR_PRIVILEGED (1UL << 9)
+/*
+ * This is a hint to the DMA-mapping subsystem that the device is expected
+ * to overwrite the entire mapped size, thus the caller does not require any
+ * of the previous buffer contents to be preserved. This allows
+ * bounce-buffering implementations to optimise DMA_FROM_DEVICE transfers.
+ */
+#define DMA_ATTR_OVERWRITE (1UL << 10)
+
/*
* A dma_addr_t can hold any valid DMA or bus address for the platform. It can
* be given to a device to use as a DMA source or target. It is specific to a
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index f1e7ea160b43..bfc56cb21705 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -628,7 +628,8 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
mem->slots[index + i].orig_addr = slot_addr(orig_addr, i);
tlb_addr = slot_addr(mem->start, index) + offset;
if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC) &&
- (dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL))
+ (!(attrs & DMA_ATTR_OVERWRITE) || dir == DMA_TO_DEVICE ||
+ dir == DMA_BIDIRECTIONAL))
swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_TO_DEVICE);
return tlb_addr;
}
The logic in commit 2a5f1b67ec57 "KVM: arm64: Don't access PMCR_EL0 when no
PMU is available" relies on an empty reset handler being benign. This was
not the case in earlier kernel versions, so the stable backport of this
patch is causing problems.
KVMs behaviour in this area changed over time. In particular, prior to commit
03fdfb269009 ("KVM: arm64: Don't write junk to sysregs on reset"), an empty
reset handler will trigger a warning, as the guest registers have been
poisoned.
Prior to commit 20589c8cc47d ("arm/arm64: KVM: Don't panic on failure to
properly reset system registers"), this warning was a panic().
Instead of reverting the backport, make it write 0 to the sys_reg[] array.
This keeps the reset logic happy, and the dodgy value can't be seen by
the guest as it can't request the emulation.
The original bug was accessing the PMCR_EL0 register on CPUs that don't
implement that feature. There is no known silicon that does this, but
v4.9's ACPI support is unable to find the PMU, so triggers this code:
| Kernel panic - not syncing: Didn't reset vcpu_sys_reg(24)
| CPU: 1 PID: 3055 Comm: lkvm Not tainted 4.9.302-00032-g64e078a56789 #13476
| Hardware name: ARM LTD ARM Juno Development Platform/ARM Juno Development Platform, BIOS EDK II Jul 30 2018
| Call trace:
| [<ffff00000808b4b0>] dump_backtrace+0x0/0x1a0
| [<ffff00000808b664>] show_stack+0x14/0x20
| [<ffff0000088f0e18>] dump_stack+0x98/0xb8
| [<ffff0000088eef08>] panic+0x118/0x274
| [<ffff0000080b50e0>] access_actlr+0x0/0x20
| [<ffff0000080b2620>] kvm_reset_vcpu+0x5c/0xac
| [<ffff0000080ac688>] kvm_arch_vcpu_ioctl+0x3e4/0x490
| [<ffff0000080a382c>] kvm_vcpu_ioctl+0x5b8/0x720
| [<ffff000008201e44>] do_vfs_ioctl+0x2f4/0x884
| [<ffff00000820244c>] SyS_ioctl+0x78/0x9c
| [<ffff000008083a9c>] __sys_trace_return+0x0/0x4
Cc: <stable(a)vger.kernel.org> # < v5.3 with 2a5f1b67ec57 backported
Signed-off-by: James Morse <james.morse(a)arm.com>
---
arch/arm64/kvm/sys_regs.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index 10d80456f38f..8d548fdbb6b2 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -451,8 +451,10 @@ static void reset_pmcr(struct kvm_vcpu *vcpu, const struct sys_reg_desc *r)
u64 pmcr, val;
/* No PMU available, PMCR_EL0 may UNDEF... */
- if (!kvm_arm_support_pmu_v3())
+ if (!kvm_arm_support_pmu_v3()) {
+ vcpu_sys_reg(vcpu, PMCR_EL0) = 0;
return;
+ }
pmcr = read_sysreg(pmcr_el0);
/*
--
2.30.2
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 0bf476fc3624e3a72af4ba7340d430a91c18cd67 Mon Sep 17 00:00:00 2001
From: Robert Hancock <robert.hancock(a)calian.com>
Date: Thu, 3 Mar 2022 12:10:27 -0600
Subject: [PATCH] net: macb: Fix lost RX packet wakeup race in NAPI receive
There is an oddity in the way the RSR register flags propagate to the
ISR register (and the actual interrupt output) on this hardware: it
appears that RSR register bits only result in ISR being asserted if the
interrupt was actually enabled at the time, so enabling interrupts with
RSR bits already set doesn't trigger an interrupt to be raised. There
was already a partial fix for this race in the macb_poll function where
it checked for RSR bits being set and re-triggered NAPI receive.
However, there was a still a race window between checking RSR and
actually enabling interrupts, where a lost wakeup could happen. It's
necessary to check again after enabling interrupts to see if RSR was set
just prior to the interrupt being enabled, and re-trigger receive in that
case.
This issue was noticed in a point-to-point UDP request-response protocol
which periodically saw timeouts or abnormally high response times due to
received packets not being processed in a timely fashion. In many
applications, more packets arriving, including TCP retransmissions, would
cause the original packet to be processed, thus masking the issue.
Fixes: 02f7a34f34e3 ("net: macb: Re-enable RX interrupt only when RX is done")
Cc: stable(a)vger.kernel.org
Co-developed-by: Scott McNutt <scott.mcnutt(a)siriusxm.com>
Signed-off-by: Scott McNutt <scott.mcnutt(a)siriusxm.com>
Signed-off-by: Robert Hancock <robert.hancock(a)calian.com>
Tested-by: Claudiu Beznea <claudiu.beznea(a)microchip.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c
index 98498a76ae16..d13f06cf0308 100644
--- a/drivers/net/ethernet/cadence/macb_main.c
+++ b/drivers/net/ethernet/cadence/macb_main.c
@@ -1573,7 +1573,14 @@ static int macb_poll(struct napi_struct *napi, int budget)
if (work_done < budget) {
napi_complete_done(napi, work_done);
- /* Packets received while interrupts were disabled */
+ /* RSR bits only seem to propagate to raise interrupts when
+ * interrupts are enabled at the time, so if bits are already
+ * set due to packets received while interrupts were disabled,
+ * they will not cause another interrupt to be generated when
+ * interrupts are re-enabled.
+ * Check for this case here. This has been seen to happen
+ * around 30% of the time under heavy network load.
+ */
status = macb_readl(bp, RSR);
if (status) {
if (bp->caps & MACB_CAPS_ISR_CLEAR_ON_WRITE)
@@ -1581,6 +1588,22 @@ static int macb_poll(struct napi_struct *napi, int budget)
napi_reschedule(napi);
} else {
queue_writel(queue, IER, bp->rx_intr_mask);
+
+ /* In rare cases, packets could have been received in
+ * the window between the check above and re-enabling
+ * interrupts. Therefore, a double-check is required
+ * to avoid losing a wakeup. This can potentially race
+ * with the interrupt handler doing the same actions
+ * if an interrupt is raised just after enabling them,
+ * but this should be harmless.
+ */
+ status = macb_readl(bp, RSR);
+ if (unlikely(status)) {
+ queue_writel(queue, IDR, bp->rx_intr_mask);
+ if (bp->caps & MACB_CAPS_ISR_CLEAR_ON_WRITE)
+ queue_writel(queue, ISR, MACB_BIT(RCOMP));
+ napi_schedule(napi);
+ }
}
}
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 0bf476fc3624e3a72af4ba7340d430a91c18cd67 Mon Sep 17 00:00:00 2001
From: Robert Hancock <robert.hancock(a)calian.com>
Date: Thu, 3 Mar 2022 12:10:27 -0600
Subject: [PATCH] net: macb: Fix lost RX packet wakeup race in NAPI receive
There is an oddity in the way the RSR register flags propagate to the
ISR register (and the actual interrupt output) on this hardware: it
appears that RSR register bits only result in ISR being asserted if the
interrupt was actually enabled at the time, so enabling interrupts with
RSR bits already set doesn't trigger an interrupt to be raised. There
was already a partial fix for this race in the macb_poll function where
it checked for RSR bits being set and re-triggered NAPI receive.
However, there was a still a race window between checking RSR and
actually enabling interrupts, where a lost wakeup could happen. It's
necessary to check again after enabling interrupts to see if RSR was set
just prior to the interrupt being enabled, and re-trigger receive in that
case.
This issue was noticed in a point-to-point UDP request-response protocol
which periodically saw timeouts or abnormally high response times due to
received packets not being processed in a timely fashion. In many
applications, more packets arriving, including TCP retransmissions, would
cause the original packet to be processed, thus masking the issue.
Fixes: 02f7a34f34e3 ("net: macb: Re-enable RX interrupt only when RX is done")
Cc: stable(a)vger.kernel.org
Co-developed-by: Scott McNutt <scott.mcnutt(a)siriusxm.com>
Signed-off-by: Scott McNutt <scott.mcnutt(a)siriusxm.com>
Signed-off-by: Robert Hancock <robert.hancock(a)calian.com>
Tested-by: Claudiu Beznea <claudiu.beznea(a)microchip.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c
index 98498a76ae16..d13f06cf0308 100644
--- a/drivers/net/ethernet/cadence/macb_main.c
+++ b/drivers/net/ethernet/cadence/macb_main.c
@@ -1573,7 +1573,14 @@ static int macb_poll(struct napi_struct *napi, int budget)
if (work_done < budget) {
napi_complete_done(napi, work_done);
- /* Packets received while interrupts were disabled */
+ /* RSR bits only seem to propagate to raise interrupts when
+ * interrupts are enabled at the time, so if bits are already
+ * set due to packets received while interrupts were disabled,
+ * they will not cause another interrupt to be generated when
+ * interrupts are re-enabled.
+ * Check for this case here. This has been seen to happen
+ * around 30% of the time under heavy network load.
+ */
status = macb_readl(bp, RSR);
if (status) {
if (bp->caps & MACB_CAPS_ISR_CLEAR_ON_WRITE)
@@ -1581,6 +1588,22 @@ static int macb_poll(struct napi_struct *napi, int budget)
napi_reschedule(napi);
} else {
queue_writel(queue, IER, bp->rx_intr_mask);
+
+ /* In rare cases, packets could have been received in
+ * the window between the check above and re-enabling
+ * interrupts. Therefore, a double-check is required
+ * to avoid losing a wakeup. This can potentially race
+ * with the interrupt handler doing the same actions
+ * if an interrupt is raised just after enabling them,
+ * but this should be harmless.
+ */
+ status = macb_readl(bp, RSR);
+ if (unlikely(status)) {
+ queue_writel(queue, IDR, bp->rx_intr_mask);
+ if (bp->caps & MACB_CAPS_ISR_CLEAR_ON_WRITE)
+ queue_writel(queue, ISR, MACB_BIT(RCOMP));
+ napi_schedule(napi);
+ }
}
}
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 0c4bcfdecb1ac0967619ee7ff44871d93c08c909 Mon Sep 17 00:00:00 2001
From: Miklos Szeredi <mszeredi(a)redhat.com>
Date: Mon, 7 Mar 2022 16:30:44 +0100
Subject: [PATCH] fuse: fix pipe buffer lifetime for direct_io
In FOPEN_DIRECT_IO mode, fuse_file_write_iter() calls
fuse_direct_write_iter(), which normally calls fuse_direct_io(), which then
imports the write buffer with fuse_get_user_pages(), which uses
iov_iter_get_pages() to grab references to userspace pages instead of
actually copying memory.
On the filesystem device side, these pages can then either be read to
userspace (via fuse_dev_read()), or splice()d over into a pipe using
fuse_dev_splice_read() as pipe buffers with &nosteal_pipe_buf_ops.
This is wrong because after fuse_dev_do_read() unlocks the FUSE request,
the userspace filesystem can mark the request as completed, causing write()
to return. At that point, the userspace filesystem should no longer have
access to the pipe buffer.
Fix by copying pages coming from the user address space to new pipe
buffers.
Reported-by: Jann Horn <jannh(a)google.com>
Fixes: c3021629a0d8 ("fuse: support splice() reading from fuse device")
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi(a)redhat.com>
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index cd54a529460d..592730fd6e42 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -941,7 +941,17 @@ static int fuse_copy_page(struct fuse_copy_state *cs, struct page **pagep,
while (count) {
if (cs->write && cs->pipebufs && page) {
- return fuse_ref_page(cs, page, offset, count);
+ /*
+ * Can't control lifetime of pipe buffers, so always
+ * copy user pages.
+ */
+ if (cs->req->args->user_pages) {
+ err = fuse_copy_fill(cs);
+ if (err)
+ return err;
+ } else {
+ return fuse_ref_page(cs, page, offset, count);
+ }
} else if (!cs->len) {
if (cs->move_pages && page &&
offset == 0 && count == PAGE_SIZE) {
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 829094451774..0fc150c1c50b 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1413,6 +1413,7 @@ static int fuse_get_user_pages(struct fuse_args_pages *ap, struct iov_iter *ii,
(PAGE_SIZE - ret) & (PAGE_SIZE - 1);
}
+ ap->args.user_pages = true;
if (write)
ap->args.in_pages = true;
else
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index e8e59fbdefeb..eac4984cc753 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -256,6 +256,7 @@ struct fuse_args {
bool nocreds:1;
bool in_pages:1;
bool out_pages:1;
+ bool user_pages:1;
bool out_argvar:1;
bool page_zeroing:1;
bool page_replace:1;
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 0c4bcfdecb1ac0967619ee7ff44871d93c08c909 Mon Sep 17 00:00:00 2001
From: Miklos Szeredi <mszeredi(a)redhat.com>
Date: Mon, 7 Mar 2022 16:30:44 +0100
Subject: [PATCH] fuse: fix pipe buffer lifetime for direct_io
In FOPEN_DIRECT_IO mode, fuse_file_write_iter() calls
fuse_direct_write_iter(), which normally calls fuse_direct_io(), which then
imports the write buffer with fuse_get_user_pages(), which uses
iov_iter_get_pages() to grab references to userspace pages instead of
actually copying memory.
On the filesystem device side, these pages can then either be read to
userspace (via fuse_dev_read()), or splice()d over into a pipe using
fuse_dev_splice_read() as pipe buffers with &nosteal_pipe_buf_ops.
This is wrong because after fuse_dev_do_read() unlocks the FUSE request,
the userspace filesystem can mark the request as completed, causing write()
to return. At that point, the userspace filesystem should no longer have
access to the pipe buffer.
Fix by copying pages coming from the user address space to new pipe
buffers.
Reported-by: Jann Horn <jannh(a)google.com>
Fixes: c3021629a0d8 ("fuse: support splice() reading from fuse device")
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi(a)redhat.com>
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index cd54a529460d..592730fd6e42 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -941,7 +941,17 @@ static int fuse_copy_page(struct fuse_copy_state *cs, struct page **pagep,
while (count) {
if (cs->write && cs->pipebufs && page) {
- return fuse_ref_page(cs, page, offset, count);
+ /*
+ * Can't control lifetime of pipe buffers, so always
+ * copy user pages.
+ */
+ if (cs->req->args->user_pages) {
+ err = fuse_copy_fill(cs);
+ if (err)
+ return err;
+ } else {
+ return fuse_ref_page(cs, page, offset, count);
+ }
} else if (!cs->len) {
if (cs->move_pages && page &&
offset == 0 && count == PAGE_SIZE) {
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 829094451774..0fc150c1c50b 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1413,6 +1413,7 @@ static int fuse_get_user_pages(struct fuse_args_pages *ap, struct iov_iter *ii,
(PAGE_SIZE - ret) & (PAGE_SIZE - 1);
}
+ ap->args.user_pages = true;
if (write)
ap->args.in_pages = true;
else
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index e8e59fbdefeb..eac4984cc753 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -256,6 +256,7 @@ struct fuse_args {
bool nocreds:1;
bool in_pages:1;
bool out_pages:1;
+ bool user_pages:1;
bool out_argvar:1;
bool page_zeroing:1;
bool page_replace:1;