From: Pali Rohár <pali(a)kernel.org>
upstream e2a8910af01653c1c268984855629d71fb81f404 commit.
ReparseDataLength is sum of the InodeType size and DataBuffer size.
So to get DataBuffer size it is needed to subtract InodeType's size from
ReparseDataLength.
Function cifs_strndup_from_utf16() is currentlly accessing buf->DataBuffer
at position after the end of the buffer because it does not subtract
InodeType size from the length. Fix this problem and correctly subtract
variable len.
Member InodeType is present only when reparse buffer is large enough. Check
for ReparseDataLength before accessing InodeType to prevent another invalid
memory access.
Major and minor rdev values are present also only when reparse buffer is
large enough. Check for reparse buffer size before calling reparse_mkdev().
Fixes: d5ecebc4900d ("smb3: Allow query of symlinks stored as reparse points")
Reviewed-by: Paulo Alcantara (Red Hat) <pc(a)manguebit.com>
Signed-off-by: Pali Rohár <pali(a)kernel.org>
Signed-off-by: Steve French <stfrench(a)microsoft.com>
[use variable name symlink_buf, the other buf->InodeType accesses are
not used in current version so skip]
Signed-off-by: Mahmoud Adam <mngyadam(a)amazon.com>
---
This fixes CVE-2024-49996, and applies cleanly on 5.4->6.1, 6.6 and
later already has the fix.
fs/smb/client/smb2ops.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/fs/smb/client/smb2ops.c b/fs/smb/client/smb2ops.c
index d1e5ff9a3cd39..fcfbc096924a8 100644
--- a/fs/smb/client/smb2ops.c
+++ b/fs/smb/client/smb2ops.c
@@ -2897,6 +2897,12 @@ parse_reparse_posix(struct reparse_posix_data *symlink_buf,
/* See MS-FSCC 2.1.2.6 for the 'NFS' style reparse tags */
len = le16_to_cpu(symlink_buf->ReparseDataLength);
+ if (len < sizeof(symlink_buf->InodeType)) {
+ cifs_dbg(VFS, "srv returned malformed nfs buffer\n");
+ return -EIO;
+ }
+
+ len -= sizeof(symlink_buf->InodeType);
if (le64_to_cpu(symlink_buf->InodeType) != NFS_SPECFILE_LNK) {
cifs_dbg(VFS, "%lld not a supported symlink type\n",
--
2.40.1
CC stable.
This needs picking up for 6.12
Head commit 573f45a9f9a47 applied by Linus with a modified commit message.
David
> -----Original Message-----
> From: David Laight
> Sent: 24 November 2024 15:39
> To: 'Linus Torvalds' <torvalds(a)linux-foundation.org>; 'Andrew Cooper' <andrew.cooper3(a)citrix.com>;
> 'bp(a)alien8.de' <bp(a)alien8.de>; 'Josh Poimboeuf' <jpoimboe(a)kernel.org>
> Cc: 'x86(a)kernel.org' <x86(a)kernel.org>; 'linux-kernel(a)vger.kernel.org' <linux-kernel(a)vger.kernel.org>;
> 'Arnd Bergmann' <arnd(a)kernel.org>; 'Mikel Rychliski' <mikel(a)mikelr.com>; 'Thomas Gleixner'
> <tglx(a)linutronix.de>; 'Ingo Molnar' <mingo(a)redhat.com>; 'Borislav Petkov' <bp(a)alien8.de>; 'Dave
> Hansen' <dave.hansen(a)linux.intel.com>; 'H. Peter Anvin' <hpa(a)zytor.com>
> Subject: [PATCH v2] x86: Allow user accesses to the base of the guard page
>
> __access_ok() calls valid_user_address() with the address after
> the last byte of the user buffer.
> It is valid for a buffer to end with the last valid user address
> so valid_user_address() must allow accesses to the base of the
> guard page.
>
> Fixes: 86e6b1547b3d0 ("x86: fix user address masking non-canonical speculation issue")
> Signed-off-by: David Laight <david.laight(a)aculab.com>
> ---
>
> v2: Rewritten commit message.
>
> arch/x86/kernel/cpu/common.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
> index 06a516f6795b..ca327cfa42ae 100644
> --- a/arch/x86/kernel/cpu/common.c
> +++ b/arch/x86/kernel/cpu/common.c
> @@ -2389,12 +2389,12 @@ void __init arch_cpu_finalize_init(void)
> alternative_instructions();
>
> if (IS_ENABLED(CONFIG_X86_64)) {
> - unsigned long USER_PTR_MAX = TASK_SIZE_MAX-1;
> + unsigned long USER_PTR_MAX = TASK_SIZE_MAX;
>
> /*
> * Enable this when LAM is gated on LASS support
> if (cpu_feature_enabled(X86_FEATURE_LAM))
> - USER_PTR_MAX = (1ul << 63) - PAGE_SIZE - 1;
> + USER_PTR_MAX = (1ul << 63) - PAGE_SIZE;
> */
> runtime_const_init(ptr, USER_PTR_MAX);
>
> --
> 2.17.1
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
The quilt patch titled
Subject: mm: vmscan: ensure kswapd is woken up if the wait queue is active
has been removed from the -mm tree. Its filename was
mm-vmscan-ensure-kswapd-is-woken-up-if-the-wait-queue-is-active.patch
This patch was dropped because an updated version will be issued
------------------------------------------------------
From: Seiji Nishikawa <snishika(a)redhat.com>
Subject: mm: vmscan: ensure kswapd is woken up if the wait queue is active
Date: Wed, 27 Nov 2024 00:06:12 +0900
Even after commit 501b26510ae3 ("vmstat: allow_direct_reclaim should use
zone_page_state_snapshot"), a task may remain indefinitely stuck in
throttle_direct_reclaim() while holding mm->rwsem.
__alloc_pages_nodemask
try_to_free_pages
throttle_direct_reclaim
This can cause numerous other tasks to wait on the same rwsem, leading
to severe system hangups:
[1088963.358712] INFO: task python3:1670971 blocked for more than 120 seconds.
[1088963.365653] Tainted: G OE -------- - - 4.18.0-553.el8_10.aarch64 #1
[1088963.373887] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1088963.381862] task:python3 state:D stack:0 pid:1670971 ppid:1667117 flags:0x00800080
[1088963.381869] Call trace:
[1088963.381872] __switch_to+0xd0/0x120
[1088963.381877] __schedule+0x340/0xac8
[1088963.381881] schedule+0x68/0x118
[1088963.381886] rwsem_down_read_slowpath+0x2d4/0x4b8
The issue arises when allow_direct_reclaim(pgdat) returns false,
preventing progress even when the pgdat->pfmemalloc_wait wait queue is
empty. Despite the wait queue being empty, the condition,
allow_direct_reclaim(pgdat), may still be returning false, causing it to
continue looping.
In some cases, reclaimable pages exist (zone_reclaimable_pages() returns
> 0), but calculations of pfmemalloc_reserve and free_pages result in
wmark_ok being false.
And then, despite the pgdat->kswapd_wait queue being non-empty, kswapd
is not woken up, further exacerbating the problem:
crash> px ((struct pglist_data *) 0xffff00817fffe540)->kswapd_highest_zoneidx
$775 = __MAX_NR_ZONES
The issue likely occurs under specific conditions: high memory pressure
with frequent direct reclaim, contention on mmap_sem from concurrent
memory allocations, reclaimable pages exist, but zone states cause
wmark_ok to return false.
Modern workloads (e.g., Python multiprocessing) and changes in kernel
reclaim logic may have surfaced such edge cases more prominently than
before.
The workload involves concurrent Python processes under high memory
pressure, leading to contention on mmap_sem. While not unusual, this
workload may trigger a rare combination of conditions that expose the
issue.
This patch modifies allow_direct_reclaim() to wake kswapd if the
pgdat->kswapd_wait queue is active, regardless of whether wmark_ok is true
or false. This change ensures kswapd does not miss wake-ups under high
memory pressure, reducing the risk of task stalls in the throttled reclaim
path.
Link: https://lkml.kernel.org/r/20241126150612.114561-1-snishika@redhat.com
Signed-off-by: Seiji Nishikawa <snishika(a)redhat.com>
Cc: Mel Gorman <mgorman(a)techsingularity.net>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/vmscan.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
--- a/mm/vmscan.c~mm-vmscan-ensure-kswapd-is-woken-up-if-the-wait-queue-is-active
+++ a/mm/vmscan.c
@@ -6389,8 +6389,8 @@ static bool allow_direct_reclaim(pg_data
wmark_ok = free_pages > pfmemalloc_reserve / 2;
- /* kswapd must be awake if processes are being throttled */
- if (!wmark_ok && waitqueue_active(&pgdat->kswapd_wait)) {
+ /* Always wake up kswapd if the wait queue is not empty */
+ if (waitqueue_active(&pgdat->kswapd_wait)) {
if (READ_ONCE(pgdat->kswapd_highest_zoneidx) > ZONE_NORMAL)
WRITE_ONCE(pgdat->kswapd_highest_zoneidx, ZONE_NORMAL);
_
Patches currently in -mm which might be from snishika(a)redhat.com are
mm-vmscan-account-for-free-pages-to-prevent-infinite-loop-in-throttle_direct_reclaim.patch
The patch titled
Subject: mm/hugetlb: change ENOSPC to ENOMEM in alloc_hugetlb_folio
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-hugetlb-change-enospc-to-enomem-in-alloc_hugetlb_folio.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Dafna Hirschfeld <dafna.hirschfeld(a)intel.com>
Subject: mm/hugetlb: change ENOSPC to ENOMEM in alloc_hugetlb_folio
Date: Sun, 1 Dec 2024 03:03:41 +0200
The error ENOSPC is translated in vmf_error to VM_FAULT_SIGBUS which is
further translated in EFAULT in i.e. pin/get_user_pages. But when
running out of pages/hugepages we expect to see ENOMEM and not EFAULT.
Link: https://lkml.kernel.org/r/20241201010341.1382431-1-dafna.hirschfeld@intel.c…
Fixes: 8f34af6f93ae ("mm, hugetlb: move the error handle logic out of normal code path")
Signed-off-by: Dafna Hirschfeld <dafna.hirschfeld(a)intel.com>
Cc: Muchun Song <muchun.song(a)linux.dev>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/hugetlb.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/mm/hugetlb.c~mm-hugetlb-change-enospc-to-enomem-in-alloc_hugetlb_folio
+++ a/mm/hugetlb.c
@@ -3113,7 +3113,7 @@ out_end_reservation:
if (!memcg_charge_ret)
mem_cgroup_cancel_charge(memcg, nr_pages);
mem_cgroup_put(memcg);
- return ERR_PTR(-ENOSPC);
+ return ERR_PTR(-ENOMEM);
}
int alloc_bootmem_huge_page(struct hstate *h, int nid)
_
Patches currently in -mm which might be from dafna.hirschfeld(a)intel.com are
mm-hugetlb-change-enospc-to-enomem-in-alloc_hugetlb_folio.patch
The patch titled
Subject: mm: vmscan: account for free pages to prevent infinite Loop in throttle_direct_reclaim()
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-vmscan-account-for-free-pages-to-prevent-infinite-loop-in-throttle_direct_reclaim.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Seiji Nishikawa <snishika(a)redhat.com>
Subject: mm: vmscan: account for free pages to prevent infinite Loop in throttle_direct_reclaim()
Date: Sun, 1 Dec 2024 01:12:34 +0900
The task sometimes continues looping in throttle_direct_reclaim() because
allow_direct_reclaim(pgdat) keeps returning false.
#0 [ffff80002cb6f8d0] __switch_to at ffff8000080095ac
#1 [ffff80002cb6f900] __schedule at ffff800008abbd1c
#2 [ffff80002cb6f990] schedule at ffff800008abc50c
#3 [ffff80002cb6f9b0] throttle_direct_reclaim at ffff800008273550
#4 [ffff80002cb6fa20] try_to_free_pages at ffff800008277b68
#5 [ffff80002cb6fae0] __alloc_pages_nodemask at ffff8000082c4660
#6 [ffff80002cb6fc50] alloc_pages_vma at ffff8000082e4a98
#7 [ffff80002cb6fca0] do_anonymous_page at ffff80000829f5a8
#8 [ffff80002cb6fce0] __handle_mm_fault at ffff8000082a5974
#9 [ffff80002cb6fd90] handle_mm_fault at ffff8000082a5bd4
At this point, the pgdat contains the following two zones:
NODE: 4 ZONE: 0 ADDR: ffff00817fffe540 NAME: "DMA32"
SIZE: 20480 MIN/LOW/HIGH: 11/28/45
VM_STAT:
NR_FREE_PAGES: 359
NR_ZONE_INACTIVE_ANON: 18813
NR_ZONE_ACTIVE_ANON: 0
NR_ZONE_INACTIVE_FILE: 50
NR_ZONE_ACTIVE_FILE: 0
NR_ZONE_UNEVICTABLE: 0
NR_ZONE_WRITE_PENDING: 0
NR_MLOCK: 0
NR_BOUNCE: 0
NR_ZSPAGES: 0
NR_FREE_CMA_PAGES: 0
NODE: 4 ZONE: 1 ADDR: ffff00817fffec00 NAME: "Normal"
SIZE: 8454144 PRESENT: 98304 MIN/LOW/HIGH: 68/166/264
VM_STAT:
NR_FREE_PAGES: 146
NR_ZONE_INACTIVE_ANON: 94668
NR_ZONE_ACTIVE_ANON: 3
NR_ZONE_INACTIVE_FILE: 735
NR_ZONE_ACTIVE_FILE: 78
NR_ZONE_UNEVICTABLE: 0
NR_ZONE_WRITE_PENDING: 0
NR_MLOCK: 0
NR_BOUNCE: 0
NR_ZSPAGES: 0
NR_FREE_CMA_PAGES: 0
In allow_direct_reclaim(), while processing ZONE_DMA32, the sum of
inactive/active file-backed pages calculated in zone_reclaimable_pages()
based on the result of zone_page_state_snapshot() is zero.
Additionally, since this system lacks swap, the calculation of inactive/
active anonymous pages is skipped.
crash> p nr_swap_pages
nr_swap_pages = $1937 = {
counter = 0
}
As a result, ZONE_DMA32 is deemed unreclaimable and skipped, moving on to
the processing of the next zone, ZONE_NORMAL, despite ZONE_DMA32 having
free pages significantly exceeding the high watermark.
The problem is that the pgdat->kswapd_failures hasn't been incremented.
crash> px ((struct pglist_data *) 0xffff00817fffe540)->kswapd_failures
$1935 = 0x0
This is because the node deemed balanced. The node balancing logic in
balance_pgdat() evaluates all zones collectively. If one or more zones
(e.g., ZONE_DMA32) have enough free pages to meet their watermarks, the
entire node is deemed balanced. This causes balance_pgdat() to exit early
before incrementing the kswapd_failures, as it considers the overall
memory state acceptable, even though some zones (like ZONE_NORMAL) remain
under significant pressure.
The patch ensures that zone_reclaimable_pages() includes free pages
(NR_FREE_PAGES) in its calculation when no other reclaimable pages are
available (e.g., file-backed or anonymous pages). This change prevents
zones like ZONE_DMA32, which have sufficient free pages, from being
mistakenly deemed unreclaimable. By doing so, the patch ensures proper
node balancing, avoids masking pressure on other zones like ZONE_NORMAL,
and prevents infinite loops in throttle_direct_reclaim() caused by
allow_direct_reclaim(pgdat) repeatedly returning false.
The kernel hangs due to a task stuck in throttle_direct_reclaim(), caused
by a node being incorrectly deemed balanced despite pressure in certain
zones, such as ZONE_NORMAL. This issue arises from
zone_reclaimable_pages() returning 0 for zones without reclaimable file-
backed or anonymous pages, causing zones like ZONE_DMA32 with sufficient
free pages to be skipped.
The lack of swap or reclaimable pages results in ZONE_DMA32 being ignored
during reclaim, masking pressure in other zones. Consequently,
pgdat->kswapd_failures remains 0 in balance_pgdat(), preventing fallback
mechanisms in allow_direct_reclaim() from being triggered, leading to an
infinite loop in throttle_direct_reclaim().
This patch modifies zone_reclaimable_pages() to account for free pages
(NR_FREE_PAGES) when no other reclaimable pages exist. This ensures zones
with sufficient free pages are not skipped, enabling proper balancing and
reclaim behavior.
Link: https://lkml.kernel.org/r/20241130164346.436469-1-snishika@redhat.com
Link: https://lkml.kernel.org/r/20241130161236.433747-2-snishika@redhat.com
Signed-off-by: Seiji Nishikawa <snishika(a)redhat.com>
Cc: Mel Gorman <mgorman(a)techsingularity.net>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/vmscan.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
--- a/mm/vmscan.c~mm-vmscan-account-for-free-pages-to-prevent-infinite-loop-in-throttle_direct_reclaim
+++ a/mm/vmscan.c
@@ -374,7 +374,14 @@ unsigned long zone_reclaimable_pages(str
if (can_reclaim_anon_pages(NULL, zone_to_nid(zone), NULL))
nr += zone_page_state_snapshot(zone, NR_ZONE_INACTIVE_ANON) +
zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_ANON);
-
+ /*
+ * If there are no reclaimable file-backed or anonymous pages,
+ * ensure zones with sufficient free pages are not skipped.
+ * This prevents zones like DMA32 from being ignored in reclaim
+ * scenarios where they can still help alleviate memory pressure.
+ */
+ if (nr == 0)
+ nr = zone_page_state_snapshot(zone, NR_FREE_PAGES);
return nr;
}
_
Patches currently in -mm which might be from snishika(a)redhat.com are
mm-vmscan-ensure-kswapd-is-woken-up-if-the-wait-queue-is-active.patch
mm-vmscan-account-for-free-pages-to-prevent-infinite-loop-in-throttle_direct_reclaim.patch
The patch titled
Subject: mm: vmscan: account for free pages to prevent infinite Loop in throttle_direct_reclaim()
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-vmscan-account-for-free-pages-to-prevent-infinite-loop-in-throttle_direct_reclaim.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Seiji Nishikawa <snishika(a)redhat.com>
Subject: mm: vmscan: account for free pages to prevent infinite Loop in throttle_direct_reclaim()
Date: Sun, 1 Dec 2024 01:12:34 +0900
The kernel hangs due to a task stuck in throttle_direct_reclaim(), caused
by a node being incorrectly deemed balanced despite pressure in certain
zones, such as ZONE_NORMAL. This issue arises from
zone_reclaimable_pages() returning 0 for zones without reclaimable file-
backed or anonymous pages, causing zones like ZONE_DMA32 with sufficient
free pages to be skipped.
The lack of swap or reclaimable pages results in ZONE_DMA32 being ignored
during reclaim, masking pressure in other zones. Consequently,
pgdat->kswapd_failures remains 0 in balance_pgdat(), preventing fallback
mechanisms in allow_direct_reclaim() from being triggered, leading to an
infinite loop in throttle_direct_reclaim().
This patch modifies zone_reclaimable_pages() to account for free pages
(NR_FREE_PAGES) when no other reclaimable pages exist. This ensures zones
with sufficient free pages are not skipped, enabling proper balancing and
reclaim behavior.
Link: https://lkml.kernel.org/r/20241130161236.433747-2-snishika@redhat.com
Signed-off-by: Seiji Nishikawa <snishika(a)redhat.com>
Cc: Mel Gorman <mgorman(a)techsingularity.net>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/vmscan.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
--- a/mm/vmscan.c~mm-vmscan-account-for-free-pages-to-prevent-infinite-loop-in-throttle_direct_reclaim
+++ a/mm/vmscan.c
@@ -374,7 +374,14 @@ unsigned long zone_reclaimable_pages(str
if (can_reclaim_anon_pages(NULL, zone_to_nid(zone), NULL))
nr += zone_page_state_snapshot(zone, NR_ZONE_INACTIVE_ANON) +
zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_ANON);
-
+ /*
+ * If there are no reclaimable file-backed or anonymous pages,
+ * ensure zones with sufficient free pages are not skipped.
+ * This prevents zones like DMA32 from being ignored in reclaim
+ * scenarios where they can still help alleviate memory pressure.
+ */
+ if (nr == 0)
+ nr = zone_page_state_snapshot(zone, NR_FREE_PAGES);
return nr;
}
_
Patches currently in -mm which might be from snishika(a)redhat.com are
mm-vmscan-ensure-kswapd-is-woken-up-if-the-wait-queue-is-active.patch
mm-vmscan-account-for-free-pages-to-prevent-infinite-loop-in-throttle_direct_reclaim.patch
[ Upstream commit 122aba8c80618eca904490b1733af27fb8f07528 ]
Recent kernels cause a lot of TCP retransmissions
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 2.24 GBytes 19.2 Gbits/sec 2767 442 KBytes
[ 5] 1.00-2.00 sec 2.23 GBytes 19.1 Gbits/sec 2312 350 KBytes
^^^^
Replacing the qdisc with pfifo makes retransmissions go away.
It appears that a flow may have a delayed packet with a very near
Tx time. Later, we may get busy processing Rx and the target Tx time
will pass, but we won't service Tx since the CPU is busy with Rx.
If Rx sees an ACK and we try to push more data for the delayed flow
we may fastpath the skb, not realizing that there are already "ready
to send" packets for this flow sitting in the qdisc.
Don't trust the fastpath if we are "behind" according to the projected
Tx time for next flow waiting in the Qdisc. Because we consider anything
within the offload window to be okay for fastpath we must consider
the entire offload window as "now".
Qdisc config:
qdisc fq 8001: dev eth0 parent 1234:1 limit 10000p flow_limit 100p \
buckets 32768 orphan_mask 1023 bands 3 \
priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 \
weights 589824 196608 65536 quantum 3028b initial_quantum 15140b \
low_rate_threshold 550Kbit \
refill_delay 40ms timer_slack 10us horizon 10s horizon_drop
For iperf this change seems to do fine, the reordering is gone.
The fastpath still gets used most of the time:
gc 0 highprio 0 fastpath 142614 throttled 418309 latency 19.1us
xx_behind 2731
where "xx_behind" counts how many times we hit the new "return false".
CC: stable(a)vger.kernel.org
Fixes: 076433bd78d7 ("net_sched: sch_fq: add fast path for mostly idle qdisc")
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
Reviewed-by: Eric Dumazet <edumazet(a)google.com>
Link: https://patch.msgid.link/20241124022148.3126719-1-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni(a)redhat.com>
[stable: drop the offload horizon, it's not supported / 0]
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
---
Per Fixes tag 6.7+, so the two non-longterm branches.
---
net/sched/sch_fq.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/net/sched/sch_fq.c b/net/sched/sch_fq.c
index 19a49af5a9e5..afefe124d903 100644
--- a/net/sched/sch_fq.c
+++ b/net/sched/sch_fq.c
@@ -331,6 +331,12 @@ static bool fq_fastpath_check(const struct Qdisc *sch, struct sk_buff *skb,
*/
if (q->internal.qlen >= 8)
return false;
+
+ /* Ordering invariants fall apart if some delayed flows
+ * are ready but we haven't serviced them, yet.
+ */
+ if (q->time_next_delayed_flow <= now)
+ return false;
}
sk = skb->sk;
--
2.47.0