April 2025 - Linux-stable-mirror

+ mm-page_alloc-speed-up-fallbacks-in-rmqueue_bulk.patch added to mm-hotfixes-unstable branch

by Andrew Morton

The patch titled Subject: mm: page_alloc: speed up fallbacks in rmqueue_bulk() has been added to the -mm mm-hotfixes-unstable branch. Its filename is mm-page_alloc-speed-up-fallbacks-in-rmqueue_bulk.patch This patch will shortly appear at https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche… This patch will later appear in the mm-hotfixes-unstable branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next via the mm-everything branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm and is updated there every 2-3 working days ------------------------------------------------------ From: Johannes Weiner <hannes(a)cmpxchg.org> Subject: mm: page_alloc: speed up fallbacks in rmqueue_bulk() Date: Mon, 7 Apr 2025 14:01:53 -0400 The test robot identified c2f6ea38fc1b ("mm: page_alloc: don't steal single pages from biggest buddy") as the root cause of a 56.4% regression in vm-scalability::lru-file-mmap-read. Carlos reports an earlier patch, c0cd6f557b90 ("mm: page_alloc: fix freelist movement during block conversion"), as the root cause for a regression in worst-case zone->lock+irqoff hold times. Both of these patches modify the page allocator's fallback path to be less greedy in an effort to stave off fragmentation. The flip side of this is that fallbacks are also less productive each time around, which means the fallback search can run much more frequently. Carlos' traces point to rmqueue_bulk() specifically, which tries to refill the percpu cache by allocating a large batch of pages in a loop. It highlights how once the native freelists are exhausted, the fallback code first scans orders top-down for whole blocks to claim, then falls back to a bottom-up search for the smallest buddy to steal. For the next batch page, it goes through the same thing again. This can be made more efficient. Since rmqueue_bulk() holds the zone->lock over the entire batch, the freelists are not subject to outside changes; when the search for a block to claim has already failed, there is no point in trying again for the next page. Modify __rmqueue() to remember the last successful fallback mode, and restart directly from there on the next rmqueue_bulk() iteration. Oliver confirms that this improves beyond the regression that the test robot reported against c2f6ea38fc1b: commit: f3b92176f4 ("tools/selftests: add guard region test for /proc/$pid/pagemap") c2f6ea38fc ("mm: page_alloc: don't steal single pages from biggest buddy") acc4d5ff0b ("Merge tag 'net-6.15-rc0' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net") 2c847f27c3 ("mm: page_alloc: speed up fallbacks in rmqueue_bulk()") <--- your patch f3b92176f4f7100f c2f6ea38fc1b640aa7a2e155cc1 acc4d5ff0b61eb1715c498b6536 2c847f27c37da65a93d23c237c5 ---------------- --------------------------- --------------------------- --------------------------- %stddev %change %stddev %change %stddev %change %stddev \ | \ | \ | \ 25525364 �� 3% -56.4% 11135467 -57.8% 10779336 +31.6% 33581409 vm-scalability.throughput Carlos confirms that worst-case times are almost fully recovered compared to before the earlier culprit patch: 2dd482ba627d (before freelist hygiene): 1ms c0cd6f557b90 (after freelist hygiene): 90ms next-20250319 (steal smallest buddy): 280ms this patch : 8ms Link: https://lkml.kernel.org/r/20250407180154.63348-1-hannes@cmpxchg.org Fixes: c0cd6f557b90 ("mm: page_alloc: fix freelist movement during block conversion") Fixes: c2f6ea38fc1b ("mm: page_alloc: don't steal single pages from biggest buddy") Signed-off-by: Johannes Weiner <hannes(a)cmpxchg.org> Reported-by: kernel test robot <oliver.sang(a)intel.com> Reported-by: Carlos Song <carlos.song(a)nxp.com> Tested-by: kernel test robot <oliver.sang(a)intel.com> Closes: https://lore.kernel.org/oe-lkp/202503271547.fc08b188-lkp@intel.com Cc: <stable(a)vger.kernel.org> [6.10+] Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- mm/page_alloc.c | 100 ++++++++++++++++++++++++++++++++++------------ 1 file changed, 74 insertions(+), 26 deletions(-) --- a/mm/page_alloc.c~mm-page_alloc-speed-up-fallbacks-in-rmqueue_bulk +++ a/mm/page_alloc.c @@ -2189,11 +2189,11 @@ try_to_claim_block(struct zone *zone, st * The use of signed ints for order and current_order is a deliberate * deviation from the rest of this file, to make the for loop * condition simpler. - * - * Return the stolen page, or NULL if none can be found. */ + +/* Try to claim a whole foreign block, take a page, expand the remainder */ static __always_inline struct page * -__rmqueue_fallback(struct zone *zone, int order, int start_migratetype, +__rmqueue_claim(struct zone *zone, int order, int start_migratetype, unsigned int alloc_flags) { struct free_area *area; @@ -2231,14 +2231,26 @@ __rmqueue_fallback(struct zone *zone, in page = try_to_claim_block(zone, page, current_order, order, start_migratetype, fallback_mt, alloc_flags); - if (page) - goto got_one; + if (page) { + trace_mm_page_alloc_extfrag(page, order, current_order, + start_migratetype, fallback_mt); + return page; + } } - if (alloc_flags & ALLOC_NOFRAGMENT) - return NULL; + return NULL; +} + +/* Try to steal a single page from a foreign block */ +static __always_inline struct page * +__rmqueue_steal(struct zone *zone, int order, int start_migratetype) +{ + struct free_area *area; + int current_order; + struct page *page; + int fallback_mt; + bool claim_block; - /* No luck claiming pageblock. Find the smallest fallback page */ for (current_order = order; current_order < NR_PAGE_ORDERS; current_order++) { area = &(zone->free_area[current_order]); fallback_mt = find_suitable_fallback(area, current_order, @@ -2248,25 +2260,28 @@ __rmqueue_fallback(struct zone *zone, in page = get_page_from_free_area(area, fallback_mt); page_del_and_expand(zone, page, order, current_order, fallback_mt); - goto got_one; + trace_mm_page_alloc_extfrag(page, order, current_order, + start_migratetype, fallback_mt); + return page; } return NULL; - -got_one: - trace_mm_page_alloc_extfrag(page, order, current_order, - start_migratetype, fallback_mt); - - return page; } +enum rmqueue_mode { + RMQUEUE_NORMAL, + RMQUEUE_CMA, + RMQUEUE_CLAIM, + RMQUEUE_STEAL, +}; + /* * Do the hard work of removing an element from the buddy allocator. * Call me with the zone->lock already held. */ static __always_inline struct page * __rmqueue(struct zone *zone, unsigned int order, int migratetype, - unsigned int alloc_flags) + unsigned int alloc_flags, enum rmqueue_mode *mode) { struct page *page; @@ -2285,16 +2300,47 @@ __rmqueue(struct zone *zone, unsigned in } } - page = __rmqueue_smallest(zone, order, migratetype); - if (unlikely(!page)) { - if (alloc_flags & ALLOC_CMA) + /* + * Try the different freelists, native then foreign. + * + * The fallback logic is expensive and rmqueue_bulk() calls in + * a loop with the zone->lock held, meaning the freelists are + * not subject to any outside changes. Remember in *mode where + * we found pay dirt, to save us the search on the next call. + */ + switch (*mode) { + case RMQUEUE_NORMAL: + page = __rmqueue_smallest(zone, order, migratetype); + if (page) + return page; + fallthrough; + case RMQUEUE_CMA: + if (alloc_flags & ALLOC_CMA) { page = __rmqueue_cma_fallback(zone, order); - - if (!page) - page = __rmqueue_fallback(zone, order, migratetype, - alloc_flags); + if (page) { + *mode = RMQUEUE_CMA; + return page; + } + } + fallthrough; + case RMQUEUE_CLAIM: + page = __rmqueue_claim(zone, order, migratetype, alloc_flags); + if (page) { + /* Replenished native freelist, back to normal mode */ + *mode = RMQUEUE_NORMAL; + return page; + } + fallthrough; + case RMQUEUE_STEAL: + if (!(alloc_flags & ALLOC_NOFRAGMENT)) { + page = __rmqueue_steal(zone, order, migratetype); + if (page) { + *mode = RMQUEUE_STEAL; + return page; + } + } } - return page; + return NULL; } /* @@ -2306,6 +2352,7 @@ static int rmqueue_bulk(struct zone *zon unsigned long count, struct list_head *list, int migratetype, unsigned int alloc_flags) { + enum rmqueue_mode rmqm = RMQUEUE_NORMAL; unsigned long flags; int i; @@ -2317,7 +2364,7 @@ static int rmqueue_bulk(struct zone *zon } for (i = 0; i < count; ++i) { struct page *page = __rmqueue(zone, order, migratetype, - alloc_flags); + alloc_flags, &rmqm); if (unlikely(page == NULL)) break; @@ -2930,6 +2977,7 @@ struct page *rmqueue_buddy(struct zone * { struct page *page; unsigned long flags; + enum rmqueue_mode rmqm = RMQUEUE_NORMAL; do { page = NULL; @@ -2942,7 +2990,7 @@ struct page *rmqueue_buddy(struct zone * if (alloc_flags & ALLOC_HIGHATOMIC) page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC); if (!page) { - page = __rmqueue(zone, order, migratetype, alloc_flags); + page = __rmqueue(zone, order, migratetype, alloc_flags, &rmqm); /* * If the allocation fails, allow OOM handling and _ Patches currently in -mm which might be from hannes(a)cmpxchg.org are mm-page_alloc-speed-up-fallbacks-in-rmqueue_bulk.patch

3 months, 1 week

1
0
0 0

[PATCH AUTOSEL 6.1 01/13] usb: host: max3421-hcd: Add missing spi_device_id table

by Sasha Levin

From: Alexander Stein <alexander.stein(a)mailbox.org> [ Upstream commit 41d5e3806cf589f658f92c75195095df0b66f66a ] "maxim,max3421" DT compatible is missing its SPI device ID entry, not allowing module autoloading and leading to the following message: "SPI driver max3421-hcd has no spi_device_id for maxim,max3421" Fix this by adding the spi_device_id table. Signed-off-by: Alexander Stein <alexander.stein(a)mailbox.org> Link: https://lore.kernel.org/r/20250128195114.56321-1-alexander.stein@mailbox.org Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org> Signed-off-by: Sasha Levin <sashal(a)kernel.org> --- drivers/usb/host/max3421-hcd.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/drivers/usb/host/max3421-hcd.c b/drivers/usb/host/max3421-hcd.c index ab12d76b01fbe..8aaafba058aa9 100644 --- a/drivers/usb/host/max3421-hcd.c +++ b/drivers/usb/host/max3421-hcd.c @@ -1955,6 +1955,12 @@ max3421_remove(struct spi_device *spi) usb_put_hcd(hcd); } +static const struct spi_device_id max3421_spi_ids[] = { + { "max3421" }, + { }, +}; +MODULE_DEVICE_TABLE(spi, max3421_spi_ids); + static const struct of_device_id max3421_of_match_table[] = { { .compatible = "maxim,max3421", }, {}, @@ -1964,6 +1970,7 @@ MODULE_DEVICE_TABLE(of, max3421_of_match_table); static struct spi_driver max3421_driver = { .probe = max3421_probe, .remove = max3421_remove, + .id_table = max3421_spi_ids, .driver = { .name = "max3421-hcd", .of_match_table = of_match_ptr(max3421_of_match_table), -- 2.39.5

3 months, 1 week

1
12
0 0

[PATCH AUTOSEL 6.6 01/15] usb: host: max3421-hcd: Add missing spi_device_id table

by Sasha Levin

From: Alexander Stein <alexander.stein(a)mailbox.org> [ Upstream commit 41d5e3806cf589f658f92c75195095df0b66f66a ] "maxim,max3421" DT compatible is missing its SPI device ID entry, not allowing module autoloading and leading to the following message: "SPI driver max3421-hcd has no spi_device_id for maxim,max3421" Fix this by adding the spi_device_id table. Signed-off-by: Alexander Stein <alexander.stein(a)mailbox.org> Link: https://lore.kernel.org/r/20250128195114.56321-1-alexander.stein@mailbox.org Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org> Signed-off-by: Sasha Levin <sashal(a)kernel.org> --- drivers/usb/host/max3421-hcd.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/drivers/usb/host/max3421-hcd.c b/drivers/usb/host/max3421-hcd.c index a219260ad3e6c..cc1f579f02de1 100644 --- a/drivers/usb/host/max3421-hcd.c +++ b/drivers/usb/host/max3421-hcd.c @@ -1946,6 +1946,12 @@ max3421_remove(struct spi_device *spi) usb_put_hcd(hcd); } +static const struct spi_device_id max3421_spi_ids[] = { + { "max3421" }, + { }, +}; +MODULE_DEVICE_TABLE(spi, max3421_spi_ids); + static const struct of_device_id max3421_of_match_table[] = { { .compatible = "maxim,max3421", }, {}, @@ -1955,6 +1961,7 @@ MODULE_DEVICE_TABLE(of, max3421_of_match_table); static struct spi_driver max3421_driver = { .probe = max3421_probe, .remove = max3421_remove, + .id_table = max3421_spi_ids, .driver = { .name = "max3421-hcd", .of_match_table = max3421_of_match_table, -- 2.39.5

3 months, 1 week

1
14
0 0

[PATCH AUTOSEL 6.12 01/22] usb: host: max3421-hcd: Add missing spi_device_id table

by Sasha Levin

From: Alexander Stein <alexander.stein(a)mailbox.org> [ Upstream commit 41d5e3806cf589f658f92c75195095df0b66f66a ] "maxim,max3421" DT compatible is missing its SPI device ID entry, not allowing module autoloading and leading to the following message: "SPI driver max3421-hcd has no spi_device_id for maxim,max3421" Fix this by adding the spi_device_id table. Signed-off-by: Alexander Stein <alexander.stein(a)mailbox.org> Link: https://lore.kernel.org/r/20250128195114.56321-1-alexander.stein@mailbox.org Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org> Signed-off-by: Sasha Levin <sashal(a)kernel.org> --- drivers/usb/host/max3421-hcd.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/drivers/usb/host/max3421-hcd.c b/drivers/usb/host/max3421-hcd.c index 0881fdd1823e0..dcf31a592f5d1 100644 --- a/drivers/usb/host/max3421-hcd.c +++ b/drivers/usb/host/max3421-hcd.c @@ -1946,6 +1946,12 @@ max3421_remove(struct spi_device *spi) usb_put_hcd(hcd); } +static const struct spi_device_id max3421_spi_ids[] = { + { "max3421" }, + { }, +}; +MODULE_DEVICE_TABLE(spi, max3421_spi_ids); + static const struct of_device_id max3421_of_match_table[] = { { .compatible = "maxim,max3421", }, {}, @@ -1955,6 +1961,7 @@ MODULE_DEVICE_TABLE(of, max3421_of_match_table); static struct spi_driver max3421_driver = { .probe = max3421_probe, .remove = max3421_remove, + .id_table = max3421_spi_ids, .driver = { .name = "max3421-hcd", .of_match_table = max3421_of_match_table, -- 2.39.5

3 months, 1 week

1
21
0 0

[PATCH v3] x86/e820: Fix handling of subpage regions when calculating nosave ranges

by Myrrh Periwinkle

The current implementation of e820__register_nosave_regions suffers from multiple serious issues: - The end of last region is tracked by PFN, causing it to find holes that aren't there if two consecutive subpage regions are present - The nosave PFN ranges derived from holes are rounded out (instead of rounded in) which makes it inconsistent with how explicitly reserved regions are handled Fix this by: - Treating reserved regions as if they were holes, to ensure consistent handling (rounding out nosave PFN ranges is more correct as the kernel does not use partial pages) - Tracking the end of the last RAM region by address instead of pages to detect holes more precisely Cc: stable(a)vger.kernel.org Fixes: e5540f875404 ("x86/boot/e820: Consolidate 'struct e820_entry *entry' local variable names") Reported-by: Roberto Ricci <io(a)r-ricci.it> Closes: https://lore.kernel.org/all/Z4WFjBVHpndct7br@desktop0a/ Signed-off-by: Myrrh Periwinkle <myrrhperiwinkle(a)qtmlabs.xyz> --- The issue of the kernel failing to resume from hibernation after kexec_load() is used is likely due to kexec-tools passing in a different e820 memory map from the one provided by system firmware, causing the e820 consistency check to fail. That issue is not addressed in this patch and will need to be fixed in kexec-tools instead. Changes in v3: - Round out nosave PFN ranges since the kernel does not use partial pages - Link to v2: https://lore.kernel.org/r/20250405-fix-e820-nosave-v2-1-d40dbe457c95@qtmlab… Changes in v2: - Updated author details - Rewrote commit message - Link to v1: https://lore.kernel.org/r/20250405-fix-e820-nosave-v1-1-162633199548@qtmlab… --- arch/x86/kernel/e820.c | 17 ++++++++--------- 1 file changed, 8 insertions(+), 9 deletions(-) diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c index 57120f0749cc3c23844eeb36820705687e08bbf7..9d8dd8deb2a702bd34b961ca4f5eba8a8d9643d0 100644 --- a/arch/x86/kernel/e820.c +++ b/arch/x86/kernel/e820.c @@ -753,22 +753,21 @@ void __init e820__memory_setup_extended(u64 phys_addr, u32 data_len) void __init e820__register_nosave_regions(unsigned long limit_pfn) { int i; - unsigned long pfn = 0; + u64 last_addr = 0; for (i = 0; i < e820_table->nr_entries; i++) { struct e820_entry *entry = &e820_table->entries[i]; - if (pfn < PFN_UP(entry->addr)) - register_nosave_region(pfn, PFN_UP(entry->addr)); - - pfn = PFN_DOWN(entry->addr + entry->size); - if (entry->type != E820_TYPE_RAM) - register_nosave_region(PFN_UP(entry->addr), pfn); + continue; - if (pfn >= limit_pfn) - break; + if (last_addr < entry->addr) + register_nosave_region(PFN_DOWN(last_addr), PFN_UP(entry->addr)); + + last_addr = entry->addr + entry->size; } + + register_nosave_region(PFN_DOWN(last_addr), limit_pfn); } #ifdef CONFIG_ACPI --- base-commit: a8662bcd2ff152bfbc751cab20f33053d74d0963 change-id: 20250405-fix-e820-nosave-f5cb985a9a61 Best regards, -- Myrrh Periwinkle <myrrhperiwinkle(a)qtmlabs.xyz>

3 months, 1 week

3
6
0 0

[PATCH 5.10] cifs: fix integer overflow in match_server()

by Roman Smirnov

From: Roman Smirnov <r.smirnov(a)omp.ru> [ Upstream commit 2510859475d7f46ed7940db0853f3342bf1b65ee ] The echo_interval is not limited in any way during mounting, which makes it possible to write a large number to it. This can cause an overflow when multiplying ctx->echo_interval by HZ in match_server(). Add constraints for echo_interval to smb3_fs_context_parse_param(). Found by Linux Verification Center (linuxtesting.org) with Svace. Fixes: adfeb3e00e8e1 ("cifs: Make echo interval tunable") Cc: stable(a)vger.kernel.org Signed-off-by: Roman Smirnov <r.smirnov(a)omp.ru> Signed-off-by: Steve French <stfrench(a)microsoft.com> Signed-off-by: Roman Smirnov <r.smirnov(a)omp.ru> --- fs/cifs/connect.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c index a3c0e6a4e484..322b8be73bb8 100644 --- a/fs/cifs/connect.c +++ b/fs/cifs/connect.c @@ -1915,6 +1915,12 @@ cifs_parse_mount_options(const char *mountdata, const char *devname, __func__); goto cifs_parse_mount_err; } + if (option < SMB_ECHO_INTERVAL_MIN || + option > SMB_ECHO_INTERVAL_MAX) { + cifs_dbg(VFS, "%s: Echo interval is out of bounds\n", + __func__); + goto cifs_parse_mount_err; + } vol->echo_interval = option; break; case Opt_snapshot: -- 2.43.0

3 months, 1 week

1
0
0 0

[PATCH 5.15] cifs: fix integer overflow in match_server()

by Roman Smirnov

From: Roman Smirnov <r.smirnov(a)omp.ru> [ Upstream commit 2510859475d7f46ed7940db0853f3342bf1b65ee ] The echo_interval is not limited in any way during mounting, which makes it possible to write a large number to it. This can cause an overflow when multiplying ctx->echo_interval by HZ in match_server(). Add constraints for echo_interval to smb3_fs_context_parse_param(). Found by Linux Verification Center (linuxtesting.org) with Svace. Fixes: adfeb3e00e8e1 ("cifs: Make echo interval tunable") Cc: stable(a)vger.kernel.org Signed-off-by: Roman Smirnov <r.smirnov(a)omp.ru> Signed-off-by: Steve French <stfrench(a)microsoft.com> Signed-off-by: Roman Smirnov <r.smirnov(a)omp.ru> --- fs/cifs/fs_context.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/fs/cifs/fs_context.c b/fs/cifs/fs_context.c index fb3651513f83..3479f0072cf6 100644 --- a/fs/cifs/fs_context.c +++ b/fs/cifs/fs_context.c @@ -1088,6 +1088,11 @@ static int smb3_fs_context_parse_param(struct fs_context *fc, } break; case Opt_echo_interval: + if (result.uint_32 < SMB_ECHO_INTERVAL_MIN || + result.uint_32 > SMB_ECHO_INTERVAL_MAX) { + cifs_errorf(fc, "echo interval is out of bounds\n"); + goto cifs_parse_mount_err; + } ctx->echo_interval = result.uint_32; break; case Opt_snapshot: -- 2.43.0

3 months, 1 week

1
0
0 0

[PATCH v2] platform/x86: thinkpad-acpi: Add error check for tpacpi_check_quirks()

by Wentao Liang

In tpacpi_battery_init(), the return value of tpacpi_check_quirks() needs to be checked. The battery should not be hooked if there is no matched battery information in quirk table. Add an error check and return -ENODEV immediately if the device fail the check. Fixes: 1a32ebb26ba9 ("platform/x86: thinkpad_acpi: Support battery quirk") Cc: stable(a)vger.kernel.org Signed-off-by: Wentao Liang <vulab(a)iscas.ac.cn> --- v2: Fix double assignment error. drivers/platform/x86/thinkpad_acpi.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/platform/x86/thinkpad_acpi.c b/drivers/platform/x86/thinkpad_acpi.c index 2cfb2ac3f465..93eaca3bd9d1 100644 --- a/drivers/platform/x86/thinkpad_acpi.c +++ b/drivers/platform/x86/thinkpad_acpi.c @@ -9973,7 +9973,9 @@ static int __init tpacpi_battery_init(struct ibm_init_struct *ibm) tp_features.battery_force_primary = tpacpi_check_quirks( battery_quirk_table, - ARRAY_SIZE(battery_quirk_table)); + ARRAY_SIZE(battery_quirk_table)) + if (!tp_features.battery_force_primary) + return -ENODEV; battery_hook_register(&battery_hook); return 0; -- 2.42.0.windows.2

3 months, 1 week

2
1
0 0

[PATCH v2 net] net: Fix null-ptr-deref by sock_lock_init_class_and_name() and rmmod.

by Kuniyuki Iwashima

When I ran the repro [0] and waited a few seconds, I observed two LOCKDEP splats: a warning immediately followed by a null-ptr-deref. [1] Reproduction Steps: 1) Mount CIFS 2) Add an iptables rule to drop incoming FIN packets for CIFS 3) Unmount CIFS 4) Unload the CIFS module 5) Remove the iptables rule At step 3), the CIFS module calls sock_release() for the underlying TCP socket, and it returns quickly. However, the socket remains in FIN_WAIT_1 because incoming FIN packets are dropped. At this point, the module's refcnt is 0 while the socket is still alive, so the following rmmod command succeeds. # ss -tan State Recv-Q Send-Q Local Address:Port Peer Address:Port FIN-WAIT-1 0 477 10.0.2.15:51062 10.0.0.137:445 # lsmod | grep cifs cifs 1159168 0 This highlights a discrepancy between the lifetime of the CIFS module and the underlying TCP socket. Even after CIFS calls sock_release() and it returns, the TCP socket does not die immediately in order to close the connection gracefully. While this is generally fine, it causes an issue with LOCKDEP because CIFS assigns a different lock class to the TCP socket's sk->sk_lock using sock_lock_init_class_and_name(). Once an incoming packet is processed for the socket or a timer fires, sk->sk_lock is acquired. Then, LOCKDEP checks the lock context in check_wait_context(), where hlock_class() is called to retrieve the lock class. However, since the module has already been unloaded, hlock_class() logs a warning and returns NULL, triggering the null-ptr-deref. If LOCKDEP is enabled, we must ensure that a module calling sock_lock_init_class_and_name() (CIFS, NFS, etc) cannot be unloaded while such a socket is still alive to prevent this issue. Let's hold the module reference in sock_lock_init_class_and_name() and release it when the socket is freed in __sk_destruct(). Note that sock_lock_init() clears sk->sk_owner for svc_create_socket() that calls sock_lock_init_class_and_name() for a listening socket, which clones a socket by sk_clone_lock() without GFP_ZERO. [0]: CIFS_SERVER="10.0.0.137" CIFS_PATH="//${CIFS_SERVER}/Users/Administrator/Desktop/CIFS_TEST" DEV="enp0s3" CRED="/root/WindowsCredential.txt" MNT=$(mktemp -d /tmp/XXXXXX) mount -t cifs ${CIFS_PATH} ${MNT} -o vers=3.0,credentials=${CRED},cache=none,echo_interval=1 iptables -A INPUT -s ${CIFS_SERVER} -j DROP for i in $(seq 10); do umount ${MNT} rmmod cifs sleep 1 done rm -r ${MNT} iptables -D INPUT -s ${CIFS_SERVER} -j DROP [1]: DEBUG_LOCKS_WARN_ON(1) WARNING: CPU: 10 PID: 0 at kernel/locking/lockdep.c:234 hlock_class (kernel/locking/lockdep.c:234 kernel/locking/lockdep.c:223) Modules linked in: cifs_arc4 nls_ucs2_utils cifs_md4 [last unloaded: cifs] CPU: 10 UID: 0 PID: 0 Comm: swapper/10 Not tainted 6.14.0 #36 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 RIP: 0010:hlock_class (kernel/locking/lockdep.c:234 kernel/locking/lockdep.c:223) ... Call Trace: <IRQ> __lock_acquire (kernel/locking/lockdep.c:4853 kernel/locking/lockdep.c:5178) lock_acquire (kernel/locking/lockdep.c:469 kernel/locking/lockdep.c:5853 kernel/locking/lockdep.c:5816) _raw_spin_lock_nested (kernel/locking/spinlock.c:379) tcp_v4_rcv (./include/linux/skbuff.h:1678 ./include/net/tcp.h:2547 net/ipv4/tcp_ipv4.c:2350) ... BUG: kernel NULL pointer dereference, address: 00000000000000c4 PF: supervisor read access in kernel mode PF: error_code(0x0000) - not-present page PGD 0 Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI CPU: 10 UID: 0 PID: 0 Comm: swapper/10 Tainted: G W 6.14.0 #36 Tainted: [W]=WARN Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 RIP: 0010:__lock_acquire (kernel/locking/lockdep.c:4852 kernel/locking/lockdep.c:5178) Code: 15 41 09 c7 41 8b 44 24 20 25 ff 1f 00 00 41 09 c7 8b 84 24 a0 00 00 00 45 89 7c 24 20 41 89 44 24 24 e8 e1 bc ff ff 4c 89 e7 <44> 0f b6 b8 c4 00 00 00 e8 d1 bc ff ff 0f b6 80 c5 00 00 00 88 44 RSP: 0018:ffa0000000468a10 EFLAGS: 00010046 RAX: 0000000000000000 RBX: ff1100010091cc38 RCX: 0000000000000027 RDX: ff1100081f09ca48 RSI: 0000000000000001 RDI: ff1100010091cc88 RBP: ff1100010091c200 R08: ff1100083fe6e228 R09: 00000000ffffbfff R10: ff1100081eca0000 R11: ff1100083fe10dc0 R12: ff1100010091cc88 R13: 0000000000000001 R14: 0000000000000000 R15: 00000000000424b1 FS: 0000000000000000(0000) GS:ff1100081f080000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000000c4 CR3: 0000000002c4a003 CR4: 0000000000771ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <IRQ> lock_acquire (kernel/locking/lockdep.c:469 kernel/locking/lockdep.c:5853 kernel/locking/lockdep.c:5816) _raw_spin_lock_nested (kernel/locking/spinlock.c:379) tcp_v4_rcv (./include/linux/skbuff.h:1678 ./include/net/tcp.h:2547 net/ipv4/tcp_ipv4.c:2350) ip_protocol_deliver_rcu (net/ipv4/ip_input.c:205 (discriminator 1)) ip_local_deliver_finish (./include/linux/rcupdate.h:878 net/ipv4/ip_input.c:234) ip_sublist_rcv_finish (net/ipv4/ip_input.c:576) ip_list_rcv_finish (net/ipv4/ip_input.c:628) ip_list_rcv (net/ipv4/ip_input.c:670) __netif_receive_skb_list_core (net/core/dev.c:5939 net/core/dev.c:5986) netif_receive_skb_list_internal (net/core/dev.c:6040 net/core/dev.c:6129) napi_complete_done (./include/linux/list.h:37 ./include/net/gro.h:519 ./include/net/gro.h:514 net/core/dev.c:6496) e1000_clean (drivers/net/ethernet/intel/e1000/e1000_main.c:3815) __napi_poll.constprop.0 (net/core/dev.c:7191) net_rx_action (net/core/dev.c:7262 net/core/dev.c:7382) handle_softirqs (kernel/softirq.c:561) __irq_exit_rcu (kernel/softirq.c:596 kernel/softirq.c:435 kernel/softirq.c:662) irq_exit_rcu (kernel/softirq.c:680) common_interrupt (arch/x86/kernel/irq.c:280 (discriminator 14)) </IRQ> <TASK> asm_common_interrupt (./arch/x86/include/asm/idtentry.h:693) RIP: 0010:default_idle (./arch/x86/include/asm/irqflags.h:37 ./arch/x86/include/asm/irqflags.h:92 arch/x86/kernel/process.c:744) Code: 4c 01 c7 4c 29 c2 e9 72 ff ff ff 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa eb 07 0f 00 2d c3 2b 15 00 fb f4 <fa> c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 RSP: 0018:ffa00000000ffee8 EFLAGS: 00000202 RAX: 000000000000640b RBX: ff1100010091c200 RCX: 0000000000061aa4 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff812f30c5 RBP: 000000000000000a R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000002 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 ? do_idle (kernel/sched/idle.c:186 kernel/sched/idle.c:325) default_idle_call (./include/linux/cpuidle.h:143 kernel/sched/idle.c:118) do_idle (kernel/sched/idle.c:186 kernel/sched/idle.c:325) cpu_startup_entry (kernel/sched/idle.c:422 (discriminator 1)) start_secondary (arch/x86/kernel/smpboot.c:315) common_startup_64 (arch/x86/kernel/head_64.S:421) </TASK> Modules linked in: cifs_arc4 nls_ucs2_utils cifs_md4 [last unloaded: cifs] CR2: 00000000000000c4 Fixes: ed07536ed673 ("[PATCH] lockdep: annotate nfs/nfsd in-kernel sockets") Signed-off-by: Kuniyuki Iwashima <kuniyu(a)amazon.com> Cc: stable(a)vger.kernel.org --- v2: * Clear sk_owner in sock_lock_init() * Define helper under the same #if guard * Remove redundant null check before module_put() v1: https://lore.kernel.org/netdev/20250403020837.51664-1-kuniyu@amazon.com/ --- include/net/sock.h | 38 ++++++++++++++++++++++++++++++++++++-- net/core/sock.c | 4 ++++ 2 files changed, 40 insertions(+), 2 deletions(-) diff --git a/include/net/sock.h b/include/net/sock.h index 8daf1b3b12c6..4216d7d86150 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -547,6 +547,10 @@ struct sock { struct rcu_head sk_rcu; netns_tracker ns_tracker; struct xarray sk_user_frags; + +#if IS_ENABLED(CONFIG_PROVE_LOCKING) && IS_ENABLED(CONFIG_MODULES) + struct module *sk_owner; +#endif }; struct sock_bh_locked { @@ -1583,6 +1587,35 @@ static inline void sk_mem_uncharge(struct sock *sk, int size) sk_mem_reclaim(sk); } +#if IS_ENABLED(CONFIG_PROVE_LOCKING) && IS_ENABLED(CONFIG_MODULES) +static inline void sk_owner_set(struct sock *sk, struct module *owner) +{ + __module_get(owner); + sk->sk_owner = owner; +} + +static inline void sk_owner_clear(struct sock *sk) +{ + sk->sk_owner = NULL; +} + +static inline void sk_owner_put(struct sock *sk) +{ + module_put(sk->sk_owner); +} +#else +static inline void sk_owner_set(struct sock *sk, struct module *owner) +{ +} + +static inline void sk_owner_clear(struct sock *sk) +{ +} + +static inline void sk_owner_put(struct sock *sk) +{ +} +#endif /* * Macro so as to not evaluate some arguments when * lockdep is not enabled. @@ -1592,13 +1625,14 @@ static inline void sk_mem_uncharge(struct sock *sk, int size) */ #define sock_lock_init_class_and_name(sk, sname, skey, name, key) \ do { \ + sk_owner_set(sk, THIS_MODULE); \ sk->sk_lock.owned = 0; \ init_waitqueue_head(&sk->sk_lock.wq); \ spin_lock_init(&(sk)->sk_lock.slock); \ debug_check_no_locks_freed((void *)&(sk)->sk_lock, \ - sizeof((sk)->sk_lock)); \ + sizeof((sk)->sk_lock)); \ lockdep_set_class_and_name(&(sk)->sk_lock.slock, \ - (skey), (sname)); \ + (skey), (sname)); \ lockdep_init_map(&(sk)->sk_lock.dep_map, (name), (key), 0); \ } while (0) diff --git a/net/core/sock.c b/net/core/sock.c index 323892066def..d426c5f8e20f 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -2130,6 +2130,8 @@ int sk_getsockopt(struct sock *sk, int level, int optname, */ static inline void sock_lock_init(struct sock *sk) { + sk_owner_clear(sk); + if (sk->sk_kern_sock) sock_lock_init_class_and_name( sk, @@ -2324,6 +2326,8 @@ static void __sk_destruct(struct rcu_head *head) __netns_tracker_free(net, &sk->ns_tracker, false); net_passive_dec(net); } + + sk_owner_put(sk); sk_prot_free(sk->sk_prot_creator, sk); } -- 2.48.1

3 months, 1 week

3
4
0 0

[PATCH v2] platform/x86: amd: pmf: Fix STT limits

by Mario Limonciello

From: Mario Limonciello <mario.limonciello(a)amd.com> On some platforms it has been observed that STT limits are not being applied properly causing poor performance as power limits are set too low. STT limits that are sent to the platform are supposed to be in Q8.8 format. Convert them before sending. Reported-by: Yijun Shen <Yijun.Shen(a)dell.com> Fixes: 7c45534afa443 ("platform/x86/amd/pmf: Add support for PMF Policy Binary") Cc: stable(a)vger.kernel.org Tested-By: Yijun Shen <Yijun_Shen(a)Dell.com> Signed-off-by: Mario Limonciello <mario.limonciello(a)amd.com> --- v2: * Handle cases for auto-mode, cnqf, and sps as well --- drivers/platform/x86/amd/pmf/auto-mode.c | 4 ++-- drivers/platform/x86/amd/pmf/cnqf.c | 4 ++-- drivers/platform/x86/amd/pmf/sps.c | 8 ++++---- drivers/platform/x86/amd/pmf/tee-if.c | 4 ++-- 4 files changed, 10 insertions(+), 10 deletions(-) diff --git a/drivers/platform/x86/amd/pmf/auto-mode.c b/drivers/platform/x86/amd/pmf/auto-mode.c index 02ff68be10d01..df37f8a84a007 100644 --- a/drivers/platform/x86/amd/pmf/auto-mode.c +++ b/drivers/platform/x86/amd/pmf/auto-mode.c @@ -120,9 +120,9 @@ static void amd_pmf_set_automode(struct amd_pmf_dev *dev, int idx, amd_pmf_send_cmd(dev, SET_SPPT_APU_ONLY, false, pwr_ctrl->sppt_apu_only, NULL); amd_pmf_send_cmd(dev, SET_STT_MIN_LIMIT, false, pwr_ctrl->stt_min, NULL); amd_pmf_send_cmd(dev, SET_STT_LIMIT_APU, false, - pwr_ctrl->stt_skin_temp[STT_TEMP_APU], NULL); + pwr_ctrl->stt_skin_temp[STT_TEMP_APU] << 8, NULL); amd_pmf_send_cmd(dev, SET_STT_LIMIT_HS2, false, - pwr_ctrl->stt_skin_temp[STT_TEMP_HS2], NULL); + pwr_ctrl->stt_skin_temp[STT_TEMP_HS2] << 8, NULL); if (is_apmf_func_supported(dev, APMF_FUNC_SET_FAN_IDX)) apmf_update_fan_idx(dev, config_store.mode_set[idx].fan_control.manual, diff --git a/drivers/platform/x86/amd/pmf/cnqf.c b/drivers/platform/x86/amd/pmf/cnqf.c index bc8899e15c914..6a5ecc05961d9 100644 --- a/drivers/platform/x86/amd/pmf/cnqf.c +++ b/drivers/platform/x86/amd/pmf/cnqf.c @@ -81,9 +81,9 @@ static int amd_pmf_set_cnqf(struct amd_pmf_dev *dev, int src, int idx, amd_pmf_send_cmd(dev, SET_SPPT, false, pc->sppt, NULL); amd_pmf_send_cmd(dev, SET_SPPT_APU_ONLY, false, pc->sppt_apu_only, NULL); amd_pmf_send_cmd(dev, SET_STT_MIN_LIMIT, false, pc->stt_min, NULL); - amd_pmf_send_cmd(dev, SET_STT_LIMIT_APU, false, pc->stt_skin_temp[STT_TEMP_APU], + amd_pmf_send_cmd(dev, SET_STT_LIMIT_APU, false, pc->stt_skin_temp[STT_TEMP_APU] << 8, NULL); - amd_pmf_send_cmd(dev, SET_STT_LIMIT_HS2, false, pc->stt_skin_temp[STT_TEMP_HS2], + amd_pmf_send_cmd(dev, SET_STT_LIMIT_HS2, false, pc->stt_skin_temp[STT_TEMP_HS2] << 8, NULL); if (is_apmf_func_supported(dev, APMF_FUNC_SET_FAN_IDX)) diff --git a/drivers/platform/x86/amd/pmf/sps.c b/drivers/platform/x86/amd/pmf/sps.c index d3083383f11fb..ec10db1bfa5ec 100644 --- a/drivers/platform/x86/amd/pmf/sps.c +++ b/drivers/platform/x86/amd/pmf/sps.c @@ -198,9 +198,9 @@ static void amd_pmf_update_slider_v2(struct amd_pmf_dev *dev, int idx) amd_pmf_send_cmd(dev, SET_STT_MIN_LIMIT, false, apts_config_store.val[idx].stt_min_limit, NULL); amd_pmf_send_cmd(dev, SET_STT_LIMIT_APU, false, - apts_config_store.val[idx].stt_skin_temp_limit_apu, NULL); + apts_config_store.val[idx].stt_skin_temp_limit_apu << 8, NULL); amd_pmf_send_cmd(dev, SET_STT_LIMIT_HS2, false, - apts_config_store.val[idx].stt_skin_temp_limit_hs2, NULL); + apts_config_store.val[idx].stt_skin_temp_limit_hs2 << 8, NULL); } void amd_pmf_update_slider(struct amd_pmf_dev *dev, bool op, int idx, @@ -217,9 +217,9 @@ void amd_pmf_update_slider(struct amd_pmf_dev *dev, bool op, int idx, amd_pmf_send_cmd(dev, SET_STT_MIN_LIMIT, false, config_store.prop[src][idx].stt_min, NULL); amd_pmf_send_cmd(dev, SET_STT_LIMIT_APU, false, - config_store.prop[src][idx].stt_skin_temp[STT_TEMP_APU], NULL); + config_store.prop[src][idx].stt_skin_temp[STT_TEMP_APU] << 8, NULL); amd_pmf_send_cmd(dev, SET_STT_LIMIT_HS2, false, - config_store.prop[src][idx].stt_skin_temp[STT_TEMP_HS2], NULL); + config_store.prop[src][idx].stt_skin_temp[STT_TEMP_HS2] << 8, NULL); } else if (op == SLIDER_OP_GET) { amd_pmf_send_cmd(dev, GET_SPL, true, ARG_NONE, &table->prop[src][idx].spl); amd_pmf_send_cmd(dev, GET_FPPT, true, ARG_NONE, &table->prop[src][idx].fppt); diff --git a/drivers/platform/x86/amd/pmf/tee-if.c b/drivers/platform/x86/amd/pmf/tee-if.c index d6a871f0d8ff2..7096923107929 100644 --- a/drivers/platform/x86/amd/pmf/tee-if.c +++ b/drivers/platform/x86/amd/pmf/tee-if.c @@ -142,7 +142,7 @@ static void amd_pmf_apply_policies(struct amd_pmf_dev *dev, struct ta_pmf_enact_ case PMF_POLICY_STT_SKINTEMP_APU: if (dev->prev_data->stt_skintemp_apu != val) { - amd_pmf_send_cmd(dev, SET_STT_LIMIT_APU, false, val, NULL); + amd_pmf_send_cmd(dev, SET_STT_LIMIT_APU, false, val << 8, NULL); dev_dbg(dev->dev, "update STT_SKINTEMP_APU: %u\n", val); dev->prev_data->stt_skintemp_apu = val; } @@ -150,7 +150,7 @@ static void amd_pmf_apply_policies(struct amd_pmf_dev *dev, struct ta_pmf_enact_ case PMF_POLICY_STT_SKINTEMP_HS2: if (dev->prev_data->stt_skintemp_hs2 != val) { - amd_pmf_send_cmd(dev, SET_STT_LIMIT_HS2, false, val, NULL); + amd_pmf_send_cmd(dev, SET_STT_LIMIT_HS2, false, val << 8, NULL); dev_dbg(dev->dev, "update STT_SKINTEMP_HS2: %u\n", val); dev->prev_data->stt_skintemp_hs2 = val; } -- 2.43.0

3 months, 1 week

2
3
0 0

2025

2024

2023

2022

2021

2020

2019

2018

2017

Linux-stable-mirror April 2025