Linux-stable-mirror January 2025

linux-stable-mirror@lists.linaro.org

520 participants
1002 discussions

[PATCH AUTOSEL 6.6 01/10] mac802154: check local interfaces before deleting sdata list

by Sasha Levin

From: Lizhi Xu <lizhi.xu(a)windriver.com> [ Upstream commit eb09fbeb48709fe66c0d708aed81e910a577a30a ] syzkaller reported a corrupted list in ieee802154_if_remove. [1] Remove an IEEE 802.15.4 network interface after unregister an IEEE 802.15.4 hardware device from the system. CPU0 CPU1 ==== ==== genl_family_rcv_msg_doit ieee802154_unregister_hw ieee802154_del_iface ieee802154_remove_interfaces rdev_del_virtual_intf_deprecated list_del(&sdata->list) ieee802154_if_remove list_del_rcu The net device has been unregistered, since the rcu grace period, unregistration must be run before ieee802154_if_remove. To avoid this issue, add a check for local->interfaces before deleting sdata list. [1] kernel BUG at lib/list_debug.c:58! Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN PTI CPU: 0 UID: 0 PID: 6277 Comm: syz-executor157 Not tainted 6.12.0-rc6-syzkaller-00005-g557329bcecc2 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024 RIP: 0010:__list_del_entry_valid_or_report+0xf4/0x140 lib/list_debug.c:56 Code: e8 a1 7e 00 07 90 0f 0b 48 c7 c7 e0 37 60 8c 4c 89 fe e8 8f 7e 00 07 90 0f 0b 48 c7 c7 40 38 60 8c 4c 89 fe e8 7d 7e 00 07 90 <0f> 0b 48 c7 c7 a0 38 60 8c 4c 89 fe e8 6b 7e 00 07 90 0f 0b 48 c7 RSP: 0018:ffffc9000490f3d0 EFLAGS: 00010246 RAX: 000000000000004e RBX: dead000000000122 RCX: d211eee56bb28d00 RDX: 0000000000000000 RSI: 0000000080000000 RDI: 0000000000000000 RBP: ffff88805b278dd8 R08: ffffffff8174a12c R09: 1ffffffff2852f0d R10: dffffc0000000000 R11: fffffbfff2852f0e R12: dffffc0000000000 R13: dffffc0000000000 R14: dead000000000100 R15: ffff88805b278cc0 FS: 0000555572f94380(0000) GS:ffff8880b8600000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000056262e4a3000 CR3: 0000000078496000 CR4: 00000000003526f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> __list_del_entry_valid include/linux/list.h:124 [inline] __list_del_entry include/linux/list.h:215 [inline] list_del_rcu include/linux/rculist.h:157 [inline] ieee802154_if_remove+0x86/0x1e0 net/mac802154/iface.c:687 rdev_del_virtual_intf_deprecated net/ieee802154/rdev-ops.h:24 [inline] ieee802154_del_iface+0x2c0/0x5c0 net/ieee802154/nl-phy.c:323 genl_family_rcv_msg_doit net/netlink/genetlink.c:1115 [inline] genl_family_rcv_msg net/netlink/genetlink.c:1195 [inline] genl_rcv_msg+0xb14/0xec0 net/netlink/genetlink.c:1210 netlink_rcv_skb+0x1e3/0x430 net/netlink/af_netlink.c:2551 genl_rcv+0x28/0x40 net/netlink/genetlink.c:1219 netlink_unicast_kernel net/netlink/af_netlink.c:1331 [inline] netlink_unicast+0x7f6/0x990 net/netlink/af_netlink.c:1357 netlink_sendmsg+0x8e4/0xcb0 net/netlink/af_netlink.c:1901 sock_sendmsg_nosec net/socket.c:729 [inline] __sock_sendmsg+0x221/0x270 net/socket.c:744 ____sys_sendmsg+0x52a/0x7e0 net/socket.c:2607 ___sys_sendmsg net/socket.c:2661 [inline] __sys_sendmsg+0x292/0x380 net/socket.c:2690 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f Reported-and-tested-by: syzbot+985f827280dc3a6e7e92(a)syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=985f827280dc3a6e7e92 Signed-off-by: Lizhi Xu <lizhi.xu(a)windriver.com> Reviewed-by: Miquel Raynal <miquel.raynal(a)bootlin.com> Link: https://lore.kernel.org/20241113095129.1457225-1-lizhi.xu@windriver.com Signed-off-by: Stefan Schmidt <stefan(a)datenfreihafen.org> Signed-off-by: Sasha Levin <sashal(a)kernel.org> --- net/mac802154/iface.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/net/mac802154/iface.c b/net/mac802154/iface.c index c0e2da5072be..9e4631fade90 100644 --- a/net/mac802154/iface.c +++ b/net/mac802154/iface.c @@ -684,6 +684,10 @@ void ieee802154_if_remove(struct ieee802154_sub_if_data *sdata) ASSERT_RTNL(); mutex_lock(&sdata->local->iflist_mtx); + if (list_empty(&sdata->local->interfaces)) { + mutex_unlock(&sdata->local->iflist_mtx); + return; + } list_del_rcu(&sdata->list); mutex_unlock(&sdata->local->iflist_mtx); -- 2.39.5

1 year

[PATCH AUTOSEL 6.12 01/20] mac802154: check local interfaces before deleting sdata list

by Sasha Levin

1 year

[PATCH] wifi: mac80211: fix interger overflow in hwmp_route_info_get()

by Gavrilov Ilia

Since the new_metric and last_hop_metric variables can reach the MAX_METRIC(0xffffffff) value, an integer overflow may occur when multiplying them by 10/9. It can lead to incorrect behavior. Found by InfoTeCS on behalf of Linux Verification Center (linuxtesting.org) with SVACE. Fixes: a8d418d9ac25 ("mac80211: mesh: only switch path when new metric is at least 10% better") Cc: stable(a)vger.kernel.org Signed-off-by: Ilia Gavrilov <Ilia.Gavrilov(a)infotecs.ru> --- net/mac80211/mesh_hwmp.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/net/mac80211/mesh_hwmp.c b/net/mac80211/mesh_hwmp.c index 4e9546e998b6..7d367ff1efc2 100644 --- a/net/mac80211/mesh_hwmp.c +++ b/net/mac80211/mesh_hwmp.c @@ -458,7 +458,7 @@ static u32 hwmp_route_info_get(struct ieee80211_sub_if_data *sdata, (mpath->sn == orig_sn && (rcu_access_pointer(mpath->next_hop) != sta ? - mult_frac(new_metric, 10, 9) : + mult_frac((u64)new_metric, 10, 9) : new_metric) >= mpath->metric)) { process = false; fresh_info = false; @@ -533,7 +533,7 @@ static u32 hwmp_route_info_get(struct ieee80211_sub_if_data *sdata, if ((mpath->flags & MESH_PATH_FIXED) || ((mpath->flags & MESH_PATH_ACTIVE) && ((rcu_access_pointer(mpath->next_hop) != sta ? - mult_frac(last_hop_metric, 10, 9) : + mult_frac((u64)last_hop_metric, 10, 9) : last_hop_metric) > mpath->metric))) fresh_info = false; } else { -- 2.39.5

1 year

[PATCH] block: mark GFP_NOIO around sysfs ->store()

by Ming Lei

sysfs ->store is called with queue freezed, meantime we have several ->store() callbacks(update_nr_requests, wbt, scheduler) to allocate memory with GFP_KERNEL which may run into direct reclaim code path, then potential deadlock can be caused. Fix the issue by marking NOIO around sysfs ->store() Reported-by: Thomas Hellström <thomas.hellstrom(a)linux.intel.com> Cc: stable(a)vger.kernel.org Signed-off-by: Ming Lei <ming.lei(a)redhat.com> --- block/blk-sysfs.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c index e828be777206..e09b455874bf 100644 --- a/block/blk-sysfs.c +++ b/block/blk-sysfs.c @@ -681,6 +681,7 @@ queue_attr_store(struct kobject *kobj, struct attribute *attr, struct queue_sysfs_entry *entry = to_queue(attr); struct gendisk *disk = container_of(kobj, struct gendisk, queue_kobj); struct request_queue *q = disk->queue; + unsigned int noio_flag; ssize_t res; if (!entry->store_limit && !entry->store) @@ -711,7 +712,9 @@ queue_attr_store(struct kobject *kobj, struct attribute *attr, mutex_lock(&q->sysfs_lock); blk_mq_freeze_queue(q); + noio_flag = memalloc_noio_save(); res = entry->store(disk, page, length); + memalloc_noio_restore(noio_flag); blk_mq_unfreeze_queue(q); mutex_unlock(&q->sysfs_lock); return res; -- 2.44.0

1 year

FAILED: patch "[PATCH] io_uring/eventfd: ensure io_eventfd_signal() defers another" failed to apply to 6.1-stable tree

by gregkh＠linuxfoundation.org

The patch below does not apply to the 6.1-stable tree. If someone wants it applied there, or to any other stable or longterm tree, then please email the backport, including the original git commit id to <stable(a)vger.kernel.org>. To reproduce the conflict and resubmit, you may use the following commands: git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.1.y git checkout FETCH_HEAD git cherry-pick -x c9a40292a44e78f71258b8522655bffaf5753bdb # <resolve conflicts, build, test, etc.> git commit -s git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2025011246-appealing-angler-4f22@gregkh' --subject-prefix 'PATCH 6.1.y' HEAD^.. Possible dependencies: thanks, greg k-h ------------------ original commit in Linus's tree ------------------ From c9a40292a44e78f71258b8522655bffaf5753bdb Mon Sep 17 00:00:00 2001 From: Jens Axboe <axboe(a)kernel.dk> Date: Wed, 8 Jan 2025 10:28:05 -0700 Subject: [PATCH] io_uring/eventfd: ensure io_eventfd_signal() defers another RCU period io_eventfd_do_signal() is invoked from an RCU callback, but when dropping the reference to the io_ev_fd, it calls io_eventfd_free() directly if the refcount drops to zero. This isn't correct, as any potential freeing of the io_ev_fd should be deferred another RCU grace period. Just call io_eventfd_put() rather than open-code the dec-and-test and free, which will correctly defer it another RCU grace period. Fixes: 21a091b970cd ("io_uring: signal registered eventfd to process deferred task work") Reported-by: Jann Horn <jannh(a)google.com> Cc: stable(a)vger.kernel.org Tested-by: Li Zetao <lizetao1(a)huawei.com> Reviewed-by: Li Zetao<lizetao1(a)huawei.com> Reviewed-by: Prasanna Kumar T S M <ptsm(a)linux.microsoft.com> Signed-off-by: Jens Axboe <axboe(a)kernel.dk> diff --git a/io_uring/eventfd.c b/io_uring/eventfd.c index fab936d31ba8..100d5da94cb9 100644 --- a/io_uring/eventfd.c +++ b/io_uring/eventfd.c @@ -33,20 +33,18 @@ static void io_eventfd_free(struct rcu_head *rcu) kfree(ev_fd); } +static void io_eventfd_put(struct io_ev_fd *ev_fd) +{ + if (refcount_dec_and_test(&ev_fd->refs)) + call_rcu(&ev_fd->rcu, io_eventfd_free); +} + static void io_eventfd_do_signal(struct rcu_head *rcu) { struct io_ev_fd *ev_fd = container_of(rcu, struct io_ev_fd, rcu); eventfd_signal_mask(ev_fd->cq_ev_fd, EPOLL_URING_WAKE); - - if (refcount_dec_and_test(&ev_fd->refs)) - io_eventfd_free(rcu); -} - -static void io_eventfd_put(struct io_ev_fd *ev_fd) -{ - if (refcount_dec_and_test(&ev_fd->refs)) - call_rcu(&ev_fd->rcu, io_eventfd_free); + io_eventfd_put(ev_fd); } static void io_eventfd_release(struct io_ev_fd *ev_fd, bool put_ref)

1 year

[PATCH v2] Bluetooth: qca: Support downloading board ID specific NVM for WCN6855

by Zijun Hu

For WCN6855, board ID specific NVM needs to be downloaded once board ID is available, but the default NVM is always downloaded currently, and the wrong NVM causes poor RF performance which effects user experience. Fix by downloading board ID specific NVM if board ID is available. Cc: Bjorn Andersson <bjorande(a)quicinc.com> Cc: Aiqun Yu (Maria) <quic_aiquny(a)quicinc.com> Cc: Cheng Jiang <quic_chejiang(a)quicinc.com> Cc: Johan Hovold <johan(a)kernel.org> Cc: Jens Glathe <jens.glathe(a)oldschoolsolutions.biz> Cc: Steev Klimaszewski <steev(a)kali.org> Cc: Paul Menzel <pmenzel(a)molgen.mpg.de> Fixes: 095327fede00 ("Bluetooth: hci_qca: Add support for QTI Bluetooth chip wcn6855") Cc: stable(a)vger.kernel.org # 6.4 Reviewed-by: Johan Hovold <johan+linaro(a)kernel.org> Tested-by: Johan Hovold <johan+linaro(a)kernel.org> Tested-by: Steev Klimaszewski <steev(a)kali.org> Tested-by: Jens Glathe <jens.glathe(a)oldschoolsolutions.biz> Signed-off-by: Zijun Hu <quic_zijuhu(a)quicinc.com> --- Thank you Paul, Jens, Steev, Johan, Luiz for code review, various verification, comments and suggestions. these comments and suggestions are very good, and all of them are taken by this v2 patch. Regarding the variant 'g', sorry for that i can say nothing due to confidential information (CCI), but fortunately, we don't need to care about its difference against one without 'g' from BT host perspective, qca_get_hsp_nvm_name_generic() shows how to map BT chip to firmware. I will help to backport it to LTS kernels ASAP once this commit is mainlined. --- Changes in v2: - Correct subject and commit message - Temporarily add nvm fallback logic to speed up backport. — Add fix/stable tags as suggested by Luiz and Johan - Link to v1: https://lore.kernel.org/r/20241113-x13s_wcn6855_fix-v1-1-15af0aa2549c@quici… --- drivers/bluetooth/btqca.c | 44 +++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 41 insertions(+), 3 deletions(-) diff --git a/drivers/bluetooth/btqca.c b/drivers/bluetooth/btqca.c index dfbbac92242a..ddfe7e3c9b50 100644 --- a/drivers/bluetooth/btqca.c +++ b/drivers/bluetooth/btqca.c @@ -717,6 +717,29 @@ static void qca_generate_hsp_nvm_name(char *fwname, size_t max_size, snprintf(fwname, max_size, "qca/hpnv%02x%s.%x", rom_ver, variant, bid); } +static void qca_get_hsp_nvm_name_generic(struct qca_fw_config *cfg, + struct qca_btsoc_version ver, + u8 rom_ver, u16 bid) +{ + const char *variant; + + /* hsp gf chip */ + if ((le32_to_cpu(ver.soc_id) & QCA_HSP_GF_SOC_MASK) == QCA_HSP_GF_SOC_ID) + variant = "g"; + else + variant = ""; + + if (bid == 0x0) + snprintf(cfg->fwname, sizeof(cfg->fwname), "qca/hpnv%02x%s.bin", + rom_ver, variant); + else if (bid & 0xff00) + snprintf(cfg->fwname, sizeof(cfg->fwname), "qca/hpnv%02x%s.b%x", + rom_ver, variant, bid); + else + snprintf(cfg->fwname, sizeof(cfg->fwname), "qca/hpnv%02x%s.b%02x", + rom_ver, variant, bid); +} + static inline void qca_get_nvm_name_generic(struct qca_fw_config *cfg, const char *stem, u8 rom_ver, u16 bid) { @@ -810,8 +833,15 @@ int qca_uart_setup(struct hci_dev *hdev, uint8_t baudrate, /* Give the controller some time to get ready to receive the NVM */ msleep(10); - if (soc_type == QCA_QCA2066 || soc_type == QCA_WCN7850) + switch (soc_type) { + case QCA_QCA2066: + case QCA_WCN6855: + case QCA_WCN7850: qca_read_fw_board_id(hdev, &boardid); + break; + default: + break; + } /* Download NVM configuration */ config.type = TLV_TYPE_NVM; @@ -848,8 +878,7 @@ int qca_uart_setup(struct hci_dev *hdev, uint8_t baudrate, "qca/msnv%02x.bin", rom_ver); break; case QCA_WCN6855: - snprintf(config.fwname, sizeof(config.fwname), - "qca/hpnv%02x.bin", rom_ver); + qca_get_hsp_nvm_name_generic(&config, ver, rom_ver, boardid); break; case QCA_WCN7850: qca_get_nvm_name_generic(&config, "hmt", rom_ver, boardid); @@ -861,9 +890,18 @@ int qca_uart_setup(struct hci_dev *hdev, uint8_t baudrate, } } +download_nvm: err = qca_download_firmware(hdev, &config, soc_type, rom_ver); if (err < 0) { bt_dev_err(hdev, "QCA Failed to download NVM (%d)", err); + if (err == -ENOENT && boardid != 0 && + soc_type == QCA_WCN6855) { + boardid = 0; + qca_get_hsp_nvm_name_generic(&config, ver, + rom_ver, boardid); + bt_dev_warn(hdev, "QCA fallback to default NVM"); + goto download_nvm; + } return err; } --- base-commit: e88b020190bf5bc3e7ce5bd8003fc39b23cc95fe change-id: 20241113-x13s_wcn6855_fix-53c573ff7878 Best regards, -- Zijun Hu <quic_zijuhu(a)quicinc.com>

1 year

[PATCH] loadpin: remove MODULE_COMPRESS_NONE as it is no longer supported

by Arulpandiyan Vadivel

Commit c7ff693fa2094ba0a9d0a20feb4ab1658eff9c33 ("module: Split modules_install compression and in-kernel decompression") removed the MODULE_COMPRESS_NONE, but left it loadpin's Kconfig, and removing it Signed-off-by: Arulpandiyan Vadivel <arulpandiyan.vadivel(a)siemens.com> --- security/loadpin/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/security/loadpin/Kconfig b/security/loadpin/Kconfig index 848f8b4a60190..94348e2831db9 100644 --- a/security/loadpin/Kconfig +++ b/security/loadpin/Kconfig @@ -16,7 +16,7 @@ config SECURITY_LOADPIN_ENFORCE depends on SECURITY_LOADPIN # Module compression breaks LoadPin unless modules are decompressed in # the kernel. - depends on !MODULES || (MODULE_COMPRESS_NONE || MODULE_DECOMPRESS) + depends on !MODULES || MODULE_DECOMPRESS help If selected, LoadPin will enforce pinning at boot. If not selected, it can be enabled at boot with the kernel parameter -- 2.39.5

1 year

Re: [PATCH v3] usb: gadget: ncm: Avoid dropping datagrams of properly parsed NTBs

by 潘俊中

Hi Maciej, On 2025/1/13 1:49, Maciej Żenczykowski Wrote: > (a) I think this looks like a bug on the sending Win10 side, rather > than a parsing bug in Linux, > with there being no ZLP, and no short (<512) frame, there's simply no > way for the receiver to split at the right spot. > > Indeed, fixing this on the Linux/parsing side seems non-trivial... > I guess we could try to treat the connection as simply a serial > connection (ie. ignore frame boundaries), but then we might have > issues with other senders... > > I guess the most likely 'correct' hack/fix would be to hold on to the > extra 'N*512' bytes (it doesn't even have to be 1, though likely the N > is odd), if it starts with a NTH header... Make sence, it seems we only need to save the rest data beside dwBlockLength for next unwrap if a hack is acceptable, otherwise I may need to check if a custom host driver for Windows10 user feasible. I didn't look carefully into the 1byte and padding stuff with Windows11 host yet, I will take a look then. > (b) I notice the '512' not '1024', I think this implies a USB2 > connection instead of USB3 > -- could you try replicating this with a USB3 capable data cable (and > USB3 ports), this should result in 1024 block size instead of 512. > > I'm wondering if the win10 stack is avoiding generating N*1024, but > then hitting N*512 with odd N... Yes, I am using USB2.0 connection to better capture the crime scene. Normally the OUT transfer on USB3.0 SuperSpeed connection comes with a bunch of 1024B Data Pakcet along with a Short Packet less than 1024B in the end from the Lecroy trace. It's also reproducible on USB3.0 SuperSpeed connection using dwc3 controller, but it will cost more time and make it difficult to capture the online data (limited tracer HW buffer), I can try using software tracing or custom logs later: [ 5] 26.00-27.00 sec 183 MBytes 1.54 Gbits/sec [ 5] 27.00-28.00 sec 182 MBytes 1.53 Gbits/sec [ 206.123935] configfs.gadget.2: Wrong NDP SIGN [ 206.129785] configfs.gadget.2: Wrong NTH SIGN, skblen 12208 [ 206.136802] HEAD:0000000004f66a88: 80 06 bc f9 c0 a8 24 66 c0 a8 24 65 f7 24 14 51 aa 1a 30 d5 01 f8 01 26 50 10 20 14 27 3d 00 00 [ 5] 28.00-29.00 sec 128 MBytes 1.07 Gbits/sec [ 5] 29.00-30.00 sec 191 MBytes 1.61 Gbits/sec > > Presumably '512' would be '64' with USB1.0/1.1, but I guess finding a > USB1.x port/host to test against is likely to be near impossible... > > I'll try to see if I can find the source of the bug in the Win > driver's sources (though based on it being Win10 only, may need to > search history) > It's great if you can analyze from the host driver. I didn't know if the NCM driver open-sourced on github by M$ is the correspond version. They said that only Win 11 officially support NCM in the issue on github yet they do have a built-in driver in Win10 and 2004 tag there in the repo.

1 year

[PATCH v4] [ARM] fix reference leak in locomo_init_one_child()

by Ma Ke

Once device_register() failed, we should call put_device() to decrement reference count for cleanup. Or it could cause memory leak. device_register() includes device_add(). As comment of device_add() says, 'if device_add() succeeds, you should call device_del() when you want to get rid of it. If device_add() has not succeeded, use only put_device() to drop the reference count'. Found by code review. Cc: stable(a)vger.kernel.org Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Ma Ke <make24(a)iscas.ac.cn> --- Changes in v4: - deleted the redundant initialization; Changes in v3: - modified the patch as suggestions; Changes in v2: - modified the patch as suggestions. --- arch/arm/common/locomo.c | 13 +++++-------- 1 file changed, 5 insertions(+), 8 deletions(-) diff --git a/arch/arm/common/locomo.c b/arch/arm/common/locomo.c index cb6ef449b987..45106066a17f 100644 --- a/arch/arm/common/locomo.c +++ b/arch/arm/common/locomo.c @@ -223,10 +223,8 @@ locomo_init_one_child(struct locomo *lchip, struct locomo_dev_info *info) int ret; dev = kzalloc(sizeof(struct locomo_dev), GFP_KERNEL); - if (!dev) { - ret = -ENOMEM; - goto out; - } + if (!dev) + return -ENOMEM; /* * If the parent device has a DMA mask associated with it, @@ -254,10 +252,9 @@ locomo_init_one_child(struct locomo *lchip, struct locomo_dev_info *info) NO_IRQ : lchip->irq_base + info->irq[0]; ret = device_register(&dev->dev); - if (ret) { - out: - kfree(dev); - } + if (ret) + put_device(&dev->dev); + return ret; } -- 2.25.1

1 year

[PATCH v2 1/7] mm/hugetlb: Fix avoid_reserve to allow taking folio from subpool

by Peter Xu

Since commit 04f2cbe35699 ("hugetlb: guarantee that COW faults for a process that called mmap(MAP_PRIVATE) on hugetlbfs will succeed"), avoid_reserve was introduced for a special case of CoW on hugetlb private mappings, and only if the owner VMA is trying to allocate yet another hugetlb folio that is not reserved within the private vma reserved map. Later on, in commit d85f69b0b533 ("mm/hugetlb: alloc_huge_page handle areas hole punched by fallocate"), alloc_huge_page() enforced to not consume any global reservation as long as avoid_reserve=true. This operation doesn't look correct, because even if it will enforce the allocation to not use global reservation at all, it will still try to take one reservation from the spool (if the subpool existed). Then since the spool reserved pages take from global reservation, it'll also take one reservation globally. Logically it can cause global reservation to go wrong. I wrote a reproducer below, trigger this special path, and every run of such program will cause global reservation count to increment by one, until it hits the number of free pages: #define _GNU_SOURCE /* See feature_test_macros(7) */ #include <stdio.h> #include <fcntl.h> #include <errno.h> #include <unistd.h> #include <stdlib.h> #include <sys/mman.h> #define MSIZE (2UL << 20) int main(int argc, char *argv[]) { const char *path; int *buf; int fd, ret; pid_t child; if (argc < 2) { printf("usage: %s <hugetlb_file>\n", argv[0]); return -1; } path = argv[1]; fd = open(path, O_RDWR | O_CREAT, 0666); if (fd < 0) { perror("open failed"); return -1; } ret = fallocate(fd, 0, 0, MSIZE); if (ret != 0) { perror("fallocate"); return -1; } buf = mmap(NULL, MSIZE, PROT_READ|PROT_WRITE, MAP_PRIVATE, fd, 0); if (buf == MAP_FAILED) { perror("mmap() failed"); return -1; } /* Allocate a page */ *buf = 1; child = fork(); if (child == 0) { /* child doesn't need to do anything */ exit(0); } /* Trigger CoW from owner */ *buf = 2; munmap(buf, MSIZE); close(fd); unlink(path); return 0; } It can only reproduce with a sub-mount when there're reserved pages on the spool, like: # sysctl vm.nr_hugepages=128 # mkdir ./hugetlb-pool # mount -t hugetlbfs -o min_size=8M,pagesize=2M none ./hugetlb-pool Then run the reproducer on the mountpoint: # ./reproducer ./hugetlb-pool/test Fix it by taking the reservation from spool if available. In general, avoid_reserve is IMHO more about "avoid vma resv map", not spool's. I copied stable, however I have no intention for backporting if it's not a clean cherry-pick, because private hugetlb mapping, and then fork() on top is too rare to hit. Cc: linux-stable <stable(a)vger.kernel.org> Fixes: d85f69b0b533 ("mm/hugetlb: alloc_huge_page handle areas hole punched by fallocate") Reviewed-by: Ackerley Tng <ackerleytng(a)google.com> Tested-by: Ackerley Tng <ackerleytng(a)google.com> Signed-off-by: Peter Xu <peterx(a)redhat.com> --- mm/hugetlb.c | 22 +++------------------- 1 file changed, 3 insertions(+), 19 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 354eec6f7e84..2bf971f77553 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1394,8 +1394,7 @@ static unsigned long available_huge_pages(struct hstate *h) static struct folio *dequeue_hugetlb_folio_vma(struct hstate *h, struct vm_area_struct *vma, - unsigned long address, int avoid_reserve, - long chg) + unsigned long address, long chg) { struct folio *folio = NULL; struct mempolicy *mpol; @@ -1411,10 +1410,6 @@ static struct folio *dequeue_hugetlb_folio_vma(struct hstate *h, if (!vma_has_reserves(vma, chg) && !available_huge_pages(h)) goto err; - /* If reserves cannot be used, ensure enough pages are in the pool */ - if (avoid_reserve && !available_huge_pages(h)) - goto err; - gfp_mask = htlb_alloc_mask(h); nid = huge_node(vma, address, gfp_mask, &mpol, &nodemask); @@ -1430,7 +1425,7 @@ static struct folio *dequeue_hugetlb_folio_vma(struct hstate *h, folio = dequeue_hugetlb_folio_nodemask(h, gfp_mask, nid, nodemask); - if (folio && !avoid_reserve && vma_has_reserves(vma, chg)) { + if (folio && vma_has_reserves(vma, chg)) { folio_set_hugetlb_restore_reserve(folio); h->resv_huge_pages--; } @@ -3047,17 +3042,6 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, gbl_chg = hugepage_subpool_get_pages(spool, 1); if (gbl_chg < 0) goto out_end_reservation; - - /* - * Even though there was no reservation in the region/reserve - * map, there could be reservations associated with the - * subpool that can be used. This would be indicated if the - * return value of hugepage_subpool_get_pages() is zero. - * However, if avoid_reserve is specified we still avoid even - * the subpool reservations. - */ - if (avoid_reserve) - gbl_chg = 1; } /* If this allocation is not consuming a reservation, charge it now. @@ -3080,7 +3064,7 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, * from the global free pool (global change). gbl_chg == 0 indicates * a reservation exists for the allocation. */ - folio = dequeue_hugetlb_folio_vma(h, vma, addr, avoid_reserve, gbl_chg); + folio = dequeue_hugetlb_folio_vma(h, vma, addr, gbl_chg); if (!folio) { spin_unlock_irq(&hugetlb_lock); folio = alloc_buddy_hugetlb_folio_with_mpol(h, vma, addr); -- 2.47.0

1 year

Jump to page:

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

Linux-stable-mirror January 2025