The patch titled
Subject: Re: dma/pool: do not complain if DMA pool is not allocated
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
dma-pool-do-not-complain-if-dma-pool-is-not-allocated.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Michal Hocko <mhocko(a)suse.com>
Subject: Re: dma/pool: do not complain if DMA pool is not allocated
Date: Tue, 9 Aug 2022 17:37:59 +0200
We have a system complaining about order-10 allocation for the DMA pool.
[ 14.017417][ T1] swapper/0: page allocation failure: order:10, mode:0xcc1(GFP_KERNEL|GFP_DMA), nodemask=(null),cpuset=/,mems_allowed=0-7
[ 14.017429][ T1] CPU: 4 PID: 1 Comm: swapper/0 Not tainted 5.14.21-150400.22-default #1 SLE15-SP4 0b6a6578ade2de5c4a0b916095dff44f76ef1704
[ 14.017434][ T1] Hardware name: XXXX
[ 14.017437][ T1] Call Trace:
[ 14.017444][ T1] <TASK>
[ 14.017449][ T1] dump_stack_lvl+0x45/0x57
[ 14.017469][ T1] warn_alloc+0xfe/0x160
[ 14.017490][ T1] __alloc_pages_slowpath.constprop.112+0xc27/0xc60
[ 14.017497][ T1] ? rdinit_setup+0x2b/0x2b
[ 14.017509][ T1] ? rdinit_setup+0x2b/0x2b
[ 14.017512][ T1] __alloc_pages+0x2d5/0x320
[ 14.017517][ T1] alloc_page_interleave+0xf/0x70
[ 14.017531][ T1] atomic_pool_expand+0x4a/0x200
[ 14.017541][ T1] ? rdinit_setup+0x2b/0x2b
[ 14.017544][ T1] __dma_atomic_pool_init+0x44/0x90
[ 14.017556][ T1] dma_atomic_pool_init+0xad/0x13f
[ 14.017560][ T1] ? __dma_atomic_pool_init+0x90/0x90
[ 14.017562][ T1] do_one_initcall+0x41/0x200
[ 14.017581][ T1] kernel_init_freeable+0x236/0x298
[ 14.017589][ T1] ? rest_init+0xd0/0xd0
[ 14.017596][ T1] kernel_init+0x16/0x120
[ 14.017599][ T1] ret_from_fork+0x22/0x30
[ 14.017604][ T1] </TASK>
[...]
[ 14.018026][ T1] Node 0 DMA free:160kB boost:0kB min:0kB low:0kB high:0kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15996kB managed:15360kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[ 14.018035][ T1] lowmem_reserve[]: 0 0 0 0 0
[ 14.018339][ T1] Node 0 DMA: 0*4kB 0*8kB 0*16kB 1*32kB (U) 0*64kB 1*128kB (U) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 160kB
The usable memory in the DMA zone is obviously too small for the pool
pre-allocation. The allocation failure raises concern by admins because
this is considered an error state.
In fact the preallocation itself doesn't expose any actual problem. It is
not even clear whether anybody is ever going to use this pool. If yes
then a warning will be triggered anyway.
Silence the warning to prevent confusion and bug reports.
Link: https://lkml.kernel.org/r/YvJ/V2bor9Q3P6ov@dhcp22.suse.cz
Signed-off-by: Michal Hocko <mhocko(a)suse.com>
Cc: Baoquan He <bhe(a)redhat.com>
Cc: Christoph Hellwig <hch(a)lst.de>
Cc: John Donnelly <john.p.donnelly(a)oracle.com>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
kernel/dma/pool.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/kernel/dma/pool.c~dma-pool-do-not-complain-if-dma-pool-is-not-allocated
+++ a/kernel/dma/pool.c
@@ -205,7 +205,7 @@ static int __init dma_atomic_pool_init(v
ret = -ENOMEM;
if (has_managed_dma()) {
atomic_pool_dma = __dma_atomic_pool_init(atomic_pool_size,
- GFP_KERNEL | GFP_DMA);
+ GFP_KERNEL | GFP_DMA | __GFP_NOWARN);
if (!atomic_pool_dma)
ret = -ENOMEM;
}
_
Patches currently in -mm which might be from mhocko(a)suse.com are
dma-pool-do-not-complain-if-dma-pool-is-not-allocated.patch
When the port does not support USB PD, prevent transition to PD
only states when power supply property is written. In this case,
TCPM transitions to SNK_NEGOTIATE_CAPABILITIES
which should not be the case given that the port is not pd_capable.
[ 84.308251] state change SNK_READY -> SNK_NEGOTIATE_CAPABILITIES [rev3 NONE_AMS]
[ 84.308335] Setting usb_comm capable false
[ 84.323367] set_auto_vbus_discharge_threshold mode:3 pps_active:n vbus:5000 ret:0
[ 84.323376] state change SNK_NEGOTIATE_CAPABILITIES -> SNK_WAIT_CAPABILITIES [rev3 NONE_AMS]
Fixes: e9e6e164ed8f6 ("usb: typec: tcpm: Support non-PD mode")
Signed-off-by: Badhri Jagan Sridharan <badhri(a)google.com>
---
Changes since v1:
- Add Fixes tag.
---
drivers/usb/typec/tcpm/tcpm.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/drivers/usb/typec/tcpm/tcpm.c b/drivers/usb/typec/tcpm/tcpm.c
index ea5a917c51b1..904c7b4ce2f0 100644
--- a/drivers/usb/typec/tcpm/tcpm.c
+++ b/drivers/usb/typec/tcpm/tcpm.c
@@ -6320,6 +6320,13 @@ static int tcpm_psy_set_prop(struct power_supply *psy,
struct tcpm_port *port = power_supply_get_drvdata(psy);
int ret;
+ /*
+ * All the properties below are related to USB PD. The check needs to be
+ * property specific when a non-pd related property is added.
+ */
+ if (!port->pd_supported)
+ return -EOPNOTSUPP;
+
switch (psp) {
case POWER_SUPPLY_PROP_ONLINE:
ret = tcpm_psy_set_online(port, val);
--
2.37.1.595.g718a3a8f04-goog
The work queueing operation relies on the atomic test-and-set operation
to set the PENDING bit behaving as a full barrier, even if it fails.
Otherwise, the PENDING state may be observed before memory writes
pertaining to the work complete, as they are allowed to be reordered.
That can lead to work being processed before all prior writes are
observable, and no new work being queued to ensure they are observed at
some point.
This has been broken since the dawn of time, and it was incompletely
fixed by 346c09f80459, which added the necessary barriers in the work
execution path but failed to account for the missing barrier in the
test_and_set_bit() failure case. Fix it by switching to
atomic_long_fetch_or(), which does have unconditional barrier semantics
regardless of whether the bit was already set or not (this is actually
just test_and_set_bit() minus the early exit path).
Discovered [1] on Apple M1 platforms, which are ridiculously
out-of-order and managed to trigger this in the TTY core, of all places.
Easily reproducible by running this m1n1 client script on one M1 machine
connected to another one running the m1n1 bootloader in proxy mode:
=============
from m1n1.setup import *
i = 0
while True:
a = iface.readmem(u.base, 1170)
print(i)
i += 1
=============
The script will hang when the TTY layer fails to push a buffer of data
into the ldisc in a timely manner in tty_flip_buffer_push(), which
writes a buffer pointer and then queue_work()s the ldisc push.
(Note: reproducibility depends on .config options)
Additionally, properly document that queue_work() has guarantees even
when work is already queued (it doesn't make any sense for it not to,
the comment in set_work_pool_and_clear_pending() already implies it
does, and the TTY core and probably quite a few other places rely on
this).
[1] https://lore.kernel.org/lkml/6c089268-4f2c-9fdf-7bcb-107b611fbc21@marcan.st…
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Fixes: 346c09f80459 ("workqueue: fix ghost PENDING flag while doing MQ IO")
Cc: stable(a)vger.kernel.org
Signed-off-by: Hector Martin <marcan(a)marcan.st>
---
include/linux/workqueue.h | 15 ++++++++++-----
kernel/workqueue.c | 39 +++++++++++++++++++++++++++++++--------
2 files changed, 41 insertions(+), 13 deletions(-)
diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index a0143dd24430..d9ea73813a3c 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -484,18 +484,23 @@ extern void wq_worker_comm(char *buf, size_t size, struct task_struct *task);
* We queue the work to the CPU on which it was submitted, but if the CPU dies
* it can be processed by another CPU.
*
- * Memory-ordering properties: If it returns %true, guarantees that all stores
- * preceding the call to queue_work() in the program order will be visible from
- * the CPU which will execute @work by the time such work executes, e.g.,
+ * Memory-ordering properties: Guarantees that all stores preceding the call to
+ * queue_work() in the program order will be visible from the CPU which will
+ * execute @work by the time such work executes, e.g.,
*
* { x is initially 0 }
*
* CPU0 CPU1
*
* WRITE_ONCE(x, 1); [ @work is being executed ]
- * r0 = queue_work(wq, work); r1 = READ_ONCE(x);
+ * queue_work(wq, work); r0 = READ_ONCE(x);
*
- * Forbids: r0 == true && r1 == 0
+ * Forbids: r0 == 0 for the currently pending execution of @work after
+ * queue_work() completes.
+ *
+ * If @work was already pending (ret == false), that execution is guaranteed
+ * to observe x == 1. If @work was newly queued (ret == true), the newly
+ * queued execution is guaranteed to observe x == 1.
*/
static inline bool queue_work(struct workqueue_struct *wq,
struct work_struct *work)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index aeea9731ef80..01bc03eed649 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -655,7 +655,7 @@ static void set_work_pool_and_clear_pending(struct work_struct *work,
{
/*
* The following wmb is paired with the implied mb in
- * test_and_set_bit(PENDING) and ensures all updates to @work made
+ * atomic_long_fetch_or(PENDING) and ensures all updates to @work made
* here are visible to and precede any updates by the next PENDING
* owner.
*/
@@ -673,7 +673,7 @@ static void set_work_pool_and_clear_pending(struct work_struct *work,
*
* 1 STORE event_indicated
* 2 queue_work_on() {
- * 3 test_and_set_bit(PENDING)
+ * 3 atomic_long_fetch_or(PENDING)
* 4 } set_..._and_clear_pending() {
* 5 set_work_data() # clear bit
* 6 smp_mb()
@@ -688,6 +688,15 @@ static void set_work_pool_and_clear_pending(struct work_struct *work,
* finish the queued @work. Meanwhile CPU#1 does not see
* event_indicated is set, because speculative LOAD was executed
* before actual STORE.
+ *
+ * Line 3 requires barrier semantics, even on failure. If it were
+ * implemented with test_and_set_bit() (which does not have
+ * barrier semantics on failure), that would allow the STORE to
+ * be reordered after it, and it could be observed by CPU#1 after
+ * it has executed all the way through to line 8 (and cleared the
+ * PENDING bit in the process). At this point, CPU#0 would not have
+ * queued new work (having observed PENDING set), and CPU#1 would not
+ * have observed the event_indicated store in the last work execution.
*/
smp_mb();
}
@@ -1276,8 +1285,9 @@ static int try_to_grab_pending(struct work_struct *work, bool is_dwork,
return 1;
}
- /* try to claim PENDING the normal way */
- if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work)))
+ /* try to claim PENDING the normal way, see queue_work_on() */
+ if (!(atomic_long_fetch_or(WORK_STRUCT_PENDING, &work->data)
+ & WORK_STRUCT_PENDING))
return 0;
rcu_read_lock();
@@ -1541,7 +1551,14 @@ bool queue_work_on(int cpu, struct workqueue_struct *wq,
local_irq_save(flags);
- if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
+ /*
+ * We need unconditional barrier semantics, even on failure,
+ * to avoid racing set_work_pool_and_clear_pending(). Hence,
+ * this has to be atomic_long_fetch_or(), not test_and_set_bit()
+ * which elides the barrier on failure.
+ */
+ if (!(atomic_long_fetch_or(WORK_STRUCT_PENDING, &work->data)
+ & WORK_STRUCT_PENDING)) {
__queue_work(cpu, wq, work);
ret = true;
}
@@ -1623,7 +1640,9 @@ bool queue_work_node(int node, struct workqueue_struct *wq,
local_irq_save(flags);
- if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
+ /* see queue_work_on() */
+ if (!(atomic_long_fetch_or(WORK_STRUCT_PENDING, &work->data)
+ & WORK_STRUCT_PENDING)) {
int cpu = workqueue_select_cpu_near(node);
__queue_work(cpu, wq, work);
@@ -1697,7 +1716,9 @@ bool queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
/* read the comment in __queue_work() */
local_irq_save(flags);
- if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
+ /* see queue_work_on() */
+ if (!(atomic_long_fetch_or(WORK_STRUCT_PENDING, &work->data)
+ & WORK_STRUCT_PENDING)) {
__queue_delayed_work(cpu, wq, dwork, delay);
ret = true;
}
@@ -1769,7 +1790,9 @@ bool queue_rcu_work(struct workqueue_struct *wq, struct rcu_work *rwork)
{
struct work_struct *work = &rwork->work;
- if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
+ /* see queue_work_on() */
+ if (!(atomic_long_fetch_or(WORK_STRUCT_PENDING, &work->data)
+ & WORK_STRUCT_PENDING)) {
rwork->wq = wq;
call_rcu(&rwork->rcu, rcu_work_rcufn);
return true;
--
2.35.1
The patch titled
Subject: mm/migrate_device.c: copy pte dirty bit to page
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-migrate_devicec-copy-pte-dirty-bit-to-page.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Alistair Popple <apopple(a)nvidia.com>
Subject: mm/migrate_device.c: copy pte dirty bit to page
Date: Tue, 16 Aug 2022 17:39:24 +1000
migrate_vma_setup() has a fast path in migrate_vma_collect_pmd() that
installs migration entries directly if it can lock the migrating page.
When removing a dirty pte the dirty bit is supposed to be carried over to
the underlying page to prevent it being lost.
Currently migrate_vma_*() can only be used for private anonymous mappings.
That means loss of the dirty bit usually doesn't result in data loss
because these pages are typically not file-backed. However pages may be
backed by swap storage which can result in data loss if an attempt is made
to migrate a dirty page that doesn't yet have the PageDirty flag set.
In this case migration will fail due to unexpected references but the
dirty pte bit will be lost. If the page is subsequently reclaimed data
won't be written back to swap storage as it is considered uptodate,
resulting in data loss if the page is subsequently accessed.
Prevent this by copying the dirty bit to the page when removing the pte to
match what try_to_migrate_one() does.
Link: https://lkml.kernel.org/r/6e77914685ede036c419fa65b6adc27f25a6c3e9.16606350…
Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma while collecting pages")
Signed-off-by: Alistair Popple <apopple(a)nvidia.com>
Acked-by: Peter Xu <peterx(a)redhat.com>
Reported-by: Huang Ying <ying.huang(a)intel.com>
Reviewed-by: Huang Ying <ying.huang(a)intel.com>
Cc: Alex Sierra <alex.sierra(a)amd.com>
Cc: Ben Skeggs <bskeggs(a)redhat.com>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: Felix Kuehling <felix.kuehling(a)amd.com>
Cc: Jason Gunthorpe <jgg(a)nvidia.com>
Cc: John Hubbard <jhubbard(a)nvidia.com>
Cc: Karol Herbst <kherbst(a)redhat.com>
Cc: Logan Gunthorpe <logang(a)deltatee.com>
Cc: Lyude Paul <lyude(a)redhat.com>
Cc: Matthew Wilcox (Oracle) <willy(a)infradead.org>
Cc: Paul Mackerras <paulus(a)ozlabs.org>
Cc: Ralph Campbell <rcampbell(a)nvidia.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/migrate_device.c | 21 ++++++++-------------
1 file changed, 8 insertions(+), 13 deletions(-)
--- a/mm/migrate_device.c~mm-migrate_devicec-copy-pte-dirty-bit-to-page
+++ a/mm/migrate_device.c
@@ -7,6 +7,7 @@
#include <linux/export.h>
#include <linux/memremap.h>
#include <linux/migrate.h>
+#include <linux/mm.h>
#include <linux/mm_inline.h>
#include <linux/mmu_notifier.h>
#include <linux/oom.h>
@@ -61,7 +62,7 @@ static int migrate_vma_collect_pmd(pmd_t
struct migrate_vma *migrate = walk->private;
struct vm_area_struct *vma = walk->vma;
struct mm_struct *mm = vma->vm_mm;
- unsigned long addr = start, unmapped = 0;
+ unsigned long addr = start;
spinlock_t *ptl;
pte_t *ptep;
@@ -193,11 +194,10 @@ again:
bool anon_exclusive;
pte_t swp_pte;
+ flush_cache_page(vma, addr, pte_pfn(*ptep));
+ pte = ptep_clear_flush(vma, addr, ptep);
anon_exclusive = PageAnon(page) && PageAnonExclusive(page);
if (anon_exclusive) {
- flush_cache_page(vma, addr, pte_pfn(*ptep));
- ptep_clear_flush(vma, addr, ptep);
-
if (page_try_share_anon_rmap(page)) {
set_pte_at(mm, addr, ptep, pte);
unlock_page(page);
@@ -205,12 +205,14 @@ again:
mpfn = 0;
goto next;
}
- } else {
- ptep_get_and_clear(mm, addr, ptep);
}
migrate->cpages++;
+ /* Set the dirty flag on the folio now the pte is gone. */
+ if (pte_dirty(pte))
+ folio_mark_dirty(page_folio(page));
+
/* Setup special migration page table entry */
if (mpfn & MIGRATE_PFN_WRITE)
entry = make_writable_migration_entry(
@@ -242,9 +244,6 @@ again:
*/
page_remove_rmap(page, vma, false);
put_page(page);
-
- if (pte_present(pte))
- unmapped++;
} else {
put_page(page);
mpfn = 0;
@@ -257,10 +256,6 @@ next:
arch_leave_lazy_mmu_mode();
pte_unmap_unlock(ptep - 1, ptl);
- /* Only flush the TLB if we actually modified any entries */
- if (unmapped)
- flush_tlb_range(walk->vma, start, end);
-
return 0;
}
_
Patches currently in -mm which might be from apopple(a)nvidia.com are
mm-migrate_devicec-copy-pte-dirty-bit-to-page.patch
mm-gupc-simplify-and-fix-check_and_migrate_movable_pages-return-codes.patch