Greg,
Here are the backports of the patches for the maple tree fixes from
6.3-rc6 for 6.1 and 6.2.
The patches differ with a few extra patches for the maple tree, and
changes to the mm code patch to work without the vma iterator change.
Liam R. Howlett (14):
maple_tree: remove GFP_ZERO from kmem_cache_alloc() and
kmem_cache_alloc_bulk()
maple_tree: fix potential rcu issue
maple_tree: reduce user error potential
maple_tree: fix handle of invalidated state in mas_wr_store_setup()
maple_tree: fix mas_prev() and mas_find() state handling
maple_tree: fix mas_skip_node() end slot detection
maple_tree: be more cautious about dead nodes
maple_tree: refine ma_state init from mas_start()
maple_tree: detect dead nodes in mas_start()
maple_tree: fix freeing of nodes in rcu mode
maple_tree: remove extra smp_wmb() from mas_dead_leaves()
maple_tree: add smp_rmb() to dead node detection
maple_tree: add RCU lock checking to rcu callback functions
mm: enable maple tree RCU mode by default.
include/linux/mm_types.h | 3 +-
kernel/fork.c | 3 +
lib/maple_tree.c | 389 ++++++++++++++++++++-----------
mm/mmap.c | 3 +-
tools/testing/radix-tree/maple.c | 18 +-
5 files changed, 263 insertions(+), 153 deletions(-)
--
2.39.2
The patch below does not apply to the 6.2-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.2.y
git checkout FETCH_HEAD
git cherry-pick -x 7c7b962938ddda6a9cd095de557ee5250706ea88
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023041153-unlikable-steam-cf2b@gregkh' --subject-prefix 'PATCH 6.2.y' HEAD^..
Possible dependencies:
7c7b962938dd ("mm: take a page reference when removing device exclusive entries")
7d4a8be0c4b2 ("mm/mmu_notifier: remove unused mmu_notifier_range_update_to_read_only export")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 7c7b962938ddda6a9cd095de557ee5250706ea88 Mon Sep 17 00:00:00 2001
From: Alistair Popple <apopple(a)nvidia.com>
Date: Thu, 30 Mar 2023 12:25:19 +1100
Subject: [PATCH] mm: take a page reference when removing device exclusive
entries
Device exclusive page table entries are used to prevent CPU access to a
page whilst it is being accessed from a device. Typically this is used to
implement atomic operations when the underlying bus does not support
atomic access. When a CPU thread encounters a device exclusive entry it
locks the page and restores the original entry after calling mmu notifiers
to signal drivers that exclusive access is no longer available.
The device exclusive entry holds a reference to the page making it safe to
access the struct page whilst the entry is present. However the fault
handling code does not hold the PTL when taking the page lock. This means
if there are multiple threads faulting concurrently on the device
exclusive entry one will remove the entry whilst others will wait on the
page lock without holding a reference.
This can lead to threads locking or waiting on a folio with a zero
refcount. Whilst mmap_lock prevents the pages getting freed via munmap()
they may still be freed by a migration. This leads to warnings such as
PAGE_FLAGS_CHECK_AT_FREE due to the page being locked when the refcount
drops to zero.
Fix this by trying to take a reference on the folio before locking it.
The code already checks the PTE under the PTL and aborts if the entry is
no longer there. It is also possible the folio has been unmapped, freed
and re-allocated allowing a reference to be taken on an unrelated folio.
This case is also detected by the PTE check and the folio is unlocked
without further changes.
Link: https://lkml.kernel.org/r/20230330012519.804116-1-apopple@nvidia.com
Fixes: b756a3b5e7ea ("mm: device exclusive memory access")
Signed-off-by: Alistair Popple <apopple(a)nvidia.com>
Reviewed-by: Ralph Campbell <rcampbell(a)nvidia.com>
Reviewed-by: John Hubbard <jhubbard(a)nvidia.com>
Acked-by: David Hildenbrand <david(a)redhat.com>
Cc: Matthew Wilcox (Oracle) <willy(a)infradead.org>
Cc: Christoph Hellwig <hch(a)infradead.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/mm/memory.c b/mm/memory.c
index f456f3b5049c..01a23ad48a04 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3563,8 +3563,21 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
struct vm_area_struct *vma = vmf->vma;
struct mmu_notifier_range range;
- if (!folio_lock_or_retry(folio, vma->vm_mm, vmf->flags))
+ /*
+ * We need a reference to lock the folio because we don't hold
+ * the PTL so a racing thread can remove the device-exclusive
+ * entry and unmap it. If the folio is free the entry must
+ * have been removed already. If it happens to have already
+ * been re-allocated after being freed all we do is lock and
+ * unlock it.
+ */
+ if (!folio_try_get(folio))
+ return 0;
+
+ if (!folio_lock_or_retry(folio, vma->vm_mm, vmf->flags)) {
+ folio_put(folio);
return VM_FAULT_RETRY;
+ }
mmu_notifier_range_init_owner(&range, MMU_NOTIFY_EXCLUSIVE, 0,
vma->vm_mm, vmf->address & PAGE_MASK,
(vmf->address & PAGE_MASK) + PAGE_SIZE, NULL);
@@ -3577,6 +3590,7 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
pte_unmap_unlock(vmf->pte, vmf->ptl);
folio_unlock(folio);
+ folio_put(folio);
mmu_notifier_invalidate_range_end(&range);
return 0;
This is a note to let you know that I've just added the patch titled
iio: addac: stx104: Fix race condition when converting
to my char-misc git tree which can be found at
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc.git
in the char-misc-testing branch.
The patch will show up in the next release of the linux-next tree
(usually sometime within the next 24 hours during the week.)
The patch will be merged to the char-misc-next branch sometime soon,
after it passes testing, and the merge window is open.
If you have any questions about this process, please let me know.
From 4f9b80aefb9e2f542a49d9ec087cf5919730e1dd Mon Sep 17 00:00:00 2001
From: William Breathitt Gray <william.gray(a)linaro.org>
Date: Thu, 6 Apr 2023 10:40:11 -0400
Subject: iio: addac: stx104: Fix race condition when converting
analog-to-digital
The ADC conversion procedure requires several device I/O operations
performed in a particular sequence. If stx104_read_raw() is called
concurrently, the ADC conversion procedure could be clobbered. Prevent
such a race condition by utilizing a mutex.
Fixes: 4075a283ae83 ("iio: stx104: Add IIO support for the ADC channels")
Signed-off-by: William Breathitt Gray <william.gray(a)linaro.org>
Link: https://lore.kernel.org/r/2ae5e40eed5006ca735e4c12181a9ff5ced65547.16807905…
Cc: <Stable(a)vger.kernel.org>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron(a)huawei.com>
---
drivers/iio/addac/stx104.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/drivers/iio/addac/stx104.c b/drivers/iio/addac/stx104.c
index 4239aafe42fc..8730b79e921c 100644
--- a/drivers/iio/addac/stx104.c
+++ b/drivers/iio/addac/stx104.c
@@ -117,6 +117,8 @@ static int stx104_read_raw(struct iio_dev *indio_dev,
return IIO_VAL_INT;
}
+ mutex_lock(&priv->lock);
+
/* select ADC channel */
iowrite8(chan->channel | (chan->channel << 4), ®->achan);
@@ -127,6 +129,8 @@ static int stx104_read_raw(struct iio_dev *indio_dev,
while (ioread8(®->cir_asr) & BIT(7));
*val = ioread16(®->ssr_ad);
+
+ mutex_unlock(&priv->lock);
return IIO_VAL_INT;
case IIO_CHAN_INFO_OFFSET:
/* get ADC bipolar/unipolar configuration */
--
2.40.0
This is a note to let you know that I've just added the patch titled
iio: addac: stx104: Fix race condition for stx104_write_raw()
to my char-misc git tree which can be found at
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc.git
in the char-misc-testing branch.
The patch will show up in the next release of the linux-next tree
(usually sometime within the next 24 hours during the week.)
The patch will be merged to the char-misc-next branch sometime soon,
after it passes testing, and the merge window is open.
If you have any questions about this process, please let me know.
From 9740827468cea80c42db29e7171a50e99acf7328 Mon Sep 17 00:00:00 2001
From: William Breathitt Gray <william.gray(a)linaro.org>
Date: Thu, 6 Apr 2023 10:40:10 -0400
Subject: iio: addac: stx104: Fix race condition for stx104_write_raw()
The priv->chan_out_states array and actual DAC value can become
mismatched if stx104_write_raw() is called concurrently. Prevent such a
race condition by utilizing a mutex.
Fixes: 97a445dad37a ("iio: Add IIO support for the DAC on the Apex Embedded Systems STX104")
Signed-off-by: William Breathitt Gray <william.gray(a)linaro.org>
Link: https://lore.kernel.org/r/c95c9a77fcef36b2a052282146950f23bbc1ebdc.16807905…
Cc: <Stable(a)vger.kernel.org>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron(a)huawei.com>
---
drivers/iio/addac/stx104.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/drivers/iio/addac/stx104.c b/drivers/iio/addac/stx104.c
index e45b70aa5bb7..4239aafe42fc 100644
--- a/drivers/iio/addac/stx104.c
+++ b/drivers/iio/addac/stx104.c
@@ -15,6 +15,7 @@
#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/moduleparam.h>
+#include <linux/mutex.h>
#include <linux/spinlock.h>
#include <linux/types.h>
@@ -69,10 +70,12 @@ struct stx104_reg {
/**
* struct stx104_iio - IIO device private data structure
+ * @lock: synchronization lock to prevent I/O race conditions
* @chan_out_states: channels' output states
* @reg: I/O address offset for the device registers
*/
struct stx104_iio {
+ struct mutex lock;
unsigned int chan_out_states[STX104_NUM_OUT_CHAN];
struct stx104_reg __iomem *reg;
};
@@ -178,9 +181,12 @@ static int stx104_write_raw(struct iio_dev *indio_dev,
if ((unsigned int)val > 65535)
return -EINVAL;
+ mutex_lock(&priv->lock);
+
priv->chan_out_states[chan->channel] = val;
iowrite16(val, &priv->reg->dac[chan->channel]);
+ mutex_unlock(&priv->lock);
return 0;
}
return -EINVAL;
@@ -351,6 +357,8 @@ static int stx104_probe(struct device *dev, unsigned int id)
indio_dev->name = dev_name(dev);
+ mutex_init(&priv->lock);
+
/* configure device for software trigger operation */
iowrite8(0, &priv->reg->acr);
--
2.40.0
From: Paolo Bonzini <pbonzini(a)redhat.com>
commit 112e66017bff7f2837030f34c2bc19501e9212d5 upstream.
The effective values of the guest CR0 and CR4 registers may differ from
those included in the VMCS12. In particular, disabling EPT forces
CR4.PAE=1 and disabling unrestricted guest mode forces CR0.PG=CR0.PE=1.
Therefore, checks on these bits cannot be delegated to the processor
and must be performed by KVM.
Reported-by: Reima ISHII <ishiir(a)g.ecc.u-tokyo.ac.jp>
Cc: stable(a)vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
[OP: drop CC() macro calls, as tracing is not implemented in 4.19]
[OP: adjust "return -EINVAL" -> "return 1" to match current return logic]
Signed-off-by: Ovidiu Panait <ovidiu.panait(a)windriver.com>
---
arch/x86/kvm/vmx/vmx.c | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index ec821a5d131a..265e70b0eb79 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -12752,7 +12752,7 @@ static int nested_vmx_check_vmcs_link_ptr(struct kvm_vcpu *vcpu,
static int check_vmentry_postreqs(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
u32 *exit_qual)
{
- bool ia32e;
+ bool ia32e = !!(vmcs12->vm_entry_controls & VM_ENTRY_IA32E_MODE);
*exit_qual = ENTRY_FAIL_DEFAULT;
@@ -12765,6 +12765,13 @@ static int check_vmentry_postreqs(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
return 1;
}
+ if ((vmcs12->guest_cr0 & (X86_CR0_PG | X86_CR0_PE)) == X86_CR0_PG)
+ return 1;
+
+ if ((ia32e && !(vmcs12->guest_cr4 & X86_CR4_PAE)) ||
+ (ia32e && !(vmcs12->guest_cr0 & X86_CR0_PG)))
+ return 1;
+
/*
* If the load IA32_EFER VM-entry control is 1, the following checks
* are performed on the field for the IA32_EFER MSR:
@@ -12776,7 +12783,6 @@ static int check_vmentry_postreqs(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
*/
if (to_vmx(vcpu)->nested.nested_run_pending &&
(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_EFER)) {
- ia32e = (vmcs12->vm_entry_controls & VM_ENTRY_IA32E_MODE) != 0;
if (!kvm_valid_efer(vcpu, vmcs12->guest_ia32_efer) ||
ia32e != !!(vmcs12->guest_ia32_efer & EFER_LMA) ||
((vmcs12->guest_cr0 & X86_CR0_PG) &&
--
2.39.1
As per the RZ/G2L users hardware manual (Rev.1.20 Sep, 2022), section
23.3.7 Serial Data Transmission (Asynchronous Mode) it is mentioned
that the TE (transmit enable) must be set after setting TIE (transmit
interrupt enable) or these 2 bits are set to 1 simultaneously by a
single instruction. So set these 2 bits in single instruction.
Fixes: 93bcefd4c6ba ("serial: sh-sci: Fix setting SCSCR_TIE while transferring data")
Cc: stable(a)vger.kernel.org
Signed-off-by: Biju Das <biju.das.jz(a)bp.renesas.com>
---
v3:
* New patch moved here from Renesas SCI fixes patch series
* Updated commit description
* Moved handling of clearing TE bit as separate patch#5.
---
drivers/tty/serial/sh-sci.c | 19 +++++++++++++++++--
1 file changed, 17 insertions(+), 2 deletions(-)
diff --git a/drivers/tty/serial/sh-sci.c b/drivers/tty/serial/sh-sci.c
index 15743c2f3d3d..32f5c1f7d697 100644
--- a/drivers/tty/serial/sh-sci.c
+++ b/drivers/tty/serial/sh-sci.c
@@ -601,6 +601,15 @@ static void sci_start_tx(struct uart_port *port)
port->type == PORT_SCIFA || port->type == PORT_SCIFB) {
/* Set TIE (Transmit Interrupt Enable) bit in SCSCR */
ctrl = serial_port_in(port, SCSCR);
+
+ /*
+ * For SCI, TE (transmit enable) must be set after setting TIE
+ * (transmit interrupt enable) or in the same instruction to start
+ * the transmit process.
+ */
+ if (port->type == PORT_SCI)
+ ctrl |= SCSCR_TE;
+
serial_port_out(port, SCSCR, ctrl | SCSCR_TIE);
}
}
@@ -2600,8 +2609,14 @@ static void sci_set_termios(struct uart_port *port, struct ktermios *termios,
sci_set_mctrl(port, port->mctrl);
}
- scr_val |= SCSCR_RE | SCSCR_TE |
- (s->cfg->scscr & ~(SCSCR_CKE1 | SCSCR_CKE0));
+ /*
+ * For SCI, TE (transmit enable) must be set after setting TIE
+ * (transmit interrupt enable) or in the same instruction to
+ * start the transmitting process. So skip setting TE here for SCI.
+ */
+ if (port->type != PORT_SCI)
+ scr_val |= SCSCR_TE;
+ scr_val |= SCSCR_RE | (s->cfg->scscr & ~(SCSCR_CKE1 | SCSCR_CKE0));
serial_port_out(port, SCSCR, scr_val | s->hscif_tot);
if ((srr + 1 == 5) &&
(port->type == PORT_SCIFA || port->type == PORT_SCIFB)) {
--
2.25.1
Hi Dave,
Please pull this branch with changes for xfs.
As usual, I did a test-merge with the main upstream branch as of a few
minutes ago, and didn't see any conflicts. Please let me know if you
encounter any problems.
--D
The following changes since commit 7ba83850ca2691865713b307ed001bde5fddb084:
xfs: deprecate the ascii-ci feature (2023-04-11 19:05:19 -0700)
are available in the Git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git tags/fix-bugs-6.4_2023-04-11
for you to fetch changes up to 4fe9cd8a34a1f934e5f936d4245e19300b52d440:
xfs: fix BUG_ON in xfs_getbmap() (2023-04-11 19:48:08 -0700)
----------------------------------------------------------------
xfs: rollup of various bug fixes
This is a rollup of miscellaneous bug fixes.
Signed-off-by: Darrick J. Wong <djwong(a)kernel.org>
----------------------------------------------------------------
Darrick J. Wong (2):
xfs: _{attr,data}_map_shared should take ILOCK_EXCL until iread_extents is completely done
xfs: verify buffer contents when we skip log replay
Dave Chinner (2):
xfs: remove WARN when dquot cache insertion fails
xfs: don't consider future format versions valid
Ye Bin (1):
xfs: fix BUG_ON in xfs_getbmap()
fs/xfs/libxfs/xfs_bmap.c | 6 ++++++
fs/xfs/libxfs/xfs_inode_fork.c | 16 +++++++++++++++-
fs/xfs/libxfs/xfs_inode_fork.h | 6 ++++--
fs/xfs/libxfs/xfs_sb.c | 11 ++++++-----
fs/xfs/xfs_bmap_util.c | 14 ++++++--------
fs/xfs/xfs_buf_item_recover.c | 10 ++++++++++
fs/xfs/xfs_dquot.c | 1 -
7 files changed, 47 insertions(+), 17 deletions(-)
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y
git checkout FETCH_HEAD
git cherry-pick -x 7c7b962938ddda6a9cd095de557ee5250706ea88
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023041157-hyperlink-prognosis-e6db@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^..
Possible dependencies:
7c7b962938dd ("mm: take a page reference when removing device exclusive entries")
7d4a8be0c4b2 ("mm/mmu_notifier: remove unused mmu_notifier_range_update_to_read_only export")
369258ce41c6 ("hugetlb: remove duplicate mmu notifications")
f268f6cf875f ("mm/khugepaged: invoke MMU notifiers in shmem/file collapse paths")
2ba99c5e0881 ("mm/khugepaged: fix GUP-fast interaction by sending IPI")
8d3c106e19e8 ("mm/khugepaged: take the right locks for page table retraction")
21b85b09527c ("madvise: use zap_page_range_single for madvise dontneed")
131a79b474e9 ("hugetlb: fix vma lock handling during split vma and range unmapping")
34488399fa08 ("mm/madvise: add file and shmem support to MADV_COLLAPSE")
58ac9a8993a1 ("mm/khugepaged: attempt to map file/shmem-backed pte-mapped THPs by pmds")
780a4b6fb865 ("mm/khugepaged: check compound_order() in collapse_pte_mapped_thp()")
40549ba8f8e0 ("hugetlb: use new vma_lock for pmd sharing synchronization")
378397ccb8e5 ("hugetlb: create hugetlb_unmap_file_folio to unmap single file folio")
8d9bfb260814 ("hugetlb: add vma based lock for pmd sharing")
12710fd69634 ("hugetlb: rename vma_shareable() and refactor code")
c86272287bc6 ("hugetlb: create remove_inode_single_folio to remove single file folio")
7e1813d48dd3 ("hugetlb: rename remove_huge_page to hugetlb_delete_from_page_cache")
3a47c54f09c4 ("hugetlbfs: revert use i_mmap_rwsem for more pmd sharing synchronization")
188a39725ad7 ("hugetlbfs: revert use i_mmap_rwsem to address page fault/truncate race")
19672a9e4a75 ("mm: convert lock_page_or_retry() to folio_lock_or_retry()")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 7c7b962938ddda6a9cd095de557ee5250706ea88 Mon Sep 17 00:00:00 2001
From: Alistair Popple <apopple(a)nvidia.com>
Date: Thu, 30 Mar 2023 12:25:19 +1100
Subject: [PATCH] mm: take a page reference when removing device exclusive
entries
Device exclusive page table entries are used to prevent CPU access to a
page whilst it is being accessed from a device. Typically this is used to
implement atomic operations when the underlying bus does not support
atomic access. When a CPU thread encounters a device exclusive entry it
locks the page and restores the original entry after calling mmu notifiers
to signal drivers that exclusive access is no longer available.
The device exclusive entry holds a reference to the page making it safe to
access the struct page whilst the entry is present. However the fault
handling code does not hold the PTL when taking the page lock. This means
if there are multiple threads faulting concurrently on the device
exclusive entry one will remove the entry whilst others will wait on the
page lock without holding a reference.
This can lead to threads locking or waiting on a folio with a zero
refcount. Whilst mmap_lock prevents the pages getting freed via munmap()
they may still be freed by a migration. This leads to warnings such as
PAGE_FLAGS_CHECK_AT_FREE due to the page being locked when the refcount
drops to zero.
Fix this by trying to take a reference on the folio before locking it.
The code already checks the PTE under the PTL and aborts if the entry is
no longer there. It is also possible the folio has been unmapped, freed
and re-allocated allowing a reference to be taken on an unrelated folio.
This case is also detected by the PTE check and the folio is unlocked
without further changes.
Link: https://lkml.kernel.org/r/20230330012519.804116-1-apopple@nvidia.com
Fixes: b756a3b5e7ea ("mm: device exclusive memory access")
Signed-off-by: Alistair Popple <apopple(a)nvidia.com>
Reviewed-by: Ralph Campbell <rcampbell(a)nvidia.com>
Reviewed-by: John Hubbard <jhubbard(a)nvidia.com>
Acked-by: David Hildenbrand <david(a)redhat.com>
Cc: Matthew Wilcox (Oracle) <willy(a)infradead.org>
Cc: Christoph Hellwig <hch(a)infradead.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/mm/memory.c b/mm/memory.c
index f456f3b5049c..01a23ad48a04 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3563,8 +3563,21 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
struct vm_area_struct *vma = vmf->vma;
struct mmu_notifier_range range;
- if (!folio_lock_or_retry(folio, vma->vm_mm, vmf->flags))
+ /*
+ * We need a reference to lock the folio because we don't hold
+ * the PTL so a racing thread can remove the device-exclusive
+ * entry and unmap it. If the folio is free the entry must
+ * have been removed already. If it happens to have already
+ * been re-allocated after being freed all we do is lock and
+ * unlock it.
+ */
+ if (!folio_try_get(folio))
+ return 0;
+
+ if (!folio_lock_or_retry(folio, vma->vm_mm, vmf->flags)) {
+ folio_put(folio);
return VM_FAULT_RETRY;
+ }
mmu_notifier_range_init_owner(&range, MMU_NOTIFY_EXCLUSIVE, 0,
vma->vm_mm, vmf->address & PAGE_MASK,
(vmf->address & PAGE_MASK) + PAGE_SIZE, NULL);
@@ -3577,6 +3590,7 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
pte_unmap_unlock(vmf->pte, vmf->ptl);
folio_unlock(folio);
+ folio_put(folio);
mmu_notifier_invalidate_range_end(&range);
return 0;
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.1.y
git checkout FETCH_HEAD
git cherry-pick -x 7c7b962938ddda6a9cd095de557ee5250706ea88
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023041155-spiny-taking-d67f@gregkh' --subject-prefix 'PATCH 6.1.y' HEAD^..
Possible dependencies:
7c7b962938dd ("mm: take a page reference when removing device exclusive entries")
7d4a8be0c4b2 ("mm/mmu_notifier: remove unused mmu_notifier_range_update_to_read_only export")
369258ce41c6 ("hugetlb: remove duplicate mmu notifications")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 7c7b962938ddda6a9cd095de557ee5250706ea88 Mon Sep 17 00:00:00 2001
From: Alistair Popple <apopple(a)nvidia.com>
Date: Thu, 30 Mar 2023 12:25:19 +1100
Subject: [PATCH] mm: take a page reference when removing device exclusive
entries
Device exclusive page table entries are used to prevent CPU access to a
page whilst it is being accessed from a device. Typically this is used to
implement atomic operations when the underlying bus does not support
atomic access. When a CPU thread encounters a device exclusive entry it
locks the page and restores the original entry after calling mmu notifiers
to signal drivers that exclusive access is no longer available.
The device exclusive entry holds a reference to the page making it safe to
access the struct page whilst the entry is present. However the fault
handling code does not hold the PTL when taking the page lock. This means
if there are multiple threads faulting concurrently on the device
exclusive entry one will remove the entry whilst others will wait on the
page lock without holding a reference.
This can lead to threads locking or waiting on a folio with a zero
refcount. Whilst mmap_lock prevents the pages getting freed via munmap()
they may still be freed by a migration. This leads to warnings such as
PAGE_FLAGS_CHECK_AT_FREE due to the page being locked when the refcount
drops to zero.
Fix this by trying to take a reference on the folio before locking it.
The code already checks the PTE under the PTL and aborts if the entry is
no longer there. It is also possible the folio has been unmapped, freed
and re-allocated allowing a reference to be taken on an unrelated folio.
This case is also detected by the PTE check and the folio is unlocked
without further changes.
Link: https://lkml.kernel.org/r/20230330012519.804116-1-apopple@nvidia.com
Fixes: b756a3b5e7ea ("mm: device exclusive memory access")
Signed-off-by: Alistair Popple <apopple(a)nvidia.com>
Reviewed-by: Ralph Campbell <rcampbell(a)nvidia.com>
Reviewed-by: John Hubbard <jhubbard(a)nvidia.com>
Acked-by: David Hildenbrand <david(a)redhat.com>
Cc: Matthew Wilcox (Oracle) <willy(a)infradead.org>
Cc: Christoph Hellwig <hch(a)infradead.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/mm/memory.c b/mm/memory.c
index f456f3b5049c..01a23ad48a04 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3563,8 +3563,21 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
struct vm_area_struct *vma = vmf->vma;
struct mmu_notifier_range range;
- if (!folio_lock_or_retry(folio, vma->vm_mm, vmf->flags))
+ /*
+ * We need a reference to lock the folio because we don't hold
+ * the PTL so a racing thread can remove the device-exclusive
+ * entry and unmap it. If the folio is free the entry must
+ * have been removed already. If it happens to have already
+ * been re-allocated after being freed all we do is lock and
+ * unlock it.
+ */
+ if (!folio_try_get(folio))
+ return 0;
+
+ if (!folio_lock_or_retry(folio, vma->vm_mm, vmf->flags)) {
+ folio_put(folio);
return VM_FAULT_RETRY;
+ }
mmu_notifier_range_init_owner(&range, MMU_NOTIFY_EXCLUSIVE, 0,
vma->vm_mm, vmf->address & PAGE_MASK,
(vmf->address & PAGE_MASK) + PAGE_SIZE, NULL);
@@ -3577,6 +3590,7 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
pte_unmap_unlock(vmf->pte, vmf->ptl);
folio_unlock(folio);
+ folio_put(folio);
mmu_notifier_invalidate_range_end(&range);
return 0;