switch_to_super_page() assumes the memory range it's working on is aligned
to the target large page level. Unfortunately, __domain_mapping() doesn't
take this into account when using it, and will pass unaligned ranges
ultimately freeing a PTE range larger than expected.
Take for example a mapping with the following iov_pfn range [0x3fe400,
0x4c0600], which should be backed by the following mappings:
iov_pfn [0x3fe400, 0x3fffff] covered by 2MiB pages
iov_pfn [0x400000, 0x4bffff] covered by 1GiB pages
iov_pfn [0x4c0000, 0x4c05ff] covered by 2MiB pages
Under this circumstance, __domain_mapping() will pass [0x400000, 0x4c05ff]
to switch_to_super_page() at a 1 GiB granularity, which will in turn
free PTEs all the way to iov_pfn 0x4fffff.
Mitigate this by rounding down the iov_pfn range passed to
switch_to_super_page() in __domain_mapping()
to the target large page level.
Additionally add range alignment checks to switch_to_super_page.
Fixes: 9906b9352a35 ("iommu/vt-d: Avoid duplicate removing in __domain_mapping()")
Signed-off-by: Eugene Koira <eugkoira(a)amazon.com>
Cc: stable(a)vger.kernel.org
---
drivers/iommu/intel/iommu.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 9c3ab9d9f69a..dff2d895b8ab 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -1575,6 +1575,10 @@ static void switch_to_super_page(struct dmar_domain *domain,
unsigned long lvl_pages = lvl_to_nr_pages(level);
struct dma_pte *pte = NULL;
+ if (WARN_ON(!IS_ALIGNED(start_pfn, lvl_pages) ||
+ !IS_ALIGNED(end_pfn + 1, lvl_pages)))
+ return;
+
while (start_pfn <= end_pfn) {
if (!pte)
pte = pfn_to_dma_pte(domain, start_pfn, &level,
@@ -1650,7 +1654,8 @@ __domain_mapping(struct dmar_domain *domain, unsigned long iov_pfn,
unsigned long pages_to_remove;
pteval |= DMA_PTE_LARGE_PAGE;
- pages_to_remove = min_t(unsigned long, nr_pages,
+ pages_to_remove = min_t(unsigned long,
+ round_down(nr_pages, lvl_pages),
nr_pte_to_next_page(pte) * lvl_pages);
end_pfn = iov_pfn + pages_to_remove - 1;
switch_to_super_page(domain, iov_pfn, end_pfn, largepage_lvl);
--
2.47.3
Amazon Development Center (Netherlands) B.V., Johanna Westerdijkplein 1, NL-2521 EN The Hague, Registration No. Chamber of Commerce 56869649, VAT: NL 852339859B01
In the IOMMU Shared Virtual Addressing (SVA) context, the IOMMU hardware
shares and walks the CPU's page tables. The Linux x86 architecture maps
the kernel address space into the upper portion of every process’s page
table. Consequently, in an SVA context, the IOMMU hardware can walk and
cache kernel space mappings. However, the Linux kernel currently lacks
a notification mechanism for kernel space mapping changes. This means
the IOMMU driver is not aware of such changes, leading to a break in
IOMMU cache coherence.
Modern IOMMUs often cache page table entries of the intermediate-level
page table as long as the entry is valid, no matter the permissions, to
optimize walk performance. Currently the iommu driver is notified only
for changes of user VA mappings, so the IOMMU's internal caches may
retain stale entries for kernel VA. When kernel page table mappings are
changed (e.g., by vfree()), but the IOMMU's internal caches retain stale
entries, Use-After-Free (UAF) vulnerability condition arises.
If these freed page table pages are reallocated for a different purpose,
potentially by an attacker, the IOMMU could misinterpret the new data as
valid page table entries. This allows the IOMMU to walk into attacker-
controlled memory, leading to arbitrary physical memory DMA access or
privilege escalation.
To mitigate this, introduce a new iommu interface to flush IOMMU caches.
This interface should be invoked from architecture-specific code that
manages combined user and kernel page tables, whenever a kernel page table
update is done and the CPU TLB needs to be flushed.
Fixes: 26b25a2b98e4 ("iommu: Bind process address spaces to devices")
Cc: stable(a)vger.kernel.org
Suggested-by: Jann Horn <jannh(a)google.com>
Co-developed-by: Jason Gunthorpe <jgg(a)nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg(a)nvidia.com>
Signed-off-by: Lu Baolu <baolu.lu(a)linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg(a)nvidia.com>
Reviewed-by: Vasant Hegde <vasant.hegde(a)amd.com>
Reviewed-by: Kevin Tian <kevin.tian(a)intel.com>
Tested-by: Yi Lai <yi1.lai(a)intel.com>
---
arch/x86/mm/tlb.c | 4 +++
drivers/iommu/iommu-sva.c | 60 ++++++++++++++++++++++++++++++++++++++-
include/linux/iommu.h | 4 +++
3 files changed, 67 insertions(+), 1 deletion(-)
Change log:
v3:
- iommu_sva_mms is an unbound list; iterating it in an atomic context
could introduce significant latency issues. Schedule it in a kernel
thread and replace the spinlock with a mutex.
- Replace the static key with a normal bool; it can be brought back if
data shows the benefit.
- Invalidate KVA range in the flush_tlb_all() paths.
- All previous reviewed-bys are preserved. Please let me know if there
are any objections.
v2:
- https://lore.kernel.org/linux-iommu/20250709062800.651521-1-baolu.lu@linux.…
- Remove EXPORT_SYMBOL_GPL(iommu_sva_invalidate_kva_range);
- Replace the mutex with a spinlock to make the interface usable in the
critical regions.
v1: https://lore.kernel.org/linux-iommu/20250704133056.4023816-1-baolu.lu@linux…
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 39f80111e6f1..3b85e7d3ba44 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -12,6 +12,7 @@
#include <linux/task_work.h>
#include <linux/mmu_notifier.h>
#include <linux/mmu_context.h>
+#include <linux/iommu.h>
#include <asm/tlbflush.h>
#include <asm/mmu_context.h>
@@ -1478,6 +1479,8 @@ void flush_tlb_all(void)
else
/* Fall back to the IPI-based invalidation. */
on_each_cpu(do_flush_tlb_all, NULL, 1);
+
+ iommu_sva_invalidate_kva_range(0, TLB_FLUSH_ALL);
}
/* Flush an arbitrarily large range of memory with INVLPGB. */
@@ -1540,6 +1543,7 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end)
kernel_tlb_flush_range(info);
put_flush_tlb_info();
+ iommu_sva_invalidate_kva_range(start, end);
}
/*
diff --git a/drivers/iommu/iommu-sva.c b/drivers/iommu/iommu-sva.c
index 1a51cfd82808..d0da2b3fd64b 100644
--- a/drivers/iommu/iommu-sva.c
+++ b/drivers/iommu/iommu-sva.c
@@ -10,6 +10,8 @@
#include "iommu-priv.h"
static DEFINE_MUTEX(iommu_sva_lock);
+static bool iommu_sva_present;
+static LIST_HEAD(iommu_sva_mms);
static struct iommu_domain *iommu_sva_domain_alloc(struct device *dev,
struct mm_struct *mm);
@@ -42,6 +44,7 @@ static struct iommu_mm_data *iommu_alloc_mm_data(struct mm_struct *mm, struct de
return ERR_PTR(-ENOSPC);
}
iommu_mm->pasid = pasid;
+ iommu_mm->mm = mm;
INIT_LIST_HEAD(&iommu_mm->sva_domains);
/*
* Make sure the write to mm->iommu_mm is not reordered in front of
@@ -132,8 +135,13 @@ struct iommu_sva *iommu_sva_bind_device(struct device *dev, struct mm_struct *mm
if (ret)
goto out_free_domain;
domain->users = 1;
- list_add(&domain->next, &mm->iommu_mm->sva_domains);
+ if (list_empty(&iommu_mm->sva_domains)) {
+ if (list_empty(&iommu_sva_mms))
+ WRITE_ONCE(iommu_sva_present, true);
+ list_add(&iommu_mm->mm_list_elm, &iommu_sva_mms);
+ }
+ list_add(&domain->next, &iommu_mm->sva_domains);
out:
refcount_set(&handle->users, 1);
mutex_unlock(&iommu_sva_lock);
@@ -175,6 +183,13 @@ void iommu_sva_unbind_device(struct iommu_sva *handle)
list_del(&domain->next);
iommu_domain_free(domain);
}
+
+ if (list_empty(&iommu_mm->sva_domains)) {
+ list_del(&iommu_mm->mm_list_elm);
+ if (list_empty(&iommu_sva_mms))
+ WRITE_ONCE(iommu_sva_present, false);
+ }
+
mutex_unlock(&iommu_sva_lock);
kfree(handle);
}
@@ -312,3 +327,46 @@ static struct iommu_domain *iommu_sva_domain_alloc(struct device *dev,
return domain;
}
+
+struct kva_invalidation_work_data {
+ struct work_struct work;
+ unsigned long start;
+ unsigned long end;
+};
+
+static void invalidate_kva_func(struct work_struct *work)
+{
+ struct kva_invalidation_work_data *data =
+ container_of(work, struct kva_invalidation_work_data, work);
+ struct iommu_mm_data *iommu_mm;
+
+ guard(mutex)(&iommu_sva_lock);
+ list_for_each_entry(iommu_mm, &iommu_sva_mms, mm_list_elm)
+ mmu_notifier_arch_invalidate_secondary_tlbs(iommu_mm->mm,
+ data->start, data->end);
+
+ kfree(data);
+}
+
+void iommu_sva_invalidate_kva_range(unsigned long start, unsigned long end)
+{
+ struct kva_invalidation_work_data *data;
+
+ if (likely(!READ_ONCE(iommu_sva_present)))
+ return;
+
+ /* will be freed in the task function */
+ data = kzalloc(sizeof(*data), GFP_ATOMIC);
+ if (!data)
+ return;
+
+ data->start = start;
+ data->end = end;
+ INIT_WORK(&data->work, invalidate_kva_func);
+
+ /*
+ * Since iommu_sva_mms is an unbound list, iterating it in an atomic
+ * context could introduce significant latency issues.
+ */
+ schedule_work(&data->work);
+}
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index c30d12e16473..66e4abb2df0d 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -1134,7 +1134,9 @@ struct iommu_sva {
struct iommu_mm_data {
u32 pasid;
+ struct mm_struct *mm;
struct list_head sva_domains;
+ struct list_head mm_list_elm;
};
int iommu_fwspec_init(struct device *dev, struct fwnode_handle *iommu_fwnode);
@@ -1615,6 +1617,7 @@ struct iommu_sva *iommu_sva_bind_device(struct device *dev,
struct mm_struct *mm);
void iommu_sva_unbind_device(struct iommu_sva *handle);
u32 iommu_sva_get_pasid(struct iommu_sva *handle);
+void iommu_sva_invalidate_kva_range(unsigned long start, unsigned long end);
#else
static inline struct iommu_sva *
iommu_sva_bind_device(struct device *dev, struct mm_struct *mm)
@@ -1639,6 +1642,7 @@ static inline u32 mm_get_enqcmd_pasid(struct mm_struct *mm)
}
static inline void mm_pasid_drop(struct mm_struct *mm) {}
+static inline void iommu_sva_invalidate_kva_range(unsigned long start, unsigned long end) {}
#endif /* CONFIG_IOMMU_SVA */
#ifdef CONFIG_IOMMU_IOPF
--
2.43.0
Commit 8ee53820edfd ("thp: mmu_notifier_test_young") introduced
mmu_notifier_test_young(), but we should pass the address need to test.
In xxx_scan_pmd(), the actual iteration address is "_address" not
"address". We seem to misuse the variable on the very beginning.
Change it to the right one.
Fixes: 8ee53820edfd ("thp: mmu_notifier_test_young")
Signed-off-by: Wei Yang <richard.weiyang(a)gmail.com>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com>
Cc: Zi Yan <ziy(a)nvidia.com>
Cc: Baolin Wang <baolin.wang(a)linux.alibaba.com>
Cc: Liam R. Howlett <Liam.Howlett(a)oracle.com>
Cc: Nico Pache <npache(a)redhat.com>
Cc: Ryan Roberts <ryan.roberts(a)arm.com>
Cc: Dev Jain <dev.jain(a)arm.com>
Cc: Barry Song <baohua(a)kernel.org>
CC: <stable(a)vger.kernel.org>
---
The original commit 8ee53820edfd is at 2011.
Then the code is moved to khugepaged.c in commit b46e756f5e470 ("thp:
extract khugepaged from mm/huge_memory.c") in 2022.
---
mm/khugepaged.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 24e18a7f8a93..b000942250d1 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1418,7 +1418,7 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
if (cc->is_khugepaged &&
(pte_young(pteval) || folio_test_young(folio) ||
folio_test_referenced(folio) || mmu_notifier_test_young(vma->vm_mm,
- address)))
+ _address)))
referenced++;
}
if (!writable) {
--
2.34.1
Since the commit 96336ec70264 ("PCI: Perform reset_resource() and build
fail list in sync") the failed list is always built and returned to let
the caller decide what to do with the failures. The caller may want to
retry resource fitting and assignment and before that can happen, the
resources should be restored to their original state (a reset
effectively clears the struct resource), which requires returning them
on the failed list so that the original state remains stored in the
associated struct pci_dev_resource.
Resource resizing is different from the ordinary resource fitting and
assignment in that it only considers part of the resources. This means
failures for other resource types are not relevant at all and should be
ignored. As resize doesn't unassign such unrelated resources, those
resource ending up into the failed list implies assignment of that
resource must have failed before resize too. The check in
pci_reassign_bridge_resources() to decide if the whole assignment is
successful, however, is based on list emptiness which will cause false
negatives when the failed list has resources with an unrelated type.
If the failed list is not empty, call pci_required_resource_failed()
and extend it to be able to filter on specific resource types too (if
provided).
Calling pci_required_resource_failed() at this point is slightly
problematic because the resource itself is reset when the failed list
is constructed in __assign_resources_sorted(). As a result,
pci_resource_is_optional() does not have access to the original
resource flags. This could be worked around by restoring and
re-reseting the resource around the call to pci_resource_is_optional(),
however, it shouldn't cause issue as resource resizing is meant for
64-bit prefetchable resources according to Christian König (see the
Link which unfortunately doesn't point directly to Christian's reply
because lore didn't store that email at all).
Fixes: 96336ec70264 ("PCI: Perform reset_resource() and build fail list in sync")
Link: https://lore.kernel.org/all/c5d1b5d8-8669-5572-75a7-0b480f581ac1@linux.inte…
Reported-by: D Scott Phillips <scott(a)os.amperecomputing.com>
Closes: https://lore.kernel.org/all/86plf0lgit.fsf@scott-ph-mail.amperecomputing.co…
Tested-by: D Scott Phillips <scott(a)os.amperecomputing.com>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen(a)linux.intel.com>
Reviewed-by: D Scott Phillips <scott(a)os.amperecomputing.com>
Cc: Christian König <christian.koenig(a)amd.com>
Cc: <stable(a)vger.kernel.org>
---
drivers/pci/setup-bus.c | 26 ++++++++++++++++++--------
1 file changed, 18 insertions(+), 8 deletions(-)
diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
index df5aec46c29d..def29506700e 100644
--- a/drivers/pci/setup-bus.c
+++ b/drivers/pci/setup-bus.c
@@ -28,6 +28,10 @@
#include <linux/acpi.h>
#include "pci.h"
+#define PCI_RES_TYPE_MASK \
+ (IORESOURCE_IO | IORESOURCE_MEM | IORESOURCE_PREFETCH |\
+ IORESOURCE_MEM_64)
+
unsigned int pci_flags;
EXPORT_SYMBOL_GPL(pci_flags);
@@ -384,13 +388,19 @@ static bool pci_need_to_release(unsigned long mask, struct resource *res)
}
/* Return: @true if assignment of a required resource failed. */
-static bool pci_required_resource_failed(struct list_head *fail_head)
+static bool pci_required_resource_failed(struct list_head *fail_head,
+ unsigned long type)
{
struct pci_dev_resource *fail_res;
+ type &= PCI_RES_TYPE_MASK;
+
list_for_each_entry(fail_res, fail_head, list) {
int idx = pci_resource_num(fail_res->dev, fail_res->res);
+ if (type && (fail_res->flags & PCI_RES_TYPE_MASK) != type)
+ continue;
+
if (!pci_resource_is_optional(fail_res->dev, idx))
return true;
}
@@ -504,7 +514,7 @@ static void __assign_resources_sorted(struct list_head *head,
}
/* Without realloc_head and only optional fails, nothing more to do. */
- if (!pci_required_resource_failed(&local_fail_head) &&
+ if (!pci_required_resource_failed(&local_fail_head, 0) &&
list_empty(realloc_head)) {
list_for_each_entry(save_res, &save_head, list) {
struct resource *res = save_res->res;
@@ -1708,10 +1718,6 @@ static void __pci_bridge_assign_resources(const struct pci_dev *bridge,
}
}
-#define PCI_RES_TYPE_MASK \
- (IORESOURCE_IO | IORESOURCE_MEM | IORESOURCE_PREFETCH |\
- IORESOURCE_MEM_64)
-
static void pci_bridge_release_resources(struct pci_bus *bus,
unsigned long type)
{
@@ -2450,8 +2456,12 @@ int pci_reassign_bridge_resources(struct pci_dev *bridge, unsigned long type)
free_list(&added);
if (!list_empty(&failed)) {
- ret = -ENOSPC;
- goto cleanup;
+ if (pci_required_resource_failed(&failed, type)) {
+ ret = -ENOSPC;
+ goto cleanup;
+ }
+ /* Only resources with unrelated types failed (again) */
+ free_list(&failed);
}
list_for_each_entry(dev_res, &saved, list) {
--
2.39.5
From: Alexander Sverdlin <alexander.sverdlin(a)siemens.com>
KCSAN reports:
BUG: KCSAN: data-race in do_raw_write_lock / do_raw_write_lock
write (marked) to 0xffff800009cf504c of 4 bytes by task 1102 on cpu 1:
do_raw_write_lock+0x120/0x204
_raw_write_lock_irq
do_exit
call_usermodehelper_exec_async
ret_from_fork
read to 0xffff800009cf504c of 4 bytes by task 1103 on cpu 0:
do_raw_write_lock+0x88/0x204
_raw_write_lock_irq
do_exit
call_usermodehelper_exec_async
ret_from_fork
value changed: 0xffffffff -> 0x00000001
Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 1103 Comm: kworker/u4:1 6.1.111
Commit 1a365e822372 ("locking/spinlock/debug: Fix various data races") has
adressed most of these races, but seems to be not consistent/not complete.
From do_raw_write_lock() only debug_write_lock_after() part has been
converted to WRITE_ONCE(), but not debug_write_lock_before() part.
Do it now.
Cc: stable(a)vger.kernel.org
Fixes: 1a365e822372 ("locking/spinlock/debug: Fix various data races")
Reported-by: Adrian Freihofer <adrian.freihofer(a)siemens.com>
Acked-by: Waiman Long <longman(a)redhat.com>
Signed-off-by: Alexander Sverdlin <alexander.sverdlin(a)siemens.com>
---
There are still some inconsistencies remaining IMO:
- lock->magic is sometimes accessed with READ_ONCE() even though it's only
being plain-written;
- debug_spin_unlock() and debug_write_unlock() both do WRITE_ONCE() on
lock->owner and lock->owner_cpu, but examine them with plain read accesses.
kernel/locking/spinlock_debug.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/locking/spinlock_debug.c b/kernel/locking/spinlock_debug.c
index 87b03d2e41dbb..2338b3adfb55f 100644
--- a/kernel/locking/spinlock_debug.c
+++ b/kernel/locking/spinlock_debug.c
@@ -184,8 +184,8 @@ void do_raw_read_unlock(rwlock_t *lock)
static inline void debug_write_lock_before(rwlock_t *lock)
{
RWLOCK_BUG_ON(lock->magic != RWLOCK_MAGIC, lock, "bad magic");
- RWLOCK_BUG_ON(lock->owner == current, lock, "recursion");
- RWLOCK_BUG_ON(lock->owner_cpu == raw_smp_processor_id(),
+ RWLOCK_BUG_ON(READ_ONCE(lock->owner) == current, lock, "recursion");
+ RWLOCK_BUG_ON(READ_ONCE(lock->owner_cpu) == raw_smp_processor_id(),
lock, "cpu recursion");
}
--
2.47.1
[BUG]
With 64K page size (aarch64 with 64K page size config) and 4K btrfs
block size, the following workload can easily lead to a corrupted read:
mkfs.btrfs -f -s 4k $dev > /dev/null
mount -o compress=zstd $dev $mnt
xfs_io -f -c "pwrite -S 0xff 0 64k" -c sync $mnt/base > /dev/null
echo "correct result:"
od -Ax -x $mnt/base
xfs_io -f -c "reflink $mnt/base 32k 0 32k" \
-c "reflink $mnt/base 0 32k 32k" \
-c "pwrite -S 0xff 60k 4k" $mnt/new > /dev/null
echo "incorrect result:"
od -Ax -x $mnt/new
umount $mnt
This shows the following result:
correct result:
000000 ffff ffff ffff ffff ffff ffff ffff ffff
*
010000
incorrect result:
000000 ffff ffff ffff ffff ffff ffff ffff ffff
*
008000 0000 0000 0000 0000 0000 0000 0000 0000
*
00f000 ffff ffff ffff ffff ffff ffff ffff ffff
*
010000
Notice the zero in the range [0x8000, 0xf000), which is incorrect.
[CAUSE]
With extra trace printk, it shows the following events during od:
(some unrelated info removed like CPU and context)
od-3457 btrfs_do_readpage: enter r/i=5/258 folio=0(65536) prev_em_start=0000000000000000
The "r/i" is indicating the root and inode number. In our case the file
"new" is using ino 258 from fs tree (root 5).
Here notice the @prev_em_start pointer is NULL. This means the
btrfs_do_readpage() is called from btrfs_read_folio(), not from
btrfs_readahead().
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=0 got em start=0 len=32768
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=4096 got em start=0 len=32768
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=8192 got em start=0 len=32768
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=12288 got em start=0 len=32768
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=16384 got em start=0 len=32768
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=20480 got em start=0 len=32768
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=24576 got em start=0 len=32768
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=28672 got em start=0 len=32768
These above 32K blocks will be read from the the first half of the
compressed data extent.
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=32768 got em start=32768 len=32768
Note here there is no btrfs_submit_compressed_read() call. Which is
incorrect now.
As both extent maps at 0 and 32K are pointing to the same compressed
data, their offset are different thus can not be merged into the same
read.
So this means the compressed data read merge check is doing something
wrong.
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=36864 got em start=32768 len=32768
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=40960 got em start=32768 len=32768
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=45056 got em start=32768 len=32768
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=49152 got em start=32768 len=32768
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=53248 got em start=32768 len=32768
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=57344 got em start=32768 len=32768
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=61440 skip uptodate
od-3457 btrfs_submit_compressed_read: cb orig_bio: file off=0 len=61440
The function btrfs_submit_compressed_read() is only called at the end of
folio read. The compressed bio will only have a extent map of range [0,
32K), but the original bio passed in are for the whole 64K folio.
This will cause the decompression part to only fill the first 32K,
leaving the rest untouched (aka, filled with zero).
This incorrect compressed read merge leads to the above data corruption.
There are similar problems happened in the past, commit 808f80b46790
("Btrfs: update fix for read corruption of compressed and shared
extents") is doing pretty much the same fix for readahead.
But that's back to 2015, where btrfs still only supports bs (block size)
== ps (page size) cases.
This means btrfs_do_readpage() only needs to handle a folio which
contains exactly one block.
Only btrfs_readahead() can lead to a read covering multiple blocks.
Thus only btrfs_readahead() passes a non-NULL @prev_em_start pointer.
With the v5.15 btrfs introduced bs < ps support. This breaks the above
assumption that a folio can only contain one block.
Now btrfs_read_folio() can also read multiple blocks in one go.
But btrfs_read_folio() doesn't pass a @prev_em_start pointer, thus the
existing bio force submission check will never be triggered.
In theory, this can also happen for btrfs with large folios, but since
large folio is still experimental, we don't need to bother it, thus only
bs < ps support is affected for now.
[FIX]
Instead of passing @prev_em_start to do the proper compressed extent
check, introduce one new member, btrfs_bio_ctrl::last_em_start, so that
the existing bio force submission logic will always be triggered.
CC: stable(a)vger.kernel.org #5.15+
Signed-off-by: Qu Wenruo <wqu(a)suse.com>
---
Changelog:
v3:
- Use a single btrfs_bio_ctrl::last_em_start
Since the existing prev_em_start is using U64_MAX as the initial value
to indicate no em hit yet, we can use the same logic, which saves
another 8 bytes from btrfs_bio_ctrl.
- Update the reproducer
Previously I failed to reproduce using a minimal workload.
As regular read from od/md5sum always trigger readahead thus not
hitting the @prev_em_start == NULL path.
Will send out a fstest case for it using the minimal reproducer.
But it will still need a 16K/64K page sized system to reproduce.
Fix the problem by doing an block aligned write into the folio, so
that the folio will be partially dirty and not go through the
readahead path.
- Update the analyze
This includes the trace events of the minimal reproducer, and
mentioning of previous similar fixes and why they do not work for
subpage cases.
- Update the CC tag
Since it's only affecting bs < ps cases (for non-experimental builds),
only need to fix kernels with subpage btrfs supports.
v2:
- Only save extent_map::start/len to save memory for btrfs_bio_ctrl
It's using on-stack memory which is very limited inside the kernel.
- Remove the commit message mentioning of clearing last saved em
Since we're using em::start/len, there is no need to clear them.
Either we hit the same em::start/len, meaning hitting the same extent
map, or we hit a different em, which will have a different start/len.
---
fs/btrfs/extent_io.c | 40 ++++++++++++++++++++++++++++++----------
1 file changed, 30 insertions(+), 10 deletions(-)
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 0c12fd64a1f3..b219dbaaedc8 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -131,6 +131,24 @@ struct btrfs_bio_ctrl {
*/
unsigned long submit_bitmap;
struct readahead_control *ractl;
+
+ /*
+ * The start file offset of the last hit extent map.
+ *
+ * This is for proper compressed read merge.
+ * U64_MAX means no extent map hit yet.
+ *
+ * The current btrfs_bio_is_contig() only uses disk_bytenr as
+ * the condition to check if the read can be merged with previous
+ * bio, which is not correct. E.g. two file extents pointing to the
+ * same extent but with different offset.
+ *
+ * So here we need to do extra check to only merge reads that are
+ * covered by the same extent map.
+ * Just extent_map::start will be enough, as they are unique
+ * inside the same inode.
+ */
+ u64 last_em_start;
};
/*
@@ -965,7 +983,7 @@ static void btrfs_readahead_expand(struct readahead_control *ractl,
* return 0 on success, otherwise return error
*/
static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
- struct btrfs_bio_ctrl *bio_ctrl, u64 *prev_em_start)
+ struct btrfs_bio_ctrl *bio_ctrl)
{
struct inode *inode = folio->mapping->host;
struct btrfs_fs_info *fs_info = inode_to_fs_info(inode);
@@ -1076,12 +1094,11 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
* non-optimal behavior (submitting 2 bios for the same extent).
*/
if (compress_type != BTRFS_COMPRESS_NONE &&
- prev_em_start && *prev_em_start != (u64)-1 &&
- *prev_em_start != em->start)
+ bio_ctrl->last_em_start != U64_MAX &&
+ bio_ctrl->last_em_start != em->start)
force_bio_submit = true;
- if (prev_em_start)
- *prev_em_start = em->start;
+ bio_ctrl->last_em_start = em->start;
em_gen = em->generation;
btrfs_free_extent_map(em);
@@ -1296,12 +1313,15 @@ int btrfs_read_folio(struct file *file, struct folio *folio)
const u64 start = folio_pos(folio);
const u64 end = start + folio_size(folio) - 1;
struct extent_state *cached_state = NULL;
- struct btrfs_bio_ctrl bio_ctrl = { .opf = REQ_OP_READ };
+ struct btrfs_bio_ctrl bio_ctrl = {
+ .opf = REQ_OP_READ,
+ .last_em_start = U64_MAX,
+ };
struct extent_map *em_cached = NULL;
int ret;
lock_extents_for_read(inode, start, end, &cached_state);
- ret = btrfs_do_readpage(folio, &em_cached, &bio_ctrl, NULL);
+ ret = btrfs_do_readpage(folio, &em_cached, &bio_ctrl);
btrfs_unlock_extent(&inode->io_tree, start, end, &cached_state);
btrfs_free_extent_map(em_cached);
@@ -2641,7 +2661,8 @@ void btrfs_readahead(struct readahead_control *rac)
{
struct btrfs_bio_ctrl bio_ctrl = {
.opf = REQ_OP_READ | REQ_RAHEAD,
- .ractl = rac
+ .ractl = rac,
+ .last_em_start = U64_MAX,
};
struct folio *folio;
struct btrfs_inode *inode = BTRFS_I(rac->mapping->host);
@@ -2649,12 +2670,11 @@ void btrfs_readahead(struct readahead_control *rac)
const u64 end = start + readahead_length(rac) - 1;
struct extent_state *cached_state = NULL;
struct extent_map *em_cached = NULL;
- u64 prev_em_start = (u64)-1;
lock_extents_for_read(inode, start, end, &cached_state);
while ((folio = readahead_folio(rac)) != NULL)
- btrfs_do_readpage(folio, &em_cached, &bio_ctrl, &prev_em_start);
+ btrfs_do_readpage(folio, &em_cached, &bio_ctrl);
btrfs_unlock_extent(&inode->io_tree, start, end, &cached_state);
--
2.50.1
ENODATA (aka ENOATTR) has a very specific meaning in the xfs xattr code;
namely, that the requested attribute name could not be found.
However, a medium error from disk may also return ENODATA. At best,
this medium error may escape to userspace as "attribute not found"
when in fact it's an IO (disk) error.
At worst, we may oops in xfs_attr_leaf_get() when we do:
error = xfs_attr_leaf_hasname(args, &bp);
if (error == -ENOATTR) {
xfs_trans_brelse(args->trans, bp);
return error;
}
because an ENODATA/ENOATTR error from disk leaves us with a null bp,
and the xfs_trans_brelse will then null-deref it.
As discussed on the list, we really need to modify the lower level
IO functions to trap all disk errors and ensure that we don't let
unique errors like this leak up into higher xfs functions - many
like this should be remapped to EIO.
However, this patch directly addresses a reported bug in the xattr
code, and should be safe to backport to stable kernels. A larger-scope
patch to handle more unique errors at lower levels can follow later.
(Note, prior to 07120f1abdff we did not oops, but we did return the
wrong error code to userspace.)
Signed-off-by: Eric Sandeen <sandeen(a)redhat.com>
Fixes: 07120f1abdff ("xfs: Add xfs_has_attr and subroutines")
Cc: <stable(a)vger.kernel.org> # v5.9+
---
V2: Remove the extraneous trap point as pointed out by djwong, oops.
(I get it that sprinkling this around in 2 places might have an ick
factor but I think it's necessary to make a surgical strike on this bug
before we address the general problem.)
Thanks,
-Eric
diff --git a/fs/xfs/libxfs/xfs_attr_remote.c b/fs/xfs/libxfs/xfs_attr_remote.c
index 4c44ce1c8a64..bff3dc226f81 100644
--- a/fs/xfs/libxfs/xfs_attr_remote.c
+++ b/fs/xfs/libxfs/xfs_attr_remote.c
@@ -435,6 +435,13 @@ xfs_attr_rmtval_get(
0, &bp, &xfs_attr3_rmt_buf_ops);
if (xfs_metadata_is_sick(error))
xfs_dirattr_mark_sick(args->dp, XFS_ATTR_FORK);
+ /*
+ * ENODATA from disk implies a disk medium failure;
+ * ENODATA for xattrs means attribute not found, so
+ * disambiguate that here.
+ */
+ if (error == -ENODATA)
+ error = -EIO;
if (error)
return error;
diff --git a/fs/xfs/libxfs/xfs_da_btree.c b/fs/xfs/libxfs/xfs_da_btree.c
index 17d9e6154f19..723a0643b838 100644
--- a/fs/xfs/libxfs/xfs_da_btree.c
+++ b/fs/xfs/libxfs/xfs_da_btree.c
@@ -2833,6 +2833,12 @@ xfs_da_read_buf(
&bp, ops);
if (xfs_metadata_is_sick(error))
xfs_dirattr_mark_sick(dp, whichfork);
+ /*
+ * ENODATA from disk implies a disk medium failure; ENODATA for
+ * xattrs means attribute not found, so disambiguate that here.
+ */
+ if (error == -ENODATA && whichfork == XFS_ATTR_FORK)
+ error = -EIO;
if (error)
goto out_free;