On Fri, Nov 13, 2020 at 08:25:29AM -0800, Minchan Kim wrote:
On Thu, Nov 12, 2020 at 02:49:19PM -0800, Andrew Morton wrote:
On Thu, 12 Nov 2020 12:01:01 -0800 Minchan Kim minchan@kernel.org wrote:
On Sat, Nov 07, 2020 at 12:39:39AM -0800, Minchan Kim wrote:
Hi Andrew,
On Fri, Nov 06, 2020 at 05:59:33PM -0800, Andrew Morton wrote:
On Thu, 5 Nov 2020 09:02:49 -0800 Minchan Kim minchan@kernel.org wrote:
This reverts commit e47110e90584a22e9980510b00d0dfad3a83354e.
While I was doing zram testing, I found sometimes decompression failed since the compression buffer was corrupted. With investigation, I found below commit calls cond_resched unconditionally so it could make a problem in atomic context if the task is reschedule.
Revert the original commit for now.
How should we proceed this problem?
(top-posting repaired - please don't).
Well, we don't want to reintroduce the softlockup reports which e47110e90584a2 fixed, and we certainly don't want to do that on behalf of code which is using the unmap_kernel_range() interface incorrectly.
So I suggest either
a) make zs_unmap_object() stop doing the unmapping from atomic context or
It's not easy since the path already hold several spin_locks as well as per-cpu context. I could pursuit the direction but it takes several steps to change entire locking scheme in the zsmalloc, which will take time(we couldn't leave zsmalloc broken until then) and hard to land on stable tree.
b) figure out whether the vmalloc unmap code is *truly* unsafe from atomic context - perhaps it is only unsafe from interrupt context, in which case we can rework the vmalloc.c checks to reflect this, or
I don't get the point. I assume your suggestion would be "let's make the vunmap code atomic context safe" but how could it solve this problem? The point from e47110e90584a2 was softlockup could be triggered if vunamp deal with large mapping so need *explict reschedule* point for CONFIG_PREEMPT_VOLUNTARY. However, CONFIG_PREEMPT_VOLUNTARY doesn't consider peempt count so even though we could make vunamp atomic safe to make a call under spin_lock:
spin_lock(&A); vunmap vunmap_pmd_range cond_resched <- bang Below options would have same problem, too. Let me know if I misunderstand something.
c) make the vfree code callable from all contexts. Which is by far the preferred solution, but may be tough.
Or maybe not so tough - if the only problem in the vmalloc code is the use of mutex_trylock() from irqs then it may be as simple as switching to old-style semaphores and using down_trylock(), which I think is irq-safe.
However old-style semaphores are deprecated. A hackyish approach might be to use an rwsem always in down_write mode and use down_write_trylock(), which I think is also callable from interrrupt context.
But I have a feeling that there are other reasons why vfree shouldn't be called from atomic context, apart from the mutex_trylock-in-irq issue.
How about this?
From 0733bc468d0072147c2ecf998d7cc3af2234bc87 Mon Sep 17 00:00:00 2001
From: Minchan Kim minchan@kernel.org Date: Mon, 16 Nov 2020 09:38:40 -0800 Subject: [PATCH] mm: unmap_kernel_range_atomic
unmap_kernel_range had been atomic operation and zsmalloc has used it in atomic context in zs_unmap_object. However, ("e47110e90584, mm/vunmap: add cond_resched() in vunmap_pmd_range") changed it into non-atomic operation via adding cond_resched. It causes zram decompresion failure by corrupting compressed buffer in atomic context.
This patch introduces unmap_kernel_range_atomic which works for only range less than PMD_SIZE to prevent cond_resched call.
Signed-off-by: Minchan Kim minchan@kernel.org --- include/linux/vmalloc.h | 2 ++ mm/vmalloc.c | 23 +++++++++++++++++++++-- mm/zsmalloc.c | 2 +- 3 files changed, 24 insertions(+), 3 deletions(-)
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h index 938eaf9517e2..36b1ecc2d014 100644 --- a/include/linux/vmalloc.h +++ b/include/linux/vmalloc.h @@ -180,6 +180,7 @@ int map_kernel_range(unsigned long start, unsigned long size, pgprot_t prot, struct page **pages); extern void unmap_kernel_range_noflush(unsigned long addr, unsigned long size); extern void unmap_kernel_range(unsigned long addr, unsigned long size); +extern void unmap_kernel_range_atomic(unsigned long addr, unsigned long size); static inline void set_vm_flush_reset_perms(void *addr) { struct vm_struct *vm = find_vm_area(addr); @@ -200,6 +201,7 @@ unmap_kernel_range_noflush(unsigned long addr, unsigned long size) { } #define unmap_kernel_range unmap_kernel_range_noflush +#define unmap_kernel_range_atomic unmap_kernel_range_noflush static inline void set_vm_flush_reset_perms(void *addr) { } diff --git a/mm/vmalloc.c b/mm/vmalloc.c index d7075ad340aa..714e5425dc45 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -88,6 +88,7 @@ static void vunmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end, pmd_t *pmd; unsigned long next; int cleared; + bool check_resched = (end - addr) > PMD_SIZE;
pmd = pmd_offset(pud, addr); do { @@ -102,8 +103,8 @@ static void vunmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end, if (pmd_none_or_clear_bad(pmd)) continue; vunmap_pte_range(pmd, addr, next, mask); - - cond_resched(); + if (check_resched) + cond_resched(); } while (pmd++, addr = next, addr != end); }
@@ -2024,6 +2025,24 @@ void unmap_kernel_range(unsigned long addr, unsigned long size) flush_tlb_kernel_range(addr, end); }
+/** + * unmap_kernel_range_atomic - unmap kernel VM area and flush cache and TLB + * @addr: start of the VM area to unmap + * @size: size of the VM area to unmap + * + * Similar to unmap_kernel_range_noflush() but it's atomic. @size should be + * less than PMD_SIZE. + */ +void unmap_kernel_range_atomic(unsigned long addr, unsigned long size) +{ + unsigned long end = addr + size; + + flush_cache_vunmap(addr, end); + WARN_ON(size > PMD_SIZE); + unmap_kernel_range_noflush(addr, size); + flush_tlb_kernel_range(addr, end); +} + static inline void setup_vmalloc_vm_locked(struct vm_struct *vm, struct vmap_area *va, unsigned long flags, const void *caller) { diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c index 662ee420706f..9decc7634852 100644 --- a/mm/zsmalloc.c +++ b/mm/zsmalloc.c @@ -1154,7 +1154,7 @@ static inline void __zs_unmap_object(struct mapping_area *area, { unsigned long addr = (unsigned long)area->vm_addr;
- unmap_kernel_range(addr, PAGE_SIZE * 2); + unmap_kernel_range_atomic(addr, PAGE_SIZE * 2); }
#else /* CONFIG_ZSMALLOC_PGTABLE_MAPPING */