February 2018 - Linux-stable-mirror

[Linux-stable-mirror] [PATCH 4.4] kaiser: fix intel_bts perf crashes

by Hugh Dickins

Vince reported perf_fuzzer quickly locks up on 4.15-rc7 with PTI; Robert reported Bad RIP with KPTI and Intel BTS also on 4.15-rc7: honggfuzz -f /tmp/somedirectorywithatleastonefile \ --linux_perf_bts_edge -s -- /bin/true (honggfuzz from https://github.com/google/honggfuzz) crashed with BUG: unable to handle kernel paging request at ffff9d3215100000 (then narrowed it down to perf record --per-thread -e intel_bts//u -- /bin/ls). The intel_bts driver does not use the 'normal' BTS buffer which is exposed through kaiser_add_mapping(), but instead uses the memory allocated for the perf AUX buffer. This obviously comes apart when using PTI, because then the kernel mapping, which includes that AUX buffer memory, disappears while switched to user page tables. Easily fixed in old-Kaiser backports, by applying kaiser_add_mapping() to those pages; perhaps not so easy for upstream, where 4.15-rc8 commit 99a9dc98ba52 ("x86,perf: Disable intel_bts when PTI") disables for now. Slightly reorganized surrounding code in bts_buffer_setup_aux(), so it can better match bts_buffer_free_aux(): free_aux with an #ifdef to avoid the loop when PTI is off, but setup_aux needs to loop anyway (and kaiser_add_mapping() is cheap when PTI config is off or "pti=off"). Reported-by: Vince Weaver <vincent.weaver(a)maine.edu> Reported-by: Robert Święcki <robert(a)swiecki.net> Analyzed-by: Peter Zijlstra <peterz(a)infradead.org> Analyzed-by: Stephane Eranian <eranian(a)google.com> Cc: Thomas Gleixner <tglx(a)linutronix.de> Cc: Ingo Molnar <mingo(a)kernel.org> Cc: Andy Lutomirski <luto(a)amacapital.net> Cc: Alexander Shishkin <alexander.shishkin(a)linux.intel.com> Cc: Linus Torvalds <torvalds(a)linux-foundation.org> Cc: Vince Weaver <vince(a)deater.net> Cc: stable(a)vger.kernel.org Cc: Jiri Kosina <jkosina(a)suze.cz> Signed-off-by: Hugh Dickins <hughd(a)google.com> --- arch/x86/kernel/cpu/perf_event_intel_bts.c | 44 ++++++++++++++++++++++-------- 1 file changed, 33 insertions(+), 11 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event_intel_bts.c b/arch/x86/kernel/cpu/perf_event_intel_bts.c index 2cad71d1b14c..5af11c46d0b9 100644 --- a/arch/x86/kernel/cpu/perf_event_intel_bts.c +++ b/arch/x86/kernel/cpu/perf_event_intel_bts.c @@ -22,6 +22,7 @@ #include <linux/debugfs.h> #include <linux/device.h> #include <linux/coredump.h> +#include <linux/kaiser.h> #include <asm-generic/sizes.h> #include <asm/perf_event.h> @@ -67,6 +68,23 @@ static size_t buf_size(struct page *page) return 1 << (PAGE_SHIFT + page_private(page)); } +static void bts_buffer_free_aux(void *data) +{ +#ifdef CONFIG_PAGE_TABLE_ISOLATION + struct bts_buffer *buf = data; + int nbuf; + + for (nbuf = 0; nbuf < buf->nr_bufs; nbuf++) { + struct page *page = buf->buf[nbuf].page; + void *kaddr = page_address(page); + size_t page_size = buf_size(page); + + kaiser_remove_mapping((unsigned long)kaddr, page_size); + } +#endif + kfree(data); +} + static void * bts_buffer_setup_aux(int cpu, void **pages, int nr_pages, bool overwrite) { @@ -103,29 +121,33 @@ bts_buffer_setup_aux(int cpu, void **pages, int nr_pages, bool overwrite) buf->real_size = size - size % BTS_RECORD_SIZE; for (pg = 0, nbuf = 0, offset = 0, pad = 0; nbuf < buf->nr_bufs; nbuf++) { - unsigned int __nr_pages; + void *kaddr = pages[pg]; + size_t page_size; + + page = virt_to_page(kaddr); + page_size = buf_size(page); + + if (kaiser_add_mapping((unsigned long)kaddr, + page_size, __PAGE_KERNEL) < 0) { + buf->nr_bufs = nbuf; + bts_buffer_free_aux(buf); + return NULL; + } - page = virt_to_page(pages[pg]); - __nr_pages = PagePrivate(page) ? 1 << page_private(page) : 1; buf->buf[nbuf].page = page; buf->buf[nbuf].offset = offset; buf->buf[nbuf].displacement = (pad ? BTS_RECORD_SIZE - pad : 0); - buf->buf[nbuf].size = buf_size(page) - buf->buf[nbuf].displacement; + buf->buf[nbuf].size = page_size - buf->buf[nbuf].displacement; pad = buf->buf[nbuf].size % BTS_RECORD_SIZE; buf->buf[nbuf].size -= pad; - pg += __nr_pages; - offset += __nr_pages << PAGE_SHIFT; + pg += page_size >> PAGE_SHIFT; + offset += page_size; } return buf; } -static void bts_buffer_free_aux(void *data) -{ - kfree(data); -} - static unsigned long bts_buffer_offset(struct bts_buffer *buf, unsigned int idx) { return buf->buf[idx].offset + buf->buf[idx].displacement; -- 2.16.0.rc1.238.g530d649a79-goog

7 years, 11 months

7
8
0 0

[Linux-stable-mirror] [PATCH for-next] IB/core: Avoid a potential OOPs for an unused optional parameter

by Dennis Dalessandro

From: Michael J. Ruhl <michael.j.ruhl(a)intel.com> The ev_file is an optional parameter for CQ creation. If the parameter is not passed, the ev_file pointer will be NULL. Using that pointer to set the cq_context will result in an OOPs. Verify that ev_file is not NULL before using. Cc: <stable(a)vger.kernel.org> # 4.14.x Fixes: 9ee79fce3642 ("IB/core: Add completion queue (cq) object actions") Reviewed-by: Dennis Dalessandro <dennis.dalessandro(a)intel.com> Reviewed-by: Ira Weiny <ira.weiny(a)intel.com> Signed-off-by: Michael J. Ruhl <michael.j.ruhl(a)intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro(a)intel.com> --- drivers/infiniband/core/uverbs_std_types.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/drivers/infiniband/core/uverbs_std_types.c b/drivers/infiniband/core/uverbs_std_types.c index b571176..cab0ac3 100644 --- a/drivers/infiniband/core/uverbs_std_types.c +++ b/drivers/infiniband/core/uverbs_std_types.c @@ -316,7 +316,7 @@ static int uverbs_create_cq_handler(struct ib_device *ib_dev, cq->uobject = &obj->uobject; cq->comp_handler = ib_uverbs_comp_handler; cq->event_handler = ib_uverbs_cq_event_handler; - cq->cq_context = &ev_file->ev_queue; + cq->cq_context = ev_file ? &ev_file->ev_queue : NULL; obj->uobject.object = cq; obj->uobject.user_handle = user_handle; atomic_set(&cq->usecnt, 0);

7 years, 11 months

2
1
0 0

[Linux-stable-mirror] + mm-kasan-dont-vfree-nonexistent-vm_area.patch added to -mm tree

by akpm＠linux-foundation.org

The patch titled Subject: mm/kasan: don't vfree() nonexistent vm_area has been added to the -mm tree. Its filename is mm-kasan-dont-vfree-nonexistent-vm_area.patch This patch should soon appear at http://ozlabs.org/~akpm/mmots/broken-out/mm-kasan-dont-vfree-nonexistent-vm… and later at http://ozlabs.org/~akpm/mmotm/broken-out/mm-kasan-dont-vfree-nonexistent-vm… Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/SubmitChecklist when testing your code *** The -mm tree is included into linux-next and is updated there every 3-4 working days ------------------------------------------------------ From: Andrey Ryabinin <aryabinin(a)virtuozzo.com> Subject: mm/kasan: don't vfree() nonexistent vm_area KASAN uses different routines to map shadow for hot added memory and memory obtained in boot process. Attempt to offline memory onlined by normal boot process leads to this: Trying to vfree() nonexistent vm area (000000005d3b34b9) WARNING: CPU: 2 PID: 13215 at mm/vmalloc.c:1525 __vunmap+0x147/0x190 Call Trace: kasan_mem_notifier+0xad/0xb9 notifier_call_chain+0x166/0x260 __blocking_notifier_call_chain+0xdb/0x140 __offline_pages+0x96a/0xb10 memory_subsys_offline+0x76/0xc0 device_offline+0xb8/0x120 store_mem_state+0xfa/0x120 kernfs_fop_write+0x1d5/0x320 __vfs_write+0xd4/0x530 vfs_write+0x105/0x340 SyS_write+0xb0/0x140 Obviously we can't call vfree() to free memory that wasn't allocated via vmalloc(). Use find_vm_area() to see if we can call vfree(). Unfortunately it's a bit tricky to properly unmap and free shadow allocated during boot, so we'll have to keep it. If memory will come online again that shadow will be reused. Link: http://lkml.kernel.org/r/20180201163349.8700-1-aryabinin@virtuozzo.com Fixes: fa69b5989bb0 ("mm/kasan: add support for memory hotplug") Signed-off-by: Andrey Ryabinin <aryabinin(a)virtuozzo.com> Reported-by: Paul Menzel <pmenzel+linux-kasan-dev(a)molgen.mpg.de> Cc: Alexander Potapenko <glider(a)google.com> Cc: Dmitry Vyukov <dvyukov(a)google.com> Cc: Matthew Wilcox <willy(a)infradead.org> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- mm/kasan/kasan.c | 57 +++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 55 insertions(+), 2 deletions(-) diff -puN mm/kasan/kasan.c~mm-kasan-dont-vfree-nonexistent-vm_area mm/kasan/kasan.c --- a/mm/kasan/kasan.c~mm-kasan-dont-vfree-nonexistent-vm_area +++ a/mm/kasan/kasan.c @@ -791,6 +791,41 @@ DEFINE_ASAN_SET_SHADOW(f5); DEFINE_ASAN_SET_SHADOW(f8); #ifdef CONFIG_MEMORY_HOTPLUG +static bool shadow_mapped(unsigned long addr) +{ + pgd_t *pgd = pgd_offset_k(addr); + p4d_t *p4d; + pud_t *pud; + pmd_t *pmd; + pte_t *pte; + + if (pgd_none(*pgd)) + return false; + p4d = p4d_offset(pgd, addr); + if (p4d_none(*p4d)) + return false; + pud = pud_offset(p4d, addr); + if (pud_none(*pud)) + return false; + + /* + * We can't use pud_large() or pud_huge(), the first one + * is arch-specific, the last one depend on HUGETLB_PAGE. + * So let's abuse pud_bad(), if bud is bad it's has to + * because it's huge. + */ + if (pud_bad(*pud)) + return true; + pmd = pmd_offset(pud, addr); + if (pmd_none(*pmd)) + return false; + + if (pmd_bad(*pmd)) + return true; + pte = pte_offset_kernel(pmd, addr); + return !pte_none(*pte); +} + static int __meminit kasan_mem_notifier(struct notifier_block *nb, unsigned long action, void *data) { @@ -812,6 +847,14 @@ static int __meminit kasan_mem_notifier( case MEM_GOING_ONLINE: { void *ret; + /* + * If shadow is mapped already than it must have been mapped + * during the boot. This could happen if we onlining previously + * offlined memory. + */ + if (shadow_mapped(shadow_start)) + return NOTIFY_OK; + ret = __vmalloc_node_range(shadow_size, PAGE_SIZE, shadow_start, shadow_end, GFP_KERNEL, PAGE_KERNEL, VM_NO_GUARD, @@ -823,8 +866,18 @@ static int __meminit kasan_mem_notifier( kmemleak_ignore(ret); return NOTIFY_OK; } - case MEM_OFFLINE: - vfree((void *)shadow_start); + case MEM_OFFLINE: { + struct vm_struct *vm; + + /* + * Only hot-added memory have vm_area. Freeing shadow + * mapped during boot would be tricky, so we'll just + * have to keep it. + */ + vm = find_vm_area((void *)shadow_start); + if (vm) + vfree((void *)shadow_start); + } } return NOTIFY_OK; _ Patches currently in -mm which might be from aryabinin(a)virtuozzo.com are kasan-makefile-support-llvm-style-asan-parameters.patch mm-kasan-dont-vfree-nonexistent-vm_area.patch lib-ubsan-add-type-mismatch-handler-for-new-gcc-clang.patch lib-ubsan-remove-returns-nonnull-attribute-checks.patch lib-ubsan-remove-returns-nonnull-attribute-checks-fix.patch

7 years, 11 months

1
0
0 0

[Linux-stable-mirror] [merged] mm-memory_hotplug-fix-memmap-initialization.patch removed from -mm tree

by akpm＠linux-foundation.org

The patch titled Subject: mm, memory_hotplug: fix memmap initialization has been removed from the -mm tree. Its filename was mm-memory_hotplug-fix-memmap-initialization.patch This patch was dropped because it was merged into mainline or a subsystem tree ------------------------------------------------------ From: Michal Hocko <mhocko(a)suse.com> Subject: mm, memory_hotplug: fix memmap initialization Bharata has noticed that onlining a newly added memory doesn't increase the total memory, pointing to f7f99100d8d9 ("mm: stop zeroing memory during allocation in vmemmap") as a culprit. This commit has changed the way how the memory for memmaps is initialized and moves it from the allocation time to the initialization time. This works properly for the early memmap init path. It doesn't work for the memory hotplug though because we need to mark page as reserved when the sparsemem section is created and later initialize it completely during onlining. memmap_init_zone is called in the early stage of onlining. With the current code it calls __init_single_page and as such it clears up the whole stage and therefore online_pages_range skips those pages. Fix this by skipping mm_zero_struct_page in __init_single_page for memory hotplug path. This is quite uggly but unifying both early init and memory hotplug init paths is a large project. Make sure we plug the regression at least. Link: http://lkml.kernel.org/r/20180130101141.GW21609@dhcp22.suse.cz Fixes: f7f99100d8d9 ("mm: stop zeroing memory during allocation in vmemmap") Signed-off-by: Michal Hocko <mhocko(a)suse.com> Reported-by: Bharata B Rao <bharata(a)linux.vnet.ibm.com> Tested-by: Bharata B Rao <bharata(a)linux.vnet.ibm.com> Reviewed-by: Pavel Tatashin <pasha.tatashin(a)oracle.com> Cc: Steven Sistare <steven.sistare(a)oracle.com> Cc: Daniel Jordan <daniel.m.jordan(a)oracle.com> Cc: Bob Picco <bob.picco(a)oracle.com> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- mm/page_alloc.c | 22 ++++++++++++++-------- 1 file changed, 14 insertions(+), 8 deletions(-) diff -puN mm/page_alloc.c~mm-memory_hotplug-fix-memmap-initialization mm/page_alloc.c --- a/mm/page_alloc.c~mm-memory_hotplug-fix-memmap-initialization +++ a/mm/page_alloc.c @@ -1177,9 +1177,10 @@ static void free_one_page(struct zone *z } static void __meminit __init_single_page(struct page *page, unsigned long pfn, - unsigned long zone, int nid) + unsigned long zone, int nid, bool zero) { - mm_zero_struct_page(page); + if (zero) + mm_zero_struct_page(page); set_page_links(page, zone, nid, pfn); init_page_count(page); page_mapcount_reset(page); @@ -1194,9 +1195,9 @@ static void __meminit __init_single_page } static void __meminit __init_single_pfn(unsigned long pfn, unsigned long zone, - int nid) + int nid, bool zero) { - return __init_single_page(pfn_to_page(pfn), pfn, zone, nid); + return __init_single_page(pfn_to_page(pfn), pfn, zone, nid, zero); } #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT @@ -1217,7 +1218,7 @@ static void __meminit init_reserved_page if (pfn >= zone->zone_start_pfn && pfn < zone_end_pfn(zone)) break; } - __init_single_pfn(pfn, zid, nid); + __init_single_pfn(pfn, zid, nid, true); } #else static inline void init_reserved_page(unsigned long pfn) @@ -1534,7 +1535,7 @@ static unsigned long __init deferred_in } else { page++; } - __init_single_page(page, pfn, zid, nid); + __init_single_page(page, pfn, zid, nid, true); nr_pages++; } return (nr_pages); @@ -5399,15 +5400,20 @@ not_early: * can be created for invalid pages (for alignment) * check here not to call set_pageblock_migratetype() against * pfn out of zone. + * + * Please note that MEMMAP_HOTPLUG path doesn't clear memmap + * because this is done early in sparse_add_one_section */ if (!(pfn & (pageblock_nr_pages - 1))) { struct page *page = pfn_to_page(pfn); - __init_single_page(page, pfn, zone, nid); + __init_single_page(page, pfn, zone, nid, + context != MEMMAP_HOTPLUG); set_pageblock_migratetype(page, MIGRATE_MOVABLE); cond_resched(); } else { - __init_single_pfn(pfn, zone, nid); + __init_single_pfn(pfn, zone, nid, + context != MEMMAP_HOTPLUG); } } } _ Patches currently in -mm which might be from mhocko(a)suse.com are mm-oom-docs-describe-the-cgroup-aware-oom-killer-fix-2.patch mm-introduce-map_fixed_safe.patch fs-elf-drop-map_fixed-usage-from-elf_map.patch fs-elf-drop-map_fixed-usage-from-elf_map-fix-fix.patch mm-numa-rework-do_pages_move.patch mm-migrate-remove-reason-argument-from-new_page_t.patch mm-migrate-remove-reason-argument-from-new_page_t-fix-3.patch mm-unclutter-thp-migration.patch net-netfilter-x_tablesc-remove-size-check.patch

7 years, 11 months

1
0
0 0

[Linux-stable-mirror] [PATCH] KVM: x86: fix backward migration with async_PF

by Radim Krčmář

Guests on new hypersiors might set KVM_ASYNC_PF_DELIVERY_AS_PF_VMEXIT bit when enabling async_PF, but this bit is reserved on old hypervisors, which results in a failure upon migration. Guests at least expect that KVM_ASYNC_PF_DELIVERY_AS_PF_VMEXIT might not be present when booting, so we allow userspace to handle migration compatibility by adding a KVM CPUID flag that determines the presence of KVM_ASYNC_PF_DELIVERY_AS_PF_VMEXIT. Fixes: 52a5c155cf79 ("KVM: async_pf: Let guest support delivery of async_pf from guest mode") Cc: <stable(a)vger.kernel.org> Signed-off-by: Radim Krčmář <rkrcmar(a)redhat.com> --- arch/x86/include/uapi/asm/kvm_para.h | 1 + arch/x86/kernel/kvm.c | 8 ++++---- arch/x86/kvm/cpuid.c | 3 ++- arch/x86/kvm/cpuid.h | 11 +++++++++++ arch/x86/kvm/x86.c | 6 ++++-- 5 files changed, 22 insertions(+), 7 deletions(-) diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h index 7a2ade4aa235..6cfa9c8cb7d6 100644 --- a/arch/x86/include/uapi/asm/kvm_para.h +++ b/arch/x86/include/uapi/asm/kvm_para.h @@ -26,6 +26,7 @@ #define KVM_FEATURE_PV_EOI 6 #define KVM_FEATURE_PV_UNHALT 7 #define KVM_FEATURE_PV_TLB_FLUSH 9 +#define KVM_FEATURE_ASYNC_PF_VMEXIT 10 /* The last 8 bits are used to indicate how to interpret the flags field * in pvclock structure. If no bits are set, all flags are ignored. diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 4e37d1a851a6..971babe964d2 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -341,10 +341,10 @@ static void kvm_guest_cpu_init(void) #endif pa |= KVM_ASYNC_PF_ENABLED; - /* Async page fault support for L1 hypervisor is optional */ - if (wrmsr_safe(MSR_KVM_ASYNC_PF_EN, - (pa | KVM_ASYNC_PF_DELIVERY_AS_PF_VMEXIT) & 0xffffffff, pa >> 32) < 0) - wrmsrl(MSR_KVM_ASYNC_PF_EN, pa); + if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF_VMEXIT)) + pa |= KVM_ASYNC_PF_DELIVERY_AS_PF_VMEXIT; + + wrmsrl(MSR_KVM_ASYNC_PF_EN, pa); __this_cpu_write(apf_reason.enabled, 1); printk(KERN_INFO"KVM setup async PF for cpu %d\n", smp_processor_id()); diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 20e491b94f44..7fc04a176c57 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -604,7 +604,8 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function, (1 << KVM_FEATURE_PV_EOI) | (1 << KVM_FEATURE_CLOCKSOURCE_STABLE_BIT) | (1 << KVM_FEATURE_PV_UNHALT) | - (1 << KVM_FEATURE_PV_TLB_FLUSH); + (1 << KVM_FEATURE_PV_TLB_FLUSH) | + (1 << KVM_FEATURE_ASYNC_PF_VMEXIT); if (sched_info_on()) entry->eax |= (1 << KVM_FEATURE_STEAL_TIME); diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h index c2cea6651279..f20731dfe28e 100644 --- a/arch/x86/kvm/cpuid.h +++ b/arch/x86/kvm/cpuid.h @@ -105,6 +105,17 @@ static __always_inline bool guest_cpuid_has(struct kvm_vcpu *vcpu, unsigned x86_ return *reg & bit(x86_feature); } +static inline bool guest_kvm_cpuid_has(struct kvm_vcpu *vcpu, unsigned kvm_feature) +{ + struct kvm_cpuid_entry2 *entry; + + entry = kvm_find_cpuid_entry(vcpu, KVM_CPUID_FEATURES, 0); + if (!entry) + return false; + + return entry->eax & bit(kvm_feature); +} + static __always_inline void guest_cpuid_clear(struct kvm_vcpu *vcpu, unsigned x86_feature) { int *reg; diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 4c3103f449a3..c16740a06f0c 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2139,8 +2139,10 @@ static int kvm_pv_enable_async_pf(struct kvm_vcpu *vcpu, u64 data) { gpa_t gpa = data & ~0x3f; - /* Bits 3:5 are reserved, Should be zero */ - if (data & 0x38) + /* Bits 3:5 are reserved, Should be zero. */ + if (data & 0x38 || + (data & KVM_ASYNC_PF_DELIVERY_AS_PF_VMEXIT && + !guest_kvm_cpuid_has(vcpu, KVM_FEATURE_ASYNC_PF_VMEXIT))) return 1; vcpu->arch.apf.msr_val = data; -- 2.15.1

7 years, 11 months

2
3
0 0

[Linux-stable-mirror] [merged] ocfs2-try-a-blocking-lock-before-return-aop_truncated_page.patch removed from -mm tree

by akpm＠linux-foundation.org

The patch titled Subject: ocfs2: try a blocking lock before return AOP_TRUNCATED_PAGE has been removed from the -mm tree. Its filename was ocfs2-try-a-blocking-lock-before-return-aop_truncated_page.patch This patch was dropped because it was merged into mainline or a subsystem tree ------------------------------------------------------ From: Gang He <ghe(a)suse.com> Subject: ocfs2: try a blocking lock before return AOP_TRUNCATED_PAGE If we can't get inode lock immediately in the function ocfs2_inode_lock_with_page() when reading a page, we should not return directly here, since this will lead to a softlockup problem when the kernel is configured with CONFIG_PREEMPT is not set. The method is to get a blocking lock and immediately unlock before returning, this can avoid CPU resource waste due to lots of retries, and benefits fairness in getting lock among multiple nodes, increase efficiency in case modifying the same file frequently from multiple nodes. The softlockup crash (when set /proc/sys/kernel/softlockup_panic to 1) looks like: Kernel panic - not syncing: softlockup: hung tasks CPU: 0 PID: 885 Comm: multi_mmap Tainted: G L 4.12.14-6.1-default #1 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 Call Trace: <IRQ> dump_stack+0x5c/0x82 panic+0xd5/0x21e watchdog_timer_fn+0x208/0x210 ? watchdog_park_threads+0x70/0x70 __hrtimer_run_queues+0xcc/0x200 hrtimer_interrupt+0xa6/0x1f0 smp_apic_timer_interrupt+0x34/0x50 apic_timer_interrupt+0x96/0xa0 </IRQ> RIP: 0010:unlock_page+0x17/0x30 RSP: 0000:ffffaf154080bc88 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10 RAX: dead000000000100 RBX: fffff21e009f5300 RCX: 0000000000000004 RDX: dead0000000000ff RSI: 0000000000000202 RDI: fffff21e009f5300 RBP: 0000000000000000 R08: 0000000000000000 R09: ffffaf154080bb00 R10: ffffaf154080bc30 R11: 0000000000000040 R12: ffff993749a39518 R13: 0000000000000000 R14: fffff21e009f5300 R15: fffff21e009f5300 ocfs2_inode_lock_with_page+0x25/0x30 [ocfs2] ocfs2_readpage+0x41/0x2d0 [ocfs2] ? pagecache_get_page+0x30/0x200 filemap_fault+0x12b/0x5c0 ? recalc_sigpending+0x17/0x50 ? __set_task_blocked+0x28/0x70 ? __set_current_blocked+0x3d/0x60 ocfs2_fault+0x29/0xb0 [ocfs2] __do_fault+0x1a/0xa0 __handle_mm_fault+0xbe8/0x1090 handle_mm_fault+0xaa/0x1f0 __do_page_fault+0x235/0x4b0 trace_do_page_fault+0x3c/0x110 async_page_fault+0x28/0x30 RIP: 0033:0x7fa75ded638e RSP: 002b:00007ffd6657db18 EFLAGS: 00010287 RAX: 000055c7662fb700 RBX: 0000000000000001 RCX: 000055c7662fb700 RDX: 0000000000001770 RSI: 00007fa75e909000 RDI: 000055c7662fb700 RBP: 0000000000000003 R08: 000000000000000e R09: 0000000000000000 R10: 0000000000000483 R11: 00007fa75ded61b0 R12: 00007fa75e90a770 R13: 000000000000000e R14: 0000000000001770 R15: 0000000000000000 About performance improvement, we can see the testing time is reduced, and CPU utilization decreases, the detailed data is as follows. I ran multi_mmap test case in ocfs2-test package in a three nodes cluster. Before applying this patch: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2754 ocfs2te+ 20 0 170248 6980 4856 D 80.73 0.341 0:18.71 multi_mmap 1505 root rt 0 222236 123060 97224 S 2.658 6.015 0:01.44 corosync 5 root 20 0 0 0 0 S 1.329 0.000 0:00.19 kworker/u8:0 95 root 20 0 0 0 0 S 1.329 0.000 0:00.25 kworker/u8:1 2728 root 20 0 0 0 0 S 0.997 0.000 0:00.24 jbd2/sda1-33 2721 root 20 0 0 0 0 S 0.664 0.000 0:00.07 ocfs2dc-3C8CFD4 2750 ocfs2te+ 20 0 142976 4652 3532 S 0.664 0.227 0:00.28 mpirun ocfs2test@tb-node2:~>multiple_run.sh -i ens3 -k ~/linux-4.4.21-69.tar.gz -o ~/ocfs2mullog -C hacluster -s pcmk -n tb-node2,tb-node1,tb-node3 -d /dev/sda1 -b 4096 -c 32768 -t multi_mmap /mnt/shared Tests with "-b 4096 -C 32768" Thu Dec 28 14:44:52 CST 2017 multi_mmap..................................................Passed. Runtime 783 seconds. After apply this patch: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2508 ocfs2te+ 20 0 170248 6804 4680 R 54.00 0.333 0:55.37 multi_mmap 155 root 20 0 0 0 0 S 2.667 0.000 0:01.20 kworker/u8:3 95 root 20 0 0 0 0 S 2.000 0.000 0:01.58 kworker/u8:1 2504 ocfs2te+ 20 0 142976 4604 3480 R 1.667 0.225 0:01.65 mpirun 5 root 20 0 0 0 0 S 1.000 0.000 0:01.36 kworker/u8:0 2482 root 20 0 0 0 0 S 1.000 0.000 0:00.86 jbd2/sda1-33 299 root 0 -20 0 0 0 S 0.333 0.000 0:00.13 kworker/2:1H 335 root 0 -20 0 0 0 S 0.333 0.000 0:00.17 kworker/1:1H 535 root 20 0 12140 7268 1456 S 0.333 0.355 0:00.34 haveged 1282 root rt 0 222284 123108 97224 S 0.333 6.017 0:01.33 corosync ocfs2test@tb-node2:~>multiple_run.sh -i ens3 -k ~/linux-4.4.21-69.tar.gz -o ~/ocfs2mullog -C hacluster -s pcmk -n tb-node2,tb-node1,tb-node3 -d /dev/sda1 -b 4096 -c 32768 -t multi_mmap /mnt/shared Tests with "-b 4096 -C 32768" Thu Dec 28 15:04:12 CST 2017 multi_mmap..................................................Passed. Runtime 487 seconds. Link: http://lkml.kernel.org/r/1514447305-30814-1-git-send-email-ghe@suse.com Fixes: 1cce4df04f37 ("ocfs2: do not lock/unlock() inode DLM lock") Signed-off-by: Gang He <ghe(a)suse.com> Reviewed-by: Eric Ren <zren(a)suse.com> Acked-by: alex chen <alex.chen(a)huawei.com> Acked-by: piaojun <piaojun(a)huawei.com> Cc: Mark Fasheh <mfasheh(a)versity.com> Cc: Joel Becker <jlbec(a)evilplan.org> Cc: Junxiao Bi <junxiao.bi(a)oracle.com> Cc: Joseph Qi <jiangqi903(a)gmail.com> Cc: Changwei Ge <ge.changwei(a)h3c.com> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- fs/ocfs2/dlmglue.c | 9 +++++++++ 1 file changed, 9 insertions(+) diff -puN fs/ocfs2/dlmglue.c~ocfs2-try-a-blocking-lock-before-return-aop_truncated_page fs/ocfs2/dlmglue.c --- a/fs/ocfs2/dlmglue.c~ocfs2-try-a-blocking-lock-before-return-aop_truncated_page +++ a/fs/ocfs2/dlmglue.c @@ -2486,6 +2486,15 @@ int ocfs2_inode_lock_with_page(struct in ret = ocfs2_inode_lock_full(inode, ret_bh, ex, OCFS2_LOCK_NONBLOCK); if (ret == -EAGAIN) { unlock_page(page); + /* + * If we can't get inode lock immediately, we should not return + * directly here, since this will lead to a softlockup problem. + * The method is to get a blocking lock and immediately unlock + * before returning, this can avoid CPU resource waste due to + * lots of retries, and benefits fairness in getting lock. + */ + if (ocfs2_inode_lock(inode, ret_bh, ex) == 0) + ocfs2_inode_unlock(inode, ex); ret = AOP_TRUNCATED_PAGE; } _ Patches currently in -mm which might be from ghe(a)suse.com are ocfs2-get-rid-of-ocfs2_is_o2cb_active-function.patch ocfs2-move-some-definitions-to-header-file.patch ocfs2-fix-some-small-problems.patch ocfs2-add-kobject-for-online-file-check.patch ocfs2-add-duplicative-ino-number-check.patch

7 years, 11 months

1
0
0 0

[Linux-stable-mirror] [PATCH v2 1/2] drm/i915/bxt, glk: Increase PCODE timeouts during CDCLK freq changing

by Imre Deak

Currently we see sporadic timeouts during CDCLK changing both on BXT and GLK as reported by the Bugzilla: ticket. It's easy to reproduce this by changing the frequency in a tight loop after blanking the display. The upper bound for the completion time is 800us based on my tests, so increase it from the current 500us to 2ms; with that I couldn't trigger the problem either on BXT or GLK. Note that timeouts happened during both the change notification and the voltage level setting PCODE request. (For the latter one BSpec doesn't require us to wait for completion before further HW programming.) This issue is similar to 2c7d0602c815 ("drm/i915/gen9: Fix PCODE polling during CDCLK change notification") but there the PCODE request does complete (as shown by the mbox busy flag), only the reply we get from PCODE indicates a failure. So there we keep resending the request until a success reply, here we just have to increase the timeout for the one PCODE request we send. v2: - s/snb_pcode_request/sandybridge_pcode_write_timeout/ (Ville) Cc: Chris Wilson <chris(a)chris-wilson.co.uk> Cc: Ville Syrjälä <ville.syrjala(a)linux.intel.com> Cc: <stable(a)vger.kernel.org> # v4.4+ Acked-by: Chris Wilson <chris(a)chris-wilson.co.uk> (v1) Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=103326 Signed-off-by: Imre Deak <imre.deak(a)intel.com> --- drivers/gpu/drm/i915/i915_drv.h | 6 +++++- drivers/gpu/drm/i915/intel_cdclk.c | 22 +++++++++++++++++----- drivers/gpu/drm/i915/intel_pm.c | 6 +++--- 3 files changed, 25 insertions(+), 9 deletions(-) diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h index 454d8f937fae..7e83d0d3e632 100644 --- a/drivers/gpu/drm/i915/i915_drv.h +++ b/drivers/gpu/drm/i915/i915_drv.h @@ -3723,7 +3723,11 @@ extern void intel_display_print_error_state(struct drm_i915_error_state_buf *e, struct intel_display_error_state *error); int sandybridge_pcode_read(struct drm_i915_private *dev_priv, u32 mbox, u32 *val); -int sandybridge_pcode_write(struct drm_i915_private *dev_priv, u32 mbox, u32 val); +int sandybridge_pcode_write_timeout(struct drm_i915_private *dev_priv, u32 mbox, + u32 val, int timeout_us); +#define sandybridge_pcode_write(dev_priv, mbox, val) \ + sandybridge_pcode_write_timeout(dev_priv, mbox, val, 500) + int skl_pcode_request(struct drm_i915_private *dev_priv, u32 mbox, u32 request, u32 reply_mask, u32 reply, int timeout_base_ms); diff --git a/drivers/gpu/drm/i915/intel_cdclk.c b/drivers/gpu/drm/i915/intel_cdclk.c index c4392ea34a3d..a423b674fcec 100644 --- a/drivers/gpu/drm/i915/intel_cdclk.c +++ b/drivers/gpu/drm/i915/intel_cdclk.c @@ -1370,10 +1370,15 @@ static void bxt_set_cdclk(struct drm_i915_private *dev_priv, break; } - /* Inform power controller of upcoming frequency change */ + /* + * Inform power controller of upcoming frequency change. BSpec + * requires us to wait up to 150usec, but that leads to timeouts; + * the 2ms used here is based on experiment. + */ mutex_lock(&dev_priv->pcu_lock); - ret = sandybridge_pcode_write(dev_priv, HSW_PCODE_DE_WRITE_FREQ_REQ, - 0x80000000); + ret = sandybridge_pcode_write_timeout(dev_priv, + HSW_PCODE_DE_WRITE_FREQ_REQ, + 0x80000000, 2000); mutex_unlock(&dev_priv->pcu_lock); if (ret) { @@ -1404,8 +1409,15 @@ static void bxt_set_cdclk(struct drm_i915_private *dev_priv, I915_WRITE(CDCLK_CTL, val); mutex_lock(&dev_priv->pcu_lock); - ret = sandybridge_pcode_write(dev_priv, HSW_PCODE_DE_WRITE_FREQ_REQ, - cdclk_state->voltage_level); + /* + * The timeout isn't specified, the 2ms used here is based on + * experiment. + * FIXME: Waiting for the request completion could be delayed until + * the next PCODE request based on BSpec. + */ + ret = sandybridge_pcode_write_timeout(dev_priv, + HSW_PCODE_DE_WRITE_FREQ_REQ, + cdclk_state->voltage_level, 2000); mutex_unlock(&dev_priv->pcu_lock); if (ret) { diff --git a/drivers/gpu/drm/i915/intel_pm.c b/drivers/gpu/drm/i915/intel_pm.c index 0b92ea1dbd40..a6098932be5a 100644 --- a/drivers/gpu/drm/i915/intel_pm.c +++ b/drivers/gpu/drm/i915/intel_pm.c @@ -9169,8 +9169,8 @@ int sandybridge_pcode_read(struct drm_i915_private *dev_priv, u32 mbox, u32 *val return 0; } -int sandybridge_pcode_write(struct drm_i915_private *dev_priv, - u32 mbox, u32 val) +int sandybridge_pcode_write_timeout(struct drm_i915_private *dev_priv, + u32 mbox, u32 val, int timeout_us) { int status; @@ -9193,7 +9193,7 @@ int sandybridge_pcode_write(struct drm_i915_private *dev_priv, if (__intel_wait_for_register_fw(dev_priv, GEN6_PCODE_MAILBOX, GEN6_PCODE_READY, 0, - 500, 0, NULL)) { + timeout_us, 0, NULL)) { DRM_ERROR("timeout waiting for pcode write of 0x%08x to mbox %x to finish for %ps\n", val, mbox, __builtin_return_address(0)); return -ETIMEDOUT; -- 2.13.2

7 years, 11 months

2
1
0 0

[Linux-stable-mirror] net: cdc_ncm: initialize drvflags before usage

by Porto Rio

Hi all, we detected a problem in stable Kernel 4.4.114 in drivers/net/usb/cdc_ncm.c. In line 833, ctx->drvflags is checked in the if clause: if (ctx->drvflags & CDC_NCM_FLAG_RESET_NTB16) { but it is initialized *later* in line 877: /* Device-specific flags */ ctx->drvflags = drvflags; This initialization has to be done before the if clause. Note, that the if clause was backported from mainline at Nov. 15th 2017 (GetNtbFormat endian fix). In mainline, the initialization is at the right place before the if clause. Please find here a suggested patch: --- linux/drivers/net/usb/cdc_ncm.c.orig 2018-02-01 13:55:20.034393993 +0100 +++ linux/drivers/net/usb/cdc_ncm.c 2018-02-01 13:56:12.842393881 +0100 @@ -825,6 +825,9 @@ int cdc_ncm_bind_common(struct usbnet *d goto error2; } + /* Device-specific flags */ + ctx->drvflags = drvflags; + /* * Some Huawei devices have been observed to come out of reset in NDP32 mode. * Let's check if this is the case, and set the device to NDP16 mode again if @@ -873,9 +876,6 @@ int cdc_ncm_bind_common(struct usbnet *d /* finish setting up the device specific data */ cdc_ncm_setup(dev); - /* Device-specific flags */ - ctx->drvflags = drvflags; - /* Allocate the delayed NDP if needed. */ if (ctx->drvflags & CDC_NCM_FLAG_NDP_TO_END) { ctx->delayed_ndp16 = kzalloc(ctx->max_ndp_size, GFP_KERNEL);

7 years, 11 months

2
1
0 0

[Linux-stable-mirror] [PATCH] drm/atomic: Fix memleak on ERESTARTSYS during non-blocking commits

by Maarten Lankhorst

From: "Leo (Sunpeng) Li" <sunpeng.li(a)amd.com> During a non-blocking commit, it is possible to return before the commit_tail work is queued (-ERESTARTSYS, for example). Since a reference on the crtc commit object is obtained for the pending vblank event when preparing the commit, the above situation will leave us with an extra reference. Therefore, if the commit_tail worker has not consumed the event at the end of a commit, release it's reference. Changes since v1: - Also check for state->event->base.completion being set, to handle the case where stall_checks() fails in setup_crtc_commit(). Changes since v2: - Add a flag to drm_crtc_commit, to prevent dereferencing a freed event. i915 may unreference the state in a worker. Fixes: 24835e442f28 ("drm: reference count event->completion") Cc: <stable(a)vger.kernel.org> # v4.11+ Signed-off-by: Leo (Sunpeng) Li <sunpeng.li(a)amd.com> Acked-by: Harry Wentland <harry.wentland(a)amd.com> #v1 Signed-off-by: Maarten Lankhorst <maarten.lankhorst(a)linux.intel.com> --- drivers/gpu/drm/drm_atomic_helper.c | 15 +++++++++++++++ include/drm/drm_atomic.h | 9 +++++++++ 2 files changed, 24 insertions(+) diff --git a/drivers/gpu/drm/drm_atomic_helper.c b/drivers/gpu/drm/drm_atomic_helper.c index ab4032167094..ae3cbfe9e01c 100644 --- a/drivers/gpu/drm/drm_atomic_helper.c +++ b/drivers/gpu/drm/drm_atomic_helper.c @@ -1878,6 +1878,8 @@ int drm_atomic_helper_setup_commit(struct drm_atomic_state *state, new_crtc_state->event->base.completion = &commit->flip_done; new_crtc_state->event->base.completion_release = release_crtc_commit; drm_crtc_commit_get(commit); + + commit->abort_completion = true; } for_each_oldnew_connector_in_state(state, conn, old_conn_state, new_conn_state, i) { @@ -3421,8 +3423,21 @@ EXPORT_SYMBOL(drm_atomic_helper_crtc_duplicate_state); void __drm_atomic_helper_crtc_destroy_state(struct drm_crtc_state *state) { if (state->commit) { + /* + * In the event that a non-blocking commit returns + * -ERESTARTSYS before the commit_tail work is queued, we will + * have an extra reference to the commit object. Release it, if + * the event has not been consumed by the worker. + * + * state->event may be freed, so we can't directly look at + * state->event->base.completion. + */ + if (state->event && state->commit->abort_completion) + drm_crtc_commit_put(state->commit); + kfree(state->commit->event); state->commit->event = NULL; + drm_crtc_commit_put(state->commit); } diff --git a/include/drm/drm_atomic.h b/include/drm/drm_atomic.h index 1c27526c499e..cf13842a6dbd 100644 --- a/include/drm/drm_atomic.h +++ b/include/drm/drm_atomic.h @@ -134,6 +134,15 @@ struct drm_crtc_commit { * &drm_pending_vblank_event pointer to clean up private events. */ struct drm_pending_vblank_event *event; + + /** + * @abort_completion: + * + * A flag that's set after drm_atomic_helper_setup_commit takes a second + * reference for the completion of $drm_crtc_state.event. It's used by + * the free code to remove the second reference if commit fails. + */ + bool abort_completion; }; struct __drm_planes_state { -- 2.15.1

7 years, 11 months

4
8
0 0

[Linux-stable-mirror] Patch "xen-netfront: remove warning when unloading module" has been added to the 3.18-stable tree

by gregkh＠linuxfoundation.org

This is a note to let you know that I've just added the patch titled xen-netfront: remove warning when unloading module to the 3.18-stable tree which can be found at: http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum… The filename of the patch is: xen-netfront-remove-warning-when-unloading-module.patch and it can be found in the queue-3.18 subdirectory. If you, or anyone else, feels it should not be added to the stable tree, please let <stable(a)vger.kernel.org> know about it. >From foo@baz Thu Feb 1 14:48:12 CET 2018 From: Eduardo Otubo <otubo(a)redhat.com> Date: Thu, 23 Nov 2017 15:18:35 +0100 Subject: xen-netfront: remove warning when unloading module From: Eduardo Otubo <otubo(a)redhat.com> [ Upstream commit 5b5971df3bc2775107ddad164018a8a8db633b81 ] v2: * Replace busy wait with wait_event()/wake_up_all() * Cannot garantee that at the time xennet_remove is called, the xen_netback state will not be XenbusStateClosed, so added a condition for that * There's a small chance for the xen_netback state is XenbusStateUnknown by the time the xen_netfront switches to Closed, so added a condition for that. When unloading module xen_netfront from guest, dmesg would output warning messages like below: [ 105.236836] xen:grant_table: WARNING: g.e. 0x903 still in use! [ 105.236839] deferring g.e. 0x903 (pfn 0x35805) This problem relies on netfront and netback being out of sync. By the time netfront revokes the g.e.'s netback didn't have enough time to free all of them, hence displaying the warnings on dmesg. The trick here is to make netfront to wait until netback frees all the g.e.'s and only then continue to cleanup for the module removal, and this is done by manipulating both device states. Signed-off-by: Eduardo Otubo <otubo(a)redhat.com> Acked-by: Juergen Gross <jgross(a)suse.com> Signed-off-by: David S. Miller <davem(a)davemloft.net> Signed-off-by: Sasha Levin <alexander.levin(a)microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org> --- drivers/net/xen-netfront.c | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) --- a/drivers/net/xen-netfront.c +++ b/drivers/net/xen-netfront.c @@ -85,6 +85,8 @@ struct netfront_cb { /* IRQ name is queue name with "-tx" or "-rx" appended */ #define IRQ_NAME_SIZE (QUEUE_NAME_SIZE + 3) +static DECLARE_WAIT_QUEUE_HEAD(module_unload_q); + struct netfront_stats { u64 rx_packets; u64 tx_packets; @@ -2080,10 +2082,12 @@ static void netback_changed(struct xenbu break; case XenbusStateClosed: + wake_up_all(&module_unload_q); if (dev->state == XenbusStateClosed) break; /* Missed the backend's CLOSING state -- fallthrough */ case XenbusStateClosing: + wake_up_all(&module_unload_q); xenbus_frontend_closed(dev); break; } @@ -2305,6 +2309,20 @@ static int xennet_remove(struct xenbus_d dev_dbg(&dev->dev, "%s\n", dev->nodename); + if (xenbus_read_driver_state(dev->otherend) != XenbusStateClosed) { + xenbus_switch_state(dev, XenbusStateClosing); + wait_event(module_unload_q, + xenbus_read_driver_state(dev->otherend) == + XenbusStateClosing); + + xenbus_switch_state(dev, XenbusStateClosed); + wait_event(module_unload_q, + xenbus_read_driver_state(dev->otherend) == + XenbusStateClosed || + xenbus_read_driver_state(dev->otherend) == + XenbusStateUnknown); + } + xennet_disconnect_backend(info); xennet_sysfs_delif(info->netdev); Patches currently in stable-queue which might be from otubo(a)redhat.com are queue-3.18/xen-netfront-remove-warning-when-unloading-module.patch

7 years, 11 months

1
0
0 0

2025

2024

2023

2022

2021

2020

2019

2018

2017

Linux-stable-mirror February 2018