December 2019 - Linux-stable-mirror

[PATCH v2 03/13] KVM: x86: Refactor picdev_write() to prevent Spectre-v1/L1TF attacks

by Marios Pomonis

This fixes a Spectre-v1/L1TF vulnerability in picdev_write(). It replaces index computations based on the (attacked-controlled) port number with constants through a minor refactoring. Fixes: commit 85f455f7ddbe ("KVM: Add support for in-kernel PIC emulation") Signed-off-by: Nick Finco <nifi(a)google.com> Signed-off-by: Marios Pomonis <pomonis(a)google.com> Reviewed-by: Andrew Honig <ahonig(a)google.com> Cc: stable(a)vger.kernel.org --- arch/x86/kvm/i8259.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/i8259.c b/arch/x86/kvm/i8259.c index 8b38bb4868a6..629a09ca9860 100644 --- a/arch/x86/kvm/i8259.c +++ b/arch/x86/kvm/i8259.c @@ -460,10 +460,14 @@ static int picdev_write(struct kvm_pic *s, switch (addr) { case 0x20: case 0x21: + pic_lock(s); + pic_ioport_write(&s->pics[0], addr, data); + pic_unlock(s); + break; case 0xa0: case 0xa1: pic_lock(s); - pic_ioport_write(&s->pics[addr >> 7], addr, data); + pic_ioport_write(&s->pics[1], addr, data); pic_unlock(s); break; case 0x4d0: -- 2.24.0.525.g8f36a354ae-goog

5 years, 6 months

2
1
0 0

[PATCH v2 02/13] KVM: x86: Protect kvm_hv_msr_[get|set]_crash_data() from Spectre-v1/L1TF attacks

by Marios Pomonis

This fixes Spectre-v1/L1TF vulnerabilities in kvm_hv_msr_get_crash_data() and kvm_hv_msr_set_crash_data(). These functions contain index computations that use the (attacker-controlled) MSR number. Fixes: commit e7d9513b60e8 ("kvm/x86: added hyper-v crash msrs into kvm hyperv context") Signed-off-by: Nick Finco <nifi(a)google.com> Signed-off-by: Marios Pomonis <pomonis(a)google.com> Reviewed-by: Andrew Honig <ahonig(a)google.com> Cc: stable(a)vger.kernel.org --- arch/x86/kvm/hyperv.c | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c index 23ff65504d7e..26408434b9bc 100644 --- a/arch/x86/kvm/hyperv.c +++ b/arch/x86/kvm/hyperv.c @@ -809,11 +809,12 @@ static int kvm_hv_msr_get_crash_data(struct kvm_vcpu *vcpu, u32 index, u64 *pdata) { struct kvm_hv *hv = &vcpu->kvm->arch.hyperv; + size_t size = ARRAY_SIZE(hv->hv_crash_param); - if (WARN_ON_ONCE(index >= ARRAY_SIZE(hv->hv_crash_param))) + if (WARN_ON_ONCE(index >= size)) return -EINVAL; - *pdata = hv->hv_crash_param[index]; + *pdata = hv->hv_crash_param[array_index_nospec(index, size)]; return 0; } @@ -852,11 +853,12 @@ static int kvm_hv_msr_set_crash_data(struct kvm_vcpu *vcpu, u32 index, u64 data) { struct kvm_hv *hv = &vcpu->kvm->arch.hyperv; + size_t size = ARRAY_SIZE(hv->hv_crash_param); - if (WARN_ON_ONCE(index >= ARRAY_SIZE(hv->hv_crash_param))) + if (WARN_ON_ONCE(index >= size)) return -EINVAL; - hv->hv_crash_param[index] = data; + hv->hv_crash_param[array_index_nospec(index, size)] = data; return 0; } -- 2.24.0.525.g8f36a354ae-goog

5 years, 6 months

4
6
0 0

[PATCH v2 01/13] KVM: x86: Protect x86_decode_insn from Spectre-v1/L1TF attacks

by Marios Pomonis

This fixes a Spectre-v1/L1TF vulnerability in x86_decode_insn(). kvm_emulate_instruction() (an ancestor of x86_decode_insn()) is an exported symbol, so KVM should treat it conservatively from a security perspective. Fixes: commit 045a282ca415 ("KVM: emulator: implement fninit, fnstsw, fnstcw") Signed-off-by: Nick Finco <nifi(a)google.com> Signed-off-by: Marios Pomonis <pomonis(a)google.com> Reviewed-by: Andrew Honig <ahonig(a)google.com> Cc: stable(a)vger.kernel.org --- arch/x86/kvm/emulate.c | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 952d1a4f4d7e..fcf7cdb21d60 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -5303,10 +5303,15 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void *insn, int insn_len) } break; case Escape: - if (ctxt->modrm > 0xbf) - opcode = opcode.u.esc->high[ctxt->modrm - 0xc0]; - else + if (ctxt->modrm > 0xbf) { + size_t size = ARRAY_SIZE(opcode.u.esc->high); + u32 index = array_index_nospec( + ctxt->modrm - 0xc0, size); + + opcode = opcode.u.esc->high[index]; + } else { opcode = opcode.u.esc->op[(ctxt->modrm >> 3) & 7]; + } break; case InstrDual: if ((ctxt->modrm >> 6) == 3) -- 2.24.0.525.g8f36a354ae-goog

5 years, 6 months

2
1
0 0

[PATCH for 5.5 1/2 v2] rseq: Fix: Clarify rseq.h UAPI rseq_cs memory reclaim requirements

by Mathieu Desnoyers

The rseq.h UAPI documents that the rseq_cs field must be cleared before reclaiming memory that contains the targeted struct rseq_cs. We should extend this comment to also dictate that the rseq_cs field must be cleared before reclaiming memory of the code pointed to by the rseq_cs start_ip and post_commit_offset fields. While we can expect that use of dlclose(3) will typically unmap both struct rseq_cs and its associated code at once, nothing would theoretically prevent a JIT from reclaiming the code without reclaiming the struct rseq_cs, which would erroneously allow the kernel to consider new code which is not a rseq critical section as a rseq critical section following a code reclaim. Suggested-by: Florian Weimer <fw(a)deneb.enyo.de> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers(a)efficios.com> Cc: Florian Weimer <fw(a)deneb.enyo.de> Cc: Thomas Gleixner <tglx(a)linutronix.de> Cc: Peter Zijlstra (Intel) <peterz(a)infradead.org> Cc: "Paul E. McKenney" <paulmck(a)linux.ibm.com> Cc: Boqun Feng <boqun.feng(a)gmail.com> Cc: "H . Peter Anvin" <hpa(a)zytor.com> Cc: Paul Turner <pjt(a)google.com> Cc: Dmitry Vyukov <dvyukov(a)google.com> Cc: Neel Natu <neelnatu(a)google.com> Cc: linux-api(a)vger.kernel.org --- include/uapi/linux/rseq.h | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h index 9a402fdb60e9..d94afdfc4b7c 100644 --- a/include/uapi/linux/rseq.h +++ b/include/uapi/linux/rseq.h @@ -100,7 +100,9 @@ struct rseq { * instruction sequence block, as well as when the kernel detects that * it is preempting or delivering a signal outside of the range * targeted by the rseq_cs. Also needs to be set to NULL by user-space - * before reclaiming memory that contains the targeted struct rseq_cs. + * before reclaiming memory that contains the targeted struct rseq_cs + * or reclaiming memory that contains the code referred to by the + * start_ip and post_commit_offset fields of struct rseq_cs. * * Read and set by the kernel. Set by user-space with single-copy * atomicity semantics. This field should only be updated by the -- 2.17.1

5 years, 6 months

1
2
0 0

[PATCH] mm, debug_pagealloc: don't rely on static keys too early

by Vlastimil Babka

Commit 96a2b03f281d ("mm, debug_pagelloc: use static keys to enable debugging") has introduced a static key to reduce overhead when debug_pagealloc is compiled in but not enabled. It relied on the assumption that jump_label_init() is called before parse_early_param() as in start_kernel(), so when the "debug_pagealloc=on" option is parsed, it is safe to enable the static key. However, it turns out multiple architectures call parse_early_param() earlier from their setup_arch(). x86 also calls jump_label_init() even earlier, so no issue was found while testing the commit, but same is not true for e.g. ppc64 and s390 where the kernel would not boot with debug_pagealloc=on as found by our QA. To fix this without tricky changes to init code of multiple architectures, this patch partially reverts the static key conversion from 96a2b03f281d. Init-time and non-fastpath calls (such as in arch code) of debug_pagealloc_enabled() will again test a simple bool variable. Fastpath mm code is converted to a new debug_pagealloc_enabled_static() variant that relies on the static key, which is enabled in a well-defined point in mm_init() where it's guaranteed that jump_label_init() has been called, regardless of architecture. Fixes: 96a2b03f281d ("mm, debug_pagelloc: use static keys to enable debugging") Cc: <stable(a)vger.kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim(a)lge.com> Cc: "Kirill A. Shutemov" <kirill.shutemov(a)linux.intel.com> Cc: Michal Hocko <mhocko(a)kernel.org> Cc: Vlastimil Babka <vbabka(a)suse.cz> Cc: Matthew Wilcox <willy(a)infradead.org> Cc: Mel Gorman <mgorman(a)techsingularity.net> Cc: Peter Zijlstra <peterz(a)infradead.org> Cc: Borislav Petkov <bp(a)alien8.de> Signed-off-by: Vlastimil Babka <vbabka(a)suse.cz> --- include/linux/mm.h | 18 +++++++++++++++--- init/main.c | 1 + mm/page_alloc.c | 36 ++++++++++++------------------------ mm/slab.c | 4 ++-- mm/slub.c | 2 +- mm/vmalloc.c | 4 ++-- 6 files changed, 33 insertions(+), 32 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index c97ea3b694e6..5cf260d5e248 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2655,13 +2655,25 @@ static inline bool want_init_on_free(void) !page_poisoning_enabled(); } -#ifdef CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT -DECLARE_STATIC_KEY_TRUE(_debug_pagealloc_enabled); +#ifdef CONFIG_DEBUG_PAGEALLOC +extern void init_debug_pagealloc(void); #else -DECLARE_STATIC_KEY_FALSE(_debug_pagealloc_enabled); +static inline void init_debug_pagealloc(void) {} #endif +extern bool _debug_pagealloc_enabled_early; +DECLARE_STATIC_KEY_FALSE(_debug_pagealloc_enabled); static inline bool debug_pagealloc_enabled(void) +{ + return IS_ENABLED(CONFIG_DEBUG_PAGEALLOC) && + _debug_pagealloc_enabled_early; +} + +/* + * For use in fast paths after init_debug_pagealloc() has run, or when a + * false negative result is not harmful when called too early. + */ +static inline bool debug_pagealloc_enabled_static(void) { if (!IS_ENABLED(CONFIG_DEBUG_PAGEALLOC)) return false; diff --git a/init/main.c b/init/main.c index ec3a1463ac69..c93b9cc201fa 100644 --- a/init/main.c +++ b/init/main.c @@ -554,6 +554,7 @@ static void __init mm_init(void) * bigger than MAX_ORDER unless SPARSEMEM. */ page_ext_init_flatmem(); + init_debug_pagealloc(); report_meminit(); mem_init(); kmem_cache_init(); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 4785a8a2040e..5e3fe156ffb4 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -694,34 +694,26 @@ void prep_compound_page(struct page *page, unsigned int order) #ifdef CONFIG_DEBUG_PAGEALLOC unsigned int _debug_guardpage_minorder; -#ifdef CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT -DEFINE_STATIC_KEY_TRUE(_debug_pagealloc_enabled); -#else +bool _debug_pagealloc_enabled_early __read_mostly + = IS_ENABLED(CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT); DEFINE_STATIC_KEY_FALSE(_debug_pagealloc_enabled); -#endif EXPORT_SYMBOL(_debug_pagealloc_enabled); DEFINE_STATIC_KEY_FALSE(_debug_guardpage_enabled); static int __init early_debug_pagealloc(char *buf) { - bool enable = false; - - if (kstrtobool(buf, &enable)) - return -EINVAL; - - if (enable) - static_branch_enable(&_debug_pagealloc_enabled); - - return 0; + return kstrtobool(buf, &_debug_pagealloc_enabled_early); } early_param("debug_pagealloc", early_debug_pagealloc); -static void init_debug_guardpage(void) +void init_debug_pagealloc(void) { if (!debug_pagealloc_enabled()) return; + static_branch_enable(&_debug_pagealloc_enabled); + if (!debug_guardpage_minorder()) return; @@ -1186,7 +1178,7 @@ static __always_inline bool free_pages_prepare(struct page *page, */ arch_free_page(page, order); - if (debug_pagealloc_enabled()) + if (debug_pagealloc_enabled_static()) kernel_map_pages(page, 1 << order, 0); kasan_free_nondeferred_pages(page, order); @@ -1207,7 +1199,7 @@ static bool free_pcp_prepare(struct page *page) static bool bulkfree_pcp_prepare(struct page *page) { - if (debug_pagealloc_enabled()) + if (debug_pagealloc_enabled_static()) return free_pages_check(page); else return false; @@ -1221,7 +1213,7 @@ static bool bulkfree_pcp_prepare(struct page *page) */ static bool free_pcp_prepare(struct page *page) { - if (debug_pagealloc_enabled()) + if (debug_pagealloc_enabled_static()) return free_pages_prepare(page, 0, true); else return free_pages_prepare(page, 0, false); @@ -1973,10 +1965,6 @@ void __init page_alloc_init_late(void) for_each_populated_zone(zone) set_zone_contiguous(zone); - -#ifdef CONFIG_DEBUG_PAGEALLOC - init_debug_guardpage(); -#endif } #ifdef CONFIG_CMA @@ -2106,7 +2094,7 @@ static inline bool free_pages_prezeroed(void) */ static inline bool check_pcp_refill(struct page *page) { - if (debug_pagealloc_enabled()) + if (debug_pagealloc_enabled_static()) return check_new_page(page); else return false; @@ -2128,7 +2116,7 @@ static inline bool check_pcp_refill(struct page *page) } static inline bool check_new_pcp(struct page *page) { - if (debug_pagealloc_enabled()) + if (debug_pagealloc_enabled_static()) return check_new_page(page); else return false; @@ -2155,7 +2143,7 @@ inline void post_alloc_hook(struct page *page, unsigned int order, set_page_refcounted(page); arch_alloc_page(page, order); - if (debug_pagealloc_enabled()) + if (debug_pagealloc_enabled_static()) kernel_map_pages(page, 1 << order, 1); kasan_alloc_pages(page, order); kernel_poison_pages(page, 1 << order, 1); diff --git a/mm/slab.c b/mm/slab.c index f1e1840af533..a89633603b2d 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -1416,7 +1416,7 @@ static void kmem_rcu_free(struct rcu_head *head) #if DEBUG static bool is_debug_pagealloc_cache(struct kmem_cache *cachep) { - if (debug_pagealloc_enabled() && OFF_SLAB(cachep) && + if (debug_pagealloc_enabled_static() && OFF_SLAB(cachep) && (cachep->size % PAGE_SIZE) == 0) return true; @@ -2008,7 +2008,7 @@ int __kmem_cache_create(struct kmem_cache *cachep, slab_flags_t flags) * to check size >= 256. It guarantees that all necessary small * sized slab is initialized in current slab initialization sequence. */ - if (debug_pagealloc_enabled() && (flags & SLAB_POISON) && + if (debug_pagealloc_enabled_static() && (flags & SLAB_POISON) && size >= 256 && cachep->object_size > cache_line_size()) { if (size < PAGE_SIZE || size % PAGE_SIZE == 0) { size_t tmp_size = ALIGN(size, PAGE_SIZE); diff --git a/mm/slub.c b/mm/slub.c index d11389710b12..8eafccf75940 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -288,7 +288,7 @@ static inline void *get_freepointer_safe(struct kmem_cache *s, void *object) unsigned long freepointer_addr; void *p; - if (!debug_pagealloc_enabled()) + if (!debug_pagealloc_enabled_static()) return get_freepointer(s, object); freepointer_addr = (unsigned long)object + s->offset; diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 4d3b3d60d893..544cc9a725cf 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -1375,7 +1375,7 @@ static void free_unmap_vmap_area(struct vmap_area *va) { flush_cache_vunmap(va->va_start, va->va_end); unmap_vmap_area(va); - if (debug_pagealloc_enabled()) + if (debug_pagealloc_enabled_static()) flush_tlb_kernel_range(va->va_start, va->va_end); free_vmap_area_noflush(va); @@ -1673,7 +1673,7 @@ static void vb_free(const void *addr, unsigned long size) vunmap_page_range((unsigned long)addr, (unsigned long)addr + size); - if (debug_pagealloc_enabled()) + if (debug_pagealloc_enabled_static()) flush_tlb_kernel_range((unsigned long)addr, (unsigned long)addr + size); -- 2.24.0

5 years, 6 months

3
4
0 0

[PATCH] drm/dp_mst: clear time slots for ports invalid

by Wayne Lin

[Why] When change the connection status in a MST topology, mst device which detect the event will send out CONNECTION_STATUS_NOTIFY messgae. e.g. src-mst-mst-sst => src-mst (unplug) mst-sst Currently, under the above case of unplugging device, ports which have been allocated payloads and are no longer in the topology still occupy time slots and recorded in proposed_vcpi[] of topology manager. If we don't clean up the proposed_vcpi[], when code flow goes to try to update payload table by calling drm_dp_update_payload_part1(), we will fail at checking port validation due to there are ports with proposed time slots but no longer in the mst topology. As the result of that, we will also stop updating the DPCD payload table of down stream port. [How] While handling the CONNECTION_STATUS_NOTIFY message, add a detection to see if the event indicates that a device is unplugged to an output port. If the detection is true, then iterrate over all proposed_vcpi[] to see whether a port of the proposed_vcpi[] is still in the topology or not. If the port is invalid, set its num_slots to 0. Thereafter, when try to update payload table by calling drm_dp_update_payload_part1(), we can successfully update the DPCD payload table of down stream port and clear the proposed_vcpi[] to NULL. Signed-off-by: Wayne Lin <Wayne.Lin(a)amd.com> Cc: stable(a)vger.kernel.org --- drivers/gpu/drm/drm_dp_mst_topology.c | 24 +++++++++++++++++++++++- 1 file changed, 23 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/drm_dp_mst_topology.c b/drivers/gpu/drm/drm_dp_mst_topology.c index 5306c47dc820..2e236b6275c4 100644 --- a/drivers/gpu/drm/drm_dp_mst_topology.c +++ b/drivers/gpu/drm/drm_dp_mst_topology.c @@ -2318,7 +2318,7 @@ drm_dp_mst_handle_conn_stat(struct drm_dp_mst_branch *mstb, { struct drm_dp_mst_topology_mgr *mgr = mstb->mgr; struct drm_dp_mst_port *port; - int old_ddps, ret; + int old_ddps, old_input, ret, i; u8 new_pdt; bool dowork = false, create_connector = false; @@ -2349,6 +2349,7 @@ drm_dp_mst_handle_conn_stat(struct drm_dp_mst_branch *mstb, } old_ddps = port->ddps; + old_input = port->input; port->input = conn_stat->input_port; port->mcs = conn_stat->message_capability_status; port->ldps = conn_stat->legacy_device_plug_status; @@ -2373,6 +2374,27 @@ drm_dp_mst_handle_conn_stat(struct drm_dp_mst_branch *mstb, dowork = false; } + if (!old_input && old_ddps != port->ddps && !port->ddps) { + for (i = 0; i < mgr->max_payloads; i++) { + struct drm_dp_vcpi *vcpi = mgr->proposed_vcpis[i]; + struct drm_dp_mst_port *port_validated; + + if (vcpi) { + port_validated = + container_of(vcpi, struct drm_dp_mst_port, vcpi); + port_validated = + drm_dp_mst_topology_get_port_validated(mgr, port_validated); + if (!port_validated) { + mutex_lock(&mgr->payload_lock); + vcpi->num_slots = 0; + mutex_unlock(&mgr->payload_lock); + } else { + drm_dp_mst_topology_put_port(port_validated); + } + } + } + } + if (port->connector) drm_modeset_unlock(&mgr->base.lock); else if (create_connector) -- 2.17.1

5 years, 6 months

3
6
0 0

[PATCH] drm/dp_mst: Avoid NULL pointer dereference

by Wayne Lin

[Why] Found kernel NULL pointer dereference under the below situation: src — HDMI_Monitor src — HDMI_Monitor e.g.: \ => MSTB — MSTB (unplug) MSTB — MSTB When display 1 HDMI and 2 DP daisy chain monitors, unplugging the dp cable connected to source causes kernel NULL pointer dereference at drm_dp_mst_atomic_check_bw_limit(). When calculating pbn_limit, if branch is null, accessing "&branch->ports" causes the problem. [How] Judge branch is null or not at the beginning. If it is null, return 0. Signed-off-by: Wayne Lin <Wayne.Lin(a)amd.com> Cc: stable(a)vger.kernel.org --- drivers/gpu/drm/drm_dp_mst_topology.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/gpu/drm/drm_dp_mst_topology.c b/drivers/gpu/drm/drm_dp_mst_topology.c index 7d2d31eaf003..a6473e3ab448 100644 --- a/drivers/gpu/drm/drm_dp_mst_topology.c +++ b/drivers/gpu/drm/drm_dp_mst_topology.c @@ -4707,6 +4707,9 @@ int drm_dp_mst_atomic_check_bw_limit(struct drm_dp_mst_branch *branch, struct drm_dp_vcpi_allocation *vcpi; int pbn_limit = 0, pbn_used = 0; + if (!branch) + return 0; + list_for_each_entry(port, &branch->ports, next) { if (port->mstb) if (drm_dp_mst_atomic_check_bw_limit(port->mstb, mst_state)) -- 2.17.1

5 years, 6 months

4
3
0 0

[PATCH net] virtio-net: Skip set_features on non-cvq devices

by Alistair Delva

On devices without control virtqueue support, such as the virtio_net implementation in crosvm[1], attempting to configure LRO will panic the kernel: kernel BUG at drivers/net/virtio_net.c:1591! invalid opcode: 0000 [#1] PREEMPT SMP PTI CPU: 1 PID: 483 Comm: Binder:330_1 Not tainted 5.4.5-01326-g19463e9acaac #1 Hardware name: ChromiumOS crosvm, BIOS 0 RIP: 0010:virtnet_send_command+0x15d/0x170 [virtio_net] Code: d8 00 00 00 80 78 02 00 0f 94 c0 65 48 8b 0c 25 28 00 00 00 48 3b 4c 24 70 75 11 48 8d 65 d8 5b 41 5c 41 5d 41 5e 41 5f 5d c3 <0f> 0b e8 ec a4 12 c8 66 90 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 RSP: 0018:ffffb97940e7bb50 EFLAGS: 00010246 RAX: ffffffffc0596020 RBX: ffffa0e1fc8ea840 RCX: 0000000000000017 RDX: ffffffffc0596110 RSI: 0000000000000011 RDI: 000000000000000d RBP: ffffb97940e7bbf8 R08: ffffa0e1fc8ea0b0 R09: ffffa0e1fc8ea0b0 R10: ffffffffffffffff R11: ffffffffc0590940 R12: 0000000000000005 R13: ffffa0e1ffad2c00 R14: ffffb97940e7bc08 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffffa0e1fd100000(006b) knlGS:00000000e5ef7494 CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033 CR2: 00000000e5eeb82c CR3: 0000000079b06001 CR4: 0000000000360ee0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: ? preempt_count_add+0x58/0xb0 ? _raw_spin_lock_irqsave+0x36/0x70 ? _raw_spin_unlock_irqrestore+0x1a/0x40 ? __wake_up+0x70/0x190 virtnet_set_features+0x90/0xf0 [virtio_net] __netdev_update_features+0x271/0x980 ? nlmsg_notify+0x5b/0xa0 dev_disable_lro+0x2b/0x190 ? inet_netconf_notify_devconf+0xe2/0x120 devinet_sysctl_forward+0x176/0x1e0 proc_sys_call_handler+0x1f0/0x250 proc_sys_write+0xf/0x20 __vfs_write+0x3e/0x190 ? __sb_start_write+0x6d/0xd0 vfs_write+0xd3/0x190 ksys_write+0x68/0xd0 __ia32_sys_write+0x14/0x20 do_fast_syscall_32+0x86/0xe0 entry_SYSENTER_compat+0x7c/0x8e This happens because virtio_set_features() does not check the presence of the control virtqueue feature, which is sanity checked by a BUG_ON in virtnet_send_command(). Fix this by skipping any feature processing if the control virtqueue is missing. This should be OK for any future feature that is added, as presumably all of them would require control virtqueue support to notify the endpoint that offload etc. should begin. [1] https://chromium.googlesource.com/chromiumos/platform/crosvm/ Fixes: a02e8964eaf9 ("virtio-net: ethtool configurable LRO") Cc: stable(a)vger.kernel.org [4.20+] Cc: Michael S. Tsirkin <mst(a)redhat.com> Cc: Jason Wang <jasowang(a)redhat.com> Cc: David S. Miller <davem(a)davemloft.net> Cc: kernel-team(a)android.com Cc: virtualization(a)lists.linux-foundation.org Cc: linux-kernel(a)vger.kernel.org Signed-off-by: Alistair Delva <adelva(a)google.com> --- drivers/net/virtio_net.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 4d7d5434cc5d..709bcd34e485 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -2560,6 +2560,9 @@ static int virtnet_set_features(struct net_device *dev, u64 offloads; int err; + if (!vi->has_cvq) + return 0; + if ((dev->features ^ features) & NETIF_F_LRO) { if (vi->xdp_queue_pairs) return -EBUSY; -- 2.24.1.735.g03f4e72817-goog

5 years, 6 months

4
11
0 0

[PATCH v3.16] sched/fair: Scale bandwidth quota and period without losing quota/period ratio precision

by Xuewei Zhang

commit 4929a4e6faa0f13289a67cae98139e727f0d4a97 upstream. The quota/period ratio is used to ensure a child task group won't get more bandwidth than the parent task group, and is calculated as: normalized_cfs_quota() = [(quota_us << 20) / period_us] If the quota/period ratio was changed during this scaling due to precision loss, it will cause inconsistency between parent and child task groups. See below example: A userspace container manager (kubelet) does three operations: 1) Create a parent cgroup, set quota to 1,000us and period to 10,000us. 2) Create a few children cgroups. 3) Set quota to 1,000us and period to 10,000us on a child cgroup. These operations are expected to succeed. However, if the scaling of 147/128 happens before step 3, quota and period of the parent cgroup will be changed: new_quota: 1148437ns, 1148us new_period: 11484375ns, 11484us And when step 3 comes in, the ratio of the child cgroup will be 104857, which will be larger than the parent cgroup ratio (104821), and will fail. Scaling them by a factor of 2 will fix the problem. Tested-by: Phil Auld <pauld(a)redhat.com> Signed-off-by: Xuewei Zhang <xueweiz(a)google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz(a)infradead.org> Acked-by: Phil Auld <pauld(a)redhat.com> Cc: Anton Blanchard <anton(a)ozlabs.org> Cc: Ben Segall <bsegall(a)google.com> Cc: Dietmar Eggemann <dietmar.eggemann(a)arm.com> Cc: Juri Lelli <juri.lelli(a)redhat.com> Cc: Linus Torvalds <torvalds(a)linux-foundation.org> Cc: Mel Gorman <mgorman(a)suse.de> Cc: Peter Zijlstra <peterz(a)infradead.org> Cc: Steven Rostedt <rostedt(a)goodmis.org> Cc: Thomas Gleixner <tglx(a)linutronix.de> Cc: Vincent Guittot <vincent.guittot(a)linaro.org> Fixes: 2e8e19226398 ("sched/fair: Limit sched_cfs_period_timer() loop to avoid hard lockup") Link: https://lkml.kernel.org/r/20191004001243.140897-1-xueweiz@google.com Signed-off-by: Ingo Molnar <mingo(a)kernel.org> --- kernel/sched/fair.c | 36 ++++++++++++++++++++++-------------- 1 file changed, 22 insertions(+), 14 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index ea2d33aa1f55..773135f534ef 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3753,20 +3753,28 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer) if (++count > 3) { u64 new, old = ktime_to_ns(cfs_b->period); - new = (old * 147) / 128; /* ~115% */ - new = min(new, max_cfs_quota_period); - - cfs_b->period = ns_to_ktime(new); - - /* since max is 1s, this is limited to 1e9^2, which fits in u64 */ - cfs_b->quota *= new; - cfs_b->quota = div64_u64(cfs_b->quota, old); - - pr_warn_ratelimited( - "cfs_period_timer[cpu%d]: period too short, scaling up (new cfs_period_us %lld, cfs_quota_us = %lld)\n", - smp_processor_id(), - div_u64(new, NSEC_PER_USEC), - div_u64(cfs_b->quota, NSEC_PER_USEC)); + /* + * Grow period by a factor of 2 to avoid losing precision. + * Precision loss in the quota/period ratio can cause __cfs_schedulable + * to fail. + */ + new = old * 2; + if (new < max_cfs_quota_period) { + cfs_b->period = ns_to_ktime(new); + cfs_b->quota *= 2; + + pr_warn_ratelimited( + "cfs_period_timer[cpu%d]: period too short, scaling up (new cfs_period_us = %lld, cfs_quota_us = %lld)\n", + smp_processor_id(), + div_u64(new, NSEC_PER_USEC), + div_u64(cfs_b->quota, NSEC_PER_USEC)); + } else { + pr_warn_ratelimited( + "cfs_period_timer[cpu%d]: period too short, but cannot scale up without losing precision (cfs_period_us = %lld, cfs_quota_us = %lld)\n", + smp_processor_id(), + div_u64(old, NSEC_PER_USEC), + div_u64(cfs_b->quota, NSEC_PER_USEC)); + } /* reset count so we don't come right back in here */ count = 0; -- 2.24.0.393.g34dc348eaf-goog

5 years, 6 months

2
1
0 0

[PATCH for-rc] IB/hfi1: Adjust flow PSN with the correct resync_psn

by Dennis Dalessandro

From: Kaike Wan <kaike.wan(a)intel.com> When a TID RDMA ACK to RESYNC request is received, the flow PSNs for pending TID RDMA WRITE segments will be adjusted with the next flow generation number, based on the resync_psn value extracted from the flow PSN of the TID RDMA ACK packet. The resync_psn value indicates the last flow PSN for which a TID RDMA WRITE DATA packet has been received by the responder and the requester should resend TID RDMA WRITE DATA packets, starting from the next flow PSN. However, if resync_psn points to the last flow PSN for a segment and the next segment flow PSN starts with a new generation number, use of the old resync_psn to adjust the flow PSN for the next segment will lead to miscalculation, resulting in WARN_ON and sge rewinding errors: [2419460.492485] WARNING: CPU: 4 PID: 146961 at /nfs/site/home/phcvs2/gitrepo/ifs-all/components/Drivers/tmp/rpmbuild/BUILD/ifs-kernel-updates-3.10.0_957.el7.x86_64/hfi1/tid_rdma.c:4764 hfi1_rc_rcv_tid_rdma_ack+0x8f6/0xa90 [hfi1] [2419460.514565] Modules linked in: ib_ipoib(OE) hfi1(OE) rdmavt(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfsv3 nfs_acl nfs lockd grace fscache iTCO_wdt iTCO_vendor_support skx_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm irqbypass crc32_pclmul ghash_clmulni_intel ib_isert iscsi_target_mod target_core_mod aesni_intel lrw gf128mul glue_helper ablk_helper cryptd rpcrdma sunrpc opa_vnic ast ttm ib_iser libiscsi drm_kms_helper scsi_transport_iscsi ipmi_ssif syscopyarea sysfillrect sysimgblt fb_sys_fops drm joydev ipmi_si pcspkr sg drm_panel_orientation_quirks ipmi_devintf lpc_ich i2c_i801 ipmi_msghandler wmi rdma_ucm ib_ucm ib_uverbs acpi_cpufreq acpi_power_meter ib_umad rdma_cm ib_cm iw_cm ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic crct10dif_pclmul i2c_algo_bit crct10dif_common [2419460.594432] crc32c_intel e1000e ib_core ahci libahci ptp libata pps_core nfit libnvdimm [last unloaded: rdmavt] [2419460.605645] CPU: 4 PID: 146961 Comm: kworker/4:0H Kdump: loaded Tainted: G W OE ------------ 3.10.0-957.el7.x86_64 #1 [2419460.619424] Hardware name: Intel Corporation S2600WFT/S2600WFT, BIOS SE5C620.86B.0X.02.0117.040420182310 04/04/2018 [2419460.631062] Workqueue: hfi0_0 _hfi1_do_tid_send [hfi1] [2419460.637423] Call Trace: [2419460.641044] <IRQ> [<ffffffff9e361dc1>] dump_stack+0x19/0x1b [2419460.647980] [<ffffffff9dc97648>] __warn+0xd8/0x100 [2419460.654023] [<ffffffff9dc9778d>] warn_slowpath_null+0x1d/0x20 [2419460.661025] [<ffffffffc05d28c6>] hfi1_rc_rcv_tid_rdma_ack+0x8f6/0xa90 [hfi1] [2419460.669333] [<ffffffffc05c21cc>] hfi1_kdeth_eager_rcv+0x1dc/0x210 [hfi1] [2419460.677295] [<ffffffffc05c23ef>] ? hfi1_kdeth_expected_rcv+0x1ef/0x210 [hfi1] [2419460.685693] [<ffffffffc0574f15>] kdeth_process_eager+0x35/0x90 [hfi1] [2419460.693394] [<ffffffffc0575b5a>] handle_receive_interrupt_nodma_rtail+0x17a/0x2b0 [hfi1] [2419460.702745] [<ffffffffc056a623>] receive_context_interrupt+0x23/0x40 [hfi1] [2419460.710963] [<ffffffff9dd4a294>] __handle_irq_event_percpu+0x44/0x1c0 [2419460.718659] [<ffffffff9dd4a442>] handle_irq_event_percpu+0x32/0x80 [2419460.726086] [<ffffffff9dd4a4cc>] handle_irq_event+0x3c/0x60 [2419460.732903] [<ffffffff9dd4d27f>] handle_edge_irq+0x7f/0x150 [2419460.739710] [<ffffffff9dc2e554>] handle_irq+0xe4/0x1a0 [2419460.746091] [<ffffffff9e3795dd>] do_IRQ+0x4d/0xf0 [2419460.752040] [<ffffffff9e36b362>] common_interrupt+0x162/0x162 [2419460.759029] <EOI> [<ffffffff9dfa0f79>] ? swiotlb_map_page+0x49/0x150 [2419460.766758] [<ffffffffc05c2ed1>] hfi1_verbs_send_dma+0x291/0xb70 [hfi1] [2419460.774637] [<ffffffffc05c2c40>] ? hfi1_wait_kmem+0xf0/0xf0 [hfi1] [2419460.782080] [<ffffffffc05c3f26>] hfi1_verbs_send+0x126/0x2b0 [hfi1] [2419460.789606] [<ffffffffc05ce683>] _hfi1_do_tid_send+0x1d3/0x320 [hfi1] [2419460.797298] [<ffffffff9dcb9d4f>] process_one_work+0x17f/0x440 [2419460.804292] [<ffffffff9dcbade6>] worker_thread+0x126/0x3c0 [2419460.811025] [<ffffffff9dcbacc0>] ? manage_workers.isra.25+0x2a0/0x2a0 [2419460.818710] [<ffffffff9dcc1c31>] kthread+0xd1/0xe0 [2419460.824751] [<ffffffff9dcc1b60>] ? insert_kthread_work+0x40/0x40 [2419460.832013] [<ffffffff9e374c1d>] ret_from_fork_nospec_begin+0x7/0x21 [2419460.839611] [<ffffffff9dcc1b60>] ? insert_kthread_work+0x40/0x40 This patch fixes the issue by adjusting the resync_psn first if the flow generation has been advanced for a pending segment. Fixes: 9e93e967f7b4 ("IB/hfi1: Add a function to receive TID RDMA ACK packet") Cc: <stable(a)vger.kernel.org> Reviewed-by: Mike Marciniszyn <mike.marciniszyn(a)intel.com> Signed-off-by: Kaike Wan <kaike.wan(a)intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro(a)intel.com> --- drivers/infiniband/hw/hfi1/tid_rdma.c | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/drivers/infiniband/hw/hfi1/tid_rdma.c b/drivers/infiniband/hw/hfi1/tid_rdma.c index e53f542..8a2e0d9 100644 --- a/drivers/infiniband/hw/hfi1/tid_rdma.c +++ b/drivers/infiniband/hw/hfi1/tid_rdma.c @@ -4633,6 +4633,15 @@ void hfi1_rc_rcv_tid_rdma_ack(struct hfi1_packet *packet) */ fpsn = full_flow_psn(flow, flow->flow_state.spsn); req->r_ack_psn = psn; + /* + * If resync_psn points to the last flow PSN for a + * segment and the new segment (likely from a new + * request) starts with a new generation number, we + * need to adjust resync_psn accordingly. + */ + if (flow->flow_state.generation != + (resync_psn >> HFI1_KDETH_BTH_SEQ_SHIFT)) + resync_psn = mask_psn(fpsn - 1); flow->resync_npkts += delta_psn(mask_psn(resync_psn + 1), fpsn); /*

5 years, 6 months

2
1
0 0

2025

2024

2023

2022

2021

2020

2019

2018

2017

Linux-stable-mirror December 2019