[PATCH AUTOSEL 4.14 01/22] null_blk: Fix the null_add_dev() error path

List overview All Threads
Download

newer

older

[PATCH AUTOSEL 4.9 1/9]...

[PATCH AUTOSEL 4.19 01/32]...

Sasha Levin

10 Apr 2020 10 Apr '20

3:50 a.m.

From: Bart Van Assche bvanassche@acm.org

[ Upstream commit 2004bfdef945fe55196db6b9cdf321fbc75bb0de ]

If null_add_dev() fails, clear dev->nullb.

This patch fixes the following KASAN complaint:

BUG: KASAN: use-after-free in nullb_device_submit_queues_store+0xcf/0x160 [null_blk] Read of size 8 at addr ffff88803280fc30 by task check/8409

Call Trace: dump_stack+0xa5/0xe6 print_address_description.constprop.0+0x26/0x260 __kasan_report.cold+0x7b/0x99 kasan_report+0x16/0x20 __asan_load8+0x58/0x90 nullb_device_submit_queues_store+0xcf/0x160 [null_blk] configfs_write_file+0x1c4/0x250 [configfs] __vfs_write+0x4c/0x90 vfs_write+0x145/0x2c0 ksys_write+0xd7/0x180 __x64_sys_write+0x47/0x50 do_syscall_64+0x6f/0x2f0 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7ff370926317 Code: 64 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24 RSP: 002b:00007fff2dd2da48 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007ff370926317 RDX: 0000000000000002 RSI: 0000559437ef23f0 RDI: 0000000000000001 RBP: 0000559437ef23f0 R08: 000000000000000a R09: 0000000000000001 R10: 0000559436703471 R11: 0000000000000246 R12: 0000000000000002 R13: 00007ff370a006a0 R14: 00007ff370a014a0 R15: 00007ff370a008a0

Allocated by task 8409: save_stack+0x23/0x90 __kasan_kmalloc.constprop.0+0xcf/0xe0 kasan_kmalloc+0xd/0x10 kmem_cache_alloc_node_trace+0x129/0x4c0 null_add_dev+0x24a/0xe90 [null_blk] nullb_device_power_store+0x1b6/0x270 [null_blk] configfs_write_file+0x1c4/0x250 [configfs] __vfs_write+0x4c/0x90 vfs_write+0x145/0x2c0 ksys_write+0xd7/0x180 __x64_sys_write+0x47/0x50 do_syscall_64+0x6f/0x2f0 entry_SYSCALL_64_after_hwframe+0x49/0xbe

Freed by task 8409: save_stack+0x23/0x90 __kasan_slab_free+0x112/0x160 kasan_slab_free+0x12/0x20 kfree+0xdf/0x250 null_add_dev+0xaf3/0xe90 [null_blk] nullb_device_power_store+0x1b6/0x270 [null_blk] configfs_write_file+0x1c4/0x250 [configfs] __vfs_write+0x4c/0x90 vfs_write+0x145/0x2c0 ksys_write+0xd7/0x180 __x64_sys_write+0x47/0x50 do_syscall_64+0x6f/0x2f0 entry_SYSCALL_64_after_hwframe+0x49/0xbe

Fixes: 2984c8684f96 ("nullb: factor disk parameters") Signed-off-by: Bart Van Assche bvanassche@acm.org Reviewed-by: Chaitanya Kulkarni chaitanya.kulkarni@wdc.com Cc: Johannes Thumshirn jth@kernel.org Cc: Hannes Reinecke hare@suse.com Cc: Ming Lei ming.lei@redhat.com Cc: Christoph Hellwig hch@infradead.org Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/block/null_blk.c | 1 + 1 file changed, 1 insertion(+)

diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c index f01d4a8a783ac..e9776ca0996b0 100644 --- a/drivers/block/null_blk.c +++ b/drivers/block/null_blk.c @@ -1919,6 +1919,7 @@ static int null_add_dev(struct nullb_device *dev) cleanup_queues(nullb); out_free_nullb: kfree(nullb); + dev->nullb = NULL; out: return rv; }

-- 2.20.1

Show replies by date

Sasha Levin

10 Apr 10 Apr

3:50 a.m.

New subject: [PATCH AUTOSEL 4.14 02/22] null_blk: Handle null_add_dev() failures properly

From: Bart Van Assche bvanassche@acm.org

[ Upstream commit 9b03b713082a31a5b90e0a893c72aa620e255c26 ]

If null_add_dev() fails then null_del_dev() is called with a NULL argument. Make null_del_dev() handle this scenario correctly. This patch fixes the following KASAN complaint:

null-ptr-deref in null_del_dev+0x28/0x280 [null_blk] Read of size 8 at addr 0000000000000000 by task find/1062

Call Trace: dump_stack+0xa5/0xe6 __kasan_report.cold+0x65/0x99 kasan_report+0x16/0x20 __asan_load8+0x58/0x90 null_del_dev+0x28/0x280 [null_blk] nullb_group_drop_item+0x7e/0xa0 [null_blk] client_drop_item+0x53/0x80 [configfs] configfs_rmdir+0x395/0x4e0 [configfs] vfs_rmdir+0xb6/0x220 do_rmdir+0x238/0x2c0 __x64_sys_unlinkat+0x75/0x90 do_syscall_64+0x6f/0x2f0 entry_SYSCALL_64_after_hwframe+0x49/0xbe

Signed-off-by: Bart Van Assche bvanassche@acm.org Reviewed-by: Chaitanya Kulkarni chaitanya.kulkarni@wdc.com Cc: Johannes Thumshirn jth@kernel.org Cc: Hannes Reinecke hare@suse.com Cc: Ming Lei ming.lei@redhat.com Cc: Christoph Hellwig hch@infradead.org Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/block/null_blk.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c index e9776ca0996b0..b4078901dbcb9 100644 --- a/drivers/block/null_blk.c +++ b/drivers/block/null_blk.c @@ -1593,7 +1593,12 @@ static void null_nvm_unregister(struct nullb *nullb) {}

static void null_del_dev(struct nullb *nullb) { - struct nullb_device *dev = nullb->dev; + struct nullb_device *dev; + + if (!nullb) + return; + + dev = nullb->dev;

ida_simple_remove(&nullb_indexes, nullb->index);

-- 2.20.1

Sasha Levin

3:50 a.m.

New subject: [PATCH AUTOSEL 4.14 03/22] null_blk: fix spurious IO errors after failed past-wp access

From: Alexey Dobriyan adobriyan@gmail.com

[ Upstream commit ff77042296d0a54535ddf74412c5ae92cb4ec76a ]

Steps to reproduce:

BLKRESETZONE zone 0

// force EIO pwrite(fd, buf, 4096, 4096);

[issue more IO including zone ioctls]

It will start failing randomly including IO to unrelated zones because of ->error "reuse". Trigger can be partition detection as well if test is not run immediately which is even more entertaining.

The fix is of course to clear ->error where necessary.

Reviewed-by: Christoph Hellwig hch@lst.de Signed-off-by: Alexey Dobriyan (SK hynix) adobriyan@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/block/null_blk.c | 2 ++ 1 file changed, 2 insertions(+)

diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c index b4078901dbcb9..b12e373aa956a 100644 --- a/drivers/block/null_blk.c +++ b/drivers/block/null_blk.c @@ -622,6 +622,7 @@ static struct nullb_cmd *__alloc_cmd(struct nullb_queue *nq) if (tag != -1U) { cmd = &nq->cmds[tag]; cmd->tag = tag; + cmd->error = BLK_STS_OK; cmd->nq = nq; if (nq->dev->irqmode == NULL_IRQ_TIMER) { hrtimer_init(&cmd->timer, CLOCK_MONOTONIC, @@ -1399,6 +1400,7 @@ static blk_status_t null_queue_rq(struct blk_mq_hw_ctx *hctx, cmd->timer.function = null_cmd_timer_expired; } cmd->rq = bd->rq; + cmd->error = BLK_STS_OK; cmd->nq = nq;

blk_mq_start_request(bd->rq);

-- 2.20.1

Sasha Levin

3:50 a.m.

New subject: [PATCH AUTOSEL 4.14 04/22] x86: Don't let pgprot_modify() change the page encryption bit

From: Thomas Hellstrom thellstrom@vmware.com

[ Upstream commit 6db73f17c5f155dbcfd5e48e621c706270b84df0 ]

When SEV or SME is enabled and active, vm_get_page_prot() typically returns with the encryption bit set. This means that users of pgprot_modify(, vm_get_page_prot()) (mprotect_fixup(), do_mmap()) end up with a value of vma->vm_pg_prot that is not consistent with the intended protection of the PTEs.

This is also important for fault handlers that rely on the VMA vm_page_prot to set the page protection. Fix this by not allowing pgprot_modify() to change the encryption bit, similar to how it's done for PAT bits.

Signed-off-by: Thomas Hellstrom thellstrom@vmware.com Signed-off-by: Borislav Petkov bp@suse.de Reviewed-by: Dave Hansen dave.hansen@linux.intel.com Acked-by: Tom Lendacky thomas.lendacky@amd.com Link: https://lkml.kernel.org/r/20200304114527.3636-2-thomas_os@shipmail.org Signed-off-by: Sasha Levin sashal@kernel.org --- arch/x86/include/asm/pgtable.h | 7 +++++-- arch/x86/include/asm/pgtable_types.h | 2 +- 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 6a4b1a54ff479..98a337e3835d6 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -588,12 +588,15 @@ static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot) return __pmd(val); }

-/* mprotect needs to preserve PAT bits when updating vm_page_prot */ +/* + * mprotect needs to preserve PAT and encryption bits when updating + * vm_page_prot + */ #define pgprot_modify pgprot_modify static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot) { pgprotval_t preservebits = pgprot_val(oldprot) & _PAGE_CHG_MASK; - pgprotval_t addbits = pgprot_val(newprot); + pgprotval_t addbits = pgprot_val(newprot) & ~_PAGE_CHG_MASK; return __pgprot(preservebits | addbits); }

diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h index 85f8279c885ac..e6c870c240657 100644 --- a/arch/x86/include/asm/pgtable_types.h +++ b/arch/x86/include/asm/pgtable_types.h @@ -124,7 +124,7 @@ */ #define _PAGE_CHG_MASK (PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT | \ _PAGE_SPECIAL | _PAGE_ACCESSED | _PAGE_DIRTY | \ - _PAGE_SOFT_DIRTY | _PAGE_DEVMAP) + _PAGE_SOFT_DIRTY | _PAGE_DEVMAP | _PAGE_ENC) #define _HPAGE_CHG_MASK (_PAGE_CHG_MASK | _PAGE_PSE)

-- 2.20.1

Sasha Levin

3:50 a.m.

New subject: [PATCH AUTOSEL 4.14 05/22] block: keep bdi->io_pages in sync with max_sectors_kb for stacked devices

From: Konstantin Khlebnikov khlebnikov@yandex-team.ru

[ Upstream commit e74d93e96d721c4297f2a900ad0191890d2fc2b0 ]

Field bdi->io_pages added in commit 9491ae4aade6 ("mm: don't cap request size based on read-ahead setting") removes unneeded split of read requests.

Stacked drivers do not call blk_queue_max_hw_sectors(). Instead they set limits of their devices by blk_set_stacking_limits() + disk_stack_limits(). Field bio->io_pages stays zero until user set max_sectors_kb via sysfs.

This patch updates io_pages after merging limits in disk_stack_limits().

Commit c6d6e9b0f6b4 ("dm: do not allow readahead to limit IO size") fixed the same problem for device-mapper devices, this one fixes MD RAIDs.

Fixes: 9491ae4aade6 ("mm: don't cap request size based on read-ahead setting") Reviewed-by: Paul Menzel pmenzel@molgen.mpg.de Reviewed-by: Bob Liu bob.liu@oracle.com Signed-off-by: Konstantin Khlebnikov khlebnikov@yandex-team.ru Signed-off-by: Song Liu songliubraving@fb.com Signed-off-by: Sasha Levin sashal@kernel.org --- block/blk-settings.c | 3 +++ 1 file changed, 3 insertions(+)

diff --git a/block/blk-settings.c b/block/blk-settings.c index 6c2faaa38cc1e..e0a744921ed3d 100644 --- a/block/blk-settings.c +++ b/block/blk-settings.c @@ -717,6 +717,9 @@ void disk_stack_limits(struct gendisk *disk, struct block_device *bdev, printk(KERN_NOTICE "%s: Warning: Device %s is misaligned\n", top, bottom); } + + t->backing_dev_info->io_pages = + t->limits.max_sectors >> (PAGE_SHIFT - 9); } EXPORT_SYMBOL(disk_stack_limits);

-- 2.20.1

Sasha Levin

3:50 a.m.

New subject: [PATCH AUTOSEL 4.14 06/22] irqchip/versatile-fpga: Handle chained IRQs properly

From: Sungbo Eo mans0n@gorani.run

[ Upstream commit 486562da598c59e9f835b551d7cf19507de2d681 ]

Enclose the chained handler with chained_irq_{enter,exit}(), so that the muxed interrupts get properly acked.

This patch also fixes a reboot bug on OX820 SoC, where the jiffies timer interrupt is never acked. The kernel waits a clock tick forever in calibrate_delay_converge(), which leads to a boot hang.

Fixes: c41b16f8c9d9 ("ARM: integrator/versatile: consolidate FPGA IRQ handling code") Signed-off-by: Sungbo Eo mans0n@gorani.run Signed-off-by: Marc Zyngier maz@kernel.org Link: https://lore.kernel.org/r/20200319023448.1479701-1-mans0n@gorani.run Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/irqchip/irq-versatile-fpga.c | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/irqchip/irq-versatile-fpga.c b/drivers/irqchip/irq-versatile-fpga.c index 928858dada756..70e2cfff8175f 100644 --- a/drivers/irqchip/irq-versatile-fpga.c +++ b/drivers/irqchip/irq-versatile-fpga.c @@ -6,6 +6,7 @@ #include <linux/irq.h> #include <linux/io.h> #include <linux/irqchip.h> +#include <linux/irqchip/chained_irq.h> #include <linux/irqchip/versatile-fpga.h> #include <linux/irqdomain.h> #include <linux/module.h> @@ -68,12 +69,16 @@ static void fpga_irq_unmask(struct irq_data *d)

static void fpga_irq_handle(struct irq_desc *desc) { + struct irq_chip *chip = irq_desc_get_chip(desc); struct fpga_irq_data *f = irq_desc_get_handler_data(desc); - u32 status = readl(f->base + IRQ_STATUS); + u32 status; + + chained_irq_enter(chip, desc);

+ status = readl(f->base + IRQ_STATUS); if (status == 0) { do_bad_IRQ(desc); - return; + goto out; }

do { @@ -82,6 +87,9 @@ static void fpga_irq_handle(struct irq_desc *desc) status &= ~(1 << irq); generic_handle_irq(irq_find_mapping(f->domain, irq)); } while (status); + +out: + chained_irq_exit(chip, desc); }

-- 2.20.1

Sasha Levin

3:50 a.m.

New subject: [PATCH AUTOSEL 4.14 07/22] sched: Avoid scale real weight down to zero

From: Michael Wang yun.wang@linux.alibaba.com

[ Upstream commit 26cf52229efc87e2effa9d788f9b33c40fb3358a ]

During our testing, we found a case that shares no longer working correctly, the cgroup topology is like:

/sys/fs/cgroup/cpu/A (shares=102400) /sys/fs/cgroup/cpu/A/B (shares=2) /sys/fs/cgroup/cpu/A/B/C (shares=1024)

/sys/fs/cgroup/cpu/D (shares=1024) /sys/fs/cgroup/cpu/D/E (shares=1024) /sys/fs/cgroup/cpu/D/E/F (shares=1024)

The same benchmark is running in group C & F, no other tasks are running, the benchmark is capable to consumed all the CPUs.

We suppose the group C will win more CPU resources since it could enjoy all the shares of group A, but it's F who wins much more.

The reason is because we have group B with shares as 2, since A->cfs_rq.load.weight == B->se.load.weight == B->shares/nr_cpus, so A->cfs_rq.load.weight become very small.

And in calc_group_shares() we calculate shares as:

load = max(scale_load_down(cfs_rq->load.weight), cfs_rq->avg.load_avg); shares = (tg_shares * load) / tg_weight;

Since the 'cfs_rq->load.weight' is too small, the load become 0 after scale down, although 'tg_shares' is 102400, shares of the se which stand for group A on root cfs_rq become 2.

While the se of D on root cfs_rq is far more bigger than 2, so it wins the battle.

Thus when scale_load_down() scale real weight down to 0, it's no longer telling the real story, the caller will have the wrong information and the calculation will be buggy.

This patch add check in scale_load_down(), so the real weight will be >= MIN_SHARES after scale, after applied the group C wins as expected.

Suggested-by: Peter Zijlstra peterz@infradead.org Signed-off-by: Michael Wang yun.wang@linux.alibaba.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Reviewed-by: Vincent Guittot vincent.guittot@linaro.org Link: https://lkml.kernel.org/r/38e8e212-59a1-64b2-b247-b6d0b52d8dc1@linux.alibaba... Signed-off-by: Sasha Levin sashal@kernel.org --- kernel/sched/sched.h | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 268f560ec9986..391d73a12ad72 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -89,7 +89,13 @@ static inline void cpu_load_update_active(struct rq *this_rq) { } #ifdef CONFIG_64BIT # define NICE_0_LOAD_SHIFT (SCHED_FIXEDPOINT_SHIFT + SCHED_FIXEDPOINT_SHIFT) # define scale_load(w) ((w) << SCHED_FIXEDPOINT_SHIFT) -# define scale_load_down(w) ((w) >> SCHED_FIXEDPOINT_SHIFT) +# define scale_load_down(w) \ +({ \ + unsigned long __w = (w); \ + if (__w) \ + __w = max(2UL, __w >> SCHED_FIXEDPOINT_SHIFT); \ + __w; \ +}) #else # define NICE_0_LOAD_SHIFT (SCHED_FIXEDPOINT_SHIFT) # define scale_load(w) (w)

-- 2.20.1

Sasha Levin

3:50 a.m.

New subject: [PATCH AUTOSEL 4.14 08/22] selftests/x86/ptrace_syscall_32: Fix no-vDSO segfault

From: Andy Lutomirski luto@kernel.org

[ Upstream commit 630b99ab60aa972052a4202a1ff96c7e45eb0054 ]

If AT_SYSINFO is not present, don't try to call a NULL pointer.

Reported-by: kbuild test robot lkp@intel.com Signed-off-by: Andy Lutomirski luto@kernel.org Signed-off-by: Borislav Petkov bp@suse.de Link: https://lkml.kernel.org/r/faaf688265a7e1a5b944d6f8bc0f6368158306d3.158405240... Signed-off-by: Sasha Levin sashal@kernel.org --- tools/testing/selftests/x86/ptrace_syscall.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/x86/ptrace_syscall.c b/tools/testing/selftests/x86/ptrace_syscall.c index 6f22238f32173..12aaa063196e7 100644 --- a/tools/testing/selftests/x86/ptrace_syscall.c +++ b/tools/testing/selftests/x86/ptrace_syscall.c @@ -414,8 +414,12 @@ int main()

#if defined(__i386__) && (!defined(__GLIBC__) || __GLIBC__ > 2 || __GLIBC_MINOR__ >= 16) vsyscall32 = (void *)getauxval(AT_SYSINFO); - printf("[RUN]\tCheck AT_SYSINFO return regs\n"); - test_sys32_regs(do_full_vsyscall32); + if (vsyscall32) { + printf("[RUN]\tCheck AT_SYSINFO return regs\n"); + test_sys32_regs(do_full_vsyscall32); + } else { + printf("[SKIP]\tAT_SYSINFO is not available\n"); + } #endif

test_ptrace_syscall_restart();

-- 2.20.1

Sasha Levin

3:50 a.m.

New subject: [PATCH AUTOSEL 4.14 09/22] PCI/switchtec: Fix init_completion race condition with poll_wait()

From: Logan Gunthorpe logang@deltatee.com

[ Upstream commit efbdc769601f4d50018bf7ca50fc9f7c67392ece ]

The call to init_completion() in mrpc_queue_cmd() can theoretically race with the call to poll_wait() in switchtec_dev_poll().

poll() write() switchtec_dev_poll() switchtec_dev_write() poll_wait(&s->comp.wait); mrpc_queue_cmd() init_completion(&s->comp) init_waitqueue_head(&s->comp.wait)

To my knowledge, no one has hit this bug.

Fix this by using reinit_completion() instead of init_completion() in mrpc_queue_cmd().

Fixes: 080b47def5e5 ("MicroSemi Switchtec management interface driver")

Reported-by: Sebastian Andrzej Siewior bigeasy@linutronix.de Signed-off-by: Logan Gunthorpe logang@deltatee.com Signed-off-by: Thomas Gleixner tglx@linutronix.de Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Acked-by: Bjorn Helgaas bhelgaas@google.com Link: https://lkml.kernel.org/r/20200313183608.2646-1-logang@deltatee.com Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/pci/switch/switchtec.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/pci/switch/switchtec.c b/drivers/pci/switch/switchtec.c index bf229b442e723..6ef0d4b756f09 100644 --- a/drivers/pci/switch/switchtec.c +++ b/drivers/pci/switch/switchtec.c @@ -412,7 +412,7 @@ static int mrpc_queue_cmd(struct switchtec_user *stuser) kref_get(&stuser->kref); stuser->read_len = sizeof(stuser->data); stuser_set_state(stuser, MRPC_QUEUED); - init_completion(&stuser->comp); + reinit_completion(&stuser->comp); list_add_tail(&stuser->list, &stdev->mrpc_queue);

mrpc_cmd_submit(stdev);

-- 2.20.1

Sasha Levin

3:50 a.m.

New subject: [PATCH AUTOSEL 4.14 10/22] libata: Remove extra scsi_host_put() in ata_scsi_add_hosts()

From: John Garry john.garry@huawei.com

[ Upstream commit 1d72f7aec3595249dbb83291ccac041a2d676c57 ]

If the call to scsi_add_host_with_dma() in ata_scsi_add_hosts() fails, then we may get use-after-free KASAN warns:

================================================================== BUG: KASAN: use-after-free in kobject_put+0x24/0x180 Read of size 1 at addr ffff0026b8c80364 by task swapper/0/1 CPU: 1 PID: 1 Comm: swapper/0 Tainted: G W 5.6.0-rc3-00004-g5a71b206ea82-dirty #1765 Hardware name: Huawei TaiShan 200 (Model 2280)/BC82AMDD, BIOS 2280-V2 CS V3.B160.01 02/24/2020 Call trace: dump_backtrace+0x0/0x298 show_stack+0x14/0x20 dump_stack+0x118/0x190 print_address_description.isra.9+0x6c/0x3b8 __kasan_report+0x134/0x23c kasan_report+0xc/0x18 __asan_load1+0x5c/0x68 kobject_put+0x24/0x180 put_device+0x10/0x20 scsi_host_put+0x10/0x18 ata_devres_release+0x74/0xb0 release_nodes+0x2d0/0x470 devres_release_all+0x50/0x78 really_probe+0x2d4/0x560 driver_probe_device+0x7c/0x148 device_driver_attach+0x94/0xa0 __driver_attach+0xa8/0x110 bus_for_each_dev+0xe8/0x158 driver_attach+0x30/0x40 bus_add_driver+0x220/0x2e0 driver_register+0xbc/0x1d0 __pci_register_driver+0xbc/0xd0 ahci_pci_driver_init+0x20/0x28 do_one_initcall+0xf0/0x608 kernel_init_freeable+0x31c/0x384 kernel_init+0x10/0x118 ret_from_fork+0x10/0x18

Allocated by task 5: save_stack+0x28/0xc8 __kasan_kmalloc.isra.8+0xbc/0xd8 kasan_kmalloc+0xc/0x18 __kmalloc+0x1a8/0x280 scsi_host_alloc+0x44/0x678 ata_scsi_add_hosts+0x74/0x268 ata_host_register+0x228/0x488 ahci_host_activate+0x1c4/0x2a8 ahci_init_one+0xd18/0x1298 local_pci_probe+0x74/0xf0 work_for_cpu_fn+0x2c/0x48 process_one_work+0x488/0xc08 worker_thread+0x330/0x5d0 kthread+0x1c8/0x1d0 ret_from_fork+0x10/0x18

Freed by task 5: save_stack+0x28/0xc8 __kasan_slab_free+0x118/0x180 kasan_slab_free+0x10/0x18 slab_free_freelist_hook+0xa4/0x1a0 kfree+0xd4/0x3a0 scsi_host_dev_release+0x100/0x148 device_release+0x7c/0xe0 kobject_put+0xb0/0x180 put_device+0x10/0x20 scsi_host_put+0x10/0x18 ata_scsi_add_hosts+0x210/0x268 ata_host_register+0x228/0x488 ahci_host_activate+0x1c4/0x2a8 ahci_init_one+0xd18/0x1298 local_pci_probe+0x74/0xf0 work_for_cpu_fn+0x2c/0x48 process_one_work+0x488/0xc08 worker_thread+0x330/0x5d0 kthread+0x1c8/0x1d0 ret_from_fork+0x10/0x18

There is also refcount issue, as well: WARNING: CPU: 1 PID: 1 at lib/refcount.c:28 refcount_warn_saturate+0xf8/0x170

The issue is that we make an erroneous extra call to scsi_host_put() for that host:

So in ahci_init_one()->ata_host_alloc_pinfo()->ata_host_alloc(), we setup a device release method - ata_devres_release() - which intends to release the SCSI hosts:

static void ata_devres_release(struct device *gendev, void *res) { ... for (i = 0; i < host->n_ports; i++) { struct ata_port *ap = host->ports[i];

if (!ap) continue;

if (ap->scsi_host) scsi_host_put(ap->scsi_host);

} ... }

However in the ata_scsi_add_hosts() error path, we also call scsi_host_put() for the SCSI hosts.

Fix by removing the the scsi_host_put() calls in ata_scsi_add_hosts() and leave this to ata_devres_release().

Fixes: f31871951b38 ("libata: separate out ata_host_alloc() and ata_host_register()") Signed-off-by: John Garry john.garry@huawei.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/ata/libata-scsi.c | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c index eb0c4ee205258..2f81d65342709 100644 --- a/drivers/ata/libata-scsi.c +++ b/drivers/ata/libata-scsi.c @@ -4571,22 +4571,19 @@ int ata_scsi_add_hosts(struct ata_host *host, struct scsi_host_template *sht) */ shost->max_host_blocked = 1;

- rc = scsi_add_host_with_dma(ap->scsi_host, - &ap->tdev, ap->host->dev); + rc = scsi_add_host_with_dma(shost, &ap->tdev, ap->host->dev); if (rc) - goto err_add; + goto err_alloc; }

return 0;

- err_add: - scsi_host_put(host->ports[i]->scsi_host); err_alloc: while (--i >= 0) { struct Scsi_Host *shost = host->ports[i]->scsi_host;

+ /* scsi_host_put() is in ata_devres_release() */ scsi_remove_host(shost); - scsi_host_put(shost); } return rc; }

-- 2.20.1

Sasha Levin

3:50 a.m.

New subject: [PATCH AUTOSEL 4.14 11/22] gfs2: Don't demote a glock until its revokes are written

From: Bob Peterson rpeterso@redhat.com

[ Upstream commit df5db5f9ee112e76b5202fbc331f990a0fc316d6 ]

Before this patch, run_queue would demote glocks based on whether there are any more holders. But if the glock has pending revokes that haven't been written to the media, giving up the glock might end in file system corruption if the revokes never get written due to io errors, node crashes and fences, etc. In that case, another node will replay the metadata blocks associated with the glock, but because the revoke was never written, it could replay that block even though the glock had since been granted to another node who might have made changes.

This patch changes the logic in run_queue so that it never demotes a glock until its count of pending revokes reaches zero.

Signed-off-by: Bob Peterson rpeterso@redhat.com Reviewed-by: Andreas Gruenbacher agruenba@redhat.com Signed-off-by: Sasha Levin sashal@kernel.org --- fs/gfs2/glock.c | 3 +++ 1 file changed, 3 insertions(+)

diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c index aea1ed0aebd0f..1e2ff4b32c79a 100644 --- a/fs/gfs2/glock.c +++ b/fs/gfs2/glock.c @@ -636,6 +636,9 @@ __acquires(&gl->gl_lockref.lock) goto out_unlock; if (nonblock) goto out_sched; + smp_mb(); + if (atomic_read(&gl->gl_revokes) != 0) + goto out_sched; set_bit(GLF_DEMOTE_IN_PROGRESS, &gl->gl_flags); GLOCK_BUG_ON(gl, gl->gl_demote_state == LM_ST_EXCLUSIVE); gl->gl_target = gl->gl_demote_state;

-- 2.20.1

Sasha Levin

3:50 a.m.

New subject: [PATCH AUTOSEL 4.14 12/22] x86/boot: Use unsigned comparison for addresses

From: Arvind Sankar nivedita@alum.mit.edu

[ Upstream commit 81a34892c2c7c809f9c4e22c5ac936ae673fb9a2 ]

The load address is compared with LOAD_PHYSICAL_ADDR using a signed comparison currently (using jge instruction).

When loading a 64-bit kernel using the new efi32_pe_entry() point added by:

97aa276579b2 ("efi/x86: Add true mixed mode entry point into .compat section")

using Qemu with -m 3072, the firmware actually loads us above 2Gb, resulting in a very early crash.

Use the JAE instruction to perform a unsigned comparison instead, as physical addresses should be considered unsigned.

Signed-off-by: Arvind Sankar nivedita@alum.mit.edu Signed-off-by: Ard Biesheuvel ardb@kernel.org Signed-off-by: Ingo Molnar mingo@kernel.org Link: https://lore.kernel.org/r/20200301230436.2246909-6-nivedita@alum.mit.edu Link: https://lore.kernel.org/r/20200308080859.21568-14-ardb@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- arch/x86/boot/compressed/head_32.S | 2 +- arch/x86/boot/compressed/head_64.S | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/boot/compressed/head_32.S b/arch/x86/boot/compressed/head_32.S index 37380c0d59996..01d628ea34024 100644 --- a/arch/x86/boot/compressed/head_32.S +++ b/arch/x86/boot/compressed/head_32.S @@ -106,7 +106,7 @@ ENTRY(startup_32) notl %eax andl %eax, %ebx cmpl $LOAD_PHYSICAL_ADDR, %ebx - jge 1f + jae 1f #endif movl $LOAD_PHYSICAL_ADDR, %ebx 1: diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S index 39fdede523f21..a25127916e679 100644 --- a/arch/x86/boot/compressed/head_64.S +++ b/arch/x86/boot/compressed/head_64.S @@ -105,7 +105,7 @@ ENTRY(startup_32) notl %eax andl %eax, %ebx cmpl $LOAD_PHYSICAL_ADDR, %ebx - jge 1f + jae 1f #endif movl $LOAD_PHYSICAL_ADDR, %ebx 1: @@ -280,7 +280,7 @@ ENTRY(startup_64) notq %rax andq %rax, %rbp cmpq $LOAD_PHYSICAL_ADDR, %rbp - jge 1f + jae 1f #endif movq $LOAD_PHYSICAL_ADDR, %rbp 1:

-- 2.20.1

Sasha Levin

3:50 a.m.

New subject: [PATCH AUTOSEL 4.14 13/22] efi/x86: Ignore the memory attributes table on i386

From: Ard Biesheuvel ardb@kernel.org

[ Upstream commit dd09fad9d2caad2325a39b766ce9e79cfc690184 ]

Commit:

3a6b6c6fb23667fa ("efi: Make EFI_MEMORY_ATTRIBUTES_TABLE initialization common across all architectures")

moved the call to efi_memattr_init() from ARM specific to the generic EFI init code, in order to be able to apply the restricted permissions described in that table on x86 as well.

We never enabled this feature fully on i386, and so mapping and reserving this table is pointless. However, due to the early call to memblock_reserve(), the memory bookkeeping gets confused to the point where it produces the splat below when we try to map the memory later on:

------------[ cut here ]------------ ioremap on RAM at 0x3f251000 - 0x3fa1afff WARNING: CPU: 0 PID: 0 at arch/x86/mm/ioremap.c:166 __ioremap_caller ... Modules linked in: CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.20.0 #48 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015 EIP: __ioremap_caller.constprop.0+0x249/0x260 Code: 90 0f b7 05 4e 38 40 de 09 45 e0 e9 09 ff ff ff 90 8d 45 ec c6 05 ... EAX: 00000029 EBX: 00000000 ECX: de59c228 EDX: 00000001 ESI: 3f250fff EDI: 00000000 EBP: de3edf20 ESP: de3edee0 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00200296 CR0: 80050033 CR2: ffd17000 CR3: 1e58c000 CR4: 00040690 Call Trace: ioremap_cache+0xd/0x10 ? old_map_region+0x72/0x9d old_map_region+0x72/0x9d efi_map_region+0x8/0xa efi_enter_virtual_mode+0x260/0x43b start_kernel+0x329/0x3aa i386_start_kernel+0xa7/0xab startup_32_smp+0x164/0x168 ---[ end trace e15ccf6b9f356833 ]---

Let's work around this by disregarding the memory attributes table altogether on i386, which does not result in a loss of functionality or protection, given that we never consumed the contents.

Fixes: 3a6b6c6fb23667fa ("efi: Make EFI_MEMORY_ATTRIBUTES_TABLE ... ") Tested-by: Arvind Sankar nivedita@alum.mit.edu Signed-off-by: Ard Biesheuvel ardb@kernel.org Signed-off-by: Ingo Molnar mingo@kernel.org Link: https://lore.kernel.org/r/20200304165917.5893-1-ardb@kernel.org Link: https://lore.kernel.org/r/20200308080859.21568-21-ardb@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/firmware/efi/efi.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c index f50072b51aefb..b39b7e6d4e4dc 100644 --- a/drivers/firmware/efi/efi.c +++ b/drivers/firmware/efi/efi.c @@ -550,7 +550,7 @@ int __init efi_config_parse_tables(void *config_tables, int count, int sz, } }

- if (efi_enabled(EFI_MEMMAP)) + if (!IS_ENABLED(CONFIG_X86_32) && efi_enabled(EFI_MEMMAP)) efi_memattr_init();

/* Parse the EFI Properties table if it exists */

-- 2.20.1

Sasha Levin

3:50 a.m.

New subject: [PATCH AUTOSEL 4.14 14/22] genirq/irqdomain: Check pointer in irq_domain_alloc_irqs_hierarchy()

From: Alexander Sverdlin alexander.sverdlin@nokia.com

[ Upstream commit 87f2d1c662fa1761359fdf558246f97e484d177a ]

irq_domain_alloc_irqs_hierarchy() has 3 call sites in the compilation unit but only one of them checks for the pointer which is being dereferenced inside the called function. Move the check into the function. This allows for catching the error instead of the following crash:

Unable to handle kernel NULL pointer dereference at virtual address 00000000 PC is at 0x0 LR is at gpiochip_hierarchy_irq_domain_alloc+0x11f/0x140 ... [<c06c23ff>] (gpiochip_hierarchy_irq_domain_alloc) [<c0462a89>] (__irq_domain_alloc_irqs) [<c0462dad>] (irq_create_fwspec_mapping) [<c06c2251>] (gpiochip_to_irq) [<c06c1c9b>] (gpiod_to_irq) [<bf973073>] (gpio_irqs_init [gpio_irqs]) [<bf974048>] (gpio_irqs_exit+0xecc/0xe84 [gpio_irqs]) Code: bad PC value

Signed-off-by: Alexander Sverdlin alexander.sverdlin@nokia.com Signed-off-by: Thomas Gleixner tglx@linutronix.de Link: https://lkml.kernel.org/r/20200306174720.82604-1-alexander.sverdlin@nokia.co... Signed-off-by: Sasha Levin sashal@kernel.org --- kernel/irq/irqdomain.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/kernel/irq/irqdomain.c b/kernel/irq/irqdomain.c index b269ae16b10cd..0d54f8256b9f4 100644 --- a/kernel/irq/irqdomain.c +++ b/kernel/irq/irqdomain.c @@ -1372,6 +1372,11 @@ int irq_domain_alloc_irqs_hierarchy(struct irq_domain *domain, unsigned int irq_base, unsigned int nr_irqs, void *arg) { + if (!domain->ops->alloc) { + pr_debug("domain->ops->alloc() is NULL\n"); + return -ENOSYS; + } + return domain->ops->alloc(domain, irq_base, nr_irqs, arg); }

@@ -1409,11 +1414,6 @@ int __irq_domain_alloc_irqs(struct irq_domain *domain, int irq_base, return -EINVAL; }

- if (!domain->ops->alloc) { - pr_debug("domain->ops->alloc() is NULL\n"); - return -ENOSYS; - } - if (realloc && irq_base >= 0) { virq = irq_base; } else {

-- 2.20.1

Sasha Levin

3:50 a.m.

New subject: [PATCH AUTOSEL 4.14 15/22] block: Fix use-after-free issue accessing struct io_cq

From: Sahitya Tummala stummala@codeaurora.org

[ Upstream commit 30a2da7b7e225ef6c87a660419ea04d3cef3f6a7 ]

There is a potential race between ioc_release_fn() and ioc_clear_queue() as shown below, due to which below kernel crash is observed. It also can result into use-after-free issue.

context#1: context#2: ioc_release_fn() __ioc_clear_queue() gets the same icq ->spin_lock(&ioc->lock); ->spin_lock(&ioc->lock); ->ioc_destroy_icq(icq); ->list_del_init(&icq->q_node); ->call_rcu(&icq->__rcu_head, icq_free_icq_rcu); ->spin_unlock(&ioc->lock); ->ioc_destroy_icq(icq); ->hlist_del_init(&icq->ioc_node); This results into below crash as this memory is now used by icq->__rcu_head in context#1. There is a chance that icq could be free'd as well.

22150.386550: <6> Unable to handle kernel write to read-only memory at virtual address ffffffaa8d31ca50 ... Call trace: 22150.607350: <2> ioc_destroy_icq+0x44/0x110 22150.611202: <2> ioc_clear_queue+0xac/0x148 22150.615056: <2> blk_cleanup_queue+0x11c/0x1a0 22150.619174: <2> __scsi_remove_device+0xdc/0x128 22150.623465: <2> scsi_forget_host+0x2c/0x78 22150.627315: <2> scsi_remove_host+0x7c/0x2a0 22150.631257: <2> usb_stor_disconnect+0x74/0xc8 22150.635371: <2> usb_unbind_interface+0xc8/0x278 22150.639665: <2> device_release_driver_internal+0x198/0x250 22150.644897: <2> device_release_driver+0x24/0x30 22150.649176: <2> bus_remove_device+0xec/0x140 22150.653204: <2> device_del+0x270/0x460 22150.656712: <2> usb_disable_device+0x120/0x390 22150.660918: <2> usb_disconnect+0xf4/0x2e0 22150.664684: <2> hub_event+0xd70/0x17e8 22150.668197: <2> process_one_work+0x210/0x480 22150.672222: <2> worker_thread+0x32c/0x4c8

Fix this by adding a new ICQ_DESTROYED flag in ioc_destroy_icq() to indicate this icq is once marked as destroyed. Also, ensure __ioc_clear_queue() is accessing icq within rcu_read_lock/unlock so that icq doesn't get free'd up while it is still using it.

Signed-off-by: Sahitya Tummala stummala@codeaurora.org Co-developed-by: Pradeep P V K ppvk@codeaurora.org Signed-off-by: Pradeep P V K ppvk@codeaurora.org Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Sasha Levin sashal@kernel.org --- block/blk-ioc.c | 7 +++++++ include/linux/iocontext.h | 1 + 2 files changed, 8 insertions(+)

diff --git a/block/blk-ioc.c b/block/blk-ioc.c index f23311e4b201f..e56a480b6f929 100644 --- a/block/blk-ioc.c +++ b/block/blk-ioc.c @@ -87,6 +87,7 @@ static void ioc_destroy_icq(struct io_cq *icq) * making it impossible to determine icq_cache. Record it in @icq. */ icq->__rcu_icq_cache = et->icq_cache; + icq->flags |= ICQ_DESTROYED; call_rcu(&icq->__rcu_head, icq_free_icq_rcu); }

@@ -230,15 +231,21 @@ static void __ioc_clear_queue(struct list_head *icq_list) { unsigned long flags;

+ rcu_read_lock(); while (!list_empty(icq_list)) { struct io_cq *icq = list_entry(icq_list->next, struct io_cq, q_node); struct io_context *ioc = icq->ioc;

spin_lock_irqsave(&ioc->lock, flags); + if (icq->flags & ICQ_DESTROYED) { + spin_unlock_irqrestore(&ioc->lock, flags); + continue; + } ioc_destroy_icq(icq); spin_unlock_irqrestore(&ioc->lock, flags); } + rcu_read_unlock(); }

/** diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h index dba15ca8e60bc..1dcd9198beb7f 100644 --- a/include/linux/iocontext.h +++ b/include/linux/iocontext.h @@ -8,6 +8,7 @@

enum { ICQ_EXITED = 1 << 2, + ICQ_DESTROYED = 1 << 3, };

-- 2.20.1

Sasha Levin

3:50 a.m.

New subject: [PATCH AUTOSEL 4.14 16/22] usb: dwc3: core: add support for disabling SS instances in park mode

From: Neil Armstrong narmstrong@baylibre.com

[ Upstream commit 7ba6b09fda5e0cb741ee56f3264665e0edc64822 ]

In certain circumstances, the XHCI SuperSpeed instance in park mode can fail to recover, thus on Amlogic G12A/G12B/SM1 SoCs when there is high load on the single XHCI SuperSpeed instance, the controller can crash like: xhci-hcd xhci-hcd.0.auto: xHCI host not responding to stop endpoint command. xhci-hcd xhci-hcd.0.auto: Host halt failed, -110 xhci-hcd xhci-hcd.0.auto: xHCI host controller not responding, assume dead xhci-hcd xhci-hcd.0.auto: xHCI host not responding to stop endpoint command. hub 2-1.1:1.0: hub_ext_port_status failed (err = -22) xhci-hcd xhci-hcd.0.auto: HC died; cleaning up usb 2-1.1-port1: cannot reset (err = -22)

Setting the PARKMODE_DISABLE_SS bit in the DWC3_USB3_GUCTL1 mitigates the issue. The bit is described as : "When this bit is set to '1' all SS bus instances in park mode are disabled"

Synopsys explains: The GUCTL1.PARKMODE_DISABLE_SS is only available in dwc_usb3 controller running in host mode. This should not be set for other IPs. This can be disabled by default based on IP, but I recommend to have a property to enable this feature for devices that need this.

CC: Dongjin Kim tobetter@gmail.com Cc: Jianxin Pan jianxin.pan@amlogic.com Cc: Thinh Nguyen thinhn@synopsys.com Cc: Jun Li lijun.kernel@gmail.com Reported-by: Tim elatllat@gmail.com Signed-off-by: Neil Armstrong narmstrong@baylibre.com Signed-off-by: Felipe Balbi balbi@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/usb/dwc3/core.c | 5 +++++ drivers/usb/dwc3/core.h | 4 ++++ 2 files changed, 9 insertions(+)

diff --git a/drivers/usb/dwc3/core.c b/drivers/usb/dwc3/core.c index 021899c580288..010201dbd029a 100644 --- a/drivers/usb/dwc3/core.c +++ b/drivers/usb/dwc3/core.c @@ -867,6 +867,9 @@ static int dwc3_core_init(struct dwc3 *dwc) if (dwc->dis_tx_ipgap_linecheck_quirk) reg |= DWC3_GUCTL1_TX_IPGAP_LINECHECK_DIS;

+ if (dwc->parkmode_disable_ss_quirk) + reg |= DWC3_GUCTL1_PARKMODE_DISABLE_SS; + dwc3_writel(dwc->regs, DWC3_GUCTL1, reg); }

@@ -1107,6 +1110,8 @@ static void dwc3_get_properties(struct dwc3 *dwc) "snps,dis-del-phy-power-chg-quirk"); dwc->dis_tx_ipgap_linecheck_quirk = device_property_read_bool(dev, "snps,dis-tx-ipgap-linecheck-quirk"); + dwc->parkmode_disable_ss_quirk = device_property_read_bool(dev, + "snps,parkmode-disable-ss-quirk");

dwc->tx_de_emphasis_quirk = device_property_read_bool(dev, "snps,tx_de_emphasis_quirk"); diff --git a/drivers/usb/dwc3/core.h b/drivers/usb/dwc3/core.h index 40bf0e0768d96..8747f9f02229e 100644 --- a/drivers/usb/dwc3/core.h +++ b/drivers/usb/dwc3/core.h @@ -206,6 +206,7 @@ #define DWC3_GCTL_DSBLCLKGTNG BIT(0)

/* Global User Control 1 Register */ +#define DWC3_GUCTL1_PARKMODE_DISABLE_SS BIT(17) #define DWC3_GUCTL1_TX_IPGAP_LINECHECK_DIS BIT(28) #define DWC3_GUCTL1_DEV_L1_EXIT_BY_HW BIT(24)

@@ -863,6 +864,8 @@ struct dwc3_scratchpad_array { * change quirk. * @dis_tx_ipgap_linecheck_quirk: set if we disable u2mac linestate * check during HS transmit. + * @parkmode_disable_ss_quirk: set if we need to disable all SuperSpeed + * instances in park mode. * @tx_de_emphasis_quirk: set if we enable Tx de-emphasis quirk * @tx_de_emphasis: Tx de-emphasis value * 0 - -6dB de-emphasis @@ -1022,6 +1025,7 @@ struct dwc3 { unsigned dis_u2_freeclk_exists_quirk:1; unsigned dis_del_phy_power_chg_quirk:1; unsigned dis_tx_ipgap_linecheck_quirk:1; + unsigned parkmode_disable_ss_quirk:1;

unsigned tx_de_emphasis_quirk:1; unsigned tx_de_emphasis:2;

-- 2.20.1

Sasha Levin

3:50 a.m.

New subject: [PATCH AUTOSEL 4.14 17/22] irqchip/gic-v4: Provide irq_retrigger to avoid circular locking dependency

From: Marc Zyngier maz@kernel.org

[ Upstream commit 7809f7011c3bce650e502a98afeb05961470d865 ]

On a very heavily loaded D05 with GICv4, I managed to trigger the following lockdep splat:

[ 6022.598864] ====================================================== [ 6022.605031] WARNING: possible circular locking dependency detected [ 6022.611200] 5.6.0-rc4-00026-geee7c7b0f498 #680 Tainted: G E [ 6022.618061] ------------------------------------------------------ [ 6022.624227] qemu-system-aar/7569 is trying to acquire lock: [ 6022.629789] ffff042f97606808 (&p->pi_lock){-.-.}, at: try_to_wake_up+0x54/0x7a0 [ 6022.637102] [ 6022.637102] but task is already holding lock: [ 6022.642921] ffff002fae424cf0 (&irq_desc_lock_class){-.-.}, at: __irq_get_desc_lock+0x5c/0x98 [ 6022.651350] [ 6022.651350] which lock already depends on the new lock. [ 6022.651350] [ 6022.659512] [ 6022.659512] the existing dependency chain (in reverse order) is: [ 6022.666980] [ 6022.666980] -> #2 (&irq_desc_lock_class){-.-.}: [ 6022.672983] _raw_spin_lock_irqsave+0x50/0x78 [ 6022.677848] __irq_get_desc_lock+0x5c/0x98 [ 6022.682453] irq_set_vcpu_affinity+0x40/0xc0 [ 6022.687236] its_make_vpe_non_resident+0x6c/0xb8 [ 6022.692364] vgic_v4_put+0x54/0x70 [ 6022.696273] vgic_v3_put+0x20/0xd8 [ 6022.700183] kvm_vgic_put+0x30/0x48 [ 6022.704182] kvm_arch_vcpu_put+0x34/0x50 [ 6022.708614] kvm_sched_out+0x34/0x50 [ 6022.712700] __schedule+0x4bc/0x7f8 [ 6022.716697] schedule+0x50/0xd8 [ 6022.720347] kvm_arch_vcpu_ioctl_run+0x5f0/0x978 [ 6022.725473] kvm_vcpu_ioctl+0x3d4/0x8f8 [ 6022.729820] ksys_ioctl+0x90/0xd0 [ 6022.733642] __arm64_sys_ioctl+0x24/0x30 [ 6022.738074] el0_svc_common.constprop.3+0xa8/0x1e8 [ 6022.743373] do_el0_svc+0x28/0x88 [ 6022.747198] el0_svc+0x14/0x40 [ 6022.750761] el0_sync_handler+0x124/0x2b8 [ 6022.755278] el0_sync+0x140/0x180 [ 6022.759100] [ 6022.759100] -> #1 (&rq->lock){-.-.}: [ 6022.764143] _raw_spin_lock+0x38/0x50 [ 6022.768314] task_fork_fair+0x40/0x128 [ 6022.772572] sched_fork+0xe0/0x210 [ 6022.776484] copy_process+0x8c4/0x18d8 [ 6022.780742] _do_fork+0x88/0x6d8 [ 6022.784478] kernel_thread+0x64/0x88 [ 6022.788563] rest_init+0x30/0x270 [ 6022.792390] arch_call_rest_init+0x14/0x1c [ 6022.796995] start_kernel+0x498/0x4c4 [ 6022.801164] [ 6022.801164] -> #0 (&p->pi_lock){-.-.}: [ 6022.806382] __lock_acquire+0xdd8/0x15c8 [ 6022.810813] lock_acquire+0xd0/0x218 [ 6022.814896] _raw_spin_lock_irqsave+0x50/0x78 [ 6022.819761] try_to_wake_up+0x54/0x7a0 [ 6022.824018] wake_up_process+0x1c/0x28 [ 6022.828276] wakeup_softirqd+0x38/0x40 [ 6022.832533] __tasklet_schedule_common+0xc4/0xf0 [ 6022.837658] __tasklet_schedule+0x24/0x30 [ 6022.842176] check_irq_resend+0xc8/0x158 [ 6022.846609] irq_startup+0x74/0x128 [ 6022.850606] __enable_irq+0x6c/0x78 [ 6022.854602] enable_irq+0x54/0xa0 [ 6022.858431] its_make_vpe_non_resident+0xa4/0xb8 [ 6022.863557] vgic_v4_put+0x54/0x70 [ 6022.867469] kvm_arch_vcpu_blocking+0x28/0x38 [ 6022.872336] kvm_vcpu_block+0x48/0x490 [ 6022.876594] kvm_handle_wfx+0x18c/0x310 [ 6022.880938] handle_exit+0x138/0x198 [ 6022.885022] kvm_arch_vcpu_ioctl_run+0x4d4/0x978 [ 6022.890148] kvm_vcpu_ioctl+0x3d4/0x8f8 [ 6022.894494] ksys_ioctl+0x90/0xd0 [ 6022.898317] __arm64_sys_ioctl+0x24/0x30 [ 6022.902748] el0_svc_common.constprop.3+0xa8/0x1e8 [ 6022.908046] do_el0_svc+0x28/0x88 [ 6022.911871] el0_svc+0x14/0x40 [ 6022.915434] el0_sync_handler+0x124/0x2b8 [ 6022.919951] el0_sync+0x140/0x180 [ 6022.923773] [ 6022.923773] other info that might help us debug this: [ 6022.923773] [ 6022.931762] Chain exists of: [ 6022.931762] &p->pi_lock --> &rq->lock --> &irq_desc_lock_class [ 6022.931762] [ 6022.942101] Possible unsafe locking scenario: [ 6022.942101] [ 6022.948007] CPU0 CPU1 [ 6022.952523] ---- ---- [ 6022.957039] lock(&irq_desc_lock_class); [ 6022.961036] lock(&rq->lock); [ 6022.966595] lock(&irq_desc_lock_class); [ 6022.973109] lock(&p->pi_lock); [ 6022.976324] [ 6022.976324] *** DEADLOCK ***

This is happening because we have a pending doorbell that requires retrigger. As SW retriggering is done in a tasklet, we trigger the circular dependency above.

The easy cop-out is to provide a retrigger callback that doesn't require acquiring any extra lock.

Signed-off-by: Marc Zyngier maz@kernel.org Link: https://lore.kernel.org/r/20200310184921.23552-5-maz@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/irqchip/irq-gic-v3-its.c | 6 ++++++ 1 file changed, 6 insertions(+)

diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c index 799df1e598db3..84b23d902d5b8 100644 --- a/drivers/irqchip/irq-gic-v3-its.c +++ b/drivers/irqchip/irq-gic-v3-its.c @@ -2591,12 +2591,18 @@ static int its_vpe_set_irqchip_state(struct irq_data *d, return 0; }

+static int its_vpe_retrigger(struct irq_data *d) +{ + return !its_vpe_set_irqchip_state(d, IRQCHIP_STATE_PENDING, true); +} + static struct irq_chip its_vpe_irq_chip = { .name = "GICv4-vpe", .irq_mask = its_vpe_mask_irq, .irq_unmask = its_vpe_unmask_irq, .irq_eoi = irq_chip_eoi_parent, .irq_set_affinity = its_vpe_set_affinity, + .irq_retrigger = its_vpe_retrigger, .irq_set_irqchip_state = its_vpe_set_irqchip_state, .irq_set_vcpu_affinity = its_vpe_set_vcpu_affinity, };

-- 2.20.1

Sasha Levin

3:50 a.m.

New subject: [PATCH AUTOSEL 4.14 18/22] locking/lockdep: Avoid recursion in lockdep_count_{for, back}ward_deps()

From: Boqun Feng boqun.feng@gmail.com

[ Upstream commit 25016bd7f4caf5fc983bbab7403d08e64cba3004 ]

Qian Cai reported a bug when PROVE_RCU_LIST=y, and read on /proc/lockdep triggered a warning:

[ ] DEBUG_LOCKS_WARN_ON(current->hardirqs_enabled) ... [ ] Call Trace: [ ] lock_is_held_type+0x5d/0x150 [ ] ? rcu_lockdep_current_cpu_online+0x64/0x80 [ ] rcu_read_lock_any_held+0xac/0x100 [ ] ? rcu_read_lock_held+0xc0/0xc0 [ ] ? __slab_free+0x421/0x540 [ ] ? kasan_kmalloc+0x9/0x10 [ ] ? __kmalloc_node+0x1d7/0x320 [ ] ? kvmalloc_node+0x6f/0x80 [ ] __bfs+0x28a/0x3c0 [ ] ? class_equal+0x30/0x30 [ ] lockdep_count_forward_deps+0x11a/0x1a0

The warning got triggered because lockdep_count_forward_deps() call __bfs() without current->lockdep_recursion being set, as a result a lockdep internal function (__bfs()) is checked by lockdep, which is unexpected, and the inconsistency between the irq-off state and the state traced by lockdep caused the warning.

Apart from this warning, lockdep internal functions like __bfs() should always be protected by current->lockdep_recursion to avoid potential deadlocks and data inconsistency, therefore add the current->lockdep_recursion on-and-off section to protect __bfs() in both lockdep_count_forward_deps() and lockdep_count_backward_deps()

Reported-by: Qian Cai cai@lca.pw Signed-off-by: Boqun Feng boqun.feng@gmail.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Link: https://lkml.kernel.org/r/20200312151258.128036-1-boqun.feng@gmail.com Signed-off-by: Sasha Levin sashal@kernel.org --- kernel/locking/lockdep.c | 4 ++++ 1 file changed, 4 insertions(+)

diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index 90a3469a7a888..03e3ab61a2edd 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -1297,9 +1297,11 @@ unsigned long lockdep_count_forward_deps(struct lock_class *class) this.class = class;

raw_local_irq_save(flags); + current->lockdep_recursion = 1; arch_spin_lock(&lockdep_lock); ret = __lockdep_count_forward_deps(&this); arch_spin_unlock(&lockdep_lock); + current->lockdep_recursion = 0; raw_local_irq_restore(flags);

return ret; @@ -1324,9 +1326,11 @@ unsigned long lockdep_count_backward_deps(struct lock_class *class) this.class = class;

raw_local_irq_save(flags); + current->lockdep_recursion = 1; arch_spin_lock(&lockdep_lock); ret = __lockdep_count_backward_deps(&this); arch_spin_unlock(&lockdep_lock); + current->lockdep_recursion = 0; raw_local_irq_restore(flags);

return ret;

-- 2.20.1

Sasha Levin

3:50 a.m.

New subject: [PATCH AUTOSEL 4.14 19/22] block, bfq: fix use-after-free in bfq_idle_slice_timer_body

From: Zhiqiang Liu liuzhiqiang26@huawei.com

[ Upstream commit 2f95fa5c955d0a9987ffdc3a095e2f4e62c5f2a9 ]

In bfq_idle_slice_timer func, bfqq = bfqd->in_service_queue is not in bfqd-lock critical section. The bfqq, which is not equal to NULL in bfq_idle_slice_timer, may be freed after passing to bfq_idle_slice_timer_body. So we will access the freed memory.

In addition, considering the bfqq may be in race, we should firstly check whether bfqq is in service before doing something on it in bfq_idle_slice_timer_body func. If the bfqq in race is not in service, it means the bfqq has been expired through __bfq_bfqq_expire func, and wait_request flags has been cleared in __bfq_bfqd_reset_in_service func. So we do not need to re-clear the wait_request of bfqq which is not in service.

KASAN log is given as follows: [13058.354613] ================================================================== [13058.354640] BUG: KASAN: use-after-free in bfq_idle_slice_timer+0xac/0x290 [13058.354644] Read of size 8 at addr ffffa02cf3e63f78 by task fork13/19767 [13058.354646] [13058.354655] CPU: 96 PID: 19767 Comm: fork13 [13058.354661] Call trace: [13058.354667] dump_backtrace+0x0/0x310 [13058.354672] show_stack+0x28/0x38 [13058.354681] dump_stack+0xd8/0x108 [13058.354687] print_address_description+0x68/0x2d0 [13058.354690] kasan_report+0x124/0x2e0 [13058.354697] __asan_load8+0x88/0xb0 [13058.354702] bfq_idle_slice_timer+0xac/0x290 [13058.354707] __hrtimer_run_queues+0x298/0x8b8 [13058.354710] hrtimer_interrupt+0x1b8/0x678 [13058.354716] arch_timer_handler_phys+0x4c/0x78 [13058.354722] handle_percpu_devid_irq+0xf0/0x558 [13058.354731] generic_handle_irq+0x50/0x70 [13058.354735] __handle_domain_irq+0x94/0x110 [13058.354739] gic_handle_irq+0x8c/0x1b0 [13058.354742] el1_irq+0xb8/0x140 [13058.354748] do_wp_page+0x260/0xe28 [13058.354752] __handle_mm_fault+0x8ec/0x9b0 [13058.354756] handle_mm_fault+0x280/0x460 [13058.354762] do_page_fault+0x3ec/0x890 [13058.354765] do_mem_abort+0xc0/0x1b0 [13058.354768] el0_da+0x24/0x28 [13058.354770] [13058.354773] Allocated by task 19731: [13058.354780] kasan_kmalloc+0xe0/0x190 [13058.354784] kasan_slab_alloc+0x14/0x20 [13058.354788] kmem_cache_alloc_node+0x130/0x440 [13058.354793] bfq_get_queue+0x138/0x858 [13058.354797] bfq_get_bfqq_handle_split+0xd4/0x328 [13058.354801] bfq_init_rq+0x1f4/0x1180 [13058.354806] bfq_insert_requests+0x264/0x1c98 [13058.354811] blk_mq_sched_insert_requests+0x1c4/0x488 [13058.354818] blk_mq_flush_plug_list+0x2d4/0x6e0 [13058.354826] blk_flush_plug_list+0x230/0x548 [13058.354830] blk_finish_plug+0x60/0x80 [13058.354838] read_pages+0xec/0x2c0 [13058.354842] __do_page_cache_readahead+0x374/0x438 [13058.354846] ondemand_readahead+0x24c/0x6b0 [13058.354851] page_cache_sync_readahead+0x17c/0x2f8 [13058.354858] generic_file_buffered_read+0x588/0xc58 [13058.354862] generic_file_read_iter+0x1b4/0x278 [13058.354965] ext4_file_read_iter+0xa8/0x1d8 [ext4] [13058.354972] __vfs_read+0x238/0x320 [13058.354976] vfs_read+0xbc/0x1c0 [13058.354980] ksys_read+0xdc/0x1b8 [13058.354984] __arm64_sys_read+0x50/0x60 [13058.354990] el0_svc_common+0xb4/0x1d8 [13058.354994] el0_svc_handler+0x50/0xa8 [13058.354998] el0_svc+0x8/0xc [13058.354999] [13058.355001] Freed by task 19731: [13058.355007] __kasan_slab_free+0x120/0x228 [13058.355010] kasan_slab_free+0x10/0x18 [13058.355014] kmem_cache_free+0x288/0x3f0 [13058.355018] bfq_put_queue+0x134/0x208 [13058.355022] bfq_exit_icq_bfqq+0x164/0x348 [13058.355026] bfq_exit_icq+0x28/0x40 [13058.355030] ioc_exit_icq+0xa0/0x150 [13058.355035] put_io_context_active+0x250/0x438 [13058.355038] exit_io_context+0xd0/0x138 [13058.355045] do_exit+0x734/0xc58 [13058.355050] do_group_exit+0x78/0x220 [13058.355054] __wake_up_parent+0x0/0x50 [13058.355058] el0_svc_common+0xb4/0x1d8 [13058.355062] el0_svc_handler+0x50/0xa8 [13058.355066] el0_svc+0x8/0xc [13058.355067] [13058.355071] The buggy address belongs to the object at ffffa02cf3e63e70#012 which belongs to the cache bfq_queue of size 464 [13058.355075] The buggy address is located 264 bytes inside of#012 464-byte region [ffffa02cf3e63e70, ffffa02cf3e64040) [13058.355077] The buggy address belongs to the page: [13058.355083] page:ffff7e80b3cf9800 count:1 mapcount:0 mapping:ffff802db5c90780 index:0xffffa02cf3e606f0 compound_mapcount: 0 [13058.366175] flags: 0x2ffffe0000008100(slab|head) [13058.370781] raw: 2ffffe0000008100 ffff7e80b53b1408 ffffa02d730c1c90 ffff802db5c90780 [13058.370787] raw: ffffa02cf3e606f0 0000000000370023 00000001ffffffff 0000000000000000 [13058.370789] page dumped because: kasan: bad access detected [13058.370791] [13058.370792] Memory state around the buggy address: [13058.370797] ffffa02cf3e63e00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fb fb [13058.370801] ffffa02cf3e63e80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [13058.370805] >ffffa02cf3e63f00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [13058.370808] ^ [13058.370811] ffffa02cf3e63f80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [13058.370815] ffffa02cf3e64000: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc [13058.370817] ================================================================== [13058.370820] Disabling lock debugging due to kernel taint

Here, we directly pass the bfqd to bfq_idle_slice_timer_body func. -- V2->V3: rewrite the comment as suggested by Paolo Valente V1->V2: add one comment, and add Fixes and Reported-by tag.

Fixes: aee69d78d ("block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler") Acked-by: Paolo Valente paolo.valente@linaro.org Reported-by: Wang Wang wangwang2@huawei.com Signed-off-by: Zhiqiang Liu liuzhiqiang26@huawei.com Signed-off-by: Feilong Lin linfeilong@huawei.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Sasha Levin sashal@kernel.org --- block/bfq-iosched.c | 16 ++++++++++++---- 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c index 93863c6173e66..959bee9fa9118 100644 --- a/block/bfq-iosched.c +++ b/block/bfq-iosched.c @@ -4541,20 +4541,28 @@ static void bfq_prepare_request(struct request *rq, struct bio *bio) spin_unlock_irq(&bfqd->lock); }

-static void bfq_idle_slice_timer_body(struct bfq_queue *bfqq) +static void +bfq_idle_slice_timer_body(struct bfq_data *bfqd, struct bfq_queue *bfqq) { - struct bfq_data *bfqd = bfqq->bfqd; enum bfqq_expiration reason; unsigned long flags;

spin_lock_irqsave(&bfqd->lock, flags); - bfq_clear_bfqq_wait_request(bfqq);

+ /* + * Considering that bfqq may be in race, we should firstly check + * whether bfqq is in service before doing something on it. If + * the bfqq in race is not in service, it has already been expired + * through __bfq_bfqq_expire func and its wait_request flags has + * been cleared in __bfq_bfqd_reset_in_service func. + */ if (bfqq != bfqd->in_service_queue) { spin_unlock_irqrestore(&bfqd->lock, flags); return; }

+ bfq_clear_bfqq_wait_request(bfqq); + if (bfq_bfqq_budget_timeout(bfqq)) /* * Also here the queue can be safely expired @@ -4599,7 +4607,7 @@ static enum hrtimer_restart bfq_idle_slice_timer(struct hrtimer *timer) * early. */ if (bfqq) - bfq_idle_slice_timer_body(bfqq); + bfq_idle_slice_timer_body(bfqd, bfqq);

return HRTIMER_NORESTART; }

-- 2.20.1

Sasha Levin

3:50 a.m.

New subject: [PATCH AUTOSEL 4.14 20/22] btrfs: hold a ref on the root in btrfs_recover_relocation

From: Josef Bacik josef@toxicpanda.com

[ Upstream commit 932fd26df8125a5b14438563c4d3e33f59ba80f7 ]

We look up the fs root in various places in here when recovering from a crashed relcoation. Make sure we hold a ref on the root whenever we look them up.

Signed-off-by: Josef Bacik josef@toxicpanda.com Reviewed-by: David Sterba dsterba@suse.com Signed-off-by: David Sterba dsterba@suse.com Signed-off-by: Sasha Levin sashal@kernel.org --- fs/btrfs/relocation.c | 21 ++++++++++++++++++--- 1 file changed, 18 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c index d4c00edd16d2b..ef83fb0ffc784 100644 --- a/fs/btrfs/relocation.c +++ b/fs/btrfs/relocation.c @@ -4541,6 +4541,12 @@ int btrfs_recover_relocation(struct btrfs_root *root) err = ret; goto out; } + } else { + if (!btrfs_grab_fs_root(fs_root)) { + err = -ENOENT; + goto out; + } + btrfs_put_fs_root(fs_root); } }

@@ -4590,10 +4596,15 @@ int btrfs_recover_relocation(struct btrfs_root *root) list_add_tail(&reloc_root->root_list, &reloc_roots); goto out_free; } + if (!btrfs_grab_fs_root(fs_root)) { + err = -ENOENT; + goto out_free; + }

err = __add_reloc_root(reloc_root); BUG_ON(err < 0); /* -ENOMEM or logic error */ fs_root->reloc_root = reloc_root; + btrfs_put_fs_root(fs_root); }

err = btrfs_commit_transaction(trans); @@ -4621,10 +4632,14 @@ int btrfs_recover_relocation(struct btrfs_root *root) if (err == 0) { /* cleanup orphan inode in data relocation tree */ fs_root = read_fs_root(fs_info, BTRFS_DATA_RELOC_TREE_OBJECTID); - if (IS_ERR(fs_root)) + if (IS_ERR(fs_root)) { err = PTR_ERR(fs_root); - else - err = btrfs_orphan_cleanup(fs_root); + } else { + if (btrfs_grab_fs_root(fs_root)) { + err = btrfs_orphan_cleanup(fs_root); + btrfs_put_fs_root(fs_root); + } + } } return err; }

-- 2.20.1

Sasha Levin

3:50 a.m.

New subject: [PATCH AUTOSEL 4.14 21/22] btrfs: remove a BUG_ON() from merge_reloc_roots()

From: Josef Bacik josef@toxicpanda.com

[ Upstream commit 7b7b74315b24dc064bc1c683659061c3d48f8668 ]

This was pretty subtle, we default to reloc roots having 0 root refs, so if we crash in the middle of the relocation they can just be deleted. If we successfully complete the relocation operations we'll set our root refs to 1 in prepare_to_merge() and then go on to merge_reloc_roots().

At prepare_to_merge() time if any of the reloc roots have a 0 reference still, we will remove that reloc root from our reloc root rb tree, and then clean it up later.

However this only happens if we successfully start a transaction. If we've aborted previously we will skip this step completely, and only have reloc roots with a reference count of 0, but were never properly removed from the reloc control's rb tree.

This isn't a problem per-se, our references are held by the list the reloc roots are on, and by the original root the reloc root belongs to. If we end up in this situation all the reloc roots will be added to the dirty_reloc_list, and then properly dropped at that point. The reloc control will be free'd and the rb tree is no longer used.

There were two options when fixing this, one was to remove the BUG_ON(), the other was to make prepare_to_merge() handle the case where we couldn't start a trans handle.

IMO this is the cleaner solution. I started with handling the error in prepare_to_merge(), but it turned out super ugly. And in the end this BUG_ON() simply doesn't matter, the cleanup was happening properly, we were just panicing because this BUG_ON() only matters in the success case. So I've opted to just remove it and add a comment where it was.

Reviewed-by: Qu Wenruo wqu@suse.com Signed-off-by: Josef Bacik josef@toxicpanda.com Signed-off-by: David Sterba dsterba@suse.com Signed-off-by: Sasha Levin sashal@kernel.org --- fs/btrfs/relocation.c | 16 +++++++++++++++- 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c index ef83fb0ffc784..3bb89d09d128b 100644 --- a/fs/btrfs/relocation.c +++ b/fs/btrfs/relocation.c @@ -2480,7 +2480,21 @@ void merge_reloc_roots(struct reloc_control *rc) free_reloc_roots(&reloc_roots); }

- BUG_ON(!RB_EMPTY_ROOT(&rc->reloc_root_tree.rb_root)); + /* + * We used to have + * + * BUG_ON(!RB_EMPTY_ROOT(&rc->reloc_root_tree.rb_root)); + * + * here, but it's wrong. If we fail to start the transaction in + * prepare_to_merge() we will have only 0 ref reloc roots, none of which + * have actually been removed from the reloc_root_tree rb tree. This is + * fine because we're bailing here, and we hold a reference on the root + * for the list that holds it, so these roots will be cleaned up when we + * do the reloc_dirty_list afterwards. Meanwhile the root->reloc_root + * will be cleaned up on unmount. + * + * The remaining nodes will be cleaned up by free_reloc_control. + */ }

static void free_block_list(struct rb_root *blocks)

-- 2.20.1

Sasha Levin

3:50 a.m.

New subject: [PATCH AUTOSEL 4.14 22/22] btrfs: track reloc roots based on their commit root bytenr

From: Josef Bacik josef@toxicpanda.com

[ Upstream commit ea287ab157c2816bf12aad4cece41372f9d146b4 ]

We always search the commit root of the extent tree for looking up back references, however we track the reloc roots based on their current bytenr.

This is wrong, if we commit the transaction between relocating tree blocks we could end up in this code in build_backref_tree

if (key.objectid == key.offset) { /* * Only root blocks of reloc trees use backref * pointing to itself. */ root = find_reloc_root(rc, cur->bytenr); ASSERT(root); cur->root = root; break; }

find_reloc_root() is looking based on the bytenr we had in the commit root, but if we've COWed this reloc root we will not find that bytenr, and we will trip over the ASSERT(root).

Fix this by using the commit_root->start bytenr for indexing the commit root. Then we change the __update_reloc_root() caller to be used when we switch the commit root for the reloc root during commit.

This fixes the panic I was seeing when we started throttling relocation for delayed refs.

Signed-off-by: Josef Bacik josef@toxicpanda.com Signed-off-by: David Sterba dsterba@suse.com Signed-off-by: Sasha Levin sashal@kernel.org --- fs/btrfs/relocation.c | 17 +++++++---------- 1 file changed, 7 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c index 3bb89d09d128b..ba54bc43b1215 100644 --- a/fs/btrfs/relocation.c +++ b/fs/btrfs/relocation.c @@ -1306,7 +1306,7 @@ static int __must_check __add_reloc_root(struct btrfs_root *root) if (!node) return -ENOMEM;

- node->bytenr = root->node->start; + node->bytenr = root->commit_root->start; node->data = root;

spin_lock(&rc->reloc_root_tree.lock); @@ -1337,10 +1337,11 @@ static void __del_reloc_root(struct btrfs_root *root) if (rc && root->node) { spin_lock(&rc->reloc_root_tree.lock); rb_node = tree_search(&rc->reloc_root_tree.rb_root, - root->node->start); + root->commit_root->start); if (rb_node) { node = rb_entry(rb_node, struct mapping_node, rb_node); rb_erase(&node->rb_node, &rc->reloc_root_tree.rb_root); + RB_CLEAR_NODE(&node->rb_node); } spin_unlock(&rc->reloc_root_tree.lock); if (!node) @@ -1358,7 +1359,7 @@ static void __del_reloc_root(struct btrfs_root *root) * helper to update the 'address of tree root -> reloc tree' * mapping */ -static int __update_reloc_root(struct btrfs_root *root, u64 new_bytenr) +static int __update_reloc_root(struct btrfs_root *root) { struct btrfs_fs_info *fs_info = root->fs_info; struct rb_node *rb_node; @@ -1367,7 +1368,7 @@ static int __update_reloc_root(struct btrfs_root *root, u64 new_bytenr)

spin_lock(&rc->reloc_root_tree.lock); rb_node = tree_search(&rc->reloc_root_tree.rb_root, - root->node->start); + root->commit_root->start); if (rb_node) { node = rb_entry(rb_node, struct mapping_node, rb_node); rb_erase(&node->rb_node, &rc->reloc_root_tree.rb_root); @@ -1379,7 +1380,7 @@ static int __update_reloc_root(struct btrfs_root *root, u64 new_bytenr) BUG_ON((struct btrfs_root *)node->data != root);

spin_lock(&rc->reloc_root_tree.lock); - node->bytenr = new_bytenr; + node->bytenr = root->node->start; rb_node = tree_insert(&rc->reloc_root_tree.rb_root, node->bytenr, &node->rb_node); spin_unlock(&rc->reloc_root_tree.lock); @@ -1524,6 +1525,7 @@ int btrfs_update_reloc_root(struct btrfs_trans_handle *trans, }

if (reloc_root->commit_root != reloc_root->node) { + __update_reloc_root(reloc_root); btrfs_set_root_node(root_item, reloc_root->node); free_extent_buffer(reloc_root->commit_root); reloc_root->commit_root = btrfs_root_node(reloc_root); @@ -4727,11 +4729,6 @@ int btrfs_reloc_cow_block(struct btrfs_trans_handle *trans, BUG_ON(rc->stage == UPDATE_DATA_PTRS && root->root_key.objectid == BTRFS_DATA_RELOC_TREE_OBJECTID);

- if (root->root_key.objectid == BTRFS_TREE_RELOC_OBJECTID) { - if (buf == root->node) - __update_reloc_root(root, cow->start); - } - level = btrfs_header_level(buf); if (btrfs_header_generation(buf) <= btrfs_root_last_snapshot(&root->root_item))

-- 2.20.1

2097

days inactive

2097

days old

linux-stable-mirror@lists.linaro.org

21 comments

participants

tags (0)

participants (1)

Sasha Levin