December 2018 - Linux-stable-mirror

[PATCH 1/1] xhci: Don't prevent USB2 bus suspend in state check intended for USB3 only

by Mathias Nyman

The code to prevent a bus suspend if a USB3 port was still in link training also reacted to USB2 port polling state. This caused bus suspend to busyloop in some cases. USB2 polling state is different from USB3, and should not prevent bus suspend. Limit the USB3 link training state check to USB3 root hub ports only. The origial commit went to stable so this need to be applied there as well Fixes: 2f31a67f01a8 ("usb: xhci: Prevent bus suspend if a port connect change or polling state is detected") Cc: stable(a)vger.kernel.org Signed-off-by: Mathias Nyman <mathias.nyman(a)linux.intel.com> --- drivers/usb/host/xhci-hub.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/usb/host/xhci-hub.c b/drivers/usb/host/xhci-hub.c index 94aca1b..01b5818 100644 --- a/drivers/usb/host/xhci-hub.c +++ b/drivers/usb/host/xhci-hub.c @@ -1507,7 +1507,8 @@ int xhci_bus_suspend(struct usb_hcd *hcd) portsc_buf[port_index] = 0; /* Bail out if a USB3 port has a new device in link training */ - if ((t1 & PORT_PLS_MASK) == XDEV_POLLING) { + if ((hcd->speed >= HCD_USB3) && + (t1 & PORT_PLS_MASK) == XDEV_POLLING) { bus_state->bus_suspended = 0; spin_unlock_irqrestore(&xhci->lock, flags); xhci_dbg(xhci, "Bus suspend bailout, port in polling\n"); -- 2.7.4

7 years

1
0
0 0

v4.19.9 build: 0 failures 3 warnings (v4.19.9)

by Build bot for Mark Brown

Tree/Branch: v4.19.9 Git describe: v4.19.9 Commit: be53d23e68 Linux 4.19.9 Build Time: 124 min 18 sec Passed: 11 / 11 (100.00 %) Failed: 0 / 11 ( 0.00 %) Errors: 0 Warnings: 3 Section Mismatches: 0 ------------------------------------------------------------------------------- defconfigs with issues (other than build errors): 1 warnings 0 mismatches : arm64-allmodconfig 1 warnings 0 mismatches : arm-allmodconfig 1 warnings 0 mismatches : arm-multi_v5_defconfig ------------------------------------------------------------------------------- Warnings Summary: 3 1 ../drivers/staging/erofs/unzip_vle.c:186:29: warning: array subscript is above array bounds [-Warray-bounds] 1 ../drivers/isdn/hardware/eicon/message.c:5985:1: warning: the frame size of 2064 bytes is larger than 2048 bytes [-Wframe-larger-than=] 1 ../drivers/i2c/busses/i2c-aspeed.c:567:1: warning: label 'out' defined but not used [-Wunused-label] =============================================================================== Detailed per-defconfig build reports below: ------------------------------------------------------------------------------- arm64-allmodconfig : PASS, 0 errors, 1 warnings, 0 section mismatches Warnings: ../drivers/isdn/hardware/eicon/message.c:5985:1: warning: the frame size of 2064 bytes is larger than 2048 bytes [-Wframe-larger-than=] ------------------------------------------------------------------------------- arm-allmodconfig : PASS, 0 errors, 1 warnings, 0 section mismatches Warnings: ../drivers/staging/erofs/unzip_vle.c:186:29: warning: array subscript is above array bounds [-Warray-bounds] ------------------------------------------------------------------------------- arm-multi_v5_defconfig : PASS, 0 errors, 1 warnings, 0 section mismatches Warnings: ../drivers/i2c/busses/i2c-aspeed.c:567:1: warning: label 'out' defined but not used [-Wunused-label] ------------------------------------------------------------------------------- Passed with no errors, warnings or mismatches: arm64-allnoconfig arm-multi_v7_defconfig x86_64-defconfig arm-allnoconfig x86_64-allnoconfig arm-multi_v4t_defconfig x86_64-allmodconfig arm64-defconfig

7 years

1
0
0 0

[PATCH v3] mm, memcg: fix reclaim deadlock with writeback

by Michal Hocko

From: Michal Hocko <mhocko(a)suse.com> Liu Bo has experienced a deadlock between memcg (legacy) reclaim and the ext4 writeback task1: [<ffffffff811aaa52>] wait_on_page_bit+0x82/0xa0 [<ffffffff811c5777>] shrink_page_list+0x907/0x960 [<ffffffff811c6027>] shrink_inactive_list+0x2c7/0x680 [<ffffffff811c6ba4>] shrink_node_memcg+0x404/0x830 [<ffffffff811c70a8>] shrink_node+0xd8/0x300 [<ffffffff811c73dd>] do_try_to_free_pages+0x10d/0x330 [<ffffffff811c7865>] try_to_free_mem_cgroup_pages+0xd5/0x1b0 [<ffffffff8122df2d>] try_charge+0x14d/0x720 [<ffffffff812320cc>] memcg_kmem_charge_memcg+0x3c/0xa0 [<ffffffff812321ae>] memcg_kmem_charge+0x7e/0xd0 [<ffffffff811b68a8>] __alloc_pages_nodemask+0x178/0x260 [<ffffffff8120bff5>] alloc_pages_current+0x95/0x140 [<ffffffff81074247>] pte_alloc_one+0x17/0x40 [<ffffffff811e34de>] __pte_alloc+0x1e/0x110 [<ffffffffa06739de>] alloc_set_pte+0x5fe/0xc20 [<ffffffff811e5d93>] do_fault+0x103/0x970 [<ffffffff811e6e5e>] handle_mm_fault+0x61e/0xd10 [<ffffffff8106ea02>] __do_page_fault+0x252/0x4d0 [<ffffffff8106ecb0>] do_page_fault+0x30/0x80 [<ffffffff8171bce8>] page_fault+0x28/0x30 [<ffffffffffffffff>] 0xffffffffffffffff task2: [<ffffffff811aadc6>] __lock_page+0x86/0xa0 [<ffffffffa02f1e47>] mpage_prepare_extent_to_map+0x2e7/0x310 [ext4] [<ffffffffa08a2689>] ext4_writepages+0x479/0xd60 [<ffffffff811bbede>] do_writepages+0x1e/0x30 [<ffffffff812725e5>] __writeback_single_inode+0x45/0x320 [<ffffffff81272de2>] writeback_sb_inodes+0x272/0x600 [<ffffffff81273202>] __writeback_inodes_wb+0x92/0xc0 [<ffffffff81273568>] wb_writeback+0x268/0x300 [<ffffffff81273d24>] wb_workfn+0xb4/0x390 [<ffffffff810a2f19>] process_one_work+0x189/0x420 [<ffffffff810a31fe>] worker_thread+0x4e/0x4b0 [<ffffffff810a9786>] kthread+0xe6/0x100 [<ffffffff8171a9a1>] ret_from_fork+0x41/0x50 [<ffffffffffffffff>] 0xffffffffffffffff He adds : task1 is waiting for the PageWriteback bit of the page that task2 has : collected in mpd->io_submit->io_bio, and tasks2 is waiting for the LOCKED : bit the page which tasks1 has locked. More precisely task1 is handling a page fault and it has a page locked while it charges a new page table to a memcg. That in turn hits a memory limit reclaim and the memcg reclaim for legacy controller is waiting on the writeback but that is never going to finish because the writeback itself is waiting for the page locked in the #PF path. So this is essentially ABBA deadlock: lock_page(A) SetPageWriteback(A) unlock_page(A) lock_page(B) lock_page(B) pte_alloc_pne shrink_page_list wait_on_page_writeback(A) SetPageWriteback(B) unlock_page(B) # flush A, B to clear the writeback This accumulating of more pages to flush is used by several filesystems to generate a more optimal IO patterns. Waiting for the writeback in legacy memcg controller is a workaround for pre-mature OOM killer invocations because there is no dirty IO throttling available for the controller. There is no easy way around that unfortunately. Therefore fix this specific issue by pre-allocating the page table outside of the page lock. We have that handy infrastructure for that already so simply reuse the fault-around pattern which already does this. There are probably other hidden __GFP_ACCOUNT | GFP_KERNEL allocations from under a fs page locked but they should be really rare. I am not aware of a better solution unfortunately. Reported-and-Debugged-by: Liu Bo <bo.liu(a)linux.alibaba.com> Cc: stable Fixes: c3b94f44fcb0 ("memcg: further prevent OOM with too many dirty pages") Signed-off-by: Michal Hocko <mhocko(a)suse.com> --- mm/memory.c | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/mm/memory.c b/mm/memory.c index 4ad2d293ddc2..bb78e90a9b70 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2993,6 +2993,17 @@ static vm_fault_t __do_fault(struct vm_fault *vmf) struct vm_area_struct *vma = vmf->vma; vm_fault_t ret; + /* + * Preallocate pte before we take page_lock because this might lead to + * deadlocks for memcg reclaim which waits for pages under writeback. + */ + if (pmd_none(*vmf->pmd) && !vmf->prealloc_pte) { + vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm_mm, vmf->address); + if (!vmf->prealloc_pte) + return VM_FAULT_OOM; + smp_wmb(); /* See comment in __pte_alloc() */ + } + ret = vma->vm_ops->fault(vmf); if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY | VM_FAULT_DONE_COW))) -- 2.19.2

7 years

4
5
0 0

[PATCH 00/17] Backport rt/deadline crash and the ardous story of FUTEX_UNLOCK_PI to 4.4

by Henrik Austad

From: Henrik Austad <haustad(a)cisco.com> Short story: The following patches are needed on a 4.4 kernel to avoid Oops in the scheduler when a sched_rr and sched_deadline task contends on the same futex (with PI). Longer story: On one of our arm64 systems, we occasionally crash with an Oops in the scheduler with the following backtrace. [<ffffffc0000ee398>] enqueue_task_dl+0x1f0/0x420 [<ffffffc0000d0f14>] activate_task+0x7c/0x90 [<ffffffc0000edbdc>] push_dl_task+0x164/0x1c8 [<ffffffc0000edc60>] push_dl_tasks+0x20/0x30 [<ffffffc0000cc00c>] __balance_callback+0x44/0x68 [<ffffffc000d2c018>] __schedule+0x6f0/0x728 [<ffffffc000d2c278>] schedule+0x78/0x98 [<ffffffc000d2e76c>] __rt_mutex_slowlock+0x9c/0x108 [<ffffffc000d2e9d0>] rt_mutex_slowlock+0xd8/0x198 [<ffffffc0000f7f28>] rt_mutex_timed_futex_lock+0x30/0x40 [<ffffffc00012c1a8>] futex_lock_pi+0x200/0x3b0 [<ffffffc00012cf84>] do_futex+0x1c4/0x550 [<ffffffc00012d92c>] compat_SyS_futex+0x10c/0x138 [<ffffffc00008504c>] __sys_trace_return+0x0/0x4 This seems to be the same bug Xuneli Pang triggered and fixed in e96a7705e7d3 "sched/rtmutex/deadline: Fix a PI crash for deadline tasks". As noted by Peter Zijlstra in the previous attempt, this fix requires a few other patches, most notably the FUTEX_UNLOCK_PI series [1] Testing this on a dual-core VM I have not been able to reproduce the same crash, but pi_stress (part of the rt-test suite) reveals that vanilla 4.4.162 behaves rather badly with a mix of deadline and sched_(rr|fifo) tasks: time pi_stress --rr --mlockall --sched id=high,policy=deadline,runtime=100000,deadline=200000,period=200000 Starting PI Stress Test Number of thread groups: 1 Duration of test run: infinite Number of inversions per group: unlimited Admin thread SCHED_RR priority 4 1 groups of 3 threads will be created High thread SCHED_DEADLINE runtime 100000 deadline 200000 period 200000 Med thread SCHED_RR priority 2 Low thread SCHED_RR priority 1 Current Inversions: 141627 WATCHDOG triggered: group 0 is deadlocked! reporter stopping due to watchdog event Stopping test Terminated real 0m26.291s user 0m0.148s sys 0m18.819s With this series applied, the test ran for ~4.5 hours and again for 129 minutes (when I remembered to time it) before crashing: time pi_stress --rr --mlockall --sched id=high,policy=deadline,runtime=100000,deadline=200000,period=200000 Starting PI Stress Test Number of thread groups: 1 Duration of test run: infinite Number of inversions per group: unlimited Admin thread SCHED_RR priority 4 1 groups of 3 threads will be created High thread SCHED_DEADLINE runtime 100000 deadline 200000 period 200000 Med thread SCHED_RR priority 2 Low thread SCHED_RR priority 1 Current Inversions: 51985223 WATCHDOG triggered: group 0 is deadlocked! reporter stopping due to watchdog event Stopping test Terminated real 129m38.807s user 0m59.084s sys 109m53.666s So clearly not perfect, but a *lot* better. The same series on our vendor-4.4 kernel moves pi_stress up from ~30 seconds before deadlock up to the same level as the VM (the test is still going as of this writing). I suspect other users of 4.4 would benefit from having these patches backported, so tag them for stable. I assume 4.9 and 4.14 could benefit as well, but I have not had time to look into those. 1) https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1359667.html Peter Zijlstra (13): futex: Cleanup variable names for futex_top_waiter() futex: Use smp_store_release() in mark_wake_futex() futex: Remove rt_mutex_deadlock_account_*() futex,rt_mutex: Provide futex specific rt_mutex API futex: Change locking rules futex: Cleanup refcounting futex: Rework inconsistent rt_mutex/futex_q state futex: Pull rt_mutex_futex_unlock() out from under hb->lock futex,rt_mutex: Introduce rt_mutex_init_waiter() futex,rt_mutex: Restructure rt_mutex_finish_proxy_lock() futex: Rework futex_lock_pi() to use rt_mutex_*_proxy_lock() futex: Futex_unlock_pi() determinism futex: Drop hb->lock before enqueueing on the rtmutex Thomas Gleixner (2): rtmutex: Make wait_lock irq safe futex: Rename free_pi_state() to put_pi_state() Xunlei Pang (2): rtmutex: Deboost before waking up the top waiter sched/rtmutex/deadline: Fix a PI crash for deadline tasks include/linux/init_task.h | 1 + include/linux/sched.h | 2 + include/linux/sched/rt.h | 1 + kernel/fork.c | 1 + kernel/futex.c | 532 ++++++++++++++++++++++++++-------------- kernel/locking/rtmutex-debug.c | 9 - kernel/locking/rtmutex-debug.h | 3 - kernel/locking/rtmutex.c | 406 ++++++++++++++++++------------ kernel/locking/rtmutex.h | 2 - kernel/locking/rtmutex_common.h | 24 +- kernel/sched/core.c | 2 + 11 files changed, 620 insertions(+), 363 deletions(-) -- 2.7.4

7 years

2
21
0 0

[PATCH 1/1] bcache: set max writeback rate when I/O request is idle

by Kai Krakow

From: Coly Li <colyli(a)suse.de> Commit b1092c9af9ed ("bcache: allow quick writeback when backing idle") allows the writeback rate to be faster if there is no I/O request on a bcache device. It works well if there is only one bcache device attached to the cache set. If there are many bcache devices attached to a cache set, it may introduce performance regression because multiple faster writeback threads of the idle bcache devices will compete the btree level locks with the bcache device who have I/O requests coming. This patch fixes the above issue by only permitting fast writebac when all bcache devices attached on the cache set are idle. And if one of the bcache devices has new I/O request coming, minimized all writeback throughput immediately and let PI controller __update_writeback_rate() to decide the upcoming writeback rate for each bcache device. Also when all bcache devices are idle, limited wrieback rate to a small number is wast of thoughput, especially when backing devices are slower non-rotation devices (e.g. SATA SSD). This patch sets a max writeback rate for each backing device if the whole cache set is idle. A faster writeback rate in idle time means new I/Os may have more available space for dirty data, and people may observe a better write performance then. Please note bcache may change its cache mode in run time, and this patch still works if the cache mode is switched from writeback mode and there is still dirty data on cache. Fixes: Commit b1092c9af9ed ("bcache: allow quick writeback when backing idle") Cc: stable(a)vger.kernel.org #4.16+ Signed-off-by: Coly Li <colyli(a)suse.de> Tested-by: Kai Krakow <kai(a)kaishome.de> Tested-by: Stefan Priebe <s.priebe(a)profihost.ag> Cc: Michael Lyle <mlyle(a)lyle.org> Signed-off-by: Jens Axboe <axboe(a)kernel.dk> (cherry picked from commit ea8c5356d39048bc94bae068228f51ddbecc6b89) Signed-off-by: Kai Krakow <kai(a)kaishome.de> --- drivers/md/bcache/bcache.h | 10 ++--- drivers/md/bcache/request.c | 54 ++++++++++++++++++++++++- drivers/md/bcache/super.c | 4 ++ drivers/md/bcache/sysfs.c | 14 +++++-- drivers/md/bcache/util.c | 2 +- drivers/md/bcache/util.h | 2 +- drivers/md/bcache/writeback.c | 91 +++++++++++++++++++++++++++++-------------- 7 files changed, 133 insertions(+), 44 deletions(-) diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h index d6bf294f3907..6ba41887664a 100644 --- a/drivers/md/bcache/bcache.h +++ b/drivers/md/bcache/bcache.h @@ -328,13 +328,6 @@ struct cached_dev { */ atomic_t has_dirty; - /* - * Set to zero by things that touch the backing volume-- except - * writeback. Incremented by writeback. Used to determine when to - * accelerate idle writeback. - */ - atomic_t backing_idle; - struct bch_ratelimit writeback_rate; struct delayed_work writeback_rate_update; @@ -514,6 +507,8 @@ struct cache_set { struct cache_accounting accounting; unsigned long flags; + atomic_t idle_counter; + atomic_t at_max_writeback_rate; struct cache_sb sb; @@ -523,6 +518,7 @@ struct cache_set { struct bcache_device **devices; unsigned devices_max_used; + atomic_t attached_dev_nr; struct list_head cached_devs; uint64_t cached_dev_sectors; struct closure caching; diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c index ae67f5fa8047..6e08eb89abee 100644 --- a/drivers/md/bcache/request.c +++ b/drivers/md/bcache/request.c @@ -1102,6 +1102,44 @@ static void detached_dev_do_request(struct bcache_device *d, struct bio *bio) generic_make_request(bio); } +static void quit_max_writeback_rate(struct cache_set *c, + struct cached_dev *this_dc) +{ + int i; + struct bcache_device *d; + struct cached_dev *dc; + + /* + * mutex bch_register_lock may compete with other parallel requesters, + * or attach/detach operations on other backing device. Waiting to + * the mutex lock may increase I/O request latency for seconds or more. + * To avoid such situation, if mutext_trylock() failed, only writeback + * rate of current cached device is set to 1, and __update_write_back() + * will decide writeback rate of other cached devices (remember now + * c->idle_counter is 0 already). + */ + if (mutex_trylock(&bch_register_lock)) { + for (i = 0; i < c->devices_max_used; i++) { + if (!c->devices[i]) + continue; + + if (UUID_FLASH_ONLY(&c->uuids[i])) + continue; + + d = c->devices[i]; + dc = container_of(d, struct cached_dev, disk); + /* + * set writeback rate to default minimum value, + * then let update_writeback_rate() to decide the + * upcoming rate. + */ + atomic_long_set(&dc->writeback_rate.rate, 1); + } + mutex_unlock(&bch_register_lock); + } else + atomic_long_set(&this_dc->writeback_rate.rate, 1); +} + /* Cached devices - read & write stuff */ static blk_qc_t cached_dev_make_request(struct request_queue *q, @@ -1119,7 +1157,21 @@ static blk_qc_t cached_dev_make_request(struct request_queue *q, return BLK_QC_T_NONE; } - atomic_set(&dc->backing_idle, 0); + if (likely(d->c)) { + if (atomic_read(&d->c->idle_counter)) + atomic_set(&d->c->idle_counter, 0); + /* + * If at_max_writeback_rate of cache set is true and new I/O + * comes, quit max writeback rate of all cached devices + * attached to this cache set, and set at_max_writeback_rate + * to false. + */ + if (unlikely(atomic_read(&d->c->at_max_writeback_rate) == 1)) { + atomic_set(&d->c->at_max_writeback_rate, 0); + quit_max_writeback_rate(d->c, dc); + } + } + generic_start_io_acct(q, rw, bio_sectors(bio), &d->disk->part0); bio_set_dev(bio, dc->bdev); diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index fa4058e43202..dc7b6131ddbb 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -696,6 +696,8 @@ static void bcache_device_detach(struct bcache_device *d) { lockdep_assert_held(&bch_register_lock); + atomic_dec(&d->c->attached_dev_nr); + if (test_bit(BCACHE_DEV_DETACHING, &d->flags)) { struct uuid_entry *u = d->c->uuids + d->id; @@ -1138,6 +1140,7 @@ int bch_cached_dev_attach(struct cached_dev *dc, struct cache_set *c, bch_cached_dev_run(dc); bcache_device_link(&dc->disk, c, "bdev"); + atomic_inc(&c->attached_dev_nr); /* Allow the writeback thread to proceed */ up_write(&dc->writeback_lock); @@ -1687,6 +1690,7 @@ struct cache_set *bch_cache_set_alloc(struct cache_sb *sb) c->block_bits = ilog2(sb->block_size); c->nr_uuids = bucket_bytes(c) / sizeof(struct uuid_entry); c->devices_max_used = 0; + atomic_set(&c->attached_dev_nr, 0); c->btree_pages = bucket_pages(c); if (c->btree_pages > BTREE_MAX_PAGES) c->btree_pages = max_t(int, c->btree_pages / 4, diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c index 225b15aa0340..a56067e80b10 100644 --- a/drivers/md/bcache/sysfs.c +++ b/drivers/md/bcache/sysfs.c @@ -170,7 +170,8 @@ SHOW(__bch_cached_dev) var_printf(writeback_running, "%i"); var_print(writeback_delay); var_print(writeback_percent); - sysfs_hprint(writeback_rate, dc->writeback_rate.rate << 9); + sysfs_hprint(writeback_rate, + atomic_long_read(&dc->writeback_rate.rate) << 9); sysfs_hprint(io_errors, atomic_read(&dc->io_errors)); sysfs_printf(io_error_limit, "%i", dc->error_limit); sysfs_printf(io_disable, "%i", dc->io_disable); @@ -188,7 +189,8 @@ SHOW(__bch_cached_dev) char change[20]; s64 next_io; - bch_hprint(rate, dc->writeback_rate.rate << 9); + bch_hprint(rate, + atomic_long_read(&dc->writeback_rate.rate) << 9); bch_hprint(dirty, bcache_dev_sectors_dirty(&dc->disk) << 9); bch_hprint(target, dc->writeback_rate_target << 9); bch_hprint(proportional,dc->writeback_rate_proportional << 9); @@ -255,8 +257,12 @@ STORE(__cached_dev) sysfs_strtoul_clamp(writeback_percent, dc->writeback_percent, 0, 40); - sysfs_strtoul_clamp(writeback_rate, - dc->writeback_rate.rate, 1, INT_MAX); + if (attr == &sysfs_writeback_rate) { + int v; + + sysfs_strtoul_clamp(writeback_rate, v, 1, INT_MAX); + atomic_long_set(&dc->writeback_rate.rate, v); + } sysfs_strtoul_clamp(writeback_rate_update_seconds, dc->writeback_rate_update_seconds, diff --git a/drivers/md/bcache/util.c b/drivers/md/bcache/util.c index fc479b026d6d..b15256bcf0e7 100644 --- a/drivers/md/bcache/util.c +++ b/drivers/md/bcache/util.c @@ -200,7 +200,7 @@ uint64_t bch_next_delay(struct bch_ratelimit *d, uint64_t done) { uint64_t now = local_clock(); - d->next += div_u64(done * NSEC_PER_SEC, d->rate); + d->next += div_u64(done * NSEC_PER_SEC, atomic_long_read(&d->rate)); /* Bound the time. Don't let us fall further than 2 seconds behind * (this prevents unnecessary backlog that would make it impossible diff --git a/drivers/md/bcache/util.h b/drivers/md/bcache/util.h index cced87f8eb27..f7b0133c9d2f 100644 --- a/drivers/md/bcache/util.h +++ b/drivers/md/bcache/util.h @@ -442,7 +442,7 @@ struct bch_ratelimit { * Rate at which we want to do work, in units per second * The units here correspond to the units passed to bch_next_delay() */ - uint32_t rate; + atomic_long_t rate; }; static inline void bch_ratelimit_reset(struct bch_ratelimit *d) diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c index ad45ebe1a74b..9f5e33324d1d 100644 --- a/drivers/md/bcache/writeback.c +++ b/drivers/md/bcache/writeback.c @@ -104,11 +104,56 @@ static void __update_writeback_rate(struct cached_dev *dc) dc->writeback_rate_proportional = proportional_scaled; dc->writeback_rate_integral_scaled = integral_scaled; - dc->writeback_rate_change = new_rate - dc->writeback_rate.rate; - dc->writeback_rate.rate = new_rate; + dc->writeback_rate_change = new_rate - + atomic_long_read(&dc->writeback_rate.rate); + atomic_long_set(&dc->writeback_rate.rate, new_rate); dc->writeback_rate_target = target; } +static bool set_at_max_writeback_rate(struct cache_set *c, + struct cached_dev *dc) +{ + /* + * Idle_counter is increased everytime when update_writeback_rate() is + * called. If all backing devices attached to the same cache set have + * identical dc->writeback_rate_update_seconds values, it is about 6 + * rounds of update_writeback_rate() on each backing device before + * c->at_max_writeback_rate is set to 1, and then max wrteback rate set + * to each dc->writeback_rate.rate. + * In order to avoid extra locking cost for counting exact dirty cached + * devices number, c->attached_dev_nr is used to calculate the idle + * throushold. It might be bigger if not all cached device are in write- + * back mode, but it still works well with limited extra rounds of + * update_writeback_rate(). + */ + if (atomic_inc_return(&c->idle_counter) < + atomic_read(&c->attached_dev_nr) * 6) + return false; + + if (atomic_read(&c->at_max_writeback_rate) != 1) + atomic_set(&c->at_max_writeback_rate, 1); + + atomic_long_set(&dc->writeback_rate.rate, INT_MAX); + + /* keep writeback_rate_target as existing value */ + dc->writeback_rate_proportional = 0; + dc->writeback_rate_integral_scaled = 0; + dc->writeback_rate_change = 0; + + /* + * Check c->idle_counter and c->at_max_writeback_rate agagain in case + * new I/O arrives during before set_at_max_writeback_rate() returns. + * Then the writeback rate is set to 1, and its new value should be + * decided via __update_writeback_rate(). + */ + if ((atomic_read(&c->idle_counter) < + atomic_read(&c->attached_dev_nr) * 6) || + !atomic_read(&c->at_max_writeback_rate)) + return false; + + return true; +} + static void update_writeback_rate(struct work_struct *work) { struct cached_dev *dc = container_of(to_delayed_work(work), @@ -136,13 +181,20 @@ static void update_writeback_rate(struct work_struct *work) return; } - down_read(&dc->writeback_lock); + if (atomic_read(&dc->has_dirty) && dc->writeback_percent) { + /* + * If the whole cache set is idle, set_at_max_writeback_rate() + * will set writeback rate to a max number. Then it is + * unncessary to update writeback rate for an idle cache set + * in maximum writeback rate number(s). + */ + if (!set_at_max_writeback_rate(c, dc)) { + down_read(&dc->writeback_lock); + __update_writeback_rate(dc); + up_read(&dc->writeback_lock); + } + } - if (atomic_read(&dc->has_dirty) && - dc->writeback_percent) - __update_writeback_rate(dc); - - up_read(&dc->writeback_lock); /* * CACHE_SET_IO_DISABLE might be set via sysfs interface, @@ -422,27 +474,6 @@ static void read_dirty(struct cached_dev *dc) delay = writeback_delay(dc, size); - /* If the control system would wait for at least half a - * second, and there's been no reqs hitting the backing disk - * for awhile: use an alternate mode where we have at most - * one contiguous set of writebacks in flight at a time. If - * someone wants to do IO it will be quick, as it will only - * have to contend with one operation in flight, and we'll - * be round-tripping data to the backing disk as quickly as - * it can accept it. - */ - if (delay >= HZ / 2) { - /* 3 means at least 1.5 seconds, up to 7.5 if we - * have slowed way down. - */ - if (atomic_inc_return(&dc->backing_idle) >= 3) { - /* Wait for current I/Os to finish */ - closure_sync(&cl); - /* And immediately launch a new set. */ - delay = 0; - } - } - while (!kthread_should_stop() && !test_bit(CACHE_SET_IO_DISABLE, &dc->disk.c->flags) && delay) { @@ -715,7 +746,7 @@ void bch_cached_dev_writeback_init(struct cached_dev *dc) dc->writeback_running = true; dc->writeback_percent = 10; dc->writeback_delay = 30; - dc->writeback_rate.rate = 1024; + atomic_long_set(&dc->writeback_rate.rate, 1024); dc->writeback_rate_minimum = 8; dc->writeback_rate_update_seconds = WRITEBACK_RATE_UPDATE_SECS_DEFAULT; -- 2.16.4

7 years

2
1
0 0

FAILED: patch "[PATCH] vhost/vsock: fix use-after-free in network stack callers" failed to apply to 4.4-stable tree

by gregkh＠linuxfoundation.org

The patch below does not apply to the 4.4-stable tree. If someone wants it applied there, or to any other stable or longterm tree, then please email the backport, including the original git commit id to <stable(a)vger.kernel.org>. thanks, greg k-h ------------------ original commit in Linus's tree ------------------ >From 834e772c8db0c6a275d75315d90aba4ebbb1e249 Mon Sep 17 00:00:00 2001 From: Stefan Hajnoczi <stefanha(a)redhat.com> Date: Mon, 5 Nov 2018 10:35:47 +0000 Subject: [PATCH] vhost/vsock: fix use-after-free in network stack callers If the network stack calls .send_pkt()/.cancel_pkt() during .release(), a struct vhost_vsock use-after-free is possible. This occurs because .release() does not wait for other CPUs to stop using struct vhost_vsock. Switch to an RCU-enabled hashtable (indexed by guest CID) so that .release() can wait for other CPUs by calling synchronize_rcu(). This also eliminates vhost_vsock_lock acquisition in the data path so it could have a positive effect on performance. This is CVE-2018-14625 "kernel: use-after-free Read in vhost_transport_send_pkt". Cc: stable(a)vger.kernel.org Reported-and-tested-by: syzbot+bd391451452fb0b93039(a)syzkaller.appspotmail.com Reported-by: syzbot+e3e074963495f92a89ed(a)syzkaller.appspotmail.com Reported-by: syzbot+d5a0a170c5069658b141(a)syzkaller.appspotmail.com Signed-off-by: Stefan Hajnoczi <stefanha(a)redhat.com> Signed-off-by: Michael S. Tsirkin <mst(a)redhat.com> Acked-by: Jason Wang <jasowang(a)redhat.com> diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c index 731e2ea2aeca..98ed5be132c6 100644 --- a/drivers/vhost/vsock.c +++ b/drivers/vhost/vsock.c @@ -15,6 +15,7 @@ #include <net/sock.h> #include <linux/virtio_vsock.h> #include <linux/vhost.h> +#include <linux/hashtable.h> #include <net/af_vsock.h> #include "vhost.h" @@ -27,14 +28,14 @@ enum { /* Used to track all the vhost_vsock instances on the system. */ static DEFINE_SPINLOCK(vhost_vsock_lock); -static LIST_HEAD(vhost_vsock_list); +static DEFINE_READ_MOSTLY_HASHTABLE(vhost_vsock_hash, 8); struct vhost_vsock { struct vhost_dev dev; struct vhost_virtqueue vqs[2]; - /* Link to global vhost_vsock_list, protected by vhost_vsock_lock */ - struct list_head list; + /* Link to global vhost_vsock_hash, writes use vhost_vsock_lock */ + struct hlist_node hash; struct vhost_work send_pkt_work; spinlock_t send_pkt_list_lock; @@ -50,11 +51,14 @@ static u32 vhost_transport_get_local_cid(void) return VHOST_VSOCK_DEFAULT_HOST_CID; } -static struct vhost_vsock *__vhost_vsock_get(u32 guest_cid) +/* Callers that dereference the return value must hold vhost_vsock_lock or the + * RCU read lock. + */ +static struct vhost_vsock *vhost_vsock_get(u32 guest_cid) { struct vhost_vsock *vsock; - list_for_each_entry(vsock, &vhost_vsock_list, list) { + hash_for_each_possible_rcu(vhost_vsock_hash, vsock, hash, guest_cid) { u32 other_cid = vsock->guest_cid; /* Skip instances that have no CID yet */ @@ -69,17 +73,6 @@ static struct vhost_vsock *__vhost_vsock_get(u32 guest_cid) return NULL; } -static struct vhost_vsock *vhost_vsock_get(u32 guest_cid) -{ - struct vhost_vsock *vsock; - - spin_lock_bh(&vhost_vsock_lock); - vsock = __vhost_vsock_get(guest_cid); - spin_unlock_bh(&vhost_vsock_lock); - - return vsock; -} - static void vhost_transport_do_send_pkt(struct vhost_vsock *vsock, struct vhost_virtqueue *vq) @@ -210,9 +203,12 @@ vhost_transport_send_pkt(struct virtio_vsock_pkt *pkt) struct vhost_vsock *vsock; int len = pkt->len; + rcu_read_lock(); + /* Find the vhost_vsock according to guest context id */ vsock = vhost_vsock_get(le64_to_cpu(pkt->hdr.dst_cid)); if (!vsock) { + rcu_read_unlock(); virtio_transport_free_pkt(pkt); return -ENODEV; } @@ -225,6 +221,8 @@ vhost_transport_send_pkt(struct virtio_vsock_pkt *pkt) spin_unlock_bh(&vsock->send_pkt_list_lock); vhost_work_queue(&vsock->dev, &vsock->send_pkt_work); + + rcu_read_unlock(); return len; } @@ -234,12 +232,15 @@ vhost_transport_cancel_pkt(struct vsock_sock *vsk) struct vhost_vsock *vsock; struct virtio_vsock_pkt *pkt, *n; int cnt = 0; + int ret = -ENODEV; LIST_HEAD(freeme); + rcu_read_lock(); + /* Find the vhost_vsock according to guest context id */ vsock = vhost_vsock_get(vsk->remote_addr.svm_cid); if (!vsock) - return -ENODEV; + goto out; spin_lock_bh(&vsock->send_pkt_list_lock); list_for_each_entry_safe(pkt, n, &vsock->send_pkt_list, list) { @@ -265,7 +266,10 @@ vhost_transport_cancel_pkt(struct vsock_sock *vsk) vhost_poll_queue(&tx_vq->poll); } - return 0; + ret = 0; +out: + rcu_read_unlock(); + return ret; } static struct virtio_vsock_pkt * @@ -533,10 +537,6 @@ static int vhost_vsock_dev_open(struct inode *inode, struct file *file) spin_lock_init(&vsock->send_pkt_list_lock); INIT_LIST_HEAD(&vsock->send_pkt_list); vhost_work_init(&vsock->send_pkt_work, vhost_transport_send_pkt_work); - - spin_lock_bh(&vhost_vsock_lock); - list_add_tail(&vsock->list, &vhost_vsock_list); - spin_unlock_bh(&vhost_vsock_lock); return 0; out: @@ -585,9 +585,13 @@ static int vhost_vsock_dev_release(struct inode *inode, struct file *file) struct vhost_vsock *vsock = file->private_data; spin_lock_bh(&vhost_vsock_lock); - list_del(&vsock->list); + if (vsock->guest_cid) + hash_del_rcu(&vsock->hash); spin_unlock_bh(&vhost_vsock_lock); + /* Wait for other CPUs to finish using vsock */ + synchronize_rcu(); + /* Iterating over all connections for all CIDs to find orphans is * inefficient. Room for improvement here. */ vsock_for_each_connected_socket(vhost_vsock_reset_orphans); @@ -628,12 +632,17 @@ static int vhost_vsock_set_cid(struct vhost_vsock *vsock, u64 guest_cid) /* Refuse if CID is already in use */ spin_lock_bh(&vhost_vsock_lock); - other = __vhost_vsock_get(guest_cid); + other = vhost_vsock_get(guest_cid); if (other && other != vsock) { spin_unlock_bh(&vhost_vsock_lock); return -EADDRINUSE; } + + if (vsock->guest_cid) + hash_del_rcu(&vsock->hash); + vsock->guest_cid = guest_cid; + hash_add_rcu(vhost_vsock_hash, &vsock->hash, guest_cid); spin_unlock_bh(&vhost_vsock_lock); return 0;

7 years

3
2
0 0

FAILED: patch "[PATCH] dax: Check page->mapping isn't NULL" failed to apply to 4.19-stable tree

by gregkh＠linuxfoundation.org

The patch below does not apply to the 4.19-stable tree. If someone wants it applied there, or to any other stable or longterm tree, then please email the backport, including the original git commit id to <stable(a)vger.kernel.org>. thanks, greg k-h ------------------ original commit in Linus's tree ------------------ >From c93db7bb6ef3251e0ea48ade311d3e9942748e1c Mon Sep 17 00:00:00 2001 From: Matthew Wilcox <willy(a)infradead.org> Date: Tue, 27 Nov 2018 13:16:33 -0800 Subject: [PATCH] dax: Check page->mapping isn't NULL If we race with inode destroy, it's possible for page->mapping to be NULL before we even enter this routine, as well as after having slept waiting for the dax entry to become unlocked. Fixes: c2a7d2a11552 ("filesystem-dax: Introduce dax_lock_mapping_entry()") Cc: <stable(a)vger.kernel.org> Reported-by: Jan Kara <jack(a)suse.cz> Signed-off-by: Matthew Wilcox <willy(a)infradead.org> Reviewed-by: Johannes Thumshirn <jthumshirn(a)suse.de> Reviewed-by: Jan Kara <jack(a)suse.cz> Signed-off-by: Dan Williams <dan.j.williams(a)intel.com> diff --git a/fs/dax.c b/fs/dax.c index 9bcce89ea18e..e69fc231833b 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -365,7 +365,7 @@ bool dax_lock_mapping_entry(struct page *page) struct address_space *mapping = READ_ONCE(page->mapping); locked = false; - if (!dax_mapping(mapping)) + if (!mapping || !dax_mapping(mapping)) break; /*

7 years

3
2
0 0

(no subject)

by Sebastian Andrzej Siewior

Hi, this is a backport of commit 7aa54be297655 ("locking/qspinlock, x86: Provide liveness guarantee") for the v4.9 stable tree. For the v4.4 tree the ARCH_USE_QUEUED_SPINLOCKS option got disabled on x86. For v4.9 it has been decided to do a minimal backport of the final fix (including all its dependencies). With this backport I can't reproduce the issue in the latest v4.9-RT tree. I was able to boot (and use) an arm64 box with these patches so it is not broken in an abvious way. Sebastian

7 years

4
15
0 0

+ mm-thp-fix-flags-for-pmd-migration-when-split.patch added to -mm tree

by akpm＠linux-foundation.org

The patch titled Subject: mm: thp: fix flags for pmd migration when split has been added to the -mm tree. Its filename is mm-thp-fix-flags-for-pmd-migration-when-split.patch This patch should soon appear at http://ozlabs.org/~akpm/mmots/broken-out/mm-thp-fix-flags-for-pmd-migration… and later at http://ozlabs.org/~akpm/mmotm/broken-out/mm-thp-fix-flags-for-pmd-migration… Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next and is updated there every 3-4 working days ------------------------------------------------------ From: Peter Xu <peterx(a)redhat.com> Subject: mm: thp: fix flags for pmd migration when split When splitting a huge migrating PMD, we'll transfer all the existing PMD bits and apply them again onto the small PTEs. However we are fetching the bits unconditionally via pmd_soft_dirty(), pmd_write() or pmd_yound() while actually they don't make sense at all when it's a migration entry. Fix them up. Since at it, drop the ifdef together as not needed. Note that if my understanding is correct about the problem then if without the patch there is chance to lose some of the dirty bits in the migrating pmd pages (on x86_64 we're fetching bit 11 which is part of swap offset instead of bit 2) and it could potentially corrupt the memory of an userspace program which depends on the dirty bit. Link: http://lkml.kernel.org/r/20181213051510.20306-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx(a)redhat.com> Reviewed-by: Konstantin Khlebnikov <khlebnikov(a)yandex-team.ru> Reviewed-by: William Kucharski <william.kucharski(a)oracle.com> Acked-by: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com> Cc: Andrea Arcangeli <aarcange(a)redhat.com> Cc: Matthew Wilcox <willy(a)infradead.org> Cc: Michal Hocko <mhocko(a)suse.com> Cc: Dave Jiang <dave.jiang(a)intel.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar(a)linux.vnet.ibm.com> Cc: Souptick Joarder <jrdr.linux(a)gmail.com> Cc: Konstantin Khlebnikov <khlebnikov(a)yandex-team.ru> Cc: Zi Yan <zi.yan(a)cs.rutgers.edu> Cc: <stable(a)vger.kernel.org> [4.14+] Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- --- a/mm/huge_memory.c~mm-thp-fix-flags-for-pmd-migration-when-split +++ a/mm/huge_memory.c @@ -2144,23 +2144,25 @@ static void __split_huge_pmd_locked(stru */ old_pmd = pmdp_invalidate(vma, haddr, pmd); -#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION pmd_migration = is_pmd_migration_entry(old_pmd); - if (pmd_migration) { + if (unlikely(pmd_migration)) { swp_entry_t entry; entry = pmd_to_swp_entry(old_pmd); page = pfn_to_page(swp_offset(entry)); - } else -#endif + write = is_write_migration_entry(entry); + young = false; + soft_dirty = pmd_swp_soft_dirty(old_pmd); + } else { page = pmd_page(old_pmd); + if (pmd_dirty(old_pmd)) + SetPageDirty(page); + write = pmd_write(old_pmd); + young = pmd_young(old_pmd); + soft_dirty = pmd_soft_dirty(old_pmd); + } VM_BUG_ON_PAGE(!page_count(page), page); page_ref_add(page, HPAGE_PMD_NR - 1); - if (pmd_dirty(old_pmd)) - SetPageDirty(page); - write = pmd_write(old_pmd); - young = pmd_young(old_pmd); - soft_dirty = pmd_soft_dirty(old_pmd); /* * Withdraw the table only after we mark the pmd entry invalid. _ Patches currently in -mm which might be from peterx(a)redhat.com are mm-thp-fix-flags-for-pmd-migration-when-split.patch userfaultfd-clear-flag-if-remap-event-not-enabled.patch

7 years

1
0
0 0

stable-rc/linux-4.4.y boot: 91 boots: 1 failed, 88 passed with 1 offline, 1 untried/unknown (v4.4.167-40-g840a97100a76)

by kernelci.org bot

stable-rc/linux-4.4.y boot: 91 boots: 1 failed, 88 passed with 1 offline, 1 untried/unknown (v4.4.167-40-g840a97100a76) Full Boot Summary: https://kernelci.org/boot/all/job/stable-rc/branch/linux-4.4.y/kernel/v4.4.… Full Build Summary: https://kernelci.org/build/stable-rc/branch/linux-4.4.y/kernel/v4.4.167-40-… Tree: stable-rc Branch: linux-4.4.y Git Describe: v4.4.167-40-g840a97100a76 Git Commit: 840a97100a767329253bab6ebfdc805968ad42c1 Git URL: http://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git Tested: 41 unique boards, 20 SoC families, 12 builds out of 187 Boot Failure Detected: arm64: defconfig qcom-qdf2400: 1 failed lab Offline Platforms: arm: multi_v7_defconfig: stih410-b2120: 1 offline lab --- For more info write to <info(a)kernelci.org>

7 years

1
0
0 0

2025

2024

2023

2022

2021

2020

2019

2018

2017

Linux-stable-mirror December 2018