On Fri, Dec 14, 2018 at 11:42:01PM +0100, Thomas Schöbel-Theuer wrote:
> Ah, I overlooked that commit e56c92565dfe2 is already providing a different
> solution to the same problem in newer kernels _only_, as a _side_ effect
> (not clear to me from the description, but clear from reading the code).
Damn, I missed the fact that this is not the upstream kernel:
CPU: 0 PID: 1 UID: 0 Comm: swapper/0 Not tainted 4.4.0-ui18344.004-uiabi1-infong-amd64 #1
> So another alternative would be backporting e56c92565dfe2 to the 4.4 LTS
> series. Also fine for me.
That looks like the right fix.
A note for the next time: do not send a fix for a stable kernel which is
not upstream:
>From Documentation/process/stable-kernel-rules.rst:
" - It or an equivalent fix must already exist in Linus' tree (upstream)."
The stable kernels track upstream so if a stable kernel has a problem,
the first thing one needs to do is to check whether this has been fixed
upstream and if so, to backport it. This is the case most of the time.
In the very seldom cases where a separate fix is needed, it needs to be
handled by asking Greg what to do. :-)
Adding stable@ folks to CC to set me straight if I'm missing something.
--
Regards/Gruss,
Boris.
Good mailing practices for 400: avoid top-posting and trim the reply.
On 12/15/18 2:10 AM, Michael Lyle wrote:
> Coly--
>
> Apologies for the late reply on this. I just noticed it based on Greg's
> comment about stable.
>
> When I wrote the previous "accelerate writeback" patchset, my first
> attempt was very much like this. I believe it was asked (by you?)
> whether it would impact the latency of front-end I/O because of deep
> backing device queues when a new request comes in.
>
> Won't this cause lots of requests to be pending to backing, so if there
> is intermittent front-end I/O they'll have to wait for the device?
> That's why I previous had it set to only complete one writeback at a
> time, to bound the impact on latency-- based on that review feedback.
Hi Mike,
This patch is a much more conservative effort. It sets a high writeback
rate only when all attached bcache device are idled for quite many
seconds. In this situation, the cache set is really quite and spared.
Commit b1092c9af9ed ("bcache: allow quick writeback when backing idle")
just looks at single bcache device. If there are I/Os for other bcache
device on the cache set, and a single bcache device is idle, a faster
writeback rate for this single idle bcache device will happen, I/O to
read dirty data on cache for writeback will have negative impact to I/O
requests of other bcache devices.
Therefore I give up a specific faster writeback, to make sure the
latency of front end I/O in general.
Thanks.
Coly Li
> On Mon, Jul 23, 2018 at 9:03 PM Coly Li <colyli(a)suse.de
> <mailto:colyli@suse.de>> wrote:
>
> Commit b1092c9af9ed ("bcache: allow quick writeback when backing idle")
> allows the writeback rate to be faster if there is no I/O request on a
> bcache device. It works well if there is only one bcache device attached
> to the cache set. If there are many bcache devices attached to a cache
> set, it may introduce performance regression because multiple faster
> writeback threads of the idle bcache devices will compete the btree
> level
> locks with the bcache device who have I/O requests coming.
>
> This patch fixes the above issue by only permitting fast writebac when
> all bcache devices attached on the cache set are idle. And if one of the
> bcache devices has new I/O request coming, minimized all writeback
> throughput immediately and let PI controller __update_writeback_rate()
> to decide the upcoming writeback rate for each bcache device.
>
> Also when all bcache devices are idle, limited wrieback rate to a small
> number is wast of thoughput, especially when backing devices are slower
> non-rotation devices (e.g. SATA SSD). This patch sets a max writeback
> rate for each backing device if the whole cache set is idle. A faster
> writeback rate in idle time means new I/Os may have more available space
> for dirty data, and people may observe a better write performance then.
>
> Please note bcache may change its cache mode in run time, and this patch
> still works if the cache mode is switched from writeback mode and there
> is still dirty data on cache.
>
> Fixes: Commit b1092c9af9ed ("bcache: allow quick writeback when
> backing idle")
> Cc: stable(a)vger.kernel.org <mailto:stable@vger.kernel.org> #4.16+
> Signed-off-by: Coly Li <colyli(a)suse.de <mailto:colyli@suse.de>>
> Tested-by: Kai Krakow <kai(a)kaishome.de <mailto:kai@kaishome.de>>
> Cc: Michael Lyle <mlyle(a)lyle.org <mailto:mlyle@lyle.org>>
> Cc: Stefan Priebe <s.priebe(a)profihost.ag <mailto:s.priebe@profihost.ag>>
> ---
> Channgelog:
> v2, Fix a deadlock reported by Stefan Priebe.
> v1, Initial version.
>
> drivers/md/bcache/bcache.h | 11 ++--
> drivers/md/bcache/request.c | 51 ++++++++++++++-
> drivers/md/bcache/super.c | 1 +
> drivers/md/bcache/sysfs.c | 14 +++--
> drivers/md/bcache/util.c | 2 +-
> drivers/md/bcache/util.h | 2 +-
> drivers/md/bcache/writeback.c | 115 ++++++++++++++++++++++++++--------
> 7 files changed, 155 insertions(+), 41 deletions(-)
>
> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
> index d6bf294f3907..469ab1a955e0 100644
> --- a/drivers/md/bcache/bcache.h
> +++ b/drivers/md/bcache/bcache.h
> @@ -328,13 +328,6 @@ struct cached_dev {
> */
> atomic_t has_dirty;
>
> - /*
> - * Set to zero by things that touch the backing volume-- except
> - * writeback. Incremented by writeback. Used to determine
> when to
> - * accelerate idle writeback.
> - */
> - atomic_t backing_idle;
> -
> struct bch_ratelimit writeback_rate;
> struct delayed_work writeback_rate_update;
>
> @@ -514,6 +507,8 @@ struct cache_set {
> struct cache_accounting accounting;
>
> unsigned long flags;
> + atomic_t idle_counter;
> + atomic_t at_max_writeback_rate;
>
> struct cache_sb sb;
>
> @@ -523,6 +518,8 @@ struct cache_set {
>
> struct bcache_device **devices;
> unsigned devices_max_used;
> + /* See set_at_max_writeback_rate() for how it is used */
> + unsigned previous_dirty_dc_nr;
> struct list_head cached_devs;
> uint64_t cached_dev_sectors;
> struct closure caching;
> diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
> index ae67f5fa8047..1af3d96abfa5 100644
> --- a/drivers/md/bcache/request.c
> +++ b/drivers/md/bcache/request.c
> @@ -1104,6 +1104,43 @@ static void detached_dev_do_request(struct
> bcache_device *d, struct bio *bio)
>
> /* Cached devices - read & write stuff */
>
> +static void quit_max_writeback_rate(struct cache_set *c,
> + struct cached_dev *this_dc)
> +{
> + int i;
> + struct bcache_device *d;
> + struct cached_dev *dc;
> +
> + /*
> + * If bch_register_lock is acquired by other attach/detach
> operations,
> + * waiting here will increase I/O request latency for
> seconds or more.
> + * To avoid such situation, only writeback rate of current
> cached device
> + * is set to 1, and __update_write_back() will decide
> writeback rate
> + * of other cached devices (remember c->idle_counter is 0 now).
> + */
> + if (mutex_trylock(&bch_register_lock)){
> + for (i = 0; i < c->devices_max_used; i++) {
> + if (!c->devices[i])
> + continue;
> +
> + if (UUID_FLASH_ONLY(&c->uuids[i]))
> + continue;
> +
> + d = c->devices[i];
> + dc = container_of(d, struct cached_dev, disk);
> + /*
> + * set writeback rate to default minimum value,
> + * then let update_writeback_rate() to
> decide the
> + * upcoming rate.
> + */
> + atomic64_set(&dc->writeback_rate.rate, 1);
> + }
> +
> + mutex_unlock(&bch_register_lock);
> + } else
> + atomic64_set(&this_dc->writeback_rate.rate, 1);
> +}
> +
> static blk_qc_t cached_dev_make_request(struct request_queue *q,
> struct bio *bio)
> {
> @@ -1119,7 +1156,19 @@ static blk_qc_t
> cached_dev_make_request(struct request_queue *q,
> return BLK_QC_T_NONE;
> }
>
> - atomic_set(&dc->backing_idle, 0);
> + if (d->c) {
> + atomic_set(&d->c->idle_counter, 0);
> + /*
> + * If at_max_writeback_rate of cache set is true and
> new I/O
> + * comes, quit max writeback rate of all cached devices
> + * attached to this cache set, and set
> at_max_writeback_rate
> + * to false.
> + */
> + if
> (unlikely(atomic_read(&d->c->at_max_writeback_rate) == 1)) {
> + atomic_set(&d->c->at_max_writeback_rate, 0);
> + quit_max_writeback_rate(d->c, dc);
> + }
> + }
> generic_start_io_acct(q, rw, bio_sectors(bio), &d->disk->part0);
>
> bio_set_dev(bio, dc->bdev);
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index fa4058e43202..fa532d9f9353 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -1687,6 +1687,7 @@ struct cache_set *bch_cache_set_alloc(struct
> cache_sb *sb)
> c->block_bits = ilog2(sb->block_size);
> c->nr_uuids = bucket_bytes(c) / sizeof(struct
> uuid_entry);
> c->devices_max_used = 0;
> + c->previous_dirty_dc_nr = 0;
> c->btree_pages = bucket_pages(c);
> if (c->btree_pages > BTREE_MAX_PAGES)
> c->btree_pages = max_t(int, c->btree_pages / 4,
> diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
> index 225b15aa0340..d719021bff81 100644
> --- a/drivers/md/bcache/sysfs.c
> +++ b/drivers/md/bcache/sysfs.c
> @@ -170,7 +170,8 @@ SHOW(__bch_cached_dev)
> var_printf(writeback_running, "%i");
> var_print(writeback_delay);
> var_print(writeback_percent);
> - sysfs_hprint(writeback_rate, dc->writeback_rate.rate << 9);
> + sysfs_hprint(writeback_rate,
> + atomic64_read(&dc->writeback_rate.rate) << 9);
> sysfs_hprint(io_errors, atomic_read(&dc->io_errors));
> sysfs_printf(io_error_limit, "%i", dc->error_limit);
> sysfs_printf(io_disable, "%i", dc->io_disable);
> @@ -188,7 +189,8 @@ SHOW(__bch_cached_dev)
> char change[20];
> s64 next_io;
>
> - bch_hprint(rate, dc->writeback_rate.rate << 9);
> + bch_hprint(rate,
> + atomic64_read(&dc->writeback_rate.rate)
> << 9);
> bch_hprint(dirty,
> bcache_dev_sectors_dirty(&dc->disk) << 9);
> bch_hprint(target, dc->writeback_rate_target << 9);
>
> bch_hprint(proportional,dc->writeback_rate_proportional << 9);
> @@ -255,8 +257,12 @@ STORE(__cached_dev)
>
> sysfs_strtoul_clamp(writeback_percent,
> dc->writeback_percent, 0, 40);
>
> - sysfs_strtoul_clamp(writeback_rate,
> - dc->writeback_rate.rate, 1, INT_MAX);
> + if (attr == &sysfs_writeback_rate) {
> + int v;
> +
> + sysfs_strtoul_clamp(writeback_rate, v, 1, INT_MAX);
> + atomic64_set(&dc->writeback_rate.rate, v);
> + }
>
> sysfs_strtoul_clamp(writeback_rate_update_seconds,
> dc->writeback_rate_update_seconds,
> diff --git a/drivers/md/bcache/util.c b/drivers/md/bcache/util.c
> index fc479b026d6d..84f90c3d996d 100644
> --- a/drivers/md/bcache/util.c
> +++ b/drivers/md/bcache/util.c
> @@ -200,7 +200,7 @@ uint64_t bch_next_delay(struct bch_ratelimit *d,
> uint64_t done)
> {
> uint64_t now = local_clock();
>
> - d->next += div_u64(done * NSEC_PER_SEC, d->rate);
> + d->next += div_u64(done * NSEC_PER_SEC,
> atomic64_read(&d->rate));
>
> /* Bound the time. Don't let us fall further than 2 seconds
> behind
> * (this prevents unnecessary backlog that would make it
> impossible
> diff --git a/drivers/md/bcache/util.h b/drivers/md/bcache/util.h
> index cced87f8eb27..7e17f32ab563 100644
> --- a/drivers/md/bcache/util.h
> +++ b/drivers/md/bcache/util.h
> @@ -442,7 +442,7 @@ struct bch_ratelimit {
> * Rate at which we want to do work, in units per second
> * The units here correspond to the units passed to
> bch_next_delay()
> */
> - uint32_t rate;
> + atomic64_t rate;
> };
>
> static inline void bch_ratelimit_reset(struct bch_ratelimit *d)
> diff --git a/drivers/md/bcache/writeback.c
> b/drivers/md/bcache/writeback.c
> index ad45ebe1a74b..11ffadc3cf8f 100644
> --- a/drivers/md/bcache/writeback.c
> +++ b/drivers/md/bcache/writeback.c
> @@ -49,6 +49,80 @@ static uint64_t __calc_target_rate(struct
> cached_dev *dc)
> return (cache_dirty_target * bdev_share) >>
> WRITEBACK_SHARE_SHIFT;
> }
>
> +static bool set_at_max_writeback_rate(struct cache_set *c,
> + struct cached_dev *dc)
> +{
> + int i, dirty_dc_nr = 0;
> + struct bcache_device *d;
> +
> + /*
> + * bch_register_lock is acquired in
> cached_dev_detach_finish() before
> + * calling cancel_writeback_rate_update_dwork() to stop the
> delayed
> + * kworker writeback_rate_update (where the context we are
> for now).
> + * Therefore call mutex_lock() here may introduce deadlock
> when shut
> + * down the bcache device.
> + * c->previous_dirty_dc_nr is used to record previous calculated
> + * dirty_dc_nr when mutex_trylock() last time succeeded. Then if
> + * mutex_trylock() failed here, use c->previous_dirty_dc_nr
> as dirty
> + * cached device number. Of cause it might be inaccurate,
> but a few more
> + * or less loop before setting c->at_max_writeback_rate is
> much better
> + * then a deadlock here.
> + */
> + if (mutex_trylock(&bch_register_lock)) {
> + for (i = 0; i < c->devices_max_used; i++) {
> + if (!c->devices[i])
> + continue;
> + if (UUID_FLASH_ONLY(&c->uuids[i]))
> + continue;
> + d = c->devices[i];
> + dc = container_of(d, struct cached_dev, disk);
> + if (atomic_read(&dc->has_dirty))
> + dirty_dc_nr++;
> + }
> + c->previous_dirty_dc_nr = dirty_dc_nr;
> +
> + mutex_unlock(&bch_register_lock);
> + } else
> + dirty_dc_nr = c->previous_dirty_dc_nr;
> +
> + /*
> + * Idle_counter is increased everytime when
> update_writeback_rate()
> + * is rescheduled in. If all backing devices attached to the
> same
> + * cache set has same dc->writeback_rate_update_seconds
> value, it
> + * is about 10 rounds of update_writeback_rate() is called
> on each
> + * backing device, then the code will fall through at set 1 to
> + * c->at_max_writeback_rate, and a max wrteback rate to each
> + * dc->writeback_rate.rate. This is not very accurate but
> works well
> + * to make sure the whole cache set has no new I/O coming before
> + * writeback rate is set to a max number.
> + */
> + if (atomic_inc_return(&c->idle_counter) < dirty_dc_nr * 10)
> + return false;
> +
> + if (atomic_read(&c->at_max_writeback_rate) != 1)
> + atomic_set(&c->at_max_writeback_rate, 1);
> +
> +
> + atomic64_set(&dc->writeback_rate.rate, INT_MAX);
> +
> + /* keep writeback_rate_target as existing value */
> + dc->writeback_rate_proportional = 0;
> + dc->writeback_rate_integral_scaled = 0;
> + dc->writeback_rate_change = 0;
> +
> + /*
> + * Check c->idle_counter and c->at_max_writeback_rate
> agagain in case
> + * new I/O arrives during before set_at_max_writeback_rate()
> returns.
> + * Then the writeback rate is set to 1, and its new value
> should be
> + * decided via __update_writeback_rate().
> + */
> + if (atomic_read(&c->idle_counter) < dirty_dc_nr * 10 ||
> + !atomic_read(&c->at_max_writeback_rate))
> + return false;
> +
> + return true;
> +}
> +
> static void __update_writeback_rate(struct cached_dev *dc)
> {
> /*
> @@ -104,8 +178,9 @@ static void __update_writeback_rate(struct
> cached_dev *dc)
>
> dc->writeback_rate_proportional = proportional_scaled;
> dc->writeback_rate_integral_scaled = integral_scaled;
> - dc->writeback_rate_change = new_rate - dc->writeback_rate.rate;
> - dc->writeback_rate.rate = new_rate;
> + dc->writeback_rate_change = new_rate -
> + atomic64_read(&dc->writeback_rate.rate);
> + atomic64_set(&dc->writeback_rate.rate, new_rate);
> dc->writeback_rate_target = target;
> }
>
> @@ -138,9 +213,16 @@ static void update_writeback_rate(struct
> work_struct *work)
>
> down_read(&dc->writeback_lock);
>
> - if (atomic_read(&dc->has_dirty) &&
> - dc->writeback_percent)
> - __update_writeback_rate(dc);
> + if (atomic_read(&dc->has_dirty) && dc->writeback_percent) {
> + /*
> + * If the whole cache set is idle,
> set_at_max_writeback_rate()
> + * will set writeback rate to a max number. Then it is
> + * unncessary to update writeback rate for an idle
> cache set
> + * in maximum writeback rate number(s).
> + */
> + if (!set_at_max_writeback_rate(c, dc))
> + __update_writeback_rate(dc);
> + }
>
> up_read(&dc->writeback_lock);
>
> @@ -422,27 +504,6 @@ static void read_dirty(struct cached_dev *dc)
>
> delay = writeback_delay(dc, size);
>
> - /* If the control system would wait for at least half a
> - * second, and there's been no reqs hitting the
> backing disk
> - * for awhile: use an alternate mode where we have
> at most
> - * one contiguous set of writebacks in flight at a
> time. If
> - * someone wants to do IO it will be quick, as it
> will only
> - * have to contend with one operation in flight, and
> we'll
> - * be round-tripping data to the backing disk as
> quickly as
> - * it can accept it.
> - */
> - if (delay >= HZ / 2) {
> - /* 3 means at least 1.5 seconds, up to 7.5 if we
> - * have slowed way down.
> - */
> - if (atomic_inc_return(&dc->backing_idle) >= 3) {
> - /* Wait for current I/Os to finish */
> - closure_sync(&cl);
> - /* And immediately launch a new set. */
> - delay = 0;
> - }
> - }
> -
> while (!kthread_should_stop() &&
> !test_bit(CACHE_SET_IO_DISABLE,
> &dc->disk.c->flags) &&
> delay) {
> @@ -715,7 +776,7 @@ void bch_cached_dev_writeback_init(struct
> cached_dev *dc)
> dc->writeback_running = true;
> dc->writeback_percent = 10;
> dc->writeback_delay = 30;
> - dc->writeback_rate.rate = 1024;
> + atomic64_set(&dc->writeback_rate.rate, 1024);
> dc->writeback_rate_minimum = 8;
>
> dc->writeback_rate_update_seconds =
> WRITEBACK_RATE_UPDATE_SECS_DEFAULT;
> --
> 2.17.1
>
Commit 379d98ddf413 ("x86: vdso: Use $LD instead of $CC to link")
accidentally broke unwinding from userspace, because ld would strip the
.eh_frame sections when linking.
Originally, the compiler would implicitly add --eh-frame-hdr when
invoking the linker, but when this Makefile was converted from invoking
ld via the compiler, to invoking it directly (like vmlinux does),
the flag was missed. (The EH_FRAME section is important for the VDSO
shared libraries, but not for vmlinux.)
Fix the problem by explicitly specifying --eh-frame-hdr, which restores
parity with the old method.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=201741
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1659295
Reported-by: Laura Abbott <labbott(a)redhat.com>
Fixes: 379d98ddf413 ("x86: vdso: Use $LD instead of $CC to link")
Cc: stable(a)vger.kernel.org
Cc: Andy Lutomirski <luto(a)kernel.org>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: "H. Peter Anvin" <hpa(a)zytor.com>
Cc: X86 ML <x86(a)kernel.org>
Cc: Florian Weimer <fweimer(a)redhat.com>,
Cc: Carlos O'Donell <carlos(a)redhat.com>,
Cc: "H. J. Lu" <hjl.tools(a)gmail.com>
Cc: Joel Fernandes <joel(a)joelfernandes.org>
Cc: kernel-team(a)android.com
Signed-off-by: Alistair Strachan <astrachan(a)google.com>
---
arch/x86/entry/vdso/Makefile | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/arch/x86/entry/vdso/Makefile b/arch/x86/entry/vdso/Makefile
index 141d415a8c80..c3d7ccd25381 100644
--- a/arch/x86/entry/vdso/Makefile
+++ b/arch/x86/entry/vdso/Makefile
@@ -171,7 +171,8 @@ quiet_cmd_vdso = VDSO $@
sh $(srctree)/$(src)/checkundef.sh '$(NM)' '$@'
VDSO_LDFLAGS = -shared $(call ld-option, --hash-style=both) \
- $(call ld-option, --build-id) -Bsymbolic
+ $(call ld-option, --build-id) $(call ld-option, --eh-frame-hdr) \
+ -Bsymbolic
GCOV_PROFILE := n
#
--
2.20.0.405.gbc1bbc6f85-goog
From: Thierry Reding <treding(a)nvidia.com>
Subject: scripts/spdxcheck.py: always open files in binary mode
The spdxcheck script currently falls over when confronted with a binary
file (such as Documentation/logo.gif). To avoid that, always open files
in binary mode and decode line-by-line, ignoring encoding errors.
One tricky case is when piping data into the script and reading it from
standard input. By default, standard input will be opened in text mode,
so we need to reopen it in binary mode.
The breakage only happens with python3 and results in a
UnicodeDecodeError (according to Uwe).
Link: http://lkml.kernel.org/r/20181212131210.28024-1-thierry.reding@gmail.com
Fixes: 6f4d29df66ac ("scripts/spdxcheck.py: make python3 compliant")
Signed-off-by: Thierry Reding <treding(a)nvidia.com>
Reviewed-by: Jeremy Cline <jcline(a)redhat.com>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Jonathan Corbet <corbet(a)lwn.net>
Cc: Joe Perches <joe(a)perches.com>
Cc: Uwe Kleine-König <u.kleine-koenig(a)pengutronix.de>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
scripts/spdxcheck.py | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
--- a/scripts/spdxcheck.py~scripts-spdxcheckpy-always-open-files-in-binary-mode
+++ a/scripts/spdxcheck.py
@@ -168,6 +168,7 @@ class id_parser(object):
self.curline = 0
try:
for line in fd:
+ line = line.decode(locale.getpreferredencoding(False), errors='ignore')
self.curline += 1
if self.curline > maxlines:
break
@@ -249,12 +250,13 @@ if __name__ == '__main__':
try:
if len(args.path) and args.path[0] == '-':
- parser.parse_lines(sys.stdin, args.maxlines, '-')
+ stdin = os.fdopen(sys.stdin.fileno(), 'rb')
+ parser.parse_lines(stdin, args.maxlines, '-')
else:
if args.path:
for p in args.path:
if os.path.isfile(p):
- parser.parse_lines(open(p), args.maxlines, p)
+ parser.parse_lines(open(p, 'rb'), args.maxlines, p)
elif os.path.isdir(p):
scan_git_subtree(repo.head.reference.commit.tree, p)
else:
_
From: Andrea Arcangeli <aarcange(a)redhat.com>
Subject: userfaultfd: check VM_MAYWRITE was set after verifying the uffd is registered
Calling UFFDIO_UNREGISTER on virtual ranges not yet registered in uffd
could trigger an harmless false positive WARN_ON. Check the vma is
already registered before checking VM_MAYWRITE to shut off the false
positive warning.
Link: http://lkml.kernel.org/r/20181206212028.18726-2-aarcange@redhat.com
Cc: <stable(a)vger.kernel.org>
Fixes: 29ec90660d68 ("userfaultfd: shmem/hugetlbfs: only allow to register VM_MAYWRITE vmas")
Signed-off-by: Andrea Arcangeli <aarcange(a)redhat.com>
Reported-by: syzbot+06c7092e7d71218a2c16(a)syzkaller.appspotmail.com
Acked-by: Mike Rapoport <rppt(a)linux.ibm.com>
Acked-by: Hugh Dickins <hughd(a)google.com>
Acked-by: Peter Xu <peterx(a)redhat.com>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
fs/userfaultfd.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
--- a/fs/userfaultfd.c~userfaultfd-check-vm_maywrite-was-set-after-verifying-the-uffd-is-registered
+++ a/fs/userfaultfd.c
@@ -1566,7 +1566,6 @@ static int userfaultfd_unregister(struct
cond_resched();
BUG_ON(!vma_can_userfault(vma));
- WARN_ON(!(vma->vm_flags & VM_MAYWRITE));
/*
* Nothing to do: this vma is already registered into this
@@ -1575,6 +1574,8 @@ static int userfaultfd_unregister(struct
if (!vma->vm_userfaultfd_ctx.ctx)
goto skip;
+ WARN_ON(!(vma->vm_flags & VM_MAYWRITE));
+
if (vma->vm_start > start)
start = vma->vm_start;
vma_end = min(end, vma->vm_end);
_
From: Piotr Jaroszynski <pjaroszynski(a)nvidia.com>
Subject: fs/iomap.c: get/put the page in iomap_page_create/release()
migrate_page_move_mapping() expects pages with private data set to have a
page_count elevated by 1. This is what used to happen for xfs through the
buffer_heads code before the switch to iomap in commit 82cb14175e7d ("xfs:
add support for sub-pagesize writeback without buffer_heads"). Not having
the count elevated causes move_pages() to fail on memory mapped files
coming from xfs.
Make iomap compatible with the migrate_page_move_mapping() assumption by
elevating the page count as part of iomap_page_create() and lowering it in
iomap_page_release().
It causes the move_pages() syscall to misbehave on memory mapped files
from xfs. It does not not move any pages, which I suppose is "just" a
perf issue, but it also ends up returning a positive number which is
out of spec for the syscall. Talking to Michal Hocko, it sounds like
returning positive numbers might be a necessary update to move_pages()
anyway though
(https://lkml.kernel.org/r/20181116114955.GJ14706@dhcp22.suse.cz).
I only hit this in tests that verify that move_pages() actually moved
the pages. The test also got confused by the positive return from
move_pages() (it got treated as a success as positive numbers were not
expected and not handled) making it a bit harder to track down what's
going on.
Link: http://lkml.kernel.org/r/20181115184140.1388751-1-pjaroszynski@nvidia.com
Fixes: 82cb14175e7d ("xfs: add support for sub-pagesize writeback without buffer_heads")
Signed-off-by: Piotr Jaroszynski <pjaroszynski(a)nvidia.com>
Reviewed-by: Christoph Hellwig <hch(a)lst.de>
Cc: William Kucharski <william.kucharski(a)oracle.com>
Cc: Darrick J. Wong <darrick.wong(a)oracle.com>
Cc: Brian Foster <bfoster(a)redhat.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
fs/iomap.c | 7 +++++++
1 file changed, 7 insertions(+)
--- a/fs/iomap.c~iomap-get-put-the-page-in-iomap_page_create-release
+++ a/fs/iomap.c
@@ -116,6 +116,12 @@ iomap_page_create(struct inode *inode, s
atomic_set(&iop->read_count, 0);
atomic_set(&iop->write_count, 0);
bitmap_zero(iop->uptodate, PAGE_SIZE / SECTOR_SIZE);
+
+ /*
+ * migrate_page_move_mapping() assumes that pages with private data have
+ * their count elevated by 1.
+ */
+ get_page(page);
set_page_private(page, (unsigned long)iop);
SetPagePrivate(page);
return iop;
@@ -132,6 +138,7 @@ iomap_page_release(struct page *page)
WARN_ON_ONCE(atomic_read(&iop->write_count));
ClearPagePrivate(page);
set_page_private(page, 0);
+ put_page(page);
kfree(iop);
}
_
The nr_dentry_unused per-cpu counter tracks dentries in both the
LRU lists and the shrink lists where the DCACHE_LRU_LIST bit is set.
The shrink_dcache_sb() function moves dentries from the LRU list to a
shrink list and subtracts the dentry count from nr_dentry_unused. This
is incorrect as the nr_dentry_unused count Will also be decremented in
shrink_dentry_list() via d_shrink_del(). To fix this double decrement,
the decrement in the shrink_dcache_sb() function is taken out.
Fixes: 4e717f5c1083 ("list_lru: remove special case function list_lru_dispose_all."
Cc: stable(a)vger.kernel.org
Signed-off-by: Waiman Long <longman(a)redhat.com>
Reviewed-by: Dave Chinner <dchinner(a)redhat.com>
---
fs/dcache.c | 6 +-----
1 file changed, 1 insertion(+), 5 deletions(-)
diff --git a/fs/dcache.c b/fs/dcache.c
index 2593153..44e5652 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1188,15 +1188,11 @@ static enum lru_status dentry_lru_isolate_shrink(struct list_head *item,
*/
void shrink_dcache_sb(struct super_block *sb)
{
- long freed;
-
do {
LIST_HEAD(dispose);
- freed = list_lru_walk(&sb->s_dentry_lru,
+ list_lru_walk(&sb->s_dentry_lru,
dentry_lru_isolate_shrink, &dispose, 1024);
-
- this_cpu_sub(nr_dentry_unused, freed);
shrink_dentry_list(&dispose);
} while (list_lru_count(&sb->s_dentry_lru) > 0);
}
--
1.8.3.1
From: Shuah Khan <shuah(a)kernel.org>
Commit b2d35fa5fc80 ("selftests: add headers_install to lib.mk") added
khdr target to run headers_install target from the main Makefile. The
logic uses KSFT_KHDR_INSTALL and top_srcdir as controls to initialize
variables and include files to run headers_install from the top level
Makefile. There are a few problems with this logic.
1. Exposes top_srcdir to all tests
2. Common logic impacts all tests
3. Uses KSFT_KHDR_INSTALL, top_srcdir, and khdr in an adhoc way. Tests
add "khdr" dependency in their Makefiles to TEST_PROGS_EXTENDED in
some cases, and STATIC_LIBS in other cases. This makes this framework
confusing to use.
The common logic that runs for all tests even when KSFT_KHDR_INSTALL
isn't defined by the test. top_srcdir is initialized to a default value
when test doesn't initialize it. It works for all tests without a sub-dir
structure and tests with sub-dir structure fail to build.
e.g: make -C sparc64/drivers/ or make -C drivers/dma-buf
../../lib.mk:20: ../../../../scripts/subarch.include: No such file or directory
make: *** No rule to make target '../../../../scripts/subarch.include'. Stop.
There is no reason to require all tests to define top_srcdir and there is
no need to require tests to add khdr dependency using adhoc changes to
TEST_* and other variables.
Fix it with a consistent use of KSFT_KHDR_INSTALL and top_srcdir from tests
that have the dependency on headers_install.
Change common logic to include khdr target define and "all" target with
dependency on khdr when KSFT_KHDR_INSTALL is defined.
Only tests that have dependency on headers_install have to define just
the KSFT_KHDR_INSTALL, and top_srcdir variables and there is no need to
specify khdr dependency in the test Makefiles.
Fixes: b2d35fa5fc80 ("selftests: add headers_install to lib.mk")
Cc: stable(a)vger.kernel.org
Signed-off-by: Shuah Khan <shuah(a)kernel.org>
---
tools/testing/selftests/android/Makefile | 2 +-
tools/testing/selftests/futex/functional/Makefile | 1 +
tools/testing/selftests/gpio/Makefile | 6 +++---
tools/testing/selftests/kvm/Makefile | 2 +-
tools/testing/selftests/lib.mk | 8 ++++----
tools/testing/selftests/networking/timestamping/Makefile | 1 +
tools/testing/selftests/tc-testing/bpf/Makefile | 1 +
tools/testing/selftests/vm/Makefile | 1 +
8 files changed, 13 insertions(+), 9 deletions(-)
diff --git a/tools/testing/selftests/android/Makefile b/tools/testing/selftests/android/Makefile
index d9a725478375..72c25a3cb658 100644
--- a/tools/testing/selftests/android/Makefile
+++ b/tools/testing/selftests/android/Makefile
@@ -6,7 +6,7 @@ TEST_PROGS := run.sh
include ../lib.mk
-all: khdr
+all:
@for DIR in $(SUBDIRS); do \
BUILD_TARGET=$(OUTPUT)/$$DIR; \
mkdir $$BUILD_TARGET -p; \
diff --git a/tools/testing/selftests/futex/functional/Makefile b/tools/testing/selftests/futex/functional/Makefile
index ad1eeb14fda7..30996306cabc 100644
--- a/tools/testing/selftests/futex/functional/Makefile
+++ b/tools/testing/selftests/futex/functional/Makefile
@@ -19,6 +19,7 @@ TEST_GEN_FILES := \
TEST_PROGS := run.sh
top_srcdir = ../../../../..
+KSFT_KHDR_INSTALL := 1
include ../../lib.mk
$(TEST_GEN_FILES): $(HEADERS)
diff --git a/tools/testing/selftests/gpio/Makefile b/tools/testing/selftests/gpio/Makefile
index 46648427d537..07f572a1bd3f 100644
--- a/tools/testing/selftests/gpio/Makefile
+++ b/tools/testing/selftests/gpio/Makefile
@@ -10,8 +10,6 @@ TEST_PROGS_EXTENDED := gpio-mockup-chardev
GPIODIR := $(realpath ../../../gpio)
GPIOOBJ := gpio-utils.o
-include ../lib.mk
-
all: $(TEST_PROGS_EXTENDED)
override define CLEAN
@@ -19,7 +17,9 @@ override define CLEAN
$(MAKE) -C $(GPIODIR) OUTPUT=$(GPIODIR)/ clean
endef
-$(TEST_PROGS_EXTENDED):| khdr
+KSFT_KHDR_INSTALL := 1
+include ../lib.mk
+
$(TEST_PROGS_EXTENDED): $(GPIODIR)/$(GPIOOBJ)
$(GPIODIR)/$(GPIOOBJ):
diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
index 01a219229238..52bfe5e76907 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -1,6 +1,7 @@
all:
top_srcdir = ../../../..
+KSFT_KHDR_INSTALL := 1
UNAME_M := $(shell uname -m)
LIBKVM = lib/assert.c lib/elf.c lib/io.c lib/kvm_util.c lib/ucall.c lib/sparsebit.c
@@ -44,7 +45,6 @@ $(OUTPUT)/libkvm.a: $(LIBKVM_OBJ)
all: $(STATIC_LIBS)
$(TEST_GEN_PROGS): $(STATIC_LIBS)
-$(STATIC_LIBS):| khdr
cscope: include_paths = $(LINUX_TOOL_INCLUDE) $(LINUX_HDR_PATH) include lib ..
cscope:
diff --git a/tools/testing/selftests/lib.mk b/tools/testing/selftests/lib.mk
index 0a8e75886224..8b0f16409ed7 100644
--- a/tools/testing/selftests/lib.mk
+++ b/tools/testing/selftests/lib.mk
@@ -16,18 +16,18 @@ TEST_GEN_PROGS := $(patsubst %,$(OUTPUT)/%,$(TEST_GEN_PROGS))
TEST_GEN_PROGS_EXTENDED := $(patsubst %,$(OUTPUT)/%,$(TEST_GEN_PROGS_EXTENDED))
TEST_GEN_FILES := $(patsubst %,$(OUTPUT)/%,$(TEST_GEN_FILES))
+ifdef KSFT_KHDR_INSTALL
top_srcdir ?= ../../../..
include $(top_srcdir)/scripts/subarch.include
ARCH ?= $(SUBARCH)
-all: $(TEST_GEN_PROGS) $(TEST_GEN_PROGS_EXTENDED) $(TEST_GEN_FILES)
-
.PHONY: khdr
khdr:
make ARCH=$(ARCH) -C $(top_srcdir) headers_install
-ifdef KSFT_KHDR_INSTALL
-$(TEST_GEN_PROGS) $(TEST_GEN_PROGS_EXTENDED) $(TEST_GEN_FILES):| khdr
+all: khdr $(TEST_GEN_PROGS) $(TEST_GEN_PROGS_EXTENDED) $(TEST_GEN_FILES)
+else
+all: $(TEST_GEN_PROGS) $(TEST_GEN_PROGS_EXTENDED) $(TEST_GEN_FILES)
endif
.ONESHELL:
diff --git a/tools/testing/selftests/networking/timestamping/Makefile b/tools/testing/selftests/networking/timestamping/Makefile
index 14cfcf006936..c46c0eefab9e 100644
--- a/tools/testing/selftests/networking/timestamping/Makefile
+++ b/tools/testing/selftests/networking/timestamping/Makefile
@@ -6,6 +6,7 @@ TEST_PROGS := hwtstamp_config rxtimestamp timestamping txtimestamp
all: $(TEST_PROGS)
top_srcdir = ../../../../..
+KSFT_KHDR_INSTALL := 1
include ../../lib.mk
clean:
diff --git a/tools/testing/selftests/tc-testing/bpf/Makefile b/tools/testing/selftests/tc-testing/bpf/Makefile
index dc92eb271d9a..be5a5e542804 100644
--- a/tools/testing/selftests/tc-testing/bpf/Makefile
+++ b/tools/testing/selftests/tc-testing/bpf/Makefile
@@ -4,6 +4,7 @@ APIDIR := ../../../../include/uapi
TEST_GEN_FILES = action.o
top_srcdir = ../../../../..
+KSFT_KHDR_INSTALL := 1
include ../../lib.mk
CLANG ?= clang
diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile
index 6e67e726e5a5..e13eb6cc8901 100644
--- a/tools/testing/selftests/vm/Makefile
+++ b/tools/testing/selftests/vm/Makefile
@@ -25,6 +25,7 @@ TEST_GEN_FILES += virtual_address_range
TEST_PROGS := run_vmtests
+KSFT_KHDR_INSTALL := 1
include ../lib.mk
$(OUTPUT)/userfaultfd: LDLIBS += -lpthread
--
2.17.1
On Fri, Dec 14, 2018 at 06:29:54AM -0800, Eric Dumazet wrote:
> On Fri, Dec 14, 2018 at 6:26 AM Greg Kroah-Hartman <
> gregkh(a)linuxfoundation.org> wrote:
>
> > On Fri, Dec 14, 2018 at 02:03:55PM +0000, Sudip Mukherjee wrote:
> > > Hi Greg,
> > >
> > > On Fri, Dec 14, 2018 at 12:08 PM Greg Kroah-Hartman
> > > <gregkh(a)linuxfoundation.org> wrote:
> > > >
> > > > 4.14-stable review patch. If anyone has any objections, please let me
> > know.
> > > >
> > > > ------------------
> > > >
> > > > From: Eric Dumazet <edumazet(a)google.com>
> > > >
> > > > [ Upstream commit 41727549de3e7281feb174d568c6e46823db8684 ]
> > >
> > > There is one upstream commit which fixes this one.
> > > f9bfe4e6a9d0 ("tcp: lack of available data can also cause TSO defer")
> >
> > I can take this if Eric and/or David ack it :)
> >
> >
> I ack it, I guess David had this queued already.
>
> No big deal, that is only for tcp_info instrumentation accuracy.
If this is already queued for the next round of stable patches, and it's
not a major issue, I'll just wait and take it when it shows up in a week
or so.
thanks,
greg k-h
ASIDs have always been stored as unsigned longs, ie. 32 bits on MIPS32
kernels. This is problematic because it is feasible for the ASID version
to overflow & wrap around to zero.
We currently attempt to handle this overflow by simply setting the ASID
version to 1, using asid_first_version(), but we make no attempt to
account for the fact that there may be mm_structs with stale ASIDs that
have versions which we now reuse due to the overflow & wrap around.
Encountering this requires that:
1) A struct mm_struct X is active on CPU A using ASID (V,n).
2) That mm is not used on CPU A for the length of time that it takes
for CPU A's asid_cache to overflow & wrap around to the same
version V that the mm had in step 1. During this time tasks using
the mm could either be sleeping or only scheduled on other CPUs.
3) Some other mm Y becomes active on CPU A and is allocated the same
ASID (V,n).
4) mm X now becomes active on CPU A again, and now incorrectly has the
same ASID as mm Y.
Where struct mm_struct ASIDs are represented above in the format
(version, EntryHi.ASID), and on a typical MIPS32 system version will be
24 bits wide & EntryHi.ASID will be 8 bits wide.
The length of time required in step 2 is highly dependent upon the CPU &
workload, but for a hypothetical 2GHz CPU running a workload which
generates a new ASID every 10000 cycles this period is around 248 days.
Due to this long period of time & the fact that tasks need to be
scheduled in just the right (or wrong, depending upon your inclination)
way, this is obviously a difficult bug to encounter but it's entirely
possible as evidenced by reports.
In order to fix this, simply extend ASIDs to 64 bits even on MIPS32
builds. This will extend the period of time required for the
hypothetical system above to encounter the problem from 28 days to
around 3 trillion years, which feels safely outside of the realms of
possibility.
The cost of this is slightly more generated code in some commonly
executed paths, but this is pretty minimal:
| Code Size Gain | Percentage
-----------------------|----------------|-------------
decstation_defconfig | +270 | +0.00%
32r2el_defconfig | +652 | +0.01%
32r6el_defconfig | +1000 | +0.01%
I have been unable to measure any change in performance of the LMbench
lat_ctx or lat_proc tests resulting from the 64b ASIDs on either
32r2el_defconfig+interAptiv or 32r6el_defconfig+I6500 systems.
Signed-off-by: Paul Burton <paul.burton(a)mips.com>
Suggested-by: James Hogan <jhogan(a)kernel.org>
References: https://lore.kernel.org/linux-mips/80B78A8B8FEE6145A87579E8435D78C30205D5F3…
References: https://lore.kernel.org/linux-mips/1488684260-18867-1-git-send-email-jiwei.…
Cc: Jiwei Sun <jiwei.sun(a)windriver.com>
Cc: Yu Huabing <yhb(a)ruijie.com.cn>
Cc: stable(a)vger.kernel.org # 2.6.12+
---
arch/mips/include/asm/cpu-info.h | 2 +-
arch/mips/include/asm/mmu.h | 2 +-
arch/mips/include/asm/mmu_context.h | 8 ++++----
arch/mips/mm/c-r3k.c | 2 +-
4 files changed, 7 insertions(+), 7 deletions(-)
diff --git a/arch/mips/include/asm/cpu-info.h b/arch/mips/include/asm/cpu-info.h
index a41059d47d31..ed7ffe4e63a3 100644
--- a/arch/mips/include/asm/cpu-info.h
+++ b/arch/mips/include/asm/cpu-info.h
@@ -50,7 +50,7 @@ struct guest_info {
#define MIPS_CACHE_PINDEX 0x00000020 /* Physically indexed cache */
struct cpuinfo_mips {
- unsigned long asid_cache;
+ u64 asid_cache;
#ifdef CONFIG_MIPS_ASID_BITS_VARIABLE
unsigned long asid_mask;
#endif
diff --git a/arch/mips/include/asm/mmu.h b/arch/mips/include/asm/mmu.h
index 0740be7d5d4a..24d6b42345fb 100644
--- a/arch/mips/include/asm/mmu.h
+++ b/arch/mips/include/asm/mmu.h
@@ -7,7 +7,7 @@
#include <linux/wait.h>
typedef struct {
- unsigned long asid[NR_CPUS];
+ u64 asid[NR_CPUS];
void *vdso;
atomic_t fp_mode_switching;
diff --git a/arch/mips/include/asm/mmu_context.h b/arch/mips/include/asm/mmu_context.h
index 94414561de0e..fd869d538a3c 100644
--- a/arch/mips/include/asm/mmu_context.h
+++ b/arch/mips/include/asm/mmu_context.h
@@ -76,14 +76,14 @@ extern unsigned long pgd_current[];
* All unused by hardware upper bits will be considered
* as a software asid extension.
*/
-static unsigned long asid_version_mask(unsigned int cpu)
+static u64 asid_version_mask(unsigned int cpu)
{
unsigned long asid_mask = cpu_asid_mask(&cpu_data[cpu]);
- return ~(asid_mask | (asid_mask - 1));
+ return ~(u64)(asid_mask | (asid_mask - 1));
}
-static unsigned long asid_first_version(unsigned int cpu)
+static u64 asid_first_version(unsigned int cpu)
{
return ~asid_version_mask(cpu) + 1;
}
@@ -102,7 +102,7 @@ static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
static inline void
get_new_mmu_context(struct mm_struct *mm, unsigned long cpu)
{
- unsigned long asid = asid_cache(cpu);
+ u64 asid = asid_cache(cpu);
if (!((asid += cpu_asid_inc()) & cpu_asid_mask(&cpu_data[cpu]))) {
if (cpu_has_vtag_icache)
diff --git a/arch/mips/mm/c-r3k.c b/arch/mips/mm/c-r3k.c
index 3466fcdae0ca..01848cdf2074 100644
--- a/arch/mips/mm/c-r3k.c
+++ b/arch/mips/mm/c-r3k.c
@@ -245,7 +245,7 @@ static void r3k_flush_cache_page(struct vm_area_struct *vma,
pmd_t *pmdp;
pte_t *ptep;
- pr_debug("cpage[%08lx,%08lx]\n",
+ pr_debug("cpage[%08llx,%08lx]\n",
cpu_context(smp_processor_id(), mm), addr);
/* No ASID => no such page in the cache. */
--
2.19.1