Hi Andrew,
On Wed, Jun 6, 2018 at 6:29 PM, Andrew Jeddeloh
<andrew.jeddeloh(a)redhat.com> wrote:
> Hi all,
>
> The patch "xen-netfront: Fix race between device setup and open" seems
> to have introduced a regression preventing setting MTU's larger than
> 1500. We experienced this downstream with Container Linux and
> confirmed with Fedora 28 as well.
>
> It's commit f599c64fdf7d9c108e8717fb04bc41c680120da4 in the linux-stable tree.
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/com…
>
> Downstream bugs:
> https://github.com/coreos/bugs/issues/2443
> https://bugzilla.redhat.com/show_bug.cgi?id=1584216
>
> We've confirmed that reverting that commit fixes the bug. It be
> reliably can be reproduced on AWS with t2.micro instances (and
> presumably other systems using the same driver). Both using
> systemd-networkd to set the mtu and manual ip link commands cause the
> link to repsond with "Invalid argument" when trying to set the MTU >
> 1500.
>
> I'm not sure why that commit introduced the regression.
>
> Please let me know if there's any more information that would be helpful.
>
> - Andrew
I'm adding some relevant people to the CC list to bring more attention
on this regression.
The get_maintainer.pl script is very useful to get some hints on who
should be copied, i.e:
$ ./scripts/get_maintainer.pl -f drivers/net/xen-netfront.c
Best regards,
Javier
On 6/21/18 1:01 AM, bsegall(a)google.com wrote:
> Xunlei Pang <xlpang(a)linux.alibaba.com> writes:
>
>> I noticed the group constantly got throttled even it consumed
>> low cpu usage, this caused some jitters on the response time
>> to some of our business containers enabling cpu quota.
>>
>> It's very simple to reproduce:
>> mkdir /sys/fs/cgroup/cpu/test
>> cd /sys/fs/cgroup/cpu/test
>> echo 100000 > cpu.cfs_quota_us
>> echo $$ > tasks
>> then repeat:
>> cat cpu.stat |grep nr_throttled // nr_throttled will increase
>>
>> After some analysis, we found that cfs_rq::runtime_remaining will
>> be cleared by expire_cfs_rq_runtime() due to two equal but stale
>> "cfs_{b|q}->runtime_expires" after period timer is re-armed.
>>
>> The current condition to judge clock drift in expire_cfs_rq_runtime()
>> is wrong, the two runtime_expires are actually the same when clock
>> drift happens, so this condtion can never hit. The orginal design was
>> correctly done by commit a9cf55b28610 ("sched: Expire invalid runtime"),
>> but was changed to be the current one due to its locking issue.
>>
>> This patch introduces another way, it adds a new field in both structures
>> cfs_rq and cfs_bandwidth to record the expiration update sequence, and
>> use them to figure out if clock drift happens(true if they equal).
>>
>> Fixes: 51f2176d74ac ("sched/fair: Fix unlocked reads of some cfs_b->quota/period")
>> Cc: Ben Segall <bsegall(a)google.com>
>
> Reviewed-By: Ben Segall <bsegall(a)google.com>
Thanks Ben :-)
Hi Peter, could you please have a look at them?
Cc stable(a)vger.kernel.org
>
>> Signed-off-by: Xunlei Pang <xlpang(a)linux.alibaba.com>
>> ---
>> kernel/sched/fair.c | 14 ++++++++------
>> kernel/sched/sched.h | 6 ++++--
>> 2 files changed, 12 insertions(+), 8 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index e497c05aab7f..e6bb68d52962 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -4590,6 +4590,7 @@ void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
>> now = sched_clock_cpu(smp_processor_id());
>> cfs_b->runtime = cfs_b->quota;
>> cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period);
>> + cfs_b->expires_seq++;
>> }
>>
>> static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
>> @@ -4612,6 +4613,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>> struct task_group *tg = cfs_rq->tg;
>> struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
>> u64 amount = 0, min_amount, expires;
>> + int expires_seq;
>>
>> /* note: this is a positive sum as runtime_remaining <= 0 */
>> min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining;
>> @@ -4628,6 +4630,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>> cfs_b->idle = 0;
>> }
>> }
>> + expires_seq = cfs_b->expires_seq;
>> expires = cfs_b->runtime_expires;
>> raw_spin_unlock(&cfs_b->lock);
>>
>> @@ -4637,8 +4640,10 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>> * spread between our sched_clock and the one on which runtime was
>> * issued.
>> */
>> - if ((s64)(expires - cfs_rq->runtime_expires) > 0)
>> + if (cfs_rq->expires_seq != expires_seq) {
>> + cfs_rq->expires_seq = expires_seq;
>> cfs_rq->runtime_expires = expires;
>> + }
>>
>> return cfs_rq->runtime_remaining > 0;
>> }
>> @@ -4664,12 +4669,9 @@ static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>> * has not truly expired.
>> *
>> * Fortunately we can check determine whether this the case by checking
>> - * whether the global deadline has advanced. It is valid to compare
>> - * cfs_b->runtime_expires without any locks since we only care about
>> - * exact equality, so a partial write will still work.
>> + * whether the global deadline(cfs_b->expires_seq) has advanced.
>> */
>> -
>> - if (cfs_rq->runtime_expires != cfs_b->runtime_expires) {
>> + if (cfs_rq->expires_seq == cfs_b->expires_seq) {
>> /* extend local deadline, drift is bounded above by 2 ticks */
>> cfs_rq->runtime_expires += TICK_NSEC;
>> } else {
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index 6601baf2361c..e977e04f8daf 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -334,9 +334,10 @@ struct cfs_bandwidth {
>> u64 runtime;
>> s64 hierarchical_quota;
>> u64 runtime_expires;
>> + int expires_seq;
>>
>> - int idle;
>> - int period_active;
>> + short idle;
>> + short period_active;
>> struct hrtimer period_timer;
>> struct hrtimer slack_timer;
>> struct list_head throttled_cfs_rq;
>> @@ -551,6 +552,7 @@ struct cfs_rq {
>>
>> #ifdef CONFIG_CFS_BANDWIDTH
>> int runtime_enabled;
>> + int expires_seq;
>> u64 runtime_expires;
>> s64 runtime_remaining;
The patch titled
Subject: slub: track number of slabs irrespective of CONFIG_SLUB_DEBUG
has been added to the -mm tree. Its filename is
slub-track-number-of-slabs-irrespective-of-config_slub_debug.patch
This patch should soon appear at
http://ozlabs.org/~akpm/mmots/broken-out/slub-track-number-of-slabs-irrespe…
and later at
http://ozlabs.org/~akpm/mmotm/broken-out/slub-track-number-of-slabs-irrespe…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Shakeel Butt <shakeelb(a)google.com>
Subject: slub: track number of slabs irrespective of CONFIG_SLUB_DEBUG
For !CONFIG_SLUB_DEBUG, SLUB does not maintain the number of slabs
allocated per node for a kmem_cache. Thus, slabs_node() in
__kmem_cache_empty(), __kmem_cache_shrink() and __kmem_cache_destroy()
will always return 0 for such config. This is wrong and can cause issues
for all users of these functions.
In fact in [1] Jason has reported a system crash while using SLUB without
CONFIG_SLUB_DEBUG. The reason was the usage of slabs_node() by
__kmem_cache_empty().
The right solution is to make slabs_node() work even for
!CONFIG_SLUB_DEBUG. The commit 0f389ec63077 ("slub: No need for per node
slab counters if !SLUB_DEBUG") had put the per node slab counter under
CONFIG_SLUB_DEBUG because it was only read through sysfs API and the sysfs
API was disabled on !CONFIG_SLUB_DEBUG. However the users of the per node
slab counter assumed that it will work in the absence of
CONFIG_SLUB_DEBUG. So, make the counter work for !CONFIG_SLUB_DEBUG.
Please note that f9e13c0a5a33 ("slab, slub: skip unnecessary
kasan_cache_shutdown()") exposed this issue but it is present even before.
[1] http://lkml.kernel.org/r/CAHmME9rtoPwxUSnktxzKso14iuVCWT7BE_-_8PAC=pGw1iJnQ…
Link: http://lkml.kernel.org/r/20180620224147.23777-1-shakeelb@google.com
Fixes: f9e13c0a5a33 ("slab, slub: skip unnecessary kasan_cache_shutdown()")
Signed-off-by: Shakeel Butt <shakeelb(a)google.com>
Suggested-by: David Rientjes <rientjes(a)google.com>
Reported-by: Jason A . Donenfeld <Jason(a)zx2c4.com>
Cc: Christoph Lameter <cl(a)linux.com>
Cc: Pekka Enberg <penberg(a)kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim(a)lge.com>
Cc: Andrey Ryabinin <aryabinin(a)virtuozzo.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/slab.h | 2 -
mm/slub.c | 80 ++++++++++++++++++++++++----------------------------
2 files changed, 38 insertions(+), 44 deletions(-)
diff -puN mm/slab.h~slub-track-number-of-slabs-irrespective-of-config_slub_debug mm/slab.h
--- a/mm/slab.h~slub-track-number-of-slabs-irrespective-of-config_slub_debug
+++ a/mm/slab.h
@@ -473,8 +473,8 @@ struct kmem_cache_node {
#ifdef CONFIG_SLUB
unsigned long nr_partial;
struct list_head partial;
-#ifdef CONFIG_SLUB_DEBUG
atomic_long_t nr_slabs;
+#ifdef CONFIG_SLUB_DEBUG
atomic_long_t total_objects;
struct list_head full;
#endif
diff -puN mm/slub.c~slub-track-number-of-slabs-irrespective-of-config_slub_debug mm/slub.c
--- a/mm/slub.c~slub-track-number-of-slabs-irrespective-of-config_slub_debug
+++ a/mm/slub.c
@@ -1030,42 +1030,6 @@ static void remove_full(struct kmem_cach
list_del(&page->lru);
}
-/* Tracking of the number of slabs for debugging purposes */
-static inline unsigned long slabs_node(struct kmem_cache *s, int node)
-{
- struct kmem_cache_node *n = get_node(s, node);
-
- return atomic_long_read(&n->nr_slabs);
-}
-
-static inline unsigned long node_nr_slabs(struct kmem_cache_node *n)
-{
- return atomic_long_read(&n->nr_slabs);
-}
-
-static inline void inc_slabs_node(struct kmem_cache *s, int node, int objects)
-{
- struct kmem_cache_node *n = get_node(s, node);
-
- /*
- * May be called early in order to allocate a slab for the
- * kmem_cache_node structure. Solve the chicken-egg
- * dilemma by deferring the increment of the count during
- * bootstrap (see early_kmem_cache_node_alloc).
- */
- if (likely(n)) {
- atomic_long_inc(&n->nr_slabs);
- atomic_long_add(objects, &n->total_objects);
- }
-}
-static inline void dec_slabs_node(struct kmem_cache *s, int node, int objects)
-{
- struct kmem_cache_node *n = get_node(s, node);
-
- atomic_long_dec(&n->nr_slabs);
- atomic_long_sub(objects, &n->total_objects);
-}
-
/* Object debug checks for alloc/free paths */
static void setup_object_debug(struct kmem_cache *s, struct page *page,
void *object)
@@ -1321,16 +1285,46 @@ slab_flags_t kmem_cache_flags(unsigned i
#define disable_higher_order_debug 0
+#endif /* CONFIG_SLUB_DEBUG */
+
static inline unsigned long slabs_node(struct kmem_cache *s, int node)
- { return 0; }
+{
+ struct kmem_cache_node *n = get_node(s, node);
+
+ return atomic_long_read(&n->nr_slabs);
+}
+
static inline unsigned long node_nr_slabs(struct kmem_cache_node *n)
- { return 0; }
-static inline void inc_slabs_node(struct kmem_cache *s, int node,
- int objects) {}
-static inline void dec_slabs_node(struct kmem_cache *s, int node,
- int objects) {}
+{
+ return atomic_long_read(&n->nr_slabs);
+}
-#endif /* CONFIG_SLUB_DEBUG */
+static inline void inc_slabs_node(struct kmem_cache *s, int node, int objects)
+{
+ struct kmem_cache_node *n = get_node(s, node);
+
+ /*
+ * May be called early in order to allocate a slab for the
+ * kmem_cache_node structure. Solve the chicken-egg
+ * dilemma by deferring the increment of the count during
+ * bootstrap (see early_kmem_cache_node_alloc).
+ */
+ if (likely(n)) {
+ atomic_long_inc(&n->nr_slabs);
+#ifdef CONFIG_SLUB_DEBUG
+ atomic_long_add(objects, &n->total_objects);
+#endif
+ }
+}
+static inline void dec_slabs_node(struct kmem_cache *s, int node, int objects)
+{
+ struct kmem_cache_node *n = get_node(s, node);
+
+ atomic_long_dec(&n->nr_slabs);
+#ifdef CONFIG_SLUB_DEBUG
+ atomic_long_sub(objects, &n->total_objects);
+#endif
+}
/*
* Hooks for other subsystems that check memory allocations. In a typical
_
Patches currently in -mm which might be from shakeelb(a)google.com are
slub-track-number-of-slabs-irrespective-of-config_slub_debug.patch
slub-fix-__kmem_cache_empty-for-config_slub_debug.patch
The patch titled
Subject: slub: fix __kmem_cache_empty for !CONFIG_SLUB_DEBUG
has been removed from the -mm tree. Its filename was
slub-fix-__kmem_cache_empty-for-config_slub_debug.patch
This patch was dropped because an updated version will be merged
------------------------------------------------------
From: Shakeel Butt <shakeelb(a)google.com>
Subject: slub: fix __kmem_cache_empty for !CONFIG_SLUB_DEBUG
f9e13c0a5a33 ("slab, slub: skip unnecessary kasan_cache_shutdown()")
causes crashes when using slub, as described at
http://lkml.kernel.org/r/CAHmME9rtoPwxUSnktxzKso14iuVCWT7BE_-_8PAC=pGw1iJnQ…
For !CONFIG_SLUB_DEBUG, SLUB does not maintain the number of slabs
allocated per node for a kmem_cache. Thus, slabs_node() in
__kmem_cache_empty() will always return 0. So, in such situation, it is
required to check per-cpu slabs to make sure if a kmem_cache is empty or
not.
Please note that __kmem_cache_shutdown() and __kmem_cache_shrink() are not
affected by !CONFIG_SLUB_DEBUG as they call flush_all() to clear per-cpu
slabs.
Link: http://lkml.kernel.org/r/20180619213352.71740-1-shakeelb@google.com
Link: http://lkml.kernel.org/r/CAHmME9rtoPwxUSnktxzKso14iuVCWT7BE_-_8PAC=pGw1iJnQ…
Fixes: f9e13c0a5a33 ("slab, slub: skip unnecessary kasan_cache_shutdown()")
Signed-off-by: Shakeel Butt <shakeelb(a)google.com>
Reported-by: Jason A. Donenfeld <Jason(a)zx2c4.com>
Tested-by: Jason A. Donenfeld <Jason(a)zx2c4.com>
Cc: Christoph Lameter <cl(a)linux.com>
Cc: Pekka Enberg <penberg(a)kernel.org>
Cc: David Rientjes <rientjes(a)google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim(a)lge.com>
Cc: Andrey Ryabinin <aryabinin(a)virtuozzo.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/slub.c | 16 +++++++++++++++-
1 file changed, 15 insertions(+), 1 deletion(-)
diff -puN mm/slub.c~slub-fix-__kmem_cache_empty-for-config_slub_debug mm/slub.c
--- a/mm/slub.c~slub-fix-__kmem_cache_empty-for-config_slub_debug
+++ a/mm/slub.c
@@ -3673,9 +3673,23 @@ static void free_partial(struct kmem_cac
bool __kmem_cache_empty(struct kmem_cache *s)
{
- int node;
+ int cpu, node;
struct kmem_cache_node *n;
+ /*
+ * slabs_node will always be 0 for !CONFIG_SLUB_DEBUG. So, manually
+ * check slabs for all cpus.
+ */
+ if (!IS_ENABLED(CONFIG_SLUB_DEBUG)) {
+ for_each_online_cpu(cpu) {
+ struct kmem_cache_cpu *c;
+
+ c = per_cpu_ptr(s->cpu_slab, cpu);
+ if (c->page || slub_percpu_partial(c))
+ return false;
+ }
+ }
+
for_each_kmem_cache_node(s, node, n)
if (n->nr_partial || slabs_node(s, node))
return false;
_
Patches currently in -mm which might be from shakeelb(a)google.com are
slub-track-number-of-slabs-irrespective-of-config_slub_debug.patch
For !CONFIG_SLUB_DEBUG, SLUB does not maintain the number of slabs
allocated per node for a kmem_cache. Thus, slabs_node() in
__kmem_cache_empty() will always return 0. So, in such situation, it is
required to check per-cpu slabs to make sure if a kmem_cache is empty or
not.
Please note that __kmem_cache_shutdown() and __kmem_cache_shrink() are
not affected by !CONFIG_SLUB_DEBUG as they call flush_all() to clear
per-cpu slabs.
Fixes: f9e13c0a5a33 ("slab, slub: skip unnecessary kasan_cache_shutdown()")
Signed-off-by: Shakeel Butt <shakeelb(a)google.com>
Reported-by: Jason A . Donenfeld <Jason(a)zx2c4.com>
Cc: Christoph Lameter <cl(a)linux.com>
Cc: Pekka Enberg <penberg(a)kernel.org>
Cc: David Rientjes <rientjes(a)google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim(a)lge.com>
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Cc: <stable(a)vger.kernel.org>
---
mm/slub.c | 16 +++++++++++++++-
1 file changed, 15 insertions(+), 1 deletion(-)
diff --git a/mm/slub.c b/mm/slub.c
index a3b8467c14af..731c02b371ae 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3673,9 +3673,23 @@ static void free_partial(struct kmem_cache *s, struct kmem_cache_node *n)
bool __kmem_cache_empty(struct kmem_cache *s)
{
- int node;
+ int cpu, node;
struct kmem_cache_node *n;
+ /*
+ * slabs_node will always be 0 for !CONFIG_SLUB_DEBUG. So, manually
+ * check slabs for all cpus.
+ */
+ if (!IS_ENABLED(CONFIG_SLUB_DEBUG)) {
+ for_each_online_cpu(cpu) {
+ struct kmem_cache_cpu *c;
+
+ c = per_cpu_ptr(s->cpu_slab, cpu);
+ if (c->page || slub_percpu_partial(c))
+ return false;
+ }
+ }
+
for_each_kmem_cache_node(s, node, n)
if (n->nr_partial || slabs_node(s, node))
return false;
--
2.18.0.rc1.244.gcf134e6275-goog