From: Yicong Yang <yangyicong(a)hisilicon.com>
[ Upstream commit 0dd37d6dd33a9c23351e6115ae8cdac7863bc7de ]
We've run into the case that the balancer tries to balance a migration
disabled task and trigger the warning in set_task_cpu() like below:
------------[ cut here ]------------
WARNING: CPU: 7 PID: 0 at kernel/sched/core.c:3115 set_task_cpu+0x188/0x240
Modules linked in: hclgevf xt_CHECKSUM ipt_REJECT nf_reject_ipv4 <...snip>
CPU: 7 PID: 0 Comm: swapper/7 Kdump: loaded Tainted: G O 6.1.0-rc4+ #1
Hardware name: Huawei TaiShan 2280 V2/BC82AMDC, BIOS 2280-V2 CS V5.B221.01 12/09/2021
pstate: 604000c9 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : set_task_cpu+0x188/0x240
lr : load_balance+0x5d0/0xc60
sp : ffff80000803bc70
x29: ffff80000803bc70 x28: ffff004089e190e8 x27: ffff004089e19040
x26: ffff007effcabc38 x25: 0000000000000000 x24: 0000000000000001
x23: ffff80000803be84 x22: 000000000000000c x21: ffffb093e79e2a78
x20: 000000000000000c x19: ffff004089e19040 x18: 0000000000000000
x17: 0000000000001fad x16: 0000000000000030 x15: 0000000000000000
x14: 0000000000000003 x13: 0000000000000000 x12: 0000000000000000
x11: 0000000000000001 x10: 0000000000000400 x9 : ffffb093e4cee530
x8 : 00000000fffffffe x7 : 0000000000ce168a x6 : 000000000000013e
x5 : 00000000ffffffe1 x4 : 0000000000000001 x3 : 0000000000000b2a
x2 : 0000000000000b2a x1 : ffffb093e6d6c510 x0 : 0000000000000001
Call trace:
set_task_cpu+0x188/0x240
load_balance+0x5d0/0xc60
rebalance_domains+0x26c/0x380
_nohz_idle_balance.isra.0+0x1e0/0x370
run_rebalance_domains+0x6c/0x80
__do_softirq+0x128/0x3d8
____do_softirq+0x18/0x24
call_on_irq_stack+0x2c/0x38
do_softirq_own_stack+0x24/0x3c
__irq_exit_rcu+0xcc/0xf4
irq_exit_rcu+0x18/0x24
el1_interrupt+0x4c/0xe4
el1h_64_irq_handler+0x18/0x2c
el1h_64_irq+0x74/0x78
arch_cpu_idle+0x18/0x4c
default_idle_call+0x58/0x194
do_idle+0x244/0x2b0
cpu_startup_entry+0x30/0x3c
secondary_start_kernel+0x14c/0x190
__secondary_switched+0xb0/0xb4
---[ end trace 0000000000000000 ]---
Further investigation shows that the warning is superfluous, the migration
disabled task is just going to be migrated to its current running CPU.
This is because that on load balance if the dst_cpu is not allowed by the
task, we'll re-select a new_dst_cpu as a candidate. If no task can be
balanced to dst_cpu we'll try to balance the task to the new_dst_cpu
instead. In this case when the migration disabled task is not on CPU it
only allows to run on its current CPU, load balance will select its
current CPU as new_dst_cpu and later triggers the warning above.
The new_dst_cpu is chosen from the env->dst_grpmask. Currently it
contains CPUs in sched_group_span() and if we have overlapped groups it's
possible to run into this case. This patch makes env->dst_grpmask of
group_balance_mask() which exclude any CPUs from the busiest group and
solve the issue. For balancing in a domain with no overlapped groups
the behaviour keeps same as before.
Suggested-by: Vincent Guittot <vincent.guittot(a)linaro.org>
Signed-off-by: Yicong Yang <yangyicong(a)hisilicon.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot(a)linaro.org>
Link: https://lore.kernel.org/r/20230530082507.10444-1-yangyicong@huawei.com
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
kernel/sched/fair.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 259996d2dcf7a..9d1e7b0bf486d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8142,7 +8142,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
.sd = sd,
.dst_cpu = this_cpu,
.dst_rq = this_rq,
- .dst_grpmask = sched_group_span(sd->groups),
+ .dst_grpmask = group_balance_mask(sd->groups),
.idle = idle,
.loop_break = sched_nr_migrate_break,
.cpus = cpus,
--
2.39.2
From: Yicong Yang <yangyicong(a)hisilicon.com>
[ Upstream commit 0dd37d6dd33a9c23351e6115ae8cdac7863bc7de ]
We've run into the case that the balancer tries to balance a migration
disabled task and trigger the warning in set_task_cpu() like below:
------------[ cut here ]------------
WARNING: CPU: 7 PID: 0 at kernel/sched/core.c:3115 set_task_cpu+0x188/0x240
Modules linked in: hclgevf xt_CHECKSUM ipt_REJECT nf_reject_ipv4 <...snip>
CPU: 7 PID: 0 Comm: swapper/7 Kdump: loaded Tainted: G O 6.1.0-rc4+ #1
Hardware name: Huawei TaiShan 2280 V2/BC82AMDC, BIOS 2280-V2 CS V5.B221.01 12/09/2021
pstate: 604000c9 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : set_task_cpu+0x188/0x240
lr : load_balance+0x5d0/0xc60
sp : ffff80000803bc70
x29: ffff80000803bc70 x28: ffff004089e190e8 x27: ffff004089e19040
x26: ffff007effcabc38 x25: 0000000000000000 x24: 0000000000000001
x23: ffff80000803be84 x22: 000000000000000c x21: ffffb093e79e2a78
x20: 000000000000000c x19: ffff004089e19040 x18: 0000000000000000
x17: 0000000000001fad x16: 0000000000000030 x15: 0000000000000000
x14: 0000000000000003 x13: 0000000000000000 x12: 0000000000000000
x11: 0000000000000001 x10: 0000000000000400 x9 : ffffb093e4cee530
x8 : 00000000fffffffe x7 : 0000000000ce168a x6 : 000000000000013e
x5 : 00000000ffffffe1 x4 : 0000000000000001 x3 : 0000000000000b2a
x2 : 0000000000000b2a x1 : ffffb093e6d6c510 x0 : 0000000000000001
Call trace:
set_task_cpu+0x188/0x240
load_balance+0x5d0/0xc60
rebalance_domains+0x26c/0x380
_nohz_idle_balance.isra.0+0x1e0/0x370
run_rebalance_domains+0x6c/0x80
__do_softirq+0x128/0x3d8
____do_softirq+0x18/0x24
call_on_irq_stack+0x2c/0x38
do_softirq_own_stack+0x24/0x3c
__irq_exit_rcu+0xcc/0xf4
irq_exit_rcu+0x18/0x24
el1_interrupt+0x4c/0xe4
el1h_64_irq_handler+0x18/0x2c
el1h_64_irq+0x74/0x78
arch_cpu_idle+0x18/0x4c
default_idle_call+0x58/0x194
do_idle+0x244/0x2b0
cpu_startup_entry+0x30/0x3c
secondary_start_kernel+0x14c/0x190
__secondary_switched+0xb0/0xb4
---[ end trace 0000000000000000 ]---
Further investigation shows that the warning is superfluous, the migration
disabled task is just going to be migrated to its current running CPU.
This is because that on load balance if the dst_cpu is not allowed by the
task, we'll re-select a new_dst_cpu as a candidate. If no task can be
balanced to dst_cpu we'll try to balance the task to the new_dst_cpu
instead. In this case when the migration disabled task is not on CPU it
only allows to run on its current CPU, load balance will select its
current CPU as new_dst_cpu and later triggers the warning above.
The new_dst_cpu is chosen from the env->dst_grpmask. Currently it
contains CPUs in sched_group_span() and if we have overlapped groups it's
possible to run into this case. This patch makes env->dst_grpmask of
group_balance_mask() which exclude any CPUs from the busiest group and
solve the issue. For balancing in a domain with no overlapped groups
the behaviour keeps same as before.
Suggested-by: Vincent Guittot <vincent.guittot(a)linaro.org>
Signed-off-by: Yicong Yang <yangyicong(a)hisilicon.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot(a)linaro.org>
Link: https://lore.kernel.org/r/20230530082507.10444-1-yangyicong@huawei.com
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
kernel/sched/fair.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index eb67f42fb96ba..09f82c84474b8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8721,7 +8721,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
.sd = sd,
.dst_cpu = this_cpu,
.dst_rq = this_rq,
- .dst_grpmask = sched_group_span(sd->groups),
+ .dst_grpmask = group_balance_mask(sd->groups),
.idle = idle,
.loop_break = sched_nr_migrate_break,
.cpus = cpus,
--
2.39.2
From: Yicong Yang <yangyicong(a)hisilicon.com>
[ Upstream commit 0dd37d6dd33a9c23351e6115ae8cdac7863bc7de ]
We've run into the case that the balancer tries to balance a migration
disabled task and trigger the warning in set_task_cpu() like below:
------------[ cut here ]------------
WARNING: CPU: 7 PID: 0 at kernel/sched/core.c:3115 set_task_cpu+0x188/0x240
Modules linked in: hclgevf xt_CHECKSUM ipt_REJECT nf_reject_ipv4 <...snip>
CPU: 7 PID: 0 Comm: swapper/7 Kdump: loaded Tainted: G O 6.1.0-rc4+ #1
Hardware name: Huawei TaiShan 2280 V2/BC82AMDC, BIOS 2280-V2 CS V5.B221.01 12/09/2021
pstate: 604000c9 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : set_task_cpu+0x188/0x240
lr : load_balance+0x5d0/0xc60
sp : ffff80000803bc70
x29: ffff80000803bc70 x28: ffff004089e190e8 x27: ffff004089e19040
x26: ffff007effcabc38 x25: 0000000000000000 x24: 0000000000000001
x23: ffff80000803be84 x22: 000000000000000c x21: ffffb093e79e2a78
x20: 000000000000000c x19: ffff004089e19040 x18: 0000000000000000
x17: 0000000000001fad x16: 0000000000000030 x15: 0000000000000000
x14: 0000000000000003 x13: 0000000000000000 x12: 0000000000000000
x11: 0000000000000001 x10: 0000000000000400 x9 : ffffb093e4cee530
x8 : 00000000fffffffe x7 : 0000000000ce168a x6 : 000000000000013e
x5 : 00000000ffffffe1 x4 : 0000000000000001 x3 : 0000000000000b2a
x2 : 0000000000000b2a x1 : ffffb093e6d6c510 x0 : 0000000000000001
Call trace:
set_task_cpu+0x188/0x240
load_balance+0x5d0/0xc60
rebalance_domains+0x26c/0x380
_nohz_idle_balance.isra.0+0x1e0/0x370
run_rebalance_domains+0x6c/0x80
__do_softirq+0x128/0x3d8
____do_softirq+0x18/0x24
call_on_irq_stack+0x2c/0x38
do_softirq_own_stack+0x24/0x3c
__irq_exit_rcu+0xcc/0xf4
irq_exit_rcu+0x18/0x24
el1_interrupt+0x4c/0xe4
el1h_64_irq_handler+0x18/0x2c
el1h_64_irq+0x74/0x78
arch_cpu_idle+0x18/0x4c
default_idle_call+0x58/0x194
do_idle+0x244/0x2b0
cpu_startup_entry+0x30/0x3c
secondary_start_kernel+0x14c/0x190
__secondary_switched+0xb0/0xb4
---[ end trace 0000000000000000 ]---
Further investigation shows that the warning is superfluous, the migration
disabled task is just going to be migrated to its current running CPU.
This is because that on load balance if the dst_cpu is not allowed by the
task, we'll re-select a new_dst_cpu as a candidate. If no task can be
balanced to dst_cpu we'll try to balance the task to the new_dst_cpu
instead. In this case when the migration disabled task is not on CPU it
only allows to run on its current CPU, load balance will select its
current CPU as new_dst_cpu and later triggers the warning above.
The new_dst_cpu is chosen from the env->dst_grpmask. Currently it
contains CPUs in sched_group_span() and if we have overlapped groups it's
possible to run into this case. This patch makes env->dst_grpmask of
group_balance_mask() which exclude any CPUs from the busiest group and
solve the issue. For balancing in a domain with no overlapped groups
the behaviour keeps same as before.
Suggested-by: Vincent Guittot <vincent.guittot(a)linaro.org>
Signed-off-by: Yicong Yang <yangyicong(a)hisilicon.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot(a)linaro.org>
Link: https://lore.kernel.org/r/20230530082507.10444-1-yangyicong@huawei.com
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
kernel/sched/fair.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9fcba0d2ab19b..2680216234ff2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8938,7 +8938,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
.sd = sd,
.dst_cpu = this_cpu,
.dst_rq = this_rq,
- .dst_grpmask = sched_group_span(sd->groups),
+ .dst_grpmask = group_balance_mask(sd->groups),
.idle = idle,
.loop_break = sched_nr_migrate_break,
.cpus = cpus,
--
2.39.2
From: Yicong Yang <yangyicong(a)hisilicon.com>
[ Upstream commit 0dd37d6dd33a9c23351e6115ae8cdac7863bc7de ]
We've run into the case that the balancer tries to balance a migration
disabled task and trigger the warning in set_task_cpu() like below:
------------[ cut here ]------------
WARNING: CPU: 7 PID: 0 at kernel/sched/core.c:3115 set_task_cpu+0x188/0x240
Modules linked in: hclgevf xt_CHECKSUM ipt_REJECT nf_reject_ipv4 <...snip>
CPU: 7 PID: 0 Comm: swapper/7 Kdump: loaded Tainted: G O 6.1.0-rc4+ #1
Hardware name: Huawei TaiShan 2280 V2/BC82AMDC, BIOS 2280-V2 CS V5.B221.01 12/09/2021
pstate: 604000c9 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : set_task_cpu+0x188/0x240
lr : load_balance+0x5d0/0xc60
sp : ffff80000803bc70
x29: ffff80000803bc70 x28: ffff004089e190e8 x27: ffff004089e19040
x26: ffff007effcabc38 x25: 0000000000000000 x24: 0000000000000001
x23: ffff80000803be84 x22: 000000000000000c x21: ffffb093e79e2a78
x20: 000000000000000c x19: ffff004089e19040 x18: 0000000000000000
x17: 0000000000001fad x16: 0000000000000030 x15: 0000000000000000
x14: 0000000000000003 x13: 0000000000000000 x12: 0000000000000000
x11: 0000000000000001 x10: 0000000000000400 x9 : ffffb093e4cee530
x8 : 00000000fffffffe x7 : 0000000000ce168a x6 : 000000000000013e
x5 : 00000000ffffffe1 x4 : 0000000000000001 x3 : 0000000000000b2a
x2 : 0000000000000b2a x1 : ffffb093e6d6c510 x0 : 0000000000000001
Call trace:
set_task_cpu+0x188/0x240
load_balance+0x5d0/0xc60
rebalance_domains+0x26c/0x380
_nohz_idle_balance.isra.0+0x1e0/0x370
run_rebalance_domains+0x6c/0x80
__do_softirq+0x128/0x3d8
____do_softirq+0x18/0x24
call_on_irq_stack+0x2c/0x38
do_softirq_own_stack+0x24/0x3c
__irq_exit_rcu+0xcc/0xf4
irq_exit_rcu+0x18/0x24
el1_interrupt+0x4c/0xe4
el1h_64_irq_handler+0x18/0x2c
el1h_64_irq+0x74/0x78
arch_cpu_idle+0x18/0x4c
default_idle_call+0x58/0x194
do_idle+0x244/0x2b0
cpu_startup_entry+0x30/0x3c
secondary_start_kernel+0x14c/0x190
__secondary_switched+0xb0/0xb4
---[ end trace 0000000000000000 ]---
Further investigation shows that the warning is superfluous, the migration
disabled task is just going to be migrated to its current running CPU.
This is because that on load balance if the dst_cpu is not allowed by the
task, we'll re-select a new_dst_cpu as a candidate. If no task can be
balanced to dst_cpu we'll try to balance the task to the new_dst_cpu
instead. In this case when the migration disabled task is not on CPU it
only allows to run on its current CPU, load balance will select its
current CPU as new_dst_cpu and later triggers the warning above.
The new_dst_cpu is chosen from the env->dst_grpmask. Currently it
contains CPUs in sched_group_span() and if we have overlapped groups it's
possible to run into this case. This patch makes env->dst_grpmask of
group_balance_mask() which exclude any CPUs from the busiest group and
solve the issue. For balancing in a domain with no overlapped groups
the behaviour keeps same as before.
Suggested-by: Vincent Guittot <vincent.guittot(a)linaro.org>
Signed-off-by: Yicong Yang <yangyicong(a)hisilicon.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot(a)linaro.org>
Link: https://lore.kernel.org/r/20230530082507.10444-1-yangyicong@huawei.com
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
kernel/sched/fair.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 45c1d03aff735..d53f57ac76094 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9883,7 +9883,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
.sd = sd,
.dst_cpu = this_cpu,
.dst_rq = this_rq,
- .dst_grpmask = sched_group_span(sd->groups),
+ .dst_grpmask = group_balance_mask(sd->groups),
.idle = idle,
.loop_break = sched_nr_migrate_break,
.cpus = cpus,
--
2.39.2
From: Yicong Yang <yangyicong(a)hisilicon.com>
[ Upstream commit 0dd37d6dd33a9c23351e6115ae8cdac7863bc7de ]
We've run into the case that the balancer tries to balance a migration
disabled task and trigger the warning in set_task_cpu() like below:
------------[ cut here ]------------
WARNING: CPU: 7 PID: 0 at kernel/sched/core.c:3115 set_task_cpu+0x188/0x240
Modules linked in: hclgevf xt_CHECKSUM ipt_REJECT nf_reject_ipv4 <...snip>
CPU: 7 PID: 0 Comm: swapper/7 Kdump: loaded Tainted: G O 6.1.0-rc4+ #1
Hardware name: Huawei TaiShan 2280 V2/BC82AMDC, BIOS 2280-V2 CS V5.B221.01 12/09/2021
pstate: 604000c9 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : set_task_cpu+0x188/0x240
lr : load_balance+0x5d0/0xc60
sp : ffff80000803bc70
x29: ffff80000803bc70 x28: ffff004089e190e8 x27: ffff004089e19040
x26: ffff007effcabc38 x25: 0000000000000000 x24: 0000000000000001
x23: ffff80000803be84 x22: 000000000000000c x21: ffffb093e79e2a78
x20: 000000000000000c x19: ffff004089e19040 x18: 0000000000000000
x17: 0000000000001fad x16: 0000000000000030 x15: 0000000000000000
x14: 0000000000000003 x13: 0000000000000000 x12: 0000000000000000
x11: 0000000000000001 x10: 0000000000000400 x9 : ffffb093e4cee530
x8 : 00000000fffffffe x7 : 0000000000ce168a x6 : 000000000000013e
x5 : 00000000ffffffe1 x4 : 0000000000000001 x3 : 0000000000000b2a
x2 : 0000000000000b2a x1 : ffffb093e6d6c510 x0 : 0000000000000001
Call trace:
set_task_cpu+0x188/0x240
load_balance+0x5d0/0xc60
rebalance_domains+0x26c/0x380
_nohz_idle_balance.isra.0+0x1e0/0x370
run_rebalance_domains+0x6c/0x80
__do_softirq+0x128/0x3d8
____do_softirq+0x18/0x24
call_on_irq_stack+0x2c/0x38
do_softirq_own_stack+0x24/0x3c
__irq_exit_rcu+0xcc/0xf4
irq_exit_rcu+0x18/0x24
el1_interrupt+0x4c/0xe4
el1h_64_irq_handler+0x18/0x2c
el1h_64_irq+0x74/0x78
arch_cpu_idle+0x18/0x4c
default_idle_call+0x58/0x194
do_idle+0x244/0x2b0
cpu_startup_entry+0x30/0x3c
secondary_start_kernel+0x14c/0x190
__secondary_switched+0xb0/0xb4
---[ end trace 0000000000000000 ]---
Further investigation shows that the warning is superfluous, the migration
disabled task is just going to be migrated to its current running CPU.
This is because that on load balance if the dst_cpu is not allowed by the
task, we'll re-select a new_dst_cpu as a candidate. If no task can be
balanced to dst_cpu we'll try to balance the task to the new_dst_cpu
instead. In this case when the migration disabled task is not on CPU it
only allows to run on its current CPU, load balance will select its
current CPU as new_dst_cpu and later triggers the warning above.
The new_dst_cpu is chosen from the env->dst_grpmask. Currently it
contains CPUs in sched_group_span() and if we have overlapped groups it's
possible to run into this case. This patch makes env->dst_grpmask of
group_balance_mask() which exclude any CPUs from the busiest group and
solve the issue. For balancing in a domain with no overlapped groups
the behaviour keeps same as before.
Suggested-by: Vincent Guittot <vincent.guittot(a)linaro.org>
Signed-off-by: Yicong Yang <yangyicong(a)hisilicon.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot(a)linaro.org>
Link: https://lore.kernel.org/r/20230530082507.10444-1-yangyicong@huawei.com
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
kernel/sched/fair.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 646a6ae4b2509..ab6cbd676a9dd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10193,7 +10193,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
.sd = sd,
.dst_cpu = this_cpu,
.dst_rq = this_rq,
- .dst_grpmask = sched_group_span(sd->groups),
+ .dst_grpmask = group_balance_mask(sd->groups),
.idle = idle,
.loop_break = sched_nr_migrate_break,
.cpus = cpus,
--
2.39.2
From: Yicong Yang <yangyicong(a)hisilicon.com>
[ Upstream commit 0dd37d6dd33a9c23351e6115ae8cdac7863bc7de ]
We've run into the case that the balancer tries to balance a migration
disabled task and trigger the warning in set_task_cpu() like below:
------------[ cut here ]------------
WARNING: CPU: 7 PID: 0 at kernel/sched/core.c:3115 set_task_cpu+0x188/0x240
Modules linked in: hclgevf xt_CHECKSUM ipt_REJECT nf_reject_ipv4 <...snip>
CPU: 7 PID: 0 Comm: swapper/7 Kdump: loaded Tainted: G O 6.1.0-rc4+ #1
Hardware name: Huawei TaiShan 2280 V2/BC82AMDC, BIOS 2280-V2 CS V5.B221.01 12/09/2021
pstate: 604000c9 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : set_task_cpu+0x188/0x240
lr : load_balance+0x5d0/0xc60
sp : ffff80000803bc70
x29: ffff80000803bc70 x28: ffff004089e190e8 x27: ffff004089e19040
x26: ffff007effcabc38 x25: 0000000000000000 x24: 0000000000000001
x23: ffff80000803be84 x22: 000000000000000c x21: ffffb093e79e2a78
x20: 000000000000000c x19: ffff004089e19040 x18: 0000000000000000
x17: 0000000000001fad x16: 0000000000000030 x15: 0000000000000000
x14: 0000000000000003 x13: 0000000000000000 x12: 0000000000000000
x11: 0000000000000001 x10: 0000000000000400 x9 : ffffb093e4cee530
x8 : 00000000fffffffe x7 : 0000000000ce168a x6 : 000000000000013e
x5 : 00000000ffffffe1 x4 : 0000000000000001 x3 : 0000000000000b2a
x2 : 0000000000000b2a x1 : ffffb093e6d6c510 x0 : 0000000000000001
Call trace:
set_task_cpu+0x188/0x240
load_balance+0x5d0/0xc60
rebalance_domains+0x26c/0x380
_nohz_idle_balance.isra.0+0x1e0/0x370
run_rebalance_domains+0x6c/0x80
__do_softirq+0x128/0x3d8
____do_softirq+0x18/0x24
call_on_irq_stack+0x2c/0x38
do_softirq_own_stack+0x24/0x3c
__irq_exit_rcu+0xcc/0xf4
irq_exit_rcu+0x18/0x24
el1_interrupt+0x4c/0xe4
el1h_64_irq_handler+0x18/0x2c
el1h_64_irq+0x74/0x78
arch_cpu_idle+0x18/0x4c
default_idle_call+0x58/0x194
do_idle+0x244/0x2b0
cpu_startup_entry+0x30/0x3c
secondary_start_kernel+0x14c/0x190
__secondary_switched+0xb0/0xb4
---[ end trace 0000000000000000 ]---
Further investigation shows that the warning is superfluous, the migration
disabled task is just going to be migrated to its current running CPU.
This is because that on load balance if the dst_cpu is not allowed by the
task, we'll re-select a new_dst_cpu as a candidate. If no task can be
balanced to dst_cpu we'll try to balance the task to the new_dst_cpu
instead. In this case when the migration disabled task is not on CPU it
only allows to run on its current CPU, load balance will select its
current CPU as new_dst_cpu and later triggers the warning above.
The new_dst_cpu is chosen from the env->dst_grpmask. Currently it
contains CPUs in sched_group_span() and if we have overlapped groups it's
possible to run into this case. This patch makes env->dst_grpmask of
group_balance_mask() which exclude any CPUs from the busiest group and
solve the issue. For balancing in a domain with no overlapped groups
the behaviour keeps same as before.
Suggested-by: Vincent Guittot <vincent.guittot(a)linaro.org>
Signed-off-by: Yicong Yang <yangyicong(a)hisilicon.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot(a)linaro.org>
Link: https://lore.kernel.org/r/20230530082507.10444-1-yangyicong@huawei.com
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
kernel/sched/fair.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fa33c441ae867..57d39de0962d7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10556,7 +10556,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
.sd = sd,
.dst_cpu = this_cpu,
.dst_rq = this_rq,
- .dst_grpmask = sched_group_span(sd->groups),
+ .dst_grpmask = group_balance_mask(sd->groups),
.idle = idle,
.loop_break = SCHED_NR_MIGRATE_BREAK,
.cpus = cpus,
--
2.39.2
From: Yicong Yang <yangyicong(a)hisilicon.com>
[ Upstream commit 0dd37d6dd33a9c23351e6115ae8cdac7863bc7de ]
We've run into the case that the balancer tries to balance a migration
disabled task and trigger the warning in set_task_cpu() like below:
------------[ cut here ]------------
WARNING: CPU: 7 PID: 0 at kernel/sched/core.c:3115 set_task_cpu+0x188/0x240
Modules linked in: hclgevf xt_CHECKSUM ipt_REJECT nf_reject_ipv4 <...snip>
CPU: 7 PID: 0 Comm: swapper/7 Kdump: loaded Tainted: G O 6.1.0-rc4+ #1
Hardware name: Huawei TaiShan 2280 V2/BC82AMDC, BIOS 2280-V2 CS V5.B221.01 12/09/2021
pstate: 604000c9 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : set_task_cpu+0x188/0x240
lr : load_balance+0x5d0/0xc60
sp : ffff80000803bc70
x29: ffff80000803bc70 x28: ffff004089e190e8 x27: ffff004089e19040
x26: ffff007effcabc38 x25: 0000000000000000 x24: 0000000000000001
x23: ffff80000803be84 x22: 000000000000000c x21: ffffb093e79e2a78
x20: 000000000000000c x19: ffff004089e19040 x18: 0000000000000000
x17: 0000000000001fad x16: 0000000000000030 x15: 0000000000000000
x14: 0000000000000003 x13: 0000000000000000 x12: 0000000000000000
x11: 0000000000000001 x10: 0000000000000400 x9 : ffffb093e4cee530
x8 : 00000000fffffffe x7 : 0000000000ce168a x6 : 000000000000013e
x5 : 00000000ffffffe1 x4 : 0000000000000001 x3 : 0000000000000b2a
x2 : 0000000000000b2a x1 : ffffb093e6d6c510 x0 : 0000000000000001
Call trace:
set_task_cpu+0x188/0x240
load_balance+0x5d0/0xc60
rebalance_domains+0x26c/0x380
_nohz_idle_balance.isra.0+0x1e0/0x370
run_rebalance_domains+0x6c/0x80
__do_softirq+0x128/0x3d8
____do_softirq+0x18/0x24
call_on_irq_stack+0x2c/0x38
do_softirq_own_stack+0x24/0x3c
__irq_exit_rcu+0xcc/0xf4
irq_exit_rcu+0x18/0x24
el1_interrupt+0x4c/0xe4
el1h_64_irq_handler+0x18/0x2c
el1h_64_irq+0x74/0x78
arch_cpu_idle+0x18/0x4c
default_idle_call+0x58/0x194
do_idle+0x244/0x2b0
cpu_startup_entry+0x30/0x3c
secondary_start_kernel+0x14c/0x190
__secondary_switched+0xb0/0xb4
---[ end trace 0000000000000000 ]---
Further investigation shows that the warning is superfluous, the migration
disabled task is just going to be migrated to its current running CPU.
This is because that on load balance if the dst_cpu is not allowed by the
task, we'll re-select a new_dst_cpu as a candidate. If no task can be
balanced to dst_cpu we'll try to balance the task to the new_dst_cpu
instead. In this case when the migration disabled task is not on CPU it
only allows to run on its current CPU, load balance will select its
current CPU as new_dst_cpu and later triggers the warning above.
The new_dst_cpu is chosen from the env->dst_grpmask. Currently it
contains CPUs in sched_group_span() and if we have overlapped groups it's
possible to run into this case. This patch makes env->dst_grpmask of
group_balance_mask() which exclude any CPUs from the busiest group and
solve the issue. For balancing in a domain with no overlapped groups
the behaviour keeps same as before.
Suggested-by: Vincent Guittot <vincent.guittot(a)linaro.org>
Signed-off-by: Yicong Yang <yangyicong(a)hisilicon.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot(a)linaro.org>
Link: https://lore.kernel.org/r/20230530082507.10444-1-yangyicong@huawei.com
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
kernel/sched/fair.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ed89be0aa6503..0e263417d7f93 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10683,7 +10683,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
.sd = sd,
.dst_cpu = this_cpu,
.dst_rq = this_rq,
- .dst_grpmask = sched_group_span(sd->groups),
+ .dst_grpmask = group_balance_mask(sd->groups),
.idle = idle,
.loop_break = SCHED_NR_MIGRATE_BREAK,
.cpus = cpus,
--
2.39.2
From: Yicong Yang <yangyicong(a)hisilicon.com>
[ Upstream commit 0dd37d6dd33a9c23351e6115ae8cdac7863bc7de ]
We've run into the case that the balancer tries to balance a migration
disabled task and trigger the warning in set_task_cpu() like below:
------------[ cut here ]------------
WARNING: CPU: 7 PID: 0 at kernel/sched/core.c:3115 set_task_cpu+0x188/0x240
Modules linked in: hclgevf xt_CHECKSUM ipt_REJECT nf_reject_ipv4 <...snip>
CPU: 7 PID: 0 Comm: swapper/7 Kdump: loaded Tainted: G O 6.1.0-rc4+ #1
Hardware name: Huawei TaiShan 2280 V2/BC82AMDC, BIOS 2280-V2 CS V5.B221.01 12/09/2021
pstate: 604000c9 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : set_task_cpu+0x188/0x240
lr : load_balance+0x5d0/0xc60
sp : ffff80000803bc70
x29: ffff80000803bc70 x28: ffff004089e190e8 x27: ffff004089e19040
x26: ffff007effcabc38 x25: 0000000000000000 x24: 0000000000000001
x23: ffff80000803be84 x22: 000000000000000c x21: ffffb093e79e2a78
x20: 000000000000000c x19: ffff004089e19040 x18: 0000000000000000
x17: 0000000000001fad x16: 0000000000000030 x15: 0000000000000000
x14: 0000000000000003 x13: 0000000000000000 x12: 0000000000000000
x11: 0000000000000001 x10: 0000000000000400 x9 : ffffb093e4cee530
x8 : 00000000fffffffe x7 : 0000000000ce168a x6 : 000000000000013e
x5 : 00000000ffffffe1 x4 : 0000000000000001 x3 : 0000000000000b2a
x2 : 0000000000000b2a x1 : ffffb093e6d6c510 x0 : 0000000000000001
Call trace:
set_task_cpu+0x188/0x240
load_balance+0x5d0/0xc60
rebalance_domains+0x26c/0x380
_nohz_idle_balance.isra.0+0x1e0/0x370
run_rebalance_domains+0x6c/0x80
__do_softirq+0x128/0x3d8
____do_softirq+0x18/0x24
call_on_irq_stack+0x2c/0x38
do_softirq_own_stack+0x24/0x3c
__irq_exit_rcu+0xcc/0xf4
irq_exit_rcu+0x18/0x24
el1_interrupt+0x4c/0xe4
el1h_64_irq_handler+0x18/0x2c
el1h_64_irq+0x74/0x78
arch_cpu_idle+0x18/0x4c
default_idle_call+0x58/0x194
do_idle+0x244/0x2b0
cpu_startup_entry+0x30/0x3c
secondary_start_kernel+0x14c/0x190
__secondary_switched+0xb0/0xb4
---[ end trace 0000000000000000 ]---
Further investigation shows that the warning is superfluous, the migration
disabled task is just going to be migrated to its current running CPU.
This is because that on load balance if the dst_cpu is not allowed by the
task, we'll re-select a new_dst_cpu as a candidate. If no task can be
balanced to dst_cpu we'll try to balance the task to the new_dst_cpu
instead. In this case when the migration disabled task is not on CPU it
only allows to run on its current CPU, load balance will select its
current CPU as new_dst_cpu and later triggers the warning above.
The new_dst_cpu is chosen from the env->dst_grpmask. Currently it
contains CPUs in sched_group_span() and if we have overlapped groups it's
possible to run into this case. This patch makes env->dst_grpmask of
group_balance_mask() which exclude any CPUs from the busiest group and
solve the issue. For balancing in a domain with no overlapped groups
the behaviour keeps same as before.
Suggested-by: Vincent Guittot <vincent.guittot(a)linaro.org>
Signed-off-by: Yicong Yang <yangyicong(a)hisilicon.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot(a)linaro.org>
Link: https://lore.kernel.org/r/20230530082507.10444-1-yangyicong@huawei.com
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
kernel/sched/fair.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 373ff5f558844..0128dc9344ccf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10744,7 +10744,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
.sd = sd,
.dst_cpu = this_cpu,
.dst_rq = this_rq,
- .dst_grpmask = sched_group_span(sd->groups),
+ .dst_grpmask = group_balance_mask(sd->groups),
.idle = idle,
.loop_break = SCHED_NR_MIGRATE_BREAK,
.cpus = cpus,
--
2.39.2
From: "Paul E. McKenney" <paulmck(a)kernel.org>
[ Upstream commit a24c1aab652ebacf9ea62470a166514174c96fe1 ]
The rcu_data structure's ->rcu_cpu_has_work field can be modified by
any CPU attempting to wake up the rcuc kthread. Therefore, this commit
marks accesses to this field from the rcu_cpu_kthread() function.
This data race was reported by KCSAN. Not appropriate for backporting
due to failure being unlikely.
Signed-off-by: Paul E. McKenney <paulmck(a)kernel.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
kernel/rcu/tree.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 615283404d9dc..98d64f107fbb7 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2457,12 +2457,12 @@ static void rcu_cpu_kthread(unsigned int cpu)
*statusp = RCU_KTHREAD_RUNNING;
local_irq_disable();
work = *workp;
- *workp = 0;
+ WRITE_ONCE(*workp, 0);
local_irq_enable();
if (work)
rcu_core();
local_bh_enable();
- if (*workp == 0) {
+ if (!READ_ONCE(*workp)) {
trace_rcu_utilization(TPS("End CPU kthread@rcu_wait"));
*statusp = RCU_KTHREAD_WAITING;
return;
--
2.39.2
From: "Paul E. McKenney" <paulmck(a)kernel.org>
[ Upstream commit a24c1aab652ebacf9ea62470a166514174c96fe1 ]
The rcu_data structure's ->rcu_cpu_has_work field can be modified by
any CPU attempting to wake up the rcuc kthread. Therefore, this commit
marks accesses to this field from the rcu_cpu_kthread() function.
This data race was reported by KCSAN. Not appropriate for backporting
due to failure being unlikely.
Signed-off-by: Paul E. McKenney <paulmck(a)kernel.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
kernel/rcu/tree.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index eec8e2f7537eb..b2c1ab260ed56 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2810,12 +2810,12 @@ static void rcu_cpu_kthread(unsigned int cpu)
*statusp = RCU_KTHREAD_RUNNING;
local_irq_disable();
work = *workp;
- *workp = 0;
+ WRITE_ONCE(*workp, 0);
local_irq_enable();
if (work)
rcu_core();
local_bh_enable();
- if (*workp == 0) {
+ if (!READ_ONCE(*workp)) {
trace_rcu_utilization(TPS("End CPU kthread@rcu_wait"));
*statusp = RCU_KTHREAD_WAITING;
return;
--
2.39.2
From: "Paul E. McKenney" <paulmck(a)kernel.org>
[ Upstream commit a24c1aab652ebacf9ea62470a166514174c96fe1 ]
The rcu_data structure's ->rcu_cpu_has_work field can be modified by
any CPU attempting to wake up the rcuc kthread. Therefore, this commit
marks accesses to this field from the rcu_cpu_kthread() function.
This data race was reported by KCSAN. Not appropriate for backporting
due to failure being unlikely.
Signed-off-by: Paul E. McKenney <paulmck(a)kernel.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
kernel/rcu/tree.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index df016f6d0662c..48f3e90c5de53 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2826,12 +2826,12 @@ static void rcu_cpu_kthread(unsigned int cpu)
*statusp = RCU_KTHREAD_RUNNING;
local_irq_disable();
work = *workp;
- *workp = 0;
+ WRITE_ONCE(*workp, 0);
local_irq_enable();
if (work)
rcu_core();
local_bh_enable();
- if (*workp == 0) {
+ if (!READ_ONCE(*workp)) {
trace_rcu_utilization(TPS("End CPU kthread@rcu_wait"));
*statusp = RCU_KTHREAD_WAITING;
return;
--
2.39.2
The blamed commit introduces usage of fixed_phy_register() but
not a corresponding dependency on FIXED_PHY.
This can result in a build failure.
s390-linux-ld: drivers/net/ethernet/microchip/lan743x_main.o: in function `lan743x_phy_open':
drivers/net/ethernet/microchip/lan743x_main.c:1514: undefined reference to `fixed_phy_register'
Fixes: 624864fbff92 ("net: lan743x: add fixed phy support for LAN7431 device")
Cc: stable(a)vger.kernel.org
Reported-by: Randy Dunlap <rdunlap(a)infradead.org>
Closes: https://lore.kernel.org/netdev/725bf1c5-b252-7d19-7582-a6809716c7d6@infrade…
Reviewed-by: Randy Dunlap <rdunlap(a)infradead.org>
Tested-by: Randy Dunlap <rdunlap(a)infradead.org> # build-tested
Signed-off-by: Simon Horman <horms(a)kernel.org>
---
drivers/net/ethernet/microchip/Kconfig | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/microchip/Kconfig b/drivers/net/ethernet/microchip/Kconfig
index 24c994baad13..329e374b9539 100644
--- a/drivers/net/ethernet/microchip/Kconfig
+++ b/drivers/net/ethernet/microchip/Kconfig
@@ -46,7 +46,7 @@ config LAN743X
tristate "LAN743x support"
depends on PCI
depends on PTP_1588_CLOCK_OPTIONAL
- select PHYLIB
+ select FIXED_PHY
select CRC16
select CRC32
help
Currently it only indicates a change in window size, I expect the
si_code value is also 0 for this signal. The extension will be for
mouse input and the difference will be indicated by si_code being 1,
to avoid issues with x11 vs wayland vs etc a custom structure should
be pointed to in the si_addr parameter. I think the custom structure
should look something like this:
struct ttymouse
{
uint button_mask;
int x, y, wheel;
};
The patch titled
Subject: prctl: move PR_GET_AUXV out of PR_MCE_KILL
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
prctl-move-pr_get_auxv-out-of-pr_mce_kill.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Miguel Ojeda <ojeda(a)kernel.org>
Subject: prctl: move PR_GET_AUXV out of PR_MCE_KILL
Date: Sun, 9 Jul 2023 01:33:44 +0200
Somehow PR_GET_AUXV got added into PR_MCE_KILL's switch when the patch was
applied [1].
Thus move it out of the switch, to the place the patch added it.
In the recently released v6.4 kernel some user could, in principle, be
already using this feature by mapping the right page and passing the
PR_GET_AUXV constant as a pointer:
prctl(PR_MCE_KILL, PR_GET_AUXV, ...)
So this does change the behavior for users. We could keep the bug since
the other subcases in PR_MCE_KILL (PR_MCE_KILL_CLEAR and PR_MCE_KILL_SET)
do not overlap.
However, v6.4 may be recent enough (2 weeks old) that moving the lines
(rather than just adding a new case) does not break anybody? Moreover,
the documentation in man-pages was just committed today [2].
Link: https://lkml.kernel.org/r/20230708233344.361854-1-ojeda@kernel.org
Fixes: ddc65971bb67 ("prctl: add PR_GET_AUXV to copy auxv to userspace")
Link: https://lore.kernel.org/all/d81864a7f7f43bca6afa2a09fc2e850e4050ab42.168061… [1]
Link: https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/commit/?id=8cf0… [2]
Signed-off-by: Miguel Ojeda <ojeda(a)kernel.org>
Cc: Josh Triplett <josh(a)joshtriplett.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
kernel/sys.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
--- a/kernel/sys.c~prctl-move-pr_get_auxv-out-of-pr_mce_kill
+++ a/kernel/sys.c
@@ -2535,11 +2535,6 @@ SYSCALL_DEFINE5(prctl, int, option, unsi
else
return -EINVAL;
break;
- case PR_GET_AUXV:
- if (arg4 || arg5)
- return -EINVAL;
- error = prctl_get_auxv((void __user *)arg2, arg3);
- break;
default:
return -EINVAL;
}
@@ -2694,6 +2689,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsi
case PR_SET_VMA:
error = prctl_set_vma(arg2, arg3, arg4, arg5);
break;
+ case PR_GET_AUXV:
+ if (arg4 || arg5)
+ return -EINVAL;
+ error = prctl_get_auxv((void __user *)arg2, arg3);
+ break;
#ifdef CONFIG_KSM
case PR_SET_MEMORY_MERGE:
if (arg3 || arg4 || arg5)
_
Patches currently in -mm which might be from ojeda(a)kernel.org are
prctl-move-pr_get_auxv-out-of-pr_mce_kill.patch
The quilt patch titled
Subject: kasan, slub: fix HW_TAGS zeroing with slub_debug
has been removed from the -mm tree. Its filename was
kasan-slub-fix-hw_tags-zeroing-with-slub_debug.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Andrey Konovalov <andreyknvl(a)google.com>
Subject: kasan, slub: fix HW_TAGS zeroing with slub_debug
Date: Wed, 5 Jul 2023 14:44:02 +0200
Commit 946fa0dbf2d8 ("mm/slub: extend redzone check to extra allocated
kmalloc space than requested") added precise kmalloc redzone poisoning to
the slub_debug functionality.
However, this commit didn't account for HW_TAGS KASAN fully initializing
the object via its built-in memory initialization feature. Even though
HW_TAGS KASAN memory initialization contains special memory initialization
handling for when slub_debug is enabled, it does not account for in-object
slub_debug redzones. As a result, HW_TAGS KASAN can overwrite these
redzones and cause false-positive slub_debug reports.
To fix the issue, avoid HW_TAGS KASAN memory initialization when
slub_debug is enabled altogether. Implement this by moving the
__slub_debug_enabled check to slab_post_alloc_hook. Common slab code
seems like a more appropriate place for a slub_debug check anyway.
Link: https://lkml.kernel.org/r/678ac92ab790dba9198f9ca14f405651b97c8502.16885610…
Fixes: 946fa0dbf2d8 ("mm/slub: extend redzone check to extra allocated kmalloc space than requested")
Signed-off-by: Andrey Konovalov <andreyknvl(a)google.com>
Reported-by: Will Deacon <will(a)kernel.org>
Acked-by: Marco Elver <elver(a)google.com>
Cc: Mark Rutland <mark.rutland(a)arm.com>
Cc: Alexander Potapenko <glider(a)google.com>
Cc: Andrey Ryabinin <ryabinin.a.a(a)gmail.com>
Cc: Catalin Marinas <catalin.marinas(a)arm.com>
Cc: Christoph Lameter <cl(a)linux.com>
Cc: David Rientjes <rientjes(a)google.com>
Cc: Dmitry Vyukov <dvyukov(a)google.com>
Cc: Feng Tang <feng.tang(a)intel.com>
Cc: Hyeonggon Yoo <42.hyeyoo(a)gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim(a)lge.com>
Cc: kasan-dev(a)googlegroups.com
Cc: Pekka Enberg <penberg(a)kernel.org>
Cc: Peter Collingbourne <pcc(a)google.com>
Cc: Roman Gushchin <roman.gushchin(a)linux.dev>
Cc: Vincenzo Frascino <vincenzo.frascino(a)arm.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/kasan/kasan.h | 12 ------------
mm/slab.h | 16 ++++++++++++++--
2 files changed, 14 insertions(+), 14 deletions(-)
--- a/mm/kasan/kasan.h~kasan-slub-fix-hw_tags-zeroing-with-slub_debug
+++ a/mm/kasan/kasan.h
@@ -466,18 +466,6 @@ static inline void kasan_unpoison(const
if (WARN_ON((unsigned long)addr & KASAN_GRANULE_MASK))
return;
- /*
- * Explicitly initialize the memory with the precise object size to
- * avoid overwriting the slab redzone. This disables initialization in
- * the arch code and may thus lead to performance penalty. This penalty
- * does not affect production builds, as slab redzones are not enabled
- * there.
- */
- if (__slub_debug_enabled() &&
- init && ((unsigned long)size & KASAN_GRANULE_MASK)) {
- init = false;
- memzero_explicit((void *)addr, size);
- }
size = round_up(size, KASAN_GRANULE_SIZE);
hw_set_mem_tag_range((void *)addr, size, tag, init);
--- a/mm/slab.h~kasan-slub-fix-hw_tags-zeroing-with-slub_debug
+++ a/mm/slab.h
@@ -723,6 +723,7 @@ static inline void slab_post_alloc_hook(
unsigned int orig_size)
{
unsigned int zero_size = s->object_size;
+ bool kasan_init = init;
size_t i;
flags &= gfp_allowed_mask;
@@ -740,6 +741,17 @@ static inline void slab_post_alloc_hook(
zero_size = orig_size;
/*
+ * When slub_debug is enabled, avoid memory initialization integrated
+ * into KASAN and instead zero out the memory via the memset below with
+ * the proper size. Otherwise, KASAN might overwrite SLUB redzones and
+ * cause false-positive reports. This does not lead to a performance
+ * penalty on production builds, as slub_debug is not intended to be
+ * enabled there.
+ */
+ if (__slub_debug_enabled())
+ kasan_init = false;
+
+ /*
* As memory initialization might be integrated into KASAN,
* kasan_slab_alloc and initialization memset must be
* kept together to avoid discrepancies in behavior.
@@ -747,8 +759,8 @@ static inline void slab_post_alloc_hook(
* As p[i] might get tagged, memset and kmemleak hook come after KASAN.
*/
for (i = 0; i < size; i++) {
- p[i] = kasan_slab_alloc(s, p[i], flags, init);
- if (p[i] && init && !kasan_has_integrated_init())
+ p[i] = kasan_slab_alloc(s, p[i], flags, kasan_init);
+ if (p[i] && init && (!kasan_init || !kasan_has_integrated_init()))
memset(p[i], 0, zero_size);
kmemleak_alloc_recursive(p[i], s->object_size, 1,
s->flags, flags);
_
Patches currently in -mm which might be from andreyknvl(a)google.com are
The quilt patch titled
Subject: kasan: fix type cast in memory_is_poisoned_n
has been removed from the -mm tree. Its filename was
kasan-fix-type-cast-in-memory_is_poisoned_n.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Andrey Konovalov <andreyknvl(a)google.com>
Subject: kasan: fix type cast in memory_is_poisoned_n
Date: Tue, 4 Jul 2023 02:52:05 +0200
Commit bb6e04a173f0 ("kasan: use internal prototypes matching gcc-13
builtins") introduced a bug into the memory_is_poisoned_n implementation:
it effectively removed the cast to a signed integer type after applying
KASAN_GRANULE_MASK.
As a result, KASAN started failing to properly check memset, memcpy, and
other similar functions.
Fix the bug by adding the cast back (through an additional signed integer
variable to make the code more readable).
Link: https://lkml.kernel.org/r/8c9e0251c2b8b81016255709d4ec42942dcaf018.16884318…
Fixes: bb6e04a173f0 ("kasan: use internal prototypes matching gcc-13 builtins")
Signed-off-by: Andrey Konovalov <andreyknvl(a)google.com>
Cc: Alexander Potapenko <glider(a)google.com>
Cc: Andrey Ryabinin <ryabinin.a.a(a)gmail.com>
Cc: Arnd Bergmann <arnd(a)arndb.de>
Cc: Dmitry Vyukov <dvyukov(a)google.com>
Cc: Marco Elver <elver(a)google.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/kasan/generic.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
--- a/mm/kasan/generic.c~kasan-fix-type-cast-in-memory_is_poisoned_n
+++ a/mm/kasan/generic.c
@@ -130,9 +130,10 @@ static __always_inline bool memory_is_po
if (unlikely(ret)) {
const void *last_byte = addr + size - 1;
s8 *last_shadow = (s8 *)kasan_mem_to_shadow(last_byte);
+ s8 last_accessible_byte = (unsigned long)last_byte & KASAN_GRANULE_MASK;
if (unlikely(ret != (unsigned long)last_shadow ||
- (((long)last_byte & KASAN_GRANULE_MASK) >= *last_shadow)))
+ last_accessible_byte >= *last_shadow))
return true;
}
return false;
_
Patches currently in -mm which might be from andreyknvl(a)google.com are
The quilt patch titled
Subject: bootmem: remove the vmemmap pages from kmemleak in free_bootmem_page
has been removed from the -mm tree. Its filename was
bootmem-remove-the-vmemmap-pages-from-kmemleak-in-free_bootmem_page.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Liu Shixin <liushixin2(a)huawei.com>
Subject: bootmem: remove the vmemmap pages from kmemleak in free_bootmem_page
Date: Tue, 4 Jul 2023 18:19:42 +0800
commit dd0ff4d12dd2 ("bootmem: remove the vmemmap pages from kmemleak in
put_page_bootmem") fix an overlaps existing problem of kmemleak. But the
problem still existed when HAVE_BOOTMEM_INFO_NODE is disabled, because in
this case, free_bootmem_page() will call free_reserved_page() directly.
Fix the problem by adding kmemleak_free_part() in free_bootmem_page() when
HAVE_BOOTMEM_INFO_NODE is disabled.
Link: https://lkml.kernel.org/r/20230704101942.2819426-1-liushixin2@huawei.com
Fixes: f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with each HugeTLB page")
Signed-off-by: Liu Shixin <liushixin2(a)huawei.com>
Acked-by: Muchun Song <songmuchun(a)bytedance.com>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: Mike Kravetz <mike.kravetz(a)oracle.com>
Cc: Oscar Salvador <osalvador(a)suse.de>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/bootmem_info.h | 2 ++
1 file changed, 2 insertions(+)
--- a/include/linux/bootmem_info.h~bootmem-remove-the-vmemmap-pages-from-kmemleak-in-free_bootmem_page
+++ a/include/linux/bootmem_info.h
@@ -3,6 +3,7 @@
#define __LINUX_BOOTMEM_INFO_H
#include <linux/mm.h>
+#include <linux/kmemleak.h>
/*
* Types for free bootmem stored in page->lru.next. These have to be in
@@ -59,6 +60,7 @@ static inline void get_page_bootmem(unsi
static inline void free_bootmem_page(struct page *page)
{
+ kmemleak_free_part(page_to_virt(page), PAGE_SIZE);
free_reserved_page(page);
}
#endif
_
Patches currently in -mm which might be from liushixin2(a)huawei.com are
The quilt patch titled
Subject: mm: call arch_swap_restore() from do_swap_page()
has been removed from the -mm tree. Its filename was
mm-call-arch_swap_restore-from-do_swap_page.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Peter Collingbourne <pcc(a)google.com>
Subject: mm: call arch_swap_restore() from do_swap_page()
Date: Mon, 22 May 2023 17:43:08 -0700
Commit c145e0b47c77 ("mm: streamline COW logic in do_swap_page()") moved
the call to swap_free() before the call to set_pte_at(), which meant that
the MTE tags could end up being freed before set_pte_at() had a chance to
restore them. Fix it by adding a call to the arch_swap_restore() hook
before the call to swap_free().
Link: https://lkml.kernel.org/r/20230523004312.1807357-2-pcc@google.com
Link: https://linux-review.googlesource.com/id/I6470efa669e8bd2f841049b8c61020c51…
Fixes: c145e0b47c77 ("mm: streamline COW logic in do_swap_page()")
Signed-off-by: Peter Collingbourne <pcc(a)google.com>
Reported-by: Qun-wei Lin <Qun-wei.Lin(a)mediatek.com>
Closes: https://lore.kernel.org/all/5050805753ac469e8d727c797c2218a9d780d434.camel@…
Acked-by: David Hildenbrand <david(a)redhat.com>
Acked-by: "Huang, Ying" <ying.huang(a)intel.com>
Reviewed-by: Steven Price <steven.price(a)arm.com>
Acked-by: Catalin Marinas <catalin.marinas(a)arm.com>
Cc: <stable(a)vger.kernel.org> [6.1+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/memory.c | 7 +++++++
1 file changed, 7 insertions(+)
--- a/mm/memory.c~mm-call-arch_swap_restore-from-do_swap_page
+++ a/mm/memory.c
@@ -3954,6 +3954,13 @@ vm_fault_t do_swap_page(struct vm_fault
}
/*
+ * Some architectures may have to restore extra metadata to the page
+ * when reading from swap. This metadata may be indexed by swap entry
+ * so this must be called before swap_free().
+ */
+ arch_swap_restore(entry, folio);
+
+ /*
* Remove the swap entry and conditionally try to free up the swapcache.
* We're already holding a reference on the page but haven't mapped it
* yet.
_
Patches currently in -mm which might be from pcc(a)google.com are
mm-call-arch_swap_restore-from-unuse_pte.patch
arm64-mte-simplify-swap-tag-restoration-logic.patch
The quilt patch titled
Subject: mm: disable CONFIG_PER_VMA_LOCK until its fixed
has been removed from the -mm tree. Its filename was
mm-disable-config_per_vma_lock-until-its-fixed.patch
This patch was dropped because it is obsolete
------------------------------------------------------
From: Suren Baghdasaryan <surenb(a)google.com>
Subject: mm: disable CONFIG_PER_VMA_LOCK until its fixed
Date: Wed, 5 Jul 2023 18:14:00 -0700
A memory corruption was reported in [1] with bisection pointing to the
patch [2] enabling per-VMA locks for x86. Disable per-VMA locks config to
prevent this issue until the fix is confirmed. This is expected to be a
temporary measure.
[1] https://bugzilla.kernel.org/show_bug.cgi?id=217624
[2] https://lore.kernel.org/all/20230227173632.3292573-30-surenb@google.com
Link: https://lkml.kernel.org/r/20230706011400.2949242-3-surenb@google.com
Reported-by: Jiri Slaby <jirislaby(a)kernel.org>
Closes: https://lore.kernel.org/all/dbdef34c-3a07-5951-e1ae-e9c6e3cdf51b@kernel.org/
Reported-by: Jacob Young <jacobly.alt(a)gmail.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217624
Fixes: 0bff0aaea03e ("x86/mm: try VMA lock-based page fault handling first")
Signed-off-by: Suren Baghdasaryan <surenb(a)google.com>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: Holger Hoffst��tte <holger(a)applied-asynchrony.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/Kconfig | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
--- a/mm/Kconfig~mm-disable-config_per_vma_lock-until-its-fixed
+++ a/mm/Kconfig
@@ -1224,8 +1224,9 @@ config ARCH_SUPPORTS_PER_VMA_LOCK
def_bool n
config PER_VMA_LOCK
- def_bool y
+ bool "Enable per-vma locking during page fault handling."
depends on ARCH_SUPPORTS_PER_VMA_LOCK && MMU && SMP
+ depends on BROKEN
help
Allow per-vma locking during page fault handling.
_
Patches currently in -mm which might be from surenb(a)google.com are
mm-lock-a-vma-before-stack-expansion.patch
mm-lock-newly-mapped-vma-which-can-be-modified-after-it-becomes-visible.patch
swap-remove-remnants-of-polling-from-read_swap_cache_async.patch
mm-add-missing-vm_fault_result_trace-name-for-vm_fault_completed.patch
mm-drop-per-vma-lock-when-returning-vm_fault_retry-or-vm_fault_completed.patch
mm-change-folio_lock_or_retry-to-use-vm_fault-directly.patch
mm-handle-swap-page-faults-under-per-vma-lock.patch
mm-handle-userfaults-under-vma-lock.patch
The quilt patch titled
Subject: fork: lock VMAs of the parent process when forking
has been removed from the -mm tree. Its filename was
fork-lock-vmas-of-the-parent-process-when-forking.patch
This patch was dropped because it is obsolete
------------------------------------------------------
From: Suren Baghdasaryan <surenb(a)google.com>
Subject: fork: lock VMAs of the parent process when forking
Date: Wed, 5 Jul 2023 18:13:59 -0700
Patch series "Avoid memory corruption caused by per-VMA locks", v4.
A memory corruption was reported in [1] with bisection pointing to the
patch [2] enabling per-VMA locks for x86. Based on the reproducer
provided in [1] we suspect this is caused by the lack of VMA locking while
forking a child process.
Patch 1/2 in the series implements proper VMA locking during fork. I
tested the fix locally using the reproducer and was unable to reproduce
the memory corruption problem.
This fix can potentially regress some fork-heavy workloads. Kernel build
time did not show noticeable regression on a 56-core machine while a
stress test mapping 10000 VMAs and forking 5000 times in a tight loop
shows ~7% regression. If such fork time regression is unacceptable,
disabling CONFIG_PER_VMA_LOCK should restore its performance. Further
optimizations are possible if this regression proves to be problematic.
Patch 2/2 disables per-VMA locks until the fix is tested and verified.
This patch (of 2):
When forking a child process, parent write-protects an anonymous page and
COW-shares it with the child being forked using copy_present_pte().
Parent's TLB is flushed right before we drop the parent's mmap_lock in
dup_mmap(). If we get a write-fault before that TLB flush in the parent,
and we end up replacing that anonymous page in the parent process in
do_wp_page() (because, COW-shared with the child), this might lead to some
stale writable TLB entries targeting the wrong (old) page. Similar issue
happened in the past with userfaultfd (see flush_tlb_page() call inside
do_wp_page()).
Lock VMAs of the parent process when forking a child, which prevents
concurrent page faults during fork operation and avoids this issue. This
fix can potentially regress some fork-heavy workloads. Kernel build time
did not show noticeable regression on a 56-core machine while a stress
test mapping 10000 VMAs and forking 5000 times in a tight loop shows ~7%
regression. If such fork time regression is unacceptable, disabling
CONFIG_PER_VMA_LOCK should restore its performance. Further optimizations
are possible if this regression proves to be problematic.
Link: https://lkml.kernel.org/r/20230706011400.2949242-1-surenb@google.com
Link: https://lkml.kernel.org/r/20230706011400.2949242-2-surenb@google.com
Fixes: 0bff0aaea03e ("x86/mm: try VMA lock-based page fault handling first")
Signed-off-by: Suren Baghdasaryan <surenb(a)google.com>
Suggested-by: David Hildenbrand <david(a)redhat.com>
Reported-by: Jiri Slaby <jirislaby(a)kernel.org>
Closes: https://lore.kernel.org/all/dbdef34c-3a07-5951-e1ae-e9c6e3cdf51b@kernel.org/
Reported-by: Holger Hoffst��tte <holger(a)applied-asynchrony.com>
Closes: https://lore.kernel.org/all/b198d649-f4bf-b971-31d0-e8433ec2a34c@applied-as…
Reported-by: Jacob Young <jacobly.alt(a)gmail.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=3D217624
Reviewed-by: Liam R. Howlett <Liam.Howlett(a)oracle.com>
Acked-by: David Hildenbrand <david(a)redhat.com>
Tested-by: Holger Hoffsttte <holger(a)applied-asynchrony.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
kernel/fork.c | 6 ++++++
1 file changed, 6 insertions(+)
--- a/kernel/fork.c~fork-lock-vmas-of-the-parent-process-when-forking
+++ a/kernel/fork.c
@@ -658,6 +658,12 @@ static __latent_entropy int dup_mmap(str
retval = -EINTR;
goto fail_uprobe_end;
}
+#ifdef CONFIG_PER_VMA_LOCK
+ /* Disallow any page faults before calling flush_cache_dup_mm */
+ for_each_vma(old_vmi, mpnt)
+ vma_start_write(mpnt);
+ vma_iter_set(&old_vmi, 0);
+#endif
flush_cache_dup_mm(oldmm);
uprobe_dup_mmap(oldmm, mm);
/*
_
Patches currently in -mm which might be from surenb(a)google.com are
mm-disable-config_per_vma_lock-until-its-fixed.patch
mm-lock-a-vma-before-stack-expansion.patch
mm-lock-newly-mapped-vma-which-can-be-modified-after-it-becomes-visible.patch
swap-remove-remnants-of-polling-from-read_swap_cache_async.patch
mm-add-missing-vm_fault_result_trace-name-for-vm_fault_completed.patch
mm-drop-per-vma-lock-when-returning-vm_fault_retry-or-vm_fault_completed.patch
mm-change-folio_lock_or_retry-to-use-vm_fault-directly.patch
mm-handle-swap-page-faults-under-per-vma-lock.patch
mm-handle-userfaults-under-vma-lock.patch
Somehow PR_GET_AUXV got added into PR_MCE_KILL's switch when
the patch was applied [1].
Thus move it out of the switch, to the place the patch added it.
In the recently released v6.4 kernel some user could, in
principle, be already using this feature by mapping the right
page and passing the PR_GET_AUXV constant as a pointer:
prctl(PR_MCE_KILL, PR_GET_AUXV, ...)
So this does change the behavior for users. We could keep the bug
since the other subcases in PR_MCE_KILL (PR_MCE_KILL_CLEAR and
PR_MCE_KILL_SET) do not overlap.
However, v6.4 may be recent enough (2 weeks old) that moving
the lines (rather than just adding a new case) does not break
anybody? Moreover, the documentation in man-pages was just
committed today [2].
Fixes: ddc65971bb67 ("prctl: add PR_GET_AUXV to copy auxv to userspace")
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/all/d81864a7f7f43bca6afa2a09fc2e850e4050ab42.168061… [1]
Link: https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/commit/?id=8cf0… [2]
Signed-off-by: Miguel Ojeda <ojeda(a)kernel.org>
---
kernel/sys.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/kernel/sys.c b/kernel/sys.c
index 339fee3eff6a..a36a27ebac33 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2529,11 +2529,6 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
else
return -EINVAL;
break;
- case PR_GET_AUXV:
- if (arg4 || arg5)
- return -EINVAL;
- error = prctl_get_auxv((void __user *)arg2, arg3);
- break;
default:
return -EINVAL;
}
@@ -2688,6 +2683,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
case PR_SET_VMA:
error = prctl_set_vma(arg2, arg3, arg4, arg5);
break;
+ case PR_GET_AUXV:
+ if (arg4 || arg5)
+ return -EINVAL;
+ error = prctl_get_auxv((void __user *)arg2, arg3);
+ break;
#ifdef CONFIG_KSM
case PR_SET_MEMORY_MERGE:
if (arg3 || arg4 || arg5)
base-commit: 6995e2de6891c724bfeb2db33d7b87775f913ad1
--
2.41.0
Lockdep is certainly right to complain about
(&vma->vm_lock->lock){++++}-{3:3}, at: vma_start_write+0x2d/0x3f
but task is already holding lock:
(&mapping->i_mmap_rwsem){+.+.}-{3:3}, at: mmap_region+0x4dc/0x6db
Invert those to the usual ordering.
Fixes: 33313a747e81 ("mm: lock newly mapped VMA which can be modified after it becomes visible")
Cc: stable(a)vger.kernel.org
Signed-off-by: Hugh Dickins <hughd(a)google.com>
---
mm/mmap.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/mm/mmap.c b/mm/mmap.c
index 84c71431a527..3eda23c9ebe7 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2809,11 +2809,11 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
if (vma_iter_prealloc(&vmi))
goto close_and_free_vma;
+ /* Lock the VMA since it is modified after insertion into VMA tree */
+ vma_start_write(vma);
if (vma->vm_file)
i_mmap_lock_write(vma->vm_file->f_mapping);
- /* Lock the VMA since it is modified after insertion into VMA tree */
- vma_start_write(vma);
vma_iter_store(&vmi, vma);
mm->map_count++;
if (vma->vm_file) {
--
2.35.3
Dear developer, I am a security researcher at Wuhan University. Recently I discovered a vulnerability in the driver of the USBcore module in the Linux kernel. This vulnerability will lead to an infinite loop in the probe process of the USB device, which will consume a lot of system resources. The vulnerability was found in kernel version 5.6.19 and tested to exist in the new 6.3.7 kernel version as well. I hope that after your review, you will be able to apply for a CVE number to disclose this vulnerability. If you need more detailed vulnerability information, please contact me. Thank you for your help.
With recent changes necessitating mmap_lock to be held for write while
expanding a stack, per-VMA locks should follow the same rules and be
write-locked to prevent page faults into the VMA being expanded. Add
the necessary locking.
Signed-off-by: Suren Baghdasaryan <surenb(a)google.com>
---
mm/mmap.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/mm/mmap.c b/mm/mmap.c
index 204ddcd52625..c66e4622a557 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1977,6 +1977,8 @@ static int expand_upwards(struct vm_area_struct *vma, unsigned long address)
return -ENOMEM;
}
+ /* Lock the VMA before expanding to prevent concurrent page faults */
+ vma_start_write(vma);
/*
* vma->vm_start/vm_end cannot change under us because the caller
* is required to hold the mmap_lock in read mode. We need the
@@ -2064,6 +2066,8 @@ int expand_downwards(struct vm_area_struct *vma, unsigned long address)
return -ENOMEM;
}
+ /* Lock the VMA before expanding to prevent concurrent page faults */
+ vma_start_write(vma);
/*
* vma->vm_start/vm_end cannot change under us because the caller
* is required to hold the mmap_lock in read mode. We need the
--
2.41.0.255.g8b1d071c50-goog
The patch titled
Subject: mm: hugetlb_vmemmap: fix a race between vmemmap pmd split
has been added to the -mm mm-unstable branch. Its filename is
mm-hugetlb_vmemmap-fix-a-race-between-vmemmap-pmd-split.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Muchun Song <songmuchun(a)bytedance.com>
Subject: mm: hugetlb_vmemmap: fix a race between vmemmap pmd split
Date: Fri, 7 Jul 2023 11:38:59 +0800
The local variable @page in __split_vmemmap_huge_pmd() to obtain a pmd
page without holding page_table_lock may possiblely get the page table
page instead of a huge pmd page.
The effect may be in set_pte_at() since we may pass an invalid page
struct, if set_pte_at() wants to access the page struct (e.g.
CONFIG_PAGE_TABLE_CHECK is enabled), it may crash the kernel.
So fix it. And inline __split_vmemmap_huge_pmd() since it only has one
user.
Link: https://lkml.kernel.org/r/20230707033859.16148-1-songmuchun@bytedance.com
Fixes: d8d55f5616cf ("mm: sparsemem: use page table lock to protect kernel pmd operations")
Signed-off-by: Muchun Song <songmuchun(a)bytedance.com>
Cc: Mike Kravetz <mike.kravetz(a)oracle.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/hugetlb_vmemmap.c | 34 ++++++++++++++--------------------
1 file changed, 14 insertions(+), 20 deletions(-)
--- a/mm/hugetlb_vmemmap.c~mm-hugetlb_vmemmap-fix-a-race-between-vmemmap-pmd-split
+++ a/mm/hugetlb_vmemmap.c
@@ -36,14 +36,22 @@ struct vmemmap_remap_walk {
struct list_head *vmemmap_pages;
};
-static int __split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start)
+static int split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start)
{
pmd_t __pmd;
int i;
unsigned long addr = start;
- struct page *page = pmd_page(*pmd);
- pte_t *pgtable = pte_alloc_one_kernel(&init_mm);
+ struct page *head;
+ pte_t *pgtable;
+
+ spin_lock(&init_mm.page_table_lock);
+ head = pmd_leaf(*pmd) ? pmd_page(*pmd) : NULL;
+ spin_unlock(&init_mm.page_table_lock);
+ if (!head)
+ return 0;
+
+ pgtable = pte_alloc_one_kernel(&init_mm);
if (!pgtable)
return -ENOMEM;
@@ -53,7 +61,7 @@ static int __split_vmemmap_huge_pmd(pmd_
pte_t entry, *pte;
pgprot_t pgprot = PAGE_KERNEL;
- entry = mk_pte(page + i, pgprot);
+ entry = mk_pte(head + i, pgprot);
pte = pte_offset_kernel(&__pmd, addr);
set_pte_at(&init_mm, addr, pte, entry);
}
@@ -65,8 +73,8 @@ static int __split_vmemmap_huge_pmd(pmd_
* be treated as indepdenent small pages (as they can be freed
* individually).
*/
- if (!PageReserved(page))
- split_page(page, get_order(PMD_SIZE));
+ if (!PageReserved(head))
+ split_page(head, get_order(PMD_SIZE));
/* Make pte visible before pmd. See comment in pmd_install(). */
smp_wmb();
@@ -80,20 +88,6 @@ static int __split_vmemmap_huge_pmd(pmd_
return 0;
}
-static int split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start)
-{
- int leaf;
-
- spin_lock(&init_mm.page_table_lock);
- leaf = pmd_leaf(*pmd);
- spin_unlock(&init_mm.page_table_lock);
-
- if (!leaf)
- return 0;
-
- return __split_vmemmap_huge_pmd(pmd, start);
-}
-
static void vmemmap_pte_range(pmd_t *pmd, unsigned long addr,
unsigned long end,
struct vmemmap_remap_walk *walk)
_
Patches currently in -mm which might be from songmuchun(a)bytedance.com are
mm-hugetlb_vmemmap-fix-a-race-between-vmemmap-pmd-split.patch
Hi,
Can you please help in back porting the below patch to linux-5.15.y
stable tree:
commit d8c47cc7bf602ef73384a00869a70148146c1191("mm: page_io: fix psi
memory pressure error on cold swapins") .
With the absence of this patch we are seeing some user space tools, like
Android low memory killer based on PSI events, bit aggressive as it
makes the PSI is accounted for even cold swapin on a device where swap
is mounted on a zram with slower backing dev.
Thanks,
Charan
From: Rafał Miłecki <rafal(a)milecki.pl>
commit f99e6d7c4ed3be2531bd576425a5bd07fb133bd7 upstream.
While bringing hardware up we should perform a full reset including the
switch bit (BGMAC_BCMA_IOCTL_SW_RESET aka SICF_SWRST). It's what
specification says and what reference driver does.
This seems to be critical for the BCM5358. Without this hardware doesn't
get initialized properly and doesn't seem to transmit or receive any
packets.
Originally bgmac was calling bgmac_chip_reset() before setting
"has_robosw" property which resulted in expected behaviour. That has
changed as a side effect of adding platform device support which
regressed BCM5358 support.
Fixes: f6a95a24957a ("net: ethernet: bgmac: Add platform device support")
Cc: Jon Mason <jdmason(a)kudzu.us>
Signed-off-by: Rafał Miłecki <rafal(a)milecki.pl>
Reviewed-by: Leon Romanovsky <leonro(a)nvidia.com>
Reviewed-by: Florian Fainelli <f.fainelli(a)gmail.com>
Link: https://lore.kernel.org/r/20230227091156.19509-1-zajec5@gmail.com
Signed-off-by: Paolo Abeni <pabeni(a)redhat.com>
---
Upstream commit wasn't backported to 5.4 (and older) because it couldn't
be cherry-picked cleanly. There was a small fuzz caused by a missing
commit 8c7da63978f1 ("bgmac: configure MTU and add support for frames
beyond 8192 byte size").
I've manually cherry-picked fix for BCM5358 to the linux-5.4.x.
---
drivers/net/ethernet/broadcom/bgmac.c | 8 ++++++--
drivers/net/ethernet/broadcom/bgmac.h | 2 ++
2 files changed, 8 insertions(+), 2 deletions(-)
diff --git a/drivers/net/ethernet/broadcom/bgmac.c b/drivers/net/ethernet/broadcom/bgmac.c
index 193722334d93..89a63fdbe0e3 100644
--- a/drivers/net/ethernet/broadcom/bgmac.c
+++ b/drivers/net/ethernet/broadcom/bgmac.c
@@ -890,13 +890,13 @@ static void bgmac_chip_reset_idm_config(struct bgmac *bgmac)
if (iost & BGMAC_BCMA_IOST_ATTACHED) {
flags = BGMAC_BCMA_IOCTL_SW_CLKEN;
- if (!bgmac->has_robosw)
+ if (bgmac->in_init || !bgmac->has_robosw)
flags |= BGMAC_BCMA_IOCTL_SW_RESET;
}
bgmac_clk_enable(bgmac, flags);
}
- if (iost & BGMAC_BCMA_IOST_ATTACHED && !bgmac->has_robosw)
+ if (iost & BGMAC_BCMA_IOST_ATTACHED && (bgmac->in_init || !bgmac->has_robosw))
bgmac_idm_write(bgmac, BCMA_IOCTL,
bgmac_idm_read(bgmac, BCMA_IOCTL) &
~BGMAC_BCMA_IOCTL_SW_RESET);
@@ -1489,6 +1489,8 @@ int bgmac_enet_probe(struct bgmac *bgmac)
struct net_device *net_dev = bgmac->net_dev;
int err;
+ bgmac->in_init = true;
+
bgmac_chip_intrs_off(bgmac);
net_dev->irq = bgmac->irq;
@@ -1538,6 +1540,8 @@ int bgmac_enet_probe(struct bgmac *bgmac)
net_dev->hw_features = net_dev->features;
net_dev->vlan_features = net_dev->features;
+ bgmac->in_init = false;
+
err = register_netdev(bgmac->net_dev);
if (err) {
dev_err(bgmac->dev, "Cannot register net device\n");
diff --git a/drivers/net/ethernet/broadcom/bgmac.h b/drivers/net/ethernet/broadcom/bgmac.h
index 40d02fec2747..76930b8353d6 100644
--- a/drivers/net/ethernet/broadcom/bgmac.h
+++ b/drivers/net/ethernet/broadcom/bgmac.h
@@ -511,6 +511,8 @@ struct bgmac {
int irq;
u32 int_mask;
+ bool in_init;
+
/* Current MAC state */
int mac_speed;
int mac_duplex;
--
2.35.3
From: Quan Zhou <quan.zhou(a)mediatek.com>
For some cases as below, we may encounter the unpreditable chip stats
in driver probe()
* The system reboot flow do not work properly, such as kernel oops while
rebooting, and then the driver do not go back to default status at
this moment.
* Similar to the flow above. If the device was enabled in BIOS or UEFI,
the system may switch to Linux without driver fully shutdown.
To avoid the problem, force push the device back to default in probe()
* mt7921e_mcu_fw_pmctrl() : return control privilege to chip side.
* mt7921_wfsys_reset() : cleanup chip config before resource init.
Error log
[59007.600714] mt7921e 0000:02:00.0: ASIC revision: 79220010
[59010.889773] mt7921e 0000:02:00.0: Message 00000010 (seq 1) timeout
[59010.889786] mt7921e 0000:02:00.0: Failed to get patch semaphore
[59014.217839] mt7921e 0000:02:00.0: Message 00000010 (seq 2) timeout
[59014.217852] mt7921e 0000:02:00.0: Failed to get patch semaphore
[59017.545880] mt7921e 0000:02:00.0: Message 00000010 (seq 3) timeout
[59017.545893] mt7921e 0000:02:00.0: Failed to get patch semaphore
[59020.874086] mt7921e 0000:02:00.0: Message 00000010 (seq 4) timeout
[59020.874099] mt7921e 0000:02:00.0: Failed to get patch semaphore
[59024.202019] mt7921e 0000:02:00.0: Message 00000010 (seq 5) timeout
[59024.202033] mt7921e 0000:02:00.0: Failed to get patch semaphore
[59027.530082] mt7921e 0000:02:00.0: Message 00000010 (seq 6) timeout
[59027.530096] mt7921e 0000:02:00.0: Failed to get patch semaphore
[59030.857888] mt7921e 0000:02:00.0: Message 00000010 (seq 7) timeout
[59030.857904] mt7921e 0000:02:00.0: Failed to get patch semaphore
[59034.185946] mt7921e 0000:02:00.0: Message 00000010 (seq 8) timeout
[59034.185961] mt7921e 0000:02:00.0: Failed to get patch semaphore
[59037.514249] mt7921e 0000:02:00.0: Message 00000010 (seq 9) timeout
[59037.514262] mt7921e 0000:02:00.0: Failed to get patch semaphore
[59040.842362] mt7921e 0000:02:00.0: Message 00000010 (seq 10) timeout
[59040.842375] mt7921e 0000:02:00.0: Failed to get patch semaphore
[59040.923845] mt7921e 0000:02:00.0: hardware init failed
Cc: stable(a)vger.kernel.org
Fixes: 5c14a5f944b9 ("mt76: mt7921: introduce mt7921e support")
Tested-by: Kai-Heng Feng <kai.heng.feng(a)canonical.com>
Tested-by: Juan Martinez <juan.martinez(a)amd.com>
Co-developed-by: Leon Yen <leon.yen(a)mediatek.com>
Signed-off-by: Leon Yen <leon.yen(a)mediatek.com>
Signed-off-by: Quan Zhou <quan.zhou(a)mediatek.com>
Signed-off-by: Deren Wu <deren.wu(a)mediatek.com>
---
v2: The v1 patch has been accpeted in wireless patchwork. However,
this patch is very important for existing system, we need to add
cc stable tag and hope this patch can be pulled to stable branch earlier.
---
drivers/net/wireless/mediatek/mt76/mt7921/dma.c | 4 ----
drivers/net/wireless/mediatek/mt76/mt7921/mcu.c | 8 --------
drivers/net/wireless/mediatek/mt76/mt7921/pci.c | 8 ++++++++
3 files changed, 8 insertions(+), 12 deletions(-)
diff --git a/drivers/net/wireless/mediatek/mt76/mt7921/dma.c b/drivers/net/wireless/mediatek/mt76/mt7921/dma.c
index f0a80c2b476a..4153cd6c2a01 100644
--- a/drivers/net/wireless/mediatek/mt76/mt7921/dma.c
+++ b/drivers/net/wireless/mediatek/mt76/mt7921/dma.c
@@ -231,10 +231,6 @@ int mt7921_dma_init(struct mt7921_dev *dev)
if (ret)
return ret;
- ret = mt7921_wfsys_reset(dev);
- if (ret)
- return ret;
-
/* init tx queue */
ret = mt76_connac_init_tx_queues(dev->phy.mt76, MT7921_TXQ_BAND0,
MT7921_TX_RING_SIZE,
diff --git a/drivers/net/wireless/mediatek/mt76/mt7921/mcu.c b/drivers/net/wireless/mediatek/mt76/mt7921/mcu.c
index c69ce6df4956..f55caa00ac69 100644
--- a/drivers/net/wireless/mediatek/mt76/mt7921/mcu.c
+++ b/drivers/net/wireless/mediatek/mt76/mt7921/mcu.c
@@ -476,12 +476,6 @@ static int mt7921_load_firmware(struct mt7921_dev *dev)
{
int ret;
- ret = mt76_get_field(dev, MT_CONN_ON_MISC, MT_TOP_MISC2_FW_N9_RDY);
- if (ret && mt76_is_mmio(&dev->mt76)) {
- dev_dbg(dev->mt76.dev, "Firmware is already download\n");
- goto fw_loaded;
- }
-
ret = mt76_connac2_load_patch(&dev->mt76, mt7921_patch_name(dev));
if (ret)
return ret;
@@ -504,8 +498,6 @@ static int mt7921_load_firmware(struct mt7921_dev *dev)
return -EIO;
}
-fw_loaded:
-
#ifdef CONFIG_PM
dev->mt76.hw->wiphy->wowlan = &mt76_connac_wowlan_support;
#endif /* CONFIG_PM */
diff --git a/drivers/net/wireless/mediatek/mt76/mt7921/pci.c b/drivers/net/wireless/mediatek/mt76/mt7921/pci.c
index 1c727870bbdb..6c512bc75685 100644
--- a/drivers/net/wireless/mediatek/mt76/mt7921/pci.c
+++ b/drivers/net/wireless/mediatek/mt76/mt7921/pci.c
@@ -325,6 +325,10 @@ static int mt7921_pci_probe(struct pci_dev *pdev,
bus_ops->rmw = mt7921_rmw;
dev->mt76.bus = bus_ops;
+ ret = mt7921e_mcu_fw_pmctrl(dev);
+ if (ret)
+ goto err_free_dev;
+
ret = __mt7921e_mcu_drv_pmctrl(dev);
if (ret)
goto err_free_dev;
@@ -333,6 +337,10 @@ static int mt7921_pci_probe(struct pci_dev *pdev,
(mt7921_l1_rr(dev, MT_HW_REV) & 0xff);
dev_info(mdev->dev, "ASIC revision: %04x\n", mdev->rev);
+ ret = mt7921_wfsys_reset(dev);
+ if (ret)
+ goto err_free_dev;
+
mt76_wr(dev, MT_WFDMA0_HOST_INT_ENA, 0);
mt76_wr(dev, MT_PCIE_MAC_INT_ENABLE, 0xff);
--
2.18.0
From: Long Li <longli(a)microsoft.com>
It's inefficient to ring the doorbell page every time a WQE is posted to
the received queue. Excessive MMIO writes result in CPU spending more
time waiting on LOCK instructions (atomic operations), resulting in
poor scaling performance.
Move the code for ringing doorbell page to where after we have posted all
WQEs to the receive queue during a callback from napi_poll().
With this change, tests showed an improvement from 120G/s to 160G/s on a
200G physical link, with 16 or 32 hardware queues.
Tests showed no regression in network latency benchmarks on single
connection.
While we are making changes in this code path, change the code for
ringing doorbell to set the WQE_COUNT to 0 for Receive Queue. The
hardware specification specifies that it should set to 0. Although
currently the hardware doesn't enforce the check, in the future releases
it may do.
Cc: stable(a)vger.kernel.org
Fixes: ca9c54d2d6a5 ("net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)")
Reviewed-by: Haiyang Zhang <haiyangz(a)microsoft.com>
Reviewed-by: Dexuan Cui <decui(a)microsoft.com>
Signed-off-by: Long Li <longli(a)microsoft.com>
---
Change log:
v2:
Check for comp_read > 0 as it might be negative on completion error.
Set rq.wqe_cnt to 0 according to BNIC spec.
v3:
Add details in the commit on the reason of performance increase and test numbers.
Add details in the commit on why rq.wqe_cnt should be set to 0 according to hardware spec.
Add "Reviewed-by" from Haiyang and Dexuan.
drivers/net/ethernet/microsoft/mana/gdma_main.c | 5 ++++-
drivers/net/ethernet/microsoft/mana/mana_en.c | 10 ++++++++--
2 files changed, 12 insertions(+), 3 deletions(-)
diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index 8f3f78b68592..3765d3389a9a 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -300,8 +300,11 @@ static void mana_gd_ring_doorbell(struct gdma_context *gc, u32 db_index,
void mana_gd_wq_ring_doorbell(struct gdma_context *gc, struct gdma_queue *queue)
{
+ /* Hardware Spec specifies that software client should set 0 for
+ * wqe_cnt for Receive Queues. This value is not used in Send Queues.
+ */
mana_gd_ring_doorbell(gc, queue->gdma_dev->doorbell, queue->type,
- queue->id, queue->head * GDMA_WQE_BU_SIZE, 1);
+ queue->id, queue->head * GDMA_WQE_BU_SIZE, 0);
}
void mana_gd_ring_cq(struct gdma_queue *cq, u8 arm_bit)
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index cd4d5ceb9f2d..1d8abe63fcb8 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -1383,8 +1383,8 @@ static void mana_post_pkt_rxq(struct mana_rxq *rxq)
recv_buf_oob = &rxq->rx_oobs[curr_index];
- err = mana_gd_post_and_ring(rxq->gdma_rq, &recv_buf_oob->wqe_req,
- &recv_buf_oob->wqe_inf);
+ err = mana_gd_post_work_request(rxq->gdma_rq, &recv_buf_oob->wqe_req,
+ &recv_buf_oob->wqe_inf);
if (WARN_ON_ONCE(err))
return;
@@ -1654,6 +1654,12 @@ static void mana_poll_rx_cq(struct mana_cq *cq)
mana_process_rx_cqe(rxq, cq, &comp[i]);
}
+ if (comp_read > 0) {
+ struct gdma_context *gc = rxq->gdma_rq->gdma_dev->gdma_context;
+
+ mana_gd_wq_ring_doorbell(gc, rxq->gdma_rq);
+ }
+
if (rxq->xdp_flush)
xdp_do_flush();
}
--
2.34.1
A crash was reported in amd-sfh related to hid core initialization
before SFH initialization has run.
```
amdtp_hid_request+0x36/0x50 [amd_sfh
2e3095779aada9fdb1764f08ca578ccb14e41fe4]
sensor_hub_get_feature+0xad/0x170 [hid_sensor_hub
d6157999c9d260a1bfa6f27d4a0dc2c3e2c5654e]
hid_sensor_parse_common_attributes+0x217/0x310 [hid_sensor_iio_common
07a7935272aa9c7a28193b574580b3e953a64ec4]
hid_gyro_3d_probe+0x7f/0x2e0 [hid_sensor_gyro_3d
9f2eb51294a1f0c0315b365f335617cbaef01eab]
platform_probe+0x44/0xa0
really_probe+0x19e/0x3e0
```
Ensure that sensors have been set up before calling into
amd_sfh_get_report() or amd_sfh_set_report().
Cc: stable(a)vger.kernel.org
Cc: Linux regression tracking (Thorsten Leemhuis) <regressions(a)leemhuis.info>
Fixes: 7bcfdab3f0c6 ("HID: amd_sfh: if no sensors are enabled, clean up")
Reported-by: Haochen Tong <linux(a)hexchain.org>
Link: https://lore.kernel.org/all/3250319.ancTxkQ2z5@zen/T/
Signed-off-by: Mario Limonciello <mario.limonciello(a)amd.com>
---
drivers/hid/amd-sfh-hid/amd_sfh_client.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/drivers/hid/amd-sfh-hid/amd_sfh_client.c b/drivers/hid/amd-sfh-hid/amd_sfh_client.c
index d9b7b01900b5..88f3d913eaa1 100644
--- a/drivers/hid/amd-sfh-hid/amd_sfh_client.c
+++ b/drivers/hid/amd-sfh-hid/amd_sfh_client.c
@@ -25,6 +25,9 @@ void amd_sfh_set_report(struct hid_device *hid, int report_id,
struct amdtp_cl_data *cli_data = hid_data->cli_data;
int i;
+ if (!cli_data->is_any_sensor_enabled)
+ return;
+
for (i = 0; i < cli_data->num_hid_devices; i++) {
if (cli_data->hid_sensor_hubs[i] == hid) {
cli_data->cur_hid_dev = i;
@@ -41,6 +44,9 @@ int amd_sfh_get_report(struct hid_device *hid, int report_id, int report_type)
struct request_list *req_list = &cli_data->req_list;
int i;
+ if (!cli_data->is_any_sensor_enabled)
+ return -ENODEV;
+
for (i = 0; i < cli_data->num_hid_devices; i++) {
if (cli_data->hid_sensor_hubs[i] == hid) {
struct request_list *new = kzalloc(sizeof(*new), GFP_KERNEL);
--
2.34.1
A memory corruption was reported in [1] with bisection pointing to the
patch [2] enabling per-VMA locks for x86. Based on the reproducer
provided in [1] we suspect this is caused by the lack of VMA locking
while forking a child process.
Patch 1/2 in the series implements proper VMA locking during fork.
I tested the fix locally using the reproducer and was unable to reproduce
the memory corruption problem.
This fix can potentially regress some fork-heavy workloads. Kernel build
time did not show noticeable regression on a 56-core machine while a
stress test mapping 10000 VMAs and forking 5000 times in a tight loop
shows ~7% regression. If such fork time regression is unacceptable,
disabling CONFIG_PER_VMA_LOCK should restore its performance. Further
optimizations are possible if this regression proves to be problematic.
Patch 2/2 disables per-VMA locks until the fix is tested and verified.
Both patches apply cleanly over Linus' ToT and stable 6.4.y branch.
Changes from v3 posted at [3]:
- Replace vma_iter_init with vma_iter_set, per Liam R. Howlett
- Update the regression number caused by additional VMA tree walk
[1] https://bugzilla.kernel.org/show_bug.cgi?id=217624
[2] https://lore.kernel.org/all/20230227173632.3292573-30-surenb@google.com
[3] https://lore.kernel.org/all/20230705171213.2843068-1-surenb@google.com
Suren Baghdasaryan (2):
fork: lock VMAs of the parent process when forking
mm: disable CONFIG_PER_VMA_LOCK until its fixed
kernel/fork.c | 6 ++++++
mm/Kconfig | 3 ++-
2 files changed, 8 insertions(+), 1 deletion(-)
--
2.41.0.255.g8b1d071c50-goog
This is the start of the stable review cycle for the 6.4.2 release.
There are 15 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Thu, 06 Jul 2023 08:46:01 +0000.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
https://www.kernel.org/pub/linux/kernel/v6.x/stable-review/patch-6.4.2-rc2.…
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-6.4.y
and the diffstat can be found below.
thanks,
greg k-h
-------------
Pseudo-Shortlog of commits:
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Linux 6.4.2-rc2
Linus Torvalds <torvalds(a)linux-foundation.org>
gup: avoid stack expansion warning for known-good case
SeongJae Park <sj(a)kernel.org>
arch/arm64/mm/fault: Fix undeclared variable error in do_page_fault()
Bas Nieuwenhuizen <bas(a)basnieuwenhuizen.nl>
drm/amdgpu: Validate VM ioctl flags.
Demi Marie Obenour <demi(a)invisiblethingslab.com>
dm ioctl: Avoid double-fetch of version
Ahmed S. Darwish <darwi(a)linutronix.de>
docs: Set minimal gtags / GNU GLOBAL version to 6.6.5
Ahmed S. Darwish <darwi(a)linutronix.de>
scripts/tags.sh: Resolve gtags empty index generation
Mike Kravetz <mike.kravetz(a)oracle.com>
hugetlb: revert use of page_cache_next_miss()
Finn Thain <fthain(a)linux-m68k.org>
nubus: Partially revert proc_create_single_data() conversion
Dan Williams <dan.j.williams(a)intel.com>
Revert "cxl/port: Enable the HDM decoder capability for switch ports"
Jeff Layton <jlayton(a)kernel.org>
nfs: don't report STATX_BTIME in ->getattr
Linus Torvalds <torvalds(a)linux-foundation.org>
execve: always mark stack as growing down during early stack setup
Mario Limonciello <mario.limonciello(a)amd.com>
PCI/ACPI: Call _REG when transitioning D-states
Bjorn Helgaas <bhelgaas(a)google.com>
PCI/ACPI: Validate acpi_pci_set_power_state() parameter
Thomas Weißschuh <linux(a)weissschuh.net>
tools/nolibc: x86_64: disable stack protector for _start
Max Filippov <jcmvbkbc(a)gmail.com>
xtensa: fix lock_mm_and_find_vma in case VMA not found
-------------
Diffstat:
Documentation/process/changes.rst | 7 +++++
Makefile | 4 +--
arch/arm64/mm/fault.c | 2 --
drivers/cxl/core/pci.c | 27 +++--------------
drivers/cxl/cxl.h | 1 -
drivers/cxl/port.c | 14 ++++-----
drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 4 +++
drivers/md/dm-ioctl.c | 33 +++++++++++++--------
drivers/nubus/proc.c | 22 ++++++++++----
drivers/pci/pci-acpi.c | 53 +++++++++++++++++++++++++---------
fs/hugetlbfs/inode.c | 8 ++---
fs/nfs/inode.c | 2 +-
include/linux/mm.h | 4 ++-
mm/hugetlb.c | 12 ++++----
mm/memory.c | 4 +++
mm/nommu.c | 7 ++++-
scripts/tags.sh | 9 +++++-
tools/include/nolibc/arch-x86_64.h | 2 +-
tools/testing/cxl/Kbuild | 1 -
tools/testing/cxl/test/mock.c | 15 ----------
20 files changed, 132 insertions(+), 99 deletions(-)
This is the start of the stable review cycle for the 6.3.12 release.
There are 14 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Thu, 06 Jul 2023 08:46:01 +0000.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
https://www.kernel.org/pub/linux/kernel/v6.x/stable-review/patch-6.3.12-rc2…
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-6.3.y
and the diffstat can be found below.
thanks,
greg k-h
-------------
Pseudo-Shortlog of commits:
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Linux 6.3.12-rc2
Linus Torvalds <torvalds(a)linux-foundation.org>
gup: avoid stack expansion warning for known-good case
Rodrigo Siqueira <Rodrigo.Siqueira(a)amd.com>
drm/amd/display: Ensure vmin and vmax adjust for DCE
Bas Nieuwenhuizen <bas(a)basnieuwenhuizen.nl>
drm/amdgpu: Validate VM ioctl flags.
Demi Marie Obenour <demi(a)invisiblethingslab.com>
dm ioctl: Avoid double-fetch of version
Ahmed S. Darwish <darwi(a)linutronix.de>
docs: Set minimal gtags / GNU GLOBAL version to 6.6.5
Ahmed S. Darwish <darwi(a)linutronix.de>
scripts/tags.sh: Resolve gtags empty index generation
Finn Thain <fthain(a)linux-m68k.org>
nubus: Partially revert proc_create_single_data() conversion
Dan Williams <dan.j.williams(a)intel.com>
Revert "cxl/port: Enable the HDM decoder capability for switch ports"
Jeff Layton <jlayton(a)kernel.org>
nfs: don't report STATX_BTIME in ->getattr
Linus Torvalds <torvalds(a)linux-foundation.org>
execve: always mark stack as growing down during early stack setup
Mario Limonciello <mario.limonciello(a)amd.com>
PCI/ACPI: Call _REG when transitioning D-states
Bjorn Helgaas <bhelgaas(a)google.com>
PCI/ACPI: Validate acpi_pci_set_power_state() parameter
Aric Cyr <aric.cyr(a)amd.com>
drm/amd/display: Do not update DRR while BW optimizations pending
Max Filippov <jcmvbkbc(a)gmail.com>
xtensa: fix lock_mm_and_find_vma in case VMA not found
-------------
Diffstat:
Documentation/process/changes.rst | 7 +++++
Makefile | 4 +--
drivers/cxl/core/pci.c | 27 +++-------------
drivers/cxl/cxl.h | 1 -
drivers/cxl/port.c | 14 +++------
drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 4 +++
drivers/gpu/drm/amd/display/dc/core/dc.c | 49 +++++++++++++++++------------
drivers/md/dm-ioctl.c | 33 ++++++++++++--------
drivers/nubus/proc.c | 22 ++++++++++---
drivers/pci/pci-acpi.c | 53 ++++++++++++++++++++++++--------
fs/nfs/inode.c | 2 +-
include/linux/mm.h | 4 ++-
mm/memory.c | 4 +++
mm/nommu.c | 7 ++++-
scripts/tags.sh | 9 +++++-
tools/testing/cxl/Kbuild | 1 -
tools/testing/cxl/test/mock.c | 15 ---------
17 files changed, 152 insertions(+), 104 deletions(-)
When unloading the MANA driver, mana_dealloc_queues() waits for the MANA
hardware to complete any inflight packets and set the pending send count
to zero. But if the hardware has failed, mana_dealloc_queues()
could wait forever.
Fix this by adding a timeout to the wait. Set the timeout to 120 seconds,
which is a somewhat arbitrary value that is more than long enough for
functional hardware to complete any sends.
Fixes: ca9c54d2d6a5 ("net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)")
---
V4 -> V5:
* Added fixes tag
* Changed the usleep_range from static to incremental value.
* Initialized timeout in the begining.
---
Signed-off-by: Souradeep Chakrabarti <schakrabarti(a)linux.microsoft.com>
---
drivers/net/ethernet/microsoft/mana/mana_en.c | 30 ++++++++++++++++---
1 file changed, 26 insertions(+), 4 deletions(-)
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index a499e460594b..56b7074db1a2 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -2345,9 +2345,13 @@ int mana_attach(struct net_device *ndev)
static int mana_dealloc_queues(struct net_device *ndev)
{
struct mana_port_context *apc = netdev_priv(ndev);
+ unsigned long timeout = jiffies + 120 * HZ;
struct gdma_dev *gd = apc->ac->gdma_dev;
struct mana_txq *txq;
+ struct sk_buff *skb;
+ struct mana_cq *cq;
int i, err;
+ u32 tsleep;
if (apc->port_is_up)
return -EINVAL;
@@ -2363,15 +2367,33 @@ static int mana_dealloc_queues(struct net_device *ndev)
* to false, but it doesn't matter since mana_start_xmit() drops any
* new packets due to apc->port_is_up being false.
*
- * Drain all the in-flight TX packets
+ * Drain all the in-flight TX packets.
+ * A timeout of 120 seconds for all the queues is used.
+ * This will break the while loop when h/w is not responding.
+ * This value of 120 has been decided here considering max
+ * number of queues.
*/
+
for (i = 0; i < apc->num_queues; i++) {
txq = &apc->tx_qp[i].txq;
-
- while (atomic_read(&txq->pending_sends) > 0)
- usleep_range(1000, 2000);
+ tsleep = 1000;
+ while (atomic_read(&txq->pending_sends) > 0 &&
+ time_before(jiffies, timeout)) {
+ usleep_range(tsleep, tsleep << 1);
+ tsleep <<= 1;
+ }
}
+ for (i = 0; i < apc->num_queues; i++) {
+ txq = &apc->tx_qp[i].txq;
+ cq = &apc->tx_qp[i].tx_cq;
+ while (atomic_read(&txq->pending_sends)) {
+ skb = skb_dequeue(&txq->pending_skbs);
+ mana_unmap_skb(skb, apc);
+ dev_consume_skb_any(skb);
+ atomic_sub(1, &txq->pending_sends);
+ }
+ }
/* We're 100% sure the queues can no longer be woken up, because
* we're sure now mana_poll_tx_cq() can't be running.
*/
--
2.34.1
Making 'blk' sector_t (i.e. 64 bit if LBD support is active)
fails the 'blk>0' test in the partition block loop if a
value of (signed int) -1 is used to mark the end of the
partition block list.
This bug was introduced in patch 3 of my prior Amiga partition
support fixes series, and spotted by Christian Zigotzky when
testing the latest block updates.
Explicitly cast 'blk' to signed int to allow use of -1 to
terminate the partition block linked list.
Reported-by: Christian Zigotzky <chzigotzky(a)xenosoft.de>
Fixes: b6f3f28f60 ("block: add overflow checks for Amiga partition support")
Message-ID: 024ce4fa-cc6d-50a2-9aae-3701d0ebf668(a)xenosoft.de
Cc: <stable(a)vger.kernel.org> # 5.2
Link: https://lore.kernel.org/r/024ce4fa-cc6d-50a2-9aae-3701d0ebf668@xenosoft.de
Signed-off-by: Michael Schmitz <schmitzmic(a)gmail.com>
Reviewed-by: Martin Steigerwald <martin(a)lichtvoll.de>
Tested-by: Christian Zigotzky <chzigotzky(a)xenosoft.de>
--
Changes since v1:
- corrected Fixes: tag
- added Tested-by:
- reworded commit message to describe filesystem partition
size mismatch problem
Changes since v2:
Adrian Glaubitz:
- fix typo in commit message
Changes since v3:
Greg KH:
- fix stable tag
Geert Uytterhoeven:
- revert changes to commit message since v1
---
block/partitions/amiga.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/block/partitions/amiga.c b/block/partitions/amiga.c
index ed222b9c901b..506921095412 100644
--- a/block/partitions/amiga.c
+++ b/block/partitions/amiga.c
@@ -90,7 +90,7 @@ int amiga_partition(struct parsed_partitions *state)
}
blk = be32_to_cpu(rdb->rdb_PartitionList);
put_dev_sector(sect);
- for (part = 1; blk>0 && part<=16; part++, put_dev_sector(sect)) {
+ for (part = 1; (s32) blk>0 && part<=16; part++, put_dev_sector(sect)) {
/* Read in terms partition table understands */
if (check_mul_overflow(blk, (sector_t) blksize, &blk)) {
pr_err("Dev %s: overflow calculating partition block %llu! Skipping partitions %u and beyond\n",
--
2.17.1
PARF_SLV_ADDR_SPACE_SIZE_2_3_3 macro is used for IPQ8074
2_3_3 post_init ops. pcie slave addr size was initially set
to 0x358, but was wrongly changed to 0x168 as a part of
"PCI: qcom: Remove PCIE20_ prefix from register definitions"
Fixing it, by using the right macro PARF_SLV_ADDR_SPACE_SIZE
and removing the unused PARF_SLV_ADDR_SPACE_SIZE_2_3_3.
Without this pcie bring up on IPQ8074 is broken now.
Fixes: 39171b33f652 ("PCI: qcom: Remove PCIE20_ prefix from register definitions")
Signed-off-by: Sricharan Ramabadhran <quic_srichara(a)quicinc.com>
---
[V2] Fixed the 'fixes tag' correctly, subject, right macro usage
drivers/pci/controller/dwc/pcie-qcom.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/drivers/pci/controller/dwc/pcie-qcom.c b/drivers/pci/controller/dwc/pcie-qcom.c
index 4ab30892f6ef..1689d072fe86 100644
--- a/drivers/pci/controller/dwc/pcie-qcom.c
+++ b/drivers/pci/controller/dwc/pcie-qcom.c
@@ -43,7 +43,6 @@
#define PARF_PHY_REFCLK 0x4c
#define PARF_CONFIG_BITS 0x50
#define PARF_DBI_BASE_ADDR 0x168
-#define PARF_SLV_ADDR_SPACE_SIZE_2_3_3 0x16c /* Register offset specific to IP ver 2.3.3 */
#define PARF_MHI_CLOCK_RESET_CTRL 0x174
#define PARF_AXI_MSTR_WR_ADDR_HALT 0x178
#define PARF_AXI_MSTR_WR_ADDR_HALT_V2 0x1a8
@@ -811,7 +810,7 @@ static int qcom_pcie_post_init_2_3_3(struct qcom_pcie *pcie)
u32 val;
writel(SLV_ADDR_SPACE_SZ,
- pcie->parf + PARF_SLV_ADDR_SPACE_SIZE_2_3_3);
+ pcie->parf + PARF_SLV_ADDR_SPACE_SIZE);
val = readl(pcie->parf + PARF_PHY_CTRL);
val &= ~PHY_TEST_PWR_DOWN;
--
2.34.1
The patch titled
Subject: mm: disable CONFIG_PER_VMA_LOCK until its fixed
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-disable-config_per_vma_lock-until-its-fixed.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Suren Baghdasaryan <surenb(a)google.com>
Subject: mm: disable CONFIG_PER_VMA_LOCK until its fixed
Date: Wed, 5 Jul 2023 18:14:00 -0700
A memory corruption was reported in [1] with bisection pointing to the
patch [2] enabling per-VMA locks for x86. Disable per-VMA locks config to
prevent this issue until the fix is confirmed. This is expected to be a
temporary measure.
[1] https://bugzilla.kernel.org/show_bug.cgi?id=217624
[2] https://lore.kernel.org/all/20230227173632.3292573-30-surenb@google.com
Link: https://lkml.kernel.org/r/20230706011400.2949242-3-surenb@google.com
Reported-by: Jiri Slaby <jirislaby(a)kernel.org>
Closes: https://lore.kernel.org/all/dbdef34c-3a07-5951-e1ae-e9c6e3cdf51b@kernel.org/
Reported-by: Jacob Young <jacobly.alt(a)gmail.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217624
Fixes: 0bff0aaea03e ("x86/mm: try VMA lock-based page fault handling first")
Signed-off-by: Suren Baghdasaryan <surenb(a)google.com>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: Holger Hoffst��tte <holger(a)applied-asynchrony.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/Kconfig | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
--- a/mm/Kconfig~mm-disable-config_per_vma_lock-until-its-fixed
+++ a/mm/Kconfig
@@ -1224,8 +1224,9 @@ config ARCH_SUPPORTS_PER_VMA_LOCK
def_bool n
config PER_VMA_LOCK
- def_bool y
+ bool "Enable per-vma locking during page fault handling."
depends on ARCH_SUPPORTS_PER_VMA_LOCK && MMU && SMP
+ depends on BROKEN
help
Allow per-vma locking during page fault handling.
_
Patches currently in -mm which might be from surenb(a)google.com are
fork-lock-vmas-of-the-parent-process-when-forking.patch
mm-disable-config_per_vma_lock-until-its-fixed.patch
swap-remove-remnants-of-polling-from-read_swap_cache_async.patch
mm-add-missing-vm_fault_result_trace-name-for-vm_fault_completed.patch
mm-drop-per-vma-lock-when-returning-vm_fault_retry-or-vm_fault_completed.patch
mm-change-folio_lock_or_retry-to-use-vm_fault-directly.patch
mm-handle-swap-page-faults-under-per-vma-lock.patch
mm-handle-userfaults-under-vma-lock.patch
The patch titled
Subject: fork: lock VMAs of the parent process when forking
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
fork-lock-vmas-of-the-parent-process-when-forking.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Suren Baghdasaryan <surenb(a)google.com>
Subject: fork: lock VMAs of the parent process when forking
Date: Wed, 5 Jul 2023 18:13:59 -0700
Patch series "Avoid memory corruption caused by per-VMA locks", v4.
A memory corruption was reported in [1] with bisection pointing to the
patch [2] enabling per-VMA locks for x86. Based on the reproducer
provided in [1] we suspect this is caused by the lack of VMA locking while
forking a child process.
Patch 1/2 in the series implements proper VMA locking during fork. I
tested the fix locally using the reproducer and was unable to reproduce
the memory corruption problem.
This fix can potentially regress some fork-heavy workloads. Kernel build
time did not show noticeable regression on a 56-core machine while a
stress test mapping 10000 VMAs and forking 5000 times in a tight loop
shows ~7% regression. If such fork time regression is unacceptable,
disabling CONFIG_PER_VMA_LOCK should restore its performance. Further
optimizations are possible if this regression proves to be problematic.
Patch 2/2 disables per-VMA locks until the fix is tested and verified.
This patch (of 2):
When forking a child process, parent write-protects an anonymous page and
COW-shares it with the child being forked using copy_present_pte().
Parent's TLB is flushed right before we drop the parent's mmap_lock in
dup_mmap(). If we get a write-fault before that TLB flush in the parent,
and we end up replacing that anonymous page in the parent process in
do_wp_page() (because, COW-shared with the child), this might lead to some
stale writable TLB entries targeting the wrong (old) page. Similar issue
happened in the past with userfaultfd (see flush_tlb_page() call inside
do_wp_page()).
Lock VMAs of the parent process when forking a child, which prevents
concurrent page faults during fork operation and avoids this issue. This
fix can potentially regress some fork-heavy workloads. Kernel build time
did not show noticeable regression on a 56-core machine while a stress
test mapping 10000 VMAs and forking 5000 times in a tight loop shows ~7%
regression. If such fork time regression is unacceptable, disabling
CONFIG_PER_VMA_LOCK should restore its performance. Further optimizations
are possible if this regression proves to be problematic.
Link: https://lkml.kernel.org/r/20230706011400.2949242-1-surenb@google.com
Link: https://lkml.kernel.org/r/20230706011400.2949242-2-surenb@google.com
Fixes: 0bff0aaea03e ("x86/mm: try VMA lock-based page fault handling first")
Signed-off-by: Suren Baghdasaryan <surenb(a)google.com>
Suggested-by: David Hildenbrand <david(a)redhat.com>
Reported-by: Jiri Slaby <jirislaby(a)kernel.org>
Closes: https://lore.kernel.org/all/dbdef34c-3a07-5951-e1ae-e9c6e3cdf51b@kernel.org/
Reported-by: Holger Hoffst��tte <holger(a)applied-asynchrony.com>
Closes: https://lore.kernel.org/all/b198d649-f4bf-b971-31d0-e8433ec2a34c@applied-as…
Reported-by: Jacob Young <jacobly.alt(a)gmail.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=3D217624
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
kernel/fork.c | 6 ++++++
1 file changed, 6 insertions(+)
--- a/kernel/fork.c~fork-lock-vmas-of-the-parent-process-when-forking
+++ a/kernel/fork.c
@@ -658,6 +658,12 @@ static __latent_entropy int dup_mmap(str
retval = -EINTR;
goto fail_uprobe_end;
}
+#ifdef CONFIG_PER_VMA_LOCK
+ /* Disallow any page faults before calling flush_cache_dup_mm */
+ for_each_vma(old_vmi, mpnt)
+ vma_start_write(mpnt);
+ vma_iter_set(&old_vmi, 0);
+#endif
flush_cache_dup_mm(oldmm);
uprobe_dup_mmap(oldmm, mm);
/*
_
Patches currently in -mm which might be from surenb(a)google.com are
fork-lock-vmas-of-the-parent-process-when-forking.patch
mm-disable-config_per_vma_lock-until-its-fixed.patch
swap-remove-remnants-of-polling-from-read_swap_cache_async.patch
mm-add-missing-vm_fault_result_trace-name-for-vm_fault_completed.patch
mm-drop-per-vma-lock-when-returning-vm_fault_retry-or-vm_fault_completed.patch
mm-change-folio_lock_or_retry-to-use-vm_fault-directly.patch
mm-handle-swap-page-faults-under-per-vma-lock.patch
mm-handle-userfaults-under-vma-lock.patch
A memory corruption was reported in [1] with bisection pointing to the
patch [2] enabling per-VMA locks for x86. Based on the reproducer
provided in [1] we suspect this is caused by the lack of VMA locking
while forking a child process.
Patch 1/2 in the series implements proper VMA locking during fork.
I tested the fix locally using the reproducer and was unable to reproduce
the memory corruption problem.
This fix can potentially regress some fork-heavy workloads. Kernel build
time did not show noticeable regression on a 56-core machine while a
stress test mapping 10000 VMAs and forking 5000 times in a tight loop
shows ~5% regression. If such fork time regression is unacceptable,
disabling CONFIG_PER_VMA_LOCK should restore its performance. Further
optimizations are possible if this regression proves to be problematic.
Patch 2/2 disabled per-VMA locks until the fix is tested and verified.
Both patches apply cleanly over Linus' ToT and stable 6.4.y branch.
Changes from v2 posted at [3]:
- Move VMA locking before flush_cache_dup_mm, per David Hildenbrand
[1] https://bugzilla.kernel.org/show_bug.cgi?id=217624
[2] https://lore.kernel.org/all/20230227173632.3292573-30-surenb@google.com
[3] https://lore.kernel.org/all/20230705063711.2670599-1-surenb@google.com/
Suren Baghdasaryan (2):
fork: lock VMAs of the parent process when forking
mm: disable CONFIG_PER_VMA_LOCK until its fixed
kernel/fork.c | 6 ++++++
mm/Kconfig | 3 ++-
2 files changed, 8 insertions(+), 1 deletion(-)
--
2.41.0.255.g8b1d071c50-goog
On 7/5/23 21:37, Ranjan kumar wrote:
> Hi Damien,
> Regarding delay:
> As Sathya already mentioned as this is our hardware specific behavior and
> we are confident that the increased retry count
> is sufficient from our hardware perspective for any new systems too. So, we
> want to go with this change .
Fine. Adding a comment above the macro definitions to explain something like
that would be nice.
> Apart from that, I will change the name as suggested .
>
> Thanks & Regards,
> Ranjan
Please avoid top posting.
> On Thu, 29 Jun 2023 at 05:24, Damien Le Moal <dlemoal(a)kernel.org> wrote:
>
>> On 6/28/23 16:05, Ranjan Kumar wrote:
>>> Doorbell and Host diagnostic registers could return 0 even
>>> after 3 retries and that leads to occasional resets of the
>>> controllers, hence increased the retry count to thirty.
>>
>> The magic value "3" for retry count was already that, magic. Why would
>> things
>> work better with 30 ? What is the reasoning ? Isn't a udelay needed to
>> avoid
>> that many retries ?
>>
>>>
>>> Fixes: b899202901a8 ("mpt3sas: Add separate function for aero doorbell
>> reads ")
>>> Cc: stable(a)vger.kernel.org
>>> Signed-off-by: Ranjan Kumar <ranjan.kumar(a)broadcom.com>
>>
>> [..]
>>
>>> diff --git a/drivers/scsi/mpt3sas/mpt3sas_base.h
>> b/drivers/scsi/mpt3sas/mpt3sas_base.h
>>> index 05364aa15ecd..3b8ec4fd2d21 100644
>>> --- a/drivers/scsi/mpt3sas/mpt3sas_base.h
>>> +++ b/drivers/scsi/mpt3sas/mpt3sas_base.h
>>> @@ -160,6 +160,8 @@
>>>
>>> #define IOC_OPERATIONAL_WAIT_COUNT 10
>>>
>>> +#define READL_RETRY_COUNT_OF_THIRTY 30
>>> +#define READL_RETRY_COUNT_OF_THREE 3
>>
>> Less than ideal naming I think. If the values need to be changed again, a
>> lot of
>> code will need to change. What about soemthing like:
>>
>> #define READL_RETRY_COUNT 30
>> #define READL_RETRY_SHORT_COUNT 3
>>
>>> /*
>>> * NVMe defines
>>> */
>>> @@ -994,7 +996,7 @@ typedef void (*NVME_BUILD_PRP)(struct
>> MPT3SAS_ADAPTER *ioc, u16 smid,
>>> typedef void (*PUT_SMID_IO_FP_HIP) (struct MPT3SAS_ADAPTER *ioc, u16
>> smid,
>>> u16 funcdep);
>>> typedef void (*PUT_SMID_DEFAULT) (struct MPT3SAS_ADAPTER *ioc, u16
>> smid);
>>> -typedef u32 (*BASE_READ_REG) (const volatile void __iomem *addr);
>>> +typedef u32 (*BASE_READ_REG) (const volatile void __iomem *addr, u8
>> retry_count);
>>> /*
>>> * To get high iops reply queue's msix index when high iops mode is
>> enabled
>>> * else get the msix index of general reply queues.
>>
>> --
>> Damien Le Moal
>> Western Digital Research
>>
>>
>
--
Damien Le Moal
Western Digital Research
Hi,
The following commit landed in 6.4 that helps some occasional NULL
pointer dereferences when setting up MST devices, particularly on
suspend/resume cycles.
54d217406afe ("drm: use mgr->dev in drm_dbg_kms in
drm_dp_add_payload_part2")
Can you please take this to 6.1.y and 6.3.y as well? The NULL pointer
de-reference has been reported on both.
Thanks,
From: Yinjun Zhang <yinjun.zhang(a)corigine.com>
When moving devices from one namespace to another, mc addresses are
cleaned in software while not removed from application firmware. Thus
the mc addresses are remained and will cause resource leak.
Now use `__dev_mc_unsync` to clean mc addresses when closing port.
Fixes: e20aa071cd95 ("nfp: fix schedule in atomic context when sync mc address")
Cc: stable(a)vger.kernel.org
Signed-off-by: Yinjun Zhang <yinjun.zhang(a)corigine.com>
Acked-by: Simon Horman <simon.horman(a)corigine.com>
Signed-off-by: Louis Peens <louis.peens(a)corigine.com>
---
Changes since v2:
* Use function prototype to avoid moving code chunk.
Changes since v1:
* Use __dev_mc_unsyc to clean mc addresses instead of tracking mc addresses by
driver itself.
* Clean mc addresses when closing port instead of driver exits,
so that the issue of moving devices between namespaces can be fixed.
* Modify commit message accordingly.
drivers/net/ethernet/netronome/nfp/nfp_net_common.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
index 49f2f081ebb5..6b1fb5708434 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
@@ -53,6 +53,8 @@
#include "crypto/crypto.h"
#include "crypto/fw.h"
+static int nfp_net_mc_unsync(struct net_device *netdev, const unsigned char *addr);
+
/**
* nfp_net_get_fw_version() - Read and parse the FW version
* @fw_ver: Output fw_version structure to read to
@@ -1084,6 +1086,9 @@ static int nfp_net_netdev_close(struct net_device *netdev)
/* Step 2: Tell NFP
*/
+ if (nn->cap_w1 & NFP_NET_CFG_CTRL_MCAST_FILTER)
+ __dev_mc_unsync(netdev, nfp_net_mc_unsync);
+
nfp_net_clear_config_and_disable(nn);
nfp_port_configure(netdev, false);
--
2.34.1
I'm announcing the release of the 6.4.2 kernel.
All users of the 6.4 kernel series must upgrade.
The updated 6.4.y git tree can be found at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git linux-6.4.y
and can be browsed at the normal kernel.org git web browser:
https://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=summary
thanks,
greg k-h
------------
Documentation/process/changes.rst | 7 ++++
Makefile | 2 -
arch/arm64/mm/fault.c | 2 -
drivers/cxl/core/pci.c | 27 ++--------------
drivers/cxl/cxl.h | 1
drivers/cxl/port.c | 14 +++-----
drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 4 ++
drivers/md/dm-ioctl.c | 33 +++++++++++++-------
drivers/nubus/proc.c | 22 ++++++++++---
drivers/pci/pci-acpi.c | 53 ++++++++++++++++++++++++---------
fs/hugetlbfs/inode.c | 8 +---
fs/nfs/inode.c | 2 -
include/linux/mm.h | 4 +-
mm/hugetlb.c | 12 +++----
mm/nommu.c | 7 +++-
scripts/tags.sh | 9 ++++-
tools/include/nolibc/arch-x86_64.h | 2 -
tools/testing/cxl/Kbuild | 1
tools/testing/cxl/test/mock.c | 15 ---------
19 files changed, 127 insertions(+), 98 deletions(-)
Ahmed S. Darwish (2):
scripts/tags.sh: Resolve gtags empty index generation
docs: Set minimal gtags / GNU GLOBAL version to 6.6.5
Bas Nieuwenhuizen (1):
drm/amdgpu: Validate VM ioctl flags.
Bjorn Helgaas (1):
PCI/ACPI: Validate acpi_pci_set_power_state() parameter
Dan Williams (1):
Revert "cxl/port: Enable the HDM decoder capability for switch ports"
Demi Marie Obenour (1):
dm ioctl: Avoid double-fetch of version
Finn Thain (1):
nubus: Partially revert proc_create_single_data() conversion
Greg Kroah-Hartman (1):
Linux 6.4.2
Jeff Layton (1):
nfs: don't report STATX_BTIME in ->getattr
Linus Torvalds (1):
execve: always mark stack as growing down during early stack setup
Mario Limonciello (1):
PCI/ACPI: Call _REG when transitioning D-states
Max Filippov (1):
xtensa: fix lock_mm_and_find_vma in case VMA not found
Mike Kravetz (1):
hugetlb: revert use of page_cache_next_miss()
SeongJae Park (1):
arch/arm64/mm/fault: Fix undeclared variable error in do_page_fault()
Thomas Weißschuh (1):
tools/nolibc: x86_64: disable stack protector for _start
I'm announcing the release of the 6.3.12 kernel.
All users of the 6.3 kernel series must upgrade.
The updated 6.3.y git tree can be found at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git linux-6.3.y
and can be browsed at the normal kernel.org git web browser:
https://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=summary
thanks,
greg k-h
------------
Documentation/process/changes.rst | 7 ++++
Makefile | 2 -
drivers/cxl/core/pci.c | 27 ++-------------
drivers/cxl/cxl.h | 1
drivers/cxl/port.c | 14 ++------
drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 4 ++
drivers/gpu/drm/amd/display/dc/core/dc.c | 49 +++++++++++++++++-----------
drivers/md/dm-ioctl.c | 33 ++++++++++++-------
drivers/nubus/proc.c | 22 +++++++++---
drivers/pci/pci-acpi.c | 53 +++++++++++++++++++++++--------
fs/nfs/inode.c | 2 -
include/linux/mm.h | 4 +-
mm/nommu.c | 7 +++-
scripts/tags.sh | 9 ++++-
tools/testing/cxl/Kbuild | 1
tools/testing/cxl/test/mock.c | 15 --------
16 files changed, 147 insertions(+), 103 deletions(-)
Ahmed S. Darwish (2):
scripts/tags.sh: Resolve gtags empty index generation
docs: Set minimal gtags / GNU GLOBAL version to 6.6.5
Aric Cyr (1):
drm/amd/display: Do not update DRR while BW optimizations pending
Bas Nieuwenhuizen (1):
drm/amdgpu: Validate VM ioctl flags.
Bjorn Helgaas (1):
PCI/ACPI: Validate acpi_pci_set_power_state() parameter
Dan Williams (1):
Revert "cxl/port: Enable the HDM decoder capability for switch ports"
Demi Marie Obenour (1):
dm ioctl: Avoid double-fetch of version
Finn Thain (1):
nubus: Partially revert proc_create_single_data() conversion
Greg Kroah-Hartman (1):
Linux 6.3.12
Jeff Layton (1):
nfs: don't report STATX_BTIME in ->getattr
Linus Torvalds (1):
execve: always mark stack as growing down during early stack setup
Mario Limonciello (1):
PCI/ACPI: Call _REG when transitioning D-states
Max Filippov (1):
xtensa: fix lock_mm_and_find_vma in case VMA not found
Rodrigo Siqueira (1):
drm/amd/display: Ensure vmin and vmax adjust for DCE
I'm announcing the release of the 6.1.38 kernel.
All users of the 6.1 kernel series must upgrade.
The updated 6.1.y git tree can be found at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git linux-6.1.y
and can be browsed at the normal kernel.org git web browser:
https://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=summary
thanks,
greg k-h
------------
Documentation/process/changes.rst | 7 ++++
Makefile | 2 -
drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 4 ++
drivers/gpu/drm/amd/display/dc/core/dc.c | 50 ++++++++++++++++-------------
drivers/nubus/proc.c | 22 +++++++++---
drivers/pci/pci-acpi.c | 53 +++++++++++++++++++++++--------
include/linux/mm.h | 4 +-
mm/nommu.c | 7 +++-
scripts/tags.sh | 9 ++++-
tools/perf/util/symbol.c | 17 ++++++++-
10 files changed, 130 insertions(+), 45 deletions(-)
Ahmed S. Darwish (2):
scripts/tags.sh: Resolve gtags empty index generation
docs: Set minimal gtags / GNU GLOBAL version to 6.6.5
Alvin Lee (1):
drm/amd/display: Remove optimization for VRR updates
Aric Cyr (1):
drm/amd/display: Do not update DRR while BW optimizations pending
Bas Nieuwenhuizen (1):
drm/amdgpu: Validate VM ioctl flags.
Bjorn Helgaas (1):
PCI/ACPI: Validate acpi_pci_set_power_state() parameter
Finn Thain (1):
nubus: Partially revert proc_create_single_data() conversion
Greg Kroah-Hartman (1):
Linux 6.1.38
Krister Johansen (1):
perf symbols: Symbol lookup with kcore can fail if multiple segments match stext
Linus Torvalds (1):
execve: always mark stack as growing down during early stack setup
Mario Limonciello (1):
PCI/ACPI: Call _REG when transitioning D-states
Max Filippov (1):
xtensa: fix lock_mm_and_find_vma in case VMA not found
Rodrigo Siqueira (1):
drm/amd/display: Ensure vmin and vmax adjust for DCE
I'm announcing the release of the 5.15.120 kernel.
All users of the 5.15 kernel series must upgrade.
The updated 5.15.y git tree can be found at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git linux-5.15.y
and can be browsed at the normal kernel.org git web browser:
https://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=summary
thanks,
greg k-h
------------
Makefile | 2 -
arch/parisc/include/asm/assembly.h | 4 --
arch/x86/kernel/cpu/microcode/amd.c | 2 -
arch/x86/kernel/smpboot.c | 24 ++++++++-------
drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 1
drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 4 ++
drivers/hid/hid-logitech-hidpp.c | 2 -
drivers/hid/wacom_wac.c | 6 +--
drivers/hid/wacom_wac.h | 2 -
drivers/nubus/proc.c | 22 ++++++++++---
drivers/thermal/mtk_thermal.c | 14 +-------
include/linux/highmem.h | 24 +++++++++++++++
include/linux/mm.h | 5 ++-
kernel/bpf/verifier.c | 7 +++-
mm/memory.c | 33 ++++++++++++++------
net/can/isotp.c | 5 +--
net/mptcp/protocol.c | 46 +++++++++++++----------------
net/mptcp/subflow.c | 17 ++++++----
scripts/tags.sh | 9 +++++
tools/perf/util/symbol.c | 17 +++++++++-
20 files changed, 158 insertions(+), 88 deletions(-)
Ahmed S. Darwish (1):
scripts/tags.sh: Resolve gtags empty index generation
Bas Nieuwenhuizen (1):
drm/amdgpu: Validate VM ioctl flags.
Ben Hutchings (1):
parisc: Delete redundant register definitions in <asm/assembly.h>
Borislav Petkov (AMD) (1):
x86/microcode/AMD: Load late on both threads too
Finn Thain (1):
nubus: Partially revert proc_create_single_data() conversion
Greg Kroah-Hartman (1):
Linux 5.15.120
Jane Chu (1):
mm, hwpoison: when copy-on-write hits poison, take page offline
Jason Gerecke (1):
HID: wacom: Use ktime_t rather than int when dealing with timestamps
Krister Johansen (2):
bpf: ensure main program has an extable
perf symbols: Symbol lookup with kcore can fail if multiple segments match stext
Mike Hommey (1):
HID: logitech-hidpp: add HIDPP_QUIRK_DELAYED_INIT for the T651.
Oliver Hartkopp (1):
can: isotp: isotp_sendmsg(): fix return error fix on TX path
Paolo Abeni (2):
mptcp: fix possible divide by zero in recvmsg()
mptcp: consolidate fallback and non fallback state machine
Philip Yang (1):
drm/amdgpu: Set vmbo destroy after pt bo is created
Ricardo Cañuelo (1):
Revert "thermal/drivers/mediatek: Use devm_of_iomap to avoid resource leak in mtk_thermal_probe"
Thomas Gleixner (1):
x86/smp: Use dedicated cache-line for mwait_play_dead()
Tony Luck (1):
mm, hwpoison: try to recover from copy-on write faults
This is the start of the stable review cycle for the 6.3.12 release.
There are 13 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Wed, 05 Jul 2023 18:45:08 +0000.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
https://www.kernel.org/pub/linux/kernel/v6.x/stable-review/patch-6.3.12-rc1…
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-6.3.y
and the diffstat can be found below.
thanks,
greg k-h
-------------
Pseudo-Shortlog of commits:
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Linux 6.3.12-rc1
Bas Nieuwenhuizen <bas(a)basnieuwenhuizen.nl>
drm/amdgpu: Validate VM ioctl flags.
Demi Marie Obenour <demi(a)invisiblethingslab.com>
dm ioctl: Avoid double-fetch of version
Ahmed S. Darwish <darwi(a)linutronix.de>
docs: Set minimal gtags / GNU GLOBAL version to 6.6.5
Ahmed S. Darwish <darwi(a)linutronix.de>
scripts/tags.sh: Resolve gtags empty index generation
Mike Kravetz <mike.kravetz(a)oracle.com>
hugetlb: revert use of page_cache_next_miss()
Finn Thain <fthain(a)linux-m68k.org>
nubus: Partially revert proc_create_single_data() conversion
Dan Williams <dan.j.williams(a)intel.com>
Revert "cxl/port: Enable the HDM decoder capability for switch ports"
Jeff Layton <jlayton(a)kernel.org>
nfs: don't report STATX_BTIME in ->getattr
Linus Torvalds <torvalds(a)linux-foundation.org>
execve: always mark stack as growing down during early stack setup
Mario Limonciello <mario.limonciello(a)amd.com>
PCI/ACPI: Call _REG when transitioning D-states
Bjorn Helgaas <bhelgaas(a)google.com>
PCI/ACPI: Validate acpi_pci_set_power_state() parameter
Aric Cyr <aric.cyr(a)amd.com>
drm/amd/display: Do not update DRR while BW optimizations pending
Max Filippov <jcmvbkbc(a)gmail.com>
xtensa: fix lock_mm_and_find_vma in case VMA not found
-------------
Diffstat:
Documentation/process/changes.rst | 7 +++++
Makefile | 4 +--
drivers/cxl/core/pci.c | 27 +++-------------
drivers/cxl/cxl.h | 1 -
drivers/cxl/port.c | 14 +++------
drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 4 +++
drivers/gpu/drm/amd/display/dc/core/dc.c | 48 +++++++++++++++++------------
drivers/md/dm-ioctl.c | 33 ++++++++++++--------
drivers/nubus/proc.c | 22 ++++++++++---
drivers/pci/pci-acpi.c | 53 ++++++++++++++++++++++++--------
fs/hugetlbfs/inode.c | 8 ++---
fs/nfs/inode.c | 2 +-
include/linux/mm.h | 4 ++-
mm/hugetlb.c | 12 ++++----
mm/nommu.c | 7 ++++-
scripts/tags.sh | 9 +++++-
tools/testing/cxl/Kbuild | 1 -
tools/testing/cxl/test/mock.c | 15 ---------
18 files changed, 156 insertions(+), 115 deletions(-)
The patch titled
Subject: fork: lock VMAs of the parent process when forking
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
fork-lock-vmas-of-the-parent-process-when-forking-v3.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Suren Baghdasaryan <surenb(a)google.com>
Subject: fork: lock VMAs of the parent process when forking
Date: Wed, 5 Jul 2023 10:12:11 -0700
Patch series "Avoid memory corruption caused by per-VMA locks", v3.
A memory corruption was reported in [1] with bisection pointing to the
patch [2] enabling per-VMA locks for x86. Based on the reproducer
provided in [1] we suspect this is caused by the lack of VMA locking while
forking a child process.
Patch 1/2 in the series implements proper VMA locking during fork. I
tested the fix locally using the reproducer and was unable to reproduce
the memory corruption problem. This fix can potentially regress some
fork-heavy workloads. Kernel build time did not show noticeable
regression on a 56-core machine while a stress test mapping 10000 VMAs and
forking 5000 times in a tight loop shows ~5% regression. If such fork
time regression is unacceptable, disabling CONFIG_PER_VMA_LOCK should
restore its performance. Further optimizations are possible if this
regression proves to be problematic.
This patch (of 2):
When forking a child process, parent write-protects an anonymous page and
COW-shares it with the child being forked using copy_present_pte().
Parent's TLB is flushed right before we drop the parent's mmap_lock in
dup_mmap(). If we get a write-fault before that TLB flush in the parent,
and we end up replacing that anonymous page in the parent process in
do_wp_page() (because, COW-shared with the child), this might lead to some
stale writable TLB entries targeting the wrong (old) page. Similar issue
happened in the past with userfaultfd (see flush_tlb_page() call inside
do_wp_page()).
Lock VMAs of the parent process when forking a child, which prevents
concurrent page faults during fork operation and avoids this issue. This
fix can potentially regress some fork-heavy workloads. Kernel build time
did not show noticeable regression on a 56-core machine while a stress
test mapping 10000 VMAs and forking 5000 times in a tight loop shows ~5%
regression. If such fork time regression is unacceptable, disabling
CONFIG_PER_VMA_LOCK should restore its performance. Further optimizations
are possible if this regression proves to be problematic.
Link: https://lkml.kernel.org/r/20230705171213.2843068-2-surenb@google.com
Fixes: 0bff0aaea03e ("x86/mm: try VMA lock-based page fault handling first"=
Signed-off-by: Suren Baghdasaryan <surenb(a)google.com>
Suggested-by: David Hildenbrand <david(a)redhat.com>
Reported-by: Jiri Slaby <jirislaby(a)kernel.org>
Closes: https://lore.kernel.org/all/dbdef34c-3a07-5951-e1ae-e9c6e3cdf51b@ke=rnel.org/
Reported-by: Holger Hoffst��tte <holger(a)applied-asynchrony.com>
Closes: https://lore.kernel.org/all/b198d649-f4bf-b971-31d0-e8433ec2a34c@ap=plied-asynchrony.com/
Reported-by: Jacob Young <jacobly.alt(a)gmail.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=3D217624
)
Cc: Andy Lutomirski <luto(a)kernel.org>
Cc: Axel Rasmussen <axelrasmussen(a)google.com>
Cc: Chris Li <chriscli(a)google.com>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: David Howells <dhowells(a)redhat.com>
Cc: Davidlohr Bueso <dave(a)stgolabs.net>
Cc: David Rientjes <rientjes(a)google.com>
Cc: Eric Dumazet <edumazet(a)google.com>
Cc: Greg Thelen <gthelen(a)google.com>
Cc: Hans de Goede <hdegoede(a)redhat.com>
Cc: Hugh Dickins <hughd(a)google.com>
Cc: Ingo Molnar <mingo(a)redhat.com>
Cc: Jann Horn <jannh(a)google.com>
Cc: Jiri Slaby <jirislaby(a)kernel.org>
Cc: Joel Fernandes <joelaf(a)google.com>
Cc: Johannes Weiner <hannes(a)cmpxchg.org>
Cc: Kent Overstreet <kent.overstreet(a)linux.dev>
Cc: Laurent Dufour <ldufour(a)linux.ibm.com>
Cc: Liam R. Howlett <Liam.Howlett(a)oracle.com>
Cc: Lorenzo Stoakes <lstoakes(a)gmail.com>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: Mel Gorman <mgorman(a)techsingularity.net>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Michel Lespinasse <michel(a)lespinasse.org>
Cc: Mike Rapoport (IBM) <rppt(a)kernel.org>
Cc: Minchan Kim <minchan(a)google.com>
Cc: "Paul E. McKenney" <paulmck(a)kernel.org>
Cc: Peter Xu <peterx(a)redhat.com>
Cc: <peterz(a)infradead.org>
Cc: Punit Agrawal <punit.agrawal(a)bytedance.com>
Cc: Sebastian Andrzej Siewior <bigeasy(a)linutronix.de>
Cc: Shakeel Butt <shakeelb(a)google.com>
Cc: Song Liu <songliubraving(a)fb.com>
Cc: Suren Baghdasaryan <surenb(a)google.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: Will Deacon <will(a)kernel.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
kernel/fork.c | 6 ++++++
1 file changed, 6 insertions(+)
--- a/kernel/fork.c~fork-lock-vmas-of-the-parent-process-when-forking-v3
+++ a/kernel/fork.c
@@ -658,6 +658,12 @@ static __latent_entropy int dup_mmap(str
retval = -EINTR;
goto fail_uprobe_end;
}
+#ifdef CONFIG_PER_VMA_LOCK
+ /* Disallow any page faults before calling flush_cache_dup_mm */
+ for_each_vma(old_vmi, mpnt)
+ vma_start_write(mpnt);
+ vma_iter_init(&old_vmi, oldmm, 0);
+#endif
flush_cache_dup_mm(oldmm);
uprobe_dup_mmap(oldmm, mm);
/*
_
Patches currently in -mm which might be from surenb(a)google.com are
fork-lock-vmas-of-the-parent-process-when-forking-v3.patch
mm-disable-config_per_vma_lock-until-its-fixed.patch
swap-remove-remnants-of-polling-from-read_swap_cache_async.patch
mm-add-missing-vm_fault_result_trace-name-for-vm_fault_completed.patch
mm-drop-per-vma-lock-when-returning-vm_fault_retry-or-vm_fault_completed.patch
mm-change-folio_lock_or_retry-to-use-vm_fault-directly.patch
mm-handle-swap-page-faults-under-per-vma-lock.patch
mm-handle-userfaults-under-vma-lock.patch
The quilt patch titled
Subject: fork: lock VMAs of the parent process when forking
has been removed from the -mm tree. Its filename was
fork-lock-vmas-of-the-parent-process-when-forking.patch
This patch was dropped because an updated version will be merged
------------------------------------------------------
From: Suren Baghdasaryan <surenb(a)google.com>
Subject: fork: lock VMAs of the parent process when forking
Date: Tue, 4 Jul 2023 23:37:10 -0700
Patch series "Avoid memory corruption caused by per-VMA locks", v2.
A memory corruption was reported in [1] with bisection pointing to the
patch [2] enabling per-VMA locks for x86. Based on the reproducer
provided in [1] we suspect this is caused by the lack of VMA locking while
forking a child process.
Patch 1/2 in the series implements proper VMA locking during fork. I
tested the fix locally using the reproducer and was unable to reproduce
the memory corruption problem.
This fix can potentially regress some fork-heavy workloads. Kernel build
time did not show noticeable regression on a 56-core machine while a
stress test mapping 10000 VMAs and forking 5000 times in a tight loop
shows ~5% regression. If such fork time regression is unacceptable,
disabling CONFIG_PER_VMA_LOCK should restore its performance. Further
optimizations are possible if this regression proves to be problematic.
Patch 2/2 disabled per-VMA locks until the fix is tested and verified.
This patch (of 2):
When forking a child process, parent write-protects an anonymous page and
COW-shares it with the child being forked using copy_present_pte().
Parent's TLB is flushed right before we drop the parent's mmap_lock in
dup_mmap(). If we get a write-fault before that TLB flush in the parent,
and we end up replacing that anonymous page in the parent process in
do_wp_page() (because, COW-shared with the child), this might lead to some
stale writable TLB entries targeting the wrong (old) page. Similar issue
happened in the past with userfaultfd (see flush_tlb_page() call inside
do_wp_page()).
Lock VMAs of the parent process when forking a child, which prevents
concurrent page faults during fork operation and avoids this issue. This
fix can potentially regress some fork-heavy workloads. Kernel build time
did not show noticeable regression on a 56-core machine while a stress
test mapping 10000 VMAs and forking 5000 times in a tight loop shows ~5%
regression. If such fork time regression is unacceptable, disabling
CONFIG_PER_VMA_LOCK should restore its performance. Further optimizations
are possible if this regression proves to be problematic.
Link: https://lkml.kernel.org/r/20230705063711.2670599-1-surenb@google.com
Link: https://lkml.kernel.org/r/20230705063711.2670599-2-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb(a)google.com>
Suggested-by: David Hildenbrand <david(a)redhat.com>
Reported-by: Jiri Slaby <jirislaby(a)kernel.org>
Closes: https://lore.kernel.org/all/dbdef34c-3a07-5951-e1ae-e9c6e3cdf51b@kernel.org/
Reported-by: Holger Hoffst��tte <holger(a)applied-asynchrony.com>
Closes: https://lore.kernel.org/all/b198d649-f4bf-b971-31d0-e8433ec2a34c@applied-as…
Reported-by: Jacob Young <jacobly.alt(a)gmail.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=3D217624
Fixes: 0bff0aaea03e ("x86/mm: try VMA lock-based page fault handling first")
Acked-by: David Hildenbrand <david(a)redhat.com>
Cc: Bagas Sanjaya <bagasdotme(a)gmail.com>
Cc: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Cc: Laurent Dufour <ldufour(a)linux.ibm.com>
Cc: <regressions(a)lists.linux.dev>
Cc: Andy Lutomirski <luto(a)kernel.org>
Cc: Axel Rasmussen <axelrasmussen(a)google.com>
Cc: Chris Li <chriscli(a)google.com>
Cc: David Howells <dhowells(a)redhat.com>
Cc: Davidlohr Bueso <dave(a)stgolabs.net>
Cc: David Rientjes <rientjes(a)google.com>
Cc: Eric Dumazet <edumazet(a)google.com>
Cc: Greg Thelen <gthelen(a)google.com>
Cc: Hans de Goede <hdegoede(a)redhat.com>
Cc: Hugh Dickins <hughd(a)google.com>
Cc: Ingo Molnar <mingo(a)redhat.com>
Cc: Jann Horn <jannh(a)google.com>
Cc: Joel Fernandes <joelaf(a)google.com>
Cc: Johannes Weiner <hannes(a)cmpxchg.org>
Cc: Kent Overstreet <kent.overstreet(a)linux.dev>
Cc: Liam R. Howlett <Liam.Howlett(a)oracle.com>
Cc: Lorenzo Stoakes <lstoakes(a)gmail.com>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: Mel Gorman <mgorman(a)techsingularity.net>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Michel Lespinasse <michel(a)lespinasse.org>
Cc: Mike Rapoport (IBM) <rppt(a)kernel.org>
Cc: Minchan Kim <minchan(a)google.com>
Cc: "Paul E. McKenney" <paulmck(a)kernel.org>
Cc: Peter Xu <peterx(a)redhat.com>
Cc: <peterz(a)infradead.org>
Cc: Punit Agrawal <punit.agrawal(a)bytedance.com>
Cc: Sebastian Andrzej Siewior <bigeasy(a)linutronix.de>
Cc: Shakeel Butt <shakeelb(a)google.com>
Cc: Song Liu <songliubraving(a)fb.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: Will Deacon <will(a)kernel.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
kernel/fork.c | 1 +
1 file changed, 1 insertion(+)
--- a/kernel/fork.c~fork-lock-vmas-of-the-parent-process-when-forking
+++ a/kernel/fork.c
@@ -686,6 +686,7 @@ static __latent_entropy int dup_mmap(str
for_each_vma(old_vmi, mpnt) {
struct file *file;
+ vma_start_write(mpnt);
if (mpnt->vm_flags & VM_DONTCOPY) {
vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt));
continue;
_
Patches currently in -mm which might be from surenb(a)google.com are
fork-lock-vmas-of-the-parent-process-when-forking-v3.patch
mm-disable-config_per_vma_lock-until-its-fixed.patch
swap-remove-remnants-of-polling-from-read_swap_cache_async.patch
mm-add-missing-vm_fault_result_trace-name-for-vm_fault_completed.patch
mm-drop-per-vma-lock-when-returning-vm_fault_retry-or-vm_fault_completed.patch
mm-change-folio_lock_or_retry-to-use-vm_fault-directly.patch
mm-handle-swap-page-faults-under-per-vma-lock.patch
mm-handle-userfaults-under-vma-lock.patch
A memory corruption was reported in [1] with bisection pointing to the
patch [2] enabling per-VMA locks for x86. Based on the reproducer
provided in [1] we suspect this is caused by the lack of VMA locking
while forking a child process.
Patch 1/2 in the series implements proper VMA locking during fork.
I tested the fix locally using the reproducer and was unable to reproduce
the memory corruption problem.
This fix can potentially regress some fork-heavy workloads. Kernel build
time did not show noticeable regression on a 56-core machine while a
stress test mapping 10000 VMAs and forking 5000 times in a tight loop
shows ~5% regression. If such fork time regression is unacceptable,
disabling CONFIG_PER_VMA_LOCK should restore its performance. Further
optimizations are possible if this regression proves to be problematic.
Patch 2/2 disabled per-VMA locks until the fix is tested and verified.
Both patches apply cleanly over Linus' ToT and stable 6.4.y branch.
[1] https://bugzilla.kernel.org/show_bug.cgi?id=217624
[2] https://lore.kernel.org/all/20230227173632.3292573-30-surenb@google.com
Suren Baghdasaryan (2):
fork: lock VMAs of the parent process when forking
mm: disable CONFIG_PER_VMA_LOCK until its fixed
kernel/fork.c | 1 +
mm/Kconfig | 3 ++-
2 files changed, 3 insertions(+), 1 deletion(-)
--
2.41.0.255.g8b1d071c50-goog
The patch titled
Subject: kasan, slub: fix HW_TAGS zeroing with slub_debug
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
kasan-slub-fix-hw_tags-zeroing-with-slub_debug.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Andrey Konovalov <andreyknvl(a)google.com>
Subject: kasan, slub: fix HW_TAGS zeroing with slub_debug
Date: Wed, 5 Jul 2023 14:44:02 +0200
Commit 946fa0dbf2d8 ("mm/slub: extend redzone check to extra allocated
kmalloc space than requested") added precise kmalloc redzone poisoning to
the slub_debug functionality.
However, this commit didn't account for HW_TAGS KASAN fully initializing
the object via its built-in memory initialization feature. Even though
HW_TAGS KASAN memory initialization contains special memory initialization
handling for when slub_debug is enabled, it does not account for in-object
slub_debug redzones. As a result, HW_TAGS KASAN can overwrite these
redzones and cause false-positive slub_debug reports.
To fix the issue, avoid HW_TAGS KASAN memory initialization when
slub_debug is enabled altogether. Implement this by moving the
__slub_debug_enabled check to slab_post_alloc_hook. Common slab code
seems like a more appropriate place for a slub_debug check anyway.
Link: https://lkml.kernel.org/r/678ac92ab790dba9198f9ca14f405651b97c8502.16885610…
Fixes: 946fa0dbf2d8 ("mm/slub: extend redzone check to extra allocated kmalloc space than requested")
Signed-off-by: Andrey Konovalov <andreyknvl(a)google.com>
Reported-by: Will Deacon <will(a)kernel.org>
Acked-by: Marco Elver <elver(a)google.com>
Cc: Mark Rutland <mark.rutland(a)arm.com>
Cc: Alexander Potapenko <glider(a)google.com>
Cc: Andrey Ryabinin <ryabinin.a.a(a)gmail.com>
Cc: Catalin Marinas <catalin.marinas(a)arm.com>
Cc: Christoph Lameter <cl(a)linux.com>
Cc: David Rientjes <rientjes(a)google.com>
Cc: Dmitry Vyukov <dvyukov(a)google.com>
Cc: Feng Tang <feng.tang(a)intel.com>
Cc: Hyeonggon Yoo <42.hyeyoo(a)gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim(a)lge.com>
Cc: kasan-dev(a)googlegroups.com
Cc: Pekka Enberg <penberg(a)kernel.org>
Cc: Peter Collingbourne <pcc(a)google.com>
Cc: Roman Gushchin <roman.gushchin(a)linux.dev>
Cc: Vincenzo Frascino <vincenzo.frascino(a)arm.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/kasan/kasan.h | 12 ------------
mm/slab.h | 16 ++++++++++++++--
2 files changed, 14 insertions(+), 14 deletions(-)
--- a/mm/kasan/kasan.h~kasan-slub-fix-hw_tags-zeroing-with-slub_debug
+++ a/mm/kasan/kasan.h
@@ -466,18 +466,6 @@ static inline void kasan_unpoison(const
if (WARN_ON((unsigned long)addr & KASAN_GRANULE_MASK))
return;
- /*
- * Explicitly initialize the memory with the precise object size to
- * avoid overwriting the slab redzone. This disables initialization in
- * the arch code and may thus lead to performance penalty. This penalty
- * does not affect production builds, as slab redzones are not enabled
- * there.
- */
- if (__slub_debug_enabled() &&
- init && ((unsigned long)size & KASAN_GRANULE_MASK)) {
- init = false;
- memzero_explicit((void *)addr, size);
- }
size = round_up(size, KASAN_GRANULE_SIZE);
hw_set_mem_tag_range((void *)addr, size, tag, init);
--- a/mm/slab.h~kasan-slub-fix-hw_tags-zeroing-with-slub_debug
+++ a/mm/slab.h
@@ -723,6 +723,7 @@ static inline void slab_post_alloc_hook(
unsigned int orig_size)
{
unsigned int zero_size = s->object_size;
+ bool kasan_init = init;
size_t i;
flags &= gfp_allowed_mask;
@@ -740,6 +741,17 @@ static inline void slab_post_alloc_hook(
zero_size = orig_size;
/*
+ * When slub_debug is enabled, avoid memory initialization integrated
+ * into KASAN and instead zero out the memory via the memset below with
+ * the proper size. Otherwise, KASAN might overwrite SLUB redzones and
+ * cause false-positive reports. This does not lead to a performance
+ * penalty on production builds, as slub_debug is not intended to be
+ * enabled there.
+ */
+ if (__slub_debug_enabled())
+ kasan_init = false;
+
+ /*
* As memory initialization might be integrated into KASAN,
* kasan_slab_alloc and initialization memset must be
* kept together to avoid discrepancies in behavior.
@@ -747,8 +759,8 @@ static inline void slab_post_alloc_hook(
* As p[i] might get tagged, memset and kmemleak hook come after KASAN.
*/
for (i = 0; i < size; i++) {
- p[i] = kasan_slab_alloc(s, p[i], flags, init);
- if (p[i] && init && !kasan_has_integrated_init())
+ p[i] = kasan_slab_alloc(s, p[i], flags, kasan_init);
+ if (p[i] && init && (!kasan_init || !kasan_has_integrated_init()))
memset(p[i], 0, zero_size);
kmemleak_alloc_recursive(p[i], s->object_size, 1,
s->flags, flags);
_
Patches currently in -mm which might be from andreyknvl(a)google.com are
kasan-fix-type-cast-in-memory_is_poisoned_n.patch
kasan-slub-fix-hw_tags-zeroing-with-slub_debug.patch
The patch titled
Subject: mm: disable CONFIG_PER_VMA_LOCK until its fixed
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-disable-config_per_vma_lock-until-its-fixed.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Suren Baghdasaryan <surenb(a)google.com>
Subject: mm: disable CONFIG_PER_VMA_LOCK until its fixed
Date: Tue, 4 Jul 2023 23:37:11 -0700
A memory corruption was reported in [1] with bisection pointing to the
patch [2] enabling per-VMA locks for x86. Disable per-VMA locks config to
prevent this issue while the problem is being investigated. This is
expected to be a temporary measure.
[1] https://bugzilla.kernel.org/show_bug.cgi?id=217624
[2] https://lore.kernel.org/all/20230227173632.3292573-30-surenb@google.com
Link: https://lkml.kernel.org/r/20230705063711.2670599-3-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb(a)google.com>
Reported-by: Jiri Slaby <jirislaby(a)kernel.org>
Closes: https://lore.kernel.org/all/dbdef34c-3a07-5951-e1ae-e9c6e3cdf51b@kernel.org/
Reported-by: Jacob Young <jacobly.alt(a)gmail.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217624
Fixes: 0bff0aaea03e ("x86/mm: try VMA lock-based page fault handling first")
Cc: Andy Lutomirski <luto(a)kernel.org>
Cc: Axel Rasmussen <axelrasmussen(a)google.com>
Cc: Bagas Sanjaya <bagasdotme(a)gmail.com>
Cc: Chris Li <chriscli(a)google.com>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: David Howells <dhowells(a)redhat.com>
Cc: Davidlohr Bueso <dave(a)stgolabs.net>
Cc: David Rientjes <rientjes(a)google.com>
Cc: Eric Dumazet <edumazet(a)google.com>
Cc: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Cc: Greg Thelen <gthelen(a)google.com>
Cc: Hans de Goede <hdegoede(a)redhat.com>
Cc: Holger Hoffst��tte <holger(a)applied-asynchrony.com>
Cc: Hugh Dickins <hughd(a)google.com>
Cc: Ingo Molnar <mingo(a)redhat.com>
Cc: Jann Horn <jannh(a)google.com>
Cc: Joel Fernandes <joelaf(a)google.com>
Cc: Johannes Weiner <hannes(a)cmpxchg.org>
Cc: Kent Overstreet <kent.overstreet(a)linux.dev>
Cc: Laurent Dufour <ldufour(a)linux.ibm.com>
Cc: Liam R. Howlett <Liam.Howlett(a)oracle.com>
Cc: Lorenzo Stoakes <lstoakes(a)gmail.com>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: Mel Gorman <mgorman(a)techsingularity.net>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Michel Lespinasse <michel(a)lespinasse.org>
Cc: Mike Rapoport (IBM) <rppt(a)kernel.org>
Cc: Minchan Kim <minchan(a)google.com>
Cc: "Paul E. McKenney" <paulmck(a)kernel.org>
Cc: Peter Xu <peterx(a)redhat.com>
Cc: <peterz(a)infradead.org>
Cc: Punit Agrawal <punit.agrawal(a)bytedance.com>
Cc: <regressions(a)lists.linux.dev>
Cc: Sebastian Andrzej Siewior <bigeasy(a)linutronix.de>
Cc: Shakeel Butt <shakeelb(a)google.com>
Cc: Song Liu <songliubraving(a)fb.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: Will Deacon <will(a)kernel.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/Kconfig | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
--- a/mm/Kconfig~mm-disable-config_per_vma_lock-until-its-fixed
+++ a/mm/Kconfig
@@ -1224,8 +1224,9 @@ config ARCH_SUPPORTS_PER_VMA_LOCK
def_bool n
config PER_VMA_LOCK
- def_bool y
+ bool "Enable per-vma locking during page fault handling."
depends on ARCH_SUPPORTS_PER_VMA_LOCK && MMU && SMP
+ depends on BROKEN
help
Allow per-vma locking during page fault handling.
_
Patches currently in -mm which might be from surenb(a)google.com are
fork-lock-vmas-of-the-parent-process-when-forking.patch
mm-disable-config_per_vma_lock-until-its-fixed.patch
swap-remove-remnants-of-polling-from-read_swap_cache_async.patch
mm-add-missing-vm_fault_result_trace-name-for-vm_fault_completed.patch
mm-drop-per-vma-lock-when-returning-vm_fault_retry-or-vm_fault_completed.patch
mm-change-folio_lock_or_retry-to-use-vm_fault-directly.patch
mm-handle-swap-page-faults-under-per-vma-lock.patch
mm-handle-userfaults-under-vma-lock.patch
mm-disable-config_per_vma_lock-by-default-until-its-fixed.patch
The patch titled
Subject: fork: lock VMAs of the parent process when forking
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
fork-lock-vmas-of-the-parent-process-when-forking.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Suren Baghdasaryan <surenb(a)google.com>
Subject: fork: lock VMAs of the parent process when forking
Date: Tue, 4 Jul 2023 23:37:10 -0700
Patch series "Avoid memory corruption caused by per-VMA locks", v2.
A memory corruption was reported in [1] with bisection pointing to the
patch [2] enabling per-VMA locks for x86. Based on the reproducer
provided in [1] we suspect this is caused by the lack of VMA locking while
forking a child process.
Patch 1/2 in the series implements proper VMA locking during fork. I
tested the fix locally using the reproducer and was unable to reproduce
the memory corruption problem.
This fix can potentially regress some fork-heavy workloads. Kernel build
time did not show noticeable regression on a 56-core machine while a
stress test mapping 10000 VMAs and forking 5000 times in a tight loop
shows ~5% regression. If such fork time regression is unacceptable,
disabling CONFIG_PER_VMA_LOCK should restore its performance. Further
optimizations are possible if this regression proves to be problematic.
Patch 2/2 disabled per-VMA locks until the fix is tested and verified.
This patch (of 2):
When forking a child process, parent write-protects an anonymous page and
COW-shares it with the child being forked using copy_present_pte().
Parent's TLB is flushed right before we drop the parent's mmap_lock in
dup_mmap(). If we get a write-fault before that TLB flush in the parent,
and we end up replacing that anonymous page in the parent process in
do_wp_page() (because, COW-shared with the child), this might lead to some
stale writable TLB entries targeting the wrong (old) page. Similar issue
happened in the past with userfaultfd (see flush_tlb_page() call inside
do_wp_page()).
Lock VMAs of the parent process when forking a child, which prevents
concurrent page faults during fork operation and avoids this issue. This
fix can potentially regress some fork-heavy workloads. Kernel build time
did not show noticeable regression on a 56-core machine while a stress
test mapping 10000 VMAs and forking 5000 times in a tight loop shows ~5%
regression. If such fork time regression is unacceptable, disabling
CONFIG_PER_VMA_LOCK should restore its performance. Further optimizations
are possible if this regression proves to be problematic.
Link: https://lkml.kernel.org/r/20230705063711.2670599-1-surenb@google.com
Link: https://lkml.kernel.org/r/20230705063711.2670599-2-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb(a)google.com>
Suggested-by: David Hildenbrand <david(a)redhat.com>
Reported-by: Jiri Slaby <jirislaby(a)kernel.org>
Closes: https://lore.kernel.org/all/dbdef34c-3a07-5951-e1ae-e9c6e3cdf51b@kernel.org/
Reported-by: Holger Hoffst��tte <holger(a)applied-asynchrony.com>
Closes: https://lore.kernel.org/all/b198d649-f4bf-b971-31d0-e8433ec2a34c@applied-as…
Reported-by: Jacob Young <jacobly.alt(a)gmail.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=3D217624
Fixes: 0bff0aaea03e ("x86/mm: try VMA lock-based page fault handling first")
Acked-by: David Hildenbrand <david(a)redhat.com>
Cc: Bagas Sanjaya <bagasdotme(a)gmail.com>
Cc: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Cc: Laurent Dufour <ldufour(a)linux.ibm.com>
Cc: <regressions(a)lists.linux.dev>
Cc: Andy Lutomirski <luto(a)kernel.org>
Cc: Axel Rasmussen <axelrasmussen(a)google.com>
Cc: Chris Li <chriscli(a)google.com>
Cc: David Howells <dhowells(a)redhat.com>
Cc: Davidlohr Bueso <dave(a)stgolabs.net>
Cc: David Rientjes <rientjes(a)google.com>
Cc: Eric Dumazet <edumazet(a)google.com>
Cc: Greg Thelen <gthelen(a)google.com>
Cc: Hans de Goede <hdegoede(a)redhat.com>
Cc: Hugh Dickins <hughd(a)google.com>
Cc: Ingo Molnar <mingo(a)redhat.com>
Cc: Jann Horn <jannh(a)google.com>
Cc: Joel Fernandes <joelaf(a)google.com>
Cc: Johannes Weiner <hannes(a)cmpxchg.org>
Cc: Kent Overstreet <kent.overstreet(a)linux.dev>
Cc: Liam R. Howlett <Liam.Howlett(a)oracle.com>
Cc: Lorenzo Stoakes <lstoakes(a)gmail.com>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: Mel Gorman <mgorman(a)techsingularity.net>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Michel Lespinasse <michel(a)lespinasse.org>
Cc: Mike Rapoport (IBM) <rppt(a)kernel.org>
Cc: Minchan Kim <minchan(a)google.com>
Cc: "Paul E. McKenney" <paulmck(a)kernel.org>
Cc: Peter Xu <peterx(a)redhat.com>
Cc: <peterz(a)infradead.org>
Cc: Punit Agrawal <punit.agrawal(a)bytedance.com>
Cc: Sebastian Andrzej Siewior <bigeasy(a)linutronix.de>
Cc: Shakeel Butt <shakeelb(a)google.com>
Cc: Song Liu <songliubraving(a)fb.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: Will Deacon <will(a)kernel.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
kernel/fork.c | 1 +
1 file changed, 1 insertion(+)
--- a/kernel/fork.c~fork-lock-vmas-of-the-parent-process-when-forking
+++ a/kernel/fork.c
@@ -686,6 +686,7 @@ static __latent_entropy int dup_mmap(str
for_each_vma(old_vmi, mpnt) {
struct file *file;
+ vma_start_write(mpnt);
if (mpnt->vm_flags & VM_DONTCOPY) {
vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt));
continue;
_
Patches currently in -mm which might be from surenb(a)google.com are
fork-lock-vmas-of-the-parent-process-when-forking.patch
mm-disable-config_per_vma_lock-until-its-fixed.patch
swap-remove-remnants-of-polling-from-read_swap_cache_async.patch
mm-add-missing-vm_fault_result_trace-name-for-vm_fault_completed.patch
mm-drop-per-vma-lock-when-returning-vm_fault_retry-or-vm_fault_completed.patch
mm-change-folio_lock_or_retry-to-use-vm_fault-directly.patch
mm-handle-swap-page-faults-under-per-vma-lock.patch
mm-handle-userfaults-under-vma-lock.patch
mm-disable-config_per_vma_lock-by-default-until-its-fixed.patch
The soundwire subsystem uses two completion structures that allow
drivers to wait for soundwire device to become enumerated on the bus and
initialised by their drivers, respectively.
The code implementing the signalling is currently broken as it does not
signal all current and future waiters and also uses the wrong
reinitialisation function, which can potentially lead to memory
corruption if there are still waiters on the queue.
Not signalling future waiters specifically breaks sound card probe
deferrals as codec drivers can not tell that the soundwire device is
already attached when being reprobed. Some codec runtime PM
implementations suffer from similar problems as waiting for enumeration
during resume can also timeout despite the device already having been
enumerated.
Fixes: fb9469e54fa7 ("soundwire: bus: fix race condition with enumeration_complete signaling")
Fixes: a90def068127 ("soundwire: bus: fix race condition with initialization_complete signaling")
Cc: stable(a)vger.kernel.org # 5.7
Cc: Pierre-Louis Bossart <pierre-louis.bossart(a)linux.intel.com>
Cc: Rander Wang <rander.wang(a)linux.intel.com>
Signed-off-by: Johan Hovold <johan+linaro(a)kernel.org>
---
drivers/soundwire/bus.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/drivers/soundwire/bus.c b/drivers/soundwire/bus.c
index 1ea6a64f8c4a..66e5dba919fa 100644
--- a/drivers/soundwire/bus.c
+++ b/drivers/soundwire/bus.c
@@ -908,8 +908,8 @@ static void sdw_modify_slave_status(struct sdw_slave *slave,
"initializing enumeration and init completion for Slave %d\n",
slave->dev_num);
- init_completion(&slave->enumeration_complete);
- init_completion(&slave->initialization_complete);
+ reinit_completion(&slave->enumeration_complete);
+ reinit_completion(&slave->initialization_complete);
} else if ((status == SDW_SLAVE_ATTACHED) &&
(slave->status == SDW_SLAVE_UNATTACHED)) {
@@ -917,7 +917,7 @@ static void sdw_modify_slave_status(struct sdw_slave *slave,
"signaling enumeration completion for Slave %d\n",
slave->dev_num);
- complete(&slave->enumeration_complete);
+ complete_all(&slave->enumeration_complete);
}
slave->status = status;
mutex_unlock(&bus->bus_lock);
@@ -1941,7 +1941,7 @@ int sdw_handle_slave_status(struct sdw_bus *bus,
"signaling initialization completion for Slave %d\n",
slave->dev_num);
- complete(&slave->initialization_complete);
+ complete_all(&slave->initialization_complete);
/*
* If the manager became pm_runtime active, the peripherals will be
--
2.39.3
Hi Greg, Sasha,
The following list shows the backported patches, I am using original
commit IDs for reference:
1) 0854db2aaef3 ("netfilter: nf_tables: use net_generic infra for transaction data")
2) 81ea01066741 ("netfilter: nf_tables: add rescheduling points during loop detection walks")
3) 1240eb93f061 ("netfilter: nf_tables: incorrect error path handling with NFT_MSG_NEWRULE")
4) 4bedf9eee016 ("netfilter: nf_tables: fix chain binding transaction logic")
5) 26b5a5712eb8 ("netfilter: nf_tables: add NFT_TRANS_PREPARE_ERROR to deal with bound set/chain")
6) 938154b93be8 ("netfilter: nf_tables: reject unbound anonymous set before commit phase")
7) 62e1e94b246e ("netfilter: nf_tables: reject unbound chain set before commit phase")
8) f8bb7889af58 ("netfilter: nftables: rename set element data activation/deactivation functions")
9) 628bd3e49cba ("netfilter: nf_tables: drop map element references from preparation phase")
10) 3e70489721b6 ("netfilter: nf_tables: unbind non-anonymous set if rule construction fails")
Note that Patch #1 is a backported dependency patch required by these fixes.
Please, apply,
Thanks.
Florian Westphal (2):
netfilter: nf_tables: use net_generic infra for transaction data
netfilter: nf_tables: add rescheduling points during loop detection walks
Pablo Neira Ayuso (8):
netfilter: nf_tables: incorrect error path handling with NFT_MSG_NEWRULE
netfilter: nf_tables: fix chain binding transaction logic
netfilter: nf_tables: add NFT_TRANS_PREPARE_ERROR to deal with bound set/chain
netfilter: nf_tables: reject unbound anonymous set before commit phase
netfilter: nf_tables: reject unbound chain set before commit phase
netfilter: nftables: rename set element data activation/deactivation functions
netfilter: nf_tables: drop map element references from preparation phase
netfilter: nf_tables: unbind non-anonymous set if rule construction fails
include/net/netfilter/nf_tables.h | 41 +-
include/net/netns/nftables.h | 7 -
net/netfilter/nf_tables_api.c | 696 +++++++++++++++++++++---------
net/netfilter/nf_tables_offload.c | 30 +-
net/netfilter/nft_chain_filter.c | 11 +-
net/netfilter/nft_dynset.c | 6 +-
net/netfilter/nft_immediate.c | 90 +++-
net/netfilter/nft_set_bitmap.c | 5 +-
net/netfilter/nft_set_hash.c | 23 +-
net/netfilter/nft_set_pipapo.c | 14 +-
net/netfilter/nft_set_rbtree.c | 5 +-
11 files changed, 682 insertions(+), 246 deletions(-)
--
2.30.2
Here is a first batch of fixes for v6.5 and older.
The fixes are not linked to each others.
Patch 1 ensures subflows are unhashed before cleaning the backlog to
avoid races. This fixes another recent fix from v6.4.
Patch 2 does not rely on implicit state check in mptcp_listen() to avoid
races when receiving an MP_FASTCLOSE. A regression from v5.17.
The rest fixes issues in the selftests.
Patch 3 makes sure errors when setting up the environment are no longer
ignored. For v5.17+.
Patch 4 uses 'iptables-legacy' if available to be able to run on older
kernels. A fix for v5.13 and newer.
Patch 5 catches errors when issues are detected with packet marks. Also
for v5.13+.
Patch 6 uses the correct variable instead of an undefined one. Even if
there was no visible impact, it can help to find regressions later. An
issue visible in v5.19+.
Patch 7 makes sure errors with some sub-tests are reported to have the
selftest marked as failed as expected. Also for v5.19+.
Patch 8 adds a kernel config that is required to execute MPTCP
selftests. It is valid for v5.9+.
Patch 9 fixes issues when validating the userspace path-manager with
32-bit arch, an issue affecting v5.19+.
Signed-off-by: Matthieu Baerts <matthieu.baerts(a)tessares.net>
---
Matthieu Baerts (7):
selftests: mptcp: connect: fail if nft supposed to work
selftests: mptcp: sockopt: use 'iptables-legacy' if available
selftests: mptcp: sockopt: return error if wrong mark
selftests: mptcp: userspace_pm: use correct server port
selftests: mptcp: userspace_pm: report errors with 'remove' tests
selftests: mptcp: depend on SYN_COOKIES
selftests: mptcp: pm_nl_ctl: fix 32-bit support
Paolo Abeni (2):
mptcp: ensure subflow is unhashed before cleaning the backlog
mptcp: do not rely on implicit state check in mptcp_listen()
net/mptcp/protocol.c | 7 +++++-
tools/testing/selftests/net/mptcp/config | 1 +
tools/testing/selftests/net/mptcp/mptcp_connect.sh | 3 +++
tools/testing/selftests/net/mptcp/mptcp_sockopt.sh | 29 ++++++++++++----------
tools/testing/selftests/net/mptcp/pm_nl_ctl.c | 10 ++++----
tools/testing/selftests/net/mptcp/userspace_pm.sh | 4 ++-
6 files changed, 34 insertions(+), 20 deletions(-)
---
base-commit: 14bb236b29922c4f57d8c05bfdbcb82677f917c9
change-id: 20230704-upstream-net-20230704-misc-fixes-6-5-rc1-c52608649559
Best regards,
--
Matthieu Baerts <matthieu.baerts(a)tessares.net>
Making 'blk' sector_t (i.e. 64 bit if LBD support is active)
fails the 'blk>0' test in the partition block loop if a
value of (signed int) -1 is used to mark the end of the
partition block list.
This bug was introduced in patch 3 of my prior Amiga partition
support fixes series, and spotted by Christian Zigotzky when
testing the latest block updates.
Explicitly cast 'blk' to signed int to allow use of -1 to
terminate the partition block linked list.
Testing by Christian also exposed another aspect of the old
bug fixed in commits fc3d092c6b ("block: fix signed int
overflow in Amiga partition support") and b6f3f28f60
("block: add overflow checks for Amiga partition support"):
Partitions that did overflow the disk size (due to 32 bit int
overflow) were not skipped but truncated to the end of the
disk. Users who missed the warning message during boot would
go on to create a filesystem with a size exceeding the
actual partition size. Now that the 32 bit overflow has been
corrected, such filesystems may refuse to mount with a
'filesystem exceeds partition size' error. Users should
either correct the partition size, or resize the filesystem
before attempting to boot a kernel with the RDB fixes in
place.
Reported-by: Christian Zigotzky <chzigotzky(a)xenosoft.de>
Fixes: b6f3f28f60 ("block: add overflow checks for Amiga partition support")
Message-ID: 024ce4fa-cc6d-50a2-9aae-3701d0ebf668(a)xenosoft.de
Cc: <stable(a)vger.kernel.org> # 6.4
Link: https://lore.kernel.org/r/024ce4fa-cc6d-50a2-9aae-3701d0ebf668@xenosoft.de
Signed-off-by: Michael Schmitz <schmitzmic(a)gmail.com>
Tested-by: Christian Zigotzky <chzigotzky(a)xenosoft.de>
--
Changes since v2:
Adrian Glaubitz:
- fix typo in commit message
Changes since v1:
- corrected Fixes: tag
- added Tested-by:
- reworded commit message to describe filesystem partition
size mismatch problem
---
block/partitions/amiga.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/block/partitions/amiga.c b/block/partitions/amiga.c
index ed222b9c901b..506921095412 100644
--- a/block/partitions/amiga.c
+++ b/block/partitions/amiga.c
@@ -90,7 +90,7 @@ int amiga_partition(struct parsed_partitions *state)
}
blk = be32_to_cpu(rdb->rdb_PartitionList);
put_dev_sector(sect);
- for (part = 1; blk>0 && part<=16; part++, put_dev_sector(sect)) {
+ for (part = 1; (s32) blk>0 && part<=16; part++, put_dev_sector(sect)) {
/* Read in terms partition table understands */
if (check_mul_overflow(blk, (sector_t) blksize, &blk)) {
pr_err("Dev %s: overflow calculating partition block %llu! Skipping partitions %u and beyond\n",
--
2.17.1
We get the following crash caused by a null pointer access:
BUG: kernel NULL pointer dereference, address: 0000000000000000
...
RIP: 0010:resume_execution+0x35/0x190
...
Call Trace:
<#DB>
kprobe_debug_handler+0x41/0xd0
exc_debug+0xe5/0x1b0
asm_exc_debug+0x19/0x30
RIP: 0010:copy_from_kernel_nofault.part.0+0x55/0xc0
...
</#DB>
process_fetch_insn+0xfb/0x720
kprobe_trace_func+0x199/0x2c0
? kernel_clone+0x5/0x2f0
kprobe_dispatcher+0x3d/0x60
aggr_pre_handler+0x40/0x80
? kernel_clone+0x1/0x2f0
kprobe_ftrace_handler+0x82/0xf0
? __se_sys_clone+0x65/0x90
ftrace_ops_assist_func+0x86/0x110
? rcu_nocb_try_bypass+0x1f3/0x370
0xffffffffc07e60c8
? kernel_clone+0x1/0x2f0
kernel_clone+0x5/0x2f0
The analysis reveals that kprobe and hardware breakpoints conflict in
the use of debug exceptions.
If we set a hardware breakpoint on a memory address and also have a
kprobe event to fetch the memory at this address. Then when kprobe
triggers, it goes to read the memory and triggers hardware breakpoint
monitoring. This time, since kprobe handles debug exceptions earlier
than hardware breakpoints, it will cause kprobe to incorrectly assume
that the exception is a kprobe trigger.
Notice that after the mainline commit 6256e668b7af ("x86/kprobes: Use
int3 instead of debug trap for single-step"), kprobe no longer uses
debug trap, avoiding the conflict with hardware breakpoints here. This
commit is to remove the IRET that returns to kernel, not to fix the
problem we have here. Also there are a bunch of merge conflicts when
trying to apply this commit to older kernels, so fixing it directly in
older kernels is probably a better option.
If the debug exception is triggered by kprobe, then regs->ip should be
located in the kprobe instruction slot. Add this check to
kprobe_debug_handler() to properly determine if a debug exception should
be handled by kprobe.
The stable kernels affected are 5.10, 5.4, 4.19, and 4.14. I made the
fix in 5.10, and we should probably apply this fix to other stable
kernels.
Signed-off-by: Li Huafei <lihuafei1(a)huawei.com>
---
arch/x86/kernel/kprobes/core.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c
index 5de757099186..fd8d7d128807 100644
--- a/arch/x86/kernel/kprobes/core.c
+++ b/arch/x86/kernel/kprobes/core.c
@@ -900,7 +900,14 @@ int kprobe_debug_handler(struct pt_regs *regs)
struct kprobe *cur = kprobe_running();
struct kprobe_ctlblk *kcb = get_kprobe_ctlblk();
- if (!cur)
+ if (!cur || !cur->ainsn.insn)
+ return 0;
+
+ /* regs->ip should be the address of next instruction to
+ * cur->ainsn.insn.
+ */
+ if (regs->ip < (unsigned long)cur->ainsn.insn ||
+ regs->ip - (unsigned long)cur->ainsn.insn > MAX_INSN_SIZE)
return 0;
resume_execution(cur, regs, kcb);
--
2.17.1
When forking a child process, parent write-protects an anonymous page
and COW-shares it with the child being forked using copy_present_pte().
Parent's TLB is flushed right before we drop the parent's mmap_lock in
dup_mmap(). If we get a write-fault before that TLB flush in the parent,
and we end up replacing that anonymous page in the parent process in
do_wp_page() (because, COW-shared with the child), this might lead to
some stale writable TLB entries targeting the wrong (old) page.
Similar issue happened in the past with userfaultfd (see flush_tlb_page()
call inside do_wp_page()).
Lock VMAs of the parent process when forking a child, which prevents
concurrent page faults during fork operation and avoids this issue.
This fix can potentially regress some fork-heavy workloads. Kernel build
time did not show noticeable regression on a 56-core machine while a
stress test mapping 10000 VMAs and forking 5000 times in a tight loop
shows ~5% regression. If such fork time regression is unacceptable,
disabling CONFIG_PER_VMA_LOCK should restore its performance. Further
optimizations are possible if this regression proves to be problematic.
Suggested-by: David Hildenbrand <david(a)redhat.com>
Reported-by: Jiri Slaby <jirislaby(a)kernel.org>
Closes: https://lore.kernel.org/all/dbdef34c-3a07-5951-e1ae-e9c6e3cdf51b@kernel.org/
Reported-by: Holger Hoffstätte <holger(a)applied-asynchrony.com>
Closes: https://lore.kernel.org/all/b198d649-f4bf-b971-31d0-e8433ec2a34c@applied-as…
Reported-by: Jacob Young <jacobly.alt(a)gmail.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217624
Fixes: 0bff0aaea03e ("x86/mm: try VMA lock-based page fault handling first")
Signed-off-by: Suren Baghdasaryan <surenb(a)google.com>
---
kernel/fork.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/kernel/fork.c b/kernel/fork.c
index b85814e614a5..d2e12b6d2b18 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -686,6 +686,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
for_each_vma(old_vmi, mpnt) {
struct file *file;
+ vma_start_write(mpnt);
if (mpnt->vm_flags & VM_DONTCOPY) {
vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt));
continue;
--
2.41.0.255.g8b1d071c50-goog
The patch titled
Subject: kasan: fix type cast in memory_is_poisoned_n
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
kasan-fix-type-cast-in-memory_is_poisoned_n.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Andrey Konovalov <andreyknvl(a)google.com>
Subject: kasan: fix type cast in memory_is_poisoned_n
Date: Tue, 4 Jul 2023 02:52:05 +0200
Commit bb6e04a173f0 ("kasan: use internal prototypes matching gcc-13
builtins") introduced a bug into the memory_is_poisoned_n implementation:
it effectively removed the cast to a signed integer type after applying
KASAN_GRANULE_MASK.
As a result, KASAN started failing to properly check memset, memcpy, and
other similar functions.
Fix the bug by adding the cast back (through an additional signed integer
variable to make the code more readable).
Link: https://lkml.kernel.org/r/8c9e0251c2b8b81016255709d4ec42942dcaf018.16884318…
Fixes: bb6e04a173f0 ("kasan: use internal prototypes matching gcc-13 builtins")
Signed-off-by: Andrey Konovalov <andreyknvl(a)google.com>
Cc: Alexander Potapenko <glider(a)google.com>
Cc: Andrey Ryabinin <ryabinin.a.a(a)gmail.com>
Cc: Arnd Bergmann <arnd(a)arndb.de>
Cc: Dmitry Vyukov <dvyukov(a)google.com>
Cc: Marco Elver <elver(a)google.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/kasan/generic.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
--- a/mm/kasan/generic.c~kasan-fix-type-cast-in-memory_is_poisoned_n
+++ a/mm/kasan/generic.c
@@ -130,9 +130,10 @@ static __always_inline bool memory_is_po
if (unlikely(ret)) {
const void *last_byte = addr + size - 1;
s8 *last_shadow = (s8 *)kasan_mem_to_shadow(last_byte);
+ s8 last_accessible_byte = (unsigned long)last_byte & KASAN_GRANULE_MASK;
if (unlikely(ret != (unsigned long)last_shadow ||
- (((long)last_byte & KASAN_GRANULE_MASK) >= *last_shadow)))
+ last_accessible_byte >= *last_shadow))
return true;
}
return false;
_
Patches currently in -mm which might be from andreyknvl(a)google.com are
kasan-fix-type-cast-in-memory_is_poisoned_n.patch
Hallow und wie geht es dir heute?
Ich möchte, dass Ihre Partnerschaft Sie als Subunternehmer
präsentiert, damit Sie in meinem Namen 8,6 Mio.
Überrechnungsvertragsfonds erhalten können, die wir zu 65 % und 35 %
aufteilen können.
Diese Transaktion ist 100 % risikofrei; Du brauchst keine Angst zu haben.
Bitte senden Sie mir eine E-Mail an (kmrs41786(a)gmail.com), um
ausführliche Informationen zu erhalten und bei Interesse zu erfahren,
wie wir dies gemeinsam bewältigen können.
Sie müssen es mir also weiterleiten
Ihr vollständiger Name.........................
Telefon.............
Mit freundlichen Grüße,
Herr Jimoh Oyebisi.
Hallow and how are you today?
I seek for your partnership to present you as a sub-contractor so
that you can receive 8.6M Over-Invoice contract fund on my behalf and
we can split it 65% 35%.
This transaction is 100% risk -free; you need not to be afraid.
Please email me at ( kmrs41786(a)gmail.com ) for comprehensive details
and how we can handle this together if interested.
So I need you to forward it to me
your full name.........................
Telephone.............
Kind Regards,
Mr. Jimoh Oyebisi.
The patch titled
Subject: memcg: drop kmem.limit_in_bytes
has been added to the -mm mm-unstable branch. Its filename is
memcg-drop-kmemlimit_in_bytes.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Michal Hocko <mhocko(a)suse.com>
Subject: memcg: drop kmem.limit_in_bytes
Date: Tue, 4 Jul 2023 13:52:40 +0200
kmem.limit_in_bytes (v1 way to limit kernel memory usage) has been
deprecated since 58056f77502f ("memcg, kmem: further deprecate
kmem.limit_in_bytes") merged in 5.16. We haven't heard about any serious
users since then but it seems that the mere presence of the file is
causing more harm than good. We (SUSE) have had several bug reports from
customers where Docker based containers started to fail because a write to
kmem.limit_in_bytes has failed.
This was unexpected because runc code only expects ENOENT (kmem disabled)
or EBUSY (tasks already running within cgroup). So a new error code was
unexpected and the whole container startup failed. This has been later
addressed by
https://github.com/opencontainers/runc/commit/52390d68040637dfc77f9fda6bbe7…
so current Docker runtimes do not suffer from the problem anymore. There
are still older version of Docker in use and likely hard to get rid of
completely.
Address this by wiping out the file completely and effectively get back to
pre 4.5 era and CONFIG_MEMCG_KMEM=n configuration.
I would recommend backporting to stable trees which have picked up
58056f77502f ("memcg, kmem: further deprecate kmem.limit_in_bytes").
Link: https://lkml.kernel.org/r/20230704115240.14672-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko(a)suse.com>
Cc: Johannes Weiner <hannes(a)cmpxchg.org>
Cc: Muchun Song <muchun.song(a)linux.dev>
Cc: Roman Gushchin <roman.gushchin(a)linux.dev>
Cc: Shakeel Butt <shakeelb(a)google.com>
Cc: Tejun Heo <tj(a)kernel.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
Documentation/admin-guide/cgroup-v1/memory.rst | 2 --
mm/memcontrol.c | 13 -------------
2 files changed, 15 deletions(-)
--- a/Documentation/admin-guide/cgroup-v1/memory.rst~memcg-drop-kmemlimit_in_bytes
+++ a/Documentation/admin-guide/cgroup-v1/memory.rst
@@ -92,8 +92,6 @@ Brief summary of control files.
memory.oom_control set/show oom controls.
memory.numa_stat show the number of memory usage per numa
node
- memory.kmem.limit_in_bytes This knob is deprecated and writing to
- it will return -ENOTSUPP.
memory.kmem.usage_in_bytes show current kernel memory allocation
memory.kmem.failcnt show the number of kernel memory usage
hits limits
--- a/mm/memcontrol.c~memcg-drop-kmemlimit_in_bytes
+++ a/mm/memcontrol.c
@@ -3708,9 +3708,6 @@ static u64 mem_cgroup_read_u64(struct cg
case _MEMSWAP:
counter = &memcg->memsw;
break;
- case _KMEM:
- counter = &memcg->kmem;
- break;
case _TCP:
counter = &memcg->tcpmem;
break;
@@ -3871,10 +3868,6 @@ static ssize_t mem_cgroup_write(struct k
case _MEMSWAP:
ret = mem_cgroup_resize_max(memcg, nr_pages, true);
break;
- case _KMEM:
- /* kmem.limit_in_bytes is deprecated. */
- ret = -EOPNOTSUPP;
- break;
case _TCP:
ret = memcg_update_tcp_max(memcg, nr_pages);
break;
@@ -5086,12 +5079,6 @@ static struct cftype mem_cgroup_legacy_f
},
#endif
{
- .name = "kmem.limit_in_bytes",
- .private = MEMFILE_PRIVATE(_KMEM, RES_LIMIT),
- .write = mem_cgroup_write,
- .read_u64 = mem_cgroup_read_u64,
- },
- {
.name = "kmem.usage_in_bytes",
.private = MEMFILE_PRIVATE(_KMEM, RES_USAGE),
.read_u64 = mem_cgroup_read_u64,
_
Patches currently in -mm which might be from mhocko(a)suse.com are
memcg-drop-kmemlimit_in_bytes.patch
The patch titled
Subject: bootmem: remove the vmemmap pages from kmemleak in free_bootmem_page
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
bootmem-remove-the-vmemmap-pages-from-kmemleak-in-free_bootmem_page.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Liu Shixin <liushixin2(a)huawei.com>
Subject: bootmem: remove the vmemmap pages from kmemleak in free_bootmem_page
Date: Tue, 4 Jul 2023 18:19:42 +0800
commit dd0ff4d12dd2 ("bootmem: remove the vmemmap pages from kmemleak in
put_page_bootmem") fix an overlaps existing problem of kmemleak. But the
problem still existed when HAVE_BOOTMEM_INFO_NODE is disabled, because in
this case, free_bootmem_page() will call free_reserved_page() directly.
Fix the problem by adding kmemleak_free_part() in free_bootmem_page() when
HAVE_BOOTMEM_INFO_NODE is disabled.
Link: https://lkml.kernel.org/r/20230704101942.2819426-1-liushixin2@huawei.com
Fixes: f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with each HugeTLB page")
Signed-off-by: Liu Shixin <liushixin2(a)huawei.com>
Acked-by: Muchun Song <songmuchun(a)bytedance.com>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: Mike Kravetz <mike.kravetz(a)oracle.com>
Cc: Oscar Salvador <osalvador(a)suse.de>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/bootmem_info.h | 2 ++
1 file changed, 2 insertions(+)
--- a/include/linux/bootmem_info.h~bootmem-remove-the-vmemmap-pages-from-kmemleak-in-free_bootmem_page
+++ a/include/linux/bootmem_info.h
@@ -3,6 +3,7 @@
#define __LINUX_BOOTMEM_INFO_H
#include <linux/mm.h>
+#include <linux/kmemleak.h>
/*
* Types for free bootmem stored in page->lru.next. These have to be in
@@ -59,6 +60,7 @@ static inline void get_page_bootmem(unsi
static inline void free_bootmem_page(struct page *page)
{
+ kmemleak_free_part(page_to_virt(page), PAGE_SIZE);
free_reserved_page(page);
}
#endif
_
Patches currently in -mm which might be from liushixin2(a)huawei.com are
bootmem-remove-the-vmemmap-pages-from-kmemleak-in-free_bootmem_page.patch