The submit queue polling threads are userland threads that just never exit to the userland. When creating the thread with IORING_SETUP_SQ_AFF, the affinity of the poller thread is set to the cpu specified in sq_thread_cpu. However, this CPU can be outside of the cpuset defined by the cgroup cpuset controller. This violates the rules defined by the cpuset controller and is a potential issue for realtime applications.
In b7ed6d8ffd6 we fixed the default affinity of the poller thread, in case no explicit pinning is required by inheriting the one of the creating task. In case of explicit pinning, the check is more complicated, as also a cpu outside of the parent cpumask is allowed. We implemented this by using cpuset_cpus_allowed (that has support for cgroup cpusets) and testing if the requested cpu is in the set.
Fixes: 37d1e2e3642e ("io_uring: move SQPOLL thread io-wq forked worker") Cc: stable@vger.kernel.org # 6.1+ Signed-off-by: Felix Moessbauer felix.moessbauer@siemens.com --- Hi,
that's hopefully the last fix of cpu pinnings of the sq poller threads. However, there is more to come on the io-wq side. E.g the syscalls for IORING_REGISTER_IOWQ_AFF that can be used to change the affinites are not yet protected. I'm currently just lacking good reproducers for that. I also have to admit that I don't feel too comfortable making changes to the wq part, given that I don't have good tests.
While fixing this, I'm wondering if it makes sense to add tests for the combination of pinning and cpuset. If yes, where should these tests be added?
Best regards, Felix Moessbauer Siemens AG
io_uring/sqpoll.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/io_uring/sqpoll.c b/io_uring/sqpoll.c index 713be7c29388..b8ec8fec99b8 100644 --- a/io_uring/sqpoll.c +++ b/io_uring/sqpoll.c @@ -10,6 +10,7 @@ #include <linux/slab.h> #include <linux/audit.h> #include <linux/security.h> +#include <linux/cpuset.h> #include <linux/io_uring.h>
#include <uapi/linux/io_uring.h> @@ -459,10 +460,12 @@ __cold int io_sq_offload_create(struct io_ring_ctx *ctx, return 0;
if (p->flags & IORING_SETUP_SQ_AFF) { + struct cpumask allowed_mask; int cpu = p->sq_thread_cpu;
ret = -EINVAL; - if (cpu >= nr_cpu_ids || !cpu_online(cpu)) + cpuset_cpus_allowed(current, &allowed_mask); + if (!cpumask_test_cpu(cpu, &allowed_mask)) goto err_sqpoll; sqd->sq_cpu = cpu; } else {
On 9/9/24 9:00 AM, Felix Moessbauer wrote:
The submit queue polling threads are userland threads that just never exit to the userland. When creating the thread with IORING_SETUP_SQ_AFF, the affinity of the poller thread is set to the cpu specified in sq_thread_cpu. However, this CPU can be outside of the cpuset defined by the cgroup cpuset controller. This violates the rules defined by the cpuset controller and is a potential issue for realtime applications.
In b7ed6d8ffd6 we fixed the default affinity of the poller thread, in case no explicit pinning is required by inheriting the one of the creating task. In case of explicit pinning, the check is more complicated, as also a cpu outside of the parent cpumask is allowed. We implemented this by using cpuset_cpus_allowed (that has support for cgroup cpusets) and testing if the requested cpu is in the set.
This also looks good to me.
that's hopefully the last fix of cpu pinnings of the sq poller threads. However, there is more to come on the io-wq side. E.g the syscalls for IORING_REGISTER_IOWQ_AFF that can be used to change the affinites are not yet protected. I'm currently just lacking good reproducers for that. I also have to admit that I don't feel too comfortable making changes to the wq part, given that I don't have good tests.
Yep io-wq will have the same ignorance of cpu limits, so would need the same love for when someone asks for specific cpus.
While fixing this, I'm wondering if it makes sense to add tests for the combination of pinning and cpuset. If yes, where should these tests be added?
Yeah certainly add tests, liburing would be a good spot for that. That's where the feature/regression/bug tests always go.
On Mon, 09 Sep 2024 17:00:36 +0200, Felix Moessbauer wrote:
The submit queue polling threads are userland threads that just never exit to the userland. When creating the thread with IORING_SETUP_SQ_AFF, the affinity of the poller thread is set to the cpu specified in sq_thread_cpu. However, this CPU can be outside of the cpuset defined by the cgroup cpuset controller. This violates the rules defined by the cpuset controller and is a potential issue for realtime applications.
[...]
Applied, thanks!
[1/1] io_uring/sqpoll: do not allow pinning outside of cpuset commit: f011c9cf04c06f16b24f583d313d3c012e589e50
Best regards,
Hi Felix Moessbauer,
Greetings!
I used Syzkaller and found that there is KASAN: use-after-free Read in io_sq_offload_create in Linux-next tree - next-20240916.
After bisection and the first bad commit is: " f011c9cf04c0 io_uring/sqpoll: do not allow pinning outside of cpuset "
All detailed into can be found at: https://github.com/laifryiee/syzkaller_logs/tree/main/240917_135250_io_sq_of... Syzkaller repro code: https://github.com/laifryiee/syzkaller_logs/blob/main/240917_135250_io_sq_of... Syzkaller repro syscall steps: https://github.com/laifryiee/syzkaller_logs/blob/main/240917_135250_io_sq_of... Syzkaller report: https://github.com/laifryiee/syzkaller_logs/blob/main/240917_135250_io_sq_of... Kconfig(make olddefconfig): https://github.com/laifryiee/syzkaller_logs/blob/main/240917_135250_io_sq_of... Bisect info: https://github.com/laifryiee/syzkaller_logs/blob/main/240917_135250_io_sq_of... bzImage: https://github.com/laifryiee/syzkaller_logs/raw/main/240917_135250_io_sq_off... Issue dmesg: https://github.com/laifryiee/syzkaller_logs/blob/main/240917_135250_io_sq_of...
" [ 23.564898] ================================================================== [ 23.565444] BUG: KASAN: use-after-free in io_sq_offload_create+0xcaa/0x11d0 [ 23.565971] Read of size 8 at addr ffff888036377898 by task repro/729 [ 23.566459] [ 23.566593] CPU: 0 UID: 0 PID: 729 Comm: repro Not tainted 6.11.0-next-20240916-7083504315d6 #1 [ 23.567271] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 [ 23.568066] Call Trace: [ 23.568252] <TASK> [ 23.568417] dump_stack_lvl+0xea/0x150 [ 23.568718] print_report+0xce/0x610 [ 23.569001] ? io_sq_offload_create+0xcaa/0x11d0 [ 23.569340] ? kasan_addr_to_slab+0x11/0xb0 [ 23.569651] ? io_sq_offload_create+0xcaa/0x11d0 [ 23.569992] kasan_report+0xcc/0x110 [ 23.570277] ? io_sq_offload_create+0xcaa/0x11d0 [ 23.570621] kasan_check_range+0x3e/0x1c0 [ 23.570917] __kasan_check_read+0x15/0x20 [ 23.571212] io_sq_offload_create+0xcaa/0x11d0 [ 23.571540] ? __pfx_io_sq_offload_create+0x10/0x10 [ 23.571893] ? __pfx___lock_acquire+0x10/0x10 [ 23.572228] ? __this_cpu_preempt_check+0x21/0x30 [ 23.572580] ? lock_acquire.part.0+0x152/0x390 [ 23.572910] ? __this_cpu_preempt_check+0x21/0x30 [ 23.573254] ? lock_release+0x441/0x870 [ 23.573541] ? __pfx_lock_release+0x10/0x10 [ 23.573846] ? trace_lock_acquire+0x139/0x1b0 [ 23.574180] ? debug_smp_processor_id+0x20/0x30 [ 23.574524] ? rcu_is_watching+0x19/0xc0 [ 23.574826] ? __alloc_pages_noprof+0x517/0x710 [ 23.575171] ? __pfx___alloc_pages_noprof+0x10/0x10 [ 23.575526] ? mod_objcg_state+0x42c/0x9c0 [ 23.575838] ? lockdep_hardirqs_on+0x89/0x110 [ 23.576159] ? __sanitizer_cov_trace_switch+0x58/0xa0 [ 23.576534] ? policy_nodemask+0xf9/0x450 [ 23.576835] ? __sanitizer_cov_trace_const_cmp2+0x1c/0x30 [ 23.577220] ? alloc_pages_mpol_noprof+0x35d/0x580 [ 23.577575] ? __pfx_alloc_pages_mpol_noprof+0x10/0x10 [ 23.577950] ? __kmalloc_node_noprof+0x3a3/0x4e0 [ 23.578302] ? __kvmalloc_node_noprof+0x7f/0x240 [ 23.578645] ? alloc_pages_noprof+0xa9/0x180 [ 23.578963] ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30 [ 23.579347] ? io_pages_map+0x244/0x5c0 [ 23.579631] io_uring_setup+0x18df/0x3950 [ 23.579936] ? __pfx_io_uring_setup+0x10/0x10 [ 23.580263] ? __audit_syscall_entry+0x39c/0x500 [ 23.580602] __x64_sys_io_uring_setup+0xa4/0x160 [ 23.580939] x64_sys_call+0x17f5/0x20d0 [ 23.581224] do_syscall_64+0x6d/0x140 [ 23.581498] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 23.581872] RIP: 0033:0x7efd9fa3ee5d [ 23.582140] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 93 af 1b 00 f7 d8 64 89 01 48 [ 23.583404] RSP: 002b:00007ffdd4400858 EFLAGS: 00000202 ORIG_RAX: 00000000000001a9 [ 23.583938] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007efd9fa3ee5d [ 23.584430] RDX: 00007efd9fb3f247 RSI: 0000000020000080 RDI: 0000000000005230 [ 23.584927] RBP: 00007ffdd4400860 R08: 00007ffdd44002d0 R09: 00007ffdd4400890 [ 23.585419] R10: 0000000000000000 R11: 0000000000000202 R12: 00007ffdd44009b8 [ 23.585914] R13: 0000000000401730 R14: 0000000000403e08 R15: 00007efd9fcb5000 [ 23.586424] </TASK> [ 23.586589] [ 23.586709] The buggy address belongs to the physical page: [ 23.587094] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x36377 [ 23.587644] flags: 0xfffffc0000000(node=0|zone=1|lastcpupid=0x1fffff) [ 23.588103] raw: 000fffffc0000000 ffffea0000d8ddc8 ffffea0000d8ddc8 0000000000000000 [ 23.588636] raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000 [ 23.589167] page dumped because: kasan: bad access detected [ 23.589551] [ 23.589670] Memory state around the buggy address: [ 23.590007] ffff888036377780: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff [ 23.590514] ffff888036377800: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff [ 23.591011] >ffff888036377880: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff [ 23.591508] ^ [ 23.591794] ffff888036377900: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff [ 23.592292] ffff888036377980: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff [ 23.592789] ================================================================== [ 23.593344] Disabling lock debugging due to kernel taint "
I hope you find it useful.
Regards, Yi Lai
---
If you don't need the following environment to reproduce the problem or if you already have one reproduced environment, please ignore the following information.
How to reproduce: git clone https://gitlab.com/xupengfe/repro_vm_env.git cd repro_vm_env tar -xvf repro_vm_env.tar.gz cd repro_vm_env; ./start3.sh // it needs qemu-system-x86_64 and I used v7.1.0 // start3.sh will load bzImage_2241ab53cbb5cdb08a6b2d4688feb13971058f65 v6.2-rc5 kernel // You could change the bzImage_xxx as you want // Maybe you need to remove line "-drive if=pflash,format=raw,readonly=on,file=./OVMF_CODE.fd " for different qemu version You could use below command to log in, there is no password for root. ssh -p 10023 root@localhost
After login vm(virtual machine) successfully, you could transfer reproduced binary to the vm by below way, and reproduce the problem in vm: gcc -pthread -o repro repro.c scp -P 10023 repro root@localhost:/root/
Get the bzImage for target kernel: Please use target kconfig and copy it to kernel_src/.config make olddefconfig make -jx bzImage //x should equal or less than cpu num your pc has
Fill the bzImage file into above start3.sh to load the target kernel in vm.
Tips: If you already have qemu-system-x86_64, please ignore below info. If you want to install qemu v7.1.0 version: git clone https://github.com/qemu/qemu.git cd qemu git checkout -f v7.1.0 mkdir build cd build yum install -y ninja-build.x86_64 yum -y install libslirp-devel.x86_64 ../configure --target-list=x86_64-softmmu --enable-kvm --enable-vnc --enable-gtk --enable-sdl --enable-usb-redir --enable-slirp make make install
On Mon, Sep 09, 2024 at 05:00:36PM +0200, Felix Moessbauer wrote:
The submit queue polling threads are userland threads that just never exit to the userland. When creating the thread with IORING_SETUP_SQ_AFF, the affinity of the poller thread is set to the cpu specified in sq_thread_cpu. However, this CPU can be outside of the cpuset defined by the cgroup cpuset controller. This violates the rules defined by the cpuset controller and is a potential issue for realtime applications.
In b7ed6d8ffd6 we fixed the default affinity of the poller thread, in case no explicit pinning is required by inheriting the one of the creating task. In case of explicit pinning, the check is more complicated, as also a cpu outside of the parent cpumask is allowed. We implemented this by using cpuset_cpus_allowed (that has support for cgroup cpusets) and testing if the requested cpu is in the set.
Fixes: 37d1e2e3642e ("io_uring: move SQPOLL thread io-wq forked worker") Cc: stable@vger.kernel.org # 6.1+ Signed-off-by: Felix Moessbauer felix.moessbauer@siemens.com
Hi,
that's hopefully the last fix of cpu pinnings of the sq poller threads. However, there is more to come on the io-wq side. E.g the syscalls for IORING_REGISTER_IOWQ_AFF that can be used to change the affinites are not yet protected. I'm currently just lacking good reproducers for that. I also have to admit that I don't feel too comfortable making changes to the wq part, given that I don't have good tests.
While fixing this, I'm wondering if it makes sense to add tests for the combination of pinning and cpuset. If yes, where should these tests be added?
Best regards, Felix Moessbauer Siemens AG
io_uring/sqpoll.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/io_uring/sqpoll.c b/io_uring/sqpoll.c index 713be7c29388..b8ec8fec99b8 100644 --- a/io_uring/sqpoll.c +++ b/io_uring/sqpoll.c @@ -10,6 +10,7 @@ #include <linux/slab.h> #include <linux/audit.h> #include <linux/security.h> +#include <linux/cpuset.h> #include <linux/io_uring.h> #include <uapi/linux/io_uring.h> @@ -459,10 +460,12 @@ __cold int io_sq_offload_create(struct io_ring_ctx *ctx, return 0; if (p->flags & IORING_SETUP_SQ_AFF) {
struct cpumask allowed_mask; int cpu = p->sq_thread_cpu;
ret = -EINVAL;
if (cpu >= nr_cpu_ids || !cpu_online(cpu))
cpuset_cpus_allowed(current, &allowed_mask);
} else {if (!cpumask_test_cpu(cpu, &allowed_mask)) goto err_sqpoll; sqd->sq_cpu = cpu;
-- 2.39.2
On 9/17/24 9:56 PM, Lai, Yi wrote:
Hi Felix Moessbauer,
Greetings!
I used Syzkaller and found that there is KASAN: use-after-free Read in io_sq_offload_create in Linux-next tree - next-20240916.
After bisection and the first bad commit is: " f011c9cf04c0 io_uring/sqpoll: do not allow pinning outside of cpuset "
This is known and fixed:
https://git.kernel.dk/cgit/linux/commit/?h=for-6.12/io_uring&id=a09c1724...
Thanks for reply. I will include lkp into search database before sending out fuzzing findings.
Regards, Yi Lai
On Wed, Sep 18, 2024 at 12:16:41AM -0600, Jens Axboe wrote:
On 9/17/24 9:56 PM, Lai, Yi wrote:
Hi Felix Moessbauer,
Greetings!
I used Syzkaller and found that there is KASAN: use-after-free Read in io_sq_offload_create in Linux-next tree - next-20240916.
After bisection and the first bad commit is: " f011c9cf04c0 io_uring/sqpoll: do not allow pinning outside of cpuset "
This is known and fixed:
https://git.kernel.dk/cgit/linux/commit/?h=for-6.12/io_uring&id=a09c1724...
-- Jens Axboe
linux-stable-mirror@lists.linaro.org