|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6 |
|
| #
92835ceb |
| 08-May-2025 |
Gabriel Krisman Bertazi <[email protected]> |
io_uring/sqpoll: Increase task_work submission batch size
Our QA team reported a 10%-23%, throughput reduction on an io_uring sqpoll testcase doing IO to a null_blk, that I traced back to a reductio
io_uring/sqpoll: Increase task_work submission batch size
Our QA team reported a 10%-23%, throughput reduction on an io_uring sqpoll testcase doing IO to a null_blk, that I traced back to a reduction of the device submission queue depth utilization. It turns out that, after commit af5d68f8892f ("io_uring/sqpoll: manage task_work privately"), we capped the number of task_work entries that can be completed from a single spin of sqpoll to only 8 entries, before the sqpoll goes around to (potentially) sleep. While this cap doesn't drive the submission side directly, it impacts the completion behavior, which affects the number of IO queued by fio per sqpoll cycle on the submission side, and io_uring ends up seeing less ios per sqpoll cycle. As a result, block layer plugging is less effective, and we see more time spent inside the block layer in profilings charts, and increased submission latency measured by fio.
There are other places that have increased overhead once sqpoll sleeps more often, such as the sqpoll utilization calculation. But, in this microbenchmark, those were not representative enough in perf charts, and their removal didn't yield measurable changes in throughput. The major overhead comes from the fact we plug less, and less often, when submitting to the block layer.
My benchmark is:
fio --ioengine=io_uring --direct=1 --iodepth=128 --runtime=300 --bs=4k \ --invalidate=1 --time_based --ramp_time=10 --group_reporting=1 \ --filename=/dev/nullb0 --name=RandomReads-direct-nullb-sqpoll-4k-1 \ --rw=randread --numjobs=1 --sqthread_poll
In one machine, tested on top of Linux 6.15-rc1, we have the following baseline: READ: bw=4994MiB/s (5236MB/s), 4994MiB/s-4994MiB/s (5236MB/s-5236MB/s), io=439GiB (471GB), run=90001-90001msec
With this patch: READ: bw=5762MiB/s (6042MB/s), 5762MiB/s-5762MiB/s (6042MB/s-6042MB/s), io=506GiB (544GB), run=90001-90001msec
which is a 15% improvement in measured bandwidth. The average submission latency is noticeably lowered too. As measured by fio:
Baseline: lat (usec): min=20, max=241, avg=99.81, stdev=3.38 Patched: lat (usec): min=26, max=226, avg=86.48, stdev=4.82
If we look at blktrace, we can also see the plugging behavior is improved. In the baseline, we end up limited to plugging 8 requests in the block layer regardless of the device queue depth size, while after patching we can drive more io, and we manage to utilize the full device queue.
In the baseline, after a stabilization phase, an ordinary submission looks like: 254,0 1 49942 0.016028795 5977 U N [iou-sqp-5976] 7
After patching, I see consistently more requests per unplug. 254,0 1 4996 0.001432872 3145 U N [iou-sqp-3144] 32
Ideally, the cap size would at least be the deep enough to fill the device queue, but we can't predict that behavior, or assume all IO goes to a single device, and thus can't guess the ideal batch size. We also don't want to let the tw run unbounded, though I'm not sure it would really be a problem. Instead, let's just give it a more sensible value that will allow for more efficient batching. I've tested with different cap values, and initially proposed to increase the cap to 1024. Jens argued it is too big of a bump and I observed that, with 32, I'm no longer able to observe this bottleneck in any of my machines.
Fixes: af5d68f8892f ("io_uring/sqpoll: manage task_work privately") Signed-off-by: Gabriel Krisman Bertazi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14, v6.14-rc7, v6.14-rc6, v6.14-rc5, v6.14-rc4, v6.14-rc3, v6.14-rc2, v6.14-rc1, v6.13, v6.13-rc7 |
|
| #
4b7cfa8b |
| 10-Jan-2025 |
Pavel Begunkov <[email protected]> |
io_uring/sqpoll: zero sqd->thread on tctx errors
Syzkeller reports:
BUG: KASAN: slab-use-after-free in thread_group_cputime+0x409/0x700 kernel/sched/cputime.c:341 Read of size 8 at addr ffff8880357
io_uring/sqpoll: zero sqd->thread on tctx errors
Syzkeller reports:
BUG: KASAN: slab-use-after-free in thread_group_cputime+0x409/0x700 kernel/sched/cputime.c:341 Read of size 8 at addr ffff88803578c510 by task syz.2.3223/27552 Call Trace: <TASK> ... kasan_report+0x143/0x180 mm/kasan/report.c:602 thread_group_cputime+0x409/0x700 kernel/sched/cputime.c:341 thread_group_cputime_adjusted+0xa6/0x340 kernel/sched/cputime.c:639 getrusage+0x1000/0x1340 kernel/sys.c:1863 io_uring_show_fdinfo+0xdfe/0x1770 io_uring/fdinfo.c:197 seq_show+0x608/0x770 fs/proc/fd.c:68 ...
That's due to sqd->task not being cleared properly in cases where SQPOLL task tctx setup fails, which can essentially only happen with fault injection to insert allocation errors.
Cc: [email protected] Fixes: 1251d2025c3e1 ("io_uring/sqpoll: early exit thread if task_context wasn't allocated") Reported-by: [email protected] Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/efc7ec7010784463b2e7466d7b5c02c2cb381635.1736519461.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v6.13-rc6, v6.13-rc5 |
|
| #
e33ac68e |
| 26-Dec-2024 |
Pavel Begunkov <[email protected]> |
io_uring/sqpoll: fix sqpoll error handling races
BUG: KASAN: slab-use-after-free in __lock_acquire+0x370b/0x4a10 kernel/locking/lockdep.c:5089 Call Trace: <TASK> ... _raw_spin_lock_irqsave+0x3d/0x60
io_uring/sqpoll: fix sqpoll error handling races
BUG: KASAN: slab-use-after-free in __lock_acquire+0x370b/0x4a10 kernel/locking/lockdep.c:5089 Call Trace: <TASK> ... _raw_spin_lock_irqsave+0x3d/0x60 kernel/locking/spinlock.c:162 class_raw_spinlock_irqsave_constructor include/linux/spinlock.h:551 [inline] try_to_wake_up+0xb5/0x23c0 kernel/sched/core.c:4205 io_sq_thread_park+0xac/0xe0 io_uring/sqpoll.c:55 io_sq_thread_finish+0x6b/0x310 io_uring/sqpoll.c:96 io_sq_offload_create+0x162/0x11d0 io_uring/sqpoll.c:497 io_uring_create io_uring/io_uring.c:3724 [inline] io_uring_setup+0x1728/0x3230 io_uring/io_uring.c:3806 ...
Kun Hu reports that the SQPOLL creating error path has UAF, which happens if io_uring_alloc_task_context() fails and then io_sq_thread() manages to run and complete before the rest of error handling code, which means io_sq_thread_finish() is looking at already killed task.
Note that this is mostly theoretical, requiring fault injection on the allocation side to trigger in practice.
Cc: [email protected] Reported-by: Kun Hu <[email protected]> Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/0f2f1aa5729332612bd01fe0f2f385fd1f06ce7c.1735231717.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v6.13-rc4, v6.13-rc3, v6.13-rc2, v6.13-rc1 |
|
| #
3a3f61ce |
| 30-Nov-2024 |
Kees Cook <[email protected]> |
exec: Make sure task->comm is always NUL-terminated
Using strscpy() meant that the final character in task->comm may be non-NUL for a moment before the "string too long" truncation happens.
Instead
exec: Make sure task->comm is always NUL-terminated
Using strscpy() meant that the final character in task->comm may be non-NUL for a moment before the "string too long" truncation happens.
Instead of adding a new use of the ambiguous strncpy(), we'd want to use memtostr_pad() which enforces being able to check at compile time that sizes are sensible, but this requires being able to see string buffer lengths. Instead of trying to inline __set_task_comm() (which needs to call trace and perf functions), just open-code it. But to make sure we're always safe, add compile-time checking like we already do for get_task_comm().
Suggested-by: Linus Torvalds <[email protected]> Suggested-by: "Eric W. Biederman" <[email protected]> Signed-off-by: Kees Cook <[email protected]>
show more ...
|
| #
b690668b |
| 25-Nov-2024 |
Christian Brauner <[email protected]> |
io_uring: avoid pointless cred reference count bump
req->creds and ctx->sq_creds already hold reference counts that are stable during the operations.
Link: https://lore.kernel.org/r/20241125-work-c
io_uring: avoid pointless cred reference count bump
req->creds and ctx->sq_creds already hold reference counts that are stable during the operations.
Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Jeff Layton <[email protected]> Reviewed-by: Jens Axboe <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
| #
51c0bcf0 |
| 25-Nov-2024 |
Christian Brauner <[email protected]> |
tree-wide: s/revert_creds_light()/revert_creds()/g
Rename all calls to revert_creds_light() back to revert_creds().
Link: https://lore.kernel.org/r/[email protected] R
tree-wide: s/revert_creds_light()/revert_creds()/g
Rename all calls to revert_creds_light() back to revert_creds().
Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Jeff Layton <[email protected]> Reviewed-by: Jens Axboe <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
| #
6771e004 |
| 25-Nov-2024 |
Christian Brauner <[email protected]> |
tree-wide: s/override_creds_light()/override_creds()/g
Rename all calls to override_creds_light() back to overrid_creds().
Link: https://lore.kernel.org/r/20241125-work-cred-v2-5-68b9d38bb5b2@kerne
tree-wide: s/override_creds_light()/override_creds()/g
Rename all calls to override_creds_light() back to overrid_creds().
Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Jeff Layton <[email protected]> Reviewed-by: Jens Axboe <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
| #
f905e009 |
| 25-Nov-2024 |
Christian Brauner <[email protected]> |
tree-wide: s/revert_creds()/put_cred(revert_creds_light())/g
Convert all calls to revert_creds() over to explicitly dropping reference counts in preparation for converting revert_creds() to revert_c
tree-wide: s/revert_creds()/put_cred(revert_creds_light())/g
Convert all calls to revert_creds() over to explicitly dropping reference counts in preparation for converting revert_creds() to revert_creds_light() semantics.
Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Jeff Layton <[email protected]> Reviewed-by: Jens Axboe <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
| #
0a670e15 |
| 25-Nov-2024 |
Christian Brauner <[email protected]> |
tree-wide: s/override_creds()/override_creds_light(get_new_cred())/g
Convert all callers from override_creds() to override_creds_light(get_new_cred()) in preparation of making override_creds() not t
tree-wide: s/override_creds()/override_creds_light(get_new_cred())/g
Convert all callers from override_creds() to override_creds_light(get_new_cred()) in preparation of making override_creds() not take a separate reference at all.
Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Jeff Layton <[email protected]> Reviewed-by: Jens Axboe <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.12, v6.12-rc7, v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3, v6.12-rc2, v6.12-rc1, v6.11, v6.11-rc7, v6.11-rc6, v6.11-rc5, v6.11-rc4, v6.11-rc3, v6.11-rc2, v6.11-rc1 |
|
| #
6348be02 |
| 20-Jul-2024 |
Al Viro <[email protected]> |
fdget(), trivial conversions
fdget() is the first thing done in scope, all matching fdput() are immediately followed by leaving the scope.
Reviewed-by: Christian Brauner <[email protected]> Signed
fdget(), trivial conversions
fdget() is the first thing done in scope, all matching fdput() are immediately followed by leaving the scope.
Reviewed-by: Christian Brauner <[email protected]> Signed-off-by: Al Viro <[email protected]>
show more ...
|
| #
b898b8c9 |
| 28-Oct-2024 |
Jens Axboe <[email protected]> |
io_uring/sqpoll: wait on sqd->wait for thread parking
io_sqd_handle_event() just does a mutex unlock/lock dance when it's supposed to park, somewhat relying on full ordering with the thread trying t
io_uring/sqpoll: wait on sqd->wait for thread parking
io_sqd_handle_event() just does a mutex unlock/lock dance when it's supposed to park, somewhat relying on full ordering with the thread trying to park it which does a similar unlock/lock dance on sqd->lock. However, with adaptive spinning on mutexes, this can waste an awful lot of time. Normally this isn't very noticeable, as parking and unparking the thread isn't a common (or fast path) occurence. However, in testing ring resizing, it's testing exactly that, as each resize will require the SQPOLL to safely park and unpark.
Have io_sq_thread_park() explicitly wait on sqd->park_pending being zero before attempting to grab the sqd->lock again.
In a resize test, this brings the runtime of SQPOLL down from about 60 seconds to a few seconds, just like the !SQPOLL tests. And saves a ton of spinning time on the mutex, on both sides.
Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| #
53d69bdd |
| 16-Sep-2024 |
Olivier Langlois <[email protected]> |
io_uring/sqpoll: do the napi busy poll outside the submission block
there are many small reasons justifying this change.
1. busy poll must be performed even on rings that have no iopoll and no n
io_uring/sqpoll: do the napi busy poll outside the submission block
there are many small reasons justifying this change.
1. busy poll must be performed even on rings that have no iopoll and no new sqe. It is quite possible that a ring configured for inbound traffic with multishot be several hours without receiving new request submissions 2. NAPI busy poll does not perform any credential validation 3. If the thread is awaken by task work, processing the task work is prioritary over NAPI busy loop. This is why a second loop has been created after the io_sq_tw() call instead of doing the busy loop in __io_sq_thread() outside its credential acquisition block.
Signed-off-by: Olivier Langlois <[email protected]> Link: https://lore.kernel.org/r/de7679adf1249446bd47426db01d82b9603b7224.1726161831.git.olivier@trillion01.com Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| #
7f44bead |
| 16-Sep-2024 |
Felix Moessbauer <[email protected]> |
io_uring/sqpoll: do not put cpumask on stack
Putting the cpumask on the stack is deprecated for a long time (since 2d3854a37e8), as these can be big. Given that, change the on-stack allocation of al
io_uring/sqpoll: do not put cpumask on stack
Putting the cpumask on the stack is deprecated for a long time (since 2d3854a37e8), as these can be big. Given that, change the on-stack allocation of allowed_mask to be dynamically allocated.
Fixes: f011c9cf04c0 ("io_uring/sqpoll: do not allow pinning outside of cpuset") Signed-off-by: Felix Moessbauer <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| #
a09c1724 |
| 16-Sep-2024 |
Jens Axboe <[email protected]> |
io_uring/sqpoll: retain test for whether the CPU is valid
A recent commit ensured that SQPOLL cannot be setup with a CPU that isn't in the current tasks cpuset, but it also dropped testing whether t
io_uring/sqpoll: retain test for whether the CPU is valid
A recent commit ensured that SQPOLL cannot be setup with a CPU that isn't in the current tasks cpuset, but it also dropped testing whether the CPU is valid in the first place. Without that, if a task passes in a CPU value that is too high, the following KASAN splat can get triggered:
BUG: KASAN: stack-out-of-bounds in io_sq_offload_create+0x858/0xaa4 Read of size 8 at addr ffff800089bc7b90 by task wq-aff.t/1391
CPU: 4 UID: 1000 PID: 1391 Comm: wq-aff.t Not tainted 6.11.0-rc7-00227-g371c468f4db6 #7080 Hardware name: linux,dummy-virt (DT) Call trace: dump_backtrace.part.0+0xcc/0xe0 show_stack+0x14/0x1c dump_stack_lvl+0x58/0x74 print_report+0x16c/0x4c8 kasan_report+0x9c/0xe4 __asan_report_load8_noabort+0x1c/0x24 io_sq_offload_create+0x858/0xaa4 io_uring_setup+0x1394/0x17c4 __arm64_sys_io_uring_setup+0x6c/0x180 invoke_syscall+0x6c/0x260 el0_svc_common.constprop.0+0x158/0x224 do_el0_svc+0x3c/0x5c el0_svc+0x34/0x70 el0t_64_sync_handler+0x118/0x124 el0t_64_sync+0x168/0x16c
The buggy address belongs to stack of task wq-aff.t/1391 and is located at offset 48 in frame: io_sq_offload_create+0x0/0xaa4
This frame has 1 object: [32, 40) 'allowed_mask'
The buggy address belongs to the virtual mapping at [ffff800089bc0000, ffff800089bc9000) created by: kernel_clone+0x124/0x7e0
The buggy address belongs to the physical page: page: refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff0000d740af80 pfn:0x11740a memcg:ffff0000c2706f02 flags: 0xbffe00000000000(node=0|zone=2|lastcpupid=0x1fff) raw: 0bffe00000000000 0000000000000000 dead000000000122 0000000000000000 raw: ffff0000d740af80 0000000000000000 00000001ffffffff ffff0000c2706f02 page dumped because: kasan: bad access detected
Memory state around the buggy address: ffff800089bc7a80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffff800089bc7b00: 00 00 00 00 00 00 00 00 00 00 00 00 f1 f1 f1 f1 >ffff800089bc7b80: 00 f3 f3 f3 00 00 00 00 00 00 00 00 00 00 00 00 ^ ffff800089bc7c00: 00 00 00 00 00 00 00 00 00 00 00 00 f1 f1 f1 f1 ffff800089bc7c80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 f3
Reported-by: kernel test robot <[email protected]> Closes: https://lore.kernel.org/oe-lkp/[email protected] Fixes: f011c9cf04c0 ("io_uring/sqpoll: do not allow pinning outside of cpuset") Tested-by: Felix Moessbauer <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| #
f011c9cf |
| 09-Sep-2024 |
Felix Moessbauer <[email protected]> |
io_uring/sqpoll: do not allow pinning outside of cpuset
The submit queue polling threads are userland threads that just never exit to the userland. When creating the thread with IORING_SETUP_SQ_AFF,
io_uring/sqpoll: do not allow pinning outside of cpuset
The submit queue polling threads are userland threads that just never exit to the userland. When creating the thread with IORING_SETUP_SQ_AFF, the affinity of the poller thread is set to the cpu specified in sq_thread_cpu. However, this CPU can be outside of the cpuset defined by the cgroup cpuset controller. This violates the rules defined by the cpuset controller and is a potential issue for realtime applications.
In b7ed6d8ffd6 we fixed the default affinity of the poller thread, in case no explicit pinning is required by inheriting the one of the creating task. In case of explicit pinning, the check is more complicated, as also a cpu outside of the parent cpumask is allowed. We implemented this by using cpuset_cpus_allowed (that has support for cgroup cpusets) and testing if the requested cpu is in the set.
Fixes: 37d1e2e3642e ("io_uring: move SQPOLL thread io-wq forked worker") Cc: [email protected] # 6.1+ Signed-off-by: Felix Moessbauer <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| #
7255cd89 |
| 30-Jul-2024 |
Olivier Langlois <[email protected]> |
io_uring: micro optimization of __io_sq_thread() condition
reverse the order of the element evaluation in an if statement.
for many users that are not using iopoll, the iopoll_list will always eval
io_uring: micro optimization of __io_sq_thread() condition
reverse the order of the element evaluation in an if statement.
for many users that are not using iopoll, the iopoll_list will always evaluate to false after having made a memory access whereas to_submit is very likely already loaded in a register.
Signed-off-by: Olivier Langlois <[email protected]> Reviewed-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/052ca60b5c49e7439e4b8bd33bfab4a09d36d3d6.1722374371.git.olivier@trillion01.com Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| #
e4956dc7 |
| 13-Aug-2024 |
Jens Axboe <[email protected]> |
io_uring/sqpoll: annotate debug task == current with data_race()
There's a debug check in io_sq_thread_park() checking if it's the SQPOLL thread itself calling park. KCSAN warns about this, as we sh
io_uring/sqpoll: annotate debug task == current with data_race()
There's a debug check in io_sq_thread_park() checking if it's the SQPOLL thread itself calling park. KCSAN warns about this, as we should not be reading sqd->thread outside of sqd->lock.
Just silence this with data_race(). The pointer isn't used for anything but this debug check.
Reported-by: [email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v6.10, v6.10-rc7, v6.10-rc6, v6.10-rc5, v6.10-rc4, v6.10-rc3, v6.10-rc2 |
|
| #
1da91ea8 |
| 31-May-2024 |
Al Viro <[email protected]> |
introduce fd_file(), convert all accessors to it.
For any changes of struct fd representation we need to turn existing accesses to fields into calls of wrappers. Accesses to struct fd::flags are ve
introduce fd_file(), convert all accessors to it.
For any changes of struct fd representation we need to turn existing accesses to fields into calls of wrappers. Accesses to struct fd::flags are very few (3 in linux/file.h, 1 in net/socket.c, 3 in fs/overlayfs/file.c and 3 more in explicit initializers). Those can be dealt with in the commit converting to new layout; accesses to struct fd::file are too many for that. This commit converts (almost) all of f.file to fd_file(f). It's not entirely mechanical ('file' is used as a member name more than just in struct fd) and it does not even attempt to distinguish the uses in pointer context from those in boolean context; the latter will be eventually turned into a separate helper (fd_empty()).
NOTE: mass conversion to fd_empty(), tempting as it might be, is a bad idea; better do that piecewise in commit that convert from fdget...() to CLASS(...).
[conflicts in fs/fhandle.c, kernel/bpf/syscall.c, mm/memcontrol.c caught by git; fs/stat.c one got caught by git grep] [fs/xattr.c conflict]
Reviewed-by: Christian Brauner <[email protected]> Signed-off-by: Al Viro <[email protected]>
show more ...
|
|
Revision tags: v6.10-rc1 |
|
| #
d13ddd9c |
| 21-May-2024 |
Jens Axboe <[email protected]> |
io_uring/sqpoll: ensure that normal task_work is also run timely
With the move to private task_work, SQPOLL neglected to also run the normal task_work, if any is pending. This will eventually get ru
io_uring/sqpoll: ensure that normal task_work is also run timely
With the move to private task_work, SQPOLL neglected to also run the normal task_work, if any is pending. This will eventually get run, but we should run it with the private task_work to ensure that things like a final fput() is processed in a timely fashion.
Cc: [email protected] Link: https://lore.kernel.org/all/[email protected]/ Reported-by: Andrew Udvare <[email protected]> Fixes: af5d68f8892f ("io_uring/sqpoll: manage task_work privately") Tested-by: Christian Heusel <[email protected]> Tested-by: Andrew Udvare <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v6.9, v6.9-rc7, v6.9-rc6, v6.9-rc5, v6.9-rc4, v6.9-rc3, v6.9-rc2, v6.9-rc1 |
|
| #
c4ce0ab2 |
| 21-Mar-2024 |
Jens Axboe <[email protected]> |
io_uring/sqpoll: work around a potential audit memory leak
kmemleak complains that there's a memory leak related to connect handling:
unreferenced object 0xffff0001093bdf00 (size 128): comm "iou-sq
io_uring/sqpoll: work around a potential audit memory leak
kmemleak complains that there's a memory leak related to connect handling:
unreferenced object 0xffff0001093bdf00 (size 128): comm "iou-sqp-455", pid 457, jiffies 4294894164 hex dump (first 32 bytes): 02 00 fa ea 7f 00 00 01 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace (crc 2e481b1a): [<00000000c0a26af4>] kmemleak_alloc+0x30/0x38 [<000000009c30bb45>] kmalloc_trace+0x228/0x358 [<000000009da9d39f>] __audit_sockaddr+0xd0/0x138 [<0000000089a93e34>] move_addr_to_kernel+0x1a0/0x1f8 [<000000000b4e80e6>] io_connect_prep+0x1ec/0x2d4 [<00000000abfbcd99>] io_submit_sqes+0x588/0x1e48 [<00000000e7c25e07>] io_sq_thread+0x8a4/0x10e4 [<00000000d999b491>] ret_from_fork+0x10/0x20
which can can happen if:
1) The command type does something on the prep side that triggers an audit call. 2) The thread hasn't done any operations before this that triggered an audit call inside ->issue(), where we have audit_uring_entry() and audit_uring_exit().
Work around this by issuing a blanket NOP operation before the SQPOLL does anything.
Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| #
1251d202 |
| 19-Mar-2024 |
Jens Axboe <[email protected]> |
io_uring/sqpoll: early exit thread if task_context wasn't allocated
Ideally we'd want to simply kill the task rather than wake it, but for now let's just add a startup check that causes the thread t
io_uring/sqpoll: early exit thread if task_context wasn't allocated
Ideally we'd want to simply kill the task rather than wake it, but for now let's just add a startup check that causes the thread to exit. This can only happen if io_uring_alloc_task_context() fails, which generally requires fault injection.
Reported-by: Ubisectech Sirius <[email protected]> Fixes: af5d68f8892f ("io_uring/sqpoll: manage task_work privately") Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v6.8, v6.8-rc7 |
|
| #
3fcb9d17 |
| 28-Feb-2024 |
Xiaobing Li <[email protected]> |
io_uring/sqpoll: statistics of the true utilization of sq threads
Count the running time and actual IO processing time of the sqpoll thread, and output the statistical data to fdinfo.
Variable desc
io_uring/sqpoll: statistics of the true utilization of sq threads
Count the running time and actual IO processing time of the sqpoll thread, and output the statistical data to fdinfo.
Variable description: "work_time" in the code represents the sum of the jiffies of the sq thread actually processing IO, that is, how many milliseconds it actually takes to process IO. "total_time" represents the total time that the sq thread has elapsed from the beginning of the loop to the current time point, that is, how many milliseconds it has spent in total.
The test tool is fio, and its parameters are as follows: [global] ioengine=io_uring direct=1 group_reporting bs=128k norandommap=1 randrepeat=0 refill_buffers ramp_time=30s time_based runtime=1m clocksource=clock_gettime overwrite=1 log_avg_msec=1000 numjobs=1
[disk0] filename=/dev/nvme0n1 rw=read iodepth=16 hipri sqthread_poll=1
The test results are as follows: Every 2.0s: cat /proc/9230/fdinfo/6 | grep -E Sq SqMask: 0x3 SqHead: 3197153 SqTail: 3197153 CachedSqHead: 3197153 SqThread: 9231 SqThreadCpu: 11 SqTotalTime: 18099614 SqWorkTime: 16748316
The test results corresponding to different iodepths are as follows: |-----------|-------|-------|-------|------|-------| | iodepth | 1 | 4 | 8 | 16 | 64 | |-----------|-------|-------|-------|------|-------| |utilization| 2.9% | 8.8% | 10.9% | 92.9%| 84.4% | |-----------|-------|-------|-------|------|-------| | idle | 97.1% | 91.2% | 89.1% | 7.1% | 15.6% | |-----------|-------|-------|-------|------|-------|
Signed-off-by: Xiaobing Li <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v6.8-rc6, v6.8-rc5 |
|
| #
c8d8fc3b |
| 14-Feb-2024 |
Jens Axboe <[email protected]> |
io_uring/sqpoll: use the correct check for pending task_work
A previous commit moved to using just the private task_work list for SQPOLL, but it neglected to update the check for whether we have pen
io_uring/sqpoll: use the correct check for pending task_work
A previous commit moved to using just the private task_work list for SQPOLL, but it neglected to update the check for whether we have pending task_work. Normally this is fine as we'll attempt to run it unconditionally, but if we race with going to sleep AND task_work being added, then we certainly need the right check here.
Fixes: af5d68f8892f ("io_uring/sqpoll: manage task_work privately") Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v6.8-rc4, v6.8-rc3, v6.8-rc2, v6.8-rc1, v6.7, v6.7-rc8, v6.7-rc7, v6.7-rc6, v6.7-rc5, v6.7-rc4, v6.7-rc3, v6.7-rc2, v6.7-rc1, v6.6, v6.6-rc7, v6.6-rc6, v6.6-rc5, v6.6-rc4, v6.6-rc3, v6.6-rc2, v6.6-rc1, v6.5, v6.5-rc7, v6.5-rc6, v6.5-rc5, v6.5-rc4, v6.5-rc3, v6.5-rc2, v6.5-rc1, v6.4, v6.4-rc7, v6.4-rc6 |
|
| #
ff183d42 |
| 08-Jun-2023 |
Stefan Roesch <[email protected]> |
io-uring: add sqpoll support for napi busy poll
This adds the sqpoll support to the io-uring napi.
Signed-off-by: Stefan Roesch <[email protected]> Suggested-by: Olivier Langlois <olivier@trillion01
io-uring: add sqpoll support for napi busy poll
This adds the sqpoll support to the io-uring napi.
Signed-off-by: Stefan Roesch <[email protected]> Suggested-by: Olivier Langlois <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| #
af5d68f8 |
| 02-Feb-2024 |
Jens Axboe <[email protected]> |
io_uring/sqpoll: manage task_work privately
Decouple from task_work running, and cap the number of entries we process at the time. If we exceed that number, push remaining entries to a retry list th
io_uring/sqpoll: manage task_work privately
Decouple from task_work running, and cap the number of entries we process at the time. If we exceed that number, push remaining entries to a retry list that we'll process first next time.
We cap the number of entries to process at 8, which is fairly random. We just want to get enough per-ctx batching here, while not processing endlessly.
Since we manually run PF_IO_WORKER related task_work anyway as the task never exits to userspace, with this we no longer need to add an actual task_work item to the per-process list.
Signed-off-by: Jens Axboe <[email protected]>
show more ...
|