History log of /linux-6.15/io_uring/sqpoll.c (Results 1 – 25 of 34)
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
Revision tags: v6.15, v6.15-rc7, v6.15-rc6
# 92835ceb 08-May-2025 Gabriel Krisman Bertazi <[email protected]>

io_uring/sqpoll: Increase task_work submission batch size

Our QA team reported a 10%-23%, throughput reduction on an io_uring
sqpoll testcase doing IO to a null_blk, that I traced back to a
reductio

io_uring/sqpoll: Increase task_work submission batch size

Our QA team reported a 10%-23%, throughput reduction on an io_uring
sqpoll testcase doing IO to a null_blk, that I traced back to a
reduction of the device submission queue depth utilization. It turns out
that, after commit af5d68f8892f ("io_uring/sqpoll: manage task_work
privately"), we capped the number of task_work entries that can be
completed from a single spin of sqpoll to only 8 entries, before the
sqpoll goes around to (potentially) sleep. While this cap doesn't drive
the submission side directly, it impacts the completion behavior, which
affects the number of IO queued by fio per sqpoll cycle on the
submission side, and io_uring ends up seeing less ios per sqpoll cycle.
As a result, block layer plugging is less effective, and we see more
time spent inside the block layer in profilings charts, and increased
submission latency measured by fio.

There are other places that have increased overhead once sqpoll sleeps
more often, such as the sqpoll utilization calculation. But, in this
microbenchmark, those were not representative enough in perf charts, and
their removal didn't yield measurable changes in throughput. The major
overhead comes from the fact we plug less, and less often, when submitting
to the block layer.

My benchmark is:

fio --ioengine=io_uring --direct=1 --iodepth=128 --runtime=300 --bs=4k \
--invalidate=1 --time_based --ramp_time=10 --group_reporting=1 \
--filename=/dev/nullb0 --name=RandomReads-direct-nullb-sqpoll-4k-1 \
--rw=randread --numjobs=1 --sqthread_poll

In one machine, tested on top of Linux 6.15-rc1, we have the following
baseline:
READ: bw=4994MiB/s (5236MB/s), 4994MiB/s-4994MiB/s (5236MB/s-5236MB/s), io=439GiB (471GB), run=90001-90001msec

With this patch:
READ: bw=5762MiB/s (6042MB/s), 5762MiB/s-5762MiB/s (6042MB/s-6042MB/s), io=506GiB (544GB), run=90001-90001msec

which is a 15% improvement in measured bandwidth. The average
submission latency is noticeably lowered too. As measured by
fio:

Baseline:
lat (usec): min=20, max=241, avg=99.81, stdev=3.38
Patched:
lat (usec): min=26, max=226, avg=86.48, stdev=4.82

If we look at blktrace, we can also see the plugging behavior is
improved. In the baseline, we end up limited to plugging 8 requests in
the block layer regardless of the device queue depth size, while after
patching we can drive more io, and we manage to utilize the full device
queue.

In the baseline, after a stabilization phase, an ordinary submission
looks like:
254,0 1 49942 0.016028795 5977 U N [iou-sqp-5976] 7

After patching, I see consistently more requests per unplug.
254,0 1 4996 0.001432872 3145 U N [iou-sqp-3144] 32

Ideally, the cap size would at least be the deep enough to fill the
device queue, but we can't predict that behavior, or assume all IO goes
to a single device, and thus can't guess the ideal batch size. We also
don't want to let the tw run unbounded, though I'm not sure it would
really be a problem. Instead, let's just give it a more sensible value
that will allow for more efficient batching. I've tested with different
cap values, and initially proposed to increase the cap to 1024. Jens
argued it is too big of a bump and I observed that, with 32, I'm no
longer able to observe this bottleneck in any of my machines.

Fixes: af5d68f8892f ("io_uring/sqpoll: manage task_work privately")
Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>

show more ...


Revision tags: v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14, v6.14-rc7, v6.14-rc6, v6.14-rc5, v6.14-rc4, v6.14-rc3, v6.14-rc2, v6.14-rc1, v6.13, v6.13-rc7
# 4b7cfa8b 10-Jan-2025 Pavel Begunkov <[email protected]>

io_uring/sqpoll: zero sqd->thread on tctx errors

Syzkeller reports:

BUG: KASAN: slab-use-after-free in thread_group_cputime+0x409/0x700 kernel/sched/cputime.c:341
Read of size 8 at addr ffff8880357

io_uring/sqpoll: zero sqd->thread on tctx errors

Syzkeller reports:

BUG: KASAN: slab-use-after-free in thread_group_cputime+0x409/0x700 kernel/sched/cputime.c:341
Read of size 8 at addr ffff88803578c510 by task syz.2.3223/27552
Call Trace:
<TASK>
...
kasan_report+0x143/0x180 mm/kasan/report.c:602
thread_group_cputime+0x409/0x700 kernel/sched/cputime.c:341
thread_group_cputime_adjusted+0xa6/0x340 kernel/sched/cputime.c:639
getrusage+0x1000/0x1340 kernel/sys.c:1863
io_uring_show_fdinfo+0xdfe/0x1770 io_uring/fdinfo.c:197
seq_show+0x608/0x770 fs/proc/fd.c:68
...

That's due to sqd->task not being cleared properly in cases where
SQPOLL task tctx setup fails, which can essentially only happen with
fault injection to insert allocation errors.

Cc: [email protected]
Fixes: 1251d2025c3e1 ("io_uring/sqpoll: early exit thread if task_context wasn't allocated")
Reported-by: [email protected]
Signed-off-by: Pavel Begunkov <[email protected]>
Link: https://lore.kernel.org/r/efc7ec7010784463b2e7466d7b5c02c2cb381635.1736519461.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <[email protected]>

show more ...


Revision tags: v6.13-rc6, v6.13-rc5
# e33ac68e 26-Dec-2024 Pavel Begunkov <[email protected]>

io_uring/sqpoll: fix sqpoll error handling races

BUG: KASAN: slab-use-after-free in __lock_acquire+0x370b/0x4a10 kernel/locking/lockdep.c:5089
Call Trace:
<TASK>
...
_raw_spin_lock_irqsave+0x3d/0x60

io_uring/sqpoll: fix sqpoll error handling races

BUG: KASAN: slab-use-after-free in __lock_acquire+0x370b/0x4a10 kernel/locking/lockdep.c:5089
Call Trace:
<TASK>
...
_raw_spin_lock_irqsave+0x3d/0x60 kernel/locking/spinlock.c:162
class_raw_spinlock_irqsave_constructor include/linux/spinlock.h:551 [inline]
try_to_wake_up+0xb5/0x23c0 kernel/sched/core.c:4205
io_sq_thread_park+0xac/0xe0 io_uring/sqpoll.c:55
io_sq_thread_finish+0x6b/0x310 io_uring/sqpoll.c:96
io_sq_offload_create+0x162/0x11d0 io_uring/sqpoll.c:497
io_uring_create io_uring/io_uring.c:3724 [inline]
io_uring_setup+0x1728/0x3230 io_uring/io_uring.c:3806
...

Kun Hu reports that the SQPOLL creating error path has UAF, which
happens if io_uring_alloc_task_context() fails and then io_sq_thread()
manages to run and complete before the rest of error handling code,
which means io_sq_thread_finish() is looking at already killed task.

Note that this is mostly theoretical, requiring fault injection on
the allocation side to trigger in practice.

Cc: [email protected]
Reported-by: Kun Hu <[email protected]>
Signed-off-by: Pavel Begunkov <[email protected]>
Link: https://lore.kernel.org/r/0f2f1aa5729332612bd01fe0f2f385fd1f06ce7c.1735231717.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <[email protected]>

show more ...


Revision tags: v6.13-rc4, v6.13-rc3, v6.13-rc2, v6.13-rc1
# 3a3f61ce 30-Nov-2024 Kees Cook <[email protected]>

exec: Make sure task->comm is always NUL-terminated

Using strscpy() meant that the final character in task->comm may be
non-NUL for a moment before the "string too long" truncation happens.

Instead

exec: Make sure task->comm is always NUL-terminated

Using strscpy() meant that the final character in task->comm may be
non-NUL for a moment before the "string too long" truncation happens.

Instead of adding a new use of the ambiguous strncpy(), we'd want to
use memtostr_pad() which enforces being able to check at compile time
that sizes are sensible, but this requires being able to see string
buffer lengths. Instead of trying to inline __set_task_comm() (which
needs to call trace and perf functions), just open-code it. But to
make sure we're always safe, add compile-time checking like we already
do for get_task_comm().

Suggested-by: Linus Torvalds <[email protected]>
Suggested-by: "Eric W. Biederman" <[email protected]>
Signed-off-by: Kees Cook <[email protected]>

show more ...


# b690668b 25-Nov-2024 Christian Brauner <[email protected]>

io_uring: avoid pointless cred reference count bump

req->creds and ctx->sq_creds already hold reference counts that are
stable during the operations.

Link: https://lore.kernel.org/r/20241125-work-c

io_uring: avoid pointless cred reference count bump

req->creds and ctx->sq_creds already hold reference counts that are
stable during the operations.

Link: https://lore.kernel.org/r/[email protected]
Reviewed-by: Jeff Layton <[email protected]>
Reviewed-by: Jens Axboe <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>

show more ...


# 51c0bcf0 25-Nov-2024 Christian Brauner <[email protected]>

tree-wide: s/revert_creds_light()/revert_creds()/g

Rename all calls to revert_creds_light() back to revert_creds().

Link: https://lore.kernel.org/r/[email protected]
R

tree-wide: s/revert_creds_light()/revert_creds()/g

Rename all calls to revert_creds_light() back to revert_creds().

Link: https://lore.kernel.org/r/[email protected]
Reviewed-by: Jeff Layton <[email protected]>
Reviewed-by: Jens Axboe <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>

show more ...


# 6771e004 25-Nov-2024 Christian Brauner <[email protected]>

tree-wide: s/override_creds_light()/override_creds()/g

Rename all calls to override_creds_light() back to overrid_creds().

Link: https://lore.kernel.org/r/20241125-work-cred-v2-5-68b9d38bb5b2@kerne

tree-wide: s/override_creds_light()/override_creds()/g

Rename all calls to override_creds_light() back to overrid_creds().

Link: https://lore.kernel.org/r/[email protected]
Reviewed-by: Jeff Layton <[email protected]>
Reviewed-by: Jens Axboe <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>

show more ...


# f905e009 25-Nov-2024 Christian Brauner <[email protected]>

tree-wide: s/revert_creds()/put_cred(revert_creds_light())/g

Convert all calls to revert_creds() over to explicitly dropping
reference counts in preparation for converting revert_creds() to
revert_c

tree-wide: s/revert_creds()/put_cred(revert_creds_light())/g

Convert all calls to revert_creds() over to explicitly dropping
reference counts in preparation for converting revert_creds() to
revert_creds_light() semantics.

Link: https://lore.kernel.org/r/[email protected]
Reviewed-by: Jeff Layton <[email protected]>
Reviewed-by: Jens Axboe <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>

show more ...


# 0a670e15 25-Nov-2024 Christian Brauner <[email protected]>

tree-wide: s/override_creds()/override_creds_light(get_new_cred())/g

Convert all callers from override_creds() to
override_creds_light(get_new_cred()) in preparation of making
override_creds() not t

tree-wide: s/override_creds()/override_creds_light(get_new_cred())/g

Convert all callers from override_creds() to
override_creds_light(get_new_cred()) in preparation of making
override_creds() not take a separate reference at all.

Link: https://lore.kernel.org/r/[email protected]
Reviewed-by: Jeff Layton <[email protected]>
Reviewed-by: Jens Axboe <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>

show more ...


Revision tags: v6.12, v6.12-rc7, v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3, v6.12-rc2, v6.12-rc1, v6.11, v6.11-rc7, v6.11-rc6, v6.11-rc5, v6.11-rc4, v6.11-rc3, v6.11-rc2, v6.11-rc1
# 6348be02 20-Jul-2024 Al Viro <[email protected]>

fdget(), trivial conversions

fdget() is the first thing done in scope, all matching fdput() are
immediately followed by leaving the scope.

Reviewed-by: Christian Brauner <[email protected]>
Signed

fdget(), trivial conversions

fdget() is the first thing done in scope, all matching fdput() are
immediately followed by leaving the scope.

Reviewed-by: Christian Brauner <[email protected]>
Signed-off-by: Al Viro <[email protected]>

show more ...


# b898b8c9 28-Oct-2024 Jens Axboe <[email protected]>

io_uring/sqpoll: wait on sqd->wait for thread parking

io_sqd_handle_event() just does a mutex unlock/lock dance when it's
supposed to park, somewhat relying on full ordering with the thread
trying t

io_uring/sqpoll: wait on sqd->wait for thread parking

io_sqd_handle_event() just does a mutex unlock/lock dance when it's
supposed to park, somewhat relying on full ordering with the thread
trying to park it which does a similar unlock/lock dance on sqd->lock.
However, with adaptive spinning on mutexes, this can waste an awful
lot of time. Normally this isn't very noticeable, as parking and
unparking the thread isn't a common (or fast path) occurence. However,
in testing ring resizing, it's testing exactly that, as each resize
will require the SQPOLL to safely park and unpark.

Have io_sq_thread_park() explicitly wait on sqd->park_pending being
zero before attempting to grab the sqd->lock again.

In a resize test, this brings the runtime of SQPOLL down from about
60 seconds to a few seconds, just like the !SQPOLL tests. And saves
a ton of spinning time on the mutex, on both sides.

Signed-off-by: Jens Axboe <[email protected]>

show more ...


# 53d69bdd 16-Sep-2024 Olivier Langlois <[email protected]>

io_uring/sqpoll: do the napi busy poll outside the submission block

there are many small reasons justifying this change.

1. busy poll must be performed even on rings that have no iopoll and no
n

io_uring/sqpoll: do the napi busy poll outside the submission block

there are many small reasons justifying this change.

1. busy poll must be performed even on rings that have no iopoll and no
new sqe. It is quite possible that a ring configured for inbound
traffic with multishot be several hours without receiving new request
submissions
2. NAPI busy poll does not perform any credential validation
3. If the thread is awaken by task work, processing the task work is
prioritary over NAPI busy loop. This is why a second loop has been
created after the io_sq_tw() call instead of doing the busy loop in
__io_sq_thread() outside its credential acquisition block.

Signed-off-by: Olivier Langlois <[email protected]>
Link: https://lore.kernel.org/r/de7679adf1249446bd47426db01d82b9603b7224.1726161831.git.olivier@trillion01.com
Signed-off-by: Jens Axboe <[email protected]>

show more ...


# 7f44bead 16-Sep-2024 Felix Moessbauer <[email protected]>

io_uring/sqpoll: do not put cpumask on stack

Putting the cpumask on the stack is deprecated for a long time (since
2d3854a37e8), as these can be big. Given that, change the on-stack
allocation of al

io_uring/sqpoll: do not put cpumask on stack

Putting the cpumask on the stack is deprecated for a long time (since
2d3854a37e8), as these can be big. Given that, change the on-stack
allocation of allowed_mask to be dynamically allocated.

Fixes: f011c9cf04c0 ("io_uring/sqpoll: do not allow pinning outside of cpuset")
Signed-off-by: Felix Moessbauer <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>

show more ...


# a09c1724 16-Sep-2024 Jens Axboe <[email protected]>

io_uring/sqpoll: retain test for whether the CPU is valid

A recent commit ensured that SQPOLL cannot be setup with a CPU that
isn't in the current tasks cpuset, but it also dropped testing whether
t

io_uring/sqpoll: retain test for whether the CPU is valid

A recent commit ensured that SQPOLL cannot be setup with a CPU that
isn't in the current tasks cpuset, but it also dropped testing whether
the CPU is valid in the first place. Without that, if a task passes in
a CPU value that is too high, the following KASAN splat can get
triggered:

BUG: KASAN: stack-out-of-bounds in io_sq_offload_create+0x858/0xaa4
Read of size 8 at addr ffff800089bc7b90 by task wq-aff.t/1391

CPU: 4 UID: 1000 PID: 1391 Comm: wq-aff.t Not tainted 6.11.0-rc7-00227-g371c468f4db6 #7080
Hardware name: linux,dummy-virt (DT)
Call trace:
dump_backtrace.part.0+0xcc/0xe0
show_stack+0x14/0x1c
dump_stack_lvl+0x58/0x74
print_report+0x16c/0x4c8
kasan_report+0x9c/0xe4
__asan_report_load8_noabort+0x1c/0x24
io_sq_offload_create+0x858/0xaa4
io_uring_setup+0x1394/0x17c4
__arm64_sys_io_uring_setup+0x6c/0x180
invoke_syscall+0x6c/0x260
el0_svc_common.constprop.0+0x158/0x224
do_el0_svc+0x3c/0x5c
el0_svc+0x34/0x70
el0t_64_sync_handler+0x118/0x124
el0t_64_sync+0x168/0x16c

The buggy address belongs to stack of task wq-aff.t/1391
and is located at offset 48 in frame:
io_sq_offload_create+0x0/0xaa4

This frame has 1 object:
[32, 40) 'allowed_mask'

The buggy address belongs to the virtual mapping at
[ffff800089bc0000, ffff800089bc9000) created by:
kernel_clone+0x124/0x7e0

The buggy address belongs to the physical page:
page: refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff0000d740af80 pfn:0x11740a
memcg:ffff0000c2706f02
flags: 0xbffe00000000000(node=0|zone=2|lastcpupid=0x1fff)
raw: 0bffe00000000000 0000000000000000 dead000000000122 0000000000000000
raw: ffff0000d740af80 0000000000000000 00000001ffffffff ffff0000c2706f02
page dumped because: kasan: bad access detected

Memory state around the buggy address:
ffff800089bc7a80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
ffff800089bc7b00: 00 00 00 00 00 00 00 00 00 00 00 00 f1 f1 f1 f1
>ffff800089bc7b80: 00 f3 f3 f3 00 00 00 00 00 00 00 00 00 00 00 00
^
ffff800089bc7c00: 00 00 00 00 00 00 00 00 00 00 00 00 f1 f1 f1 f1
ffff800089bc7c80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 f3

Reported-by: kernel test robot <[email protected]>
Closes: https://lore.kernel.org/oe-lkp/[email protected]
Fixes: f011c9cf04c0 ("io_uring/sqpoll: do not allow pinning outside of cpuset")
Tested-by: Felix Moessbauer <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

show more ...


# f011c9cf 09-Sep-2024 Felix Moessbauer <[email protected]>

io_uring/sqpoll: do not allow pinning outside of cpuset

The submit queue polling threads are userland threads that just never
exit to the userland. When creating the thread with IORING_SETUP_SQ_AFF,

io_uring/sqpoll: do not allow pinning outside of cpuset

The submit queue polling threads are userland threads that just never
exit to the userland. When creating the thread with IORING_SETUP_SQ_AFF,
the affinity of the poller thread is set to the cpu specified in
sq_thread_cpu. However, this CPU can be outside of the cpuset defined
by the cgroup cpuset controller. This violates the rules defined by the
cpuset controller and is a potential issue for realtime applications.

In b7ed6d8ffd6 we fixed the default affinity of the poller thread, in
case no explicit pinning is required by inheriting the one of the
creating task. In case of explicit pinning, the check is more
complicated, as also a cpu outside of the parent cpumask is allowed.
We implemented this by using cpuset_cpus_allowed (that has support for
cgroup cpusets) and testing if the requested cpu is in the set.

Fixes: 37d1e2e3642e ("io_uring: move SQPOLL thread io-wq forked worker")
Cc: [email protected] # 6.1+
Signed-off-by: Felix Moessbauer <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>

show more ...


# 7255cd89 30-Jul-2024 Olivier Langlois <[email protected]>

io_uring: micro optimization of __io_sq_thread() condition

reverse the order of the element evaluation in an if statement.

for many users that are not using iopoll, the iopoll_list will always
eval

io_uring: micro optimization of __io_sq_thread() condition

reverse the order of the element evaluation in an if statement.

for many users that are not using iopoll, the iopoll_list will always
evaluate to false after having made a memory access whereas to_submit is
very likely already loaded in a register.

Signed-off-by: Olivier Langlois <[email protected]>
Reviewed-by: Pavel Begunkov <[email protected]>
Link: https://lore.kernel.org/r/052ca60b5c49e7439e4b8bd33bfab4a09d36d3d6.1722374371.git.olivier@trillion01.com
Signed-off-by: Jens Axboe <[email protected]>

show more ...


# e4956dc7 13-Aug-2024 Jens Axboe <[email protected]>

io_uring/sqpoll: annotate debug task == current with data_race()

There's a debug check in io_sq_thread_park() checking if it's the SQPOLL
thread itself calling park. KCSAN warns about this, as we sh

io_uring/sqpoll: annotate debug task == current with data_race()

There's a debug check in io_sq_thread_park() checking if it's the SQPOLL
thread itself calling park. KCSAN warns about this, as we should not be
reading sqd->thread outside of sqd->lock.

Just silence this with data_race(). The pointer isn't used for anything
but this debug check.

Reported-by: [email protected]
Signed-off-by: Jens Axboe <[email protected]>

show more ...


Revision tags: v6.10, v6.10-rc7, v6.10-rc6, v6.10-rc5, v6.10-rc4, v6.10-rc3, v6.10-rc2
# 1da91ea8 31-May-2024 Al Viro <[email protected]>

introduce fd_file(), convert all accessors to it.

For any changes of struct fd representation we need to
turn existing accesses to fields into calls of wrappers.
Accesses to struct fd::flags are ve

introduce fd_file(), convert all accessors to it.

For any changes of struct fd representation we need to
turn existing accesses to fields into calls of wrappers.
Accesses to struct fd::flags are very few (3 in linux/file.h,
1 in net/socket.c, 3 in fs/overlayfs/file.c and 3 more in
explicit initializers).
Those can be dealt with in the commit converting to
new layout; accesses to struct fd::file are too many for that.
This commit converts (almost) all of f.file to
fd_file(f). It's not entirely mechanical ('file' is used as
a member name more than just in struct fd) and it does not
even attempt to distinguish the uses in pointer context from
those in boolean context; the latter will be eventually turned
into a separate helper (fd_empty()).

NOTE: mass conversion to fd_empty(), tempting as it
might be, is a bad idea; better do that piecewise in commit
that convert from fdget...() to CLASS(...).

[conflicts in fs/fhandle.c, kernel/bpf/syscall.c, mm/memcontrol.c
caught by git; fs/stat.c one got caught by git grep]
[fs/xattr.c conflict]

Reviewed-by: Christian Brauner <[email protected]>
Signed-off-by: Al Viro <[email protected]>

show more ...


Revision tags: v6.10-rc1
# d13ddd9c 21-May-2024 Jens Axboe <[email protected]>

io_uring/sqpoll: ensure that normal task_work is also run timely

With the move to private task_work, SQPOLL neglected to also run the
normal task_work, if any is pending. This will eventually get ru

io_uring/sqpoll: ensure that normal task_work is also run timely

With the move to private task_work, SQPOLL neglected to also run the
normal task_work, if any is pending. This will eventually get run, but
we should run it with the private task_work to ensure that things like
a final fput() is processed in a timely fashion.

Cc: [email protected]
Link: https://lore.kernel.org/all/[email protected]/
Reported-by: Andrew Udvare <[email protected]>
Fixes: af5d68f8892f ("io_uring/sqpoll: manage task_work privately")
Tested-by: Christian Heusel <[email protected]>
Tested-by: Andrew Udvare <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

show more ...


Revision tags: v6.9, v6.9-rc7, v6.9-rc6, v6.9-rc5, v6.9-rc4, v6.9-rc3, v6.9-rc2, v6.9-rc1
# c4ce0ab2 21-Mar-2024 Jens Axboe <[email protected]>

io_uring/sqpoll: work around a potential audit memory leak

kmemleak complains that there's a memory leak related to connect
handling:

unreferenced object 0xffff0001093bdf00 (size 128):
comm "iou-sq

io_uring/sqpoll: work around a potential audit memory leak

kmemleak complains that there's a memory leak related to connect
handling:

unreferenced object 0xffff0001093bdf00 (size 128):
comm "iou-sqp-455", pid 457, jiffies 4294894164
hex dump (first 32 bytes):
02 00 fa ea 7f 00 00 01 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace (crc 2e481b1a):
[<00000000c0a26af4>] kmemleak_alloc+0x30/0x38
[<000000009c30bb45>] kmalloc_trace+0x228/0x358
[<000000009da9d39f>] __audit_sockaddr+0xd0/0x138
[<0000000089a93e34>] move_addr_to_kernel+0x1a0/0x1f8
[<000000000b4e80e6>] io_connect_prep+0x1ec/0x2d4
[<00000000abfbcd99>] io_submit_sqes+0x588/0x1e48
[<00000000e7c25e07>] io_sq_thread+0x8a4/0x10e4
[<00000000d999b491>] ret_from_fork+0x10/0x20

which can can happen if:

1) The command type does something on the prep side that triggers an
audit call.
2) The thread hasn't done any operations before this that triggered
an audit call inside ->issue(), where we have audit_uring_entry()
and audit_uring_exit().

Work around this by issuing a blanket NOP operation before the SQPOLL
does anything.

Signed-off-by: Jens Axboe <[email protected]>

show more ...


# 1251d202 19-Mar-2024 Jens Axboe <[email protected]>

io_uring/sqpoll: early exit thread if task_context wasn't allocated

Ideally we'd want to simply kill the task rather than wake it, but for
now let's just add a startup check that causes the thread t

io_uring/sqpoll: early exit thread if task_context wasn't allocated

Ideally we'd want to simply kill the task rather than wake it, but for
now let's just add a startup check that causes the thread to exit.
This can only happen if io_uring_alloc_task_context() fails, which
generally requires fault injection.

Reported-by: Ubisectech Sirius <[email protected]>
Fixes: af5d68f8892f ("io_uring/sqpoll: manage task_work privately")
Signed-off-by: Jens Axboe <[email protected]>

show more ...


Revision tags: v6.8, v6.8-rc7
# 3fcb9d17 28-Feb-2024 Xiaobing Li <[email protected]>

io_uring/sqpoll: statistics of the true utilization of sq threads

Count the running time and actual IO processing time of the sqpoll
thread, and output the statistical data to fdinfo.

Variable desc

io_uring/sqpoll: statistics of the true utilization of sq threads

Count the running time and actual IO processing time of the sqpoll
thread, and output the statistical data to fdinfo.

Variable description:
"work_time" in the code represents the sum of the jiffies of the sq
thread actually processing IO, that is, how many milliseconds it
actually takes to process IO. "total_time" represents the total time
that the sq thread has elapsed from the beginning of the loop to the
current time point, that is, how many milliseconds it has spent in
total.

The test tool is fio, and its parameters are as follows:
[global]
ioengine=io_uring
direct=1
group_reporting
bs=128k
norandommap=1
randrepeat=0
refill_buffers
ramp_time=30s
time_based
runtime=1m
clocksource=clock_gettime
overwrite=1
log_avg_msec=1000
numjobs=1

[disk0]
filename=/dev/nvme0n1
rw=read
iodepth=16
hipri
sqthread_poll=1

The test results are as follows:
Every 2.0s: cat /proc/9230/fdinfo/6 | grep -E Sq
SqMask: 0x3
SqHead: 3197153
SqTail: 3197153
CachedSqHead: 3197153
SqThread: 9231
SqThreadCpu: 11
SqTotalTime: 18099614
SqWorkTime: 16748316

The test results corresponding to different iodepths are as follows:
|-----------|-------|-------|-------|------|-------|
| iodepth | 1 | 4 | 8 | 16 | 64 |
|-----------|-------|-------|-------|------|-------|
|utilization| 2.9% | 8.8% | 10.9% | 92.9%| 84.4% |
|-----------|-------|-------|-------|------|-------|
| idle | 97.1% | 91.2% | 89.1% | 7.1% | 15.6% |
|-----------|-------|-------|-------|------|-------|

Signed-off-by: Xiaobing Li <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>

show more ...


Revision tags: v6.8-rc6, v6.8-rc5
# c8d8fc3b 14-Feb-2024 Jens Axboe <[email protected]>

io_uring/sqpoll: use the correct check for pending task_work

A previous commit moved to using just the private task_work list for
SQPOLL, but it neglected to update the check for whether we have
pen

io_uring/sqpoll: use the correct check for pending task_work

A previous commit moved to using just the private task_work list for
SQPOLL, but it neglected to update the check for whether we have
pending task_work. Normally this is fine as we'll attempt to run it
unconditionally, but if we race with going to sleep AND task_work
being added, then we certainly need the right check here.

Fixes: af5d68f8892f ("io_uring/sqpoll: manage task_work privately")
Signed-off-by: Jens Axboe <[email protected]>

show more ...


Revision tags: v6.8-rc4, v6.8-rc3, v6.8-rc2, v6.8-rc1, v6.7, v6.7-rc8, v6.7-rc7, v6.7-rc6, v6.7-rc5, v6.7-rc4, v6.7-rc3, v6.7-rc2, v6.7-rc1, v6.6, v6.6-rc7, v6.6-rc6, v6.6-rc5, v6.6-rc4, v6.6-rc3, v6.6-rc2, v6.6-rc1, v6.5, v6.5-rc7, v6.5-rc6, v6.5-rc5, v6.5-rc4, v6.5-rc3, v6.5-rc2, v6.5-rc1, v6.4, v6.4-rc7, v6.4-rc6
# ff183d42 08-Jun-2023 Stefan Roesch <[email protected]>

io-uring: add sqpoll support for napi busy poll

This adds the sqpoll support to the io-uring napi.

Signed-off-by: Stefan Roesch <[email protected]>
Suggested-by: Olivier Langlois <olivier@trillion01

io-uring: add sqpoll support for napi busy poll

This adds the sqpoll support to the io-uring napi.

Signed-off-by: Stefan Roesch <[email protected]>
Suggested-by: Olivier Langlois <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>

show more ...


# af5d68f8 02-Feb-2024 Jens Axboe <[email protected]>

io_uring/sqpoll: manage task_work privately

Decouple from task_work running, and cap the number of entries we process
at the time. If we exceed that number, push remaining entries to a retry
list th

io_uring/sqpoll: manage task_work privately

Decouple from task_work running, and cap the number of entries we process
at the time. If we exceed that number, push remaining entries to a retry
list that we'll process first next time.

We cap the number of entries to process at 8, which is fairly random.
We just want to get enough per-ctx batching here, while not processing
endlessly.

Since we manually run PF_IO_WORKER related task_work anyway as the task
never exits to userspace, with this we no longer need to add an actual
task_work item to the per-process list.

Signed-off-by: Jens Axboe <[email protected]>

show more ...


12