| 92835ceb | 08-May-2025 |
Gabriel Krisman Bertazi <[email protected]> |
io_uring/sqpoll: Increase task_work submission batch size
Our QA team reported a 10%-23%, throughput reduction on an io_uring sqpoll testcase doing IO to a null_blk, that I traced back to a reductio
io_uring/sqpoll: Increase task_work submission batch size
Our QA team reported a 10%-23%, throughput reduction on an io_uring sqpoll testcase doing IO to a null_blk, that I traced back to a reduction of the device submission queue depth utilization. It turns out that, after commit af5d68f8892f ("io_uring/sqpoll: manage task_work privately"), we capped the number of task_work entries that can be completed from a single spin of sqpoll to only 8 entries, before the sqpoll goes around to (potentially) sleep. While this cap doesn't drive the submission side directly, it impacts the completion behavior, which affects the number of IO queued by fio per sqpoll cycle on the submission side, and io_uring ends up seeing less ios per sqpoll cycle. As a result, block layer plugging is less effective, and we see more time spent inside the block layer in profilings charts, and increased submission latency measured by fio.
There are other places that have increased overhead once sqpoll sleeps more often, such as the sqpoll utilization calculation. But, in this microbenchmark, those were not representative enough in perf charts, and their removal didn't yield measurable changes in throughput. The major overhead comes from the fact we plug less, and less often, when submitting to the block layer.
My benchmark is:
fio --ioengine=io_uring --direct=1 --iodepth=128 --runtime=300 --bs=4k \ --invalidate=1 --time_based --ramp_time=10 --group_reporting=1 \ --filename=/dev/nullb0 --name=RandomReads-direct-nullb-sqpoll-4k-1 \ --rw=randread --numjobs=1 --sqthread_poll
In one machine, tested on top of Linux 6.15-rc1, we have the following baseline: READ: bw=4994MiB/s (5236MB/s), 4994MiB/s-4994MiB/s (5236MB/s-5236MB/s), io=439GiB (471GB), run=90001-90001msec
With this patch: READ: bw=5762MiB/s (6042MB/s), 5762MiB/s-5762MiB/s (6042MB/s-6042MB/s), io=506GiB (544GB), run=90001-90001msec
which is a 15% improvement in measured bandwidth. The average submission latency is noticeably lowered too. As measured by fio:
Baseline: lat (usec): min=20, max=241, avg=99.81, stdev=3.38 Patched: lat (usec): min=26, max=226, avg=86.48, stdev=4.82
If we look at blktrace, we can also see the plugging behavior is improved. In the baseline, we end up limited to plugging 8 requests in the block layer regardless of the device queue depth size, while after patching we can drive more io, and we manage to utilize the full device queue.
In the baseline, after a stabilization phase, an ordinary submission looks like: 254,0 1 49942 0.016028795 5977 U N [iou-sqp-5976] 7
After patching, I see consistently more requests per unplug. 254,0 1 4996 0.001432872 3145 U N [iou-sqp-3144] 32
Ideally, the cap size would at least be the deep enough to fill the device queue, but we can't predict that behavior, or assume all IO goes to a single device, and thus can't guess the ideal batch size. We also don't want to let the tw run unbounded, though I'm not sure it would really be a problem. Instead, let's just give it a more sensible value that will allow for more efficient batching. I've tested with different cap values, and initially proposed to increase the cap to 1024. Jens argued it is too big of a bump and I observed that, with 32, I'm no longer able to observe this bottleneck in any of my machines.
Fixes: af5d68f8892f ("io_uring/sqpoll: manage task_work privately") Signed-off-by: Gabriel Krisman Bertazi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| 687b2bae | 07-May-2025 |
Jens Axboe <[email protected]> |
io_uring: ensure deferred completions are flushed for multishot
Multishot normally uses io_req_post_cqe() to post completions, but when stopping it, it may finish up with a deferred completion. This
io_uring: ensure deferred completions are flushed for multishot
Multishot normally uses io_req_post_cqe() to post completions, but when stopping it, it may finish up with a deferred completion. This is fine, except if another multishot event triggers before the deferred completions get flushed. If this occurs, then CQEs may get reordered in the CQ ring, as new multishot completions get posted before the deferred ones are flushed. This can cause confusion on the application side, if strict ordering is required for the use case.
When multishot posting via io_req_post_cqe(), flush any pending deferred completions first, if any.
Cc: [email protected] # 6.1+ Reported-by: Norman Maurer <[email protected]> Reported-by: Christian Mazakas <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| b53e5232 | 04-May-2025 |
Jens Axboe <[email protected]> |
io_uring: always arm linked timeouts prior to issue
There are a few spots where linked timeouts are armed, and not all of them adhere to the pre-arm, attempt issue, post-arm pattern. This can be pro
io_uring: always arm linked timeouts prior to issue
There are a few spots where linked timeouts are armed, and not all of them adhere to the pre-arm, attempt issue, post-arm pattern. This can be problematic if the linked request returns that it will trigger a callback later, and does so before the linked timeout is fully armed.
Consolidate all the linked timeout handling into __io_issue_sqe(), rather than have it spread throughout the various issue entry points.
Cc: [email protected] Link: https://github.com/axboe/liburing/issues/1390 Reported-by: Chase Hiltz <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| f024d3a8 | 30-Apr-2025 |
Jens Axboe <[email protected]> |
io_uring/fdinfo: annotate racy sq/cq head/tail reads
syzbot complains about the cached sq head read, and it's totally right. But we don't need to care, it's just reading fdinfo, and reading the CQ o
io_uring/fdinfo: annotate racy sq/cq head/tail reads
syzbot complains about the cached sq head read, and it's totally right. But we don't need to care, it's just reading fdinfo, and reading the CQ or SQ tail/head entries are known racy in that they are just a view into that very instant and may of course be outdated by the time they are reported.
Annotate both the SQ head and CQ tail read with data_race() to avoid this syzbot complaint.
Link: https://lore.kernel.org/io-uring/[email protected]/ Reported-by: [email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| f12ecf5e | 18-Apr-2025 |
Pavel Begunkov <[email protected]> |
io_uring/zcrx: fix late dma unmap for a dead dev
There is a problem with page pools not dma-unmapping immediately when the device is going down, and delaying it until the page pool is destroyed, whi
io_uring/zcrx: fix late dma unmap for a dead dev
There is a problem with page pools not dma-unmapping immediately when the device is going down, and delaying it until the page pool is destroyed, which is not allowed (see links). That just got fixed for normal page pools, and we need to address memory providers as well.
Unmap pages in the memory provider uninstall callback, and protect it with a new lock. There is also a gap between when a dma mapping is created and the mp is installed, so if the device is killed in between, io_uring would be holding on to dma mappings to a dead device with no one to call ->uninstall. Move it to page pool init and rely on ->is_mapped to make sure it's only done once.
Link: https://lore.kernel.org/lkml/[email protected]/T/ Link: https://lore.kernel.org/all/[email protected]/ Fixes: 34a3e60821ab9 ("io_uring/zcrx: implement zerocopy receive pp memory provider") Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/ef9b7db249b14f6e0b570a1bb77ff177389f881c.1744965853.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| 80c7378f | 17-Apr-2025 |
Nitesh Shetty <[email protected]> |
io_uring/rsrc: send exact nr_segs for fixed buffer
Sending exact nr_segs, avoids bio split check and processing in block layer, which takes around 5%[1] of overall CPU utilization.
In our setup, we
io_uring/rsrc: send exact nr_segs for fixed buffer
Sending exact nr_segs, avoids bio split check and processing in block layer, which takes around 5%[1] of overall CPU utilization.
In our setup, we see overall improvement of IOPS from 7.15M to 7.65M [2] and 5% less CPU utilization.
[1] 3.52% io_uring [kernel.kallsyms] [k] bio_split_rw_at 1.42% io_uring [kernel.kallsyms] [k] bio_split_rw 0.62% io_uring [kernel.kallsyms] [k] bio_submit_split
[2] sudo taskset -c 0,1 ./t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -n2 -r4 /dev/nvme0n1 /dev/nvme1n1
Signed-off-by: Nitesh Shetty <[email protected]> [Pavel: fixed for kbuf, rebased and reworked on top of cleanups] Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/7a1a49a8d053bd617c244291d63dbfbc07afde36.1744882081.git.asml.silence@gmail.com [axboe: fold in fix factoring in buf reg offset] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|