|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3 |
|
| #
25744f84 |
| 15-Apr-2025 |
Pavel Begunkov <[email protected]> |
io_uring/zcrx: return ifq id to the user
IORING_OP_RECV_ZC requests take a zcrx object id via sqe::zcrx_ifq_idx, which binds it to the corresponding if / queue. However, we don't return that id back
io_uring/zcrx: return ifq id to the user
IORING_OP_RECV_ZC requests take a zcrx object id via sqe::zcrx_ifq_idx, which binds it to the corresponding if / queue. However, we don't return that id back to the user. It's fine as currently there can be only one zcrx and the user assumes that its id should be 0, but as we'll need multiple zcrx objects in the future let's explicitly pass it back on registration.
Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/8714667d370651962f7d1a169032e5f02682a73e.1744722517.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v6.15-rc2, v6.15-rc1, v6.14, v6.14-rc7 |
|
| #
07754bfd |
| 14-Mar-2025 |
Jens Axboe <[email protected]> |
io_uring: enable toggle of iowait usage when waiting on CQEs
By default, io_uring marks a waiting task as being in iowait, if it's sleeping waiting on events and there are pending requests. This isn
io_uring: enable toggle of iowait usage when waiting on CQEs
By default, io_uring marks a waiting task as being in iowait, if it's sleeping waiting on events and there are pending requests. This isn't necessarily always useful, and may be confusing on non-storage setups where iowait isn't expected. It can also cause extra power usage, by preventing the CPU from entering lower sleep states.
This adds a new enter flag, IORING_ENTER_NO_IOWAIT. If set, then io_uring will not account the sleeping task as being in iowait. If the kernel supports this feature, then it will be marked by having the IORING_FEAT_NO_IOWAIT feature flag set.
As the kernel currently does not support separating the iowait accounting and CPU frequency boosting, the IORING_ENTER_NO_IOWAIT controls both of these at the same time. In the future, if those do end up being split, then it'd be possible to control them separately. However, it seems more likely that the kernel will decouple iowait and CPU frequency boosting anyway.
Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc6 |
|
| #
bdabba04 |
| 07-Mar-2025 |
Pavel Begunkov <[email protected]> |
io_uring/rw: implement vectored registered rw
Implement registered buffer vectored reads with new opcodes IORING_OP_WRITEV_FIXED and IORING_OP_READV_FIXED.
Signed-off-by: Pavel Begunkov <asml.silen
io_uring/rw: implement vectored registered rw
Implement registered buffer vectored reads with new opcodes IORING_OP_WRITEV_FIXED and IORING_OP_READV_FIXED.
Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/d7c89eb481e870f598edc91cc66ff4d1e4ae3788.1741362889.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc5, v6.14-rc4, v6.14-rc3, v6.14-rc2, v6.14-rc1 |
|
| #
19f7e942 |
| 31-Jan-2025 |
Jens Axboe <[email protected]> |
io_uring/epoll: add support for IORING_OP_EPOLL_WAIT
For existing epoll event loops that can't fully convert to io_uring, the used approach is usually to add the io_uring fd to the epoll instance an
io_uring/epoll: add support for IORING_OP_EPOLL_WAIT
For existing epoll event loops that can't fully convert to io_uring, the used approach is usually to add the io_uring fd to the epoll instance and use epoll_wait() to wait on both "legacy" and io_uring events. While this work, it isn't optimal as:
1) epoll_wait() is pretty limited in what it can do. It does not support partial reaping of events, or waiting on a batch of events.
2) When an io_uring ring is added to an epoll instance, it activates the io_uring "I'm being polled" logic which slows things down.
Rather than use this approach, with EPOLL_WAIT support added to io_uring, event loops can use the normal io_uring wait logic for everything, as long as an epoll wait request has been armed with io_uring.
Note that IORING_OP_EPOLL_WAIT does NOT take a timeout value, as this is an async request. Waiting on io_uring events in general has various timeout parameters, and those are the ones that should be used when waiting on any kind of request. If events are immediately available for reaping, then This opcode will return those immediately. If none are available, then it will post an async completion when they become available.
cqe->res will contain either an error code (< 0 value) for a malformed request, invalid epoll instance, etc. It will return a positive result indicating how many events were reaped.
IORING_OP_EPOLL_WAIT requests may be canceled using the normal io_uring cancelation infrastructure.
Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| #
1fc61eee |
| 18-Feb-2025 |
Jens Axboe <[email protected]> |
io_uring: fix spelling error in uapi io_uring.h
This is obviously not that important, but when changes are synced back from the kernel to liburing, the codespell CI ends up erroring because of this
io_uring: fix spelling error in uapi io_uring.h
This is obviously not that important, but when changes are synced back from the kernel to liburing, the codespell CI ends up erroring because of this misspelling. Let's just correct it and avoid this biting us again on an import.
Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| #
11ed914b |
| 15-Feb-2025 |
David Wei <[email protected]> |
io_uring/zcrx: add io_recvzc request
Add io_uring opcode OP_RECV_ZC for doing zero copy reads out of a socket. Only the connection should be land on the specific rx queue set up for zero copy, and t
io_uring/zcrx: add io_recvzc request
Add io_uring opcode OP_RECV_ZC for doing zero copy reads out of a socket. Only the connection should be land on the specific rx queue set up for zero copy, and the socket must be handled by the io_uring instance that the rx queue was registered for zero copy with. That's because neither net_iovs / buffers from our queue can be read by outside applications, nor zero copy is possible if traffic for the zero copy connection goes to another queue. This coordination is outside of the scope of this patch series. Also, any traffic directed to the zero copy enabled queue is immediately visible to the application, which is why CAP_NET_ADMIN is required at the registration step.
Of course, no data is actually read out of the socket, it has already been copied by the netdev into userspace memory via DMA. OP_RECV_ZC reads skbs out of the socket and checks that its frags are indeed net_iovs that belong to io_uring. A cqe is queued for each one of these frags.
Recall that each cqe is a big cqe, with the top half being an io_uring_zcrx_cqe. The cqe res field contains the len or error. The lower IORING_ZCRX_AREA_SHIFT bits of the struct io_uring_zcrx_cqe::off field contain the offset relative to the start of the zero copy area. The upper part of the off field is trivially zero, and will be used to carry the area id.
For now, there is no limit as to how much work each OP_RECV_ZC request does. It will attempt to drain a socket of all available data. This request always operates in multishot mode.
Reviewed-by: Jens Axboe <[email protected]> Signed-off-by: David Wei <[email protected]> Acked-by: Jakub Kicinski <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| #
cf96310c |
| 15-Feb-2025 |
David Wei <[email protected]> |
io_uring/zcrx: add io_zcrx_area
Add io_zcrx_area that represents a region of userspace memory that is used for zero copy. During ifq registration, userspace passes in the uaddr and len of userspace
io_uring/zcrx: add io_zcrx_area
Add io_zcrx_area that represents a region of userspace memory that is used for zero copy. During ifq registration, userspace passes in the uaddr and len of userspace memory, which is then pinned by the kernel. Each net_iov is mapped to one of these pages.
The freelist is a spinlock protected list that keeps track of all the net_iovs/pages that aren't used.
For now, there is only one area per ifq and area registration happens implicitly as part of ifq registration. There is no API for adding/removing areas yet. The struct for area registration is there for future extensibility once we support multiple areas and TCP devmem.
Reviewed-by: Jens Axboe <[email protected]> Signed-off-by: Pavel Begunkov <[email protected]> Signed-off-by: David Wei <[email protected]> Acked-by: Jakub Kicinski <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| #
6f377873 |
| 15-Feb-2025 |
David Wei <[email protected]> |
io_uring/zcrx: add interface queue and refill queue
Add a new object called an interface queue (ifq) that represents a net rx queue that has been configured for zero copy. Each ifq is registered usi
io_uring/zcrx: add interface queue and refill queue
Add a new object called an interface queue (ifq) that represents a net rx queue that has been configured for zero copy. Each ifq is registered using a new registration opcode IORING_REGISTER_ZCRX_IFQ.
The refill queue is allocated by the kernel and mapped by userspace using a new offset IORING_OFF_RQ_RING, in a similar fashion to the main SQ/CQ. It is used by userspace to return buffers that it is done with, which will then be re-used by the netdev again.
The main CQ ring is used to notify userspace of received data by using the upper 16 bytes of a big CQE as a new struct io_uring_zcrx_cqe. Each entry contains the offset + len to the data.
For now, each io_uring instance only has a single ifq.
Reviewed-by: Jens Axboe <[email protected]> Signed-off-by: David Wei <[email protected]> Acked-by: Jakub Kicinski <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2 |
|
| #
94d57442 |
| 05-Dec-2024 |
Anuj Gupta <[email protected]> |
io_uring: expose read/write attribute capability
After commit 9a213d3b80c0, we can pass additional attributes along with read/write. However, userspace doesn't know that. Add a new feature flag IORI
io_uring: expose read/write attribute capability
After commit 9a213d3b80c0, we can pass additional attributes along with read/write. However, userspace doesn't know that. Add a new feature flag IORING_FEAT_RW_ATTR, to notify the userspace that the kernel has this ability.
Signed-off-by: Anuj Gupta <[email protected]> Reviewed-by: Li Zetao <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]> Tested-by: Martin K. Petersen <[email protected]> Reviewed-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v6.13-rc1 |
|
| #
59a7d12a |
| 28-Nov-2024 |
Anuj Gupta <[email protected]> |
io_uring: introduce attributes for read/write and PI support
Add the ability to pass additional attributes along with read/write. Application can prepare attibute specific information and pass its a
io_uring: introduce attributes for read/write and PI support
Add the ability to pass additional attributes along with read/write. Application can prepare attibute specific information and pass its address using the SQE field: __u64 attr_ptr;
Along with setting a mask indicating attributes being passed: __u64 attr_type_mask;
Overall 64 attributes are allowed and currently one attribute 'IORING_RW_ATTR_FLAG_PI' is supported.
With PI attribute, userspace can pass following information: - flags: integrity check flags IO_INTEGRITY_CHK_{GUARD/APPTAG/REFTAG} - len: length of PI/metadata buffer - addr: address of metadata buffer - seed: seed value for reftag remapping - app_tag: application defined 16b value
Process this information to prepare uio_meta_descriptor and pass it down using kiocb->private.
PI attribute is supported only for direct IO.
Signed-off-by: Anuj Gupta <[email protected]> Signed-off-by: Kanchan Joshi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| #
c750629c |
| 18-Nov-2024 |
Pavel Begunkov <[email protected]> |
io_uring: remove io_uring_cqwait_reg_arg
A separate wait argument registration API was removed, also delete leftover uapi definitions.
Signed-off-by: Pavel Begunkov <[email protected]> Link: h
io_uring: remove io_uring_cqwait_reg_arg
A separate wait argument registration API was removed, also delete leftover uapi definitions.
Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/143b6a53591badac23632d3e6fa3e5db4b342ee2.1731942445.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v6.12 |
|
| #
d617b314 |
| 15-Nov-2024 |
Pavel Begunkov <[email protected]> |
io_uring: restore back registered wait arguments
Now we've got a more generic region registration API, place IORING_ENTER_EXT_ARG_REG and re-enable it.
First, the user has to register a region with
io_uring: restore back registered wait arguments
Now we've got a more generic region registration API, place IORING_ENTER_EXT_ARG_REG and re-enable it.
First, the user has to register a region with the IORING_MEM_REGION_REG_WAIT_ARG flag set. It can only be done for a ring in a disabled state, aka IORING_SETUP_R_DISABLED, to avoid races with already running waiters. With that we should have stable constant values for ctx->cq_wait_{size,arg} in io_get_ext_arg_reg() and hence no READ_ONCE required.
The other API difference is that we're now passing byte offsets instead of indexes. The user _must_ align all offsets / pointers to the native word size, failing to do so might but not necessarily has to lead to a failure usually returned as -EFAULT. liburing will be hiding this details from users.
Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/81822c1b4ffbe8ad391b4f9ad1564def0d26d990.1731689588.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| #
93238e66 |
| 15-Nov-2024 |
Pavel Begunkov <[email protected]> |
io_uring: add memory region registration
Regions will serve multiple purposes. First, with it we can decouple ring/etc. object creation from registration / mapping of the memory they will be placed
io_uring: add memory region registration
Regions will serve multiple purposes. First, with it we can decouple ring/etc. object creation from registration / mapping of the memory they will be placed in. We already have hacks that allow to put both SQ and CQ into the same huge page, in the future we should be able to:
region = create_region(io_ring); create_pbuf_ring(io_uring, region, offset=0); create_pbuf_ring(io_uring, region, offset=N);
The second use case is efficiently passing parameters. The following patch enables back on top of regions IORING_ENTER_EXT_ARG_REG, which optimises wait arguments. It'll also be useful for request arguments replacing iovecs, msghdr, etc. pointers. Eventually it would also be handy for BPF as well if it comes to fruition.
Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/0798cf3a14fad19cfc96fc9feca5f3e11481691d.1731689588.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| #
dfbbfbf1 |
| 15-Nov-2024 |
Pavel Begunkov <[email protected]> |
io_uring: introduce concept of memory regions
We've got a good number of mappings we share with the userspace, that includes the main rings, provided buffer rings, upcoming rings for zerocopy rx and
io_uring: introduce concept of memory regions
We've got a good number of mappings we share with the userspace, that includes the main rings, provided buffer rings, upcoming rings for zerocopy rx and more. All of them duplicate user argument parsing and some internal details as well (page pinnning, huge page optimisations, mmap'ing, etc.)
Introduce a notion of regions. For userspace for now it's just a new structure called struct io_uring_region_desc which is supposed to parameterise all such mapping / queue creations. A region either represents a user provided chunk of memory, in which case the user_addr field should point to it, or a request for the kernel to allocate the memory, in which case the user would need to mmap it after using the offset returned in the mmap_offset field. With a uniform userspace API we can avoid additional boiler plate code and apply future optimisation to all of them at once.
Internally, there is a new structure struct io_mapped_region holding all relevant runtime information and some helpers to work with it. This patch limits it to user provided regions.
Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/0e6fe25818dfbaebd1bd90b870a6cac503fe1a24.1731689588.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| #
83e04152 |
| 15-Nov-2024 |
Pavel Begunkov <[email protected]> |
io_uring: temporarily disable registered waits
Disable wait argument registration as it'll be replaced with a more generic feature. We'll still need IORING_ENTER_EXT_ARG_REG parsing in a few commits
io_uring: temporarily disable registered waits
Disable wait argument registration as it'll be replaced with a more generic feature. We'll still need IORING_ENTER_EXT_ARG_REG parsing in a few commits so leave it be.
Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/70b1d1d218c41ba77a76d1789c8641dab0b0563e.1731689588.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v6.12-rc7, v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3 |
|
| #
6bf90bd8 |
| 13-Oct-2024 |
Olivier Langlois <[email protected]> |
io_uring/napi: add static napi tracking strategy
Add the static napi tracking strategy. That allows the user to manually manage the napi ids list for busy polling, and eliminate the overhead of dyna
io_uring/napi: add static napi tracking strategy
Add the static napi tracking strategy. That allows the user to manually manage the napi ids list for busy polling, and eliminate the overhead of dynamically updating the list from the fast path.
Signed-off-by: Olivier Langlois <[email protected]> Link: https://lore.kernel.org/r/96943de14968c35a5c599352259ad98f3c0770ba.1728828877.git.olivier@trillion01.com Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| #
01ee194d |
| 01-Nov-2024 |
hexue <[email protected]> |
io_uring: add support for hybrid IOPOLL
A new hybrid poll is implemented on the io_uring layer. Once an IO is issued, it will not poll immediately, but rather block first and re-run before IO comple
io_uring: add support for hybrid IOPOLL
A new hybrid poll is implemented on the io_uring layer. Once an IO is issued, it will not poll immediately, but rather block first and re-run before IO complete, then poll to reap IO. While this poll method could be a suboptimal solution when running on a single thread, it offers performance lower than regular polling but higher than IRQ, and CPU utilization is also lower than polling.
To use hybrid polling, the ring must be setup with both the IORING_SETUP_IOPOLL and IORING_SETUP_HYBRID)IOPOLL flags set. Hybrid polling has the same restrictions as IOPOLL, in that commands must explicitly support it.
Signed-off-by: hexue <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| #
c1329532 |
| 29-Oct-2024 |
Jens Axboe <[email protected]> |
io_uring/rsrc: allow cloning with node replacements
Currently cloning a buffer table will fail if the destination already has a table. But it should be possible to use it to replace existing element
io_uring/rsrc: allow cloning with node replacements
Currently cloning a buffer table will fail if the destination already has a table. But it should be possible to use it to replace existing elements. Add a IORING_REGISTER_DST_REPLACE cloning flag, which if set, will allow the destination to already having a buffer table. If that is the case, then entries designated by offset + nr buffers will be replaced if they already exist.
Note that it's allowed to use IORING_REGISTER_DST_REPLACE and not have an existing table, in which case it'll work just like not having the flag set and an empty table - it'll just assign the newly created table for that case.
Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| #
b16e920a |
| 29-Oct-2024 |
Jens Axboe <[email protected]> |
io_uring/rsrc: allow cloning at an offset
Right now buffer cloning is an all-or-nothing kind of thing - either the whole table is cloned from a source to a destination ring, or nothing at all.
Howe
io_uring/rsrc: allow cloning at an offset
Right now buffer cloning is an all-or-nothing kind of thing - either the whole table is cloned from a source to a destination ring, or nothing at all.
However, it's not always desired to clone the whole thing. Allow for the application to specify a source and destination offset, and a number of buffers to clone. If the destination offset is non-zero, then allocate sparse nodes upfront.
Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| #
a85f3105 |
| 27-Oct-2024 |
Jens Axboe <[email protected]> |
io_uring/nop: add support for testing registered files and buffers
Useful for testing performance/efficiency impact of registered files and buffers, vs (particularly) non-registered files.
Signed-o
io_uring/nop: add support for testing registered files and buffers
Useful for testing performance/efficiency impact of registered files and buffers, vs (particularly) non-registered files.
Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| #
aa00f67a |
| 22-Oct-2024 |
Jens Axboe <[email protected]> |
io_uring: add support for fixed wait regions
Generally applications have 1 or a few waits of waiting, yet they pass in a struct io_uring_getevents_arg every time. This needs to get copied and, in tu
io_uring: add support for fixed wait regions
Generally applications have 1 or a few waits of waiting, yet they pass in a struct io_uring_getevents_arg every time. This needs to get copied and, in turn, the timeout value needs to get copied.
Rather than do this for every invocation, allow the application to register a fixed set of wait regions that can simply be indexed when asking the kernel to wait on events.
At ring setup time, the application can register a number of these wait regions and initialize region/index 0 upfront:
struct io_uring_reg_wait *reg;
reg = io_uring_setup_reg_wait(ring, nr_regions, &ret);
/* set timeout and mark as set, sigmask/sigmask_sz as needed */ reg->ts.tv_sec = 0; reg->ts.tv_nsec = 100000; reg->flags = IORING_REG_WAIT_TS;
where nr_regions >= 1 && nr_regions <= PAGE_SIZE / sizeof(*reg). The above initializes index 0, but 63 other regions can be initialized, if needed. Now, instead of doing:
struct __kernel_timespec timeout = { .tv_nsec = 100000, };
io_uring_submit_and_wait_timeout(ring, &cqe, nr, &t, NULL);
to wait for events for each submit_and_wait, or just wait, operation, it can just reference the above region at offset 0 and do:
io_uring_submit_and_wait_reg(ring, &cqe, nr, 0);
to achieve the same goal of waiting 100usec without needing to copy both struct io_uring_getevents_arg (24b) and struct __kernel_timeout (16b) for each invocation. Struct io_uring_reg_wait looks as follows:
struct io_uring_reg_wait { struct __kernel_timespec ts; __u32 min_wait_usec; __u32 flags; __u64 sigmask; __u32 sigmask_sz; __u32 pad[3]; __u64 pad2[2]; };
embedding the timeout itself in the region, rather than passing it as a pointer as well. Note that the signal mask is still passed as a pointer, both for compatability reasons, but also because there doesn't seem to be a lot of high frequency waits scenarios that involve setting and resetting the signal mask for each wait.
The application is free to modify any region before a wait call, or it can use keep multiple regions with different settings to avoid needing to modify the same one for wait calls. Up to a page size of regions is mapped by default, allowing PAGE_SIZE / 64 available regions for use.
The registered region must fit within a page. On a 4kb page size system, that allows for 64 wait regions if a full page is used, as the size of struct io_uring_reg_wait is 64b. The region registered must be aligned to io_uring_reg_wait in size. It's valid to register less than 64 entries.
In network performance testing with zero-copy, this reduced the time spent waiting on the TX side from 3.12% to 0.3% and the RX side from 4.4% to 0.3%.
Wait regions are fixed for the lifetime of the ring - once registered, they are persistent until the ring is torn down. The regions support minimum wait timeout as well as the regular waits.
Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| #
79cfe9e5 |
| 21-Oct-2024 |
Jens Axboe <[email protected]> |
io_uring/register: add IORING_REGISTER_RESIZE_RINGS
Once a ring has been created, the size of the CQ and SQ rings are fixed. Usually this isn't a problem on the SQ ring side, as it merely controls t
io_uring/register: add IORING_REGISTER_RESIZE_RINGS
Once a ring has been created, the size of the CQ and SQ rings are fixed. Usually this isn't a problem on the SQ ring side, as it merely controls the available number of requests that can be submitted in a single system call, and there's rarely a need to change that.
For the CQ ring, it's a different story. For most efficient use of io_uring, it's important that the CQ ring never overflows. This means that applications must size it for the worst case scenario, which can be wasteful.
Add IORING_REGISTER_RESIZE_RINGS, which allows an application to resize the existing rings. It takes a struct io_uring_params argument, the same one which is used to setup the ring initially, and resizes rings according to the sizes given.
Certain properties are always inherited from the original ring setup, like SQE128/CQE32 and other setup options. The implementation only allows flag associated with how the CQ ring is sized and clamped.
Existing unconsumed SQE and CQE entries are copied as part of the process. If either the SQ or CQ resized destination ring cannot hold the entries already present in the source rings, then the operation is failed with -EOVERFLOW. Any register op holds ->uring_lock, which prevents new submissions, and the internal mapping holds the completion lock as well across moving CQ ring state.
To prevent races between mmap and ring resizing, add a mutex that's solely used to serialize ring resize and mmap. mmap_sem can't be used here, as as fork'ed process may be doing mmaps on the ring as well. The ctx->resize_lock is held across mmap operations, and the resize will grab it before swapping out the already mapped new data.
Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v6.12-rc2, v6.12-rc1 |
|
| #
a3771321 |
| 24-Sep-2024 |
Jens Axboe <[email protected]> |
io_uring/msg_ring: add support for sending a sync message
Normally MSG_RING requires both a source and a destination ring. But some users don't always have a ring avilable to send a message from, ye
io_uring/msg_ring: add support for sending a sync message
Normally MSG_RING requires both a source and a destination ring. But some users don't always have a ring avilable to send a message from, yet they still need to notify a target ring.
Add support for using io_uring_register(2) without having a source ring, using a file descriptor of -1 for that. Internally those are called blind registration opcodes. Implement IORING_REGISTER_SEND_MSG_RING as a blind opcode, which simply takes an sqe that the application can put on the stack and use the normal liburing helpers to initialize it. Then the app can call:
io_uring_register(-1, IORING_REGISTER_SEND_MSG_RING, &sqe, 1);
and get the same behavior in terms of the target, where a CQE is posted with the details given in the sqe.
For now this takes a single sqe pointer argument, and hence arg must be set to that, and nr_args must be 1. Could easily be extended to take an array of sqes, but for now let's keep it simple.
Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v6.11 |
|
| #
636119af |
| 14-Sep-2024 |
Jens Axboe <[email protected]> |
io_uring: rename "copy buffers" to "clone buffers"
A recent commit added support for copying registered buffers from one ring to another. But that term is a bit confusing, as no copying of buffer da
io_uring: rename "copy buffers" to "clone buffers"
A recent commit added support for copying registered buffers from one ring to another. But that term is a bit confusing, as no copying of buffer data is done here. What is being done is simply cloning the buffer registrations from one ring to another.
Rename it while we still can, so that it's more descriptive. No functional changes in this patch.
Fixes: 7cc2a6eadcd7 ("io_uring: add IORING_REGISTER_COPY_BUFFERS method") Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| #
7cc2a6ea |
| 11-Sep-2024 |
Jens Axboe <[email protected]> |
io_uring: add IORING_REGISTER_COPY_BUFFERS method
Buffers can get registered with io_uring, which allows to skip the repeated pin_pages, unpin/unref pages for each O_DIRECT operation. This reduces t
io_uring: add IORING_REGISTER_COPY_BUFFERS method
Buffers can get registered with io_uring, which allows to skip the repeated pin_pages, unpin/unref pages for each O_DIRECT operation. This reduces the overhead of O_DIRECT IO.
However, registrering buffers can take some time. Normally this isn't an issue as it's done at initialization time (and hence less critical), but for cases where rings can be created and destroyed as part of an IO thread pool, registering the same buffers for multiple rings become a more time sensitive proposition. As an example, let's say an application has an IO memory pool of 500G. Initial registration takes:
Got 500 huge pages (each 1024MB) Registered 500 pages in 409 msec
or about 0.4 seconds. If we go higher to 900 1GB huge pages being registered:
Registered 900 pages in 738 msec
which is, as expected, a fully linear scaling.
Rather than have each ring pin/map/register the same buffer pool, provide an io_uring_register(2) opcode to simply duplicate the buffers that are registered with another ring. Adding the same 900GB of registered buffers to the target ring can then be accomplished in:
Copied 900 pages in 17 usec
While timing differs a bit, this provides around a 25,000-40,000x speedup for this use case.
Signed-off-by: Jens Axboe <[email protected]>
show more ...
|