|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14, v6.14-rc7, v6.14-rc6 |
|
| #
00a7d398 |
| 07-Mar-2025 |
Linus Torvalds <[email protected]> |
fs/pipe: add simpler helpers for common cases
The fix to atomically read the pipe head and tail state when not holding the pipe mutex has caused a number of headaches due to the size change of the i
fs/pipe: add simpler helpers for common cases
The fix to atomically read the pipe head and tail state when not holding the pipe mutex has caused a number of headaches due to the size change of the involved types.
It turns out that we don't have _that_ many places that access these fields directly and were affected, but we have more than we strictly should have, because our low-level helper functions have been designed to have intimate knowledge of how the pipes work.
And as a result, that random noise of direct 'pipe->head' and 'pipe->tail' accesses makes it harder to pinpoint any actual potential problem spots remaining.
For example, we didn't have a "is the pipe full" helper function, but instead had a "given these pipe buffer indexes and this pipe size, is the pipe full". That's because some low-level pipe code does actually want that much more complicated interface.
But most other places literally just want a "is the pipe full" helper, and not having it meant that those places ended up being unnecessarily much too aware of this all.
It would have been much better if only the very core pipe code that cared had been the one aware of this all.
So let's fix it - better late than never. This just introduces the trivial wrappers for "is this pipe full or empty" and to get how many pipe buffers are used, so that instead of writing
if (pipe_full(pipe->head, pipe->tail, pipe->max_usage))
the places that literally just want to know if a pipe is full can just say
if (pipe_is_full(pipe))
instead. The existing trivial cases were converted with a 'sed' script.
This cuts down on the places that access pipe->head and pipe->tail directly outside of the pipe code (and core splice code) quite a lot.
The splice code in particular still revels in doing the direct low-level accesses, and the fuse fuse_dev_splice_write() code also seems a bit unnecessarily eager to go very low-level, but it's at least a bit better than it used to be.
Signed-off-by: Linus Torvalds <[email protected]>
show more ...
|
| #
74d42bdb |
| 06-Mar-2025 |
Linus Torvalds <[email protected]> |
fs/pipe: express 'pipe_empty()' in terms of 'pipe_occupancy()'
That's what 'pipe_full()' does, so it's more consistent. But more importantly it gets the type limits right when the pipe head and tail
fs/pipe: express 'pipe_empty()' in terms of 'pipe_occupancy()'
That's what 'pipe_full()' does, so it's more consistent. But more importantly it gets the type limits right when the pipe head and tail are no longer necessarily 'unsigned int'.
Signed-off-by: Linus Torvalds <[email protected]>
show more ...
|
| #
0d2d0f3d |
| 05-Mar-2025 |
Linus Torvalds <[email protected]> |
fs/pipe: remove buggy and unused 'helper' function
While looking for incorrect users of the pipe head/tail fields (see commit c27c66afc449: "fs/pipe: Fix pipe_occupancy() with 16-bit indexes"), I fo
fs/pipe: remove buggy and unused 'helper' function
While looking for incorrect users of the pipe head/tail fields (see commit c27c66afc449: "fs/pipe: Fix pipe_occupancy() with 16-bit indexes"), I found a bug in pipe_discard_from() that looked entirely broken.
However, the fix is trivial: this buggy function isn't actually called by anything, so let's just remove it ASAP.
Signed-off-by: Linus Torvalds <[email protected]>
show more ...
|
| #
cfced12f |
| 05-Mar-2025 |
K Prateek Nayak <[email protected]> |
include/linux/pipe_fs_i: Add htmldoc annotation for "head_tail" member
Add htmldoc annotation for the newly introduced "head_tail" member describing it to be a union of the pipe_inode_info's @head a
include/linux/pipe_fs_i: Add htmldoc annotation for "head_tail" member
Add htmldoc annotation for the newly introduced "head_tail" member describing it to be a union of the pipe_inode_info's @head and @tail members.
Reported-by: Stephen Rothwell <[email protected]> Closes: https://lore.kernel.org/lkml/[email protected]/ Fixes: 3d252160b818 ("fs/pipe: Read pipe->{head,tail} atomically outside pipe->mutex") Signed-off-by: K Prateek Nayak <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
show more ...
|
| #
c27c66af |
| 05-Mar-2025 |
Linus Torvalds <[email protected]> |
fs/pipe: Fix pipe_occupancy() with 16-bit indexes
The pipe_occupancy() logic implicitly relied on the natural unsigned modulo arithmetic in C, but that doesn't work for the new 'pipe_index_t' case,
fs/pipe: Fix pipe_occupancy() with 16-bit indexes
The pipe_occupancy() logic implicitly relied on the natural unsigned modulo arithmetic in C, but that doesn't work for the new 'pipe_index_t' case, since any arithmetic will be done in 'int' (and here we had also made it 'unsigned int' due to the function call boundary).
So make the modulo arithmetic explicit by casting the result to the proper type.
Cc: Oleg Nesterov <[email protected]> Cc: Mateusz Guzik <[email protected]> Cc: Manfred Spraul <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Swapnil Sapkal <[email protected]> Cc: Alexey Gladkov <[email protected]> Cc: K Prateek Nayak <[email protected]> Link: https://lore.kernel.org/all/CAHk-=wjyHsGLx=rxg6PKYBNkPYAejgo7=CbyL3=HGLZLsAaJFQ@mail.gmail.com/ Fixes: 3d252160b818 ("fs/pipe: Read pipe->{head,tail} atomically outside pipe->mutex") Signed-off-by: Linus Torvalds <[email protected]>
show more ...
|
| #
3d252160 |
| 04-Mar-2025 |
Linus Torvalds <[email protected]> |
fs/pipe: Read pipe->{head,tail} atomically outside pipe->mutex
pipe_readable(), pipe_writable(), and pipe_poll() can read "pipe->head" and "pipe->tail" outside of "pipe->mutex" critical section. Whe
fs/pipe: Read pipe->{head,tail} atomically outside pipe->mutex
pipe_readable(), pipe_writable(), and pipe_poll() can read "pipe->head" and "pipe->tail" outside of "pipe->mutex" critical section. When the head and the tail are read individually in that order, there is a window for interruption between the two reads in which both the head and the tail can be updated by concurrent readers and writers.
One of the problematic scenarios observed with hackbench running multiple groups on a large server on a particular pipe inode is as follows:
pipe->head = 36 pipe->tail = 36
hackbench-118762 [057] ..... 1029.550548: pipe_write: *wakes up: pipe not full* hackbench-118762 [057] ..... 1029.550548: pipe_write: head: 36 -> 37 [tail: 36] hackbench-118762 [057] ..... 1029.550548: pipe_write: *wake up next reader 118740* hackbench-118762 [057] ..... 1029.550548: pipe_write: *wake up next writer 118768*
hackbench-118768 [206] ..... 1029.55055X: pipe_write: *writer wakes up* hackbench-118768 [206] ..... 1029.55055X: pipe_write: head = READ_ONCE(pipe->head) [37] ... CPU 206 interrupted (exact wakeup was not traced but 118768 did read head at 37 in traces)
hackbench-118740 [057] ..... 1029.550558: pipe_read: *reader wakes up: pipe is not empty* hackbench-118740 [057] ..... 1029.550558: pipe_read: tail: 36 -> 37 [head = 37] hackbench-118740 [057] ..... 1029.550559: pipe_read: *pipe is empty; wakeup writer 118768* hackbench-118740 [057] ..... 1029.550559: pipe_read: *sleeps*
hackbench-118766 [185] ..... 1029.550592: pipe_write: *New writer comes in* hackbench-118766 [185] ..... 1029.550592: pipe_write: head: 37 -> 38 [tail: 37] hackbench-118766 [185] ..... 1029.550592: pipe_write: *wakes up reader 118766*
hackbench-118740 [185] ..... 1029.550598: pipe_read: *reader wakes up; pipe not empty* hackbench-118740 [185] ..... 1029.550599: pipe_read: tail: 37 -> 38 [head: 38] hackbench-118740 [185] ..... 1029.550599: pipe_read: *pipe is empty* hackbench-118740 [185] ..... 1029.550599: pipe_read: *reader sleeps; wakeup writer 118768*
... CPU 206 switches back to writer hackbench-118768 [206] ..... 1029.550601: pipe_write: tail = READ_ONCE(pipe->tail) [38] hackbench-118768 [206] ..... 1029.550601: pipe_write: pipe_full()? (u32)(37 - 38) >= 16? Yes hackbench-118768 [206] ..... 1029.550601: pipe_write: *writer goes back to sleep*
[ Tasks 118740 and 118768 can then indefinitely wait on each other. ]
The unsigned arithmetic in pipe_occupancy() wraps around when "pipe->tail > pipe->head" leading to pipe_full() returning true despite the pipe being empty.
The case of genuine wraparound of "pipe->head" is handled since pipe buffer has data allowing readers to make progress until the pipe->tail wraps too after which the reader will wakeup a sleeping writer, however, mistaking the pipe to be full when it is in fact empty can lead to readers and writers waiting on each other indefinitely.
This issue became more problematic and surfaced as a hang in hackbench after the optimization in commit aaec5a95d596 ("pipe_read: don't wake up the writer if the pipe is still full") significantly reduced the number of spurious wakeups of writers that had previously helped mask the issue.
To avoid missing any updates between the reads of "pipe->head" and "pipe->write", unionize the two with a single unsigned long "pipe->head_tail" member that can be loaded atomically.
Using "pipe->head_tail" to read the head and the tail ensures the lockless checks do not miss any updates to the head or the tail and since those two are only updated under "pipe->mutex", it ensures that the head is always ahead of, or equal to the tail resulting in correct calculations.
[ prateek: commit log, testing on x86 platforms. ]
Reported-and-debugged-by: Swapnil Sapkal <[email protected]> Closes: https://lore.kernel.org/lkml/[email protected]/ Reported-by: Alexey Gladkov <[email protected]> Closes: https://lore.kernel.org/all/[email protected]/ Fixes: 8cefc107ca54 ("pipe: Use head and tail pointers for the ring, not cursor and length") Tested-by: Swapnil Sapkal <[email protected]> Reviewed-by: Oleg Nesterov <[email protected]> Tested-by: Alexey Gladkov <[email protected]> Signed-off-by: K Prateek Nayak <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
show more ...
|
| #
46af8e24 |
| 03-Mar-2025 |
Mateusz Guzik <[email protected]> |
pipe: cache 2 pages instead of 1
User data is kept in a circular buffer backed by pages allocated as needed. Only having space for one spare is still prone to having to resort to allocation / freein
pipe: cache 2 pages instead of 1
User data is kept in a circular buffer backed by pages allocated as needed. Only having space for one spare is still prone to having to resort to allocation / freeing.
In my testing this decreases page allocs by 60% during a kernel build.
Signed-off-by: Mateusz Guzik <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc5, v6.14-rc4, v6.14-rc3, v6.14-rc2, v6.14-rc1, v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2, v6.13-rc1, v6.12, v6.12-rc7, v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3, v6.12-rc2, v6.12-rc1, v6.11, v6.11-rc7, v6.11-rc6, v6.11-rc5, v6.11-rc4, v6.11-rc3, v6.11-rc2, v6.11-rc1, v6.10, v6.10-rc7, v6.10-rc6, v6.10-rc5, v6.10-rc4, v6.10-rc3, v6.10-rc2, v6.10-rc1, v6.9, v6.9-rc7, v6.9-rc6, v6.9-rc5, v6.9-rc4, v6.9-rc3, v6.9-rc2, v6.9-rc1, v6.8, v6.8-rc7, v6.8-rc6, v6.8-rc5, v6.8-rc4, v6.8-rc3, v6.8-rc2, v6.8-rc1, v6.7, v6.7-rc8, v6.7-rc7, v6.7-rc6, v6.7-rc5, v6.7-rc4, v6.7-rc3, v6.7-rc2, v6.7-rc1, v6.6, v6.6-rc7, v6.6-rc6, v6.6-rc5, v6.6-rc4, v6.6-rc3 |
|
| #
b4bd6b4b |
| 21-Sep-2023 |
Max Kellermann <[email protected]> |
fs/pipe: move check to pipe_has_watch_queue()
This declutters the code by reducing the number of #ifdefs and makes the watch_queue checks simpler. This has no runtime effect; the machine code is id
fs/pipe: move check to pipe_has_watch_queue()
This declutters the code by reducing the number of #ifdefs and makes the watch_queue checks simpler. This has no runtime effect; the machine code is identical.
Signed-off-by: Max Kellermann <[email protected]> Message-Id: <[email protected]> Reviewed-by: David Howells <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
| #
61105aab |
| 21-Sep-2023 |
Max Kellermann <[email protected]> |
pipe: reduce padding in struct pipe_inode_info
This has no effect on 64 bit because there are 10 32-bit integers surrounding the two bools, but on 32 bit architectures, this reduces the struct size
pipe: reduce padding in struct pipe_inode_info
This has no effect on 64 bit because there are 10 32-bit integers surrounding the two bools, but on 32 bit architectures, this reduces the struct size by 4 bytes by merging the two bools into one word.
Signed-off-by: Max Kellermann <[email protected]> Message-Id: <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.6-rc2, v6.6-rc1, v6.5, v6.5-rc7, v6.5-rc6, v6.5-rc5, v6.5-rc4, v6.5-rc3, v6.5-rc2, v6.5-rc1, v6.4, v6.4-rc7, v6.4-rc6, v6.4-rc5, v6.4-rc4, v6.4-rc3, v6.4-rc2, v6.4-rc1, v6.3, v6.3-rc7, v6.3-rc6, v6.3-rc5, v6.3-rc4, v6.3-rc3, v6.3-rc2, v6.3-rc1, v6.2, v6.2-rc8, v6.2-rc7 |
|
| #
515c5046 |
| 01-Feb-2023 |
Luca Vizzarro <[email protected]> |
pipe: Pass argument of pipe_fcntl as int
The interface for fcntl expects the argument passed for the command F_SETPIPE_SZ to be of type int. The current code wrongly treats it as a long. In order to
pipe: Pass argument of pipe_fcntl as int
The interface for fcntl expects the argument passed for the command F_SETPIPE_SZ to be of type int. The current code wrongly treats it as a long. In order to avoid access to undefined bits, we should explicitly cast the argument to int.
Cc: Alexander Viro <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Jeff Layton <[email protected]> Cc: Chuck Lever <[email protected]> Cc: Kevin Brodsky <[email protected]> Cc: Vincenzo Frascino <[email protected]> Cc: Szabolcs Nagy <[email protected]> Cc: "Theodore Ts'o" <[email protected]> Cc: David Laight <[email protected]> Cc: Mark Rutland <[email protected]> Cc: [email protected] Cc: [email protected] Signed-off-by: Luca Vizzarro <[email protected]> Message-Id: <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
| #
247c8d2f |
| 16-May-2023 |
Arnd Bergmann <[email protected]> |
fs: pipe: reveal missing function protoypes
A couple of functions from fs/pipe.c are used both internally and for the watch queue code, but the declaration is only visible when the latter is enabled
fs: pipe: reveal missing function protoypes
A couple of functions from fs/pipe.c are used both internally and for the watch queue code, but the declaration is only visible when the latter is enabled:
fs/pipe.c:1254:5: error: no previous prototype for 'pipe_resize_ring' fs/pipe.c:758:15: error: no previous prototype for 'account_pipe_buffers' fs/pipe.c:764:6: error: no previous prototype for 'too_many_pipe_buffers_soft' fs/pipe.c:771:6: error: no previous prototype for 'too_many_pipe_buffers_hard' fs/pipe.c:777:6: error: no previous prototype for 'pipe_is_unprivileged_user'
Make the visible unconditionally to avoid these warnings.
Fixes: c73be61cede5 ("pipe: Add general notification queue support") Signed-off-by: Arnd Bergmann <[email protected]> Message-Id: <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
| #
07073eb0 |
| 14-Feb-2023 |
David Howells <[email protected]> |
splice: Add a func to do a splice from a buffered file without ITER_PIPE
Provide a function to do splice read from a buffered file, pulling the folios out of the pagecache directly by calling filema
splice: Add a func to do a splice from a buffered file without ITER_PIPE
Provide a function to do splice read from a buffered file, pulling the folios out of the pagecache directly by calling filemap_get_pages() to do any required reading and then pasting the returned folios into the pipe.
A helper function is provided to do the actual folio pasting and will handle multipage folios by splicing as many of the relevant subpages as will fit into the pipe.
The code is loosely based on filemap_read() and might belong in mm/filemap.c with that as it needs to use filemap_get_pages().
Signed-off-by: David Howells <[email protected]> Reviewed-by: Jens Axboe <[email protected]> cc: Christoph Hellwig <[email protected]> cc: Al Viro <[email protected]> cc: David Hildenbrand <[email protected]> cc: John Hubbard <[email protected]> cc: [email protected] cc: [email protected] cc: [email protected] Signed-off-by: Steve French <[email protected]>
show more ...
|
|
Revision tags: v6.2-rc6, v6.2-rc5, v6.2-rc4, v6.2-rc3, v6.2-rc2, v6.2-rc1, v6.1, v6.1-rc8, v6.1-rc7, v6.1-rc6, v6.1-rc5, v6.1-rc4, v6.1-rc3, v6.1-rc2, v6.1-rc1, v6.0, v6.0-rc7, v6.0-rc6, v6.0-rc5, v6.0-rc4, v6.0-rc3, v6.0-rc2, v6.0-rc1, v5.19, v5.19-rc8, v5.19-rc7, v5.19-rc6, v5.19-rc5, v5.19-rc4, v5.19-rc3 |
|
| #
12d426ab |
| 15-Jun-2022 |
Al Viro <[email protected]> |
ITER_PIPE: fold data_start() and pipe_space_for_user() together
All their callers are next to each other; all of them want the total amount of pages and, possibly, the offset in the partial final bu
ITER_PIPE: fold data_start() and pipe_space_for_user() together
All their callers are next to each other; all of them want the total amount of pages and, possibly, the offset in the partial final buffer.
Combine into a new helper (pipe_npages()), fix the bogosity in pipe_space_for_user(), while we are at it.
Signed-off-by: Al Viro <[email protected]>
show more ...
|
| #
c3497fd0 |
| 12-Jun-2022 |
Al Viro <[email protected]> |
fix short copy handling in copy_mc_pipe_to_iter()
Unlike other copying operations on ITER_PIPE, copy_mc_to_iter() can result in a short copy. In that case we need to trim the unused buffers, as wel
fix short copy handling in copy_mc_pipe_to_iter()
Unlike other copying operations on ITER_PIPE, copy_mc_to_iter() can result in a short copy. In that case we need to trim the unused buffers, as well as the length of partially filled one - it's not enough to set ->head, ->iov_offset and ->count to reflect how much had we copied. Not hard to fix, fortunately...
I'd put a helper (pipe_discard_from(pipe, head)) into pipe_fs_i.h, rather than iov_iter.c - it has nothing to do with iov_iter and having it will allow us to avoid an ugly kludge in fs/splice.c. We could put it into lib/iov_iter.c for now and move it later, but I don't see the point going that way...
Cc: [email protected] # 4.19+ Fixes: ca146f6f091e "lib/iov_iter: Fix pipe handling in _copy_to_iter_mcsafe()" Reviewed-by: Jeff Layton <[email protected]> Reviewed-by: Christian Brauner (Microsoft) <[email protected]> Signed-off-by: Al Viro <[email protected]>
show more ...
|
|
Revision tags: v5.19-rc2, v5.19-rc1, v5.18, v5.18-rc7, v5.18-rc6, v5.18-rc5 |
|
| #
f485922d |
| 29-Apr-2022 |
Kuniyuki Iwashima <[email protected]> |
pipe: make poll_usage boolean and annotate its access
Patch series "Fix data-races around epoll reported by KCSAN."
This series suppresses a false positive KCSAN's message and fixes a real data-rac
pipe: make poll_usage boolean and annotate its access
Patch series "Fix data-races around epoll reported by KCSAN."
This series suppresses a false positive KCSAN's message and fixes a real data-race.
This patch (of 2):
pipe_poll() runs locklessly and assigns 1 to poll_usage. Once poll_usage is set to 1, it never changes in other places. However, concurrent writes of a value trigger KCSAN, so let's make KCSAN happy.
BUG: KCSAN: data-race in pipe_poll / pipe_poll
write to 0xffff8880042f6678 of 4 bytes by task 174 on cpu 3: pipe_poll (fs/pipe.c:656) ep_item_poll.isra.0 (./include/linux/poll.h:88 fs/eventpoll.c:853) do_epoll_wait (fs/eventpoll.c:1692 fs/eventpoll.c:1806 fs/eventpoll.c:2234) __x64_sys_epoll_wait (fs/eventpoll.c:2246 fs/eventpoll.c:2241 fs/eventpoll.c:2241) do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113)
write to 0xffff8880042f6678 of 4 bytes by task 177 on cpu 1: pipe_poll (fs/pipe.c:656) ep_item_poll.isra.0 (./include/linux/poll.h:88 fs/eventpoll.c:853) do_epoll_wait (fs/eventpoll.c:1692 fs/eventpoll.c:1806 fs/eventpoll.c:2234) __x64_sys_epoll_wait (fs/eventpoll.c:2246 fs/eventpoll.c:2241 fs/eventpoll.c:2241) do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113)
Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 177 Comm: epoll_race Not tainted 5.17.0-58927-gf443e374ae13 #6 Hardware name: Red Hat KVM, BIOS 1.11.0-2.amzn2 04/01/2014
Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: 3b844826b6c6 ("pipe: avoid unnecessary EPOLLET wakeups under normal loads") Signed-off-by: Kuniyuki Iwashima <[email protected]> Cc: Alexander Duyck <[email protected]> Cc: Al Viro <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Kuniyuki Iwashima <[email protected]> Cc: "Soheil Hassas Yeganeh" <[email protected]> Cc: "Sridhar Samudrala" <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v5.18-rc4, v5.18-rc3, v5.18-rc2, v5.18-rc1, v5.17, v5.17-rc8, v5.17-rc7, v5.17-rc6, v5.17-rc5, v5.17-rc4, v5.17-rc3, v5.17-rc2, v5.17-rc1 |
|
| #
1998f193 |
| 22-Jan-2022 |
Luis Chamberlain <[email protected]> |
fs: move pipe sysctls to is own file
kernel/sysctl.c is a kitchen sink where everyone leaves their dirty dishes, this makes it very difficult to maintain.
To help with this maintenance let's start
fs: move pipe sysctls to is own file
kernel/sysctl.c is a kitchen sink where everyone leaves their dirty dishes, this makes it very difficult to maintain.
To help with this maintenance let's start by moving sysctls to places where they actually belong. The proc sysctl maintainers do not want to know what sysctl knobs you wish to add for your own piece of code, we just care about the core logic.
So move the pipe sysctls to its own file.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Luis Chamberlain <[email protected]> Cc: Al Viro <[email protected]> Cc: Andy Shevchenko <[email protected]> Cc: Antti Palosaari <[email protected]> Cc: Eric Biederman <[email protected]> Cc: Iurii Zaikin <[email protected]> Cc: "J. Bruce Fields" <[email protected]> Cc: Jeff Layton <[email protected]> Cc: Kees Cook <[email protected]> Cc: Lukas Middendorf <[email protected]> Cc: Stephen Kitt <[email protected]> Cc: Xiaoming Ni <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
show more ...
|
|
Revision tags: v5.16, v5.16-rc8, v5.16-rc7, v5.16-rc6, v5.16-rc5, v5.16-rc4, v5.16-rc3, v5.16-rc2, v5.16-rc1, v5.15, v5.15-rc7, v5.15-rc6, v5.15-rc5, v5.15-rc4, v5.15-rc3, v5.15-rc2, v5.15-rc1, v5.14, v5.14-rc7, v5.14-rc6, v5.14-rc5 |
|
| #
3b844826 |
| 05-Aug-2021 |
Linus Torvalds <[email protected]> |
pipe: avoid unnecessary EPOLLET wakeups under normal loads
I had forgotten just how sensitive hackbench is to extra pipe wakeups, and commit 3a34b13a88ca ("pipe: make pipe writes always wake up read
pipe: avoid unnecessary EPOLLET wakeups under normal loads
I had forgotten just how sensitive hackbench is to extra pipe wakeups, and commit 3a34b13a88ca ("pipe: make pipe writes always wake up readers") ended up causing a quite noticeable regression on larger machines.
Now, hackbench isn't necessarily a hugely meaningful benchmark, and it's not clear that this matters in real life all that much, but as Mel points out, it's used often enough when comparing kernels and so the performance regression shows up like a sore thumb.
It's easy enough to fix at least for the common cases where pipes are used purely for data transfer, and you never have any exciting poll usage at all. So set a special 'poll_usage' flag when there is polling activity, and make the ugly "EPOLLET has crazy legacy expectations" semantics explicit to only that case.
I would love to limit it to just the broken EPOLLET case, but the pipe code can't see the difference between epoll and regular select/poll, so any non-read/write waiting will trigger the extra wakeup behavior. That is sufficient for at least the hackbench case.
Apart from making the odd extra wakeup cases more explicitly about EPOLLET, this also makes the extra wakeup be at the _end_ of the pipe write, not at the first write chunk. That is actually much saner semantics (as much as you can call any of the legacy edge-triggered expectations for EPOLLET "sane") since it means that you know the wakeup will happen once the write is done, rather than possibly in the middle of one.
[ For stable people: I'm putting a "Fixes" tag on this, but I leave it up to you to decide whether you actually want to backport it or not. It likely has no impact outside of synthetic benchmarks - Linus ]
Link: https://lore.kernel.org/lkml/20210802024945.GA8372@xsang-OptiPlex-9020/ Fixes: 3a34b13a88ca ("pipe: make pipe writes always wake up readers") Reported-by: kernel test robot <[email protected]> Tested-by: Sandeep Patil <[email protected]> Tested-by: Mel Gorman <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
show more ...
|
|
Revision tags: v5.14-rc4, v5.14-rc3, v5.14-rc2, v5.14-rc1, v5.13, v5.13-rc7, v5.13-rc6, v5.13-rc5, v5.13-rc4, v5.13-rc3, v5.13-rc2, v5.13-rc1, v5.12, v5.12-rc8, v5.12-rc7, v5.12-rc6, v5.12-rc5, v5.12-rc4, v5.12-rc3, v5.12-rc2, v5.12-rc1, v5.12-rc1-dontuse, v5.11, v5.11-rc7, v5.11-rc6, v5.11-rc5, v5.11-rc4, v5.11-rc3, v5.11-rc2, v5.11-rc1, v5.10, v5.10-rc7, v5.10-rc6, v5.10-rc5, v5.10-rc4, v5.10-rc3, v5.10-rc2, v5.10-rc1, v5.9, v5.9-rc8 |
|
| #
472e5b05 |
| 02-Oct-2020 |
Linus Torvalds <[email protected]> |
pipe: remove pipe_wait() and fix wakeup race with splice
The pipe splice code still used the old model of waiting for pipe IO by using a non-specific "pipe_wait()" that waited for any pipe event to
pipe: remove pipe_wait() and fix wakeup race with splice
The pipe splice code still used the old model of waiting for pipe IO by using a non-specific "pipe_wait()" that waited for any pipe event to happen, which depended on all pipe IO being entirely serialized by the pipe lock. So by checking the state you were waiting for, and then adding yourself to the wait queue before dropping the lock, you were guaranteed to see all the wakeups.
Strictly speaking, the actual wakeups were not done under the lock, but the pipe_wait() model still worked, because since the waiter held the lock when checking whether it should sleep, it would always see the current state, and the wakeup was always done after updating the state.
However, commit 0ddad21d3e99 ("pipe: use exclusive waits when reading or writing") split the single wait-queue into two, and in the process also made the "wait for event" code wait for _two_ wait queues, and that then showed a race with the wakers that were not serialized by the pipe lock.
It's only splice that used that "pipe_wait()" model, so the problem wasn't obvious, but Josef Bacik reports:
"I hit a hang with fstest btrfs/187, which does a btrfs send into /dev/null. This works by creating a pipe, the write side is given to the kernel to write into, and the read side is handed to a thread that splices into a file, in this case /dev/null.
The box that was hung had the write side stuck here [pipe_write] and the read side stuck here [splice_from_pipe_next -> pipe_wait].
[ more details about pipe_wait() scenario ]
The problem is we're doing the prepare_to_wait, which sets our state each time, however we can be woken up either with reads or writes. In the case above we race with the WRITER waking us up, and re-set our state to INTERRUPTIBLE, and thus never break out of schedule"
Josef had a patch that avoided the issue in pipe_wait() by just making it set the state only once, but the deeper problem is that pipe_wait() depends on a level of synchonization by the pipe mutex that it really shouldn't. And the whole "wait for any pipe state change" model really isn't very good to begin with.
So rather than trying to work around things in pipe_wait(), remove that legacy model of "wait for arbitrary pipe event" entirely, and actually create functions that wait for the pipe actually being readable or writable, and can do so without depending on the pipe lock serializing everything.
Fixes: 0ddad21d3e99 ("pipe: use exclusive waits when reading or writing") Link: https://lore.kernel.org/linux-fsdevel/bfa88b5ad6f069b2b679316b9e495a970130416c.1601567868.git.josef@toxicpanda.com/ Reported-by: Josef Bacik <[email protected]> Reviewed-and-tested-by: Josef Bacik <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
show more ...
|
|
Revision tags: v5.9-rc7, v5.9-rc6, v5.9-rc5, v5.9-rc4, v5.9-rc3, v5.9-rc2, v5.9-rc1, v5.8, v5.8-rc7, v5.8-rc6, v5.8-rc5, v5.8-rc4, v5.8-rc3, v5.8-rc2, v5.8-rc1, v5.7, v5.7-rc7 |
|
| #
c928f642 |
| 20-May-2020 |
Christoph Hellwig <[email protected]> |
fs: rename pipe_buf ->steal to ->try_steal
And replace the arcane return value convention with a simple bool where true means success and false means failure.
[AV: braino fix folded in]
Signed-off
fs: rename pipe_buf ->steal to ->try_steal
And replace the arcane return value convention with a simple bool where true means success and false means failure.
[AV: braino fix folded in]
Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Al Viro <[email protected]>
show more ...
|
| #
b8d9e7f2 |
| 20-May-2020 |
Christoph Hellwig <[email protected]> |
fs: make the pipe_buf_operations ->confirm operation optional
Just return 0 for success if it is not present.
Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Al Viro <[email protected].
fs: make the pipe_buf_operations ->confirm operation optional
Just return 0 for success if it is not present.
Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Al Viro <[email protected]>
show more ...
|
| #
76887c25 |
| 20-May-2020 |
Christoph Hellwig <[email protected]> |
fs: make the pipe_buf_operations ->steal operation optional
Just return 1 for failure if it is not present.
Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Al Viro <[email protected]
fs: make the pipe_buf_operations ->steal operation optional
Just return 1 for failure if it is not present.
Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Al Viro <[email protected]>
show more ...
|
| #
f6dd9755 |
| 20-May-2020 |
Christoph Hellwig <[email protected]> |
pipe: merge anon_pipe_buf*_ops
All the op vectors are exactly the same, they are just used to encode packet or nomerge behavior. There already is a flag for the packet behavior, so just add a new o
pipe: merge anon_pipe_buf*_ops
All the op vectors are exactly the same, they are just used to encode packet or nomerge behavior. There already is a flag for the packet behavior, so just add a new one to allow for merging. Inverting it vs the previous nomerge special casing actually allows for much nicer code.
Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Al Viro <[email protected]>
show more ...
|
|
Revision tags: v5.7-rc6, v5.7-rc5, v5.7-rc4, v5.7-rc3, v5.7-rc2, v5.7-rc1, v5.6, v5.6-rc7, v5.6-rc6, v5.6-rc5, v5.6-rc4, v5.6-rc3, v5.6-rc2, v5.6-rc1, v5.5, v5.5-rc7 |
|
| #
e7d553d6 |
| 14-Jan-2020 |
David Howells <[email protected]> |
pipe: Add notification lossage handling
Add handling for loss of notifications by having read() insert a loss-notification message after it has read the pipe buffer that was last in the ring when th
pipe: Add notification lossage handling
Add handling for loss of notifications by having read() insert a loss-notification message after it has read the pipe buffer that was last in the ring when the loss occurred.
Lossage can come about either by running out of notification descriptors or by running out of space in the pipe ring.
Signed-off-by: David Howells <[email protected]>
show more ...
|
| #
8cfba763 |
| 14-Jan-2020 |
David Howells <[email protected]> |
pipe: Allow buffers to be marked read-whole-or-error for notifications
Allow a buffer to be marked such that read() must return the entire buffer in one go or return ENOBUFS. Multiple buffers can b
pipe: Allow buffers to be marked read-whole-or-error for notifications
Allow a buffer to be marked such that read() must return the entire buffer in one go or return ENOBUFS. Multiple buffers can be amalgamated into a single read, but a short read will occur if the next "whole" buffer won't fit.
This is useful for watch queue notifications to make sure we don't split a notification across multiple reads, especially given that we need to fabricate an overrun record under some circumstances - and that isn't in the buffers.
Signed-off-by: David Howells <[email protected]>
show more ...
|
| #
c73be61c |
| 14-Jan-2020 |
David Howells <[email protected]> |
pipe: Add general notification queue support
Make it possible to have a general notification queue built on top of a standard pipe. Notifications are 'spliced' into the pipe and then read out. spl
pipe: Add general notification queue support
Make it possible to have a general notification queue built on top of a standard pipe. Notifications are 'spliced' into the pipe and then read out. splice(), vmsplice() and sendfile() are forbidden on pipes used for notifications as post_one_notification() cannot take pipe->mutex. This means that notifications could be posted in between individual pipe buffers, making iov_iter_revert() difficult to effect.
The way the notification queue is used is:
(1) An application opens a pipe with a special flag and indicates the number of messages it wishes to be able to queue at once (this can only be set once):
pipe2(fds, O_NOTIFICATION_PIPE); ioctl(fds[0], IOC_WATCH_QUEUE_SET_SIZE, queue_depth);
(2) The application then uses poll() and read() as normal to extract data from the pipe. read() will return multiple notifications if the buffer is big enough, but it will not split a notification across buffers - rather it will return a short read or EMSGSIZE.
Notification messages include a length in the header so that the caller can split them up.
Each message has a header that describes it:
struct watch_notification { __u32 type:24; __u32 subtype:8; __u32 info; };
The type indicates the source (eg. mount tree changes, superblock events, keyring changes, block layer events) and the subtype indicates the event type (eg. mount, unmount; EIO, EDQUOT; link, unlink). The info field indicates a number of things, including the entry length, an ID assigned to a watchpoint contributing to this buffer and type-specific flags.
Supplementary data, such as the key ID that generated an event, can be attached in additional slots. The maximum message size is 127 bytes. Messages may not be padded or aligned, so there is no guarantee, for example, that the notification type will be on a 4-byte bounary.
Signed-off-by: David Howells <[email protected]>
show more ...
|