History log of /linux-6.15/include/linux/pipe_fs_i.h (Results 1 – 25 of 96)
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14, v6.14-rc7, v6.14-rc6
# 00a7d398 07-Mar-2025 Linus Torvalds <[email protected]>

fs/pipe: add simpler helpers for common cases

The fix to atomically read the pipe head and tail state when not holding
the pipe mutex has caused a number of headaches due to the size change
of the i

fs/pipe: add simpler helpers for common cases

The fix to atomically read the pipe head and tail state when not holding
the pipe mutex has caused a number of headaches due to the size change
of the involved types.

It turns out that we don't have _that_ many places that access these
fields directly and were affected, but we have more than we strictly
should have, because our low-level helper functions have been designed
to have intimate knowledge of how the pipes work.

And as a result, that random noise of direct 'pipe->head' and
'pipe->tail' accesses makes it harder to pinpoint any actual potential
problem spots remaining.

For example, we didn't have a "is the pipe full" helper function, but
instead had a "given these pipe buffer indexes and this pipe size, is
the pipe full". That's because some low-level pipe code does actually
want that much more complicated interface.

But most other places literally just want a "is the pipe full" helper,
and not having it meant that those places ended up being unnecessarily
much too aware of this all.

It would have been much better if only the very core pipe code that
cared had been the one aware of this all.

So let's fix it - better late than never. This just introduces the
trivial wrappers for "is this pipe full or empty" and to get how many
pipe buffers are used, so that instead of writing

if (pipe_full(pipe->head, pipe->tail, pipe->max_usage))

the places that literally just want to know if a pipe is full can just
say

if (pipe_is_full(pipe))

instead. The existing trivial cases were converted with a 'sed' script.

This cuts down on the places that access pipe->head and pipe->tail
directly outside of the pipe code (and core splice code) quite a lot.

The splice code in particular still revels in doing the direct low-level
accesses, and the fuse fuse_dev_splice_write() code also seems a bit
unnecessarily eager to go very low-level, but it's at least a bit better
than it used to be.

Signed-off-by: Linus Torvalds <[email protected]>

show more ...


# 74d42bdb 06-Mar-2025 Linus Torvalds <[email protected]>

fs/pipe: express 'pipe_empty()' in terms of 'pipe_occupancy()'

That's what 'pipe_full()' does, so it's more consistent. But more
importantly it gets the type limits right when the pipe head and tail

fs/pipe: express 'pipe_empty()' in terms of 'pipe_occupancy()'

That's what 'pipe_full()' does, so it's more consistent. But more
importantly it gets the type limits right when the pipe head and tail
are no longer necessarily 'unsigned int'.

Signed-off-by: Linus Torvalds <[email protected]>

show more ...


# 0d2d0f3d 05-Mar-2025 Linus Torvalds <[email protected]>

fs/pipe: remove buggy and unused 'helper' function

While looking for incorrect users of the pipe head/tail fields (see
commit c27c66afc449: "fs/pipe: Fix pipe_occupancy() with 16-bit
indexes"), I fo

fs/pipe: remove buggy and unused 'helper' function

While looking for incorrect users of the pipe head/tail fields (see
commit c27c66afc449: "fs/pipe: Fix pipe_occupancy() with 16-bit
indexes"), I found a bug in pipe_discard_from() that looked entirely
broken.

However, the fix is trivial: this buggy function isn't actually called
by anything, so let's just remove it ASAP.

Signed-off-by: Linus Torvalds <[email protected]>

show more ...


# cfced12f 05-Mar-2025 K Prateek Nayak <[email protected]>

include/linux/pipe_fs_i: Add htmldoc annotation for "head_tail" member

Add htmldoc annotation for the newly introduced "head_tail" member
describing it to be a union of the pipe_inode_info's @head a

include/linux/pipe_fs_i: Add htmldoc annotation for "head_tail" member

Add htmldoc annotation for the newly introduced "head_tail" member
describing it to be a union of the pipe_inode_info's @head and @tail
members.

Reported-by: Stephen Rothwell <[email protected]>
Closes: https://lore.kernel.org/lkml/[email protected]/
Fixes: 3d252160b818 ("fs/pipe: Read pipe->{head,tail} atomically outside pipe->mutex")
Signed-off-by: K Prateek Nayak <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

show more ...


# c27c66af 05-Mar-2025 Linus Torvalds <[email protected]>

fs/pipe: Fix pipe_occupancy() with 16-bit indexes

The pipe_occupancy() logic implicitly relied on the natural unsigned
modulo arithmetic in C, but that doesn't work for the new 'pipe_index_t'
case,

fs/pipe: Fix pipe_occupancy() with 16-bit indexes

The pipe_occupancy() logic implicitly relied on the natural unsigned
modulo arithmetic in C, but that doesn't work for the new 'pipe_index_t'
case, since any arithmetic will be done in 'int' (and here we had also
made it 'unsigned int' due to the function call boundary).

So make the modulo arithmetic explicit by casting the result to the
proper type.

Cc: Oleg Nesterov <[email protected]>
Cc: Mateusz Guzik <[email protected]>
Cc: Manfred Spraul <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Swapnil Sapkal <[email protected]>
Cc: Alexey Gladkov <[email protected]>
Cc: K Prateek Nayak <[email protected]>
Link: https://lore.kernel.org/all/CAHk-=wjyHsGLx=rxg6PKYBNkPYAejgo7=CbyL3=HGLZLsAaJFQ@mail.gmail.com/
Fixes: 3d252160b818 ("fs/pipe: Read pipe->{head,tail} atomically outside pipe->mutex")
Signed-off-by: Linus Torvalds <[email protected]>

show more ...


# 3d252160 04-Mar-2025 Linus Torvalds <[email protected]>

fs/pipe: Read pipe->{head,tail} atomically outside pipe->mutex

pipe_readable(), pipe_writable(), and pipe_poll() can read "pipe->head"
and "pipe->tail" outside of "pipe->mutex" critical section. Whe

fs/pipe: Read pipe->{head,tail} atomically outside pipe->mutex

pipe_readable(), pipe_writable(), and pipe_poll() can read "pipe->head"
and "pipe->tail" outside of "pipe->mutex" critical section. When the
head and the tail are read individually in that order, there is a window
for interruption between the two reads in which both the head and the
tail can be updated by concurrent readers and writers.

One of the problematic scenarios observed with hackbench running
multiple groups on a large server on a particular pipe inode is as
follows:

pipe->head = 36
pipe->tail = 36

hackbench-118762 [057] ..... 1029.550548: pipe_write: *wakes up: pipe not full*
hackbench-118762 [057] ..... 1029.550548: pipe_write: head: 36 -> 37 [tail: 36]
hackbench-118762 [057] ..... 1029.550548: pipe_write: *wake up next reader 118740*
hackbench-118762 [057] ..... 1029.550548: pipe_write: *wake up next writer 118768*

hackbench-118768 [206] ..... 1029.55055X: pipe_write: *writer wakes up*
hackbench-118768 [206] ..... 1029.55055X: pipe_write: head = READ_ONCE(pipe->head) [37]
... CPU 206 interrupted (exact wakeup was not traced but 118768 did read head at 37 in traces)

hackbench-118740 [057] ..... 1029.550558: pipe_read: *reader wakes up: pipe is not empty*
hackbench-118740 [057] ..... 1029.550558: pipe_read: tail: 36 -> 37 [head = 37]
hackbench-118740 [057] ..... 1029.550559: pipe_read: *pipe is empty; wakeup writer 118768*
hackbench-118740 [057] ..... 1029.550559: pipe_read: *sleeps*

hackbench-118766 [185] ..... 1029.550592: pipe_write: *New writer comes in*
hackbench-118766 [185] ..... 1029.550592: pipe_write: head: 37 -> 38 [tail: 37]
hackbench-118766 [185] ..... 1029.550592: pipe_write: *wakes up reader 118766*

hackbench-118740 [185] ..... 1029.550598: pipe_read: *reader wakes up; pipe not empty*
hackbench-118740 [185] ..... 1029.550599: pipe_read: tail: 37 -> 38 [head: 38]
hackbench-118740 [185] ..... 1029.550599: pipe_read: *pipe is empty*
hackbench-118740 [185] ..... 1029.550599: pipe_read: *reader sleeps; wakeup writer 118768*

... CPU 206 switches back to writer
hackbench-118768 [206] ..... 1029.550601: pipe_write: tail = READ_ONCE(pipe->tail) [38]
hackbench-118768 [206] ..... 1029.550601: pipe_write: pipe_full()? (u32)(37 - 38) >= 16? Yes
hackbench-118768 [206] ..... 1029.550601: pipe_write: *writer goes back to sleep*

[ Tasks 118740 and 118768 can then indefinitely wait on each other. ]

The unsigned arithmetic in pipe_occupancy() wraps around when
"pipe->tail > pipe->head" leading to pipe_full() returning true despite
the pipe being empty.

The case of genuine wraparound of "pipe->head" is handled since pipe
buffer has data allowing readers to make progress until the pipe->tail
wraps too after which the reader will wakeup a sleeping writer, however,
mistaking the pipe to be full when it is in fact empty can lead to
readers and writers waiting on each other indefinitely.

This issue became more problematic and surfaced as a hang in hackbench
after the optimization in commit aaec5a95d596 ("pipe_read: don't wake up
the writer if the pipe is still full") significantly reduced the number
of spurious wakeups of writers that had previously helped mask the
issue.

To avoid missing any updates between the reads of "pipe->head" and
"pipe->write", unionize the two with a single unsigned long
"pipe->head_tail" member that can be loaded atomically.

Using "pipe->head_tail" to read the head and the tail ensures the
lockless checks do not miss any updates to the head or the tail and
since those two are only updated under "pipe->mutex", it ensures that
the head is always ahead of, or equal to the tail resulting in correct
calculations.

[ prateek: commit log, testing on x86 platforms. ]

Reported-and-debugged-by: Swapnil Sapkal <[email protected]>
Closes: https://lore.kernel.org/lkml/[email protected]/
Reported-by: Alexey Gladkov <[email protected]>
Closes: https://lore.kernel.org/all/[email protected]/
Fixes: 8cefc107ca54 ("pipe: Use head and tail pointers for the ring, not cursor and length")
Tested-by: Swapnil Sapkal <[email protected]>
Reviewed-by: Oleg Nesterov <[email protected]>
Tested-by: Alexey Gladkov <[email protected]>
Signed-off-by: K Prateek Nayak <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

show more ...


# 46af8e24 03-Mar-2025 Mateusz Guzik <[email protected]>

pipe: cache 2 pages instead of 1

User data is kept in a circular buffer backed by pages allocated as
needed. Only having space for one spare is still prone to having to
resort to allocation / freein

pipe: cache 2 pages instead of 1

User data is kept in a circular buffer backed by pages allocated as
needed. Only having space for one spare is still prone to having to
resort to allocation / freeing.

In my testing this decreases page allocs by 60% during a kernel build.

Signed-off-by: Mateusz Guzik <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Christian Brauner <[email protected]>

show more ...


Revision tags: v6.14-rc5, v6.14-rc4, v6.14-rc3, v6.14-rc2, v6.14-rc1, v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2, v6.13-rc1, v6.12, v6.12-rc7, v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3, v6.12-rc2, v6.12-rc1, v6.11, v6.11-rc7, v6.11-rc6, v6.11-rc5, v6.11-rc4, v6.11-rc3, v6.11-rc2, v6.11-rc1, v6.10, v6.10-rc7, v6.10-rc6, v6.10-rc5, v6.10-rc4, v6.10-rc3, v6.10-rc2, v6.10-rc1, v6.9, v6.9-rc7, v6.9-rc6, v6.9-rc5, v6.9-rc4, v6.9-rc3, v6.9-rc2, v6.9-rc1, v6.8, v6.8-rc7, v6.8-rc6, v6.8-rc5, v6.8-rc4, v6.8-rc3, v6.8-rc2, v6.8-rc1, v6.7, v6.7-rc8, v6.7-rc7, v6.7-rc6, v6.7-rc5, v6.7-rc4, v6.7-rc3, v6.7-rc2, v6.7-rc1, v6.6, v6.6-rc7, v6.6-rc6, v6.6-rc5, v6.6-rc4, v6.6-rc3
# b4bd6b4b 21-Sep-2023 Max Kellermann <[email protected]>

fs/pipe: move check to pipe_has_watch_queue()

This declutters the code by reducing the number of #ifdefs and makes
the watch_queue checks simpler. This has no runtime effect; the
machine code is id

fs/pipe: move check to pipe_has_watch_queue()

This declutters the code by reducing the number of #ifdefs and makes
the watch_queue checks simpler. This has no runtime effect; the
machine code is identical.

Signed-off-by: Max Kellermann <[email protected]>
Message-Id: <[email protected]>
Reviewed-by: David Howells <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>

show more ...


# 61105aab 21-Sep-2023 Max Kellermann <[email protected]>

pipe: reduce padding in struct pipe_inode_info

This has no effect on 64 bit because there are 10 32-bit integers
surrounding the two bools, but on 32 bit architectures, this reduces
the struct size

pipe: reduce padding in struct pipe_inode_info

This has no effect on 64 bit because there are 10 32-bit integers
surrounding the two bools, but on 32 bit architectures, this reduces
the struct size by 4 bytes by merging the two bools into one word.

Signed-off-by: Max Kellermann <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>

show more ...


Revision tags: v6.6-rc2, v6.6-rc1, v6.5, v6.5-rc7, v6.5-rc6, v6.5-rc5, v6.5-rc4, v6.5-rc3, v6.5-rc2, v6.5-rc1, v6.4, v6.4-rc7, v6.4-rc6, v6.4-rc5, v6.4-rc4, v6.4-rc3, v6.4-rc2, v6.4-rc1, v6.3, v6.3-rc7, v6.3-rc6, v6.3-rc5, v6.3-rc4, v6.3-rc3, v6.3-rc2, v6.3-rc1, v6.2, v6.2-rc8, v6.2-rc7
# 515c5046 01-Feb-2023 Luca Vizzarro <[email protected]>

pipe: Pass argument of pipe_fcntl as int

The interface for fcntl expects the argument passed for the command
F_SETPIPE_SZ to be of type int. The current code wrongly treats it as
a long. In order to

pipe: Pass argument of pipe_fcntl as int

The interface for fcntl expects the argument passed for the command
F_SETPIPE_SZ to be of type int. The current code wrongly treats it as
a long. In order to avoid access to undefined bits, we should explicitly
cast the argument to int.

Cc: Alexander Viro <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Jeff Layton <[email protected]>
Cc: Chuck Lever <[email protected]>
Cc: Kevin Brodsky <[email protected]>
Cc: Vincenzo Frascino <[email protected]>
Cc: Szabolcs Nagy <[email protected]>
Cc: "Theodore Ts'o" <[email protected]>
Cc: David Laight <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Luca Vizzarro <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>

show more ...


# 247c8d2f 16-May-2023 Arnd Bergmann <[email protected]>

fs: pipe: reveal missing function protoypes

A couple of functions from fs/pipe.c are used both internally
and for the watch queue code, but the declaration is only
visible when the latter is enabled

fs: pipe: reveal missing function protoypes

A couple of functions from fs/pipe.c are used both internally
and for the watch queue code, but the declaration is only
visible when the latter is enabled:

fs/pipe.c:1254:5: error: no previous prototype for 'pipe_resize_ring'
fs/pipe.c:758:15: error: no previous prototype for 'account_pipe_buffers'
fs/pipe.c:764:6: error: no previous prototype for 'too_many_pipe_buffers_soft'
fs/pipe.c:771:6: error: no previous prototype for 'too_many_pipe_buffers_hard'
fs/pipe.c:777:6: error: no previous prototype for 'pipe_is_unprivileged_user'

Make the visible unconditionally to avoid these warnings.

Fixes: c73be61cede5 ("pipe: Add general notification queue support")
Signed-off-by: Arnd Bergmann <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>

show more ...


# 07073eb0 14-Feb-2023 David Howells <[email protected]>

splice: Add a func to do a splice from a buffered file without ITER_PIPE

Provide a function to do splice read from a buffered file, pulling the
folios out of the pagecache directly by calling filema

splice: Add a func to do a splice from a buffered file without ITER_PIPE

Provide a function to do splice read from a buffered file, pulling the
folios out of the pagecache directly by calling filemap_get_pages() to do
any required reading and then pasting the returned folios into the pipe.

A helper function is provided to do the actual folio pasting and will
handle multipage folios by splicing as many of the relevant subpages as
will fit into the pipe.

The code is loosely based on filemap_read() and might belong in
mm/filemap.c with that as it needs to use filemap_get_pages().

Signed-off-by: David Howells <[email protected]>
Reviewed-by: Jens Axboe <[email protected]>
cc: Christoph Hellwig <[email protected]>
cc: Al Viro <[email protected]>
cc: David Hildenbrand <[email protected]>
cc: John Hubbard <[email protected]>
cc: [email protected]
cc: [email protected]
cc: [email protected]
Signed-off-by: Steve French <[email protected]>

show more ...


Revision tags: v6.2-rc6, v6.2-rc5, v6.2-rc4, v6.2-rc3, v6.2-rc2, v6.2-rc1, v6.1, v6.1-rc8, v6.1-rc7, v6.1-rc6, v6.1-rc5, v6.1-rc4, v6.1-rc3, v6.1-rc2, v6.1-rc1, v6.0, v6.0-rc7, v6.0-rc6, v6.0-rc5, v6.0-rc4, v6.0-rc3, v6.0-rc2, v6.0-rc1, v5.19, v5.19-rc8, v5.19-rc7, v5.19-rc6, v5.19-rc5, v5.19-rc4, v5.19-rc3
# 12d426ab 15-Jun-2022 Al Viro <[email protected]>

ITER_PIPE: fold data_start() and pipe_space_for_user() together

All their callers are next to each other; all of them
want the total amount of pages and, possibly, the
offset in the partial final bu

ITER_PIPE: fold data_start() and pipe_space_for_user() together

All their callers are next to each other; all of them
want the total amount of pages and, possibly, the
offset in the partial final buffer.

Combine into a new helper (pipe_npages()), fix the
bogosity in pipe_space_for_user(), while we are at it.

Signed-off-by: Al Viro <[email protected]>

show more ...


# c3497fd0 12-Jun-2022 Al Viro <[email protected]>

fix short copy handling in copy_mc_pipe_to_iter()

Unlike other copying operations on ITER_PIPE, copy_mc_to_iter() can
result in a short copy. In that case we need to trim the unused
buffers, as wel

fix short copy handling in copy_mc_pipe_to_iter()

Unlike other copying operations on ITER_PIPE, copy_mc_to_iter() can
result in a short copy. In that case we need to trim the unused
buffers, as well as the length of partially filled one - it's not
enough to set ->head, ->iov_offset and ->count to reflect how
much had we copied. Not hard to fix, fortunately...

I'd put a helper (pipe_discard_from(pipe, head)) into pipe_fs_i.h,
rather than iov_iter.c - it has nothing to do with iov_iter and
having it will allow us to avoid an ugly kludge in fs/splice.c.
We could put it into lib/iov_iter.c for now and move it later,
but I don't see the point going that way...

Cc: [email protected] # 4.19+
Fixes: ca146f6f091e "lib/iov_iter: Fix pipe handling in _copy_to_iter_mcsafe()"
Reviewed-by: Jeff Layton <[email protected]>
Reviewed-by: Christian Brauner (Microsoft) <[email protected]>
Signed-off-by: Al Viro <[email protected]>

show more ...


Revision tags: v5.19-rc2, v5.19-rc1, v5.18, v5.18-rc7, v5.18-rc6, v5.18-rc5
# f485922d 29-Apr-2022 Kuniyuki Iwashima <[email protected]>

pipe: make poll_usage boolean and annotate its access

Patch series "Fix data-races around epoll reported by KCSAN."

This series suppresses a false positive KCSAN's message and fixes a real
data-rac

pipe: make poll_usage boolean and annotate its access

Patch series "Fix data-races around epoll reported by KCSAN."

This series suppresses a false positive KCSAN's message and fixes a real
data-race.


This patch (of 2):

pipe_poll() runs locklessly and assigns 1 to poll_usage. Once poll_usage
is set to 1, it never changes in other places. However, concurrent writes
of a value trigger KCSAN, so let's make KCSAN happy.

BUG: KCSAN: data-race in pipe_poll / pipe_poll

write to 0xffff8880042f6678 of 4 bytes by task 174 on cpu 3:
pipe_poll (fs/pipe.c:656)
ep_item_poll.isra.0 (./include/linux/poll.h:88 fs/eventpoll.c:853)
do_epoll_wait (fs/eventpoll.c:1692 fs/eventpoll.c:1806 fs/eventpoll.c:2234)
__x64_sys_epoll_wait (fs/eventpoll.c:2246 fs/eventpoll.c:2241 fs/eventpoll.c:2241)
do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113)

write to 0xffff8880042f6678 of 4 bytes by task 177 on cpu 1:
pipe_poll (fs/pipe.c:656)
ep_item_poll.isra.0 (./include/linux/poll.h:88 fs/eventpoll.c:853)
do_epoll_wait (fs/eventpoll.c:1692 fs/eventpoll.c:1806 fs/eventpoll.c:2234)
__x64_sys_epoll_wait (fs/eventpoll.c:2246 fs/eventpoll.c:2241 fs/eventpoll.c:2241)
do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113)

Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 177 Comm: epoll_race Not tainted 5.17.0-58927-gf443e374ae13 #6
Hardware name: Red Hat KVM, BIOS 1.11.0-2.amzn2 04/01/2014

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 3b844826b6c6 ("pipe: avoid unnecessary EPOLLET wakeups under normal loads")
Signed-off-by: Kuniyuki Iwashima <[email protected]>
Cc: Alexander Duyck <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Kuniyuki Iwashima <[email protected]>
Cc: "Soheil Hassas Yeganeh" <[email protected]>
Cc: "Sridhar Samudrala" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


Revision tags: v5.18-rc4, v5.18-rc3, v5.18-rc2, v5.18-rc1, v5.17, v5.17-rc8, v5.17-rc7, v5.17-rc6, v5.17-rc5, v5.17-rc4, v5.17-rc3, v5.17-rc2, v5.17-rc1
# 1998f193 22-Jan-2022 Luis Chamberlain <[email protected]>

fs: move pipe sysctls to is own file

kernel/sysctl.c is a kitchen sink where everyone leaves their dirty
dishes, this makes it very difficult to maintain.

To help with this maintenance let's start

fs: move pipe sysctls to is own file

kernel/sysctl.c is a kitchen sink where everyone leaves their dirty
dishes, this makes it very difficult to maintain.

To help with this maintenance let's start by moving sysctls to places
where they actually belong. The proc sysctl maintainers do not want to
know what sysctl knobs you wish to add for your own piece of code, we
just care about the core logic.

So move the pipe sysctls to its own file.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Luis Chamberlain <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Andy Shevchenko <[email protected]>
Cc: Antti Palosaari <[email protected]>
Cc: Eric Biederman <[email protected]>
Cc: Iurii Zaikin <[email protected]>
Cc: "J. Bruce Fields" <[email protected]>
Cc: Jeff Layton <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Lukas Middendorf <[email protected]>
Cc: Stephen Kitt <[email protected]>
Cc: Xiaoming Ni <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

show more ...


Revision tags: v5.16, v5.16-rc8, v5.16-rc7, v5.16-rc6, v5.16-rc5, v5.16-rc4, v5.16-rc3, v5.16-rc2, v5.16-rc1, v5.15, v5.15-rc7, v5.15-rc6, v5.15-rc5, v5.15-rc4, v5.15-rc3, v5.15-rc2, v5.15-rc1, v5.14, v5.14-rc7, v5.14-rc6, v5.14-rc5
# 3b844826 05-Aug-2021 Linus Torvalds <[email protected]>

pipe: avoid unnecessary EPOLLET wakeups under normal loads

I had forgotten just how sensitive hackbench is to extra pipe wakeups,
and commit 3a34b13a88ca ("pipe: make pipe writes always wake up
read

pipe: avoid unnecessary EPOLLET wakeups under normal loads

I had forgotten just how sensitive hackbench is to extra pipe wakeups,
and commit 3a34b13a88ca ("pipe: make pipe writes always wake up
readers") ended up causing a quite noticeable regression on larger
machines.

Now, hackbench isn't necessarily a hugely meaningful benchmark, and it's
not clear that this matters in real life all that much, but as Mel
points out, it's used often enough when comparing kernels and so the
performance regression shows up like a sore thumb.

It's easy enough to fix at least for the common cases where pipes are
used purely for data transfer, and you never have any exciting poll
usage at all. So set a special 'poll_usage' flag when there is polling
activity, and make the ugly "EPOLLET has crazy legacy expectations"
semantics explicit to only that case.

I would love to limit it to just the broken EPOLLET case, but the pipe
code can't see the difference between epoll and regular select/poll, so
any non-read/write waiting will trigger the extra wakeup behavior. That
is sufficient for at least the hackbench case.

Apart from making the odd extra wakeup cases more explicitly about
EPOLLET, this also makes the extra wakeup be at the _end_ of the pipe
write, not at the first write chunk. That is actually much saner
semantics (as much as you can call any of the legacy edge-triggered
expectations for EPOLLET "sane") since it means that you know the wakeup
will happen once the write is done, rather than possibly in the middle
of one.

[ For stable people: I'm putting a "Fixes" tag on this, but I leave it
up to you to decide whether you actually want to backport it or not.
It likely has no impact outside of synthetic benchmarks - Linus ]

Link: https://lore.kernel.org/lkml/20210802024945.GA8372@xsang-OptiPlex-9020/
Fixes: 3a34b13a88ca ("pipe: make pipe writes always wake up readers")
Reported-by: kernel test robot <[email protected]>
Tested-by: Sandeep Patil <[email protected]>
Tested-by: Mel Gorman <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

show more ...


Revision tags: v5.14-rc4, v5.14-rc3, v5.14-rc2, v5.14-rc1, v5.13, v5.13-rc7, v5.13-rc6, v5.13-rc5, v5.13-rc4, v5.13-rc3, v5.13-rc2, v5.13-rc1, v5.12, v5.12-rc8, v5.12-rc7, v5.12-rc6, v5.12-rc5, v5.12-rc4, v5.12-rc3, v5.12-rc2, v5.12-rc1, v5.12-rc1-dontuse, v5.11, v5.11-rc7, v5.11-rc6, v5.11-rc5, v5.11-rc4, v5.11-rc3, v5.11-rc2, v5.11-rc1, v5.10, v5.10-rc7, v5.10-rc6, v5.10-rc5, v5.10-rc4, v5.10-rc3, v5.10-rc2, v5.10-rc1, v5.9, v5.9-rc8
# 472e5b05 02-Oct-2020 Linus Torvalds <[email protected]>

pipe: remove pipe_wait() and fix wakeup race with splice

The pipe splice code still used the old model of waiting for pipe IO by
using a non-specific "pipe_wait()" that waited for any pipe event to

pipe: remove pipe_wait() and fix wakeup race with splice

The pipe splice code still used the old model of waiting for pipe IO by
using a non-specific "pipe_wait()" that waited for any pipe event to
happen, which depended on all pipe IO being entirely serialized by the
pipe lock. So by checking the state you were waiting for, and then
adding yourself to the wait queue before dropping the lock, you were
guaranteed to see all the wakeups.

Strictly speaking, the actual wakeups were not done under the lock, but
the pipe_wait() model still worked, because since the waiter held the
lock when checking whether it should sleep, it would always see the
current state, and the wakeup was always done after updating the state.

However, commit 0ddad21d3e99 ("pipe: use exclusive waits when reading or
writing") split the single wait-queue into two, and in the process also
made the "wait for event" code wait for _two_ wait queues, and that then
showed a race with the wakers that were not serialized by the pipe lock.

It's only splice that used that "pipe_wait()" model, so the problem
wasn't obvious, but Josef Bacik reports:

"I hit a hang with fstest btrfs/187, which does a btrfs send into
/dev/null. This works by creating a pipe, the write side is given to
the kernel to write into, and the read side is handed to a thread that
splices into a file, in this case /dev/null.

The box that was hung had the write side stuck here [pipe_write] and
the read side stuck here [splice_from_pipe_next -> pipe_wait].

[ more details about pipe_wait() scenario ]

The problem is we're doing the prepare_to_wait, which sets our state
each time, however we can be woken up either with reads or writes. In
the case above we race with the WRITER waking us up, and re-set our
state to INTERRUPTIBLE, and thus never break out of schedule"

Josef had a patch that avoided the issue in pipe_wait() by just making
it set the state only once, but the deeper problem is that pipe_wait()
depends on a level of synchonization by the pipe mutex that it really
shouldn't. And the whole "wait for any pipe state change" model really
isn't very good to begin with.

So rather than trying to work around things in pipe_wait(), remove that
legacy model of "wait for arbitrary pipe event" entirely, and actually
create functions that wait for the pipe actually being readable or
writable, and can do so without depending on the pipe lock serializing
everything.

Fixes: 0ddad21d3e99 ("pipe: use exclusive waits when reading or writing")
Link: https://lore.kernel.org/linux-fsdevel/bfa88b5ad6f069b2b679316b9e495a970130416c.1601567868.git.josef@toxicpanda.com/
Reported-by: Josef Bacik <[email protected]>
Reviewed-and-tested-by: Josef Bacik <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

show more ...


Revision tags: v5.9-rc7, v5.9-rc6, v5.9-rc5, v5.9-rc4, v5.9-rc3, v5.9-rc2, v5.9-rc1, v5.8, v5.8-rc7, v5.8-rc6, v5.8-rc5, v5.8-rc4, v5.8-rc3, v5.8-rc2, v5.8-rc1, v5.7, v5.7-rc7
# c928f642 20-May-2020 Christoph Hellwig <[email protected]>

fs: rename pipe_buf ->steal to ->try_steal

And replace the arcane return value convention with a simple bool
where true means success and false means failure.

[AV: braino fix folded in]

Signed-off

fs: rename pipe_buf ->steal to ->try_steal

And replace the arcane return value convention with a simple bool
where true means success and false means failure.

[AV: braino fix folded in]

Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Al Viro <[email protected]>

show more ...


# b8d9e7f2 20-May-2020 Christoph Hellwig <[email protected]>

fs: make the pipe_buf_operations ->confirm operation optional

Just return 0 for success if it is not present.

Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Al Viro <[email protected].

fs: make the pipe_buf_operations ->confirm operation optional

Just return 0 for success if it is not present.

Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Al Viro <[email protected]>

show more ...


# 76887c25 20-May-2020 Christoph Hellwig <[email protected]>

fs: make the pipe_buf_operations ->steal operation optional

Just return 1 for failure if it is not present.

Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Al Viro <[email protected]

fs: make the pipe_buf_operations ->steal operation optional

Just return 1 for failure if it is not present.

Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Al Viro <[email protected]>

show more ...


# f6dd9755 20-May-2020 Christoph Hellwig <[email protected]>

pipe: merge anon_pipe_buf*_ops

All the op vectors are exactly the same, they are just used to encode
packet or nomerge behavior. There already is a flag for the packet
behavior, so just add a new o

pipe: merge anon_pipe_buf*_ops

All the op vectors are exactly the same, they are just used to encode
packet or nomerge behavior. There already is a flag for the packet
behavior, so just add a new one to allow for merging. Inverting it vs
the previous nomerge special casing actually allows for much nicer code.

Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Al Viro <[email protected]>

show more ...


Revision tags: v5.7-rc6, v5.7-rc5, v5.7-rc4, v5.7-rc3, v5.7-rc2, v5.7-rc1, v5.6, v5.6-rc7, v5.6-rc6, v5.6-rc5, v5.6-rc4, v5.6-rc3, v5.6-rc2, v5.6-rc1, v5.5, v5.5-rc7
# e7d553d6 14-Jan-2020 David Howells <[email protected]>

pipe: Add notification lossage handling

Add handling for loss of notifications by having read() insert a
loss-notification message after it has read the pipe buffer that was last
in the ring when th

pipe: Add notification lossage handling

Add handling for loss of notifications by having read() insert a
loss-notification message after it has read the pipe buffer that was last
in the ring when the loss occurred.

Lossage can come about either by running out of notification descriptors or
by running out of space in the pipe ring.

Signed-off-by: David Howells <[email protected]>

show more ...


# 8cfba763 14-Jan-2020 David Howells <[email protected]>

pipe: Allow buffers to be marked read-whole-or-error for notifications

Allow a buffer to be marked such that read() must return the entire buffer
in one go or return ENOBUFS. Multiple buffers can b

pipe: Allow buffers to be marked read-whole-or-error for notifications

Allow a buffer to be marked such that read() must return the entire buffer
in one go or return ENOBUFS. Multiple buffers can be amalgamated into a
single read, but a short read will occur if the next "whole" buffer won't
fit.

This is useful for watch queue notifications to make sure we don't split a
notification across multiple reads, especially given that we need to
fabricate an overrun record under some circumstances - and that isn't in
the buffers.

Signed-off-by: David Howells <[email protected]>

show more ...


# c73be61c 14-Jan-2020 David Howells <[email protected]>

pipe: Add general notification queue support

Make it possible to have a general notification queue built on top of a
standard pipe. Notifications are 'spliced' into the pipe and then read
out. spl

pipe: Add general notification queue support

Make it possible to have a general notification queue built on top of a
standard pipe. Notifications are 'spliced' into the pipe and then read
out. splice(), vmsplice() and sendfile() are forbidden on pipes used for
notifications as post_one_notification() cannot take pipe->mutex. This
means that notifications could be posted in between individual pipe
buffers, making iov_iter_revert() difficult to effect.

The way the notification queue is used is:

(1) An application opens a pipe with a special flag and indicates the
number of messages it wishes to be able to queue at once (this can
only be set once):

pipe2(fds, O_NOTIFICATION_PIPE);
ioctl(fds[0], IOC_WATCH_QUEUE_SET_SIZE, queue_depth);

(2) The application then uses poll() and read() as normal to extract data
from the pipe. read() will return multiple notifications if the
buffer is big enough, but it will not split a notification across
buffers - rather it will return a short read or EMSGSIZE.

Notification messages include a length in the header so that the
caller can split them up.

Each message has a header that describes it:

struct watch_notification {
__u32 type:24;
__u32 subtype:8;
__u32 info;
};

The type indicates the source (eg. mount tree changes, superblock events,
keyring changes, block layer events) and the subtype indicates the event
type (eg. mount, unmount; EIO, EDQUOT; link, unlink). The info field
indicates a number of things, including the entry length, an ID assigned to
a watchpoint contributing to this buffer and type-specific flags.

Supplementary data, such as the key ID that generated an event, can be
attached in additional slots. The maximum message size is 127 bytes.
Messages may not be padded or aligned, so there is no guarantee, for
example, that the notification type will be on a 4-byte bounary.

Signed-off-by: David Howells <[email protected]>

show more ...


1234