History log of /linux-6.15/mm/filemap.c (Results 1 – 25 of 941)
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1
# 8ab1b160 03-Apr-2025 Vishal Moola (Oracle) <[email protected]>

mm: fix filemap_get_folios_contig returning batches of identical folios

filemap_get_folios_contig() is supposed to return distinct folios found
within [start, end]. Large folios in the Xarray becom

mm: fix filemap_get_folios_contig returning batches of identical folios

filemap_get_folios_contig() is supposed to return distinct folios found
within [start, end]. Large folios in the Xarray become multi-index
entries. xas_next() can iterate through the sub-indexes before finding a
sibling entry and breaking out of the loop.

This can result in a returned folio_batch containing an indeterminate
number of duplicate folios, which forces the callers to skeptically handle
the returned batch. This is inefficient and incurs a large maintenance
overhead.

We can fix this by calling xas_advance() after we have successfully adding
a folio to the batch to ensure our Xarray is positioned such that it will
correctly find the next folio - similar to filemap_get_read_batch().

Link: https://lkml.kernel.org/r/Z-8s1-kiIDkzgRbc@fedora
Fixes: 35b471467f88 ("filemap: add filemap_get_folios_contig()")
Signed-off-by: Vishal Moola (Oracle) <[email protected]>
Reported-by: Qu Wenruo <[email protected]>
Closes: https://lkml.kernel.org/r/[email protected]
Tested-by: Qu Wenruo <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Vivek Kasireddy <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


Revision tags: v6.14, v6.14-rc7
# 200a89c1 14-Mar-2025 Zi Yan <[email protected]>

mm/filemap: use xas_try_split() in __filemap_add_folio()

Patch series "Minimize xa_node allocation during xarry split", v3.

When splitting a multi-index entry in XArray from order-n to order-m,
exi

mm/filemap: use xas_try_split() in __filemap_add_folio()

Patch series "Minimize xa_node allocation during xarry split", v3.

When splitting a multi-index entry in XArray from order-n to order-m,
existing xas_split_alloc()+xas_split() approach requires 2^(n %
XA_CHUNK_SHIFT) xa_node allocations. But its callers,
__filemap_add_folio() and shmem_split_large_entry(), use at most 1
xa_node. To minimize xa_node allocation and remove the limitation of no
split from order-12 (or above) to order-0 (or anything between 0 and
5)[1], xas_try_split() was added[2], which allocates (n / XA_CHUNK_SHIFT -
m / XA_CHUNK_SHIFT) xa_node. It is used for non-uniform folio split, but
can be used by __filemap_add_folio() and shmem_split_large_entry().

xas_split_alloc() and xas_split() split an order-9 to order-0:

---------------------------------
| | | | | | | | |
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
| | | | | | | | |
---------------------------------
| | | |
------- --- --- -------
| | ... | |
V V V V
----------- ----------- ----------- -----------
| xa_node | | xa_node | ... | xa_node | | xa_node |
----------- ----------- ----------- -----------

xas_try_split() splits an order-9 to order-0:
---------------------------------
| | | | | | | | |
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
| | | | | | | | |
---------------------------------
|
|
V
-----------
| xa_node |
-----------

xas_try_split() is designed to be called iteratively with n = m + 1.
xas_try_split_mini_order() is added to minmize the number of calls to
xas_try_split() by telling the caller the next minimal order to split to
instead of n - 1. Splitting order-n to order-m when m= l * XA_CHUNK_SHIFT
does not require xa_node allocation and requires 1 xa_node when n=l *
XA_CHUNK_SHIFT and m = n - 1, so it is OK to use xas_try_split() with n >
m + 1 when no new xa_node is needed.

xfstests quick group test passed on xfs and tmpfs.

[1] https://lore.kernel.org/linux-mm/[email protected]/
[2] https://lore.kernel.org/linux-mm/[email protected]/


This patch (of 2):

During __filemap_add_folio(), a shadow entry is covering n slots and a
folio covers m slots with m < n is to be added. Instead of splitting all
n slots, only the m slots covered by the folio need to be split and the
remaining n-m shadow entries can be retained with orders ranging from m to
n-1. This method only requires

(n/XA_CHUNK_SHIFT) - (m/XA_CHUNK_SHIFT)

new xa_nodes instead of

(n % XA_CHUNK_SHIFT) * ((n/XA_CHUNK_SHIFT) - (m/XA_CHUNK_SHIFT))

new xa_nodes, compared to the original xas_split_alloc() + xas_split()
one. For example, to insert an order-0 folio when an order-9 shadow entry
is present (assuming XA_CHUNK_SHIFT is 6), 1 xa_node is needed instead of
8.

xas_try_split_min_order() is introduced to reduce the number of calls to
xas_try_split() during split.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Zi Yan <[email protected]>
Cc: Baolin Wang <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Kairui Song <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Mattew Wilcox <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Kirill A. Shuemov <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yu Zhao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


Revision tags: v6.14-rc6, v6.14-rc5, v6.14-rc4, v6.14-rc3
# b23ceebd 13-Feb-2025 Guanjun <[email protected]>

filemap: remove redundant folio_test_large check in filemap_free_folio

The folio_test_large() check in filemap_free_folio() is unnecessary
because folio_nr_pages(), which is called internally alread

filemap: remove redundant folio_test_large check in filemap_free_folio

The folio_test_large() check in filemap_free_folio() is unnecessary
because folio_nr_pages(), which is called internally already performs this
check. Removing the redundant condition simplifies the code and avoids
double validation.

This change improves code readability and reduces unnecessary operations
in the folio freeing path.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Guanjun <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# 182db972 24-Feb-2025 Raphael S. Carvalho <[email protected]>

mm: fix error handling in __filemap_get_folio() with FGP_NOWAIT

original report:
https://lore.kernel.org/all/CAKhLTr1UL3ePTpYjXOx2AJfNk8Ku2EdcEfu+CH1sf3Asr=B-Dw@mail.gmail.com/T/

When doing buffere

mm: fix error handling in __filemap_get_folio() with FGP_NOWAIT

original report:
https://lore.kernel.org/all/CAKhLTr1UL3ePTpYjXOx2AJfNk8Ku2EdcEfu+CH1sf3Asr=B-Dw@mail.gmail.com/T/

When doing buffered writes with FGP_NOWAIT, under memory pressure, the
system returned ENOMEM despite there being plenty of available memory, to
be reclaimed from page cache. The user space used io_uring interface,
which in turn submits I/O with FGP_NOWAIT (the fast path).

retsnoop pointed to iomap_get_folio:

00:34:16.180612 -> 00:34:16.180651 TID/PID 253786/253721
(reactor-1/combined_tests):

entry_SYSCALL_64_after_hwframe+0x76
do_syscall_64+0x82
__do_sys_io_uring_enter+0x265
io_submit_sqes+0x209
io_issue_sqe+0x5b
io_write+0xdd
xfs_file_buffered_write+0x84
iomap_file_buffered_write+0x1a6
32us [-ENOMEM] iomap_write_begin+0x408
iter=&{.inode=0xffff8c67aa031138,.len=4096,.flags=33,.iomap={.addr=0xffffffffffffffff,.length=4096,.type=1,.flags=3,.bdev=0x…
pos=0 len=4096 foliop=0xffffb32c296b7b80
! 4us [-ENOMEM] iomap_get_folio
iter=&{.inode=0xffff8c67aa031138,.len=4096,.flags=33,.iomap={.addr=0xffffffffffffffff,.length=4096,.type=1,.flags=3,.bdev=0x…
pos=0 len=4096

This is likely a regression caused by 66dabbb65d67 ("mm: return an ERR_PTR
from __filemap_get_folio"), which moved error handling from
io_map_get_folio() to __filemap_get_folio(), but broke FGP_NOWAIT
handling, so ENOMEM is being escaped to user space. Had it correctly
returned -EAGAIN with NOWAIT, either io_uring or user space itself would
be able to retry the request.

It's not enough to patch io_uring since the iomap interface is the one
responsible for it, and pwritev2(RWF_NOWAIT) and AIO interfaces must
return the proper error too.

The patch was tested with scylladb test suite (its original reproducer),
and the tests all pass now when memory is pressured.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 66dabbb65d67 ("mm: return an ERR_PTR from __filemap_get_folio")
Signed-off-by: Raphael S. Carvalho <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Dave Chinner <[email protected]>
Cc: "Darrick J. Wong" <[email protected]>
Cc: Matthew Wilcow (Oracle) <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# 665575cf 28-Feb-2025 Dave Hansen <[email protected]>

filemap: move prefaulting out of hot write path

There is a generic anti-pattern that shows up in the VFS and several
filesystems where the hot write paths touch userspace twice when they
could get a

filemap: move prefaulting out of hot write path

There is a generic anti-pattern that shows up in the VFS and several
filesystems where the hot write paths touch userspace twice when they
could get away with doing it once.

Dave Chinner suggested that they should all be fixed up[1]. I agree[2].
But, the series to do that fixup spans a bunch of filesystems and a lot of
people. This patch fixes common code that absolutely everyone uses. It
has measurable performance benefits[3].

I think this patch can go in and not be held up by the others.

I will post them separately to their separate maintainers for
consideration. But, honestly, I'm not going to lose any sleep if
the maintainers don't pick those up.

1. https://lore.kernel.org/all/[email protected]/
2. https://lore.kernel.org/all/[email protected]/
3. https://lore.kernel.org/all/[email protected]/


This patch:

There is a bit of a sordid history here. I originally wrote
998ef75ddb57 ("fs: do not prefault sys_write() user buffer pages")
to fix a performance issue that showed up on early SMAP hardware.
But that was reverted with 00a3d660cbac because it exposed an
underlying filesystem bug.

This is a reimplementation of the original commit along with some
simplification and comment improvements.

The basic problem is that the generic write path has two userspace
accesses: one to prefault the write source buffer and then another to
perform the actual write. On x86, this means an extra STAC/CLAC pair.
These are relatively expensive instructions because they function as
barriers.

Keep the prefaulting behavior but move it into the slow path that gets
run when the write did not make any progress. This avoids livelocks
that can happen when the write's source and destination target the
same folio. Contrary to the existing comments, the fault-in does not
prevent deadlocks. That's accomplished by using an "atomic" usercopy
that disables page faults.

The end result is that the generic write fast path now touches
userspace once instead of twice.

0day has shown some improvements on a couple of microbenchmarks:

https://lore.kernel.org/all/[email protected]/

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Dave Hansen <[email protected]>
Link: https://lore.kernel.org/all/yxyuijjfd6yknryji2q64j3keq2ygw6ca6fs5jwyolklzvo45s@4u63qqqyosy2/
Cc: Ted Ts'o <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Mateusz Guzik <[email protected]>
Cc: Dave Chinner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# 252256e4 12-Mar-2025 Amir Goldstein <[email protected]>

Revert "fanotify: disable readahead if we have pre-content watches"

This reverts commit fac84846a28c0950d4433118b3dffd44306df62d.

Signed-off-by: Amir Goldstein <[email protected]>
Signed-off-by: J

Revert "fanotify: disable readahead if we have pre-content watches"

This reverts commit fac84846a28c0950d4433118b3dffd44306df62d.

Signed-off-by: Amir Goldstein <[email protected]>
Signed-off-by: Jan Kara <[email protected]>
Link: https://patch.msgid.link/[email protected]

show more ...


# 955fbe0e 12-Mar-2025 Amir Goldstein <[email protected]>

Revert "fsnotify: generate pre-content permission event on page fault"

This reverts commit 8392bc2ff8c8bf7c4c5e6dfa71ccd893a3c046f6.

In the use case of buffered write whose input buffer is mmapped

Revert "fsnotify: generate pre-content permission event on page fault"

This reverts commit 8392bc2ff8c8bf7c4c5e6dfa71ccd893a3c046f6.

In the use case of buffered write whose input buffer is mmapped file on a
filesystem with a pre-content mark, the prefaulting of the buffer can
happen under the filesystem freeze protection (obtained in vfs_write())
which breaks assumptions of pre-content hook and introduces potential
deadlock of HSM handler in userspace with filesystem freezing.

Now that we have pre-content hooks at file mmap() time, disable the
pre-content event hooks on page fault to avoid the potential deadlock.

Reported-by: [email protected]
Closes: https://lore.kernel.org/linux-fsdevel/7ehxrhbvehlrjwvrduoxsao5k3x4aw275patsb3krkwuq573yv@o2hskrfawbnc/
Fixes: 8392bc2ff8c8 ("fsnotify: generate pre-content permission event on page fault")
Signed-off-by: Amir Goldstein <[email protected]>
Signed-off-by: Jan Kara <[email protected]>
Link: https://patch.msgid.link/[email protected]

show more ...


# 00a7d398 07-Mar-2025 Linus Torvalds <[email protected]>

fs/pipe: add simpler helpers for common cases

The fix to atomically read the pipe head and tail state when not holding
the pipe mutex has caused a number of headaches due to the size change
of the i

fs/pipe: add simpler helpers for common cases

The fix to atomically read the pipe head and tail state when not holding
the pipe mutex has caused a number of headaches due to the size change
of the involved types.

It turns out that we don't have _that_ many places that access these
fields directly and were affected, but we have more than we strictly
should have, because our low-level helper functions have been designed
to have intimate knowledge of how the pipes work.

And as a result, that random noise of direct 'pipe->head' and
'pipe->tail' accesses makes it harder to pinpoint any actual potential
problem spots remaining.

For example, we didn't have a "is the pipe full" helper function, but
instead had a "given these pipe buffer indexes and this pipe size, is
the pipe full". That's because some low-level pipe code does actually
want that much more complicated interface.

But most other places literally just want a "is the pipe full" helper,
and not having it meant that those places ended up being unnecessarily
much too aware of this all.

It would have been much better if only the very core pipe code that
cared had been the one aware of this all.

So let's fix it - better late than never. This just introduces the
trivial wrappers for "is this pipe full or empty" and to get how many
pipe buffers are used, so that instead of writing

if (pipe_full(pipe->head, pipe->tail, pipe->max_usage))

the places that literally just want to know if a pipe is full can just
say

if (pipe_is_full(pipe))

instead. The existing trivial cases were converted with a 'sed' script.

This cuts down on the places that access pipe->head and pipe->tail
directly outside of the pipe code (and core splice code) quite a lot.

The splice code in particular still revels in doing the direct low-level
accesses, and the fuse fuse_dev_splice_write() code also seems a bit
unnecessarily eager to go very low-level, but it's at least a bit better
than it used to be.

Signed-off-by: Linus Torvalds <[email protected]>

show more ...


# d96e2802 18-Feb-2025 Matthew Wilcox (Oracle) <[email protected]>

mm: Remove wait_on_page_locked()

This compatibility wrapper has no callers left, so remove it.

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Chao Yu <[email protected]>
Si

mm: Remove wait_on_page_locked()

This compatibility wrapper has no callers left, so remove it.

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>

show more ...


# 8510edf1 18-Feb-2025 Jingbo Xu <[email protected]>

mm/filemap: fix miscalculated file range for filemap_fdatawrite_range_kick()

iocb->ki_pos has been updated with the number of written bytes since
generic_perform_write().

Besides __filemap_fdatawri

mm/filemap: fix miscalculated file range for filemap_fdatawrite_range_kick()

iocb->ki_pos has been updated with the number of written bytes since
generic_perform_write().

Besides __filemap_fdatawrite_range() accepts the inclusive end of the
data range.

Fixes: 1d4457576570 ("mm: call filemap_fdatawrite_range_kick() after IOCB_DONTCACHE issue")
Signed-off-by: Jingbo Xu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Reviewed-by: Jens Axboe <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>

show more ...


Revision tags: v6.14-rc2, v6.14-rc1, v6.13, v6.13-rc7
# 73aa354a 11-Jan-2025 Kaixiong Yu <[email protected]>

mm: filemap: move sysctl to mm/filemap.c

This moves the filemap related sysctl to mm/filemap.c, and
removes the redundant external variable declaration.

Signed-off-by: Kaixiong Yu <yukaixiong@huawe

mm: filemap: move sysctl to mm/filemap.c

This moves the filemap related sysctl to mm/filemap.c, and
removes the redundant external variable declaration.

Signed-off-by: Kaixiong Yu <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Reviewed-by: Jeff Layton <[email protected]>
Signed-off-by: Joel Granados <[email protected]>

show more ...


Revision tags: v6.13-rc6, v6.13-rc5, v6.13-rc4
# d94d23fd 20-Dec-2024 Jens Axboe <[email protected]>

mm: add FGP_DONTCACHE folio creation flag

Callers can pass this in for uncached folio creation, in which case if a
folio is newly created it gets marked as uncached. If a folio exists for
this inde

mm: add FGP_DONTCACHE folio creation flag

Callers can pass this in for uncached folio creation, in which case if a
folio is newly created it gets marked as uncached. If a folio exists for
this index and lookup succeeds, then it will not get marked as uncached.
If an !uncached lookup finds a cached folio, clear the flag. For that
case, there are competeting uncached and cached users of the folio, and it
should not get pruned.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
Cc: Brian Foster <[email protected]>
Cc: Chris Mason <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# dddc559f 20-Dec-2024 Jens Axboe <[email protected]>

mm/filemap: add filemap_fdatawrite_range_kick() helper

Works like filemap_fdatawrite_range(), except it's a non-integrity data
writeback and hence only starts writeback on the specified range. Will

mm/filemap: add filemap_fdatawrite_range_kick() helper

Works like filemap_fdatawrite_range(), except it's a non-integrity data
writeback and hence only starts writeback on the specified range. Will
help facilitate generically starting uncached writeback from
generic_write_sync(), as header dependencies preclude doing this inline
from fs.h.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
Cc: Brian Foster <[email protected]>
Cc: Chris Mason <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# fb7d3bc4 20-Dec-2024 Jens Axboe <[email protected]>

mm/filemap: drop streaming/uncached pages when writeback completes

If the folio is marked as streaming, drop pages when writeback completes.
Intended to be used with RWF_DONTCACHE, to avoid needing

mm/filemap: drop streaming/uncached pages when writeback completes

If the folio is marked as streaming, drop pages when writeback completes.
Intended to be used with RWF_DONTCACHE, to avoid needing sync writes for
uncached IO.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
Cc: Brian Foster <[email protected]>
Cc: Chris Mason <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# 8026e49b 20-Dec-2024 Jens Axboe <[email protected]>

mm/filemap: add read support for RWF_DONTCACHE

Add RWF_DONTCACHE as a read operation flag, which means that any data read
wil be removed from the page cache upon completion. Uses the page cache
to

mm/filemap: add read support for RWF_DONTCACHE

Add RWF_DONTCACHE as a read operation flag, which means that any data read
wil be removed from the page cache upon completion. Uses the page cache
to synchronize, and simply prunes folios that were instantiated when the
operation completes. While it would be possible to use private pages for
this, using the page cache as synchronization is handy for a variety of
reasons:

1) No special truncate magic is needed
2) Async buffered reads need some place to serialize, using the page
cache is a lot easier than writing extra code for this
3) The pruning cost is pretty reasonable

and the code to support this is much simpler as a result.

You can think of uncached buffered IO as being the much more attractive
cousin of O_DIRECT - it has none of the restrictions of O_DIRECT. Yes, it
will copy the data, but unlike regular buffered IO, it doesn't run into
the unpredictability of the page cache in terms of reclaim. As an
example, on a test box with 32 drives, reading them with buffered IO looks
as follows:

Reading bs 65536, uncached 0
1s: 145945MB/sec
2s: 158067MB/sec
3s: 157007MB/sec
4s: 148622MB/sec
5s: 118824MB/sec
6s: 70494MB/sec
7s: 41754MB/sec
8s: 90811MB/sec
9s: 92204MB/sec
10s: 95178MB/sec
11s: 95488MB/sec
12s: 95552MB/sec
13s: 96275MB/sec

where it's quite easy to see where the page cache filled up, and
performance went from good to erratic, and finally settles at a much
lower rate. Looking at top while this is ongoing, we see:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
7535 root 20 0 267004 0 0 S 3199 0.0 8:40.65 uncached
3326 root 20 0 0 0 0 R 100.0 0.0 0:16.40 kswapd4
3327 root 20 0 0 0 0 R 100.0 0.0 0:17.22 kswapd5
3328 root 20 0 0 0 0 R 100.0 0.0 0:13.29 kswapd6
3332 root 20 0 0 0 0 R 100.0 0.0 0:11.11 kswapd10
3339 root 20 0 0 0 0 R 100.0 0.0 0:16.25 kswapd17
3348 root 20 0 0 0 0 R 100.0 0.0 0:16.40 kswapd26
3343 root 20 0 0 0 0 R 100.0 0.0 0:16.30 kswapd21
3344 root 20 0 0 0 0 R 100.0 0.0 0:11.92 kswapd22
3349 root 20 0 0 0 0 R 100.0 0.0 0:16.28 kswapd27
3352 root 20 0 0 0 0 R 99.7 0.0 0:11.89 kswapd30
3353 root 20 0 0 0 0 R 96.7 0.0 0:16.04 kswapd31
3329 root 20 0 0 0 0 R 96.4 0.0 0:11.41 kswapd7
3345 root 20 0 0 0 0 R 96.4 0.0 0:13.40 kswapd23
3330 root 20 0 0 0 0 S 91.1 0.0 0:08.28 kswapd8
3350 root 20 0 0 0 0 S 86.8 0.0 0:11.13 kswapd28
3325 root 20 0 0 0 0 S 76.3 0.0 0:07.43 kswapd3
3341 root 20 0 0 0 0 S 74.7 0.0 0:08.85 kswapd19
3334 root 20 0 0 0 0 S 71.7 0.0 0:10.04 kswapd12
3351 root 20 0 0 0 0 R 60.5 0.0 0:09.59 kswapd29
3323 root 20 0 0 0 0 R 57.6 0.0 0:11.50 kswapd1
[...]

which is just showing a partial list of the 32 kswapd threads that are
running mostly full tilt, burning ~28 full CPU cores.

If the same test case is run with RWF_DONTCACHE set for the buffered read,
the output looks as follows:

Reading bs 65536, uncached 0
1s: 153144MB/sec
2s: 156760MB/sec
3s: 158110MB/sec
4s: 158009MB/sec
5s: 158043MB/sec
6s: 157638MB/sec
7s: 157999MB/sec
8s: 158024MB/sec
9s: 157764MB/sec
10s: 157477MB/sec
11s: 157417MB/sec
12s: 157455MB/sec
13s: 157233MB/sec
14s: 156692MB/sec

which is just chugging along at ~155GB/sec of read performance. Looking
at top, we see:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
7961 root 20 0 267004 0 0 S 3180 0.0 5:37.95 uncached
8024 axboe 20 0 14292 4096 0 R 1.0 0.0 0:00.13 top

where just the test app is using CPU, no reclaim is taking place outside
of the main thread. Not only is performance 65% better, it's also using
half the CPU to do it.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
Cc: Brian Foster <[email protected]>
Cc: Chris Mason <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# f598cdaa 20-Dec-2024 Jens Axboe <[email protected]>

mm/filemap: use page_cache_sync_ra() to kick off read-ahead

Rather than use the page_cache_sync_readahead() helper, define our own
ractl and use page_cache_sync_ra() directly. In preparation for ne

mm/filemap: use page_cache_sync_ra() to kick off read-ahead

Rather than use the page_cache_sync_readahead() helper, define our own
ractl and use page_cache_sync_ra() directly. In preparation for needing
to modify ractl inside filemap_get_pages().

No functional changes in this patch.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
Reviewed-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>
Cc: Brian Foster <[email protected]>
Cc: Chris Mason <[email protected]>
Cc: Johannes Weiner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# 9ad63445 20-Dec-2024 Jens Axboe <[email protected]>

mm/filemap: change filemap_create_folio() to take a struct kiocb

Patch series "Uncached buffered IO", v8.

5 years ago I posted patches adding support for RWF_UNCACHED, as a way to
do buffered IO th

mm/filemap: change filemap_create_folio() to take a struct kiocb

Patch series "Uncached buffered IO", v8.

5 years ago I posted patches adding support for RWF_UNCACHED, as a way to
do buffered IO that isn't page cache persistent. The approach back then
was to have private pages for IO, and then get rid of them once IO was
done. But that then runs into all the issues that O_DIRECT has, in terms
of synchronizing with the page cache.

So here's a new approach to the same concent, but using the page cache as
synchronization. Due to excessive bike shedding on the naming, this is
now named RWF_DONTCACHE, and is less special in that it's just page cache
IO, except it prunes the ranges once IO is completed.

Why do this, you may ask? The tldr is that device speeds are only getting
faster, while reclaim is not. Doing normal buffered IO can be very
unpredictable, and suck up a lot of resources on the reclaim side. This
leads people to use O_DIRECT as a work-around, which has its own set of
restrictions in terms of size, offset, and length of IO. It's also
inherently synchronous, and now you need async IO as well. While the
latter isn't necessarily a big problem as we have good options available
there, it also should not be a requirement when all you want to do is read
or write some data without caching.

Even on desktop type systems, a normal NVMe device can fill the entire
page cache in seconds. On the big system I used for testing, there's a
lot more RAM, but also a lot more devices. As can be seen in some of the
results in the following patches, you can still fill RAM in seconds even
when there's 1TB of it. Hence this problem isn't solely a "big
hyperscaler system" issue, it's common across the board.

Common for both reads and writes with RWF_DONTCACHE is that they use the
page cache for IO. Reads work just like a normal buffered read would,
with the only exception being that the touched ranges will get pruned
after data has been copied. For writes, the ranges will get writeback
kicked off before the syscall returns, and then writeback completion will
prune the range. Hence writes aren't synchronous, and it's easy to
pipeline writes using RWF_DONTCACHE. Folios that aren't instantiated by
RWF_DONTCACHE IO are left untouched. This means you that uncached IO will
take advantage of the page cache for uptodate data, but not leave anything
it instantiated/created in cache.

File systems need to support this. This patchset adds support for the
generic read path, which covers file systems like ext4. Patches exist to
add support for iomap/XFS and btrfs as well, which sit on top of this
series. If RWF_DONTCACHE IO is attempted on a file system that doesn't
support it, -EOPNOTSUPP is returned. Hence the user can rely on it either
working as designed, or flagging and error if that's not the case. The
intent here is to give the application a sensible fallback path - eg, it
may fall back to O_DIRECT if appropriate, or just live with the fact that
uncached IO isn't available and do normal buffered IO.

Adding "support" to other file systems should be trivial, most of the time
just a one-liner adding FOP_DONTCACHE to the fop_flags in the
file_operations struct, if the file system is using either iomap or the
generic filemap helpers for reading and writing.

Performance results are in patch 8 for reads, and you can find the write
side results in the XFS patch adding support for DONTCACHE writes for XFS:

https://git.kernel.dk/cgit/linux/commit/?h=buffered-uncached-fs.10&id=257e92de795fdff7d7e256501e024fac6da6a7f4

with the tldr being that I see about a 65% improvement in performance for
both, with fully predictable IO times. CPU reduction is substantial as
well, with no kswapd activity at all for reclaim when using uncached IO.

Using it from applications is trivial - just set RWF_DONTCACHE for the
read or write, using pwritev2(2) or preadv2(2). For io_uring, same thing,
just set RWF_DONTCACHE in sqe->rw_flags for a buffered read/write
operation. And that's it.

Patches 1..7 are just prep patches, and should have no functional changes
at all. Patch 8 adds support for the filemap path for RWF_DONTCACHE
reads, and patches 9..12 are just prep patches for supporting the write
side of uncached writes. In the below mentioned branch, there are then
patches to adopt uncached reads and writes for xfs, btrfs, and ext4. The
latter currently relies on bit of a hack for passing whether this is an
uncached write or not through ->write_begin(), which can hopefully go away
once ext4 adopts iomap for buffered writes. I say this is a hack as it's
not the prettiest way to do it, however it is fully solid and will work
just fine.

Passes full xfstests and fsx overnight runs, no issues observed. That
includes the vm running the testing also using RWF_DONTCACHE on the host.
I'll post fsstress and fsx patches for RWF_DONTCACHE separately. As far
as I'm concerned, no further work needs doing here.

And git tree for the patches is here:

https://git.kernel.dk/cgit/linux/log/?h=buffered-uncached.10

with the file system patches on top adding support for xfs/btrfs/ext4
here:

https://git.kernel.dk/cgit/linux/log/?h=buffered-uncached-fs.10


This patch (of 12):

Rather than pass in both the file and position directly from the kiocb,
just take a struct kiocb instead. With the kiocb being passed in, skip
passing in the address_space separately as well. While doing so, move the
ki_flags checking into filemap_create_folio() as well. In preparation for
actually needing the kiocb in the function.

No functional changes in this patch.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
Reviewed-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>
Cc: Brian Foster <[email protected]>
Cc: Chris Mason <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# 5f537664 21-Jan-2025 Linus Torvalds <[email protected]>

cachestat: fix page cache statistics permission checking

When the 'cachestat()' system call was added in commit cf264e1329fb
("cachestat: implement cachestat syscall"), it was meant to be a much
mor

cachestat: fix page cache statistics permission checking

When the 'cachestat()' system call was added in commit cf264e1329fb
("cachestat: implement cachestat syscall"), it was meant to be a much
more convenient (and performant) version of mincore() that didn't need
mapping things into the user virtual address space in order to work.

But it ended up missing the "check for writability or ownership" fix for
mincore(), done in commit 134fca9063ad ("mm/mincore.c: make mincore()
more conservative").

This just adds equivalent logic to 'cachestat()', modified for the file
context (rather than vma).

Reported-by: Sudheendra Raghav Neela <[email protected]>
Fixes: cf264e1329fb ("cachestat: implement cachestat syscall")
Tested-by: Johannes Weiner <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Acked-by: Nhat Pham <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

show more ...


Revision tags: v6.13-rc3, v6.13-rc2, v6.13-rc1, v6.12
# 1168b2be 16-Nov-2024 Dr. David Alan Gilbert <[email protected]>

filemap: remove unused folio_add_wait_queue

folio_add_wait_queue() has been unused since 2021's commit 850cba069c26
("cachefiles: Delete the cachefiles driver pending rewrite")

Remove it.

Link: ht

filemap: remove unused folio_add_wait_queue

folio_add_wait_queue() has been unused since 2021's commit 850cba069c26
("cachefiles: Delete the cachefiles driver pending rewrite")

Remove it.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Dr. David Alan Gilbert <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Reviewed-by: Vishal Moola (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# 1c47c578 10-Jan-2025 Matthew Wilcox (Oracle) <[email protected]>

mm: fix assertion in folio_end_read()

We only need to assert that the uptodate flag is clear if we're going to
set it. This hasn't been a problem before now because we have only used
folio_end_read

mm: fix assertion in folio_end_read()

We only need to assert that the uptodate flag is clear if we're going to
set it. This hasn't been a problem before now because we have only used
folio_end_read() when completing with an error, but it's convenient to use
it in squashfs if we discover the folio is already uptodate.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Cc: Phillip Lougher <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# f505e6c9 02-Jan-2025 Marco Nelissen <[email protected]>

filemap: avoid truncating 64-bit offset to 32 bits

On 32-bit kernels, folio_seek_hole_data() was inadvertently truncating a
64-bit value to 32 bits, leading to a possible infinite loop when writing

filemap: avoid truncating 64-bit offset to 32 bits

On 32-bit kernels, folio_seek_hole_data() was inadvertently truncating a
64-bit value to 32 bits, leading to a possible infinite loop when writing
to an xfs filesystem.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 54fa39ac2e00 ("iomap: use mapping_seek_hole_data")
Signed-off-by: Marco Nelissen <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# 62e72d2c 22-Dec-2024 Kairui Song <[email protected]>

mm, madvise: fix potential workingset node list_lru leaks

Since commit 5abc1e37afa0 ("mm: list_lru: allocate list_lru_one only when
needed"), all list_lru users need to allocate the items using the

mm, madvise: fix potential workingset node list_lru leaks

Since commit 5abc1e37afa0 ("mm: list_lru: allocate list_lru_one only when
needed"), all list_lru users need to allocate the items using the new
infrastructure that provides list_lru info for slab allocation, ensuring
that the corresponding memcg list_lru is allocated before use.

For workingset shadow nodes (which are xa_node), users are converted to
use the new infrastructure by commit 9bbdc0f32409 ("xarray: use
kmem_cache_alloc_lru to allocate xa_node"). The xas->xa_lru will be set
correctly for filemap users. However, there is a missing case: xa_node
allocations caused by madvise(..., MADV_COLLAPSE).

madvise(..., MADV_COLLAPSE) will also read in the absent parts of file
map, and there will be xa_nodes allocated for the caller's memcg (assuming
it's not rootcg). However, these allocations won't trigger memcg list_lru
allocation because the proper xas info was not set.

If nothing else has allocated other xa_nodes for that memcg to trigger
list_lru creation, and memory pressure starts to evict file pages,
workingset_update_node will try to add these xa_nodes to their
corresponding memcg list_lru, and it does not exist (NULL). So they will
be added to rootcg's list_lru instead.

This shouldn't be a significant issue in practice, but it is indeed
unexpected behavior, and these xa_nodes will not be reclaimed effectively.
And may lead to incorrect counting of the list_lru->nr_items counter.

This problem wasn't exposed until recent commit 28e98022b31ef
("mm/list_lru: simplify reparenting and initial allocation") added a
sanity check: only dying memcg could have a NULL list_lru when
list_lru_{add,del} is called. This problem triggered this WARNING.

So make madvise(..., MADV_COLLAPSE) also call xas_set_lru() to pass the
list_lru which we may want to insert xa_node into later. And move
mapping_set_update to mm/internal.h, and turn into a macro to avoid
including extra headers in mm/internal.h.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 9bbdc0f32409 ("xarray: use kmem_cache_alloc_lru to allocate xa_node")
Reported-by: [email protected]
Closes: https://lore.kernel.org/lkml/[email protected]/
Signed-off-by: Kairui Song <[email protected]>
Cc: Chengming Zhou <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Qi Zheng <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Sasha Levin <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Yu Zhao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# 8392bc2f 15-Nov-2024 Josef Bacik <[email protected]>

fsnotify: generate pre-content permission event on page fault

FS_PRE_ACCESS will be generated on page fault depending on the faulting
method. This pre-content event is meant to be used by hierarchic

fsnotify: generate pre-content permission event on page fault

FS_PRE_ACCESS will be generated on page fault depending on the faulting
method. This pre-content event is meant to be used by hierarchical storage
managers that want to fill in the file content on first read access.

Export a simple helper that file systems that have their own ->fault()
will use, and have a more complicated helper to be do fancy things in
filemap_fault.

Signed-off-by: Josef Bacik <[email protected]>
Signed-off-by: Jan Kara <[email protected]>
Link: https://patch.msgid.link/aa56c50ce81b1fd18d7f5d71dd2dfced5eba9687.1731684329.git.josef@toxicpanda.com

show more ...


# fac84846 15-Nov-2024 Josef Bacik <[email protected]>

fanotify: disable readahead if we have pre-content watches

With page faults we can trigger readahead on the file, and then
subsequent faults can find these pages and insert them into the file
withou

fanotify: disable readahead if we have pre-content watches

With page faults we can trigger readahead on the file, and then
subsequent faults can find these pages and insert them into the file
without emitting an fanotify event. To avoid this case, disable
readahead if we have pre-content watches on the file. This way we are
guaranteed to get an event for every range we attempt to access on a
pre-content watched file.

Reviewed-by: Christian Brauner <[email protected]>
Signed-off-by: Josef Bacik <[email protected]>
Signed-off-by: Jan Kara <[email protected]>
Link: https://patch.msgid.link/70a54e859f555e54bc7a47b32fe5aca92b085615.1731684329.git.josef@toxicpanda.com

show more ...


# 3203b3ab 29-Nov-2024 David Hildenbrand <[email protected]>

mm/filemap: don't call folio_test_locked() without a reference in next_uptodate_folio()

The folio can get freed + buddy-merged + reallocated in the meantime,
resulting in us calling folio_test_locke

mm/filemap: don't call folio_test_locked() without a reference in next_uptodate_folio()

The folio can get freed + buddy-merged + reallocated in the meantime,
resulting in us calling folio_test_locked() possibly on a tail page.

This makes const_folio_flags VM_BUG_ON_PGFLAGS() when stumbling over the
tail page.

Could this result in other issues? Doesn't look like it. False positives
and false negatives don't really matter, because this folio would get
skipped either way when detecting that they have been reallocated in the
meantime.

Fix it by performing the folio_test_locked() checked after grabbing a
reference. If this ever becomes a real problem, we could add a special
helper that racily checks if the bit is set even on tail pages ... but
let's hope that's not required so we can just handle it cleaner: work on
the folio after we hold a reference.

Do we really need the folio_test_locked() check if we are going to trylock
briefly after? Well, we can at least avoid a xas_reload().

It's a bit unclear which exact change introduced that issue. Likely, ever
since we made PG_locked obey to the PF_NO_TAIL policy it could have been
triggered in some way.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 48c935ad88f5 ("page-flags: define PG_locked behavior on compound pages")
Signed-off-by: David Hildenbrand <[email protected]>
Reported-by: [email protected]
Closes: https://lore.kernel.org/lkml/[email protected]/
Acked-by: Kirill A. Shutemov <[email protected]>
Cc: "Matthew Wilcox (Oracle)" <[email protected]>
Cc: Hillf Danton <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


12345678910>>...38