|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1 |
|
| #
8ab1b160 |
| 03-Apr-2025 |
Vishal Moola (Oracle) <[email protected]> |
mm: fix filemap_get_folios_contig returning batches of identical folios
filemap_get_folios_contig() is supposed to return distinct folios found within [start, end]. Large folios in the Xarray becom
mm: fix filemap_get_folios_contig returning batches of identical folios
filemap_get_folios_contig() is supposed to return distinct folios found within [start, end]. Large folios in the Xarray become multi-index entries. xas_next() can iterate through the sub-indexes before finding a sibling entry and breaking out of the loop.
This can result in a returned folio_batch containing an indeterminate number of duplicate folios, which forces the callers to skeptically handle the returned batch. This is inefficient and incurs a large maintenance overhead.
We can fix this by calling xas_advance() after we have successfully adding a folio to the batch to ensure our Xarray is positioned such that it will correctly find the next folio - similar to filemap_get_read_batch().
Link: https://lkml.kernel.org/r/Z-8s1-kiIDkzgRbc@fedora Fixes: 35b471467f88 ("filemap: add filemap_get_folios_contig()") Signed-off-by: Vishal Moola (Oracle) <[email protected]> Reported-by: Qu Wenruo <[email protected]> Closes: https://lkml.kernel.org/r/[email protected] Tested-by: Qu Wenruo <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Vivek Kasireddy <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.14, v6.14-rc7 |
|
| #
200a89c1 |
| 14-Mar-2025 |
Zi Yan <[email protected]> |
mm/filemap: use xas_try_split() in __filemap_add_folio()
Patch series "Minimize xa_node allocation during xarry split", v3.
When splitting a multi-index entry in XArray from order-n to order-m, exi
mm/filemap: use xas_try_split() in __filemap_add_folio()
Patch series "Minimize xa_node allocation during xarry split", v3.
When splitting a multi-index entry in XArray from order-n to order-m, existing xas_split_alloc()+xas_split() approach requires 2^(n % XA_CHUNK_SHIFT) xa_node allocations. But its callers, __filemap_add_folio() and shmem_split_large_entry(), use at most 1 xa_node. To minimize xa_node allocation and remove the limitation of no split from order-12 (or above) to order-0 (or anything between 0 and 5)[1], xas_try_split() was added[2], which allocates (n / XA_CHUNK_SHIFT - m / XA_CHUNK_SHIFT) xa_node. It is used for non-uniform folio split, but can be used by __filemap_add_folio() and shmem_split_large_entry().
xas_split_alloc() and xas_split() split an order-9 to order-0:
--------------------------------- | | | | | | | | | | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | | | | | | | | | | --------------------------------- | | | | ------- --- --- ------- | | ... | | V V V V ----------- ----------- ----------- ----------- | xa_node | | xa_node | ... | xa_node | | xa_node | ----------- ----------- ----------- -----------
xas_try_split() splits an order-9 to order-0: --------------------------------- | | | | | | | | | | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | | | | | | | | | | --------------------------------- | | V ----------- | xa_node | -----------
xas_try_split() is designed to be called iteratively with n = m + 1. xas_try_split_mini_order() is added to minmize the number of calls to xas_try_split() by telling the caller the next minimal order to split to instead of n - 1. Splitting order-n to order-m when m= l * XA_CHUNK_SHIFT does not require xa_node allocation and requires 1 xa_node when n=l * XA_CHUNK_SHIFT and m = n - 1, so it is OK to use xas_try_split() with n > m + 1 when no new xa_node is needed.
xfstests quick group test passed on xfs and tmpfs.
[1] https://lore.kernel.org/linux-mm/[email protected]/ [2] https://lore.kernel.org/linux-mm/[email protected]/
This patch (of 2):
During __filemap_add_folio(), a shadow entry is covering n slots and a folio covers m slots with m < n is to be added. Instead of splitting all n slots, only the m slots covered by the folio need to be split and the remaining n-m shadow entries can be retained with orders ranging from m to n-1. This method only requires
(n/XA_CHUNK_SHIFT) - (m/XA_CHUNK_SHIFT)
new xa_nodes instead of
(n % XA_CHUNK_SHIFT) * ((n/XA_CHUNK_SHIFT) - (m/XA_CHUNK_SHIFT))
new xa_nodes, compared to the original xas_split_alloc() + xas_split() one. For example, to insert an order-0 folio when an order-9 shadow entry is present (assuming XA_CHUNK_SHIFT is 6), 1 xa_node is needed instead of 8.
xas_try_split_min_order() is introduced to reduce the number of calls to xas_try_split() during split.
Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Zi Yan <[email protected]> Cc: Baolin Wang <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Kairui Song <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Mattew Wilcox <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: John Hubbard <[email protected]> Cc: Kefeng Wang <[email protected]> Cc: Kirill A. Shuemov <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Yang Shi <[email protected]> Cc: Yu Zhao <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc6, v6.14-rc5, v6.14-rc4, v6.14-rc3 |
|
| #
b23ceebd |
| 13-Feb-2025 |
Guanjun <[email protected]> |
filemap: remove redundant folio_test_large check in filemap_free_folio
The folio_test_large() check in filemap_free_folio() is unnecessary because folio_nr_pages(), which is called internally alread
filemap: remove redundant folio_test_large check in filemap_free_folio
The folio_test_large() check in filemap_free_folio() is unnecessary because folio_nr_pages(), which is called internally already performs this check. Removing the redundant condition simplifies the code and avoids double validation.
This change improves code readability and reduces unnecessary operations in the folio freeing path.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Guanjun <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Matthew Wilcox <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
182db972 |
| 24-Feb-2025 |
Raphael S. Carvalho <[email protected]> |
mm: fix error handling in __filemap_get_folio() with FGP_NOWAIT
original report: https://lore.kernel.org/all/CAKhLTr1UL3ePTpYjXOx2AJfNk8Ku2EdcEfu+CH1sf3Asr=B-Dw@mail.gmail.com/T/
When doing buffere
mm: fix error handling in __filemap_get_folio() with FGP_NOWAIT
original report: https://lore.kernel.org/all/CAKhLTr1UL3ePTpYjXOx2AJfNk8Ku2EdcEfu+CH1sf3Asr=B-Dw@mail.gmail.com/T/
When doing buffered writes with FGP_NOWAIT, under memory pressure, the system returned ENOMEM despite there being plenty of available memory, to be reclaimed from page cache. The user space used io_uring interface, which in turn submits I/O with FGP_NOWAIT (the fast path).
retsnoop pointed to iomap_get_folio:
00:34:16.180612 -> 00:34:16.180651 TID/PID 253786/253721 (reactor-1/combined_tests):
entry_SYSCALL_64_after_hwframe+0x76 do_syscall_64+0x82 __do_sys_io_uring_enter+0x265 io_submit_sqes+0x209 io_issue_sqe+0x5b io_write+0xdd xfs_file_buffered_write+0x84 iomap_file_buffered_write+0x1a6 32us [-ENOMEM] iomap_write_begin+0x408 iter=&{.inode=0xffff8c67aa031138,.len=4096,.flags=33,.iomap={.addr=0xffffffffffffffff,.length=4096,.type=1,.flags=3,.bdev=0x… pos=0 len=4096 foliop=0xffffb32c296b7b80 ! 4us [-ENOMEM] iomap_get_folio iter=&{.inode=0xffff8c67aa031138,.len=4096,.flags=33,.iomap={.addr=0xffffffffffffffff,.length=4096,.type=1,.flags=3,.bdev=0x… pos=0 len=4096
This is likely a regression caused by 66dabbb65d67 ("mm: return an ERR_PTR from __filemap_get_folio"), which moved error handling from io_map_get_folio() to __filemap_get_folio(), but broke FGP_NOWAIT handling, so ENOMEM is being escaped to user space. Had it correctly returned -EAGAIN with NOWAIT, either io_uring or user space itself would be able to retry the request.
It's not enough to patch io_uring since the iomap interface is the one responsible for it, and pwritev2(RWF_NOWAIT) and AIO interfaces must return the proper error too.
The patch was tested with scylladb test suite (its original reproducer), and the tests all pass now when memory is pressured.
Link: https://lkml.kernel.org/r/[email protected] Fixes: 66dabbb65d67 ("mm: return an ERR_PTR from __filemap_get_folio") Signed-off-by: Raphael S. Carvalho <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Dave Chinner <[email protected]> Cc: "Darrick J. Wong" <[email protected]> Cc: Matthew Wilcow (Oracle) <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
665575cf |
| 28-Feb-2025 |
Dave Hansen <[email protected]> |
filemap: move prefaulting out of hot write path
There is a generic anti-pattern that shows up in the VFS and several filesystems where the hot write paths touch userspace twice when they could get a
filemap: move prefaulting out of hot write path
There is a generic anti-pattern that shows up in the VFS and several filesystems where the hot write paths touch userspace twice when they could get away with doing it once.
Dave Chinner suggested that they should all be fixed up[1]. I agree[2]. But, the series to do that fixup spans a bunch of filesystems and a lot of people. This patch fixes common code that absolutely everyone uses. It has measurable performance benefits[3].
I think this patch can go in and not be held up by the others.
I will post them separately to their separate maintainers for consideration. But, honestly, I'm not going to lose any sleep if the maintainers don't pick those up.
1. https://lore.kernel.org/all/[email protected]/ 2. https://lore.kernel.org/all/[email protected]/ 3. https://lore.kernel.org/all/[email protected]/
This patch:
There is a bit of a sordid history here. I originally wrote 998ef75ddb57 ("fs: do not prefault sys_write() user buffer pages") to fix a performance issue that showed up on early SMAP hardware. But that was reverted with 00a3d660cbac because it exposed an underlying filesystem bug.
This is a reimplementation of the original commit along with some simplification and comment improvements.
The basic problem is that the generic write path has two userspace accesses: one to prefault the write source buffer and then another to perform the actual write. On x86, this means an extra STAC/CLAC pair. These are relatively expensive instructions because they function as barriers.
Keep the prefaulting behavior but move it into the slow path that gets run when the write did not make any progress. This avoids livelocks that can happen when the write's source and destination target the same folio. Contrary to the existing comments, the fault-in does not prevent deadlocks. That's accomplished by using an "atomic" usercopy that disables page faults.
The end result is that the generic write fast path now touches userspace once instead of twice.
0day has shown some improvements on a couple of microbenchmarks:
https://lore.kernel.org/all/[email protected]/
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Dave Hansen <[email protected]> Link: https://lore.kernel.org/all/yxyuijjfd6yknryji2q64j3keq2ygw6ca6fs5jwyolklzvo45s@4u63qqqyosy2/ Cc: Ted Ts'o <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Mateusz Guzik <[email protected]> Cc: Dave Chinner <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
252256e4 |
| 12-Mar-2025 |
Amir Goldstein <[email protected]> |
Revert "fanotify: disable readahead if we have pre-content watches"
This reverts commit fac84846a28c0950d4433118b3dffd44306df62d.
Signed-off-by: Amir Goldstein <[email protected]> Signed-off-by: J
Revert "fanotify: disable readahead if we have pre-content watches"
This reverts commit fac84846a28c0950d4433118b3dffd44306df62d.
Signed-off-by: Amir Goldstein <[email protected]> Signed-off-by: Jan Kara <[email protected]> Link: https://patch.msgid.link/[email protected]
show more ...
|
| #
955fbe0e |
| 12-Mar-2025 |
Amir Goldstein <[email protected]> |
Revert "fsnotify: generate pre-content permission event on page fault"
This reverts commit 8392bc2ff8c8bf7c4c5e6dfa71ccd893a3c046f6.
In the use case of buffered write whose input buffer is mmapped
Revert "fsnotify: generate pre-content permission event on page fault"
This reverts commit 8392bc2ff8c8bf7c4c5e6dfa71ccd893a3c046f6.
In the use case of buffered write whose input buffer is mmapped file on a filesystem with a pre-content mark, the prefaulting of the buffer can happen under the filesystem freeze protection (obtained in vfs_write()) which breaks assumptions of pre-content hook and introduces potential deadlock of HSM handler in userspace with filesystem freezing.
Now that we have pre-content hooks at file mmap() time, disable the pre-content event hooks on page fault to avoid the potential deadlock.
Reported-by: [email protected] Closes: https://lore.kernel.org/linux-fsdevel/7ehxrhbvehlrjwvrduoxsao5k3x4aw275patsb3krkwuq573yv@o2hskrfawbnc/ Fixes: 8392bc2ff8c8 ("fsnotify: generate pre-content permission event on page fault") Signed-off-by: Amir Goldstein <[email protected]> Signed-off-by: Jan Kara <[email protected]> Link: https://patch.msgid.link/[email protected]
show more ...
|
| #
00a7d398 |
| 07-Mar-2025 |
Linus Torvalds <[email protected]> |
fs/pipe: add simpler helpers for common cases
The fix to atomically read the pipe head and tail state when not holding the pipe mutex has caused a number of headaches due to the size change of the i
fs/pipe: add simpler helpers for common cases
The fix to atomically read the pipe head and tail state when not holding the pipe mutex has caused a number of headaches due to the size change of the involved types.
It turns out that we don't have _that_ many places that access these fields directly and were affected, but we have more than we strictly should have, because our low-level helper functions have been designed to have intimate knowledge of how the pipes work.
And as a result, that random noise of direct 'pipe->head' and 'pipe->tail' accesses makes it harder to pinpoint any actual potential problem spots remaining.
For example, we didn't have a "is the pipe full" helper function, but instead had a "given these pipe buffer indexes and this pipe size, is the pipe full". That's because some low-level pipe code does actually want that much more complicated interface.
But most other places literally just want a "is the pipe full" helper, and not having it meant that those places ended up being unnecessarily much too aware of this all.
It would have been much better if only the very core pipe code that cared had been the one aware of this all.
So let's fix it - better late than never. This just introduces the trivial wrappers for "is this pipe full or empty" and to get how many pipe buffers are used, so that instead of writing
if (pipe_full(pipe->head, pipe->tail, pipe->max_usage))
the places that literally just want to know if a pipe is full can just say
if (pipe_is_full(pipe))
instead. The existing trivial cases were converted with a 'sed' script.
This cuts down on the places that access pipe->head and pipe->tail directly outside of the pipe code (and core splice code) quite a lot.
The splice code in particular still revels in doing the direct low-level accesses, and the fuse fuse_dev_splice_write() code also seems a bit unnecessarily eager to go very low-level, but it's at least a bit better than it used to be.
Signed-off-by: Linus Torvalds <[email protected]>
show more ...
|
| #
d96e2802 |
| 18-Feb-2025 |
Matthew Wilcox (Oracle) <[email protected]> |
mm: Remove wait_on_page_locked()
This compatibility wrapper has no callers left, so remove it.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Chao Yu <[email protected]> Si
mm: Remove wait_on_page_locked()
This compatibility wrapper has no callers left, so remove it.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Chao Yu <[email protected]> Signed-off-by: Jaegeuk Kim <[email protected]>
show more ...
|
| #
8510edf1 |
| 18-Feb-2025 |
Jingbo Xu <[email protected]> |
mm/filemap: fix miscalculated file range for filemap_fdatawrite_range_kick()
iocb->ki_pos has been updated with the number of written bytes since generic_perform_write().
Besides __filemap_fdatawri
mm/filemap: fix miscalculated file range for filemap_fdatawrite_range_kick()
iocb->ki_pos has been updated with the number of written bytes since generic_perform_write().
Besides __filemap_fdatawrite_range() accepts the inclusive end of the data range.
Fixes: 1d4457576570 ("mm: call filemap_fdatawrite_range_kick() after IOCB_DONTCACHE issue") Signed-off-by: Jingbo Xu <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Jens Axboe <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc2, v6.14-rc1, v6.13, v6.13-rc7 |
|
| #
73aa354a |
| 11-Jan-2025 |
Kaixiong Yu <[email protected]> |
mm: filemap: move sysctl to mm/filemap.c
This moves the filemap related sysctl to mm/filemap.c, and removes the redundant external variable declaration.
Signed-off-by: Kaixiong Yu <yukaixiong@huawe
mm: filemap: move sysctl to mm/filemap.c
This moves the filemap related sysctl to mm/filemap.c, and removes the redundant external variable declaration.
Signed-off-by: Kaixiong Yu <[email protected]> Reviewed-by: Kees Cook <[email protected]> Reviewed-by: Jeff Layton <[email protected]> Signed-off-by: Joel Granados <[email protected]>
show more ...
|
|
Revision tags: v6.13-rc6, v6.13-rc5, v6.13-rc4 |
|
| #
d94d23fd |
| 20-Dec-2024 |
Jens Axboe <[email protected]> |
mm: add FGP_DONTCACHE folio creation flag
Callers can pass this in for uncached folio creation, in which case if a folio is newly created it gets marked as uncached. If a folio exists for this inde
mm: add FGP_DONTCACHE folio creation flag
Callers can pass this in for uncached folio creation, in which case if a folio is newly created it gets marked as uncached. If a folio exists for this index and lookup succeeds, then it will not get marked as uncached. If an !uncached lookup finds a cached folio, clear the flag. For that case, there are competeting uncached and cached users of the folio, and it should not get pruned.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]> Cc: Brian Foster <[email protected]> Cc: Chris Mason <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
dddc559f |
| 20-Dec-2024 |
Jens Axboe <[email protected]> |
mm/filemap: add filemap_fdatawrite_range_kick() helper
Works like filemap_fdatawrite_range(), except it's a non-integrity data writeback and hence only starts writeback on the specified range. Will
mm/filemap: add filemap_fdatawrite_range_kick() helper
Works like filemap_fdatawrite_range(), except it's a non-integrity data writeback and hence only starts writeback on the specified range. Will help facilitate generically starting uncached writeback from generic_write_sync(), as header dependencies preclude doing this inline from fs.h.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]> Cc: Brian Foster <[email protected]> Cc: Chris Mason <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
fb7d3bc4 |
| 20-Dec-2024 |
Jens Axboe <[email protected]> |
mm/filemap: drop streaming/uncached pages when writeback completes
If the folio is marked as streaming, drop pages when writeback completes. Intended to be used with RWF_DONTCACHE, to avoid needing
mm/filemap: drop streaming/uncached pages when writeback completes
If the folio is marked as streaming, drop pages when writeback completes. Intended to be used with RWF_DONTCACHE, to avoid needing sync writes for uncached IO.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]> Cc: Brian Foster <[email protected]> Cc: Chris Mason <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
8026e49b |
| 20-Dec-2024 |
Jens Axboe <[email protected]> |
mm/filemap: add read support for RWF_DONTCACHE
Add RWF_DONTCACHE as a read operation flag, which means that any data read wil be removed from the page cache upon completion. Uses the page cache to
mm/filemap: add read support for RWF_DONTCACHE
Add RWF_DONTCACHE as a read operation flag, which means that any data read wil be removed from the page cache upon completion. Uses the page cache to synchronize, and simply prunes folios that were instantiated when the operation completes. While it would be possible to use private pages for this, using the page cache as synchronization is handy for a variety of reasons:
1) No special truncate magic is needed 2) Async buffered reads need some place to serialize, using the page cache is a lot easier than writing extra code for this 3) The pruning cost is pretty reasonable
and the code to support this is much simpler as a result.
You can think of uncached buffered IO as being the much more attractive cousin of O_DIRECT - it has none of the restrictions of O_DIRECT. Yes, it will copy the data, but unlike regular buffered IO, it doesn't run into the unpredictability of the page cache in terms of reclaim. As an example, on a test box with 32 drives, reading them with buffered IO looks as follows:
Reading bs 65536, uncached 0 1s: 145945MB/sec 2s: 158067MB/sec 3s: 157007MB/sec 4s: 148622MB/sec 5s: 118824MB/sec 6s: 70494MB/sec 7s: 41754MB/sec 8s: 90811MB/sec 9s: 92204MB/sec 10s: 95178MB/sec 11s: 95488MB/sec 12s: 95552MB/sec 13s: 96275MB/sec
where it's quite easy to see where the page cache filled up, and performance went from good to erratic, and finally settles at a much lower rate. Looking at top while this is ongoing, we see:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 7535 root 20 0 267004 0 0 S 3199 0.0 8:40.65 uncached 3326 root 20 0 0 0 0 R 100.0 0.0 0:16.40 kswapd4 3327 root 20 0 0 0 0 R 100.0 0.0 0:17.22 kswapd5 3328 root 20 0 0 0 0 R 100.0 0.0 0:13.29 kswapd6 3332 root 20 0 0 0 0 R 100.0 0.0 0:11.11 kswapd10 3339 root 20 0 0 0 0 R 100.0 0.0 0:16.25 kswapd17 3348 root 20 0 0 0 0 R 100.0 0.0 0:16.40 kswapd26 3343 root 20 0 0 0 0 R 100.0 0.0 0:16.30 kswapd21 3344 root 20 0 0 0 0 R 100.0 0.0 0:11.92 kswapd22 3349 root 20 0 0 0 0 R 100.0 0.0 0:16.28 kswapd27 3352 root 20 0 0 0 0 R 99.7 0.0 0:11.89 kswapd30 3353 root 20 0 0 0 0 R 96.7 0.0 0:16.04 kswapd31 3329 root 20 0 0 0 0 R 96.4 0.0 0:11.41 kswapd7 3345 root 20 0 0 0 0 R 96.4 0.0 0:13.40 kswapd23 3330 root 20 0 0 0 0 S 91.1 0.0 0:08.28 kswapd8 3350 root 20 0 0 0 0 S 86.8 0.0 0:11.13 kswapd28 3325 root 20 0 0 0 0 S 76.3 0.0 0:07.43 kswapd3 3341 root 20 0 0 0 0 S 74.7 0.0 0:08.85 kswapd19 3334 root 20 0 0 0 0 S 71.7 0.0 0:10.04 kswapd12 3351 root 20 0 0 0 0 R 60.5 0.0 0:09.59 kswapd29 3323 root 20 0 0 0 0 R 57.6 0.0 0:11.50 kswapd1 [...]
which is just showing a partial list of the 32 kswapd threads that are running mostly full tilt, burning ~28 full CPU cores.
If the same test case is run with RWF_DONTCACHE set for the buffered read, the output looks as follows:
Reading bs 65536, uncached 0 1s: 153144MB/sec 2s: 156760MB/sec 3s: 158110MB/sec 4s: 158009MB/sec 5s: 158043MB/sec 6s: 157638MB/sec 7s: 157999MB/sec 8s: 158024MB/sec 9s: 157764MB/sec 10s: 157477MB/sec 11s: 157417MB/sec 12s: 157455MB/sec 13s: 157233MB/sec 14s: 156692MB/sec
which is just chugging along at ~155GB/sec of read performance. Looking at top, we see:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 7961 root 20 0 267004 0 0 S 3180 0.0 5:37.95 uncached 8024 axboe 20 0 14292 4096 0 R 1.0 0.0 0:00.13 top
where just the test app is using CPU, no reclaim is taking place outside of the main thread. Not only is performance 65% better, it's also using half the CPU to do it.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]> Cc: Brian Foster <[email protected]> Cc: Chris Mason <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
f598cdaa |
| 20-Dec-2024 |
Jens Axboe <[email protected]> |
mm/filemap: use page_cache_sync_ra() to kick off read-ahead
Rather than use the page_cache_sync_readahead() helper, define our own ractl and use page_cache_sync_ra() directly. In preparation for ne
mm/filemap: use page_cache_sync_ra() to kick off read-ahead
Rather than use the page_cache_sync_readahead() helper, define our own ractl and use page_cache_sync_ra() directly. In preparation for needing to modify ractl inside filemap_get_pages().
No functional changes in this patch.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]> Reviewed-by: Kirill A. Shutemov <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Matthew Wilcox (Oracle) <[email protected]> Cc: Brian Foster <[email protected]> Cc: Chris Mason <[email protected]> Cc: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
9ad63445 |
| 20-Dec-2024 |
Jens Axboe <[email protected]> |
mm/filemap: change filemap_create_folio() to take a struct kiocb
Patch series "Uncached buffered IO", v8.
5 years ago I posted patches adding support for RWF_UNCACHED, as a way to do buffered IO th
mm/filemap: change filemap_create_folio() to take a struct kiocb
Patch series "Uncached buffered IO", v8.
5 years ago I posted patches adding support for RWF_UNCACHED, as a way to do buffered IO that isn't page cache persistent. The approach back then was to have private pages for IO, and then get rid of them once IO was done. But that then runs into all the issues that O_DIRECT has, in terms of synchronizing with the page cache.
So here's a new approach to the same concent, but using the page cache as synchronization. Due to excessive bike shedding on the naming, this is now named RWF_DONTCACHE, and is less special in that it's just page cache IO, except it prunes the ranges once IO is completed.
Why do this, you may ask? The tldr is that device speeds are only getting faster, while reclaim is not. Doing normal buffered IO can be very unpredictable, and suck up a lot of resources on the reclaim side. This leads people to use O_DIRECT as a work-around, which has its own set of restrictions in terms of size, offset, and length of IO. It's also inherently synchronous, and now you need async IO as well. While the latter isn't necessarily a big problem as we have good options available there, it also should not be a requirement when all you want to do is read or write some data without caching.
Even on desktop type systems, a normal NVMe device can fill the entire page cache in seconds. On the big system I used for testing, there's a lot more RAM, but also a lot more devices. As can be seen in some of the results in the following patches, you can still fill RAM in seconds even when there's 1TB of it. Hence this problem isn't solely a "big hyperscaler system" issue, it's common across the board.
Common for both reads and writes with RWF_DONTCACHE is that they use the page cache for IO. Reads work just like a normal buffered read would, with the only exception being that the touched ranges will get pruned after data has been copied. For writes, the ranges will get writeback kicked off before the syscall returns, and then writeback completion will prune the range. Hence writes aren't synchronous, and it's easy to pipeline writes using RWF_DONTCACHE. Folios that aren't instantiated by RWF_DONTCACHE IO are left untouched. This means you that uncached IO will take advantage of the page cache for uptodate data, but not leave anything it instantiated/created in cache.
File systems need to support this. This patchset adds support for the generic read path, which covers file systems like ext4. Patches exist to add support for iomap/XFS and btrfs as well, which sit on top of this series. If RWF_DONTCACHE IO is attempted on a file system that doesn't support it, -EOPNOTSUPP is returned. Hence the user can rely on it either working as designed, or flagging and error if that's not the case. The intent here is to give the application a sensible fallback path - eg, it may fall back to O_DIRECT if appropriate, or just live with the fact that uncached IO isn't available and do normal buffered IO.
Adding "support" to other file systems should be trivial, most of the time just a one-liner adding FOP_DONTCACHE to the fop_flags in the file_operations struct, if the file system is using either iomap or the generic filemap helpers for reading and writing.
Performance results are in patch 8 for reads, and you can find the write side results in the XFS patch adding support for DONTCACHE writes for XFS:
https://git.kernel.dk/cgit/linux/commit/?h=buffered-uncached-fs.10&id=257e92de795fdff7d7e256501e024fac6da6a7f4
with the tldr being that I see about a 65% improvement in performance for both, with fully predictable IO times. CPU reduction is substantial as well, with no kswapd activity at all for reclaim when using uncached IO.
Using it from applications is trivial - just set RWF_DONTCACHE for the read or write, using pwritev2(2) or preadv2(2). For io_uring, same thing, just set RWF_DONTCACHE in sqe->rw_flags for a buffered read/write operation. And that's it.
Patches 1..7 are just prep patches, and should have no functional changes at all. Patch 8 adds support for the filemap path for RWF_DONTCACHE reads, and patches 9..12 are just prep patches for supporting the write side of uncached writes. In the below mentioned branch, there are then patches to adopt uncached reads and writes for xfs, btrfs, and ext4. The latter currently relies on bit of a hack for passing whether this is an uncached write or not through ->write_begin(), which can hopefully go away once ext4 adopts iomap for buffered writes. I say this is a hack as it's not the prettiest way to do it, however it is fully solid and will work just fine.
Passes full xfstests and fsx overnight runs, no issues observed. That includes the vm running the testing also using RWF_DONTCACHE on the host. I'll post fsstress and fsx patches for RWF_DONTCACHE separately. As far as I'm concerned, no further work needs doing here.
And git tree for the patches is here:
https://git.kernel.dk/cgit/linux/log/?h=buffered-uncached.10
with the file system patches on top adding support for xfs/btrfs/ext4 here:
https://git.kernel.dk/cgit/linux/log/?h=buffered-uncached-fs.10
This patch (of 12):
Rather than pass in both the file and position directly from the kiocb, just take a struct kiocb instead. With the kiocb being passed in, skip passing in the address_space separately as well. While doing so, move the ki_flags checking into filemap_create_folio() as well. In preparation for actually needing the kiocb in the function.
No functional changes in this patch.
Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]> Reviewed-by: Kirill A. Shutemov <[email protected]> Reviewed-by: Matthew Wilcox (Oracle) <[email protected]> Cc: Brian Foster <[email protected]> Cc: Chris Mason <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Christoph Hellwig <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
5f537664 |
| 21-Jan-2025 |
Linus Torvalds <[email protected]> |
cachestat: fix page cache statistics permission checking
When the 'cachestat()' system call was added in commit cf264e1329fb ("cachestat: implement cachestat syscall"), it was meant to be a much mor
cachestat: fix page cache statistics permission checking
When the 'cachestat()' system call was added in commit cf264e1329fb ("cachestat: implement cachestat syscall"), it was meant to be a much more convenient (and performant) version of mincore() that didn't need mapping things into the user virtual address space in order to work.
But it ended up missing the "check for writability or ownership" fix for mincore(), done in commit 134fca9063ad ("mm/mincore.c: make mincore() more conservative").
This just adds equivalent logic to 'cachestat()', modified for the file context (rather than vma).
Reported-by: Sudheendra Raghav Neela <[email protected]> Fixes: cf264e1329fb ("cachestat: implement cachestat syscall") Tested-by: Johannes Weiner <[email protected]> Acked-by: Johannes Weiner <[email protected]> Acked-by: Nhat Pham <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
show more ...
|
|
Revision tags: v6.13-rc3, v6.13-rc2, v6.13-rc1, v6.12 |
|
| #
1168b2be |
| 16-Nov-2024 |
Dr. David Alan Gilbert <[email protected]> |
filemap: remove unused folio_add_wait_queue
folio_add_wait_queue() has been unused since 2021's commit 850cba069c26 ("cachefiles: Delete the cachefiles driver pending rewrite")
Remove it.
Link: ht
filemap: remove unused folio_add_wait_queue
folio_add_wait_queue() has been unused since 2021's commit 850cba069c26 ("cachefiles: Delete the cachefiles driver pending rewrite")
Remove it.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Dr. David Alan Gilbert <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Reviewed-by: Vishal Moola (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
1c47c578 |
| 10-Jan-2025 |
Matthew Wilcox (Oracle) <[email protected]> |
mm: fix assertion in folio_end_read()
We only need to assert that the uptodate flag is clear if we're going to set it. This hasn't been a problem before now because we have only used folio_end_read
mm: fix assertion in folio_end_read()
We only need to assert that the uptodate flag is clear if we're going to set it. This hasn't been a problem before now because we have only used folio_end_read() when completing with an error, but it's convenient to use it in squashfs if we discover the folio is already uptodate.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Cc: Phillip Lougher <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
f505e6c9 |
| 02-Jan-2025 |
Marco Nelissen <[email protected]> |
filemap: avoid truncating 64-bit offset to 32 bits
On 32-bit kernels, folio_seek_hole_data() was inadvertently truncating a 64-bit value to 32 bits, leading to a possible infinite loop when writing
filemap: avoid truncating 64-bit offset to 32 bits
On 32-bit kernels, folio_seek_hole_data() was inadvertently truncating a 64-bit value to 32 bits, leading to a possible infinite loop when writing to an xfs filesystem.
Link: https://lkml.kernel.org/r/[email protected] Fixes: 54fa39ac2e00 ("iomap: use mapping_seek_hole_data") Signed-off-by: Marco Nelissen <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
62e72d2c |
| 22-Dec-2024 |
Kairui Song <[email protected]> |
mm, madvise: fix potential workingset node list_lru leaks
Since commit 5abc1e37afa0 ("mm: list_lru: allocate list_lru_one only when needed"), all list_lru users need to allocate the items using the
mm, madvise: fix potential workingset node list_lru leaks
Since commit 5abc1e37afa0 ("mm: list_lru: allocate list_lru_one only when needed"), all list_lru users need to allocate the items using the new infrastructure that provides list_lru info for slab allocation, ensuring that the corresponding memcg list_lru is allocated before use.
For workingset shadow nodes (which are xa_node), users are converted to use the new infrastructure by commit 9bbdc0f32409 ("xarray: use kmem_cache_alloc_lru to allocate xa_node"). The xas->xa_lru will be set correctly for filemap users. However, there is a missing case: xa_node allocations caused by madvise(..., MADV_COLLAPSE).
madvise(..., MADV_COLLAPSE) will also read in the absent parts of file map, and there will be xa_nodes allocated for the caller's memcg (assuming it's not rootcg). However, these allocations won't trigger memcg list_lru allocation because the proper xas info was not set.
If nothing else has allocated other xa_nodes for that memcg to trigger list_lru creation, and memory pressure starts to evict file pages, workingset_update_node will try to add these xa_nodes to their corresponding memcg list_lru, and it does not exist (NULL). So they will be added to rootcg's list_lru instead.
This shouldn't be a significant issue in practice, but it is indeed unexpected behavior, and these xa_nodes will not be reclaimed effectively. And may lead to incorrect counting of the list_lru->nr_items counter.
This problem wasn't exposed until recent commit 28e98022b31ef ("mm/list_lru: simplify reparenting and initial allocation") added a sanity check: only dying memcg could have a NULL list_lru when list_lru_{add,del} is called. This problem triggered this WARNING.
So make madvise(..., MADV_COLLAPSE) also call xas_set_lru() to pass the list_lru which we may want to insert xa_node into later. And move mapping_set_update to mm/internal.h, and turn into a macro to avoid including extra headers in mm/internal.h.
Link: https://lkml.kernel.org/r/[email protected] Fixes: 9bbdc0f32409 ("xarray: use kmem_cache_alloc_lru to allocate xa_node") Reported-by: [email protected] Closes: https://lore.kernel.org/lkml/[email protected]/ Signed-off-by: Kairui Song <[email protected]> Cc: Chengming Zhou <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Muchun Song <[email protected]> Cc: Qi Zheng <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Sasha Levin <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Yu Zhao <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
8392bc2f |
| 15-Nov-2024 |
Josef Bacik <[email protected]> |
fsnotify: generate pre-content permission event on page fault
FS_PRE_ACCESS will be generated on page fault depending on the faulting method. This pre-content event is meant to be used by hierarchic
fsnotify: generate pre-content permission event on page fault
FS_PRE_ACCESS will be generated on page fault depending on the faulting method. This pre-content event is meant to be used by hierarchical storage managers that want to fill in the file content on first read access.
Export a simple helper that file systems that have their own ->fault() will use, and have a more complicated helper to be do fancy things in filemap_fault.
Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Jan Kara <[email protected]> Link: https://patch.msgid.link/aa56c50ce81b1fd18d7f5d71dd2dfced5eba9687.1731684329.git.josef@toxicpanda.com
show more ...
|
| #
fac84846 |
| 15-Nov-2024 |
Josef Bacik <[email protected]> |
fanotify: disable readahead if we have pre-content watches
With page faults we can trigger readahead on the file, and then subsequent faults can find these pages and insert them into the file withou
fanotify: disable readahead if we have pre-content watches
With page faults we can trigger readahead on the file, and then subsequent faults can find these pages and insert them into the file without emitting an fanotify event. To avoid this case, disable readahead if we have pre-content watches on the file. This way we are guaranteed to get an event for every range we attempt to access on a pre-content watched file.
Reviewed-by: Christian Brauner <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Jan Kara <[email protected]> Link: https://patch.msgid.link/70a54e859f555e54bc7a47b32fe5aca92b085615.1731684329.git.josef@toxicpanda.com
show more ...
|
| #
3203b3ab |
| 29-Nov-2024 |
David Hildenbrand <[email protected]> |
mm/filemap: don't call folio_test_locked() without a reference in next_uptodate_folio()
The folio can get freed + buddy-merged + reallocated in the meantime, resulting in us calling folio_test_locke
mm/filemap: don't call folio_test_locked() without a reference in next_uptodate_folio()
The folio can get freed + buddy-merged + reallocated in the meantime, resulting in us calling folio_test_locked() possibly on a tail page.
This makes const_folio_flags VM_BUG_ON_PGFLAGS() when stumbling over the tail page.
Could this result in other issues? Doesn't look like it. False positives and false negatives don't really matter, because this folio would get skipped either way when detecting that they have been reallocated in the meantime.
Fix it by performing the folio_test_locked() checked after grabbing a reference. If this ever becomes a real problem, we could add a special helper that racily checks if the bit is set even on tail pages ... but let's hope that's not required so we can just handle it cleaner: work on the folio after we hold a reference.
Do we really need the folio_test_locked() check if we are going to trylock briefly after? Well, we can at least avoid a xas_reload().
It's a bit unclear which exact change introduced that issue. Likely, ever since we made PG_locked obey to the PF_NO_TAIL policy it could have been triggered in some way.
Link: https://lkml.kernel.org/r/[email protected] Fixes: 48c935ad88f5 ("page-flags: define PG_locked behavior on compound pages") Signed-off-by: David Hildenbrand <[email protected]> Reported-by: [email protected] Closes: https://lore.kernel.org/lkml/[email protected]/ Acked-by: Kirill A. Shutemov <[email protected]> Cc: "Matthew Wilcox (Oracle)" <[email protected]> Cc: Hillf Danton <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|