|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14, v6.14-rc7, v6.14-rc6, v6.14-rc5, v6.14-rc4, v6.14-rc3 |
|
| #
d2da21a6 |
| 10-Feb-2025 |
Qu Wenruo <[email protected]> |
btrfs: introduce a read path dedicated extent lock helper
Currently we're using btrfs_lock_and_flush_ordered_range() for both btrfs_read_folio() and btrfs_readahead(), but it has one critical proble
btrfs: introduce a read path dedicated extent lock helper
Currently we're using btrfs_lock_and_flush_ordered_range() for both btrfs_read_folio() and btrfs_readahead(), but it has one critical problem for future subpage optimizations:
- It will call btrfs_start_ordered_extent() to writeback the involved folios
But remember we're calling btrfs_lock_and_flush_ordered_range() at read paths, meaning the folio is already locked by read path.
If we really trigger writeback for those already locked folios, this will lead to a deadlock and writeback cannot get the folio lock.
Such dead lock is prevented by the fact that btrfs always keeps a dirty folio also uptodate, by either dirtying all blocks of the folio, or by reading the whole folio before dirtying.
To prepare for the incoming patch which allows btrfs to skip full folio read if the buffered write is block aligned, we have to start by solving the possible deadlock first.
Instead of blindly calling btrfs_start_ordered_extent(), introduce a new helper, which is smarter in the following ways:
- Only wait and flush the ordered extent if * The folio doesn't even have private bit set * Part of the blocks of the ordered extent are not uptodate
This can happen by: * The folio writeback finished, then got invalidated. There are a lot of reasons that a folio can get invalidated, from memory pressure to direct IO (which invalidates all folios of the range). But OE not yet finished.
We have to wait for the ordered extent, as the OE may contain to-be-inserted data checksum. Without waiting, our read can fail due to the missing checksum.
But either way, the OE should not need any extra flush inside the locked folio range.
- Skip the ordered extent completely if * All the blocks are dirty This happens when OE creation is caused by a folio writeback whose file offset is before our folio.
E.g. 16K page size and 4K block size
0 8K 16K 24K 32K |//////////////||///////| |
The writeback of folio 0 created an OE for range [0, 24K), but since folio 16K is not fully uptodate, a read is triggered for folio 16K.
The writeback will never happen (we're holding the folio lock for read), nor will the OE finish.
Thus we must skip the range.
* All the blocks are uptodate This happens when the writeback finished, but OE not yet finished.
Since the blocks are already uptodate, we can skip the OE range.
The new helper lock_extents_for_read() will do a loop for the target range by:
1) Lock the full range
2) If there is no ordered extent in the remaining range, exit
3) If there is an ordered extent that we can skip Skip to the end of the OE, and continue checking We do not trigger writeback nor wait for the OE.
4) If there is an ordered extent that we cannot skip Unlock the whole extent range and start the ordered extent.
And also update btrfs_start_ordered_extent() to add two more parameters: @nowriteback_start and @nowriteback_len, to prevent triggering flush for a certain range.
This will allow us to handle the following case properly in the future:
16K page size, 4K btrfs block size:
0 4K 8K 12K 16K 20K 24K 28K 32K |/////////////////////////////||////////////////| | | |<-------------------- OE 2 ------------------->| |< OE 1 >|
The folio has been written back before, thus we have an OE at [28K, 32K). Although the OE 1 finished its IO, the OE is not yet removed from IO tree. The folio got invalidated after writeback completed and before the ordered extent finished.
And [16K, 24K) range is dirty and uptodate, caused by a block aligned buffered write (and future enhancements allowing btrfs to skip full folio read for such case). But writeback for folio 0 has began, thus it generated OE 2, covering range [0, 24K).
Since the full folio 16K is not uptodate, if we want to read the folio, the existing btrfs_lock_and_flush_ordered_range() will dead lock, by:
btrfs_read_folio() | Folio 16K is already locked |- btrfs_lock_and_flush_ordered_range() |- btrfs_start_ordered_extent() for range [16K, 24K) |- filemap_fdatawrite_range() for range [16K, 24K) |- extent_write_cache_pages() folio_lock() on folio 16K, deadlock.
But now we will have the following sequence:
btrfs_read_folio() | Folio 16K is already locked |- lock_extents_for_read() |- can_skip_ordered_extent() for range [16K, 24K) | Returned true, the range [16K, 24K) will be skipped. |- can_skip_ordered_extent() for range [28K, 32K) | Returned false. |- btrfs_start_ordered_extent() for range [28K, 32K) with [16K, 32K) as no writeback range No writeback for folio 16K will be triggered.
And there will be no more possible deadlock on the same folio.
Reviewed-by: Filipe Manana <[email protected]> Signed-off-by: Qu Wenruo <[email protected]> Signed-off-by: David Sterba <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc2, v6.14-rc1, v6.13 |
|
| #
0d85f5c2 |
| 13-Jan-2025 |
Filipe Manana <[email protected]> |
btrfs: fix assertion failure when splitting ordered extent after transaction abort
If while we are doing a direct IO write a transaction abort happens, we mark all existing ordered extents with the
btrfs: fix assertion failure when splitting ordered extent after transaction abort
If while we are doing a direct IO write a transaction abort happens, we mark all existing ordered extents with the BTRFS_ORDERED_IOERR flag (done at btrfs_destroy_ordered_extents()), and then after that if we enter btrfs_split_ordered_extent() and the ordered extent has bytes left (meaning we have a bio that doesn't cover the whole ordered extent, see details at btrfs_extract_ordered_extent()), we will fail on the following assertion at btrfs_split_ordered_extent():
ASSERT(!(flags & ~BTRFS_ORDERED_TYPE_FLAGS));
because the BTRFS_ORDERED_IOERR flag is set and the definition of BTRFS_ORDERED_TYPE_FLAGS is just the union of all flags that identify the type of write (regular, nocow, prealloc, compressed, direct IO, encoded).
Fix this by returning an error from btrfs_extract_ordered_extent() if we find the BTRFS_ORDERED_IOERR flag in the ordered extent. The error will be the error that resulted in the transaction abort or -EIO if no transaction abort happened.
This was recently reported by syzbot with the following trace:
FAULT_INJECTION: forcing a failure. name failslab, interval 1, probability 0, space 0, times 1 CPU: 0 UID: 0 PID: 5321 Comm: syz.0.0 Not tainted 6.13.0-rc5-syzkaller #0 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014 Call Trace: <TASK> __dump_stack lib/dump_stack.c:94 [inline] dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120 fail_dump lib/fault-inject.c:53 [inline] should_fail_ex+0x3b0/0x4e0 lib/fault-inject.c:154 should_failslab+0xac/0x100 mm/failslab.c:46 slab_pre_alloc_hook mm/slub.c:4072 [inline] slab_alloc_node mm/slub.c:4148 [inline] __do_kmalloc_node mm/slub.c:4297 [inline] __kmalloc_noprof+0xdd/0x4c0 mm/slub.c:4310 kmalloc_noprof include/linux/slab.h:905 [inline] kzalloc_noprof include/linux/slab.h:1037 [inline] btrfs_chunk_alloc_add_chunk_item+0x244/0x1100 fs/btrfs/volumes.c:5742 reserve_chunk_space+0x1ca/0x2c0 fs/btrfs/block-group.c:4292 check_system_chunk fs/btrfs/block-group.c:4319 [inline] do_chunk_alloc fs/btrfs/block-group.c:3891 [inline] btrfs_chunk_alloc+0x77b/0xf80 fs/btrfs/block-group.c:4187 find_free_extent_update_loop fs/btrfs/extent-tree.c:4166 [inline] find_free_extent+0x42d1/0x5810 fs/btrfs/extent-tree.c:4579 btrfs_reserve_extent+0x422/0x810 fs/btrfs/extent-tree.c:4672 btrfs_new_extent_direct fs/btrfs/direct-io.c:186 [inline] btrfs_get_blocks_direct_write+0x706/0xfa0 fs/btrfs/direct-io.c:321 btrfs_dio_iomap_begin+0xbb7/0x1180 fs/btrfs/direct-io.c:525 iomap_iter+0x697/0xf60 fs/iomap/iter.c:90 __iomap_dio_rw+0xeb9/0x25b0 fs/iomap/direct-io.c:702 btrfs_dio_write fs/btrfs/direct-io.c:775 [inline] btrfs_direct_write+0x610/0xa30 fs/btrfs/direct-io.c:880 btrfs_do_write_iter+0x2a0/0x760 fs/btrfs/file.c:1397 do_iter_readv_writev+0x600/0x880 vfs_writev+0x376/0xba0 fs/read_write.c:1050 do_pwritev fs/read_write.c:1146 [inline] __do_sys_pwritev2 fs/read_write.c:1204 [inline] __se_sys_pwritev2+0x196/0x2b0 fs/read_write.c:1195 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f1281f85d29 RSP: 002b:00007f12819fe038 EFLAGS: 00000246 ORIG_RAX: 0000000000000148 RAX: ffffffffffffffda RBX: 00007f1282176080 RCX: 00007f1281f85d29 RDX: 0000000000000001 RSI: 0000000020000240 RDI: 0000000000000005 RBP: 00007f12819fe090 R08: 0000000000000000 R09: 0000000000000003 R10: 0000000000007000 R11: 0000000000000246 R12: 0000000000000002 R13: 0000000000000000 R14: 00007f1282176080 R15: 00007ffcb9e23328 </TASK> BTRFS error (device loop0 state A): Transaction aborted (error -12) BTRFS: error (device loop0 state A) in btrfs_chunk_alloc_add_chunk_item:5745: errno=-12 Out of memory BTRFS info (device loop0 state EA): forced readonly assertion failed: !(flags & ~BTRFS_ORDERED_TYPE_FLAGS), in fs/btrfs/ordered-data.c:1234 ------------[ cut here ]------------ kernel BUG at fs/btrfs/ordered-data.c:1234! Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI CPU: 0 UID: 0 PID: 5321 Comm: syz.0.0 Not tainted 6.13.0-rc5-syzkaller #0 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014 RIP: 0010:btrfs_split_ordered_extent+0xd8d/0xe20 fs/btrfs/ordered-data.c:1234 RSP: 0018:ffffc9000d1df2b8 EFLAGS: 00010246 RAX: 0000000000000057 RBX: 000000000006a000 RCX: 9ce21886c4195300 RDX: 0000000000000000 RSI: 0000000080000000 RDI: 0000000000000000 RBP: 0000000000000091 R08: ffffffff817f0a3c R09: 1ffff92001a3bdf4 R10: dffffc0000000000 R11: fffff52001a3bdf5 R12: 1ffff1100a45f401 R13: ffff8880522fa018 R14: dffffc0000000000 R15: 000000000006a000 FS: 00007f12819fe6c0(0000) GS:ffff88801fc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000557750bd7da8 CR3: 00000000400ea000 CR4: 0000000000352ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> btrfs_extract_ordered_extent fs/btrfs/direct-io.c:702 [inline] btrfs_dio_submit_io+0x4be/0x6d0 fs/btrfs/direct-io.c:737 iomap_dio_submit_bio fs/iomap/direct-io.c:85 [inline] iomap_dio_bio_iter+0x1022/0x1740 fs/iomap/direct-io.c:447 __iomap_dio_rw+0x13b7/0x25b0 fs/iomap/direct-io.c:703 btrfs_dio_write fs/btrfs/direct-io.c:775 [inline] btrfs_direct_write+0x610/0xa30 fs/btrfs/direct-io.c:880 btrfs_do_write_iter+0x2a0/0x760 fs/btrfs/file.c:1397 do_iter_readv_writev+0x600/0x880 vfs_writev+0x376/0xba0 fs/read_write.c:1050 do_pwritev fs/read_write.c:1146 [inline] __do_sys_pwritev2 fs/read_write.c:1204 [inline] __se_sys_pwritev2+0x196/0x2b0 fs/read_write.c:1195 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f1281f85d29 RSP: 002b:00007f12819fe038 EFLAGS: 00000246 ORIG_RAX: 0000000000000148 RAX: ffffffffffffffda RBX: 00007f1282176080 RCX: 00007f1281f85d29 RDX: 0000000000000001 RSI: 0000000020000240 RDI: 0000000000000005 RBP: 00007f12819fe090 R08: 0000000000000000 R09: 0000000000000003 R10: 0000000000007000 R11: 0000000000000246 R12: 0000000000000002 R13: 0000000000000000 R14: 00007f1282176080 R15: 00007ffcb9e23328 </TASK> Modules linked in: ---[ end trace 0000000000000000 ]--- RIP: 0010:btrfs_split_ordered_extent+0xd8d/0xe20 fs/btrfs/ordered-data.c:1234 RSP: 0018:ffffc9000d1df2b8 EFLAGS: 00010246 RAX: 0000000000000057 RBX: 000000000006a000 RCX: 9ce21886c4195300 RDX: 0000000000000000 RSI: 0000000080000000 RDI: 0000000000000000 RBP: 0000000000000091 R08: ffffffff817f0a3c R09: 1ffff92001a3bdf4 R10: dffffc0000000000 R11: fffff52001a3bdf5 R12: 1ffff1100a45f401 R13: ffff8880522fa018 R14: dffffc0000000000 R15: 000000000006a000 FS: 00007f12819fe6c0(0000) GS:ffff88801fc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000557750bd7da8 CR3: 00000000400ea000 CR4: 0000000000352ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
In this case the transaction abort was due to (an injected) memory allocation failure when attempting to allocate a new chunk.
Reported-by: [email protected] Link: https://lore.kernel.org/linux-btrfs/[email protected]/ Fixes: 52b1fdca23ac ("btrfs: handle completed ordered extents in btrfs_split_ordered_extent") Reviewed-by: Qu Wenruo <[email protected]> Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: David Sterba <[email protected]>
show more ...
|
|
Revision tags: v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2, v6.13-rc1, v6.12, v6.12-rc7 |
|
| #
06cf321a |
| 07-Nov-2024 |
Ira Weiny <[email protected]> |
range: Add range_overlaps()
Code to support CXL Dynamic Capacity devices will have extent ranges which need to be compared for intersection not a subset as is being checked in range_contains().
ran
range: Add range_overlaps()
Code to support CXL Dynamic Capacity devices will have extent ranges which need to be compared for intersection not a subset as is being checked in range_contains().
range_overlaps() is defined in btrfs with a different meaning from what is required in the standard range code. Dan Williams pointed this out in [1]. Adjust the btrfs call according to his suggestion there.
Then add a generic range_overlaps().
Cc: Dan Williams <[email protected]> Cc: Chris Mason <[email protected]> Cc: Josef Bacik <[email protected]> Cc: David Sterba <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/all/[email protected]/ [1] Acked-by: David Sterba <[email protected]> Reviewed-by: Davidlohr Bueso <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Reviewed-by: Fan Ni <[email protected]> Reviewed-by: Dave Jiang <[email protected]> Reviewed-by: Jonathan Cameron <[email protected]> Signed-off-by: Ira Weiny <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Dave Jiang <[email protected]>
show more ...
|
|
Revision tags: v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3, v6.12-rc2 |
|
| #
a6752a6e |
| 02-Oct-2024 |
Matthew Wilcox (Oracle) <[email protected]> |
btrfs: Switch from using the private_2 flag to owner_2
We are close to removing the private_2 flag, so switch btrfs to using owner_2 for its ordered flag. This is mostly used by buffer head filesys
btrfs: Switch from using the private_2 flag to owner_2
We are close to removing the private_2 flag, so switch btrfs to using owner_2 for its ordered flag. This is mostly used by buffer head filesystems, so btrfs can use it because it doesn't use buffer heads.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.12-rc1, v6.11, v6.11-rc7, v6.11-rc6 |
|
| #
1b6e068a |
| 29-Aug-2024 |
Filipe Manana <[email protected]> |
btrfs: add and use helper to verify the calling task has locked the inode
We have a few places that check if we have the inode locked by doing:
ASSERT(inode_is_locked(vfs_inode));
This actuall
btrfs: add and use helper to verify the calling task has locked the inode
We have a few places that check if we have the inode locked by doing:
ASSERT(inode_is_locked(vfs_inode));
This actually proved to be useful several times as if assertions are enabled (and by default they are in many distros) it immediately triggers a crash which is impossible for users to miss.
However that doesn't check if the lock is held by the calling task, so the check passes if some other task locked the inode.
Using one of the lockdep functions to check the lock is held, like lockdep_assert_held() for example, does check that the calling task holds the lock, and if that's not the case it produces a warning and stack trace in dmesg. However, despite the misleading "assert" in the name of the lockdep helpers, it does not trigger a crash/BUG_ON(), just a warning and splat in dmesg, which is easy to get unnoticed by users who may have lockdep enabled.
So add a helper that does the ASSERT() and calls lockdep_assert_held() immediately after and use it every where we check the inode is locked. Like this if the lock is held by some other task we get the warning in dmesg which is caught by fstests, very helpful during development, and may also be occassionaly noticed by users with lockdep enabled.
Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: David Sterba <[email protected]>
show more ...
|
|
Revision tags: v6.11-rc5, v6.11-rc4, v6.11-rc3, v6.11-rc2, v6.11-rc1 |
|
| #
a7922801 |
| 24-Jul-2024 |
Josef Bacik <[email protected]> |
btrfs: convert btrfs_mark_ordered_io_finished() to take a folio
We only need a folio now, make it take a folio as an argument and update all of the callers.
Signed-off-by: Josef Bacik <josef@toxicp
btrfs: convert btrfs_mark_ordered_io_finished() to take a folio
We only need a folio now, make it take a folio as an argument and update all of the callers.
Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
show more ...
|
| #
aef665d6 |
| 24-Jul-2024 |
Josef Bacik <[email protected]> |
btrfs: convert btrfs_finish_ordered_extent() to take a folio
The callers and callee's of this now all use folios, update it to take a folio as well.
Signed-off-by: Josef Bacik <[email protected]
btrfs: convert btrfs_finish_ordered_extent() to take a folio
The callers and callee's of this now all use folios, update it to take a folio as well.
Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
show more ...
|
| #
0a577636 |
| 24-Jul-2024 |
Josef Bacik <[email protected]> |
btrfs: convert can_finish_ordered_extent() to use a folio
Pass in a folio instead, and use a folio instead of a page.
Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <ds
btrfs: convert can_finish_ordered_extent() to use a folio
Pass in a folio instead, and use a folio instead of a page.
Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
show more ...
|
|
Revision tags: v6.10, v6.10-rc7, v6.10-rc6, v6.10-rc5, v6.10-rc4, v6.10-rc3, v6.10-rc2, v6.10-rc1, v6.9, v6.9-rc7, v6.9-rc6, v6.9-rc5, v6.9-rc4, v6.9-rc3, v6.9-rc2, v6.9-rc1, v6.8, v6.8-rc7, v6.8-rc6, v6.8-rc5, v6.8-rc4, v6.8-rc3, v6.8-rc2, v6.8-rc1, v6.7, v6.7-rc8, v6.7-rc7, v6.7-rc6, v6.7-rc5, v6.7-rc4, v6.7-rc3, v6.7-rc2, v6.7-rc1, v6.6, v6.6-rc7, v6.6-rc6, v6.6-rc5, v6.6-rc4, v6.6-rc3, v6.6-rc2, v6.6-rc1, v6.5, v6.5-rc7, v6.5-rc6, v6.5-rc5, v6.5-rc4, v6.5-rc3, v6.5-rc2, v6.5-rc1, v6.4, v6.4-rc7, v6.4-rc6, v6.4-rc5, v6.4-rc4, v6.4-rc3, v6.4-rc2, v6.4-rc1, v6.3, v6.3-rc7, v6.3-rc6, v6.3-rc5, v6.3-rc4, v6.3-rc3, v6.3-rc2, v6.3-rc1, v6.2, v6.2-rc8, v6.2-rc7, v6.2-rc6, v6.2-rc5, v6.2-rc4, v6.2-rc3, v6.2-rc2, v6.2-rc1, v6.1, v6.1-rc8, v6.1-rc7, v6.1-rc6, v6.1-rc5, v6.1-rc4, v6.1-rc3, v6.1-rc2, v6.1-rc1, v6.0, v6.0-rc7, v6.0-rc6, v6.0-rc5, v6.0-rc4, v6.0-rc3, v6.0-rc2, v6.0-rc1, v5.19, v5.19-rc8, v5.19-rc7, v5.19-rc6, v5.19-rc5, v5.19-rc4, v5.19-rc3, v5.19-rc2, v5.19-rc1, v5.18, v5.18-rc7, v5.18-rc6, v5.18-rc5, v5.18-rc4, v5.18-rc3, v5.18-rc2, v5.18-rc1, v5.17, v5.17-rc8, v5.17-rc7, v5.17-rc6, v5.17-rc5, v5.17-rc4, v5.17-rc3 |
|
| #
a1f4e3d7 |
| 31-Jan-2022 |
David Sterba <[email protected]> |
btrfs: switch btrfs_ordered_extent::inode to struct btrfs_inode
The structure is internal so we should use struct btrfs_inode for that, allowing to remove some use of BTRFS_I.
Reviewed-by: Boris Bu
btrfs: switch btrfs_ordered_extent::inode to struct btrfs_inode
The structure is internal so we should use struct btrfs_inode for that, allowing to remove some use of BTRFS_I.
Reviewed-by: Boris Burkov <[email protected]> Signed-off-by: David Sterba <[email protected]>
show more ...
|
| #
8b62f14d |
| 03-Jun-2024 |
Filipe Manana <[email protected]> |
btrfs: update panic message when splitting ordered extent
During ordered extent splitting if we find a duplicated ordered extent when attempting to insert the new ordered extent we panic but with a
btrfs: update panic message when splitting ordered extent
During ordered extent splitting if we find a duplicated ordered extent when attempting to insert the new ordered extent we panic but with a message that has the "zoned:" prefix. This is because the splitting used to be exclusive for zoned filesystems, but as of commit b73a6fd1b1ef ("btrfs: split partial dio bios before submit") it can also be done for non zoned filesystems during direct IO writes. So remove the "zoned:" prefix from the message and mention the split to make it more specific and different from the panic message at insert_ordered_extent().
Reviewed-by: Josef Bacik <[email protected]> Reviewed-by: Qu Wenruo <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Signed-off-by: Filipe Manana <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
show more ...
|
| #
b7ac1acb |
| 03-Jun-2024 |
Filipe Manana <[email protected]> |
btrfs: mark ordered extent insertion failure checks as unlikely
We never expect an ordered extent insertion to fail due to already having another ordered extent in the tree for the same file offset,
btrfs: mark ordered extent insertion failure checks as unlikely
We never expect an ordered extent insertion to fail due to already having another ordered extent in the tree for the same file offset, since we always wait for existing ordered extents in a range to complete before writing into the range again. So mark the failure checks for the results of tree_insert() as unlikely, to make it clear it's never expected (save exceptional causes like bugs or memory corruptions) and to serve as a hint for the compiler to possibly generate better code.
Reviewed-by: Josef Bacik <[email protected]> Reviewed-by: Qu Wenruo <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Signed-off-by: Filipe Manana <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
show more ...
|
| #
cb3cd624 |
| 03-Jun-2024 |
Filipe Manana <[email protected]> |
btrfs: avoid removal and re-insertion of split ordered extent
At btrfs_split_ordered_extent(), we are removing and re-inserting the ordered extent that we are trimming, but we don't need to since th
btrfs: avoid removal and re-insertion of split ordered extent
At btrfs_split_ordered_extent(), we are removing and re-inserting the ordered extent that we are trimming, but we don't need to since the trimming doesn't change its position in the red black tree because we don't have overlapping ordered extents (that would imply double allocation of extents) and we know the split length is smaller than the ordered extent's num_bytes field (we checked that early in the function).
So drop the remove and re-insert code for the slit ordered extent.
Reviewed-by: Josef Bacik <[email protected]> Reviewed-by: Qu Wenruo <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: David Sterba <[email protected]>
show more ...
|
| #
c18ca3c9 |
| 03-Jun-2024 |
Filipe Manana <[email protected]> |
btrfs: add comment about locking to btrfs_split_ordered_extent()
There are subtle details about why the root's ordered_extent_lock is held, so add a comment mentioning them.
Reviewed-by: Josef Baci
btrfs: add comment about locking to btrfs_split_ordered_extent()
There are subtle details about why the root's ordered_extent_lock is held, so add a comment mentioning them.
Reviewed-by: Josef Bacik <[email protected]> Reviewed-by: Qu Wenruo <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Signed-off-by: Filipe Manana <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
show more ...
|
| #
ac1f580c |
| 03-Jun-2024 |
Filipe Manana <[email protected]> |
btrfs: reduce critical section at btrfs_wait_ordered_extents()
At btrfs_wait_ordered_extents(), there's no point in updating the counters after locking the root's ordered extent lock, as the counter
btrfs: reduce critical section at btrfs_wait_ordered_extents()
At btrfs_wait_ordered_extents(), there's no point in updating the counters after locking the root's ordered extent lock, as the counters are local. So change this to update the counters before taking the lock.
Reviewed-by: Josef Bacik <[email protected]> Reviewed-by: Qu Wenruo <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Signed-off-by: Filipe Manana <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
show more ...
|
| #
03103ecf |
| 03-Jun-2024 |
Filipe Manana <[email protected]> |
btrfs: reduce critical section at btrfs_wait_ordered_roots()
At btrfs_wait_ordered_roots(), there's no point in decrementing the counter after locking fs_info->ordered_root_lock as the counter is lo
btrfs: reduce critical section at btrfs_wait_ordered_roots()
At btrfs_wait_ordered_roots(), there's no point in decrementing the counter after locking fs_info->ordered_root_lock as the counter is local. So change this to decrement the counter before taking the lock.
Reviewed-by: Josef Bacik <[email protected]> Reviewed-by: Qu Wenruo <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Signed-off-by: Filipe Manana <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
show more ...
|
| #
9fec848b |
| 03-May-2024 |
Qu Wenruo <[email protected]> |
btrfs: cleanup duplicated parameters related to create_io_em()
Most parameters of create_io_em() can be replaced by the members with the same name inside btrfs_file_extent.
Do a direct parameters c
btrfs: cleanup duplicated parameters related to create_io_em()
Most parameters of create_io_em() can be replaced by the members with the same name inside btrfs_file_extent.
Do a direct parameters cleanup here.
Reviewed-by: Johannes Thumshirn <[email protected]> Reviewed-by: Filipe Manana <[email protected]> Signed-off-by: Qu Wenruo <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
show more ...
|
| #
e9ea31fb |
| 03-May-2024 |
Qu Wenruo <[email protected]> |
btrfs: cleanup duplicated parameters related to btrfs_alloc_ordered_extent
All parameters after @filepos of btrfs_alloc_ordered_extent() can be replaced with btrfs_file_extent structure.
This patch
btrfs: cleanup duplicated parameters related to btrfs_alloc_ordered_extent
All parameters after @filepos of btrfs_alloc_ordered_extent() can be replaced with btrfs_file_extent structure.
This patch does the cleanup, meanwhile some points to note:
- Move btrfs_file_extent structure to ordered-data.h The structure is needed by both btrfs_alloc_ordered_extent() and can_nocow_extent(), but since btrfs_inode.h includes ordered-data.h, so we need to move the structure to ordered-data.h.
- Move the special handling of NOCOW/PREALLOC into btrfs_alloc_ordered_extent() This is to allow btrfs_split_ordered_extent() to properly split them for DIO. For now just move the handling into btrfs_alloc_ordered_extent() to simplify the callers.
Reviewed-by: Johannes Thumshirn <[email protected]> Reviewed-by: Filipe Manana <[email protected]> Signed-off-by: Qu Wenruo <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
show more ...
|
| #
42317ab4 |
| 14-May-2024 |
David Sterba <[email protected]> |
btrfs: simplify range parameters of btrfs_wait_ordered_roots()
The range is specified only in two ways, we can simplify the case for the whole filesystem range as a NULL block group parameter.
Sign
btrfs: simplify range parameters of btrfs_wait_ordered_roots()
The range is specified only in two ways, we can simplify the case for the whole filesystem range as a NULL block group parameter.
Signed-off-by: David Sterba <[email protected]>
show more ...
|
| #
e641e323 |
| 18-May-2024 |
Filipe Manana <[email protected]> |
btrfs: pass a btrfs_inode to btrfs_wait_ordered_range()
Instead of passing a (VFS) inode pointer argument, pass a btrfs_inode instead, as this is generally what we do for internal APIs, making it mo
btrfs: pass a btrfs_inode to btrfs_wait_ordered_range()
Instead of passing a (VFS) inode pointer argument, pass a btrfs_inode instead, as this is generally what we do for internal APIs, making it more consistent with most of the code base. This will later allow to help to remove a lot of BTRFS_I() calls in btrfs_sync_file().
Reviewed-by: Qu Wenruo <[email protected]> Signed-off-by: Filipe Manana <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
show more ...
|
| #
cef2daba |
| 18-May-2024 |
Filipe Manana <[email protected]> |
btrfs: pass a btrfs_inode to btrfs_fdatawrite_range()
Instead of passing a (VFS) inode pointer argument, pass a btrfs_inode instead, as this is generally what we do for internal APIs, making it more
btrfs: pass a btrfs_inode to btrfs_fdatawrite_range()
Instead of passing a (VFS) inode pointer argument, pass a btrfs_inode instead, as this is generally what we do for internal APIs, making it more consistent with most of the code base. This will later allow to help to remove a lot of BTRFS_I() calls in btrfs_sync_file().
Reviewed-by: Qu Wenruo <[email protected]> Signed-off-by: Filipe Manana <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
show more ...
|
| #
c41881ae |
| 15-May-2024 |
Filipe Manana <[email protected]> |
btrfs: make btrfs_finish_ordered_extent() return void
Currently btrfs_finish_ordered_extent() returns a boolean indicating if the ordered extent was added to the work queue for completion, but none
btrfs: make btrfs_finish_ordered_extent() return void
Currently btrfs_finish_ordered_extent() returns a boolean indicating if the ordered extent was added to the work queue for completion, but none of its callers cares about it, so make it return void.
Reviewed-by: Qu Wenruo <[email protected]> Signed-off-by: Filipe Manana <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
show more ...
|
| #
3441b070 |
| 14-May-2024 |
Filipe Manana <[email protected]> |
btrfs: fix function name in comment for btrfs_remove_ordered_extent()
Due to a refactoring introduced by commit 53d9981ca20e ("btrfs: split btrfs_alloc_ordered_extent to allocation and insertion hel
btrfs: fix function name in comment for btrfs_remove_ordered_extent()
Due to a refactoring introduced by commit 53d9981ca20e ("btrfs: split btrfs_alloc_ordered_extent to allocation and insertion helpers"), the function btrfs_alloc_ordered_extent() was renamed to alloc_ordered_extent(), so the comment at btrfs_remove_ordered_extent() is no longer very accurate. Update the comment to refer to the new name "alloc_ordered_extent()".
Reviewed-by: Qu Wenruo <[email protected]> Signed-off-by: Filipe Manana <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
show more ...
|
| #
f13e01b8 |
| 17-May-2024 |
Filipe Manana <[email protected]> |
btrfs: ensure fast fsync waits for ordered extents after a write failure
If a write path in COW mode fails, either before submitting a bio for the new extents or an actual IO error happens, we can e
btrfs: ensure fast fsync waits for ordered extents after a write failure
If a write path in COW mode fails, either before submitting a bio for the new extents or an actual IO error happens, we can end up allowing a fast fsync to log file extent items that point to unwritten extents.
This is because dropping the extent maps happens when completing ordered extents, at btrfs_finish_one_ordered(), and the completion of an ordered extent is executed in a work queue.
This can result in a fast fsync to start logging file extent items based on existing extent maps before the ordered extents complete, therefore resulting in a log that has file extent items that point to unwritten extents, resulting in a corrupt file if a crash happens after and the log tree is replayed the next time the fs is mounted.
This can happen for both direct IO writes and buffered writes.
For example consider a direct IO write, in COW mode, that fails at btrfs_dio_submit_io() because btrfs_extract_ordered_extent() returned an error:
1) We call btrfs_finish_ordered_extent() with the 'uptodate' parameter set to false, meaning an error happened;
2) That results in marking the ordered extent with the BTRFS_ORDERED_IOERR flag;
3) btrfs_finish_ordered_extent() queues the completion of the ordered extent - so that btrfs_finish_one_ordered() will be executed later in a work queue. That function will drop extent maps in the range when it's executed, since the extent maps point to unwritten locations (signaled by the BTRFS_ORDERED_IOERR flag);
4) After calling btrfs_finish_ordered_extent() we keep going down the write path and unlock the inode;
5) After that a fast fsync starts and locks the inode;
6) Before the work queue executes btrfs_finish_one_ordered(), the fsync task sees the extent maps that point to the unwritten locations and logs file extent items based on them - it does not know they are unwritten, and the fast fsync path does not wait for ordered extents to complete, which is an intentional behaviour in order to reduce latency.
For the buffered write case, here's one example:
1) A fast fsync begins, and it starts by flushing delalloc and waiting for the writeback to complete by calling filemap_fdatawait_range();
2) Flushing the dellaloc created a new extent map X;
3) During the writeback some IO error happened, and at the end io callback (end_bbio_data_write()) we call btrfs_finish_ordered_extent(), which sets the BTRFS_ORDERED_IOERR flag in the ordered extent and queues its completion;
4) After queuing the ordered extent completion, the end io callback clears the writeback flag from all pages (or folios), and from that moment the fast fsync can proceed;
5) The fast fsync proceeds sees extent map X and logs a file extent item based on extent map X, resulting in a log that points to an unwritten data extent - because the ordered extent completion hasn't run yet, it happens only after the logging.
To fix this make btrfs_finish_ordered_extent() set the inode flag BTRFS_INODE_NEEDS_FULL_SYNC in case an error happened for a COW write, so that a fast fsync will wait for ordered extent completion.
Note that this issues of using extent maps that point to unwritten locations can not happen for reads, because in read paths we start by locking the extent range and wait for any ordered extents in the range to complete before looking for extent maps.
Reviewed-by: Qu Wenruo <[email protected]> Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: David Sterba <[email protected]>
show more ...
|
| #
aa5ccf29 |
| 03-Apr-2024 |
Josef Bacik <[email protected]> |
btrfs: handle errors in btrfs_reloc_clone_csums properly
In the cow path we will clone the reloc csums for relocated data extents, and if there's an error we already have an ordered extent and rely
btrfs: handle errors in btrfs_reloc_clone_csums properly
In the cow path we will clone the reloc csums for relocated data extents, and if there's an error we already have an ordered extent and rely on the ordered extent finishing to clean everything up.
There's a problem however, we don't mark the ordered extent with an error, we pretend like everything was just fine. If we were at the end of our range we won't actually bubble up this error anywhere, and we could end up inserting an extent that doesn't have csums where it should have them.
Fix this by adding a helper to mark the ordered extent with an error, and then use this when we fail to lookup the csums in btrfs_reloc_clone_csums. Use this helper in the other place where we use the same pattern while we're here.
This will prevent us from erroneously inserting the extent that doesn't have the required checksums.
Reviewed-by: Johannes Thumshirn <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: David Sterba <[email protected]>
show more ...
|
| #
e094f480 |
| 15-Apr-2024 |
Josef Bacik <[email protected]> |
btrfs: change root->root_key.objectid to btrfs_root_id()
A comment from Filipe on one of my previous cleanups brought my attention to a new helper we have for getting the root id of a root, which ma
btrfs: change root->root_key.objectid to btrfs_root_id()
A comment from Filipe on one of my previous cleanups brought my attention to a new helper we have for getting the root id of a root, which makes it easier to read in the code.
The changes where made with the following Coccinelle semantic patch:
// <smpl> @@ expression E,E1; @@ ( E->root_key.objectid = E1 | - E->root_key.objectid + btrfs_root_id(E) ) // </smpl>
Reviewed-by: Filipe Manana <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> [ minor style fixups ] Signed-off-by: David Sterba <[email protected]>
show more ...
|