|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1 |
|
| #
50fecb8c |
| 05-Apr-2025 |
Filipe Manana <[email protected]> |
btrfs: fix invalid inode pointer after failure to create reloc inode
If we have a failure at create_reloc_inode(), under the 'out' label we assign an error pointer to the 'inode' variable and then r
btrfs: fix invalid inode pointer after failure to create reloc inode
If we have a failure at create_reloc_inode(), under the 'out' label we assign an error pointer to the 'inode' variable and then return a weird pointer because we return the expression "&inode->vfs_inode":
static noinline_for_stack struct inode *create_reloc_inode( const struct btrfs_block_group *group) { (...) out: (...) if (ret) { if (inode) iput(&inode->vfs_inode); inode = ERR_PTR(ret); } return &inode->vfs_inode; }
This can make us return a pointer that is not an error pointer and make the caller proceed as if an error didn't happen and later result in an invalid memory access when dereferencing the inode pointer. Syzbot reported reported such a case with the following stack trace:
R10: 0000000000000002 R11: 0000000000000246 R12: 0000000000000000 R13: 0000000000000000 R14: 431bde82d7b634db R15: 00007ffc55de5790 </TASK> BTRFS info (device loop0): relocating block group 6881280 flags data|metadata Oops: general protection fault, probably for non-canonical address 0xdffffc0000000045: 0000 [#1] SMP KASAN NOPTI KASAN: null-ptr-deref in range [0x0000000000000228-0x000000000000022f] CPU: 0 UID: 0 PID: 5332 Comm: syz-executor215 Not tainted 6.14.0-syzkaller-13423-ga8662bcd2ff1 #0 PREEMPT(full) Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014 RIP: 0010:relocate_file_extent_cluster+0xe7/0x1750 fs/btrfs/relocation.c:2971 Code: 00 74 08 (...) RSP: 0018:ffffc9000d3375e0 EFLAGS: 00010203 RAX: 0000000000000045 RBX: 000000000000022c RCX: ffff888000562440 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8880452db000 RBP: ffffc9000d337870 R08: ffffffff84089251 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: dffffc0000000000 R13: ffffffff9368a020 R14: 0000000000000394 R15: ffff8880452db000 FS: 000055558bc7b380(0000) GS:ffff88808c596000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055a7a192e740 CR3: 0000000036e2e000 CR4: 0000000000352ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> relocate_block_group+0xa1e/0xd50 fs/btrfs/relocation.c:3657 btrfs_relocate_block_group+0x777/0xd80 fs/btrfs/relocation.c:4011 btrfs_relocate_chunk+0x12c/0x3b0 fs/btrfs/volumes.c:3511 __btrfs_balance+0x1a93/0x25e0 fs/btrfs/volumes.c:4292 btrfs_balance+0xbde/0x10c0 fs/btrfs/volumes.c:4669 btrfs_ioctl_balance+0x3f5/0x660 fs/btrfs/ioctl.c:3586 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:906 [inline] __se_sys_ioctl+0xf1/0x160 fs/ioctl.c:892 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7fb4ef537dd9 Code: 28 00 00 (...) RSP: 002b:00007ffc55de5728 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00007ffc55de5750 RCX: 00007fb4ef537dd9 RDX: 0000200000000440 RSI: 00000000c4009420 RDI: 0000000000000003 RBP: 0000000000000002 R08: 00007ffc55de54c6 R09: 00007ffc55de5770 R10: 0000000000000002 R11: 0000000000000246 R12: 0000000000000000 R13: 0000000000000000 R14: 431bde82d7b634db R15: 00007ffc55de5790 </TASK> Modules linked in: ---[ end trace 0000000000000000 ]--- RIP: 0010:relocate_file_extent_cluster+0xe7/0x1750 fs/btrfs/relocation.c:2971 Code: 00 74 08 (...) RSP: 0018:ffffc9000d3375e0 EFLAGS: 00010203 RAX: 0000000000000045 RBX: 000000000000022c RCX: ffff888000562440 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8880452db000 RBP: ffffc9000d337870 R08: ffffffff84089251 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: dffffc0000000000 R13: ffffffff9368a020 R14: 0000000000000394 R15: ffff8880452db000 FS: 000055558bc7b380(0000) GS:ffff88808c596000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055a7a192e740 CR3: 0000000036e2e000 CR4: 0000000000352ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 ---------------- Code disassembly (best guess): 0: 00 74 08 48 add %dh,0x48(%rax,%rcx,1) 4: 89 df mov %ebx,%edi 6: e8 f8 36 24 fe call 0xfe243703 b: 48 89 9c 24 30 01 00 mov %rbx,0x130(%rsp) 12: 00 13: 4c 89 74 24 28 mov %r14,0x28(%rsp) 18: 4d 8b 76 10 mov 0x10(%r14),%r14 1c: 49 8d 9e 98 fe ff ff lea -0x168(%r14),%rbx 23: 48 89 d8 mov %rbx,%rax 26: 48 c1 e8 03 shr $0x3,%rax * 2a: 42 80 3c 20 00 cmpb $0x0,(%rax,%r12,1) <-- trapping instruction 2f: 74 08 je 0x39 31: 48 89 df mov %rbx,%rdi 34: e8 ca 36 24 fe call 0xfe243703 39: 4c 8b 3b mov (%rbx),%r15 3c: 48 rex.W 3d: 8b .byte 0x8b 3e: 44 rex.R 3f: 24 .byte 0x24
So fix this by returning the error immediately.
Reported-by: [email protected] Link: https://lore.kernel.org/linux-btrfs/[email protected]/ Fixes: b204e5c7d4dc ("btrfs: make btrfs_iget() return a btrfs inode instead") Reviewed-by: Qu Wenruo <[email protected]> Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: David Sterba <[email protected]>
show more ...
|
|
Revision tags: v6.14, v6.14-rc7, v6.14-rc6 |
|
| #
20faaab2 |
| 07-Mar-2025 |
Filipe Manana <[email protected]> |
btrfs: remove unnecessary fs_info argument from delete_block_group_cache()
The fs_info can be taken from the given block group, so there is no need to pass it as an argument.
Signed-off-by: Filipe
btrfs: remove unnecessary fs_info argument from delete_block_group_cache()
The fs_info can be taken from the given block group, so there is no need to pass it as an argument.
Signed-off-by: Filipe Manana <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
show more ...
|
| #
f75a0437 |
| 07-Mar-2025 |
Filipe Manana <[email protected]> |
btrfs: remove unnecessary fs_info argument from create_reloc_inode()
The fs_info can be taken from the given block group, so there is no need to pass it as an argument.
Signed-off-by: Filipe Manana
btrfs: remove unnecessary fs_info argument from create_reloc_inode()
The fs_info can be taken from the given block group, so there is no need to pass it as an argument.
Signed-off-by: Filipe Manana <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
show more ...
|
| #
b204e5c7 |
| 07-Mar-2025 |
Filipe Manana <[email protected]> |
btrfs: make btrfs_iget() return a btrfs inode instead
It's an internal function and most of the time the callers are doing a lot of BTRFS_I() calls on the returned VFS inode to get the btrfs inode,
btrfs: make btrfs_iget() return a btrfs inode instead
It's an internal function and most of the time the callers are doing a lot of BTRFS_I() calls on the returned VFS inode to get the btrfs inode, so change the return type to struct btrfs_inode instead.
Signed-off-by: Filipe Manana <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc5, v6.14-rc4, v6.14-rc3, v6.14-rc2, v6.14-rc1, v6.13, v6.13-rc7 |
|
| #
3a1c46db |
| 09-Jan-2025 |
David Sterba <[email protected]> |
btrfs: open code set_page_extent_mapped()
The function set_page_extent_mapped() is now a simple wrapper so use the folio helper.
Reviewed-by: Johannes Thumshirn <[email protected]> Reviewe
btrfs: open code set_page_extent_mapped()
The function set_page_extent_mapped() is now a simple wrapper so use the folio helper.
Reviewed-by: Johannes Thumshirn <[email protected]> Reviewed-by: Anand Jain <[email protected]> Signed-off-by: David Sterba <[email protected]>
show more ...
|
|
Revision tags: v6.13-rc6, v6.13-rc5, v6.13-rc4 |
|
| #
5a8293a1 |
| 18-Dec-2024 |
Filipe Manana <[email protected]> |
btrfs: relocation: remove unnecessary calls to btrfs_mark_buffer_dirty()
We have several places explicitly calling btrfs_mark_buffer_dirty() but that is not necessarily since the target leaf came fr
btrfs: relocation: remove unnecessary calls to btrfs_mark_buffer_dirty()
We have several places explicitly calling btrfs_mark_buffer_dirty() but that is not necessarily since the target leaf came from a path that was obtained for a btree search function that modifies the btree, something like btrfs_insert_empty_item() or anything else that ends up calling btrfs_search_slot() with a value of 1 for its 'cow' argument.
These just make the code more verbose, confusing and add a little extra overhead and well as increase the module's text size, so remove them.
Reviewed-by: Johannes Thumshirn <[email protected]> Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: David Sterba <[email protected]>
show more ...
|
|
Revision tags: v6.13-rc3, v6.13-rc2, v6.13-rc1, v6.12, v6.12-rc7, v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3, v6.12-rc2 |
|
| #
f974bc3c |
| 03-Oct-2024 |
Josef Bacik <[email protected]> |
btrfs: remove detached list from struct btrfs_backref_cache
We don't ever look at this list, remove it.
Reviewed-by: Boris Burkov <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Re
btrfs: remove detached list from struct btrfs_backref_cache
We don't ever look at this list, remove it.
Reviewed-by: Boris Burkov <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
show more ...
|
| #
b61e0eb0 |
| 03-Oct-2024 |
Josef Bacik <[email protected]> |
btrfs: remove the ->lowest and ->leaves members from struct btrfs_backref_node
Before we were keeping all of our nodes on various lists in order to make sure everything got cleaned up correctly. We
btrfs: remove the ->lowest and ->leaves members from struct btrfs_backref_node
Before we were keeping all of our nodes on various lists in order to make sure everything got cleaned up correctly. We used node->lowest to indicate that node->lower was linked into the cache->leaves list. Now that we do cleanup based on the rb-tree both the list and the flag are useless, so delete them both.
Reviewed-by: Boris Burkov <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
show more ...
|
| #
4eb8064d |
| 03-Oct-2024 |
Josef Bacik <[email protected]> |
btrfs: do not handle non-shareable roots in backref cache
Now that we handle relocation for non-shareable roots without using the backref cache, remove the ->cowonly field from the backref nodes and
btrfs: do not handle non-shareable roots in backref cache
Now that we handle relocation for non-shareable roots without using the backref cache, remove the ->cowonly field from the backref nodes and update the handling to throw an error.
Reviewed-by: Boris Burkov <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: David Sterba <[email protected]>
show more ...
|
| #
46bb6765 |
| 03-Oct-2024 |
Josef Bacik <[email protected]> |
btrfs: don't build backref tree for COW-only blocks
We already determine the owner for any blocks we find when we're relocating, and for COW-only blocks (and the data reloc tree) we COW down to the
btrfs: don't build backref tree for COW-only blocks
We already determine the owner for any blocks we find when we're relocating, and for COW-only blocks (and the data reloc tree) we COW down to the block and call it good enough. However we still build a whole backref tree for them, even though we're not going to use it, and then just don't put these blocks in the cache.
Rework the code to check if the block belongs to a COW-only root or the data reloc root, and then just cow down to the block, skipping the backref cache generation.
Reviewed-by: Boris Burkov <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: David Sterba <[email protected]>
show more ...
|
| #
0097422c |
| 03-Oct-2024 |
Josef Bacik <[email protected]> |
btrfs: remove clone_backref_node() from relocation
Since we no longer maintain backref cache across transactions, and this is only called when we're creating the reloc root for a newly created snaps
btrfs: remove clone_backref_node() from relocation
Since we no longer maintain backref cache across transactions, and this is only called when we're creating the reloc root for a newly created snapshot in the transaction critical section, we will end up doing a bunch of work that will just get thrown away when we start the transaction in the relocation loop. Delete this code as it no longer does anything for us.
Reviewed-by: Boris Burkov <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
show more ...
|
| #
551d04a3 |
| 03-Oct-2024 |
Josef Bacik <[email protected]> |
btrfs: simplify loop in select_reloc_root()
We have this setup as a loop, but in reality we will never walk back up the backref tree, if we do then it's a bug. Get rid of the loop and handle the ca
btrfs: simplify loop in select_reloc_root()
We have this setup as a loop, but in reality we will never walk back up the backref tree, if we do then it's a bug. Get rid of the loop and handle the case where we have node->new_bytenr set at all. Previous check was only if node->new_bytenr != root->node->start, but if it did then we would hit the WARN_ON() and walk back up the tree.
Instead we want to just return error if ->new_bytenr is set, and then do the normal updating of the node for the reloc root and carry on.
Reviewed-by: Boris Burkov <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: David Sterba <[email protected]>
show more ...
|
| #
b1d4d5d1 |
| 03-Oct-2024 |
Josef Bacik <[email protected]> |
btrfs: remove the changed list for backref cache
Now that we're not updating the backref cache when we switch transids we can remove the changed list.
We're going to keep the new_bytenr field becau
btrfs: remove the changed list for backref cache
Now that we're not updating the backref cache when we switch transids we can remove the changed list.
We're going to keep the new_bytenr field because it serves as a good sanity check for the backref cache and relocation, and can prevent us from making extent tree corruption worse.
Reviewed-by: Boris Burkov <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
show more ...
|
| #
6a4730b3 |
| 03-Oct-2024 |
Josef Bacik <[email protected]> |
btrfs: convert BUG_ON in btrfs_reloc_cow_block() to proper error handling
This BUG_ON is meant to catch backref cache problems, but these can arise from either bugs in the backref cache or corruptio
btrfs: convert BUG_ON in btrfs_reloc_cow_block() to proper error handling
This BUG_ON is meant to catch backref cache problems, but these can arise from either bugs in the backref cache or corruption in the extent tree. Fix it to be a proper error.
Reviewed-by: Boris Burkov <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
show more ...
|
| #
3e74859e |
| 13-Dec-2024 |
Boris Burkov <[email protected]> |
btrfs: check folio mapping after unlock in relocate_one_folio()
When we call btrfs_read_folio() to bring a folio uptodate, we unlock the folio. The result of that is that a different thread can modi
btrfs: check folio mapping after unlock in relocate_one_folio()
When we call btrfs_read_folio() to bring a folio uptodate, we unlock the folio. The result of that is that a different thread can modify the mapping (like remove it with invalidate) before we call folio_lock(). This results in an invalid page and we need to try again.
In particular, if we are relocating concurrently with aborting a transaction, this can result in a crash like the following:
BUG: kernel NULL pointer dereference, address: 0000000000000000 PGD 0 P4D 0 Oops: 0000 [#1] SMP CPU: 76 PID: 1411631 Comm: kworker/u322:5 Workqueue: events_unbound btrfs_reclaim_bgs_work RIP: 0010:set_page_extent_mapped+0x20/0xb0 RSP: 0018:ffffc900516a7be8 EFLAGS: 00010246 RAX: ffffea009e851d08 RBX: ffffea009e0b1880 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffffc900516a7b90 RDI: ffffea009e0b1880 RBP: 0000000003573000 R08: 0000000000000001 R09: ffff88c07fd2f3f0 R10: 0000000000000000 R11: 0000194754b575be R12: 0000000003572000 R13: 0000000003572fff R14: 0000000000100cca R15: 0000000005582fff FS: 0000000000000000(0000) GS:ffff88c07fd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 000000407d00f002 CR4: 00000000007706f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <TASK> ? __die+0x78/0xc0 ? page_fault_oops+0x2a8/0x3a0 ? __switch_to+0x133/0x530 ? wq_worker_running+0xa/0x40 ? exc_page_fault+0x63/0x130 ? asm_exc_page_fault+0x22/0x30 ? set_page_extent_mapped+0x20/0xb0 relocate_file_extent_cluster+0x1a7/0x940 relocate_data_extent+0xaf/0x120 relocate_block_group+0x20f/0x480 btrfs_relocate_block_group+0x152/0x320 btrfs_relocate_chunk+0x3d/0x120 btrfs_reclaim_bgs_work+0x2ae/0x4e0 process_scheduled_works+0x184/0x370 worker_thread+0xc6/0x3e0 ? blk_add_timer+0xb0/0xb0 kthread+0xae/0xe0 ? flush_tlb_kernel_range+0x90/0x90 ret_from_fork+0x2f/0x40 ? flush_tlb_kernel_range+0x90/0x90 ret_from_fork_asm+0x11/0x20 </TASK>
This occurs because cleanup_one_transaction() calls destroy_delalloc_inodes() which calls invalidate_inode_pages2() which takes the folio_lock before setting mapping to NULL. We fail to check this, and subsequently call set_extent_mapping(), which assumes that mapping != NULL (in fact it asserts that in debug mode)
Note that the "fixes" patch here is not the one that introduced the race (the very first iteration of this code from 2009) but a more recent change that made this particular crash happen in practice.
Fixes: e7f1326cc24e ("btrfs: set page extent mapped after read_folio in relocate_one_page") CC: [email protected] # 6.1+ Reviewed-by: Qu Wenruo <[email protected]> Signed-off-by: Boris Burkov <[email protected]> Signed-off-by: David Sterba <[email protected]>
show more ...
|
| #
d7f4b4ef |
| 09-Oct-2024 |
David Sterba <[email protected]> |
btrfs: drop unused transaction parameter from btrfs_qgroup_add_swapped_blocks()
The caller replace_path() runs under transaction but we don't need it in btrfs_qgroup_add_swapped_blocks().
Reviewed-
btrfs: drop unused transaction parameter from btrfs_qgroup_add_swapped_blocks()
The caller replace_path() runs under transaction but we don't need it in btrfs_qgroup_add_swapped_blocks().
Reviewed-by: Anand Jain <[email protected]> Signed-off-by: David Sterba <[email protected]>
show more ...
|
|
Revision tags: v6.12-rc1 |
|
| #
c3b47f49 |
| 27-Sep-2024 |
Qu Wenruo <[email protected]> |
btrfs: fix a NULL pointer dereference when failed to start a new trasacntion
[BUG] Syzbot reported a NULL pointer dereference with the following crash:
FAULT_INJECTION: forcing a failure. star
btrfs: fix a NULL pointer dereference when failed to start a new trasacntion
[BUG] Syzbot reported a NULL pointer dereference with the following crash:
FAULT_INJECTION: forcing a failure. start_transaction+0x830/0x1670 fs/btrfs/transaction.c:676 prepare_to_relocate+0x31f/0x4c0 fs/btrfs/relocation.c:3642 relocate_block_group+0x169/0xd20 fs/btrfs/relocation.c:3678 ... BTRFS info (device loop0): balance: ended with status: -12 Oops: general protection fault, probably for non-canonical address 0xdffffc00000000cc: 0000 [#1] PREEMPT SMP KASAN NOPTI KASAN: null-ptr-deref in range [0x0000000000000660-0x0000000000000667] RIP: 0010:btrfs_update_reloc_root+0x362/0xa80 fs/btrfs/relocation.c:926 Call Trace: <TASK> commit_fs_roots+0x2ee/0x720 fs/btrfs/transaction.c:1496 btrfs_commit_transaction+0xfaf/0x3740 fs/btrfs/transaction.c:2430 del_balance_item fs/btrfs/volumes.c:3678 [inline] reset_balance_state+0x25e/0x3c0 fs/btrfs/volumes.c:3742 btrfs_balance+0xead/0x10c0 fs/btrfs/volumes.c:4574 btrfs_ioctl_balance+0x493/0x7c0 fs/btrfs/ioctl.c:3673 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:907 [inline] __se_sys_ioctl+0xf9/0x170 fs/ioctl.c:893 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f
[CAUSE] The allocation failure happens at the start_transaction() inside prepare_to_relocate(), and during the error handling we call unset_reloc_control(), which makes fs_info->balance_ctl to be NULL.
Then we continue the error path cleanup in btrfs_balance() by calling reset_balance_state() which will call del_balance_item() to fully delete the balance item in the root tree.
However during the small window between set_reloc_contrl() and unset_reloc_control(), we can have a subvolume tree update and created a reloc_root for that subvolume.
Then we go into the final btrfs_commit_transaction() of del_balance_item(), and into btrfs_update_reloc_root() inside commit_fs_roots().
That function checks if fs_info->reloc_ctl is in the merge_reloc_tree stage, but since fs_info->reloc_ctl is NULL, it results a NULL pointer dereference.
[FIX] Just add extra check on fs_info->reloc_ctl inside btrfs_update_reloc_root(), before checking fs_info->reloc_ctl->merge_reloc_tree.
That DEAD_RELOC_TREE handling is to prevent further modification to the reloc tree during merge stage, but since there is no reloc_ctl at all, we do not need to bother that.
Reported-by: [email protected] Link: https://lore.kernel.org/linux-btrfs/[email protected]/ CC: [email protected] # 4.19+ Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Qu Wenruo <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
show more ...
|
| #
db7e68b5 |
| 24-Sep-2024 |
Josef Bacik <[email protected]> |
btrfs: drop the backref cache during relocation if we commit
Since the inception of relocation we have maintained the backref cache across transaction commits, updating the backref cache with the ne
btrfs: drop the backref cache during relocation if we commit
Since the inception of relocation we have maintained the backref cache across transaction commits, updating the backref cache with the new bytenr whenever we COWed blocks that were in the cache, and then updating their bytenr once we detected a transaction id change.
This works as long as we're only ever modifying blocks, not changing the structure of the tree.
However relocation does in fact change the structure of the tree. For example, if we are relocating a data extent, we will look up all the leaves that point to this data extent. We will then call do_relocation() on each of these leaves, which will COW down to the leaf and then update the file extent location.
But, a key feature of do_relocation() is the pending list. This is all the pending nodes that we modified when we updated the file extent item. We will then process all of these blocks via finish_pending_nodes, which calls do_relocation() on all of the nodes that led up to that leaf.
The purpose of this is to make sure we don't break sharing unless we absolutely have to. Consider the case that we have 3 snapshots that all point to this leaf through the same nodes, the initial COW would have created a whole new path. If we did this for all 3 snapshots we would end up with 3x the number of nodes we had originally. To avoid this we will cycle through each of the snapshots that point to each of these nodes and update their pointers to point at the new nodes.
Once we update the pointer to the new node we will drop the node we removed the link for and all of its children via btrfs_drop_subtree(). This is essentially just btrfs_drop_snapshot(), but for an arbitrary point in the snapshot.
The problem with this is that we will never reflect this in the backref cache. If we do this btrfs_drop_snapshot() for a node that is in the backref tree, we will leave the node in the backref tree. This becomes a problem when we change the transid, as now the backref cache has entire subtrees that no longer exist, but exist as if they still are pointed to by the same roots.
In the best case scenario you end up with "adding refs to an existing tree ref" errors from insert_inline_extent_backref(), where we attempt to link in nodes on roots that are no longer valid.
Worst case you will double free some random block and re-use it when there's still references to the block.
This is extremely subtle, and the consequences are quite bad. There isn't a way to make sure our backref cache is consistent between transid's.
In order to fix this we need to simply evict the entire backref cache anytime we cross transid's. This reduces performance in that we have to rebuild this backref cache every time we change transid's, but fixes the bug.
This has existed since relocation was added, and is a pretty critical bug. There's a lot more cleanup that can be done now that this functionality is going away, but this patch is as small as possible in order to fix the problem and make it easy for us to backport it to all the kernels it needs to be backported to.
Followup series will dismantle more of this code and simplify relocation drastically to remove this functionality.
We have a reproducer that reproduced the corruption within a few minutes of running. With this patch it survives several iterations/hours of running the reproducer.
Fixes: 3fd0a5585eb9 ("Btrfs: Metadata ENOSPC handling for balance") CC: [email protected] Reviewed-by: Boris Burkov <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: David Sterba <[email protected]>
show more ...
|
|
Revision tags: v6.11, v6.11-rc7, v6.11-rc6, v6.11-rc5, v6.11-rc4, v6.11-rc3, v6.11-rc2 |
|
| #
04915240 |
| 31-Jul-2024 |
Johannes Thumshirn <[email protected]> |
btrfs: don't readahead the relocation inode on RST
On relocation we're doing readahead on the relocation inode, but if the filesystem is backed by a RAID stripe tree we can get ENOENT (e.g. due to p
btrfs: don't readahead the relocation inode on RST
On relocation we're doing readahead on the relocation inode, but if the filesystem is backed by a RAID stripe tree we can get ENOENT (e.g. due to preallocated extents not being mapped in the RST) from the lookup.
But readahead doesn't handle the error and submits invalid reads to the device, causing an assertion in the scatter-gather list code:
BTRFS info (device nvme1n1): balance: start -d -m -s BTRFS info (device nvme1n1): relocating block group 6480920576 flags data|raid0 BTRFS error (device nvme1n1): cannot find raid-stripe for logical [6481928192, 6481969152] devid 2, profile raid0 ------------[ cut here ]------------ kernel BUG at include/linux/scatterlist.h:115! Oops: invalid opcode: 0000 [#1] PREEMPT SMP PTI CPU: 0 PID: 1012 Comm: btrfs Not tainted 6.10.0-rc7+ #567 RIP: 0010:__blk_rq_map_sg+0x339/0x4a0 RSP: 0018:ffffc90001a43820 EFLAGS: 00010202 RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffea00045d4802 RDX: 0000000117520000 RSI: 0000000000000000 RDI: ffff8881027d1000 RBP: 0000000000003000 R08: ffffea00045d4902 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000001000 R12: ffff8881003d10b8 R13: ffffc90001a438f0 R14: 0000000000000000 R15: 0000000000003000 FS: 00007fcc048a6900(0000) GS:ffff88813bc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000002cd11000 CR3: 00000001109ea001 CR4: 0000000000370eb0 Call Trace: <TASK> ? __die_body.cold+0x14/0x25 ? die+0x2e/0x50 ? do_trap+0xca/0x110 ? do_error_trap+0x65/0x80 ? __blk_rq_map_sg+0x339/0x4a0 ? exc_invalid_op+0x50/0x70 ? __blk_rq_map_sg+0x339/0x4a0 ? asm_exc_invalid_op+0x1a/0x20 ? __blk_rq_map_sg+0x339/0x4a0 nvme_prep_rq.part.0+0x9d/0x770 nvme_queue_rq+0x7d/0x1e0 __blk_mq_issue_directly+0x2a/0x90 ? blk_mq_get_budget_and_tag+0x61/0x90 blk_mq_try_issue_list_directly+0x56/0xf0 blk_mq_flush_plug_list.part.0+0x52b/0x5d0 __blk_flush_plug+0xc6/0x110 blk_finish_plug+0x28/0x40 read_pages+0x160/0x1c0 page_cache_ra_unbounded+0x109/0x180 relocate_file_extent_cluster+0x611/0x6a0 ? btrfs_search_slot+0xba4/0xd20 ? balance_dirty_pages_ratelimited_flags+0x26/0xb00 relocate_data_extent.constprop.0+0x134/0x160 relocate_block_group+0x3f2/0x500 btrfs_relocate_block_group+0x250/0x430 btrfs_relocate_chunk+0x3f/0x130 btrfs_balance+0x71b/0xef0 ? kmalloc_trace_noprof+0x13b/0x280 btrfs_ioctl+0x2c2e/0x3030 ? kvfree_call_rcu+0x1e6/0x340 ? list_lru_add_obj+0x66/0x80 ? mntput_no_expire+0x3a/0x220 __x64_sys_ioctl+0x96/0xc0 do_syscall_64+0x54/0x110 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7fcc04514f9b Code: Unable to access opcode bytes at 0x7fcc04514f71. RSP: 002b:00007ffeba923370 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007fcc04514f9b RDX: 00007ffeba923460 RSI: 00000000c4009420 RDI: 0000000000000003 RBP: 0000000000000000 R08: 0000000000000013 R09: 0000000000000001 R10: 00007fcc043fbba8 R11: 0000000000000246 R12: 00007ffeba924fc5 R13: 00007ffeba923460 R14: 0000000000000002 R15: 00000000004d4bb0 </TASK> Modules linked in: ---[ end trace 0000000000000000 ]--- RIP: 0010:__blk_rq_map_sg+0x339/0x4a0 RSP: 0018:ffffc90001a43820 EFLAGS: 00010202 RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffea00045d4802 RDX: 0000000117520000 RSI: 0000000000000000 RDI: ffff8881027d1000 RBP: 0000000000003000 R08: ffffea00045d4902 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000001000 R12: ffff8881003d10b8 R13: ffffc90001a438f0 R14: 0000000000000000 R15: 0000000000003000 FS: 00007fcc048a6900(0000) GS:ffff88813bc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fcc04514f71 CR3: 00000001109ea001 CR4: 0000000000370eb0 Kernel panic - not syncing: Fatal exception Kernel Offset: disabled ---[ end Kernel panic - not syncing: Fatal exception ]---
So in case of a relocation on a RAID stripe-tree based file system, skip the readahead.
Reviewed-by: Josef Bacik <[email protected]> Reviewed-by: Qu Wenruo <[email protected]> Signed-off-by: Johannes Thumshirn <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
show more ...
|
|
Revision tags: v6.11-rc1, v6.10, v6.10-rc7 |
|
| #
ca84529a |
| 01-Jul-2024 |
Filipe Manana <[email protected]> |
btrfs: fix data race when accessing the last_trans field of a root
KCSAN complains about a data race when accessing the last_trans field of a root:
[ 199.553628] BUG: KCSAN: data-race in btrfs_r
btrfs: fix data race when accessing the last_trans field of a root
KCSAN complains about a data race when accessing the last_trans field of a root:
[ 199.553628] BUG: KCSAN: data-race in btrfs_record_root_in_trans [btrfs] / record_root_in_trans [btrfs]
[ 199.555186] read to 0x000000008801e308 of 8 bytes by task 2812 on cpu 1: [ 199.555210] btrfs_record_root_in_trans+0x9a/0x128 [btrfs] [ 199.555999] start_transaction+0x154/0xcd8 [btrfs] [ 199.556780] btrfs_join_transaction+0x44/0x60 [btrfs] [ 199.557559] btrfs_dirty_inode+0x9c/0x140 [btrfs] [ 199.558339] btrfs_update_time+0x8c/0xb0 [btrfs] [ 199.559123] touch_atime+0x16c/0x1e0 [ 199.559151] pipe_read+0x6a8/0x7d0 [ 199.559179] vfs_read+0x466/0x498 [ 199.559204] ksys_read+0x108/0x150 [ 199.559230] __s390x_sys_read+0x68/0x88 [ 199.559257] do_syscall+0x1c6/0x210 [ 199.559286] __do_syscall+0xc8/0xf0 [ 199.559318] system_call+0x70/0x98
[ 199.559431] write to 0x000000008801e308 of 8 bytes by task 2808 on cpu 0: [ 199.559464] record_root_in_trans+0x196/0x228 [btrfs] [ 199.560236] btrfs_record_root_in_trans+0xfe/0x128 [btrfs] [ 199.561097] start_transaction+0x154/0xcd8 [btrfs] [ 199.561927] btrfs_join_transaction+0x44/0x60 [btrfs] [ 199.562700] btrfs_dirty_inode+0x9c/0x140 [btrfs] [ 199.563493] btrfs_update_time+0x8c/0xb0 [btrfs] [ 199.564277] file_update_time+0xb8/0xf0 [ 199.564301] pipe_write+0x8ac/0xab8 [ 199.564326] vfs_write+0x33c/0x588 [ 199.564349] ksys_write+0x108/0x150 [ 199.564372] __s390x_sys_write+0x68/0x88 [ 199.564397] do_syscall+0x1c6/0x210 [ 199.564424] __do_syscall+0xc8/0xf0 [ 199.564452] system_call+0x70/0x98
This is because we update and read last_trans concurrently without any type of synchronization. This should be generally harmless and in the worst case it can make us do extra locking (btrfs_record_root_in_trans()) trigger some warnings at ctree.c or do extra work during relocation - this would probably only happen in case of load or store tearing.
So fix this by always reading and updating the field using READ_ONCE() and WRITE_ONCE(), this silences KCSAN and prevents load and store tearing.
Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: David Sterba <[email protected]>
show more ...
|
|
Revision tags: v6.10-rc6, v6.10-rc5, v6.10-rc4, v6.10-rc3, v6.10-rc2, v6.10-rc1, v6.9, v6.9-rc7, v6.9-rc6, v6.9-rc5, v6.9-rc4, v6.9-rc3, v6.9-rc2, v6.9-rc1, v6.8, v6.8-rc7, v6.8-rc6, v6.8-rc5, v6.8-rc4, v6.8-rc3, v6.8-rc2, v6.8-rc1, v6.7, v6.7-rc8, v6.7-rc7, v6.7-rc6, v6.7-rc5, v6.7-rc4, v6.7-rc3, v6.7-rc2, v6.7-rc1, v6.6, v6.6-rc7, v6.6-rc6, v6.6-rc5, v6.6-rc4, v6.6-rc3, v6.6-rc2, v6.6-rc1, v6.5, v6.5-rc7, v6.5-rc6, v6.5-rc5, v6.5-rc4, v6.5-rc3, v6.5-rc2, v6.5-rc1, v6.4, v6.4-rc7, v6.4-rc6, v6.4-rc5, v6.4-rc4, v6.4-rc3, v6.4-rc2, v6.4-rc1, v6.3, v6.3-rc7, v6.3-rc6, v6.3-rc5, v6.3-rc4, v6.3-rc3, v6.3-rc2, v6.3-rc1, v6.2, v6.2-rc8, v6.2-rc7, v6.2-rc6, v6.2-rc5, v6.2-rc4, v6.2-rc3, v6.2-rc2, v6.2-rc1, v6.1, v6.1-rc8, v6.1-rc7, v6.1-rc6, v6.1-rc5, v6.1-rc4, v6.1-rc3, v6.1-rc2, v6.1-rc1, v6.0, v6.0-rc7, v6.0-rc6, v6.0-rc5, v6.0-rc4, v6.0-rc3, v6.0-rc2, v6.0-rc1, v5.19, v5.19-rc8, v5.19-rc7, v5.19-rc6, v5.19-rc5, v5.19-rc4, v5.19-rc3, v5.19-rc2, v5.19-rc1, v5.18, v5.18-rc7, v5.18-rc6, v5.18-rc5, v5.18-rc4, v5.18-rc3, v5.18-rc2, v5.18-rc1, v5.17, v5.17-rc8, v5.17-rc7, v5.17-rc6, v5.17-rc5, v5.17-rc4, v5.17-rc3 |
|
| #
a1f4e3d7 |
| 31-Jan-2022 |
David Sterba <[email protected]> |
btrfs: switch btrfs_ordered_extent::inode to struct btrfs_inode
The structure is internal so we should use struct btrfs_inode for that, allowing to remove some use of BTRFS_I.
Reviewed-by: Boris Bu
btrfs: switch btrfs_ordered_extent::inode to struct btrfs_inode
The structure is internal so we should use struct btrfs_inode for that, allowing to remove some use of BTRFS_I.
Reviewed-by: Boris Burkov <[email protected]> Signed-off-by: David Sterba <[email protected]>
show more ...
|
| #
d13240dd |
| 13-Jun-2024 |
Filipe Manana <[email protected]> |
btrfs: remove super block argument from btrfs_iget()
It's pointless to pass a super block argument to btrfs_iget() because we always pass a root and from it we can get the super block through:
r
btrfs: remove super block argument from btrfs_iget()
It's pointless to pass a super block argument to btrfs_iget() because we always pass a root and from it we can get the super block through:
root->fs_info->sb
So remove the super block argument.
Reviewed-by: Johannes Thumshirn <[email protected]> Reviewed-by: Josef Bacik <[email protected]> Reviewed-by: Qu Wenruo <[email protected]> Signed-off-by: Filipe Manana <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
show more ...
|
| #
6d81df75 |
| 05-Jun-2024 |
Johannes Thumshirn <[email protected]> |
btrfs: pass reloc_control to setup_relocation_extent_mapping()
All parameters passed into setup_relocation_extent_mapping() can be derived from 'struct reloc_control', so only pass in a 'struct relo
btrfs: pass reloc_control to setup_relocation_extent_mapping()
All parameters passed into setup_relocation_extent_mapping() can be derived from 'struct reloc_control', so only pass in a 'struct reloc_control'.
Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Johannes Thumshirn <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
show more ...
|
| #
60f3dabd |
| 05-Jun-2024 |
Johannes Thumshirn <[email protected]> |
btrfs: pass a struct reloc_control to prealloc_file_extent_cluster()
Pass a 'struct reloc_control' to prealloc_file_extent_cluster() instead of passing its members 'data_inode' and 'cluster' on thei
btrfs: pass a struct reloc_control to prealloc_file_extent_cluster()
Pass a 'struct reloc_control' to prealloc_file_extent_cluster() instead of passing its members 'data_inode' and 'cluster' on their own.
Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Johannes Thumshirn <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
show more ...
|
| #
17a21d79 |
| 05-Jun-2024 |
Johannes Thumshirn <[email protected]> |
btrfs: don't pass fs_info to describe_relocation()
In describe_relocation() the fs_info is only needed for printing information via btrfs_info() and can easily be accessed via the passed in 'struct
btrfs: don't pass fs_info to describe_relocation()
In describe_relocation() the fs_info is only needed for printing information via btrfs_info() and can easily be accessed via the passed in 'struct btrfs_block_group'.
So we can safely remove the fs_info parameter.
Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Johannes Thumshirn <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
show more ...
|