History log of /linux-6.15/fs/btrfs/relocation.c (Results 1 – 25 of 539)
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1
# 50fecb8c 05-Apr-2025 Filipe Manana <[email protected]>

btrfs: fix invalid inode pointer after failure to create reloc inode

If we have a failure at create_reloc_inode(), under the 'out' label we
assign an error pointer to the 'inode' variable and then r

btrfs: fix invalid inode pointer after failure to create reloc inode

If we have a failure at create_reloc_inode(), under the 'out' label we
assign an error pointer to the 'inode' variable and then return a weird
pointer because we return the expression "&inode->vfs_inode":

static noinline_for_stack struct inode *create_reloc_inode(
const struct btrfs_block_group *group)
{
(...)
out:
(...)
if (ret) {
if (inode)
iput(&inode->vfs_inode);
inode = ERR_PTR(ret);
}
return &inode->vfs_inode;
}

This can make us return a pointer that is not an error pointer and make
the caller proceed as if an error didn't happen and later result in an
invalid memory access when dereferencing the inode pointer.
Syzbot reported reported such a case with the following stack trace:

R10: 0000000000000002 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000000 R14: 431bde82d7b634db R15: 00007ffc55de5790
</TASK>
BTRFS info (device loop0): relocating block group 6881280 flags data|metadata
Oops: general protection fault, probably for non-canonical address 0xdffffc0000000045: 0000 [#1] SMP KASAN NOPTI
KASAN: null-ptr-deref in range [0x0000000000000228-0x000000000000022f]
CPU: 0 UID: 0 PID: 5332 Comm: syz-executor215 Not tainted 6.14.0-syzkaller-13423-ga8662bcd2ff1 #0 PREEMPT(full)
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
RIP: 0010:relocate_file_extent_cluster+0xe7/0x1750 fs/btrfs/relocation.c:2971
Code: 00 74 08 (...)
RSP: 0018:ffffc9000d3375e0 EFLAGS: 00010203
RAX: 0000000000000045 RBX: 000000000000022c RCX: ffff888000562440
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8880452db000
RBP: ffffc9000d337870 R08: ffffffff84089251 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: dffffc0000000000
R13: ffffffff9368a020 R14: 0000000000000394 R15: ffff8880452db000
FS: 000055558bc7b380(0000) GS:ffff88808c596000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055a7a192e740 CR3: 0000000036e2e000 CR4: 0000000000352ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
relocate_block_group+0xa1e/0xd50 fs/btrfs/relocation.c:3657
btrfs_relocate_block_group+0x777/0xd80 fs/btrfs/relocation.c:4011
btrfs_relocate_chunk+0x12c/0x3b0 fs/btrfs/volumes.c:3511
__btrfs_balance+0x1a93/0x25e0 fs/btrfs/volumes.c:4292
btrfs_balance+0xbde/0x10c0 fs/btrfs/volumes.c:4669
btrfs_ioctl_balance+0x3f5/0x660 fs/btrfs/ioctl.c:3586
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:906 [inline]
__se_sys_ioctl+0xf1/0x160 fs/ioctl.c:892
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7fb4ef537dd9
Code: 28 00 00 (...)
RSP: 002b:00007ffc55de5728 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007ffc55de5750 RCX: 00007fb4ef537dd9
RDX: 0000200000000440 RSI: 00000000c4009420 RDI: 0000000000000003
RBP: 0000000000000002 R08: 00007ffc55de54c6 R09: 00007ffc55de5770
R10: 0000000000000002 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000000 R14: 431bde82d7b634db R15: 00007ffc55de5790
</TASK>
Modules linked in:
---[ end trace 0000000000000000 ]---
RIP: 0010:relocate_file_extent_cluster+0xe7/0x1750 fs/btrfs/relocation.c:2971
Code: 00 74 08 (...)
RSP: 0018:ffffc9000d3375e0 EFLAGS: 00010203
RAX: 0000000000000045 RBX: 000000000000022c RCX: ffff888000562440
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8880452db000
RBP: ffffc9000d337870 R08: ffffffff84089251 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: dffffc0000000000
R13: ffffffff9368a020 R14: 0000000000000394 R15: ffff8880452db000
FS: 000055558bc7b380(0000) GS:ffff88808c596000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055a7a192e740 CR3: 0000000036e2e000 CR4: 0000000000352ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
----------------
Code disassembly (best guess):
0: 00 74 08 48 add %dh,0x48(%rax,%rcx,1)
4: 89 df mov %ebx,%edi
6: e8 f8 36 24 fe call 0xfe243703
b: 48 89 9c 24 30 01 00 mov %rbx,0x130(%rsp)
12: 00
13: 4c 89 74 24 28 mov %r14,0x28(%rsp)
18: 4d 8b 76 10 mov 0x10(%r14),%r14
1c: 49 8d 9e 98 fe ff ff lea -0x168(%r14),%rbx
23: 48 89 d8 mov %rbx,%rax
26: 48 c1 e8 03 shr $0x3,%rax
* 2a: 42 80 3c 20 00 cmpb $0x0,(%rax,%r12,1) <-- trapping instruction
2f: 74 08 je 0x39
31: 48 89 df mov %rbx,%rdi
34: e8 ca 36 24 fe call 0xfe243703
39: 4c 8b 3b mov (%rbx),%r15
3c: 48 rex.W
3d: 8b .byte 0x8b
3e: 44 rex.R
3f: 24 .byte 0x24

So fix this by returning the error immediately.

Reported-by: [email protected]
Link: https://lore.kernel.org/linux-btrfs/[email protected]/
Fixes: b204e5c7d4dc ("btrfs: make btrfs_iget() return a btrfs inode instead")
Reviewed-by: Qu Wenruo <[email protected]>
Signed-off-by: Filipe Manana <[email protected]>
Signed-off-by: David Sterba <[email protected]>

show more ...


Revision tags: v6.14, v6.14-rc7, v6.14-rc6
# 20faaab2 07-Mar-2025 Filipe Manana <[email protected]>

btrfs: remove unnecessary fs_info argument from delete_block_group_cache()

The fs_info can be taken from the given block group, so there is no need
to pass it as an argument.

Signed-off-by: Filipe

btrfs: remove unnecessary fs_info argument from delete_block_group_cache()

The fs_info can be taken from the given block group, so there is no need
to pass it as an argument.

Signed-off-by: Filipe Manana <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>

show more ...


# f75a0437 07-Mar-2025 Filipe Manana <[email protected]>

btrfs: remove unnecessary fs_info argument from create_reloc_inode()

The fs_info can be taken from the given block group, so there is no need
to pass it as an argument.

Signed-off-by: Filipe Manana

btrfs: remove unnecessary fs_info argument from create_reloc_inode()

The fs_info can be taken from the given block group, so there is no need
to pass it as an argument.

Signed-off-by: Filipe Manana <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>

show more ...


# b204e5c7 07-Mar-2025 Filipe Manana <[email protected]>

btrfs: make btrfs_iget() return a btrfs inode instead

It's an internal function and most of the time the callers are doing a lot
of BTRFS_I() calls on the returned VFS inode to get the btrfs inode,

btrfs: make btrfs_iget() return a btrfs inode instead

It's an internal function and most of the time the callers are doing a lot
of BTRFS_I() calls on the returned VFS inode to get the btrfs inode, so
change the return type to struct btrfs_inode instead.

Signed-off-by: Filipe Manana <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>

show more ...


Revision tags: v6.14-rc5, v6.14-rc4, v6.14-rc3, v6.14-rc2, v6.14-rc1, v6.13, v6.13-rc7
# 3a1c46db 09-Jan-2025 David Sterba <[email protected]>

btrfs: open code set_page_extent_mapped()

The function set_page_extent_mapped() is now a simple wrapper so use the
folio helper.

Reviewed-by: Johannes Thumshirn <[email protected]>
Reviewe

btrfs: open code set_page_extent_mapped()

The function set_page_extent_mapped() is now a simple wrapper so use the
folio helper.

Reviewed-by: Johannes Thumshirn <[email protected]>
Reviewed-by: Anand Jain <[email protected]>
Signed-off-by: David Sterba <[email protected]>

show more ...


Revision tags: v6.13-rc6, v6.13-rc5, v6.13-rc4
# 5a8293a1 18-Dec-2024 Filipe Manana <[email protected]>

btrfs: relocation: remove unnecessary calls to btrfs_mark_buffer_dirty()

We have several places explicitly calling btrfs_mark_buffer_dirty() but
that is not necessarily since the target leaf came fr

btrfs: relocation: remove unnecessary calls to btrfs_mark_buffer_dirty()

We have several places explicitly calling btrfs_mark_buffer_dirty() but
that is not necessarily since the target leaf came from a path that was
obtained for a btree search function that modifies the btree, something
like btrfs_insert_empty_item() or anything else that ends up calling
btrfs_search_slot() with a value of 1 for its 'cow' argument.

These just make the code more verbose, confusing and add a little extra
overhead and well as increase the module's text size, so remove them.

Reviewed-by: Johannes Thumshirn <[email protected]>
Signed-off-by: Filipe Manana <[email protected]>
Signed-off-by: David Sterba <[email protected]>

show more ...


Revision tags: v6.13-rc3, v6.13-rc2, v6.13-rc1, v6.12, v6.12-rc7, v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3, v6.12-rc2
# f974bc3c 03-Oct-2024 Josef Bacik <[email protected]>

btrfs: remove detached list from struct btrfs_backref_cache

We don't ever look at this list, remove it.

Reviewed-by: Boris Burkov <[email protected]>
Signed-off-by: Josef Bacik <[email protected]>
Re

btrfs: remove detached list from struct btrfs_backref_cache

We don't ever look at this list, remove it.

Reviewed-by: Boris Burkov <[email protected]>
Signed-off-by: Josef Bacik <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>

show more ...


# b61e0eb0 03-Oct-2024 Josef Bacik <[email protected]>

btrfs: remove the ->lowest and ->leaves members from struct btrfs_backref_node

Before we were keeping all of our nodes on various lists in order to
make sure everything got cleaned up correctly. We

btrfs: remove the ->lowest and ->leaves members from struct btrfs_backref_node

Before we were keeping all of our nodes on various lists in order to
make sure everything got cleaned up correctly. We used node->lowest to
indicate that node->lower was linked into the cache->leaves list. Now
that we do cleanup based on the rb-tree both the list and the flag are
useless, so delete them both.

Reviewed-by: Boris Burkov <[email protected]>
Signed-off-by: Josef Bacik <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>

show more ...


# 4eb8064d 03-Oct-2024 Josef Bacik <[email protected]>

btrfs: do not handle non-shareable roots in backref cache

Now that we handle relocation for non-shareable roots without using the
backref cache, remove the ->cowonly field from the backref nodes and

btrfs: do not handle non-shareable roots in backref cache

Now that we handle relocation for non-shareable roots without using the
backref cache, remove the ->cowonly field from the backref nodes and
update the handling to throw an error.

Reviewed-by: Boris Burkov <[email protected]>
Signed-off-by: Josef Bacik <[email protected]>
Signed-off-by: David Sterba <[email protected]>

show more ...


# 46bb6765 03-Oct-2024 Josef Bacik <[email protected]>

btrfs: don't build backref tree for COW-only blocks

We already determine the owner for any blocks we find when we're
relocating, and for COW-only blocks (and the data reloc tree) we COW
down to the

btrfs: don't build backref tree for COW-only blocks

We already determine the owner for any blocks we find when we're
relocating, and for COW-only blocks (and the data reloc tree) we COW
down to the block and call it good enough. However we still build a
whole backref tree for them, even though we're not going to use it, and
then just don't put these blocks in the cache.

Rework the code to check if the block belongs to a COW-only root or the
data reloc root, and then just cow down to the block, skipping the
backref cache generation.

Reviewed-by: Boris Burkov <[email protected]>
Signed-off-by: Josef Bacik <[email protected]>
Signed-off-by: David Sterba <[email protected]>

show more ...


# 0097422c 03-Oct-2024 Josef Bacik <[email protected]>

btrfs: remove clone_backref_node() from relocation

Since we no longer maintain backref cache across transactions, and this
is only called when we're creating the reloc root for a newly created
snaps

btrfs: remove clone_backref_node() from relocation

Since we no longer maintain backref cache across transactions, and this
is only called when we're creating the reloc root for a newly created
snapshot in the transaction critical section, we will end up doing a
bunch of work that will just get thrown away when we start the
transaction in the relocation loop. Delete this code as it no longer
does anything for us.

Reviewed-by: Boris Burkov <[email protected]>
Signed-off-by: Josef Bacik <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>

show more ...


# 551d04a3 03-Oct-2024 Josef Bacik <[email protected]>

btrfs: simplify loop in select_reloc_root()

We have this setup as a loop, but in reality we will never walk back up
the backref tree, if we do then it's a bug. Get rid of the loop and
handle the ca

btrfs: simplify loop in select_reloc_root()

We have this setup as a loop, but in reality we will never walk back up
the backref tree, if we do then it's a bug. Get rid of the loop and
handle the case where we have node->new_bytenr set at all. Previous
check was only if node->new_bytenr != root->node->start, but if it did
then we would hit the WARN_ON() and walk back up the tree.

Instead we want to just return error if ->new_bytenr is set, and then do
the normal updating of the node for the reloc root and carry on.

Reviewed-by: Boris Burkov <[email protected]>
Signed-off-by: Josef Bacik <[email protected]>
Signed-off-by: David Sterba <[email protected]>

show more ...


# b1d4d5d1 03-Oct-2024 Josef Bacik <[email protected]>

btrfs: remove the changed list for backref cache

Now that we're not updating the backref cache when we switch transids we
can remove the changed list.

We're going to keep the new_bytenr field becau

btrfs: remove the changed list for backref cache

Now that we're not updating the backref cache when we switch transids we
can remove the changed list.

We're going to keep the new_bytenr field because it serves as a good
sanity check for the backref cache and relocation, and can prevent us
from making extent tree corruption worse.

Reviewed-by: Boris Burkov <[email protected]>
Signed-off-by: Josef Bacik <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>

show more ...


# 6a4730b3 03-Oct-2024 Josef Bacik <[email protected]>

btrfs: convert BUG_ON in btrfs_reloc_cow_block() to proper error handling

This BUG_ON is meant to catch backref cache problems, but these can
arise from either bugs in the backref cache or corruptio

btrfs: convert BUG_ON in btrfs_reloc_cow_block() to proper error handling

This BUG_ON is meant to catch backref cache problems, but these can
arise from either bugs in the backref cache or corruption in the extent
tree. Fix it to be a proper error.

Reviewed-by: Boris Burkov <[email protected]>
Signed-off-by: Josef Bacik <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>

show more ...


# 3e74859e 13-Dec-2024 Boris Burkov <[email protected]>

btrfs: check folio mapping after unlock in relocate_one_folio()

When we call btrfs_read_folio() to bring a folio uptodate, we unlock the
folio. The result of that is that a different thread can modi

btrfs: check folio mapping after unlock in relocate_one_folio()

When we call btrfs_read_folio() to bring a folio uptodate, we unlock the
folio. The result of that is that a different thread can modify the
mapping (like remove it with invalidate) before we call folio_lock().
This results in an invalid page and we need to try again.

In particular, if we are relocating concurrently with aborting a
transaction, this can result in a crash like the following:

BUG: kernel NULL pointer dereference, address: 0000000000000000
PGD 0 P4D 0
Oops: 0000 [#1] SMP
CPU: 76 PID: 1411631 Comm: kworker/u322:5
Workqueue: events_unbound btrfs_reclaim_bgs_work
RIP: 0010:set_page_extent_mapped+0x20/0xb0
RSP: 0018:ffffc900516a7be8 EFLAGS: 00010246
RAX: ffffea009e851d08 RBX: ffffea009e0b1880 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffc900516a7b90 RDI: ffffea009e0b1880
RBP: 0000000003573000 R08: 0000000000000001 R09: ffff88c07fd2f3f0
R10: 0000000000000000 R11: 0000194754b575be R12: 0000000003572000
R13: 0000000003572fff R14: 0000000000100cca R15: 0000000005582fff
FS: 0000000000000000(0000) GS:ffff88c07fd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000407d00f002 CR4: 00000000007706f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
<TASK>
? __die+0x78/0xc0
? page_fault_oops+0x2a8/0x3a0
? __switch_to+0x133/0x530
? wq_worker_running+0xa/0x40
? exc_page_fault+0x63/0x130
? asm_exc_page_fault+0x22/0x30
? set_page_extent_mapped+0x20/0xb0
relocate_file_extent_cluster+0x1a7/0x940
relocate_data_extent+0xaf/0x120
relocate_block_group+0x20f/0x480
btrfs_relocate_block_group+0x152/0x320
btrfs_relocate_chunk+0x3d/0x120
btrfs_reclaim_bgs_work+0x2ae/0x4e0
process_scheduled_works+0x184/0x370
worker_thread+0xc6/0x3e0
? blk_add_timer+0xb0/0xb0
kthread+0xae/0xe0
? flush_tlb_kernel_range+0x90/0x90
ret_from_fork+0x2f/0x40
? flush_tlb_kernel_range+0x90/0x90
ret_from_fork_asm+0x11/0x20
</TASK>

This occurs because cleanup_one_transaction() calls
destroy_delalloc_inodes() which calls invalidate_inode_pages2() which
takes the folio_lock before setting mapping to NULL. We fail to check
this, and subsequently call set_extent_mapping(), which assumes that
mapping != NULL (in fact it asserts that in debug mode)

Note that the "fixes" patch here is not the one that introduced the
race (the very first iteration of this code from 2009) but a more recent
change that made this particular crash happen in practice.

Fixes: e7f1326cc24e ("btrfs: set page extent mapped after read_folio in relocate_one_page")
CC: [email protected] # 6.1+
Reviewed-by: Qu Wenruo <[email protected]>
Signed-off-by: Boris Burkov <[email protected]>
Signed-off-by: David Sterba <[email protected]>

show more ...


# d7f4b4ef 09-Oct-2024 David Sterba <[email protected]>

btrfs: drop unused transaction parameter from btrfs_qgroup_add_swapped_blocks()

The caller replace_path() runs under transaction but we don't need it in
btrfs_qgroup_add_swapped_blocks().

Reviewed-

btrfs: drop unused transaction parameter from btrfs_qgroup_add_swapped_blocks()

The caller replace_path() runs under transaction but we don't need it in
btrfs_qgroup_add_swapped_blocks().

Reviewed-by: Anand Jain <[email protected]>
Signed-off-by: David Sterba <[email protected]>

show more ...


Revision tags: v6.12-rc1
# c3b47f49 27-Sep-2024 Qu Wenruo <[email protected]>

btrfs: fix a NULL pointer dereference when failed to start a new trasacntion

[BUG]
Syzbot reported a NULL pointer dereference with the following crash:

FAULT_INJECTION: forcing a failure.
star

btrfs: fix a NULL pointer dereference when failed to start a new trasacntion

[BUG]
Syzbot reported a NULL pointer dereference with the following crash:

FAULT_INJECTION: forcing a failure.
start_transaction+0x830/0x1670 fs/btrfs/transaction.c:676
prepare_to_relocate+0x31f/0x4c0 fs/btrfs/relocation.c:3642
relocate_block_group+0x169/0xd20 fs/btrfs/relocation.c:3678
...
BTRFS info (device loop0): balance: ended with status: -12
Oops: general protection fault, probably for non-canonical address 0xdffffc00000000cc: 0000 [#1] PREEMPT SMP KASAN NOPTI
KASAN: null-ptr-deref in range [0x0000000000000660-0x0000000000000667]
RIP: 0010:btrfs_update_reloc_root+0x362/0xa80 fs/btrfs/relocation.c:926
Call Trace:
<TASK>
commit_fs_roots+0x2ee/0x720 fs/btrfs/transaction.c:1496
btrfs_commit_transaction+0xfaf/0x3740 fs/btrfs/transaction.c:2430
del_balance_item fs/btrfs/volumes.c:3678 [inline]
reset_balance_state+0x25e/0x3c0 fs/btrfs/volumes.c:3742
btrfs_balance+0xead/0x10c0 fs/btrfs/volumes.c:4574
btrfs_ioctl_balance+0x493/0x7c0 fs/btrfs/ioctl.c:3673
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:907 [inline]
__se_sys_ioctl+0xf9/0x170 fs/ioctl.c:893
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f

[CAUSE]
The allocation failure happens at the start_transaction() inside
prepare_to_relocate(), and during the error handling we call
unset_reloc_control(), which makes fs_info->balance_ctl to be NULL.

Then we continue the error path cleanup in btrfs_balance() by calling
reset_balance_state() which will call del_balance_item() to fully delete
the balance item in the root tree.

However during the small window between set_reloc_contrl() and
unset_reloc_control(), we can have a subvolume tree update and created a
reloc_root for that subvolume.

Then we go into the final btrfs_commit_transaction() of
del_balance_item(), and into btrfs_update_reloc_root() inside
commit_fs_roots().

That function checks if fs_info->reloc_ctl is in the merge_reloc_tree
stage, but since fs_info->reloc_ctl is NULL, it results a NULL pointer
dereference.

[FIX]
Just add extra check on fs_info->reloc_ctl inside
btrfs_update_reloc_root(), before checking
fs_info->reloc_ctl->merge_reloc_tree.

That DEAD_RELOC_TREE handling is to prevent further modification to the
reloc tree during merge stage, but since there is no reloc_ctl at all,
we do not need to bother that.

Reported-by: [email protected]
Link: https://lore.kernel.org/linux-btrfs/[email protected]/
CC: [email protected] # 4.19+
Reviewed-by: Josef Bacik <[email protected]>
Signed-off-by: Qu Wenruo <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>

show more ...


# db7e68b5 24-Sep-2024 Josef Bacik <[email protected]>

btrfs: drop the backref cache during relocation if we commit

Since the inception of relocation we have maintained the backref cache
across transaction commits, updating the backref cache with the ne

btrfs: drop the backref cache during relocation if we commit

Since the inception of relocation we have maintained the backref cache
across transaction commits, updating the backref cache with the new
bytenr whenever we COWed blocks that were in the cache, and then
updating their bytenr once we detected a transaction id change.

This works as long as we're only ever modifying blocks, not changing the
structure of the tree.

However relocation does in fact change the structure of the tree. For
example, if we are relocating a data extent, we will look up all the
leaves that point to this data extent. We will then call
do_relocation() on each of these leaves, which will COW down to the leaf
and then update the file extent location.

But, a key feature of do_relocation() is the pending list. This is all
the pending nodes that we modified when we updated the file extent item.
We will then process all of these blocks via finish_pending_nodes, which
calls do_relocation() on all of the nodes that led up to that leaf.

The purpose of this is to make sure we don't break sharing unless we
absolutely have to. Consider the case that we have 3 snapshots that all
point to this leaf through the same nodes, the initial COW would have
created a whole new path. If we did this for all 3 snapshots we would
end up with 3x the number of nodes we had originally. To avoid this we
will cycle through each of the snapshots that point to each of these
nodes and update their pointers to point at the new nodes.

Once we update the pointer to the new node we will drop the node we
removed the link for and all of its children via btrfs_drop_subtree().
This is essentially just btrfs_drop_snapshot(), but for an arbitrary
point in the snapshot.

The problem with this is that we will never reflect this in the backref
cache. If we do this btrfs_drop_snapshot() for a node that is in the
backref tree, we will leave the node in the backref tree. This becomes
a problem when we change the transid, as now the backref cache has
entire subtrees that no longer exist, but exist as if they still are
pointed to by the same roots.

In the best case scenario you end up with "adding refs to an existing
tree ref" errors from insert_inline_extent_backref(), where we attempt
to link in nodes on roots that are no longer valid.

Worst case you will double free some random block and re-use it when
there's still references to the block.

This is extremely subtle, and the consequences are quite bad. There
isn't a way to make sure our backref cache is consistent between
transid's.

In order to fix this we need to simply evict the entire backref cache
anytime we cross transid's. This reduces performance in that we have to
rebuild this backref cache every time we change transid's, but fixes the
bug.

This has existed since relocation was added, and is a pretty critical
bug. There's a lot more cleanup that can be done now that this
functionality is going away, but this patch is as small as possible in
order to fix the problem and make it easy for us to backport it to all
the kernels it needs to be backported to.

Followup series will dismantle more of this code and simplify relocation
drastically to remove this functionality.

We have a reproducer that reproduced the corruption within a few minutes
of running. With this patch it survives several iterations/hours of
running the reproducer.

Fixes: 3fd0a5585eb9 ("Btrfs: Metadata ENOSPC handling for balance")
CC: [email protected]
Reviewed-by: Boris Burkov <[email protected]>
Signed-off-by: Josef Bacik <[email protected]>
Signed-off-by: David Sterba <[email protected]>

show more ...


Revision tags: v6.11, v6.11-rc7, v6.11-rc6, v6.11-rc5, v6.11-rc4, v6.11-rc3, v6.11-rc2
# 04915240 31-Jul-2024 Johannes Thumshirn <[email protected]>

btrfs: don't readahead the relocation inode on RST

On relocation we're doing readahead on the relocation inode, but if the
filesystem is backed by a RAID stripe tree we can get ENOENT (e.g. due to
p

btrfs: don't readahead the relocation inode on RST

On relocation we're doing readahead on the relocation inode, but if the
filesystem is backed by a RAID stripe tree we can get ENOENT (e.g. due to
preallocated extents not being mapped in the RST) from the lookup.

But readahead doesn't handle the error and submits invalid reads to the
device, causing an assertion in the scatter-gather list code:

BTRFS info (device nvme1n1): balance: start -d -m -s
BTRFS info (device nvme1n1): relocating block group 6480920576 flags data|raid0
BTRFS error (device nvme1n1): cannot find raid-stripe for logical [6481928192, 6481969152] devid 2, profile raid0
------------[ cut here ]------------
kernel BUG at include/linux/scatterlist.h:115!
Oops: invalid opcode: 0000 [#1] PREEMPT SMP PTI
CPU: 0 PID: 1012 Comm: btrfs Not tainted 6.10.0-rc7+ #567
RIP: 0010:__blk_rq_map_sg+0x339/0x4a0
RSP: 0018:ffffc90001a43820 EFLAGS: 00010202
RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffea00045d4802
RDX: 0000000117520000 RSI: 0000000000000000 RDI: ffff8881027d1000
RBP: 0000000000003000 R08: ffffea00045d4902 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000001000 R12: ffff8881003d10b8
R13: ffffc90001a438f0 R14: 0000000000000000 R15: 0000000000003000
FS: 00007fcc048a6900(0000) GS:ffff88813bc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000002cd11000 CR3: 00000001109ea001 CR4: 0000000000370eb0
Call Trace:
<TASK>
? __die_body.cold+0x14/0x25
? die+0x2e/0x50
? do_trap+0xca/0x110
? do_error_trap+0x65/0x80
? __blk_rq_map_sg+0x339/0x4a0
? exc_invalid_op+0x50/0x70
? __blk_rq_map_sg+0x339/0x4a0
? asm_exc_invalid_op+0x1a/0x20
? __blk_rq_map_sg+0x339/0x4a0
nvme_prep_rq.part.0+0x9d/0x770
nvme_queue_rq+0x7d/0x1e0
__blk_mq_issue_directly+0x2a/0x90
? blk_mq_get_budget_and_tag+0x61/0x90
blk_mq_try_issue_list_directly+0x56/0xf0
blk_mq_flush_plug_list.part.0+0x52b/0x5d0
__blk_flush_plug+0xc6/0x110
blk_finish_plug+0x28/0x40
read_pages+0x160/0x1c0
page_cache_ra_unbounded+0x109/0x180
relocate_file_extent_cluster+0x611/0x6a0
? btrfs_search_slot+0xba4/0xd20
? balance_dirty_pages_ratelimited_flags+0x26/0xb00
relocate_data_extent.constprop.0+0x134/0x160
relocate_block_group+0x3f2/0x500
btrfs_relocate_block_group+0x250/0x430
btrfs_relocate_chunk+0x3f/0x130
btrfs_balance+0x71b/0xef0
? kmalloc_trace_noprof+0x13b/0x280
btrfs_ioctl+0x2c2e/0x3030
? kvfree_call_rcu+0x1e6/0x340
? list_lru_add_obj+0x66/0x80
? mntput_no_expire+0x3a/0x220
__x64_sys_ioctl+0x96/0xc0
do_syscall_64+0x54/0x110
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7fcc04514f9b
Code: Unable to access opcode bytes at 0x7fcc04514f71.
RSP: 002b:00007ffeba923370 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007fcc04514f9b
RDX: 00007ffeba923460 RSI: 00000000c4009420 RDI: 0000000000000003
RBP: 0000000000000000 R08: 0000000000000013 R09: 0000000000000001
R10: 00007fcc043fbba8 R11: 0000000000000246 R12: 00007ffeba924fc5
R13: 00007ffeba923460 R14: 0000000000000002 R15: 00000000004d4bb0
</TASK>
Modules linked in:
---[ end trace 0000000000000000 ]---
RIP: 0010:__blk_rq_map_sg+0x339/0x4a0
RSP: 0018:ffffc90001a43820 EFLAGS: 00010202
RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffea00045d4802
RDX: 0000000117520000 RSI: 0000000000000000 RDI: ffff8881027d1000
RBP: 0000000000003000 R08: ffffea00045d4902 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000001000 R12: ffff8881003d10b8
R13: ffffc90001a438f0 R14: 0000000000000000 R15: 0000000000003000
FS: 00007fcc048a6900(0000) GS:ffff88813bc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fcc04514f71 CR3: 00000001109ea001 CR4: 0000000000370eb0
Kernel panic - not syncing: Fatal exception
Kernel Offset: disabled
---[ end Kernel panic - not syncing: Fatal exception ]---

So in case of a relocation on a RAID stripe-tree based file system, skip
the readahead.

Reviewed-by: Josef Bacik <[email protected]>
Reviewed-by: Qu Wenruo <[email protected]>
Signed-off-by: Johannes Thumshirn <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>

show more ...


Revision tags: v6.11-rc1, v6.10, v6.10-rc7
# ca84529a 01-Jul-2024 Filipe Manana <[email protected]>

btrfs: fix data race when accessing the last_trans field of a root

KCSAN complains about a data race when accessing the last_trans field of a
root:

[ 199.553628] BUG: KCSAN: data-race in btrfs_r

btrfs: fix data race when accessing the last_trans field of a root

KCSAN complains about a data race when accessing the last_trans field of a
root:

[ 199.553628] BUG: KCSAN: data-race in btrfs_record_root_in_trans [btrfs] / record_root_in_trans [btrfs]

[ 199.555186] read to 0x000000008801e308 of 8 bytes by task 2812 on cpu 1:
[ 199.555210] btrfs_record_root_in_trans+0x9a/0x128 [btrfs]
[ 199.555999] start_transaction+0x154/0xcd8 [btrfs]
[ 199.556780] btrfs_join_transaction+0x44/0x60 [btrfs]
[ 199.557559] btrfs_dirty_inode+0x9c/0x140 [btrfs]
[ 199.558339] btrfs_update_time+0x8c/0xb0 [btrfs]
[ 199.559123] touch_atime+0x16c/0x1e0
[ 199.559151] pipe_read+0x6a8/0x7d0
[ 199.559179] vfs_read+0x466/0x498
[ 199.559204] ksys_read+0x108/0x150
[ 199.559230] __s390x_sys_read+0x68/0x88
[ 199.559257] do_syscall+0x1c6/0x210
[ 199.559286] __do_syscall+0xc8/0xf0
[ 199.559318] system_call+0x70/0x98

[ 199.559431] write to 0x000000008801e308 of 8 bytes by task 2808 on cpu 0:
[ 199.559464] record_root_in_trans+0x196/0x228 [btrfs]
[ 199.560236] btrfs_record_root_in_trans+0xfe/0x128 [btrfs]
[ 199.561097] start_transaction+0x154/0xcd8 [btrfs]
[ 199.561927] btrfs_join_transaction+0x44/0x60 [btrfs]
[ 199.562700] btrfs_dirty_inode+0x9c/0x140 [btrfs]
[ 199.563493] btrfs_update_time+0x8c/0xb0 [btrfs]
[ 199.564277] file_update_time+0xb8/0xf0
[ 199.564301] pipe_write+0x8ac/0xab8
[ 199.564326] vfs_write+0x33c/0x588
[ 199.564349] ksys_write+0x108/0x150
[ 199.564372] __s390x_sys_write+0x68/0x88
[ 199.564397] do_syscall+0x1c6/0x210
[ 199.564424] __do_syscall+0xc8/0xf0
[ 199.564452] system_call+0x70/0x98

This is because we update and read last_trans concurrently without any
type of synchronization. This should be generally harmless and in the
worst case it can make us do extra locking (btrfs_record_root_in_trans())
trigger some warnings at ctree.c or do extra work during relocation - this
would probably only happen in case of load or store tearing.

So fix this by always reading and updating the field using READ_ONCE()
and WRITE_ONCE(), this silences KCSAN and prevents load and store tearing.

Reviewed-by: Josef Bacik <[email protected]>
Signed-off-by: Filipe Manana <[email protected]>
Signed-off-by: David Sterba <[email protected]>

show more ...


Revision tags: v6.10-rc6, v6.10-rc5, v6.10-rc4, v6.10-rc3, v6.10-rc2, v6.10-rc1, v6.9, v6.9-rc7, v6.9-rc6, v6.9-rc5, v6.9-rc4, v6.9-rc3, v6.9-rc2, v6.9-rc1, v6.8, v6.8-rc7, v6.8-rc6, v6.8-rc5, v6.8-rc4, v6.8-rc3, v6.8-rc2, v6.8-rc1, v6.7, v6.7-rc8, v6.7-rc7, v6.7-rc6, v6.7-rc5, v6.7-rc4, v6.7-rc3, v6.7-rc2, v6.7-rc1, v6.6, v6.6-rc7, v6.6-rc6, v6.6-rc5, v6.6-rc4, v6.6-rc3, v6.6-rc2, v6.6-rc1, v6.5, v6.5-rc7, v6.5-rc6, v6.5-rc5, v6.5-rc4, v6.5-rc3, v6.5-rc2, v6.5-rc1, v6.4, v6.4-rc7, v6.4-rc6, v6.4-rc5, v6.4-rc4, v6.4-rc3, v6.4-rc2, v6.4-rc1, v6.3, v6.3-rc7, v6.3-rc6, v6.3-rc5, v6.3-rc4, v6.3-rc3, v6.3-rc2, v6.3-rc1, v6.2, v6.2-rc8, v6.2-rc7, v6.2-rc6, v6.2-rc5, v6.2-rc4, v6.2-rc3, v6.2-rc2, v6.2-rc1, v6.1, v6.1-rc8, v6.1-rc7, v6.1-rc6, v6.1-rc5, v6.1-rc4, v6.1-rc3, v6.1-rc2, v6.1-rc1, v6.0, v6.0-rc7, v6.0-rc6, v6.0-rc5, v6.0-rc4, v6.0-rc3, v6.0-rc2, v6.0-rc1, v5.19, v5.19-rc8, v5.19-rc7, v5.19-rc6, v5.19-rc5, v5.19-rc4, v5.19-rc3, v5.19-rc2, v5.19-rc1, v5.18, v5.18-rc7, v5.18-rc6, v5.18-rc5, v5.18-rc4, v5.18-rc3, v5.18-rc2, v5.18-rc1, v5.17, v5.17-rc8, v5.17-rc7, v5.17-rc6, v5.17-rc5, v5.17-rc4, v5.17-rc3
# a1f4e3d7 31-Jan-2022 David Sterba <[email protected]>

btrfs: switch btrfs_ordered_extent::inode to struct btrfs_inode

The structure is internal so we should use struct btrfs_inode for that,
allowing to remove some use of BTRFS_I.

Reviewed-by: Boris Bu

btrfs: switch btrfs_ordered_extent::inode to struct btrfs_inode

The structure is internal so we should use struct btrfs_inode for that,
allowing to remove some use of BTRFS_I.

Reviewed-by: Boris Burkov <[email protected]>
Signed-off-by: David Sterba <[email protected]>

show more ...


# d13240dd 13-Jun-2024 Filipe Manana <[email protected]>

btrfs: remove super block argument from btrfs_iget()

It's pointless to pass a super block argument to btrfs_iget() because we
always pass a root and from it we can get the super block through:

r

btrfs: remove super block argument from btrfs_iget()

It's pointless to pass a super block argument to btrfs_iget() because we
always pass a root and from it we can get the super block through:

root->fs_info->sb

So remove the super block argument.

Reviewed-by: Johannes Thumshirn <[email protected]>
Reviewed-by: Josef Bacik <[email protected]>
Reviewed-by: Qu Wenruo <[email protected]>
Signed-off-by: Filipe Manana <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>

show more ...


# 6d81df75 05-Jun-2024 Johannes Thumshirn <[email protected]>

btrfs: pass reloc_control to setup_relocation_extent_mapping()

All parameters passed into setup_relocation_extent_mapping() can be
derived from 'struct reloc_control', so only pass in a 'struct
relo

btrfs: pass reloc_control to setup_relocation_extent_mapping()

All parameters passed into setup_relocation_extent_mapping() can be
derived from 'struct reloc_control', so only pass in a 'struct
reloc_control'.

Reviewed-by: Josef Bacik <[email protected]>
Signed-off-by: Johannes Thumshirn <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>

show more ...


# 60f3dabd 05-Jun-2024 Johannes Thumshirn <[email protected]>

btrfs: pass a struct reloc_control to prealloc_file_extent_cluster()

Pass a 'struct reloc_control' to prealloc_file_extent_cluster()
instead of passing its members 'data_inode' and 'cluster' on thei

btrfs: pass a struct reloc_control to prealloc_file_extent_cluster()

Pass a 'struct reloc_control' to prealloc_file_extent_cluster()
instead of passing its members 'data_inode' and 'cluster' on their own.

Reviewed-by: Josef Bacik <[email protected]>
Signed-off-by: Johannes Thumshirn <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>

show more ...


# 17a21d79 05-Jun-2024 Johannes Thumshirn <[email protected]>

btrfs: don't pass fs_info to describe_relocation()

In describe_relocation() the fs_info is only needed for printing
information via btrfs_info() and can easily be accessed via the passed
in 'struct

btrfs: don't pass fs_info to describe_relocation()

In describe_relocation() the fs_info is only needed for printing
information via btrfs_info() and can easily be accessed via the passed
in 'struct btrfs_block_group'.

So we can safely remove the fs_info parameter.

Reviewed-by: Josef Bacik <[email protected]>
Signed-off-by: Johannes Thumshirn <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>

show more ...


12345678910>>...22