History log of /linux-6.15/block/blk-core.c (Results 1 – 25 of 952)
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1
# 8fa7292f 05-Apr-2025 Thomas Gleixner <[email protected]>

treewide: Switch/rename to timer_delete[_sync]()

timer_delete[_sync]() replaces del_timer[_sync](). Convert the whole tree
over and remove the historical wrapper inlines.

Conversion was done with c

treewide: Switch/rename to timer_delete[_sync]()

timer_delete[_sync]() replaces del_timer[_sync](). Convert the whole tree
over and remove the historical wrapper inlines.

Conversion was done with coccinelle plus manual fixups where necessary.

Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>

show more ...


Revision tags: v6.14
# ffa1e7ad 18-Mar-2025 Thomas Hellström <[email protected]>

block: Make request_queue lockdep splats show up earlier

In recent kernels, there are lockdep splats around the
struct request_queue::io_lockdep_map, similar to [1], but they
typically don't show up

block: Make request_queue lockdep splats show up earlier

In recent kernels, there are lockdep splats around the
struct request_queue::io_lockdep_map, similar to [1], but they
typically don't show up until reclaim with writeback happens.

Having multiple kernel versions released with a known risc of kernel
deadlock during reclaim writeback should IMHO be addressed and
backported to -stable with the highest priority.

In order to have these lockdep splats show up earlier,
preferrably during system initialization, prime the
struct request_queue::io_lockdep_map as GFP_KERNEL reclaim-
tainted. This will instead lead to lockdep splats looking similar
to [2], but without the need for reclaim + writeback
happening.

[1]:
[ 189.762244] ======================================================
[ 189.762432] WARNING: possible circular locking dependency detected
[ 189.762441] 6.14.0-rc6-xe+ #6 Tainted: G U
[ 189.762450] ------------------------------------------------------
[ 189.762459] kswapd0/119 is trying to acquire lock:
[ 189.762467] ffff888110ceb710 (&q->q_usage_counter(io)#26){++++}-{0:0}, at: __submit_bio+0x76/0x230
[ 189.762485]
but task is already holding lock:
[ 189.762494] ffffffff834c97c0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0xbe/0xb00
[ 189.762507]
which lock already depends on the new lock.

[ 189.762519]
the existing dependency chain (in reverse order) is:
[ 189.762529]
-> #2 (fs_reclaim){+.+.}-{0:0}:
[ 189.762540] fs_reclaim_acquire+0xc5/0x100
[ 189.762548] kmem_cache_alloc_lru_noprof+0x4a/0x480
[ 189.762558] alloc_inode+0xaa/0xe0
[ 189.762566] iget_locked+0x157/0x330
[ 189.762573] kernfs_get_inode+0x1b/0x110
[ 189.762582] kernfs_get_tree+0x1b0/0x2e0
[ 189.762590] sysfs_get_tree+0x1f/0x60
[ 189.762597] vfs_get_tree+0x2a/0xf0
[ 189.762605] path_mount+0x4cd/0xc00
[ 189.762613] __x64_sys_mount+0x119/0x150
[ 189.762621] x64_sys_call+0x14f2/0x2310
[ 189.762630] do_syscall_64+0x91/0x180
[ 189.762637] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 189.762647]
-> #1 (&root->kernfs_rwsem){++++}-{3:3}:
[ 189.762659] down_write+0x3e/0xf0
[ 189.762667] kernfs_remove+0x32/0x60
[ 189.762676] sysfs_remove_dir+0x4f/0x60
[ 189.762685] __kobject_del+0x33/0xa0
[ 189.762709] kobject_del+0x13/0x30
[ 189.762716] elv_unregister_queue+0x52/0x80
[ 189.762725] elevator_switch+0x68/0x360
[ 189.762733] elv_iosched_store+0x14b/0x1b0
[ 189.762756] queue_attr_store+0x181/0x1e0
[ 189.762765] sysfs_kf_write+0x49/0x80
[ 189.762773] kernfs_fop_write_iter+0x17d/0x250
[ 189.762781] vfs_write+0x281/0x540
[ 189.762790] ksys_write+0x72/0xf0
[ 189.762798] __x64_sys_write+0x19/0x30
[ 189.762807] x64_sys_call+0x2a3/0x2310
[ 189.762815] do_syscall_64+0x91/0x180
[ 189.762823] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 189.762833]
-> #0 (&q->q_usage_counter(io)#26){++++}-{0:0}:
[ 189.762845] __lock_acquire+0x1525/0x2760
[ 189.762854] lock_acquire+0xca/0x310
[ 189.762861] blk_mq_submit_bio+0x8a2/0xba0
[ 189.762870] __submit_bio+0x76/0x230
[ 189.762878] submit_bio_noacct_nocheck+0x323/0x430
[ 189.762888] submit_bio_noacct+0x2cc/0x620
[ 189.762896] submit_bio+0x38/0x110
[ 189.762904] __swap_writepage+0xf5/0x380
[ 189.762912] swap_writepage+0x3c7/0x600
[ 189.762920] shmem_writepage+0x3da/0x4f0
[ 189.762929] pageout+0x13f/0x310
[ 189.762937] shrink_folio_list+0x61c/0xf60
[ 189.763261] evict_folios+0x378/0xcd0
[ 189.763584] try_to_shrink_lruvec+0x1b0/0x360
[ 189.763946] shrink_one+0x10e/0x200
[ 189.764266] shrink_node+0xc02/0x1490
[ 189.764586] balance_pgdat+0x563/0xb00
[ 189.764934] kswapd+0x1e8/0x430
[ 189.765249] kthread+0x10b/0x260
[ 189.765559] ret_from_fork+0x44/0x70
[ 189.765889] ret_from_fork_asm+0x1a/0x30
[ 189.766198]
other info that might help us debug this:

[ 189.767089] Chain exists of:
&q->q_usage_counter(io)#26 --> &root->kernfs_rwsem --> fs_reclaim

[ 189.767971] Possible unsafe locking scenario:

[ 189.768555] CPU0 CPU1
[ 189.768849] ---- ----
[ 189.769136] lock(fs_reclaim);
[ 189.769421] lock(&root->kernfs_rwsem);
[ 189.769714] lock(fs_reclaim);
[ 189.770016] rlock(&q->q_usage_counter(io)#26);
[ 189.770305]
*** DEADLOCK ***

[ 189.771167] 1 lock held by kswapd0/119:
[ 189.771453] #0: ffffffff834c97c0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0xbe/0xb00
[ 189.771770]
stack backtrace:
[ 189.772351] CPU: 4 UID: 0 PID: 119 Comm: kswapd0 Tainted: G U 6.14.0-rc6-xe+ #6
[ 189.772353] Tainted: [U]=USER
[ 189.772354] Hardware name: ASUS System Product Name/PRIME B560M-A AC, BIOS 2001 02/01/2023
[ 189.772354] Call Trace:
[ 189.772355] <TASK>
[ 189.772356] dump_stack_lvl+0x6e/0xa0
[ 189.772359] dump_stack+0x10/0x18
[ 189.772360] print_circular_bug.cold+0x17a/0x1b7
[ 189.772363] check_noncircular+0x13a/0x150
[ 189.772365] ? __pfx_stack_trace_consume_entry+0x10/0x10
[ 189.772368] __lock_acquire+0x1525/0x2760
[ 189.772368] ? ret_from_fork_asm+0x1a/0x30
[ 189.772371] lock_acquire+0xca/0x310
[ 189.772372] ? __submit_bio+0x76/0x230
[ 189.772375] ? lock_release+0xd5/0x2c0
[ 189.772376] blk_mq_submit_bio+0x8a2/0xba0
[ 189.772378] ? __submit_bio+0x76/0x230
[ 189.772380] __submit_bio+0x76/0x230
[ 189.772382] ? trace_hardirqs_on+0x1e/0xe0
[ 189.772384] submit_bio_noacct_nocheck+0x323/0x430
[ 189.772386] ? submit_bio_noacct_nocheck+0x323/0x430
[ 189.772387] ? __might_sleep+0x58/0xa0
[ 189.772390] submit_bio_noacct+0x2cc/0x620
[ 189.772391] ? count_memcg_events+0x68/0x90
[ 189.772393] submit_bio+0x38/0x110
[ 189.772395] __swap_writepage+0xf5/0x380
[ 189.772396] swap_writepage+0x3c7/0x600
[ 189.772397] shmem_writepage+0x3da/0x4f0
[ 189.772401] pageout+0x13f/0x310
[ 189.772406] shrink_folio_list+0x61c/0xf60
[ 189.772409] ? isolate_folios+0xe80/0x16b0
[ 189.772410] ? mark_held_locks+0x46/0x90
[ 189.772412] evict_folios+0x378/0xcd0
[ 189.772414] ? evict_folios+0x34a/0xcd0
[ 189.772415] ? lock_is_held_type+0xa3/0x130
[ 189.772417] try_to_shrink_lruvec+0x1b0/0x360
[ 189.772420] shrink_one+0x10e/0x200
[ 189.772421] shrink_node+0xc02/0x1490
[ 189.772423] ? shrink_node+0xa08/0x1490
[ 189.772424] ? shrink_node+0xbd8/0x1490
[ 189.772425] ? mem_cgroup_iter+0x366/0x480
[ 189.772427] balance_pgdat+0x563/0xb00
[ 189.772428] ? balance_pgdat+0x563/0xb00
[ 189.772430] ? trace_hardirqs_on+0x1e/0xe0
[ 189.772431] ? finish_task_switch.isra.0+0xcb/0x330
[ 189.772433] ? __switch_to_asm+0x33/0x70
[ 189.772437] kswapd+0x1e8/0x430
[ 189.772438] ? __pfx_autoremove_wake_function+0x10/0x10
[ 189.772440] ? __pfx_kswapd+0x10/0x10
[ 189.772441] kthread+0x10b/0x260
[ 189.772443] ? __pfx_kthread+0x10/0x10
[ 189.772444] ret_from_fork+0x44/0x70
[ 189.772446] ? __pfx_kthread+0x10/0x10
[ 189.772447] ret_from_fork_asm+0x1a/0x30
[ 189.772450] </TASK>

[2]:
[ 8.760253] ======================================================
[ 8.760254] WARNING: possible circular locking dependency detected
[ 8.760255] 6.14.0-rc6-xe+ #7 Tainted: G U
[ 8.760256] ------------------------------------------------------
[ 8.760257] (udev-worker)/674 is trying to acquire lock:
[ 8.760259] ffff888100e39148 (&root->kernfs_rwsem){++++}-{3:3}, at: kernfs_remove+0x32/0x60
[ 8.760265]
but task is already holding lock:
[ 8.760266] ffff888110dc7680 (&q->q_usage_counter(io)#27){++++}-{0:0}, at: blk_mq_freeze_queue_nomemsave+0x12/0x30
[ 8.760272]
which lock already depends on the new lock.

[ 8.760272]
the existing dependency chain (in reverse order) is:
[ 8.760273]
-> #2 (&q->q_usage_counter(io)#27){++++}-{0:0}:
[ 8.760276] blk_alloc_queue+0x30a/0x350
[ 8.760279] blk_mq_alloc_queue+0x6b/0xe0
[ 8.760281] scsi_alloc_sdev+0x276/0x3c0
[ 8.760284] scsi_probe_and_add_lun+0x22a/0x440
[ 8.760286] __scsi_scan_target+0x109/0x230
[ 8.760288] scsi_scan_channel+0x65/0xc0
[ 8.760290] scsi_scan_host_selected+0xff/0x140
[ 8.760292] do_scsi_scan_host+0xa7/0xc0
[ 8.760293] do_scan_async+0x1c/0x160
[ 8.760295] async_run_entry_fn+0x32/0x150
[ 8.760299] process_one_work+0x224/0x5f0
[ 8.760302] worker_thread+0x1d4/0x3e0
[ 8.760304] kthread+0x10b/0x260
[ 8.760306] ret_from_fork+0x44/0x70
[ 8.760309] ret_from_fork_asm+0x1a/0x30
[ 8.760312]
-> #1 (fs_reclaim){+.+.}-{0:0}:
[ 8.760315] fs_reclaim_acquire+0xc5/0x100
[ 8.760317] kmem_cache_alloc_lru_noprof+0x4a/0x480
[ 8.760319] alloc_inode+0xaa/0xe0
[ 8.760322] iget_locked+0x157/0x330
[ 8.760323] kernfs_get_inode+0x1b/0x110
[ 8.760325] kernfs_get_tree+0x1b0/0x2e0
[ 8.760327] sysfs_get_tree+0x1f/0x60
[ 8.760329] vfs_get_tree+0x2a/0xf0
[ 8.760332] path_mount+0x4cd/0xc00
[ 8.760334] __x64_sys_mount+0x119/0x150
[ 8.760336] x64_sys_call+0x14f2/0x2310
[ 8.760338] do_syscall_64+0x91/0x180
[ 8.760340] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 8.760342]
-> #0 (&root->kernfs_rwsem){++++}-{3:3}:
[ 8.760345] __lock_acquire+0x1525/0x2760
[ 8.760347] lock_acquire+0xca/0x310
[ 8.760348] down_write+0x3e/0xf0
[ 8.760350] kernfs_remove+0x32/0x60
[ 8.760351] sysfs_remove_dir+0x4f/0x60
[ 8.760353] __kobject_del+0x33/0xa0
[ 8.760355] kobject_del+0x13/0x30
[ 8.760356] elv_unregister_queue+0x52/0x80
[ 8.760358] elevator_switch+0x68/0x360
[ 8.760360] elv_iosched_store+0x14b/0x1b0
[ 8.760362] queue_attr_store+0x181/0x1e0
[ 8.760364] sysfs_kf_write+0x49/0x80
[ 8.760366] kernfs_fop_write_iter+0x17d/0x250
[ 8.760367] vfs_write+0x281/0x540
[ 8.760370] ksys_write+0x72/0xf0
[ 8.760372] __x64_sys_write+0x19/0x30
[ 8.760374] x64_sys_call+0x2a3/0x2310
[ 8.760376] do_syscall_64+0x91/0x180
[ 8.760377] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 8.760380]
other info that might help us debug this:

[ 8.760380] Chain exists of:
&root->kernfs_rwsem --> fs_reclaim --> &q->q_usage_counter(io)#27

[ 8.760384] Possible unsafe locking scenario:

[ 8.760384] CPU0 CPU1
[ 8.760385] ---- ----
[ 8.760385] lock(&q->q_usage_counter(io)#27);
[ 8.760387] lock(fs_reclaim);
[ 8.760388] lock(&q->q_usage_counter(io)#27);
[ 8.760390] lock(&root->kernfs_rwsem);
[ 8.760391]
*** DEADLOCK ***

[ 8.760391] 6 locks held by (udev-worker)/674:
[ 8.760392] #0: ffff8881209ac420 (sb_writers#4){.+.+}-{0:0}, at: ksys_write+0x72/0xf0
[ 8.760398] #1: ffff88810c80f488 (&of->mutex#2){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x136/0x250
[ 8.760402] #2: ffff888125d1d330 (kn->active#101){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x13f/0x250
[ 8.760406] #3: ffff888110dc7bb0 (&q->sysfs_lock){+.+.}-{3:3}, at: queue_attr_store+0x148/0x1e0
[ 8.760411] #4: ffff888110dc7680 (&q->q_usage_counter(io)#27){++++}-{0:0}, at: blk_mq_freeze_queue_nomemsave+0x12/0x30
[ 8.760416] #5: ffff888110dc76b8 (&q->q_usage_counter(queue)#27){++++}-{0:0}, at: blk_mq_freeze_queue_nomemsave+0x12/0x30
[ 8.760421]
stack backtrace:
[ 8.760422] CPU: 7 UID: 0 PID: 674 Comm: (udev-worker) Tainted: G U 6.14.0-rc6-xe+ #7
[ 8.760424] Tainted: [U]=USER
[ 8.760425] Hardware name: ASUS System Product Name/PRIME B560M-A AC, BIOS 2001 02/01/2023
[ 8.760426] Call Trace:
[ 8.760427] <TASK>
[ 8.760428] dump_stack_lvl+0x6e/0xa0
[ 8.760431] dump_stack+0x10/0x18
[ 8.760433] print_circular_bug.cold+0x17a/0x1b7
[ 8.760437] check_noncircular+0x13a/0x150
[ 8.760441] ? save_trace+0x54/0x360
[ 8.760445] __lock_acquire+0x1525/0x2760
[ 8.760446] ? irqentry_exit+0x3a/0xb0
[ 8.760448] ? sysvec_apic_timer_interrupt+0x57/0xc0
[ 8.760452] lock_acquire+0xca/0x310
[ 8.760453] ? kernfs_remove+0x32/0x60
[ 8.760457] down_write+0x3e/0xf0
[ 8.760459] ? kernfs_remove+0x32/0x60
[ 8.760460] kernfs_remove+0x32/0x60
[ 8.760462] sysfs_remove_dir+0x4f/0x60
[ 8.760464] __kobject_del+0x33/0xa0
[ 8.760466] kobject_del+0x13/0x30
[ 8.760467] elv_unregister_queue+0x52/0x80
[ 8.760470] elevator_switch+0x68/0x360
[ 8.760472] elv_iosched_store+0x14b/0x1b0
[ 8.760475] queue_attr_store+0x181/0x1e0
[ 8.760479] ? lock_acquire+0xca/0x310
[ 8.760480] ? kernfs_fop_write_iter+0x13f/0x250
[ 8.760482] ? lock_is_held_type+0xa3/0x130
[ 8.760485] sysfs_kf_write+0x49/0x80
[ 8.760487] kernfs_fop_write_iter+0x17d/0x250
[ 8.760489] vfs_write+0x281/0x540
[ 8.760494] ksys_write+0x72/0xf0
[ 8.760497] __x64_sys_write+0x19/0x30
[ 8.760499] x64_sys_call+0x2a3/0x2310
[ 8.760502] do_syscall_64+0x91/0x180
[ 8.760504] ? trace_hardirqs_off+0x5d/0xe0
[ 8.760506] ? handle_softirqs+0x479/0x4d0
[ 8.760508] ? hrtimer_interrupt+0x13f/0x280
[ 8.760511] ? irqentry_exit_to_user_mode+0x8b/0x260
[ 8.760513] ? clear_bhb_loop+0x15/0x70
[ 8.760515] ? clear_bhb_loop+0x15/0x70
[ 8.760516] ? clear_bhb_loop+0x15/0x70
[ 8.760518] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 8.760520] RIP: 0033:0x7aa3bf2f5504
[ 8.760522] Code: c7 00 16 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 80 3d c5 8b 10 00 00 74 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 55 48 89 e5 48 83 ec 20 48 89
[ 8.760523] RSP: 002b:00007ffc1e3697d8 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
[ 8.760526] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007aa3bf2f5504
[ 8.760527] RDX: 0000000000000003 RSI: 00007ffc1e369ae0 RDI: 000000000000001c
[ 8.760528] RBP: 00007ffc1e369800 R08: 00007aa3bf3f51c8 R09: 00007ffc1e3698b0
[ 8.760528] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000003
[ 8.760529] R13: 00007ffc1e369ae0 R14: 0000613ccf21f2f0 R15: 00007aa3bf3f4e80
[ 8.760533] </TASK>

v2:
- Update a code comment to increase readability (Ming Lei).

Cc: Jens Axboe <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: Ming Lei <[email protected]>
Signed-off-by: Thomas Hellström <[email protected]>
Reviewed-by: Ming Lei <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>

show more ...


Revision tags: v6.14-rc7, v6.14-rc6
# 1bf70d08 04-Mar-2025 Nilay Shroff <[email protected]>

block: introduce a dedicated lock for protecting queue elevator updates

A queue's elevator can be updated either when modifying nr_hw_queues
or through the sysfs scheduler attribute. Currently, elev

block: introduce a dedicated lock for protecting queue elevator updates

A queue's elevator can be updated either when modifying nr_hw_queues
or through the sysfs scheduler attribute. Currently, elevator switching/
updating is protected using q->sysfs_lock, but this has led to lockdep
splats[1] due to inconsistent lock ordering between q->sysfs_lock and
the freeze-lock in multiple block layer call sites.

As the scope of q->sysfs_lock is not well-defined, its (mis)use has
resulted in numerous lockdep warnings. To address this, introduce a new
q->elevator_lock, dedicated specifically for protecting elevator
switches/updates. And we'd now use this new q->elevator_lock instead of
q->sysfs_lock for protecting elevator switches/updates.

While at it, make elv_iosched_load_module() a static function, as it is
only called from elv_iosched_store(). Also, remove redundant parameters
from elv_iosched_load_module() function signature.

[1] https://lore.kernel.org/all/[email protected]/

Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Hannes Reinecke <[email protected]>
Reviewed-by: Ming Lei <[email protected]>
Signed-off-by: Nilay Shroff <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>

show more ...


Revision tags: v6.14-rc5, v6.14-rc4, v6.14-rc3, v6.14-rc2, v6.14-rc1
# fe662860 28-Jan-2025 Nilay Shroff <[email protected]>

block: get rid of request queue ->sysfs_dir_lock

The request queue uses ->sysfs_dir_lock for protecting the addition/
deletion of kobject entries under sysfs while we register/unregister
blk-mq. How

block: get rid of request queue ->sysfs_dir_lock

The request queue uses ->sysfs_dir_lock for protecting the addition/
deletion of kobject entries under sysfs while we register/unregister
blk-mq. However kobject addition/deletion is already protected with
kernfs/sysfs internal synchronization primitives. So use of q->sysfs_
dir_lock seems redundant.

Moreover, q->sysfs_dir_lock is also used at few other callsites along
with q->sysfs_lock for protecting the addition/deletion of kojects.
One such example is when we register with sysfs a set of independent
access ranges for a disk. Here as well we could get rid off q->sysfs_
dir_lock and only use q->sysfs_lock.

The only variable which q->sysfs_dir_lock appears to protect is q->
mq_sysfs_init_done which is set/unset while registering/unregistering
blk-mq with sysfs. But use of q->mq_sysfs_init_done could be easily
replaced using queue registered bit QUEUE_FLAG_REGISTERED.

So with this patch we remove q->sysfs_dir_lock from each callsite
and replace q->mq_sysfs_init_done using QUEUE_FLAG_REGISTERED.

Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Nilay Shroff <[email protected]>
Reviewed-by: Hannes Reinecke <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>

show more ...


Revision tags: v6.13, v6.13-rc7
# d432c817 10-Jan-2025 Christoph Hellwig <[email protected]>

block: don't update BLK_FEAT_POLL in __blk_mq_update_nr_hw_queues

When __blk_mq_update_nr_hw_queues changes the number of tag sets, it
might have to disable poll queues. Currently it does so by adj

block: don't update BLK_FEAT_POLL in __blk_mq_update_nr_hw_queues

When __blk_mq_update_nr_hw_queues changes the number of tag sets, it
might have to disable poll queues. Currently it does so by adjusting
the BLK_FEAT_POLL, which is a bit against the intent of features that
describe hardware / driver capabilities, but more importantly causes
nasty lock order problems with the broadly held freeze when updating the
number of hardware queues and the limits lock. Fix this by leaving
BLK_FEAT_POLL alone, and instead check for the number of poll queues in
the bio submission and poll handlers. While this adds extra work to the
fast path, the variables are in cache lines used by these operations
anyway, so it should be cheap enough.

Fixes: 8023e144f9d6 ("block: move the poll flag to queue_limits")
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Ming Lei <[email protected]>
Reviewed-by: Damien Le Moal <[email protected]>
Reviewed-by: Martin K. Petersen <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Reviewed-by: Nilay Shroff <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>

show more ...


# 958148a6 10-Jan-2025 Christoph Hellwig <[email protected]>

block: check BLK_FEAT_POLL under q_usage_count

Otherwise feature reconfiguration can race with I/O submission.

Also drop the bio_clear_polled in the error path, as the flag does not
matter for inst

block: check BLK_FEAT_POLL under q_usage_count

Otherwise feature reconfiguration can race with I/O submission.

Also drop the bio_clear_polled in the error path, as the flag does not
matter for instant error completions, it is a left over from when we
allowed polled I/O to proceed unpolled in this case.

Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Ming Lei <[email protected]>
Reviewed-by: Nilay Shroff <[email protected]>
Reviewed-by: Martin K. Petersen <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>

show more ...


Revision tags: v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2, v6.13-rc1, v6.12
# a3396b99 13-Nov-2024 Christoph Hellwig <[email protected]>

block: add a rq_list type

Replace the semi-open coded request list helpers with a proper rq_list
type that mirrors the bio_list and has head and tail pointers. Besides
better type safety this actua

block: add a rq_list type

Replace the semi-open coded request list helpers with a proper rq_list
type that mirrors the bio_list and has head and tail pointers. Besides
better type safety this actually allows to insert at the tail of the
list, which will be useful soon.

Signed-off-by: Christoph Hellwig <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>

show more ...


Revision tags: v6.12-rc7
# 559218d4 08-Nov-2024 Christoph Hellwig <[email protected]>

block: pre-calculate max_zone_append_sectors

max_zone_append_sectors differs from all other queue limits in that the
final value used is not stored in the queue_limits but needs to be
obtained using

block: pre-calculate max_zone_append_sectors

max_zone_append_sectors differs from all other queue limits in that the
final value used is not stored in the queue_limits but needs to be
obtained using queue_limits_max_zone_append_sectors helper. This not
only adds (tiny) extra overhead to the I/O path, but also can be easily
forgotten in file system code.

Add a new max_hw_zone_append_sectors value to queue_limits which is
set by the driver, and calculate max_zone_append_sectors from that and
the other inputs in blk_validate_zoned_limits, similar to how
max_sectors is calculated to fix this.

Signed-off-by: Christoph Hellwig <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Reviewed-by: Damien Le Moal <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>

show more ...


Revision tags: v6.12-rc6
# 6a786998 31-Oct-2024 Ming Lei <[email protected]>

block: always verify unfreeze lock on the owner task

commit f1be1788a32e ("block: model freeze & enter queue as lock for
supporting lockdep") tries to apply lockdep for verifying freeze &
unfreeze.

block: always verify unfreeze lock on the owner task

commit f1be1788a32e ("block: model freeze & enter queue as lock for
supporting lockdep") tries to apply lockdep for verifying freeze &
unfreeze. However, the verification is only done the outmost freeze and
unfreeze. This way is actually not correct because q->mq_freeze_depth
still may drop to zero on other task instead of the freeze owner task.

Fix this issue by always verifying the last unfreeze lock on the owner
task context, and make sure both the outmost freeze & unfreeze are
verified in the current task.

Fixes: f1be1788a32e ("block: model freeze & enter queue as lock for supporting lockdep")
Signed-off-by: Ming Lei <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>

show more ...


# ab9bc81c 07-Nov-2024 Jens Axboe <[email protected]>

Revert "block: pre-calculate max_zone_append_sectors"

This causes issue on, at least, nvme-mpath where my boot fails with:

WARNING: CPU: 354 PID: 2729 at block/blk-settings.c:75 blk_validate_limits

Revert "block: pre-calculate max_zone_append_sectors"

This causes issue on, at least, nvme-mpath where my boot fails with:

WARNING: CPU: 354 PID: 2729 at block/blk-settings.c:75 blk_validate_limits+0x356/0x380
Modules linked in: tg3(+) nvme usbcore scsi_mod ptp i2c_piix4 libphy nvme_core crc32c_intel scsi_common usb_common pps_core i2c_smbus
CPU: 354 UID: 0 PID: 2729 Comm: kworker/u2061:1 Not tainted 6.12.0-rc6+ #181
Hardware name: Dell Inc. PowerEdge R7625/06444F, BIOS 1.8.3 04/02/2024
Workqueue: async async_run_entry_fn
RIP: 0010:blk_validate_limits+0x356/0x380
Code: f6 47 01 04 75 28 83 bf 94 00 00 00 00 75 39 83 bf 98 00 00 00 00 75 34 83 7f 68 00 75 32 31 c0 83 7f 5c 00 0f 84 9b fd ff ff <0f> 0b eb 13 0f 0b eb 0f 48 c7 c0 74 12 58 92 48 89 c7 e8 13 76 46
RSP: 0018:ffffa8a1dfb93b30 EFLAGS: 00010286
RAX: 0000000000000000 RBX: ffff9232829c8388 RCX: 0000000000000088
RDX: 0000000000000080 RSI: 0000000000000200 RDI: ffffa8a1dfb93c38
RBP: 000000000000000c R08: 00000000ffffffff R09: 000000000000ffff
R10: 0000000000000000 R11: 0000000000000000 R12: ffff9232829b9000
R13: ffff9232829b9010 R14: ffffa8a1dfb93c38 R15: ffffa8a1dfb93c38
FS: 0000000000000000(0000) GS:ffff923867c80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055c1b92480a8 CR3: 0000002484ff0002 CR4: 0000000000370ef0
Call Trace:
<TASK>
? __warn+0xca/0x1a0
? blk_validate_limits+0x356/0x380
? report_bug+0x11a/0x1a0
? handle_bug+0x5e/0x90
? exc_invalid_op+0x16/0x40
? asm_exc_invalid_op+0x16/0x20
? blk_validate_limits+0x356/0x380
blk_alloc_queue+0x7a/0x250
__blk_alloc_disk+0x39/0x80
nvme_mpath_alloc_disk+0x13d/0x1b0 [nvme_core]
nvme_scan_ns+0xcc7/0x1010 [nvme_core]
async_run_entry_fn+0x27/0x120
process_scheduled_works+0x1a0/0x360
worker_thread+0x2bc/0x350
? pr_cont_work+0x1b0/0x1b0
kthread+0x111/0x120
? kthread_unuse_mm+0x90/0x90
ret_from_fork+0x30/0x40
? kthread_unuse_mm+0x90/0x90
ret_from_fork_asm+0x11/0x20
</TASK>
---[ end trace 0000000000000000 ]---

presumably due to max_zone_append_sectors not being cleared to zero,
resulting in blk_validate_zoned_limits() complaining and failing.

This reverts commit 2a8f6153e1c2db06a537a5c9d61102eb591776f1.

Signed-off-by: Jens Axboe <[email protected]>

show more ...


# 2a8f6153 04-Nov-2024 Christoph Hellwig <[email protected]>

block: pre-calculate max_zone_append_sectors

max_zone_append_sectors differs from all other queue limits in that the
final value used is not stored in the queue_limits but needs to be
obtained using

block: pre-calculate max_zone_append_sectors

max_zone_append_sectors differs from all other queue limits in that the
final value used is not stored in the queue_limits but needs to be
obtained using queue_limits_max_zone_append_sectors helper. This not
only adds (tiny) extra overhead to the I/O path, but also can be easily
forgotten in file system code.

Add a new max_hw_zone_append_sectors value to queue_limits which is
set by the driver, and calculate max_zone_append_sectors from that and
the other inputs in blk_validate_zoned_limits, similar to how
max_sectors is calculated to fix this.

Signed-off-by: Christoph Hellwig <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>

show more ...


Revision tags: v6.12-rc5
# f1be1788 25-Oct-2024 Ming Lei <[email protected]>

block: model freeze & enter queue as lock for supporting lockdep

Recently we got several deadlock report[1][2][3] caused by
blk_mq_freeze_queue and blk_enter_queue().

Turns out the two are just lik

block: model freeze & enter queue as lock for supporting lockdep

Recently we got several deadlock report[1][2][3] caused by
blk_mq_freeze_queue and blk_enter_queue().

Turns out the two are just like acquiring read/write lock, so model them
as read/write lock for supporting lockdep:

1) model q->q_usage_counter as two locks(io and queue lock)

- queue lock covers sync with blk_enter_queue()

- io lock covers sync with bio_enter_queue()

2) make the lockdep class/key as per-queue:

- different subsystem has very different lock use pattern, shared lock
class causes false positive easily

- freeze_queue degrades to no lock in case that disk state becomes DEAD
because bio_enter_queue() won't be blocked any more

- freeze_queue degrades to no lock in case that request queue becomes dying
because blk_enter_queue() won't be blocked any more

3) model blk_mq_freeze_queue() as acquire_exclusive & try_lock
- it is exclusive lock, so dependency with blk_enter_queue() is covered

- it is trylock because blk_mq_freeze_queue() are allowed to run
concurrently

4) model blk_enter_queue() & bio_enter_queue() as acquire_read()
- nested blk_enter_queue() are allowed

- dependency with blk_mq_freeze_queue() is covered

- blk_queue_exit() is often called from other contexts(such as irq), and
it can't be annotated as lock_release(), so simply do it in
blk_enter_queue(), this way still covered cases as many as possible

With lockdep support, such kind of reports may be reported asap and
needn't wait until the real deadlock is triggered.

For example, lockdep report can be triggered in the report[3] with this
patch applied.

[1] occasional block layer hang when setting 'echo noop > /sys/block/sda/queue/scheduler'
https://bugzilla.kernel.org/show_bug.cgi?id=219166

[2] del_gendisk() vs blk_queue_enter() race condition
https://lore.kernel.org/linux-block/[email protected]/

[3] queue_freeze & queue_enter deadlock in scsi
https://lore.kernel.org/linux-block/ZxG38G9BuFdBpBHZ@fedora/T/#u

Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Ming Lei <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>

show more ...


Revision tags: v6.12-rc4, v6.12-rc3, v6.12-rc2, v6.12-rc1, v6.11, v6.11-rc7, v6.11-rc6, v6.11-rc5, v6.11-rc4, v6.11-rc3
# ea6787c6 05-Aug-2024 John Garry <[email protected]>

scsi: block: Don't check REQ_ATOMIC for reads

We check in submit_bio_noacct() if flag REQ_ATOMIC is set for both read and
write operations, and then validate the atomic operation if set. Flag
REQ_AT

scsi: block: Don't check REQ_ATOMIC for reads

We check in submit_bio_noacct() if flag REQ_ATOMIC is set for both read and
write operations, and then validate the atomic operation if set. Flag
REQ_ATOMIC can only be set for writes, so don't bother checking for reads.

Fixes: 9da3d1e912f3 ("block: Add core atomic write support")
Signed-off-by: John Garry <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Reviewed-by: Kanchan Joshi <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>

show more ...


Revision tags: v6.11-rc2, v6.11-rc1
# 73e59d3e 18-Jul-2024 hexue <[email protected]>

block: avoid polling configuration errors

This patch adds a poll queue check, aiming to help users use polled IO
accurately.

If users do polled IO but the device doesn't have poll queues, they will

block: avoid polling configuration errors

This patch adds a poll queue check, aiming to help users use polled IO
accurately.

If users do polled IO but the device doesn't have poll queues, they will
get suboptimal performance data and waste CPU resources. Add a poll queue
check batching this. If users don't have the device properly configured,
or if it simply doesn't support polled IO, it will error the IO with
-EOPNOTSUPP. This is similar to what we used to do for sync polled IO,
which is no longer supported.

Signed-off-by: hexue <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>

show more ...


Revision tags: v6.10, v6.10-rc7
# f2a7bea2 04-Jul-2024 Damien Le Moal <[email protected]>

block: Remove REQ_OP_ZONE_RESET_ALL emulation

Now that device mapper can handle resetting all zones of a mapped zoned
device using REQ_OP_ZONE_RESET_ALL, all zoned block device drivers
support this

block: Remove REQ_OP_ZONE_RESET_ALL emulation

Now that device mapper can handle resetting all zones of a mapped zoned
device using REQ_OP_ZONE_RESET_ALL, all zoned block device drivers
support this operation. With this, the request queue feature
BLK_FEAT_ZONE_RESETALL is not necessary and the emulation code in
blk-zone.c can be removed.

Signed-off-by: Damien Le Moal <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Reviewed-by: Martin K. Petersen <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>

show more ...


Revision tags: v6.10-rc6
# 63db4a1f 27-Jun-2024 John Garry <[email protected]>

block: Delete blk_queue_flag_test_and_set()

Since commit 70200574cc22 ("block: remove QUEUE_FLAG_DISCARD"),
blk_queue_flag_test_and_set() has not been used, so delete it.

Signed-off-by: John Garry

block: Delete blk_queue_flag_test_and_set()

Since commit 70200574cc22 ("block: remove QUEUE_FLAG_DISCARD"),
blk_queue_flag_test_and_set() has not been used, so delete it.

Signed-off-by: John Garry <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>

show more ...


Revision tags: v6.10-rc5
# 9da3d1e9 20-Jun-2024 John Garry <[email protected]>

block: Add core atomic write support

Add atomic write support, as follows:
- add helper functions to get request_queue atomic write limits
- report request_queue atomic write support limits to sysfs

block: Add core atomic write support

Add atomic write support, as follows:
- add helper functions to get request_queue atomic write limits
- report request_queue atomic write support limits to sysfs and update Doc
- support to safely merge atomic writes
- deal with splitting atomic writes
- misc helper functions
- add a per-request atomic write flag

New request_queue limits are added, as follows:
- atomic_write_hw_max is set by the block driver and is the maximum length
of an atomic write which the device may support. It is not
necessarily a power-of-2.
- atomic_write_max_sectors is derived from atomic_write_hw_max_sectors and
max_hw_sectors. It is always a power-of-2. Atomic writes may be merged,
and atomic_write_max_sectors would be the limit on a merged atomic write
request size. This value is not capped at max_sectors, as the value in
max_sectors can be controlled from userspace, and it would only cause
trouble if userspace could limit atomic_write_unit_max_bytes and the
other atomic write limits.
- atomic_write_hw_unit_{min,max} are set by the block driver and are the
min/max length of an atomic write unit which the device may support. They
both must be a power-of-2. Typically atomic_write_hw_unit_max will hold
the same value as atomic_write_hw_max.
- atomic_write_unit_{min,max} are derived from
atomic_write_hw_unit_{min,max}, max_hw_sectors, and block core limits.
Both min and max values must be a power-of-2.
- atomic_write_hw_boundary is set by the block driver. If non-zero, it
indicates an LBA space boundary at which an atomic write straddles no
longer is atomically executed by the disk. The value must be a
power-of-2. Note that it would be acceptable to enforce a rule that
atomic_write_hw_boundary_sectors is a multiple of
atomic_write_hw_unit_max, but the resultant code would be more
complicated.

All atomic writes limits are by default set 0 to indicate no atomic write
support. Even though it is assumed by Linux that a logical block can always
be atomically written, we ignore this as it is not of particular interest.
Stacked devices are just not supported either for now.

An atomic write must always be submitted to the block driver as part of a
single request. As such, only a single BIO must be submitted to the block
layer for an atomic write. When a single atomic write BIO is submitted, it
cannot be split. As such, atomic_write_unit_{max, min}_bytes are limited
by the maximum guaranteed BIO size which will not be required to be split.
This max size is calculated by request_queue max segments and the number
of bvecs a BIO can fit, BIO_MAX_VECS. Currently we rely on userspace
issuing a write with iovcnt=1 for pwritev2() - as such, we can rely on each
segment containing PAGE_SIZE of data, apart from the first+last, which each
can fit logical block size of data. The first+last will be LBS
length/aligned as we rely on direct IO alignment rules also.

New sysfs files are added to report the following atomic write limits:
- atomic_write_unit_max_bytes - same as atomic_write_unit_max_sectors in
bytes
- atomic_write_unit_min_bytes - same as atomic_write_unit_min_sectors in
bytes
- atomic_write_boundary_bytes - same as atomic_write_hw_boundary_sectors in
bytes
- atomic_write_max_bytes - same as atomic_write_max_sectors in bytes

Atomic writes may only be merged with other atomic writes and only under
the following conditions:
- total resultant request length <= atomic_write_max_bytes
- the merged write does not straddle a boundary

Helper function bdev_can_atomic_write() is added to indicate whether
atomic writes may be issued to a bdev. If a bdev is a partition, the
partition start must be aligned with both atomic_write_unit_min_sectors
and atomic_write_hw_boundary_sectors.

FSes will rely on the block layer to validate that an atomic write BIO
submitted will be of valid size, so add blk_validate_atomic_write_op_size()
for this purpose. Userspace expects an atomic write which is of invalid
size to be rejected with -EINVAL, so add BLK_STS_INVAL for this. Also use
BLK_STS_INVAL for when a BIO needs to be split, as this should mean an
invalid size BIO.

Flag REQ_ATOMIC is used for indicating an atomic write.

Co-developed-by: Himanshu Madhani <[email protected]>
Signed-off-by: Himanshu Madhani <[email protected]>
Reviewed-by: Martin K. Petersen <[email protected]>
Signed-off-by: John Garry <[email protected]>
Reviewed-by: Keith Busch <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>

show more ...


# 8023e144 17-Jun-2024 Christoph Hellwig <[email protected]>

block: move the poll flag to queue_limits

Move the poll flag into the queue_limits feature field so that it can
be set atomically with the queue frozen.

Stacking drivers are simplified in that they

block: move the poll flag to queue_limits

Move the poll flag into the queue_limits feature field so that it can
be set atomically with the queue frozen.

Stacking drivers are simplified in that they now can simply set the
flag, and blk_stack_limits will clear it when the features is not
supported by any of the underlying devices.

Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Damien Le Moal <[email protected]>
Reviewed-by: Hannes Reinecke <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>

show more ...


# 1122c0c1 17-Jun-2024 Christoph Hellwig <[email protected]>

block: move cache control settings out of queue->flags

Move the cache control settings into the queue_limits so that the flags
can be set atomically with the device queue frozen.

Add new features a

block: move cache control settings out of queue->flags

Move the cache control settings into the queue_limits so that the flags
can be set atomically with the device queue frozen.

Add new features and flags field for the driver set flags, and internal
(usually sysfs-controlled) flags in the block layer. Note that we'll
eventually remove enough field from queue_limits to bring it back to the
previous size.

The disable flag is inverted compared to the previous meaning, which
means it now survives a rescan, similar to the max_sectors and
max_discard_sectors user limits.

The FLUSH and FUA flags are now inherited by blk_stack_limits, which
simplified the code in dm a lot, but also causes a slight behavior
change in that dm-switch and dm-unstripe now advertise a write cache
despite setting num_flush_bios to 0. The I/O path will handle this
gracefully, but as far as I can tell the lack of num_flush_bios
and thus flush support is a pre-existing data integrity bug in those
targets that really needs fixing, after which a non-zero num_flush_bios
should be required in dm for targets that map to underlying devices.

Signed-off-by: Christoph Hellwig <[email protected]>
Acked-by: Ulf Hansson <[email protected]>
Reviewed-by: Damien Le Moal <[email protected]>
Reviewed-by: Hannes Reinecke <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>

show more ...


Revision tags: v6.10-rc4, v6.10-rc3, v6.10-rc2, v6.10-rc1
# 9a42891c 21-May-2024 Yu Kuai <[email protected]>

block: fix lost bio for plug enabled bio based device

With the following two conditions, bio will be lost:

1) blk plug is not enabled, for example, __blkdev_direct_IO_simple() and
__blkdev_direct_I

block: fix lost bio for plug enabled bio based device

With the following two conditions, bio will be lost:

1) blk plug is not enabled, for example, __blkdev_direct_IO_simple() and
__blkdev_direct_IO_async();
2) bio plug is enabled, for example write IO for raid1/raid10 while
bitmap is enabled;

Root cause is that blk_finish_plug() will add the bio to
curent->bio_list, while such bio will not be handled:

__submit_bio_noacct
current->bio_list = bio_list_on_stack;
blk_start_plug

do {
dm_submit_bio
md_handle_request
raid10_write_request
-> generate new bio for underlying disks
raid1_add_bio_to_plug -> bio is added to plug
} while ((bio = bio_list_pop(&bio_list_on_stack[0])))
-> previous bio are all handled

blk_finish_plug
raid10_unplug
raid1_submit_write
submit_bio_noacct
if (current->bio_list)
bio_list_add(&current->bio_list[0], bio)
-> add new bio

current->bio_list = NULL
-> new bio is lost

Fix the problem by moving the plug into the while loop, so that
current->bio_list will still be handled after blk_finish_plug().

By the way, enable plug for raid1/raid10 in this case will also prevent
delay IO handling into daemon thread, which should also improve IO
performance.

Fixes: 060406c61c7c ("block: add plug while submitting IO")
Reported-by: Changhui Zhong <[email protected]>
Closes: https://lore.kernel.org/all/CAGVVp+Xsmzy2G9YuEatfMT6qv1M--YdOCQ0g7z7OVmcTbBxQAg@mail.gmail.com/
Signed-off-by: Yu Kuai <[email protected]>
Tested-by: Changhui Zhong <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>

show more ...


Revision tags: v6.9
# 99dc4223 09-May-2024 Yu Kuai <[email protected]>

block: support to account io_ticks precisely

Currently, io_ticks is accounted based on sampling, specifically
update_io_ticks() will always account io_ticks by 1 jiffies from
bdev_start_io_acct()/bl

block: support to account io_ticks precisely

Currently, io_ticks is accounted based on sampling, specifically
update_io_ticks() will always account io_ticks by 1 jiffies from
bdev_start_io_acct()/blk_account_io_start(), and the result can be
inaccurate, for example(HZ is 250):

Test script:
fio -filename=/dev/sda -bs=4k -rw=write -direct=1 -name=test -thinktime=4ms

Test result: util is about 90%, while the disk is really idle.

This behaviour is introduced by commit 5b18b5a73760 ("block: delete
part_round_stats and switch to less precise counting"), however, there
was a key point that is missed that this patch also improve performance
a lot:

Before the commit:
part_round_stats:
if (part->stamp != now)
stats |= 1;

part_in_flight()
-> there can be lots of task here in 1 jiffies.
part_round_stats_single()
__part_stat_add()
part->stamp = now;

After the commit:
update_io_ticks:
stamp = part->bd_stamp;
if (time_after(now, stamp))
if (try_cmpxchg())
__part_stat_add()
-> only one task can reach here in 1 jiffies.

Hence in order to account io_ticks precisely, we only need to know if
there are IO inflight at most once in one jiffies. Noted that for
rq-based device, iterating tags should not be used here because
'tags->lock' is grabbed in blk_mq_find_and_get_req(), hence
part_stat_lock_inc/dec() and part_in_flight() is used to trace inflight.
The additional overhead is quite little:

- per cpu add/dec for each IO for rq-based device;
- per cpu sum for each jiffies;

And it's verified by null-blk that there are no performance degration
under heavy IO pressure.

Fixes: 5b18b5a73760 ("block: delete part_round_stats and switch to less precise counting")
Signed-off-by: Yu Kuai <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>

show more ...


# 060406c6 09-May-2024 Yu Kuai <[email protected]>

block: add plug while submitting IO

So that if caller didn't use plug, for example, __blkdev_direct_IO_simple()
and __blkdev_direct_IO_async(), block layer can still benefit from caching
nsec time i

block: add plug while submitting IO

So that if caller didn't use plug, for example, __blkdev_direct_IO_simple()
and __blkdev_direct_IO_async(), block layer can still benefit from caching
nsec time in the plug.

Signed-off-by: Yu Kuai <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>

show more ...


Revision tags: v6.9-rc7, v6.9-rc6
# 811ba89a 28-Apr-2024 Al Viro <[email protected]>

bdev: move ->bd_make_it_fail to ->__bd_flags

Signed-off-by: Al Viro <[email protected]>


Revision tags: v6.9-rc5, v6.9-rc4
# 49a43dae 12-Apr-2024 Al Viro <[email protected]>

bdev: move ->bd_ro_warned to ->__bd_flags

Signed-off-by: Al Viro <[email protected]>


# ac2b6f9d 12-Apr-2024 Al Viro <[email protected]>

bdev: move ->bd_has_subit_bio to ->__bd_flags

In bdev_alloc() we have all flags initialized to false, so
assignment to ->bh_has_submit_bio n there is a no-op unless
we have partno != 0 and flag alre

bdev: move ->bd_has_subit_bio to ->__bd_flags

In bdev_alloc() we have all flags initialized to false, so
assignment to ->bh_has_submit_bio n there is a no-op unless
we have partno != 0 and flag already set on entire device.

In device_add_disk() we have just allocated the block_device
in question and it had been a full-device one, so the flag
is guaranteed to be still clear when we get to assignment.

Signed-off-by: Al Viro <[email protected]>

show more ...


12345678910>>...39