|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14, v6.14-rc7, v6.14-rc6 |
|
| #
1bf70d08 |
| 04-Mar-2025 |
Nilay Shroff <[email protected]> |
block: introduce a dedicated lock for protecting queue elevator updates
A queue's elevator can be updated either when modifying nr_hw_queues or through the sysfs scheduler attribute. Currently, elev
block: introduce a dedicated lock for protecting queue elevator updates
A queue's elevator can be updated either when modifying nr_hw_queues or through the sysfs scheduler attribute. Currently, elevator switching/ updating is protected using q->sysfs_lock, but this has led to lockdep splats[1] due to inconsistent lock ordering between q->sysfs_lock and the freeze-lock in multiple block layer call sites.
As the scope of q->sysfs_lock is not well-defined, its (mis)use has resulted in numerous lockdep warnings. To address this, introduce a new q->elevator_lock, dedicated specifically for protecting elevator switches/updates. And we'd now use this new q->elevator_lock instead of q->sysfs_lock for protecting elevator switches/updates.
While at it, make elv_iosched_load_module() a static function, as it is only called from elv_iosched_store(). Also, remove redundant parameters from elv_iosched_load_module() function signature.
[1] https://lore.kernel.org/all/[email protected]/
Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Reviewed-by: Ming Lei <[email protected]> Signed-off-by: Nilay Shroff <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc5, v6.14-rc4, v6.14-rc3, v6.14-rc2, v6.14-rc1, v6.13 |
|
| #
3d9a9e9a |
| 15-Jan-2025 |
Ming Lei <[email protected]> |
block: limit disk max sectors to (LLONG_MAX >> 9)
Kernel `loff_t` is defined as `long long int`, so we can't support disk which size is > LLONG_MAX.
There are many virtual block drivers, and hardwa
block: limit disk max sectors to (LLONG_MAX >> 9)
Kernel `loff_t` is defined as `long long int`, so we can't support disk which size is > LLONG_MAX.
There are many virtual block drivers, and hardware may report bad capacity too, so limit max sectors to (LLONG_MAX >> 9) for avoiding potential trouble.
Signed-off-by: Ming Lei <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v6.13-rc7 |
|
| #
67838115 |
| 06-Jan-2025 |
Christoph Hellwig <[email protected]> |
block: better split mq vs non-mq code in add_disk_fwnode
Add a big conditional for blk-mq vs not mq at the beginning of add_disk_fwnode so that elevator_init_mq is only called for blk-mq disks, and
block: better split mq vs non-mq code in add_disk_fwnode
Add a big conditional for blk-mq vs not mq at the beginning of add_disk_fwnode so that elevator_init_mq is only called for blk-mq disks, and add checks that the right methods or set or not set based on the queue type.
Signed-off-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3 |
|
| #
457ef47c |
| 09-Dec-2024 |
Yang Erkun <[email protected]> |
block: retry call probe after request_module in blk_request_module
Set kernel config:
CONFIG_BLK_DEV_LOOP=m CONFIG_BLK_DEV_LOOP_MIN_COUNT=0
Do latter:
mknod loop0 b 7 0 exec 4<> loop0
Before
block: retry call probe after request_module in blk_request_module
Set kernel config:
CONFIG_BLK_DEV_LOOP=m CONFIG_BLK_DEV_LOOP_MIN_COUNT=0
Do latter:
mknod loop0 b 7 0 exec 4<> loop0
Before commit e418de3abcda ("block: switch gendisk lookup to a simple xarray"), lookup_gendisk will first use base_probe to load module loop, and then the retry will call loop_probe to prepare the loop disk. Finally open for this disk will success. However, after this commit, we lose the retry logic, and open will fail with ENXIO. Block device autoloading is deprecated and will be removed soon, but maybe we should keep open success until we really remove it. So, give a retry to fix it.
Fixes: e418de3abcda ("block: switch gendisk lookup to a simple xarray") Suggested-by: Christoph Hellwig <[email protected]> Signed-off-by: Yang Erkun <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v6.13-rc2, v6.13-rc1 |
|
| #
f6661b1d |
| 27-Nov-2024 |
Ming Lei <[email protected]> |
block: track queue dying state automatically for modeling queue freeze lockdep
Now we only verify the outmost freeze & unfreeze in current context in case that !q->mq_freeze_depth, so it is reliable
block: track queue dying state automatically for modeling queue freeze lockdep
Now we only verify the outmost freeze & unfreeze in current context in case that !q->mq_freeze_depth, so it is reliable to save queue lying state when we want to lock the freeze queue since the state is one per-task variable now.
Signed-off-by: Ming Lei <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| #
6f491a8d |
| 27-Nov-2024 |
Ming Lei <[email protected]> |
block: track disk DEAD state automatically for modeling queue freeze lockdep
Now we only verify the outmost freeze & unfreeze in current context in case that !q->mq_freeze_depth, so it is reliable t
block: track disk DEAD state automatically for modeling queue freeze lockdep
Now we only verify the outmost freeze & unfreeze in current context in case that !q->mq_freeze_depth, so it is reliable to save disk DEAD state when we want to lock the freeze queue since the state is one per-task variable now.
Doing this way can kill lots of false positive when freeze queue is called before adding disk[1].
[1] https://lore.kernel.org/linux-block/[email protected]/
Signed-off-by: Ming Lei <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v6.12, v6.12-rc7 |
|
| #
3802f73b |
| 04-Nov-2024 |
Yu Kuai <[email protected]> |
block: fix uaf for flush rq while iterating tags
blk_mq_clear_flush_rq_mapping() is not called during scsi probe, by checking blk_queue_init_done(). However, QUEUE_FLAG_INIT_DONE is cleared in del_g
block: fix uaf for flush rq while iterating tags
blk_mq_clear_flush_rq_mapping() is not called during scsi probe, by checking blk_queue_init_done(). However, QUEUE_FLAG_INIT_DONE is cleared in del_gendisk by commit aec89dc5d421 ("block: keep q_usage_counter in atomic mode after del_gendisk"), hence for disk like scsi, following blk_mq_destroy_queue() will not clear flush rq from tags->rqs[] as well, cause following uaf that is found by our syzkaller for v6.6:
================================================================== BUG: KASAN: slab-use-after-free in blk_mq_find_and_get_req+0x16e/0x1a0 block/blk-mq-tag.c:261 Read of size 4 at addr ffff88811c969c20 by task kworker/1:2H/224909
CPU: 1 PID: 224909 Comm: kworker/1:2H Not tainted 6.6.0-ga836a5060850 #32 Workqueue: kblockd blk_mq_timeout_work Call Trace:
__dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x91/0xf0 lib/dump_stack.c:106 print_address_description.constprop.0+0x66/0x300 mm/kasan/report.c:364 print_report+0x3e/0x70 mm/kasan/report.c:475 kasan_report+0xb8/0xf0 mm/kasan/report.c:588 blk_mq_find_and_get_req+0x16e/0x1a0 block/blk-mq-tag.c:261 bt_iter block/blk-mq-tag.c:288 [inline] __sbitmap_for_each_set include/linux/sbitmap.h:295 [inline] sbitmap_for_each_set include/linux/sbitmap.h:316 [inline] bt_for_each+0x455/0x790 block/blk-mq-tag.c:325 blk_mq_queue_tag_busy_iter+0x320/0x740 block/blk-mq-tag.c:534 blk_mq_timeout_work+0x1a3/0x7b0 block/blk-mq.c:1673 process_one_work+0x7c4/0x1450 kernel/workqueue.c:2631 process_scheduled_works kernel/workqueue.c:2704 [inline] worker_thread+0x804/0xe40 kernel/workqueue.c:2785 kthread+0x346/0x450 kernel/kthread.c:388 ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:293
Allocated by task 942: kasan_save_stack+0x22/0x50 mm/kasan/common.c:45 kasan_set_track+0x25/0x30 mm/kasan/common.c:52 ____kasan_kmalloc mm/kasan/common.c:374 [inline] __kasan_kmalloc mm/kasan/common.c:383 [inline] __kasan_kmalloc+0xaa/0xb0 mm/kasan/common.c:380 kasan_kmalloc include/linux/kasan.h:198 [inline] __do_kmalloc_node mm/slab_common.c:1007 [inline] __kmalloc_node+0x69/0x170 mm/slab_common.c:1014 kmalloc_node include/linux/slab.h:620 [inline] kzalloc_node include/linux/slab.h:732 [inline] blk_alloc_flush_queue+0x144/0x2f0 block/blk-flush.c:499 blk_mq_alloc_hctx+0x601/0x940 block/blk-mq.c:3788 blk_mq_alloc_and_init_hctx+0x27f/0x330 block/blk-mq.c:4261 blk_mq_realloc_hw_ctxs+0x488/0x5e0 block/blk-mq.c:4294 blk_mq_init_allocated_queue+0x188/0x860 block/blk-mq.c:4350 blk_mq_init_queue_data block/blk-mq.c:4166 [inline] blk_mq_init_queue+0x8d/0x100 block/blk-mq.c:4176 scsi_alloc_sdev+0x843/0xd50 drivers/scsi/scsi_scan.c:335 scsi_probe_and_add_lun+0x77c/0xde0 drivers/scsi/scsi_scan.c:1189 __scsi_scan_target+0x1fc/0x5a0 drivers/scsi/scsi_scan.c:1727 scsi_scan_channel drivers/scsi/scsi_scan.c:1815 [inline] scsi_scan_channel+0x14b/0x1e0 drivers/scsi/scsi_scan.c:1791 scsi_scan_host_selected+0x2fe/0x400 drivers/scsi/scsi_scan.c:1844 scsi_scan+0x3a0/0x3f0 drivers/scsi/scsi_sysfs.c:151 store_scan+0x2a/0x60 drivers/scsi/scsi_sysfs.c:191 dev_attr_store+0x5c/0x90 drivers/base/core.c:2388 sysfs_kf_write+0x11c/0x170 fs/sysfs/file.c:136 kernfs_fop_write_iter+0x3fc/0x610 fs/kernfs/file.c:338 call_write_iter include/linux/fs.h:2083 [inline] new_sync_write+0x1b4/0x2d0 fs/read_write.c:493 vfs_write+0x76c/0xb00 fs/read_write.c:586 ksys_write+0x127/0x250 fs/read_write.c:639 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x70/0x120 arch/x86/entry/common.c:81 entry_SYSCALL_64_after_hwframe+0x78/0xe2
Freed by task 244687: kasan_save_stack+0x22/0x50 mm/kasan/common.c:45 kasan_set_track+0x25/0x30 mm/kasan/common.c:52 kasan_save_free_info+0x2b/0x50 mm/kasan/generic.c:522 ____kasan_slab_free mm/kasan/common.c:236 [inline] __kasan_slab_free+0x12a/0x1b0 mm/kasan/common.c:244 kasan_slab_free include/linux/kasan.h:164 [inline] slab_free_hook mm/slub.c:1815 [inline] slab_free_freelist_hook mm/slub.c:1841 [inline] slab_free mm/slub.c:3807 [inline] __kmem_cache_free+0xe4/0x520 mm/slub.c:3820 blk_free_flush_queue+0x40/0x60 block/blk-flush.c:520 blk_mq_hw_sysfs_release+0x4a/0x170 block/blk-mq-sysfs.c:37 kobject_cleanup+0x136/0x410 lib/kobject.c:689 kobject_release lib/kobject.c:720 [inline] kref_put include/linux/kref.h:65 [inline] kobject_put+0x119/0x140 lib/kobject.c:737 blk_mq_release+0x24f/0x3f0 block/blk-mq.c:4144 blk_free_queue block/blk-core.c:298 [inline] blk_put_queue+0xe2/0x180 block/blk-core.c:314 blkg_free_workfn+0x376/0x6e0 block/blk-cgroup.c:144 process_one_work+0x7c4/0x1450 kernel/workqueue.c:2631 process_scheduled_works kernel/workqueue.c:2704 [inline] worker_thread+0x804/0xe40 kernel/workqueue.c:2785 kthread+0x346/0x450 kernel/kthread.c:388 ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:293
Other than blk_mq_clear_flush_rq_mapping(), the flag is only used in blk_register_queue() from initialization path, hence it's safe not to clear the flag in del_gendisk. And since QUEUE_FLAG_REGISTERED already make sure that queue should only be registered once, there is no need to test the flag as well.
Fixes: 6cfeadbff3f8 ("blk-mq: don't clear flush_rq from tags->rqs[]") Depends-on: commit aec89dc5d421 ("block: keep q_usage_counter in atomic mode after del_gendisk") Signed-off-by: Yu Kuai <[email protected]> Reviewed-by: Ming Lei <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| #
bda9c7d9 |
| 08-Nov-2024 |
David Wang <[email protected]> |
block/genhd: use seq_put_decimal_ull for diskstats decimal values
seq_printf is costly. For each block device, 19 decimal values are yielded in /proc/diskstats via seq_printf. On a system with 16 lo
block/genhd: use seq_put_decimal_ull for diskstats decimal values
seq_printf is costly. For each block device, 19 decimal values are yielded in /proc/diskstats via seq_printf. On a system with 16 logical block devices, profiling for open/read/close sequences shows seq_printf took ~75% samples of diskstats_show:
diskstats_show(92.626% 2269372/2450040) seq_printf(76.026% 1725313/2269372) vsnprintf(99.163% 1710866/1725313) format_decode(26.597% 455040/1710866) number(19.554% 334542/1710866) memcpy_orig(4.183% 71570/1710866) ... srso_return_thunk(0.009% 148/1725313) part_stat_read_all(8.030% 182236/2269372)
One million rounds of open/read/close /proc/diskstats takes:
real 0m37.687s user 0m0.264s sys 0m32.911s On average, each sequence tooks ~0.032ms
With this patch, most decimal values are yield via seq_put_decimal_ull, performance is significantly improved:
real 0m20.792s user 0m0.316s sys 0m20.463s On average, each sequence tooks ~0.020ms, a ~37.5% improvement.
Signed-off-by: David Wang <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| #
8e71afb9 |
| 07-Nov-2024 |
zhangguopeng <[email protected]> |
block: Replace sprintf() with sysfs_emit()
Per Documentation/filesystems/sysfs.rst, show() should only use sysfs_emit() or sysfs_emit_at() when formatting the value to be returned to user space.
No
block: Replace sprintf() with sysfs_emit()
Per Documentation/filesystems/sysfs.rst, show() should only use sysfs_emit() or sysfs_emit_at() when formatting the value to be returned to user space.
No functional change intended.
Signed-off-by: zhangguopeng <[email protected]> Suggested-by: Christoph Hellwig <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v6.12-rc6, v6.12-rc5 |
|
| #
f1be1788 |
| 25-Oct-2024 |
Ming Lei <[email protected]> |
block: model freeze & enter queue as lock for supporting lockdep
Recently we got several deadlock report[1][2][3] caused by blk_mq_freeze_queue and blk_enter_queue().
Turns out the two are just lik
block: model freeze & enter queue as lock for supporting lockdep
Recently we got several deadlock report[1][2][3] caused by blk_mq_freeze_queue and blk_enter_queue().
Turns out the two are just like acquiring read/write lock, so model them as read/write lock for supporting lockdep:
1) model q->q_usage_counter as two locks(io and queue lock)
- queue lock covers sync with blk_enter_queue()
- io lock covers sync with bio_enter_queue()
2) make the lockdep class/key as per-queue:
- different subsystem has very different lock use pattern, shared lock class causes false positive easily
- freeze_queue degrades to no lock in case that disk state becomes DEAD because bio_enter_queue() won't be blocked any more
- freeze_queue degrades to no lock in case that request queue becomes dying because blk_enter_queue() won't be blocked any more
3) model blk_mq_freeze_queue() as acquire_exclusive & try_lock - it is exclusive lock, so dependency with blk_enter_queue() is covered
- it is trylock because blk_mq_freeze_queue() are allowed to run concurrently
4) model blk_enter_queue() & bio_enter_queue() as acquire_read() - nested blk_enter_queue() are allowed
- dependency with blk_mq_freeze_queue() is covered
- blk_queue_exit() is often called from other contexts(such as irq), and it can't be annotated as lock_release(), so simply do it in blk_enter_queue(), this way still covered cases as many as possible
With lockdep support, such kind of reports may be reported asap and needn't wait until the real deadlock is triggered.
For example, lockdep report can be triggered in the report[3] with this patch applied.
[1] occasional block layer hang when setting 'echo noop > /sys/block/sda/queue/scheduler' https://bugzilla.kernel.org/show_bug.cgi?id=219166
[2] del_gendisk() vs blk_queue_enter() race condition https://lore.kernel.org/linux-block/[email protected]/
[3] queue_freeze & queue_enter deadlock in scsi https://lore.kernel.org/linux-block/ZxG38G9BuFdBpBHZ@fedora/T/#u
Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Ming Lei <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v6.12-rc4, v6.12-rc3, v6.12-rc2 |
|
| #
9dfd9ea9 |
| 02-Oct-2024 |
Christian Marangi <[email protected]> |
block: introduce add_disk_fwnode()
Introduce add_disk_fwnode() as a replacement of device_add_disk() that permits to pass and attach a fwnode to disk dev.
This variant can be useful for eMMC that m
block: introduce add_disk_fwnode()
Introduce add_disk_fwnode() as a replacement of device_add_disk() that permits to pass and attach a fwnode to disk dev.
This variant can be useful for eMMC that might have the partition table for the disk defined in DT. A parser can later make use of the attached fwnode to parse the related table and init the hardcoded partition for the disk.
device_add_disk() is converted to a simple wrapper of add_disk_fwnode() with the fwnode entry set as NULL.
Signed-off-by: Christian Marangi <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v6.12-rc1, v6.11, v6.11-rc7, v6.11-rc6, v6.11-rc5, v6.11-rc4, v6.11-rc3, v6.11-rc2, v6.11-rc1 |
|
| #
7e04da2d |
| 24-Jul-2024 |
Yang Yang <[email protected]> |
block: fix deadlock between sd_remove & sd_release
Our test report the following hung task:
[ 2538.459400] INFO: task "kworker/0:0":7 blocked for more than 188 seconds. [ 2538.459427] Call trace: [
block: fix deadlock between sd_remove & sd_release
Our test report the following hung task:
[ 2538.459400] INFO: task "kworker/0:0":7 blocked for more than 188 seconds. [ 2538.459427] Call trace: [ 2538.459430] __switch_to+0x174/0x338 [ 2538.459436] __schedule+0x628/0x9c4 [ 2538.459442] schedule+0x7c/0xe8 [ 2538.459447] schedule_preempt_disabled+0x24/0x40 [ 2538.459453] __mutex_lock+0x3ec/0xf04 [ 2538.459456] __mutex_lock_slowpath+0x14/0x24 [ 2538.459459] mutex_lock+0x30/0xd8 [ 2538.459462] del_gendisk+0xdc/0x350 [ 2538.459466] sd_remove+0x30/0x60 [ 2538.459470] device_release_driver_internal+0x1c4/0x2c4 [ 2538.459474] device_release_driver+0x18/0x28 [ 2538.459478] bus_remove_device+0x15c/0x174 [ 2538.459483] device_del+0x1d0/0x358 [ 2538.459488] __scsi_remove_device+0xa8/0x198 [ 2538.459493] scsi_forget_host+0x50/0x70 [ 2538.459497] scsi_remove_host+0x80/0x180 [ 2538.459502] usb_stor_disconnect+0x68/0xf4 [ 2538.459506] usb_unbind_interface+0xd4/0x280 [ 2538.459510] device_release_driver_internal+0x1c4/0x2c4 [ 2538.459514] device_release_driver+0x18/0x28 [ 2538.459518] bus_remove_device+0x15c/0x174 [ 2538.459523] device_del+0x1d0/0x358 [ 2538.459528] usb_disable_device+0x84/0x194 [ 2538.459532] usb_disconnect+0xec/0x300 [ 2538.459537] hub_event+0xb80/0x1870 [ 2538.459541] process_scheduled_works+0x248/0x4dc [ 2538.459545] worker_thread+0x244/0x334 [ 2538.459549] kthread+0x114/0x1bc
[ 2538.461001] INFO: task "fsck.":15415 blocked for more than 188 seconds. [ 2538.461014] Call trace: [ 2538.461016] __switch_to+0x174/0x338 [ 2538.461021] __schedule+0x628/0x9c4 [ 2538.461025] schedule+0x7c/0xe8 [ 2538.461030] blk_queue_enter+0xc4/0x160 [ 2538.461034] blk_mq_alloc_request+0x120/0x1d4 [ 2538.461037] scsi_execute_cmd+0x7c/0x23c [ 2538.461040] ioctl_internal_command+0x5c/0x164 [ 2538.461046] scsi_set_medium_removal+0x5c/0xb0 [ 2538.461051] sd_release+0x50/0x94 [ 2538.461054] blkdev_put+0x190/0x28c [ 2538.461058] blkdev_release+0x28/0x40 [ 2538.461063] __fput+0xf8/0x2a8 [ 2538.461066] __fput_sync+0x28/0x5c [ 2538.461070] __arm64_sys_close+0x84/0xe8 [ 2538.461073] invoke_syscall+0x58/0x114 [ 2538.461078] el0_svc_common+0xac/0xe0 [ 2538.461082] do_el0_svc+0x1c/0x28 [ 2538.461087] el0_svc+0x38/0x68 [ 2538.461090] el0t_64_sync_handler+0x68/0xbc [ 2538.461093] el0t_64_sync+0x1a8/0x1ac
T1: T2: sd_remove del_gendisk __blk_mark_disk_dead blk_freeze_queue_start ++q->mq_freeze_depth bdev_release mutex_lock(&disk->open_mutex) sd_release scsi_execute_cmd blk_queue_enter wait_event(!q->mq_freeze_depth) mutex_lock(&disk->open_mutex)
SCSI does not set GD_OWNS_QUEUE, so QUEUE_FLAG_DYING is not set in this scenario. This is a classic ABBA deadlock. To fix the deadlock, make sure we don't try to acquire disk->open_mutex after freezing the queue.
Cc: [email protected] Fixes: eec1be4c30df ("block: delete partitions later in del_gendisk") Signed-off-by: Yang Yang <[email protected]> Reviewed-by: Ming Lei <[email protected]> Reviewed-by: Bart Van Assche <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Fixes: and Cc: stable tags are missing. Otherwise this patch looks fine Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v6.10, v6.10-rc7, v6.10-rc6 |
|
| #
73781b3b |
| 26-Jun-2024 |
Christoph Hellwig <[email protected]> |
block: remove disk_update_readahead
Mark blk_apply_bdi_limits non-static and open code disk_update_readahead in the only caller.
Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Damien Le
block: remove disk_update_readahead
Mark blk_apply_bdi_limits non-static and open code disk_update_readahead in the only caller.
Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Damien Le Moal <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v6.10-rc5, v6.10-rc4, v6.10-rc3, v6.10-rc2, v6.10-rc1, v6.9 |
|
| #
7be83569 |
| 09-May-2024 |
Yu Kuai <[email protected]> |
block: fix that util can be greater than 100%
util means the percentage that disk has IO, and theoretically it should not be greater than 100%. However, there is a gap for rq-based disk:
io_ticks w
block: fix that util can be greater than 100%
util means the percentage that disk has IO, and theoretically it should not be greater than 100%. However, there is a gap for rq-based disk:
io_ticks will be updated when rq is allocated, however, before such rq dispatch to driver, it will not be account as inflight from blk_mq_start_request() hence diskstats_show()/part_stat_show() will not update io_ticks. For example:
1) at t0, issue a new IO, rq is allocated, and blk_account_io_start() update io_ticks;
2) something is wrong with drivers, and the rq can't be dispatched;
3) at t0 + 10s, drivers recovers and rq is dispatched and done, io_ticks is updated;
Then if user is using "iostat 1" to monitor "util", between t0 - t0+9s, util will be zero, and between t0+9s - t0+10s, util will be 1000%.
Fix this problem by updating io_ticks from diskstats_show() and part_stat_show() if there are rq allocated.
Signed-off-by: Yu Kuai <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| #
99dc4223 |
| 09-May-2024 |
Yu Kuai <[email protected]> |
block: support to account io_ticks precisely
Currently, io_ticks is accounted based on sampling, specifically update_io_ticks() will always account io_ticks by 1 jiffies from bdev_start_io_acct()/bl
block: support to account io_ticks precisely
Currently, io_ticks is accounted based on sampling, specifically update_io_ticks() will always account io_ticks by 1 jiffies from bdev_start_io_acct()/blk_account_io_start(), and the result can be inaccurate, for example(HZ is 250):
Test script: fio -filename=/dev/sda -bs=4k -rw=write -direct=1 -name=test -thinktime=4ms
Test result: util is about 90%, while the disk is really idle.
This behaviour is introduced by commit 5b18b5a73760 ("block: delete part_round_stats and switch to less precise counting"), however, there was a key point that is missed that this patch also improve performance a lot:
Before the commit: part_round_stats: if (part->stamp != now) stats |= 1;
part_in_flight() -> there can be lots of task here in 1 jiffies. part_round_stats_single() __part_stat_add() part->stamp = now;
After the commit: update_io_ticks: stamp = part->bd_stamp; if (time_after(now, stamp)) if (try_cmpxchg()) __part_stat_add() -> only one task can reach here in 1 jiffies.
Hence in order to account io_ticks precisely, we only need to know if there are IO inflight at most once in one jiffies. Noted that for rq-based device, iterating tags should not be used here because 'tags->lock' is grabbed in blk_mq_find_and_get_req(), hence part_stat_lock_inc/dec() and part_in_flight() is used to trace inflight. The additional overhead is quite little:
- per cpu add/dec for each IO for rq-based device; - per cpu sum for each jiffies;
And it's verified by null-blk that there are no performance degration under heavy IO pressure.
Fixes: 5b18b5a73760 ("block: delete part_round_stats and switch to less precise counting") Signed-off-by: Yu Kuai <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v6.9-rc7 |
|
| #
a4217c67 |
| 02-May-2024 |
Christoph Hellwig <[email protected]> |
block: add a partscan sysfs attribute for disks
Userspace had been unknowingly relying on a non-stable interface of kernel internals to determine if partition scanning is enabled for a given disk. P
block: add a partscan sysfs attribute for disks
Userspace had been unknowingly relying on a non-stable interface of kernel internals to determine if partition scanning is enabled for a given disk. Provide a stable interface for this purpose instead.
Cc: [email protected] # 6.3+ Depends-on: 140ce28dd3be ("block: add a disk_has_partscan helper") Signed-off-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/linux-block/ZhQJf8mzq_wipkBH@gardel-login/ Link: https://lore.kernel.org/r/[email protected] [axboe: add links and commit message from Keith] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| #
140ce28d |
| 02-May-2024 |
Christoph Hellwig <[email protected]> |
block: add a disk_has_partscan helper
Add a helper to check if partition scanning is enabled instead of open coding the check in a few places. This now always checks for the hidden flag even if all
block: add a disk_has_partscan helper
Add a helper to check if partition scanning is enabled instead of open coding the check in a few places. This now always checks for the hidden flag even if all but one of the callers are never reachable for hidden gendisks.
Signed-off-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v6.9-rc6, v6.9-rc5, v6.9-rc4 |
|
| #
224941e8 |
| 11-Apr-2024 |
Al Viro <[email protected]> |
use ->bd_mapping instead of ->bd_inode->i_mapping
Just the low-hanging fruit...
Signed-off-by: Al Viro <[email protected]> Link: https://lore.kernel.org/r/20240411145346.2516848-2-viro@zeniv.
use ->bd_mapping instead of ->bd_inode->i_mapping
Just the low-hanging fruit...
Signed-off-by: Al Viro <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
| #
2638c208 |
| 28-Apr-2024 |
Al Viro <[email protected]> |
missing helpers: bdev_unhash(), bdev_drop()
bdev_unhash(): make block device invisible to lookups by device number bdev_drop(): drop reference to associated inode.
Both are internal, for use by gen
missing helpers: bdev_unhash(), bdev_drop()
bdev_unhash(): make block device invisible to lookups by device number bdev_drop(): drop reference to associated inode.
Both are internal, for use by genhd and partition-related code - similar to bdev_add(). The logics in there (especially the lifetime-related parts of it) ought to be cleaned up, but that's a separate story; here we just encapsulate getting to associated inode.
Signed-off-by: Al Viro <[email protected]>
show more ...
|
| #
811ba89a |
| 28-Apr-2024 |
Al Viro <[email protected]> |
bdev: move ->bd_make_it_fail to ->__bd_flags
Signed-off-by: Al Viro <[email protected]>
|
| #
ac2b6f9d |
| 12-Apr-2024 |
Al Viro <[email protected]> |
bdev: move ->bd_has_subit_bio to ->__bd_flags
In bdev_alloc() we have all flags initialized to false, so assignment to ->bh_has_submit_bio n there is a no-op unless we have partno != 0 and flag alre
bdev: move ->bd_has_subit_bio to ->__bd_flags
In bdev_alloc() we have all flags initialized to false, so assignment to ->bh_has_submit_bio n there is a no-op unless we have partno != 0 and flag already set on entire device.
In device_add_disk() we have just allocated the block_device in question and it had been a full-device one, so the flag is guaranteed to be still clear when we get to assignment.
Signed-off-by: Al Viro <[email protected]>
show more ...
|
| #
dd291d77 |
| 08-Apr-2024 |
Damien Le Moal <[email protected]> |
block: Introduce zone write plugging
Zone write plugging implements a per-zone "plug" for write operations to control the submission and execution order of write operations to sequential write requi
block: Introduce zone write plugging
Zone write plugging implements a per-zone "plug" for write operations to control the submission and execution order of write operations to sequential write required zones of a zoned block device. Per-zone plugging guarantees that at any time there is at most only one write request per zone being executed. This mechanism is intended to replace zone write locking which implements a similar per-zone write throttling at the scheduler level, but is implemented only by mq-deadline.
Unlike zone write locking which operates on requests, zone write plugging operates on BIOs. A zone write plug is simply a BIO list that is atomically manipulated using a spinlock and a kblockd submission work. A write BIO to a zone is "plugged" to delay its execution if a write BIO for the same zone was already issued, that is, if a write request for the same zone is being executed. The next plugged BIO is unplugged and issued once the write request completes.
This mechanism allows to: - Untangle zone write ordering from block IO schedulers. This allows removing the restriction on using mq-deadline for writing to zoned block devices. Any block IO scheduler, including "none" can be used. - Zone write plugging operates on BIOs instead of requests. Plugged BIOs waiting for execution thus do not hold scheduling tags and thus are not preventing other BIOs from executing (reads or writes to other zones). Depending on the workload, this can significantly improve the device use (higher queue depth operation) and performance. - Both blk-mq (request based) zoned devices and BIO-based zoned devices (e.g. device mapper) can use zone write plugging. It is mandatory for the former but optional for the latter. BIO-based drivers can use zone write plugging to implement write ordering guarantees, or the drivers can implement their own if needed. - The code is less invasive in the block layer and is mostly limited to blk-zoned.c with some small changes in blk-mq.c, blk-merge.c and bio.c.
Zone write plugging is implemented using struct blk_zone_wplug. This structure includes a spinlock, a BIO list and a work structure to handle the submission of plugged BIOs. Zone write plugs structures are managed using a per-disk hash table.
Plugging of zone write BIOs is done using the function blk_zone_write_plug_bio() which returns false if a BIO execution does not need to be delayed and true otherwise. This function is called from blk_mq_submit_bio() after a BIO is split to avoid large BIOs spanning multiple zones which would cause mishandling of zone write plugs. This ichange enables by default zone write plugging for any mq request-based block device. BIO-based device drivers can also use zone write plugging by expliclty calling blk_zone_write_plug_bio() in their ->submit_bio method. For such devices, the driver must ensure that a BIO passed to blk_zone_write_plug_bio() is already split and not straddling zone boundaries.
Only write and write zeroes BIOs are plugged. Zone write plugging does not introduce any significant overhead for other operations. A BIO that is being handled through zone write plugging is flagged using the new BIO flag BIO_ZONE_WRITE_PLUGGING. A request handling a BIO flagged with this new flag is flagged with the new RQF_ZONE_WRITE_PLUGGING flag. The completion of BIOs and requests flagged trigger respectively calls to the functions blk_zone_write_bio_endio() and blk_zone_write_complete_request(). The latter function is used to trigger submission of the next plugged BIO using the zone plug work. blk_zone_write_bio_endio() does the same for BIO-based devices. This ensures that at any time, at most one request (blk-mq devices) or one BIO (BIO-based devices) is being executed for any zone. The handling of zone write plugs using a per-zone plug spinlock maximizes parallelism and device usage by allowing multiple zones to be writen simultaneously without lock contention.
Zone write plugging ignores flush BIOs without data. Hovever, any flush BIO that has data is always plugged so that the write part of the flush sequence is serialized with other regular writes.
Given that any BIO handled through zone write plugging will be the only BIO in flight for the target zone when it is executed, the unplugging and submission of a BIO will have no chance of successfully merging with plugged requests or requests in the scheduler. To overcome this potential performance degradation, blk_mq_submit_bio() calls the function blk_zone_write_plug_attempt_merge() to try to merge other plugged BIOs with the one just unplugged and submitted. Successful merging is signaled using blk_zone_write_plug_bio_merged(), called from bio_attempt_back_merge(). Furthermore, to avoid recalculating the number of segments of plugged BIOs to attempt merging, the number of segments of a plugged BIO is saved using the new struct bio field __bi_nr_segments. To avoid growing the size of struct bio, this field is added as a union with the bio_cookie field. This is safe to do as polling is always disabled for plugged BIOs.
When BIOs are plugged in a zone write plug, the device request queue usage counter is always incremented. This reference is kept and reused for blk-mq devices when the plugged BIO is unplugged and submitted again using submit_bio_noacct_nocheck(). For this case, the unplugged BIO is already flagged with BIO_ZONE_WRITE_PLUGGING and blk_mq_submit_bio() proceeds directly to allocating a new request for the BIO, re-using the usage reference count taken when the BIO was plugged. This extra reference count is dropped in blk_zone_write_plug_attempt_merge() for any plugged BIO that is successfully merged. Given that BIO-based devices will not take this path, the extra reference is dropped after a plugged BIO is unplugged and submitted.
Zone write plugs are dynamically allocated and managed using a hash table (an array of struct hlist_head) with RCU protection. A zone write plug is allocated when a write BIO is received for the zone and not freed until the zone is fully written, reset or finished. To detect when a zone write plug can be freed, the write state of each zone is tracked using a write pointer offset which corresponds to the offset of a zone write pointer relative to the zone start. Write operations always increment this write pointer offset. Zone reset operations set it to 0 and zone finish operations set it to the zone size.
If a write error happens, the wp_offset value of a zone write plug may become incorrect and out of sync with the device managed write pointer. This is handled using the zone write plug flag BLK_ZONE_WPLUG_ERROR. The function blk_zone_wplug_handle_error() is called from the new disk zone write plug work when this flag is set. This function executes a report zone to update the zone write pointer offset to the current value as indicated by the device. The disk zone write plug work is scheduled whenever a BIO flagged with BIO_ZONE_WRITE_PLUGGING completes with an error or when bio_zone_wplug_prepare_bio() detects an unaligned write. Once scheduled, the disk zone write plugs work keeps running until all zone errors are handled.
To match the new data structures used for zoned disks, the function disk_free_zone_bitmaps() is renamed to the more generic disk_free_zone_resources(). The function disk_init_zone_resources() is also introduced to initialize zone write plugs resources when a gendisk is allocated.
In order to guarantee that the user can simultaneously write up to a number of zones equal to a device max active zone limit or max open zone limit, zone write plugs are allocated using a mempool sized to the maximum of these 2 device limits. For a device that does not have active and open zone limits, 128 is used as the default mempool size.
If a change to the device active and open zone limits is detected, the disk mempool is resized when blk_revalidate_disk_zones() is executed.
This commit contains contributions from Christoph Hellwig <[email protected]>.
Signed-off-by: Damien Le Moal <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Tested-by: Hans Holmberg <[email protected]> Tested-by: Dennis Maisenbacher <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]> Reviewed-by: Bart Van Assche <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v6.9-rc3, v6.9-rc2, v6.9-rc1, v6.8 |
|
| #
f8c7511d |
| 05-Mar-2024 |
Ricardo B. Marliere <[email protected]> |
block: make block_class constant
Since commit 43a7206b0963 ("driver core: class: make class_register() take a const *"), the driver core allows for struct class to be in read-only memory, so move th
block: make block_class constant
Since commit 43a7206b0963 ("driver core: class: make class_register() take a const *"), the driver core allows for struct class to be in read-only memory, so move the block_class structure to be declared at build time placing it into read-only memory, instead of having to be dynamically allocated at boot time.
Cc: Greg Kroah-Hartman <[email protected]> Suggested-by: Greg Kroah-Hartman <[email protected]> Signed-off-by: Ricardo B. Marliere <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v6.8-rc7, v6.8-rc6, v6.8-rc5, v6.8-rc4, v6.8-rc3, v6.8-rc2 |
|
| #
190f676a |
| 23-Jan-2024 |
Christian Brauner <[email protected]> |
block/genhd: port disk_scan_partitions() to file
This may run from a kernel thread via device_add_disk(). So this could also use __fput_sync() if we were worried about EBUSY. But when it is called f
block/genhd: port disk_scan_partitions() to file
This may run from a kernel thread via device_add_disk(). So this could also use __fput_sync() if we were worried about EBUSY. But when it is called from a kernel thread it's always BLK_OPEN_READ so EBUSY can't really happen even if we do BLK_OPEN_RESTRICT_WRITES or BLK_OPEN_EXCL.
Otherwise it's called from an ioctl on the block device which is only called from userspace and can rely on task work.
Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
| #
74fa8f9c |
| 15-Feb-2024 |
Christoph Hellwig <[email protected]> |
block: pass a queue_limits argument to blk_alloc_disk
Pass a queue_limits to blk_alloc_disk and apply it if non-NULL. This will allow allocating queues with valid queue limits instead of setting th
block: pass a queue_limits argument to blk_alloc_disk
Pass a queue_limits to blk_alloc_disk and apply it if non-NULL. This will allow allocating queues with valid queue limits instead of setting the values one at a time later.
Also change blk_alloc_disk to return an ERR_PTR instead of just NULL which can't distinguish errors.
Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Dan Williams <[email protected]> Reviewed-by: Himanshu Madhani <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|