|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4 |
|
| #
c6320214 |
| 23-Apr-2025 |
Christoph Hellwig <[email protected]> |
block: move blkdev_{get,put} _no_open prototypes out of blkdev.h
These are only to be used by block internal code. Remove the comment as we grew more users due to reworking block device node openin
block: move blkdev_{get,put} _no_open prototypes out of blkdev.h
These are only to be used by block internal code. Remove the comment as we grew more users due to reworking block device node opening.
Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Christian Brauner <[email protected]> Acked-by: Tejun Heo <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| #
e03463d2 |
| 23-Apr-2025 |
Darrick J. Wong <[email protected]> |
block: hoist block size validation code to a separate function
Hoist the block size validation code to bdev_validate_blocksize so that we can call it from filesystems that don't care about the bdev
block: hoist block size validation code to a separate function
Hoist the block size validation code to bdev_validate_blocksize so that we can call it from filesystems that don't care about the bdev pagecache manipulations of set_blocksize.
Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Luis Chamberlain <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/174543795720.4139148.840349813093799165.stgit@frogsfrogsfrogs Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14 |
|
| #
c1a79b1a |
| 19-Mar-2025 |
Naohiro Aota <[email protected]> |
block: introduce zone capacity helper
{bdev,disk}_zone_capacity() takes block_device or gendisk and sector position and returns the zone capacity of the corresponding zone.
With that, move disk_nr_
block: introduce zone capacity helper
{bdev,disk}_zone_capacity() takes block_device or gendisk and sector position and returns the zone capacity of the corresponding zone.
With that, move disk_nr_zones() and blk_zone_plug_bio() to consolidate them in the same #ifdef block.
Signed-off-by: Naohiro Aota <[email protected]> Reviewed-by: Damien Le Moal <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Signed-off-by: David Sterba <[email protected]>
show more ...
|
| #
777d0961 |
| 17-Apr-2025 |
Christoph Hellwig <[email protected]> |
fs: move the bdex_statx call to vfs_getattr_nosec
Currently bdex_statx is only called from the very high-level vfs_statx_path function, and thus bypassing it for in-kernel calls to vfs_getattr or vf
fs: move the bdex_statx call to vfs_getattr_nosec
Currently bdex_statx is only called from the very high-level vfs_statx_path function, and thus bypassing it for in-kernel calls to vfs_getattr or vfs_getattr_nosec.
This breaks querying the block ѕize of the underlying device in the loop driver and also is a pitfall for any other new kernel caller.
Move the call into the lowest level helper to ensure all callers get the right results.
Fixes: 2d985f8c6b91 ("vfs: support STATX_DIOALIGN on block devices") Fixes: f4774e92aab8 ("loop: take the file system minimum dio alignment into account") Reported-by: "Darrick J. Wong" <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/[email protected] Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc7 |
|
| #
a3996d11 |
| 13-Mar-2025 |
Nilay Shroff <[email protected]> |
block: protect debugfs attrs using elevator_lock instead of sysfs_lock
Currently, the block debugfs attributes (tags, tags_bitmap, sched_tags, and sched_tags_bitmap) are protected using q->sysfs_loc
block: protect debugfs attrs using elevator_lock instead of sysfs_lock
Currently, the block debugfs attributes (tags, tags_bitmap, sched_tags, and sched_tags_bitmap) are protected using q->sysfs_lock. However, these attributes are updated in multiple scenarios: - During driver probe method - During an elevator switch/update - During an nr_hw_queues update - When writing to the sysfs attribute nr_requests
All these update paths (except driver probe method, which doesn't require any protection) are already protected using q->elevator_lock. To ensure consistency and proper synchronization, replace q->sysfs_lock with q->elevator_lock for protecting these debugfs attributes.
Signed-off-by: Nilay Shroff <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] [axboe: some commit message rewording/fixes] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc6 |
|
| #
5abba4ce |
| 06-Mar-2025 |
Nilay Shroff <[email protected]> |
block: protect hctx attributes/params using q->elevator_lock
Currently, hctx attributes (nr_tags, nr_reserved_tags, and cpu_list) are protected using `q->sysfs_lock`. However, these attributes can b
block: protect hctx attributes/params using q->elevator_lock
Currently, hctx attributes (nr_tags, nr_reserved_tags, and cpu_list) are protected using `q->sysfs_lock`. However, these attributes can be updated in multiple scenarios: - During the driver's probe method. - When updating nr_hw_queues. - When writing to the sysfs attribute nr_requests, which can modify nr_tags. The nr_requests attribute is already protected using q->elevator_lock, but none of the update paths actually use q->sysfs_lock to protect hctx attributes. So to ensure proper synchronization, replace q->sysfs_lock with q->elevator_lock when reading hctx attributes through sysfs.
Additionally, blk_mq_update_nr_hw_queues allocates and updates hctx. The allocation of hctx is protected using q->elevator_lock, however, updating hctx params happens without any protection, so safeguard hctx param update path by also using q->elevator_lock.
Signed-off-by: Nilay Shroff <[email protected]> Link: https://lore.kernel.org/r/[email protected] [axboe: wrap comment at 80 chars] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| #
5e40f445 |
| 04-Mar-2025 |
Nilay Shroff <[email protected]> |
block: protect read_ahead_kb using q->limits_lock
The bdi->ra_pages could be updated under q->limits_lock because it's usually calculated from the queue limits by queue_limits_commit_update. So prot
block: protect read_ahead_kb using q->limits_lock
The bdi->ra_pages could be updated under q->limits_lock because it's usually calculated from the queue limits by queue_limits_commit_update. So protect reading/writing the sysfs attribute read_ahead_kb using q->limits_lock instead of q->sysfs_lock.
Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Reviewed-by: Ming Lei <[email protected]> Signed-off-by: Nilay Shroff <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| #
245618f8 |
| 04-Mar-2025 |
Nilay Shroff <[email protected]> |
block: protect wbt_lat_usec using q->elevator_lock
The wbt latency and state could be updated while initializing the elevator or exiting the elevator. It could be also updated while configuring IO l
block: protect wbt_lat_usec using q->elevator_lock
The wbt latency and state could be updated while initializing the elevator or exiting the elevator. It could be also updated while configuring IO latency QoS parameters using cgroup. The elevator code path is now protected with q->elevator_lock. So we should protect the access to sysfs attribute wbt_lat_usec using q->elevator _lock instead of q->sysfs_lock. White we're at it, also protect ioc_qos_write(), which configures wbt parameters via cgroup, using q->elevator_lock.
Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Reviewed-by: Ming Lei <[email protected]> Signed-off-by: Nilay Shroff <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| #
3efe7571 |
| 04-Mar-2025 |
Nilay Shroff <[email protected]> |
block: protect nr_requests update using q->elevator_lock
The sysfs attribute nr_requests could be simultaneously updated from elevator switch/update or nr_hw_queue update code path. The update to nr
block: protect nr_requests update using q->elevator_lock
The sysfs attribute nr_requests could be simultaneously updated from elevator switch/update or nr_hw_queue update code path. The update to nr_requests for each of those code paths runs holding q->elevator_lock. So we should protect access to sysfs attribute nr_requests using q-> elevator_lock instead of q->sysfs_lock.
Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Reviewed-by: Ming Lei <[email protected]> Signed-off-by: Nilay Shroff <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| #
1bf70d08 |
| 04-Mar-2025 |
Nilay Shroff <[email protected]> |
block: introduce a dedicated lock for protecting queue elevator updates
A queue's elevator can be updated either when modifying nr_hw_queues or through the sysfs scheduler attribute. Currently, elev
block: introduce a dedicated lock for protecting queue elevator updates
A queue's elevator can be updated either when modifying nr_hw_queues or through the sysfs scheduler attribute. Currently, elevator switching/ updating is protected using q->sysfs_lock, but this has led to lockdep splats[1] due to inconsistent lock ordering between q->sysfs_lock and the freeze-lock in multiple block layer call sites.
As the scope of q->sysfs_lock is not well-defined, its (mis)use has resulted in numerous lockdep warnings. To address this, introduce a new q->elevator_lock, dedicated specifically for protecting elevator switches/updates. And we'd now use this new q->elevator_lock instead of q->sysfs_lock for protecting elevator switches/updates.
While at it, make elv_iosched_load_module() a static function, as it is only called from elv_iosched_store(). Also, remove redundant parameters from elv_iosched_load_module() function signature.
[1] https://lore.kernel.org/all/[email protected]/
Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Reviewed-by: Ming Lei <[email protected]> Signed-off-by: Nilay Shroff <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc5, v6.14-rc4, v6.14-rc3 |
|
| #
a6aa36e9 |
| 14-Feb-2025 |
Damien Le Moal <[email protected]> |
block: Remove zone write plugs when handling native zone append writes
For devices that natively support zone append operations, REQ_OP_ZONE_APPEND BIOs are not processed through zone write plugging
block: Remove zone write plugs when handling native zone append writes
For devices that natively support zone append operations, REQ_OP_ZONE_APPEND BIOs are not processed through zone write plugging and are immediately issued to the zoned device. This means that there is no write pointer offset tracking done for these operations and that a zone write plug is not necessary.
However, when receiving a zone append BIO, we may already have a zone write plug for the target zone if that zone was previously partially written using regular write operations. In such case, since the write pointer offset of the zone write plug is not incremented by the amount of sectors appended to the zone, 2 issues arise: 1) we risk leaving the plug in the disk hash table if the zone is fully written using zone append or regular write operations, because the write pointer offset will never reach the "zone full" state. 2) Regular write operations that are issued after zone append operations will always be failed by blk_zone_wplug_prepare_bio() as the write pointer alignment check will fail, even if the user correctly accounted for the zone append operations and issued the regular writes with a correct sector.
Avoid these issues by immediately removing the zone write plug of zones that are the target of zone append operations when blk_zone_plug_bio() is called. The new function blk_zone_wplug_handle_native_zone_append() implements this for devices that natively support zone append. The removal of the zone write plug using disk_remove_zone_wplug() requires aborting all plugged regular write using disk_zone_wplug_abort() as otherwise the plugged write BIOs would never be executed (with the plug removed, the completion path will never see again the zone write plug as disk_get_zone_wplug() will return NULL). Rate-limited warnings are added to blk_zone_wplug_handle_native_zone_append() and to disk_zone_wplug_abort() to signal this.
Since blk_zone_wplug_handle_native_zone_append() is called in the hot path for operations that will not be plugged, disk_get_zone_wplug() is optimized under the assumption that a user issuing zone append operations is not at the same time issuing regular writes and that there are no hashed zone write plugs. The struct gendisk atomic counter nr_zone_wplugs is added to check this, with this counter incremented in disk_insert_zone_wplug() and decremented in disk_remove_zone_wplug().
To be consistent with this fix, we do not need to fill the zone write plug hash table with zone write plugs for zones that are partially written for a device that supports native zone append operations. So modify blk_revalidate_seq_zone() to return early to avoid allocating and inserting a zone write plug for partially written sequential zones if the device natively supports zone append.
Reported-by: Jorgen Hansen <[email protected]> Fixes: 9b1ce7f0c6f8 ("block: Implement zone append emulation") Cc: [email protected] Signed-off-by: Damien Le Moal <[email protected]> Tested-by: Jorgen Hansen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| #
889c5706 |
| 25-Feb-2025 |
Ming Lei <[email protected]> |
block: make segment size limit workable for > 4K PAGE_SIZE
Using PAGE_SIZE as a minimum expected DMA segment size in consideration of devices which have a max DMA segment size of < 64k when used on
block: make segment size limit workable for > 4K PAGE_SIZE
Using PAGE_SIZE as a minimum expected DMA segment size in consideration of devices which have a max DMA segment size of < 64k when used on 64k PAGE_SIZE systems leads to devices not being able to probe such as eMMC and Exynos UFS controller [0] [1] you can end up with a probe failure as follows:
WARNING: CPU: 2 PID: 397 at block/blk-settings.c:339 blk_validate_limits+0x364/0x3c0
Ensure we use min(max_seg_size, seg_boundary_mask + 1) as the new min segment size when max segment size is < PAGE_SIZE for 16k and 64k base page size systems.
If anyone need to backport this patch, the following commits are depended:
commit 6aeb4f836480 ("block: remove bio_add_pc_page") commit 02ee5d69e3ba ("block: remove blk_rq_bio_prep") commit b7175e24d6ac ("block: add a dma mapping iterator")
Link: https://lore.kernel.org/linux-block/[email protected]/ # [0] Link: https://lore.kernel.org/linux-block/[email protected]/ # [1] Cc: Yi Zhang <[email protected]> Cc: John Garry <[email protected]> Cc: Keith Busch <[email protected]> Tested-by: Paul Bunyan <[email protected]> Reviewed-by: Daniel Gomez <[email protected]> Reviewed-by: Luis Chamberlain <[email protected]> Reviewed-by: Bart Van Assche <[email protected]> Signed-off-by: Ming Lei <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| #
47dd6753 |
| 21-Feb-2025 |
Luis Chamberlain <[email protected]> |
block/bdev: lift block size restrictions to 64k
We now can support blocksizes larger than PAGE_SIZE, so in theory we should be able to lift the restriction up to the max supported page cache order.
block/bdev: lift block size restrictions to 64k
We now can support blocksizes larger than PAGE_SIZE, so in theory we should be able to lift the restriction up to the max supported page cache order. However bound ourselves to what we can currently validate and test. Through blktests and fstest we can validate up to 64k today.
Reviewed-by: Hannes Reinecke <[email protected]> Reviewed-by: "Matthew Wilcox (Oracle)" <[email protected]> Signed-off-by: Luis Chamberlain <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc2, v6.14-rc1 |
|
| #
fe662860 |
| 28-Jan-2025 |
Nilay Shroff <[email protected]> |
block: get rid of request queue ->sysfs_dir_lock
The request queue uses ->sysfs_dir_lock for protecting the addition/ deletion of kobject entries under sysfs while we register/unregister blk-mq. How
block: get rid of request queue ->sysfs_dir_lock
The request queue uses ->sysfs_dir_lock for protecting the addition/ deletion of kobject entries under sysfs while we register/unregister blk-mq. However kobject addition/deletion is already protected with kernfs/sysfs internal synchronization primitives. So use of q->sysfs_ dir_lock seems redundant.
Moreover, q->sysfs_dir_lock is also used at few other callsites along with q->sysfs_lock for protecting the addition/deletion of kojects. One such example is when we register with sysfs a set of independent access ranges for a disk. Here as well we could get rid off q->sysfs_ dir_lock and only use q->sysfs_lock.
The only variable which q->sysfs_dir_lock appears to protect is q-> mq_sysfs_init_done which is set/unset while registering/unregistering blk-mq with sysfs. But use of q->mq_sysfs_init_done could be easily replaced using queue registered bit QUEUE_FLAG_REGISTERED.
So with this patch we remove q->sysfs_dir_lock from each callsite and replace q->mq_sysfs_init_done using QUEUE_FLAG_REGISTERED.
Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Nilay Shroff <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v6.13 |
|
| #
6a7e17b2 |
| 16-Jan-2025 |
John Garry <[email protected]> |
block: Add common atomic writes enable flag
Currently only stacked devices need to explicitly enable atomic writes by setting BLK_FEAT_ATOMIC_WRITES_STACKED flag.
This does not work well for device
block: Add common atomic writes enable flag
Currently only stacked devices need to explicitly enable atomic writes by setting BLK_FEAT_ATOMIC_WRITES_STACKED flag.
This does not work well for device mapper stacking devices, as there many sets of limits are stacked and what is the 'bottom' and 'top' device can swapped. This means that BLK_FEAT_ATOMIC_WRITES_STACKED needs to be set for many queue limits, which is messy.
Generalize enabling atomic writes enabling by ensuring that all devices must explicitly set a flag - that includes NVMe, SCSI sd, and md raid.
Signed-off-by: John Garry <[email protected]> Reviewed-by: Mike Snitzer <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v6.13-rc7 |
|
| #
6564862d |
| 09-Jan-2025 |
John Garry <[email protected]> |
block: Ensure start sector is aligned for stacking atomic writes
For stacking atomic writes, ensure that the start sector is aligned with the device atomic write unit min and any boundary. Otherwise
block: Ensure start sector is aligned for stacking atomic writes
For stacking atomic writes, ensure that the start sector is aligned with the device atomic write unit min and any boundary. Otherwise, we may permit misaligned atomic writes.
Rework bdev_can_atomic_write() into a common helper to resuse the alignment check. There also use atomic_write_hw_unit_min, which is more proper (than atomic_write_unit_min).
Fixes: d7f36dc446e89 ("block: Support atomic writes limits for stacked devices") Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: John Garry <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| #
aa427d7b |
| 10-Jan-2025 |
Christoph Hellwig <[email protected]> |
block: add a queue_limits_commit_update_frozen helper
Add a helper that freezes the queue, updates the queue limits and unfreezes the queue and convert all open coded versions of that to the new hel
block: add a queue_limits_commit_update_frozen helper
Add a helper that freezes the queue, updates the queue limits and unfreezes the queue and convert all open coded versions of that to the new helper.
Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: John Garry <[email protected]> Reviewed-by: Ming Lei <[email protected]> Reviewed-by: Damien Le Moal <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]> Reviewed-by: Nilay Shroff <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| #
9c96821b |
| 10-Jan-2025 |
Christoph Hellwig <[email protected]> |
block: fix docs for freezing of queue limits updates
queue_limits_commit_update is the function that needs to operate on a frozen queue, not queue_limits_start_update. Update the kerneldoc comments
block: fix docs for freezing of queue limits updates
queue_limits_commit_update is the function that needs to operate on a frozen queue, not queue_limits_start_update. Update the kerneldoc comments to reflect that.
Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Ming Lei <[email protected]> Reviewed-by: Damien Le Moal <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]> Reviewed-by: Nilay Shroff <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Reviewed-by: John Garry <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2, v6.13-rc1 |
|
| #
f6661b1d |
| 27-Nov-2024 |
Ming Lei <[email protected]> |
block: track queue dying state automatically for modeling queue freeze lockdep
Now we only verify the outmost freeze & unfreeze in current context in case that !q->mq_freeze_depth, so it is reliable
block: track queue dying state automatically for modeling queue freeze lockdep
Now we only verify the outmost freeze & unfreeze in current context in case that !q->mq_freeze_depth, so it is reliable to save queue lying state when we want to lock the freeze queue since the state is one per-task variable now.
Signed-off-by: Ming Lei <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| #
6f491a8d |
| 27-Nov-2024 |
Ming Lei <[email protected]> |
block: track disk DEAD state automatically for modeling queue freeze lockdep
Now we only verify the outmost freeze & unfreeze in current context in case that !q->mq_freeze_depth, so it is reliable t
block: track disk DEAD state automatically for modeling queue freeze lockdep
Now we only verify the outmost freeze & unfreeze in current context in case that !q->mq_freeze_depth, so it is reliable to save disk DEAD state when we want to lock the freeze queue since the state is one per-task variable now.
Doing this way can kill lots of false positive when freeze queue is called before adding disk[1].
[1] https://lore.kernel.org/linux-block/[email protected]/
Signed-off-by: Ming Lei <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| #
fe0418eb |
| 09-Dec-2024 |
Damien Le Moal <[email protected]> |
block: Prevent potential deadlocks in zone write plug error recovery
Zone write plugging for handling writes to zones of a zoned block device always execute a zone report whenever a write BIO to a z
block: Prevent potential deadlocks in zone write plug error recovery
Zone write plugging for handling writes to zones of a zoned block device always execute a zone report whenever a write BIO to a zone fails. The intent of this is to ensure that the tracking of a zone write pointer is always correct to ensure that the alignment to a zone write pointer of write BIOs can be checked on submission and that we can always correctly emulate zone append operations using regular write BIOs.
However, this error recovery scheme introduces a potential deadlock if a device queue freeze is initiated while BIOs are still plugged in a zone write plug and one of these write operation fails. In such case, the disk zone write plug error recovery work is scheduled and executes a report zone. This in turn can result in a request allocation in the underlying driver to issue the report zones command to the device. But with the device queue freeze already started, this allocation will block, preventing the report zone execution and the continuation of the processing of the plugged BIOs. As plugged BIOs hold a queue usage reference, the queue freeze itself will never complete, resulting in a deadlock.
Avoid this problem by completely removing from the zone write plugging code the use of report zones operations after a failed write operation, instead relying on the device user to either execute a report zones, reset the zone, finish the zone, or give up writing to the device (which is a fairly common pattern for file systems which degrade to read-only after write failures). This is not an unreasonnable requirement as all well-behaved applications, FSes and device mapper already use report zones to recover from write errors whenever possible by comparing the current position of a zone write pointer with what their assumption about the position is.
The changes to remove the automatic error recovery are as follows: - Completely remove the error recovery work and its associated resources (zone write plug list head, disk error list, and disk zone_wplugs_work work struct). This also removes the functions disk_zone_wplug_set_error() and disk_zone_wplug_clear_error().
- Change the BLK_ZONE_WPLUG_ERROR zone write plug flag into BLK_ZONE_WPLUG_NEED_WP_UPDATE. This new flag is set for a zone write plug whenever a write opration targetting the zone of the zone write plug fails. This flag indicates that the zone write pointer offset is not reliable and that it must be updated when the next report zone, reset zone, finish zone or disk revalidation is executed.
- Modify blk_zone_write_plug_bio_endio() to set the BLK_ZONE_WPLUG_NEED_WP_UPDATE flag for the target zone of a failed write BIO.
- Modify the function disk_zone_wplug_set_wp_offset() to clear this new flag, thus implementing recovery of a correct write pointer offset with the reset (all) zone and finish zone operations.
- Modify blkdev_report_zones() to always use the disk_report_zones_cb() callback so that disk_zone_wplug_sync_wp_offset() can be called for any zone marked with the BLK_ZONE_WPLUG_NEED_WP_UPDATE flag. This implements recovery of a correct write pointer offset for zone write plugs marked with BLK_ZONE_WPLUG_NEED_WP_UPDATE and within the range of the report zones operation executed by the user.
- Modify blk_revalidate_seq_zone() to call disk_zone_wplug_sync_wp_offset() for all sequential write required zones when a zoned block device is revalidated, thus always resolving any inconsistency between the write pointer offset of zone write plugs and the actual write pointer position of sequential zones.
Fixes: dd291d77cc90 ("block: Introduce zone write plugging") Cc: [email protected] Signed-off-by: Damien Le Moal <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| #
b76b840f |
| 09-Dec-2024 |
Damien Le Moal <[email protected]> |
dm: Fix dm-zoned-reclaim zone write pointer alignment
The zone reclaim processing of the dm-zoned device mapper uses blkdev_issue_zeroout() to align the write pointer of a zone being used for reclai
dm: Fix dm-zoned-reclaim zone write pointer alignment
The zone reclaim processing of the dm-zoned device mapper uses blkdev_issue_zeroout() to align the write pointer of a zone being used for reclaiming another zone, to write the valid data blocks from the zone being reclaimed at the same position relative to the zone start in the reclaim target zone.
The first call to blkdev_issue_zeroout() will try to use hardware offload using a REQ_OP_WRITE_ZEROES operation if the device reports a non-zero max_write_zeroes_sectors queue limit. If this operation fails because of the lack of hardware support, blkdev_issue_zeroout() falls back to using a regular write operation with the zero-page as buffer. Currently, such REQ_OP_WRITE_ZEROES failure is automatically handled by the block layer zone write plugging code which will execute a report zones operation to ensure that the write pointer of the target zone of the failed operation has not changed and to "rewind" the zone write pointer offset of the target zone as it was advanced when the write zero operation was submitted. So the REQ_OP_WRITE_ZEROES failure does not cause any issue and blkdev_issue_zeroout() works as expected.
However, since the automatic recovery of zone write pointers by the zone write plugging code can potentially cause deadlocks with queue freeze operations, a different recovery must be implemented in preparation for the removal of zone write plugging report zones based recovery.
Do this by introducing the new function blk_zone_issue_zeroout(). This function first calls blkdev_issue_zeroout() with the flag BLKDEV_ZERO_NOFALLBACK to intercept failures on the first execution which attempt to use the device hardware offload with the REQ_OP_WRITE_ZEROES operation. If this attempt fails, a report zone operation is issued to restore the zone write pointer offset of the target zone to the correct position and blkdev_issue_zeroout() is called again without the BLKDEV_ZERO_NOFALLBACK flag. The report zones operation performing this recovery is implemented using the helper function disk_zone_sync_wp_offset() which calls the gendisk report_zones file operation with the callback disk_report_zones_cb(). This callback updates the target write pointer offset of the target zone using the new function disk_zone_wplug_sync_wp_offset().
dmz_reclaim_align_wp() is modified to change its call to blkdev_issue_zeroout() to a call to blk_zone_issue_zeroout() without any other change needed as the two functions are functionnally equivalent.
Fixes: dd291d77cc90 ("block: Introduce zone write plugging") Cc: [email protected] Signed-off-by: Damien Le Moal <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Acked-by: Mike Snitzer <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| #
766a71ef |
| 19-Nov-2024 |
Christoph Hellwig <[email protected]> |
block: return bool from get_disk_ro and bdev_read_only
get_disk_ro and bdev_read_only return boolean conditions, don't masquerade them as int.
Signed-off-by: Christoph Hellwig <[email protected]> Reviewed
block: return bool from get_disk_ro and bdev_read_only
get_disk_ro and bdev_read_only return boolean conditions, don't masquerade them as int.
Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: John Garry <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| #
e888810b |
| 19-Nov-2024 |
Christoph Hellwig <[email protected]> |
block: remove a duplicate definition for bdev_read_only
bdev_read_only is already defined as an inline function in blkdev.h.
Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: John Garry <j
block: remove a duplicate definition for bdev_read_only
bdev_read_only is already defined as an inline function in blkdev.h.
Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: John Garry <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| #
da77d9b2 |
| 19-Nov-2024 |
Christoph Hellwig <[email protected]> |
block: return bool from blk_rq_aligned
blk_rq_aligned returns a boolean condition, don't mascquerade it as int.
Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: John Garry <john.g.garry@o
block: return bool from blk_rq_aligned
blk_rq_aligned returns a boolean condition, don't mascquerade it as int.
Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: John Garry <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|