|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6 |
|
| #
e8007fad |
| 08-May-2025 |
Steve Siwinski <[email protected]> |
scsi: sd_zbc: block: Respect bio vector limits for REPORT ZONES buffer
The REPORT ZONES buffer size is currently limited by the HBA's maximum segment count to ensure the buffer can be mapped. Howeve
scsi: sd_zbc: block: Respect bio vector limits for REPORT ZONES buffer
The REPORT ZONES buffer size is currently limited by the HBA's maximum segment count to ensure the buffer can be mapped. However, the block layer further limits the number of iovec entries to 1024 when allocating a bio.
To avoid allocation of buffers too large to be mapped, further restrict the maximum buffer size to BIO_MAX_INLINE_VECS.
Replace the UIO_MAXIOV symbolic name with the more contextually appropriate BIO_MAX_INLINE_VECS.
Fixes: b091ac616846 ("sd_zbc: Fix report zones buffer allocation") Cc: [email protected] Signed-off-by: Steve Siwinski <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Damien Le Moal <[email protected]> Signed-off-by: Martin K. Petersen <[email protected]>
show more ...
|
|
Revision tags: v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14, v6.14-rc7 |
|
| #
26064d3e |
| 12-Mar-2025 |
Ming Lei <[email protected]> |
block: fix adding folio to bio
>4GB folio is possible on some ARCHs, such as aarch64, 16GB hugepage is supported, then 'offset' of folio can't be held in 'unsigned int', cause warning in bio_add_fol
block: fix adding folio to bio
>4GB folio is possible on some ARCHs, such as aarch64, 16GB hugepage is supported, then 'offset' of folio can't be held in 'unsigned int', cause warning in bio_add_folio_nofail() and IO failure.
Fix it by adjusting 'page' & trimming 'offset' so that `->bi_offset` won't be overflow, and folio can be added to bio successfully.
Fixes: ed9832bc08db ("block: introduce folio awareness and add a bigger size from folio") Cc: Kundan Kumar <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Luis Chamberlain <[email protected]> Cc: Gavin Shan <[email protected]> Signed-off-by: Ming Lei <[email protected]> Reviewed-by: Matthew Wilcox (Oracle) <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc6, v6.14-rc5 |
|
| #
105ca2a2 |
| 25-Feb-2025 |
Christoph Hellwig <[email protected]> |
block: split struct bio_integrity_payload
Many of the fields in struct bio_integrity_payload are only needed for the default integrity buffer in the block layer, and the variable sized array at the
block: split struct bio_integrity_payload
Many of the fields in struct bio_integrity_payload are only needed for the default integrity buffer in the block layer, and the variable sized array at the end of the structure makes it very hard to embed into caller allocated structures.
Reduce struct bio_integrity_payload to the minimal structure needed in common code and create two separate containing structures for the automatically generated payload and the caller allocated payload. The latter is a simple wrapper for struct bio_integrity_payload and the bvecs, while the former contains the additional fields moved out of struct bio_integrity_payload.
Always use a dedicated mempool for automatic integrity metadata instead of depending on bio_set that is submitter controlled and thus often doesn't have the mempool initialized and stop using mempools for the submitter buffers as they aren't in the NOIO I/O submission path where we need to guarantee forward progress.
Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Tested-by: Anuj Gupta <[email protected]> Reviewed-by: Anuj Gupta <[email protected]> Reviewed-by: Kanchan Joshi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| #
b654f7a5 |
| 28-Feb-2025 |
Ming Lei <[email protected]> |
block: fix 'kmem_cache of name 'bio-108' already exists'
Device mapper bioset often has big bio_slab size, which can be more than 1000, then 8byte can't hold the slab name any more, cause the kmem_c
block: fix 'kmem_cache of name 'bio-108' already exists'
Device mapper bioset often has big bio_slab size, which can be more than 1000, then 8byte can't hold the slab name any more, cause the kmem_cache allocation warning of 'kmem_cache of name 'bio-108' already exists'.
Fix the warning by extending bio_slab->name to 12 bytes, but fix output of /proc/slabinfo
Reported-by: Guangwu Zhang <[email protected]> Signed-off-by: Ming Lei <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc4, v6.14-rc3, v6.14-rc2, v6.14-rc1, v6.13 |
|
| #
554b2286 |
| 16-Jan-2025 |
John Garry <[email protected]> |
block: Don't trim an atomic write
This is disallowed.
This check will now be relevant since the device mapper personalities will start to support atomic writes, and they use this function.
Signed-
block: Don't trim an atomic write
This is disallowed.
This check will now be relevant since the device mapper personalities will start to support atomic writes, and they use this function.
Signed-off-by: John Garry <[email protected]> Reviewed-by: Mike Snitzer <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v6.13-rc7, v6.13-rc6 |
|
| #
6aeb4f83 |
| 03-Jan-2025 |
Christoph Hellwig <[email protected]> |
block: remove bio_add_pc_page
Lift bio_split_rw_at into blk_rq_append_bio so that it validates the hardware limits. With this all passthrough callers can simply add bio_add_page to build the bio an
block: remove bio_add_pc_page
Lift bio_split_rw_at into blk_rq_append_bio so that it validates the hardware limits. With this all passthrough callers can simply add bio_add_page to build the bio and delay checking for exceeding of limits to this point instead of doing it for each page.
While this looks like adding a new expensive loop over all bio_vecs, blk_rq_append_bio is already doing that just to counter the number of segments.
Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2 |
|
| #
2f4873f9 |
| 02-Dec-2024 |
John Garry <[email protected]> |
block: Make bio_iov_bvec_set() accept pointer to const iov_iter
Make bio_iov_bvec_set() accept a pointer to const iov_iter, which means that we can drop the undesirable casting to struct iov_iter po
block: Make bio_iov_bvec_set() accept pointer to const iov_iter
Make bio_iov_bvec_set() accept a pointer to const iov_iter, which means that we can drop the undesirable casting to struct iov_iter pointer in blk_rq_map_user_bvec().
Signed-off-by: John Garry <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v6.13-rc1, v6.12 |
|
| #
27b26f09 |
| 11-Nov-2024 |
John Garry <[email protected]> |
block: Error an attempt to split an atomic write in bio_split()
This is disallowed.
Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Reviewed
block: Error an attempt to split an atomic write in bio_split()
This is disallowed.
Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Signed-off-by: John Garry <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| #
e546fe1d |
| 11-Nov-2024 |
John Garry <[email protected]> |
block: Rework bio_split() return value
Instead of returning an inconclusive value of NULL for an error in calling bio_split(), return a ERR_PTR() always.
Also remove the BUG_ON() calls, and WARN_ON
block: Rework bio_split() return value
Instead of returning an inconclusive value of NULL for an error in calling bio_split(), return a ERR_PTR() always.
Also remove the BUG_ON() calls, and WARN_ON_ONCE() instead. Indeed, since almost all callers don't check the return code from bio_split(), we'll crash anyway (for those failures).
Fix up the only user which checks bio_split() return code today (directly or indirectly), blk_crypto_fallback_split_bio_if_needed(). The md/bcache code does check the return code in cached_dev_cache_miss() -> bio_next_split() -> bio_split(), but only to see if there was a split, so there would be no change in behaviour here (when returning a ERR_PTR()).
Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Signed-off-by: John Garry <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v6.12-rc7, v6.12-rc6 |
|
| #
f187b9bf |
| 30-Oct-2024 |
Christoph Hellwig <[email protected]> |
block: remove bio_add_zone_append_page
This is only used by the nvmet zns passthrough code, which can trivially just use bio_add_pc_page and do the sanity check for the max zone append limit itself.
block: remove bio_add_zone_append_page
This is only used by the nvmet zns passthrough code, which can trivially just use bio_add_pc_page and do the sanity check for the max zone append limit itself.
All future zoned file systems should follow the btrfs lead and let the upper layers fill up bios unlimited by hardware constraints and split them to the limits in the I/O submission handler.
Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| #
cafd00d0 |
| 30-Oct-2024 |
Christoph Hellwig <[email protected]> |
block: remove zone append special casing from the direct I/O path
This code is unused, and all future zoned file systems should follow the btrfs lead of splitting the bios themselves to the zoned li
block: remove zone append special casing from the direct I/O path
This code is unused, and all future zoned file systems should follow the btrfs lead of splitting the bios themselves to the zoned limits in the I/O submission handler, because if they didn't they would be hit by commit ed9832bc08db ("block: introduce folio awareness and add a bigger size from folio") breaking this code when the zone append limit (that is usually the max_hw_sectors limit) is smaller than the largest possible folio size.
Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v6.12-rc5, v6.12-rc4, v6.12-rc3, v6.12-rc2, v6.12-rc1, v6.11 |
|
| #
eb1d46fc |
| 11-Sep-2024 |
Kundan Kumar <[email protected]> |
block: unpin user pages belonging to a folio at once
Use newly added mm function unpin_user_folio() to put refs by npages count.
Signed-off-by: Kundan Kumar <[email protected]> Tested-by: Lu
block: unpin user pages belonging to a folio at once
Use newly added mm function unpin_user_folio() to put refs by npages count.
Signed-off-by: Kundan Kumar <[email protected]> Tested-by: Luis Chamberlain <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| #
ed9832bc |
| 11-Sep-2024 |
Kundan Kumar <[email protected]> |
block: introduce folio awareness and add a bigger size from folio
Add a bigger size from folio to bio and skip merge processing for pages.
Fetch the offset of page within a folio. Depending on the
block: introduce folio awareness and add a bigger size from folio
Add a bigger size from folio to bio and skip merge processing for pages.
Fetch the offset of page within a folio. Depending on the size of folio and folio_offset, fetch a larger length. This length may consist of multiple contiguous pages if folio is multiorder.
Using the length calculate number of pages which will be added to bio and increment the loop counter to skip those pages.
This technique helps to avoid overhead of merging pages which belong to same large order folio.
Also folio-ize the functions bio_iov_add_page() and bio_iov_add_zone_append_page()
Signed-off-by: Kundan Kumar <[email protected]> Tested-by: Luis Chamberlain <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Matthew Wilcox (Oracle) <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| #
7de98954 |
| 11-Sep-2024 |
Kundan Kumar <[email protected]> |
block: Added folio-ized version of bio_add_hw_page()
Added new bio_add_hw_folio() function as a wrapper around bio_add_hw_page(). This is a prep patch.
Signed-off-by: Kundan Kumar <kundan.kumar@sam
block: Added folio-ized version of bio_add_hw_page()
Added new bio_add_hw_folio() function as a wrapper around bio_add_hw_page(). This is a prep patch.
Signed-off-by: Kundan Kumar <[email protected]> Tested-by: Luis Chamberlain <[email protected]> Reviewed-by: Matthew Wilcox (Oracle) <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v6.11-rc7, v6.11-rc6, v6.11-rc5, v6.11-rc4, v6.11-rc3, v6.11-rc2, v6.11-rc1, v6.10, v6.10-rc7 |
|
| #
25f76c3d |
| 06-Jul-2024 |
Christoph Hellwig <[email protected]> |
block: add a bvec_phys helper
Get callers out of poking into bvec internals a bit more. Not a huge win right now, but with the proposed new DMA mapping API we might end up with a lot more of this o
block: add a bvec_phys helper
Get callers out of poking into bvec internals a bit more. Not a huge win right now, but with the proposed new DMA mapping API we might end up with a lot more of this otherwise.
Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| #
bf4c89fc |
| 02-Jul-2024 |
Christoph Hellwig <[email protected]> |
block: don't call bio_uninit from bio_endio
Commit b222dd2fdd53 ("block: call bio_uninit in bio_endio") added a call to bio_uninit in bio_endio to work around callers that use bio_init but fail to c
block: don't call bio_uninit from bio_endio
Commit b222dd2fdd53 ("block: call bio_uninit in bio_endio") added a call to bio_uninit in bio_endio to work around callers that use bio_init but fail to call bio_uninit after they are done to release the resources. While this is an abuse of the bio_init API we still have quite a few of those left. But this early uninit causes a problem for integrity data, as at least some users need the bio_integrity_payload. Right now the only one is the NVMe passthrough which archives this by adding a special case to skip the freeing if the BIP_INTEGRITY_USER flag is set.
Sort this out by only putting bi_blkg in bio_endio as that is the cause of the actual leaks - the few users of the crypto context and integrity data all properly call bio_uninit, usually through bio_put for dynamically allocated bios.
Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| #
da042a36 |
| 02-Jul-2024 |
Christoph Hellwig <[email protected]> |
block: split integrity support out of bio.h
Split struct bio_integrity_payload and the related prototypes out of bio.h into a separate bio-integrity.h header so that it is only pulled in by the few
block: split integrity support out of bio.h
Split struct bio_integrity_payload and the related prototypes out of bio.h into a separate bio-integrity.h header so that it is only pulled in by the few places that need it.
Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Anuj Gupta <[email protected]> Reviewed-by: Kanchan Joshi <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v6.10-rc6, v6.10-rc5, v6.10-rc4, v6.10-rc3, v6.10-rc2, v6.10-rc1, v6.9 |
|
| #
bf20ab53 |
| 09-May-2024 |
Yu Kuai <[email protected]> |
blk-throttle: remove CONFIG_BLK_DEV_THROTTLING_LOW
One the one hand, it's marked EXPERIMENTAL since 2017, and looks like there are no users since then, and no testers and no developers, it's just no
blk-throttle: remove CONFIG_BLK_DEV_THROTTLING_LOW
One the one hand, it's marked EXPERIMENTAL since 2017, and looks like there are no users since then, and no testers and no developers, it's just not active at all.
On the other hand, even if the config is disabled, there are still many fields in throtl_grp and throtl_data and many functions that are only used for throtl low.
At last, currently blk-throtl is initialized during disk initialization, and destroyed during disk removal, and it exposes many functions to be called directly from block layer.
Remove throtl low to make code much more cleaner and follow up work much easier.
Signed-off-by: Yu Kuai <[email protected]> Acked-by: Tejun Heo <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v6.9-rc7, v6.9-rc6 |
|
| #
8fde439b |
| 25-Apr-2024 |
Matthew Wilcox (Oracle) <[email protected]> |
bio: Export bio_add_folio_nofail to modules
Several modules use __bio_add_page() today and may need to be converted to bio_add_folio_nofail().
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wd
bio: Export bio_add_folio_nofail to modules
Several modules use __bio_add_page() today and may need to be converted to bio_add_folio_nofail().
Reviewed-by: Johannes Thumshirn <[email protected]> Reviewed-by: Jens Axboe <[email protected]> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
show more ...
|
| #
0f8e9ecc |
| 06-May-2024 |
Keith Busch <[email protected]> |
block: add a bio_await_chain helper
Add a helper to wait for an entire chain of bios to complete.
[hch: split from a larger patch, moved and changed the name now that it is non-static]
Signed-off
block: add a bio_await_chain helper
Add a helper to wait for an entire chain of bios to complete.
[hch: split from a larger patch, moved and changed the name now that it is non-static]
Signed-off-by: Keith Busch <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| #
81c2168c |
| 06-May-2024 |
Christoph Hellwig <[email protected]> |
block: add a bio_chain_and_submit helper
This is basically blk_next_bio just with the bio allocation moved to the caller to allow for more flexible bio handling in the caller.
Signed-off-by: Christ
block: add a bio_chain_and_submit helper
This is basically blk_next_bio just with the bio allocation moved to the caller to allow for more flexible bio handling in the caller.
Signed-off-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v6.9-rc5, v6.9-rc4 |
|
| #
dd291d77 |
| 08-Apr-2024 |
Damien Le Moal <[email protected]> |
block: Introduce zone write plugging
Zone write plugging implements a per-zone "plug" for write operations to control the submission and execution order of write operations to sequential write requi
block: Introduce zone write plugging
Zone write plugging implements a per-zone "plug" for write operations to control the submission and execution order of write operations to sequential write required zones of a zoned block device. Per-zone plugging guarantees that at any time there is at most only one write request per zone being executed. This mechanism is intended to replace zone write locking which implements a similar per-zone write throttling at the scheduler level, but is implemented only by mq-deadline.
Unlike zone write locking which operates on requests, zone write plugging operates on BIOs. A zone write plug is simply a BIO list that is atomically manipulated using a spinlock and a kblockd submission work. A write BIO to a zone is "plugged" to delay its execution if a write BIO for the same zone was already issued, that is, if a write request for the same zone is being executed. The next plugged BIO is unplugged and issued once the write request completes.
This mechanism allows to: - Untangle zone write ordering from block IO schedulers. This allows removing the restriction on using mq-deadline for writing to zoned block devices. Any block IO scheduler, including "none" can be used. - Zone write plugging operates on BIOs instead of requests. Plugged BIOs waiting for execution thus do not hold scheduling tags and thus are not preventing other BIOs from executing (reads or writes to other zones). Depending on the workload, this can significantly improve the device use (higher queue depth operation) and performance. - Both blk-mq (request based) zoned devices and BIO-based zoned devices (e.g. device mapper) can use zone write plugging. It is mandatory for the former but optional for the latter. BIO-based drivers can use zone write plugging to implement write ordering guarantees, or the drivers can implement their own if needed. - The code is less invasive in the block layer and is mostly limited to blk-zoned.c with some small changes in blk-mq.c, blk-merge.c and bio.c.
Zone write plugging is implemented using struct blk_zone_wplug. This structure includes a spinlock, a BIO list and a work structure to handle the submission of plugged BIOs. Zone write plugs structures are managed using a per-disk hash table.
Plugging of zone write BIOs is done using the function blk_zone_write_plug_bio() which returns false if a BIO execution does not need to be delayed and true otherwise. This function is called from blk_mq_submit_bio() after a BIO is split to avoid large BIOs spanning multiple zones which would cause mishandling of zone write plugs. This ichange enables by default zone write plugging for any mq request-based block device. BIO-based device drivers can also use zone write plugging by expliclty calling blk_zone_write_plug_bio() in their ->submit_bio method. For such devices, the driver must ensure that a BIO passed to blk_zone_write_plug_bio() is already split and not straddling zone boundaries.
Only write and write zeroes BIOs are plugged. Zone write plugging does not introduce any significant overhead for other operations. A BIO that is being handled through zone write plugging is flagged using the new BIO flag BIO_ZONE_WRITE_PLUGGING. A request handling a BIO flagged with this new flag is flagged with the new RQF_ZONE_WRITE_PLUGGING flag. The completion of BIOs and requests flagged trigger respectively calls to the functions blk_zone_write_bio_endio() and blk_zone_write_complete_request(). The latter function is used to trigger submission of the next plugged BIO using the zone plug work. blk_zone_write_bio_endio() does the same for BIO-based devices. This ensures that at any time, at most one request (blk-mq devices) or one BIO (BIO-based devices) is being executed for any zone. The handling of zone write plugs using a per-zone plug spinlock maximizes parallelism and device usage by allowing multiple zones to be writen simultaneously without lock contention.
Zone write plugging ignores flush BIOs without data. Hovever, any flush BIO that has data is always plugged so that the write part of the flush sequence is serialized with other regular writes.
Given that any BIO handled through zone write plugging will be the only BIO in flight for the target zone when it is executed, the unplugging and submission of a BIO will have no chance of successfully merging with plugged requests or requests in the scheduler. To overcome this potential performance degradation, blk_mq_submit_bio() calls the function blk_zone_write_plug_attempt_merge() to try to merge other plugged BIOs with the one just unplugged and submitted. Successful merging is signaled using blk_zone_write_plug_bio_merged(), called from bio_attempt_back_merge(). Furthermore, to avoid recalculating the number of segments of plugged BIOs to attempt merging, the number of segments of a plugged BIO is saved using the new struct bio field __bi_nr_segments. To avoid growing the size of struct bio, this field is added as a union with the bio_cookie field. This is safe to do as polling is always disabled for plugged BIOs.
When BIOs are plugged in a zone write plug, the device request queue usage counter is always incremented. This reference is kept and reused for blk-mq devices when the plugged BIO is unplugged and submitted again using submit_bio_noacct_nocheck(). For this case, the unplugged BIO is already flagged with BIO_ZONE_WRITE_PLUGGING and blk_mq_submit_bio() proceeds directly to allocating a new request for the BIO, re-using the usage reference count taken when the BIO was plugged. This extra reference count is dropped in blk_zone_write_plug_attempt_merge() for any plugged BIO that is successfully merged. Given that BIO-based devices will not take this path, the extra reference is dropped after a plugged BIO is unplugged and submitted.
Zone write plugs are dynamically allocated and managed using a hash table (an array of struct hlist_head) with RCU protection. A zone write plug is allocated when a write BIO is received for the zone and not freed until the zone is fully written, reset or finished. To detect when a zone write plug can be freed, the write state of each zone is tracked using a write pointer offset which corresponds to the offset of a zone write pointer relative to the zone start. Write operations always increment this write pointer offset. Zone reset operations set it to 0 and zone finish operations set it to the zone size.
If a write error happens, the wp_offset value of a zone write plug may become incorrect and out of sync with the device managed write pointer. This is handled using the zone write plug flag BLK_ZONE_WPLUG_ERROR. The function blk_zone_wplug_handle_error() is called from the new disk zone write plug work when this flag is set. This function executes a report zone to update the zone write pointer offset to the current value as indicated by the device. The disk zone write plug work is scheduled whenever a BIO flagged with BIO_ZONE_WRITE_PLUGGING completes with an error or when bio_zone_wplug_prepare_bio() detects an unaligned write. Once scheduled, the disk zone write plugs work keeps running until all zone errors are handled.
To match the new data structures used for zoned disks, the function disk_free_zone_bitmaps() is renamed to the more generic disk_free_zone_resources(). The function disk_init_zone_resources() is also introduced to initialize zone write plugs resources when a gendisk is allocated.
In order to guarantee that the user can simultaneously write up to a number of zones equal to a device max active zone limit or max open zone limit, zone write plugs are allocated using a mempool sized to the maximum of these 2 device limits. For a device that does not have active and open zone limits, 128 is used as the default mempool size.
If a change to the device active and open zone limits is detected, the disk mempool is resized when blk_revalidate_disk_zones() is executed.
This commit contains contributions from Christoph Hellwig <[email protected]>.
Signed-off-by: Damien Le Moal <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Tested-by: Hans Holmberg <[email protected]> Tested-by: Dennis Maisenbacher <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]> Reviewed-by: Bart Van Assche <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v6.9-rc3, v6.9-rc2, v6.9-rc1, v6.8, v6.8-rc7 |
|
| #
38b43539 |
| 29-Feb-2024 |
Tony Battersby <[email protected]> |
block: Fix page refcounts for unaligned buffers in __bio_release_pages()
Fix an incorrect number of pages being released for buffers that do not start at the beginning of a page.
Fixes: 1b151e2435f
block: Fix page refcounts for unaligned buffers in __bio_release_pages()
Fix an incorrect number of pages being released for buffers that do not start at the beginning of a page.
Fixes: 1b151e2435fc ("block: Remove special-casing of compound pages") Cc: [email protected] Signed-off-by: Tony Battersby <[email protected]> Tested-by: Greg Edwards <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v6.8-rc6 |
|
| #
0eb4db47 |
| 23-Feb-2024 |
Keith Busch <[email protected]> |
block: io wait hang check helper
This is the same in two places, and another will be added soon. Create a helper for it.
Reviewed-by: Ming Lei <[email protected]> Reviewed-by: Christoph Hellwig <
block: io wait hang check helper
This is the same in two places, and another will be added soon. Create a helper for it.
Reviewed-by: Ming Lei <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Keith Busch <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v6.8-rc5, v6.8-rc4 |
|
| #
e516c3fc |
| 07-Feb-2024 |
Pavel Begunkov <[email protected]> |
block: optimise in irq bio put caching
When enlisting a bio into ->free_list_irq we protect the list by disabling irqs. It's likely they're already disabled and performance of local_irq_{save,restor
block: optimise in irq bio put caching
When enlisting a bio into ->free_list_irq we protect the list by disabling irqs. It's likely they're already disabled and performance of local_irq_{save,restore}() is decent, but it's not zero cost.
Let's only use the irq cache when when we're serving a hard irq, which allows to remove local_irq_{save,restore}(), and fall back to bio_free() in all left cases.
Profiles indicate that the bio_put() cost is reduced by ~3.5 times (1.76% -> 0.49%), and total throughput of a CPU bound benchmark improve by around 1% (t/io_uring with high QD and several drives).
Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/36d207540b7046c653cc16e5ff08fe7234b19f81.1707314970.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
show more ...
|