|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14, v6.14-rc7, v6.14-rc6, v6.14-rc5, v6.14-rc4, v6.14-rc3, v6.14-rc2, v6.14-rc1, v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2, v6.13-rc1, v6.12, v6.12-rc7, v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3, v6.12-rc2, v6.12-rc1, v6.11, v6.11-rc7, v6.11-rc6, v6.11-rc5, v6.11-rc4, v6.11-rc3, v6.11-rc2, v6.11-rc1, v6.10, v6.10-rc7, v6.10-rc6, v6.10-rc5, v6.10-rc4, v6.10-rc3, v6.10-rc2 |
|
| #
05356938 |
| 28-May-2024 |
Coly Li <[email protected]> |
bcache: call force_wake_up_gc() if necessary in check_should_bypass()
If there are extreme heavy write I/O continuously hit on relative small cache device (512GB in my testing), it is possible to ma
bcache: call force_wake_up_gc() if necessary in check_should_bypass()
If there are extreme heavy write I/O continuously hit on relative small cache device (512GB in my testing), it is possible to make counter c->gc_stats.in_use continue to increase and exceed CUTOFF_CACHE_ADD.
If 'c->gc_stats.in_use > CUTOFF_CACHE_ADD' happens, all following write requests will bypass the cache device because check_should_bypass() returns 'true'. Because all writes bypass the cache device, counter c->sectors_to_gc has no chance to be negative value, and garbage collection thread won't be waken up even the whole cache becomes clean after writeback accomplished. The aftermath is that all write I/Os go directly into backing device even the cache device is clean.
To avoid the above situation, this patch uses a quite conservative way to fix: if 'c->gc_stats.in_use > CUTOFF_CACHE_ADD' happens, only wakes up garbage collection thread when the whole cache device is clean.
Before the fix, the writes-always-bypass situation happens after 10+ hours write I/O pressure on 512GB Intel optane memory which acts as cache device. After this fix, such situation doesn't happen after 36+ hours testing.
Signed-off-by: Coly Li <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v6.10-rc1, v6.9, v6.9-rc7, v6.9-rc6, v6.9-rc5, v6.9-rc4, v6.9-rc3, v6.9-rc2, v6.9-rc1, v6.8, v6.8-rc7, v6.8-rc6, v6.8-rc5, v6.8-rc4, v6.8-rc3, v6.8-rc2, v6.8-rc1, v6.7, v6.7-rc8, v6.7-rc7, v6.7-rc6, v6.7-rc5, v6.7-rc4, v6.7-rc3, v6.7-rc2 |
|
| #
d4e3b928 |
| 18-Nov-2023 |
Kent Overstreet <[email protected]> |
closures: CLOSURE_CALLBACK() to fix type punning
Control flow integrity is now checking that type signatures match on indirect function calls. That breaks closures, which embed a work_struct in a cl
closures: CLOSURE_CALLBACK() to fix type punning
Control flow integrity is now checking that type signatures match on indirect function calls. That breaks closures, which embed a work_struct in a closure in such a way that a closure_fn may also be used as a workqueue fn by the underlying closure code.
So we have to change closure fns to take a work_struct as their argument - but that results in a loss of clarity, as closure fns have different semantics from normal workqueue functions (they run owning a ref on the closure, which must be released with continue_at() or closure_return()).
Thus, this patc introduces CLOSURE_CALLBACK() and closure_type() macros as suggested by Kees, to smooth things over a bit.
Suggested-by: Kees Cook <[email protected]> Cc: Coly Li <[email protected]> Signed-off-by: Kent Overstreet <[email protected]>
show more ...
|
|
Revision tags: v6.7-rc1, v6.6, v6.6-rc7, v6.6-rc6, v6.6-rc5, v6.6-rc4, v6.6-rc3, v6.6-rc2, v6.6-rc1, v6.5, v6.5-rc7, v6.5-rc6, v6.5-rc5, v6.5-rc4, v6.5-rc3, v6.5-rc2, v6.5-rc1, v6.4, v6.4-rc7, v6.4-rc6 |
|
| #
05bdb996 |
| 08-Jun-2023 |
Christoph Hellwig <[email protected]> |
block: replace fmode_t with a block-specific type for block open flags
The only overlap between the block open flags mapped into the fmode_t and other uses of fmode_t are FMODE_READ and FMODE_WRITE.
block: replace fmode_t with a block-specific type for block open flags
The only overlap between the block open flags mapped into the fmode_t and other uses of fmode_t are FMODE_READ and FMODE_WRITE. Define a new blk_mode_t instead for use in blkdev_get_by_{dev,path}, ->open and ->ioctl and stop abusing fmode_t.
Signed-off-by: Christoph Hellwig <[email protected]> Acked-by: Jack Wang <[email protected]> [rnbd] Reviewed-by: Hannes Reinecke <[email protected]> Reviewed-by: Christian Brauner <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v6.4-rc5, v6.4-rc4, v6.4-rc3, v6.4-rc2, v6.4-rc1, v6.3, v6.3-rc7, v6.3-rc6, v6.3-rc5, v6.3-rc4, v6.3-rc3, v6.3-rc2, v6.3-rc1, v6.2, v6.2-rc8, v6.2-rc7, v6.2-rc6, v6.2-rc5, v6.2-rc4, v6.2-rc3, v6.2-rc2, v6.2-rc1, v6.1 |
|
| #
c34b7ac6 |
| 06-Dec-2022 |
Christoph Hellwig <[email protected]> |
block: remove bio_set_op_attrs
This macro is obsolete, so replace the last few uses with open coded bi_opf assignments.
Signed-off-by: Christoph Hellwig <[email protected]> Acked-by: Coly Li <colyli@suse.
block: remove bio_set_op_attrs
This macro is obsolete, so replace the last few uses with open coded bi_opf assignments.
Signed-off-by: Christoph Hellwig <[email protected]> Acked-by: Coly Li <[email protected] <mailto:[email protected]>> Reviewed-by: Johannes Thumshirn <[email protected]> Reviewed-by: Bart Van Assche <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v6.1-rc8, v6.1-rc7, v6.1-rc6, v6.1-rc5, v6.1-rc4, v6.1-rc3, v6.1-rc2, v6.1-rc1 |
|
| #
8032bf12 |
| 10-Oct-2022 |
Jason A. Donenfeld <[email protected]> |
treewide: use get_random_u32_below() instead of deprecated function
This is a simple mechanical transformation done by:
@@ expression E; @@ - prandom_u32_max + get_random_u32_below (E)
Reviewed-
treewide: use get_random_u32_below() instead of deprecated function
This is a simple mechanical transformation done by:
@@ expression E; @@ - prandom_u32_max + get_random_u32_below (E)
Reviewed-by: Kees Cook <[email protected]> Reviewed-by: Greg Kroah-Hartman <[email protected]> Acked-by: Darrick J. Wong <[email protected]> # for xfs Reviewed-by: SeongJae Park <[email protected]> # for damon Reviewed-by: Jason Gunthorpe <[email protected]> # for infiniband Reviewed-by: Russell King (Oracle) <[email protected]> # for arm Acked-by: Ulf Hansson <[email protected]> # for mmc Signed-off-by: Jason A. Donenfeld <[email protected]>
show more ...
|
| #
81895a65 |
| 05-Oct-2022 |
Jason A. Donenfeld <[email protected]> |
treewide: use prandom_u32_max() when possible, part 1
Rather than incurring a division or requesting too many random bytes for the given range, use the prandom_u32_max() function, which only takes t
treewide: use prandom_u32_max() when possible, part 1
Rather than incurring a division or requesting too many random bytes for the given range, use the prandom_u32_max() function, which only takes the minimum required bytes from the RNG and avoids divisions. This was done mechanically with this coccinelle script:
@basic@ expression E; type T; identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32"; typedef u64; @@ ( - ((T)get_random_u32() % (E)) + prandom_u32_max(E) | - ((T)get_random_u32() & ((E) - 1)) + prandom_u32_max(E * XXX_MAKE_SURE_E_IS_POW2) | - ((u64)(E) * get_random_u32() >> 32) + prandom_u32_max(E) | - ((T)get_random_u32() & ~PAGE_MASK) + prandom_u32_max(PAGE_SIZE) )
@multi_line@ identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32"; identifier RAND; expression E; @@
- RAND = get_random_u32(); ... when != RAND - RAND %= (E); + RAND = prandom_u32_max(E);
// Find a potential literal @literal_mask@ expression LITERAL; type T; identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32"; position p; @@
((T)get_random_u32()@p & (LITERAL))
// Add one to the literal. @script:python add_one@ literal << literal_mask.LITERAL; RESULT; @@
value = None if literal.startswith('0x'): value = int(literal, 16) elif literal[0] in '123456789': value = int(literal, 10) if value is None: print("I don't know how to handle %s" % (literal)) cocci.include_match(False) elif value == 2**32 - 1 or value == 2**31 - 1 or value == 2**24 - 1 or value == 2**16 - 1 or value == 2**8 - 1: print("Skipping 0x%x for cleanup elsewhere" % (value)) cocci.include_match(False) elif value & (value + 1) != 0: print("Skipping 0x%x because it's not a power of two minus one" % (value)) cocci.include_match(False) elif literal.startswith('0x'): coccinelle.RESULT = cocci.make_expr("0x%x" % (value + 1)) else: coccinelle.RESULT = cocci.make_expr("%d" % (value + 1))
// Replace the literal mask with the calculated result. @plus_one@ expression literal_mask.LITERAL; position literal_mask.p; expression add_one.RESULT; identifier FUNC; @@
- (FUNC()@p & (LITERAL)) + prandom_u32_max(RESULT)
@collapse_ret@ type T; identifier VAR; expression E; @@
{ - T VAR; - VAR = (E); - return VAR; + return E; }
@drop_var@ type T; identifier VAR; @@
{ - T VAR; ... when != VAR }
Reviewed-by: Greg Kroah-Hartman <[email protected]> Reviewed-by: Kees Cook <[email protected]> Reviewed-by: Yury Norov <[email protected]> Reviewed-by: KP Singh <[email protected]> Reviewed-by: Jan Kara <[email protected]> # for ext4 and sbitmap Reviewed-by: Christoph Böhmwalder <[email protected]> # for drbd Acked-by: Jakub Kicinski <[email protected]> Acked-by: Heiko Carstens <[email protected]> # for s390 Acked-by: Ulf Hansson <[email protected]> # for mmc Acked-by: Darrick J. Wong <[email protected]> # for xfs Signed-off-by: Jason A. Donenfeld <[email protected]>
show more ...
|
|
Revision tags: v6.0, v6.0-rc7, v6.0-rc6, v6.0-rc5, v6.0-rc4, v6.0-rc3, v6.0-rc2, v6.0-rc1, v5.19, v5.19-rc8, v5.19-rc7, v5.19-rc6, v5.19-rc5, v5.19-rc4, v5.19-rc3, v5.19-rc2, v5.19-rc1 |
|
| #
40f567bb |
| 27-May-2022 |
Jia-Ju Bai <[email protected]> |
md: bcache: check the return value of kzalloc() in detached_dev_do_request()
The function kzalloc() in detached_dev_do_request() can fail, so its return value should be checked.
Fixes: bc082a55d25c
md: bcache: check the return value of kzalloc() in detached_dev_do_request()
The function kzalloc() in detached_dev_do_request() can fail, so its return value should be checked.
Fixes: bc082a55d25c ("bcache: fix inaccurate io state for detached bcache devices") Reported-by: TOTE Robot <[email protected]> Signed-off-by: Jia-Ju Bai <[email protected]> Signed-off-by: Coly Li <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v5.18, v5.18-rc7, v5.18-rc6, v5.18-rc5, v5.18-rc4 |
|
| #
9dca4168 |
| 19-Apr-2022 |
Coly Li <[email protected]> |
bcache: fix wrong bdev parameter when calling bio_alloc_clone() in do_bio_hook()
Commit abfc426d1b2f ("block: pass a block_device to bio_clone_fast") calls the modified bio_alloc_clone() in bcache c
bcache: fix wrong bdev parameter when calling bio_alloc_clone() in do_bio_hook()
Commit abfc426d1b2f ("block: pass a block_device to bio_clone_fast") calls the modified bio_alloc_clone() in bcache code as: bio_init_clone(bio->bi_bdev, bio, orig_bio, GFP_NOIO);
But the first parameter is wrong, where bio->bi_bdev should be orig_bio->bi_bdev. The wrong bi_bdev panics the kernel when submitting cache bio.
This patch fixes the wrong bdev parameter usage and avoid the panic.
Fixes: abfc426d1b2f ("block: pass a block_device to bio_clone_fast") Signed-off-by: Coly Li <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Mike Snitzer <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v5.18-rc3 |
|
| #
70200574 |
| 15-Apr-2022 |
Christoph Hellwig <[email protected]> |
block: remove QUEUE_FLAG_DISCARD
Just use a non-zero max_discard_sectors as an indicator for discard support, similar to what is done for write zeroes.
The only places where needs special attention
block: remove QUEUE_FLAG_DISCARD
Just use a non-zero max_discard_sectors as an indicator for discard support, similar to what is done for write zeroes.
The only places where needs special attention is the RAID5 driver, which must clear discard support for security reasons by default, even if the default stacking rules would allow for it.
Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]> Acked-by: Christoph Böhmwalder <[email protected]> [drbd] Acked-by: Jan Höppner <[email protected]> [s390] Acked-by: Coly Li <[email protected]> [bcache] Acked-by: David Sterba <[email protected]> [btrfs] Reviewed-by: Chaitanya Kulkarni <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v5.18-rc2, v5.18-rc1, v5.17, v5.17-rc8, v5.17-rc7 |
|
| #
07fee7ab |
| 03-Mar-2022 |
Christoph Hellwig <[email protected]> |
bcache: use bvec_kmap_local in bio_csum
Using local kmaps slightly reduces the chances to stray writes, and the bvec interface cleans up the code a little bit.
Signed-off-by: Christoph Hellwig <hch
bcache: use bvec_kmap_local in bio_csum
Using local kmaps slightly reduces the chances to stray writes, and the bvec interface cleans up the code a little bit.
Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Ira Weiny <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v5.17-rc6, v5.17-rc5, v5.17-rc4, v5.17-rc3 |
|
| #
abfc426d |
| 02-Feb-2022 |
Christoph Hellwig <[email protected]> |
block: pass a block_device to bio_clone_fast
Pass a block_device to bio_clone_fast and __bio_clone_fast and give the functions more suitable names.
Signed-off-by: Christoph Hellwig <[email protected]> Rev
block: pass a block_device to bio_clone_fast
Pass a block_device to bio_clone_fast and __bio_clone_fast and give the functions more suitable names.
Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Mike Snitzer <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| #
a0e8de79 |
| 02-Feb-2022 |
Christoph Hellwig <[email protected]> |
block: initialize the target bio in __bio_clone_fast
All callers of __bio_clone_fast initialize the bio first. Move that initialization into __bio_clone_fast instead.
Signed-off-by: Christoph Hell
block: initialize the target bio in __bio_clone_fast
All callers of __bio_clone_fast initialize the bio first. Move that initialization into __bio_clone_fast instead.
Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Mike Snitzer <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| #
56b4b5ab |
| 02-Feb-2022 |
Christoph Hellwig <[email protected]> |
block: clone crypto and integrity data in __bio_clone_fast
__bio_clone_fast should also clone integrity and crypto data, as a clone without those is incomplete. Right now the only caller that can a
block: clone crypto and integrity data in __bio_clone_fast
__bio_clone_fast should also clone integrity and crypto data, as a clone without those is incomplete. Right now the only caller that can actually support crypto and integrity data (dm) does it manually for the one callchain that supports these, but we better do it properly in the core.
Note that all callers except for the above mentioned one also don't need to handle failure at all, given that the integrity and crypto clones are based on mempool allocations that won't fail for sleeping allocations.
Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Mike Snitzer <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v5.17-rc2 |
|
| #
a7c50c94 |
| 24-Jan-2022 |
Christoph Hellwig <[email protected]> |
block: pass a block_device and opf to bio_reset
Pass the block_device that we plan to use this bio for and the operation to bio_reset to optimize the assigment. A NULL block_device can be passed, b
block: pass a block_device and opf to bio_reset
Pass the block_device that we plan to use this bio for and the operation to bio_reset to optimize the assigment. A NULL block_device can be passed, both for the passthrough case on a raw request_queue and to temporarily avoid refactoring some nasty code.
Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| #
49add496 |
| 24-Jan-2022 |
Christoph Hellwig <[email protected]> |
block: pass a block_device and opf to bio_init
Pass the block_device that we plan to use this bio for and the operation to bio_init to optimize the assignment. A NULL block_device can be passed, bo
block: pass a block_device and opf to bio_init
Pass the block_device that we plan to use this bio for and the operation to bio_init to optimize the assignment. A NULL block_device can be passed, both for the passthrough case on a raw request_queue and to temporarily avoid refactoring some nasty code.
Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| #
609be106 |
| 24-Jan-2022 |
Christoph Hellwig <[email protected]> |
block: pass a block_device and opf to bio_alloc_bioset
Pass the block_device and operation that we plan to use this bio for to bio_alloc_bioset to optimize the assigment. NULL/0 can be passed, both
block: pass a block_device and opf to bio_alloc_bioset
Pass the block_device and operation that we plan to use this bio for to bio_alloc_bioset to optimize the assigment. NULL/0 can be passed, both for the passthrough case on a raw request_queue and to temporarily avoid refactoring some nasty code.
Also move the gfp_mask argument after the nr_vecs argument for a much more logical calling convention matching what most of the kernel does.
Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v5.17-rc1, v5.16, v5.16-rc8, v5.16-rc7, v5.16-rc6, v5.16-rc5, v5.16-rc4, v5.16-rc3, v5.16-rc2, v5.16-rc1, v5.15, v5.15-rc7 |
|
| #
39fa7a95 |
| 20-Oct-2021 |
Christoph Hellwig <[email protected]> |
bcache: remove bch_crc64_update
bch_crc64_update is an entirely pointless wrapper around crc64_be.
Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Coly Li <[email protected]> Link: https:
bcache: remove bch_crc64_update
bch_crc64_update is an entirely pointless wrapper around crc64_be.
Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Coly Li <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| #
0f5cd781 |
| 20-Oct-2021 |
Christoph Hellwig <[email protected]> |
bcache: remove the backing_dev_name field from struct cached_dev
Just use the %pg format specifier to print the name directly.
Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Coly Li <
bcache: remove the backing_dev_name field from struct cached_dev
Just use the %pg format specifier to print the name directly.
Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Coly Li <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v5.15-rc6 |
|
| #
3e08773c |
| 12-Oct-2021 |
Christoph Hellwig <[email protected]> |
block: switch polling to be bio based
Replace the blk_poll interface that requires the caller to keep a queue and cookie from the submissions with polling based on the bio.
Polling for the bio itse
block: switch polling to be bio based
Replace the blk_poll interface that requires the caller to keep a queue and cookie from the submissions with polling based on the bio.
Polling for the bio itself leads to a few advantages:
- the cookie construction can made entirely private in blk-mq.c - the caller does not need to remember the request_queue and cookie separately and thus sidesteps their lifetime issues - keeping the device and the cookie inside the bio allows to trivially support polling BIOs remapping by stacking drivers - a lot of code to propagate the cookie back up the submission path can be removed entirely.
Signed-off-by: Christoph Hellwig <[email protected]> Tested-by: Mark Wunderlich <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v5.15-rc5, v5.15-rc4, v5.15-rc3, v5.15-rc2, v5.15-rc1, v5.14, v5.14-rc7, v5.14-rc6, v5.14-rc5, v5.14-rc4, v5.14-rc3, v5.14-rc2, v5.14-rc1, v5.13, v5.13-rc7, v5.13-rc6 |
|
| #
41fe8d08 |
| 07-Jun-2021 |
Coly Li <[email protected]> |
bcache: avoid oversized read request in cache missing code path
In the cache missing code path of cached device, if a proper location from the internal B+ tree is matched for a cache miss range, fun
bcache: avoid oversized read request in cache missing code path
In the cache missing code path of cached device, if a proper location from the internal B+ tree is matched for a cache miss range, function cached_dev_cache_miss() will be called in cache_lookup_fn() in the following code block, [code block 1] 526 unsigned int sectors = KEY_INODE(k) == s->iop.inode 527 ? min_t(uint64_t, INT_MAX, 528 KEY_START(k) - bio->bi_iter.bi_sector) 529 : INT_MAX; 530 int ret = s->d->cache_miss(b, s, bio, sectors);
Here s->d->cache_miss() is the call backfunction pointer initialized as cached_dev_cache_miss(), the last parameter 'sectors' is an important hint to calculate the size of read request to backing device of the missing cache data.
Current calculation in above code block may generate oversized value of 'sectors', which consequently may trigger 2 different potential kernel panics by BUG() or BUG_ON() as listed below,
1) BUG_ON() inside bch_btree_insert_key(), [code block 2] 886 BUG_ON(b->ops->is_extents && !KEY_SIZE(k)); 2) BUG() inside biovec_slab(), [code block 3] 51 default: 52 BUG(); 53 return NULL;
All the above panics are original from cached_dev_cache_miss() by the oversized parameter 'sectors'.
Inside cached_dev_cache_miss(), parameter 'sectors' is used to calculate the size of data read from backing device for the cache missing. This size is stored in s->insert_bio_sectors by the following lines of code, [code block 4] 909 s->insert_bio_sectors = min(sectors, bio_sectors(bio) + reada);
Then the actual key inserting to the internal B+ tree is generated and stored in s->iop.replace_key by the following lines of code, [code block 5] 911 s->iop.replace_key = KEY(s->iop.inode, 912 bio->bi_iter.bi_sector + s->insert_bio_sectors, 913 s->insert_bio_sectors); The oversized parameter 'sectors' may trigger panic 1) by BUG_ON() from the above code block.
And the bio sending to backing device for the missing data is allocated with hint from s->insert_bio_sectors by the following lines of code, [code block 6] 926 cache_bio = bio_alloc_bioset(GFP_NOWAIT, 927 DIV_ROUND_UP(s->insert_bio_sectors, PAGE_SECTORS), 928 &dc->disk.bio_split); The oversized parameter 'sectors' may trigger panic 2) by BUG() from the agove code block.
Now let me explain how the panics happen with the oversized 'sectors'. In code block 5, replace_key is generated by macro KEY(). From the definition of macro KEY(), [code block 7] 71 #define KEY(inode, offset, size) \ 72 ((struct bkey) { \ 73 .high = (1ULL << 63) | ((__u64) (size) << 20) | (inode), \ 74 .low = (offset) \ 75 })
Here 'size' is 16bits width embedded in 64bits member 'high' of struct bkey. But in code block 1, if "KEY_START(k) - bio->bi_iter.bi_sector" is very probably to be larger than (1<<16) - 1, which makes the bkey size calculation in code block 5 is overflowed. In one bug report the value of parameter 'sectors' is 131072 (= 1 << 17), the overflowed 'sectors' results the overflowed s->insert_bio_sectors in code block 4, then makes size field of s->iop.replace_key to be 0 in code block 5. Then the 0- sized s->iop.replace_key is inserted into the internal B+ tree as cache missing check key (a special key to detect and avoid a racing between normal write request and cache missing read request) as, [code block 8] 915 ret = bch_btree_insert_check_key(b, &s->op, &s->iop.replace_key);
Then the 0-sized s->iop.replace_key as 3rd parameter triggers the bkey size check BUG_ON() in code block 2, and causes the kernel panic 1).
Another kernel panic is from code block 6, is by the bvecs number oversized value s->insert_bio_sectors from code block 4, min(sectors, bio_sectors(bio) + reada) There are two possibility for oversized reresult, - bio_sectors(bio) is valid, but bio_sectors(bio) + reada is oversized. - sectors < bio_sectors(bio) + reada, but sectors is oversized.
From a bug report the result of "DIV_ROUND_UP(s->insert_bio_sectors, PAGE_SECTORS)" from code block 6 can be 344, 282, 946, 342 and many other values which larther than BIO_MAX_VECS (a.k.a 256). When calling bio_alloc_bioset() with such larger-than-256 value as the 2nd parameter, this value will eventually be sent to biovec_slab() as parameter 'nr_vecs' in following code path, bio_alloc_bioset() ==> bvec_alloc() ==> biovec_slab() Because parameter 'nr_vecs' is larger-than-256 value, the panic by BUG() in code block 3 is triggered inside biovec_slab().
From the above analysis, we know that the 4th parameter 'sector' sent into cached_dev_cache_miss() may cause overflow in code block 5 and 6, and finally cause kernel panic in code block 2 and 3. And if result of bio_sectors(bio) + reada exceeds valid bvecs number, it may also trigger kernel panic in code block 3 from code block 6.
Now the almost-useless readahead size for cache missing request back to backing device is removed, this patch can fix the oversized issue with more simpler method. - add a local variable size_limit, set it by the minimum value from the max bkey size and max bio bvecs number. - set s->insert_bio_sectors by the minimum value from size_limit, sectors, and the sectors size of bio. - replace sectors by s->insert_bio_sectors to do bio_next_split.
By the above method with size_limit, s->insert_bio_sectors will never result oversized replace_key size or bio bvecs number. And split bio 'miss' from bio_next_split() will always match the size of 'cache_bio', that is the current maximum bio size we can sent to backing device for fetching the cache missing data.
Current problmatic code can be partially found since Linux v3.13-rc1, therefore all maintained stable kernels should try to apply this fix.
Reported-by: Alexander Ullrich <[email protected]> Reported-by: Diego Ercolani <[email protected]> Reported-by: Jan Szubiak <[email protected]> Reported-by: Marco Rebhan <[email protected]> Reported-by: Matthias Ferdinand <[email protected]> Reported-by: Victor Westerhuis <[email protected]> Reported-by: Vojtech Pavlik <[email protected]> Reported-and-tested-by: Rolf Fokkens <[email protected]> Reported-and-tested-by: Thorsten Knabe <[email protected]> Signed-off-by: Coly Li <[email protected]> Cc: [email protected] Cc: Christoph Hellwig <[email protected]> Cc: Kent Overstreet <[email protected]> Cc: Nix <[email protected]> Cc: Takashi Iwai <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| #
1616a4c2 |
| 07-Jun-2021 |
Coly Li <[email protected]> |
bcache: remove bcache device self-defined readahead
For read cache missing, bcache defines a readahead size for the read I/O request to the backing device for the missing data. This readahead size i
bcache: remove bcache device self-defined readahead
For read cache missing, bcache defines a readahead size for the read I/O request to the backing device for the missing data. This readahead size is initialized to 0, and almost no one uses it to avoid unnecessary read amplifying onto backing device and write amplifying onto cache device. Considering upper layer file system code has readahead logic allready and works fine with readahead_cache_policy sysfile interface, we don't have to keep bcache self-defined readahead anymore.
This patch removes the bcache self-defined readahead for cache missing request for backing device, and the readahead sysfs file interfaces are removed as well.
This is the preparation for next patch to fix potential kernel panic due to oversized request in a simpler method.
Reported-by: Alexander Ullrich <[email protected]> Reported-by: Diego Ercolani <[email protected]> Reported-by: Jan Szubiak <[email protected]> Reported-by: Marco Rebhan <[email protected]> Reported-by: Matthias Ferdinand <[email protected]> Reported-by: Victor Westerhuis <[email protected]> Reported-by: Vojtech Pavlik <[email protected]> Reported-and-tested-by: Rolf Fokkens <[email protected]> Reported-and-tested-by: Thorsten Knabe <[email protected]> Signed-off-by: Coly Li <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Cc: [email protected] Cc: Kent Overstreet <[email protected]> Cc: Nix <[email protected]> Cc: Takashi Iwai <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v5.13-rc5, v5.13-rc4, v5.13-rc3, v5.13-rc2, v5.13-rc1, v5.12, v5.12-rc8, v5.12-rc7, v5.12-rc6, v5.12-rc5, v5.12-rc4, v5.12-rc3, v5.12-rc2, v5.12-rc1, v5.12-rc1-dontuse, v5.11, v5.11-rc7, v5.11-rc6, v5.11-rc5 |
|
| #
99dfc43e |
| 24-Jan-2021 |
Christoph Hellwig <[email protected]> |
block: use ->bi_bdev for bio based I/O accounting
Rework the I/O accounting for bio based drivers to use ->bi_bdev. This means all drivers can now simply use bio_start_io_acct to start accounting,
block: use ->bi_bdev for bio based I/O accounting
Rework the I/O accounting for bio based drivers to use ->bi_bdev. This means all drivers can now simply use bio_start_io_acct to start accounting, and it will take partitions into account automatically. To end I/O account either bio_end_io_acct can be used if the driver never remaps I/O to a different device, or bio_end_io_acct_remapped if the driver did remap the I/O.
Signed-off-by: Christoph Hellwig <[email protected]> Acked-by: Tejun Heo <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| #
309dca30 |
| 24-Jan-2021 |
Christoph Hellwig <[email protected]> |
block: store a block_device pointer in struct bio
Replace the gendisk pointer in struct bio with a pointer to the newly improved struct block device. From that the gendisk can be trivially accessed
block: store a block_device pointer in struct bio
Replace the gendisk pointer in struct bio with a pointer to the newly improved struct block device. From that the gendisk can be trivially accessed with an extra indirection, but it also allows to directly look up all information related to partition remapping.
Signed-off-by: Christoph Hellwig <[email protected]> Acked-by: Tejun Heo <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v5.11-rc4, v5.11-rc3, v5.11-rc2, v5.11-rc1, v5.10, v5.10-rc7, v5.10-rc6 |
|
| #
8446fe92 |
| 24-Nov-2020 |
Christoph Hellwig <[email protected]> |
block: switch partition lookup to use struct block_device
Use struct block_device to lookup partitions on a disk. This removes all usage of struct hd_struct from the I/O path.
Signed-off-by: Chris
block: switch partition lookup to use struct block_device
Use struct block_device to lookup partitions on a disk. This removes all usage of struct hd_struct from the I/O path.
Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Jan Kara <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Acked-by: Coly Li <[email protected]> [bcache] Acked-by: Chao Yu <[email protected]> [f2fs] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v5.10-rc5, v5.10-rc4, v5.10-rc3 |
|
| #
a7cb3d2f |
| 03-Nov-2020 |
Christoph Hellwig <[email protected]> |
block: remove __blkdev_driver_ioctl
Just open code it in the few callers.
Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
|