|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14, v6.14-rc7, v6.14-rc6, v6.14-rc5, v6.14-rc4, v6.14-rc3, v6.14-rc2, v6.14-rc1, v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2, v6.13-rc1, v6.12, v6.12-rc7, v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3, v6.12-rc2, v6.12-rc1 |
|
| #
65f666c6 |
| 19-Sep-2024 |
Ming Lei <[email protected]> |
lib/sbitmap: define swap_lock as raw_spinlock_t
When called from sbitmap_queue_get(), sbitmap_deferred_clear() may be run with preempt disabled. In RT kernel, spin_lock() can sleep, then warning of
lib/sbitmap: define swap_lock as raw_spinlock_t
When called from sbitmap_queue_get(), sbitmap_deferred_clear() may be run with preempt disabled. In RT kernel, spin_lock() can sleep, then warning of "BUG: sleeping function called from invalid context" can be triggered.
Fix it by replacing it with raw_spin_lock.
Cc: Yang Yang <[email protected]> Fixes: 72d04bdcf3f7 ("sbitmap: fix io hung due to race on sbitmap_word::cleared") Signed-off-by: Ming Lei <[email protected]> Reviewed-by: Yang Yang <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v6.11, v6.11-rc7, v6.11-rc6, v6.11-rc5, v6.11-rc4, v6.11-rc3, v6.11-rc2, v6.11-rc1 |
|
| #
72d04bdc |
| 16-Jul-2024 |
Yang Yang <[email protected]> |
sbitmap: fix io hung due to race on sbitmap_word::cleared
Configuration for sbq: depth=64, wake_batch=6, shift=6, map_nr=1
1. There are 64 requests in progress: map->word = 0xFFFFFFFFFFFFFFFF 2
sbitmap: fix io hung due to race on sbitmap_word::cleared
Configuration for sbq: depth=64, wake_batch=6, shift=6, map_nr=1
1. There are 64 requests in progress: map->word = 0xFFFFFFFFFFFFFFFF 2. After all the 64 requests complete, and no more requests come: map->word = 0xFFFFFFFFFFFFFFFF, map->cleared = 0xFFFFFFFFFFFFFFFF 3. Now two tasks try to allocate requests: T1: T2: __blk_mq_get_tag . __sbitmap_queue_get . sbitmap_get . sbitmap_find_bit . sbitmap_find_bit_in_word . __sbitmap_get_word -> nr=-1 __blk_mq_get_tag sbitmap_deferred_clear __sbitmap_queue_get /* map->cleared=0xFFFFFFFFFFFFFFFF */ sbitmap_find_bit if (!READ_ONCE(map->cleared)) sbitmap_find_bit_in_word return false; __sbitmap_get_word -> nr=-1 mask = xchg(&map->cleared, 0) sbitmap_deferred_clear atomic_long_andnot() /* map->cleared=0 */ if (!(map->cleared)) return false; /* * map->cleared is cleared by T1 * T2 fail to acquire the tag */
4. T2 is the sole tag waiter. When T1 puts the tag, T2 cannot be woken up due to the wake_batch being set at 6. If no more requests come, T1 will wait here indefinitely.
This patch achieves two purposes: 1. Check on ->cleared and update on both ->cleared and ->word need to be done atomically, and using spinlock could be the simplest solution. 2. Add extra check in sbitmap_deferred_clear(), to identify whether ->word has free bits.
Fixes: ea86ea2cdced ("sbitmap: ammortize cost of clearing bits") Signed-off-by: Yang Yang <[email protected]> Reviewed-by: Ming Lei <[email protected]> Reviewed-by: Bart Van Assche <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v6.10, v6.10-rc7, v6.10-rc6, v6.10-rc5, v6.10-rc4, v6.10-rc3, v6.10-rc2, v6.10-rc1, v6.9, v6.9-rc7, v6.9-rc6 |
|
| #
6ad0d7e0 |
| 26-Apr-2024 |
linke li <[email protected]> |
sbitmap: use READ_ONCE to access map->word
In __sbitmap_queue_get_batch(), map->word is read several times, and update atomically using atomic_long_try_cmpxchg(). But the first two read of map->word
sbitmap: use READ_ONCE to access map->word
In __sbitmap_queue_get_batch(), map->word is read several times, and update atomically using atomic_long_try_cmpxchg(). But the first two read of map->word is not protected.
This patch moves the statement val = READ_ONCE(map->word) forward, eliminating unprotected accesses to map->word within the function. It is aimed at reducing the number of benign races reported by KCSAN in order to focus future debugging effort on harmful races.
Signed-off-by: linke li <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v6.9-rc5, v6.9-rc4, v6.9-rc3, v6.9-rc2, v6.9-rc1, v6.8, v6.8-rc7, v6.8-rc6, v6.8-rc5, v6.8-rc4, v6.8-rc3, v6.8-rc2, v6.8-rc1 |
|
| #
5c7fa5c8 |
| 15-Jan-2024 |
Kemeng Shi <[email protected]> |
sbitmap: remove stale comment in sbq_calc_wake_batch
After commit 106397376c036 ("sbitmap: fix batching wakeup"), we may wake up more than one queue for each batch. Just remove stale comment that we
sbitmap: remove stale comment in sbq_calc_wake_batch
After commit 106397376c036 ("sbitmap: fix batching wakeup"), we may wake up more than one queue for each batch. Just remove stale comment that we wake up only one queue for each batch.
Signed-off-by: Kemeng Shi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v6.7, v6.7-rc8, v6.7-rc7, v6.7-rc6, v6.7-rc5, v6.7-rc4, v6.7-rc3, v6.7-rc2, v6.7-rc1, v6.6, v6.6-rc7, v6.6-rc6, v6.6-rc5, v6.6-rc4, v6.6-rc3, v6.6-rc2, v6.6-rc1, v6.5, v6.5-rc7, v6.5-rc6, v6.5-rc5, v6.5-rc4, v6.5-rc3 |
|
| #
10639737 |
| 21-Jul-2023 |
David Jeffery <[email protected]> |
sbitmap: fix batching wakeup
Current code supposes that it is enough to provide forward progress by just waking up one wait queue after one completion batch is done.
Unfortunately this way isn't en
sbitmap: fix batching wakeup
Current code supposes that it is enough to provide forward progress by just waking up one wait queue after one completion batch is done.
Unfortunately this way isn't enough, cause waiter can be added to wait queue just after it is woken up.
Follows one example(64 depth, wake_batch is 8)
1) all 64 tags are active
2) in each wait queue, there is only one single waiter
3) each time one completion batch(8 completions) wakes up just one waiter in each wait queue, then immediately one new sleeper is added to this wait queue
4) after 64 completions, 8 waiters are wakeup, and there are still 8 waiters in each wait queue
5) after another 8 active tags are completed, only one waiter can be wakeup, and the other 7 can't be waken up anymore.
Turns out it isn't easy to fix this problem, so simply wakeup enough waiters for single batch.
Cc: Kemeng Shi <[email protected]> Cc: Chengming Zhou <[email protected]> Cc: Jan Kara <[email protected]> Signed-off-by: David Jeffery <[email protected]> Signed-off-by: Ming Lei <[email protected]> Reviewed-by: Gabriel Krisman Bertazi <[email protected]> Reviewed-by: Keith Busch <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v6.5-rc2, v6.5-rc1, v6.4, v6.4-rc7, v6.4-rc6, v6.4-rc5, v6.4-rc4, v6.4-rc3, v6.4-rc2, v6.4-rc1, v6.3, v6.3-rc7, v6.3-rc6, v6.3-rc5, v6.3-rc4, v6.3-rc3, v6.3-rc2, v6.3-rc1, v6.2, v6.2-rc8, v6.2-rc7, v6.2-rc6, v6.2-rc5 |
|
| #
b5fcf787 |
| 16-Jan-2023 |
Kemeng Shi <[email protected]> |
sbitmap: correct wake_batch recalculation to avoid potential IO hung
Commit 180dccb0dba4f ("blk-mq: fix tag_get wait task can't be awakened") mentioned that in case of shared tags, there could be ju
sbitmap: correct wake_batch recalculation to avoid potential IO hung
Commit 180dccb0dba4f ("blk-mq: fix tag_get wait task can't be awakened") mentioned that in case of shared tags, there could be just one real active hctx(queue) because of lazy detection of tag idle. Then driver tag allocation may wait forever on this real active hctx(queue) if wake_batch is > hctx_max_depth where hctx_max_depth is available tags depth for the actve hctx(queue). However, the condition wake_batch > hctx_max_depth is not strong enough to avoid IO hung as the sbitmap_queue_wake_up will only wake up one wait queue for each wake_batch even though there is only one waiter in the woken wait queue. After this, there is only one tag to free and wake_batch may not be reached anymore. Commit 180dccb0dba4f ("blk-mq: fix tag_get wait task can't be awakened") methioned that driver tag allocation may wait forever. Actually, the inactive hctx(queue) will be truely idle after at most 30 seconds and will call blk_mq_tag_wakeup_all to wake one waiter per wait queue to break the hung. But IO hung for 30 seconds is also not acceptable. Set batch size to small enough that depth of the shared hctx(queue) is enough to wake up all of the queues like sbq_calc_wake_batch do to fix this potential IO hung.
Although hctx_max_depth will be clamped to at least 4 while wake_batch recalculation does not do the clamp, the wake_batch will be always recalculated to 1 when hctx_max_depth <= 4.
Fixes: 180dccb0dba4 ("blk-mq: fix tag_get wait task can't be awakened") Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Kemeng Shi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| #
678418c6 |
| 16-Jan-2023 |
Kemeng Shi <[email protected]> |
sbitmap: add sbitmap_find_bit to remove repeat code in __sbitmap_get/__sbitmap_get_shallow
There are three differences between __sbitmap_get and __sbitmap_get_shallow when searching free bit: 1. __s
sbitmap: add sbitmap_find_bit to remove repeat code in __sbitmap_get/__sbitmap_get_shallow
There are three differences between __sbitmap_get and __sbitmap_get_shallow when searching free bit: 1. __sbitmap_get_shallow limit number of bit to search per word. __sbitmap_get has no such limit. 2. __sbitmap_get_shallow always searches with wrap set. __sbitmap_get set wrap according to round_robin. 3. __sbitmap_get_shallow always searches from first bit in first word. __sbitmap_get searches from first bit when round_robin is not set otherwise searches from SB_NR_TO_BIT(sb, alloc_hint).
Add helper function sbitmap_find_bit function to do common search while accept "limit depth per word", "wrap flag" and "first bit to search" from caller to support the need of both __sbitmap_get and __sbitmap_get_shallow.
Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Kemeng Shi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| #
08470a98 |
| 16-Jan-2023 |
Kemeng Shi <[email protected]> |
sbitmap: rewrite sbitmap_find_bit_in_index to reduce repeat code
Rewrite sbitmap_find_bit_in_index as following: 1. Rename sbitmap_find_bit_in_index to sbitmap_find_bit_in_word 2. Accept "struct sbi
sbitmap: rewrite sbitmap_find_bit_in_index to reduce repeat code
Rewrite sbitmap_find_bit_in_index as following: 1. Rename sbitmap_find_bit_in_index to sbitmap_find_bit_in_word 2. Accept "struct sbitmap_word *" directly instead of accepting "struct sbitmap *" and "int index" to get "struct sbitmap_word *". 3. Accept depth/shallow_depth and wrap for __sbitmap_get_word from caller to support need of both __sbitmap_get_shallow and __sbitmap_get.
With helper function sbitmap_find_bit_in_word, we can remove repeat code in __sbitmap_get_shallow to find bit considring deferred clear.
Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Kemeng Shi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| #
903e86f3 |
| 16-Jan-2023 |
Kemeng Shi <[email protected]> |
sbitmap: remove redundant check in __sbitmap_queue_get_batch
Commit fbb564a557809 ("lib/sbitmap: Fix invalid loop in __sbitmap_queue_get_batch()") mentioned that "Checking free bits when setting the
sbitmap: remove redundant check in __sbitmap_queue_get_batch
Commit fbb564a557809 ("lib/sbitmap: Fix invalid loop in __sbitmap_queue_get_batch()") mentioned that "Checking free bits when setting the target bits. Otherwise, it may reuse the busying bits." This commit add check to make sure all masked bits in word before cmpxchg is zero. Then the existing check after cmpxchg to check any zero bit is existing in masked bits in word is redundant.
Actually, old value of word before cmpxchg is stored in val and we will filter out busy bits in val by "(get_mask & ~val)" after cmpxchg. So we will not reuse busy bits methioned in commit fbb564a557809 ("lib/sbitmap: Fix invalid loop in __sbitmap_queue_get_batch()"). Revert new-added check to remove redundant check.
Fixes: fbb564a55780 ("lib/sbitmap: Fix invalid loop in __sbitmap_queue_get_batch()") Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Kemeng Shi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| #
f1591a8b |
| 16-Jan-2023 |
Kemeng Shi <[email protected]> |
sbitmap: remove unnecessary calculation of alloc_hint in __sbitmap_get_shallow
Updates to alloc_hint in the loop in __sbitmap_get_shallow() are mostly pointless and equivalent to setting alloc_hint
sbitmap: remove unnecessary calculation of alloc_hint in __sbitmap_get_shallow
Updates to alloc_hint in the loop in __sbitmap_get_shallow() are mostly pointless and equivalent to setting alloc_hint to zero (because SB_NR_TO_BIT() considers only low sb->shift bits from alloc_hint). So simplify the logic.
Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Kemeng Shi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v6.2-rc4, v6.2-rc3, v6.2-rc2, v6.2-rc1, v6.1, v6.1-rc8, v6.1-rc7, v6.1-rc6, v6.1-rc5, v6.1-rc4, v6.1-rc3, v6.1-rc2, v6.1-rc1 |
|
| #
8032bf12 |
| 10-Oct-2022 |
Jason A. Donenfeld <[email protected]> |
treewide: use get_random_u32_below() instead of deprecated function
This is a simple mechanical transformation done by:
@@ expression E; @@ - prandom_u32_max + get_random_u32_below (E)
Reviewed-
treewide: use get_random_u32_below() instead of deprecated function
This is a simple mechanical transformation done by:
@@ expression E; @@ - prandom_u32_max + get_random_u32_below (E)
Reviewed-by: Kees Cook <[email protected]> Reviewed-by: Greg Kroah-Hartman <[email protected]> Acked-by: Darrick J. Wong <[email protected]> # for xfs Reviewed-by: SeongJae Park <[email protected]> # for damon Reviewed-by: Jason Gunthorpe <[email protected]> # for infiniband Reviewed-by: Russell King (Oracle) <[email protected]> # for arm Acked-by: Ulf Hansson <[email protected]> # for mmc Signed-off-by: Jason A. Donenfeld <[email protected]>
show more ...
|
| #
26edb30d |
| 15-Nov-2022 |
Gabriel Krisman Bertazi <[email protected]> |
sbitmap: Try each queue to wake up at least one waiter
Jan reported the new algorithm as merged might be problematic if the queue being awaken becomes empty between the waitqueue_active inside sbq_w
sbitmap: Try each queue to wake up at least one waiter
Jan reported the new algorithm as merged might be problematic if the queue being awaken becomes empty between the waitqueue_active inside sbq_wake_ptr check and the wake up. If that happens, wake_up_nr will not wake up any waiter and we loose too many wake ups. In order to guarantee progress, we need to wake up at least one waiter here, if there are any. This now requires trying to wake up from every queue.
Instead of walking through all the queues with sbq_wake_ptr, this call moves the wake up inside that function. In a previous version of the patch, I found that updating wake_index several times when walking through queues had a measurable overhead. This ensures we only update it once, at the end.
Fixes: 4f8126bb2308 ("sbitmap: Use single per-bitmap counting to wake up queued tags") Reported-by: Jan Kara <[email protected]> Signed-off-by: Gabriel Krisman Bertazi <[email protected]> Reviewed-by: Jan Kara <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| #
976570b4 |
| 15-Nov-2022 |
Gabriel Krisman Bertazi <[email protected]> |
sbitmap: Advance the queue index before waking up a queue
When a queue is awaken, the wake_index written by sbq_wake_ptr currently keeps pointing to the same queue. On the next wake up, it will thu
sbitmap: Advance the queue index before waking up a queue
When a queue is awaken, the wake_index written by sbq_wake_ptr currently keeps pointing to the same queue. On the next wake up, it will thus retry the same queue, which is unfair to other queues, and can lead to starvation. This patch, moves the index update to happen before the queue is returned, such that it will now try a different queue first on the next wake up, improving fairness.
Fixes: 4f8126bb2308 ("sbitmap: Use single per-bitmap counting to wake up queued tags") Reported-by: Jan Kara <[email protected]> Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Gabriel Krisman Bertazi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| #
4f8126bb |
| 05-Nov-2022 |
Gabriel Krisman Bertazi <[email protected]> |
sbitmap: Use single per-bitmap counting to wake up queued tags
sbitmap suffers from code complexity, as demonstrated by recent fixes, and eventual lost wake ups on nested I/O completion. The later
sbitmap: Use single per-bitmap counting to wake up queued tags
sbitmap suffers from code complexity, as demonstrated by recent fixes, and eventual lost wake ups on nested I/O completion. The later happens, from what I understand, due to the non-atomic nature of the updates to wait_cnt, which needs to be subtracted and eventually reset when equal to zero. This two step process can eventually miss an update when a nested completion happens to interrupt the CPU in between the wait_cnt updates. This is very hard to fix, as shown by the recent changes to this code.
The code complexity arises mostly from the corner cases to avoid missed wakes in this scenario. In addition, the handling of wake_batch recalculation plus the synchronization with sbq_queue_wake_up is non-trivial.
This patchset implements the idea originally proposed by Jan [1], which removes the need for the two-step updates of wait_cnt. This is done by tracking the number of completions and wakeups in always increasing, per-bitmap counters. Instead of having to reset the wait_cnt when it reaches zero, we simply keep counting, and attempt to wake up N threads in a single wait queue whenever there is enough space for a batch. Waking up less than batch_wake shouldn't be a problem, because we haven't changed the conditions for wake up, and the existing batch calculation guarantees at least enough remaining completions to wake up a batch for each queue at any time.
Performance-wise, one should expect very similar performance to the original algorithm for the case where there is no queueing. In both the old algorithm and this implementation, the first thing is to check ws_active, which bails out if there is no queueing to be managed. In the new code, we took care to avoid accounting completions and wakeups when there is no queueing, to not pay the cost of atomic operations unnecessarily, since it doesn't skew the numbers.
For more interesting cases, where there is queueing, we need to take into account the cross-communication of the atomic operations. I've been benchmarking by running parallel fio jobs against a single hctx nullb in different hardware queue depth scenarios, and verifying both IOPS and queueing.
Each experiment was repeated 5 times on a 20-CPU box, with 20 parallel jobs. fio was issuing fixed-size randwrites with qd=64 against nullb, varying only the hardware queue length per test.
queue size 2 4 8 16 32 64 6.1-rc2 1681.1K (1.6K) 2633.0K (12.7K) 6940.8K (16.3K) 8172.3K (617.5K) 8391.7K (367.1K) 8606.1K (351.2K) patched 1721.8K (15.1K) 3016.7K (3.8K) 7543.0K (89.4K) 8132.5K (303.4K) 8324.2K (230.6K) 8401.8K (284.7K)
The following is a similar experiment, ran against a nullb with a single bitmap shared by 20 hctx spread across 2 NUMA nodes. This has 40 parallel fio jobs operating on the same device
queue size 2 4 8 16 32 64 6.1-rc2 1081.0K (2.3K) 957.2K (1.5K) 1699.1K (5.7K) 6178.2K (124.6K) 12227.9K (37.7K) 13286.6K (92.9K) patched 1081.8K (2.8K) 1316.5K (5.4K) 2364.4K (1.8K) 6151.4K (20.0K) 11893.6K (17.5K) 12385.6K (18.4K)
It has also survived blktests and a 12h-stress run against nullb. I also ran the code against nvme and a scsi SSD, and I didn't observe performance regression in those. If there are other tests you think I should run, please let me know and I will follow up with results.
[1] https://lore.kernel.org/all/[email protected]/
Cc: Hugh Dickins <[email protected]> Cc: Keith Busch <[email protected]> Cc: Liu Song <[email protected]> Suggested-by: Jan Kara <[email protected]> Signed-off-by: Gabriel Krisman Bertazi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| #
8b3ccbc1 |
| 05-Oct-2022 |
Jason A. Donenfeld <[email protected]> |
treewide: use prandom_u32_max() when possible, part 2
Rather than incurring a division or requesting too many random bytes for the given range, use the prandom_u32_max() function, which only takes t
treewide: use prandom_u32_max() when possible, part 2
Rather than incurring a division or requesting too many random bytes for the given range, use the prandom_u32_max() function, which only takes the minimum required bytes from the RNG and avoids divisions. This was done by hand, covering things that coccinelle could not do on its own.
Reviewed-by: Greg Kroah-Hartman <[email protected]> Reviewed-by: Kees Cook <[email protected]> Reviewed-by: Yury Norov <[email protected]> Reviewed-by: Jan Kara <[email protected]> # for ext2, ext4, and sbitmap Acked-by: Jakub Kicinski <[email protected]> Signed-off-by: Jason A. Donenfeld <[email protected]>
show more ...
|
| #
81895a65 |
| 05-Oct-2022 |
Jason A. Donenfeld <[email protected]> |
treewide: use prandom_u32_max() when possible, part 1
Rather than incurring a division or requesting too many random bytes for the given range, use the prandom_u32_max() function, which only takes t
treewide: use prandom_u32_max() when possible, part 1
Rather than incurring a division or requesting too many random bytes for the given range, use the prandom_u32_max() function, which only takes the minimum required bytes from the RNG and avoids divisions. This was done mechanically with this coccinelle script:
@basic@ expression E; type T; identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32"; typedef u64; @@ ( - ((T)get_random_u32() % (E)) + prandom_u32_max(E) | - ((T)get_random_u32() & ((E) - 1)) + prandom_u32_max(E * XXX_MAKE_SURE_E_IS_POW2) | - ((u64)(E) * get_random_u32() >> 32) + prandom_u32_max(E) | - ((T)get_random_u32() & ~PAGE_MASK) + prandom_u32_max(PAGE_SIZE) )
@multi_line@ identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32"; identifier RAND; expression E; @@
- RAND = get_random_u32(); ... when != RAND - RAND %= (E); + RAND = prandom_u32_max(E);
// Find a potential literal @literal_mask@ expression LITERAL; type T; identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32"; position p; @@
((T)get_random_u32()@p & (LITERAL))
// Add one to the literal. @script:python add_one@ literal << literal_mask.LITERAL; RESULT; @@
value = None if literal.startswith('0x'): value = int(literal, 16) elif literal[0] in '123456789': value = int(literal, 10) if value is None: print("I don't know how to handle %s" % (literal)) cocci.include_match(False) elif value == 2**32 - 1 or value == 2**31 - 1 or value == 2**24 - 1 or value == 2**16 - 1 or value == 2**8 - 1: print("Skipping 0x%x for cleanup elsewhere" % (value)) cocci.include_match(False) elif value & (value + 1) != 0: print("Skipping 0x%x because it's not a power of two minus one" % (value)) cocci.include_match(False) elif literal.startswith('0x'): coccinelle.RESULT = cocci.make_expr("0x%x" % (value + 1)) else: coccinelle.RESULT = cocci.make_expr("%d" % (value + 1))
// Replace the literal mask with the calculated result. @plus_one@ expression literal_mask.LITERAL; position literal_mask.p; expression add_one.RESULT; identifier FUNC; @@
- (FUNC()@p & (LITERAL)) + prandom_u32_max(RESULT)
@collapse_ret@ type T; identifier VAR; expression E; @@
{ - T VAR; - VAR = (E); - return VAR; + return E; }
@drop_var@ type T; identifier VAR; @@
{ - T VAR; ... when != VAR }
Reviewed-by: Greg Kroah-Hartman <[email protected]> Reviewed-by: Kees Cook <[email protected]> Reviewed-by: Yury Norov <[email protected]> Reviewed-by: KP Singh <[email protected]> Reviewed-by: Jan Kara <[email protected]> # for ext4 and sbitmap Reviewed-by: Christoph Böhmwalder <[email protected]> # for drbd Acked-by: Jakub Kicinski <[email protected]> Acked-by: Heiko Carstens <[email protected]> # for s390 Acked-by: Ulf Hansson <[email protected]> # for mmc Acked-by: Darrick J. Wong <[email protected]> # for xfs Signed-off-by: Jason A. Donenfeld <[email protected]>
show more ...
|
|
Revision tags: v6.0 |
|
| #
30514bd2 |
| 29-Sep-2022 |
Hugh Dickins <[email protected]> |
sbitmap: fix lockup while swapping
Commit 4acb83417cad ("sbitmap: fix batched wait_cnt accounting") is a big improvement: without it, I had to revert to before commit 040b83fcecfb ("sbitmap: fix pos
sbitmap: fix lockup while swapping
Commit 4acb83417cad ("sbitmap: fix batched wait_cnt accounting") is a big improvement: without it, I had to revert to before commit 040b83fcecfb ("sbitmap: fix possible io hung due to lost wakeup") to avoid the high system time and freezes which that had introduced.
Now okay on the NVME laptop, but 4acb83417cad is a disaster for heavy swapping (kernel builds in low memory) on another: soon locking up in sbitmap_queue_wake_up() (into which __sbq_wake_up() is inlined), cycling around with waitqueue_active() but wait_cnt 0 . Here is a backtrace, showing the common pattern of outer sbitmap_queue_wake_up() interrupted before setting wait_cnt 0 back to wake_batch (in some cases other CPUs are idle, in other cases they're spinning for a lock in dd_bio_merge()):
sbitmap_queue_wake_up < sbitmap_queue_clear < blk_mq_put_tag < __blk_mq_free_request < blk_mq_free_request < __blk_mq_end_request < scsi_end_request < scsi_io_completion < scsi_finish_command < scsi_complete < blk_complete_reqs < blk_done_softirq < __do_softirq < __irq_exit_rcu < irq_exit_rcu < common_interrupt < asm_common_interrupt < _raw_spin_unlock_irqrestore < __wake_up_common_lock < __wake_up < sbitmap_queue_wake_up < sbitmap_queue_clear < blk_mq_put_tag < __blk_mq_free_request < blk_mq_free_request < dd_bio_merge < blk_mq_sched_bio_merge < blk_mq_attempt_bio_merge < blk_mq_submit_bio < __submit_bio < submit_bio_noacct_nocheck < submit_bio_noacct < submit_bio < __swap_writepage < swap_writepage < pageout < shrink_folio_list < evict_folios < lru_gen_shrink_lruvec < shrink_lruvec < shrink_node < do_try_to_free_pages < try_to_free_pages < __alloc_pages_slowpath < __alloc_pages < folio_alloc < vma_alloc_folio < do_anonymous_page < __handle_mm_fault < handle_mm_fault < do_user_addr_fault < exc_page_fault < asm_exc_page_fault
See how the process-context sbitmap_queue_wake_up() has been interrupted, after bringing wait_cnt down to 0 (and in this example, after doing its wakeups), before advancing wake_index and refilling wake_cnt: an interrupt-context sbitmap_queue_wake_up() of the same sbq gets stuck.
I have almost no grasp of all the possible sbitmap races, and their consequences: but __sbq_wake_up() can do nothing useful while wait_cnt 0, so it is better if sbq_wake_ptr() skips on to the next ws in that case: which fixes the lockup and shows no adverse consequence for me.
The check for wait_cnt being 0 is obviously racy, and ultimately can lead to lost wakeups: for example, when there is only a single waitqueue with waiters. However, lost wakeups are unlikely to matter in these cases, and a proper fix requires redesign (and benchmarking) of the batched wakeup code: so let's plug the hole with this bandaid for now.
Signed-off-by: Hugh Dickins <[email protected]> Reviewed-by: Jan Kara <[email protected]> Reviewed-by: Keith Busch <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v6.0-rc7, v6.0-rc6, v6.0-rc5 |
|
| #
4acb8341 |
| 09-Sep-2022 |
Keith Busch <[email protected]> |
sbitmap: fix batched wait_cnt accounting
Batched completions can clear multiple bits, but we're only decrementing the wait_cnt by one each time. This can cause waiters to never be woken, stalling IO
sbitmap: fix batched wait_cnt accounting
Batched completions can clear multiple bits, but we're only decrementing the wait_cnt by one each time. This can cause waiters to never be woken, stalling IO. Use the batched count instead.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=215679 Signed-off-by: Keith Busch <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| #
c35227d4 |
| 08-Sep-2022 |
Uros Bizjak <[email protected]> |
sbitmap: Use atomic_long_try_cmpxchg in __sbitmap_queue_get_batch
Use atomic_long_try_cmpxchg instead of atomic_long_cmpxchg (*ptr, old, new) == old in __sbitmap_queue_get_batch. x86 CMPXCHG instruc
sbitmap: Use atomic_long_try_cmpxchg in __sbitmap_queue_get_batch
Use atomic_long_try_cmpxchg instead of atomic_long_cmpxchg (*ptr, old, new) == old in __sbitmap_queue_get_batch. x86 CMPXCHG instruction returns success in ZF flag, so this change saves a compare after cmpxchg (and related move instruction in front of cmpxchg).
Also, atomic_long_cmpxchg implicitly assigns old *ptr value to "old" when cmpxchg fails, enabling further code simplifications, e.g. an extra memory read can be avoided in the loop.
No functional change intended.
Cc: Jens Axboe <[email protected]> Signed-off-by: Uros Bizjak <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| #
48c03331 |
| 08-Sep-2022 |
Jan Kara <[email protected]> |
sbitmap: Avoid leaving waitqueue in invalid state in __sbq_wake_up()
When __sbq_wake_up() decrements wait_cnt to 0 but races with someone else waking the waiter on the waitqueue (so the waitqueue be
sbitmap: Avoid leaving waitqueue in invalid state in __sbq_wake_up()
When __sbq_wake_up() decrements wait_cnt to 0 but races with someone else waking the waiter on the waitqueue (so the waitqueue becomes empty), it exits without reseting wait_cnt to wake_batch number. Once wait_cnt is 0, nobody will ever reset the wait_cnt or wake the new waiters resulting in possible deadlocks or busyloops. Fix the problem by making sure we reset wait_cnt even if we didn't wake up anybody in the end.
Fixes: 040b83fcecfb ("sbitmap: fix possible io hung due to lost wakeup") Reported-by: Keith Busch <[email protected]> Signed-off-by: Jan Kara <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v6.0-rc4 |
|
| #
bce1b56c |
| 04-Sep-2022 |
Jens Axboe <[email protected]> |
Revert "sbitmap: fix batched wait_cnt accounting"
This reverts commit 16ede66973c84f890c03584f79158dd5b2d725f5.
This is causing issues with CPU stalls on my test box, revert it for now until we und
Revert "sbitmap: fix batched wait_cnt accounting"
This reverts commit 16ede66973c84f890c03584f79158dd5b2d725f5.
This is causing issues with CPU stalls on my test box, revert it for now until we understand what is going on. It looks like infinite looping off sbitmap_queue_wake_up(), but hard to tell with a lot of CPUs hitting this issue and the console scrolling infinitely.
Link: https://lore.kernel.org/linux-block/[email protected]/ Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v6.0-rc3 |
|
| #
16ede669 |
| 25-Aug-2022 |
Keith Busch <[email protected]> |
sbitmap: fix batched wait_cnt accounting
Batched completions can clear multiple bits, but we're only decrementing the wait_cnt by one each time. This can cause waiters to never be woken, stalling IO
sbitmap: fix batched wait_cnt accounting
Batched completions can clear multiple bits, but we're only decrementing the wait_cnt by one each time. This can cause waiters to never be woken, stalling IO. Use the batched count instead.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=215679 Signed-off-by: Keith Busch <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
| #
ddbfc34f |
| 26-Aug-2022 |
Liu Song <[email protected]> |
sbitmap: remove unnecessary code in __sbitmap_queue_get_batch
If "nr + nr_tags <= map_depth", then the value of nr_tags will not be greater than map_depth, so no additional comparison is required.
sbitmap: remove unnecessary code in __sbitmap_queue_get_batch
If "nr + nr_tags <= map_depth", then the value of nr_tags will not be greater than map_depth, so no additional comparison is required.
Signed-off-by: Liu Song <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v6.0-rc2, v6.0-rc1 |
|
| #
040b83fc |
| 03-Aug-2022 |
Yu Kuai <[email protected]> |
sbitmap: fix possible io hung due to lost wakeup
There are two problems can lead to lost wakeup:
1) invalid wakeup on the wrong waitqueue:
For example, 2 * wake_batch tags are put, while only wake
sbitmap: fix possible io hung due to lost wakeup
There are two problems can lead to lost wakeup:
1) invalid wakeup on the wrong waitqueue:
For example, 2 * wake_batch tags are put, while only wake_batch threads are woken:
__sbq_wake_up atomic_cmpxchg -> reset wait_cnt __sbq_wake_up -> decrease wait_cnt ... __sbq_wake_up -> wait_cnt is decreased to 0 again atomic_cmpxchg sbq_index_atomic_inc -> increase wake_index wake_up_nr -> wake up and waitqueue might be empty sbq_index_atomic_inc -> increase again, one waitqueue is skipped wake_up_nr -> invalid wake up because old wakequeue might be empty
To fix the problem, increasing 'wake_index' before resetting 'wait_cnt'.
2) 'wait_cnt' can be decreased while waitqueue is empty
As pointed out by Jan Kara, following race is possible:
CPU1 CPU2 __sbq_wake_up __sbq_wake_up sbq_wake_ptr() sbq_wake_ptr() -> the same wait_cnt = atomic_dec_return() /* decreased to 0 */ sbq_index_atomic_inc() /* move to next waitqueue */ atomic_set() /* reset wait_cnt */ wake_up_nr() /* wake up on the old waitqueue */ wait_cnt = atomic_dec_return() /* * decrease wait_cnt in the old * waitqueue, while it can be * empty. */
Fix the problem by waking up before updating 'wake_index' and 'wait_cnt'.
With this patch, noted that 'wait_cnt' is still decreased in the old empty waitqueue, however, the wakeup is redirected to a active waitqueue, and the extra decrement on the old empty waitqueue is not handled.
Fixes: 88459642cba4 ("blk-mq: abstract tag allocation out into sbitmap library") Signed-off-by: Yu Kuai <[email protected]> Reviewed-by: Jan Kara <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v5.19, v5.19-rc8, v5.19-rc7, v5.19-rc6, v5.19-rc5, v5.19-rc4, v5.19-rc3, v5.19-rc2, v5.19-rc1 |
|
| #
fbb564a5 |
| 05-Jun-2022 |
wuchi <[email protected]> |
lib/sbitmap: Fix invalid loop in __sbitmap_queue_get_batch()
1. Getting next index before continue branch. 2. Checking free bits when setting the target bits. Otherwise, it may reuse the busying bit
lib/sbitmap: Fix invalid loop in __sbitmap_queue_get_batch()
1. Getting next index before continue branch. 2. Checking free bits when setting the target bits. Otherwise, it may reuse the busying bits.
Signed-off-by: wuchi <[email protected]> Reviewed-by: Martin Wilck <[email protected]> Link: https://lore.kernel.org/r/[email protected] Fixes: 9672b0d43782 ("sbitmap: add __sbitmap_queue_get_batch()") Signed-off-by: Jens Axboe <[email protected]>
show more ...
|