|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14, v6.14-rc7, v6.14-rc6, v6.14-rc5, v6.14-rc4, v6.14-rc3, v6.14-rc2, v6.14-rc1, v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2, v6.13-rc1, v6.12, v6.12-rc7, v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3 |
|
| #
58143465 |
| 08-Oct-2024 |
Chen Ridong <[email protected]> |
workqueue: Adjust WQ_MAX_ACTIVE from 512 to 2048
WQ_MAX_ACTIVE is currently set to 512, which was established approximately 15 yeas ago. However, with the significant increase in machine sizes and c
workqueue: Adjust WQ_MAX_ACTIVE from 512 to 2048
WQ_MAX_ACTIVE is currently set to 512, which was established approximately 15 yeas ago. However, with the significant increase in machine sizes and capabilities, the previous limit of 256 concurrent tasks is no longer sufficient. Therefore, we propose to increase WQ_MAX_ACTIVE to 2048. and WQ_DFL_ACTIVE is 1024 now.
Signed-off-by: Chen Ridong <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
|
Revision tags: v6.12-rc2, v6.12-rc1, v6.11, v6.11-rc7, v6.11-rc6, v6.11-rc5 |
|
| #
d156263e |
| 21-Aug-2024 |
Tejun Heo <[email protected]> |
workqueue: Fix another htmldocs build warning
Fix htmldocs build warning introduced by 9b59a85a84dc ("workqueue: Don't call va_start / va_end twice").
Signed-off-by: Tejun Heo <[email protected]> Repor
workqueue: Fix another htmldocs build warning
Fix htmldocs build warning introduced by 9b59a85a84dc ("workqueue: Don't call va_start / va_end twice").
Signed-off-by: Tejun Heo <[email protected]> Reported-by: Stephen Rothwell <[email protected]> Cc: Matthew Brost <[email protected]>
show more ...
|
| #
9b59a85a |
| 20-Aug-2024 |
Matthew Brost <[email protected]> |
workqueue: Don't call va_start / va_end twice
Calling va_start / va_end multiple times is undefined and causes problems with certain compiler / platforms.
Change alloc_ordered_workqueue_lockdep_map
workqueue: Don't call va_start / va_end twice
Calling va_start / va_end multiple times is undefined and causes problems with certain compiler / platforms.
Change alloc_ordered_workqueue_lockdep_map to a macro and updated __alloc_workqueue to take a va_list argument.
Cc: Sergey Senozhatsky <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Lai Jiangshan <[email protected]> Signed-off-by: Matthew Brost <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
| #
8dffaec3 |
| 19-Aug-2024 |
Tejun Heo <[email protected]> |
workqueue: Fix htmldocs build warning
Fix htmldocs build warning introduced by ec0a7d44b358 ("workqueue: Add interface for user-defined workqueue lockdep map").
Signed-off-by: Tejun Heo <tj@kernel.
workqueue: Fix htmldocs build warning
Fix htmldocs build warning introduced by ec0a7d44b358 ("workqueue: Add interface for user-defined workqueue lockdep map").
Signed-off-by: Tejun Heo <[email protected]> Reported-by: Stephen Rothwell <[email protected]> Cc: Matthew Brost <[email protected]>
show more ...
|
|
Revision tags: v6.11-rc4, v6.11-rc3 |
|
| #
ec0a7d44 |
| 09-Aug-2024 |
Matthew Brost <[email protected]> |
workqueue: Add interface for user-defined workqueue lockdep map
Add an interface for a user-defined workqueue lockdep map, which is helpful when multiple workqueues are created for the same purpose.
workqueue: Add interface for user-defined workqueue lockdep map
Add an interface for a user-defined workqueue lockdep map, which is helpful when multiple workqueues are created for the same purpose. This also helps avoid leaking lockdep maps on each workqueue creation.
v2: - Add alloc_workqueue_lockdep_map (Tejun) v3: - Drop __WQ_USER_OWNED_LOCKDEP (Tejun) - static inline alloc_ordered_workqueue_lockdep_map (Tejun)
Cc: Tejun Heo <[email protected]> Cc: Lai Jiangshan <[email protected]> Signed-off-by: Matthew Brost <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
|
Revision tags: v6.11-rc2, v6.11-rc1, v6.10, v6.10-rc7, v6.10-rc6, v6.10-rc5, v6.10-rc4, v6.10-rc3, v6.10-rc2 |
|
| #
e1b6705b |
| 28-May-2024 |
Yury Norov <[email protected]> |
cpumask: make core headers including cpumask_types.h where possible
Now that cpumask types are split out to a separate smaller header, many frequently included core headers may switch to using it.
cpumask: make core headers including cpumask_types.h where possible
Now that cpumask types are split out to a separate smaller header, many frequently included core headers may switch to using it.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yury Norov <[email protected]> Cc: Amit Daniel Kachhap <[email protected]> Cc: Anna-Maria Behnsen <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Daniel Lezcano <[email protected]> Cc: Dennis Zhou <[email protected]> Cc: Frederic Weisbecker <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Juri Lelli <[email protected]> Cc: Kees Cook <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Cc: Rasmus Villemoes <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ulf Hansson <[email protected]> Cc: Vincent Guittot <[email protected]> Cc: Viresh Kumar <[email protected]> Cc: Yury Norov <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
231035f1 |
| 06-Jun-2024 |
Wenchao Hao <[email protected]> |
workqueue: Increase worker desc's length to 32
Commit 31c89007285d ("workqueue.c: Increase workqueue name length") increased WQ_NAME_LEN from 24 to 32, but forget to increase WORKER_DESC_LEN, which
workqueue: Increase worker desc's length to 32
Commit 31c89007285d ("workqueue.c: Increase workqueue name length") increased WQ_NAME_LEN from 24 to 32, but forget to increase WORKER_DESC_LEN, which would cause truncation when setting kworker's desc from workqueue_struct's name, process_one_work() for example.
Fixes: 31c89007285d ("workqueue.c: Increase workqueue name length")
Signed-off-by: Wenchao Hao <[email protected]> CC: Audra Mitchell <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
|
Revision tags: v6.10-rc1, v6.9, v6.9-rc7, v6.9-rc6 |
|
| #
51da7f68 |
| 22-Apr-2024 |
Tejun Heo <[email protected]> |
workqueue: Use "@..." in function comment to describe variable length argument
Previously, it was using "remaining args" without leading "@" which isn't valid. Let's follow snprintf()'s example and
workqueue: Use "@..." in function comment to describe variable length argument
Previously, it was using "remaining args" without leading "@" which isn't valid. Let's follow snprintf()'s example and use "@...".
Signed-off-by: Tejun Heo <[email protected]> Reported-by: Stephen Rothwell <[email protected]>
show more ...
|
|
Revision tags: v6.9-rc5, v6.9-rc4, v6.9-rc3, v6.9-rc2 |
|
| #
474a549f |
| 25-Mar-2024 |
Allen Pais <[email protected]> |
workqueue: Introduce enable_and_queue_work() convenience function
The enable_and_queue_work() function is introduced to streamline the process of enabling and queuing a work item on a specific workq
workqueue: Introduce enable_and_queue_work() convenience function
The enable_and_queue_work() function is introduced to streamline the process of enabling and queuing a work item on a specific workqueue. This function combines the functionalities of enable_work() and queue_work() in a single call, providing a concise and convenient API for enabling and queuing work items.
The function accepts a target workqueue and a work item as parameters. It first attempts to enable the work item using enable_work(). A successful enable operation means that the work item was previously disabled and is now marked as eligible for execution. If the enable operation is successful, the work item is then queued on the specified workqueue using queue_work(). The function returns true if the work item was successfully enabled and queued, and false otherwise.
Note: This function may lead to unnecessary spurious wake-ups in cases where the work item is expected to be dormant but enable/disable are called frequently. Spurious wake-ups refer to the condition where worker threads are woken up without actual work to be done. Callers should be aware of this behavior and may need to employ additional synchronization mechanisms to avoid these overheads if such wake-ups are not desired.
This addition aims to enhance code readability and maintainability by providing a unified interface for the common use case of enabling and queuing work items on a workqueue.
tj: Made the function comment more compact.
Signed-off-by: Allen Pais <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
|
Revision tags: v6.9-rc1, v6.8 |
|
| #
ae1296a7 |
| 08-Mar-2024 |
Lai Jiangshan <[email protected]> |
workqueue: Move attrs->cpumask out of worker_pool's properties when attrs->affn_strict
Allow more pools can be shared when attrs->affn_strict.
Signed-off-by: Lai Jiangshan <[email protected]
workqueue: Move attrs->cpumask out of worker_pool's properties when attrs->affn_strict
Allow more pools can be shared when attrs->affn_strict.
Signed-off-by: Lai Jiangshan <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
| #
456a78ee |
| 25-Mar-2024 |
Tejun Heo <[email protected]> |
workqueue: Remember whether a work item was on a BH workqueue
Add an off-queue flag, WORK_OFFQ_BH, that indicates whether the last workqueue the work item was on was a BH one. This will be used to t
workqueue: Remember whether a work item was on a BH workqueue
Add an off-queue flag, WORK_OFFQ_BH, that indicates whether the last workqueue the work item was on was a BH one. This will be used to test whether a work item is BH in cancel_sync path to implement atomic cancel_sync'ing for BH work items.
Signed-off-by: Tejun Heo <[email protected]> Reviewed-by: Lai Jiangshan <[email protected]>
show more ...
|
| #
f09b10b6 |
| 25-Mar-2024 |
Tejun Heo <[email protected]> |
workqueue: Remove WORK_OFFQ_CANCELING
cancel[_delayed]_work_sync() guarantees that it can shut down self-requeueing work items. To achieve that, it grabs and then holds WORK_STRUCT_PENDING bit set w
workqueue: Remove WORK_OFFQ_CANCELING
cancel[_delayed]_work_sync() guarantees that it can shut down self-requeueing work items. To achieve that, it grabs and then holds WORK_STRUCT_PENDING bit set while flushing the currently executing instance. As the PENDING bit is set, all queueing attempts including the self-requeueing ones fail and once the currently executing instance is flushed, the work item should be idle as long as someone else isn't actively queueing it.
This means that the cancel_work_sync path may hold the PENDING bit set while flushing the target work item. This isn't a problem for the queueing path - it can just fail which is the desired effect. It doesn't affect flush. It doesn't matter to cancel_work either as it can just report that the work item has successfully canceled. However, if there's another cancel_work_sync attempt on the work item, it can't simply fail or report success and that would breach the guarantee that it should provide. cancel_work_sync has to wait for and grab that PENDING bit and go through the motions.
WORK_OFFQ_CANCELING and wq_cancel_waitq are what implement this cancel_work_sync to cancel_work_sync wait mechanism. When a work item is being canceled, WORK_OFFQ_CANCELING is also set on it and other cancel_work_sync attempts wait on the bit to be cleared using the wait queue.
While this works, it's an isolated wart which doesn't jive with the rest of flush and cancel mechanisms and forces enable_work() and disable_work() to require a sleepable context, which hampers their usability.
Now that a work item can be disabled, we can use that to block queueing while cancel_work_sync is in progress. Instead of holding PENDING the bit, it can temporarily disable the work item, flush and then re-enable it as that'd achieve the same end result of blocking queueings while canceling and thus enable canceling of self-requeueing work items.
- WORK_OFFQ_CANCELING and the surrounding mechanims are removed.
- work_grab_pending() is now simpler, no longer has to wait for a blocking operation and thus can be called from any context.
- With work_grab_pending() simplified, no need to use try_to_grab_pending() directly. All users are converted to use work_grab_pending().
- __cancel_work_sync() is updated to __cancel_work() with WORK_CANCEL_DISABLE to cancel and plug racing queueing attempts. It then flushes and re-enables the work item if necessary.
- These changes allow disable_work() and enable_work() to be called from any context.
v2: Lai pointed out that mod_delayed_work_on() needs to check the disable count before queueing the delayed work item. Added clear_pending_if_disabled() call.
Signed-off-by: Tejun Heo <[email protected]> Reviewed-by: Lai Jiangshan <[email protected]>
show more ...
|
| #
86898fa6 |
| 25-Mar-2024 |
Tejun Heo <[email protected]> |
workqueue: Implement disable/enable for (delayed) work items
While (delayed) work items could be flushed and canceled, there was no way to prevent them from being queued in the future. While this di
workqueue: Implement disable/enable for (delayed) work items
While (delayed) work items could be flushed and canceled, there was no way to prevent them from being queued in the future. While this didn't lead to functional deficiencies, it sometimes required a bit more effort from the workqueue users to e.g. sequence shutdown steps with more care.
Workqueue is currently in the process of replacing tasklet which does support disabling and enabling. The feature is used relatively widely to, for example, temporarily suppress main path while a control plane operation (reset or config change) is in progress.
To enable easy conversion of tasklet users and as it seems like an inherent useful feature, this patch implements disabling and enabling of work items.
- A work item carries 16bit disable count in work->data while not queued. The access to the count is synchronized by the PENDING bit like all other parts of work->data.
- If the count is non-zero, the work item cannot be queued. Any attempt to queue the work item fails and returns %false.
- disable_work[_sync](), enable_work(), disable_delayed_work[_sync]() and enable_delayed_work() are added.
v3: enable_work() was using local_irq_enable() instead of local_irq_restore() to undo IRQ-disable by work_grab_pending(). This is awkward now and will become incorrect as enable_work() will later be used from IRQ context too. (Lai)
v2: Lai noticed that queue_work_node() wasn't checking the disable count. Fixed. queue_rcu_work() is updated to trigger warning if the inner work item is disabled.
Signed-off-by: Tejun Heo <[email protected]> Reviewed-by: Lai Jiangshan <[email protected]>
show more ...
|
| #
1211f3b2 |
| 25-Mar-2024 |
Tejun Heo <[email protected]> |
workqueue: Preserve OFFQ bits in cancel[_sync] paths
The cancel[_sync] paths acquire and release WORK_STRUCT_PENDING, and manipulate WORK_OFFQ_CANCELING. However, they assume that all the OFFQ bit v
workqueue: Preserve OFFQ bits in cancel[_sync] paths
The cancel[_sync] paths acquire and release WORK_STRUCT_PENDING, and manipulate WORK_OFFQ_CANCELING. However, they assume that all the OFFQ bit values except for the pool ID are statically known and don't preserve them, which is not wrong in the current code as the pool ID and CANCELING are the only information carried. However, the planned disable/enable support will add more fields and need them to be preserved.
This patch updates work data handling so that only the bits which need updating are updated.
- struct work_offq_data is added along with work_offqd_unpack() and work_offqd_pack_flags() to help manipulating multiple fields contained in work->data. Note that the helpers look a bit silly right now as there isn't that much to pack. The next patch will add more.
- mark_work_canceling() which is used only by __cancel_work_sync() is replaced by open-coded usage of work_offq_data and set_work_pool_and_keep_pending() in __cancel_work_sync().
- __cancel_work[_sync]() uses offq_data helpers to preserve other OFFQ bits when clearing WORK_STRUCT_PENDING and WORK_OFFQ_CANCELING at the end.
- This removes all users of get_work_pool_id() which is dropped. Note that get_work_pool_id() could handle both WORK_STRUCT_PWQ and !WORK_STRUCT_PWQ cases; however, it was only being called after try_to_grab_pending() succeeded, in which case WORK_STRUCT_PWQ is never set and thus it's safe to use work_offqd_unpack() instead.
No behavior changes intended.
Signed-off-by: Tejun Heo <[email protected]> Reviewed-by: Lai Jiangshan <[email protected]>
show more ...
|
|
Revision tags: v6.8-rc7 |
|
| #
1acd92d9 |
| 27-Feb-2024 |
Tejun Heo <[email protected]> |
workqueue: Drain BH work items on hot-unplugged CPUs
Boqun pointed out that workqueues aren't handling BH work items on offlined CPUs. Unlike tasklet which transfers out the pending tasks from CPUHP
workqueue: Drain BH work items on hot-unplugged CPUs
Boqun pointed out that workqueues aren't handling BH work items on offlined CPUs. Unlike tasklet which transfers out the pending tasks from CPUHP_SOFTIRQ_DEAD, BH workqueue would just leave them pending which is problematic. Note that this behavior is specific to BH workqueues as the non-BH per-CPU workers just become unbound when the CPU goes offline.
This patch fixes the issue by draining the pending BH work items from an offlined CPU from CPUHP_SOFTIRQ_DEAD. Because work items carry more context, it's not as easy to transfer the pending work items from one pool to another. Instead, run BH work items which execute the offlined pools on an online CPU.
Note that this assumes that no further BH work items will be queued on the offlined CPUs. This assumption is shared with tasklet and should be fine for conversions. However, this issue also exists for per-CPU workqueues which will just keep executing work items queued after CPU offline on unbound workers and workqueue should reject per-CPU and BH work items queued on offline CPUs. This will be addressed separately later.
Signed-off-by: Tejun Heo <[email protected]> Reported-and-reviewed-by: Boqun Feng <[email protected]> Link: http://lkml.kernel.org/r/Zdvw0HdSXcU3JZ4g@boqun-archlinux
show more ...
|
| #
60b2ebf4 |
| 27-Feb-2024 |
Allen Pais <[email protected]> |
workqueue: Introduce from_work() helper for cleaner callback declarations
To streamline the transition from tasklets to worqueues, a new helper function, from_work(), is introduced. This helper, ins
workqueue: Introduce from_work() helper for cleaner callback declarations
To streamline the transition from tasklets to worqueues, a new helper function, from_work(), is introduced. This helper, inspired by existing from_() patterns, utilizes container_of() and eliminates the redundancy of declaring variable types, leading to more concise and readable code.
The modified code snippet demonstrates the enhanced clarity achieved with from_wq():
void callback(struct work_struct *w) { - struct some_data_structure *local = container_of(w, struct some_data_structure, work); + struct some_data_structure *local = from_work(local, w, work);
This change aims to facilitate a smoother transition and uphold code quality standards.
Based on: git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git disable_work-v3
Signed-off-by: Allen Pais <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
|
Revision tags: v6.8-rc6 |
|
| #
e9a8e01f |
| 21-Feb-2024 |
Tejun Heo <[email protected]> |
workqueue: Clean up enum work_bits and related constants
The bits of work->data are used for a few different purposes. How the bits are used is determined by enum work_bits. The planned disable/enab
workqueue: Clean up enum work_bits and related constants
The bits of work->data are used for a few different purposes. How the bits are used is determined by enum work_bits. The planned disable/enable support will add another use, so let's clean it up a bit in preparation.
- Let WORK_STRUCT_*_BIT's values be determined by enum definition order.
- Deliminate different bit sections the same way using SHIFT and BITS values.
- Rename __WORK_OFFQ_CANCELING to WORK_OFFQ_CANCELING_BIT for consistency.
- Introduce WORK_STRUCT_PWQ_SHIFT and replace WORK_STRUCT_FLAG_MASK and WORK_STRUCT_WQ_DATA_MASK with WQ_STRUCT_PWQ_MASK for clarity.
- Improve documentation.
No functional changes.
Signed-off-by: Tejun Heo <[email protected]> Reviewed-by: Lai Jiangshan <[email protected]>
show more ...
|
|
Revision tags: v6.8-rc5, v6.8-rc4 |
|
| #
8f172181 |
| 09-Feb-2024 |
Tejun Heo <[email protected]> |
workqueue: Implement workqueue_set_min_active()
Since 5797b1c18919 ("workqueue: Implement system-wide nr_active enforcement for unbound workqueues"), unbound workqueues have separate min_active whic
workqueue: Implement workqueue_set_min_active()
Since 5797b1c18919 ("workqueue: Implement system-wide nr_active enforcement for unbound workqueues"), unbound workqueues have separate min_active which sets the number of interdependent work items that can be handled. This value is currently initialized to WQ_DFL_MIN_ACTIVE which is 8. This isn't high enough for some users, let's add an interface to adjust the setting.
Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
| #
3bc1e711 |
| 06-Feb-2024 |
Tejun Heo <[email protected]> |
workqueue: Don't implicitly make UNBOUND workqueues w/ @max_active==1 ordered
5c0338c68706 ("workqueue: restore WQ_UNBOUND/max_active==1 to be ordered") automoatically promoted UNBOUND workqueues w/
workqueue: Don't implicitly make UNBOUND workqueues w/ @max_active==1 ordered
5c0338c68706 ("workqueue: restore WQ_UNBOUND/max_active==1 to be ordered") automoatically promoted UNBOUND workqueues w/ @max_active==1 to ordered workqueues because UNBOUND workqueues w/ @max_active==1 used to be the way to create ordered workqueues and the new NUMA support broke it. These problems can be subtle and the fact that they can only trigger on NUMA machines made them even more difficult to debug.
However, overloading the UNBOUND allocation interface this way creates other issues. It's difficult to tell whether a given workqueue actually needs to be ordered and users that legitimately want a min concurrency level wq unexpectedly gets an ordered one instead. With planned UNBOUND workqueue udpates to improve execution locality and more prevalence of chiplet designs which can benefit from such improvements, this isn't a state we wanna be in forever.
There aren't that many UNBOUND w/ @max_active==1 users in the tree and the preceding patches audited all and converted them to alloc_ordered_workqueue() as appropriate. This patch removes the implicit promotion of UNBOUND w/ @max_active==1 workqueues to ordered ones.
v2: v1 patch incorrectly dropped !list_empty(&wq->pwqs) condition in apply_workqueue_attrs_locked() which spuriously triggers WARNING and fails workqueue creation. Fix it.
Signed-off-by: Tejun Heo <[email protected]> Reported-by: kernel test robot <[email protected]> Link: https://lore.kernel.org/oe-lkp/[email protected]
show more ...
|
| #
4cb1ef64 |
| 04-Feb-2024 |
Tejun Heo <[email protected]> |
workqueue: Implement BH workqueues to eventually replace tasklets
The only generic interface to execute asynchronously in the BH context is tasklet; however, it's marked deprecated and has some desi
workqueue: Implement BH workqueues to eventually replace tasklets
The only generic interface to execute asynchronously in the BH context is tasklet; however, it's marked deprecated and has some design flaws such as the execution code accessing the tasklet item after the execution is complete which can lead to subtle use-after-free in certain usage scenarios and less-developed flush and cancel mechanisms.
This patch implements BH workqueues which share the same semantics and features of regular workqueues but execute their work items in the softirq context. As there is always only one BH execution context per CPU, none of the concurrency management mechanisms applies and a BH workqueue can be thought of as a convenience wrapper around softirq.
Except for the inability to sleep while executing and lack of max_active adjustments, BH workqueues and work items should behave the same as regular workqueues and work items.
Currently, the execution is hooked to tasklet[_hi]. However, the goal is to convert all tasklet users over to BH workqueues. Once the conversion is complete, tasklet can be removed and BH workqueues can directly take over the tasklet softirqs.
system_bh[_highpri]_wq are added. As queue-wide flushing doesn't exist in tasklet, all existing tasklet users should be able to use the system BH workqueues without creating their own workqueues.
v3: - Add missing interrupt.h include.
v2: - Instead of using tasklets, hook directly into its softirq action functions - tasklet[_hi]_action(). This is slightly cheaper and closer to the eventual code structure we want to arrive at. Suggested by Lai.
- Lai also pointed out several places which need NULL worker->task handling or can use clarification. Updated.
Signed-off-by: Tejun Heo <[email protected]> Suggested-by: Linus Torvalds <[email protected]> Link: http://lkml.kernel.org/r/CAHk-=wjDW53w4-YcSmgKC5RruiRLHmJ1sXeYdp_ZgVoBw=5byA@mail.gmail.com Tested-by: Allen Pais <[email protected]> Reviewed-by: Lai Jiangshan <[email protected]>
show more ...
|
|
Revision tags: v6.8-rc3 |
|
| #
5797b1c1 |
| 29-Jan-2024 |
Tejun Heo <[email protected]> |
workqueue: Implement system-wide nr_active enforcement for unbound workqueues
A pool_workqueue (pwq) represents the connection between a workqueue and a worker_pool. One of the roles that a pwq play
workqueue: Implement system-wide nr_active enforcement for unbound workqueues
A pool_workqueue (pwq) represents the connection between a workqueue and a worker_pool. One of the roles that a pwq plays is enforcement of the max_active concurrency limit. Before 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues"), there was one pwq per each CPU for per-cpu workqueues and per each NUMA node for unbound workqueues, which was a natural result of per-cpu workqueues being served by per-cpu pools and unbound by per-NUMA pools.
In terms of max_active enforcement, this was, while not perfect, workable. For per-cpu workqueues, it was fine. For unbound, it wasn't great in that NUMA machines would get max_active that's multiplied by the number of nodes but didn't cause huge problems because NUMA machines are relatively rare and the node count is usually pretty low.
However, cache layouts are more complex now and sharing a worker pool across a whole node didn't really work well for unbound workqueues. Thus, a series of commits culminating on 8639ecebc9b1 ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues") implemented more flexible affinity mechanism for unbound workqueues which enables using e.g. last-level-cache aligned pools. In the process, 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues") made unbound workqueues use per-cpu pwqs like per-cpu workqueues.
While the change was necessary to enable more flexible affinity scopes, this came with the side effect of blowing up the effective max_active for unbound workqueues. Before, the effective max_active for unbound workqueues was multiplied by the number of nodes. After, by the number of CPUs.
636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues") claims that this should generally be okay. It is okay for users which self-regulates concurrency level which are the vast majority; however, there are enough use cases which actually depend on max_active to prevent the level of concurrency from going bonkers including several IO handling workqueues that can issue a work item for each in-flight IO. With targeted benchmarks, the misbehavior can easily be exposed as reported in http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3.
Unfortunately, there is no way to express what these use cases need using per-cpu max_active. A CPU may issue most of in-flight IOs, so we don't want to set max_active too low but as soon as we increase max_active a bit, we can end up with unreasonable number of in-flight work items when many CPUs issue IOs at the same time. ie. The acceptable lowest max_active is higher than the acceptable highest max_active.
Ideally, max_active for an unbound workqueue should be system-wide so that the users can regulate the total level of concurrency regardless of node and cache layout. The reasons workqueue hasn't implemented that yet are:
- One max_active enforcement decouples from pool boundaires, chaining execution after a work item finishes requires inter-pool operations which would require lock dancing, which is nasty.
- Sharing a single nr_active count across the whole system can be pretty expensive on NUMA machines.
- Per-pwq enforcement had been more or less okay while we were using per-node pools.
It looks like we no longer can avoid decoupling max_active enforcement from pool boundaries. This patch implements system-wide nr_active mechanism with the following design characteristics:
- To avoid sharing a single counter across multiple nodes, the configured max_active is split across nodes according to the proportion of each workqueue's online effective CPUs per node. e.g. A node with twice more online effective CPUs will get twice higher portion of max_active.
- Workqueue used to be able to process a chain of interdependent work items which is as long as max_active. We can't do this anymore as max_active is distributed across the nodes. Instead, a new parameter min_active is introduced which determines the minimum level of concurrency within a node regardless of how max_active distribution comes out to be.
It is set to the smaller of max_active and WQ_DFL_MIN_ACTIVE which is 8. This can lead to higher effective max_weight than configured and also deadlocks if a workqueue was depending on being able to handle chains of interdependent work items that are longer than 8.
I believe these should be fine given that the number of CPUs in each NUMA node is usually higher than 8 and work item chain longer than 8 is pretty unlikely. However, if these assumptions turn out to be wrong, we'll need to add an interface to adjust min_active.
- Each unbound wq has an array of struct wq_node_nr_active which tracks per-node nr_active. When its pwq wants to run a work item, it has to obtain the matching node's nr_active. If over the node's max_active, the pwq is queued on wq_node_nr_active->pending_pwqs. As work items finish, the completion path round-robins the pending pwqs activating the first inactive work item of each, which involves some pool lock dancing and kicking other pools. It's not the simplest code but doesn't look too bad.
v4: - wq_adjust_max_active() updated to invoke wq_update_node_max_active().
- wq_adjust_max_active() is now protected by wq->mutex instead of wq_pool_mutex.
v3: - wq_node_max_active() used to calculate per-node max_active on the fly based on system-wide CPU online states. Lai pointed out that this can lead to skewed distributions for workqueues with restricted cpumasks. Update the max_active distribution to use per-workqueue effective online CPU counts instead of system-wide and cache the calculation results in node_nr_active->max.
v2: - wq->min/max_active now uses WRITE/READ_ONCE() as suggested by Lai.
Signed-off-by: Tejun Heo <[email protected]> Reported-by: Naohiro Aota <[email protected]> Link: http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3 Fixes: 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues") Reviewed-by: Lai Jiangshan <[email protected]>
show more ...
|
|
Revision tags: v6.8-rc2 |
|
| #
e563d0a7 |
| 26-Jan-2024 |
Tejun Heo <[email protected]> |
workqueue: Break up enum definitions and give names to the types
workqueue is collecting different sorts of enums into a single unnamed enum type which can increase confusion around enum width. Also
workqueue: Break up enum definitions and give names to the types
workqueue is collecting different sorts of enums into a single unnamed enum type which can increase confusion around enum width. Also, unnamed enums can't be accessed from BPF. Let's break up enum definitions according to their purposes and give them type names.
Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
|
Revision tags: v6.8-rc1, v6.7, v6.7-rc8, v6.7-rc7, v6.7-rc6 |
|
| #
b2fa8443 |
| 11-Dec-2023 |
Kent Overstreet <[email protected]> |
workqueue: Split out workqueue_types.h
More sched.h dependency culling - this lets us kill a rhashtable-types.h dependency on workqueue.h.
Signed-off-by: Kent Overstreet <[email protected]>
|
|
Revision tags: v6.7-rc5, v6.7-rc4, v6.7-rc3, v6.7-rc2, v6.7-rc1, v6.6 |
|
| #
fe28f631 |
| 25-Oct-2023 |
Waiman Long <[email protected]> |
workqueue: Add workqueue_unbound_exclude_cpumask() to exclude CPUs from wq_unbound_cpumask
When the "isolcpus" boot command line option is used to add a set of isolated CPUs, those CPUs will be excl
workqueue: Add workqueue_unbound_exclude_cpumask() to exclude CPUs from wq_unbound_cpumask
When the "isolcpus" boot command line option is used to add a set of isolated CPUs, those CPUs will be excluded automatically from wq_unbound_cpumask to avoid running work functions from unbound workqueues.
Recently cpuset has been extended to allow the creation of partitions of isolated CPUs dynamically. To make it closer to the "isolcpus" in functionality, the CPUs in those isolated cpuset partitions should be excluded from wq_unbound_cpumask as well. This can be done currently by explicitly writing to the workqueue's cpumask sysfs file after creating the isolated partitions. However, this process can be error prone.
Ideally, the cpuset code should be allowed to request the workqueue code to exclude those isolated CPUs from wq_unbound_cpumask so that this operation can be done automatically and the isolated CPUs will be returned back to wq_unbound_cpumask after the destructions of the isolated cpuset partitions.
This patch adds a new workqueue_unbound_exclude_cpumask() function to enable that. This new function will exclude the specified isolated CPUs from wq_unbound_cpumask. To be able to restore those isolated CPUs back after the destruction of isolated cpuset partitions, a new wq_requested_unbound_cpumask is added to store the user provided unbound cpumask either from the boot command line options or from writing to the cpumask sysfs file. This new cpumask provides the basis for CPU exclusion.
To enable users to understand how the wq_unbound_cpumask is being modified internally, this patch also exposes the newly introduced wq_requested_unbound_cpumask as well as a wq_isolated_cpumask to store the cpumask to be excluded from wq_unbound_cpumask as read-only sysfs files.
Signed-off-by: Waiman Long <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|
|
Revision tags: v6.6-rc7, v6.6-rc6, v6.6-rc5, v6.6-rc4, v6.6-rc3 |
|
| #
265f3ed0 |
| 24-Sep-2023 |
Frederic Weisbecker <[email protected]> |
workqueue: Provide one lock class key per work_on_cpu() callsite
All callers of work_on_cpu() share the same lock class key for all the functions queued. As a result the workqueue related locking sc
workqueue: Provide one lock class key per work_on_cpu() callsite
All callers of work_on_cpu() share the same lock class key for all the functions queued. As a result the workqueue related locking scenario for a function A may be spuriously accounted as an inversion against the locking scenario of function B such as in the following model:
long A(void *arg) { mutex_lock(&mutex); mutex_unlock(&mutex); }
long B(void *arg) { }
void launchA(void) { work_on_cpu(0, A, NULL); }
void launchB(void) { mutex_lock(&mutex); work_on_cpu(1, B, NULL); mutex_unlock(&mutex); }
launchA and launchB running concurrently have no chance to deadlock. However the above can be reported by lockdep as a possible locking inversion because the works containing A() and B() are treated as belonging to the same locking class.
The following shows an existing example of such a spurious lockdep splat:
====================================================== WARNING: possible circular locking dependency detected 6.6.0-rc1-00065-g934ebd6e5359 #35409 Not tainted ------------------------------------------------------ kworker/0:1/9 is trying to acquire lock: ffffffff9bc72f30 (cpu_hotplug_lock){++++}-{0:0}, at: _cpu_down+0x57/0x2b0
but task is already holding lock: ffff9e3bc0057e60 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: process_scheduled_works+0x216/0x500
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #2 ((work_completion)(&wfc.work)){+.+.}-{0:0}: __flush_work+0x83/0x4e0 work_on_cpu+0x97/0xc0 rcu_nocb_cpu_offload+0x62/0xb0 rcu_nocb_toggle+0xd0/0x1d0 kthread+0xe6/0x120 ret_from_fork+0x2f/0x40 ret_from_fork_asm+0x1b/0x30
-> #1 (rcu_state.barrier_mutex){+.+.}-{3:3}: __mutex_lock+0x81/0xc80 rcu_nocb_cpu_deoffload+0x38/0xb0 rcu_nocb_toggle+0x144/0x1d0 kthread+0xe6/0x120 ret_from_fork+0x2f/0x40 ret_from_fork_asm+0x1b/0x30
-> #0 (cpu_hotplug_lock){++++}-{0:0}: __lock_acquire+0x1538/0x2500 lock_acquire+0xbf/0x2a0 percpu_down_write+0x31/0x200 _cpu_down+0x57/0x2b0 __cpu_down_maps_locked+0x10/0x20 work_for_cpu_fn+0x15/0x20 process_scheduled_works+0x2a7/0x500 worker_thread+0x173/0x330 kthread+0xe6/0x120 ret_from_fork+0x2f/0x40 ret_from_fork_asm+0x1b/0x30
other info that might help us debug this:
Chain exists of: cpu_hotplug_lock --> rcu_state.barrier_mutex --> (work_completion)(&wfc.work)
Possible unsafe locking scenario:
CPU0 CPU1 ---- ---- lock((work_completion)(&wfc.work)); lock(rcu_state.barrier_mutex); lock((work_completion)(&wfc.work)); lock(cpu_hotplug_lock);
*** DEADLOCK ***
2 locks held by kworker/0:1/9: #0: ffff900481068b38 ((wq_completion)events){+.+.}-{0:0}, at: process_scheduled_works+0x212/0x500 #1: ffff9e3bc0057e60 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: process_scheduled_works+0x216/0x500
stack backtrace: CPU: 0 PID: 9 Comm: kworker/0:1 Not tainted 6.6.0-rc1-00065-g934ebd6e5359 #35409 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 Workqueue: events work_for_cpu_fn Call Trace: rcu-torture: rcu_torture_read_exit: Start of episode <TASK> dump_stack_lvl+0x4a/0x80 check_noncircular+0x132/0x150 __lock_acquire+0x1538/0x2500 lock_acquire+0xbf/0x2a0 ? _cpu_down+0x57/0x2b0 percpu_down_write+0x31/0x200 ? _cpu_down+0x57/0x2b0 _cpu_down+0x57/0x2b0 __cpu_down_maps_locked+0x10/0x20 work_for_cpu_fn+0x15/0x20 process_scheduled_works+0x2a7/0x500 worker_thread+0x173/0x330 ? __pfx_worker_thread+0x10/0x10 kthread+0xe6/0x120 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x2f/0x40 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1b/0x30 </TASK
Fix this with providing one lock class key per work_on_cpu() caller.
Reported-and-tested-by: Paul E. McKenney <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
show more ...
|