17dbf15c5STejun Heo=========
27dbf15c5STejun HeoWorkqueue
37dbf15c5STejun Heo=========
4e7f08ffbSSilvio Fricke
5e7f08ffbSSilvio Fricke:Date: September, 2010
6e7f08ffbSSilvio Fricke:Author: Tejun Heo <[email protected]>
7e7f08ffbSSilvio Fricke:Author: Florian Mickler <[email protected]>
8e7f08ffbSSilvio Fricke
9e7f08ffbSSilvio Fricke
10e7f08ffbSSilvio FrickeIntroduction
11e7f08ffbSSilvio Fricke============
12e7f08ffbSSilvio Fricke
13e7f08ffbSSilvio FrickeThere are many cases where an asynchronous process execution context
14e7f08ffbSSilvio Frickeis needed and the workqueue (wq) API is the most commonly used
15e7f08ffbSSilvio Frickemechanism for such cases.
16e7f08ffbSSilvio Fricke
17e7f08ffbSSilvio FrickeWhen such an asynchronous execution context is needed, a work item
18e7f08ffbSSilvio Frickedescribing which function to execute is put on a queue.  An
19e7f08ffbSSilvio Frickeindependent thread serves as the asynchronous execution context.  The
20e7f08ffbSSilvio Frickequeue is called workqueue and the thread is called worker.
21e7f08ffbSSilvio Fricke
22e7f08ffbSSilvio FrickeWhile there are work items on the workqueue the worker executes the
23e7f08ffbSSilvio Frickefunctions associated with the work items one after the other.  When
24e7f08ffbSSilvio Frickethere is no work item left on the workqueue the worker becomes idle.
25e7f08ffbSSilvio FrickeWhen a new work item gets queued, the worker begins executing again.
26e7f08ffbSSilvio Fricke
27e7f08ffbSSilvio Fricke
287dbf15c5STejun HeoWhy Concurrency Managed Workqueue?
297dbf15c5STejun Heo==================================
30e7f08ffbSSilvio Fricke
31e7f08ffbSSilvio FrickeIn the original wq implementation, a multi threaded (MT) wq had one
32e7f08ffbSSilvio Frickeworker thread per CPU and a single threaded (ST) wq had one worker
33e7f08ffbSSilvio Frickethread system-wide.  A single MT wq needed to keep around the same
34e7f08ffbSSilvio Frickenumber of workers as the number of CPUs.  The kernel grew a lot of MT
35e7f08ffbSSilvio Frickewq users over the years and with the number of CPU cores continuously
36e7f08ffbSSilvio Frickerising, some systems saturated the default 32k PID space just booting
37e7f08ffbSSilvio Frickeup.
38e7f08ffbSSilvio Fricke
39e7f08ffbSSilvio FrickeAlthough MT wq wasted a lot of resource, the level of concurrency
40e7f08ffbSSilvio Frickeprovided was unsatisfactory.  The limitation was common to both ST and
41e7f08ffbSSilvio FrickeMT wq albeit less severe on MT.  Each wq maintained its own separate
4247684e11SRandy Dunlapworker pool.  An MT wq could provide only one execution context per CPU
4347684e11SRandy Dunlapwhile an ST wq one for the whole system.  Work items had to compete for
44e7f08ffbSSilvio Frickethose very limited execution contexts leading to various problems
45e7f08ffbSSilvio Frickeincluding proneness to deadlocks around the single execution context.
46e7f08ffbSSilvio Fricke
47e7f08ffbSSilvio FrickeThe tension between the provided level of concurrency and resource
48e7f08ffbSSilvio Frickeusage also forced its users to make unnecessary tradeoffs like libata
49e7f08ffbSSilvio Frickechoosing to use ST wq for polling PIOs and accepting an unnecessary
50e7f08ffbSSilvio Frickelimitation that no two polling PIOs can progress at the same time.  As
51e7f08ffbSSilvio FrickeMT wq don't provide much better concurrency, users which require
52e7f08ffbSSilvio Frickehigher level of concurrency, like async or fscache, had to implement
53e7f08ffbSSilvio Fricketheir own thread pool.
54e7f08ffbSSilvio Fricke
55e7f08ffbSSilvio FrickeConcurrency Managed Workqueue (cmwq) is a reimplementation of wq with
56e7f08ffbSSilvio Frickefocus on the following goals.
57e7f08ffbSSilvio Fricke
58e7f08ffbSSilvio Fricke* Maintain compatibility with the original workqueue API.
59e7f08ffbSSilvio Fricke
60e7f08ffbSSilvio Fricke* Use per-CPU unified worker pools shared by all wq to provide
61e7f08ffbSSilvio Fricke  flexible level of concurrency on demand without wasting a lot of
62e7f08ffbSSilvio Fricke  resource.
63e7f08ffbSSilvio Fricke
64e7f08ffbSSilvio Fricke* Automatically regulate worker pool and level of concurrency so that
65e7f08ffbSSilvio Fricke  the API users don't need to worry about such details.
66e7f08ffbSSilvio Fricke
67e7f08ffbSSilvio Fricke
68e7f08ffbSSilvio FrickeThe Design
69e7f08ffbSSilvio Fricke==========
70e7f08ffbSSilvio Fricke
71e7f08ffbSSilvio FrickeIn order to ease the asynchronous execution of functions a new
72e7f08ffbSSilvio Frickeabstraction, the work item, is introduced.
73e7f08ffbSSilvio Fricke
74e7f08ffbSSilvio FrickeA work item is a simple struct that holds a pointer to the function
75e7f08ffbSSilvio Frickethat is to be executed asynchronously.  Whenever a driver or subsystem
76e7f08ffbSSilvio Frickewants a function to be executed asynchronously it has to set up a work
77e7f08ffbSSilvio Frickeitem pointing to that function and queue that work item on a
78e7f08ffbSSilvio Frickeworkqueue.
79e7f08ffbSSilvio Fricke
804cb1ef64STejun HeoA work item can be executed in either a thread or the BH (softirq) context.
814cb1ef64STejun Heo
824cb1ef64STejun HeoFor threaded workqueues, special purpose threads, called [k]workers, execute
834cb1ef64STejun Heothe functions off of the queue, one after the other. If no work is queued,
844cb1ef64STejun Heothe worker threads become idle. These worker threads are managed in
854cb1ef64STejun Heoworker-pools.
86e7f08ffbSSilvio Fricke
87e7f08ffbSSilvio FrickeThe cmwq design differentiates between the user-facing workqueues that
88e7f08ffbSSilvio Frickesubsystems and drivers queue work items on and the backend mechanism
89e7f08ffbSSilvio Frickewhich manages worker-pools and processes the queued work items.
90e7f08ffbSSilvio Fricke
91e7f08ffbSSilvio FrickeThere are two worker-pools, one for normal work items and the other
92e7f08ffbSSilvio Frickefor high priority ones, for each possible CPU and some extra
93e7f08ffbSSilvio Frickeworker-pools to serve work items queued on unbound workqueues - the
94e7f08ffbSSilvio Frickenumber of these backing pools is dynamic.
95e7f08ffbSSilvio Fricke
964cb1ef64STejun HeoBH workqueues use the same framework. However, as there can only be one
974cb1ef64STejun Heoconcurrent execution context, there's no need to worry about concurrency.
984cb1ef64STejun HeoEach per-CPU BH worker pool contains only one pseudo worker which represents
994cb1ef64STejun Heothe BH execution context. A BH workqueue can be considered a convenience
1004cb1ef64STejun Heointerface to softirq.
1014cb1ef64STejun Heo
102e7f08ffbSSilvio FrickeSubsystems and drivers can create and queue work items through special
103e7f08ffbSSilvio Frickeworkqueue API functions as they see fit. They can influence some
104e7f08ffbSSilvio Frickeaspects of the way the work items are executed by setting flags on the
105e7f08ffbSSilvio Frickeworkqueue they are putting the work item on. These flags include
106e7f08ffbSSilvio Frickethings like CPU locality, concurrency limits, priority and more.  To
107e7f08ffbSSilvio Frickeget a detailed overview refer to the API description of
108e7f08ffbSSilvio Fricke``alloc_workqueue()`` below.
109e7f08ffbSSilvio Fricke
110e7f08ffbSSilvio FrickeWhen a work item is queued to a workqueue, the target worker-pool is
111e7f08ffbSSilvio Frickedetermined according to the queue parameters and workqueue attributes
112e7f08ffbSSilvio Frickeand appended on the shared worklist of the worker-pool.  For example,
113e7f08ffbSSilvio Frickeunless specifically overridden, a work item of a bound workqueue will
114e7f08ffbSSilvio Frickebe queued on the worklist of either normal or highpri worker-pool that
115e7f08ffbSSilvio Frickeis associated to the CPU the issuer is running on.
116e7f08ffbSSilvio Fricke
1174cb1ef64STejun HeoFor any thread pool implementation, managing the concurrency level
118e7f08ffbSSilvio Fricke(how many execution contexts are active) is an important issue.  cmwq
119e7f08ffbSSilvio Fricketries to keep the concurrency at a minimal but sufficient level.
120e7f08ffbSSilvio FrickeMinimal to save resources and sufficient in that the system is used at
121e7f08ffbSSilvio Frickeits full capacity.
122e7f08ffbSSilvio Fricke
123e7f08ffbSSilvio FrickeEach worker-pool bound to an actual CPU implements concurrency
124e7f08ffbSSilvio Frickemanagement by hooking into the scheduler.  The worker-pool is notified
125e7f08ffbSSilvio Frickewhenever an active worker wakes up or sleeps and keeps track of the
126e7f08ffbSSilvio Frickenumber of the currently runnable workers.  Generally, work items are
127e7f08ffbSSilvio Frickenot expected to hog a CPU and consume many cycles.  That means
128e7f08ffbSSilvio Frickemaintaining just enough concurrency to prevent work processing from
129e7f08ffbSSilvio Frickestalling should be optimal.  As long as there are one or more runnable
130e7f08ffbSSilvio Frickeworkers on the CPU, the worker-pool doesn't start execution of a new
131e7f08ffbSSilvio Frickework, but, when the last running worker goes to sleep, it immediately
132e7f08ffbSSilvio Frickeschedules a new worker so that the CPU doesn't sit idle while there
133e7f08ffbSSilvio Frickeare pending work items.  This allows using a minimal number of workers
134e7f08ffbSSilvio Frickewithout losing execution bandwidth.
135e7f08ffbSSilvio Fricke
136e7f08ffbSSilvio FrickeKeeping idle workers around doesn't cost other than the memory space
137e7f08ffbSSilvio Frickefor kthreads, so cmwq holds onto idle ones for a while before killing
138e7f08ffbSSilvio Frickethem.
139e7f08ffbSSilvio Fricke
140e7f08ffbSSilvio FrickeFor unbound workqueues, the number of backing pools is dynamic.
141e7f08ffbSSilvio FrickeUnbound workqueue can be assigned custom attributes using
142e7f08ffbSSilvio Fricke``apply_workqueue_attrs()`` and workqueue will automatically create
143e7f08ffbSSilvio Frickebacking worker pools matching the attributes.  The responsibility of
144e7f08ffbSSilvio Frickeregulating concurrency level is on the users.  There is also a flag to
145e7f08ffbSSilvio Frickemark a bound wq to ignore the concurrency management.  Please refer to
146e7f08ffbSSilvio Frickethe API section for details.
147e7f08ffbSSilvio Fricke
148e7f08ffbSSilvio FrickeForward progress guarantee relies on that workers can be created when
149e7f08ffbSSilvio Frickemore execution contexts are necessary, which in turn is guaranteed
150e7f08ffbSSilvio Frickethrough the use of rescue workers.  All work items which might be used
151e7f08ffbSSilvio Frickeon code paths that handle memory reclaim are required to be queued on
152e7f08ffbSSilvio Frickewq's that have a rescue-worker reserved for execution under memory
153e7f08ffbSSilvio Frickepressure.  Else it is possible that the worker-pool deadlocks waiting
154e7f08ffbSSilvio Frickefor execution contexts to free up.
155e7f08ffbSSilvio Fricke
156e7f08ffbSSilvio Fricke
157e7f08ffbSSilvio FrickeApplication Programming Interface (API)
158e7f08ffbSSilvio Fricke=======================================
159e7f08ffbSSilvio Fricke
160e7f08ffbSSilvio Fricke``alloc_workqueue()`` allocates a wq.  The original
161e7f08ffbSSilvio Fricke``create_*workqueue()`` functions are deprecated and scheduled for
16247684e11SRandy Dunlapremoval.  ``alloc_workqueue()`` takes three arguments - ``@name``,
163e7f08ffbSSilvio Fricke``@flags`` and ``@max_active``.  ``@name`` is the name of the wq and
164e7f08ffbSSilvio Frickealso used as the name of the rescuer thread if there is one.
165e7f08ffbSSilvio Fricke
166e7f08ffbSSilvio FrickeA wq no longer manages execution resources but serves as a domain for
167e7f08ffbSSilvio Frickeforward progress guarantee, flush and work item attributes. ``@flags``
168e7f08ffbSSilvio Frickeand ``@max_active`` control how work items are assigned execution
169e7f08ffbSSilvio Frickeresources, scheduled and executed.
170e7f08ffbSSilvio Fricke
171e7f08ffbSSilvio Fricke
172e7f08ffbSSilvio Fricke``flags``
173e7f08ffbSSilvio Fricke---------
174e7f08ffbSSilvio Fricke
1754cb1ef64STejun Heo``WQ_BH``
1764cb1ef64STejun Heo  BH workqueues can be considered a convenience interface to softirq. BH
1774cb1ef64STejun Heo  workqueues are always per-CPU and all BH work items are executed in the
1784cb1ef64STejun Heo  queueing CPU's softirq context in the queueing order.
1794cb1ef64STejun Heo
1804cb1ef64STejun Heo  All BH workqueues must have 0 ``max_active`` and ``WQ_HIGHPRI`` is the
1814cb1ef64STejun Heo  only allowed additional flag.
1824cb1ef64STejun Heo
1834cb1ef64STejun Heo  BH work items cannot sleep. All other features such as delayed queueing,
1844cb1ef64STejun Heo  flushing and canceling are supported.
1854cb1ef64STejun Heo
186e7f08ffbSSilvio Fricke``WQ_UNBOUND``
187e7f08ffbSSilvio Fricke  Work items queued to an unbound wq are served by the special
188e7f08ffbSSilvio Fricke  worker-pools which host workers which are not bound to any
189e7f08ffbSSilvio Fricke  specific CPU.  This makes the wq behave as a simple execution
190e7f08ffbSSilvio Fricke  context provider without concurrency management.  The unbound
191e7f08ffbSSilvio Fricke  worker-pools try to start execution of work items as soon as
192e7f08ffbSSilvio Fricke  possible.  Unbound wq sacrifices locality but is useful for
193e7f08ffbSSilvio Fricke  the following cases.
194e7f08ffbSSilvio Fricke
195e7f08ffbSSilvio Fricke  * Wide fluctuation in the concurrency level requirement is
196e7f08ffbSSilvio Fricke    expected and using bound wq may end up creating large number
197e7f08ffbSSilvio Fricke    of mostly unused workers across different CPUs as the issuer
198e7f08ffbSSilvio Fricke    hops through different CPUs.
199e7f08ffbSSilvio Fricke
200e7f08ffbSSilvio Fricke  * Long running CPU intensive workloads which can be better
201e7f08ffbSSilvio Fricke    managed by the system scheduler.
202e7f08ffbSSilvio Fricke
203e7f08ffbSSilvio Fricke``WQ_FREEZABLE``
204e7f08ffbSSilvio Fricke  A freezable wq participates in the freeze phase of the system
205e7f08ffbSSilvio Fricke  suspend operations.  Work items on the wq are drained and no
206e7f08ffbSSilvio Fricke  new work item starts execution until thawed.
207e7f08ffbSSilvio Fricke
208e7f08ffbSSilvio Fricke``WQ_MEM_RECLAIM``
209e7f08ffbSSilvio Fricke  All wq which might be used in the memory reclaim paths **MUST**
210e7f08ffbSSilvio Fricke  have this flag set.  The wq is guaranteed to have at least one
211e7f08ffbSSilvio Fricke  execution context regardless of memory pressure.
212e7f08ffbSSilvio Fricke
213e7f08ffbSSilvio Fricke``WQ_HIGHPRI``
214e7f08ffbSSilvio Fricke  Work items of a highpri wq are queued to the highpri
215e7f08ffbSSilvio Fricke  worker-pool of the target cpu.  Highpri worker-pools are
216e7f08ffbSSilvio Fricke  served by worker threads with elevated nice level.
217e7f08ffbSSilvio Fricke
218e7f08ffbSSilvio Fricke  Note that normal and highpri worker-pools don't interact with
21947684e11SRandy Dunlap  each other.  Each maintains its separate pool of workers and
220e7f08ffbSSilvio Fricke  implements concurrency management among its workers.
221e7f08ffbSSilvio Fricke
222e7f08ffbSSilvio Fricke``WQ_CPU_INTENSIVE``
223e7f08ffbSSilvio Fricke  Work items of a CPU intensive wq do not contribute to the
224e7f08ffbSSilvio Fricke  concurrency level.  In other words, runnable CPU intensive
225e7f08ffbSSilvio Fricke  work items will not prevent other work items in the same
226e7f08ffbSSilvio Fricke  worker-pool from starting execution.  This is useful for bound
227e7f08ffbSSilvio Fricke  work items which are expected to hog CPU cycles so that their
228e7f08ffbSSilvio Fricke  execution is regulated by the system scheduler.
229e7f08ffbSSilvio Fricke
230e7f08ffbSSilvio Fricke  Although CPU intensive work items don't contribute to the
231e7f08ffbSSilvio Fricke  concurrency level, start of their executions is still
232e7f08ffbSSilvio Fricke  regulated by the concurrency management and runnable
233e7f08ffbSSilvio Fricke  non-CPU-intensive work items can delay execution of CPU
234e7f08ffbSSilvio Fricke  intensive work items.
235e7f08ffbSSilvio Fricke
236e7f08ffbSSilvio Fricke  This flag is meaningless for unbound wq.
237e7f08ffbSSilvio Fricke
238e7f08ffbSSilvio Fricke
239e7f08ffbSSilvio Fricke``max_active``
240e7f08ffbSSilvio Fricke--------------
241e7f08ffbSSilvio Fricke
242636b927eSTejun Heo``@max_active`` determines the maximum number of execution contexts per
243636b927eSTejun HeoCPU which can be assigned to the work items of a wq. For example, with
244636b927eSTejun Heo``@max_active`` of 16, at most 16 work items of the wq can be executing
245636b927eSTejun Heoat the same time per CPU. This is always a per-CPU attribute, even for
246636b927eSTejun Heounbound workqueues.
247e7f08ffbSSilvio Fricke
248*58143465SChen RidongThe maximum limit for ``@max_active`` is 2048 and the default value used
249*58143465SChen Ridongwhen 0 is specified is 1024. These values are chosen sufficiently high
250636b927eSTejun Heosuch that they are not the limiting factor while providing protection in
251636b927eSTejun Heorunaway cases.
252e7f08ffbSSilvio Fricke
253e7f08ffbSSilvio FrickeThe number of active work items of a wq is usually regulated by the
254e7f08ffbSSilvio Frickeusers of the wq, more specifically, by how many work items the users
255e7f08ffbSSilvio Frickemay queue at the same time.  Unless there is a specific need for
256e7f08ffbSSilvio Frickethrottling the number of active work items, specifying '0' is
257e7f08ffbSSilvio Frickerecommended.
258e7f08ffbSSilvio Fricke
2593bc1e711STejun HeoSome users depend on strict execution ordering where only one work item
2603bc1e711STejun Heois in flight at any given time and the work items are processed in
2613bc1e711STejun Heoqueueing order. While the combination of ``@max_active`` of 1 and
2623bc1e711STejun Heo``WQ_UNBOUND`` used to achieve this behavior, this is no longer the
26344732f1dSNikita Shubincase. Use alloc_ordered_workqueue() instead.
2640e0cafcdSAlexei Potashnik
265e7f08ffbSSilvio Fricke
266e7f08ffbSSilvio FrickeExample Execution Scenarios
267e7f08ffbSSilvio Fricke===========================
268e7f08ffbSSilvio Fricke
269e7f08ffbSSilvio FrickeThe following example execution scenarios try to illustrate how cmwq
270e7f08ffbSSilvio Frickebehave under different configurations.
271e7f08ffbSSilvio Fricke
272e7f08ffbSSilvio Fricke Work items w0, w1, w2 are queued to a bound wq q0 on the same CPU.
273e7f08ffbSSilvio Fricke w0 burns CPU for 5ms then sleeps for 10ms then burns CPU for 5ms
274e7f08ffbSSilvio Fricke again before finishing.  w1 and w2 burn CPU for 5ms then sleep for
275e7f08ffbSSilvio Fricke 10ms.
276e7f08ffbSSilvio Fricke
277e7f08ffbSSilvio FrickeIgnoring all other tasks, works and processing overhead, and assuming
278e7f08ffbSSilvio Frickesimple FIFO scheduling, the following is one highly simplified version
279e7f08ffbSSilvio Frickeof possible sequences of events with the original wq. ::
280e7f08ffbSSilvio Fricke
281e7f08ffbSSilvio Fricke TIME IN MSECS	EVENT
282e7f08ffbSSilvio Fricke 0		w0 starts and burns CPU
283e7f08ffbSSilvio Fricke 5		w0 sleeps
284e7f08ffbSSilvio Fricke 15		w0 wakes up and burns CPU
285e7f08ffbSSilvio Fricke 20		w0 finishes
286e7f08ffbSSilvio Fricke 20		w1 starts and burns CPU
287e7f08ffbSSilvio Fricke 25		w1 sleeps
288e7f08ffbSSilvio Fricke 35		w1 wakes up and finishes
289e7f08ffbSSilvio Fricke 35		w2 starts and burns CPU
290e7f08ffbSSilvio Fricke 40		w2 sleeps
291e7f08ffbSSilvio Fricke 50		w2 wakes up and finishes
292e7f08ffbSSilvio Fricke
293e7f08ffbSSilvio FrickeAnd with cmwq with ``@max_active`` >= 3, ::
294e7f08ffbSSilvio Fricke
295e7f08ffbSSilvio Fricke TIME IN MSECS	EVENT
296e7f08ffbSSilvio Fricke 0		w0 starts and burns CPU
297e7f08ffbSSilvio Fricke 5		w0 sleeps
298e7f08ffbSSilvio Fricke 5		w1 starts and burns CPU
299e7f08ffbSSilvio Fricke 10		w1 sleeps
300e7f08ffbSSilvio Fricke 10		w2 starts and burns CPU
301e7f08ffbSSilvio Fricke 15		w2 sleeps
302e7f08ffbSSilvio Fricke 15		w0 wakes up and burns CPU
303e7f08ffbSSilvio Fricke 20		w0 finishes
304e7f08ffbSSilvio Fricke 20		w1 wakes up and finishes
305e7f08ffbSSilvio Fricke 25		w2 wakes up and finishes
306e7f08ffbSSilvio Fricke
307e7f08ffbSSilvio FrickeIf ``@max_active`` == 2, ::
308e7f08ffbSSilvio Fricke
309e7f08ffbSSilvio Fricke TIME IN MSECS	EVENT
310e7f08ffbSSilvio Fricke 0		w0 starts and burns CPU
311e7f08ffbSSilvio Fricke 5		w0 sleeps
312e7f08ffbSSilvio Fricke 5		w1 starts and burns CPU
313e7f08ffbSSilvio Fricke 10		w1 sleeps
314e7f08ffbSSilvio Fricke 15		w0 wakes up and burns CPU
315e7f08ffbSSilvio Fricke 20		w0 finishes
316e7f08ffbSSilvio Fricke 20		w1 wakes up and finishes
317e7f08ffbSSilvio Fricke 20		w2 starts and burns CPU
318e7f08ffbSSilvio Fricke 25		w2 sleeps
319e7f08ffbSSilvio Fricke 35		w2 wakes up and finishes
320e7f08ffbSSilvio Fricke
321e7f08ffbSSilvio FrickeNow, let's assume w1 and w2 are queued to a different wq q1 which has
322e7f08ffbSSilvio Fricke``WQ_CPU_INTENSIVE`` set, ::
323e7f08ffbSSilvio Fricke
324e7f08ffbSSilvio Fricke TIME IN MSECS	EVENT
325e7f08ffbSSilvio Fricke 0		w0 starts and burns CPU
326e7f08ffbSSilvio Fricke 5		w0 sleeps
327e7f08ffbSSilvio Fricke 5		w1 and w2 start and burn CPU
328e7f08ffbSSilvio Fricke 10		w1 sleeps
329e7f08ffbSSilvio Fricke 15		w2 sleeps
330e7f08ffbSSilvio Fricke 15		w0 wakes up and burns CPU
331e7f08ffbSSilvio Fricke 20		w0 finishes
332e7f08ffbSSilvio Fricke 20		w1 wakes up and finishes
333e7f08ffbSSilvio Fricke 25		w2 wakes up and finishes
334e7f08ffbSSilvio Fricke
335e7f08ffbSSilvio Fricke
336e7f08ffbSSilvio FrickeGuidelines
337e7f08ffbSSilvio Fricke==========
338e7f08ffbSSilvio Fricke
339e7f08ffbSSilvio Fricke* Do not forget to use ``WQ_MEM_RECLAIM`` if a wq may process work
340e7f08ffbSSilvio Fricke  items which are used during memory reclaim.  Each wq with
341e7f08ffbSSilvio Fricke  ``WQ_MEM_RECLAIM`` set has an execution context reserved for it.  If
342e7f08ffbSSilvio Fricke  there is dependency among multiple work items used during memory
343e7f08ffbSSilvio Fricke  reclaim, they should be queued to separate wq each with
344e7f08ffbSSilvio Fricke  ``WQ_MEM_RECLAIM``.
345e7f08ffbSSilvio Fricke
346e7f08ffbSSilvio Fricke* Unless strict ordering is required, there is no need to use ST wq.
347e7f08ffbSSilvio Fricke
348e7f08ffbSSilvio Fricke* Unless there is a specific need, using 0 for @max_active is
349e7f08ffbSSilvio Fricke  recommended.  In most use cases, concurrency level usually stays
350e7f08ffbSSilvio Fricke  well under the default limit.
351e7f08ffbSSilvio Fricke
352e7f08ffbSSilvio Fricke* A wq serves as a domain for forward progress guarantee
353e7f08ffbSSilvio Fricke  (``WQ_MEM_RECLAIM``, flush and work item attributes.  Work items
354e7f08ffbSSilvio Fricke  which are not involved in memory reclaim and don't need to be
355e7f08ffbSSilvio Fricke  flushed as a part of a group of work items, and don't require any
356e7f08ffbSSilvio Fricke  special attribute, can use one of the system wq.  There is no
357e7f08ffbSSilvio Fricke  difference in execution characteristics between using a dedicated wq
358e7f08ffbSSilvio Fricke  and a system wq.
359e7f08ffbSSilvio Fricke
360e3dddcfdSChen Ridong  Note: If something may generate more than @max_active outstanding
361e3dddcfdSChen Ridong  work items (do stress test your producers), it may saturate a system
362e3dddcfdSChen Ridong  wq and potentially lead to deadlock. It should utilize its own
363e3dddcfdSChen Ridong  dedicated workqueue rather than the system wq.
364e3dddcfdSChen Ridong
365e7f08ffbSSilvio Fricke* Unless work items are expected to consume a huge amount of CPU
366e7f08ffbSSilvio Fricke  cycles, using a bound wq is usually beneficial due to the increased
367e7f08ffbSSilvio Fricke  level of locality in wq operations and work item execution.
368e7f08ffbSSilvio Fricke
369e7f08ffbSSilvio Fricke
37063c5484eSTejun HeoAffinity Scopes
37163c5484eSTejun Heo===============
37263c5484eSTejun Heo
37363c5484eSTejun HeoAn unbound workqueue groups CPUs according to its affinity scope to improve
37463c5484eSTejun Heocache locality. For example, if a workqueue is using the default affinity
37563c5484eSTejun Heoscope of "cache", it will group CPUs according to last level cache
3768639ecebSTejun Heoboundaries. A work item queued on the workqueue will be assigned to a worker
3778639ecebSTejun Heoon one of the CPUs which share the last level cache with the issuing CPU.
3788639ecebSTejun HeoOnce started, the worker may or may not be allowed to move outside the scope
3798639ecebSTejun Heodepending on the ``affinity_strict`` setting of the scope.
38063c5484eSTejun Heo
381523a301eSTejun HeoWorkqueue currently supports the following affinity scopes.
382523a301eSTejun Heo
383523a301eSTejun Heo``default``
384523a301eSTejun Heo  Use the scope in module parameter ``workqueue.default_affinity_scope``
385523a301eSTejun Heo  which is always set to one of the scopes below.
38663c5484eSTejun Heo
38763c5484eSTejun Heo``cpu``
38863c5484eSTejun Heo  CPUs are not grouped. A work item issued on one CPU is processed by a
38963c5484eSTejun Heo  worker on the same CPU. This makes unbound workqueues behave as per-cpu
39063c5484eSTejun Heo  workqueues without concurrency management.
39163c5484eSTejun Heo
39263c5484eSTejun Heo``smt``
39363c5484eSTejun Heo  CPUs are grouped according to SMT boundaries. This usually means that the
39463c5484eSTejun Heo  logical threads of each physical CPU core are grouped together.
39563c5484eSTejun Heo
39663c5484eSTejun Heo``cache``
39763c5484eSTejun Heo  CPUs are grouped according to cache boundaries. Which specific cache
39863c5484eSTejun Heo  boundary is used is determined by the arch code. L3 is used in a lot of
39963c5484eSTejun Heo  cases. This is the default affinity scope.
40063c5484eSTejun Heo
40163c5484eSTejun Heo``numa``
40289405db5Sattreyee-muk  CPUs are grouped according to NUMA boundaries.
40363c5484eSTejun Heo
40463c5484eSTejun Heo``system``
40563c5484eSTejun Heo  All CPUs are put in the same group. Workqueue makes no effort to process a
40663c5484eSTejun Heo  work item on a CPU close to the issuing CPU.
40763c5484eSTejun Heo
40863c5484eSTejun HeoThe default affinity scope can be changed with the module parameter
40963c5484eSTejun Heo``workqueue.default_affinity_scope`` and a specific workqueue's affinity
41063c5484eSTejun Heoscope can be changed using ``apply_workqueue_attrs()``.
41163c5484eSTejun Heo
41263c5484eSTejun HeoIf ``WQ_SYSFS`` is set, the workqueue will have the following affinity scope
413bd9e7326SWangJinchaorelated interface files under its ``/sys/devices/virtual/workqueue/WQ_NAME/``
41463c5484eSTejun Heodirectory.
41563c5484eSTejun Heo
41663c5484eSTejun Heo``affinity_scope``
41763c5484eSTejun Heo  Read to see the current affinity scope. Write to change.
41863c5484eSTejun Heo
419523a301eSTejun Heo  When default is the current scope, reading this file will also show the
420523a301eSTejun Heo  current effective scope in parentheses, for example, ``default (cache)``.
421523a301eSTejun Heo
4228639ecebSTejun Heo``affinity_strict``
4238639ecebSTejun Heo  0 by default indicating that affinity scopes are not strict. When a work
4248639ecebSTejun Heo  item starts execution, workqueue makes a best-effort attempt to ensure
4258639ecebSTejun Heo  that the worker is inside its affinity scope, which is called
4268639ecebSTejun Heo  repatriation. Once started, the scheduler is free to move the worker
4278639ecebSTejun Heo  anywhere in the system as it sees fit. This enables benefiting from scope
4288639ecebSTejun Heo  locality while still being able to utilize other CPUs if necessary and
4298639ecebSTejun Heo  available.
4308639ecebSTejun Heo
4318639ecebSTejun Heo  If set to 1, all workers of the scope are guaranteed always to be in the
4328639ecebSTejun Heo  scope. This may be useful when crossing affinity scopes has other
4338639ecebSTejun Heo  implications, for example, in terms of power consumption or workload
4348639ecebSTejun Heo  isolation. Strict NUMA scope can also be used to match the workqueue
4358639ecebSTejun Heo  behavior of older kernels.
4368639ecebSTejun Heo
43763c5484eSTejun Heo
4387dbf15c5STejun HeoAffinity Scopes and Performance
4397dbf15c5STejun Heo===============================
4407dbf15c5STejun Heo
4417dbf15c5STejun HeoIt'd be ideal if an unbound workqueue's behavior is optimal for vast
4427dbf15c5STejun Heomajority of use cases without further tuning. Unfortunately, in the current
4437dbf15c5STejun Heokernel, there exists a pronounced trade-off between locality and utilization
4447dbf15c5STejun Heonecessitating explicit configurations when workqueues are heavily used.
4457dbf15c5STejun Heo
4467dbf15c5STejun HeoHigher locality leads to higher efficiency where more work is performed for
4477dbf15c5STejun Heothe same number of consumed CPU cycles. However, higher locality may also
4487dbf15c5STejun Heocause lower overall system utilization if the work items are not spread
4497dbf15c5STejun Heoenough across the affinity scopes by the issuers. The following performance
4507dbf15c5STejun Heotesting with dm-crypt clearly illustrates this trade-off.
4517dbf15c5STejun Heo
4527dbf15c5STejun HeoThe tests are run on a CPU with 12-cores/24-threads split across four L3
4537dbf15c5STejun Heocaches (AMD Ryzen 9 3900x). CPU clock boost is turned off for consistency.
4547dbf15c5STejun Heo``/dev/dm-0`` is a dm-crypt device created on NVME SSD (Samsung 990 PRO) and
4557dbf15c5STejun Heoopened with ``cryptsetup`` with default settings.
4567dbf15c5STejun Heo
4577dbf15c5STejun Heo
4587dbf15c5STejun HeoScenario 1: Enough issuers and work spread across the machine
4597dbf15c5STejun Heo-------------------------------------------------------------
4607dbf15c5STejun Heo
4617dbf15c5STejun HeoThe command used: ::
4627dbf15c5STejun Heo
4637dbf15c5STejun Heo  $ fio --filename=/dev/dm-0 --direct=1 --rw=randrw --bs=32k --ioengine=libaio \
4647dbf15c5STejun Heo    --iodepth=64 --runtime=60 --numjobs=24 --time_based --group_reporting \
4657dbf15c5STejun Heo    --name=iops-test-job --verify=sha512
4667dbf15c5STejun Heo
4677dbf15c5STejun HeoThere are 24 issuers, each issuing 64 IOs concurrently. ``--verify=sha512``
4687dbf15c5STejun Heomakes ``fio`` generate and read back the content each time which makes
46922160b08Sattreyee-mukexecution locality matter between the issuer and ``kcryptd``. The following
4707dbf15c5STejun Heoare the read bandwidths and CPU utilizations depending on different affinity
4717dbf15c5STejun Heoscope settings on ``kcryptd`` measured over five runs. Bandwidths are in
4727dbf15c5STejun HeoMiBps, and CPU util in percents.
4737dbf15c5STejun Heo
4747dbf15c5STejun Heo.. list-table::
4757dbf15c5STejun Heo   :widths: 16 20 20
4767dbf15c5STejun Heo   :header-rows: 1
4777dbf15c5STejun Heo
4787dbf15c5STejun Heo   * - Affinity
4797dbf15c5STejun Heo     - Bandwidth (MiBps)
4807dbf15c5STejun Heo     - CPU util (%)
4817dbf15c5STejun Heo
4827dbf15c5STejun Heo   * - system
4837dbf15c5STejun Heo     - 1159.40 ±1.34
4847dbf15c5STejun Heo     - 99.31 ±0.02
4857dbf15c5STejun Heo
4867dbf15c5STejun Heo   * - cache
4877dbf15c5STejun Heo     - 1166.40 ±0.89
4887dbf15c5STejun Heo     - 99.34 ±0.01
4897dbf15c5STejun Heo
4907dbf15c5STejun Heo   * - cache (strict)
4917dbf15c5STejun Heo     - 1166.00 ±0.71
4927dbf15c5STejun Heo     - 99.35 ±0.01
4937dbf15c5STejun Heo
4947dbf15c5STejun HeoWith enough issuers spread across the system, there is no downside to
4957dbf15c5STejun Heo"cache", strict or otherwise. All three configurations saturate the whole
4967dbf15c5STejun Heomachine but the cache-affine ones outperform by 0.6% thanks to improved
4977dbf15c5STejun Heolocality.
4987dbf15c5STejun Heo
4997dbf15c5STejun Heo
5007dbf15c5STejun HeoScenario 2: Fewer issuers, enough work for saturation
5017dbf15c5STejun Heo-----------------------------------------------------
5027dbf15c5STejun Heo
5037dbf15c5STejun HeoThe command used: ::
5047dbf15c5STejun Heo
5057dbf15c5STejun Heo  $ fio --filename=/dev/dm-0 --direct=1 --rw=randrw --bs=32k \
5067dbf15c5STejun Heo    --ioengine=libaio --iodepth=64 --runtime=60 --numjobs=8 \
5077dbf15c5STejun Heo    --time_based --group_reporting --name=iops-test-job --verify=sha512
5087dbf15c5STejun Heo
5097dbf15c5STejun HeoThe only difference from the previous scenario is ``--numjobs=8``. There are
5107dbf15c5STejun Heoa third of the issuers but is still enough total work to saturate the
5117dbf15c5STejun Heosystem.
5127dbf15c5STejun Heo
5137dbf15c5STejun Heo.. list-table::
5147dbf15c5STejun Heo   :widths: 16 20 20
5157dbf15c5STejun Heo   :header-rows: 1
5167dbf15c5STejun Heo
5177dbf15c5STejun Heo   * - Affinity
5187dbf15c5STejun Heo     - Bandwidth (MiBps)
5197dbf15c5STejun Heo     - CPU util (%)
5207dbf15c5STejun Heo
5217dbf15c5STejun Heo   * - system
5227dbf15c5STejun Heo     - 1155.40 ±0.89
5237dbf15c5STejun Heo     - 97.41 ±0.05
5247dbf15c5STejun Heo
5257dbf15c5STejun Heo   * - cache
5267dbf15c5STejun Heo     - 1154.40 ±1.14
5277dbf15c5STejun Heo     - 96.15 ±0.09
5287dbf15c5STejun Heo
5297dbf15c5STejun Heo   * - cache (strict)
5307dbf15c5STejun Heo     - 1112.00 ±4.64
5317dbf15c5STejun Heo     - 93.26 ±0.35
5327dbf15c5STejun Heo
5337dbf15c5STejun HeoThis is more than enough work to saturate the system. Both "system" and
5347dbf15c5STejun Heo"cache" are nearly saturating the machine but not fully. "cache" is using
5357dbf15c5STejun Heoless CPU but the better efficiency puts it at the same bandwidth as
5367dbf15c5STejun Heo"system".
5377dbf15c5STejun Heo
5387dbf15c5STejun HeoEight issuers moving around over four L3 cache scope still allow "cache
5397dbf15c5STejun Heo(strict)" to mostly saturate the machine but the loss of work conservation
5407dbf15c5STejun Heois now starting to hurt with 3.7% bandwidth loss.
5417dbf15c5STejun Heo
5427dbf15c5STejun Heo
5437dbf15c5STejun HeoScenario 3: Even fewer issuers, not enough work to saturate
5447dbf15c5STejun Heo-----------------------------------------------------------
5457dbf15c5STejun Heo
5467dbf15c5STejun HeoThe command used: ::
5477dbf15c5STejun Heo
5487dbf15c5STejun Heo  $ fio --filename=/dev/dm-0 --direct=1 --rw=randrw --bs=32k \
5497dbf15c5STejun Heo    --ioengine=libaio --iodepth=64 --runtime=60 --numjobs=4 \
5507dbf15c5STejun Heo    --time_based --group_reporting --name=iops-test-job --verify=sha512
5517dbf15c5STejun Heo
5527dbf15c5STejun HeoAgain, the only difference is ``--numjobs=4``. With the number of issuers
5537dbf15c5STejun Heoreduced to four, there now isn't enough work to saturate the whole system
5547dbf15c5STejun Heoand the bandwidth becomes dependent on completion latencies.
5557dbf15c5STejun Heo
5567dbf15c5STejun Heo.. list-table::
5577dbf15c5STejun Heo   :widths: 16 20 20
5587dbf15c5STejun Heo   :header-rows: 1
5597dbf15c5STejun Heo
5607dbf15c5STejun Heo   * - Affinity
5617dbf15c5STejun Heo     - Bandwidth (MiBps)
5627dbf15c5STejun Heo     - CPU util (%)
5637dbf15c5STejun Heo
5647dbf15c5STejun Heo   * - system
5657dbf15c5STejun Heo     - 993.60 ±1.82
5667dbf15c5STejun Heo     - 75.49 ±0.06
5677dbf15c5STejun Heo
5687dbf15c5STejun Heo   * - cache
5697dbf15c5STejun Heo     - 973.40 ±1.52
5707dbf15c5STejun Heo     - 74.90 ±0.07
5717dbf15c5STejun Heo
5727dbf15c5STejun Heo   * - cache (strict)
5737dbf15c5STejun Heo     - 828.20 ±4.49
5747dbf15c5STejun Heo     - 66.84 ±0.29
5757dbf15c5STejun Heo
5767dbf15c5STejun HeoNow, the tradeoff between locality and utilization is clearer. "cache" shows
5777dbf15c5STejun Heo2% bandwidth loss compared to "system" and "cache (struct)" whopping 20%.
5787dbf15c5STejun Heo
5797dbf15c5STejun Heo
5807dbf15c5STejun HeoConclusion and Recommendations
5817dbf15c5STejun Heo------------------------------
5827dbf15c5STejun Heo
5837dbf15c5STejun HeoIn the above experiments, the efficiency advantage of the "cache" affinity
5847dbf15c5STejun Heoscope over "system" is, while consistent and noticeable, small. However, the
5857dbf15c5STejun Heoimpact is dependent on the distances between the scopes and may be more
5867dbf15c5STejun Heopronounced in processors with more complex topologies.
5877dbf15c5STejun Heo
5887dbf15c5STejun HeoWhile the loss of work-conservation in certain scenarios hurts, it is a lot
5897dbf15c5STejun Heobetter than "cache (strict)" and maximizing workqueue utilization is
5907dbf15c5STejun Heounlikely to be the common case anyway. As such, "cache" is the default
5917dbf15c5STejun Heoaffinity scope for unbound pools.
5927dbf15c5STejun Heo
5937dbf15c5STejun Heo* As there is no one option which is great for most cases, workqueue usages
5947dbf15c5STejun Heo  that may consume a significant amount of CPU are recommended to configure
5957dbf15c5STejun Heo  the workqueues using ``apply_workqueue_attrs()`` and/or enable
5967dbf15c5STejun Heo  ``WQ_SYSFS``.
5977dbf15c5STejun Heo
5987dbf15c5STejun Heo* An unbound workqueue with strict "cpu" affinity scope behaves the same as
5997dbf15c5STejun Heo  ``WQ_CPU_INTENSIVE`` per-cpu workqueue. There is no real advanage to the
6007dbf15c5STejun Heo  latter and an unbound workqueue provides a lot more flexibility.
6017dbf15c5STejun Heo
6027dbf15c5STejun Heo* Affinity scopes are introduced in Linux v6.5. To emulate the previous
6037dbf15c5STejun Heo  behavior, use strict "numa" affinity scope.
6047dbf15c5STejun Heo
6057dbf15c5STejun Heo* The loss of work-conservation in non-strict affinity scopes is likely
6067dbf15c5STejun Heo  originating from the scheduler. There is no theoretical reason why the
6077dbf15c5STejun Heo  kernel wouldn't be able to do the right thing and maintain
6087dbf15c5STejun Heo  work-conservation in most cases. As such, it is possible that future
6097dbf15c5STejun Heo  scheduler improvements may make most of these tunables unnecessary.
6107dbf15c5STejun Heo
6117dbf15c5STejun Heo
6127f7dc377STejun HeoExamining Configuration
6137f7dc377STejun Heo=======================
6147f7dc377STejun Heo
6157f7dc377STejun HeoUse tools/workqueue/wq_dump.py to examine unbound CPU affinity
6167f7dc377STejun Heoconfiguration, worker pools and how workqueues map to the pools: ::
6177f7dc377STejun Heo
6187f7dc377STejun Heo  $ tools/workqueue/wq_dump.py
6197f7dc377STejun Heo  Affinity Scopes
6207f7dc377STejun Heo  ===============
6217f7dc377STejun Heo  wq_unbound_cpumask=0000000f
6227f7dc377STejun Heo
62363c5484eSTejun Heo  CPU
62463c5484eSTejun Heo    nr_pods  4
62563c5484eSTejun Heo    pod_cpus [0]=00000001 [1]=00000002 [2]=00000004 [3]=00000008
62663c5484eSTejun Heo    pod_node [0]=0 [1]=0 [2]=1 [3]=1
62763c5484eSTejun Heo    cpu_pod  [0]=0 [1]=1 [2]=2 [3]=3
62863c5484eSTejun Heo
62963c5484eSTejun Heo  SMT
63063c5484eSTejun Heo    nr_pods  4
63163c5484eSTejun Heo    pod_cpus [0]=00000001 [1]=00000002 [2]=00000004 [3]=00000008
63263c5484eSTejun Heo    pod_node [0]=0 [1]=0 [2]=1 [3]=1
63363c5484eSTejun Heo    cpu_pod  [0]=0 [1]=1 [2]=2 [3]=3
63463c5484eSTejun Heo
63563c5484eSTejun Heo  CACHE (default)
63663c5484eSTejun Heo    nr_pods  2
63763c5484eSTejun Heo    pod_cpus [0]=00000003 [1]=0000000c
63863c5484eSTejun Heo    pod_node [0]=0 [1]=1
63963c5484eSTejun Heo    cpu_pod  [0]=0 [1]=0 [2]=1 [3]=1
64063c5484eSTejun Heo
6417f7dc377STejun Heo  NUMA
6427f7dc377STejun Heo    nr_pods  2
6437f7dc377STejun Heo    pod_cpus [0]=00000003 [1]=0000000c
6447f7dc377STejun Heo    pod_node [0]=0 [1]=1
6457f7dc377STejun Heo    cpu_pod  [0]=0 [1]=0 [2]=1 [3]=1
6467f7dc377STejun Heo
6477f7dc377STejun Heo  SYSTEM
6487f7dc377STejun Heo    nr_pods  1
6497f7dc377STejun Heo    pod_cpus [0]=0000000f
6507f7dc377STejun Heo    pod_node [0]=-1
6517f7dc377STejun Heo    cpu_pod  [0]=0 [1]=0 [2]=0 [3]=0
6527f7dc377STejun Heo
6537f7dc377STejun Heo  Worker Pools
6547f7dc377STejun Heo  ============
6557f7dc377STejun Heo  pool[00] ref= 1 nice=  0 idle/workers=  4/  4 cpu=  0
6567f7dc377STejun Heo  pool[01] ref= 1 nice=-20 idle/workers=  2/  2 cpu=  0
6577f7dc377STejun Heo  pool[02] ref= 1 nice=  0 idle/workers=  4/  4 cpu=  1
6587f7dc377STejun Heo  pool[03] ref= 1 nice=-20 idle/workers=  2/  2 cpu=  1
6597f7dc377STejun Heo  pool[04] ref= 1 nice=  0 idle/workers=  4/  4 cpu=  2
6607f7dc377STejun Heo  pool[05] ref= 1 nice=-20 idle/workers=  2/  2 cpu=  2
6617f7dc377STejun Heo  pool[06] ref= 1 nice=  0 idle/workers=  3/  3 cpu=  3
6627f7dc377STejun Heo  pool[07] ref= 1 nice=-20 idle/workers=  2/  2 cpu=  3
6637f7dc377STejun Heo  pool[08] ref=42 nice=  0 idle/workers=  6/  6 cpus=0000000f
6647f7dc377STejun Heo  pool[09] ref=28 nice=  0 idle/workers=  3/  3 cpus=00000003
6657f7dc377STejun Heo  pool[10] ref=28 nice=  0 idle/workers= 17/ 17 cpus=0000000c
6667f7dc377STejun Heo  pool[11] ref= 1 nice=-20 idle/workers=  1/  1 cpus=0000000f
6677f7dc377STejun Heo  pool[12] ref= 2 nice=-20 idle/workers=  1/  1 cpus=00000003
6687f7dc377STejun Heo  pool[13] ref= 2 nice=-20 idle/workers=  1/  1 cpus=0000000c
6697f7dc377STejun Heo
6707f7dc377STejun Heo  Workqueue CPU -> pool
6717f7dc377STejun Heo  =====================
6727f7dc377STejun Heo  [    workqueue \ CPU              0  1  2  3 dfl]
6737f7dc377STejun Heo  events                   percpu   0  2  4  6
6747f7dc377STejun Heo  events_highpri           percpu   1  3  5  7
6757f7dc377STejun Heo  events_long              percpu   0  2  4  6
6767f7dc377STejun Heo  events_unbound           unbound  9  9 10 10  8
6777f7dc377STejun Heo  events_freezable         percpu   0  2  4  6
6787f7dc377STejun Heo  events_power_efficient   percpu   0  2  4  6
6792c534f2fSAudra Mitchell  events_freezable_pwr_ef  percpu   0  2  4  6
6807f7dc377STejun Heo  rcu_gp                   percpu   0  2  4  6
6817f7dc377STejun Heo  rcu_par_gp               percpu   0  2  4  6
6827f7dc377STejun Heo  slub_flushwq             percpu   0  2  4  6
6837f7dc377STejun Heo  netns                    ordered  8  8  8  8  8
6847f7dc377STejun Heo  ...
6857f7dc377STejun Heo
6867f7dc377STejun HeoSee the command's help message for more info.
6877f7dc377STejun Heo
6887f7dc377STejun Heo
689725e8ec5STejun HeoMonitoring
690725e8ec5STejun Heo==========
691725e8ec5STejun Heo
692725e8ec5STejun HeoUse tools/workqueue/wq_monitor.py to monitor workqueue operations: ::
693725e8ec5STejun Heo
694725e8ec5STejun Heo  $ tools/workqueue/wq_monitor.py events
6958639ecebSTejun Heo                              total  infl  CPUtime  CPUhog CMW/RPR  mayday rescued
6968a1dd1e5STejun Heo  events                      18545     0      6.1       0       5       -       -
6978a1dd1e5STejun Heo  events_highpri                  8     0      0.0       0       0       -       -
6988a1dd1e5STejun Heo  events_long                     3     0      0.0       0       0       -       -
6998639ecebSTejun Heo  events_unbound              38306     0      0.1       -       7       -       -
7008a1dd1e5STejun Heo  events_freezable                0     0      0.0       0       0       -       -
7018a1dd1e5STejun Heo  events_power_efficient      29598     0      0.2       0       0       -       -
7022c534f2fSAudra Mitchell  events_freezable_pwr_ef        10     0      0.0       0       0       -       -
7038a1dd1e5STejun Heo  sock_diag_events                0     0      0.0       0       0       -       -
704725e8ec5STejun Heo
7058639ecebSTejun Heo                              total  infl  CPUtime  CPUhog CMW/RPR  mayday rescued
7068a1dd1e5STejun Heo  events                      18548     0      6.1       0       5       -       -
7078a1dd1e5STejun Heo  events_highpri                  8     0      0.0       0       0       -       -
7088a1dd1e5STejun Heo  events_long                     3     0      0.0       0       0       -       -
7098639ecebSTejun Heo  events_unbound              38322     0      0.1       -       7       -       -
7108a1dd1e5STejun Heo  events_freezable                0     0      0.0       0       0       -       -
7118a1dd1e5STejun Heo  events_power_efficient      29603     0      0.2       0       0       -       -
7122c534f2fSAudra Mitchell  events_freezable_pwr_ef        10     0      0.0       0       0       -       -
7138a1dd1e5STejun Heo  sock_diag_events                0     0      0.0       0       0       -       -
714725e8ec5STejun Heo
715725e8ec5STejun Heo  ...
716725e8ec5STejun Heo
717725e8ec5STejun HeoSee the command's help message for more info.
718725e8ec5STejun Heo
719725e8ec5STejun Heo
720e7f08ffbSSilvio FrickeDebugging
721e7f08ffbSSilvio Fricke=========
722e7f08ffbSSilvio Fricke
723e7f08ffbSSilvio FrickeBecause the work functions are executed by generic worker threads
724e7f08ffbSSilvio Frickethere are a few tricks needed to shed some light on misbehaving
725e7f08ffbSSilvio Frickeworkqueue users.
726e7f08ffbSSilvio Fricke
727e7f08ffbSSilvio FrickeWorker threads show up in the process list as: ::
728e7f08ffbSSilvio Fricke
729e7f08ffbSSilvio Fricke  root      5671  0.0  0.0      0     0 ?        S    12:07   0:00 [kworker/0:1]
730e7f08ffbSSilvio Fricke  root      5672  0.0  0.0      0     0 ?        S    12:07   0:00 [kworker/1:2]
731e7f08ffbSSilvio Fricke  root      5673  0.0  0.0      0     0 ?        S    12:12   0:00 [kworker/0:0]
732e7f08ffbSSilvio Fricke  root      5674  0.0  0.0      0     0 ?        S    12:13   0:00 [kworker/1:0]
733e7f08ffbSSilvio Fricke
734e7f08ffbSSilvio FrickeIf kworkers are going crazy (using too much cpu), there are two types
735e7f08ffbSSilvio Frickeof possible problems:
736e7f08ffbSSilvio Fricke
737e7f08ffbSSilvio Fricke	1. Something being scheduled in rapid succession
738e7f08ffbSSilvio Fricke	2. A single work item that consumes lots of cpu cycles
739e7f08ffbSSilvio Fricke
740e7f08ffbSSilvio FrickeThe first one can be tracked using tracing: ::
741e7f08ffbSSilvio Fricke
7422abfcd29SRoss Zwisler	$ echo workqueue:workqueue_queue_work > /sys/kernel/tracing/set_event
7432abfcd29SRoss Zwisler	$ cat /sys/kernel/tracing/trace_pipe > out.txt
744e7f08ffbSSilvio Fricke	(wait a few secs)
745e7f08ffbSSilvio Fricke	^C
746e7f08ffbSSilvio Fricke
747e7f08ffbSSilvio FrickeIf something is busy looping on work queueing, it would be dominating
748e7f08ffbSSilvio Frickethe output and the offender can be determined with the work item
749e7f08ffbSSilvio Frickefunction.
750e7f08ffbSSilvio Fricke
751e7f08ffbSSilvio FrickeFor the second type of problems it should be possible to just check
752e7f08ffbSSilvio Frickethe stack trace of the offending worker thread. ::
753e7f08ffbSSilvio Fricke
754e7f08ffbSSilvio Fricke	$ cat /proc/THE_OFFENDING_KWORKER/stack
755e7f08ffbSSilvio Fricke
756e7f08ffbSSilvio FrickeThe work item's function should be trivially visible in the stack
757e7f08ffbSSilvio Fricketrace.
758e7f08ffbSSilvio Fricke
759725e8ec5STejun Heo
760f9eaaa82SBoqun FengNon-reentrance Conditions
761f9eaaa82SBoqun Feng=========================
762f9eaaa82SBoqun Feng
763f9eaaa82SBoqun FengWorkqueue guarantees that a work item cannot be re-entrant if the following
764f9eaaa82SBoqun Fengconditions hold after a work item gets queued:
765f9eaaa82SBoqun Feng
766f9eaaa82SBoqun Feng        1. The work function hasn't been changed.
767f9eaaa82SBoqun Feng        2. No one queues the work item to another workqueue.
768f9eaaa82SBoqun Feng        3. The work item hasn't been reinitiated.
769f9eaaa82SBoqun Feng
770f9eaaa82SBoqun FengIn other words, if the above conditions hold, the work item is guaranteed to be
771f9eaaa82SBoqun Fengexecuted by at most one worker system-wide at any given time.
772f9eaaa82SBoqun Feng
773f9eaaa82SBoqun FengNote that requeuing the work item (to the same queue) in the self function
774f9eaaa82SBoqun Fengdoesn't break these conditions, so it's safe to do. Otherwise, caution is
775f9eaaa82SBoqun Fengrequired when breaking the conditions inside a work function.
776f9eaaa82SBoqun Feng
777e7f08ffbSSilvio Fricke
778e7f08ffbSSilvio FrickeKernel Inline Documentations Reference
779e7f08ffbSSilvio Fricke======================================
780e7f08ffbSSilvio Fricke
781e7f08ffbSSilvio Fricke.. kernel-doc:: include/linux/workqueue.h
782c9e3d519SMauro Carvalho Chehab
783c9e3d519SMauro Carvalho Chehab.. kernel-doc:: kernel/workqueue.c
784