17dbf15c5STejun Heo========= 27dbf15c5STejun HeoWorkqueue 37dbf15c5STejun Heo========= 4e7f08ffbSSilvio Fricke 5e7f08ffbSSilvio Fricke:Date: September, 2010 6e7f08ffbSSilvio Fricke:Author: Tejun Heo <[email protected]> 7e7f08ffbSSilvio Fricke:Author: Florian Mickler <[email protected]> 8e7f08ffbSSilvio Fricke 9e7f08ffbSSilvio Fricke 10e7f08ffbSSilvio FrickeIntroduction 11e7f08ffbSSilvio Fricke============ 12e7f08ffbSSilvio Fricke 13e7f08ffbSSilvio FrickeThere are many cases where an asynchronous process execution context 14e7f08ffbSSilvio Frickeis needed and the workqueue (wq) API is the most commonly used 15e7f08ffbSSilvio Frickemechanism for such cases. 16e7f08ffbSSilvio Fricke 17e7f08ffbSSilvio FrickeWhen such an asynchronous execution context is needed, a work item 18e7f08ffbSSilvio Frickedescribing which function to execute is put on a queue. An 19e7f08ffbSSilvio Frickeindependent thread serves as the asynchronous execution context. The 20e7f08ffbSSilvio Frickequeue is called workqueue and the thread is called worker. 21e7f08ffbSSilvio Fricke 22e7f08ffbSSilvio FrickeWhile there are work items on the workqueue the worker executes the 23e7f08ffbSSilvio Frickefunctions associated with the work items one after the other. When 24e7f08ffbSSilvio Frickethere is no work item left on the workqueue the worker becomes idle. 25e7f08ffbSSilvio FrickeWhen a new work item gets queued, the worker begins executing again. 26e7f08ffbSSilvio Fricke 27e7f08ffbSSilvio Fricke 287dbf15c5STejun HeoWhy Concurrency Managed Workqueue? 297dbf15c5STejun Heo================================== 30e7f08ffbSSilvio Fricke 31e7f08ffbSSilvio FrickeIn the original wq implementation, a multi threaded (MT) wq had one 32e7f08ffbSSilvio Frickeworker thread per CPU and a single threaded (ST) wq had one worker 33e7f08ffbSSilvio Frickethread system-wide. A single MT wq needed to keep around the same 34e7f08ffbSSilvio Frickenumber of workers as the number of CPUs. The kernel grew a lot of MT 35e7f08ffbSSilvio Frickewq users over the years and with the number of CPU cores continuously 36e7f08ffbSSilvio Frickerising, some systems saturated the default 32k PID space just booting 37e7f08ffbSSilvio Frickeup. 38e7f08ffbSSilvio Fricke 39e7f08ffbSSilvio FrickeAlthough MT wq wasted a lot of resource, the level of concurrency 40e7f08ffbSSilvio Frickeprovided was unsatisfactory. The limitation was common to both ST and 41e7f08ffbSSilvio FrickeMT wq albeit less severe on MT. Each wq maintained its own separate 4247684e11SRandy Dunlapworker pool. An MT wq could provide only one execution context per CPU 4347684e11SRandy Dunlapwhile an ST wq one for the whole system. Work items had to compete for 44e7f08ffbSSilvio Frickethose very limited execution contexts leading to various problems 45e7f08ffbSSilvio Frickeincluding proneness to deadlocks around the single execution context. 46e7f08ffbSSilvio Fricke 47e7f08ffbSSilvio FrickeThe tension between the provided level of concurrency and resource 48e7f08ffbSSilvio Frickeusage also forced its users to make unnecessary tradeoffs like libata 49e7f08ffbSSilvio Frickechoosing to use ST wq for polling PIOs and accepting an unnecessary 50e7f08ffbSSilvio Frickelimitation that no two polling PIOs can progress at the same time. As 51e7f08ffbSSilvio FrickeMT wq don't provide much better concurrency, users which require 52e7f08ffbSSilvio Frickehigher level of concurrency, like async or fscache, had to implement 53e7f08ffbSSilvio Fricketheir own thread pool. 54e7f08ffbSSilvio Fricke 55e7f08ffbSSilvio FrickeConcurrency Managed Workqueue (cmwq) is a reimplementation of wq with 56e7f08ffbSSilvio Frickefocus on the following goals. 57e7f08ffbSSilvio Fricke 58e7f08ffbSSilvio Fricke* Maintain compatibility with the original workqueue API. 59e7f08ffbSSilvio Fricke 60e7f08ffbSSilvio Fricke* Use per-CPU unified worker pools shared by all wq to provide 61e7f08ffbSSilvio Fricke flexible level of concurrency on demand without wasting a lot of 62e7f08ffbSSilvio Fricke resource. 63e7f08ffbSSilvio Fricke 64e7f08ffbSSilvio Fricke* Automatically regulate worker pool and level of concurrency so that 65e7f08ffbSSilvio Fricke the API users don't need to worry about such details. 66e7f08ffbSSilvio Fricke 67e7f08ffbSSilvio Fricke 68e7f08ffbSSilvio FrickeThe Design 69e7f08ffbSSilvio Fricke========== 70e7f08ffbSSilvio Fricke 71e7f08ffbSSilvio FrickeIn order to ease the asynchronous execution of functions a new 72e7f08ffbSSilvio Frickeabstraction, the work item, is introduced. 73e7f08ffbSSilvio Fricke 74e7f08ffbSSilvio FrickeA work item is a simple struct that holds a pointer to the function 75e7f08ffbSSilvio Frickethat is to be executed asynchronously. Whenever a driver or subsystem 76e7f08ffbSSilvio Frickewants a function to be executed asynchronously it has to set up a work 77e7f08ffbSSilvio Frickeitem pointing to that function and queue that work item on a 78e7f08ffbSSilvio Frickeworkqueue. 79e7f08ffbSSilvio Fricke 804cb1ef64STejun HeoA work item can be executed in either a thread or the BH (softirq) context. 814cb1ef64STejun Heo 824cb1ef64STejun HeoFor threaded workqueues, special purpose threads, called [k]workers, execute 834cb1ef64STejun Heothe functions off of the queue, one after the other. If no work is queued, 844cb1ef64STejun Heothe worker threads become idle. These worker threads are managed in 854cb1ef64STejun Heoworker-pools. 86e7f08ffbSSilvio Fricke 87e7f08ffbSSilvio FrickeThe cmwq design differentiates between the user-facing workqueues that 88e7f08ffbSSilvio Frickesubsystems and drivers queue work items on and the backend mechanism 89e7f08ffbSSilvio Frickewhich manages worker-pools and processes the queued work items. 90e7f08ffbSSilvio Fricke 91e7f08ffbSSilvio FrickeThere are two worker-pools, one for normal work items and the other 92e7f08ffbSSilvio Frickefor high priority ones, for each possible CPU and some extra 93e7f08ffbSSilvio Frickeworker-pools to serve work items queued on unbound workqueues - the 94e7f08ffbSSilvio Frickenumber of these backing pools is dynamic. 95e7f08ffbSSilvio Fricke 964cb1ef64STejun HeoBH workqueues use the same framework. However, as there can only be one 974cb1ef64STejun Heoconcurrent execution context, there's no need to worry about concurrency. 984cb1ef64STejun HeoEach per-CPU BH worker pool contains only one pseudo worker which represents 994cb1ef64STejun Heothe BH execution context. A BH workqueue can be considered a convenience 1004cb1ef64STejun Heointerface to softirq. 1014cb1ef64STejun Heo 102e7f08ffbSSilvio FrickeSubsystems and drivers can create and queue work items through special 103e7f08ffbSSilvio Frickeworkqueue API functions as they see fit. They can influence some 104e7f08ffbSSilvio Frickeaspects of the way the work items are executed by setting flags on the 105e7f08ffbSSilvio Frickeworkqueue they are putting the work item on. These flags include 106e7f08ffbSSilvio Frickethings like CPU locality, concurrency limits, priority and more. To 107e7f08ffbSSilvio Frickeget a detailed overview refer to the API description of 108e7f08ffbSSilvio Fricke``alloc_workqueue()`` below. 109e7f08ffbSSilvio Fricke 110e7f08ffbSSilvio FrickeWhen a work item is queued to a workqueue, the target worker-pool is 111e7f08ffbSSilvio Frickedetermined according to the queue parameters and workqueue attributes 112e7f08ffbSSilvio Frickeand appended on the shared worklist of the worker-pool. For example, 113e7f08ffbSSilvio Frickeunless specifically overridden, a work item of a bound workqueue will 114e7f08ffbSSilvio Frickebe queued on the worklist of either normal or highpri worker-pool that 115e7f08ffbSSilvio Frickeis associated to the CPU the issuer is running on. 116e7f08ffbSSilvio Fricke 1174cb1ef64STejun HeoFor any thread pool implementation, managing the concurrency level 118e7f08ffbSSilvio Fricke(how many execution contexts are active) is an important issue. cmwq 119e7f08ffbSSilvio Fricketries to keep the concurrency at a minimal but sufficient level. 120e7f08ffbSSilvio FrickeMinimal to save resources and sufficient in that the system is used at 121e7f08ffbSSilvio Frickeits full capacity. 122e7f08ffbSSilvio Fricke 123e7f08ffbSSilvio FrickeEach worker-pool bound to an actual CPU implements concurrency 124e7f08ffbSSilvio Frickemanagement by hooking into the scheduler. The worker-pool is notified 125e7f08ffbSSilvio Frickewhenever an active worker wakes up or sleeps and keeps track of the 126e7f08ffbSSilvio Frickenumber of the currently runnable workers. Generally, work items are 127e7f08ffbSSilvio Frickenot expected to hog a CPU and consume many cycles. That means 128e7f08ffbSSilvio Frickemaintaining just enough concurrency to prevent work processing from 129e7f08ffbSSilvio Frickestalling should be optimal. As long as there are one or more runnable 130e7f08ffbSSilvio Frickeworkers on the CPU, the worker-pool doesn't start execution of a new 131e7f08ffbSSilvio Frickework, but, when the last running worker goes to sleep, it immediately 132e7f08ffbSSilvio Frickeschedules a new worker so that the CPU doesn't sit idle while there 133e7f08ffbSSilvio Frickeare pending work items. This allows using a minimal number of workers 134e7f08ffbSSilvio Frickewithout losing execution bandwidth. 135e7f08ffbSSilvio Fricke 136e7f08ffbSSilvio FrickeKeeping idle workers around doesn't cost other than the memory space 137e7f08ffbSSilvio Frickefor kthreads, so cmwq holds onto idle ones for a while before killing 138e7f08ffbSSilvio Frickethem. 139e7f08ffbSSilvio Fricke 140e7f08ffbSSilvio FrickeFor unbound workqueues, the number of backing pools is dynamic. 141e7f08ffbSSilvio FrickeUnbound workqueue can be assigned custom attributes using 142e7f08ffbSSilvio Fricke``apply_workqueue_attrs()`` and workqueue will automatically create 143e7f08ffbSSilvio Frickebacking worker pools matching the attributes. The responsibility of 144e7f08ffbSSilvio Frickeregulating concurrency level is on the users. There is also a flag to 145e7f08ffbSSilvio Frickemark a bound wq to ignore the concurrency management. Please refer to 146e7f08ffbSSilvio Frickethe API section for details. 147e7f08ffbSSilvio Fricke 148e7f08ffbSSilvio FrickeForward progress guarantee relies on that workers can be created when 149e7f08ffbSSilvio Frickemore execution contexts are necessary, which in turn is guaranteed 150e7f08ffbSSilvio Frickethrough the use of rescue workers. All work items which might be used 151e7f08ffbSSilvio Frickeon code paths that handle memory reclaim are required to be queued on 152e7f08ffbSSilvio Frickewq's that have a rescue-worker reserved for execution under memory 153e7f08ffbSSilvio Frickepressure. Else it is possible that the worker-pool deadlocks waiting 154e7f08ffbSSilvio Frickefor execution contexts to free up. 155e7f08ffbSSilvio Fricke 156e7f08ffbSSilvio Fricke 157e7f08ffbSSilvio FrickeApplication Programming Interface (API) 158e7f08ffbSSilvio Fricke======================================= 159e7f08ffbSSilvio Fricke 160e7f08ffbSSilvio Fricke``alloc_workqueue()`` allocates a wq. The original 161e7f08ffbSSilvio Fricke``create_*workqueue()`` functions are deprecated and scheduled for 16247684e11SRandy Dunlapremoval. ``alloc_workqueue()`` takes three arguments - ``@name``, 163e7f08ffbSSilvio Fricke``@flags`` and ``@max_active``. ``@name`` is the name of the wq and 164e7f08ffbSSilvio Frickealso used as the name of the rescuer thread if there is one. 165e7f08ffbSSilvio Fricke 166e7f08ffbSSilvio FrickeA wq no longer manages execution resources but serves as a domain for 167e7f08ffbSSilvio Frickeforward progress guarantee, flush and work item attributes. ``@flags`` 168e7f08ffbSSilvio Frickeand ``@max_active`` control how work items are assigned execution 169e7f08ffbSSilvio Frickeresources, scheduled and executed. 170e7f08ffbSSilvio Fricke 171e7f08ffbSSilvio Fricke 172e7f08ffbSSilvio Fricke``flags`` 173e7f08ffbSSilvio Fricke--------- 174e7f08ffbSSilvio Fricke 1754cb1ef64STejun Heo``WQ_BH`` 1764cb1ef64STejun Heo BH workqueues can be considered a convenience interface to softirq. BH 1774cb1ef64STejun Heo workqueues are always per-CPU and all BH work items are executed in the 1784cb1ef64STejun Heo queueing CPU's softirq context in the queueing order. 1794cb1ef64STejun Heo 1804cb1ef64STejun Heo All BH workqueues must have 0 ``max_active`` and ``WQ_HIGHPRI`` is the 1814cb1ef64STejun Heo only allowed additional flag. 1824cb1ef64STejun Heo 1834cb1ef64STejun Heo BH work items cannot sleep. All other features such as delayed queueing, 1844cb1ef64STejun Heo flushing and canceling are supported. 1854cb1ef64STejun Heo 186e7f08ffbSSilvio Fricke``WQ_UNBOUND`` 187e7f08ffbSSilvio Fricke Work items queued to an unbound wq are served by the special 188e7f08ffbSSilvio Fricke worker-pools which host workers which are not bound to any 189e7f08ffbSSilvio Fricke specific CPU. This makes the wq behave as a simple execution 190e7f08ffbSSilvio Fricke context provider without concurrency management. The unbound 191e7f08ffbSSilvio Fricke worker-pools try to start execution of work items as soon as 192e7f08ffbSSilvio Fricke possible. Unbound wq sacrifices locality but is useful for 193e7f08ffbSSilvio Fricke the following cases. 194e7f08ffbSSilvio Fricke 195e7f08ffbSSilvio Fricke * Wide fluctuation in the concurrency level requirement is 196e7f08ffbSSilvio Fricke expected and using bound wq may end up creating large number 197e7f08ffbSSilvio Fricke of mostly unused workers across different CPUs as the issuer 198e7f08ffbSSilvio Fricke hops through different CPUs. 199e7f08ffbSSilvio Fricke 200e7f08ffbSSilvio Fricke * Long running CPU intensive workloads which can be better 201e7f08ffbSSilvio Fricke managed by the system scheduler. 202e7f08ffbSSilvio Fricke 203e7f08ffbSSilvio Fricke``WQ_FREEZABLE`` 204e7f08ffbSSilvio Fricke A freezable wq participates in the freeze phase of the system 205e7f08ffbSSilvio Fricke suspend operations. Work items on the wq are drained and no 206e7f08ffbSSilvio Fricke new work item starts execution until thawed. 207e7f08ffbSSilvio Fricke 208e7f08ffbSSilvio Fricke``WQ_MEM_RECLAIM`` 209e7f08ffbSSilvio Fricke All wq which might be used in the memory reclaim paths **MUST** 210e7f08ffbSSilvio Fricke have this flag set. The wq is guaranteed to have at least one 211e7f08ffbSSilvio Fricke execution context regardless of memory pressure. 212e7f08ffbSSilvio Fricke 213e7f08ffbSSilvio Fricke``WQ_HIGHPRI`` 214e7f08ffbSSilvio Fricke Work items of a highpri wq are queued to the highpri 215e7f08ffbSSilvio Fricke worker-pool of the target cpu. Highpri worker-pools are 216e7f08ffbSSilvio Fricke served by worker threads with elevated nice level. 217e7f08ffbSSilvio Fricke 218e7f08ffbSSilvio Fricke Note that normal and highpri worker-pools don't interact with 21947684e11SRandy Dunlap each other. Each maintains its separate pool of workers and 220e7f08ffbSSilvio Fricke implements concurrency management among its workers. 221e7f08ffbSSilvio Fricke 222e7f08ffbSSilvio Fricke``WQ_CPU_INTENSIVE`` 223e7f08ffbSSilvio Fricke Work items of a CPU intensive wq do not contribute to the 224e7f08ffbSSilvio Fricke concurrency level. In other words, runnable CPU intensive 225e7f08ffbSSilvio Fricke work items will not prevent other work items in the same 226e7f08ffbSSilvio Fricke worker-pool from starting execution. This is useful for bound 227e7f08ffbSSilvio Fricke work items which are expected to hog CPU cycles so that their 228e7f08ffbSSilvio Fricke execution is regulated by the system scheduler. 229e7f08ffbSSilvio Fricke 230e7f08ffbSSilvio Fricke Although CPU intensive work items don't contribute to the 231e7f08ffbSSilvio Fricke concurrency level, start of their executions is still 232e7f08ffbSSilvio Fricke regulated by the concurrency management and runnable 233e7f08ffbSSilvio Fricke non-CPU-intensive work items can delay execution of CPU 234e7f08ffbSSilvio Fricke intensive work items. 235e7f08ffbSSilvio Fricke 236e7f08ffbSSilvio Fricke This flag is meaningless for unbound wq. 237e7f08ffbSSilvio Fricke 238e7f08ffbSSilvio Fricke 239e7f08ffbSSilvio Fricke``max_active`` 240e7f08ffbSSilvio Fricke-------------- 241e7f08ffbSSilvio Fricke 242636b927eSTejun Heo``@max_active`` determines the maximum number of execution contexts per 243636b927eSTejun HeoCPU which can be assigned to the work items of a wq. For example, with 244636b927eSTejun Heo``@max_active`` of 16, at most 16 work items of the wq can be executing 245636b927eSTejun Heoat the same time per CPU. This is always a per-CPU attribute, even for 246636b927eSTejun Heounbound workqueues. 247e7f08ffbSSilvio Fricke 248*58143465SChen RidongThe maximum limit for ``@max_active`` is 2048 and the default value used 249*58143465SChen Ridongwhen 0 is specified is 1024. These values are chosen sufficiently high 250636b927eSTejun Heosuch that they are not the limiting factor while providing protection in 251636b927eSTejun Heorunaway cases. 252e7f08ffbSSilvio Fricke 253e7f08ffbSSilvio FrickeThe number of active work items of a wq is usually regulated by the 254e7f08ffbSSilvio Frickeusers of the wq, more specifically, by how many work items the users 255e7f08ffbSSilvio Frickemay queue at the same time. Unless there is a specific need for 256e7f08ffbSSilvio Frickethrottling the number of active work items, specifying '0' is 257e7f08ffbSSilvio Frickerecommended. 258e7f08ffbSSilvio Fricke 2593bc1e711STejun HeoSome users depend on strict execution ordering where only one work item 2603bc1e711STejun Heois in flight at any given time and the work items are processed in 2613bc1e711STejun Heoqueueing order. While the combination of ``@max_active`` of 1 and 2623bc1e711STejun Heo``WQ_UNBOUND`` used to achieve this behavior, this is no longer the 26344732f1dSNikita Shubincase. Use alloc_ordered_workqueue() instead. 2640e0cafcdSAlexei Potashnik 265e7f08ffbSSilvio Fricke 266e7f08ffbSSilvio FrickeExample Execution Scenarios 267e7f08ffbSSilvio Fricke=========================== 268e7f08ffbSSilvio Fricke 269e7f08ffbSSilvio FrickeThe following example execution scenarios try to illustrate how cmwq 270e7f08ffbSSilvio Frickebehave under different configurations. 271e7f08ffbSSilvio Fricke 272e7f08ffbSSilvio Fricke Work items w0, w1, w2 are queued to a bound wq q0 on the same CPU. 273e7f08ffbSSilvio Fricke w0 burns CPU for 5ms then sleeps for 10ms then burns CPU for 5ms 274e7f08ffbSSilvio Fricke again before finishing. w1 and w2 burn CPU for 5ms then sleep for 275e7f08ffbSSilvio Fricke 10ms. 276e7f08ffbSSilvio Fricke 277e7f08ffbSSilvio FrickeIgnoring all other tasks, works and processing overhead, and assuming 278e7f08ffbSSilvio Frickesimple FIFO scheduling, the following is one highly simplified version 279e7f08ffbSSilvio Frickeof possible sequences of events with the original wq. :: 280e7f08ffbSSilvio Fricke 281e7f08ffbSSilvio Fricke TIME IN MSECS EVENT 282e7f08ffbSSilvio Fricke 0 w0 starts and burns CPU 283e7f08ffbSSilvio Fricke 5 w0 sleeps 284e7f08ffbSSilvio Fricke 15 w0 wakes up and burns CPU 285e7f08ffbSSilvio Fricke 20 w0 finishes 286e7f08ffbSSilvio Fricke 20 w1 starts and burns CPU 287e7f08ffbSSilvio Fricke 25 w1 sleeps 288e7f08ffbSSilvio Fricke 35 w1 wakes up and finishes 289e7f08ffbSSilvio Fricke 35 w2 starts and burns CPU 290e7f08ffbSSilvio Fricke 40 w2 sleeps 291e7f08ffbSSilvio Fricke 50 w2 wakes up and finishes 292e7f08ffbSSilvio Fricke 293e7f08ffbSSilvio FrickeAnd with cmwq with ``@max_active`` >= 3, :: 294e7f08ffbSSilvio Fricke 295e7f08ffbSSilvio Fricke TIME IN MSECS EVENT 296e7f08ffbSSilvio Fricke 0 w0 starts and burns CPU 297e7f08ffbSSilvio Fricke 5 w0 sleeps 298e7f08ffbSSilvio Fricke 5 w1 starts and burns CPU 299e7f08ffbSSilvio Fricke 10 w1 sleeps 300e7f08ffbSSilvio Fricke 10 w2 starts and burns CPU 301e7f08ffbSSilvio Fricke 15 w2 sleeps 302e7f08ffbSSilvio Fricke 15 w0 wakes up and burns CPU 303e7f08ffbSSilvio Fricke 20 w0 finishes 304e7f08ffbSSilvio Fricke 20 w1 wakes up and finishes 305e7f08ffbSSilvio Fricke 25 w2 wakes up and finishes 306e7f08ffbSSilvio Fricke 307e7f08ffbSSilvio FrickeIf ``@max_active`` == 2, :: 308e7f08ffbSSilvio Fricke 309e7f08ffbSSilvio Fricke TIME IN MSECS EVENT 310e7f08ffbSSilvio Fricke 0 w0 starts and burns CPU 311e7f08ffbSSilvio Fricke 5 w0 sleeps 312e7f08ffbSSilvio Fricke 5 w1 starts and burns CPU 313e7f08ffbSSilvio Fricke 10 w1 sleeps 314e7f08ffbSSilvio Fricke 15 w0 wakes up and burns CPU 315e7f08ffbSSilvio Fricke 20 w0 finishes 316e7f08ffbSSilvio Fricke 20 w1 wakes up and finishes 317e7f08ffbSSilvio Fricke 20 w2 starts and burns CPU 318e7f08ffbSSilvio Fricke 25 w2 sleeps 319e7f08ffbSSilvio Fricke 35 w2 wakes up and finishes 320e7f08ffbSSilvio Fricke 321e7f08ffbSSilvio FrickeNow, let's assume w1 and w2 are queued to a different wq q1 which has 322e7f08ffbSSilvio Fricke``WQ_CPU_INTENSIVE`` set, :: 323e7f08ffbSSilvio Fricke 324e7f08ffbSSilvio Fricke TIME IN MSECS EVENT 325e7f08ffbSSilvio Fricke 0 w0 starts and burns CPU 326e7f08ffbSSilvio Fricke 5 w0 sleeps 327e7f08ffbSSilvio Fricke 5 w1 and w2 start and burn CPU 328e7f08ffbSSilvio Fricke 10 w1 sleeps 329e7f08ffbSSilvio Fricke 15 w2 sleeps 330e7f08ffbSSilvio Fricke 15 w0 wakes up and burns CPU 331e7f08ffbSSilvio Fricke 20 w0 finishes 332e7f08ffbSSilvio Fricke 20 w1 wakes up and finishes 333e7f08ffbSSilvio Fricke 25 w2 wakes up and finishes 334e7f08ffbSSilvio Fricke 335e7f08ffbSSilvio Fricke 336e7f08ffbSSilvio FrickeGuidelines 337e7f08ffbSSilvio Fricke========== 338e7f08ffbSSilvio Fricke 339e7f08ffbSSilvio Fricke* Do not forget to use ``WQ_MEM_RECLAIM`` if a wq may process work 340e7f08ffbSSilvio Fricke items which are used during memory reclaim. Each wq with 341e7f08ffbSSilvio Fricke ``WQ_MEM_RECLAIM`` set has an execution context reserved for it. If 342e7f08ffbSSilvio Fricke there is dependency among multiple work items used during memory 343e7f08ffbSSilvio Fricke reclaim, they should be queued to separate wq each with 344e7f08ffbSSilvio Fricke ``WQ_MEM_RECLAIM``. 345e7f08ffbSSilvio Fricke 346e7f08ffbSSilvio Fricke* Unless strict ordering is required, there is no need to use ST wq. 347e7f08ffbSSilvio Fricke 348e7f08ffbSSilvio Fricke* Unless there is a specific need, using 0 for @max_active is 349e7f08ffbSSilvio Fricke recommended. In most use cases, concurrency level usually stays 350e7f08ffbSSilvio Fricke well under the default limit. 351e7f08ffbSSilvio Fricke 352e7f08ffbSSilvio Fricke* A wq serves as a domain for forward progress guarantee 353e7f08ffbSSilvio Fricke (``WQ_MEM_RECLAIM``, flush and work item attributes. Work items 354e7f08ffbSSilvio Fricke which are not involved in memory reclaim and don't need to be 355e7f08ffbSSilvio Fricke flushed as a part of a group of work items, and don't require any 356e7f08ffbSSilvio Fricke special attribute, can use one of the system wq. There is no 357e7f08ffbSSilvio Fricke difference in execution characteristics between using a dedicated wq 358e7f08ffbSSilvio Fricke and a system wq. 359e7f08ffbSSilvio Fricke 360e3dddcfdSChen Ridong Note: If something may generate more than @max_active outstanding 361e3dddcfdSChen Ridong work items (do stress test your producers), it may saturate a system 362e3dddcfdSChen Ridong wq and potentially lead to deadlock. It should utilize its own 363e3dddcfdSChen Ridong dedicated workqueue rather than the system wq. 364e3dddcfdSChen Ridong 365e7f08ffbSSilvio Fricke* Unless work items are expected to consume a huge amount of CPU 366e7f08ffbSSilvio Fricke cycles, using a bound wq is usually beneficial due to the increased 367e7f08ffbSSilvio Fricke level of locality in wq operations and work item execution. 368e7f08ffbSSilvio Fricke 369e7f08ffbSSilvio Fricke 37063c5484eSTejun HeoAffinity Scopes 37163c5484eSTejun Heo=============== 37263c5484eSTejun Heo 37363c5484eSTejun HeoAn unbound workqueue groups CPUs according to its affinity scope to improve 37463c5484eSTejun Heocache locality. For example, if a workqueue is using the default affinity 37563c5484eSTejun Heoscope of "cache", it will group CPUs according to last level cache 3768639ecebSTejun Heoboundaries. A work item queued on the workqueue will be assigned to a worker 3778639ecebSTejun Heoon one of the CPUs which share the last level cache with the issuing CPU. 3788639ecebSTejun HeoOnce started, the worker may or may not be allowed to move outside the scope 3798639ecebSTejun Heodepending on the ``affinity_strict`` setting of the scope. 38063c5484eSTejun Heo 381523a301eSTejun HeoWorkqueue currently supports the following affinity scopes. 382523a301eSTejun Heo 383523a301eSTejun Heo``default`` 384523a301eSTejun Heo Use the scope in module parameter ``workqueue.default_affinity_scope`` 385523a301eSTejun Heo which is always set to one of the scopes below. 38663c5484eSTejun Heo 38763c5484eSTejun Heo``cpu`` 38863c5484eSTejun Heo CPUs are not grouped. A work item issued on one CPU is processed by a 38963c5484eSTejun Heo worker on the same CPU. This makes unbound workqueues behave as per-cpu 39063c5484eSTejun Heo workqueues without concurrency management. 39163c5484eSTejun Heo 39263c5484eSTejun Heo``smt`` 39363c5484eSTejun Heo CPUs are grouped according to SMT boundaries. This usually means that the 39463c5484eSTejun Heo logical threads of each physical CPU core are grouped together. 39563c5484eSTejun Heo 39663c5484eSTejun Heo``cache`` 39763c5484eSTejun Heo CPUs are grouped according to cache boundaries. Which specific cache 39863c5484eSTejun Heo boundary is used is determined by the arch code. L3 is used in a lot of 39963c5484eSTejun Heo cases. This is the default affinity scope. 40063c5484eSTejun Heo 40163c5484eSTejun Heo``numa`` 40289405db5Sattreyee-muk CPUs are grouped according to NUMA boundaries. 40363c5484eSTejun Heo 40463c5484eSTejun Heo``system`` 40563c5484eSTejun Heo All CPUs are put in the same group. Workqueue makes no effort to process a 40663c5484eSTejun Heo work item on a CPU close to the issuing CPU. 40763c5484eSTejun Heo 40863c5484eSTejun HeoThe default affinity scope can be changed with the module parameter 40963c5484eSTejun Heo``workqueue.default_affinity_scope`` and a specific workqueue's affinity 41063c5484eSTejun Heoscope can be changed using ``apply_workqueue_attrs()``. 41163c5484eSTejun Heo 41263c5484eSTejun HeoIf ``WQ_SYSFS`` is set, the workqueue will have the following affinity scope 413bd9e7326SWangJinchaorelated interface files under its ``/sys/devices/virtual/workqueue/WQ_NAME/`` 41463c5484eSTejun Heodirectory. 41563c5484eSTejun Heo 41663c5484eSTejun Heo``affinity_scope`` 41763c5484eSTejun Heo Read to see the current affinity scope. Write to change. 41863c5484eSTejun Heo 419523a301eSTejun Heo When default is the current scope, reading this file will also show the 420523a301eSTejun Heo current effective scope in parentheses, for example, ``default (cache)``. 421523a301eSTejun Heo 4228639ecebSTejun Heo``affinity_strict`` 4238639ecebSTejun Heo 0 by default indicating that affinity scopes are not strict. When a work 4248639ecebSTejun Heo item starts execution, workqueue makes a best-effort attempt to ensure 4258639ecebSTejun Heo that the worker is inside its affinity scope, which is called 4268639ecebSTejun Heo repatriation. Once started, the scheduler is free to move the worker 4278639ecebSTejun Heo anywhere in the system as it sees fit. This enables benefiting from scope 4288639ecebSTejun Heo locality while still being able to utilize other CPUs if necessary and 4298639ecebSTejun Heo available. 4308639ecebSTejun Heo 4318639ecebSTejun Heo If set to 1, all workers of the scope are guaranteed always to be in the 4328639ecebSTejun Heo scope. This may be useful when crossing affinity scopes has other 4338639ecebSTejun Heo implications, for example, in terms of power consumption or workload 4348639ecebSTejun Heo isolation. Strict NUMA scope can also be used to match the workqueue 4358639ecebSTejun Heo behavior of older kernels. 4368639ecebSTejun Heo 43763c5484eSTejun Heo 4387dbf15c5STejun HeoAffinity Scopes and Performance 4397dbf15c5STejun Heo=============================== 4407dbf15c5STejun Heo 4417dbf15c5STejun HeoIt'd be ideal if an unbound workqueue's behavior is optimal for vast 4427dbf15c5STejun Heomajority of use cases without further tuning. Unfortunately, in the current 4437dbf15c5STejun Heokernel, there exists a pronounced trade-off between locality and utilization 4447dbf15c5STejun Heonecessitating explicit configurations when workqueues are heavily used. 4457dbf15c5STejun Heo 4467dbf15c5STejun HeoHigher locality leads to higher efficiency where more work is performed for 4477dbf15c5STejun Heothe same number of consumed CPU cycles. However, higher locality may also 4487dbf15c5STejun Heocause lower overall system utilization if the work items are not spread 4497dbf15c5STejun Heoenough across the affinity scopes by the issuers. The following performance 4507dbf15c5STejun Heotesting with dm-crypt clearly illustrates this trade-off. 4517dbf15c5STejun Heo 4527dbf15c5STejun HeoThe tests are run on a CPU with 12-cores/24-threads split across four L3 4537dbf15c5STejun Heocaches (AMD Ryzen 9 3900x). CPU clock boost is turned off for consistency. 4547dbf15c5STejun Heo``/dev/dm-0`` is a dm-crypt device created on NVME SSD (Samsung 990 PRO) and 4557dbf15c5STejun Heoopened with ``cryptsetup`` with default settings. 4567dbf15c5STejun Heo 4577dbf15c5STejun Heo 4587dbf15c5STejun HeoScenario 1: Enough issuers and work spread across the machine 4597dbf15c5STejun Heo------------------------------------------------------------- 4607dbf15c5STejun Heo 4617dbf15c5STejun HeoThe command used: :: 4627dbf15c5STejun Heo 4637dbf15c5STejun Heo $ fio --filename=/dev/dm-0 --direct=1 --rw=randrw --bs=32k --ioengine=libaio \ 4647dbf15c5STejun Heo --iodepth=64 --runtime=60 --numjobs=24 --time_based --group_reporting \ 4657dbf15c5STejun Heo --name=iops-test-job --verify=sha512 4667dbf15c5STejun Heo 4677dbf15c5STejun HeoThere are 24 issuers, each issuing 64 IOs concurrently. ``--verify=sha512`` 4687dbf15c5STejun Heomakes ``fio`` generate and read back the content each time which makes 46922160b08Sattreyee-mukexecution locality matter between the issuer and ``kcryptd``. The following 4707dbf15c5STejun Heoare the read bandwidths and CPU utilizations depending on different affinity 4717dbf15c5STejun Heoscope settings on ``kcryptd`` measured over five runs. Bandwidths are in 4727dbf15c5STejun HeoMiBps, and CPU util in percents. 4737dbf15c5STejun Heo 4747dbf15c5STejun Heo.. list-table:: 4757dbf15c5STejun Heo :widths: 16 20 20 4767dbf15c5STejun Heo :header-rows: 1 4777dbf15c5STejun Heo 4787dbf15c5STejun Heo * - Affinity 4797dbf15c5STejun Heo - Bandwidth (MiBps) 4807dbf15c5STejun Heo - CPU util (%) 4817dbf15c5STejun Heo 4827dbf15c5STejun Heo * - system 4837dbf15c5STejun Heo - 1159.40 ±1.34 4847dbf15c5STejun Heo - 99.31 ±0.02 4857dbf15c5STejun Heo 4867dbf15c5STejun Heo * - cache 4877dbf15c5STejun Heo - 1166.40 ±0.89 4887dbf15c5STejun Heo - 99.34 ±0.01 4897dbf15c5STejun Heo 4907dbf15c5STejun Heo * - cache (strict) 4917dbf15c5STejun Heo - 1166.00 ±0.71 4927dbf15c5STejun Heo - 99.35 ±0.01 4937dbf15c5STejun Heo 4947dbf15c5STejun HeoWith enough issuers spread across the system, there is no downside to 4957dbf15c5STejun Heo"cache", strict or otherwise. All three configurations saturate the whole 4967dbf15c5STejun Heomachine but the cache-affine ones outperform by 0.6% thanks to improved 4977dbf15c5STejun Heolocality. 4987dbf15c5STejun Heo 4997dbf15c5STejun Heo 5007dbf15c5STejun HeoScenario 2: Fewer issuers, enough work for saturation 5017dbf15c5STejun Heo----------------------------------------------------- 5027dbf15c5STejun Heo 5037dbf15c5STejun HeoThe command used: :: 5047dbf15c5STejun Heo 5057dbf15c5STejun Heo $ fio --filename=/dev/dm-0 --direct=1 --rw=randrw --bs=32k \ 5067dbf15c5STejun Heo --ioengine=libaio --iodepth=64 --runtime=60 --numjobs=8 \ 5077dbf15c5STejun Heo --time_based --group_reporting --name=iops-test-job --verify=sha512 5087dbf15c5STejun Heo 5097dbf15c5STejun HeoThe only difference from the previous scenario is ``--numjobs=8``. There are 5107dbf15c5STejun Heoa third of the issuers but is still enough total work to saturate the 5117dbf15c5STejun Heosystem. 5127dbf15c5STejun Heo 5137dbf15c5STejun Heo.. list-table:: 5147dbf15c5STejun Heo :widths: 16 20 20 5157dbf15c5STejun Heo :header-rows: 1 5167dbf15c5STejun Heo 5177dbf15c5STejun Heo * - Affinity 5187dbf15c5STejun Heo - Bandwidth (MiBps) 5197dbf15c5STejun Heo - CPU util (%) 5207dbf15c5STejun Heo 5217dbf15c5STejun Heo * - system 5227dbf15c5STejun Heo - 1155.40 ±0.89 5237dbf15c5STejun Heo - 97.41 ±0.05 5247dbf15c5STejun Heo 5257dbf15c5STejun Heo * - cache 5267dbf15c5STejun Heo - 1154.40 ±1.14 5277dbf15c5STejun Heo - 96.15 ±0.09 5287dbf15c5STejun Heo 5297dbf15c5STejun Heo * - cache (strict) 5307dbf15c5STejun Heo - 1112.00 ±4.64 5317dbf15c5STejun Heo - 93.26 ±0.35 5327dbf15c5STejun Heo 5337dbf15c5STejun HeoThis is more than enough work to saturate the system. Both "system" and 5347dbf15c5STejun Heo"cache" are nearly saturating the machine but not fully. "cache" is using 5357dbf15c5STejun Heoless CPU but the better efficiency puts it at the same bandwidth as 5367dbf15c5STejun Heo"system". 5377dbf15c5STejun Heo 5387dbf15c5STejun HeoEight issuers moving around over four L3 cache scope still allow "cache 5397dbf15c5STejun Heo(strict)" to mostly saturate the machine but the loss of work conservation 5407dbf15c5STejun Heois now starting to hurt with 3.7% bandwidth loss. 5417dbf15c5STejun Heo 5427dbf15c5STejun Heo 5437dbf15c5STejun HeoScenario 3: Even fewer issuers, not enough work to saturate 5447dbf15c5STejun Heo----------------------------------------------------------- 5457dbf15c5STejun Heo 5467dbf15c5STejun HeoThe command used: :: 5477dbf15c5STejun Heo 5487dbf15c5STejun Heo $ fio --filename=/dev/dm-0 --direct=1 --rw=randrw --bs=32k \ 5497dbf15c5STejun Heo --ioengine=libaio --iodepth=64 --runtime=60 --numjobs=4 \ 5507dbf15c5STejun Heo --time_based --group_reporting --name=iops-test-job --verify=sha512 5517dbf15c5STejun Heo 5527dbf15c5STejun HeoAgain, the only difference is ``--numjobs=4``. With the number of issuers 5537dbf15c5STejun Heoreduced to four, there now isn't enough work to saturate the whole system 5547dbf15c5STejun Heoand the bandwidth becomes dependent on completion latencies. 5557dbf15c5STejun Heo 5567dbf15c5STejun Heo.. list-table:: 5577dbf15c5STejun Heo :widths: 16 20 20 5587dbf15c5STejun Heo :header-rows: 1 5597dbf15c5STejun Heo 5607dbf15c5STejun Heo * - Affinity 5617dbf15c5STejun Heo - Bandwidth (MiBps) 5627dbf15c5STejun Heo - CPU util (%) 5637dbf15c5STejun Heo 5647dbf15c5STejun Heo * - system 5657dbf15c5STejun Heo - 993.60 ±1.82 5667dbf15c5STejun Heo - 75.49 ±0.06 5677dbf15c5STejun Heo 5687dbf15c5STejun Heo * - cache 5697dbf15c5STejun Heo - 973.40 ±1.52 5707dbf15c5STejun Heo - 74.90 ±0.07 5717dbf15c5STejun Heo 5727dbf15c5STejun Heo * - cache (strict) 5737dbf15c5STejun Heo - 828.20 ±4.49 5747dbf15c5STejun Heo - 66.84 ±0.29 5757dbf15c5STejun Heo 5767dbf15c5STejun HeoNow, the tradeoff between locality and utilization is clearer. "cache" shows 5777dbf15c5STejun Heo2% bandwidth loss compared to "system" and "cache (struct)" whopping 20%. 5787dbf15c5STejun Heo 5797dbf15c5STejun Heo 5807dbf15c5STejun HeoConclusion and Recommendations 5817dbf15c5STejun Heo------------------------------ 5827dbf15c5STejun Heo 5837dbf15c5STejun HeoIn the above experiments, the efficiency advantage of the "cache" affinity 5847dbf15c5STejun Heoscope over "system" is, while consistent and noticeable, small. However, the 5857dbf15c5STejun Heoimpact is dependent on the distances between the scopes and may be more 5867dbf15c5STejun Heopronounced in processors with more complex topologies. 5877dbf15c5STejun Heo 5887dbf15c5STejun HeoWhile the loss of work-conservation in certain scenarios hurts, it is a lot 5897dbf15c5STejun Heobetter than "cache (strict)" and maximizing workqueue utilization is 5907dbf15c5STejun Heounlikely to be the common case anyway. As such, "cache" is the default 5917dbf15c5STejun Heoaffinity scope for unbound pools. 5927dbf15c5STejun Heo 5937dbf15c5STejun Heo* As there is no one option which is great for most cases, workqueue usages 5947dbf15c5STejun Heo that may consume a significant amount of CPU are recommended to configure 5957dbf15c5STejun Heo the workqueues using ``apply_workqueue_attrs()`` and/or enable 5967dbf15c5STejun Heo ``WQ_SYSFS``. 5977dbf15c5STejun Heo 5987dbf15c5STejun Heo* An unbound workqueue with strict "cpu" affinity scope behaves the same as 5997dbf15c5STejun Heo ``WQ_CPU_INTENSIVE`` per-cpu workqueue. There is no real advanage to the 6007dbf15c5STejun Heo latter and an unbound workqueue provides a lot more flexibility. 6017dbf15c5STejun Heo 6027dbf15c5STejun Heo* Affinity scopes are introduced in Linux v6.5. To emulate the previous 6037dbf15c5STejun Heo behavior, use strict "numa" affinity scope. 6047dbf15c5STejun Heo 6057dbf15c5STejun Heo* The loss of work-conservation in non-strict affinity scopes is likely 6067dbf15c5STejun Heo originating from the scheduler. There is no theoretical reason why the 6077dbf15c5STejun Heo kernel wouldn't be able to do the right thing and maintain 6087dbf15c5STejun Heo work-conservation in most cases. As such, it is possible that future 6097dbf15c5STejun Heo scheduler improvements may make most of these tunables unnecessary. 6107dbf15c5STejun Heo 6117dbf15c5STejun Heo 6127f7dc377STejun HeoExamining Configuration 6137f7dc377STejun Heo======================= 6147f7dc377STejun Heo 6157f7dc377STejun HeoUse tools/workqueue/wq_dump.py to examine unbound CPU affinity 6167f7dc377STejun Heoconfiguration, worker pools and how workqueues map to the pools: :: 6177f7dc377STejun Heo 6187f7dc377STejun Heo $ tools/workqueue/wq_dump.py 6197f7dc377STejun Heo Affinity Scopes 6207f7dc377STejun Heo =============== 6217f7dc377STejun Heo wq_unbound_cpumask=0000000f 6227f7dc377STejun Heo 62363c5484eSTejun Heo CPU 62463c5484eSTejun Heo nr_pods 4 62563c5484eSTejun Heo pod_cpus [0]=00000001 [1]=00000002 [2]=00000004 [3]=00000008 62663c5484eSTejun Heo pod_node [0]=0 [1]=0 [2]=1 [3]=1 62763c5484eSTejun Heo cpu_pod [0]=0 [1]=1 [2]=2 [3]=3 62863c5484eSTejun Heo 62963c5484eSTejun Heo SMT 63063c5484eSTejun Heo nr_pods 4 63163c5484eSTejun Heo pod_cpus [0]=00000001 [1]=00000002 [2]=00000004 [3]=00000008 63263c5484eSTejun Heo pod_node [0]=0 [1]=0 [2]=1 [3]=1 63363c5484eSTejun Heo cpu_pod [0]=0 [1]=1 [2]=2 [3]=3 63463c5484eSTejun Heo 63563c5484eSTejun Heo CACHE (default) 63663c5484eSTejun Heo nr_pods 2 63763c5484eSTejun Heo pod_cpus [0]=00000003 [1]=0000000c 63863c5484eSTejun Heo pod_node [0]=0 [1]=1 63963c5484eSTejun Heo cpu_pod [0]=0 [1]=0 [2]=1 [3]=1 64063c5484eSTejun Heo 6417f7dc377STejun Heo NUMA 6427f7dc377STejun Heo nr_pods 2 6437f7dc377STejun Heo pod_cpus [0]=00000003 [1]=0000000c 6447f7dc377STejun Heo pod_node [0]=0 [1]=1 6457f7dc377STejun Heo cpu_pod [0]=0 [1]=0 [2]=1 [3]=1 6467f7dc377STejun Heo 6477f7dc377STejun Heo SYSTEM 6487f7dc377STejun Heo nr_pods 1 6497f7dc377STejun Heo pod_cpus [0]=0000000f 6507f7dc377STejun Heo pod_node [0]=-1 6517f7dc377STejun Heo cpu_pod [0]=0 [1]=0 [2]=0 [3]=0 6527f7dc377STejun Heo 6537f7dc377STejun Heo Worker Pools 6547f7dc377STejun Heo ============ 6557f7dc377STejun Heo pool[00] ref= 1 nice= 0 idle/workers= 4/ 4 cpu= 0 6567f7dc377STejun Heo pool[01] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 0 6577f7dc377STejun Heo pool[02] ref= 1 nice= 0 idle/workers= 4/ 4 cpu= 1 6587f7dc377STejun Heo pool[03] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 1 6597f7dc377STejun Heo pool[04] ref= 1 nice= 0 idle/workers= 4/ 4 cpu= 2 6607f7dc377STejun Heo pool[05] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 2 6617f7dc377STejun Heo pool[06] ref= 1 nice= 0 idle/workers= 3/ 3 cpu= 3 6627f7dc377STejun Heo pool[07] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 3 6637f7dc377STejun Heo pool[08] ref=42 nice= 0 idle/workers= 6/ 6 cpus=0000000f 6647f7dc377STejun Heo pool[09] ref=28 nice= 0 idle/workers= 3/ 3 cpus=00000003 6657f7dc377STejun Heo pool[10] ref=28 nice= 0 idle/workers= 17/ 17 cpus=0000000c 6667f7dc377STejun Heo pool[11] ref= 1 nice=-20 idle/workers= 1/ 1 cpus=0000000f 6677f7dc377STejun Heo pool[12] ref= 2 nice=-20 idle/workers= 1/ 1 cpus=00000003 6687f7dc377STejun Heo pool[13] ref= 2 nice=-20 idle/workers= 1/ 1 cpus=0000000c 6697f7dc377STejun Heo 6707f7dc377STejun Heo Workqueue CPU -> pool 6717f7dc377STejun Heo ===================== 6727f7dc377STejun Heo [ workqueue \ CPU 0 1 2 3 dfl] 6737f7dc377STejun Heo events percpu 0 2 4 6 6747f7dc377STejun Heo events_highpri percpu 1 3 5 7 6757f7dc377STejun Heo events_long percpu 0 2 4 6 6767f7dc377STejun Heo events_unbound unbound 9 9 10 10 8 6777f7dc377STejun Heo events_freezable percpu 0 2 4 6 6787f7dc377STejun Heo events_power_efficient percpu 0 2 4 6 6792c534f2fSAudra Mitchell events_freezable_pwr_ef percpu 0 2 4 6 6807f7dc377STejun Heo rcu_gp percpu 0 2 4 6 6817f7dc377STejun Heo rcu_par_gp percpu 0 2 4 6 6827f7dc377STejun Heo slub_flushwq percpu 0 2 4 6 6837f7dc377STejun Heo netns ordered 8 8 8 8 8 6847f7dc377STejun Heo ... 6857f7dc377STejun Heo 6867f7dc377STejun HeoSee the command's help message for more info. 6877f7dc377STejun Heo 6887f7dc377STejun Heo 689725e8ec5STejun HeoMonitoring 690725e8ec5STejun Heo========== 691725e8ec5STejun Heo 692725e8ec5STejun HeoUse tools/workqueue/wq_monitor.py to monitor workqueue operations: :: 693725e8ec5STejun Heo 694725e8ec5STejun Heo $ tools/workqueue/wq_monitor.py events 6958639ecebSTejun Heo total infl CPUtime CPUhog CMW/RPR mayday rescued 6968a1dd1e5STejun Heo events 18545 0 6.1 0 5 - - 6978a1dd1e5STejun Heo events_highpri 8 0 0.0 0 0 - - 6988a1dd1e5STejun Heo events_long 3 0 0.0 0 0 - - 6998639ecebSTejun Heo events_unbound 38306 0 0.1 - 7 - - 7008a1dd1e5STejun Heo events_freezable 0 0 0.0 0 0 - - 7018a1dd1e5STejun Heo events_power_efficient 29598 0 0.2 0 0 - - 7022c534f2fSAudra Mitchell events_freezable_pwr_ef 10 0 0.0 0 0 - - 7038a1dd1e5STejun Heo sock_diag_events 0 0 0.0 0 0 - - 704725e8ec5STejun Heo 7058639ecebSTejun Heo total infl CPUtime CPUhog CMW/RPR mayday rescued 7068a1dd1e5STejun Heo events 18548 0 6.1 0 5 - - 7078a1dd1e5STejun Heo events_highpri 8 0 0.0 0 0 - - 7088a1dd1e5STejun Heo events_long 3 0 0.0 0 0 - - 7098639ecebSTejun Heo events_unbound 38322 0 0.1 - 7 - - 7108a1dd1e5STejun Heo events_freezable 0 0 0.0 0 0 - - 7118a1dd1e5STejun Heo events_power_efficient 29603 0 0.2 0 0 - - 7122c534f2fSAudra Mitchell events_freezable_pwr_ef 10 0 0.0 0 0 - - 7138a1dd1e5STejun Heo sock_diag_events 0 0 0.0 0 0 - - 714725e8ec5STejun Heo 715725e8ec5STejun Heo ... 716725e8ec5STejun Heo 717725e8ec5STejun HeoSee the command's help message for more info. 718725e8ec5STejun Heo 719725e8ec5STejun Heo 720e7f08ffbSSilvio FrickeDebugging 721e7f08ffbSSilvio Fricke========= 722e7f08ffbSSilvio Fricke 723e7f08ffbSSilvio FrickeBecause the work functions are executed by generic worker threads 724e7f08ffbSSilvio Frickethere are a few tricks needed to shed some light on misbehaving 725e7f08ffbSSilvio Frickeworkqueue users. 726e7f08ffbSSilvio Fricke 727e7f08ffbSSilvio FrickeWorker threads show up in the process list as: :: 728e7f08ffbSSilvio Fricke 729e7f08ffbSSilvio Fricke root 5671 0.0 0.0 0 0 ? S 12:07 0:00 [kworker/0:1] 730e7f08ffbSSilvio Fricke root 5672 0.0 0.0 0 0 ? S 12:07 0:00 [kworker/1:2] 731e7f08ffbSSilvio Fricke root 5673 0.0 0.0 0 0 ? S 12:12 0:00 [kworker/0:0] 732e7f08ffbSSilvio Fricke root 5674 0.0 0.0 0 0 ? S 12:13 0:00 [kworker/1:0] 733e7f08ffbSSilvio Fricke 734e7f08ffbSSilvio FrickeIf kworkers are going crazy (using too much cpu), there are two types 735e7f08ffbSSilvio Frickeof possible problems: 736e7f08ffbSSilvio Fricke 737e7f08ffbSSilvio Fricke 1. Something being scheduled in rapid succession 738e7f08ffbSSilvio Fricke 2. A single work item that consumes lots of cpu cycles 739e7f08ffbSSilvio Fricke 740e7f08ffbSSilvio FrickeThe first one can be tracked using tracing: :: 741e7f08ffbSSilvio Fricke 7422abfcd29SRoss Zwisler $ echo workqueue:workqueue_queue_work > /sys/kernel/tracing/set_event 7432abfcd29SRoss Zwisler $ cat /sys/kernel/tracing/trace_pipe > out.txt 744e7f08ffbSSilvio Fricke (wait a few secs) 745e7f08ffbSSilvio Fricke ^C 746e7f08ffbSSilvio Fricke 747e7f08ffbSSilvio FrickeIf something is busy looping on work queueing, it would be dominating 748e7f08ffbSSilvio Frickethe output and the offender can be determined with the work item 749e7f08ffbSSilvio Frickefunction. 750e7f08ffbSSilvio Fricke 751e7f08ffbSSilvio FrickeFor the second type of problems it should be possible to just check 752e7f08ffbSSilvio Frickethe stack trace of the offending worker thread. :: 753e7f08ffbSSilvio Fricke 754e7f08ffbSSilvio Fricke $ cat /proc/THE_OFFENDING_KWORKER/stack 755e7f08ffbSSilvio Fricke 756e7f08ffbSSilvio FrickeThe work item's function should be trivially visible in the stack 757e7f08ffbSSilvio Fricketrace. 758e7f08ffbSSilvio Fricke 759725e8ec5STejun Heo 760f9eaaa82SBoqun FengNon-reentrance Conditions 761f9eaaa82SBoqun Feng========================= 762f9eaaa82SBoqun Feng 763f9eaaa82SBoqun FengWorkqueue guarantees that a work item cannot be re-entrant if the following 764f9eaaa82SBoqun Fengconditions hold after a work item gets queued: 765f9eaaa82SBoqun Feng 766f9eaaa82SBoqun Feng 1. The work function hasn't been changed. 767f9eaaa82SBoqun Feng 2. No one queues the work item to another workqueue. 768f9eaaa82SBoqun Feng 3. The work item hasn't been reinitiated. 769f9eaaa82SBoqun Feng 770f9eaaa82SBoqun FengIn other words, if the above conditions hold, the work item is guaranteed to be 771f9eaaa82SBoqun Fengexecuted by at most one worker system-wide at any given time. 772f9eaaa82SBoqun Feng 773f9eaaa82SBoqun FengNote that requeuing the work item (to the same queue) in the self function 774f9eaaa82SBoqun Fengdoesn't break these conditions, so it's safe to do. Otherwise, caution is 775f9eaaa82SBoqun Fengrequired when breaking the conditions inside a work function. 776f9eaaa82SBoqun Feng 777e7f08ffbSSilvio Fricke 778e7f08ffbSSilvio FrickeKernel Inline Documentations Reference 779e7f08ffbSSilvio Fricke====================================== 780e7f08ffbSSilvio Fricke 781e7f08ffbSSilvio Fricke.. kernel-doc:: include/linux/workqueue.h 782c9e3d519SMauro Carvalho Chehab 783c9e3d519SMauro Carvalho Chehab.. kernel-doc:: kernel/workqueue.c 784