Documentation/core-api/workqueue.rst

7dbf15c5STejun Heo=========
7dbf15c5STejun HeoWorkqueue
7dbf15c5STejun Heo=========
e7f08ffbSSilvio Fricke
e7f08ffbSSilvio Fricke:Date: September, 2010
e7f08ffbSSilvio Fricke:Author: Tejun Heo <[email protected]>
e7f08ffbSSilvio Fricke:Author: Florian Mickler <[email protected]>
e7f08ffbSSilvio Fricke
e7f08ffbSSilvio Fricke
e7f08ffbSSilvio FrickeIntroduction
e7f08ffbSSilvio Fricke============
e7f08ffbSSilvio Fricke
e7f08ffbSSilvio FrickeThere are many cases where an asynchronous process execution context
e7f08ffbSSilvio Frickeis needed and the workqueue (wq) API is the most commonly used
e7f08ffbSSilvio Frickemechanism for such cases.
e7f08ffbSSilvio Fricke
e7f08ffbSSilvio FrickeWhen such an asynchronous execution context is needed, a work item
e7f08ffbSSilvio Frickedescribing which function to execute is put on a queue.  An
e7f08ffbSSilvio Frickeindependent thread serves as the asynchronous execution context.  The
e7f08ffbSSilvio Frickequeue is called workqueue and the thread is called worker.
e7f08ffbSSilvio Fricke
e7f08ffbSSilvio FrickeWhile there are work items on the workqueue the worker executes the
e7f08ffbSSilvio Frickefunctions associated with the work items one after the other.  When
e7f08ffbSSilvio Frickethere is no work item left on the workqueue the worker becomes idle.
e7f08ffbSSilvio FrickeWhen a new work item gets queued, the worker begins executing again.
e7f08ffbSSilvio Fricke
e7f08ffbSSilvio Fricke
7dbf15c5STejun HeoWhy Concurrency Managed Workqueue?
7dbf15c5STejun Heo==================================
e7f08ffbSSilvio Fricke
e7f08ffbSSilvio FrickeIn the original wq implementation, a multi threaded (MT) wq had one
e7f08ffbSSilvio Frickeworker thread per CPU and a single threaded (ST) wq had one worker
e7f08ffbSSilvio Frickethread system-wide.  A single MT wq needed to keep around the same
e7f08ffbSSilvio Frickenumber of workers as the number of CPUs.  The kernel grew a lot of MT
e7f08ffbSSilvio Frickewq users over the years and with the number of CPU cores continuously
e7f08ffbSSilvio Frickerising, some systems saturated the default 32k PID space just booting
e7f08ffbSSilvio Frickeup.
e7f08ffbSSilvio Fricke
e7f08ffbSSilvio FrickeAlthough MT wq wasted a lot of resource, the level of concurrency
e7f08ffbSSilvio Frickeprovided was unsatisfactory.  The limitation was common to both ST and
e7f08ffbSSilvio FrickeMT wq albeit less severe on MT.  Each wq maintained its own separate
47684e11SRandy Dunlapworker pool.  An MT wq could provide only one execution context per CPU
47684e11SRandy Dunlapwhile an ST wq one for the whole system.  Work items had to compete for
e7f08ffbSSilvio Frickethose very limited execution contexts leading to various problems
e7f08ffbSSilvio Frickeincluding proneness to deadlocks around the single execution context.
e7f08ffbSSilvio Fricke
e7f08ffbSSilvio FrickeThe tension between the provided level of concurrency and resource
e7f08ffbSSilvio Frickeusage also forced its users to make unnecessary tradeoffs like libata
e7f08ffbSSilvio Frickechoosing to use ST wq for polling PIOs and accepting an unnecessary
e7f08ffbSSilvio Frickelimitation that no two polling PIOs can progress at the same time.  As
e7f08ffbSSilvio FrickeMT wq don't provide much better concurrency, users which require
e7f08ffbSSilvio Frickehigher level of concurrency, like async or fscache, had to implement
e7f08ffbSSilvio Fricketheir own thread pool.
e7f08ffbSSilvio Fricke
e7f08ffbSSilvio FrickeConcurrency Managed Workqueue (cmwq) is a reimplementation of wq with
e7f08ffbSSilvio Frickefocus on the following goals.
e7f08ffbSSilvio Fricke
e7f08ffbSSilvio Fricke* Maintain compatibility with the original workqueue API.
e7f08ffbSSilvio Fricke
e7f08ffbSSilvio Fricke* Use per-CPU unified worker pools shared by all wq to provide
e7f08ffbSSilvio Fricke  flexible level of concurrency on demand without wasting a lot of
e7f08ffbSSilvio Fricke  resource.
e7f08ffbSSilvio Fricke
e7f08ffbSSilvio Fricke* Automatically regulate worker pool and level of concurrency so that
e7f08ffbSSilvio Fricke  the API users don't need to worry about such details.
e7f08ffbSSilvio Fricke
e7f08ffbSSilvio Fricke
e7f08ffbSSilvio FrickeThe Design
e7f08ffbSSilvio Fricke==========
e7f08ffbSSilvio Fricke
e7f08ffbSSilvio FrickeIn order to ease the asynchronous execution of functions a new
e7f08ffbSSilvio Frickeabstraction, the work item, is introduced.
e7f08ffbSSilvio Fricke
e7f08ffbSSilvio FrickeA work item is a simple struct that holds a pointer to the function
e7f08ffbSSilvio Frickethat is to be executed asynchronously.  Whenever a driver or subsystem
e7f08ffbSSilvio Frickewants a function to be executed asynchronously it has to set up a work
e7f08ffbSSilvio Frickeitem pointing to that function and queue that work item on a
e7f08ffbSSilvio Frickeworkqueue.
e7f08ffbSSilvio Fricke
4cb1ef64STejun HeoA work item can be executed in either a thread or the BH (softirq) context.
4cb1ef64STejun Heo
4cb1ef64STejun HeoFor threaded workqueues, special purpose threads, called [k]workers, execute
4cb1ef64STejun Heothe functions off of the queue, one after the other. If no work is queued,
4cb1ef64STejun Heothe worker threads become idle. These worker threads are managed in
4cb1ef64STejun Heoworker-pools.
e7f08ffbSSilvio Fricke
e7f08ffbSSilvio FrickeThe cmwq design differentiates between the user-facing workqueues that
e7f08ffbSSilvio Frickesubsystems and drivers queue work items on and the backend mechanism
e7f08ffbSSilvio Frickewhich manages worker-pools and processes the queued work items.
e7f08ffbSSilvio Fricke
e7f08ffbSSilvio FrickeThere are two worker-pools, one for normal work items and the other
e7f08ffbSSilvio Frickefor high priority ones, for each possible CPU and some extra
e7f08ffbSSilvio Frickeworker-pools to serve work items queued on unbound workqueues - the
e7f08ffbSSilvio Frickenumber of these backing pools is dynamic.
e7f08ffbSSilvio Fricke
4cb1ef64STejun HeoBH workqueues use the same framework. However, as there can only be one
4cb1ef64STejun Heoconcurrent execution context, there's no need to worry about concurrency.
4cb1ef64STejun HeoEach per-CPU BH worker pool contains only one pseudo worker which represents
4cb1ef64STejun Heothe BH execution context. A BH workqueue can be considered a convenience
4cb1ef64STejun Heointerface to softirq.
4cb1ef64STejun Heo
e7f08ffbSSilvio FrickeSubsystems and drivers can create and queue work items through special
e7f08ffbSSilvio Frickeworkqueue API functions as they see fit. They can influence some
e7f08ffbSSilvio Frickeaspects of the way the work items are executed by setting flags on the
e7f08ffbSSilvio Frickeworkqueue they are putting the work item on. These flags include
e7f08ffbSSilvio Frickethings like CPU locality, concurrency limits, priority and more.  To
e7f08ffbSSilvio Frickeget a detailed overview refer to the API description of
e7f08ffbSSilvio Fricke``alloc_workqueue()`` below.
e7f08ffbSSilvio Fricke
e7f08ffbSSilvio FrickeWhen a work item is queued to a workqueue, the target worker-pool is
e7f08ffbSSilvio Frickedetermined according to the queue parameters and workqueue attributes
e7f08ffbSSilvio Frickeand appended on the shared worklist of the worker-pool.  For example,
e7f08ffbSSilvio Frickeunless specifically overridden, a work item of a bound workqueue will
e7f08ffbSSilvio Frickebe queued on the worklist of either normal or highpri worker-pool that
e7f08ffbSSilvio Frickeis associated to the CPU the issuer is running on.
e7f08ffbSSilvio Fricke
4cb1ef64STejun HeoFor any thread pool implementation, managing the concurrency level
e7f08ffbSSilvio Fricke(how many execution contexts are active) is an important issue.  cmwq
e7f08ffbSSilvio Fricketries to keep the concurrency at a minimal but sufficient level.
e7f08ffbSSilvio FrickeMinimal to save resources and sufficient in that the system is used at
e7f08ffbSSilvio Frickeits full capacity.
e7f08ffbSSilvio Fricke
e7f08ffbSSilvio FrickeEach worker-pool bound to an actual CPU implements concurrency
e7f08ffbSSilvio Frickemanagement by hooking into the scheduler.  The worker-pool is notified
e7f08ffbSSilvio Frickewhenever an active worker wakes up or sleeps and keeps track of the
e7f08ffbSSilvio Frickenumber of the currently runnable workers.  Generally, work items are
e7f08ffbSSilvio Frickenot expected to hog a CPU and consume many cycles.  That means
e7f08ffbSSilvio Frickemaintaining just enough concurrency to prevent work processing from
e7f08ffbSSilvio Frickestalling should be optimal.  As long as there are one or more runnable
e7f08ffbSSilvio Frickeworkers on the CPU, the worker-pool doesn't start execution of a new
e7f08ffbSSilvio Frickework, but, when the last running worker goes to sleep, it immediately
e7f08ffbSSilvio Frickeschedules a new worker so that the CPU doesn't sit idle while there
e7f08ffbSSilvio Frickeare pending work items.  This allows using a minimal number of workers
e7f08ffbSSilvio Frickewithout losing execution bandwidth.
e7f08ffbSSilvio Fricke
e7f08ffbSSilvio FrickeKeeping idle workers around doesn't cost other than the memory space
e7f08ffbSSilvio Frickefor kthreads, so cmwq holds onto idle ones for a while before killing
e7f08ffbSSilvio Frickethem.
e7f08ffbSSilvio Fricke
e7f08ffbSSilvio FrickeFor unbound workqueues, the number of backing pools is dynamic.
e7f08ffbSSilvio FrickeUnbound workqueue can be assigned custom attributes using
e7f08ffbSSilvio Fricke``apply_workqueue_attrs()`` and workqueue will automatically create
e7f08ffbSSilvio Frickebacking worker pools matching the attributes.  The responsibility of
e7f08ffbSSilvio Frickeregulating concurrency level is on the users.  There is also a flag to
e7f08ffbSSilvio Frickemark a bound wq to ignore the concurrency management.  Please refer to
e7f08ffbSSilvio Frickethe API section for details.
e7f08ffbSSilvio Fricke
e7f08ffbSSilvio FrickeForward progress guarantee relies on that workers can be created when
e7f08ffbSSilvio Frickemore execution contexts are necessary, which in turn is guaranteed
e7f08ffbSSilvio Frickethrough the use of rescue workers.  All work items which might be used
e7f08ffbSSilvio Frickeon code paths that handle memory reclaim are required to be queued on
e7f08ffbSSilvio Frickewq's that have a rescue-worker reserved for execution under memory
e7f08ffbSSilvio Frickepressure.  Else it is possible that the worker-pool deadlocks waiting
e7f08ffbSSilvio Frickefor execution contexts to free up.
e7f08ffbSSilvio Fricke
e7f08ffbSSilvio Fricke
e7f08ffbSSilvio FrickeApplication Programming Interface (API)
e7f08ffbSSilvio Fricke=======================================
e7f08ffbSSilvio Fricke
e7f08ffbSSilvio Fricke``alloc_workqueue()`` allocates a wq.  The original
e7f08ffbSSilvio Fricke``create_*workqueue()`` functions are deprecated and scheduled for
47684e11SRandy Dunlapremoval.  ``alloc_workqueue()`` takes three arguments - ``@name``,
e7f08ffbSSilvio Fricke``@flags`` and ``@max_active``.  ``@name`` is the name of the wq and
e7f08ffbSSilvio Frickealso used as the name of the rescuer thread if there is one.
e7f08ffbSSilvio Fricke
e7f08ffbSSilvio FrickeA wq no longer manages execution resources but serves as a domain for
e7f08ffbSSilvio Frickeforward progress guarantee, flush and work item attributes. ``@flags``
e7f08ffbSSilvio Frickeand ``@max_active`` control how work items are assigned execution
e7f08ffbSSilvio Frickeresources, scheduled and executed.
e7f08ffbSSilvio Fricke
e7f08ffbSSilvio Fricke
e7f08ffbSSilvio Fricke``flags``
e7f08ffbSSilvio Fricke---------
e7f08ffbSSilvio Fricke
4cb1ef64STejun Heo``WQ_BH``
4cb1ef64STejun Heo  BH workqueues can be considered a convenience interface to softirq. BH
4cb1ef64STejun Heo  workqueues are always per-CPU and all BH work items are executed in the
4cb1ef64STejun Heo  queueing CPU's softirq context in the queueing order.
4cb1ef64STejun Heo
4cb1ef64STejun Heo  All BH workqueues must have 0 ``max_active`` and ``WQ_HIGHPRI`` is the
4cb1ef64STejun Heo  only allowed additional flag.
4cb1ef64STejun Heo
4cb1ef64STejun Heo  BH work items cannot sleep. All other features such as delayed queueing,
4cb1ef64STejun Heo  flushing and canceling are supported.
4cb1ef64STejun Heo
e7f08ffbSSilvio Fricke``WQ_UNBOUND``
e7f08ffbSSilvio Fricke  Work items queued to an unbound wq are served by the special
e7f08ffbSSilvio Fricke  worker-pools which host workers which are not bound to any
e7f08ffbSSilvio Fricke  specific CPU.  This makes the wq behave as a simple execution
e7f08ffbSSilvio Fricke  context provider without concurrency management.  The unbound
e7f08ffbSSilvio Fricke  worker-pools try to start execution of work items as soon as
e7f08ffbSSilvio Fricke  possible.  Unbound wq sacrifices locality but is useful for
e7f08ffbSSilvio Fricke  the following cases.
e7f08ffbSSilvio Fricke
e7f08ffbSSilvio Fricke  * Wide fluctuation in the concurrency level requirement is
e7f08ffbSSilvio Fricke    expected and using bound wq may end up creating large number
e7f08ffbSSilvio Fricke    of mostly unused workers across different CPUs as the issuer
e7f08ffbSSilvio Fricke    hops through different CPUs.
e7f08ffbSSilvio Fricke
e7f08ffbSSilvio Fricke  * Long running CPU intensive workloads which can be better
e7f08ffbSSilvio Fricke    managed by the system scheduler.
e7f08ffbSSilvio Fricke
e7f08ffbSSilvio Fricke``WQ_FREEZABLE``
e7f08ffbSSilvio Fricke  A freezable wq participates in the freeze phase of the system
e7f08ffbSSilvio Fricke  suspend operations.  Work items on the wq are drained and no
e7f08ffbSSilvio Fricke  new work item starts execution until thawed.
e7f08ffbSSilvio Fricke
e7f08ffbSSilvio Fricke``WQ_MEM_RECLAIM``
e7f08ffbSSilvio Fricke  All wq which might be used in the memory reclaim paths **MUST**
e7f08ffbSSilvio Fricke  have this flag set.  The wq is guaranteed to have at least one
e7f08ffbSSilvio Fricke  execution context regardless of memory pressure.
e7f08ffbSSilvio Fricke
e7f08ffbSSilvio Fricke``WQ_HIGHPRI``
e7f08ffbSSilvio Fricke  Work items of a highpri wq are queued to the highpri
e7f08ffbSSilvio Fricke  worker-pool of the target cpu.  Highpri worker-pools are
e7f08ffbSSilvio Fricke  served by worker threads with elevated nice level.
e7f08ffbSSilvio Fricke
e7f08ffbSSilvio Fricke  Note that normal and highpri worker-pools don't interact with
47684e11SRandy Dunlap  each other.  Each maintains its separate pool of workers and
e7f08ffbSSilvio Fricke  implements concurrency management among its workers.
e7f08ffbSSilvio Fricke
e7f08ffbSSilvio Fricke``WQ_CPU_INTENSIVE``
e7f08ffbSSilvio Fricke  Work items of a CPU intensive wq do not contribute to the
e7f08ffbSSilvio Fricke  concurrency level.  In other words, runnable CPU intensive
e7f08ffbSSilvio Fricke  work items will not prevent other work items in the same
e7f08ffbSSilvio Fricke  worker-pool from starting execution.  This is useful for bound
e7f08ffbSSilvio Fricke  work items which are expected to hog CPU cycles so that their
e7f08ffbSSilvio Fricke  execution is regulated by the system scheduler.
e7f08ffbSSilvio Fricke
e7f08ffbSSilvio Fricke  Although CPU intensive work items don't contribute to the
e7f08ffbSSilvio Fricke  concurrency level, start of their executions is still
e7f08ffbSSilvio Fricke  regulated by the concurrency management and runnable
e7f08ffbSSilvio Fricke  non-CPU-intensive work items can delay execution of CPU
e7f08ffbSSilvio Fricke  intensive work items.
e7f08ffbSSilvio Fricke
e7f08ffbSSilvio Fricke  This flag is meaningless for unbound wq.
e7f08ffbSSilvio Fricke
e7f08ffbSSilvio Fricke
e7f08ffbSSilvio Fricke``max_active``
e7f08ffbSSilvio Fricke--------------
e7f08ffbSSilvio Fricke
636b927eSTejun Heo``@max_active`` determines the maximum number of execution contexts per
636b927eSTejun HeoCPU which can be assigned to the work items of a wq. For example, with
636b927eSTejun Heo``@max_active`` of 16, at most 16 work items of the wq can be executing
636b927eSTejun Heoat the same time per CPU. This is always a per-CPU attribute, even for
636b927eSTejun Heounbound workqueues.
e7f08ffbSSilvio Fricke
*58143465SChen RidongThe maximum limit for ``@max_active`` is 2048 and the default value used
*58143465SChen Ridongwhen 0 is specified is 1024. These values are chosen sufficiently high
636b927eSTejun Heosuch that they are not the limiting factor while providing protection in
636b927eSTejun Heorunaway cases.
e7f08ffbSSilvio Fricke
e7f08ffbSSilvio FrickeThe number of active work items of a wq is usually regulated by the
e7f08ffbSSilvio Frickeusers of the wq, more specifically, by how many work items the users
e7f08ffbSSilvio Frickemay queue at the same time.  Unless there is a specific need for
e7f08ffbSSilvio Frickethrottling the number of active work items, specifying '0' is
e7f08ffbSSilvio Frickerecommended.
e7f08ffbSSilvio Fricke
3bc1e711STejun HeoSome users depend on strict execution ordering where only one work item
3bc1e711STejun Heois in flight at any given time and the work items are processed in
3bc1e711STejun Heoqueueing order. While the combination of ``@max_active`` of 1 and
3bc1e711STejun Heo``WQ_UNBOUND`` used to achieve this behavior, this is no longer the
44732f1dSNikita Shubincase. Use alloc_ordered_workqueue() instead.
0e0cafcdSAlexei Potashnik
e7f08ffbSSilvio Fricke
e7f08ffbSSilvio FrickeExample Execution Scenarios
e7f08ffbSSilvio Fricke===========================
e7f08ffbSSilvio Fricke
e7f08ffbSSilvio FrickeThe following example execution scenarios try to illustrate how cmwq
e7f08ffbSSilvio Frickebehave under different configurations.
e7f08ffbSSilvio Fricke
e7f08ffbSSilvio Fricke Work items w0, w1, w2 are queued to a bound wq q0 on the same CPU.
e7f08ffbSSilvio Fricke w0 burns CPU for 5ms then sleeps for 10ms then burns CPU for 5ms
e7f08ffbSSilvio Fricke again before finishing.  w1 and w2 burn CPU for 5ms then sleep for
e7f08ffbSSilvio Fricke 10ms.
e7f08ffbSSilvio Fricke
e7f08ffbSSilvio FrickeIgnoring all other tasks, works and processing overhead, and assuming
e7f08ffbSSilvio Frickesimple FIFO scheduling, the following is one highly simplified version
e7f08ffbSSilvio Frickeof possible sequences of events with the original wq. ::
e7f08ffbSSilvio Fricke
e7f08ffbSSilvio Fricke TIME IN MSECS	EVENT
e7f08ffbSSilvio Fricke 0		w0 starts and burns CPU
e7f08ffbSSilvio Fricke 5		w0 sleeps
e7f08ffbSSilvio Fricke 15		w0 wakes up and burns CPU
e7f08ffbSSilvio Fricke 20		w0 finishes
e7f08ffbSSilvio Fricke 20		w1 starts and burns CPU
e7f08ffbSSilvio Fricke 25		w1 sleeps
e7f08ffbSSilvio Fricke 35		w1 wakes up and finishes
e7f08ffbSSilvio Fricke 35		w2 starts and burns CPU
e7f08ffbSSilvio Fricke 40		w2 sleeps
e7f08ffbSSilvio Fricke 50		w2 wakes up and finishes
e7f08ffbSSilvio Fricke
e7f08ffbSSilvio FrickeAnd with cmwq with ``@max_active`` >= 3, ::
e7f08ffbSSilvio Fricke
e7f08ffbSSilvio Fricke TIME IN MSECS	EVENT
e7f08ffbSSilvio Fricke 0		w0 starts and burns CPU
e7f08ffbSSilvio Fricke 5		w0 sleeps
e7f08ffbSSilvio Fricke 5		w1 starts and burns CPU
e7f08ffbSSilvio Fricke 10		w1 sleeps
e7f08ffbSSilvio Fricke 10		w2 starts and burns CPU
e7f08ffbSSilvio Fricke 15		w2 sleeps
e7f08ffbSSilvio Fricke 15		w0 wakes up and burns CPU
e7f08ffbSSilvio Fricke 20		w0 finishes
e7f08ffbSSilvio Fricke 20		w1 wakes up and finishes
e7f08ffbSSilvio Fricke 25		w2 wakes up and finishes
e7f08ffbSSilvio Fricke
e7f08ffbSSilvio FrickeIf ``@max_active`` == 2, ::
e7f08ffbSSilvio Fricke
e7f08ffbSSilvio Fricke TIME IN MSECS	EVENT
e7f08ffbSSilvio Fricke 0		w0 starts and burns CPU
e7f08ffbSSilvio Fricke 5		w0 sleeps
e7f08ffbSSilvio Fricke 5		w1 starts and burns CPU
e7f08ffbSSilvio Fricke 10		w1 sleeps
e7f08ffbSSilvio Fricke 15		w0 wakes up and burns CPU
e7f08ffbSSilvio Fricke 20		w0 finishes
e7f08ffbSSilvio Fricke 20		w1 wakes up and finishes
e7f08ffbSSilvio Fricke 20		w2 starts and burns CPU
e7f08ffbSSilvio Fricke 25		w2 sleeps
e7f08ffbSSilvio Fricke 35		w2 wakes up and finishes
e7f08ffbSSilvio Fricke
e7f08ffbSSilvio FrickeNow, let's assume w1 and w2 are queued to a different wq q1 which has
e7f08ffbSSilvio Fricke``WQ_CPU_INTENSIVE`` set, ::
e7f08ffbSSilvio Fricke
e7f08ffbSSilvio Fricke TIME IN MSECS	EVENT
e7f08ffbSSilvio Fricke 0		w0 starts and burns CPU
e7f08ffbSSilvio Fricke 5		w0 sleeps
e7f08ffbSSilvio Fricke 5		w1 and w2 start and burn CPU
e7f08ffbSSilvio Fricke 10		w1 sleeps
e7f08ffbSSilvio Fricke 15		w2 sleeps
e7f08ffbSSilvio Fricke 15		w0 wakes up and burns CPU
e7f08ffbSSilvio Fricke 20		w0 finishes
e7f08ffbSSilvio Fricke 20		w1 wakes up and finishes
e7f08ffbSSilvio Fricke 25		w2 wakes up and finishes
e7f08ffbSSilvio Fricke
e7f08ffbSSilvio Fricke
e7f08ffbSSilvio FrickeGuidelines
e7f08ffbSSilvio Fricke==========
e7f08ffbSSilvio Fricke
e7f08ffbSSilvio Fricke* Do not forget to use ``WQ_MEM_RECLAIM`` if a wq may process work
e7f08ffbSSilvio Fricke  items which are used during memory reclaim.  Each wq with
e7f08ffbSSilvio Fricke  ``WQ_MEM_RECLAIM`` set has an execution context reserved for it.  If
e7f08ffbSSilvio Fricke  there is dependency among multiple work items used during memory
e7f08ffbSSilvio Fricke  reclaim, they should be queued to separate wq each with
e7f08ffbSSilvio Fricke  ``WQ_MEM_RECLAIM``.
e7f08ffbSSilvio Fricke
e7f08ffbSSilvio Fricke* Unless strict ordering is required, there is no need to use ST wq.
e7f08ffbSSilvio Fricke
e7f08ffbSSilvio Fricke* Unless there is a specific need, using 0 for @max_active is
e7f08ffbSSilvio Fricke  recommended.  In most use cases, concurrency level usually stays
e7f08ffbSSilvio Fricke  well under the default limit.
e7f08ffbSSilvio Fricke
e7f08ffbSSilvio Fricke* A wq serves as a domain for forward progress guarantee
e7f08ffbSSilvio Fricke  (``WQ_MEM_RECLAIM``, flush and work item attributes.  Work items
e7f08ffbSSilvio Fricke  which are not involved in memory reclaim and don't need to be
e7f08ffbSSilvio Fricke  flushed as a part of a group of work items, and don't require any
e7f08ffbSSilvio Fricke  special attribute, can use one of the system wq.  There is no
e7f08ffbSSilvio Fricke  difference in execution characteristics between using a dedicated wq
e7f08ffbSSilvio Fricke  and a system wq.
e7f08ffbSSilvio Fricke
e3dddcfdSChen Ridong  Note: If something may generate more than @max_active outstanding
e3dddcfdSChen Ridong  work items (do stress test your producers), it may saturate a system
e3dddcfdSChen Ridong  wq and potentially lead to deadlock. It should utilize its own
e3dddcfdSChen Ridong  dedicated workqueue rather than the system wq.
e3dddcfdSChen Ridong
e7f08ffbSSilvio Fricke* Unless work items are expected to consume a huge amount of CPU
e7f08ffbSSilvio Fricke  cycles, using a bound wq is usually beneficial due to the increased
e7f08ffbSSilvio Fricke  level of locality in wq operations and work item execution.
e7f08ffbSSilvio Fricke
e7f08ffbSSilvio Fricke
63c5484eSTejun HeoAffinity Scopes
63c5484eSTejun Heo===============
63c5484eSTejun Heo
63c5484eSTejun HeoAn unbound workqueue groups CPUs according to its affinity scope to improve
63c5484eSTejun Heocache locality. For example, if a workqueue is using the default affinity
63c5484eSTejun Heoscope of "cache", it will group CPUs according to last level cache
8639ecebSTejun Heoboundaries. A work item queued on the workqueue will be assigned to a worker
8639ecebSTejun Heoon one of the CPUs which share the last level cache with the issuing CPU.
8639ecebSTejun HeoOnce started, the worker may or may not be allowed to move outside the scope
8639ecebSTejun Heodepending on the ``affinity_strict`` setting of the scope.
63c5484eSTejun Heo
523a301eSTejun HeoWorkqueue currently supports the following affinity scopes.
523a301eSTejun Heo
523a301eSTejun Heo``default``
523a301eSTejun Heo  Use the scope in module parameter ``workqueue.default_affinity_scope``
523a301eSTejun Heo  which is always set to one of the scopes below.
63c5484eSTejun Heo
63c5484eSTejun Heo``cpu``
63c5484eSTejun Heo  CPUs are not grouped. A work item issued on one CPU is processed by a
63c5484eSTejun Heo  worker on the same CPU. This makes unbound workqueues behave as per-cpu
63c5484eSTejun Heo  workqueues without concurrency management.
63c5484eSTejun Heo
63c5484eSTejun Heo``smt``
63c5484eSTejun Heo  CPUs are grouped according to SMT boundaries. This usually means that the
63c5484eSTejun Heo  logical threads of each physical CPU core are grouped together.
63c5484eSTejun Heo
63c5484eSTejun Heo``cache``
63c5484eSTejun Heo  CPUs are grouped according to cache boundaries. Which specific cache
63c5484eSTejun Heo  boundary is used is determined by the arch code. L3 is used in a lot of
63c5484eSTejun Heo  cases. This is the default affinity scope.
63c5484eSTejun Heo
63c5484eSTejun Heo``numa``
89405db5Sattreyee-muk  CPUs are grouped according to NUMA boundaries.
63c5484eSTejun Heo
63c5484eSTejun Heo``system``
63c5484eSTejun Heo  All CPUs are put in the same group. Workqueue makes no effort to process a
63c5484eSTejun Heo  work item on a CPU close to the issuing CPU.
63c5484eSTejun Heo
63c5484eSTejun HeoThe default affinity scope can be changed with the module parameter
63c5484eSTejun Heo``workqueue.default_affinity_scope`` and a specific workqueue's affinity
63c5484eSTejun Heoscope can be changed using ``apply_workqueue_attrs()``.
63c5484eSTejun Heo
63c5484eSTejun HeoIf ``WQ_SYSFS`` is set, the workqueue will have the following affinity scope
bd9e7326SWangJinchaorelated interface files under its ``/sys/devices/virtual/workqueue/WQ_NAME/``
63c5484eSTejun Heodirectory.
63c5484eSTejun Heo
63c5484eSTejun Heo``affinity_scope``
63c5484eSTejun Heo  Read to see the current affinity scope. Write to change.
63c5484eSTejun Heo
523a301eSTejun Heo  When default is the current scope, reading this file will also show the
523a301eSTejun Heo  current effective scope in parentheses, for example, ``default (cache)``.
523a301eSTejun Heo
8639ecebSTejun Heo``affinity_strict``
8639ecebSTejun Heo  0 by default indicating that affinity scopes are not strict. When a work
8639ecebSTejun Heo  item starts execution, workqueue makes a best-effort attempt to ensure
8639ecebSTejun Heo  that the worker is inside its affinity scope, which is called
8639ecebSTejun Heo  repatriation. Once started, the scheduler is free to move the worker
8639ecebSTejun Heo  anywhere in the system as it sees fit. This enables benefiting from scope
8639ecebSTejun Heo  locality while still being able to utilize other CPUs if necessary and
8639ecebSTejun Heo  available.
8639ecebSTejun Heo
8639ecebSTejun Heo  If set to 1, all workers of the scope are guaranteed always to be in the
8639ecebSTejun Heo  scope. This may be useful when crossing affinity scopes has other
8639ecebSTejun Heo  implications, for example, in terms of power consumption or workload
8639ecebSTejun Heo  isolation. Strict NUMA scope can also be used to match the workqueue
8639ecebSTejun Heo  behavior of older kernels.
8639ecebSTejun Heo
63c5484eSTejun Heo
7dbf15c5STejun HeoAffinity Scopes and Performance
7dbf15c5STejun Heo===============================
7dbf15c5STejun Heo
7dbf15c5STejun HeoIt'd be ideal if an unbound workqueue's behavior is optimal for vast
7dbf15c5STejun Heomajority of use cases without further tuning. Unfortunately, in the current
7dbf15c5STejun Heokernel, there exists a pronounced trade-off between locality and utilization
7dbf15c5STejun Heonecessitating explicit configurations when workqueues are heavily used.
7dbf15c5STejun Heo
7dbf15c5STejun HeoHigher locality leads to higher efficiency where more work is performed for
7dbf15c5STejun Heothe same number of consumed CPU cycles. However, higher locality may also
7dbf15c5STejun Heocause lower overall system utilization if the work items are not spread
7dbf15c5STejun Heoenough across the affinity scopes by the issuers. The following performance
7dbf15c5STejun Heotesting with dm-crypt clearly illustrates this trade-off.
7dbf15c5STejun Heo
7dbf15c5STejun HeoThe tests are run on a CPU with 12-cores/24-threads split across four L3
7dbf15c5STejun Heocaches (AMD Ryzen 9 3900x). CPU clock boost is turned off for consistency.
7dbf15c5STejun Heo``/dev/dm-0`` is a dm-crypt device created on NVME SSD (Samsung 990 PRO) and
7dbf15c5STejun Heoopened with ``cryptsetup`` with default settings.
7dbf15c5STejun Heo
7dbf15c5STejun Heo
7dbf15c5STejun HeoScenario 1: Enough issuers and work spread across the machine
7dbf15c5STejun Heo-------------------------------------------------------------
7dbf15c5STejun Heo
7dbf15c5STejun HeoThe command used: ::
7dbf15c5STejun Heo
7dbf15c5STejun Heo  $ fio --filename=/dev/dm-0 --direct=1 --rw=randrw --bs=32k --ioengine=libaio \
7dbf15c5STejun Heo    --iodepth=64 --runtime=60 --numjobs=24 --time_based --group_reporting \
7dbf15c5STejun Heo    --name=iops-test-job --verify=sha512
7dbf15c5STejun Heo
7dbf15c5STejun HeoThere are 24 issuers, each issuing 64 IOs concurrently. ``--verify=sha512``
7dbf15c5STejun Heomakes ``fio`` generate and read back the content each time which makes
22160b08Sattreyee-mukexecution locality matter between the issuer and ``kcryptd``. The following
7dbf15c5STejun Heoare the read bandwidths and CPU utilizations depending on different affinity
7dbf15c5STejun Heoscope settings on ``kcryptd`` measured over five runs. Bandwidths are in
7dbf15c5STejun HeoMiBps, and CPU util in percents.
7dbf15c5STejun Heo
7dbf15c5STejun Heo.. list-table::
7dbf15c5STejun Heo   :widths: 16 20 20
7dbf15c5STejun Heo   :header-rows: 1
7dbf15c5STejun Heo
7dbf15c5STejun Heo   * - Affinity
7dbf15c5STejun Heo     - Bandwidth (MiBps)
7dbf15c5STejun Heo     - CPU util (%)
7dbf15c5STejun Heo
7dbf15c5STejun Heo   * - system
7dbf15c5STejun Heo     - 1159.40 ±1.34
7dbf15c5STejun Heo     - 99.31 ±0.02
7dbf15c5STejun Heo
7dbf15c5STejun Heo   * - cache
7dbf15c5STejun Heo     - 1166.40 ±0.89
7dbf15c5STejun Heo     - 99.34 ±0.01
7dbf15c5STejun Heo
7dbf15c5STejun Heo   * - cache (strict)
7dbf15c5STejun Heo     - 1166.00 ±0.71
7dbf15c5STejun Heo     - 99.35 ±0.01
7dbf15c5STejun Heo
7dbf15c5STejun HeoWith enough issuers spread across the system, there is no downside to
7dbf15c5STejun Heo"cache", strict or otherwise. All three configurations saturate the whole
7dbf15c5STejun Heomachine but the cache-affine ones outperform by 0.6% thanks to improved
7dbf15c5STejun Heolocality.
7dbf15c5STejun Heo
7dbf15c5STejun Heo
7dbf15c5STejun HeoScenario 2: Fewer issuers, enough work for saturation
7dbf15c5STejun Heo-----------------------------------------------------
7dbf15c5STejun Heo
7dbf15c5STejun HeoThe command used: ::
7dbf15c5STejun Heo
7dbf15c5STejun Heo  $ fio --filename=/dev/dm-0 --direct=1 --rw=randrw --bs=32k \
7dbf15c5STejun Heo    --ioengine=libaio --iodepth=64 --runtime=60 --numjobs=8 \
7dbf15c5STejun Heo    --time_based --group_reporting --name=iops-test-job --verify=sha512
7dbf15c5STejun Heo
7dbf15c5STejun HeoThe only difference from the previous scenario is ``--numjobs=8``. There are
7dbf15c5STejun Heoa third of the issuers but is still enough total work to saturate the
7dbf15c5STejun Heosystem.
7dbf15c5STejun Heo
7dbf15c5STejun Heo.. list-table::
7dbf15c5STejun Heo   :widths: 16 20 20
7dbf15c5STejun Heo   :header-rows: 1
7dbf15c5STejun Heo
7dbf15c5STejun Heo   * - Affinity
7dbf15c5STejun Heo     - Bandwidth (MiBps)
7dbf15c5STejun Heo     - CPU util (%)
7dbf15c5STejun Heo
7dbf15c5STejun Heo   * - system
7dbf15c5STejun Heo     - 1155.40 ±0.89
7dbf15c5STejun Heo     - 97.41 ±0.05
7dbf15c5STejun Heo
7dbf15c5STejun Heo   * - cache
7dbf15c5STejun Heo     - 1154.40 ±1.14
7dbf15c5STejun Heo     - 96.15 ±0.09
7dbf15c5STejun Heo
7dbf15c5STejun Heo   * - cache (strict)
7dbf15c5STejun Heo     - 1112.00 ±4.64
7dbf15c5STejun Heo     - 93.26 ±0.35
7dbf15c5STejun Heo
7dbf15c5STejun HeoThis is more than enough work to saturate the system. Both "system" and
7dbf15c5STejun Heo"cache" are nearly saturating the machine but not fully. "cache" is using
7dbf15c5STejun Heoless CPU but the better efficiency puts it at the same bandwidth as
7dbf15c5STejun Heo"system".
7dbf15c5STejun Heo
7dbf15c5STejun HeoEight issuers moving around over four L3 cache scope still allow "cache
7dbf15c5STejun Heo(strict)" to mostly saturate the machine but the loss of work conservation
7dbf15c5STejun Heois now starting to hurt with 3.7% bandwidth loss.
7dbf15c5STejun Heo
7dbf15c5STejun Heo
7dbf15c5STejun HeoScenario 3: Even fewer issuers, not enough work to saturate
7dbf15c5STejun Heo-----------------------------------------------------------
7dbf15c5STejun Heo
7dbf15c5STejun HeoThe command used: ::
7dbf15c5STejun Heo
7dbf15c5STejun Heo  $ fio --filename=/dev/dm-0 --direct=1 --rw=randrw --bs=32k \
7dbf15c5STejun Heo    --ioengine=libaio --iodepth=64 --runtime=60 --numjobs=4 \
7dbf15c5STejun Heo    --time_based --group_reporting --name=iops-test-job --verify=sha512
7dbf15c5STejun Heo
7dbf15c5STejun HeoAgain, the only difference is ``--numjobs=4``. With the number of issuers
7dbf15c5STejun Heoreduced to four, there now isn't enough work to saturate the whole system
7dbf15c5STejun Heoand the bandwidth becomes dependent on completion latencies.
7dbf15c5STejun Heo
7dbf15c5STejun Heo.. list-table::
7dbf15c5STejun Heo   :widths: 16 20 20
7dbf15c5STejun Heo   :header-rows: 1
7dbf15c5STejun Heo
7dbf15c5STejun Heo   * - Affinity
7dbf15c5STejun Heo     - Bandwidth (MiBps)
7dbf15c5STejun Heo     - CPU util (%)
7dbf15c5STejun Heo
7dbf15c5STejun Heo   * - system
7dbf15c5STejun Heo     - 993.60 ±1.82
7dbf15c5STejun Heo     - 75.49 ±0.06
7dbf15c5STejun Heo
7dbf15c5STejun Heo   * - cache
7dbf15c5STejun Heo     - 973.40 ±1.52
7dbf15c5STejun Heo     - 74.90 ±0.07
7dbf15c5STejun Heo
7dbf15c5STejun Heo   * - cache (strict)
7dbf15c5STejun Heo     - 828.20 ±4.49
7dbf15c5STejun Heo     - 66.84 ±0.29
7dbf15c5STejun Heo
7dbf15c5STejun HeoNow, the tradeoff between locality and utilization is clearer. "cache" shows
7dbf15c5STejun Heo2% bandwidth loss compared to "system" and "cache (struct)" whopping 20%.
7dbf15c5STejun Heo
7dbf15c5STejun Heo
7dbf15c5STejun HeoConclusion and Recommendations
7dbf15c5STejun Heo------------------------------
7dbf15c5STejun Heo
7dbf15c5STejun HeoIn the above experiments, the efficiency advantage of the "cache" affinity
7dbf15c5STejun Heoscope over "system" is, while consistent and noticeable, small. However, the
7dbf15c5STejun Heoimpact is dependent on the distances between the scopes and may be more
7dbf15c5STejun Heopronounced in processors with more complex topologies.
7dbf15c5STejun Heo
7dbf15c5STejun HeoWhile the loss of work-conservation in certain scenarios hurts, it is a lot
7dbf15c5STejun Heobetter than "cache (strict)" and maximizing workqueue utilization is
7dbf15c5STejun Heounlikely to be the common case anyway. As such, "cache" is the default
7dbf15c5STejun Heoaffinity scope for unbound pools.
7dbf15c5STejun Heo
7dbf15c5STejun Heo* As there is no one option which is great for most cases, workqueue usages
7dbf15c5STejun Heo  that may consume a significant amount of CPU are recommended to configure
7dbf15c5STejun Heo  the workqueues using ``apply_workqueue_attrs()`` and/or enable
7dbf15c5STejun Heo  ``WQ_SYSFS``.
7dbf15c5STejun Heo
7dbf15c5STejun Heo* An unbound workqueue with strict "cpu" affinity scope behaves the same as
7dbf15c5STejun Heo  ``WQ_CPU_INTENSIVE`` per-cpu workqueue. There is no real advanage to the
7dbf15c5STejun Heo  latter and an unbound workqueue provides a lot more flexibility.
7dbf15c5STejun Heo
7dbf15c5STejun Heo* Affinity scopes are introduced in Linux v6.5. To emulate the previous
7dbf15c5STejun Heo  behavior, use strict "numa" affinity scope.
7dbf15c5STejun Heo
7dbf15c5STejun Heo* The loss of work-conservation in non-strict affinity scopes is likely
7dbf15c5STejun Heo  originating from the scheduler. There is no theoretical reason why the
7dbf15c5STejun Heo  kernel wouldn't be able to do the right thing and maintain
7dbf15c5STejun Heo  work-conservation in most cases. As such, it is possible that future
7dbf15c5STejun Heo  scheduler improvements may make most of these tunables unnecessary.
7dbf15c5STejun Heo
7dbf15c5STejun Heo
7f7dc377STejun HeoExamining Configuration
7f7dc377STejun Heo=======================
7f7dc377STejun Heo
7f7dc377STejun HeoUse tools/workqueue/wq_dump.py to examine unbound CPU affinity
7f7dc377STejun Heoconfiguration, worker pools and how workqueues map to the pools: ::
7f7dc377STejun Heo
7f7dc377STejun Heo  $ tools/workqueue/wq_dump.py
7f7dc377STejun Heo  Affinity Scopes
7f7dc377STejun Heo  ===============
7f7dc377STejun Heo  wq_unbound_cpumask=0000000f
7f7dc377STejun Heo
63c5484eSTejun Heo  CPU
63c5484eSTejun Heo    nr_pods  4
63c5484eSTejun Heo    pod_cpus [0]=00000001 [1]=00000002 [2]=00000004 [3]=00000008
63c5484eSTejun Heo    pod_node [0]=0 [1]=0 [2]=1 [3]=1
63c5484eSTejun Heo    cpu_pod  [0]=0 [1]=1 [2]=2 [3]=3
63c5484eSTejun Heo
63c5484eSTejun Heo  SMT
63c5484eSTejun Heo    nr_pods  4
63c5484eSTejun Heo    pod_cpus [0]=00000001 [1]=00000002 [2]=00000004 [3]=00000008
63c5484eSTejun Heo    pod_node [0]=0 [1]=0 [2]=1 [3]=1
63c5484eSTejun Heo    cpu_pod  [0]=0 [1]=1 [2]=2 [3]=3
63c5484eSTejun Heo
63c5484eSTejun Heo  CACHE (default)
63c5484eSTejun Heo    nr_pods  2
63c5484eSTejun Heo    pod_cpus [0]=00000003 [1]=0000000c
63c5484eSTejun Heo    pod_node [0]=0 [1]=1
63c5484eSTejun Heo    cpu_pod  [0]=0 [1]=0 [2]=1 [3]=1
63c5484eSTejun Heo
7f7dc377STejun Heo  NUMA
7f7dc377STejun Heo    nr_pods  2
7f7dc377STejun Heo    pod_cpus [0]=00000003 [1]=0000000c
7f7dc377STejun Heo    pod_node [0]=0 [1]=1
7f7dc377STejun Heo    cpu_pod  [0]=0 [1]=0 [2]=1 [3]=1
7f7dc377STejun Heo
7f7dc377STejun Heo  SYSTEM
7f7dc377STejun Heo    nr_pods  1
7f7dc377STejun Heo    pod_cpus [0]=0000000f
7f7dc377STejun Heo    pod_node [0]=-1
7f7dc377STejun Heo    cpu_pod  [0]=0 [1]=0 [2]=0 [3]=0
7f7dc377STejun Heo
7f7dc377STejun Heo  Worker Pools
7f7dc377STejun Heo  ============
7f7dc377STejun Heo  pool[00] ref= 1 nice=  0 idle/workers=  4/  4 cpu=  0
7f7dc377STejun Heo  pool[01] ref= 1 nice=-20 idle/workers=  2/  2 cpu=  0
7f7dc377STejun Heo  pool[02] ref= 1 nice=  0 idle/workers=  4/  4 cpu=  1
7f7dc377STejun Heo  pool[03] ref= 1 nice=-20 idle/workers=  2/  2 cpu=  1
7f7dc377STejun Heo  pool[04] ref= 1 nice=  0 idle/workers=  4/  4 cpu=  2
7f7dc377STejun Heo  pool[05] ref= 1 nice=-20 idle/workers=  2/  2 cpu=  2
7f7dc377STejun Heo  pool[06] ref= 1 nice=  0 idle/workers=  3/  3 cpu=  3
7f7dc377STejun Heo  pool[07] ref= 1 nice=-20 idle/workers=  2/  2 cpu=  3
7f7dc377STejun Heo  pool[08] ref=42 nice=  0 idle/workers=  6/  6 cpus=0000000f
7f7dc377STejun Heo  pool[09] ref=28 nice=  0 idle/workers=  3/  3 cpus=00000003
7f7dc377STejun Heo  pool[10] ref=28 nice=  0 idle/workers= 17/ 17 cpus=0000000c
7f7dc377STejun Heo  pool[11] ref= 1 nice=-20 idle/workers=  1/  1 cpus=0000000f
7f7dc377STejun Heo  pool[12] ref= 2 nice=-20 idle/workers=  1/  1 cpus=00000003
7f7dc377STejun Heo  pool[13] ref= 2 nice=-20 idle/workers=  1/  1 cpus=0000000c
7f7dc377STejun Heo
7f7dc377STejun Heo  Workqueue CPU -> pool
7f7dc377STejun Heo  =====================
7f7dc377STejun Heo  [    workqueue \ CPU              0  1  2  3 dfl]
7f7dc377STejun Heo  events                   percpu   0  2  4  6
7f7dc377STejun Heo  events_highpri           percpu   1  3  5  7
7f7dc377STejun Heo  events_long              percpu   0  2  4  6
7f7dc377STejun Heo  events_unbound           unbound  9  9 10 10  8
7f7dc377STejun Heo  events_freezable         percpu   0  2  4  6
7f7dc377STejun Heo  events_power_efficient   percpu   0  2  4  6
2c534f2fSAudra Mitchell  events_freezable_pwr_ef  percpu   0  2  4  6
7f7dc377STejun Heo  rcu_gp                   percpu   0  2  4  6
7f7dc377STejun Heo  rcu_par_gp               percpu   0  2  4  6
7f7dc377STejun Heo  slub_flushwq             percpu   0  2  4  6
7f7dc377STejun Heo  netns                    ordered  8  8  8  8  8
7f7dc377STejun Heo  ...
7f7dc377STejun Heo
7f7dc377STejun HeoSee the command's help message for more info.
7f7dc377STejun Heo
7f7dc377STejun Heo
725e8ec5STejun HeoMonitoring
725e8ec5STejun Heo==========
725e8ec5STejun Heo
725e8ec5STejun HeoUse tools/workqueue/wq_monitor.py to monitor workqueue operations: ::
725e8ec5STejun Heo
725e8ec5STejun Heo  $ tools/workqueue/wq_monitor.py events
8639ecebSTejun Heo                              total  infl  CPUtime  CPUhog CMW/RPR  mayday rescued
8a1dd1e5STejun Heo  events                      18545     0      6.1       0       5       -       -
8a1dd1e5STejun Heo  events_highpri                  8     0      0.0       0       0       -       -
8a1dd1e5STejun Heo  events_long                     3     0      0.0       0       0       -       -
8639ecebSTejun Heo  events_unbound              38306     0      0.1       -       7       -       -
8a1dd1e5STejun Heo  events_freezable                0     0      0.0       0       0       -       -
8a1dd1e5STejun Heo  events_power_efficient      29598     0      0.2       0       0       -       -
2c534f2fSAudra Mitchell  events_freezable_pwr_ef        10     0      0.0       0       0       -       -
8a1dd1e5STejun Heo  sock_diag_events                0     0      0.0       0       0       -       -
725e8ec5STejun Heo
8639ecebSTejun Heo                              total  infl  CPUtime  CPUhog CMW/RPR  mayday rescued
8a1dd1e5STejun Heo  events                      18548     0      6.1       0       5       -       -
8a1dd1e5STejun Heo  events_highpri                  8     0      0.0       0       0       -       -
8a1dd1e5STejun Heo  events_long                     3     0      0.0       0       0       -       -
8639ecebSTejun Heo  events_unbound              38322     0      0.1       -       7       -       -
8a1dd1e5STejun Heo  events_freezable                0     0      0.0       0       0       -       -
8a1dd1e5STejun Heo  events_power_efficient      29603     0      0.2       0       0       -       -
2c534f2fSAudra Mitchell  events_freezable_pwr_ef        10     0      0.0       0       0       -       -
8a1dd1e5STejun Heo  sock_diag_events                0     0      0.0       0       0       -       -
725e8ec5STejun Heo
725e8ec5STejun Heo  ...
725e8ec5STejun Heo
725e8ec5STejun HeoSee the command's help message for more info.
725e8ec5STejun Heo
725e8ec5STejun Heo
e7f08ffbSSilvio FrickeDebugging
e7f08ffbSSilvio Fricke=========
e7f08ffbSSilvio Fricke
e7f08ffbSSilvio FrickeBecause the work functions are executed by generic worker threads
e7f08ffbSSilvio Frickethere are a few tricks needed to shed some light on misbehaving
e7f08ffbSSilvio Frickeworkqueue users.
e7f08ffbSSilvio Fricke
e7f08ffbSSilvio FrickeWorker threads show up in the process list as: ::
e7f08ffbSSilvio Fricke
e7f08ffbSSilvio Fricke  root      5671  0.0  0.0      0     0 ?        S    12:07   0:00 [kworker/0:1]
e7f08ffbSSilvio Fricke  root      5672  0.0  0.0      0     0 ?        S    12:07   0:00 [kworker/1:2]
e7f08ffbSSilvio Fricke  root      5673  0.0  0.0      0     0 ?        S    12:12   0:00 [kworker/0:0]
e7f08ffbSSilvio Fricke  root      5674  0.0  0.0      0     0 ?        S    12:13   0:00 [kworker/1:0]
e7f08ffbSSilvio Fricke
e7f08ffbSSilvio FrickeIf kworkers are going crazy (using too much cpu), there are two types
e7f08ffbSSilvio Frickeof possible problems:
e7f08ffbSSilvio Fricke
e7f08ffbSSilvio Fricke	1. Something being scheduled in rapid succession
e7f08ffbSSilvio Fricke	2. A single work item that consumes lots of cpu cycles
e7f08ffbSSilvio Fricke
e7f08ffbSSilvio FrickeThe first one can be tracked using tracing: ::
e7f08ffbSSilvio Fricke
2abfcd29SRoss Zwisler	$ echo workqueue:workqueue_queue_work > /sys/kernel/tracing/set_event
2abfcd29SRoss Zwisler	$ cat /sys/kernel/tracing/trace_pipe > out.txt
e7f08ffbSSilvio Fricke	(wait a few secs)
e7f08ffbSSilvio Fricke	^C
e7f08ffbSSilvio Fricke
e7f08ffbSSilvio FrickeIf something is busy looping on work queueing, it would be dominating
e7f08ffbSSilvio Frickethe output and the offender can be determined with the work item
e7f08ffbSSilvio Frickefunction.
e7f08ffbSSilvio Fricke
e7f08ffbSSilvio FrickeFor the second type of problems it should be possible to just check
e7f08ffbSSilvio Frickethe stack trace of the offending worker thread. ::
e7f08ffbSSilvio Fricke
e7f08ffbSSilvio Fricke	$ cat /proc/THE_OFFENDING_KWORKER/stack
e7f08ffbSSilvio Fricke
e7f08ffbSSilvio FrickeThe work item's function should be trivially visible in the stack
e7f08ffbSSilvio Fricketrace.
e7f08ffbSSilvio Fricke
725e8ec5STejun Heo
f9eaaa82SBoqun FengNon-reentrance Conditions
f9eaaa82SBoqun Feng=========================
f9eaaa82SBoqun Feng
f9eaaa82SBoqun FengWorkqueue guarantees that a work item cannot be re-entrant if the following
f9eaaa82SBoqun Fengconditions hold after a work item gets queued:
f9eaaa82SBoqun Feng
f9eaaa82SBoqun Feng        1. The work function hasn't been changed.
f9eaaa82SBoqun Feng        2. No one queues the work item to another workqueue.
f9eaaa82SBoqun Feng        3. The work item hasn't been reinitiated.
f9eaaa82SBoqun Feng
f9eaaa82SBoqun FengIn other words, if the above conditions hold, the work item is guaranteed to be
f9eaaa82SBoqun Fengexecuted by at most one worker system-wide at any given time.
f9eaaa82SBoqun Feng
f9eaaa82SBoqun FengNote that requeuing the work item (to the same queue) in the self function
f9eaaa82SBoqun Fengdoesn't break these conditions, so it's safe to do. Otherwise, caution is
f9eaaa82SBoqun Fengrequired when breaking the conditions inside a work function.
f9eaaa82SBoqun Feng
e7f08ffbSSilvio Fricke
e7f08ffbSSilvio FrickeKernel Inline Documentations Reference
e7f08ffbSSilvio Fricke======================================
e7f08ffbSSilvio Fricke
e7f08ffbSSilvio Fricke.. kernel-doc:: include/linux/workqueue.h
c9e3d519SMauro Carvalho Chehab
c9e3d519SMauro Carvalho Chehab.. kernel-doc:: kernel/workqueue.c