1..  BSD LICENSE
2    Copyright(c) 2015 Intel Corporation. All rights reserved.
3    All rights reserved.
4
5    Redistribution and use in source and binary forms, with or without
6    modification, are permitted provided that the following conditions
7    are met:
8
9    * Re-distributions of source code must retain the above copyright
10    notice, this list of conditions and the following disclaimer.
11    * Redistributions in binary form must reproduce the above copyright
12    notice, this list of conditions and the following disclaimer in
13    the documentation and/or other materials provided with the
14    distribution.
15    * Neither the name of Intel Corporation nor the names of its
16    contributors may be used to endorse or promote products derived
17    from this software without specific prior written permission.
18
19    THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
20    "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
21    LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
22    A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
23    OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
24    SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
25    LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
26    DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
27    THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
28    (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
29    OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
30
31
32Performance Thread Sample Application
33=====================================
34
35The performance thread sample application is a derivative of the standard L3
36forwarding application that demonstrates different threading models.
37
38Overview
39--------
40For a general description of the L3 forwarding applications capabilities
41please refer to the documentation of the standard application in
42:doc:`l3_forward`.
43
44The performance thread sample application differs from the standard L3
45forwarding example in that it divides the TX and RX processing between
46different threads, and makes it possible to assign individual threads to
47different cores.
48
49Three threading models are considered:
50
51#. When there is one EAL thread per physical core.
52#. When there are multiple EAL threads per physical core.
53#. When there are multiple lightweight threads per EAL thread.
54
55Since DPDK release 2.0 it is possible to launch applications using the
56``--lcores`` EAL parameter, specifying cpu-sets for a physical core. With the
57performance thread sample application its is now also possible to assign
58individual RX and TX functions to different cores.
59
60As an alternative to dividing the L3 forwarding work between different EAL
61threads the performance thread sample introduces the possibility to run the
62application threads as lightweight threads (L-threads) within one or
63more EAL threads.
64
65In order to facilitate this threading model the example includes a primitive
66cooperative scheduler (L-thread) subsystem. More details of the L-thread
67subsystem can be found in :ref:`lthread_subsystem`.
68
69**Note:** Whilst theoretically possible it is not anticipated that multiple
70L-thread schedulers would be run on the same physical core, this mode of
71operation should not be expected to yield useful performance and is considered
72invalid.
73
74Compiling the Application
75-------------------------
76The application is located in the sample application folder in the
77``performance-thread`` folder.
78
79#.  Go to the example applications folder
80
81    .. code-block:: console
82
83       export RTE_SDK=/path/to/rte_sdk
84       cd ${RTE_SDK}/examples/performance-thread/l3fwd-thread
85
86#.  Set the target (a default target is used if not specified). For example:
87
88    .. code-block:: console
89
90       export RTE_TARGET=x86_64-native-linuxapp-gcc
91
92    See the *DPDK Linux Getting Started Guide* for possible RTE_TARGET values.
93
94#.  Build the application:
95
96        make
97
98
99Running the Application
100-----------------------
101
102The application has a number of command line options::
103
104    ./build/l3fwd-thread [EAL options] --
105        -p PORTMASK [-P]
106        --rx(port,queue,lcore,thread)[,(port,queue,lcore,thread)]
107        --tx(lcore,thread)[,(lcore,thread)]
108        [--enable-jumbo] [--max-pkt-len PKTLEN]]  [--no-numa]
109        [--hash-entry-num] [--ipv6] [--no-lthreads] [--stat-lcore lcore]
110
111Where:
112
113* ``-p PORTMASK``: Hexadecimal bitmask of ports to configure.
114
115* ``-P``: optional, sets all ports to promiscuous mode so that packets are
116  accepted regardless of the packet's Ethernet MAC destination address.
117  Without this option, only packets with the Ethernet MAC destination address
118  set to the Ethernet address of the port are accepted.
119
120* ``--rx (port,queue,lcore,thread)[,(port,queue,lcore,thread)]``: the list of
121  NIC RX ports and queues handled by the RX lcores and threads. The parameters
122  are explained below.
123
124* ``--tx (lcore,thread)[,(lcore,thread)]``: the list of TX threads identifying
125  the lcore the thread runs on, and the id of RX thread with which it is
126  associated. The parameters are explained below.
127
128* ``--enable-jumbo``: optional, enables jumbo frames.
129
130* ``--max-pkt-len``: optional, maximum packet length in decimal (64-9600).
131
132* ``--no-numa``: optional, disables numa awareness.
133
134* ``--hash-entry-num``: optional, specifies the hash entry number in hex to be
135  setup.
136
137* ``--ipv6``: optional, set it if running ipv6 packets.
138
139* ``--no-lthreads``: optional, disables l-thread model and uses EAL threading
140  model. See below.
141
142* ``--stat-lcore``: optional, run CPU load stats collector on the specified
143  lcore.
144
145The parameters of the ``--rx`` and ``--tx`` options are:
146
147* ``--rx`` parameters
148
149   .. _table_l3fwd_rx_parameters:
150
151   +--------+------------------------------------------------------+
152   | port   | RX port                                              |
153   +--------+------------------------------------------------------+
154   | queue  | RX queue that will be read on the specified RX port  |
155   +--------+------------------------------------------------------+
156   | lcore  | Core to use for the thread                           |
157   +--------+------------------------------------------------------+
158   | thread | Thread id (continuously from 0 to N)                 |
159   +--------+------------------------------------------------------+
160
161
162* ``--tx`` parameters
163
164   .. _table_l3fwd_tx_parameters:
165
166   +--------+------------------------------------------------------+
167   | lcore  | Core to use for L3 route match and transmit          |
168   +--------+------------------------------------------------------+
169   | thread | Id of RX thread to be associated with this TX thread |
170   +--------+------------------------------------------------------+
171
172The ``l3fwd-thread`` application allows you to start packet processing in two
173threading models: L-Threads (default) and EAL Threads (when the
174``--no-lthreads`` parameter is used). For consistency all parameters are used
175in the same way for both models.
176
177
178Running with L-threads
179~~~~~~~~~~~~~~~~~~~~~~
180
181When the L-thread model is used (default option), lcore and thread parameters
182in ``--rx/--tx`` are used to affinitize threads to the selected scheduler.
183
184For example, the following places every l-thread on different lcores::
185
186   l3fwd-thread -c ff -n 2 -- -P -p 3 \
187                --rx="(0,0,0,0)(1,0,1,1)" \
188                --tx="(2,0)(3,1)"
189
190The following places RX l-threads on lcore 0 and TX l-threads on lcore 1 and 2
191and so on::
192
193   l3fwd-thread -c ff -n 2 -- -P -p 3 \
194                --rx="(0,0,0,0)(1,0,0,1)" \
195                --tx="(1,0)(2,1)"
196
197
198Running with EAL threads
199~~~~~~~~~~~~~~~~~~~~~~~~
200
201When the ``--no-lthreads`` parameter is used, the L-threading model is turned
202off and EAL threads are used for all processing. EAL threads are enumerated in
203the same way as L-threads, but the ``--lcores`` EAL parameter is used to
204affinitize threads to the selected cpu-set (scheduler). Thus it is possible to
205place every RX and TX thread on different lcores.
206
207For example, the following places every EAL thread on different lcores::
208
209   l3fwd-thread -c ff -n 2 -- -P -p 3 \
210                --rx="(0,0,0,0)(1,0,1,1)" \
211                --tx="(2,0)(3,1)" \
212                --no-lthreads
213
214
215To affinitize two or more EAL threads to one cpu-set, the EAL ``--lcores``
216parameter is used.
217
218The following places RX EAL threads on lcore 0 and TX EAL threads on lcore 1
219and 2 and so on::
220
221   l3fwd-thread -c ff -n 2 --lcores="(0,1)@0,(2,3)@1" -- -P -p 3 \
222                --rx="(0,0,0,0)(1,0,1,1)" \
223                --tx="(2,0)(3,1)" \
224                --no-lthreads
225
226
227Examples
228~~~~~~~~
229
230For selected scenarios the command line configuration of the application for L-threads
231and its corresponding EAL threads command line can be realized as follows:
232
233a) Start every thread on different scheduler (1:1)::
234
235      l3fwd-thread -c ff -n 2 -- -P -p 3 \
236                   --rx="(0,0,0,0)(1,0,1,1)" \
237                   --tx="(2,0)(3,1)"
238
239   EAL thread equivalent::
240
241      l3fwd-thread -c ff -n 2 -- -P -p 3 \
242                   --rx="(0,0,0,0)(1,0,1,1)" \
243                   --tx="(2,0)(3,1)" \
244                   --no-lthreads
245
246b) Start all threads on one core (N:1).
247
248   Start 4 L-threads on lcore 0::
249
250      l3fwd-thread -c ff -n 2 -- -P -p 3 \
251                   --rx="(0,0,0,0)(1,0,0,1)" \
252                   --tx="(0,0)(0,1)"
253
254   Start 4 EAL threads on cpu-set 0::
255
256      l3fwd-thread -c ff -n 2 --lcores="(0-3)@0" -- -P -p 3 \
257                   --rx="(0,0,0,0)(1,0,0,1)" \
258                   --tx="(2,0)(3,1)" \
259                   --no-lthreads
260
261c) Start threads on different cores (N:M).
262
263   Start 2 L-threads for RX on lcore 0, and 2 L-threads for TX on lcore 1::
264
265      l3fwd-thread -c ff -n 2 -- -P -p 3 \
266                   --rx="(0,0,0,0)(1,0,0,1)" \
267                   --tx="(1,0)(1,1)"
268
269   Start 2 EAL threads for RX on cpu-set 0, and 2 EAL threads for TX on
270   cpu-set 1::
271
272      l3fwd-thread -c ff -n 2 --lcores="(0-1)@0,(2-3)@1" -- -P -p 3 \
273                   --rx="(0,0,0,0)(1,0,1,1)" \
274                   --tx="(2,0)(3,1)" \
275                   --no-lthreads
276
277Explanation
278-----------
279
280To a great extent the sample application differs little from the standard L3
281forwarding application, and readers are advised to familiarize themselves with
282the material covered in the :doc:`l3_forward` documentation before proceeding.
283
284The following explanation is focused on the way threading is handled in the
285performance thread example.
286
287
288Mode of operation with EAL threads
289~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
290
291The performance thread sample application has split the RX and TX functionality
292into two different threads, and the RX and TX threads are
293interconnected via software rings. With respect to these rings the RX threads
294are producers and the TX threads are consumers.
295
296On initialization the TX and RX threads are started according to the command
297line parameters.
298
299The RX threads poll the network interface queues and post received packets to a
300TX thread via a corresponding software ring.
301
302The TX threads poll software rings, perform the L3 forwarding hash/LPM match,
303and assemble packet bursts before performing burst transmit on the network
304interface.
305
306As with the standard L3 forward application, burst draining of residual packets
307is performed periodically with the period calculated from elapsed time using
308the timestamps counter.
309
310The diagram below illustrates a case with two RX threads and three TX threads.
311
312.. _figure_performance_thread_1:
313
314.. figure:: img/performance_thread_1.*
315
316
317Mode of operation with L-threads
318~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
319
320Like the EAL thread configuration the application has split the RX and TX
321functionality into different threads, and the pairs of RX and TX threads are
322interconnected via software rings.
323
324On initialization an L-thread scheduler is started on every EAL thread. On all
325but the master EAL thread only a a dummy L-thread is initially started.
326The L-thread started on the master EAL thread then spawns other L-threads on
327different L-thread schedulers according the the command line parameters.
328
329The RX threads poll the network interface queues and post received packets
330to a TX thread via the corresponding software ring.
331
332The ring interface is augmented by means of an L-thread condition variable that
333enables the TX thread to be suspended when the TX ring is empty. The RX thread
334signals the condition whenever it posts to the TX ring, causing the TX thread
335to be resumed.
336
337Additionally the TX L-thread spawns a worker L-thread to take care of
338polling the software rings, whilst it handles burst draining of the transmit
339buffer.
340
341The worker threads poll the software rings, perform L3 route lookup and
342assemble packet bursts. If the TX ring is empty the worker thread suspends
343itself by waiting on the condition variable associated with the ring.
344
345Burst draining of residual packets, less than the burst size, is performed by
346the TX thread which sleeps (using an L-thread sleep function) and resumes
347periodically to flush the TX buffer.
348
349This design means that L-threads that have no work, can yield the CPU to other
350L-threads and avoid having to constantly poll the software rings.
351
352The diagram below illustrates a case with two RX threads and three TX functions
353(each comprising a thread that processes forwarding and a thread that
354periodically drains the output buffer of residual packets).
355
356.. _figure_performance_thread_2:
357
358.. figure:: img/performance_thread_2.*
359
360
361CPU load statistics
362~~~~~~~~~~~~~~~~~~~
363
364It is possible to display statistics showing estimated CPU load on each core.
365The statistics indicate the percentage of CPU time spent: processing
366received packets (forwarding), polling queues/rings (waiting for work),
367and doing any other processing (context switch and other overhead).
368
369When enabled statistics are gathered by having the application threads set and
370clear flags when they enter and exit pertinent code sections. The flags are
371then sampled in real time by a statistics collector thread running on another
372core. This thread displays the data in real time on the console.
373
374This feature is enabled by designating a statistics collector core, using the
375``--stat-lcore`` parameter.
376
377
378.. _lthread_subsystem:
379
380The L-thread subsystem
381----------------------
382
383The L-thread subsystem resides in the examples/performance-thread/common
384directory and is built and linked automatically when building the
385``l3fwd-thread`` example.
386
387The subsystem provides a simple cooperative scheduler to enable arbitrary
388functions to run as cooperative threads within a single EAL thread.
389The subsystem provides a pthread like API that is intended to assist in
390reuse of legacy code written for POSIX pthreads.
391
392The following sections provide some detail on the features, constraints,
393performance and porting considerations when using L-threads.
394
395
396.. _comparison_between_lthreads_and_pthreads:
397
398Comparison between L-threads and POSIX pthreads
399~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
400
401The fundamental difference between the L-thread and pthread models is the
402way in which threads are scheduled. The simplest way to think about this is to
403consider the case of a processor with a single CPU. To run multiple threads
404on a single CPU, the scheduler must frequently switch between the threads,
405in order that each thread is able to make timely progress.
406This is the basis of any multitasking operating system.
407
408This section explores the differences between the pthread model and the
409L-thread model as implemented in the provided L-thread subsystem. If needed a
410theoretical discussion of preemptive vs cooperative multi-threading can be
411found in any good text on operating system design.
412
413
414Scheduling and context switching
415^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
416
417The POSIX pthread library provides an application programming interface to
418create and synchronize threads. Scheduling policy is determined by the host OS,
419and may be configurable. The OS may use sophisticated rules to determine which
420thread should be run next, threads may suspend themselves or make other threads
421ready, and the scheduler may employ a time slice giving each thread a maximum
422time quantum after which it will be preempted in favor of another thread that
423is ready to run. To complicate matters further threads may be assigned
424different scheduling priorities.
425
426By contrast the L-thread subsystem is considerably simpler. Logically the
427L-thread scheduler performs the same multiplexing function for L-threads
428within a single pthread as the OS scheduler does for pthreads within an
429application process. The L-thread scheduler is simply the main loop of a
430pthread, and in so far as the host OS is concerned it is a regular pthread
431just like any other. The host OS is oblivious about the existence of and
432not at all involved in the scheduling of L-threads.
433
434The other and most significant difference between the two models is that
435L-threads are scheduled cooperatively. L-threads cannot not preempt each
436other, nor can the L-thread scheduler preempt a running L-thread (i.e.
437there is no time slicing). The consequence is that programs implemented with
438L-threads must possess frequent rescheduling points, meaning that they must
439explicitly and of their own volition return to the scheduler at frequent
440intervals, in order to allow other L-threads an opportunity to proceed.
441
442In both models switching between threads requires that the current CPU
443context is saved and a new context (belonging to the next thread ready to run)
444is restored. With pthreads this context switching is handled transparently
445and the set of CPU registers that must be preserved between context switches
446is as per an interrupt handler.
447
448An L-thread context switch is achieved by the thread itself making a function
449call to the L-thread scheduler. Thus it is only necessary to preserve the
450callee registers. The caller is responsible to save and restore any other
451registers it is using before a function call, and restore them on return,
452and this is handled by the compiler. For ``X86_64`` on both Linux and BSD the
453System V calling convention is used, this defines registers RSP, RBP, and
454R12-R15 as callee-save registers (for more detailed discussion a good reference
455is `X86 Calling Conventions <https://en.wikipedia.org/wiki/X86_calling_conventions>`_).
456
457Taking advantage of this, and due to the absence of preemption, an L-thread
458context switch is achieved with less than 20 load/store instructions.
459
460The scheduling policy for L-threads is fixed, there is no prioritization of
461L-threads, all L-threads are equal and scheduling is based on a FIFO
462ready queue.
463
464An L-thread is a struct containing the CPU context of the thread
465(saved on context switch) and other useful items. The ready queue contains
466pointers to threads that are ready to run. The L-thread scheduler is a simple
467loop that polls the ready queue, reads from it the next thread ready to run,
468which it resumes by saving the current context (the current position in the
469scheduler loop) and restoring the context of the next thread from its thread
470struct. Thus an L-thread is always resumed at the last place it yielded.
471
472A well behaved L-thread will call the context switch regularly (at least once
473in its main loop) thus returning to the scheduler's own main loop. Yielding
474inserts the current thread at the back of the ready queue, and the process of
475servicing the ready queue is repeated, thus the system runs by flipping back
476and forth the between L-threads and scheduler loop.
477
478In the case of pthreads, the preemptive scheduling, time slicing, and support
479for thread prioritization means that progress is normally possible for any
480thread that is ready to run. This comes at the price of a relatively heavier
481context switch and scheduling overhead.
482
483With L-threads the progress of any particular thread is determined by the
484frequency of rescheduling opportunities in the other L-threads. This means that
485an errant L-thread monopolizing the CPU might cause scheduling of other threads
486to be stalled. Due to the lower cost of context switching, however, voluntary
487rescheduling to ensure progress of other threads, if managed sensibly, is not
488a prohibitive overhead, and overall performance can exceed that of an
489application using pthreads.
490
491
492Mutual exclusion
493^^^^^^^^^^^^^^^^
494
495With pthreads preemption means that threads that share data must observe
496some form of mutual exclusion protocol.
497
498The fact that L-threads cannot preempt each other means that in many cases
499mutual exclusion devices can be completely avoided.
500
501Locking to protect shared data can be a significant bottleneck in
502multi-threaded applications so a carefully designed cooperatively scheduled
503program can enjoy significant performance advantages.
504
505So far we have considered only the simplistic case of a single core CPU,
506when multiple CPUs are considered things are somewhat more complex.
507
508First of all it is inevitable that there must be multiple L-thread schedulers,
509one running on each EAL thread. So long as these schedulers remain isolated
510from each other the above assertions about the potential advantages of
511cooperative scheduling hold true.
512
513A configuration with isolated cooperative schedulers is less flexible than the
514pthread model where threads can be affinitized to run on any CPU. With isolated
515schedulers scaling of applications to utilize fewer or more CPUs according to
516system demand is very difficult to achieve.
517
518The L-thread subsystem makes it possible for L-threads to migrate between
519schedulers running on different CPUs. Needless to say if the migration means
520that threads that share data end up running on different CPUs then this will
521introduce the need for some kind of mutual exclusion system.
522
523Of course ``rte_ring`` software rings can always be used to interconnect
524threads running on different cores, however to protect other kinds of shared
525data structures, lock free constructs or else explicit locking will be
526required. This is a consideration for the application design.
527
528In support of this extended functionality, the L-thread subsystem implements
529thread safe mutexes and condition variables.
530
531The cost of affinitizing and of condition variable signaling is significantly
532lower than the equivalent pthread operations, and so applications using these
533features will see a performance benefit.
534
535
536Thread local storage
537^^^^^^^^^^^^^^^^^^^^
538
539As with applications written for pthreads an application written for L-threads
540can take advantage of thread local storage, in this case local to an L-thread.
541An application may save and retrieve a single pointer to application data in
542the L-thread struct.
543
544For legacy and backward compatibility reasons two alternative methods are also
545offered, the first is modelled directly on the pthread get/set specific APIs,
546the second approach is modelled on the ``RTE_PER_LCORE`` macros, whereby
547``PER_LTHREAD`` macros are introduced, in both cases the storage is local to
548the L-thread.
549
550
551.. _constraints_and_performance_implications:
552
553Constraints and performance implications when using L-threads
554~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
555
556
557.. _API_compatibility:
558
559API compatibility
560^^^^^^^^^^^^^^^^^
561
562The L-thread subsystem provides a set of functions that are logically equivalent
563to the corresponding functions offered by the POSIX pthread library, however not
564all pthread functions have a corresponding L-thread equivalent, and not all
565features available to pthreads are implemented for L-threads.
566
567The pthread library offers considerable flexibility via programmable attributes
568that can be associated with threads, mutexes, and condition variables.
569
570By contrast the L-thread subsystem has fixed functionality, the scheduler policy
571cannot be varied, and L-threads cannot be prioritized. There are no variable
572attributes associated with any L-thread objects. L-threads, mutexes and
573conditional variables, all have fixed functionality. (Note: reserved parameters
574are included in the APIs to facilitate possible future support for attributes).
575
576The table below lists the pthread and equivalent L-thread APIs with notes on
577differences and/or constraints. Where there is no L-thread entry in the table,
578then the L-thread subsystem provides no equivalent function.
579
580.. _table_lthread_pthread:
581
582.. table:: Pthread and equivalent L-thread APIs.
583
584   +----------------------------+------------------------+-------------------+
585   | **Pthread function**       | **L-thread function**  | **Notes**         |
586   +============================+========================+===================+
587   | pthread_barrier_destroy    |                        |                   |
588   +----------------------------+------------------------+-------------------+
589   | pthread_barrier_init       |                        |                   |
590   +----------------------------+------------------------+-------------------+
591   | pthread_barrier_wait       |                        |                   |
592   +----------------------------+------------------------+-------------------+
593   | pthread_cond_broadcast     | lthread_cond_broadcast | See note 1        |
594   +----------------------------+------------------------+-------------------+
595   | pthread_cond_destroy       | lthread_cond_destroy   |                   |
596   +----------------------------+------------------------+-------------------+
597   | pthread_cond_init          | lthread_cond_init      |                   |
598   +----------------------------+------------------------+-------------------+
599   | pthread_cond_signal        | lthread_cond_signal    | See note 1        |
600   +----------------------------+------------------------+-------------------+
601   | pthread_cond_timedwait     |                        |                   |
602   +----------------------------+------------------------+-------------------+
603   | pthread_cond_wait          | lthread_cond_wait      | See note 5        |
604   +----------------------------+------------------------+-------------------+
605   | pthread_create             | lthread_create         | See notes 2, 3    |
606   +----------------------------+------------------------+-------------------+
607   | pthread_detach             | lthread_detach         | See note 4        |
608   +----------------------------+------------------------+-------------------+
609   | pthread_equal              |                        |                   |
610   +----------------------------+------------------------+-------------------+
611   | pthread_exit               | lthread_exit           |                   |
612   +----------------------------+------------------------+-------------------+
613   | pthread_getspecific        | lthread_getspecific    |                   |
614   +----------------------------+------------------------+-------------------+
615   | pthread_getcpuclockid      |                        |                   |
616   +----------------------------+------------------------+-------------------+
617   | pthread_join               | lthread_join           |                   |
618   +----------------------------+------------------------+-------------------+
619   | pthread_key_create         | lthread_key_create     |                   |
620   +----------------------------+------------------------+-------------------+
621   | pthread_key_delete         | lthread_key_delete     |                   |
622   +----------------------------+------------------------+-------------------+
623   | pthread_mutex_destroy      | lthread_mutex_destroy  |                   |
624   +----------------------------+------------------------+-------------------+
625   | pthread_mutex_init         | lthread_mutex_init     |                   |
626   +----------------------------+------------------------+-------------------+
627   | pthread_mutex_lock         | lthread_mutex_lock     | See note 6        |
628   +----------------------------+------------------------+-------------------+
629   | pthread_mutex_trylock      | lthread_mutex_trylock  | See note 6        |
630   +----------------------------+------------------------+-------------------+
631   | pthread_mutex_timedlock    |                        |                   |
632   +----------------------------+------------------------+-------------------+
633   | pthread_mutex_unlock       | lthread_mutex_unlock   |                   |
634   +----------------------------+------------------------+-------------------+
635   | pthread_once               |                        |                   |
636   +----------------------------+------------------------+-------------------+
637   | pthread_rwlock_destroy     |                        |                   |
638   +----------------------------+------------------------+-------------------+
639   | pthread_rwlock_init        |                        |                   |
640   +----------------------------+------------------------+-------------------+
641   | pthread_rwlock_rdlock      |                        |                   |
642   +----------------------------+------------------------+-------------------+
643   | pthread_rwlock_timedrdlock |                        |                   |
644   +----------------------------+------------------------+-------------------+
645   | pthread_rwlock_timedwrlock |                        |                   |
646   +----------------------------+------------------------+-------------------+
647   | pthread_rwlock_tryrdlock   |                        |                   |
648   +----------------------------+------------------------+-------------------+
649   | pthread_rwlock_trywrlock   |                        |                   |
650   +----------------------------+------------------------+-------------------+
651   | pthread_rwlock_unlock      |                        |                   |
652   +----------------------------+------------------------+-------------------+
653   | pthread_rwlock_wrlock      |                        |                   |
654   +----------------------------+------------------------+-------------------+
655   | pthread_self               | lthread_current        |                   |
656   +----------------------------+------------------------+-------------------+
657   | pthread_setspecific        | lthread_setspecific    |                   |
658   +----------------------------+------------------------+-------------------+
659   | pthread_spin_init          |                        | See note 10       |
660   +----------------------------+------------------------+-------------------+
661   | pthread_spin_destroy       |                        | See note 10       |
662   +----------------------------+------------------------+-------------------+
663   | pthread_spin_lock          |                        | See note 10       |
664   +----------------------------+------------------------+-------------------+
665   | pthread_spin_trylock       |                        | See note 10       |
666   +----------------------------+------------------------+-------------------+
667   | pthread_spin_unlock        |                        | See note 10       |
668   +----------------------------+------------------------+-------------------+
669   | pthread_cancel             | lthread_cancel         |                   |
670   +----------------------------+------------------------+-------------------+
671   | pthread_setcancelstate     |                        |                   |
672   +----------------------------+------------------------+-------------------+
673   | pthread_setcanceltype      |                        |                   |
674   +----------------------------+------------------------+-------------------+
675   | pthread_testcancel         |                        |                   |
676   +----------------------------+------------------------+-------------------+
677   | pthread_getschedparam      |                        |                   |
678   +----------------------------+------------------------+-------------------+
679   | pthread_setschedparam      |                        |                   |
680   +----------------------------+------------------------+-------------------+
681   | pthread_yield              | lthread_yield          | See note 7        |
682   +----------------------------+------------------------+-------------------+
683   | pthread_setaffinity_np     | lthread_set_affinity   | See notes 2, 3, 8 |
684   +----------------------------+------------------------+-------------------+
685   |                            | lthread_sleep          | See note 9        |
686   +----------------------------+------------------------+-------------------+
687   |                            | lthread_sleep_clks     | See note 9        |
688   +----------------------------+------------------------+-------------------+
689
690
691**Note 1**:
692
693Neither lthread signal nor broadcast may be called concurrently by L-threads
694running on different schedulers, although multiple L-threads running in the
695same scheduler may freely perform signal or broadcast operations. L-threads
696running on the same or different schedulers may always safely wait on a
697condition variable.
698
699
700**Note 2**:
701
702Pthread attributes may be used to affinitize a pthread with a cpu-set. The
703L-thread subsystem does not support a cpu-set. An L-thread may be affinitized
704only with a single CPU at any time.
705
706
707**Note 3**:
708
709If an L-thread is intended to run on a different NUMA node than the node that
710creates the thread then, when calling ``lthread_create()`` it is advantageous
711to specify the destination core as a parameter of ``lthread_create()``. See
712:ref:`memory_allocation_and_NUMA_awareness` for details.
713
714
715**Note 4**:
716
717An L-thread can only detach itself, and cannot detach other L-threads.
718
719
720**Note 5**:
721
722A wait operation on a pthread condition variable is always associated with and
723protected by a mutex which must be owned by the thread at the time it invokes
724``pthread_wait()``. By contrast L-thread condition variables are thread safe
725(for waiters) and do not use an associated mutex. Multiple L-threads (including
726L-threads running on other schedulers) can safely wait on a L-thread condition
727variable. As a consequence the performance of an L-thread condition variables
728is typically an order of magnitude faster than its pthread counterpart.
729
730
731**Note 6**:
732
733Recursive locking is not supported with L-threads, attempts to take a lock
734recursively will be detected and rejected.
735
736
737**Note 7**:
738
739``lthread_yield()`` will save the current context, insert the current thread
740to the back of the ready queue, and resume the next ready thread. Yielding
741increases ready queue backlog, see :ref:`ready_queue_backlog` for more details
742about the implications of this.
743
744
745N.B. The context switch time as measured from immediately before the call to
746``lthread_yield()`` to the point at which the next ready thread is resumed,
747can be an order of magnitude faster that the same measurement for
748pthread_yield.
749
750
751**Note 8**:
752
753``lthread_set_affinity()`` is similar to a yield apart from the fact that the
754yielding thread is inserted into a peer ready queue of another scheduler.
755The peer ready queue is actually a separate thread safe queue, which means that
756threads appearing in the peer ready queue can jump any backlog in the local
757ready queue on the destination scheduler.
758
759The context switch time as measured from the time just before the call to
760``lthread_set_affinity()`` to just after the same thread is resumed on the new
761scheduler can be orders of magnitude faster than the same measurement for
762``pthread_setaffinity_np()``.
763
764
765**Note 9**:
766
767Although there is no ``pthread_sleep()`` function, ``lthread_sleep()`` and
768``lthread_sleep_clks()`` can be used wherever ``sleep()``, ``usleep()`` or
769``nanosleep()`` might ordinarily be used. The L-thread sleep functions suspend
770the current thread, start an ``rte_timer`` and resume the thread when the
771timer matures. The ``rte_timer_manage()`` entry point is called on every pass
772of the scheduler loop. This means that the worst case jitter on timer expiry
773is determined by the longest period between context switches of any running
774L-threads.
775
776In a synthetic test with many threads sleeping and resuming then the measured
777jitter is typically orders of magnitude lower than the same measurement made
778for ``nanosleep()``.
779
780
781**Note 10**:
782
783Spin locks are not provided because they are problematical in a cooperative
784environment, see :ref:`porting_locks_and_spinlocks` for a more detailed
785discussion on how to avoid spin locks.
786
787
788.. _Thread_local_storage_performance:
789
790Thread local storage
791^^^^^^^^^^^^^^^^^^^^
792
793Of the three L-thread local storage options the simplest and most efficient is
794storing a single application data pointer in the L-thread struct.
795
796The ``PER_LTHREAD`` macros involve a run time computation to obtain the address
797of the variable being saved/retrieved and also require that the accesses are
798de-referenced  via a pointer. This means that code that has used
799``RTE_PER_LCORE`` macros being ported to L-threads might need some slight
800adjustment (see :ref:`porting_thread_local_storage` for hints about porting
801code that makes use of thread local storage).
802
803The get/set specific APIs are consistent with their pthread counterparts both
804in use and in performance.
805
806
807.. _memory_allocation_and_NUMA_awareness:
808
809Memory allocation and NUMA awareness
810^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
811
812All memory allocation is from DPDK huge pages, and is NUMA aware. Each
813scheduler maintains its own caches of objects: lthreads, their stacks, TLS,
814mutexes and condition variables. These caches are implemented as unbounded lock
815free MPSC queues. When objects are created they are always allocated from the
816caches on the local core (current EAL thread).
817
818If an L-thread has been affinitized to a different scheduler, then it can
819always safely free resources to the caches from which they originated (because
820the caches are MPSC queues).
821
822If the L-thread has been affinitized to a different NUMA node then the memory
823resources associated with it may incur longer access latency.
824
825The commonly used pattern of setting affinity on entry to a thread after it has
826started, means that memory allocation for both the stack and TLS will have been
827made from caches on the NUMA node on which the threads creator is running.
828This has the side effect that access latency will be sub-optimal after
829affinitizing.
830
831This side effect can be mitigated to some extent (although not completely) by
832specifying the destination CPU as a parameter of ``lthread_create()`` this
833causes the L-thread's stack and TLS to be allocated when it is first scheduled
834on the destination scheduler, if the destination is a on another NUMA node it
835results in a more optimal memory allocation.
836
837Note that the lthread struct itself remains allocated from memory on the
838creating node, this is unavoidable because an L-thread is known everywhere by
839the address of this struct.
840
841
842.. _object_cache_sizing:
843
844Object cache sizing
845^^^^^^^^^^^^^^^^^^^
846
847The per lcore object caches pre-allocate objects in bulk whenever a request to
848allocate an object finds a cache empty. By default 100 objects are
849pre-allocated, this is defined by ``LTHREAD_PREALLOC`` in the public API
850header file lthread_api.h. This means that the caches constantly grow to meet
851system demand.
852
853In the present implementation there is no mechanism to reduce the cache sizes
854if system demand reduces. Thus the caches will remain at their maximum extent
855indefinitely.
856
857A consequence of the bulk pre-allocation of objects is that every 100 (default
858value) additional new object create operations results in a call to
859``rte_malloc()``. For creation of objects such as L-threads, which trigger the
860allocation of even more objects (i.e. their stacks and TLS) then this can
861cause outliers in scheduling performance.
862
863If this is a problem the simplest mitigation strategy is to dimension the
864system, by setting the bulk object pre-allocation size to some large number
865that you do not expect to be exceeded. This means the caches will be populated
866once only, the very first time a thread is created.
867
868
869.. _Ready_queue_backlog:
870
871Ready queue backlog
872^^^^^^^^^^^^^^^^^^^
873
874One of the more subtle performance considerations is managing the ready queue
875backlog. The fewer threads that are waiting in the ready queue then the faster
876any particular thread will get serviced.
877
878In a naive L-thread application with N L-threads simply looping and yielding,
879this backlog will always be equal to the number of L-threads, thus the cost of
880a yield to a particular L-thread will be N times the context switch time.
881
882This side effect can be mitigated by arranging for threads to be suspended and
883wait to be resumed, rather than polling for work by constantly yielding.
884Blocking on a mutex or condition variable or even more obviously having a
885thread sleep if it has a low frequency workload are all mechanisms by which a
886thread can be excluded from the ready queue until it really does need to be
887run. This can have a significant positive impact on performance.
888
889
890.. _Initialization_and_shutdown_dependencies:
891
892Initialization, shutdown and dependencies
893^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
894
895The L-thread subsystem depends on DPDK for huge page allocation and depends on
896the ``rte_timer subsystem``. The DPDK EAL initialization and
897``rte_timer_subsystem_init()`` **MUST** be completed before the L-thread sub
898system can be used.
899
900Thereafter initialization of the L-thread subsystem is largely transparent to
901the application. Constructor functions ensure that global variables are properly
902initialized. Other than global variables each scheduler is initialized
903independently the first time that an L-thread is created by a particular EAL
904thread.
905
906If the schedulers are to be run as isolated and independent schedulers, with
907no intention that L-threads running on different schedulers will migrate between
908schedulers or synchronize with L-threads running on other schedulers, then
909initialization consists simply of creating an L-thread, and then running the
910L-thread scheduler.
911
912If there will be interaction between L-threads running on different schedulers,
913then it is important that the starting of schedulers on different EAL threads
914is synchronized.
915
916To achieve this an additional initialization step is necessary, this is simply
917to set the number of schedulers by calling the API function
918``lthread_num_schedulers_set(n)``, where ``n`` is the number of EAL threads
919that will run L-thread schedulers. Setting the number of schedulers to a
920number greater than 0 will cause all schedulers to wait until the others have
921started before beginning to schedule L-threads.
922
923The L-thread scheduler is started by calling the function ``lthread_run()``
924and should be called from the EAL thread and thus become the main loop of the
925EAL thread.
926
927The function ``lthread_run()``, will not return until all threads running on
928the scheduler have exited, and the scheduler has been explicitly stopped by
929calling ``lthread_scheduler_shutdown(lcore)`` or
930``lthread_scheduler_shutdown_all()``.
931
932All these function do is tell the scheduler that it can exit when there are no
933longer any running L-threads, neither function forces any running L-thread to
934terminate. Any desired application shutdown behavior must be designed and
935built into the application to ensure that L-threads complete in a timely
936manner.
937
938**Important Note:** It is assumed when the scheduler exits that the application
939is terminating for good, the scheduler does not free resources before exiting
940and running the scheduler a subsequent time will result in undefined behavior.
941
942
943.. _porting_legacy_code_to_run_on_lthreads:
944
945Porting legacy code to run on L-threads
946~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
947
948Legacy code originally written for a pthread environment may be ported to
949L-threads if the considerations about differences in scheduling policy, and
950constraints discussed in the previous sections can be accommodated.
951
952This section looks in more detail at some of the issues that may have to be
953resolved when porting code.
954
955
956.. _pthread_API_compatibility:
957
958pthread API compatibility
959^^^^^^^^^^^^^^^^^^^^^^^^^
960
961The first step is to establish exactly which pthread APIs the legacy
962application uses, and to understand the requirements of those APIs. If there
963are corresponding L-lthread APIs, and where the default pthread functionality
964is used by the application then, notwithstanding the other issues discussed
965here, it should be feasible to run the application with L-threads. If the
966legacy code modifies the default behavior using attributes then if may be
967necessary to make some adjustments to eliminate those requirements.
968
969
970.. _blocking_system_calls:
971
972Blocking system API calls
973^^^^^^^^^^^^^^^^^^^^^^^^^
974
975It is important to understand what other system services the application may be
976using, bearing in mind that in a cooperatively scheduled environment a thread
977cannot block without stalling the scheduler and with it all other cooperative
978threads. Any kind of blocking system call, for example file or socket IO, is a
979potential problem, a good tool to analyze the application for this purpose is
980the ``strace`` utility.
981
982There are many strategies to resolve these kind of issues, each with it
983merits. Possible solutions include:
984
985* Adopting a polled mode of the system API concerned (if available).
986
987* Arranging for another core to perform the function and synchronizing with
988  that core via constructs that will not block the L-thread.
989
990* Affinitizing the thread to another scheduler devoted (as a matter of policy)
991  to handling threads wishing to make blocking calls, and then back again when
992  finished.
993
994
995.. _porting_locks_and_spinlocks:
996
997Locks and spinlocks
998^^^^^^^^^^^^^^^^^^^
999
1000Locks and spinlocks are another source of blocking behavior that for the same
1001reasons as system calls will need to be addressed.
1002
1003If the application design ensures that the contending L-threads will always
1004run on the same scheduler then it its probably safe to remove locks and spin
1005locks completely.
1006
1007The only exception to the above rule is if for some reason the
1008code performs any kind of context switch whilst holding the lock
1009(e.g. yield, sleep, or block on a different lock, or on a condition variable).
1010This will need to determined before deciding to eliminate a lock.
1011
1012If a lock cannot be eliminated then an L-thread mutex can be substituted for
1013either kind of lock.
1014
1015An L-thread blocking on an L-thread mutex will be suspended and will cause
1016another ready L-thread to be resumed, thus not blocking the scheduler. When
1017default behavior is required, it can be used as a direct replacement for a
1018pthread mutex lock.
1019
1020Spin locks are typically used when lock contention is likely to be rare and
1021where the period during which the lock may be held is relatively short.
1022When the contending L-threads are running on the same scheduler then an
1023L-thread blocking on a spin lock will enter an infinite loop stopping the
1024scheduler completely (see :ref:`porting_infinite_loops` below).
1025
1026If the application design ensures that contending L-threads will always run
1027on different schedulers then it might be reasonable to leave a short spin lock
1028that rarely experiences contention in place.
1029
1030If after all considerations it appears that a spin lock can neither be
1031eliminated completely, replaced with an L-thread mutex, or left in place as
1032is, then an alternative is to loop on a flag, with a call to
1033``lthread_yield()`` inside the loop (n.b. if the contending L-threads might
1034ever run on different schedulers the flag will need to be manipulated
1035atomically).
1036
1037Spinning and yielding is the least preferred solution since it introduces
1038ready queue backlog (see also :ref:`ready_queue_backlog`).
1039
1040
1041.. _porting_sleeps_and_delays:
1042
1043Sleeps and delays
1044^^^^^^^^^^^^^^^^^
1045
1046Yet another kind of blocking behavior (albeit momentary) are delay functions
1047like ``sleep()``, ``usleep()``, ``nanosleep()`` etc. All will have the
1048consequence of stalling the L-thread scheduler and unless the delay is very
1049short (e.g. a very short nanosleep) calls to these functions will need to be
1050eliminated.
1051
1052The simplest mitigation strategy is to use the L-thread sleep API functions,
1053of which two variants exist, ``lthread_sleep()`` and ``lthread_sleep_clks()``.
1054These functions start an rte_timer against the L-thread, suspend the L-thread
1055and cause another ready L-thread to be resumed. The suspended L-thread is
1056resumed when the rte_timer matures.
1057
1058
1059.. _porting_infinite_loops:
1060
1061Infinite loops
1062^^^^^^^^^^^^^^
1063
1064Some applications have threads with loops that contain no inherent
1065rescheduling opportunity, and rely solely on the OS time slicing to share
1066the CPU. In a cooperative environment this will stop everything dead. These
1067kind of loops are not hard to identify, in a debug session you will find the
1068debugger is always stopping in the same loop.
1069
1070The simplest solution to this kind of problem is to insert an explicit
1071``lthread_yield()`` or ``lthread_sleep()`` into the loop. Another solution
1072might be to include the function performed by the loop into the execution path
1073of some other loop that does in fact yield, if this is possible.
1074
1075
1076.. _porting_thread_local_storage:
1077
1078Thread local storage
1079^^^^^^^^^^^^^^^^^^^^
1080
1081If the application uses thread local storage, the use case should be
1082studied carefully.
1083
1084In a legacy pthread application either or both the ``__thread`` prefix, or the
1085pthread set/get specific APIs may have been used to define storage local to a
1086pthread.
1087
1088In some applications it may be a reasonable assumption that the data could
1089or in fact most likely should be placed in L-thread local storage.
1090
1091If the application (like many DPDK applications) has assumed a certain
1092relationship between a pthread and the CPU to which it is affinitized, there
1093is a risk that thread local storage may have been used to save some data items
1094that are correctly logically associated with the CPU, and others items which
1095relate to application context for the thread. Only a good understanding of the
1096application will reveal such cases.
1097
1098If the application requires an that an L-thread is to be able to move between
1099schedulers then care should be taken to separate these kinds of data, into per
1100lcore, and per L-thread storage. In this way a migrating thread will bring with
1101it the local data it needs, and pick up the new logical core specific values
1102from pthread local storage at its new home.
1103
1104
1105.. _pthread_shim:
1106
1107Pthread shim
1108~~~~~~~~~~~~
1109
1110A convenient way to get something working with legacy code can be to use a
1111shim that adapts pthread API calls to the corresponding L-thread ones.
1112This approach will not mitigate any of the porting considerations mentioned
1113in the previous sections, but it will reduce the amount of code churn that
1114would otherwise been involved. It is a reasonable approach to evaluate
1115L-threads, before investing effort in porting to the native L-thread APIs.
1116
1117
1118Overview
1119^^^^^^^^
1120The L-thread subsystem includes an example pthread shim. This is a partial
1121implementation but does contain the API stubs needed to get basic applications
1122running. There is a simple "hello world" application that demonstrates the
1123use of the pthread shim.
1124
1125A subtlety of working with a shim is that the application will still need
1126to make use of the genuine pthread library functions, at the very least in
1127order to create the EAL threads in which the L-thread schedulers will run.
1128This is the case with DPDK initialization, and exit.
1129
1130To deal with the initialization and shutdown scenarios, the shim is capable of
1131switching on or off its adaptor functionality, an application can control this
1132behavior by the calling the function ``pt_override_set()``. The default state
1133is disabled.
1134
1135The pthread shim uses the dynamic linker loader and saves the loaded addresses
1136of the genuine pthread API functions in an internal table, when the shim
1137functionality is enabled it performs the adaptor function, when disabled it
1138invokes the genuine pthread function.
1139
1140The function ``pthread_exit()`` has additional special handling. The standard
1141system header file pthread.h declares ``pthread_exit()`` with
1142``__attribute__((noreturn))`` this is an optimization that is possible because
1143the pthread is terminating and this enables the compiler to omit the normal
1144handling of stack and protection of registers since the function is not
1145expected to return, and in fact the thread is being destroyed. These
1146optimizations are applied in both the callee and the caller of the
1147``pthread_exit()`` function.
1148
1149In our cooperative scheduling environment this behavior is inadmissible. The
1150pthread is the L-thread scheduler thread, and, although an L-thread is
1151terminating, there must be a return to the scheduler in order that the system
1152can continue to run. Further, returning from a function with attribute
1153``noreturn`` is invalid and may result in undefined behavior.
1154
1155The solution is to redefine the ``pthread_exit`` function with a macro,
1156causing it to be mapped to a stub function in the shim that does not have the
1157``noreturn`` attribute. This macro is defined in the file
1158``pthread_shim.h``. The stub function is otherwise no different than any of
1159the other stub functions in the shim, and will switch between the real
1160``pthread_exit()`` function or the ``lthread_exit()`` function as
1161required. The only difference is that the mapping to the stub by macro
1162substitution.
1163
1164A consequence of this is that the file ``pthread_shim.h`` must be included in
1165legacy code wishing to make use of the shim. It also means that dynamic
1166linkage of a pre-compiled binary that did not include pthread_shim.h is not be
1167supported.
1168
1169Given the requirements for porting legacy code outlined in
1170:ref:`porting_legacy_code_to_run_on_lthreads` most applications will require at
1171least some minimal adjustment and recompilation to run on L-threads so
1172pre-compiled binaries are unlikely to be met in practice.
1173
1174In summary the shim approach adds some overhead but can be a useful tool to help
1175establish the feasibility of a code reuse project. It is also a fairly
1176straightforward task to extend the shim if necessary.
1177
1178**Note:** Bearing in mind the preceding discussions about the impact of making
1179blocking calls then switching the shim in and out on the fly to invoke any
1180pthread API this might block is something that should typically be avoided.
1181
1182
1183Building and running the pthread shim
1184^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1185
1186The shim example application is located in the sample application
1187in the performance-thread folder
1188
1189To build and run the pthread shim example
1190
1191#. Go to the example applications folder
1192
1193   .. code-block:: console
1194
1195       export RTE_SDK=/path/to/rte_sdk
1196       cd ${RTE_SDK}/examples/performance-thread/pthread_shim
1197
1198
1199#. Set the target (a default target is used if not specified). For example:
1200
1201   .. code-block:: console
1202
1203       export RTE_TARGET=x86_64-native-linuxapp-gcc
1204
1205   See the DPDK Getting Started Guide for possible RTE_TARGET values.
1206
1207#. Build the application:
1208
1209   .. code-block:: console
1210
1211       make
1212
1213#. To run the pthread_shim example
1214
1215   .. code-block:: console
1216
1217       lthread-pthread-shim -c core_mask -n number_of_channels
1218
1219.. _lthread_diagnostics:
1220
1221L-thread Diagnostics
1222~~~~~~~~~~~~~~~~~~~~
1223
1224When debugging you must take account of the fact that the L-threads are run in
1225a single pthread. The current scheduler is defined by
1226``RTE_PER_LCORE(this_sched)``, and the current lthread is stored at
1227``RTE_PER_LCORE(this_sched)->current_lthread``. Thus on a breakpoint in a GDB
1228session the current lthread can be obtained by displaying the pthread local
1229variable ``per_lcore_this_sched->current_lthread``.
1230
1231Another useful diagnostic feature is the possibility to trace significant
1232events in the life of an L-thread, this feature is enabled by changing the
1233value of LTHREAD_DIAG from 0 to 1 in the file ``lthread_diag_api.h``.
1234
1235Tracing of events can be individually masked, and the mask may be programmed
1236at run time. An unmasked event results in a callback that provides information
1237about the event. The default callback simply prints trace information. The
1238default mask is 0 (all events off) the mask can be modified by calling the
1239function ``lthread_diagniostic_set_mask()``.
1240
1241It is possible register a user callback function to implement more
1242sophisticated diagnostic functions.
1243Object creation events (lthread, mutex, and condition variable) accept, and
1244store in the created object, a user supplied reference value returned by the
1245callback function.
1246
1247The lthread reference value is passed back in all subsequent event callbacks,
1248the mutex and APIs are provided to retrieve the reference value from
1249mutexes and condition variables. This enables a user to monitor, count, or
1250filter for specific events, on specific objects, for example to monitor for a
1251specific thread signaling a specific condition variable, or to monitor
1252on all timer events, the possibilities and combinations are endless.
1253
1254The callback function can be set by calling the function
1255``lthread_diagnostic_enable()`` supplying a callback function pointer and an
1256event mask.
1257
1258Setting ``LTHREAD_DIAG`` also enables counting of statistics about cache and
1259queue usage, and these statistics can be displayed by calling the function
1260``lthread_diag_stats_display()``. This function also performs a consistency
1261check on the caches and queues. The function should only be called from the
1262master EAL thread after all slave threads have stopped and returned to the C
1263main program, otherwise the consistency check will fail.
1264