1.. BSD LICENSE 2 Copyright(c) 2015 Intel Corporation. All rights reserved. 3 All rights reserved. 4 5 Redistribution and use in source and binary forms, with or without 6 modification, are permitted provided that the following conditions 7 are met: 8 9 * Re-distributions of source code must retain the above copyright 10 notice, this list of conditions and the following disclaimer. 11 * Redistributions in binary form must reproduce the above copyright 12 notice, this list of conditions and the following disclaimer in 13 the documentation and/or other materials provided with the 14 distribution. 15 * Neither the name of Intel Corporation nor the names of its 16 contributors may be used to endorse or promote products derived 17 from this software without specific prior written permission. 18 19 THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 20 "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT 21 LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR 22 A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT 23 OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, 24 SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT 25 LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, 26 DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY 27 THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 28 (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 29 OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 30 31 32Performance Thread Sample Application 33===================================== 34 35The performance thread sample application is a derivative of the standard L3 36forwarding application that demonstrates different threading models. 37 38Overview 39-------- 40For a general description of the L3 forwarding applications capabilities 41please refer to the documentation of the standard application in 42:doc:`l3_forward`. 43 44The performance thread sample application differs from the standard L3 45forwarding example in that it divides the TX and RX processing between 46different threads, and makes it possible to assign individual threads to 47different cores. 48 49Three threading models are considered: 50 51#. When there is one EAL thread per physical core. 52#. When there are multiple EAL threads per physical core. 53#. When there are multiple lightweight threads per EAL thread. 54 55Since DPDK release 2.0 it is possible to launch applications using the 56``--lcores`` EAL parameter, specifying cpu-sets for a physical core. With the 57performance thread sample application its is now also possible to assign 58individual RX and TX functions to different cores. 59 60As an alternative to dividing the L3 forwarding work between different EAL 61threads the performance thread sample introduces the possibility to run the 62application threads as lightweight threads (L-threads) within one or 63more EAL threads. 64 65In order to facilitate this threading model the example includes a primitive 66cooperative scheduler (L-thread) subsystem. More details of the L-thread 67subsystem can be found in :ref:`lthread_subsystem`. 68 69**Note:** Whilst theoretically possible it is not anticipated that multiple 70L-thread schedulers would be run on the same physical core, this mode of 71operation should not be expected to yield useful performance and is considered 72invalid. 73 74Compiling the Application 75------------------------- 76The application is located in the sample application folder in the 77``performance-thread`` folder. 78 79#. Go to the example applications folder 80 81 .. code-block:: console 82 83 export RTE_SDK=/path/to/rte_sdk 84 cd ${RTE_SDK}/examples/performance-thread/l3fwd-thread 85 86#. Set the target (a default target is used if not specified). For example: 87 88 .. code-block:: console 89 90 export RTE_TARGET=x86_64-native-linuxapp-gcc 91 92 See the *DPDK Linux Getting Started Guide* for possible RTE_TARGET values. 93 94#. Build the application: 95 96 make 97 98 99Running the Application 100----------------------- 101 102The application has a number of command line options:: 103 104 ./build/l3fwd-thread [EAL options] -- 105 -p PORTMASK [-P] 106 --rx(port,queue,lcore,thread)[,(port,queue,lcore,thread)] 107 --tx(lcore,thread)[,(lcore,thread)] 108 [--enable-jumbo] [--max-pkt-len PKTLEN]] [--no-numa] 109 [--hash-entry-num] [--ipv6] [--no-lthreads] [--stat-lcore lcore] 110 111Where: 112 113* ``-p PORTMASK``: Hexadecimal bitmask of ports to configure. 114 115* ``-P``: optional, sets all ports to promiscuous mode so that packets are 116 accepted regardless of the packet's Ethernet MAC destination address. 117 Without this option, only packets with the Ethernet MAC destination address 118 set to the Ethernet address of the port are accepted. 119 120* ``--rx (port,queue,lcore,thread)[,(port,queue,lcore,thread)]``: the list of 121 NIC RX ports and queues handled by the RX lcores and threads. The parameters 122 are explained below. 123 124* ``--tx (lcore,thread)[,(lcore,thread)]``: the list of TX threads identifying 125 the lcore the thread runs on, and the id of RX thread with which it is 126 associated. The parameters are explained below. 127 128* ``--enable-jumbo``: optional, enables jumbo frames. 129 130* ``--max-pkt-len``: optional, maximum packet length in decimal (64-9600). 131 132* ``--no-numa``: optional, disables numa awareness. 133 134* ``--hash-entry-num``: optional, specifies the hash entry number in hex to be 135 setup. 136 137* ``--ipv6``: optional, set it if running ipv6 packets. 138 139* ``--no-lthreads``: optional, disables l-thread model and uses EAL threading 140 model. See below. 141 142* ``--stat-lcore``: optional, run CPU load stats collector on the specified 143 lcore. 144 145The parameters of the ``--rx`` and ``--tx`` options are: 146 147* ``--rx`` parameters 148 149 .. _table_l3fwd_rx_parameters: 150 151 +--------+------------------------------------------------------+ 152 | port | RX port | 153 +--------+------------------------------------------------------+ 154 | queue | RX queue that will be read on the specified RX port | 155 +--------+------------------------------------------------------+ 156 | lcore | Core to use for the thread | 157 +--------+------------------------------------------------------+ 158 | thread | Thread id (continuously from 0 to N) | 159 +--------+------------------------------------------------------+ 160 161 162* ``--tx`` parameters 163 164 .. _table_l3fwd_tx_parameters: 165 166 +--------+------------------------------------------------------+ 167 | lcore | Core to use for L3 route match and transmit | 168 +--------+------------------------------------------------------+ 169 | thread | Id of RX thread to be associated with this TX thread | 170 +--------+------------------------------------------------------+ 171 172The ``l3fwd-thread`` application allows you to start packet processing in two 173threading models: L-Threads (default) and EAL Threads (when the 174``--no-lthreads`` parameter is used). For consistency all parameters are used 175in the same way for both models. 176 177 178Running with L-threads 179~~~~~~~~~~~~~~~~~~~~~~ 180 181When the L-thread model is used (default option), lcore and thread parameters 182in ``--rx/--tx`` are used to affinitize threads to the selected scheduler. 183 184For example, the following places every l-thread on different lcores:: 185 186 l3fwd-thread -c ff -n 2 -- -P -p 3 \ 187 --rx="(0,0,0,0)(1,0,1,1)" \ 188 --tx="(2,0)(3,1)" 189 190The following places RX l-threads on lcore 0 and TX l-threads on lcore 1 and 2 191and so on:: 192 193 l3fwd-thread -c ff -n 2 -- -P -p 3 \ 194 --rx="(0,0,0,0)(1,0,0,1)" \ 195 --tx="(1,0)(2,1)" 196 197 198Running with EAL threads 199~~~~~~~~~~~~~~~~~~~~~~~~ 200 201When the ``--no-lthreads`` parameter is used, the L-threading model is turned 202off and EAL threads are used for all processing. EAL threads are enumerated in 203the same way as L-threads, but the ``--lcores`` EAL parameter is used to 204affinitize threads to the selected cpu-set (scheduler). Thus it is possible to 205place every RX and TX thread on different lcores. 206 207For example, the following places every EAL thread on different lcores:: 208 209 l3fwd-thread -c ff -n 2 -- -P -p 3 \ 210 --rx="(0,0,0,0)(1,0,1,1)" \ 211 --tx="(2,0)(3,1)" \ 212 --no-lthreads 213 214 215To affinitize two or more EAL threads to one cpu-set, the EAL ``--lcores`` 216parameter is used. 217 218The following places RX EAL threads on lcore 0 and TX EAL threads on lcore 1 219and 2 and so on:: 220 221 l3fwd-thread -c ff -n 2 --lcores="(0,1)@0,(2,3)@1" -- -P -p 3 \ 222 --rx="(0,0,0,0)(1,0,1,1)" \ 223 --tx="(2,0)(3,1)" \ 224 --no-lthreads 225 226 227Examples 228~~~~~~~~ 229 230For selected scenarios the command line configuration of the application for L-threads 231and its corresponding EAL threads command line can be realized as follows: 232 233a) Start every thread on different scheduler (1:1):: 234 235 l3fwd-thread -c ff -n 2 -- -P -p 3 \ 236 --rx="(0,0,0,0)(1,0,1,1)" \ 237 --tx="(2,0)(3,1)" 238 239 EAL thread equivalent:: 240 241 l3fwd-thread -c ff -n 2 -- -P -p 3 \ 242 --rx="(0,0,0,0)(1,0,1,1)" \ 243 --tx="(2,0)(3,1)" \ 244 --no-lthreads 245 246b) Start all threads on one core (N:1). 247 248 Start 4 L-threads on lcore 0:: 249 250 l3fwd-thread -c ff -n 2 -- -P -p 3 \ 251 --rx="(0,0,0,0)(1,0,0,1)" \ 252 --tx="(0,0)(0,1)" 253 254 Start 4 EAL threads on cpu-set 0:: 255 256 l3fwd-thread -c ff -n 2 --lcores="(0-3)@0" -- -P -p 3 \ 257 --rx="(0,0,0,0)(1,0,0,1)" \ 258 --tx="(2,0)(3,1)" \ 259 --no-lthreads 260 261c) Start threads on different cores (N:M). 262 263 Start 2 L-threads for RX on lcore 0, and 2 L-threads for TX on lcore 1:: 264 265 l3fwd-thread -c ff -n 2 -- -P -p 3 \ 266 --rx="(0,0,0,0)(1,0,0,1)" \ 267 --tx="(1,0)(1,1)" 268 269 Start 2 EAL threads for RX on cpu-set 0, and 2 EAL threads for TX on 270 cpu-set 1:: 271 272 l3fwd-thread -c ff -n 2 --lcores="(0-1)@0,(2-3)@1" -- -P -p 3 \ 273 --rx="(0,0,0,0)(1,0,1,1)" \ 274 --tx="(2,0)(3,1)" \ 275 --no-lthreads 276 277Explanation 278----------- 279 280To a great extent the sample application differs little from the standard L3 281forwarding application, and readers are advised to familiarize themselves with 282the material covered in the :doc:`l3_forward` documentation before proceeding. 283 284The following explanation is focused on the way threading is handled in the 285performance thread example. 286 287 288Mode of operation with EAL threads 289~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 290 291The performance thread sample application has split the RX and TX functionality 292into two different threads, and the RX and TX threads are 293interconnected via software rings. With respect to these rings the RX threads 294are producers and the TX threads are consumers. 295 296On initialization the TX and RX threads are started according to the command 297line parameters. 298 299The RX threads poll the network interface queues and post received packets to a 300TX thread via a corresponding software ring. 301 302The TX threads poll software rings, perform the L3 forwarding hash/LPM match, 303and assemble packet bursts before performing burst transmit on the network 304interface. 305 306As with the standard L3 forward application, burst draining of residual packets 307is performed periodically with the period calculated from elapsed time using 308the timestamps counter. 309 310The diagram below illustrates a case with two RX threads and three TX threads. 311 312.. _figure_performance_thread_1: 313 314.. figure:: img/performance_thread_1.* 315 316 317Mode of operation with L-threads 318~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 319 320Like the EAL thread configuration the application has split the RX and TX 321functionality into different threads, and the pairs of RX and TX threads are 322interconnected via software rings. 323 324On initialization an L-thread scheduler is started on every EAL thread. On all 325but the master EAL thread only a a dummy L-thread is initially started. 326The L-thread started on the master EAL thread then spawns other L-threads on 327different L-thread schedulers according the the command line parameters. 328 329The RX threads poll the network interface queues and post received packets 330to a TX thread via the corresponding software ring. 331 332The ring interface is augmented by means of an L-thread condition variable that 333enables the TX thread to be suspended when the TX ring is empty. The RX thread 334signals the condition whenever it posts to the TX ring, causing the TX thread 335to be resumed. 336 337Additionally the TX L-thread spawns a worker L-thread to take care of 338polling the software rings, whilst it handles burst draining of the transmit 339buffer. 340 341The worker threads poll the software rings, perform L3 route lookup and 342assemble packet bursts. If the TX ring is empty the worker thread suspends 343itself by waiting on the condition variable associated with the ring. 344 345Burst draining of residual packets, less than the burst size, is performed by 346the TX thread which sleeps (using an L-thread sleep function) and resumes 347periodically to flush the TX buffer. 348 349This design means that L-threads that have no work, can yield the CPU to other 350L-threads and avoid having to constantly poll the software rings. 351 352The diagram below illustrates a case with two RX threads and three TX functions 353(each comprising a thread that processes forwarding and a thread that 354periodically drains the output buffer of residual packets). 355 356.. _figure_performance_thread_2: 357 358.. figure:: img/performance_thread_2.* 359 360 361CPU load statistics 362~~~~~~~~~~~~~~~~~~~ 363 364It is possible to display statistics showing estimated CPU load on each core. 365The statistics indicate the percentage of CPU time spent: processing 366received packets (forwarding), polling queues/rings (waiting for work), 367and doing any other processing (context switch and other overhead). 368 369When enabled statistics are gathered by having the application threads set and 370clear flags when they enter and exit pertinent code sections. The flags are 371then sampled in real time by a statistics collector thread running on another 372core. This thread displays the data in real time on the console. 373 374This feature is enabled by designating a statistics collector core, using the 375``--stat-lcore`` parameter. 376 377 378.. _lthread_subsystem: 379 380The L-thread subsystem 381---------------------- 382 383The L-thread subsystem resides in the examples/performance-thread/common 384directory and is built and linked automatically when building the 385``l3fwd-thread`` example. 386 387The subsystem provides a simple cooperative scheduler to enable arbitrary 388functions to run as cooperative threads within a single EAL thread. 389The subsystem provides a pthread like API that is intended to assist in 390reuse of legacy code written for POSIX pthreads. 391 392The following sections provide some detail on the features, constraints, 393performance and porting considerations when using L-threads. 394 395 396.. _comparison_between_lthreads_and_pthreads: 397 398Comparison between L-threads and POSIX pthreads 399~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 400 401The fundamental difference between the L-thread and pthread models is the 402way in which threads are scheduled. The simplest way to think about this is to 403consider the case of a processor with a single CPU. To run multiple threads 404on a single CPU, the scheduler must frequently switch between the threads, 405in order that each thread is able to make timely progress. 406This is the basis of any multitasking operating system. 407 408This section explores the differences between the pthread model and the 409L-thread model as implemented in the provided L-thread subsystem. If needed a 410theoretical discussion of preemptive vs cooperative multi-threading can be 411found in any good text on operating system design. 412 413 414Scheduling and context switching 415^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 416 417The POSIX pthread library provides an application programming interface to 418create and synchronize threads. Scheduling policy is determined by the host OS, 419and may be configurable. The OS may use sophisticated rules to determine which 420thread should be run next, threads may suspend themselves or make other threads 421ready, and the scheduler may employ a time slice giving each thread a maximum 422time quantum after which it will be preempted in favor of another thread that 423is ready to run. To complicate matters further threads may be assigned 424different scheduling priorities. 425 426By contrast the L-thread subsystem is considerably simpler. Logically the 427L-thread scheduler performs the same multiplexing function for L-threads 428within a single pthread as the OS scheduler does for pthreads within an 429application process. The L-thread scheduler is simply the main loop of a 430pthread, and in so far as the host OS is concerned it is a regular pthread 431just like any other. The host OS is oblivious about the existence of and 432not at all involved in the scheduling of L-threads. 433 434The other and most significant difference between the two models is that 435L-threads are scheduled cooperatively. L-threads cannot not preempt each 436other, nor can the L-thread scheduler preempt a running L-thread (i.e. 437there is no time slicing). The consequence is that programs implemented with 438L-threads must possess frequent rescheduling points, meaning that they must 439explicitly and of their own volition return to the scheduler at frequent 440intervals, in order to allow other L-threads an opportunity to proceed. 441 442In both models switching between threads requires that the current CPU 443context is saved and a new context (belonging to the next thread ready to run) 444is restored. With pthreads this context switching is handled transparently 445and the set of CPU registers that must be preserved between context switches 446is as per an interrupt handler. 447 448An L-thread context switch is achieved by the thread itself making a function 449call to the L-thread scheduler. Thus it is only necessary to preserve the 450callee registers. The caller is responsible to save and restore any other 451registers it is using before a function call, and restore them on return, 452and this is handled by the compiler. For ``X86_64`` on both Linux and BSD the 453System V calling convention is used, this defines registers RSP, RBP, and 454R12-R15 as callee-save registers (for more detailed discussion a good reference 455is `X86 Calling Conventions <https://en.wikipedia.org/wiki/X86_calling_conventions>`_). 456 457Taking advantage of this, and due to the absence of preemption, an L-thread 458context switch is achieved with less than 20 load/store instructions. 459 460The scheduling policy for L-threads is fixed, there is no prioritization of 461L-threads, all L-threads are equal and scheduling is based on a FIFO 462ready queue. 463 464An L-thread is a struct containing the CPU context of the thread 465(saved on context switch) and other useful items. The ready queue contains 466pointers to threads that are ready to run. The L-thread scheduler is a simple 467loop that polls the ready queue, reads from it the next thread ready to run, 468which it resumes by saving the current context (the current position in the 469scheduler loop) and restoring the context of the next thread from its thread 470struct. Thus an L-thread is always resumed at the last place it yielded. 471 472A well behaved L-thread will call the context switch regularly (at least once 473in its main loop) thus returning to the scheduler's own main loop. Yielding 474inserts the current thread at the back of the ready queue, and the process of 475servicing the ready queue is repeated, thus the system runs by flipping back 476and forth the between L-threads and scheduler loop. 477 478In the case of pthreads, the preemptive scheduling, time slicing, and support 479for thread prioritization means that progress is normally possible for any 480thread that is ready to run. This comes at the price of a relatively heavier 481context switch and scheduling overhead. 482 483With L-threads the progress of any particular thread is determined by the 484frequency of rescheduling opportunities in the other L-threads. This means that 485an errant L-thread monopolizing the CPU might cause scheduling of other threads 486to be stalled. Due to the lower cost of context switching, however, voluntary 487rescheduling to ensure progress of other threads, if managed sensibly, is not 488a prohibitive overhead, and overall performance can exceed that of an 489application using pthreads. 490 491 492Mutual exclusion 493^^^^^^^^^^^^^^^^ 494 495With pthreads preemption means that threads that share data must observe 496some form of mutual exclusion protocol. 497 498The fact that L-threads cannot preempt each other means that in many cases 499mutual exclusion devices can be completely avoided. 500 501Locking to protect shared data can be a significant bottleneck in 502multi-threaded applications so a carefully designed cooperatively scheduled 503program can enjoy significant performance advantages. 504 505So far we have considered only the simplistic case of a single core CPU, 506when multiple CPUs are considered things are somewhat more complex. 507 508First of all it is inevitable that there must be multiple L-thread schedulers, 509one running on each EAL thread. So long as these schedulers remain isolated 510from each other the above assertions about the potential advantages of 511cooperative scheduling hold true. 512 513A configuration with isolated cooperative schedulers is less flexible than the 514pthread model where threads can be affinitized to run on any CPU. With isolated 515schedulers scaling of applications to utilize fewer or more CPUs according to 516system demand is very difficult to achieve. 517 518The L-thread subsystem makes it possible for L-threads to migrate between 519schedulers running on different CPUs. Needless to say if the migration means 520that threads that share data end up running on different CPUs then this will 521introduce the need for some kind of mutual exclusion system. 522 523Of course ``rte_ring`` software rings can always be used to interconnect 524threads running on different cores, however to protect other kinds of shared 525data structures, lock free constructs or else explicit locking will be 526required. This is a consideration for the application design. 527 528In support of this extended functionality, the L-thread subsystem implements 529thread safe mutexes and condition variables. 530 531The cost of affinitizing and of condition variable signaling is significantly 532lower than the equivalent pthread operations, and so applications using these 533features will see a performance benefit. 534 535 536Thread local storage 537^^^^^^^^^^^^^^^^^^^^ 538 539As with applications written for pthreads an application written for L-threads 540can take advantage of thread local storage, in this case local to an L-thread. 541An application may save and retrieve a single pointer to application data in 542the L-thread struct. 543 544For legacy and backward compatibility reasons two alternative methods are also 545offered, the first is modelled directly on the pthread get/set specific APIs, 546the second approach is modelled on the ``RTE_PER_LCORE`` macros, whereby 547``PER_LTHREAD`` macros are introduced, in both cases the storage is local to 548the L-thread. 549 550 551.. _constraints_and_performance_implications: 552 553Constraints and performance implications when using L-threads 554~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 555 556 557.. _API_compatibility: 558 559API compatibility 560^^^^^^^^^^^^^^^^^ 561 562The L-thread subsystem provides a set of functions that are logically equivalent 563to the corresponding functions offered by the POSIX pthread library, however not 564all pthread functions have a corresponding L-thread equivalent, and not all 565features available to pthreads are implemented for L-threads. 566 567The pthread library offers considerable flexibility via programmable attributes 568that can be associated with threads, mutexes, and condition variables. 569 570By contrast the L-thread subsystem has fixed functionality, the scheduler policy 571cannot be varied, and L-threads cannot be prioritized. There are no variable 572attributes associated with any L-thread objects. L-threads, mutexes and 573conditional variables, all have fixed functionality. (Note: reserved parameters 574are included in the APIs to facilitate possible future support for attributes). 575 576The table below lists the pthread and equivalent L-thread APIs with notes on 577differences and/or constraints. Where there is no L-thread entry in the table, 578then the L-thread subsystem provides no equivalent function. 579 580.. _table_lthread_pthread: 581 582.. table:: Pthread and equivalent L-thread APIs. 583 584 +----------------------------+------------------------+-------------------+ 585 | **Pthread function** | **L-thread function** | **Notes** | 586 +============================+========================+===================+ 587 | pthread_barrier_destroy | | | 588 +----------------------------+------------------------+-------------------+ 589 | pthread_barrier_init | | | 590 +----------------------------+------------------------+-------------------+ 591 | pthread_barrier_wait | | | 592 +----------------------------+------------------------+-------------------+ 593 | pthread_cond_broadcast | lthread_cond_broadcast | See note 1 | 594 +----------------------------+------------------------+-------------------+ 595 | pthread_cond_destroy | lthread_cond_destroy | | 596 +----------------------------+------------------------+-------------------+ 597 | pthread_cond_init | lthread_cond_init | | 598 +----------------------------+------------------------+-------------------+ 599 | pthread_cond_signal | lthread_cond_signal | See note 1 | 600 +----------------------------+------------------------+-------------------+ 601 | pthread_cond_timedwait | | | 602 +----------------------------+------------------------+-------------------+ 603 | pthread_cond_wait | lthread_cond_wait | See note 5 | 604 +----------------------------+------------------------+-------------------+ 605 | pthread_create | lthread_create | See notes 2, 3 | 606 +----------------------------+------------------------+-------------------+ 607 | pthread_detach | lthread_detach | See note 4 | 608 +----------------------------+------------------------+-------------------+ 609 | pthread_equal | | | 610 +----------------------------+------------------------+-------------------+ 611 | pthread_exit | lthread_exit | | 612 +----------------------------+------------------------+-------------------+ 613 | pthread_getspecific | lthread_getspecific | | 614 +----------------------------+------------------------+-------------------+ 615 | pthread_getcpuclockid | | | 616 +----------------------------+------------------------+-------------------+ 617 | pthread_join | lthread_join | | 618 +----------------------------+------------------------+-------------------+ 619 | pthread_key_create | lthread_key_create | | 620 +----------------------------+------------------------+-------------------+ 621 | pthread_key_delete | lthread_key_delete | | 622 +----------------------------+------------------------+-------------------+ 623 | pthread_mutex_destroy | lthread_mutex_destroy | | 624 +----------------------------+------------------------+-------------------+ 625 | pthread_mutex_init | lthread_mutex_init | | 626 +----------------------------+------------------------+-------------------+ 627 | pthread_mutex_lock | lthread_mutex_lock | See note 6 | 628 +----------------------------+------------------------+-------------------+ 629 | pthread_mutex_trylock | lthread_mutex_trylock | See note 6 | 630 +----------------------------+------------------------+-------------------+ 631 | pthread_mutex_timedlock | | | 632 +----------------------------+------------------------+-------------------+ 633 | pthread_mutex_unlock | lthread_mutex_unlock | | 634 +----------------------------+------------------------+-------------------+ 635 | pthread_once | | | 636 +----------------------------+------------------------+-------------------+ 637 | pthread_rwlock_destroy | | | 638 +----------------------------+------------------------+-------------------+ 639 | pthread_rwlock_init | | | 640 +----------------------------+------------------------+-------------------+ 641 | pthread_rwlock_rdlock | | | 642 +----------------------------+------------------------+-------------------+ 643 | pthread_rwlock_timedrdlock | | | 644 +----------------------------+------------------------+-------------------+ 645 | pthread_rwlock_timedwrlock | | | 646 +----------------------------+------------------------+-------------------+ 647 | pthread_rwlock_tryrdlock | | | 648 +----------------------------+------------------------+-------------------+ 649 | pthread_rwlock_trywrlock | | | 650 +----------------------------+------------------------+-------------------+ 651 | pthread_rwlock_unlock | | | 652 +----------------------------+------------------------+-------------------+ 653 | pthread_rwlock_wrlock | | | 654 +----------------------------+------------------------+-------------------+ 655 | pthread_self | lthread_current | | 656 +----------------------------+------------------------+-------------------+ 657 | pthread_setspecific | lthread_setspecific | | 658 +----------------------------+------------------------+-------------------+ 659 | pthread_spin_init | | See note 10 | 660 +----------------------------+------------------------+-------------------+ 661 | pthread_spin_destroy | | See note 10 | 662 +----------------------------+------------------------+-------------------+ 663 | pthread_spin_lock | | See note 10 | 664 +----------------------------+------------------------+-------------------+ 665 | pthread_spin_trylock | | See note 10 | 666 +----------------------------+------------------------+-------------------+ 667 | pthread_spin_unlock | | See note 10 | 668 +----------------------------+------------------------+-------------------+ 669 | pthread_cancel | lthread_cancel | | 670 +----------------------------+------------------------+-------------------+ 671 | pthread_setcancelstate | | | 672 +----------------------------+------------------------+-------------------+ 673 | pthread_setcanceltype | | | 674 +----------------------------+------------------------+-------------------+ 675 | pthread_testcancel | | | 676 +----------------------------+------------------------+-------------------+ 677 | pthread_getschedparam | | | 678 +----------------------------+------------------------+-------------------+ 679 | pthread_setschedparam | | | 680 +----------------------------+------------------------+-------------------+ 681 | pthread_yield | lthread_yield | See note 7 | 682 +----------------------------+------------------------+-------------------+ 683 | pthread_setaffinity_np | lthread_set_affinity | See notes 2, 3, 8 | 684 +----------------------------+------------------------+-------------------+ 685 | | lthread_sleep | See note 9 | 686 +----------------------------+------------------------+-------------------+ 687 | | lthread_sleep_clks | See note 9 | 688 +----------------------------+------------------------+-------------------+ 689 690 691**Note 1**: 692 693Neither lthread signal nor broadcast may be called concurrently by L-threads 694running on different schedulers, although multiple L-threads running in the 695same scheduler may freely perform signal or broadcast operations. L-threads 696running on the same or different schedulers may always safely wait on a 697condition variable. 698 699 700**Note 2**: 701 702Pthread attributes may be used to affinitize a pthread with a cpu-set. The 703L-thread subsystem does not support a cpu-set. An L-thread may be affinitized 704only with a single CPU at any time. 705 706 707**Note 3**: 708 709If an L-thread is intended to run on a different NUMA node than the node that 710creates the thread then, when calling ``lthread_create()`` it is advantageous 711to specify the destination core as a parameter of ``lthread_create()``. See 712:ref:`memory_allocation_and_NUMA_awareness` for details. 713 714 715**Note 4**: 716 717An L-thread can only detach itself, and cannot detach other L-threads. 718 719 720**Note 5**: 721 722A wait operation on a pthread condition variable is always associated with and 723protected by a mutex which must be owned by the thread at the time it invokes 724``pthread_wait()``. By contrast L-thread condition variables are thread safe 725(for waiters) and do not use an associated mutex. Multiple L-threads (including 726L-threads running on other schedulers) can safely wait on a L-thread condition 727variable. As a consequence the performance of an L-thread condition variables 728is typically an order of magnitude faster than its pthread counterpart. 729 730 731**Note 6**: 732 733Recursive locking is not supported with L-threads, attempts to take a lock 734recursively will be detected and rejected. 735 736 737**Note 7**: 738 739``lthread_yield()`` will save the current context, insert the current thread 740to the back of the ready queue, and resume the next ready thread. Yielding 741increases ready queue backlog, see :ref:`ready_queue_backlog` for more details 742about the implications of this. 743 744 745N.B. The context switch time as measured from immediately before the call to 746``lthread_yield()`` to the point at which the next ready thread is resumed, 747can be an order of magnitude faster that the same measurement for 748pthread_yield. 749 750 751**Note 8**: 752 753``lthread_set_affinity()`` is similar to a yield apart from the fact that the 754yielding thread is inserted into a peer ready queue of another scheduler. 755The peer ready queue is actually a separate thread safe queue, which means that 756threads appearing in the peer ready queue can jump any backlog in the local 757ready queue on the destination scheduler. 758 759The context switch time as measured from the time just before the call to 760``lthread_set_affinity()`` to just after the same thread is resumed on the new 761scheduler can be orders of magnitude faster than the same measurement for 762``pthread_setaffinity_np()``. 763 764 765**Note 9**: 766 767Although there is no ``pthread_sleep()`` function, ``lthread_sleep()`` and 768``lthread_sleep_clks()`` can be used wherever ``sleep()``, ``usleep()`` or 769``nanosleep()`` might ordinarily be used. The L-thread sleep functions suspend 770the current thread, start an ``rte_timer`` and resume the thread when the 771timer matures. The ``rte_timer_manage()`` entry point is called on every pass 772of the scheduler loop. This means that the worst case jitter on timer expiry 773is determined by the longest period between context switches of any running 774L-threads. 775 776In a synthetic test with many threads sleeping and resuming then the measured 777jitter is typically orders of magnitude lower than the same measurement made 778for ``nanosleep()``. 779 780 781**Note 10**: 782 783Spin locks are not provided because they are problematical in a cooperative 784environment, see :ref:`porting_locks_and_spinlocks` for a more detailed 785discussion on how to avoid spin locks. 786 787 788.. _Thread_local_storage_performance: 789 790Thread local storage 791^^^^^^^^^^^^^^^^^^^^ 792 793Of the three L-thread local storage options the simplest and most efficient is 794storing a single application data pointer in the L-thread struct. 795 796The ``PER_LTHREAD`` macros involve a run time computation to obtain the address 797of the variable being saved/retrieved and also require that the accesses are 798de-referenced via a pointer. This means that code that has used 799``RTE_PER_LCORE`` macros being ported to L-threads might need some slight 800adjustment (see :ref:`porting_thread_local_storage` for hints about porting 801code that makes use of thread local storage). 802 803The get/set specific APIs are consistent with their pthread counterparts both 804in use and in performance. 805 806 807.. _memory_allocation_and_NUMA_awareness: 808 809Memory allocation and NUMA awareness 810^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 811 812All memory allocation is from DPDK huge pages, and is NUMA aware. Each 813scheduler maintains its own caches of objects: lthreads, their stacks, TLS, 814mutexes and condition variables. These caches are implemented as unbounded lock 815free MPSC queues. When objects are created they are always allocated from the 816caches on the local core (current EAL thread). 817 818If an L-thread has been affinitized to a different scheduler, then it can 819always safely free resources to the caches from which they originated (because 820the caches are MPSC queues). 821 822If the L-thread has been affinitized to a different NUMA node then the memory 823resources associated with it may incur longer access latency. 824 825The commonly used pattern of setting affinity on entry to a thread after it has 826started, means that memory allocation for both the stack and TLS will have been 827made from caches on the NUMA node on which the threads creator is running. 828This has the side effect that access latency will be sub-optimal after 829affinitizing. 830 831This side effect can be mitigated to some extent (although not completely) by 832specifying the destination CPU as a parameter of ``lthread_create()`` this 833causes the L-thread's stack and TLS to be allocated when it is first scheduled 834on the destination scheduler, if the destination is a on another NUMA node it 835results in a more optimal memory allocation. 836 837Note that the lthread struct itself remains allocated from memory on the 838creating node, this is unavoidable because an L-thread is known everywhere by 839the address of this struct. 840 841 842.. _object_cache_sizing: 843 844Object cache sizing 845^^^^^^^^^^^^^^^^^^^ 846 847The per lcore object caches pre-allocate objects in bulk whenever a request to 848allocate an object finds a cache empty. By default 100 objects are 849pre-allocated, this is defined by ``LTHREAD_PREALLOC`` in the public API 850header file lthread_api.h. This means that the caches constantly grow to meet 851system demand. 852 853In the present implementation there is no mechanism to reduce the cache sizes 854if system demand reduces. Thus the caches will remain at their maximum extent 855indefinitely. 856 857A consequence of the bulk pre-allocation of objects is that every 100 (default 858value) additional new object create operations results in a call to 859``rte_malloc()``. For creation of objects such as L-threads, which trigger the 860allocation of even more objects (i.e. their stacks and TLS) then this can 861cause outliers in scheduling performance. 862 863If this is a problem the simplest mitigation strategy is to dimension the 864system, by setting the bulk object pre-allocation size to some large number 865that you do not expect to be exceeded. This means the caches will be populated 866once only, the very first time a thread is created. 867 868 869.. _Ready_queue_backlog: 870 871Ready queue backlog 872^^^^^^^^^^^^^^^^^^^ 873 874One of the more subtle performance considerations is managing the ready queue 875backlog. The fewer threads that are waiting in the ready queue then the faster 876any particular thread will get serviced. 877 878In a naive L-thread application with N L-threads simply looping and yielding, 879this backlog will always be equal to the number of L-threads, thus the cost of 880a yield to a particular L-thread will be N times the context switch time. 881 882This side effect can be mitigated by arranging for threads to be suspended and 883wait to be resumed, rather than polling for work by constantly yielding. 884Blocking on a mutex or condition variable or even more obviously having a 885thread sleep if it has a low frequency workload are all mechanisms by which a 886thread can be excluded from the ready queue until it really does need to be 887run. This can have a significant positive impact on performance. 888 889 890.. _Initialization_and_shutdown_dependencies: 891 892Initialization, shutdown and dependencies 893^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 894 895The L-thread subsystem depends on DPDK for huge page allocation and depends on 896the ``rte_timer subsystem``. The DPDK EAL initialization and 897``rte_timer_subsystem_init()`` **MUST** be completed before the L-thread sub 898system can be used. 899 900Thereafter initialization of the L-thread subsystem is largely transparent to 901the application. Constructor functions ensure that global variables are properly 902initialized. Other than global variables each scheduler is initialized 903independently the first time that an L-thread is created by a particular EAL 904thread. 905 906If the schedulers are to be run as isolated and independent schedulers, with 907no intention that L-threads running on different schedulers will migrate between 908schedulers or synchronize with L-threads running on other schedulers, then 909initialization consists simply of creating an L-thread, and then running the 910L-thread scheduler. 911 912If there will be interaction between L-threads running on different schedulers, 913then it is important that the starting of schedulers on different EAL threads 914is synchronized. 915 916To achieve this an additional initialization step is necessary, this is simply 917to set the number of schedulers by calling the API function 918``lthread_num_schedulers_set(n)``, where ``n`` is the number of EAL threads 919that will run L-thread schedulers. Setting the number of schedulers to a 920number greater than 0 will cause all schedulers to wait until the others have 921started before beginning to schedule L-threads. 922 923The L-thread scheduler is started by calling the function ``lthread_run()`` 924and should be called from the EAL thread and thus become the main loop of the 925EAL thread. 926 927The function ``lthread_run()``, will not return until all threads running on 928the scheduler have exited, and the scheduler has been explicitly stopped by 929calling ``lthread_scheduler_shutdown(lcore)`` or 930``lthread_scheduler_shutdown_all()``. 931 932All these function do is tell the scheduler that it can exit when there are no 933longer any running L-threads, neither function forces any running L-thread to 934terminate. Any desired application shutdown behavior must be designed and 935built into the application to ensure that L-threads complete in a timely 936manner. 937 938**Important Note:** It is assumed when the scheduler exits that the application 939is terminating for good, the scheduler does not free resources before exiting 940and running the scheduler a subsequent time will result in undefined behavior. 941 942 943.. _porting_legacy_code_to_run_on_lthreads: 944 945Porting legacy code to run on L-threads 946~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 947 948Legacy code originally written for a pthread environment may be ported to 949L-threads if the considerations about differences in scheduling policy, and 950constraints discussed in the previous sections can be accommodated. 951 952This section looks in more detail at some of the issues that may have to be 953resolved when porting code. 954 955 956.. _pthread_API_compatibility: 957 958pthread API compatibility 959^^^^^^^^^^^^^^^^^^^^^^^^^ 960 961The first step is to establish exactly which pthread APIs the legacy 962application uses, and to understand the requirements of those APIs. If there 963are corresponding L-lthread APIs, and where the default pthread functionality 964is used by the application then, notwithstanding the other issues discussed 965here, it should be feasible to run the application with L-threads. If the 966legacy code modifies the default behavior using attributes then if may be 967necessary to make some adjustments to eliminate those requirements. 968 969 970.. _blocking_system_calls: 971 972Blocking system API calls 973^^^^^^^^^^^^^^^^^^^^^^^^^ 974 975It is important to understand what other system services the application may be 976using, bearing in mind that in a cooperatively scheduled environment a thread 977cannot block without stalling the scheduler and with it all other cooperative 978threads. Any kind of blocking system call, for example file or socket IO, is a 979potential problem, a good tool to analyze the application for this purpose is 980the ``strace`` utility. 981 982There are many strategies to resolve these kind of issues, each with it 983merits. Possible solutions include: 984 985* Adopting a polled mode of the system API concerned (if available). 986 987* Arranging for another core to perform the function and synchronizing with 988 that core via constructs that will not block the L-thread. 989 990* Affinitizing the thread to another scheduler devoted (as a matter of policy) 991 to handling threads wishing to make blocking calls, and then back again when 992 finished. 993 994 995.. _porting_locks_and_spinlocks: 996 997Locks and spinlocks 998^^^^^^^^^^^^^^^^^^^ 999 1000Locks and spinlocks are another source of blocking behavior that for the same 1001reasons as system calls will need to be addressed. 1002 1003If the application design ensures that the contending L-threads will always 1004run on the same scheduler then it its probably safe to remove locks and spin 1005locks completely. 1006 1007The only exception to the above rule is if for some reason the 1008code performs any kind of context switch whilst holding the lock 1009(e.g. yield, sleep, or block on a different lock, or on a condition variable). 1010This will need to determined before deciding to eliminate a lock. 1011 1012If a lock cannot be eliminated then an L-thread mutex can be substituted for 1013either kind of lock. 1014 1015An L-thread blocking on an L-thread mutex will be suspended and will cause 1016another ready L-thread to be resumed, thus not blocking the scheduler. When 1017default behavior is required, it can be used as a direct replacement for a 1018pthread mutex lock. 1019 1020Spin locks are typically used when lock contention is likely to be rare and 1021where the period during which the lock may be held is relatively short. 1022When the contending L-threads are running on the same scheduler then an 1023L-thread blocking on a spin lock will enter an infinite loop stopping the 1024scheduler completely (see :ref:`porting_infinite_loops` below). 1025 1026If the application design ensures that contending L-threads will always run 1027on different schedulers then it might be reasonable to leave a short spin lock 1028that rarely experiences contention in place. 1029 1030If after all considerations it appears that a spin lock can neither be 1031eliminated completely, replaced with an L-thread mutex, or left in place as 1032is, then an alternative is to loop on a flag, with a call to 1033``lthread_yield()`` inside the loop (n.b. if the contending L-threads might 1034ever run on different schedulers the flag will need to be manipulated 1035atomically). 1036 1037Spinning and yielding is the least preferred solution since it introduces 1038ready queue backlog (see also :ref:`ready_queue_backlog`). 1039 1040 1041.. _porting_sleeps_and_delays: 1042 1043Sleeps and delays 1044^^^^^^^^^^^^^^^^^ 1045 1046Yet another kind of blocking behavior (albeit momentary) are delay functions 1047like ``sleep()``, ``usleep()``, ``nanosleep()`` etc. All will have the 1048consequence of stalling the L-thread scheduler and unless the delay is very 1049short (e.g. a very short nanosleep) calls to these functions will need to be 1050eliminated. 1051 1052The simplest mitigation strategy is to use the L-thread sleep API functions, 1053of which two variants exist, ``lthread_sleep()`` and ``lthread_sleep_clks()``. 1054These functions start an rte_timer against the L-thread, suspend the L-thread 1055and cause another ready L-thread to be resumed. The suspended L-thread is 1056resumed when the rte_timer matures. 1057 1058 1059.. _porting_infinite_loops: 1060 1061Infinite loops 1062^^^^^^^^^^^^^^ 1063 1064Some applications have threads with loops that contain no inherent 1065rescheduling opportunity, and rely solely on the OS time slicing to share 1066the CPU. In a cooperative environment this will stop everything dead. These 1067kind of loops are not hard to identify, in a debug session you will find the 1068debugger is always stopping in the same loop. 1069 1070The simplest solution to this kind of problem is to insert an explicit 1071``lthread_yield()`` or ``lthread_sleep()`` into the loop. Another solution 1072might be to include the function performed by the loop into the execution path 1073of some other loop that does in fact yield, if this is possible. 1074 1075 1076.. _porting_thread_local_storage: 1077 1078Thread local storage 1079^^^^^^^^^^^^^^^^^^^^ 1080 1081If the application uses thread local storage, the use case should be 1082studied carefully. 1083 1084In a legacy pthread application either or both the ``__thread`` prefix, or the 1085pthread set/get specific APIs may have been used to define storage local to a 1086pthread. 1087 1088In some applications it may be a reasonable assumption that the data could 1089or in fact most likely should be placed in L-thread local storage. 1090 1091If the application (like many DPDK applications) has assumed a certain 1092relationship between a pthread and the CPU to which it is affinitized, there 1093is a risk that thread local storage may have been used to save some data items 1094that are correctly logically associated with the CPU, and others items which 1095relate to application context for the thread. Only a good understanding of the 1096application will reveal such cases. 1097 1098If the application requires an that an L-thread is to be able to move between 1099schedulers then care should be taken to separate these kinds of data, into per 1100lcore, and per L-thread storage. In this way a migrating thread will bring with 1101it the local data it needs, and pick up the new logical core specific values 1102from pthread local storage at its new home. 1103 1104 1105.. _pthread_shim: 1106 1107Pthread shim 1108~~~~~~~~~~~~ 1109 1110A convenient way to get something working with legacy code can be to use a 1111shim that adapts pthread API calls to the corresponding L-thread ones. 1112This approach will not mitigate any of the porting considerations mentioned 1113in the previous sections, but it will reduce the amount of code churn that 1114would otherwise been involved. It is a reasonable approach to evaluate 1115L-threads, before investing effort in porting to the native L-thread APIs. 1116 1117 1118Overview 1119^^^^^^^^ 1120The L-thread subsystem includes an example pthread shim. This is a partial 1121implementation but does contain the API stubs needed to get basic applications 1122running. There is a simple "hello world" application that demonstrates the 1123use of the pthread shim. 1124 1125A subtlety of working with a shim is that the application will still need 1126to make use of the genuine pthread library functions, at the very least in 1127order to create the EAL threads in which the L-thread schedulers will run. 1128This is the case with DPDK initialization, and exit. 1129 1130To deal with the initialization and shutdown scenarios, the shim is capable of 1131switching on or off its adaptor functionality, an application can control this 1132behavior by the calling the function ``pt_override_set()``. The default state 1133is disabled. 1134 1135The pthread shim uses the dynamic linker loader and saves the loaded addresses 1136of the genuine pthread API functions in an internal table, when the shim 1137functionality is enabled it performs the adaptor function, when disabled it 1138invokes the genuine pthread function. 1139 1140The function ``pthread_exit()`` has additional special handling. The standard 1141system header file pthread.h declares ``pthread_exit()`` with 1142``__attribute__((noreturn))`` this is an optimization that is possible because 1143the pthread is terminating and this enables the compiler to omit the normal 1144handling of stack and protection of registers since the function is not 1145expected to return, and in fact the thread is being destroyed. These 1146optimizations are applied in both the callee and the caller of the 1147``pthread_exit()`` function. 1148 1149In our cooperative scheduling environment this behavior is inadmissible. The 1150pthread is the L-thread scheduler thread, and, although an L-thread is 1151terminating, there must be a return to the scheduler in order that the system 1152can continue to run. Further, returning from a function with attribute 1153``noreturn`` is invalid and may result in undefined behavior. 1154 1155The solution is to redefine the ``pthread_exit`` function with a macro, 1156causing it to be mapped to a stub function in the shim that does not have the 1157``noreturn`` attribute. This macro is defined in the file 1158``pthread_shim.h``. The stub function is otherwise no different than any of 1159the other stub functions in the shim, and will switch between the real 1160``pthread_exit()`` function or the ``lthread_exit()`` function as 1161required. The only difference is that the mapping to the stub by macro 1162substitution. 1163 1164A consequence of this is that the file ``pthread_shim.h`` must be included in 1165legacy code wishing to make use of the shim. It also means that dynamic 1166linkage of a pre-compiled binary that did not include pthread_shim.h is not be 1167supported. 1168 1169Given the requirements for porting legacy code outlined in 1170:ref:`porting_legacy_code_to_run_on_lthreads` most applications will require at 1171least some minimal adjustment and recompilation to run on L-threads so 1172pre-compiled binaries are unlikely to be met in practice. 1173 1174In summary the shim approach adds some overhead but can be a useful tool to help 1175establish the feasibility of a code reuse project. It is also a fairly 1176straightforward task to extend the shim if necessary. 1177 1178**Note:** Bearing in mind the preceding discussions about the impact of making 1179blocking calls then switching the shim in and out on the fly to invoke any 1180pthread API this might block is something that should typically be avoided. 1181 1182 1183Building and running the pthread shim 1184^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 1185 1186The shim example application is located in the sample application 1187in the performance-thread folder 1188 1189To build and run the pthread shim example 1190 1191#. Go to the example applications folder 1192 1193 .. code-block:: console 1194 1195 export RTE_SDK=/path/to/rte_sdk 1196 cd ${RTE_SDK}/examples/performance-thread/pthread_shim 1197 1198 1199#. Set the target (a default target is used if not specified). For example: 1200 1201 .. code-block:: console 1202 1203 export RTE_TARGET=x86_64-native-linuxapp-gcc 1204 1205 See the DPDK Getting Started Guide for possible RTE_TARGET values. 1206 1207#. Build the application: 1208 1209 .. code-block:: console 1210 1211 make 1212 1213#. To run the pthread_shim example 1214 1215 .. code-block:: console 1216 1217 lthread-pthread-shim -c core_mask -n number_of_channels 1218 1219.. _lthread_diagnostics: 1220 1221L-thread Diagnostics 1222~~~~~~~~~~~~~~~~~~~~ 1223 1224When debugging you must take account of the fact that the L-threads are run in 1225a single pthread. The current scheduler is defined by 1226``RTE_PER_LCORE(this_sched)``, and the current lthread is stored at 1227``RTE_PER_LCORE(this_sched)->current_lthread``. Thus on a breakpoint in a GDB 1228session the current lthread can be obtained by displaying the pthread local 1229variable ``per_lcore_this_sched->current_lthread``. 1230 1231Another useful diagnostic feature is the possibility to trace significant 1232events in the life of an L-thread, this feature is enabled by changing the 1233value of LTHREAD_DIAG from 0 to 1 in the file ``lthread_diag_api.h``. 1234 1235Tracing of events can be individually masked, and the mask may be programmed 1236at run time. An unmasked event results in a callback that provides information 1237about the event. The default callback simply prints trace information. The 1238default mask is 0 (all events off) the mask can be modified by calling the 1239function ``lthread_diagniostic_set_mask()``. 1240 1241It is possible register a user callback function to implement more 1242sophisticated diagnostic functions. 1243Object creation events (lthread, mutex, and condition variable) accept, and 1244store in the created object, a user supplied reference value returned by the 1245callback function. 1246 1247The lthread reference value is passed back in all subsequent event callbacks, 1248the mutex and APIs are provided to retrieve the reference value from 1249mutexes and condition variables. This enables a user to monitor, count, or 1250filter for specific events, on specific objects, for example to monitor for a 1251specific thread signaling a specific condition variable, or to monitor 1252on all timer events, the possibilities and combinations are endless. 1253 1254The callback function can be set by calling the function 1255``lthread_diagnostic_enable()`` supplying a callback function pointer and an 1256event mask. 1257 1258Setting ``LTHREAD_DIAG`` also enables counting of statistics about cache and 1259queue usage, and these statistics can be displayed by calling the function 1260``lthread_diag_stats_display()``. This function also performs a consistency 1261check on the caches and queues. The function should only be called from the 1262master EAL thread after all slave threads have stopped and returned to the C 1263main program, otherwise the consistency check will fail. 1264