1.. BSD LICENSE 2 Copyright(c) 2015 Intel Corporation. All rights reserved. 3 All rights reserved. 4 5 Redistribution and use in source and binary forms, with or without 6 modification, are permitted provided that the following conditions 7 are met: 8 9 * Re-distributions of source code must retain the above copyright 10 notice, this list of conditions and the following disclaimer. 11 * Redistributions in binary form must reproduce the above copyright 12 notice, this list of conditions and the following disclaimer in 13 the documentation and/or other materials provided with the 14 distribution. 15 * Neither the name of Intel Corporation nor the names of its 16 contributors may be used to endorse or promote products derived 17 from this software without specific prior written permission. 18 19 THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 20 "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT 21 LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR 22 A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT 23 OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, 24 SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT 25 LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, 26 DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY 27 THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 28 (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 29 OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 30 31 32Performance Thread Sample Application 33===================================== 34 35The performance thread sample application is a derivative of the standard L3 36forwarding application that demonstrates different threading models. 37 38Overview 39-------- 40For a general description of the L3 forwarding applications capabilities 41please refer to the documentation of the standard application in 42:doc:`l3_forward`. 43 44The performance thread sample application differs from the standard L3 45forwarding example in that it divides the TX and RX processing between 46different threads, and makes it possible to assign individual threads to 47different cores. 48 49Three threading models are considered: 50 51#. When there is one EAL thread per physical core. 52#. When there are multiple EAL threads per physical core. 53#. When there are multiple lightweight threads per EAL thread. 54 55Since DPDK release 2.0 it is possible to launch applications using the 56``--lcores`` EAL parameter, specifying cpu-sets for a physical core. With the 57performance thread sample application its is now also possible to assign 58individual RX and TX functions to different cores. 59 60As an alternative to dividing the L3 forwarding work between different EAL 61threads the performance thread sample introduces the possibility to run the 62application threads as lightweight threads (L-threads) within one or 63more EAL threads. 64 65In order to facilitate this threading model the example includes a primitive 66cooperative scheduler (L-thread) subsystem. More details of the L-thread 67subsystem can be found in :ref:`lthread_subsystem`. 68 69**Note:** Whilst theoretically possible it is not anticipated that multiple 70L-thread schedulers would be run on the same physical core, this mode of 71operation should not be expected to yield useful performance and is considered 72invalid. 73 74Compiling the Application 75------------------------- 76 77To compile the sample application see :doc:`compiling`. 78 79The application is located in the `performance-thread/l3fwd-thread` sub-directory. 80 81Running the Application 82----------------------- 83 84The application has a number of command line options:: 85 86 ./build/l3fwd-thread [EAL options] -- 87 -p PORTMASK [-P] 88 --rx(port,queue,lcore,thread)[,(port,queue,lcore,thread)] 89 --tx(lcore,thread)[,(lcore,thread)] 90 [--enable-jumbo] [--max-pkt-len PKTLEN]] [--no-numa] 91 [--hash-entry-num] [--ipv6] [--no-lthreads] [--stat-lcore lcore] 92 [--parse-ptype] 93 94Where: 95 96* ``-p PORTMASK``: Hexadecimal bitmask of ports to configure. 97 98* ``-P``: optional, sets all ports to promiscuous mode so that packets are 99 accepted regardless of the packet's Ethernet MAC destination address. 100 Without this option, only packets with the Ethernet MAC destination address 101 set to the Ethernet address of the port are accepted. 102 103* ``--rx (port,queue,lcore,thread)[,(port,queue,lcore,thread)]``: the list of 104 NIC RX ports and queues handled by the RX lcores and threads. The parameters 105 are explained below. 106 107* ``--tx (lcore,thread)[,(lcore,thread)]``: the list of TX threads identifying 108 the lcore the thread runs on, and the id of RX thread with which it is 109 associated. The parameters are explained below. 110 111* ``--enable-jumbo``: optional, enables jumbo frames. 112 113* ``--max-pkt-len``: optional, maximum packet length in decimal (64-9600). 114 115* ``--no-numa``: optional, disables numa awareness. 116 117* ``--hash-entry-num``: optional, specifies the hash entry number in hex to be 118 setup. 119 120* ``--ipv6``: optional, set it if running ipv6 packets. 121 122* ``--no-lthreads``: optional, disables l-thread model and uses EAL threading 123 model. See below. 124 125* ``--stat-lcore``: optional, run CPU load stats collector on the specified 126 lcore. 127 128* ``--parse-ptype:`` optional, set to use software to analyze packet type. 129 Without this option, hardware will check the packet type. 130 131The parameters of the ``--rx`` and ``--tx`` options are: 132 133* ``--rx`` parameters 134 135 .. _table_l3fwd_rx_parameters: 136 137 +--------+------------------------------------------------------+ 138 | port | RX port | 139 +--------+------------------------------------------------------+ 140 | queue | RX queue that will be read on the specified RX port | 141 +--------+------------------------------------------------------+ 142 | lcore | Core to use for the thread | 143 +--------+------------------------------------------------------+ 144 | thread | Thread id (continuously from 0 to N) | 145 +--------+------------------------------------------------------+ 146 147 148* ``--tx`` parameters 149 150 .. _table_l3fwd_tx_parameters: 151 152 +--------+------------------------------------------------------+ 153 | lcore | Core to use for L3 route match and transmit | 154 +--------+------------------------------------------------------+ 155 | thread | Id of RX thread to be associated with this TX thread | 156 +--------+------------------------------------------------------+ 157 158The ``l3fwd-thread`` application allows you to start packet processing in two 159threading models: L-Threads (default) and EAL Threads (when the 160``--no-lthreads`` parameter is used). For consistency all parameters are used 161in the same way for both models. 162 163 164Running with L-threads 165~~~~~~~~~~~~~~~~~~~~~~ 166 167When the L-thread model is used (default option), lcore and thread parameters 168in ``--rx/--tx`` are used to affinitize threads to the selected scheduler. 169 170For example, the following places every l-thread on different lcores:: 171 172 l3fwd-thread -l 0-7 -n 2 -- -P -p 3 \ 173 --rx="(0,0,0,0)(1,0,1,1)" \ 174 --tx="(2,0)(3,1)" 175 176The following places RX l-threads on lcore 0 and TX l-threads on lcore 1 and 2 177and so on:: 178 179 l3fwd-thread -l 0-7 -n 2 -- -P -p 3 \ 180 --rx="(0,0,0,0)(1,0,0,1)" \ 181 --tx="(1,0)(2,1)" 182 183 184Running with EAL threads 185~~~~~~~~~~~~~~~~~~~~~~~~ 186 187When the ``--no-lthreads`` parameter is used, the L-threading model is turned 188off and EAL threads are used for all processing. EAL threads are enumerated in 189the same way as L-threads, but the ``--lcores`` EAL parameter is used to 190affinitize threads to the selected cpu-set (scheduler). Thus it is possible to 191place every RX and TX thread on different lcores. 192 193For example, the following places every EAL thread on different lcores:: 194 195 l3fwd-thread -l 0-7 -n 2 -- -P -p 3 \ 196 --rx="(0,0,0,0)(1,0,1,1)" \ 197 --tx="(2,0)(3,1)" \ 198 --no-lthreads 199 200 201To affinitize two or more EAL threads to one cpu-set, the EAL ``--lcores`` 202parameter is used. 203 204The following places RX EAL threads on lcore 0 and TX EAL threads on lcore 1 205and 2 and so on:: 206 207 l3fwd-thread -l 0-7 -n 2 --lcores="(0,1)@0,(2,3)@1" -- -P -p 3 \ 208 --rx="(0,0,0,0)(1,0,1,1)" \ 209 --tx="(2,0)(3,1)" \ 210 --no-lthreads 211 212 213Examples 214~~~~~~~~ 215 216For selected scenarios the command line configuration of the application for L-threads 217and its corresponding EAL threads command line can be realized as follows: 218 219a) Start every thread on different scheduler (1:1):: 220 221 l3fwd-thread -l 0-7 -n 2 -- -P -p 3 \ 222 --rx="(0,0,0,0)(1,0,1,1)" \ 223 --tx="(2,0)(3,1)" 224 225 EAL thread equivalent:: 226 227 l3fwd-thread -l 0-7 -n 2 -- -P -p 3 \ 228 --rx="(0,0,0,0)(1,0,1,1)" \ 229 --tx="(2,0)(3,1)" \ 230 --no-lthreads 231 232b) Start all threads on one core (N:1). 233 234 Start 4 L-threads on lcore 0:: 235 236 l3fwd-thread -l 0-7 -n 2 -- -P -p 3 \ 237 --rx="(0,0,0,0)(1,0,0,1)" \ 238 --tx="(0,0)(0,1)" 239 240 Start 4 EAL threads on cpu-set 0:: 241 242 l3fwd-thread -l 0-7 -n 2 --lcores="(0-3)@0" -- -P -p 3 \ 243 --rx="(0,0,0,0)(1,0,0,1)" \ 244 --tx="(2,0)(3,1)" \ 245 --no-lthreads 246 247c) Start threads on different cores (N:M). 248 249 Start 2 L-threads for RX on lcore 0, and 2 L-threads for TX on lcore 1:: 250 251 l3fwd-thread -l 0-7 -n 2 -- -P -p 3 \ 252 --rx="(0,0,0,0)(1,0,0,1)" \ 253 --tx="(1,0)(1,1)" 254 255 Start 2 EAL threads for RX on cpu-set 0, and 2 EAL threads for TX on 256 cpu-set 1:: 257 258 l3fwd-thread -l 0-7 -n 2 --lcores="(0-1)@0,(2-3)@1" -- -P -p 3 \ 259 --rx="(0,0,0,0)(1,0,1,1)" \ 260 --tx="(2,0)(3,1)" \ 261 --no-lthreads 262 263Explanation 264----------- 265 266To a great extent the sample application differs little from the standard L3 267forwarding application, and readers are advised to familiarize themselves with 268the material covered in the :doc:`l3_forward` documentation before proceeding. 269 270The following explanation is focused on the way threading is handled in the 271performance thread example. 272 273 274Mode of operation with EAL threads 275~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 276 277The performance thread sample application has split the RX and TX functionality 278into two different threads, and the RX and TX threads are 279interconnected via software rings. With respect to these rings the RX threads 280are producers and the TX threads are consumers. 281 282On initialization the TX and RX threads are started according to the command 283line parameters. 284 285The RX threads poll the network interface queues and post received packets to a 286TX thread via a corresponding software ring. 287 288The TX threads poll software rings, perform the L3 forwarding hash/LPM match, 289and assemble packet bursts before performing burst transmit on the network 290interface. 291 292As with the standard L3 forward application, burst draining of residual packets 293is performed periodically with the period calculated from elapsed time using 294the timestamps counter. 295 296The diagram below illustrates a case with two RX threads and three TX threads. 297 298.. _figure_performance_thread_1: 299 300.. figure:: img/performance_thread_1.* 301 302 303Mode of operation with L-threads 304~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 305 306Like the EAL thread configuration the application has split the RX and TX 307functionality into different threads, and the pairs of RX and TX threads are 308interconnected via software rings. 309 310On initialization an L-thread scheduler is started on every EAL thread. On all 311but the master EAL thread only a a dummy L-thread is initially started. 312The L-thread started on the master EAL thread then spawns other L-threads on 313different L-thread schedulers according the the command line parameters. 314 315The RX threads poll the network interface queues and post received packets 316to a TX thread via the corresponding software ring. 317 318The ring interface is augmented by means of an L-thread condition variable that 319enables the TX thread to be suspended when the TX ring is empty. The RX thread 320signals the condition whenever it posts to the TX ring, causing the TX thread 321to be resumed. 322 323Additionally the TX L-thread spawns a worker L-thread to take care of 324polling the software rings, whilst it handles burst draining of the transmit 325buffer. 326 327The worker threads poll the software rings, perform L3 route lookup and 328assemble packet bursts. If the TX ring is empty the worker thread suspends 329itself by waiting on the condition variable associated with the ring. 330 331Burst draining of residual packets, less than the burst size, is performed by 332the TX thread which sleeps (using an L-thread sleep function) and resumes 333periodically to flush the TX buffer. 334 335This design means that L-threads that have no work, can yield the CPU to other 336L-threads and avoid having to constantly poll the software rings. 337 338The diagram below illustrates a case with two RX threads and three TX functions 339(each comprising a thread that processes forwarding and a thread that 340periodically drains the output buffer of residual packets). 341 342.. _figure_performance_thread_2: 343 344.. figure:: img/performance_thread_2.* 345 346 347CPU load statistics 348~~~~~~~~~~~~~~~~~~~ 349 350It is possible to display statistics showing estimated CPU load on each core. 351The statistics indicate the percentage of CPU time spent: processing 352received packets (forwarding), polling queues/rings (waiting for work), 353and doing any other processing (context switch and other overhead). 354 355When enabled statistics are gathered by having the application threads set and 356clear flags when they enter and exit pertinent code sections. The flags are 357then sampled in real time by a statistics collector thread running on another 358core. This thread displays the data in real time on the console. 359 360This feature is enabled by designating a statistics collector core, using the 361``--stat-lcore`` parameter. 362 363 364.. _lthread_subsystem: 365 366The L-thread subsystem 367---------------------- 368 369The L-thread subsystem resides in the examples/performance-thread/common 370directory and is built and linked automatically when building the 371``l3fwd-thread`` example. 372 373The subsystem provides a simple cooperative scheduler to enable arbitrary 374functions to run as cooperative threads within a single EAL thread. 375The subsystem provides a pthread like API that is intended to assist in 376reuse of legacy code written for POSIX pthreads. 377 378The following sections provide some detail on the features, constraints, 379performance and porting considerations when using L-threads. 380 381 382.. _comparison_between_lthreads_and_pthreads: 383 384Comparison between L-threads and POSIX pthreads 385~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 386 387The fundamental difference between the L-thread and pthread models is the 388way in which threads are scheduled. The simplest way to think about this is to 389consider the case of a processor with a single CPU. To run multiple threads 390on a single CPU, the scheduler must frequently switch between the threads, 391in order that each thread is able to make timely progress. 392This is the basis of any multitasking operating system. 393 394This section explores the differences between the pthread model and the 395L-thread model as implemented in the provided L-thread subsystem. If needed a 396theoretical discussion of preemptive vs cooperative multi-threading can be 397found in any good text on operating system design. 398 399 400Scheduling and context switching 401^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 402 403The POSIX pthread library provides an application programming interface to 404create and synchronize threads. Scheduling policy is determined by the host OS, 405and may be configurable. The OS may use sophisticated rules to determine which 406thread should be run next, threads may suspend themselves or make other threads 407ready, and the scheduler may employ a time slice giving each thread a maximum 408time quantum after which it will be preempted in favor of another thread that 409is ready to run. To complicate matters further threads may be assigned 410different scheduling priorities. 411 412By contrast the L-thread subsystem is considerably simpler. Logically the 413L-thread scheduler performs the same multiplexing function for L-threads 414within a single pthread as the OS scheduler does for pthreads within an 415application process. The L-thread scheduler is simply the main loop of a 416pthread, and in so far as the host OS is concerned it is a regular pthread 417just like any other. The host OS is oblivious about the existence of and 418not at all involved in the scheduling of L-threads. 419 420The other and most significant difference between the two models is that 421L-threads are scheduled cooperatively. L-threads cannot not preempt each 422other, nor can the L-thread scheduler preempt a running L-thread (i.e. 423there is no time slicing). The consequence is that programs implemented with 424L-threads must possess frequent rescheduling points, meaning that they must 425explicitly and of their own volition return to the scheduler at frequent 426intervals, in order to allow other L-threads an opportunity to proceed. 427 428In both models switching between threads requires that the current CPU 429context is saved and a new context (belonging to the next thread ready to run) 430is restored. With pthreads this context switching is handled transparently 431and the set of CPU registers that must be preserved between context switches 432is as per an interrupt handler. 433 434An L-thread context switch is achieved by the thread itself making a function 435call to the L-thread scheduler. Thus it is only necessary to preserve the 436callee registers. The caller is responsible to save and restore any other 437registers it is using before a function call, and restore them on return, 438and this is handled by the compiler. For ``X86_64`` on both Linux and BSD the 439System V calling convention is used, this defines registers RSP, RBP, and 440R12-R15 as callee-save registers (for more detailed discussion a good reference 441is `X86 Calling Conventions <https://en.wikipedia.org/wiki/X86_calling_conventions>`_). 442 443Taking advantage of this, and due to the absence of preemption, an L-thread 444context switch is achieved with less than 20 load/store instructions. 445 446The scheduling policy for L-threads is fixed, there is no prioritization of 447L-threads, all L-threads are equal and scheduling is based on a FIFO 448ready queue. 449 450An L-thread is a struct containing the CPU context of the thread 451(saved on context switch) and other useful items. The ready queue contains 452pointers to threads that are ready to run. The L-thread scheduler is a simple 453loop that polls the ready queue, reads from it the next thread ready to run, 454which it resumes by saving the current context (the current position in the 455scheduler loop) and restoring the context of the next thread from its thread 456struct. Thus an L-thread is always resumed at the last place it yielded. 457 458A well behaved L-thread will call the context switch regularly (at least once 459in its main loop) thus returning to the scheduler's own main loop. Yielding 460inserts the current thread at the back of the ready queue, and the process of 461servicing the ready queue is repeated, thus the system runs by flipping back 462and forth the between L-threads and scheduler loop. 463 464In the case of pthreads, the preemptive scheduling, time slicing, and support 465for thread prioritization means that progress is normally possible for any 466thread that is ready to run. This comes at the price of a relatively heavier 467context switch and scheduling overhead. 468 469With L-threads the progress of any particular thread is determined by the 470frequency of rescheduling opportunities in the other L-threads. This means that 471an errant L-thread monopolizing the CPU might cause scheduling of other threads 472to be stalled. Due to the lower cost of context switching, however, voluntary 473rescheduling to ensure progress of other threads, if managed sensibly, is not 474a prohibitive overhead, and overall performance can exceed that of an 475application using pthreads. 476 477 478Mutual exclusion 479^^^^^^^^^^^^^^^^ 480 481With pthreads preemption means that threads that share data must observe 482some form of mutual exclusion protocol. 483 484The fact that L-threads cannot preempt each other means that in many cases 485mutual exclusion devices can be completely avoided. 486 487Locking to protect shared data can be a significant bottleneck in 488multi-threaded applications so a carefully designed cooperatively scheduled 489program can enjoy significant performance advantages. 490 491So far we have considered only the simplistic case of a single core CPU, 492when multiple CPUs are considered things are somewhat more complex. 493 494First of all it is inevitable that there must be multiple L-thread schedulers, 495one running on each EAL thread. So long as these schedulers remain isolated 496from each other the above assertions about the potential advantages of 497cooperative scheduling hold true. 498 499A configuration with isolated cooperative schedulers is less flexible than the 500pthread model where threads can be affinitized to run on any CPU. With isolated 501schedulers scaling of applications to utilize fewer or more CPUs according to 502system demand is very difficult to achieve. 503 504The L-thread subsystem makes it possible for L-threads to migrate between 505schedulers running on different CPUs. Needless to say if the migration means 506that threads that share data end up running on different CPUs then this will 507introduce the need for some kind of mutual exclusion system. 508 509Of course ``rte_ring`` software rings can always be used to interconnect 510threads running on different cores, however to protect other kinds of shared 511data structures, lock free constructs or else explicit locking will be 512required. This is a consideration for the application design. 513 514In support of this extended functionality, the L-thread subsystem implements 515thread safe mutexes and condition variables. 516 517The cost of affinitizing and of condition variable signaling is significantly 518lower than the equivalent pthread operations, and so applications using these 519features will see a performance benefit. 520 521 522Thread local storage 523^^^^^^^^^^^^^^^^^^^^ 524 525As with applications written for pthreads an application written for L-threads 526can take advantage of thread local storage, in this case local to an L-thread. 527An application may save and retrieve a single pointer to application data in 528the L-thread struct. 529 530For legacy and backward compatibility reasons two alternative methods are also 531offered, the first is modelled directly on the pthread get/set specific APIs, 532the second approach is modelled on the ``RTE_PER_LCORE`` macros, whereby 533``PER_LTHREAD`` macros are introduced, in both cases the storage is local to 534the L-thread. 535 536 537.. _constraints_and_performance_implications: 538 539Constraints and performance implications when using L-threads 540~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 541 542 543.. _API_compatibility: 544 545API compatibility 546^^^^^^^^^^^^^^^^^ 547 548The L-thread subsystem provides a set of functions that are logically equivalent 549to the corresponding functions offered by the POSIX pthread library, however not 550all pthread functions have a corresponding L-thread equivalent, and not all 551features available to pthreads are implemented for L-threads. 552 553The pthread library offers considerable flexibility via programmable attributes 554that can be associated with threads, mutexes, and condition variables. 555 556By contrast the L-thread subsystem has fixed functionality, the scheduler policy 557cannot be varied, and L-threads cannot be prioritized. There are no variable 558attributes associated with any L-thread objects. L-threads, mutexes and 559conditional variables, all have fixed functionality. (Note: reserved parameters 560are included in the APIs to facilitate possible future support for attributes). 561 562The table below lists the pthread and equivalent L-thread APIs with notes on 563differences and/or constraints. Where there is no L-thread entry in the table, 564then the L-thread subsystem provides no equivalent function. 565 566.. _table_lthread_pthread: 567 568.. table:: Pthread and equivalent L-thread APIs. 569 570 +----------------------------+------------------------+-------------------+ 571 | **Pthread function** | **L-thread function** | **Notes** | 572 +============================+========================+===================+ 573 | pthread_barrier_destroy | | | 574 +----------------------------+------------------------+-------------------+ 575 | pthread_barrier_init | | | 576 +----------------------------+------------------------+-------------------+ 577 | pthread_barrier_wait | | | 578 +----------------------------+------------------------+-------------------+ 579 | pthread_cond_broadcast | lthread_cond_broadcast | See note 1 | 580 +----------------------------+------------------------+-------------------+ 581 | pthread_cond_destroy | lthread_cond_destroy | | 582 +----------------------------+------------------------+-------------------+ 583 | pthread_cond_init | lthread_cond_init | | 584 +----------------------------+------------------------+-------------------+ 585 | pthread_cond_signal | lthread_cond_signal | See note 1 | 586 +----------------------------+------------------------+-------------------+ 587 | pthread_cond_timedwait | | | 588 +----------------------------+------------------------+-------------------+ 589 | pthread_cond_wait | lthread_cond_wait | See note 5 | 590 +----------------------------+------------------------+-------------------+ 591 | pthread_create | lthread_create | See notes 2, 3 | 592 +----------------------------+------------------------+-------------------+ 593 | pthread_detach | lthread_detach | See note 4 | 594 +----------------------------+------------------------+-------------------+ 595 | pthread_equal | | | 596 +----------------------------+------------------------+-------------------+ 597 | pthread_exit | lthread_exit | | 598 +----------------------------+------------------------+-------------------+ 599 | pthread_getspecific | lthread_getspecific | | 600 +----------------------------+------------------------+-------------------+ 601 | pthread_getcpuclockid | | | 602 +----------------------------+------------------------+-------------------+ 603 | pthread_join | lthread_join | | 604 +----------------------------+------------------------+-------------------+ 605 | pthread_key_create | lthread_key_create | | 606 +----------------------------+------------------------+-------------------+ 607 | pthread_key_delete | lthread_key_delete | | 608 +----------------------------+------------------------+-------------------+ 609 | pthread_mutex_destroy | lthread_mutex_destroy | | 610 +----------------------------+------------------------+-------------------+ 611 | pthread_mutex_init | lthread_mutex_init | | 612 +----------------------------+------------------------+-------------------+ 613 | pthread_mutex_lock | lthread_mutex_lock | See note 6 | 614 +----------------------------+------------------------+-------------------+ 615 | pthread_mutex_trylock | lthread_mutex_trylock | See note 6 | 616 +----------------------------+------------------------+-------------------+ 617 | pthread_mutex_timedlock | | | 618 +----------------------------+------------------------+-------------------+ 619 | pthread_mutex_unlock | lthread_mutex_unlock | | 620 +----------------------------+------------------------+-------------------+ 621 | pthread_once | | | 622 +----------------------------+------------------------+-------------------+ 623 | pthread_rwlock_destroy | | | 624 +----------------------------+------------------------+-------------------+ 625 | pthread_rwlock_init | | | 626 +----------------------------+------------------------+-------------------+ 627 | pthread_rwlock_rdlock | | | 628 +----------------------------+------------------------+-------------------+ 629 | pthread_rwlock_timedrdlock | | | 630 +----------------------------+------------------------+-------------------+ 631 | pthread_rwlock_timedwrlock | | | 632 +----------------------------+------------------------+-------------------+ 633 | pthread_rwlock_tryrdlock | | | 634 +----------------------------+------------------------+-------------------+ 635 | pthread_rwlock_trywrlock | | | 636 +----------------------------+------------------------+-------------------+ 637 | pthread_rwlock_unlock | | | 638 +----------------------------+------------------------+-------------------+ 639 | pthread_rwlock_wrlock | | | 640 +----------------------------+------------------------+-------------------+ 641 | pthread_self | lthread_current | | 642 +----------------------------+------------------------+-------------------+ 643 | pthread_setspecific | lthread_setspecific | | 644 +----------------------------+------------------------+-------------------+ 645 | pthread_spin_init | | See note 10 | 646 +----------------------------+------------------------+-------------------+ 647 | pthread_spin_destroy | | See note 10 | 648 +----------------------------+------------------------+-------------------+ 649 | pthread_spin_lock | | See note 10 | 650 +----------------------------+------------------------+-------------------+ 651 | pthread_spin_trylock | | See note 10 | 652 +----------------------------+------------------------+-------------------+ 653 | pthread_spin_unlock | | See note 10 | 654 +----------------------------+------------------------+-------------------+ 655 | pthread_cancel | lthread_cancel | | 656 +----------------------------+------------------------+-------------------+ 657 | pthread_setcancelstate | | | 658 +----------------------------+------------------------+-------------------+ 659 | pthread_setcanceltype | | | 660 +----------------------------+------------------------+-------------------+ 661 | pthread_testcancel | | | 662 +----------------------------+------------------------+-------------------+ 663 | pthread_getschedparam | | | 664 +----------------------------+------------------------+-------------------+ 665 | pthread_setschedparam | | | 666 +----------------------------+------------------------+-------------------+ 667 | pthread_yield | lthread_yield | See note 7 | 668 +----------------------------+------------------------+-------------------+ 669 | pthread_setaffinity_np | lthread_set_affinity | See notes 2, 3, 8 | 670 +----------------------------+------------------------+-------------------+ 671 | | lthread_sleep | See note 9 | 672 +----------------------------+------------------------+-------------------+ 673 | | lthread_sleep_clks | See note 9 | 674 +----------------------------+------------------------+-------------------+ 675 676 677**Note 1**: 678 679Neither lthread signal nor broadcast may be called concurrently by L-threads 680running on different schedulers, although multiple L-threads running in the 681same scheduler may freely perform signal or broadcast operations. L-threads 682running on the same or different schedulers may always safely wait on a 683condition variable. 684 685 686**Note 2**: 687 688Pthread attributes may be used to affinitize a pthread with a cpu-set. The 689L-thread subsystem does not support a cpu-set. An L-thread may be affinitized 690only with a single CPU at any time. 691 692 693**Note 3**: 694 695If an L-thread is intended to run on a different NUMA node than the node that 696creates the thread then, when calling ``lthread_create()`` it is advantageous 697to specify the destination core as a parameter of ``lthread_create()``. See 698:ref:`memory_allocation_and_NUMA_awareness` for details. 699 700 701**Note 4**: 702 703An L-thread can only detach itself, and cannot detach other L-threads. 704 705 706**Note 5**: 707 708A wait operation on a pthread condition variable is always associated with and 709protected by a mutex which must be owned by the thread at the time it invokes 710``pthread_wait()``. By contrast L-thread condition variables are thread safe 711(for waiters) and do not use an associated mutex. Multiple L-threads (including 712L-threads running on other schedulers) can safely wait on a L-thread condition 713variable. As a consequence the performance of an L-thread condition variables 714is typically an order of magnitude faster than its pthread counterpart. 715 716 717**Note 6**: 718 719Recursive locking is not supported with L-threads, attempts to take a lock 720recursively will be detected and rejected. 721 722 723**Note 7**: 724 725``lthread_yield()`` will save the current context, insert the current thread 726to the back of the ready queue, and resume the next ready thread. Yielding 727increases ready queue backlog, see :ref:`ready_queue_backlog` for more details 728about the implications of this. 729 730 731N.B. The context switch time as measured from immediately before the call to 732``lthread_yield()`` to the point at which the next ready thread is resumed, 733can be an order of magnitude faster that the same measurement for 734pthread_yield. 735 736 737**Note 8**: 738 739``lthread_set_affinity()`` is similar to a yield apart from the fact that the 740yielding thread is inserted into a peer ready queue of another scheduler. 741The peer ready queue is actually a separate thread safe queue, which means that 742threads appearing in the peer ready queue can jump any backlog in the local 743ready queue on the destination scheduler. 744 745The context switch time as measured from the time just before the call to 746``lthread_set_affinity()`` to just after the same thread is resumed on the new 747scheduler can be orders of magnitude faster than the same measurement for 748``pthread_setaffinity_np()``. 749 750 751**Note 9**: 752 753Although there is no ``pthread_sleep()`` function, ``lthread_sleep()`` and 754``lthread_sleep_clks()`` can be used wherever ``sleep()``, ``usleep()`` or 755``nanosleep()`` might ordinarily be used. The L-thread sleep functions suspend 756the current thread, start an ``rte_timer`` and resume the thread when the 757timer matures. The ``rte_timer_manage()`` entry point is called on every pass 758of the scheduler loop. This means that the worst case jitter on timer expiry 759is determined by the longest period between context switches of any running 760L-threads. 761 762In a synthetic test with many threads sleeping and resuming then the measured 763jitter is typically orders of magnitude lower than the same measurement made 764for ``nanosleep()``. 765 766 767**Note 10**: 768 769Spin locks are not provided because they are problematical in a cooperative 770environment, see :ref:`porting_locks_and_spinlocks` for a more detailed 771discussion on how to avoid spin locks. 772 773 774.. _Thread_local_storage_performance: 775 776Thread local storage 777^^^^^^^^^^^^^^^^^^^^ 778 779Of the three L-thread local storage options the simplest and most efficient is 780storing a single application data pointer in the L-thread struct. 781 782The ``PER_LTHREAD`` macros involve a run time computation to obtain the address 783of the variable being saved/retrieved and also require that the accesses are 784de-referenced via a pointer. This means that code that has used 785``RTE_PER_LCORE`` macros being ported to L-threads might need some slight 786adjustment (see :ref:`porting_thread_local_storage` for hints about porting 787code that makes use of thread local storage). 788 789The get/set specific APIs are consistent with their pthread counterparts both 790in use and in performance. 791 792 793.. _memory_allocation_and_NUMA_awareness: 794 795Memory allocation and NUMA awareness 796^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 797 798All memory allocation is from DPDK huge pages, and is NUMA aware. Each 799scheduler maintains its own caches of objects: lthreads, their stacks, TLS, 800mutexes and condition variables. These caches are implemented as unbounded lock 801free MPSC queues. When objects are created they are always allocated from the 802caches on the local core (current EAL thread). 803 804If an L-thread has been affinitized to a different scheduler, then it can 805always safely free resources to the caches from which they originated (because 806the caches are MPSC queues). 807 808If the L-thread has been affinitized to a different NUMA node then the memory 809resources associated with it may incur longer access latency. 810 811The commonly used pattern of setting affinity on entry to a thread after it has 812started, means that memory allocation for both the stack and TLS will have been 813made from caches on the NUMA node on which the threads creator is running. 814This has the side effect that access latency will be sub-optimal after 815affinitizing. 816 817This side effect can be mitigated to some extent (although not completely) by 818specifying the destination CPU as a parameter of ``lthread_create()`` this 819causes the L-thread's stack and TLS to be allocated when it is first scheduled 820on the destination scheduler, if the destination is a on another NUMA node it 821results in a more optimal memory allocation. 822 823Note that the lthread struct itself remains allocated from memory on the 824creating node, this is unavoidable because an L-thread is known everywhere by 825the address of this struct. 826 827 828.. _object_cache_sizing: 829 830Object cache sizing 831^^^^^^^^^^^^^^^^^^^ 832 833The per lcore object caches pre-allocate objects in bulk whenever a request to 834allocate an object finds a cache empty. By default 100 objects are 835pre-allocated, this is defined by ``LTHREAD_PREALLOC`` in the public API 836header file lthread_api.h. This means that the caches constantly grow to meet 837system demand. 838 839In the present implementation there is no mechanism to reduce the cache sizes 840if system demand reduces. Thus the caches will remain at their maximum extent 841indefinitely. 842 843A consequence of the bulk pre-allocation of objects is that every 100 (default 844value) additional new object create operations results in a call to 845``rte_malloc()``. For creation of objects such as L-threads, which trigger the 846allocation of even more objects (i.e. their stacks and TLS) then this can 847cause outliers in scheduling performance. 848 849If this is a problem the simplest mitigation strategy is to dimension the 850system, by setting the bulk object pre-allocation size to some large number 851that you do not expect to be exceeded. This means the caches will be populated 852once only, the very first time a thread is created. 853 854 855.. _Ready_queue_backlog: 856 857Ready queue backlog 858^^^^^^^^^^^^^^^^^^^ 859 860One of the more subtle performance considerations is managing the ready queue 861backlog. The fewer threads that are waiting in the ready queue then the faster 862any particular thread will get serviced. 863 864In a naive L-thread application with N L-threads simply looping and yielding, 865this backlog will always be equal to the number of L-threads, thus the cost of 866a yield to a particular L-thread will be N times the context switch time. 867 868This side effect can be mitigated by arranging for threads to be suspended and 869wait to be resumed, rather than polling for work by constantly yielding. 870Blocking on a mutex or condition variable or even more obviously having a 871thread sleep if it has a low frequency workload are all mechanisms by which a 872thread can be excluded from the ready queue until it really does need to be 873run. This can have a significant positive impact on performance. 874 875 876.. _Initialization_and_shutdown_dependencies: 877 878Initialization, shutdown and dependencies 879^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 880 881The L-thread subsystem depends on DPDK for huge page allocation and depends on 882the ``rte_timer subsystem``. The DPDK EAL initialization and 883``rte_timer_subsystem_init()`` **MUST** be completed before the L-thread sub 884system can be used. 885 886Thereafter initialization of the L-thread subsystem is largely transparent to 887the application. Constructor functions ensure that global variables are properly 888initialized. Other than global variables each scheduler is initialized 889independently the first time that an L-thread is created by a particular EAL 890thread. 891 892If the schedulers are to be run as isolated and independent schedulers, with 893no intention that L-threads running on different schedulers will migrate between 894schedulers or synchronize with L-threads running on other schedulers, then 895initialization consists simply of creating an L-thread, and then running the 896L-thread scheduler. 897 898If there will be interaction between L-threads running on different schedulers, 899then it is important that the starting of schedulers on different EAL threads 900is synchronized. 901 902To achieve this an additional initialization step is necessary, this is simply 903to set the number of schedulers by calling the API function 904``lthread_num_schedulers_set(n)``, where ``n`` is the number of EAL threads 905that will run L-thread schedulers. Setting the number of schedulers to a 906number greater than 0 will cause all schedulers to wait until the others have 907started before beginning to schedule L-threads. 908 909The L-thread scheduler is started by calling the function ``lthread_run()`` 910and should be called from the EAL thread and thus become the main loop of the 911EAL thread. 912 913The function ``lthread_run()``, will not return until all threads running on 914the scheduler have exited, and the scheduler has been explicitly stopped by 915calling ``lthread_scheduler_shutdown(lcore)`` or 916``lthread_scheduler_shutdown_all()``. 917 918All these function do is tell the scheduler that it can exit when there are no 919longer any running L-threads, neither function forces any running L-thread to 920terminate. Any desired application shutdown behavior must be designed and 921built into the application to ensure that L-threads complete in a timely 922manner. 923 924**Important Note:** It is assumed when the scheduler exits that the application 925is terminating for good, the scheduler does not free resources before exiting 926and running the scheduler a subsequent time will result in undefined behavior. 927 928 929.. _porting_legacy_code_to_run_on_lthreads: 930 931Porting legacy code to run on L-threads 932~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 933 934Legacy code originally written for a pthread environment may be ported to 935L-threads if the considerations about differences in scheduling policy, and 936constraints discussed in the previous sections can be accommodated. 937 938This section looks in more detail at some of the issues that may have to be 939resolved when porting code. 940 941 942.. _pthread_API_compatibility: 943 944pthread API compatibility 945^^^^^^^^^^^^^^^^^^^^^^^^^ 946 947The first step is to establish exactly which pthread APIs the legacy 948application uses, and to understand the requirements of those APIs. If there 949are corresponding L-lthread APIs, and where the default pthread functionality 950is used by the application then, notwithstanding the other issues discussed 951here, it should be feasible to run the application with L-threads. If the 952legacy code modifies the default behavior using attributes then if may be 953necessary to make some adjustments to eliminate those requirements. 954 955 956.. _blocking_system_calls: 957 958Blocking system API calls 959^^^^^^^^^^^^^^^^^^^^^^^^^ 960 961It is important to understand what other system services the application may be 962using, bearing in mind that in a cooperatively scheduled environment a thread 963cannot block without stalling the scheduler and with it all other cooperative 964threads. Any kind of blocking system call, for example file or socket IO, is a 965potential problem, a good tool to analyze the application for this purpose is 966the ``strace`` utility. 967 968There are many strategies to resolve these kind of issues, each with it 969merits. Possible solutions include: 970 971* Adopting a polled mode of the system API concerned (if available). 972 973* Arranging for another core to perform the function and synchronizing with 974 that core via constructs that will not block the L-thread. 975 976* Affinitizing the thread to another scheduler devoted (as a matter of policy) 977 to handling threads wishing to make blocking calls, and then back again when 978 finished. 979 980 981.. _porting_locks_and_spinlocks: 982 983Locks and spinlocks 984^^^^^^^^^^^^^^^^^^^ 985 986Locks and spinlocks are another source of blocking behavior that for the same 987reasons as system calls will need to be addressed. 988 989If the application design ensures that the contending L-threads will always 990run on the same scheduler then it its probably safe to remove locks and spin 991locks completely. 992 993The only exception to the above rule is if for some reason the 994code performs any kind of context switch whilst holding the lock 995(e.g. yield, sleep, or block on a different lock, or on a condition variable). 996This will need to determined before deciding to eliminate a lock. 997 998If a lock cannot be eliminated then an L-thread mutex can be substituted for 999either kind of lock. 1000 1001An L-thread blocking on an L-thread mutex will be suspended and will cause 1002another ready L-thread to be resumed, thus not blocking the scheduler. When 1003default behavior is required, it can be used as a direct replacement for a 1004pthread mutex lock. 1005 1006Spin locks are typically used when lock contention is likely to be rare and 1007where the period during which the lock may be held is relatively short. 1008When the contending L-threads are running on the same scheduler then an 1009L-thread blocking on a spin lock will enter an infinite loop stopping the 1010scheduler completely (see :ref:`porting_infinite_loops` below). 1011 1012If the application design ensures that contending L-threads will always run 1013on different schedulers then it might be reasonable to leave a short spin lock 1014that rarely experiences contention in place. 1015 1016If after all considerations it appears that a spin lock can neither be 1017eliminated completely, replaced with an L-thread mutex, or left in place as 1018is, then an alternative is to loop on a flag, with a call to 1019``lthread_yield()`` inside the loop (n.b. if the contending L-threads might 1020ever run on different schedulers the flag will need to be manipulated 1021atomically). 1022 1023Spinning and yielding is the least preferred solution since it introduces 1024ready queue backlog (see also :ref:`ready_queue_backlog`). 1025 1026 1027.. _porting_sleeps_and_delays: 1028 1029Sleeps and delays 1030^^^^^^^^^^^^^^^^^ 1031 1032Yet another kind of blocking behavior (albeit momentary) are delay functions 1033like ``sleep()``, ``usleep()``, ``nanosleep()`` etc. All will have the 1034consequence of stalling the L-thread scheduler and unless the delay is very 1035short (e.g. a very short nanosleep) calls to these functions will need to be 1036eliminated. 1037 1038The simplest mitigation strategy is to use the L-thread sleep API functions, 1039of which two variants exist, ``lthread_sleep()`` and ``lthread_sleep_clks()``. 1040These functions start an rte_timer against the L-thread, suspend the L-thread 1041and cause another ready L-thread to be resumed. The suspended L-thread is 1042resumed when the rte_timer matures. 1043 1044 1045.. _porting_infinite_loops: 1046 1047Infinite loops 1048^^^^^^^^^^^^^^ 1049 1050Some applications have threads with loops that contain no inherent 1051rescheduling opportunity, and rely solely on the OS time slicing to share 1052the CPU. In a cooperative environment this will stop everything dead. These 1053kind of loops are not hard to identify, in a debug session you will find the 1054debugger is always stopping in the same loop. 1055 1056The simplest solution to this kind of problem is to insert an explicit 1057``lthread_yield()`` or ``lthread_sleep()`` into the loop. Another solution 1058might be to include the function performed by the loop into the execution path 1059of some other loop that does in fact yield, if this is possible. 1060 1061 1062.. _porting_thread_local_storage: 1063 1064Thread local storage 1065^^^^^^^^^^^^^^^^^^^^ 1066 1067If the application uses thread local storage, the use case should be 1068studied carefully. 1069 1070In a legacy pthread application either or both the ``__thread`` prefix, or the 1071pthread set/get specific APIs may have been used to define storage local to a 1072pthread. 1073 1074In some applications it may be a reasonable assumption that the data could 1075or in fact most likely should be placed in L-thread local storage. 1076 1077If the application (like many DPDK applications) has assumed a certain 1078relationship between a pthread and the CPU to which it is affinitized, there 1079is a risk that thread local storage may have been used to save some data items 1080that are correctly logically associated with the CPU, and others items which 1081relate to application context for the thread. Only a good understanding of the 1082application will reveal such cases. 1083 1084If the application requires an that an L-thread is to be able to move between 1085schedulers then care should be taken to separate these kinds of data, into per 1086lcore, and per L-thread storage. In this way a migrating thread will bring with 1087it the local data it needs, and pick up the new logical core specific values 1088from pthread local storage at its new home. 1089 1090 1091.. _pthread_shim: 1092 1093Pthread shim 1094~~~~~~~~~~~~ 1095 1096A convenient way to get something working with legacy code can be to use a 1097shim that adapts pthread API calls to the corresponding L-thread ones. 1098This approach will not mitigate any of the porting considerations mentioned 1099in the previous sections, but it will reduce the amount of code churn that 1100would otherwise been involved. It is a reasonable approach to evaluate 1101L-threads, before investing effort in porting to the native L-thread APIs. 1102 1103 1104Overview 1105^^^^^^^^ 1106The L-thread subsystem includes an example pthread shim. This is a partial 1107implementation but does contain the API stubs needed to get basic applications 1108running. There is a simple "hello world" application that demonstrates the 1109use of the pthread shim. 1110 1111A subtlety of working with a shim is that the application will still need 1112to make use of the genuine pthread library functions, at the very least in 1113order to create the EAL threads in which the L-thread schedulers will run. 1114This is the case with DPDK initialization, and exit. 1115 1116To deal with the initialization and shutdown scenarios, the shim is capable of 1117switching on or off its adaptor functionality, an application can control this 1118behavior by the calling the function ``pt_override_set()``. The default state 1119is disabled. 1120 1121The pthread shim uses the dynamic linker loader and saves the loaded addresses 1122of the genuine pthread API functions in an internal table, when the shim 1123functionality is enabled it performs the adaptor function, when disabled it 1124invokes the genuine pthread function. 1125 1126The function ``pthread_exit()`` has additional special handling. The standard 1127system header file pthread.h declares ``pthread_exit()`` with 1128``__attribute__((noreturn))`` this is an optimization that is possible because 1129the pthread is terminating and this enables the compiler to omit the normal 1130handling of stack and protection of registers since the function is not 1131expected to return, and in fact the thread is being destroyed. These 1132optimizations are applied in both the callee and the caller of the 1133``pthread_exit()`` function. 1134 1135In our cooperative scheduling environment this behavior is inadmissible. The 1136pthread is the L-thread scheduler thread, and, although an L-thread is 1137terminating, there must be a return to the scheduler in order that the system 1138can continue to run. Further, returning from a function with attribute 1139``noreturn`` is invalid and may result in undefined behavior. 1140 1141The solution is to redefine the ``pthread_exit`` function with a macro, 1142causing it to be mapped to a stub function in the shim that does not have the 1143``noreturn`` attribute. This macro is defined in the file 1144``pthread_shim.h``. The stub function is otherwise no different than any of 1145the other stub functions in the shim, and will switch between the real 1146``pthread_exit()`` function or the ``lthread_exit()`` function as 1147required. The only difference is that the mapping to the stub by macro 1148substitution. 1149 1150A consequence of this is that the file ``pthread_shim.h`` must be included in 1151legacy code wishing to make use of the shim. It also means that dynamic 1152linkage of a pre-compiled binary that did not include pthread_shim.h is not be 1153supported. 1154 1155Given the requirements for porting legacy code outlined in 1156:ref:`porting_legacy_code_to_run_on_lthreads` most applications will require at 1157least some minimal adjustment and recompilation to run on L-threads so 1158pre-compiled binaries are unlikely to be met in practice. 1159 1160In summary the shim approach adds some overhead but can be a useful tool to help 1161establish the feasibility of a code reuse project. It is also a fairly 1162straightforward task to extend the shim if necessary. 1163 1164**Note:** Bearing in mind the preceding discussions about the impact of making 1165blocking calls then switching the shim in and out on the fly to invoke any 1166pthread API this might block is something that should typically be avoided. 1167 1168 1169Building and running the pthread shim 1170^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 1171 1172The shim example application is located in the sample application 1173in the performance-thread folder 1174 1175To build and run the pthread shim example 1176 1177#. Go to the example applications folder 1178 1179 .. code-block:: console 1180 1181 export RTE_SDK=/path/to/rte_sdk 1182 cd ${RTE_SDK}/examples/performance-thread/pthread_shim 1183 1184 1185#. Set the target (a default target is used if not specified). For example: 1186 1187 .. code-block:: console 1188 1189 export RTE_TARGET=x86_64-native-linuxapp-gcc 1190 1191 See the DPDK Getting Started Guide for possible RTE_TARGET values. 1192 1193#. Build the application: 1194 1195 .. code-block:: console 1196 1197 make 1198 1199#. To run the pthread_shim example 1200 1201 .. code-block:: console 1202 1203 lthread-pthread-shim -c core_mask -n number_of_channels 1204 1205.. _lthread_diagnostics: 1206 1207L-thread Diagnostics 1208~~~~~~~~~~~~~~~~~~~~ 1209 1210When debugging you must take account of the fact that the L-threads are run in 1211a single pthread. The current scheduler is defined by 1212``RTE_PER_LCORE(this_sched)``, and the current lthread is stored at 1213``RTE_PER_LCORE(this_sched)->current_lthread``. Thus on a breakpoint in a GDB 1214session the current lthread can be obtained by displaying the pthread local 1215variable ``per_lcore_this_sched->current_lthread``. 1216 1217Another useful diagnostic feature is the possibility to trace significant 1218events in the life of an L-thread, this feature is enabled by changing the 1219value of LTHREAD_DIAG from 0 to 1 in the file ``lthread_diag_api.h``. 1220 1221Tracing of events can be individually masked, and the mask may be programmed 1222at run time. An unmasked event results in a callback that provides information 1223about the event. The default callback simply prints trace information. The 1224default mask is 0 (all events off) the mask can be modified by calling the 1225function ``lthread_diagniostic_set_mask()``. 1226 1227It is possible register a user callback function to implement more 1228sophisticated diagnostic functions. 1229Object creation events (lthread, mutex, and condition variable) accept, and 1230store in the created object, a user supplied reference value returned by the 1231callback function. 1232 1233The lthread reference value is passed back in all subsequent event callbacks, 1234the mutex and APIs are provided to retrieve the reference value from 1235mutexes and condition variables. This enables a user to monitor, count, or 1236filter for specific events, on specific objects, for example to monitor for a 1237specific thread signaling a specific condition variable, or to monitor 1238on all timer events, the possibilities and combinations are endless. 1239 1240The callback function can be set by calling the function 1241``lthread_diagnostic_enable()`` supplying a callback function pointer and an 1242event mask. 1243 1244Setting ``LTHREAD_DIAG`` also enables counting of statistics about cache and 1245queue usage, and these statistics can be displayed by calling the function 1246``lthread_diag_stats_display()``. This function also performs a consistency 1247check on the caches and queues. The function should only be called from the 1248master EAL thread after all slave threads have stopped and returned to the C 1249main program, otherwise the consistency check will fail. 1250