1.. SPDX-License-Identifier: BSD-3-Clause 2 Copyright(c) 2015 Intel Corporation. 3 4Performance Thread Sample Application 5===================================== 6 7The performance thread sample application is a derivative of the standard L3 8forwarding application that demonstrates different threading models. 9 10Overview 11-------- 12For a general description of the L3 forwarding applications capabilities 13please refer to the documentation of the standard application in 14:doc:`l3_forward`. 15 16The performance thread sample application differs from the standard L3 17forwarding example in that it divides the TX and RX processing between 18different threads, and makes it possible to assign individual threads to 19different cores. 20 21Three threading models are considered: 22 23#. When there is one EAL thread per physical core. 24#. When there are multiple EAL threads per physical core. 25#. When there are multiple lightweight threads per EAL thread. 26 27Since DPDK release 2.0 it is possible to launch applications using the 28``--lcores`` EAL parameter, specifying cpu-sets for a physical core. With the 29performance thread sample application its is now also possible to assign 30individual RX and TX functions to different cores. 31 32As an alternative to dividing the L3 forwarding work between different EAL 33threads the performance thread sample introduces the possibility to run the 34application threads as lightweight threads (L-threads) within one or 35more EAL threads. 36 37In order to facilitate this threading model the example includes a primitive 38cooperative scheduler (L-thread) subsystem. More details of the L-thread 39subsystem can be found in :ref:`lthread_subsystem`. 40 41**Note:** Whilst theoretically possible it is not anticipated that multiple 42L-thread schedulers would be run on the same physical core, this mode of 43operation should not be expected to yield useful performance and is considered 44invalid. 45 46Compiling the Application 47------------------------- 48 49To compile the sample application see :doc:`compiling`. 50 51The application is located in the `performance-thread/l3fwd-thread` sub-directory. 52 53Running the Application 54----------------------- 55 56The application has a number of command line options:: 57 58 ./<build_dir>/examples/dpdk-l3fwd-thread [EAL options] -- 59 -p PORTMASK [-P] 60 --rx(port,queue,lcore,thread)[,(port,queue,lcore,thread)] 61 --tx(lcore,thread)[,(lcore,thread)] 62 [--enable-jumbo] [--max-pkt-len PKTLEN]] [--no-numa] 63 [--hash-entry-num] [--ipv6] [--no-lthreads] [--stat-lcore lcore] 64 [--parse-ptype] 65 66Where: 67 68* ``-p PORTMASK``: Hexadecimal bitmask of ports to configure. 69 70* ``-P``: optional, sets all ports to promiscuous mode so that packets are 71 accepted regardless of the packet's Ethernet MAC destination address. 72 Without this option, only packets with the Ethernet MAC destination address 73 set to the Ethernet address of the port are accepted. 74 75* ``--rx (port,queue,lcore,thread)[,(port,queue,lcore,thread)]``: the list of 76 NIC RX ports and queues handled by the RX lcores and threads. The parameters 77 are explained below. 78 79* ``--tx (lcore,thread)[,(lcore,thread)]``: the list of TX threads identifying 80 the lcore the thread runs on, and the id of RX thread with which it is 81 associated. The parameters are explained below. 82 83* ``--enable-jumbo``: optional, enables jumbo frames. 84 85* ``--max-pkt-len``: optional, maximum packet length in decimal (64-9600). 86 87* ``--no-numa``: optional, disables numa awareness. 88 89* ``--hash-entry-num``: optional, specifies the hash entry number in hex to be 90 setup. 91 92* ``--ipv6``: optional, set it if running ipv6 packets. 93 94* ``--no-lthreads``: optional, disables l-thread model and uses EAL threading 95 model. See below. 96 97* ``--stat-lcore``: optional, run CPU load stats collector on the specified 98 lcore. 99 100* ``--parse-ptype:`` optional, set to use software to analyze packet type. 101 Without this option, hardware will check the packet type. 102 103The parameters of the ``--rx`` and ``--tx`` options are: 104 105* ``--rx`` parameters 106 107 .. _table_l3fwd_rx_parameters: 108 109 +--------+------------------------------------------------------+ 110 | port | RX port | 111 +--------+------------------------------------------------------+ 112 | queue | RX queue that will be read on the specified RX port | 113 +--------+------------------------------------------------------+ 114 | lcore | Core to use for the thread | 115 +--------+------------------------------------------------------+ 116 | thread | Thread id (continuously from 0 to N) | 117 +--------+------------------------------------------------------+ 118 119 120* ``--tx`` parameters 121 122 .. _table_l3fwd_tx_parameters: 123 124 +--------+------------------------------------------------------+ 125 | lcore | Core to use for L3 route match and transmit | 126 +--------+------------------------------------------------------+ 127 | thread | Id of RX thread to be associated with this TX thread | 128 +--------+------------------------------------------------------+ 129 130The ``l3fwd-thread`` application allows you to start packet processing in two 131threading models: L-Threads (default) and EAL Threads (when the 132``--no-lthreads`` parameter is used). For consistency all parameters are used 133in the same way for both models. 134 135 136Running with L-threads 137~~~~~~~~~~~~~~~~~~~~~~ 138 139When the L-thread model is used (default option), lcore and thread parameters 140in ``--rx/--tx`` are used to affinitize threads to the selected scheduler. 141 142For example, the following places every l-thread on different lcores:: 143 144 dpdk-l3fwd-thread -l 0-7 -n 2 -- -P -p 3 \ 145 --rx="(0,0,0,0)(1,0,1,1)" \ 146 --tx="(2,0)(3,1)" 147 148The following places RX l-threads on lcore 0 and TX l-threads on lcore 1 and 2 149and so on:: 150 151 dpdk-l3fwd-thread -l 0-7 -n 2 -- -P -p 3 \ 152 --rx="(0,0,0,0)(1,0,0,1)" \ 153 --tx="(1,0)(2,1)" 154 155 156Running with EAL threads 157~~~~~~~~~~~~~~~~~~~~~~~~ 158 159When the ``--no-lthreads`` parameter is used, the L-threading model is turned 160off and EAL threads are used for all processing. EAL threads are enumerated in 161the same way as L-threads, but the ``--lcores`` EAL parameter is used to 162affinitize threads to the selected cpu-set (scheduler). Thus it is possible to 163place every RX and TX thread on different lcores. 164 165For example, the following places every EAL thread on different lcores:: 166 167 dpdk-l3fwd-thread -l 0-7 -n 2 -- -P -p 3 \ 168 --rx="(0,0,0,0)(1,0,1,1)" \ 169 --tx="(2,0)(3,1)" \ 170 --no-lthreads 171 172 173To affinitize two or more EAL threads to one cpu-set, the EAL ``--lcores`` 174parameter is used. 175 176The following places RX EAL threads on lcore 0 and TX EAL threads on lcore 1 177and 2 and so on:: 178 179 dpdk-l3fwd-thread -l 0-7 -n 2 --lcores="(0,1)@0,(2,3)@1" -- -P -p 3 \ 180 --rx="(0,0,0,0)(1,0,1,1)" \ 181 --tx="(2,0)(3,1)" \ 182 --no-lthreads 183 184 185Examples 186~~~~~~~~ 187 188For selected scenarios the command line configuration of the application for L-threads 189and its corresponding EAL threads command line can be realized as follows: 190 191a) Start every thread on different scheduler (1:1):: 192 193 dpdk-l3fwd-thread -l 0-7 -n 2 -- -P -p 3 \ 194 --rx="(0,0,0,0)(1,0,1,1)" \ 195 --tx="(2,0)(3,1)" 196 197 EAL thread equivalent:: 198 199 dpdk-l3fwd-thread -l 0-7 -n 2 -- -P -p 3 \ 200 --rx="(0,0,0,0)(1,0,1,1)" \ 201 --tx="(2,0)(3,1)" \ 202 --no-lthreads 203 204b) Start all threads on one core (N:1). 205 206 Start 4 L-threads on lcore 0:: 207 208 dpdk-l3fwd-thread -l 0-7 -n 2 -- -P -p 3 \ 209 --rx="(0,0,0,0)(1,0,0,1)" \ 210 --tx="(0,0)(0,1)" 211 212 Start 4 EAL threads on cpu-set 0:: 213 214 dpdk-l3fwd-thread -l 0-7 -n 2 --lcores="(0-3)@0" -- -P -p 3 \ 215 --rx="(0,0,0,0)(1,0,0,1)" \ 216 --tx="(2,0)(3,1)" \ 217 --no-lthreads 218 219c) Start threads on different cores (N:M). 220 221 Start 2 L-threads for RX on lcore 0, and 2 L-threads for TX on lcore 1:: 222 223 dpdk-l3fwd-thread -l 0-7 -n 2 -- -P -p 3 \ 224 --rx="(0,0,0,0)(1,0,0,1)" \ 225 --tx="(1,0)(1,1)" 226 227 Start 2 EAL threads for RX on cpu-set 0, and 2 EAL threads for TX on 228 cpu-set 1:: 229 230 dpdk-l3fwd-thread -l 0-7 -n 2 --lcores="(0-1)@0,(2-3)@1" -- -P -p 3 \ 231 --rx="(0,0,0,0)(1,0,1,1)" \ 232 --tx="(2,0)(3,1)" \ 233 --no-lthreads 234 235Explanation 236----------- 237 238To a great extent the sample application differs little from the standard L3 239forwarding application, and readers are advised to familiarize themselves with 240the material covered in the :doc:`l3_forward` documentation before proceeding. 241 242The following explanation is focused on the way threading is handled in the 243performance thread example. 244 245 246Mode of operation with EAL threads 247~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 248 249The performance thread sample application has split the RX and TX functionality 250into two different threads, and the RX and TX threads are 251interconnected via software rings. With respect to these rings the RX threads 252are producers and the TX threads are consumers. 253 254On initialization the TX and RX threads are started according to the command 255line parameters. 256 257The RX threads poll the network interface queues and post received packets to a 258TX thread via a corresponding software ring. 259 260The TX threads poll software rings, perform the L3 forwarding hash/LPM match, 261and assemble packet bursts before performing burst transmit on the network 262interface. 263 264As with the standard L3 forward application, burst draining of residual packets 265is performed periodically with the period calculated from elapsed time using 266the timestamps counter. 267 268The diagram below illustrates a case with two RX threads and three TX threads. 269 270.. _figure_performance_thread_1: 271 272.. figure:: img/performance_thread_1.* 273 274 275Mode of operation with L-threads 276~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 277 278Like the EAL thread configuration the application has split the RX and TX 279functionality into different threads, and the pairs of RX and TX threads are 280interconnected via software rings. 281 282On initialization an L-thread scheduler is started on every EAL thread. On all 283but the main EAL thread only a dummy L-thread is initially started. 284The L-thread started on the main EAL thread then spawns other L-threads on 285different L-thread schedulers according the command line parameters. 286 287The RX threads poll the network interface queues and post received packets 288to a TX thread via the corresponding software ring. 289 290The ring interface is augmented by means of an L-thread condition variable that 291enables the TX thread to be suspended when the TX ring is empty. The RX thread 292signals the condition whenever it posts to the TX ring, causing the TX thread 293to be resumed. 294 295Additionally the TX L-thread spawns a worker L-thread to take care of 296polling the software rings, whilst it handles burst draining of the transmit 297buffer. 298 299The worker threads poll the software rings, perform L3 route lookup and 300assemble packet bursts. If the TX ring is empty the worker thread suspends 301itself by waiting on the condition variable associated with the ring. 302 303Burst draining of residual packets, less than the burst size, is performed by 304the TX thread which sleeps (using an L-thread sleep function) and resumes 305periodically to flush the TX buffer. 306 307This design means that L-threads that have no work, can yield the CPU to other 308L-threads and avoid having to constantly poll the software rings. 309 310The diagram below illustrates a case with two RX threads and three TX functions 311(each comprising a thread that processes forwarding and a thread that 312periodically drains the output buffer of residual packets). 313 314.. _figure_performance_thread_2: 315 316.. figure:: img/performance_thread_2.* 317 318 319CPU load statistics 320~~~~~~~~~~~~~~~~~~~ 321 322It is possible to display statistics showing estimated CPU load on each core. 323The statistics indicate the percentage of CPU time spent: processing 324received packets (forwarding), polling queues/rings (waiting for work), 325and doing any other processing (context switch and other overhead). 326 327When enabled statistics are gathered by having the application threads set and 328clear flags when they enter and exit pertinent code sections. The flags are 329then sampled in real time by a statistics collector thread running on another 330core. This thread displays the data in real time on the console. 331 332This feature is enabled by designating a statistics collector core, using the 333``--stat-lcore`` parameter. 334 335 336.. _lthread_subsystem: 337 338The L-thread subsystem 339---------------------- 340 341The L-thread subsystem resides in the examples/performance-thread/common 342directory and is built and linked automatically when building the 343``l3fwd-thread`` example. 344 345The subsystem provides a simple cooperative scheduler to enable arbitrary 346functions to run as cooperative threads within a single EAL thread. 347The subsystem provides a pthread like API that is intended to assist in 348reuse of legacy code written for POSIX pthreads. 349 350The following sections provide some detail on the features, constraints, 351performance and porting considerations when using L-threads. 352 353 354.. _comparison_between_lthreads_and_pthreads: 355 356Comparison between L-threads and POSIX pthreads 357~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 358 359The fundamental difference between the L-thread and pthread models is the 360way in which threads are scheduled. The simplest way to think about this is to 361consider the case of a processor with a single CPU. To run multiple threads 362on a single CPU, the scheduler must frequently switch between the threads, 363in order that each thread is able to make timely progress. 364This is the basis of any multitasking operating system. 365 366This section explores the differences between the pthread model and the 367L-thread model as implemented in the provided L-thread subsystem. If needed a 368theoretical discussion of preemptive vs cooperative multi-threading can be 369found in any good text on operating system design. 370 371 372Scheduling and context switching 373^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 374 375The POSIX pthread library provides an application programming interface to 376create and synchronize threads. Scheduling policy is determined by the host OS, 377and may be configurable. The OS may use sophisticated rules to determine which 378thread should be run next, threads may suspend themselves or make other threads 379ready, and the scheduler may employ a time slice giving each thread a maximum 380time quantum after which it will be preempted in favor of another thread that 381is ready to run. To complicate matters further threads may be assigned 382different scheduling priorities. 383 384By contrast the L-thread subsystem is considerably simpler. Logically the 385L-thread scheduler performs the same multiplexing function for L-threads 386within a single pthread as the OS scheduler does for pthreads within an 387application process. The L-thread scheduler is simply the main loop of a 388pthread, and in so far as the host OS is concerned it is a regular pthread 389just like any other. The host OS is oblivious about the existence of and 390not at all involved in the scheduling of L-threads. 391 392The other and most significant difference between the two models is that 393L-threads are scheduled cooperatively. L-threads cannot not preempt each 394other, nor can the L-thread scheduler preempt a running L-thread (i.e. 395there is no time slicing). The consequence is that programs implemented with 396L-threads must possess frequent rescheduling points, meaning that they must 397explicitly and of their own volition return to the scheduler at frequent 398intervals, in order to allow other L-threads an opportunity to proceed. 399 400In both models switching between threads requires that the current CPU 401context is saved and a new context (belonging to the next thread ready to run) 402is restored. With pthreads this context switching is handled transparently 403and the set of CPU registers that must be preserved between context switches 404is as per an interrupt handler. 405 406An L-thread context switch is achieved by the thread itself making a function 407call to the L-thread scheduler. Thus it is only necessary to preserve the 408callee registers. The caller is responsible to save and restore any other 409registers it is using before a function call, and restore them on return, 410and this is handled by the compiler. For ``X86_64`` on both Linux and BSD the 411System V calling convention is used, this defines registers RSP, RBP, and 412R12-R15 as callee-save registers (for more detailed discussion a good reference 413is `X86 Calling Conventions <https://en.wikipedia.org/wiki/X86_calling_conventions>`_). 414 415Taking advantage of this, and due to the absence of preemption, an L-thread 416context switch is achieved with less than 20 load/store instructions. 417 418The scheduling policy for L-threads is fixed, there is no prioritization of 419L-threads, all L-threads are equal and scheduling is based on a FIFO 420ready queue. 421 422An L-thread is a struct containing the CPU context of the thread 423(saved on context switch) and other useful items. The ready queue contains 424pointers to threads that are ready to run. The L-thread scheduler is a simple 425loop that polls the ready queue, reads from it the next thread ready to run, 426which it resumes by saving the current context (the current position in the 427scheduler loop) and restoring the context of the next thread from its thread 428struct. Thus an L-thread is always resumed at the last place it yielded. 429 430A well behaved L-thread will call the context switch regularly (at least once 431in its main loop) thus returning to the scheduler's own main loop. Yielding 432inserts the current thread at the back of the ready queue, and the process of 433servicing the ready queue is repeated, thus the system runs by flipping back 434and forth the between L-threads and scheduler loop. 435 436In the case of pthreads, the preemptive scheduling, time slicing, and support 437for thread prioritization means that progress is normally possible for any 438thread that is ready to run. This comes at the price of a relatively heavier 439context switch and scheduling overhead. 440 441With L-threads the progress of any particular thread is determined by the 442frequency of rescheduling opportunities in the other L-threads. This means that 443an errant L-thread monopolizing the CPU might cause scheduling of other threads 444to be stalled. Due to the lower cost of context switching, however, voluntary 445rescheduling to ensure progress of other threads, if managed sensibly, is not 446a prohibitive overhead, and overall performance can exceed that of an 447application using pthreads. 448 449 450Mutual exclusion 451^^^^^^^^^^^^^^^^ 452 453With pthreads preemption means that threads that share data must observe 454some form of mutual exclusion protocol. 455 456The fact that L-threads cannot preempt each other means that in many cases 457mutual exclusion devices can be completely avoided. 458 459Locking to protect shared data can be a significant bottleneck in 460multi-threaded applications so a carefully designed cooperatively scheduled 461program can enjoy significant performance advantages. 462 463So far we have considered only the simplistic case of a single core CPU, 464when multiple CPUs are considered things are somewhat more complex. 465 466First of all it is inevitable that there must be multiple L-thread schedulers, 467one running on each EAL thread. So long as these schedulers remain isolated 468from each other the above assertions about the potential advantages of 469cooperative scheduling hold true. 470 471A configuration with isolated cooperative schedulers is less flexible than the 472pthread model where threads can be affinitized to run on any CPU. With isolated 473schedulers scaling of applications to utilize fewer or more CPUs according to 474system demand is very difficult to achieve. 475 476The L-thread subsystem makes it possible for L-threads to migrate between 477schedulers running on different CPUs. Needless to say if the migration means 478that threads that share data end up running on different CPUs then this will 479introduce the need for some kind of mutual exclusion system. 480 481Of course ``rte_ring`` software rings can always be used to interconnect 482threads running on different cores, however to protect other kinds of shared 483data structures, lock free constructs or else explicit locking will be 484required. This is a consideration for the application design. 485 486In support of this extended functionality, the L-thread subsystem implements 487thread safe mutexes and condition variables. 488 489The cost of affinitizing and of condition variable signaling is significantly 490lower than the equivalent pthread operations, and so applications using these 491features will see a performance benefit. 492 493 494Thread local storage 495^^^^^^^^^^^^^^^^^^^^ 496 497As with applications written for pthreads an application written for L-threads 498can take advantage of thread local storage, in this case local to an L-thread. 499An application may save and retrieve a single pointer to application data in 500the L-thread struct. 501 502For legacy and backward compatibility reasons two alternative methods are also 503offered, the first is modeled directly on the pthread get/set specific APIs, 504the second approach is modeled on the ``RTE_PER_LCORE`` macros, whereby 505``PER_LTHREAD`` macros are introduced, in both cases the storage is local to 506the L-thread. 507 508 509.. _constraints_and_performance_implications: 510 511Constraints and performance implications when using L-threads 512~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 513 514 515.. _API_compatibility: 516 517API compatibility 518^^^^^^^^^^^^^^^^^ 519 520The L-thread subsystem provides a set of functions that are logically equivalent 521to the corresponding functions offered by the POSIX pthread library, however not 522all pthread functions have a corresponding L-thread equivalent, and not all 523features available to pthreads are implemented for L-threads. 524 525The pthread library offers considerable flexibility via programmable attributes 526that can be associated with threads, mutexes, and condition variables. 527 528By contrast the L-thread subsystem has fixed functionality, the scheduler policy 529cannot be varied, and L-threads cannot be prioritized. There are no variable 530attributes associated with any L-thread objects. L-threads, mutexes and 531conditional variables, all have fixed functionality. (Note: reserved parameters 532are included in the APIs to facilitate possible future support for attributes). 533 534The table below lists the pthread and equivalent L-thread APIs with notes on 535differences and/or constraints. Where there is no L-thread entry in the table, 536then the L-thread subsystem provides no equivalent function. 537 538.. _table_lthread_pthread: 539 540.. table:: Pthread and equivalent L-thread APIs. 541 542 +----------------------------+------------------------+-------------------+ 543 | **Pthread function** | **L-thread function** | **Notes** | 544 +============================+========================+===================+ 545 | pthread_barrier_destroy | | | 546 +----------------------------+------------------------+-------------------+ 547 | pthread_barrier_init | | | 548 +----------------------------+------------------------+-------------------+ 549 | pthread_barrier_wait | | | 550 +----------------------------+------------------------+-------------------+ 551 | pthread_cond_broadcast | lthread_cond_broadcast | See note 1 | 552 +----------------------------+------------------------+-------------------+ 553 | pthread_cond_destroy | lthread_cond_destroy | | 554 +----------------------------+------------------------+-------------------+ 555 | pthread_cond_init | lthread_cond_init | | 556 +----------------------------+------------------------+-------------------+ 557 | pthread_cond_signal | lthread_cond_signal | See note 1 | 558 +----------------------------+------------------------+-------------------+ 559 | pthread_cond_timedwait | | | 560 +----------------------------+------------------------+-------------------+ 561 | pthread_cond_wait | lthread_cond_wait | See note 5 | 562 +----------------------------+------------------------+-------------------+ 563 | pthread_create | lthread_create | See notes 2, 3 | 564 +----------------------------+------------------------+-------------------+ 565 | pthread_detach | lthread_detach | See note 4 | 566 +----------------------------+------------------------+-------------------+ 567 | pthread_equal | | | 568 +----------------------------+------------------------+-------------------+ 569 | pthread_exit | lthread_exit | | 570 +----------------------------+------------------------+-------------------+ 571 | pthread_getspecific | lthread_getspecific | | 572 +----------------------------+------------------------+-------------------+ 573 | pthread_getcpuclockid | | | 574 +----------------------------+------------------------+-------------------+ 575 | pthread_join | lthread_join | | 576 +----------------------------+------------------------+-------------------+ 577 | pthread_key_create | lthread_key_create | | 578 +----------------------------+------------------------+-------------------+ 579 | pthread_key_delete | lthread_key_delete | | 580 +----------------------------+------------------------+-------------------+ 581 | pthread_mutex_destroy | lthread_mutex_destroy | | 582 +----------------------------+------------------------+-------------------+ 583 | pthread_mutex_init | lthread_mutex_init | | 584 +----------------------------+------------------------+-------------------+ 585 | pthread_mutex_lock | lthread_mutex_lock | See note 6 | 586 +----------------------------+------------------------+-------------------+ 587 | pthread_mutex_trylock | lthread_mutex_trylock | See note 6 | 588 +----------------------------+------------------------+-------------------+ 589 | pthread_mutex_timedlock | | | 590 +----------------------------+------------------------+-------------------+ 591 | pthread_mutex_unlock | lthread_mutex_unlock | | 592 +----------------------------+------------------------+-------------------+ 593 | pthread_once | | | 594 +----------------------------+------------------------+-------------------+ 595 | pthread_rwlock_destroy | | | 596 +----------------------------+------------------------+-------------------+ 597 | pthread_rwlock_init | | | 598 +----------------------------+------------------------+-------------------+ 599 | pthread_rwlock_rdlock | | | 600 +----------------------------+------------------------+-------------------+ 601 | pthread_rwlock_timedrdlock | | | 602 +----------------------------+------------------------+-------------------+ 603 | pthread_rwlock_timedwrlock | | | 604 +----------------------------+------------------------+-------------------+ 605 | pthread_rwlock_tryrdlock | | | 606 +----------------------------+------------------------+-------------------+ 607 | pthread_rwlock_trywrlock | | | 608 +----------------------------+------------------------+-------------------+ 609 | pthread_rwlock_unlock | | | 610 +----------------------------+------------------------+-------------------+ 611 | pthread_rwlock_wrlock | | | 612 +----------------------------+------------------------+-------------------+ 613 | pthread_self | lthread_current | | 614 +----------------------------+------------------------+-------------------+ 615 | pthread_setspecific | lthread_setspecific | | 616 +----------------------------+------------------------+-------------------+ 617 | pthread_spin_init | | See note 10 | 618 +----------------------------+------------------------+-------------------+ 619 | pthread_spin_destroy | | See note 10 | 620 +----------------------------+------------------------+-------------------+ 621 | pthread_spin_lock | | See note 10 | 622 +----------------------------+------------------------+-------------------+ 623 | pthread_spin_trylock | | See note 10 | 624 +----------------------------+------------------------+-------------------+ 625 | pthread_spin_unlock | | See note 10 | 626 +----------------------------+------------------------+-------------------+ 627 | pthread_cancel | lthread_cancel | | 628 +----------------------------+------------------------+-------------------+ 629 | pthread_setcancelstate | | | 630 +----------------------------+------------------------+-------------------+ 631 | pthread_setcanceltype | | | 632 +----------------------------+------------------------+-------------------+ 633 | pthread_testcancel | | | 634 +----------------------------+------------------------+-------------------+ 635 | pthread_getschedparam | | | 636 +----------------------------+------------------------+-------------------+ 637 | pthread_setschedparam | | | 638 +----------------------------+------------------------+-------------------+ 639 | pthread_yield | lthread_yield | See note 7 | 640 +----------------------------+------------------------+-------------------+ 641 | pthread_setaffinity_np | lthread_set_affinity | See notes 2, 3, 8 | 642 +----------------------------+------------------------+-------------------+ 643 | | lthread_sleep | See note 9 | 644 +----------------------------+------------------------+-------------------+ 645 | | lthread_sleep_clks | See note 9 | 646 +----------------------------+------------------------+-------------------+ 647 648 649**Note 1**: 650 651Neither lthread signal nor broadcast may be called concurrently by L-threads 652running on different schedulers, although multiple L-threads running in the 653same scheduler may freely perform signal or broadcast operations. L-threads 654running on the same or different schedulers may always safely wait on a 655condition variable. 656 657 658**Note 2**: 659 660Pthread attributes may be used to affinitize a pthread with a cpu-set. The 661L-thread subsystem does not support a cpu-set. An L-thread may be affinitized 662only with a single CPU at any time. 663 664 665**Note 3**: 666 667If an L-thread is intended to run on a different NUMA node than the node that 668creates the thread then, when calling ``lthread_create()`` it is advantageous 669to specify the destination core as a parameter of ``lthread_create()``. See 670:ref:`memory_allocation_and_NUMA_awareness` for details. 671 672 673**Note 4**: 674 675An L-thread can only detach itself, and cannot detach other L-threads. 676 677 678**Note 5**: 679 680A wait operation on a pthread condition variable is always associated with and 681protected by a mutex which must be owned by the thread at the time it invokes 682``pthread_wait()``. By contrast L-thread condition variables are thread safe 683(for waiters) and do not use an associated mutex. Multiple L-threads (including 684L-threads running on other schedulers) can safely wait on a L-thread condition 685variable. As a consequence the performance of an L-thread condition variables 686is typically an order of magnitude faster than its pthread counterpart. 687 688 689**Note 6**: 690 691Recursive locking is not supported with L-threads, attempts to take a lock 692recursively will be detected and rejected. 693 694 695**Note 7**: 696 697``lthread_yield()`` will save the current context, insert the current thread 698to the back of the ready queue, and resume the next ready thread. Yielding 699increases ready queue backlog, see :ref:`ready_queue_backlog` for more details 700about the implications of this. 701 702 703N.B. The context switch time as measured from immediately before the call to 704``lthread_yield()`` to the point at which the next ready thread is resumed, 705can be an order of magnitude faster that the same measurement for 706pthread_yield. 707 708 709**Note 8**: 710 711``lthread_set_affinity()`` is similar to a yield apart from the fact that the 712yielding thread is inserted into a peer ready queue of another scheduler. 713The peer ready queue is actually a separate thread safe queue, which means that 714threads appearing in the peer ready queue can jump any backlog in the local 715ready queue on the destination scheduler. 716 717The context switch time as measured from the time just before the call to 718``lthread_set_affinity()`` to just after the same thread is resumed on the new 719scheduler can be orders of magnitude faster than the same measurement for 720``pthread_setaffinity_np()``. 721 722 723**Note 9**: 724 725Although there is no ``pthread_sleep()`` function, ``lthread_sleep()`` and 726``lthread_sleep_clks()`` can be used wherever ``sleep()``, ``usleep()`` or 727``nanosleep()`` might ordinarily be used. The L-thread sleep functions suspend 728the current thread, start an ``rte_timer`` and resume the thread when the 729timer matures. The ``rte_timer_manage()`` entry point is called on every pass 730of the scheduler loop. This means that the worst case jitter on timer expiry 731is determined by the longest period between context switches of any running 732L-threads. 733 734In a synthetic test with many threads sleeping and resuming then the measured 735jitter is typically orders of magnitude lower than the same measurement made 736for ``nanosleep()``. 737 738 739**Note 10**: 740 741Spin locks are not provided because they are problematical in a cooperative 742environment, see :ref:`porting_locks_and_spinlocks` for a more detailed 743discussion on how to avoid spin locks. 744 745 746.. _Thread_local_storage_performance: 747 748Thread local storage 749^^^^^^^^^^^^^^^^^^^^ 750 751Of the three L-thread local storage options the simplest and most efficient is 752storing a single application data pointer in the L-thread struct. 753 754The ``PER_LTHREAD`` macros involve a run time computation to obtain the address 755of the variable being saved/retrieved and also require that the accesses are 756de-referenced via a pointer. This means that code that has used 757``RTE_PER_LCORE`` macros being ported to L-threads might need some slight 758adjustment (see :ref:`porting_thread_local_storage` for hints about porting 759code that makes use of thread local storage). 760 761The get/set specific APIs are consistent with their pthread counterparts both 762in use and in performance. 763 764 765.. _memory_allocation_and_NUMA_awareness: 766 767Memory allocation and NUMA awareness 768^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 769 770All memory allocation is from DPDK huge pages, and is NUMA aware. Each 771scheduler maintains its own caches of objects: lthreads, their stacks, TLS, 772mutexes and condition variables. These caches are implemented as unbounded lock 773free MPSC queues. When objects are created they are always allocated from the 774caches on the local core (current EAL thread). 775 776If an L-thread has been affinitized to a different scheduler, then it can 777always safely free resources to the caches from which they originated (because 778the caches are MPSC queues). 779 780If the L-thread has been affinitized to a different NUMA node then the memory 781resources associated with it may incur longer access latency. 782 783The commonly used pattern of setting affinity on entry to a thread after it has 784started, means that memory allocation for both the stack and TLS will have been 785made from caches on the NUMA node on which the threads creator is running. 786This has the side effect that access latency will be sub-optimal after 787affinitizing. 788 789This side effect can be mitigated to some extent (although not completely) by 790specifying the destination CPU as a parameter of ``lthread_create()`` this 791causes the L-thread's stack and TLS to be allocated when it is first scheduled 792on the destination scheduler, if the destination is a on another NUMA node it 793results in a more optimal memory allocation. 794 795Note that the lthread struct itself remains allocated from memory on the 796creating node, this is unavoidable because an L-thread is known everywhere by 797the address of this struct. 798 799 800.. _object_cache_sizing: 801 802Object cache sizing 803^^^^^^^^^^^^^^^^^^^ 804 805The per lcore object caches pre-allocate objects in bulk whenever a request to 806allocate an object finds a cache empty. By default 100 objects are 807pre-allocated, this is defined by ``LTHREAD_PREALLOC`` in the public API 808header file lthread_api.h. This means that the caches constantly grow to meet 809system demand. 810 811In the present implementation there is no mechanism to reduce the cache sizes 812if system demand reduces. Thus the caches will remain at their maximum extent 813indefinitely. 814 815A consequence of the bulk pre-allocation of objects is that every 100 (default 816value) additional new object create operations results in a call to 817``rte_malloc()``. For creation of objects such as L-threads, which trigger the 818allocation of even more objects (i.e. their stacks and TLS) then this can 819cause outliers in scheduling performance. 820 821If this is a problem the simplest mitigation strategy is to dimension the 822system, by setting the bulk object pre-allocation size to some large number 823that you do not expect to be exceeded. This means the caches will be populated 824once only, the very first time a thread is created. 825 826 827.. _Ready_queue_backlog: 828 829Ready queue backlog 830^^^^^^^^^^^^^^^^^^^ 831 832One of the more subtle performance considerations is managing the ready queue 833backlog. The fewer threads that are waiting in the ready queue then the faster 834any particular thread will get serviced. 835 836In a naive L-thread application with N L-threads simply looping and yielding, 837this backlog will always be equal to the number of L-threads, thus the cost of 838a yield to a particular L-thread will be N times the context switch time. 839 840This side effect can be mitigated by arranging for threads to be suspended and 841wait to be resumed, rather than polling for work by constantly yielding. 842Blocking on a mutex or condition variable or even more obviously having a 843thread sleep if it has a low frequency workload are all mechanisms by which a 844thread can be excluded from the ready queue until it really does need to be 845run. This can have a significant positive impact on performance. 846 847 848.. _Initialization_and_shutdown_dependencies: 849 850Initialization, shutdown and dependencies 851^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 852 853The L-thread subsystem depends on DPDK for huge page allocation and depends on 854the ``rte_timer subsystem``. The DPDK EAL initialization and 855``rte_timer_subsystem_init()`` **MUST** be completed before the L-thread sub 856system can be used. 857 858Thereafter initialization of the L-thread subsystem is largely transparent to 859the application. Constructor functions ensure that global variables are properly 860initialized. Other than global variables each scheduler is initialized 861independently the first time that an L-thread is created by a particular EAL 862thread. 863 864If the schedulers are to be run as isolated and independent schedulers, with 865no intention that L-threads running on different schedulers will migrate between 866schedulers or synchronize with L-threads running on other schedulers, then 867initialization consists simply of creating an L-thread, and then running the 868L-thread scheduler. 869 870If there will be interaction between L-threads running on different schedulers, 871then it is important that the starting of schedulers on different EAL threads 872is synchronized. 873 874To achieve this an additional initialization step is necessary, this is simply 875to set the number of schedulers by calling the API function 876``lthread_num_schedulers_set(n)``, where ``n`` is the number of EAL threads 877that will run L-thread schedulers. Setting the number of schedulers to a 878number greater than 0 will cause all schedulers to wait until the others have 879started before beginning to schedule L-threads. 880 881The L-thread scheduler is started by calling the function ``lthread_run()`` 882and should be called from the EAL thread and thus become the main loop of the 883EAL thread. 884 885The function ``lthread_run()``, will not return until all threads running on 886the scheduler have exited, and the scheduler has been explicitly stopped by 887calling ``lthread_scheduler_shutdown(lcore)`` or 888``lthread_scheduler_shutdown_all()``. 889 890All these function do is tell the scheduler that it can exit when there are no 891longer any running L-threads, neither function forces any running L-thread to 892terminate. Any desired application shutdown behavior must be designed and 893built into the application to ensure that L-threads complete in a timely 894manner. 895 896**Important Note:** It is assumed when the scheduler exits that the application 897is terminating for good, the scheduler does not free resources before exiting 898and running the scheduler a subsequent time will result in undefined behavior. 899 900 901.. _porting_legacy_code_to_run_on_lthreads: 902 903Porting legacy code to run on L-threads 904~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 905 906Legacy code originally written for a pthread environment may be ported to 907L-threads if the considerations about differences in scheduling policy, and 908constraints discussed in the previous sections can be accommodated. 909 910This section looks in more detail at some of the issues that may have to be 911resolved when porting code. 912 913 914.. _pthread_API_compatibility: 915 916pthread API compatibility 917^^^^^^^^^^^^^^^^^^^^^^^^^ 918 919The first step is to establish exactly which pthread APIs the legacy 920application uses, and to understand the requirements of those APIs. If there 921are corresponding L-lthread APIs, and where the default pthread functionality 922is used by the application then, notwithstanding the other issues discussed 923here, it should be feasible to run the application with L-threads. If the 924legacy code modifies the default behavior using attributes then if may be 925necessary to make some adjustments to eliminate those requirements. 926 927 928.. _blocking_system_calls: 929 930Blocking system API calls 931^^^^^^^^^^^^^^^^^^^^^^^^^ 932 933It is important to understand what other system services the application may be 934using, bearing in mind that in a cooperatively scheduled environment a thread 935cannot block without stalling the scheduler and with it all other cooperative 936threads. Any kind of blocking system call, for example file or socket IO, is a 937potential problem, a good tool to analyze the application for this purpose is 938the ``strace`` utility. 939 940There are many strategies to resolve these kind of issues, each with it 941merits. Possible solutions include: 942 943* Adopting a polled mode of the system API concerned (if available). 944 945* Arranging for another core to perform the function and synchronizing with 946 that core via constructs that will not block the L-thread. 947 948* Affinitizing the thread to another scheduler devoted (as a matter of policy) 949 to handling threads wishing to make blocking calls, and then back again when 950 finished. 951 952 953.. _porting_locks_and_spinlocks: 954 955Locks and spinlocks 956^^^^^^^^^^^^^^^^^^^ 957 958Locks and spinlocks are another source of blocking behavior that for the same 959reasons as system calls will need to be addressed. 960 961If the application design ensures that the contending L-threads will always 962run on the same scheduler then it its probably safe to remove locks and spin 963locks completely. 964 965The only exception to the above rule is if for some reason the 966code performs any kind of context switch whilst holding the lock 967(e.g. yield, sleep, or block on a different lock, or on a condition variable). 968This will need to determined before deciding to eliminate a lock. 969 970If a lock cannot be eliminated then an L-thread mutex can be substituted for 971either kind of lock. 972 973An L-thread blocking on an L-thread mutex will be suspended and will cause 974another ready L-thread to be resumed, thus not blocking the scheduler. When 975default behavior is required, it can be used as a direct replacement for a 976pthread mutex lock. 977 978Spin locks are typically used when lock contention is likely to be rare and 979where the period during which the lock may be held is relatively short. 980When the contending L-threads are running on the same scheduler then an 981L-thread blocking on a spin lock will enter an infinite loop stopping the 982scheduler completely (see :ref:`porting_infinite_loops` below). 983 984If the application design ensures that contending L-threads will always run 985on different schedulers then it might be reasonable to leave a short spin lock 986that rarely experiences contention in place. 987 988If after all considerations it appears that a spin lock can neither be 989eliminated completely, replaced with an L-thread mutex, or left in place as 990is, then an alternative is to loop on a flag, with a call to 991``lthread_yield()`` inside the loop (n.b. if the contending L-threads might 992ever run on different schedulers the flag will need to be manipulated 993atomically). 994 995Spinning and yielding is the least preferred solution since it introduces 996ready queue backlog (see also :ref:`ready_queue_backlog`). 997 998 999.. _porting_sleeps_and_delays: 1000 1001Sleeps and delays 1002^^^^^^^^^^^^^^^^^ 1003 1004Yet another kind of blocking behavior (albeit momentary) are delay functions 1005like ``sleep()``, ``usleep()``, ``nanosleep()`` etc. All will have the 1006consequence of stalling the L-thread scheduler and unless the delay is very 1007short (e.g. a very short nanosleep) calls to these functions will need to be 1008eliminated. 1009 1010The simplest mitigation strategy is to use the L-thread sleep API functions, 1011of which two variants exist, ``lthread_sleep()`` and ``lthread_sleep_clks()``. 1012These functions start an rte_timer against the L-thread, suspend the L-thread 1013and cause another ready L-thread to be resumed. The suspended L-thread is 1014resumed when the rte_timer matures. 1015 1016 1017.. _porting_infinite_loops: 1018 1019Infinite loops 1020^^^^^^^^^^^^^^ 1021 1022Some applications have threads with loops that contain no inherent 1023rescheduling opportunity, and rely solely on the OS time slicing to share 1024the CPU. In a cooperative environment this will stop everything dead. These 1025kind of loops are not hard to identify, in a debug session you will find the 1026debugger is always stopping in the same loop. 1027 1028The simplest solution to this kind of problem is to insert an explicit 1029``lthread_yield()`` or ``lthread_sleep()`` into the loop. Another solution 1030might be to include the function performed by the loop into the execution path 1031of some other loop that does in fact yield, if this is possible. 1032 1033 1034.. _porting_thread_local_storage: 1035 1036Thread local storage 1037^^^^^^^^^^^^^^^^^^^^ 1038 1039If the application uses thread local storage, the use case should be 1040studied carefully. 1041 1042In a legacy pthread application either or both the ``__thread`` prefix, or the 1043pthread set/get specific APIs may have been used to define storage local to a 1044pthread. 1045 1046In some applications it may be a reasonable assumption that the data could 1047or in fact most likely should be placed in L-thread local storage. 1048 1049If the application (like many DPDK applications) has assumed a certain 1050relationship between a pthread and the CPU to which it is affinitized, there 1051is a risk that thread local storage may have been used to save some data items 1052that are correctly logically associated with the CPU, and others items which 1053relate to application context for the thread. Only a good understanding of the 1054application will reveal such cases. 1055 1056If the application requires an that an L-thread is to be able to move between 1057schedulers then care should be taken to separate these kinds of data, into per 1058lcore, and per L-thread storage. In this way a migrating thread will bring with 1059it the local data it needs, and pick up the new logical core specific values 1060from pthread local storage at its new home. 1061 1062 1063.. _pthread_shim: 1064 1065Pthread shim 1066~~~~~~~~~~~~ 1067 1068A convenient way to get something working with legacy code can be to use a 1069shim that adapts pthread API calls to the corresponding L-thread ones. 1070This approach will not mitigate any of the porting considerations mentioned 1071in the previous sections, but it will reduce the amount of code churn that 1072would otherwise been involved. It is a reasonable approach to evaluate 1073L-threads, before investing effort in porting to the native L-thread APIs. 1074 1075 1076Overview 1077^^^^^^^^ 1078The L-thread subsystem includes an example pthread shim. This is a partial 1079implementation but does contain the API stubs needed to get basic applications 1080running. There is a simple "hello world" application that demonstrates the 1081use of the pthread shim. 1082 1083A subtlety of working with a shim is that the application will still need 1084to make use of the genuine pthread library functions, at the very least in 1085order to create the EAL threads in which the L-thread schedulers will run. 1086This is the case with DPDK initialization, and exit. 1087 1088To deal with the initialization and shutdown scenarios, the shim is capable of 1089switching on or off its adaptor functionality, an application can control this 1090behavior by the calling the function ``pt_override_set()``. The default state 1091is disabled. 1092 1093The pthread shim uses the dynamic linker loader and saves the loaded addresses 1094of the genuine pthread API functions in an internal table, when the shim 1095functionality is enabled it performs the adaptor function, when disabled it 1096invokes the genuine pthread function. 1097 1098The function ``pthread_exit()`` has additional special handling. The standard 1099system header file pthread.h declares ``pthread_exit()`` with 1100``__rte_noreturn`` this is an optimization that is possible because 1101the pthread is terminating and this enables the compiler to omit the normal 1102handling of stack and protection of registers since the function is not 1103expected to return, and in fact the thread is being destroyed. These 1104optimizations are applied in both the callee and the caller of the 1105``pthread_exit()`` function. 1106 1107In our cooperative scheduling environment this behavior is inadmissible. The 1108pthread is the L-thread scheduler thread, and, although an L-thread is 1109terminating, there must be a return to the scheduler in order that the system 1110can continue to run. Further, returning from a function with attribute 1111``noreturn`` is invalid and may result in undefined behavior. 1112 1113The solution is to redefine the ``pthread_exit`` function with a macro, 1114causing it to be mapped to a stub function in the shim that does not have the 1115``noreturn`` attribute. This macro is defined in the file 1116``pthread_shim.h``. The stub function is otherwise no different than any of 1117the other stub functions in the shim, and will switch between the real 1118``pthread_exit()`` function or the ``lthread_exit()`` function as 1119required. The only difference is that the mapping to the stub by macro 1120substitution. 1121 1122A consequence of this is that the file ``pthread_shim.h`` must be included in 1123legacy code wishing to make use of the shim. It also means that dynamic 1124linkage of a pre-compiled binary that did not include pthread_shim.h is not be 1125supported. 1126 1127Given the requirements for porting legacy code outlined in 1128:ref:`porting_legacy_code_to_run_on_lthreads` most applications will require at 1129least some minimal adjustment and recompilation to run on L-threads so 1130pre-compiled binaries are unlikely to be met in practice. 1131 1132In summary the shim approach adds some overhead but can be a useful tool to help 1133establish the feasibility of a code reuse project. It is also a fairly 1134straightforward task to extend the shim if necessary. 1135 1136**Note:** Bearing in mind the preceding discussions about the impact of making 1137blocking calls then switching the shim in and out on the fly to invoke any 1138pthread API this might block is something that should typically be avoided. 1139 1140 1141Building and running the pthread shim 1142^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 1143 1144The shim example application is located in the sample application 1145in the performance-thread folder 1146 1147To build and run the pthread shim example 1148 1149#. Build the application: 1150 1151 To compile the sample application see :doc:`compiling`. 1152 1153#. To run the pthread_shim example 1154 1155 .. code-block:: console 1156 1157 dpdk-pthread-shim -c core_mask -n number_of_channels 1158 1159.. _lthread_diagnostics: 1160 1161L-thread Diagnostics 1162~~~~~~~~~~~~~~~~~~~~ 1163 1164When debugging you must take account of the fact that the L-threads are run in 1165a single pthread. The current scheduler is defined by 1166``RTE_PER_LCORE(this_sched)``, and the current lthread is stored at 1167``RTE_PER_LCORE(this_sched)->current_lthread``. Thus on a breakpoint in a GDB 1168session the current lthread can be obtained by displaying the pthread local 1169variable ``per_lcore_this_sched->current_lthread``. 1170 1171Another useful diagnostic feature is the possibility to trace significant 1172events in the life of an L-thread, this feature is enabled by changing the 1173value of LTHREAD_DIAG from 0 to 1 in the file ``lthread_diag_api.h``. 1174 1175Tracing of events can be individually masked, and the mask may be programmed 1176at run time. An unmasked event results in a callback that provides information 1177about the event. The default callback simply prints trace information. The 1178default mask is 0 (all events off) the mask can be modified by calling the 1179function ``lthread_diagniostic_set_mask()``. 1180 1181It is possible register a user callback function to implement more 1182sophisticated diagnostic functions. 1183Object creation events (lthread, mutex, and condition variable) accept, and 1184store in the created object, a user supplied reference value returned by the 1185callback function. 1186 1187The lthread reference value is passed back in all subsequent event callbacks, 1188the mutex and APIs are provided to retrieve the reference value from 1189mutexes and condition variables. This enables a user to monitor, count, or 1190filter for specific events, on specific objects, for example to monitor for a 1191specific thread signaling a specific condition variable, or to monitor 1192on all timer events, the possibilities and combinations are endless. 1193 1194The callback function can be set by calling the function 1195``lthread_diagnostic_enable()`` supplying a callback function pointer and an 1196event mask. 1197 1198Setting ``LTHREAD_DIAG`` also enables counting of statistics about cache and 1199queue usage, and these statistics can be displayed by calling the function 1200``lthread_diag_stats_display()``. This function also performs a consistency 1201check on the caches and queues. The function should only be called from the 1202main EAL thread after all worker threads have stopped and returned to the C 1203main program, otherwise the consistency check will fail. 1204