1d30ea906Sjfb8856606..  SPDX-License-Identifier: BSD-3-Clause
2d30ea906Sjfb8856606    Copyright(c) 2015 Intel Corporation.
3a9643ea8Slogwang
4a9643ea8SlogwangPerformance Thread Sample Application
5a9643ea8Slogwang=====================================
6a9643ea8Slogwang
7a9643ea8SlogwangThe performance thread sample application is a derivative of the standard L3
8a9643ea8Slogwangforwarding application that demonstrates different threading models.
9a9643ea8Slogwang
10a9643ea8SlogwangOverview
11a9643ea8Slogwang--------
12a9643ea8SlogwangFor a general description of the L3 forwarding applications capabilities
13a9643ea8Slogwangplease refer to the documentation of the standard application in
14a9643ea8Slogwang:doc:`l3_forward`.
15a9643ea8Slogwang
16a9643ea8SlogwangThe performance thread sample application differs from the standard L3
17a9643ea8Slogwangforwarding example in that it divides the TX and RX processing between
18a9643ea8Slogwangdifferent threads, and makes it possible to assign individual threads to
19a9643ea8Slogwangdifferent cores.
20a9643ea8Slogwang
21a9643ea8SlogwangThree threading models are considered:
22a9643ea8Slogwang
23a9643ea8Slogwang#. When there is one EAL thread per physical core.
24a9643ea8Slogwang#. When there are multiple EAL threads per physical core.
25a9643ea8Slogwang#. When there are multiple lightweight threads per EAL thread.
26a9643ea8Slogwang
27a9643ea8SlogwangSince DPDK release 2.0 it is possible to launch applications using the
28a9643ea8Slogwang``--lcores`` EAL parameter, specifying cpu-sets for a physical core. With the
29a9643ea8Slogwangperformance thread sample application its is now also possible to assign
30a9643ea8Slogwangindividual RX and TX functions to different cores.
31a9643ea8Slogwang
32a9643ea8SlogwangAs an alternative to dividing the L3 forwarding work between different EAL
33a9643ea8Slogwangthreads the performance thread sample introduces the possibility to run the
34a9643ea8Slogwangapplication threads as lightweight threads (L-threads) within one or
35a9643ea8Slogwangmore EAL threads.
36a9643ea8Slogwang
37a9643ea8SlogwangIn order to facilitate this threading model the example includes a primitive
38a9643ea8Slogwangcooperative scheduler (L-thread) subsystem. More details of the L-thread
39a9643ea8Slogwangsubsystem can be found in :ref:`lthread_subsystem`.
40a9643ea8Slogwang
41a9643ea8Slogwang**Note:** Whilst theoretically possible it is not anticipated that multiple
42a9643ea8SlogwangL-thread schedulers would be run on the same physical core, this mode of
43a9643ea8Slogwangoperation should not be expected to yield useful performance and is considered
44a9643ea8Slogwanginvalid.
45a9643ea8Slogwang
46a9643ea8SlogwangCompiling the Application
47a9643ea8Slogwang-------------------------
48a9643ea8Slogwang
492bfe3f2eSlogwangTo compile the sample application see :doc:`compiling`.
50a9643ea8Slogwang
512bfe3f2eSlogwangThe application is located in the `performance-thread/l3fwd-thread` sub-directory.
52a9643ea8Slogwang
53a9643ea8SlogwangRunning the Application
54a9643ea8Slogwang-----------------------
55a9643ea8Slogwang
56a9643ea8SlogwangThe application has a number of command line options::
57a9643ea8Slogwang
58*2d9fd380Sjfb8856606    ./<build_dir>/examples/dpdk-l3fwd-thread [EAL options] --
59a9643ea8Slogwang        -p PORTMASK [-P]
60a9643ea8Slogwang        --rx(port,queue,lcore,thread)[,(port,queue,lcore,thread)]
61a9643ea8Slogwang        --tx(lcore,thread)[,(lcore,thread)]
62a9643ea8Slogwang        [--enable-jumbo] [--max-pkt-len PKTLEN]]  [--no-numa]
63a9643ea8Slogwang        [--hash-entry-num] [--ipv6] [--no-lthreads] [--stat-lcore lcore]
642bfe3f2eSlogwang        [--parse-ptype]
65a9643ea8Slogwang
66a9643ea8SlogwangWhere:
67a9643ea8Slogwang
68a9643ea8Slogwang* ``-p PORTMASK``: Hexadecimal bitmask of ports to configure.
69a9643ea8Slogwang
70a9643ea8Slogwang* ``-P``: optional, sets all ports to promiscuous mode so that packets are
71a9643ea8Slogwang  accepted regardless of the packet's Ethernet MAC destination address.
72a9643ea8Slogwang  Without this option, only packets with the Ethernet MAC destination address
73a9643ea8Slogwang  set to the Ethernet address of the port are accepted.
74a9643ea8Slogwang
75a9643ea8Slogwang* ``--rx (port,queue,lcore,thread)[,(port,queue,lcore,thread)]``: the list of
76a9643ea8Slogwang  NIC RX ports and queues handled by the RX lcores and threads. The parameters
77a9643ea8Slogwang  are explained below.
78a9643ea8Slogwang
79a9643ea8Slogwang* ``--tx (lcore,thread)[,(lcore,thread)]``: the list of TX threads identifying
80a9643ea8Slogwang  the lcore the thread runs on, and the id of RX thread with which it is
81a9643ea8Slogwang  associated. The parameters are explained below.
82a9643ea8Slogwang
83a9643ea8Slogwang* ``--enable-jumbo``: optional, enables jumbo frames.
84a9643ea8Slogwang
85a9643ea8Slogwang* ``--max-pkt-len``: optional, maximum packet length in decimal (64-9600).
86a9643ea8Slogwang
87a9643ea8Slogwang* ``--no-numa``: optional, disables numa awareness.
88a9643ea8Slogwang
89a9643ea8Slogwang* ``--hash-entry-num``: optional, specifies the hash entry number in hex to be
90a9643ea8Slogwang  setup.
91a9643ea8Slogwang
92a9643ea8Slogwang* ``--ipv6``: optional, set it if running ipv6 packets.
93a9643ea8Slogwang
94a9643ea8Slogwang* ``--no-lthreads``: optional, disables l-thread model and uses EAL threading
95a9643ea8Slogwang  model. See below.
96a9643ea8Slogwang
97a9643ea8Slogwang* ``--stat-lcore``: optional, run CPU load stats collector on the specified
98a9643ea8Slogwang  lcore.
99a9643ea8Slogwang
1002bfe3f2eSlogwang* ``--parse-ptype:`` optional, set to use software to analyze packet type.
1012bfe3f2eSlogwang  Without this option, hardware will check the packet type.
1022bfe3f2eSlogwang
103a9643ea8SlogwangThe parameters of the ``--rx`` and ``--tx`` options are:
104a9643ea8Slogwang
105a9643ea8Slogwang* ``--rx`` parameters
106a9643ea8Slogwang
107a9643ea8Slogwang   .. _table_l3fwd_rx_parameters:
108a9643ea8Slogwang
109a9643ea8Slogwang   +--------+------------------------------------------------------+
110a9643ea8Slogwang   | port   | RX port                                              |
111a9643ea8Slogwang   +--------+------------------------------------------------------+
112a9643ea8Slogwang   | queue  | RX queue that will be read on the specified RX port  |
113a9643ea8Slogwang   +--------+------------------------------------------------------+
114a9643ea8Slogwang   | lcore  | Core to use for the thread                           |
115a9643ea8Slogwang   +--------+------------------------------------------------------+
116a9643ea8Slogwang   | thread | Thread id (continuously from 0 to N)                 |
117a9643ea8Slogwang   +--------+------------------------------------------------------+
118a9643ea8Slogwang
119a9643ea8Slogwang
120a9643ea8Slogwang* ``--tx`` parameters
121a9643ea8Slogwang
122a9643ea8Slogwang   .. _table_l3fwd_tx_parameters:
123a9643ea8Slogwang
124a9643ea8Slogwang   +--------+------------------------------------------------------+
125a9643ea8Slogwang   | lcore  | Core to use for L3 route match and transmit          |
126a9643ea8Slogwang   +--------+------------------------------------------------------+
127a9643ea8Slogwang   | thread | Id of RX thread to be associated with this TX thread |
128a9643ea8Slogwang   +--------+------------------------------------------------------+
129a9643ea8Slogwang
130a9643ea8SlogwangThe ``l3fwd-thread`` application allows you to start packet processing in two
131a9643ea8Slogwangthreading models: L-Threads (default) and EAL Threads (when the
132a9643ea8Slogwang``--no-lthreads`` parameter is used). For consistency all parameters are used
133a9643ea8Slogwangin the same way for both models.
134a9643ea8Slogwang
135a9643ea8Slogwang
136a9643ea8SlogwangRunning with L-threads
137a9643ea8Slogwang~~~~~~~~~~~~~~~~~~~~~~
138a9643ea8Slogwang
139a9643ea8SlogwangWhen the L-thread model is used (default option), lcore and thread parameters
140a9643ea8Slogwangin ``--rx/--tx`` are used to affinitize threads to the selected scheduler.
141a9643ea8Slogwang
142a9643ea8SlogwangFor example, the following places every l-thread on different lcores::
143a9643ea8Slogwang
144*2d9fd380Sjfb8856606   dpdk-l3fwd-thread -l 0-7 -n 2 -- -P -p 3 \
145a9643ea8Slogwang                --rx="(0,0,0,0)(1,0,1,1)" \
146a9643ea8Slogwang                --tx="(2,0)(3,1)"
147a9643ea8Slogwang
148a9643ea8SlogwangThe following places RX l-threads on lcore 0 and TX l-threads on lcore 1 and 2
149a9643ea8Slogwangand so on::
150a9643ea8Slogwang
151*2d9fd380Sjfb8856606   dpdk-l3fwd-thread -l 0-7 -n 2 -- -P -p 3 \
152a9643ea8Slogwang                --rx="(0,0,0,0)(1,0,0,1)" \
153a9643ea8Slogwang                --tx="(1,0)(2,1)"
154a9643ea8Slogwang
155a9643ea8Slogwang
156a9643ea8SlogwangRunning with EAL threads
157a9643ea8Slogwang~~~~~~~~~~~~~~~~~~~~~~~~
158a9643ea8Slogwang
159a9643ea8SlogwangWhen the ``--no-lthreads`` parameter is used, the L-threading model is turned
160a9643ea8Slogwangoff and EAL threads are used for all processing. EAL threads are enumerated in
161a9643ea8Slogwangthe same way as L-threads, but the ``--lcores`` EAL parameter is used to
162a9643ea8Slogwangaffinitize threads to the selected cpu-set (scheduler). Thus it is possible to
163a9643ea8Slogwangplace every RX and TX thread on different lcores.
164a9643ea8Slogwang
165a9643ea8SlogwangFor example, the following places every EAL thread on different lcores::
166a9643ea8Slogwang
167*2d9fd380Sjfb8856606   dpdk-l3fwd-thread -l 0-7 -n 2 -- -P -p 3 \
168a9643ea8Slogwang                --rx="(0,0,0,0)(1,0,1,1)" \
169a9643ea8Slogwang                --tx="(2,0)(3,1)" \
170a9643ea8Slogwang                --no-lthreads
171a9643ea8Slogwang
172a9643ea8Slogwang
173a9643ea8SlogwangTo affinitize two or more EAL threads to one cpu-set, the EAL ``--lcores``
174a9643ea8Slogwangparameter is used.
175a9643ea8Slogwang
176a9643ea8SlogwangThe following places RX EAL threads on lcore 0 and TX EAL threads on lcore 1
177a9643ea8Slogwangand 2 and so on::
178a9643ea8Slogwang
179*2d9fd380Sjfb8856606   dpdk-l3fwd-thread -l 0-7 -n 2 --lcores="(0,1)@0,(2,3)@1" -- -P -p 3 \
180a9643ea8Slogwang                --rx="(0,0,0,0)(1,0,1,1)" \
181a9643ea8Slogwang                --tx="(2,0)(3,1)" \
182a9643ea8Slogwang                --no-lthreads
183a9643ea8Slogwang
184a9643ea8Slogwang
185a9643ea8SlogwangExamples
186a9643ea8Slogwang~~~~~~~~
187a9643ea8Slogwang
188a9643ea8SlogwangFor selected scenarios the command line configuration of the application for L-threads
189a9643ea8Slogwangand its corresponding EAL threads command line can be realized as follows:
190a9643ea8Slogwang
191a9643ea8Slogwanga) Start every thread on different scheduler (1:1)::
192a9643ea8Slogwang
193*2d9fd380Sjfb8856606      dpdk-l3fwd-thread -l 0-7 -n 2 -- -P -p 3 \
194a9643ea8Slogwang                   --rx="(0,0,0,0)(1,0,1,1)" \
195a9643ea8Slogwang                   --tx="(2,0)(3,1)"
196a9643ea8Slogwang
197a9643ea8Slogwang   EAL thread equivalent::
198a9643ea8Slogwang
199*2d9fd380Sjfb8856606      dpdk-l3fwd-thread -l 0-7 -n 2 -- -P -p 3 \
200a9643ea8Slogwang                   --rx="(0,0,0,0)(1,0,1,1)" \
201a9643ea8Slogwang                   --tx="(2,0)(3,1)" \
202a9643ea8Slogwang                   --no-lthreads
203a9643ea8Slogwang
204a9643ea8Slogwangb) Start all threads on one core (N:1).
205a9643ea8Slogwang
206a9643ea8Slogwang   Start 4 L-threads on lcore 0::
207a9643ea8Slogwang
208*2d9fd380Sjfb8856606      dpdk-l3fwd-thread -l 0-7 -n 2 -- -P -p 3 \
209a9643ea8Slogwang                   --rx="(0,0,0,0)(1,0,0,1)" \
210a9643ea8Slogwang                   --tx="(0,0)(0,1)"
211a9643ea8Slogwang
212a9643ea8Slogwang   Start 4 EAL threads on cpu-set 0::
213a9643ea8Slogwang
214*2d9fd380Sjfb8856606      dpdk-l3fwd-thread -l 0-7 -n 2 --lcores="(0-3)@0" -- -P -p 3 \
215a9643ea8Slogwang                   --rx="(0,0,0,0)(1,0,0,1)" \
216a9643ea8Slogwang                   --tx="(2,0)(3,1)" \
217a9643ea8Slogwang                   --no-lthreads
218a9643ea8Slogwang
219a9643ea8Slogwangc) Start threads on different cores (N:M).
220a9643ea8Slogwang
221a9643ea8Slogwang   Start 2 L-threads for RX on lcore 0, and 2 L-threads for TX on lcore 1::
222a9643ea8Slogwang
223*2d9fd380Sjfb8856606      dpdk-l3fwd-thread -l 0-7 -n 2 -- -P -p 3 \
224a9643ea8Slogwang                   --rx="(0,0,0,0)(1,0,0,1)" \
225a9643ea8Slogwang                   --tx="(1,0)(1,1)"
226a9643ea8Slogwang
227a9643ea8Slogwang   Start 2 EAL threads for RX on cpu-set 0, and 2 EAL threads for TX on
228a9643ea8Slogwang   cpu-set 1::
229a9643ea8Slogwang
230*2d9fd380Sjfb8856606      dpdk-l3fwd-thread -l 0-7 -n 2 --lcores="(0-1)@0,(2-3)@1" -- -P -p 3 \
231a9643ea8Slogwang                   --rx="(0,0,0,0)(1,0,1,1)" \
232a9643ea8Slogwang                   --tx="(2,0)(3,1)" \
233a9643ea8Slogwang                   --no-lthreads
234a9643ea8Slogwang
235a9643ea8SlogwangExplanation
236a9643ea8Slogwang-----------
237a9643ea8Slogwang
238a9643ea8SlogwangTo a great extent the sample application differs little from the standard L3
239a9643ea8Slogwangforwarding application, and readers are advised to familiarize themselves with
240a9643ea8Slogwangthe material covered in the :doc:`l3_forward` documentation before proceeding.
241a9643ea8Slogwang
242a9643ea8SlogwangThe following explanation is focused on the way threading is handled in the
243a9643ea8Slogwangperformance thread example.
244a9643ea8Slogwang
245a9643ea8Slogwang
246a9643ea8SlogwangMode of operation with EAL threads
247a9643ea8Slogwang~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
248a9643ea8Slogwang
249a9643ea8SlogwangThe performance thread sample application has split the RX and TX functionality
250a9643ea8Slogwanginto two different threads, and the RX and TX threads are
251a9643ea8Slogwanginterconnected via software rings. With respect to these rings the RX threads
252a9643ea8Slogwangare producers and the TX threads are consumers.
253a9643ea8Slogwang
254a9643ea8SlogwangOn initialization the TX and RX threads are started according to the command
255a9643ea8Slogwangline parameters.
256a9643ea8Slogwang
257a9643ea8SlogwangThe RX threads poll the network interface queues and post received packets to a
258a9643ea8SlogwangTX thread via a corresponding software ring.
259a9643ea8Slogwang
260a9643ea8SlogwangThe TX threads poll software rings, perform the L3 forwarding hash/LPM match,
261a9643ea8Slogwangand assemble packet bursts before performing burst transmit on the network
262a9643ea8Slogwanginterface.
263a9643ea8Slogwang
264a9643ea8SlogwangAs with the standard L3 forward application, burst draining of residual packets
265a9643ea8Slogwangis performed periodically with the period calculated from elapsed time using
266a9643ea8Slogwangthe timestamps counter.
267a9643ea8Slogwang
268a9643ea8SlogwangThe diagram below illustrates a case with two RX threads and three TX threads.
269a9643ea8Slogwang
270a9643ea8Slogwang.. _figure_performance_thread_1:
271a9643ea8Slogwang
272a9643ea8Slogwang.. figure:: img/performance_thread_1.*
273a9643ea8Slogwang
274a9643ea8Slogwang
275a9643ea8SlogwangMode of operation with L-threads
276a9643ea8Slogwang~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
277a9643ea8Slogwang
278a9643ea8SlogwangLike the EAL thread configuration the application has split the RX and TX
279a9643ea8Slogwangfunctionality into different threads, and the pairs of RX and TX threads are
280a9643ea8Slogwanginterconnected via software rings.
281a9643ea8Slogwang
282a9643ea8SlogwangOn initialization an L-thread scheduler is started on every EAL thread. On all
283*2d9fd380Sjfb8856606but the main EAL thread only a dummy L-thread is initially started.
284*2d9fd380Sjfb8856606The L-thread started on the main EAL thread then spawns other L-threads on
285d30ea906Sjfb8856606different L-thread schedulers according the command line parameters.
286a9643ea8Slogwang
287a9643ea8SlogwangThe RX threads poll the network interface queues and post received packets
288a9643ea8Slogwangto a TX thread via the corresponding software ring.
289a9643ea8Slogwang
290a9643ea8SlogwangThe ring interface is augmented by means of an L-thread condition variable that
291a9643ea8Slogwangenables the TX thread to be suspended when the TX ring is empty. The RX thread
292a9643ea8Slogwangsignals the condition whenever it posts to the TX ring, causing the TX thread
293a9643ea8Slogwangto be resumed.
294a9643ea8Slogwang
295a9643ea8SlogwangAdditionally the TX L-thread spawns a worker L-thread to take care of
296a9643ea8Slogwangpolling the software rings, whilst it handles burst draining of the transmit
297a9643ea8Slogwangbuffer.
298a9643ea8Slogwang
299a9643ea8SlogwangThe worker threads poll the software rings, perform L3 route lookup and
300a9643ea8Slogwangassemble packet bursts. If the TX ring is empty the worker thread suspends
301a9643ea8Slogwangitself by waiting on the condition variable associated with the ring.
302a9643ea8Slogwang
303a9643ea8SlogwangBurst draining of residual packets, less than the burst size, is performed by
304a9643ea8Slogwangthe TX thread which sleeps (using an L-thread sleep function) and resumes
305a9643ea8Slogwangperiodically to flush the TX buffer.
306a9643ea8Slogwang
307a9643ea8SlogwangThis design means that L-threads that have no work, can yield the CPU to other
308a9643ea8SlogwangL-threads and avoid having to constantly poll the software rings.
309a9643ea8Slogwang
310a9643ea8SlogwangThe diagram below illustrates a case with two RX threads and three TX functions
311a9643ea8Slogwang(each comprising a thread that processes forwarding and a thread that
312a9643ea8Slogwangperiodically drains the output buffer of residual packets).
313a9643ea8Slogwang
314a9643ea8Slogwang.. _figure_performance_thread_2:
315a9643ea8Slogwang
316a9643ea8Slogwang.. figure:: img/performance_thread_2.*
317a9643ea8Slogwang
318a9643ea8Slogwang
319a9643ea8SlogwangCPU load statistics
320a9643ea8Slogwang~~~~~~~~~~~~~~~~~~~
321a9643ea8Slogwang
322a9643ea8SlogwangIt is possible to display statistics showing estimated CPU load on each core.
323a9643ea8SlogwangThe statistics indicate the percentage of CPU time spent: processing
324a9643ea8Slogwangreceived packets (forwarding), polling queues/rings (waiting for work),
325a9643ea8Slogwangand doing any other processing (context switch and other overhead).
326a9643ea8Slogwang
327a9643ea8SlogwangWhen enabled statistics are gathered by having the application threads set and
328a9643ea8Slogwangclear flags when they enter and exit pertinent code sections. The flags are
329a9643ea8Slogwangthen sampled in real time by a statistics collector thread running on another
330a9643ea8Slogwangcore. This thread displays the data in real time on the console.
331a9643ea8Slogwang
332a9643ea8SlogwangThis feature is enabled by designating a statistics collector core, using the
333a9643ea8Slogwang``--stat-lcore`` parameter.
334a9643ea8Slogwang
335a9643ea8Slogwang
336a9643ea8Slogwang.. _lthread_subsystem:
337a9643ea8Slogwang
338a9643ea8SlogwangThe L-thread subsystem
339a9643ea8Slogwang----------------------
340a9643ea8Slogwang
341a9643ea8SlogwangThe L-thread subsystem resides in the examples/performance-thread/common
342a9643ea8Slogwangdirectory and is built and linked automatically when building the
343a9643ea8Slogwang``l3fwd-thread`` example.
344a9643ea8Slogwang
345a9643ea8SlogwangThe subsystem provides a simple cooperative scheduler to enable arbitrary
346a9643ea8Slogwangfunctions to run as cooperative threads within a single EAL thread.
347a9643ea8SlogwangThe subsystem provides a pthread like API that is intended to assist in
348a9643ea8Slogwangreuse of legacy code written for POSIX pthreads.
349a9643ea8Slogwang
350a9643ea8SlogwangThe following sections provide some detail on the features, constraints,
351a9643ea8Slogwangperformance and porting considerations when using L-threads.
352a9643ea8Slogwang
353a9643ea8Slogwang
354a9643ea8Slogwang.. _comparison_between_lthreads_and_pthreads:
355a9643ea8Slogwang
356a9643ea8SlogwangComparison between L-threads and POSIX pthreads
357a9643ea8Slogwang~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
358a9643ea8Slogwang
359a9643ea8SlogwangThe fundamental difference between the L-thread and pthread models is the
360a9643ea8Slogwangway in which threads are scheduled. The simplest way to think about this is to
361a9643ea8Slogwangconsider the case of a processor with a single CPU. To run multiple threads
362a9643ea8Slogwangon a single CPU, the scheduler must frequently switch between the threads,
363a9643ea8Slogwangin order that each thread is able to make timely progress.
364a9643ea8SlogwangThis is the basis of any multitasking operating system.
365a9643ea8Slogwang
366a9643ea8SlogwangThis section explores the differences between the pthread model and the
367a9643ea8SlogwangL-thread model as implemented in the provided L-thread subsystem. If needed a
368a9643ea8Slogwangtheoretical discussion of preemptive vs cooperative multi-threading can be
369a9643ea8Slogwangfound in any good text on operating system design.
370a9643ea8Slogwang
371a9643ea8Slogwang
372a9643ea8SlogwangScheduling and context switching
373a9643ea8Slogwang^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
374a9643ea8Slogwang
375a9643ea8SlogwangThe POSIX pthread library provides an application programming interface to
376a9643ea8Slogwangcreate and synchronize threads. Scheduling policy is determined by the host OS,
377a9643ea8Slogwangand may be configurable. The OS may use sophisticated rules to determine which
378a9643ea8Slogwangthread should be run next, threads may suspend themselves or make other threads
379a9643ea8Slogwangready, and the scheduler may employ a time slice giving each thread a maximum
380a9643ea8Slogwangtime quantum after which it will be preempted in favor of another thread that
381a9643ea8Slogwangis ready to run. To complicate matters further threads may be assigned
382a9643ea8Slogwangdifferent scheduling priorities.
383a9643ea8Slogwang
384a9643ea8SlogwangBy contrast the L-thread subsystem is considerably simpler. Logically the
385a9643ea8SlogwangL-thread scheduler performs the same multiplexing function for L-threads
386a9643ea8Slogwangwithin a single pthread as the OS scheduler does for pthreads within an
387a9643ea8Slogwangapplication process. The L-thread scheduler is simply the main loop of a
388a9643ea8Slogwangpthread, and in so far as the host OS is concerned it is a regular pthread
389a9643ea8Slogwangjust like any other. The host OS is oblivious about the existence of and
390a9643ea8Slogwangnot at all involved in the scheduling of L-threads.
391a9643ea8Slogwang
392a9643ea8SlogwangThe other and most significant difference between the two models is that
393a9643ea8SlogwangL-threads are scheduled cooperatively. L-threads cannot not preempt each
394a9643ea8Slogwangother, nor can the L-thread scheduler preempt a running L-thread (i.e.
395a9643ea8Slogwangthere is no time slicing). The consequence is that programs implemented with
396a9643ea8SlogwangL-threads must possess frequent rescheduling points, meaning that they must
397a9643ea8Slogwangexplicitly and of their own volition return to the scheduler at frequent
398a9643ea8Slogwangintervals, in order to allow other L-threads an opportunity to proceed.
399a9643ea8Slogwang
400a9643ea8SlogwangIn both models switching between threads requires that the current CPU
401a9643ea8Slogwangcontext is saved and a new context (belonging to the next thread ready to run)
402a9643ea8Slogwangis restored. With pthreads this context switching is handled transparently
403a9643ea8Slogwangand the set of CPU registers that must be preserved between context switches
404a9643ea8Slogwangis as per an interrupt handler.
405a9643ea8Slogwang
406a9643ea8SlogwangAn L-thread context switch is achieved by the thread itself making a function
407a9643ea8Slogwangcall to the L-thread scheduler. Thus it is only necessary to preserve the
408a9643ea8Slogwangcallee registers. The caller is responsible to save and restore any other
409a9643ea8Slogwangregisters it is using before a function call, and restore them on return,
410a9643ea8Slogwangand this is handled by the compiler. For ``X86_64`` on both Linux and BSD the
411a9643ea8SlogwangSystem V calling convention is used, this defines registers RSP, RBP, and
412a9643ea8SlogwangR12-R15 as callee-save registers (for more detailed discussion a good reference
413a9643ea8Slogwangis `X86 Calling Conventions <https://en.wikipedia.org/wiki/X86_calling_conventions>`_).
414a9643ea8Slogwang
415a9643ea8SlogwangTaking advantage of this, and due to the absence of preemption, an L-thread
416a9643ea8Slogwangcontext switch is achieved with less than 20 load/store instructions.
417a9643ea8Slogwang
418a9643ea8SlogwangThe scheduling policy for L-threads is fixed, there is no prioritization of
419a9643ea8SlogwangL-threads, all L-threads are equal and scheduling is based on a FIFO
420a9643ea8Slogwangready queue.
421a9643ea8Slogwang
422a9643ea8SlogwangAn L-thread is a struct containing the CPU context of the thread
423a9643ea8Slogwang(saved on context switch) and other useful items. The ready queue contains
424a9643ea8Slogwangpointers to threads that are ready to run. The L-thread scheduler is a simple
425a9643ea8Slogwangloop that polls the ready queue, reads from it the next thread ready to run,
426a9643ea8Slogwangwhich it resumes by saving the current context (the current position in the
427a9643ea8Slogwangscheduler loop) and restoring the context of the next thread from its thread
428a9643ea8Slogwangstruct. Thus an L-thread is always resumed at the last place it yielded.
429a9643ea8Slogwang
430a9643ea8SlogwangA well behaved L-thread will call the context switch regularly (at least once
431a9643ea8Slogwangin its main loop) thus returning to the scheduler's own main loop. Yielding
432a9643ea8Slogwanginserts the current thread at the back of the ready queue, and the process of
433a9643ea8Slogwangservicing the ready queue is repeated, thus the system runs by flipping back
434a9643ea8Slogwangand forth the between L-threads and scheduler loop.
435a9643ea8Slogwang
436a9643ea8SlogwangIn the case of pthreads, the preemptive scheduling, time slicing, and support
437a9643ea8Slogwangfor thread prioritization means that progress is normally possible for any
438a9643ea8Slogwangthread that is ready to run. This comes at the price of a relatively heavier
439a9643ea8Slogwangcontext switch and scheduling overhead.
440a9643ea8Slogwang
441a9643ea8SlogwangWith L-threads the progress of any particular thread is determined by the
442a9643ea8Slogwangfrequency of rescheduling opportunities in the other L-threads. This means that
443a9643ea8Slogwangan errant L-thread monopolizing the CPU might cause scheduling of other threads
444a9643ea8Slogwangto be stalled. Due to the lower cost of context switching, however, voluntary
445a9643ea8Slogwangrescheduling to ensure progress of other threads, if managed sensibly, is not
446a9643ea8Slogwanga prohibitive overhead, and overall performance can exceed that of an
447a9643ea8Slogwangapplication using pthreads.
448a9643ea8Slogwang
449a9643ea8Slogwang
450a9643ea8SlogwangMutual exclusion
451a9643ea8Slogwang^^^^^^^^^^^^^^^^
452a9643ea8Slogwang
453a9643ea8SlogwangWith pthreads preemption means that threads that share data must observe
454a9643ea8Slogwangsome form of mutual exclusion protocol.
455a9643ea8Slogwang
456a9643ea8SlogwangThe fact that L-threads cannot preempt each other means that in many cases
457a9643ea8Slogwangmutual exclusion devices can be completely avoided.
458a9643ea8Slogwang
459a9643ea8SlogwangLocking to protect shared data can be a significant bottleneck in
460a9643ea8Slogwangmulti-threaded applications so a carefully designed cooperatively scheduled
461a9643ea8Slogwangprogram can enjoy significant performance advantages.
462a9643ea8Slogwang
463a9643ea8SlogwangSo far we have considered only the simplistic case of a single core CPU,
464a9643ea8Slogwangwhen multiple CPUs are considered things are somewhat more complex.
465a9643ea8Slogwang
466a9643ea8SlogwangFirst of all it is inevitable that there must be multiple L-thread schedulers,
467a9643ea8Slogwangone running on each EAL thread. So long as these schedulers remain isolated
468a9643ea8Slogwangfrom each other the above assertions about the potential advantages of
469a9643ea8Slogwangcooperative scheduling hold true.
470a9643ea8Slogwang
471a9643ea8SlogwangA configuration with isolated cooperative schedulers is less flexible than the
472a9643ea8Slogwangpthread model where threads can be affinitized to run on any CPU. With isolated
473a9643ea8Slogwangschedulers scaling of applications to utilize fewer or more CPUs according to
474a9643ea8Slogwangsystem demand is very difficult to achieve.
475a9643ea8Slogwang
476a9643ea8SlogwangThe L-thread subsystem makes it possible for L-threads to migrate between
477a9643ea8Slogwangschedulers running on different CPUs. Needless to say if the migration means
478a9643ea8Slogwangthat threads that share data end up running on different CPUs then this will
479a9643ea8Slogwangintroduce the need for some kind of mutual exclusion system.
480a9643ea8Slogwang
481a9643ea8SlogwangOf course ``rte_ring`` software rings can always be used to interconnect
482a9643ea8Slogwangthreads running on different cores, however to protect other kinds of shared
483a9643ea8Slogwangdata structures, lock free constructs or else explicit locking will be
484a9643ea8Slogwangrequired. This is a consideration for the application design.
485a9643ea8Slogwang
486a9643ea8SlogwangIn support of this extended functionality, the L-thread subsystem implements
487a9643ea8Slogwangthread safe mutexes and condition variables.
488a9643ea8Slogwang
489a9643ea8SlogwangThe cost of affinitizing and of condition variable signaling is significantly
490a9643ea8Slogwanglower than the equivalent pthread operations, and so applications using these
491a9643ea8Slogwangfeatures will see a performance benefit.
492a9643ea8Slogwang
493a9643ea8Slogwang
494a9643ea8SlogwangThread local storage
495a9643ea8Slogwang^^^^^^^^^^^^^^^^^^^^
496a9643ea8Slogwang
497a9643ea8SlogwangAs with applications written for pthreads an application written for L-threads
498a9643ea8Slogwangcan take advantage of thread local storage, in this case local to an L-thread.
499a9643ea8SlogwangAn application may save and retrieve a single pointer to application data in
500a9643ea8Slogwangthe L-thread struct.
501a9643ea8Slogwang
502a9643ea8SlogwangFor legacy and backward compatibility reasons two alternative methods are also
5031646932aSjfb8856606offered, the first is modeled directly on the pthread get/set specific APIs,
5041646932aSjfb8856606the second approach is modeled on the ``RTE_PER_LCORE`` macros, whereby
505a9643ea8Slogwang``PER_LTHREAD`` macros are introduced, in both cases the storage is local to
506a9643ea8Slogwangthe L-thread.
507a9643ea8Slogwang
508a9643ea8Slogwang
509a9643ea8Slogwang.. _constraints_and_performance_implications:
510a9643ea8Slogwang
511a9643ea8SlogwangConstraints and performance implications when using L-threads
512a9643ea8Slogwang~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
513a9643ea8Slogwang
514a9643ea8Slogwang
515a9643ea8Slogwang.. _API_compatibility:
516a9643ea8Slogwang
517a9643ea8SlogwangAPI compatibility
518a9643ea8Slogwang^^^^^^^^^^^^^^^^^
519a9643ea8Slogwang
520a9643ea8SlogwangThe L-thread subsystem provides a set of functions that are logically equivalent
521a9643ea8Slogwangto the corresponding functions offered by the POSIX pthread library, however not
522a9643ea8Slogwangall pthread functions have a corresponding L-thread equivalent, and not all
523a9643ea8Slogwangfeatures available to pthreads are implemented for L-threads.
524a9643ea8Slogwang
525a9643ea8SlogwangThe pthread library offers considerable flexibility via programmable attributes
526a9643ea8Slogwangthat can be associated with threads, mutexes, and condition variables.
527a9643ea8Slogwang
528a9643ea8SlogwangBy contrast the L-thread subsystem has fixed functionality, the scheduler policy
529a9643ea8Slogwangcannot be varied, and L-threads cannot be prioritized. There are no variable
530a9643ea8Slogwangattributes associated with any L-thread objects. L-threads, mutexes and
531a9643ea8Slogwangconditional variables, all have fixed functionality. (Note: reserved parameters
532a9643ea8Slogwangare included in the APIs to facilitate possible future support for attributes).
533a9643ea8Slogwang
534a9643ea8SlogwangThe table below lists the pthread and equivalent L-thread APIs with notes on
535a9643ea8Slogwangdifferences and/or constraints. Where there is no L-thread entry in the table,
536a9643ea8Slogwangthen the L-thread subsystem provides no equivalent function.
537a9643ea8Slogwang
538a9643ea8Slogwang.. _table_lthread_pthread:
539a9643ea8Slogwang
540a9643ea8Slogwang.. table:: Pthread and equivalent L-thread APIs.
541a9643ea8Slogwang
542a9643ea8Slogwang   +----------------------------+------------------------+-------------------+
543a9643ea8Slogwang   | **Pthread function**       | **L-thread function**  | **Notes**         |
544a9643ea8Slogwang   +============================+========================+===================+
545a9643ea8Slogwang   | pthread_barrier_destroy    |                        |                   |
546a9643ea8Slogwang   +----------------------------+------------------------+-------------------+
547a9643ea8Slogwang   | pthread_barrier_init       |                        |                   |
548a9643ea8Slogwang   +----------------------------+------------------------+-------------------+
549a9643ea8Slogwang   | pthread_barrier_wait       |                        |                   |
550a9643ea8Slogwang   +----------------------------+------------------------+-------------------+
551a9643ea8Slogwang   | pthread_cond_broadcast     | lthread_cond_broadcast | See note 1        |
552a9643ea8Slogwang   +----------------------------+------------------------+-------------------+
553a9643ea8Slogwang   | pthread_cond_destroy       | lthread_cond_destroy   |                   |
554a9643ea8Slogwang   +----------------------------+------------------------+-------------------+
555a9643ea8Slogwang   | pthread_cond_init          | lthread_cond_init      |                   |
556a9643ea8Slogwang   +----------------------------+------------------------+-------------------+
557a9643ea8Slogwang   | pthread_cond_signal        | lthread_cond_signal    | See note 1        |
558a9643ea8Slogwang   +----------------------------+------------------------+-------------------+
559a9643ea8Slogwang   | pthread_cond_timedwait     |                        |                   |
560a9643ea8Slogwang   +----------------------------+------------------------+-------------------+
561a9643ea8Slogwang   | pthread_cond_wait          | lthread_cond_wait      | See note 5        |
562a9643ea8Slogwang   +----------------------------+------------------------+-------------------+
563a9643ea8Slogwang   | pthread_create             | lthread_create         | See notes 2, 3    |
564a9643ea8Slogwang   +----------------------------+------------------------+-------------------+
565a9643ea8Slogwang   | pthread_detach             | lthread_detach         | See note 4        |
566a9643ea8Slogwang   +----------------------------+------------------------+-------------------+
567a9643ea8Slogwang   | pthread_equal              |                        |                   |
568a9643ea8Slogwang   +----------------------------+------------------------+-------------------+
569a9643ea8Slogwang   | pthread_exit               | lthread_exit           |                   |
570a9643ea8Slogwang   +----------------------------+------------------------+-------------------+
571a9643ea8Slogwang   | pthread_getspecific        | lthread_getspecific    |                   |
572a9643ea8Slogwang   +----------------------------+------------------------+-------------------+
573a9643ea8Slogwang   | pthread_getcpuclockid      |                        |                   |
574a9643ea8Slogwang   +----------------------------+------------------------+-------------------+
575a9643ea8Slogwang   | pthread_join               | lthread_join           |                   |
576a9643ea8Slogwang   +----------------------------+------------------------+-------------------+
577a9643ea8Slogwang   | pthread_key_create         | lthread_key_create     |                   |
578a9643ea8Slogwang   +----------------------------+------------------------+-------------------+
579a9643ea8Slogwang   | pthread_key_delete         | lthread_key_delete     |                   |
580a9643ea8Slogwang   +----------------------------+------------------------+-------------------+
581a9643ea8Slogwang   | pthread_mutex_destroy      | lthread_mutex_destroy  |                   |
582a9643ea8Slogwang   +----------------------------+------------------------+-------------------+
583a9643ea8Slogwang   | pthread_mutex_init         | lthread_mutex_init     |                   |
584a9643ea8Slogwang   +----------------------------+------------------------+-------------------+
585a9643ea8Slogwang   | pthread_mutex_lock         | lthread_mutex_lock     | See note 6        |
586a9643ea8Slogwang   +----------------------------+------------------------+-------------------+
587a9643ea8Slogwang   | pthread_mutex_trylock      | lthread_mutex_trylock  | See note 6        |
588a9643ea8Slogwang   +----------------------------+------------------------+-------------------+
589a9643ea8Slogwang   | pthread_mutex_timedlock    |                        |                   |
590a9643ea8Slogwang   +----------------------------+------------------------+-------------------+
591a9643ea8Slogwang   | pthread_mutex_unlock       | lthread_mutex_unlock   |                   |
592a9643ea8Slogwang   +----------------------------+------------------------+-------------------+
593a9643ea8Slogwang   | pthread_once               |                        |                   |
594a9643ea8Slogwang   +----------------------------+------------------------+-------------------+
595a9643ea8Slogwang   | pthread_rwlock_destroy     |                        |                   |
596a9643ea8Slogwang   +----------------------------+------------------------+-------------------+
597a9643ea8Slogwang   | pthread_rwlock_init        |                        |                   |
598a9643ea8Slogwang   +----------------------------+------------------------+-------------------+
599a9643ea8Slogwang   | pthread_rwlock_rdlock      |                        |                   |
600a9643ea8Slogwang   +----------------------------+------------------------+-------------------+
601a9643ea8Slogwang   | pthread_rwlock_timedrdlock |                        |                   |
602a9643ea8Slogwang   +----------------------------+------------------------+-------------------+
603a9643ea8Slogwang   | pthread_rwlock_timedwrlock |                        |                   |
604a9643ea8Slogwang   +----------------------------+------------------------+-------------------+
605a9643ea8Slogwang   | pthread_rwlock_tryrdlock   |                        |                   |
606a9643ea8Slogwang   +----------------------------+------------------------+-------------------+
607a9643ea8Slogwang   | pthread_rwlock_trywrlock   |                        |                   |
608a9643ea8Slogwang   +----------------------------+------------------------+-------------------+
609a9643ea8Slogwang   | pthread_rwlock_unlock      |                        |                   |
610a9643ea8Slogwang   +----------------------------+------------------------+-------------------+
611a9643ea8Slogwang   | pthread_rwlock_wrlock      |                        |                   |
612a9643ea8Slogwang   +----------------------------+------------------------+-------------------+
613a9643ea8Slogwang   | pthread_self               | lthread_current        |                   |
614a9643ea8Slogwang   +----------------------------+------------------------+-------------------+
615a9643ea8Slogwang   | pthread_setspecific        | lthread_setspecific    |                   |
616a9643ea8Slogwang   +----------------------------+------------------------+-------------------+
617a9643ea8Slogwang   | pthread_spin_init          |                        | See note 10       |
618a9643ea8Slogwang   +----------------------------+------------------------+-------------------+
619a9643ea8Slogwang   | pthread_spin_destroy       |                        | See note 10       |
620a9643ea8Slogwang   +----------------------------+------------------------+-------------------+
621a9643ea8Slogwang   | pthread_spin_lock          |                        | See note 10       |
622a9643ea8Slogwang   +----------------------------+------------------------+-------------------+
623a9643ea8Slogwang   | pthread_spin_trylock       |                        | See note 10       |
624a9643ea8Slogwang   +----------------------------+------------------------+-------------------+
625a9643ea8Slogwang   | pthread_spin_unlock        |                        | See note 10       |
626a9643ea8Slogwang   +----------------------------+------------------------+-------------------+
627a9643ea8Slogwang   | pthread_cancel             | lthread_cancel         |                   |
628a9643ea8Slogwang   +----------------------------+------------------------+-------------------+
629a9643ea8Slogwang   | pthread_setcancelstate     |                        |                   |
630a9643ea8Slogwang   +----------------------------+------------------------+-------------------+
631a9643ea8Slogwang   | pthread_setcanceltype      |                        |                   |
632a9643ea8Slogwang   +----------------------------+------------------------+-------------------+
633a9643ea8Slogwang   | pthread_testcancel         |                        |                   |
634a9643ea8Slogwang   +----------------------------+------------------------+-------------------+
635a9643ea8Slogwang   | pthread_getschedparam      |                        |                   |
636a9643ea8Slogwang   +----------------------------+------------------------+-------------------+
637a9643ea8Slogwang   | pthread_setschedparam      |                        |                   |
638a9643ea8Slogwang   +----------------------------+------------------------+-------------------+
639a9643ea8Slogwang   | pthread_yield              | lthread_yield          | See note 7        |
640a9643ea8Slogwang   +----------------------------+------------------------+-------------------+
641a9643ea8Slogwang   | pthread_setaffinity_np     | lthread_set_affinity   | See notes 2, 3, 8 |
642a9643ea8Slogwang   +----------------------------+------------------------+-------------------+
643a9643ea8Slogwang   |                            | lthread_sleep          | See note 9        |
644a9643ea8Slogwang   +----------------------------+------------------------+-------------------+
645a9643ea8Slogwang   |                            | lthread_sleep_clks     | See note 9        |
646a9643ea8Slogwang   +----------------------------+------------------------+-------------------+
647a9643ea8Slogwang
648a9643ea8Slogwang
649a9643ea8Slogwang**Note 1**:
650a9643ea8Slogwang
651a9643ea8SlogwangNeither lthread signal nor broadcast may be called concurrently by L-threads
652a9643ea8Slogwangrunning on different schedulers, although multiple L-threads running in the
653a9643ea8Slogwangsame scheduler may freely perform signal or broadcast operations. L-threads
654a9643ea8Slogwangrunning on the same or different schedulers may always safely wait on a
655a9643ea8Slogwangcondition variable.
656a9643ea8Slogwang
657a9643ea8Slogwang
658a9643ea8Slogwang**Note 2**:
659a9643ea8Slogwang
660a9643ea8SlogwangPthread attributes may be used to affinitize a pthread with a cpu-set. The
661a9643ea8SlogwangL-thread subsystem does not support a cpu-set. An L-thread may be affinitized
662a9643ea8Slogwangonly with a single CPU at any time.
663a9643ea8Slogwang
664a9643ea8Slogwang
665a9643ea8Slogwang**Note 3**:
666a9643ea8Slogwang
667a9643ea8SlogwangIf an L-thread is intended to run on a different NUMA node than the node that
668a9643ea8Slogwangcreates the thread then, when calling ``lthread_create()`` it is advantageous
669a9643ea8Slogwangto specify the destination core as a parameter of ``lthread_create()``. See
670a9643ea8Slogwang:ref:`memory_allocation_and_NUMA_awareness` for details.
671a9643ea8Slogwang
672a9643ea8Slogwang
673a9643ea8Slogwang**Note 4**:
674a9643ea8Slogwang
675a9643ea8SlogwangAn L-thread can only detach itself, and cannot detach other L-threads.
676a9643ea8Slogwang
677a9643ea8Slogwang
678a9643ea8Slogwang**Note 5**:
679a9643ea8Slogwang
680a9643ea8SlogwangA wait operation on a pthread condition variable is always associated with and
681a9643ea8Slogwangprotected by a mutex which must be owned by the thread at the time it invokes
682a9643ea8Slogwang``pthread_wait()``. By contrast L-thread condition variables are thread safe
683a9643ea8Slogwang(for waiters) and do not use an associated mutex. Multiple L-threads (including
684a9643ea8SlogwangL-threads running on other schedulers) can safely wait on a L-thread condition
685a9643ea8Slogwangvariable. As a consequence the performance of an L-thread condition variables
686a9643ea8Slogwangis typically an order of magnitude faster than its pthread counterpart.
687a9643ea8Slogwang
688a9643ea8Slogwang
689a9643ea8Slogwang**Note 6**:
690a9643ea8Slogwang
691a9643ea8SlogwangRecursive locking is not supported with L-threads, attempts to take a lock
692a9643ea8Slogwangrecursively will be detected and rejected.
693a9643ea8Slogwang
694a9643ea8Slogwang
695a9643ea8Slogwang**Note 7**:
696a9643ea8Slogwang
697a9643ea8Slogwang``lthread_yield()`` will save the current context, insert the current thread
698a9643ea8Slogwangto the back of the ready queue, and resume the next ready thread. Yielding
699a9643ea8Slogwangincreases ready queue backlog, see :ref:`ready_queue_backlog` for more details
700a9643ea8Slogwangabout the implications of this.
701a9643ea8Slogwang
702a9643ea8Slogwang
703a9643ea8SlogwangN.B. The context switch time as measured from immediately before the call to
704a9643ea8Slogwang``lthread_yield()`` to the point at which the next ready thread is resumed,
705a9643ea8Slogwangcan be an order of magnitude faster that the same measurement for
706a9643ea8Slogwangpthread_yield.
707a9643ea8Slogwang
708a9643ea8Slogwang
709a9643ea8Slogwang**Note 8**:
710a9643ea8Slogwang
711a9643ea8Slogwang``lthread_set_affinity()`` is similar to a yield apart from the fact that the
712a9643ea8Slogwangyielding thread is inserted into a peer ready queue of another scheduler.
713a9643ea8SlogwangThe peer ready queue is actually a separate thread safe queue, which means that
714a9643ea8Slogwangthreads appearing in the peer ready queue can jump any backlog in the local
715a9643ea8Slogwangready queue on the destination scheduler.
716a9643ea8Slogwang
717a9643ea8SlogwangThe context switch time as measured from the time just before the call to
718a9643ea8Slogwang``lthread_set_affinity()`` to just after the same thread is resumed on the new
719a9643ea8Slogwangscheduler can be orders of magnitude faster than the same measurement for
720a9643ea8Slogwang``pthread_setaffinity_np()``.
721a9643ea8Slogwang
722a9643ea8Slogwang
723a9643ea8Slogwang**Note 9**:
724a9643ea8Slogwang
725a9643ea8SlogwangAlthough there is no ``pthread_sleep()`` function, ``lthread_sleep()`` and
726a9643ea8Slogwang``lthread_sleep_clks()`` can be used wherever ``sleep()``, ``usleep()`` or
727a9643ea8Slogwang``nanosleep()`` might ordinarily be used. The L-thread sleep functions suspend
728a9643ea8Slogwangthe current thread, start an ``rte_timer`` and resume the thread when the
729a9643ea8Slogwangtimer matures. The ``rte_timer_manage()`` entry point is called on every pass
730a9643ea8Slogwangof the scheduler loop. This means that the worst case jitter on timer expiry
731a9643ea8Slogwangis determined by the longest period between context switches of any running
732a9643ea8SlogwangL-threads.
733a9643ea8Slogwang
734a9643ea8SlogwangIn a synthetic test with many threads sleeping and resuming then the measured
735a9643ea8Slogwangjitter is typically orders of magnitude lower than the same measurement made
736a9643ea8Slogwangfor ``nanosleep()``.
737a9643ea8Slogwang
738a9643ea8Slogwang
739a9643ea8Slogwang**Note 10**:
740a9643ea8Slogwang
741a9643ea8SlogwangSpin locks are not provided because they are problematical in a cooperative
742a9643ea8Slogwangenvironment, see :ref:`porting_locks_and_spinlocks` for a more detailed
743a9643ea8Slogwangdiscussion on how to avoid spin locks.
744a9643ea8Slogwang
745a9643ea8Slogwang
746a9643ea8Slogwang.. _Thread_local_storage_performance:
747a9643ea8Slogwang
748a9643ea8SlogwangThread local storage
749a9643ea8Slogwang^^^^^^^^^^^^^^^^^^^^
750a9643ea8Slogwang
751a9643ea8SlogwangOf the three L-thread local storage options the simplest and most efficient is
752a9643ea8Slogwangstoring a single application data pointer in the L-thread struct.
753a9643ea8Slogwang
754a9643ea8SlogwangThe ``PER_LTHREAD`` macros involve a run time computation to obtain the address
755a9643ea8Slogwangof the variable being saved/retrieved and also require that the accesses are
756a9643ea8Slogwangde-referenced  via a pointer. This means that code that has used
757a9643ea8Slogwang``RTE_PER_LCORE`` macros being ported to L-threads might need some slight
758a9643ea8Slogwangadjustment (see :ref:`porting_thread_local_storage` for hints about porting
759a9643ea8Slogwangcode that makes use of thread local storage).
760a9643ea8Slogwang
761a9643ea8SlogwangThe get/set specific APIs are consistent with their pthread counterparts both
762a9643ea8Slogwangin use and in performance.
763a9643ea8Slogwang
764a9643ea8Slogwang
765a9643ea8Slogwang.. _memory_allocation_and_NUMA_awareness:
766a9643ea8Slogwang
767a9643ea8SlogwangMemory allocation and NUMA awareness
768a9643ea8Slogwang^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
769a9643ea8Slogwang
770a9643ea8SlogwangAll memory allocation is from DPDK huge pages, and is NUMA aware. Each
771a9643ea8Slogwangscheduler maintains its own caches of objects: lthreads, their stacks, TLS,
772a9643ea8Slogwangmutexes and condition variables. These caches are implemented as unbounded lock
773a9643ea8Slogwangfree MPSC queues. When objects are created they are always allocated from the
774a9643ea8Slogwangcaches on the local core (current EAL thread).
775a9643ea8Slogwang
776a9643ea8SlogwangIf an L-thread has been affinitized to a different scheduler, then it can
777a9643ea8Slogwangalways safely free resources to the caches from which they originated (because
778a9643ea8Slogwangthe caches are MPSC queues).
779a9643ea8Slogwang
780a9643ea8SlogwangIf the L-thread has been affinitized to a different NUMA node then the memory
781a9643ea8Slogwangresources associated with it may incur longer access latency.
782a9643ea8Slogwang
783a9643ea8SlogwangThe commonly used pattern of setting affinity on entry to a thread after it has
784a9643ea8Slogwangstarted, means that memory allocation for both the stack and TLS will have been
785a9643ea8Slogwangmade from caches on the NUMA node on which the threads creator is running.
786a9643ea8SlogwangThis has the side effect that access latency will be sub-optimal after
787a9643ea8Slogwangaffinitizing.
788a9643ea8Slogwang
789a9643ea8SlogwangThis side effect can be mitigated to some extent (although not completely) by
790a9643ea8Slogwangspecifying the destination CPU as a parameter of ``lthread_create()`` this
791a9643ea8Slogwangcauses the L-thread's stack and TLS to be allocated when it is first scheduled
792a9643ea8Slogwangon the destination scheduler, if the destination is a on another NUMA node it
793a9643ea8Slogwangresults in a more optimal memory allocation.
794a9643ea8Slogwang
795a9643ea8SlogwangNote that the lthread struct itself remains allocated from memory on the
796a9643ea8Slogwangcreating node, this is unavoidable because an L-thread is known everywhere by
797a9643ea8Slogwangthe address of this struct.
798a9643ea8Slogwang
799a9643ea8Slogwang
800a9643ea8Slogwang.. _object_cache_sizing:
801a9643ea8Slogwang
802a9643ea8SlogwangObject cache sizing
803a9643ea8Slogwang^^^^^^^^^^^^^^^^^^^
804a9643ea8Slogwang
805a9643ea8SlogwangThe per lcore object caches pre-allocate objects in bulk whenever a request to
806a9643ea8Slogwangallocate an object finds a cache empty. By default 100 objects are
807a9643ea8Slogwangpre-allocated, this is defined by ``LTHREAD_PREALLOC`` in the public API
808a9643ea8Slogwangheader file lthread_api.h. This means that the caches constantly grow to meet
809a9643ea8Slogwangsystem demand.
810a9643ea8Slogwang
811a9643ea8SlogwangIn the present implementation there is no mechanism to reduce the cache sizes
812a9643ea8Slogwangif system demand reduces. Thus the caches will remain at their maximum extent
813a9643ea8Slogwangindefinitely.
814a9643ea8Slogwang
815a9643ea8SlogwangA consequence of the bulk pre-allocation of objects is that every 100 (default
816a9643ea8Slogwangvalue) additional new object create operations results in a call to
817a9643ea8Slogwang``rte_malloc()``. For creation of objects such as L-threads, which trigger the
818a9643ea8Slogwangallocation of even more objects (i.e. their stacks and TLS) then this can
819a9643ea8Slogwangcause outliers in scheduling performance.
820a9643ea8Slogwang
821a9643ea8SlogwangIf this is a problem the simplest mitigation strategy is to dimension the
822a9643ea8Slogwangsystem, by setting the bulk object pre-allocation size to some large number
823a9643ea8Slogwangthat you do not expect to be exceeded. This means the caches will be populated
824a9643ea8Slogwangonce only, the very first time a thread is created.
825a9643ea8Slogwang
826a9643ea8Slogwang
827a9643ea8Slogwang.. _Ready_queue_backlog:
828a9643ea8Slogwang
829a9643ea8SlogwangReady queue backlog
830a9643ea8Slogwang^^^^^^^^^^^^^^^^^^^
831a9643ea8Slogwang
832a9643ea8SlogwangOne of the more subtle performance considerations is managing the ready queue
833a9643ea8Slogwangbacklog. The fewer threads that are waiting in the ready queue then the faster
834a9643ea8Slogwangany particular thread will get serviced.
835a9643ea8Slogwang
836a9643ea8SlogwangIn a naive L-thread application with N L-threads simply looping and yielding,
837a9643ea8Slogwangthis backlog will always be equal to the number of L-threads, thus the cost of
838a9643ea8Slogwanga yield to a particular L-thread will be N times the context switch time.
839a9643ea8Slogwang
840a9643ea8SlogwangThis side effect can be mitigated by arranging for threads to be suspended and
841a9643ea8Slogwangwait to be resumed, rather than polling for work by constantly yielding.
842a9643ea8SlogwangBlocking on a mutex or condition variable or even more obviously having a
843a9643ea8Slogwangthread sleep if it has a low frequency workload are all mechanisms by which a
844a9643ea8Slogwangthread can be excluded from the ready queue until it really does need to be
845a9643ea8Slogwangrun. This can have a significant positive impact on performance.
846a9643ea8Slogwang
847a9643ea8Slogwang
848a9643ea8Slogwang.. _Initialization_and_shutdown_dependencies:
849a9643ea8Slogwang
850a9643ea8SlogwangInitialization, shutdown and dependencies
851a9643ea8Slogwang^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
852a9643ea8Slogwang
853a9643ea8SlogwangThe L-thread subsystem depends on DPDK for huge page allocation and depends on
854a9643ea8Slogwangthe ``rte_timer subsystem``. The DPDK EAL initialization and
855a9643ea8Slogwang``rte_timer_subsystem_init()`` **MUST** be completed before the L-thread sub
856a9643ea8Slogwangsystem can be used.
857a9643ea8Slogwang
858a9643ea8SlogwangThereafter initialization of the L-thread subsystem is largely transparent to
859a9643ea8Slogwangthe application. Constructor functions ensure that global variables are properly
860a9643ea8Slogwanginitialized. Other than global variables each scheduler is initialized
861a9643ea8Slogwangindependently the first time that an L-thread is created by a particular EAL
862a9643ea8Slogwangthread.
863a9643ea8Slogwang
864a9643ea8SlogwangIf the schedulers are to be run as isolated and independent schedulers, with
865a9643ea8Slogwangno intention that L-threads running on different schedulers will migrate between
866a9643ea8Slogwangschedulers or synchronize with L-threads running on other schedulers, then
867a9643ea8Slogwanginitialization consists simply of creating an L-thread, and then running the
868a9643ea8SlogwangL-thread scheduler.
869a9643ea8Slogwang
870a9643ea8SlogwangIf there will be interaction between L-threads running on different schedulers,
871a9643ea8Slogwangthen it is important that the starting of schedulers on different EAL threads
872a9643ea8Slogwangis synchronized.
873a9643ea8Slogwang
874a9643ea8SlogwangTo achieve this an additional initialization step is necessary, this is simply
875a9643ea8Slogwangto set the number of schedulers by calling the API function
876a9643ea8Slogwang``lthread_num_schedulers_set(n)``, where ``n`` is the number of EAL threads
877a9643ea8Slogwangthat will run L-thread schedulers. Setting the number of schedulers to a
878a9643ea8Slogwangnumber greater than 0 will cause all schedulers to wait until the others have
879a9643ea8Slogwangstarted before beginning to schedule L-threads.
880a9643ea8Slogwang
881a9643ea8SlogwangThe L-thread scheduler is started by calling the function ``lthread_run()``
882a9643ea8Slogwangand should be called from the EAL thread and thus become the main loop of the
883a9643ea8SlogwangEAL thread.
884a9643ea8Slogwang
885a9643ea8SlogwangThe function ``lthread_run()``, will not return until all threads running on
886a9643ea8Slogwangthe scheduler have exited, and the scheduler has been explicitly stopped by
887a9643ea8Slogwangcalling ``lthread_scheduler_shutdown(lcore)`` or
888a9643ea8Slogwang``lthread_scheduler_shutdown_all()``.
889a9643ea8Slogwang
890a9643ea8SlogwangAll these function do is tell the scheduler that it can exit when there are no
891a9643ea8Slogwanglonger any running L-threads, neither function forces any running L-thread to
892a9643ea8Slogwangterminate. Any desired application shutdown behavior must be designed and
893a9643ea8Slogwangbuilt into the application to ensure that L-threads complete in a timely
894a9643ea8Slogwangmanner.
895a9643ea8Slogwang
896a9643ea8Slogwang**Important Note:** It is assumed when the scheduler exits that the application
897a9643ea8Slogwangis terminating for good, the scheduler does not free resources before exiting
898a9643ea8Slogwangand running the scheduler a subsequent time will result in undefined behavior.
899a9643ea8Slogwang
900a9643ea8Slogwang
901a9643ea8Slogwang.. _porting_legacy_code_to_run_on_lthreads:
902a9643ea8Slogwang
903a9643ea8SlogwangPorting legacy code to run on L-threads
904a9643ea8Slogwang~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
905a9643ea8Slogwang
906a9643ea8SlogwangLegacy code originally written for a pthread environment may be ported to
907a9643ea8SlogwangL-threads if the considerations about differences in scheduling policy, and
908a9643ea8Slogwangconstraints discussed in the previous sections can be accommodated.
909a9643ea8Slogwang
910a9643ea8SlogwangThis section looks in more detail at some of the issues that may have to be
911a9643ea8Slogwangresolved when porting code.
912a9643ea8Slogwang
913a9643ea8Slogwang
914a9643ea8Slogwang.. _pthread_API_compatibility:
915a9643ea8Slogwang
916a9643ea8Slogwangpthread API compatibility
917a9643ea8Slogwang^^^^^^^^^^^^^^^^^^^^^^^^^
918a9643ea8Slogwang
919a9643ea8SlogwangThe first step is to establish exactly which pthread APIs the legacy
920a9643ea8Slogwangapplication uses, and to understand the requirements of those APIs. If there
921a9643ea8Slogwangare corresponding L-lthread APIs, and where the default pthread functionality
922a9643ea8Slogwangis used by the application then, notwithstanding the other issues discussed
923a9643ea8Slogwanghere, it should be feasible to run the application with L-threads. If the
924a9643ea8Slogwanglegacy code modifies the default behavior using attributes then if may be
925a9643ea8Slogwangnecessary to make some adjustments to eliminate those requirements.
926a9643ea8Slogwang
927a9643ea8Slogwang
928a9643ea8Slogwang.. _blocking_system_calls:
929a9643ea8Slogwang
930a9643ea8SlogwangBlocking system API calls
931a9643ea8Slogwang^^^^^^^^^^^^^^^^^^^^^^^^^
932a9643ea8Slogwang
933a9643ea8SlogwangIt is important to understand what other system services the application may be
934a9643ea8Slogwangusing, bearing in mind that in a cooperatively scheduled environment a thread
935a9643ea8Slogwangcannot block without stalling the scheduler and with it all other cooperative
936a9643ea8Slogwangthreads. Any kind of blocking system call, for example file or socket IO, is a
937a9643ea8Slogwangpotential problem, a good tool to analyze the application for this purpose is
938a9643ea8Slogwangthe ``strace`` utility.
939a9643ea8Slogwang
940a9643ea8SlogwangThere are many strategies to resolve these kind of issues, each with it
941a9643ea8Slogwangmerits. Possible solutions include:
942a9643ea8Slogwang
943a9643ea8Slogwang* Adopting a polled mode of the system API concerned (if available).
944a9643ea8Slogwang
945a9643ea8Slogwang* Arranging for another core to perform the function and synchronizing with
946a9643ea8Slogwang  that core via constructs that will not block the L-thread.
947a9643ea8Slogwang
948a9643ea8Slogwang* Affinitizing the thread to another scheduler devoted (as a matter of policy)
949a9643ea8Slogwang  to handling threads wishing to make blocking calls, and then back again when
950a9643ea8Slogwang  finished.
951a9643ea8Slogwang
952a9643ea8Slogwang
953a9643ea8Slogwang.. _porting_locks_and_spinlocks:
954a9643ea8Slogwang
955a9643ea8SlogwangLocks and spinlocks
956a9643ea8Slogwang^^^^^^^^^^^^^^^^^^^
957a9643ea8Slogwang
958a9643ea8SlogwangLocks and spinlocks are another source of blocking behavior that for the same
959a9643ea8Slogwangreasons as system calls will need to be addressed.
960a9643ea8Slogwang
961a9643ea8SlogwangIf the application design ensures that the contending L-threads will always
962a9643ea8Slogwangrun on the same scheduler then it its probably safe to remove locks and spin
963a9643ea8Slogwanglocks completely.
964a9643ea8Slogwang
965a9643ea8SlogwangThe only exception to the above rule is if for some reason the
966a9643ea8Slogwangcode performs any kind of context switch whilst holding the lock
967a9643ea8Slogwang(e.g. yield, sleep, or block on a different lock, or on a condition variable).
968a9643ea8SlogwangThis will need to determined before deciding to eliminate a lock.
969a9643ea8Slogwang
970a9643ea8SlogwangIf a lock cannot be eliminated then an L-thread mutex can be substituted for
971a9643ea8Slogwangeither kind of lock.
972a9643ea8Slogwang
973a9643ea8SlogwangAn L-thread blocking on an L-thread mutex will be suspended and will cause
974a9643ea8Slogwanganother ready L-thread to be resumed, thus not blocking the scheduler. When
975a9643ea8Slogwangdefault behavior is required, it can be used as a direct replacement for a
976a9643ea8Slogwangpthread mutex lock.
977a9643ea8Slogwang
978a9643ea8SlogwangSpin locks are typically used when lock contention is likely to be rare and
979a9643ea8Slogwangwhere the period during which the lock may be held is relatively short.
980a9643ea8SlogwangWhen the contending L-threads are running on the same scheduler then an
981a9643ea8SlogwangL-thread blocking on a spin lock will enter an infinite loop stopping the
982a9643ea8Slogwangscheduler completely (see :ref:`porting_infinite_loops` below).
983a9643ea8Slogwang
984a9643ea8SlogwangIf the application design ensures that contending L-threads will always run
985a9643ea8Slogwangon different schedulers then it might be reasonable to leave a short spin lock
986a9643ea8Slogwangthat rarely experiences contention in place.
987a9643ea8Slogwang
988a9643ea8SlogwangIf after all considerations it appears that a spin lock can neither be
989a9643ea8Slogwangeliminated completely, replaced with an L-thread mutex, or left in place as
990a9643ea8Slogwangis, then an alternative is to loop on a flag, with a call to
991a9643ea8Slogwang``lthread_yield()`` inside the loop (n.b. if the contending L-threads might
992a9643ea8Slogwangever run on different schedulers the flag will need to be manipulated
993a9643ea8Slogwangatomically).
994a9643ea8Slogwang
995a9643ea8SlogwangSpinning and yielding is the least preferred solution since it introduces
996a9643ea8Slogwangready queue backlog (see also :ref:`ready_queue_backlog`).
997a9643ea8Slogwang
998a9643ea8Slogwang
999a9643ea8Slogwang.. _porting_sleeps_and_delays:
1000a9643ea8Slogwang
1001a9643ea8SlogwangSleeps and delays
1002a9643ea8Slogwang^^^^^^^^^^^^^^^^^
1003a9643ea8Slogwang
1004a9643ea8SlogwangYet another kind of blocking behavior (albeit momentary) are delay functions
1005a9643ea8Slogwanglike ``sleep()``, ``usleep()``, ``nanosleep()`` etc. All will have the
1006a9643ea8Slogwangconsequence of stalling the L-thread scheduler and unless the delay is very
1007a9643ea8Slogwangshort (e.g. a very short nanosleep) calls to these functions will need to be
1008a9643ea8Slogwangeliminated.
1009a9643ea8Slogwang
1010a9643ea8SlogwangThe simplest mitigation strategy is to use the L-thread sleep API functions,
1011a9643ea8Slogwangof which two variants exist, ``lthread_sleep()`` and ``lthread_sleep_clks()``.
1012a9643ea8SlogwangThese functions start an rte_timer against the L-thread, suspend the L-thread
1013a9643ea8Slogwangand cause another ready L-thread to be resumed. The suspended L-thread is
1014a9643ea8Slogwangresumed when the rte_timer matures.
1015a9643ea8Slogwang
1016a9643ea8Slogwang
1017a9643ea8Slogwang.. _porting_infinite_loops:
1018a9643ea8Slogwang
1019a9643ea8SlogwangInfinite loops
1020a9643ea8Slogwang^^^^^^^^^^^^^^
1021a9643ea8Slogwang
1022a9643ea8SlogwangSome applications have threads with loops that contain no inherent
1023a9643ea8Slogwangrescheduling opportunity, and rely solely on the OS time slicing to share
1024a9643ea8Slogwangthe CPU. In a cooperative environment this will stop everything dead. These
1025a9643ea8Slogwangkind of loops are not hard to identify, in a debug session you will find the
1026a9643ea8Slogwangdebugger is always stopping in the same loop.
1027a9643ea8Slogwang
1028a9643ea8SlogwangThe simplest solution to this kind of problem is to insert an explicit
1029a9643ea8Slogwang``lthread_yield()`` or ``lthread_sleep()`` into the loop. Another solution
1030a9643ea8Slogwangmight be to include the function performed by the loop into the execution path
1031a9643ea8Slogwangof some other loop that does in fact yield, if this is possible.
1032a9643ea8Slogwang
1033a9643ea8Slogwang
1034a9643ea8Slogwang.. _porting_thread_local_storage:
1035a9643ea8Slogwang
1036a9643ea8SlogwangThread local storage
1037a9643ea8Slogwang^^^^^^^^^^^^^^^^^^^^
1038a9643ea8Slogwang
1039a9643ea8SlogwangIf the application uses thread local storage, the use case should be
1040a9643ea8Slogwangstudied carefully.
1041a9643ea8Slogwang
1042a9643ea8SlogwangIn a legacy pthread application either or both the ``__thread`` prefix, or the
1043a9643ea8Slogwangpthread set/get specific APIs may have been used to define storage local to a
1044a9643ea8Slogwangpthread.
1045a9643ea8Slogwang
1046a9643ea8SlogwangIn some applications it may be a reasonable assumption that the data could
1047a9643ea8Slogwangor in fact most likely should be placed in L-thread local storage.
1048a9643ea8Slogwang
1049a9643ea8SlogwangIf the application (like many DPDK applications) has assumed a certain
1050a9643ea8Slogwangrelationship between a pthread and the CPU to which it is affinitized, there
1051a9643ea8Slogwangis a risk that thread local storage may have been used to save some data items
1052a9643ea8Slogwangthat are correctly logically associated with the CPU, and others items which
1053a9643ea8Slogwangrelate to application context for the thread. Only a good understanding of the
1054a9643ea8Slogwangapplication will reveal such cases.
1055a9643ea8Slogwang
1056a9643ea8SlogwangIf the application requires an that an L-thread is to be able to move between
1057a9643ea8Slogwangschedulers then care should be taken to separate these kinds of data, into per
1058a9643ea8Slogwanglcore, and per L-thread storage. In this way a migrating thread will bring with
1059a9643ea8Slogwangit the local data it needs, and pick up the new logical core specific values
1060a9643ea8Slogwangfrom pthread local storage at its new home.
1061a9643ea8Slogwang
1062a9643ea8Slogwang
1063a9643ea8Slogwang.. _pthread_shim:
1064a9643ea8Slogwang
1065a9643ea8SlogwangPthread shim
1066a9643ea8Slogwang~~~~~~~~~~~~
1067a9643ea8Slogwang
1068a9643ea8SlogwangA convenient way to get something working with legacy code can be to use a
1069a9643ea8Slogwangshim that adapts pthread API calls to the corresponding L-thread ones.
1070a9643ea8SlogwangThis approach will not mitigate any of the porting considerations mentioned
1071a9643ea8Slogwangin the previous sections, but it will reduce the amount of code churn that
1072a9643ea8Slogwangwould otherwise been involved. It is a reasonable approach to evaluate
1073a9643ea8SlogwangL-threads, before investing effort in porting to the native L-thread APIs.
1074a9643ea8Slogwang
1075a9643ea8Slogwang
1076a9643ea8SlogwangOverview
1077a9643ea8Slogwang^^^^^^^^
1078a9643ea8SlogwangThe L-thread subsystem includes an example pthread shim. This is a partial
1079a9643ea8Slogwangimplementation but does contain the API stubs needed to get basic applications
1080a9643ea8Slogwangrunning. There is a simple "hello world" application that demonstrates the
1081a9643ea8Slogwanguse of the pthread shim.
1082a9643ea8Slogwang
1083a9643ea8SlogwangA subtlety of working with a shim is that the application will still need
1084a9643ea8Slogwangto make use of the genuine pthread library functions, at the very least in
1085a9643ea8Slogwangorder to create the EAL threads in which the L-thread schedulers will run.
1086a9643ea8SlogwangThis is the case with DPDK initialization, and exit.
1087a9643ea8Slogwang
1088a9643ea8SlogwangTo deal with the initialization and shutdown scenarios, the shim is capable of
1089a9643ea8Slogwangswitching on or off its adaptor functionality, an application can control this
1090a9643ea8Slogwangbehavior by the calling the function ``pt_override_set()``. The default state
1091a9643ea8Slogwangis disabled.
1092a9643ea8Slogwang
1093a9643ea8SlogwangThe pthread shim uses the dynamic linker loader and saves the loaded addresses
1094a9643ea8Slogwangof the genuine pthread API functions in an internal table, when the shim
1095a9643ea8Slogwangfunctionality is enabled it performs the adaptor function, when disabled it
1096a9643ea8Slogwanginvokes the genuine pthread function.
1097a9643ea8Slogwang
1098a9643ea8SlogwangThe function ``pthread_exit()`` has additional special handling. The standard
1099a9643ea8Slogwangsystem header file pthread.h declares ``pthread_exit()`` with
1100*2d9fd380Sjfb8856606``__rte_noreturn`` this is an optimization that is possible because
1101a9643ea8Slogwangthe pthread is terminating and this enables the compiler to omit the normal
1102a9643ea8Slogwanghandling of stack and protection of registers since the function is not
1103a9643ea8Slogwangexpected to return, and in fact the thread is being destroyed. These
1104a9643ea8Slogwangoptimizations are applied in both the callee and the caller of the
1105a9643ea8Slogwang``pthread_exit()`` function.
1106a9643ea8Slogwang
1107a9643ea8SlogwangIn our cooperative scheduling environment this behavior is inadmissible. The
1108a9643ea8Slogwangpthread is the L-thread scheduler thread, and, although an L-thread is
1109a9643ea8Slogwangterminating, there must be a return to the scheduler in order that the system
1110a9643ea8Slogwangcan continue to run. Further, returning from a function with attribute
1111a9643ea8Slogwang``noreturn`` is invalid and may result in undefined behavior.
1112a9643ea8Slogwang
1113a9643ea8SlogwangThe solution is to redefine the ``pthread_exit`` function with a macro,
1114a9643ea8Slogwangcausing it to be mapped to a stub function in the shim that does not have the
1115a9643ea8Slogwang``noreturn`` attribute. This macro is defined in the file
1116a9643ea8Slogwang``pthread_shim.h``. The stub function is otherwise no different than any of
1117a9643ea8Slogwangthe other stub functions in the shim, and will switch between the real
1118a9643ea8Slogwang``pthread_exit()`` function or the ``lthread_exit()`` function as
1119a9643ea8Slogwangrequired. The only difference is that the mapping to the stub by macro
1120a9643ea8Slogwangsubstitution.
1121a9643ea8Slogwang
1122a9643ea8SlogwangA consequence of this is that the file ``pthread_shim.h`` must be included in
1123a9643ea8Slogwanglegacy code wishing to make use of the shim. It also means that dynamic
1124a9643ea8Slogwanglinkage of a pre-compiled binary that did not include pthread_shim.h is not be
1125a9643ea8Slogwangsupported.
1126a9643ea8Slogwang
1127a9643ea8SlogwangGiven the requirements for porting legacy code outlined in
1128a9643ea8Slogwang:ref:`porting_legacy_code_to_run_on_lthreads` most applications will require at
1129a9643ea8Slogwangleast some minimal adjustment and recompilation to run on L-threads so
1130a9643ea8Slogwangpre-compiled binaries are unlikely to be met in practice.
1131a9643ea8Slogwang
1132a9643ea8SlogwangIn summary the shim approach adds some overhead but can be a useful tool to help
1133a9643ea8Slogwangestablish the feasibility of a code reuse project. It is also a fairly
1134a9643ea8Slogwangstraightforward task to extend the shim if necessary.
1135a9643ea8Slogwang
1136a9643ea8Slogwang**Note:** Bearing in mind the preceding discussions about the impact of making
1137a9643ea8Slogwangblocking calls then switching the shim in and out on the fly to invoke any
1138a9643ea8Slogwangpthread API this might block is something that should typically be avoided.
1139a9643ea8Slogwang
1140a9643ea8Slogwang
1141a9643ea8SlogwangBuilding and running the pthread shim
1142a9643ea8Slogwang^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1143a9643ea8Slogwang
1144a9643ea8SlogwangThe shim example application is located in the sample application
1145a9643ea8Slogwangin the performance-thread folder
1146a9643ea8Slogwang
1147a9643ea8SlogwangTo build and run the pthread shim example
1148a9643ea8Slogwang
1149a9643ea8Slogwang#. Build the application:
1150a9643ea8Slogwang
1151*2d9fd380Sjfb8856606   To compile the sample application see :doc:`compiling`.
1152a9643ea8Slogwang
1153a9643ea8Slogwang#. To run the pthread_shim example
1154a9643ea8Slogwang
1155a9643ea8Slogwang   .. code-block:: console
1156a9643ea8Slogwang
1157*2d9fd380Sjfb8856606       dpdk-pthread-shim -c core_mask -n number_of_channels
1158a9643ea8Slogwang
1159a9643ea8Slogwang.. _lthread_diagnostics:
1160a9643ea8Slogwang
1161a9643ea8SlogwangL-thread Diagnostics
1162a9643ea8Slogwang~~~~~~~~~~~~~~~~~~~~
1163a9643ea8Slogwang
1164a9643ea8SlogwangWhen debugging you must take account of the fact that the L-threads are run in
1165a9643ea8Slogwanga single pthread. The current scheduler is defined by
1166a9643ea8Slogwang``RTE_PER_LCORE(this_sched)``, and the current lthread is stored at
1167a9643ea8Slogwang``RTE_PER_LCORE(this_sched)->current_lthread``. Thus on a breakpoint in a GDB
1168a9643ea8Slogwangsession the current lthread can be obtained by displaying the pthread local
1169a9643ea8Slogwangvariable ``per_lcore_this_sched->current_lthread``.
1170a9643ea8Slogwang
1171a9643ea8SlogwangAnother useful diagnostic feature is the possibility to trace significant
1172a9643ea8Slogwangevents in the life of an L-thread, this feature is enabled by changing the
1173a9643ea8Slogwangvalue of LTHREAD_DIAG from 0 to 1 in the file ``lthread_diag_api.h``.
1174a9643ea8Slogwang
1175a9643ea8SlogwangTracing of events can be individually masked, and the mask may be programmed
1176a9643ea8Slogwangat run time. An unmasked event results in a callback that provides information
1177a9643ea8Slogwangabout the event. The default callback simply prints trace information. The
1178a9643ea8Slogwangdefault mask is 0 (all events off) the mask can be modified by calling the
1179a9643ea8Slogwangfunction ``lthread_diagniostic_set_mask()``.
1180a9643ea8Slogwang
1181a9643ea8SlogwangIt is possible register a user callback function to implement more
1182a9643ea8Slogwangsophisticated diagnostic functions.
1183a9643ea8SlogwangObject creation events (lthread, mutex, and condition variable) accept, and
1184a9643ea8Slogwangstore in the created object, a user supplied reference value returned by the
1185a9643ea8Slogwangcallback function.
1186a9643ea8Slogwang
1187a9643ea8SlogwangThe lthread reference value is passed back in all subsequent event callbacks,
1188a9643ea8Slogwangthe mutex and APIs are provided to retrieve the reference value from
1189a9643ea8Slogwangmutexes and condition variables. This enables a user to monitor, count, or
1190a9643ea8Slogwangfilter for specific events, on specific objects, for example to monitor for a
1191a9643ea8Slogwangspecific thread signaling a specific condition variable, or to monitor
1192a9643ea8Slogwangon all timer events, the possibilities and combinations are endless.
1193a9643ea8Slogwang
1194a9643ea8SlogwangThe callback function can be set by calling the function
1195a9643ea8Slogwang``lthread_diagnostic_enable()`` supplying a callback function pointer and an
1196a9643ea8Slogwangevent mask.
1197a9643ea8Slogwang
1198a9643ea8SlogwangSetting ``LTHREAD_DIAG`` also enables counting of statistics about cache and
1199a9643ea8Slogwangqueue usage, and these statistics can be displayed by calling the function
1200a9643ea8Slogwang``lthread_diag_stats_display()``. This function also performs a consistency
1201a9643ea8Slogwangcheck on the caches and queues. The function should only be called from the
1202*2d9fd380Sjfb8856606main EAL thread after all worker threads have stopped and returned to the C
1203a9643ea8Slogwangmain program, otherwise the consistency check will fail.
1204