1d30ea906Sjfb8856606..  SPDX-License-Identifier: BSD-3-Clause
2d30ea906Sjfb8856606    Copyright(c) 2010-2014 Intel Corporation.
3a9643ea8Slogwang
4a9643ea8SlogwangWriting Efficient Code
5a9643ea8Slogwang======================
6a9643ea8Slogwang
7a9643ea8SlogwangThis chapter provides some tips for developing efficient code using the DPDK.
8a9643ea8SlogwangFor additional and more general information,
9a9643ea8Slogwangplease refer to the *Intel® 64 and IA-32 Architectures Optimization Reference Manual*
10a9643ea8Slogwangwhich is a valuable reference to writing efficient code.
11a9643ea8Slogwang
12a9643ea8SlogwangMemory
13a9643ea8Slogwang------
14a9643ea8Slogwang
15a9643ea8SlogwangThis section describes some key memory considerations when developing applications in the DPDK environment.
16a9643ea8Slogwang
17a9643ea8SlogwangMemory Copy: Do not Use libc in the Data Plane
18a9643ea8Slogwang~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
19a9643ea8Slogwang
20a9643ea8SlogwangMany libc functions are available in the DPDK, via the Linux* application environment.
21a9643ea8SlogwangThis can ease the porting of applications and the development of the configuration plane.
22a9643ea8SlogwangHowever, many of these functions are not designed for performance.
23a9643ea8SlogwangFunctions such as memcpy() or strcpy() should not be used in the data plane.
24a9643ea8SlogwangTo copy small structures, the preference is for a simpler technique that can be optimized by the compiler.
25a9643ea8SlogwangRefer to the *VTune™ Performance Analyzer Essentials* publication from Intel Press for recommendations.
26a9643ea8Slogwang
27a9643ea8SlogwangFor specific functions that are called often,
28a9643ea8Slogwangit is also a good idea to provide a self-made optimized function, which should be declared as static inline.
29a9643ea8Slogwang
30a9643ea8SlogwangThe DPDK API provides an optimized rte_memcpy() function.
31a9643ea8Slogwang
32a9643ea8SlogwangMemory Allocation
33a9643ea8Slogwang~~~~~~~~~~~~~~~~~
34a9643ea8Slogwang
35a9643ea8SlogwangOther functions of libc, such as malloc(), provide a flexible way to allocate and free memory.
36a9643ea8SlogwangIn some cases, using dynamic allocation is necessary,
37a9643ea8Slogwangbut it is really not advised to use malloc-like functions in the data plane because
38a9643ea8Slogwangmanaging a fragmented heap can be costly and the allocator may not be optimized for parallel allocation.
39a9643ea8Slogwang
40a9643ea8SlogwangIf you really need dynamic allocation in the data plane, it is better to use a memory pool of fixed-size objects.
41a9643ea8SlogwangThis API is provided by librte_mempool.
42a9643ea8SlogwangThis data structure provides several services that increase performance, such as memory alignment of objects,
43a9643ea8Slogwanglockless access to objects, NUMA awareness, bulk get/put and per-lcore cache.
44a9643ea8SlogwangThe rte_malloc () function uses a similar concept to mempools.
45a9643ea8Slogwang
46a9643ea8SlogwangConcurrent Access to the Same Memory Area
47a9643ea8Slogwang~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
48a9643ea8Slogwang
49a9643ea8SlogwangRead-Write (RW) access operations by several lcores to the same memory area can generate a lot of data cache misses,
50a9643ea8Slogwangwhich are very costly.
51a9643ea8SlogwangIt is often possible to use per-lcore variables, for example, in the case of statistics.
52a9643ea8SlogwangThere are at least two solutions for this:
53a9643ea8Slogwang
54a9643ea8Slogwang*   Use RTE_PER_LCORE variables. Note that in this case, data on lcore X is not available to lcore Y.
55a9643ea8Slogwang
56a9643ea8Slogwang*   Use a table of structures (one per lcore). In this case, each structure must be cache-aligned.
57a9643ea8Slogwang
58a9643ea8SlogwangRead-mostly variables can be shared among lcores without performance losses if there are no RW variables in the same cache line.
59a9643ea8Slogwang
60a9643ea8SlogwangNUMA
61a9643ea8Slogwang~~~~
62a9643ea8Slogwang
63a9643ea8SlogwangOn a NUMA system, it is preferable to access local memory since remote memory access is slower.
64a9643ea8SlogwangIn the DPDK, the memzone, ring, rte_malloc and mempool APIs provide a way to create a pool on a specific socket.
65a9643ea8Slogwang
66a9643ea8SlogwangSometimes, it can be a good idea to duplicate data to optimize speed.
67a9643ea8SlogwangFor read-mostly variables that are often accessed,
68a9643ea8Slogwangit should not be a problem to keep them in one socket only, since data will be present in cache.
69a9643ea8Slogwang
70a9643ea8SlogwangDistribution Across Memory Channels
71a9643ea8Slogwang~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
72a9643ea8Slogwang
73a9643ea8SlogwangModern memory controllers have several memory channels that can load or store data in parallel.
74a9643ea8SlogwangDepending on the memory controller and its configuration,
75a9643ea8Slogwangthe number of channels and the way the memory is distributed across the channels varies.
76a9643ea8SlogwangEach channel has a bandwidth limit,
77a9643ea8Slogwangmeaning that if all memory access operations are done on the first channel only, there is a potential bottleneck.
78a9643ea8Slogwang
79a9643ea8SlogwangBy default, the  :ref:`Mempool Library <Mempool_Library>` spreads the addresses of objects among memory channels.
80a9643ea8Slogwang
812bfe3f2eSlogwangLocking memory pages
822bfe3f2eSlogwang~~~~~~~~~~~~~~~~~~~~
832bfe3f2eSlogwang
842bfe3f2eSlogwangThe underlying operating system is allowed to load/unload memory pages at its own discretion.
852bfe3f2eSlogwangThese page loads could impact the performance, as the process is on hold when the kernel fetches them.
862bfe3f2eSlogwang
872bfe3f2eSlogwangTo avoid these you could pre-load, and lock them into memory with the ``mlockall()`` call.
882bfe3f2eSlogwang
892bfe3f2eSlogwang.. code-block:: c
902bfe3f2eSlogwang
912bfe3f2eSlogwang    if (mlockall(MCL_CURRENT | MCL_FUTURE)) {
922bfe3f2eSlogwang        RTE_LOG(NOTICE, USER1, "mlockall() failed with error \"%s\"\n",
932bfe3f2eSlogwang                strerror(errno));
942bfe3f2eSlogwang    }
952bfe3f2eSlogwang
96a9643ea8SlogwangCommunication Between lcores
97a9643ea8Slogwang----------------------------
98a9643ea8Slogwang
99a9643ea8SlogwangTo provide a message-based communication between lcores,
100a9643ea8Slogwangit is advised to use the DPDK ring API, which provides a lockless ring implementation.
101a9643ea8Slogwang
102a9643ea8SlogwangThe ring supports bulk and burst access,
103a9643ea8Slogwangmeaning that it is possible to read several elements from the ring with only one costly atomic operation
104a9643ea8Slogwang(see :doc:`ring_lib`).
105a9643ea8SlogwangPerformance is greatly improved when using bulk access operations.
106a9643ea8Slogwang
107a9643ea8SlogwangThe code algorithm that dequeues messages may be something similar to the following:
108a9643ea8Slogwang
109a9643ea8Slogwang.. code-block:: c
110a9643ea8Slogwang
111a9643ea8Slogwang    #define MAX_BULK 32
112a9643ea8Slogwang
113a9643ea8Slogwang    while (1) {
114a9643ea8Slogwang        /* Process as many elements as can be dequeued. */
1152bfe3f2eSlogwang        count = rte_ring_dequeue_burst(ring, obj_table, MAX_BULK, NULL);
116a9643ea8Slogwang        if (unlikely(count == 0))
117a9643ea8Slogwang            continue;
118a9643ea8Slogwang
119a9643ea8Slogwang        my_process_bulk(obj_table, count);
120a9643ea8Slogwang   }
121a9643ea8Slogwang
122a9643ea8SlogwangPMD Driver
123a9643ea8Slogwang----------
124a9643ea8Slogwang
125a9643ea8SlogwangThe DPDK Poll Mode Driver (PMD) is also able to work in bulk/burst mode,
126a9643ea8Slogwangallowing the factorization of some code for each call in the send or receive function.
127a9643ea8Slogwang
128a9643ea8SlogwangAvoid partial writes.
129a9643ea8SlogwangWhen PCI devices write to system memory through DMA,
130a9643ea8Slogwangit costs less if the write operation is on a full cache line as opposed to part of it.
131a9643ea8SlogwangIn the PMD code, actions have been taken to avoid partial writes as much as possible.
132a9643ea8Slogwang
133a9643ea8SlogwangLower Packet Latency
134a9643ea8Slogwang~~~~~~~~~~~~~~~~~~~~
135a9643ea8Slogwang
136a9643ea8SlogwangTraditionally, there is a trade-off between throughput and latency.
137a9643ea8SlogwangAn application can be tuned to achieve a high throughput,
138a9643ea8Slogwangbut the end-to-end latency of an average packet will typically increase as a result.
139a9643ea8SlogwangSimilarly, the application can be tuned to have, on average,
140a9643ea8Slogwanga low end-to-end latency, at the cost of lower throughput.
141a9643ea8Slogwang
142a9643ea8SlogwangIn order to achieve higher throughput,
143a9643ea8Slogwangthe DPDK attempts to aggregate the cost of processing each packet individually by processing packets in bursts.
144a9643ea8Slogwang
145a9643ea8SlogwangUsing the testpmd application as an example,
146a9643ea8Slogwangthe burst size can be set on the command line to a value of 16 (also the default value).
147a9643ea8SlogwangThis allows the application to request 16 packets at a time from the PMD.
148a9643ea8SlogwangThe testpmd application then immediately attempts to transmit all the packets that were received,
149a9643ea8Slogwangin this case, all 16 packets.
150a9643ea8Slogwang
151a9643ea8SlogwangThe packets are not transmitted until the tail pointer is updated on the corresponding TX queue of the network port.
152a9643ea8SlogwangThis behavior is desirable when tuning for high throughput because
153a9643ea8Slogwangthe cost of tail pointer updates to both the RX and TX queues can be spread across 16 packets,
154a9643ea8Slogwangeffectively hiding the relatively slow MMIO cost of writing to the PCIe* device.
155a9643ea8SlogwangHowever, this is not very desirable when tuning for low latency because
156a9643ea8Slogwangthe first packet that was received must also wait for another 15 packets to be received.
157a9643ea8SlogwangIt cannot be transmitted until the other 15 packets have also been processed because
158a9643ea8Slogwangthe NIC will not know to transmit the packets until the TX tail pointer has been updated,
159a9643ea8Slogwangwhich is not done until all 16 packets have been processed for transmission.
160a9643ea8Slogwang
161a9643ea8SlogwangTo consistently achieve low latency, even under heavy system load,
162a9643ea8Slogwangthe application developer should avoid processing packets in bunches.
163a9643ea8SlogwangThe testpmd application can be configured from the command line to use a burst value of 1.
164a9643ea8SlogwangThis will allow a single packet to be processed at a time, providing lower latency,
165a9643ea8Slogwangbut with the added cost of lower throughput.
166a9643ea8Slogwang
167a9643ea8SlogwangLocks and Atomic Operations
168a9643ea8Slogwang---------------------------
169a9643ea8Slogwang
170*2d9fd380Sjfb8856606This section describes some key considerations when using locks and atomic
171*2d9fd380Sjfb8856606operations in the DPDK environment.
172*2d9fd380Sjfb8856606
173*2d9fd380Sjfb8856606Locks
174*2d9fd380Sjfb8856606~~~~~
175*2d9fd380Sjfb8856606
176*2d9fd380Sjfb8856606On x86, atomic operations imply a lock prefix before the instruction,
177a9643ea8Slogwangcausing the processor's LOCK# signal to be asserted during execution of the following instruction.
178a9643ea8SlogwangThis has a big impact on performance in a multicore environment.
179a9643ea8Slogwang
180a9643ea8SlogwangPerformance can be improved by avoiding lock mechanisms in the data plane.
181a9643ea8SlogwangIt can often be replaced by other solutions like per-lcore variables.
182a9643ea8SlogwangAlso, some locking techniques are more efficient than others.
183a9643ea8SlogwangFor instance, the Read-Copy-Update (RCU) algorithm can frequently replace simple rwlocks.
184a9643ea8Slogwang
185*2d9fd380Sjfb8856606Atomic Operations: Use C11 Atomic Builtins
186*2d9fd380Sjfb8856606~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
187*2d9fd380Sjfb8856606
188*2d9fd380Sjfb8856606DPDK generic rte_atomic operations are implemented by __sync builtins. These
189*2d9fd380Sjfb8856606__sync builtins result in full barriers on aarch64, which are unnecessary
190*2d9fd380Sjfb8856606in many use cases. They can be replaced by __atomic builtins that conform to
191*2d9fd380Sjfb8856606the C11 memory model and provide finer memory order control.
192*2d9fd380Sjfb8856606
193*2d9fd380Sjfb8856606So replacing the rte_atomic operations with __atomic builtins might improve
194*2d9fd380Sjfb8856606performance for aarch64 machines.
195*2d9fd380Sjfb8856606
196*2d9fd380Sjfb8856606Some typical optimization cases are listed below:
197*2d9fd380Sjfb8856606
198*2d9fd380Sjfb8856606Atomicity
199*2d9fd380Sjfb8856606^^^^^^^^^
200*2d9fd380Sjfb8856606
201*2d9fd380Sjfb8856606Some use cases require atomicity alone, the ordering of the memory operations
202*2d9fd380Sjfb8856606does not matter. For example, the packet statistics counters need to be
203*2d9fd380Sjfb8856606incremented atomically but do not need any particular memory ordering.
204*2d9fd380Sjfb8856606So, RELAXED memory ordering is sufficient.
205*2d9fd380Sjfb8856606
206*2d9fd380Sjfb8856606One-way Barrier
207*2d9fd380Sjfb8856606^^^^^^^^^^^^^^^
208*2d9fd380Sjfb8856606
209*2d9fd380Sjfb8856606Some use cases allow for memory reordering in one way while requiring memory
210*2d9fd380Sjfb8856606ordering in the other direction.
211*2d9fd380Sjfb8856606
212*2d9fd380Sjfb8856606For example, the memory operations before the spinlock lock are allowed to
213*2d9fd380Sjfb8856606move to the critical section, but the memory operations in the critical section
214*2d9fd380Sjfb8856606are not allowed to move above the lock. In this case, the full memory barrier
215*2d9fd380Sjfb8856606in the compare-and-swap operation can be replaced with ACQUIRE memory order.
216*2d9fd380Sjfb8856606On the other hand, the memory operations after the spinlock unlock are allowed
217*2d9fd380Sjfb8856606to move to the critical section, but the memory operations in the critical
218*2d9fd380Sjfb8856606section are not allowed to move below the unlock. So the full barrier in the
219*2d9fd380Sjfb8856606store operation can use RELEASE memory order.
220*2d9fd380Sjfb8856606
221*2d9fd380Sjfb8856606Reader-Writer Concurrency
222*2d9fd380Sjfb8856606^^^^^^^^^^^^^^^^^^^^^^^^^
223*2d9fd380Sjfb8856606
224*2d9fd380Sjfb8856606Lock-free reader-writer concurrency is one of the common use cases in DPDK.
225*2d9fd380Sjfb8856606
226*2d9fd380Sjfb8856606The payload or the data that the writer wants to communicate to the reader,
227*2d9fd380Sjfb8856606can be written with RELAXED memory order. However, the guard variable should
228*2d9fd380Sjfb8856606be written with RELEASE memory order. This ensures that the store to guard
229*2d9fd380Sjfb8856606variable is observable only after the store to payload is observable.
230*2d9fd380Sjfb8856606
231*2d9fd380Sjfb8856606Correspondingly, on the reader side, the guard variable should be read
232*2d9fd380Sjfb8856606with ACQUIRE memory order. The payload or the data the writer communicated,
233*2d9fd380Sjfb8856606can be read with RELAXED memory order. This ensures that, if the store to
234*2d9fd380Sjfb8856606guard variable is observable, the store to payload is also observable.
235*2d9fd380Sjfb8856606
236a9643ea8SlogwangCoding Considerations
237a9643ea8Slogwang---------------------
238a9643ea8Slogwang
239a9643ea8SlogwangInline Functions
240a9643ea8Slogwang~~~~~~~~~~~~~~~~
241a9643ea8Slogwang
242a9643ea8SlogwangSmall functions can be declared as static inline in the header file.
243a9643ea8SlogwangThis avoids the cost of a call instruction (and the associated context saving).
244a9643ea8SlogwangHowever, this technique is not always efficient; it depends on many factors including the compiler.
245a9643ea8Slogwang
246a9643ea8SlogwangBranch Prediction
247a9643ea8Slogwang~~~~~~~~~~~~~~~~~
248a9643ea8Slogwang
249a9643ea8SlogwangThe Intel® C/C++ Compiler (icc)/gcc built-in helper functions likely() and unlikely()
250a9643ea8Slogwangallow the developer to indicate if a code branch is likely to be taken or not.
251a9643ea8SlogwangFor instance:
252a9643ea8Slogwang
253a9643ea8Slogwang.. code-block:: c
254a9643ea8Slogwang
255a9643ea8Slogwang    if (likely(x > 1))
256a9643ea8Slogwang        do_stuff();
257a9643ea8Slogwang
258a9643ea8SlogwangSetting the Target CPU Type
259a9643ea8Slogwang---------------------------
260a9643ea8Slogwang
261*2d9fd380Sjfb8856606The DPDK supports CPU microarchitecture-specific optimizations by means of RTE_MACHINE option.
262a9643ea8SlogwangThe degree of optimization depends on the compiler's ability to optimize for a specific microarchitecture,
263a9643ea8Slogwangtherefore it is preferable to use the latest compiler versions whenever possible.
264a9643ea8Slogwang
265a9643ea8SlogwangIf the compiler version does not support the specific feature set (for example, the Intel® AVX instruction set),
266a9643ea8Slogwangthe build process gracefully degrades to whatever latest feature set is supported by the compiler.
267a9643ea8Slogwang
268a9643ea8SlogwangSince the build and runtime targets may not be the same,
269a9643ea8Slogwangthe resulting binary also contains a platform check that runs before the
270a9643ea8Slogwangmain() function and checks if the current machine is suitable for running the binary.
271a9643ea8Slogwang
272a9643ea8SlogwangAlong with compiler optimizations,
273a9643ea8Slogwanga set of preprocessor defines are automatically added to the build process (regardless of the compiler version).
274a9643ea8SlogwangThese defines correspond to the instruction sets that the target CPU should be able to support.
275