1d30ea906Sjfb8856606.. SPDX-License-Identifier: BSD-3-Clause 2d30ea906Sjfb8856606 Copyright(c) 2010-2014 Intel Corporation. 3a9643ea8Slogwang 4a9643ea8SlogwangWriting Efficient Code 5a9643ea8Slogwang====================== 6a9643ea8Slogwang 7a9643ea8SlogwangThis chapter provides some tips for developing efficient code using the DPDK. 8a9643ea8SlogwangFor additional and more general information, 9a9643ea8Slogwangplease refer to the *Intel® 64 and IA-32 Architectures Optimization Reference Manual* 10a9643ea8Slogwangwhich is a valuable reference to writing efficient code. 11a9643ea8Slogwang 12a9643ea8SlogwangMemory 13a9643ea8Slogwang------ 14a9643ea8Slogwang 15a9643ea8SlogwangThis section describes some key memory considerations when developing applications in the DPDK environment. 16a9643ea8Slogwang 17a9643ea8SlogwangMemory Copy: Do not Use libc in the Data Plane 18a9643ea8Slogwang~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 19a9643ea8Slogwang 20a9643ea8SlogwangMany libc functions are available in the DPDK, via the Linux* application environment. 21a9643ea8SlogwangThis can ease the porting of applications and the development of the configuration plane. 22a9643ea8SlogwangHowever, many of these functions are not designed for performance. 23a9643ea8SlogwangFunctions such as memcpy() or strcpy() should not be used in the data plane. 24a9643ea8SlogwangTo copy small structures, the preference is for a simpler technique that can be optimized by the compiler. 25a9643ea8SlogwangRefer to the *VTune™ Performance Analyzer Essentials* publication from Intel Press for recommendations. 26a9643ea8Slogwang 27a9643ea8SlogwangFor specific functions that are called often, 28a9643ea8Slogwangit is also a good idea to provide a self-made optimized function, which should be declared as static inline. 29a9643ea8Slogwang 30a9643ea8SlogwangThe DPDK API provides an optimized rte_memcpy() function. 31a9643ea8Slogwang 32a9643ea8SlogwangMemory Allocation 33a9643ea8Slogwang~~~~~~~~~~~~~~~~~ 34a9643ea8Slogwang 35a9643ea8SlogwangOther functions of libc, such as malloc(), provide a flexible way to allocate and free memory. 36a9643ea8SlogwangIn some cases, using dynamic allocation is necessary, 37a9643ea8Slogwangbut it is really not advised to use malloc-like functions in the data plane because 38a9643ea8Slogwangmanaging a fragmented heap can be costly and the allocator may not be optimized for parallel allocation. 39a9643ea8Slogwang 40a9643ea8SlogwangIf you really need dynamic allocation in the data plane, it is better to use a memory pool of fixed-size objects. 41a9643ea8SlogwangThis API is provided by librte_mempool. 42a9643ea8SlogwangThis data structure provides several services that increase performance, such as memory alignment of objects, 43a9643ea8Slogwanglockless access to objects, NUMA awareness, bulk get/put and per-lcore cache. 44a9643ea8SlogwangThe rte_malloc () function uses a similar concept to mempools. 45a9643ea8Slogwang 46a9643ea8SlogwangConcurrent Access to the Same Memory Area 47a9643ea8Slogwang~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 48a9643ea8Slogwang 49a9643ea8SlogwangRead-Write (RW) access operations by several lcores to the same memory area can generate a lot of data cache misses, 50a9643ea8Slogwangwhich are very costly. 51a9643ea8SlogwangIt is often possible to use per-lcore variables, for example, in the case of statistics. 52a9643ea8SlogwangThere are at least two solutions for this: 53a9643ea8Slogwang 54a9643ea8Slogwang* Use RTE_PER_LCORE variables. Note that in this case, data on lcore X is not available to lcore Y. 55a9643ea8Slogwang 56a9643ea8Slogwang* Use a table of structures (one per lcore). In this case, each structure must be cache-aligned. 57a9643ea8Slogwang 58a9643ea8SlogwangRead-mostly variables can be shared among lcores without performance losses if there are no RW variables in the same cache line. 59a9643ea8Slogwang 60a9643ea8SlogwangNUMA 61a9643ea8Slogwang~~~~ 62a9643ea8Slogwang 63a9643ea8SlogwangOn a NUMA system, it is preferable to access local memory since remote memory access is slower. 64a9643ea8SlogwangIn the DPDK, the memzone, ring, rte_malloc and mempool APIs provide a way to create a pool on a specific socket. 65a9643ea8Slogwang 66a9643ea8SlogwangSometimes, it can be a good idea to duplicate data to optimize speed. 67a9643ea8SlogwangFor read-mostly variables that are often accessed, 68a9643ea8Slogwangit should not be a problem to keep them in one socket only, since data will be present in cache. 69a9643ea8Slogwang 70a9643ea8SlogwangDistribution Across Memory Channels 71a9643ea8Slogwang~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 72a9643ea8Slogwang 73a9643ea8SlogwangModern memory controllers have several memory channels that can load or store data in parallel. 74a9643ea8SlogwangDepending on the memory controller and its configuration, 75a9643ea8Slogwangthe number of channels and the way the memory is distributed across the channels varies. 76a9643ea8SlogwangEach channel has a bandwidth limit, 77a9643ea8Slogwangmeaning that if all memory access operations are done on the first channel only, there is a potential bottleneck. 78a9643ea8Slogwang 79a9643ea8SlogwangBy default, the :ref:`Mempool Library <Mempool_Library>` spreads the addresses of objects among memory channels. 80a9643ea8Slogwang 812bfe3f2eSlogwangLocking memory pages 822bfe3f2eSlogwang~~~~~~~~~~~~~~~~~~~~ 832bfe3f2eSlogwang 842bfe3f2eSlogwangThe underlying operating system is allowed to load/unload memory pages at its own discretion. 852bfe3f2eSlogwangThese page loads could impact the performance, as the process is on hold when the kernel fetches them. 862bfe3f2eSlogwang 872bfe3f2eSlogwangTo avoid these you could pre-load, and lock them into memory with the ``mlockall()`` call. 882bfe3f2eSlogwang 892bfe3f2eSlogwang.. code-block:: c 902bfe3f2eSlogwang 912bfe3f2eSlogwang if (mlockall(MCL_CURRENT | MCL_FUTURE)) { 922bfe3f2eSlogwang RTE_LOG(NOTICE, USER1, "mlockall() failed with error \"%s\"\n", 932bfe3f2eSlogwang strerror(errno)); 942bfe3f2eSlogwang } 952bfe3f2eSlogwang 96a9643ea8SlogwangCommunication Between lcores 97a9643ea8Slogwang---------------------------- 98a9643ea8Slogwang 99a9643ea8SlogwangTo provide a message-based communication between lcores, 100a9643ea8Slogwangit is advised to use the DPDK ring API, which provides a lockless ring implementation. 101a9643ea8Slogwang 102a9643ea8SlogwangThe ring supports bulk and burst access, 103a9643ea8Slogwangmeaning that it is possible to read several elements from the ring with only one costly atomic operation 104a9643ea8Slogwang(see :doc:`ring_lib`). 105a9643ea8SlogwangPerformance is greatly improved when using bulk access operations. 106a9643ea8Slogwang 107a9643ea8SlogwangThe code algorithm that dequeues messages may be something similar to the following: 108a9643ea8Slogwang 109a9643ea8Slogwang.. code-block:: c 110a9643ea8Slogwang 111a9643ea8Slogwang #define MAX_BULK 32 112a9643ea8Slogwang 113a9643ea8Slogwang while (1) { 114a9643ea8Slogwang /* Process as many elements as can be dequeued. */ 1152bfe3f2eSlogwang count = rte_ring_dequeue_burst(ring, obj_table, MAX_BULK, NULL); 116a9643ea8Slogwang if (unlikely(count == 0)) 117a9643ea8Slogwang continue; 118a9643ea8Slogwang 119a9643ea8Slogwang my_process_bulk(obj_table, count); 120a9643ea8Slogwang } 121a9643ea8Slogwang 122a9643ea8SlogwangPMD Driver 123a9643ea8Slogwang---------- 124a9643ea8Slogwang 125a9643ea8SlogwangThe DPDK Poll Mode Driver (PMD) is also able to work in bulk/burst mode, 126a9643ea8Slogwangallowing the factorization of some code for each call in the send or receive function. 127a9643ea8Slogwang 128a9643ea8SlogwangAvoid partial writes. 129a9643ea8SlogwangWhen PCI devices write to system memory through DMA, 130a9643ea8Slogwangit costs less if the write operation is on a full cache line as opposed to part of it. 131a9643ea8SlogwangIn the PMD code, actions have been taken to avoid partial writes as much as possible. 132a9643ea8Slogwang 133a9643ea8SlogwangLower Packet Latency 134a9643ea8Slogwang~~~~~~~~~~~~~~~~~~~~ 135a9643ea8Slogwang 136a9643ea8SlogwangTraditionally, there is a trade-off between throughput and latency. 137a9643ea8SlogwangAn application can be tuned to achieve a high throughput, 138a9643ea8Slogwangbut the end-to-end latency of an average packet will typically increase as a result. 139a9643ea8SlogwangSimilarly, the application can be tuned to have, on average, 140a9643ea8Slogwanga low end-to-end latency, at the cost of lower throughput. 141a9643ea8Slogwang 142a9643ea8SlogwangIn order to achieve higher throughput, 143a9643ea8Slogwangthe DPDK attempts to aggregate the cost of processing each packet individually by processing packets in bursts. 144a9643ea8Slogwang 145a9643ea8SlogwangUsing the testpmd application as an example, 146a9643ea8Slogwangthe burst size can be set on the command line to a value of 16 (also the default value). 147a9643ea8SlogwangThis allows the application to request 16 packets at a time from the PMD. 148a9643ea8SlogwangThe testpmd application then immediately attempts to transmit all the packets that were received, 149a9643ea8Slogwangin this case, all 16 packets. 150a9643ea8Slogwang 151a9643ea8SlogwangThe packets are not transmitted until the tail pointer is updated on the corresponding TX queue of the network port. 152a9643ea8SlogwangThis behavior is desirable when tuning for high throughput because 153a9643ea8Slogwangthe cost of tail pointer updates to both the RX and TX queues can be spread across 16 packets, 154a9643ea8Slogwangeffectively hiding the relatively slow MMIO cost of writing to the PCIe* device. 155a9643ea8SlogwangHowever, this is not very desirable when tuning for low latency because 156a9643ea8Slogwangthe first packet that was received must also wait for another 15 packets to be received. 157a9643ea8SlogwangIt cannot be transmitted until the other 15 packets have also been processed because 158a9643ea8Slogwangthe NIC will not know to transmit the packets until the TX tail pointer has been updated, 159a9643ea8Slogwangwhich is not done until all 16 packets have been processed for transmission. 160a9643ea8Slogwang 161a9643ea8SlogwangTo consistently achieve low latency, even under heavy system load, 162a9643ea8Slogwangthe application developer should avoid processing packets in bunches. 163a9643ea8SlogwangThe testpmd application can be configured from the command line to use a burst value of 1. 164a9643ea8SlogwangThis will allow a single packet to be processed at a time, providing lower latency, 165a9643ea8Slogwangbut with the added cost of lower throughput. 166a9643ea8Slogwang 167a9643ea8SlogwangLocks and Atomic Operations 168a9643ea8Slogwang--------------------------- 169a9643ea8Slogwang 170*2d9fd380Sjfb8856606This section describes some key considerations when using locks and atomic 171*2d9fd380Sjfb8856606operations in the DPDK environment. 172*2d9fd380Sjfb8856606 173*2d9fd380Sjfb8856606Locks 174*2d9fd380Sjfb8856606~~~~~ 175*2d9fd380Sjfb8856606 176*2d9fd380Sjfb8856606On x86, atomic operations imply a lock prefix before the instruction, 177a9643ea8Slogwangcausing the processor's LOCK# signal to be asserted during execution of the following instruction. 178a9643ea8SlogwangThis has a big impact on performance in a multicore environment. 179a9643ea8Slogwang 180a9643ea8SlogwangPerformance can be improved by avoiding lock mechanisms in the data plane. 181a9643ea8SlogwangIt can often be replaced by other solutions like per-lcore variables. 182a9643ea8SlogwangAlso, some locking techniques are more efficient than others. 183a9643ea8SlogwangFor instance, the Read-Copy-Update (RCU) algorithm can frequently replace simple rwlocks. 184a9643ea8Slogwang 185*2d9fd380Sjfb8856606Atomic Operations: Use C11 Atomic Builtins 186*2d9fd380Sjfb8856606~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 187*2d9fd380Sjfb8856606 188*2d9fd380Sjfb8856606DPDK generic rte_atomic operations are implemented by __sync builtins. These 189*2d9fd380Sjfb8856606__sync builtins result in full barriers on aarch64, which are unnecessary 190*2d9fd380Sjfb8856606in many use cases. They can be replaced by __atomic builtins that conform to 191*2d9fd380Sjfb8856606the C11 memory model and provide finer memory order control. 192*2d9fd380Sjfb8856606 193*2d9fd380Sjfb8856606So replacing the rte_atomic operations with __atomic builtins might improve 194*2d9fd380Sjfb8856606performance for aarch64 machines. 195*2d9fd380Sjfb8856606 196*2d9fd380Sjfb8856606Some typical optimization cases are listed below: 197*2d9fd380Sjfb8856606 198*2d9fd380Sjfb8856606Atomicity 199*2d9fd380Sjfb8856606^^^^^^^^^ 200*2d9fd380Sjfb8856606 201*2d9fd380Sjfb8856606Some use cases require atomicity alone, the ordering of the memory operations 202*2d9fd380Sjfb8856606does not matter. For example, the packet statistics counters need to be 203*2d9fd380Sjfb8856606incremented atomically but do not need any particular memory ordering. 204*2d9fd380Sjfb8856606So, RELAXED memory ordering is sufficient. 205*2d9fd380Sjfb8856606 206*2d9fd380Sjfb8856606One-way Barrier 207*2d9fd380Sjfb8856606^^^^^^^^^^^^^^^ 208*2d9fd380Sjfb8856606 209*2d9fd380Sjfb8856606Some use cases allow for memory reordering in one way while requiring memory 210*2d9fd380Sjfb8856606ordering in the other direction. 211*2d9fd380Sjfb8856606 212*2d9fd380Sjfb8856606For example, the memory operations before the spinlock lock are allowed to 213*2d9fd380Sjfb8856606move to the critical section, but the memory operations in the critical section 214*2d9fd380Sjfb8856606are not allowed to move above the lock. In this case, the full memory barrier 215*2d9fd380Sjfb8856606in the compare-and-swap operation can be replaced with ACQUIRE memory order. 216*2d9fd380Sjfb8856606On the other hand, the memory operations after the spinlock unlock are allowed 217*2d9fd380Sjfb8856606to move to the critical section, but the memory operations in the critical 218*2d9fd380Sjfb8856606section are not allowed to move below the unlock. So the full barrier in the 219*2d9fd380Sjfb8856606store operation can use RELEASE memory order. 220*2d9fd380Sjfb8856606 221*2d9fd380Sjfb8856606Reader-Writer Concurrency 222*2d9fd380Sjfb8856606^^^^^^^^^^^^^^^^^^^^^^^^^ 223*2d9fd380Sjfb8856606 224*2d9fd380Sjfb8856606Lock-free reader-writer concurrency is one of the common use cases in DPDK. 225*2d9fd380Sjfb8856606 226*2d9fd380Sjfb8856606The payload or the data that the writer wants to communicate to the reader, 227*2d9fd380Sjfb8856606can be written with RELAXED memory order. However, the guard variable should 228*2d9fd380Sjfb8856606be written with RELEASE memory order. This ensures that the store to guard 229*2d9fd380Sjfb8856606variable is observable only after the store to payload is observable. 230*2d9fd380Sjfb8856606 231*2d9fd380Sjfb8856606Correspondingly, on the reader side, the guard variable should be read 232*2d9fd380Sjfb8856606with ACQUIRE memory order. The payload or the data the writer communicated, 233*2d9fd380Sjfb8856606can be read with RELAXED memory order. This ensures that, if the store to 234*2d9fd380Sjfb8856606guard variable is observable, the store to payload is also observable. 235*2d9fd380Sjfb8856606 236a9643ea8SlogwangCoding Considerations 237a9643ea8Slogwang--------------------- 238a9643ea8Slogwang 239a9643ea8SlogwangInline Functions 240a9643ea8Slogwang~~~~~~~~~~~~~~~~ 241a9643ea8Slogwang 242a9643ea8SlogwangSmall functions can be declared as static inline in the header file. 243a9643ea8SlogwangThis avoids the cost of a call instruction (and the associated context saving). 244a9643ea8SlogwangHowever, this technique is not always efficient; it depends on many factors including the compiler. 245a9643ea8Slogwang 246a9643ea8SlogwangBranch Prediction 247a9643ea8Slogwang~~~~~~~~~~~~~~~~~ 248a9643ea8Slogwang 249a9643ea8SlogwangThe Intel® C/C++ Compiler (icc)/gcc built-in helper functions likely() and unlikely() 250a9643ea8Slogwangallow the developer to indicate if a code branch is likely to be taken or not. 251a9643ea8SlogwangFor instance: 252a9643ea8Slogwang 253a9643ea8Slogwang.. code-block:: c 254a9643ea8Slogwang 255a9643ea8Slogwang if (likely(x > 1)) 256a9643ea8Slogwang do_stuff(); 257a9643ea8Slogwang 258a9643ea8SlogwangSetting the Target CPU Type 259a9643ea8Slogwang--------------------------- 260a9643ea8Slogwang 261*2d9fd380Sjfb8856606The DPDK supports CPU microarchitecture-specific optimizations by means of RTE_MACHINE option. 262a9643ea8SlogwangThe degree of optimization depends on the compiler's ability to optimize for a specific microarchitecture, 263a9643ea8Slogwangtherefore it is preferable to use the latest compiler versions whenever possible. 264a9643ea8Slogwang 265a9643ea8SlogwangIf the compiler version does not support the specific feature set (for example, the Intel® AVX instruction set), 266a9643ea8Slogwangthe build process gracefully degrades to whatever latest feature set is supported by the compiler. 267a9643ea8Slogwang 268a9643ea8SlogwangSince the build and runtime targets may not be the same, 269a9643ea8Slogwangthe resulting binary also contains a platform check that runs before the 270a9643ea8Slogwangmain() function and checks if the current machine is suitable for running the binary. 271a9643ea8Slogwang 272a9643ea8SlogwangAlong with compiler optimizations, 273a9643ea8Slogwanga set of preprocessor defines are automatically added to the build process (regardless of the compiler version). 274a9643ea8SlogwangThese defines correspond to the instruction sets that the target CPU should be able to support. 275