guides/prog_guide/writing_efficient_code.rst

d30ea906Sjfb8856606..  SPDX-License-Identifier: BSD-3-Clause
d30ea906Sjfb8856606    Copyright(c) 2010-2014 Intel Corporation.
a9643ea8Slogwang
a9643ea8SlogwangWriting Efficient Code
a9643ea8Slogwang======================
a9643ea8Slogwang
a9643ea8SlogwangThis chapter provides some tips for developing efficient code using the DPDK.
a9643ea8SlogwangFor additional and more general information,
a9643ea8Slogwangplease refer to the *Intel® 64 and IA-32 Architectures Optimization Reference Manual*
a9643ea8Slogwangwhich is a valuable reference to writing efficient code.
a9643ea8Slogwang
a9643ea8SlogwangMemory
a9643ea8Slogwang------
a9643ea8Slogwang
a9643ea8SlogwangThis section describes some key memory considerations when developing applications in the DPDK environment.
a9643ea8Slogwang
a9643ea8SlogwangMemory Copy: Do not Use libc in the Data Plane
a9643ea8Slogwang~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
a9643ea8Slogwang
a9643ea8SlogwangMany libc functions are available in the DPDK, via the Linux* application environment.
a9643ea8SlogwangThis can ease the porting of applications and the development of the configuration plane.
a9643ea8SlogwangHowever, many of these functions are not designed for performance.
a9643ea8SlogwangFunctions such as memcpy() or strcpy() should not be used in the data plane.
a9643ea8SlogwangTo copy small structures, the preference is for a simpler technique that can be optimized by the compiler.
a9643ea8SlogwangRefer to the *VTune™ Performance Analyzer Essentials* publication from Intel Press for recommendations.
a9643ea8Slogwang
a9643ea8SlogwangFor specific functions that are called often,
a9643ea8Slogwangit is also a good idea to provide a self-made optimized function, which should be declared as static inline.
a9643ea8Slogwang
a9643ea8SlogwangThe DPDK API provides an optimized rte_memcpy() function.
a9643ea8Slogwang
a9643ea8SlogwangMemory Allocation
a9643ea8Slogwang~~~~~~~~~~~~~~~~~
a9643ea8Slogwang
a9643ea8SlogwangOther functions of libc, such as malloc(), provide a flexible way to allocate and free memory.
a9643ea8SlogwangIn some cases, using dynamic allocation is necessary,
a9643ea8Slogwangbut it is really not advised to use malloc-like functions in the data plane because
a9643ea8Slogwangmanaging a fragmented heap can be costly and the allocator may not be optimized for parallel allocation.
a9643ea8Slogwang
a9643ea8SlogwangIf you really need dynamic allocation in the data plane, it is better to use a memory pool of fixed-size objects.
a9643ea8SlogwangThis API is provided by librte_mempool.
a9643ea8SlogwangThis data structure provides several services that increase performance, such as memory alignment of objects,
a9643ea8Slogwanglockless access to objects, NUMA awareness, bulk get/put and per-lcore cache.
a9643ea8SlogwangThe rte_malloc () function uses a similar concept to mempools.
a9643ea8Slogwang
a9643ea8SlogwangConcurrent Access to the Same Memory Area
a9643ea8Slogwang~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
a9643ea8Slogwang
a9643ea8SlogwangRead-Write (RW) access operations by several lcores to the same memory area can generate a lot of data cache misses,
a9643ea8Slogwangwhich are very costly.
a9643ea8SlogwangIt is often possible to use per-lcore variables, for example, in the case of statistics.
a9643ea8SlogwangThere are at least two solutions for this:
a9643ea8Slogwang
a9643ea8Slogwang*   Use RTE_PER_LCORE variables. Note that in this case, data on lcore X is not available to lcore Y.
a9643ea8Slogwang
a9643ea8Slogwang*   Use a table of structures (one per lcore). In this case, each structure must be cache-aligned.
a9643ea8Slogwang
a9643ea8SlogwangRead-mostly variables can be shared among lcores without performance losses if there are no RW variables in the same cache line.
a9643ea8Slogwang
a9643ea8SlogwangNUMA
a9643ea8Slogwang~~~~
a9643ea8Slogwang
a9643ea8SlogwangOn a NUMA system, it is preferable to access local memory since remote memory access is slower.
a9643ea8SlogwangIn the DPDK, the memzone, ring, rte_malloc and mempool APIs provide a way to create a pool on a specific socket.
a9643ea8Slogwang
a9643ea8SlogwangSometimes, it can be a good idea to duplicate data to optimize speed.
a9643ea8SlogwangFor read-mostly variables that are often accessed,
a9643ea8Slogwangit should not be a problem to keep them in one socket only, since data will be present in cache.
a9643ea8Slogwang
a9643ea8SlogwangDistribution Across Memory Channels
a9643ea8Slogwang~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
a9643ea8Slogwang
a9643ea8SlogwangModern memory controllers have several memory channels that can load or store data in parallel.
a9643ea8SlogwangDepending on the memory controller and its configuration,
a9643ea8Slogwangthe number of channels and the way the memory is distributed across the channels varies.
a9643ea8SlogwangEach channel has a bandwidth limit,
a9643ea8Slogwangmeaning that if all memory access operations are done on the first channel only, there is a potential bottleneck.
a9643ea8Slogwang
a9643ea8SlogwangBy default, the  :ref:`Mempool Library <Mempool_Library>` spreads the addresses of objects among memory channels.
a9643ea8Slogwang
2bfe3f2eSlogwangLocking memory pages
2bfe3f2eSlogwang~~~~~~~~~~~~~~~~~~~~
2bfe3f2eSlogwang
2bfe3f2eSlogwangThe underlying operating system is allowed to load/unload memory pages at its own discretion.
2bfe3f2eSlogwangThese page loads could impact the performance, as the process is on hold when the kernel fetches them.
2bfe3f2eSlogwang
2bfe3f2eSlogwangTo avoid these you could pre-load, and lock them into memory with the ``mlockall()`` call.
2bfe3f2eSlogwang
2bfe3f2eSlogwang.. code-block:: c
2bfe3f2eSlogwang
2bfe3f2eSlogwang    if (mlockall(MCL_CURRENT | MCL_FUTURE)) {
2bfe3f2eSlogwang        RTE_LOG(NOTICE, USER1, "mlockall() failed with error \"%s\"\n",
2bfe3f2eSlogwang                strerror(errno));
2bfe3f2eSlogwang    }
2bfe3f2eSlogwang
a9643ea8SlogwangCommunication Between lcores
a9643ea8Slogwang----------------------------
a9643ea8Slogwang
a9643ea8SlogwangTo provide a message-based communication between lcores,
a9643ea8Slogwangit is advised to use the DPDK ring API, which provides a lockless ring implementation.
a9643ea8Slogwang
a9643ea8SlogwangThe ring supports bulk and burst access,
a9643ea8Slogwangmeaning that it is possible to read several elements from the ring with only one costly atomic operation
a9643ea8Slogwang(see :doc:`ring_lib`).
a9643ea8SlogwangPerformance is greatly improved when using bulk access operations.
a9643ea8Slogwang
a9643ea8SlogwangThe code algorithm that dequeues messages may be something similar to the following:
a9643ea8Slogwang
a9643ea8Slogwang.. code-block:: c
a9643ea8Slogwang
a9643ea8Slogwang    #define MAX_BULK 32
a9643ea8Slogwang
a9643ea8Slogwang    while (1) {
a9643ea8Slogwang        /* Process as many elements as can be dequeued. */
2bfe3f2eSlogwang        count = rte_ring_dequeue_burst(ring, obj_table, MAX_BULK, NULL);
a9643ea8Slogwang        if (unlikely(count == 0))
a9643ea8Slogwang            continue;
a9643ea8Slogwang
a9643ea8Slogwang        my_process_bulk(obj_table, count);
a9643ea8Slogwang   }
a9643ea8Slogwang
a9643ea8SlogwangPMD Driver
a9643ea8Slogwang----------
a9643ea8Slogwang
a9643ea8SlogwangThe DPDK Poll Mode Driver (PMD) is also able to work in bulk/burst mode,
a9643ea8Slogwangallowing the factorization of some code for each call in the send or receive function.
a9643ea8Slogwang
a9643ea8SlogwangAvoid partial writes.
a9643ea8SlogwangWhen PCI devices write to system memory through DMA,
a9643ea8Slogwangit costs less if the write operation is on a full cache line as opposed to part of it.
a9643ea8SlogwangIn the PMD code, actions have been taken to avoid partial writes as much as possible.
a9643ea8Slogwang
a9643ea8SlogwangLower Packet Latency
a9643ea8Slogwang~~~~~~~~~~~~~~~~~~~~
a9643ea8Slogwang
a9643ea8SlogwangTraditionally, there is a trade-off between throughput and latency.
a9643ea8SlogwangAn application can be tuned to achieve a high throughput,
a9643ea8Slogwangbut the end-to-end latency of an average packet will typically increase as a result.
a9643ea8SlogwangSimilarly, the application can be tuned to have, on average,
a9643ea8Slogwanga low end-to-end latency, at the cost of lower throughput.
a9643ea8Slogwang
a9643ea8SlogwangIn order to achieve higher throughput,
a9643ea8Slogwangthe DPDK attempts to aggregate the cost of processing each packet individually by processing packets in bursts.
a9643ea8Slogwang
a9643ea8SlogwangUsing the testpmd application as an example,
a9643ea8Slogwangthe burst size can be set on the command line to a value of 16 (also the default value).
a9643ea8SlogwangThis allows the application to request 16 packets at a time from the PMD.
a9643ea8SlogwangThe testpmd application then immediately attempts to transmit all the packets that were received,
a9643ea8Slogwangin this case, all 16 packets.
a9643ea8Slogwang
a9643ea8SlogwangThe packets are not transmitted until the tail pointer is updated on the corresponding TX queue of the network port.
a9643ea8SlogwangThis behavior is desirable when tuning for high throughput because
a9643ea8Slogwangthe cost of tail pointer updates to both the RX and TX queues can be spread across 16 packets,
a9643ea8Slogwangeffectively hiding the relatively slow MMIO cost of writing to the PCIe* device.
a9643ea8SlogwangHowever, this is not very desirable when tuning for low latency because
a9643ea8Slogwangthe first packet that was received must also wait for another 15 packets to be received.
a9643ea8SlogwangIt cannot be transmitted until the other 15 packets have also been processed because
a9643ea8Slogwangthe NIC will not know to transmit the packets until the TX tail pointer has been updated,
a9643ea8Slogwangwhich is not done until all 16 packets have been processed for transmission.
a9643ea8Slogwang
a9643ea8SlogwangTo consistently achieve low latency, even under heavy system load,
a9643ea8Slogwangthe application developer should avoid processing packets in bunches.
a9643ea8SlogwangThe testpmd application can be configured from the command line to use a burst value of 1.
a9643ea8SlogwangThis will allow a single packet to be processed at a time, providing lower latency,
a9643ea8Slogwangbut with the added cost of lower throughput.
a9643ea8Slogwang
a9643ea8SlogwangLocks and Atomic Operations
a9643ea8Slogwang---------------------------
a9643ea8Slogwang
*2d9fd380Sjfb8856606This section describes some key considerations when using locks and atomic
*2d9fd380Sjfb8856606operations in the DPDK environment.
*2d9fd380Sjfb8856606
*2d9fd380Sjfb8856606Locks
*2d9fd380Sjfb8856606~~~~~
*2d9fd380Sjfb8856606
*2d9fd380Sjfb8856606On x86, atomic operations imply a lock prefix before the instruction,
a9643ea8Slogwangcausing the processor's LOCK# signal to be asserted during execution of the following instruction.
a9643ea8SlogwangThis has a big impact on performance in a multicore environment.
a9643ea8Slogwang
a9643ea8SlogwangPerformance can be improved by avoiding lock mechanisms in the data plane.
a9643ea8SlogwangIt can often be replaced by other solutions like per-lcore variables.
a9643ea8SlogwangAlso, some locking techniques are more efficient than others.
a9643ea8SlogwangFor instance, the Read-Copy-Update (RCU) algorithm can frequently replace simple rwlocks.
a9643ea8Slogwang
*2d9fd380Sjfb8856606Atomic Operations: Use C11 Atomic Builtins
*2d9fd380Sjfb8856606~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*2d9fd380Sjfb8856606
*2d9fd380Sjfb8856606DPDK generic rte_atomic operations are implemented by __sync builtins. These
*2d9fd380Sjfb8856606__sync builtins result in full barriers on aarch64, which are unnecessary
*2d9fd380Sjfb8856606in many use cases. They can be replaced by __atomic builtins that conform to
*2d9fd380Sjfb8856606the C11 memory model and provide finer memory order control.
*2d9fd380Sjfb8856606
*2d9fd380Sjfb8856606So replacing the rte_atomic operations with __atomic builtins might improve
*2d9fd380Sjfb8856606performance for aarch64 machines.
*2d9fd380Sjfb8856606
*2d9fd380Sjfb8856606Some typical optimization cases are listed below:
*2d9fd380Sjfb8856606
*2d9fd380Sjfb8856606Atomicity
*2d9fd380Sjfb8856606^^^^^^^^^
*2d9fd380Sjfb8856606
*2d9fd380Sjfb8856606Some use cases require atomicity alone, the ordering of the memory operations
*2d9fd380Sjfb8856606does not matter. For example, the packet statistics counters need to be
*2d9fd380Sjfb8856606incremented atomically but do not need any particular memory ordering.
*2d9fd380Sjfb8856606So, RELAXED memory ordering is sufficient.
*2d9fd380Sjfb8856606
*2d9fd380Sjfb8856606One-way Barrier
*2d9fd380Sjfb8856606^^^^^^^^^^^^^^^
*2d9fd380Sjfb8856606
*2d9fd380Sjfb8856606Some use cases allow for memory reordering in one way while requiring memory
*2d9fd380Sjfb8856606ordering in the other direction.
*2d9fd380Sjfb8856606
*2d9fd380Sjfb8856606For example, the memory operations before the spinlock lock are allowed to
*2d9fd380Sjfb8856606move to the critical section, but the memory operations in the critical section
*2d9fd380Sjfb8856606are not allowed to move above the lock. In this case, the full memory barrier
*2d9fd380Sjfb8856606in the compare-and-swap operation can be replaced with ACQUIRE memory order.
*2d9fd380Sjfb8856606On the other hand, the memory operations after the spinlock unlock are allowed
*2d9fd380Sjfb8856606to move to the critical section, but the memory operations in the critical
*2d9fd380Sjfb8856606section are not allowed to move below the unlock. So the full barrier in the
*2d9fd380Sjfb8856606store operation can use RELEASE memory order.
*2d9fd380Sjfb8856606
*2d9fd380Sjfb8856606Reader-Writer Concurrency
*2d9fd380Sjfb8856606^^^^^^^^^^^^^^^^^^^^^^^^^
*2d9fd380Sjfb8856606
*2d9fd380Sjfb8856606Lock-free reader-writer concurrency is one of the common use cases in DPDK.
*2d9fd380Sjfb8856606
*2d9fd380Sjfb8856606The payload or the data that the writer wants to communicate to the reader,
*2d9fd380Sjfb8856606can be written with RELAXED memory order. However, the guard variable should
*2d9fd380Sjfb8856606be written with RELEASE memory order. This ensures that the store to guard
*2d9fd380Sjfb8856606variable is observable only after the store to payload is observable.
*2d9fd380Sjfb8856606
*2d9fd380Sjfb8856606Correspondingly, on the reader side, the guard variable should be read
*2d9fd380Sjfb8856606with ACQUIRE memory order. The payload or the data the writer communicated,
*2d9fd380Sjfb8856606can be read with RELAXED memory order. This ensures that, if the store to
*2d9fd380Sjfb8856606guard variable is observable, the store to payload is also observable.
*2d9fd380Sjfb8856606
a9643ea8SlogwangCoding Considerations
a9643ea8Slogwang---------------------
a9643ea8Slogwang
a9643ea8SlogwangInline Functions
a9643ea8Slogwang~~~~~~~~~~~~~~~~
a9643ea8Slogwang
a9643ea8SlogwangSmall functions can be declared as static inline in the header file.
a9643ea8SlogwangThis avoids the cost of a call instruction (and the associated context saving).
a9643ea8SlogwangHowever, this technique is not always efficient; it depends on many factors including the compiler.
a9643ea8Slogwang
a9643ea8SlogwangBranch Prediction
a9643ea8Slogwang~~~~~~~~~~~~~~~~~
a9643ea8Slogwang
a9643ea8SlogwangThe Intel® C/C++ Compiler (icc)/gcc built-in helper functions likely() and unlikely()
a9643ea8Slogwangallow the developer to indicate if a code branch is likely to be taken or not.
a9643ea8SlogwangFor instance:
a9643ea8Slogwang
a9643ea8Slogwang.. code-block:: c
a9643ea8Slogwang
a9643ea8Slogwang    if (likely(x > 1))
a9643ea8Slogwang        do_stuff();
a9643ea8Slogwang
a9643ea8SlogwangSetting the Target CPU Type
a9643ea8Slogwang---------------------------
a9643ea8Slogwang
*2d9fd380Sjfb8856606The DPDK supports CPU microarchitecture-specific optimizations by means of RTE_MACHINE option.
a9643ea8SlogwangThe degree of optimization depends on the compiler's ability to optimize for a specific microarchitecture,
a9643ea8Slogwangtherefore it is preferable to use the latest compiler versions whenever possible.
a9643ea8Slogwang
a9643ea8SlogwangIf the compiler version does not support the specific feature set (for example, the Intel® AVX instruction set),
a9643ea8Slogwangthe build process gracefully degrades to whatever latest feature set is supported by the compiler.
a9643ea8Slogwang
a9643ea8SlogwangSince the build and runtime targets may not be the same,
a9643ea8Slogwangthe resulting binary also contains a platform check that runs before the
a9643ea8Slogwangmain() function and checks if the current machine is suitable for running the binary.
a9643ea8Slogwang
a9643ea8SlogwangAlong with compiler optimizations,
a9643ea8Slogwanga set of preprocessor defines are automatically added to the build process (regardless of the compiler version).
a9643ea8SlogwangThese defines correspond to the instruction sets that the target CPU should be able to support.