1.. SPDX-License-Identifier: BSD-3-Clause 2 Copyright(c) 2017 Intel Corporation. 3 4Traffic Management API 5====================== 6 7 8Overview 9-------- 10 11This is the generic API for the Quality of Service (QoS) Traffic Management of 12Ethernet devices, which includes the following main features: hierarchical 13scheduling, traffic shaping, congestion management, packet marking. This API 14is agnostic of the underlying HW, SW or mixed HW-SW implementation. 15 16Main features: 17 18* Part of DPDK rte_ethdev API 19* Capability query API per port, per hierarchy level and per hierarchy node 20* Scheduling algorithms: Strict Priority (SP), Weighed Fair Queuing (WFQ) 21* Traffic shaping: single/dual rate, private (per node) and 22 shared (by multiple nodes) shapers 23* Congestion management for hierarchy leaf nodes: algorithms of tail drop, head 24 drop, WRED, private (per node) and shared (by multiple nodes) WRED contexts 25* Packet marking: IEEE 802.1q (VLAN DEI), IETF RFC 3168 (IPv4/IPv6 ECN for TCP 26 and SCTP), IETF RFC 2597 (IPv4 / IPv6 DSCP) 27 28 29Capability API 30-------------- 31 32The aim of these APIs is to advertise the capability information (i.e critical 33parameter values) that the TM implementation (HW/SW) is able to support for the 34application. The APIs supports the information disclosure at the TM level, at 35any hierarchical level of the TM and at any node level of the specific 36hierarchical level. Such information helps towards rapid understanding of 37whether a specific implementation does meet the needs to the user application. 38 39At the TM level, users can get high level idea with the help of various 40parameters such as maximum number of nodes, maximum number of hierarchical 41levels, maximum number of shapers, maximum number of private shapers, type of 42scheduling algorithm (Strict Priority, Weighted Fair Queuing , etc.), etc., 43supported by the implementation. 44 45Likewise, users can query the capability of the TM at the hierarchical level to 46have more granular knowledge about the specific level. The various parameters 47such as maximum number of nodes at the level, maximum number of leaf/non-leaf 48nodes at the level, type of the shaper(dual rate, single rate) supported at 49the level if node is non-leaf type etc., are exposed as a result of 50hierarchical level capability query. 51 52Finally, the node level capability API offers knowledge about the capability 53supported by the node at any specific level. The information whether the 54support is available for private shaper, dual rate shaper, maximum and minimum 55shaper rate, etc. is exposed by node level capability API. 56 57 58Scheduling Algorithms 59--------------------- 60 61The fundamental scheduling algorithms that are supported are Strict Priority 62(SP) and Weighted Fair Queuing (WFQ). The SP and WFQ algorithms are supported 63at the level of each node of the scheduling hierarchy, regardless of the node 64level/position in the tree. The SP algorithm is used to schedule between 65sibling nodes with different priority, while WFQ is used to schedule between 66groups of siblings that have the same priority. 67 68Algorithms such as Weighed Round Robin (WRR), byte-level WRR, Deficit WRR 69(DWRR), etc are considered approximations of the ideal WFQ and are therefore 70assimilated to WFQ, although an associated implementation-dependent accuracy, 71performance and resource usage trade-off might exist. 72 73 74Traffic Shaping 75--------------- 76 77The TM API provides support for single rate and dual rate shapers (rate 78limiters) for the hierarchy nodes, subject to the specific implementation 79support being available. 80 81Each hierarchy node has zero or one private shaper (only one node using it) 82and/or zero, one or several shared shapers (multiple nodes use the same shaper 83instance). A private shaper is used to perform traffic shaping for a single 84node, while a shared shaper is used to perform traffic shaping for a group of 85nodes. 86 87The configuration of private and shared shapers is done through the definition 88of shaper profiles. Any shaper profile (single rate or dual rate shaper) can be 89used by one or several shaper instances (either private or shared). 90 91Single rate shapers use a single token bucket. Therefore, single rate shaper is 92configured by setting the rate of the committed bucket to zero, which 93effectively disables this bucket. The peak bucket is used to limit the rate 94and the burst size for the single rate shaper. Dual rate shapers use both the 95committed and the peak token buckets. The rate of the peak bucket has to be 96bigger than zero, as well as greater than or equal to the rate of the committed 97bucket. 98 99 100Congestion Management 101--------------------- 102 103Congestion management is used to control the admission of packets into a packet 104queue or group of packet queues on congestion. The congestion management 105algorithms that are supported are: Tail Drop, Head Drop and Weighted Random 106Early Detection (WRED). They are made available for every leaf node in the 107hierarchy, subject to the specific implementation supporting them. 108On request of writing a new packet into the current queue while the queue is 109full, the Tail Drop algorithm drops the new packet while leaving the queue 110unmodified, as opposed to the Head Drop* algorithm, which drops the packet 111at the head of the queue (the oldest packet waiting in the queue) and admits 112the new packet at the tail of the queue. 113 114The Random Early Detection (RED) algorithm works by proactively dropping more 115and more input packets as the queue occupancy builds up. When the queue is full 116or almost full, RED effectively works as Tail Drop. The Weighted RED (WRED) 117algorithm uses a separate set of RED thresholds for each packet color and uses 118separate set of RED thresholds for each packet color. 119 120Each hierarchy leaf node with WRED enabled as its congestion management mode 121has zero or one private WRED context (only one leaf node using it) and/or zero, 122one or several shared WRED contexts (multiple leaf nodes use the same WRED 123context). A private WRED context is used to perform congestion management for 124a single leaf node, while a shared WRED context is used to perform congestion 125management for a group of leaf nodes. 126 127The configuration of WRED private and shared contexts is done through the 128definition of WRED profiles. Any WRED profile can be used by one or several 129WRED contexts (either private or shared). 130 131 132Packet Marking 133-------------- 134The TM APIs have been provided to support various types of packet marking such 135as VLAN DEI packet marking (IEEE 802.1Q), IPv4/IPv6 ECN marking of TCP and SCTP 136packets (IETF RFC 3168) and IPv4/IPv6 DSCP packet marking (IETF RFC 2597). 137All VLAN frames of a given color get their DEI bit set if marking is enabled 138for this color. In case, when marking for a given color is not enabled, the 139DEI bit is left as is (either set or not). 140 141All IPv4/IPv6 packets of a given color with ECN set to 2’b01 or 2’b10 carrying 142TCP or SCTP have their ECN set to 2’b11 if the marking feature is enabled for 143the current color, otherwise the ECN field is left as is. 144 145All IPv4/IPv6 packets have their color marked into DSCP bits 3 and 4 as 146follows: green mapped to Low Drop Precedence (2’b01), yellow to Medium (2’b10) 147and red to High (2’b11). Marking needs to be explicitly enabled for each color; 148when not enabled for a given color, the DSCP field of all packets with that 149color is left as is. 150 151 152Steps to Setup the Hierarchy 153---------------------------- 154 155The TM hierarchical tree consists of leaf nodes and non-leaf nodes. Each leaf 156node sits on top of a scheduling queue of the current Ethernet port. Therefore, 157the leaf nodes have predefined IDs in the range of 0... (N-1), where N is the 158number of scheduling queues of the current Ethernet port. The non-leaf nodes 159have their IDs generated by the application outside of the above range, which 160is reserved for leaf nodes. 161 162Each non-leaf node has multiple inputs (its children nodes) and single output 163(which is input to its parent node). It arbitrates its inputs using Strict 164Priority (SP) and Weighted Fair Queuing (WFQ) algorithms to schedule input 165packets to its output while observing its shaping (rate limiting) constraints. 166 167The children nodes with different priorities are scheduled using the SP 168algorithm based on their priority, with 0 as the highest priority. Children 169with the same priority are scheduled using the WFQ algorithm according to their 170weights. The WFQ weight of a given child node is relative to the sum of the 171weights of all its sibling nodes that have the same priority, with 1 as the 172lowest weight. For each SP priority, the WFQ weight mode can be set as either 173byte-based or packet-based. 174 175 176Initial Hierarchy Specification 177~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 178 179The hierarchy is specified by incrementally adding nodes to build up the 180scheduling tree. The first node that is added to the hierarchy becomes the root 181node and all the nodes that are subsequently added have to be added as 182descendants of the root node. The parent of the root node has to be specified 183as RTE_TM_NODE_ID_NULL and there can only be one node with this parent ID 184(i.e. the root node). The unique ID that is assigned to each node when the node 185is created is further used to update the node configuration or to connect 186children nodes to it. 187 188During this phase, some limited checks on the hierarchy specification can be 189conducted, usually limited in scope to the current node, its parent node and 190its sibling nodes. At this time, since the hierarchy is not fully defined, 191there is typically no real action performed by the underlying implementation. 192 193 194Hierarchy Commit 195~~~~~~~~~~~~~~~~ 196 197The hierarchy commit API is called during the port initialization phase (before 198the Ethernet port is started) to freeze the start-up hierarchy. This function 199typically performs the following steps: 200 201* It validates the start-up hierarchy that was previously defined for the 202 current port through successive node add API invocations. 203* Assuming successful validation, it performs all the necessary implementation 204 specific operations to install the specified hierarchy on the current port, 205 with immediate effect once the port is started. 206 207This function fails when the currently configured hierarchy is not supported by 208the Ethernet port, in which case the user can abort or try out another 209hierarchy configuration (e.g. a hierarchy with less leaf nodes), which can be 210built from scratch or by modifying the existing hierarchy configuration. Note 211that this function can still fail due to other causes (e.g. not enough memory 212available in the system, etc.), even though the specified hierarchy is 213supported in principle by the current port. 214 215 216Run-Time Hierarchy Updates 217~~~~~~~~~~~~~~~~~~~~~~~~~~ 218 219The TM API provides support for on-the-fly changes to the scheduling hierarchy, 220thus operations such as node add/delete, node suspend/resume, parent node 221update, etc., can be invoked after the Ethernet port has been started, subject 222to the specific implementation supporting them. The set of dynamic updates 223supported by the implementation is advertised through the port capability set. 224