1b7ec1ecaSJacob Keller.. SPDX-License-Identifier: GPL-2.0 2b7ec1ecaSJacob Keller 3b7ec1ecaSJacob Keller============= 4b7ec1ecaSJacob KellerDevlink DPIPE 5b7ec1ecaSJacob Keller============= 6b7ec1ecaSJacob Keller 7b7ec1ecaSJacob KellerBackground 8b7ec1ecaSJacob Keller========== 9b7ec1ecaSJacob Keller 10b7ec1ecaSJacob KellerWhile performing the hardware offloading process, much of the hardware 11b7ec1ecaSJacob Kellerspecifics cannot be presented. These details are useful for debugging, and 12b7ec1ecaSJacob Keller``devlink-dpipe`` provides a standardized way to provide visibility into the 13b7ec1ecaSJacob Kelleroffloading process. 14b7ec1ecaSJacob Keller 15b7ec1ecaSJacob KellerFor example, the routing longest prefix match (LPM) algorithm used by the 16b7ec1ecaSJacob KellerLinux kernel may differ from the hardware implementation. The pipeline debug 17b7ec1ecaSJacob KellerAPI (DPIPE) is aimed at providing the user visibility into the ASIC's 18b7ec1ecaSJacob Kellerpipeline in a generic way. 19b7ec1ecaSJacob Keller 20b7ec1ecaSJacob KellerThe hardware offload process is expected to be done in a way that the user 21b7ec1ecaSJacob Kellershould not be able to distinguish between the hardware vs. software 22b7ec1ecaSJacob Kellerimplementation. In this process, hardware specifics are neglected. In 23b7ec1ecaSJacob Kellerreality those details can have lots of meaning and should be exposed in some 24b7ec1ecaSJacob Kellerstandard way. 25b7ec1ecaSJacob Keller 26b7ec1ecaSJacob KellerThis problem is made even more complex when one wishes to offload the 27b7ec1ecaSJacob Kellercontrol path of the whole networking stack to a switch ASIC. Due to 28b7ec1ecaSJacob Kellerdifferences in the hardware and software models some processes cannot be 29b7ec1ecaSJacob Kellerrepresented correctly. 30b7ec1ecaSJacob Keller 31b7ec1ecaSJacob KellerOne example is the kernel's LPM algorithm which in many cases differs 32b7ec1ecaSJacob Kellergreatly to the hardware implementation. The configuration API is the same, 33b7ec1ecaSJacob Kellerbut one cannot rely on the Forward Information Base (FIB) to look like the 34b7ec1ecaSJacob KellerLevel Path Compression trie (LPC-trie) in hardware. 35b7ec1ecaSJacob Keller 36b7ec1ecaSJacob KellerIn many situations trying to analyze systems failure solely based on the 37b7ec1ecaSJacob Kellerkernel's dump may not be enough. By combining this data with complementary 38b7ec1ecaSJacob Kellerinformation about the underlying hardware, this debugging can be made 39b7ec1ecaSJacob Kellereasier; additionally, the information can be useful when debugging 40b7ec1ecaSJacob Kellerperformance issues. 41b7ec1ecaSJacob Keller 42b7ec1ecaSJacob KellerOverview 43b7ec1ecaSJacob Keller======== 44b7ec1ecaSJacob Keller 45b7ec1ecaSJacob KellerThe ``devlink-dpipe`` interface closes this gap. The hardware's pipeline is 46b7ec1ecaSJacob Kellermodeled as a graph of match/action tables. Each table represents a specific 47b7ec1ecaSJacob Kellerhardware block. This model is not new, first being used by the P4 language. 48b7ec1ecaSJacob Keller 49b7ec1ecaSJacob KellerTraditionally it has been used as an alternative model for hardware 50b7ec1ecaSJacob Kellerconfiguration, but the ``devlink-dpipe`` interface uses it for visibility 51b7ec1ecaSJacob Kellerpurposes as a standard complementary tool. The system's view from 52b7ec1ecaSJacob Keller``devlink-dpipe`` should change according to the changes done by the 53b7ec1ecaSJacob Kellerstandard configuration tools. 54b7ec1ecaSJacob Keller 55*ad236ccdSEva DenglerFor example, it’s quite common to implement Access Control Lists (ACL) 56b7ec1ecaSJacob Kellerusing Ternary Content Addressable Memory (TCAM). The TCAM memory can be 57b7ec1ecaSJacob Kellerdivided into TCAM regions. Complex TC filters can have multiple rules with 58b7ec1ecaSJacob Kellerdifferent priorities and different lookup keys. On the other hand hardware 59b7ec1ecaSJacob KellerTCAM regions have a predefined lookup key. Offloading the TC filter rules 60b7ec1ecaSJacob Kellerusing TCAM engine can result in multiple TCAM regions being interconnected 61b7ec1ecaSJacob Kellerin a chain (which may affect the data path latency). In response to a new TC 62b7ec1ecaSJacob Kellerfilter new tables should be created describing those regions. 63b7ec1ecaSJacob Keller 64b7ec1ecaSJacob KellerModel 65b7ec1ecaSJacob Keller===== 66b7ec1ecaSJacob Keller 67b7ec1ecaSJacob KellerThe ``DPIPE`` model introduces several objects: 68b7ec1ecaSJacob Keller 69b7ec1ecaSJacob Keller * headers 70b7ec1ecaSJacob Keller * tables 71b7ec1ecaSJacob Keller * entries 72b7ec1ecaSJacob Keller 73b7ec1ecaSJacob KellerA ``header`` describes packet formats and provides names for fields within 74b7ec1ecaSJacob Kellerthe packet. A ``table`` describes hardware blocks. An ``entry`` describes 75b7ec1ecaSJacob Kellerthe actual content of a specific table. 76b7ec1ecaSJacob Keller 77b7ec1ecaSJacob KellerThe hardware pipeline is not port specific, but rather describes the whole 78b7ec1ecaSJacob KellerASIC. Thus it is tied to the top of the ``devlink`` infrastructure. 79b7ec1ecaSJacob Keller 80b7ec1ecaSJacob KellerDrivers can register and unregister tables at run time, in order to support 81b7ec1ecaSJacob Kellerdynamic behavior. This dynamic behavior is mandatory for describing hardware 82b7ec1ecaSJacob Kellerblocks like TCAM regions which can be allocated and freed dynamically. 83b7ec1ecaSJacob Keller 84b7ec1ecaSJacob Keller``devlink-dpipe`` generally is not intended for configuration. The exception 85b7ec1ecaSJacob Kelleris hardware counting for a specific table. 86b7ec1ecaSJacob Keller 87b7ec1ecaSJacob KellerThe following commands are used to obtain the ``dpipe`` objects from 88b7ec1ecaSJacob Kelleruserspace: 89b7ec1ecaSJacob Keller 90b7ec1ecaSJacob Keller * ``table_get``: Receive a table's description. 91b7ec1ecaSJacob Keller * ``headers_get``: Receive a device's supported headers. 92b7ec1ecaSJacob Keller * ``entries_get``: Receive a table's current entries. 93b7ec1ecaSJacob Keller * ``counters_set``: Enable or disable counters on a table. 94b7ec1ecaSJacob Keller 95b7ec1ecaSJacob KellerTable 96b7ec1ecaSJacob Keller----- 97b7ec1ecaSJacob Keller 98b7ec1ecaSJacob KellerThe driver should implement the following operations for each table: 99b7ec1ecaSJacob Keller 100b7ec1ecaSJacob Keller * ``matches_dump``: Dump the supported matches. 101b7ec1ecaSJacob Keller * ``actions_dump``: Dump the supported actions. 102b7ec1ecaSJacob Keller * ``entries_dump``: Dump the actual content of the table. 103b7ec1ecaSJacob Keller * ``counters_set_update``: Synchronize hardware with counters enabled or 104b7ec1ecaSJacob Keller disabled. 105b7ec1ecaSJacob Keller 106b7ec1ecaSJacob KellerHeader/Field 107b7ec1ecaSJacob Keller------------ 108b7ec1ecaSJacob Keller 109b7ec1ecaSJacob KellerIn a similar way to P4 headers and fields are used to describe a table's 110b7ec1ecaSJacob Kellerbehavior. There is a slight difference between the standard protocol headers 111b7ec1ecaSJacob Kellerand specific ASIC metadata. The protocol headers should be declared in the 112b7ec1ecaSJacob Keller``devlink`` core API. On the other hand ASIC meta data is driver specific 113b7ec1ecaSJacob Kellerand should be defined in the driver. Additionally, each driver-specific 114b7ec1ecaSJacob Kellerdevlink documentation file should document the driver-specific ``dpipe`` 115b7ec1ecaSJacob Kellerheaders it implements. The headers and fields are identified by enumeration. 116b7ec1ecaSJacob Keller 117b7ec1ecaSJacob KellerIn order to provide further visibility some ASIC metadata fields could be 118b7ec1ecaSJacob Kellermapped to kernel objects. For example, internal router interface indexes can 119b7ec1ecaSJacob Kellerbe directly mapped to the net device ifindex. FIB table indexes used by 120b7ec1ecaSJacob Kellerdifferent Virtual Routing and Forwarding (VRF) tables can be mapped to 121b7ec1ecaSJacob Kellerinternal routing table indexes. 122b7ec1ecaSJacob Keller 123b7ec1ecaSJacob KellerMatch 124b7ec1ecaSJacob Keller----- 125b7ec1ecaSJacob Keller 126b7ec1ecaSJacob KellerMatches are kept primitive and close to hardware operation. Match types like 127b7ec1ecaSJacob KellerLPM are not supported due to the fact that this is exactly a process we wish 128b7ec1ecaSJacob Kellerto describe in full detail. Example of matches: 129b7ec1ecaSJacob Keller 130b7ec1ecaSJacob Keller * ``field_exact``: Exact match on a specific field. 131b7ec1ecaSJacob Keller * ``field_exact_mask``: Exact match on a specific field after masking. 132b7ec1ecaSJacob Keller * ``field_range``: Match on a specific range. 133b7ec1ecaSJacob Keller 134b7ec1ecaSJacob KellerThe id's of the header and the field should be specified in order to 135b7ec1ecaSJacob Kelleridentify the specific field. Furthermore, the header index should be 136b7ec1ecaSJacob Kellerspecified in order to distinguish multiple headers of the same type in a 137b7ec1ecaSJacob Kellerpacket (tunneling). 138b7ec1ecaSJacob Keller 139b7ec1ecaSJacob KellerAction 140b7ec1ecaSJacob Keller------ 141b7ec1ecaSJacob Keller 142b7ec1ecaSJacob KellerSimilar to match, the actions are kept primitive and close to hardware 143b7ec1ecaSJacob Kelleroperation. For example: 144b7ec1ecaSJacob Keller 145b7ec1ecaSJacob Keller * ``field_modify``: Modify the field value. 146b7ec1ecaSJacob Keller * ``field_inc``: Increment the field value. 147b7ec1ecaSJacob Keller * ``push_header``: Add a header. 148b7ec1ecaSJacob Keller * ``pop_header``: Remove a header. 149b7ec1ecaSJacob Keller 150b7ec1ecaSJacob KellerEntry 151b7ec1ecaSJacob Keller----- 152b7ec1ecaSJacob Keller 153b7ec1ecaSJacob KellerEntries of a specific table can be dumped on demand. Each eentry is 154b7ec1ecaSJacob Kelleridentified with an index and its properties are described by a list of 155b7ec1ecaSJacob Kellermatch/action values and specific counter. By dumping the tables content the 156b7ec1ecaSJacob Kellerinteractions between tables can be resolved. 157b7ec1ecaSJacob Keller 158b7ec1ecaSJacob KellerAbstraction Example 159b7ec1ecaSJacob Keller=================== 160b7ec1ecaSJacob Keller 161b7ec1ecaSJacob KellerThe following is an example of the abstraction model of the L3 part of 162b7ec1ecaSJacob KellerMellanox Spectrum ASIC. The blocks are described in the order they appear in 163b7ec1ecaSJacob Kellerthe pipeline. The table sizes in the following examples are not real 164b7ec1ecaSJacob Kellerhardware sizes and are provided for demonstration purposes. 165b7ec1ecaSJacob Keller 166b7ec1ecaSJacob KellerLPM 167b7ec1ecaSJacob Keller--- 168b7ec1ecaSJacob Keller 169b7ec1ecaSJacob KellerThe LPM algorithm can be implemented as a list of hash tables. Each hash 170b7ec1ecaSJacob Kellertable contains routes with the same prefix length. The root of the list is 171b7ec1ecaSJacob Keller/32, and in case of a miss the hardware will continue to the next hash 172b7ec1ecaSJacob Kellertable. The depth of the search will affect the data path latency. 173b7ec1ecaSJacob Keller 174b7ec1ecaSJacob KellerIn case of a hit the entry contains information about the next stage of the 175b7ec1ecaSJacob Kellerpipeline which resolves the MAC address. The next stage can be either local 176b7ec1ecaSJacob Kellerhost table for directly connected routes, or adjacency table for next-hops. 177b7ec1ecaSJacob KellerThe ``meta.lpm_prefix`` field is used to connect two LPM tables. 178b7ec1ecaSJacob Keller 179b7ec1ecaSJacob Keller.. code:: 180b7ec1ecaSJacob Keller 181b7ec1ecaSJacob Keller table lpm_prefix_16 { 182b7ec1ecaSJacob Keller size: 4096, 183b7ec1ecaSJacob Keller counters_enabled: true, 184b7ec1ecaSJacob Keller match: { meta.vr_id: exact, 185b7ec1ecaSJacob Keller ipv4.dst_addr: exact_mask, 186b7ec1ecaSJacob Keller ipv6.dst_addr: exact_mask, 187b7ec1ecaSJacob Keller meta.lpm_prefix: exact }, 188b7ec1ecaSJacob Keller action: { meta.adj_index: set, 189b7ec1ecaSJacob Keller meta.adj_group_size: set, 190b7ec1ecaSJacob Keller meta.rif_port: set, 191b7ec1ecaSJacob Keller meta.lpm_prefix: set }, 192b7ec1ecaSJacob Keller } 193b7ec1ecaSJacob Keller 194b7ec1ecaSJacob KellerLocal Host 195b7ec1ecaSJacob Keller---------- 196b7ec1ecaSJacob Keller 197b7ec1ecaSJacob KellerIn the case of local routes the LPM lookup already resolves the egress 198b7ec1ecaSJacob Kellerrouter interface (RIF), yet the exact MAC address is not known. The local 199b7ec1ecaSJacob Kellerhost table is a hash table combining the output interface id with 200b7ec1ecaSJacob Kellerdestination IP address as a key. The result is the MAC address. 201b7ec1ecaSJacob Keller 202b7ec1ecaSJacob Keller.. code:: 203b7ec1ecaSJacob Keller 204b7ec1ecaSJacob Keller table local_host { 205b7ec1ecaSJacob Keller size: 4096, 206b7ec1ecaSJacob Keller counters_enabled: true, 207b7ec1ecaSJacob Keller match: { meta.rif_port: exact, 208b7ec1ecaSJacob Keller ipv4.dst_addr: exact}, 209b7ec1ecaSJacob Keller action: { ethernet.daddr: set } 210b7ec1ecaSJacob Keller } 211b7ec1ecaSJacob Keller 212b7ec1ecaSJacob KellerAdjacency 213b7ec1ecaSJacob Keller--------- 214b7ec1ecaSJacob Keller 215b7ec1ecaSJacob KellerIn case of remote routes this table does the ECMP. The LPM lookup results in 216b7ec1ecaSJacob KellerECMP group size and index that serves as a global offset into this table. 217b7ec1ecaSJacob KellerConcurrently a hash of the packet is generated. Based on the ECMP group size 218b7ec1ecaSJacob Kellerand the packet's hash a local offset is generated. Multiple LPM entries can 219b7ec1ecaSJacob Kellerpoint to the same adjacency group. 220b7ec1ecaSJacob Keller 221b7ec1ecaSJacob Keller.. code:: 222b7ec1ecaSJacob Keller 223b7ec1ecaSJacob Keller table adjacency { 224b7ec1ecaSJacob Keller size: 4096, 225b7ec1ecaSJacob Keller counters_enabled: true, 226b7ec1ecaSJacob Keller match: { meta.adj_index: exact, 227b7ec1ecaSJacob Keller meta.adj_group_size: exact, 228b7ec1ecaSJacob Keller meta.packet_hash_index: exact }, 229b7ec1ecaSJacob Keller action: { ethernet.daddr: set, 230b7ec1ecaSJacob Keller meta.erif: set } 231b7ec1ecaSJacob Keller } 232b7ec1ecaSJacob Keller 233b7ec1ecaSJacob KellerERIF 234b7ec1ecaSJacob Keller---- 235b7ec1ecaSJacob Keller 236b7ec1ecaSJacob KellerIn case the egress RIF and destination MAC have been resolved by previous 237b7ec1ecaSJacob Kellertables this table does multiple operations like TTL decrease and MTU check. 238b7ec1ecaSJacob KellerThen the decision of forward/drop is taken and the port L3 statistics are 239b7ec1ecaSJacob Kellerupdated based on the packet's type (broadcast, unicast, multicast). 240b7ec1ecaSJacob Keller 241b7ec1ecaSJacob Keller.. code:: 242b7ec1ecaSJacob Keller 243b7ec1ecaSJacob Keller table erif { 244b7ec1ecaSJacob Keller size: 800, 245b7ec1ecaSJacob Keller counters_enabled: true, 246b7ec1ecaSJacob Keller match: { meta.rif_port: exact, 247b7ec1ecaSJacob Keller meta.is_l3_unicast: exact, 248b7ec1ecaSJacob Keller meta.is_l3_broadcast: exact, 249b7ec1ecaSJacob Keller meta.is_l3_multicast, exact }, 250b7ec1ecaSJacob Keller action: { meta.l3_drop: set, 251b7ec1ecaSJacob Keller meta.l3_forward: set } 252b7ec1ecaSJacob Keller } 253