1b7ec1ecaSJacob Keller.. SPDX-License-Identifier: GPL-2.0
2b7ec1ecaSJacob Keller
3b7ec1ecaSJacob Keller=============
4b7ec1ecaSJacob KellerDevlink DPIPE
5b7ec1ecaSJacob Keller=============
6b7ec1ecaSJacob Keller
7b7ec1ecaSJacob KellerBackground
8b7ec1ecaSJacob Keller==========
9b7ec1ecaSJacob Keller
10b7ec1ecaSJacob KellerWhile performing the hardware offloading process, much of the hardware
11b7ec1ecaSJacob Kellerspecifics cannot be presented. These details are useful for debugging, and
12b7ec1ecaSJacob Keller``devlink-dpipe`` provides a standardized way to provide visibility into the
13b7ec1ecaSJacob Kelleroffloading process.
14b7ec1ecaSJacob Keller
15b7ec1ecaSJacob KellerFor example, the routing longest prefix match (LPM) algorithm used by the
16b7ec1ecaSJacob KellerLinux kernel may differ from the hardware implementation. The pipeline debug
17b7ec1ecaSJacob KellerAPI (DPIPE) is aimed at providing the user visibility into the ASIC's
18b7ec1ecaSJacob Kellerpipeline in a generic way.
19b7ec1ecaSJacob Keller
20b7ec1ecaSJacob KellerThe hardware offload process is expected to be done in a way that the user
21b7ec1ecaSJacob Kellershould not be able to distinguish between the hardware vs. software
22b7ec1ecaSJacob Kellerimplementation. In this process, hardware specifics are neglected. In
23b7ec1ecaSJacob Kellerreality those details can have lots of meaning and should be exposed in some
24b7ec1ecaSJacob Kellerstandard way.
25b7ec1ecaSJacob Keller
26b7ec1ecaSJacob KellerThis problem is made even more complex when one wishes to offload the
27b7ec1ecaSJacob Kellercontrol path of the whole networking stack to a switch ASIC. Due to
28b7ec1ecaSJacob Kellerdifferences in the hardware and software models some processes cannot be
29b7ec1ecaSJacob Kellerrepresented correctly.
30b7ec1ecaSJacob Keller
31b7ec1ecaSJacob KellerOne example is the kernel's LPM algorithm which in many cases differs
32b7ec1ecaSJacob Kellergreatly to the hardware implementation. The configuration API is the same,
33b7ec1ecaSJacob Kellerbut one cannot rely on the Forward Information Base (FIB) to look like the
34b7ec1ecaSJacob KellerLevel Path Compression trie (LPC-trie) in hardware.
35b7ec1ecaSJacob Keller
36b7ec1ecaSJacob KellerIn many situations trying to analyze systems failure solely based on the
37b7ec1ecaSJacob Kellerkernel's dump may not be enough. By combining this data with complementary
38b7ec1ecaSJacob Kellerinformation about the underlying hardware, this debugging can be made
39b7ec1ecaSJacob Kellereasier; additionally, the information can be useful when debugging
40b7ec1ecaSJacob Kellerperformance issues.
41b7ec1ecaSJacob Keller
42b7ec1ecaSJacob KellerOverview
43b7ec1ecaSJacob Keller========
44b7ec1ecaSJacob Keller
45b7ec1ecaSJacob KellerThe ``devlink-dpipe`` interface closes this gap. The hardware's pipeline is
46b7ec1ecaSJacob Kellermodeled as a graph of match/action tables. Each table represents a specific
47b7ec1ecaSJacob Kellerhardware block. This model is not new, first being used by the P4 language.
48b7ec1ecaSJacob Keller
49b7ec1ecaSJacob KellerTraditionally it has been used as an alternative model for hardware
50b7ec1ecaSJacob Kellerconfiguration, but the ``devlink-dpipe`` interface uses it for visibility
51b7ec1ecaSJacob Kellerpurposes as a standard complementary tool. The system's view from
52b7ec1ecaSJacob Keller``devlink-dpipe`` should change according to the changes done by the
53b7ec1ecaSJacob Kellerstandard configuration tools.
54b7ec1ecaSJacob Keller
55*ad236ccdSEva DenglerFor example, it’s quite common to  implement Access Control Lists (ACL)
56b7ec1ecaSJacob Kellerusing Ternary Content Addressable Memory (TCAM). The TCAM memory can be
57b7ec1ecaSJacob Kellerdivided into TCAM regions. Complex TC filters can have multiple rules with
58b7ec1ecaSJacob Kellerdifferent priorities and different lookup keys. On the other hand hardware
59b7ec1ecaSJacob KellerTCAM regions have a predefined lookup key. Offloading the TC filter rules
60b7ec1ecaSJacob Kellerusing TCAM engine can result in multiple TCAM regions being interconnected
61b7ec1ecaSJacob Kellerin a chain (which may affect the data path latency). In response to a new TC
62b7ec1ecaSJacob Kellerfilter new tables should be created describing those regions.
63b7ec1ecaSJacob Keller
64b7ec1ecaSJacob KellerModel
65b7ec1ecaSJacob Keller=====
66b7ec1ecaSJacob Keller
67b7ec1ecaSJacob KellerThe ``DPIPE`` model introduces several objects:
68b7ec1ecaSJacob Keller
69b7ec1ecaSJacob Keller  * headers
70b7ec1ecaSJacob Keller  * tables
71b7ec1ecaSJacob Keller  * entries
72b7ec1ecaSJacob Keller
73b7ec1ecaSJacob KellerA ``header`` describes packet formats and provides names for fields within
74b7ec1ecaSJacob Kellerthe packet. A ``table`` describes hardware blocks. An ``entry`` describes
75b7ec1ecaSJacob Kellerthe actual content of a specific table.
76b7ec1ecaSJacob Keller
77b7ec1ecaSJacob KellerThe hardware pipeline is not port specific, but rather describes the whole
78b7ec1ecaSJacob KellerASIC. Thus it is tied to the top of the ``devlink`` infrastructure.
79b7ec1ecaSJacob Keller
80b7ec1ecaSJacob KellerDrivers can register and unregister tables at run time, in order to support
81b7ec1ecaSJacob Kellerdynamic behavior. This dynamic behavior is mandatory for describing hardware
82b7ec1ecaSJacob Kellerblocks like TCAM regions which can be allocated and freed dynamically.
83b7ec1ecaSJacob Keller
84b7ec1ecaSJacob Keller``devlink-dpipe`` generally is not intended for configuration. The exception
85b7ec1ecaSJacob Kelleris hardware counting for a specific table.
86b7ec1ecaSJacob Keller
87b7ec1ecaSJacob KellerThe following commands are used to obtain the ``dpipe`` objects from
88b7ec1ecaSJacob Kelleruserspace:
89b7ec1ecaSJacob Keller
90b7ec1ecaSJacob Keller  * ``table_get``: Receive a table's description.
91b7ec1ecaSJacob Keller  * ``headers_get``: Receive a device's supported headers.
92b7ec1ecaSJacob Keller  * ``entries_get``: Receive a table's current entries.
93b7ec1ecaSJacob Keller  * ``counters_set``: Enable or disable counters on a table.
94b7ec1ecaSJacob Keller
95b7ec1ecaSJacob KellerTable
96b7ec1ecaSJacob Keller-----
97b7ec1ecaSJacob Keller
98b7ec1ecaSJacob KellerThe driver should implement the following operations for each table:
99b7ec1ecaSJacob Keller
100b7ec1ecaSJacob Keller  * ``matches_dump``: Dump the supported matches.
101b7ec1ecaSJacob Keller  * ``actions_dump``: Dump the supported actions.
102b7ec1ecaSJacob Keller  * ``entries_dump``: Dump the actual content of the table.
103b7ec1ecaSJacob Keller  * ``counters_set_update``: Synchronize hardware with counters enabled or
104b7ec1ecaSJacob Keller    disabled.
105b7ec1ecaSJacob Keller
106b7ec1ecaSJacob KellerHeader/Field
107b7ec1ecaSJacob Keller------------
108b7ec1ecaSJacob Keller
109b7ec1ecaSJacob KellerIn a similar way to P4 headers and fields are used to describe a table's
110b7ec1ecaSJacob Kellerbehavior. There is a slight difference between the standard protocol headers
111b7ec1ecaSJacob Kellerand specific ASIC metadata. The protocol headers should be declared in the
112b7ec1ecaSJacob Keller``devlink`` core API. On the other hand ASIC meta data is driver specific
113b7ec1ecaSJacob Kellerand should be defined in the driver. Additionally, each driver-specific
114b7ec1ecaSJacob Kellerdevlink documentation file should document the driver-specific ``dpipe``
115b7ec1ecaSJacob Kellerheaders it implements. The headers and fields are identified by enumeration.
116b7ec1ecaSJacob Keller
117b7ec1ecaSJacob KellerIn order to provide further visibility some ASIC metadata fields could be
118b7ec1ecaSJacob Kellermapped to kernel objects. For example, internal router interface indexes can
119b7ec1ecaSJacob Kellerbe directly mapped to the net device ifindex. FIB table indexes used by
120b7ec1ecaSJacob Kellerdifferent Virtual Routing and Forwarding (VRF) tables can be mapped to
121b7ec1ecaSJacob Kellerinternal routing table indexes.
122b7ec1ecaSJacob Keller
123b7ec1ecaSJacob KellerMatch
124b7ec1ecaSJacob Keller-----
125b7ec1ecaSJacob Keller
126b7ec1ecaSJacob KellerMatches are kept primitive and close to hardware operation. Match types like
127b7ec1ecaSJacob KellerLPM are not supported due to the fact that this is exactly a process we wish
128b7ec1ecaSJacob Kellerto describe in full detail. Example of matches:
129b7ec1ecaSJacob Keller
130b7ec1ecaSJacob Keller  * ``field_exact``: Exact match on a specific field.
131b7ec1ecaSJacob Keller  * ``field_exact_mask``: Exact match on a specific field after masking.
132b7ec1ecaSJacob Keller  * ``field_range``: Match on a specific range.
133b7ec1ecaSJacob Keller
134b7ec1ecaSJacob KellerThe id's of the header and the field should be specified in order to
135b7ec1ecaSJacob Kelleridentify the specific field. Furthermore, the header index should be
136b7ec1ecaSJacob Kellerspecified in order to distinguish multiple headers of the same type in a
137b7ec1ecaSJacob Kellerpacket (tunneling).
138b7ec1ecaSJacob Keller
139b7ec1ecaSJacob KellerAction
140b7ec1ecaSJacob Keller------
141b7ec1ecaSJacob Keller
142b7ec1ecaSJacob KellerSimilar to match, the actions are kept primitive and close to hardware
143b7ec1ecaSJacob Kelleroperation. For example:
144b7ec1ecaSJacob Keller
145b7ec1ecaSJacob Keller  * ``field_modify``: Modify the field value.
146b7ec1ecaSJacob Keller  * ``field_inc``: Increment the field value.
147b7ec1ecaSJacob Keller  * ``push_header``: Add a header.
148b7ec1ecaSJacob Keller  * ``pop_header``: Remove a header.
149b7ec1ecaSJacob Keller
150b7ec1ecaSJacob KellerEntry
151b7ec1ecaSJacob Keller-----
152b7ec1ecaSJacob Keller
153b7ec1ecaSJacob KellerEntries of a specific table can be dumped on demand. Each eentry is
154b7ec1ecaSJacob Kelleridentified with an index and its properties are described by a list of
155b7ec1ecaSJacob Kellermatch/action values and specific counter. By dumping the tables content the
156b7ec1ecaSJacob Kellerinteractions between tables can be resolved.
157b7ec1ecaSJacob Keller
158b7ec1ecaSJacob KellerAbstraction Example
159b7ec1ecaSJacob Keller===================
160b7ec1ecaSJacob Keller
161b7ec1ecaSJacob KellerThe following is an example of the abstraction model of the L3 part of
162b7ec1ecaSJacob KellerMellanox Spectrum ASIC. The blocks are described in the order they appear in
163b7ec1ecaSJacob Kellerthe pipeline. The table sizes in the following examples are not real
164b7ec1ecaSJacob Kellerhardware sizes and are provided for demonstration purposes.
165b7ec1ecaSJacob Keller
166b7ec1ecaSJacob KellerLPM
167b7ec1ecaSJacob Keller---
168b7ec1ecaSJacob Keller
169b7ec1ecaSJacob KellerThe LPM algorithm can be implemented as a list of hash tables. Each hash
170b7ec1ecaSJacob Kellertable contains routes with the same prefix length. The root of the list is
171b7ec1ecaSJacob Keller/32, and in case of a miss the hardware will continue to the next hash
172b7ec1ecaSJacob Kellertable. The depth of the search will affect the data path latency.
173b7ec1ecaSJacob Keller
174b7ec1ecaSJacob KellerIn case of a hit the entry contains information about the next stage of the
175b7ec1ecaSJacob Kellerpipeline which resolves the MAC address. The next stage can be either local
176b7ec1ecaSJacob Kellerhost table for directly connected routes, or adjacency table for next-hops.
177b7ec1ecaSJacob KellerThe ``meta.lpm_prefix`` field is used to connect two LPM tables.
178b7ec1ecaSJacob Keller
179b7ec1ecaSJacob Keller.. code::
180b7ec1ecaSJacob Keller
181b7ec1ecaSJacob Keller    table lpm_prefix_16 {
182b7ec1ecaSJacob Keller      size: 4096,
183b7ec1ecaSJacob Keller      counters_enabled: true,
184b7ec1ecaSJacob Keller      match: { meta.vr_id: exact,
185b7ec1ecaSJacob Keller               ipv4.dst_addr: exact_mask,
186b7ec1ecaSJacob Keller               ipv6.dst_addr: exact_mask,
187b7ec1ecaSJacob Keller               meta.lpm_prefix: exact },
188b7ec1ecaSJacob Keller      action: { meta.adj_index: set,
189b7ec1ecaSJacob Keller                meta.adj_group_size: set,
190b7ec1ecaSJacob Keller                meta.rif_port: set,
191b7ec1ecaSJacob Keller                meta.lpm_prefix: set },
192b7ec1ecaSJacob Keller    }
193b7ec1ecaSJacob Keller
194b7ec1ecaSJacob KellerLocal Host
195b7ec1ecaSJacob Keller----------
196b7ec1ecaSJacob Keller
197b7ec1ecaSJacob KellerIn the case of local routes the LPM lookup already resolves the egress
198b7ec1ecaSJacob Kellerrouter interface (RIF), yet the exact MAC address is not known. The local
199b7ec1ecaSJacob Kellerhost table is a hash table combining the output interface id with
200b7ec1ecaSJacob Kellerdestination IP address as a key. The result is the MAC address.
201b7ec1ecaSJacob Keller
202b7ec1ecaSJacob Keller.. code::
203b7ec1ecaSJacob Keller
204b7ec1ecaSJacob Keller    table local_host {
205b7ec1ecaSJacob Keller      size: 4096,
206b7ec1ecaSJacob Keller      counters_enabled: true,
207b7ec1ecaSJacob Keller      match: { meta.rif_port: exact,
208b7ec1ecaSJacob Keller               ipv4.dst_addr: exact},
209b7ec1ecaSJacob Keller      action: { ethernet.daddr: set }
210b7ec1ecaSJacob Keller    }
211b7ec1ecaSJacob Keller
212b7ec1ecaSJacob KellerAdjacency
213b7ec1ecaSJacob Keller---------
214b7ec1ecaSJacob Keller
215b7ec1ecaSJacob KellerIn case of remote routes this table does the ECMP. The LPM lookup results in
216b7ec1ecaSJacob KellerECMP group size and index that serves as a global offset into this table.
217b7ec1ecaSJacob KellerConcurrently a hash of the packet is generated. Based on the ECMP group size
218b7ec1ecaSJacob Kellerand the packet's hash a local offset is generated. Multiple LPM entries can
219b7ec1ecaSJacob Kellerpoint to the same adjacency group.
220b7ec1ecaSJacob Keller
221b7ec1ecaSJacob Keller.. code::
222b7ec1ecaSJacob Keller
223b7ec1ecaSJacob Keller    table adjacency {
224b7ec1ecaSJacob Keller      size: 4096,
225b7ec1ecaSJacob Keller      counters_enabled: true,
226b7ec1ecaSJacob Keller      match: { meta.adj_index: exact,
227b7ec1ecaSJacob Keller               meta.adj_group_size: exact,
228b7ec1ecaSJacob Keller               meta.packet_hash_index: exact },
229b7ec1ecaSJacob Keller      action: { ethernet.daddr: set,
230b7ec1ecaSJacob Keller                meta.erif: set }
231b7ec1ecaSJacob Keller    }
232b7ec1ecaSJacob Keller
233b7ec1ecaSJacob KellerERIF
234b7ec1ecaSJacob Keller----
235b7ec1ecaSJacob Keller
236b7ec1ecaSJacob KellerIn case the egress RIF and destination MAC have been resolved by previous
237b7ec1ecaSJacob Kellertables this table does multiple operations like TTL decrease and MTU check.
238b7ec1ecaSJacob KellerThen the decision of forward/drop is taken and the port L3 statistics are
239b7ec1ecaSJacob Kellerupdated based on the packet's type (broadcast, unicast, multicast).
240b7ec1ecaSJacob Keller
241b7ec1ecaSJacob Keller.. code::
242b7ec1ecaSJacob Keller
243b7ec1ecaSJacob Keller    table erif {
244b7ec1ecaSJacob Keller      size: 800,
245b7ec1ecaSJacob Keller      counters_enabled: true,
246b7ec1ecaSJacob Keller      match: { meta.rif_port: exact,
247b7ec1ecaSJacob Keller               meta.is_l3_unicast: exact,
248b7ec1ecaSJacob Keller               meta.is_l3_broadcast: exact,
249b7ec1ecaSJacob Keller               meta.is_l3_multicast, exact },
250b7ec1ecaSJacob Keller      action: { meta.l3_drop: set,
251b7ec1ecaSJacob Keller                meta.l3_forward: set }
252b7ec1ecaSJacob Keller    }
253