1.. SPDX-License-Identifier: BSD-3-Clause 2 Copyright(c) 2020 Intel Corporation. 3 4Driver for the Intel® Dynamic Load Balancer (DLB) 5================================================= 6 7The DPDK DLB poll mode driver supports the Intel® Dynamic Load Balancer, 8hardware versions 2.0 and 2.5. 9 10Prerequisites 11------------- 12 13Follow the DPDK :ref:`Getting Started Guide for Linux <linux_gsg>` to setup 14the basic DPDK environment. 15 16Configuration 17------------- 18 19The DLB PF PMD is a user-space PMD that uses VFIO to gain direct 20device access. To use this operation mode, the PCIe PF device must be bound 21to a DPDK-compatible VFIO driver, such as vfio-pci. 22 23Eventdev API Notes 24------------------ 25 26The DLB PMD provides the functions of a DPDK event device; specifically, it 27supports atomic, ordered, and parallel scheduling events from queues to ports. 28However, the DLB hardware is not a perfect match to the eventdev API. Some DLB 29features are abstracted by the PMD such as directed ports. 30 31In general the DLB PMD is designed for ease-of-use and does not require a 32detailed understanding of the hardware, but these details are important when 33writing high-performance code. This section describes the places where the 34eventdev API and DLB misalign. 35 36Scheduling Domain Configuration 37~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 38 39DLB supports 32 scheduling domains. 40When one is configured, it allocates load-balanced and 41directed queues, ports, credits, and other hardware resources. Some 42resource allocations are user-controlled -- the number of queues, for example 43-- and others, like credit pools (one directed and one load-balanced pool per 44scheduling domain), are not. 45 46The DLB is a closed system eventdev, and as such the ``nb_events_limit`` device 47setup argument and the per-port ``new_event_threshold`` argument apply as 48defined in the eventdev header file. The limit is applied to all enqueues, 49regardless of whether it will consume a directed or load-balanced credit. 50 51Load-Balanced Queues 52~~~~~~~~~~~~~~~~~~~~ 53 54A load-balanced queue can support atomic and ordered scheduling, or atomic and 55unordered scheduling, but not atomic and unordered and ordered scheduling. A 56queue's scheduling types are controlled by the event queue configuration. 57 58If the user sets the ``RTE_EVENT_QUEUE_CFG_ALL_TYPES`` flag, the 59``nb_atomic_order_sequences`` determines the supported scheduling types. 60With non-zero ``nb_atomic_order_sequences``, the queue is configured for atomic 61and ordered scheduling. In this case, ``RTE_SCHED_TYPE_PARALLEL`` scheduling is 62supported by scheduling those events as ordered events. Note that when the 63event is dequeued, its sched_type will be ``RTE_SCHED_TYPE_ORDERED``. Else if 64``nb_atomic_order_sequences`` is zero, the queue is configured for atomic and 65unordered scheduling. In this case, ``RTE_SCHED_TYPE_ORDERED`` is unsupported. 66 67If the ``RTE_EVENT_QUEUE_CFG_ALL_TYPES`` flag is not set, schedule_type 68dictates the queue's scheduling type. 69 70The ``nb_atomic_order_sequences`` queue configuration field sets the ordered 71queue's reorder buffer size. DLB has 2 groups of ordered queues, where each 72group is configured to contain either 1 queue with 1024 reorder entries, 2 73queues with 512 reorder entries, and so on down to 32 queues with 32 entries. 74 75When a load-balanced queue is created, the PMD will configure a new sequence 76number group on-demand if num_sequence_numbers does not match a pre-existing 77group with available reorder buffer entries. If all sequence number groups are 78in use, no new group will be created and queue configuration will fail. (Note 79that when the PMD is used with a virtual DLB device, it cannot change the 80sequence number configuration.) 81 82The queue's ``nb_atomic_flows`` parameter is ignored by the DLB PMD, because 83the DLB does not limit the number of flows a queue can track. In the DLB, all 84load-balanced queues can use the full 16-bit flow ID range. 85 86Load-balanced and Directed Ports 87~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 88 89DLB ports come in two flavors: load-balanced and directed. The eventdev API 90does not have the same concept, but it has a similar one: ports and queues that 91are singly-linked (i.e. linked to a single queue or port, respectively). 92 93The ``rte_event_dev_info_get()`` function reports the number of available 94event ports and queues (among other things). For the DLB PMD, max_event_ports 95and max_event_queues report the number of available load-balanced ports and 96queues, and max_single_link_event_port_queue_pairs reports the number of 97available directed ports and queues. 98 99When a scheduling domain is created in ``rte_event_dev_configure()``, the user 100specifies ``nb_event_ports`` and ``nb_single_link_event_port_queues``, which 101control the total number of ports (load-balanced and directed) and the number 102of directed ports. Hence, the number of requested load-balanced ports is 103``nb_event_ports - nb_single_link_event_ports``. The ``nb_event_queues`` field 104specifies the total number of queues (load-balanced and directed). The number 105of directed queues comes from ``nb_single_link_event_port_queues``, since 106directed ports and queues come in pairs. 107 108When a port is setup, the ``RTE_EVENT_PORT_CFG_SINGLE_LINK`` flag determines 109whether it should be configured as a directed (the flag is set) or a 110load-balanced (the flag is unset) port. Similarly, the 111``RTE_EVENT_QUEUE_CFG_SINGLE_LINK`` queue configuration flag controls 112whether it is a directed or load-balanced queue. 113 114Load-balanced ports can only be linked to load-balanced queues, and directed 115ports can only be linked to directed queues. Furthermore, directed ports can 116only be linked to a single directed queue (and vice versa), and that link 117cannot change after the eventdev is started. 118 119The eventdev API does not have a directed scheduling type. To support directed 120traffic, the DLB PMD detects when an event is being sent to a directed queue 121and overrides its scheduling type. Note that the originally selected scheduling 122type (atomic, ordered, or parallel) is not preserved, and an event's sched_type 123will be set to ``RTE_SCHED_TYPE_ATOMIC`` when it is dequeued from a directed 124port. 125 126Finally, even though all 3 event types are supported on the same QID by 127converting unordered events to ordered, such use should be discouraged as much 128as possible, since mixing types on the same queue uses valuable reorder 129resources, and orders events which do not require ordering. 130 131Flow ID 132~~~~~~~ 133 134The flow ID field is preserved in the event when it is scheduled in the 135DLB. 136 137Hardware Credits 138~~~~~~~~~~~~~~~~ 139 140DLB uses a hardware credit scheme to prevent software from overflowing hardware 141event storage, with each unit of storage represented by a credit. A port spends 142a credit to enqueue an event, and hardware refills the ports with credits as the 143events are scheduled to ports. Refills come from credit pools. 144 145For DLB v2.5, there is a single credit pool used for both load balanced and 146directed traffic. 147 148For DLB v2.0, each port is a member of both a load-balanced credit pool and a 149directed credit pool. The load-balanced credits are used to enqueue to 150load-balanced queues, and directed credits are used for directed queues. 151These pools' sizes are controlled by the nb_events_limit field in struct 152rte_event_dev_config. The load-balanced pool is sized to contain 153nb_events_limit credits, and the directed pool is sized to contain 154nb_events_limit/2 credits. The directed pool size can be overridden with the 155num_dir_credits devargs argument, like so: 156 157 .. code-block:: console 158 159 --allow ea:00.0,num_dir_credits=<value> 160 161This can be used if the default allocation is too low or too high for the 162specific application needs. The PMD also supports a devarg that limits the 163max_num_events reported by rte_event_dev_info_get(): 164 165 .. code-block:: console 166 167 --allow ea:00.0,max_num_events=<value> 168 169By default, max_num_events is reported as the total available load-balanced 170credits. If multiple DLB-based applications are being used, it may be desirable 171to control how many load-balanced credits each application uses, particularly 172when application(s) are written to configure nb_events_limit equal to the 173reported max_num_events. 174 175Each port is a member of both credit pools. A port's credit allocation is 176defined by its low watermark, high watermark, and refill quanta. These three 177parameters are calculated by the DLB PMD like so: 178 179- The load-balanced high watermark is set to the port's enqueue_depth. 180 The directed high watermark is set to the minimum of the enqueue_depth and 181 the directed pool size divided by the total number of ports. 182- The refill quanta is set to half the high watermark. 183- The low watermark is set to the minimum of 16 and the refill quanta. 184 185When the eventdev is started, each port is pre-allocated a high watermark's 186worth of credits. For example, if an eventdev contains four ports with enqueue 187depths of 32 and a load-balanced credit pool size of 4096, each port will start 188with 32 load-balanced credits, and there will be 3968 credits available to 189replenish the ports. Thus, a single port is not capable of enqueueing up to the 190nb_events_limit (without any events being dequeued), since the other ports are 191retaining their initial credit allocation; in short, all ports must enqueue in 192order to reach the limit. 193 194If a port attempts to enqueue and has no credits available, the enqueue 195operation will fail and the application must retry the enqueue. Credits are 196replenished asynchronously by the DLB hardware. 197 198Software Credits 199~~~~~~~~~~~~~~~~ 200 201The DLB is a "closed system" event dev, and the DLB PMD layers a software 202credit scheme on top of the hardware credit scheme in order to comply with 203the per-port backpressure described in the eventdev API. 204 205The DLB's hardware scheme is local to a queue/pipeline stage: a port spends a 206credit when it enqueues to a queue, and credits are later replenished after the 207events are dequeued and released. 208 209In the software credit scheme, a credit is consumed when a new (.op = 210RTE_EVENT_OP_NEW) event is injected into the system, and the credit is 211replenished when the event is released from the system (either explicitly with 212RTE_EVENT_OP_RELEASE or implicitly in dequeue_burst()). 213 214In this model, an event is "in the system" from its first enqueue into eventdev 215until it is last dequeued. If the event goes through multiple event queues, it 216is still considered "in the system" while a worker thread is processing it. 217 218A port will fail to enqueue if the number of events in the system exceeds its 219``new_event_threshold`` (specified at port setup time). A port will also fail 220to enqueue if it lacks enough hardware credits to enqueue; load-balanced 221credits are used to enqueue to a load-balanced queue, and directed credits are 222used to enqueue to a directed queue. 223 224The out-of-credit situations are typically transient, and an eventdev 225application using the DLB ought to retry its enqueues if they fail. 226If enqueue fails, DLB PMD sets rte_errno as follows: 227 228- -ENOSPC: Credit exhaustion (either hardware or software) 229- -EINVAL: Invalid argument, such as port ID, queue ID, or sched_type. 230 231Depending on the pipeline the application has constructed, it's possible to 232enter a credit deadlock scenario wherein the worker thread lacks the credit 233to enqueue an event, and it must dequeue an event before it can recover the 234credit. If the worker thread retries its enqueue indefinitely, it will not 235make forward progress. Such deadlock is possible if the application has event 236"loops", in which an event in dequeued from queue A and later enqueued back to 237queue A. 238 239Due to this, workers should stop retrying after a time, release the events it 240is attempting to enqueue, and dequeue more events. It is important that the 241worker release the events and don't simply set them aside to retry the enqueue 242again later, because the port has limited history list size (by default, same 243as port's dequeue_depth). 244 245Priority 246~~~~~~~~ 247 248The DLB supports event priority and per-port queue service priority, as 249described in the eventdev header file. The DLB does not support 'global' event 250queue priority established at queue creation time. 251 252DLB supports 4 event and queue service priority levels. For both priority types, 253the PMD uses the upper three bits of the priority field to determine the DLB 254priority, discarding the 5 least significant bits. But least significant bit out 255of 3 priority bits is effectively ignored for binning into 4 priorities. The 256discarded 5 least significant event priority bits are not preserved when an event 257is enqueued. 258 259Note that event priority only works within the same event type. 260When atomic and ordered or unordered events are enqueued to same QID, priority 261across the types is always equal, and both types are served in a round robin manner. 262 263Reconfiguration 264~~~~~~~~~~~~~~~ 265 266The Eventdev API allows one to reconfigure a device, its ports, and its queues 267by first stopping the device, calling the configuration function(s), then 268restarting the device. The DLB does not support configuring an individual queue 269or port without first reconfiguring the entire device, however, so there are 270certain reconfiguration sequences that are valid in the eventdev API but not 271supported by the PMD. 272 273Specifically, the PMD supports the following configuration sequence: 2741. Configure and start the device 2752. Stop the device 2763. (Optional) Reconfigure the device 2774. (Optional) If step 3 is run: 278 279 a. Setup queue(s). The reconfigured queue(s) lose their previous port links. 280 b. The reconfigured port(s) lose their previous queue links. 281 2825. (Optional, only if steps 4a and 4b are run) Link port(s) to queue(s) 2836. Restart the device. If the device is reconfigured in step 3 but one or more 284 of its ports or queues are not, the PMD will apply their previous 285 configuration (including port->queue links) at this time. 286 287The PMD does not support the following configuration sequences: 2881. Configure and start the device 2892. Stop the device 2903. Setup queue or setup port 2914. Start the device 292 293This sequence is not supported because the event device must be reconfigured 294before its ports or queues can be. 295 296Atomic Inflights Allocation 297~~~~~~~~~~~~~~~~~~~~~~~~~~~ 298 299In the last stage prior to scheduling an atomic event to a CQ, DLB holds the 300inflight event in a temporary buffer that is divided among load-balanced 301queues. If a queue's atomic buffer storage fills up, this can result in 302head-of-line-blocking. For example: 303 304- An LDB queue allocated N atomic buffer entries 305- All N entries are filled with events from flow X, which is pinned to CQ 0. 306 307Until CQ 0 releases 1+ events, no other atomic flows for that LDB queue can be 308scheduled. The likelihood of this case depends on the eventdev configuration, 309traffic behavior, event processing latency, potential for a worker to be 310interrupted or otherwise delayed, etc. 311 312By default, the PMD allocates 64 buffer entries for each load-balanced queue, 313which provides an even division across all 32 queues but potentially wastes 314buffer space (e.g. if not all queues are used, or aren't used for atomic 315scheduling). 316 317QID Depth Threshold 318~~~~~~~~~~~~~~~~~~~ 319 320DLB supports setting and tracking queue depth thresholds. Hardware uses 321the thresholds to track how full a queue is compared to its threshold. 322Four buckets are used 323 324- Less than or equal to 50% of queue depth threshold 325- Greater than 50%, but less than or equal to 75% of depth threshold 326- Greater than 75%, but less than or equal to 100% of depth threshold 327- Greater than 100% of depth thresholds 328 329Per queue threshold metrics are tracked in the DLB xstats, and are also 330returned in the impl_opaque field of each received event. 331 332The per qid threshold can be specified as part of the device args, and 333can be applied to all queues, a range of queues, or a single queue, as 334shown below. 335 336 .. code-block:: console 337 338 --allow ea:00.0,qid_depth_thresh=all:<threshold_value> 339 --allow ea:00.0,qid_depth_thresh=qidA-qidB:<threshold_value> 340 --allow ea:00.0,qid_depth_thresh=qid:<threshold_value> 341 342Class of service 343~~~~~~~~~~~~~~~~ 344 345DLB supports provisioning the DLB bandwidth into 4 classes of service. 346 347- Class 4 corresponds to 40% of the DLB hardware bandwidth 348- Class 3 corresponds to 30% of the DLB hardware bandwidth 349- Class 2 corresponds to 20% of the DLB hardware bandwidth 350- Class 1 corresponds to 10% of the DLB hardware bandwidth 351- Class 0 corresponds to don't care 352 353The classes are applied globally to the set of ports contained in this 354scheduling domain, which is more appropriate for the bifurcated 355PMD than for the PF PMD, since the PF PMD supports just 1 scheduling 356domain. 357 358Class of service can be specified in the devargs, as follows 359 360 .. code-block:: console 361 362 --allow ea:00.0,cos=<0..4> 363 364Use X86 Vector Instructions 365~~~~~~~~~~~~~~~~~~~~~~~~~~~ 366 367DLB supports using x86 vector instructions to optimize the data path. 368 369The default mode of operation is to use scalar instructions, but 370the use of vector instructions can be enabled in the devargs, as 371follows 372 373 .. code-block:: console 374 375 --allow ea:00.0,vector_opts_enabled=<y/Y> 376